How to train the ADT Classifier to Perform Classification (SPMF documentation)

This example explains how to run the ADT algorithm using the SPMF open-source data mining library.

What is ADT?

The ADT algorithm is an algorithm for classification, proposed in the following paper:

K. Wang, S. Zhou, and Y. He, Growing decision trees on support-less association rules. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2000). New York, NY, USA: ACM, pp. 265-269.

The ADT algorithm takes as input a dataset that consists of a set of records, described using attributes, assumed to be nominal attributes (strings). The goal of classification is to guess the missing value of an attribute called the target attribute based on the values for the other attributes. For example, consider data about customers of a bank. Each record (customer) may be described using various attributes such as age, gender, city, education and steal-money?. Consider that steal-money? is an attribute that indicate whether a customer has stolen some money or not (yes or no). The goal of classification can be to guess whether a customer will steal money (yes or no) given the values of other attributes (the age, gender city and education) of a customer.

To do classification, the ADT algorithm first creates a model using training data (records where the target attribute value is known). A model (a classifier) is a set of rules called class association rules. After the model is created, it can be used to guess the missing value of the target attribute for a new record. For instance, using data about previous customers at the bank, it is possible to learn rules that can help to guess whether the new customers will steal money or not. There exists many algorithms for classication in the data mining literature. Rule-based classification models such as ADT generally can make some good predictions but other models may work better. However, rule-based models have the advantage of being easily interpretable by humans.

This is a Java implementation of ADT. It was originally obtained under the GPCMAR license from the LAC library from Padillo et al. (2020). The code of ADT from LAC was cleaned, adapted and integrated into SPMF.

A) How to train the ADT model to make a prediction

How to run this example?

What is the input?

The input is a dataset that contains a set of records described according to some attributes . For instance, in this example, we use a dataset called tennisExtended.txt. This dataset is provided in the file "tennisExtended.txt." of the SPMF distribution. This dataset defines 7 attributes (outlook, temp, humid, wind, play, day and moon) and contains 19 instances (records).

What is the model created by ADT?

The ADT algorithm can be used to create a model that can be used to perform classification. The goal is to use that model to guess what is the missing value of a target attribute for a new record. For example, lets say that there is a new record where the value of the attribute "play" is unknown:

???? rainy mild high strong monday small

The goal of classification is to build a model that will be able to predict the value ???? as either yes or no.

To build a model, ADT takes as input:

The model created by ADT contains a set of rules. A rule is an implication of the form X ==> Y where Y is a target attribute value (e.g. yes or no) and X is a set of values from other attributes. For instance, a rule {mild,normal} ==> {yes} would indicate that if a record contains the values {mild} and {normal} then it is likely to have the value yes for the attribute play. The ADT algorithm can also create some negative rules such {NOT mild, normal} ==> {no}. This rule indicates that if {mild} does not appear and {normal} appears in a record then the value of the attribute play is likely to be no.

To select a good set of rules to build a model, ADT relies on a few measures called the support, confidence, merit and pessimistic error rate.

The support of a rule X ==> Y means how many records of the dataset contains the values X and Y together, divided by the total number of records.. For example, the support of the rule {mild,normal} ==> {yes} is 4/19 because values {mild,normal,yes} appear in 4 records and there is a total of 19 records.

The confidence of a rule X==> Y is how many records of the dataset contains the values X and Y together, divided by the number of records that contains the values of X. For example, the confidence of the rule {mild,normal} ==> {yes} is 4/4 because values {mild,normal,yes} appear in 4 records and there is 4 records that contain {mild,normal}.

For a description of the Merit and Error of a rule, please see the related research papers on this topic.

By applying the ADT algorithm, a model is generated containing a set of rules where each rule has a confidence that is no less than minConf , and a merit that is no less than minMerit . In this example, the parameters are set as minMerit = 0.4, minConf = 0.4, and the following rules are obtained

overcast hot high weak ==> yes #SUP: 2 #CONF: 1.0 #MERIT: 1.0 #ERROR: 1.0
rainy mild high strong ==> yes #SUP: 2 #CONF: 1.0 #MERIT: 1.0 #ERROR: 1.0
sunny hot high tuesday ==> no #SUP: 2 #CONF: 1.0 #MERIT: 1.0 #ERROR: 1.0
sunny mild high weak small ==> no #SUP: 2 #CONF: 1.0 #MERIT: 1.0 #ERROR: 1.0
normal ==> yes #SUP: 9 #CONF: 0.8181818181818182 #MERIT: 0.8181818181818182 #ERROR: 3.575808964271214

where each line is a rule. The keywords #SUP:, #CONF:, #MERIT: and #ERROR:: are used to respectively indicate the support, confidence, merit and ERROR.

Using the trained model, the ADT algorithm can make a prediction for the record:

???? rainy mild high strong monday small

The prediction is: yes

Optional feature: saving a model to a file using serialization to load it again into memory later

Training is generally really fast. But if you want to save a trained model to a file and load it in memory later, it is possible. Saving the model done by uncommenting the following lines of code in the example:

classifier.saveTrainedClassifierToFile("classifier.ser"); // Save the model the a file

Loading a saved model is done using the following lines of code in the example:

classifier = Classifier.loadTrainedClassifierToFile("classifier.ser");

Optional feature: saving the model as a set of rules into a text file (for the purpose of analysis)

If you want to see the rules of the trained model, you may save them to a text file by uncommenting the following code in the example:

String rulesPath = "rulesPath.txt"
((RuleClassifier)classifier).writeRulesToFileSPMFFormatAsStrings(rulesPath,dataset);

where rulesPath.txt is the output file path and dataset is the training dataset.

Note that rules saves in this format cannot be loaded back in memory. If you want to save and load a model in memory, you should save rules using serialization (see above) instead.

Input format (default)

A few input file formats are supported by this algorithm. The first one is a text file such as the dataset "tennisExtended.txt." used in this example:

play outlook temp humid wind day moon
no sunny hot high weak tuesday full
no sunny hot high strong tuesday small
yes overcast hot high weak tuesday full
yes rainy mild high strong tuesday small
yes rainy cool normal weak monday full
no rainy cool normal strong monday small
yes overcast cool normal strong monday full
no sunny mild high weak monday small
yes sunny cool normal weak monday full
yes rainy mild normal weak monday small
yes sunny mild normal strong friday full
no rainy cool normal strong friday small
yes overcast cool normal strong friday full
no sunny mild high weak friday small
yes sunny cool normal weak friday full
yes overcast hot high weak friday small
yes rainy mild high strong friday full
yes rainy mild normal weak friday small
yes sunny mild normal strong tuesday full

The first line indicates the names of the attributes, each separated by a space. Then each of the following lines is a record where attribute values are separated by spaces. There are 19 records.

Alternative input format (ARFF)

It is also possible to use a dataset encoded using the ARFF format as an alternative to the default dataset format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that all attribute values are treated as nominal values. This is due to the design of ADT, which is only defined for handling nominal values. If numerical values are in the data, they will be treated as nominal values (strings). To load a file using the ARFF format, the following lines of code can be used in the example:

String datasetPath = fileToPath("weather-train.arff");
ARFFDataset dataset = new ARFFDataset(datasetPath, targetClassName);

Using these lines, the dataset weather-train.aff will be used, which contains the following content:

@relation weather.tennis

@attribute outlook {sunny,overcast,rainy}
@attribute temperature {hot,mild,cool}
@attribute humidity {high,normal}
@attribute windy {strong,weak}
@attribute play {yes,no}

@data
sunny,hot,high,weak,no
sunny,hot,high,strong,no
overcast,hot,high,weak,yes
rainy,mild,high,weak,yes
rainy,cool,normal,weak,yes
rainy,cool,normal,strong,no
overcast,cool,normal,strong,yes
sunny,mild,high,weak,no
sunny,cool,normal,weak,yes
rainy,mild,normal,weak,yes
sunny,mild,normal,strong,yes

This dataset defines 5 attributes and 11 records (note that it is slightly different from the file tennisExtended.txt in the above example).

Alternative input format (CSV)

It is also possible to load a dataset encoded in the CSV format as an alternative to the default format and ARFF format. The CSV format is a file where values and attribute values are separated by commas. To load a file encoded according to the CSV format, the following lines of code can be used in the example:

String datasetPath = fileToPath("tennisExtendedCSV.txt");
CSVDataset dataset = new CSVDataset(datasetPath, targetClassName);

Using these lines, the dataset tennisExtendedCSV.txt can be read, which contains the following content:

play,outlook,temp,humid,wind,day,moon
no,sunny,hot,high,weak,tuesday,full
no,sunny,hot,high,strong,tuesday,small
yes,overcast,hot,high,weak,tuesday,full
yes,rainy,mild,high,strong,tuesday,small
yes,rainy,cool,normal,weak,monday,full
no,rainy,cool,normal,strong,monday,small
yes,overcast,cool,normal,strong,monday,full
no,sunny,mild,high,weak,monday,small
yes,sunny,cool,normal,weak,monday,full
yes,rainy,mild,normal,weak,monday,small
yes,sunny,mild,normal,strong,friday,full
no,rainy,cool,normal,strong,friday,small
yes,overcast,cool,normal,strong,friday,full
no,sunny,mild,high,weak,friday,small
yes,sunny,cool,normal,weak,friday,full
yes,overcast,hot,high,weak,friday,small
yes,rainy,mild,high,strong,friday,full
yes,rainy,mild,normal,weak,friday,small
yes,sunny,mild,normal,strong,tuesday,full

The first line indicates the names of the attributes, each separated by a comma. Then each of the following lines is a record where attribute values are separated by commas. There are 19 records.

B) How to run batch experiments to test the ADT model for classification

In the SPMF library there is some code to automatically run experiments with ADT on a dataset.

How to run this example?

What is this example about?

In this example, the dataset tennisExtended.txt from the previous example is read into memory. It is then split into two parts : a training dataset (the first 50% of the records) and a testing dataset (the last 50% of the records).

The ADT algorithm is then applied to build a model using the training dataset for the target attribute play.

Then, the model is applied to guess the values of the attribute play for all records in the test dataset. Statistics are then calculated in terms of various measures for evaluating a classifier and the results are presented in the console. The results have the following format (but are actually slightly different)

=== MODEL TRAINING RESULTS ===
#NAME: ADT
#RULECOUNT: 1 the number of rules in the model
#TIMEms: 19 the time for training the model (ms)
#MEMORYmb: 1.9 the memory used for training the model (MB)

==== CLASSIFICATION RESULTS ON TRAINING DATA =====
#NAME: ADT
#ACCURACY: 0.5556 accuracy of the model on training data
#RECALL: 0.5 recall of the model on training data
#PRECISION: 0.2778 precision of the model on training data
#KAPPA: 0 Kappa Score of the model on the training data
#FMICRO: 0.5556 The F1 measure (micro) of the model on the training data
#FMACRO: 0.3571 The F1 measure (macro of the model on the training data
#TIMEms: 0 the time for making predictions using the training data (ms)
#MEMORYmb: 1.9 the memory usage for making predictions using the training data (MB)
#NOPREDICTION: 0.0 the percentage of records for which no prediction was made for the training data

==== CLASSIFICATION RESULTS ON TESTING DATA =====
#NAME: ADT
#ACCURACY: 0.8
#RECALL: 0.5
#PRECISION: 0.4
#KAPPA: 0
#FMICRO: 0.8
#FMACRO: 0.4444
#TIMEms: 0
#MEMORYmb: 1.9757
#NOPREDICTION: 0.0

==== CLASSIFICATION RESULTS ON TESTING DATA =====
#NAME: ADT
#ACCURACY: 0.8 accuracy of the model on the testing data
#RECALL: 0.5 recall of the model on the testing data
#PRECISION: 0.4 precision of the model on the testing data
#KAPPA: 0 Kappa Score of the model on the testing data
#FMICRO: 0.8 The F1 measure (micro) of the model on the testing data
#FMACRO: 0.44 The F1 measure (macro) of the model on the testing data
#TIMEms: 0 the time for making predictions using the testing data (ms)
#MEMORYmb: 1.9 the memory usage for making predictions using the testing data (MB)
#NOPREDICTION: 0.0 the percentage of records for which no prediction was made for the testing data

These measures are commonly used for evaluating classification models (classifiers).

Optional feature: Using K-Fold Cross validation

The above example has shown how to split a dataset into two parts (training and testing) to evaluate a classifier. This approach called holdout is useful. However, a problem is that only part of the data is used for training (e.g. 50%) and only part of the data (e.g. 50%) is used for testing. To be able to use all the data for training and all the data for testing, there is an alternative way of testing a classifier, which is called k-fold cross-validation. To use k-fold cross-validation, the user must set a parameter k (a positive integer) indicating the number of folds (parts). For example, lets say that a dataset has 100 records and that k = 5. Then the dataset will be divided into 5 parts, each containing 20 records. Then, to evaluate the classifier, five experiments will be done:

Then, the average of the results are presented to the user for the five experiments.

To try k-fold cross-validation instead of holdout, you may run the example "MainTestADT_batch_kfold.java" in the package ca.pfv.spmf.test of the SPMF distribution