Run experiments to compare classifiers such as ID3, CMAR, ACCF, CBA and CBA2 (SPMF documentation)
This example explains how to compare multiple classifiers using the SPMF open-source data mining library.
What is classification?
There exists multiple algorithms for classification. These algorithms takes as input a dataset that consists of a set of records, described using attributes. The goal of classification is to guess the missing value of an attribute called the target attribute based on the values for the other attributes. For example, consider data about customers of a bank. Each record (customer) may be described using various attributes such as age, gender, city, education and steal-money?. Consider that steal-money? is an attribute that indicate whether a customer has stolen some money or not (yes or no). The goal of classification can be to guess whether a customer will steal money (yes or no) given the values of other attributes (the age, gender city and education) of a customer.
To do classification, a classification algorithm first creates a model using training data (records where the target attribute value is known). A model (a classifier) can take various forms such as a set of rules or a tree structure. After the model is created, it can be used to guess the missing value of the target attribute for a new record. For instance, using data about previous customers at the bank, it is possible to learn rules that can help to guess whether the new customers will steal money or not. There exists many algorithms for classication in the data mining literature.
In SPMF, multiple algorithms are implemented for classification. While there are several examples in the documentation of SPMF to explain each algorithm, this example explains how to perform experiments to compare multiple classifiers on the same dataset.
How to run this example?
- This example is not available in the graphical user interface of SPMF.
- If you are using the source code version of SPMF, launch the file "MainTestEVALUATOR_Holdout.java" in the package ca.pfv.SPMF.test
What is the input of classification?
The input is a dataset that contains a set of records described according to some attributes . For instance, in this example, we use a dataset called tennisExtended.txt. This dataset is provided in the file "tennisExtended.txt." of the SPMF distribution. This dataset defines 7 attributes (outlook, temp, humid, wind, play, day and moon) and contains 19 instances (records).
play | outlook | temp | humid | wind | day | moon |
no | sunny | hot | high | weak | tuesday | full |
no | sunny | hot | high | strong | tuesday | small |
yes | overcast | hot | high | weak | tuesday | full |
yes | rainy | mild | high | strong | tuesday | small |
yes | rainy | cool | normal | weak | monday | full |
no | rainy | cool | normal | strong | monday | small |
yes | overcast | cool | normal | strong | monday | full |
no | sunny | mild | high | weak | monday | small |
yes | sunny | cool | normal | weak | monday | full |
yes | rainy | mild | normal | weak | monday | small |
yes | sunny | mild | normal | strong | friday | full |
no | rainy | cool | normal | strong | friday | small |
yes | overcast | cool | normal | strong | friday | full |
no | sunny | mild | high | weak | friday | small |
yes | sunny | cool | normal | weak | friday | full |
yes | overcast | hot | high | weak | friday | small |
yes | rainy | mild | high | strong | friday | full |
yes | rainy | mild | normal | weak | friday | small |
yes | sunny | mild | normal | strong | tuesday | full |
What is the output of classification?
There are various algorithms that can be used to create a model that can be used to perform classification. The goal is to use each model to guess what is the missing value of a target attribute for a new record. For example, lets say that there is a new record where the value of the attribute "play" is unknown:
???? | rainy | mild | high | strong | monday | small |
The goal of classification is to build a model that will be able to predict the value ???? as either yes or no.
In this example, multiple classification models are compared (CBA, CBA2, ID3, CMAR, L3 etc.). The dataset tennisExtended.txt from the previous example is read into memory. It is then split into two parts : a training dataset (the first 50% of the records) and a testing dataset (the last 50% of the records). Each algorithm is then applied to build a model using the training dataset for the target attribute play. Then, the model is applied to guess the values of the attribute play for all records in the test dataset. Statistics are then calculated in terms of various measures for evaluating a classifier and the results are presented in the console:
=== MODEL TRAINING RESULTS ===
#NAME: ACAC ACCF ACN ADT CBA CBA2 MAC L3 CMAR ID3
#RULECOUNT: 22 55 166 6 5 5 6 460 12 0 the number of rules in each model
#TIMEms: 9 8 60 19 23 16 12 12 6 2 the time for training each model (ms)
#MEMORYmb: 1.9796 2 9.482 9.962 10.482 10.9621 11.9621 12.482 12.962 12.962 the memory used for training each model (MB)
==== CLASSIFICATION RESULTS ON TRAINING DATA =====
#NAME: ACAC ACCF ACN ADT CBA CBA2 MAC L3 CMAR ID3
#ACCURACY: 0.4444 1 1 1 1 1 1 1 0.8889 1 accuracy of each model on training data
#RECALL: 0.475 1 1 1 1 1 1 1 0.9 1 recall of each model on training data
#PRECISION: 0.4643 1 1 1 1 1 1 1 0.9 1 precision of each model on training data
#KAPPA: -0.0465 1 1 1 1 1 1 1 0.7805 1 Kappa Score of each model on the training data
#FMICRO: 0.4444 1 1 1 1 1 1 1 0.8889 1 The F1 measure (micro) of each model on the training data
#FMACRO: 0.4156 1 1 1 1 1 1 1 0.8889 1 The F1 measure (macro of each model on the training data
#TIMEms: 4 0 0 0 0 0 2 0 0 0 the time for making predictions using the training data (ms)
#MEMORYmb: 1.9796 2 9.482 9.962 10.482 10.9621 11.9621 12.482 12.962 12.962 the memory usage for making predictions using the training data (MB)
#NOPREDICTION: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 the percentage of records for which no prediction was made for the training data
==== CLASSIFICATION RESULTS ON TESTING DATA =====
#NAME: ACAC ACCF ACN ADT CBA CBA2 MAC L3 CMAR ID3
#ACCURACY: 0.1 0.7 0.9 0 0.8 0.8 0.8 0.9 0.4 0.7 accuracy of the model on the testing data
#RECALL: 0.0625 0.8125 0.9375 0 0.6875 0.6875 0.6875 0.9375 0.625 0.625 recall of the model on the testing data
#PRECISION: 0.1667 0.7 0.8333 0 0.6875 0.6875 0.6875 0.8333 0.625 0.75 precision of the model on the testing data
#KAPPA: -0.4516 0.4 0.7368 0 0.375 0.375 0.375 0.7368 0.1176 0.375 Kappa Score of the model on the testing data
#FMICRO: 0.1 0.7 0.9 NaN 0.8 0.8 0.8 0.9 0.4 0.7778 The F1 measure (micro) of the model on the testing data
#FMACRO: 0.0909 0.6703 0.8667 0 0.6875 0.6875 0.6875 0.8667 0.4 0.6786 The F1 measure (macro) of the model on the testing data
#TIMEms: 1 0 0 0 0 0 0 0 0 1 the time for making predictions using the testing data (ms)
#MEMORYmb: 1.9796 2 9.482 9.962 10.482 10.9621 11.9621 12.482 12.962 12.962 the memory usage for making predictions using the testing data (MB)
#NOPREDICTION: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 the percentage of records for which no prediction was made for the testing data
It is to be noted that these results output by SPMF are separated by tabs. It is thus easy to copy and paste these results in a spreadsheet such as Excel.
These measures are commonly used for evaluating classification models (classifiers).
Optional feature: Using K-Fold Cross validation
The above example has shown how to split a dataset into two parts (training and testing) to compare multiple classifiers. This approach called holdout is useful. However, a problem is that only part of the data is used for training (e.g. 50%) and only part of the data (e.g. 50%) is used for testing. To be able to use all the data for training and all the data for testing, there is an alternative way of comparing classifiers, which is called k-fold cross-validation. To use k-fold cross-validation, the user must set a parameter k (a positive integer) indicating the number of folds (parts). For example, lets say that a dataset has 100 records and that k = 5. Then the dataset will be divided into 5 parts, each containing 20 records. Then, to evaluate a classifier, five experiments will be done:
- Records 1 to 20 are used for training, and the other 80 records are used for testing
- Records 21 to 40 are used for training, and the other 80 records are used for testing
- Records 41 to 60 are used for training, and the other 80 records are used for testing
- Records 61 to 80 are used for training, and the other 80 records are used for testing
- Records 81 to 100 are used for training, and the other 80 records are used for testing
Then, the average of the results are presented to the user for the five experiments.
To try k-fold cross-validation instead of holdout, you may run the example "MainTestEVALUATOR_kfold.java" in the package ca.pfv.spmf.test of the SPMF distribution
Input format (default)
A few input file formats are supported by this algorithm. The first one is a text file such as the dataset "tennisExtended.txt." used in this example:
play outlook temp humid wind day moon
no sunny hot high weak tuesday full
no sunny hot high strong tuesday small
yes overcast hot high weak tuesday full
yes rainy mild high strong tuesday small
yes rainy cool normal weak monday full
no rainy cool normal strong monday small
yes overcast cool normal strong monday full
no sunny mild high weak monday small
yes sunny cool normal weak monday full
yes rainy mild normal weak monday small
yes sunny mild normal strong friday full
no rainy cool normal strong friday small
yes overcast cool normal strong friday full
no sunny mild high weak friday small
yes sunny cool normal weak friday full
yes overcast hot high weak friday small
yes rainy mild high strong friday full
yes rainy mild normal weak friday small
yes sunny mild normal strong tuesday full
The first line indicates the names of the attributes, each separated by a space. Then each of the following lines is a record where attribute values are separated by spaces. There are 19 records.
Alternative input format (ARFF)
It is also possible to use a dataset encoded using the ARFF format as an alternative to the default dataset format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that all attribute values are treated as nominal values. This is due to the design of ID3, which is only defined for handling nominal values. If numerical values are in the data, they will be treated as nominal values (strings). To load a file using the ARFF format, the following lines of code can be used in the example:
String datasetPath = fileToPath("weather-train.arff");
ARFFDataset dataset = new ARFFDataset(datasetPath, targetClassName);
Using these lines, the dataset weather-train.aff will be used, which contains the following content:
@relation weather.tennis
@attribute outlook {sunny,overcast,rainy}
@attribute temperature {hot,mild,cool}
@attribute humidity {high,normal}
@attribute windy {strong,weak}
@attribute play {yes,no}
@data
sunny,hot,high,weak,no
sunny,hot,high,strong,no
overcast,hot,high,weak,yes
rainy,mild,high,weak,yes
rainy,cool,normal,weak,yes
rainy,cool,normal,strong,no
overcast,cool,normal,strong,yes
sunny,mild,high,weak,no
sunny,cool,normal,weak,yes
rainy,mild,normal,weak,yes
sunny,mild,normal,strong,yes
This dataset defines 5 attributes and 11 records (note that it is slightly different from the file tennisExtended.txt in the above example).
Alternative input format (CSV)
It is also possible to load a dataset encoded in the CSV format as an alternative to the default format and ARFF format. The CSV format is a file where values and attribute values are separated by commas. To load a file encoded according to the CSV format, the following lines of code can be used in the example:
String datasetPath = fileToPath("tennisExtendedCSV.txt");
CSVDataset dataset = new CSVDataset(datasetPath, targetClassName);
Using these lines, the dataset tennisExtendedCSV.txt can be read, which contains the following content:
play,outlook,temp,humid,wind,day,moon
no,sunny,hot,high,weak,tuesday,full
no,sunny,hot,high,strong,tuesday,small
yes,overcast,hot,high,weak,tuesday,full
yes,rainy,mild,high,strong,tuesday,small
yes,rainy,cool,normal,weak,monday,full
no,rainy,cool,normal,strong,monday,small
yes,overcast,cool,normal,strong,monday,full
no,sunny,mild,high,weak,monday,small
yes,sunny,cool,normal,weak,monday,full
yes,rainy,mild,normal,weak,monday,small
yes,sunny,mild,normal,strong,friday,full
no,rainy,cool,normal,strong,friday,small
yes,overcast,cool,normal,strong,friday,full
no,sunny,mild,high,weak,friday,small
yes,sunny,cool,normal,weak,friday,full
yes,overcast,hot,high,weak,friday,small
yes,rainy,mild,high,strong,friday,full
yes,rainy,mild,normal,weak,friday,small
yes,sunny,mild,normal,strong,tuesday,full
The first line indicates the names of the attributes, each separated by a comma. Then each of the following lines is a record where attribute values are separated by commas. There are 19 records.