Using the ARFF format in the source code version of SPMF (SPMF documentation)

This example explains how to usge the ARFF format in the source code version of SPMF using the SPMF open-source data mining library.

The GUI interface and command line interface of SPMF can read the ARFF file format since version 0.93 of SPMF and this is totally transparent to the user. But what if you want to use the ARFF format when running algorithms from the source code? This example explains how to do it and it is quite simple.

But before presenting the example, let's explain a few things about how the ARFF support is implemented in SPMF:

Having said that, we will now explain how to use the ARFF format in the source code with an example. We will use the Apriori algorithm but the steps are the same for the other algorithms. We will first show how to run the Apriori algorithm if the input file is in SPMF format. Then, we will show how to run the Apriori algorithm if the input is in ARFF format to illustrate the differences.

If the input is in SPMF format

To run Apriori with a file "input.txt" in SPMF format with the parameter minsup = 0.4, the following code is used:

AlgoApriori apriori = new AlgoApriori();
apriori.runAlgorithm(0.4, "input.txt", "output.txt");

If the input is in ARFF format

Now let's say that the input file is in the ARFF format.

// We first need to convert the input file from ARFF to SPMF format. To do that, we create a transaction database converter. Then we call its method "convertARFFandReturnMap" to convert the input file to the SPMF format. It produces a converted input file named "input_converted.arff". Moreover, the conversion method returns a map containing mapping information between the data in ARFF format and the data in SPMF format.

TransactionDatabaseConverter converter = new TransactionDatabaseConverter();
Map<Integer, String> mapping = converter.convertARFFandReturnMap("input.arff", "input_converted.txt", Formats.ARFF, Integer.MAX_VALUE);

// Then we run the algorithm with the converted file "input_converted.txt". This creates a file "output.txt" containing the result.

AlgoApriori apriori = new AlgoApriori();
apriori.runAlgorithm(0.4, "input_converted.txt", "output.txt");

// Finally, we need to use the mapping to convert the output file so that the result is shown using the names that are found in the ARFF file rather than the integer-based representation used internally by the Apriori algorithm. This is very simple and performed as follows. The result is a file named "final_output.txt".

ResultConverter converter = new ResultConverter();
converter.convert(mapping, "output.txt", "final_output.txt");

What is the cost of using the ARFF format in terms of performance? The only additional cost when using ARFF is the cost of converting the input and output files, which is generally much smaller than the cost of performing the data mining. In the future, we plan to add support for SQL databases, Excel files and other formats by using a similar conversion mechanism that does not affect the performance of the mining phase. We also plan to add support for the visualizations of patterns.