Using the ARFF format in the source code version of SPMF (SPMF documentation)
This example explains how to usge the ARFF format in the source code version of SPMF using the SPMF open-source data mining library.
The GUI interface and command line interface of SPMF can read the ARFF file format since version 0.93 of SPMF and this is totally transparent to the user. But what if you want to use the ARFF format when running algorithms from the source code? This example explains how to do it and it is quite simple.
But before presenting the example, let's explain a few things about how the ARFF support is implemented in SPMF:
- The ARFF format is only supported for algorithms that take a transaction database as input (most association rule and itemset mining algorithm such as Apriori and FPGrowth) because ARFF does provide constructs for representing sequential data.
- The full ARFF format is supported except that (1) the character "=" is forbidden and (2) escape characters are not considered.
- ARFF generally represents items in files as strings while the SPMF format and algorithms use an integer representation which is much more memory and time efficient for the algorithms concerned. To support ARFF while keeping the very efficient integer representation of our algorithm, we decided to implement the ARFF support in a way that did not require to modify the source code of the algorithms. The solution is to convert the input before running an algorithm and then, to convert the output.
Having said that, we will now explain how to use the ARFF format in the source code with an example. We will use the Apriori algorithm but the steps are the same for the other algorithms. We will first show how to run the Apriori algorithm if the input file is in SPMF format. Then, we will show how to run the Apriori algorithm if the input is in ARFF format to illustrate the differences.
If the input is in SPMF format
To run Apriori with a file "input.txt" in SPMF format with the parameter minsup = 0.4, the following code is used:
AlgoApriori apriori = new AlgoApriori();
apriori.runAlgorithm(0.4, "input.txt", "output.txt");
If the input is in ARFF format
Now let's say that the input file is in the ARFF format.
// We first need to convert the input file from ARFF to SPMF format. To do that, we create a transaction database converter. Then we call its method "convertARFFandReturnMap" to convert the input file to the SPMF format. It produces a converted input file named "input_converted.arff". Moreover, the conversion method returns a map containing mapping information between the data in ARFF format and the data in SPMF format.
TransactionDatabaseConverter converter = new
TransactionDatabaseConverter();
Map<Integer, String> mapping =
converter.convertARFFandReturnMap("input.arff", "input_converted.txt",
Formats.ARFF, Integer.MAX_VALUE);
// Then we run the algorithm with the converted file "input_converted.txt". This creates a file "output.txt" containing the result.
AlgoApriori apriori = new AlgoApriori();
apriori.runAlgorithm(0.4, "input_converted.txt", "output.txt");
// Finally, we need to use the mapping to convert the output file so that the result is shown using the names that are found in the ARFF file rather than the integer-based representation used internally by the Apriori algorithm. This is very simple and performed as follows. The result is a file named "final_output.txt".
ResultConverter converter = new ResultConverter();
converter.convert(mapping, "output.txt", "final_output.txt");
What is the cost of using the ARFF format in terms of performance? The only additional cost when using ARFF is the cost of converting the input and output files, which is generally much smaller than the cost of performing the data mining. In the future, we plan to add support for SQL databases, Excel files and other formats by using a similar conversion mechanism that does not affect the performance of the mining phase. We also plan to add support for the visualizations of patterns.