Run experiments to compare the scalability of one or more algorithms on a dataset (dataset size is varied) (SPMF documentation)

This example explains how to to use the ExperimenterForScalability tool of SPMF, which allows to compare the performance of one or more algorithms on a dataset where the size is varied and all input parameters are fixed, so as to study the effect of dataset size on runtime, memory and number of patterns..

This tool runs algorithms automatically and save the results to files. The results include a comparison of runtime, memory usage and output file size. These results are in a tab-separated format, which can be easily imported into Excel or other spreadsheet software to draw charts.

There is also an option to output results as PGFPlots charts that can be inserted directly in Latex documents such as research papers. This tool is was designed to speed-up the redundant task of running experiments to compare algorithms when writing research papers.

I will first explain how to use this tool from the user interface of SPMF, and then how the main functionnalities can be used from the source code version of SPMF.

How to start this tool?

To run this example, if you are using the graphical user interface of SPMF, (1) select the algorithm "Performance_experiment_scalability" and then (2) click "Run algorithm". This will open the tool, which provides a simple user interface..

Alternatively, if you want to run this tool from the source code of SPMF (e.g. in Eclipse), without using the graphical user interface of SPMF, you can execute the same tool by running the file "MainTestExperimenterForScalability", located in the package "ca.pfv.spmf.test". However, this is less user-friendly as there is no user interface. Thus, it is recommended to use the graphical user interface of SPMF.

How to use the tool?

After you run the algorithm "Performance_experiment_scalability" in the user interface of SPMF, the following window will open, which let you indicate what kind of experiments should be run.

experiment with spmf

I will explain step-by-step how to use this tool:

If you have followed the above instructions, the window will look like this:

all parameters are filled for the experiments

Then, we can click on the "Run the experiment" button K) to start the experiments!

During the experiments, the status of experiments will be displayed in the text area L) as follows:

console output from the tool

And finally, when all experiments are done, the results of the experiments will be displayed in the text area L). The results will look like this (exact values will vary):

********************************************
***** RESULTS *****
********************************************
INPUT FILE: D:\SPMF_ALL\SPMF 6\NEW TEST FILES 2020\NEW TEST FILES\chess.txt (the input file)
PARAMETERS: [0.4] (the parameter(s) used in this experiment)


TIME (S)
% of dataset 20% 40% 60% 80% 100%
Apriori 0.7 0.47 0.44 0.4 0.39
Apriori_TID 0.47 0.38 0.38 0.4 0.38
Eclat 0.86 0.89 0.77 0.63 0.57

(The above lines provide the results for the runtime comparison. The first line is the percentages of the dataset used for each experiment. The second line is the runtimes in seconds for Apriori when it is run on these subsets of the dataset (the first 20% of the dataset, the first 40%, the first 60%, the first 80%, and the whole database).. The following lines show the runtimes for the other algorithms, namely Apriori_TID, and Eclat. For example, the second line means that Apriori took 0.7 seconds when 20% of the dataset was used, and Apriori took 0.0.47 seconds when 40% of the dataset was used, etc.)

MEMORY (MB)
% of dataset 20% 40% 60% 80% 100%
Apriori 77.64 22.96 17.44 15.44 15.44
Apriori_TID 34.31 17.44 14.48 13.44 12.96
Eclat 103.06 142.57 125.01 121 14.19

(The above lines provide the results for the memory comparison. The first line is the percentages of the dataset used for each experiment. The second line is the maximum memory usage for Apriori for these parameter values. The following lines show the maximum memory usage for the other algorithms, namely Apriori_TID and Eclat. For example, the second line means that Apriori used 77.64 megabytes of memory when it was run on 20% of the dataset, and Apriori took also 22.96 megabytes of memory when it was run on 40% of the dataset, etc.)

OUTPUT_SIZE (LINES)
% of dataset 20% 40% 60% 80% 100%
Apriori 249855 21791 6815 2359 622
Apriori_TID 249855 21791 6815 2359 622
Eclat 249855 21791 6815 2359 622

(The above lines compares the output size of the algorithms when the algorithms are run on the different subsets of the dataset. The first line is the percentages of the dataset used for each experiment. The second line is the output size (here, number of patterns) for Apriori for these parameter values. The following lines show the output sizes for the other algorithms, namely Apriori_TID and Eclat.

All the above results are also saved in a file called EXPERIMENT_RESULT.txt in the output directory.

Importing the results into Excel

The results from the file EXPERIMENT_RESULT.txt are tab-separated. This means that they can be easily imported into Excel to create charts. For this, just copy the above results into Excel. For example, here we copy the runtime results from a different experiments (just for illustrative purpose of the process

excel results

Then, we can select these results to generate charts:

chart excel

As can be seen in this picture, this chart is not perfect. In particular, the X axis is incorrect. This can be fixed by right-clicking on the chart in Excel and choosing "Select data" to set up the X axis properly. Besides, it is possible to adjust other aspects of the chart to make it look more pretty.

Importing the results into a Latex document

If you have selected the option to generate a latex file with PGFplot, a file called PGPLOT_FIGURES.tex will have been created in the output directory. This file can be compiled using Latex to generate charts, which can also be included in latex documents.

The chart for the above example may look more or less like this, depending on the results:

The Result Files

After running experiments, a set of files are created in the output directory. Here is an overview of those files.

Here is an overview of all the files created in the output directory for this example:

How to run experiments from the source code of SPMF?

Above, I have explained how to use the graphical interface of SPMF to run automated performance experiments. I will now explain how the same functionnality can be used from the source code of SPMF. This is suited for those who are familiar with Java and wish for example to run experiments without passing through the graphical user interface of SPMF.

There is an example of how to do this, in the SPMF source code. It is the file MainTestExperimenterScalability, which is located in the package ca.pfv.spmf.experiments.scalability of SPMF.

The main code from that example is shown below. In that example, we will run a performance comparison of two algorithms from SPMF called "Eclat" and "Apriori. The input file is the file "ContextPasquier99.txt", which is located in the same folder, and results of the experiments will be saved in a new folder called "EXPERIMENTS". To run the experiments, we indicate a path to the file spmf.jar on the local computer. This is because we will be calling the spmf.jar file to run the algorithms.. In this example, we assume that spmf.jar has been downloaded to the computer and saved in C://Users//Phil/Desktop/spmf.jar but other location could work as well.
We will also indicate that there are one parameters for those algorithms and that the value will be 0.4, and that the algorithm will be run on 20, 40, 60, 80, and 100% of the database. We will also indicate that each algorithm cannot run for more than 120000s (two minutes). We can also set additional options such as to generate PGFPlots figures to be used with Latex.

The important part of the code for this example is below:
;

		  // ── 1. Algorithms to compare ─────────────────────────────────
        String[] algorithmNames = new String[] {
                "Eclat",
                "Apriori"
        };

        // ── 2. Fixed algorithm parameters ────────────────────────────
        // These are passed unchanged to every algorithm on every run.
        String[] fixedParams = new String[] { "0.4"};

        // ── 3. Single input dataset ───────────────────────────────────
        // The experimenter reads this file and creates subsets.
        String inputFile = fileToPath("contextPasquier99.txt");

        // ── 4. Subset sizes as percentages ───────────────────────────
        // Each value must be an integer between 1 and 100.
        // The experimenter will run algorithms on the first 20%,
        // then the first 40%, ..., then the full 100% of the data.
        String[] percentages = new String[] {
                "20", "40", "60", "80", "100"
        };

        // ── 5. Output directory ───────────────────────────────────────
        String outputDirectory = "SCALABILITY_EXPERIMENTS";

        // ── 6. Timeout per run (milliseconds) ────────────────────────
        int timeoutInMilliseconds = 120000; // 2 minutes

        // ── 7. Options ────────────────────────────────────────────────
        boolean compareOutputSizes   = true;
        boolean showCommand          = false;
        boolean generateLatexFigures = true;

        // ── 8. Create the experimenter and run ────────────────────────
        ExperimenterForScalability experimenter =
                new ExperimenterForScalability();

        // Adjust to the actual location of spmf.jar on your machine
        experimenter.setSPMFJarFilePath(
                "C:\\Users\\Phil\\Desktop\\spmf.jar");

        // String used in result tables when a run times out
        experimenter.setTimeoutCodeS("-");

        // Run — this is the only method you need to call
        experimenter.runScalabilityExperiment(
                algorithmNames,
                fixedParams,
                inputFile,
                percentages,
                outputDirectory,
                timeoutInMilliseconds,
                compareOutputSizes,
                showCommand,
                generateLatexFigures);

If we run this example, the output in the console will be like this:

**************************************************
*****   RUNNING SCALABILITY EXPERIMENTS      *****
**************************************************
 INPUT FILE       : /D:/workspace/spmf/bin/ca/pfv/spmf/experiments/scalability/contextPasquier99.txt
 TOTAL DATA LINES : 5
 OUTPUT DIRECTORY : SCALABILITY_EXPERIMENTS
 ALGORITHMS       : [Eclat, Apriori]
 FIXED PARAMETERS : [0.4]
 PERCENTAGES      : [20, 40, 60, 80, 100]

 Subset 20% -> 1 lines -> SCALABILITY_EXPERIMENTS\subset_20pct.txt
 *****  EXPERIMENT 1
   ALGORITHM  : Eclat
   SUBSET     : 20%  (1 lines)
   COMMAND: java -jar C:\Users\Phil\Desktop\spmf.jar run Eclat SCALABILITY_EXPERIMENTS\subset_20pct.txt SCALABILITY_EXPERIMENTS\Eclat_20pct.txt 0.4 
   TIME: 0.33 s   MEMORY: 11.94 MB
   OUTPUT SIZE: 7 lines

 *****  EXPERIMENT 2
   ALGORITHM  : Apriori
   SUBSET     : 20%  (1 lines)
   COMMAND: java -jar C:\Users\Phil\Desktop\spmf.jar run Apriori SCALABILITY_EXPERIMENTS\subset_20pct.txt SCALABILITY_EXPERIMENTS\Apriori_20pct.txt 0.4 
   TIME: 0.3 s   MEMORY: 12.48 MB
   OUTPUT SIZE: 7 lines

 Subset 40% -> 2 lines -> SCALABILITY_EXPERIMENTS\subset_40pct.txt
 *****  EXPERIMENT 3
   ALGORITHM  : Eclat
   SUBSET     : 40%  (2 lines)
   COMMAND: java -jar C:\Users\Phil\Desktop\spmf.jar run Eclat SCALABILITY_EXPERIMENTS\subset_40pct.txt SCALABILITY_EXPERIMENTS\Eclat_40pct.txt 0.4 
   TIME: 0.3 s   MEMORY: 11.94 MB
   OUTPUT SIZE: 13 lines

 *****  EXPERIMENT 4
   ALGORITHM  : Apriori
   SUBSET     : 40%  (2 lines)
   COMMAND: java -jar C:\Users\Phil\Desktop\spmf.jar run Apriori SCALABILITY_EXPERIMENTS\subset_40pct.txt SCALABILITY_EXPERIMENTS\Apriori_40pct.txt 0.4 
   TIME: 0.3 s   MEMORY: 12.48 MB
   OUTPUT SIZE: 13 lines

 Subset 60% -> 3 lines -> SCALABILITY_EXPERIMENTS\subset_60pct.txt
 *****  EXPERIMENT 5
   ALGORITHM  : Eclat
   SUBSET     : 60%  (3 lines)
   COMMAND: java -jar C:\Users\Phil\Desktop\spmf.jar run Eclat SCALABILITY_EXPERIMENTS\subset_60pct.txt SCALABILITY_EXPERIMENTS\Eclat_60pct.txt 0.4 
   TIME: 0.29 s   MEMORY: 11.94 MB
   OUTPUT SIZE: 9 lines

 *****  EXPERIMENT 6
   ALGORITHM  : Apriori
   SUBSET     : 60%  (3 lines)
   COMMAND: java -jar C:\Users\Phil\Desktop\spmf.jar run Apriori SCALABILITY_EXPERIMENTS\subset_60pct.txt SCALABILITY_EXPERIMENTS\Apriori_60pct.txt 0.4 
   TIME: 0.3 s   MEMORY: 12.48 MB
   OUTPUT SIZE: 9 lines

 Subset 80% -> 4 lines -> SCALABILITY_EXPERIMENTS\subset_80pct.txt
 *****  EXPERIMENT 7
   ALGORITHM  : Eclat
   SUBSET     : 80%  (4 lines)
   COMMAND: java -jar C:\Users\Phil\Desktop\spmf.jar run Eclat SCALABILITY_EXPERIMENTS\subset_80pct.txt SCALABILITY_EXPERIMENTS\Eclat_80pct.txt 0.4 
   TIME: 0.29 s   MEMORY: 11.94 MB
   OUTPUT SIZE: 9 lines

 *****  EXPERIMENT 8
   ALGORITHM  : Apriori
   SUBSET     : 80%  (4 lines)
   COMMAND: java -jar C:\Users\Phil\Desktop\spmf.jar run Apriori SCALABILITY_EXPERIMENTS\subset_80pct.txt SCALABILITY_EXPERIMENTS\Apriori_80pct.txt 0.4 
   TIME: 0.29 s   MEMORY: 12.48 MB
   OUTPUT SIZE: 9 lines

 Subset 100% -> 5 lines -> SCALABILITY_EXPERIMENTS\subset_100pct.txt
 *****  EXPERIMENT 9
   ALGORITHM  : Eclat
   SUBSET     : 100%  (5 lines)
   COMMAND: java -jar C:\Users\Phil\Desktop\spmf.jar run Eclat SCALABILITY_EXPERIMENTS\subset_100pct.txt SCALABILITY_EXPERIMENTS\Eclat_100pct.txt 0.4 
   TIME: 0.31 s   MEMORY: 11.94 MB
   OUTPUT SIZE: 15 lines

 *****  EXPERIMENT 10
   ALGORITHM  : Apriori
   SUBSET     : 100%  (5 lines)
   COMMAND: java -jar C:\Users\Phil\Desktop\spmf.jar run Apriori SCALABILITY_EXPERIMENTS\subset_100pct.txt SCALABILITY_EXPERIMENTS\Apriori_100pct.txt 0.4 
   TIME: 0.36 s   MEMORY: 12.48 MB
   OUTPUT SIZE: 15 lines


**************************************************
*****              RESULTS                   *****
**************************************************
SCALABILITY EXPERIMENT RESULTS
INPUT FILE: /D:/workspace/spmf/bin/ca/pfv/spmf/experiments/scalability/contextPasquier99.txt
FIXED PARAMETERS: [0.4]

TIME (S)
% of dataset	20%	40%	60%	80%	100%	
Eclat	0.33	0.3	0.29	0.29	0.31	
Apriori	0.3	0.3	0.3	0.29	0.36	

MEMORY (MB)
% of dataset	20%	40%	60%	80%	100%	
Eclat	11.94	11.94	11.94	11.94	11.94	
Apriori	12.48	12.48	12.48	12.48	12.48	

OUTPUT_SIZE (LINES)
% of dataset	20%	40%	60%	80%	100%	
Eclat	7	13	9	9	15	
Apriori	7	13	9	9	15	

LaTeX figures written to: SCALABILITY_EXPERIMENTS\PGPLOT_FIGURES.tex

If we want we can copy the above results in green color to Excel for generating charts from these results.

Besides, in the EXPERIMENTS/ directory that will have been created, we can find the detailed results in files, including figures to be used in Latex.

Now, having said that an interesting question is : What if I want to test some algorithms that are not in SPMF.jar ? At this moment, the tool is only designed to compare algorithms from SPMF.jar but it is not difficult to add new algorithms to SPMF.Jar. I will try to add more explanations about this in the future.