Deriving Frequent Itemsets from Frequent Closed Itemsets using the DFI-Growth Algorithm (SPMF documentation)

This example explains how to run the DFI-Growth algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "DFI-Growth" algorithm, (2) select the input file "contextMushroom_FCI90.txt", (3) set the output file name (e.g. "output.txt") and (4) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run DFI-Growth contextMushroom_FCI90.txt output.txt in a folder containing spmf.jar and the example input file contextMushroom_FCI90.txt.
If you are using the source code version of SPMF, launch the file "MainTestDFIGrowth_saveToField.java" in the package ca.pfv.SPMF.tests.

What is DFI-Growth?

DFI-Growth is an algorithm for deriving frequent itemsets from frequent closed itemsets by pattern growth. It was proposed by Huang et al. (2019).

What is the input of the DFI-Growth algorithm?

The input of DFI-Growth is a FCI database.
A FCI database is a set of frequent closed itemsets. For example, consider the following FCI database. It contains 6 frequent closed itemsets and support number of each itemset. This database is provided as the file contextMushroom_FCI90.txt. This input file was obtained by applying the Charm algorithm (proposed by Zaki) on the Mushroom.txt dataset with 90% as the minsup threshold.

frequent closed itemsets	support
{36 90 97}	7576
{90 97}	7768
{36 90 94}	8192
{36 90}	8200
{90 94}	8216
{90}	8416

What is the output of the DFI-Growth algorithm?

DFI-Growth is an algorithm for deriving frequent itemsets from frequent closed itemsets. A frequent itemset is an itemset which appears in at least minsup transactions from the transaction database. And a frequent closed itemset is a frequent itemset that none of its immediate supersets have the same support number as itself.
For example, if DFI-Growth is run on the previous FCI database, DFI-Growth produces the following result:

frequent itemsets	support
{36 97}	7576
{36 90 97}	7576
{90 97}	7768
{97}	7768
{36 94}	8192
{36 90 94}	8192
{90 94}	8216
{94}	8216
{36}	8200
{36 90}	8200
{90}	8416

How should I interpret the results?

In the result, each frequent itemset is annotated with its support number. The support number of an frequent itemset is how many times the itemset appears in the original transaction database from which we obtain the frequent closed itemsets. For example, the itemset {36 90} has a support of 8200 which is a frequent itemset because its support number is higher or equal to the minsup threshold.

Input file format

The input file format used by DFI-Growth is defined as follows. It is a text file, where each line represents a frequent closed itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions.
For example, here is the input file for this example. The first line indicates the frequent closed itemset consisting of the item {36 90 97} and it indicates that this itemset has a support of 7576 transactions.

36 90 97 #SUP: 7576
90 97 #SUP: 7768
36 90 94 #SUP: 8192
36 90 #SUP: 8200
90 94 #SUP: 8216
90 #SUP: 8416

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For example, here is the input file for this example. The first line indicates the frequent itemset consisting of the item {36 97} and it indicates that this itemset has a support of 7576 transactions.

36 97 #SUP: 7576
36 90 97 #SUP: 7576
90 97 #SUP: 7768
97 #SUP: 7768
36 94 #SUP: 8192
36 90 94 #SUP: 8192
90 94 #SUP: 8216
94 #SUP: 8216
36 #SUP: 8200
36 90 #SUP: 8200
90 #SUP: 8416

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Implementation details

This is the original implementation

Where can I get more information about the DFI-Growth algorithm?

This is the conference article describing DFI-Growth:

JianTao Huang, Yi-Pei Lai, Chieh Lo, Cheng-Wei Wu: An Efficient Algorithm for Deriving Frequent Itemsets from Lossless Condensed Representation. IEA/AIE 2019: 216-229