Mining Frequent Closed Itemsets and Frequent Generator Itemsets using the Zart Algorithm (SPMF documentation)

This example explains how to run the Zart algorithm using the SPMF open-source data mining library.

How to run this example?

What is Zart?

Zart is an algorithm for discovering frequent closed itemsets and their corresponding generators in a transaction database.

Zart is an Apriori-based algorithm. Why is it useful to discover closed itemsets and their generators at the same time? One reason is that this information is necessary to generate some special kind of association rules such as the IGB basis of association rules (see the example for IGB for more information about IGB association rules).

What is the input of the Zart algorithm?

The input is a transaction database (aka binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextZart.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {1, 3}
t3 {1, 2, 3, 5}
t4 {2, 3, 5}
t5 {1, 2, 3, 5}

What is the output of the Zart algorithm?

The output of the Zart algorithm for a transaction database and a minimum support threshold minsup is the set of all frequent closed itemsets and their support, and the associated generator(s) for each closed frequent itemset.

To explain what is a frequent closed itemset and a generator, it is necessary to review a few definitions.

An itemset is a unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset. For example, the itemset {1, 3} has a support of 3 because it appears in three transactions from the database (t2, t3 and t5). A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A frequent closed itemset is a frequent itemset that is not included in a proper superset having the same support. A generator Y of a closed itemset X is an itemset such that (1) it has the same support as X and (2) it does not have any subset having the same support.

By running Zart with the previous transaction database and a minsup of 40% (2 transactions), we obtain the following result:

itemsets support is closed? minimal generators
{} 5 yes {}
{1} 4 yes {1}
{2} 4 no
{3} 4 yes {3}
{5} 4 no
{1, 2} 3 no
{1, 3} 3 yes {1,3}
{1, 5} 3 no
{2, 3} 3 no
{2, 5} 4 yes {2}, {5}
{3, 5} 3 no
{1, 2, 3} 2 no
{1, 2, 5} 3 yes {1, 2}, {1, 5}
{1, 3, 5} 2 no
{2, 3, 5} 3 yes {2, 3}, {3, 5}
{1, 2, 3, 5} 2 yes {1, 2, 3}, {1, 3, 5}

How should I interpret the results?

In the results, all frequent itemsets are shown. Each frequent itemset that is a closed itemset is marked as such ("yes"). For each closed itemset, its support is indicated and its list of generators. For example, the itemset {1,2,3,5} has a support of 2 because it appears in 2 transactions (t3 and t5). It is a closed itemset because it has no proper superset having the same support. Moreover is has two generators: {1, 2, 3} and {1, 3, 5}. By definition, these generators have the same support as {1, 2, 3, 5}.

Another example. The itemset {1, 3, 5} is not closed and it has a support of 2. It is not closed because it has a proper superset {1, 2, 3, 5} that has the same support.

Input file format

The input file format used by Zart is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

1 2 4 5
1 3
1 2 3 5
2 3 5
1 2 3 5

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, containing two sections.

The first section starts with "======= List of closed itemsets and their generators ============" on the first line of the file. Then, each closed itemset is indicated on a single line as follows. A closed itemset is represented by a line starting with "CLOSED :" followed by the itemset itself and then the support of the itemset. An itemset is represented by a list of integers, where each integer represents an item and where integers (items) are separated by single spaces. The support of a closed itemset is indicated by an integer immediately following the special keyword "#SUP:" on the same line. The support is expressed as a number of transactions. On the lines immediately following a closed itemset, the keyword "GENERATORS :" appears. Then, on the immediately following line, the generators of the itemsets are listed, one per line. A generator is represented by the keyword "=" followed by the itemset representing the generator. If a generator is the empty set, then it is represented by the keyword EMPTYSET.

The second sections starts with "======= List of frequent itemsets ============" on a single line. Then all frequent itemsets are listed on the following lines, one per line. On each line, the keyword "ITEMSET :" appears followed by the items of the itemset. Each item is represented by an integer and it is followed by a single space. After, all the items, the special keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions.

For example, we show below the output file for the previous example.

======= List of closed itemsets and their generators ============
CLOSED :
EMPTYSET #SUP: 5
GENERATOR(S) :
EMPTYSET
CLOSED :
1 #SUP: 4
GENERATOR(S) :
1
CLOSED :
3 #SUP: 4
GENERATOR(S) :
3
CLOSED :
1 3 #SUP: 3
GENERATOR(S) :
1 3
CLOSED :
2 5 #SUP: 4
GENERATOR(S) :
2
5
CLOSED :
1 2 5 #SUP: 3
GENERATOR(S) :
1 2
1 5
CLOSED :
2 3 5 #SUP: 3
GENERATOR(S) :
2 3
3 5
CLOSED :
1 2 3 5 #SUP: 2
GENERATOR(S) :
1 2 3
1 3 5
======= List of frequent itemsets ============
ITEMSET : EMPTYSET #SUP: 5
ITEMSET : 1 #SUP: 4
ITEMSET : 2 #SUP: 4
ITEMSET : 3 #SUP: 4
ITEMSET : 5 #SUP: 4
ITEMSET : 1 2 #SUP: 3
ITEMSET : 1 3 #SUP: 3
ITEMSET : 2 3 #SUP: 3
ITEMSET : 1 5 #SUP: 3
ITEMSET : 2 5 #SUP: 4
ITEMSET : 3 5 #SUP: 3
ITEMSET : 1 2 3 #SUP: 2
ITEMSET : 1 2 5 #SUP: 3
ITEMSET : 1 3 5 #SUP: 2
ITEMSET : 2 3 5 #SUP: 3
ITEMSET : 1 2 3 5 #SUP: 2

   

In this example, the first lines of the first section indicates that the empty set is a closed itemset with a support of 5 and that it is the generator of itself. The following lines indicates that the itemset {3} is closed, has a support of 4 and is the generator of itself. The following lines indicates that the itemset {1, 3} is closed, has a support of 3 and that the itemset {1} is the only generator for that itemset. The following lines of this section indicates in the same way the remaining closed itemsets and their associated generators.

In the same example, the first lines of the second sections indicates that the empty set is a frequent itemset with a support of 5 transactions, that the itemset 1 is frequent with a support of 3 transactions and that the itemset {2} is frequent with a support of 4 transactions. In the same way, the following lines indicates all the other frequent itemsets.

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Implementation details

In the source code version of SPMF, there are two versions of Zart. The version "MainTestZart.java" keeps the result into memory. The version named "MainTestZart_saveToFile.java" saves the result to a file. In the graphical user interface and command line interface only the second version is offered.

Performance

The Zart algorithm is not a very efficient algorithm because it is based on Apriori. If someone only want to discover closed itemsets and do not need the information about generators, then he should instead use DCI_Closed or Charm, which are more efficient for closed itemset mining. However, in some cases it is desirable to discover closed itemset and their corresponding generators (for example to generate IGB association rules). For these cases, Zart is an appropriate algorithm.

Where can I get more information about the Zart algorithm?

The Zart algorithm is described in this paper:

L. Szathmary, A. Napoli, S. O. Kuznetsov. ZART: A Multifunctional Itemset Mining Algorithm. Laszlo Szathmary, Amedeo Napoli, Sergei O. Kuznetsov In: CLA, 2007.

Also, for a good overview of frequent itemset mining algorithms, you may read this survey paper.