Mining the Top-K Frequent Itemsets using the Apriori(top-k) Algorithm (SPMF documentation)

This example explains how to run the Apriori(top-k) algorithm using the SPMF open-source data mining library.

How to run this example?

What is Apriori(top-k)?

Apriori is an algorithm for discovering frequent itemsets (groups of items appearing frequently) in a transaction database. It was proposed by Agrawal & Srikant (1993).

Apriori(top-k) is a variation of the Apriori algorithm for discovering the top-k most frequent itemsets. In Apriori(top-k), the user can directly request to find the k most frequent itemsets instead of using the minsup parameter of Apriori.
This is more intuitive to use. However, the performance of Apriori(top-k) is worse than Apriori. NOTE: To find the top-k frequent itemsets more efficiently, it is recommended to use FPGrowth_itemsets(top-k), which is also offered in SPMF, and is more efficient.

What is the input of the Apriori(top-k) algorithm?

The input is a transaction database (aka binary context) and a threshold named k (a positive integer value).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output of the Apriori(top-k) algorithm?

Apriori(top-k) is an algorithm for discovering the k itemsets (group of items) that occur the most frequently in a transaction database (frequent itemsets). To use this algorithm, you have to set a value for the parameter k, which represents the number of itemsets to find.

For example, if Apriori(top-k) is run on the previous transaction database with k = 7, Apriori(top-k) produces the following result:

itemsets support
{1} 3
{2} 4
{3} 4
{5} 4
{1, 3} 3
{2, 3} 3
{2, 5} 4
{3, 5} 3
{2, 3, 5} 3

How should I interpret the results?

In the results, each itemset is annotated with its support. The support of an itemset, also called the frequency, is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5.

For k = 9, the itemsets presented in the previous table the top-k most frequent itemsets, because there are no other itemsets that have a greater support.

The parameter k can be adjusted to find more or less itemsets. It is to be noted that it is possible that the algorithm return more than k itemsets. This can happen if numerous itemsets have exactly the same support. And it is also possible that the algorithm returns less than k itemsets in the situation where k is set to a value that is larger than the number of itemsets in the database.

Input file format

The input file format used by Apriori(top-k) is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a top-k itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For example, here is the output file for this example. The first line indicates the frequent itemset consisting of the item 1 and it indicates that this itemset has a support of 3 transactions.

1 #SUP: 3
2 #SUP: 4
3 #SUP: 4
5 #SUP: 4
1 3 #SUP: 3
2 3 #SUP: 3
2 5 #SUP: 4
3 5 #SUP: 3
2 3 5 #SUP: 3

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Performance

Apriori is an important algorithms for historical reasons and also because it is and easy to learn. However, faster and more memory efficient algorithms have been proposed.
For efficiency, it is recommended to use more efficient algorithms like FPGrowth(top-k) instead of Apriori(top-k) or Apriori.

Implementation details

This implementation is a modified version of the standard version of Apriori algorithm from SPMF.

Optional parameter(s)

This implementation allows to specify additional optional parameter(s) :

These parameter(s) are available in the GUI of SPMF and also in the example(s) "MainTestAprioriTOPK_..._saveToFile .java" provided in the source code of SPMF.

The parameter(s) can be also used in the command line with the Jar file. If you want to use these optional parameter(s) in the command line, it can be done as follows. Consider this example:
java -jar spmf.jar run Apriori(top-k) contextPasquier99.txt output.txt 9 true 2
This command means to apply the algorithm on the file "contextPasquier99.txt" and output the results to "output.txt". Moreover, it specifies that the user wants to find patterns for k=9, and that transaction ids should be output for each pattern found. Moreover, only frequent itemsets having no more than 2 items should be output.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextPasquier99.txt". Here we have modified the file to give names to the items: 

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define transactions in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

orange tomato bread #SUP: 3
orange bread #SUP: 4
apple orange tomato bread #SUP: 2

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Where can I get more information about the Apriori algorithm?

This is the technical report published in 1994 describing Apriori algorithm.

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. Research Report RJ 9839, IBM Almaden Research Center, San Jose, California, June 1994.

For a good overview of frequent itemset mining algorithms, you may read this survey paper.