Mining Frequent Itemsets with Multiple Support Thresholds Using the MSApriori Algorithm (SPMF documentation)

This example explains how to run the MSApriori algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "MSApriori" algorithm, (2) select the input file "contextIGB.txt", (3) set the output file name (e.g. "output.txt") (4) set beta = 0.4 and LS = 0.2 and (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run MSApriori contextIGB.txt output.txt 0.4 0.2 in a folder containing spmf.jar and the example input file contextIGB.txt.
If you are using the source code version of SPMF, launch the file "MainTestMSApriori_saveToFile.java" in the package ca.pfv.SPMF.tests.

What is MISApriori?

MISApriori is an algorithm for mining frequent itemsets by using multiple minimum supports. It is a generalization of the Apriori algorithm, which uses a single minimum support threshold.

The idea behind MSApriori is that different minimum supports could be used to consider the fact that some items are less frequent than others in a dataset.

There are two implementations of MSApriori in SPMF. The first one is called "MSApriori". The second one is an alternative implementation called "MSApriori(Srinivas)", which use a hashing based optimization.

What is the input of this algorithm?

The input of MSApriori is a transaction database and two parameters named beta (a value between 0 and 1) and LS (a value between 0 and 1). These parameters are used to determine a minimum support for each item.

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id	Items
t1	{1, 2, 4, 5}
t2	{2, 3, 5}
t3	{1, 2, 4, 5}
t4	{1, 2, 3, 5}
t5	{1, 2, 3, 4, 5}
t6	{2, 3, 4}

What is the output of this algorithm?

The output of MSApriori is the set of all frequent itemsets contained in the database.

Contrarily to the original Apriori algorithm, MSApriori use multiple minimum supports thresholds instead of just one. In fact, MSApriori uses a minimum support value for each item. Because it would be time consuming to set a minimum support threshold value for each item for a large database, the thresholds are determined automatically by using two user-specified parameters named beta (0 <= B <= 1) and LS (0 <= LS <= 1).

The minimum support of an item k is then defined as the greatest value between:

LS
and B x f(k) where f(k) is the number of transactions containing the item k.

Note that if B is set to 0, there will be a single minimum support for all items and this will be equivalent to the regular Apriori algorithm.

The support of an itemset is the number of transactions containing the itemset divided by the total number of transactions. An itemset is a frequent itemset if its support is higher or equal to the smallest minimum support threshold from the minimum support thresholds of all its items.

Why MSApriori is useful? It is useful because it allows discovering frequent itemsets containing rare items (if their minimum support is set low).

If we run MSApriori on the previous transaction database with beta = 0.4 and LS = 0.2, we obtain the following result:

1 supp: 4
2 supp: 6
3 supp: 4
4 supp: 4
5 supp: 5
1 2 Support: 4
1 3 Support: 2
1 4 Support: 3
1 5 Support: 4
2 3 Support: 4
2 4 Support: 4
2 5 Support: 5
3 4 Support: 2
3 5 Support: 3
4 5 Support: 3
1 2 3 Support: 2
1 2 4 Support: 3
1 2 5 Support: 4
1 3 5 Support: 2
1 4 5 Support: 3
2 3 4 Support: 2
2 3 5 Support: 3
2 4 5 Support: 3
1 2 3 5 Support: 2
1 2 4 5 Support:

Note that here the support is expressed by an integer value which represents the number of transactions containing the itemset. For example, itemset {2, 3 5} has a support of 3 because it appears in three transactions, namely t2, t4 and t5. This integer value can be converted as a percentage by dividing by the total number of transactions.

Input file format

The input file format of MSApriori is defined as follows. It is a text file. Each lines represents a transaction. The items in the transaction are listed. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction. For example, for the previous example, the input file is defined as follows:

1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 4

Consider the first line. It means that the first transaction is the itemset {1, 2, 4, 5}. The following lines follow the same format.

Output file format

The output file format of MSApriori is defined as follows. It is a text file, where each line represents a frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by a integer value indicating the support of that itemset.

1 #SUP: 4
2 #SUP: 6
3 #SUP: 4
4 #SUP: 4
5 #SUP: 5
1 2 #SUP: 4
1 3 #SUP: 2
1 4 #SUP: 3
1 5 #SUP: 4
2 3 #SUP: 4
2 4 #SUP: 4
2 5 #SUP: 5
3 4 #SUP: 2
3 5 #SUP: 3
4 5 #SUP: 3
1 2 3 #SUP: 2
1 2 4 #SUP: 3
1 2 5 #SUP: 4
1 3 5 #SUP: 2
1 4 5 #SUP: 3
2 3 4 #SUP: 2
2 3 5 #SUP: 3
2 4 5 #SUP: 3
1 2 3 5 #SUP: 2
1 2 4 5 #SUP: 3

For example, the first line indicates that the itemset {1} has a support of 4 transactions. The following lines follows the same format.

Optional parameter(s): constraints on the size of itemsets

Sometimes, there may be just too many itemsets, and itemsets containing many items may not be interesting. Thus, it is also possible to specify an optional parameter in the user interface of SPMF:

Max pattern length (integer) : This parameter allows to set a maximum number of items to appear on the an itemset. By default, this parameter is equal to the infinity if it is not set.

If you are using the command line interface of SPMF, it is also possible to use this optional parameter by adding it at the end of the command. For example:
java -jar spmf.jar run MSApriori contextIGB.txt output.txt 0.4 0.2 2
means to run the above example to find only frequent itemsets having 2 items or less.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextIGB.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 4

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define four sequences in the SPMF format.

Then, if we apply a sequential pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

orange tomato bread #SUP: 3
orange milk bread #SUP: 3
apple orange tomato bread #SUP: 2

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Performance

MSApriori is one of the first algorithm for mining itemsets with multiple minimum support thresholds. It is not the most efficient algorithm for this task because it is based on Apriori and thus suffer from the same limitations. If performance is important, it is recommend to use CFPGrowth++, which is based on FPGrowth and is more efficient.

Note that there is one important difference between the input of CFPGrowth++ and MSApriori in SPMF. The MISApriori works by setting the multiple minimum supports by using the LS and BETA values. The CFPGrowth++ implementation uses a list of minimum support values stored in a text file instead.

Where can I get more information about the MSApriori algorithm?

This article describes the MSApriori algorithm:

B. Liu, W. Hsu, Y. Ma, "Mining Association Rules with Multiple Minimum Supports" Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-99), August 15-18, 1999, San Diego, CA, USA.

For a good overview of frequent itemset mining algorithms, you may read this survey paper.