Mining Compressing Itemsets using the SLIM Algorithm (SPMF documentation)

This example explains how to run the SLIM algorithm using the SPMF open-source data mining library.

How to run this example?

What is SLIM?

SLIM is an algorithm for discovering compressing itemsets in transaction databases. It was proposed by Smets et al. (2012). It is designed to find the itemsets that best describes the input database. These itemsets are called "compressing itemsets". They are supposed to capture the most important characteristics of the database.

There are some other algorithms in SPMF for discovering compressing itemsets such as KRIMP. But Slim can be viewed as superior algorithm to KRIMP, especially in terms of efficiency as KRIMP must be applied by postprocessing after another frequent itemset mining algorithm.

What is the input of the SLIM algorithm?

The input is a transaction database (aka binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output of the SLIM algorithm?

SLIM is an algorithm for discovering compressing itemsets (group of items) that generally occur frequently in a transaction database or capture well its characteristics. The idea is that a set of itemsets should be good if it can be used to compress the database.

For example, if SLIM is run on the previous transaction database, SLIM produces the following result:

itemsets support
{1, 2, 3, 5} 2
{1, 3, 4} 1
{2, 5} 2
{3} 1

There are four itemsets that are discovered and together they provide a good compression of the database. For each itemset, a support value is given, which is how many times it appears in the database after the previous itemsets have been removed. For example, the itemset {1,2,3,5} appears two times in the database (in trasactions t3 and t5), so its support is 2. Then, lets look at the next itemset. The itemset {1,3,4} has a support of 1 because it appears only in transaction t1. Then, lets look at the next itemset. After removing the previous itemsets, the itemset {2,5} only appears in transactions t2 and t4. Thus its support is 2. Lastly, after removing all the previous itemsets, it is found that the last itemset {3} has a support of 1 and only appears in t2.

It should be noted that this definition of support is slightly different from that of other frequent itemset mining algorithms.

How should I interpret the results?

In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset could be used to compress the ttransaction database. For example, the itemset {1,2,3,5} has a support of 2 because it can be used in transactions t3 and t5.

Input file format

The input file format for SLIM is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For example, here is the output file for this example. The first line indicates the frequent itemset consisting of the items 1,2, 3 and 5 and it indicates that this itemset has a support of 2 transactions.

1 2 3 5 #SUP: 2
1 3 4 #SUP: 1
2 5 #SUP: 2
3 #SUP: 1

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextPasquier99.txt". Here we have modified the file to give names to the items: 

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define transactions in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

apple orange tomato bread #SUP: 2
apple tomato milk #SUP: 1
orange bread #SUP: 2
tomato #SUP: 1

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Optional parameter(s):

If the SLIM algorithms is running for a too long time, a solution is to set a maximum number of iterations by using an optional parameter for SLIM in the user interface of SPMF:

If you are using the command line interface of SPMF, it is also possible to use this optional parameter by adding it at the end of the command. For example:
java -jar spmf.jar run SLIM contextPasquier99.txt output.txt 500
means to run the SLIM algorithm for the above example but with a maximum of 500 iterations.

Reducing the number of iterations will increase the speed but may also reduce the quality of the result in that the itemsets discovered may not be the best for capturing the characteristics of the database.

Performance

The SLIM algorithm was implemented by following the indications from the original paper as closely as possible. But, the post-acceptance pruning step was not implemented.

The performance of SLIM can be adjusted by setting the optional max iterations parameter.

Where can I get more information about the SLIM algorithm?

This is the paper proposing the SLIM algorithm:

Smets et al. (2012) SLIM: Directly Mining Descriptive Patterns

For a good overview of frequent itemset mining algorithms, you may read this survey paper.