Mining Compressing Itemsets using the KRIMP Algorithm (SPMF documentation)

This example explains how to run the KRIMP algorithm using the SPMF open-source data mining library.

How to run this example?

What is KRIMP?

KRIMP is an algorithm for discovering compressing itemsets in transaction databases. It was proposed by Vreeken et al. (1993). It is designed to be applied after another frequent itemset mining algorithms such as FP-Growth. The idea is that KRIMP will then select the itemsets that describes the data "best" among those found by the other algorithm. These itemsets are called "compressing itemsets".

There are some other algorithms in SPMF for discovering compressing itemsets such as SLIM. This latter has the advantage over KRIMP that it can be applied directly without having to first apply another algorithm like FP-Growth, and SLIM was shown to give better results for downstream tasks. However, KRIMP is still an interesting algorithm and this is why it is implemented in SPMF.

What is the input of the KRIMP algorithm?

The input is a transaction database (aka binary context), and a file containing frequent itemsets that have been extracted by a frequet itemset mining algorithm in advance.

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

A frequent itemset is a set of items that appear together in many transactions. The support of an itemset is the number of transactions where it appears in the input database To apply the KRIMP algorithm, we need to provide a file containing frequent itemsets, which must have been extracted in advance by another algorithm such as FP-Growth. In this example, we will use the file patterns60.txt, which contains some frequent itemsets extracted in advance. The content of this file is presented in the table below. It lists frequent itemsets with their support values. For instance, the last line represents the itemset {1, 2, 3, 5} and indicates that its support is 2 (because the items {1, 2, 3, 5} appear together in two transactions, that is transactions t3 and t4).

itemsets support
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 2
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2

What is the output of the KRIMP algorithm?

KRIMP is an algorithm for discovering compressing itemsets, that is a set of itemsets that "best" capture the characteristics of the database. The idea is that a set of itemsets should be good if it can be used to compress the database.

If we run KRIMP in this example, the result is this set of itemsets:

itemsets support
{2, 5} 4
{1, 3} 3
{3} 1
{4} 1

There are four itemsets that are discovered and together they provide a good compression of the database. For each itemset, a support value is given, which is how many times it appears in the database after the previous itemsets have been removed. For example, the itemset {2, 5} appears four times in the database (in trasactions t2, t3, t4 and t5), so its support is 4. Then, lets look at the next itemset. The itemset {1,3} has a support of 3 because it appears in transaction t1, t3 and t5. Then, lets look at the next itemsets. After removing the previous itemsets, it is found that the itemset {3} appears only in t2 and thus its support is 1. And similarly, it is found that itemset {4} has a support of 1.

It should be noted that this definition of support is slightly different from that of other frequent itemset mining algorithms.

How should I interpret the results?

In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset could be used to compress the ttransaction database. For example, the itemset {1,3} has a support of 3 because it can be used in transactions t1, t3 and t5.

Input file format

The input file format for the input database is defined as follows. It is a text file containing a transaction database. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

The input file format for the pattern file is defined as follows. It is a text file, where each line represents a frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For example, here is the output file for this example. The first line indicates the frequent itemset consisting of the item 1 and it indicates that this itemset has a support of 3 transactions.

1 #SUP: 3
2 #SUP: 4
3 #SUP: 4
5 #SUP: 4
1 2 #SUP: 2
1 3 #SUP: 3
1 5 #SUP: 2
2 3 #SUP: 3
2 5 #SUP: 4
3 5 #SUP: 3
1 2 3 #SUP: 2
1 2 5 #SUP: 2
1 3 5 #SUP: 2
2 3 5 #SUP: 3
1 2 3 5 #SUP: 2

Output file format

The output file format is defined as follows. It is a text file, where each line represents a frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For example, here is the output file for this example. The first line indicates the frequent itemset consisting of the items 2 and 5 and it indicates that this itemset has a support of 4 transactions.

2 5 #SUP: 4
1 3 #SUP: 3
3 #SUP: 1
4 #SUP: 1

Performance

The KRIMP algorithm was implemented by following the indications from the original paper as closely as possible. However, the post-acceptance pruning step was not implemented, as it did not seem to perform well in the initial experiments.

Where can I get more information about the KRIMP algorithm?

This is the paper describing KRIMP.

Vreeken et al. (2011) KRIMP: mining itemsets that compress

For a good overview of frequent itemset mining algorithms, you may read this survey paper.