Mining Top-K Periodic Patterns using the MTKPP algorithm (SPMF documentation)

This example explains how to run the MTKPP algorithm using the SPMF open-source data mining library.

How to run this example?

What is MTKPP?

MTKPP is an algorithm for discovering the top-K periodic frequent itemsets in a sequence of transactions (a transaction database). It was proposed by Amphawan et al. (2009). Unlike some other algorithms that require to set a minimum support threshold, MTKPP only requires the user to set the number of patterns to be discovered (K) and a maximum periodicity threshold. This makes MTKPP easier to use since the user does not need to know in advance what minimum support value is appropriate for a given dataset.
Note that MTKPP can return more than K patterns when there are ties (multiple patterns have the same support in a dataset.

What is the input of the MTKPP algorithm?

The input is a transaction database (a sequence of transactions) and two parameters set by the user:

Note that two optional parameters are also offered to specify constraints on the minimum and maximum number of items that patterns should contain (positive integers).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 7 transactions (t1, t2, ..., t7) and 5 items (1, 2, 3, 4, 5). For example, the first transaction represents the set of items 1 and 3. This database is provided as the file contextPFPM.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3}
t2 {5}
t3 {1, 2, 3, 4, 5}
t4 {2, 3, 4, 5}
t5 {1, 3, 4}
t6 {1, 3, 5}
t7 {2, 3, 5}

What is the output of the algorithm?

MTKPP discovers the top-K periodic frequent itemsets, which are the K itemsets with the highest support that also satisfy the maximum periodicity constraint. An itemset is a group of items. To measure if an itemset is periodic, the algorithm calculates its periods. To explain this in more detail, it is necessary to introduce a few definitions.

The set of transactions containing an itemset X is denoted as g(X). For example, consider the itemset {1, 3}. The set g({1,3}) is equal to {t1, t3, t5, t6}. In other words, the itemset {1, 3} appears in the transactions t1, t3, t5 and t6.

To assess the periodicity of an itemset X, its list of periods is calculated. A period is the time in terms of number of transactions between two occurrences of an itemset in the database. For example, the periods of the itemset {1, 3} are {1, 2, 2, 1, 1}. The first period of {1,3} is 1 because {1,3} appears in the first transaction after the start of the database. The second period of {1,3} is 2 because the itemset appears in transaction t3, which is two transactions after t1. The third period is 2 because the itemset appears in t5, two transactions after t3. The fourth period is 1 because the itemset appears in t6, one transaction after t5. Finally, the fifth period is 1 because there is one transaction after the last occurrence of {1,3} in the database.

The maximum periodicity of an itemset is the largest period among all its periods. MTKPP finds the K itemsets with the highest support whose maximum periodicity does not exceed the maxper threshold set by the user. If there are ties in support at the K-th rank, all tied patterns are returned, so the result may contain more than K patterns.

For example, if MTKPP is run on the previous transaction database with k = 5 and maxper = 3, the MTKPP algorithm finds the following 5 top-K periodic frequent itemsets:

itemset support (number of transactions where the itemset appears) maximum periodicity
{3} 6 2
{5} 5 2
{1} 4 2
{1, 3} 4 2
{3, 5} 4 3

How should I interpret the results?

Each top-K periodic itemset is annotated with its support (number of transactions where it appears) and its maximum periodicity. For example, the itemset {3} has a support of 6 because it appears in six transactions (t1, t3, t4, t5, t6, t7). Its maximum periodicity is 2 transactions, which means that it never goes more than 2 consecutive transactions without appearing. This indicates that {3} appears very periodically in the database. The algorithm returns the 5 itemsets with the highest support that satisfy the constraint that their maximum periodicity does not exceed 3 transactions.

Input file format

The input file format used by MTKPP is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

3 1
5
3 5 1 2 4
3 5 2 4
3 1 4
3 5 1
3 5 2

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a top-K periodic frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. Then, the keyword "#MAXPER:" appears and is followed by a space and an integer indicating the maximum periodicity of the itemset.

For example, here is the output file for this example. The first line indicates that the itemset {3} is a top-K periodic frequent itemset, having a support of 6 transactions and a maximum periodicity of 2 transactions.

3 #SUP: 6 #MAXPER: 2
5 #SUP: 5 #MAXPER: 2
1 #SUP: 4 #MAXPER: 2
1 3 #SUP: 4 #MAXPER: 2
5 3 #SUP: 4 #MAXPER: 3

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Optional feature: giving names to items

Some users have requested the feature of giving names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextPFPM.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
3 1
5
3 5 1 2 4
3 5 2 4
3 1 4
3 5 1
3 5 2

In this file, the first line indicates that it is a file where names are given to items. Then, the second line indicates that item 1 is called "apple". The third line indicates that item 2 is called "orange". Then the following lines define seven transactions in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains patterns such as:

tomato #SUP: 6 #MAXPER: 2
bread #SUP: 5 #MAXPER: 2
apple #SUP: 4 #MAXPER: 2
apple tomato #SUP: 4 #MAXPER: 2
bread tomato #SUP: 4 #MAXPER: 3

Note that this feature could also be used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Performance

MTKPP is designed to be more user-friendly than PFPM because the user does not need to set a minimum support threshold. MTKPP internally raises its support pruning threshold dynamically as more patterns are discovered, which allows it to prune the search space progressively and efficiently.

Where can I get more information about the MTKPP algorithm?

This is the article proposing the MTKPP algorithm:

Komate Amphawan, Philippe Lenca, Athasit Surarerks: “Mining Top-K Periodic-Frequent Pattern from Transactional Databases without Support Threshold” International Conference on Advances in Information Technology (2009)