Mining Top-K Periodic Patterns using the TRCT algorithm (SPMF documentation)

This example explains how to run the TRCT algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "TRCT" algorithm, (2) select the input file "contextPFPM.txt", (3) set the output file name (e.g. "output.txt"), (4) set the value of k and the maximum periodicity respectively to 5 and 3, and (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run TRCT contextPFPM.txt output.txt 5 3 in a folder containing spmf.jar and the example input file contextPFPM.txt.
If you are using the source code version of SPMF, to run TRCT, launch the file "MainTestTRCT.java" in the package ca.pfv.SPMF.tests.

What is TRCT?

TRCT (Top-k Regular-frequent itemset mining based on Compressed Tidsets) is an efficient algorithm for discovering the top-K periodic frequent itemsets in a sequence of transactions (a transaction database). It was proposed by Amphawan et al. (2011) as an improvement over the MTKPP algorithm. Like MTKPP, TRCT only requires the user to set the number of patterns to be discovered (K) and a maximum periodicity threshold, without needing a minimum support threshold. This makes TRCT easier to use since the user does not need to know in advance what minimum support value is appropriate for a given dataset.

The key innovation of TRCT is its use of a compressed tidset representation that stores transaction ids (tids) in a compact form. Consecutive continuous tids are compressed by storing only the first and last tid of each run, which significantly reduces memory usage and speeds up tidset intersection operations, especially on dense datasets where items frequently appear in consecutive transactions.
Note that TRCT can return more than K patterns when there are ties (multiple patterns have the same support in a dataset).

Note that it is possible that TRCT returns more than K patterns when it is run in the cases where many itemsets have exactly the same support.

What is the input of the TRCT algorithm?

The input is a transaction database (a sequence of transactions) and two parameters set by the user:

a value K, which is the desired number of periodic patterns to be discovered (a positive integer)
a maximum periodicity threshold maxper (a positive integer)

Note that two optional parameters are also offered to specify constraints on the minimum and maximum number of items that patterns should contain (positive integers).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 7 transactions (t1, t2, ..., t7) and 5 items (1, 2, 3, 4, 5). For example, the first transaction represents the set of items 1 and 3. This database is provided as the file contextPFPM.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id	Items
t1	{1, 3}
t2	{5}
t3	{1, 2, 3, 4, 5}
t4	{2, 3, 4, 5}
t5	{1, 3, 4}
t6	{1, 3, 5}
t7	{2, 3, 5}

What is the output of the algorithm?

TRCT discovers the top-K periodic frequent itemsets, which are the K itemsets with the highest support that also satisfy the maximum periodicity constraint. An itemset is a group of items. To measure if an itemset is periodic, the algorithm calculates its periods. To explain this in more detail, it is necessary to introduce a few definitions.

The set of transactions containing an itemset X is denoted as g(X). For example, consider the itemset {1, 3}. The set g({1,3}) is equal to {t1, t3, t5, t6}. In other words, the itemset {1, 3} appears in the transactions t1, t3, t5 and t6.

To assess the periodicity of an itemset X, its list of periods is calculated. A period is the time in terms of number of transactions between two occurrences of an itemset in the database. For example, the periods of the itemset {1, 3} are {1, 2, 2, 1, 1}. The first period of {1,3} is 1 because {1,3} appears in the first transaction after the start of the database. The second period of {1,3} is 2 because the itemset appears in transaction t3, which is two transactions after t1. The third period is 2 because the itemset appears in t5, two transactions after t3. The fourth period is 1 because the itemset appears in t6, one transaction after t5. Finally, the fifth period is 1 because there is one transaction after the last occurrence of {1,3} in the database.

The maximum periodicity of an itemset is the largest period among all its periods. TRCT finds the K itemsets with the highest support whose maximum periodicity does not exceed the maxper threshold set by the user. If there are ties in support at the K-th rank, all tied patterns are returned, so the result may contain more than K patterns.

For example, if TRCT is run on the previous transaction database with k = 5 and maxper = 3, the TRCT algorithm finds the following 5 top-K periodic frequent itemsets:

itemset	support (number of transactions where the itemset appears)	maximum periodicity
{3}	6	2
{5}	5	2
{1}	4	2
{1, 3}	4	2
{3, 5}	4	3

How should I interpret the results?

Each top-K periodic itemset is annotated with its support (number of transactions where it appears) and its maximum periodicity. For example, the itemset {3} has a support of 6 because it appears in six transactions (t1, t3, t4, t5, t6, t7). Its maximum periodicity is 2 transactions, which means that it never goes more than 2 consecutive transactions without appearing. This indicates that {3} appears very periodically in the database. The algorithm returns the 5 itemsets with the highest support that satisfy the constraint that their maximum periodicity does not exceed 3 transactions.

Input file format

The input file format used by TRCT is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

3 1
5
3 5 1 2 4
3 5 2 4
3 1 4
3 5 1
3 5 2

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a top-K periodic frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. Then, the keyword "#MAXPER:" appears and is followed by a space and an integer indicating the maximum periodicity of the itemset.

For example, here is the output file for this example. The first line indicates that the itemset {3} is a top-K periodic frequent itemset, having a support of 6 transactions and a maximum periodicity of 2 transactions.

3 #SUP: 6 #MAXPER: 2
5 #SUP: 5 #MAXPER: 2
1 #SUP: 4 #MAXPER: 2
1 3 #SUP: 4 #MAXPER: 2
3 5 #SUP: 4 #MAXPER: 3

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Optional feature: giving names to items

Some users have requested the feature of giving names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextPFPM.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
3 1
5
3 5 1 2 4
3 5 2 4
3 1 4
3 5 1
3 5 2

In this file, the first line indicates that it is a file where names are given to items. Then, the second line indicates that item 1 is called "apple". The third line indicates that item 2 is called "orange". Then the following lines define seven transactions in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains patterns such as:

tomato #SUP: 6 #MAXPER: 2
bread #SUP: 5 #MAXPER: 2
apple #SUP: 4 #MAXPER: 2
apple tomato #SUP: 4 #MAXPER: 2
bread tomato #SUP: 4 #MAXPER: 3

Note that this feature could also be used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Performance

TRCT is designed to be both user-friendly (no minimum support threshold required) and efficient in terms of execution time and memory usage. The key innovation is the compressed tidset representation, which stores runs of consecutive transaction ids compactly. For example, if an itemset appears in transactions {5, 6, 7, 8, 12, 13, 14}, the normal tidset representation would store all seven transaction ids. The compressed tidset representation stores only four values: the start and end of the first run (5, -8), and the start and end of the second run (12, -14).

This compression achieves significant benefits on dense datasets where items frequently appear in consecutive transactions.

Where can I get more information about the TRCT algorithm?

This is the article proposing the TRCT algorithm:

Komate Amphawan, Philippe Lenca, Athasit Surarerks: "Efficient mining Top-k regular-frequent itemset using compressed tidsets" PAKDD 2011

The TRCT algorithm builds upon and improves the earlier MTKPP algorithm:

Komate Amphawan, Philippe Lenca, Athasit Surarerks: "Mining Top-K Periodic-Frequent Pattern from Transactional Databases without Support Threshold" International Conference on Advances in Information Technology (IAIT 2009)