Mining Local Periodic Frequent Patterns using the LPP-Growth algorithm (SPMF documentation)

This example explains how to run the LPP-Growth algorithm using the SPMF open-source data mining library.

How to run this example?

What are these algorithms?

LPP-Growth, LPPM_breadth and LPPM_depth are algorithms for discovering locally periodic patterns in a sequence of events or transactions, with or without timestamps. These algorithms were proposed by Fournier-Viger, P. Yang, P. et al. (2020). Compared to other periodic pattern mining algorithm, these algorithms are designed to find patterns that are locally periodic rather than always periodic. This is interesting for application such as analyzing customer behavior. For example, it may be found in a sequence of transactions made by a customer that the customer periodically bought wine and cheese during the summer rather than during the whole year.

The three algorithms have the same input and output. But they use different strategies and data structures to find the result. Thus, they have different runtimes and memory requirement.

What is the input ?

The input is a transaction database (a sequence of transactions or events) and four parameters that are set by the user:

A transaction database is a set of transactions. Each transaction is a set of items (symbols) that has a timestamp. For example, consider the following transaction database. It contains 8 transactions (t1, t2, ..., t10) and 5 items (1, 2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 3 and 5. This database is provided as the file contextLPP.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted according to some total order in each transaction such as the alphabetical order.

Transaction id

Items

Timestamp

1

1 2 3 5

6

2

1 2 3 4

7

3

1 2 5

9

4

3 5

10

5

2 4 5

12

6

2 3 5

15

7

2 3 4 5

18

8

1 3

22

9

1 2 4

23

10

2

25

 

What is the output of the algorithms?

LPPM_breadth, LPPM_depth and LPPGrowth are algorithms for discovering local periodic patterns (LPPs), which are also called local periodic itemsets. An itemset is a group of items. The algorithms find itemsets that appears periodically in a sequence of transactions and their periodic time-intervals. To measure if an itemset is periodic, the algorithm calculates its periods. To explain this in more details, it is necessary to introduce a few definitions.

The set of timestamps that an itemset X appears is denoted as ts(X). For example, consider the itemset {2, 5}. The set ts({2,5}) is equal to {6, 9, 12, 15, 18}. In other words, the itemset {2, 5} appears in the timestamps 6, 9, 12, 15 and 18. It is also said that these are five occurrences of the itemset {2,5}.

Now, to assess the periodicity of an itemset X, its list of periods is calculated. A period is the time in terms of difference of timestamps between two occurences of an itemset appears in the database (see the paper for the formal definition). For example, the periods of the itemset {2, 5} are {3,3,3,3,7}. The first period of {2,5} is 3 because {2,3} appears in timestamp 6, which is one timestamp before timestamp 9. The second period of {2,5} is 3 because {2,3} appears in timestamp 9, which is one timestamp before timestamp 12. The third period of {2,5} is 3 because {2,3} appears in timestamp 12, which is one timestamp before timestamp 15. The fourth period of {2,5} is 3 because {2,3} appears in timestamp 15, which is one timestamp before timestamp 18.  Finally, the fifth period of {2,3} is 7 because {2,3} appears in timestamp 18, which is one timestamp before last timestamp 25 in the database.

The algorithms utilize the list of periods of an itemset X to calculate its maximum periodicity and a measure called maxSoPer that measures by how far the periods deviates from the maxPer threshold set by the user .Then, it uses adynamic update mechanism of spillovers of  periods to find all periodic time-intervals having a duration that is no less than the minimum duration threshold. The maximum periodicity of an itemset is the largest period among the periods of the itemset. The maxSoPer of an itemset X is the cummulative sum of the difference between each period length and maxper (see the paper for a formal definition and some detailed examples).

The algorithms find LPPs and their periodic time-intervals having a duration that is no less than the minDur threshold.

For example, if we set maxPer = 3, minDur = 7, maxSoPer = 2. the 4 LPPs are found:

LPP

Periodic time-intervals

{2}

[6,25]

{2,5}

[6,18]

{3}

[6,18]

{5}

[6,18]

Input file format

It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line. Items are followed by the character "|" and then the timestamp of the transaction. Note that it is possible to run LPPM_breadth or LPPM_depth or LPPGrowth on a database that has no timestamps. In that case, the boolean parameter (the fourth parameter) should be set to true to indicate that the dataset has no timestamps.

In the previous example, the input file is defined as follows:
1 2 3 5|6
1 2 3 4|7
1 2 5|9
3 5|10
2 4 5|12
2 3 5|15
2 3 4 5|18
1 3|22
1 2 4|23
2|25

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found <a rel="nofollow" href="http://weka.wikispaces.com/ARFF+%28stable+version%29">here</a>. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a stable periodic frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#TIME-INTERVALS:" appears, which is followed by several intervals indicating the periodic time-intervals of the itemset.

2 #TIME-INTERVALS: [ 6 , 25 ]  
2 5 #TIME-INTERVALS: [ 6 , 18 ]  
3 #TIME-INTERVALS: [ 6 , 18 ]  
5 #TIME-INTERVALS: [ 6 , 18 ] 

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Optional feature: giving names to items

Some users have requested the feature of giving names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextSPPGrowth.txt". Here we have modified the file to give a name to each item: 

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
2 3 5|1
2 3 4 5|3
2 3 4 5|3
1 2 3 4 5|5
1 3 5|6
2 3 5|7
1 3 4|9
1 3 5|10

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define eight transactions in the SPMF format.

Then, if we apply an algorithm on this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

bread #TIME-INTERVALS: [ 6 , 18 ]
orange bread #TIME-INTERVALS: [ 6 , 18 ]
tomato #TIME-INTERVALS: [ 6 , 18 ]
orange #TIME-INTERVALS: [ 6 , 25 ]

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Performance

The proposed algorithms can discover patterns in non-predefined time intervals, and each algorithm was found to perform best in some situations. (see paper for more detail on experimental result)

Where can I get more information about the algorithms?

This is the article proposing these algorithms:

Fournier-Viger, P., Yang, P., Kiran, R. U., Ventura, S., Luna, J. M.. (2020). Mining Local Periodic Patterns in a Discrete Sequence. Information Sciences, Elsevier, to appear.  [ppt]
DOI: 10.1016/j.ins.2020.09.044