Mining Sequential Patterns with Time Constraints from a Time-Extended Sequence Database (SPMF documentation)

This example explains how to run the Hirate-Yamana algorithm using the SPMF open-source data mining library.

How to run this example?

What is this algorithm?

The Hirate-Yamana, 2006 algorithm is an algorithm for discovering frequent sequential patterns respecting some time-constraints to filter uninteresting patterns.

The idea of using time constraints is interesting because it can greatly reduce the number of patterns found and it is also faster and use less memory than if all patterns are discovered.

Note that in our implementation, the Hirate & Yamana algorithm is not implemented as a standalone algorithm. The features of the Hirate-Yamana 2006 algorithm are rather integrated in the Fournier-Viger et al. (2008) algorithm that combines features from several other sequential pattern mining algorithms.

What is the input?

The input is a time-extended sequence database (as defined by Hirate-Yamana, 2006) and some constraints.

A time-extended sequence database is a set of time-extended sequences. A time-extended sequences is a list of itemsets (groups of items). Each itemset is anotated with a timestamp that is an integer value. Note that it is assumed that an item should not appear more than once in an itemset and that items in an itemset are lexically ordered.

For example, consider the following time-extended sequence database provided in the file contextSequencesTimeExtended.txt of the SPMF distribution. The database contains 4 time-extended sequences. Each sequence contains itemsets that are annotated with a timestamp. For example, consider the sequence S1. This sequence indicates that itemset {1} appeared at time 0. It was followed by the itemset {1, 2, 3} at time 1. This latter itemset was followed by the itemset {1 2} at time 2.

ID Sequences
S1 (0, 1), (1, 1 2 3}), (2, 1 3)
S2 (0, 1 ) (1, 1 2 ), (2, 1 2 3), (3, 1 2 3 )
S3 (0, 1 2), (1, 1 2 )
S4 (0, 2), (1, 1 2 3 )

The algorithms discovers time-extended sequential patterns that are common to several sequences. To do that, the user needs to provide five constraints (see the paper by Hirate & Yamana, 2006 for full details):

What is the output?

The output is a set of time-extended sequential patterns meeting the five constraints given by the user. For example, if we run the algorithm with minsup= 55 %, min_time_interval = 0, max_time_interval = 2, min_whole_interval = 0, max_whole_interval = 2, we obtain the following results:

ID Sequential Patterns Support
S1 (0, 3) 75 %
S2 (0, 2 3) 75 %
S3 (0, 2) 100 %
S4 (0, 1 2 3) 75%
S5 (0, 1 2) 100 %
S6 (0, 1 3) 75 %
S7 (0, 1) 100 %
S8 (0, 2), (1, 3) 75 %
S9 (0, 2), (1, 1 2) 75 %
S10 (0, 2), (1, 1 3) 75 %
S11 (0, 2), (1, 1) 100 %
S12 (0, 2), (1, 2) 75 %
S13 (0, 1), (1, 1 2) 75 %
S14 (0, 1), (1, 1) 75 %
S15 (0, 1), (1, 2) 75 %
S16 (0, 1 2), (1, 1) 75 %

For instance, the pattern S16 indicates that the items 1 and 2 were followed by item 1 one time unit after. This pattern has a support of 75 % because it appears in S1, S2 and S3. It is important to note that the timestamps in the sequential patterns found are relative. For example, the pattern S16 is considered to appear in S1, S2 and S3 because {1} appears one time unit after the itemset {1, 2} in all of these sequences, even though the timestamps do not need to be the same in all of these sequences.

Input file format

The input file format is defined as follows. It is a text file where each line represents a time-extended sequence from a sequence database. Each line is a list of itemsets, where each itemset has a timestamp represented by a positive integer and each item is represented by a positive integer. Each itemset is first represented by it timestamp between the "<" and "> symbol. Then, the items of the itemset appear separated by single spaces. Finally, the end of an itemset is indicated by "-1". After all the itemsets, the end of a sequence (line) is indicated by the symbol "-2". Note that it is assumed that items are sorted according to a total order in each itemset and that no item appears twice in the same itemset.

For example, the input file "contextSequencesTimeExtended.txt" contains the following four lines (four sequences).

<0> 1 -1 <1> 1 2 3 -1 <2> 1 3 -1 -2
<0> 1 -1 <1> 1 2 -1 <2> 1 2 3 -1 <3> 1 2 3 -1 -2
<0> 1 2 -1 <1> 1 2 -1 -2
<0> 2 -1 <1> 1 2 3 -1 -2

Consider the first line. It indicates that at time "0" the itemset {1} appeared, followed by the itemset {1, 2, 3} at time 1, then followed by the itemset {1, 3} at time 2. Note that timestamps do not need to be consecutive integers. But they should increase for each succesive itemset within a sequence. The second, third and fourth line follow the same format.

Output file format

The output file format is defined as follows. It is a text file. Each line is a time-extended frequent sequential pattern. Each line starts by listing the itemsets of the sequential pattern, where each itemset has a relative timestamp represented by a positive integer between the "<" and "> symbol. Then the timestamp is followed by each item in the itemset, each represented by a positive integer. The items of the itemset appear separated by single spaces and the symbol "-1" indicates the end of an itemset. Finally, after all the itemsets of a sequential pattern, the keyword "#SUP:" is followed by an integer indicating the support of the pattern as a number of sequences. For example, here is a two lines from the output file from the previous example:

<0> 1 2 -1 <1> 1 3 -1 #SUP: 2
<0> 1 2 -1 <1> 1 2 -1 #SUP: 2

Consider the first line. It represents a sequential pattern having the itemset {1, 2} with a relative timestamp of 0, followed by the itemset {1, 3} one time unit later. This pattern has a support of 2 sequences. The second line follow the same format.

Implementation details

In this implementation, we have followed the Hirate & Yamana (2006) algorithm closely. The only difference is that we did not keep the idea of interval itemization function for discretization. But we have keep the core idea which is to use the time constraints.

Note that the Hirate & Yamana algorithm is an extension of the PrefixSpan algorithm. We have implemented it based on our implementation of PrefixSpan.

Where can I get more information about this algorithm?

The Hirate & Yamana algorithm is described in this paper

Yu Hirate, Hayato Yamana (2006) Generalized Sequential Pattern Mining with Item Intervals. JCP 1(3): 51-60.

The implementation of the algorithm in SPMF is part of the Fournier-Viger (2008) algorithm, which is described in this paper:

Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.