Finding all occurrences of some sequential Pattern(s) by post-processing using the Occur algorithm (SPMF documentation)

This example explains how to run the OCCUR algorithm using the SPMF open-source data mining library.

How to run this example?

To run the implementation of OCCUR implemented by P. Fournier-Viger (PFV):

What is OCCUR?

OCCUR is a post-processing algorithm for finding all occurrences of some sequential pattern(s) in asequence database. This algorithm is designed to be applied on the patterns found by a sequential pattern mining algorithm such as CM-SPAM, CM-SPADE and PrefixSpan. The OCCUR algorithms compares the patterns with the sequences and list all occurrences of each pattern.

Why using OCCUR? Traditional sequential pattern mining algorithms find sequential patterns but do not indicate the exact occurrences where these patterns occur in sequences. By using OCCUR as a post-processing step, it is possible to the information about all these occurrences which may be useful for some applications.

To apply OCCUR, one need to have a database file (e.g. contextPrefixSpan.txt) and a sequential pattern file that has been generated by a sequential pattern mining algorithm using the optional parameter: "Show sequence ids?" set to "true".

What is the input of OCCUR?

The input of OCCUR is a (1) sequence database and a (2) set of sequential patterns previously found by sequential pattern mining algorithm.

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of distinct items. For example, the table shown below contains four sequences. The first sequence, named s0, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID Sequences
S0 (1), (1 2 3), (1 3), (4), (3 6)
S1 (1 4), (3), (2 3), (1 5)
S2 (5 6), (1 2), (4 6), (3), (2)
S3 (5), (7), (1 6), (3), (2), (3)

A sequential pattern is a sequence. A sequence SA = X1, X2, ... Xk, where X1, X2... Xk are itemsets is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 ⊆ Yi1, X2 ⊆ Yi2, ... Xk ⊆ Yik. A file containing sequential patterns with the identifiers of sequences where they appear must be provided to the OCCUR algorithm as input. For example, the spmPatterns.txt file containing several lines. such as:

1 -1 #SUP: 4 #SID: 0 1 2 3
1 2 -1 #SUP: 2 #SID: 0 2
...

The first line is a sequential pattern (1), having a support of 4 sequences, appearing in the sequences s0, s1, s2, and s3. The second line is the sequential pattern (1 2), having a support of 2 sequences and occuring in sequence s0 and s2.

The file of sequential patterns must be generated by a sequential pattern mining algorithm using the optional parameter: "Show sequence ids?" set to "true".

What is the output of OCCUR?

OCCUR output a text file where sequential patterns are annotated with the list of their occurrences in each sequence where they appear. For example, the output contains several line such as:

1 -1 #SUP: 4 #SIDOCC: 0[0] [1] [2] 1[0] [3] 2[1] 3[2]
1 2 -1 #SUP: 2 #SIDOCC: 0[1] 2[1]

The first line indicate that the sequential pattern (1) has a support of 4 sequences, and appear in sequence s0 (in itemset 0, 1 and 2), in sequence s1 (itemset 0 and 3), in sequence s2 (itemset 1) and in sequence s3 (itemset 2). Other lines follow the same format.

Input file format

The input file format for the sequence database is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextOCCUR.txt" contains the following four lines (four sequences).

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.

The input file format for the sequential pattern file is defined as follows. It is a text file. Each line is a frequent sequential pattern. Each item from a sequential pattern is a positive integer and items from the same itemset within a sequence are separated by single spaces. The value "-1" indicates the end of an itemset. On each line, the sequential pattern is first indicated. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the pattern as a number of sequences. Then, the keyword #SID: appears, followed by the identifiers of sequences where the pattern appears. For example, a few lines from the output file from the previous example are shown below:

1 -1 #SUP: 4 #SID: 0 1 2 3
1 2 -1 #SUP: 2 #SID: 0 2

The first line indicates that the frequent sequential pattern consisting of the itemset {1} has a support of 4 sequences, and appear in sequences s0, s1, s2, and s3.

Output file format

The output file format is a sequential pattern file with occurrence information, defined as follows. It is a text file. Each line is a frequent sequential pattern. Each item from a sequential pattern is a positive integer and items from the same itemset within a sequence are separated by single spaces. The value "-1" indicates the end of an itemset. On each line, the sequential pattern is first indicated. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the pattern as a number of sequences. Then, the keyword #SIDOCC: appears, followed by the identifier of a sequence where the pattern appears. Then, this is followed by a list of occurrences. Each occurrence is written between [ and ]. An occurrence is a list of itemset identifiers separated by spaces. There can be several sequence identifiers followed by their occurrences for each sequential pattern.

1 -1 #SUP: 4 #SIDOCC: 0[0] [1] [2] 1[0] [3] 2[1] 3[2]
1 2 -1 #SUP: 2 #SIDOCC: 0[1] 2[1]

The first line indicate that the sequential pattern (1) has a support of 4 sequences, and appear in sequence s0 (with three occurrences: iin itemset 0, 1 and 2), in sequence s1 (two occurrences: one in itemset 0 and an occurrence in itemset 3), in sequence s2 (an occurrence in itemset 1) and in sequence s3 (an occurrence in itemset 2). Other lines follow the same format.

Where can I get more information about OCCUR?

The OCCUR algorithm is not an algorithm published in any papers because it is a very simple algorithm. It is implemented in a quite efficient way, but could be optimized further if performance is an issue.

If you want to know more about sequential pattern mining , you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.