Mining Frequent Sequential Patterns Using the SPAM Algorithm (SPMF documentation)
This example explains how to run the SPAM algorithm using the SPMF open-source data mining library.
How to run this example?
To run the implementation of SPAM by P. Fournier-Viger (PFV):
- If you are using the graphical interface, (1) choose the "SPAM" algorithm, (2) select the input file "contextPrefixSpan.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup = 50% and maximum pattern length = 100, (5) click "Run algorithm".
- If you want to execute this example from the command
line, then execute this command:
java -jar spmf.jar run SPAM contextPrefixSpan.txt output.txt 50% 100 in a folder containing spmf.jar and the example input file contextPrefixSpan.txt. - If you are using the source code version of SPMF, launch the file "MainTestSPAM.java" in the package ca.pfv.SPMF.tests.
To run the version the implementation of SPAM by A. Gomariz Peñalver (AGP):
- If you are using the graphical interface, (1) choose the "SPAM_AGP" algorithm, (2) select the input file "contextPrefixSpan.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup = 50% and (5) click "Run algorithm".
- If you want to execute this example from the command
line, then execute this command:
java -jar spmf.jar run SPAM_AGP contextPrefixSpan.txt output.txt 50% in a folder containing spmf.jar and the example input file contextPrefixSpan.txt. - If you are using the source code version of SPMF, launch the file "MainTestSPAM_AGP_FatBitMap_saveToFile.java" in the package ca.pfv.SPMF.tests. (other variations are also available in the source code)
What is SPAM?
SPAM is an algorithm for discovering frequent sequential patterns in a sequence database. It was proposed by Ayres (2002).
What is the input of SPAM?
The input of SPAM is a sequence database and a user-specified threshold named minsup (a value in [0,1] representing a percentage). Moreover, the implementation in SPMF adds another parameter, which is the maximum sequential pattern length in terms of items.
A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.
ID | Sequences |
S1 | (1), (1 2 3), (1 3), (4), (3 6) |
S2 | (1 4), (3), (2 3), (1 5) |
S3 | (5 6), (1 2), (4 6), (3), (2) |
S4 | (5), (7), (1 6), (3), (2), (3) |
What is the output of SPAM?
SPAM discovers all frequent sequential patterns occurring in a sequence database (subsequences that occurs in more than minsup sequences of the database.
To explain more formally what is a sequential pattern, it is necessary to review some definition.
A sequential pattern is a sequence. A sequence SA = X1, X2, ... Xk, where X1, X2... Xk are itemsets is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 ⊆ Yi1, X2 ⊆ Yi2, ... Xk ⊆ Yik.
The support of a sequential pattern is the number of sequences where the pattern occurs divided by the total number of sequences in the database.
A frequent sequential pattern is a sequential pattern having a support no less than the minsup parameter provided by the user.
For example, if we run SPAM with minsup= 50 %, 53 sequential patterns will be found. The list is too long to be presented here. An example of pattern found is "(1,2),(6)" which appears in the first and the third sequences (it has therefore a support of 50%). This pattern has a length of 3 because it contains three items. Another pattern is "(4), (3), (2)". It appears in the second and third sequence (it has thus a support of 50 %). It also has a length of 3 because it contains 3 items.
Optional parameters
The SPAM implementation allows to specify four optional parameters :
- "minimum pattern length" allows to specify the minimum number of items that patterns found should contain.
- "maximum pattern length" allows to specify the maximum number of items that patterns found should contain.
- "max gap" allows to specify if gaps are allowed in sequential patterns. For example, if "max gap" is set to 1, no gap is allowed (i.e. each consecutive itemset of a pattern must appear consecutively in a sequence). If "max gap" is set to N, a gap of N-1 itemsets is allowed between two consecutive itemsets of a pattern. If the parameter is not used, by default "max gap" is set to +∞.
- "show sequences ids?" (true/false) This parameter allows to specify that sequence ids of sequences containing a pattern should be output for each pattern found. For example, if the parameter is set to true, each pattern in the output file will be followed by the keyword #SID followed by a list of sequences ids (integers separated by space). For example, a line terminated by "#SID: 0 2" means that the pattern on this line appears in the first and the third sequences of the sequence database (sequences with ids 0 and 2).
These parameters are available in the GUI of SPMF and also in the example "MainTestSPAM.java" provided in the source code of SPMF.
The parameter(s) can be also used in the command line with the Jar
file. If you want to use these optional parameters in the command line,
it can be done as follows. Consider this example:
java -jar spmf.jar run SPAM contextPrefixSpan.txt
output.txt 0.5 2 6 1 true
This command means to apply SPAM on the file
"contextPrefixSpan.txt" and output the results to "output.txt".
Moreover, it specifies that the user wants to find patterns for minsup = 0.5, that patterns must have a minimum length of 2 items, a maximum
length of 6 items, and have no gap between itemsets, and that ids of
sequence where the patterns is found must be shown in the output.
Input file format
The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextPrefixSpan.txt" contains the following four lines (four sequences).
1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2
The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.
Note that it is also possible to use a text file containing a text (several sentences) if the text file has the ".text" extension, as an alternative to the default input format. If the algorithm is applied on a text file from the graphical interface or command line interface, the text file will be automatically converted to the SPMF format, by dividing the text into sentences separated by ".", "?" and "!", where each word is considered as an item. Note that when a text file is used as input of a data mining algorithm, the performance will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.
Output file format
The output file format is defined as follows. It is a text file. Each line is a frequent sequential pattern. Each item from a sequential pattern is a positive integer and items from the same itemset within a sequence are separated by single spaces. The value "-1" indicates the end of an itemset. On each line, the sequential pattern is first indicated. Then, the keyword " #SUP: " appears followed by an integer indicating the support of the pattern as a number of sequences. For example, a few lines from the output file from the previous example are shown below:
2 3 -1 1 -1 #SUP: 2
6 -1 2 -1 #SUP: 2
6 -1 2 -1 3 -1 #SUP: 2
The first line indicates that the frequent sequential pattern consisting of the itemset {2, 3}, followed by the itemset {1} has a support of 2 sequences. The next lines follow the same format.
Performance
SPAM is one of the fastest sequential pattern mining algorithm. The SPAM implementation in SPMF is reported to be faster than PrefixSpan (see the "performance" section of the website for a performance comparison). However, CM-SPAM is faster than SPAM.
Implementation details
In the source code, we also provide examples of how to keep the result into memory instead of saving it to a file. This can be useful if the algorithms are integrated into another Java software. Examples of how to save result into memory are named according to the following naming convention: "MainTest..._saveToMemory".
For the AGP implementation of SPAM, several version are provided in the source code that shows different way to perform the join of IdLists. The fastest implementation is the one named "Fat_Bitmap". It is the one offered in the graphical user interface.
- MainTestSPAM_AGP_BitMap_saveToFile.java"
- "MainTestSPAM_AGP_BitMap_saveToMemory.java"
- "MainTestSPAM_AGP_EntryList_saveToFile.java"
- "MainTestSPAM_AGP_EntryList_saveToMemory.java"
- "MainTestSPAM_AGP_FatBitMap_saveToFile.java"
- "MainTestSPAM_AGP_FatBitMap_saveToMemory.java"
The AGP and PFV implementations of SPAM shares some source code but also have some significant differences. See the performance section of the website for a performance comparison (will be added at the end of August 2013).
Where can I get more information about SPAM?
The SPAM algorithm was proposed in this paper:
J. Ayres, J. Gehrke, T.Yiu, and J. Flannick. Sequential Pattern Mining Using Bitmaps. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada, July 2002.
The implementation of the optional "maxgap" constraint is based on this paper:
Ho, J., Lukov, L., & Chawla, S. (2005). Sequential pattern mining with constraints on large protein databases. In Proceedings of the 12th International Conference on Management of Data (COMAD) (pp. 89-100).
Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.