Mining the Top-K Class Sequential Rules (rules with a fixed consequent) (SPMF documentation)
This example explains how to run the TopSeqClassRules algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "TopSeqClassRules" algorithm, (2) select the input file "contextPrefixSpan.txt", (3) set the output file name (e.g. "output.txt") (4) set k = 5, minconf = 0.8 and required items = 1,2. (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run TopSeqClassRules contextPrefixSpan.txt output.txt 5 80% 1,2 in a folder containing spmf.jar and the example input file contextPrefixSpan.txt. - If you are using the source code version of SPMF, launch the file "MainTestTopSeqClassRules.java" in the package ca.pfv.SPMF.tests.
What is TopSeqClassRules?
TopKSeqClassRules is an algorithm for discovering the top-k class sequential rules appearing in a sequence database. It is a variation of the TopSeqRules algorithm (Fournier-Viger, 2011), which has been adapted to find sequential rules with a consequent containing a single item chosen from a set of items allowed by the user. This algorithm is more simple than TopSeqRules as it is designed to solve a special case of the top-k sequential rule mining problem. However, the output is very useful for some applications as the user may be interested in only finding rules with some specific items in the consequent.
Why is it important to discover top-k class sequential rules? Because other sequential rule mining algorithms requires the user to set a minimum support (minsup) parameter that is hard to set (usually users set it by trial and error, which is time consuming). The TopSeqClassRules and TopSeqRules algorithm solve this problem by letting users directly indicate k, the number of rules to be discovered.
What is the input of TopSeqClassRules ?
TopSeqClassRules takes four parameters as input:
- a sequence database,
- a parameter k representing the number of rules to be discovered (an integer >=1),
- a parameter minconf representing the minimum confidence that the rules should have (a value in [0,1] representing a percentage).
- a set of items allowedItems that are items that are allowed to appear in the consequent of rules.
A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.
ID | Sequences |
S1 | (1), (1 2 3), (1 3), (4), (3 6) |
S2 | (1 4), (3), (2 3), (1 5) |
S3 | (5 6), (1 2), (4 6), (3), (2) |
S4 | (5), (7), (1 6), (3), (2), (3) |
What is the output of TopSeqClassRules ?
TopSeqClassRules outputs the k most frequent sequential rules having a confidence higher or equal to minconf.
A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a sequential rule X==>Y is the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a sequential rule is the number of sequences that contains X∪Y, divided by the number of sequences that contains X. A class sequential rule X --> Y is a special type of association rule where the itemset Y contains a single item that must be chosen from a set allowedItems specified by the user.
For example, if we run TopSeqClassRules with k = 5, minconf = 0.8, and allowedItems = 1,2, the result is the following rules:
Rule | Support | Confidence |
{3,5,6 } ==> {2 } |
75 % (3 sequences) |
100 % |
{1,3,5,6 } ==> {2 } |
100 % (4 sequences) |
100 % |
{1 } ==> {2 } | 75 % (3 sequences) | 100 % |
{1,5,6 } ==> {2 } | 50 % (2 sequences) |
100 % |
{5,6 } ==> {2 } | 50 % (2 sequences) | 100 % |
These rules are the top five rules appearing in the sequence database having a confidence higher or equals to 80 %.
For example, the rule 5,6 ==> 2 means that if 5 an 6 appears in any order they will be followed by 2 with a confidence of 100 %. Moreover, this rule has a support 50 % because it appears in two sequences (S3 and S4) out of four sequences.
It is important to note that for some values of k, the algorithm may return slightly more rules than k. This can happen if several rules have exactly the same support, and it is normal.
Input file format
The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a postive integer and items from the same itemset within a sequence are separated by single spaces. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the sample input file "contextPrefixSpan.txt" contains the following four lines (four sequences).
1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2
The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.
Note that it is also possible to use a text file containing a text (several sentences) if the text file has the ".text" extension, as an alternative to the default input format. If the algorithm is applied on a text file from the graphical interface or command line interface, the text file will be automatically converted to the SPMF format, by dividing the text into sentences separated by ".", "?" and "!", where each word is considered as an item. Note that when a text file is used as input of a data mining algorithm, the performance will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.
Output file format
The output file format is defined as follows. It is a text file. Each line is a sequential rule. Each item from a sequential rule is a postive integer. On each line, the items from the rule antecedent are first listed, separated by single spaces. Then the keyword "==>" appears, followed by the items from the rule consequent, separated by single spaces. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the rule as a number of sequences. Then, the keyword "#CONF:" appears followed by a double values in the [0, 1] interval indicating the confidence of the rule. For example, an output file is shown below:
3,5,6 ==> 2 #SUP: 2 #CONF: 1.0
1,3,5,6 ==> 2 #SUP: 2 #CONF: 1.0
1 ==> 2 #SUP: 4 #CONF: 1.0
1,5,6 ==> 2 #SUP: 2 #CONF: 1.0
5,6 ==> 2 #SUP: 2 #CONF: 1.0
Consider the second line. It indicates that the rule {3,5,6} ==> {2} has a support of 2 sequences and a confidence of 100 %. The other lines follow the same format.
Optional feature: giving names to items
Some users have requested the feature of given names to items instead
of using numbers. This feature is offered in the user
interface of SPMF and in the command line
of SPMF. To use this feature, your file must include
@CONVERTED_FROM_TEXT as first line and then several lines to define the
names of items in your file. For example, consider the example database
"contextPrefixSpan.txt". Here we have modified the file
to give names to the items:
@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
@ITEM=6=noodle
@ITEM=7=rice
@ITEM=-1=|
1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2
In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". The 9th line indicates that the symbol "-1" must be replaced by "|". Then the following lines define four sequences in the SPMF format.
Then, if we apply a sequential pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:
apple,bread,noodle ==> orange #SUP: 2 #CONF: 1.0
tomato,bread,noodle ==> orange #SUP: 2 #CONF: 1.0
bread,noodle ==> orange #SUP: 2 #CONF: 1.0
Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided.
Performance
TopSeqClassRules is a simplified version of the TopSeqRules algorithm, which has been adapted to only find the class sequential rules, as this can be useful in practice. The TopSeqRules is a very efficient algorihtm. It is based on the RuleGrowth algorithm, which is one of the most efficient algorithm for mining sequential rules.
TopSeqRules and TopSeqClassRules are more intuititive to use than RuleGrowth. However, it should be noted that the problem of top-k sequential rule mining is more computationally expensive than the problem of sequential rule mining. Therefore, it is recommended to use TopSeqClassRules for k values of up to 1000 or 2000 depending on the dataset. If more rules should be found, it may be better to use RuleGrowth or TRuleGrowth.
Besides, note that there is another variation of TopSeqRules is offered in SPMF, which is named TNS. The improvement in TNS is that it eliminate some sequential rules that are deemed "redundant" (rules that are included in other rules having the same support and confidence - see the TNS example for the formal definition). Using TNS is more costly than using TopSeqRules but it brings the benefit of eliminating some redundancy in the results.
Where can I get more information about this algorithm?
The TopSeqClassRules algorithm is an adaptation of the TopSeqRules algorithm. The latter is described in this paper:
Fournier-Viger, P. & Tseng, V. S. (2011). Mining Top-K Sequential Rules. Proceedings of the 7th Intern. Conf. on Advanced Data Mining and Applications (ADMA 2011). LNAI 7121, Springer, pp.180-194.