Mining Cost-Efficient Sequential Patterns Using CorCEPB Algorithm (SPMF documentation)
This example explains how to run the CorCEPB algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "CORCEPB" algorithm, (2) select the input file "example_CorCEPB.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup = 2. maxcost = 50, minoccupancy - 0.1 and (5) click "Run algorithm".
- If you want to execute this example from the command
line, then execute this command:
java -jar spmf.jar run CORCEPB example_CorCEPB .txt output.txt 2 50 0.1 in a folder containing spmf.jar and the example input file example_CorCEPB.txt. - If you are using the source code version of SPMF, launch the file "MainTestCorCEPB.java" in the package ca.pfv.SPMF.tests.
What is CorCEPB?
CorCEPB is an algorithm proposed by Fournier-Viger, P., Li, Jiaxuan et al. (2020) for discovering cost-effective patterns in a sequence database (a set of sequences), where each event/activity in a sequence has a cost, and each sequence has an outcome called "utility" that is a binary class (good or bad outcome).
This type of data can be found in many domains. For example, it can model how students use an e-learning system. In that case, each sequence is a list of activities or sessions performed by a student, and the cost represents the time spent for studying for each activity. Then, the utility represents the outcome of a sequence such whether the student passed or failed the course or exam (binary utility). Another example of cost-utility sequences is hospital data, where a sequence is a list of medicines taken by some patient, the cost indicates how much money or time was spent for these medical treaments, and the utility is whether a patient has cured or died.
The CorCEPB algorithm can discover patterns in such sequences that are said to be cost-efficient. This means that the pattern has a low cost and is likely to bring a high utility. Moreover, a pattern must have a positive correlation with a good outcome (positive utility). In the above examples, some cost-effective patterns may be that studying some e-learning materials A and B typically require not much time but lead to high scores, or that taking some given medicines has a low cost but a high possibility of success to cure a disease.
It is to be noted that another algorithm named CEPB is proposed in SPMF. The difference between the two algorithms is that CEPB does not calculate the correlation. Besides, another algorithm named CEPN is provided to handle the case where utility is a numeric value instead of a binary value.
What is the input of CorCEPB?
CorCEPB takes as input a sequence database where each sequence is an ordered list of events, each event has a cost value (a positive integer), and each sequence has a utility value ( a boolean value indicating for example, a good or bad outcome). Moreover, the user must set two parameters: (1) a minimum support threshold minsup (a positive integer), (2) a maximum cost threshold maxcost (a positive integer), and a (3) minimum occupancy minoccupancy threshold (a value in the [0,1] interval).
For example, consider the following sequence database, which is provided in the file example_CorCEPB.txt of the SPMF distribution:
1[2] -1 2[4] -1 3[9] -1 4[2] -1 -2 SUtility:1
2[1] -1 4[12] -1 3[10] -1 5[1] -1 -2 SUtility:0
1[5] -1 5[4] -1 2[8] -1 -2 SUtility:1
1[3] -1 2[5] -1 4[1] -1 -2 SUtility:0
2[3] -1 5[4] -1 3[2] -1 -2 SUtility:1
This database contains five lines. Each line is a sequence.
Moreover, each sequence (line) is an ordered list of events separated by -1.
An event is represented by a positive integer and it is followed by a cost value (e.g. spent time on the event) indicated between squared brackets [ ]. A cost is a positive integer.
The end of a sequence is indicated by -2. Finally, at the end of each line, the keyword "SUtility:" is followed by 0 or 1, which respectively represent a negative or positive outcome.
For example, the first line indicates that event "1" had a cost of 2, was followed by event "2" with a cost of 4, followed by event "3" with a cost of 9, followed by event "4" with a cost of 2. Moreover, this sequence has a utility of 1, which means a positive outcome. The other sequences follow the same format.
This database could for example represents sequences of learning activities made by learners, where the events 1,2,3,4 and 5 are learning activities, cost values are the time spent on a learning activity and the utility is to pass or faill an exam
What is the output of CorCEPB?
The output of CorCEPB is the set of all cost-effective patterns meeting the criteria specified by the user.
A pattern is a subsequence of events that appear in some sequences of the input database.
A pattern is said to be cost-effective if it meets the following criteria:
- it appears in at least minsup sequences of the database.
- the average cost of the sequence is not greater than the maxcost parameter set by the user
- the pattern has an occupancy not smaller than the minoccupancy threshold.
For example, if we setminsup = 2, maxcost = 50, minoccupancy = 0.1, we find 11 cost-effective patterns.
This is the output file, where each line is a cost-effective pattern:
1 -1 -2 #CORR: 0.189 #SUP: 3 #AVGCOST: 3.333 #OCCUP: 0.153
2 -1 -2 #CORR: 0.423 #SUP: 5 #AVGCOST: 4.200 #OCCUP: 0.150
5 -1 -2 #CORR: 1.000 #SUP: 3 #AVGCOST: 3.000 #OCCUP: 0.153
3 -1 -2 #CORR: -0.596 #SUP: 3 #AVGCOST: 7.000 #OCCUP: 0.139
4 -1 -2 #CORR: -0.427 #SUP: 3 #AVGCOST: 5.000 #OCCUP: 0.139
1 -1 2 -1 -2 #CORR: 0.240 #SUP: 3 #AVGCOST: 9.000 #OCCUP: 0.611
1 -1 4 -1 -2 #CORR: 0.000 #SUP: 2 #AVGCOST: 4.000 #OCCUP: 0.583
2 -1 5 -1 -2 #CORR: 1.000 #SUP: 2 #AVGCOST: 4.500 #OCCUP: 0.583
2 -1 3 -1 -2 #CORR: -0.277 #SUP: 3 #AVGCOST: 9.667 #OCCUP: 0.556
2 -1 4 -1 -2 #CORR: -0.500 #SUP: 3 #AVGCOST: 8.333 #OCCUP: 0.556
1 -1 2 -1 4 -1 -2 #CORR: -1.000 #SUP: 2 #AVGCOST: 8.500 #OCCUP: 0.875
Consider the line:
1 -1 2 -1 -2 #CORR: 0.240 #SUP: 3 #AVGCOST: 9.000 #OCCUP: 0.611
This line is the pattern where event "1" is followed by event "2". On that line, there is #SUP:2, which indicates that the support (occurrence frequency) of this pattern is 2, and AVGCOST: 9.000 indicating that the pattern's average cost is 9. Moreover, #CORR: 0.240 indicates that this pattern is positively correlated with a good outcome. Generally, a correlation greater than 0 is positive and lower than 0 is negative. See the paper describing the CorCEPB algorithm for a formal explanation of how the correlation is calculated.
The support of that pattern is 3 because this pattern appears in the first, third, and fourth sequences of the input database. Below, I have highlighted these two occurrences in blue, and the corresponding cost values in red:
1[2] -1 2[4] -1 3[9] -1 4[2] -1 -2 SUtility:1
2[1] -1 4[12] -1 3[10] -1 5[1] -1 -2 SUtility:0
1[5] -1 5[4] -1 2[8] -1 -2 SUtility:1
1[3] -1 2[5] -1 4[1] -1 -2 SUtility:0
2[3] -1 5[4] -1 3[2] -1 -2 SUtility:1
In that first sequence, the cost of that pattern is 2 + 4= 6 (the sum of the red values). In the third sequence, the cost of that pattern is 5 + 6 = 13, and in the fourth sequence, the cost is 3 + 5 = 8. Thus, the average cost of that pattern is : (6 + 13 + 8) / 3 = 27/3 = 9.
Besides, "#OCCUP: 0611" indicates that this pattern has an occupancy of 0.611. The occupancy of a pattern represents how much a pattern covers the sequences in which it appears. The occupancy of the pattern in Sequence 1 is 2/4 = 0.5 because the pattern consists of 2 of the 4 events of that sequence. The occupancy of the pattern in Sequence 3 is 2/3 = 0.666 because it consists of 2 of the 3 events in that sequence. The occupancy of the pattern in Sequence 4 is 2/3 = 0.666 because it consists of 2 of the 3 events in that sequence. The occupancy of that pattern for the whole database is the average of its occupancy in Sequence 1, 3 and 4, that is (0.5 + 0.666 + 0.666 ) / 3 = 0.611.
Since the average cost of that pattern is 9 <= maxcost, its support is 3 >= minsup, and its occupancy is 0.611 >= minoccupancy, this pattern is a cost-effective pattern, and it is output.
If the database is sequences of events from an e-learning website, we could interpret this pattern as follows. Three learners have some learning activity 1, followed by some learning activity 2, and they have spent an average of 9 hours, and this pattern has a positive correlation with passing the exam.
Note that if one does not wish to use the occupancy measure, it is possible to simply set minoccupancy = 0.
Optional parameter(s)
The CorCEPB implementation allows to specify two additional optional parameters:
- "maximum pattern length" allows to specify the maximum number of items that patterns found should contain. The value for this parameter must be a positive integer.
- "sort patterns by correlation" specify that patterns in the output must be sorted by decreasing correlation. The value for this parameter must be true or false.
These parameter(s) are available in the GUI of SPMF and also in the example(s) "MainTestCorCEPB.java" provided in the source code of SPMF.
The parameter(s) can be also used in the command line with the Jar
file. If you want to use these optional parameter(s) in the command
line, it can be done as follows. Consider this example:
java -jar spmf.jar run CorCEPB example_CorCEPB .txt
output.txt 2 50 2 true
This command means to apply the algorithm on the file
"example_CorCEPB.txt" and output the results to "output.txt".
Moreover, it specifies that the user wants to find patterns for minsup = 2 and maxcost = 50, and patterns must have a maximum length of 2 items, and to sort patterns in the output by correlation. The output is then like this:
2 -1 5 -1 -2 #CORR: 1.000 #SUP: 2 #AVGCOST: 4.500 #OCCUP: 0.583
5 -1 -2 #CORR: 1.000 #SUP: 3 #AVGCOST: 3.000 #OCCUP: 0.153
2 -1 -2 #CORR: 0.423 #SUP: 5 #AVGCOST: 4.200 #OCCUP: 0.150
1 -1 2 -1 -2 #CORR: 0.240 #SUP: 3 #AVGCOST: 9.000 #OCCUP: 0.611
1 -1 -2 #CORR: 0.189 #SUP: 3 #AVGCOST: 3.333 #OCCUP: 0.153
1 -1 4 -1 -2 #CORR: 0.000 #SUP: 2 #AVGCOST: 4.000 #OCCUP: 0.583
2 -1 3 -1 -2 #CORR: -0.277 #SUP: 3 #AVGCOST: 9.667 #OCCUP: 0.556
4 -1 -2 #CORR: -0.427 #SUP: 3 #AVGCOST: 5.000 #OCCUP: 0.139
2 -1 4 -1 -2 #CORR: -0.500 #SUP: 3 #AVGCOST: 8.333 #OCCUP: 0.556
3 -1 -2 #CORR: -0.596 #SUP: 3 #AVGCOST: 7.000 #OCCUP: 0.139
Implementation details
This is the original implementation of the algorithm.
Optional feature: giving names to items
Some users have requested the feature of giving names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "example_CorCEPB.txt". Here we have modified the file to give names to the items:
@CONVERTED_FROM_TEXT
@ITEM=1=homework
@ITEM=2=lesson
@ITEM=3=anotherlesson
@ITEM=4=groupwork
@ITEM=5=selfstudy
@ITEM=-1=selfstudy
1[2] -1 2[4] -1 3[9] -1 4[2] -1 -2 SUtility:1
2[1] -1 4[12] -1 3[10] -1 5[1] -1 -2 SUtility:0
1[5] -1 5[4] -1 2[8] -1 -2 SUtility:1
1[3] -1 2[5] -1 4[1] -1 -2 SUtility:0
2[3] -1 5[4] -1 3[2] -1 -2 SUtility:1
In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "homework". The third line indicates that the item 2 is called "lesson".... The 9th line indicates that the symbol "-1" must be replaced by "|". Then the following lines define four sequences in the SPMF format.
Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file looks like this:
lesson selfstudy selfstudy selfstudy -2 #CORR: 1.000 #SUP: 2 #AVGCOST: 4.500 #OCCUP: 0.583
selfstudy selfstudy -2 #CORR: 1.000 #SUP: 3 #AVGCOST: 3.000 #OCCUP: 0.153
lesson selfstudy -2 #CORR: 0.423 #SUP: 5 #AVGCOST: 4.200 #OCCUP: 0.150
homework selfstudy lesson selfstudy -2 #CORR: 0.240 #SUP: 3 #AVGCOST: 9.000 #OCCUP: 0.611
homework selfstudy -2 #CORR: 0.189 #SUP: 3 #AVGCOST: 3.333 #OCCUP: 0.153
homework selfstudy groupwork selfstudy -2 #CORR: 0.000 #SUP: 2 #AVGCOST: 4.000 #OCCUP: 0.583
lesson selfstudy anotherlesson selfstudy -2 #CORR: -0.277 #SUP: 3 #AVGCOST: 9.667 #OCCUP: 0.556
groupwork selfstudy -2 #CORR: -0.427 #SUP: 3 #AVGCOST: 5.000 #OCCUP: 0.139
lesson selfstudy groupwork selfstudy -2 #CORR: -0.500 #SUP: 3 #AVGCOST: 8.333 #OCCUP: 0.556
anotherlesson selfstudy -2 #CORR: -0.596 #SUP: 3 #AVGCOST: 7.000 #OCCUP: 0.139
Note that what is described above is the normal way of using item names in SPMF. But the CorCEPB algorithm also allows to directly replace the items like 1 and 2 with strings in the input file like this:
homework[4] -1 selfsttudy [2] -1 -2 SUtility:1
This will achieve a similar result
Where can I get more information about the CorCEPB algorithm?
The CorCEPB algorithm is described in this article:
Fournier-Viger, P., Li, J., Lin, J. C., Chi, T. T., Kiran, R. U. (2019). Mining Cost-Effective Patterns in Event Logs. Knowledge-Based Systems (KBS), Elsevier,
Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.