àMining Cost-Efficient Sequential Patterns Using CEPN Algorithm (SPMF documentation)

This example explains how to run the CEPN algorithm using the SPMF open-source data mining library.

How to run this example?

What is CEPN?

CEPN is an algorithm proposed by Fournier-Viger, P., Li, Jiaxuan et al. (2020) for discovering cost-effective patterns in a sequence database (a set of sequences), where each event/activity in a sequence has a cost, and each sequence has an outcome called "utility" that is a numeric value indicating that an outcome is good or bad to different degrees.

This type of data can be found in many domains. For example, it can model how students use an e-learning system. In that case, each sequence is a list of activities or sessions performed by a student, and the cost represents the time spent for studying for each activity. Then, the utility represents the outcome of a sequence such whether the student's grade at a final exam (numeric utility).

The CEPN algorithm can discover patterns in such sequences that are said to be cost-efficient. This means that the pattern has a low cost and is likely to bring a high utility. Moreover, a pattern must have a good trade-off between cost and utility, i.e. it must be able to provide utility at a low-cost . In the above example, some cost-effective patterns may be that studying some e-learning materials A and B typically require not much time but lead to high scores at the final exam.

It is to be noted that some other algorithms are provided in SPMF named CorCEPB and CEPB for handling the case where utility is binary values instead of numeric values.

What is the input of CEPN?

CEPN takes as input a sequence database where each sequence is an ordered list of events, each event has a cost value (a positive integer), and each sequence has a utility value ( a numeric value such that a high value indicates a better outcome or result). Moreover, the user must set two parameters: (1) a minimum support threshold minsup (a positive integer), (2) a maximum cost threshold maxcost (a positive integer), and a (3) minimum occupancy minoccupancy threshold (a value in the [0,1] interval). 

For example, consider the following sequence database, which is provided in the file example_CEPN.txt of the SPMF distribution:

1[20] -1 2[40] -1 3[50] -1 4[20] -1 -2 SUtility:80
2[25] -1 4[12] -1 3[30] -1 5[25] -1 -2 SUtility:60
1[25] -1 5[14] -1 2[30] -1 -2 SUtility:50
1[40] -1 2[16] -1 4[40] -1 -2 SUtility:40
2[20] -1 5[24] -1 3[20] -1 -2 SUtility:70

This database contains five lines. Each line is a sequence.

Moreover, each sequence (line) is an ordered list of events separated by -1.

An event is represented by a positive integer and it is followed by a cost value (e.g. spent time on the event) indicated between squared brackets [ ]. A cost is a positive integer.

The end of a sequence is indicated by -2. Finally, at the end of each line, the keyword "SUtility:" is followed by a positive integer which indicates how good the outcome of this sequence is (e.g. it could represents a final exam score)

For example, the first line indicates that event "1" had a cost of 20, was followed by event "2" with a cost of 40, followed by event "3" with a cost of 50, followed by event "4" with a cost of 20. Moreover, this sequence has a utility of 80, which means a quite good outcome (compared to other sequences in this database). The other sequences follow the same format.

This database could for example represents sequences of learning activities made by learners, where the events 1,2,3,4 and 5 are learning activities, cost values are the time spent on a learning activity and the utility is the grade obtained at a final exam.

What is the output of CEPN?

The output of CEPN is the set of all cost-effective patterns meeting the criteria specified by the user.

A pattern is a subsequence of events that appear in some sequences of the input database.

A pattern is said to be cost-effective if it meets the following criteria:

For example, if we set minsup = 2, maxcost = 50, minoccupancy = 0.1, we find 6 cost-effective patterns.

This is the output file, where each line is a cost-effective pattern:

1 -1 -2 #UTIL: 56.667 #SUP: 3 #TRADE: 0.500 #AVGCOST: 28.333 #OCCUP: 0.153
2 -1 -2 #UTIL: 60.000 #SUP: 5 #TRADE: 0.437 #AVGCOST: 26.200 #OCCUP: 0.150
5 -1 -2 #UTIL: 60.000 #SUP: 3 #TRADE: 0.350 #AVGCOST: 21.000 #OCCUP: 0.153
3 -1 -2 #UTIL: 70.000 #SUP: 3 #TRADE: 0.476 #AVGCOST: 33.333 #OCCUP: 0.139
4 -1 -2 #UTIL: 60.000 #SUP: 3 #TRADE: 0.400 #AVGCOST: 24.000 #OCCUP: 0.139
2 -1 5 -1 -2 #UTIL: 65.000 #SUP: 2 #TRADE: 0.723 #AVGCOST: 47.000 #OCCUP: 0.583

Consider the line:

2 -1 5 -1 -2 #UTIL: 65.000 #SUP: 2 #TRADE: 0.723 #AVGCOST: 47.000 #OCCUP: 0.583

This line is the pattern where event "2" is followed by event "5". On that line, there is #SUP:2, which indicates that the support (occurrence frequency) of this pattern is 2, and AVGCOST: 47.000 indicating that the pattern's average cost is 47. Moreover, #UTIL: 65.000 indicates that this pattern has an average utility of 65, and #TRADE: 0.723 indicates that this pattern has a trade-off ratio between cost and utility of 0.723. Generally, a low trade-off is more desirable as it means that a pattern produces utility at a lower cost.

The support of the above pattern is 2 because this pattern appears in the second and fifth sequences of the input database. Below, I have highlighted these two occurrences in blue, and the corresponding cost values in red, and utility values in green.

1[20] -1 2[40] -1 3[50] -1 4[20] -1 -2 SUtility:80
2[25] -1 4[12] -1 3[30] -1 5[25] -1 -2 SUtility:60
1[25] -1 5[14] -1 2[30] -1 -2 SUtility:50
1[40] -1 2[16] -1 4[40] -1 -2 SUtility:40
2[20] -1 5[24] -1 3[20] -1 -2 SUtility:70

In the second sequence, the cost of that pattern is 25 + 25= 50 (the sum of the red values). In the last sequence, the cost of that pattern is 20 + 24 = 44. Thus, the average cost of that pattern is : (50 + 44) / 2 = 1302 = 65.

The trade-off of that pattern is calculated as the cost divided by the utility that is : 44 / 65 = 0.723.

Besides, "#OCCUP: 0.583" indicates that this pattern has an occupancy of 0.583. The occupancy of a pattern represents how much a pattern covers the sequences in which it appears. The occupancy of the pattern in Sequence 2 is 2 / 4 = 0.5 because the pattern consists of 2 of the 4 events of that sequence. The occupancy of the pattern in Sequence 5 is 2/3 = 0.666 because it consists of 2 of the 3 events in that sequence. The occupancy of that pattern for the whole database is the average of its occupancy in Sequence 2 and Sequence 5, that is 0.666 + 0.5 / 2 = 0.583.

Since the average cost of that pattern is 65 <= maxcost, its support is 2 >= minsup, and its occupancy is 0.583 >= minoccupancy, this pattern is a cost-effective pattern, and it is output.

If the database is sequences of events from an e-learning website, we could interpret this pattern as follows. Three learners have some learning activity 2, followed by some learning activity 5, and they have spent an average of 44 hours and received and average score of 65 at a final exam. The trade-off between the time spent studying and the final score is 0.723.

Note that if one does not wish to use the occupancy measure, it is possible to simply set minoccupancy = 0.

Optional parameter(s)

The CEPN implementation allows to specify three additional optional parameters:

Note that the "sort patterns by utility" and "only lowest trade-off?"cannot be activated at the same time (one should not set these two parameters to true at the same time!)

The optional parameter(s) are available in the GUI of SPMF and also in the example(s) "MainTestCEPN.java" provided in the source code of SPMF.

The parameter(s) can be also used in the command line with the Jar file. If you want to use these optional parameter(s) in the command line, it can be done as follows. Consider this example:
java -jar spmf.jar run CEPN example_CEPN .txt output.txt 2 50 0.1 2 true false
This command means to apply the algorithm on the file "example_CEPN.txt" and output the results to "output.txt". Moreover, it specifies that the user wants to find patterns for minsup = 2, maxcost = 50, minoccupancy = 0.1, and patterns must have a maximum length of 2 items, and to sort patterns in the output by utility. The output is then like this:

3 -1 -2 #UTIL: 70.000 #SUP: 3 #TRADE: 0.476 #OCCUP: 0.139 #AVGCOST: 33.333
2 -1 5 -1 -2 #UTIL: 65.000 #SUP: 2 #TRADE: 0.723 #OCCUP: 0.583 #AVGCOST: 47.000
2 -1 -2 #UTIL: 60.000 #SUP: 5 #TRADE: 0.437 #OCCUP: 0.150 #AVGCOST: 26.200
5 -1 -2 #UTIL: 60.000 #SUP: 3 #TRADE: 0.350 #OCCUP: 0.153 #AVGCOST: 21.000
4 -1 -2 #UTIL: 60.000 #SUP: 3 #TRADE: 0.400 #OCCUP: 0.139 #AVGCOST: 24.000
1 -1 -2 #UTIL: 56.667 #SUP: 3 #TRADE: 0.500 #OCCUP: 0.153 #AVGCOST: 28.333

Another example is: java -jar spmf.jar run CEPN example_CEPN .txt output.txt 2 50 0.1 2 false true
This command means to apply the algorithm on the file "example_CEPN.txt" and output the results to "output.txt". Moreover, it specifies that the user wants to find patterns for minsup = 2, maxcost = 50, minoccupancy = 0.1, and patterns must have a maximum length of 2 items, and to sort keep only patterns having the lowest trade-off for each utility value. The output is then like this:

2 -1 5 -1 -2 #UTIL: 65.000 #SUP: 2 #TRADE: 0.723 #AVGCOST: 47.000 #OCCUP: 0.583
3 -1 -2 #UTIL: 70.000 #SUP: 3 #TRADE: 0.476 #AVGCOST: 33.333 #OCCUP: 0.139
1 -1 -2 #UTIL: 56.667 #SUP: 3 #TRADE: 0.500 #AVGCOST: 28.333 #OCCUP: 0.153
5 -1 -2 #UTIL: 60.000 #SUP: 3 #TRADE: 0.350 #AVGCOST: 21.000 #OCCUP: 0.153

Implementation details

This is the original implementation of the algorithm.

Optional feature: giving names to items

Some users have requested the feature of giving names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "example_CEPN.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=homework
@ITEM=2=lesson
@ITEM=3=anotherlesson
@ITEM=4=groupwork
@ITEM=5=selfstudy
@ITEM=-1=selfstudy
1[20] -1 2[40] -1 3[50] -1 4[20] -1 -2 SUtility:80
2[25] -1 4[12] -1 3[30] -1 5[25] -1 -2 SUtility:60
1[25] -1 5[14] -1 2[30] -1 -2 SUtility:50
1[40] -1 2[16] -1 4[40] -1 -2 SUtility:40
2[20] -1 5[24] -1 3[20] -1 -2 SUtility:70

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "homework". The third line indicates that the item 2 is called "lesson".... The 9th line indicates that the symbol "-1" must be replaced by "|". Then the following lines define four sequences in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file looks like this:

homework selfstudy -2 #UTIL: 56.667 #SUP: 3 #TRADE: 0.500 #AVGCOST: 28.333 #OCCUP: 0.153
lesson selfstudy -2 #UTIL: 60.000 #SUP: 5 #TRADE: 0.437 #AVGCOST: 26.200 #OCCUP: 0.150
selfstudy selfstudy -2 #UTIL: 60.000 #SUP: 3 #TRADE: 0.350 #AVGCOST: 21.000 #OCCUP: 0.153
anotherlesson selfstudy -2 #UTIL: 70.000 #SUP: 3 #TRADE: 0.476 #AVGCOST: 33.333 #OCCUP: 0.139
groupwork selfstudy -2 #UTIL: 60.000 #SUP: 3 #TRADE: 0.400 #AVGCOST: 24.000 #OCCUP: 0.139
lesson selfstudy selfstudy selfstudy -2 #UTIL: 65.000 #SUP: 2 #TRADE: 0.723 #AVGCOST: 47.000 #OCCUP: 0.583

Note that what is described above is the normal way of using item names in SPMF. But the CEPN algorithm also allows to directly replace the items like 1 and 2 with strings in the input file like this:

homework[4] -1 selfsttudy [2] -1 -2 SUtility:1

This will achieve a similar result

Where can I get more information about the CEPN algorithm?

The CEPN algorithm is described in this article:

Fournier-Viger, P., Li, J., Lin, J. C., Chi, T. T., Kiran, R. U. (2019). Mining Cost-Effective Patterns in Event Logs. Knowledge-Based Systems (KBS), Elsevier,

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.