Calculate Statistics for a Sequence Database with Cost and Binary Utility Information(SPMF documentation)

This example explains how to calculate statistics for a sequence database with cost and binary utility information using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool is a tool for generating statistics about a sequence database with cost and binary utility information, as used by algorithms such as CEPB and CorCEPB.

What is the input?

The input is a sequence database where each sequence is an ordered list of events, each event has a cost value (a positive integer), and each sequence has a utility value ( a boolean value indicating for example, a good or bad outcome). Moreover, the user must set two parameters: (1) a minimum support threshold minsup (a positive integer), (2) a maximum cost threshold maxcost (a positive integer), and a (3) minimum occupancy minoccupancy threshold (a value in the [0,1] interval). 

For example, consider the following sequence database, which is provided in the file example_CorCEPB.txt of the SPMF distribution:

1[2] -1 2[4] -1 3[9] -1 4[2] -1 -2 SUtility:1
2[1] -1 4[12] -1 3[10] -1 5[1] -1 -2 SUtility:0
1[5] -1 5[4] -1 2[8] -1 -2 SUtility:1
1[3] -1 2[5] -1 4[1] -1 -2 SUtility:0
2[3] -1 5[4] -1 3[2] -1 -2 SUtility:1

This database contains five lines. Each line is a sequence.

Moreover, each sequence (line) is an ordered list of events separated by -1.

An event is represented by a positive integer and it is followed by a cost value (e.g. spent time on the event) indicated between squared brackets [ ]. A cost is a positive integer.

The end of a sequence is indicated by -2. Finally, at the end of each line, the keyword "SUtility:" is followed by 0 or 1, which respectively represent a negative or positive outcome.

For example, the first line indicates that event "1" had a cost of 2, was followed by event "2" with a cost of 4, followed by event "3" with a cost of 9, followed by event "4" with a cost of 2. Moreover, this sequence has a utility of 1, which means a positive outcome. The other sequences follow the same format.

This database could for example represents sequences of learning activities made by learners, where the events 1,2,3,4 and 5 are learning activities, cost values are the time spent on a learning activity and the utility is to pass or faill an exam.

What is the output?

The output is statistics about the database. For example, if we use the tool on the previous database given as example, we get the following statistics:

============ SEQUENCE COST UTILITY DATABASE STATS ==========
File size (MB): 0.00
Number of sequences : 5
Max item: 5
Average number of itemsets per sequence : 3.4 standard deviation: 0.4898979485566356 variance: 0.24
Average number of items per itemset : 1.0 standard deviation: 0.0 variance: 0.0
Average cost per item: 4.470588235294118 standard deviation: 3.2560829419054236 variance: 10.602076124567478
Average cost per sequence: 15.2 standard deviation: 5.670978751503131 variance: 32.160000000000004
Average utility per sequence: 0.6 standard deviation: 0.48989794855663565 variance: 0.24000000000000005
=========================================================