Calculate Statistics for a Sequence Database with Cost and Numeric Utility Information(SPMF documentation)

This example explains how to calculate statistics for a sequence databasewith cost and numeric utility information using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool is a tool for generating statistics about a sequence database with cost and numeric utility information, as used by algorithms such as CEPN.

What is the input?

The input is a sequence database where each sequence is an ordered list of events, each event has a cost value (a positive integer), and each sequence has a utility value ( a numeric value such that a high value indicates a better outcome or result).

For example, consider the following sequence database, which is provided in the file example_CEPN.txt of the SPMF distribution:

1[20] -1 2[40] -1 3[50] -1 4[20] -1 -2 SUtility:80
2[25] -1 4[12] -1 3[30] -1 5[25] -1 -2 SUtility:60
1[25] -1 5[14] -1 2[30] -1 -2 SUtility:50
1[40] -1 2[16] -1 4[40] -1 -2 SUtility:40
2[20] -1 5[24] -1 3[20] -1 -2 SUtility:70

This database contains five lines. Each line is a sequence.

Moreover, each sequence (line) is an ordered list of events separated by -1.

An event is represented by a positive integer and it is followed by a cost value (e.g. spent time on the event) indicated between squared brackets [ ]. A cost is a positive integer.

The end of a sequence is indicated by -2. Finally, at the end of each line, the keyword "SUtility:" is followed by a positive integer which indicates how good the outcome of this sequence is (e.g. it could represents a final exam score)

For example, the first line indicates that event "1" had a cost of 20, was followed by event "2" with a cost of 40, followed by event "3" with a cost of 50, followed by event "4" with a cost of 20. Moreover, this sequence has a utility of 80, which means a quite good outcome (compared to other sequences in this database). The other sequences follow the same format.

This database could for example represents sequences of learning activities made by learners, where the events 1,2,3,4 and 5 are learning activities, cost values are the time spent on a learning activity and the utility is the grade obtained at a final exam.

What is the output?

The output is statistics about the database. For example, if we use the tool on the previous database given as example, we get the following statistics:

============ SEQUENCE COST UTILITY DATABASE STATS ==========
File size (MB): 0.00
Number of sequences : 5
Max item: 5
Average number of itemsets per sequence : 3.4 standard deviation: 0.4898979485566356 variance: 0.24
Average number of items per itemset : 1.0 standard deviation: 0.0 variance: 0.0
Average cost per item: 26.529411764705884 standard deviation: 10.23901217121137 variance: 104.83737024221456
Average cost per sequence: 90.2 standard deviation: 23.481056194302674 variance: 551.36
Average utility per sequence: 60.0 standard deviation: 14.142135623730951 variance: 200.00000000000003
=========================================================