Calculate Statistics for a Sequence Database with Utility Information(SPMF documentation)

This example explains how to calculate statistics for a Sequence Database with Utility Information using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool is a tool for generating statistics about a sequence database with utility information, as used by algorithms such as HUSRM and USpan.

What is the input?

The input is a sequence database with utility information.

Let's consider the following sequence database consisting of 4 sequences of transactions (s1,s2, s3, s4) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DataBase_HUSRM.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.


Sequence Sequence utility
s1 {1[1],2[4]},{3[10]},{6[9]},{7[2]},{5[1]} 27
s2 {1[1],4[12]},{3[20]},{2[4]},{5[1],7[2]} 40
s3 {1[1]},{2[4]},{6[9]},{5[1]} 15
s4 {1[3],2[4],3[5]},{6[3],7[1]} 16

Each line of the database is a sequence:

What are real-life examples of such a database? A typical example is a database containing sequences of customer transactions. Imagine that each sequence represents the transactions made by a customer. The first customer named "s1" bought items 1 and 2, and those items respectively generated a profit of 1$ and 4$. Then, the customer bought item 3 for 10$. Then, the customer bought item 6 for 9 $. Then, the customer bought items 7 for 2$. Then the customer bought item 5 for 1 $.

What is the output?

The output is statistics about the database. For example, if we use the tool on the previous database given as example, we get the following statistics:

============ SEQUENCE UTILITY DATABASE STATS ==========
File size (MB): 0.00
Number of sequences : 4
Max item: 7
Average number of itemsets per sequence : 3.75 standard deviation: 1.0897247358851685 variance: 1.1875000000000002
Average number of items per itemset : 1.4 standard deviation: 0.6110100926607788 variance: 0.37333333333333346
Average utility per item: 4.666666666666667 standard deviation: 4.734205151099645 variance: 22.41269841269841
Average utility per sequence: 24.5 standard deviation: 10.111874208078342 variance: 102.25

What is the input file format?

The input file format is defined as follows. It is a text file.

For example, for the previous example, the input file is defined as follows:

1[1] 2[4] -1 3[10] -1 6[9] -1 7[2] -1 5[1] -1 -2 SUtility:27
1[1] 4[12] -1 3[20] -1 2[4] -1 5[1] 7[2] -1 -2 SUtility:40
1[1] -1 2[4] -1 6[9] -1 5[1] -1 -2 SUtility:15
1[3] 2[4] 3[5] -1 6[3] 7[1] -1 -2 SUtility:16

For example, consider the first line. It means that the first customer nbought items 1 and 2, and those items respectively generated a profit of 1$ and 4$. Then, the customer bought item 3 for 10$. Then, the customer bought item 6 for 9 $. Then, the customer bought items 7 for 2$. Then the customer bought item 5 for 1 $. Thus, this customer has made 5 transaction. The total utility (profit) generated by that sequence of transaction is 1$ + 4$ + 10$ + 9$ + 2$ + 1$ = 27 $.