Calculate Statistics for a Sequence Database with Utility Information(SPMF documentation)

This example explains how to calculate statistics for a Sequence Database with Utility Information using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Calculate_stats_for_a_sequence_database_with_utility" algorithm, (2) choose the input file DataBase_HUSRM.txt (3) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Calculate_stats_for_a_sequence_database_with_utility DataBase_HUSRM.txt no_output_file in a folder containing spmf.jar and the input file DataBase_HUSRM.txt.
If you are using the source code version of SPMF, launch the file "MainTestStatsSequenceDatabaseUtility.java" in the package ca.pfv.SPMF.tests.

What is this tool?

This tool is a tool for generating statistics about a sequence database with utility information, as used by algorithms such as HUSRM and USpan.

What is the input?

The input is a sequence database with utility information.

Let's consider the following sequence database consisting of 4 sequences of transactions (s1,s2, s3, s4) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DataBase_HUSRM.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.

	Sequence	Sequence utility
s1	{1[1],2[4]},{3[10]},{6[9]},{7[2]},{5[1]}	27
s2	{1[1],4[12]},{3[20]},{2[4]},{5[1],7[2]}	40
s3	{1[1]},{2[4]},{6[9]},{5[1]}	15
s4	{1[3],2[4],3[5]},{6[3],7[1]}	16

Each line of the database is a sequence:

each sequence is an ordered list of transactions, such that transactions are enclosed by {} in this Example
each transaction contains a set of items represented by integers
each item is annotated with a utility value (e.g. sale profit), indicated between squared brackets
the sum of the utilities (e.g. profit) of all items in the sequence is also indicated (the "sequence utility" column)

What are real-life examples of such a database? A typical example is a database containing sequences of customer transactions. Imagine that each sequence represents the transactions made by a customer. The first customer named "s1" bought items 1 and 2, and those items respectively generated a profit of 1$ and 4$. Then, the customer bought item 3 for 10$. Then, the customer bought item 6 for 9 $. Then, the customer bought items 7 for 2$. Then the customer bought item 5 for 1 $.

What is the output?

The output is statistics about the database. For example, if we use the tool on the previous database given as example, we get the following statistics:

============ SEQUENCE UTILITY DATABASE STATS ==========
File size (MB): 0.00
Number of sequences : 4
Max item: 7
Average number of itemsets per sequence : 3.75 standard deviation: 1.0897247358851685 variance: 1.1875000000000002
Average number of items per itemset : 1.4 standard deviation: 0.6110100926607788 variance: 0.37333333333333346
Average utility per item: 4.666666666666667 standard deviation: 4.734205151099645 variance: 22.41269841269841
Average utility per sequence: 24.5 standard deviation: 10.111874208078342 variance: 102.25

What is the input file format?

The input file format is defined as follows. It is a text file.

Each line represents a sequence of transactions.
Each transaction is separated by the keyword -1.
A transaction is a list of items (positive integers) separated by single spaces and where each item is annotated with a generated sale profit indicated between square brackets [ ]. The sale profit is a positive integer.
In a transaction, it is assumed that items are sorted according to some order (eg. alphabetical order).
Each sequence ends by the keyword "-2". Then, it is followed by the keyword "SUtility:" followed by the sum of the utility (profit) of all items in that sequence.

For example, for the previous example, the input file is defined as follows:

1[1] 2[4] -1 3[10] -1 6[9] -1 7[2] -1 5[1] -1 -2 SUtility:27
1[1] 4[12] -1 3[20] -1 2[4] -1 5[1] 7[2] -1 -2 SUtility:40
1[1] -1 2[4] -1 6[9] -1 5[1] -1 -2 SUtility:15
1[3] 2[4] 3[5] -1 6[3] 7[1] -1 -2 SUtility:16

For example, consider the first line. It means that the first customer nbought items 1 and 2, and those items respectively generated a profit of 1$ and 4$. Then, the customer bought item 3 for 10$. Then, the customer bought item 6 for 9 $. Then, the customer bought items 7 for 2$. Then the customer bought item 5 for 1 $. Thus, this customer has made 5 transaction. The total utility (profit) generated by that sequence of transaction is 1$ + 4$ + 10$ + 9$ + 2$ + 1$ = 27 $.