Calculate Statistics for a Sequence Database with Utility Information(SPMF documentation)
This example explains how to calculate statistics for a Sequence Database with Utility Information using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "Calculate_stats_for_a_sequence_database_with_utility" algorithm, (2) choose the input file DataBase_HUSRM.txt (3) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run Calculate_stats_for_a_sequence_database_with_utility DataBase_HUSRM.txt no_output_file in a folder containing spmf.jar and the input file DataBase_HUSRM.txt. - If you are using the source code version of SPMF, launch the file "MainTestStatsSequenceDatabaseUtility.java" in the package ca.pfv.SPMF.tests.
What is this tool?
This tool is a tool for generating statistics about a sequence database with utility information, as used by algorithms such as HUSRM and USpan.
What is the input?
The input is a sequence database with utility information.
Let's consider the following sequence database consisting of 4 sequences of transactions (s1,s2, s3, s4) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DataBase_HUSRM.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Sequence | Sequence utility | |
s1 | {1[1],2[4]},{3[10]},{6[9]},{7[2]},{5[1]} | 27 |
s2 | {1[1],4[12]},{3[20]},{2[4]},{5[1],7[2]} | 40 |
s3 | {1[1]},{2[4]},{6[9]},{5[1]} | 15 |
s4 | {1[3],2[4],3[5]},{6[3],7[1]} | 16 |
Each line of the database is a sequence:
- each sequence is an ordered list of transactions, such that transactions are enclosed by {} in this Example
- each transaction contains a set of items represented by integers
- each item is annotated with a utility value (e.g. sale profit), indicated between squared brackets [ ].
- the sum of the utilities (e.g. profit) of all items in the sequence is also indicated (the "sequence utility" column)
What are real-life examples of such a database? A typical example is a database containing sequences of customer transactions. Imagine that each sequence represents the transactions made by a customer. The first customer named "s1" bought items 1 and 2, and those items respectively generated a profit of 1$ and 4$. Then, the customer bought item 3 for 10$. Then, the customer bought item 6 for 9 $. Then, the customer bought items 7 for 2$. Then the customer bought item 5 for 1 $.
What is the output?
The output is statistics about the database. For example, if we use the tool on the previous database given as example, we get the following statistics:
============ SEQUENCE UTILITY DATABASE STATS ==========
File size (MB): 0.00
Number of sequences : 4
Max item: 7
Average number of itemsets per sequence : 3.75 standard deviation: 1.0897247358851685 variance: 1.1875000000000002
Average number of items per itemset : 1.4 standard deviation: 0.6110100926607788 variance: 0.37333333333333346
Average utility per item: 4.666666666666667 standard deviation: 4.734205151099645 variance: 22.41269841269841
Average utility per sequence: 24.5 standard deviation: 10.111874208078342 variance: 102.25
What is the input file format?
The input file format is defined as follows. It is a text file.
- Each line represents a sequence of transactions.
- Each transaction is separated by the keyword -1.
- A transaction is a list of items (positive integers) separated by single spaces and where each item is annotated with a generated sale profit indicated between square brackets [ ]. The sale profit is a positive integer.
- In a transaction, it is assumed that items are sorted according to some order (eg. alphabetical order).
- Each sequence ends by the keyword "-2". Then, it is followed by the keyword "SUtility:" followed by the sum of the utility (profit) of all items in that sequence.
For example, for the previous example, the input file is defined as follows:
1[1] 2[4] -1 3[10] -1 6[9] -1 7[2] -1 5[1] -1 -2 SUtility:27
1[1] 4[12] -1 3[20] -1 2[4] -1 5[1] 7[2] -1 -2 SUtility:40
1[1] -1 2[4] -1 6[9] -1 5[1] -1 -2 SUtility:15
1[3] 2[4] 3[5] -1 6[3] 7[1] -1 -2 SUtility:16
For example, consider the first line. It means that the first customer nbought items 1 and 2, and those items respectively generated a profit of 1$ and 4$. Then, the customer bought item 3 for 10$. Then, the customer bought item 6 for 9 $. Then, the customer bought items 7 for 2$. Then the customer bought item 5 for 1 $. Thus, this customer has made 5 transaction. The total utility (profit) generated by that sequence of transaction is 1$ + 4$ + 10$ + 9$ + 2$ + 1$ = 27 $.