Calculate Statistics for a Time-Extended Sequence Database (SPMF documentation)
This example explains how to calculate statistics for a time-extended sequence database using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "Calculate_stats_for_a_time-extended_sequence_database" algorithm, (2) choose the input file contextSequencesTimeExtended.txt (3) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run Calculate_stats_for_a_time-extended_sequence_database contextSequencesTimeExtended.txt no_output_file in a folder containing spmf.jar and the input file contextSequencesTimeExtended.txt. - If you are using the source code version of SPMF, launch the file "MainTestGenerateTimeSequenceDatabaseStats.java" in the package ca.pfv.SPMF.tests.
What is this tool?
This tool is a tool for generating statistics about a time-extended sequence database, a type of data taken as input by algorithms such as the Hirate & Yamana algorithm.
What is the input?
The input is a time-extended sequence database. A time-extended sequence database is a set of time-extended sequences. A time-extended sequences is a list of itemsets (groups of items). Each itemset is anotated with a timestamp that is an integer value. Note that it is assumed that an item should not appear more than once in an itemset and that items in an itemset are lexically ordered.
For example, consider the following time-extended sequence database provided in the file contextSequencesTimeExtended.txt of the SPMF distribution. The database contains 4 time-extended sequences. Each sequence contains itemsets that are annotated with a timestamp. For example, consider the sequence S1. This sequence indicates that itemset {1} appeared at time 0. It was followed by the itemset {1, 2, 3} at time 1. This latter itemset was followed by the itemset {1 2} at time 2.
ID | Sequences |
S1 | (0, 1), (1, 1 2 3}), (2, 1 3) |
S2 | (0, 1 ) (1, 1 2 ), (2, 1 2 3), (3, 1 2 3 ) |
S3 | (0, 1 2), (1, 1 2 ) |
S4 | (0, 2), (1, 1 2 3 ) |
The algorithms discovers time-extended sequential patterns that are common to several sequences. To do that, the user needs to provide five constraints (see the paper by Hirate & Yamana, 2006 for full details):
- minimum support (minsup): the minimum number of sequences that should contain a sequential patterns (a positive integer >=0)
- minimum time interval allowed between two succesive itemsets of a sequential pattern (min_time_interval) (an integer >=0)
- maximum time interval allowed between two succesive itemsets of a sequential pattern (max_time_interval) (an integer >=0)
- minimum time interval allowed between the first itemset and the last itemset of a sequential pattern (min_whole_interval) (an integer >=0)
- maximum time interval allowed between the first itemset and the last itemset of a sequential pattern (max_whole_interval) (an integer >=0)
What is the output?
The output is statistics about the time-extended sequence database. For example, if we use the tool on the previous sequence database given as example, we get the following statistics:
============ SEQUENCE DATABASE STATS ==========
Number of sequences : 4
File C:\Users\Phil\Desktop\test_files\contextSequencesTimeExtended.txt
Number of distinct items: 3
Largest item id: 3
Average number of itemsets per sequence : 2.75 standard deviation: 0.82915619758885 variance: 0.6875
Average number of distinct item per sequence : 2.75 standard deviation: 0.4330127018922193 variance: 0.18749999999999997
Average number of occurences in a sequence for each item appearing in a sequence : 2.090909090909091 standard deviation: 0.899954085146515 variance: 0.8099173553719007
Average number of items per itemset : 2.090909090909091 standard deviation: 0.7925270806437588 variance: 0.6280991735537189
Timestamps range: 0 to 3
Input file format
The input file format is defined as follows. It is a text file where each line represents a time-extended sequence from a sequence database. Each line is a list of itemsets, where each itemset has a timestamp represented by a positive integer and each item is represented by a positive integer. Each itemset is first represented by it timestamp between the "<" and "> symbol. Then, the items of the itemset appear separated by single spaces. Finally, the end of an itemset is indicated by "-1". After all the itemsets, the end of a sequence (line) is indicated by the symbol "-2". Note that it is assumed that items are sorted according to a total order in each itemset and that no item appears twice in the same itemset.
For example, the input file "contextSequencesTimeExtended.txt" contains the following four lines (four sequences).
<0> 1 -1 <1> 1 2 3 -1 <2> 1 3 -1 -2
<0> 1 -1 <1> 1 2 -1 <2> 1 2 3 -1 <3> 1 2 3 -1 -2
<0> 1 2 -1 <1> 1 2 -1 -2
<0> 2 -1 <1> 1 2 3 -1 -2
Consider the first line. It indicates that at time "0" the itemset {1} appeared, followed by the itemset {1, 2, 3} at time 1, then followed by the itemset {1, 3} at time 2. Note that timestamps do not need to be consecutive integers. But they should increase for each succesive itemset within a sequence. The second, third and fourth line follow the same format.