Calculate Statistics for a multi-dimensional Sequence Database with Timestamps (SPMF documentation)

This example explains how to calculate statistics for a multi-dimensional sequence database with timestamps using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Calculate_stats_for_an_md_time_sequence_database" algorithm, (2) choose the input file ContextMDSequence.txt (3) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Calculate_stats_for_an_md_time_sequence_database ContextMDSequence.txt no_output_file in a folder containing spmf.jar and the input file ContextMDSequence.txt.
If you are using the source code version of SPMF, launch the file "MainTestMDSequenceDatabaseStats.java" in the package ca.pfv.SPMF.tests.

What is this tool?

This tool is a tool for generating statistics about a multi-dimensional sequence database with timestamps.

What is the input?

The input is a multi-dimensional sequence database with timestamps.

A time-extended multidimensional sequence database is a set of time-extended multi-dimensional sequences. A time-extended multi-dimensional sequence (here called MD-Sequence) is a time-extended sequence (as defined by Hirate & Yamana) but with dimensional information (as defined by Pinto et al. 2001).

A time-extended multi-dimensional sequence (MD-Sequence) is composed of an MD-pattern and a time-extended sequence. A time-extended sequence is an ordered list of itemsets (groups of items) with timestamps. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered. An MD-pattern is a set of symbolic values for the dimensions (here represented by integer numbers).

The set of dimensional values for an MD-Sequence is called an MD-Pattern. For a multi-dimensional database, there is a fixed set of dimensions. Each dimensions can take a symbolic value or the value "*" which means any value. In the following MD-Database, there is four MD-Sequences named S1, S2, S3, S4. This database is provided in the file ContextMDSequence.txt.

MD-Sequences
ID	MD-Patterns			Sequences
	d1	d2	d3
S1	1	1	1	(0, 2 4), (1, 3), (2, 2), (3, 1)
S2	1	2	2	(0, 2 6), (1, 3 5), (2, 6 7)
S3	1	2	1	(0, 1 8), (1, 1), (2, 2), (3, 6)
S4	*	3	3	(0, 2 5), (1, 3 5)

For instance, the first MD-Sequence represents that items 2 and 4 appeared together at time 0, then were followed by 3 at time 1, which was followed by item 2 at time 2, wich was followed by item 1 at time 3. The context of this sequence is the value 1 for dimension d1, the value 1 for dimension d2 and the value 1 for dimension d3. Note that the value "*" in the fourth MD-sequence means "any values".

What is the output?

The output is statistics about the sequence database. For example, if we use the tool on the previous sequence database given as example, we get the following statistics:

============== MD-SEQUENCE DATABASE STATS ==========
File C:\Users\Phil\Desktop\test_files\ContextMDSequence.txt
Number of MD-sequences : 4
Number of dimensions: 3
Dimension 0 has 2 different values.
Dimension 1 has 3 different values.
Dimension 2 has 3 different values.
Number of distinct items: 8
Largest item id: 8
Average number of itemsets per sequence : 3.25 standard deviation: 0.82915619758885 variance: 0.6875
Average number of distinct item per sequence : 4.0 standard deviation: 0.7071067811865476 variance: 0.5000000000000001
Average number of occurences in a sequence for each item appearing in a sequence : 1.25 standard deviation: 0.4330127018922193 variance: 0.18749999999999997
Average number of items per itemset : 1.5384615384615385 standard deviation: 0.4985185152621431 variance: 0.2485207100591716
Timestamps range: 0 to 3

Input file format

The input file format is defined as follows. It is a text file where each line represents a multi-dimensional time-extended sequence from a multi-dimensional time-extended sequence database. Each line consists of two parts.

The first part is a list of dimension values separated by single spaces. A dimension value is a positive integer or the symbol "*" meaning "any values". Finally, the value "-3" indicates the end of the first part. Note that each line should have the same number of dimension values.
The second part is a list of itemsets, where each itemset has a timestamp represented by a positive integer and each item is represented by a positive integer. Each itemset is first represented by its timestamp between the "<" and "> symbols. Then, the items of the itemset appear separated by single spaces. Finally, the end of an itemset is indicated by "-1". After all the itemsets, the end of a sequence (line) is indicated by the symbol "-2". Note that it is assumed that items are sorted according to a total order in each itemset and that no item appears twice in the same itemset.

For example, the input file "ContextMDSequence.txt" contains the following four lines (four sequences).

1 1 1 -3 <0> 2 4 -1 <1> 3 -1 <2> 2 -1 <3> 1 -1 -2
1 2 2 -3 <0> 2 6 -1 <1> 3 5 -1 <2> 6 7 -1 -2
1 2 1 -3 <0> 1 8 -1 <1> 1 -1 <2> 2 -1 <3> 6 -1 -2
* 3 3 -3 <0> 2 5 -1 <1> 3 5 -1 -2

Consider the second line. It indicates that the second multi-dimensional time-extended sequence of this database has the dimension values 1, 2 and 2. Furthermore, the first itemset is {2, 4} with a timestamp of 0. Then, the item 3 appears with a timestamp of 1. Then the item 2 appears with a timestamp of 2. Finally, the item 1 appears with a timestamp of 3. The other sequence follows the same format. Note that timestamps do not need to be consecutive integers. But they should increase for each succesive itemset within a same sequence.