Calculate Statistics for a multi-dimensional Sequence Database with Timestamps (SPMF documentation)

This example explains how to calculate statistics for a multi-dimensional sequence database with timestamps using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool is a tool for generating statistics about a multi-dimensional sequence database with timestamps.

What is the input?

The input is a multi-dimensional sequence database with timestamps.

time-extended multidimensional sequence database is a set of time-extended multi-dimensional sequences. A time-extended multi-dimensional sequence (here called MD-Sequence) is a time-extended sequence (as defined by Hirate & Yamana) but with dimensional information (as defined by Pinto et al. 2001).

A time-extended multi-dimensional sequence (MD-Sequence) is composed of an MD-pattern and a time-extended sequence. A time-extended sequence is an ordered list of itemsets (groups of items) with timestamps. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered. An MD-pattern is a set of symbolic values for the dimensions (here represented by integer numbers).

The set of dimensional values for an MD-Sequence is called an MD-Pattern. For a multi-dimensional database, there is a fixed set of dimensions. Each dimensions can take a symbolic value or the value "*" which means any value. In the following MD-Database, there is four MD-Sequences named S1, S2, S3, S4. This database is provided in the file ContextMDSequence.txt.

MD-Sequences
ID MD-Patterns Sequences

d1 d2 d3
S1 1 1 1 (0, 2 4), (1, 3), (2, 2), (3, 1)
S2 1 2 2 (0, 2 6), (1, 3 5), (2, 6 7)
S3 1 2 1 (0, 1 8), (1, 1), (2, 2), (3, 6)
S4 * 3 3 (0, 2 5), (1, 3 5)

For instance, the first MD-Sequence represents that items 2 and 4 appeared together at time 0, then were followed by 3 at time 1, which was followed by item 2 at time 2, wich was followed by item 1 at time 3. The context of this sequence is the value 1 for dimension d1, the value 1 for dimension d2 and the value 1 for dimension d3. Note that the value "*" in the fourth MD-sequence means "any values".

What is the output?

The output is statistics about the sequence database. For example, if we use the tool on the previous sequence database given as example, we get the following statistics:

============== MD-SEQUENCE DATABASE STATS ==========
File C:\Users\Phil\Desktop\test_files\ContextMDSequence.txt
Number of MD-sequences : 4
Number of dimensions: 3
Dimension 0 has 2 different values.
Dimension 1 has 3 different values.
Dimension 2 has 3 different values.
Number of distinct items: 8
Largest item id: 8
Average number of itemsets per sequence : 3.25 standard deviation: 0.82915619758885 variance: 0.6875
Average number of distinct item per sequence : 4.0 standard deviation: 0.7071067811865476 variance: 0.5000000000000001
Average number of occurences in a sequence for each item appearing in a sequence : 1.25 standard deviation: 0.4330127018922193 variance: 0.18749999999999997
Average number of items per itemset : 1.5384615384615385 standard deviation: 0.4985185152621431 variance: 0.2485207100591716
Timestamps range: 0 to 3

Input file format

The input file format is defined as follows. It is a text file where each line represents a multi-dimensional time-extended sequence from a multi-dimensional time-extended sequence database. Each line consists of two parts.

For example, the input file "ContextMDSequence.txt" contains the following four lines (four sequences).

1 1 1 -3 <0> 2 4 -1 <1> 3 -1 <2> 2 -1 <3> 1 -1 -2
1 2 2 -3 <0> 2 6 -1 <1> 3 5 -1 <2> 6 7 -1 -2
1 2 1 -3 <0> 1 8 -1 <1> 1 -1 <2> 2 -1 <3> 6 -1 -2
* 3 3 -3 <0> 2 5 -1 <1> 3 5 -1 -2

Consider the second line. It indicates that the second multi-dimensional time-extended sequence of this database has the dimension values 1, 2 and 2. Furthermore, the first itemset is {2, 4} with a timestamp of 0. Then, the item 3 appears with a timestamp of 1. Then the item 2 appears with a timestamp of 2. Finally, the item 1 appears with a timestamp of 3. The other sequence follows the same format. Note that timestamps do not need to be consecutive integers. But they should increase for each succesive itemset within a same sequence.