Calculate Statistics for a multi-dimensional Sequence Database (SPMF documentation)

This example explains how to calculate statistics for a multi-dimensional sequence database using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool is a tool for generating statistics about a multi-dimensional sequence database.

What is the input?

The input is a multi-dimensional sequence database (as defined by Pinto et al. 2001and a threshold named minsup (a value in [0,1] representing a percentage).

multi-dimensional database is a set of multi-dimensional sequences and a set of dimensions d1, d2... dn. A multi-dimensional sequence (MD-Sequence) is composed of an MD-pattern and a sequence. A sequence is an ordered list of itemsets (groups of items). Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered. An MD-pattern is a set of symbolic values for the dimensions (here represented by integer numbers).

For example, consider the following database, provided in the file "ContextMDSequenceNoTime.txt" of the SPMF distribution. The database contains 4 MD-sequences.

MD-Sequences
ID MD-Patterns Sequences

d1 d2 d3
S1 1 1 1 (2 4), (3), (2), (1)
S2 1 2 2 (2 6), (3 5), (6 7)
S3 1 2 1 (1 8), (1), (2), (6)
S4 * 3 3 (2 5), (3 5)

For instance, the first MD-Sequence represents that items 2 and 4 appeared together, then were followed by 3, which was followed by item 2, wich was followed by item 1. The context of this sequence is the value 1 for dimension d1, the value 1 for dimension d2 and the value 1 for dimension d3. Note that the value "*" in the fourth MD-sequence means "any values".

What is the output?

The output is statistics about the sequence database. For example, if we use the tool on the previous sequence database given as example, we get the following statistics:

========== MD-SEQUENCE DATABASE STATS ==========
File /D:/workspace/SPMF_2019_for_release/bin/ca/pfv/spmf/test/ContextMDSequenceNoTime.txt
Number of MD-sequences : 4
Number of dimensions: 3
Dimension 0 has 2 different values.
Dimension 1 has 3 different values.
Dimension 2 has 3 different values.
Number of distinct items: 8
Largest item id: 8
Average number of itemsets per sequence : 3.25 standard deviation: 0.82915619758885 variance: 0.6875
Average number of distinct item per sequence : 4.0 standard deviation: 0.7071067811865476 variance: 0.5000000000000001
Average number of occurences in a sequence for each item appearing in a sequence : 1.25 standard deviation: 0.4330127018922193 variance: 0.18749999999999997
Average number of items per itemset : 1.5384615384615385 standard deviation: 0.4985185152621431 variance: 0.2485207100591716

Input file format

The input file format is defined as follows. It is a text file where each line represents a multi-dimensional sequence from a sequence database. Each line is separated into two parts: (1) a MD-pattern and (2) a sequence.

For example, the input file "ContextMDSequenceNoTime.txt" contains the following four lines (four sequences).

1 1 1 -3 2 4 -1 3 -1 2 -1 1 -1 -2
1 2 2 -3 2 6 -1 3 5 -1 6 7 -1 -2
1 2 1 -3 1 8 -1 1 -1 2 -1 6 -1 -2
* 3 3 -3 2 5 -1 3 5 -1 -2

This file contains four MD-sequences (four lines). Each line has 3 dimensions in each MD-Pattern. For example, consider the second line. It represents a MD-sequence where the value for the three dimensions are respectively 1, 2 and 2. Then, the sequence in this MD-Sequence is the itemset {2, 6} followed by the itemset {3, 5}, followed by the itemset {6, 7}.