Calculate Statistics for a multi-dimensional Sequence Database (SPMF documentation)

This example explains how to calculate statistics for a multi-dimensional sequence database using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Calculate_stats_for_an_md_sequence_database" algorithm, (2) choose the input file ContextMDSequenceNoTime.txt (3) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Calculate_stats_for_an_md_sequence_database ContextMDSequenceNoTime.txt no_output_file in a folder containing spmf.jar and the input file ContextMDSequenceNoTime.txt.
If you are using the source code version of SPMF, launch the file "MainTestMDSequenceDatabaseStats.java" in the package ca.pfv.SPMF.tests.

What is this tool?

This tool is a tool for generating statistics about a multi-dimensional sequence database.

What is the input?

The input is a multi-dimensional sequence database (as defined by Pinto et al. 2001) and a threshold named minsup (a value in [0,1] representing a percentage).

A multi-dimensional database is a set of multi-dimensional sequences and a set of dimensions d1, d2... dn. A multi-dimensional sequence (MD-Sequence) is composed of an MD-pattern and a sequence. A sequence is an ordered list of itemsets (groups of items). Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered. An MD-pattern is a set of symbolic values for the dimensions (here represented by integer numbers).

For example, consider the following database, provided in the file "ContextMDSequenceNoTime.txt" of the SPMF distribution. The database contains 4 MD-sequences.

MD-Sequences
ID	MD-Patterns			Sequences
	d1	d2	d3
S1	1	1	1	(2 4), (3), (2), (1)
S2	1	2	2	(2 6), (3 5), (6 7)
S3	1	2	1	(1 8), (1), (2), (6)
S4	*	3	3	(2 5), (3 5)

For instance, the first MD-Sequence represents that items 2 and 4 appeared together, then were followed by 3, which was followed by item 2, wich was followed by item 1. The context of this sequence is the value 1 for dimension d1, the value 1 for dimension d2 and the value 1 for dimension d3. Note that the value "*" in the fourth MD-sequence means "any values".

What is the output?

The output is statistics about the sequence database. For example, if we use the tool on the previous sequence database given as example, we get the following statistics:

========== MD-SEQUENCE DATABASE STATS ==========
File /D:/workspace/SPMF_2019_for_release/bin/ca/pfv/spmf/test/ContextMDSequenceNoTime.txt
Number of MD-sequences : 4
Number of dimensions: 3
Dimension 0 has 2 different values.
Dimension 1 has 3 different values.
Dimension 2 has 3 different values.
Number of distinct items: 8
Largest item id: 8
Average number of itemsets per sequence : 3.25 standard deviation: 0.82915619758885 variance: 0.6875
Average number of distinct item per sequence : 4.0 standard deviation: 0.7071067811865476 variance: 0.5000000000000001
Average number of occurences in a sequence for each item appearing in a sequence : 1.25 standard deviation: 0.4330127018922193 variance: 0.18749999999999997
Average number of items per itemset : 1.5384615384615385 standard deviation: 0.4985185152621431 variance: 0.2485207100591716

Input file format

The input file format is defined as follows. It is a text file where each line represents a multi-dimensional sequence from a sequence database. Each line is separated into two parts: (1) a MD-pattern and (2) a sequence.

The first part is a list of dimension values separated by single spaces. A dimension value is a positive integer or the symbol "*" meaning "any values". Finally, the value "-3" indicates the end of the first part. Note that each line should have the same number of dimension values.
The second part of each line is a sequence. Each item in a sequence is represented by a postive integers and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line).

For example, the input file "ContextMDSequenceNoTime.txt" contains the following four lines (four sequences).

1 1 1 -3 2 4 -1 3 -1 2 -1 1 -1 -2
1 2 2 -3 2 6 -1 3 5 -1 6 7 -1 -2
1 2 1 -3 1 8 -1 1 -1 2 -1 6 -1 -2
* 3 3 -3 2 5 -1 3 5 -1 -2

This file contains four MD-sequences (four lines). Each line has 3 dimensions in each MD-Pattern. For example, consider the second line. It represents a MD-sequence where the value for the three dimensions are respectively 1, 2 and 2. Then, the sequence in this MD-Sequence is the itemset {2, 6} followed by the itemset {3, 5}, followed by the itemset {6, 7}.