Calculate Statistics for a Sequence Database (SPMF documentation)

This example explains how to calculate statistics for a sequence database using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Calculate_stats_for_a_sequence_database" algorithm, (2) choose the input file contextPrefixSpan.txt (3) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Calculate_stats_for_a_sequence_database contextPrefixSpan.txt no_output_file in a folder containing spmf.jar and the input file contextPrefixSpan.txt.
If you are using the source code version of SPMF, launch the file "MainTestGenerateSequenceDatabaseStats.java" in the package ca.pfv.SPMF.tests.

What is this tool?

This tool is a tool for generating statistics about a sequence database. It can be used to know for example if the database is dense or sparse before applying a data mining algorithms.

What is the input?

The input is a sequence database. A sequence database is a set of sequence. Each sequence is an ordered list of itemsets. An itemset is an unordered list of items (symbols). For example, consider the following database. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. It contains 4 sequences. The second sequence represents that the set of items {1 4} was followed by {3}, which was followed by {2, 3}, which were followed by {1, 5}. It is a sequence database (as defined in Pei et al., 2004).

ID	Sequences
S1	(1), (1 2 3), (1 3), (4), (3 6)
S2	(1 4), (3), (2 3), (1 5)
S3	(5 6), (1 2), (4 6), (3), (2)
S4	(5), (7), (1 6), (3), (2), (3)

What is the output?

The output is statistics about the sequence database. For example, if we use the tool on the previous sequence database given as example, we get the following statistics:

the number of distinct items in this database: 7
the largest item id: 7
the average number of itemsets per sequence: 5
the average number of distinct items per sequence: 5.5
the average number of occurences in a sequence for each item appearing in a sequence: 1.41
the average number of items per itemset: 1.55

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextPrefixSpan.txt" contains the following four lines (four sequences).

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.