Fix a Sequence Database (SPMF documentation)

This example explains how to fix a sequence database using the SPMF open-source data mining library.

How to run this example?

What is this tool?

The tool "Fix_a_sequence_database" is a small program that fix some common problems that may be found in a sequence database file in SPMF format.

The tool fixes two common problems: (1) an item appears more than once in an itemset of a sequence, (2) items from an itemset in a sequence are not sorted.

To fix the first problem the tool will keep only one occurrence of each item in an itemset. So if an items appears more than once in an itemset of a sequence, it will appears only once after applying the tool.

To fix the second problem, the tool sorts each itemset in a sequence according to the lexicographical ordering (because it is required by most sequential pattern mining algorithms).

What is the input?

The input is a sequence database. A sequence database is a set of sequence. Each sequence is an ordered list of itemsets. An itemset is an unordered list of items (symbols). For example, consider the following database. This database is provided in the file "sequence_broken.txt" of the SPMF distribution. It contains 2 sequences. The second sequence represents that the set of items {7} was followed by itemset {8}, which was followed by itemset {9, 8, 7}. It is a sequence database (as defined in Pei et al., 2004).

ID Sequence
s1 {4, 4, 4, 5, 5},{4, 5, 6}
s2 {7},{8},{9,8,7}

What is the output?

The output is a sequence database where each itemset in a sequence is sorted and no item appears more than once. For example, the output using the above example is:

ID Sequence
s1 {4,5},{4, 5, 6}
s2 {7},{8},{7,8,9}

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "sequence_borken.txt" contains the following two lines (two sequences).

4 4 4 5 5 -1 4 5 6 -1 -2
7 -1 8 -1 9 8 7 -1 -2

Output file format

The output file format is the same as the input format. But the problems contained in the input file have been fixed.

4 5 -1 4 5 6 -1 -2
7 -1 8 -1 7 8 9 -1 -2