Add consecutive timestamps to a sequence database without timestamps (SPMF documentation)

This example explains how to add consecutive timestamps to a sequence database without timestamps using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Add_consecutive_timestamps_to_sequence_database" algorithm, (2) select the input file "contextPrefixspan.txt", (3) set the output file name (e.g. "output.txt"), (4) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Add_consecutive_timestamps_to_sequence_database contextPrefixspan.txt output.txt in a folder containing spmf.jar and the example input file contextPrefixspan.txt.
If you are using the source code version of SPMF, launch the file "MainTestAddTimeStampsToSequenceDatabase.java" in the package ca.pfv.SPMF.tests

What is this tool?

This tool converts a sequence database to a sequence database with timestamps. This is useful for applying an algorithm that requires timestamp information. This tool assumes that each itemset in a sequence have consecutive timestamps, i.e. that timestamps are assigned as 0,1,2 ... .

What is the input?

The tool takes a sequence database as input.

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixspan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID	Sequences
S1	(1), (1 2 3), (1 3), (4), (3 6)
S2	(1 4), (3), (2 3), (1 5)
S3	(5 6), (1 2), (4 6), (3), (2)
S4	(5), (7), (1 6), (3), (2), (3)

What is the output?

The output is the same sequence database, except that consecutive timestamps have been added to each itemset in each sequence. For example, consider the following database. The timestamps are indicated in bold. For example, the first sequence indicates that item 1 appeared at time 0, that it was followed by items 1, 2 and 3 at time 1, which was followed by items 1 and 3 at time 2, which was followed by item 4 at time 3, was followed by items 3 and 6 at time 4,

ID	Sequences
S1	(0, 1), (1, 1 2 3), (2, 1 3), (3, 4), (4, 3 6)
S2	(0, 1 4), (1, 3), (2, 2 3), (3, 1 5)
S3	(0, 5 6), (1, 1 2), (2, 4 6), (3, 3), (4, 2)
S4	(0, 5), (1, 7), (2, 1 6), (3, 3), (4, 2), (5, 3)

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextPrefixSpan.txt" contains the following four lines (four sequences).

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.

Output file format

The output file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each line is a list of itemsets, where each itemset has a timestamp represented by a positive integer and each item is represented by a positive integer. Each itemset is first represented by it timestamp between the "<" and "> symbol. Then, the items of the itemset appear separated by single spaces. Finally, the end of an itemset is indicated by "-1". After all the itemsets, the end of a sequence (line) is indicated by the symbol "-2". Note that it is assumed that items are sorted according to a total order in each itemset and that no item appears twice in the same itemset.

For example, the output file of the example contains the following four lines (four sequences).

<0> 1 -1 <1> 1 2 3 -1 <2> 1 3 -1 <3> 4 -1 <4> 3 6 -1 -2
<0> 1 4 -1 <1> 3 -1 <2> 2 3 -1 <3> 1 5 -1 -2
<0> 5 6 -1 <1> 1 2 -1 <2> 4 6 -1 <3> 3 -1 <4> 2 -1 -2
<0> 5 -1 <1> 7 -1 <2> 1 6 -1 <3> 3 -1 <4> 2 -1 <5> 3 -1 -2

Consider the first line. It indicates that item 1 appeared at time 0, that it was followed by items 1, 2 and 3 at time 1, which was followed by items 1 and 3 at time 2, which was followed by item 4 at time 3, was followed by items 3 and 6 at time 4,