Converting a Transaction Database to a Sequence Database (SPMF documentation)

This example explains how to convert a transaction database to a sequence database using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool converts a transaction database to a sequence database. It should be used carefully since it assumes that each transaction is a sequence, and that items in each transaction are sequentially ordered, which is usually not the case in real-life transaction databases.

What is the input?

The tool takes two prameters as input:

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output?

The output is a sequence database in SPMF format. A sequence database is a set of sequences. Each sequence is an ordered list of itemsets. Each itemset is an unordered set of items (symbols) represented by positive integers. The output for this example is the following sequence database. It contains five sequences. The first sequence indicates that item 1 is followed by item 3, which is followed by item 4.

Sequence id Itemsets
s1 {1},{3}, {4}
s2 {2},{3},{5}
s3 {1}, {2}, {3}, {5}
s4 {2}, {5},
s5 {1}, {2}, {3}, {5}

Input file format

The input file format is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

Output file format

The output file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the output file for this example contains five lines (five sequences).

1 -1 3 -1 4 -1 -2
2 -1 3 -1 5 -1 -2
1 -1 2 -1 3 -1 5 -1 -2
2 -1 5 -1 -2
1 -1 2 -1 3 -1 5 -1 -2

The first line represents a sequence where the item 1 is followed by item 3, which is followed by item 4.