Converting a Sequence Database to a Transaction Database (SPMF documentation)

This example explains how to convert a sequence database to a transaction database using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool converts a sequence database to a transaction database by removing the ordering between items. This tool is useful if you have a sequence database and you want to apply an algorithm that is designed to be applied on a transaction database. For example, you could take a sequence database and convert it to a transaction database to then apply and association rule mining algorithm.

What is the input?

The tool takes two prameters as input:

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixspan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output?

The output is a transaction database in SPMF format. A transaction database is a set of transactions. Each transaction an unordered set of items (symbols) represented by positive integers. For example, consider the following database. The output for this example would be the following transaction database. It contains five transactions. The first transaction contains the set of items {1, 3, 4, 6}.

Transaction id Items
t1 {1, 2, 3, 4, 6}
t2 {1, 2, 3, 4, 5}
t3 {1, 2, 3, 4, 5, 6}
t4 {1, 2, 3, 5, 6, 7}

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextPrefixSpan.txt" contains the following four lines (four sequences).

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.

Output file format

The output file format is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the output file is defined as follows:

1 2 3 4 6
1 2 3 4 5
1 2 3 4 5 6
1 2 3 5 6 7