Converting a Sequence Database to SPMF Format (SPMF documentation)
This example explains how to convert a sequence database to SPMF format using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "Convert_a_sequence_database_to_SPMF_format" algorithm, (2) select the input file "contextCSV.txt", (3) set the output file name (e.g. "output.txt") (4) set input_format = CSV_INTEGER and sequence count = 5.(5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run Convert_a_sequence_database_to_SPMF_format contextCSV.txt output.txt CSV 5 in a folder containing spmf.jar and the example input file contextCSV.txt. - If you are using the source code version of SPMF, launch the file "MainTestConvertSequenceDatabase.java" in the package ca.pfv.SPMF.tests.
The tool for converting a sequence databases to SPMF format takes three prameters as input:
- a sequence database,
- the name of the sequence database format (CSV_INTEGER, Kosarak, Snake, IBMGenerator or BMS),
- the number of sequences to be converted
The algorithm outputs a sequence database in SPMF format.
The CSV_INTEGER format is defined as follows:
- each line is a sequence
- each sequence is a list of items represented by positive integers (>0) separated by commas.
For example, the follwing sequence databasee is in CSV_INTEGER format and contains four sequences:
1,2,3,4
5,6,7,8
5,6,7
1,2,3
The Kosarak format is defined as follows:
- each line is a sequence
- each sequence is a list of items represented by integers and separated by a space.
For example, the follwing sequence databasee is in Kosarak format and contains four sequences:
1 2 3 45 6 7 8
5 6 7
1 2 3
The IBMGenerator format is the format used by the IBM Data Quest Generator. The format is defined as follows:
- the file is a binary file in little indian format
- the file is a list of integers without spaces
- a positive integer represents an item
- -1 represents the end of an itemset
- -2 represents the end of a sequence
For example, the follwing sequence databasee is in Kosarak format and contains four sequences:
1 -1 2 -1 3 -1 4 -1 -25 -1 6 -1 7 -1 8 -1 -2
5 -1 6 -1 7 -1 -2
1 -1 2 -1 3 -1 -2
The Snake format is defined as follows:
- each line is a sequence
- each sequence is a list of letters
For example, the follwing sequence databasee is in Snake format and contains four sequences:
ABCDABAB
CACD
ADAC
The BMS format is defined as follows:
- each line contains a sequence id and an item, both represented as integers
For example, the follwing sequence databasee is in BMS format and contains four sequences with the ids 10, 20, 30 and 40, respectively:
10 110 2
10 3
10 4
20 5
20 6
20 7
20 8
30 5
30 6
30 7
40 1
40 2
40 3