Generating an Item Weights File from a Transaction Database (SPMF documentation)
This example explains how to generate an item weights file from a transaction database in SPMF format using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "Generate_item_weights" algorithm, (2) set the input file (e.g. DB_RWFIM.txt), (3) set the output file name (e.g. output_weights.txt), (4) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run Generate_item_weights DB_RWFIM.txt output_weights.txt in a folder containing spmf.jar. - If you are using the source code version of SPMF, launch the file "MainTestItemWeightsGenerator.java" in the package ca.pfv.spmf.tools.dataset_generator.
What is this tool?
This tool reads a transaction database in SPMF format, collects all distinct items that appear in the database, and generates a weight file that associates a random weight (a real number between 0 and 1) to each distinct item. The weights are generated randomly using a uniform distribution.
Weight files are used by several weighted data mining algorithms available in SPMF, such as the RWFIM algorithm for mining weighted frequent itemsets. This tool makes it easy to quickly generate a weight file for testing and benchmarking such algorithms on a given transaction database, without having to create the weight file manually.
What is the input?
The tool takes a single transaction database file in SPMF format as input. No additional parameters are required.
A transaction database is a text file where each line represents one transaction. Each transaction is a list of items (positive integers) separated by single spaces. Lines starting with #, %, or @ are treated as comments and ignored.
For example, the input file DB_RWFIM.txt is defined as follows:
1 2 3 4 5
1 3 4
2 3 5
1 2 4 5
2 3 4 5
1 3 5
2 4
1 2 3
3 4 5
1 2 4
Each line is a transaction. For example, the first line means that transaction T1 contains items 1, 2, 3, 4, and 5.
What is the output?
The tool outputs a weight file associating each distinct item found in the transaction database with a randomly generated weight. The weight is a real number uniformly distributed in the range [0, 1].
The weight file is a text file where each line contains one item identifier (a positive integer) followed by a single space and the item weight (a real number between 0 and 1). Items are listed in ascending order.
For example, for the input file shown above, the output file could look like the following:
1 0.4 2 0.7 3 1.0 4 0.5 5 0.45
In this example, item 1 has been assigned the weight 0.4, item 2 has been assigned the weight 0.7, and so on. Note that the exact weight values will differ each time the tool is run, since they are generated randomly.
Output file format
The output file format is defined as follows. It is a plain text file. Each line associates one item with its weight. Each line contains:
- an item identifier (a positive integer),
- followed by a single space,
- followed by the item weight (a real number between 0 and 1).
Lines starting with #, %, or @ are treated as comments and ignored by algorithms that read this format.
Items appear in ascending order of their identifier. For example:
1 0.4
2 0.7
3 1.0
4 0.5
5 0.45
This format is compatible with the weight file format expected by weighted data mining algorithms in SPMF, such as RWFIM, NWFI, NWFCI and WFIM.