Generating an Item Weights File from a Transaction Database (SPMF documentation)

This example explains how to generate an item weights file from a transaction database in SPMF format using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool reads a transaction database in SPMF format, collects all distinct items that appear in the database, and generates a weight file that associates a random weight (a real number between 0 and 1) to each distinct item. The weights are generated randomly using a uniform distribution.

Weight files are used by several weighted data mining algorithms available in SPMF, such as the RWFIM algorithm for mining weighted frequent itemsets. This tool makes it easy to quickly generate a weight file for testing and benchmarking such algorithms on a given transaction database, without having to create the weight file manually.

What is the input?

The tool takes a single transaction database file in SPMF format as input. No additional parameters are required.

A transaction database is a text file where each line represents one transaction. Each transaction is a list of items (positive integers) separated by single spaces. Lines starting with #, %, or @ are treated as comments and ignored.

For example, the input file DB_RWFIM.txt is defined as follows:

1 2 3 4 5
1 3 4
2 3 5
1 2 4 5
2 3 4 5
1 3 5
2 4
1 2 3
3 4 5
1 2 4
            

Each line is a transaction. For example, the first line means that transaction T1 contains items 1, 2, 3, 4, and 5.

What is the output?

The tool outputs a weight file associating each distinct item found in the transaction database with a randomly generated weight. The weight is a real number uniformly distributed in the range [0, 1].

The weight file is a text file where each line contains one item identifier (a positive integer) followed by a single space and the item weight (a real number between 0 and 1). Items are listed in ascending order.

For example, for the input file shown above, the output file could look like the following:

1 0.4
2 0.7
3 1.0
4 0.5
5 0.45

In this example, item 1 has been assigned the weight 0.4, item 2 has been assigned the weight 0.7, and so on. Note that the exact weight values will differ each time the tool is run, since they are generated randomly.

Output file format

The output file format is defined as follows. It is a plain text file. Each line associates one item with its weight. Each line contains:

Lines starting with #, %, or @ are treated as comments and ignored by algorithms that read this format.

Items appear in ascending order of their identifier. For example:

1 0.4
2 0.7
3 1.0
4 0.5
5 0.45
            

This format is compatible with the weight file format expected by weighted data mining algorithms in SPMF, such as RWFIM, NWFI, NWFCI and WFIM.