Remove utility information from a transaction database (SPMF documentation)

This example explains how to remove utility information from a transaction database using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool is a small program that is designed to convert a transaction database with utility information to a transaction database that does not contain utility information. For example, this tool can be used to convert a database such as Foodmart, available on the dataset page of the SPMF website so that the dataset can be used with frequent itemset mining algorithm such as Apriori, FPGrowth, etc., and association rule mining algorithms.

What is the input?

The input is a transaction database with utility information.For example, lLet's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.


Items Transaction utility Item utilities for this transaction
t1 3 5 1 2 4 6 30 1 3 5 10 6 5
t2 3 5 2 4 20 3 3 8 6
t3 3 1 4 8 1 5 2
t4 3 5 1 7 27 6 6 10 5
t5 3 5 2 7 11 2 3 4 2

Each line of the database is:

Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.

What is the output?

The output is a transaction database where the utility information has been removed. For example, the output of the above example is:


Items
t1 3 5 1 2 4 6
t2 3 5 2 4
t3 3 1 4
t4 3 5 1 7
t5 3 5 2 7

The output is written to a file (output.txt in this example).

Input file format

The input file format is defined as follows. It is a text file. Each lines represents a transaction. Each line is composed of three sections, as follows.

For example, for the previous example, the input file is defined as follows:

3 5 1 2 4 6:30:1 3 5 10 6 5
3 5 2 4:20:3 3 8 6
3 1 4:8:1 5 2
3 5 1 7:27:6 6 10 5
3 5 2 7:11:2 3 4 2

Consider the first line. It means that the transaction {3, 5, 1, 2, 4, 6} has a total utility of 30 and that items 3, 5, 1, 2, 4 and 6 respectively have a utility of 1, 3, 5, 10, 6 and 5 in this transaction. The following lines follow the same format.

Output file format

The output file format is a transaction database. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space.

3 5 1 2 4 6
3 5 2 4
3 1 4
3 5 1 7
3 5 2 7