Generating synthetic utility values for a transaction database without utility values (SPMF documentation)

This example explains how to generate synthetic utility values for a transaction database without utility values using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Generate_utility_values_for_transaction_database" algorithm, (2) set the output file name (e.g. "output.txt") (3) choose 10 as max quantity and 1 as multiplicative factor (4) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Generate_utility_values_for_transaction_database contextPasquier99.txt output.txt 10 1 in a folder containing spmf.jar.
If you are using the source code version of SPMF, launch the file "MainTestTransactionDatabaseUtilityGenerator.java" in the package ca.pfv.SPMF.tests.

What is this tool?

This tool that generate synthetic utility values for a transaction database without utility value. This is useful to generate datasets that can be used for high utility-itemset mining.

Transaction database with synthetic utiliy values are often used in the data mining litterature to evaluate high utility itemset mining algorithms.

What is the input?

The tool takes as parameter an input file (a transaction database), and two parameters that are used for generating the synthetic utility values:

the maximum internal utility (quantity) of each item in a transaction, such that internal utility values will be generated in the [1, maximum_internal_utility] interval
a multiplicative factor X, such that the external utility of each item will be generated by calling Random.nextGaussian() multiplied by X.

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id	Items
t1	{1, 3, 4}
t2	{2, 3, 5}
t3	{1, 2, 3, 5}
t4	{2, 5}
t5	{1, 2, 3, 5}

What is the output?

The output is a transaction database with utility information.

It is a transaction database where each item appearing in a transaction has a purchase quantity (a.k.a internal utility). Furthermore, each item has a weight (a.k.a external utility or weight) that can be interpreted as a unit profit when buying one unit of the item.

In SPMF, the format of transaction database with utility information is represented as follows. Consider the following database:

	Items	Transaction utility	Item utilities for this transaction
t1	1 3 4	9	1 3 5
t2	2 3 5	14	3 3 8
t3	1 2 3 5	9	1 5 2 1
t4	2 5	12	6 6
t5	1 2 3 5	11	2 3 4 2

Each line of the database is:

a transaction, defined as set of items (the first column of the table),
the sum of the utilities (e.g. profit) of these items in this transaction (the second column of the table),
the utility of each item for this transaction (e.g. external utility multiplied by internal utility)(the third column of the table).

Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 1, 3 and 4. The amount of money spent for each item is respectively 1 $, 3 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 = 9 $.

Input file format

The input file format is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

Output file format

The output file format is defined as follows. It is a text file. Each lines represents a transaction with utility information. Each line is composed of three sections, as follows.

First, the items contained in the transaction are listed. An item is represented by a positive integer. Each item is separated from the next item by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same transaction.
Second, the symbol ":" appears and is followed by the transaction utility (an integer).
Third, the symbol ":" appears and is followed by the utility of each item in this transaction (an integer), separated by single spaces.

For example, an output file that can be generated for the example input file could the following:

1 3 4:9:1 3 5
2 3 5:14:3 3 8
1 2 3 5:9:1 5 2 1
2 5:12:6 6
1 2 3 5:11:2 3 4 2

	Items	Transaction utility	Item utilities for this transaction
t1	1 3 4	9	1 3 5
t2	2 3 5	14	3 3 8
t3	1 2 3 5	9	1 5 2 1
t4	2 5	12	6 6
t5	1 2 3 5	11	2 3 4 2

Consider the first line. It means that the transaction {1, 3, 4} has a total utility of 9 and that items 1, 3 and 4 respectively have a utility of 1, 3 and 5 in this transaction.