Mining Skyline Frequent High-Utility Itemsets in a transaction database with utility information using the SFU_CE Algorithm (SPMF documentation)

This example explains how to run the SFU_CE algorithm using the SPMF open-source data mining library.

How to run this example?

What is SFU_CE?

SFU_CE (Song et al, 2021) is an approximate algorithm for discovering skyline frequent high-utility itemsets in a transaction database containing utility information.

This is the original implementation of SFU_CE.

What is the input?

The algorithm takes as input a transaction database with utility information and a minimum utility threshold min_utility (a positive integer). Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.


Items Transaction utility Item utilities for this transaction
t1 3 5 1 2 4 6 30 1 3 5 10 6 5
t2 3 5 2 4 20 3 3 8 6
t3 3 1 4 8 1 5 2
t4 3 5 1 7 27 6 6 10 5
t5 3 5 2 7 11 2 3 4 2

Each line of the database is:

Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.

What is the output?

The output of SFU_CE is the set of skyline frequent high utility itemsets. To explain what is a skyline high-utility itemsets, it is necessary to review some definitions.

An itemset is an unordered set of distinct items. The utility of an item in a transaction is the product of its purchase quantity in the transaction by its unit profit. For example, the utility of item 3 in transaction t2 is (6*1)- 6 $. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {5 7} in transaction t2 is (2*3)+(5*1)=12$ and the utility of {5, 7} in transaction t4 is (1*3)+(2*1)=5. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {5 7} in the database is the utility of {5 7} in t4 plus the utility of {5 7} in t5, for a total of 12 + 5= 17. The utility of an itemset X is denoted as u(X). Thus u({5 7})= 17$

The support of an itemset is the number of transactions that contains the itemset. For example, the support of the itemset {5 7} is sup({5 7}) = 2 transactions because it appears in transactions t4 and t5.

An itemset X is said to be dominating another itemset Y, if and only if, sup(X) >= sup(Y ) and u(X) > u(Y ), or, sup(X) > sup(Y ) and u(X) >= u(Y ).

A skyline high utility itemset is an itemset that is not dominated by another itemset in the transaction database.

For example, if we run SFU_CE, we may obtain the following 4 skyline frequent high-utility itemsets:

itemsets support utility
{3} 5 14
{2, 3, 4, 5} 2 40
{2, 3, 5} 3 37
{3, 5} 4 27

If the database is a transaction database from a store, we could interpret these results as all the itemsets that are dominating the other itemsets in terms of selling frequencies and utilty.

Input file format

The input file format is defined as follows. It is a text file. Each lines represents a transaction. Each line is composed of three sections, as follows.

For example, for the previous example, the input file is defined as follows:

3 5 1 2 4 6:30:1 3 5 10 6 5
3 5 2 4:20:3 3 8 6
3 1 4:8:1 5 2
3 5 1 7:27:6 6 10 5
3 5 2 7:11:2 3 4 2

Consider the first line. It means that the transaction {3, 5, 1, 2, 4, 6} has a total utility of 30 and that items 3, 5, 1, 2, 4 and 6 respectively have a utility of 1, 3, 5, 10, 6 and 5 in this transaction. The following lines follow the same format.

Output file format

The output file format of SFU_CE is defined as follows. It is a text file, where each line represents a skyline high utility itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. Then, the keyword #UTIL: " appears and is followed by the utility of the itemset. For example, this is the output file for this example:

4 2 5 3 #SUP:2 #UTILITY:40
2 5 3 #SUP:3 #UTILITY:37
5 3 #SUP:4 #UTILITY:27
3 #SUP:5 #UTILITY:13

For example, the third line indicates that the itemset {2, 3, 4, 5} has a utility of 40$. The other lines follows the same format.

Performance

This is the original implementation

Where can I get more information about the algorithm?

This is the reference of the article describing the algorithm:

Song W., Zheng C., (2021) SFU-CE: Skyline Frequent-Utility Itemset Discovery Using the Cross-Entropy Method. (to appear) (slides about the algorithm )