Mining Closed High-Utility Itemsets from a Transaction Database using the MEHUIM-Closed Algorithm (SPMF documentation)

This example explains how to run the MEHUIM-Closed algorithm using the SPMF open-source data mining library.

How to run this example?

What is MEHUIM-Closed?

MEHUIM-Closed (Yang, Hongyang, 2024) is an algorithm for discovering closed high-utility itemsets in a transaction database containing utility information. It is an extension of the MEHUIM algorithm that adds strategies to discover only the closed high-utility itemsets, thereby producing a more concise and informative result than traditional high-utility itemset mining.

A limitation of many high-utility itemset mining algorithms is that they output too many itemsets, which can be inconvenient to analyze. As a solution, MEHUIM-Closed discovers only the high-utility itemsets that are closed. An itemset is closed if it has no superset appearing in exactly the same set of transactions. MEHUIM-Closed incorporates backward extension checking and a closed pattern jumping strategy to efficiently identify and output only closed high-utility itemsets, while maintaining the memory-efficient search structure of MEHUIM.

What is the input?

MEHUIM-Closed takes as input a transaction database with utility information and a minimum utility threshold min_utility (a positive integer). Let's consider the following database consisting of 5 transactions (t1, t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.


Items Transaction utility Item utilities for this transaction
t1 3 5 1 2 4 6 30 1 3 5 10 6 5
t2 3 5 2 4 20 3 3 8 6
t3 3 1 4 8 1 5 2
t4 3 5 1 7 27 6 6 10 5
t5 3 5 2 7 11 2 3 4 2

Each line of the database is:

Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.

What is the output?

The output of MEHUIM-Closed is the set of closed high utility itemsets having a utility no less than a min_utility threshold (a positive integer) set by the user.

An itemset is an unordered set of distinct items. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 4} in transaction t1 is 5 + 6 = 11 and the utility of {1 4} in transaction t3 is 5 + 2 = 7. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 4} in the database is the utility of {1 4} in t1 plus the utility of {1 4} in t3, for a total of 11 + 7 = 18. A high utility itemset is an itemset such that its utility is no less than min_utility.

The support of an itemset is the number of transactions that contain the itemset. A closed itemset is an itemset X such that there does not exist a superset Y strictly containing X that appears in exactly the same set of transactions. A closed high utility itemset (CHUI) is a high-utility itemset that is also a closed itemset.

For example, if we run MEHUIM-Closed with a minimum utility of 30, we obtain 4 closed high-utility itemsets:

itemsets utility support
{1, 2, 3, 4, 5, 6} 30 1 transaction
{2, 3, 4, 5} 40 2 transactions
{2, 3, 5} 37 3 transactions
{1, 3, 5} 31 2 transactions

If the database is a transaction database from a store, we could interpret these results as all the groups of items bought together that generated a profit of 30 $ or more, and that are maximal sets of items shared by a given group of customers.

Input file format

The input file format of MEHUIM-Closed is defined as follows. It is a text file. Each line represents a transaction. Each line is composed of three sections, as follows.

For example, for the previous example, the input file is defined as follows:

3 5 1 2 4 6:30:1 3 5 10 6 5
3 5 2 4:20:3 3 8 6
3 1 4:8:1 5 2
3 5 1 7:27:6 6 10 5
3 5 2 7:11:2 3 4 2

Consider the first line. It means that the transaction {3, 5, 1, 2, 4, 6} has a total utility of 30 and that items 3, 5, 1, 2, 4 and 6 respectively have a utility of 1, 3, 5, 10, 6 and 5 in this transaction. The following lines follow the same format.

Output file format

The output file format of MEHUIM-Closed is defined as follows. It is a text file, where each line represents a closed high utility itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. After all the items, the keyword "#UTIL:" appears and is followed by the utility of the itemset. For example, we show below the output file for this example.

6 4 2 1 5 3 #SUP: 1 #UTIL: 30
4 3 2 5 #SUP: 2 #UTIL: 40
2 5 3 #SUP: 3 #UTIL: 37
1 3 5 #SUP: 2 #UTIL: 31

For example, the third line indicates that the itemset {2, 3, 5} has a support of 3 transactions and a utility of 37 $. The other lines follow the same format.

Performance

High utility itemset mining is a more difficult problem than frequent itemset mining. Therefore, high-utility itemset mining algorithms are generally slower than frequent itemset mining algorithms.

MEHUIM-Closed addresses the problem of outputting too many itemsets by mining only the closed high-utility itemsets, which form a more compact and lossless representation of all high-utility itemsets. By combining the memory-efficient search structure of MEHUIM with backward extension checking and a closed pattern jumping strategy, MEHUIM-Closed can efficiently discover this concise result set. It is generally faster and produces fewer results than mining all high-utility itemsets, particularly on dense datasets.

Implementation details

The implementation offered in SPMF is the original implementation of MEHUIM-Closed by Hongyang Yang and Philippe Fournier-Viger.

In the source code version of SPMF, there are two examples of using MEHUIM-Closed in the package ca.pfv.spmf.tests. The first one is MainTestMEHUIMClosed_saveToFile.java, which saves the result to an output file. The second one is MainTestMEHUIMClosed_saveToMemory.java, which saves the result to memory.

Where can I get more information about the MEHUIM-Closed algorithm?

The MEHUIM-Closed algorithm is described in the following Master's thesis:

Yang, Hongyang (2024). Research on Efficient High-Utility Mining Algorithms Based on Transaction Lists (基于事务列表的高效用挖掘算法研究). Master's thesis (电子信息硕士), School of Computer Science and Software Engineering (计算机与软件学院), Shenzhen University (深圳大学), June 2024.

Besides, for a general overview of high utility itemset mining, you may read this survey paper.