Mining High Average-Utility Itemsets with Multiple Thresholds in a Transaction Database using the MEMU Algorithm (SPMF documentation)

This example explains how to run the MEMU algorithm using the SPMF open-source data mining library.

How to run this example?

What is MEMU?

MEMU is an algorithm for mining average-utility itemsets by using multiple minimum average-utility thresholds. Unlike some other algorithms such as HAUI-Miner, the MEMU algorithms allows to set a different threshold for each item, rather than a single threshold to evaluate all items. Setting multiple thresholds is useful because it allows to set lower minimum average-utility thresholds for low profit items. Therefore, it allows discovering high average-utility itemsets containing low profit items. Unlike the HAUI-MMU algorithm, the MEMU algorithm allows to set all the thresholds automatically using a math formula where the user has to define two values : GLMAU and BETA

What is the input?

The input of MEMU is a transaction database and two parameters called BETA and GLMAU (positive integers). These parameters then allow the algorithm to calculate a list of minimum average-utility thresholds indicating the minimum average-utility threshold for each item.
A transaction database is a set of transactions, where each transaction is a list of distinct items (symbols). For example, let's consider the following transaction database. It consists of 4 transactions (t1, t2, ..., t4) and 4 items (1, 2, 3, 4). For instance, transaction t1 is the set of items {1, 2, 3}. In a transaction database, each item has a corresponding purchase quantity. For examle, in the first transaction, the quantity of item 1 is 2, the quantity of item 2 is 3, and the quantity of item 3 is 6. This database is provided in the file "UtilityDB.txt" of the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction.

Transaction ID

Items

Quantities

t1

1 2 3

2 3 6

t2

1 2

3 4

t3

2 3

5 9

t4

2 4

7 10

Besides a table indicating the unit profit of each item is required. For example, consider the following table:

Item

Unit profit

1

5

2

4

3

6

4

2

It indicates that the unit profit yield by selling the items 1, 2, 3 and 4 are 5, 4, 6 and 2, respectively. This table is provided in the file "UtilityDB_profit.txt"

Note that in research papers, unit profit is often called "external utility" while purchase quantities are called "internal utility". This is the terminology used by researchers. Although in the above example, utility is used to represent quantities and profit, it could be more generally be used to represent the relative importance of items to the user. Thus, it can be also applied to other types of data.

What is the output?

The output file format is defined as follows. It is a text file, where each line represents a high average-utility itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. After, all the items, the keyword " #AUTIL: " appears and is followed by the average-utility of the itemset, and then #mau followed by the minimum average-utility count of the itemset (see the research paper for a definition). For example, we show below the output file for this example.

1 #AUTIL: 25,00 #mau: 25,00
1 2 #AUTIL: 26,50 #mau: 25,00
2 #AUTIL: 76,00 #mau: 25,00
2 3 #AUTIL: 61,00 #mau: 25,00
3 #AUTIL: 90,00 #mau: 25,00

For example, the second line indicates that the itemset { 1, 2} has an average-utility of 25 and its minimum average-utility count is 25. The other lines follows the same format.

Input file format

MEMU takes two files as input, defined as follows.

The first file (e.g. UtilityDB.txt) is a text file containing transactions. Each lines represents a transaction. Each line is a list of positive numbers where an item is followed by a space and then its quantity. Then, the next item appears, followed by a space, followed by its quantity, and so on... This is the file UtilityDB.txt used in the example.

1 2 2 3 3 6
1 3 2 4
2 5 3 9
2 7 4 10

The second file is a text file (e.g. UtilityDB_profit.txt) which provides the unit profit of each item. Each is an item, followed by a space, followed by its unit profit. This is the file UtilityDB_profit.txt used in the example.

1, 5
2, 4
3, 6
4, 2

Consider the first line. It means that the first transaction is the unit profit of item 1 is 5. The following lines follow the same format.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a high average-utility itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. After, all the items, the keyword " #AUTIL: " appears and is followed by the average-utility of the itemset, and then the keyword #mau followed by the minimum average count.For example, we show below the output file for this example.

1 #AUTIL: 25,00 #mau: 25,00
1 2 #AUTIL: 26,50 #mau: 25,00
2 #AUTIL: 76,00 #mau: 25,00
2 3 #AUTIL: 61,00 #mau: 25,00
3 #AUTIL: 90,00 #mau: 25,00

For example, the second line indicates that the itemset { 1, 2} has an average-utility of 25 and its minimum average-utility count is 25. The other lines follows the same format.

Implementation details

This is the original implementation of the algorithm.

Where can I get more information about the MEMU algorithm?

This is the reference of the article describing the MEMU algorithm:

Lin, C.-W., Ren, S., Fournier-Viger, P., Hong, T.-P. (2018). MEMU: More Efficient Algorithm to Mine High Average-Utility Patterns with Multiple Minimum Average-Utility Thresholds. IEEE Access, to appear.

Besides, for a general overview of high utility itemset mining, you may read this survey paper.