Mining the Cross-Level High-Utility Itemsets in a Transaction Database using the CLH-Miner Algorithm (SPMF documentation)

This example explains how to run the CLH-Miner algorithm using the SPMF open-source data mining library.

How to run this example?

What is CLH-Miner?

CLH-Miner is an algorithm for discovering the cross-level high-utility itemsets in a transaction database containing utility information and having a taxonomy.

High utility itemset mining has several applications such as discovering groups of items in transactions of a store that generate the most profit. A database containing utility information is a database where items can have quantities and a unit price. Although these algorithms are often presented in the context of market basket analysis, there exist other applications.

Contrarily to traditional high utility itemset mining algorithm, CLH-Miner allows to consider a taxonomy of items (items can be organized into a hierarchy of categories). Then, patterns can be found involving these categories.

What is the input?

CLH-Miner takes as input a transaction database with utility information, a taxonomy database, and a minimum utility threshold min_utility (a positive number). Let’s consider the following transaction database consisting 7 transactions (t1, t2, …, t7) and 5 items (1, 2, 3, 4, 5). This transaction database is provided in the text file “transaction_CLHMiner.txt” in the package ca.pfv.spmf.tests of the SPMF distribution.

 

Items

Transaction_utility

Item_utilities for this transaction

t1

1 3

6

5 1

t2

5

3

3

t3

1 2 3 4 5

25

5 10 1 6 3

t4

2 3 4 5

20

8 3 6 3

t5

1 3 4

5

5 1 2

t6

1 3 5

22

10 6 6

t7

2 3 5

9

4 2 3

Each line of the transaction database contains:

  1. a set of items (the first column of the table),
  2. the sum of the utilities (e.g. profit) of these items in this transaction (the second column of the table),
  3. the utility of each item for this transaction (e.g. profit generated by this item for this transaction) (the third column of the table).

For example, the first transaction may indicate that some customer has purchased some items 1 and 3, that the profit generated by the sale of item 1 is 5, and the profit generated by the sale of item 3 is 1. The total utility (profit) is in this case 5 + 1 = 6. Thus, 6 is shown in the second column.

Note that the value in the second column for each line is the sum of the values in the third column.

The following table is the taxonomy database taxonomy_CLHMiner.txt for this example, which consists of 5 items and 3 generalized items. This taxonomy database is provided in the text file “taxonomy_CLHMiner.txt” in the package ca.pfv.spmf.tests of the SPMF distribution.

Item

Generalized item

1

6

2

6

3

7

4

8

5

8

6

7

Each line of the taxonomy database is:

For example, the first line indicates that item 1 is a special case (a specialization) of the item 6. The second line indicates that the item 2 is also a special case of the item 6. Thus, the item 6 can be viewed as category containing the items 1 and 2. The other lines have a similar meaning.

What is the output?

The output of CLHMiner is the set of cross-level high utility itemsets having a utility no less than a min_utility threshold (a positive number) set by the user. To explain what is a cross-level high utility itemset, it is necessary to review some definitions.

An itemset is an unordered set of distinct items. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. The utility of a generalized itemset in a transaction is the sum of the utility of its descendant items in the transaction. For example, the utility of the itemset {1 3} in transaction t3 is 5 + 1 = 6 and the utility of {1 3} in t6 = 10 + 6 = 16. Because item 1 and item 2 are the descendants of itemset 6. The utility of 6 in t3 is 5 + 10 = 15. The utility of generalized itemset {3, 6} is 1+ 15 = 16. The utility of a (generalized) itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 3} in the database is the utility of {1 3} in t1 plus the utility of {1 3} in t3 plus the utility of {1 3} in t5 plus the utility of {1 3} in t6, for a total of 6 + 6 + 6 + 16 = 34. The utility of {3 6} in the database is 6 + 16 + 11 + 6 + 16 + 6 = 61. A cross-level high utility itemset is a (generalized) itemset such that its utility is no less than min_utility. For example, if we run CLHMiner with a minimum utility of 60, we obtain 7 cross-level high utility itemsets.

Itemsets

Utility

Support

{7 8}

84

71% (5 transactions)

{3 6 8}

84

71% (5 transactions)

{6 8}

71

71% (5 transactions)

(5 7)

64

57% (4 transactions)

{3 5 6}

64

57% (4 transactions)

{7}

61

86% (6 transactions)

{3 6}

61

86% (6 transactions)

If the database is a transaction database from a store, we could interpret these results as all the sets of items that yield a profit that is no less than 60$.

Input file format

The input file format of CLHMiner is defined as follows. The transaction database and taxonomy database are both text files. In the transaction database, each line represents a transaction. Each line is composed of three sections, as follows.

  1. First, the items contained in the transaction are listed. An item is represented by a positive integer. Each item is separated from the next item by a single space. It is assumed that all items within a same transaction are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same transaction.
  2. Second, the symbol “:” appears and is followed by the transaction utility (an integer).
  3. Third, the symbol “:” appears and is followed by the utility of each item in this transaction (an integer), separated by single spaces.

For example, for the previous example, the input file for the transaction is defined as follows:

1 3:6:5 1
5:3:3
1 2 3 4 5:25:5 10 1 6 3
2 3 4 5:20:8 3 6 3
1 3 4:8:5 1 2
1 3 5:22:10 6 6
2 3 5:9:4 2 3

Consider the first line. It means that the transaction {1, 3} has a total utility of 6 and that items 1 and 3 respectively have a utility of 5 and 1 in this transaction. The following lines follow the same format.

In the taxonomy database, each line represents a category relationship. Each line is composed of two sections, as follows.

  1. First, the item.
  2. Second, the symbol “,” appears and is followed by a generalized item, representing the category that the item belongs.

For example, for the previous example, the input file for the taxonomy is defined as follows:

1,6
2,6
3,7
4,8
5,8
6,7
Consider the first line. It means that the item 1 is belongs to the generalized item 6.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a cross-level high utility itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. After all the items, the key word “#UTIL:” appears and is followed by the utility of the itemset. For example, we show below the output file for this example.

8 7 #UTIL: 84
8 3 6 #UTIL: 84
8 6 #UTIL: 71
7 #UTIL: 61
7 5 #UTIL: 64
5 3 6 #UTIL: 64
3 6 #UTIL: 61

For example, the fourth line indicates that the itemset {7} has a utility of 61. The following lines follows the same format.

Performance

CLH-Miner is the first algorithm for cross-level high utility itemset mining. This is the original implementation.

Where can I get more information about the CLH-Miner algorithm?

The CLH-Miner algorithm was proposed in this paper:

Fournier-Viger, P., Yang, Y., Lin, J. C.-W., Luna, J. M., Ventura, S. (2020). Mining Cross-Level High Utility Itemsets. Proc. 33rd Intern. Conf. on Industrial, Engineering and Other Applications of Applied Intelligent Systems (IEA AIE 2020), Springer LNAI, pp. 858-871.

Besides, for a general overview of high utility itemset mining, you may read this survey paper.