Mining Quantiative High Utility Itemsets using the TKQ Algorithm (SPMF documentation)
This example explains how to run the TKQ algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "TKQ" algorithm, (2) select the input file "dbHUQI.txt", (3) set the output file name (e.g. "output.txt") (4) set k = 13, relativecoefficient = 3, method = COMBINEALL (5) set the profit file to "dbHUQI_p.txt" nd (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run TKQ dbHUQI.txt output.txt dbHUQI_p.txt 13 3 COMBINEMALL in a folder containing spmf.jar and the example input file dbHUQI.txt and dbHUQI_p.txt. - If you are using the source code version of SPMF, launch the file "MainTestTKQ .java" in the package ca.pfv.SPMF.tests.
What is TKQ?
TKQ is a fast algorithm for mining the top-k quantitative high utility itemsets proposed by Nouioua et al. (2021) . The advantage of this algorithm over algorithms such as FHUQI-Miner is that the user can directly select the number of patterns k to be found instead of using a minimum utility threshold.
Note: the algorithm can return more than k patterns if several patterns have exactly the same utility, or less than k patterns if the database is small and k is too large.
What is the input?
The input of TKQ is a transaction database with purchase quantities (internal utility) and unit profit information (external utility). Moreover, three other parameter must be set:
- a parameter k indicating the number of itemsets to be found.
- a relative coefficient ( a positive integer)
- a method for combining itemsets (either COMBINEMIN, COMBINEMAX or COMBINEALL).
For details about the meaning of these parameters, please see the TKQ research paper.
A transaction database is a set of transactions, where each transaction is a list of distinct items (symbols). For example, let's consider the following transaction database. It consists of 4 transactions (t1, t2, ..., t4) and 4 items (1, 2, 3, 4). For instance, transaction t1 is the set of items {1, 2, 3, 4}. In a transaction database, each item has a corresponding purchase quantity. For examle, in the first transaction, the quantity of item 1 is 3, the quantity of item 2 is 3, the quantity of item 2 is 6, and the quantity of item 4 is 5. This database is provided in the file "dbHUQI.txt" of the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction, and that items are assumed to be sorted according to some order (e.g. ascending order of numbers).
Transaction ID |
Items |
Quantities |
t1 |
1 2 3 4 |
3 3 2 5 |
t2 |
1 2 3 |
2 3 3 |
t3 |
1 2 4 |
7 6 4 |
t4 |
1 2 3 4 |
7 8 9 6 |
Besides a table indicating the unit profit of each item is required. For example, consider the following table:
Item |
Unit profit |
1 |
42 |
2 |
87 |
3 |
94 |
4 |
97 |
It indicates that the unit profit yield by selling the items 1, 2, 3 and 4 are 42, 87, 94 and 97, respectively. This table is provided in the file "dbHUQI_p.txt".
Note that in research papers, unit profit is often called "external utility" while purchase quantities are called "internal utility". This is the terminology used by researchers. Although in the above example, utility is used to represent quantities and profit, it could be more generally be used to represent the relative importance of items to the user in terms of various factors besides money. Thus, it can be also applied to other types of data besides shopping databases.
What is the output?
The output file format is defined as follows. It is a text file, where each line represents a quantitative high utility itemset. An itemset contains several items. Each item is represented in the form (x,y) where x is the item and y is its purchase quantity, or by the form (x,y,z) indicating that from y to z units of item x where purchased. At the end of each line, the utility of the itemset is indicated after the keyword #UTIL: For example, here is a few line of the output file for the above example:
(4,5,6) #UTIL: 1067
(3,9) (1,7) #UTIL: 1140
(2,6) (4,4) (1,7) #UTIL: 1204
(4,4,6) #UTIL: 1455
(3,9) (4,6) #UTIL: 1428
(3,9) (2,8) #UTIL: 1542
(2,8) (4,6) #UTIL: 1278
(4,4,6) (1,7) #UTIL: 1558
(3,9) (2,8) (4,6) #UTIL: 2124
(3,9) (2,8) (1,7) #UTIL: 1836
(3,9) (2,8) (4,6) (1,7) #UTIL: 2418
(3,9) (4,6) (1,7) #UTIL: 1722
(2,8) (4,6) (1,7) #UTIL: 1572
For example, the second line indicates that the item containing 9 units of item 3 and 7 units of item 1 yields a profit of 1140 $. Another example is the eigth line, which indicates that the itemset containing 4 to 6 units of item 6 and 7 units of item 1 yields a profit of 1558 $.
Input file format
TKQ takes two files as input, defined as follows.
The first file (e.g. dbHUQI.txt) is a text file containing transactions. Each lines represents a transaction. Each line is a list of positive numbers where an item is followed by a space and then its quantity. Then, the next item appears, followed by a space, followed by its quantity, and so on... This is the file dbHUQI.txt used in the example.
1,3 2,3 3,2 4,5
1,2 2,3 3,3
1,7 2,6 4,4
1,7 2,8 3,9 4,6
The second file is a text file (e.g. dbHUQI.txt) which provides the unit profit of each item. Each is an item, followed by a space, followed by its unit profit. This is the file dbHUQI_p.txt used in the above example.
1, 42
2, 87
3, 94
4, 97
Output file format
The output file format is defined as follows. It is a text file, where each line represents a quantitative high utility itemset. An itemset contains several items. Each item is represented in the form (x,y) where x is the item and y is its purchase quantity, or by the form (x,y,z) indicating that from y to z units of item x where purchased. At the end of each line, the utility of the itemset is indicated after the keyword #UTIL: For examle, here is a few line of the output file for the above example:
(4,5,6) #UTIL: 1067
(3,9) (1,7) #UTIL: 1140
(2,6) (4,4) (1,7) #UTIL: 1204
(4,4,6) #UTIL: 1455
(3,9) (4,6) #UTIL: 1428
(3,9) (2,8) #UTIL: 1542
(2,8) (4,6) #UTIL: 1278
(4,4,6) (1,7) #UTIL: 1558
(3,9) (2,8) (4,6) #UTIL: 2124
(3,9) (2,8) (1,7) #UTIL: 1836
(3,9) (2,8) (4,6) (1,7) #UTIL: 2418
(3,9) (4,6) (1,7) #UTIL: 1722
(2,8) (4,6) (1,7) #UTIL: 1572
Implementation details
This is the original implementation of the algorithm.
Where can I get more information about the TKQ algorithm?
This is the reference of the article describing the TKQ algorithm:
Nouioua, M., Fournier-Viger, P., Gan, W., Wu, Y., Lin, J. C.-W., Nouioua, F. (2021). TKQ: Top-K Quantitative High Utility Itemset Mining. Proc. 16th Intern. Conference on Advanced Data Mining and Applications (ADMA 2021) Springer LNAI, 12 pages [ppt]
Besides, for a general overview of high utility itemset mining, you may read this survey paper.