Mining Quantiative High Utility Itemsets using the VHUQI Algorithm (SPMF documentation)
This example explains how to run the VHUQI algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "VHUQI" algorithm, (2) select the input file "HUQI_DB.txt", (3) set the output file name (e.g. "output.txt") (4) set minutility = 0.2, relativecoefficient = 3, method = COMBINEALL (5) set the profit file to "HUQI_DB_profit.txt" nd (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run VHUQI HUQI_DB.txt output.txt HUQI_DB_profit.txt 0.2 3 COMBINEMALL in a folder containing spmf.jar and the example input file HUQI_DB.txt and HUQI_DB_profit.txt. - If you are using the source code version of SPMF, launch the file "MainTestVHUQI.java" in the package ca.pfv.SPMF.tests.
What is VHUQI?
VHUQI is an algorithm for mining quantitative high utility itemsets. Unlike traditional high utility itemsets, the VHUQI algorithm will not only find patterns that have a high utility, but also give information about the ranges of purchase quantities (internal utility) that lead to this high utility.
What is the input?
The input of VHUQI is a transaction database with purchase quantities (internal utility) and unit profit information (external utility). Moreover, three other parameter must be set:
- a minimum utility threshold called "minutil" ( a value in the [0,1] interval representing a percentage)
- a relative coefficient ( a positive integer)
- a method for combining itemsets (either COMBINEMIN, COMBINEMAX or COMBINEALL).
For details about the meaning of these parameters, please see the VHUQI research paper.
A transaction database is a set of transactions, where each transaction is a list of distinct items (symbols). For example, let's consider the following transaction database. It consists of 4 transactions (t1, t2, ..., t4) and 4 items (1, 2, 3, 4). For instance, transaction t1 is the set of items {1, 2, 3, 4}. In a transaction database, each item has a corresponding purchase quantity. For examle, in the first transaction, the quantity of item 1 is 3, the quantity of item 2 is 3, the quantity of item 2 is 6, and the quantity of item 4 is 5. This database is provided in the file "HUQI_DB.txt" of the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction, and that items are assumed to be sorted according to some order (e.g. ascending order of numbers).
Transaction ID | Items |
Quantities |
t1 |
1 2 3 4 |
3 3 2 5 |
t2 |
1 2 3 |
2 3 3 |
t3 |
1 2 4 |
7 6 4 |
t4 |
1 2 3 4 |
7 8 9 6 |
Besides a table indicating the unit profit of each item is required. For example, consider the following table:
Item | Unit profit |
1 |
42 |
2 |
87 |
3 |
94 |
4 |
97 |
It indicates that the unit profit yield by selling the items 1, 2, 3 and 4 are 42, 87, 94 and 97, respectively. This table is provided in the file "HUQI_DB_profit.txt".
Note that in research papers, unit profit is often called "external utility" while purchase quantities are called "internal utility". This is the terminology used by researchers. Although in the above example, utility is used to represent quantities and profit, it could be more generally be used to represent the relative importance of items to the user. Thus, it can be also applied to other types of data besides shopping databases.
What is the output?
The output file format is defined as follows. It is a text file, where each line represents a quantitative high utility itemset. An itemset contains several items. Each item is represented in the form (x,y) where x is the item and y is its purchase quantity, or by the form (x,y,z) indicating that from y to z units of item x where purchased. At the end of each line, the utility of the itemset is indicated after the keyword #UTIL: For examle, here is a few line of the output file for the above example:
(4,4,6) #UTIL: 1455
(4,5,6) #UTIL: 1067
(3,9) (2,8) #UTIL: 1542
(3,9) (4,6) #UTIL: 1428
(3,9) (1,7) #UTIL: 1140
(3,9) (2,8) (4,6) #UTIL: 2124
(3,9) (2,8) (1,7) #UTIL: 1836
(3,9) (2,8) (4,6) (1,7) #UTIL: 2418
(3,9) (4,6) (1,7) #UTIL: 1722
(2,8) (4,6) #UTIL: 1278
(2,8) (4,6) (1,7) #UTIL: 1572
(4,4,6) (1,7) #UTIL: 1558
(2,6) (4,4) (1,7) #UTIL: 1204
For example, the third line indicates that the itemset containing 9 units of item 3 and 8 units of item 2 yield a utility of 1542 $. The second line indicates that the purchase of 5 to 6 units of item 4 yield a 1067 $ profit.
Input file format
VHUQI takes two files as input, defined as follows.
The first file (e.g. HUQI_DB.txt) is a text file containing transactions. Each lines represents a transaction. Each line is a list of positive numbers where an item is followed by a space and then its quantity. Then, the next item appears, followed by a space, followed by its quantity, and so on... This is the file HUQI_DB.txt used in the example.
1,3 2,3 3,2 4,5
1,2 2,3 3,3
1,7 2,6 4,4
1,7 2,8 3,9 4,6
The second file is a text file (e.g. HUQI_DB.txt) which provides the unit profit of each item. Each is an item, followed by a space, followed by its unit profit. This is the file HUQI_DB_profit.txt used in the above example.
1, 42
2, 87
3, 94
4, 97
Output file format
The output file format is defined as follows. It is a text file, where each line represents a quantitative high utility itemset. An itemset contains several items. Each item is represented in the form (x,y) where x is the item and y is its purchase quantity, or by the form (x,y,z) indicating that from y to z units of item x where purchased. At the end of each line, the utility of the itemset is indicated after the keyword #UTIL: For examle, here is a few line of the output file for the above example:
(4,4,6) #UTIL: 1455
(4,5,6) #UTIL: 1067
(3,9) (2,8) #UTIL: 1542
(3,9) (4,6) #UTIL: 1428
(3,9) (1,7) #UTIL: 1140
(3,9) (2,8) (4,6) #UTIL: 2124
(3,9) (2,8) (1,7) #UTIL: 1836
(3,9) (2,8) (4,6) (1,7) #UTIL: 2418
(3,9) (4,6) (1,7) #UTIL: 1722
(2,8) (4,6) #UTIL: 1278
(2,8) (4,6) (1,7) #UTIL: 1572
(4,4,6) (1,7) #UTIL: 1558
(2,6) (4,4) (1,7) #UTIL: 1204
Implementation details
This is the original implementation of the algorithm.
Where can I get more information about the VHUQI algorithm?
This is the reference of the article describing the VHUQI algorithm:
C. H. Li, C. Wu and V. S. Tseng, “Efficient Vertical Mining of High Utility Quantitative Itemsets, Proc. of Int’l Conf. on Granular Computing,pp. 155-160, 2014.
Besides, for a general overview of high utility itemset mining, you may read this survey paper.