Mining High Utility Association Rules using the HGB / HGB-ALL Algorithm (SPMF documentation)
This example explains how to run the HGB and HGB_All algorithm using the SPMF open-source data mining library to discover high utility association rules..
How to run this example?
To run the example with the HGB algorithm for mining all non redundant high utility association rules:
- If you are using the graphical interface, (1) choose the "HGB" algorithm, (2) select the input file "DB_utility.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum utility to 30, the minimum confidence to 0.6 (which means 60%) and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run HGB DB_utility.txt output.txt 30 0.6 in a folder containing spmf.jar and the example input file DB_utility.txt. - If you are using the source code version of SPMF, launch the file "MainTestHGB_all.java" in the package ca.pfv.SPMF.tests.
To run the example with the HGB_All algorithm for mining all high utility association rules:
- If you are using the graphical interface, (1) choose the "HGB_All" algorithm, (2) select the input file "DB_utility.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum utility to 30, the minimum confidence to 0.6 (which means 60%) and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run HGB_All DB_utility.txt output.txt 30 0.6 in a folder containing spmf.jar and the example input file DB_utility.txt. - If you are using the source code version of SPMF, launch the file "MainTestHGB_all.java" in the package ca.pfv.SPMF.tests.
What is HGB / HGB_All ?
HGB and HGB_All are names given in SPMF to two algorithms proposed by Sahoo et al. (2015) for mining high utility association rules. Those algorithms are post-processing algorithms to extract high utility association rules using the results from the HUCI-Miner algorithm. The HGB_All algorithm is designed to find all high utility association rules, while the HGB algorithm finds only the non redundant association rules, which is a subset of all rules.
The advantage of high utility association rules over traditional association rules is that the concept of utility or importance is taken into account. The utility can be used to model factors such as the profit yield by the sales of items, and can be also applied to other application.
What is the input?
HGB / HGB_ALL takes as input a transaction database with utility information, a minimum utility threshold min_utility (a positive integer) and a minimum confidence threshold minconf (a double value representing a percentage) . Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Items | Transaction utility | Item utilities for this transaction | |
t1 | 3 5 1 2 4 6 | 30 | 1 3 5 10 6 5 |
t2 | 3 5 2 4 | 20 | 3 3 8 6 |
t3 | 3 1 4 | 8 | 1 5 2 |
t4 | 3 5 1 7 | 27 | 6 6 10 5 |
t5 | 3 5 2 7 | 11 | 2 3 4 2 |
Each line of the database is:
- a set of items (the first column of the table),
- the sum of the utilities (e.g. profit) of these items in this transaction (the second column of the table),
- the utility of each item for this transaction (e.g. profit generated by this item for this transaction)(the third column of the table).
Note that the value in the second column for each line is the sum of the values in the third column.
What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.
What is the output?
The output of HGB and HGB_ALL are high utility association rules. To explain what is a high utility association rule, it is necessary to review some definitions.
An itemset is an unordered set of distinct items.
An association rule is an implication of the form X ==> Y where X and Y are two non empty itemsets, and where X and Y are disjoint. X is called the antecedent of the rule and Y is called the consequent.
The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 4} in transaction t1 is 5 + 6 = 11 and the utility of {1 4} in transaction t3 is 5 + 2 = 7. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 4} in the database is the utility of {1 4} in t1 plus the utility of {1 4} in t3, for a total of 11 + 7 = 18. A high utility itemset is an itemset such that its utility is no less than min_utility.
An association rule X ==> Y is said to be a high utility association rule if it satisfy the following conditions:
- The antecedent X is a high utility itemset.
- The itemset X U Y is a high utility itemset.
- The utility-confidence of the rule X ==> Y is no less than the minimum confidence threshold. The utility confidence is defined as the sum of the utility values of items in X in the transactions containing X U Y, divided by the utility of X (see the paper by Sahoo et al., 2005 for more details)s.
The support of an itemset is the number of transactions that contain the itemset. For example, the itemset {1, 3, 5} has a support of 2 because it appears in three transactions from the database (t1 and t4). A closed is an itemset X such that there does not exist an itemset Y strictly included in X that has the same support. For example, itemset {1, 3, 5} is a closed itemset.
A closed high utility itemset (CHUI) is a high-utility itemset that is a closed itemset.
A high utility generator, as defined by Sahoo (2015), is a high utility itemsets that has no subset that is also a high utility itemset and has the same support
A high utility association rule is said to be a non redundant association rule if X U Y is a closed itemset and X is a generator.
For example, if we run HGB to find the non redundant high utility association rules with a minimum utility of 30 and minconf = 0.5 (which means 50%) we obtain 4 rules:
Association rule | utility | utility of antecedent | utility confidence |
{2 5} ==> {3} | 37 | 31 | 100% |
{2 4}==> {3 5} | 40 | 30 | 100 % |
{2 5} ==> [3 4] | 40 | 31 | 77.4 % |
{2 4}==>[1 3 5 6] | 30 | 30 | 53.3 % |
Now if we run the HGB_All algorithms, we find 13 rules, as follows:
Association rule | utility | utility of antecedent | utility confidence |
2 4 ==> 5 | 36 | 30 | 100% |
2 4 ==> 3 | 34 | 30 | 100 % |
2 5 ==> 4 | 36 | 31 | 77.4 % |
2 5 ==> 3 | 37 | 31 | 100% |
2 4 ==> 3 5 | 40 | 30 | 100% |
2 5 ==> 3 4 | 40 | 31 | 77.4 % |
2 3 4 ==> 5 | 40 | 34 | 100% |
2 4 5 ==> 3 | 40 | 36 | 100% |
2 3 5 ==> 4 | 40 | 37 | 75.7% |
2 4 ==> 1 3 5 6 | 30 | 30 | 53.3% |
2 3 4 ==> 1 5 6 | 30 | 34 | 50 % |
2 4 5 ==> 1 3 6 | 30 | 36 | 52.78 % |
2 3 4 5 ==> 1 6 | 30 | 40 | 50 % |
Input file format
The input file format of HUCI-Miner is defined as follows. It is a text file. Each line represents a transaction. Each line is composed of three sections, as follows.
- First, the items contained in the transaction are listed. An item is represented by a positive integer. Each item is separated from the next item by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same transaction.
- Second, the symbol ":" appears and is followed by the transaction utility (an integer).
- Third, the symbol ":" appears and is followed by the utility of each item in this transaction (an integer), separated by single spaces.
For example, for the previous example, the input file is defined as follows:
3 5 1 2 4 6:30:1 3 5 10 6 5
3 5 2 4:20:3 3 8 6
3 1 4:8:1 5 2
3 5 1 7:27:6 6 10 5
3 5 2 7:11:2 3 4 2
Consider the first line. It means that the transaction {3, 5, 1, 2, 4, 6} has a total utility of 30 and that items 3, 5, 1, 2, 4 and 6 respectively have a utility of 1, 3, 5, 10, 6 and 5 in this transaction. The following lines follow the same format.
Output file format
The output file format of HGB is defined as follows. It is a text file, that contains a list of high utility association rules. Each line contains a rule. First, items from the antecedent are listed, separated by spaces. Then, there is "==>", followed by the list of items from the consequent, separated by spaces. Then, #UTIL: is followed by the utility of the rule. Then the keyword #AUTIL: is followed by the utility of the rule antecedent. Then, the keyword #UCONF: is followed by the utility confidence. Here is the output of HGB for the above example:
2 5 ==> 3 #UTIL: 37 #AUTIL: 31 #UCONF: 1.0
2 4 ==> 3 5 #UTIL: 40 #AUTIL: 30 #UCONF: 1.0
2 5 ==> 3 4 #UTIL: 40 #AUTIL: 31 #UCONF: 0.7741935483870968
2 4 ==> 1 3 5 6 #UTIL: 30 #AUTIL: 30 #UCONF: 0.5333333333333333
And here is the output of HGB_All for the example:
2 4 ==> 5 #UTIL: 36 #AUTIL: 30 #UCONF: 1.0
2 4 ==> 3 #UTIL: 34 #AUTIL: 30 #UCONF: 1.0
2 5 ==> 4 #UTIL: 36 #AUTIL: 31 #UCONF: 0.7741935483870968
2 5 ==> 3 #UTIL: 37 #AUTIL: 31 #UCONF: 1.0
2 4 ==> 3 5 #UTIL: 40 #AUTIL: 30 #UCONF: 1.0
2 5 ==> 3 4 #UTIL: 40 #AUTIL: 31 #UCONF: 0.7741935483870968
2 3 4 ==> 5 #UTIL: 40 #AUTIL: 34 #UCONF: 1.0
2 4 5 ==> 3 #UTIL: 40 #AUTIL: 36 #UCONF: 1.0
2 3 5 ==> 4 #UTIL: 40 #AUTIL: 37 #UCONF: 0.7567567567567568
2 4 ==> 1 3 5 6 #UTIL: 30 #AUTIL: 30 #UCONF: 0.5333333333333333
2 3 4 ==> 1 5 6 #UTIL: 30 #AUTIL: 34 #UCONF: 0.5
2 4 5 ==> 1 3 6 #UTIL: 30 #AUTIL: 36 #UCONF: 0.5277777777777778
2 3 4 5 ==> 1 6 #UTIL: 30 #AUTIL: 40 #UCONF: 0.5
Performance
This is the original implementation of HGB and HGB_All, which also uses the original implementation of HUCI_Miner.
Where can I get more information about the HGB, HGB_All and HUCI_Miner algorithms?
This is the reference of the article describing the HUCI_Miner algorithm, as well as HGB and HGB_All:
Sahoo, Jayakrushna, Das, A., Goswami, A., (2015) An efficient approach for mining association rules from high utility itemsets. Expert Syst. Appl. 42, 5754-5778.
Besides, for a general overview of high utility itemset mining, you may read this survey paper.