Mining the IGB basis of Association Rules (SPMF documentation)
This example explains how to mine the IGB basis of association rules using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "IGB" algorithm , (2) select the input file "contextIGB.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup = 50 %, minconf= 61% (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run IGB contextIGB.txt output.txt 50% 61% in a folder containing spmf.jar and the example input file contextIGB.txt. - To run this example with the source code version of SPMF, launch the file "MainTestIGB_saveToMemory.java" in the package ca.pfv.SPMF.tests.
What is this algorithm?
This algorithm mines a subset of all association rules that is called IGB association rules (Informative and Generic Basis of Association Rules) from a transaction database.
To discover the IGB association rules, this algorithm performs two steps: (1) first it discovers Closed itemsets and their associated generators by applying the Zart algorithm. Then (2), association rules are generated by using closed itemsets and generators.
What is the input?
The input is a transaction database and two thresholds named minsup (a value between 0 and 1) and minconf (a value between 0 and 1) .
A transaction database is a set of transactions. Each transaction is a set of distinct items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt of the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.
Transaction id | Items |
t1 | {1, 2, 4, 5} |
t2 | {2, 3, 5} |
t3 | {1, 2, 4, 5} |
t4 | {1, 2, 3, 5} |
t5 | {1, 2, 3, 4, 5} |
t6 | {2, 3, 4} |
What is the output?
The output is the IGB basis of association rules. It is a compact set of association rules that is both informative and generic. To explain what is the IGB basis of association rules, it is necessary to review some definitions. An itemset is a group of items. The support of an itemset is the number of times that it appears in the database divided by the total number of transactions in the database. For example, the itemset {1 3} has a support of 33 % because it appears in 2 out of 6 transactions from the database.
An association rule X--> Y is an association between two itemsets X and Y that are disjoint. The support of an association rule is the number of transactions that contains X and Y divided by the total number of transactions. The confidence of an association rule is the number of transactions that contains X and Y divided by the number of transactions that contains X.
A closed itemset is an itemset that is strictly included in no itemset having the same support. An itemset Y is the closure of an itemset X if Y is a closed itemset, X is a subset of Y and X and Y have the same support. A generator Y of a closed itemset X is an itemset such that (1) it has the same support as X and (2) it does not have a subset having the same support.
The IGB set of association rules is the set of association rules of the form X ==> Y - X, where X is a minimal generator of Y, Y is a closed itemset having a support higher or equal to minsup, and the confidence of the rule is higher or equal to minconf.
For example, by applying the IGB algorithm on the transaction database previously described with minsup = 0.50 and minconf= 0.61, we obtain the following set of association rules:
Rule | Support | Confidence |
1 ==> 2, 4, 5 | 0.50 | 0.75 |
4 ==> 1, 2, 5 | 0.50 | 0.75 |
3 ==> 2, 5 | 0.50 | 0.75 |
{} ==> 2, 3 | 0.66 | 0.66 |
{} ==> 1, 2, 5 | 0.66 | 0.66 |
{} ==> 2, 4 | 0.66 | 0.66 |
{} ==> 2, 5 | 0.83 | 0.83 |
{} ==> 2 | 1 | 1 |
Input file format
The input file format is a text file containing transactions. Each lines represents a transaction. The items in the transaction are listed on the corresponding line. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction. For example, for the previous example, the input file is defined as follows:
1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 4
Consider the first line. It means that the first transaction is the itemset {1, 2, 4 and 5}. The following lines follow the same format.
Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.
Output file format
The output file format is defined as follows. It is a text file, where each line represents an association rule. On each line, the items of the rule antecedent are first listed. Each item is represented by an integer, followed by a single space. After, that the keyword "==>" appears followed by a space. Then, the items of the rule consequent are listed. Each item is represented by an integer, followed by a single space. Then, the keyword " #SUP: " appears followed by the support of the rule represented by an integer. Then, the keyword " #CONF: " appears followed by the confidence of the rule represented by a double value (a value between 0 and 1, inclusively). For example, here is a few lines from the output file for this example:
1 ==> 2 4 5 #SUP: 0,5 #CONF: 0,75
3 ==> 2 5 #SUP: 3 #CONF: 0.75
For example, the first line indicates that the association rule {1} --> {2, 4, 5} has a support of 3 transactions and a confidence of 75 %. The other lines follow the same format.
Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.
Where can I get more information about IGB association rules?
This article described IGB rules:
G. Gasmi, S. Ben Yahia, E. Mephu Nguifo, Y. Slimani: IGB: A New Informative Generic Base of Association Rules. PAKDD 2005: 81-90
For a good overview of itemset mining and association rule mining, you may read this survey paper.