Mining Closed Association Rules (SPMF documentation)
This example explains how to mine closed association using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "Closed_association_rules" algorithm , (2) select the input file "contextZart.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup = 60 %, minconf= 60% (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run Closed_association_rules(using_fpclose) contextZart.txt output.txt 60% 60% in a folder containing spmf.jar and the example input file contextZart.txt. - To run this example with the source code version of SPMF, launch the file "MainTestClosedAssociationRulesWithFPClose_saveToMemory.java" in the package ca.pfv.SPMF.tests.
What is this algorithm?
It is an algorithm for mining "closed association rules", which are a concise subset of all association rules.
What is the input of this algorithm?
The input is a transaction database (aka binary context) and two thresholds named minsup (a value in [0,1] that represents a percentage) and minconf (a value in [0,1] that represents a percentage).
A transaction database is a set of transactions. Each transaction is a set of distinct items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextZart.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.
Transaction id | Items |
t1 | {1, 2, 4, 5} |
t2 | {1, 3} |
t3 | {1, 2, 3, 5} |
t4 | {2, 3, 5} |
t5 | {1, 2, 3, 5} |
What is the output of this algorithm?
Given the minimum support threshold (minsup) and minimum confidence threshold (minconf) set by the user, the algorithm returns the set of closed association rules that respect these thresholds. To explain what is a closed association rule, it is necessary to review some definitions.
An itemset is an unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset divided by the total number of transactions. For example, the itemset {1, 2} has a support of 60% because it appears in 3 transactions out of 5 (it appears in t1, t2 and t5). A closed itemset is an itemset that is strictly included in no itemset having the same support.
An association rule X==>Y is a relationship between two itemsets (sets of items) X and Y such that the intersection of X and Y is empty. The support of a rule X==>Y is the number of transactions that contains X∪Y divided by the total number of transactions. The confidence of a rule X==>Y is the number of transactions that contains X∪Y divided by the number of transactions that contain X. A closed association rule is an association rule of the form X ==> Y such that the union of X and Y is a closed itemset.
The algorithm returns all closed association rules such that their support and confidence are respectively higher or equal to the minsup and minconf thresholds set by the user.
For instance, by applying this algorithm with minsup = 60 %, minconf= 60%, we obtains 16 closed associations rules:
1 ==> 3 #SUP: 3 #CONF: 0.75 // which means that this rule has a
support of 3 transactions and a confidence of 75 %
3 ==> 1 #SUP: 3 #CONF: 0.75 // which means that this rule has a
support of 3 transactions and a confidence of 75 %
2 ==> 5 #SUP: 4 #CONF: 1.0 // which means that this rule has a
support of 4 transactions and a confidence of 100 %
5 ==> 2 #SUP: 4 #CONF: 1.0 // ...
2 5 ==> 1 #SUP: 3 #CONF: 0.75
1 5 ==> 2 #SUP: 3 #CONF: 1.0
1 2 ==> 5 #SUP: 3 #CONF: 1.0
1 ==> 2 5 #SUP: 3 #CONF: 0.75
2 ==> 1 5 #SUP: 3 #CONF: 0.75
5 ==> 1 2 #SUP: 3 #CONF: 0.75
3 5 ==> 2 #SUP: 3 #CONF: 1.0
2 3 ==> 5 #SUP: 3 #CONF: 1.0
2 5 ==> 3 #SUP: 3 #CONF: 0.75
5 ==> 2 3 #SUP: 3 #CONF: 0.75
3 ==> 2 5 #SUP: 3 #CONF: 0.75
2 ==> 3 5 #SUP: 3 #CONF: 0.75
Input file format
The input file format is a text file containing transactions. Each lines represents a transaction. The items in the transaction are listed on the corresponding line. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction. For example, for the previous example, the input file is defined as follows:
1 2 4 5
1 3
1 2 3 5
2 3 5
1 2 3 5
This file contains five lines (five transactions). Consider the first line. It means that the first transaction is the itemset {1, 2, 4, 5}. The following lines follow the same format.
Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.
Output file format
The output file format is defined as follows. It is a text file, where each line represents an association rule. On each line, the items of the rule antecedent are first listed. Each item is represented by an integer, followed by a single space. After, that the keyword "==>" appears followed by a space. Then, the items of the rule consequent are listed. Each item is represented by an integer, followed by a single space. Then, the keyword " #SUP: " appears followed by the support of the rule represented by an integer. Then, the keyword " #CONF: " appears followed by the confidence of the rule represented by a double value (a value between 0 and 1, inclusively). For example, here is the output file for this example:
1 ==> 3 #SUP: 3 #CONF: 0.75
3 ==> 1 #SUP: 3 #CONF: 0.75
2 ==> 5 #SUP: 4 #CONF: 1.0
5 ==> 2 #SUP: 4 #CONF: 1.0
1 2 ==> 5 #SUP: 3 #CONF: 1.0
2 5 ==> 1 #SUP: 3 #CONF: 0.75
1 5 ==> 2 #SUP: 3 #CONF: 1.0
5 ==> 1 2 #SUP: 3 #CONF: 0.75
2 ==> 1 5 #SUP: 3 #CONF: 0.75
1 ==> 2 5 #SUP: 3 #CONF: 0.75
2 5 ==> 3 #SUP: 3 #CONF: 0.75
2 3 ==> 5 #SUP: 3 #CONF: 1.0
3 5 ==> 2 #SUP: 3 #CONF: 1.0
5 ==> 2 3 #SUP: 3 #CONF: 0.75
2 ==> 3 5 #SUP: 3 #CONF: 0.75
3 ==> 2 5 #SUP: 3 #CONF: 0.75
For example, the last line indicates that the association rule {3} --> {2, 5} has a support of 3 transactions and a confidence of 75 %. The other lines follow the same format.
Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.
Optional feature: giving names to items
Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextZart.txt". Here we have modified the file to give names to the items:
@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 2 4 5
1 3
1 2 3 5
2 3 5
1 2 3 5
In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define four sequences in the SPMF format.
Then, if we apply a sequential pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:
bread ==> apple orange #SUP: 3 #CONF: 1.0
orange ==> apple bread #SUP: 3 #CONF: 0.75
apple ==> orange bread #SUP: 3 #CONF: 0.75
Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.
Implementation details and performance
There are two versions of this algorithms implemented in SPMF. The first one uses CHARM for finding the frequent closed itemsets before generating the rules. The second one uses FPClose for finding the frequent closed itemsets before generating the rules. The version based on FPClose is generally faster than the version based on CHARM.
In the release version of SPMF, the algorithm "Closed_association_rules(using_fpclose)" denotes the version using FPClose, while "Closed_association_rules" denotes the version based on CHARM.
In the source code version of SPMF, the files "MainTestClosedAssociationRulesWithFPClose_saveToMemory" and "MainTestClosedAssociationRulesWithFPClose_saveToFile" denotes respectively the version using FPClose which saves the result to memory or to a file. Moreover, the files "MainTestClosedAssociationRules_saveToMemory" and "MainTestClosedAssociationRules_saveToFile" denotes respectively the version using CHARM which saves the result to memory or to a file.
Where can I get more information about closed association rules?
The following Ph.D. thesis proposed "closed association rules".Szathmary, L. (2006). Symbolic Data Mining Methods with the Coron Platform. Szathmary, L. PhD thesis, University Henri Poincaré — Nancy 1, France.
For a good overview of itemset mining and association rule mining, you may read this survey paper.