Mining Positive and Negative Association Rules with PNAR-Eclat or PNAR-AprioriTID (SPMF documentation)
This example explains how mine negative association rules using the PNAR-Eclat or PNAR-AprioriTID algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "PNAR-Eclat" or "PNAR-AprioriTID" algorithm, (2) select the input file "contextIGB.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup = 30 %, and minconf = 60 % (5) optionally, set the optional parameters to choose which rule types to generate (6) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run PNAR-Eclat contextIGB.txt output.txt 30% 60%
or
java -jar spmf.jar run PNAR-AprioriTID contextIGB.txt output.txt 30% 60%
in a folder containing spmf.jar and the example input file contextIGB.txt
- If you are using the source code version of SPMF, launch this file "MainTestPNAR.java" in the package ca.pfv.SPMF.tests.
What is this algorithm?
It is an algorithm for discovering positive and negative association rules in a transaction database, proposed by Cornelis et al. (2006).
Four types of rules can be discovered:
- A ==> B
- NOT A ==> B
- A ==> NOT B
- NOT A ==> NOT B
This implementation allows you to select which types of rules should be generated using optional parameters (see the Optional Parameters section of this page).
Moreover, this implementation allows you to choose between two algorithms (AprioriTID and Eclat) for discovering the frequent itemsets that are then used to generate the rules.
What is the input?
The input is a transaction database (aka binary context) and three thresholds named minimum support or minsup (a value between 0 and 1) , and minimum confidence or minconf (a value between 0 and 1).
A transaction database is a set of transactions. Each transaction is a set of distinct items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt of the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.
| Transaction id | Items |
| t1 | {1, 2, 4, 5} |
| t2 | {2, 3, 5} |
| t3 | {1, 2, 4, 5} |
| t4 | {1, 2, 3, 5} |
| t5 | {1, 2, 3, 4, 5} |
| t6 | {2, 3, 4} |
What is the output?
The PNAR algorithm outputs positive and negative association rules that satisfy the user-defined thresholds.
To understand how this algorithm works, it is necessary to review some basic definitions.
The PNAR algorithm can discover four types of association rules:
- A ==> B
- NOT A ==> B
- A ==> NOT B
- NOT A ==> NOT B
where A and B are itemsets such that the intersection of A and B is empty.
A positive association rule A ==> B indicates that when the items of A appear in a transaction, the items of B are also likely to appear in the same transaction.
A negative association rule A ==> NOT B indicates that when the items of A appear in a transaction, the items of B are likely to be absent from that transaction.
Similarly, a rule NOT A ==> B indicates that when the items of A are absent from a transaction, the items of B are likely to appear.
Finally, a rule NOT A ==> NOT B indicates that when the items of A are absent, the items of B are also likely to be absent.
The support of a rule is the number of transactions satisfying the rule condition. For example, the support of a rule A ==> NOT B is the number of transactions containing A but not containing B.
The confidence of a rule measures its predictive strength. For a rule A ==> NOT B, the confidence is defined as the number of transactions containing A but not B, divided by the number of transactions containing A.
The algorithm outputs the association rules of the four types that have a support not less than the use-defined threshold minsup and a confidence not less than the user-defined minconf threshold.
For example, if the algorithm is executed with:
- minsup = 0.3 (30%)
- minconf = 0.6 (60%)
then multiple rules are discovered including the following rules:
2 4 5 ==> 1 #SUP: 3 #CONF: 1.0
2 4 5 ==> NOT 3 #SUP: 2 #CONF: 0.6666666666666666
NOT 3 ==> 1 4 #SUP: 2 #CONF: 1.0
NOT 1 ==> NOT 4 5 #SUP: 2 #CONF: 1.0
The first rule indicates that when items 2, 4 and 5 appear in a transaction, item 1 is present also in that transaction in 100% of the cases. This rule has a support of 3 transactions as it was observed three times.
The second rule indicates that when items 2, 4 and 5 appear in a transaction, item 3 is absent from that transaction in 66.67% of the cases. This rule has a support of 2 transactions.
The third rule indicates that when item 3 is absent from a transaction, items 1 and 4 are present in that transaction in 100% of the cases. This rule has a support of 2 transactions.
The fourth rule indicates that when item 1 is absent from a transaction, items 4 and 5 are also absent from that transaction in 100% of the cases. This rule has a support of 2 transactions.
Input file format
The input file format is a text file containing transactions. Each lines represents a transaction. The items in the transaction are listed on the corresponding line. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction. For example, for the previous example, the input file is defined as follows:
1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 4
Consider the first line. It means that the first transaction is the itemset {1, 2, 4 and 5}. The following lines follow the same format.
Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.
Output file format
The output file format is defined as follows. It is a text file, where each line represents an association rule. On each line, the items of the rule antecedent are first listed. Each item is represented by an integer, followed by a single space. After, that the keyword "==>" appears followed by a space. Then, the items of the rule consequent are listed. Each item is represented by an integer, followed by a single space. Then, the keyword " #SUP: " appears followed by the support of the rule represented by an integer. Then, the keyword " #CONF: " appears followed by the confidence of the rule represented by a double value (a value between 0 and 1, inclusively). For example, here is a few lines from the output file of this example:
2 4 5 ==> 1 #SUP: 3 #CONF: 1.0
2 4 5 ==> NOT 3 #SUP: 2 #CONF: 0.6666666666666666
NOT 3 ==> 1 4 #SUP: 2 #CONF: 1.0
NOT 1 ==> NOT 4 5 #SUP: 2 #CONF: 1.0
The first rule indicates that when items 2, 4 and 5 appear in a transaction, item 1 is also present in that transaction in 100% of the cases.
The second rule indicates that when items 2, 4 and 5 appear in a transaction, item 3 is absent from that transaction in 66.67% of the cases.
The third rule indicates that when item 3 is absent from a transaction, items 1 and 4 are present in that transaction in 100% of the cases.
The fourth rule indicates that when item 1 is absent from a transaction, items 4 and 5 are also absent from that transaction in 100% of the cases.
Note that if the ARFF format is used as input instead of the default input format, the output format remains the same except that items are represented by strings instead of integers
Optional parameters
The algorithm has four optional parameters to control which types of rules are generated. By default, all four rule types are generated. The optional parameters are:
- Generate R1 X ==> Y (e.g. true or false): If set to true, rules of the form A ==> B are generated (default: true).
- Generate R2 NOT(X) ==> NOT(Y) (e.g. true or false): If set to true, rules of the form NOT A ==> NOT B are generated (default: true).
- Generate R3 X ==> NOT(Y) (e.g. true or false): If set to true, rules of the form A ==> NOT B are generated (default: true).
- Generate R4 NOT(X) ==> Y (e.g. true or false): If set to true, rules of the form NOT A ==> B are generated (default: true).
For example, to run the algorithm and generate only R1 (positive rules) and R3 (A ==> NOT B rules) from the command line:
java -jar spmf.jar run PNAR-Eclat contextIGB.txt output.txt 30% 60% true false true false
Optional feature: giving names to items
Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextIGB.txt". Here we have modified the file to give names to the items:
@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 4
In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define transactions in the SPMF format.
Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:
bread ==> orange #SUP: 5 #CONF: 1.0
orange ==> bread #SUP: 5 #CONF: 0.8333333333333334
apple tomato ==> bread #SUP: 2 #CONF: 1.0
tomato bread ==> apple #SUP: 2 #CONF: 0.6666666666666666
apple tomato ==> orange #SUP: 2 #CONF: 1.0
apple ==> milk bread #SUP: 3 #CONF: 0.75
Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.
Implementation details
The algorithm is implemented to be as faithful as possible to the paper, while having several optimizations for performance.
Where can I get more information about this algorithm?
The following article describes the PNAR algorithm:
Cornelis et al. (2006). Mining Positive and Negative Association Rules from Large Databases, Proc. of CIS 2006, pp. 613-618.
For a good overview of itemset mining and association rule mining, you may read this survey paper.