Mining Negative Association Rules with NAR-Miner (SPMF documentation)

This example explains how mine negative association rules with NAR-Miner using the SPMF open-source data mining library.

How to run this example?

What is this algorithm?

It is an algorithm for discovering negative association rules of the form A ==> NOT B in a transaction database proposed by Bian et al. (2018).
The first step is to discover frequent itemsets and infrequent itemsets. The second step is to generate negative rules from them.

What is the input?

The input is a transaction database (aka binary context) and three thresholds named minimum frequent support or minsup (a value between 0 and 1), minimum infrequent support or mis (a value between 0 and 1) , and minconf (a value between 0 and 1).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output?

The output is negative association rules of the form A ==> NOT B respecting the user-defined thresholds.

To explain how this algorithm works, it is necessary to review some definitions.

A negative association rule A ==> NOT B as defined by this algorithm is a relationship between two itemsets (sets of items) A and B such that the intersection of A and B is empty.

The interpretation of a rule A ==> NOT B is that if the items of A appears, it is likely that the items of B will not appear in the same transaction.

The support of a negative rule of this type is the number of transactions that contains A but does not contain B.

The confidence of a negative rule of this type is the number of transactions that contains A but does not contain B divided by the number of transactions that contain A.

The interestingness of a negative rule of this type is a secondary quality measure designed to assess how informative the rule is beyond support and confidence.
In this implementation, it is computed by comparing the confidence of the rule with the entropy of the antecedent itemset.More precisely, if the entropy of the antecedent is zero, the interestingness is simply equal to the confidence. Otherwise, the interestingness is obtained by dividing the confidence by the entropy value. Intuitively, this measure adjusts the confidence according to how “informative” or “structured” the antecedent is in the dataset. When the antecedent has low entropy, meaning its items are highly skewed or strongly present in the database, the interestingness value becomes higher. This emphasizes rules that are derived from more deterministic or less random patterns. On the other hand, when the antecedent has higher entropy, meaning the items are more uniformly distributed, the interestingness decreases, reflecting that such rules are less distinctive. In practical terms, this means that two rules with the same confidence can still receive different interestingness scores depending on the distribution of their antecedent items in the database. This allows the algorithm to rank rules not only by their predictive strength, but also by how informative and non-trivial they are. See the research paper for more details about the interestingness measure.

If we apply the algorithm, it will return all the negative rules meeting the thresholds constraints.

Note that NAR-MINER algorithms generate rule in a very specific way such a rule A ==> NOT B is only output if the union C = A U B is infrequent and all the subsets of C are frequent. Thus, the algorithm does not output all possible negative rules but only the subset meeting this constraint.

For example, by applying the algorithm with minsup = 0.5 (50%), mis = 0.3% minconf = 0.3 (30%), we obtains 2 negative associations rules:

1 ==> NOT 2 #SUP: 2 #CONF: 0.3333 #INTER: 0.4764

1 ==> NOT 5 #SUP: 2 #CONF: 0.3333 #INTER: 0.4764

The first rule indicates that if the item 1 is present, the item 2 is not present, and this rule has a support of 2 transactions, a confidence of 33% and an interestingness value of 0.4764. The second rule can be interpreted in the same way.

Input file format

The input file format is a text file containing transactions. Each lines represents a transaction. The items in the transaction are listed on the corresponding line. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction. For example, for the previous example, the input file is defined as follows:

For example, for the previous example, the input file is defined as follows:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a negative association rule. On each line, the items of the rule antecedent are first listed. Each item is represented by an integer, followed by a single space. After, that the keyword "==>" appears followed by a space. Then, NOT appears followed by a space, and then followed by the items of the rule consequent. Each item is represented by an integer, followed by a single space. Then, the keyword " #SUP: " appears followed by the support of the rule represented by an integer. Then, the keyword " #CONF: " appears followed by the confidence of the rule represented by a double value (a value between 0 and 1, inclusively). Then the keyword "#INTER:" appears, followed by the interestingness value (a value greater or equal to 0). For example, here is the output of this example:

1 ==> NOT 2 #SUP: 2 #CONF: 0.3333 #INTER: 0.4764
1 ==> NOT 5 #SUP: 2 #CONF: 0.3333 #INTER: 0.4764

The first rule indicates that if the item 1 is present, the item 2 is not present, and this rule has a support of 2 transactions, a confidence of 33% and an interestingness value of 0.4764. The second rule can be interpreted in the same way.

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextIGB.txt". Here we have modified the file to give names to the items: 

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 4

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define transactions in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

apple ==> NOT orange #SUP: 2 #CONF: 0.3333 #INTER: 0.4764
apple ==> NOT bread #SUP: 2 #CONF: 0.3333 #INTER: 0.4764

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Implementation details

The algorithm is implemented to be as faithful as possible to the paper, while having several optimizations for performance.

Where can I get more information about this algorithm?

The following article describes the NAR-Miner algorithm:

Bian et al. (2018) NAR-Miner: Discovering Negative Association Rules from Code for Bug Detection. Proceedings of ESEC/FSE 2018. pages 411-422.

For a good overview of itemset mining and association rule mining, you may read this survey paper.