Mining the Top-K Non-Redundant Association Rules (SPMF documentation)

This example explains how to run the TNR algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "TNR" algorithm, (2) select the input file "contextIGB.txt", (3) set the output file name (e.g. "output.txt") (4) set k = 30,minconf = 0.5, and delta = 2 (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run TNR contextIGB.txt output.txt 30 0.5 2 in a folder containing spmf.jar and the example input file contextIGB.txt.
If you are using the source code version of SPMF, launch the file "MainTestTNR.java" in the package ca.pfv.SPMF.tests.

What is TopKRules?

TNR is an algorithm for discovering the top-k non-redundant association rules appearing in a transaction database. It is an approximate algorithm in the sense that it always generates non-redundant rules. But these rules may not always be the top-k non-redundant association rules. TNR uses a parameter named delta, which is a positive integer >=0 that can be used to improve the chance that the result is exact (the higher the delta value, the more chances that the result will be exact).

Why is it important to discover top-k non-redundant association rules? Because other association rule mining algorithms requires that the user set a minimum support (minsup) parameter that is hard to set (usually users set it by trial and error, which is time consuming). Moreover, the result of association rule mining algorithms usually contains a high level of redundancy (for example, thousands of rules can be found that are variation of other rules having the same support and confidence). The TNR algorithm provide a solution to both of these problems by letting users directly indicate k, the number of rules to be discovered, and by eliminating redundancy in results.

What is the input of TNR ?

TNR takes four parameters as input:

a transaction database,
a parameter k representing the number of rules to be discovered (a positive integer >= 1),
a parameter minconf representing the minimum confidence that association rules should have (a value in [0,1] representing a percentage).
a parameter delta (a positive integer >=0) that is used to increase the chances of having an exact result (because the TNR algorithm is approximate).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id	Items
t1	{1, 2, 4, 5}
t2	{2, 3, 5}
t3	{1, 2, 4, 5}
t4	{1, 2, 3, 5}
t5	{1, 2, 3, 4, 5}
t6	{2, 3, 4}

What is the output of TNS ?

TNR outputs an approximation of the k most frequent non redundant association rules having a confidence higher or equal to minconf.

To explain what are top-k non redundant association rules, it is necessary to review some definitions. An itemset is a set of distinct items. The support of an itemset is the number of times that it appears in the database divided by the total number of transactions in the database. For example, the itemset {1 3} has a support of 33 % because it appears in 2 out of 6 transactions from the database.

An association rule X--> Y is an association between two itemsets X and Y that are disjoint. The support of an association rule is the number of transactions that contains X and Y divided by the total number of transactions. The confidence of an association rule is the number of transactions that contains X and Y divided by the number of transactions that contains X.

An association rule ra: X → Y is redundant with respect to another rule rb : X1 → Y1 if and only if:

conf(ra) = conf(rb)
sup(ra) = sup(rb)
X1 ⊆ X ∧ Y ⊆ Y1.

The top-k non redundant association rules are the k most non-redundant frequent association rules in the database having a confidence higher or equal to minconf.

For example, If we run TNR with k = 10 and minconf = 0.5 and delta = 2, the following set of rules is found

4, ==> 2, sup= 4 conf= 1.0
2, ==> 1,5, sup= 4 conf=0.66
2, ==> 5, sup= 5 conf= 0.8333333333333334
5, ==> 2, sup= 5 conf= 1.0
5, ==> 1,2, sup= 4 conf= 0.8
1, ==> 2,5, sup= 4 conf= 1.0
2, ==> 3, sup= 4 conf=0.66
2, ==> 4, sup= 4 conf=0.66
3, ==> 2, sup= 4 conf= 1.0
1,4, ==> 2,5, sup= 3 conf= 1.0

For instance, the association rule 2 ==> 1 5 means that if items 2 appears, it is likely to be associated with item 1 and item 5 with a confidence of 66 %. Moreover, this rule has a support 66 % (sup = 4) because it appears in four transaction (t1, t3, t4 and t5) out of six transactions contained in this database.

Note that for some values of k and some datasets, TNR may return more than k association rules. This can happen if several rules have exactly the same support, and it is normal. It is also possible that the algorithm returns slightly less than k association rules in some circumstances because the algorithm is approximate.

Input file format

The input file format is a text file containing transactions. Each lines represents a transaction. The items in the transaction are listed on the corresponding line. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction. For example, for the previous example, the input file is defined as follows:

1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 4

Consider the first line. It means that the first transaction is the itemset {1, 2, 4 and 5}. The following lines follow the same format.

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents an association rule. On each line, the items of the rule antecedent are first listed. Each item is represented by a positive integer, followed by a single space. After, that the keyword "==>" appears followed by a space. Then, the items of the rule consequent are listed. Each item is represented by an integer, followed by a single space. Then, the keyword " #SUP: " appears followed by the support of the rule represented by an integer (a number of transactions). Then, the keyword " #CONF: " appears followed by the confidence of the rule represented by a double value (a value between 0 and 1, inclusively). For example, here is a few lines from the output file if we run TopKRules on contextIGB.txt with k=3 and minconf=0.8 (80 %):

2 ==> 4 #SUP: 4 #CONF:0.66
5 ==> 1 2 #SUP: 4 #CONF: 0.8
5 ==> 2 #SUP: 5 #CONF: 1.0
2 ==> 5 #SUP: 5 #CONF: 0.8333333333333334
2 ==> 1 5 #SUP: 4 #CONF:0.66
1 ==> 2 5 #SUP: 4 #CONF: 1.0
2 ==> 3 #SUP: 4 #CONF:0.66
3 ==> 2 #SUP: 4 #CONF: 1.0
4 ==> 2 #SUP: 4 #CONF: 1.0
4 5 ==> 1 2 #SUP: 3 #CONF: 1.0

For example, the first line indicates that the association rule {2} --> {4} has a support of 4 transactions and a confidence of 66.66 %. The other lines follow the same format.

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextIGB.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 4

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define transactions in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

milk ==> orange #SUP: 4 #CONF: 1.0
apple milk ==> orange bread #SUP: 3 #CONF: 1.0
milk bread ==> apple orange #SUP: 3 #CONF: 1.0

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Performance

TNR is an efficient algorithm. It is based on the TopKRules algorithm for discovering top-k association rule. The main difference between TNR and TopKRules is that TNR includes additional strategies to eliminate redundancy in results, and that TNR is an approximate algorithm, while TopKRules is not.

TNR and TopKRules are more intuitive to use than regular association rule mining algorithms. However, it should be noted that the problem of top-k association rule mining is more computationally expensive than the problem of association rule mining. Therefore, it is recommended to use TNR or TopKRules for k values of up to 5000 depending on the dataset. If more rules should be found, it could be better to find association rules with a classical association rule mining algorithm like FPGrowth, for more efficiency.

Where can I get more information about this algorithm?

The TNR algorithm is described in this paper:

Fournier-Viger, P., Tseng, V.S. (2012). Mining Top-K Non-Redundant Association Rules. Proc. 20th International Symposium on Methodologies for Intelligent Systems (ISMIS 2012), Springer, LNCS 7661, pp. 31- 40.

For a good overview of itemset mining and association rule mining, you may read this survey paper.