Mining All Association Rules with the Lift Measure ( SPMF documentation)

This example explains how to mine all association rules using the lift measure using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "FPGrowth_association_rules_with_lift" algorithm , (2) select the input file "contextIGB.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup = 50 %, minconf= 90% and minlift = 1 (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run FPGrowth_association_rules_with_lift contextIGB.txt output.txt 50% 90% 1 in a folder containing spmf.jar and the example input file contextIGB.txt.
If you are using the source code version of SPMF, launch the file "MainTestAllAssociationRules_FPGrowth_version_with_lift.java" in the package ca.pfv.SPMF.tests.

What is this algorithm?

This is a variation of the algorithm for mining all association rules from a transaction database, described in the previous example.

Traditionally, association rule mining is performed by using two interestingness measures named the support and confidence to evaluate rules. In this example, we show how to use another popular measure that is called the lift or interest.

What is the input?

The input is a transaction database (aka binary context) and three thresholds named minsup (a value between 0 and 1), minconf (a value between 0 and 1) and minlift (a value between -infinity to +infinity).

A transaction database is a set of transactions. Each transaction is a set of distinct items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id	Items
t1	{1, 2, 4, 5}
t2	{2, 3, 5}
t3	{1, 2, 4, 5}
t4	{1, 2, 3, 5}
t5	{1, 2, 3, 4, 5}
t6	{2, 3, 4}

What is the output ?

The output of this algorithm is a set of all the association rules that have a support, confidence and lift respectively higher than minsup, minconf and minlift.

The lift of a rule X-->Y is calculated as lift(X-->Y) = ( (sup(X U Y)/ N) / (sup(X)/ N*sup(Y)/ N ), where

N is the number of transactions in the transaction database,
sup(X∪Y) is the number of transactions containing X and Y,
sup(X) is the number of transactions containing X
sup(Y) is the number of transactions containing Y.

The confidence of a rule X-->Y is calculated as conf(X-->Y) = sup(X U Y) / (sup(X)).

The support of a rule X -->Y is defined as sup(X-->Y) = sup(X∪Y) / N

By applying the algorithm with minsup = 0.5, minconf= 0.9 and minlift = 1 on the previous database, we obtains 18 associations rules:

  rule 0:   4  ==> 2      support :  0.66 (4/6) confidence :  1.0  lift :  1.0
  rule 1:   3  ==> 2      support :  0.66 (4/6) confidence :  1.0  lift :  1.0
  rule 2:   1  ==> 5      support :  0.66 (4/6) confidence :  1.0  lift :  1.2
  rule 3:   1  ==> 2      support :  0.66 (4/6) confidence :  1.0  lift :  1.0
  rule 4:   5  ==> 2      support :  0.833(5/6)  confidence :  1.0  lift :  1.0
  rule 5:   4 5  ==> 2    support :  0.5 (3/6)   confidence :  1.0  lift :  1.0
  rule 6:   1 4  ==> 5    support :  0.5 (3/6)   confidence :  1.0  lift :  1.2
  rule 7:   4 5  ==> 1    support :  0.5 (3/6)   confidence :  1.0  lift :  1.5
  rule 8:   1 4  ==> 2    support :  0.5 (3/6)   confidence :  1.0  lift :  1.0
  rule 9:   3 5  ==> 2    support :  0.5 (3/6)   confidence :  1.0  lift :  1.0
  rule 10:  1 5  ==> 2    support :  0.66 (4/6)  confidence :  1.0  lift :  1.0
  rule 11:  1 2  ==> 5    support :  0.66 (4/6)  confidence :  1.0  lift :  1.2
  rule 12:  1  ==> 2 5    support :  0.66 (4/6)  confidence :  1.0  lift :  1.2
  rule 13:  1 4 5  ==> 2  support :  0.5 (3/6)   confidence :  1.0  lift :  1.0
  rule 14:  1 2 4  ==> 5  support :  0.5 (3/6)   confidence :  1.0  lift :  1.2
  rule 15:  2 4 5  ==> 1  support :  0.5 (3/6)   confidence :  1.0  lift :  1.5
  rule 16:  4 5  ==> 1 2  support :  0.5 (3/6)   confidence :  1.0  lift :  1.5
  rule 17:  1 4  ==> 2 5  support :  0.5 (3/6)   confidence :  1.0  lift :  1.5

How to interpret the results?

For an association rule X ==> Y, if the lift is equal to 1, it means that X and Y are independent. If the lift is higher than 1, it means that X and Y are positively correlated. If the lift is lower than 1, it means that X and Y are negatively correlated. For example, if we consider the rule {1, 4} ==> {2, 5}, it has a lift of 1.5, which means that the occurrence of the itemset {1, 4} is positively correlated with the occurrence of {2, 5}.

Input file format

The input file format is a text file containing transactions. Each lines represents a transaction. The items in the transaction are listed on the corresponding line. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction. For example, for the previous example, the input file is defined as follows:

1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 4

Consider the first line. It means that the first transaction is the itemset {1, 2, 4 and 5}. The following lines follow the same format.

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents an association rule. On each line, the items of the rule antecedent are first listed. Each item is represented by an integer, followed by a single space. After, that the keyword "==>" appears followed by a space. Then, the items of the rule consequent are listed. Each item is represented by an integer, followed by a single space. Then, the keyword " #SUP: " appears followed by the support of the rule represented by an integer. Then, the keyword " #CONF: " appears followed by the confidence of the rule represented by a double value (a value between 0 and 1, inclusively). Then, the keyword " #LIFT: " appears followed by the lift of the rule represented by a double value (a value between -infinity to +infinity). For example, here is a few lines from the output file for this example:

1 ==> 2 4 5 #SUP: 3 #CONF: 0,75 #LIFT: 1,5
5 ==> 1 2 4 #SUP: 3 #CONF: 0,6 #LIFT: 1,2

For example, the first line indicates that the association rule {1} --> {2, 4, 5} has a support of 3 transactions, a confidence of 75 % and a lift of 1.5 indicating a positive correlation (when the value is higher than 1). The other lines follow the same format.

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextIGB.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 4

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define transactions in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

apple milk bread ==> orange #SUP: 3 #CONF: 1.0 #LIFT: 1.0
apple orange milk ==> bread #SUP: 3 #CONF: 1.0 #LIFT: 1.2
milk bread ==> apple orange #SUP: 3 #CONF: 1.0 #LIFT: 1.5

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Optional feature: constraints on the size of association rules

Sometimes, there may be just too many association rules, and rules containing many items may not be interesting. Thus, it is also possible to specify two optional parameters in the user interface of SPMF:

Max antecedent length: This parameter allows to set a maximum number of items to appear on the left side of a rule. By default, this parameter is equal to the infinity if it is not set.
Max consequent length: This parameter allows to set a maximum number of items to appear on the right side of a rule. By default, this parameter is equal to the infinity if it is not set.

If you are using the command line interface of SPMF, it is also possible to use these optional parameters by adding them at the end of the command. For example:
java -jar spmf.jar run FPGrowth_association_rules_with_lift contextIGB.txt output.txt 50% 90% 1 2 3
means to run the above example with a maximum antecedent length of 2 items and a maximum consequent length of 3 items.

Implementation details

In the source code version of SPMF, there are two versions of this algorithm. The first version saves the result into memory ("MainTestAllAssociationRules_FPGrowth_wifthLift"). The second one saves the result to a file ("MainTestAllAssociationRules_FPGrowth_saveToFile_wifthLift").

Note that we offer also the alternative of choosing CFPGrowth++ instead of FPGrowth. This is called the "CFPGrowth++_association_rules_lift" algorithm in the graphical user interface or command line interface. CFPGrowth++ allows to use multiple minimum support thresholds instead of a single minsup threshold so the input and output are slightly different (see the example about CFPGrowth++ for more details about this algorithm).

Where can I get more information about this algorithm?

The following technical report published in 1994 describes how to generate association rules from frequent itemsets (Step 2):

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. Research Report RJ 9839, IBM Almaden Research Center, San Jose, California, June 1994.

You can also read chapter 6 of the book "introduction to data mining" which provide a nice and easy to understand introduction to how to discover frequent itemsets and generate association rules, and also describes the advantages of using the lift measure.

The following article describes the FPGrowth algorithm for mining frequent itemsets:

Jiawei Han, Jian Pei, Yiwen Yin, Runying Mao: Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 8(1): 53-87 (2004)

For a good overview of itemset mining and association rule mining, you may read this survey paper.