Mining Self-Sufficient Itemsets using the Opus-Miner algorithm (SPMF documentation)
This example explains how to run the Opus-Miner algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "OPUS-Miner" algorithm, (2) select the input file "contextOpusMiner.txt", (3) set the output file name (e.g. "output.txt") (4) set the parameters to 3, true, true, true, false, respectively and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run OPUS-Miner contextOpusMiner.txt output.txt 3 true true true false in a folder containing spmf.jar and the example input file contextOpusMiner.txt. - If you are using the source code version of SPMF, launch the file "MainTestOpusMiner.java" in the package ca.pfv.SPMF.tests.
What is the algorithm?
Opus-Miner is an algorithm mining statistically significant frequent itemsets, which are called self-sufficient itemsets. This algorithm is interesting because it not only discover frequent itemsets but aim to ensure that the patterns that are found are also statistically significant (dont occur frequently by chance). It is argued by Webb et al. that such itemset could be more interesting than frequent itemsets.
The Java version of OPUS-Miner included in SPMF is a conversion of the original C++ code, which was made available under the GPL license. Some small modifications have been done to make it work in Java, and to itnegrate it in SPMF.
What is the input?
The input is a transaction database (aka binary context) and a threshold named minsup (a value between 0 and 100 %).
A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is the first five transactions of the file contextOpusMiner.txt in the SPMF distribution, which contains 115 transactions. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.
Transaction id | Items |
t1 | {1, 2, 4, 5} |
t2 | {1, 3} |
t3 | {1, 2, 3, 5} |
t4 | {2, 3} |
t5 | {1, 2, 4, 5} |
What is the output?
The output of Opus-Miner is a set of itemsets that is statistically significant, called self-sufficient itemsets.An itemset is an unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset divided by the total number of transactions. For example, if we consider the five transactions above, the itemset {1, 2} has a support of 60% because it appears in 3 transactions out of 5 (it appears in t1, t2 and t5).
To apply the OPUS-Miner algorithm, several parameters must be set by the user:
- k: This is the main parameter. It is the number of itemsets that the user wants to find. For example, if k = 3, OPUS-Miner will return the top-3 itemsets that have the highest value according to some interestingness measure.
- check independency (filter): if this option is set to true, itemsets that are not independently productive (statistically significant) when compared to their subsets will be eliminated. In simple words, it means that an itemset should be different from its subsets (not just be frequent because of its subsets).
- search by lift: if this parameter is set to true, the lift will be used as interestingness measure to select patterns, otherwise it will be the leverage measure that will be used. See the paper by Webb et al. for a description of these measures.
- check redundancy: if set to true, itemsets that are redundant will be eliminated. In the paper of Webb, an itemset is called redundant if it has a subset that has the same support. If you are familiar with the terminology of itemset mining, this is the same thing as saying that only the generator itemsets will be kept.
- correction for multicompare: if set to true, the algorithm will make some adjustements to the alpha values used for statistical tests to account for the multiple statistical tests performed for large itemsets. If set to false, no correction will be done.
- Print closure?: This is an advanced parameter. If set to true, it will print the closure (smallest closed itemset that is a superset of) each itemset that is output. If set to false, nothing will be printed.
For example, if we run Opus-Miner on the file contextOpusMiner.txt provided in the SPMF distribution, with k = 3, check independency = true, search by lift = true, check redundancy = true, correction for multicompare = true and print closure = false, the following three self-sufficient itemsets are found:
Self-sufficient itemsets | Support | Lift | P value |
{1, 4} | 23 | 1.6 | 1.3223990575406282E-6 |
{1, 3} | 69 | 1.25 | 9.051081854861051E-12 |
{2, 5} | 92 | 1.24 | 1.0993086787649736E-24 |
The first line indicates that the itemset {1, 4} appeared in 23 transactions and has a lift of 1.6. The lift measure is interpreted as follows. A lift > 1 means a positive correlation, = 1 means no correlation, while < 1 means a negative correlation. Moreover, this iitemset was found to be statistically significant for a P-value of 1.3223990575406282E-6.
The other lines follow the same format.
Input file format
The input file format is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.
For example, for the previous example, the input file is defined as follows:
1 2 4 5
1 3
1 2 3 5
2 3 5
1 2 3 5
Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.
Output file format
The output file format is defined as follows. It is a text file, where each line represents an itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. Then, if the lift measure is used, the keyword #LIFT: appears followed by a double value indicating the lift values. Otherwise, the keyword #LEVERAGE: is used followed by a double value indicating the leverage. Then the keyword #PVALUE: appears followed by a double value indicating the P value. If the user sets print-closure = true, then the keyword #CLOSURE: appears followed by the corresponding closed itemset.
For example, we show below the output file for this example.
1 4 #SUP: 23 #LIFT: 1.6666666 #PVALUE: 1.3223990575406282E-6
1 3 #SUP: 69 #LIFT: 1.25 #PVALUE: 9.051081854861051E-12
2 5 #SUP: 92 #LIFT: 1.2499999 #PVALUE: 1.0993086787649736E-24
The output file here consists of three lines representing the three itemsets that have been found above.
Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.
Optional feature: giving names to items
Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextInverse.txt". Here we have modified the file to give names to the items:
@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 2 4 5
1 3
1 2 3 5
2 3
1 2 4 5
... (only the first lines are shown above.
In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines follows the SPMF format.
Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns:
apple milk #SUP: 23 #LIFT: 1.6666666 #PVALUE: 1.3223990575406282E-6
apple tomato #SUP: 69 #LIFT: 1.25 #PVALUE: 9.051081854861051E-12
orange bread #SUP: 92 #LIFT: 1.2499999 #PVALUE: 1.0993086787649736E-24
Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.
Performance
Opus-Miner is the only algorithm for mining self-sufficient itemsets offered in SPMF.
Where can I get more information about this algorithm?
The Opus-Miner algorithm is described in this p aper:
Webb, G. I., & Vreeken, J. (2014). Efficient discovery of the most interesting associations. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3), 15.
For a good overview of frequent itemset mining algorithms, you may read this survey paper.