Mining Erasable Itemsets from a Product Database with the dMERIT+ algorithm (SPMF documentation)
This example explains how to run the dMERIT+ algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "dMERIT+" algorithm, (2) select the input file "contextVME.txt", (3) set the output file name (e.g. "output.txt"), (4) set the threshold to 15%, and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run dMERIT+ contextVME.txt output.txt 15% in a folder containing spmf.jar and the example input file contextVME.txt. - If you are using the source code version of SPMF, launch the file "MainTestDMERIT.java" in the package ca.pfv.spmf.algorithms.frequentpatterns.dmerit.
What is the dMERIT+ algorithm?
dMERIT+ (difference Mining Erasable itemseTs) was proposed by Le et al. (2013) as a memory-efficient improvement over MERIT+. While MERIT+ stores the full NC_Set for every erasable itemset (which includes a weight value in every node code), dMERIT+ introduces two key optimizations: the index of weight (W) and the difference NC_sets.
These optimizations make dMERIT+ especially effective on databases with many erasable itemsets, where MERIT+ would otherwise consume large amounts of memory storing redundant weight information in every NC_Set.
What is the input?
dMERIT+ takes as input a product database and a threshold (a value between 0 and 100%). A product is defined as a set of items used to assemble the product, annotated with a profit (a positive integer) that indicates how much money the company earns by selling it. For example, let's consider the following product database, consisting of 6 products and 7 items. This product database is provided in the file "contextVME.txt" of the SPMF distribution:
| profit | items | |
| product 1 | 50$ | {2, 3, 4, 6} |
| product 2 | 20$ | {2, 5, 7} |
| product 3 | 50$ | {1, 2, 3, 5} |
| product 4 | 800$ | {1, 2, 4} |
| product 5 | 30$ | {6, 7} |
| product 6 | 50$ | {3, 4} |
What is the output?
The output is the complete set of erasable itemsets whose gain is lower than or equal to the user-specified threshold. The gain of an itemset A is defined as the sum of the profits of all products P such that P.Items ∩ A ≠ ∅. For example, the gain of itemset {5, 6} is the sum of profits of products containing 5 or 6: 50 $ + 20 $ + 50 $ + 30 $ = 150 $. The total profit of the database is 1000 $. An itemset is erasable if its gain is at most ξ × 1000.
By running dMERIT+ with a threshold of 15%, we obtain the same 8 erasable itemsets as VME, MEI, META, and MERIT+ (gain ≤ 15% × 1000 $ = 150 $):
| erasable itemsets | gain (loss of profit) |
| {3} | 150.0 |
| {5} | 70.0 |
| {6} | 80.0 |
| {7} | 50.0 |
| {5, 6} | 150.0 |
| {5, 7} | 100.0 |
| {6, 7} | 100.0 |
| {5, 6, 7} | 150.0 |
This means that if the items from one of those erasable itemsets are no longer purchased, the loss of profit will be lower than or equal to 15%.
Input file format
The input file format of dMERIT+ is identical to that of VME, MEI, META, and MERIT+. It is a text file where each line represents a transaction (product). Each line is composed of two sections:
- First, the profit of the transaction is indicated by an integer number, followed by a single space.
- Second, the items in the transaction are listed. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction.
For example, the input file for the previous example is defined as follows:
50 2 3 4 6
20 2 5 7
50 1 2 3 5
800 1 2 4
30 6 7
50 3 4
Consider the first line. It means that the transaction {2, 3, 4, 6} has a profit of 50 and contains the items 2, 3, 4 and 6. The following lines follow the same format.
Output file format
The output file format of dMERIT+ is identical to that of VME, MEI, META, and MERIT+. It is a text file where each line represents an erasable itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer followed by a single space. After all the items, the keyword "#LOSS:" appears, followed by a numeric value indicating the gain (loss of profit) for that itemset.
3 #LOSS: 150.0
5 #LOSS: 70.0
6 #LOSS: 80.0
7 #LOSS: 50.0
5 6 #LOSS: 150.0
5 7 #LOSS: 100.0
6 7 #LOSS: 100.0
5 6 7 #LOSS: 150.0
For example, the first line indicates that the itemset {3} would generate a gain (loss of profit) of 150.0. The following lines follow the same format.
Performance
dMERIT+ is significantly more memory-efficient than MERIT+ because it stores only difference NC'_Sets and uses a single index-of-weight array W instead of embedding weight values in every node code.
Where can I get more information about the dMERIT+ algorithm?
Here is the article describing the dMERIT+ algorithm:
T. Le, B. Vo, F. Coenen: An Efficient Algorithm for Mining Erasable Itemsets using the Difference of NC-Sets. Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Manchester, UK, 2013, pp. 2270–2274.
For a good overview of frequent itemset mining algorithms, you may read this survey paper.