Mining Erasable Itemsets from a Product Database with the dMERIT+ algorithm (SPMF documentation)

This example explains how to run the dMERIT+ algorithm using the SPMF open-source data mining library.

How to run this example?

What is the dMERIT+ algorithm?

dMERIT+ (difference Mining Erasable itemseTs) was proposed by Le et al. (2013) as a memory-efficient improvement over MERIT+. While MERIT+ stores the full NC_Set for every erasable itemset (which includes a weight value in every node code), dMERIT+ introduces two key optimizations: the index of weight (W) and the difference NC_sets.

These optimizations make dMERIT+ especially effective on databases with many erasable itemsets, where MERIT+ would otherwise consume large amounts of memory storing redundant weight information in every NC_Set.

What is the input?

dMERIT+ takes as input a product database and a threshold (a value between 0 and 100%). A product is defined as a set of items used to assemble the product, annotated with a profit (a positive integer) that indicates how much money the company earns by selling it. For example, let's consider the following product database, consisting of 6 products and 7 items. This product database is provided in the file "contextVME.txt" of the SPMF distribution:


profit items
product 150${2, 3, 4, 6}
product 220${2, 5, 7}
product 350${1, 2, 3, 5}
product 4800${1, 2, 4}
product 530${6, 7}
product 650${3, 4}

What is the output?

The output is the complete set of erasable itemsets whose gain is lower than or equal to the user-specified threshold. The gain of an itemset A is defined as the sum of the profits of all products P such that P.Items ∩ A ≠ ∅. For example, the gain of itemset {5, 6} is the sum of profits of products containing 5 or 6: 50 $ + 20 $ + 50 $ + 30 $ = 150 $. The total profit of the database is 1000 $. An itemset is erasable if its gain is at most ξ × 1000.

By running dMERIT+ with a threshold of 15%, we obtain the same 8 erasable itemsets as VME, MEI, META, and MERIT+ (gain ≤ 15% × 1000 $ = 150 $):

erasable itemsets gain (loss of profit)
{3}150.0
{5}70.0
{6}80.0
{7}50.0
{5, 6}150.0
{5, 7}100.0
{6, 7}100.0
{5, 6, 7}150.0

This means that if the items from one of those erasable itemsets are no longer purchased, the loss of profit will be lower than or equal to 15%.

Input file format

The input file format of dMERIT+ is identical to that of VME, MEI, META, and MERIT+. It is a text file where each line represents a transaction (product). Each line is composed of two sections:

For example, the input file for the previous example is defined as follows:

50 2 3 4 6
20 2 5 7
50 1 2 3 5
800 1 2 4
30 6 7
50 3 4

Consider the first line. It means that the transaction {2, 3, 4, 6} has a profit of 50 and contains the items 2, 3, 4 and 6. The following lines follow the same format.

Output file format

The output file format of dMERIT+ is identical to that of VME, MEI, META, and MERIT+. It is a text file where each line represents an erasable itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer followed by a single space. After all the items, the keyword "#LOSS:" appears, followed by a numeric value indicating the gain (loss of profit) for that itemset.

3 #LOSS: 150.0
5 #LOSS: 70.0
6 #LOSS: 80.0
7 #LOSS: 50.0
5 6 #LOSS: 150.0
5 7 #LOSS: 100.0
6 7 #LOSS: 100.0
5 6 7 #LOSS: 150.0

For example, the first line indicates that the itemset {3} would generate a gain (loss of profit) of 150.0. The following lines follow the same format.

Performance

dMERIT+ is significantly more memory-efficient than MERIT+ because it stores only difference NC'_Sets and uses a single index-of-weight array W instead of embedding weight values in every node code.

Where can I get more information about the dMERIT+ algorithm?

Here is the article describing the dMERIT+ algorithm:

T. Le, B. Vo, F. Coenen: An Efficient Algorithm for Mining Erasable Itemsets using the Difference of NC-Sets. Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Manchester, UK, 2013, pp. 2270–2274.

For a good overview of frequent itemset mining algorithms, you may read this survey paper.