Mining Erasable Itemsets from a Product Database with the MERIT+ algorithm (SPMF documentation)

This example explains how to run the MERIT+ algorithm using the SPMF open-source data mining library.

How to run this example?

What is the MERIT+ algorithm?

MERIT+ is a corrected version of the MERIT algorithm (Mining Erasable itemseTs) . The original MERIT algorithm had bugs that prevented it from finding all erasable itemsets. MERIT+ fixes both of these bugs while preserving MERIT's core innovations: the WPPC-tree (Weighted Prefix-Path Compressed tree) and the NC_Set (Node Code Set) structure.

What is the input?

MERIT+ takes as input a product database and a threshold (a value between 0 and 100%). A product is defined as a set of items used to assemble the product, annotated with a profit (a positive integer) that indicates how much money the company earns by selling it. For example, let's consider the following product database, consisting of 6 products and 7 items. This product database is provided in the file "contextVME.txt" of the SPMF distribution:


profit items
product 150${2, 3, 4, 6}
product 220${2, 5, 7}
product 350${1, 2, 3, 5}
product 4800${1, 2, 4}
product 530${6, 7}
product 650${3, 4}

What is the output?

The output is the complete set of erasable itemsets whose gain is lower than or equal to the user-specified threshold. The gain of an itemset A is defined as the sum of the profits of all products P such that P.Items ∩ A ≠ ∅. For example, the gain of itemset {5, 6} is the sum of profits of products containing 5 or 6: 50 $ + 20 $ + 50 $ + 30 $ = 150 $. The total profit of the database is 1000 $. An itemset is erasable if its gain is at most ξ × 1000.

By running MERIT+ with a threshold of 15%, we obtain the same 8 erasable itemsets as VME, MEI, and META (gain ≤ 15% × 1000 $ = 150 $):

erasable itemsets gain (loss of profit)
{3}150.0
{5}70.0
{6}80.0
{7}50.0
{5, 6}150.0
{5, 7}100.0
{6, 7}100.0
{5, 6, 7}150.0

This means that if the items from one of those erasable itemsets are no longer purchased, the loss of profit will be lower than or equal to 15%.

Input file format

The input file format of MERIT+ is identical to that of VME, MEI, and META. It is a text file where each line represents a transaction (product). Each line is composed of two sections:

For example, the input file for the previous example is defined as follows:

50 2 3 4 6
20 2 5 7
50 1 2 3 5
800 1 2 4
30 6 7
50 3 4

Consider the first line. It means that the transaction {2, 3, 4, 6} has a profit of 50 and contains the items 2, 3, 4 and 6. The following lines follow the same format.

Output file format

The output file format of MERIT+ is identical to that of VME, MEI, and META. It is a text file where each line represents an erasable itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer followed by a single space. After all the items, the keyword "#LOSS:" appears, followed by a numeric value indicating the gain (loss of profit) for that itemset.

3 #LOSS: 150.0
5 #LOSS: 70.0
6 #LOSS: 80.0
7 #LOSS: 50.0
5 6 #LOSS: 150.0
5 7 #LOSS: 100.0
6 7 #LOSS: 100.0
5 6 7 #LOSS: 150.0

For example, the first line indicates that the itemset {3} would generate a gain (loss of profit) of 150.0. The following lines follow the same format.

Performance

MERIT+ uses a WPPC-tree and NC_Set structure to avoid rescanning the database at every level. However, MERIT+ stores the full NC_Set (including all node weights) for every erasable itemset, which can consume significant memory when the number of erasable itemsets is large. For a more memory-efficient variant, see the dMERIT+ algorithm, which uses difference NC'_Sets and an index-of-weight array to reduce memory usage.

Where can I get more information about the MERIT+ algorithm?

MERIT+ is a corrected implementation of the MERIT algorithm originally described in the following paper:

Z. Deng, X. Xu: Fast Mining Erasable Itemsets using NC_Sets. Expert Systems with Applications, 2012.

The bugs in MERIT and the corrected MERIT+ approach are discussed in:

T. Le, B. Vo, F. Coenen: An Efficient Algorithm for Mining Erasable Itemsets using the Difference of NC-Sets. Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Manchester, UK, 2013, pp. 2270–2274.

For a good overview of frequent itemset mining algorithms, you may read this survey paper.