Mining Erasable Itemsets from a Product Database with the META algorithm (SPMF documentation)

This example explains how to run the META algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "META" algorithm, (2) select the input file "contextVME.txt", (3) set the output file name (e.g. "output.txt"), (4) set the threshold to 15%, and (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run META contextVME.txt output.txt 15% in a folder containing spmf.jar and the example input file contextVME.txt.
If you are using the source code version of SPMF, launch the file "MainTestMETA.java" in the package ca.pfv.spmf.algorithms.frequentpatterns.meta.

What is the META algorithm?

META (Mining Erasable iTemsets with the Anti-monotone property) was proposed by Deng et al. (2009) for mining erasable itemsets from a product database with profit information. Unlike VME and MEI, which use a vertical tid-list representation to compute gains without rescanning the database, META uses a horizontal representation and rescans the database once per level to compute the gain of each candidate. META explicitly exploits the anti-monotone property of erasable itemsets (Property 2 in the paper): if an itemset is not erasable, all of its supersets are also not erasable. Consequently, during candidate generation, META checks that all (k−1)-subsets of a candidate k-itemset are erasable before admitting it as a candidate (the No_Unerasable_Subset check). This subset-checking step is the main structural difference between META on one side and VME/MEI on the other.

What is the input?

META takes as input a product database and a threshold (a value between 0 and 100%). A product is defined as a set of items used to assemble the product, annotated with a profit (a positive integer) that indicates how much money the company earns by selling it. For example, let's consider the following product database, consisting of 6 products and 7 items. This product database is provided in the file "contextVME.txt" of the SPMF distribution:

	profit	items
product 1	50$	{2, 3, 4, 6}
product 2	20$	{2, 5, 7}
product 3	50$	{1, 2, 3, 5}
product 4	800$	{1, 2, 4}
product 5	30$	{6, 7}
product 6	50$	{3, 4}

What is the output?

The output is the complete set of erasable itemsets whose gain is lower than or equal to the user-specified threshold. In the META paper, the term gain is used instead of loss to denote the sum of profits of all products that share at least one item with the candidate itemset. Both terms refer to the same quantity.

Formally, the gain of an itemset A is defined as the sum of the profits of all products P such that P.Items ∩ A ≠ ∅. For example, the gain of itemset {5, 6} is the sum of profits of products containing 5 or 6: 50 $ + 20 $ + 50 $ + 30 $ = 150 $. The total profit (Sum_val) of the database is 1000 $. An itemset is erasable if its gain is at most ξ × Sum_val.

By running META with a threshold of 15%, we obtain the same 8 erasable itemsets as VME and MEI (gain ≤ 15% × 1000 $ = 150 $):

erasable itemsets	gain (loss of profit)
{3}	150.0
{5}	70.0
{6}	80.0
{7}	50.0
{5, 6}	150.0
{5, 7}	100.0
{6, 7}	100.0
{5, 6, 7}	150.0

This means that if the items from one of those erasable itemsets are no longer purchased, the loss of profit will be lower than or equal to 15%.

Input file format

The input file format of META is identical to that of VME and MEI. It is a text file where each line represents a transaction (product). Each line is composed of two sections:

First, the profit of the transaction is indicated by an integer number, followed by a single space.
Second, the items in the transaction are listed. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction.

For example, the input file for the previous example is defined as follows:

50 2 3 4 6
20 2 5 7
50 1 2 3 5
800 1 2 4
30 6 7
50 3 4

Consider the first line. It means that the transaction {2, 3, 4, 6} has a profit of 50 and contains the items 2, 3, 4 and 6. The following lines follow the same format.

Output file format

The output file format of META is similar to that of VME and MEI. It is a text file where each line represents an erasable itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer followed by a single space. After all the items, the keyword "#LOSS:" appears, followed by a numeric value indicating the gain (loss of profit) for that itemset.

3 #LOSS: 150.0
5 #LOSS: 70.0
6 #LOSS: 80.0
7 #LOSS: 50.0
5 6 #LOSS: 150.0
5 7 #LOSS: 100.0
6 7 #LOSS: 100.0
5 6 7 #LOSS: 150.0

For example, the first line indicates that the itemset {3} would generate a gain (loss of profit) of 150.0. The following lines follow the same format.

Performance

META exploits the anti-monotone property of erasable itemsets to prune the search space. The No_Unerasable_Subset check ensures that no candidate with an unerasable subset is ever evaluated, which can substantially reduce the number of database scans needed at higher levels. However, because META rescans the entire database horizontally at each level, it may be slower than VME or MEI on very large databases where the vertical representation provides a greater advantage. META is most competitive when the number of erasable itemsets is small and the subset-pruning step eliminates many candidates early.

Where can I get more information about the META algorithm?

Here is the article describing the META algorithm:

Z. H. Deng, G. Fang, Z. Wang, X. Xu: Mining Erasable Itemsets. Proceedings of the 8th IEEE International Conference on Machine Learning and Cybernetics, Baoding, Hebei, China, 2009, pp. 67–73.

For a good overview of frequent itemset mining algorithms, you may read this survey paper.