Mining Frequent Itemsets from Uncertain Data with the UH-Mine Algorithm (SPMF documentation)

This example explains how to run the UH-Mine algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the UH-Mine algorithm, (2) select the input file contextUncertain.txt, (3) set the output file name (e.g. output.txt), (4) set the minimum expected support to 0.10 and (5) click Run algorithm.
If you want to execute this example from the command line, execute this command:
java -jar spmf.jar run UH-Mine contextUncertain.txt output.txt 10% in a folder containing spmf.jar and the example input file contextUncertain.txt.
If you are using the source code version of SPMF, launch the file MainTestUHMine_saveToFile.java in the package ca.pfv.spmf.tests.

What is UH-Mine?

UH-Mine is an algorithm for mining frequent itemsets from a transaction database where the data is uncertain (each item occurrence is annotated with an existential probability). UH-Mine is an extension of the H-Mine algorithm (Pei et al., 2007) to the uncertain data setting, as described by Aggarwal et al. (KDD 2009).

Like H-Mine, UH-Mine uses a hyper-linked array structure (H-Struct) to represent the database in memory. Unlike H-Mine, each cell in the H-Struct stores not only the item identifier but also its existential probability in the corresponding transaction. Mining proceeds depth-first by projecting the H-Struct onto conditional databases for each frequent item. Expected support is computed on-the-fly by scanning the sub-transaction before the projected position to recover the prefix probability, then multiplying by the current item probability, avoiding the need to store the prefix probability per transaction explicitly.

This algorithm can be applied whenever items in a dataset are uncertain, such as in medical diagnosis, sensor readings, or market basket data collected under noisy conditions.

What is the input?

UH-Mine takes as input an uncertain transaction database and a minimum expected support threshold (a value between 0 and 1, representing a fraction of the number of transactions).

In an uncertain transaction database each item in each transaction is annotated with an existential probability between 0 and 1. For example, consider the following database of 4 transactions and 5 items provided in the file contextUncertain.txt of the SPMF distribution:

	1	2	3	4	5
t1	0.5	0.4		0.3	0.7
t2		0.5	0.4		0.4
t3	0.6	0.5		0.1	0.5
t4	0.7	0.4	0.3		0.9

What is the output?

The output of UH-Mine is the complete set of frequent itemsets whose expected support meets or exceeds the minimum expected support threshold.

The expected support of an itemset in a transaction is the product of the existential probabilities of each item from the itemset in that transaction. For example, the expected support of itemset {1, 2} in transaction t1 is 0.5 x 0.4 = 0.20. The expected support of an itemset in the database is the sum of its expected support across all transactions where all of its items appear. An itemset is frequent when its expected support is greater than or equal to the minimum expected support.

For example, running UH-Mine on the above database with a minimum expected support of 0.10 produces 19 frequent itemsets, including:

itemset	expected support
{1}	1.8
{2}	1.8
{5}	2.5
{1 2}	0.78
{1 5}	1.28
{2 5}	1.09
{1 2 5}	0.542
{2 3}	0.32
{1 3}	0.21
{4 5}	0.26
...	...

Input file format

The input file format of UH-Mine is defined as follows. It is a plain text file. Each line represents one transaction. Each item in the transaction is written as itemID(probability), where the item ID is a positive integer and the probability is a decimal value strictly greater than 0 and at most 1. Items within a transaction are separated by single spaces and must be listed in ascending order of their IDs. Lines beginning with #, % or @ are treated as comments and ignored.

For example, the input file for the database above is:

# This binary context contains uncertain data.
# Each line represents a transaction.
# For each item there is an existential probability.
1(0.5) 2(0.4) 4(0.3) 5(0.7)
2(0.5) 3(0.4) 5(0.4)
1(0.6) 2(0.5) 4(0.1) 5(0.5)
1(0.7) 2(0.4) 3(0.3) 5(0.9)

The first line represents the transaction {1, 2, 4, 5} where items 1, 2, 4 and 5 have existential probabilities 0.5, 0.4, 0.3 and 0.7 respectively.

Output file format

The output file format of UH-Mine is defined as follows. It is a plain text file where each line represents one frequent itemset. On each line the item IDs are listed first as space-separated integers, followed by the keyword #SUP: and the expected support of the itemset as a decimal number. For example:

1 #SUP: 1.8
2 #SUP: 1.7999999999999998
3 #SUP: 0.7
4 #SUP: 0.4
5 #SUP: 2.5
1 2 #SUP: 0.78
1 3 #SUP: 0.21
1 4 #SUP: 0.21
1 5 #SUP: 1.2799999999999998
1 2 5 #SUP: 0.542
1 3 5 #SUP: 0.189
1 4 5 #SUP: 0.135
2 3 #SUP: 0.32
2 4 #SUP: 0.16999999999999998
2 5 #SUP: 1.09
2 3 5 #SUP: 0.188
2 4 5 #SUP: 0.10899999999999999
3 5 #SUP: 0.43000000000000005
4 5 #SUP: 0.26

For example, the last line indicates that the itemset {4, 5} has an expected support of 0.26.

Optional parameter: constraint on the size of itemsets

It is possible to limit the length of the discovered itemsets using an optional parameter in the user interface of SPMF:

Max pattern length (integer): the maximum number of items allowed in a frequent itemset. By default this parameter is not set (no limit).

When using the command line interface, add this parameter at the end of the command. For example:
java -jar spmf.jar run UH-Mine contextUncertain.txt output.txt 10% 2
runs UH-Mine and returns only frequent itemsets containing at most 2 items.

Performance

UH-Mine is generally more efficient than UApriori on uncertain databases, especially on dense datasets, because it uses a hyper-linked array structure that avoids candidate generation and performs only a single pass over the database to build the initial H-Struct. Mining then proceeds depth-first using projected conditional databases, which tend to be much smaller than the original database. As reported by Aggarwal et al. (KDD 2009), UH-Mine is the only algorithm among UApriori, UH-Mine and UFP-growth that performs robustly in terms of both running time and memory consumption across sparse and dense datasets.

Where can I get more information about the UH-Mine algorithm?

The UH-Mine algorithm is described in the following article:

Aggarwal, C. C., et al. (2009). Frequent Pattern Mining with Uncertain Data. In Proc. ACM KDD 2009, Paris, France, pp. 29-38.

For a good overview of frequent itemset mining algorithms, you may read this survey paper.