Mining Frequent Itemsets from Uncertain Data with the UV-Eclat Algorithm (SPMF documentation)

This example explains how to run the UV-Eclat algorithm using the SPMF open-source data mining library.

How to run this example?

What is UV-Eclat?

UV-Eclat is an algorithm for mining frequent itemsets from a transaction database where the data is uncertain (contains probabilities). The UV-Eclat algorithm was proposed by Leung and Sun (2011). It is based on the Eclat algorithm but adapted for uncertain data using vertical database format and equivalence class transformation.

This algorithm can have multiple applications such as in mining medical data or sensor data where observations may be uncertain.

What is the input ?

UV-Eclat takes as input a transaction database containing probabilities and a minimum expected support threshold (a value between 0 and 1). A transaction database is a set of transactions where each transaction is a set of items. In UV-Eclat, we assume that each item in a transaction is annotated with an existential probability. For example, let's consider the following transaction database, consisting of 4 transactions (t1,t2...t4) and 5 items (1,2,3,4,5). The transaction t1 contains item 1 with a probability of 0.5, item 2 with a probability of 0.4, item 4 with a probability of 0.3 and item 5 with a probability of 0.7. This database is provided in the file "contextUncertain.txt" of the SPMF distribution:


1 2 3 4 5
t1 0.5 0.4
0.3 0.7
t2
0.5 0.4
0.4
t3 0.6 0.5
0.1 0.5
t4 0.7 0.4 0.3
0.9

What is the output?

The output of UV-Eclat is the set of frequent itemsets. Note that the definition of a frequent itemset is here different from the definition used by the regular Eclat algorithm because we have to consider the existential probabilities.

The expected support of an itemset in a transaction is defined as the product of the existential probability of each item from the itemset in this transaction. It is a value between 0 and 1. For example, the support of itemset {1, 2} in transaction t1 is 0.5 x 0.4 = 0.2. The expected support of an itemset in a transaction database is the sum of its support in all transactions where it occurs. For example, the expected support of itemset {2, 3} is the sum of its expected support in t2 and t4 : 0.5 x 0.4 + 0.4 x 0.3 = 0.32. A frequent itemset is an itemset that has an expected support higher or equal to the minimum expected support set by the user. For example, by running UV-Eclat with a minimum expected support of 0.10, we obtain 19 frequent itemsets, including:

itemsets expected support
{2 3 5} 0.188
{1 3 5} 0.189
{1 4 5} 0.135
{2 4 5} 0.109
{1 2 5} 0.542
{1 5} 1.28
{1 3} 0.21
{1 4} 0.21
{2 3} 0.32
{1 2} 0.78
... ...

Input file format

The input file format of UV-Eclat is defined as follows. It is a text file. An item is represented by a positive integer. Each item is associated with a probability indicated as a double value between parenthesis. A transaction is a line in the text file. In each line (transaction), each item is immediately followed by its probability between parenthesis and a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line. Probabilities should be greater than 0 and not more than 1.

For example, for the previous example, the input file is defined as follows:

# This binary context contains uncertain data.
# Each line represents a transaction.
# For each item there is an existential probability.
1(0.5) 2(0.4) 4(0.3) 5(0.7)
2(0.5) 3(0.4) 5(0.4)
1(0.6) 2(0.5) 4(0.1) 5(0.5)
1(0.7) 2(0.4) 3(0.3) 5(0.9)

The first line represents the itemset {1, 2, 4, 5} where items 1, 2, 4 and 5 respectively have the probabilities 0.5, 0.4, 0.3 and 0.7.

Output file format

The output file format of UV-Eclat is defined as follows. It is a text file, where each line represents a frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer followed by a single space. After all the items, the keyword "#SUP:" appears, which is followed by a double value indicating the expected support of the itemset. For example, we show below the output file for this example.

1 #SUP: 1.8
2 #SUP: 1.7999999999999998
3 #SUP: 0.7
4 #SUP: 0.4
5 #SUP: 2.5
1 2 #SUP: 0.78
1 3 #SUP: 0.21
1 4 #SUP: 0.21
1 5 #SUP: 1.2799999999999998
1 2 5 #SUP: 0.542
1 3 5 #SUP: 0.189
1 4 5 #SUP: 0.135
2 3 #SUP: 0.32
2 4 #SUP: 0.16999999999999998
2 5 #SUP: 1.09
2 3 5 #SUP: 0.188
2 4 5 #SUP: 0.10899999999999999
3 5 #SUP: 0.43000000000000005
4 5 #SUP: 0.26

For example, the last line indicates that the itemset {4, 5} has an expected support of 0.26.

Optional parameter(s): constraints on the size of itemsets

Sometimes, there may be just too many itemsets, and itemsets containing many items may not be interesting. Thus, it is also possible to specify an optional parameter in the user interface of SPMF:

If you are using the command line interface of SPMF, it is also possible to use this optional parameter by adding it at the end of the command. For example:
java -jar spmf.jar run UV-Eclat contextUncertain.txt output.txt 10% 2
means to run the above example to find only frequent itemsets having 2 items or less.

Performance

UV-Eclat is generally more efficient than UApriori for uncertain itemset mining because it uses a vertical database format and avoids multiple database scans. It employs equivalence class transformation and tidlist intersection to efficiently discover frequent itemsets in a depth-first manner.

Where can I get more information about the UV-Eclat algorithm?

Here is the article describing the UV-Eclat algorithm:

C. K.-S. Leung, L. Sun: Equivalence Class Transformation Based Mining of Frequent Itemsets from Uncertain Data. In Proc. ACM SAC 2011, pp. 983-984

For a good overview of frequent itemset mining algorithms, you may read this survey paper.