Mining Frequent Itemsets from Uncertain Data with the UApriori Algorithm (SPMF documentation)
This example explains how to run the UApriori algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "UApriori" algorithm, (2) select the input file "contextUncertain.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum expected support to 0.10 and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run UApriori contextUncertain.txt output.txt 10% in a folder containing spmf.jar and the example input file contextUncertain.txt. - If you are using the source code version of SPMF, launch the file "MainTestUApriori.java" in the package ca.pfv.SPMF.tests.
What is UApriori?
UApriori is an algorithm for mining frequent itemsets from a transaction database where the data is uncertain (contains probabilities). The UApriori algorithm was proposed by Chui et al. (2007).
This algorithm can have multiple applications such as in mining medical data or sensor data where observations may be uncertain.
What is the input ?
UApriori takes as input a transaction database containing probabilities and a minimum expected support threshold (a value between 0 and 1). A transaction database is a set of transactions where each transaction is a set of items. In UApriori, we assume that each item in a transaction is annotated with an existential probability. For example, let's consider the following transaction database, consisting of 4 transactions (t1,t2...t5) and 5 items (1,2,3,4,5). The transaction t1 contains item 1 with a probability of 0.5, item 2 with a probability of 0.4, item 4 with a probability of 0.3 and item 5 with a probability of 0.7. This database is provided in the file "contextUncertain.txt" of the SPMF distribution:
1 | 2 | 3 | 4 | 5 | |
t1 | 0.5 | 0.4 | 0.3 | 0.7 | |
t2 | 0.5 | 0.4 | 0.4 | ||
t3 | 0.6 | 0.5 | 0.1 | 0.5 | |
t4 | 0.7 | 0.4 | 0.3 | 0.9 |
What is the output?
The output of U-Apriori is the set of frequent itemsets. Note that the definition of a frequent itemset is here different from the definition used by the regular Apriori algorithm because we have to consider the existential probabilities.
The expected support of an itemset in a transaction is defined as the product of the existential probability of each item from the itemset in this transaction. It is a value between 0 and 1. For example, the support of itemset {1, 2} in transaction t1 is 0.5 x 0.4 = 0.2. The expected support of an itemset in a transaction database is the sum of its support in all transactions where it occurs. For example, the expected support of itemset {2, 3} is the sum of its expected support in t2 and t4 : 0.5 x 0.4 + 0.4 x 0.3 = 0.32. A frequent itemset is an itemset that has an expected support higher or equal to the minimum expected support set by the user. For example, by running U-Apriori with a minimum expected support of 0.10, we obtain 19 frequent itemsets, including:
itemsets | expected support |
{2 3 5} | 0.19 |
{1 3 5} | 0.19 |
{1 4 5} | 0.14 |
{2 4 5} | 0.11 |
{1 2 5} | 0.54 |
{1 5} | 1.28 |
{1 3} | 0.21 |
{1 4} | 0.21 |
{2 3} | 0.32 |
{1 2} | 0.78 |
... | ... |
Input file format
The input file format of UApriori is defined as follows. It is a text file. An item is represented by a positive integer. Each item is associated with a probability indicated as a double value between parenthesis. A transaction is a line in the text file. In each line (transaction), each item is immediately followed by its probability between parenthesis and a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line. Probabilities should be greater than 0 and not more than 1.
For example, for the previous example, the input file is defined as follows:
# This binary context contains uncertain data.
# Each line represents a transaction.
# For each item there is an existential probability.
1(0.5) 2(0.4) 4(0.3) 5(0.7)
2(0.5) 3(0.4) 5(0.4)
1(0.6) 2(0.5) 4(0.1) 5(0.5)
1(0.7) 2(0.4) 3(0.3) 5(0.9)
The first line represents the itemsets {1, 2, 4, 5} where items 1, 2, 4 and 5 respectively have the probabilities 0.5, 0.4, 0.3 and 0.7.
Output file format
The output file format of UApriori is defined as follows. It is a text file, where each line represents a frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it expected support between parenthesis, followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by a double value indicating the expected support of the itemset. For example, we show below the output file for this example.
2 (0.4) Support: 1.7999999999999998
3 (0.4) Support: 0.7
4 (0.3) Support: 0.4
5 (0.7) Support: 2.5
1 (0.5) Support: 1.8
1 (0.5) 2 (0.4) Support: 0.78
1 (0.5) 3 (0.4) Support: 0.21
2 (0.4) 5 (0.7) Support: 1.09
2 (0.4) 4 (0.3) Support: 0.16999999999999998
1 (0.5) 5 (0.7) Support: 1.2799999999999998
3 (0.4) 5 (0.7) Support: 0.43000000000000005
2 (0.4) 3 (0.4) Support: 0.32
1 (0.5) 4 (0.3) Support: 0.21
4 (0.3) 5 (0.7) Support: 0.26
1 (0.5) 3 (0.4) 5 (0.7) Support: 0.189
2 (0.4) 3 (0.4) 5 (0.7) Support: 0.188
1 (0.5) 2 (0.4) 5 (0.7) Support: 0.542
1 (0.5) 4 (0.3) 5 (0.7) Support: 0.135
2 (0.4) 4 (0.3) 5 (0.7) Support: 0.10899999999999999
For example, the last line indicates that the itemset {2, 4, 5} has an expected support of 0.1089999 and that items in this itemset have an existential support of 0.4, 0.3 and 0.7 with respect to this itemset, respectively.
Optional parameter(s): constraints on the size of itemsets
Sometimes, there may be just too many itemsets, and itemsets containing many items may not be interesting. Thus, it is also possible to specify an optional parameter in the user interface of SPMF:
- Max pattern length (integer) : This parameter allows to set a maximum number of items to appear on the an itemset. By default, this parameter is equal to the infinity if it is not set.
If you are using the command line interface of SPMF, it is also possible to use this optional parameter by adding it at the end of the command. For example:
java -jar spmf.jar run UApriori contextUncertain.txt
output.txt 10% 2
means to run the above example to find only frequent itemsets having 2 items or less.
Performance
UApriori is not the most efficient algorithm for uncertain itemset mining but it is simple and it is the first algorithm designed for this task.
Where can I get more information about the UApriori algorithm?
Here is an article describing the UApriori algorithm:
C. Kit Chui, B. Kao, E. Hung: Mining Frequent Itemsets from Uncertain Data. PAKDD 2007: 47-58
For a good overview of frequent itemset mining algorithms, you may read this survey paper.