Mining Top-k Frequent Itemsets from Uncertain Data with the TUFP Algorithm (SPMF documentation)
This example explains how to run the TUFP algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "TUFP" algorithm, (2) select the input file "contextUncertain.txt", (3) set the output file name (e.g. "output.txt") (4) set the value of k to 10 and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run TUFP contextUncertain.txt output.txt 10 in a folder containing spmf.jar and the example input file contextUncertain.txt. - If you are using the source code version of SPMF, launch the file "MainTestTUFP.java" in the package ca.pfv.spmf.test.
What is TUFP?
TUFP (Top-k Uncertain Frequent Patterns) is an algorithm for mining the top-k most frequent itemsets from a transaction database where the data is uncertain (contains probabilities). The TUFP algorithm was proposed by Le, et al. (2020). Unlike traditional frequent itemset mining algorithms that require the user to specify a minimum support threshold, TUFP directly discovers the k itemsets with the highest expected support without requiring this parameter. TUFP uses CUP-Lists (Conditional Uncertain Probability-Lists), a vertical data structure, combined with effective threshold raising strategies to efficiently mine top-k patterns.
This algorithm can have multiple applications such as in mining medical data or sensor data where observations may be uncertain, and where the user wants to focus on the most promising patterns.
What is the input ?
TUFP takes as input a transaction database containing probabilities and a parameter k (a positive integer). A transaction database is a set of transactions where each transaction is a set of items. In TUFP, we assume that each item in a transaction is annotated with an existential probability. For example, let's consider the following transaction database, consisting of 4 transactions (t1, t2, t3, t4) and 5 items (1, 2, 3, 4, 5). The transaction t1 contains item 1 with a probability of 0.5, item 2 with a probability of 0.4, item 4 with a probability of 0.3 and item 5 with a probability of 0.7. This database is provided in the file "contextUncertain.txt" of the SPMF distribution:
| 1 | 2 | 3 | 4 | 5 | |
| t1 | 0.5 | 0.4 | 0.3 | 0.7 | |
| t2 | 0.5 | 0.4 | 0.4 | ||
| t3 | 0.6 | 0.5 | 0.1 | 0.5 | |
| t4 | 0.7 | 0.4 | 0.3 | 0.9 |
What is the output?
The output of TUFP is a set of exactly k itemsets (or more if there are ties at the k-th position) with the highest expected support values. Note that the definition of a frequent itemset and expected support are the same as in other uncertain frequent itemset mining algorithms like UV-Eclat and UApriori.
The expected support of an itemset in a transaction is defined as the product of the existential probability of each item from the itemset in this transaction. It is a value between 0 and 1. For example, the support of itemset {1, 2} in transaction t1 is 0.5 × 0.4 = 0.2. The expected support of an itemset in a transaction database is the sum of its support in all transactions where it occurs. For example, the expected support of itemset {2, 3} is the sum of its expected support in t2 and t4 : 0.5 × 0.4 + 0.4 × 0.3 = 0.32.
By running TUFP with k = 10 on the example database, we obtain the top-10 itemsets with the highest expected support:
| Rank | Itemset | Expected Support |
| 1 | {5} | 2.5 |
| 2 | {1} | 1.8 |
| 3 | {2} | 1.8 |
| 4 | {1, 5} | 1.28 |
| 5 | {2, 5} | 1.09 |
| 6 | {1, 2} | 0.78 |
| 7 | {3} | 0.7 |
| 8 | {1, 2, 5} | 0.542 |
| 9 | {3, 5} | 0.43 |
| 10 | {4} | 0.4 |
Input file format
The input file format of TUFP is a text file. An item is represented by a positive integer. Each item is associated with a probability indicated as a double value between parenthesis. A transaction is a line in the text file. In each line (transaction), each item is immediately followed by its probability between parenthesis and a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line. Probabilities should be greater than 0 and not more than 1.
For example, for the previous example, the input file is defined
as follows:
1(0.5) 2(0.4) 4(0.3) 5(0.7)
2(0.5) 3(0.4) 5(0.4)
1(0.6) 2(0.5) 4(0.1) 5(0.5)
1(0.7) 2(0.4) 3(0.3) 5(0.9)
The first line represents the itemset {1, 2, 4, 5} where items 1, 2, 4 and 5 respectively have the probabilities 0.5, 0.4, 0.3 and 0.7.
Output file format
The output file format of TUFP is defined as follows. It is a text file, where each line represents a top-k itemset. On each line, the items of the itemset are first listed in ascending order. Each item is represented by an integer followed by a single space. After all the items, the keyword "#SUP:" appears, which is followed by a double value indicating the expected support of the itemset. The output is sorted by descending expected support.
For example, below is the output file for this example with k = 10:
5 #SUP: 2.5
1 #SUP: 1.8
2 #SUP: 1.7999999999999998
1 5 #SUP: 1.2799999999999998
2 5 #SUP: 1.09
1 2 #SUP: 0.78
3 #SUP: 0.7
1 2 5 #SUP: 0.542
3 5 #SUP: 0.43000000000000005
4 #SUP: 0.4
For example, the first line indicates that the itemset {5} has the highest expected support of 2.5, while the last line indicates that the itemset {4} is the 10th most frequent itemset with an expected support of 0.4.
Parameter(s)
TUFP has the following required parameter:
- k (integer) : This parameter specifies the number of top-k itemsets to discover. It must be a positive integer. If there are ties at the k-th position (multiple itemsets with the same expected support as the k-th itemset), all tied patterns will be included in the output.
If you are using the command line interface of SPMF, the k parameter should be provided as the last argument. For example:
java -jar spmf.jar run TUFP contextUncertain.txt
output.txt 10
means to run the above example to find the top-10 frequent itemsets.
Performance
TUFP is generally efficient for discovering the most important patterns in uncertain databases because it uses effective threshold raising strategies that dynamically increase the minimum support threshold as better patterns are found, enabling aggressive pruning of the search space.
Where can I get more information about the TUFP algorithm?
Here is the article describing the TUFP algorithm:
Tuong Le, Bay Vo, Van-Nam Huynh, Ngoc Thanh Nguyen, Sung Wook Baik. Mining top-k frequent patterns from uncertain databases. Applied Intelligence, Springer Nature, 2020. https://doi.org/10.1007/s10489-019-01622-1
For a good overview of frequent itemset mining algorithms, you may read this survey paper.