Calculate Statistics for an Uncertain Transaction Database (SPMF documentation)

This example explains how to calculate statistics about an Uncertain Transaction Database using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Calculate_stats_for_an_uncertain_transaction_database" algorithm, (2) choose the input file contextUncertain.txt (3) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar runCalculate_stats_for_an_uncertain_transaction_database contextUncertain.txt no_output_file in a folder containing spmf.jar and the input file contextUncertain.txt.
If you are using the source code version of SPMF, launch the file "MainTestStatsUncertainTransactionDB.java" in the package ca.pfv.SPMF.tests.

What is this tool?

This tool is a tool for generating statistics about an Uncertain Transaction Database. This is a type of data taken as input by algorithms such as UApriori. This tool can be used to know for example if the database is dense or sparse before applying a data mining algorithms.

What is the input?

This tool takes as input an uncertain transaction database containing probabilities. A transaction database is a set of transactions where each transaction is a set of items. In this type of data, we assume that each item in a transaction is annotated with an existential probability. For example, let's consider the following transaction database, consisting of 4 transactions (t1,t2...t5) and 5 items (called 1,2,3,4,5). The transaction t1 contains item 1 with a probability of 0.5, item 2 with a probability of 0.4, item 4 with a probability of 0.3 and item 5 with a probability of 0.7. This database is provided in the file "contextUncertain.txt" of the SPMF distribution:

	1	2	3	4	5
t1	0.5	0.4		0.3	0.7
t2		0.5	0.4		0.4
t3	0.6	0.5		0.1	0.5
t4	0.7	0.4	0.3		0.9

What is the output?

The output is statistics about the Uncertain Transaction Database. For example, if we use the tool on the previous Uncertain Transaction Database given as example, we get the following statistics:

Number of transactions : 4
File contextUncertain.txt
Number of distinct items: 5
Smallest item id: 1
Largest item id: 5
Average number of items per transaction: 3.75 standard deviation: 0.4330127018922193 variance: 0.18749999999999997
Average item support in the database: 3.0 standard deviation: 0.8944271909999159 variance: 0.7999999999999999 min value: 2 max value: 4
Average expected support probability: 0.48000000000000004 standard deviation: 0.18690461025168248 variance: 0.03493333333333333 min value: 0.1 max value: 0.9
Database density: 75.0 %

Note: the database density is calculated as the average transaction length divided by the number of distinct items.

Input file format

The input file format is defined as follows. It is a text file. An item is represented by a positive integer. Each item is associated with a probability indicated as a double value between parenthesis. A transaction is a line in the text file. In each line (transaction), each item is immediately followed by its probability between parenthesis and a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line. Probabilities should be greater than 0 and not more than 1.

For example, for the previous example, the input file is defined as follows:

# This binary context contains uncertain data.
# Each line represents a transaction.
# For each item there is an existential probability.
1(0.5) 2(0.4) 4(0.3) 5(0.7)
2(0.5) 3(0.4) 5(0.4)
1(0.6) 2(0.5) 4(0.1) 5(0.5)
1(0.7) 2(0.4) 3(0.3) 5(0.9)

The first line represents the itemsets {1, 2, 4, 5} where items 1, 2, 4 and 5 respectively have the probabilities 0.5, 0.4, 0.3 and 0.7.