Calculate Statistics for a ProductTransaction Database (SPMF documentation)

This example explains how to calculate statistics for a product transaction database using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool is a tool for generating statistics about a product transaction database, as used by algorithms such as VME. This tool can be used to know for example if the database is dense or sparse before applying a data mining algorithm.

What is the input?

The input is a product transaction database (aka formal context). A product is defined as a set of items that are used to assemble the product. Moreover each product is annotated with a profit (a positive integer) that indicates how much money this product generate for the company. For example, let's consider the following product database, consisting of 6 products and 7 items (this example is taken from the article of Deng & Xu, 2010). Each product is annotated with the profit information. For example, the first line indicates that the product 1 generate a total profit of 50 $ for the company and that its assembly requires parts 2, 3, 4 and 6. This product database is provided in the file "contextVME.txt" of the SPMF distribution.:


profit items
product1 50$ {2, 3, 4, 6}
product2 20$ {2, 5, 7}
product3 50$ {1, 2, 3, 5}
product4 800$ {1, 2, 4}
product5 30$ {6, 7}
product6 50$ {3, 4}

What is the output?

The output is statistics about the product transaction database. For example, if we use the tool on the previous product transaction database given as example, we get the following statistics:

============ TRANSACTION DATABASE STATS ==========
Number of transactions : 6
File C:\Users\Phil\Desktop\test_files\contextVME.txt
Number of distinct items: 7
Smallest item id: 1
Largest item id: 7
Average number of items per transaction: 3.0 standard deviation: 0.816496580927726 variance: 0.6666666666666666
Average profit per product (transaction): 166.66666666666666 standard deviation: 283.4705550062573 variance: 80355.55555555555
Average item support in the database: 2.5714285714285716 standard deviation: 0.7284313590846836 variance: 0.5306122448979592 min value: 2 max value: 4
Database density: 42.857142857142854 %

Input file format

The input file format is defined as follows. It is a text file. Each lines represents a transaction. Each line is composed of two sections, as follows.

For example, for the previous example, the input file is defined as follows:

50 2 3 4 6
20 2 5 7
50 1 2 3 5
800 1 2 4
30 6 7
50 3 4

Consider the first line. It means that the transaction {2, 3, 4, 6} has a profit of 50 and it contains the items 2, 3, 4 and 6. The following lines follow the same format.