Calculate Statistics for a Transaction Database with Utility and Cost Information (SPMF documentation)

This example explains how to calculate statistics for a transaction database with utility and cost information using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool is a tool for generating statistics about a transaction database that contains utility and cost information. This tool can be used to know for example what is the average length of transaction in a database, and what is the total utility of the database. This type of database is used as input by some low-cost high utility itemset mining algorithms such as LCIM.

What is the input?

The input is a transaction database with utility and cost information three parameters called the minimum utility threshold min_utility (a positive integer), maxcost (an integer)and minsup (a double value) . Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_cost.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.


Items Transaction utility Item cost values for this transaction
t1 3 5 1 2 4 6 40 1 3 5 10 6 5
t2 3 5 2 4 20 3 3 8 6
t3 3 1 4 8 1 5 2
t4 3 5 1 7 37 6 6 10 5
t5 3 5 2 7 21 2 3 4 2

Each line of the database is:

Let's explain what this means with a real example. Let's say that this data is from e-learning. Each transaction (record) is a set of items that may represent the courses taken by students. Here the items are called 1,2,3,4,5,6,7, which means different courses. For example, the first transaction t1 indicates that a student took the courses 3, 5, 1, 2, 4 and 6 (without any specific order). Then, the utility of the transaction t1 is 40, which represents the grade that the student obtained after taking an exam after taking the courses. For example, the student described by transaction t1 received a utility (grade) of 40 at the final exam. Then, the last column indicates the cost of each item, which may represent the time spent on each course. For example, the transaction t1 has the cost values 1, 3, 5, 10, 6 and 5, which means that the student spent 1 time unit on course 3, 3 time units on course 5, 5 time units on course 1, 10 time units on course 2, 6 time units on course 4, and 5 time units on course 6. The database contains five transactions (t1, t2, t3, t4, and t5), which may indicate the information for five students.

It is to be noted that here we use the example of e-learning but other types of data could also be represented using that same format such as medical data, or shopping data.

What is the output?

The output is statistics about the transaction database. For example, if we use the tool on the previous database given as example, we get the following statistics:

---------- Cost-Utility Transaction Database Information----------
Number of transations : 5
Total utility : 136
Number of distinct items : 7
Maximum Id of item : 7
Average length of transaction : 4.2
Maximum length of transaction : 6
Average cost per item: 4.571428571428571 standard deviation: 2.536803924165182 variance: 6.435374149659867
Average utility per transaction: 27.2 standard deviation: 9.325234581499814 variance: 86.96000000000001
Database density: 60.0 %

Note: the database density is calculated as the average transaction length divided by the number of distinct items.

Input file format

The input file formatis defined as follows. It is a text file. Each lines represents a transaction. Each line is composed of three sections, as follows.

For example, for the previous example, the input file is defined as follows:

3 5 1 2 4 6:40:1 3 5 10 6 5
3 5 2 4:20:3 3 8 6
3 1 4:18:1 5 2
3 5 1 7:37:6 6 10 5
3 5 2 7:21:2 3 4 2

Consider the first line. It means that the transaction {3, 5, 1, 2, 4, 6} has a total utility of 40 and that items 3, 5, 1, 2, 4 and 6 respectively have a cost of 1, 3, 5, 10, 6 and 5 in this transaction. The following lines follow the same format.