Calculate Statistics for a Transaction Database with Utility and Cost Information (SPMF documentation)

This example explains how to calculate statistics for a transaction database with utility and cost information using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Calculate_stats_for_a_transaction_database_with_cost_utility" algorithm, (2) choose the input file DB_cost.txt (3) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Calculate_stats_for_a_transaction_database_with_utility_period DB_cost.txt no_output_file in a folder containing spmf.jar and the input file DB_cost.txt.
If you are using the source code version of SPMF, launch the file "MainTestStatsCostUtilityTDB.java" in the package ca.pfv.SPMF.tests.

What is this tool?

This tool is a tool for generating statistics about a transaction database that contains utility and cost information. This tool can be used to know for example what is the average length of transaction in a database, and what is the total utility of the database. This type of database is used as input by some low-cost high utility itemset mining algorithms such as LCIM.

What is the input?

The input is a transaction database with utility and cost information three parameters called the minimum utility threshold min_utility (a positive integer), maxcost (an integer), and minsup (a double value) . Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_cost.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.

	Items	Transaction utility	Item cost values for this transaction
t1	3 5 1 2 4 6	40	1 3 5 10 6 5
t2	3 5 2 4	20	3 3 8 6
t3	3 1 4	8	1 5 2
t4	3 5 1 7	37	6 6 10 5
t5	3 5 2 7	21	2 3 4 2

Each line of the database is:

a set of items (the second column of the table) representing some nominal values,
a utility value (an integer) that is assigned to this transaction (the third column of the table),
a cost value (integer) for each item of this transaction (the fourth column of the table).

Let's explain what this means with a real example. Let's say that this data is from e-learning. Each transaction (record) is a set of items that may represent the courses taken by students. Here the items are called 1,2,3,4,5,6,7, which means different courses. For example, the first transaction t1 indicates that a student took the courses 3, 5, 1, 2, 4 and 6 (without any specific order). Then, the utility of the transaction t1 is 40, which represents the grade that the student obtained after taking an exam after taking the courses. For example, the student described by transaction t1 received a utility (grade) of 40 at the final exam. Then, the last column indicates the cost of each item, which may represent the time spent on each course. For example, the transaction t1 has the cost values 1, 3, 5, 10, 6 and 5, which means that the student spent 1 time unit on course 3, 3 time units on course 5, 5 time units on course 1, 10 time units on course 2, 6 time units on course 4, and 5 time units on course 6. The database contains five transactions (t1, t2, t3, t4, and t5), which may indicate the information for five students.

It is to be noted that here we use the example of e-learning but other types of data could also be represented using that same format such as medical data, or shopping data.

What is the output?

The output is statistics about the transaction database. For example, if we use the tool on the previous database given as example, we get the following statistics:

---------- Cost-Utility Transaction Database Information----------
Number of transations : 5
Total utility : 136
Number of distinct items : 7
Maximum Id of item : 7
Average length of transaction : 4.2
Maximum length of transaction : 6
Average cost per item: 4.571428571428571 standard deviation: 2.536803924165182 variance: 6.435374149659867
Average utility per transaction: 27.2 standard deviation: 9.325234581499814 variance: 86.96000000000001
Database density: 60.0 %

Note: the database density is calculated as the average transaction length divided by the number of distinct items.

Input file format

The input file formatis defined as follows. It is a text file. Each lines represents a transaction. Each line is composed of three sections, as follows.

First, the items contained in the transaction are listed. An item is represented by a positive integer. Each item is separated from the next item by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same transaction.
Second, the symbol ":" appears and is followed by the transaction utility (an integer).
Third, the symbol ":" appears and is followed by the cost of each item in this transaction (an integer), separated by single spaces.

For example, for the previous example, the input file is defined as follows:

3 5 1 2 4 6:40:1 3 5 10 6 5
3 5 2 4:20:3 3 8 6
3 1 4:18:1 5 2
3 5 1 7:37:6 6 10 5
3 5 2 7:21:2 3 4 2

Consider the first line. It means that the transaction {3, 5, 1, 2, 4, 6} has a total utility of 40 and that items 3, 5, 1, 2, 4 and 6 respectively have a cost of 1, 3, 5, 10, 6 and 5 in this transaction. The following lines follow the same format.