Calculate Statistics for an Event Sequence (SPMF documentation)

This example explains how to calculate statistics for an event sequence database using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Calculate_stats_for_an_event_sequence" algorithm, (2) choose the input file contextEMMA.txt (3) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Calculate_stats_for_an_event_sequence contextEMMA.txt no_output_file in a folder containing spmf.jar and the input file contextEMMA.txt.
If you are using the source code version of SPMF, launch the file "MainTestStatsEventSequence.java" in the package ca.pfv.SPMF.tests.

What is this tool?

This tool is a tool for generating statistics about an event sequence, a type of data taken as input by algorithms such as TKE and EMMA.

What is the input?

The input is an event sequence. Let’s consider following event sequenceconsisting of 11 time points (t1, t2, …, t10) and 4 events type (1, 2, 3, 4). This database is provided in the text file "contextEMMA.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.

Time points	Itemset (event set)
1	1
2	1
3	1 2
6	1
7	1 2
8	3
9	2
11	4

Each line of the sequence is called an event set and consists of:

a time point (first column)
an event set (second column)

For example, in the above example, at time point 1, the event 1 occurred. Then at time point 2, the event 1 occurred again. Then at time point 3, the events 1 and 2 occurred simultaneously. And so on.

What is the output?

The output is statistics about the event sequence. For example, if we use the tool on the previous event sequence given as example, we get the following statistics:

============ EVENT SEQUENCE STATS ==========
Number of events : 10
Number of distinct event types : 4
Max item id : 4
Number of distinct timestamp: 8
Min timestamp : 1
Max timestamp: 11
Avg. number of events per timestamp: 1.25

Input file format

The input file format of that algorithm is a text file. An item (event) is represented by a positive integer. A transaction (event set) is a line in the text file. In each line (event set), items are separated by a single space. It is assumed that all items (events) within a same transaction line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same event set. Each line is optionally followed by the character "|" and then the timestamp of the event set (line).
In the previous example, the input file is defined as follows:

1|1
1|2
1 2|3
1|6
1 2|7
3|8
2|9
4|1