Mining the Top-K Frequent Episodes in a Complex Event Sequence using the TKE Algorithm (SPMF documentation)

This example explains how to run the TKE algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "TKE" algorithm, (2) select the input file "contextEMMA.txt", (3) set the output file name (e.g. "output.txt") (4) set the parameter k, the maximum window and the boolean parameter indicating that timestamps are not provided respectively to 6, 2, false and (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run TKE contextEMMA.txt output.txt 6 2 false in a folder containing spmf.jar and the example input file contextEMMA.txt.
If you are using the source code version of SPMF, launch the file "MainTestTKE.java" in the package ca.pfv.SPMF.tests.

What is TKE?

TKE (Fournier-Viger et al., 2020) is an algorithm for discovering the top-k frequent episodes in a complex event sequence. In simple words, this algorithm finds the k subsequences that appear the most often in a long sequence of events, where some events may be simultaneous, and k is a parameter set by the user .

Various algorithms have been proposed to find frequent episodes in data. Some classical algorithms are MINEPI, MINEPI+ and EMMA (also offered in SPMF). However, all these algorithms requires that the user sets a minimum support threshold parameter, which is unintuitive to set and may produce very few or too many patterns. The TKE algorithm was designed to avoid this problem by replacing the minimum support parameter by the parameter k. Using TKE, the user can directly set the number k of patterns to be found.

It is to be noted that different frequent episode mining algorithms do not count the support (occurrence frequency) of an episode in the same way. The TKE algorithm count the support using a concept of head frequency, which was originally proposed in the EMMA paper.

Also, it is to be noted that the TKE algorithm can be applied on datasets that have timestamps or that do not have timestamps by setting the third parameter of the algorithm to false or true, respectively.

Frequent episode mining has many possible applications as data from many domains can be encoded as sequences of events.

What is the input?

The algorithm takes as input an event sequence, a parameter k (a positive integer) indicating the number of episodes to find , a maximum window length, and a boolean parameter self_increment, that must be set to true if the input dataset has no timestamps or otherwise false. Let’s consider following sequence consisting of 11 time points (t1, t2, …, t10) and 4 events type (1, 2, 3, 4). This database is provided in the text file "contextEMMA.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.

Time points	Itemset (event set)
1	1
2	1
3	1 2
6	1
7	1 2
8	3
9	2
11	4

Each line of the sequence is called an event set and consists of:

a time point (first column)
an event set (second column)

For example, in the above example, at time point 1, the event 1 occurred. Then at time point 2, the event 1 occurred again. Then at time point 3, the events 1 and 2 occurred simultaneously. And so on.

What is the output?

The output of TKE is the set of the k episodes that have the highest support (frequency, where k (a positive integer) is set by the user. To explain what is the frequency of an episode, it is necessary to review some definitions

An episode is a sequence of events. It is said to appear in a time interval [ti,tj] (where ti and tj are time points) if all the events from the episode appear in the same order in the input sequence in that time interval. For example, consider the episode <(1),(2)>, which means event 1 followed by event 2. This episode appears in the time interval [t2,t3] because in that interval the event 1 appears (at time point t2) followed by event 2 (at time point t3). Moreover, it is important to note that each time interval is required to have a time duration (tj - ti) that is smaller than the maximum window parameter set by the user.

Given an episode, it is possible to find all the time intervals where it appears in the input sequence. For an interval [ti,tj], ti is said to be the starting point of the time interval. The support of an episode is the number of different starting points of intervals that contains this episode. For example, for a maximum window of 2, the episode <(1),(2)> appears in the ime intervals [t2,3] and [t6,t7]. Because that episode has two different starting points (t2 and t6), its support is 2.
If we change the maximum window parameter to 3, then <(1),(2)> appears in time intervals [t1,t3] [t2,t3], [t6,t7]and [t7,t9]. Because that episode has four different starting points t1,t2,t6,t7), its support is 4.

Another example is the episode <(1,2)> which means that events 1 and 2 appeared at the same time. This episode appears in two time intervals that are [t3,t3] and [t7,t7]. These time intervals have two different starting points (t3 and t7). Thus the support of that episode is 2.

An episode is a top-k episode if there does not exist more than than k-1 episodes having a higher support. The TKE algorithm finds the top-k episodes. It is to be noted that the set of top-k episodes may sometimes contain more than k episodes (if several episodes have the same support).

For example, if set k = 6 and maxWindow = 2. We obtain the top-6 most frequent episodes with the following support values:

Frequent episode	Support
<(1)>	5
<(2)>	3
<(1, 2)>	2
<(1), (1)>	3
<(1), (2)>	2
<(1), (1, 2)>	2

For instance, the last episode <(1), (1, 2)> ( indicating that event 1 was followed by event 1 and 2 at the same time) has a support of 2.

Input file format

The input file format of that algorithm is a text file. An item (event) is represented by a positive integer. A transaction (event set) is a line in the text file. In each line (event set), items are separated by a single space. It is assumed that all items (events) within a same transaction line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same event set. Each line is optionally followed by the character "|" and then the timestamp of the event set (line). Note that it is possible to run TKE on a database that has no timestamps. In that case, the boolean parameter (the third parameter) should be set to true to indicate that the dataset has no timestamps. If there is not timestamps and the parameter is set to true, the algorithm will assume that each line has a timestamp that is increasing by 1.
In the previous example, the input file is defined as follows:

1|1
1|2
1 2|3
1|6
1 2|7
3|8
2|9
4|1

Output file format

The output file format is defined as follows. It is a text file. Each line is a frequent episode. Each event in a frequent episode is a positive integer and items from the same event set within an episode are separated by single spaces. The value "-1" indicates the end of an event set. On each line, the episode is first indicated. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the pattern (a positive integer). For example, a few lines from the output file from the previous example are shown below:

1-1 #SUP : 5
2-1 #SUP : 3
1 2-1 #SUP : 2
1 -1 1-1 #SUP : 3
1 -1 2-1 #SUP : 2
1 -1 1 2-1 #SUP : 2

For instance, the last line indicates that event 1 followed by events 1 and 2 simultaneously, has a support of 2. Other lines follow the same format.

Performance and Implementation details

This is the original implementation of the TKE algorithm, containing all optimizations described in the paper. Some of the optimizations can be deactivated in the source code for doing performance experiments..

Where can I get more information about the TKE algorithm?

The TKE algorithm is described in this article:

Fournier-Viger, P., Wang, Y., Yang, P., Lin, J. C.-W., Yun, U. (2020). TKE: Mining Top-K Frequent Episodes. Proc. 33rd Intern. Conf. on Industrial, Engineering and Other Applications of Applied Intelligent Systems (IEA AIE 2020), Springer LNAI, 12 pages