Mining Frequent Episodes in a Complex Event Sequence using the EMMA Algorithm (SPMF documentation)
This example explains how to run the EMMA algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "EMMA" algorithm, (2) select the input file "contextEMMA.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum support, maximum window and the boolean parameter indicating that timestamps are not provided respectively to 2, 2, false and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run EMMA contextEMMA.txt output.txt 2 2 false in a folder containing spmf.jar and the example input file contextEMMA.txt. - If you are using the source code version of SPMF, launch the file "MainTestEMMA.java" in the package ca.pfv.SPMF.tests.
What is EMMA?
EMMA (Kuo-Yu et al., 2008) is an algorithm for discovering frequent episodes in a complex event sequence. In simple words, frequent episode mining means to look for subsequences that appear frequently in a long sequence of events, where some events may be simultaneous.
Various algorithms have been proposed to find frequent episodes in data. EMMA is one of the most famous. It is to be noted that different algorithms do not count the support (occurrence frequency) of an episode in the same way. The EMMA algorithm count the support using a concept of head frequency (see the paper for more details).
Also, it is to be noted that the EMMA algorithm can be applied on datasets that have timestamps or that do not have timestamps by setting the third parameter of the algorithm to false or true, respectively.
Frequent episode mining has many possible applications as data from many domains can be encoded as sequences of events.
What is the input?
The algorithm takes as input an event sequence with a minimum support threshold, a maximum window length and a boolean parameter self_increment, that must be set to true if the input dataset has no timestamps or otherwise false. Let’s consider following sequence consisting of 11 time points (t1, t2, …, t10) and 4 events type (1, 2, 3, 4). This database is provided in the text file "contextEMMA.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Time points |
Itemset (event set) |
1 |
1 |
2 |
1 |
3 |
1 2 |
6 |
1 |
7 |
1 2 |
8 |
3 |
9 |
2 |
11 |
4 |
Each line of the sequence is called an event set and consists of:
- a time point (first column)
- an event set (second column)
For example, in the above example, at time point 1, the event 1 occurred. Then at time point 2, the event 1 occurred again. Then at time point 3, the events 1 and 2 occurred simultaneously. And so on.
What is the output?
The output of EMMA is the set of frequent episodes having a support no less than a minSup threshold (a positive integer) set by the user. To explain what is a frequent episode, it is necessary to review some definitions
An episode is a sequence of events. It is said to appear in a time interval [ti,tj] (where ti and tj are time points) if all the events from the episode appear in the same order in the input sequence in that time interval. For example, consider the episode <(1),(2)>, which means event 1 followed by event 2. This episode appears in the time interval [t2,t3] because in that interval the event 1 appears (at time point t2) followed by event 2 (at time point t3). Moreover, it is important to note that each time interval is required to have a time duration (tj - ti) that is smaller than the maximum window parameter set by the user.
Given an episode, it is possible to find all the time intervals where it appears in the input sequence. For an interval [ti,tj], ti is said to be the starting point of the time interval. The support of an episode is the number of different starting points of intervals that contains this episode. For example, for a maximum window of 2, the episode <(1),(2)> appears in the ime intervals [t2,3] and [t6,t7]. Because that episode has two different starting points (t2 and t6), its support is 2.
If we change the maximum window parameter to 3, then <(1),(2)> appears in time intervals [t1,t3] [t2,t3], [t6,t7]and [t7,t9]. Because that episode has four different starting points t1,t2,t6,t7), its support is 4.
Another example is the episode <(1,2)> which means that events 1 and 2 appeared at the same time. This episode appears in two time intervals that are [t3,t3] and [t7,t7]. These time intervals have two different starting points (t3 and t7). Thus the support of that episode is 2.
A frequent episode is an episode that has a support no less than minSup.
For example, if set minSup = 2 and maxWindow = 2. We obtain 6 frequent episodes with the following support values:
Frequent episode |
Support |
<(1)> |
5 |
<(2)> |
3 |
<(1, 2)> |
2 |
<(1), (1)> |
3 |
<(1), (2)> |
2 |
<(1), (1, 2)> |
2 |
For instance, the last episode <(1), (1, 2)> ( indicating that event 1 was followed by event 1 and 2 at the same time) has a support of 2.
Input file format
The input file format of that algorithm is a text file. An item (event) is represented by a positive integer. A transaction (event set) is a line in the text file. In each line (event set), items are separated by a single space. It is assumed that all items (events) within a same transaction line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same event set. Each line is optionally followed by the character "|" and then the timestamp of the event set (line). Note that it is possible to run EMMA on a database that has no timestamps. In that case, the boolean parameter (the third parameter) should be set to true to indicate that the dataset has no timestamps. If there is not timestamps and the parameter is set to true, the algorithm will assume that each line has a timestamp that is increasing by 1.
In the previous example, the input file is defined as follows:
1|1
1|2
1 2|3
1|6
1 2|7
3|8
2|9
4|1
Output file format
The output file format is defined as follows. It is a text file. Each line is a frequent episode. Each event in a frequent episode is a positive integer and items from the same event set within an episode are separated by single spaces. The value "-1" indicates the end of an event set. On each line, the episode is first indicated. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the pattern (a positive integer). For example, a few lines from the output file from the previous example are shown below:
1-1 #SUP : 5
2-1 #SUP : 3
1 2-1 #SUP : 2
1 -1 1-1 #SUP : 3
1 -1 2-1 #SUP : 2
1 -1 1 2-1 #SUP : 2
For instance, the last line indicates that event 1 followed by events 1 and 2 simultaneously, has a support of 2. Other lines follow the same format.
Performance
EMMA is one of the fastest algorithm for frequent episode mining.
Implementation details
The version implemented here contains all the optimizations described in the paper proposing EMMA.
Where can I get more information about the EMMA algorithm?
The EMMA algorithm is described in this article:
Kuo-Yu Huang,Chia-Hui Chang (2008). Efficient mining of frequent episodes from complex sequences.Inf. Syst.33(1):96-114