Mining Maximal Frequent Episodes in a Complex Event Sequence using the MaxFEM Algorithm (SPMF documentation)
This example explains how to run the MaxFEM algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "MaxFEM" algorithm, (2) select the input file "contextMaxFEM.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum support, maximum window and the boolean parameter indicating that timestamps are not provided respectively to 2, 2, false and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run MaxFEM contextMaxFEM.txt output.txt 2 2 false
in a folder containing spmf.jar and the example input file contextMaxFEM.txt. - If you are using the source code version of SPMF, launch the file "MainTestMaxFEM.java" in the package ca.pfv.SPMF.tests.
What is MaxFEM?
MaxFEM (Fournier-Viger et al., 2022) is an algorithm for discovering maximal frequent episodes in a complex event sequence. In simple words, frequent episode mining means to look for subsequences that appear frequently in a long sequence of events, where some events may be simultaneous. Frequent episode mining has many possible applications as data from many domains can be encoded as sequences of events. Various algorithms have been proposed to find frequent episodes in data. They generally use different ways of counting the support (number of occurrences) of episodes in a sequence.
A problem with many episode mining algorithms is that they can produce a lot of episodes as output. Hence, the MaxFEM algorithm was proposed to find only the maximal episodes. Maximal episodes are the largest episodes that are frequent and are not included in another frequent episodes..
The MaxFEM algorithm can be applied on datasets that have timestamps or that do not have timestamps by setting the third parameter of the algorithm to false or true, respectively.
What is the input?
The algorithm takes as input an event sequence with a minimum support threshold, a maximum window length and a boolean parameter self_increment, that must be set to true if the input dataset has no timestamps or otherwise false. Let’s consider following sequence consisting of 11 time points (t1, t2, …, t10) and 4 events type (1, 2, 3, 4). This database is provided in the text file "contextMaxFEM.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Time points |
Itemset (event set) |
1 |
1 |
2 |
1 |
3 |
1 2 |
6 |
1 |
7 |
1 2 |
8 |
3 |
9 |
2 |
11 |
4 |
Each line of the sequence is called an event set and consists of:
- a time point (first column)
- an event set (second column)
For example, in the above example, at time point 1, the event 1 occurred. Then at time point 2, the event 1 occurred again. Then at time point 3, the events 1 and 2 occurred simultaneously. And so on.
What is the output?
The output of MaxFEM is the set of frequent episodes having a support no less than a minSup threshold (a positive integer) set by the user, and that are maximal. To explain what is a frequent episode, it is necessary to review some definitions
An episode is a sequence of events. It is said to appear in a time interval [ti,tj] (where ti and tj are time points) if all the events from the episode appear in the same order in the input sequence in that time interval. For example, consider the episode <(1),(2)>, which means event 1 followed by event 2. This episode appears in the time interval [t2,t3] because in that interval the event 1 appears (at time point t2) followed by event 2 (at time point t3). Moreover, it is important to note that each time interval is required to have a time duration (tj - ti) that is smaller than the maximum window parameter set by the user.
Given an episode, it is possible to find all the time intervals where it appears in the input sequence. For an interval [ti,tj], ti is said to be the starting point of the time interval. The support of an episode is the number of different starting points of intervals that contains this episode. For example, for a maximum window of 2, the episode <(1),(2)> appears in the ime intervals [t2,3] and [t6,t7]. Because that episode has two different starting points (t2 and t6), its support is 2.
If we change the maximum window parameter to 3, then <(1),(2)> appears in time intervals [t1,t3] [t2,t3], [t6,t7]and [t7,t9]. Because that episode has four different starting points t1,t2,t6,t7), its support is 4.
Another example is the episode <(1,2)> which means that events 1 and 2 appeared at the same time. This episode appears in two time intervals that are [t3,t3] and [t7,t7]. These time intervals have two different starting points (t3 and t7). Thus the support of that episode is 2.
A frequent episode is an episode that has a support no less than minSup.
A maximal frequent episode is a frequent episode that is not included in a larger frequent episode (see the paper for a more formal definition)
For example, if set minSup = 2 and maxWindow = 2, and apply an algorithm like EMMA to find all frequent episodes, we would obtain 6 frequent episodes with the following support values:
Frequent episode |
Support |
<(1)> |
5 |
<(2)> |
3 |
<(1, 2)> |
2 |
<(1), (1)> |
3 |
<(1), (2)> |
2 |
<(1), (1, 2)> |
2 |
For instance, the last episode <(1), (1, 2)> ( indicating that event 1 was followed by event 1 and 2 at the same time) has a support of 2.
But it can be observed that many of these episodes are not maximal. For example, the episode <(1, 2)> is not maximal because it is included in <(1),(1, 2)>
If we apply MaxFEM, we find that there is actually only one maximal frequent episode:
Frequent episode |
Support |
<(1), (1, 2)> |
2 |
Input file format
The input file format of that algorithm is a text file. An item (event) is represented by a positive integer. A transaction (event set) is a line in the text file. In each line (event set), items are separated by a single space. It is assumed that all items (events) within a same transaction line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same event set. Each line is optionally followed by the character "|" and then the timestamp of the event set (line). Note that it is possible to run MaxFEM on a database that has no timestamps. In that case, the boolean parameter (the third parameter) should be set to true to indicate that the dataset has no timestamps. If there is not timestamps and the parameter is set to true, the algorithm will assume that each line has a timestamp that is increasing by 1.
In the previous example, the input file is defined as follows:
1|1
1|2
1 2|3
1|6
1 2|7
3|8
2|9
4|1
Output file format
The output file format is defined as follows. It is a text file. Each line is a frequent episode. Each event in a frequent episode is a positive integer and items from the same event set within an episode are separated by single spaces. The value "-1" indicates the end of an event set. On each line, the episode is first indicated. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the pattern (a positive integer). For example, a few lines from the output file from the previous example are shown below:
1 -1 1 2-1 #SUP : 2
For instance, the last line indicates that event 1 followed by events 1 and 2 simultaneously, has a support of 2.
Performance
This is the original implementation of MaxFEM. In the paper proposing MaxFEM, its performance was compared with EMMA and was shown to have good performance, while reducing the number of patterns presented to users.
There also exists a version of MaxFEM for finding all frequent episodes. This version is called AFEM and it is also offered in SPMF.
Implementation details
The version implemented here contains all the optimizations described in the paper proposing MaxFEM.
Where can I get more information about the MaxFEM algorithm?
The MaxFEM algorithm is described in this article:
Fournier-Viger, P., Nawaz, M. S., He, Y., Wu, Y., Nouioua, F., Yun, U. (2022). MaxFEM: Mining Maximal Frequent Episodes in Complex Event Sequences. Proc. of the 15th Multi-disciplinary International Conference on Artificial Intelligence (MIWAI 2022), 12 pages, Springer LNAI, to appear.