Mining frequent sequential patterns with periodic wilcard gaps in a sequence of characters with the MAPD algorithm (SPMF documentation)

This example explains how to run the MAPD algorithm using the SPMF open-source data mining library.

How to run this example?

What is MAPD?

MAPD is an algorithm for discovering nonoverlapping sequential patterns in a sequence of characters. This algorithm was proposed by Wu et al. (2014). This algorithm is designed to find patterns that contain wildcards between characters and where the number of wildcards between any two consecutive characters is similar.

This is a Java version, converted from the original C++ implementation by Youxi Wu et al. (available at : https://github.com/wuc567/Pattern-Mining)

What is the input of MAPD?

The input is a sequence of symbols (e.g. events or characters) and five parameters that are set by the user:

A sequence for this algorithm is an ordered list of characters that are taken from a character set.

To run the MAPD algorithm, an input file must be given, which contains a sequence. For example, the file contextMAPD.txt of the SPMF distribution contains the sequence ttcctccgcg.

This sequence can represent for example a biological sequence, indicating that the symbol T, was followed by T, then followed by C, by C, by T, by C, by C, by G, by C, and by G.

In the input file format, the sequence is simply provided as a line in a text file:

ttcctccgcg

In that format, each symbol is a single character such as : a, c, g, t.... This sequence from the file contextMAPD.txt swill be used for this example.

What is the output of MAPD?

The output of MAPD is the set of frequent patterns with periodic wildcard gaps.

To try to give a more precise explanation of what is the output of MAPD, it is necessary to review some definitions.

Let Σ bet the set of possible characters that can appear in a sequence. For example, in this example Σ = {c,g,t}.

A pattern P is a sequence of characters that can appear several times in a sequence. For example, a pattern P containing m characters can be denoted as P = p0 p1 p2 p3 ... pm indicating that p0 was followed by p1, then followed by p2... and so on.

The goal of MAPD is to find patterns that appear many times, and where some characters might be skipped when matching a pattern with the input sequence.

In other words, the MAPD algorithm is designed to find patterns of the above form where the characters do not need to appear contiguously in the input sequence. MAPD allows to match a pattern with a sequence by using some wildcards, which means to skip one or more characters.

More precisely, MAPD is designed to find what is called patterns with periodic wildcard gaps. Such pattern has the form: P = p0 p1 p2 p3 ... pm and require that the gaps between any pair of consecutive characters in the pattern have the same minimum and maximum number of wildcards (characters that are skipped).

The algorithm allows to set a threshold, which is the minimum number of occurrences that a pattern should have to be output, the minimum and maximum size of the gaps in terms of number of wildcards, and the maximum length of the patterns that are output.

For example, for the sequence ttcctccgcg, if we set the minimum threshold = 1, maxlength = 20, mingap = 0, maxgap =2, the set of frequent patterns with periodic wildcard gaps is:

Frequent patterns

Support

c

0.5

g

0.3

t

0.2

c*c

0.8

c*g 0.6
t*c 0.8

For instance, the last line t*c of this table indicates that t followed by c after some gap has a support of 0.8 in this sequence. The support of a pattern is its number of non-overlapping occurrences divided by the length of the sequence, and it is required to be no less than the threshold taken as parameter. In this example, t*c has 8 non-overlapping occurrences in this sequence of length 10, and hence its support is 0.8.

Since the definition used in this paper is relatively complex, for a more precise and formal definition of the output, please see the research paper describing the MAPD algorithm (Wu et al., 2014).

Input file format

The input file format is a text file. A line contain a sequence of character.

For example, the file contextMAPD.txt contains this content:

ttcctccgcg

Output file format

The output file format is defined as follows. It is a text file. Each line is a frequent pattern, which is a list of characters. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the pattern as a number of occurrences. For example, a few lines from the output file from the previous example are shown below:

c #SUP: 0.5
t #SUP: 0.3
g #SUP: 0.2
cc #SUP: 0.8
cg #SUP: 0.6
tc #SUP: 0.8

The fourth line is the frequent pattern consisting of the symbol "c" followed by the symbol "c" and indicates that it has a support of 0.2.

Performance

This is the original implementation of MAPD. By using the NetTree data structure, MAPD can avoid redundant calculations and process long sequences to find patterns (see performance experiments in the paper, for example).

Where can I get more information about MAPD?

The MAPD algorithm is described in this article:

Youxi Wu, Yao Tong, Xingquan Zhu, Xindong Wu: MAPD: Nonoverlapping Sequence Pattern Mining With Gap Constraints. IEEE Trans. Cybern. 48(10): 2809-2822 (2018)