Perform Sequence Prediction using the PPM Sequence Prediction Model (SPMF documentation)

This example explains how to run the PPM algorithm using the SPMF open-source data mining library.

How to run this example?

To run the implementation of PPM

What is PPM?

PPM (Prediction by Partial Matching) is a sequence prediction model proposed by Cleary & Witten (1984). It is used for performing sequence predictions. A sequence prediction consists of predicting the next symbol of a sequence based on a set of training sequences. The task of sequence prediction has numerous applications in various domains. For example, it can be used to predict the next webpage that a user will visit based on previously visited webpages by the user and other users.

The PPM prediction model is quite simple. This is one reason why it is still popular. But can it be outperformed by newer models such as CPT+ in terms of prediction accuracy

It is important to note that the PPM implementation is a PPM model of order 1. PPM models of higher order are not supported in this implementation.

This implementation has been obtained from the ipredict project.

What is the input of PPM?

The input of PPM is a sequence database containing training sequences. These sequences are used to train the prediction model.

In the context of PPM, a sequence database is a set of sequences where each sequence is a list of items (symbols). For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 items. This sequence means that item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. This database is provided in the file "contextCPT.txt" of the SPMF distribution.

ID Sequences
S1 (1), (2), (3), (4), (6)
S2 (4), (3), (2), (5)
S3 (5), (1), (4), (3), (2)
S4 (5), (7), (1), (4), (2), (3)

What is the output of PPM?

PPM performs sequence prediction. After PPM has been trained with the input sequence database, it can predict the next symbol of a new sequence.

For example, if PPM is trained with the previous sequence database and parameters, it will predict that the next symbol following the sequence (1),(4) is the symbol (3).

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is separated by single space and a -1. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextCPT.txt" contains the following four lines (four sequences).

1 -1 2 -1 3 -1 4 -1 6 -1 -2
4 -1 3 -1 2 -1 5 -1 -2
5 -1 1 -1 4 -1 3 -1 2 -1 -2
5 -1 7 -1 1 -1 4 -1 2 -1 3 -1 -2

The first line represents a sequence where the item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. The next lines follow the same format.

Performance

PPM is a markovian sequence prediction model that assumes that the next symbol only depends on the previous symbol. This results in a very simple model that is memory efficient. However, it can be outperformed in terms of prediction accuracy by newer models such as CPT+.

Where can I get more information about PPM?

The PPM sequence prediction model was proposed in this paper:

] J. G. Cleary, I. Witten, "Data compression using adaptive coding and partial string matching".IEEE Transactions on Communications, vol. 32, pp. 396-402, 1984.