Perform Sequence Prediction using the CPT+ Sequence Prediction Model (SPMF documentation)

This example explains how to run the CPT+ algorithm using the SPMF open-source data mining library.

How to run this example?

To run the implementation of CPT+

What is CPT+ (Compact Prediction Tree+)?

CPT+ (Compact Prediction Tree+) is a sequence prediction model. It is used for performing sequence predictions. A sequence prediction consists of predicting the next symbol of a sequence based on a set of training sequences. The task of sequence prediction has numerous applications in various domains. For example, it can be used to predict the next webpage that a user will visit based on previously visited webpages by the user and other users.

The CPT+ prediction model (2015) is an improved version of the CPT model (Gueniche et al., 2013). CPT+ was shown to provide better accuracy than several state-of-the-art prediction models such as DG, AKOM, TDAG, PPM and CPT on various datasets (see Gueniche et al., 2015 for details).

The implementation of CPT+ in SPMF is the original implementation (obtained from the ipredict project).

What is the input of CPT+?

The input of CPT+ is a sequence database containing training sequences. These sequences are used to train the prediction model.

In the context of CPT+, a sequence database is a set of sequences where each sequence is a list of items (symbols). For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 items. This sequence means that item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. This database is provided in the file "contextCPT.txt" of the SPMF distribution.

ID Sequences
S1 (1), (2), (3), (4), (6)
S2 (4), (3), (2), (5)
S3 (5), (1), (4), (3), (2)
S4 (5), (7), (1), (4), (2), (3)

CPT+ also takes as set of parameters as input.

What is the output of CPT+?

CPT+ performs sequence prediction. After CPT+ has been trained with the input sequence database, it can predict the next symbol of a new sequence.

For example, if CPT+ is trained with the previous sequence database and parameters, it will predict that the next symbol following the sequence (1),(2) is the symbol (3).

Parameter(s)

There are several parameters that can be set for CPT+. In the source code, these parameters are passed as a string to the class CPTPlusPredictor. The main parameters are:

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is separated by single space and a -1. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextCPT.txt" contains the following four lines (four sequences).

1 -1 2 -1 3 -1 4 -1 6 -1 -2
4 -1 3 -1 2 -1 5 -1 -2
5 -1 1 -1 4 -1 3 -1 2 -1 -2
5 -1 7 -1 1 -1 4 -1 2 -1 3 -1 -2

The first line represents a sequence where the item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. The next lines follow the same format.

Performance

CPT+ is a sequence prediction models that often provides the best accuracy according to a performance comparison (see Gueniche et al., 2015). Training the model is also very fast. However, performing a prediction may be slower than with some other models.

Where can I get more information about CPT+?

The CPT+ (Compact Prediction Tree+) model is described in this article:

Gueniche, T., Fournier-Viger, P., Raman, R., Tseng, V. S. (2015). CPT+: Decreasing the time/space complexity of the Compact Prediction Tree. Proc. 19th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD 2015), Springer, LNAI9078, pp. 625-636.

The original CPT algorithm was described in this paper:

Gueniche, T., Fournier-Viger, P., Tseng, V. S. (2013). Compact Prediction Tree: A Lossless Model for Accurate Sequence Prediction. Proc. 9th Intern. Conference on Advanced Data Mining and Applications (ADMA 2013) Part II, Springer LNAI 8347, pp. 177-188.

Also, you can watch a video presentation of the CPT and CPT+ models (47 minutes)