Perform Sequence Prediction using the CPT+ Sequence Prediction Model (SPMF documentation)

This example explains how to run the CPT+ algorithm using the SPMF open-source data mining library.

How to run this example?

To run the implementation of CPT+

This Example is not available in the graphical user interface of SPMF.
If you are using the source code version of SPMF, launch the file "MainTestCPTPlus.java" in the package ca.pfv.SPMF.tests

What is CPT+ (Compact Prediction Tree+)?

CPT+ (Compact Prediction Tree+) is a sequence prediction model. It is used for performing sequence predictions. A sequence prediction consists of predicting the next symbol of a sequence based on a set of training sequences. The task of sequence prediction has numerous applications in various domains. For example, it can be used to predict the next webpage that a user will visit based on previously visited webpages by the user and other users.

The CPT+ prediction model (2015) is an improved version of the CPT model (Gueniche et al., 2013). CPT+ was shown to provide better accuracy than several state-of-the-art prediction models such as DG, AKOM, TDAG, PPM and CPT on various datasets (see Gueniche et al., 2015 for details).

The implementation of CPT+ in SPMF is the original implementation (obtained from the ipredict project).

What is the input of CPT+?

The input of CPT+ is a sequence database containing training sequences. These sequences are used to train the prediction model.

In the context of CPT+, a sequence database is a set of sequences where each sequence is a list of items (symbols). For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 items. This sequence means that item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. This database is provided in the file "contextCPT.txt" of the SPMF distribution.

ID	Sequences
S1	(1), (2), (3), (4), (6)
S2	(4), (3), (2), (5)
S3	(5), (1), (4), (3), (2)
S4	(5), (7), (1), (4), (2), (3)

CPT+ also takes as set of parameters as input.

What is the output of CPT+?

CPT+ performs sequence prediction. After CPT+ has been trained with the input sequence database, it can predict the next symbol of a new sequence.

For example, if CPT+ is trained with the previous sequence database and parameters, it will predict that the next symbol following the sequence (1),(2) is the symbol (3).

Parameter(s)

There are several parameters that can be set for CPT+. In the source code, these parameters are passed as a string to the class CPTPlusPredictor. The main parameters are:

CCF:true : if true, the Frequent Subsequence Compression (FSC) strategy will be activated. Otherwise, if false, it is deactivated. This strategy can slows down the processing time but it can be used to reduce the memory usage of CPT+
CBS:true : if true, the Simple Branches Compression (SBC) strategy will be activated. Otherwise, if false, it is deactivated. This strategy can slows down the processing time but it can be used to reduce the memory usage of CPT+
CCFmin:1 : this parameter allows to specify the minimum subsequence length for the FSC strategy. It takes a positive integer as value such as 1 in this example. This parameter is only needed when the FSC strategy is activated
CCFmax:6 : this parameter allows to specify the maximum subsequence length for the FSC strategy. It takes a positive integer as value such as 6 in this example. This parameter is only needed when the FSC strategy is activated.
CCFsup:2 : this parameter allows to specify the minimum support (minsup) for the FSC strategy. It takes a positive integer as value such as 2 in this example. This parameter is only needed when the FSC strategy is activated.
splitMethod:0: This parameter indicates whether the training sequences should be kept entirely or if they should be cut to reduce memory usage when training the model. If this parameter is set to 0, the training sequences are kept entirely. If the parameter is set to a value k > 0, then only the last k items of each sequences will be used for training. This can reduce memory usage by reducing the size of the prediction model.
minPredictionRatio:1.0: This parameter takes a value in the [0,1] interval that affects the algorithms's accuracy. A high value may offer more accurate predictions but it may reduces the coverage while a lower value typically boost the coverage and reduce the overall accuracy.
noiseRatio:1.0: This is a parameter of the Prediction with improved Noise Reduction (PNR) strategy. It takes a value in the [0,1] interval that represents the ratio of items in each sequence that should be considered as noise.

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is separated by single space and a -1. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextCPT.txt" contains the following four lines (four sequences).

1 -1 2 -1 3 -1 4 -1 6 -1 -2
4 -1 3 -1 2 -1 5 -1 -2
5 -1 1 -1 4 -1 3 -1 2 -1 -2
5 -1 7 -1 1 -1 4 -1 2 -1 3 -1 -2

The first line represents a sequence where the item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. The next lines follow the same format.

Performance

CPT+ is a sequence prediction models that often provides the best accuracy according to a performance comparison (see Gueniche et al., 2015). Training the model is also very fast. However, performing a prediction may be slower than with some other models.

Where can I get more information about CPT+?

The CPT+ (Compact Prediction Tree+) model is described in this article:

Gueniche, T., Fournier-Viger, P., Raman, R., Tseng, V. S. (2015). CPT+: Decreasing the time/space complexity of the Compact Prediction Tree. Proc. 19th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD 2015), Springer, LNAI9078, pp. 625-636.

The original CPT algorithm was described in this paper:

Gueniche, T., Fournier-Viger, P., Tseng, V. S. (2013). Compact Prediction Tree: A Lossless Model for Accurate Sequence Prediction. Proc. 9th Intern. Conference on Advanced Data Mining and Applications (ADMA 2013) Part II, Springer LNAI 8347, pp. 177-188.

Also, you can watch a video presentation of the CPT and CPT+ models (47 minutes)