Perform Sequence Prediction using the DG Sequence Prediction Model (SPMF documentation)

This example explains how to run the DG algorithm using the SPMF open-source data mining library.

How to run this example?

To run the implementation of DG (Dependency Graph)

What is DG (Dependency Graph)?

DG (Dependency Graph) is a sequence prediction model proposed by Padmanabhan & Mogul (1996). It is used for performing sequence predictions. A sequence prediction consists of predicting the next symbol of a sequence based on a set of training sequences. The task of sequence prediction has numerous applications in various domains. For example, it can be used to predict the next webpage that a user will visit based on previously visited webpages by the user and other users.

The DG prediction model is quite simple. This is one reason why it is still popular. But can it be outperformed by newer models such as CPT+ in terms of prediction accuracy

This implementation has been obtained from the ipredict project.

What is the input of DG?

The input of DG is a sequence database containing training sequences. These sequences are used to train the prediction model.

In the context of DG, a sequence database is a set of sequences where each sequence is a list of items (symbols). For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 items. This sequence means that item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. This database is provided in the file "contextCPT.txt" of the SPMF distribution.

ID Sequences
S1 (1), (2), (3), (4), (6)
S2 (4), (3), (2), (5)
S3 (5), (1), (4), (3), (2)
S4 (5), (7), (1), (4), (2), (3)

Parameter(s)

The DG algorithm takes the "lookahead window" as parameter. In the source code, the look-ahead window value is passed as a string to the class DGPredictor. For example, the string "lookahead:2" means to set the look-ahead windows to 2. This means that DG will assume that a symbol only depends on the two previous symbols for performing a prediction.

What is the output of DG?

DG performs sequence prediction. After DG has been trained with the input sequence database, it can predict the next symbol of a new sequence.

For example, if DG is trained with the previous sequence database and parameters, it will predict that the next symbol following the sequence (1),(4) is the symbol (3).

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is separated by single space and a -1. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextCPT.txt" contains the following four lines (four sequences).

1 -1 2 -1 3 -1 4 -1 6 -1 -2
4 -1 3 -1 2 -1 5 -1 -2
5 -1 1 -1 4 -1 3 -1 2 -1 -2
5 -1 7 -1 1 -1 4 -1 2 -1 3 -1 -2

The first line represents a sequence where the item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. The next lines follow the same format.

Performance

DG is a simple sequence prediction model that is thus memory efficient. However, it can be outperformed in terms of prediction accuracy by newer models such as CPT+.

Where can I get more information about DG?

The DG sequence prediction model was proposed in this paper:

V. N. Padmanabhan, J. C. Mogul, "Using predictive prefetching to improve world wide web latency". ACM SIGCOMM Computer Communication Review, vol. 26, pp. 22-36, 1996.