Perform Sequence Prediction using the LZ78 Sequence Prediction Model (SPMF documentation)

This example explains how to run the LZ78 algorithm using the SPMF open-source data mining library.

How to run this example?

To run the implementation of LZ78

What is LZ78 ?

LZ78 is a sequence prediction model proposed by Ziiv & Lempel (1978), that is also a compression algorithm. This model is here used for performing sequence predictions. A sequence prediction consists of predicting the next symbol of a sequence based on a set of training sequences. The task of sequence prediction has numerous applications in various domains. For example, it can be used to predict the next webpage that a user will visit based on previously visited webpages by the user and other users.

The LZ78 prediction model is quite simple and can be outperformed by newer models such as CPT+ in terms of prediction accuracy.

This implementation has been obtained from the ipredict project.

What is the input of LZ78?

The input of LZ78 is a sequence database containing training sequences. These sequences are used to train the prediction model.

In the context of LZ78, a sequence database is a set of sequences where each sequence is a list of items (symbols). For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 items. This sequence means that item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. This database is provided in the file "contextCPT.txt" of the SPMF distribution.

ID Sequences
S1 (1), (2), (3), (4), (6)
S2 (4), (3), (2), (5)
S3 (5), (1), (4), (3), (2)
S4 (5), (7), (1), (4), (2), (3)

What is the output of LZ78?

LZ78 performs sequence prediction. After LZ78 has been trained with the input sequence database, it can predict the next symbol of a new sequence.

For example, if LZ78 is trained with the previous sequence database and parameters, it will predict that the next symbol following the sequence (1),(4) is the symbol (2).

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is separated by single space and a -1. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextCPT.txt" contains the following four lines (four sequences).

1 -1 2 -1 3 -1 4 -1 6 -1 -2
4 -1 3 -1 2 -1 5 -1 -2
5 -1 1 -1 4 -1 3 -1 2 -1 -2
5 -1 7 -1 1 -1 4 -1 2 -1 3 -1 -2

The first line represents a sequence where the item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. The next lines follow the same format.

Performance

LZ78 is a sequence prediction model that is quite simple and thus is generally memory efficient. But, it can often be outperformed in terms of prediction accuracy by newer models such as CPT+.

Where can I get more information about LZ78?

The LZ78 sequence prediction model was proposed in this paper:

Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. Information Theory, IEEE Transactions on 24(5), 530-536 (1978)