Mining Compressing Patterns in Genome Sequences using the HMG Algorithm (SPMF documentation)

This example explains how to run the HMG algorithm using the SPMF open-source data mining library.

How to run this example?

Two version of the HMG algorithm are offered in SPMF. The version called HMG-SA adopts a simulated annealing approach to find patterns, while the HMG-GA version utilizes a genetic algorithm.

To run the example for HMG-SA:

To run the example for HMG-GA:

What is HMG?

HMG is an algorithm for discovering compressing sequential patterns in genome data. It was proposed by Nawaz et al. (2025).

It is designed to find the patterns that best describes an input genome database (file).

A sequential pattern (also called k-mer) is a consecutive subsequence of nucleotides, and the patterns are called "compressing patterns " because they are intended to capture the most important characteristics of the database.

The user can choose the number of patterns to discover and this number will be returned by the algorithm..

The HMG algorithm uses a heuristic approach to search for the best compressing patterns and two methods are offered called HMG-GA (based on a genetic algorithm) and HMG-SA (based on simulated annealing).

For the HMG-GA method, the user has in addition the possibility of choosing the number of generations (a good default value is 10), a crossover variant (SINGLE OR CYCLE) and a mutation variant (SINGLE or SCRAMBLE) for generating patterns.

Besides the user can choose between two output formats: strings of characters such as ACGTA or the SPMF format.

What is the input of the HMG algorithm?

The input is a genomic sequence, represented as a contiguous string of nucleotide characters (A, C, T, G).

For example: ATAGCAGTAAAGTAACTACCAG, HMG processes this sequence to discover compressing k-mers (short nucleotide subsequences).

What is the output of the HMG algorithm?

HMG is an algorithm for iteratively discovering k-mers that provide the best compression of the genomic sequence. The core idea is to find a k-mer that appears frequently, replace all its occurrences with a unique code (e.g., a single symbol such as 'X'), and then repeat the process on the newly compressed sequence to find the next most compressing k-mer.
For example, if HMG is run on the sequence ATAGCAGTAAAGTAACTACCAG with the user setting the required number of patterns to 2, HMG can discover two patterns:

Pattern # Selected k-mer Support Updated genome sequeunce
1 TA 4 AXAGCAGXAAGXACXCCAG
2 CA 2 AXAGYGXAAGXAXCYG


The first line represent the pattern TA, and it appears 4 times in the input sequence.
The second line is the pattern CA, which occurs 2 times in the input sequence.

Because HMG is a randomized algorithm applying a heuristic search to find patterns, if it is run multiple times, the output can be different set of patterns, but they should all be good patterns.

Explanation of the Process:

While selecting the first pattern, HMG analyzes the original sequence and finds that the k-mer TA appears 4 times (at positions 2, 7, 13, and 16). It has the highest compression potential. All occurrences of TA are replaced with a unique code X. The sequence is now compressed to AXAGCAGXAAGXACXCCAG.
For the next pattern, HMG now analyzes the compressed sequence AXAGCAGXAAGXACXCCAG. In this new context, it finds that the k-mer CA appears twice. It is selected and its occurrences are replaced with a code Y, resulting in AXAGYGXAAGXAXCYG.

The final output would be the list of these discovered k-mers and their support (number of occurrences) within the sequence at the time they were found:

TA #SUP: 4
CA #SUP: 2

This output shows that TA was the most significant pattern in the original sequence, and CA was the most significant pattern remaining after the first compression step. Together, these k-mers efficiently capture the repetitive structures within the genome.

Parameters offered only for HMP-GA

If using the HMG-GA algorithm, two additional parameter(s) are

Input file format

The default input file format is a text file containing a sequence of characters. For instance, the file AeCa used in this example contains data like this:

GCCGCCCCCATGGTCCATACGGTGTGCGAATACGGCGTGGCCCTCCTTACCCCATCCAGGCCTCTCTACGCCCCACTTGTCTATAGTGCCTTTCACGACCCTGGCCACAAGGTCGATCGCGTACTCCCAGGAGACAGGCTCTAGCCTTCCTCCGAACCTTACGAGGGGCTTTGTCAGCCTCCTCTCGCCTGCTATGTTCCATGGGCTCCATATCCTCTCAGCCTGCGTTCCCCCCCTCACCGAATGGTTGCCCCAGTTTATGGGGCACTCGGGGCTGGGTATCTGTGCGACGTAAACCTCTCTCCAGCTTCCTGTGCCTCTACCCTTCTCCCAGGTCCTCCTTATCGTCTTTACAACCATAGCCTCGGAGATCCAGGGCTCTCCCTCCTCCGCGCTCAGACCCACGTGCTCCTTTGTAAAGTCCGGCTGCTTCACAAGATTCCTCCCATAAGCCTCGTCGACTATCTTATA

Here only a small portion of the file is displayed.

Besides, the algorithm also accepts file in the FASTA format (.fasta or .fa).

Output file format

The default output file format is a text file containing a pattern on each line.
For instance, in this example, the output file may be:

TA #SUP: 4
CA #SUP: 2

This indicate that the pattern TA has a support of 4 and the pattern CA has a support of 2.

Alternative output file format

There is an alternative output format that is offered, which is the SPMF format. It can be activated by setting the parameter SPMF output style to true.

@CONVERTED_FROM_TEXT
@ITEM=65=A
@ITEM=67=C
@ITEM=84=T
84 -1 65 -1 -2 #SUP: 4
67 -1 65 -1 -2 #SUP: 2

In this format, the file starts by @CONVERTED_FROM_TEXT followed by lines that give a numerical code to each nucleotice.

Such line starts by @ITEM= followed by a code, then '=', and then the nucleotide as a letter.
Then the following lines each describe a pattern.
A pattern is a list of codes representing nucleotides separated by -1 and ending by -2, and then followed by the keyword #SUP: and then the support of the pattern.

For example, the line 84 -1 65 -1 -2 #SUP: 4 means that the pattern TA has a support of 4.

Performance

This is the Java version of HMG made by the original authors of HMG.

Where can I get more information about the HMG algorithm?

This is the paper proposing the HMG algorithm:

Nawaz et al. (2025) Efficient genome sequence compression via the fusion of MDL-based heuristics. Information Fusion Volume 120, August 2025, 103083