Mining Top-k High Utility Itemsets based on the Information Gain for Supervised Learning using the THUIsl Algorithm (SPMF documentation)

This example explains how to run the THUIsl algorithm.

How to run this example?

This algorithm is not offered in the release version of SPMF.
To run this example with the source code version of SPMF, launch the file "MaintTestTHUIsl.java" file in the package ca.pfv.SPMF.tests.

What is the THUIsl algorithm?

Top-k High Utility Itemsets - Supervised learning (THUIsl) takes as input a tabular dataset in CSV format that is used for supervised learning.

Then, the algorithm transforms the dataset into a transaction database format and extract the top-k high utility itemsets based on the information gain for the purpose of supervised learning.

The THUIsl algorithm was proposed as part of the HUG-IML framework (High Utility Gain - Interpretable Machine Learning), which embeds extracted high-utility patterns into interpretable classifier models, such as logistic regression. The framework supports both binary and multi-class classification problems. Full details on the HUG-IML methodology, benchmark results, and application scenarios can be found in the IEEE Access paper: Interpretable classifier models for decision support using high utility gain patterns, IEEE Access 2024, DOI: https://doi.org/10.1109/ACCESS.2024.3455563.

THUIsl requires to provide a configuration file and a dataset. The algorithm first parses metadata such as target column, skipped columns, numeric and categorical indices. Numeric attributes are discretized into B bins using equal-width or quantile-based discretization, while categorical attributes are directly mapped to unique item IDs. Each dataset record is then converted into a set of attribute-value items, which are aggregated into transactions used for high utility itemset mining. The process generates a mapping file linking item IDs to original attributes and values, ensuring full interpretability and reproducibility of the derived patterns.

What is the input?

To execute the algorithm, you need a CSV dataset and a configuration file. In this example, we will use the dataset contextUBTGenTitanicSample.csv and the configuration file configTitanic.txt, provided in the ca.pfv.SPMF.tests package.

The key parameters of the algorithms are:

B – Number of bins for numeric attributes; -1 = auto-determined.
L – Maximum pattern length; can be an integer or all for unlimited length.
G – Minimum utility gain threshold; default = 1e-4.
topK – Number of top high-utility patterns to mine.
fsK – Feature selection size; defaults to topK.
modelTest – Boolean; true to transform test dataset using patterns from training.

Dataset file

The dataset must be in CSV (Comma-Separated Values) format with a header row. Each row represents one data instance (e.g., a passenger), and each column represents an attribute (e.g., age, fare, class). As example, the content of the dataset contextUBTGenTitanicSample.csv is provided here:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,0,3,Thayer Mr. John Borland,male,24,0,0,A/5 21171,8.05,,S
17,0,2,Sloper Mr. William Thompson,male,21,0,0,SC/Paris 2123,13.0,,S
32,0,1,Harder Mrs. George W (Elizabeth Lydia Waldheim),female,39,1,0,PC 17585,89.1042,C123,C
47,0,3,Beesley Mr. Lawrence,male,30,0,0,SOTON/OQ 392076,10.5,,S
59,0,2,Anderson Mr. Harry,male,48,0,0,248744,13.0,,S
6,1,3,Allen Mr. William Henry,male,35,0,0,373450,8.05,,S
14,1,3,Moran Mr. James,male,27,0,0,330877,7.25,,Q
22,1,1,Futrelle Mrs. Jacques Heath (Lily May Peel),female,35,1,0,113803,53.1,C123,S
39,1,2,Saunders Miss. Eliza Mary,female,19,0,0,237736,26.0,,S
51,1,1,Brown Mrs. James Joseph (Margaret Tobin),female,49,1,0,113784,76.7292,C105,C

The format of this CSV file is explained as follows. The first line is the header and contains attribute names separated by commas. The subsequent lines are data records where each line represents one instance with attribute values separated by commas. Missing values are represented by empty fields, which appear as consecutive commas.

In this example, the dataset contains the following attributes:

PassengerId — unique identifier for each passenger
Survived — survival label (0 = did not survive, 1 = survived)
Pclass — passenger ticket class
Name — full name of the passenger
Sex — gender of the passenger
Age — age in years
SibSp — number of siblings/spouses aboard
Parch — number of parents/children aboard
Ticket — ticket number
Fare — ticket fare
Cabin — cabin number
Embarked — port of embarkation

Configuration file

The configuration file is a text file with key-value pairs (one per line) that specify how to process the CSV dataset. Each line specifies one configuration parameter. Each line follows the format parameter=value.

The configuration file configTitanic.txt used in this example has the following content:

inputFile=contextUBTGenTitanicSample.csv
outputFile=titanic_processed.csv
testFile=contextUBTGenTitanicSampleTest.csv
prefix=titanic
header=true
delimiter=,
targetColIndex=1
skipColsIndices=0,3,8,10
numericIntColsIndices=6,7
numericFloatColsIndices=5,9
catColsIndices=2,4,11
B=5
writeTransformParameters=false
missingValueImputation=false

The format of this configuration file is explained as follows. Each line specifies one configuration parameter. The format is parameterName=value with no spaces around the equals sign. Multiple values are separated by commas (e.g., skipColsIndices=0,3,8,10). Column indices are zero-based, meaning the first column has index 0.

Parameter	Description
inputFile	Input training dataset (CSV)
outputFile	Transactional output file
testFile	Test dataset (optional)
prefix	Prefix for generated filenames
header	CSV has header row (true/false)
delimiter	CSV delimiter
targetColIndex	Target column index
skipColsIndices	Columns to ignore
numericIntColsIndices	Integer numeric columns
numericFloatColsIndices	Floating numeric columns
catColsIndices	Categorical columns
B	Bin count for numeric discretization
writeTransformParameters	Write transformation parameters to file
missingValueImputation	Enable missing value handling

Special case: When B = -1, the algorithm automatically determines the number of bins using quantile-based discretization.y.

What is the output?

The output of THUIsl consists of the following files:

Top-k high utility item (HUI) patterns: Lists the most significant patterns mined from the dataset along with their utility values.
Transformed transaction file: Each dataset record is transformed into item IDs corresponding to attribute-value pairs, reflecting the utility mining format used in SPMF.
Column mapping file: Links each numeric item ID in the transaction file back to its original attribute/value in the dataset, ensuring interpretability.
Intermediate file for utility mining: Encodes records with item IDs, transaction weights, and normalized numeric features used to compute HUI patterns.

Top-k HUI patterns

This file contains the mined high-utility patterns ranked by information gain. Each line represents one pattern with its utility score and information gain value.

The format consists of three columns: utility, pattern, and ig. The utility column shows the utility value of the pattern as a floating-point number. The pattern column displays the itemset pattern in human-readable format showing attribute equals value. The ig column shows the information gain score as a floating-point number.

An example of this file format is shown below:

utility pattern ig
.3500 'Sex=male' .13565559
5.0000 'Sex=female' .13565557
1.5879 'Fare=[53.1-76.7292]' .13230415
1.1005 'Age=[35.0-38.0]' .13230415
.4285 'Fare=[13.0-26.0]' .13230415
1.9695 'Pclass=1' .06465999
.1969 'Embarked=S' .06465997
.9354 'Fare=[76.7292-89.1042]' .06155537
.8688 'Age=[48.0-49.0]' .06155537
.8399 'Age=[39.0-48.0]' .06155537
.5792 'Age=[38.0-39.0]' .06155537
.5239 'Fare=[26.0-53.1]' .06155537
.3186 'Age=[27.0-30.0]' .06155537
.2317 'Age=[24.0-27.0]' .06155537
.1448 'Age=[22.0-24.0]' .06155537
.0657 'Fare=[10.5-13.0]' .06155537
.4569 'Embarked=C' .01879747
.0228 'Pclass=2' .01879747
5.9846 'SibSp=[0.0-1.0]' .01436260
.0265 'Pclass=3' .01436258

The interpretation of this file is as follows. Patterns are ranked by information gain with the most predictive patterns appearing first. For categorical attributes, the pattern shows the exact value such as Sex=male. For numeric attributes, the pattern shows the discretized bin range such as Age=[35.0-38.0]. Higher information gain values indicate stronger association with the target variable.

Transformed transaction file (filtered based on mined Top-K patterns)

This file contains the human-readable transactional representation of each data record, showing only the attribute-value pairs that appear in the mined top-k patterns. The format of this file is that each line represents one transaction or data record as a comma-separated list of attribute-value pairs. An example of this file format is shown below:

Sex=male, Embarked=S, Age=[22.0-24.0], Pclass=3
Sex=female, Fare=[53.1-76.7292], Pclass=1, Age=[38.0-39.0], Embarked=C
Sex=male, Embarked=S, Age=[24.0-27.0], SibSp=[0.0-1.0], Pclass=3
Sex=male, Fare=[13.0-26.0], Embarked=S, Pclass=2, SibSp=[0.0-1.0]
Sex=female, Pclass=1, Age=[39.0-48.0], Embarked=C
Sex=male, Embarked=S, Fare=[10.5-13.0], SibSp=[0.0-1.0], Pclass=3
Sex=male, Fare=[13.0-26.0], Embarked=S, Age=[48.0-49.0], Pclass=2, SibSp=[0.0-1.0]
Sex=male, Age=[35.0-38.0], Embarked=S, SibSp=[0.0-1.0], Pclass=3
Sex=male, Age=[27.0-30.0], SibSp=[0.0-1.0], Pclass=3
Sex=female, Fare=[53.1-76.7292], Age=[35.0-38.0], Pclass=1, Embarked=S
Sex=female, Embarked=S, Fare=[26.0-53.1], Pclass=2, SibSp=[0.0-1.0]
Sex=female, Pclass=1, Fare=[76.7292-89.1042], Embarked=C

The interpretation of this file is as follows. Each line corresponds to one row from the original CSV dataset. The file only includes features that appear in the top-k patterns, filtering out irrelevant attributes. Numeric values are shown as bin ranges such as Age=[22.0-24.0]. This format can be used directly for interpretable machine learning models.

Column mapping file

This file provides the mapping between numeric item IDs (used internally) and human-readable attribute-value pairs.

The format consists of three columns separated by spaces. The first column is the bin identifier, which is an encoded bin name following the format <bin_number><column_index>. The second column is the item ID, which is a unique numeric identifier used in internal representations. The third column is the attribute mapping, which shows the human-readable attribute equals value or attribute equals range.

An example of this file format is shown below:

10000 1 SibSp=[0.0-1.0]
10002 2 Age=[19.0-24.0]
20002 3 Age=[24.0-30.0]
30002 4 Age=[30.0-35.0]
40002 5 Age=[35.0-39.0]
50002 6 Age=[39.0-49.0]
10003 7 Fare=[7.25-8.05]
20003 8 Fare=[8.05-13.0]
30003 9 Fare=[13.0-26.0]
40003 10 Fare=[26.0-71.2833]
50003 11 Fare=[71.2833-89.1042]
10004 12 Pclass=3
20004 13 Pclass=1
30004 14 Pclass=2
10005 15 Sex=male
20005 16 Sex=female
10006 17 Embarked=S
20006 18 Embarked=C
30006 19 Embarked=Q

The first column (bin identifier) follows a specific format. The format is <bin_number><column_index> where the first digit or digits encode which bin (1, 2, 3, etc.) and the last digit represents the feature column index. For example, 10002 means bin 1 of column or feature 2, which corresponds to Age.

The interpretation of this file is as follows. This file ensures full traceability from numeric IDs to original data values. It is essential for interpreting patterns in their original context. The file enables transformation of test data using the same encoding scheme that was applied to training data.

Intermediate file in utility mining format for Top-K HUI mining

This file represents the internal transactional format used during high-utility itemset mining. It is a text file using the standard SPMF format for transaction databases with utility values.

Each lines represents a transaction. Each line is composed of three sections, as follows.

First, the items contained in the transaction are listed. An item is represented by a positive integer. Each item is separated from the next item by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same transaction.
Second, the symbol ":" appears and is followed by the transaction utility (a double value).
Third, the symbol ":" appears and is followed by the utility of each item in this transaction (a double value), separated by single spaces.

2 7 12 15 17:1.697197:0.156452 0.009773 0.018053 0.932594 0.580325
5 11 13 16 18:3.15678:0.625808 1.0 0.018053 0.932594 0.580325
1 3 8 12 15 17:2.74273:0.797316 0.344195 0.070247 0.018053 0.932594 0.580325
1 2 9 14 15 17:2.713806:0.797316 0.156452 0.229066 0.018053 0.932594 0.580325
6 13 16 18:2.469684:0.938712 0.018053 0.932594 0.580325
1 4 8 12 15 17:2.899182:0.797316 0.500647 0.070247 0.018053 0.932594 0.580325
1 6 9 14 15 17:3.496066:0.797316 0.938712 0.229066 0.018053 0.932594 0.580325
1 5 8 12 15 17:3.024343:0.797316 0.625808 0.070247 0.018053 0.932594 0.580325
1 3 7 12 15 19:2.682256:0.797316 0.344195 0.009773 0.018053 0.932594 0.580325
5 10 13 16 17:2.939065:0.625808 0.782285 0.018053 0.932594 0.580325
1 2 10 14 16 17:3.267025:0.797316 0.156452 0.782285 0.018053 0.932594 0.580325
11 13 16 18:2.530972:1.0 0.018053 0.932594 0.580325

Where to get more information?

More information about this algorithm can be found in this paper:

S. Krishnamoorthy, "Interpretable classifier models for decision support using high utility gain patterns," IEEE Access, vol. 12, pp. 126088–126107, 2024. https://doi.org/10.1109/ACCESS.2024.3455563

The following GitHub repository also offers this algorithm and other components of the HUG-IML method: github.com/srikumar2050/ubtgen

The code of THUIsl is included from this repository in SPMF under the GPL license.