Transform a CSV dataset into a Transaction Database with Utility Information for Classification using the UBTGen Algorithm (SPMF documentation)

This example explains how to run the UBTGen algorithm..

How to run this example?

This algorithm is not offered in the release version of SPMF.
To run this example with the source code version of SPMF, launch the file "MainTestUBTGen.java" file in the package ca.pfv.SPMF.tests.

What is UBTGen?

Utility-Based Transaction Generator (UBTGen) is an algorithm proposed by S. Krishnamoorthy, intended to applied for supervised learning from tabular datasets.

UBTGen transforms a tabular CSV dataset into a transactional representation by encoding attribute–value information as item identifiers. Then, pattern mining algorithms can be applied on this transformed dataset.

The algorithm first parses metadata (e.g., target column, skipped columns, numeric and categorical indices). Numeric attributes are discretized into B bins using either equal-width partitioning or quantile-based discretization, while categorical attributes are directly mapped to unique item IDs. Each record in the dataset is then converted into a transaction consisting of the corresponding item IDs, augmented with normalized numeric feature values and a computed transaction weight. The process also generates a mapping file to maintain traceability between item IDs and original attribute–value pairs, ensuring interpretability and reproducibility of the transformed dataset.

The UBTGen algorithm is was proposed as part of the classifier modeling method HUG-IML (High Utility Gain-Interpretable Machine Learning) for transaction generation. This method is an intrinsic classifier model that extracts a class of higher order patterns and embeds them into an interpretable learning model such as logistic regression. The model supports both binary and multi-class classification problems. The specific details of the HUG-IML models are provided in the IEEE Access paper titled: Interpretable classifier models for decision support using high utility gain patterns, IEEE Access 2024, DOI: https://doi.org/10.1109/ACCESS.2024.3455563.

What is the input?

To execute the algorithm, you need a CSV dataset and a configuration file. In this example, we will use the dataset contextUBTGenTitanicSample.csv and the configuration file configTitanic.txt, provided in the ca.pfv.SPMF.tests package.

Dataset file

As example, the content of the dataset contextUBTGenTitanicSample.csv is provided here:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,0,3,Thayer Mr. John Borland,male,24,0,0,A/5 21171,8.05,,S
17,0,2,Sloper Mr. William Thompson,male,21,0,0,SC/Paris 2123,13.0,,S
32,0,1,Harder Mrs. George W (Elizabeth Lydia Waldheim),female,39,1,0,PC 17585,89.1042,C123,C
47,0,3,Beesley Mr. Lawrence,male,30,0,0,SOTON/OQ 392076,10.5,,S
59,0,2,Anderson Mr. Harry,male,48,0,0,248744,13.0,,S
6,1,3,Allen Mr. William Henry,male,35,0,0,373450,8.05,,S
14,1,3,Moran Mr. James,male,27,0,0,330877,7.25,,Q
22,1,1,Futrelle Mrs. Jacques Heath (Lily May Peel),female,35,1,0,113803,53.1,C123,S
39,1,2,Saunders Miss. Eliza Mary,female,19,0,0,237736,26.0,,S
51,1,1,Brown Mrs. James Joseph (Margaret Tobin),female,49,1,0,113784,76.7292,C105,C

The first line of a dataset in this format indicates a list of attribute names separated by commas.
The following lines are records. On each line the attribute values are separated by commas. In this example, the dataset contains the following attributes:

PassengerId — unique identifier for each passenger
Survived — survival label (0 = did not survive, 1 = survived)
Pclass — passenger ticket class
Name — full name of the passenger
Sex — gender of the passenger
Age — age in years
SibSp — number of siblings/spouses aboard
Parch — number of parents/children aboard
Ticket — ticket number
Fare — ticket fare
Cabin — cabin number
Embarked — port of embarkation

Configuration file

The configuration file configTitanic.txt used in this example has the following content:

inputFile=contextUBTGenTitanicSample.csv
outputFile=titanic_processed.csv
testFile=contextUBTGenTitanicSampleTest.csv
prefix=titanic
header=true
delimiter=,
targetColIndex=1
skipColsIndices=0,3,8,10
numericIntColsIndices=6,7
numericFloatColsIndices=5,9
catColsIndices=2,4,11
B=5
writeTransformParameters=false
missingValueImputation=false

The description of each line is as follows:

Parameter	Description
inputFile	Input training dataset (CSV)
outputFile	Transactional output file
testFile	Test dataset (optional)
prefix	Prefix for generated filenames
header	CSV has header row (true/false)
delimiter	CSV delimiter
targetColIndex	Target column index
skipColsIndices	Columns to ignore
numericIntColsIndices	Integer numeric columns
numericFloatColsIndices	Floating numeric columns
catColsIndices	Categorical columns
B	Bin count for numeric discretization
writeTransformParameters	Write transformation parameters to file
missingValueImputation	Enable missing value handling

Special case: B = -1 uses quantile-based discretization automatically.

What is the output?

The output consists of two files:

A transaction file where each dataset record is transformed into item IDs, a transaction weight, and optional normalized numeric features. This follows the standard format used for transaction database with utility values in SPMF.
A column mapping file linking each numeric item ID to its original attribute/value, ensuring interpretability.

Transaction output file

It is a text file. Each lines represents a transaction. Each line is composed of three sections, as follows.

First, the items contained in the transaction are listed. An item is represented by a positive integer. Each item is separated from the next item by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same transaction.
Second, the symbol ":" appears and is followed by the transaction utility (a double value).
Third, the symbol ":" appears and is followed by the utility of each item in this transaction (a double value), separated by single spaces.

For example, after running the algorith for this example, the generated transaction file titanic_processed.csv produced as output is:

2 7 12 15 17:1.697197:0.156452 0.009773 0.018053 0.932594 0.580325
5 11 13 16 18:3.15678:0.625808 1.0 0.018053 0.932594 0.580325
1 3 8 12 15 17:2.74273:0.797316 0.344195 0.070247 0.018053 0.932594 0.580325
1 2 9 14 15 17:2.713806:0.797316 0.156452 0.229066 0.018053 0.932594 0.580325
6 13 16 18:2.469684:0.938712 0.018053 0.932594 0.580325
1 4 8 12 15 17:2.899182:0.797316 0.500647 0.070247 0.018053 0.932594 0.580325
1 6 9 14 15 17:3.496066:0.797316 0.938712 0.229066 0.018053 0.932594 0.580325
1 5 8 12 15 17:3.024343:0.797316 0.625808 0.070247 0.018053 0.932594 0.580325
1 3 7 12 15 19:2.682256:0.797316 0.344195 0.009773 0.018053 0.932594 0.580325
5 10 13 16 17:2.939065:0.625808 0.782285 0.018053 0.932594 0.580325
1 2 10 14 16 17:3.267025:0.797316 0.156452 0.782285 0.018053 0.932594 0.580325
11 13 16 18:2.530972:1.0 0.018053 0.932594 0.580325

Column mapping file

The second file is the column mapping file. This file maps each numeric item ID in the transaction file back to the original dataset values. The format is as follows:

Column 1 — Bin name: Encodes the bin number in the first part and the column (attribute) index in the last part. For example, the last digit indicates the column index (0–6), while the preceding digits indicate the bin.
Column 2 — Unique item identifier: A numeric ID representing this attribute-value pair in the transaction file.
Column 3 — Attribute mapping: Shows the actual numeric range or categorical value represented by this item ID (e.g., Age=[19.0-24.0] or Sex=male).

The second file produced in this example is named titanic_processed_colNameNew.txt and contains:

10000 1 SibSp=[0.0-1.0]
10002 2 Age=[19.0-24.0]
20002 3 Age=[24.0-30.0]
30002 4 Age=[30.0-35.0]
40002 5 Age=[35.0-39.0]
50002 6 Age=[39.0-49.0]
10003 7 Fare=[7.25-8.05]
20003 8 Fare=[8.05-13.0]
30003 9 Fare=[13.0-26.0]
40003 10 Fare=[26.0-71.2833]
50003 11 Fare=[71.2833-89.1042]
10004 12 Pclass=3
20004 13 Pclass=1
30004 14 Pclass=2
10005 15 Sex=male
20005 16 Sex=female
10006 17 Embarked=S
20006 18 Embarked=C
30006 19 Embarked=Q

This mapping ensures that every item ID in the transaction file can be interpreted and traced back to its original attribute-value in the dataset.

Where can I get more information about the UBTGen algorithm?

More information about UBTGen can be found in this paper:

S. Krishnamoorthy, "Interpretable classifier models for decision support using high utility gain patterns," IEEE Access, vol. 12, pp. 126088–126107, 2024. https://doi.org/10.1109/ACCESS.2024.3455563

The following GitHub repository also offers this algorithm and other components of the HUG-IML method: github.com/srikumar2050/ubtgen

The code of UBTGen is included from this repository in SPMF under the GPL license.