Transform a CSV dataset into a Transaction Database with Utility Information for Classification using the UBTGen Algorithm (SPMF documentation)

This example explains how to run the UBTGen algorithm..

How to run this example?

What is UBTGen?

Utility-Based Transaction Generator (UBTGen) is an algorithm proposed by S. Krishnamoorthy, intended to applied for supervised learning from tabular datasets.

UBTGen transforms a tabular CSV dataset into a transactional representation by encoding attribute–value information as item identifiers. Then, pattern mining algorithms can be applied on this transformed dataset.

The algorithm first parses metadata (e.g., target column, skipped columns, numeric and categorical indices). Numeric attributes are discretized into B bins using either equal-width partitioning or quantile-based discretization, while categorical attributes are directly mapped to unique item IDs. Each record in the dataset is then converted into a transaction consisting of the corresponding item IDs, augmented with normalized numeric feature values and a computed transaction weight. The process also generates a mapping file to maintain traceability between item IDs and original attribute–value pairs, ensuring interpretability and reproducibility of the transformed dataset.

The UBTGen algorithm is was proposed as part of the classifier modeling method HUG-IML (High Utility Gain-Interpretable Machine Learning) for transaction generation. This method is an intrinsic classifier model that extracts a class of higher order patterns and embeds them into an interpretable learning model such as logistic regression. The model supports both binary and multi-class classification problems. The specific details of the HUG-IML models are provided in the IEEE Access paper titled: Interpretable classifier models for decision support using high utility gain patterns, IEEE Access 2024, DOI: https://doi.org/10.1109/ACCESS.2024.3455563.

What is the input?

To execute the algorithm, you need a CSV dataset and a configuration file. In this example, we will use the dataset contextUBTGenTitanicSample.csv and the configuration file configTitanic.txt, provided in the ca.pfv.SPMF.tests package.

Dataset file

As example, the content of the dataset contextUBTGenTitanicSample.csv is provided here:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,0,3,Thayer Mr. John Borland,male,24,0,0,A/5 21171,8.05,,S
17,0,2,Sloper Mr. William Thompson,male,21,0,0,SC/Paris 2123,13.0,,S
32,0,1,Harder Mrs. George W (Elizabeth Lydia Waldheim),female,39,1,0,PC 17585,89.1042,C123,C
47,0,3,Beesley Mr. Lawrence,male,30,0,0,SOTON/OQ 392076,10.5,,S
59,0,2,Anderson Mr. Harry,male,48,0,0,248744,13.0,,S
6,1,3,Allen Mr. William Henry,male,35,0,0,373450,8.05,,S
14,1,3,Moran Mr. James,male,27,0,0,330877,7.25,,Q
22,1,1,Futrelle Mrs. Jacques Heath (Lily May Peel),female,35,1,0,113803,53.1,C123,S
39,1,2,Saunders Miss. Eliza Mary,female,19,0,0,237736,26.0,,S
51,1,1,Brown Mrs. James Joseph (Margaret Tobin),female,49,1,0,113784,76.7292,C105,C

The first line of a dataset in this format indicates a list of attribute names separated by commas.
The following lines are records. On each line the attribute values are separated by commas. In this example, the dataset contains the following attributes:

Configuration file

The configuration file configTitanic.txt used in this example has the following content:

inputFile=contextUBTGenTitanicSample.csv
outputFile=titanic_processed.csv
testFile=contextUBTGenTitanicSampleTest.csv
prefix=titanic
header=true
delimiter=,
targetColIndex=1
skipColsIndices=0,3,8,10
numericIntColsIndices=6,7
numericFloatColsIndices=5,9
catColsIndices=2,4,11
B=5
writeTransformParameters=false
missingValueImputation=false
The description of each line is as follows:

Parameter Description
inputFile Input training dataset (CSV)
outputFile Transactional output file
testFile Test dataset (optional)
prefix Prefix for generated filenames
header CSV has header row (true/false)
delimiter CSV delimiter
targetColIndex Target column index
skipColsIndices Columns to ignore
numericIntColsIndices Integer numeric columns
numericFloatColsIndices Floating numeric columns
catColsIndices Categorical columns
B Bin count for numeric discretization
writeTransformParameters Write transformation parameters to file
missingValueImputation Enable missing value handling

Special case: B = -1 uses quantile-based discretization automatically.

What is the output?

The output consists of two files:

Transaction output file

It is a text file. Each lines represents a transaction. Each line is composed of three sections, as follows.

For example, after running the algorith for this example, the generated transaction file titanic_processed.csv produced as output is:

2 7 12 15 17:1.697197:0.156452 0.009773 0.018053 0.932594 0.580325
5 11 13 16 18:3.15678:0.625808 1.0 0.018053 0.932594 0.580325
1 3 8 12 15 17:2.74273:0.797316 0.344195 0.070247 0.018053 0.932594 0.580325
1 2 9 14 15 17:2.713806:0.797316 0.156452 0.229066 0.018053 0.932594 0.580325
6 13 16 18:2.469684:0.938712 0.018053 0.932594 0.580325
1 4 8 12 15 17:2.899182:0.797316 0.500647 0.070247 0.018053 0.932594 0.580325
1 6 9 14 15 17:3.496066:0.797316 0.938712 0.229066 0.018053 0.932594 0.580325
1 5 8 12 15 17:3.024343:0.797316 0.625808 0.070247 0.018053 0.932594 0.580325
1 3 7 12 15 19:2.682256:0.797316 0.344195 0.009773 0.018053 0.932594 0.580325
5 10 13 16 17:2.939065:0.625808 0.782285 0.018053 0.932594 0.580325
1 2 10 14 16 17:3.267025:0.797316 0.156452 0.782285 0.018053 0.932594 0.580325
11 13 16 18:2.530972:1.0 0.018053 0.932594 0.580325

Column mapping file

The second file is the column mapping file. This file maps each numeric item ID in the transaction file back to the original dataset values. The format is as follows:

The second file produced in this example is named titanic_processed_colNameNew.txt and contains:

10000 1 SibSp=[0.0-1.0]
10002 2 Age=[19.0-24.0]
20002 3 Age=[24.0-30.0]
30002 4 Age=[30.0-35.0]
40002 5 Age=[35.0-39.0]
50002 6 Age=[39.0-49.0]
10003 7 Fare=[7.25-8.05]
20003 8 Fare=[8.05-13.0]
30003 9 Fare=[13.0-26.0]
40003 10 Fare=[26.0-71.2833]
50003 11 Fare=[71.2833-89.1042]
10004 12 Pclass=3
20004 13 Pclass=1
30004 14 Pclass=2
10005 15 Sex=male
20005 16 Sex=female
10006 17 Embarked=S
20006 18 Embarked=C
30006 19 Embarked=Q

This mapping ensures that every item ID in the transaction file can be interpreted and traced back to its original attribute-value in the dataset.

Where can I get more information about the UBTGen algorithm?

More information about UBTGen can be found in this paper:

The following GitHub repository also offers this algorithm and other components of the HUG-IML method: github.com/srikumar2050/ubtgen

The code of UBTGen is included from this repository in SPMF under the GPL license.