This example explains how to run the UBTGen algorithm..
Utility-Based Transaction Generator (UBTGen) is an algorithm proposed by S. Krishnamoorthy, intended to applied for supervised learning from tabular datasets.
UBTGen transforms a tabular CSV dataset into a transactional representation by encoding attribute–value information as item identifiers. Then, pattern mining algorithms can be applied on this transformed dataset.
The algorithm first parses metadata (e.g., target column, skipped columns, numeric and categorical indices). Numeric attributes are discretized into B bins using either equal-width partitioning or quantile-based discretization, while categorical attributes are directly mapped to unique item IDs. Each record in the dataset is then converted into a transaction consisting of the corresponding item IDs, augmented with normalized numeric feature values and a computed transaction weight. The process also generates a mapping file to maintain traceability between item IDs and original attribute–value pairs, ensuring interpretability and reproducibility of the transformed dataset.
The UBTGen algorithm is was proposed as part of the classifier modeling method HUG-IML (High Utility Gain-Interpretable Machine Learning) for transaction generation. This method is an intrinsic classifier model that extracts a class of higher order patterns and embeds them into an interpretable learning model such as logistic regression. The model supports both binary and multi-class classification problems. The specific details of the HUG-IML models are provided in the IEEE Access paper titled: Interpretable classifier models for decision support using high utility gain patterns, IEEE Access 2024, DOI: https://doi.org/10.1109/ACCESS.2024.3455563.
To execute the algorithm, you need a CSV dataset and a configuration file. In this example, we will use the dataset contextUBTGenTitanicSample.csv and the configuration file configTitanic.txt, provided in the ca.pfv.SPMF.tests package.
As example, the content of the dataset contextUBTGenTitanicSample.csv is provided here:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked 1,0,3,"Braund Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S 2,1,1,"Cumings Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C 3,0,3,Thayer Mr. John Borland,male,24,0,0,A/5 21171,8.05,,S 17,0,2,Sloper Mr. William Thompson,male,21,0,0,SC/Paris 2123,13.0,,S 32,0,1,Harder Mrs. George W (Elizabeth Lydia Waldheim),female,39,1,0,PC 17585,89.1042,C123,C 47,0,3,Beesley Mr. Lawrence,male,30,0,0,SOTON/OQ 392076,10.5,,S 59,0,2,Anderson Mr. Harry,male,48,0,0,248744,13.0,,S 6,1,3,Allen Mr. William Henry,male,35,0,0,373450,8.05,,S 14,1,3,Moran Mr. James,male,27,0,0,330877,7.25,,Q 22,1,1,Futrelle Mrs. Jacques Heath (Lily May Peel),female,35,1,0,113803,53.1,C123,S 39,1,2,Saunders Miss. Eliza Mary,female,19,0,0,237736,26.0,,S 51,1,1,Brown Mrs. James Joseph (Margaret Tobin),female,49,1,0,113784,76.7292,C105,C
The first line of a dataset in this format indicates a list of attribute names separated by commas.
The following lines are records. On each line the attribute values are separated by commas. In this example, the dataset contains the following attributes:
Configuration file
The configuration file configTitanic.txt used in this example has the following content:
inputFile=contextUBTGenTitanicSample.csv outputFile=titanic_processed.csv testFile=contextUBTGenTitanicSampleTest.csv prefix=titanic header=true delimiter=, targetColIndex=1 skipColsIndices=0,3,8,10 numericIntColsIndices=6,7 numericFloatColsIndices=5,9 catColsIndices=2,4,11 B=5 writeTransformParameters=false missingValueImputation=falseThe description of each line is as follows:
| Parameter | Description |
|---|---|
| inputFile | Input training dataset (CSV) |
| outputFile | Transactional output file |
| testFile | Test dataset (optional) |
| prefix | Prefix for generated filenames |
| header | CSV has header row (true/false) |
| delimiter | CSV delimiter |
| targetColIndex | Target column index |
| skipColsIndices | Columns to ignore |
| numericIntColsIndices | Integer numeric columns |
| numericFloatColsIndices | Floating numeric columns |
| catColsIndices | Categorical columns |
| B | Bin count for numeric discretization |
| writeTransformParameters | Write transformation parameters to file |
| missingValueImputation | Enable missing value handling |
Special case: B = -1 uses quantile-based discretization automatically.
The output consists of two files:
It is a text file. Each lines represents a transaction. Each line is composed of three sections, as follows.
For example, after running the algorith for this example, the generated transaction file titanic_processed.csv produced as output is:
2 7 12 15 17:1.697197:0.156452 0.009773 0.018053 0.932594 0.580325 5 11 13 16 18:3.15678:0.625808 1.0 0.018053 0.932594 0.580325 1 3 8 12 15 17:2.74273:0.797316 0.344195 0.070247 0.018053 0.932594 0.580325 1 2 9 14 15 17:2.713806:0.797316 0.156452 0.229066 0.018053 0.932594 0.580325 6 13 16 18:2.469684:0.938712 0.018053 0.932594 0.580325 1 4 8 12 15 17:2.899182:0.797316 0.500647 0.070247 0.018053 0.932594 0.580325 1 6 9 14 15 17:3.496066:0.797316 0.938712 0.229066 0.018053 0.932594 0.580325 1 5 8 12 15 17:3.024343:0.797316 0.625808 0.070247 0.018053 0.932594 0.580325 1 3 7 12 15 19:2.682256:0.797316 0.344195 0.009773 0.018053 0.932594 0.580325 5 10 13 16 17:2.939065:0.625808 0.782285 0.018053 0.932594 0.580325 1 2 10 14 16 17:3.267025:0.797316 0.156452 0.782285 0.018053 0.932594 0.580325 11 13 16 18:2.530972:1.0 0.018053 0.932594 0.580325
The second file is the column mapping file. This file maps each numeric item ID in the transaction file back to the original dataset values. The format is as follows:
Age=[19.0-24.0] or Sex=male).The second file produced in this example is named titanic_processed_colNameNew.txt and contains:
10000 1 SibSp=[0.0-1.0] 10002 2 Age=[19.0-24.0] 20002 3 Age=[24.0-30.0] 30002 4 Age=[30.0-35.0] 40002 5 Age=[35.0-39.0] 50002 6 Age=[39.0-49.0] 10003 7 Fare=[7.25-8.05] 20003 8 Fare=[8.05-13.0] 30003 9 Fare=[13.0-26.0] 40003 10 Fare=[26.0-71.2833] 50003 11 Fare=[71.2833-89.1042] 10004 12 Pclass=3 20004 13 Pclass=1 30004 14 Pclass=2 10005 15 Sex=male 20005 16 Sex=female 10006 17 Embarked=S 20006 18 Embarked=C 30006 19 Embarked=Q
This mapping ensures that every item ID in the transaction file can be interpreted and traced back to its original attribute-value in the dataset.
More information about UBTGen can be found in this paper:
The following GitHub repository also offers this algorithm and other components of the HUG-IML method: github.com/srikumar2050/ubtgen
The code of UBTGen is included from this repository in SPMF under the GPL license.