Mining the Top-K Frequent Itemsets using the HTK-negFIN Algorithm (SPMF documentation)
This example explains how to run the HTK-negFIN algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "HTK_NEGFIN" algorithm, (2) select the input file "contextPasquier99.txt", (3) set the output file name (e.g. "output.txt"), (4) set k to 8 and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run HTK_NEGFIN contextPasquier99.txt output.txt 8 in a folder containing spmf.jar and the example input file contextPasquier99.txt. - If you are using the source code version of SPMF, launch the file "MainTestHTKnegFIN_saveToFile.java" in the package ca.pfv.SPMF.tests.
What is HTK_NEGFIN?
HTK-negFIN (A Top-K Frequent Itemset Miner based on negNodesets) is an algorithm designed for the efficient discovery of the top-K most frequent itemsets in a transaction database, without requiring a user-defined minimum support threshold. It is based on the negNodeset structure introduced in the negFIN algorithm, and was proposed by Malliaridis and Ougiaroglou (2026).
Instead of asking the user to define a minimum support threshold, HTK_NEGFIN asks the user to specify the number k of itemsets to find. The algorithm then discovers the k most frequent itemsets in the database. This is more intuitive to use, as the user does not need to know in advance what support values are meaningful for a given dataset.
Implementation note: This implementation of HTK_NEGFIN is implemented to return k itemsets plus any tied itemsets at the K-th position (K+N behaviour).
What is the input of the HTK_NEGFIN algorithm?
The input is a transaction database (also called a binary context) and a threshold named k (a positive integer value).
A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1, 2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.
| Transaction id | Items |
| t1 | {1, 3, 4} |
| t2 | {2, 3, 5} |
| t3 | {1, 2, 3, 5} |
| t4 | {2, 5} |
| t5 | {1, 2, 3, 5} |
What is the output of the HTK_NEGFIN algorithm?
HTK_NEGFIN discovers the k itemsets (groups of items) that occur most frequently in a transaction database, plus any additional itemsets that are tied at the K-th position. To use this algorithm, you must set a value for the parameter k, which represents the desired number of itemsets to be returned.
For example, if HTK_NEGFIN is run on the previous transaction database with k = 8, it produces the following result:
| Itemset | Support |
| {2} | 4 |
| {3} | 4 |
| {5} | 4 |
| {2, 5} | 4 |
| {3, 1} | 3 |
| {3, 2, 5} | 3 |
| {3, 2} | 3 |
| {3, 5} | 3 |
How should I interpret the results?
In the results, each itemset is annotated with its support. The support of an itemset, also called its frequency, is the number of transactions in the database that contain that itemset. For example, the itemset {3, 2, 5} has a support of 3 because it appears in transactions t2, t3 and t5.
For k = 8, the itemsets presented in the previous table are the top-8 most frequent itemsets found by HTK_negFIN.
An important characteristic of this implementation of HTK_negFIN is that it supports K+N ties. This means that the algorithm will return exactly K itemsets, plus any additional itemsets (N) that have the same support value as the K-th itemset. This ensures that no tied itemsets are arbitrarily excluded from the results.
Input file format
The input file format used by HTK_negFIN is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is represented by a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.
For example, for the previous example, the input file is defined as follows:
1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5
Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.
Output file format
The output file format is defined as follows. It is a text file, where each line represents a top-k itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For example, here is the output file for this example. The first line indicates the frequent itemset consisting of item 2 and shows that this itemset has a support of 4 transactions.
2 #SUP: 4
3 #SUP: 4
5 #SUP: 4
2 5 #SUP: 4
3 1 #SUP: 3
3 2 5 #SUP: 3
3 2 #SUP: 3
3 5 #SUP: 3
Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.
Performance
HTK_negFIN is designed to be a high-performance algorithm for top-k frequent itemset mining. It achieves efficiency through its negNodeset representation based on the negFIN algorithm, the BMC Tree structure for organizing transactions, and dynamic threshold raising via its integrated Q-Heap-based top-k management. The use of negative itemset representations can be particularly efficient for dense datasets.
Implementation details
This implementation is a Java translation of the HTK_negFIN algorithm, adapted by the authors of the HTK_negFIN algorithm for the SPMF library. Several design decisions were made to ensure correctness and efficiency within the JVM environment:
- The Q-Heap structure is implemented as a specialized priority queue that maintains the top-K itemsets in sorted order with efficient insertion operations.
- Note: Support for K+N ties has been implemented in this version. The implementation will return exactly K itemsets plus any tied itemsets N at the K-th position, ensuring complete results.
- The algorithm is based on the negFIN approach by Aryabarzan et al. (2018), which uses negative itemset node representations to efficiently compute support without repeatedly scanning the database.
- The BMC Tree (Bitmap Code Tree) structure is used to organize transactions in a compact form, enabling fast support counting through bitmap operations.
Optional parameter(s)
This implementation allows to specify additional optional parameter(s):
- "Separator" (string): This parameter allows to specify the character used to separate items in the input file. By default, a single space " " is used.
These parameter(s) are available in the GUI of SPMF and also in the example "MainTestHTKnegFIN_saveToFile.java" provided in the source code of SPMF.
The parameter(s) can also be used in the command line with the Jar file. If you want to use these optional parameter(s) in the command line, it can be done as follows. Consider this example:
java -jar spmf.jar run HTK_negFIN contextPasquier99.txt output.txt 8 " "
This command means to apply the algorithm on the file "contextPasquier99.txt" and output the results to "output.txt". Moreover, it specifies that the user wants to find patterns for k=8, using a space as the separator.
Optional feature: giving names to items
Some users have requested the feature of giving names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextPasquier99.txt". Here we have modified the file to give names to the items:
@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5
In this file, the first line indicates that it is a file where names are given to items. Then, the second line indicates that item 1 is called "apple". The third line indicates that item 2 is called "orange". Then the following lines define transactions in the SPMF format.
Then, if we apply the algorithm using this file through the user interface of SPMF or the command line, the output file will contain patterns such as:
orange #SUP: 4
tomato #SUP: 4
bread #SUP: 4
orange bread #SUP: 4
tomato apple #SUP: 3
tomato orange bread #SUP: 3
tomato orange #SUP: 3
tomato bread #SUP: 3
Note that this feature could also be used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.
Where can I get more information about the HTK_negFIN algorithm?
This is the research article describing the HTK_negFIN algorithm:
Konstantinos Malliaridis and Stefanos Ougiaroglou. Efficient techniques for retrieving top-K frequent itemsets. Expert Systems with Applications, https://doi.org/10.1016/j.eswa.2026.131250
The negFIN algorithm, which HTK_negFIN is based on, was introduced in:
Aryabarzan, N., Minaei-Bidgoli, B., & Teshnehlab, M. (2018). negFIN: An efficient algorithm for fast mining frequent itemsets. Expert Systems with Applications, 105, 129-143.
For a good overview of frequent itemset mining algorithms, you may read this survey paper.