Mining Frequent Weighted Itemsets using the WIT-FWI Algorithms (SPMF documentation)

This example explains how to run the WIT-FWI, WIT-FWI-MOD, and WIT-FWI-DIFF algorithms using the SPMF open-source data mining library.

How to run this example?

What are the WIT-FWI algorithms?

The WIT-FWI family of algorithms are designed for discovering frequent weighted itemsets in a transaction database where each item is associated with a weight representing its importance. These algorithms use a tree-based data structure called the WIT-tree (Weighted Itemset Tree) to efficiently mine weighted itemsets without candidate generation.

Three variants are available in SPMF:

All three algorithms implement the same weighted support definition and produce the same output (the complete set of frequent weighted itemsets).

The weighted support of an itemset X is defined as:

wsup(X) = ΣT∈g(X) w(X,T) / ΣT∈D maxi∈T w(i)

where g(X) is the set of transactions containing X, w(X,T) is the weight of X in transaction T (computed as the sum of weights of items in X that appear in T), and the denominator is the total transaction weight across the entire database D. An itemset X is a Frequent Weighted Itemset (FWI) if its weighted support is no less than a user-specified minimum weighted support threshold.

Note that this weighted support definition is also used by algorithms such as NFWI and NFWCI but it is different from the definition used by other algorithms such as WFIM.

What is the input?

The WIT-FWI algorithms take as input:

Let's consider the following database consisting of 10 transactions and 5 items (1, 2, 3, 4, 5). This database is provided in the text file "DB_RWFIM.txt" in the package ca.pfv.spmf.algorithms.frequentpatterns.witfwi of the SPMF distribution.

Transaction ID Items
T11 2 3 4 5
T21 3 4
T32 3 5
T41 2 4 5
T52 3 4 5
T61 3 5
T72 4
T81 2 3
T93 4 5
T101 2 4

The weight table is provided in the file "weights_RWFIM.txt" and assigns a weight to each item as follows:

Item Weight
10.40
20.70
31.00
40.50
50.45

For example, item 3 has the highest weight (1.00), meaning it is the most important item, while item 1 has the lowest weight (0.40).

What is the output?

The output of the WIT-FWI algorithms is the set of all frequent weighted itemsets whose weighted support is no less than the minimum weighted support threshold. All three algorithm variants produce the same output.

For example, if we run any of the WIT-FWI algorithms on the database above with a minimum weighted support of 40% (i.e., minWS = 0.40), we obtain the following 11 frequent weighted itemsets:

Itemset Weighted Support (wsup)
{1}0.5783
{1, 3}0.4106
{5}0.6044
{2, 5} 0.4012
{3, 5} 0.5222
{4}0.6739
{3, 4} 0.4099
{2, 4} 0.4681
{2}0.6953
{2, 3}0.4313
{3}0.7360

For example, consider the second row of the table. It indicates that itemset {1,3} consisting of items 1 and 3, has a weighted support of 0.4106. It is a frequent weighted itemset because this value is no less than the minimum threshold set to 0.4 in this example.

Input file format

The WIT-FWI algorithms take two input files:

Transaction database file: The transaction database file is a text file where each line represents one transaction. Each transaction is a list of items (positive integers) separated by single spaces. Lines starting with #, %, or @ are treated as comments and ignored. For example, the file DB_RWFIM.txt is defined as follows:

1 2 3 4 5
1 3 4
2 3 5
1 2 4 5
2 3 4 5
1 3 5
2 4
1 2 3
3 4 5
1 2 4

Each line is a transaction. For example, the first line means that transaction T1 contains items 1, 2, 3, 4, and 5.

Weight table file: The weight table file is a text file where each line associates one item with its weight. Each line contains an item identifier (a positive integer) followed by a single space and the item weight (a positive real number). Lines starting with #, %, or @ are treated as comments and ignored.

For example, the file weights_RWFIM.txt is defined as follows:

1 0.4
2 0.7
3 1.0
4 0.5
5 0.45

Each line associates one item with its weight. For example, item 1 has weight 0.4 and item 3 has the maximum weight of 1.0.

Output file format

The output file is a text file where each line represents one frequent weighted itemset. On each line, the items of the itemset are first listed as positive integers separated by single spaces. After all the items, the keyword "#WSUP:" appears, followed by the weighted support of the itemset (a real number formatted to 4 decimal places).

For example, the output file for this example is:

1 #WSUP: 0.5783
1 3 #WSUP: 0.4106
5 #WSUP: 0.6044
2 5 #WSUP: 0.4012
3 5 #WSUP: 0.5222
4 #WSUP: 0.6739
3 4 #WSUP: 0.4099
2 4 #WSUP: 0.4681
2 #WSUP: 0.6953
2 3 #WSUP: 0.4313
3 #WSUP: 0.7360

For example, the first line indicates that the itemset {1} has a weighted support of 0.5783. The following lines follow the same format.

Performance

The three variants (WIT-FWI, WIT-FWI-MOD, and WIT-FWI-DIFF) differ in their internal optimizations and implementation strategies, but all produce the same complete set of frequent weighted itemsets. In practice, these algorithms are efficient for moderate to large databases.

Where can I get more information about the WIT-FWI algorithms?

The WIT-FWI algorithms are based on concepts from weighted frequent itemset mining research. For more details about weighted itemset mining and related algorithms, you may refer to:

Vo, B., Coenen, F., Le, B., 2013. A new method for mining frequent weighted itemsets based on wit-trees. Expert Syst. Appl. 40 (4), 1256–1264