Mining Weighted Frequent Itemsets in a Transaction Database using the WFIM Algorithm (SPMF documentation)

This example explains how to run the WFIM algorithm using the SPMF open-source data mining library.

How to run this example?

What is WFIM?

WFIM (Yun & Leggett, SDM 2005) is an algorithm for discovering weighted frequent itemsets in a transaction database where each item is associated with a weight. Unlike traditional frequent itemset mining, which treats all items equally, WFIM accounts for the relative importance (weight) of items by computing a weighted support for each itemset.

The weighted support of an itemset X is defined in this paper as:

wsup(X) = w(X) × sup(X)

where w(X) is the average weight of the items in X and sup(X) is the support count of X (i.e., the number of transactions containing X). An itemset X is called a Weighted Frequent Itemset (WFI) if its weighted support is no less than a user-specified minimum weighted support threshold.

Note that this definition of the weighted support is not the same as the one used by some other algorithms in SPMF for weighted frequent itemset mining.

What is the input?

WFIM takes as input:

Let's consider the following database consisting of 10 transactions and 5 items (1, 2, 3, 4, 5). This database is provided in the text file "DB_RWFIM.txt" in the package ca.pfv.spmf.test of the SPMF distribution.

Transaction ID Items
T11 2 3 4 5
T21 3 4
T32 3 5
T41 2 4 5
T52 3 4 5
T61 3 5
T72 4
T81 2 3
T93 4 5
T101 2 4

The weight table is provided in the file "weights_RWFIM.txt" and assigns a weight to each item as follows:

Item Weight
10.40
20.70
31.00
40.50
50.45

For example, item 3 has the highest weight (1.00), meaning it is the most important item, while item 1 has the lowest weight (0.40).

What is the output?

The output of WFIM is the set of all weighted frequent itemsets whose weighted support (wsup = avg_weight × support_count) is no less than minSup × dbSize.

WFIM also uses two pruning conditions to efficiently avoid exploring itemsets that cannot be weighted frequent:

For example, if we run WFIM on the database above with a minimum support ratio of 40% (i.e., minSup = 0.40, which means an itemset must have wsup ≥ 0.40 × 10 = 4.0) and a minimum weight of 0.5, we obtain the following 14 weighted frequent itemsets:

Itemset Support count Avg. weight Weighted support (wsup)
{3} 7 1.0000 7.0000
{3, 2} 4 0.8500 3.4000
{3, 4} 4 0.7500 3.0000
{3, 5} 5 0.7250 3.6250
{3, 1} 4 0.7000 2.8000
{2} 7 0.7000 4.9000
{2, 4} 5 0.6000 3.0000
{2, 5} 4 0.5750 2.3000
{2, 1} 4 0.5500 2.2000
{4} 7 0.5000 3.5000
{4, 5} 4 0.4750 1.9000
{4, 1} 4 0.4500 1.8000
{5} 6 0.4500 2.7000
{1} 6 0.4000 2.4000

Note: An itemset is considered a weighted frequent itemset if its weighted support (wsup = avg_weight × support_count) is at least minSup × dbSize = 0.40 × 10 = 4.0. However, itemsets with wsup < 4.0 may still appear in the output if they satisfy the pruning conditions (they don't violate both Condition 1 AND Condition 2). For example, {1, 3} has wsup = 2.8 but appears because it doesn't satisfy Condition 1 (sup=4 ≥ minSup=4) nor Condition 2 (4 × 1.0 = 4.0 ≥ 4.0).

The actual output file produced by WFIM for this example is:

3 #SUP: 7 #WAVG: 1.0000
3 2 #SUP: 4 #WAVG: 0.8500
3 4 #SUP: 4 #WAVG: 0.7500
3 5 #SUP: 5 #WAVG: 0.7250
3 1 #SUP: 4 #WAVG: 0.7000
2 #SUP: 7 #WAVG: 0.7000
2 4 #SUP: 5 #WAVG: 0.6000
2 5 #SUP: 4 #WAVG: 0.5750
2 1 #SUP: 4 #WAVG: 0.5500
4 #SUP: 7 #WAVG: 0.5000
4 5 #SUP: 4 #WAVG: 0.4750
4 1 #SUP: 4 #WAVG: 0.4500
5 #SUP: 6 #WAVG: 0.4500
1 #SUP: 6 #WAVG: 0.4000

Input file format

WFIM takes two input files:

1. Transaction database file

The transaction database file is a text file where each line represents one transaction. Each transaction is a list of items (positive integers) separated by single spaces. Lines starting with #, %, or @ are treated as comments and ignored.For example, the file DB_RWFIM.txt is defined as follows:

1 2 3 4 5
1 3 4
2 3 5
1 2 4 5
2 3 4 5
1 3 5
2 4
1 2 3
3 4 5
1 2 4

Each line is a transaction. For example, the first line means that transaction T1 contains items 1, 2, 3, 4, and 5.

2. Weight table file

The weight table file is a text file where each line associates one item with its weight. Each line contains an item identifier (a positive integer) followed by a single space and the item weight (a real number between 0 and 1). Lines starting with #, %, or @ are treated as comments and ignored.

For example, the file weights_RWFIM.txt is defined as follows:

1 0.4
2 0.7
3 1.0
4 0.5
5 0.45

Each line associates one item with its weight. For example, item 1 has weight 0.4 and item 3 has the maximum weight of 1.0.

Output file format

The output file is a text file where each line represents one weighted frequent itemset. On each line, the items of the itemset are first listed as positive integers separated by single spaces. After all the items, the keyword "#SUP:" appears, followed by the support count of the itemset. Then the keyword "#WAVG:" appears, followed by the average weight of the items in the itemset (a real number formatted to 4 decimal places).

For example, the output file for this example is:

3 #SUP: 7 #WAVG: 1.0000
3 2 #SUP: 4 #WAVG: 0.8500
3 4 #SUP: 4 #WAVG: 0.7500
3 5 #SUP: 5 #WAVG: 0.7250
3 1 #SUP: 4 #WAVG: 0.7000
2 #SUP: 7 #WAVG: 0.7000
2 4 #SUP: 5 #WAVG: 0.6000
2 5 #SUP: 4 #WAVG: 0.5750
2 1 #SUP: 4 #WAVG: 0.5500
4 #SUP: 7 #WAVG: 0.5000
4 5 #SUP: 4 #WAVG: 0.4750
4 1 #SUP: 4 #WAVG: 0.4500
5 #SUP: 6 #WAVG: 0.4500
1 #SUP: 6 #WAVG: 0.4000

For example, the first line indicates that the itemset {3} has a support count of 7 (appears in 7 out of 10 transactions) and an average weight of 1.0000. The weighted support can be computed as wsup({3}) = 1.0000 × 7 = 7.0000. The following lines follow the same format.

Performance

Weighted frequent itemset mining is a generalization of traditional frequent itemset mining that introduces item weights to reflect the varying importance of items. WFIM achieves efficiency by:

In practice, WFIM is efficient for moderate-sized databases.

Implementation details

The version offered in SPMF is an implementation of the WFIM algorithm based on the description provided in the original paper (Yun & Leggett, SDM 2005). The implementation uses an FP-tree data structure and a recursive pattern-growth mining procedure. Weights are computed using exact integer arithmetic (scaled by 10000) to avoid floating-point inconsistencies. The output reports the support count and average weight (formatted to 4 decimal places) for each weighted frequent itemset.

Where can I get more information about the WFIM algorithm?

This is the reference of the article describing the WFIM algorithm:

Yun, U., Leggett, J.J. (2005). WFIM: Weighted Frequent Itemset Mining with a weight range and a minimum weight. Proceedings of the SIAM International Conference on Data Mining (SDM 2005), pp. 636-640.