Mining Frequent Generator Itemsets using the Pascal Algorithm (SPMF documentation)
This example explains how to run the Pascal algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "Pascal" algorithm, (2) select the input file "contextZart.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup to 40% and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run Pascal contextZart.txt output.txt 40% in a folder containing spmf.jar and the example input file contextZart.txt. - If you are using the source code version of SPMF, launch the file "MainTestPascal.java" in the package ca.pfv.SPMF.tests.
What is Pascal?
Pascal is an algorithm for discovering frequent itemsets and at the same time identify which ones
are generators in a transaction database.
Pascal is an Apriori-based algorithm. It uses a
special pruning property that can avoid counting the support of some
candidate itemsets. This property is based on the fact that if an
itemset of size k is not a generator, then its support is the support
of the minimum support of its subsets of size k-1.
What is the input of the Pascal algorithm?
The input is a transaction database (aka binary context) and a threshold named minsup (a value between 0 and 100 %).
A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextZart.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.
Transaction id | Items |
t1 | {1, 2, 4, 5} |
t2 | {1, 3} |
t3 | {1, 2, 3, 5} |
t4 | {2, 3, 5} |
t5 | {1, 2, 3, 5} |
What is the output of the Pascal algorithm?
The output of the Pascal algorithm for a transaction database and a minimum support threshold minsup is the set of all frequent itemsets and their support, and a flag indicating which itemsets is a generator.
To explain what is a frequent itemset and a generator, it is necessary to review a few definitions.
An itemset is a unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset. For example, the itemset {1, 3} has a support of 3 because it appears in three transactions from the database (t2, t3 and t5). A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A generator is an itemset X such that there does not exist an itemset Y strictly included in X that has the same support.
By running Pascal with the previous transaction database and a minsup of 40% (2 transactions), we obtain the following result:
itemsets | is a generator? | support |
{} | yes | 5 |
{1} | yes | 4 |
{2} | yes | 4 |
{3} | yes | 4 |
{5} | yes | 4 |
{1, 2} | yes | 3 |
{1, 3} | yes | 3 |
{1, 5} | yes | 3 |
{2, 3} | yes | 3 |
{2, 5} | yes | 4 |
{3, 5} | yes | 3 |
{1, 2, 3} | yes | 2 |
{1, 2, 5} | no | 3 |
{1, 3, 5} | yes | 2 |
{2, 3, 5} | no | 3 |
{1, 2, 3, 5} | no | 2 |
How should I interpret the results?
In the results, all frequent itemsets are shown. Each frequent itemset that is a generator is marked as such ("yes"). For each itemset, its support is indicated. For example, the itemset {1,2,3,5} has a support of 2 because it appears in 2 transactions (t3 and t5) and it is not a generator because it has a subset {1,2,3} that has the same support.
Input file format
The input file format used by Pascal is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.
For example, for the previous example, the input file is defined as follows:
1 2 4 5
1 3
1 2 3 5
2 3 5
1 2 3 5
Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.
Output file format
The output file format is defined as follows. It is a text file, where each line represents a frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. After, all the items, the keyword "#IS_GENERATOR:" appears, which is followed by a boolean indicating if the itemset is a generator (true) or not (false). For example, here is the output file for this example. The first line indicates the frequent itemset consisting of the item 1 and it indicates that this itemset has a support of 3 transactions and is a generator.
1 #SUP: 0 #IS_GENERATOR true
2 #SUP: 0 #IS_GENERATOR true
3 #SUP: 0 #IS_GENERATOR true
5 #SUP: 0 #IS_GENERATOR true
1 2 #SUP: 2 #IS_GENERATOR true
1 3 #SUP: 3 #IS_GENERATOR true
1 5 #SUP: 2 #IS_GENERATOR true
2 3 #SUP: 3 #IS_GENERATOR true
2 5 #SUP: 4 #IS_GENERATOR true
3 5 #SUP: 3 #IS_GENERATOR true
1 2 3 #SUP: 2 #IS_GENERATOR false
1 2 5 #SUP: 2 #IS_GENERATOR false
1 3 5 #SUP: 2 #IS_GENERATOR false
2 3 5 #SUP: 3 #IS_GENERATOR false
1 2 3 5 #SUP: 2 #IS_GENERATOR false
Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.
Optional parameter(s): constraints on the size of itemsets
Sometimes, there may be just too many itemsets, and itemsets containing many items may not be interesting. Thus, it is also possible to specify an optional parameter in the user interface of SPMF:
- Max pattern length (integer) : This parameter allows to set a maximum number of items to appear on the an itemset. By default, this parameter is equal to the infinity if it is not set.
If you are using the command line interface of SPMF, it is also possible to use this optional parameter by adding it at the end of the command. For example:
java -jar spmf.jar run Pascal contextZart.txt
output.txt 40% 2
means to run the above example to find only frequent itemsets having 2 items or less.
Optional feature: giving names to items
Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextZart.txt". Here we have modified the file to give names to the items:
@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 2 4 5
1 3
1 2 3 5
2 3 5
1 2 3 5
In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define four sequences in the SPMF format.
Then, if we apply a sequential pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:
apple tomato bread #SUP: 2 #IS_GENERATOR true
orange tomato bread #SUP: 3 #IS_GENERATOR false
apple orange tomato bread #SUP: 2 #IS_GENERATOR false
Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.
Performance
The Pascal algorithm should be more or less as efficient as Apriori since it is an Apriori-based algorithm. Pascal utilizes a pruning strategies that is supposed to make it faster by avoiding counting the support of some candidates. But to see really which one is better, experiments would need to be done to compare it.
Where can I get more information about the Pascal algorithm?
The Pascal algorithm is described in this paper:
Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., & Lakhal, L. (2000). Mining frequent patterns with counting inference. ACM SIGKDD Explorations Newsletter, 2(2), 66-75.
Also, for a good overview of frequent itemset mining algorithms, you may read this survey paper.