Mining Frequent Subgraphs using the TSeqMiner Algorithm (SPMF documentation)
This example explains how to run the TSeqMiner algorithm using the SPMF open-source data mining library.
How to run this example?
To run the example, you will need to first download the example datasets that can be used with TSEQMiner: datasets (ZIP, 5.0 MB) and extract the files on your hardrive. Inside, the folder "TSEQMINER_datasets" you will find two subfolders containing two datasets:
-
DB_TSEQMINER
- USFLIGHT_TSEQMINER
Then:
- If you are using the graphical interface, (1) choose the "TSEQMINER" algorithm, (2) set the input directory as the path to the sample directory "DB_TSEQMINER" folder on your hard drive such as: /D:/Workspace/DB_TSEQMINER, (3) set the output file name (e.g. "output.txt") (4) set discretizationThreshold = 1, minInitSup = 0.004, minTailSup = 60, minSig = 8, attributeCount = 43 and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run TSeqMiner output.txt /D:/Workspace/DB_TSEQMINER 1 0.004 60 8.0 43 in a folder containing spmf.jar and where you replace /D:/Workspace/DB_TSEQMINER by the path of the example input folder "DB_TSEQMINER" on your hard drive. - If you are using the source code version of SPMF, edit the file "MainTestTSeqMiner1.java" in the package ca.pfv.SPMF.tests and set the path of the data to the correct path on your hard drive such as /D:/Workspace/DB_TSEQMINER and then run the example.
What is this algorithm?
TSeqMiner is an algorithm for discovering strongly correlated sequential patterns in dynamic attributed graph, proposed by Fournier-Viger, P., Cheng, C. et al. (2019)
What is the input of the algorithm?
The input of TSeqMiner is a dynamic attributed graph and 5 parameters called discretization threshold, minInitSup, minTailSup, minSig, and attributeCount.
The dynamic attributed graph is described by 4 files, including:
- attributes_mapping.txt This file record all attributes names.
The format is:
A1
A2
A3
- vertices_mapping.txt This file describe all vertex entities and corresponding labels.
The format is:
0,v1
1,v2
2,v3
- attributes.txt This file record vertex and corresponding attribute values.
Line starting with 'T' represent a new graph. Within the range of the graph, in one line, the first item is vertex label and others are attribute values.
The format is:
T0
0 0.3 0.5 3.4
1 0.1 0.9 5.6
T1
0 0.2 0.3 3.3
1 0.1 0.5 1.3
- graph.txt This file describe the edge connnections.
Line starting with 'T' represent a new graph. Within the range of the graph, in one line, the first item is a vertex label of currently considered, and the others are vertex labels of the connected vertices of the current vertex.
The format is:
T0
0 45 87 97 35
1 49 41 56 67
T1
0 78 32
1 10
The algorithm has five parameters that must be specified are as follows:
- discretization threshold (float)
If the value is 2, it means when (next_value - cur_value) >= 2, the trend is '+' and when (next_value - cur_value) <= -2, the trend is '-'. Otherwise the trend is '0'.
- minInitSup (float)
A frequent sequence should satisfy that the frequency of the first item in sequence >= minInitSup.
- minTailSup (int)
A frequent sequence should satisfy that the number of tail point of the sequence >= minTailSup.
- minSig (float)
A signficant sequence should satisfy that the significance between any two items in sequence is >= minSig.
-attributeCount (int)
This parameter specify how many attributes are considered.
What is the output of TSeqMiner ?
Given a dynamic attributed graph and several parameters, the algorithm will output all significant trend sequences with extra information including number of tail supporting points, significance values. For the above example, the output file looks like this:
sequence support of tail itemset significance
{(IntellDtAnal0),(IntellDtAnal-)} {228,95} {12}
{(IntellDtAnal+),(IntellDtAnal0)} {236,96} {9.2}
{(IntellDtAnal+),(IntellDtAnal0),(IntellDtAnal-)} {236,96,61} {9.2,14.6}
{(JMLR0),(JMLR-)} {283,147} {10.3}
{(ICML+,JMLR+),(JMLR0)} {153,97} {9}
{(ICML+,JMLR+),(JMLR0),(JMLR-)} {153,97,103} {9,17.2}
{(SIGMOD-,ACMTransDBSys+),(PVLDB+,SIGMOD-)} {95,81} {8.4}
{(ICDE0,ACMTransDBSys+),(PVLDB+,SIGMOD-)} {102,69} {8.3}
{(SIGMOD0,ACMTransDBSys0),(PVLDB+,VLDB-)} {107,101} {8.1}
{(ICML0,MachineLearning0),(MachineLearning-)} {134,123} {8.3}
This file contains 10 significant trend sequences. Consider this line:
{(SIGMOD-,ACMTransDBSys+),(PVLDB+,SIGMOD-)} {95,81} {8.4}
It represents a pattern found in the dataset.
This pattern means thatthe trend set (SIGMOD-, ACMTransDBSys+) is likely to trigger the trend set (PVLDB+, SIGMOD-). Moreover, the numbers 95 and 81 are the numbers of supporting points corresponding to the first and second item. The number 8.4 indicates the significance between the two items.
Optional parameters available in the source code
Note that there is more parameters that can be used for TSeqMiner in the source code, in the class ParametersSetting.java. These parameters are not offered in the command line as they are for very specific cases.
Performance
The TSeqMiner algorithm is the first algorithm for this problem. This is the original implementation.
Where can I get more information about the algorithm?
This is the article describing the algorithm:
Fournier-Viger, P., Cheng, C., Cheng, Z., Lin, J. C.-W., Selmaoui-Folcher, N. (2019). Mining Significant Trend Sequences in Dynamic Attributed Graphs. Knowledge-Based Systems (KBS)