Mining High-Utility Sequential Rules from a Sequence Database with utility information using the HUSRM Algorithm (SPMF documentation)

This example explains how to run the HUINIV-Mine algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "HUSRM" algorithm, (2) select the input file "DataBase_HUSRM.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum utility to 40, (5) set the minimum confidence to 0.7, (6) set the minimum antecedent size to 4, (7) set the maximum consequent size to 4, and (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run HUSRM DataBase_HUSRM.txt output.txt 40 0.7 4 4 in a folder containing spmf.jar and the example input file DataBase_HUSRM.txt.
If you are using the source code version of SPMF, launch the file "MainTest_HUSRM_saveToFile.java" in the package ca.pfv.SPMF.tests.

What is HUSRM?

HUSRM (Zida et al, 2015) is the first algorithm for discovering high-utility sequential rules in a sequence database containing utility information.

An typical example of a sequence database with utility information is a database of customer transactions containing sequences of transactions performed by customers, where each transaction is a set of items annotated with the profit generated by the sale of items. The goal of high-utility sequential rule mining is to find rules of the form A -> B, meaning that if a customer buy items A, he will then buy items B with a high confidence, and this rule generate a high profit. Although, this algorithm is designed for the scenario of sequence of transactions, the task is general and could be applied to other types of data such as sequences of webpages visited by user on a website, where the sale profit is replaced by the time spent on webpages.

This is the original implementation of HUSRM.

Note that the problem of high-utility sequential rule mining is similar to high-utility sequential pattern mining. However, a key advantage of high-utility sequential rule mining is that discovered rules provide information about the probability that if some customers buy some items A, they will then buy other items B. High-utility sequential patterns do not consider the confidence that a pattern will be followed.

What is the input?

HUSRM takes as input a sequence database with utility information, a minimum utility threshold min_utility (a positive integer), a minimum confidence threshold (a double value in the [0,1] interval, a maximum antecedent size (a positive integer) and a maximm consequent size (a positive nteger).

Let's consider the following sequence database consisting of 4 sequences of transactions (s1,s2, s3, s4) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DataBase_HUSRM.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.

	Sequence	Sequence utility
s1	{1[1],2[4]},{3[10]},{6[9]},{7[2]},{5[1]}	27
s2	{1[1],4[12]},{3[20]},{2[4]},{5[1],7[2]}	40
s3	{1[1]},{2[4]},{6[9]},{5[1]}	15
s4	{1[3],2[4],3[5]},{6[3],7[1]}	16

Each line of the database is a sequence:

each sequence is an ordered list of transactions, such that transactions are enclosed by {} in this Example
each transaction contains a set of items represented by integers
each item is annotated with a utility value (e.g. sale profit), indicated between squared brackets
the sum of the utilities (e.g. profit) of all items in the sequence is also indicated (the "sequence utility" column)

What are real-life examples of such a database? A typical example is a database containing sequences of customer transactions. Imagine that each sequence represents the transactions made by a customer. The first customer named "s1" bought items 1 and 2, and those items respectively generated a profit of 1$ and 4$. Then, the customer bought item 3 for 10$. Then, the customer bought item 6 for 9 $. Then, the customer bought items 7 for 2$. Then the customer bought item 5 for 1 $.

What is the output?

The output of HUSRM is the set of high utility sequential rules meeting the criteria specified by the user

A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a rule X==>Y is the number of sequences that contains the items in X before the items in Y, divided by the number of sequences in the database. The confidence of a rule is the number of sequences that contains the items in X before the items in Y, divided by the number of sequences that contains X. For example, the rule {1,2,3} ==> {7} has a support of 3/5 because it appears in 3 out of 4 sequences in the database.

The utility (profit) of a rule is the total utility (profit) generated by the rule in the sequences where it appears. For example, the rule {1,2,3} ==> {7} appears in sequences s1,s2, and s4. In s1, the profit generated by that rule is 1$ + 4$ + 10$ + 2 $ = 17$. In s2, the profit generated by that rule is 1$ + 20$ + 4 + 2 $ = 27$. In s4, the profit generated by that rule is 3$ + 4$ + 5$ + 1$ =13$. Thus, the total utility of that rule in the database is 17$ + 27 $ + 13$ = 57 $

The HUSRM algorithm returns all high-utility sequential rules, that is each rule that meet the four following criteria:

the utility of the rule in the database is no less than a minimum utility threshold set by the user,
the confidence of the rule in the database is no less than a minimum confidence threshold set by the user,
the number of items in the antecedent (left side) of the rule contains no more than a maximum number of items specified by the user,
the number of items in the consequent (right side) of the rule contains no more than a maximum number of items specified by the user,

For example, if we run HUSRM with a minimum utility of 40 and minconf = 0.70 (70 %), and a maximum antecedent and consequent size of 4 items, we obtain 7 high-utility sequential rules:

rule	confidence	utility	support
1,4 ==> 2,3,5,7	100 %	40	1 sequence(s)
1,3,4 ==> 2,5,7	100 %	40	1 sequence(s)
1,2,3,4 ==> 5,7	100 %	40	1 sequence(s)
1,2,3 ==> 7	100 %	57	3 sequence(s)
1,3 ==> 7	100 %	45	3 sequence(s)
2,3 ==> 7	100 %	52	3 sequence(s)
3 ==> 7	100 %	40	3 sequence(s)

If the database is a transaction database from a store, we could interpret these results as rules representing the purchasing behavior of customers, such that these rules have a high confidence and generate a high profit. For example, the rule {1,3} -> {7} means that all customers buying the items 1 and 3 always buy the item 7 thereafter (since the confidence is 100%) and that this rule has generated a profit of 57 $ and appear in three sequences.

Input file format

The input file format of HUSRM is defined as follows. It is a text file.

Each line represents a sequence of transactions.
Each transaction is separated by the keyword -1.
A transaction is a list of items (positive integers) separated by single spaces and where each item is annotated with a generated sale profit indicated between square brackets [ ]. The sale profit is a positive integer.
In a transaction, it is assumed that items are sorted according to some order (eg. alphabetical order).
Each sequence ends by the keyword "-2". Then, it is followed by the keyword "SUtility:" followed by the sum of the utility (profit) of all items in that sequence.

For example, for the previous example, the input file is defined as follows:

1[1] 2[4] -1 3[10] -1 6[9] -1 7[2] -1 5[1] -1 -2 SUtility:27
1[1] 4[12] -1 3[20] -1 2[4] -1 5[1] 7[2] -1 -2 SUtility:40
1[1] -1 2[4] -1 6[9] -1 5[1] -1 -2 SUtility:15
1[3] 2[4] 3[5] -1 6[3] 7[1] -1 -2 SUtility:16

For example, consider the first line. It means that the first customer nbought items 1 and 2, and those items respectively generated a profit of 1$ and 4$. Then, the customer bought item 3 for 10$. Then, the customer bought item 6 for 9 $. Then, the customer bought items 7 for 2$. Then the customer bought item 5 for 1 $. Thus, this customer has made 5 transaction. The total utility (profit) generated by that sequence of transaction is 1$ + 4$ + 10$ + 9$ + 2$ + 1$ = 27 $.

Output file format

The output file format of HUSRM is defined as follows. It is a text file, where each line represents a high utility sequential rule. On each line, the items of the left side of the rule (antecedent) are first listed. Each item is represented by an integer, followed by a ",". After, the keyword " ==>" appears. It is followed by the items in the right side of the rule (consequent), each separated by ",". Then, the keyword "#SUP" appears followed by the support of the rule. Then, the keyword "#CONF" appears followed by the confidence of the rule. Then, the keyword "#UTIL" appears followed by the utility of the rule.

1,4 ==> 2,3,5,7 #SUP: 1.0 #CONF: 1.0 #UTIL: 40.0
1,3,4 ==> 2,5,7 #SUP: 1.0 #CONF: 1.0 #UTIL: 40.0
1,2,3,4 ==> 5,7 #SUP: 1.0 #CONF: 1.0 #UTIL: 40.0
1,2,3 ==> 7 #SUP: 3.0 #CONF: 1.0 #UTIL: 57.0
1,3 ==> 7 #SUP: 3.0 #CONF: 1.0 #UTIL: 45.0
2,3 ==> 7 #SUP: 3.0 #CONF: 1.0 #UTIL: 52.0
3 ==> 7 #SUP: 3.0 #CONF: 1.0 #UTIL: 40.0

For example, the fourth line indicates that all customers buying the items 1, 2 and 3 will then buy item 7 with a confidence of 100%, and that this rule has generated a profit of 57 $ and appear in three sequences.

Performance

High utility sequential rulemining is a more difficult problem than sequential rule mining and sequential pattern mining. Therefore, high-utility sequential rule mining algorithms are generally slower than those types of algorithms. The HUSRM algorithm is the first algorithm for high-utility sequential rule mining.

Implementation details

This is the original implementation of HUSRM.

Where can I get more information about the HUSRM algorithm?

This is the article describing the HUSRM algorithm:

Zida, S., Fournier-Viger, P., Wu, C.-W., Lin, J. C. W., Tseng, V.S., (2015). Efficient Mining of High Utility Sequential Rules. Proc. 11th Intern. Conference on Machine Learning and Data Mining (MLDM 2015). Springer, LNAI 9166, pp. 157-171.

For a good overview of itemset mining algorithms, you may read this survey paper.