Mining High-Utility Sequential Rules from a Sequence Database with utility information using the HUSRM Algorithm (SPMF documentation)

This example explains how to run the HUINIV-Mine algorithm using the SPMF open-source data mining library.

How to run this example?

What is HUSRM?

HUSRM (Zida et al, 2015) is the first algorithm for discovering high-utility sequential rules in a sequence database containing utility information.

An typical example of a sequence database with utility information is a database of customer transactions containing sequences of transactions performed by customers, where each transaction is a set of items annotated with the profit generated by the sale of items. The goal of high-utility sequential rule mining is to find rules of the form A -> B, meaning that if a customer buy items A, he will then buy items B with a high confidence, and this rule generate a high profit. Although, this algorithm is designed for the scenario of sequence of transactions, the task is general and could be applied to other types of data such as sequences of webpages visited by user on a website, where the sale profit is replaced by the time spent on webpages.

This is the original implementation of HUSRM.

Note that the problem of high-utility sequential rule mining is similar to high-utility sequential pattern mining. However, a key advantage of high-utility sequential rule mining is that discovered rules provide information about the probability that if some customers buy some items A, they will then buy other items B. High-utility sequential patterns do not consider the confidence that a pattern will be followed.

What is the input?

HUSRM takes as input a sequence database with utility information, a minimum utility threshold min_utility (a positive integer), a minimum confidence threshold (a double value in the [0,1] interval, a maximum antecedent size (a positive integer) and a maximm consequent size (a positive nteger).

Let's consider the following sequence database consisting of 4 sequences of transactions (s1,s2, s3, s4) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DataBase_HUSRM.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.


Sequence Sequence utility
s1 {1[1],2[4]},{3[10]},{6[9]},{7[2]},{5[1]} 27
s2 {1[1],4[12]},{3[20]},{2[4]},{5[1],7[2]} 40
s3 {1[1]},{2[4]},{6[9]},{5[1]} 15
s4 {1[3],2[4],3[5]},{6[3],7[1]} 16

Each line of the database is a sequence:

What are real-life examples of such a database? A typical example is a database containing sequences of customer transactions. Imagine that each sequence represents the transactions made by a customer. The first customer named "s1" bought items 1 and 2, and those items respectively generated a profit of 1$ and 4$. Then, the customer bought item 3 for 10$. Then, the customer bought item 6 for 9 $. Then, the customer bought items 7 for 2$. Then the customer bought item 5 for 1 $.

What is the output?

The output of HUSRM is the set of high utility sequential rules meeting the criteria specified by the user

A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a rule X==>Y is the number of sequences that contains the items in X before the items in Y, divided by the number of sequences in the database. The confidence of a rule is the number of sequences that contains the items in X before the items in Y, divided by the number of sequences that contains X. For example, the rule {1,2,3} ==> {7} has a support of 3/5 because it appears in 3 out of 4 sequences in the database.

The utility (profit) of a rule is the total utility (profit) generated by the rule in the sequences where it appears. For example, the rule {1,2,3} ==> {7} appears in sequences s1,s2, and s4. In s1, the profit generated by that rule is 1$ + 4$ + 10$ + 2 $ = 17$. In s2, the profit generated by that rule is 1$ + 20$ + 4 + 2 $ = 27$. In s4, the profit generated by that rule is 3$ + 4$ + 5$ + 1$ =13$. Thus, the total utility of that rule in the database is 17$ + 27 $ + 13$ = 57 $

The HUSRM algorithm returns all high-utility sequential rules, that is each rule that meet the four following criteria:

For example, if we run HUSRM with a minimum utility of 40 and minconf = 0.70 (70 %), and a maximum antecedent and consequent size of 4 items, we obtain 7 high-utility sequential rules:

rule confidence utility support
1,4 ==> 2,3,5,7 100 % 40 1 sequence(s)
1,3,4 ==> 2,5,7 100 % 40 1 sequence(s)
1,2,3,4 ==> 5,7 100 % 40 1 sequence(s)
1,2,3 ==> 7 100 % 57 3 sequence(s)
1,3 ==> 7 100 % 45 3 sequence(s)
2,3 ==> 7 100 % 52 3 sequence(s)
3 ==> 7 100 % 40 3 sequence(s)

If the database is a transaction database from a store, we could interpret these results as rules representing the purchasing behavior of customers, such that these rules have a high confidence and generate a high profit. For example, the rule {1,3} -> {7} means that all customers buying the items 1 and 3 always buy the item 7 thereafter (since the confidence is 100%) and that this rule has generated a profit of 57 $ and appear in three sequences.

Input file format

The input file format of HUSRM is defined as follows. It is a text file.

For example, for the previous example, the input file is defined as follows:

1[1] 2[4] -1 3[10] -1 6[9] -1 7[2] -1 5[1] -1 -2 SUtility:27
1[1] 4[12] -1 3[20] -1 2[4] -1 5[1] 7[2] -1 -2 SUtility:40
1[1] -1 2[4] -1 6[9] -1 5[1] -1 -2 SUtility:15
1[3] 2[4] 3[5] -1 6[3] 7[1] -1 -2 SUtility:16

For example, consider the first line. It means that the first customer nbought items 1 and 2, and those items respectively generated a profit of 1$ and 4$. Then, the customer bought item 3 for 10$. Then, the customer bought item 6 for 9 $. Then, the customer bought items 7 for 2$. Then the customer bought item 5 for 1 $. Thus, this customer has made 5 transaction. The total utility (profit) generated by that sequence of transaction is 1$ + 4$ + 10$ + 9$ + 2$ + 1$ = 27 $.

Output file format

The output file format of HUSRM is defined as follows. It is a text file, where each line represents a high utility sequential rule. On each line, the items of the left side of the rule (antecedent) are first listed. Each item is represented by an integer, followed by a ",". After, the keyword " ==>" appears. It is followed by the items in the right side of the rule (consequent), each separated by ",". Then, the keyword "#SUP" appears followed by the support of the rule. Then, the keyword "#CONF" appears followed by the confidence of the rule. Then, the keyword "#UTIL" appears followed by the utility of the rule.

1,4 ==> 2,3,5,7 #SUP: 1.0 #CONF: 1.0 #UTIL: 40.0
1,3,4 ==> 2,5,7 #SUP: 1.0 #CONF: 1.0 #UTIL: 40.0
1,2,3,4 ==> 5,7 #SUP: 1.0 #CONF: 1.0 #UTIL: 40.0
1,2,3 ==> 7 #SUP: 3.0 #CONF: 1.0 #UTIL: 57.0
1,3 ==> 7 #SUP: 3.0 #CONF: 1.0 #UTIL: 45.0
2,3 ==> 7 #SUP: 3.0 #CONF: 1.0 #UTIL: 52.0
3 ==> 7 #SUP: 3.0 #CONF: 1.0 #UTIL: 40.0

For example, the fourth line indicates that all customers buying the items 1, 2 and 3 will then buy item 7 with a confidence of 100%, and that this rule has generated a profit of 57 $ and appear in three sequences.

Performance

High utility sequential rulemining is a more difficult problem than sequential rule mining and sequential pattern mining. Therefore, high-utility sequential rule mining algorithms are generally slower than those types of algorithms. The HUSRM algorithm is the first algorithm for high-utility sequential rule mining.

Implementation details

This is the original implementation of HUSRM.

Where can I get more information about the HUSRM algorithm?

This is the article describing the HUSRM algorithm:

Zida, S., Fournier-Viger, P., Wu, C.-W., Lin, J. C. W., Tseng, V.S., (2015). Efficient Mining of High Utility Sequential Rules. Proc. 11th Intern. Conference on Machine Learning and Data Mining (MLDM 2015). Springer, LNAI 9166, pp. 157-171.

For a good overview of itemset mining algorithms, you may read this survey paper.