Frequent Itemset Mining for Data Analysis

Philippe Fournier-Viger
http://www.philippe-Fournier-viger.com
 
Maximal, Closed and Generator
Itemsets
1
Source code and datasets
 available in the 
SPMF library
Introduction
In previous videos, we talked about
frequent itemset mining.
It is a data analysis task that has many
applications.
However, a problem is that the number of
frequent itemsets is often very large.
Besides, frequent itemsets often have some
form of redundancy.
Today, we will talk about this problem and a
solution, which is to discover 
concise
representations of frequent itemsets.
2
Frequent itemset mining
Input
:  
A transaction database
minsup
 
= 2
Frequent itemset mining
Input
:  
A transaction database
minsup
 
= 2
Output
:  
The frequent itemsets
{lemon}, {pasta}, {orange}, {cake}, {lemon, pasta}, {lemon,
orange}, {pasta, orange}, {pasta, cake}, {orange, cake},
{lemon, pasta, orange}, 
{pasta, orange, cake} 
5
 
{pasta}  
    
support = 4
   
 
{lemon} 
    
support = 3
   
 
{orange} 
    
support = 3
   
 
{cake} 
     
support = 2
 
{pasta, lemon}         
   
support: 3 
 
 
{pasta, orange} 
   
support: 3
 
{pasta, cake} 
    
support: 2
 
{lemon, orange}
  
 
 
support:  2
 
{orange, cake} 
    
support: 2
 
{pasta, lemon, orange} 
  
support: 2
 
{pasta, orange, cake} 
   
support: 2
Frequent itemsets
6
 
{pasta}  
    
support = 4
   
 
{lemon} 
    
support = 3
   
 
{orange} 
    
support = 3
   
 
{cake} 
     
support = 2
 
{pasta, lemon}         
   
support: 3 
 
 
{pasta, orange} 
   
support: 3
 
{pasta, cake} 
    
support: 2
 
{lemon, orange}
  
 
 
support:  2
 
{orange, cake} 
    
support: 2
 
{pasta, lemon, orange} 
  
support: 2
 
{pasta, orange, cake} 
   
support: 2
Frequent itemsets
There are 10
frequent
itemsets!
minsup =2     The frequent itemsets
minsup =2     The frequent itemsets
7
lpboc
l
p
b
o
c
lp
lb
lo
lc
pb
po
pc
bo
bc
oc
lpb
lpo
lpc
lbo
lbc
loc
pbo
pbc
poc
boc
lpbo
lpbc
lpoc
lboc
pboc
frequent
itemsets
Infrequent
itemsets
l = lemon
p = pasta
b = bread
0 = orange
c = cake
minsup =2     The frequent itemsets
minsup =2     The frequent itemsets
8
lpboc
l
p
b
o
c
lp
lb
lo
lc
pb
po
pc
bo
bc
oc
lpb
lpo
lpc
lbo
lbc
loc
pbo
pbc
poc
boc
lpbo
lpbc
lpoc
lboc
pboc
frequent
itemsets
Infrequent
itemsets
l = lemon
p = pasta
b = bread
0 = orange
c = cake
We can observe that if an itemset is
frequent then all its subsets are frequent
minsup =2     The frequent itemsets
minsup =2     The frequent itemsets
9
lpboc
l
p
b
o
c
lp
lb
lo
lc
pb
po
pc
bo
bc
oc
lpb
lpo
lpc
lbo
lbc
loc
pbo
pbc
poc
boc
lpbo
lpbc
lpoc
lboc
pboc
frequent
itemsets
Infrequent
itemsets
l = lemon
p = pasta
b = bread
0 = orange
c = cake
We call the 
largest
 frequent itemsets the
maximal frequent itemsets
minsup =2     The frequent itemsets
minsup =2     The frequent itemsets
10
lpboc
l
p
b
o
c
lp
lb
lo
lc
pb
po
pc
bo
bc
oc
lpb
lpo
lpc
lbo
lbc
loc
pbo
pbc
poc
boc
lpbo
lpbc
lpoc
lboc
pboc
frequent
itemsets
Infrequent
itemsets
l = lemon
p = pasta
b = bread
0 = orange
c = cake
Maximal frequent itemsets
 are interesting
because if we find them
, 
we can recover all
the other frequent itemsets.
11
Recovering all frequent itemsets
In this example 
(minsup = 2)
,
 there are only two maximal
itemsets:
 
{pasta, lemon, orange} 
   
support: 2
 
{pasta, orange, cake} 
   
support: 2
12
 
{pasta, lemon, orange} 
   
support: 2
 
{pasta, orange, cake} 
   
support: 2
Maximal frequent itemsets
In this example 
(minsup = 2)
,
 there are only two maximal
itemsets:
We can 
recover
 all other frequent itemsets from the
maximal itemsets by simply enumerating their subsets:
13
 
{pasta, lemon, orange} 
   
support: 2
 
{pasta, orange, cake} 
   
support: 2
Maximal frequent itemsets
In this example 
(minsup = 2)
,
 there are only two maximal
itemsets:
We can 
recover
 all other frequent itemsets from the
maximal itemsets by simply enumerating their subsets:
{lemon}, {pasta}, {orange}, {cake}, {lemon, pasta}, {lemon,
orange}, {pasta, orange}, {pasta, cake}, {orange, cake},
14
 
{pasta, lemon, orange} 
   
support: 2
 
{pasta, orange, cake} 
   
support: 2
Maximal frequent itemsets
In this example 
(minsup = 2)
,
 there are only two maximal
itemsets:
We can 
recover
 all other frequent itemsets from the
maximal itemsets by simply enumerating their subsets:
{lemon}, {pasta}, {orange}, {cake}, {lemon, pasta}, {lemon,
orange}, {pasta, orange}, {pasta, cake}, {orange, cake},
But we cannot recover their support!
Maximal itemsets are compact
If we find the maximal frequent itemsets,
we can find 
much less 
itemsets.
15
6,439,702 frequent itemsets
38,050 frequent
 
maximal
 
itemsets
Example
:  
Chess
 dataset and 
minsup =0.4
168 times
smaller
How to find the maximal itemsets?
Naïve approach
by post-processing
inefficient!
Efficient algorithms:
GenMax
: a modified version of ECLAT
FPMax
: a modified version of FP-Growth
LCMMax
:  a modified version of LCM
… and many others!
Those algorithms adopt various strategies to find
the maximal itemsets without having to find all
frequent itemsets.
16
What are the alternatives?
Is there some other ways of summarizing the
frequent itemsets?
Another popular representation of frequent
itemsets is the 
closed itemsets 
(Pasquier
et al., 1999)
17
Could we do better?
Is there some other ways of summarizing the
frequent itemsets?
Another popular representation of frequent
itemsets is the 
closed itemsets 
(Pasquier
et al., 1999)
18
19
Closed itemsets
 
{pasta}  
    
support = 4
 
closed
   
 
{lemon} 
    
support =
 3
   
 
{orange} 
    
support = 
3
   
 
{cake} 
     
support = 
2
 
{pasta, lemon}         
  
support: 3 
 
closed
 
{pasta, orange} 
   
support: 3 
 
closed
 
{pasta, cake} 
    
support: 
2
 
{lemon, orange}
  
 
 
support:  
2
 
{orange, cake} 
    
support: 
2
 
{pasta, lemon, orange} 
  
support: 2 
 
closed
 
{pasta, orange, cake} 
  
support: 2 
 
closed
minsup =2     The frequent closed itemsets
minsup =2     The frequent closed itemsets
minsup = 2
20
lpboc
l
p
b
o
c
lp
lb
lo
lc
pb
po
pc
bo
bc
oc
lpb
lpo
lpc
lbo
lbc
loc
pbo
pbc
poc
boc
lpbo
lpbc
lpoc
lboc
pboc
frequent
itemsets
infrequent
itemsets
support
frequent non closed itemsets
frequent closed itemsets
4
3
3
2
2
4
3
3
2
2
2
2
Closed itemsets are compact
 
6,439,702 frequent itemsets
38,050 frequent
 
maximal
 
itemsets
Example
:  
Chess
 dataset and 
minsup =0.4
1,361,157 frequent
 
closed
 
itemsets
22
Recovering all frequent itemsets
In this example 
(minsup = 2)
,
 there are only five closed itemsets:
{pasta}  
   
support = 4
 
closed
{pasta, lemon}         
  
support: 3 
 
closed
 
 
{pasta, orange} 
   
support: 3 
 
closed
{pasta, lemon, orange} 
  
support: 2 
 
closed
{pasta, orange, cake} 
  
support: 2 
 
closed
     
23
Recovering all frequent itemsets
In this example 
(minsup = 2)
,
 there are only five closed itemsets:
{pasta}  
   
support = 4
 
closed
{pasta, lemon}         
  
support: 3 
 
closed
 
 
{pasta, orange} 
   
support: 3 
 
closed
{pasta, lemon, orange} 
  
support: 2 
 
closed
{pasta, orange, cake} 
  
support: 2 
 
closed
     
By enumerating all their subsets, we can find all the other
frequent itemsets: 
{pasta, cake},
 
{lemon, orange},
 
{orange, cake}, 
  
 {lemon,{orange} ,{cake}
24
Recovering all frequent itemsets
In this example 
(minsup = 2)
,
 there are only five closed itemsets:
{pasta}  
   
support = 4
 
closed
{pasta, lemon}         
  
support: 3 
 
closed
 
 
{pasta, orange} 
   
support: 3 
 
closed
{pasta, lemon, orange} 
  
support: 2 
 
closed
{pasta, orange, cake} 
  
support: 2 
 
closed
     
By enumerating all their subsets, we can find all the other
frequent itemsets: 
{pasta, cake},
 
{lemon, orange},
 
{orange, cake}, 
  
 {lemon,{orange} ,{cake}
But how to find
the support of
these itemsets? 
Steps to find the support of a frequent
itemset 
X 
from closed itemsets
1.
Find the closure of 
X.
2.
Return the support of the closure of 
X
25
Steps to find the support of a frequent
itemset 
X 
from closed itemsets
1.
Find the closure of 
X.
2.
Return the support of the closure of 
X
26
{pasta}  
   
support = 4
 
closed
{pasta, lemon}         
  
support: 3 
 
closed
 
 
{pasta, orange} 
   
support: 3 
 
closed
{pasta, lemon, orange} 
  
support: 2 
 
closed
{pasta, orange, cake} 
  
support: 2 
 
closed
Example:
The closure of  
{cake} 
is  
c({cake}) ???
Steps to find the support of a frequent
itemset 
X 
from closed itemsets
1.
Find the closure of 
X.
2.
Return the support of the closure of 
X
27
{pasta}  
   
support = 4
 
closed
{pasta, lemon}         
  
support: 3 
 
closed
 
 
{pasta, orange} 
   
support: 3 
 
closed
{pasta, lemon, orange} 
  
support: 2 
 
closed
{pasta, orange, cake} 
  
support: 2 
 
closed
Example:
The closure of  
{cake} 
is  
c({cake}) = {pasta,orange,cake}
 
Thus, the support of 
{cake} 
is the same, that is 2.
Itemsets and their closures
Itemsets and their closures
minsup = 2
28
lpboc
l
p
b
o
c
lp
lb
lo
lc
pb
po
pc
bo
bc
oc
lpb
lpo
lpc
lbo
lbc
loc
pbo
pbc
poc
boc
lpbo
lpbc
lpoc
lboc
pboc
support
frequent non closed itemsets
frequent closed itemsets
4
3
3
2
2
4
3
3
2
2
2
2
Itemsets that have the same closure
They all have the same support
They all appear in the same transactions.
E
xample:
29
{cake}
{pasta, cake},
{orange, cake}, 
  
{pasta, orange, cake}
They all have a support of 2
They all appears in 
T3
 and 
T4
Itemsets that have the same closure
They all have the same support
They all appear in the same transactions.
E
xample:
30
{cake}
{pasta, cake},
{orange, cake}, 
  
{pasta, orange, cake}
They all have a support of 2
They all appears in 
T3
 and 
T4
Observation:
The closure is the intersection of
transactions T3 and T4
Itemsets and their closures
Itemsets and their closures
minsup = 2
31
lpboc
l
p
b
o
c
lp
lb
lo
lc
pb
po
pc
bo
bc
oc
lpb
lpo
lpc
lbo
lbc
loc
pbo
pbc
poc
boc
lpbo
lpbc
lpoc
lboc
pboc
4
3
3
2
2
4
3
3
2
2
2
2
T1, T2, T4
T1, T4
T1, T2,T3T4
T1,T3,T4
T3,T4
support
frequent non closed itemsets
frequent closed itemsets
Itemsets and their closures
Itemsets and their closures
minsup = 2
32
lpboc
l
p
b
o
c
lp
lb
lo
lc
pb
po
pc
bo
bc
oc
lpb
lpo
lpc
lbo
lbc
loc
pbo
pbc
poc
boc
lpbo
lpbc
lpoc
lboc
pboc
4
3
3
2
2
4
3
3
2
2
2
2
T1, T2, T4
T1, T4
T1, T2,T3T4
T1,T3,T4
T3,T4
support
frequent non closed itemsets
frequent closed itemsets
Note: 
Itemsets having the same
closure are sometimes said to be
from a same 
equivalence class
.
How to find the closed itemsets?
Naïve approach
:
by post-processing
inefficient!
Efficient algorithms:
Charm, DCI_Closed,
FPClose, Closet, LCM, etc.
Find frequent itemsets by avoiding generating
all frequent itemsets.
Several strategies and properties.
33
Charm (Zaki, 2001)
34
Based on ECLAT
Depth-first search
Four cases must be checked 
to avoid generating all
frequent itemsets
Four cases
35
X
1
 
X
2
 
 
X
1
 U X
2
 
X
2
Notation:  t(X) denotes the set of transactions containing an itemset X
Four cases
36
X
1
 
X
2
 
 
X
1
 U X
2
 
X
2
Four cases
37
X
1
 
X
2
 
 
X
1
 U X
2
 
X
1
Four cases
38
X
1
 
X
2
 
Charm
These four properties allows to
eliminate several frequent itemsets
thus, reduce the search space
However, the algorithm can still generate
several itemsets that are non closed.
We must add a 
closure checking test
.
39
Closure checking
For each 
frequent itemset 
X  
found
, we
compare it to the other 
frequent closed
itemsets 
found until now.
If 
X
 is contained in an itemset that has been
already found, we do not keep 
X.
Otherwise, 
X
 is closed. It is added to the set of
closed itemsets
.
How to implement this test?
store closed itemsets in a hash table
hash function is the list of transaction ids  t(.).
40
Pseudocode of Charm
41
Nodes
 
the set of frequent itemsets of size 1
I
  is the set of items 
T
 is the set of transactions
C
 is the set of all closed itemsets found
X x t(X) 
denotes an 
X
  and its list of transaction 
t(X)
vérifier le  test de closure
Optimizations
Implementation with diffsets (
dCharm
).
Implementation with bit vectors.
Use a 
triangular matrix 
to count the support
of itemsets of size 2 by scanning the database
once.
42
illustration: szathmary, 2006
Maximal vs Closed Itemsets
Closed frequent itemsets
Can recover all frequent itemsets
Can recover the support
Maximal frequent itemsets
Can recover all frequent itemsets
Cannot recover the support
A subset of closed itemsets, sometimes much
smaller.
Is there other compact representations of
frequent itemsets
? Yes 
43
The generator itemsets (or key itemsets)
44
The generator itemsets (or key itemsets)
45
 
{} 
     
support: 4
 
{lemon} 
    
support: 3
 
{orange} 
    
support: 3
 
{cake} 
    
support: 2
 
{lemon,orange} 
   
support: 2
In the example 
(minsup = 2)
,
 there are five 
frequent
generator itemsets
:
minsup =2     The frequent generator itemsets
minsup =2     The frequent generator itemsets
46
lpboc
l
p
b
o
c
lp
lb
lo
lc
pb
po
pc
bo
bc
oc
lpb
lpo
lpc
lbo
lbc
loc
pbo
pbc
poc
boc
lpbo
lpbc
lpoc
lboc
pboc
l = lemon
p = pasta
b = bread
0 = orange
c = cake
frequent generator itemsets
other frequent itemsets
minsup =2     The frequent generator itemsets
minsup =2     The frequent generator itemsets
47
lpboc
l
p
b
o
c
lp
lb
lo
lc
pb
po
pc
bo
bc
oc
lpb
lpo
lpc
lbo
lbc
loc
pbo
pbc
poc
boc
lpbo
lpbc
lpoc
lboc
pboc
l = lemon
p = pasta
b = bread
0 = orange
c = cake
Some observations:
The 
empty set 
can be a generator
While each equivalence class has always one closed itemset, it could
have more than one generator (not in this example)
In some cases, a closed itemset can be a generator (not in this example)
Why generators are interesting?
They are the smallest sets of items that
describe a set of transactions.
e.g. 
the smallest sets of products common to a
group of customers.
Some studies have shown shown that generator
patterns can provide better accuracy for
classification than maximal or closed itemsets.
48
Why generators are interesting?
Also, if we find 
generator
 and 
closed
 itemsets, we can
use them to generate compact representations of
association rules of the form:
       generator 
 closed itemset - generator
 
    
 
{cake} 
 
{pasta, orange} 
 
support: 2
     
confidence: 100%
 
      
49
generator: 
{cake}
closed
: {pasta,orange,cake}
See the paper about IGB assocation rules for details
How to find the generator itemsets?
Naïve approach
:
by post-processing
inefficient!
Efficient algorithms:
Pascal, DefMe, etc.
Find generator itemsets by avoiding
generating all frequent itemsets.
Key search space pruning property:
An itemset can only be a generator if all its
subsets are also generators.
50
References – maximal itemsets
FPMax
Grahne, G., & Zhu, J. (2003, May). High
performance mining of maximal frequent
itemsets. In 6th International Workshop on
High Performance Data Mining.
LCM_Max
Grahne, G., & Zhu, J. (2003, May). High
performance mining of maximal frequent
itemsets. In 6th International Workshop on
High Performance Data Mining.
51
References – closed itemsets
AprioriClose (Close)
Nicolas Pasquier, Yves Bastide, Rafik Taouil,
Lotfi Lakhal: 
Discovering Frequent Closed
Itemsets for Association Rules
. ICDT 1999:
398-416
Charm
Mohammed Javeed Zaki, Ching-Jiu
Hsiao: 
CHARM: An Efficient Algorithm for
Closed Itemset Mining
. SDM 2002.
52
References – generator itemsets
DefMe
Arnaud Soulet, François Rioult (2014). 
Efficiently Depth-
First Minimal Pattern Mining
. PAKDD (1) 2014: 28-39
Pascal
Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., &
Lakhal, L. (2000). 
Mining frequent patterns with
counting inference
ACM SIGKDD Explorations
Newsletter
2
(2), 66-75.
I
GB
G. Gasmi, S. Ben Yahia, E. Mephu Nguifo, Y. Slimani:
IGB: A New Informative Generic Base of
Association Rules. PAKDD 2005: 81-90
53
Other references
Data Mining: The Textbook by Aggarwal (2015)
Data Mining and Analysis Fundamental Concepts
and Algorithms by Zaki & Meira (2014)
Han and Kamber (2011), Data Mining: Concepts
and Techniques, 3rd edition, Morgan Kaufmann
Publishers,
Tan, Steinbach & Kumar (2006), Introduction to
Data Mining, Pearson education, ISBN-10:
0321321367.
 
54
Slide Note

Fournier-Viger, P., Wu, C.-W., Gomariz, A., Tseng, V. S. (2014). VMSP: Efficient Vertical Mining of Maximal Sequential Patterns. Proc. 27th Canadian Conference on Artificial Intelligence (AI 2014), Springer, LNAI, pp. 83-94.

It’s not really about « APPLICATIONS OF AI »

Embed
Share

Frequent itemset mining is a crucial data analysis task with various applications. This article delves into the challenges of dealing with large numbers of frequent itemsets and explores solutions to uncover concise representations, such as maximal, closed, and generator itemsets. The discussion covers transaction databases, support values, and the identification of both frequent and infrequent itemsets.

  • Data Analysis
  • Itemset Mining
  • Frequent Patterns
  • Transaction Databases
  • Support Values

Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Maximal, Closed and Generator Itemsets Philippe Fournier-Viger http://www.philippe-Fournier-viger.com Source code and datasets available in the SPMF library 1

  2. Introduction In previous videos, we talked about frequent itemset mining. It is a data analysis task that has many applications. However, a problem is that the number of frequent itemsets is often very large. Besides, frequent itemsets often have some form of redundancy. Today, we will talk about this problem and a solution, which is to discover concise representations of frequent itemsets. 2

  3. Frequent itemset mining Input: A transaction database Transaction Items appearing in the transaction T1 T2 T3 T4 {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} {pasta, lemon, orange cake} minsup= 2

  4. Frequent itemset mining Input: A transaction database Transaction Items appearing in the transaction T1 T2 T3 T4 {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} {pasta, lemon, orange cake} minsup= 2 Output: The frequent itemsets {lemon}, {pasta}, {orange}, {cake}, {lemon, pasta}, {lemon, orange}, {pasta, orange}, {pasta, cake}, {orange, cake}, {lemon, pasta, orange}, {pasta, orange, cake}

  5. Frequent itemsets {pasta} {lemon} {orange} {cake} support = 4 support = 3 support = 3 support = 2 {pasta, lemon} {pasta, orange} {pasta, cake} {lemon, orange} {orange, cake} support: 3 support: 3 support: 2 support: 2 support: 2 {pasta, lemon, orange} {pasta, orange, cake} support: 2 support: 2 5

  6. Frequent itemsets {pasta} {lemon} {orange} {cake} support = 4 support = 3 support = 3 support = 2 {pasta, lemon} {pasta, orange} {pasta, cake} {lemon, orange} {orange, cake} support: 3 support: 3 support: 2 support: 2 support: 2 There are 10 frequent itemsets! {pasta, lemon, orange} {pasta, orange, cake} support: 2 support: 2 6

  7. minsup =2 The frequent itemsets l p b o c frequent itemsets lb lo lc pb lp po pc bo bc oc pbc poc boc lpb lpo lpc lbo lbc loc pbo l = lemon p = pasta b = bread 0 = orange c = cake Infrequent itemsets lpbo lboc lpbc lpoc pboc lpboc 7

  8. minsup =2 The frequent itemsets l p b o c frequent itemsets lb lo lc pb lp po pc bo bc oc pbc poc boc lpb lpo lpc lbo lbc loc pbo l = lemon p = pasta b = bread 0 = orange c = cake Infrequent itemsets lpbo lboc lpbc lpoc pboc We can observe that if an itemset is frequent then all its subsets are frequent lpboc 8

  9. minsup =2 The frequent itemsets l p b o c frequent itemsets lb lo lc pb lp po pc bo bc oc pbc poc boc lpb lpo lpc lbo lbc loc pbo l = lemon p = pasta b = bread 0 = orange c = cake Infrequent itemsets lpbo lboc lpbc lpoc pboc We call the largest frequent itemsets the maximal frequent itemsets lpboc 9

  10. minsup =2 The frequent itemsets l p b o c frequent itemsets lb lo lc pb lp po pc bo bc oc pbc poc boc lpb lpo lpc lbo lbc loc pbo l = lemon p = pasta b = bread 0 = orange c = cake Infrequent itemsets Maximal frequent itemsets are interesting because if we find them, we can recover all the other frequent itemsets. lpbo lboc lpbc lpoc pboc lpboc 10

  11. Recovering all frequent itemsets Definition: A frequent itemset? is maximal if there is no frequent itemset ? such that ? ?. In this example (minsup = 2), there are only two maximal itemsets: {pasta, lemon, orange} {pasta, orange, cake} support: 2 support: 2 11

  12. Maximal frequent itemsets Definition: A frequent itemset? is maximal if there is no frequent itemset ? such that ? ?. In this example (minsup = 2), there are only two maximal itemsets: {pasta, lemon, orange} {pasta, orange, cake} support: 2 support: 2 We can recover all other frequent itemsets from the maximal itemsets by simply enumerating their subsets: 12

  13. Maximal frequent itemsets Definition: A frequent itemset? is maximal if there is no frequent itemset ? such that ? ?. In this example (minsup = 2), there are only two maximal itemsets: {pasta, lemon, orange} {pasta, orange, cake} support: 2 support: 2 We can recover all other frequent itemsets from the maximal itemsets by simply enumerating their subsets: {lemon}, {pasta}, {orange}, {cake}, {lemon, pasta}, {lemon, orange}, {pasta, orange}, {pasta, cake}, {orange, cake}, 13

  14. Maximal frequent itemsets Definition: A frequent itemset? is maximal if there is no frequent itemset ? such that ? ?. In this example (minsup = 2), there are only two maximal itemsets: {pasta, lemon, orange} {pasta, orange, cake} support: 2 support: 2 We can recover all other frequent itemsets from the maximal itemsets by simply enumerating their subsets: {lemon}, {pasta}, {orange}, {cake}, {lemon, pasta}, {lemon, orange}, {pasta, orange}, {pasta, cake}, {orange, cake}, But we cannot recover their support! 14

  15. Maximal itemsets are compact If we find the maximal frequent itemsets, we can find much less itemsets. Example: Chess dataset and minsup =0.4 6,439,702 frequent itemsets 38,050 frequent maximal itemsets 168 times smaller 15

  16. How to find the maximal itemsets? Na ve approach by post-processing inefficient! Efficient algorithms: GenMax: a modified version of ECLAT FPMax: a modified version of FP-Growth LCMMax: a modified version of LCM and many others! Those algorithms adopt various strategies to find the maximal itemsets without having to find all frequent itemsets. 16

  17. What are the alternatives? Is there some other ways of summarizing the frequent itemsets? Another popular representation of frequent itemsets is the closed itemsets (Pasquier et al., 1999) 17

  18. Could we do better? Is there some other ways of summarizing the frequent itemsets? Another popular representation of frequent itemsets is the closed itemsets (Pasquier et al., 1999) Definition: A (frequent) itemset? is closed if there exists no (frequent) itemset ? such that ? ? and sup(?) = sup(Y). 18

  19. Closed itemsets {pasta} {lemon} {orange} {cake} {pasta, lemon} {pasta, orange} {pasta, cake} {lemon, orange} {orange, cake} {pasta, lemon, orange} {pasta, orange, cake} support = 4 closed support = 3 support = 3 support = 2 support: 3 support: 3 support: 2 support: 2 support: 2 support: 2 support: 2 closed closed closed closed 19

  20. minsup =2 The frequent closed itemsets 4 minsup = 2 3 4 3 2 l p b o c frequent itemsets 3 2 3 2 2 lb lo lc pb lp po pc bo bc oc 2 2 pbc poc boc lpb lpo lpc lbo lbc loc pbo infrequent itemsets lpbo lboc lpbc lpoc pboc support frequent non closed itemsets frequent closed itemsets lpboc 20

  21. Closed itemsets are compact Example: Chess dataset and minsup =0.4 6,439,702 frequent itemsets 1,361,157 frequent closed itemsets 38,050 frequent maximal itemsets

  22. Recovering all frequent itemsets Definition: A (frequent) itemset? is closed if there exists no (frequent) itemset ? such that ? ? and sup(?) = sup(Y). In this example (minsup = 2), there are only five closed itemsets: {pasta} {pasta, lemon} {pasta, orange} {pasta, lemon, orange} {pasta, orange, cake} support = 4 support: 3 support: 3 support: 2 support: 2 closed closed closed closed closed 22

  23. Recovering all frequent itemsets Definition: A (frequent) itemset? is closed if there exists no (frequent) itemset ? such that ? ? and sup(?) = sup(Y). In this example (minsup = 2), there are only five closed itemsets: {pasta} {pasta, lemon} {pasta, orange} {pasta, lemon, orange} {pasta, orange, cake} support = 4 support: 3 support: 3 support: 2 support: 2 closed closed closed closed closed By enumerating all their subsets, we can find all the other frequent itemsets: {pasta, cake}, {lemon, orange}, {orange, cake}, {lemon,{orange} ,{cake} 23

  24. Recovering all frequent itemsets Definition: A (frequent) itemset? is closed if there exists no (frequent) itemset ? such that ? ? and sup(?) = sup(Y). In this example (minsup = 2), there are only five closed itemsets: {pasta} {pasta, lemon} {pasta, orange} {pasta, lemon, orange} {pasta, orange, cake} support = 4 support: 3 support: 3 support: 2 support: 2 closed closed closed closed closed By enumerating all their subsets, we can find all the other frequent itemsets: {pasta, cake}, {lemon, orange}, {orange, cake}, {lemon,{orange} ,{cake} But how to find the support of these itemsets? 24

  25. Steps to find the support of a frequent itemset X from closed itemsets 1. Find the closure of X. 2. Return the support of the closure of X Definition: The closure of an itemset? is the smallest closed itemset ? such that ? ?. It is denoted as c(X) 25

  26. Steps to find the support of a frequent itemset X from closed itemsets 1. Find the closure of X. 2. Return the support of the closure of X Definition: The closure of an itemset? is the smallest closed itemset ? such that ? ?. It is denoted as c(X) Example: {pasta} {pasta, lemon} {pasta, orange} {pasta, lemon, orange} {pasta, orange, cake} support = 4 support: 3 support: 3 support: 2 support: 2 closed closed closed closed closed The closure of {cake} is c({cake}) ??? 26

  27. Steps to find the support of a frequent itemset X from closed itemsets 1. Find the closure of X. 2. Return the support of the closure of X Definition: The closure of an itemset? is the smallest closed itemset ? such that ? ?. It is denoted as c(X) Example: {pasta} {pasta, lemon} {pasta, orange} {pasta, lemon, orange} {pasta, orange, cake} support = 4 support: 3 support: 3 support: 2 support: 2 closed closed closed closed closed The closure of {cake} is c({cake}) = {pasta,orange,cake} Thus, the support of {cake} is the same, that is 2. 27

  28. Itemsets and their closures 4 minsup = 2 3 4 3 2 l p b o c 3 2 3 2 2 lb lo lc pb lp po pc bo bc oc 2 2 pbc poc boc lpb lpo lpc lbo lbc loc pbo lpbo lboc lpbc lpoc pboc support frequent non closed itemsets frequent closed itemsets lpboc 28

  29. Itemsets that have the same closure They all have the same support They all appear in the same transactions. Example: {cake} {pasta, cake}, {orange, cake}, {pasta, orange, cake} They all have a support of 2 They all appears in T3 and T4 Trans. T1 T2 T3 T4 Items {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} {pasta, lemon, orange cake} 29

  30. Itemsets that have the same closure They all have the same support They all appear in the same transactions. Example: {cake} {pasta, cake}, {orange, cake}, {pasta, orange, cake} They all have a support of 2 They all appears in T3 and T4 Trans. T1 T2 T3 T4 Items {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} {pasta, lemon, orange cake} Observation: The closure is the intersection of transactions T3 and T4 30

  31. Itemsets and their closures T1, T2,T3T4 4 minsup = 2 3 4 3 2 l p b o c T1, T2, T4 T3,T4 3 2 3 2 2 lb lo lc pb lp po pc bo bc oc T1,T3,T4 2 2 pbc poc boc lpb lpo lpc lbo lbc loc pbo T1, T4 lpbo lboc lpbc lpoc pboc support frequent non closed itemsets frequent closed itemsets lpboc 31

  32. Itemsets and their closures T1, T2,T3T4 4 minsup = 2 3 4 3 2 l p b o c T1, T2, T4 T3,T4 3 2 3 2 2 lb lo lc pb lp po pc bo bc oc T1,T3,T4 2 2 pbc poc boc lpb lpo lpc lbo lbc loc pbo T1, T4 lpbo lboc lpbc lpoc pboc Note: Itemsets having the same closure are sometimes said to be from a same equivalence class. support frequent non closed itemsets frequent closed itemsets lpboc 32

  33. How to find the closed itemsets? Na ve approach: by post-processing inefficient! Efficient algorithms: Charm, DCI_Closed, FPClose, Closet, LCM, etc. Find frequent itemsets by avoiding generating all frequent itemsets. Several strategies and properties. 33

  34. Charm (Zaki, 2001) Based on ECLAT Depth-first search Four cases must be checked to avoid generating all frequent itemsets 34

  35. Four cases Suppose that the algorithm is considering combining an itemset X1 with an itemset X2 such that X2 ?1. X1 X2 X1U X2 X2 Property 1 IF t(X1) = t(X2), THEN t(X1) = t(X2) = t(X1 X2). Thus, we can replace all occurrences of X1 by X1 X2. We do not need to consider X2. Notation: t(X) denotes the set of transactions containing an itemset X 35

  36. Four cases Suppose that the algorithm is considering combining an itemset X1 with an itemset X2 such that X2 ?1. X1 X2 X1U X2 X2 Property 2 IF t(X1) t(X2), THEN t(X1) = t(X1 X2) t(X2). Thus, we can replace all occurrences of X1 by X1 X2. But we must keep X2. 36

  37. Four cases Suppose that the algorithm is considering combining an itemset X1 with an itemset X2 such that X2 ?1. X1 X2 X1 X1U X2 Property 3 Si t(X1) t(X2), alors t(X2) = t(X1 X2) t(X1). Thus, we can replace all occurrences of X2 by X1 X2. But we must keep X1. 37

  38. Four cases Suppose that the algorithm is considering combining an itemset X1 with an itemset X2 such that X2 ?1. X1 X2 Property 4 IF t(X1) t(X2), then t(X2) t(X1 X2) t(X1). We cannot remove anything. 38

  39. Charm These four properties allows to eliminate several frequent itemsets thus, reduce the search space However, the algorithm can still generate several itemsets that are non closed. We must add a closure checking test. 39

  40. Closure checking For each frequent itemset X found, we compare it to the other frequent closed itemsets found until now. If X is contained in an itemset that has been already found, we do not keep X. Otherwise, X is closed. It is added to the set of closed itemsets. How to implement this test? store closed itemsets in a hash table hash function is the list of transaction ids t(.). 40

  41. Pseudocode of Charm j ? v rifier le test de closure Nodes the set of frequent itemsets of size 1 I is the set of items T is the set of transactions C is the set of all closed itemsets found X x t(X) denotes an X and its list of transaction t(X) 41

  42. Optimizations Implementation with diffsets (dCharm). Implementation with bit vectors. Use a triangular matrix to count the support of itemsets of size 2 by scanning the database once. 42 illustration: szathmary, 2006

  43. Maximal vs Closed Itemsets Closed frequent itemsets Can recover all frequent itemsets Can recover the support Maximal frequent itemsets Can recover all frequent itemsets Cannot recover the support A subset of closed itemsets, sometimes much smaller. Is there other compact representations of frequent itemsets? Yes 43

  44. The generator itemsets (or key itemsets) Definition: A (frequent) itemset? is a generator if there exists no (frequent) itemset ? such that ? ? and sup(?) = sup(Y). 44

  45. The generator itemsets (or key itemsets) Definition: A (frequent) itemset? is a generator if there exists no (frequent) itemset ? such that ? ? and sup(?) = sup(Y). In the example (minsup = 2), there are five frequent generator itemsets: {} {lemon} {orange} {cake} {lemon,orange} support: 4 support: 3 support: 3 support: 2 support: 2 45

  46. minsup =2 The frequent generator itemsets l p b o c lb lo lc pb lp po pc bo bc oc pbc poc boc lpb lpo lpc lbo lbc loc pbo l = lemon p = pasta b = bread 0 = orange c = cake lpbo lboc lpbc lpoc pboc frequent generator itemsets other frequent itemsets lpboc 46

  47. minsup =2 The frequent generator itemsets l p b o c lb lo lc pb lp po pc bo bc oc pbc poc boc lpb lpo lpc lbo lbc loc pbo l = lemon p = pasta b = bread 0 = orange c = cake Some observations: The empty set can be a generator While each equivalence class has always one closed itemset, it could have more than one generator (not in this example) In some cases, a closed itemset can be a generator (not in this example) lpbo lboc lpbc lpoc pboc lpboc 47

  48. Why generators are interesting? They are the smallest sets of items that describe a set of transactions. e.g. the smallest sets of products common to a group of customers. Some studies have shown shown that generator patterns can provide better accuracy for classification than maximal or closed itemsets. 48

  49. Why generators are interesting? Also, if we find generator and closed itemsets, we can use them to generate compact representations of association rules of the form: generator closed itemset - generator {cake} {pasta, orange} support: 2 confidence: 100% generator: {cake} closed: {pasta,orange,cake} See the paper about IGB assocation rules for details 49

  50. How to find the generator itemsets? Na ve approach: by post-processing inefficient! Efficient algorithms: Pascal, DefMe, etc. Find generator itemsets by avoiding generating all frequent itemsets. Key search space pruning property: An itemset can only be a generator if all its subsets are also generators. 50

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#