Академический Документы
Профессиональный Документы
Культура Документы
Business Intelligence
10th Lecture
Association rules
Iraklis Varlamis
Contents
2
Association rules
3
The problem
Given a set of transactions,
Shopping baskets find rules that predict the
TID Items occurrence of an item in a
1 Bread, Milk new transaction which
already contains several
2 Bread, Diaper, Beer, Eggs other items
3 Milk, Diaper, Beer, Coke • Examples of association
4 Bread, Milk, Diaper, Beer rules
5 Bread, Milk, Diaper, Coke • {Diaper} {Beer},
{Milk, Bread} {Eggs,
Coke},
transaction {Beer, Bread} {Milk}
Itemset: A subset of Ι
Example: {Milk, Bread, Diaper}
k-itemset: an itemset of size k
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Definitions
TID Items
support count (): The number of
1 Bread, Milk
occurrences of an itemset ti
2 Bread, Diaper, Beer, Eggs
σ(Χ) = |{ ti | X ti, ti T}|
3 Milk, Diaper, Beer, Coke
Example: σ({Milk, Bread, Diaper}) = 2 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Frequent Itemset
An itemset with support greater than or equal to a minsup
threshold
Definitions
TID Items
Rule Support (s): The ratio of transactions that contain all the
items of both Χ and Υ. The number of transactions that contain
items in Χ Υ divided by the total number of transactions:
σ(Χ Υ)/|Τ|
Rule Confidence (c): The number of transactions that contain
items in Χ Υ divided by the number of transactions items in Χ:
σ(Χ Υ)/σ(Χ)
9
Divide to sub-problems
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
12
A-priori pruning strategy
When an itemset is infrequent
all the itemsets that contain it
Support-based pruning are infrequent too,
null
so prune them
An A B C D E
infrequent
itemset
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
6 4
1 6 6 1 13
1 2
Generic algorithm
k=1
Find all frequent 1-itemsets
Repeat until there are no new itemsets
Create candidate k+1-itemsets from the frequent k-itemsets
Prune candidate k+1-itemsets that contain infrequent k-
itemsets
Find the support of each remaining candidate itemset by
scanning the transactions set
Remove infrequent k+1-itemsets and keep the frequent ones
Item Count
Bread 4
Coke 2
Milk 4
Beer 3
Diaper 4
Eggs 1
16
Support computation
17
Example
Subset function (k mod 3)
3,6,9
1,4,7 Transaction: 1 2 3 5 6
2,5,8
Hash tree that
contains all
1+2356 candidate 3- itemsets
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
18
Apriori optimizations
• Issues
– The transaction set scans are expensive (I/O)
– The number of candidate itemsets can be
large
– Computation of support is expensive
• Improvements
– Reduce number of scans
– Reduce candidate itemsets to be checked
– Improve support count computation time
• When the support of itemsets is known we can
easily compute confidence
19
Reduce the number of itemset checks
null
21
Equivalence classes
Itemsets (with ordered items) are split into equivalence classes,
which are then examined separately
Apriori: defines classes based on width, starting with 1-itemset
classes, 2-itemset classes etc.
Prefix (Suffix): Two itemsets of the same class have the same
prefix (or suffix) of length k
null null
A B C D A B C D
AB AC AD BC BD CD AB AC BC AD BD CD
ABCD ABCD
Apriori
DFS: Depth-First-Search
BFS: Breadth-First-Search
24
FP-tree construction
•The FP-tree is a prefix tree
TID Items
•Items in the itemsets must be ordered
1 {A,B}
2 {B,C,D} in order to have prefixes
–e.g. We cannot have {Α, Β} in a transaction
3 {A,C,D,E}
and {Β, C, A} in another, because we will
4 {A,D,E} miss the common prefix ΑΒ (or ΒΑ)
5 {A,B,C} •We can order items in lexicographic
6 {A,B,C,D} order
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
We start with an
empty tree
null
FP-tree construction
TID Items
null
1 {A,B} Read TID=1:
2 {B,C,D}
3 {A,C,D,E} A:1
4 {A,D,E}
5 {A,B,C} B:1
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
Each node has a label: The label contains the item
name (e.g. B) and a count (e.g. 1), the number of
transactions that match this node
Label: <ITEM: SUPPORT COUNT>
FP-tree construction
TID Items
1 {A,B} Read TID=1: null
2 {B,C,D}
3 {A,C,D,E} A:1 B:1
4 {A,D,E}
5 {A,B,C} B:1 C:1
6 {A,B,C,D} Read TID=2:
7 {B,C}
D:1
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
We add, directed edges between nodes, the edges connect
nodes with the same item
Following this edges we can compute the total support count
of the item
FP-tree construction
TID Items
1 {A,B} Read TID=1, 2: null
2 {B,C,D}
We also keep an
3 {A,C,D,E} A:1 B:1
array of pointers
4 {A,D,E} for counting the
5 {A,B,C} support of 1- B:1
C:1
6 {A,B,C,D} itemsets
7 {B,C}
Pointers’ array D:1
8 {A,B,C}
Item Pointer
9 {A,B,D} A
10 {B,C,E} B
C
D
E
28
TID Items
1 {A,B}
2 {B,C,D} FP-tree construction
3 {A,C,D,E}
4 {A,D,E}
null
5 {A,B,C}
6 {A,B,C,D}
7 {B,C} A:7 B:3
8 {A,B,C}
9 {A,B,D}
B:5 C:3
10 {B,C,E} C:1 D:1
D:1
C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
D:1
C
D
E
Pointers’ array
Size of the FP-tree
Input: FP-tree
Output: Frequent itemsets and their support
Method (divide and conquer):
We split itemset to suffix equivalence classs
all itemsets split to those that end with E, D, C, etc.
those that end with E split to those that end with DE, CE,
BE, Ae and so on.
Representative itemsets
33
Maximal frequent itemsets
Frequent null An itemset is maximal
frequent when none of
its direct supersets is
A B C D E
not frequent
AB AC AD AE BC BD BE CD CE DE
They offer a compact
representation of the
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE frequent itemsets.
TID Items
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE
35
Sequential patterns
36
Sequential databases
37
Definitions
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb> <a(bc)dc> is a subsequence of
40 <eg(af)cbc> <a(abc)(ac)d(cf)>
39
GSP – Generalized Sequential Pattern
41
GSP Algorithm
42
GSP restrictions
• Similar to Apriori
• Exponential increase of candidates when
the sequence length grows
• Extensions
– FreeSpan: the sequence set is mapped to
smaller sets based on the frequent patterns of
each level
– PrefixSpan: the sequence set is mapped
mapped to smaller sets based on the prefix of
frequent patterns
43