Вы находитесь на странице: 1из 43

Data Mining and

Business Intelligence
10th Lecture
Association rules

Iraklis Varlamis
Contents

• Association rules – Definitions


• Algorithms
– Apriori, FPGrowth
• Sequential rule mining
– GSP

2
Association rules

• Association rules mining


– Detect frequent patterns, in the form of association (co-
occurrence, cause-result) in transaction databases
– Frequent pattern: a pattern that occurs many times in
the data set
• Examples
– Which products are frequently bought together?
(association, correlation)
– Which will be the next purchases of a customer who
buys an item? (causality)
– Which DNA segments (genes) respond to a drug?
(causality)

3
The problem
Given a set of transactions,
Shopping baskets find rules that predict the
TID Items occurrence of an item in a
1 Bread, Milk new transaction which
already contains several
2 Bread, Diaper, Beer, Eggs other items
3 Milk, Diaper, Beer, Coke • Examples of association
4 Bread, Milk, Diaper, Beer rules
5 Bread, Milk, Diaper, Coke • {Diaper}  {Beer},
{Milk, Bread}  {Eggs,
Coke},
transaction {Beer, Bread}  {Milk}

 Product offers / Bundle sales Association means co-


 Product positioning occurrence and not
 Stock management causality, since there is no
ordering of items
Transaction set representation

• Rows are transactions (baskets, sessions)


• Columns are items
• Cell value is 1 when the item appears in the
transaction
• The value is non-symmetric (1 is more important
than 0)
 Restriction: quantity count is ignored
TID Items TID Brea Milk Diape Bee Egg Cok
d r r s e
1 Bread, Milk
1 1 1 0 0 0 0
2 Bread, Diaper, Beer, Eggs
2 1 0 1 1 1 0
3 Milk, Diaper, Beer, Coke
3 0 1 1 1 0 1
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke 4 1 1 1 1 0 0
5 1 1 1 0 0 1
Definitions
 Ι = {i1, i2, .., ik} a set of distinct element (items)
Example : {Bread, Milk, Diapers, Beer, Eggs, Coke}

 Itemset: A subset of Ι
Example: {Milk, Bread, Diaper}
 k-itemset: an itemset of size k

 Τ = {t1, t2, .., tN} a set of transactions, where ti are itemsets


Transaction width: the number of items in a transaction ti

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Definitions
TID Items
 support count (): The number of
1 Bread, Milk
occurrences of an itemset ti
2 Bread, Diaper, Beer, Eggs
σ(Χ) = |{ ti | X  ti, ti T}|
3 Milk, Diaper, Beer, Coke
Example: σ({Milk, Bread, Diaper}) = 2 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

• Support (s): The ratio of transactions that contain an itemset ti:

Example: s({Milk, Bread, Diaper}) = 2/5

 Frequent Itemset
An itemset with support greater than or equal to a minsup
threshold
Definitions
TID Items

Association Rule 1 Bread, Milk


Is an expression X  Y, 2 Bread, Diaper, Beer, Eggs
where X and Y are itemsets 3 Milk, Diaper, Beer, Coke
Χ  Ι, Υ  Ι, Χ  Υ =  4 Bread, Milk, Diaper, Beer
Example: {Milk, Diaper}  {Beer} 5 Bread, Milk, Diaper, Coke

Rule Support (s): The ratio of transactions that contain all the
items of both Χ and Υ. The number of transactions that contain
items in Χ  Υ divided by the total number of transactions:
σ(Χ  Υ)/|Τ|
Rule Confidence (c): The number of transactions that contain
items in Χ  Υ divided by the number of transactions items in Χ:
σ(Χ  Υ)/σ(Χ)

 { Milk, Diaper, Beer } 2


s    0 .4
|T | 5
{ Milk, Diaper }  Beer
 { Milk, Diaper, Beer } 2
c    0 . 67
 { Milk, Diaper } 3
Association rule
mining algorithms

9
Divide to sub-problems

• Algorithms operate in two steps:


• Frequent Itemset Generation
– Find all itemsets X with support sX >= minsup
• Rule Generation
– For each itemset X, create rules with confidence
>=minconf threshold
– The rules divide X into two distinct subsets Xi, Xj
(XiUXj=X and XiUXj={ } )
Xi Xj

• Both, step are computationally expensive


Itemset Lattice
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Combinations
For d different items, we
ABCDE must examine 2d - 1
candidate itemsets
Αpriori

• Important notice: Every subset of a


frequent itemset is a frequent itemset
– A transaction which contains A,B,C will
definitely contain A,B
• Apriori: itemset lattice pruning
– When an itemset is infrequent, then all its
supersets are infrequent too.
– If we find some frequent k-itemsets in the
transaction set then we can create candidate
frequent k+1-itemsets and examine them

12
A-priori pruning strategy
When an itemset is infrequent
all the itemsets that contain it
Support-based pruning are infrequent too,
null
so prune them

An A B C D E
infrequent
itemset
AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


We prune all
supersets of
AB ABCDE
Example TID
1
Items
Bread, Milk
Minimum Support = 3 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
Items (1-Itemsets) 5 Bread, Milk, Diaper, Coke
Item Count We don’t generate candidates that contain Coke or Eggs
Bread 4
Coke 2 Pairs(2-itemsets)
Milk 4 Itemset Count
Beer 3 {Bread,Milk} 3
Diaper 4
{Bread,Beer} 2
Eggs 1
{Bread,Diaper} 3
{Milk,Beer} 2
All candidate itemsets: {Milk,Diaper} 3
{Beer,Diaper} 3
6  6 6
         6  15  20  41 Triples (3-itemsets)
 1  2  3
Ite m s e t C ount
Candidate itemsets after pruning: { B r e a d ,M ilk ,D ia p e r } 3

 6  4
      1  6  6  1  13
1  2
Generic algorithm
k=1
Find all frequent 1-itemsets
Repeat until there are no new itemsets
 Create candidate k+1-itemsets from the frequent k-itemsets
 Prune candidate k+1-itemsets that contain infrequent k-
itemsets
 Find the support of each remaining candidate itemset by
scanning the transactions set
 Remove infrequent k+1-itemsets and keep the frequent ones

• In every iteration scans the transaction set


•In order to avoid creating the same candidate itemset many
times, we (lexicographically) sort items in the itemset
Candidate itemsets generation
• Combine Fk-1 x F1
– Extend every frequent (k-1)-itemset with frequent items
– Every frequent (k-1) itemset, sorted in lexicographic order, is
Itemset Count extended with frequent items of higher order
{Bread,Milk} 3
{Beer,Bread} 2
{Bread,Diaper} 3
{Beer, Milk} 2
{Diaper,Milk} 3
{Beer,Diaper} 3

Item Count
Bread 4
Coke 2
Milk 4
Beer 3
Diaper 4
Eggs 1

16
Support computation

• Brute force: For every transaction check all


candidate itemsets
– When the number of candidates is large, it is expensive
– A transaction with many items will match many
candidates
• Candidate itemsets are stored in a hash-tree
• The leafs of the hash-tree contain itemsets and
update their number of occurence
• Subset function: finds all candidate itemsets
contained in a transaction

17
Example
Subset function (k mod 3)
3,6,9
1,4,7 Transaction: 1 2 3 5 6
2,5,8
Hash tree that
contains all
1+2356 candidate 3- itemsets
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458

18
Apriori optimizations

• Issues
– The transaction set scans are expensive (I/O)
– The number of candidate itemsets can be
large
– Computation of support is expensive
• Improvements
– Reduce number of scans
– Reduce candidate itemsets to be checked
– Improve support count computation time
• When the support of itemsets is known we can
easily compute confidence

19
Reduce the number of itemset checks
null

In order to compute the support of


ABC ΑΒ, AC and BC must be A B C D E
frequent and consequently Α, Β, C
must be frequent
AB AC AD AE BC BD BE CD CE DE

The transaction set is checked


in smaller chunks
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

After scanning a chunk some of the


ABCD ABCE ABDE ACDE BCDE
itemsets have been marked as
definitely frequent, possibly frequent
or possibly infrequent. ABCD
E

When an itemset is marked frequent (above minsupport) after a chunk scan,


then it is ignored for the remaining chunks

Dynamic Itemset Counting


FP-Growth

21
Equivalence classes
Itemsets (with ordered items) are split into equivalence classes,
which are then examined separately
Apriori: defines classes based on width, starting with 1-itemset
classes, 2-itemset classes etc.
Prefix (Suffix): Two itemsets of the same class have the same
prefix (or suffix) of length k
null null

A B C D A B C D

AB AC AD BC BD CD AB AC BC AD BD CD

ABC ABD ACD BCD ABC ABD ACD BCD

ABCD ABCD

(a) Prefix tree (b) Suffix tree


BFS vs DFS

Apriori

DFS: Depth-First-Search

BFS: Breadth-First-Search

DFS is faster than BFS in


finding maximal frequent
itemsets
All supersets of a maximal
frequent itemset are pruned
Support count without candidate
generation
• FP-tree: compresses then transaction set representation
and allows support count computation without candidate
itemset generation
– The tree is similar to a prefix tree (trie)
– The algorithm for tree construction reads each transaction
once and maps it to a path in the FP-tree
– Some paths may overlap.
– When many paths overlap, then we achieve better
compression
• When the FP-tree is constructed, then the algorithm uses
a divide-and-conquer approach to find frequent itemsets

The transaction set is scanned only once!

24
FP-tree construction
•The FP-tree is a prefix tree
TID Items
•Items in the itemsets must be ordered
1 {A,B}
2 {B,C,D} in order to have prefixes
–e.g. We cannot have {Α, Β} in a transaction
3 {A,C,D,E}
and {Β, C, A} in another, because we will
4 {A,D,E} miss the common prefix ΑΒ (or ΒΑ)
5 {A,B,C} •We can order items in lexicographic
6 {A,B,C,D} order
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
We start with an
empty tree
null
FP-tree construction
TID Items
null
1 {A,B} Read TID=1:

2 {B,C,D}
3 {A,C,D,E} A:1
4 {A,D,E}
5 {A,B,C} B:1
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
Each node has a label: The label contains the item
name (e.g. B) and a count (e.g. 1), the number of
transactions that match this node
Label: <ITEM: SUPPORT COUNT>
FP-tree construction
TID Items
1 {A,B} Read TID=1: null
2 {B,C,D}
3 {A,C,D,E} A:1 B:1
4 {A,D,E}
5 {A,B,C} B:1 C:1
6 {A,B,C,D} Read TID=2:

7 {B,C}
D:1
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
We add, directed edges between nodes, the edges connect
nodes with the same item
Following this edges we can compute the total support count
of the item
FP-tree construction
TID Items
1 {A,B} Read TID=1, 2: null
2 {B,C,D}
We also keep an
3 {A,C,D,E} A:1 B:1
array of pointers
4 {A,D,E} for counting the
5 {A,B,C} support of 1- B:1
C:1
6 {A,B,C,D} itemsets
7 {B,C}
Pointers’ array D:1
8 {A,B,C}
Item Pointer
9 {A,B,D} A
10 {B,C,E} B
C
D
E

28
TID Items
1 {A,B}
2 {B,C,D} FP-tree construction
3 {A,C,D,E}
4 {A,D,E}
null
5 {A,B,C}
6 {A,B,C,D}
7 {B,C} A:7 B:3
8 {A,B,C}
9 {A,B,D}
B:5 C:3
10 {B,C,E} C:1 D:1

D:1
C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
D:1
C
D
E
Pointers’ array
Size of the FP-tree

 Each transaction is a path starting from the tree


root
 The size of the tree is smaller than that of the
transaction set, when there are common prefixes
 When all transactions contain the same items,
the tree has only one path
 When all transactions contain different items
(extremely rare) then FP-tree needs more
space than the transaction set (because it
contains additional info, e.g. pointers, partial
support counts)
Final tree
• The final tree is related to the way of ordering items:
• Different ordering  different prefixes
• The tree is (usually) smaller when the items are ordered in
decreasing frequency order
• The transaction set is scanned once for computing 1-item
support count and then itemsets are ordered accordingly in
the second scan
TID Items TID Items
1 {A,B}  We ignore infrequent 1 {Β,Α}
2 {B,C,D} items in the second 2 {B,C,D}
3 {A,C,D,E} scan 3 {A,C,D,E}
4 {A,D,E} 4 {A,D,E}
In the example, σ(Α)=7,
5 {A,B,C} 5 {Β,Α,C}
σ(Β)=8, σ(C)=7, σ(D)=5,
6 {A,B,C,D} 6 {Β,Α,C,D}
σ(Ε)=3
7 {B,C} 7 {B,C}
8 {A,B,C} So the order is Β,Α,C,D,E 8 {Β,Α,C}
9 {A,B,D} 9 {Β,Α,D}
10 {B,C,E} 10 {B,C,E}
FP-Growth algorithm
Frequent itemset algorithm

Input: FP-tree
Output: Frequent itemsets and their support
Method (divide and conquer):
 We split itemset to suffix equivalence classs
 all itemsets split to those that end with E, D, C, etc.
 those that end with E split to those that end with DE, CE,
BE, Ae and so on.
Representative itemsets

33
Maximal frequent itemsets
Frequent null An itemset is maximal
frequent when none of
its direct supersets is
A B C D E

not frequent
AB AC AD AE BC BD BE CD CE DE
They offer a compact
representation of the
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE frequent itemsets.

From this set of maximal


ABCD ABCE ABDE ACDE BCDE frequent itemsets we can
create all frequent itemsets
ABCDE (by generating all subsets)

infrequent They do not contain


information on the support of
their subsets
Closed itemsets
• An itemset is closed when none of its direct supersets has equal
support count
• For a non-closed itemset there exists a superset with the same
support count
Minimum support count =2

TID Items
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE

35
Sequential patterns

36
Sequential databases

• Transaction databases that contain time-


ordered actions
• We move from frequent itemsets to
frequent sequential patterns
• Applications
– Next buy in an e-shop
– The evolution of a patient’s health
– Clickstream analysis
– DNA chain analysis

37
Definitions

• In a set of sequences we search for


frequent subsequences
Sequence: < (ef) (ab) (df) c b >
Sequence set
SID sequence Events
10 <a(abc)(ac)d(cf)> (contain multiple items)

20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb> <a(bc)dc> is a subsequence of
40 <eg(af)cbc> <a(abc)(ac)d(cf)>

For a threshold min_sup =2, the sequence <(ab)c>


is a frequent sequential pattern
38
Challenges

• Large number of sequential patterns


• The algorithm must
– Find the complete set of frequent patterns
– Be scalable and have good performance
– Support user constraints (e.g. on the pattern
length, or the number of events that can be
ignored between events of interest)

39
GSP – Generalized Sequential Pattern

• Apriori based: when a sequence is infrequent all


its supersequences are infrequent
• Find candidate sequences and pruning like
Apriori
Cand Sup
Length-1 Sequential Patterns <a> 3
<b> 5
min_sup =2 <c> 4
Seq. ID Sequence
<d> 3
10 <(bd)cb(ac)>
<e> 3
20 <(bf)(ce)b(fg)>
<f> 2
30 <(ah)(bf)abf>
40 <(be)(ce)d> <g> 1
50 <a(bd)bcb(ade)> <h> 1
40
Length-2 Candidates

<a> <b> <c> <d> <e> <f>


<a> <aa> <ab> <ac> <ad> <ae> <af>
51 length-2 <b> <ba> <bb> <bc> <bd> <be> <bf>

Candidates <c> <ca> <cb> <cc> <cd> <ce> <cf>


<d> <da> <db> <dc> <dd> <de> <df>
=6*6+6*5/2 <e> <ea> <eb> <ec> <ed> <ee> <ef>
<f> <fa> <fb> <fc> <fd> <fe> <ff>

<a> <b> <c> <d> <e> <f>


<a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
<b> <(bc)> <(bd)> <(be)> <(bf)> - database scan
<c> <(cd)> <(ce)> <(cf)> - 19 length-2
<d> <(de)> <(df)>
candidates remain
<e> <(ef)>
<f>

41
GSP Algorithm

• Take sequences in form of <x> as length-1


candidates
• Scan database once, find F1, the set of length-1
sequential patterns
• Let k=1; while Fk is not empty do
– Form Ck+1, the set of length-(k+1) candidates
from Fk;
– If Ck+1 is not empty, scan database once, find
Fk+1, the set of length-(k+1) sequential patterns
– Let k=k+1;

42
GSP restrictions

• Similar to Apriori
• Exponential increase of candidates when
the sequence length grows

• Extensions
– FreeSpan: the sequence set is mapped to
smaller sets based on the frequent patterns of
each level
– PrefixSpan: the sequence set is mapped
mapped to smaller sets based on the prefix of
frequent patterns

43

Вам также может понравиться