Вы находитесь на странице: 1из 11

ASSOCIATION RULE MINING

Mining Frequent Itemsets using Vertical data format


Horizontal Data format Vs Vertical data format
Vertical Data format
Intersect the TID_sets of every pair of frequent single
items
Vertical Data format
Transform data from horizontal to vertical format
Support count of an itemset is length of TID_set
Starting with k=1, frequent itemsets are used to
construct candidate k+1 itemsets
Apriori property is exploited
No need to scan the database to find the support
of (k+1) itemsets
To avoid long TID sets keep track of only the
differences of the TID_sets between a (k+1)
itemset and corresponding k itemset

Ex: diffset({I1, I2},{I1}) = {T500, T700}


Maximal and Closed Frequent Itemsets & Mining Closed
Frequent Itemsets:
Simple approach Mine complete set and eliminate
every frequent itemset that is a proper subset of some
other set and has the same support
Directly mine closed frequent itemsets with effective
pruning strategies
Mining Closed Frequent Itemsets
Item Merging - If every transaction containing a
frequent item-set X also contains an item-set Y but
not any proper superset of Y , then X Y forms a
frequent closed item-set and there is no need to

search for any item-set containing X but no Y .


Projected database for {I5: 2} is {{I2, I1},
{I2,I1,I3}}. Each transaction contains item-set
{I2, I1} but no proper superset of {I2, I1}. So this
can be merged with {I5} to give {I5, I2, I1:2}
There is no need to mine for closed item-sets that
contain I5 but not {I2, I1}
Mining Closed Frequent Itemsets
Sub-Itemset Pruning - If a frequent itemset X is a
proper subset of an already found frequent closed
itemset Y and support count(X) = support count(Y ),
then X and all of Xs descendants in the set
enumeration tree cannot be frequent closed itemsets
and thus can be pruned.
{ <a1, a2, ,a100>, <a1, a2,,a50>} min_sup =
2
Projection on a1 gives {a1, a2,,a50 : 2} based
on Itemset merging
Support {a2} = support ({a1, a2,..a50}) = 2 and
a2 is a proper subset - no need to examine a2 and
its projections
Mining Closed Frequent Itemsets
Item Skipping
Depth-first mining of closed itemsets
prefix itemset X associated with a header table
and a projected database.
If a local frequent item p has the same support in
several header tables at different levels, one can
safely prune p from the header tables at higher levels.
a2 has same support in global header and a1s
projection can be pruned

Mining Closed Frequent Itemsets


Closure Checking
Check if superset / subset of already found closed
frequent itemsets with same support
Superset Checking
Handled in Item Merging
Subset Checking
Pattern tree maintain set of closed itemsets
mined so far (Similar to FP tree)
If Sc is subsumed by another closed itemset
Sa then
Both have same support
Length of Sc is smaller than Sa
All items in Sc are contained in Sa
Multilevel Association Rules
Rules generated from association rule mining with
concept hierarchies
Levels of the hierarchy
Too general common sense knowledge
Some rules can be discovered at higher levels
Uniform Support
Same minimum support threshold for all levels
Reduced Support
Reduced minimum support threshold at lower levels
Using Reduced Support
Level-by-level independent
Full-breadth search
No back ground knowledge is used for pruning
Level-cross filtering by single item
An item at the ith level is examined iff its parent
node at the (i-1)st level is frequent

Level-cross filtering by k-itemset

A k-itemset at the ith level is examined iff its


corresponding parent k-itemset at (i-1)st level is
frequent

Group based Support

Redundant Multilevel Association Rules Filtering


Some rules may be redundant due to ancestor
relationships between items
milk wheat bread [8%, 70%]
2% milk wheat bread [2%, 72%]
First rule is an ancestor of the second rule
A rule is redundant if its support and confidence are
close to their expected values, based on the rules
ancestor.
Multidimensional Association Rules
Single-dimensional rules
buys(X, milk) buys(X, bread)
Multidimensional rules(2 dimensions/predicates)
Inter-dimension assoc. rules (no repeated
predicates) age(X,19-25)
occupation(X,student)
buys(X, coke)
Hybrid-dimension assoc. rules (repeated
predicates)
age(X,19-25) buys(X, popcorn)
buys(X, coke)

Categorical Attributes and Quantitative Attributes


Categorical Attributes
Finite number of possible values, no ordering
among values
Quantitative Attributes
Numeric, implicit ordering among values

Mining Quantitative Associations


Static discretization based on predefined concept
hierarchies
Dynamic discretization based on data distribution
Clustering: Distance-based association

Static Discretization of Quantitative Attributes


Discretized prior to mining using concept hierarchy.
Numeric values are replaced by ranges.
In relational database, finding all frequent k-predicate
sets will require k or k+1 table scans.
Data cube is well suited
for mining. (faster)
Fully materialized cubes
may exist
Apriori prunes search
Quantitative Association Rules
Numeric attributes are dynamically discretized
The confidence of the rules mined is maximized
Aquan1 and Aquan2 Acat
Cluster adjacent association rules to form general
rules using a 2-D grid: ARCS (Association Rules

Clustering System)
Binning

Equiwidth, Equidepth, Homogeneity

Use a 2-D array


Finding frequent predicate sets
Clustering the association rules
Grid based technique Rectangular regions

Clustering Association Rules: Example


age(X,34) income(X,30 - 40K) buys(X,high
resolution TV)
age(X,35) income(X,30 - 40K) buys(X,high
resolution TV)
age(X,34) income(X,40 - 50K) buys(X,high
resolution TV)
age(X,35) income(X,40 - 50K) buys(X,high
resolution TV)
Strong Vs Interesting
User is the final judge
Misleading rule
Buys (X , Computer Games) Buys (X ,
Videos) [40%, 66.7%]

May have to use other measures


Correlation Analysis
Correlation between itemsets A and B are calculated
Lift Occurrence if A is independent of B if P(AUB)
= P(A) P(B) otherwise A and B are correlated

lift(A,B) = P(AUB) / P(A).P(B)


If value < 1 A is negatively correlated with B
Value > 1 A, B are positively correlated
Value = 1 Independent
Lift = conf(AB) / support(B)
Correlation Analysis
Lift = P(Game and Video) / P(Game) x P(Video) =
0.89
Negative Correlation
Buys (X , Computer Games)
Videos) is more accurate, although with lower
support and confidence

Correlation Analysis
2

expected)2 / expected
= (4000 4500)2/4500 = 555.6

Because 2 is greater than one and observed is less


than expected for Video and Game Negative
Correlation

Correlation Analysis
All Confidence
All_conf(X) = sup(X) / max_item_sup(X)
Maximum Single item support is considered
Minimal confidence among the rules ij X ij
where ij

Cosine Measure
Cosine(A, B) = P(AUB) / (P(A) x P(B))1/2
Similar to Lift
Influenced only by support of A, B and A U B not
by total number of transactions

Comparison of Correlation Measures


All_confidence and cosine measure are null-invariant
Can use these first followed by lift etc.
Correlation Analysis
All-Confidence
If a pattern is all_confident (meets threshold) then
all its sub-patterns are also all_confident
Leads to pruning of patterns which dont meet the
all_confidence threshold
Correlation rules
Reduces number of rules
Meaningful rules are discovered
Can use combination of measures
Constraint-based Data Mining
Finding all the patterns in a database autonomously
unrealistic
The patterns could be too many but not focused!
Data mining should be an interactive process
User directs what to be mined using a data mining
query language (or a graphical user interface)
Constraint-based mining
User flexibility: provides constraints on what to
be mined
System optimization: explores such constraints
for efficient miningconstraint-based mining
Constraints in Data Mining
Knowledge type constraint:
classification, association, etc.
Data constraint using SQL-like queries
find product pairs sold together in stores in
Vancouver in Dec.00
Dimension/level constraint

in relevance to region, price, brand, customer


category
Rule (or pattern) constraint
small sales (price < $10) triggers big sales (sum
> $200)
Interestingness constraint

min_confidence 60%
Meta Rule guided Mining
Makes Mining process more effective and efficient
Users can specify the syntactic form of the rules
Example:
P1(X, Y) P2(X, W) buys(X,Educational
Software)
Rule form
P1
2
l Q1
2 Qr

p=l+r
All frequent p-predicate sets and count of lpredicate sets
Cube Search
Rule Constraints
Find the sales of which cheap items (where the sum
of prices is less than $100) may promote the sales of
which expensive items (where the minimum price is
$500) of the same group for Chicago Customers in
2004
mine associations as
lives_in(C, _, Chicago) sales+(C,?{I},{S})
sales+(C,?{J},{T})
from sales
where S.year = 2004 and T.year = 2004 and I.group =
J.group

group by C, I.group
having sum(I.price) < 100 and min(J.price) >= 500
with support threshold = 1%
with confidence threshold = 1%
Rule Constraints
Looks for rules of the form:
Lives_in(C,_,Chicago) sales(C,?I1,S1
sales(C,?Ik,Sk
1,Ik
1,Sk}
sales(C,?J1,T1
m,Tm
1,Jm}
1,Tm}
Mines rules like
sales(C, MS/Office,
MS/SQLServer,_) [1.5, 68%]
Types of Constraints
Anti-monotone If an itemset does not satisfy a
constraint none of its supersets will also satisfy the
constraint

Sum(I.price) <= 100

Avg(I.price) <= 100 - Not


Monotone If an itemset satisfies a constraint all its
supersets also satisfy the constraint

Sum(I.price) >= 100


Succint

Can directly generate the sets that satisfy

Min(J.price)>= 500
Convertible

Arranging may convert

Avg(I.price) <= 100

Ascending order / Descending order

Inconvertible

Sum(S) v where

Вам также может понравиться