Data Mining-Association Mining - 3 PDF

ASSOCIATION RULE MINING
Mining Frequent Itemsets using Vertical data format

Horizontal Data format Vs Vertical data format
Vertical Data format
Intersect the TID_sets of every pair of frequent single
items
Vertical Data format
Transform data from horizontal to vertical format
Support count of an itemset is length of TID_set
Starting with k=1, frequent itemsets are used to
construct candidate k+1 itemsets
Apriori property is exploited
No need to scan the database to find the support
of (k+1) itemsets
To avoid long TID sets keep track of only the
differences of the TID_sets between a (k+1)
itemset and corresponding k itemset
Ex: diffset({I1, I2},{I1}) = {T500, T700}

Maximal and Closed Frequent Itemsets & Mining Closed
Frequent Itemsets:
Simple approach Mine complete set and eliminate
every frequent itemset that is a proper subset of some
other set and has the same support
Directly mine closed frequent itemsets with effective
pruning strategies
Mining Closed Frequent Itemsets
Item Merging - If every transaction containing a
frequent item-set X also contains an item-set Y but
not any proper superset of Y , then X Y forms a
frequent closed item-set and there is no need to
search for any item-set containing X but no Y .

Projected database for {I5: 2} is {{I2, I1},
{I2,I1,I3}}. Each transaction contains item-set
{I2, I1} but no proper superset of {I2, I1}. So this
can be merged with {I5} to give {I5, I2, I1:2}
There is no need to mine for closed item-sets that
contain I5 but not {I2, I1}
Sub-Itemset Pruning - If a frequent itemset X is a
proper subset of an already found frequent closed
itemset Y and support count(X) = support count(Y ),
then X and all of Xs descendants in the set
enumeration tree cannot be frequent closed itemsets
and thus can be pruned.
{ <a1, a2, ,a100>, <a1, a2,,a50>} min_sup =
2
Projection on a1 gives {a1, a2,,a50 : 2} based
on Itemset merging
Support {a2} = support ({a1, a2,..a50}) = 2 and
a2 is a proper subset - no need to examine a2 and
its projections
Item Skipping
Depth-first mining of closed itemsets
prefix itemset X associated with a header table
and a projected database.
If a local frequent item p has the same support in
several header tables at different levels, one can
safely prune p from the header tables at higher levels.
a2 has same support in global header and a1s
projection can be pruned

Closure Checking
Check if superset / subset of already found closed
frequent itemsets with same support
Superset Checking
Handled in Item Merging
Subset Checking
Pattern tree maintain set of closed itemsets
mined so far (Similar to FP tree)
If Sc is subsumed by another closed itemset
Sa then
Both have same support
Length of Sc is smaller than Sa
All items in Sc are contained in Sa
Multilevel Association Rules
Rules generated from association rule mining with
concept hierarchies
Levels of the hierarchy
Too general common sense knowledge
Some rules can be discovered at higher levels
Uniform Support
Same minimum support threshold for all levels
Reduced Support
Reduced minimum support threshold at lower levels
Using Reduced Support
Level-by-level independent
Full-breadth search
No back ground knowledge is used for pruning
Level-cross filtering by single item
An item at the ith level is examined iff its parent
node at the (i-1)st level is frequent
Level-cross filtering by k-itemset
A k-itemset at the ith level is examined iff its

corresponding parent k-itemset at (i-1)st level is
frequent
Group based Support
Redundant Multilevel Association Rules Filtering

Some rules may be redundant due to ancestor
relationships between items
milk wheat bread [8%, 70%]
2% milk wheat bread [2%, 72%]
First rule is an ancestor of the second rule
A rule is redundant if its support and confidence are
close to their expected values, based on the rules
ancestor.
Multidimensional Association Rules
Single-dimensional rules
buys(X, milk) buys(X, bread)
Multidimensional rules(2 dimensions/predicates)
Inter-dimension assoc. rules (no repeated
predicates) age(X,19-25)
occupation(X,student)
buys(X, coke)
Hybrid-dimension assoc. rules (repeated
predicates)
age(X,19-25) buys(X, popcorn)
buys(X, coke)
Categorical Attributes and Quantitative Attributes

Categorical Attributes
Finite number of possible values, no ordering
among values
Quantitative Attributes
Numeric, implicit ordering among values
Mining Quantitative Associations

Static discretization based on predefined concept
hierarchies
Dynamic discretization based on data distribution
Clustering: Distance-based association
Static Discretization of Quantitative Attributes

Discretized prior to mining using concept hierarchy.
Numeric values are replaced by ranges.
In relational database, finding all frequent k-predicate
sets will require k or k+1 table scans.
Data cube is well suited
for mining. (faster)
Fully materialized cubes
may exist
Apriori prunes search
Quantitative Association Rules
Numeric attributes are dynamically discretized
The confidence of the rules mined is maximized
Aquan1 and Aquan2 Acat
Cluster adjacent association rules to form general
rules using a 2-D grid: ARCS (Association Rules
Clustering System)
Binning
Equiwidth, Equidepth, Homogeneity
Use a 2-D array

Finding frequent predicate sets
Clustering the association rules
Grid based technique Rectangular regions
Clustering Association Rules: Example

age(X,34) income(X,30 - 40K) buys(X,high
resolution TV)
resolution TV)
resolution TV)
resolution TV)
Strong Vs Interesting
User is the final judge
Misleading rule
Buys (X , Computer Games) Buys (X ,
Videos) [40%, 66.7%]
May have to use other measures

Correlation Analysis
Correlation between itemsets A and B are calculated
Lift Occurrence if A is independent of B if P(AUB)
= P(A) P(B) otherwise A and B are correlated
lift(A,B) = P(AUB) / P(A).P(B)

If value < 1 A is negatively correlated with B
Value > 1 A, B are positively correlated
Value = 1 Independent
Lift = conf(AB) / support(B)
Lift = P(Game and Video) / P(Game) x P(Video) =
0.89
Negative Correlation
Buys (X , Computer Games)
Videos) is more accurate, although with lower
support and confidence
2

expected)2 / expected
= (4000 4500)2/4500 = 555.6
Because 2 is greater than one and observed is less

than expected for Video and Game Negative
Correlation
All Confidence
All_conf(X) = sup(X) / max_item_sup(X)
Maximum Single item support is considered
Minimal confidence among the rules ij X ij
where ij
Cosine Measure
Cosine(A, B) = P(AUB) / (P(A) x P(B))1/2
Similar to Lift
Influenced only by support of A, B and A U B not
by total number of transactions
Comparison of Correlation Measures

All_confidence and cosine measure are null-invariant
Can use these first followed by lift etc.
All-Confidence
If a pattern is all_confident (meets threshold) then
all its sub-patterns are also all_confident
Leads to pruning of patterns which dont meet the
all_confidence threshold
Correlation rules
Reduces number of rules
Meaningful rules are discovered
Can use combination of measures
Constraint-based Data Mining
Finding all the patterns in a database autonomously
unrealistic
The patterns could be too many but not focused!
Data mining should be an interactive process
User directs what to be mined using a data mining
query language (or a graphical user interface)
Constraint-based mining
User flexibility: provides constraints on what to
be mined
System optimization: explores such constraints
for efficient miningconstraint-based mining
Constraints in Data Mining
Knowledge type constraint:
classification, association, etc.
Data constraint using SQL-like queries
find product pairs sold together in stores in
Vancouver in Dec.00
Dimension/level constraint
in relevance to region, price, brand, customer

category
Rule (or pattern) constraint
small sales (price < $10) triggers big sales (sum
> $200)
Interestingness constraint
min_confidence 60%
Meta Rule guided Mining
Makes Mining process more effective and efficient
Users can specify the syntactic form of the rules
Example:
P1(X, Y) P2(X, W) buys(X,Educational
Software)
Rule form
P1
2
l Q1
2 Qr
p=l+r
All frequent p-predicate sets and count of lpredicate sets
Cube Search
Rule Constraints
Find the sales of which cheap items (where the sum
of prices is less than $100) may promote the sales of
which expensive items (where the minimum price is
$500) of the same group for Chicago Customers in
2004
mine associations as
lives_in(C, _, Chicago) sales+(C,?{I},{S})
sales+(C,?{J},{T})
from sales
where S.year = 2004 and T.year = 2004 and I.group =
J.group
group by C, I.group
having sum(I.price) < 100 and min(J.price) >= 500
with support threshold = 1%
with confidence threshold = 1%
Rule Constraints
Looks for rules of the form:
Lives_in(C,_,Chicago) sales(C,?I1,S1
sales(C,?Ik,Sk
1,Ik
1,Sk}
sales(C,?J1,T1
m,Tm
1,Jm}
1,Tm}
Mines rules like
sales(C, MS/Office,
MS/SQLServer,_) [1.5, 68%]
Types of Constraints
Anti-monotone If an itemset does not satisfy a
constraint none of its supersets will also satisfy the
constraint
Sum(I.price) <= 100
Avg(I.price) <= 100 - Not

Monotone If an itemset satisfies a constraint all its
supersets also satisfy the constraint
Sum(I.price) >= 100

Succint
Can directly generate the sets that satisfy
Min(J.price)>= 500
Convertible
Arranging may convert
Avg(I.price) <= 100
Ascending order / Descending order
Inconvertible
Sum(S) v where

Data Mining-Association Mining - 3 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Mining-Association Mining - 3 PDF

Загружено:

Авторское право:

Доступные форматы

ASSOCIATION RULE MINING

Mining Frequent Itemsets using Vertical data format

Ex: diffset({I1, I2},{I1}) = {T500, T700}

search for any item-set containing X but no Y .

Mining Closed Frequent Itemsets

Level-cross filtering by k-itemset

A k-itemset at the ith level is examined iff its

Group based Support

Redundant Multilevel Association Rules Filtering

Categorical Attributes and Quantitative Attributes

Mining Quantitative Associations

Static Discretization of Quantitative Attributes

Equiwidth, Equidepth, Homogeneity

Use a 2-D array

Clustering Association Rules: Example

May have to use other measures

lift(A,B) = P(AUB) / P(A).P(B)

Because 2 is greater than one and observed is less

Comparison of Correlation Measures

in relevance to region, price, brand, customer

Sum(I.price) <= 100

Avg(I.price) <= 100 - Not

Sum(I.price) >= 100

Can directly generate the sets that satisfy

Arranging may convert

Avg(I.price) <= 100

Ascending order / Descending order

Вам также может понравиться