Вы находитесь на странице: 1из 30

Group 4

Advanced Pattern
Mining
Vu Manh Cam
Nguyen Quy Ky Nguyen
Luong Anh Tuan
Nguyen Kim Chinh

OutLine

Pattern Pruning

Data Pruning

Pattern Fusion

Patter clustering

Pruning Pattern Space with Pattern Pruning


Constraints

Basis: Apriori Property

All nonempty subsets of a frequent itemset must also be


frequent

Pruning Pattern Space with Pattern Pruning


Constraints

Categories of pattern mining constraints:

Anti-monotonic Constraints

Monotonic Constraints

Succinct Constraints

Convertible Constraints

Pruning Pattern Space with Pattern Pruning


Constraints

Categories of pattern mining constraints:

Anti-monotonic Constraints

A constraint Ca is anti-monotonic if for any pattern S not


satisfying Ca, none of the super-patterns of S can satisfy Ca.

Example: sum(S.Price) <= value

Monotonic Constraints

A constraint Cm is monotonic if. for any pattern S satisfying


Cm, every super-pattern of S also satisfies it.

Example : sum(S.Price) >= value

Pruning Pattern Space with Pattern Pruning


Constraints

Categories of pattern mining constraints:

Succinct Constraints

A constraint Cs is succinct if all and only those patterns that


satisfy the Cs can be precisely generated, even before support
counting begins.

Example : min(S.Price) <= value

Pruning Pattern Space with Pattern Pruning


Constraints

Categories of pattern mining constraints:

Convertible Constraints

Constraints which are not anti-monotonic, monotonic or


succinct.

Become anti-monotonic or monotonic if the items in the


pattern are arranged in a particular order.

Example:

If the items are arranged in ascending order then avg(I.price) <=


100 is a convertible anti-monotonic constraint.

If the items are arranged in descending order then avg(I.price) <=


100 is a convertible monotonic constraint.

Pruning Pattern Space with Pattern Pruning


Constraints

Data-space
pruning

Prunes pieces of data if they will not contribute to the


subsequent generation of satisfiable patterns in the
mining process.

2 properties:

Data Succinctness

Data Anti-monotonicity

Data-Succinctness

A constraint is data-succinct if it can be used at the


beginning of a pattern mining process

Example: All pattern must contain Digital Camera

Any transaction that does not contain digital camera can


be pruned at the beginning of the mining process

Effectively reduces the data set to be examined.

Data Anti-monotonicity

A constraint is data-antimonotonic if during the


mining process, if a data entry cannot satisfy a
data-antimonotonic constraint based on the
current pattern, then it can be pruned

Data Anti-monotonicity
Example

Constraint C1 (monotonic): sum(I .price) $100

Current Itemset S: $50

Current Transaction Ti: {i2.price = $5, i5.price = $10, i8.price =


$20}

Ti + S not satisfy C1

Ti can be pruned

Note that we cannot use this technique in the beginning of the


mining process. Instead, they should be done at each iteration.

Data Anti-monotonicity
Example

Constraint C2 (anti-monotonic): sum(I .price) $100

If S > $100, it can be pruned.

Transaction Ti can be pruned, too

Prune both pattern and data


More powerful than Monotonic Constraint

Data Anti-monotonicity
Example

Constraint C3 (neither pattern monotonic nor antimonotonic): avg(I.price) 10

C3 could be data anti-monotonic depends on Ti


=> Ti could be pruned as well.

Note that Data anti-monotonicity is confined to


pattern growth-based Algorithm

It cannot be used for pruning the data space if the


Apriori algorithm is used.

Minning Colossal Pattern by Pattern


Fusion

Data Mining: Concepts and Techniques

3/7/15

17

Introductions

Bioinformatics : DNA or microarray data analysis -> colossal pattern [# large


support set]

Challenging : mining tends to get trapped when explosive midsize


patterns

D[m, n] when n very large but m is 100 -> 1000 => New mining strategy
Pattern-Fusion

18

Pattern- Fusion method

Traverses the tree in a bounded-breadth way

Only a fixed number of patterns in a bounded-size

Avoid problem of exponential search space

Designed to give an approximation to the colossal patterns

19

Core pattern

Core Patterns
Intuitively, for a pattern , a subpattern is a -core pattern of if
shares a similar support set with , i.e.,

| D |

| D |

0 1

where is called the core ratio

Robustness
Pattern apha is (d,t) robust if d is

20

Example: Core Patterns

Transaction (# of
Ts)

Core Patterns ( = 0.5)

(abe) (100)

(abe), (ab), (be), (ae), (e)

(bcf) (100)

(bcf), (bc), (bf)

(acf) (100)

(acf), (ac), (af)

(abcef) (100)

(ab), (ac), (af), (ae), (bc), (bf), (be) (ce), (fe), (e),
(abc), (abf), (abe), (ace), (acf), (afe), (bcf), (bce),
(bfe), (cfe), (abcf), (abce), (bcfe), (acfe), (abfe),
(abcef)

21

Idea of Pattern-Fusion Algorithm

Generate a complete set of frequent patterns up to a


small size

Randomly pick a pattern , and has a high probability


to be a core-descendant of some colossal pattern

Identify all s descendants in this complete set, and


merge all of them This would generate a much larger
core-descendant of

In the same fashion, we select K patterns. This set of


larger core-descendants will be the candidate pool for
the next iteration

22

Pattern-Fusion: The Algorithm

Initialization (Initial pool): Use an existing algorithm to mine


all frequent patterns up to a small size, e.g., 3

Iteration (Iterative Pattern Fusion):

At each iteration, k seed patterns are randomly picked


from the current pattern pool

For each seed pattern thus picked, we find all the patterns
within a bounding ball centered at the seed pattern

All these patterns found are fused together to generate a


set of super-patterns. All the super-patterns thus
generated form a new pool for the next iteration

Termination: when the current pool contains no more than K


patterns at the beginning of an iteration
23

Why Is Pattern-Fusion Efficient?

A bounded-breadth pattern
tree traversal

It avoids explosion in
mining mid-sized ones

Randomness comes to help


to stay on the right path

Ability to identify short-cuts


and take leaps

fuse small patterns


together in one step to
generate new patterns of
significant sizes

Efficiency
24

Mining Compressed Frequent-Pattern


Sets By Pattern clustering

Introduction

Frequent Pattern Mining

Minimum Support: 2
(b) : 3
(a, b, c, d)

(a) : 2
(a, b) : 2

(a, b, d, e)

(a, d) : 2

(b, e, f)

(d) : 2
(b, d) : 2
(e) : 2
(b, e) : 2
(a, b, d) : 2

26

Compressing Frequent
Patterns

Our compressing framework

Clustering frequent patterns by pattern similarity

Pick a representative pattern for each cluster

Key Problems

Need a distance function to measure the similarity between patterns

The quality of the clustering needs to be controllable

The representative pattern should be able to describe both


expressions and supports of other patterns

Efficiency is always desirable

27

Distance Measure

Let P1 and P2 are two closed frequent patterns, T(P) is


the set of raw data which contains P, the distance
between P1 and P2 is:

Let T(P1)={t1,t2,t3,t4,t5}, T(P2)={t1,t2,t3,t4,t6}, then


D(P1,P2)=1-4/6=1/3

D is a valid distance metric

D characterizes the support, but ignore the expression

28

Clustering Criterion

General clustering approach (i.e., k-means):

Directly apply the distance measure

No guarantee on the quality of the clusters

The representative pattern may not exist in a cluster

-clustering

For each pattern P, Find all patterns which can be


expressed by P and their distance to P are within (cover)

All patterns in the cluster can be represented by P

29

Вам также может понравиться