Вы находитесь на странице: 1из 166

DATA MINING

ALGORITHMES

SUSHIL
1

KULKARNI
SUSHIL KULKARNI

INTENSIONS

Define classification problem using map and


illustrate with examples.

What are the different techniques to classify the


data into classes

List the approach in classification.


What are common methods to define classes?
Give suitable examples.

What are the different issues faced in doing


classification of data?

SUSHILKULKARNI
KULKARNI
SUSHIL

CLASSIFICATION
PROBLEM
3

SUSHIL KULKARNI

CLASSIFICATION PROBLEM
Given a database D={t1,t2,,tn} and a set of
classes C={C1,,Cm}, the Classification
Problem is to define a mapping f: DC
where each ti is assigned to one class.
Actually divides D into equivalence
classes.
Prediction is similar, but may be viewed as
4having infinite number of classes.
SUSHIL KULKARNI

CLASSIFICATION EXAMPLES
Teachers classify students grades as A,
B, C, D, or F.
Identify mushrooms as poisonous or
edible.
Predict when a river will flood.
Identify individuals with credit risks.
Speech recognition
Pattern recognition
5

SUSHIL KULKARNI

CLASSIFICATION EXAMPLE:
MARKS
x

<90

If x >= 90 then grade =A.


If 80<=x<90 then grade =B.
If 70<=x<80 then grade =C. <80
x
If 60<=x<70 then grade =D.
<70
If x<50 then grade =F
x
<50
6

>=90

A
>=80
B

>=70
C

>=60
D

SUSHIL KULKARNI

CLASSIFICATION EXAMPLE
Letter Recognition
View letters as constructed from 5 components:

Letter A

Letter B

Letter C

Letter D

Letter E

Letter F
SUSHIL KULKARNI

CLASSIFICATION
TECHNIQUES
8

SUSHIL KULKARNI

CLASSIFICATION
TECHNIQUES
Regression
Distance
Decision Trees
Rules
Neural Networks
9

SUSHIL KULKARNI

CLASSIFICATION TECHNIQUES
Approach:
Create specific model by evaluating
training data (or using domain experts
knowledge)
Apply model developed to new data.

10

SUSHIL KULKARNI

CLASSIFICATION TECHNIQUES
Classes must be predefined
Most common techniques use DTs, NNs,
or are based on distances or statistical
methods.

11

SUSHIL KULKARNI

DEFINE CLASSES
Distance Based

Partitioning Based

12

SUSHIL KULKARNI

ISSUES IN CLASSIFICATION
View letters as constructed from 5
components:
Missing Data
1. Ignore
2. Replace with assumed value
Measuring Performance
1. Classification accuracy on test data
2. Confusion matrix
133. OC Curve
SUSHIL KULKARNI

INTENSIONS

How one can find the performance that can be


measured to do the classification of data?

Explain Operating Characteristic curve.


Define confusion matrix.
How regression is used to classify the data?
What are the two different approaches in
classification using regression?

How correlation is used in classification of data?


What is Bayes theorem?
Explain with example.
14

SUSHIL KULKARNI
SUSHIL KULKARNI

PERFORMANCE
MEASURE
15

SUSHIL KULKARNI

HEIGHT EXAMPLE DATA

16

SUSHIL KULKARNI

MEASURING PERFORMANCE IN
CLASSIFICATION
C j is a specific class and t I is a database
tuple, may or may not be assigned to that
class while its actual membership may or
may not be in mat class. This gives four
parts as shown below:

17

SUSHIL KULKARNI

MEASURING PERFORMANCE IN
CLASSIFICATION

1.True Positive: t i predicted to be in c j and is


actually in it.
2. False Positive : t i predicted to be in c j but
is not actually in it.
3. True Negative : t I not predicted to be in c j
and is not actually in it.
4. False Negative : t i not predicted to be in
18c j but actually in it.
SUSHIL KULKARNI

CLASSIFICATION
PERFORMANCE

19

True Positive

False Negative

False Positive

True Negative
SUSHIL KULKARNI

OPERATING CHARECTERISTIC
CURVE
It shows the relation ship between false
positives and true positives
OC curve was originally used to examine
false alarm rates.

20

SUSHIL KULKARNI

OPERATING CHARACTERISTIC
CURVE

21

SUSHIL KULKARNI

CONFUSION MATRIX
It illustrates the accuracy of solution to a
classification problem
Definition:
Given m classes, a confusion matrix is an m
by m matrix where each entry indicates the
number of tuples from D that were assigned
to class C j but where correct class is C i
22

SUSHIL KULKARNI

CONFUSION MATRIX
EXAMPLE
Using height data example with Output 1
correct and Output 2 actual assignment

23

SUSHIL KULKARNI

STATISTICAL
BASED
ALGORITHMS
24

SUSHIL KULKARNI

REGRESSION
Assume data fits a predefined function
Determine best values for regression
coefficients c 0,c1,,cn.
Linear Regression:
y = c0+ c1x1++ cnxn
Assume an error: y = c0+ c1x1++ cnxn+ e
25

SUSHIL KULKARNI

Linear Regression

26

SUSHIL KULKARNI

LINEAR REGRESSION : Poor Fit

27

SUSHIL KULKARNI

CLASSIFICATION USING
REGRESSION
Division: Use regression function to
divide area into regions.
Prediction: Use regression function to
predict a class membership function.
Input includes desired class.

28

SUSHIL KULKARNI

DIVISION

29

SUSHIL KULKARNI

PREDICTION

30

SUSHIL KULKARNI

CORRELATION
Examine the degree to which the
values for two variables behave
similarly.
Correlation coefficient r:
1 = perfect correlation
-1 = perfect but opposite correlation
0 = no correlation
31

SUSHIL KULKARNI

BAYES THEOREM
Posterior Probability: P(h1|xi)
Prior Probability: P(h1)
Bayes Theorem:

Assign probabilities of hypotheses given a


data value.
32

SUSHIL KULKARNI

BAYES THEOREM EXAMPLE


Credit authorizations (hypotheses):
h1=authorize purchase, h2 = authorize after
further identification, h3=do not authorize,
h4= do not authorize but contact police
Assign twelve data values for all
combinations of credit and income:

From training data: P(h1) = 60%;


P(h2)=20%; P(h3)=10%; P(h4)=10%.
33

SUSHIL KULKARNI

Bayes Example(contd)
Training Data:

34

SUSHIL KULKARNI

INTENSIONS

Explain different distance bases algorithms.

Explain similarity measures between data using


distances.

How distance is useful in classification of data?

35

Explain KNN in detail

SUSHIL KULKARNI
SUSHIL KULKARNI

DISTANCE BASED
ALGORITHM
36

SUSHIL KULKARNI

SIMILARITY MEASURES
Determine similarity between two objects.
Similarity characteristics:

Alternatively, distance measure measure


how unlike or dissimilar objects are.

37

SUSHIL KULKARNI

SIMILARITY MEASURES
Similarity characteristics:

Sim( t i, t i ) = 1
SIMILARITY
Sim( t i, t j ) = 0
No SIMILARITY

38

SUSHIL KULKARNI

SIMILARITY MEASURES

39

SUSHIL KULKARNI

CLASSIFICATION USING
DISTANCE
Place items in class to which they are
closest

40

SUSHIL KULKARNI

CLASSIFICATION USING
DISTANCE
Must determine distance between an item
and a class.

41

SUSHIL KULKARNI

DISTANCE MEASURES
Measure dissimilarity between objects

42

SUSHIL KULKARNI

CLASSIFICATION USING
DISTANCE
Classes represented by
1. Centroid: Central value.
2. Medoid: Representative point.
3. Individual points
Algorithm: KNN
43

SUSHIL KULKARNI

K-NEAREST NEIGHBOUR (KNN)


Training set includes classes.
Examine K items near item to be
classified.
New item placed in class with the most
number of close items.
O(q) for each tuple to be classified.
(Here q is the size of the training set.)
44

SUSHIL KULKARNI

KNN

45

SUSHIL KULKARNI

KNN ALGORITHM

46

SUSHIL KULKARNI

DECISION TREE
47

SUSHIL KULKARNI

DECISION TREE
Tree where the root and each internal
node is labeled with a question.
The arcs represent each possible
answer to the associated question.
Each leaf node represents a
prediction of a solution to the
problem.
48

SUSHIL KULKARNI

DECISION TREE
Popular technique for classification;
Leaf node indicates class to which the
corresponding tuple belongs.

49

SUSHIL KULKARNI

DECISION TREE: Example

50

SUSHIL KULKARNI

DECISION TREE
Given:
D = {t1, , tn} where ti=<ti1, , tih>
Database schema contains
{A1, A2, , Ah}
Classes C={C1, ., Cm}

51

SUSHIL KULKARNI

DECISION TREE
Decision or Classification Tree is a tree
associated with D such that
Each internal node is labeled with
attribute, Ai
Each arc is labeled with predicate
which can be applied to attribute at
parent

52

Each leaf node is labeled with a class,


Cj
SUSHIL KULKARNI

DECISION TREES MODEL


A Decision Tree Model is a
computational model consisting of
three parts:
Decision Tree
Algorithm to create the tree
Algorithm that applies the tree to
data
53

SUSHIL KULKARNI

DECISION TREES MODEL


Creation of the tree is the most
difficult part.
Processing is basically a search
similar to that in a binary search tree
(although DT may not be binary).
54

SUSHIL KULKARNI

Decision Tree Algorithm

55

SUSHIL KULKARNI

DIRECTED TREE :
ADVANTAGES
Easy to understand.
Easy to generate rules

56

SUSHIL KULKARNI

DIRECTED TREE :
DISADVANTAGES
May suffer from over fitting.
Classifies by rectangular partitioning.
Does not easily handle nonnumeric
data.
Can be quite large pruning is
necessary.
57

SUSHIL KULKARNI

CLASSIFICATION USING
DECISION TREE
Partitioning based: Divide search
space into rectangular regions.
Tuple placed into class based on the
region within which it falls.
DT approaches differ in how the tree is
built: DT Induction
58

SUSHIL KULKARNI

CLASSIFICATION USING
DECISION TREE
Internal nodes associated with attribute
and arcs with values for that attribute.
Algorithms: ID3, C4.5, CART

59

SUSHIL KULKARNI

DT INDUCTION

60

SUSHIL KULKARNI

DT SPLIT AREA

Gender

M
F
Height

61

SUSHIL KULKARNI

COMPARING DTs

62

Balanced

Deep
SUSHIL KULKARNI

DT ISSUES
Choosing Splitting Attributes
Ordering of Splitting Attributes
Splits
Tree Structure
Stopping Criteria
Training Data
63

SUSHIL KULKARNI

Decision Tree Induction is often based


on Information Theory
So

64

SUSHIL KULKARNI

INFORMATION

65

SUSHIL KULKARNI

DT INDUCTION
When all the marbles in the bowl are
mixed up, little information is given.
When the marbles in the bowl are all
from one class and those in the other
two classes are on either side, more
information is given.
Use this approach with DT Induction !
66

SUSHIL KULKARNI

ARTIFICIAL NEURAL
NETWORK (ANN)
ANN is an information processing
paradigm that is inspired by the way
brain process information.
Composed of a large number of highly
interconnected
processing
elements
called neurones.
ANNs, like people, learn by example.
67

SUSHIL KULKARNI

ARTIFICIAL NEURAL
NETWORK (ANN)
Learning in biological systems involves
adjustments to the synaptic connections
that exist between the neurones.
This is true of ANNs as well.

68

SUSHIL KULKARNI

HOW HUMAN BRAIN LEARNS?

69

SUSHIL KULKARNI

Cont..

70

SUSHIL KULKARNI

HOW HUMAN BRAIN LEARNS?


In the human brain, a typical neuron
collects signals from others through a
host of fine structures called dendrites.
The neuron sends out spikes of electrical
activity through a long, thin stand known
as an axon, which splits into thousands
of branches.

71

SUSHIL KULKARNI

HOW HUMAN BRAIN LEARNS?


At the end of each branch, a structure
called a synapse converts the activity from
the axon into electrical effects that inhibit
or excite activity from the axon into
electrical effects that inhibit or excite
activity in the connected neurones.
Learning
occurs
by
changing
the
effectiveness of the synapses so that the
influence of one neuron on another
changes.
72

SUSHIL KULKARNI

A SIMPLE NEURON

73

SUSHIL KULKARNI

NEURAL NETWORKS
Based on observed functioning of
human brain.
(Artificial Neural Networks (ANN)
The first artificial neuron was
produced
in
1943
by
the
neurophysiologist Warren McCulloch
and the logician Walter Pits.

74

SUSHIL KULKARNI

NEURAL NETWORKS
Our view of neural networks is very
simplistic.
We view a neural network (NN) from a
graphical viewpoint.
Used in pattern recognition, speech
recognition, computer vision, and
classification.
75

SUSHIL KULKARNI

NEURAL NETWORKS: Example

76

SUSHIL KULKARNI

NEURAL NETWORKS
It is a directed graph F=<V,A> with vertices
V={1,2,,n} and arcs A={<i,j>|1<=i,j<=n},
with the following restrictions:
V is partitioned into a set of input nodes, V I,
hidden nodes, VH, and output nodes, VO.

The vertices are also partitioned into


layers
77

SUSHIL KULKARNI

NEURAL NETWORKS
Any arc <i,j> must have node i in layer
h-1 and node j in layer h.
Arc <i,j> is labeled with a numeric value
wij.
Node i is labeled with a function fi.

78

SUSHIL KULKARNI

NN NODE

79

SUSHIL KULKARNI

NEURAL NETWORK MODEL


It is a computational model consisting of
Three parts:

Neural Network graph


Learning algorithm that indicates
how learning takes place.
Recall techniques that determine
how information is obtained from the
network.
80

SUSHIL KULKARNI

NN : Advantages
Learning
Can continue learning even after training
set has been applied.
Easy parallelization
Solves many problems

81

SUSHIL KULKARNI

NN: Disadvantages
Difficult to understand
May suffer from overfitting
Structure of graph must be determined a
priori.
Input values must be numeric.
Verification difficult.

82

SUSHIL KULKARNI

CLASSIFICATION USING
NEURAL NETWORKS
Typical NN structure for classification:
1. One output node per class
2.Output value is class membership
function value
Supervised learning
83

SUSHIL KULKARNI

CLASSIFICATION USING
NEURAL NETWORKS
For each tuple in training set, propagate it
through NN. Adjust weights on edges to
improve future classification.
Algorithms: Propagation, Back
propagation, Gradient Descent
84

SUSHIL KULKARNI

NN ISSUES
Number of source nodes
Number of hidden layers
Training data
Number of sinks
Interconnections
85

SUSHIL KULKARNI

NN ISSUES
Weights
Activation Functions
Learning Technique
When to stop learning
86

SUSHIL KULKARNI

DECISION TREE VS. NEURAL


NETWORK

87

SUSHIL KULKARNI

PRPOGATION

Tuple Input
Output

88

SUSHIL KULKARNI

NN PROPOGATION ALGORITHM

89

SUSHIL KULKARNI

EXAMPLE PROPOGATION

90

SUSHIL KULKARNI

RULES
91

SUSHIL KULKARNI

CLASSIFICATION USING RULES


Perform classification using If-Then
rules
Classification Rule: r = <a,c>
Antecedent, Consequent
May generate from from other
techniques (DT, NN) or generate
directly.
Algorithms: Gen, RX, 1R, PRISM
92

SUSHIL KULKARNI

GENERATING RULES FROM


DTs

93

SUSHIL KULKARNI

GENERATING RULES EXAMPLE

94

SUSHIL KULKARNI

GENERATING RULES FROM NNs

95

SUSHIL KULKARNI

1R ALGORITHM

96

SUSHIL KULKARNI

1R EXAMPLE

97

SUSHIL KULKARNI

PRISM ALGORITHM

98

SUSHIL KULKARNI

PRISM EXAMPLE

99

SUSHIL KULKARNI

DECISION TREE VS. RULES


Tree has implied
order in which
splitting is
performed.
Tree created
based on looking
at all classes.

100

Rules have no
ordering of
predicates.
Only need to look
at one class to
generate its rules.

SUSHIL KULKARNI

INTENSIONS

Clustering Examples
Segment customer database based on
similar buying patterns.
Group houses in a town into
neighborhoods based on similar features.
Identify new plant species
Identify similar Web usage patterns

101

SUSHIL KULKARNI

CLUSTERING
102

SUSHIL KULKARNI

CLUSTERING : Example

103

SUSHIL KULKARNI

CLUSTERING HOUSES

Geographic
Size
Distance
Based Based
104

SUSHIL KULKARNI

Clustering vs. Classification


No prior knowledge
Number of clusters
Meaning of clusters

Unsupervised learning

105

SUSHIL KULKARNI

Clustering Issues
Outlier handling
Dynamic data
Interpreting results
Evaluating results
Number of clusters
Data to be used
Scalability
106

SUSHIL KULKARNI

Impact of Outliers on
Clustering

107

SUSHIL KULKARNI

Clustering Problem
Given a database D={t1,t2,,tn} of tuples
and an integer value k, the Clustering
Problem is to define a mapping f:D{1,..,k}
where each ti is assigned to one cluster Kj,
1<=j<=k.
A Cluster, Kj, contains precisely those
tuples mapped to it.
Unlike classification problem, clusters are
not known a priori.
108

SUSHIL KULKARNI

Types of Clustering
Hierarchical Nested set of clusters
created.
Partitional One set of clusters created.
Incremental Each element handled one
at a time.
Simultaneous All elements handled
together.
Overlapping/Non-overlapping
109

SUSHIL KULKARNI

Clustering Approaches
Clustering

Hierarchical

Agglomerative

110

Partitional

Divisive

Categorical

Sampling

Large DB

Compression

SUSHIL KULKARNI

Cluster Parameters

111

SUSHIL KULKARNI

Distance Between Clusters


Single Link: smallest distance between points
Complete Link: largest distance between
points
Average Link: average distance between
points
Centroid: distance between centroids

112

SUSHIL KULKARNI

Hierarchical Clustering
Clusters are created in levels actually creating
sets of clusters at each level.
Agglomerative
Initially each item in its own cluster
Iteratively clusters are merged together
Bottom Up

Divisive
Initially all items in one cluster
Large clusters are successively divided
Top Down
113

SUSHIL KULKARNI

Hierarchical Algorithms
Single Link
MST Single Link
Complete Link
Average Link

114

SUSHIL KULKARNI

Dendrogram
Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
Each level shows clusters for
that level.
Leaf individual clusters
Root one cluster

A cluster at level i is the union


of its children clusters at level
i+1.
115

SUSHIL KULKARNI

Agglomerative Example
A B C D E
A

D
Threshold of
1 2 34 5

A B C D E
116

SUSHIL KULKARNI

MST Example
A

A B C D E

117

SUSHIL KULKARNI

118

SUSHIL KULKARNI

Agglomerative Algorithm

119

SUSHIL KULKARNI

Single Link
View all items with links (distances)
between them.
Finds maximal connected components
in this graph.
Two clusters are merged if there is at
least one edge which connects them.
Uses threshold distances at each level.
Could be agglomerative or divisive.
120

SUSHIL KULKARNI

MST Single Link Algorithm

121

SUSHIL KULKARNI

Single Link Clustering

122

SUSHIL KULKARNI

Partitional Clustering
Nonhierarchical
Creates clusters in one step as opposed
to several steps.
Since only one set of clusters is output,
the user normally has to input the desired
number of clusters, k.
Usually deals with static sets.
123

SUSHIL KULKARNI

Partitional Algorithms
MST
Squared Error
K-Means
Nearest Neighbor
PAM
BEA
GA
124

SUSHIL KULKARNI

K-Means
Initial set of clusters randomly chosen.
Iteratively, items are moved among sets
of clusters until the desired set is
reached.
High degree of similarity among
elements in a cluster is obtained.
Given a cluster Ki={ti1,ti2,,tim}, the
cluster mean is mi = (1/m)(ti1 + + tim)
125

SUSHIL KULKARNI

K-Means Example
Given: {2,4,10,12,3,20,30,11,25}, k=2
Randomly assign means: m1=3,m2=4
K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
K1={2,3,4},K2={10,12,20,30,11,25},
m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
K1={2,3,4,10,11,12},K2={20,30,25},
m1=7,m2=25
Stop as the clusters with these means are the
same.
126

SUSHIL KULKARNI

K-Means Algorithm

127

SUSHIL KULKARNI

Nearest Neighbor
Items are iteratively merged into the
existing clusters that are closest.
Incremental
Threshold, t, used to determine if items
are added to existing clusters or a new
cluster is created.

128

SUSHIL KULKARNI

Nearest Neighbor Algorithm

129

SUSHIL KULKARNI

Clustering Large Databases


Most clustering algorithms assume a large
data structure which is memory resident.
Clustering may be performed first on a
sample of the database then applied to the
entire database.
Algorithms
BIRCH
DBSCAN
CURE
130

SUSHIL KULKARNI

Desired Features for Large


Databases
One scan (or less) of DB
Online
Suspendable, stoppable, resumable
Incremental
Work with limited main memory
Different techniques to scan (e.g.
sampling)
Process each tuple once
131

SUSHIL KULKARNI

BIRCH
Balanced Iterative Reducing and
Clustering using Hierarchies
Incremental, hierarchical, one scan
Save clustering information in a tree
Each entry in the tree contains
information about one cluster
New nodes inserted in closest entry in
tree
132

SUSHIL KULKARNI

Clustering Feature
CT Triple: (N,LS,SS)
N: Number of points in cluster
LS: Sum of points in the cluster
SS: Sum of squares of points in the cluster
CF Tree
Balanced search tree
Node has CF triple for each child
Leaf node represents cluster and has CF value
for each subcluster in it.
Subcluster has maximum diameter
133

SUSHIL KULKARNI

CURE
Clustering Using Representatives
Use many points to represent a cluster
instead of only one
Points will be well scattered

134

SUSHIL KULKARNI

CURE Approach

135

SUSHIL KULKARNI

CURE Algorithm

136

SUSHIL KULKARNI

CURE for Large Databases

137

SUSHIL KULKARNI

ASSOCIATION
RULES
138

SUSHIL KULKARNI

Association Rules Outline


Goal: Provide an overview of basic
Association Rule mining techniques
Association Rules Problem Overview
Large itemsets

Association Rules Algorithms


Apriori
Sampling
Partitioning
Parallel Algorithms

Comparing Techniques
Incremental Algorithm
Advanced AR Techniques
139

SUSHIL KULKARNI

Example: Market Basket Data


Items frequently purchased together:
Bread Butter

Uses:
Placement
Advertising
Sales
Coupons

Objective: increase sales and reduce


costs
140

SUSHIL KULKARNI

Association Rule Definitions


Set of items: I={I1,I2,,Im}
Transactions: D={t1,t2, , tn}, tj I
Itemset: {Ii1,Ii2, , Iik} I
Support of an itemset: Percentage of
transactions which contain that itemset.
Large (Frequent) itemset: Itemset whose
number of occurrences is above a
threshold.
141

SUSHIL KULKARNI

Association Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}


Support of {Bread,PeanutButter} is 60%
142

SUSHIL KULKARNI

Association Rule Definitions


Association Rule (AR): implication
X Y where X,Y I and X Y = ;
Support of AR (s) X Y: Percentage
of transactions that contain X Y
Confidence of AR ( ) X Y: Ratio of
number of transactions that contain
X Y to the number that contain X
143

SUSHIL KULKARNI

Association Rules Ex (contd)

144

SUSHIL KULKARNI

Association Rule Problem


Given a set of items I={I1,I2,,Im} and a
database of transactions D={t1,t2, , tn}
where ti={Ii1,Ii2, , Iik} and Iij I, the
Association Rule Problem is to
identify all association rules X Y with
a minimum support and confidence.
Link Analysis
NOTE: Support of X Y is same as
support of X Y.
145

SUSHIL KULKARNI

Association Rule Techniques


1. Find Large Itemsets.
2. Generate rules from frequent
itemsets.

146

SUSHIL KULKARNI

Algorithm to Generate ARs

147

SUSHIL KULKARNI

Apriori
Large Itemset Property:
Any subset of a large itemset is large.
Contrapositive:
If an itemset is not large, none of its
supersets are large.

148

SUSHIL KULKARNI

Large Itemset Property

149

SUSHIL KULKARNI

Apriori Ex (contd)

s=30%
150

= 50%
SUSHIL KULKARNI

Apriori Algorithm
1. C1 = Itemsets of size one in I;
2. Determine all large itemsets of size 1, L1;
3. i = 1;
4. Repeat
5.
i = i + 1;
6.
Ci = Apriori-Gen(Li-1);
7.

Count Ci to determine Li;

8. until no more large itemsets found;


151

SUSHIL KULKARNI

Apriori-Gen
Generate candidates of size i+1 from large
itemsets of size i.
Approach used: join large itemsets of size
i if they agree on i-1
May also prune candidates who have
subsets that are not large.

152

SUSHIL KULKARNI

Apriori-Gen Example

153

SUSHIL KULKARNI

Apriori-Gen Example (contd)

154

SUSHIL KULKARNI

Apriori Adv/Disadv
Advantages:
Uses large itemset property.
Easily parallelized
Easy to implement.

Disadvantages:
Assumes transaction database is memory
resident.
Requires up to m database scans.
155

SUSHIL KULKARNI

Sampling
Large databases
Sample the database and apply Apriori to the
sample.
Potentially Large Itemsets (PL): Large
itemsets from sample
Negative Border (BD - ):
Generalization of Apriori-Gen applied to
itemsets of varying sizes.
Minimal set of itemsets which are not in PL,
but whose subsets are all in PL.
156

SUSHIL KULKARNI

Negative Border Example

PL
157

PL BD-(PL)
SUSHIL KULKARNI

Sampling Algorithm
1.
2.
3.
4.
5.
6.
7.
8.
158

Ds = sample of Database D;
PL = Large itemsets in Ds using smalls;
C = PL BD-(PL);
Count C in Database using s;
ML = large itemsets in BD-(PL);
If ML = then done
else C = repeated application of BD-;
Count C in Database;
SUSHIL KULKARNI

Sampling Example
Find AR assuming s = 20%
Ds = { t1,t2}
Smalls = 10%
PL = {{Bread}, {Jelly}, {PeanutButter},
{Bread,Jelly}, {Bread,PeanutButter}, {Jelly,
PeanutButter}, {Bread,Jelly,PeanutButter}}
BD-(PL)={{Beer},{Milk}}
ML = {{Beer}, {Milk}}
Repeated application of BD- generates all
remaining itemsets
159

SUSHIL KULKARNI

Sampling Adv/Disadv
Advantages:
Reduces number of database scans to one
in the best case and two in worst.
Scales better.

Disadvantages:
Potentially large number of candidates in
second pass

160

SUSHIL KULKARNI

Partitioning
Divide database into partitions D1,D2,
,Dp
Apply Apriori to each partition
Any large itemset must be large in at
least one partition.

161

SUSHIL KULKARNI

Partitioning Algorithm
1.
2.
3.
4.
5.

162

Divide D into partitions D1,D2,,Dp;


For I = 1 to p do
Li = Apriori(Di);
C = L1 Lp;
Count C on D to generate L;

SUSHIL KULKARNI

Partitioning Example
L1 ={{Bread}, {Jelly},
{PeanutButter},
{Bread,Jelly},
{Bread,PeanutButter},
{Jelly, PeanutButter},
{Bread,Jelly,PeanutButter}}

D1

D2

S=10%
163

L2 ={{Bread}, {Milk},
{PeanutButter}, {Bread,Milk},
{Bread,PeanutButter}, {Milk,
PeanutButter},
{Bread,Milk,PeanutButter},
{Beer}, {Beer,Bread},
{Beer,Milk}}
SUSHIL KULKARNI

Partitioning Adv/Disadv
Advantages:
Adapts to available main memory
Easily parallelized
Maximum number of database scans is
two.

Disadvantages:
May have many candidates during second
scan.
164

SUSHIL KULKARNI

Parallelizing AR Algorithms
Based on Apriori
Techniques differ:
What is counted at each site
How data (transactions) are distributed

Data Parallelism
Data partitioned
Count Distribution Algorithm

Task Parallelism
Data and candidates partitioned
Data Distribution Algorithm
165

SUSHIL KULKARNI

T H A N K S !

166

SUSHIL KULKARNI