Академический Документы
Профессиональный Документы
Культура Документы
z Prototype-based
– Fuzzy c-means
– Mixture Model Clustering
– Self-Organizing Maps
z Density-based
– Grid-based clustering
– Subspace clustering
z Graph-based
– Chameleon
– Jarvis-Patrick
– Shared Nearest Neighbor (SNN)
z Characteristics of Clustering Algorithms
SSE = ∑∑ w (x − c ) , ∑w
k N
ij i j
2
ij =1
j =1 i =1 j =1
1 2 5
∑w
p 2
ij ij =1
j =1 i =1
j =1
1 2 5
E ( x ) = wx21( 2 − 1) 2 + wx22 (5 − 2) 2
= wx21 + 9 wx22
p: fuzzifier (p > 1)
z Objective function:
k
– Update weights:
× ×
(centroid)
(single link)
– GROUP-AVERAGE:
Merge two clusters based on their average connectivity
(a)
(b)
(c)
(d)
z Preprocessing Step:
Represent the data by a Graph
– Given a set of points, construct the k-nearest-
neighbor (k-NN) graph to capture the relationship
between a point and its k nearest neighbors
– Concept of neighborhood is captured dynamically
(even if region is sparse)
i j i j
4
z Combines:
– Graph based clustering (similarity definition based on
number of shared nearest neighbors)
– Density based clustering (DBScan-like approach)
z High dimensionality
z Size of data set
z Sparsity of attribute values
z Noise and Outliers
z Types of attributes and type of data sets
z Differences in attribute scale
z Properties of the data space
– Can you define a meaningful centroid
z Data distribution
z Shape
z Differing sizes
z Differing densities
z Poor separation
z Relationship of clusters
z Subspace clusters
z Order dependence
z Non-determinism
z Parameter selection
z Scalability
z Underlying model
z Optimization based approach
z MIN can handle outliers, but noise can join clusters; EM clustering
can tolerate noise, but can be strongly affected by outliers.
z EM can only be applied to data for which a centroid is meaningful;
MIN only requires a meaningful definition of proximity.
z EM will have trouble as dimensionality increases and the number
of its parameters (the number of entries in the covariance matrix)
increases as the square of the number of dimensions; MIN can
work well with a suitable definition of proximity.
z EM is designed for Euclidean data, although versions of EM
clustering have been developed for other types of data. MIN is
shielded from the data type by the fact that it uses a similarity
matrix.
z MIN makes not distribution assumptions; the version of EM we are
considering assumes Gaussian distributions.
01/06/2007 Introduction to Data Mining 77
Comparison of MIN and EM-Clustering