Академический Документы
Профессиональный Документы
Культура Документы
Cluster Analysis
Erik Kropat
PRE-
PROCESSING Preprocessed
Data Patterns, clusters, correlations
automated classification
Raw outlier / anomaly detection
Standardizing association rule learning…
Data
Missing values / outliers
Clustering
Clustering
… is a tool for data analysis, which solves classification problems.
Problem
Given n observations, split them into K similar groups.
Question
How can we define “similarity” ?
Similarity
A cluster is a set of entities which are alike,
and entities from different clusters are not alike.
Distance
A cluster is an aggregation of points such that
Distance between
clusters Distance between
objects
similarity ⇔ distance
Types of Clustering
Clustering
Hierarchical Partitional
Clustering Clustering
agglomerative divisive
Similarity and Distance
Distance Measures
A metric on a set G is a function d: G x G → R+ that satisfies the following
conditions:
x
Examples
Minkowski-Distance
_1
n r
r
d r (x, y) = Σ | xi − yi |
i=1
, r ∈ [1, ∞) , x, y ∈ Rn.
o r = 1: Manhatten distance
o r = 2: Euklidian distance
Euclidean Distance
1
_
n
Σ
2 2
d2 (x, y) = ( xi − yi ) , x, y ∈ Rn
i=1
y x = (1, 1)
y = (4, 3)
_
1
____
2 2 2
d2 (x, y) = (1 - 4) + (1 - 3) = √13
x
Manhatten Distance
n
d1 (x, y) = Σ | xi − yi | , x, y ∈ Rn
i=1
y x = (1, 1)
y = (4, 3)
d1 (x, y) = | 1 - 4 | + | 1 - 3 | = 3 + 2 = 5
x
Maximum Distance
d∞ (x, y) = max | xi − yi | , x, y ∈ Rn
1≤i≤n
y x = (1, 1)
y = (4, 3)
The value of the similarity function is greater when two points are closer.
Similarity Measures
• Often used
Cluster Dendrogram
linkage)
Euclidean distance
(complete
Euclidean linkage)
distance (complete
Hierarchical
Clustering
agglomerative divisive
Step 0 e1 e2 e3 e4 4 clusters
Divisive Clustering
Chose a cluster, that optimally splits in two particular clusters
according to a given criterion.
Step 3 e1 e2 e3 e4 4 clusters
Agglomerative
Clustering
INPUT
Feature p
Feature 1
Feature 2
Feature 3
Object
⁞ ⁞ ⁞ ⁞ ⁞
xn = ( xn1 xn2 xn3 ... xnp )
Example I
Object
Example II
A d (A,B) B
• Single-Linkage Clustering
• Complete-Linkage Clustering
• Average Linkage Clustering
• Centroid Method
• ...
Single-Linkage Clustering
Nearest-Neighbor-Method
The distance between the clusters A und B is the
minimum distance between the elements of each cluster:
a d(A,B) b
Single-Linkage Clustering
A
Complete-Linkage Clustering
Furthest-Neighbor-Method
The distance between the clusters A and B is the
maximum distance between the elements of each cluster:
d(A,B) = max { d(a,b) | a ∈ A, b ∈ B }
a d (A, B) b
Complete-Linkage Clustering
a b
A B
d(A,B)
Centroid Method
d (A, B)
x x
Agglomerative Hierarchical Clustering
d (A, B)
d (A, B)
a dd(A,
(A,B)
B) b d (A,
x (A,B)
B)
x
Bioinformatics
Berlin Kiev
Paris
Odessa
Exercise
Step 0: Clustering
{Kiev}, {Odessa}, {Berlin}, {Paris}
Step 0: Clustering
{Kiev}, {Odessa}, {Berlin}, {Paris}
Step 1: Clustering
{Kiev, Odessa}, {Berlin}, {Paris}
Step 1: Clustering
{Kiev, Odessa}, {Berlin}, {Paris}
Step 2: Clustering
{Kiev, Odessa}, {Berlin, Paris}
Step 3: Clustering
{Kiev, Odessa, Berlin, Paris}
Solution - Single Linkage
Hierarchy
Distance
values
2540 1 cluster
1200
1340 2 clusters
900
440 3 clusters
440
0 4 clusters
Kiev Odessa Berlin Paris
Divisive Clustering
Divisive Algorithms
iteratively repartitioned.
• K-Means and
• Fuzzy-C-Means
K-Means
K-Means
G C3
Find K cluster centroids µ1 ,..., µK
that minimize the objective function
K C1
Σ Σ
2
J = dist ( µi, x )
i =1 x ∈ Ci C2
K-Means
G C3
Find K cluster centroids µ1 ,..., µK
that minimize the objective function x
x
K C1
Σ Σ
2
J = dist ( µi, x ) x
i =1 x ∈ Ci C2
K-Means - Minimal Distance Method
ₓ ₓ ₓ ₓ
ₓ
ₓ
ₓ ₓ
Clustering
K-Means Fuzzy-c-Means
Fuzzy-c-Means
Fuzzy Clustering vs. Hard Clustering
1
Cluster 1
Cluster 2
Cluster 3
0
Hard Clustering
• K-Means
− The number K of clusters is given.
− Each object is assigned to exactly one cluster.
Partition
Object C3
Cluster e1 e2 e3 e4 e3 e4
C1 0 1 0 0
C2 1 0 0 0 e2 e1
C1 C2
C3 0 0 1 1
Fuzzy Clustering
• Fuzzy-c-means
− The number c of clusters is given.
− Each object has a fractional membership in all clusters
Object
Cluster e1 e2 e3 e4
C1 0.8 0.2 0.1 0.0
Fuzzy-Clustering
C2 0.2 0.2 0.2 0.0 There is no strict sub-division
of clusters.
C3 0.0 0.6 0.7 1.0
Σ 1 1 1 1
Fuzzy-c-Means
• Membership Matrix
U = ( u i k ) ∈ [0, 1]c x n
The entry u i k denotes the degree of membership of object k in cluster i .
…
…
…
c
Σ u ik = 1 (k = 1,...,n)
i =1
n
0< Σ u ik < n (i = 1,...,c)
k =1
Fuzzy-c-Means
T
V = ( v1,...,vc ) ∈ Rc
Remark
The cluster centroids and the membership matrix are initialized randomly.
Afterwards they are iteratively optimized.
Fuzzy-c-Means
ALGORITHM
2. Repeat
3. Compute the centroid of each cluster using the fuzzy partition
4. Update the fuzzy partition U = (u i k )
5. Until the centroids do not change.
• In Fuzzy-c-Means:
c n m
SSE = Σ Σ u ik . 2
dist ( vi, xk )
i =1 k =1
n v2
Σ u
m
xk
k = 1 ik
_________________ ( i = 1,...,c )
vi =
n m
Σ u ik (V)
k=1
1
______________________________________
u ik = 1
_____
c 2 (U)
dist ( v i , xk ) m – 1
Σ __________
2
s=1 dist ( vs , xk )
Fuzzy-c-Means
Initialization
Determine (randomly)
• Matrix U of membership grades
• Matrix V of cluster centroids.
Iteration
Calculate updates of
• Matrix U of membership grades with (U)
• Matrix V of cluster centroids with (V)
until cluster centroids are stable
or the maximum number of iterations is reached.
Fuzzy-c-means
Method: For all possible number of clusters calculate the cluster validity index.
Then, determine the optimal number of clusters.
c n
1 2
PC (c) = __ Σ Σ uik , 2 ≤ c ≤ n-1
n i=1 k=1
c n
1
PC (c) = _ __ Σ Σ u i k log2 u i k , 2 ≤ c ≤ n-1
n i=1 k=1
c n m
Σ Σ
2 Compactness
FS (c) = uik dist ( vi , xk ) of clusters
i=1 k=1
c n
_ m _
Σ Σ Separation
2
uik dist ( vi , v ) of clusters
i=1 k=1
Partition
Fuzzy-Cluster Fuzzy-Partition
Fuzzy Clustering
Feature Weighting
Nuhn/Kropat/Reinhardt/Pickl: Preparation of complex landslide simulation results with clustering approaches for decision support and early
warning. Submitted to Hawaii International Conference on System Sciences (HICCS 45), Grand Wailea, Maui, 2012.
Thank you very much!