Вы находитесь на странице: 1из 88

Summer School

“Achievements and Applications of Contemporary Informatics,


Mathematics and Physics” (AACIMP 2011)
August 8-20, 2011, Kiev, Ukraine

Cluster Analysis

Erik Kropat

University of the Bundeswehr Munich


Institute for Theoretical Computer Science,
Mathematics and Operations Research
Neubiberg, Germany
The Knowledge Discovery
Process
PATTERN
EVALUATION
Knowledge
DATA
MINING
Patterns Strategic planning

PRE-
PROCESSING Preprocessed
Data Patterns, clusters, correlations
automated classification
Raw outlier / anomaly detection
Standardizing association rule learning…
Data
Missing values / outliers
Clustering
Clustering
… is a tool for data analysis, which solves classification problems.

Problem
Given n observations, split them into K similar groups.

Question
How can we define “similarity” ?
Similarity
A cluster is a set of entities which are alike,
and entities from different clusters are not alike.
Distance
A cluster is an aggregation of points such that

the distance between any two points in the cluster


is less than
the distance between any point in the cluster and any point not in it.
Density
Clusters may be described as
connected regions of a multidimensional space
containing a relatively high density of points,
separated from other such regions by a region
containing a relatively low density of points.
Min Max-Problem
Homogeneity: Objects within the same cluster should be similar to each other.
Separation: Objects in different clusters should be dissimilar from each other.

Distance between
clusters Distance between
objects

similarity ⇔ distance
Types of Clustering

Clustering

Hierarchical Partitional
Clustering Clustering

agglomerative divisive
Similarity and Distance
Distance Measures
A metric on a set G is a function d: G x G → R+ that satisfies the following
conditions:

(D1) d(x, y) = 0 ⇔ x=y (identity)

(D2) d(x, y) = d(y, x) ≥ 0 for all x, y ∈ G (symmetry & non-negativity)

(D3) d(x, y) ≤ d(x, z) + d(z, y) for all x, y, z ∈ G (triangle inequality)

x
Examples
Minkowski-Distance
_1
n r
r
d r (x, y) = Σ | xi − yi |
i=1
, r ∈ [1, ∞) , x, y ∈ Rn.

o r = 1: Manhatten distance
o r = 2: Euklidian distance
Euclidean Distance
1
_
n
Σ
2 2
d2 (x, y) = ( xi − yi ) , x, y ∈ Rn
i=1

y x = (1, 1)
y = (4, 3)
_
1
____
2 2 2
d2 (x, y) = (1 - 4) + (1 - 3) = √13
x
Manhatten Distance

n
d1 (x, y) = Σ | xi − yi | , x, y ∈ Rn
i=1

y x = (1, 1)
y = (4, 3)

d1 (x, y) = | 1 - 4 | + | 1 - 3 | = 3 + 2 = 5
x
Maximum Distance

d∞ (x, y) = max | xi − yi | , x, y ∈ Rn
1≤i≤n

y x = (1, 1)
y = (4, 3)

d∞ (x, y) = max (3, 2) = 3


x
Similarity Measures
A similarity function on a set G is a function S: G x G → R that satisfies the
following conditions:

(S1) S (x, y) ≥ 0 for all x, y ∈ G (positive defined)

(S2) S (x, y) ≤ S (x, x) for all x, y ∈ G (auto-similarity)

(S3) S (x, y) = S (x, x) ⇔ x=y for all x, y ∈ G (identity)

The value of the similarity function is greater when two points are closer.
Similarity Measures

• There are many different definitions of similarity.

• Often used

(S4) S (x, y) = S (y, x) for all x, y ∈ G (symmetry)


Hierachical Clustering
Dendrogram

Cluster Dendrogram
linkage)
Euclidean distance
(complete
Euclidean linkage)
distance (complete

Gross national product of EU countries – agriculture (1993)


www.isa.uni-stuttgart.de/lehre/SAHBD
Hierarchical Clustering
Hierarchical clustering creates a hierarchy of clusters of the set G.

Hierarchical
Clustering

agglomerative divisive

Agglomerative clustering: Clusters are successively merged together


Divisive clustering: Clusters are recursively split
Agglomerative Clustering
Merge clusters with smallest distance between the two clusters

Step 3 e1, e2 , e3, e4 1 cluster

Step 2 e1, e2, e3 e4 2 clusters

Step 1 e1, e2 e3 e4 3 clusters

Step 0 e1 e2 e3 e4 4 clusters
Divisive Clustering
Chose a cluster, that optimally splits in two particular clusters
according to a given criterion.

Step 0 e1, e2 , e3, e4 1 cluster

Step 1 e1, e2 e3, e4 2 clusters

Step 2 e1, e2 e3 e4 3 clusters

Step 3 e1 e2 e3 e4 4 clusters
Agglomerative
Clustering
INPUT

Given n objects G = { e1,...,en }


represented by p-dimensional feature vectors x1,...,xn ∈ Rp

Feature p
Feature 1

Feature 2

Feature 3
Object

x1 = ( x11 x12 x13 ... x1p )


x2 = ( x21 x22 x23 ... x2p )

⁞ ⁞ ⁞ ⁞ ⁞
xn = ( xn1 xn2 xn3 ... xnp )
Example I

An online shop collects data from its customers.

For each of the n customers it exists a p-dimensional feature vector

Object
Example II

In a clinical trial laboratory values of a large number of patients are gathered.

For each of the n patients it exists a p-dimensional feature vector


Agglomerative Algorithms

• Begin with disjoint clustering


C1 = { {e1}, {e2}, ... , {en} }

• Terminate when all objects are in one cluster

Cn = { {e1, e2, ... , en} } e1 e2 e3 e4

• Iterate find the most similar pair of clusters


and merge them into a single cluster.

Sequence of clusterings (Ci )i=1,...n of G with


Ci ̶ 1 ⊂ Ci for i = 2,...,n.
What is the distance between two clusters?

A d (A,B) B

⇒ Various hierarchical clustering algorithms


Agglomerative Hierarchical Clustering

There exist many metrics to measure the distance between clusters.

They lead to particular agglomerative clustering methods:

• Single-Linkage Clustering
• Complete-Linkage Clustering
• Average Linkage Clustering
• Centroid Method
• ...
Single-Linkage Clustering

Nearest-Neighbor-Method
The distance between the clusters A und B is the
minimum distance between the elements of each cluster:

d(A,B) = min { d (a, b) | a ∈ A, b ∈ B }

a d(A,B) b
Single-Linkage Clustering

• Advantage: Can detect very long and even curved clusters.


Can be used to detect outliers.

• Drawback: Chaining phenomen


Clusters that are very distant to each other
may be forced together
due to single elements being close to each other.
B
C

A
Complete-Linkage Clustering

Furthest-Neighbor-Method
The distance between the clusters A and B is the
maximum distance between the elements of each cluster:
d(A,B) = max { d(a,b) | a ∈ A, b ∈ B }

a d (A, B) b
Complete-Linkage Clustering

• … tends to find compact clusters of approximately equal diameters.

• … avoids the chaining phenomen.

• … cannot be used for outlier detection.


Average-Linkage Clustering

The distance between the clusters A and B is the mean


distance between the elements of each cluster:
1
d (A, B) = ⋅ Σ d (a, b)
|A| ⋅ |B| a ∈ A,
b∈B

a b
A B

d(A,B)
Centroid Method

The distance between the clusters A and B is the


(squared) Euclidean distance of the cluster centroids.

d (A, B)
x x
Agglomerative Hierarchical Clustering

d (A, B)
d (A, B)

a dd(A,
(A,B)
B) b d (A,
x (A,B)
B)
x
Bioinformatics

Alizadeh et al., Nature 403 (2000): pp.503–511


Exercise

Berlin Kiev

Paris
Odessa
Exercise

The following table shows the distances between 4 cities:

Kiev Odessa Berlin Paris


Kiev ̶ 440 1200 2000
Odessa 440 ̶ 1400 2100
Berlin 1200 1400 ̶ 900
Paris 2000 2100 900 ̶

Determine a hierarchical clustering with


the single linkage method.
Solution - Single Linkage

Step 0: Clustering
{Kiev}, {Odessa}, {Berlin}, {Paris}

Distances between clusters

Kiev Odessa Berlin Paris


Kiev ̶ 440 1200 2000
Odessa 440 ̶ 1400 2100
Berlin 1200 1400 ̶ 900
Paris 2000 2100 900 ̶
Solution - Single Linkage

Step 0: Clustering
{Kiev}, {Odessa}, {Berlin}, {Paris}

Distances between clusters minimal distance

Kiev Odessa Berlin Paris


Kiev ̶ 440 1200 2000
Odessa 440 ̶ 1400 2100
Berlin 1200 1400 ̶ 900
Paris 2000 2100 900 ̶

⇒ Merge clusters { Kiev } and { Odessa }


Distance value: 440
Solution - Single Linkage

Step 1: Clustering
{Kiev, Odessa}, {Berlin}, {Paris}

Distances between clusters

Kiev, Odessa Berlin Paris


Kiev, Odessa ̶ 1200 2000
Berlin 1200 ̶ 900
Paris 2000 900 ̶
Solution - Single Linkage

Step 1: Clustering
{Kiev, Odessa}, {Berlin}, {Paris}

Distances between clusters


minimal distance
Kiev, Odessa Berlin Paris
Kiev, Odessa ̶ 1200 2000
Berlin 1200 ̶ 900
Paris 2000 900 ̶

⇒ Merge clusters { Berlin } and { Paris }


Distance value: 900
Solution - Single Linkage

Step 2: Clustering
{Kiev, Odessa}, {Berlin, Paris}

Distances between clusters


minimal distance
Kiev, Odessa Berlin, Paris
Kiev, Odessa ̶ 1200
Berlin, Paris 1200 ̶

⇒ Merge clusters { Kiev, Odessa } and { Berlin, Paris }


Distance value: 1200
Solution - Single Linkage

Step 3: Clustering
{Kiev, Odessa, Berlin, Paris}
Solution - Single Linkage

Hierarchy

Distance
values
2540 1 cluster

1200

1340 2 clusters

900

440 3 clusters
440
0 4 clusters
Kiev Odessa Berlin Paris
Divisive Clustering
Divisive Algorithms

• Begin with one cluster


C1 = { {e1, e2, ... , en} }
e1 e2 e3 e4
• Terminate when all objects are in disjoint clusters

Cn = { {e1}, {e2}, ... , {en} }

• Iterate Chose a cluster Cf , that optimally splits


two particular clusters Ci and Cj
according to a given criterion.

Sequence of clusterings (Ci )i=1,...n of G with


C i ⊃ C i + 1 for i = 1,...,n-1.
Partitional Clustering
̶ Minimal Distance Methods ̶
Partitional Clustering K=2

• Aims to partition n observations into K clusters.

• The number of clusters and


initial partition
an initial partition are given.

• The initial partition is considered as


“not optimal“ and should be K=2

iteratively repartitioned.

The number of clusters is given !!!


final partition
Partitional Clustering

Difference to hierarchical clustering


• number of clusters is fixed.
• an object can change the cluster.

Initial partition is obtained by


• random or
• the application of an hierarchical clustering algorithm in advance.

Estimation of the number of clusters


• specialized methods (e.g., Silhouette) or
• the application of an hierarchical clustering algorithm in advance.
Partitional Clustering - Methods

In this course we will introduce the minimal distance methods . . .

• K-Means and
• Fuzzy-C-Means
K-Means
K-Means

Aims to partition n observations into K clusters


in which each observation belongs to the cluster with the nearest mean.

G C3
Find K cluster centroids µ1 ,..., µK
that minimize the objective function

K C1
Σ Σ
2
J = dist ( µi, x )
i =1 x ∈ Ci C2
K-Means

Aims to partition n observations into K clusters


in which each observation belongs to the cluster with the nearest mean.

G C3
Find K cluster centroids µ1 ,..., µK
that minimize the objective function x
x
K C1
Σ Σ
2
J = dist ( µi, x ) x
i =1 x ∈ Ci C2
K-Means - Minimal Distance Method

Given: n objects, K clusters


1. Determine initial partition.
2. Calculate cluster centroids. x x
3. For each object, calculate the distances
to all cluster centroids.
repartition
4. If the distance
to the centroid of another cluster
is smaller than the distance
to the actual cluster centroid,
then assign the object to the other cluster.
5. If clusters are repartitioned: GOTO 2.
ELSE: STOP.
Example

ₓ ₓ ₓ ₓ

Initial Partition Final Partition


Exercise



ₓ ₓ

Initial Partition Final Partition


K-Means
• K-Means does not determine the global optimal partition.

• The final partition obtained by K-Means depends on the initial partition.


Hard Clustering / Soft Clustering

Clustering

Hard Clustering Soft Clustering

Each object is a member Each object has a fractional


of exactly one cluster membership in all clusters

K-Means Fuzzy-c-Means
Fuzzy-c-Means
Fuzzy Clustering vs. Hard Clustering

• When clusters are well separated,


hard clustering (K-Means) makes sense.
• In many cases, clusters are not well separated.

In hard clustering, borderline objects are assigned to


a cluster in an arbitrary manner.
Fuzzy Set Theory

• Fuzzy Theory was introduced by Lofti Zadeh in 1965.

• An object can belong to a set with a degree of membership


between 0 and 1.

• Classical set theory is a special case of fuzzy theory


that restricts membership values to be either 0 or 1.
Fuzzy Clustering
• Is based on fuzzy logic and fuzzy set theory.
• Objects can belong to more than one cluster.
• Each object belongs to all clusters with some weight
(degree of membership)

1
Cluster 1
Cluster 2
Cluster 3

0
Hard Clustering

• K-Means
− The number K of clusters is given.
− Each object is assigned to exactly one cluster.
Partition

Object C3
Cluster e1 e2 e3 e4 e3 e4

C1 0 1 0 0
C2 1 0 0 0 e2 e1
C1 C2
C3 0 0 1 1
Fuzzy Clustering
• Fuzzy-c-means
− The number c of clusters is given.
− Each object has a fractional membership in all clusters

Object
Cluster e1 e2 e3 e4
C1 0.8 0.2 0.1 0.0
Fuzzy-Clustering
C2 0.2 0.2 0.2 0.0 There is no strict sub-division
of clusters.
C3 0.0 0.6 0.7 1.0
Σ 1 1 1 1
Fuzzy-c-Means

• Membership Matrix

U = ( u i k ) ∈ [0, 1]c x n
The entry u i k denotes the degree of membership of object k in cluster i .

Object 1 Object 2 … Object n


Cluster 1 u11 u12 … u1n
Cluster 2 u21 u22 … u2n



Cluster c uc1 uc2 … ucn


Restrictions (Membership Matrix)

1. All weights for a given object, ek, must add up to 1.

c
Σ u ik = 1 (k = 1,...,n)
i =1

2. Each cluster contains – with non-zero weight – at least one object,


but does not contain – with a weight of one – all the objects.

n
0< Σ u ik < n (i = 1,...,c)
k =1
Fuzzy-c-Means

• Vector of prototypes (cluster centroids)

T
V = ( v1,...,vc ) ∈ Rc

Remark
The cluster centroids and the membership matrix are initialized randomly.
Afterwards they are iteratively optimized.
Fuzzy-c-Means

ALGORITHM

1. Select an initial fuzzy partition U = (u i k )


⇒ assign values to all u i k

2. Repeat
3. Compute the centroid of each cluster using the fuzzy partition
4. Update the fuzzy partition U = (u i k )
5. Until the centroids do not change.

Other stopping criterion: “change in the u i k is below a given threshold”.


v3
u3k
Fuzzy-c-Means
xk
• K-Means and Fuzzy-c-Means attempt to minimize v1 u1k u2k
the sum of the squared errors (SSE).
v2
• In K-Means:
K
Σ Σ
2
SSE = dist ( vi, x )
i =1 x ∈ Ci

• In Fuzzy-c-Means:
c n m
SSE = Σ Σ u ik . 2
dist ( vi, xk )
i =1 k =1

m ∈ [1, ∞] is a parameter (fuzzifier) that determines the influence of the weights.


v3
u3k
Computing Cluster Centroids
xk
• For each cluster i = 1,...,c the centroid is defined by v1 u1k u2k

n v2
Σ u
m
xk
k = 1 ik
_________________ ( i = 1,...,c )
vi =
n m
Σ u ik (V)
k=1

• This is an extension of the definition of centroids of k-means.

• All points are considered and the contribution of each point


to the centroid is weighted by its membership degrees.
Update of the Fuzzy Partition (Membership Matrix)

• Minimization of SSE subject to the constraints leads to


the following update formula:

1
______________________________________
u ik = 1
_____
c 2 (U)
dist ( v i , xk ) m – 1
Σ __________
2
s=1 dist ( vs , xk )
Fuzzy-c-Means

Initialization
Determine (randomly)
• Matrix U of membership grades
• Matrix V of cluster centroids.

Iteration
Calculate updates of
• Matrix U of membership grades with (U)
• Matrix V of cluster centroids with (V)
until cluster centroids are stable
or the maximum number of iterations is reached.
Fuzzy-c-means

• Fuzzy-c-means depends on the Euclidian metric


⇒ spherical clusters.

• Other metrics can be applied to obtain different cluster shapes.

• Fuzzy covariance matrix (Gustafson/Kessel 1979)


⇒ ellipsoidal clusters.
Cluster Validity
Indizes
Cluster Validity Indexes

Fuzzy-c-means requires the number of clusters as input.

Question: How can we determine the “optimal” number of clusters?

Idea: Determine the cluster partition for a given number of clusters.


Then, evaluate the cluster partition by a cluster validity index.

Method: For all possible number of clusters calculate the cluster validity index.
Then, determine the optimal number of clusters.

Note: CVIs usually do not depend on the clustering algorithm.


Cluster Validity Indexes

• Partition Coefficient (Bezdek 1981)

c n
1 2
PC (c) = __ Σ Σ uik , 2 ≤ c ≤ n-1
n i=1 k=1

• Optimal number of clusters c∗ :

PC (c∗) = max PC (c)


2 ≤ c ≤ n-1
Cluster Validity Indexes

• Partition Entropy (Bezdek 1974)

c n
1
PC (c) = _ __ Σ Σ u i k log2 u i k , 2 ≤ c ≤ n-1
n i=1 k=1

• Optimal number of clusters c∗ :

PC (c∗) = min PC (c)


2 ≤ c ≤ n-1

• Drawback of PC and PE: Only degrees of memberships are considered.


The geometry of the data set is neglected.
Cluster Validity Indexes

• Fukuyama-Sugeno Index (Fukuyama/Sugeno 1989)

c n m
Σ Σ
2 Compactness
FS (c) = uik dist ( vi , xk ) of clusters
i=1 k=1
c n
_ m _
Σ Σ Separation
2
uik dist ( vi , v ) of clusters
i=1 k=1

• Optimal number of clusters c∗ :


_ 1 c
PC (c∗) = max PC (c) v = __ Σ vi
2 ≤ c ≤ n-1 c i =1
Application
Data Mining and Decision Support Systems ̶ Landslide Events
(UniBw, Geoinformatics Group: W. Reinhardt, E. Nuhn)

→ Spatial Data Mining / Early Warning Systems for Landslide Events


→ Fuzzy clustering approaches (feature weighting)

• Measurements (pressure values, tension, deformation vectors)


• Simulations (finite-element model)
Hard Clustering
Data

Partition

Problem: Uncertain data from measurements and simulations


Fuzzy Clustering
Data

Fuzzy-Cluster Fuzzy-Partition
Fuzzy Clustering
Feature Weighting

Nuhn/Kropat/Reinhardt/Pickl: Preparation of complex landslide simulation results with clustering approaches for decision support and early
warning. Submitted to Hawaii International Conference on System Sciences (HICCS 45), Grand Wailea, Maui, 2012.
Thank you very much!

Вам также может понравиться