Вы находитесь на странице: 1из 57

1

Tema 5:
Aprendizaje NO Supervisado:
CLUSTERING
Unsupervised Learning:
CLUSTERING
Febrero-Mayo 2005
2
SUPERVISED METHODS:
LABELED Data Base
Labeled
Data Base
Divided into
Train and
Test
Training the
algorithm or
determining
the function
Choose
Algorithm:
MAP, ML,
K-Nearest
LD, SVC
NN, Tree,
...
Evaluating
The
Classifier
Reducing the space dimension
d: Feature Selection
(Independent Algorithm Machine
Learning)
Reducing the space
dimension d: by Linear
Methods as PCA, MDA, ICA
3
UNSUPERVISED METHODS:
Non LABELED Data Base
Non Labeled
Data Base
E-Step:
Classifying
samples
Choose
Algorithm:
Clusters
initializati
on
M-Step:
Updating
Parameters
or
Evaluating
Criterion
Functions
Reducing the space dimension
d: Feature Selection
(Independent Algorithm Machine
Learning)
Reducing the space
dimension d: by Linear
Methods as PCA, ICA
4
LOOKING FOR STRUCTURE INSIDE
THE DATA
Parametric Methods: They assume
some f.d.p. for the clusters.
Non Parametric Methods: Formal
Clustering Procedures
5
INDEX (Parametric Methods)
1 MIXTURE DENSITIES AND
IDENTIFIABILITY
2 MAXIMUM LIKELIHOOD
ESTIMATES: EM
3 K-Means Clustering
6
1 MIXTURE DENSITIES AND
IDENTIFIABILITY
Assumptions:
1. The samples come from a
known number c of classes
2. Prior probabilities for each class
are known (Mixing Parameters).
3. The form of the class-
conditional probabilities
densities are known
4. The values for parameters are
unknown
5. The category labels are
unknown: UNSUPERVISED
{ }
{ }
,
Pr ; 1..
,
; 1..
j j
j
j j
j
j c
f
j c

=
=
x
x
7
1 MIXTURE DENSITIES AND
IDENTIFIABILITY
MIXTURE DENSITY:
1. For the moment it is assumed that
only parameter vector is unknown.
2. Necessary conditions for identifiability:
{ }
{ }
{ }
,
1
, Pr
j j
c
j j j
j
f f



=
=

x
x
x x
{ } { }
' f f

x x
x x
1
:
c
| |
|
=
|
|
\ .

8
1 MIXTURE DENSITIES AND
IDENTIFIABILITY
Example: identifiability problem:
BINARY (SYMMETRIC) CHANNEL
{ }
0 0
Pr ( 0) P bit = =
0 x =
0 1
1 P P + =
{ }
0 0
1 Pr 0 x = =
{ }
0 0
Pr 1 x = =
{ }
1 1
Pr 1 x = =
{ }
1 1
1 Pr 0 x = =
1 x =
{ }
1 1
Pr ( 1) P bit = =
9
1 MIXTURE DENSITIES AND
IDENTIFIABILITY
Example: identifiability problem:
BINARY (SYMMETRIC) CHANNEL
Parameter Vector:
MIXTURE DENSITY (PROBABILITY)
0
1

| |
=
|
\ .

{ }
( ) ( )
1 1
0 0 0 1 1 1
Pr 1 1
x x
x x
P P

= + x
10
2 MAXIMUM LIKELIHOOD
ESTIMATES
Likelihood of the statistical independent
observed samples:
Assuming statistical independence between

i,

j,
ML solutions is one of the multiple
solutions of:
,
{ }
1 2
, ,..
n
D = x x x
( ) ( ) ( )
1 1
; ln
k k
n n
D k k
k k
f D f l f
= =
= =
x x
x x
{ } ( )
1

Pr , ln , 0; 1..
i i k
n
i k k i i
k
l f i c
=
= = =

x
x x
( )
( )
( )
1
Pr
k k
c
k k j j j
j
f f
=
=
x x
x x
11
2 MAXIMUM LIKELIHOOD
ESTIMATES
Demo:
,
( )
( )
( )
( )
( )
( )
( )
{ } ( )
1
1 1
1
1
1
1
, Pr( )
1
, Pr( )
Pr , ln , 0; 1..
i i k
k
i k
k
i k
k
i k
n
k
k
k
n c
k j i j
k j
k
n
k i i i
k
k
n
i k k i i
k
l f
f
f
f
f
f
f i c



=
= =
=
=
= =
| |
=
|
\ .
=
= =

x
x
x
x
x
x
x
x
x
x
x
x
x
x x
12
2 MAXIMUM LIKELIHOOD
ESTIMATES
Generalizing to the
unknown prior
probability case: (No
demo is included here)
1. To compute prior
probability estimates
2. To compute vector
parameter estimates
3. To compute
conditioned probability
for classes.
{ }
{ }
{ } ( )
{ }
( )
{ }
( )
{ }
1
1
1
1

Pr Pr ,

Pr , ln , ; 1..

, Pr

Pr ,

, Pr
i k
k
k
n
i i k n
k
n
i k k i i
k
k i i i
i k
c
k j j j
j
f i c
f
f



=
=
=
=
=
=

x
x
x
x
x x
x
x
x
13
2 MAXIMUM LIKELIHOOD
ESTIMATES
For Gaussian Distributions:
Parameters to estimate:
( )
( )
( ) ( )
1
2
2
1
1
1
2
ln , ln
(2 )
d
k
T i
k i i k i i k i
f

| |
|
=
|
\ .
x

x x x
( )
( )
1
,..;
,
c
i i i
=
=


14
2 MAXIMUM LIKELIHOOD
ESTIMATES
ML is solved appliying the SOFT Expectation-Maximization algorithm:
Soft Assignment. Iterations stop when the p.d.f. does not vary.
1. Expectation (E-Step)
2. Maximization (M-Step)
{ }
{ }
{ }
{ }
{ }
( )( )
{ }
1 1 1
1
1 1

Pr , Pr ,

Pr Pr , ; ;

Pr , Pr ,
n n
T
i k k i k k i k i
n
k k
i i k i i n n n
k
i k i k
k k



= =
=
= =

= = =


x x x x x
x
x x
{ }
( ) ( )
( )
{ }
( ) ( )
( )
{ }
1
2
1
2
1 1
1
2
1 1
1
2
1


exp Pr

Pr ,


exp Pr
T
i k i i k i i
i k
c
T
j k j j k j j
j



=

=

x x
x
x x
15
2 MAXIMUM LIKELIHOOD
ESTIMATES
Pb: Starting Point: Brain Images;
Full DataBase vs Labeled Data Base Pr>0.95
| |
| |
{ }
0 ; dim: 1
0 ; dim:
Pr 1, 2, 3
i
i
i
dx
dxd
i

16
2 MAXIMUM LIKELIHOOD
ESTIMATES
E-step: For a given x
k
estimate:
M-STEP: Parameters are updated (ML estimation)
x
| |
1
n
| |
3
n
| |
2
n
{ }
( )
{ }
( )
{ }
1

, Pr

Pr ,

, Pr
k
k
k i i i
i k
c
k j j j
j
f
f


=
=

x
x
x
x
x
17
3. K-Means Clustering
HARD Classification: Simplification of the ML (EM)
estimates for a Normal Multivariable (Optimum
for CASE 1 Multivariable Gaussian Variable
seen with MAP).
Centroid
{ }
( )
( )
{ }
{ }
1
1
1
1

1 , , ;

Pr ,
0

Pr Pr ,

k
i
i
e k i e k j
i k
n
n
i i k n n
k
n
i k n
k
d d j i
other


=
=

<

= =
=

x x
x
x
x
2

i i i
= = I
18
3. K-Means Clustering
K-Means Clustering
19
3. K-Means Clustering
K-Means Clustering
20
3. K-Means Clustering
K-Means Clustering
21
3. K-Means Clustering
K-Means Clustering
22
Brain Images
23
Brain Images: K-Means
Different
Starting
Points
24
Brain Images: Expectation-
Maximization
Different
Starting
Points
25
Brain Images: NN
26
3. K-Means Clustering
APPLICATION: Vector Quantization of a n-dimensional real
valued vector. See: Proakis: Digital Communications Chapter 3:
Source Coding.
FUZZY K-Means Soft Classification. b is a free blending
parameter
{ }
( )
( )
{ }
( )
{ }
( )
1 1
1
1

Pr , ,

Pr ,


Pr ,
c n
b
Fuzzy i k e j i
i j
n
b
i k k
k
i
n
b
i k
k
J d

= =
=
=
=
=

x x
x x

x
| |
i
n
k
x
{ }

Pr ,
i k
x
27
INDEX: Formal Clustering
Procedures
1 INTRODUCTION:
FORMAL CLUSTERING PROCEDURES
2 SIMILARITY MEASURES
3 CRITERION FUNCTIONS
4 ITERATIVE OPTIMIZATION
5 CONCLUSIONS
28
1. INTRODUCTION
Clusters may form clouds of points in a d-dimensional
space.
Normal Distribution: Sample Mean and Sample
Covariance Matrix form a Sufficient Statistics
Mean Sample m: Locates de Center of gravity of the
cloud and it best represents all of the data in the sense
of minimizing the sum of squared distances from m to
the samples.
Sample Covariance Matrix C: denotes the amount the
data scatters along various directions around m.
29
1. INTRODUCTION
Sample mean vector and Sample Covariance Matrix
arent a suficcient statistical in a general case:
Distributions with identical Mean and Covariance:
1
1
N
N k
k
N
=
=

m x
( )( )
1
1
1
N
T
k k
k
N
=
=


x m x m C
30
1. INTRODUCTION
Formal Clustering Procedures: Two Key Steps
Data are grouped in clusters or groups of data
points that posses strong internal similarities.
A Criterion Function is used to seek the grouping
that extremizes it. To evaluate the partitioning of
a set of samples into clusters, the similarity is
measured between samples.
31
2. SIMILARITY MEASURES
Similarity is measured using distance between
samples
Example: Euclidean distance d(x
i
, x
j
).
Two samples belongs to the same cluster if
d(x
i
, x
j
)<d
o
.
Threshold d
o
is critical.
| | | | ( )
2
1
( , )
d
e i j i j i j
n
d x n x n
=
= =

x x x x
32
2. SIMILARITY MEASURES
Distance threshold affects the number and
size of clusters:
0
typical within clusters distance < typical between clusters distance d <
33
2. SIMILARITY MEASURES
Euclidean distance d
ij
.
Clusters are invariant to Rotation.
Clusters are invariant to Translation.
Clusters are variant to Linear Transformations
in general.
34
2. SIMILARITY MEASURES
Normalization prior to clustering.
Each feature is translated to have zero mean
Each feature is scaled to have unit variance.
(These two previous actions are recommended with Neural
Nets).
PCA Principal Components Analysis (Axes coincide
with the eigenvectors of the sample covariance
matrix).
AFTER NORMALIZATION AND PCA , CLUSTERS
ARE INVARIANT TO DISPLACEMENTS, SCALE
CHANGE AND ROTATIONS.
35
2. SIMILARITY MEASURES
Other Metrics.
Minkowski Distance
Mahalanobis Distance
| | | |
( )
1
1
( , )
q
d
q
q i j i j
n
d x n x n
=
| |
=
|
\ .

x x
( ) ( )
2 1
( , )
M i j i j i j
d

=
T
x x x x x x
36
2. SIMILARITY MEASURES
Similarity Functions:
It compares two vectors
It is invariant to Rotation and Dilation
It is no invariant to translation and general
linear transformation
( , )
T
i j
e i j
i j
s =
x x
x x
x x
37
2. SIMILARITY MEASURES
If the found clusters are used to a posterior
problem of classification:
Metric (distance) is used as classification
criteria
or
Similarity function is used as classification
criteria
38
3. CRITERION FUNCTIONS
Criterion Functions for Clustering:
Initial Set
Partition into exactly c subsets.
Objective: To find the partition that extremizes the
criterion function
{ }
1 2
, ,...,
n
D = x x x
1 2
, ,...,
c
D D D
39
3. CRITERION FUNCTIONS
3.1 Criterion Function Sum Of Squared Error Criterion:
m
i
is the best representative of the samples in D
i
.
It is appropriated when the clusters form compact
clouds and uniform number of samples per cluster.
2
1
i
c
e i
i D
J
=
=

x
x m
40
3. CRITERION FUNCTIONS
Related Minimum Variance Criteria:
J
e
Suggestion to obtain other criterion function:
2
2
1 1
2
1 '
; '
i
i i
c
e i i i
n
i D D
J n s s
=
= =

x x
x x
2
1
, ' , '
'
max ( , '); ( , '); min ( , ');
i i
i
i i
i D e i e i D e
n
D D
s d s s s s


= = =
x x x x
x x
x x x x x x
41
3. CRITERION FUNCTIONS
3.2 Scatter Criteria:
Mean Vectors and Scatter matrices used in
clustering criteria
Mean Vector for the i cluster
Total mean vector
Scatter matrix for the I cluster
Within-cluster scatter matrix
Between-cluster scatter matrix
Total Scatter Matrix
( )( )
( )( )
( ) ( )
1
1 1
1
1
1
1
i
i
i
i n
D
c
i i n n
D i
T
i
D
c
W i
i
c
T
B i i i
i
T
T W B n
D
n
n

=
=

=
= =
=
=
=
= = +

x
x
x
x
m x
m x m
S x m x m
S S
S m m m m
S x m x m S S
42
3. CRITERION FUNCTIONS
3.2 Scatter Criteria: TRACE CRITERION
It measures the square of the scattering radius
Minimize the trace of the Within Cluster Scatter
Matrix
It results function J
e
.
It is equivalent to maximize between cluster
scattering matrix trace.
| | | |
2
1 1
i
c c
W i i e
i i D
Tr Tr J
= =
= = =

x
S S x m
| |
2
1
c
B i i
i
Tr n
=
=

S m m | | | | | |
W T B
Tr Tr Tr = S S S
43
3. CRITERION FUNCTIONS
3.2 Scatter Criteria: DETERMINANT CRITERION
It measures the square of the scattering
volume.
S
B
is singular if c<=d; rank(S
B
)<=c-1
S
W
is singular if n-c<d
Assuming n>d+c
It no changes if the axes are scaled
1
c
d W i
i
J
=
= =

S S
44
3. CRITERION FUNCTIONS
3.2 Scatter Criteria: Invariant Criteria
Eigenvalues of inv(S
W
)S
B
are invariant to
nonsigular linear transformations of the
data.
Proposed Criteria
They are equivalent for c=2
1
1
1
1 1
max : ;
i
d d
W
W B i
i i
T
Tr

+
= =
(
= =


S
S S
S
1
1
1
1
min :
i
d
f T W
i
J Tr

+
=
(
= =


S S
45
3. CRITERION FUNCTIONS
3.2Invariant Criteria
Demo:
1
1
1
1
1 1 1
1 1 1
,.., ,.., ( )
,.., ,.., ( )
i d
i d W B
T B
eigenvalues
eigenvalues


+ + +
=
=
S S
S S
( )
( )
1
1
1
1
1
1
i
B i i W i
T i B i W i i W i W i i W i
i i T W i
T W i i

+
=
= + = + = +
= +
=
S v S v
S v S v S v S v S v S v
v S S v
S S v v
46
3. CRITERION FUNCTIONS
3.2 Scatter Criterion: Invariant Criteria
Trace Criteria.
Determinant Criteria.
Invariant Criteria.
47
CLUSTERING PROCEDURES
CONCLUSIONS
Underlying Model: assumes that samples form c
fairly well separated clouds of points.
S
W
measures the compactness of these clouds.
___________________________________________
Problem: Computational complexity to evaluate the
overall number of possibilities in partitioning is
impracticable.
48
4 ITERATIVE OPTIMIZATION
Direct partitioning: c
n
/c!
Practical solution:
Initiate with some reasonable partition and
to move samples from one group to another
if such a move will improve the value of the
criterion function.
It guarantees local but not global
optimization.
49
4 ITERATIVE OPTIMIZATION
Iterative Improvement to minimize the sum of
squared error criterion J
e
.
Effective error per cluster J
i
.
A sample is moved from cluster i to cluster j.
2
1
;
i
c
e i i i
i D
J J J
=
= =

x
x m

* ; *
1 1
1; 1
i j
j
i
j j i i
j i
j j i i
D D
n n
n n n n


= + =
+
= + =
x x
x m
x m
m m m m
50
4 ITERATIVE OPTIMIZATION
Increasing / Decreasing Effective error per
cluster (DEMOSTRAR COMO EJERCICIO)
( )
2 2
*
2
2
1
2
1

* *

j
j
j
j
j
j
j j j
D
n
j
j j n
D
j
n
j j n
J
n
J

+
= + =
| |

|
+ =
| +
\ .
+

x
x
x m x m
x m
x m x m
x m
51
4 ITERATIVE OPTIMIZATION
Increasing / Decreasing Effective error per
cluster (DEMOSTRAR COMO EJERCICIO)
( )
2 2
*
2
2
1
2
1

* *

i
i
i
i
i
i
i i i
D
n
i
i i n
D
i
n
i i n
J
n
J

= =
| |

|
+ =
|

\ .

x
x
x m x m
x m
x m x m
x m
52
4 ITERATIVE OPTIMIZATION
The sample moved from cluster i to cluster j
is advantageous if
2
2
1 1

j
i
i j
n
n
i j
n n +
> x m x m
53
4 ITERATIVE OPTIMIZATION
BASIC ITERATIVE MINIMUM SQUARED
ERROR CLUSTERING
54
4 ITERATIVE OPTIMIZATION
55
7 CONCLUSIONS
When underlying distribution comes from a
mixture of component densities described
by a set of unknown parameters, these
parameters can be estimated by Bayesian
or ML (EM_algorithm) methods.
Clustering is a more general approach.
56
7 CONCLUSIONS: OTHER TOPICS
Hierarchical methods to reveal clusters
and sub-clusters: Taxonomy.
Estimation of the number of clusters
Self-Organizing feature Maps: SOFM They
preserve neighborhoods to reduce
dimensionality (Kohonen Maps).
57
Laboratory Classes
Prctica 0: Observacin de base de datos Brain, Gauss.
Prctica 1: Aplicacin de mtodos MAP (ldc,qdc) sobre GAUSS.
Prctica 2: Aplicacin de mtodos MAP (ldc,qdc) sobre PHONEME,
SPAM.
Prctica 3: Aplicacin de PCA y MDA sobre GAUSS.
Prctica 4: ICA como separacin ciega de fuentes de audio
Prctica 5: k-Nearest Neigbour ZIP.
(Prctica 6: Discriminante Lineal (LMS-MMSE y Perceptron) sobre
GAUSS y ZIP).
Prctica 7: (NN,Decisin Trees and K-means)
MULTILAYER NEURAL NETWORKS, TREE CLASSIFIERS and
UNSUPERVISED Methods applied to PET and Magnetic
Resonance BRAIN Images.

Вам также может понравиться