Tema5 Teoria-2830

1
Tema 5:
Aprendizaje NO Supervisado:
CLUSTERING
Unsupervised Learning:
CLUSTERING
Febrero-Mayo 2005
2
SUPERVISED METHODS:
LABELED Data Base
Labeled
Data Base
Divided into
Train and
Test
Training the
algorithm or
determining
the function
Choose
Algorithm:
MAP, ML,
K-Nearest
LD, SVC
NN, Tree,
...
Evaluating
The
Classifier
Reducing the space dimension
d: Feature Selection
(Independent Algorithm Machine
Learning)
Reducing the space
dimension d: by Linear
Methods as PCA, MDA, ICA
3
UNSUPERVISED METHODS:
Non LABELED Data Base
Non Labeled
Data Base
E-Step:
Classifying
samples
Choose
Algorithm:
Clusters
initializati
on
M-Step:
Updating
Parameters
or
Evaluating
Criterion
Functions
Reducing the space dimension
d: Feature Selection
(Independent Algorithm Machine
Learning)
Reducing the space
dimension d: by Linear
Methods as PCA, ICA
4
LOOKING FOR STRUCTURE INSIDE
THE DATA
Parametric Methods: They assume
some f.d.p. for the clusters.
Non Parametric Methods: Formal
Clustering Procedures
5
INDEX (Parametric Methods)
1 MIXTURE DENSITIES AND
IDENTIFIABILITY
2 MAXIMUM LIKELIHOOD
ESTIMATES: EM
3 K-Means Clustering
6
IDENTIFIABILITY
Assumptions:
1. The samples come from a
known number c of classes
2. Prior probabilities for each class
are known (Mixing Parameters).
3. The form of the class-
conditional probabilities
densities are known
4. The values for parameters are
unknown
5. The category labels are
unknown: UNSUPERVISED
{ }
{ }
,
Pr ; 1..
,
; 1..
j j
j
j j
j
j c
f
j c

=
=
x
x
7
IDENTIFIABILITY
MIXTURE DENSITY:
1. For the moment it is assumed that
only parameter vector is unknown.
2. Necessary conditions for identifiability:
{ }
{ }
{ }
,
1
, Pr
j j
c
j j j
j
f f

=
=
x
x
x x
{ } { }
' f f

x x
x x
1
:
c
| |
|
=
|
|
\ .
8
IDENTIFIABILITY
Example: identifiability problem:
BINARY (SYMMETRIC) CHANNEL
{ }
0 0
Pr ( 0) P bit = =
0 x =
0 1
1 P P + =
{ }
0 0
1 Pr 0 x = =
{ }
0 0
Pr 1 x = =
{ }
1 1
Pr 1 x = =
{ }
1 1
1 Pr 0 x = =
1 x =
{ }
1 1
Pr ( 1) P bit = =
9
IDENTIFIABILITY
Example: identifiability problem:
BINARY (SYMMETRIC) CHANNEL
Parameter Vector:
MIXTURE DENSITY (PROBABILITY)
0
1
| |
=
|
\ .
{ }
( ) ( )
1 1
0 0 0 1 1 1
Pr 1 1
x x
x x
P P

= + x
10
ESTIMATES
Likelihood of the statistical independent
observed samples:
Assuming statistical independence between
i,
j,
ML solutions is one of the multiple
solutions of:
,
{ }
1 2
, ,..
n
D = x x x
( ) ( ) ( )
1 1
; ln
k k
n n
D k k
k k
f D f l f
= =
= =
x x
x x
{ } ( )
1

Pr , ln , 0; 1..
i i k
n
i k k i i
k
l f i c
=
= = =
x
x x
( )
( )
( )
1
Pr
k k
c
k k j j j
j
f f
=
=
x x
x x
11
ESTIMATES
Demo:
,
( )
( )
( )
( )
( )
( )
( )
{ } ( )
1
1 1
1
1
1
1
, Pr( )
1
, Pr( )
Pr , ln , 0; 1..
i i k
k
i k
k
i k
k
i k
n
k
k
k
n c
k j i j
k j
k
n
k i i i
k
k
n
i k k i i
k
l f
f
f
f
f
f
f i c

=
= =
=
=
= =
| |
=
|
\ .
=
= =
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
12
ESTIMATES
Generalizing to the
unknown prior
probability case: (No
demo is included here)
1. To compute prior
probability estimates
2. To compute vector
parameter estimates
3. To compute
conditioned probability
for classes.
{ }
{ }
{ } ( )
{ }
( )
{ }
( )
{ }
1
1
1
1

Pr Pr ,

Pr , ln , ; 1..

, Pr

Pr ,

, Pr
i k
k
k
n
i i k n
k
n
i k k i i
k
k i i i
i k
c
k j j j
j
f i c
f
f

=
=
=
=
=
=
x
x
x
x
x x
x
x
x
13
ESTIMATES
For Gaussian Distributions:
Parameters to estimate:
( )
( )
( ) ( )
1
2
2
1
1
1
2
ln , ln
(2 )
d
k
T i
k i i k i i k i
f
| |
|
=
|
\ .
x
x x x
( )
( )
1
,..;
,
c
i i i
=
=

14
ESTIMATES
ML is solved appliying the SOFT Expectation-Maximization algorithm:
Soft Assignment. Iterations stop when the p.d.f. does not vary.
1. Expectation (E-Step)
2. Maximization (M-Step)
{ }
{ }
{ }
{ }
{ }
( )( )
{ }
1 1 1
1
1 1

Pr , Pr ,

Pr Pr , ; ;

Pr , Pr ,
n n
T
i k k i k k i k i
n
k k
i i k i i n n n
k
i k i k
k k

= =
=
= =

= = =

x x x x x
x
x x
{ }
( ) ( )
( )
{ }
( ) ( )
( )
{ }
1
2
1
2
1 1
1
2
1 1
1
2
1

exp Pr

Pr ,

exp Pr
T
i k i i k i i
i k
c
T
j k j j k j j
j

=

=

x x
x
x x
15
ESTIMATES
Pb: Starting Point: Brain Images;
Full DataBase vs Labeled Data Base Pr>0.95
| |
| |
{ }
0 ; dim: 1
0 ; dim:
Pr 1, 2, 3
i
i
i
dx
dxd
i
16
ESTIMATES
E-step: For a given x
k
estimate:
M-STEP: Parameters are updated (ML estimation)
x
| |
1
n
| |
3
n
| |
2
n
{ }
( )
{ }
( )
{ }
1

, Pr

Pr ,

, Pr
k
k
k i i i
i k
c
k j j j
j
f
f

=
=
x
x
x
x
x
17
3. K-Means Clustering
HARD Classification: Simplification of the ML (EM)
estimates for a Normal Multivariable (Optimum
for CASE 1 Multivariable Gaussian Variable
seen with MAP).
Centroid
{ }
( )
( )
{ }
{ }
1
1
1
1

1 , , ;
Pr ,
0

Pr Pr ,
k
i
i
e k i e k j
i k
n
n
i i k n n
k
n
i k n
k
d d j i
other

=
=
<
= =
=
x x
x
x
x
2

i i i
= = I
18
K-Means Clustering
19
K-Means Clustering
20
K-Means Clustering
21
K-Means Clustering
22
Brain Images
23
Brain Images: K-Means
Different
Starting
Points
24
Brain Images: Expectation-
Maximization
Different
Starting
Points
25
Brain Images: NN
26
APPLICATION: Vector Quantization of a n-dimensional real
valued vector. See: Proakis: Digital Communications Chapter 3:
Source Coding.
FUZZY K-Means Soft Classification. b is a free blending
parameter
{ }
( )
( )
{ }
( )
{ }
( )
1 1
1
1

Pr , ,

Pr ,

Pr ,
c n
b
Fuzzy i k e j i
i j
n
b
i k k
k
i
n
b
i k
k
J d
= =
=
=
=
=
x x
x x
x
| |
i
n
k
x
{ }

Pr ,
i k
x
27
INDEX: Formal Clustering
Procedures
1 INTRODUCTION:
FORMAL CLUSTERING PROCEDURES
2 SIMILARITY MEASURES
3 CRITERION FUNCTIONS
4 ITERATIVE OPTIMIZATION
5 CONCLUSIONS
28
1. INTRODUCTION
Clusters may form clouds of points in a d-dimensional
space.
Normal Distribution: Sample Mean and Sample
Covariance Matrix form a Sufficient Statistics
Mean Sample m: Locates de Center of gravity of the
cloud and it best represents all of the data in the sense
of minimizing the sum of squared distances from m to
the samples.
Sample Covariance Matrix C: denotes the amount the
data scatters along various directions around m.
29
1. INTRODUCTION
Sample mean vector and Sample Covariance Matrix
arent a suficcient statistical in a general case:
Distributions with identical Mean and Covariance:
1
1
N
N k
k
N
=
=

m x
( )( )
1
1
1
N
T
k k
k
N
=
=

x m x m C
30
1. INTRODUCTION
Formal Clustering Procedures: Two Key Steps
Data are grouped in clusters or groups of data
points that posses strong internal similarities.
A Criterion Function is used to seek the grouping
that extremizes it. To evaluate the partitioning of
a set of samples into clusters, the similarity is
measured between samples.
31
2. SIMILARITY MEASURES
Similarity is measured using distance between
samples
Example: Euclidean distance d(x
i
, x
j
).
Two samples belongs to the same cluster if
d(x
i
, x
j
)<d
o
.
Threshold d
o
is critical.
| | | | ( )
2
1
( , )
d
e i j i j i j
n
d x n x n
=
= =
x x x x
32
Distance threshold affects the number and
size of clusters:
0
typical within clusters distance < typical between clusters distance d <
33
Euclidean distance d
ij
.
Clusters are invariant to Rotation.
Clusters are invariant to Translation.
Clusters are variant to Linear Transformations
in general.
34
Normalization prior to clustering.
Each feature is translated to have zero mean
Each feature is scaled to have unit variance.
(These two previous actions are recommended with Neural
Nets).
PCA Principal Components Analysis (Axes coincide
with the eigenvectors of the sample covariance
matrix).
AFTER NORMALIZATION AND PCA , CLUSTERS
ARE INVARIANT TO DISPLACEMENTS, SCALE
CHANGE AND ROTATIONS.
35
Other Metrics.
Minkowski Distance
Mahalanobis Distance
| | | |
( )
1
1
( , )
q
d
q
q i j i j
n
d x n x n
=
| |
=
|
\ .
x x
( ) ( )
2 1
( , )
M i j i j i j
d

=
T
x x x x x x
36
Similarity Functions:
It compares two vectors
It is invariant to Rotation and Dilation
It is no invariant to translation and general
linear transformation
( , )
T
i j
e i j
i j
s =
x x
x x
x x
37
If the found clusters are used to a posterior
problem of classification:
Metric (distance) is used as classification
criteria
or
Similarity function is used as classification
criteria
38
3. CRITERION FUNCTIONS
Criterion Functions for Clustering:
Initial Set
Partition into exactly c subsets.
Objective: To find the partition that extremizes the
criterion function
{ }
1 2
, ,...,
n
D = x x x
1 2
, ,...,
c
D D D
39
3.1 Criterion Function Sum Of Squared Error Criterion:
m
i
is the best representative of the samples in D
i
.
It is appropriated when the clusters form compact
clouds and uniform number of samples per cluster.
2
1
i
c
e i
i D
J
=
=
x
x m
40
Related Minimum Variance Criteria:
J
e
Suggestion to obtain other criterion function:
2
2
1 1
2
1 '
; '
i
i i
c
e i i i
n
i D D
J n s s
=
= =

x x
x x
2
1
, ' , '
'
max ( , '); ( , '); min ( , ');
i i
i
i i
i D e i e i D e
n
D D
s d s s s s

= = =
x x x x
x x
x x x x x x
41
3.2 Scatter Criteria:
Mean Vectors and Scatter matrices used in
clustering criteria
Mean Vector for the i cluster
Total mean vector
Scatter matrix for the I cluster
Within-cluster scatter matrix
Between-cluster scatter matrix
Total Scatter Matrix
( )( )
( )( )
( ) ( )
1
1 1
1
1
1
1
i
i
i
i n
D
c
i i n n
D i
T
i
D
c
W i
i
c
T
B i i i
i
T
T W B n
D
n
n
=
=
=
= =
=
=
=
= = +
x
x
x
x
m x
m x m
S x m x m
S S
S m m m m
S x m x m S S
42
3.2 Scatter Criteria: TRACE CRITERION
It measures the square of the scattering radius
Minimize the trace of the Within Cluster Scatter
Matrix
It results function J
e
.
It is equivalent to maximize between cluster
scattering matrix trace.
| | | |
2
1 1
i
c c
W i i e
i i D
Tr Tr J
= =
= = =

x
S S x m
| |
2
1
c
B i i
i
Tr n
=
=
S m m | | | | | |
W T B
Tr Tr Tr = S S S
43
3.2 Scatter Criteria: DETERMINANT CRITERION
It measures the square of the scattering
volume.
S
B
is singular if c<=d; rank(S
B
)<=c-1
S
W
is singular if n-c<d
Assuming n>d+c
It no changes if the axes are scaled
1
c
d W i
i
J
=
= =

S S
44
3.2 Scatter Criteria: Invariant Criteria
Eigenvalues of inv(S
W
)S
B
are invariant to
nonsigular linear transformations of the
data.
Proposed Criteria
They are equivalent for c=2
1
1
1
1 1
max : ;
i
d d
W
W B i
i i
T
Tr

+
= =
(
= =

S
S S
S
1
1
1
1
min :
i
d
f T W
i
J Tr

+
=
(
= =

S S
45
3.2Invariant Criteria
Demo:
1
1
1
1
1 1 1
1 1 1
,.., ,.., ( )
,.., ,.., ( )
i d
i d W B
T B
eigenvalues
eigenvalues

+ + +
=
=
S S
S S
( )
( )
1
1
1
1
1
1
i
B i i W i
T i B i W i i W i W i i W i
i i T W i
T W i i
+
=
= + = + = +
= +
=
S v S v
S v S v S v S v S v S v
v S S v
S S v v
46
3.2 Scatter Criterion: Invariant Criteria
Trace Criteria.
Determinant Criteria.
Invariant Criteria.
47
CLUSTERING PROCEDURES
CONCLUSIONS
Underlying Model: assumes that samples form c
fairly well separated clouds of points.
S
W
measures the compactness of these clouds.
___________________________________________
Problem: Computational complexity to evaluate the
overall number of possibilities in partitioning is
impracticable.
48
Direct partitioning: c
n
/c!
Practical solution:
Initiate with some reasonable partition and
to move samples from one group to another
if such a move will improve the value of the
criterion function.
It guarantees local but not global
optimization.
49
Iterative Improvement to minimize the sum of
squared error criterion J
e
.
Effective error per cluster J
i
.
A sample is moved from cluster i to cluster j.
2
1
;
i
c
e i i i
i D
J J J
=
= =

x
x m

* ; *
1 1
1; 1
i j
j
i
j j i i
j i
j j i i
D D
n n
n n n n

= + =
+
= + =
x x
x m
x m
m m m m
50
Increasing / Decreasing Effective error per
cluster (DEMOSTRAR COMO EJERCICIO)
( )
2 2
*
2
2
1
2
1
* *
j
j
j
j
j
j
j j j
D
n
j
j j n
D
j
n
j j n
J
n
J
+
= + =
| |
|
+ =
| +
\ .
+
x
x
x m x m
x m
x m x m
x m
51
Increasing / Decreasing Effective error per
cluster (DEMOSTRAR COMO EJERCICIO)
( )
2 2
*
2
2
1
2
1
* *
i
i
i
i
i
i
i i i
D
n
i
i i n
D
i
n
i i n
J
n
J
= =
| |
|
+ =
|
\ .

x
x
x m x m
x m
x m x m
x m
52
The sample moved from cluster i to cluster j
is advantageous if
2
2
1 1

j
i
i j
n
n
i j
n n +
> x m x m
53
BASIC ITERATIVE MINIMUM SQUARED
ERROR CLUSTERING
54
55
7 CONCLUSIONS
When underlying distribution comes from a
mixture of component densities described
by a set of unknown parameters, these
parameters can be estimated by Bayesian
or ML (EM_algorithm) methods.
Clustering is a more general approach.
56
7 CONCLUSIONS: OTHER TOPICS
Hierarchical methods to reveal clusters
and sub-clusters: Taxonomy.
Estimation of the number of clusters
Self-Organizing feature Maps: SOFM They
preserve neighborhoods to reduce
dimensionality (Kohonen Maps).
57
Laboratory Classes
Prctica 0: Observacin de base de datos Brain, Gauss.
Prctica 1: Aplicacin de mtodos MAP (ldc,qdc) sobre GAUSS.
Prctica 2: Aplicacin de mtodos MAP (ldc,qdc) sobre PHONEME,
SPAM.
Prctica 3: Aplicacin de PCA y MDA sobre GAUSS.
Prctica 4: ICA como separacin ciega de fuentes de audio
Prctica 5: k-Nearest Neigbour ZIP.
(Prctica 6: Discriminante Lineal (LMS-MMSE y Perceptron) sobre
GAUSS y ZIP).
Prctica 7: (NN,Decisin Trees and K-means)
MULTILAYER NEURAL NETWORKS, TREE CLASSIFIERS and
UNSUPERVISED Methods applied to PET and Magnetic
Resonance BRAIN Images.

Tema5 Teoria-2830

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Tema5 Teoria-2830

Загружено:

Авторское право:

Доступные форматы

1

Вам также может понравиться