Академический Документы
Профессиональный Документы
Культура Документы
Luis Tari
Motivation
One
Why clustering?
Introduction
group data points that are close (or similar) to each other
identify such groupings (or clusters) in an unsupervised
manner
Unsupervised: no information is provided to the algorithm
on which data points belong to which clusters
Example
x
x
x
x
Hypotheses:
Genes with similar expression patterns implies that the
coexpression of these genes
Coexpressed genes can imply that
Purpose of clustering on
microarray data
Suppose
Purpose of clustering on
microarray data
Suppose
So
housekeeping genes
But
visualization of data
hypothesis generation
Overview of clustering
Outline of discussion
Various
clustering algorithms
hierarchical
k-means
k-medoid
fuzzy c-means
Different
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
There
Agglomerative (bottom-up):
Divisive (top-down):
Hierarchical clustering
2.
3.
4.
5.
Hierarchical clustering
Hierarchical clustering:
example
Hierarchical clustering:
example using single linkage
Hierarchical clustering:
forming clusters
Forming
Hierarchical clustering
Advantages
Disadvantages
K-means
2.
3.
4.
K-means
Stopping criteria:
No change in the members of all clusters
when the squared error is less than some small threshold
value
Squared error se
se
i 1 pci
p mi
se(j) <
Properties of k-means
Guaranteed to converge
Guaranteed to achieve local optimal, not necessarily
global optimal.
Example:
http://www.kdnuggets.com/dmcourse/data_mining_course/mod
-13-clustering.ppt
.
K-means
Pros:
Low complexity
complexity is O(nkt), where t = #iterations
Cons:
Necessity of specifying k
Sensitive to noise and outlier data points
Outliers: a small number of such data can
substantially influence the mean value)
Clusters are sensitive to initial assignment of centroids
K-means is not a deterministic algorithm
Clusters can be inconsistent from one run to another
Fuzzy c-means
An
extension of k-means
Hierarchical, k-means generates partitions
Fuzzy
1.
2.
cj
(uij ) m xi
i 1
n
m
(
u
)
ij
i 1 and n is the number of data
where m is the fuzzy parameter
points.
x c
i
j
1
m 1
uij
nc
j 1
4.
5.
1
m 1
xi c j
For each data point gi, assign gi to cluster clj if uij of U(k) >
Fuzzy c-means
Pros:
Cons:
Similarity measures
How
Let
Distance measure
Euclidean distance
d ( g1 , g 2 )
( xi y i ) 2
i 1
Manhattan distance
n
d ( g1 , g 2 ) ( xi yi )
i 1
Minkowski distance
n
d ( g1 , g 2 ) m ( xi yi ) m
i 1
Correlation distance
Correlation
distance
rxy
Cov( X , Y )
(Var ( X ) Var (Y )
Correlation distance
Variance
Var ( X )
n
(x
i 1 i
X )2
n 1
Covariance
CoVar ( X , Y )
n 1
Negative covariance
X )( y i Y )
Positive covariance
n
(x
i 1 i
Correlation distance
Correlation
rxy
Cov ( X , Y )
(Var ( X ) Var (Y )
Summary of similarity
measures
g1 = (1,2,3,4,5)
g2 = (100,200,300,400,500)
g3 = (5,4,3,2,1)
Which genes are similar according to the two different
measures?
Validity of clusters
Why
validity of clusters?
Measuring
Optimality of clusters
Verification of biological meaning of clusters
Optimality of clusters
Optimal
clusters should
Example
of intracluster measure
Squared error se
k
se
i 1 pci
p mi
References