Академический Документы
Профессиональный Документы
Культура Документы
1 / 50
The Goals of Unsupervised Learning
2 / 50
The Challenge of Unsupervised Learning
3 / 50
Another advantage
4 / 50
Principal Components Analysis
5 / 50
Principal Components Analysis: details
The first principal component of a set of features
X1 , X2 , . . . , Xp is the normalized linear combination of the
features
Z1 = 11 X1 + 21 X2 + . . . + p1 Xp
that
Pp has2 the largest variance. By normalized, we mean that
j=1 j1 = 1.
We refer to the elements 11 , . . . , p1 as the loadings of the
first principal component; together, the loadings make up
the principal component loading vector,
1 = (11 21 . . . p1 )T .
We constrain the loadings so that their sum of squares is
equal to one, since otherwise setting these elements to be
arbitrarily large in absolute value could result in an
arbitrarily large variance.
6 / 50
PCA: example
35
30
25
Ad Spending
20
15
10
5
0
10 20 30 40 50 60 70
Population
The population size (pop) and ad spending (ad) for 100 different
cities are shown as purple circles. The green solid line indicates
the first principal component direction, and the blue dashed
line indicates the second principal component direction.
7 / 50
Computation of Principal Components
for i = 1, . . . , n thatP
has largest sample variance, subject to
the constraint that pj=1 2j1 = 1.
Since each of the xij has mean zero, then so does zi1 (for
any values of j1 ). Hence the sample variance of the zi1
can be written as n1 ni=1 zi1
2.
P
8 / 50
Computation: continued
9 / 50
Geometry of PCA
10 / 50
Further principal components
11 / 50
Further principal components: continued
12 / 50
Illustration
13 / 50
USAarrests data: PCA plot
0.5 0.0 0.5
UrbanPop
3
2
0.5
Hawaii California
Rhode Island
Massachusetts
Utah New Jersey
Second Principal Component
Connecticut
1
Washington Colorado
New York Nevada
Minnesota Pennsylvania
Wisconsin
Ohio IllinoisArizona
Oregon Rape
Texas
Delaware Missouri
Oklahoma
Kansas
Nebraska Indiana Michigan
Iowa
0.0
New Hampshire
0
Florida
Idaho Virginia New Mexico
Maine Wyoming
Maryland
North Dakota Montana
Assault
South Dakota Tennessee
Louisiana
Kentucky
1
0.5
South Carolina
2
North Carolina
Mississippi
3
3 2 1 0 1 2 3
14 / 50
Figure details
15 / 50
PCA loadings
PC1 PC2
Murder 0.5358995 -0.4181809
Assault 0.5831836 -0.1879856
UrbanPop 0.2781909 0.8728062
Rape 0.5434321 0.1673186
16 / 50
Another Interpretation of Principal Components
1.0
0.0
0.5
1.0
17 / 50
PCA find the hyperplane closest to the observations
18 / 50
Scaling of the variables matters
If the variables are in different units, scaling each to have
standard deviation equal to one is recommended.
If they are in the same units, you might or might not scale
the variables.
Scaled Unscaled
0.5 0.0 0.5 0.5 0.0 0.5 1.0
1.0
UrbanPop
3
UrbanPop
150
2
Second Principal Component
100
** ** *
0.5
*
*
1
* *
** *
50
* * * * Rape
* * * Rape
* * ** * *
0.0
** * * ** * *
0
* * ** * * ** **
* * ** * *
0.0
* * * * * *
* * **
0
* * * *Murder * * Assault
* * Assault * ** * * * * *
* * * *
* * * * * *
1
* * * *
* *
50
* * *
Murder
0.5
0.5
2
*
100
*
*
3
19 / 50
Proportion Variance Explained
To understand the strength of each component, we are
interested in knowing the proportion of variance explained
(PVE) by each one.
The total variance present in a data set (assuming that the
variables have been centered to have mean zero) is defined
as
p p n
X X 1X 2
Var(Xj ) = xij ,
n
j=1 j=1 i=1
1.0
Cumulative Prop. Variance Explained
0.8
0.8
Prop. Variance Explained
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0
21 / 50
How many principal components should we use?
22 / 50
Clustering
23 / 50
PCA vs Clustering
24 / 50
Clustering for Market Segmentation
25 / 50
Two clustering methods
26 / 50
K-means clustering
K=2 K=3 K=4
28 / 50
Details of K-means clustering: continued
The idea behind K-means clustering is that a good
clustering is one for which the within-cluster variation is as
small as possible.
The within-cluster variation for cluster Ck is a measure
WCV(Ck ) of the amount by which the observations within
a cluster differ from each other.
Hence we want to solve the problem
(K )
X
minimize WCV(Ck ) . (2)
C1 ,...,CK
k=1
30 / 50
K-Means Clustering Algorithm
31 / 50
Properties of the Algorithm
1 P
where xkj = |Ck | iCk xij is the mean for feature j in
cluster Ck .
however it is not guaranteed to give the global minimum.
Why not?
32 / 50
Example
Data Step 1 Iteration 1, Step 2a
33 / 50
Details of Previous Figure
The progress of the K-means algorithm with K=3.
Top left: The observations are shown.
Top center: In Step 1 of the algorithm, each observation is
randomly assigned to a cluster.
Top right: In Step 2(a), the cluster centroids are computed.
These are shown as large colored disks. Initially the
centroids are almost completely overlapping because the
initial cluster assignments were chosen at random.
Bottom left: In Step 2(b), each observation is assigned to
the nearest centroid.
Bottom center: Step 2(a) is once again performed, leading
to new cluster centroids.
Bottom right: The results obtained after 10 iterations.
34 / 50
Example: different starting values
320.9 235.8 235.8
35 / 50
Details of Previous Figure
36 / 50
Hierarchical Clustering
37 / 50
Hierarchical Clustering: the idea
Builds a hierarchy in a bottom-up fashion...
A B A B
C C
38 / 50
Hierarchical Clustering Algorithm
The approach in words:
Start with each point in its own cluster.
Identify the closest two clusters and merge them.
Repeat.
Ends when all points are in a single cluster.
Dendrogram
4
3
D
E
A B 2
C
1
0
D
E
B
A
C
39 / 50
An Example
4
X2
2
0
2
6 4 2 0 2
X1
10
10
8
8
6
6
4
4
2
2
0
41 / 50
Details of previous figure
42 / 50
Types of Linkage
Linkage Description
Maximal inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and
Complete
the observations in cluster B, and record the largest of
these dissimilarities.
Minimal inter-cluster dissimilarity. Compute all pairwise
Single dissimilarities between the observations in cluster A and
the observations in cluster B, and record the smallest of
these dissimilarities.
Mean inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and
Average
the observations in cluster B, and record the average of
these dissimilarities.
Dissimilarity between the centroid for cluster A (a mean
Centroid vector of length p) and the centroid for cluster B. Cen-
troid linkage can result in undesirable inversions.
43 / 50
44 / 50
Choice of Dissimilarity Measure
So far have used Euclidean distance.
An alternative is correlation-based distance which considers
two observations to be similar if their features are highly
correlated.
This is an unusual use of correlation, which is normally
computed between variables; here it is computed between
the observation profiles for each pair of observations.
20
Observation 1
Observation 2
Observation 3
15
10
2
5
3
1
0
5 10 15 20
Variable Index 45 / 50
Practical issues
46 / 50
Example: breast cancer microarray study
47 / 50
Fig. 1. Hierarchical clustering of 115 tumor tissues and 7 nonmalignant tissues using the intrinsic gene set. (A) A scaled-down representation of the entire cluster
of 534 genes and 122 tissue samples based on similarities in gene expression. (B) Experimental dendrogram showing the clustering of the tumors into five subgroups. 48 / 50
among our p
regarding di
and patient
patients had
ESR1 receiv
from vant
adjuvant tre
The obser
with a basal
nosis for th
usually high
expression o
familial canc
in several s
confounding
Molecular M
subtypes, it
vised analysi
were separa
prognostic in
of the 231 m
Fig. 5. KaplanMeier analysis of disease outcome in two patient cohorts. (A)
Time to development of distant metastasis in the 97 sporadic cases from vant
cohort, we fo
Veer et al. Patients were stratified according to the subtypes as shown in Fig. 2B. recurrences
(B) Overall survival for 72 patients with locally advanced breast cancer in the part be due
Norway cohort. The normal-like tumor subgroups were omitted from both data discussed ab
sets in this analysis. Both van
8422 www.pnas.orgcgidoi10.1073pnas.0932692100 49 / 50
Conclusions
50 / 50