PCA Tutor1

Principal Component Analysis and Matrix Factorizations for Learning
Chris Ding
Lawrence Berkeley National Laboratory Supported by Office of Science, U.S. Dept. of Energy
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Many unsupervised learning methods are closely related in a simple way

PCA NMF
K-means clustering
Spectral Clustering
Semi-supervised classification
Indicator Matrix Quadratic Clustering
Semi-supervised clustering Outlier detection
Part 1.A. Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)
Widely used in large number of different fields Most widely known as PCA (multivariate statistics) SVD is the theoretical basis for PCA
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 3
Brief history
PCA
Draw a plane closest to data points (Pearson, 1901) Retain most variance (Hotelling, 1933)
SVD
Low-rank approximation (Eckart-Young, 1936) Practical application/Efficient Computation (GolubKahan, 1965)
Many generalizations
PCA and SVD

Data: n points in p-dim:
Covariance
X = ( x1 , x2 ,L, xn )
p k =1
T C = XX T = k uk uk
Gram (kernel) matrix

Principal directions: u k (Principal axis,subspace)
XTX =
k =1
T k v k v k
Principal components: k (projection on the subspace)
Underlying basis: SVD X =
k =1
T k uk vk = UV T
5
Further Developments
SVD/PCA
Principal Curves Independent Component Analysis Sparse SVD/PCA (many approaches) Mixture of Probabilistic PCA Generalization to exponential familty, max-margin Connection to K-means clustering
Kernel (inner-product)
Kernel PCA
Methods of PCA Utilization

Principal components (uncorrelated random variables):
X = ( x1 , x2 ,L, xn )
uk = uk (1) X 1 + L + uk (d ) X d
Dimension reduction: Projection to low-dim subspace Sphereing the data Transform data to N(0,1)
X=
k =1
T k uk vk = UV T
~ X =UT X
U = (u1 ,L, uk )
~ X = C 1 / 2 X = U 1U T X
7
Applications of PCA/SVD
Most popular in multivariate statistics Image processing, signal processing Physics: principal axis, diagonalization of 2nd tensor (mass) Climate: Empirical Orthogonal Functions (EOF) s ( t +1) = As ( t ) + E , P ( t +1) = AP (t ) AT Kalman filter. Reduced order analysis
Applications of PCA/SVD
PCA/SVD is as widely as Fast Fourier Transforms
Both are spectral expansions FFT is more on Partial Differential Equations PCA/SVD is more on discrete (data) analysis PCA/SVD surpass FFT as computational sciences further advance
PCA/SVD
Select combination of variables Dimension reduction
An image has 104 pixels. True dimension is 20 !
PCA is a Matrix Factorization (spectral/eigen decomposition)

Principal directions: Principal components: Covariance
U = (u1 , u2 ,L, uk ) V = (v1 , v2 ,L, vk )
C = XX T =
k =1
T k uk uk =UU T
Kernel matrix
XTX =
k =1 p
T k vk vk = VV T
Underlying basis: SVD X =
k =1
T k uk vk = UV T
10
From PCA to spectral clustering using generalized eigenvectors

Consider the kernel matrix:
Wij = ( xi ), ( x j )
Wv = v
In Kernel PCA we compute eigenvector:
Generalized Eigenvector:
Wq = Dq
di =
D = diag (d1,L, dn )
w
j
ij
This leads to Spectral Clustering !

Scale PCA Spectral Clustering

PCA:
W=
1 2
vk k vT k
Scaled PCA:
~ 1 W = D W D2 = D
k =1
qk k qT D k
~ 1 1 ~ 1/ 2 2 2 W = D WD , wij = wij /(did j)
qk = D vk scaled principal component

1 2
Scaled PCA on a Rectangle Matrix Correspondence Analysis ~ 1 1 ~ Re-scaling: P = Dr 2 PD 2 , pij = pij /( pi. pj.)1/ 2 c
~ Apply SVD on P
T k =1
Subtract trivial component

T k
P rc / p.. = Dr f k k g Dc
r = ( p1.,L, pn. )
fk = D u , gk = D v
1 2 r k
1 2 c k
c = ( p.1,L, p.n )
are scaled row and column principal component (standard coordinates in CA)
(Zha, et al, CIKM 2001, Ding et al, PKDD2002) PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 13
Nonnegative Matrix Factorization

Data Matrix: n points in p-dim:
X = ( x1 , x2 ,L, xn )
Decomposition (low-rank approximation) Nonnegative Matrices
xi
T
is an image, document, webpage, etc
X FG
X ij 0, Fij 0, Gij 0
F = ( f1 , f 2 , L, f k )
G = ( g1 , g 2 ,L, g k )
14
Solving NMF with multiplicative updating

J =|| X FGT ||2 , F 0, G 0
Fix F, solve for G; Fix G, solve for F Lee & Seung ( 2000) propose
( X T F ) jk (GF T F ) jk
( XG )ik Fik Fik ( FGT G )ik
G jk G jk
15
Matrix Factorization Summary

Symmetric
(kernel matrix, graph)
Rectangle Matrix
(contigency table, bipartite graph)
PCA:
W = VV T
X = UV T
Scaled PCA: 1 ~ 1 T 2 2 W = D W D = D QQ D
NMF:
~ 1 X = Dr X Dc2 = Dr FG T Dc
1 2
W QQ
X FG
16

Unsigned Cluster indicator Matrix H=(h1, , hK) Kernel K-means clustering:
max Tr( H T WH ), s.t. H T H = I , H 0
H
K-means:
W = XT X;
Kernel K-means W = (< ( xi ), ( x j ) >)
Spectral clustering (normalized cut)

max Tr( H T WH ), s.t. H T DH = I , H 0
H
Difference between the two is the orthogonality of H


Additional features:
Semi-suerpvised classification:
max Tr( H T WH + C T H )
H
Semi-supervised clustering: (A) must-link and (B) cannot-link constraints
max Tr( H T WH + H T AH H T BH )
H
Outlier Detection: max Tr( H T WH ) allowing zero rows in H

H
Nonnegative Lagrangian Relaxation:

H ik H ik (WH )ik + Cik / 2 , = H T WH + H T C. ( H )ik
18
Tutorial Outline
PCA Recent developments on PCA/SVD Equivalence to K-means clustering Scaled PCA Laplacian matrix Spectral clustering Spectral ordering Nonnegative Matrix Factorization Equivalence to K-means clustering Holistic vs. Parts-based Indicator Matrix Quadratic Clustering Use Nonnegative Lagrangian Relaxtion Includes K-means and Spectral Clustering semi-supervised classification Semi-supervised clustering Outlier detection
19
Recent Developments on PCA and SVD

Principal Curves Independent Component Analysis Kernel PCA Mixture of PCA (probabilistic PCA) Sparse PCA/SVD Semi-discrete, truncation, L1 constraint, Direct sparsification Column Partitioned Matrix Factorizations 2D-PCA/SVD Equivalence to K-means clustering
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 20
Part 1.B.
PCA and SVD

Data Matrix:
Covariance
X = ( x1 , x2 ,L, xn )
T C = XX T = k uk uk p k =1
Gram (kernel) matrix

Principal directions: u k (Principal axis,subspace) Underlying basis: SVD
XTX =
k =1
T k v k v k
Principal components: k (projection on the subspace)
X = u v
k =1
T k k k
21
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding
Kernel PCA
xi ( xi )
Kernel
K ij = ( xi ), ( x j )
v, ( x ) =
PCA Component
Feature extraction
vi ( xi ), ( x)
Indefinite Kernels Generalization to graphs with nonnegative weights

(Scholkopf, Smola, Muller, 1996)
Mixture of PCA
Data has local structures.
Global PCA on all data is not useful
Clustering PCA (Hinton et al):

Using clustering to cluster data into clusters Perform PCA in each cluster No explicit generative model
Probabilistic PCA (Tipping & Bishop)

Latent variables Generative model (Gaussian) Mixture of Gaussians mixture of PCA Adding Markov dynamics for latent variables (Linear Gaussian Models)
23
Probabilistic PCA Linear Gaussian Model

Latent variables
S = ( s1 ,L, sn )
2
xi = Wsi + + , ~ N (0, I )
Gaussian prior
P( s) ~
2
2 N ( s0 , s I ) T
x ~ N (Ws0 , I + sWW )
Linear Gaussian Model
si +1 = Asi + ,
xi = Wsi + ,
24
(Tipping & Bishop, 1995; Roweis & Ghahramani, 1999)
Sparse PCA
Compute a factorization Why sparse?
Variable selection (sparse U) When n >> d Storage saving Other new reasons?
X UV T
U or V is sparse or both are sparse
L1 and L2 constraints
25
Sparse PCA: Truncation and Discretization

X UV T
Sparsified SVD
U = (u1 Luk )
V = (v1 Lvk )
Compute {uk,vk} one at a time, truncate those entries below a threshold. Recursively compute all pairs using deflation.
(Zhang, Zha, Simon, 2002)
Semi-discrete decomposition
(Kolda & Oleary, 1999)
X X uvT
U, V only contains {-1, 0, 1} Iterative algorithm to compute U,V using deflation
26
Sparse PCA: L1 constraint

LASSO (Tibshirani, 1996)
min || y X T ||2 ,
SCoTLASS (Joliffe & Uddin, 2003)
|| ||1 t
max u T ( XX T )u T ,
|| u ||1 t , u T uh = 0
Least Angle Regression (Efron, et al 2004) Sparse PCA (Zou, Hastie, Tibshirani,2004)
min
,
i =1
|| xi T xi ||2 +
j =1
|| j ||2 +
j =1
1, j || j ||1 , T = I
v j = j / || j ||
27
Sparse PCA: Direct Sparsification

Sparse SVD with explicit sparsification min || X udvT ||F + nnz(u ) + nnz(v)
u ,v
rank-one approximation Minimize a bound deflation
(Zhang, Zha, Simon 2003)
Direct sparse PCA, on covariance matrix S
u = max u T Su = max Tr( Suu T ) = max Tr( SU )

s.t. Tr(U ) = 1, nnz(U ) k 2 , U f 0, rank(U ) = 1
(DAspremont, Gharoui, Jordan,Lancriet, 2004)
Sparse PCA Summary

Many different approaches
Truncation, discretization L1 Constraint Direct sparsification Other approaches
Sparse Matrix factorization in general

L1 constraint
Many questions
Orthogonality Unique solution, global solution
PCA: Further Generalizations

Generalization to Exponential Family
(Collins, Dasgupta, Schapire, 2001)
Maximum Margin Factorization (Srebro, Rennie, Jaakkola, 2004)

Collaborative filtering Input Y is binary Hard margin Yia X ia 1, ia S Soft margin
min || X || + c
iaS
max(0,1 Y
ia X ia )
X = UV T , || X ||= 1 (|| U ||2 + || V ||2 ) Fro Fro 2

Column Partitioned Matrix Factorizations

2 6 7 8 64748 4 14 64748 4k 4 X = ( x1,L xn ) = ( x1 L xn1 , xn1 +1 L xn2 , L, xnk 1 +1 L xn )
n1 + L + nk = n
Column Partitioned Data Matrix Partitions are generate by clustering U = (u1 Luk ) Centroid matrix
uk is centroid Fix U, compute V
(Zhang & Zha, 2001)
(Dhillon & Modha, 2001) (Park, Jeon & Rosen, 2003)
min || X UV T ||2 F
V = X T U (U T U ) 1
k k
Represent each partition by a SVD.

Pick leading Us to form U Fix U, compute V
6 74 41 8 6 74 4l 8 ( ( ( ( U = (U1,LU l ) = (u11) Luk1) , L, u1l ) Lukl ) ) l 1

(Castelli, Thomasian & Li 2003) (Zeimpekis & Gallopoulos, 2004) 31
Several other variations

Two-dimensional SVD
Large number of data objects are 2-D: images, maps Standard method:
convert (re-order) each image as a 1D vector collect all 1D vectors into a single (big) matrix apply SVD on the big matrix
2D-SVD is developed for 2D objects

Extension of standard SVD Keeping the 2D characteristics Improves quality of low-dimensional approximation Reduces computation, storage
32
Linearize a 2D object into 1D object
0.0 0.5 0.7 10 . M 0.8 0.2 0.0
Pixel vector
33
SVD and 2D-SVD

SVD
X = ( x1 , x2 ,L, xn )
Eigenvectors of
XX T
T
and X T X
=UT X V
X = UV
2D-SVD
Eigenvectors of
F=
G=
{ A} = { A1 , A2 ,L, An }
( Ai A )( Ai A )T
( Ai A )T ( Ai A )
T
i
i
row-row covariance column-column cov

M i = U Ai V
T
34
Ai = UM iV
2D-SVD
{ A} = { A1 , A2 ,L, An }
row-row cov: col-col cov: Bilinear subspace
T i
T
assume
A =0
F = Ai Ai = u u
G = Ai Ai = k uk u T k
i k =1
T k k k
U = (u1 , u2 ,L, uk ) V = (v1 , v2 ,L, vk )

T
M i = U Ai V
T
Ai = UM iV , i = 1,L, n
Ai rc ,U rk ,V ck , M i kk
2D-SVD Error Analysis SVD: min || X UV T ||2 =

i = k +1
i2
Ai LM i RT , Ai R rc , L R rk , R R ck , M i R kk
min J1 =
i =1 n
|| Ai LM i ||2 =
min J 2 =
min J 3 =
i =1 n
j = k +1 r
j = k +1 r
|| Ai M i RT ||2 =
|| Ai LM i RT ||2
i =1 n
j = k +1 r
j
c
+
j j = k +1
min J 4 =
i =1
|| Ai LM i LT ||2 2
j = k +1
j
36
Temperature maps (January over 100 years)
Reconstruction Errors SVD/2DSVD=1.1 Storages SVD/2DSVD=8
37
Reconstructed image
SVD 2dSVD SVD (K=15), storage 160560 2DSVD (K=15), storage 93060
2D-SVD Summary
2DSVD is extension of standard SVD Provides optimal solution for 4 representations for 2D images/maps Substantial improvements in storage, computation, quality of reconstruction Capture 2D characteristics
39
Part 1.C. K-means Clustering Principal Component Analysis

(Equivalence between PCA and K-means)
40 PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
K-means clustering
Also called isodata, vector quantization Developed in 1960s (Lloyd, MacQueen, Hatigan, etc) Computationally Efficient (order-mN) Widely used in practice
Benchmark to evaluate other algorithms
Given n points in m-dim:
X = ( x1 , x2 ,L, xn )
K-means objective
min J K =
k =1 iC k
|| xi ck ||2
41
PCA is equivalent to K-means

Continuous optimal solution for cluster indicators in K-means clustering are given by principal components. Subspace spanned by K cluster centroids is given by PCA subspace.
2-way K -means Clustering

Cluster membership indicator:
+ n2 / n1n q (i ) = n1 / n2 n
if i C1 if i C2
J K = n x J D ,
2
nn JD = 1 2 n
d (C1 , C2 ) d (C1 , C1 ) d (C2 , C2 ) 2 n n 2 2 n1 n2 1 2
Define distance matrix: D = (dij ), dij =|xi x j|2 ~ T T ~ T T T J D = q Dq = q Dq = 2q ( X X )q = 2q Kq D = K

min J K max J D
Solution is principal eigenvector v1of K
Clusters C1, C2 are determined by: C1 = {i | v1 (i ) < 0}, C2 = {i | v1 (i ) 0}

A simple illustration
DNA Gene Expression File for Leukemia

Using v1 , tissue samples separated into 2 clusters, 3 errors Do one more Kmeans, reduce to 1 error
Multi-way K-means Clustering

Unsigned Cluster membership indicators h1, , hK: C1 C2 C3
1 1 0 0
0 0 0 0 = (h1 , h2 , h3 ) 1 0 0 1
46

JK =
1 2 xi i n k =1 k
i , jC k
xiT x j =
xi2
k =1
T hk X T Xhk
(Unsigned) Cluster indicators H=(h1, , hK)

T J K = xi2 Tr ( H k X T XH k ) i
Regularized Relaxation
(q1 ,..., qk ) = (h1 ,L, hk )T
Redundancy: k =1
Qk = H kT
n1/ 2 hk = e k
Transform h1, , hK to q1 - qk via orthogonal matrix T
q1 = e /n1/2
47
T maxTr[Qk 1 ( X T X )Qk 1 ]
Qk 1 = (q2 ,..., qk )
Optimal solutions of q2 qk are given by principal components v2 vk.
JK is bounded below by total variance minus

sum of K eigenvalues of covariance:
nx2
k =1
K 1
k < min J K < n x 2
Consistency: 2-way and K-way approaches

Orthogonal Transform:
n2 / n T = n /n 1 n1 / n n2 / n
T transforms (h1, h2) to (q1,q2):
h1 = (1L1,0L0) , h2 = (0L0,1L1)
T
T
T
a=
b=
n2 n1n
n1 n2 n
q1 = (1L1) ,
T
q2 = (a,L, a,b,L,b)
Recover the original 2-way cluster indicator

Test of Lower bounds of K-means clustering

| J opt J LB | J opt
Lower bound is within 0.6-1.5% of the optimal value

Cluster Subspace (spanned by K centroids) = PCA Subspace

Given a data point x,
P=
T ck c k
project x into the cluster subspace
Centroid is given by ck =
P=
h (i) x = Xh
k i
k
T k k k
c c
k
T k k
=X
h h
k
T k k
XT = X
v v
k
T k k
XT =
u u
k
PK means =
T k u k u k
T uk uk PPCA
PCA automatically project into cluster subspace PCA is unsupervised version of LDA
Effectiveness of PCA Dimension Reduction
Kernel K-means Clustering

Kernal K-means objective:
min J K =
xi ( xi )
k =1 iCk
|| ( xi ) (ck ) ||2
| ( xi ) |2

1 n k =1 k
( xi )T ( x j )
i , jCk
1 Kernal K-means max J K = ( xi ), ( x j ) k =1 nk i , jCk

K
Kernel K-means clustering is equivalent to Kernal PCA

Continuous optimal solution for cluster indicators are given by Kernal PCA components Subspace spanned by K cluster centroids are given by Kernal PCA principal subspace

PCA Tutor1

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

PCA Tutor1

Загружено:

Авторское право:

Доступные форматы

Principal Component Analysis and Matrix Factorizations for Learning

Many unsupervised learning methods are closely related in a simple way

Indicator Matrix Quadratic Clustering

Semi-supervised clustering Outlier detection

PCA and SVD

Gram (kernel) matrix

Principal components: k (projection on the subspace)

Underlying basis: SVD X =

Methods of PCA Utilization

PCA is a Matrix Factorization (spectral/eigen decomposition)

U = (u1 , u2 ,L, uk ) V = (v1 , v2 ,L, vk )

Underlying basis: SVD X =

From PCA to spectral clustering using generalized eigenvectors

In Kernel PCA we compute eigenvector:

This leads to Spectral Clustering !

Scale PCA Spectral Clustering

~ 1 1 ~ 1/ 2 2 2 W = D WD , wij = wij /(did j)

qk = D vk scaled principal component

Subtract trivial component

Nonnegative Matrix Factorization

is an image, document, webpage, etc

Solving NMF with multiplicative updating

( XG )ik Fik Fik ( FGT G )ik

Matrix Factorization Summary

Indicator Matrix Quadratic Clustering

Kernel K-means W = (< ( xi ), ( x j ) >)

Spectral clustering (normalized cut)

Difference between the two is the orthogonality of H

Indicator Matrix Quadratic Clustering

Semi-supervised clustering: (A) must-link and (B) cannot-link constraints

Outlier Detection: max Tr( H T WH ) allowing zero rows in H

Nonnegative Lagrangian Relaxation:

Recent Developments on PCA and SVD

PCA and SVD

Gram (kernel) matrix

Principal components: k (projection on the subspace)

Indefinite Kernels Generalization to graphs with nonnegative weights

Clustering PCA (Hinton et al):

Probabilistic PCA (Tipping & Bishop)

Probabilistic PCA Linear Gaussian Model

(Tipping & Bishop, 1995; Roweis & Ghahramani, 1999)

U or V is sparse or both are sparse

Sparse PCA: Truncation and Discretization

U, V only contains {-1, 0, 1} Iterative algorithm to compute U,V using deflation

Sparse PCA: L1 constraint

Sparse PCA: Direct Sparsification

rank-one approximation Minimize a bound deflation

(Zhang, Zha, Simon 2003)

Direct sparse PCA, on covariance matrix S

u = max u T Su = max Tr( Suu T ) = max Tr( SU )

Sparse PCA Summary

Sparse Matrix factorization in general

PCA: Further Generalizations

Maximum Margin Factorization (Srebro, Rennie, Jaakkola, 2004)

X = UV T , || X ||= 1 (|| U ||2 + || V ||2 ) Fro Fro 2

Column Partitioned Matrix Factorizations

(Zhang & Zha, 2001)

(Dhillon & Modha, 2001) (Park, Jeon & Rosen, 2003)

Represent each partition by a SVD.

6 74 41 8 6 74 4l 8 ( ( ( ( U = (U1,LU l ) = (u11) Luk1) , L, u1l ) Lukl ) ) l 1

Several other variations

2D-SVD is developed for 2D objects

Linearize a 2D object into 1D object