Вы находитесь на странице: 1из 54

Principal Component Analysis and Matrix Factorizations for Learning

Chris Ding
Lawrence Berkeley National Laboratory Supported by Office of Science, U.S. Dept. of Energy

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Many unsupervised learning methods are closely related in a simple way


PCA NMF

K-means clustering

Spectral Clustering
Semi-supervised classification

Indicator Matrix Quadratic Clustering

Semi-supervised clustering Outlier detection

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Part 1.A. Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)
Widely used in large number of different fields Most widely known as PCA (multivariate statistics) SVD is the theoretical basis for PCA
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 3

Brief history
PCA
Draw a plane closest to data points (Pearson, 1901) Retain most variance (Hotelling, 1933)

SVD
Low-rank approximation (Eckart-Young, 1936) Practical application/Efficient Computation (GolubKahan, 1965)

Many generalizations

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

PCA and SVD


Data: n points in p-dim:
Covariance

X = ( x1 , x2 ,L, xn )
p k =1

T C = XX T = k uk uk

Gram (kernel) matrix


Principal directions: u k (Principal axis,subspace)

XTX =

k =1

T k v k v k

Principal components: k (projection on the subspace)

Underlying basis: SVD X =

k =1

T k uk vk = UV T
5

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Further Developments
SVD/PCA
Principal Curves Independent Component Analysis Sparse SVD/PCA (many approaches) Mixture of Probabilistic PCA Generalization to exponential familty, max-margin Connection to K-means clustering

Kernel (inner-product)
Kernel PCA

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Methods of PCA Utilization


Principal components (uncorrelated random variables):

X = ( x1 , x2 ,L, xn )

uk = uk (1) X 1 + L + uk (d ) X d
Dimension reduction: Projection to low-dim subspace Sphereing the data Transform data to N(0,1)

X=

k =1

T k uk vk = UV T

~ X =UT X

U = (u1 ,L, uk )

~ X = C 1 / 2 X = U 1U T X
7

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Applications of PCA/SVD
Most popular in multivariate statistics Image processing, signal processing Physics: principal axis, diagonalization of 2nd tensor (mass) Climate: Empirical Orthogonal Functions (EOF) s ( t +1) = As ( t ) + E , P ( t +1) = AP (t ) AT Kalman filter. Reduced order analysis
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 8

Applications of PCA/SVD
PCA/SVD is as widely as Fast Fourier Transforms
Both are spectral expansions FFT is more on Partial Differential Equations PCA/SVD is more on discrete (data) analysis PCA/SVD surpass FFT as computational sciences further advance

PCA/SVD
Select combination of variables Dimension reduction
An image has 104 pixels. True dimension is 20 !

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

PCA is a Matrix Factorization (spectral/eigen decomposition)


Principal directions: Principal components: Covariance

U = (u1 , u2 ,L, uk ) V = (v1 , v2 ,L, vk )

C = XX T =

k =1

T k uk uk =UU T

Kernel matrix

XTX =

k =1 p

T k vk vk = VV T

Underlying basis: SVD X =

k =1

T k uk vk = UV T
10

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

From PCA to spectral clustering using generalized eigenvectors


Consider the kernel matrix:

Wij = ( xi ), ( x j )
Wv = v

In Kernel PCA we compute eigenvector:

Generalized Eigenvector:

Wq = Dq
di =

D = diag (d1,L, dn )

w
j

ij

This leads to Spectral Clustering !


PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 11

Scale PCA Spectral Clustering


PCA:

W=

1 2

vk k vT k

Scaled PCA:

~ 1 W = D W D2 = D

k =1

qk k qT D k

~ 1 1 ~ 1/ 2 2 2 W = D WD , wij = wij /(did j)

qk = D vk scaled principal component


PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 12

1 2

Scaled PCA on a Rectangle Matrix Correspondence Analysis ~ 1 1 ~ Re-scaling: P = Dr 2 PD 2 , pij = pij /( pi. pj.)1/ 2 c
~ Apply SVD on P
T k =1

Subtract trivial component


T k

P rc / p.. = Dr f k k g Dc

r = ( p1.,L, pn. )

fk = D u , gk = D v

1 2 r k

1 2 c k

c = ( p.1,L, p.n )

are scaled row and column principal component (standard coordinates in CA)
(Zha, et al, CIKM 2001, Ding et al, PKDD2002) PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 13

Nonnegative Matrix Factorization


Data Matrix: n points in p-dim:

X = ( x1 , x2 ,L, xn )
Decomposition (low-rank approximation) Nonnegative Matrices

xi
T

is an image, document, webpage, etc

X FG

X ij 0, Fij 0, Gij 0

F = ( f1 , f 2 , L, f k )

G = ( g1 , g 2 ,L, g k )
14

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Solving NMF with multiplicative updating


J =|| X FGT ||2 , F 0, G 0

Fix F, solve for G; Fix G, solve for F Lee & Seung ( 2000) propose
( X T F ) jk (GF T F ) jk

( XG )ik Fik Fik ( FGT G )ik

G jk G jk

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

15

Matrix Factorization Summary


Symmetric
(kernel matrix, graph)

Rectangle Matrix
(contigency table, bipartite graph)

PCA:

W = VV T

X = UV T

Scaled PCA: 1 ~ 1 T 2 2 W = D W D = D QQ D
NMF:

~ 1 X = Dr X Dc2 = Dr FG T Dc

1 2

W QQ

X FG

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

16

Indicator Matrix Quadratic Clustering


Unsigned Cluster indicator Matrix H=(h1, , hK) Kernel K-means clustering:
max Tr( H T WH ), s.t. H T H = I , H 0
H

K-means:

W = XT X;

Kernel K-means W = (< ( xi ), ( x j ) >)

Spectral clustering (normalized cut)


max Tr( H T WH ), s.t. H T DH = I , H 0
H

Difference between the two is the orthogonality of H


PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 17

Indicator Matrix Quadratic Clustering


Additional features:
Semi-suerpvised classification:
max Tr( H T WH + C T H )
H

Semi-supervised clustering: (A) must-link and (B) cannot-link constraints

max Tr( H T WH + H T AH H T BH )
H

Outlier Detection: max Tr( H T WH ) allowing zero rows in H


H

Nonnegative Lagrangian Relaxation:


H ik H ik (WH )ik + Cik / 2 , = H T WH + H T C. ( H )ik
18

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Tutorial Outline
PCA Recent developments on PCA/SVD Equivalence to K-means clustering Scaled PCA Laplacian matrix Spectral clustering Spectral ordering Nonnegative Matrix Factorization Equivalence to K-means clustering Holistic vs. Parts-based Indicator Matrix Quadratic Clustering Use Nonnegative Lagrangian Relaxtion Includes K-means and Spectral Clustering semi-supervised classification Semi-supervised clustering Outlier detection
19

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Recent Developments on PCA and SVD


Principal Curves Independent Component Analysis Kernel PCA Mixture of PCA (probabilistic PCA) Sparse PCA/SVD Semi-discrete, truncation, L1 constraint, Direct sparsification Column Partitioned Matrix Factorizations 2D-PCA/SVD Equivalence to K-means clustering
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 20

Part 1.B.

PCA and SVD


Data Matrix:
Covariance

X = ( x1 , x2 ,L, xn )
T C = XX T = k uk uk p k =1

Gram (kernel) matrix


Principal directions: u k (Principal axis,subspace) Underlying basis: SVD

XTX =

k =1

T k v k v k

Principal components: k (projection on the subspace)

X = u v
k =1

T k k k
21

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding

Kernel PCA
xi ( xi )

Kernel

K ij = ( xi ), ( x j )
v, ( x ) =

PCA Component

Feature extraction

vi ( xi ), ( x)

Indefinite Kernels Generalization to graphs with nonnegative weights


(Scholkopf, Smola, Muller, 1996)
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 22

Mixture of PCA
Data has local structures.
Global PCA on all data is not useful

Clustering PCA (Hinton et al):


Using clustering to cluster data into clusters Perform PCA in each cluster No explicit generative model

Probabilistic PCA (Tipping & Bishop)


Latent variables Generative model (Gaussian) Mixture of Gaussians mixture of PCA Adding Markov dynamics for latent variables (Linear Gaussian Models)
23

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding

Probabilistic PCA Linear Gaussian Model


Latent variables
S = ( s1 ,L, sn )
2

xi = Wsi + + , ~ N (0, I )
Gaussian prior

P( s) ~
2

2 N ( s0 , s I ) T

x ~ N (Ws0 , I + sWW )
Linear Gaussian Model

si +1 = Asi + ,
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding

xi = Wsi + ,
24

(Tipping & Bishop, 1995; Roweis & Ghahramani, 1999)

Sparse PCA
Compute a factorization Why sparse?
Variable selection (sparse U) When n >> d Storage saving Other new reasons?

X UV T

U or V is sparse or both are sparse

L1 and L2 constraints

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding

25

Sparse PCA: Truncation and Discretization


X UV T

Sparsified SVD

U = (u1 Luk )

V = (v1 Lvk )

Compute {uk,vk} one at a time, truncate those entries below a threshold. Recursively compute all pairs using deflation.
(Zhang, Zha, Simon, 2002)

Semi-discrete decomposition
(Kolda & Oleary, 1999)

X X uvT

U, V only contains {-1, 0, 1} Iterative algorithm to compute U,V using deflation

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding

26

Sparse PCA: L1 constraint


LASSO (Tibshirani, 1996)

min || y X T ||2 ,
SCoTLASS (Joliffe & Uddin, 2003)

|| ||1 t

max u T ( XX T )u T ,

|| u ||1 t , u T uh = 0

Least Angle Regression (Efron, et al 2004) Sparse PCA (Zou, Hastie, Tibshirani,2004)
min
,

i =1

|| xi T xi ||2 +

j =1

|| j ||2 +

j =1

1, j || j ||1 , T = I

v j = j / || j ||

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding

27

Sparse PCA: Direct Sparsification


Sparse SVD with explicit sparsification min || X udvT ||F + nnz(u ) + nnz(v)
u ,v

rank-one approximation Minimize a bound deflation

(Zhang, Zha, Simon 2003)

Direct sparse PCA, on covariance matrix S

u = max u T Su = max Tr( Suu T ) = max Tr( SU )


s.t. Tr(U ) = 1, nnz(U ) k 2 , U f 0, rank(U ) = 1
(DAspremont, Gharoui, Jordan,Lancriet, 2004)
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 28

Sparse PCA Summary


Many different approaches
Truncation, discretization L1 Constraint Direct sparsification Other approaches

Sparse Matrix factorization in general


L1 constraint

Many questions
Orthogonality Unique solution, global solution
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 29

PCA: Further Generalizations


Generalization to Exponential Family
(Collins, Dasgupta, Schapire, 2001)

Maximum Margin Factorization (Srebro, Rennie, Jaakkola, 2004)


Collaborative filtering Input Y is binary Hard margin Yia X ia 1, ia S Soft margin
min || X || + c
iaS

max(0,1 Y

ia X ia )

X = UV T , || X ||= 1 (|| U ||2 + || V ||2 ) Fro Fro 2


PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 30

Column Partitioned Matrix Factorizations


2 6 7 8 64748 4 14 64748 4k 4 X = ( x1,L xn ) = ( x1 L xn1 , xn1 +1 L xn2 , L, xnk 1 +1 L xn )

n1 + L + nk = n

Column Partitioned Data Matrix Partitions are generate by clustering U = (u1 Luk ) Centroid matrix
uk is centroid Fix U, compute V

(Zhang & Zha, 2001)

(Dhillon & Modha, 2001) (Park, Jeon & Rosen, 2003)

min || X UV T ||2 F

V = X T U (U T U ) 1
k k

Represent each partition by a SVD.


Pick leading Us to form U Fix U, compute V

6 74 41 8 6 74 4l 8 ( ( ( ( U = (U1,LU l ) = (u11) Luk1) , L, u1l ) Lukl ) ) l 1


(Castelli, Thomasian & Li 2003) (Zeimpekis & Gallopoulos, 2004) 31

Several other variations


PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding

Two-dimensional SVD
Large number of data objects are 2-D: images, maps Standard method:
convert (re-order) each image as a 1D vector collect all 1D vectors into a single (big) matrix apply SVD on the big matrix

2D-SVD is developed for 2D objects


Extension of standard SVD Keeping the 2D characteristics Improves quality of low-dimensional approximation Reduces computation, storage

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding

32

Linearize a 2D object into 1D object

0.0 0.5 0.7 10 . M 0.8 0.2 0.0

Pixel vector

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding

33

SVD and 2D-SVD


SVD
X = ( x1 , x2 ,L, xn )
Eigenvectors of

XX T
T

and X T X
=UT X V

X = UV
2D-SVD
Eigenvectors of
F=
G=

{ A} = { A1 , A2 ,L, An }
( Ai A )( Ai A )T
( Ai A )T ( Ai A )
T

i
i

row-row covariance column-column cov


M i = U Ai V
T
34

Ai = UM iV

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding

2D-SVD
{ A} = { A1 , A2 ,L, An }
row-row cov: col-col cov: Bilinear subspace
T i
T

assume

A =0

F = Ai Ai = u u
G = Ai Ai = k uk u T k
i k =1

T k k k

U = (u1 , u2 ,L, uk ) V = (v1 , v2 ,L, vk )


T

M i = U Ai V
T

Ai = UM iV , i = 1,L, n
Ai rc ,U rk ,V ck , M i kk
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 35

2D-SVD Error Analysis SVD: min || X UV T ||2 =


i = k +1

i2

Ai LM i RT , Ai R rc , L R rk , R R ck , M i R kk
min J1 =

i =1 n

|| Ai LM i ||2 =

min J 2 =
min J 3 =

i =1 n

j = k +1 r

j = k +1 r

|| Ai M i RT ||2 =
|| Ai LM i RT ||2

i =1 n

j = k +1 r

j
c

+
j j = k +1

min J 4 =

i =1

|| Ai LM i LT ||2 2

j = k +1

j
36

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding

Temperature maps (January over 100 years)

Reconstruction Errors SVD/2DSVD=1.1 Storages SVD/2DSVD=8

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding

37

Reconstructed image

SVD 2dSVD SVD (K=15), storage 160560 2DSVD (K=15), storage 93060
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 38

2D-SVD Summary
2DSVD is extension of standard SVD Provides optimal solution for 4 representations for 2D images/maps Substantial improvements in storage, computation, quality of reconstruction Capture 2D characteristics

PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding

39

Part 1.C. K-means Clustering Principal Component Analysis


(Equivalence between PCA and K-means)

40 PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

K-means clustering
Also called isodata, vector quantization Developed in 1960s (Lloyd, MacQueen, Hatigan, etc) Computationally Efficient (order-mN) Widely used in practice
Benchmark to evaluate other algorithms

Given n points in m-dim:

X = ( x1 , x2 ,L, xn )

K-means objective

min J K =

k =1 iC k

|| xi ck ||2
41

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

PCA is equivalent to K-means


Continuous optimal solution for cluster indicators in K-means clustering are given by principal components. Subspace spanned by K cluster centroids is given by PCA subspace.

42 PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

2-way K -means Clustering


Cluster membership indicator:

+ n2 / n1n q (i ) = n1 / n2 n

if i C1 if i C2

J K = n x J D ,
2

nn JD = 1 2 n

d (C1 , C2 ) d (C1 , C1 ) d (C2 , C2 ) 2 n n 2 2 n1 n2 1 2

Define distance matrix: D = (dij ), dij =|xi x j|2 ~ T T ~ T T T J D = q Dq = q Dq = 2q ( X X )q = 2q Kq D = K


min J K max J D

Solution is principal eigenvector v1of K

Clusters C1, C2 are determined by: C1 = {i | v1 (i ) < 0}, C2 = {i | v1 (i ) 0}


43 PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

A simple illustration

44 PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

DNA Gene Expression File for Leukemia


Using v1 , tissue samples separated into 2 clusters, 3 errors Do one more Kmeans, reduce to 1 error

45 PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Multi-way K-means Clustering


Unsigned Cluster membership indicators h1, , hK: C1 C2 C3

1 1 0 0

0 0 0 0 = (h1 , h2 , h3 ) 1 0 0 1
46

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Multi-way K-means Clustering


JK =

1 2 xi i n k =1 k

i , jC k

xiT x j =

xi2

k =1

T hk X T Xhk

(Unsigned) Cluster indicators H=(h1, , hK)


T J K = xi2 Tr ( H k X T XH k ) i

Regularized Relaxation
(q1 ,..., qk ) = (h1 ,L, hk )T

Redundancy: k =1
Qk = H kT

n1/ 2 hk = e k

Transform h1, , hK to q1 - qk via orthogonal matrix T

q1 = e /n1/2
47

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Multi-way K-means Clustering

T maxTr[Qk 1 ( X T X )Qk 1 ]

Qk 1 = (q2 ,..., qk )

Optimal solutions of q2 qk are given by principal components v2 vk.

JK is bounded below by total variance minus


sum of K eigenvalues of covariance:
nx2

k =1

K 1

k < min J K < n x 2

48 PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Consistency: 2-way and K-way approaches


Orthogonal Transform:
n2 / n T = n /n 1 n1 / n n2 / n

T transforms (h1, h2) to (q1,q2):

h1 = (1L1,0L0) , h2 = (0L0,1L1)
T

T
T

a=
b=

n2 n1n
n1 n2 n

q1 = (1L1) ,
T

q2 = (a,L, a,b,L,b)

Recover the original 2-way cluster indicator


49 PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Test of Lower bounds of K-means clustering


| J opt J LB | J opt

Lower bound is within 0.6-1.5% of the optimal value


50 PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Cluster Subspace (spanned by K centroids) = PCA Subspace


Given a data point x,
P=

T ck c k

project x into the cluster subspace

Centroid is given by ck =
P=

h (i) x = Xh
k i

k
T k k k

c c
k

T k k

=X

h h
k

T k k

XT = X

v v
k

T k k

XT =

u u
k

PK means =

T k u k u k

T uk uk PPCA

PCA automatically project into cluster subspace PCA is unsupervised version of LDA
51 PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Effectiveness of PCA Dimension Reduction

52 PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Kernel K-means Clustering


Kernal K-means objective:
min J K =

xi ( xi )

k =1 iCk

|| ( xi ) (ck ) ||2

| ( xi ) |2


1 n k =1 k

( xi )T ( x j )

i , jCk

1 Kernal K-means max J K = ( xi ), ( x j ) k =1 nk i , jCk


K

53 PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Kernel K-means clustering is equivalent to Kernal PCA


Continuous optimal solution for cluster indicators are given by Kernal PCA components Subspace spanned by K cluster centroids are given by Kernal PCA principal subspace

54 PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

Вам также может понравиться