Вы находитесь на странице: 1из 13

Principal Component Analysis

L ATENT VARIABLES
A.k.a. Karhunen-Loeve transformation It has applications like:

IIT Rajasthan

Dimensionality reduction Lossy data compression Feature Extraction Data Visualization

Slides last modied on November 5, 2012 Computer Science and Engineering, Indian Institute of Technology Rajasthan.

1 / 74

2 / 74

PCA

Denition 1

PCA can be dened as the orthogonal projection of the data onto a lower dimensional linear space (the principal subspace) The variance of the projected data is maximzed

u1 x2 xn xn

x1
, ,

3 / 74

4 / 74

PCA

Denition 2

PCA can be dened as the linear projection that minimizes the average projection cost, dened as the mean squared distance between the data points and their projections.

Maximum Variance

Formulation

Consider a data set of observations {xn } where n = 1, ..., N xn is a Euclidean variable with dimensionality D We want to project the data onto a space having dimensionality M < D We want to maximize the variance of the projected data

5 / 74

6 / 74

Consider the projection onto a one-dimensional space (M = 1) We can dene the direction of this space using a D-dimensional vector u1 We consider u1 to be a unit vector u u1 = 1 1 The projection of each data point xn onto this direction is given by a scalar u xn 1

The mean of the projected data is: u x 1 where x is the sample mean x= 1 xn N
n=1 N

The variance of the projected data is given by 1 u1 xn u x = u Su1 1 1 N


n=1 N

S is the data covariance matrix dened by S= 1 (xn x) (xn x) N


n=1
,

7 / 74

8 / 74

We now maximize the projected variance u Su1 w.r.t. u1 1 We therefore enforce the contraint u u1 = 1 1

This has to be a constrained maximization to prevent ||u1 || We introduce a Lagrange multiplier (1 ) and then make an unconstrained maximization of u Su1 + 1 (1 u u1 ) 1 1 By setting the derivative w.r.t u1 equal to zero we get Su1 = 1 u1 Thus, u1 must be an eigen vector of S Pre-multiplying both sides by u , we get 1 u Su1 = 1 1

The variance will be maximum when we set u1 equal to the eigen vector having the largest eigenvalue 1 This eigen vector is known as the rst principal component Now consider all possible directions orthogonal to the rst principal component From these, we can identify the second principal component as the direction which maximizes the projected variance. This would be the eigen vector u2 corresponding to the second largest eigen value 2

9 / 74

10 / 74

If we consider the general case of an Mdimensional projection space, we can nd the optimal linear projection for which the variance of the projected data is maximized The directions of projections are given by the M eigenvectors u1 , u2 , ..., uM of the data covariance matrix S corresponding to the M largest eigen values 1 , 2 , ..., M

Minimum-Error Formulation

In this formulation of PCA, we will minimize the projection error We introduce a complete orthonormal set of D-dimensional basis vectors {ui } where i = 1, ..., D that satisfy u uj = ij i Each data point can be represented exactly by a linear combination of the basis vectors D xn = ni ui
i=1

where the coefcients ni will be different for different data points

This simply corresponds to a rotation of the coordinate system to a new system dened by the {ui }
, ,

11 / 74

12 / 74

The original D components {xn1 , ..., xnD } are replaced by an equivalent set {n1 , n2 , ..., nD }

The M-dimensional linear subspace can be represented by the rst M of the basis vectors, and so we approximate each data point xn by n = x
M i=1

Taking the inner product with uj , and making use of the orthonormality property, we obtain nj = x uj n and so xn =
D i=1

zni ui +

i=M+1

bi ui

Our goal is to approximate this data point using a representation involving a restricted number M < D of variables corresponding to a projection onto a lower-dimensional subspace.

x ui ui n

Here {zni } depend on the particular data point

The {bi } are constants that are the same for all data points

We are free to choose the {ui } , the {zni } , and the {bi } so as to minimize the distortion introduced by the reduction in dimensionality

13 / 74

14 / 74

Distortion Measure J: Squared distance between the original data point xn and its approximation xn , averaged over the data set So, we minimize J= 1 ||xn n ||2 x N
n=1 N

Minimize w.r.t. {bi } : Setting the derivative of J w.r.t. bi to zero, and again making use of the orthonormality relations, bj = x uj where j = M + 1, ..., D

n = x

Minimize w.r.t. {zni } : Substituting for n , setting the derivative w.r.t. znj to zero, and making x use of the orthonormality conditions, we obtain: znj = x uj n where j = 1, ..., M

M i=1

zni ui +

i=M+1

bi ui

15 / 74

16 / 74

If we substitute for zni and bi in n = x and make use of xn = we obtain xn n = x


M i=1

u1
zni ui +
D

bi ui

x2 xn xn

i=M+1 D i=1

x ui ui n

i=M+1

The displacement vector xn n lies in the space orthogonal to the x principal subspace

D (xn x) ui ui

x1
, ,

17 / 74

18 / 74

The projected points can be moved freely in the principal subspace and so the minimum error is given by the orthogonal projection The distortion measure J can be written as a function purely of the {ui } in the form J= 1 N
N D

2-D case

Consider the case of a 2-D data space, D = 2 Consider a 1-D principal subspace M = 1 We need to choose a direction u2 so as to minimize J = u Su2 , subject 2 to the normalization constraint u u2 = 1 2

n=1 i=M+1

x ui x ui n

i=M+1

u Sui i

Using the Lagrange multiplier 2 to enforce the constraint, we consider the minimization of = u Su2 + 2 (1 u u2 ) J 2 2 Setting the derivative with respect to u2 to zero, we obtain Su2 = 2 u2 where u2 is an eigenvector of S with eigen value 2

We perform a constrained minimization to avoid the trivial solution ui = 0

19 / 74

20 / 74

2-D case

Thus any eigen vector will dene a stationary point of the distortion measure Back-substituting the solution for u2 into the distortion measure gives J = 2 We therefore obtain the minimum value of J by choosing u2 to be the eigen vector corresponding to the smaller of the two eigen values Solving u2 gives the direction of the displacement vector which is orthogonal to the principal subspace Thus we should choose the principal subspace to be aligned with the eigenvector having the larger eigen value i.e. the subspace should pass through the mean of the data points and be aligned with the directions of maximum variance

Arbitrary D
The general solution to the minimization of J for arbitrary D and arbitrary M < D is obtained by choosing the {ui } to be the eigenvectors Sui = i ui where i = 1, ..., D

The corresponding value of the distortion measure J=


D

i=M+1

which is simply the sum of the eigenvalues of those eigen vectors that are orthogonal to the principal subspace

21 / 74

22 / 74

PCA approximation to data vector xn


n = x
M i=1

zni ui +

i=M+1

Arbitrary D
bi ui We therefore obtain the minimum value of J by selecting these eigenvectors to be those having the D M smallest eigen values Hence the eigenvectors dening the principal subspace are those corresponding to the M largest eigenvalues

znj = x uj where j = 1, ..., M n

bj = x uj where j = M + 1, ..., D

The approximation to data point xn is n = x


M i=1 D x ui ui + x ui ui n i=M+1

=x+

M i=1

x ui x ui ui n

where x =

D i=1

x ui ui
, ,

23 / 74

24 / 74

Applications of PCA
Consider several samples of handwritten digit 3 We compute the mean, the covariance matrix and the eigen vectors of the covariance matrix The eigenvectors can be images of the same size as the data points Note that each image sample forms a point in a multi-dimensional (W H) space
1 = 3.4 105 2 = 2.8 105 3 = 2.4 105
3 i 2 1 0

x 10

3 J 2

x 10

200

Mean

4 = 1.6 105

400 (a)

600

200

400 (b)

600

Part (a) shows a plot of the complete spectrum of eigenvalues, sorted into decreasing order The gure shows the rst 4 eigen vectors along with the corresponding values Blue represents +ve values, White is zero, Yellow represents -ve values
,

Part (b) shows the distortion measure J associated with choosing a particular value of M i.e. the sum of the eigenvalues from M + 1 up to D and is plotted for different values of M
,

25 / 74

26 / 74

Data compression
n = x + x
M i=1

Data Preprocessing
x ui x ui ui n

PCA

A data set is transformed in order to standardize some of its properties Suppose that the original variables are measured in various different units or have signicantly different variability We typically do a linear re-scaling of the individual variables such that each variable had zero mean and unit variance This is known as standardizing the data The covariance matrix for the standardized data has components: ij = 1 (xni xi ) (xnj xj ) N i j
n=1 N

This representation n is a compression of the dataset x Each data point xn has been replaced with an M-dimensional vector having components x ui x ui n The smaller the value of M, the greater the degree of compression
Original M =1 M = 10 M = 50 M = 250

where i is the standard deviation of xi Reconstruction for increasing value of M. Here D = 28 28 = 784
, ,

27 / 74

28 / 74

ij =

1 (xni xi ) (xnj xj ) N i j
n=1

Using PCA
The normalized data will now have a zero mean

Data Normalization

Using PCA we can make a more substantial normalization of the data And Unit Covariance! so that the different variables become decorrelated

This is known as the correlation matrix of the original data It has the property that if two components xi and xj of the data are perfectly correlated, then ij = 1, and if they are uncorrelated, then ij = 0

So where PCA helps?

29 / 74

30 / 74

Using PCA
SU = UL L is a D D diagonal matrix with elements i

Data Normalization

Using PCA
N

We rst write the eigenvector equation in the form

The set yn has zero mean, and its covariance is given by the identify matrix 1 1 1/2 yn y = L U (xn x)(xn x) UL1/2 n N N
n=1 n=1 N

Data Normalization

U is a D D orthogonal matrix with columns given by ui yn = L1/2 U (xn x)

Then we dene, for each data point xn , a transformed value given by

= L1/2 U SUL1/2 = L1/2 LL1/2 = I

This operation is known as whitening or sphereing

31 / 74

32 / 74

100 90 80 70 60 50 40 2 4 6

Comparison of PCA with Fisher linear discriminant


Both methods perform a linear dimensionality reduction PCA is unsupervised and depends only on the values xn whereas Fisher linear discriminant also uses class-label information
2 0 2

2 2 0 2

Left: Original data Centre: Result of standardizing the individual variables to zero mean and unit variance Right: Result of whitening the data to give zero mean and unit covariance

33 / 74

34 / 74

1.5 1 0.5 0 0.5 1 1.5 2 5 0 5

Data Visualization

Each data point is projected to a 2-Dimensional (M = 2) principal subspace A data point xn is plotted at Cartesian coordinates given by x u1 and n x u2 n where u1 and u2 are eigenvectors corresponding to the largest and second largest eigenvalues

Data is in 2 dimensions. Belongs to 2 classes (red and blue). Needs to be projected onto a single dimension Magenta: PCA dimension Green: Fishers dimension (Why?)
, ,

35 / 74

36 / 74

PCA for high-dimensional data


In some applications, the number of data points is smaller than the dimensionality of the data space E.g. apply PCA on a few hundred images Each image corresponds to a vector in a space of potentially several million dimensions In a D dimensional space, a set of N points, where N < D, denes a linear subspace whose dimensionality is at most N 1

If we perform PCA, we nd that at least D N + 1 of the eigen values are zero

The different colors distinguish the labels on the data points

37 / 74

38 / 74

Typical algorithms for nding the eigen vectors of a D D matrix have a computational cost that scales like O(D3 ) Hence for such applications, a direct application of PCA will be computationally infeasible

PCA for high-dimensional data


Let us dene X to be the (N D)-dimensional centered data matrix whose n th row is given by (xn x) The covariance matrix can then be written as S = N1 X X The corresponding eigen vector equation becomes 1 X Xui = i ui N Now pre-multiply both sides by X 1 XX (Xui ) = i (Xui ) N

How to resolve this problem?

39 / 74

40 / 74

1 XX (Xui ) = i (Xui ) N We now dene vi = Xui , we obtain 1 XX vi = i vi N This is an eigenvector equation for the N N matrix N1 XX This has the same N 1 eigenvalues as the original covariance matrix The original covariance matrix will have additional D N + 1 eigenvalues which are 0 Thus we can solve the eigenvector problems in spaces of lower dimensionality with computational cost O(N3 ) instead of O(D3 )

1 XX vi = i vi N In order to determine the eigenvectors, we multiply both sides by X to give 1 X X X vi = i X vi N Here X vi is an eigenvector of S with eigenvalue i The original eigenvectors are given by ui = 1 X vi (Ni )1/2

41 / 74

42 / 74

Kernel PCA

Consider a data set {xn } of observations, where n = 1, ..., N in a space of dimensionality D Assume that the data has been mean adjusted, so that n xn = 0 We need to express conventional PCA in such a form that the data vectors {xn } appear only in the form of the scalar products x xm n The principal components are dened by the eigenvectors ui of the covariance matrix Sui = i ui
N 1 S= xn x n N n=1

The nonlinear transformation (xn ) transforms xn into the M-dimensional feature space Assume that the projeted data set also has a zero mean (xn ) = 0
n

The M M sample covariance matrix in feature space is given by C= 1 (xn )(xn ) N


n=1 N

u ui = 1 i

Its eigen vectors are Cvi = i vi

i = 1, ..., M

Our goal is the solve this eigenvalue problem without having to work explicitly in the feature space

43 / 74

44 / 74

Substituting for C in the eigenvector equations 1 (xn ) (xn ) vi = i vi N


n=1 N

We now make use of the kernel function k(xn , xm ) = (xn ) (xm ) in 1 (xn )(xn ) aim (xm ) = i ain (xn ) N
n=1 m=1 n=1 N N N

The vector vi is given by a linear combination of (xn ) and so can be written in the form N vi = ain (xn )
n=1

Pre-multiplying both sides by (xl ) gives


N N

Substituting this expansion back into the eigenvector equation


N N N

1 (xn )(xn ) aim (xm ) = i ain (xn ) N


n=1 m=1 n=1

1 k(xl , xn ) aim k(xn , xm ) = i k(xl , xn ) N in


n=1 m=1

This can be written in matrix notation as K2 ai = i N K ai


, ,

45 / 74

46 / 74

K2 ai = i N K ai Here ai is an N-dimensional column vector with elements ain for n = 1, ..., N We can solve for ai by solving the following eigenvalue problem Kai = i N ai Thus, Kernel PCA involves the eigenvector expansion of the N N matrix K We require that the eigenvectors in the principal space be normalized 1 = v vi = i
N N

The projection of point x onto eigenvector i is given by yi (x) = (x) vi =


N n=1

ain (x) (xn ) =

N n=1

ain k(x, xn )

in terms of the kernel function The feature space dimensionality M can be much larger than that of the input space D In general we can nd a number of nonlinear principal components that can exceed D However, the number of non-zero eigenvalues cannot exceed the number N of data points even if M > N

ain aim (xn ) (xm ) = a Kai = i Na ai i i

n=1 m=1

47 / 74

48 / 74

So far we have assumed that the projected data set given by (xn ) has zero mean We now relax this assumption Let denote the projected data points after adjusting the mean (centralizing) N 1 (xn ) = (xn ) (xl ) N
l=1

Knm = (xn ) (xm )

= (xn ) (xm )

Let the corresponding gram matrix be represented as Knm = (xn ) (xm )

N N N 1 1 (xl ) (xm ) + 2 (xj ) (xl ) N N j=1 l=1 l=1

1 (xn ) (xl ) N
l=1

= k(xn , xm )

N N N 1 1 k(xn , xl ) + 2 k(xj , xl ) N N j=1 l=1 l=1

1 k(xl , xm ) N
l=1

In matrix notation
,

49 / 74

K = K 1N K K1N + 1N K1N

50 / 74

Here 1N denotes the N N matrix in which every element takes the value
1 N

K = K 1N K K1N + 1N K1N

Kernel PCA
x2
1 v1

Illustration

Thus, we can evaluate K using only the kernel function We use K for the eigenvalue decomposition

x1

Standard PCA is recovered as a special case if we use a linear kernel k(x, x ) = x , x Data in the feature space x1 , x2 is projected by a non-linear transformation into a feature space 1 , 2 . Performing PCA in the feature space gives the principal components. Green lines in the feature space indicate the linear projections onto the rst principal component. This corresponds to nonlinear projections in the original data space
, ,

51 / 74

52 / 74

Kernel PCA

Example

The following Gaussian kernel was applied to the synthetic data k(x, x ) == exp ||x x ||2 /0.1

The projection onto the corresponding principal components is dened by N (x) vi = ain k(x, xn )
n=1

The countours are lines along which the projection onto the corresponding principal component is constant The rst two eigen vectors separate the three clusters

The next three eigenvectors split each of the cluster into halves The following three again split the clusters into halves along directions orthogonal to the previous splits
, ,

The gure shows example of Kernel PCA on synthetic data


53 / 74

54 / 74

Dis-advantage of Kernel PCA


It involves nding the eigenvectors of the N N matrix K rather than the D D matrix S of conventional linear PCA This becomes a problem for large datasets and we need to use approximations

Probabilistic PCA
We have random variables

What is a linear-Gaussian framework? There exists a linear relationship between the variables in the system The marginal/conditional of each variable follows a Gaussian distribution

55 / 74

56 / 74

Probabilistic PCA

The quantity Wz + will serve as the mean for the distribution of x The conditional p(x | z) is modeled as a Gaussian p(x | z) = N x | Wz + , 2 I W is a D M matrix W is a D-dimensional vector The observed variable x (D-dim) is dened as a linear transformation of the latent variable z (M-dim) plus additive Gaussian noise x = Wz + + is a D-dimensional zero-mean Gaussian distributed noise variable with covariance 2 I

Let us consider an explicit latent variable z corresponding to the principal-component subspace We dene a Gaussian prior p(z) over the latent variable p(z) = N (z | 0, I) which denes a zero-mean unit-covariance Gaussian Given an observed value of z we are interested in the distributions over the observed variable x Assume that a general linear relation exists between x and z x = Wz + But, since x is a random variable, it will have an uncertainty associated with it
,

57 / 74

58 / 74

x2 p(x|) z

x2

This framework maps the latent space to the data space The marginal p(x)
p(x)

p(x) =

p(z)

z |w|

An observed point x is generated by rst drawing a value for the latent z variable from its prior distribution p(z) Then drawing a value for x from an isotropic Gaussian distribution(red circles) This Gaussian has a mean w + z The green ellipses show the density contours for the marginal distribution p(x)
, ,

x1

x1

This corresponds to a linear-Gaussian model. The marginal is a Gaussian p(x) = N x | , C

p(x | z) p(z) dz

59 / 74

60 / 74

p(x) = N x | , C Here C is a D D dimensional covariance matrix for x We have E[x] = E Wz + + = cov [x] = E (Wz + )(Wz + ) = E Wzz W + E = WW + 2 I C = WW + 2 I Assumptions: z and are independent (uncorrelated) random variables
,

Probabilistic PCA
zn
2

Model

xn

W N

Thus, the covariance matrix is

61 / 74

62 / 74

The predictive distribution p(x) is governed by the parameters , W, 2 There is redundancy corresponding to rotations of the latent space coordinates Consider a matrix W = WR where R is an orthogonal matrix We have Thus, C = WW + 2 I is independent of R There is a whole family of matrices W, all of which give rise to the same predictive distribution WW = WRR W = WW

To evaluate the predictive distribution, we require C1 , which involves inversion of a D D matrix. But we note that C1 = 2 I 2 WM1 W M = W W + 2 I Inverting M takes O(M3 ) against O(D3 ) required for C1

where the M is an M M matrix dened by

63 / 74

64 / 74

Maximum Likelihood PCA


zn 2

Given a dataset X = {xn } , the log likelihood function is given as:


N ln p X | , W, 2 = ln p(xn | , W, 2 ) n=1

xn

W N

The predictive distribution p(x) is governed by the parameters , W, 2 We need to solve for the parameters which maximize the likelihood of observing X

ND N ln(2) ln |C| 2 2 N 1 xn C1 xn 2
n=1

Setting the derivative w.r.t. to zero gives = x where x is the data mean Backsubstituting for ln p(X | W, , 2 ) = N D ln(2) + ln |C| + Tr(C1 S) 2
,

65 / 74

66 / 74

ln p(X | W, , 2 ) =

N D ln(2) + ln |C| + Tr(C1 S) 2


N

Here

WML = UM (LM 2 I)1/2 R UM is a D M matrix whose columns are given by any subset (of size M) of the eigenvectors of the data covariance matrix S, LM is an M M diagonal matrix with elements given by the corresponding eigenvalues i R is an arbitrary M M orthogonal matrix The maximum of the likelihood function is obtained when the M eigenvectors are chosen to be those whose eigenvalues are the M largest The columns u1 , ..., u2 are the M principal eigenvectors arranged in order of decreasing corresponding eigenvalues

Here S is the data covariance matrix S= 1 (xn x) (xn x) N


n=1

Maximizing w.r.t. W and 2 is complex, but nonetheless has an exact closed-form solution All of the stationary points of the log-likelihood function can be written as WML = UM (LM 2 I)1/2 R

67 / 74

68 / 74

The columns of W dene the principal subspace of the standard PCA The maximum likelihood solution for 2 is given by 2 = ML 1 DM
D

Given the structure of the covariance matrix C C = WW + 2 I and WML = UM (LM 2 I)1/2 R

i=M+1

Hence 2 is the average variance associated with the discarded ML dimensions

Consider the variance of the predictive distribution along some direction v where v v = 1 The variance is given by v Cv If v is orthogonal to the principal subspace, then v U = 0 v Cv = 2

Thus, the model predicts a noise variance orthogonal to the principal subspace This variance is just the average of the discarded eigenvalues
, ,

69 / 74

70 / 74

x2

w p(x|) z

x2

C = WW + I WML = UM (LM 2 I)1/2 R


p(z)

Now consider v = ui where ui is one of the retained eigenvectors dening the principal subspace Then v Cv = i 2 + 2 = i Thus, the variance of the data along the principal axes is i We can see that the variance in the direction of an eigenvector ui is composed of two components: The contribution i 2 from the projection of the unit-variance latent space distribution into data space through the corresponding column of W The contribution 2 which is added in all directions by the noise model
,

z |w|

p(x)

}
z z

x1

x1

71 / 74

72 / 74

The case M = D
UM = U

There is no reduction of dimensionality and LM = L

Difference

Conventional PCA maps from the observed data space to the latent space Probabilistic PCA maps from the latent space to the observed data space x = Wz + + Probabilistic PCA allows the model to capture the dominant correlations in a data set C = WW + 2 I Using a D M matrix W we can construct a D D matrix C WML = UM (LM 2 I)1/2 R

Making use of the orthogonality properties UU = I and RR = I we see that the covariance C of the marginal distribution for x becomes 1/2 1/2 C = U L 2 I RR L 2 I U + 2 I = ULU = S

The covariance matrix is given by the sample covariance.

Thus, the number of free parameters can be restricted while still allowing the model to capture the dominant correlations in a data set

73 / 74

74 / 74

Вам также может понравиться