Вы находитесь на странице: 1из 16

Principal Component Analysis

Principal Component Analysis or simply PCA is a variable-reduction procedure, it is useful


when we obtained data on a number of variables (probably large number of variables). Also it is
defined as a linear combination of optimally-weighted observed variables.It is a way of
identifying patterns in data, and expressing the data in such a way as to highlight their
similarities and differences. A method for compressing a lot of data into something that captures
the essence of the data.

If the original variables are uncorrelated, the analysis does nothing. The best results are obtained
if the original variables are very highly correlated (e.g. a 20 or 30 highly correlated variables can
be adequately represented by 2 or 3 principal components).

The goal of Principal Component Analysis is to explain the maximum amount of variance with
the fewest number of principal components. It is used to reduce the number of variables and
avoid multicollinearity or many predictors relative to the number of observations. It is a useful
statistical technique that has found application in fields such as face recognition and image
compression, and is a common technique for finding patterns in data of high dimension. And is
also commonly used in social sciences, market research and other industries that involves large
data sets.

The main advantage of PCA is that reducing the number of dimensions, without much loss of
information.

Let’s define some terms first, before we move to introducing the basic concept of the Principal
Component Analysis.

Dimension reduction is analogous to being philosophically reductionist: It reduces the data


down into its basic components, stripping away any unnecessary parts.

Eigenvector a direction.

Eigenvalue is a number. Tells you how much variance there is in the data in that direction.
Principal Component Analysis is a good name, it tells you exactly what does it mean; PCA finds
the principal components of a data. So what are principal components then? They are the
underlying structure in the data. They are the directions where there is the most variance, the
directions where the data is most spread out.

Here’s a triangle in the shape of an oval,

Imagine that the triangles are points of data. To find the direction where there is most variances.
Find the straight line where the data is most spread out when projected onto it. A vertical straight
line with the points projected onto it will look like this:

The data isn’t very spread out here,


therefore it doesn’t have a large
variance. It is probably not the
principal component.

A horizontal line projected on to it will look like this:


On this line the data is way more
spread out, it has a large variance. In
fact there isn’t a straight line you can
draw that has a larger variance than a
horizontal one. A horizontal line is
therefore the principal component in
this example.

Luckily, we can use maths to find the principal component rather than drawing lines and
unevenly shaped triangles. This is where eigenvectors and eigenvalues come in.

Eigenvectors and Eigenvalues

When we get a set of data points, like the triangles earlier, we can deconstruct the set into
eigenvectors and eigenvalues. Eigenvectors and eigenvalues always comes in pair; every
eigenvector has a corresponding eigenvalue.

Although, in the last example I could point my line in any direction, it turns out that in the data
set there are not much eigenvectors and eigenvalues we can find. The amount of eigenvectors
and eigenvalues that do exist equals the number of dimensions the data set has, as a matter of
fact.

Say we’re measuring age and hours on the internet. There are 2 variables, it’s a 2 dimensional
data set, and therefore there are 2 eigenvectors/values. If we’re measuring age, hours on internet
and hours on mobile phone there’s 3 variables, 3-D data set, so 3 eigenvectors/values. The reason
for this is that eigenvectors put the data into a new set of dimensions, and these new dimensions
have to be equal to the original amount of dimensions.

At the moment the oval is on an x-y axis. x


could be age and y hours on the internet. These
are the two dimensions that my data set is
currently being measured in. Now remember
that the principal component of the oval was a
line splitting it longways:
It turns out the other eigenvector (remember there
are only two of them as it’s a 2-D problem) is
perpendicular to the principal component. As we
said, the eigenvectors have to be able to span the
whole x-y area, in order to do this (most
effectively), the two directions need to be
orthogonal (i.e. 90 degrees) to one another. This
why the x and y axis are orthogonal to each other in the first place. It would be really awkward if
the y axis was at 45 degrees to the x axis. So the second eigenvector would look like this:

The eigenvectors have given us a much more


useful axis to frame the data in. We can now
re-frame the data in these new dimensions. It
would look like this:

Note that nothing has been done to the data itself.


We’re just looking at it at a different angle.

These directions are where there is most variation,


and that is where there is more information (think
about this the reverse way round. If there was no
variation in the data [e.g. everything was equal to 1]
there would be no information, it’s a very boring statistic – in this scenario the eigenvalue for
that dimension would equal zero, because there is no variation).

There are 3 variables so it is a 3D data set. 3 dimensions is an x,y and z graph, It measure width,
depth and height (like the dimensions in the real world). Now imagine that the data forms into an
oval like the ones above, but that this oval is on a plane. i.e. all the data points lie on a piece of
paper within this 3D graph (having width and depth, but no height). Like this:
When we find the 3 eigenvectors/values of the data set (remember 3D problem = 3 eigenvectors),
2 of the eigenvectors will have large eigenvalues, and one of the eigenvectors will have an
eigenvalue of zero. The first two eigenvectors will show the width and depth of the data, but
because there is no height on the data (it is on a piece of paper) the third eigenvalue will be zero.

On the picture, ev1 is the first eigenvector (the one with the biggest eigenvalue, the principal
component), ev2 is the second eigenvector (which has a non-zero eigenvalue) and ev3 is the third
eigenvector, which has an eigenvalue of zero.

We can now rearrange our axes to be along the eigenvectors, rather than age, hours on internet
and hours on mobile. However we know that the ev3, the third eigenvector, is pretty useless.
Therefore instead of representing the data in 3 dimensions, we can get rid of the useless direction
and only represent it in 2 dimensions, like before:
This is dimension reduction. We have reduced the problem from a 3D to a 2D problem, getting
rid of a dimension. Reducing dimensions helps to simplify the data and make it easier to
visualise.

PCA : Objectives

Take p variables , , ,.., and find combinations of these to produce indices , , ,


that are uncorrelated. That is, = + +…+ and holding the condition that the

+ +…+ =1. Also, the var var var ( ). is called the


principal component. The lack of correlation is a useful property because it means that the
indices are measuring different “dimensions” of the data.

Some Basic Results

S=[ ]

The variances of the principal component are the eigenvalues of matrix S. There are p of these,
some of which may be zero. Negative eigenvalues are not possible for a covariance matrix.
Assuming that the eigenvalues are ordered as , then corresponds to the
ith principal component and the constants , , , are the component of the
corresponding eigenvector. Now,

It should be noted that the PC’s account for all of the variation in the original data. In order to
avoid one variable having an undue influence on the principal component, it is usual to
standardize the variables , , to have mean 0 and variance 1 at the start of the analyses.
Then,

R=[ ]

In this case, the covariance matrix for the standardized variable is the correlation matrix.
Steps in Principal Component Analyses

1. Standardize the variables , , to have mean 0 and variance 1.

2. Calculate the correlation matrix R.

3. Find the eigenvalues , , , and the corresponding eigenvectors , , , where

[ ].

4. Discard any component that only account for a small proportion of the variation in the data.

Assumption in the Data Before Doing Principal Component Analysis

1. The random variable should be measured at the continuous level or ordinal variables.

2. There needs to be linear relationship between all variables.

3. Sampling adequacy (5 to 10 cases per variable).

4. Your data should be suitable for data reduction (Bartlett’s Test of Spherecity).

5. No significant outliers.

Illustration Using SPSS and STATA

The birds being considered were picked up after a severe storm. The first 21 of them recovered
while the other 28 died. A question of some interest is therefore whether the survivors and non-
survivors show any differences. It has been tested that there is no evidence of any difference in
mean values. The situation can now be considered in terms of principal component.

Table 1. Body measurement of female sparrows (X1= total length, X2 = alar extent, X3= length
of beak and head, X4 = length of humerus, X5 = length of keel of sternum, all in mm.). Birds 1
to 21 survived, while the remainder died.

Bird X1 X2 X3 X4 X5
1 156 245 31.6 18.5 20.5
2 154 240 30.4 17.9 19.6
3 153 240 31 18.4 20.6
4 153 236 30.9 17.7 20.2
5 155 243 31.5 18.6 20.3
6 163 247 32 19 20.9
7 157 238 30.9 18.4 20.2
8 155 239 32.8 18.6 21.2
9 164 248 32.7 19.1 21.1
10 158 238 31 18.8 22
11 158 240 31.3 18.6 22
12 160 244 31.1 18.6 20.5
13 161 246 32.3 19.3 21.8
14 157 245 32 19.1 20
15 157 235 31.5 18.1 19.8
16 156 237 30.9 18 20.3
17 158 244 31.4 18.5 21.6
18 153 238 30.5 18.2 20.9
19 155 236 30.3 18.5 20.1
20 163 246 32.5 18.6 21.9
21 159 236 31.5 18 21.5
22 155 240 31.4 18 20.7
23 156 240 31.5 18.2 20.6
24 160 242 32.6 18.8 21.7
25 152 232 30.3 17.2 19.8
26 160 250 31.7 18.8 22.5
27 155 237 31 18.5 20
28 157 245 32.2 19.5 21.4
29 165 245 33.1 19.8 22.7
30 153 231 30.1 17.3 19.8
31 162 239 30.3 18 23.1
32 162 243 31.6 18.8 21.3
33 159 245 31.8 18.5 21.7
34 159 247 30.9 18.1 19
35 155 243 30.9 18.5 21.3
36 162 252 31.9 19.1 22.2
37 152 230 30.4 17.3 18.6
38 159 242 30.8 18.2 20.5
39 155 238 31.2 17.9 19.3
40 163 249 33.4 19.5 22.8
41 163 242 31 18.1 20.7
42 156 237 31.7 18.2 20.3
43 159 238 31.5 18.4 20.3
44 161 245 32.1 19.1 20.8
45 155 235 30.7 17.7 19.6
46 162 247 31.9 19.1 20.4
47 153 237 30.6 18.6 20.4
48 162 245 32.5 18.5 21.1

STEP 1: Analyze Dimension Reduction Factor Enter.

STEP 2: Put all variables and press PASTE option below.


STEP 3: Write the following codes in the figure. Click the run icon (color green triangle).

STEP 4: Stata is used to extract the eigen vectors. Enter the code: pca X1 X2 X3 X4 X5
SPSS Output

Descriptive Statistics

Mean Std. Deviation Analysis N

X1 157.98 3.654 49
X2 241.33 5.068 49
X3 31.459184 .7947532 49
X4 18.469388 .5642857 49
X5 20.826531 .9913744 49

The table above is output because we used the univariate option on


the /print subcommand. Please note that the only way to see how many cases were actually used
in the principal components analysis is to include the univariate option on
the /print subcommand. The number of cases used in the analysis will be less than the total
number of cases in the data file if there are missing values on any of the variables used in the
principal components analysis, because, by default, SPSS does a listwise deletion of incomplete
cases. If the principal components analysis is being conducted on the correlations (as opposed to
the covariances), it is not much of a concern that the variables have very different means and/or
standard deviations (which is often the case when variables are measured on different scales).

a
Correlation Matrix

X1 X2 X3 X4 X5

X1 1.000 .735 .662 .645 .605

X2 .735 1.000 .674 .769 .529


Correlation X3 .662 .674 1.000 .763 .526

X4 .645 .769 .763 1.000 .607

X5 .605 .529 .526 .607 1.000

a. Determinant = .037

The table above was included in the output because we included the keyword correlation on
the /print subcommand. This table gives the correlations between the original variables (which
are specified on the /variables subcommand). Before conducting a principal components
analysis, you want to check the correlations between the variables. If any of the correlations are
too high (say above .9), you may need to remove one of the variables from the analysis, as the
two variables seem to be measuring the same thing. Another alternative would be to combine
the variables in some way (perhaps by taking the average). If the correlations are too low, say
below .1, then one or more of the variables might load only onto one principal component (in
other words, make its own principal component). This is not helpful, as the whole point of the
analysis is to reduce the number of items (variables).

KMO and Bartlett's Test

Kaiser-Meyer-Olkin Measure of Sampling Adequacy. .826


Approx. Chi-Square 150.193

Bartlett's Test of Sphericity df 10

Sig. .000

a. Kaiser-Meyer-Olkin Measure of Sampling Adequacy - This measure varies between 0 and


1, and values closer to 1 are better. A value of .6 is a suggested minimum.

b. Bartlett's Test of Sphericity - This tests the null hypothesis that the correlation matrix is an
identity matrix. An identity matrix is matrix in which all of the diagonal elements are 1 and all
off diagonal elements are 0. You want to reject this null hypothesis.

Taken together, these tests provide a minimum standard which should be passed before a
principal components analysis should be conducted.

Communalities

Initial Extraction

X1 1.000 .738
X2 1.000 .771
X3 1.000 .734
X4 1.000 .801
X5 1.000 .572

Extraction Method: Principal


Component Analysis.

a. Communalities - This is the proportion of each variable's variance that can be explained by
the principal components (e.g., the underlying latent continua). It is also noted as h2 and can be
defined as the sum of squared factor loadings.
b. Initial - By definition, the initial value of the communality in a principal components analysis
is 1.

c. Extraction - The values in this column indicate the proportion of each variable's variance that
can be explained by the principal components. Variables with high values are well represented
in the common factor space, while variables with low values are not well represented. (In this
example, we don't have any particularly low values.) They are the reproduced variances from
the number of components that you have saved. You can find these values on the diagonal of the
reproduced correlation matrix.

Total Variance Explained

Component Initial Eigenvalues Extraction Sums of Squared Loadings

Total % of Variance Cumulative % Total % of Variance Cumulative %

1 3.616 72.320 72.320 3.616 72.320 72.320


2 .532 10.630 82.950
3 .386 7.728 90.678
4 .302 6.031 96.709
5 .165 3.291 100.000

Extraction Method: Principal Component Analysis.

a. Component - There are as many components extracted during a principal components


analysis as there are variables that are put into it. In our example, we used 5 variables, so we
have 5 components.

b. Initial Eigenvalues - Eigenvalues are the variances of the principal components. Because we
conducted our principal components analysis on the correlation matrix, the variables are
standardized, which means that the each variable has a variance of 1, and the total variance is
equal to the number of variables used in the analysis, in this case, 5.

c. Total - This column contains the eigenvalues. The first component will always account for
the most variance (and hence have the highest eigenvalue), and the next component will account
for as much of the left over variance as it can, and so on. Hence, each successive component will
account for less and less variance.
d. % of Variance - This column contains the percent of variance accounted for by each
principal component.

e. Cumulative % - This column contains the cumulative percentage of variance accounted for
by the current and all preceding principal components. For example, the third row shows a value
of 72.320. This means that the first components account for 72.320 % of the total
variance. (Remember that because this is principal components analysis, all variance is
considered to be true and common variance. In other words, the variables are assumed to be
measured without error, so there is no error variance.)

f. Extraction Sums of Squared Loadings - The three columns of this half of the table exactly
reproduce the values given on the same row on the left side of the table. The number of rows
reproduced on the right side of the table is determined by the number of principal components
whose eigenvalues are 1 or greater. These components will be used to represent our data.

The scree plot graphs the eigenvalue against the component number. You can see these values in
the first two columns of the table immediately above. From the second component on, you can
see that the line is almost flat, meaning the each successive component is accounting for smaller
and smaller amounts of the total variance. In general, we are interested in keeping only those
principal components whose eigenvalues are greater than 1. Components with an eigenvalue of
less than 1 account for less variance than did the original variable (which had a variance of 1),
and so are of little use. Hence, you can see that the point of principal components analysis is to
redistribute the variance in the correlation matrix (using the method of eigenvalue
decomposition) to redistribute the variance to first components extracted.
a
Component Matrix

Component

X1 .859
X2 .878
X3 .857
X4 .895
X5 .756

Extraction Method:
Principal Component
Analysis.
a. 1 components
extracted.
b. Component Matrix - This table contains component loadings, which are the correlations
between the variable and the component. Because these are correlations, possible values range
from -1 to +1. On the /format subcommand, we used the option blank(.30), which tells SPSS
not to print any of the correlations that are .3 or less. This makes the output easier to read by
removing the clutter of low correlations that are probably not meaningful anyway.

c. Component - The columns under this heading are the principal components that have been
extracted. As you can see by the footnote provided by SPSS (a.), one component was extracted
(the component that has an eigenvalue greater than 1).

Using STATA, the principal component is of the form,

= 0.452 + 0.462 + 0.451

can be interpreted as an index of the size of the sparrows. It seems therefore that about 72.3%
of the variation in the data are related to size differences. As measures overall size, it seems
that the stabilizing selection may have acted against very large and very small birds.

Вам также может понравиться