Академический Документы
Профессиональный Документы
Культура Документы
If the original variables are uncorrelated, the analysis does nothing. The best results are obtained
if the original variables are very highly correlated (e.g. a 20 or 30 highly correlated variables can
be adequately represented by 2 or 3 principal components).
The goal of Principal Component Analysis is to explain the maximum amount of variance with
the fewest number of principal components. It is used to reduce the number of variables and
avoid multicollinearity or many predictors relative to the number of observations. It is a useful
statistical technique that has found application in fields such as face recognition and image
compression, and is a common technique for finding patterns in data of high dimension. And is
also commonly used in social sciences, market research and other industries that involves large
data sets.
The main advantage of PCA is that reducing the number of dimensions, without much loss of
information.
Let’s define some terms first, before we move to introducing the basic concept of the Principal
Component Analysis.
Eigenvector a direction.
Eigenvalue is a number. Tells you how much variance there is in the data in that direction.
Principal Component Analysis is a good name, it tells you exactly what does it mean; PCA finds
the principal components of a data. So what are principal components then? They are the
underlying structure in the data. They are the directions where there is the most variance, the
directions where the data is most spread out.
Imagine that the triangles are points of data. To find the direction where there is most variances.
Find the straight line where the data is most spread out when projected onto it. A vertical straight
line with the points projected onto it will look like this:
Luckily, we can use maths to find the principal component rather than drawing lines and
unevenly shaped triangles. This is where eigenvectors and eigenvalues come in.
When we get a set of data points, like the triangles earlier, we can deconstruct the set into
eigenvectors and eigenvalues. Eigenvectors and eigenvalues always comes in pair; every
eigenvector has a corresponding eigenvalue.
Although, in the last example I could point my line in any direction, it turns out that in the data
set there are not much eigenvectors and eigenvalues we can find. The amount of eigenvectors
and eigenvalues that do exist equals the number of dimensions the data set has, as a matter of
fact.
Say we’re measuring age and hours on the internet. There are 2 variables, it’s a 2 dimensional
data set, and therefore there are 2 eigenvectors/values. If we’re measuring age, hours on internet
and hours on mobile phone there’s 3 variables, 3-D data set, so 3 eigenvectors/values. The reason
for this is that eigenvectors put the data into a new set of dimensions, and these new dimensions
have to be equal to the original amount of dimensions.
There are 3 variables so it is a 3D data set. 3 dimensions is an x,y and z graph, It measure width,
depth and height (like the dimensions in the real world). Now imagine that the data forms into an
oval like the ones above, but that this oval is on a plane. i.e. all the data points lie on a piece of
paper within this 3D graph (having width and depth, but no height). Like this:
When we find the 3 eigenvectors/values of the data set (remember 3D problem = 3 eigenvectors),
2 of the eigenvectors will have large eigenvalues, and one of the eigenvectors will have an
eigenvalue of zero. The first two eigenvectors will show the width and depth of the data, but
because there is no height on the data (it is on a piece of paper) the third eigenvalue will be zero.
On the picture, ev1 is the first eigenvector (the one with the biggest eigenvalue, the principal
component), ev2 is the second eigenvector (which has a non-zero eigenvalue) and ev3 is the third
eigenvector, which has an eigenvalue of zero.
We can now rearrange our axes to be along the eigenvectors, rather than age, hours on internet
and hours on mobile. However we know that the ev3, the third eigenvector, is pretty useless.
Therefore instead of representing the data in 3 dimensions, we can get rid of the useless direction
and only represent it in 2 dimensions, like before:
This is dimension reduction. We have reduced the problem from a 3D to a 2D problem, getting
rid of a dimension. Reducing dimensions helps to simplify the data and make it easier to
visualise.
PCA : Objectives
S=[ ]
The variances of the principal component are the eigenvalues of matrix S. There are p of these,
some of which may be zero. Negative eigenvalues are not possible for a covariance matrix.
Assuming that the eigenvalues are ordered as , then corresponds to the
ith principal component and the constants , , , are the component of the
corresponding eigenvector. Now,
It should be noted that the PC’s account for all of the variation in the original data. In order to
avoid one variable having an undue influence on the principal component, it is usual to
standardize the variables , , to have mean 0 and variance 1 at the start of the analyses.
Then,
R=[ ]
In this case, the covariance matrix for the standardized variable is the correlation matrix.
Steps in Principal Component Analyses
[ ].
4. Discard any component that only account for a small proportion of the variation in the data.
1. The random variable should be measured at the continuous level or ordinal variables.
4. Your data should be suitable for data reduction (Bartlett’s Test of Spherecity).
5. No significant outliers.
The birds being considered were picked up after a severe storm. The first 21 of them recovered
while the other 28 died. A question of some interest is therefore whether the survivors and non-
survivors show any differences. It has been tested that there is no evidence of any difference in
mean values. The situation can now be considered in terms of principal component.
Table 1. Body measurement of female sparrows (X1= total length, X2 = alar extent, X3= length
of beak and head, X4 = length of humerus, X5 = length of keel of sternum, all in mm.). Birds 1
to 21 survived, while the remainder died.
Bird X1 X2 X3 X4 X5
1 156 245 31.6 18.5 20.5
2 154 240 30.4 17.9 19.6
3 153 240 31 18.4 20.6
4 153 236 30.9 17.7 20.2
5 155 243 31.5 18.6 20.3
6 163 247 32 19 20.9
7 157 238 30.9 18.4 20.2
8 155 239 32.8 18.6 21.2
9 164 248 32.7 19.1 21.1
10 158 238 31 18.8 22
11 158 240 31.3 18.6 22
12 160 244 31.1 18.6 20.5
13 161 246 32.3 19.3 21.8
14 157 245 32 19.1 20
15 157 235 31.5 18.1 19.8
16 156 237 30.9 18 20.3
17 158 244 31.4 18.5 21.6
18 153 238 30.5 18.2 20.9
19 155 236 30.3 18.5 20.1
20 163 246 32.5 18.6 21.9
21 159 236 31.5 18 21.5
22 155 240 31.4 18 20.7
23 156 240 31.5 18.2 20.6
24 160 242 32.6 18.8 21.7
25 152 232 30.3 17.2 19.8
26 160 250 31.7 18.8 22.5
27 155 237 31 18.5 20
28 157 245 32.2 19.5 21.4
29 165 245 33.1 19.8 22.7
30 153 231 30.1 17.3 19.8
31 162 239 30.3 18 23.1
32 162 243 31.6 18.8 21.3
33 159 245 31.8 18.5 21.7
34 159 247 30.9 18.1 19
35 155 243 30.9 18.5 21.3
36 162 252 31.9 19.1 22.2
37 152 230 30.4 17.3 18.6
38 159 242 30.8 18.2 20.5
39 155 238 31.2 17.9 19.3
40 163 249 33.4 19.5 22.8
41 163 242 31 18.1 20.7
42 156 237 31.7 18.2 20.3
43 159 238 31.5 18.4 20.3
44 161 245 32.1 19.1 20.8
45 155 235 30.7 17.7 19.6
46 162 247 31.9 19.1 20.4
47 153 237 30.6 18.6 20.4
48 162 245 32.5 18.5 21.1
STEP 4: Stata is used to extract the eigen vectors. Enter the code: pca X1 X2 X3 X4 X5
SPSS Output
Descriptive Statistics
X1 157.98 3.654 49
X2 241.33 5.068 49
X3 31.459184 .7947532 49
X4 18.469388 .5642857 49
X5 20.826531 .9913744 49
a
Correlation Matrix
X1 X2 X3 X4 X5
a. Determinant = .037
The table above was included in the output because we included the keyword correlation on
the /print subcommand. This table gives the correlations between the original variables (which
are specified on the /variables subcommand). Before conducting a principal components
analysis, you want to check the correlations between the variables. If any of the correlations are
too high (say above .9), you may need to remove one of the variables from the analysis, as the
two variables seem to be measuring the same thing. Another alternative would be to combine
the variables in some way (perhaps by taking the average). If the correlations are too low, say
below .1, then one or more of the variables might load only onto one principal component (in
other words, make its own principal component). This is not helpful, as the whole point of the
analysis is to reduce the number of items (variables).
Sig. .000
b. Bartlett's Test of Sphericity - This tests the null hypothesis that the correlation matrix is an
identity matrix. An identity matrix is matrix in which all of the diagonal elements are 1 and all
off diagonal elements are 0. You want to reject this null hypothesis.
Taken together, these tests provide a minimum standard which should be passed before a
principal components analysis should be conducted.
Communalities
Initial Extraction
X1 1.000 .738
X2 1.000 .771
X3 1.000 .734
X4 1.000 .801
X5 1.000 .572
a. Communalities - This is the proportion of each variable's variance that can be explained by
the principal components (e.g., the underlying latent continua). It is also noted as h2 and can be
defined as the sum of squared factor loadings.
b. Initial - By definition, the initial value of the communality in a principal components analysis
is 1.
c. Extraction - The values in this column indicate the proportion of each variable's variance that
can be explained by the principal components. Variables with high values are well represented
in the common factor space, while variables with low values are not well represented. (In this
example, we don't have any particularly low values.) They are the reproduced variances from
the number of components that you have saved. You can find these values on the diagonal of the
reproduced correlation matrix.
b. Initial Eigenvalues - Eigenvalues are the variances of the principal components. Because we
conducted our principal components analysis on the correlation matrix, the variables are
standardized, which means that the each variable has a variance of 1, and the total variance is
equal to the number of variables used in the analysis, in this case, 5.
c. Total - This column contains the eigenvalues. The first component will always account for
the most variance (and hence have the highest eigenvalue), and the next component will account
for as much of the left over variance as it can, and so on. Hence, each successive component will
account for less and less variance.
d. % of Variance - This column contains the percent of variance accounted for by each
principal component.
e. Cumulative % - This column contains the cumulative percentage of variance accounted for
by the current and all preceding principal components. For example, the third row shows a value
of 72.320. This means that the first components account for 72.320 % of the total
variance. (Remember that because this is principal components analysis, all variance is
considered to be true and common variance. In other words, the variables are assumed to be
measured without error, so there is no error variance.)
f. Extraction Sums of Squared Loadings - The three columns of this half of the table exactly
reproduce the values given on the same row on the left side of the table. The number of rows
reproduced on the right side of the table is determined by the number of principal components
whose eigenvalues are 1 or greater. These components will be used to represent our data.
The scree plot graphs the eigenvalue against the component number. You can see these values in
the first two columns of the table immediately above. From the second component on, you can
see that the line is almost flat, meaning the each successive component is accounting for smaller
and smaller amounts of the total variance. In general, we are interested in keeping only those
principal components whose eigenvalues are greater than 1. Components with an eigenvalue of
less than 1 account for less variance than did the original variable (which had a variance of 1),
and so are of little use. Hence, you can see that the point of principal components analysis is to
redistribute the variance in the correlation matrix (using the method of eigenvalue
decomposition) to redistribute the variance to first components extracted.
a
Component Matrix
Component
X1 .859
X2 .878
X3 .857
X4 .895
X5 .756
Extraction Method:
Principal Component
Analysis.
a. 1 components
extracted.
b. Component Matrix - This table contains component loadings, which are the correlations
between the variable and the component. Because these are correlations, possible values range
from -1 to +1. On the /format subcommand, we used the option blank(.30), which tells SPSS
not to print any of the correlations that are .3 or less. This makes the output easier to read by
removing the clutter of low correlations that are probably not meaningful anyway.
c. Component - The columns under this heading are the principal components that have been
extracted. As you can see by the footnote provided by SPSS (a.), one component was extracted
(the component that has an eigenvalue greater than 1).
can be interpreted as an index of the size of the sparrows. It seems therefore that about 72.3%
of the variation in the data are related to size differences. As measures overall size, it seems
that the stabilizing selection may have acted against very large and very small birds.