Вы находитесь на странице: 1из 15

Factor Analysis

Factor analysis is a useful tool for investigating variable relationships for complex concepts such
as socioeconomic status, dietary patterns, or psychological scales. It allows researchers to
investigate concepts that are not easily measured directly by collapsing a large number of
variables into a few interpretable underlying factors.

Factor analysis​​ is a general name denoting a class of procedures primarily used for data
reduction and summarization. Factor analysis is an ​interdependence technique​​ in that an entire
set of interdependent relationships is examined without making the distinction between
dependent and independent variables.

Factor analysis is used in the following circumstances:

– To identify underlying dimensions, or ​factors​​, that explains the correlations


among a set of variables.

– To identify a new, smaller, set of uncorrelated variables to replace the original set
of correlated variables in subsequent multivariate analysis (regression or
discriminant analysis).

– To identify a smaller set of salient variables from a larger set for use in
subsequent multivariate analysis.

Factor Analysis Model


Mathematically, each variable is expressed as a linear combination of underlying factors. The
covariation among the variables is described in terms of a small number of common factors plus
a unique factor for each variable. If the variables are standardized, the factor model may be
represented as:

Xi​ ​ = ​Ai​ 1​ F
​ 1​​ + ​A​i ​2F
​ ​2​ + ​A​i ​3F
​ ​3​ + . . . + ​A​imF
​ ​m​ + ​V​iU
​ ​i

where

Xi​ = i ​th standardized variable

A​ij = standardized multiple regression coefficient of variable


i​ on common factor ​j

F = common factor
V​i = standardized regression coefficient of variable ​i​ on unique factor ​i

Ui​ = the unique factor for variable ​i

m = number of common factors

The unique factors are uncorrelated with each other and with the common factors. The common
factors themselves can be expressed as linear combinations of the observed variables.

F​i​ = W​i1​X​1​ + W​i2​X​2​ + W​i3​X​3​ + . . . + W​ik​X​k

Where:

F​i = estimate of ​i ​th factor

W​i = weight or factor score coefficient

k = number of variables

It is possible to select weights or factor score coefficients so that the first factor explains
the largest portion of the total variance.

Then a second set of weights can be selected, so that the second factor accounts for most
of the residual variance, subject to being uncorrelated with the first factor.

This same principle could be applied to selecting additional weights for the additional
factors.

Statistics Associated with Factor Analysis


• Bartlett's test of sphericity​​. Bartlett's test of sphericity is a test statistic used to examine
the hypothesis that the variables are uncorrelated in the population. In other words, the
population correlation matrix is an identity matrix; each variable correlates perfectly with
itself (​r​ = 1) but has no correlation with the other variables (​r​ = 0).

• Correlation matrix​​. A correlation matrix is a lower triangle matrix showing the simple
correlations,​ r​, between all possible pairs of variables included in the analysis. The
diagonal elements, which are all 1, are usually omitted.

• Communality​​. Communality is the amount of variance a variable shares with all the
other variables being considered. This is also the proportion of variance explained by the
common factors.

• Eigen value​​. The eigen value represents the total variance explained by each factor. Any
factor with an eigenvalue ≥1 explains more variance than a single observed variable.
• Factor loadings​​. Factor loadings are simple correlations between the variables and the
factors. Here is an example of the output of a simple factor analysis looking at indicators
of wealth, with just six variables and two resulting factors.

Variables Factor 1 Factor 2


Income 0.65 0.11
Education 0.59 0.25
Occupation 0.48 0.19
House value 0.38 0.60
Number of public parks in 0.13 0.57
neighborhood
Number of violent crimes per year in 0.23 0.55
neighborhood

• The variable with the strongest association to the underlying latent variable,factor 1, is
income, with a factor loading of 0.65.
• Since factor loadings can be interpreted like ​standardized regression coefficients​, one
could also say that the variable income has a correlation of 0.65 with Factor 1. This
would be considered a strong association for a factor analysis in most research fields.
• Two other variables, education and occupation, are also associated with Factor 1. Based
on the variables loading highly onto Factor 1, we could call it “Individual socioeconomic
status.”
• House value, number of public parks, and number of violent crimes per year, however,
have high factor loadings on the other factor, Factor 2. They seem to indicate the overall
wealth within the neighborhood, so we may want to call Factor 2 “Neighborhood
socioeconomic status.”
• Notice that the variable house value also is marginally important in Factor 1 (loading =
0.38). This makes sense, since the value of a person’s house should be associated with his
or her income.
• Factor loading plot​​. A factor loading plot is a plot of the original variables using the
factor loadings as coordinates.

• Factor matrix​​. A factor matrix contains the factor loadings of all the variables on all the
factors extracted.

• Factor scores​​. Factor scores are composite scores estimated for each respondent on the
derived factors.

• Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy​​. The Kaiser-Meyer-Olkin


(KMO) measure of sampling adequacy is an index used to examine the appropriateness
of factor analysis. High values (between 0.5 and 1.0) indicate factor analysis is
appropriate. Values below 0.5 imply that factor analysis may not be appropriate.

• Percentage of variance​​. The percentage of the total variance attributed to each factor.

• Residuals​​ are the differences between the observed correlations, as given in the input
correlation matrix, and the reproduced correlations, as estimated from the factor matrix.

• Scree plot​​. A scree plot is a plot of the Eigenvalues against the number of factors in order
of extraction.

Conducting Factor Analysis

Conducting Factor Analysis:

(I) Formulate the Problem


• The objectives of factor analysis should be identified.

• The variables to be included in the factor analysis should be specified based on past
research, theory, and judgment of the researcher. It is important that the variables be
appropriately measured on an interval or ratio scale.

An appropriate sample size should be used. As a rough guideline, there should be at least four or
five times as many observations (sample size) as there are variables
(II) Construct the Correlation Matrix
The analytical process is based on a matrix of correlations between the variables.

Bartlett's test of sphericity can be used to test the null hypothesis that the variables are
uncorrelated in the population: in other words, the population correlation matrix is an identity
matrix. If this hypothesis cannot be rejected, then the appropriateness of factor analysis should
be questioned.

Another useful statistic is the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy.


Small values of the KMO statistic indicate that the correlations between pairs of variables cannot
be explained by other variables and that factor analysis may not be appropriate.

(III) Determine the Method of Factor Analysis

Principal Component Analysis (PCA)


• Is a variable reduction technique
• Is used when variables are highly correlated
• Reduces the number of observed variables to a smaller number of principal components
which account for most
• of the variance of the observed variables
• Is a large sample procedure
• SUGI 30 Statistics and Data
The total amount of variance in PCA is equal to the number of observed variables being
analyzed. In PCA, observed variables are standardized, e.g., mean=0, standard
deviation=1, diagonals of the matrix are equal to 1. The amount of variance explained is
equal to the trace of the matrix (sum of the diagonals of the decomposed correlation
matrix).
The number of components extracted is equal to the number of observed variables in the
analysis. The first principal component identified accounts for most of the variance in the
data. The second component identified accounts for the second largest amount of
variance in the data and is uncorrelated with the first principal component and so on.
Components accounting for maximal variance are retained while other components
accounting for a trivial amount of variance are not retained. Eigenvalues indicate the
amount of variance explained by each component. Eigenvectors are the weights used to
calculate components scores.
• In ​common factor analysis​​, the factors are estimated based only on the common
variance. Communalities are inserted in the diagonal of the correlation matrix. This
method is appropriate when the primary concern is to identify the underlying dimensions
and the common variance is of interest. This method is also known as ​principal axis
factoring​.

Exploratory Factor Analysis (EFA)


• __ Is a variable reduction technique which identifies the number of latent constructs and
the underlying factor
• structure of a set of variables
• __ Hypothesizes an underlying construct, a variable not measured directly
• __ Estimates factors which influence responses on observed variables
• __ Allows you to describe and identify the number of latent constructs (factors)
• __ Includes unique factors, error due to unreliability in measurement
• __ Traditionally has been used to explore the possible underlying factor structure of a set
of measured variables without imposing any preconceived structure on the outcome
(Child, 1990).

Fig1 below shows 4 factors (ovals) each measured by 3 observed variables (rectangles) with
unique factors. Since measurement is not perfect, error or unreliability is estimated and specified
explicitly in the diagram. Factor loadings (parameter estimates) help interpret factors. Loadings
are the correlation between observed variables and factors, are standardized regression weights if
variables are standardized (weights used to predict variables from factor), and are path
coefficients in path analysis. Standardized linear weights represent the effect size of the factor on
variability of observed variables.
Variables are standardized in EFA, e.g., mean=0, standard deviation=1, diagonals are adjusted
for unique factors, 1-u. The amount of variance explained is the trace (sum of the diagonals) of
the decomposed adjusted correlation matrix. Eigenvalues indicate the amount of variance
explained by each factor. Eigenvectors are the weights that could be used to calculate factor
scores. In common practice, factor scores are calculated with a mean or sum of measured
variables that “load” on a factor.

In EFA, observed variables are a linear combination of the underlying factors (estimated factor
and a unique factor). Communality is the variance of observed variables accounted for by a
common factor. Large communality is strongly influenced by an underlying construct.
Community is computed by summing squares of factor loadings
D1square = 1 – communality = % variance accounted for by the unique factor
d1 = square root (1-community) = unique factor weight (parameter estimate)

(IV) Determine the Number of Factors


• A Priori Determination. ​Sometimes, because of prior knowledge, the researcher knows
how many factors to expect and thus can specify the number of factors to be extracted
beforehand. ​

• Determination Based on Eigen values. ​Eigen value is a column sum of squared


loadings for a factor. It is also known as latent root. ​ ​In this approach, only factors with
Eigen values greater than 1.0 are retained. An Eigen value represents the amount of
variance associated with the factor. Hence, only factors with a variance greater than 1.0
are included. Factors with variance less than 1.0 are no better than a single variable,
since, due to standardization, each variable has a variance of 1.0. If the number of
variables is less than 20, this approach will result in a conservative number of factors.

• Determination Based on Scree Plot. ​A scree plot is a plot of the Eigen values against
the number of factors in order of extraction. Experimental evidence indicates that the
point at which the scree begins denotes the true number of factors. Generally, the
number of factors determined by a scree plot will be one or a few more than that
determined by the Eigen value criterion.

• Determination Based on Percentage of Variance. ​In this approach the number of


factors extracted is determined so that the cumulative percentage of variance extracted by
the factors reaches a satisfactory level. It is recommended that the factors extracted
should account for at least 60% of the variance.
• Determination Based on Split-Half Reliability. ​The sample is split in half and factor
analysis is performed on each half. Only factors with high correspondence of factor
loadings across the two subsamples are retained.

• Determination Based on Significance Tests. ​It is possible to determine the statistical


significance of the separate Eigen values and retain only those factors that are statistically
significant. A drawback is that with large samples (size greater than 200), many factors
are likely to be statistically significant, although from a practical viewpoint many of these
account for only a small proportion of the total variance.

(V) Rotate Factors


• Although the initial or unrotated factor matrix indicates the relationship between the
factors and individual variables, it seldom results in factors that can be interpreted,
because the factors are correlated with many variables. Therefore, through rotation the
factor matrix is transformed into a simpler one that is easier to interpret.

• In rotating the factors, we would like each factor to have nonzero, or significant, loadings
or coefficients for only some of the variables. Likewise, we would like each variable to
have nonzero or significant loadings with only a few factors, if possible with only one.

• The rotation is called ​orthogonal rotation​​ if the axes are maintained at right angles.

• The most commonly used method for rotation is the ​varimax procedure​​. This is an
orthogonal method of rotation that minimizes the number of variables with high loadings
on a factor, thereby enhancing the interpretability of the factors. Orthogonal rotation
results in factors that are uncorrelated.

• The rotation is called ​oblique rotation​​ when the axes are not maintained at right angles,
and the factors are correlated. Sometimes, allowing for correlations among factors can
simplify the factor pattern matrix. Oblique rotation should be used when factors in the
population are likely to be strongly correlated.
Factor Matrix Before and
After Rotation
(VI) Interpret Factors
• A factor can then be interpreted in terms of the variables that load high on it.

• Another useful aid in interpretation is to plot the variables, using the factor loadings as
coordinates. Variables at the end of an axis are those that have high loadings on only that
factor, and hence describe the factor.

Calculate Factor Scores


The ​factor scores​​ for the ​i​th factor may be estimated

as follows:

​F​i​ = Wi1​ ​X​1​ + W​i2 ​X2​ ​ + W​i3 X


​ 3​ ​ + . . . + Wik
​ X​ k​
Select Surrogate Variables
• By examining the factor matrix, one could select for each factor the variable with the
highest loading on that factor. That variable could then be used as a surrogate variable
for the associated factor.

• However, the choice is not as easy if two or more variables have similarly high loadings.
In such a case, the choice between these variables should be based on theoretical and
measurement considerations.

Fit Determine the Model


• The correlations between the variables can be deduced or reproduced from the estimated
correlations between the variables and the factors.

• The differences between the observed correlations (as given in the input correlation
matrix) and the reproduced correlations (as estimated from the factor matrix) can be
examined to determine model fit. These differences are called ​residuals.​
Results of Common Factor Analysis
SPSS Windows:

To select this procedure using SPSS for Windows click:

Analyze>Data Reduction>Factor …

SPSS Windows: Principal Components

1. Select ANALYZE from the SPSS menu bar.

2. Click DATA REDUCTION and then FACTOR.

3. Move identified variables in to the VARIABLES box.

4. Click on DESCRIPTIVES. In the pop-up window, in the STATISTICS box check


INITIAL SOLUTION. In the CORRELATION MATRIX box check KMO AND
BARTLETT’S TEST OF SPHERICITY and also check REPRODUCED. Clicks
CONTINUE.

5. Click on EXTRACTION. In the pop-up window, for METHOD select PRINCIPAL


COMPONENTS (default). In the ANALYZE box, check CORRELATION MATRIX.
In the EXTRACT box, check EIGEN VALUE OVER 1(default). In the DISPLAY box
check UNROTATED FACTOR SOLUTION. Click CONTINUE.

6. Click on ROTATION. In the METHOD box check VARIMAX. In the DISPLAY box
check ROTATED SOLUTION. Clicks CONTINUE.

7. Click on SCORES. In the pop-up window, check DISPLAY FACTOR SCORE


COEFFICIENT MATRIX. Clicks CONTINUE.
8. Click OK.

Вам также может понравиться