Академический Документы
Профессиональный Документы
Культура Документы
LESSON 2:
DISCRIMINANT ANALYSIS
27
Multivariate Data Analysis Using SPSS Lesson 2
In multiple linear regression, the objective is to model one quantitative variable (called the
dependent variable) as a linear combination of others variables (called the independent
variables). The purpose of discriminant analysis is to obtain a model to predict a single
qualitative variable from one or more independent variable(s). In most cases the dependent
variable consists of two groups or classifications, like, high versus normal blood pressure,
loan defaulting versus non defaulting, use versus non use of internet banking etc. The choice
between three candidates, A, B or C in an election is an example where the dependent
variable consists of more than two groups.
The model
X1
X2
Independent Variables Y Dependent Variable
(Quantitative) : (Qualitative)
Xp
where, F is a latent variable formed by the linear combination of the dependent variable, X1,
X2 , Xp are the p independent variables, is the error term and 0, 1 , 2 ,, p are the
discriminant coefficients.
28
Multivariate Data Analysis Using SPSS Lesson 2
Assumptions
If you have collected a large number of independent variables and want to select a useful
subset for predicting the dependent variable, use Multiple Discrminant Analysis with
selection methods.
29
Multivariate Data Analysis Using SPSS Lesson 2
Discriminant function
The number of functions computed is one less than the number of groups in the dependent
variable. That is, for two groups one function, for three groups - two functions, and so on.
When there are two functions, the first function maximizes the differences between the
groups in the dependent variable. The second function is orthogonal to the first (uncorrelated
with it) and maximizes the differences between the groups in the dependent variable,
controlling for the first function. Though mathematically different, each discriminant function
is a dimension which differentiates a case into groups in the dependent variable based on its
values on the independent variables. In discriminant analysis, the first function will be the
most powerful in differentiation the dimensions and the subsequent functions may or may not
represent additional significant differentiation.
Discriminant Coefficient
The discriminant function coefficients are partial coefficients that reflect the unique
contribution of each variable to the classification of the groups in the dependent variable. A
discrminant score that belongs to a latent variable can be obtained for each case by applying
the coefficients to the values in the respective independent variables. The standardized
discriminant coefficients, like beta weights in regression, are used to assess the relative
classifying importance of the independent variables. Structure coefficients are the
correlations between a given independent variable and the discriminant scores. The higher the
value, the higher if the association between the independent variable and the dicriminant
function. Looking at all the structure coefficients for a function allows the researcher to
assign a label to the dimension it measures.
Group centroid
Group centroids are the mean discriminant scores for each group in the dependent variable
for each of the discriminant functions. For two groups in the dependent variable there is a
single dicriminant function. The centroids are in a unidimensional space, one center for each
group. For three groups in the dependent variable there are two dicriminant functions.
Hence, the centroids are in a two dimensional space. By connecting the centroids a canonical
plot can be created depicting a discriminant function space.
Eigenvalue
Eigenvalue, also called the characteristic roots, is a ratio between the explained and
unexplained variation in a model. For a good model the eigen value must be more than one.
In discriminant analysis there is one eigenvalue for each discriminant function. The bigger
the eigenvalue, the stronger is the discriminating power of the function. In an analysis with
three groups, the ratio between two eigenvalues indicates the relative discriminating power of
the one discriminant function over the other. For example, if the ratio of two eigenvalues is
1.6, the first discriminant function accounts for 60% more of the between-group variance for
the three groups in the dependent variable compared to the second discriminant function.
Relative percentage of a discriminant function is the function's eigenvalue divided by the sum
30
Multivariate Data Analysis Using SPSS Lesson 2
of all eigenvalues of all discriminant functions in the model. It represents the percent of
discriminating power for the model associated with a given discriminant function. Usually,
the relative percentage of the first functions will be high. If the values for the subsequent
functions are small, then a single function is as good as two or more function in the
classification.
Canonical correlation
The canonical correlation is a measure of the association between the groups in the dependent
variable and the discriminant function. A high value implies a high level of association
between the two and vice-versa.
Wilks's lambda
In discriminant analysis, the Wilks Lamba is used to test the significance of the discriminant
functions. Mathematically, it is one minus the explained variation and the value ranges from
0 to 1. Unlike the F-statistics in linear regression, when the value lambda for a function is
small, the function is significant.
Classification matrix
The classification matrix is a simple cross tabulation of the observed and predicted
memberships. For a good prediction, the values in the diagonal must be high and the values
off the diagonal must be close to 0.
Box's M
Like in other multivariate data analysis, the Box's M tests the assumption of equality of
variance-covariance matrices in the groups. A big Box's M indicated by a small p-value
indicates violation of this assumption. However, when the sample size is big, Boxs M is
usually large. In such situations, the natural logarithm of the variance-covariance matrices for
the groups are compared.
Sample size
As a rule, the sample size of the smallest group should exceed the number of independent
variables. Though the general agreement is that there should be at least 5 cases for each
independent variable, it is best to model with at least 20 cases for each independent variable.
31
Multivariate Data Analysis Using SPSS Lesson 2
Outlier detection
Descriptive summary
Group Statisti cs
St d. Error
Def ault N Mean St d. Dev iation Mean
Age Yes 8 32.00 9.769 3.454
No 7 44.86 7.081 2.676
Income Yes 8 48.00 12.036 4.255
No 7 65.14 7.198 2.721
32
Multivariate Data Analysis Using SPSS Lesson 2
Scatter diagram
Default
80
Yes
No
70
60
Income
50
40
30
20
20 30 40 50 60
Age
33
Multivariate Data Analysis Using SPSS Lesson 2
Analyze
Classify
Discriminant...
Independents: Age
Income
Save... Classify...
34
Multivariate Data Analysis Using SPSS Lesson 2
Output
One the average, the age and income among the defaulters are lesser than the non-defaulters. The
p-values for tests of equality of means are both less than 0.05.
Thus, perhaps both age and income can be important discriminant of defaulting groups
Covari ance Matricesa The diagonal are variances and the off diagonals
Def ault Age Income are covariances.
Yes Age 95.429 21.714
Income 21.714 144.857
Determinant for defaulting:
No Age 50.143 25.524 |D| = (95.429*144.857) (21.714)2 = 13352.06
Income 25.524 51.810
Total Age 113.286 80.571
Income
Ln|D| = 9.499
80.571 173.000
a. The total cov ariance matrix has 14 degrees of f reedom.
The p-value for Boxs M is more than 0.05. Thus, equality of variance-covariance matrix can be
assumed. The log determinant values are quite close to each other.
Canonical Wilks'
Function Eigenvalue % of Variance Cumulativ e % Correlation Test of Function(s) Lambda Chi-square df Sig.
1 .463 9.229 2 .010
1 1.158a 100.0 100.0 .733
a. First 1 canonical discriminant functions were used in the analysis.
35
Multivariate Data Analysis Using SPSS Lesson 2
The function
Canonical Discriminant Standardized Canoni cal Structure Matrix
Function Coeffi ci ents Discri minant Function
Coeffi cients Function
Function 1
1 Function
Income .846
Age .064 1
Age .554 Age .742
Income .069 Correlationbetween
Income .696
(Constant) -6.303
Unstandardized coef f icients Correlation between
Income, Age and F
F = -6.303 + 0.064(Age) + 0.069(Income)
Centroids
Functions at Group Centroids Classificati on Function Coefficients
Function Def ault
Def ault 1 Yes No
Yes -.937 Age .303 .432
No 1.071
Income .401 .540
Unstandardized canonical discriminant (Constant) -15.170 -27.960
f unct ions ev aluat ed at group means
Fisher's linear discriminant f unctions
Between -0.937 and 1.071, the mid point is 0.067
F
-0.937 0.067 1.071
Defaulters Non-Defaulters
Classification results
Classification Resultsb,c
Predicted Group
Membership
Def ault Y es No Total
Original Count Y es 7 1 8
No 1 6 7
% Y es 87.5 12.5 100.0
No 14.3 85.7 100.0
Cross-v alidateda Count Y es 7 1 8
No 1 6 7
% Y es 87.5 12.5 100.0
No 14.3 85.7 100.0
a. Cross v alidation is done only f or t hose cases in the analy sis. I n
cross v alidation, each case is classif ied by the f unctions deriv ed
f rom all cases ot her than that case.
b. 86.7% of original grouped cases correct ly classif ied.
c. 86.7% of cross-v alidated grouped cases correctly classif ied.
Classification:
Age = 30, Income = 40 F = -6.303 + 0.064(30) + 0.069(40) = -1.617 < 0.067 Yes
Age = 40, Income = 40 F = -6.303 + 0.064(40) + 0.069(40) = -0.975 < 0.067 Yes
Age = 30, Income = 60 F = -6.303 + 0.064(30) + 0.069(60) = -0.238 < 0.067 Yes
Age = 40, Income = 60 F = -6.303 + 0.064(40) + 0.069(60) = 0.404 > 0.067 No
Age = 50, Income = 60 F = -6.303 + 0.064(50) + 0.069(60) = 1.046 > 0.067 No
36
Multivariate Data Analysis Using SPSS Lesson 2
Example 2:
Objective: To identify the significant determinants of BP status among Age, Weight, BSA
and Pulse.
Output
37
Multivariate Data Analysis Using SPSS Lesson 2
The function
Centroids
Classification:
38
Multivariate Data Analysis Using SPSS Lesson 2
The function
F
-0.601 0.501 1.603
Normal High
Classification:
39
Multivariate Data Analysis Using SPSS Lesson 2
In this example, since there are three groups in the dependent variable candidate, there are
two functions
F1 = 01 + 11 Age + 21Education +
F2 = 02 + 12 Age + 22 Education +
Output
Voters who are younger and with lesser years of education seem to prefer candidate A.
Voters who are older favor either candidates B or C. Among them those with longer years of
education prefer candidate B.
The p-values for both Age and Education are less than 0.05. Perhaps both Age and Education
are significant predictors.
Though the Boxs M is significant, the log determinant values are quite close
40
Multivariate Data Analysis Using SPSS Lesson 2
The eigenvalue and canonical correlation values for the first function is much higher than the second
function. Looks like the first function is sufficient to differentiate the choice of candidates.
In the first row (1 through 2) the Wilks Lamda is significant, but not in the second row (2). This
means, over and above the first function, the second function does not contribute much.
The group centroids for the candidates are: A(-1.164, 0.018), B(0.639, 0.102) and C(0.428, -0.196)
In the diagram above, the range in the vertical axis is small. Hence, F2 does make much difference.
Only F1, the horizontal axis, is important for differentiation.
41
Multivariate Data Analysis Using SPSS Lesson 2
Based on the information from the table above, the classification is not that good.
Classification:
Age = 50, Education = 15:
In this example, the stepwise method also gives the same results.
42