Академический Документы
Профессиональный Документы
Культура Документы
Course
SPSS
Inaksh
IBM Statistical Package for
Social Sciences
Inaksh
Books to be referred:
1. Research methodology by Deepak Chawla
and Neena Sondhi
2. Marketing Research by Naresh Malhotra
and Satyabhushan Dash
Inaksh
Research
Inaksh
Data analysis
Inaksh
Hypothesis
It is the presupposition of the expected direction of the results of the research.
Unlike the research problem that generally takes on a. Question form, the hypothesis is always In a
declarative form .
A hypothesis is written in such a way that it can be proven or disproven by valid and reliable data .
Inaksh
Hypothesis
Few criteria that the researcher must fulfil while designing any hypothesis -
1. Simple, clear and decorative form . Hence it is advisable to make the hypothesis unidimensional , and to
be testing only one relationship between only two variables at a time .
4. Validation of hypothesis would involve testing the statistical significance of the hypothesized relation.
Inaksh
Hypothesis
Two Types ;
1.Descriptive hypothesis - Simple statement about the behaviour of population under study .Eg: the
attrition rate in the BPO sector is almost 33%
2. Relational hypothesis -
Inaksh
Hypothesis
Eg: Higher the likeability of the advertisement , the higher is the recall rate
However sometimes, the researcher might not have reasonable supportive data to hypothesise the expected
direction of the relationship . In this case, we can leave the hypothesis , non directional (or two tailed )
Inaksh
Measurement and Scaling
• The term measurement means assigning numbers or some symbols to the characteristics of certain objects .when
numbers are used , the researcher must have a rule for assigning a number to an observation in a way that provides
an accurate description .
• The assignment of numbers should be isomorphic I.e. there must be one to one correspondence between the
numbers and characteristics, eg- male (1)female (2)
• Suppose you want to measure the satisfaction level towards kingfisher airlines and a scale of 1 to 11 is used for the
said purpose .This scale indicates the degree of dissatisfaction with 1= extremely dissatisfied and 11=Extremely
satisfied
• Measurement is the actual assignment of a number from 1 to 11 to each respondent whereas the scaling is the
process of placing the respondent on a continuum with respect
• Extracting data from an excel file : File Open database New Query
• Coding : The process of identifying and denoting a numeral to a response given by a respondent.
This is essentially done to in order to facilitate the researcher’s use for interpreting the answers and
classifying and then subsequently recording the data from the questionnaire on a spreadsheet.
• SPSS Data Editor has two views : Data view and Variable View
Inaksh
When we start
SPSS, it shows
this dialog box,
we will currently
select :Type in
data
Inaksh
Variable View
1. Name : the unique variable name (Note :While defining variables empty spaces are not allowed.)
2. Type : the kind of data to be recorded (e.g., strings of characters, numeric values, or special numbers like dates)
3. Width : the number of characters allowed to be entered in the column (By default width is 8 characters and can
be modified depending on the data)
4. Decimals : the number of decimal places displayed. (For numeric data type the default value is 0.)
5. Label : a text entry to describe the data provided by the variable, which can be much longer than the variable
name and may include spaces. With questionnaires, for example, the label is usually the text of the question.
6. Values : if specific numeric values have a non-intuitive meaning, these values can be labeled (e.g., 1 = male and
2 = female)
7. Missing : This column is used in cases where no data is provided by a respondent. A missing value is chosen as
an impossible value for that column. For e.g. ,for age it can be put as 1000 or -100 which are impossible entries
for age. SPSS excludes that record while analyzing the data.
8. Columns : determines how wide the variable column should be in Data View mode. (Default value is 8.)
10. Measure : describes the level of measurement (e.g., nominal, ordinal, or scale)
11. Role :
Inaksh
Coding
Inaksh
Ranking Question
• This is a multi response question. Each response is a new variable .So we will
have 4 variables and for each variable there are 4 possible responses(1,2,3,4)
Inaksh
Checklist question
• Which of the following newspapers do you read? (Tick all that you read.)
1. The Times of India
2. The Hindustan Times
3. Mail Today
4. The Indian Express
5. Deccan Chronicle
6. The Asian Age
7. Mint
• When we do statistical analysis, we can have only one piece of data per cell.
Inaksh
Scaled question
This question will be entered as one variable. Coding will be done as 1=Strongly
Disagree,2=Disagree,3=Neutral,4=Agree,5=Strongly Agree
Inaksh
Open ended questions
Inaksh
Recoding
Inaksh
Normal distribution
Inaksh
Data type
Regression
Analysis
Factor
Analysis
Discrimina
nt Analysis
Cluster
Analysis
Inaksh
Descriptive analysis
• Frequency table- Frequency, Per cent, Valid Per cent, Cumulative Per
cent
• Valid percentage excludes missing values. (In the given e.g. valid
percentage is taken out of 566 )
Inaksh
Descriptive analysis
• Median :
• Mode :
• Variance :
• Coefficient of Variation :
Inaksh
Descriptive analysis
• Cross tabulation
• Analyze 🡪 Descriptive analysis 🡪 Crosstabs (Specify Row, Column)
• Try crosstabs with 3 variables.
Inaksh
Relationships
Inaksh
Linear relationship
• A linear relationship is a trend in the data that can be modeled by a straight line. For example, suppose
an airline wants to estimate the impact of fuel prices on flight costs. They find that for every dollar
increase in the price of a gallon of jet fuel, the cost of their LA-NYC flight increases by about $3500.
This describes a linear relationship between jet fuel cost and flight cost.
• When both variables increase or decrease concurrently and at a constant rate, a positive linear
relationship exists. The points in Plot 1 follow the line closely, suggesting that the relationship between
the variables is strong. The Pearson correlation coefficient for this relationship is +0.921.
• When one variable increases while the other variable decreases, a negative linear relationship exists.
The points in Plot 2 follow the line closely, suggesting that the relationship between the variables is
strong. The Pearson correlation coefficient for this relationship is −0.968.
• If a relationship between two variables is not linear, the rate of increase or decrease can change as one
variable changes, causing a "curved pattern" in the data. This curved trend might be better modeled by
a nonlinear function,
Strong positive linear Strong negative linear Weak linear relationship Non-linear relationship
relationship relationship
Inaksh
Monotonic relationship
Monotonic relationship
Inaksh
Spearman’s rank order correlation coefficient
1 10 9
2 1 3
3 5 4
4 2 1
5 8 8
6 3 2
7 4 6
8 6 5
9 7 7
10 9 10
Inaksh
Create two variables, namely judge1 and judge2,enter data for 10
participants and change scale to ordinal.
Inaksh
Data analysis of rank order questions
Inaksh
Frequency
table
Inaksh
Practice Questionnaire
• (Link)
Inaksh
Testing of hypothesis
• A hypothesis is an assumption or a statement that may or may not be true.The hypothesis is
tested on the basis of information obtained from a sample.
• E.g. When we want to find out :Whether a new drug is more effective than the existing drug
based on the sample data.
Null hypothesis : The hypothesis that are proposed with the intent of receiving a rejection for them
is called a null hypothesis. This requires that we hypothesize the opposite of what is desired to be
proved. It is denoted as H0 For e.g.
1. If we want to show that sales and advertisement expenditure are related, we formulate the null
hypothesis that they are not related.
2. If we want to conclude that the new sales training programme is effective, we formulate the
null hypothesis that the new training programmes is NOT effective.
3. If we want to prove that the average wages of skilled workers in Mumbai is greater than that of
Delhi, we formulate the null hypothesis that there is no difference in the wages of the skilled
workers in both the cities.
Inaksh
Testing of hypothesis
Inaksh
One tailed and two tailed tests
• A test is called one sided(one tailed) only if the null hypothesis gets rejected when a value of
the test statistic falls in one specified tail of the distribution.
• A test is called two sided(two tailed) only if the null hypothesis gets rejected when a value of
the test statistic falls in either one or the other of the two tails of sampling distribution.
-2.131 and
2.131 are
values
from the
table.
Inaksh
Steps in testing of hypothesis
• Setting up of a hypothesis
• Setting up of a suitable significance level
• Determination of a test statistic
• Determination of critical region
• Computing the value of test statistic
• Making a decision
Inaksh
Testing hypothesis concerning means
• Note : You cannot do a z test with SPSS. It does not show it as an option in
the drop down menu.
• Note : Most computer software like SPSS provide both the computed value
of test statistic and the corresponding p value. The p value provided there is
for the two sided test(by default, unless changed). In case the problem is of a
one side test, the reported value p value is divided by 2 to obtain the desired
p value for the problem and then compared with (α),the level of significance
so as to either accept or reject the null hypothesis.
Inaksh
T test
• We will be doing 3 types of t tests. One sample. Two independent samples and paired
sample.
• The t-test for testing the equality of two population means is carried out using these
two assumptions:
1. Population variances are equal
2. Population variances are not equal
Inaksh
One sample t-test
• The one sample t-test is a statistical procedure used to determine whether a sample of
observations could have been generated by a process with a specific mean.
• Suppose you are interested in determining whether an assembly line produces laptop
computers that weigh five pounds. To test this hypothesis, you could collect a sample of
laptop computers from the assembly line, measure their weights, and compare the sample
with a value of five using a one-sample t-test.
Inaksh
Assumptions (One sample
T Test)
• Assumptions :
1. Your dependent variable should be measured at the interval or ratio level (i.e., continuous).
2. There should be no significant outliers. Outliers are data points within your data that do not
follow the usual pattern (e.g., in a study of 100 students' IQ scores, where the mean score
was 108 with only a small variation between students, one student had a score of 156, which
is very unusual, and may even put her in the top 1% of IQ scores globally).
3. Your dependent variable should be approximately normally distributed. We talk about the
one-sample t-test only requiring approximately normal data because it is quite "robust" to
violations of normality, meaning that the assumption can be a little violated and still provide
valid results. You can test for normality using the Shapiro-Wilk test of normality, which is
easily tested for using SPSS Statistics.
Inaksh
• QA Prices of share (in Rs.) of a company on the different
days in a month were found to be 66, 65, 69, 70, 69, 71,
70, 63, 64 and 68. Examine whether the mean price of
shares in a month is different from 65. You may use 10%
level of significance.(Chawla and Sondhi, Page 332).
Inaksh
• Analyze🡪 Compare means🡪 One Sample T Test
Inaksh
Output
Inaksh
Interpretation
Inaksh
Independent sample t-test
• Two sample t-test : In this test we are going to measure group 1 and then measure group 2
and then see if the mean for group 1 is statistically significantly different than the mean for
group 2. Two groups measured one time, because each of the groups are independent,.
• Suppose that an analyst wants to study the amount that Mumbaians and Delhites spend,
per month, on clothing. It would not be practical to record the spending habits of every
individual (or family) in both states, thus a sample of spending habits is taken from a selected
group of individuals from each state . The average amount for Mumbaikars comes out to Rs.
500; the average amount for Delhites is Rs.1,000. The t-test questions whether the difference
between the groups is representative of a true difference or because of chance.
Inaksh
Assumptions(Independent samples
T Test)
1. Your dependent variable should be measured at the interval or ratio level (i.e., continuous).
2. Your independent variable should consist of two categorical, independent groups. Example
independent variables that meet this criterion include gender (2 groups: male or female),
employment status (2 groups: employed or unemployed), smoker (2 groups: yes or no), and
so forth.
3. You should have independence of observations, which means that there is no relationship
between the observations in each group or between the groups themselves. For example,
there must be different participants in each group with no participant being in more than one
group.
4. There should be no significant outliers.
5. There needs to be homogeneity of variances. You can test this assumption in SPSS
Statistics using Levene’s test for homogeneity of variances.
6. Your dependent variable should be approximately normally distributed for each group of the
independent variable.
Inaksh
Homogeneity of variances
• Homo:Same
• Geneity:Nature,kind
• So it means of the same nature.
• So homogeneity of variance means that the variances are of the same nature/same kind.This does not
mean that they have to be equal but they have to be close enough.
• Levene‘s Test :Tests whether the variances of two samples are approximately equal, so it tests our
assumption of homogeneity of variances. Like any test, Levine’s test also starts with a null hypothesis.
In this case the null hypothesis is that there is no difference between the variance of the first group and
the variance of the second group.Unlike other tests, in Levine’s test we WANT the variances to be the
same.We could like Levene’s test to be non significant, so that the assumption of T Test can be
satisfied.In SPSS a Levene’s test is conducted automatically anytime you do an indeoendant samples T
Test.
• Note: As long as N>30, ,the sample size inf first sample is approximately equal to the sample size of
the second sample, then the T Test is robust to violations of homogeneity of variance.
Inaksh
• Analyse 🡪 Compare means🡪 Independent Samples T Test
Step
1
Step 2
Inaksh
Output
Inaksh
Interpretation
Inaksh
Paired sample t-test
• The paired sample t-test, sometimes called the dependent sample t-test, is a statistical
procedure used to determine whether the mean difference between two sets of observations
is zero. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs
of observations
• Suppose you are interested in evaluating the effectiveness of a company training program.
One approach you might consider would be to measure the performance of a sample of
employees before and after completing the program, and analyze the differences using a
paired sample t-test.
Inaksh
Assumptions(Paired sample T Test)
Inaksh
Qc A company selects eight salesmen at random and their sales figures for the previous month
are recorded. They then undergo a training course devised by a business consultant, and their
sales figures for the following month are compared as shown in the table. Has the training
course caused an improvement in the salesman’s ability? You may use 0.05 level of significance.
.(Chawla and Sondhi, Page 341).
Previous 75 90 94 95 100 90 70 64
Month
Following 77 101 93 92 105 88 76 68
Month
Inaksh
• Analyse 🡪 Compare means🡪 Paired Samples T Test
Inaksh
Output
Inaksh
Interpretation
H 0: µ f = µ p
H 1: µ f > µ p
• Thisis a one tailed test. The results indicate that the p value is 0.143. Since it
is a one tailed test,the applicable p value would be 0.143/2=0.0715.
• Therefore the sales training programme has not caused any improvement in
the salesman’s ability.
Inaksh
• Note : When you do a t-test, you are comparing the data
for one group to the data in another group. If the t value is
positive, it means that the first group has a higher mean
than the second group. If the t value is negative, the
second group has a higher mean. In reality, it isn't very
important whether the t-test is positive or negative as
long as you report the means of each group, and it is
acceptable to drop the negative sign when reporting the t
value. (Can be verified from the previous slide,2nd table )
Inaksh
ANOVA
Inaksh
ANOVA
• We discussed the test of hypothesis concerning the equality of
two population means using T tests(and Z tests in theory).
However if there are more than two populations, the test for
the equality of means could be carried out by considering two
populations at a time. This would be a cumbersome
procedure. One easy way out could be to use the analysis of
variance (ANOVA) technique.
Inaksh
ANOVA
Inaksh
Assumptions (ANOVA)
Inaksh
Types of ANOVA
• There are two types of analysis of variance: one-way (or unidirectional) and two-way. One-way or two-
way refers to the number of independent variables in your Analysis of Variance test. A one-way ANOVA
evaluates the impact of a sole factor on a sole response variable. It determines whether all the samples
are the same. The one-way ANOVA is used to determine whether there are any statistically significant
differences between the means of three or more independent (unrelated) groups. A one-way ANOVA
has just one independent variable. For example, difference in IQ can be assessed by Country, and
County can have 2, 20, or more different categories to compare.
• A two-way ANOVA is an extension of the one-way ANOVA. With a one-way, you have one independent
variable affecting a dependent variable. With a two-way ANOVA, there are two independents. For
example, a two-way ANOVA allows a company to compare worker productivity based on two
independent variables, say salary and skill set. It is utilized to observe the interaction between the two
factors. It tests the effect of two factors at the same time. A two-way ANOVA refers to an ANOVA using
two independent variables. Expanding the example above, a 2-way ANOVA can examine differences in
IQ scores (the dependent variable) by Country (independent variable 1) and Gender (independent
variable 2). Two-way ANOVA can be used to examine the interaction between the two independent
variables.(Not in syllabus )
• N-Way ANOVA-A researcher can also use more than two independent variables, and this is an n-way
ANOVA (with n being the number of independent variables you have). For example, potential
differences in IQ scores can be examined by Country, Gender, Age group, Ethnicity, etc,
simultaneously. (Not in syllabus )
•
Inaksh
One way ANOVA
Inaksh
• Qd Suppose we want to compare the cholesterol
contents of the four competing diet foods on the basis of
the following data (in milligrams per package) which were
obtained for three randomly taken 6-ounce packages of
each of the diet foods:
Inaksh
Analyse 🡪 Compare means🡪 One Way ANOVA
Inaksh
Inaksh
Interpretation
H0: µA = µB= µc = µD
• The p value (sig.) for this problem is 0.160 which is greater than
α=0.05, the level of significance. Therefore, there is not
enough evidence to reject the null hypothesis .
Inaksh
Post-hoc analysis
• When you conduct an ANOVA, you are attempting to
determine if there is a statistically significant difference
among the groups. If you find that there is a difference,
you will then need to examine where the group
differences lay.
Inaksh
Post-hoc analysis
• Note : For a one-way ANOVA, you will probably find that just
two tests need to be considered. If your data met the
assumption of homogeneity of variances, use Tukey's honestly
significant difference (HSD) post hoc test.
•
SPSS HAS 18 DIFFERENT types of Post hoc tests.When all
your assumptions are met and the sample size is the same
then we use the Tukey.Others are used when the variances
are not equal or sample size is unequal(not in syllabus)
Inaksh
Post hoc analysis
Inaksh
Inaksh
Interpretation
Inaksh
• Note : If you do not have a significant value in ANOVA, no
need to interpret post hoc tests .
Inaksh
• Note :In an analysis of variance, it may happen that the
variances cannot be assumed to be equal. In this case,
the F test of the ANOVA is not robust enough to be used.
Inaksh
Non – Parametric tests
Inaksh
Pearson Chi Square
Inaksh
Chi-Square Independence Test
Inaksh
Note:
Inaksh
Chi-Square Independence
Test
• Null hypothesis: There are no relationships between the
categorical variables. If you know the value of one
variable, it does not help you predict the value of another
variable.
Inaksh
• Q - Suppose we have data of 100 males and 100 females
regarding their buying behavior towards perfumes.
Inaksh
Data 🡪 Weight cases🡪
Frequencies
Inaksh
Analyze🡪 Descriptive Statistics🡪 Crosstabs
Inaksh
Inaksh
Interpretation
H0: There is no significant relationship between gender and
buying behaviour towards perfumes
Inaksh
Assumptions of Chi-Square Independence
Test
• independent observations.
• For a 2 by 2 table, all expected frequencies > 5.* For a
larger table, no more than 20% of all cells may have an
expected frequency < 5 and all expected frequencies > 1.
Inaksh
Chi Square Goodness of Fit test
Inaksh
Chi Square Goodness of Fit test
Inaksh
• Consider a standard packet of Milk chocolate Gems.
There are six different colors: Red, Orange, Yellow,
Green, Blue and Brown.
Inaksh
Inaksh
Actual and expected
counts
Inaksh
Inaksh
Analyze🡪 Non Parametric Tests🡪 Legacy Dialogs🡪 Chi Square
Inaksh
Note: All categories equal is selected because are
expected value our that all gem candies are equal .
Inaksh
• Suppose we have different expected values for each
colour , then we can soecify them (in the same order ,as
entered in data view)
Inaksh
Inaksh
Interpretation
• H0: : All colors occur in the same proportion
• P1=p2=p3=p4=p5=p6
• H1: Atleast one of the population proportions is not equal
to another.
Positive correlation: When two variables X & Y move in the same direction.
When on variable increases ,the other variable also increases and vice versa.
Zero Correlation: When the variables move in no connection with each other.If
one variable increase, the other may increase or decrease.
Inaksh
Inaksh
Assumptions
1. For a Pearson correlation, each variable should be continuous. If one or both of
the variables are ordinal in measurement, then a Spearman correlation could be
conducted instead.
3. Absence of outliers refers to not having outliers in either variable. Having an outlier
can skew the results of the correlation by pulling the line of best fit formed by the
correlation too far in one direction or another.
Inaksh
• We have the height and weight of 5 males and 5 females.
With 5 % level of significance find if there’s a correlation
between their height and weight,
Inaksh
Analyze🡪 Correlate🡪 Bivariate
Inaksh
Inaksh
Inaksh
Interpretation
• H0: : There is no correlation between Height and Weight
H1: There is a significant correlation between Height and
Weight
Inaksh
Chart builder
Inaksh
Regression Analysis
• One of the problems with Karl Pearson’s formula of
correlation is that it is applicable only when the
relationship between the two variables is linear.
Inaksh
Assumptions
• normality,
• linearity- There needs to be a linear relationship between the two variables. Whilst there are a number of ways to check
whether a linear relationship exists between your two variables, we suggest creating a scatterplot using SPSS Statistics
where you can plot the dependent variable against your independent variable and then visually inspect the scatterplot to
check for linearity.
• Your two variables should be measured at the continuous level (i.e., they are either interval or ratio variables).
• Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you
move along the line.
• Finally, you need to check that the residuals (errors) of the regression line are approximately normally distributed
Inaksh
Discriminant Analysis
• This is used to predict group membership.
• This technique is used to classify individuals/objects into
one of the alternative groups on the basis 0f a set of
predictor variables.
Inaksh
When there are two groups of of dependent variable,we
have two group discriminant analysis and when there are
more than two groups then it is a case of multiple
discriminant analysis.
Inaksh
Objectives of discriminant
analysis
• To find a linear combination of variables that discriminate
between categories of dependent variable in the best
possible manner
• To find out which independent variable are relatively
better in discriminating between groups
• to find the statistical significance of the discriminant
function and whether any statistical difference exists
among groups in terms of predictor variables
• To develop the group procedure for assigning new
objects,firms, or individuals whose profile but not the
group identity are known to one of the two groups
• To evaluate the accuracy of classification i.e the
percentage of customers that it is able to classify
correctly.
Inaksh
Examples
Inaksh
Y=b0+b1X1+b2x2+b3X3+....bKXk
Y= Dependant variable
These b’s maximize the distance between the means of the criterion (dependent)
variable.
Inaksh
• For any new data point that we want to classify into one of the groups, a
decision rule is formulated.
• For this purpose, to determine the cut off score, which is usually the mid
point of the mean discriminant scores of the two groups in case of a two
group discriminant analysis, provided the size of the samples in the groups
are the same.
• The accuracy of classification is determined by using a classification
matrix(also called confusion matrix)
• The relative importance of the independant variables could be determined
from the standardized discriminant function coefficient and the structure
matrix
• The difference between standardized and unstandardized discriminant
function is that in the unstandardized discriminant function we have a
constant term,whereas in the standardized discriminant function there is no
constant term
Inaksh
What you want this function to do is maximize the distance
between the categories, i.e. come up with an equation that
has strong discriminatory power between groups. After
using an existing set of data to calculate the discriminant
function and classify cases, any new cases can then be
classified. The number of discriminant functions is one less
the number of groups. There is only one function for the
basic two group discriminant analysis.
Inaksh
Assumptions of discriminant
analysis
1. Sample size: The sample size of the smallest group needs to exceed the
number of predictor variables. The maximum number of independent
variables is n - 2, where n is the sample size. It is best to have 4 or 5 times as
many observations as independent variables.
2. Normal distribution: It is assumed that the data (for the variables) represent
a sample from a multivariate normal distribution.
3. Homogeneity of variances/covariances: DA is very sensitive to
heterogeneity of variance-covariance matrices.
4. Outliers: DA is highly sensitive to the inclusion of outliers.
Inaksh
Statistics associated with
Discriminant Analysis
Inaksh
Steps of discriminant
analysis
Inaksh
Step 1:Formulate the
problem
Inaksh
Formulate the problem
5. The next step is to divide the sample into two parts .One
part of the sample called the estimation or analysis
sample ,is used for estimation of the discriminant
function.The other part ,called the holdout or validation
sample is reserved for validating the discriminant
function.
6. When the sample is large enough ,it can be split in
half.One half serves as the analysis sample and the other
is used for validation.
7. The role of the halves is then interchanged and the
analysis is repeated. This is called cross validation.
Inaksh
Step 2: Estimate the discriminant
function coefficients
• Direct method -This involves estimating the discriminant function so that all
the predictors are included simultaneously.In this case. each independent
variable is included regardless of their discriminating power. This method is
appropriate when, based on all previous research or a theoretical model, the
researcher wants the discrimination to be based on all the predictors.
• Stepwise discriminat analysis- The predictor varibales are
entered sequentially , based on their ability to
discriminate among groups.This method is appropriate
when the researcher wants to select a subset of
predictors for inclusion in the discriminant function.
Inaksh
Question
Inaksh
Resort Visit Annual Family Income Attitude toward Travel Importance attached to family vacation Household size Age of head of household
1 50.2 5 8 3 43
1 70.3 6 7 4 61
1 62.9 7 5 6 52
1 48.5 7 5 5 36
1 52.7 6 6 4 55
1 75 8 7 5 68
1 46.2 5 3 3 62
1 57 2 4 6 51
1 64.1 7 5 4 57
1 68.1 7 6 5 45
1 73.4 6 7 5 44
1 71.9 5 8 4 64
1 56.2 1 8 6 54
1 49.3 4 2 3 56
1 62 5 6 2 58
2 32.1 5 4 3 58
2 36.2 4 3 2 55
2 43.2 2 5 2 57
2 50.4 5 2 4 37
2 44.1 6 6 3 42
2 38.3 6 6 2 45
2 55 1 2 2 57
2 46.1 3 5 3 51
2 35 6 4 5 64
2 37.3 2 7 4 54
2 41.8 5 1 3 56
2 57 8 3 2 36
2 33.4 6 8 2 50
2
37.5
41.3
3
3
2
3
3
2
48
42
Inaksh
Analyze → Classify → Discriminant
Inaksh
Inaksh
Inaksh
Inaksh
Some intuitive feel for the resukts may be obtained by examining the group means and standard
deviations. It appears that the two groups are more widely separated in terms of income than other
variables. There seems to be more of a separation on the importance attached to family vacation
than on attitude toward travel. The difference between the two groups on age of the head of the
household is small, and the stabdard deviation of this variable is large.
Inaksh
The above table helps us understand the differences in means of the different predictor variables. SPSS conducts an ANOVA for the same . SPSS
assumes the predictor variables as dependent variables for this reason and takes the VISIT variable as the grouping variable(independent variable)
In the above table we can see that there is a significant difference in the means of Income for which the p value is .000 which is less than 0.05,same
goes for vacation and household size.
There does not seem to be any significant difference in the means of the remaining two variables.
Inaksh
The pooled within groups matrix indicates low correlations between the predictors. Multicollinesrity is unlikely to
be a problem.It is very important to analyse this matrix for detecting the problem of multicollinearity(a high
correlation between pairs of predictor variables).If it is noticed that the correlation coefficient between any pair
of predicro variables is greater than 0.75,t indicates that both the variables in that particular pair share a large
amount of common shared variance and might reflect the same attribute. Under such circumstance, one of the
two variables could be eliminated for further analysis.
Inaksh
Eigenvalue: For each discriminant function, eigenvalue is the ratio of between group to within group sum of squares. Large eigenvalues imply
superior function
The larger the eigenvalue, the more variance the function explains in the dependant variable.
Canonical correlation: It is a simple correlation between the discriminant score and their corresponding group membership(visited,not visited).
The square of the canonical correlation =0.6410 which means 64% of the variance in the discriminating model is due to changes in the predictor
variables.
Inaksh
It would not be meaningful to interpret the analysis if the discriminant functions estimated were not statistically significant. The
Wilks’ Lambda takes a value between 0 and 1 and lower the value of Wilks’ lambda, the higher is the significance of the
discriminant function.Therefore 0 is the most preffered value.Since p value is less than 0.05 , it is inferred that the discriminant
function is significant and can be used for further interpretations.
Inaksh
This table shows the relative contributions of the predictor variables in discriminating between the dependent variable (visit)
From the above table we can see that income is the best predictor and travel is the worst predictor.
These structural coefficients are are obtained by computing the correlation between the discriminant score and each of the independent variables.
These are also called discriminant loadings.
Inaksh
The structure matrix table shows the correlations of each variable with each discriminant function.
We can see that the discriminant function has the highest correlation with Income(which is in sync with the
information from the previous table ).
Inaksh
These are the unstandardized discriminant function coefficients. These can be applied to the raw values of the variables in
the holdout set for classification purposes.
they are used to construct the actual prediction equation which can be used to classify new cases.
Inaksh
Centroids are the mean discriminant scores for each group. This table is used to establish the cutting point for classifying cases. If
the two groups are of equal size, the best cutting point is half way between the values of the functions at group centroids (that is, the
average). If the groups are unequal, the optimal cutting point is the weighted average of the two values.
Zero
Now any respondent, whose discriminant score is greater
than zero would be classified as a prospective visitor
,whereas the one with less than zero would be classified as a
non visitor.
Visited(+1.291) Not visited(-1.291)
Inaksh
Correct predictions
The results are shown in the above table after the discriminant scores are calculated. This table is also called the confusion matrix
or classifatory table.
Inaksh
After this the data view will
have the new predicted
group membership and
discriminant scores.
Inaksh
Factor analysis
• In Anova, regression and discriminant analysis, one of the
variables is clearly identified as the dependent variables. Factor
analysis is a procedure in which variables are not classified as
independent or dependent. Instead the whole set of
interdependent relationships among variables is examined.
Inaksh
Steps
• Formulate the problem
Inaksh
• Analyze Dimension Reduction Factor
Inaksh
Construct the correlation matrix
• If the correlation between all the variables is small, then the factor anlaysis may not be appropriate.
• A correlation matrix of the variables can be computed and tested for its statistical significance.The
hypothesis to be tested may be written as:
• H0 : Correlation matrix is insignificant I.e. correlation matrix is an identity matrix where diagonal
elements are one and off diagonal are zero
• The test is carried out by using Bartlett’s test of sphericity. The significan of the correlation matrix
ensures that a factor analysis could be carrued out.
• Another condition that needs to be fulfilled before a factor analysis could be carrued out is the value of
KMO which takes places between 0 and 1.For the application of factor analysis, the value of KMO
statistics should be greater than 0.5.
Inaksh
Inaksh
Inaksh
Inaksh
Inaksh
There are relatively high correlations among V1(prevention of cavities),V3(strong gums) and V5(prevention of tooth decay). We
would expect these variables to correlate with the same set of factors.
Likewise there are relatively high correlations among V2(shiny teeth) ,V4(fresh breath) and V6(attractive teeth). These variables
may also be expected to correlate with the same factors.
Inaksh
The null hypothesis that the population correlation matrix is an identity matrix ,is
rejected by the Bartlett’s test of sphericity.
Inaksh
• Once it has been determined that factor anlaysis is an
appropriate technique for analyzing the data, an
appropriate method must be selected.
Inaksh
This extracted value tells us the amount of variance that is explained in the factor by the variable.
We can see that the extraction values are mostly high, which is a good indicator of the research.
Inaksh
Determine the number of factors
Different methods :
• Prior knowledge to the researcher ,he may specify how many factors to be extracted.
• Determination based on eigenvalues: In this method ,only factors with eigenvalues>1 are
retained.As eigenvalue represents the amount of variance associated with the factor, hence
only factors with variance >1 are included. Factors with variance <1 are no better than a
single variable, because due to standardization,each variable has a variance of 1.
• Determination based on scree plot:Typically the plot has a distinct break between the steep
slope of factors with large eigenvalues and a gradual trailing off associated with the rest of
the factors.This gradual trailing off is knows an the scree.
Inaksh
From the above table we can see that SPSS extracted two
FACTORS or components from 6 variables. (Whichever components have eigenvalue>1 )
These two FACTORS explain 82.48% of the variance .
Inaksh
We can see that 2 factors that had an eigenvalue >1 were extracted and the other POTENTIAL
actors that had an eigenvalue <1 were not extracted
Inaksh
The correlation coefficient between factor score and the
variables included in the study is called factor loading. The
component matrix is also called the factor matrix.
Inaksh
Inaksh
We can see that V1 V3 and V5 are extracted to factor 1 .
Inaksh
Rotate factors
• An important output from the factor anlaysis is the factor matrix,also called the factor pattern matrix. The factor
matrix contains the coeffiecients used to express the standardized variables in terms of factors.These
coefficients(factor loadings) represent the correlations between the factors and the variables.
• A coefficient with a large absolute value indicates that the factor and the variable are closely related.
• Althought the initial(unrotated) factor matrix indicates the relationship between the factors and individual
variables,it seldom results in factors that can be interpreted,because the factor is correlated with many variables.
• For example, factor 1 is atleast somewhat correlated with 5 of the 6 variables(absolute value of the fator loading is
greater than 0.3).
• Moreover , variables 2,4 and 5 load atleast somewhat on both the factors
Inaksh
Rotate factors
• In such a complex matrix it is difficult to interpret the
factors.
• Fi=Wi1X1+Wi2X2+Wi3X3+…..WikXk
Inaksh
• Like factor analysis ,cluster analysis examines an entire
set of interdependent relationships.
Inaksh
Inaksh
Inaksh
• Both discriminant analysis and cluster analysis are
concerned with classification.However, discriminant
analysis requires prior knowledge of the cluster or group
membership for each case.In contrast, in cluster analysis
there is no prior information about the group or cluster
membership.
Inaksh
Statistics associated with cluster analysis
• Cluster centroid: The mean values of the variables for all the cases in a particular cluster.
• Dendrogram: It is also called tree graph, it is a graphical device for displaying clustering
results. The dendrogram is read from left to right.
• Distances between clusters: These distances indicate how separated the individual pairs of
clusters are.
Inaksh
Example
• We consider a clustering of consumers based on attitudes toward
shopping. Based on past research, six attitudinal variables were
identified .Consumers were asked to express their degree of
agreement with the following statements on a 7 point scale. (1=
Strongly Disagree ,7=Strongly agree). Data was obtained from a
pretest sample of 20 respondents.(Note: IN practice clustering is done on
much larger samples of 100 or more?
6 4 7 3 2 3
2 3 1 4 5 4
7 2 6 4 1 3
4 6 4 5 3 6
1 3 2 2 6 4
6 4 6 3 3 4
5 3 6 3 3 4
7 3 7 4 1 4
2 4 3 3 6 3
3 5 3 6 4 6
1 3 2 3 5 3
5 4 5 4 2 4
2 2 1 5 4 4
4 6 4 6 4 7
6 5 4 2 1 4
3 5 4 6 4 7
4 4 7 2 2 5
3 7 2 6 4 3
4 6 3 7 2 7
2 3 2 4 7 2
Inaksh
Select a distance measure
• The most common approach is to measure similarity in
terms of distance between pairs of objects.
Inaksh
Select a clustering procedure
Inaksh
Hierarchical clustering
Inaksh
Linkage methods
• Single linkage- This method is based on minimum distance or nearest
neighbor rule.The first two objects clustered are those that have the smallest
distance between them. The next shortest distance is identified and either
the third object is clustered with the first two or a new two object cluster is
formed.Two clusters are merged at any stage by the single shortest link
between them. This process is continued till all objects are in one cluster.
Inaksh
Variance methods
• Ward’s method: For each cluster the means for all the
variables are computed. Then for each object, the
squared Euclidean distance to the cluster means is
calculated. These distances are summed for all the
objects.
Inaksh
Centroid methods
Inaksh
Non hierarchical clustering
Inaksh
• Analyze Classify Hierarchical cluster
Inaksh
Inaksh
Inaksh
Inaksh
This shows at what point in the ward’s procedure did the two cases get put into the same
cluster.
Inaksh
From the dendrogram, we can see which all case are clse to each other and are clubbed together in which fashion. We can see that case
14,16,10,4 and 199 are very similar to each other.
Inaksh
• Since we did a hierarchical analysis, we didn’t have to
specify the number of clusters, but from the dendrogram
we can see that it would be a good idea to create 3 exact
clusters
Inaksh
Inaksh
• Now if we go to the data view, we can see a new variable
that tells us about the cluster membership of each case.
Inaksh
Inaksh