Вы находитесь на странице: 1из 200

Skill Enhancement

Course
SPSS

Inaksh
IBM Statistical Package for
Social Sciences

Inaksh
Books to be referred:
1. Research methodology by Deepak Chawla
and Neena Sondhi
2. Marketing Research by Naresh Malhotra
and Satyabhushan Dash

Inaksh
Research

• The crux of a scientific approach to identifying and


pursuing a research path is to identify the ‘what’ i.e. what
is the exact research question to which you are seeking
an answer.

• The second important thing is that the process of arriving


at the question should be logical and follow a line of
reasoning that can lend itself to scientific enquiry.

Inaksh
Data analysis

• Univariate(Testing a single variable)


• Bivariate( Two variables)
• Multivariate (More than two variables)

Inaksh
Hypothesis
It is the presupposition of the expected direction of the results of the research.

Unlike the research problem that generally takes on a. Question form, the hypothesis is always In a
declarative form .

A hypothesis is written in such a way that it can be proven or disproven by valid and reliable data .

Inaksh
Hypothesis
Few criteria that the researcher must fulfil while designing any hypothesis -

1. Simple, clear and decorative form . Hence it is advisable to make the hypothesis unidimensional , and to
be testing only one relationship between only two variables at a time .

2. It must be measurable and quantifiable

3. Not something based on a gut feeling , should be based on existing literature .

4. Validation of hypothesis would involve testing the statistical significance of the hypothesized relation.

Inaksh
Hypothesis

Two Types ;

1.Descriptive hypothesis - Simple statement about the behaviour of population under study .Eg: the
attrition rate in the BPO sector is almost 33%

The literacy rate in the city of Indore is 100%

2. Relational hypothesis -

They state the expected relationship between two variables

Inaksh
Hypothesis
Eg: Higher the likeability of the advertisement , the higher is the recall rate

(In the above example, the direction of variables is known)

However sometimes, the researcher might not have reasonable supportive data to hypothesise the expected
direction of the relationship . In this case, we can leave the hypothesis , non directional (or two tailed )

Eg- Ban on smoking has an impact on the cigarette sales

Anxiety is related to performance .

Inaksh
Measurement and Scaling
• The term measurement means assigning numbers or some symbols to the characteristics of certain objects .when
numbers are used , the researcher must have a rule for assigning a number to an observation in a way that provides
an accurate description .

• The assignment of numbers should be isomorphic I.e. there must be one to one correspondence between the
numbers and characteristics, eg- male (1)female (2)

• Scaling is an extension of measurement . Scaling involves creating a continuum on which measurements on


objects are located .

• Suppose you want to measure the satisfaction level towards kingfisher airlines and a scale of 1 to 11 is used for the
said purpose .This scale indicates the degree of dissatisfaction with 1= extremely dissatisfied and 11=Extremely
satisfied

• Measurement is the actual assignment of a number from 1 to 11 to each respondent whereas the scaling is the
process of placing the respondent on a continuum with respect

• To their satisfaction towards kingfisher airlines .

• 4 types of measurement scales - nominal,Ordinal, interval and ratio. Inaksh


Introduction

• Data entry : Data view

• Storing and retrieving files : (.sav)

• Statistics Menus : Analyze

• Generating new variables : Variable view

• Extracting data from an excel file : File Open database New Query

• Coding : The process of identifying and denoting a numeral to a response given by a respondent.
This is essentially done to in order to facilitate the researcher’s use for interpreting the answers and
classifying and then subsequently recording the data from the questionnaire on a spreadsheet.

• SPSS Data Editor has two views : Data view and Variable View

Inaksh
When we start
SPSS, it shows
this dialog box,
we will currently
select :Type in
data

Inaksh
Variable View
1. Name : the unique variable name (Note :While defining variables empty spaces are not allowed.)

2. Type : the kind of data to be recorded (e.g., strings of characters, numeric values, or special numbers like dates)

3. Width : the number of characters allowed to be entered in the column (By default width is 8 characters and can
be modified depending on the data)

4. Decimals : the number of decimal places displayed. (For numeric data type the default value is 0.)

5. Label : a text entry to describe the data provided by the variable, which can be much longer than the variable
name and may include spaces. With questionnaires, for example, the label is usually the text of the question.

6. Values : if specific numeric values have a non-intuitive meaning, these values can be labeled (e.g., 1 = male and
2 = female)

7. Missing : This column is used in cases where no data is provided by a respondent. A missing value is chosen as
an impossible value for that column. For e.g. ,for age it can be put as 1000 or -100 which are impossible entries
for age. SPSS excludes that record while analyzing the data.

8. Columns : determines how wide the variable column should be in Data View mode. (Default value is 8.)

9. Align : determines whether the data should be left-justified, right-justified, or centered .

10. Measure : describes the level of measurement (e.g., nominal, ordinal, or scale)

11. Role :

Inaksh
Coding

• Dichotomous questions : Do you buy ready to eat food?


Coding will be Yes=1; No=0.

Inaksh
Ranking Question

• Please rank the following in order of importance from 1 to 4, 1 being most


important and 4 being the least important.
1. Speed of service
2. Ease of parking
3. Cleanliness
4. Friendliness of staff

• This will be entered in SPSS as 4 variables, Speed of service, Ease of


parking, Cleanliness and Friendliness of staff. Coding will be same and
entered in all variables(1-4).

• This is a multi response question. Each response is a new variable .So we will
have 4 variables and for each variable there are 4 possible responses(1,2,3,4)

Inaksh
Checklist question

• Which of the following newspapers do you read? (Tick all that you read.)
1. The Times of India
2. The Hindustan Times
3. Mail Today
4. The Indian Express
5. Deccan Chronicle
6. The Asian Age
7. Mint

• This will be entered in SPSS as 7 variables (TOI, Hindustan Times etc.).


Coding will be done as 0=Not ticked,1=Ticked.

• When we do statistical analysis, we can have only one piece of data per cell.

Inaksh
Scaled question

This question will be entered as one variable. Coding will be done as 1=Strongly
Disagree,2=Disagree,3=Neutral,4=Agree,5=Strongly Agree

Inaksh
Open ended questions

• The coding of open ended questions is quite difficult as


they are unpredictable. The respondents exact answers
are noted on the questionnaire. Then the researcher looks
for patterns and assigns a category code.

Inaksh
Recoding

• It is a feature in SPSS which is used to convert

Inaksh
Normal distribution

Inaksh
Data type

Univariate Bivariate Multivariate

Descriptive Cross Correlation


Analysis Tabulation Analysis

Regression
Analysis

Factor
Analysis

Discrimina
nt Analysis

Cluster
Analysis

Inaksh
Descriptive analysis
• Frequency table- Frequency, Per cent, Valid Per cent, Cumulative Per
cent

• Valid percentage excludes missing values. (In the given e.g. valid
percentage is taken out of 566 )

• Analyze 🡪 Descriptive analysis 🡪 Frequencies

Inaksh
Descriptive analysis

• Mean : This represents the arithmetic average of a variable. It


is appropriate for interval and ratio scale data.

• Median :

• Mode :

• Variance :

• Standard deviation : This measures the variability of a variable


around the mean. It is appropriate for interval and ratio scale
data.

• Coefficient of Variation :

Inaksh
Descriptive analysis

• Cross tabulation
• Analyze 🡪 Descriptive analysis 🡪 Crosstabs (Specify Row, Column)
• Try crosstabs with 3 variables.

Inaksh
Relationships

• When evaluating the relationship between two variables, it


is important to determine how the variables are related.

Inaksh
Linear relationship
• A linear relationship is a trend in the data that can be modeled by a straight line. For example, suppose
an airline wants to estimate the impact of fuel prices on flight costs. They find that for every dollar
increase in the price of a gallon of jet fuel, the cost of their LA-NYC flight increases by about $3500.
This describes a linear relationship between jet fuel cost and flight cost.
• When both variables increase or decrease concurrently and at a constant rate, a positive linear
relationship exists. The points in Plot 1 follow the line closely, suggesting that the relationship between
the variables is strong. The Pearson correlation coefficient for this relationship is +0.921.
• When one variable increases while the other variable decreases, a negative linear relationship exists.
The points in Plot 2 follow the line closely, suggesting that the relationship between the variables is
strong. The Pearson correlation coefficient for this relationship is −0.968.
• If a relationship between two variables is not linear, the rate of increase or decrease can change as one
variable changes, causing a "curved pattern" in the data. This curved trend might be better modeled by
a nonlinear function,

Strong positive linear Strong negative linear Weak linear relationship Non-linear relationship
relationship relationship

Inaksh
Monotonic relationship

• In a monotonic relationship, the variables tend to move in


the same relative direction, but not necessarily at a
constant rate.

• Linear relationships are also monotonic.

Monotonic relationship

Inaksh
Spearman’s rank order correlation coefficient

• Used in case of ordinal scale data.


• Q1 :Two judges in a beauty contest evaluate ten participants. A rank of
one was assigned to the most beautiful candidate and two to the next and
so on. Compute the rank order correlation and comment on the value. The
rankings are as follows. (Chawla and Sondhi ,Page 311)

Participant Ranking by Judge 1 Ranking by judge 2

1 10 9
2 1 3
3 5 4
4 2 1
5 8 8
6 3 2
7 4 6
8 6 5
9 7 7
10 9 10

Inaksh
Create two variables, namely judge1 and judge2,enter data for 10
participants and change scale to ordinal.

Analyze🡪 Correlate🡪 Bivariate🡪 Select Spearman


Inaksh
Interpretation

• It is seen that there is a high degree of positive rank


correlation coefficient which implies that there is a strong
agreement between the two judges on their opinion about
the beauty contestants.

Inaksh
Data analysis of rank order questions

• In survey research, respondents are asked to rank order certain attributes of


a product on the basis of their preference.
• We can create a frequency table of the ranks . Then multiply the number of
people * to weightage(assumed) ,and add this number.
• This will help us understand the order of preference in a holistic way for all
participants.
• See image for reference. (Chawla and Sondhi, Page 313)

Inaksh
Frequency
table

Inaksh
Practice Questionnaire

• (Link)

Inaksh
Testing of hypothesis
• A hypothesis is an assumption or a statement that may or may not be true.The hypothesis is
tested on the basis of information obtained from a sample.

• E.g. When we want to find out :Whether a new drug is more effective than the existing drug
based on the sample data.

Null hypothesis : The hypothesis that are proposed with the intent of receiving a rejection for them
is called a null hypothesis. This requires that we hypothesize the opposite of what is desired to be
proved. It is denoted as H0 For e.g.

1. If we want to show that sales and advertisement expenditure are related, we formulate the null
hypothesis that they are not related.

2. If we want to conclude that the new sales training programme is effective, we formulate the
null hypothesis that the new training programmes is NOT effective.

3. If we want to prove that the average wages of skilled workers in Mumbai is greater than that of
Delhi, we formulate the null hypothesis that there is no difference in the wages of the skilled
workers in both the cities.

Inaksh
Testing of hypothesis

• Alternative hypothesis : Rejection of null hypothesis leads to the


acceptance of alternative hypothesis.

• The rejection of null hypothesis indicates that the relationship


between variables( e.g. sales and expenditure) or the difference
between means (e.g. wages of skilled workers in Mumbai and Delhi)
or the difference between proportions HAVE statistical significance.

• And the acceptance of the null hypothesis indicates that these


differences are due to CHANCE.

• Also, the alternative hypothesis can cover a whole range of values


rather than a single point.

• The alternative hypothesis is denoted by H1.

Inaksh
One tailed and two tailed tests
• A test is called one sided(one tailed) only if the null hypothesis gets rejected when a value of
the test statistic falls in one specified tail of the distribution.
• A test is called two sided(two tailed) only if the null hypothesis gets rejected when a value of
the test statistic falls in either one or the other of the two tails of sampling distribution.

-2.131 and
2.131 are
values
from the
table.

Inaksh
Steps in testing of hypothesis

• Setting up of a hypothesis
• Setting up of a suitable significance level
• Determination of a test statistic
• Determination of critical region
• Computing the value of test statistic
• Making a decision

Inaksh
Testing hypothesis concerning means

• Note : You cannot do a z test with SPSS. It does not show it as an option in
the drop down menu.

• Note : Most computer software like SPSS provide both the computed value
of test statistic and the corresponding p value. The p value provided there is
for the two sided test(by default, unless changed). In case the problem is of a
one side test, the reported value p value is divided by 2 to obtain the desired
p value for the problem and then compared with (α),the level of significance
so as to either accept or reject the null hypothesis.

Inaksh
T test

• We will be doing 3 types of t tests. One sample. Two independent samples and paired
sample.

• The t-test for testing the equality of two population means is carried out using these
two assumptions:
1. Population variances are equal
2. Population variances are not equal

Inaksh
One sample t-test

• The one sample t-test is a statistical procedure used to determine whether a sample of
observations could have been generated by a process with a specific mean.
• Suppose you are interested in determining whether an assembly line produces laptop
computers that weigh five pounds. To test this hypothesis, you could collect a sample of
laptop computers from the assembly line, measure their weights, and compare the sample
with a value of five using a one-sample t-test.

Inaksh
Assumptions (One sample
T Test)
• Assumptions :
1. Your dependent variable should be measured at the interval or ratio level (i.e., continuous).
2. There should be no significant outliers. Outliers are data points within your data that do not
follow the usual pattern (e.g., in a study of 100 students' IQ scores, where the mean score
was 108 with only a small variation between students, one student had a score of 156, which
is very unusual, and may even put her in the top 1% of IQ scores globally).
3. Your dependent variable should be approximately normally distributed. We talk about the
one-sample t-test only requiring approximately normal data because it is quite "robust" to
violations of normality, meaning that the assumption can be a little violated and still provide
valid results. You can test for normality using the Shapiro-Wilk test of normality, which is
easily tested for using SPSS Statistics.

Inaksh
• QA Prices of share (in Rs.) of a company on the different
days in a month were found to be 66, 65, 69, 70, 69, 71,
70, 63, 64 and 68. Examine whether the mean price of
shares in a month is different from 65. You may use 10%
level of significance.(Chawla and Sondhi, Page 332).

Inaksh
• Analyze🡪 Compare means🡪 One Sample T Test

Put the test value


as 65,go to
Options and set
the confidence
percentage.

Inaksh
Output

Inaksh
Interpretation

Inaksh
Independent sample t-test

• Two sample t-test : In this test we are going to measure group 1 and then measure group 2
and then see if the mean for group 1 is statistically significantly different than the mean for
group 2. Two groups measured one time, because each of the groups are independent,.
• Suppose that an analyst wants to study the amount that Mumbaians and Delhites spend,
per month, on clothing. It would not be practical to record the spending habits of every
individual (or family) in both states, thus a sample of spending habits is taken from a selected
group of individuals from each state . The average amount for Mumbaikars comes out to Rs.
500; the average amount for Delhites is Rs.1,000. The t-test questions whether the difference
between the groups is representative of a true difference or because of chance.

Inaksh
Assumptions(Independent samples
T Test)

1. Your dependent variable should be measured at the interval or ratio level (i.e., continuous).
2. Your independent variable should consist of two categorical, independent groups. Example
independent variables that meet this criterion include gender (2 groups: male or female),
employment status (2 groups: employed or unemployed), smoker (2 groups: yes or no), and
so forth.
3. You should have independence of observations, which means that there is no relationship
between the observations in each group or between the groups themselves. For example,
there must be different participants in each group with no participant being in more than one
group.
4. There should be no significant outliers.
5. There needs to be homogeneity of variances. You can test this assumption in SPSS
Statistics using Levene’s test for homogeneity of variances.
6. Your dependent variable should be approximately normally distributed for each group of the
independent variable.

Inaksh
Homogeneity of variances
• Homo:Same
• Geneity:Nature,kind
• So it means of the same nature.
• So homogeneity of variance means that the variances are of the same nature/same kind.This does not
mean that they have to be equal but they have to be close enough.
• Levene‘s Test :Tests whether the variances of two samples are approximately equal, so it tests our
assumption of homogeneity of variances. Like any test, Levine’s test also starts with a null hypothesis.
In this case the null hypothesis is that there is no difference between the variance of the first group and
the variance of the second group.Unlike other tests, in Levine’s test we WANT the variances to be the
same.We could like Levene’s test to be non significant, so that the assumption of T Test can be
satisfied.In SPSS a Levene’s test is conducted automatically anytime you do an indeoendant samples T
Test.
• Note: As long as N>30, ,the sample size inf first sample is approximately equal to the sample size of
the second sample, then the T Test is robust to violations of homogeneity of variance.

The test is not


significant hence the
assumption has been
proved we will use
the first row for T Test Inaksh
• Qb There were two types of drugs(1 and 2) that were tried
on some patients for reducing weight. There were 8
adults who were subjected to drug 1 and seven adults
who were administered drug 2.The decease in weight (in
pounds) is given below:

• Do the drugs differ significantly in their effect on


decreasing weight? You may use 5% level of significance.
Assume that the variances of two populations are not
same.(Chawla and Sondhi, Page 339).

Inaksh
• Analyse 🡪 Compare means🡪 Independent Samples T Test

Step
1

Step 2

SPSS needs to know our coding scheme as there


may be more than 2 groups ,it needs to know which
two groups are being compared

Inaksh
Output

Inaksh
Interpretation

Inaksh
Paired sample t-test

• The paired sample t-test, sometimes called the dependent sample t-test, is a statistical
procedure used to determine whether the mean difference between two sets of observations
is zero. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs
of observations
• Suppose you are interested in evaluating the effectiveness of a company training program.
One approach you might consider would be to measure the performance of a sample of
employees before and after completing the program, and analyze the differences using a
paired sample t-test.

Inaksh
Assumptions(Paired sample T Test)

1. Your dependent variable should be measured at the interval or


ratio level (i.e., continuous).
2. Your independent variable should consist of two categorical,
"related groups" or "matched pairs". "Related groups"
indicates that the same subjects are present in both groups.
The reason that it is possible to have the same subjects in
each group is because each subject has been measured on
two occasions on the same dependent variable.
3. There should be no significant outliers in the differences
between the two related groups.
4. The distribution of the differences in the dependent variable
between the two related groups should be approximately
normally distributed.

Inaksh
Qc A company selects eight salesmen at random and their sales figures for the previous month
are recorded. They then undergo a training course devised by a business consultant, and their
sales figures for the following month are compared as shown in the table. Has the training
course caused an improvement in the salesman’s ability? You may use 0.05 level of significance.
.(Chawla and Sondhi, Page 341).

Previous 75 90 94 95 100 90 70 64
Month
Following 77 101 93 92 105 88 76 68
Month

Inaksh
• Analyse 🡪 Compare means🡪 Paired Samples T Test

Inaksh
Output

Inaksh
Interpretation
H 0: µ f = µ p

H 1: µ f > µ p

• Thisis a one tailed test. The results indicate that the p value is 0.143. Since it
is a one tailed test,the applicable p value would be 0.143/2=0.0715.

• 0.0715>0.05,therefor the null hypothesis is accepted as there is not enough


evidence to reject it.

• Therefore the sales training programme has not caused any improvement in
the salesman’s ability.

Inaksh
• Note : When you do a t-test, you are comparing the data
for one group to the data in another group. If the t value is
positive, it means that the first group has a higher mean
than the second group. If the t value is negative, the
second group has a higher mean. In reality, it isn't very
important whether the t-test is positive or negative as
long as you report the means of each group, and it is
acceptable to drop the negative sign when reporting the t
value. (Can be verified from the previous slide,2nd table )

Inaksh
ANOVA

Inaksh
ANOVA
• We discussed the test of hypothesis concerning the equality of
two population means using T tests(and Z tests in theory).
However if there are more than two populations, the test for
the equality of means could be carried out by considering two
populations at a time. This would be a cumbersome
procedure. One easy way out could be to use the analysis of
variance (ANOVA) technique.

• There are two variables. Dependant variable (data of groups)


and Independent variables (factors).

• This test is also called the Fisher analysis of variance.

• The ANOVA, developed by Ronald Fisher in 1918.


Inaksh
ANOVA
• The basic principle underlying the technique is that the total variation in the dependant
variable is broken into two parts – one which can be attributed to some specific causes and
the other that may be attributed to chance.
• The one which is attributed to the specific causes is called the variation between the
samples and the one which is attributed to chance is termed as the variation within samples.
• Therefore, in ANOVA, the total variance may be decomposed into various components
corresponding to the sources of the variation.
• For e.g. the sales of chairs could differ because of the various styles and sizes of the stores
selling them. Similarly, one could study the differences among the various types of drugs for
curing a specific disease or the differences in cholesterol content of various diet foods or
differences in yield of crops due to varieties of seeds, fertilizers or soils, For example, an
ANOVA can examine potential differences in IQ scores by Country (US vs. Canada vs. Italy
vs. Spain).
• In general, the ANOVA techniques investigate any number of factors which are supposed to
influence the dependant variable of interest. It is also possible to investigate the differences
in various categories within each of these factors with the help of a Post-hoc analysis.
• It is an extension of the z and t test.
• The null hypothesis for an ANOVA is that there is no significant difference among the
groups. The alternative hypothesis assumes that there is at least one significant difference
among the groups.

Inaksh
ANOVA

• As explained earlier, the total variation in the data set can


be expressed as a sum of the variations that ca be
attributed to specific courses(in the given example, the
various diet foods) plus the one which is attributed due to
chance. The total variation in the data set is called the
total sum of squares(TSS) .

• The error within the sample which is attributed to chance ,


is referred to as the error sum of squares (SSE).

Inaksh
Assumptions (ANOVA)

• In ANOVA , the dependant variable in question is metric (interval or ratio


scale), whereas the independent variables are categorical (nominal scale).
• It is assumed that each of the samples is drawn from a normal population
• Also, that each of these populations has an equal variance.
• Another assumption that is made is that all factors except the one being
tested are controlled (kept constant). Basically, two estimates of the
population variances are made. One estimate is based upon between the
samples and the other one is based upon within the samples.

Inaksh
Types of ANOVA
• There are two types of analysis of variance: one-way (or unidirectional) and two-way. One-way or two-
way refers to the number of independent variables in your Analysis of Variance test. A one-way ANOVA
evaluates the impact of a sole factor on a sole response variable. It determines whether all the samples
are the same. The one-way ANOVA is used to determine whether there are any statistically significant
differences between the means of three or more independent (unrelated) groups. A one-way ANOVA
has just one independent variable. For example, difference in IQ can be assessed by Country, and
County can have 2, 20, or more different categories to compare.
• A two-way ANOVA is an extension of the one-way ANOVA. With a one-way, you have one independent
variable affecting a dependent variable. With a two-way ANOVA, there are two independents. For
example, a two-way ANOVA allows a company to compare worker productivity based on two
independent variables, say salary and skill set. It is utilized to observe the interaction between the two
factors. It tests the effect of two factors at the same time. A two-way ANOVA refers to an ANOVA using
two independent variables. Expanding the example above, a 2-way ANOVA can examine differences in
IQ scores (the dependent variable) by Country (independent variable 1) and Gender (independent
variable 2). Two-way ANOVA can be used to examine the interaction between the two independent
variables.(Not in syllabus )
• N-Way ANOVA-A researcher can also use more than two independent variables, and this is an n-way
ANOVA (with n being the number of independent variables you have). For example, potential
differences in IQ scores can be examined by Country, Gender, Age group, Ethnicity, etc,
simultaneously. (Not in syllabus )


Inaksh
One way ANOVA

• In this design, there is one dependant variable and one


independent variable .

Inaksh
• Qd Suppose we want to compare the cholesterol
contents of the four competing diet foods on the basis of
the following data (in milligrams per package) which were
obtained for three randomly taken 6-ounce packages of
each of the diet foods:

• We want to test whether the difference among the sample


means can be attributed to chance at the 5% level of
significance.
Diet Food A 3.6 4.1 4.0
Diet Food B 3.1 3.2 3.9
Diet Food C 3.2 3.5 3.5
Diet Food D 3.5 3.8 3.8

Inaksh
Analyse 🡪 Compare means🡪 One Way ANOVA

Inaksh
Inaksh
Interpretation
H0: µA = µB= µc = µD

H1: At least two means are not equal

• The p value (sig.) for this problem is 0.160 which is greater than
α=0.05, the level of significance. Therefore, there is not
enough evidence to reject the null hypothesis .

• This means that the difference in the cholesterol content of


various diet foods could be attributed to chance.

Inaksh
Post-hoc analysis
• When you conduct an ANOVA, you are attempting to
determine if there is a statistically significant difference
among the groups. If you find that there is a difference,
you will then need to examine where the group
differences lay.

• At this point you could run post-hoc tests which are t


tests examining mean differences between the groups.

Inaksh
Post-hoc analysis
• Note : For a one-way ANOVA, you will probably find that just
two tests need to be considered. If your data met the
assumption of homogeneity of variances, use Tukey's honestly
significant difference (HSD) post hoc test.

• If your data did not meet the homogeneity of variances


assumption, you should consider running the Games Howell
post hoc test.


SPSS HAS 18 DIFFERENT types of Post hoc tests.When all
your assumptions are met and the sample size is the same
then we use the Tukey.Others are used when the variances
are not equal or sample size is unequal(not in syllabus)

Inaksh
Post hoc analysis

Inaksh
Inaksh
Interpretation

• We cannot interpret the above Post-hoc table as the


result of ANOVA is insignificant hence Post hoc will be
invalid.

Inaksh
• Note : If you do not have a significant value in ANOVA, no
need to interpret post hoc tests .

Inaksh
• Note :In an analysis of variance, it may happen that the
variances cannot be assumed to be equal. In this case,
the F test of the ANOVA is not robust enough to be used.

• Welch and Brown-Forsythe ANOVA are more reliable than


the classic F when variances are unequal.

• Both the tests are a good replacemenr for ANOVA. Use


either in that case.

Inaksh
Non – Parametric tests

• Nonparametric tests are also called distribution-free


tests because they don't assume that your data follow a
specific distribution.

• Most non-parametric tests apply to data in an ordinal


scale, and some apply to data in nominal scale.

Inaksh
Pearson Chi Square

• Chi-square is an important non-parametric test and as


such no rigid assumptions are necessary in respect of the
type of population. We require only the degrees of
freedom (implicitly of course the size of the sample) for
using this test.

• As a non-parametric test, chi-square can be used (i) as a


test of goodness of fit and (ii) as a test of independence.

Inaksh
Chi-Square Independence Test

• The chi-square independence test is a procedure for


testing if two categorical variables are related in some
population.

• Also called the test of association.

Inaksh
Note:

1. In chi square we must know the frequencies of the


nominal variables.

2. Also we must ensure that the total frequencies of all the


data are more than 50

3. The frequency variable needs to be indentified as a


frequency variable by SPSS ( Data🡪 Weigh Cases)

Inaksh
Chi-Square Independence
Test
• Null hypothesis: There are no relationships between the
categorical variables. If you know the value of one
variable, it does not help you predict the value of another
variable.

• Alternative hypothesis: There are relationships between


the categorical variables. Knowing the value of one
variable does help you predict the value of another
variable

Inaksh
• Q - Suppose we have data of 100 males and 100 females
regarding their buying behavior towards perfumes.

• Buying behavior coding (Yes or No)

Inaksh
Data 🡪 Weight cases🡪
Frequencies

Inaksh
Analyze🡪 Descriptive Statistics🡪 Crosstabs

Inaksh
Inaksh
Interpretation
H0: There is no significant relationship between gender and
buying behaviour towards perfumes

H1: There is a significant relationship between gender and


buying behaviour towards perfumes

At 5% significance level, using the p value approach ,our chi


square asymptotic significance is 0.005<0.05 ,hence we will
reject the null hypothesis.

Conclusion: There is a significant relationship between


gender and buying behaviour towards perfumes

Inaksh
Assumptions of Chi-Square Independence
Test

• independent observations.
• For a 2 by 2 table, all expected frequencies > 5.* For a
larger table, no more than 20% of all cells may have an
expected frequency < 5 and all expected frequencies > 1.

Inaksh
Chi Square Goodness of Fit test

• The data in chi square is often in terms of counts or


frequencies.

• We have to compute the expected frequencies of the


occurrence of certain events.

• A goodness of fit test is a statistical test of how well the


observed data supports the assumption about the
distribution of a population

Inaksh
Chi Square Goodness of Fit test

• It is a useful test to compare a theoretical model to


observed data.

• This is a more general chi square test.

Inaksh
• Consider a standard packet of Milk chocolate Gems.
There are six different colors: Red, Orange, Yellow,
Green, Blue and Brown.

• Suppose that we are curious about the distribution of


these colors and ask,do all six colors occur in equal
proportion?

• This is the type of question that can be answered with a


goodness of fit test.

Inaksh
Inaksh
Actual and expected
counts

• Suppose that we have a random sample of 600 Gem


candies distributed in this manner- 212 in blue colour,147
in orange colour, 103 in green colour, 50 in red colour, 46
in yellow colour, 42 in brown colour.

• If the null hypothesis were true then the expected count


for each of these colors would be 1/6=100.

Inaksh
Inaksh
Analyze🡪 Non Parametric Tests🡪 Legacy Dialogs🡪 Chi Square

Inaksh
Note: All categories equal is selected because are
expected value our that all gem candies are equal .

Inaksh
• Suppose we have different expected values for each
colour , then we can soecify them (in the same order ,as
entered in data view)

Inaksh
Inaksh
Interpretation
• H0: : All colors occur in the same proportion
• P1=p2=p3=p4=p5=p6
• H1: Atleast one of the population proportions is not equal
to another.

At 5% significance level, using the p value approach ,our chi


square asymptotic significance is 0.00<0.05 ,hence we will
reject the null hypothesis.

We conclude that Gems are not evenly distributed among


the six different colours .
Inaksh
Karl Pearson’s Correlation coefficient

• Correlation measures the degree of the association between two or more


set of variables.

Three types of correlation:

Positive correlation: When two variables X & Y move in the same direction.
When on variable increases ,the other variable also increases and vice versa.

Negative Correlation: When two variables X & Y move in the opposite


direction. If one variable increases, the other variable decreases.

Zero Correlation: When the variables move in no connection with each other.If
one variable increase, the other may increase or decrease.

Pearson’s r takes a value between -1 and +1 (both values included).

Inaksh
Inaksh
Assumptions
1. For a Pearson correlation, each variable should be continuous. If one or both of
the variables are ordinal in measurement, then a Spearman correlation could be
conducted instead.

2. Each participant or observation should have a pair of values. So if the correlation


was between weight and height, then each observation used should have both a
weight and a height value.

3. Absence of outliers refers to not having outliers in either variable. Having an outlier
can skew the results of the correlation by pulling the line of best fit formed by the
correlation too far in one direction or another.

4. Theres needs to be linearity between the variables.(Linearity is the property of a


mathematical relationship or function which means that it can be graphically
represented as a straight line.)

5. normality: our 2 variables must follow a bivariate normal distribution in our


population. This assumption is not needed for sample sizes of N = 25 or more.*

Inaksh
• We have the height and weight of 5 males and 5 females.
With 5 % level of significance find if there’s a correlation
between their height and weight,

Inaksh
Analyze🡪 Correlate🡪 Bivariate

Inaksh
Inaksh
Inaksh
Interpretation
• H0: : There is no correlation between Height and Weight
H1: There is a significant correlation between Height and
Weight

At 5% significance level, using the p value approach ,our


Pearson ‘s correlation coefficient is not significant as
0.263>0.05 ,hence we will fail to reject the null hypothesis.

This also means that the correlation value is not a significant


indicator of any relation between height and weight.

We conclude that there is no significant relationship


between height and weight of the given males and females.
Inaksh
Note:
• Our sample size in the previous question is very small
(n=10). We are more likely to get significant solutions with
larger samples.

• A significant correlation value will show an asterisk(*) next


to it.

Inaksh
Chart builder

• Graphs 🡪 Chart Builder

Inaksh
Regression Analysis
• One of the problems with Karl Pearson’s formula of
correlation is that it is applicable only when the
relationship between the two variables is linear.

• However there can be situations when the variables are


connected by a non linear relationship.

• Zero correlation and independence of variables are


separate things.Zero correlation does not mean that the
varibales are not related.They may be non linearly related.

• Another problem with correlation is that it does not tell


which variable is influencing which one.
Inaksh
• In regression analysis it is assumed that there is a variable
that is influencing another variable.

• For eg: Y=f(X)


• This indicates that the values of Y depend upon the
values of X. Variable Y is called a dependent variable,
whereas X will be the independent variable.

Inaksh
Assumptions
• normality,

• linearity- There needs to be a linear relationship between the two variables. Whilst there are a number of ways to check
whether a linear relationship exists between your two variables, we suggest creating a scatterplot using SPSS Statistics
where you can plot the dependent variable against your independent variable and then visually inspect the scatterplot to
check for linearity.

• homoscedasticity, (Read from internet)

• and absence of multicollinearity.

• Your two variables should be measured at the continuous level (i.e., they are either interval or ratio variables).

• There should be no significant outliers.

• Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you
move along the line.

• Finally, you need to check that the residuals (errors) of the regression line are approximately normally distributed

Inaksh
Discriminant Analysis
• This is used to predict group membership.
• This technique is used to classify individuals/objects into
one of the alternative groups on the basis 0f a set of
predictor variables.

• Discriminant analysis is a technique that is used


by the researcher to analyze the research data
when the criterion or the dependent variable is
categorical and the predictor or the independent
variable is interval in nature.

Inaksh
When there are two groups of of dependent variable,we
have two group discriminant analysis and when there are
more than two groups then it is a case of multiple
discriminant analysis.

Inaksh
Objectives of discriminant
analysis
• To find a linear combination of variables that discriminate
between categories of dependent variable in the best
possible manner
• To find out which independent variable are relatively
better in discriminating between groups
• to find the statistical significance of the discriminant
function and whether any statistical difference exists
among groups in terms of predictor variables
• To develop the group procedure for assigning new
objects,firms, or individuals whose profile but not the
group identity are known to one of the two groups
• To evaluate the accuracy of classification i.e the
percentage of customers that it is able to classify
correctly.
Inaksh
Examples

• Incase one wants to assess people who believe that


corporate governance is the responsibility of policy
makers against those who think it needs to be self driven
or individual centric, one may generate a number of
statements and then conduct a pilot study and select only
those statements on which the two groups differ
significantly.

Inaksh
Y=b0+b1X1+b2x2+b3X3+....bKXk

Y= Dependant variable

b=Coefficients of the independent variable(also called discriminant coefficients)

X=Predictor or independent variable

Note: Y should be a categorical variable and X should be continuous.

These b’s maximize the distance between the means of the criterion (dependent)
variable.

This function is similar to a regression equation or function


Inaksh
What are we looking
for?

• Which set of features can best determine the group


membership of the object?
• What is the classification rule to best seperate the
groups?

Inaksh
• For any new data point that we want to classify into one of the groups, a
decision rule is formulated.
• For this purpose, to determine the cut off score, which is usually the mid
point of the mean discriminant scores of the two groups in case of a two
group discriminant analysis, provided the size of the samples in the groups
are the same.
• The accuracy of classification is determined by using a classification
matrix(also called confusion matrix)
• The relative importance of the independant variables could be determined
from the standardized discriminant function coefficient and the structure
matrix
• The difference between standardized and unstandardized discriminant
function is that in the unstandardized discriminant function we have a
constant term,whereas in the standardized discriminant function there is no
constant term

Inaksh
What you want this function to do is maximize the distance
between the categories, i.e. come up with an equation that
has strong discriminatory power between groups. After
using an existing set of data to calculate the discriminant
function and classify cases, any new cases can then be
classified. The number of discriminant functions is one less
the number of groups. There is only one function for the
basic two group discriminant analysis.

Inaksh
Assumptions of discriminant
analysis

1. Sample size: The sample size of the smallest group needs to exceed the
number of predictor variables. The maximum number of independent
variables is n - 2, where n is the sample size. It is best to have 4 or 5 times as
many observations as independent variables.
2. Normal distribution: It is assumed that the data (for the variables) represent
a sample from a multivariate normal distribution.
3. Homogeneity of variances/covariances: DA is very sensitive to
heterogeneity of variance-covariance matrices.
4. Outliers: DA is highly sensitive to the inclusion of outliers.

Inaksh
Statistics associated with
Discriminant Analysis

1. Canonical correlation: It is a measure of association


between the single discriminant function and the set of
dummy variables that define the group membership.
2. Centroid :
3. Classification matrix: It contains the number of correctly
classified and misclassified cases.The correctly classified
cases appear on the diagonal, because the predicted and
actual groups are the same.The off diagonal elements
represent cases that have been incorrectly classified.The
sum of the diagonal elements divided by the total number
of cases represents the hit ratio.
Inaksh
Statistics associated with
Discriminant Analysis

4. Discriminant function coefficients: These are the


multipliers of variables
5. Discriminant scores: The unstandardized coefficients are
multiplied by the values of the variables. These products
are summed and added to the constant term to obtain
the discriminant scores.
6. Eigenvalue: For each discriminant function,eigenvalue is
the ratio between group to wothin group sum of
squares.Large eigenvalues implye superior funcyions

Inaksh
Steps of discriminant
analysis

1. Formulate the problem


2. Estimate the discriminant function coefficients
3. Determine the significance of the discriminant function
4. Interpret the results
5. Assess the validity of discriminant analysis

Inaksh
Step 1:Formulate the
problem

1. First we need to identify the objectives,dependent variable and the


independent variables.
2. The dependant variables must consist of two or more mutually exclusive and
collectively exhaustive categories.
3. When the dependent variable is interval or ratio scale, it must be first
converted into categories.Eg: Unfavourable(1,2,3), Neutral(4), favorable
(5,6,7) on a 7 point scale
4. The predictor variables should be selected based on a theoretical model or
previous research or by the experience of the researcher.

Inaksh
Formulate the problem
5. The next step is to divide the sample into two parts .One
part of the sample called the estimation or analysis
sample ,is used for estimation of the discriminant
function.The other part ,called the holdout or validation
sample is reserved for validating the discriminant
function.
6. When the sample is large enough ,it can be split in
half.One half serves as the analysis sample and the other
is used for validation.
7. The role of the halves is then interchanged and the
analysis is repeated. This is called cross validation.
Inaksh
Step 2: Estimate the discriminant
function coefficients

We have two approaches for this :

• Direct method -This involves estimating the discriminant function so that all
the predictors are included simultaneously.In this case. each independent
variable is included regardless of their discriminating power. This method is
appropriate when, based on all previous research or a theoretical model, the
researcher wants the discrimination to be based on all the predictors.
• Stepwise discriminat analysis- The predictor varibales are
entered sequentially , based on their ability to
discriminate among groups.This method is appropriate
when the researcher wants to select a subset of
predictors for inclusion in the discriminant function.

Inaksh
Question

Suppose we want to determine the salient characteristics of


families that have visited a vacation resort during the last
two years.Data was obtained from a pretest sample of 42
houselholds. Of these, 30 houselholds were included in the
analysis sample and remaining 12 were part of the validation
sample.The households that visited a resort during the last
two years code as 1(visited),2(not visited). The following
data was used to test:

Inaksh
Resort Visit Annual Family Income Attitude toward Travel Importance attached to family vacation Household size Age of head of household

1 50.2 5 8 3 43

1 70.3 6 7 4 61

1 62.9 7 5 6 52

1 48.5 7 5 5 36

1 52.7 6 6 4 55

1 75 8 7 5 68

1 46.2 5 3 3 62

1 57 2 4 6 51

1 64.1 7 5 4 57

1 68.1 7 6 5 45

1 73.4 6 7 5 44

1 71.9 5 8 4 64

1 56.2 1 8 6 54

1 49.3 4 2 3 56

1 62 5 6 2 58

2 32.1 5 4 3 58

2 36.2 4 3 2 55

2 43.2 2 5 2 57

2 50.4 5 2 4 37

2 44.1 6 6 3 42

2 38.3 6 6 2 45

2 55 1 2 2 57

2 46.1 3 5 3 51

2 35 6 4 5 64

2 37.3 2 7 4 54

2 41.8 5 1 3 56

2 57 8 3 2 36

2 33.4 6 8 2 50

2
37.5

41.3
3

3
2

3
3

2
48

42
Inaksh
Analyze → Classify → Discriminant

Inaksh
Inaksh
Inaksh
Inaksh
Some intuitive feel for the resukts may be obtained by examining the group means and standard
deviations. It appears that the two groups are more widely separated in terms of income than other
variables. There seems to be more of a separation on the importance attached to family vacation
than on attitude toward travel. The difference between the two groups on age of the head of the
household is small, and the stabdard deviation of this variable is large.

Inaksh
The above table helps us understand the differences in means of the different predictor variables. SPSS conducts an ANOVA for the same . SPSS
assumes the predictor variables as dependent variables for this reason and takes the VISIT variable as the grouping variable(independent variable)
In the above table we can see that there is a significant difference in the means of Income for which the p value is .000 which is less than 0.05,same
goes for vacation and household size.
There does not seem to be any significant difference in the means of the remaining two variables.

Inaksh
The pooled within groups matrix indicates low correlations between the predictors. Multicollinesrity is unlikely to
be a problem.It is very important to analyse this matrix for detecting the problem of multicollinearity(a high
correlation between pairs of predictor variables).If it is noticed that the correlation coefficient between any pair
of predicro variables is greater than 0.75,t indicates that both the variables in that particular pair share a large
amount of common shared variance and might reflect the same attribute. Under such circumstance, one of the
two variables could be eliminated for further analysis.

Inaksh
Eigenvalue: For each discriminant function, eigenvalue is the ratio of between group to within group sum of squares. Large eigenvalues imply
superior function

What is Sum of Squares?


Rachel is a nurse at City Hospital and is closely monitoring two patients. Their doctors have asked Rachel to check the patients' blood oxygen
concentration levels every hour to make sure they don't vary too much from hour to hour. How will Rachel be able to tell if the oxygen
concentration of these patients changes too much? One way that this could be done is by calculating the sum of squares of the data.
In statistics, the sum of squares measures how far individual measurements are from the mean. To calculate the sum of squares, subtract each
measurement from the mean, square the difference, and then add up (sum) all the resulting measurements.

The larger the eigenvalue, the more variance the function explains in the dependant variable.

Canonical correlation: It is a simple correlation between the discriminant score and their corresponding group membership(visited,not visited).
The square of the canonical correlation =0.6410 which means 64% of the variance in the discriminating model is due to changes in the predictor
variables.

Inaksh
It would not be meaningful to interpret the analysis if the discriminant functions estimated were not statistically significant. The
Wilks’ Lambda takes a value between 0 and 1 and lower the value of Wilks’ lambda, the higher is the significance of the
discriminant function.Therefore 0 is the most preffered value.Since p value is less than 0.05 , it is inferred that the discriminant
function is significant and can be used for further interpretations.

Inaksh
This table shows the relative contributions of the predictor variables in discriminating between the dependent variable (visit)
From the above table we can see that income is the best predictor and travel is the worst predictor.
These structural coefficients are are obtained by computing the correlation between the discriminant score and each of the independent variables.
These are also called discriminant loadings.

Inaksh
The structure matrix table shows the correlations of each variable with each discriminant function.

We can see that the discriminant function has the highest correlation with Income(which is in sync with the
information from the previous table ).

Inaksh
These are the unstandardized discriminant function coefficients. These can be applied to the raw values of the variables in
the holdout set for classification purposes.

they are used to construct the actual prediction equation which can be used to classify new cases.

Inaksh
Centroids are the mean discriminant scores for each group. This table is used to establish the cutting point for classifying cases. If
the two groups are of equal size, the best cutting point is half way between the values of the functions at group centroids (that is, the
average). If the groups are unequal, the optimal cutting point is the weighted average of the two values.

In the above case , the cutoff poijnt will be 1.291+(-1.291)/2= 0.

Zero
Now any respondent, whose discriminant score is greater
than zero would be classified as a prospective visitor
,whereas the one with less than zero would be classified as a
non visitor.
Visited(+1.291) Not visited(-1.291)

Inaksh
Correct predictions

The results are shown in the above table after the discriminant scores are calculated. This table is also called the confusion matrix
or classifatory table.

Hit ratio=No. of correct predictions/Total number of cases

Our hit ratio is 90%.


27 out of 30 are correctly predicted.

Inaksh
After this the data view will
have the new predicted
group membership and
discriminant scores.

Inaksh
Factor analysis
• In Anova, regression and discriminant analysis, one of the
variables is clearly identified as the dependent variables. Factor
analysis is a procedure in which variables are not classified as
independent or dependent. Instead the whole set of
interdependent relationships among variables is examined.

• It is a multivariate statistical technique in which there is no


distinction between dependent and independent variables.

• It is a data reduction method

• It is very useful to reduce large number of variables (resulting in


complexity) to few manageable factors.
Inaksh
Statistics associated
with factor analysis
1. Bartlett’s test of sphericity –This is a test statistic used to examine the hypothesis that the variables are
uncorrelated in the population.
2. Correlation matrix-This shows the simple correlations between all possible pairs of variables included in the
analysis.
3. Communality-This is the amount of variance a variable shares with all the other variables being considered.
4. Eigenvalue-Total variance explained by each factor
5. Factor loading- Simple correlations between the variables and the factors
6. Factor loading plot-This is a plot of the original variables using the factor loading as the coordinates
7. Factor matrix-It contains the factor loadings of all the variables on all the factors extracted
8. Factor scores-
9. Kaiser Meyer Olkin(KMO) measure of sampling adequacy- Index used to examine the appropriateness of factor
analysis
10. Percentage of variance- Percentage of the total variance attributed to each factor
11. Residuals
12. Scree plot- is a plot of the eigenvalues against the number of factors in order of extraction.

Inaksh
Steps
• Formulate the problem

• Construct the correlation matrix

• Determine the method of factor analysis

• Rotate the factors Calculate the factor scores

• Interpret the factors Select the surrogate variables

• Determine the model fit


Inaksh
Formulate the problem

• Objective should be identified

• Variables to be included should be specified based on


past research ,theory or judgement of the researcher.

• Variables need to measured on an interval or ratio scale

• Appropriate sample size should be used

• As a rough guideline, there should be atleast 4/5 ties as


many observations as there are variables
Inaksh
Example
• Suppose the researcher wants to determine the underlying
benefits consumers seek from the purchase of a toothpaste. A
sample of 30 respondents was interviewed using mall intercept
interviewing.The respondents were asked to indicate their degree
of agreement with the following statements using a 7-point scale
(1=strongly disagree, 7=strongly agree)

V1:It is important to buy a toothpaste that prevents cavities


V2:I like a toothpaste that gives shiny teeth
V3:A toothpaste should strengthen your gums
V4:I prefer a toothpastethat freshens breath
V5:Prevention of tooth decay is not an important benefit offered by a
toothpaste
V6:The most important consideration in buying a toothpaste is
attractive teeth.
Inaksh
V1 V2 V3 V4 V5 V6
7 3 6 4 2 4
1 3 2 4 5 4
6 2 7 4 1 3
4 5 4 6 2 5
1 2 2 3 6 2
6 3 6 4 2 4
5 3 6 3 4 3
6 4 7 4 1 4
3 4 2 3 6 3
2 6 2 6 7 6
6 4 7 3 2 3
2 3 1 4 5 4
7 2 6 4 1 3
4 6 4 5 3 6
1 3 2 2 6 4
6 4 6 3 3 4
5 3 6 3 3 4
7 3 7 4 1 4
2 4 3 3 6 3
3 5 3 6 4 6
1 3 2 3 5 3
5 4 5 4 2 4
2 2 1 5 4 4
4 6 4 6 4 7
6 5 4 2 1 4
3 5 4 6 4 7
4 4 7 2 2 5
3 7 2 6 4 3
4 6 3 7 2 7
2 3 2 4 7 2

Inaksh
• Analyze Dimension Reduction Factor

Inaksh
Construct the correlation matrix

• For the factor analysis to be appropriate , the variables must be correlated.

• If the correlation between all the variables is small, then the factor anlaysis may not be appropriate.

• A correlation matrix of the variables can be computed and tested for its statistical significance.The
hypothesis to be tested may be written as:

• H0 : Correlation matrix is insignificant I.e. correlation matrix is an identity matrix where diagonal
elements are one and off diagonal are zero

• H1 :Correlation. Matrix is significant.

• The test is carried out by using Bartlett’s test of sphericity. The significan of the correlation matrix
ensures that a factor analysis could be carrued out.

• Another condition that needs to be fulfilled before a factor analysis could be carrued out is the value of
KMO which takes places between 0 and 1.For the application of factor analysis, the value of KMO
statistics should be greater than 0.5.
Inaksh
Inaksh
Inaksh
Inaksh
Inaksh
There are relatively high correlations among V1(prevention of cavities),V3(strong gums) and V5(prevention of tooth decay). We
would expect these variables to correlate with the same set of factors.

Likewise there are relatively high correlations among V2(shiny teeth) ,V4(fresh breath) and V6(attractive teeth). These variables
may also be expected to correlate with the same factors.

Inaksh
The null hypothesis that the population correlation matrix is an identity matrix ,is
rejected by the Bartlett’s test of sphericity.

The value of KMO is also greater than 0.5.

Thus factor analysis maybe considered an appropriate technique for analysis.

Inaksh
• Once it has been determined that factor anlaysis is an
appropriate technique for analyzing the data, an
appropriate method must be selected.

• The two basic approaches are – Principal components


analysis(In syllabus) and common factor analysis.

Inaksh
This extracted value tells us the amount of variance that is explained in the factor by the variable.
We can see that the extraction values are mostly high, which is a good indicator of the research.

Inaksh
Determine the number of factors

Different methods :

• Prior knowledge to the researcher ,he may specify how many factors to be extracted.

• Determination based on eigenvalues: In this method ,only factors with eigenvalues>1 are
retained.As eigenvalue represents the amount of variance associated with the factor, hence
only factors with variance >1 are included. Factors with variance <1 are no better than a
single variable, because due to standardization,each variable has a variance of 1.

• Determination based on scree plot:Typically the plot has a distinct break between the steep
slope of factors with large eigenvalues and a gradual trailing off associated with the rest of
the factors.This gradual trailing off is knows an the scree.

• Determination on the basis of percentage of variance: In this approach it is recommended


that the factors extracted should account for atleast 60% of the variance.

Inaksh
From the above table we can see that SPSS extracted two
FACTORS or components from 6 variables. (Whichever components have eigenvalue>1 )
These two FACTORS explain 82.48% of the variance .

Inaksh
We can see that 2 factors that had an eigenvalue >1 were extracted and the other POTENTIAL
actors that had an eigenvalue <1 were not extracted

Inaksh
The correlation coefficient between factor score and the
variables included in the study is called factor loading. The
component matrix is also called the factor matrix.

Inaksh
Inaksh
We can see that V1 V3 and V5 are extracted to factor 1 .

Inaksh
Rotate factors
• An important output from the factor anlaysis is the factor matrix,also called the factor pattern matrix. The factor
matrix contains the coeffiecients used to express the standardized variables in terms of factors.These
coefficients(factor loadings) represent the correlations between the factors and the variables.

• A coefficient with a large absolute value indicates that the factor and the variable are closely related.

• Althought the initial(unrotated) factor matrix indicates the relationship between the factors and individual
variables,it seldom results in factors that can be interpreted,because the factor is correlated with many variables.

• For example, factor 1 is atleast somewhat correlated with 5 of the 6 variables(absolute value of the fator loading is
greater than 0.3).

• Likewise factor 2 is atleast somewhat correlated with 4 out of the 6 variables.

• Moreover , variables 2,4 and 5 load atleast somewhat on both the factors

Inaksh
Rotate factors
• In such a complex matrix it is difficult to interpret the
factors.

• Therefore ,through rotation the factor matrix is


transformed into a simpler one that is easier to interpret.

• Rotation- We would like each factor to have non zero ,or


significant, loadings for only some of the variables and
vice versa.

• Rotation does not affect the communalities and the


percentage of total variance explained.

• The variance explained accounted for by each factor


does change. Inaksh
Interpret factors
• In the rotated factor matrix ,factor 1 has high coefficients
for V1(prevention of cavities) and V3(strong gums) and a
negative coefficient for V5(prevention of tooth decay is
not important).

• Therefore this factor may be labeled as a health benefit


factor.

• Factor 2 is highly correlated with variables V2(shiny teeth),


V4(fresh breath) and V6(attractive teeth).

• Thus factor 2 may be labeled a social benefit factor.


Inaksh
Calculate factor scores
• Factor analysis has its own stand alone value.Hoever if
the goal is to reduce the original set of variables to a
smaller set of composite variables(factors) for use in
subsequent multivariate analysis ,it is useful to compute
factor scores for each respondent.

• A factor is simple a linear combination of the original


variables.

• Fi=Wi1X1+Wi2X2+Wi3X3+…..WikXk

• The weights or factor score coefficients ,are obtained


from the factor score coefficient matrix.

• The factor scores can be used instead of original


Inaksh
variables in subsequent multivariate analysis.
Cluster Analysis

1. Metric data - Interval or ratio scale- The statistical


assessment of the distance between two objects can be
done by calculating the euclidean distance between
them.
2. Non metric data- The task of handling data on the non
metric scales i.e. those placed on the nominal or ordinal
scale, is done with a measure ‘matching coefficient’.

Inaksh
• Like factor analysis ,cluster analysis examines an entire
set of interdependent relationships.

• Cluster analysis makes no distinction between dependent


and independent variables.

• The primary objective of cluster analysis is to classify


objects into relatively homogenous groups based on the
set of variables considered.

• Objects in a group are relatively similar in terms of these


variables and different from objects in other groups

• Eg : Can be used to identify potential customer segments


that could generate additional sales.
Inaksh
• Cluster analysis is a class of technique used to classify
objects or cases into relatively homogenous groups
called clusters.

• Objects in each cluster tend to be similar to each other


and dissimilar to objects in the other clusters.

• Cluster analysis is also called classification analysis .

Inaksh
Inaksh
Inaksh
• Both discriminant analysis and cluster analysis are
concerned with classification.However, discriminant
analysis requires prior knowledge of the cluster or group
membership for each case.In contrast, in cluster analysis
there is no prior information about the group or cluster
membership.

Inaksh
Statistics associated with cluster analysis

• Agglomeration schedule: An agglomeration schedule gives information on the objects or


cases combined at each stage of a hierarchical process

• Cluster centroid: The mean values of the variables for all the cases in a particular cluster.

• Cluster membership : The cluster to which each case belongs.

• Dendrogram: It is also called tree graph, it is a graphical device for displaying clustering
results. The dendrogram is read from left to right.

• Distances between clusters: These distances indicate how separated the individual pairs of
clusters are.

• Icicle diagram: is a graphical display of clustering results,so called because it resembles a


row of icicles hanging from the eaves of a house

• Similarity/distance coefficient matrix.


Inaksh
Steps
• Formulate the problem

• Select a distance measure

• Select a clustering procedure

• Decide on the number of clusters

• Interpret and profile clusters

• Assess the validity of clustering


Inaksh
Formulate the problem

• Inclusion of even on or two irrelevant variables may


distort an otherwise useful clustering solution.

• Basically the set of variables selected should describe the


similarity between objects in terms that are relevant to the
marketing research problem.

• The variables should be selected on the basis of past


research,theory or judgement of the researcher.

Inaksh
Example
• We consider a clustering of consumers based on attitudes toward
shopping. Based on past research, six attitudinal variables were
identified .Consumers were asked to express their degree of
agreement with the following statements on a 7 point scale. (1=
Strongly Disagree ,7=Strongly agree). Data was obtained from a
pretest sample of 20 respondents.(Note: IN practice clustering is done on
much larger samples of 100 or more?

V1: Shopping is fun


V2: Shopping is bad for your budget
V3: I combine shopping with eating out
V4: I try to get the best buys when shopping
V5: I don’t care about shopping.
V6: You can save a lot of money by comparing prices.
Inaksh
V1 V2 V3 V4 V5 V6

6 4 7 3 2 3

2 3 1 4 5 4

7 2 6 4 1 3

4 6 4 5 3 6

1 3 2 2 6 4

6 4 6 3 3 4

5 3 6 3 3 4

7 3 7 4 1 4

2 4 3 3 6 3

3 5 3 6 4 6

1 3 2 3 5 3

5 4 5 4 2 4

2 2 1 5 4 4

4 6 4 6 4 7

6 5 4 2 1 4

3 5 4 6 4 7

4 4 7 2 2 5

3 7 2 6 4 3

4 6 3 7 2 7

2 3 2 4 7 2

Inaksh
Select a distance measure
• The most common approach is to measure similarity in
terms of distance between pairs of objects.

• Objects with smaller distance between them are more


similar to each other than are those at larger distances.

• Several ways to compute the distance: Most commonly


used measure is the Euclidean distance(In syllabus).

• Euclidean distance is the square root of the sum of the


squared differences in values for each variable.

• Other measures are: City block/Manhattan


distance,Chebychev distance
Inaksh
• It is desirable to remove outliers.

• If the variables are measured in vastly different units,the


clustering solution will be influenced by the units of
measurement. In such a case, standardization of the data
needs to be done.

• Use of different distance measures may lead to different


clustering results.

Inaksh
Select a clustering procedure

Inaksh
Hierarchical clustering

• Hierarchical clustering is characterized by the


development of a hierarchy or tree like structure.

• Agglomerative clustering starts with each object in a


seprate cluster. Clusters are formed by grouping objects
into bigger and bigger clusters.This process is continued
till all the objects are members of a single cluster.
Agglomerative method is the most commonly used in
marketing research.

• Divisive clustering startw with all the objects grouped in a


single cluster. Clusters are divided or split until each
object is in a separate cluster
Inaksh
• Agglomerative method consists of linkage methods,
variance methods and centroid methods.

Inaksh
Linkage methods
• Single linkage- This method is based on minimum distance or nearest
neighbor rule.The first two objects clustered are those that have the smallest
distance between them. The next shortest distance is identified and either
the third object is clustered with the first two or a new two object cluster is
formed.Two clusters are merged at any stage by the single shortest link
between them. This process is continued till all objects are in one cluster.

• Complete linkage- It is based on maximum distance or the furthest neighbor


approach. The distance between two clusters is calculated as the distance
between two furthest points

• Average linkage- The distance between two clusters is defined as the


average of the distances between all pairs of objects where one member of
the pair is from each of the clusters. This method is more preferred than
Single and complete linkage methods

Inaksh
Variance methods

• These methods attempt to minimize the within cluster


variance.

• Ward’s method: For each cluster the means for all the
variables are computed. Then for each object, the
squared Euclidean distance to the cluster means is
calculated. These distances are summed for all the
objects.

Inaksh
Centroid methods

• The distance between two clusters is the distance


between their centroids(means for all the variables).
Everytime objects are grouped,a new centroid is
computed.

Inaksh
Non hierarchical clustering

• Unlike the hierarchical method , the non hierarchical


methods start with a predefined number of clusters.
These techniques are also called K-Means clustering.

Inaksh
• Analyze Classify Hierarchical cluster

Inaksh
Inaksh
Inaksh
Inaksh
This shows at what point in the ward’s procedure did the two cases get put into the same
cluster.

Inaksh
From the dendrogram, we can see which all case are clse to each other and are clubbed together in which fashion. We can see that case
14,16,10,4 and 199 are very similar to each other.

Inaksh
• Since we did a hierarchical analysis, we didn’t have to
specify the number of clusters, but from the dendrogram
we can see that it would be a good idea to create 3 exact
clusters

Inaksh
Inaksh
• Now if we go to the data view, we can see a new variable
that tells us about the cluster membership of each case.

• Now we can use this data to make analysis between


these groups or with other variables.

Inaksh
Inaksh

Вам также может понравиться