Вы находитесь на странице: 1из 49

Statistics and business analytics

Fall 2018
Prof. Ebbes

‘HOW TO’ IN SPSS

Session 1

Topics

PART 1: Hypothesis basics

PART 2: 6 steps of hypothesis testing

PART 3: Some remarks hypothesis testing

1
PART 1: Hypothesis basics

SPSS was not used in this part.

PART 2: 6 steps of hypothesis testing

Hypothesis test for a mean (quantitative variable)

The hypothesis test outlined in the 6 steps (session 1, slides 12—23) can be carried out rather
quickly in SPSS. However, it is still good to write down the 6 steps on a scratch paper,
particularly if you haven’t done this a whole lot (see also below).

This is how you get the test out of SPSS:

 Analyze – Compare Means – One-Sample T-test


 Select the variable ‘claim_amount’ in the box ‘Test Variables’
 In the small box at the bottom, you fill in the null hypothesis mean (step 1, see slide 12),
i.e. 70
 Click ‘OK’

You should get the following tables:

Once you have the tables, it is good practice to go through the 6 steps of hypothesis testing:

(1) Specify the null hypothesis: H0: µ=70 and H1: µ>70
(2) Choose the significance level: e.g. =5%
(3) Compute the test statistic: t=1.385 (this is equivalent to the Z-test value on slide 16 –
session 1)
(4) Compute the P-value: SPSS’ P-value is 0.166 (listed in the table above under “Sig. 2-
tailed”). Note that this is a two-sided (!) hypothesis test. In other words, SPSS

2
automatically assumes a two-sided hypothesis test (see discussion slides 13 and 20
session 1). In class we used a one-side test. Hence, we need to divide SPSS’ P-value
by 2. Hence, the P-value for our test is 0.083.
(5) Make a statistical decision: the P-value = 0.083 is larger than the significance level
=5% = 0.05. Hence, we do NOT reject the null hypothesis listed in (1) above.
(6) Make a business conclusion (see lecture notes slide 22)

Remark 1: if you wanted to test the two-sided hypothesis H0: µ=70 and H1: µ≠70 you can use
the P-value given by SPSS as is.

Remark 2: in class we used the Ztest formula (slide 16) and the normal distribution to compute
the P-value (slides 17—20), whereas SPSS uses the t distribution. For large samples, the
results are nearly identical. For small samples, SPSS’ results are more reliable (see discussion
slide 31 session 1).

PART 3: some remarks hypothesis testing

Hypothesis test for a proportion (categorical variable)

Unfortunately, we will not be able to use SPSS for the test outlined on slides 26—30 (session
1). While we could “abuse” SPSS to compute a confidence interval for a proportion (categorical
variable) in the “How to guide” of data science camp day 2 (part 5), here we cannot. Hence, we
will have to compute it by hand like we did in class.

[[ Explanation (for those of you interested): the reason is that this procedure is programmed in
SPSS for quantitative variables only. However, the formula we use to test a hypothesis about a
proportion (slide 28) is quite different from the formula to test about a mean (slide 16).
Importantly, SPSS does not plug in the values specified in the null hypothesis in step 1 (the )
on slide 27 in the denominator of the formula on slide 28. In other words, SPSS uses by default
(wrongly!)
(p )
z test 
p(1  p)
n
instead of (correctly!)
( p  )
z test 
 (1   )
n
As said before, this is because SPSS is not programmed to carry out this test for categorical
variables. The “annoying” thing is that it will still give you something in the output if you run the
analysis! So it has no build-in “safety-check” to warn you of a mistake. Hence, we cannot be
lazy and have to do the labor ourselves in this particular case. In the next class meeting we will
learn about another test that we could use instead for categorical variables. ]]

3
Statistics and business analytics
Fall 2018
Prof. Ebbes

‘HOW TO’ IN SPSS

Session 2

Topics

PART 1: Sample size calculations

PART 2: Sampling challenges

PART 3: Investigating distributions for categorical variables (chi-square test)

1
PART 1: Sample size calculations

SPSS was not used in this part.

PART 2: Sampling challenges

SPSS was not used in this part.

PART 3: Investigating distributions for categorical variables (chi-square test)

Chi-square test for a frequency table

The hypothesis test outlined in the 6 steps in session 2, slides 20—32 can be carried out rather
quickly in SPSS. However, it is still good to write down the 6 steps on a scratch paper,
particularly if you haven’t done this a whole lot.

I can unfortunately not share the data with you for the Bike share case shown in class.
Therefore, I will demonstrate this technique using the insurance fraud case where we also have
a variable gender (see data science camp day 2;
data_science_day_2_insurance_fraud_web.sav).

As I don’t know the “true” population values for the insurance case, let’s just take the same
proportions as used in class, that is, let’s say the proportion of male and female clients at the
insurance company in the population is 0.40 and 0.60, respectively.

To get the frequency table for gender, do as follows:

 Analyze – Descriptive Statistics – Frequencies


 Select the variable Gender into the box ‘Variable(s)’
 Click ‘OK’

Now let’s test the following null hypothesis (0=male, 1 =female): 0 = 0.4 and 1 = 0.6.

This is how to get it out of SPSS (note: good practice – try to do this by hand!)

 Analyze – Nonparametric tests – Legacy Dialogs – Chi-square


 In the ‘Test Variable List’ select the variable ‘gender’
 In the ‘Expected Values’ area, you need to specify what the null hypothesis is. Note,
here it is important to follow the label ordering (0=male, 1=female, so enter the
corresponding value for males (0.40) first)

2
o Select ‘Values’
o Enter: 0.40 then click ‘Add’
o Then enter: 0.60 and click ‘Add’
 Click ‘OK’

Did you get:

These two tables are the results from your test (the 6 steps of hypothesis testing):

 The observed N column in the above table is the same as in your frequency table
(previous page)
 The Expected N column are the Ei’s that we used in class to compute the chi-square
formula (session 2 slide 25)
 The Residual column is just the difference between observed and expected
 The second table (labeled ‘Test Statistics’) shows:
o the test-statistic – this is what you would get if you compute the chi-square by
hand, e.g. slide 24—26 session 2)
o the degrees of freedom (df) – what you need to compute the P-value (slide 28
session 2)
o The P-value – indicated by ‘Asymp. Sig.’ which SPSS computes automatically for
you (by hand, you could curve the number 157.872 under the chi-square
distribution with 1 degree of freedom, e.g. slide 29 session 2)

3
 Note the warning below the second table (subscript ‘a’); here things are fine (lecture
notes slide 30)

Hence, we would reject the null hypothesis that 0 = 0.4 and 1 = 0.6, because the P-value =
0.000 which is smaller than the significance level (=0.05).

Good practice is to write down the 6 steps completely, here they are:

Step 1: formulate the hypotheses

H0: 0 = 0.4 and 1 = 0.6

Ha: at least one = is ≠

Step 2: specify the significance level

 = 0.05

Step 3: compute the test-statistic

chi-square = 157.872

Step 4: compute the P-value

P-value = 0.000

Step 5: make a statistical conclusion

Reject the null hypothesis

Step 6: make a business conclusion

The sample proportions for men (0.492) and women (0.507) are significantly different
from 0.4 and 0.6, and the sample is not representative with respect to the variable gender (chi-
square = 157.872, P-value = 0.000). We oversampled the men and undersampled the women [[
assuming that we actually knew that this is the truth which we don’t in this particular example as
mentioned above ]]

4
Statistics and business analytics
Fall 2018
Prof. Ebbes

‘HOW TO’ IN SPSS

Sessions 3

Topic: bivariate statistics for one categorical variable and one quantitative variable

PART 1: Bivariate statistics

PART 2: Comparing two means (Z/t-test)

PART 3: Comparing multiple means (ANOVA)

1
PART 1: Bivariate statistics

SPSS was not used in this part.

PART 2: Comparing two means (Z/t-test)

To produce the means plot on slide 7 (session 3), proceed as follows:

 Graphs – Legacy Dialogs – Bar


 Select ‘Simple’ and ‘Summaries for groups of cases’ (here the groups are defined by the
categories of the categorical variable ‘retire’)
 In the area ‘Bars Represent’, click ‘Other statistic’, and select the quantitative variable
(here: ‘claim_amount’) as Variable (make sure to compute the MEANs)
 In the area ‘Category Axis’ select the categorical variable (here: ‘retire’)
 Click ‘OK’

(or: try using the chart builder)

Here is the means plot:

The table with descriptive statistics on slide 14 (session 3) is obtained as follows:

 Analyze – Compare Means – Means


 Select the quantitative variable (‘claim_amount’) in the dependent list
 Select the categorical variable (‘retire’) in the independent list
 Click ‘Options’ and make sure the required statistics are computed (here: mean, number
of cases, standard deviation); click ‘Continue’

2
 Click ‘OK’

You should get the following table:

Now you can fill out the formula on that slide, to carry out the hypothesis test H0: µ1 = µ2 by
hand. However, SPSS can do this for you as follows:

 Analyze – Compare Means – Independent-Samples T test


 The ‘Test Variable’ is the quantitative variable (here: ‘claim_amount’)
 The ‘Grouping Variable’ is the categorical variable (here: ‘retire’)
 Note that when you select ‘retire’ in the ‘Grouping Variable’ box you find two question
marks behind it. This means you need to tell SPSS what group 1 and what group 2 is.
o Preliminary step: in SPSS ‘Variable View’ we can find that 0=No and 1=Yes
o Click on ‘Define Groups’. In class, I used ‘Yes’ as group 1 and ‘No’ as group 2.
Hence, type in the ‘Group 1:’ box a 1, and type in the ‘Group 2:’ box a 0 (!).
o Click ‘Continue’
 Click ‘OK’

You will get the following two tables:

[[ second table on next page ]]

3
The first table (“Group Statistics”) contains the same descriptive statistics as the table above
(e.g. on slide 14). The second table (“Independent Samples Test”) contains the results of
the test.

To explain the full table is beyond the scope of the class. I will tell you which part of the table
corresponds to what we learned in class, all the other parts we won’t use.

For this class, you only need to use the row that says “Equal variances NOT assumed” (i.e.
the second row). And, we only need to use the column ‘t’ and the column ‘Sig. (2-tailed)’.
Now we have enough information for the 6 steps of hypothesis testing:

Step 1: formulate the null and alternative hypothesis – see lecture notes slide 11

Step 2: choose the significance level – lecture notes slide 12

Step 3: compute the test-statistic – t = -11.641 (same as we computed in class on slide 14)

Step 4: get the P-value – P-value = 0.000 (note: SPSS by default gives you 2-tailed!) (in
class we use PQRS to get the P-value, see slide 16)

Step 5: make a statistical decision – here the P-value is LESS than the significance level, so
we REJECT the null hypothesis

Step 6: make a business conclusion – see lecture notes slide 18

PART 3: Comparing multiple means (ANOVA)

The means plot on slide 20 (session 3) is obtained using similar steps as above using the
variable ‘edcat’ as the ‘Category Axis’, instead of ‘retire’.

I am not producing here the ANOVA results on slides 24—29 as we concluded in class that the
ANOVA assumptions (slide 27) are not valid.

However, we proceeded by computing the log transform (slide 30). This goes as follows.

4
You first need to compute a new variable, say ‘log_claim_amount’, that computes the natural log
of ‘claim_amount’ – this can be done with SPSS’ function ‘Transform – Compute Variable’. We
discussed this option in the appendix of the ‘how to guide’ of day 2 of data science camp (check
this ‘how to guide’ if you forgot how to use SPSS’ compute function).

Once you have done that, you can obtain the ANOVA table on slide 30 (session 3) as follows:

 Analyze – Compare Means – One-Way ANOVA


 In the ‘Dependent List’ you have to select the quantitative variable (here:
‘log_claim_amount’)
 In the ‘Factor’ area you select the categorical variable (here: ‘edcat’)
 Click ‘OK’

This produces the following table:

As the P-value < 0.05, we reject the ANOVA’s null hypothesis of equal means (you should write
down the 6 steps of hypothesis testing (slides 24—26) to be complete).

When the ANOVA null hypothesis is rejected, we know that at least one equal sign is unequal.
But we don’t know which one. This can be discovered with a mean comparison (slides 33 and
34). Obtaining a mean comparison in SPSS is easy, you just need to add one option to the
preceding steps:

 Analyze – Compare Means – One-Way ANOVA


 In the ‘Dependent List’ you have to select the quantitative variable (here:
‘log_claim_amount’)
 In the ‘Factor’ area you select the categorical variable (here: ‘edcat’)
 Click ‘Post Hoc…’
o Select the option ‘Bonferroni’ in the area ‘Equal Variances Assumed’
o Click ‘Continue’
 Click ‘OK’

Along with the preceding table, you now also get the mean comparison table (last page of this
document). Interpreting this table is a bit harder, but we discussed this in class.

We need to make sure the assumptions of the ANOVA are valid and we can produce the figure
on slide 28 (31) and the table on slide 29 for the log-transformed variable.

5
The table on slide 29 (session 3) is produced as follows:

 Analyze – Compare Means – Means


 In the ‘Dependent List’ select the quantitative variable (‘log_claim_amount’)
 In the ‘Independent List’ select the categorical variable (‘edcat’)
 Click ‘Options’
o Make sure that the mean and the variance are listed in the area ‘Cell Statistics’
o Click ‘Continue’
 Click ‘OK’

Did you get:

The graphs on slides 28/31 are obtained as follows:

 Graphs – Legacy Dialogs – Boxplot


 Select ‘Simple’ and ‘Summaries for groups of cases’ (here the groups are defined by the
categorical variable)
 Click ‘Define’
 In the box ‘Variable’ select the quantitative variable (‘log_claim_amount’)
 In the ‘Category Axis’ box select the categorical variable (‘edcat’)
 Click ‘OK’

[[ continued on next page ]]

6
Did you get:

On the next page, you fill find the full table with multiple comparisons (slide 34 session 3)

7
8
Statistics and business analytics
Fall 2018
Prof. Ebbes

‘HOW TO’ IN SPSS

Session 4

Topic: bivariate statistics for two categorical variables

PART 1: Bivariate statistics

PART 2: Inference for a cross tab

PART 3: Graphics do’s and don’ts

PART 4: Towards business analytics

1
PART 1: Bivariate statistics

SPSS was not used in this part.

PART 2: Inference for a cross tab

The cross tab on slide 13 including the chi-square test to test the null hypothesis on slide 15
(session 4) can be produced as follows.

 Analyze—Descriptive Statistics—Crosstabs…
 Choose the categorical variable ‘fraudulent’ in the columns
 Choose the categorical variable ‘claim_type’ in the rows
 Then, to compute percentages, go to ‘Cells’
o Under Percentages, click the desired percentage (here: rows)
o Click Continue
 To compute the chi-square test and P-value (slides 17— 20 and slides 21—23 of
session 4, respectively):
o Click ‘Statistics’
o Select ‘Chi-square’ (top-left)
o Click ‘Continue’
 Click OK

Here are the results:

2
To interpret the results of the chi-square test, cycle through the 6 steps:

Step 1: formulate the null and alternative hypothesis – see lecture notes slide 15

Step 2: choose the significance level – lecture notes slide 16

Step 3: compute the test-statistic – chi-square = 29.996 (apart from minor rounding the
same as we computed in class on slide 20)

Step 4: get the P-value – P-value = 0.000 (in class we found the same number using PQRS
on slide 22)

Step 5: make a statistical decision – here the P-value is LESS than the significance level, so
we REJECT the null hypothesis

Step 6: make a business conclusion – see lecture notes slides 25-26

NOTE: the warning on slide 23 (session 4) is important to inspect! This is given in the table
above in the footnote ‘a’, where it says “0 cells (0.0%) have … count is 42.37”. Basically SPSS
is saying that the smallest expected cell frequency (E i) is 42.37. Hence, we are good to go
because they are all larger than 5. We can have a few Ei less than 5 but no more than 20% (the
Ei’s can also be computed by hand, see calculations slide 19).

To graphically display a cross tab we could use a clustered bar chart or segmented bar chart
(see ‘How to guide’ data science camp, day 1, part 4).

PART 3: Graphics do’s and don’ts

SPSS was not used in this part.

PART 4: Towards business analytics

SPSS was not used in this part.

3
Statistics and business analytics
Fall 2018
Prof. Ebbes

‘HOW TO’ IN SPSS

Session 5

PART 1: Bivariate stats for La Quinta (case)

PART 2: Correlations

PART 3: Simple linear regressions mechanics

PART 4: Interpreting SPSS regression output

1
PART 1: Bivariate stats for La Quinta (case)

The boxplot and key descriptive statistics (‘become friends with your data’) on slide 9 (session
5) can be computed using methods we learned before (see how to guide data science camp
day 2).

PART 2: Correlations

The scatter plot on slide 13 (session 5) can be obtained as follows:

 Graphs – Legacy Dialogs – Scatter/Dot


 Select ‘Simple Scatter’ and click ‘Define’
 On the Y-axis, select the variable ‘Margin’
 On the X-axis, select the variable ‘Number’
 Click ‘OK’

This is what you get:

The scatter plots on slides 14 and 15 are created similarly. You could also use SPSS chart
builder.

To compute bivariate correlation coefficients r, i.e. the table on slide 16 (session 5), proceed as
follows:

 Analyze – Correlate – Bivariate


 In the box ‘Variables’, select the variables ‘Margin’, ‘Number’, ‘OfficeSpace’, and
‘Distance’

2
 Click ‘OK’

This produces the table below, which includes the sample correlation coefficients (‘r’) and
the relevant information to test H0: =0 for each pair of variables.

PART 3: Simple linear regressions mechanics

No new SPSS techniques were used in the part. Please refer to the previous (part 2) to create
the scatter plots discussed in part 3. The lines were added to facilitate class discussion. In
general, it is not necessary to include these lines in scatter plots.

PART 4: Interpreting SPSS regression output

To estimate the linear regression model 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝜖, where Y is profit margin and X 1 is


volume of office space, i.e. the results on slide 39 (session 5), follow these steps:

 Analyze – Regression – Linear


 In the box ‘Dependent’ select the Y variable, i.e. ‘Margin’
 In the box ‘Independent(s)’ select the X variable, i.e. ‘OfficeSpace’
 Click ‘OK’

You should get the following three tables:

3
For the interpretation of these tables refer to slides 40—44 of session 5. Using similar steps,
you should be able to produce the tables on slides 45 and 46.

4
Statistics and business analytics
Fall 2018
Prof. Ebbes

‘HOW TO’ IN SPSS

Session 6

PART 1: multiple linear regressions

PART 2: Regression diagnostics tasks

PART 3: Using for decisions – predictions

1
PART 1: multiple linear regressions

The steps to estimate the multiple linear regression model 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽6 𝑋6 + 𝜖


on slide 9 (session 6), are nearly identical to the steps we learned before for the simple linear
regression model in session 5 (see how to guides session 5). Specifically:

 Analyze – Regression – Linear


 In the box ‘Dependent’ select the Y variable, i.e. ‘Margin’
 In the box ‘Independent(s)’ select the six X variables, i.e. ‘Number’, ‘Nearest’,
‘OfficeSpace’, ‘Enrollment’, ‘Income’, and ‘Distance’
 Click ‘OK’

You should get the following three tables:

2
For the interpretation of the three tables, please refer to slides 10—19 of session 6.

PART 2: Regression diagnostics tasks

As discussed in class, checking your regression model assumptions is extremely important. It is


your task to inspect at the minimum the five aspects listed on slide 23 (session 6).

1. Checking variable types – this can be done by examining the scale level of your variables. It
should be indicated in SPSS (Variable View) that each variable in the regression model is
‘Scale’.
2. The goodness of fit and P-values are produced by SPSS and listed in the output tables
above.
3. Whether or not there is a linear relationship between Y and the six X’s can be examined by
scatter plots. You should create six scatter plots, between Y and X 1, Y and X2, …, and Y and
X6, using the steps listed in the how to guide of session 5 (part 2).
4. Residual analysis – your residuals (a.k.a. errors) should ‘behave’, meaning they have an
approximate normal distribution and there should be a few outliers at most.
a. The residual plot on slide 29 can be obtained in the ‘Linear Regression’ window
(follow the steps in part 1 above), but before clicking ‘OK’, click on ‘Plots’ and then
select ‘Histogram’ in the ‘Standardized Residual Plots’ area (bottom left).
b. The residual table on slide 30 is obtained by selecting the option ‘Casewise
diagnostics’ under ‘Residuals’ in the menu ‘Statistics’ in the ‘Linear Regression’
window. I specified for teaching purposes to print a case in the table if its residual is
outside TWO standard deviations, but default is to leave it at three.
5. Check for multicollinearity. You should inspect:
a. All the bivariate correlations between the independent variables. This can be done
following the steps outlined in the how to guide of session 5 (part 1). Did you find that
the largest (absolute) correlation is 0.15?
b. Compute collinearity diagnostics (VIF and tolerance). You will get them in the ‘Linear
Regression’ window under the menu ‘Statistics’ by selecting ‘Collinearity
diagnostics’. This should produce the table on slide 33.

PART 3: Using for decisions – predictions

To obtain predictions (slide 37 session 6), we could just program the estimated regression
equation up in e.g. Excel and have Excel do the computations. Or, you could do the calculations
by hand on a hand calculator. However, neither Excel nor your hand calculations give you the
prediction interval, that you definitely need (slide 38)!

We can actually have SPSS do all the calculations for us, including the interval calculation, but
we need a little trick.

What you have to do is create a NEW case in your datafile, that has NO observation for the
dependent variable (‘Margin’), and as values for the independent variable the ones listed on
slide 36.

3
To create this NEW case, you will have to go to ‘Data View’ in SPSS, scroll to the bottom of the
spreadsheet, and type in the 101st row, the values for the X variables listed on slide 36.

After doing that, your ‘Data View’ should look as follows:

NOTE: make sure there is a DOT (missing value) for ‘Margin’!!

Now re-run the regression model in SPSS and tell it to also obtain predictions. Here is how that
goes:

 Analyze – Regression – Linear


 In the box ‘Dependent’ select the Y variable, i.e. ‘Margin’
 In the box ‘Independent(s)’ select the six X variables (‘Number’, ‘Nearest’,
‘OfficeSpace’, ‘Enrollment’, ‘Income’, and ‘Distance’)
 Click on ‘Save’
o Under ‘Predicted Values’ (top left), select ‘Unstandardized’
o Under ‘Prediction Intervals’ (somewhere halfway that window), select ‘Individual’
and the desired level of confidence (e.g. 95%)
o Click ‘Continue’
 Click ‘OK’

Now you will get the exact same output tables as before. But there is no new table with the
predictions. So where can those predictions be found?

The predictions can be found in your data spreadsheet in SPSS – i.e. in ‘Data View’. What you
can see is that SPSS created THREE new columns and populated those with numbers.

 PRE_1 – this is the predicted value 𝑌̂ for each observation (i.e. each hotel)
 LICI_1 – the is the lower bound of the prediction interval for each observation
 UICI_1 – this is the upper bound of the prediction interval for each observation

4
For our particular case, the prediction is 37.09 and the 95% prediction interval is [25.40, 48.79].
See screen shot below.

5
Statistics and business analytics
Fall 2018
Prof. Ebbes

‘HOW TO’ IN SPSS

Session 7

PART 1: categorical regressors

PART 2: Interactions with a categorical regressor

PART 3: Nonlinear relations (briefly)

1
In this how to guide I will focus on creating dummy variables, interaction variables and
transformations of variables. These ‘new’ variables can then be used as independent variables
in a regression model. I explained in the ‘how to guide’ of sessions 5&6 how to run regressions
in SPSS.

PART 1: categorical regressors

Creating dummy variables

The steps needed to create dummy variables are detailed on slide 20 of session 7. Here I will
suggest how to perform those steps with SPSS.

I will use the categorical variable ‘JobGrade’ (slide 18, 19) as an example.

Steps to create dummy variables:

Step 1: Count the number of categories you have and subtract 1.

Job grade has six categories, so we need to create 6-1=5 dummy variables to represent it

In SPSS Steps 2—7 of slide 20 go as follows. [[ this is actually pretty annoying ]]

Let’s take the last category as reference category (slide 19). We do not have to create a dummy
variable for the reference category.

Let’s create the first dummy variable for the first category. We use the following recoding
schema:

Value on current variable Value on new variable


‘JobGrade’ ‘JobGrade_dum1’
1 1
2 0
3 0
4 0
5 0
6 0

How to interpret this table: all observations with a ‘1’ on ‘JobGrade’ will get a ‘1’ on the new
variable ‘JobGrade_dum1’. All observations with a 2,3,4,5, or 6 on ‘JobGrade’ will get a ‘0’ on
‘JobGrade_dum1’.

This is how to create ‘JobGrade_dum1’ in SPSS:

 Transform – Recode into Different Variables


 CLICK RESET
 In the ‘Input Variable  Output Variable’ window, select the variable ‘JobGrade’
 In the area ‘Output variable’ type the name and description for this new variable, for
instance

2
o Name: JobGrade_dum1
o Label: Job grade 1 (dummy)
o Click ‘Change’
 Then click ‘Old and New Values’
 This will tell SPSS how to do the recoding as shown in the table above. In other
words, tell SPSS how the old values (1,2,3,4,5,6) will translate into values of the
new variable, i.e. 1,0,0,0,0,0, respectively
 On the left hand side (Old value), type behind ‘Value’ the number 1
 On the right hand side (New value), type behind ‘Value’ the number 1
 Click ‘Add’
 On the left hand side (Old value), select ‘All other values’
 On the right hand side (New value), type behind ‘Value’ the number 0
 Click ‘Add’
 It should look like this:

 Click ‘Continue’
 You will return to this window:

 Click ‘OK’

SPSS now created a new variable ‘JobGrade_dum1’ in your spreadsheet. It shows up in


‘Data View’ at the end of the spreadsheet, as well as in ‘Variable View’. Before continuing,
go to ‘Data View’ and check for a few cases that the recoding was performed correctly (e.g.
case 1, 60, 61, and 208).

3
Now you will have to repeat this four (!) more times to create the dummy variables
‘JobGrade_dum2’, ‘JobGrade_dum3’, ‘JobGrade_dum4’, and ‘JobGrade_dum5’. I will
demonstrate the second one here, and leave the rest up to you.

This is how to create ‘JobGrade_dum2’ in SPSS:

 Transform – Recode into Different Variables


 CLICK RESET
 In the ‘Input Variable  Output Variable’ window, select the variable ‘JobGrade’
 In the area ‘Output variable’ type the name and description for this new variable, for
instance
o Name: JobGrade_dum2
o Label: Job grade 2 (dummy)
o Click ‘Change’
 Then click ‘Old and New Values’
 This will tell SPSS how to do the recoding as shown in the table on slide 19
(session 7). In other words, tell SPSS how the old values (1,2,3,4,5,6) will
translate into values of the new variable, i.e. 0,1,0,0,0,0, respectively
 On the left hand side (Old value), type behind ‘Value’ the number 2
 On the right hand side (New value), type behind ‘Value’ the number 1
 Click ‘Add’
 On the left hand side (Old value), select ‘All other values’
 On the right hand side (New value), type behind ‘Value’ the number 0
 Click ‘Add’
 It should look like this:

 Click ‘Continue’
 You will return to this window:

4
 Click ‘OK’

SPSS now created a new variable ‘JobGrade_dum2’ in your spreadsheet. It shows up in


‘Data View’ at the end of the spreadsheet, as well as in ‘Variable View’. Before continuing,
go to ‘Data View’ and check for a few cases that the recoding was performed correctly (e.g.
case 1, 60, 61, and 208).

You will now need to create dummy variables JobGrade_dum3’, ‘JobGrade_dum4’, and
‘JobGrade_dum5’ following similar steps. Do not forget to click RESET (highlighted above in
red), otherwise it may get messy…

Similarly, you should create a dummy variable for the variable gender. This is considerably
more simple, as we need to create only one dummy variable to represent the categorical
variable gender. Because the females are the focus of the study, we take males as reference
category. The old and new variables are as follows:

Value on current variable ‘Gender’ Value on new variable


‘Gender_dum’
1 (females) 1
2 (males) 0

 Transform – Recode into Different Variables


 CLICK RESET
 In the ‘Input Variable  Output Variable’ window, select the variable ‘Gender’
 In the area ‘Output variable’ type the name and description for this new variable, for
instance
o Name: Gender_dum
o Label: Female (dummy)
o Click ‘Change’
 Then click ‘Old and New Values’
 This will tell SPSS how to do the recoding as shown in the table above. In other
words, tell SPSS how the old values (1,2) will translate into values of the new
variable, i.e. 1,0, respectively
 On the left hand side (Old value), type behind ‘Value’ the number 1

5
 On the right hand side (New value), type behind ‘Value’ the number 1
 Click ‘Add’
 On the left hand side (Old value), select ‘All other values’
 On the right hand side (New value), type behind ‘Value’ the number 0
 Click ‘Add’ and then ‘Continue’
 Click ‘OK’

SPSS now created a new variable ‘Gender_dum’ in your spreadsheet. It shows up in ‘Data
View’ at the end of the spreadsheet, as well as in ‘Variable View’. Before continuing, go to
‘Data View’ and check for a few cases that the recoding was performed correctly (e.g. cases
1—5).

You have now created all the necessary dummy variables to run the regression model on slide
21 (session 7). Follow the instructions in the how to guide of session 6 (part 1&2) to run the
regression. Note that:

 Your dependent variable is ‘Salary’


 Your independent variables are: YrsExper, Gender_dum, JobGrade_dum1,
JobGrade_dum2, JobGrade_dum3, JobGrade_dum4, JobGrade_dum5
 Do NOT include the original categorical variables (‘Gender’ and ‘JobGrade’) in your
regression model!!

PART 2: Interactions with a categorical regressor

Creating interaction variables

Creating interaction variables is considerably easier than creating dummy variables. The
interaction variable discussed on slide 27 (session 7) is the product of YrsExp and the dummy
variable for gender. To compute this new variable, proceed as follows:

 Transform – Compute Variable


 In the space ‘Target Variable’ type the new variable name (note: SPSS won’t take
spaces, special characters etc.)
o E.g. YrsExperGender
 In the ‘Numeric Expression’ box, type the formula how to compute this interaction
variable
o It is the product of ‘YrsExper’ and ‘Gender_dum’ (note: NOT ‘Gender’)
 Select ‘YrsExper’ in the box
 Type or click *
 Select ‘Gender_dum’ in the box
 It should look like: YrsExper * Gender_dum
 Click ‘OK’

As before, SPSS created the new variable ‘YrsExperGender’ to the end of your data file.

 In Data View, inspect for a few cases that the computations were carried out correctly
(e.g. cases 1—5)

6
 In Variable View, update the properties of the variable (e.g. type a label and update the
measure (‘scale’))

You can now run the regression on slide 29 (session 7):

 Your dependent variable is ‘Salary’


 Your independent variables are: YrsExper, Gender_dum, and ‘YrsExperGender’

PART 3: Nonlinear relations (briefly)

Computing transformations

I did not make the data available to you for the nonlinear regressions. However, steps to create
nonlinear functions of your variables are identical to the previous steps in part 2 to create
interaction terms (i.e. use Transform – Compute Variable)

For instance, to compute the logarithm of a variable, check out slides 74—77 of Data science
camp day 2.

As an additional example, we could investigate if experience and salary are nonlinearly related.
The scatter plot between Salary and YrsExper may suggest this. That is, salary could increase
quadratic with experience instead of linear. You could create the new variable YrsExperSq by
using the formula YrsExper* YrsExper and then estimate the model:

Salary = 0 + 1YrsExper + 2YrsExperSq + 3Gender_dum

(i.e. Y=0 + 1X1 + 2X12 + 3X3)

The estimated model is:

Salary = 39.043 + 0.179*YrsExper + 0.026*YrsExperSq – 6.749*Gender_dum

(coefficients that are underlined are significant with P-value < 0.05)

As discussed in class, the coefficients for years of experience are hard to interpret in a parabolic
model. But, for this simple equation we could create a similar graph as e.g. slide 16 (session 7).
Salary

YrsExper
(blue=males; orange=females) This model suggests a slight nonlinear relation between Salary
and Years of experience, where more experience tends to accelerate salary. We should
definitely inspect the scatter plot between ‘Salary’ and ‘YrsExper’ to back up this analysis – you
will notice that the scatter plot also suggests a somewhat non-linear relationship, but we do not
have a lot of data in the higher experience range. Interestingly though, the effect of gender
decreased a little compared to a model without experience-squared (e.g. slide 16 session 7).

However, we typically only use nonlinear models if economic theory dictates us to do so (e.g.
the constant elasticity model in Economics; slide 40 session 7), if we expect increasing or
decreasing returns to scale (e.g. ad spending on sales), or if linear regression model checks
(session 6 slide 23) suggest something is wrong with the linear model.

Otherwise, we prefer simple models to more complex models, and linear models are quite
straightforward to interpret and easy to use, as we saw in sessions 5, 6, and 7 of this class.
Besides, in many applications they perform quite reasonable in terms of predictions.

8
Statistics and business analytics
Fall 2018
Prof. Ebbes

‘HOW TO’ IN SPSS

Session 8

PART 1: Logit regression main idea

PART 2: Logit regression SPSS output

PART 3: Logit regression interpretation

PART 4: Predicting probabilities and other diagnostic tasks

1
In this guide I will discuss how to run a logistic regression model in SPSS. The dependent
variable is categorical with two categories. The independent variables are quantitative. If an
independent variable is categorical, make sure to create dummies first and use the dummies as
independent variables, just as for the linear regression model (session 7).

PART 1: Logit regression main idea

SPSS was not used in this part.

PART 2: Logit regression SPSS output

PART 3: Logit regression interpretation

How to run a logistic regression in SPSS

The steps are relatively simple. The challenge with logistic regressions is the interpretation.

To estimate the logistic regression model on slide 14 (results on slide 15) proceed as follows.
You would first need to make sure the dependent variable (here: ‘Prom’) is coded as a dummy
variable (1/0). Here that is the case (see SPSS Variable View).

The regressor ‘YrsExper’ is quantitative so can be used as is.

The regressor ‘Gender’ is categorical, and needs to be coded as dummy variable first (see ‘How
to in SPSS guide’ session 7, part 1).

To run the logistic regression:

 Analyze – Regression – Binary Logistic


 As dependent variable select: ‘Prom’
 As independent variables (Covariates) select: ‘YrsExper’ and ‘Gender_dum’
 Click ‘OK’

The results of the model are listed under ‘Block 1’. For the interpretation of the tables, please
refer to the lecture notes of session 8 (slides 17—30).

PART 4: Predicting probabilities and other diagnostic tasks

To predict the probabilities (slides 33—35), you can work with a hand calculator following the
steps on slide 34. However, SPSS can do these calculations for you, in the same way as we did
for the linear regression approach (how to guide session 6 topic 3). Basically, you enter in the
values of your predictors for the NEW case that you want to predict out, that has NO
observation for the dependent variable (here: ‘Prom’).

Hence, go ahead and enter the two new cases (1) female employee with 5 years of experience
(slide 33), and (2) male employee with 5 years of experience (slide 35) at the bottom of the data
file in ‘Data View’. Do NOT fill out a value for the dependent variable ‘Prom’. See following
screen shot:

2
Rerun the logit estimation, following the above steps. However, before clicking ‘OK’, click on the
menu item ‘Save’ and select ‘Probabilities’ from the ‘Predicted Values’ area (see next screen
shot).

Then click ‘Continue’ and click ‘OK’ in the ‘Logistic Regression’ window.

Your predictions are now added to the datafile in the newly created variable ‘PRE_1’. Scroll all
the way to the bottom in ‘Data View’ and you will find the predictions that we computed in class
by hand.

SPSS can also do the confidence intervals of the predictions for you (which is important to ask
for!). However, this cannot be done from this menu option. Therefore, I consider this for logistic
regression beyond the scope of the class. We would need to use another menu option in SPSS
(Analyze—Generalized Linear Models, which has a binary logistic option), which is considerably
harder to use than the Binary Logistic menu option discussed above for those of you who do not
have a lot of experience with regressions. For those of you who want to know, please come and
talk, I’ll be happy to discuss with you during the next SPSS lab.

3
Statistics and business analytics
Fall 2018
Prof. Ebbes

‘HOW TO’ IN SPSS

Session 9

PART 1: Basics of Factor Analysis (FA)

PART 2: Running a FA

PART 3: Interpreting a FA solution

PART 4: FA goodness of fit

PART 5: Putting FA to work

1
In this guide, I will discuss how to run factor analysis using SPSS. The dataset that we used in
session 9 is on Blackboard. What you will notice is that there are quite some “subjective”
decisions to make by the analyst (e.g. the number of factors to extract) in performing a factor
analysis, as we discussed in class. Feel free to play around with different choices (I will indicate
where you could do this). Come and chat if you have any questions or concerns regarding these
analyses.

PART 1: Basics of Factor Analysis (FA)

Please refer to previous ‘How to in SPSS guides’ to obtain the regression model results for the
two regression models discussed in part 1.

PART 2: Running a FA

PART 3: Interpreting a FA solution

PART 4: FA goodness of fit

The complete factor analyses output (see also supplemental handouts pp8—12) can be
obtained rather easily. Here is how it goes:

 Analyze – Dimension Reduction – Factor


 Select all variables that you want to run the factor analysis on in the box ‘Variables’.
Here: V2—V31 (we shouldn’t include V1 as that is not a psychographic variable)
 To get the scree plot, click ‘Extraction’ and select ‘Scree plot’
 To get the rotated solution, click ‘Rotation’ and select ‘Varimax’ (usually works best)
 Click ‘OK’

That should give you the outputs given on pages 8—12 of the supplemental handout. As you
can see, this was pretty easy to do. The interpretation is more challenging (and very important!).
Please refer to the lecture notes for that (parts 3 and 4).

PART 5: Putting FA to work

We can have SPSS compute the factor scores (slides 35 and 36), for subsequent use in a
regression model or any other multivariate statistical technique (e.g. cluster analysis in the next
session).To have SPSS compute the factor scores, go back into the same menu as before, and
in addition to the preceding:

 Click on ‘Scores’
 Select ‘Save as variables’ and use ‘Regression’ (these are just different ways to
compute the factor scores)
 Click ‘Continue’ and then ‘OK’

Now if you go to ‘Data View’ or ‘Variable View’ in SPSS, you see that SPSS created 9 new
variables to the back of the spreadsheet (FAC1_1, FAC2_1,…,FAC9_1). You could change the
label (in ‘Variable View’) to the description of the factors that you came up with (slide 30).

2
These 9 variables are new quantitative variables. Hence, the first thing you should (always!) do
with new data/variables is to ‘become friends’ with them! For instance, you could compute basic
descriptive statistics for these 9 new quantitative variables. Also, it would be useful to
graphically analyze each of them (eg boxplot or histogram). What did you find? Anything
surprising?

Once that is done, you can run a regression model with these 9 factors as independent
variables and V1 as dependent variable to obtain the regression model given on slide 37.

Extracting a different number of factors in PART 2: 8 instead of 9 factors

The preceding steps will give you the 9 factor solution discussed in class. The choice for 9
factors was based on: (1) the eigenvalue criterion (larger than 1); (2) elbows in the scree plot
and (3) the cumulative variance explained (see part 2).

In our interpretation of the 9th factor we found that only one original variable loads on it (slide
29). Perhaps we should also examine an 8 factor solution.

To run an 8 factor solution in SPSS proceed as follows (steps very similar as above; changes
indicated in red):

 Analyze – Dimension Reduction – Factor


 CLICK RESET (this erases all previous settings; good practice; just to be on the save
side)
 Select all variables that you want to run the factor analysis on in the box ‘Variables’.
Here: V2—V31 (we shouldn’t include V1 as that is not a psychographic variable)
 To get the scree plot, click ‘Extraction’ and select ‘Scree plot’
 In the same menu ‘Extraction’:
o select under ‘Extract’ the option ‘Fixed number of factors’
o then enter ‘Factors to extract: 8’
o click ‘Continue’
 To get the rotated solution, click ‘Rotation’ and select ‘Varimax’ (usually works best)
 Click ‘OK’

Practice questions:

(1) How does the 8 factor solution compare to the 9 factor solution? Does the interpretation of
the factors differ?

(2) How ‘good’ is the 8 factor solution (part 4)?

(3) How (if any) do the insights for marketing change (part 5)? What aspects drive interest in the
Dodge Viper? What would you recommend the marketing team at Dodge based on the eight
factor solution?

3
Statistics and business analytics
Fall 2018
Prof. Ebbes

‘HOW TO’ IN SPSS

Session 10

PART 1: Basics of clustering and K-means

PART 2: Choosing the number of clusters

PART 3: Case – MBA music market

PART 4: Clustering wrap-up

APPENDIX: K-means in SPSS

The app for cluster analysis:

http://rstudio-test.hec.fr/kmeans/

1
PART 1: Basics of clustering and K-means

The descriptive statistics table (for quantitative variables) with the average music preferences
can be obtained using SPSS’ menu Analyze-Descriptive Statistics.

The data for the hypothetical example is not available. See below how to run K-means for the
class data.

PART 2: Choosing the number of clusters

We discussed how to choose the number of clusters for K-means. This can be done through
cross-validation approaches. I will demonstrate below for the class data how this can be done
using the app.

Please carefully review this part of the lecture so that you know how to interpret the output.

PART 3: Case – MBA music market

We provide an app that runs in ‘the cloud’ that will run a K-means cluster analysis for you, as
discussed in class (session 10). The app can be accessed from your browser.

Before you can run the app, you first need to convert the SPSS file into a CSV (Comma
Separate Values file), which is discussed in topic A. We then discuss in topic B how to use the
app.

Topic 3A: Prepare in SPSS the data file for the app to run the cluster analysis

The app currently only takes CSV (Comma Separated Values) files as input, and the CSV file
needs to be in a special format:

 the first column of the data file has to be a respondent ID column


 the next 𝐿 columns are the variables you want to use for clustering (e.g. the 16 genres
variables in the music case discussed in class)
 the first row of the CSV file needs to contain the variable names
 the subsequent rows are the responses given by the respondents

You can prepare this file in SPSS. I am taking the SPSS file containing the music preferences
(Blackboard; session 10) as starting point. This file contains the following 19 variables:

2
The data file that needs to be uploaded to the app, should have the RESPID variable as first
variable, and the 16 genre variables that will be used as the basis to form the clusters. Hence,
we need to delete the variables ALIAS and INTAKE1.

 In SPSS ‘Variable View’ select the two variables ALIAS and INTAKE
 Press the delete key on your keyboard

Next, you need to export the SPSS datafile as a CSV file. When you export the SPSS file as
CSV file, you can automatically include the variable names (listed in the column ‘Name’) in the
first row of the CSV file (see below).

You export the data into a CSV file using SPSS as follows:

 File – Save As
 Select for “Save as Type” the type Comma delimited (*.csv)
 Select “Write variable names to file” [[ this option puts the variable names into the first
row of the CSV file and has to be selected ]]
 Choose a location and file name and click ‘Save’
 See following screenshot for an example

1If your study happens to have additional variables in the SPSS file, you should delete all those. You
should only retain (1) a respondent ID variable, (2) the variables that you will use for clustering.

3
Inspect the CSV file that you just created!

You need to inspect the CSV file that you just created, otherwise the app may produce an error.

Open the CSV file in a text editor (e.g. on PC: notepad or notepad++, on Mac: TextEdit).

In the text editor, inspect the following aspects (follow these steps in order):

(1) first row contains the variables names

(2) next rows are your observations

(3) first column is respondent ID

(4) remaining columns are the variables you want to cluster on

(5) [[ IMPORTANT ]] if there are numbers behind the decimal, make sure that your decimal
separator is actually a POINT (.) and not SOMETHING ELSE (e.g. a comma).

Depending on local settings on your laptop (e.g. French laptops), your machine may use a
comma (,) as a decimal separator.

If your decimal separator is not a point, then do a standard ‘search-replace’ by replacing the
incorrect decimal separator (e.g. ,) by a dot (.)

(6) [[ IMPORTANT ]] check that your observations are SEPARATED by COMMA’s and not by
SOMETHING ELSE.

Depending on local settings on your laptop (e.g. French laptops), your machine may use a
different symbol to separate columns (e.g. a ;).

If your observations are not separated by a comma (,), then do a standard ‘search-replace’ by
replacing the incorrect symbol (e.g. ;) by a comma (,)

For example, consider the following screen shot:

First row contains variables


names, separated by commas

Each subsequent row is a row


in your data file

Values are separated by


commas

First column is respondent ID, remaining Decimals should be separated by


columns are variables to cluster on a dot. For instance, 6.00 or 8.43
(not applicable here)

4
Topic 3B: Using the app for cluster analysis

The app can be accessed by pointing your browser to the following address:

http://rstudio-test.hec.fr/kmeans/

The only menu option visible on the left side when you first enter the app is the load file option.
Follow the instructions in step 1 below to upload a data file to the app.

Step 1 – upload data and verify data upload

Once the app is loaded in your browser, upload the CSV file that you just created (in the menu
on the left, click ‘Browse…’).

After the upload of the data has been completed, a summary table of the basic data descriptives
appears in the ‘Basic data descriptives’ tab. It is important to inspect that table, and to make
sure the numbers in the table are the same as those you would get in SPSS running basic
descriptive statistics (in SPSS: Analyze-Descriptive Statistics-Descriptives) on the original music
preferences SPSS file (Blackboard; session 10).

Note that the variables that you want to use for clustering have to be quantitative. Furthermore,
they should roughly be measured on the same scale. If there are large scale differences
between your variables, you should consider standardizing your variables first in SPSS2, re-run
the K-means analysis on the standardized variables, and compare the results to the
unstandardized case. Lastly, if you have a few dichotomous (0/1) categorical variables along
with many quantitative variables, then you could include the dichotomous categorical variables
in you K-means analysis as well.

Furthermore, once you have uploaded the data file, a second menu ‘Cluster solution input’
appears on the left side of the app, with four options:

(1) you can select the variables to cluster on. By default, the app selects all variables,
except the first column (observation ID). If you do not want to include all variables for
clustering, you can select those variable(s) and delete them from the list.

(2) you can specify the number of clusters you want in your cluster solution. The app will
automatically run for two clusters (𝐾 = 2), however, choosing the number of clusters is
not easy (e.g. lecture notes part 2), and this default setting may not be best for your
application. If you have not yet examined your data to get a feel for the number of
clusters, you should first run a cross-validation analysis (step 2 below), before changing
this slider.

(3) you can select the option ‘Create basic graphs of the cluster solution’. This is usually
done after one has ran the cross validation, and this option creates the graphs for

2 Analyze-Descriptive Statistics-Descriptive; select the box ‘Save standardized values as variables’


(bottom-left).

5
visualizing the cluster solution that we discussed in class (e.g. lecture notes part 4,
handout p10).

(4) you can ask the app to perform cross validation, to get a feel for the number of
clusters in your data. This is an important first step after loading your data into the app, if
you do not have prior knowledge about the number of clusters in your data (lecture notes
part 2).

Step 2 [[ optional ]] – get a feel for the number of clusters (tab: Crossvalidation number
of clusters)

In the menu on the left, select the option ‘Run cross validation to help decide on number of
clusters’ and go to the tab ‘Crossvalidation number of clusters’. Once selected, you can change
the slider to input the maximum number of clusters you want to examine (the default is 10 but
for the class example, I set the slider to 15).

The crossvalidation analysis will run automatically. Depending on your sample size, the number
of variables, and the maximum number of clusters you want to test, this could take some time.

The app produces the output that was discussed in class (lecture notes part 2, handout p7),
which can be downloaded.

Step 3 – run K-means; inspect and download the cluster solution (tab: Output summary
K-means)

As mentioned above, the app automatically runs the 𝐾 = 2 solution (‘Number of clusters in
cluster solution’). You can drag the slider to get the cluster solution for a different choice of 𝐾.
The cluster solution output is presented in the tab ‘Output summary K-means’. You will find the
‘Cluster sizes’, the ‘Cluster centers’ and the ‘Predicted cluster memberships and distances’. All
can be downloaded, if needed.

The results in class for the MBA music preference data were obtained for 𝐾 = 4 (lecture notes
part 3; handout pp8—12). When you select 𝐾 = 4, you should be able to reproduce the output
tables discussed in class.

Note that your results may differ up to a permutation of the cluster labels (e.g. your cluster 1
could be cluster 4 in class etc.). Furthermore, as the app runs K-means with randomized starts,
there may be some (hopefully) small differences in your final cluster solution and the solution
discussed in class (see lecture slides part 3, handout pp8—9).

Step 4 [[ optional ]] – obtain basic visualization of the cluster solution (tab: Visualize
cluster solution)

You can select the option ‘Create basic graphs of the cluster solution’ in the menu on the left to
visualize the cluster solution by variable pairs. The graphs are presented in the tab ‘Visualize
cluster solution’. All graphs can be downloaded, if needed. This option may be useful in
communicating insights from the cluster solution to the client. The graphs allow for a visual

6
inspection of how well the clusters are separated and how meaningful the cluster solution
potentially is for decision making.

For the class example, see lecture notes part 3 or handout p10.

The top two graphs in the tab ‘Visualize cluster solution’ are density plots for each variable, by
cluster (colors). A density plot may be seen as a “smooth” version of a histogram. For instance,
if you specify four clusters, you get four density plots in one graph. This graph will help to
visually inspect how different the clusters are for that variable. The less the density plots
overlap, the more the variable differs across clusters.

The bottom graph is the scatter plot of the two selected variables. The colors indicate to which
cluster each observation belongs. This graph also helps in visually investigating and presenting
the cluster solution for this variable pair.

Please note that the metrics at the bottom of the page are only useful to guide selecting variable
pairs for visualization (the higher the number, the more different the clusters are for those two
variables, in a relative sense). These numbers cannot be used to select the number of clusters
𝐾, nor can they be used to judge how well the clusters are separated per se.

NOTE: the application times out after one hour. That means, if you leave it idle for one hour it
will disconnect you from the cloud server. You will lose all your results and will have to rerun the
app.

It is good practice to immediately download all the result files and graphics once you are
satisfied with the cluster solution.

7
APPENDIX: K-means in SPSS

SPSS has a K-means option (Analyze—Classify—K-Means Cluster). As discussed in class, I


would not recommend using it (lecture notes slide 29; appendix 2). There are two reasons for
that.

First, SPSS does not randomize starting values. In a high dimensional space (here we describe
each observation with 16 variables, so we would have a 16 dimensional space) starting values
for the K-means algorithm are important and drive the final cluster solution. SPSS appears to
have a rather poor man’s approach to this. As it turns out, and in particular for small(er) sample
sizes, the final outcome of K-means clustering using SPSS’ starting values procedure will
heavily depend on the order of the data (i.e. the order of the rows in Data View in SPSS). This is
a very undesirable property as in many applications the order of the data has no meaning, and
our final cluster solution should not depend on it.

Second, SPSS does not have a built-in procedure to get a feel for the number of clusters and
assess the robustness of the cluster solution, such as the cross-validation approach.

While there are workarounds to each of these limitations in SPSS, there are plenty of other
user-friendly software available, such as the app discussed in these ‘How to guides’, that would
run a cluster analysis for you. Here, starting values are appropriately randomized and often a
form of cross-validation, or other model selection tools, can be obtained to get a feel for the
number of clusters.

Вам также может понравиться