Академический Документы
Профессиональный Документы
Культура Документы
Fall 2018
Prof. Ebbes
Session 1
Topics
1
PART 1: Hypothesis basics
The hypothesis test outlined in the 6 steps (session 1, slides 12—23) can be carried out rather
quickly in SPSS. However, it is still good to write down the 6 steps on a scratch paper,
particularly if you haven’t done this a whole lot (see also below).
Once you have the tables, it is good practice to go through the 6 steps of hypothesis testing:
(1) Specify the null hypothesis: H0: µ=70 and H1: µ>70
(2) Choose the significance level: e.g. =5%
(3) Compute the test statistic: t=1.385 (this is equivalent to the Z-test value on slide 16 –
session 1)
(4) Compute the P-value: SPSS’ P-value is 0.166 (listed in the table above under “Sig. 2-
tailed”). Note that this is a two-sided (!) hypothesis test. In other words, SPSS
2
automatically assumes a two-sided hypothesis test (see discussion slides 13 and 20
session 1). In class we used a one-side test. Hence, we need to divide SPSS’ P-value
by 2. Hence, the P-value for our test is 0.083.
(5) Make a statistical decision: the P-value = 0.083 is larger than the significance level
=5% = 0.05. Hence, we do NOT reject the null hypothesis listed in (1) above.
(6) Make a business conclusion (see lecture notes slide 22)
Remark 1: if you wanted to test the two-sided hypothesis H0: µ=70 and H1: µ≠70 you can use
the P-value given by SPSS as is.
Remark 2: in class we used the Ztest formula (slide 16) and the normal distribution to compute
the P-value (slides 17—20), whereas SPSS uses the t distribution. For large samples, the
results are nearly identical. For small samples, SPSS’ results are more reliable (see discussion
slide 31 session 1).
Unfortunately, we will not be able to use SPSS for the test outlined on slides 26—30 (session
1). While we could “abuse” SPSS to compute a confidence interval for a proportion (categorical
variable) in the “How to guide” of data science camp day 2 (part 5), here we cannot. Hence, we
will have to compute it by hand like we did in class.
[[ Explanation (for those of you interested): the reason is that this procedure is programmed in
SPSS for quantitative variables only. However, the formula we use to test a hypothesis about a
proportion (slide 28) is quite different from the formula to test about a mean (slide 16).
Importantly, SPSS does not plug in the values specified in the null hypothesis in step 1 (the )
on slide 27 in the denominator of the formula on slide 28. In other words, SPSS uses by default
(wrongly!)
(p )
z test
p(1 p)
n
instead of (correctly!)
( p )
z test
(1 )
n
As said before, this is because SPSS is not programmed to carry out this test for categorical
variables. The “annoying” thing is that it will still give you something in the output if you run the
analysis! So it has no build-in “safety-check” to warn you of a mistake. Hence, we cannot be
lazy and have to do the labor ourselves in this particular case. In the next class meeting we will
learn about another test that we could use instead for categorical variables. ]]
3
Statistics and business analytics
Fall 2018
Prof. Ebbes
Session 2
Topics
1
PART 1: Sample size calculations
The hypothesis test outlined in the 6 steps in session 2, slides 20—32 can be carried out rather
quickly in SPSS. However, it is still good to write down the 6 steps on a scratch paper,
particularly if you haven’t done this a whole lot.
I can unfortunately not share the data with you for the Bike share case shown in class.
Therefore, I will demonstrate this technique using the insurance fraud case where we also have
a variable gender (see data science camp day 2;
data_science_day_2_insurance_fraud_web.sav).
As I don’t know the “true” population values for the insurance case, let’s just take the same
proportions as used in class, that is, let’s say the proportion of male and female clients at the
insurance company in the population is 0.40 and 0.60, respectively.
Now let’s test the following null hypothesis (0=male, 1 =female): 0 = 0.4 and 1 = 0.6.
This is how to get it out of SPSS (note: good practice – try to do this by hand!)
2
o Select ‘Values’
o Enter: 0.40 then click ‘Add’
o Then enter: 0.60 and click ‘Add’
Click ‘OK’
These two tables are the results from your test (the 6 steps of hypothesis testing):
The observed N column in the above table is the same as in your frequency table
(previous page)
The Expected N column are the Ei’s that we used in class to compute the chi-square
formula (session 2 slide 25)
The Residual column is just the difference between observed and expected
The second table (labeled ‘Test Statistics’) shows:
o the test-statistic – this is what you would get if you compute the chi-square by
hand, e.g. slide 24—26 session 2)
o the degrees of freedom (df) – what you need to compute the P-value (slide 28
session 2)
o The P-value – indicated by ‘Asymp. Sig.’ which SPSS computes automatically for
you (by hand, you could curve the number 157.872 under the chi-square
distribution with 1 degree of freedom, e.g. slide 29 session 2)
3
Note the warning below the second table (subscript ‘a’); here things are fine (lecture
notes slide 30)
Hence, we would reject the null hypothesis that 0 = 0.4 and 1 = 0.6, because the P-value =
0.000 which is smaller than the significance level (=0.05).
Good practice is to write down the 6 steps completely, here they are:
= 0.05
chi-square = 157.872
P-value = 0.000
The sample proportions for men (0.492) and women (0.507) are significantly different
from 0.4 and 0.6, and the sample is not representative with respect to the variable gender (chi-
square = 157.872, P-value = 0.000). We oversampled the men and undersampled the women [[
assuming that we actually knew that this is the truth which we don’t in this particular example as
mentioned above ]]
4
Statistics and business analytics
Fall 2018
Prof. Ebbes
Sessions 3
Topic: bivariate statistics for one categorical variable and one quantitative variable
1
PART 1: Bivariate statistics
2
Click ‘OK’
Now you can fill out the formula on that slide, to carry out the hypothesis test H0: µ1 = µ2 by
hand. However, SPSS can do this for you as follows:
3
The first table (“Group Statistics”) contains the same descriptive statistics as the table above
(e.g. on slide 14). The second table (“Independent Samples Test”) contains the results of
the test.
To explain the full table is beyond the scope of the class. I will tell you which part of the table
corresponds to what we learned in class, all the other parts we won’t use.
For this class, you only need to use the row that says “Equal variances NOT assumed” (i.e.
the second row). And, we only need to use the column ‘t’ and the column ‘Sig. (2-tailed)’.
Now we have enough information for the 6 steps of hypothesis testing:
Step 1: formulate the null and alternative hypothesis – see lecture notes slide 11
Step 3: compute the test-statistic – t = -11.641 (same as we computed in class on slide 14)
Step 4: get the P-value – P-value = 0.000 (note: SPSS by default gives you 2-tailed!) (in
class we use PQRS to get the P-value, see slide 16)
Step 5: make a statistical decision – here the P-value is LESS than the significance level, so
we REJECT the null hypothesis
The means plot on slide 20 (session 3) is obtained using similar steps as above using the
variable ‘edcat’ as the ‘Category Axis’, instead of ‘retire’.
I am not producing here the ANOVA results on slides 24—29 as we concluded in class that the
ANOVA assumptions (slide 27) are not valid.
However, we proceeded by computing the log transform (slide 30). This goes as follows.
4
You first need to compute a new variable, say ‘log_claim_amount’, that computes the natural log
of ‘claim_amount’ – this can be done with SPSS’ function ‘Transform – Compute Variable’. We
discussed this option in the appendix of the ‘how to guide’ of day 2 of data science camp (check
this ‘how to guide’ if you forgot how to use SPSS’ compute function).
Once you have done that, you can obtain the ANOVA table on slide 30 (session 3) as follows:
As the P-value < 0.05, we reject the ANOVA’s null hypothesis of equal means (you should write
down the 6 steps of hypothesis testing (slides 24—26) to be complete).
When the ANOVA null hypothesis is rejected, we know that at least one equal sign is unequal.
But we don’t know which one. This can be discovered with a mean comparison (slides 33 and
34). Obtaining a mean comparison in SPSS is easy, you just need to add one option to the
preceding steps:
Along with the preceding table, you now also get the mean comparison table (last page of this
document). Interpreting this table is a bit harder, but we discussed this in class.
We need to make sure the assumptions of the ANOVA are valid and we can produce the figure
on slide 28 (31) and the table on slide 29 for the log-transformed variable.
5
The table on slide 29 (session 3) is produced as follows:
6
Did you get:
On the next page, you fill find the full table with multiple comparisons (slide 34 session 3)
7
8
Statistics and business analytics
Fall 2018
Prof. Ebbes
Session 4
1
PART 1: Bivariate statistics
The cross tab on slide 13 including the chi-square test to test the null hypothesis on slide 15
(session 4) can be produced as follows.
Analyze—Descriptive Statistics—Crosstabs…
Choose the categorical variable ‘fraudulent’ in the columns
Choose the categorical variable ‘claim_type’ in the rows
Then, to compute percentages, go to ‘Cells’
o Under Percentages, click the desired percentage (here: rows)
o Click Continue
To compute the chi-square test and P-value (slides 17— 20 and slides 21—23 of
session 4, respectively):
o Click ‘Statistics’
o Select ‘Chi-square’ (top-left)
o Click ‘Continue’
Click OK
2
To interpret the results of the chi-square test, cycle through the 6 steps:
Step 1: formulate the null and alternative hypothesis – see lecture notes slide 15
Step 3: compute the test-statistic – chi-square = 29.996 (apart from minor rounding the
same as we computed in class on slide 20)
Step 4: get the P-value – P-value = 0.000 (in class we found the same number using PQRS
on slide 22)
Step 5: make a statistical decision – here the P-value is LESS than the significance level, so
we REJECT the null hypothesis
NOTE: the warning on slide 23 (session 4) is important to inspect! This is given in the table
above in the footnote ‘a’, where it says “0 cells (0.0%) have … count is 42.37”. Basically SPSS
is saying that the smallest expected cell frequency (E i) is 42.37. Hence, we are good to go
because they are all larger than 5. We can have a few Ei less than 5 but no more than 20% (the
Ei’s can also be computed by hand, see calculations slide 19).
To graphically display a cross tab we could use a clustered bar chart or segmented bar chart
(see ‘How to guide’ data science camp, day 1, part 4).
3
Statistics and business analytics
Fall 2018
Prof. Ebbes
Session 5
PART 2: Correlations
1
PART 1: Bivariate stats for La Quinta (case)
The boxplot and key descriptive statistics (‘become friends with your data’) on slide 9 (session
5) can be computed using methods we learned before (see how to guide data science camp
day 2).
PART 2: Correlations
The scatter plots on slides 14 and 15 are created similarly. You could also use SPSS chart
builder.
To compute bivariate correlation coefficients r, i.e. the table on slide 16 (session 5), proceed as
follows:
2
Click ‘OK’
This produces the table below, which includes the sample correlation coefficients (‘r’) and
the relevant information to test H0: =0 for each pair of variables.
No new SPSS techniques were used in the part. Please refer to the previous (part 2) to create
the scatter plots discussed in part 3. The lines were added to facilitate class discussion. In
general, it is not necessary to include these lines in scatter plots.
3
For the interpretation of these tables refer to slides 40—44 of session 5. Using similar steps,
you should be able to produce the tables on slides 45 and 46.
4
Statistics and business analytics
Fall 2018
Prof. Ebbes
Session 6
1
PART 1: multiple linear regressions
2
For the interpretation of the three tables, please refer to slides 10—19 of session 6.
1. Checking variable types – this can be done by examining the scale level of your variables. It
should be indicated in SPSS (Variable View) that each variable in the regression model is
‘Scale’.
2. The goodness of fit and P-values are produced by SPSS and listed in the output tables
above.
3. Whether or not there is a linear relationship between Y and the six X’s can be examined by
scatter plots. You should create six scatter plots, between Y and X 1, Y and X2, …, and Y and
X6, using the steps listed in the how to guide of session 5 (part 2).
4. Residual analysis – your residuals (a.k.a. errors) should ‘behave’, meaning they have an
approximate normal distribution and there should be a few outliers at most.
a. The residual plot on slide 29 can be obtained in the ‘Linear Regression’ window
(follow the steps in part 1 above), but before clicking ‘OK’, click on ‘Plots’ and then
select ‘Histogram’ in the ‘Standardized Residual Plots’ area (bottom left).
b. The residual table on slide 30 is obtained by selecting the option ‘Casewise
diagnostics’ under ‘Residuals’ in the menu ‘Statistics’ in the ‘Linear Regression’
window. I specified for teaching purposes to print a case in the table if its residual is
outside TWO standard deviations, but default is to leave it at three.
5. Check for multicollinearity. You should inspect:
a. All the bivariate correlations between the independent variables. This can be done
following the steps outlined in the how to guide of session 5 (part 1). Did you find that
the largest (absolute) correlation is 0.15?
b. Compute collinearity diagnostics (VIF and tolerance). You will get them in the ‘Linear
Regression’ window under the menu ‘Statistics’ by selecting ‘Collinearity
diagnostics’. This should produce the table on slide 33.
To obtain predictions (slide 37 session 6), we could just program the estimated regression
equation up in e.g. Excel and have Excel do the computations. Or, you could do the calculations
by hand on a hand calculator. However, neither Excel nor your hand calculations give you the
prediction interval, that you definitely need (slide 38)!
We can actually have SPSS do all the calculations for us, including the interval calculation, but
we need a little trick.
What you have to do is create a NEW case in your datafile, that has NO observation for the
dependent variable (‘Margin’), and as values for the independent variable the ones listed on
slide 36.
3
To create this NEW case, you will have to go to ‘Data View’ in SPSS, scroll to the bottom of the
spreadsheet, and type in the 101st row, the values for the X variables listed on slide 36.
Now re-run the regression model in SPSS and tell it to also obtain predictions. Here is how that
goes:
Now you will get the exact same output tables as before. But there is no new table with the
predictions. So where can those predictions be found?
The predictions can be found in your data spreadsheet in SPSS – i.e. in ‘Data View’. What you
can see is that SPSS created THREE new columns and populated those with numbers.
PRE_1 – this is the predicted value 𝑌̂ for each observation (i.e. each hotel)
LICI_1 – the is the lower bound of the prediction interval for each observation
UICI_1 – this is the upper bound of the prediction interval for each observation
4
For our particular case, the prediction is 37.09 and the 95% prediction interval is [25.40, 48.79].
See screen shot below.
5
Statistics and business analytics
Fall 2018
Prof. Ebbes
Session 7
1
In this how to guide I will focus on creating dummy variables, interaction variables and
transformations of variables. These ‘new’ variables can then be used as independent variables
in a regression model. I explained in the ‘how to guide’ of sessions 5&6 how to run regressions
in SPSS.
The steps needed to create dummy variables are detailed on slide 20 of session 7. Here I will
suggest how to perform those steps with SPSS.
I will use the categorical variable ‘JobGrade’ (slide 18, 19) as an example.
Job grade has six categories, so we need to create 6-1=5 dummy variables to represent it
Let’s take the last category as reference category (slide 19). We do not have to create a dummy
variable for the reference category.
Let’s create the first dummy variable for the first category. We use the following recoding
schema:
How to interpret this table: all observations with a ‘1’ on ‘JobGrade’ will get a ‘1’ on the new
variable ‘JobGrade_dum1’. All observations with a 2,3,4,5, or 6 on ‘JobGrade’ will get a ‘0’ on
‘JobGrade_dum1’.
2
o Name: JobGrade_dum1
o Label: Job grade 1 (dummy)
o Click ‘Change’
Then click ‘Old and New Values’
This will tell SPSS how to do the recoding as shown in the table above. In other
words, tell SPSS how the old values (1,2,3,4,5,6) will translate into values of the
new variable, i.e. 1,0,0,0,0,0, respectively
On the left hand side (Old value), type behind ‘Value’ the number 1
On the right hand side (New value), type behind ‘Value’ the number 1
Click ‘Add’
On the left hand side (Old value), select ‘All other values’
On the right hand side (New value), type behind ‘Value’ the number 0
Click ‘Add’
It should look like this:
Click ‘Continue’
You will return to this window:
Click ‘OK’
3
Now you will have to repeat this four (!) more times to create the dummy variables
‘JobGrade_dum2’, ‘JobGrade_dum3’, ‘JobGrade_dum4’, and ‘JobGrade_dum5’. I will
demonstrate the second one here, and leave the rest up to you.
Click ‘Continue’
You will return to this window:
4
Click ‘OK’
You will now need to create dummy variables JobGrade_dum3’, ‘JobGrade_dum4’, and
‘JobGrade_dum5’ following similar steps. Do not forget to click RESET (highlighted above in
red), otherwise it may get messy…
Similarly, you should create a dummy variable for the variable gender. This is considerably
more simple, as we need to create only one dummy variable to represent the categorical
variable gender. Because the females are the focus of the study, we take males as reference
category. The old and new variables are as follows:
5
On the right hand side (New value), type behind ‘Value’ the number 1
Click ‘Add’
On the left hand side (Old value), select ‘All other values’
On the right hand side (New value), type behind ‘Value’ the number 0
Click ‘Add’ and then ‘Continue’
Click ‘OK’
SPSS now created a new variable ‘Gender_dum’ in your spreadsheet. It shows up in ‘Data
View’ at the end of the spreadsheet, as well as in ‘Variable View’. Before continuing, go to
‘Data View’ and check for a few cases that the recoding was performed correctly (e.g. cases
1—5).
You have now created all the necessary dummy variables to run the regression model on slide
21 (session 7). Follow the instructions in the how to guide of session 6 (part 1&2) to run the
regression. Note that:
Creating interaction variables is considerably easier than creating dummy variables. The
interaction variable discussed on slide 27 (session 7) is the product of YrsExp and the dummy
variable for gender. To compute this new variable, proceed as follows:
As before, SPSS created the new variable ‘YrsExperGender’ to the end of your data file.
In Data View, inspect for a few cases that the computations were carried out correctly
(e.g. cases 1—5)
6
In Variable View, update the properties of the variable (e.g. type a label and update the
measure (‘scale’))
Computing transformations
I did not make the data available to you for the nonlinear regressions. However, steps to create
nonlinear functions of your variables are identical to the previous steps in part 2 to create
interaction terms (i.e. use Transform – Compute Variable)
For instance, to compute the logarithm of a variable, check out slides 74—77 of Data science
camp day 2.
As an additional example, we could investigate if experience and salary are nonlinearly related.
The scatter plot between Salary and YrsExper may suggest this. That is, salary could increase
quadratic with experience instead of linear. You could create the new variable YrsExperSq by
using the formula YrsExper* YrsExper and then estimate the model:
(coefficients that are underlined are significant with P-value < 0.05)
As discussed in class, the coefficients for years of experience are hard to interpret in a parabolic
model. But, for this simple equation we could create a similar graph as e.g. slide 16 (session 7).
Salary
YrsExper
(blue=males; orange=females) This model suggests a slight nonlinear relation between Salary
and Years of experience, where more experience tends to accelerate salary. We should
definitely inspect the scatter plot between ‘Salary’ and ‘YrsExper’ to back up this analysis – you
will notice that the scatter plot also suggests a somewhat non-linear relationship, but we do not
have a lot of data in the higher experience range. Interestingly though, the effect of gender
decreased a little compared to a model without experience-squared (e.g. slide 16 session 7).
However, we typically only use nonlinear models if economic theory dictates us to do so (e.g.
the constant elasticity model in Economics; slide 40 session 7), if we expect increasing or
decreasing returns to scale (e.g. ad spending on sales), or if linear regression model checks
(session 6 slide 23) suggest something is wrong with the linear model.
Otherwise, we prefer simple models to more complex models, and linear models are quite
straightforward to interpret and easy to use, as we saw in sessions 5, 6, and 7 of this class.
Besides, in many applications they perform quite reasonable in terms of predictions.
8
Statistics and business analytics
Fall 2018
Prof. Ebbes
Session 8
1
In this guide I will discuss how to run a logistic regression model in SPSS. The dependent
variable is categorical with two categories. The independent variables are quantitative. If an
independent variable is categorical, make sure to create dummies first and use the dummies as
independent variables, just as for the linear regression model (session 7).
The steps are relatively simple. The challenge with logistic regressions is the interpretation.
To estimate the logistic regression model on slide 14 (results on slide 15) proceed as follows.
You would first need to make sure the dependent variable (here: ‘Prom’) is coded as a dummy
variable (1/0). Here that is the case (see SPSS Variable View).
The regressor ‘Gender’ is categorical, and needs to be coded as dummy variable first (see ‘How
to in SPSS guide’ session 7, part 1).
The results of the model are listed under ‘Block 1’. For the interpretation of the tables, please
refer to the lecture notes of session 8 (slides 17—30).
To predict the probabilities (slides 33—35), you can work with a hand calculator following the
steps on slide 34. However, SPSS can do these calculations for you, in the same way as we did
for the linear regression approach (how to guide session 6 topic 3). Basically, you enter in the
values of your predictors for the NEW case that you want to predict out, that has NO
observation for the dependent variable (here: ‘Prom’).
Hence, go ahead and enter the two new cases (1) female employee with 5 years of experience
(slide 33), and (2) male employee with 5 years of experience (slide 35) at the bottom of the data
file in ‘Data View’. Do NOT fill out a value for the dependent variable ‘Prom’. See following
screen shot:
2
Rerun the logit estimation, following the above steps. However, before clicking ‘OK’, click on the
menu item ‘Save’ and select ‘Probabilities’ from the ‘Predicted Values’ area (see next screen
shot).
Then click ‘Continue’ and click ‘OK’ in the ‘Logistic Regression’ window.
Your predictions are now added to the datafile in the newly created variable ‘PRE_1’. Scroll all
the way to the bottom in ‘Data View’ and you will find the predictions that we computed in class
by hand.
SPSS can also do the confidence intervals of the predictions for you (which is important to ask
for!). However, this cannot be done from this menu option. Therefore, I consider this for logistic
regression beyond the scope of the class. We would need to use another menu option in SPSS
(Analyze—Generalized Linear Models, which has a binary logistic option), which is considerably
harder to use than the Binary Logistic menu option discussed above for those of you who do not
have a lot of experience with regressions. For those of you who want to know, please come and
talk, I’ll be happy to discuss with you during the next SPSS lab.
3
Statistics and business analytics
Fall 2018
Prof. Ebbes
Session 9
PART 2: Running a FA
1
In this guide, I will discuss how to run factor analysis using SPSS. The dataset that we used in
session 9 is on Blackboard. What you will notice is that there are quite some “subjective”
decisions to make by the analyst (e.g. the number of factors to extract) in performing a factor
analysis, as we discussed in class. Feel free to play around with different choices (I will indicate
where you could do this). Come and chat if you have any questions or concerns regarding these
analyses.
Please refer to previous ‘How to in SPSS guides’ to obtain the regression model results for the
two regression models discussed in part 1.
PART 2: Running a FA
The complete factor analyses output (see also supplemental handouts pp8—12) can be
obtained rather easily. Here is how it goes:
That should give you the outputs given on pages 8—12 of the supplemental handout. As you
can see, this was pretty easy to do. The interpretation is more challenging (and very important!).
Please refer to the lecture notes for that (parts 3 and 4).
We can have SPSS compute the factor scores (slides 35 and 36), for subsequent use in a
regression model or any other multivariate statistical technique (e.g. cluster analysis in the next
session).To have SPSS compute the factor scores, go back into the same menu as before, and
in addition to the preceding:
Click on ‘Scores’
Select ‘Save as variables’ and use ‘Regression’ (these are just different ways to
compute the factor scores)
Click ‘Continue’ and then ‘OK’
Now if you go to ‘Data View’ or ‘Variable View’ in SPSS, you see that SPSS created 9 new
variables to the back of the spreadsheet (FAC1_1, FAC2_1,…,FAC9_1). You could change the
label (in ‘Variable View’) to the description of the factors that you came up with (slide 30).
2
These 9 variables are new quantitative variables. Hence, the first thing you should (always!) do
with new data/variables is to ‘become friends’ with them! For instance, you could compute basic
descriptive statistics for these 9 new quantitative variables. Also, it would be useful to
graphically analyze each of them (eg boxplot or histogram). What did you find? Anything
surprising?
Once that is done, you can run a regression model with these 9 factors as independent
variables and V1 as dependent variable to obtain the regression model given on slide 37.
The preceding steps will give you the 9 factor solution discussed in class. The choice for 9
factors was based on: (1) the eigenvalue criterion (larger than 1); (2) elbows in the scree plot
and (3) the cumulative variance explained (see part 2).
In our interpretation of the 9th factor we found that only one original variable loads on it (slide
29). Perhaps we should also examine an 8 factor solution.
To run an 8 factor solution in SPSS proceed as follows (steps very similar as above; changes
indicated in red):
Practice questions:
(1) How does the 8 factor solution compare to the 9 factor solution? Does the interpretation of
the factors differ?
(3) How (if any) do the insights for marketing change (part 5)? What aspects drive interest in the
Dodge Viper? What would you recommend the marketing team at Dodge based on the eight
factor solution?
3
Statistics and business analytics
Fall 2018
Prof. Ebbes
Session 10
http://rstudio-test.hec.fr/kmeans/
1
PART 1: Basics of clustering and K-means
The descriptive statistics table (for quantitative variables) with the average music preferences
can be obtained using SPSS’ menu Analyze-Descriptive Statistics.
The data for the hypothetical example is not available. See below how to run K-means for the
class data.
We discussed how to choose the number of clusters for K-means. This can be done through
cross-validation approaches. I will demonstrate below for the class data how this can be done
using the app.
Please carefully review this part of the lecture so that you know how to interpret the output.
We provide an app that runs in ‘the cloud’ that will run a K-means cluster analysis for you, as
discussed in class (session 10). The app can be accessed from your browser.
Before you can run the app, you first need to convert the SPSS file into a CSV (Comma
Separate Values file), which is discussed in topic A. We then discuss in topic B how to use the
app.
Topic 3A: Prepare in SPSS the data file for the app to run the cluster analysis
The app currently only takes CSV (Comma Separated Values) files as input, and the CSV file
needs to be in a special format:
You can prepare this file in SPSS. I am taking the SPSS file containing the music preferences
(Blackboard; session 10) as starting point. This file contains the following 19 variables:
2
The data file that needs to be uploaded to the app, should have the RESPID variable as first
variable, and the 16 genre variables that will be used as the basis to form the clusters. Hence,
we need to delete the variables ALIAS and INTAKE1.
In SPSS ‘Variable View’ select the two variables ALIAS and INTAKE
Press the delete key on your keyboard
Next, you need to export the SPSS datafile as a CSV file. When you export the SPSS file as
CSV file, you can automatically include the variable names (listed in the column ‘Name’) in the
first row of the CSV file (see below).
You export the data into a CSV file using SPSS as follows:
File – Save As
Select for “Save as Type” the type Comma delimited (*.csv)
Select “Write variable names to file” [[ this option puts the variable names into the first
row of the CSV file and has to be selected ]]
Choose a location and file name and click ‘Save’
See following screenshot for an example
1If your study happens to have additional variables in the SPSS file, you should delete all those. You
should only retain (1) a respondent ID variable, (2) the variables that you will use for clustering.
3
Inspect the CSV file that you just created!
You need to inspect the CSV file that you just created, otherwise the app may produce an error.
Open the CSV file in a text editor (e.g. on PC: notepad or notepad++, on Mac: TextEdit).
In the text editor, inspect the following aspects (follow these steps in order):
(5) [[ IMPORTANT ]] if there are numbers behind the decimal, make sure that your decimal
separator is actually a POINT (.) and not SOMETHING ELSE (e.g. a comma).
Depending on local settings on your laptop (e.g. French laptops), your machine may use a
comma (,) as a decimal separator.
If your decimal separator is not a point, then do a standard ‘search-replace’ by replacing the
incorrect decimal separator (e.g. ,) by a dot (.)
(6) [[ IMPORTANT ]] check that your observations are SEPARATED by COMMA’s and not by
SOMETHING ELSE.
Depending on local settings on your laptop (e.g. French laptops), your machine may use a
different symbol to separate columns (e.g. a ;).
If your observations are not separated by a comma (,), then do a standard ‘search-replace’ by
replacing the incorrect symbol (e.g. ;) by a comma (,)
4
Topic 3B: Using the app for cluster analysis
The app can be accessed by pointing your browser to the following address:
http://rstudio-test.hec.fr/kmeans/
The only menu option visible on the left side when you first enter the app is the load file option.
Follow the instructions in step 1 below to upload a data file to the app.
Once the app is loaded in your browser, upload the CSV file that you just created (in the menu
on the left, click ‘Browse…’).
After the upload of the data has been completed, a summary table of the basic data descriptives
appears in the ‘Basic data descriptives’ tab. It is important to inspect that table, and to make
sure the numbers in the table are the same as those you would get in SPSS running basic
descriptive statistics (in SPSS: Analyze-Descriptive Statistics-Descriptives) on the original music
preferences SPSS file (Blackboard; session 10).
Note that the variables that you want to use for clustering have to be quantitative. Furthermore,
they should roughly be measured on the same scale. If there are large scale differences
between your variables, you should consider standardizing your variables first in SPSS2, re-run
the K-means analysis on the standardized variables, and compare the results to the
unstandardized case. Lastly, if you have a few dichotomous (0/1) categorical variables along
with many quantitative variables, then you could include the dichotomous categorical variables
in you K-means analysis as well.
Furthermore, once you have uploaded the data file, a second menu ‘Cluster solution input’
appears on the left side of the app, with four options:
(1) you can select the variables to cluster on. By default, the app selects all variables,
except the first column (observation ID). If you do not want to include all variables for
clustering, you can select those variable(s) and delete them from the list.
(2) you can specify the number of clusters you want in your cluster solution. The app will
automatically run for two clusters (𝐾 = 2), however, choosing the number of clusters is
not easy (e.g. lecture notes part 2), and this default setting may not be best for your
application. If you have not yet examined your data to get a feel for the number of
clusters, you should first run a cross-validation analysis (step 2 below), before changing
this slider.
(3) you can select the option ‘Create basic graphs of the cluster solution’. This is usually
done after one has ran the cross validation, and this option creates the graphs for
5
visualizing the cluster solution that we discussed in class (e.g. lecture notes part 4,
handout p10).
(4) you can ask the app to perform cross validation, to get a feel for the number of
clusters in your data. This is an important first step after loading your data into the app, if
you do not have prior knowledge about the number of clusters in your data (lecture notes
part 2).
Step 2 [[ optional ]] – get a feel for the number of clusters (tab: Crossvalidation number
of clusters)
In the menu on the left, select the option ‘Run cross validation to help decide on number of
clusters’ and go to the tab ‘Crossvalidation number of clusters’. Once selected, you can change
the slider to input the maximum number of clusters you want to examine (the default is 10 but
for the class example, I set the slider to 15).
The crossvalidation analysis will run automatically. Depending on your sample size, the number
of variables, and the maximum number of clusters you want to test, this could take some time.
The app produces the output that was discussed in class (lecture notes part 2, handout p7),
which can be downloaded.
Step 3 – run K-means; inspect and download the cluster solution (tab: Output summary
K-means)
As mentioned above, the app automatically runs the 𝐾 = 2 solution (‘Number of clusters in
cluster solution’). You can drag the slider to get the cluster solution for a different choice of 𝐾.
The cluster solution output is presented in the tab ‘Output summary K-means’. You will find the
‘Cluster sizes’, the ‘Cluster centers’ and the ‘Predicted cluster memberships and distances’. All
can be downloaded, if needed.
The results in class for the MBA music preference data were obtained for 𝐾 = 4 (lecture notes
part 3; handout pp8—12). When you select 𝐾 = 4, you should be able to reproduce the output
tables discussed in class.
Note that your results may differ up to a permutation of the cluster labels (e.g. your cluster 1
could be cluster 4 in class etc.). Furthermore, as the app runs K-means with randomized starts,
there may be some (hopefully) small differences in your final cluster solution and the solution
discussed in class (see lecture slides part 3, handout pp8—9).
Step 4 [[ optional ]] – obtain basic visualization of the cluster solution (tab: Visualize
cluster solution)
You can select the option ‘Create basic graphs of the cluster solution’ in the menu on the left to
visualize the cluster solution by variable pairs. The graphs are presented in the tab ‘Visualize
cluster solution’. All graphs can be downloaded, if needed. This option may be useful in
communicating insights from the cluster solution to the client. The graphs allow for a visual
6
inspection of how well the clusters are separated and how meaningful the cluster solution
potentially is for decision making.
For the class example, see lecture notes part 3 or handout p10.
The top two graphs in the tab ‘Visualize cluster solution’ are density plots for each variable, by
cluster (colors). A density plot may be seen as a “smooth” version of a histogram. For instance,
if you specify four clusters, you get four density plots in one graph. This graph will help to
visually inspect how different the clusters are for that variable. The less the density plots
overlap, the more the variable differs across clusters.
The bottom graph is the scatter plot of the two selected variables. The colors indicate to which
cluster each observation belongs. This graph also helps in visually investigating and presenting
the cluster solution for this variable pair.
Please note that the metrics at the bottom of the page are only useful to guide selecting variable
pairs for visualization (the higher the number, the more different the clusters are for those two
variables, in a relative sense). These numbers cannot be used to select the number of clusters
𝐾, nor can they be used to judge how well the clusters are separated per se.
NOTE: the application times out after one hour. That means, if you leave it idle for one hour it
will disconnect you from the cloud server. You will lose all your results and will have to rerun the
app.
It is good practice to immediately download all the result files and graphics once you are
satisfied with the cluster solution.
7
APPENDIX: K-means in SPSS
First, SPSS does not randomize starting values. In a high dimensional space (here we describe
each observation with 16 variables, so we would have a 16 dimensional space) starting values
for the K-means algorithm are important and drive the final cluster solution. SPSS appears to
have a rather poor man’s approach to this. As it turns out, and in particular for small(er) sample
sizes, the final outcome of K-means clustering using SPSS’ starting values procedure will
heavily depend on the order of the data (i.e. the order of the rows in Data View in SPSS). This is
a very undesirable property as in many applications the order of the data has no meaning, and
our final cluster solution should not depend on it.
Second, SPSS does not have a built-in procedure to get a feel for the number of clusters and
assess the robustness of the cluster solution, such as the cross-validation approach.
While there are workarounds to each of these limitations in SPSS, there are plenty of other
user-friendly software available, such as the app discussed in these ‘How to guides’, that would
run a cluster analysis for you. Here, starting values are appropriately randomized and often a
form of cross-validation, or other model selection tools, can be obtained to get a feel for the
number of clusters.