Learning Book 11 Feb

Contents
1 Some Preliminaries
1.1 Notation and conventions . . . . . . . . . . . . . . .
1.1.1 Background Information . . . . . . . . . . . .
1.2 Some Useful Mathematical Facts . . . . . . . . . . .
1.3 Acknowledgements . . . . . . . . . . . . . . . . . . .
1.4 The Curse of Dimension . . . . . . . . . . . . . . . .
1.4.1 The Curse: Data isnt Where You Think it is
1.4.2 Minor Banes of Dimension . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Learning to Classify
2.1 Classification, Error, and Loss . . . . . . . . . . . . . . . .
2.1.1 Loss and the Cost of Misclassification . . . . . . .
2.1.2 Building a Classifier from Probabilities . . . . . . .
2.1.3 Building a Classifier using Decision Boundaries . .
2.1.4 What will happen on Test Data? . . . . . . . . . .
2.1.5 The Class Confusion Matrix . . . . . . . . . . . . .
2.1.6 Statistical Learning Theory and Generalization . .
2.2 Classifying with Naive Bayes . . . . . . . . . . . . . . . .
2.3 The Support Vector Machine . . . . . . . . . . . . . . . .
2.3.1 Choosing a Classifier with the Hinge Loss . . . . .
2.3.2 Finding a Minimum: General Points . . . . . . . .
2.3.3 Finding a Minimum: Stochastic Gradient Descent
2.3.4 Example: Training a Support Vector Machine with
2.3.5 Multi-Class Classifiers . . . . . . . . . . . . . . . .
2.4 Classifying with Random Forests . . . . . . . . . . . . . .
2.4.1 Building a Decision Tree . . . . . . . . . . . . . . .
2.4.2 Entropy and Information Gain . . . . . . . . . . .
2.4.3 Entropy and Splits . . . . . . . . . . . . . . . . . .
2.4.4 Choosing a Split with Information Gain . . . . . .
2.4.5 Forests . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.6 Building and Evaluating a Decision Forest . . . . .
2.4.7 Classifying Data Items with a Decision Forest . . .
2.5 Classifying with Nearest Neighbors . . . . . . . . . . . . .
2.6 You should . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 be able to: . . . . . . . . . . . . . . . . . . . . . . .
2.6.2 remember: . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Stochastic
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
2
2
3
4
4
4
4
6
7
7
8
8
9
9
11
12
13
17
18
20
21
Gradient Descent 23
25
26
26
29
31
32
34
34
35
37
40
40
40
3 Extracting Important Relationships in High Dimensions

46
3.1 Some Plots of High Dimensional Data . . . . . . . . . . . . . . . . . 46
3.1.1 Understanding Blobs with Scatterplot Matrices - CLEANUP 46
3.1.2 Parallel Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.3 Scatterplot Matrices . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Summaries of High Dimensional Data . . . . . . . . . . . . . . . . . 56
1
3.3
3.4
3.5
3.6
3.7
3.8
3.2.1 The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2.2 Using Covariance to encode Variance and Correlation . .
Blob Analysis of High-Dimensional Data . . . . . . . . . . . . . .
3.3.1 Transforming High Dimensional Data . . . . . . . . . . .
3.3.2 Transforming Blobs . . . . . . . . . . . . . . . . . . . . .
3.3.3 Whitening Data . . . . . . . . . . . . . . . . . . . . . . .
Principal Components Analysis . . . . . . . . . . . . . . . . . . .
3.4.1 The Blob Coordinate System and Smoothing . . . . . . .
3.4.2 The Low-Dimensional Representation of a Blob . . . . . .
3.4.3 Smoothing Data with a Low-Dimensional Representation
3.4.4 The Error of the Low-Dimensional Representation . . . .
3.4.5 Example: Representing Spectral Reflectances . . . . . . .
3.4.6 Example: Representing Faces with Principal Components
High Dimensions, SVD and NIPALS . . . . . . . . . . . . . . . .
3.5.1 Principal Components by SVD . . . . . . . . . . . . . . .
3.5.2 Just a few Principal Components with NIPALS . . . . . .
3.5.3 Projection and Discriminative Problems . . . . . . . . . .
3.5.4 Just a few Discriminative Directions with PLS1 . . . . . .
Multi-Dimensional Scaling . . . . . . . . . . . . . . . . . . . . . .
3.6.1 Principal Coordinate Analysis . . . . . . . . . . . . . . . .
3.6.2 Example: Mapping with Multidimensional Scaling . . . .
Example: Understanding Height and Weight . . . . . . . . . . .
What you should remember - NEED SOMETHING . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
56
56
60
60
61
64
66
67
69
71
73
75
76
78
78
79
81
82
83
83
86
87
91
4 Clustering: Models of High Dimensional Data

4.1 Agglomerative and Divisive Clustering . . . . . . . . .
4.1.1 Clustering and Distance . . . . . . . . . . . . .
4.2 The K-Means Algorithm and Variants . . . . . . . . .
4.2.1 How to choose K . . . . . . . . . . . . . . . . .
4.2.2 Soft Assignment . . . . . . . . . . . . . . . . .
4.2.3 General Comments on K-Means . . . . . . . .
4.3 Describing Repetition with Vector Quantization . . . .
4.3.1 Vector Quantization . . . . . . . . . . . . . . .
4.3.2 Example: Groceries in Portugal . . . . . . . . .
4.3.3 Efficient Clustering and Hierarchical K Means .
4.3.4 Example: Activity from Accelerometer Data .
4.4 You should . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 remember: . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
93
93
95
100
102
104
105
107
108
109
111
112
116
116
5 Clustering using Probability Models

5.1 The Multivariate Normal Distribution . . . . . . . .
5.1.1 Affine Transformations and Gaussians . . . .
5.1.2 Plotting a 2D Gaussian: Covariance Ellipses .
5.2 Mixture Models and Clustering . . . . . . . . . . . .
5.2.1 A Finite Mixture of Blobs . . . . . . . . . . .
5.2.2 Topics and Topic Models . . . . . . . . . . .
5.3 The EM Algorithm . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
118
118
119
119
120
121
122
124
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
125
128
129
129
131
131
6 Regression
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Linear Regression and Least Squares . . . . . . . . . .
6.2.1 Linear Regression . . . . . . . . . . . . . . . . .
6.2.2 Residuals and R-squared . . . . . . . . . . . . .
6.2.3 Transforming Variables . . . . . . . . . . . . .
6.3 Finding Problem Data Points . . . . . . . . . . . . . .
6.3.1 The Hat Matrix and Leverage . . . . . . . . . .
6.3.2 Cooks Distance . . . . . . . . . . . . . . . . .
6.3.3 Standardized Residuals . . . . . . . . . . . . .
6.4 Many Explanatory Variables . . . . . . . . . . . . . .
6.4.1 Functions of One Explanatory Variable . . . .
6.4.2 Regularizing Linear Regressions . . . . . . . . .
6.4.3 Example: Weight against Body Measurements
6.5 You should . . . . . . . . . . . . . . . . . . . . . . . .
6.5.1 remember: . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
133
133
134
134
139
143
145
148
148
149
150
151
153
155
157
157
7 Regression: Some harder topics

7.1 Model Selection: Which Model is Best? . . . . . . . . . . . . .
7.1.1 Bias and Variance . . . . . . . . . . . . . . . . . . . . .
7.1.2 Penalties: AIC and BIC . . . . . . . . . . . . . . . . . .
7.1.3 Cross-Validation . . . . . . . . . . . . . . . . . . . . . .
7.1.4 Forward and Backward Stagewise Regression . . . . . .
7.1.5 Dropping Variables with L1 Regularization . . . . . . .
7.1.6 Using Regression to Compare Trends . . . . . . . . . . .
7.1.7 Significance: What Variables are Important? . . . . . .
7.2 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 M-Estimators and Iteratively Reweighted Least Squares
7.2.2 RANSAC: Searching for Good Points . . . . . . . . . .
7.3 Modelling with Bumps . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Scattered Data: Smoothing and Interpolation . . . . . .
7.3.2 Density Estimation . . . . . . . . . . . . . . . . . . . . .
7.3.3 Kernel Smoothing . . . . . . . . . . . . . . . . . . . . .
7.4 Exploiting Your Neighbors for Regression . . . . . . . . . . . .
7.4.1 Local Polynomial Regression . . . . . . . . . . . . . . .
7.4.2 Using your Neighbors to Predict More than a Number .
7.5 Bayesian Regression . . . . . . . . . . . . . . . . . . . . . . . .
7.5.1 Tricks with Normal Distributions . . . . . . . . . . . . .
7.6 You should . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6.1 remember: . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
161
161
161
162
164
164
165
168
170
172
172
175
177
177
182
183
185
188
188
192
192
194
194
5.4
5.3.1 Example: Mixture of Normals: The E-step

5.3.2 Example: Mixture of Normals: The M-step
5.3.3 Example: Topic Model: The E-Step . . . .
5.3.4 Example: Topic Model: The M-step . . . .
You should . . . . . . . . . . . . . . . . . . . . . .
5.4.1 remember: . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
8 Classification II
8.1 Logistic Regression . . . . . . . . . .
8.2 Neural Nets . . . . . . . . . . . . . .
8.3 Convolution and orientation features
8.4 Convolutional neural networks . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
198
198
198
198
198
9 Boosting
199
9.1 GradientBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
9.2 ADAboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
10 Some Important Models
200
10.1 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
10.2 CRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
10.3 Fitting and inference with MCMC? . . . . . . . . . . . . . . . . . . . 200
11 Background: First Tools for Looking at Data
11.1 Datasets . . . . . . . . . . . . . . . . . . . . . . .
11.2 Whats Happening? - Plotting Data . . . . . . .
11.2.1 Bar Charts . . . . . . . . . . . . . . . . .
11.2.2 Histograms . . . . . . . . . . . . . . . . .
11.2.3 How to Make Histograms . . . . . . . . .
11.2.4 Conditional Histograms . . . . . . . . . .
11.3 Summarizing 1D Data . . . . . . . . . . . . . . .
11.3.1 The Mean . . . . . . . . . . . . . . . . . .
11.3.2 Standard Deviation and Variance . . . . .
11.3.3 Variance . . . . . . . . . . . . . . . . . . .
11.3.4 The Median . . . . . . . . . . . . . . . . .
11.3.5 Interquartile Range . . . . . . . . . . . . .
11.3.6 Using Summaries Sensibly . . . . . . . . .
11.4 Plots and Summaries . . . . . . . . . . . . . . . .
11.4.1 Some Properties of Histograms . . . . . .
11.4.2 Standard Coordinates and Normal Data .
11.4.3 Boxplots . . . . . . . . . . . . . . . . . . .
11.5 Whose is bigger? Investigating Australian Pizzas
11.6 You should . . . . . . . . . . . . . . . . . . . . .
11.6.1 be able to: . . . . . . . . . . . . . . . . . .
11.6.2 remember: . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
201
201
203
204
204
205
207
207
207
210
214
215
217
218
219
219
222
225
226
231
231
231
12 Background:Looking at Relationships
12.1 Plotting 2D Data . . . . . . . . . . . . . . . . . .
12.1.1 Categorical Data, Counts, and Charts . .
12.1.2 Series . . . . . . . . . . . . . . . . . . . .
12.1.3 Scatter Plots for Spatial Data . . . . . . .
12.1.4 Exposing Relationships with Scatter Plots
12.2 Correlation . . . . . . . . . . . . . . . . . . . . .
12.2.1 The Correlation Coefficient . . . . . . . .
12.2.2 Using Correlation to Predict . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
232
232
232
236
238
241
244
246
251
12.2.3 Confusion caused by correlation .

12.3 Sterile Males in Wild Horse Herds . . .
12.4 You should . . . . . . . . . . . . . . . .
12.4.1 be able to: . . . . . . . . . . . . .
12.4.2 remember: . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
255
256
259
259
259
13 Background: Useful Probability Distributions

13.1 Discrete Distributions . . . . . . . . . . . . . .
13.1.1 The Discrete Uniform Distribution . . .
13.1.2 Bernoulli Random Variables . . . . . . .
13.1.3 The Geometric Distribution . . . . . . .
13.1.4 The Binomial Probability Distribution .
13.1.5 Multinomial probabilities . . . . . . . .
13.1.6 The Poisson Distribution . . . . . . . .
13.2 Continuous Distributions . . . . . . . . . . . .
13.2.1 The Continuous Uniform Distribution .
13.2.2 The Beta Distribution . . . . . . . . . .
13.2.3 The Gamma Distribution . . . . . . . .
13.2.4 The Exponential Distribution . . . . . .
13.3 The Normal Distribution . . . . . . . . . . . . .
13.3.1 The Standard Normal Distribution . . .
13.3.2 The Normal Distribution . . . . . . . .
13.3.3 Properties of the Normal Distribution .
13.4 Approximating Binomials with Large N . . . .
13.4.1 Large N . . . . . . . . . . . . . . . . . .
13.4.2 Getting Normal . . . . . . . . . . . . . .
13.4.3 So What? . . . . . . . . . . . . . . . . .
13.5 You should . . . . . . . . . . . . . . . . . . . .
13.5.1 remember: . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
260
261
261
261
262
263
265
266
268
268
268
270
270
271
271
272
273
275
276
278
280
281
281
14 Background:Inference: Making Point Estimates

14.1 Estimating Model Parameters with Maximum Likelihood
14.1.1 The Maximum Likelihood Principle . . . . . . . .
14.1.2 Cautions about Maximum Likelihood . . . . . . .
14.2 Incorporating Priors with Bayesian Inference . . . . . . .
14.2.1 Constructing the Posterior . . . . . . . . . . . . .
14.2.2 Normal Prior and Normal Likelihood . . . . . . . .
14.2.3 MAP Inference . . . . . . . . . . . . . . . . . . . .
14.2.4 Filtering . . . . . . . . . . . . . . . . . . . . . . . .
14.2.5 Cautions about Bayesian Inference . . . . . . . . .
14.3 Samples, Urns and Populations . . . . . . . . . . . . . . .
14.3.1 Estimating the Population Mean from a Sample .
14.3.2 The Variance of the Sample Mean . . . . . . . . .
14.3.3 The Probability Distribution of the Sample Mean .
14.3.4 When The Urn Model Works . . . . . . . . . . . .
14.4 You should . . . . . . . . . . . . . . . . . . . . . . . . . .
14.4.1 be able to: . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
282
283
284
292
293
294
298
301
302
305
305
306
307
311
311
313
313
14.4.2 remember: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
C H A P T E R
Some Preliminaries
1.1 NOTATION AND CONVENTIONS
A dataset as a collection of d-tuples (a d-tuple is an ordered list of d elements).
Tuples differ from vectors, because we can always add and subtract vectors, but
we cannot necessarily add or subtract tuples. There are always N items in any
dataset. There are always d elements in each tuple in a dataset. The number of
elements will be the same for every tuple in any given tuple. Sometimes we may
not know the value of some elements in some tuples.
We use the same notation for a tuple and for a vector. Most of our data will
be vectors. We write a vector in bold, so x could represent a vector or a tuple (the
context will make it obvious which is intended).
The entire data set is {x}. When we need to refer to the ith data item, we
write xi . Assume we have N data items, and we wish to make a new dataset out of
them; we write the dataset made out of these items as {xi } (the i is to suggest you
are taking a set of items and making a dataset out of them). If we need to refer
(j)
to the jth component of a vector xi , we will write xi (notice this isnt in bold,
because it is a component not a vector, and the j is in parentheses because it isnt
a power). Vectors are always column vectors.
When I write {kx}, I mean the dataset created by taking each element of the
dataset {x} and multiplying by k; and when I write {x + c}, I mean the dataset
created by taking each element of the dataset {x} and adding c.
Terms:
mean ({x}) is the mean of the dataset {x} (definition 11.1, page 207).
std ({x}) is the standard deviation of the dataset {x} (definition 11.2, page 210).
var ({x}) is the variance of the dataset {x} (definition 11.3, page 214).
median ({x}) is the standard deviation of the dataset {x} (definition 11.4,
page 215).
percentile({x}, k) is the k% percentile of the dataset {x} (definition 11.5,
page 217).
iqr{x} is the interquartile range of the dataset {x} (definition 11.7, page 218).
{
x} is the dataset {x}, transformed to standard coordinates (definition 11.8,
page 222).
Standard normal data is defined in definition 11.9, (page 223).
Normal data is defined in definition 11.10, (page 223).
7
Section 1.1
Notation and conventions
corr ({(x, y)}) is the correlation between two components x and y of a dataset
(definition 12.1, page 246).
is the empty set.
is the set of all possible outcomes of an experiment.
Sets are written as A.
Ac is the complement of the set A (i.e. A).

E is an event (page 309).
P ({E}) is the probability of event E (page 309).
P ({E}|{F }) is the probability of event E, conditioned on event F (page 309).
p(x) is the probability that random variable X will take the value x; also
written P ({X = x}) (page 309).
p(x, y) is the probability that random variable X will take the value x and
random variable Y will take the value y; also written P ({X = x} {Y = y})
(page 309).
argmax
f (x) means the value of x that maximises f (x).
x
argmin
f (x) means the value of x that minimises f (x).
x
maxi (f (xi )) means the largest value that f takes on the different elements of
the dataset {xi }.
is an estimated value of a parameter .
1.1.1 Background Information

Cards: A standard deck of playing cards contains 52 cards. These cards are divided
into four suits. The suits are: spades and clubs (which are black); and hearts and
diamonds (which are red). Each suit contains 13 cards: Ace, 2, 3, 4, 5, 6, 7, 8, 9,
10, Jack (sometimes called Knave), Queen and King. It is common to call Jack,
Queen and King court cards.
Dice: If you look hard enough, you can obtain dice with many different numbers of sides (though Ive never seen a three sided die). We adopt the convention
that the sides of an N sided die are labeled with the numbers 1 . . . N , and that no
number is used twice. Most dice are like this.
Fairness: Each face of a fair coin or die has the same probability of landing
upmost in a flip or roll.
Roulette: A roulette wheel has a collection of slots. There are 36 slots numbered with the digits 1 . . . 36, and then one, two or even three slots numbered with
zero. There are no other slots. A ball is thrown at the wheel when it is spinning,
and it bounces around and eventually falls into a slot. If the wheel is properly
balanced, the ball has the same probability of falling into each slot. The number of
the slot the ball falls into is said to come up. There are a variety of bets available.
Section 1.2
Some Useful Mathematical Facts
1.2 SOME USEFUL MATHEMATICAL FACTS

The gamma function (x) is defined by a series of steps. First, we have that for n
an integer,
(n) = (n 1)!
and then for z a complex number with positive real part (which includes positive
real numbers), we have
Z
et
(z) =
tz
dt.
t
0
By doing this, we get a function on positive real numbers that is a smooth interpolate of the factorial function. We wont do any real work with this function, so
wont expand on this definition. In practice, well either look up a value in tables
or require a software environment to produce it.
1.3 ACKNOWLEDGEMENTS
Typos spotted by: Han Chen (numerous!), Henry Lin (numerous!), Paris Smaragdis
(numerous!), Johnny Chang, Eric Huber, Brian Lunt, Yusuf Sobh, Scott Walters,
Your Name Here TAs for this course have helped improve the notes. Thanks
to Zicheng Liao, Michael Sittig, Nikita Spirin, Saurabh Singh, Daphne Tsatsoulis,
Henry Lin, Karthik Ramaswamy.
1.4 THE CURSE OF DIMENSION
High dimensional models display uninituitive behavior (or, rather, it can take years
to make your intuition see the true behavior of high-dimensional models as natural).
In these models, most data lies in places you dont expect. We will do several simple
calculations with an easy high-dimensional distribution to build some intuition.
1.4.1 The Curse: Data isnt Where You Think it is
Assume our data lies within a cube, with edge length two, centered on the origin.
This means that each component of xi lies in the range [1, 1]. One simple model
for such data is to assume that each dimension has uniform probability density in
this range. In turn, this means that P (x) = 21d . The mean of this model is at the
origin, which we write as 0.
The first surprising fact about high dimensional data is that most of the data
can lie quite far away from the mean. For example, we can divide our dataset into
two pieces. A() consists of all data items where every component of the data has
a value in the range [(1 ), (1 )]. B() consists of all the rest of the data. If
you think of the data set as forming a cubical orange, then B() is the rind (which
has thickness ) and A() is the fruit.
Your intuition will tell you that there is more fruit than rind. This is true,
for three dimensional oranges, but not true in high dimensions. The fact that the
orange is cubical just simplifies the calculations, but has nothing to do with the
real problem.
We can compute P ({x A()}) and P ({x A()}). These probabilities tell
us the probability a data item lies in the fruit (resp. rind). P ({x A()}) is easy
Section 1.4
The Curse of Dimension
10
to compute as
P ({x A()}) = (2(1 )))
1
2d
= (1 )d
and
P ({x B()}) = 1 P ({x A()}) = 1 (1 )d .
But notice that, as d ,
P ({x A()}) 0.
This means that, for large d, we expect most of the data to be in B(). Equivalently,
for large d, we expect that at least one component of each data item is close to
either 1 or 1.
This suggests (correctly) that much data is quite far from the origin. It is
easy to compute the average of the squared distance of data from the origin. We
want
!
Z
X
T
x2i P (x)dx
E x x =
box
i
but we can rearrange, so that

X 2 X
E xi =
E xT x =
i
box
x2i P (x)dx.
Now each component of x is independent, so that P (x) = P (x1 )P (x2 ) . . . P (xd ).

Now we substitute, to get
Z
X1Z 1
T X 2 X 1 2
d
xi P (xi )dxi =
E xi =
E x x =
x2i dxi = ,
2
3
1
1
i
i
i
so as d gets bigger, most data points will be further and further from the origin.
Worse, as d gets bigger, data points tend to get further and further from one
another. We can see this by computing the average of the squared distance of data
points from one another. Write u for one data point and v; we can compute
Z
Z
X

(ui vi )2 dudv = E uT u + E vT v E uT v
E d(u, v)2 =
box box i

T
but since u and v are independent, we have E uT v = E[u] E[v] = 0. This yields

d
E d(u, v)2 = 2 .
3
This means that, for large d, we expect our data points to be quite far apart.
Section 1.4
The Curse of Dimension
11
1.4.2 Minor Banes of Dimension

High dimensional data presents a variety of important practical nuisances which
follow from the curse of dimension. It is hard to estimate covariance matrices, and
it is hard to build histograms.
Covariance matrices are hard to work with for two reasons. The number of
entries in the matrix grows as the square of the dimension, so the matrix can get
big and so difficult to store. More important, the amount of data we need to get an
accurate estimate of all the entries in the matrix grows fast. As we are estimating
more numbers, we need more data to be confident that our estimates are reasonable.
There are a variety of straightforward work-arounds for this effect. In some cases,
we have so much data there is no need to worry. In other cases, we assume that
the covariance matrix has a particular form, and just estimate those parameters.
There are two strategies that are usual. In one, we assume that the covariance
matrix is diagonal, and estimate only the diagonal entries. In the other, we assume
that the covariance matrix is a scaled version of the identity, and just estimate this
scale. You should see these strategies as acts of desperation, to be used only when
computing the full covariance matrix seems to produce more problems than using
these approaches.
It is difficult to build histogram representations for high dimensional data.
The strategy of dividing the domain into boxes, then counting data into them, fails
miserably because there are too many boxes. In the case of our cube, imagine we
wish to divide each dimension in half (i.e. between [1, 0] and between [0, 1]). Then
we must have 2d boxes. This presents two problems. First, we will have difficulty
representing this number of boxes. Second, unless we are exceptionally lucky, most
boxes must be empty because we will not have 2d data items.
However, one representation is extremely effective. We can represent data as
a collection of clusters coherent blobs of similar datapoints that could, under
appropriate circumstances, be regarded as the same. We could then represent the
dataset by, for example, the center of each cluster and the number of data items
in each cluster. Since each cluster is a blob, we could also report the covariance of
each cluster, if we can compute it.
C H A P T E R
Learning to Classify
A classifier is a procedure that accepts a set of features and produces a class
label for them. There could be two, or many, classes. Classifiers are immensely
useful, and find wide application, because many problems are naturally classification
problems. For example, if you wish to determine whether to place an advert on a
web-page or not, you would use a classifier (i.e. look at the page, and say yes or
no according to some rule). As another example, if you have a program that you
found for free on the web, you would use a classifier to decide whether it was safe
to run it (i.e. look at the program, and say yes or no according to some rule). As
yet another example, credit card companies must decide whether a transaction is
good or fraudulent.
All these examples are two class classifiers, but in many cases it is natural
to have more classes. You can think of sorting laundry as applying a multi-class
classifier. You can think of doctors as complex multi-class classifiers. In this (crude)
model, the doctor accepts a set of features, which might be your complaints, answers
to questions, and so on, and then produces a response which we can describe as a
class. The grading procedure for any class is a multi-class classifier: it accepts a
set of features performance in tests, homeworks, and so on and produces a
class label (the letter grade).
Classifiers are built by taking a set of labeled examples and using them to
come up with a procedure that assigns a label to any new example. In the general
problem, we have a training dataset (xi , yi ); each of the feature vectors xi consists
of measurements of the properties of different types of object, and the yi are labels
giving the type of the object that generated the example. We will then use the
training dataset to find a procedure that will predict an accurate label (y) for any
new object (x).
Definition: 2.1 Classifier

A classifier is a procedure that accepts a feature vector and produces a
label.
2.1 CLASSIFICATION, ERROR, AND LOSS

You should think of a classifier as a procedure we pass in a feature vector, and
get a class label in return. We want to use the training data to find the procedure
that is best when used on the test data. This problem has two tricky features.
12
Section 2.1
Classification, Error, and Loss
13
First, we need to be clear on what a good procedure is. Second, we really want the
procedure to be good on test data, which we havent seen and wont see; we only
get to see the training data. These two considerations shape much of what we do.
2.1.1 Loss and the Cost of Misclassification
The choice of procedure must depend on the cost of making a mistake. This cost
can be represented with a loss function, which specifies the cost of making each
type of mistake. I will write L(j k) for the loss incurred when classifying an
example of class j as having class k.
A two-class classifier can make two kinds of mistake. Because two-class classifiers are so common, there is a special name for each kind of mistake. A false
positive occurs when a negative example is classified positive (which we can write
L( +) and avoid having to remember which index refers to which class); a
false negative occurs when a positive example is classified negative (similarly
L(+ )). By convention, the loss of getting the right answer is zero, and the
loss for any wrong answer is non-negative.
The choice of procedure should depend quite strongly on the cost of each
mistake. For example, pretend there is only one disease; then doctors would be
classifiers, deciding whether a patient had it or not. If this disease is dangerous, but
is safely and easily treated, false negatives are expensive errors, but false positives
are cheap. In this case, procedures that tend to make more false positives than false
negatives are better. Similarly, if the disease is not dangerous, but the treatment is
difficult and unpleasant, then false positives are expensive errors and false negatives
are cheap, and so we prefer false negatives to false positives.
You might argue that the best choice of classifier makes no mistake. But for
most practical cases, the best choice of classifier is guaranteed to make mistakes.
As an example, consider an alien who tries to classify humans into male and female,
using only height as a feature. However the aliens classifier uses that feature, it
will make mistakes. This is because the classifier must choose, for each value of
height, whether to label the humans with that height male or female. But for the
vast majority of heights, there are some males and some females with that height,
and so the aliens classifier must make some mistakes whatever gender it chooses
for that height.
For many practical problems, it is difficult to know what loss function to use.
There is seldom an obvious choice. One common choice is to assume that all errors
are equally bad. This gives the 0-1 loss every error has loss 1, and all right
answers have loss zero.
2.1.2 Building a Classifier from Probabilities
Assume that we have a reliable model of p(y|x). This case occurs less often than you
might think for practical data, because building such a model is often very difficult.
However, when we do have a model and a loss function, it is easy to determine the
best classifier. We should choose the rule that gives minimum expected loss.
We start with a two-class classifier. At x, the expected loss of saying
is L(+ )p(+|x) (remember, L( ) = 0); similarly, the expected loss
of saying + is L( +)p(|x). At most points, one of L( +)p(|x) and
Section 2.1
14
L(+ )p(+|x) is larger than the other, and so the choice is clear. The remaining
set of points (where L( +)p(|x) = L(+ )p(+|x)) is small (formally, it
has zero measure) for most models and problems, and so it doesnt matter what we
choose at these points. This means that the rule
if L(+ )p(+|x) > L( +)p(|x)

+
if L(+ )p(+|x) < L( +)p(|x)

say
random choice
otherwise
is the best available. Because it doesnt matter what we do when L(+ )p(+|x) =
L( +)p(|x), it is fine to use

+ if L(+ )p(+|x) > L( +)p(|x)
say
otherwise
The same reasoning applies in the multi-class case. We choose the class where the
expected loss from that choice is smallest. In the case of 0-1 loss, this boils down
to:
choose k such that p(k|x) is largest.
2.1.3 Building a Classifier using Decision Boundaries
Building a classifier out of posterior probabilities is less common than you might
think, for two reasons. First, its often very difficult to get a good posterior probability model. Second, most of the model doesnt matter to the choice of classifier.
What is important is knowing which class has the lowest expected loss, not the
exact values of the expected losses, so we should be able to get away without an
exact posterior model.
Look at the rules in section 2.1.2. Each of them carves up the domain of x into
pieces, and then attaches a class the one with the lowest expected loss to each
piece. There isnt necessarily one piece per class, (though theres always one class
per piece). The important factor here is the boundaries between the pieces, which
are known as decision boundaries. A powerful strategy for building classifiers
is to choose some way of building decision boundaries, then adjust it to perform
well on the data one has. This involves modelling considerably less detail than
modelling the whole posterior.
For example, in the two-class case, we will spend some time discussing the
decision boundary given by

if xT a + b < 0
choose
+
otherwise
often written as signxT a + b (section 14.5). In this case we choose a and b to obtain
low loss.
2.1.4 What will happen on Test Data?
What we really want from a classifier is to have small loss on test data. But this
is difficult to measure or achieve directly. For example, think about the case of
classifying credit-card transactions as good or bad. We could certainly obtain
Section 2.1
15
a set of examples that have been labelled for training, because the card owner often
complains some time after a fraudulent use of their card. But what is important
here is to see a new transaction and label it without holding it up for a few months
to see what the card owner says. The classifier may never know if the label is right
or not.
Generally, we will assume that the training data is like the test data, and
so we will try to make the classifier perform well on the training data. Classifiers
that have small training error might not have small test error. One example of
this problem is the (silly) classifier that takes any data point and, if it is the same
as a point in the training set, emits the class of that point and otherwise chooses
randomly between the classes. This classifier has been learned from data, and has
a zero error rate on the training dataset; it is likely to be unhelpful on any other
dataset, however.
Test error is usually worse than training error, because of an effect that is
sometimes called overfitting, so called because the classification procedure fits
the training data better than it fits the test data. Other names include selection
bias, because the training data has been selected and so isnt exactly like the
test data, and generalizing badly, because the classifier fails to generalize. The
effect occurs because the classifier has been trained to perform well on the training
dataset, and the training dataset is not the same as the test dataset. First, it is
quite likely smaller. Second, it might be biased through a variety of accidents. This
means that small training error may have to do with quirks of the training dataset
that dont occur in other sets of examples. One consequence of overfitting is that
classifiers should always be evaluated on data that was not used in training.
Remember this:
Classifiers should always be evaluated on data that
was not used in training.
Now assume that we are using the 0-1 loss, so that the loss of using a classifier
is the same as the error rate, that is, the percentage of classification attempts on
a test set that result in the wrong answer. We could also use the accuracy, which
is the percentage of classification attempts that result in the right answer. We
cannot estimate the error rate of the classifier using training data, because the
classifier has been trained to do well on that data, which will mean our error rate
estimate will be too low. An alternative is to separate out some training data to
form a validation set (confusingly, this is often called a test set), then train the
classifier on the rest of the data, and evaluate on the validation set. This has the
difficulty that the classifier will not be the best estimate possible, because we have
left out some training data when we trained it. This issue can become a significant
nuisance when we are trying to tell which of a set of classifiers to usedid the
classifier perform poorly on validation data because it is not suited to the problem
representation or because it was trained on too little data?
We can resolve this problem with cross-validation, which involves repeatedly: splitting data into training and validation sets uniformly and at random,
Section 2.1
16
training a classifier on the training set, evaluating it on the validation set, and
then averaging the error over all splits. This allows an estimate of the likely future performance of a classifier, at the expense of substantial computation. You
should notice that cross-validation, in some sense, looks at the sensitivity of the
classifier to a change in the training set. The most usual form of this algorithm
involves omitting single items from the dataset and is known as leave-one-out
cross-validation.
You should usually compare the error rate of a classifier to two important
references. The first is the error rate if you assign classes to examples uniformly at
random, which for a two class classifier is 50%. A two class classifier should never
have an error rate higher than 50%. If you have one that does, all you need to do
is swap its class assignment, and the resulting error rate would be lower than 50%.
The second is the error rate if you assign all data to the most common class. If one
class is uncommon and the other is common, this error rate can be hard to beat.
Data where some classes occur very seldom requires careful, and quite specialized,
handling.
2.1.5 The Class Confusion Matrix
Evaluating a multi-class classifier is more complex than evaluating a binary classifier. The error rate if you assign classes to examples uniformly at random can
be rather high. If each class has about the same frequency, then this error rate is
(1 100/number of classes)%. A multi-class classifier can make many more kinds
of mistake than a binary classifier can. It is useful to know the total error rate of
the classifier (percentage of classification attempts that produce the wrong answer)
or the accuracy, (the percentage of classification attempts that produce the right
answer). If the error rate is low enough, or the accuracy is high enough, theres not
much to worry about. But if its not, you can look at the class confusion matrix
to see whats going on.
True
True
True
True
True
0
1
2
3
4
Predict
0
151
32
10
6
2
Predict
1
7
5
9
13
3
Predict
2
2
9
7
9
2
Predict
3
3
9
9
5
6
Predict
4
1
0
1
2
0
Class
error
7.9%
91%
81%
86%
100%
TABLE 2.1: The class confusion matrix for a multiclass classifier. Further details
about the dataset and this example appear in worked example 2.3.
Table 2.1 gives an example. This is a class confusion matrix from a classifier
built on a dataset where one tries to predict the degree of heart disease from a collection of physiological and physical measurements. There are five classes (0 . . . 4).
The i, jth cell of the table shows the number of data points of true class i that
were classified to have class j. As I find it hard to recall whether rows or columns
represent true or predicted classes, I have marked this on the table. For each row,
there is a class error rate, which is the percentage of data points of that class that
Section 2.1
17
were misclassified. The first thing to look at in a table like this is the diagonal; if
the largest values appear there, then the classifier is working well. This clearly isnt
what is happening for table 2.1. Instead, you can see that the method is very good
at telling whether a data point is in class 0 or not (the class error rate is rather
small), but cannot distinguish between the other classes. This is a strong hint that
the data cant be used to draw the distinctions that we want. It might be a lot
better to work with a different set of classes.
2.1.6 Statistical Learning Theory and Generalization
What is required in a classifier is an ability to predictwe should like to be confident
that the classifier chosen on a particular data set has a low risk on future data items.
The family of decision boundaries from which a classifier is chosen is an important
component of the problem. Some decision boundaries are more flexible than others
(in a sense we dont intend to make precise). This has nothing to do with the
number of parameters in the decision boundary. For example, if we were to use
a point to separate points on the line, there are very small sets of points that
are not linearly separable (the smallest set has three points in it). This means
that relatively few sets of points on the line are linearly separable, so that if our
dataset is sufficiently large and linearly separable, the resulting classifier is likely
to behave well in future. However, using the sign of sin x to separate points on
the line produces a completely different qualitative phenomenon; for any labeling
of distinct points on the line into two classes, we can choose a value of to achieve
this labeling. This flexibility means that the classifier is wholly unreliableit can
be made to fit any set of examples, meaning the fact that it fits the examples is
uninformative.
There is a body of theory that treats this question, which rests on two important points.
A large enough dataset yields a good representation of the source
of the data: this means that if the dataset used to train the classifier is very
large, there is a reasonable prospect that the performance on the training set
will represent the future performance. However, for this to be helpful, we
need the data set to be large with respect to the flexibility of the family
The flexibility of a family of decision boundaries can be formalized: yielding the Vapnik-Chervonenkis dimension (or V-C dimension) of the family. This dimension is independent of the number of parameters of the family. Families with finite V-C dimension can yield classifiers
whose future performance can be bounded using the number of training elements; families with infinite V-C dimension (like the sin x example above)
cannot be used to produce reliable classifiers.
The essence of the theory is as follows: if one chooses a decision boundary from an
inflexible family, and the resulting classifier performs well on a large data set, there
is strong reason to believe that it will perform well on future items drawn from the
same source. This statement can be expressed precisely in terms of bounds on total
risk to be expected for particular classifiers as a function of the size of the data
set used to train the classifier. These bounds hold in probability. These bounds
Section 2.2
Classifying with Naive Bayes
18
tend not to be used in practice, because they appear to be extremely pessimistic.

Space doesnt admit an exposition of this theorywhich is somewhat technical
but interested readers can look it up in (?, ?, ?).
2.2 CLASSIFYING WITH NAIVE BAYES
One reason it is difficult to build a posterior probability model is the dependencies
between features. However, if we assume that features are conditionally independent conditioned on the class of the data item, we can get a simple expression for
the posterior. This assumption is hardly ever true in practice. Remarkably, this
doesnt matter very much, and the classifier we build from the assumption often
works extremely well. It is the classifier of choice for very high dimensional data.
Recall bayes rule. If we have p(x|y) (often called either a likelihood or class
conditional probability), and p(y) (often called a prior) then we can form
p(y|x) =
p(x|y)p(y)
p(x)
(the posterior). We write xj for the jth component of x. Our assumption is

Y
p(xi |y)
p(x|y) =
i
(again, this isnt usually the case; it just turns out to be fruitful to assume that it
is true). This assumption means that
p(y|x)
=
=
p(x|y)p(y)
p(x)
Q
i p(xi |y)p(y)
p(x)
Y
p(xi |y)p(y).
i
Now because we need only to know the posterior values up to scale at x to make
a decision (check the rules above if youre unsure), we dont need to estimate p(x).
In the case of 0-1 loss, this yields the rule
Q
choose y such that i p(xi |y)p(y) is largest.
Naive bayes is particularly good when there are a large number of features, but there
are some things to be careful about. You cant actually multiply a large number
of probabilities and expect to get an answer that a floating point system thinks is
different from zero. Instead, you should add the log probabilities. A model with
many different features is likely to have many strongly negative log probabilities,
so you should not just add up all the log probabilities then exponentiate, or else
you will find that each class has a posterior probability of zero. Instead, subtract
the largest log from all the others, then exponentiate; you will obtain a vector
proportional to the class probabilities, where the largest element has the value 1.
Section 2.2
19
We still need models for p(xi |y) for each xi . It turns out that simple parametric models work really well here. For example, one could fit a normal distribution
to each xi in turn, for each possible value of y, using the training data. The logic
of the measurements might suggest other distributions, too. If one of the xi s was
a count, we might fit a Poisson distribution. If it was a 0-1 variable, we might fit a
Bernoulli distribution. If it was a numeric variable that took one of several values,
then we might use either a multinomial model.
Many effects cause missing values: measuring equipment might fail; a record
could be damaged; it might be too hard to get information in some cases; survey
respondents might not want to answer a question; and so on. As a result, missing values are quite common in practical datasets. A nice feature of naive bayes
classifiers is that they can handle missing values for particular features rather well.
Dealing with missing data during learning is easy. For example, assume for
some i, we wish to fit p(xi |y) with a normal distribution. We need to estimate
the mean and standard deviation of that normal distribution (which we do with
maximum likelihood, as one should). If not every example has a known value of xi ,
this really doesnt matter; we simply omit the unknown number from the estimate.
Write xi,j for the value of xi for the jth example. To estimate the mean, we form
P
jcases with known values xi,j
number of cases with known values
and so on.
Dealing with missing data during classification
is easy, too. We need to look
P
for the y that produces the largest value of i log p(xi |y). We cant evaluate p(xi |y)
if the value of that feature is missing - but it is missing for each class. We can just
leave that term out of the sum, and proceed. This procedure is fine if data is
missing as a result of noise (meaning that the missing terms are independent of
class). If the missing terms depend on the class, there is much more we could do
for example, we might build a model of the class-conditional density of missing
terms.
Notice that if some values of a discrete feature xi dont appear for some class,
you could end up with a model of p(xi |y) that had zeros for some values. This
almost inevitably leads to serious trouble, because it means your model states you
cannot ever observe that value for a data item of that class. This isnt a safe
property: it is hardly ever the case that not observing something means you cannot
observe it. A simple, but useful, fix is to add one to all small counts.
The usual way to find a model of p(y) is to count the number of training
examples in each class, then divide by the number of classes. If there are some
classes with very little data, then the classifier is likely to work poorly, because you
will have trouble getting reasonable estimates of the parameters for the p(xi |y).
Section 2.2
20
Classifying breast tissue samples
Worked example 2.1
The breast tissue dataset at https://archive.ics.uci.edu/ml/datasets/

Breast+Tissue contains measurements of a variety of properties of six different classes of breast tissue. Build and evaluate a naive bayes classifier to
distinguish between the classes automatically from the measurements.
Solution: The main difficulty here is finding appropriate packages, understanding their documentation, and checking theyre right, unless you want to
write the source yourself (which really isnt all that hard). I used the R package
caret to do train-test splits, cross-validation, etc. on the naive bayes classifier
in the R package klaR. I separated out a test set randomly (approx 20% of the
cases for each class, chosen at random), then trained with cross-validation on
the remainder. The class-confusion matrix on the test set was:
Prediction
adi
car
con
fad
gla
mas
adi
car
con
fad
gla
mas
2
0
2
0
0
0
0
3
0
0
0
1
0
0
2
0
0
0
0
0
0
0
0
3
0
0
0
1
2
0
0
1
0
0
1
1
which is fairly good. The accuracy is 52%. In the training data, the classes
are nearly balanced and there are six classes, meaning that chance is about
16%. The is 4.34. These numbers, and the class-confusion matrix, will vary
with test-train split. I have not averaged over splits, which would be the next
thing.
Section 2.2
21
Classifying mouse protein expression
Worked example 2.2
Build a naive bayes classifier to classify the mouse protein dataset from the
UC Irvine machine learning repository. The dataset is at http://archive.ics.uci.
edu/ml/datasets/Mice+Protein+Expression.
Solution: Theres only one significant difficulty here; many of the data items
are incomplete. I dropped all incomplete data items, which is about half of the
dataset. One can do somewhat more sophisticated things, but we dont have the
tools yet. I used the R package caret to do train-test splits, cross-validation,
etc. on the naive bayes classifier in the R package klaR. I separated out a test
set, then trained with cross-validation on the remainder. The class-confusion
matrix on the test set was:
Predn
c-CS-m
c-CS-s
c-SC-m
c-SC-s
t-CS-m
t-CS-s
t-SC-m
t-SC-s
c-CS-m
9
0
0
0
0
0
0
0
c-CS-s
0
15
0
0
0
0
0
0
c-SC-m
0
0
12
0
0
0
0
0
c-SC-s
0
0
0
15
0
0
0
0
t-CS-m
0
0
0
0
18
0
0
0
t-CS-s
0
0
0
0
0
15
0
0
t-SC-m
0
0
0
0
0
0
12
0
t-SC-s
0
0
0
0
0
0
0
14
which is as accurate as you can get. Again, I have not averaged over splits,
which would be the next thing.
Naive bayes with normal class-conditional distributions takes an interesting
and suggestive form. Assume we have two classes. Recall our decision rule is

+ if L(+ )p(+|x) > L( +)p(|x)
say
otherwise
Now as p gets larger, so does log p (logarithm is a monotonically increasing function), and the rule isnt affected by adding the same constant to both sides, so we
can rewrite as:

+ if log L(+ ) + log p(x|+) + log p(+) > log L( +) log p(x|) + log p()
say
otherwise
+
Write +
j , j respectively for the mean and standard deviation for the classconditional density for the jth component of x for class + (and so on); the comP (xj + )2 P
parison becomes log L(+ ) j 2(+j)2 j log j+ + log p(+) > log L(
j
P (xj
P
)2
j
+) j 2( )2 j log j + log p() Now we can expand and collect terms
j
really aggressively to get
cj x2j dj xj e > 0
j
Section 2.3
The Support Vector Machine
22
(where cj , dj , e are functions of the means and standard deviations and losses and
priors). Rather than forming these by estimating the means, etc., we could directly
search for good values of cj , dj and e.
2.3 THE SUPPORT VECTOR MACHINE
Assume we have a set of N example points xi that belong to two classes, which we
indicate by 1 and 1. These points come with their class labels, which we write as
yi ; thus, our dataset can be written as
{(x1 , y1 ), . . . , (xN , yN )} .
We wish to predict the sign of y for any point x. We will use a linear classifier, so
that for a new data item x, we will predict
sign ((a x + b))
and the particular classifier we use is given by our choice of a and b.
You should think of a and b as representing a hyperplane, given by the points
where a x + b = 0. This hyperplane separates the positive data from the negative
data, and is known as the decision boundary. Notice that the magnitude of
a x + b grows as the point x moves further away from the hyperplane.
Example: 2.1 A linear model with a single feature
Assume we use a linear model with one feature. Then the model has
(p)
the form yi = sign(axi + b). For any particular example which has
the feature value x , this means we will test whether x is larger than,
or smaller than, b/a.
Example: 2.2 A linear model with two features

Assume we use a linear model with two features. Then the model
(p)
has the form yi = sign(aT xi + b). The sign changes along the line
T
a x + b = 0. You should check that this is, indeed, a line. On one
side of this line, the model makes positive predictions; on the other,
negative. Which side is which can be swapped by multiplying a and b
by 1.
This family of classifiers may look bad to you, and it is easy to come up with
examples that it misclassifies badly. In fact, the family is extremely strong. First,
it is easy to estimate the best choice of rule for very large datasets. Second, linear
Section 2.3
23
classifiers have a long history of working very well in practice on real data. Third,
linear classifiers are fast to evaluate.
In fact, examples that are classified badly by the linear rule usually are classified badly because there are two few features. Remember the case of the alien
who classified humans into male and female by looking at their heights; if that alien
had looked at their chromosomes as well, the error rate would be extremely small.
In practical examples, experience shows that the error rate of a poorly performing
linear classifier can usually be improved by adding features to the vector x.
Recall that using naive bayes with
for the class conditional
P a normal model
2
distributions boiled down to testing
c
x
d
x
j j e > 0 for some values of
j j j
cj , dj , and e. This may not look to you like a linear classifier, but it is. Imagine
that, for an example ui , you form the feature vector
x = u2i,1 , ui,1 , u2i,2 , ui,2 , . . . , ui,d
T
Then we can interpret testing aT x + b > 0 as testing a1 u2i,1 (a2 )ui,1 + a3 u2i,2
(a4 )ui,2 + ... (b) > 0, and pattern matching to the expression for naive bayes
suggests that the two cases are equivalent (i.e. for any choice of a, b, there is a
corresponding naive bayes case and vice versa; exercises).
2.3.1 Choosing a Classifier with the Hinge Loss
We will choose a and b by choosing values that minimize a cost function. We will
adopt a cost function of the form:
Training error cost + penalty term.
For the moment, we will ignore the penalty term and focus on the training error
cost. Write
i = aT xi + b
for the value that the linear function takes on example i. Write C(i , yi ) for a
function that compares i with yi . The training error cost will be of the form
(1/N )
N
X
C(i , yi ).
i=1
A good choice of C should have some important properties. If i and yi have

different signs, then C should be large, because the classifier will make the wrong
prediction for this training example. Furthermore, if i and yi have different signs
and i has large magnitude, then the classifier will very likely make the wrong
prediction for test examples that are close to xi . This is because the magnitude of
(a x + b) grows as x gets further from the decision boundary. So C should get
larger as the magnitude of i gets larger in this case.
If i and yi have the same signs, but i has small magnitude, then the classifier
will classify xi correctly, but might not classify points that are nearby correctly.
This is because a small magnitude of i means that xi is close to the decision
boundary. So C should not be zero in this case. Finally, if i and yi have the same
Section 2.3
24
5
4
Hinge loss for a single example

with y=1
Loss
3
2
1
0
4
FIGURE 2.1: The hinge loss, plotted for the case yi = 1. The horizontal variable is
the i = a xi + b of the text. Notice that giving a strong negative response to this
positive example causes a loss that grows linearly as the magnitude of the response
grows. Notice also that giving an insufficiently positive response also causes a loss.
Giving a strongly positive response is free.
signs and i has large magnitude, then C can be zero because xi is on the right
side of the decision boundary and so are all the points near to xi .
The choice
C(yi , i ) = max(0, 1 yi i )
has these properties. If yi i > 1 (so the classifier predicts the sign correctly and
xi is far from the boundary) there is no cost. But in any other case, there is a
cost. The cost rises if xi moves toward the decision boundary from the correct side,
and grows linearly as xi moves further away from the boundary on the wrong side
(Figure 2.1). This means that minimizing the loss will encourage the classifier to (a)
make strong positive (or negative) predictions for positive (or negative) examples
and (b) for examples it gets wrong, make the most positive (negative) prediction
that it can. This choice is known as the hinge loss.
Now we think about future examples. We dont know what their feature
values will be, and we dont know their labels. But we do know that an example
Section 2.3
25
with feature vector x will be classified with the rule sign (()a x + b). If we classify
this example wrongly, we should like | a x + b | to be small. Achieving this would
mean that at least some nearby examples will have the right sign. The way to
achieve this is to ensure that || a || is small. By this argument, we would like to
achieve a small value of the hinge loss using a small value of || a ||. Thus, we add a
penalty term to the loss so that pairs (a, b) that have small values of the hinge loss
and large values of || a || are expensive. We minimize
#
"
N
X
T
a a
max(0, 1 yi (a xi + b))
+
S(a, b; ) = (1/N )
2
i=1
(hinge loss)
(penalty)
where is some weight that balances the importance of a small hinge loss against
the importance of a small || a ||. There are now two problems to solve. First, assume
we know ; we will need to find a and b that minimize S(a, b; ). Second, we will
need to estimate .
2.3.2 Finding a Minimum: General Points
I will first summarize general recipes for finding a minimum. Write u = [a, b] for the
vector obtained by stacking the vector a together with b. We have a function g(u),
and we wish to obtain a value of u that achieves the minimum for that function.
Sometimes we can solve this problem in closed form by constructing the gradient
and finding a value of u the makes the gradient zero. This happens mainly for
specially chosen problems that occur in textbooks. For practical problems, we tend
to need a numerical method.
Typical methods take a point u(i) , update it to u(i+1) , then check to see
whether the result is a minimum. This process is started from a start point. The
choice of start point may or may not matter for general problems, but for our
problem it wont matter. The update is usually obtained by computing a direction
p(i) such that for small values of h, g(u(i) + hp(i) ) is smaller than g(u(i) ). Such a
direction is known as a descent direction. We must then determine how far to
go along the descent direction, a process known as line search.
One method to choose a descent direction is gradient descent, which uses
the negative gradient of the function. Recall our notation that
u1
u2
u=
...
ud
and that
g
u1
g
u2
g =
...
g
ud
Section 2.3
26
We can write a Taylor series expansion for the function g(u(i) + hp(i) ). We have
that
g(u(i) + hp(i) ) = g(u(i) ) + h(g)T p(i) + O(h2 )
This means that we can expect that if
p(i) = g(u(i) ),
we expect that, at least for small values of h, g(u(i) +hp(i) ) will be less than g(u(i) ).
This works (as long as g is differentiable, and quite often when it isnt) because g
must go down for at least small steps in this direction.
2.3.3 Finding a Minimum: Stochastic Gradient Descent
Assume we wish to minimize some function g(u) = g0 (u) + (1/N )
function of u. Gradient descent would require us to form
!
N
X
gi (u)
g(u) = g0 (u) + (1/N )
PN
i=1 gi (u),
as a
i=1
and then take a small step in this direction. But if N is large, this is unattractive,
as we might have to sum a lot of terms. This happens a lot in building classifiers,
where you might quite reasonably expect to deal with millions of examples. For
some cases, there might be trillions of examples. Touching each example at each
step really is impractical.
Instead, assume that, at each step, we choose a number k in the range 1 . . . N
uniformly and at random, and form
pk = (g0 (u) + gk (u))
and then take a small step along pk . Our new point becomes
(i)
a(i+1) = a(i) + pk ,
where is called the steplength (even though it very often isnt the length of the
step we take!). It is easy to show that
E[pk ] = g(u)
(where the expectation is over the random choice of k). This implies that if we take
many small steps along pk , they should average out to a step backwards along the
gradient. This approach is known as stochastic gradient descent (because were
not going along the gradient, but along a random vector which is the gradient only
in expectation). It isnt obvious that stochastic gradient descent is a good idea.
Although each step is easy to take, we may need to take more steps. The question
is then whether we gain in the increased speed of the step what we lose by having
to take more steps. Not much is known theoretically, but in practice the approach
is hugely successful for training classifiers.
Choosing a steplength takes some work. We cant search for the step that
gives us the best value of g, because we dont want to evaluate the function g (doing
Section 2.3
27
so involves looking at each of the gi terms). Instead, we use a steplength that is

large at the start so that the method can explore large changes in the values of
the classifier parameters and small steps later so that it settles down. One
useful strategy is to divide training into epochs. Each epoch is a block of a fixed
number of iterations. Each iteration is one of the steps given above, with fixed
steplength. However, the steplength changes from epoch to epoch. In particular,
in the rth epoch, the steplength is
a
(r) =
r+b
where a and b are constants chosen by experiment with small subsets of the dataset.
One cannot really test whether stochastic gradient descent has converged to
the right answer. A better approach is to plot the error as a function of epoch on
a validation set. This should vary randomly, but generally go down as the epochs
proceed. I have summarized this discussion in box 2.1. You should be aware that
the recipe there admits many useful variations, though.
Procedure: 2.1 Stochastic Gradient Descent

We have a dataset containing N pairs (xi , yi ). Each xi is a ddimensional feature vector, and each yi is a label, either 1 or 1.
Choose a set of possible values of the regularization weight . We
wish to train a model
PN that minimizes a cost function of the form
g(u) = 2 uT u+ ( N1 ) i=1 gi (u). Separate the data into three sets: test,
training and validation. For each value of the regularization weight,
train a model, and evaluate the model on validation data. Keep the
model that produces the lowest error rate on the validation data, and
report its performance on the test data.
Train a model by choosing a fixed number of epochs Ne , and the number of steps per epoch Ns . Choose a random start point, u0 = [a, b].
For each epoch, first compute the steplength. In the eth epoch, the
1
for constants a and b chosen by smallsteplength is typically = ae+b
scale experiments (you try training a model with different values and
see what happens). For the eth epoch, choose a subset of the training
set for validation for that epoch. Now repeat until the model has been
updated Ns times:
Take k steps. Each step is taken by selecting a single data item
uniformly and at random. Assume we select the ith data item.
We then compute p = gi (u) u, and update the model by
computing
un+1 = un + p
Evaluate the current model by computing the accuracy on the
validation set for that epoch. Plot the accuracy as a function of
step number.
Section 2.3
28
2.3.4 Example: Training a Support Vector Machine with Stochastic Gradient Descent
I have summarized stochastic gradient descent in algorithm 2.1, but here is an
example in more detail. We need to choose a and b to minimize
C(a, b) = (1/N )
N
X
i=1
max(0, 1 yi (a xi + b)) +
T
a a.
2
This is a support vector machine, because it uses hinge loss. For a support vector
machine, stochastic gradient descent is particularly easy. We have estimates a(n)
and b(n) of the classifier parameters, and we want to improve the estimates. We
pick the kth example at random. We must now compute

T
max(0, 1 yk (a xk + b)) + a a .
2
Assume that yk (a xk + b) > 1. In this case, the classifier predicts a score with
the right sign, and a magnitude that is greater than one. Then the first term is
zero, and the gradient of the second term is easy. Now if yk (a xk + b) < 1, we can
ignore the max, and the first term is 1 yk (a xk + b); the gradient is again easy.
But what if yk (a xk + b) = 1? there are two distinct values we could choose for
the gradient, because the max term isnt differentiable. It turns out not to matter
which term we choose (Figure ??), so we can write the gradient as

a
if yk (a xk + b) 1
pk =

a yk x
otherwise
yk
We choose a steplength , and update our estimates using this gradient. This yields:

a
if yk (a xk + b) 1
(n+1)
(n)
a
=a
a yk x otherwise
and
b(n+1) = b(n)
0
yk
if yk (a xk + b) 1
.
otherwise
To construct figures, I downloaded the dataset at http://archive.ics.uci.edu/

ml/datasets/Adult. This dataset apparently contains 48, 842 data items, but I
worked with only the first 32, 000. Each consists of a set of numeric and categorical
features describing a person, together with whether their annual income is larger
than or smaller than 50K$. I ignored the categorical features to prepare these
figures. This isnt wise if you want a good classifier, but its fine for an example.
I used these features to predict whether income is over or under 50K$. I split the
data into 5, 000 test examples, and 27,000 training examples. Its important to
do so at random. There are 6 numerical features. I subtracted the mean (which
doesnt usually make much difference) and rescaled each so that the variance was
1 (which is often very important). I used two different training regimes.
Section 2.3
1e7
1e5
1e3
1e1
1
Size of w
5
4
3
2
0.8
0.6
0.4
0.2
1
0
0
29
Held out error
50
Epoch
100
0
0
20
40
60
Epoch
80
100
FIGURE 2.2: On the left, the magnitude of the weight vector a at the end of each
epoch for the first training regime described in the text. On the right, the accuracy on held out data at the end of each epoch. Notice how different choices of
regularization parameter lead to different magnitudes of a; how the method isnt
particularly sensitive to choice of regularization parameter (they change by factors
of 100); how the accuracy settles down fairly quickly; and how overlarge values of
the regularization parameter do lead to a loss of accuracy.
In the first training regime, there were 100 epochs. In each epoch, I applied
426 steps. For each step, I selected one data item uniformly at random (sampling
with replacement), then stepped down the gradient. This means the method sees
a total of 42, 600 data items. This means that there is a high probability it has
touched each data item once (27, 000 isnt enough, because we are sampling with
replacement, so some items get seen more than once). I chose 5 different values
for the regularization parameter and trained with a steplength of 1/(0.01 e + 50),
where e is the epoch. At the end of each epoch, I computed aT a and the accuracy
(fraction of examples correctly classified) of the current classifier on the held out
test examples. Figure 2.2 shows the results. You should notice that the accuracy
changes slightly each epoch; that for larger regularizer values aT a is smaller; and
that the accuracy settles down to about 0.8 very quickly.
In the second training regime, there were 100 epochs. In each epoch, I applied
50 steps. For each step, I selected one data item uniformly at random (sampling
with replacement), then stepped down the gradient. This means the method sees
a total of 5,000 data items, and about 3, 216 unique data items it hasnt seen
the whole training set. I chose 5 different values for the regularization parameter
and trained with a steplength of 1/(0.01 e + 50), where e is the epoch. At the end
of each epoch, I computed aT a and the accuracy (fraction of examples correctly
classified) of the current classifier on the held out test examples. Figure 2.3 shows
the results. You should notice that the accuracy changes slightly each epoch; that
for larger regularizer values aT a is smaller; and that the accuracy settles down
to about 0.8 very quickly; and that there isnt much difference between the two
training regimes. All of these points are relatively typical of stochastic gradient
Section 2.3
1e7
1e5
1e3
1e1
1
Size of w
5
4
3
2
30
1
0.8
Held out error
0.6
0.4
0.2
1
0
0
50
Epoch
100
0
0
20
40
60
Epoch
80
100
FIGURE 2.3: On the left, the magnitude of the weight vector a at the end of each
epoch for the second training regime described in the text. On the right, the accuracy on held out data at the end of each epoch. Notice how different choices of
regularization parameter lead to different magnitudes of a; how the method isnt
particularly sensitive to choice of regularization parameter (they change by factors
of 100); how the accuracy settles down fairly quickly; and how overlarge values of
the regularization parameter do lead to a loss of accuracy.
descent with very large datasets.
Remember this:
Linear SVMs are a go-to classifier. When you have
a binary classification problem, the first step should be to try a linear SVM.
There is an immense quantity of good software available.
2.3.5 Multi-Class Classifiers

I have shown how one trains a linear SVM to make a binary prediction (i.e. predict
one of two outcomes). But what if there are three, or more, labels? In principle,
you could write a binary code for each label, then use a different SVM to predict
each bit of the code. It turns out that this doesnt work terribly well, because an
error by one of the SVMs is usually catastrophic.
There are two methods that are widely used. In the all-vs-all approach, we
train a binary classifier for each pair of classes. To classify an example, we present it
to each of these classifiers. Each classifier decides which of two classes the example
belongs to, then records a vote for that class. The example gets the class label with
the most votes. This approach is simple, but scales very badly with the number of
classes (you have to build O(N 2 ) different SVMs for N classes).
In the one-vs-all approach, we build a binary classifier for each class. This
classifier must distinguish its class from all the other classes. We then take the class
Section 2.4
Classifying with Random Forests
31
with the largest classifier score. One can think up quite good reasons this approach
shouldnt work. For one thing, the classifier isnt told that you intend to use the
score to tell similarity between classes. In practice, the approach works rather well
and is quite widely used. This approach scales a bit better with the number of
classes (O(N )).
Remember this:
It is straightforward to build a multi-class classifier
out of binary classifiers. Any decent SVM package will do this for you.
2.4 CLASSIFYING WITH RANDOM FORESTS

I described a classifier as a rule that takes a feature, and produces a class. One way
to build such a rule is with a sequence of simple tests, where each test is allowed
to use the results of all previous tests. This class of rule can be drawn as a tree
(Figure ??), where each node represents a test, and the edges represent the possible
outcomes of the test. To classify a test item with such a tree, you present it to
the first node; the outcome of the test determines which node it goes to next; and
so on, until the example arrives at a leaf. When it does arrive at a leaf, we label
the test item with the most common label in the leaf. This object is known as a
decision tree. Notice one attractive feature of this decision tree: it deals with
multiple class labels quite easily, because you just label the test item with the most
common label in the leaf that it arrives at when you pass it down the tree.
Figure 2.5 shows a simple 2D dataset with four classes, next to a decision
tree that will correctly classify at least the training data. Actually classifying data
with a tree like this is straightforward. We take the data item, and pass it down
the tree. Notice it cant go both left and right, because of the way the tests work.
This means each data item arrives at a single leaf. We take the most common
label at the leaf, and give that to the test item. In turn, this means we can build
a geometric structure on the feature space that corresponds to the decision tree.
I have illustrated that structure in figure 2.5, where the first decision splits the
feature space in half (which is why the term split is used so often), and then the
next decisions split each of those halves into two.
The important question is how to get the tree from data. It turns out that
the best approach for building a tree incorporates a great deal of randomness. As
a result, we will get a different tree each time we train a tree on a dataset. None of
the individual trees will be particularly good (they are often referred to as weak
learners). The natural thing to do is to produce many such trees (a decision
forest), and allow each to vote; the class that gets the most votes, wins. This
strategy is extremely effective.
2.4.1 Building a Decision Tree
There are many algorithms for building decision trees. We will use an approach
chosen for simplicity and effectiveness; be aware there are others. We will always
Section 2.4
32
moves
bites
big
furry
cat
toddler
dog
chair leg
cardboard
sofa
box
FIGURE 2.4: This the household robots guide to obstacles is a typical decision
tree. I have labelled only one of the outgoing branches, because the other is the
negation. So if the obstacle moves, bites, but isnt furry, then its a toddler. In
general, an item is passed down the tree until it hits a leaf. It is then labelled with
the leaf s label.
use a binary tree, because its easier to describe and because thats usual (it doesnt
change anything important, though). Each node has a decision function, which
takes data items and returns either 1 or -1.
We train the tree by thinking about its effect on the training data. We pass
the whole pool of training data into the root. Any node splits its incoming data
into two pools, left (all the data that the decision function labels 1) and right (ditto,
-1). Finally, each leaf contains a pool of data, which it cant split because it is a
leaf.
Training the tree uses a straightforward algorithm. First, we choose a class of
decision functions to use at each node. It turns out that a very effective algorithm
is to choose a single feature at random, then test whether its value is larger than, or
smaller than a threshold. For this approach to work, one needs to be quite careful
about the choice of threshold, which is what we describe in the next section. Some
minor adjustments, described below, are required if the feature chosen isnt ordinal.
Surprisingly, being clever about the choice of feature doesnt seem add a great deal
of value. We wont spend more time on other kinds of decision function, though
there are lots.
Now assume we use a decision function as described, and we know how to
choose a threshold. We start with the root node, then recursively either split the
pool of data at that node, passing the left pool left and the right pool right, or stop
splitting and return. Splitting involves choosing a decision function from the class
to give the best split for a leaf. The main questions are how to choose the best
split (next section), and when to stop.
Stopping is relatively straightforward. Quite simple strategies for stopping
are very good. It is hard to choose a decision function with very little data, so we
must stop splitting when there is too little data at a node. We can tell this is the
Section 2.4
33
5
y>.32
x>-0.58
+
x>1.06
5
5
FIGURE 2.5: A straightforward decision tree, illustrated in two ways. On the left,
I have given the rules at each split; on the right, I have shown the data points in
two dimensions, and the structure that the tree produces in the feature space.
case by testing the amount of data against a threshold, chosen by experiment. If all
the data at a node belongs to a single class, there is no point in splitting. Finally,
constructing a tree that is too deep tends to result in generalization problems, so
we usually allow no more than a fixed depth D of splits. Choosing the best splitting
threshold is more complicated.
Figure 2.6 shows two possible splits of a pool of training data. One is quite
obviously a lot better than the other. In the good case, the split separates the pool
into positives and negatives. In the bad case, each side of the split has the same
number of positives and negatives. We cannot usually produce splits as good as
the good case here. What we are looking for is a split that will make the proper
label more certain.
Figure 2.7 shows a more subtle case to illustrate this. The splits in this figure
are obtained by testing the horizontal feature against a threshold. In one case,
the left and the right pools contain about the same fraction of positive (x) and
negative (o) examples. In the other, the left pool is all positive, and the right pool
is mostly negative. This is the better choice of threshold. If we were to label any
item on the left side positive and any item on the right side negative, the error rate
would be fairly small. If you count, the best error rate for the informative split is
20% on the training data, and for the uninformative split it is 40% on the training
data.
But we need some way to score the splits, so we can tell which threshold is
best. Notice that, in the uninformative case, knowing that a data item is on the
left (or the right) does not tell me much more about the data than I already knew.
We have that p(1|left pool, uninformative) = 2/3 3/5 = p(1|parent pool) and
p(1|right pool, uninformative) = 1/2 3/5 = p(1|parent pool). For the informative pool, knowing a data item is on the left classifies it completely, and knowing
that it is on the right allows us to classify it an error rate of 1/3. The informative
34
x
x
o
o
Informative split
x
x x
x x
o
o
o
o
o
o
o
o
o
o
o
Section 2.4
Less informative split
FIGURE 2.6: Two possible splits of a pool of training data. Positive data is repre-
sented with an x, negative data with a o. Notice that if we split this pool with
the informative line, all the points on the left are os, and all the points on the
right are xs. This is an excellent choice of split once we have arrived in a leaf,
everything has the same label. Compare this with the less informative split. We
started with a node that was half x and half o, and now have two nodes each of
which is half x and half o this isnt an improvement, because we do not know
more about the label as a result of the split.
split means that my uncertainty about what class the data item belongs to is significantly reduced if I know whether it goes left or right. To choose a good threshold,
we need to keep track of how informative the split is.
2.4.2 Entropy and Information Gain
It turns out to be straightforward to keep track of information, in simple cases. We
will start with an example. Assume I have 4 classes. There are 8 examples in class
1, 4 in class 2, 2 in class 3, and 2 in class 4. How much information on average will
you need to send me to tell me the class of a given example? Clearly, this depends
on how you communicate the information. You could send me the complete works
of Edward Gibbon to communicate class 1; the Encyclopaedia for class 2; and so
on. But this would be redundant. The question is how little can you send me.
Keeping track of the amount of information is easier if we encode it with bits (i.e.
you can send me sequences of 0s and 1s).
Imagine the following scheme. If an example is in class 1, you send me a 1.
If it is in class 2, you send me 01; if it is in class 3, you send me 001; and in class
4, you send me 101. Then the expected number of bits you will send me is
p(class = 1)1 + p(2)2 + p(3)3 + p(4)3 =
1
1
1
1
1+ 2+ 3+ 3
2
4
8
8
which is 1.75 bits. This number doesnt have to be an integer, because its an
expectation.
Notice that for the ith class, you have sent me log2 p(i) bits. We can write
o
o
o
x
x
x x
x
x
x x
x x
o
o
o
o
o
x
x
x
x
x x
x
x x
o
o
x
x
x x
o
o
x
x
x
x x
x x
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
x
x
x
x x
x
x x
o
o
o
o
o
o
o
o
o
x
x
x
x x
x
x x
35
Section 2.4
Informative split
Less informative split
FIGURE 2.7: Two possible splits of a pool of training data. Positive data is repre-
sented with an x, negative data with a o. Notice that if we split this pool with
the informative line, all the points on the left are xs, and two-thirds of the points
on the right are os. This means that knowing which side of the split a point lies
would give us a good basis for estimating the label. In the less informative case,
about two-thirds of the points on the left are xs and about half on the right are xs
knowing which side of the split a point lies is much less useful in deciding what
the label is.
the expected number of bits you need to send me as
X
p(i) log2 p(i).
This expression handles other simple cases correctly, too. You should notice that it
isnt really important how many objects appear in each class. Instead, the fraction
of all examples that appear in the class is what matters. This fraction is the prior
probability that an item will belong to the class. You should try what happens if
you have two classes, with an even number of examples in each; 256 classes, with
an even number of examples in each; and 5 classes, with p(1) = 1/2, p(2) = 1/4,
p(3) = 1/8, p(4) = 1/16 and p(5) = 1/16. If you try other examples, you may find
it hard to construct a scheme where you can send as few bits on average as this
expression predicts. It turns out that, in general, the smallest number of bits you
will need to send me is given by the expression
X
p(i) log2 p(i)
under all conditions, though it may be hard or impossible to determine what representation is required to achieve this number.
The entropy of a probability distribution is a number that scores how many
bits, on average, would need to be known to identify an item sampled from that
probability distribution. For a discrete probability distribution, the entropy is
computed as
X
p(i) log2 p(i)
Section 2.4
36
where i ranges over all the numbers where p(i) is not zero. For example, if we
have two classes and p(1) = 0.99, then the entropy is 0.0808, meaning you need
very little information to tell which class an object belongs to. This makes sense,
because there is a very high probability it belongs to class 1; you need very little
information to tell you when it is in class 2. If you are worried by the prospect of
having to send 0.0808 bits, remember this is an average, so you can interpret the
number as meaning that, if you want to tell which class each of 104 independent
objects belong to, you could do so in principle with only 808 bits.
Generally, the entropy is larger if the class of an item is more uncertain.
Imagine we have two classes and p(1) = 0.5, then the entropy is 1, and this is the
largest possible value for a probability distribution on two classes. You can always
tell which of two classes an object belongs to with just one bit (though you might
be able to tell with even less than one bit).
2.4.3 Entropy and Splits
Now we return to the splits. Write P for the set of all data at the node. Write Pl
for the left pool, and Pr for the right pool. The entropy of a pool C that scores
how many bits would be required to represent the class of an item in that pool, on
average. Write n(i; C) for the number of items of class i in the pool, and N (C) for
the number of items in the pool. Then the entropy is H(C) of the pool C is
X n(i; C)
n(i; C)
log2
.
N (C)
N (C
i
It is straightforward that H(P) bits are required to classify an item in the parent
pool P. For an item in the left pool, we need H(Pl ) bits; for an item in the right
pool, we need H(Pr ) bits. If we split the parent pool, we expect to encounter items
in the left pool with probability
N (Pl )
N (P)
and items in the right pool with probability
N (Pr )
.
N (P)
This means that, on average, we must supply
N (Pl )
N (Pr )
H(Pl ) +
H(Pr )
N (P)
N (P)
bits to classify data items if we split the parent pool. Now a good split is one that
results in left and right pools that are informative. In turn, we should need fewer
bits to classify once we have split than we need before the split. You can see the
difference

N (Pr )
N (Pl )
H(Pl ) +
H(Pr )
I(Pl , Pr ; P) = H(P)
N (P)
N (P)
as the information gain caused by the split. This is the average number of bits
that you dont have to supply if you know which side of the split an example lies.
Better splits have larger information gain.
x
x
x
x
o
o
o
o
o
o
o
o
o
o
o
o
37
Section 2.4
FIGURE 2.8: We search for a good splitting threshold by looking at values of the
chosen component that yield different splits. On the left, I show a small dataset
and its projection onto the chosen splitting component (the horizontal axis). For the
8 data points here, there are only 7 threshold values that produce interesting splits,
and these are shown as ts on the axis. On the right, I show a larger dataset; in
this case, I have projected only a subset of the data, which results in a small set of
thresholds to search.
2.4.4 Choosing a Split with Information Gain

Recall that our decision function is to choose a feature at random, then test its
value against a threshold. Any data point where the value is larger goes to the left
pool; where the value is smaller goes to the right. This may sound much too simple
to work, but it is actually effective and popular. Assume that we are at a node,
which we will label k. We have the pool of training examples that have reached
that node. The ith example has a feature vector xi , and each of these feature
vectors is a d dimensional vector.
We choose an integer j in the range 1 . . . d uniformly and at random. We will
(j)
split on this feature, and we store j in the node. Recall we write xi for the value
of the jth component of the ith feature vector. We will choose a threshold tk ,
(j)
and split by testing the sign of xi tk . Choosing the value of tk is easy. Assume
there are Nk examples in the pool. Then there are Nk 1 possible values of tk
that lead to different splits. To see this, sort the Nk examples by x(j) , then choose
values of tk halfway between example values (Figure 2.8). For each of these values,
we compute the information gain of the split. We then keep the threshold with the
best information gain.
We can elaborate this procedure in a useful way, by choosing m features at
random, finding the best split for each, then keeping the feature and threshold
value that is best. It is important that m is a lot smaller than the total number of
features a usual root of thumb is that m is about the square root of the total
number of features. It is usual to choose a single m, and choose that for all the
splits.
Now assume we happen to have chosen to work with a feature that isnt
ordinal, and so cant be tested against a threshold. A natural, and effective, strategy
Section 2.4
38
is as follows. We can split such a feature into two pools by flipping an unbiased
coin for each value if the coin comes up H, any data point with that value goes
left, and if it comes up T , any data point with that value goes right. We chose this
split at random, so it might not be any good. We can come up with a good split by
repeating this procedure F times, computing the information gain for each split,
then keeping the one that has the best information gain. We choose F in advance,
and it usually depends on the number of values the categorical variable can take.
We now have a relatively straightforward blueprint for an algorithm, which I
have put in a box. Its a blueprint, because there are a variety of ways in which it
can be revised and changed.
Procedure: 2.2 Building a decision tree

We have a dataset containing N pairs (xi , yi ). Each xi is a ddimensional feature vector, and each yi is a label. Call this dataset
a pool. Now recursively apply the following procedure:
If the pool is too small, or if all items in the pool have the same
label, or if the depth of the recursion has reached a limit, stop.
Otherwise, search the features for a good split that divides the
pool into two, then apply this procedure to each child.
We search for a good split by the following procedure:
Choose a subset of the feature components at random. Typically,
one uses a subset whose size is about the square root of the feature
dimension.
For each component of this subset, search for the best splitting
threshold. Do so by selecting a set of possible values for the
threshold, then for each value splitting the dataset (every data
item with a value of the component below the threshold goes left,
others go right), and computing the information gain for the split.
Keep the threshold that has the largest information gain.
A good set of possible values for the threshold will contain values that
separate the data reasonably. If the pool of data is small, you can
project the data onto the feature component (i.e. look at the values of
that component alone), then choose the N 1 distinct values that lie
between two data points. If it is big, you can randomly select a subset
of the data, then project that subset on the feature component and
choose from the values between data points.
Section 2.4
39
2.4.5 Forests
A single decision tree tends to yield poor classifications. One reason is because the
tree is not chosen to give the best classification of its training data. We used a
random selection of splitting variables at each node, so the tree cant be the best
possible. Obtaining the best possible tree presents significant technical difficulties.
It turns out that the tree that gives the best possible results on the training data
can perform rather poorly on test data. The training data is a small subset of
possible examples, and so must differ from the test data. The best possible tree on
the training data might have a large number of small leaves, built using carefully
chosen splits. But the choices that are best for training data might not be best for
test data.
Rather than build the best possible tree, we have built a tree efficiently, but
with number of random choices. If we were to rebuild the tree, we would obtain
a different result. This suggests the following extremely effective strategy: build
many trees, and classify by merging their results.
2.4.6 Building and Evaluating a Decision Forest
There are two important strategies for building and evaluating decision forests. I
am not aware of evidence strongly favoring one over the other, but different software
packages use different strategies, and you should be aware of the options. In one
strategy, we separate labelled data into a training and a test set. We then build
multiple decision trees, training each using the whole training set. Finally, we
evaluate the forest on the test set. In this approach, the forest has not seen some
fraction of the available labelled data, because we used it to test. However, each
tree has seen every training data item.
Procedure: 2.3 Building a decision forest

We have a dataset containing N pairs (xi , yi ). Each xi is a ddimensional feature vector, and each yi is a label. Separate the dataset
into a test set and a training set. Train multiple distinct decision trees
on the training set, recalling that the use of a random set of components
to find a good split means you will obtain a distinct tree each time.
In the other strategy, sometimes called bagging, each time we train a tree we
randomly subsample the labelled data with replacement, to yield a training set the
same size as the original set of labelled data. Notice that there will be duplicates
in this training set, which is like a bootstrap replicate. This training set is often
called a bag. We keep a record of the examples that do not appear in the bag (the
out of bag examples). Now to evaluate the forest, we evaluate each tree on its
out of bag examples, and average these error terms. In this approach, the entire
forest has seen all labelled data, and we also get an estimate of error, but no tree
has seen all the training data.
Section 2.4
40
Procedure: 2.4 Building a decision forest using bagging

We have a dataset containing N pairs (xi , yi ). Each xi is a ddimensional feature vector, and each yi is a label. Now build k bootstrap replicates of the training data set. Train one decision tree on each
replicate.
2.4.7 Classifying Data Items with a Decision Forest

Once we have a forest, we must classify test data items. There are two major
strategies. The simplest is to classify the item with each tree in the forest, then
take the class with the most votes. This is effective, but discounts some evidence
that might be important. For example, imagine one of the trees in the forest has a
leaf with many data items with the same class label; another tree has a leaf with
exactly one data item in it. One might not want each leaf to have the same vote.
Procedure: 2.5 Classification with a decision forest

Given a test example x, pass it down each tree of the forest. Now choose
one of the following strategies.
Each time the example arrives at a leaf, record one vote for the
label that occurs most often at the leaf. Now choose the label
with the most votes.
Each time the example arrives at a leaf, record Nl votes for each of
the labels that occur at the leaf, where Nl is the number of times
the label appears in the training data at the leaf. Now choose the
label with the most votes.
An alternative strategy that takes this observation into account is to pass the
test data item down each tree. When it arrives at a leaf, we record one vote for each
of the training data items in that leaf. The vote goes to the class of the training
data item. Finally, we take the class with the most votes. This approach allows
big, accurate leaves to dominate the voting process. Both strategies are in use, and
I am not aware of compelling evidence that one is always better than the other.
This may be because the randomness in the training process makes big, accurate
leaves uncommon in practice.
Section 2.4
Worked example 2.3
41
Classifying heart disease data
Build a random forest classifier to classify the heart dataset from the UC
Irvine machine learning repository. The dataset is at http://archive.ics.uci.edu/
ml/datasets/Heart+Disease. There are several versions. You should look at the
processed Cleveland data, which is in the file processed.cleveland.data.txt.
Solution: I used the R random forest package. This uses a bagging strategy.
There is sample code in listing ??. This package makes it quite simple to fit
a random forest, as you can see. In this dataset, variable 14 (V14) takes the
value 0, 1, 2, 3 or 4 depending on the severity of the narrowing of the arteries.
Other variables are physiological and physical measurements pertaining to the
patient (read the details on the website). I tried to predict all five levels of
variable 14, using the random forest as a multivariate classifier. This works
rather poorly, as the out-of-bag class confusion matrix below shows. The total
out-of-bag error rate was 45%.
Predict Predict Predict Predict Predict Class
0
1
2
3
4
error
True 0
151
7
2
3
1
7.9%
True 1
32
5
9
9
0
91%
True 2
10
9
7
9
1
81%
True 3
6
13
9
5
2
86%
True 4
2
3
2
6
0
100%
This is the example of a class confusion matrix from table 2.1. Fairly clearly,
one can predict narrowing or no narrowing from the features, but not the
degree of narrowing (at least, not with a random forest). So it is natural to
quantize variable 14 to two levels, 0 (meaning no narrowing), and 1 (meaning
any narrowing, so the original value could have been 1, 2, or 3). I then built
a random forest to predict this from the other variables. The total out-of-bag
error rate was 19%, and I obtained the following out-of-bag class confusion
matrix
Predict Predict Class
0
1
error
True 0
138
26
16%
True 1
31
108
22%
Notice that the false positive rate (16%, from 26/164) is rather better than the
false negative rate (22%). Looking at these class confusion matrices, you might
wonder whether it is better to predict 0, . . . , 4, then quantize. But this is not a
particularly good idea. While the false positive rate is 7.9%, the false negative
rate is much higher (36%, from 50/139). In this application, a false negative is
likely more of a problem than a false positive, so the tradeoff is unattractive.
Section 2.5
Classifying with Nearest Neighbors
42
Listing 2.1: R code used for the random forests of worked example 2.3
setwd ( / u s e r s / d a f / C u r r e n t/ c o u r s e s / P r o b c o u r s e / T r e e s /RCode ) ;
i n s t a l l . packages ( randomForest )
l i b r a r y ( randomForest )
h e a r t<read . csv ( p r o c e s s e d . c l e v e l a n d . data . t x t , h e a d e r=FALSE)
h e a r t $ l e v e l s<as . f a c t o r ( h e a r t $V14 )
h e a r t f o r e s t . a l l v a l s<
randomForest ( formula=l e v e l s V1+V2+V3+V4+V5+V6
+V7+V8+V9+V10+V11+V12+V13 ,
data=h e a r t , ty p e= c l a s s i f i c a t i o n , mtry=5)
# t h i s f i t s to a l l l e v e l s
# I g o t t h e CCM by t y p i n g
heartforest . allvals
h e a r t $ y e s n o<cut ( h e a r t $V14 , c( I n f , 0 . 1 , I n f ) )
h e a r t f o r e s t<
randomForest ( formula=y e s n o V1+V2+V3+V4+V5+V6
+V7+V8+V9+V10+V11+V12+V13 ,
data=h e a r t , ty p e= c l a s s i f i c a t i o n , mtry=5)
# t h i s f i t s to the quantized case
# I g o t t h e CCM by t y p i n g
heartforest
Remember this: Random forests are straightforward to build, and very

effective. They can predict any kind of label. Good software implementations are easily available.
2.5 CLASSIFYING WITH NEAREST NEIGHBORS

Generally we expect that, if two points xi and xj are close, then their labels will be
the same most of the time. If we were unlucky, the two points could lie on either
side of a decision boundary, but most pairs of nearby points will have the same
labels.
This observation suggests the extremely useful and general strategy of exploiting a data items neighbors. If you want to classify a data item, find the closest
example, and report the class of that example. Alternatively, you could find the
closest k examples, and vote. Imagine we have a data point x that we wish to
classify (a query point). Our strategy will be to find the closest training example,
and report its class.
How well can we expect this strategy to work? Remember that any classifier
will slice up the space of examples into cells (which might be quite complicated)
where every point in a cell has the same label. The boundaries between cells are
decision boundaries when a point passes over the decision boundary, the label
changes. Now assume we have a large number of labelled training examples, and
we know the best possible set of decision boundaries. If there are many training
examples, there should be at least one training example that is close to the query
Section 2.5
43
point. If there are enough training examples, then the closest point should be inside
the same cell as the query point.
You may be worried that, if the query point is close to a decision boundary,
the closest point might be on the other side of that boundary. But if it were,
we could improve things by simply having more training points. All this suggests
that, with enough training points, our classifier should work about as well as the
best possible classifier. This intuition turns out to be correct, though the number
of training points required is wholly impractical, particularly for high-dimensional
feature vectors.
One important generalization is to find the k nearest neighbors, then choose
a label from those. A (k, l) nearest neighbor classifier finds the k example points
closest to the point being considered, and classifies this point with the class that has
the highest number of votes, as long as this class has more than l votes (otherwise,
the point is classified as unknown). A (k, 0)-nearest neighbor classifier is usually
known as a k-nearest neighbor classifier, and a (1, 0)-nearest neighbor classifier
is usually known as a nearest neighbor classifier. In practice, one seldom uses
more than three nearest neighbors. Finding the k nearest points for a particular
query can be difficult, and Section ?? reviews this point.
There are three practical difficulties in building nearest neighbor classifiers.
You need a lot of labelled examples. You need to be able to find the nearest
neighbors for your query point. And you need to use a sensible choice of distance.
For features that are obviously of the same type, such as lengths, the usual metric
may be good enough. But what if one feature is a length, one is a color, and one is
an angle? One possibility is to whiten the features (section 3.1). This may be hard
if the dimension is so large that the covariance matrix is hard to estimate. It is
almost always a good idea to scale each feature independently so that the variance
of each feature is the same, or at least consistent; this prevents features with very
large scales dominating those with very small scales. Notice that nearest neighbors
(fairly obviously) doesnt like categorical data. If you cant give a clear account
of how far apart two things are, you shouldnt be doing nearest neighbors. It is
possible to fudge this point a little, by (say) putting together a distance between
the levels of each factor, but its probably unwise.
Nearest neighbors is wonderfully flexible about the labels the classifier predicts. Nothing changes when you go from a two-class classifier to a multi-class
classifier.
Section 2.5
Worked example 2.4
44
Classifying using nearest neighbors
Build a nearest neighbor classifier to classify the digit data originally constructed by Yann Lecun. You can find it at several places. The original dataset
is at http://yann.lecun.com/exdb/mnist/. The version I used was used for a
Kaggle competition (so I didnt have to decompress Lecuns original format).
I found it at http://www.kaggle.com/c/digit-recognizer.
Solution: As youd expect, R has nearest neighbor code that seems quite
good (I havent had any real problems with it, at least). There isnt really all
that much to say about the code. I used the R FNN package. This uses a
bagging strategy. There is sample code in listing ??. I trained on 1000 of the
42000 examples, so you could see how in the code. I tested on the next 200
examples. For this (rather small) case, I found the following class confusion
matrix
P
0
1
2
3
4
5
6
7
8
9
0 12 0
0
0
0
0
0
0
0
0
1 0 20 4
1
0
1
0
2
2
1
2 0
0 20 1
0
0
0
0
0
0
3 0
0
0 12 0
0
0
0
4
0
4 0
0
0
0 18 0
0
0
1
1
5 0
0
0
0
0 19 0
0
1
0
6 1
0
0
0
0
0 18 0
0
0
7 0
0
1
0
0
0
0 19 0
2
8 0
0
1
0
0
0
0
0 16 0
9 0
0
0
2
3
1
0
1
1 14
There are no class error rates here, because I was in a rush and couldnt recall
the magic line of R to get them. However, you can see the classifier works
rather well for this case.
Remember this:
Nearest neighbor classifiers are often very effective.
They can predict any kind of label. You do need to be careful to have enough
data, and to have a meaningful distance function.
Section 2.6
You should
45
2.6 YOU SHOULD

2.6.1 be able to:
construct a naive bayes classifier for continuous and discrete features
train a linear SVM with stochastic gradient descent and evaluate the resulting
classifier
train a random forest using a package and evaluate the resulting classifier
train and evaluate a nearest neighbors classifier
2.6.2 remember:
New term: classifier . . . . . . . . . . . . . . . . . . . . . . . .
New term: feature vectors . . . . . . . . . . . . . . . . . . . . .
Definition: Classifier . . . . . . . . . . . . . . . . . . . . . . . .
New term: decision boundaries . . . . . . . . . . . . . . . . . .
New term: overfitting . . . . . . . . . . . . . . . . . . . . . . .
New term: selection bias . . . . . . . . . . . . . . . . . . . . . .
New term: generalizing badly . . . . . . . . . . . . . . . . . . .
Do not evaluate a classifier on training data. . . . . . . . . . . .
New term: validation set . . . . . . . . . . . . . . . . . . . . . .
New term: test set . . . . . . . . . . . . . . . . . . . . . . . . .
New term: class confusion matrix . . . . . . . . . . . . . . . . .
New term: Vapnik-Chervonenkis dimension . . . . . . . . . . .
New term: V-C dimension . . . . . . . . . . . . . . . . . . . . .
New term: likelihood . . . . . . . . . . . . . . . . . . . . . . . .
New term: class conditional probability . . . . . . . . . . . . .
New term: prior . . . . . . . . . . . . . . . . . . . . . . . . . .
New term: posterior . . . . . . . . . . . . . . . . . . . . . . . .
New term: decision boundary . . . . . . . . . . . . . . . . . . .
New term: descent direction . . . . . . . . . . . . . . . . . . . .
New term: line search . . . . . . . . . . . . . . . . . . . . . . .
New term: gradient descent . . . . . . . . . . . . . . . . . . . .
New term: steplength . . . . . . . . . . . . . . . . . . . . . . .
New term: stochastic gradient descent . . . . . . . . . . . . . .
Linear SVMs are a go-to classifier. . . . . . . . . . . . . . . . .
Any SVM package should build a multi-class classifier for you.
New term: decision tree . . . . . . . . . . . . . . . . . . . . . .
New term: decision forest . . . . . . . . . . . . . . . . . . . . .
New term: decision function . . . . . . . . . . . . . . . . . . . .
New term: entropy . . . . . . . . . . . . . . . . . . . . . . . . .
New term: entropy . . . . . . . . . . . . . . . . . . . . . . . . .
New term: information gain . . . . . . . . . . . . . . . . . . . .
New term: bagging . . . . . . . . . . . . . . . . . . . . . . . . .
New term: bag . . . . . . . . . . . . . . . . . . . . . . . . . . .
Random forests are good and easy. . . . . . . . . . . . . . . . .
Nearest neighbors are good and easy. . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
9
10
10
10
10
10
10
11
12
12
13
13
13
13
17
20
20
20
21
21
25
26
26
26
27
30
31
31
34
34
37
39
Section 2.6
You should
46
PROBLEMS
PROGRAMMING EXERCISES
2.1. The UC Irvine machine learning data repository hosts a famous collection of
data on whether a patient has diabetes (the Pima Indians dataset), originally
owned by the National Institute of Diabetes and Digestive and Kidney Diseases
and donated by Vincent Sigillito. This can be found at http://archive.ics.uci.
edu/ml/datasets/Pima+Indians+Diabetes. This data has a set of attributes of
patients, and a categorical variable telling whether the patient is diabetic or
not. For several attributes in this data set, a value of 0 may indicate a missing
value of the variable.
(a) Build a simple naive Bayes classifier to classify this data set. You should
hold out 20% of the data for evaluation, and use the other 80% for training.
You should use a normal distribution to model each of the class-conditional
distributions. You should write this classifier yourself (its quite straightforward), but you may find the function createDataPartition in the R
package caret helpful to get the random partition.
(b) Now adjust your code so that, for attribute 3 (Diastolic blood pressure),
attribute 4 (Triceps skin fold thickness), attribute 6 (Body mass index),
and attribute 8 (Age), it regards a value of 0 as a missing value when
estimating the class-conditional distributions, and the posterior. R uses
a special number NA to flag a missing value. Most functions handle this
number in special, but sensible, ways; but youll need to do a bit of looking
at manuals to check. Does this affect the accuracy of your classifier?
(c) Now use the caret and klaR packages to build a naive bayes classifier
for this data, assuming that no attribute has a missing value. The caret
package does cross-validation (look at train) and can be used to hold out
data. The klaR package can estimate class-conditional densities using a
density estimation procedure that I will describe much later in the course.
Use the cross-validation mechanisms in caret to estimate the accuracy of
your classifier. I have not been able to persuade the combination of caret
and klaR to handle missing values the way Id like them to, but that may
be ignorance (look at the na.action argument).
(d) Now install SVMLight, which you can find at http://svmlight.joachims.
org, via the interface in klaR (look for svmlight in the manual) to train
and evaluate an SVM to classify this data. You dont need to understand
much about SVMs to do this well do that in following exercises. You
should hold out 20% of the data for evaluation, and use the other 80% for
training. You should NOT substitute NA values for zeros for attributes 3,
4, 6, and 8.
2.2. The UC Irvine machine learning data repository hosts a collection of data
on student performance in Portugal, donated by Paulo Cortez, University of
Minho, in Portugal. You can find this data at https://archive.ics.uci.edu/ml/
datasets/Student+Performance. It is described in P. Cortez and A. Silva. Using
Data Mining to Predict Secondary School Student Performance. In A. Brito
and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS,
ISBN 978-9077381-39-7.
There are two datasets (for grades in mathematics and for grades in Portugese). There are 30 attributes each for 649 students, and 3 values that can
Section 2.6
You should
47
be predicted (G1, G2 and G3). Of these, ignore G1 and G2.

(a) Use the mathematics dataset. Take the G3 attribute, and quantize this
into two classes, G3 > 12 and G3 12. Build and evaluate a naive
bayes classifier that predicts G3 from all attributes except G1 and G2.
You should build this classifier from scratch (i.e. DONT use the packages described in the code snippets). For binary attributes, you should
use a binomial model. For the attributes described as numeric, which
take a small set of values, you should use a multinomial model. For the
attributes described as nominal, which take a small set of values, you
should again use a multinomial model. Ignore the absence attribute.
Estimate accuracy by cross-validation. You should use at least 10 folds,
excluding 15% of the data at random to serve as test data, and average
the accuracy over those folds. Report the mean and standard deviation
of the accuracy over the folds.
(b) Now revise your classifier of the previous part so that, for the attributes
described as numeric, which take a small set of values, you use a multinomial model. For the attributes described as nominal, which take a
small set of values, you should still use a multinomial model. Ignore the
absence attribute. Estimate accuracy by cross-validation. You should
use at least 10 folds, excluding 15% of the data at random to serve as test
data, and average the accuracy over those folds. Report the mean and
standard deviation of the accuracy over the folds.
(c) Which classifier do you believe is more accurate and why?
2.3. The UC Irvine machine learning data repository hosts a collection of data on
heart disease. The data was collected and supplied by Andras Janosi, M.D., of
the Hungarian Institute of Cardiology, Budapest; William Steinbrunn, M.D.,
of the University Hospital, Zurich, Switzerland; Matthias Pfisterer, M.D., of
the University Hospital, Basel, Switzerland; and Robert Detrano, M.D., Ph.D.,
of the V.A. Medical Center, Long Beach and Cleveland Clinic Foundation. You
can find this data at https://archive.ics.uci.edu/ml/datasets/Heart+Disease.
Use the processed Cleveland dataset, where there are a total of 303 instances
with 14 attributes each. The irrelevant attributes described in the text have
been removed in these. The 14th attribute is the disease diagnosis. There are
records with missing attributes, and you should drop these.
(a) Take the disease attribute, and quantize this into two classes, num = 0
and num > 0. Build and evaluate a naive bayes classifier that predicts
the class from all other attributes Estimate accuracy by cross-validation.
You should use at least 10 folds, excluding 15% of the data at random to
serve as test data, and average the accuracy over those folds. Report the
mean and standard deviation of the accuracy over the folds.
(b) Now revise your classifier to predict each of the possible values of the
disease attribute (0-4 as I recall). Estimate accuracy by cross-validation.
You should use at least 10 folds, excluding 15% of the data at random to
serve as test data, and average the accuracy over those folds. Report the
mean and standard deviation of the accuracy over the folds.
2.4. The UC Irvine machine learning data repository hosts a collection of data
on breast cancer diagnostics, donated by Olvi Mangasarian, Nick Street, and
William H. Wolberg. You can find this data at http://archive.ics.uci.edu/ml/
datasets/Breast+Cancer+Wisconsin+(Diagnostic). For each record, there is an
id number, 10 continuous variables, and a class (benign or malignant). There
are 569 examples. Separate this dataset randomly into 100 validation, 100
Section 2.6
You should
48
test, and 369 training examples.

Write a program to train a support vector machine on this data using stochastic
gradient descent. You should not use a package to train the classifier (you
dont really need one), but your own code. You should ignore the id number,
and use the continuous variables as a feature vector. You should search for
an appropriate value of the regularization constant, trying at least the values
= [1e 3, 1e 2, 1e 1, 1]. Use the validation set for this search
You should use at least 50 epochs of at least 100 steps each. In each epoch,
you should separate out 50 training examples at random for evaluation. You
should compute the accuracy of the current classifier on the set held out for
the epoch every 10 steps. You should produce:
(a) A plot of the accuracy every 10 steps, for each value of the regularization
constant.
(b) Your estimate of the best value of the regularization constant, together
with a brief description of why you believe that is a good value.
(c) Your estimate of the accuracy of the best classifier on held out data
adult income, donated by Ronny Kohavi and Barry Becker. You can find this
data at https://archive.ics.uci.edu/ml/datasets/Adult For each record, there is
a set of continuous attributes, and a class (=50K or 50K). There are 48842
examples. You should use only the continous attributes (see the description on
the web page) and drop examples where there are missing values of the continuous attributes. Separate the resulting dataset randomly into 10% validation,
10% test, and 80% training examples.
Write a program to train a support vector machine on this data using stochastic
gradient descent. You should not use a package to train the classifier (you
dont really need one), but your own code. You should ignore the id number,
and use the continuous variables as a feature vector. You should search for
an appropriate value of the regularization constant, trying at least the values
= [1e 3, 1e 2, 1e 1, 1]. Use the validation set for this search
You should use at least 50 epochs of at least 300 steps each. In each epoch,
you should separate out 50 training examples at random for evaluation. You
should compute the accuracy of the current classifier on the set held out for
the epoch every 30 steps. You should produce:
(a) A plot of the accuracy every 30 steps, for each value of the regularization
constant.
(b) Your estimate of the best value of the regularization constant, together
with a brief description of why you believe that is a good value.
(c) Your estimate of the accuracy of the best classifier on held out data
the whether p53 expression is active or inactive.
You can find out what this means, and more information about the dataset,
by reading: Danziger, S.A., Baronio, R., Ho, L., Hall, L., Salmon, K., Hatfield, G.W., Kaiser, P., and Lathrop, R.H. (2009) Predicting Positive p53
Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning, PLOS Computational Biology, 5(9); Danziger, S.A., Zeng, J., Wang, Y.,
Brachmann, R.K. and Lathrop, R.H. (2007) Choosing where to look next in
a mutation sequence space: Active Learning of informative p53 cancer rescue mutants, Bioinformatics, 23(13), 104-114; and Danziger, S.A., Swamidass,
S.J., Zeng, J., Dearth, L.R., Lu, Q., Chen, J.H., Cheng, J., Hoang, V.P., Saigo,
H., Luo, R., Baldi, P., Brachmann, R.K. and Lathrop, R.H. (2006) Functional
Section 2.6
You should
49
census of mutation sequence spaces: the example of p53 cancer rescue mutants, IEEE/ACM transactions on computational biology and bioinformatics
/ IEEE, ACM, 3, 114-125.
You can find this data at https://archive.ics.uci.edu/ml/datasets/p53+Mutants.
There are a total of 16772 instances, with 5409 attributes per instance. Attribute 5409 is the class attribute, which is either active or inactive. There are
several versions of this dataset. You should use the version K8.data.
(a) Train an SVM to classify this data, using stochastic gradient descent. You
will need to drop data items with missing values. You should estimate
a regularization constant using cross-validation, trying at least 3 values.
Your training method should touch at least 50% of the training set data.
You should produce an estimate of the accuracy of this classifier on held
out data consisting of 10% of the dataset, chosen at random.
(b) Now train a naive bayes classifier to classify this data. You should produce
an estimate of the accuracy of this classifier on held out data consisting
of 10% of the dataset, chosen at random.
(c) Compare your classifiers. Which one is better? why?
whether a mushroom is edible, donated by Jeff Schlimmer and to be found at
http://archive.ics.uci.edu/ml/datasets/Mushroom. This data has a set of categorical attributes of the mushroom, together with two labels (poisonous or
edible). Use the R random forest package (as in the example in the chapter)
to build a random forest to classify a mushroom as edible or poisonous based
on its attributes.
(a) Produce a class-confusion matrix for this problem. If you eat a mushroom
based on your classifiers prediction it is edible, what is the probability of
being poisoned?
Section 2.6
You should
50
CODE SNIPPETS
Listing 2.2: R code used for the naive bayes example of worked example 2.1
setwd ( / u s e r s / d a f / C u r r e n t/ c o u r s e s / P r o b c o u r s e / C l a s s i f i c a t i o n /RCode/ B r e a s t T i s s u e )
wdat<read . csv ( c l e a n e d b r e a s t . c s v )
l i b r a r y ( klaR )
library ( c a r e t )
b i g x<wdat [ , c ( 1 : 2 ) ]
b i g y<wdat [ , 2 ]
wtd<c r e a t e D a t a P a r t i t i o n ( y=bigy , p = . 8 , l i s t=FALSE)
t r a x<b i g x [ wtd , ]
t r a y<b i g y [ wtd ]
model<t r a i n ( tr a x , tr a y , nb , t r C o n t r o l=t r a i n C o n t r o l ( method= cv , number =10))
t e c l a s s e s<predict ( model , newdata=b i g x [wtd , ] )
c o n f u s i o n M a t r i x ( data=t e c l a s s e s , b i g y [wtd ] )
Listing 2.3: R code used for the naive bayes example of worked example 2.2
setwd ( / u s e r s / d a f / C u r r e n t/ c o u r s e s / P r o b c o u r s e / C l a s s i f i c a t i o n /RCode/ MouseProtein )
wdat<read . csv ( Data Cortex N u c l e a r . c s v )
#i n s t a l l . p a c k a g e s ( klaR )
#i n s t a l l . p a c k a g e s ( c a r e t )
l i b r a r y ( klaR )
library ( c a r e t )
c c i<complete . c a s e s ( wdat )
b i g x<wdat [ c c i ,c ( 8 2 ) ]
b i g y<wdat [ c c i , 8 2 ]
wtd<c r e a t e D a t a P a r t i t i o n ( y=bigy , p = . 8 , l i s t=FALSE)
t r a x<b i g x [ wtd , ]
t r a y<b i g y [ wtd ]
model<t r a i n ( tr a x , tr a y , nb , t r C o n t r o l=t r a i n C o n t r o l ( method= cv , number =10))
t e c l a s s e s<predict ( model , newdata=b i g x [wtd , ] )
c o n f u s i o n M a t r i x ( data=t e c l a s s e s , b i g y [wtd ] )
C H A P T E R
Extracting Important Relationships

in High Dimensions
Chapter 12 described methods to explore the relationship between two elements in a dataset. We could extract a pair of elements and construct various plots.
For vector data, we could also compute the correlation between different pairs of
elements. But if each data item is d-dimensional, there could be a lot of pairs to
deal with.
We will think of our dataset as a collection of d dimensional vectors. It turns
out that there are easy generalizations of our summaries. However, is hard to
plot d-dimensional vectors. We need to find some way to make them fit on a 2dimensional plot. Some simple methods can offer insights, but to really get what
is going on we need methods that can at all pairs of relationships in a dataset in
one go.
These methods visualize the dataset as a blob in a d-dimensional space.
Many such blobs are flattened in some directions, because components of the data
are strongly correlated. Finding the directions in which the blobs are flat yields
methods to compute lower dimensional representations of the dataset.
3.1 SOME PLOTS OF HIGH DIMENSIONAL DATA
3.1.1 Understanding Blobs with Scatterplot Matrices - CLEANUP
Plotting high dimensional data is tricky.
3.1.2 Parallel Plots
Parallel plots can sometimes reveal information, particularly when the dimension
of the dataset is low. To construct a parallel plot, you compute a normalized
representation of each component of each data item. The component is normalized
by translating and scaling so that the minimum value over the dataset is zero, and
the maximum value over the dataset is one. Now write the ith normalised data
item as (n1 , n2 , . . . , nd ). For this data item, you plot a broken line joining (1, n1 )
to (2, n2 ) to (3, n3 , etc. These plots are superimposed on one another. In the case
of the bodyfat dataset, this yields the plot of figure 3.1.
Some structures in the parallel plot are revealing. Outliers often stick out (in
figure 3.1, its pretty clear that theres a data point with a very low height value,
and also one with a very large weight value). Outliers affect the scaling, and so
make other structures difficult to spot. I have removed them for figure 3.2. In this
figure, you can see that two negatively correlated components next to one another
produce a butterfly like shape (bodyfat and density). In this plot, you can also see
that there are still several data points that are very different from others (two data
51
Section 3.1
Some Plots of High Dimensional Data
52
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
BODYFAT
DENSITY AGE WEIGHT HEIGHT
ADIPOSITYNECK
CHESTABDOMEN HIP
THIGH
KNEE ANKLE BICEPSFOREARMWRIST
FIGURE 3.1: A parallel plot of the bodyfat dataset, including all data points. I have
named the components on the horizontal axis. It is easy to see that large values
of bodyfat correspond to small values of density, and vice versa. Notice that one
datapoint has height very different from all others; similarly, one datapoint has
weight very different from all others.
items have ankle values that are very different from the others, for example).
3.1.3 Scatterplot Matrices
One strategy that is very useful when there arent too many dimensions is to use a
scatterplot matrix. To build one, you lay out scatterplots for each pair of variables
in a matrix. On the diagonal, you name the variable that is the vertical axis for
each plot in the row, and the horizontal axis in the column. This sounds more
complicated than it is; look at the example of figure 3.3, which shows a scatterplot
matrix for four of the variables in the height weight dataset of http://www2.stetson.
edu/jrasp/data.htm; look for bodyfat.xls at that URL). This is originally a 16dimensional dataset, but a 16 by 16 scatterplot matrix is squashed and hard to
interpret.
What is nice about this kind of plot is that its quite easy to spot correlations
between pairs of variables, though you do need to take into account the coordinates
have not been normalized. For figure 3.3, you can see that weight and adiposity
Section 3.1
53
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
BODYFAT
DENSITY AGE WEIGHT HEIGHT
ADIPOSITYNECK
CHESTABDOMEN HIP
THIGH
KNEE ANKLE BICEPSFOREARMWRIST
FIGURE 3.2: A plot with those data items removed, so that those components are
renormalized. Two datapoints have rather distinct ankle measurements. Generally,
you can see that large knees go with large ankles and large biceps (the v structure).
appear to show quite strong correlations, but weight and age are pretty weakly
correlated. Height and age seem to have a low correlation. It is also easy to
visualize unusual data points. Usually one has an interactive process to do so
you can move a brush over the plot to change the color of data points under the
brush. To show what might happen, figure 3.4 shows a scatter plot matrix with
some points shown as circles. Notice how they lie inside the blob of data in some
views, and outside in others. This is an effect of projection.
UC Irvine keeps a large repository of datasets that are important in machine
learning. You can find the repository at http://archive.ics.uci.edu/ml/index.html.
Figures 3.5 and 3.6 show visualizations of a famous dataset to do with the botanical
classification of irises.
Figures ??, ?? and 3.9 show visualizations of another dataset to do with forest
fires in Portugal, also from the UC Irvine repository (look at http://archive.ics.uci.
edu/ml/datasets/Forest+Fires). In this dataset, there are a variety of measurements
of location, time, temperature, etc. together with the area burned by a wildfire.
It would be nice to know what leads to large fires, and a visualization is the place
to start. Many fires are tiny (or perhaps there was no area measurement?) and so
Section 3.1
100
100
100
50
50
50
50
400
400
300
300
50
100
54
Age
200
400
400
300
Weight
200
100
200
0
50
100
200
0
50
100
100
80
60
80
80
60
60
40
40
50
100
50
100
50
100
Height
40
20
20
50
200
400
20
60
60
60
40
40
40
20
20
20
Adiposity
50
100
200
400
FIGURE 3.3: This is a scatterplot matrix for four of the variables in the height weight
dataset of http:// www2.stetson.edu/ jrasp/ data.htm. Each plot is a scatterplot of

a pair of variables. The name of the variable for the horizontal axis is obtained by
running your eye down the column; for the vertical axis, along the row. Although
this plot is redundant (half of the plots are just flipped versions of the other half ),
that redundancy makes it easier to follow points by eye. You can look at a column,
move down to a row, move across to a column, etc. Notice how you can spot
correlations between variables and outliers (the arrows).
many values of the area are zero. I found it helpful to take the log of area, and
then to divide the values of the logarithm into seven categories. I ignored the first
four variables, because I didnt think theyd be too important. Exercise: was I
right? I made two scatterplot matrices, because an eight by eight matrix is too big
to view. Generally, this visualization suggests that it would be hard to predict the
size of a fire from these variables.
Section 3.1
100
100
100
50
50
50
50
400
400
300
300
50
100
55
Age
200
400
400
300
Weight
200
100
200
0
50
100
200
0
50
100
100
80
60
80
80
60
60
40
40
50
100
50
100
50
100
Height
40
20
20
50
200
400
20
60
60
60
40
40
40
20
20
20
Adiposity
50
100
200
400
You should compare this figure with figure 3.3. I have marked two
data points with circles in this figure; notice that in some panels these are far from
the rest of the data, in others close by. A brush in an interactive application can
be used to mark data like this to allow a user to explore a dataset.
FIGURE 3.4:
Section 3.1
setosa
versicolor
56
virginica
2.5
1.5 2.0 2.5
2.0
1.5
Petal
Width
0.0 0.5 1.0
0.5
0.0
7
4
6
5
Petal
Length
1.0
4
3
2
1
4.5
3.5 4.0 4.5
4.0
3.5
Sepal
Width
3.0
2.5
2.0 2.5 3.0
2.0
8
7
Sepal
Length
6
5
Scatter Plot Matrix
This is a scatterplot matrix for the famous Iris data, originally due
to ***. There are four variables, measured for each of three species of iris. I have
plotted each species with a different marker. You can see from the plot that the
species cluster quite tightly, and are different from one another. R code for this plot
is on the website.
FIGURE 3.5:
Section 3.1
setosa
versicolor
57
virginica
Petal.Length
Petal.Width
Sepal.Length
This is a 3D scatterplot for the famous Iris data, originally due to

***. I have chosen three variables from the four, and have plotted each species with
a different marker. You can see from the plot that the species cluster quite tightly,
and are different from one another. R code for this plot is on the website.
FIGURE 3.6:
Section 3.1
T1
T2
T3
T4
T5
T6
58
T7
50
30 40 50
40
30
ISI
20
0 10 20
800
10
0
400 600 800
600
400
DC
400
200
200 400
0
300
250
150
250
200
150
DMC 150
100
0 50 100
60
50
0
80
80
60
FFMC
40
20
40
20
Scatter Plot Matrix
This is a scatterplot matrix for the fire dataset from the UC Irvine
repository. The smallest area fire is T1, and the largest is T7; each is plotted with
a different marker. These plots show severity of the fire, plotted against variables
5-8 of the dataset. You should notice that there isnt much separation between the
markers. It might be very hard to predict the severity of a fire from these variables.
R code for this plot is on the website.
FIGURE 3.7:
Section 3.1
T1
T2
T3
T4
T5
T6
59
T7
6
5
4
3
3 4 5 6
rain
0 1 2 3
4
3
2
1
0
wind
4
0
4
2
0
100
60
80 100
RH
60
80
60
40
20
30
25
20
40
60
20
20 25 30
temp
5 10 15
15
10
5
Scatter Plot Matrix
This is a scatterplot matrix for the fire dataset from the UC Irvine
FIGURE 3.8:
Section 3.1
T1
T2
T3
T4
T5
T6
60
T7
V9
V11
V10
This is a 3D scatterplot for the fire dataset from the UC Irvine

FIGURE 3.9:
Section 3.2
Summaries of High Dimensional Data
61
3.2 SUMMARIES OF HIGH DIMENSIONAL DATA

In this chapter, we assume that our data items are vectors. This means that we can
add and subtract values and multiply values by a scalar without any distress. This
is an important assumption, but it doesnt necessarily mean that data is continuous
(for example, you can meaningfully add the number of children in one family to the
number of children in another family). It does rule out a lot of discrete data. For
example, you cant add sports to grades and expect a sensible answer.
Notation: Our data items are vectors, and we write a vector as x. The
data items are d-dimensional, and there are N of them. The entire data set is {x}.
When we need to refer to the ith data item, we write xi . We write {xi } for a new
dataset made up of N items, where the ith item is xi . If we need to refer to the
(j)
jth component of a vector xi , we will write xi (notice this isnt in bold, because
it is a component not a vector, and the j is in parentheses because it isnt a power).
Vectors are always column vectors.
3.2.1 The Mean
For one-dimensional data, we wrote
mean ({x}) =
i xi
.
N
This expression is meaningful for vectors, too, because we can add vectors and
divide by scalars. We write
P
xi
mean ({x}) = i
N
and call this the mean of the data. Notice that each component of mean ({x}) is the
mean of that component of the data. There is not an easy analogue of the median,
however (how do you order high dimensional data?) and this is a nuisance. Notice
that, just as for the one-dimensional mean, we have
mean ({x mean ({x})}) = 0
(i.e. if you subtract the mean from a data set, the resulting data set has zero mean).
3.2.2 Using Covariance to encode Variance and Correlation
Variance, standard deviation and correlation can each be seen as an instance of a
more general operation on data. Assume that we have two one dimensional data
sets {x} and {y}. Then we can define the covariance of {x} and {y}.
Section 3.2
62
Definition: 3.1 Covariance

Assume we have two sets of N data items, {x} and {y}. We compute
the covariance by
P
(xi mean ({x}))(yi mean ({y}))
cov ({x} , {y}) = i
N
Covariance measures the tendency of corresponding elements of {x} and of {y} to

be larger than (resp. smaller than) the mean. Just like mean, standard deviation
and variance, covariance can refer either to a property of a dataset (as in the
definition here) or a particular expectation (as in chapter ??). The correspondence
is defined by the order of elements in the data set, so that x1 corresponds to y1 ,
x2 corresponds to y2 , and so on. If {x} tends to be larger (resp. smaller) than its
mean for data points where {y} is also larger (resp. smaller) than its mean, then
the covariance should be positive. If {x} tends to be larger (resp. smaller) than its
mean for data points where {y} is smaller (resp. larger) than its mean, then the
covariance should be negative.
From this description, it should be clear we have seen examples of covariance
already. Notice that
2
std (x) = var ({x}) = cov ({x} , {x})

which you can prove by substituting the expressions. Recall that variance measures
the tendency of a dataset to be different from the mean, so the covariance of a
dataset with itself is a measure of its tendency not to be constant.
More important, notice that
cov ({x} , {y})
p
.
corr ({(x, y)}) = p
cov ({x} , {x}) cov ({y} , {y})
This is occasionally a useful way to think about correlation. It says that the correlation measures the tendency of {x} and {y} to be larger (resp. smaller) than their
means for the same data points, compared to how much they change on their own.
Working with covariance (rather than correlation) allows us to unify some
ideas. In particular, for data items which are d dimensional vectors, it is straightforward to compute a single matrix that captures all covariances between all pairs
of components this is the covariance matrix.
Section 3.2
63
Definition: 3.2 Covariance Matrix

The covariance matrix is:
P
(xi mean ({x}))(xi mean ({x}))T
Covmat ({x}) = i
N
Notice that it is quite usual to write a covariance matrix as , and we
will follow this convention.
Properties of the Covariance Matrix Covariance matrices are often

written as , whatever the dataset (you get to figure out precisely which dataset is
intended, from context). Generally, when we want to refer to the j, kth entry of
a matrix A, we will write Ajk , so jk is the covariance between the jth and kth
components of the data.
The j, kth entry of the covariance matrix is the
covariance

of the jth and
the kth components of x, which we write cov x(j) , x(k) .
The j, jth entry of the covariance matrix is the variance of the jth component of x.
The covariance matrix is symmetric.

The covariance matrix is always positive semi-definite; it is positive definite,
unless there is some vector a such that aT (xi mean ({xi }) = 0 for all i.
Proposition:
Covmat ({x})jk = cov
o
o n
n
x(j) , x(k)
Proof: Recall
Covmat ({x}) =
i (xi
mean ({x}))(xi mean ({x}))T

N
and the j, kth entry in this matrix will be

(j) (k)
x
)(xi mean x(k) )T
N
(j) (k)
which is cov x
, x
.
(j)
i (xi
mean
Section 3.2
Proposition:
Covmat ({xi })jj = jj = var
n
x(j)
o
Proof:
Covmat ({x})jj
Proposition:
o
o n
n
x(j) , x(j)
o
n
= var x(j)
= cov
Covmat ({x}) = Covmat ({x})T
Proof: We have
Covmat ({x})jk
o
o n
n
x(j) , x(k)
n
o n
o
= cov x(k) , x(j)
= cov
= Covmat ({x})kj
64
Section 3.3
Blob Analysis of High-Dimensional Data
65
Proposition: Write = Covmat ({x}). If there is no vector a such that

aT (xi mean ({x})) = 0 for all i, then for any vector u, such that || u || > 0,
uT u > 0.
If there is such a vector a, then
uT u 0.
Proof: We have
uT u
=
=

1 X T
u (xi mean ({x})) (xi mean ({x}))T u
N i
2
1 X T
u (xi mean ({x})) .
N i
Now this is a sum of squares. If there is some a such that aT (xi

mean ({x})) = 0 for every i, then the covariance matrix must be positive
semidefinite (because the sum of squares could be zero in this case).
Otherwise, it is positive definite, because the sum of squares will always
be positive.
3.3 BLOB ANALYSIS OF HIGH-DIMENSIONAL DATA

When we plotted histograms, we saw that mean and variance were a very helpful
description of data that had a unimodal histogram. If the histogram had more than
one mode, one needed to be somewhat careful to interpret the mean and variance;
in the pizza example, we plotted diameters for different manufacturers to try and
see the data as a collection of unimodal histograms.
Generally, mean and covariance are a good description of data that lies in a
blob (Figure 3.10). You might not believe that this is a technical term, but its
quite widely used. This is because mean and covariance supply a natural coordinate
system in which to interpret the blob. Mean and covariance are less useful as
descriptions of data that forms multiple blobs (Figure 3.10). In chapter 14.5, we
discuss methods to model data that forms multiple blobs, or other shapes that we
will interpret as a set of blobs. But many datasets really are single blobs, and we
concentrate on such data here. The way to understand a blob is to think about the
coordinate transformations that place a blob into a particularly convenient form.
3.3.1 Transforming High Dimensional Data
Assume we apply an affine transformation to our data set {x}, to obtain a new
dataset {u}, where ui = Axi + b. Here A is any matrix (it doesnt have to be
Section 3.3
66
FIGURE 3.10: On the left, a blob in two dimensions. This is a set of data points
that lie somewhat clustered around a single center, given by the mean. I have
plotted the mean of these data points with a +. On the right, a data set that is
best thought of as a collection of five blobs. I have plotted the mean of each with a
+. We could compute the mean and covariance of this data, but it would be less
revealing than the mean and covariance of a single blob. In chapter 14.5, I will
describe automatic methods to describe this dataset as a series of blobs.
square, or symmetric, or anything else; it just has to have second dimension d). It
is easy to compute the mean and covariance of {u}. We have
mean ({u}) =
=
mean ({Ax + b})

Amean ({x}) + b,
so you get the new mean by multiplying the original mean by A and adding b.
The new covariance matrix is easy to compute as well. We have:
Covmat ({u}) = Covmat ({Ax + b})
P
T
i (ui mean ({u}))(ui mean ({u}))
=
N
P
T
i (Axi + b Amean ({x}) b)(Axi + b Amean ({x}) b)
=
N
P
A i (xi mean ({x}))(xi mean ({x}))T AT
=
N
= ACovmat ({x})AT .
3.3.2 Transforming Blobs
The trick to interpreting high dimensional data is to use the mean and covariance
to understand the blob. Figure 3.11 shows a two-dimensional data set. Notice that
there is obviously some correlation between the x and y coordinates (its a diagonal
blob), and that neither x nor y has zero mean. We can easily compute the mean
and subtract it from the data points, and this translates the blob so that the origin
is at the center (Figure 3.11). In coordinates, this means we compute the new
Section 3.3
67
Translate center to origin
FIGURE 3.11: On the left, a blob in two dimensions. This is a set of data points
that lie somewhat clustered around a single center, given by the mean. I have plotted
the mean of these data points with a hollow square (its easier to see when there is
a lot of data). To translate the blob to the origin, we just subtract the mean from
each datapoint, yielding the blob on the right.
dataset {u} from the old dataset {x} by the rule ui = xi mean ({x}). This new
dataset has been translated so that the mean is zero.
Once this blob is translated (Figure 3.12, left), we can rotate it as well. It
is natural to try to rotate the blob so that there is no correlation between distinct
pairs of dimensions. We can do so by diagonalizing the covariance matrix. In
particular, let U be the matrix formed by stacking the eigenvectors of Covmat ({x})
into a matrix (i.e. U = [v1 , . . . , vd ], where vj are eigenvectors of the covariance
matrix). We now form the dataset {n}, using the rule
ni = U T ui = U T (xi mean ({x})).
The mean of this new dataset is clearly 0. The covariance of this dataset is

Covmat ({n}) = Covmat U T x
= U T Covmat ({x})U
= ,
where is a diagonal matrix of eigenvalues of Covmat ({x}). Remember that, in

describing diagonalization, we adopted the convention that the eigenvectors of the
matrix being diagonalized were ordered so that the eigenvalues are sorted in descending order along the diagonal of . We now have two very useful facts about
{n}: (a) every pair of distinct components has covariance zero, and so has correlation zero; (b) the first component has the highest variance, the second component
has the second highest variance, and so on. We can rotate and translate any blob
into a coordinate system that has these properties. In this coordinate system, we
Section 3.3
68
Rotate to diagonalize
covariance
FIGURE 3.12: On the left, the translated blob of figure 3.11. This blob lies somewhat
diagonally, because the vertical and horizontal components are correlated. On the
right, that blob of data rotated so that there is no correlation between these components. We can now describe the blob by the vertical and horizontal variances alone,
as long as we do so in the new coordinate system. In this coordinate system, the
vertical variance is significantly larger than the horizontal variance the blob is
short and wide.
can describe the blob simply by giving the variances of each component the
covariances are zero.
Translating a blob of data doesnt change the scatterplot matrix in any interesting way (the axes change, but the picture doesnt). Rotating a blob produces
really interesting results, however. Figure 3.14 shows the dataset of figure 3.3,
translated to the origin and rotated to diagonalize it. Now we do not have names
for each component of the data (theyre linear combinations of the original components), but each pair is now not correlated. This blob has some interesting shape
features. Figure 3.14 shows the gross shape of the blob best. Each panel of this
figure has the same scale in each direction. You can see the blob extends about 80
units in direction 1, but only about 15 units in direction 2, and much less in the
other two directions. You should think of this blob as being rather cigar-shaped;
its long in one direction, but there isnt much in the others. The cigar metaphor
isnt perfect because there arent any 4 dimensional cigars, but its helpful. You
can think of each panel of this figure as showing views down each of the four axes
of the cigar.
Now look at figure ??. This shows the same rotation of the same blob of
data, but now the scales on the axis have changed to get the best look at the
detailed shape of the blob. First, you can see that blob is a little curved (look at
the projection onto direction 2 and direction 4). There might be some effect here
worth studying. Second, you can see that some points seem to lie away from the
main blob. I have plotted each data point with a dot, and the interesting points
Section 3.3
69
Scale this direction
FIGURE 3.13: On the left, the translated and rotated blob of figure 3.12. This blob is
stretched one direction has more variance than another. Because all covariances
are zero, it is easy to scale the blob so that all variances are one (the blob on the
right). You can think of this as a standard blob. All blobs can be reduced to a
standard blob, by relatively straightforward linear algebra.
with a number. These points are clearly special in some way.

We could now scale the data in this new coordinate system so that all the
variances are either one (if there is any variation in that direction) or zero (directions
where the data doesnt vary these occur only if some directions are functions of
others). Figure 3.13 shows the final scaling. The result is a standard blob. Our
approach applies to any dimension I gave 2D figures because theyre much easier
to understand. There is a crucial point here: we can reduce any blob of data, in any
dimension, to a standard blob of that dimension. All blobs are the same, except
for some stretching, some rotation, and some translation. This is why blobs are so
well-liked.
3.3.3 Whitening Data
It is sometimes useful to actually reduce a dataset to a standard blob. Doing so is
known as whitening the data (for reasons I find obscure). This can be a sensible
thing to do when we dont have a clear sense of the relative scales of the components
of each data vector. For example, if we have a dataset where one component ranges
from 1e5 to 2e5, and the other component ranges from -1e-5 to 1e-5, we are likely
to face numerical problems in many computations (adding small numbers to big
numbers is often unwise). Often, this kind of thing follows from a poor choice of
units, rather than any kind of useful property of the data. In such a case, it could
be quite helpful to whiten the data. Another reason to whiten the data might
be that we know relatively little about the meaning of each component. In this
case, the original choice of coordinate system was somewhat arbitrary anyhow, and
Section 3.3
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
Direction 1
100
Direction 3
100
100
100
Direction 2
Direction 4
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
70
100
100
100
100
100
100
100
100
100
FIGURE 3.14: A panel plot of the bodyfat dataset of figure 3.3, now rotated so that the
covariance between all pairs of distinct dimensions is zero. Now we do not know
names for the directions theyre linear combinations of the original variables.
Each scatterplot is on the same set of axes, so you can see that the dataset extends
more in some directions than in others.
transforming data to a uniform blob could be helpful.
Section 3.3
5
5
1
2
3
5
100
5
1
100
10
1
5
5
50
2
3
50
Direction 4
5
20
20
10
4
3 1
0
10
0
2
20
100
100
10
3
1
5
10
20
50
0
10
4
2
Direction 2
100
2
Direction 1
100
50
100
5
1
3
0
20
32
50
5
100
50
100
20
5
1
2
4
25 3
1
50
20
100
50
50
100
20
5
50
50
Direction 3
50
3
71
2
1
34
0
20
100
5
FIGURE 3.15: A panel plot of the bodyfat dataset of figure 3.3, now rotated so that
the covariance between all pairs of distinct dimensions is zero. Now we do not know
names for the directions theyre linear combinations of the original variables. I
have scaled the axes so you can see details; notice that the blob is a little curved,
and there are several data points that seem to lie some way away from the blob,
which I have numbered.
Useful Facts: 3.1 Whitening a dataset

For a dataset {x}, compute:
U, the matrix of eigenvectors of Covmat ({x});
and mean ({x}).
Now compute {n} using the rule
ni = U T (xi mean ({x})).
Then mean ({n}) = 0 and Covmat ({n}) is diagonal.
Now write for the diagonal matrix of eigenvalues of Covmat ({x}) (so
that Covmat ({x})U = U). Assume that each of the diagonal entries
of is greater than zero (otherwise there is a redundant dimension in
the data). Write i for the ith diagonal entry of , andwrite (1/2)
for the diagonal matrix whose ith diagonal entry is 1/ i . Compute
{z} using the rule
zi = (1/2) U(xi mean ({x})).
We have that mean ({z}) = 0 and Covmat ({z}) = I. The dataset {z}
is often known as whitened data.
Section 3.4
Principal Components Analysis
72
Blob coordinates
Translation
FIGURE 3.16: A 2D blob, with its natural blob coordinate system. The origin of this
coordinate system is at the mean of the data. The coordinate axes are (a) at right
angles to one another and (b) are directions that have no covariance.
It isnt always a good idea to whiten data. In some circumstances, each

separate component is meaningful, and in a meaningful set of units. For example,
one of the components might be a length using a natural scale and the other might
be a time on a natural scale. When this happens, we might be reluctant to transform
the data, either because we dont want to add lengths to times or because we want
to preserve the scales.
3.4 PRINCIPAL COMPONENTS ANALYSIS
Mostly, when one deals with high dimensional data, it isnt clear which individual components are important. As we have seen with the height weight dataset
(for example, in the case of density and weight) some components can be quite
strongly correlated. Equivalently, as in Figure 14.5, the blob is not aligned with
the coordinate axes.
3.4.1 The Blob Coordinate System and Smoothing
We can use the fact that we could rotate, translate and scale the blob to define
a coordinate system within the blob. The origin of that coordinate system is the
mean of the data, and the coordinate axes are given by the eigenvectors of the
covariance matrix. These are orthonormal, so they form a set of unit basis vectors
at right angles to one another (i.e. a coordinate system). You should think of these
as blob coordinates; Figure 3.16 illustrates a set of blob coordinates.
The blob coordinate system is important because, once we know the blob
coordinate system, we can identify important scales of variation in the data. For
example, if you look at Figure 3.16, you can see that this blob is extended much
further along one direction than along the other. We can use this information to
identify the most significant forms of variation in very high dimensional data. In
Section 3.4
10
5
10
0
12
5
5
0
10
5
73
10
5
10
0
12
5
5
10
5
FIGURE 3.17: On the left, a blob of 3D data that has very low variance in two
directions in the blob coordinates. As a result, all the data points are very close to
a 1D blob. Experience shows that this is a common phenomenon. Although there
might be many components in the data items, all data points are very close to a
much lower dimensional object in the high dimensional space. When this is the
case, we could obtain a lower dimensional representation of the data by working in
blob coordinates, or we could smooth the data (as on the right), by projecting each
data point onto the lower dimensional space.
some directions in the blob coordinate system, the blob will be spread out ie
have large variance but in others, it might not be.
Equivalently, imagine we choose to represent each data item in blob coordinates. Then the mean over the dataset will be zero. Each pair of distinct coordinates will be uncorrelated. Some coordinates corresponding to directions where
the blob is spread out will have a large range of values. Other coordinates
directions in which the blob is small will have a small range of values. We could
choose to replace these coordinates with zeros, with little significant loss in accuracy. The advantage of doing so is that we would have lower dimensional data to
deal with.
However, it isnt particularly natural to work in blob coordinates. Each component of a data item may have a distinct meaning and scale (i.e. feet, pounds,
and so on), but this is not preserved in any easy way in blob coordinates. Instead,
we should like to (a) compute a lower dimensional representation in blob coordinates then (b) transform that representation into the original coordinate system of
the data item. Doing so is a form of smoothing suppressing small, irrelevant
variations by exploiting multiple data items.
For example, look at Figure 3.17. Imagine we transform the blob on the left
to blob coordinates. The covariance matrix in these coordinates is a 3 3 diagonal
matrix. One of the values on the diagonal is large, because the blob is extended on
Section 3.4
74
one direction; but the other two are small. This means that, in blob coordinates, the
data varies significantly in one direction, but very little in the other two directions.
Now imagine we project the data points onto the high-variation direction;
equivalently, we set the other two directions to zero for each data point. Each
of the new data points is very close to the corresponding old data point, because
by setting the small directions to zero we havent moved the point very much.
In blob coordinates, the covariance matrix of this new dataset has changed very
little. It is again a 3 3 diagonal matrix, but now two of the diagonal values are
zero, because there isnt any variance in those directions. The third value is large,
because the blob is extended in that direction. We take the new dataset, and rotate
and translate it into the original coordinate system. Each point must lie close to the
corresponding point in the original dataset. However, the new dataset lies along a
straight line (because it lay on a straight line in the blob coordinates). This process
gets us the blob on the right in Figure 3.17. This blob is a smoothed version of the
original blob.
Smoothing works because when two data items are strongly correlated, the
value of one is a good guide to the value of the other. This principle works for
more than two data items. Section 14.5 describes an example where the data
items have dimension 101, but all values are extremely tightly correlated. In a
case like this, there may be very few dimensions in blob coordinates that have
any significant variation (3-6 for this case, depending on some details of what one
believes is a small number, and so on). The components are so strongly correlated
in this case that the 101-dimensional blob really looks like a slightly thickened 3
(or slightly more) dimensional blob that has been inserted into a 101-dimensional
space (Figure 3.17). If we project the 101-dimensional data onto that structure
in the original, 101-dimensional space, we may get much better estimates of the
components of each data item than the original measuring device can supply. This
occurs because each component is now estimated using correlations between all the
measurements.
3.4.2 The Low-Dimensional Representation of a Blob
We wish to construct an r dimensional representation of a blob, where we have
chosen r in advance. First, we compute {v} by translating the blob so its mean is
at the origin, so that vi = xi mean ({x}). Now write V = [v1 , v2 , . . . , vN ]. The
covariance matrix of {v} is then
Covmat ({v}) =
1
VV T = Covmat ({x}).
N
Now write for the diagonal matrix of eigenvalues of Covmat ({x}) and U for the
matrix of eigenvectors, so that Covmat ({x})U = U. We assume that the elements
of are sorted in decreasing order along the diagonal. The covariance matrix for
the dataset transformed into blob coordinates will be . Notice that
=
=
=
U T Covmat ({x})U
U T VV T U
(U T V)(U T V)T .
Section 3.4
75
This means we can interpret (U T V) as a new dataset {b}. This is our data, rotated
into blob coordinates.
Now write r for the d d matrix

Ir 0
0 0
which projects a d dimensional vector onto its first r components, and replaces the
others with zeros. Then we have that
r = r Tr
is the covariance matrix for the reduced dimensional data in blob coordinates.
Notice that r keeps the r largest eigenvalues on the diagonal of , and replaces
all others with zero.
We have
r
= r Tr
= r U T Covmat ({x})UTr
= (r U T V)(V T UTr )
= PP T
where P = (r U T V). This represents our data, rotated into blob coordinates,
and then projected down to r dimensions, with remaining terms replaced by zeros.
Write {br } for this new dataset.
Occasionally, we need to refer to this representation, and we give it a special
name. Write
pcaproj (xi , r, {x}) = r U T (xi mean ({x}))
where the notation seeks to explicitly keep track of the fact that the low dimensional
representation of a particular data item depends on the whole dataset (because you
have to be able to compute the mean, and the eigenvectors of the covariance).
Notice that pcaproj (xi , r, {x}) is a representation of the dataset with important
properties:
The representation is r-dimensional (i.e. the last d r components are zero).
Each pair of distinct components of {pcaproj (xi , r, {x})} has zero covariance.
The first component of {pcaproj (xi , r, {x})} has largest variance; the second
component has second largest variance; and so on.
3.4.3 Smoothing Data with a Low-Dimensional Representation
We would now like to construct a low dimensional representation of the blob, in
the original coordinates. We do so by rotating the low-dimensional representation
back to the original coordinate system, then adding back the mean to translate the
origin back to where it started. We can write this as

pcasmooth (xi , r, {x}) = UTr r U T (xi mean ({x})) + mean ({x})
= UTr pcaproj (xi , r, {x}) + mean ({x})
Section 3.4
76
Original blob
Translate to
origin by
subtracting the
mean
Rotate to line
up with axes
Project to get this 1D representation
FIGURE 3.18: Computing a low dimensional representation for principal components
analysis.
we have a new representation of the ith data item in the original space (Figure 14.5). Now consider the dataset obtained by smoothing each of our data items.
We write this dataset as {pcasmooth (xi , r, {x})}.
You should think of {pcasmooth (xi , r, {x})} as a smoothed version of the
original data. One way to think of this process is that we have chosen a lowdimensional basis that represents the main variance in the data rather well. It is
quite usual to think of a data item as being given by a the mean plus a weighted
sum of these basis elements. In this view, the first weight has larger variance than
the second, and so on. By construction, this dataset lies in an r dimensional affine
subspace of the original space. We constructed this r-dimensional space to preserve
the largest variance directions of the data. Each column of this matrix is known as
a principal component. In particular, we have constructed this dataset so that
Section 3.4
77
1D Representation
Rotated to line up
with blob coordinates
Add back mean

to get smoothed
data
FIGURE 3.19: Smoothing data with principal components analysis.
mean ({pcasmooth (xi , r, {x})}) = mean ({x});

Covmat ({pcasmooth (xi , r, {x})}) has rank r;
Covmat ({pcasmooth (xi , r, {x})}) is the best approximation of Covmat ({x})
with rank r.
Figure 3.19 gives a visualization of the smoothing process. By comparing
figures 3.14 and 3.20, you can see that a real dataset can lose several dimensions
without much significant going wrong. As we shall see in the examples, some
datasets can lose many dimensions without anything bad happening.
Section 3.4
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
Direction 1
Direction 4
100
Direction 3
100
100
100
Direction 2
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
78
100
100
100
100
100
100
100
100
100
100
100
100
100
FIGURE 3.20: A panel plot of the bodyfat dataset of figure 3.3, with the dimension
reduced to two using principal components analysis. Compare this figure to figure
3.14, which is on the same set of axes. You can see that the blob has been squashed
in direction 3 and direction 4. But not much has really happened, because there
wasnt very much variation in those directions in the first place.
Procedure: 3.1 Principal Components Analysis

Assume we have a general data set xi , consisting of N d-dimensional
vectors. Now write = Covmat ({x}) for the covariance matrix.
Form U, , such that U = U (these are the eigenvectors and eigenvalues of ). Ensure that the entries of are sorted in decreasing order.
Choose r, the number of dimensions you wish to represent. Typically,
we do this by plotting the eigenvalues and looking for a knee (Figure ??). It is quite usual to do this by hand.
Constructing a low-dimensional representation: Form Ur , a
matrix consisting of the first r columns of U.
Now compute
{pcaproj (xi , r, {x})} = {(r U T (xi mean ({x})))}. This is a set of
data vectors which are r dimensional, and where each component is
independent of each other component (i.e. the covariances of distinct
components are zero).
Smoothing the data:
Form {pcasmooth (xi , r, {x})}
=
{(Ur pcaproj (xi , r, {x}) + mean ({x}))}.
These are d dimensional
vectors that lie in a r-dimensional subspace of d-dimensional space.
The missing dimensions have the lowest variance, and are independent.
Section 3.4
0.25
3
Value
0.2
0.15
2
1
0.1
0.05
300
400
500
600
700
Wavelength (nm)
First PC of spectral reflectance
800
0
0
50
100
Number of eigenvalue
Second PC of spectral reflectance
0.05
0.1
0.15
400
600
Wavelength (nm)
800
0.1
0.1
0
0.1
0.2
300
150
Third PC of spectral reflectance
0.2
Reflectance value
Reflectance value
Reflectance value
0.3
Reflectance value
79
Sorted eigenvalues, 1995 spectra
Mean spectral reflectance
0.2
200
400
500
600
700
Wavelength (nm)
800
0
0.1
0.2
0.3
300
400
500
600
700
Wavelength (nm)
800
FIGURE 3.21: On the top left, the mean spectral reflectance of a dataset of 1995
spectral reflectances, collected by Kobus Barnard (at http:// www.cs.sfu.ca/ colour/
data/ ). On the top right, eigenvalues of the covariance matrix of spectral reflectance data, from a dataset of 1995 spectral reflectances, collected by Kobus
Barnard (at http:// www.cs.sfu.ca/ colour/ data/ ). Notice how the first few eigenvalues are large, but most are very small; this suggests that a good representation
using few principal components is available. The bottom row shows the first three
principal components. A linear combination of these, with appropriate weights,
added to the mean of figure ??, gives a good representation of the dataset.
3.4.4 The Error of the Low-Dimensional Representation

We took a dataset, {x}, and constructed a d-dimensional dataset {b} in blob coordinates. We did so by translating, then rotating, the data, so no information was
lost; we could reconstruct our original dataset by rotating, then translating {b}.
But in blob coordinates we projected each data item down to the first r components
to get an r-dimensional dataset {br }. We then reconstructed a smoothed dataset
by rotating, then translating, {br }. Information has been lost here, but how much?
The answer is easy to get if you recall that rotations and translations do not
change lengths. This means that
2
|| xi pcasmooth (xi , r, {x}) || = || bi br,i || .

This expression is easy to evaluate, because bi and br,i agree in their first r com-
Section 3.4
Approx with 0, 3, 5, 7 PCs
80
Error with 0, 3, 5, 7 PCs
0.8
0.4
0.3
0.6
0.2
0.4
0.1
0
0.2
0.1
400
500
600
700
0.2
400
500
600
700
FIGURE 3.22: On the left, a spectral reflectance curve (dashed) and approximations
using the mean, the mean and 3 principal components, the mean and 5 principal
components, and the mean and 7 principal components. Notice the mean is a relatively poor approximation, but as the number of principal components goes up, the
error falls rather quickly. On the right is the error for these approximations. Figure plotted from a dataset of 1995 spectral reflectances, collected by Kobus Barnard
(at http:// www.cs.sfu.ca/ colour/ data/ ).
ponents. The remaining d r components of br,i are zero. So we can write
2
|| xi pcasmooth (xi , r, {x}) || =
d
X
(u)
(bi )2 .
u=r+1
Now a natural measure of error is the average over the dataset of this term. We
have that
d
d
n
o
X
1 X
(u)
var b(u)
(bi )2 =
N u=r+1
u=r+1
which is easy to evaluate, because we know these variances they are the values
of the d r eigenvalues that we decided to ignore. So the mean error can be written
as
1T ( r )1.
Now we could choose r by identifying how much error we can tolerate. More usual
is to plot the eigenvalues of the covariance matrix, and look for a knee, like that
in Figure 14.5. You can see that the sum of remaining eigenvalues is small.
3.4.5 Example: Representing Spectral Reflectances
Diffuse surfaces reflect light uniformly in all directions. Examples of diffuse surfaces
include matte paint, many styles of cloth, many rough materials (bark, cement,
stone, etc.). One way to tell a diffuse surface is that it does not look brighter
(or darker) when you look at it along different directions. Diffuse surfaces can
Section 3.4
20
15
15
Value
Value
81
Eigenvalues, total of 213 images
Eigenvalues, total of 213 images

20
10
10
5
5
0
0
1000
2000
3000
4000
0
0
5
10
15
20
FIGURE 3.23: On the left,the eigenvalues of the covariance of the Japanese facial
expression dataset; there are 4096, so its hard to see the curve (which is packed
to the left). On the right, a zoomed version of the curve, showing how quickly the
values of the eigenvalues get small.
be colored, because the surface reflects different fractions of the light falling on it
at different wavelengths. This effect can be represented by measuring the spectral
reflectance of a surface, which is the fraction of light the surface reflects as a function
of wavelength. This is usually measured in the visual range of wavelengths (about
380nm to about 770 nm). Typical measurements are every few nm, depending on
the measurement device. I obtained data for 1995 different surfaces from http://
www.cs.sfu.ca/colour/data/ (there are a variety of great datasets here, from Kobus
Barnard).
Each spectrum has 101 measurements, which are spaced 4nm apart. This
represents surface properties to far greater precision than is really useful. Physical
properties of surfaces suggest that the reflectance cant change too fast from wavelength to wavelength. It turns out that very few principal components are sufficient
to describe almost any spectral reflectance function. Figure 3.21 shows the mean
spectral reflectance of this dataset, and Figure 3.21 shows the eigenvalues of the
covariance matrix.
This is tremendously useful in practice. One should think of a spectral reflectance as a function, usually written (). What the principal components analysis tells us is that we can represent this function rather accurately on a (really
small) finite dimensional basis. This basis is shown in figure 3.21. This means that
there is a mean function r() and k functions m () such that, for any (),
() = r() +
k
X
ci i () + e()
i=1
where e() is the error of the representation, which we know is small (because it
consists of all the other principal components, which have tiny variance). In the
case of spectral reflectances, using a value of k around 3-5 works fine for most
applications (Figure 3.22). This is useful, because when we want to predict what
Section 3.4
82
Mean image from Japanese Facial Expression dataset
First sixteen principal components of the Japanese Facial Expression dat
FIGURE 3.24: The mean and first 16 principal components of the Japanese facial
expression dataset.
a particular object will look like under a particular light, we dont need to use a
detailed spectral reflectance model; instead, its enough to know the ci for that
object. This comes in useful in a variety of rendering applications in computer
graphics. It is also the key step in an important computer vision problem, called
color constancy. In this problem, we see a picture of a world of colored objects under unknown colored lights, and must determine what color the objects
are. Modern color constancy systems are quite accurate, even though the problem
sounds underconstrained. This is because they are able to exploit the fact that
relatively few ci are enough to accurately describe a surface reflectance.
3.4.6 Example: Representing Faces with Principal Components
An image is usually represented as an array of values. We will consider intensity
images, so there is a single intensity value in each cell. You can turn the image
into a vector by rearranging it, for example stacking the columns onto one another
Section 3.4
83
Sample Face Image

mean
10
20
50
100
FIGURE 3.25: Approximating a face image by the mean and some principal compo-
nents; notice how good the approximation becomes with relatively few components.
(use reshape in Matlab). This means you can take the principal components of a
set of images. Doing so was something of a fashionable pastime in computer vision
for a while, though there are some reasons that this is not a great representation of
pictures. However, the representation yields pictures that can give great intuition
into a dataset.
Figure ?? shows the mean of a set of face images encoding facial expressions of
Japanese women (available at http://www.kasrl.org/jaffe.html; there are tons of face
datasets at http://www.face-rec.org/databases/). I reduced the images to 64x64,
which gives a 4096 dimensional vector. The eigenvalues of the covariance of this
dataset are shown in figure 3.23; there are 4096 of them, so its hard to see a
trend, but the zoomed figure suggests that the first couple of hundred contain
most of the variance. Once we have constructed the principal components, they
can be rearranged into images; these images are shown in figure 3.24. Principal
components give quite good approximations to real images (figure 3.25).
The principal components sketch out the main kinds of variation in facial
expression. Notice how the mean face in Figure 3.24 looks like a relaxed face, but
with fuzzy boundaries. This is because the faces cant be precisely aligned, because
each face has a slightly different shape. The way to interpret the components is to
remember one adjusts the mean towards a data point by adding (or subtracting)
some scale times the component. So the first few principal components have to
do with the shape of the haircut; by the fourth, we are dealing with taller/shorter
faces; then several components have to do with the height of the eyebrows, the
shape of the chin, and the position of the mouth; and so on. These are all images of
women who are not wearing spectacles. In face pictures taken from a wider set of
models, moustaches, beards and spectacles all typically appear in the first couple
of dozen principal components.
Section 3.5
High Dimensions, SVD and NIPALS
84
3.5 HIGH DIMENSIONS, SVD AND NIPALS

If you remember the curse of dimension, you should have noticed something of a
problem in my account of PCA. When I described the curse, I said one consequence
was that forming a covariance matrix for high dimensional data is hard or impossible. Then I described PCA as a method to understand the important dimensions
in high dimensional datasets. But PCA appears to rely on covariance, so I should
not be able to form the principal components in the first place. In fact, we can
form principal components without computing a covariance matrix.
3.5.1 Principal Components by SVD
I will now assume the dataset has zero mean, to simplify notation. This is easily
achieved. You subtract the mean from each data item at the start, and add the
mean back once youve finished smoothing. As usual, we have N data items, each
a d dimensional column vector. We will now arrange these into a matrix,
xT1
X = xT2
. . . xTN
where each row of the matrix is a data vector. Now notice that the covariance
matrix for this dataset can be formed by constructing X T X , so that
Covmat ({X}) = X T X
and if we form the SVD (see the math notes at the end if you dont remember this)
of X , we have X = UV T . But we have X T X = VT V T so that
V T X T X V = T
and T is diagonal. So we can recover the principal components of the dataset
without actually forming the covariance matrix - we just form the SVD of X .
3.5.2 Just a few Principal Components with NIPALS
For really big datasets, even taking the SVD is hard. Usually, we dont really want
to recover all the principal components, because we want to recover a reasonably
accurate low dimensional representation of the data. We continue to work with a
data matrix X , whose rows are data items. Now assume we wish to recover the first
principal component. This means we are seeking a vector u and a set of N numbers
wi such that wi u is a good approximation to xi . In particular, we would like the
dataset made of wi u to encode as much of the variance of the original dataset as
possible. Now we can stack the wi into a column vector w. The Frobenius norm
is a term for the matrix norm obtained by summing squared entries of the matrix.
We write
X
a2ij .
|| A ||F =
i,j
In the exercises, you will show that the choice of w and u minimizes the cost
|| X wuT ||F
Section 3.5
which we can write as

C(w, u) =
X
ij
85
(xij wi uj ) .
Now we need to find the relevant w and u. Notice there is not a unique choice,
because the pair (sw, (1/s)u) works as well as the pair (w, u). We will choose u
such that || u || = 1. There is still not a unique choice, because you can flip the
signs in u and w, but this doesnt matter. The gradient of the cost function is a
set of partial derivatives with respect to components of w and u. The partial with
respect to wk is
X
C
(xkj wk uj ) uj
=
wk
j
which can be written in matrix vector form as
w C = (X wuT )u.
Similarly, the partial with respect to ul is
X
C
(xil wi ul ) wi
=
ul
i
which can be written in matrix vector form as
u C = (X T uwT )w.
At the solution, these partial derivatives are zero. This suggests an algorithm.
First, assume we have an estimate of u, say u(n) . Then we could choose the w that
makes the partial wrt w zero, so
1
w(n+ 2 ) =
X u(n)
(u(n) )T u(n)
Now we can update the estimate of u by choosing a value that makes the partial
1
wrt u zero, using our estimate w(n+ 2 ) , to get
1
u(n+ 2 ) =
X T w(n+ 2 )
1
(w(n+ 2 ) )T w
We need to rescale to ensure that our estimate of u has unit length. Write s =
1
1
((w(n+ 2 ) )T w) 2 We get
1
u(n+ 2 )
(n+1)
u
=
s
and
1
w(n+1) = sw(n+ 2 ) .
This iteration can be started by choosing some row of X as u(0) . You can test for
convergence by checking || u(n+1) u(n) ||. If this is small enough, then the algorithm
has converged.
Section 3.5
86
To obtain a second principal component, you form X (1) = X wuT and apply
the algorithm to that. You can get many principal components like this, but its not
a good way to get all of them (eventually numerical issues mean the estimates are
poor). The algorithm is widely known as NIPALS (for Non-linear Iterative Partial
Least Squares).
NIPALS is quite forgiving of missing values, though missing values make
it
hard
to use matrix notation. Recall I wrote the cost function as C(w, u) =
P
(x
wi uj )2 . We change the sum so that it ranges over only the known values,
ij
ij
to get
X
C(w, u) =
(xij wi uj )2
ijknown values
then write
and
C
=
wk
C
=
ul
jknown
(xkj wk uj )uj
iknown
(xil wi ul )wi .
values for k
values for l
These partial derivatives must be zero at the solution, so we can estimate

P
(n+ 21 )
jknown values for k xkj uj
= P
wk
(n+ 12 ) (n+ 21 )
uj
jknown values for k uj
and
(n+ 1 )
uk 2
=P
iknown
iknown
values for l xil wl

(n+ 12 )
values for l wl
(n+ 21 )
wl
We then normalize as before.
Procedure: 3.2 Obtaining some principal components with NIPALS
3.5.3 Projection and Discriminative Problems

You should remind yourself how Lagrange multipliers work before reading this
section.
Principal components analysis identifies the directions in high dimensional
space that best describe the dataset. This may not always be what we are looking
for. For a simple example, look at figure ??. In this case, we have two classes of data
item. Each varies a lot in the x direction and little in the y direction. If we project
onto the first principal component, the outcome represents variance in the data
well, but suppresses the difference between the two classes. For many applications
for example, classification we are interested in directions that emphasize the
Section 3.5
87
differences between classes. One very important construction for such directions
can be used for classification and for regression.
Assume we have a dataset of N items, each with two parts xi and yi . We
assume that at least xi has high dimension. We also assume that mean ({x}) = 0
and mean ({y}) = 0. This simplifies notation, and is easy to achieve (subtract the
mean). This is a mild generalization of what occurred in classification, where we
had a feature vector xi and a label yi for each data item; but now instead of having
just a label, we have a vector. This situation arises in practice quite often. For
example, xi might be a vector that describes an image and yi might be a vector
that describes a caption for that image. If we have a classification problem with C
classes, we might choose yi to be a vector with zero in all components except the
one corresponding to the class (this is sometimes known as a one-hot vector). We
wish to choose projections of x and y to a shared low dimensional space, so that
the projections in this space are strongly correlated to one another.
For the moment, assume the low dimensional space is one dimensional. Then
there is some unit vector a so that projecting xi to that space is given by aT xi ;
similarly, there is some unit vector b so that projecting yi to the space is given by
bT yi . Now we stack the vectors into data matrices whose rows are data items, as
above, so the ith row of X is xTi and the ith row of Y is yiT . Then px = matxXa
is a vector containing all the projections of the x part, and py = Yb is a vector
containing all the projections of the y part. We want these projections to be like
one another to the extent possible.
One criterion we can use is to maximize pTx py by choice of a and b. We must
maximise
aT max X T Yb subject to aT a = 1 and bT b = 1.
Write a and b for Lagrange multipliers. To solve this problem, we must find a
and b such that
X T Yb = a a and Y T X a = b b.
We can substitute using the second equation to get
(X T YY T X )a = a b a
which means that a is an eigenvector of (X T YY T X ). It turns out we seek the
eigenvector corresponding to the largest eigenvalue; equivalently, we must solve
aT (X T YY T X )a subject to aT a = 1
There is a straightforward way to obtain a second, third, etc. dimension. We
take the dataset, and subtract the a component from each xi and the b component
from each yi ; we now have a new dataset, and seek another a, b using our procedure.
These new directions must (a) maximise the covariance we are interested in and (b)
are orthogonal to the original directions. As Figure ?? suggests, these projections
of the data separate blobs of data with different labels.
3.5.4 Just a few Discriminative Directions with PLS1
Notice that we have, again, formed a (kind of) covariance matrix here, meaning
that this method might not apply if the dimension of the data is big. But we
Section 3.6
88
Iris data plotted on two discriminative directions
1
0
Second direction
0.5
0.0
1.0
0.5
Second principal component
1.0
Iris data plotted on two principal components
Multi-Dimensional Scaling
First principal component
10
11
First direction
FIGURE 3.26:
could obtain a from an SVD. An SVD yields UV T = Y T X , where U and V are

orthonormal and is diagonal. In turn, we have
aT (X T YY T X )a = aT V2 V T a
so that a must be the column of V corresponding to the largest singular value. Now
the argument I used for extracting principal components works for this problem,
too. Recall I was looking for the unit a that maximized aT X T X a; I showed I could
obtain this from an SVD of X ; then I argued that I could recover a by finding a,
2
w such that || X wa ||F was minimized.
When we are looking for a discriminative direction, we would look for a,
2
w such that || Y T X awT ||F is minimized. The procedure above applies. This
algorithm is usually called PLS1 (for partial least squares one).
3.6 MULTI-DIMENSIONAL SCALING
One way to get insight into a dataset is to plot it. But choosing what to plot for
a high dimensional dataset could be difficult. Assume we must plot the dataset
in two dimensions (by far the most common choice). We wish to build a scatter
plot in two dimensions but where should we plot each data point? One natural
requirement is that the points be laid out in two dimensions in a way that reflects
how they sit in many dimensions. In particular, we would like points that are far
apart in the high dimensional space to be far apart in the plot, and points that are
close in the high dimensional space to be close in the plot.
3.6.1 Principal Coordinate Analysis
We will plot the high dimensional point xi at vi , which is a two-dimensional vector.
Now the squared distance between points i and j in the high dimensional space is
(2)
Dij (x) = (xi xj )T (xi xj )
Section 3.6
89
Heart data plotted on two discriminative directions
20
Second direction
20
20
0
20
40
40
80
60
60
Second principal component
40
40
60
60
Heart data plotted on two principal components
100
100
200
300
550
500
First principal component
450
400
350
300
250
First direction
FIGURE 3.27:
(where the superscript is to remind you that this is a squared distance). We could
build an N N matrix of squared distances, which we write D(2) (x). The i, jth
(2)
entry in this matrix is Dij (x), and the x argument means that the distances are
between points in the high-dimensional space. Now we could choose the vi to make
2
X (2)
(2)
Di j(x) Dij (v)
ij
as small as possible. Doing so should mean that points that are far apart in the
high dimensional space are far apart in the plot, and that points that are close in
the high dimensional space are close in the plot.
In its current form, the expression is difficult to deal with, but we can refine
it. Because translation does not change the distances between points, it cannot
change either of the D(2) matrices. So it is enoughP
to solve the case when the mean
of the points xi is zero. We can assume that N1
i xi = 0. Now write 1 for the
n-dimensional vector containing all ones, and I for the identity matrix. Notice that
(2)
Dij = (xi xj )T (xi xj ) = xi xi 2xi xj + xj xj .

Now write

1 T
A = I 11 .
N
Using this expression, you can show that the matrix M, defined below,
1
M(x) = AD(2) (x)AT
2
has i, jth entry xi xj (exercises). I now argue that, to make D(2) (v) is close to
D(2) (x), it is enough to make M(v) close to M(x). Proving this will take us out
of our way unnecessarily, so I omit a proof.
Section 3.6
90
We can choose a set of vi that makes D(2) (v) close to D(2) (x) quite easily,
using the method of the previous section. Take the dataset of N d-dimensional
column vectors xi , and form a matrix X by stacking the vectors, so
X = [x1 , x2 , . . . , xN ] .
In this notation, we have
M(x) = X T X .
This matrix is symmetric, and it is positive semidefinite. It cant be positive definite, because the data is zero mean, so M(x)1 = 0. The M(v) we seek must (a)
be as close as possible to M(x) and (b) have rank 2. It must have rank 2 because
there must be some V which is 2 N so that M(v) = V T V. The columns of this
V are our vi .
We can use the method of section ?? to construct M(v) and V. As usual,
we write U for the matrix of eigenvectors of M(x), for the diagonal matrix of
eigenvalues sorted in descending order, 2 for the 2 2 upper left hand block of
(1/2)
, and 2
for the matrix of positive square roots of the eigenvalues. Then our
methods yield
(1/2) (1/2)
M(v) = U2 2 2 U2T
and
(1/2)
V = 2
U2T
and we can plot these vi (example in section 14.5). This method for constructing
a plot is known as principal coordinate analysis.
This plot might not be perfect, because reducing the dimension of the data
points should cause some distortions. In many cases, the distortions are tolerable.
In other cases, we might need to use a more sophisticated scoring system that
penalizes some kinds of distortion more strongly than others. There are many ways
to do this; the general problem is known as multidimensional scaling.
Procedure: 3.3 Principal Coordinate Analysis

Assume we have a matrix D(2) consisting of the squared differences
between each pair of N points. We do not need to know the points. We
wish to compute a set of points in r dimensions, such that the distances
(2)
between these
points are
as similar as possible to the distances in D .
Form A = I N1 11T . Form W = 21 AD(2) AT . Form U, , such that
WU = U (these are the eigenvectors and eigenvalues of W). Ensure
that the entries of are sorted in decreasing order. Choose r, the
number of dimensions you wish to represent. Form r , the top left
(1/2)
r r block of . Form r
, whose entries are the positive square
roots of r . Form Ur , the matrix consisting of the first r columns of U.
(1/2)
Then V = 2 U2T = [v1 , . . . , vN ] is the set of points to plot.
Section 3.6
91
400
Polokwane
200
Nelspruit
Mahikeng
Johannesburg
200
Kimberley
Bloemfontein
Pietermaritzburg
400
600
Bhisho
800
Cape Town
1000
800
600
400
200
200
400
FIGURE 3.28: On the left, a public domain map of South Africa, obtained from
http:// commons.wikimedia.org/ wiki/ File:Map of South Africa.svg , and edited to remove surrounding countries. On the right, the locations of the cities inferred by
multidimensional scaling, rotated, translated and scaled to allow a comparison to
the map by eye. The map doesnt have all the provincial capitals on it, but its easy
to see that MDS has placed the ones that are there in the right places (use a piece
of ruled tracing paper to check).
0.2
0.2
0.1
0.15
0.1
0.05
0.1
0
0.2
0.4
0.05
0.1
0.2
0.15
0
0.2
0.4
0.2
0.1
0.1
0.2
0.2
0.3 0.4
0.2
0.2
0.4
0.2
0.1
0.1
0.2
0.3
FIGURE 3.29: Two views of the spectral data of section 3.4.5, plotted as a scatter
plot by applying principal coordinate analysis to obtain a 3D set of points. Notice

that the data spreads out in 3D, but seems to lie on some structure; it certainly isnt
a single blob. This suggests that further investigation would be fruitful.
3.6.2 Example: Mapping with Multidimensional Scaling
Multidimensional scaling gets positions (the V of section 3.6.1) from distances (the
D(2) (x) of section 3.6.1). This means we can use the method to build maps from
distances alone. I collected distance information from the web (I used http://www.
distancefromto.net, but a google search on city distances yields a wide range of
possible sources), then apply multidimensional scaling. Table ?? shows distances
between the South African provincial capitals, in kilometers, rounded to the nearest kilometer. I then used principal coordinate analysis to find positions for each
Section 3.6
92
capital, and rotated, translated and scaled the resulting plot to check it against a
real map (Figure 3.28).
One natural use of principal coordinate analysis is to see if one can spot
any structure in a dataset. Does the dataset form a blob, or is it clumpy? This
isnt a perfect test, but its a good way to look and see if anything interesting is
happening. In figure 3.29, I show a 3D plot of the spectral data, reduced to three
dimensions using principal coordinate analysis. The plot is quite interesting. You
should notice that the data points are spread out in 3D, but actually seem to lie on
a complicated curved surface they very clearly dont form a uniform blob. To
me, the structure looks somewhat like a butterfly. I dont know why this occurs, but
it certainly suggests that something worth investigating is going on. Perhaps the
choice of samples that were measured is funny; perhaps the measuring instrument
doesnt make certain kinds of measurement; or perhaps there are physical processes
that prevent the data from spreading out over the space.
0.15
0.1
USSR
China
Japan
Yugoslavia
0.05
USA
Israel
France
Egypt
0.05
India
Cuba
0.1
Congo
Brazil
0.15
0.1
0.05
0.05
0.1
0.15
0.2
FIGURE 3.30: A map of country similarity, prepared from the data of figure ??. The
map is often interpreted as showing a variation in development or wealth (poorest
at bottom left to richest at top right); and freedom (most repressed at top left and
freeest at bottom right). I havent plotted these axes, because the interpretation
wouldnt be consistent with current intuition (the similarity data is forty years old,
and quite a lot has happened in that time).
Our algorithm has one really interesting property. In some cases, we do not
actually know the datapoints as vectors. Instead, we just know distances between
the datapoints. This happens often in the social sciences, but there are important
cases in computer science as well. As a rather contrived example, one could survey
people about breakfast foods (say, eggs, bacon, cereal, oatmeal, pancakes, toast,
muffins, kippers and sausages for a total of 9 items). We ask each person to rate the
similarity of each pair of distinct items on some scale. We advise people that similar
items are ones where, if they were offered both, they would have no particular
preference; but, for dissimilar items, they would have a strong preference for one
over the other. The scale might be very similar, quite similar, similar, quite
dissimilar, and very dissimilar (scales like this are often called Likert scales).
We collect these similarities from many people for each pair of distinct items, and
Section 3.7
Example: Understanding Height and Weight
93
50
40
20
0
0
20
40
50
50
200
400
100 600
50
100
0
100 500
400
300
200
FIGURE 3.31: Two views of a multidimensional scaling to three dimensions of the

height-weight dataset. Notice how the data seems to lie in a flat structure in 3D,
with one outlying data point. This means that the distances between data points can
be (largely) explained by a 2D representation.
then average the similarity over all respondents. We compute distances from the
similarities in a way that makes very similar items close and very dissimilar items
distant. Now we have a table of distances between items, and can compute a V
and produce a scatter plot. This plot is quite revealing, because items that most
people think are easily substituted appear close together, and items that are hard
to substitute are far apart. The neat trick here is that we did not start with a X ,
but with just a set of distances; but we were able to associate a vector with eggs,
and produce a meaningful plot.
Table ?? shows data from one such example. Students were interviewed (in
1971! things may have changed since then) about their perceptions of the similarity
of countries. The averaged perceived similarity is shown in table ??. Large numbers
reflect high similarity, so we cant use these numbers directly. It is reasonable to
turn these numbers into distances by (a) using 0 as the distance between a country
and itself and (b) using esij as the distance between countries i and j (where sij is
the similarity between them). Once we have distances, we can apply the procedure
of section 3.6.1 to get points, then plot a scatter plot (Figure 3.30).
3.7 EXAMPLE: UNDERSTANDING HEIGHT AND WEIGHT
Recall the height-weight data set of section ?? (from http://www2.stetson.edu/
jrasp/data.htm; look for bodyfat.xls at that URL). This is, in fact, a 16-dimensional
dataset. The entries are (in this order): bodyfat; density; age; weight; height; adiposity; neck; chest; abdomen; hip; thigh; knee; ankle; biceps; forearm; wrist. We
know already that many of these entries are correlated, but its hard to grasp a 16
dimensional dataset in one go. The first step is to investigate with a multidimensional scaling.
Figure ?? shows a multidimensional scaling of this dataset down to three
dimensions. The dataset seems to lie on a (fairly) flat structure in 3D, meaning
that inter-point distances are relatively well explained by a 2D representation. Two
points seem to be special, and lie far away from the flat structure. The structure
Section 3.7
94
HeightWeight 2D MDS
50
50
100
500
400
300
200
FIGURE 3.32: A multidimensional scaling to two dimensions of the height-weight
dataset. One data point is clearly special, and another looks pretty special. The
data seems to form a blob, with one axis quite a lot more important than another.
isnt perfectly flat, so there will be small errors in a 2D representation; but its clear
that a lot of dimensions are redundant. Figure 3.32 shows a 2D representation of
these points. They form a blob that is stretched along one axis, and there is no sign
of multiple blobs. Theres still at least one special point, which we shall ignore but
might be worth investigating further. The distortions involved in squashing this
dataset down to 2D seem to have made the second special point less obvious than
it was in figure ??.
HeightWeight mean
weight
180
160
140
120
chest hip
abdomen
100
80
height
thigh
60
age
40
20
0
bodyfat
knee
neck
adiposity
biceps
forearm
ankle
wrist
density
2
10
12
14
16
18
FIGURE 3.33: The mean of the bodyfat.xls dataset. Each component is likely in a
different unit (though I dont know the units), making it difficult to plot the data
without being misleading. Ive adopted one solution here, by plotting each component
as a vertical bar, and labelling the bar. You shouldnt try to compare the values to
one another. Instead, think of this plot as a compact version of a table.
The next step is to try a principal component analysis. Figure 3.33 shows
the mean of the dataset. The components of the dataset have different units, and
Section 3.7
95
shouldnt really be compared. But it is difficult to interpret a table of 16 numbers, so

I have plotted the mean by showing a vertical bar for each component. Figure 3.34
shows the eigenvalues of the covariance for this dataset. Notice how one dimension is
very important, and after the third principal component, the contributions become
small. Of course, I could have said fourth, or fifth, or whatever the precise
choice depends on how small a number you think is small.
Heightweight covariance eigenvalues
0.9
1200
HeightWeight first principal component

weight
0.8
1000
0.7
0.6
800
0.5
600
0.4
0.3
400
0.2
200
age
0
bodyfat
0.1
10
12
14
16
density
2
4
abdomen
chest
hip
thigh
adiposity
knee biceps
neck
ankle forearm
height
wrist
10
12
14
16
18
FIGURE 3.34: On the left, the eigenvalues of the covariance matrix for the bodyfat
data set. Notice how fast the eigenvalues fall off; this means that most principal
components have very small variance, so that data can be represented well with a
small number of principal components. On the right, the first principal component
for this dataset, plotted using the same convention as for figure 3.33.
Figure 3.34 also shows the first principal component. The eigenvalues justify
thinking of each data item as (roughly) the mean plus some weight times this
principal component. From this plot you can see that data items with a larger
value of weight will also have larger values of most other measurements, except age
and density. You can also see how much larger; if the weight goes up by 8.5 units,
then the abdomen will go up by 3 units, and so on. This explains the main variation
in the dataset.
In the rotated coordinate system, the components are not correlated, and they
have different variances (which are the eigenvalues of the covariance matrix). You
can get some sense of the data by adding these variances; in this case, we get 1404.
This means that, in the
translated and rotated coordinate system, the average data
point is about 37 = 1404 units away from the center (the origin). Translations
and rotations do not change distances, so the average data point is about 37 units
from the center in the original dataset, too. If we represent a datapoint by using
the mean and the first three principal components, there will be some error. We
can estimate the average error from the component variances. In this case, the sum
of the first three eigenvalues is 1357, so the mean square
error in representing a
p
datapoint by the first three principal components is (1404 1357), or 6.8. The
relative error is 6.8/37 = 0.18. Another way to represent this information, which is
more widely used, is to say that the first three principal components explain all but
Section 3.8
What you should remember - NEED SOMETHING
HeightWeight second principal component
HeightWeight third principal component

bodyfat
age
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0.2
0
bodyfat
abdomen
chest
adiposity
neck
density
2
height
weight
4
6
hip
8
10
0
wrist
14
16
abdomen
chest
adiposity
density
hip thigh
biceps
forearm
knee
ankle
wrist
neck
0.2
knee
biceps
forearm
ankle
thigh
12
96
18
0.4
0
height
ageweight
4
6
10
12
14
16
18
FIGURE 3.35: On the left, the second principal component, and on the right the
third principal component of the height-weight dataset.
(1404 1357)/1404 = 0.034, or 3.4% of the variance; notice that this is the square
of the relative error, which will be a much smaller number.
All this means that explaining a data point as the mean and the first three
principal components produces relatively small errors. Figure 3.36 shows the second
and third principal component of the data. These two principal components suggest
some further conclusions. As age gets larger, height and weight get slightly smaller,
but the weight is redistributed; abdomen gets larger, whereas thigh gets smaller.
A smaller effect (the third principal component) links bodyfat and abdomen. As
bodyfat goes up, so does abdomen.
3.8 WHAT YOU SHOULD REMEMBER - NEED SOMETHING
PROBLEMS
Summaries
3.1. You have a dataset {x} of N vectors, xi , each of which is d-dimensional. We
will consider a linear function of this dataset. Write a for a constant vector;
then the value of this linear function evaluated on the ith data item is aT xi .
Write fi = aT xi . We can make a new dataset {f } out of the values of this
linear function.
(a) Show that mean ({f }) = aT mean ({x}) (easy).
(b) Show that var ({f }) = aT Covmat ({x})a (harder, but just push it through
the definition).
(c) Assume the dataset has the special property that there exists some a so
that aT Covmat ({x})a. Show that this means that the dataset lies on a
hyperplane.
3.2. On Figure 3.36, mark the mean of the dataset, the first principal component,
and the second principal component.
3.3. You have a dataset {x} of N vectors, xi , each of which is d-dimensional.
Assume that Covmat ({x}) has one non-zero eigenvalue. Assume that x1 and
Section 3.8
97
10
5
0
5
10
10
10
FIGURE 3.36: Figure for the question

x2 do not have the same value.
(a) Show that you can choose a set of ti so that you can represent every data
item xi exactly
xi = x1 + ti (x2 x1 ).
(b) Now consider the dataset of these t values. What is the relationship
between (a) std (t) and (b) the non-zero eigenvalue of Covmat ({x})? Why?
3.4. Obtain the iris dataset from the UC Irvine machine learning data repository at
http://https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data.
(a) Plot a scatterplot matrix of this dataset, showing each species with a
different marker. The fragment of R code in listing 3.1 should take you
most of the way.
(b) Now obtain the first two principal components of the data. Plot the
data on those two principal components alone, again showing each species
with a different marker. Has this plot introduced significant distortions?
Explain
3.5. Take the wine dataset from the UC Irvine machine learning data repository at
http://archive.ics.uci.edu/ml/datasets/seeds.
(a) Plot the eigenvalues of the covariance matrix in sorted order. How many
principal components should be used to represent this dataset? Why?
(b) Construct a stem plot of each of the first 3 principal components (i.e. the
eigenvectors of the covariance matrix with largest eigenvalues). What do
you see?
(c) Compute the first two principal components of this dataset, and project
it onto those components. Now produce a scatter plot of this two dimensional dataset, where data items of class 1 are plotted as a 1, class 2 as
a 2, and so on.
3.6. Take the wheat kernel dataset from the UC Irvine machine learning data repository at http://archive.ics.uci.edu/ml/datasets/seeds. Compute the first two
principal components of this dataset, and project it onto those components.
(a) Produce a scatterplot of this projection. Do you see any interesting phenomena?
Section 3.8
98
Listing 3.1: R code for iris example.

# r code f o r s c a t t e r p l o t o f i r i s dat a
i r i s d a t<read . csv ( i r i s . dat , h e a d e r=FALSE ) ;
library ( l a t t i c e )
n u m i r i s=i r i s d a t [ , c ( 1 , 2 , 3 , 4 ) ]
postscript ( i r i s s c a t t e r p l o t . eps )
# so t h a t I g e t a p o s t s c r i p t f i l e
s p e c i e s n a m e s<c ( s e t o s a , v e r s i c o l o r , v i r g i n i c a )
pchr<c ( 1 , 2 , 3 )
c o l r<c ( r e d , g r e e n , b l u e , y e l l o w , o r a n g e )
s s<expand . grid ( s p e c i e s = 1 : 3 )
p a r s e t<with ( s s , simpleTheme ( pch=pchr [ s p e c i e s ] ,
c o l=c o l r [ s p e c i e s ] ) )
splom ( i r i s d a t [ , c ( 1 : 4 ) ] , g r o u p s=i r i s d a t $V5 ,
par . s e t t i n g s=p a r s e t ,
varnames=c ( S e p a l \ nLength , S e p a l \nWidth ,
P e t a l \ nLength , P e t a l \nWidth ) ,
key=l i s t ( text=l i s t ( s p e c i e s n a m e s ) ,
points=l i s t ( pch=pchr ) , columns =3))
dev . o f f ( )
(b) Plot the eigenvalues of the covariance matrix in sorted order. How many
principal components should be used to represent this dataset? why?
C H A P T E R
Clustering: Models of High

Dimensional Data
High-dimensional data comes with problems. Data points tend not to be
where you think; they can scattered quite far apart, and can be quite far from
the mean. Except in special cases, the only really reliable probability model is the
Gaussian (or Gaussian blob, or blob).
There is an important rule of thumb for coping with high dimensional data:
Use simple models. A blob is a good simple model. Modelling data as a blob
involves computing its mean and its covariance. Sometimes, as we shall see, the
covariance can be hard to compute. Even so, a blob model is really useful. It is
natural to try and extend this model to cover datasets that dont obviously consist
of a single blob.
One very good, very simple, model for high dimensional data is to assume
that it consists of multiple blobs. To build models like this, we must determine
(a) what the blob parameters are and (b) which datapoints belong to which blob.
Generally, we will collect together data points that are close and form blobs out of
them. This process is known as clustering.
Clustering is a somewhat puzzling activity. It is extremely useful to cluster
data, and it seems to be quite important to do it reasonably well. But it surprisingly
hard to give crisp criteria for a good (resp. bad) clustering of a dataset. Usually,
clustering is part of building a model, and the main way to know that the clustering
algorithm is bad is that the model is bad.
4.1 AGGLOMERATIVE AND DIVISIVE CLUSTERING
There are two natural algorithms for clustering. In divisive clustering, the entire
data set is regarded as a cluster, and then clusters are recursively split to yield a
good clustering (Algorithm 4.2). In agglomerative clustering, each data item is
regarded as a cluster, and clusters are recursively merged to yield a good clustering
(Algorithm 4.1).
Make each point a separate cluster

Until the clustering is satisfactory
Merge the two clusters with the
smallest inter-cluster distance
end
Algorithm 4.1: Agglomerative Clustering or Clustering by Merging.
99
Section 4.1
Agglomerative and Divisive Clustering
100
Construct a single cluster containing all points

Until the clustering is satisfactory
Split the cluster that yields the two
components with the largest inter-cluster distance
end
Algorithm 4.2: Divisive Clustering, or Clustering by Splitting.
There are two major issues in thinking about clustering:
What is a good inter-cluster distance? Agglomerative clustering uses an intercluster distance to fuse nearby clusters; divisive clustering uses it to split insufficiently coherent clusters. Even if a natural distance between data points
is available (which might not be the case for vision problems), there is no
canonical inter-cluster distance. Generally, one chooses a distance that seems
appropriate for the data set. For example, one might choose the distance
between the closest elements as the inter-cluster distance, which tends to
yield extended clusters (statisticians call this method single-link clustering). Another natural choice is the maximum distance between an element of
the first cluster and one of the second, which tends to yield rounded clusters
(statisticians call this method complete-link clustering). Finally, one could
use an average of distances between elements in the cluster, which also tends
to yield rounded clusters (statisticians call this method group average
clustering).
How many clusters are there? This is an intrinsically difficult task if there
is no model for the process that generated the clusters. The algorithms we
have described generate a hierarchy of clusters. Usually, this hierarchy is
displayed to a user in the form of a dendrograma representation of the
structure of the hierarchy of clusters that displays inter-cluster distances
and an appropriate choice of clusters is made from the dendrogram (see the
example in Figure 4.1).
The main difficulty in using a divisive model is knowing where to split. This is
sometimes made easier for particular kinds of data. For example, we could segment
an image by clustering pixel values. In this case, you can sometimes find good splits
by constructing a histogram of intensities, or of color values.
Another important thing to notice about clustering from the example of figure 4.1 is that there is no right answer. There are a variety of different clusterings
of the same data. For example, depending on what scales in that figure mean, it
might be right to zoom out and regard all of the data as a single cluster, or to zoom
in and regard each data point as a cluster. Each of these representations may be
useful.
Section 4.1
101
5
4
1
2
distance
2 3 4 5
FIGURE 4.1: Left, a data set; right, a dendrogram obtained by agglomerative clus-
tering using single-link clustering. If one selects a particular value of distance, then
a horizontal line at that distance splits the dendrogram into clusters. This representation makes it possible to guess how many clusters there are and to get some
insight into how good the clusters are.
4.1.1 Clustering and Distance
In the algorithms above, and in what follows, we assume that the features are scaled
so that distances (measured in the usual way) between data points are a good
representation of their similarity. This is quite an important point. For example,
imagine we are clustering data representing brick walls. The features might contain
several distances: the spacing between the bricks, the length of the wall, the height
of the wall, and so on. If these distances are given in the same set of units, we could
have real trouble. For example, assume that the units are centimeters. Then the
spacing between bricks is of the order of one or two centimeters, but the heights
of the walls will be in the hundreds of centimeters. In turn, this means that the
distance between two datapoints is likely to be completely dominated by the height
and length data. This could be what we want, but it might also not be a good
thing.
There are some ways to manage this issue. One is to know what the features
measure, and know how they should be scaled. Usually, this happens because you
have a deep understanding of your data. If you dont (which happens!), then it is
often a good idea to try and normalize the scale of the data set. There are two good
strategies. The simplest is to translate the data so that it has zero mean (this is
just for neatness - translation doesnt change distances), then scale each direction
so that it has unit variance. More sophisticated is to translate the data so that
it has zero mean, then transform it so that each direction is independent and has
unit variance. Doing so is sometimes referred to as decorrelation or whitening
Section 4.1
102
100
80
60
40
20
60
232
64
57
225
62
13
21
7
130
46
209
93
210
184
51
116
124
99
197
218
69
89
90
102
91
131
92
215
117
213
115
201
103
132
111
142
110
139
135
170
173
174
160
114
198
195
137
185
105
196
127
141
136
190
122
113
126
112
123
143
220
68
70
138
202
73
200
118
77
227
230
229
233
217
125
223
98
100
38
101
94
104
214
133
107
219
228
108
56
179
129
15
193
186
199
167
211
53
224
75
106
76
119
67
204
95
97
189
49
203
37
188
158
11
145
4
163
221
9120
206
72
8
208
191
171
48
128
176
183
177
146
19
154
151
125
144
26
30
32
161
24
231
23
27
47
52
109
55
121
71
236
78
80
84
85
81
16
147
63
14
6
155
18
166
20
168
162
29
234
66
235
134
172
45
156
82
33
237
83
148
2
157
31
164
165
181
34
194
240
50
149
44
153
150
88
54
159
17
246
28
239
87
251
22
207
386
245
58
65
238
242
61
247
249
252
243
244
40
212
59
140
43
182
10
192
178
35
180
187
222
175
12
226
79
241
74
248
250
205
152
96
169
216
536
41
42
39
FIGURE 4.2: A dendrogram obtained from the body fat dataset, using single link
clustering. Recall that the data points are on the horizontal axis, and that the
vertical axis is distance; there is a horizontal line linking two clusters that get
merged, established at the height at which theyre merged. I have plotted the entire
dendrogram, despite the fact its a bit crowded at the bottom, because it shows that
most data points are relatively close (i.e. there are lots of horizontal branches at
about the same height).
(because you make the data more like white noise); I described how to do this in
section ??.
Section 4.1
103
250
200
150
100
50
50
100
30
20
10
10
20
30
40
FIGURE 4.3: A clustering of the body fat dataset, using agglomerative clustering,
single link distance, and requiring a maximum of 30 clusters. I have plotted each
cluster with a distinct marker (though some markers differ only by color; you might
need to look at the PDF version to see this figure at its best). Notice that one
cluster contains much of the data, and that there are a set of small isolated clusters.
The original data is 16 dimensional, which presents plotting problems; I show a
scatter plot on the first two principal components (though I computed distances for
clustering in the original 16 dimensional space).
Worked example 4.1
Agglomerative clustering in Matlab
Cluster the height-weight dataset of http://www2.stetson.edu/jrasp/data.htm

(look for bodyfat.xls) using an agglomerative clusterer, and describe the results.
Solution: Matlab provides some tools that are useful for agglomerative clustering. These functions use a scheme where one first builds the whole tree of
merges, then analyzes that tree to decide which clustering to report. linkage
will determine which pairs of clusters should be merged at which step (there
are arguments that allow you to choose what type of inter-cluster distance it
should use); dendrogram will plot you a dendrogram; and cluster will extract the clusters from the linkage, using a variety of options for choosing the
clusters. I used these functions to prepare the dendrogram of figure 4.2 for
the height-weight dataset of section ?? (from http://www2.stetson.edu/jrasp/
data.htm; look for bodyfat.xls). I deliberately forced Matlab to plot the whole
dendrogram, which accounts for the crowded look of the figure (you can allow
it to merge small leaves, etc.). I used a single-link strategy. In particular,
notice that many data points are about the same distance from one another,
which suggests a single big cluster with a smaller set of nearby clusters. The
clustering of figure 4.3 supports this view. I plotted the data points on the
first two principal components, using different colors and shapes of marker to
indicate different clusters. There are a total of 30 clusters here, though most
are small.
Section 4.1
104
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
173
207
169
155
192
203
188
186
145
159
162
195
205
193
209
206
146
156
174
177
179
160
151
184
187
150
175
191
176
178
194
190
153
158
167
182
163
183
196
181
154
170
157
144
201
210
197
148
164
168
165
171
172
149
199
161
198
141
143
185
152
147
166
180
200
202
189
142
204
208
59
1
35
50
56
25
18
47
23
5
263
29
8
22
57
39
645
48
49
21
54
14
15
33
51
753
12
4
46
34
68
16
41
42
30
55
67
69
27
70
32
20
64
58
2
28
43
66
13
19
10
9
36
37
38
11
44
63
24
60
31
65
52
71
108
77
137
75
122
96
101
123
134
140
72
76
73
81
132
74
118
107
92
105
93
104
97
112
119
106
85
116
98
100
113
111
124
131
86
99
88
110
87
102
128
117
126
127
109
120
103
79
95
82
94
130
133
80
135
84
129
125
91
136
114
138
139
61
62
115
78
90
121
83
89
17
40
FIGURE 4.4: A dendrogram obtained from the seed dataset, using single link cluster-
ing. Recall that the data points are on the horizontal axis, and that the vertical axis
is distance; there is a horizontal line linking two clusters that get merged, established
at the height at which theyre merged. I have plotted the entire dendrogram, despite
the fact its a bit crowded at the bottom, because you can now see how clearly the
data set clusters into a small set of clusters there are a small number of vertical
runs.
Worked example 4.2
Agglomerative clustering in Matlab, 2
Cluster the seed dataset from the UC Irvine Machine Learning Dataset Repository (you can find it at http://archive.ics.uci.edu/ml/datasets/seeds).
Solution: Each item consists of seven measurements of a wheat kernel; there
are three types of wheat represented in this dataset. As you can see in figures 4.4
and 4.5, this data clusters rather well.
Section 4.1
105
8
4
FIGURE 4.5: A clustering of the seed dataset, using agglomerative clustering, single
link distance, and requiring a maximum of 30 clusters. I have plotted each cluster
with a distinct marker (though some markers differ only by color; you might need to
look at the PDF version to see this figure at its best). Notice that there are a set of
fairly natural isolated clusters. The original data is 8 dimensional, which presents
plotting problems; I show a scatter plot on the first two principal components (though
I computed distances for clustering in the original 8 dimensional space).
Worked example 4.3
Agglomerative clustering in R
Cluster the questions in the student evaluation dataset from the UC Irvine
Machine Learning Dataset Repository (you can find it at https://archive.ics.
uci.edu/ml/datasets/Turkiye+Student+Evaluation; this dataset was donated by
G. Gunduz and E. Fokoue), and display it on the first two principal components.
You should use 5 clusters, but investigate the results of choosing others.
Solution: R provides tools for agglomerative clustering, too. I used the block
of code shown in listing 4.1 to produce figure 4.6. The clustering shown in
that figure is to 5 clusters. To get some idea of what the clusters are like,
you can compute the per-cluster means, using a line in that listing. I found
means where: all students in the cluster gave moderately low, low, moderately
high and high answers to all questions, respectively; and one where all students
answered 1 to all questions. There are 5820 students in this collection. The
clusters suggest that answers tend to be quite strongly correlated a student
who gives a low answer to a question will likely give low answers to others, too.
Choosing other numbers of clusters wasnt particularly revealing, though there
were more levels to the answers.
Section 4.2
The K-Means Algorithm and Variants
106
CLUSPLOT( wdm )
(0.996,1.8]
(3.4,4.2]
(4.2,5]
(2.6,3.4]
Component 2
(1.8,2.6]
Component 1
These two components explain 86.76 % of the point variability.
A clustering of the student evaluation dataset, using agglomerative

clustering and 5 clusters. I produced this clustering with the code fragment in
listing 4.1. One difficulty with this dataset is there are 28 questions (Q1-Q28),
meaning the data is quite difficult to plot. You should be able to see the clusters on
the figure, but notice that they are smeared on top of one another; this is because I
had to project off 26 dimensions to produce the plot. The plotting function I used
(clusplot) plots a covariance ellipse for each cluster.
FIGURE 4.6:
4.2 THE K-MEANS ALGORITHM AND VARIANTS

Assume we have a dataset that, we believe, forms many clusters that look like
blobs. If we knew where the center of each of the clusters was, it would be easy to
tell which cluster each data item belonged to it would belong to the cluster with
the closest center. Similarly, if we knew which cluster each data item belonged to,
it would be easy to tell where the cluster centers were theyd be the mean of the
data items in the cluster. This is the point closest to every point in the cluster.
We can turn these observations into an algorithm. Assume that we know how
many clusters there are in the data, and write k for this number. The jth data
item to be clustered is described by a feature vector xj . We write ci for the center
of the ith cluster. We assume that most data items are close to the center of their
cluster. This suggests that we cluster the data by minimizing the the cost function
X
X
(clusters, data) =
(xj ci )T (xj ci ) .
iclusters
jith
cluster
Notice that if we know the center for each cluster, it is easy to determine which
cluster is the best choice for each point. Similarly, if the allocation of points to
clusters is known, it is easy to compute the best center for each cluster. However,
there are far too many possible allocations of points to clusters to search this space
for a minimum. Instead, we define an algorithm that iterates through two activities:
Section 4.2
107
Listing 4.1: R code for student example.

setwd ( / u s e r s / d a f / C u r r e n t/ c o u r s e s / P r o b c o u r s e / C l u s t e r i n g/RCode )
wdat<read . csv ( t u r k i y e s tu d e n t e v a l u a t i o n R S p e c i f i c . c s v )
wdm<wdat [ , c ( 6 : 3 3 ) ] # c hoos e t h e q u e s t i o n s
d < d i s t (wdm, method = e u c l i d e a n ) # d i s t a n c e m at r i x
f i t < h c l u s t ( d , method= ward ) # t h e c l u s t e r i n g
cm<cov (wdm)
ev<eigen (cm)
p r i n c<ev $ v e c t o r s
proj<t ( rbind ( t ( p r i n c [ , 1 ] ) , t ( p r i n c [ , 2 ] ) ) )
wdzm<s c a l e (wdm, c e n t e r=TRUE, s c a l e=FALSE) #s u b t r a c t t h e mean
wdzm<data . matrix (wdzm)
wdpcs<wdzm%%proj
kv<c ( 5 )
g r o u p s<c u t r e e ( f i t , k=kv )
wdbig<wdm
wdbig$ group<g r o u p s
cg<cut ( groups , kv )
p<g g p l o t ( as . data . frame ( wdpcs ))+ geom p o i n t ( a e s ( y=V1 , x=V2 , shape=cg ) )
setEPS ( )
postscript ( c l u s t e r s . eps )
p
dev . o f f ( )
cm1<colMeans ( subset ( wdbig , group ==1)) # mean o f t h e 1 s t c l u s t e r
Assume the cluster centers are known and, allocate each point to the closest
cluster center.
Assume the allocation is known, and choose a new set of cluster centers. Each
center is the mean of the points allocated to that cluster.
We then choose a start point by randomly choosing cluster centers, and then iterate
these stages alternately. This process eventually converges to a local minimum of
the objective function (the value either goes down or is fixed at each step, and
it is bounded below). It is not guaranteed to converge to the global minimum of
the objective function, however. It is also not guaranteed to produce k clusters,
unless we modify the allocation phase to ensure that each cluster has some nonzero
number of points. This algorithm is usually referred to as k-means (summarized
in Algorithm 4.3).
Usually, we are clustering high dimensional data, so that visualizing clusters
can present a challenge. If the dimension isnt too high, then we can use panel
plots. An alternative is to project the data onto two principal components, and
plot the clusters there. A natural dataset to use to explore k-means is the iris
data, where we know that the data should form three clusters (because there are
three species). Recall this dataset from section ??. I reproduce figure 3.5 from
that section as figure 4.11, for comparison. Figures 4.8, ?? and ?? show different
k-means clusterings of that data.
Section 4.2
setosa
versicolor
108
virginica
2.5
1.5 2.0 2.5
2.0
1.5
Petal
Width 1.0
0.0 0.5 1.0

4
h
Petal.Lengt
Petal
Length
0.5
0.0
4
3
2
1
4.5
3.5 4.0 4.5
4.0
3.5
Sepal
Width 3.0
2.5
2.0 2.5 3.0
Petal.Width
2.0
8
7
Sepal.Length
Sepal
Length
6
5
Scatter Plot Matrix
FIGURE 4.7: Left: a 3D scatterplot for the famous Iris data, originally due to ***.
I have chosen three variables from the four, and have plotted each species with a
different marker. You can see from the plot that the species cluster quite tightly,
and are different from one another. Right: a scatterplot matrix for the famous Iris
data, originally due to ***. There are four variables, measured for each of three
species of iris. I have plotted each species with a different marker. You can see from
the plot that the species cluster quite tightly, and are different from one another.
Worked example 4.4
K-means clustering in R
Cluster the iris dataset into two clusters using k-means, then plot the results
on the first two principal components
Solution: I used the code fragment in listing 4.2, which produced figure ??
4.2.1 How to choose K

The iris data is just a simple example. We know that the data forms clean clusters,
and we know there should be three of them. Usually, we dont know how many
clusters there should be, and we need to choose this by experiment. One strategy
is to cluster for a variety of different values of k, then look at the value of the cost
function for each. If there are more centers, each data point can find a center that
is close to it, so we expect the value to go down as k goes up. This means that
Section 4.2
109
Choose k data points to act as cluster centers

Until the cluster centers change very little
Allocate each data point to cluster whose center is nearest.
Now ensure that every cluster has at least
one data point; one way to do this is by
supplying empty clusters with a point chosen at random from
points far from their cluster center.
Replace the cluster centers with the mean of the elements
in their clusters.
end
Algorithm 4.3: Clustering by K-Means.
Listing 4.2: R code for iris example.
setwd ( / u s e r s / d a f / C u r r e n t/ c o u r s e s / P r o b c o u r s e / C l u s t e r i n g/RCode )
#l i b r a r y ( l a t t i c e )
# work w i t h i r i s d a t a s e t t h i s i s famous , and i n c l u d e d i n R
# t h e r e ar e t h r e e s p e c i e s
head ( i r i s )
#
library ( c l u s t e r )
n u m i r i s= i r i s [ , c ( 1 , 2 , 3 , 4 ) ] #t h e n u m e r i c al v a l u e s
#s c a l e d i r i s<s c a l e ( n u m i r i s ) # s c a l e t o u n i t v a r i a n c e
n c l u s<2
s f i t<kmeans ( n u m i r i s , n c l u s )
c o l r<c ( r e d , g r e e n , b l u e , y e l l o w , o r a n g e )
c l u s p l o t ( n u m i r i s , s f i t $ c l u s t e r , c o l o r=TRUE, shade=TRUE,
l a b e l s =0 , l i n e s =0)
looking for the k that gives the smallest value of the cost function is not helpful,
because that k is always the same as the number of data points (and the value is
then zero). However, it can be very helpful to plot the value as a function of k, then
look at the knee of the curve. Figure 4.11 shows this plot for the iris data. Notice
that k = 3 the true answer doesnt look particularly special, but k = 2,
k = 3, or k = 4 all seem like reasonable choices. It is possible to come up with
a procedure that makes a more precise recommendation by penalizing clusterings
that use a large k, because they may represent inefficient encodings of the data.
However, this is often not worth the bother.
In some special cases (like the iris example), we might know the right answer
to check our clustering against. In such cases, one can evaluate the clustering by
looking at the number of different labels in a cluster (sometimes called the purity),
and the number of clusters. A good solution will have few clusters, all of which
have high purity.
Mostly, we dont have a right answer to check against. An alternative strategy,
which might seem crude to you, for choosing k is extremely important in practice.
Usually, one clusters data to use the clusters in an application (one of the most
Section 4.2
1.5
110
CLUSPLOT( scalediris )
0.0 0.5 1.0 1.5
1.0
0.5
0.0
Petal
Width
0.0
0.5
0.5
1.0
1.5
1.5
1.5
0.0 0.5 1.0 1.5
0.5
2
1
Sepal
Width
0.0
0.5
1.0
1.5
1.5
3
Petal
Length
Component 2
0.0
0.5
1.0
0
1
2
2 1 0
Sepal
Length
0
1
0
2
Component 1
Scatter Plot Matrix
FIGURE 4.8: On the left, a panel plot of the iris data clustered using k-means with
k = 2. By comparison with figure 4.11, notice how the versicolor and verginica
clusters appear to have been merged. On the right, this data set projected onto the
first two principal components, with one blob drawn over each cluster.
important, vector quantization, is described in section 4.3). There are usually

natural ways to evaluate this application. For example, vector quantization is often
used as an early step in texture recognition or in image matching; here one can
evaluate the error rate of the recognizer, or the accuracy of the image matcher.
One then chooses the k that gets the best evaluation score on validation data. In
this view, the issue is not how good the clustering is; its how well the system that
uses the clustering works.
4.2.2 Soft Assignment
One difficulty with k-means is that each point must belong to exactly one cluster.
But, given we dont know how many clusters there are, this seems wrong. If a point
is close to more than one cluster, why should it be forced to choose? This reasoning
suggests we assign points to cluster centers with weights.
We allow each point to carry a total weight of 1. In the conventional k-means
algorithm, it must choose a single cluster, and assign its weight to that cluster
alone. In soft-assignment k-means, the point can allocate some weight to each
cluster center, as long as (a) the weights are all non-negative and (b) the weights
sum to one. Write wi,j for the weight connecting point i to cluster center j. We
interpret these weights as the degree
P to which the point participates in a particular
cluster. We require wi,j 0 and j wi,j = 1.
We would like wi,j to be large when xi is close to cj , and small otherwise.
Write di,j for the distance || xi cj || between these two. Write
si,j = e
d2
i,j
22
Section 4.2
1.5
111
0.0 0.5 1.0 1.5
1.0
0.5
0.0
Petal
Width
0.0
0.5
0.5
1.0
1.5
1.5
1.5
0.0 0.5 1.0 1.5
0.5
0.0
0.5
1.0
1.5
2
1
Sepal
Width
1.5
3
Petal
Length
Component 2
0.0
0.5
1.0
1
2
2 1 0
Sepal
Length
0
1
0
2
Component 1
Scatter Plot Matrix
k = 3. By comparison with figure 4.11, notice how the clusters appear to follow
the species labels. On the right, this data set projected onto the first two principal
components, with one blob drawn over each cluster.
where > 0 is a choice of scaling parameter. This is often called the affinity
between the point i and the center j. Now a natural choice of weights is
si,j
wi,j = Pk
l=1 si,l
All these weights are non-negative, they sum to one, and the weight is large if the
point is much closer to one center than to any other. The scaling parameter sets
the meaning of much closer we measure distance in units of .
Once we have weights, re-estimating the cluster centers is easy. We use a
weights to compute a weighted average of the points. In particular, we re-estimate
the jth cluster center by
P
w x
Pi i,j i .
i wi,j
Notice that k-means is a special case of this algorithm where limits to zero. In
this case, each point has a weight of one for some cluster, and zero for all others,
and the weighted mean becomes an ordinary mean. I have collected the description
into Algorithm 4.4 for convenience.
4.2.3 General Comments on K-Means
If you experiment with k-means, you will notice one irritating habit of the algorithm.
It almost always produces either some rather spread out clusters, or some single
element clusters. Most clusters are usually rather tight and blobby clusters, but
there is usually one or more bad cluster. This is fairly easily explained. Because
every data point must belong to some cluster, data points that are far from all
Section 4.2
1.5
112
0.0 0.5 1.0 1.5
1.0
0.5
0.0
Petal
Width
1
0.0
0.5
0.5
1.0
1.5
1.5
1.5
0.0 0.5 1.0 1.5
1.0
0.5
2
1
Sepal
Width
1
0.0
0.5
1.0
1.5
4
0
1.5
5
Petal
Length
Component 2
0.0
0.5
Sepal
Length
2 1 0
0
0
1
0
2
Scatter Plot Matrix
Component 1
k = 5. By comparison with figure 4.11, notice how setosa seems to have been broken
in two groups, and versicolor and verginica into a total of three . On the right,
this data set projected onto the first two principal components, with one blob drawn
over each cluster.
others (a) belong to some cluster and (b) very likely drag the cluster center into
a poor location. This applies even if you use soft assignment, because every point
must have total weight one. If the point is far from all others, then it will be
assigned to the closest with a weight very close to one, and so may drag it into a
poor location, or it will be in a cluster on its own.
There are ways to deal with this. If k is very big, the problem is often not
significant, because then you simply have many single element clusters that you
can ignore. It isnt always a good idea to have too large a k, because then some
larger clusters might break up. An alternative is to have a junk cluster. Any point
that is too far from the closest true cluster center is assigned to the junk cluster,
and the center of the junk cluster is not estimated. Notice that points should not
be assigned to the junk cluster permanently; they should be able to move in and
out of the junk cluster as the cluster centers move.
In some cases, we want to cluster objects that cant be averaged. For example,
you can compute distances between two trees but you cant meaningfully average
them. In some cases, you might have a table of distances between objects, but
not know vectors representing the objects. For example, one could collect data on
the similarities between countries (as in Section 3.6.2, particularly Figure 3.30),
then try and cluster using this data (similarities can be turned into distances by,
for example, taking the negative logarithm). A variant of k-means, known as kmedoids, applies to this case.
In k-medoids, the cluster centers are data items rather than averages, but the
rest of the algorithm has a familiar form. We assume the number of medoids is
known, and initialize these randomly. We then iterate two procedures. In the first,
Section 4.3
virginica
2.5
1.5 2.0 2.5
Petal
Width
0.0 0.5 1.0
Petal
Length
4.5
3.5 4.0 4.5
3.5
Sepal
Width
3.0
4
3
2
1
400
Within groups sum of squares
4.0
0.5
0.0
1.0
500
1.5
600
2.0
300
versicolor
113
200
setosa
Describing Repetition with Vector Quantization
2.5
2.0 2.5 3.0
2.0
100
8
8
Sepal
Length
6
5
Scatter Plot Matrix
10
12
14
Number of Clusters
FIGURE 4.11: On the left, the scatterplot matrix for the Iris data, for reference. On
the right, a plot of the value of the cost function for each of several different values
of k. Notice how there is a sharp drop in cost going from k = 1 to k = 2, and again
at k = 4; after that, the cost falls off slowly. This suggests using k = 2, k = 3, or
k = 4, depending on the precise application.
we allocate data points to medoids. In the second, we choose the best medoid for
each cluster by finding the medoid that minimizes the sum of distances of points in
the cluster to that medoid (blank search is fine).
4.3 DESCRIBING REPETITION WITH VECTOR QUANTIZATION
Repetition is an important feature of many interesting signals. For example, images contain textures, which are orderly patterns that look like large numbers of
small structures that are repeated. Examples include the spots of animals such as
leopards or cheetahs; the stripes of animals such as tigers or zebras; the patterns on
bark, wood, and skin. Similarly, speech signals contain phonemes characteristic,
stylised sounds that people assemble together to produce speech (for example, the
ka sound followed by the tuh sound leading to cat). Another example comes
from accelerometers. If a subject wears an accelerometer while moving around, the
signals record the accelerations during their movements. So, for example, brushing
ones teeth involves a lot of repeated twisting movements at the wrist, and walking
involves swinging the hand back and forth.
Repetition occurs in subtle forms. The essence is that a small number of
local patterns can be used to represent a large number of examples. You see this
effect in pictures of scenes. If you collect many pictures of, say, a beach scene, you
will expect most to contain some waves, some sky, and some sand. The individual
patches of wave, sky or sand can be surprisingly similar, and different images are
made by selecting some patches from a vocabulary of patches, then placing them
down to form an image. Similarly, pictures of living rooms contain chair patches,
TV patches, and carpet patches. Many different living rooms can be made from
Section 4.3
114
Choose k data points to act as cluster centers

Until the cluster centers change very little
First, we estimate the weights
For i indexing data points
For j indexing cluster centers
Compute si,j = e
||xi cj||
22
For i indexing data points

For j indexing cluster centers
P
Compute wi,j = si,j / l=1 ksi,l
Now, we re-estimate the centers
For j indexing cluster
P centers
wi,j xi
Compute cj = Pi w
i
end
i,j
Algorithm 4.4: Soft Clustering by K-Means.

small vocabularies of patches; but you wont often see wave patches in living rooms,
or carpet patches in beach scenes.
An important part of representing signals that repeat is building a vocabulary
of patterns that repeat, then describing the signal in terms of those patterns. For
many problems, problems, knowing what vocabulary elements appear and how
often is much more important than knowing where they appear. For example,
if you want to tell the difference between zebras and leopards, you need to know
whether stripes or spots are more common, but you dont particularly need to know
where they appear. As another example, if you want to tell the difference between
brushing teeth and walking using accelerometer signals, knowing that there are lots
of (or few) twisting movements is important, but knowing how the movements are
linked together in time may not be.
4.3.1 Vector Quantization
It is natural to try and find patterns by looking for small pieces of signal of fixed
size that appear often. In an image, a piece of signal might be a 10x10 patch; in a
sound file, which is likely represented as a vector, it might be a subvector of fixed
size. But finding patterns that appear often is hard to do, because the signal is
continuous each pattern will be slightly different, so we cannot simply count how
many times a particular pattern occurs.
Here is a strategy. First, we take a training set of signals, and cut each signal
into vectors of fixed dimension (say d). It doesnt seem to matter too much if these
Section 4.3
115
vectors overlap or not. We then build a set of clusters out of these vectors; this set
of clusters is often thought of as a dictionary. We can now now describe any new
vector with the cluster center closest to that vector. This means that a vector in
a continuous space is described with a number in the range [1, . . . , k] (where you
get to choose k), and two vectors that are close should be described by the same
number. This strategy is known as vector quantization.
We can now build features that represent important repeated structure in signals. We take a signal, and cut it up into vectors of length d. These might overlap,
or be disjoint; we follow whatever strategy we used in building the dictionary. We
then take each vector, and compute the number that describes it (i.e. the number of
the closest cluster center, as above). We then compute a histogram of the numbers
we obtained for all the vectors in the signal. This histogram describes the signal.
Notice several nice features to this construction. First, it can be applied to
anything that can be thought of in terms of vectors, so it will work for speech
signals, sound signals, accelerometer signals, images, and so on. You might need to
adjust some indices. For example, you cut the image into patches, then rearrange
the patch to form a vector. As another example, accelerometer signals are three
dimensional vectors that depend on time, so you cut out windows of a fixed number
of time samples (say t), then rearrange to get a 3t dimensional vector.
Another nice feature is the construction can accept signals of different length,
and produce a description of fixed length. One accelerometer signal might cover 100
time intervals; another might cover 200; but the description is always a histogram
with k buckets, so its always a vector of length k.
Yet another nice feature is that we dont need to be all that careful how we
cut the signal into fixed length vectors. This is because it is hard to hide repetition.
This point is easier to make with a figure than in text, so look at figure ??.
4.3.2 Example: Groceries in Portugal
At http://archive.ics.uci.edu/ml/datasets/Wholesale+customers, you will find a dataset
giving sums of money spent annually on different commodities by customers in Portugal. The commodities are divided into a set of categories (fresh; milk; grocery;
frozen; detergents and paper; and delicatessen) relevant for the study. These customers are divided by channel (two channels) and by region (three regions). You
can think of the data as being divided into six groups (one for each channel-region
pair). There are 440 records, and so there are many customers per group. Figure 4.12 shows a panel plot of the customer data; the data has been clustered, and
I gave each of 20 clusters its own marker. Relatively little structure is apparent in
this scatter plot. You cant, for example, see evidence of six groups that are cleanly
separated.
Its unlikely that all the customers in a group are the same. Instead, we
expect that there might be different types of customer. For example, customers
who prepare food at home might spend more money on fresh or on grocery, and
those who mainly buy prepared food might spend more money on delicatessan;
similarly, coffee drinkers with cats or children might spend more on milk than the
lactose-intolerant, and so on. Because some of these effects are driven by things
like wealth and the tendency of people to like to have neighbors who are similar to
Section 4.3
116
Delicatessen
DetPaper
Frozen
Grocery
Milk
Fresh
Scatter Plot Matrix
FIGURE 4.12: A panel plot of the wholesale customer data of http:// archive.ics.uci.
edu/ ml/ datasets/ Wholesale+customers, which records sums of money spent annually on different commodities by customers in Portugal. This data is recorded for six
different groups (two channels each within three regions). I have plotted each group
with a different marker, but you cant really see much structure here, for reasons
explained in the text.
them, you could expect that different groups contain different numbers of each type
of customer. There might be more deli-spenders in wealthier regions; more milkspenders and detergent-spenders in regions where it is customary to have many
children; and so on.
An effect like this is hard to see on a panel plot (Figure 4.12). The plot
for this dataset is hard to read, because the dimension is fairly high for a panel
plot and the data is squashed together in the bottom left corner. There is another
effect. If customers are grouped in the way I suggested above, then each group
might look the same in a panel plot. A group of some milk-spenders and more
detergent-spenders will have many data points with high milk expenditure values
(and low other values) and also many data points with high detergent expenditure
values (and low other values). In a panel plot, this will look like two blobs; but
another group with more milk-spenders and some detergent-spenders will also look
like two blobs, in about the same place. It will be hard to spot the difference. A
histogram of the types within each group will make this difference obvious.
I used k-means clustering to cluster the customer data to 20 different clusters
(Figure 4.14). I chose 20 rather arbitrarily, but with the plot of error against k
in mind. Then I described the each group of data by the histogram of customer
types that appeared in that group (Figure ??). Notice how the distinction between
the groups is now apparent the groups do appear to contain quite different
distributions of customer type. It looks as though the channels (rows in this figure)
are more different than the regions (columns in this figure). To be more confident
in this analysis, we would need to be sure that different types of customer really
Section 4.3
117
1.5e+11
Delicatessen
1.0e+11
Frozen
Grocery
5.0e+10
Within groups sum of squares
DetPaper
Milk
Fresh
0
10
15
20
25
30
35
Number of Clusters
Scatter Plot Matrix
FIGURE 4.13: On the left, the sum of squared error for clusterings of the customer
data with k-means for k running from 2 to 35. This suggests using a k somewhere in
the range 10-30; I chose 20. On the right, I have clustered this data to 20 cluster
centers with k-means. The clusters do seem to be squashed together, but the plot on
the left suggests that clusters do capture some important information. Using too few
clusters will clearly lead to problems. Notice that I did not scale the data, because
each of the measurements is in a comparable unit. For example, it wouldnt make
sense to scale expenditures on fresh and expenditures on grocery with a different
scale.
are different. We could do this by repeating the analysis for fewer clusters, or by
looking at the similarity of customer types.
4.3.3 Efficient Clustering and Hierarchical K Means
One important difficulty occurs in applications. We might need to have an enormous
dataset (millions of image patches are a real possibility), and so a very large k. In
this case, k means clustering becomes difficult because identifying which cluster
center is closest to a particular data point scales linearly with k (and we have to
do this for every data point at every iteration). There are two useful strategies for
dealing with this problem.
The first is to notice that, if we can be reasonably confident that each cluster
contains many data points, some of the data is redundant. We could randomly
subsample the data, cluster that, then keep the cluster centers. This works, but
doesnt scale particularly well.
A more effective strategy is to build a hierarchy of k-means clusters. We
randomly subsample the data (typically, quite aggressively), then cluster this with
a small value of k. Each data item is then allocated to the closest cluster center, and
the data in each cluster is clustered again with k-means. We now have something
that looks like a two-level tree of clusters. Of course, this process can be repeated to
produce a multi-level tree of clusters. It is easy to use this tree to vector quantize a
Section 4.3
0.2
0.1
0.0
0.3
count/sum(count)
0.3
count/sum(count)
count/sum(count)
0.3
0.2
0.1
0.0
1
10
15
20
10
15
20
0.1
0.0
15
20
10
15
20
0.3
0.2
0.1
0.0
10
case21
case13
count/sum(count)
count/sum(count)
count/sum(count)
0.2
0.1
case12
0.3
0.2
0.0
1
case11
0.3
118
0.2
0.1
0.0
1
10
15
case22
20
10
15
20
case23
FIGURE 4.14: The histogram of different types of customer, by group, for the cus-
tomer data. Notice how the distinction between the groups is now apparent the
groups do appear to contain quite different distributions of customer type. It looks
as though the channels (rows in this figure) are more different than the regions
(columns in this figure).
query data item. We vector quantize at the first level. Doing so chooses a branch of
the tree, and we pass the data item to this branch. It is either a leaf, in which case
we report the number of the leaf, or it is a set of clusters, in which case we vector
quantize, and pass the data item down. This procedure is efficient both when one
clusters and at run time.
4.3.4 Example: Activity from Accelerometer Data
A complex example dataset appears at https://archive.ics.uci.edu/ml/datasets/Dataset+for+ADL+Recognition+w

This dataset consists of examples of the signal from a wrist mounted accelerometer,
produced as different subjects engaged in different activities of daily life. Activities
include: brushing teeth, climbing stairs, combing hair, descending stairs, and so on.
Each is performed by sixteen volunteers. The accelerometer samples the data at
32Hz (i.e. this data samples and reports the acceleration 32 times per second). The
accelerations are in the x, y and z-directions. Figure 4.15 shows the x-component
of various examples of toothbrushing.
There is an important problem with using data like this. Different subjects
take quite different amounts of time to perform these activities. For example, some
subjects might be more thorough tooth-brushers than other subjects. As another
example, people with longer legs walk at somewhat different frequencies than people
with shorter legs. This means that the same activity performed by different subjects
will produce data vectors that are of different lengths. Its not a good idea to deal
Section 4.3
Eat meat example 2
30
1000
2000 3000
Time
4000
5000
35
30
25
0
Brushing teeth example 1
25
1000
2000 3000
Time
4000
5000
20
0
20
500
1000 1500
Time
2000
2500
40
20
0
0
200
400
600
Time
800
2000
4000
Time
6000
40
35
30
25
0
8000
X Acceleration
X Acceleration
X Acceleration
30
60
40
0
0
35
60
45
40
X Acceleration
35
40
1000
80
60
60
40
20
500
1000 1500
Time
2000
2000
Time
4000
6000
80
0
0
119
Eat meat example 4
45
X Acceleration
40
25
0
Eat meat example 3
45
45
X Acceleration
X Acceleration
50
X Acceleration
Eat meat example 1
2500
40
20
0
0
500
1000 1500
Time
2000
2500
4.15:
Some
examples
from
the
accelerometer
dataset
at
https:// archive.ics.uci.edu/ ml/ datasets/
Dataset+for+ADL+Recognition+with+Wrist-worn+Accelerometer.
I have labelled each signal by the activity. These show acceleration in the X direction (Y
and Z are in the dataset, too). There are four examples for brushing teeth and
four for eat meat. You should notice that the examples dont have the same length
in time (some are slower and some faster eaters, etc.), but that there seem to be
characteristic features that are shared within a category (brushing teeth seems to
involve faster movements than eating meet).
FIGURE
with this by warping time and resampling the signal. For example, doing so will
make a thorough toothbrusher look as though they are moving their hands very
fast (or a careless toothbrusher look ludicrously slow: think speeding up or slowing
down a movie). So we need a representation that can cope with signals that are a
bit longer or shorter than other signals.
Another important property of these signals is that all examples of a particular
activity should contain repeated patterns. For example, brushing teeth should show
fast accelerations up and down; walking should show a strong signal at somewhere
around 2 Hz; and so on. These two points should suggest vector quantization to
you. Representing the signal in terms of stylized, repeated structures is probably a
good idea because the signals probably contain these structures. And if we represent
the signal in terms of the relative frequency with which these structures occur, the
representation will have a fixed length, even if the signal doesnt. To do so, we need
to consider (a) over what time scale we will see these repeated structures and (b)
how to ensure we segment the signal into pieces so that we see these structures.
Generally, repetition in activity signals is so obvious that we dont need to be
smart about segment boundaries. I broke these signals into 32 sample segments,
one following the other. Each segment represents one second of activity. This
is long enough for the body to do something interesting, but not so long that our
representation will suffer if we put the segment boundaries in the wrong place. This
resulted in about 40, 000 segments. I then used hierarchical k-means to cluster these
segments. I used two levels, with 40 cluster centers at the first level, and 12 at the
second. Figure 4.16 shows some cluster centers at the second level.
Section 4.3
120
Accelerometer cluster centers
X Acceleration
60
40
20
0
0
10
20
Time
30
40
FIGURE 4.16: Some cluster centers from the accelerometer dataset. Each cluster
center represents a one-second burst of activity. There are a total of 480 in my

model, which I built using hierarchical k-means. Notice there are a couple of centers that appear to represent movement at about 5Hz; another few that represent
movement at about 2Hz; some that look like 0.5Hz movement; and some that seem
to represent much lower frequency movement. These cluster centers are samples
(rather than chosen to have this property).
Brush teeth
Brush teeth
Brush teeth
Brush teeth
0.2
0.2
0.2
0.2
0.15
0.15
0.15
0.15
0.1
0.1
0.1
0.1
0.05
0.05
0.05
0.05
100
200
300
Comb hair
400
100
200
300
Comb hair
400
100
200
300
Comb hair
400
0.2
0.2
0.2
0.2
0.15
0.15
0.15
0.15
0.1
0.1
0.1
0.1
0.05
0.05
0.05
0.05
100
200
300
Climb stairs
400
100
200
300
Climb stairs
400
100
200
300
Climb stairs
400
0.2
0.2
0.2
0.2
0.15
0.15
0.15
0.15
0.1
0.1
0.1
0.1
0.05
0.05
0.05
0.05
100
200
300
400
100
200
300
400
100
200
300
400
100
200
300
Comb hair
400
100
200
300
Climb stairs
400
100
200
300
400
FIGURE 4.17: Histograms of cluster centers for the accelerometer dataset, for different activities. You should notice that (a) these histograms look somewhat similar for
different actors performing the same activity and (b) these histograms look somewhat different for different activities.
Section 4.3
121
I then computed histogram representations for different example signals (Figure 4.17). You should notice that when the activity label is different, the histogram
looks different, too.
Another useful way to check this representation is to compare the average
within class chi-squared distance with the average between class chi-squared distance. I computed the histogram for each example. Then, for each pair of examples,
I computed the chi-squared distance between the pair. Finally, for each pair of activity labels, I computed the average distance between pairs of examples where one
example has one of the activity labels and the other example has the other activity
label. In the ideal case, all the examples with the same label would be very close
to one another, and all examples with different labels would be rather different.
Table 4.1 shows what happens with the real data. You should notice that for some
pairs of activity label, the mean distance between examples is smaller than one
would hope for (perhaps some pairs of examples are quite close?). But generally,
examples of activities with different labels tend to be further apart than examples
of activities with the same label.
0.9
2.0
1.6
1.9
2.0
1.5
2.0
1.8
2.0
1.4
2.0
2.0
1.9
2.0
1.5
2.0
2.0
1.9
2.0
1.8
0.9
1.9
2.0
1.9
2.0
1.7
1.7
0.3
2.0
1.9
1.9
2.0
1.9
1.9
1.9
1.8
1.9
1.9
1.9
2.0
1.9
1.9
1.9
1.8
1.7
1.9
2.0
1.9
2.0
1.8
1.8
1.5
1.9
1.9
1.6
2.0
1.9
1.9
2.0
1.9
1.9
1.9
1.9
1.9
1.9
1.8
2.0
1.9
1.9
2.0
1.9
1.9
1.9
1.9
1.9
1.9
1.9
1.8
2.0
2.0
1.9
2.0
1.8
1.9
1.9
1.9
1.9
1.9
1.9
2.0
1.5
2.0
1.7
2.0
1.8
2.0
2.0
2.0
1.9
1.9
2.0
1.9
1.9
2.0
1.5
4.1:
Each column of the table represents an activity
for
the
activity
dataset
https:// archive.ics.uci.edu/ ml/ datasets/
Dataset+for+ADL+Recognition+with+Wrist-worn+Accelerometer, as does each
row. In each of the upper diagonal cells, I have placed the average chi-squared
distance between histograms of examples from that pair of classes (I dropped the
lower diagonal for clarity). Notice that in general the diagonal terms (average
within class distance) are rather smaller than the off diagonal terms. This quite
strongly suggests we can use these histograms to classify examples successfully.
TABLE
Yet another way to check the representation is to try classification with nearest
neighbors, using the chi-squared distance to compute distances. I split the dataset
into 80 test pairs and 360 training pairs; using 1-nearest neighbors, I was able to
get a held-out error rate of 0.79. This suggests that the representation is fairly
good at exposing what is important.
Section 4.4
You should
122
4.4 YOU SHOULD

4.4.1 remember:
New
New
New
New
New
term:
term:
term:
term:
term:
clustering . . . . . .
decorrelation . . . .
whitening . . . . . .
k-means . . . . . . .
vector quantization
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 93
. 95
. 95
. 101
. 109
Section 4.4
You should
123
4.1. Obtain the activities of daily life dataset from the UC Irvine machine learning
website (https://archive.ics.uci.edu/ml/datasets/Dataset+for+ADL+Recognition+with+Wrist-worn+Accelerom
data provided by Barbara Bruno, Fulvio Mastrogiovanni and Antonio Sgorbissa).
(a) Build a classifier that classifies sequences into one of the 14 activities provided. To make features, you should vector quantize, then use a histogram
of cluster centers (as described in the subsection; this gives a pretty explicit set of steps to follow). You will find it helpful to use hierarchical
k-means to vector quantize. You may use whatever multi-class classifier
you wish, though Id start with Rs decision forest, because its easy to
use and effective. You should report (a) the total error rate and (b) the
class confusion matrix of your classifier.
(b) Now see if you can improve your classifier by (a) modifying the number
of cluster centers in your hierarchical k-means and (b) modifying the size
of the fixed length samples that you use.
C H A P T E R
Clustering using Probability Models

5.1 THE MULTIVARIATE NORMAL DISTRIBUTION
All the nasty facts about high dimensional data, above, suggest that we need to use
quite simple probability models. By far the most important model is the multivariate normal distribution, which is quite often known as the multivariate gaussian
distribution. There are two sets of parameters in this model, the mean and the
covariance . For a d-dimensional model, the mean is a d-dimensional column
vector and the covariance is a d d dimensional matrix. The covariance is a symmetric matrix. For our definitions to be meaningful, the covariance matrix must be
positive definite.
The form of the distribution p(x|, ) is

1
1
T 1
p(x|, ) = p
exp (x ) (x ) ,
2
(2)d det()
where is a positive definite matrix. Notice that if is not positive definite, then
we cannot have a probability distribution,
because there are some directions d such

that exp 21 (td )T 1 (td ) does not fall off to zero as t limits to infinity.
In turn, this means we cant compute the integral, and so cant normalize.
The following facts explain the names of the parameters:
Useful Facts: 5.1 Parameters of a Multivariate Normal Distribution

Assuming a multivariate normal distribution, we have
E[x] = , meaning that the mean of the distribution is .

E (x )(x )T = , meaning that the entries in represent
covariances.
Assume I know have a dataset of items xi , where i runs from 1 to N , and we

wish to model this data with a multivariate normal distribution. The maximum
likelihood estimate of the mean,
, is
P
xi
= i
N
124
Section 5.1
The Multivariate Normal Distribution
125
(which is quite easy to show). The maximum likelihood estimate of the covariance,
is
,
P
)(xi
)T
= i (xi
N
(which is rather a nuisance to show, because you need to know how to differentiate
a determinant). You should be aware that this estimate is not guaranteed to be
positive definite, even though the covariance matrix of a gaussian must be positive
definite. We deal with this problem by checking the estimate. If its smallest
eigenvalue is too close to zero, then we add some small positive constant times the
identity to get a positive definite matrix.
5.1.1 Affine Transformations and Gaussians
Gaussians behave very well under affine transformations. In fact, weve already
worked out all the math. Assume I have a dataset {x}. The mean of the maximum
likelihood gaussian model is mean ({x}), and the covariance is Covmat ({x}). We
assume that this is positive definite, or adjust it as above.
I can now transform the data with an affine transformation, to get yi =
Axi + b. We assume that A is a square matrix with full rank, so that this transformation is 1-1. The mean of the maximum likelihood gaussian model for the
transformed dataset is mean ({y}) = Amean ({x}) + b. Similarly, the covariance is
Covmat ({y}) = ACovmat ({x})AT .
A very important point follows in an obvious way. I can apply an affine
transformation to any multivariate gaussian to obtain one with (a) zero mean and
(b) independent components. In turn, this means that, in the right coordinate
system, any gaussian is a product of zero mean, unit standard deviation, onedimensional normal distributions. This fact is quite useful. For example, it means
that simulating multivariate normal distributions is quite straightforward you
could simulate a standard normal distribution for each component, then apply an
affine transformation.
5.1.2 Plotting a 2D Gaussian: Covariance Ellipses
There are some useful tricks for plotting a 2D Gaussian, which are worth knowing
both because theyre useful, and they help to understand Gaussians. Assume we
are working in 2D; we have a Gaussian with mean (which is a 2D vector), and
covariance (which is a 2x2 matrix). We could plot the collection of points x that
has some fixed value of p(x|, ). This set of points is given by:

1
(x )T 1 (x ) = c2
2
where c is some constant. I will choose c2 = 12 , because the choice doesnt matter,
and this choice simplifies some algebra. You might recall that a set of points x that
satisfies a quadratic like this is a conic section. Because (and so 1 ) is positive
definite, the curve is an ellipse. There is a useful relationship between the geometry
of this ellipse and the Gaussian.
This ellipse like all ellipses has a major axis and a minor axis. These
are at right angles, and meet at the center of the ellipse. We can determine the
Section 5.2
Mixture Models and Clustering
126
properties of the ellipse in terms of the Gaussian quite easily. The geometry of the
ellipse isnt affected by rotation or translation, so we will translate the ellipse so
that = 0 (i.e. the mean is at the origin) and rotate it so that 1 is diagonal.
Writing x = [x, y] we get that the set of points on the ellipse satisfies
1
1
1 1 2
( x + 2 y2) =
2 k12
k2
2
where k12 and k12 are the diagonal elements of 1 . We will assume that the ellipse
1
2
has been rotated so that k1 < k2 . The points (k1 , 0) and (k1 , 0) lie on the ellipse,
as do the points (0, k2 ) and (0, k2 ). The major axis of the ellipse, in this coordinate
system, is the x-axis, and the minor axis is the y-axis. In this coordinate system,
x and y are independent. If you do a little algebra, you will see that the standard
deviation of x is abs (k1 ) and the standard deviation of y is abs (k2 ). So the ellipse
is longer in the direction of largest standard deviation and shorter in the direction
of smallest standard deviation.
Now rotating the ellipse is means we will pre- and post-multiply the covariance
matrix with some rotation matrix. Translating it will move the origin to the mean.
As a result, the ellipse has its center at the mean, its major axis is in the direction
of the eigenvector of the covariance with largest eigenvalue, and its minor axis is
in the direction of the eigenvector with smallest eigenvalue. A plot of this ellipse,
which can be coaxed out of most programming environments with relatively little
effort, gives us a great deal of information about the underlying Gaussian. These
ellipses are known as covariance ellipses.
5.2 MIXTURE MODELS AND CLUSTERING
It is natural to think of clustering in the following way. The data was created by
a collection of distinct models (one per cluster). For each data item, something
(nature?) chose which model was to produce a point, and then the model produced
a point. We see the results: crucially, wed like to know what the models were,
but we dont know which model produced which point. If we knew the models, it
would be easy to decide which model produced which point. Similarly, if we knew
which point went to which model, we could determine what the models were.
One encounters this situation or problems that can be mapped to this situation again and again. It is very deeply embedded in clustering problems. It is
pretty clear that a natural algorithm is to iterate between estimating which model
gets which point, and the model parameters. We have seen this approach before,
in the case of k-means.
A particularly interesting case occurs when the models are probabilistic. There
is a standard, and very important, algorithm for estimation here, called EM (or
expectation maximization, if you want the long version). I will develop this
algorithm in two simple cases, and we will see it in a more general form later.
Notation: This topic lends itself to a glorious festival of indices, limits of
sums and products, etc. I will do one example in quite gory detail; the other
follows the same form, and for that well proceed more expeditiously. Writing the
limits of sums or products explicitlyP
is usually
Q even more confusing than adopting
a compact notation. When I write i or i , I mean a sum (or product) over all
Section 5.2
127
Q
P
values of i. When I write i,j or i,j , I mean a sum (or product) over all values
of i except for the jth item. I will write vectors, as usual, as x; the ith such vector
in a collection is xi , and the kth component of the ith vector in a collection is xik .
In what follows, I will construct a vector i corresponding to the ith data item xi
(it will tell us what cluster that item belongs to). I will write to mean all the i
(one
When I write
P for each data item). The jth component of this vector is ij . P
,
I
mean
a
sum
over
all
values
that
can
take.
When
I
write
u
, I mean a
u
P
sum over all values that each can take. When I write ,v , I mean a sum over
all values that all can take, omitting all cases for the vth vector v .
5.2.1 A Finite Mixture of Blobs
A blob of data points is quite easily modelled with a single normal distribution.
Obtaining the parameters is straightforward (estimate the mean and covariance
matrix with the usual expressions). Now imagine I have t blobs of data, and I know
t. A normal distribution is likely a poor model, but I could think of the data as being
produced by t normal distributions. I will assume that each normal distribution has
a fixed, known covariance matrix , but the mean of each is unknown. Because the
covariance matrix is fixed, and known, we can compute a factorization = AAT .
The factors must have full rank, because the covariance matrix must be positive
definite. This means that we can apply A1 to all the data, so that each blob
covariance matrix (and so each normal distribution) is the identity.
Write j for the mean of the jth normal distribution. We can model a
distribution that consists of t distinct blobs by forming aPweighted sum of the
blobs, where the jth blob gets weight j . We ensure that j j = 1, so that we
can think of the overall model as a probability distribution. We can then model
the data as samples from the probability distribution
"
#

X
1
1
T
exp (x j ) (x j ) .
j p
p(x|1 , . . . , k , 1 , . . . , k ) =
2
(2)d
j
The way to think about this probability distribution is that a point is generated by
first choosing one of the normal distributions (the ith is chosen with probability
j ), then generating a point from that distribution. This is a pretty natural model
of clustered data. Each mean is the center of a blob. Blobs with many points in
them have a high value of j , and blobs with few points have a low value of j .
We must now use the data points to estimate the values of j and j (again, I am
assuming that the blobs and the normal distribution modelling each have the
identity as a covariance matrix). A distribution of this form is known as a mixture
of normal distributions.
Writing out the likelihood will reveal a problem: we have a product of many
sums. The usual trick of taking the log will not work, because then you have a sum
of logs of sums, which is hard to differentiate and hard to work with. A much more
productive approach is to think about a set of hidden variables which tell us which
blob each data item comes from. For the ith data item, we construct a vector i . I
will write to mean all the i (one for each data item). The jth component of this
vector is ij , where ij = 1 if xi comes from blob (equivalently, normal distribution)
j and zero otherwise. Notice there is exactly one 1 in i , because each data item
Section 5.2
128
comes from one blob. Assume we know the values of these terms. I will write
= (1 , . . . , k , 1 , . . . , k ) for the unknown parameters. Then we can write
p(xi |i , ) =
Y
u
"
#iu

1
T
p
exp (x u ) (x u )
2
(2)d
1
(because ij = 1 means that xi comes from blob j, so the terms in the product are
a collection of 1s and the probability we want). We also have
p(ij = 1|) = j
allowing us to write
p(i |) =
[u ]
iu
(because this is the probability that we select blob j to produce a data item; again,
the terms in the product are a collection of 1s and the probability we want). This
means that
("
# )iu

Y
1
1
p
p(xi , i |) =
exp (x u )T (x u ) u
2
(2)d
u
and we can write a log-likelihood. The data are the observed values of x and
(remember, we pretend we know these; Ill fix this in a moment), and the parameters
are the unknown values of 1 , . . . , k and 1 , . . . , k . We have
L(1 , . . . , k , 1 , . . . , k ; x, )
= L(; x, )

X 1
T
(xi j ) (xi j ) + log j ij + K
=
2
ij
where K is a constant that absorbs the normalizing constants for the normal distributions. You should check this expression gives the right answer. I have used
the ij as a switch for one term, ij = 1 and the term in curly brackets is on,
and for all others that term is multiplied by zero. The problem with all this is that
we dont know . I will deal with this when we have another example.
5.2.2 Topics and Topic Models
A real attraction of probabilistic clustering methods is that we can cluster data
where there isnt a clear distance function. One example occurs in document processing. For many kinds of document, we obtain a good representation by (a)
choosing a list of different words then (b) representing the document by a vector of
word counts, where we simply ignore every word outside the list. This is a viable
representation for many applications because quite often, most of the words people
actually use come from a relatively short list (typically 100s to 1000s, depending
on the particular application). The vector has one component for each word in the
list, and that component contains the number of times that particular word is used.
The problem is to cluster the documents.
Section 5.2
129
It isnt a particularly good idea to cluster on the distance between word vectors. This is because quite small changes in word use might lead to large differences
between count vectors. For example, some authors might write car when others
write auto. In turn, two documents might have a large (resp. small) count for
car and a small (resp. large) count for auto. Just looking at the counts would
significantly overstate the difference between the vectors. However, the counts are
informative: a document that uses the word car often, and the word lipstick
seldom, is likely quite different from a document that uses lipstick often and car
seldom.
We get a useful notion of the differences between documents by pretending
that the count vector for each document comes from one of a small set of underlying
topics. Each topic generates words as independent, identically distributed samples
from a multinomial distribution, with one probability per word in the vocabulary.
You should think of each topic as being like a cluster center. If two documents come
from the same topic, they should have similar word distributions. Topics are one
way to deal with changes in word use. For example, one topic might have quite
high probability of generating the word car and a high probability of generating
the word auto; another might have low probability of generating those words,
but a high probability of generating lipstick.
We cluster documents together if they come from the same topic. Imagine
we know which document comes from which topic. Then we could estimate the
word probabilities using the documents in each topic. Now imagine we know the
word probabilities for each topic. Then we could tell (at least in principle) which
topic a document comes from by looking at the probability each topic generates
the document, and choosing the topic with the highest probability. This should
strike you as being a circular argument. It has a form you should recognize from
k-means, though the details of the distance have changed.
To construct a probabilistic model, we will assume that a document is generated in two steps. We will have t topics. First, we choose a topic, choosing the
jth topic with probability j . Then we will obtain a set of words by repeatedly
drawing IID samples from that topic, and record the count of each word in a count
vector. Assume we have N vectors of word counts, and write xi for the ith such
vector. Each topic is a multinomial probability distribution. Write pj for the vector of word probabilities for the jth topic. We assume that words are generated
independently, conditioned on the topic. Write xik for the kth component of xi ,
and so on. Then the probability of observing the counts in xi when the document
was generated by topic j is
Y
p(xi |pj ) =
pxjuiu .
u
We can now write the probability of observing a document. Again, we write =

(p1 , . . . , pt , 1 , . . . , t ) for the vector of unknown parameters. We have
X
p(x|) =
p(x|topic is l)p(topic is l|)
l
X Y
l
pxluu
l .
Section 5.3
The EM Algorithm
130
This model is widely called a topic model; be aware that there are many kinds
of topic model, and this is a simple one. The expression should look unpromising,
in a familiar way. If you write out a likelihood, you will see a product of sums;
and if you write out a log-likelihood, you will see a sum of logs of sums. Neither
is enticing. We could use the same trick we used for a mixture of normals. Write
ij = 1 if xi comes from topic j, and ij = 0 otherwise. Then we have
"
#
Y
p(xi |ij = 1, ) =
pxjuiu
u
(because ij = 1 means that xi comes from topic j). This means we can write
p(xi |i , ) =
Y
l
("
pxluiu
#)il
(because ij = 1 means that xi comes from topic j, so the terms in the product are
a collection of 1s and the probability we want). We also have
p(ij = 1|) = j
(because this is the probability that we select topic j to produce a data item),
allowing us to write
Y
p(i |) =
[u ]iu
u
(again, the terms in the product are a collection of 1s and the probability we want).
This means that
#il
"
Y Y
xiu
p(xi , i |) =
(plu ) l
u
and we can write a log-likelihood. The data are the observed values of x and
(remember, we pretend we know these for the moment), and the parameters are
the unknown values collected in . We have
(
"
# )
X X X
L(; x, ) =
xiu log plu + log l il
i
Again, you should check this expression gives the right answer. Again, I have used
the ij as a switch for one term, ij = 1 and the term in curly brackets is on,
and for all others that term is multiplied by zero. The problem with all this, as
before, is that we dont know ij . But there is a recipe.
5.3 THE EM ALGORITHM
There is a straightforward, natural, and very powerful recipe. In essence, we will
average out the things we dont know. But this average will depend on our estimate
of the parameters, so we will average, then re-estimate parameters, then re-average,
and so on. If you lose track of whats going on here, think of the example of k-means
Section 5.3
The EM Algorithm
131
with soft weights (section 14.5; this is what the equations for the case of a mixture
of normals will boil down to). In this analogy, the tell us which cluster center a
data item came from. Because we dont know the values of the , we assume we
have a set of cluster centers; these allow us to make a soft estimate of the ; then
we use this estimate to re-estimate the centers; and so on.
This is an instance of a general recipe. Recall we wrote for a vector of
parameters. In the mixture of normals case, contained the means and the mixing
weights; in the topic model case, it contained the topic distributions and the mixing
weights. Assume we have an estimate of the value of this vector, say (n) . We could
then compute p(|(n) , x). In the mixture of normals case, this is a guide to which
example goes to which cluster. In the topic case, it is a guide to which example
goes to which topic.
We could use this to compute the expected value of the likelihood with respect
to . We compute
X
Q(; (n) ) =
L(; x, )p(|(n) , x)
(where the sum is over all values of ). Notice that Q(; (n) ) is a function of
(because L was), but now does not have any unknown terms in it. This Q(; (n) )
encodes what we know about .
For example, assume that p(|(n) , x) has a single, narrow peak in it, at (say)
0
= . In the mixture of normals case, this would mean that there is one allocation
of points to clusters that is significantly better than all others, given (n) . For this
example, Q(; (n) ) will be approximately L(; x, 0 ).
Now assume that p(|(n) , x) is about uniform. In the mixture of normals
case, this would mean that any particular allocation of points to clusters is about
as good as any other. For this example, Q(; (n) ) will average L over all possible
values with about the same weight for each.
We obtain the next estimate of by computing
(n+1) =
argmax
Q(; (n) )
and iterate this procedure until it converges (which it does, though I shall not prove
that). The algorithm I have described is extremely general and powerful, and is
known as expectation maximization or (more usually) EM. The step where
we compute Q(; (n) ) is called the E step; the step wehre we compute the new
estimate of is known as the M step.
5.3.1 Example: Mixture of Normals: The E-step
Now let us do the actual calculations for a mixture of normal distributions. The E
step requires a little work. We have
X
Q(; (n) ) =
L(; x, )p(|(n) , x)
If you look at this expression, it should strike you as deeply worrying. There are
a very large number of different possible values of . In this case, there are N t
Section 5.3
The EM Algorithm
132
cases (there is one i for each data item, and each of these can have a one in each
of t locations). It isnt obvious how we could compute this average.
But notice
p(, x|(n) )
p(|(n) , x) =
p(x|(n) )
and let us deal with numerator and denominator separately. For the numerator,
notice that the xi and the i are independent, identically distributed samples, so
that
Y
p(i , xi |(n) ).
p(, x|(n) ) =
i
The denominator is slightly more work. We have

X
p(x|(n) ) =
p(, x|(n) )
#
"
X Y
(n)
p(i , xi | )
=
i
Y X
p(i , xi |(n) ) .
=
i
You should check the last step; one natural thing to do is check with N = 2 and
t = 2. This means that we can write
p(|(n) , x)
=
=
p(, x|(n) )
p(x|(n) )
Q
(n)
)
i p(i , xi |
i
h
Q P
(n)
i
i p(i , xi | )
Y
i
Y
i
p( , x |(n) )
P i i
(n)
i p(i , xi | )
p(i |xi , (n) )
Now we need to look at the log-likelihood. We have

X 1
(xi j )T (xi j ) + log j ij + K.
L(; x, ) =
2
ij
The K term is of no interest it will result in a constant but we will try to
keep track of it. To simplify the equations we need to write, I will construct a t
dimensional vector ci for the ith data point. The jth component of this vector
will be

1
(xi j )T (xi j ) + log j
2
Section 5.3
so we can write
L(; x, ) =
The EM Algorithm
133
cTi i + K.
Now all this means that

X
Q(; (n) ) =
L(; x, )p(|(n) , x)
!
X X
T
ci i + K p(|(n) , x)
=
i
!
X X
Y
T
ci i + K
=
p(u |(n) , x)
u
i
!
Y
Y
X
T
(n)
T
(n)
c1 1
p(u | , x) + . . . cN N
=
p(u | , x) .
u
u
P
We can simplify further. We have that i p(i |xi , (n) ) = 1, because this is a
probability distribution. Notice that, for any index v,
!

Y
X
X
X Y
cTv v p(v |(n) , x)

cTv v
p(u |(n) , x)
=
p(u |(n) , x)
u
v
, v u,v

X
cTv v p(v |(n) , x)
=
v
So we can write
Q(; (n) ) = L(; x, )p(|(n) , x)
X X
cTi i p(i |(n) , x)

=
i
i

X 1
X
(xi j )T (xi j ) + log j wij + K

=
2
j
i
where
wij
1p(ij = 1|(n) , x) + 0p(ij = 0|(n) , x)
p(ij = 1|(n) , x).
Section 5.3
The EM Algorithm
134
Now
p(ij = 1|(n) , x)
=
=
=
=
p(x, ij = 1|(n) )
p(x|(n) )
p(x, ij = 1|(n) )
P
(n) )
l p(x, il = 1|
Q
p(xi , ij = 1|(n) ) u,i p(xu , u |)
Q
P
(n) )
l p(x, il = 1|
u,i p(xu , u |)
p(x , = 1|(n) )
P i ij
(n) )
l p(x, il = 1|
Q
If the last couple of steps puzzle you, remember we obtained p(x, |) = i p(xi , i |).
Also, look closely at the denominator; it expresses the fact that the data must have
come from somewhere. So the main question is to obtain p(x, ij = 1|(n) ). But
p(x, ij = 1|(n) )
Substituting yields
p(ij = 1|
= p(xi , ij |(n) )
= p(xi |ij = 1, (n) )p(ij = 1|(n) )

"
#

1
1
T
= = p
exp (xi j ) (xi j ) j .
2
(2)d
(n)

exp 21 (xi j )T (xi j ) j
.
, x) = P
1
T
k exp 2 (xi k ) (xi k ) k
5.3.2 Example: Mixture of Normals: The M-step
The M-step is more straightforward. Recall

X 1
(xi j )T (xi j ) + log j wij + K
Q(; (n) ) =
2
ij
and we have to maximise this with respect to and , and the terms wij are known.
This maximization is easy. We compute
P
xi wij
(n+1)
= Pi
j
i wij
and
i wij
.
N
You should check these expressions by differentiating and setting
P to zero. When you
do so, remember that, because is a probability distribution, j j = 1 (otherwise
youll get the wrong answer).
(n+1)
Section 5.3
The EM Algorithm
135
5.3.3 Example: Topic Model: The E-Step

We need to work out two steps. The E step requires a little calculation. We have
X
Q(; (n) ) =
L(; x, )p(|(n) , x)
X
ij
X
ij
("
("
xiu log pju + log j
xi,k log pj,k + log j
ij p(|(n) , x)
wij
Here the last two steps follow from the same considerations as in the mixture of
normals. The xi and i are IID samples, and so the expectation simplifies as in
that case. If youre uncertain, rewrite the steps of section 5.3.1. The form of this Q
function is the same as that (a sum of cTi i terms, but using a different expression
for ci ). In this case, as above,
wij
1p(ij = 1|(n) , x) + 0p(ij = 0|(n) , x)
p(ij = 1|(n) , x).
Again, we have
p(ij = 1|(n) , x) =
=
p(xi , ij = 1|(n) )
p(xi |(n) )
p(x , = 1|(n) )
P i ij
(n) )
l p(xi , il = 1|
and so the main question is to obtain p(xi , ij = 1|(n) ). But

p(xi , ij = 1|(n) ) =
=
p(xi |ij = 1, (n) )p(ij = 1|(n) )

"
#
Y
xk
=
pj,k j .
k
Substituting yields
hQ
i
k
pxj,k
j
i
p(ij = 1|(n) , x) = P hQ
xk
p
l
k l,k l
k
5.3.4 Example: Topic Model: The M-step
The M-step is more straightforward. Recall
)
("
#
X
X
Q(; (n) ) =
xi,k log pj,k + log j wij
ij
Section 5.3
The EM Algorithm
136
and we have to maximise this with respect to and , and the terms wij are known.
This maximization is easy. We compute
P
xi wij
(n+1)
= Pi
pj
i wij
and
i wij
.
N
You should check these expressions by differentiating and setting to zero.
(n+1)
j
Section 5.4
You should
137
5.4 YOU SHOULD

5.4.1 remember:
Useful facts: Parameters of a Multivariate Normal Distribution
New term: covariance ellipses . . . . . . . . . . . . . . . . . . .
New term: EM . . . . . . . . . . . . . . . . . . . . . . . . . . .
New term: expectation maximization . . . . . . . . . . . . . . .
New term: mixture of normal distributions . . . . . . . . . . .
New term: topic model . . . . . . . . . . . . . . . . . . . . . . .
New term: expectation maximization . . . . . . . . . . . . . . .
New term: EM . . . . . . . . . . . . . . . . . . . . . . . . . . .
New term: E step . . . . . . . . . . . . . . . . . . . . . . . . . .
New term: M step . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
118
120
120
120
121
124
125
125
125
125
Section 5.4
You should
138
5.1. Obtain the activities of daily life dataset from the UC Irvine machine learning
website (https://archive.ics.uci.edu/ml/datasets/Dataset+for+ADL+Recognition+with+Wrist-worn+Accelerom
data provided by Barbara Bruno, Fulvio Mastrogiovanni and Antonio Sgorbissa).
(a) Build a classifier that classifies sequences into one of the 14 activities provided. To make features, you should vector quantize, then use a histogram
of cluster centers (as described in the subsection; this gives a pretty explicit set of steps to follow). You will find it helpful to use hierarchical
k-means to vector quantize. You may use whatever multi-class classifier
you wish, though Id start with Rs decision forest, because its easy to
use and effective. You should report (a) the total error rate and (b) the
class confusion matrix of your classifier.
(b) Now see if you can improve your classifier by (a) modifying the number
of cluster centers in your hierarchical k-means and (b) modifying the size
of the fixed length samples that you use.
C H A P T E R
Regression
Classification tries to predict a class from a data item. Regression tries to
predict a value. For example, we know the zip code of a house, the square footage
of its lot, the number of rooms and the square footage of the house, and we wish to
predict its likely sale price. As another example, we know the cost and condition of
a trading card for sale, and we wish to predict a likely profit in buying it and then
reselling it. As yet another example, we have a picture with some missing pixels
perhaps there was text covering them, and we want to replace it and we want
to fill in the missing values. As a final example, you can think of classification as
a special case of regression, where we want to predict either +1 or 1; this isnt
usually the best way to proceed, however. Predicting values is very useful, and so
there are many examples like this.
6.1 OVERVIEW
Some formalities are helpful here. In the simplest case, we have a dataset consisting
of a set of N pairs (xi , yi ). We think of yi as the value of some function evaluated
at xi , but with some random component. This means there might be two data
items where the xi are the same, and the yi are different. We refer to the xi as
explanatory variables and the yi is a dependent variable. We want to use the
examples we have the training examples to build a model of the dependence
between y and x. This model will be used to predict values of y for new values of
x, which are usually called test examples.
We do not guarantee that different values of x produce different values of y.
Data just isnt like this (see the crickets example Figure 6.1). Traditionally, regression produces some representation of a probability distribution for y conditioned on
x, so that we would get (say) some representation of a distribution on the houses
likely sale value. The best prediction would then be the expected value of that
distribution. Usually the representation is in the form of variance estimates. One
common approach is to assume that the distribution on every prediction is normal,
with a constant variance that is estimated from data. Another common approach
is to estimate a variance at every prediction point.
It should be clear that none of this will work if there is not some relationship
between the training examples and the test examples. If I collect training data
on the height and weight of children, Im unlikely to get good predictions of the
weight of adults from their height. We can be more precise with a probabilistic
framework. We think of xi as IID samples from some (usually unknown) probability
distribution P (X). Then the test examples should also be IID samples from P (X).
A probabilistic formalism can help be precise about the yi , too. Assume another
random variable Y has joint distribution with X given by P (Y, X). We think of
each yi as a sample from P (Y | {X = xi }). Then our modelling problem would be:
given the training data, build a model that takes a test example x and yields a
139
Section 6.2
Linear Regression and Least Squares
140
model of P (Y | {X = xi }).
To do anything useful with this formalism requires some aggressive simplifying assumptions. There are very few circumstances that require a comprehensive
representation of P (Y | {X = xi }). Usually, we are interested in E[Y | {X = xi }]
(the mean of P (Y | {X = xi })) and in var ({P (Y | {X = xi })}).
To recover this representation, we assume that, for any pair of examples (x, y),
the value of y is obtained by applying some (unknown) function f to x, then adding
some random variable with zero mean. We can write y(x) = f (x) + , though its
worth remembering that there can be many different values of y associated with a
single x. Now we must make some estimate of f which yields E[Y | {X = xi }]
and estimate the variance of . The variance of might be constant, or might
vary with x.
There is a very widespread collection of regression methods. We will see a
subset here. It can be extremely useful to have variances or confidence intervals
associated with predictions (for example, in the house price case it could offer a
guideline as to the spread of likely bids), but it isnt always essential. In this chapter,
I will discuss both methods that are engineered to produce predictive distributions
and methods that ignore these distributions. Cross-validation ideas can be used to
obtain estimates of predictive variance from almost anything.
Definition: 6.1 Regression

Regression accepts a feature vector and produces a prediction, which is
usually a number, but can sometimes have other forms. It is possible,
but not usually particularly helpful, to see classification as a form of
regression.
6.2 LINEAR REGRESSION AND LEAST SQUARES

Assume we have a dataset consisting of a set of N pairs (xi , yi ). We think of yi as
the value of some function evaluated at xi , with some random component added.
This means there might be two data items where the xi are the same, and the yi are
different. We refer to the xi as explanatory variables and the yi is a dependent
variable. We want to use the examples we have the training examples
to build a model of the dependence between y and x. This model will be used to
predict values of y for new values of x, which are usually called test examples. It
can also be used to understand the relationships between the x. The model needs
to have some probabilistic component; we do not expect that y is a function of x,
and there is likely some error in evaluating y anyhow.
6.2.1 Linear Regression
We cannot expect that our model makes perfect predictions. Furthermore, y may
not be a function of x it is quite possible that the same value of x could lead
Section 6.2
141
to different ys. One way that this could occur is that y is a measurement (and so
subject to some measurement noise). Another is that there is some randomness in
y. For example, we expect that two houses with the same set of features (the x)
might still sell for different prices (the ys).
A good, simple model is to assume that the dependent variable (i.e. y) is
obtained by evaluating a linear function of the explanatory variables (i.e. x), then
adding a zero-mean normal random variable. We can write this model as
y = xT +
where is a zero mean normal random variable with unknown variance. In this
expression, is a vector of weights, which we must estimate. When we use this
model to predict a value of y for a particular set of explanatory variables x , we
cannot predict the value that will take. Our best available prediction is the mean
value (which is zero). Notice that if x = 0, the model predicts y = 0. This may
seem like a problem to you you might be concerned that we can fit only lines
through the origin but remember that x contains explanatory variables, and we
can choose what appears in x. The two examples show how a sensible choice of x
allows us to fit a line with an arbitrary y-intercept.
Definition: 6.2 Linear regression

A linear regression takes the feature vector x and predicts xT , for
some vector of coefficients . The coefficients are adjusted, using data,
to produce the best predictions.
Example: 6.1 A linear model fitted to a single explanatory variable

Assume we fit a linear model to a single explanatory variable. Then
the model has the form y = x + , where is a zero mean random
variable. For any value x of the explanatory variable, our best estimate
of y is x . In particular, if x = 0, the model predicts y = 0, which
is unfortunate. We can draw the model by drawing a line through the
origin with slope in the x, y plane. The y-intercept of this line must
be zero.
Section 6.2
Chirp frequency vs temperature in crickets
800
90
1000
Weight vs length in perch from Lake Laengelmavesi
142
R^2=0.68
80
Temperature
600
0
70
200
75
400
Weight (gr)
85
R^2=0.87
10
20
30
Length (cm)
40
14
15
16
17
18
19
20
Frequency
FIGURE 6.1: On the left, a regression of weight against length for perch from a
Finnish lake (you can find this dataset, and the back story at http:// www.amstat.
org/ publications/ jse/ jse data archive.htm; look for fishcatch on that page). Notice that the linear regression fits the data fairly well, meaning that you should be
able to predict the weight of a perch from its length fairly well. On the right, a
regression of air temperature against chirp frequency for crickets. The data is fairly
close to the line, meaning that you should be able to tell the temperature from the
pitch of crickets chirp fairly well. This data is from http:// mste.illinois.edu/ patel/
amar430/ keyprob1.html. The R2 you see on each figure is a measure of the goodness
of fit of the regression (section 6.2.2).
Example: 6.2 A linear model with a non-zero y-intercept

Assume we have a single explanatory variable, which we write u. We
T
can then create a vector x = [u, 1] from the explanatory variable. We
now fit a linear model to this vector. Then the model has the form
y = xT + , where is a zero mean random variable. For any value
T
x = [u , 1] of the explanatory variable, our best estimate of y is
T
(x ) , which can be written as y = 1 u + 2 . If x = 0, the model
predicts y = 2 . We can draw the model by drawing a line through the
origin with slope 1 and y-intercept 2 in the x, y plane.
We must determine . Because we have that P (y|x, ) is normal, with mean

xT , we can write out the log-likelihood of the data. Write 2 for the variance of
Section 6.2
Heart rate vs temperature in humans
80
85
90
100
Longevity vs Thorax in Female Fruitflies
143
R^2=0.06
75
70
Heart rate (bpm)
60
20
60
65
40
Lifespan
80
R^2=0.41
0.65
0.70
0.75
0.80
0.85
Thorax Length (mm)
0.90
0.95
97
98
99
100
Temperature (F)
FIGURE 6.2: Regressions do not necessarily yield good predictions or good model fits.
On the left, a regression of the lifespan of female fruitflies against the length of
their torso as adults (apparently, this doesnt change as a fruitfly ages; you can
find this dataset, and the back story at http:// www.amstat.org/ publications/ jse/
jse data archive.htm; look for fruitfly on that page). The figure suggests you can
make some prediction of how long your fruitfly will last by measuring its torso, but
not a particularly accurate one. On the right, a regression of heart rate against
body temperature for adults. You can find the data at http:// www.amstat.org/
publications/ jse/ jse data archive.htm as well; look for temperature on that page.
Notice that predicting heart rate from body temperature isnt going to work that well,
either.
, which we dont know, but will not need to worry about right now. We have that
X
log P (yi |xi , )
log L() =
i
1 X
(yi xTi )2 + term not depending on
2 2 i
Maximizing the log-likelihood of the data is equivalent to minimizing the

negative log-likelihood of the data. Furthermore, the term 21 2Pdoes not affect the
location of the minimum. We must have that minimizes i (yi xTi )2 It is
helpful to use a penalty that is an average of squared errors, because (hopefully)
this doesnt grow much when we add data. We therefore use
!
X
1
(yi xTi )2 .
N
i
We can write all this more conveniently using vectors and matrices. Write y for
Section 6.2
the vector
144
y1
y2
...
yn
and X for the matrix
xT1
xT2 .
. . . xTn
Then we want to minimize

1
N
y X )T (y X
which means that we must have
X T X X T y = 0.
For reasonable choices of features, we could expect that X T X which should
strike you as being a lot like a covariance matrix has full rank. If it does, which
is the usual case, this equation is easy to solve. If it does not, there is more to do,
which we will do in section 6.4.2.
Remember this: The vector of coefficients is usually estimated using

a least-squares procedure.
Worked example 6.1
Simple Linear Regression with R
Regress the hormone data against time for all the devices in the Efron example.
Solution: This example is mainly used to demonstrate how to regress in R.
There is sample code in listing 6.1. The summary in the listing produces a
great deal of information (try it). Most of it wont mean anything to you yet.
You can get a figure by doing plot(foo.lm), but these figures will not mean
anything yet, either. In the code, Ive shown how to plot the data and a line
on top of it.
Section 6.2
145
Listing 6.1: R code used for the linear regression example of worked example 6.1
e f d<read . table ( e f r o n t a b l e . t x t , h e a d e r=TRUE)
# t h e t a b l e has t h e form
#N1
Ah
Bh
Ch N2 At Bt Ct
# now we need t o c o n s t r u c t a new d a t a s e t
hor<s t a c k ( e f d , s e l e c t = 2 : 4 )
tim<s t a c k ( e f d , s e l e c t = 6 : 8 )
f o o<data . frame ( time=tim [ , c ( v a l u e s ) ] ,
hormone=hor [ , c ( v a l u e s ) ] )
f o o . lm<lm( hormonetime , data=f o o )
plot ( f o o )
abline ( f o o . lm)
6.2.2 Residuals and R-squared

Assume we have produced a regression by solving
X T X X T y = 0.
We cannot expect that X is the same as y. Instead, there is likely to be some
error. The residual is the vector
e = y X
which gives the difference between the true value and the models prediction at
each point. The mean square error is
m=
eT e
N
and this gives the average of the squared error of prediction on the training examples.
Section 6.2
146
Procedure: 6.1 Linear Regression with a Normal Model

We have a dataset containing N pairs (xi , yi ). Each xi is a ddimensional explanatory vector, and each yi is a single dependent variable. We assume that each data point conforms to the model
yi = xTi + i
where i is a normal random variable with mean 0 and unknown variance. Write y for the vector
y1
y2
...
yn
and X for the matrix
xT1
xT2 .
. . . xTn
We estimate (the value of ) by solving the linear system

X T X X T y = 0.
The residuals are
For a data point x, our model predicts xT .
e = y X .
We have that eT 1 = 0. The mean square error is given by
m=
eT e
.
N
Notice that the mean squared error is not a great measure of how good the
regression is. This is because the value depends on the units in which the dependent
variable is measured. So, for example, if you measure y in meters you will get a
different mean squared error than if you measure y in kilometers.
There is an important quantitative measure of how good a regression is which
doesnt depend on units. Unless the dependent variable is a constant (which would
make prediction easy), it has some variance. If our model is of any use, it should
explain some aspects of the value of the dependent variable. This means that
the variance of the residual should be smaller than the variance of the dependent
variable. If the model made perfect predictions, then the variance of the residual
should be zero.
Section 6.2
147
We can formalize all this in a relatively straightforward way. We will ensure

that X always has a column of ones in it, so that the regression can have a non-zero
y-intercept. We now fit a model
y = X + e
(where e is the vector of residual values) by choosing such that eT e is minimized.

Then we get some useful technical results.
Useful Facts: 6.1 Regression
We write y = X + e, where e is the residual. Assume X has a column
of ones, and is chosen to minimize eT e. Then we have
1. eT X = 0, i.e. that e is orthogonal to any column of X . This
is because, if e is not orthogonal to some column of e, we can
increase or decrease the term corresponding to that column to
make the error smaller. Another way to see this is to notice that
beta is chosen to minimize N1 eT e, which is N1 (y X )T (y X ).
Now because this is a minimum, the gradient with respect to is
zero, so (y X )T (X ) = eT X = 0.
2. eT 1 = 0 (recall that X has a column of all ones, and apply the
previous result).
3. 1T (y X ) = 0 (same as previous result).
4. eT X = 0 (first result means that this is true).
Now y is a one dimensional dataset arranged into a vector, so we can compute

mean ({y}) and var[y]. Similarly, X is a one dimensional dataset arranged into a
vector (its elements are xTi ), as is e, so we know the meaning of mean and variance
for each. We have a particularly important result:
var[y] = var[X ] + var[e].
This is quite easy to show, with a little more notation. Write y = (1/N )(1T y)1 for
the vector whose entries are all mean ({y}); similarly for e and for X . We have
var[y] = (1/N )(y y)T (y y)
and so on for var[ei ], etc. Notice from the facts that y = X . Now

T

var[y] = (1/N ) X X + [e e]
X X + [e e]

T

T
T
X X + 2 [e e] X X + [e e] [e e]
= (1/N ) X X

T

T
= (1/N ) X X
X X + [e e] [e e]
=
because e = 0 and eT X = 0 and eT 1 = 0

var[X ] + var[e].
Section 6.2
148
This is extremely important, because us allows us to think about a regression as

explaining variance in y. As we are better at explaining y, var[e] goes down. In
turn, a natural measure of the goodness of a regression is what percentage of the
variance of y it explains. This is known as R2 (the r-squared measure). We have

var xTi
2
R =
var[yi ]
which gives some sense of how well the regression explains the training data. Notice
that the value of R2 is not affected by the units of y (exercises)
Good predictions result in high values of R2 , and a perfect model will have
R2 = 1 (which doesnt usually happen). For example, the regression of figure 7.4
has an R2 value of 0.87. Figures 6.1 and 6.2 show the R2 values for the regressions
plotted there; notice how better models yield larger values of R2 . Notice that if
you look at the summary that R provides for a linear regression, it will offer you
two estimates of the value for R2 . These estimates are obtained in ways that try to
account for (a) the amount of data in the regression, and (b) the number of variables
in the regression. For our purposes, the differences between these numbers and the
R2 I defined are not significant. For the figures, I computed R2 as I described in the
text above, but if you substitute one of Rs numbers nothing terrible will happen.
Linear regression is useful, but it isnt magic. Some regressions make poor
predictions (recall the regressions of figure 6.2). As another example, regressing the
first digit of your telephone number against the length of your foot wont work.
We have two straightforward tests to tell whether a regression is working. You
can look at a plot for a dataset with one explanatory variable and one dependent
variable. You plot the data on a scatter plot, then plot the model as a line on that
scatterplot. Just looking at the picture can be informative (compare Figure 6.1
and Figure 6.2). You can also check, by eye, if the residual isnt random. We
assumed that y xT was a zero-mean normal random variable with fixed variance.
Our model means that the value of the residual vector should not depend on the
corresponding y-value. If it does, this is good evidence that there is a problem.
Looking at a scatter plot of e against y will often reveal trouble in a regression
(Figure 6.6). In the case of Figure 6.6, the trouble is caused by a few outlying
data points severely affecting the regression. We will discuss how to identify and
deal with such points in Section ??. Once they have been removed, the regression
improves markedly (Figure 6.6).
Remember this: The quality of predictions made by a regression can be

evaluated by looking at the fraction of the variance in the dependent variable
that is explained by the regression. This number is called R2 , and lies between zero and one; regressions with larger values make better predictions.
You can check if the regression predicts a constant. This is usually a bad
sign. You can check this by looking at the predictions for each of the training data
Section 6.2
149
items. If the variance of these predictions is small compared to the variance of

the independent variable, the regression isnt working well. If you have only one
explanatory variable, then you can plot the regression line. If the line is horizontal,
or close, then the value of the explanatory variable makes very little contribution
to the prediction. This suggests that there is no particular relationship between
the explanatory variable and the independent variable.
Remember this: Linear regressions can make bad predictions. You can
check for trouble by: evaluating R2 ; looking at a plot; looking to see if the
regression makes a constant prediction; or checking whether the residual is
random. Other strategies exist, but are beyond the scope of this book.
Frequency of word usage in Shakespeare, loglog
8
6
4
Log number of appearances
10000
8000
6000
4000
0
2000
Number of appearances
12000
14000
Frequency of word usage in Shakespeare
20
40
60
Rank
80
100
Log rank
FIGURE 6.3: On the left, word count plotted against rank for the 100 most common
words in Shakespeare, using a dataset that comes with R (called bard, and quite
likely originating in an unpublished report by J. Gani and I. Saunders). I show a
regression line too. This is a poor fit by eye, and the R2 is poor, too (R2 = 0.1). On
the right, log word count plotted against log rank for the 100 most common words
in Shakespeare, using a dataset that comes with R (called bard, and quite likely
originating in an unpublished report by J. Gani and I. Saunders). The regression
line is very close to the data.
6.2.3 Transforming Variables

Sometimes the data isnt in a form that leads to a good linear regression. In this
case, transforming explanatory variables, the dependent variable, or both can lead
to big improvements. Figure 6.3 shows one example, based on the idea of word
Section 6.2
150
frequencies. Some words are used very often in text; most are used seldom. The
dataset for this figure consists of counts of the number of time a word occurred
for the 100 most common words in Shakespeares printed works. It was originally
collected from a concordance, and has been used to attack a variety of interesting
questions, including an attempt to assess how many words Shakespeare knew. This
is hard, because he likely knew many words that he didnt use in his works, so
one cant just count. If you look at the plot of Figure 6.3, you can see that a
linear regression of count (the number of times a word is used) against rank (how
common a word is, 1-100) is not really useful. The most common words are used
very often, and the number of times a word is used falls off very sharply as one
looks at less common words. You can see this effect in the scatter plot of residual
against dependent variable in Figure 6.3 the residual depends rather strongly
on the dependent variable. This is an extreme example that illustrates how poor
linear regressions can be.
However, if we regress log-count against log-rank, we get a very good fit
indeed. This suggests that Shakespeares word usage (at least for the 100 most
common words) is consistent with Zipf s law. This gives the relation between
frequency f and rank r for a word as
f
1s
r
where s is a constant characterizing the distribution. Our linear regression suggests

that s is approximately 1.67 for this data.
In some cases, the natural logic of the problem will suggest variable transformations that improve regression performance. For example, one could argue that
humans have approximately the same density, and so that weight should scale as
the cube of height; in turn, this suggests that one regress weight against the cube
root of height. Generally, shorter people tend not to be scaled versions of taller
people, so the cube root might be too aggressive, and so one thinks of the square
root.
Remember this:
The performance of a regression can be improved by
transforming variables. Transformations can follow from looking at plots,
or thinking about the logic of the problem
The Box-Cox transformation is a method that can search for a transformation of the dependent variable that improves the regression. The method uses a
one-parameter family of transformations, with parameter , then searches for the
best value of this parameter using maximum likelihood. A clever choice of transformation means that this search is relatively straightforward. We define the Box-Cox
transformation of the dependent variable to be
(
yi 1
(bc)
if 6= 0 .
yi =
log yi if = 0
Section 6.3
Finding Problem Data Points
80
Temperature
600
0
70
200
75
400
Weight (gr)
85
800
90
1000
151
10
20
30
40
Length (cm)
14
15
16
17
18
19
20
Frequency
FIGURE 6.4: The Box-Cox transformation suggests a value of = 0.303 for the
regression of weight against height for the perch data of Figure 6.1. You can
find this dataset, and the back story at http:// www.amstat.org/ publications/ jse/
jse data archive.htm; look for fishcatch on that page). On the left, a plot of the
resulting curve overlaid on the data. For the cricket temperature data of that figure (from http:// mste.illinois.edu/ patel/ amar430/ keyprob1.html), the transformation suggests a value of = 4.75. On the right, a plot of the resulting curve
overlaid on the data.
It turns out to be straightforward to estimate a good value of using maximum
likelihood. One searches for a value of that makes residuals look most like a
normal distribution. Statistical software will do it for you; the exercises sketch
out the method. This transformation can produce significant improvements in a
regression. For example, the transformation suggests a value of = 0.303 for
the fish example of Figure 6.1. It isnt natural to plot weight0.303 against height,
because we dont really want to predict weight0.303 . Instead, we plot the predictions
of weight that come from this model, which will lie on a curve with the form
1
(ax + b) 0.303 , rather than on a straight line. Similarly, the transformation suggests
a value of = 0.475 for the cricket data. Figure 6.4 shows the result of these
transforms.
6.3 FINDING PROBLEM DATA POINTS
Outlying data points can significantly weaken the usefulness of a regression. For
some regression problems, we can identify data points that might be a problem, and
then resolve how to deal with them. One possibility is that they are true outliers
someone recorded a data item wrong, or they represent an effect that just doesnt
occur all that often. Another is that they are important data, and our linear model
may not be good enough. If the data points really are outliers, we can ignore them;
if they arent, we may be able to improve the regression by transforming features
or by finding a new explanatory variable.
152
nyv
20
40
40
20
0
40
20
yv
20
40
Section 6.3
40
20
0
xv
20
40
40
20
20
40
nxv
FIGURE 6.5: On the left, a synthetic dataset with one independent and one explanatory variable, with the regression line plotted. Notice the line is close to the data
points, and its predictions seem likely to be reliable. On the right, the result of
adding a single outlying datapoint to that dataset. The regression line has changed
significantly, because the regression line tries to minimize the sum of squared vertical distances between the data points and the line. Because the outlying datapoint is
far from the line, the squared vertical distance to this point is enormous. The line
has moved to reduce this distance, at the cost of making the other points further
from the line.
When we construct a regression, we are solving for the that minimizes

P
xTi )2 , equivalently for the that produces the smallest value of i e2i .
This means that residuals with large value can have a very strong influence on
the outcome we are squaring that large value, resulting in an enormous value.
Generally, many residuals of medium size will have a smaller cost that one large
residual and the rest tiny. As figure 6.5 illustrates, this means that a data point
that lies far from the others can swing the regression line significantly.
This creates a problem, because data points that are clearly wrong (sometimes
called outliers) can also have the highest influence on the outcome of the regression.
Figure 6.6 shows this effect for a simple case. When we have only one explanatory
variable, theres an easy method to spot problem data points. We produce a scatter
plot and a regression line, and the difficulty is usually obvious. In particularly tricky
cases, printing the plot and using a see-through ruler to draw a line by eye can help
(if you use an opaque ruler, you may not see some errors).
P
i (yi
Section 6.3
153
Residuals against fitted values,

weight against height,
all points
50
Residuals
250
100
50
150
200
Weight
100
300
150
350
Weight against height,

all points
40
50
60
70
80
100
150
200
Height
Fitted values
Weight against height,

4 outliers removed
Residuals against fitted values,

4 outliers removed
250
50
Residuals
250
100
50
150
200
Weight
100
300
150
350
30
30
40
50
Height
60
70
80
100
150
200
250
Fitted values
FIGURE 6.6: On the top left, weight regressed against height for the bodyfat dataset.
The line doesnt describe the data particularly well, because it has been strongly
affected by a few data points (filled-in markers). On the top right, a scatter plot
of the residual against the value predicted by the regression. This doesnt look like
noise, which is a sign of trouble. On the bottom left, weight regressed against
height for the bodyfat dataset. I have now removed the four suspicious looking data
points with filled-in markers; these seemed the most likely to be outliers. On the top
right, a scatter plot of the residual against the value predicted by the regression.
Notice that the residual looks like noise. The residual seems to be uncorrelated to the
predicted value; the mean of the residual seems to be zero; and the variance of the
residual doesnt depend on the predicted value. All these are good signs, consistent
with our model, and suggest the regression will yield good predictions.
Remember this:
Outliers can affect linear regressions significantly.
Usually, if you can plot the regression, you can look for outliers by eyeballing
the plot. Other methods exist, but are beyond the scope of this text.
Section 6.3
154
6.3.1 The Hat Matrix and Leverage

Write for the estimated value of , and yp = X for the predicted y values. Then
we have
1 T
= X T X
(X y)
so that
yp = (X X T X
1
X T )y.
What this means is that the values the model predicts at training points are a linear
1 T
function of the true values at the training points. The matrix (X X T X
X ) is
sometimes called the hat matrix. The hat matrix is written H, and I shall write
the i, jth component of the hat matrix hij .
The hat matrix has a variety of important properties. I wont prove any here,
but the proofs are in the exercises. It is a symmetric matrix. The eigenvalues can
be only 1 or 0. And the row sums have the important property that
X
h2ij 1.
j
This is important, because it can be used to find data points that have values that
are hard to predict. The leverage of the ith training point is the ith diagonal
element, hii , of the hat matrix
P H. Now we can write the prediction at the ith
training point yp,i = hii yi + j6=i hij yj . But if hii has large absolute value, then
all the other entries in that row of the hat matrix must have small absolute value.
This means that, if a data point has high leverage, the models value at that point
is predicted almost entirely by the observed value at that point. Alternatively, its
hard to use the other training data to predict a value at that point.
Here is another way to see this importance of hii . Imagine we change the
value of yi by adding ; then yp,i becomes yp,i + hii . In turn, a large value of hii
means that the predictions at the ith point are very sensitive to the value of yi .
6.3.2 Cooks Distance
There are two tools that are simple and effective. One method deletes the ith
point, computes the regression for the reduced data set, then compares the true
value of every other point to the predictions made by the dataset with the ith point
deleted. The score for the comparison is called Cooks distance. If a point has a
large value of Cooks distance, then it has a strong influence on the regression and
might well be an outlier. Typically, one computes Cooks distance for each point,
and takes a closer look at any point with a large value. This procedure is described
in more detail in procedure 6.2
Section 6.3
155
Procedure: 6.2 Computing Cooks distance

We have a dataset containing N pairs (xi , yi ). Each xi is a ddimensional explanatory vector, and each yi is a single dependent variable. Write for the coefficients of a linear regression (see procedure 6.1), and i for the coefficients of the linear regression computed
by omitting the ith data point, and m for the mean square error. The
Cooks distance of the ith data point is
P T
T 2
j (xj xj i )
.
dm
6.3.3 Standardized Residuals

The hat matrix has another use. It can be used to tell how large a residual is. The
residuals that we measure depend on the units in which y was expressed, meaning
we have no idea what a large residual is. For example, if we were to express y in
kilograms, then we might want to think of 0.1 as a small residual. Using exactly
the same dataset, but now with y expressed in grams, that residual value becomes
100 is it really large because we changed units?
Now recall that we assumed, in section 6.2.1, that y xT was a zero mean
normal random variable, but we didnt know its variance. It can be shown that,
under our assumption, the ith residual value, ei , is a sample of a normal random
variable whose variance is
T
(e e)
(1 hii ).
N
This means we can tell whether a residual is large by standardizing it that is,
dividing by its standard deviation. Write si for the standard residual at the ith
training point. Then we have that
si = r
(eT e)
N
ei

.
(1 hii )
When the regression is behaving, this standard residual should look like a sample
of a standard normal random variable. In turn, this means that if all is going well,
about 66% of the residuals should have values in the range [1, 1], and so on. Large
values of the standard residuals are a sign of trouble.
R produces a nice diagnostic plot that can be used to look for problem data
points (code and details in the appendix). The plot is a scatter plot of the standardized residuals against leverage, with level curves of Cooks distance superimposed.
Figure 6.8 shows an example. Some bad points that are likely to present problems
are identified with a number (you can control how many, and the number, with
arguments to plot; appendix). Problem points will have high leverage and/or high
Section 6.4
Many Explanatory Variables
Standardized residuals against fitted values,

4 outliers removed
156
Frequency
10
20
0
2
Residuals
30
40
Standardized residuals of height vs weight
100
150
200
Fitted values
250
str
FIGURE 6.7: On the left, standardized residuals plotted against predicted value for
weight regressed against height for the bodyfat dataset. I removed the four suspicious
looking data points, identified in Figure 6.6 with filled-in markers ; these seemed the
most likely to be outliers. You should compare this plot with the residuals in figure
6.6, which are not standardized. On the right, a histogram of the residual values.
Notice this looks rather like a histogram of a standard normal random variable,
though there are slightly more large positive residuals than one would like. This
suggests the regression is working tolerably.
Cooks distance and/or high residual. The figure shows this plot for three different versions of the dataset (original; two problem points removed; and two further
problem points removed).
6.4 MANY EXPLANATORY VARIABLES
In earlier sections, I implied you could put anything into the explanatory variables.
This is correct, and makes it easy to do the math for the general case. However,
I have plotted only cases where there was one explanatory variable (together with
a constant, which hardly counts). In some cases (section 6.4.1), we can add explanatory variables and still have an easy plot. Other cases are much harder to
plot successfully, and one needs better ways to visualize the regression than just
plotting.
Assume there is more than one explanatory variable and its tough to plot
the regression nicely. The value of R2 is still a useful guide to the goodness of the
regression. A useful plot, which can offer a lot of insight, is to plot the value of
the residual against the value predicted by the model. We have already used this
plot to track down suspicious data points (Figures 6.6 and 6.6). Generally, we look
for a tendency of the residual not to look like noise. It turns out that the residual
at each example can have different variances in a predictable way, so to produce a
helpful plot, we need to standardize the residuals (section 6.3.1).
Section 6.4
157
Residuals vs Leverage
39
4
2
1
0.5
Standardized residuals
42
0.5
Cooks distance
0.0
0.1
0.2
0.3
0.4
0.5
Leverage
lm(WEIGHT ~ HEIGHT)
3
41
0.1
36
216
1
0
2
1
0.05
0.05
0.1
145
0.05
0.05
0.1
0.1
Cooks distance
0.00
0.01
Cooks distance
0.02
0.03
Leverage
lm(WEIGHT ~ HEIGHT)
0.00
0.01
0.02
0.03
Leverage
lm(WEIGHT ~ HEIGHT)
FIGURE 6.8: A diagnostic plot, produced by R, of a linear regression of weight against

height for the bodyfat dataset. Top: the whole dataset; bottom left: with the two
most extreme points in the top figure removed; bottom right: with two further
points (highest residual) removed. Details in text.
6.4.1 Functions of One Explanatory Variable

Imagine we have only one measurement to form explanatory variables. For example,
in the perch data of Figure 6.1, we have only the length of the fish. If we evaluate
functions of that measurement, and insert them into the vector of explanatory
variables, the resulting regression is still easy to plot. It may also offer better
predictions. The fitted line of Figure 6.1 looks quite good, but the data points
look as though they might be willing to follow a curve. We can get a curve quite
easily. Our current model gives the weight as a linear function of the length with
a noise term (which we wrote yi = 1 xi + 0 + i ). But we could expand this
Section 6.4
158
model to incorporate other functions of the length. In fact, its quite suprising that
the weight of a fish should be predicted by its length. If the fish doubled in each
direction, say, its weight should go up by a factor of eight. The success of our
regression suggests that fish do not just scale in each direction as they grow. But
we might try the model yi = 2 x2i + 1 xi + 0 + i . This is easy to do. The ith
row of the matrix X currently looks like [xi , 1]. We build a new matrix X (b) , where
the ith row is [x2i , xi , 1], and proceed as before. This gets us a new model. The
nice thing about this model is that it is easy to plot our predicted weight is still
a function of the length, its just not a linear function of the length. Several such
models are plotted in Figure 6.9.
1200
1000
800
0
1, x
400
1, x, x^2
1, x, x^2, x^3
"Data"
Weight
Weight
"Data"
1, ..., x^10
Data
Data
1000
400
10
20
30
Length
40
10
20
30
40
Length
FIGURE 6.9: On the left, several different models predicting fish weight from length.
The line uses the explanatory variables 1 and xi ; and the curves use other monomials in xi as well, as shown by the legend. This allows the models to predict curves
that lie closer to the data. It is important to understand that, while you can make
a curve go closer to the data by inserting monomials, that doesnt mean you necessarily have a better model. On the right, I have used monomials up to x10
i . This
curve lies very much closer to the data points than any on the other side, at the
cost of some very odd looking wiggles inbetween data points. I cant think of any
reason that these structures would come from true properties of fish, and it would
be hard to trust predictions from this model.
You should notice that it can be quite easy to add a lot of functions like this
(in the case of the fish, I tried x3i as well). However, its hard to decide whether the
regression has actually gotten better. The least-squares error on the training data
will never go up when you add new explanatory variables, so the R2 will never get
worse. This is easy to see, because you could always use a coefficient of zero with the
new variables and get back the previous regression. However, the models that you
choose are likely to produce worse and worse predictions as you add explanatory
variables. Knowing when to stop can be tough (Section 7.1), though its sometimes
obvious that the model is untrustworthy (Figure 6.9).
Section 6.4
159
Remember this: If you have only one measurement, you can construct
a high dimensional x by using functions of that measurement. This produces
a regression that has many explanatory variables, but is still easy to plot.
Knowing when to stop is hard. An understanding of the problem is helpful.
6.4.2 Regularizing Linear Regressions

One occasionally important difficulty is that the explanatory variables might be
significantly correlated. If they are, then it will generally be easy to predict one
explanatory variable from another, and so there is a vector u so that X u is small
(exercises). In turn, that uT X T X u must be small, so that X T X may have some
small eigenvalues.
These small eigenvalues lead to bad predictions. If X T X has a small eigenvalue, then there is some vector v such that X T X v is small, or, equivalently, the
matrix can turn large vectors into small ones; but that means that (X T X )1 will
turn some small vectors into big ones. In turn, this means that could be large,
and the largest components in might be very inaccurately estimated. If we are
trying to predict new y values, we expect that large components in turn into
large errors in prediction (exercises).
An important and useful way to suppress these errors is to try to find a
that isnt large, and gives a low error. We can do this by regularizing, using the
same trick we saw in the case of classification. Instead of choosing the value of
that minimizes N1 (y X )T (y X ) we minimize

1
(y X )T (y X ) + T
N
Error + Regularizer
Here > 0 is a constant that weights the two requirements (small error; small )
relative to one another. Notice also that dividing the total error by the number of
data points means that our choice of shouldnt be affected by changes in the size
of the data set.
We choose in the same way we used for classification; split the training set
into a training piece and a validation piece, train for different values of , and test
the resulting regressions on the validation piece. The error is a random variable,
random because of the random split. It is a fair model of the error that would occur
on a randomly chosen test example (assuming that the training set is like the
test set, in a way that I do not wish to make precise yet). We could use multiple
splits, and average over the splits. Doing so yields both an average error for a value
of and an estimate the standard deviation of error. Figure 6.10 shows the result
of doing so for the bodyfat dataset, both with and without outliers. Notice that
now there is now no that yields the smallest validation error, because the value
Section 6.4
160
of error depends on the random split cross-validation. A reasonable choice of lies

between the one that yields the smallest error encountered (one vertical line in the
plot) and the largest value whose mean error is within one standard deviation of
the minimum (the other vertical line in the plot). All the computation can be done
by the glmnet package in R (see exercises for details).
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
all data
600
400
200
400
600
MeanSquared Error
800
800
six outliers removed
200
MeanSquared Error
15
1000
15
1000
15
6
log(Lambda)
10
10
log(Lambda)
FIGURE 6.10: Plots of mean-squared error as a function of log regularization parameter (i.e. log ) for a regression of weight against all variables for the bodyfat
dataset. These plots show mean-squared error averaged over cross-validation folds
with a vertical one standard deviation bar. On the left, the plot for the dataset with
the six outliers identified in Figure 14.5 removed. On the right, the plot for the
whole dataset. Notice how the outliers increase the variability of the error, and the
best error.
Notice the similarity to regularizing a classification problem. We started with

a cost function that evaluated the errors caused by a choice of , then added a
term that penalized for being large. This term is the squared length of , as
a vector. It is sometimes known as the L2 norm of the vector. In section 14.5, I
describe the consequences of using other norms.
This helps deal with the small eigenvalue, because to solve for we must solve
the equation

1
1
(X T X + I) =
XTy
N
N
(obtained by differentiating with
respect to and setting to zero) and the smallest

eigenvalue of the matrix ( N1 (X T X + I) will be at least (exercises). Penalizing
a regression with the size of in this way is sometimes known as ridge regression.
Section 6.4
161
Remember this:
The performance of a regression can be improved by
regularizing, particularly if some explanatory variables are correlated. The
procedure is similar to that used for classification.
6.4.3 Example: Weight against Body Measurements

Standardized residuals can be used to look for trouble in regressions when there are
too many variables to just plot the curve. Recall the bodyfat dataset of Figure 6.6.
That figure shows a regression of weight against height. I have already identified
four points that are very likely to be outliers (the relationship between weight and
height seems sharply different for these points than for the others), and removing
these points significantly helps the regression (Figures 6.6 and 6.6). Figure 6.7
shows the standardised residuals for this regression, which look a lot like samples
from a standard normal random variable and dont show strong signs of any major
problem.
10
5
0
15
10
5
0
5
15
10
10
15

weight against all,
4 outliers removed
15

weight against all,
all points
100
150
200
250
Fitted values
300
350
100
150
200
250
300
350
Fitted values
FIGURE 6.11: On the left, standardized residuals plotted against predicted value
for weight regressed against all variables for the bodyfat dataset. Four data points
appear suspicious, and I have marked these with a filled in marker. On the right,
standardized residuals plotted against predicted value for weight regressed against
all variables for the bodyfat dataset, but with the four suspicious looking data points
removed. Notice two other points stick out markedly.
Regressing weight against all variables is more interesting, because you cant
plot the function easily. In this case, you need to rely on the residual plots. Figure 6.11 shows the standardized residual plotted against predicted value for a regression of weight against all other variables. There is clearly a problem here; the
residual seems to depend quite strongly on the predicted value. Removing the four
Section 6.4
162
outliers we have already identified leads to a much improved plot, also shown in
Figure 6.11. This is banana-shaped, which is suspicious. There are two points that
seem to come from some other model (one above the center of the banana, one
below). This is likely because, for these points, some variables other than height
have an odd relationship with weight. Removing these points gives the residual
plot shown in Figure 6.12.
The banana shape of this plot is a suggestion that some non-linearity somewhere would improve the regression. One option is a non-linear transformation of
the independent variables. Finding the right one might require some work, so its
natural to try a Box-Cox transformation first. This gives
the best value of the
parameter as 0.5 (i.e. the dependent variable should be weight, which makes the
residuals look much better (Figure 6.12).
sqrt(weight) against all,
6 outliers removed
5
0
5
15
10
10
15

weight against all,
6 outliers removed
100
150
200
250
Fitted values
300
350
18
20
22
24
26
28
30
Fitted values
FIGURE 6.12: On the left, standardized residuals plotted against predicted value for
weight regressed against all variables for the bodyfat dataset. I removed the four
suspicious data points of Figure 6.11, and the two others identified in that figure.
Notice a suspicious banana shape the residuals are distinctly larger for small and
for large predicted values. This suggests a non-linear transformation of something
might be helpful.
I used a Box-Cox transformation, which suggested a value of 0.5
(i.e. regress 2( weight 1)) against all variables. On the right, the standardized
residuals for this regression. Notice that the banana has gone, though there is
a suspicious tendency for the residuals to be smaller rather than larger. Notice
also the plots are on different axes. Its fair to compare these plots by eye; but its
not fair to compare details, because the residual of a predicted square root means
something different than the residual of a predicted value.
Section 6.5
You should
163
6.5 YOU SHOULD

6.5.1 remember:
New term: Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 133

New term: explanatory variables . . . . . . . . . . . . . . . . . . . . 133
New term: dependent variable . . . . . . . . . . . . . . . . . . . . . . 133
New term: training examples . . . . . . . . . . . . . . . . . . . . . . 133
New term: test examples . . . . . . . . . . . . . . . . . . . . . . . . . 133
Definition: Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 134
New term: explanatory variables . . . . . . . . . . . . . . . . . . . . 134
New term: dependent variable . . . . . . . . . . . . . . . . . . . . . . 134
New term: training examples . . . . . . . . . . . . . . . . . . . . . . 134
New term: test examples . . . . . . . . . . . . . . . . . . . . . . . . . 134
Definition: Linear regression . . . . . . . . . . . . . . . . . . . . . . . 135
Estimating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
New term: residual . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
New term: mean square error . . . . . . . . . . . . . . . . . . . . . . 139
New term: mean square error . . . . . . . . . . . . . . . . . . . . . . 140
Useful facts: Regression . . . . . . . . . . . . . . . . . . . . . . . . . 141
R2 evaluates the quality of predictions made by a regression . . . . . 142
Linear regressions can fail. . . . . . . . . . . . . . . . . . . . . . . . . 143
New term: Zipfs law . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Transforming variables is useful . . . . . . . . . . . . . . . . . . . . . 144
New term: Box-Cox transformation . . . . . . . . . . . . . . . . . . . 144
New term: outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Outliers can affect linear regressions significantly. . . . . . . . . . . . 147
New term: hat matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 148
New term: leverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
New term: Cooks distance . . . . . . . . . . . . . . . . . . . . . . . 148
New term: standardizing . . . . . . . . . . . . . . . . . . . . . . . . . 149
Constructing a vector of explanatory variables by evaluating functions of a measurement is useful, bu
New term: L2 norm . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
New term: ridge regression . . . . . . . . . . . . . . . . . . . . . . . 154
You can regularize a regression . . . . . . . . . . . . . . . . . . . . . 155
Section 6.5
You should
164
APPENDIX: DATA
Batch A
Amount of Time in
Hormone
Service
25.8
99
20.5
152
14.3
293
23.2
155
20.6
196
31.1
53
20.9
184
20.9
171
30.4
52
Batch B
Amount of Time in
Hormone
Service
16.3
376
11.6
385
11.8
402
32.5
29
32.0
76
18.0
296
24.1
151
26.5
177
25.8
209
Batch C
Amount of Time in
Hormone
Service
28.8
119
22.0
188
29.7
115
28.9
88
32.8
58
32.5
49
25.4
150
31.7
107
28.5
125
TABLE 6.1: A table showing the amount of hormone remaining and the time in
service for devices from lot A, lot B and lot C. The numbering is arbitrary (i.e.
theres no relationship between device 3 in lot A and device 3 in lot B). We expect
that the amount of hormone goes down as the device spends more time in service,
so cannot compare batches just by comparing numbers.
Section 6.5
You should
165
PROBLEMS
Blood pressure against age
Systolic blood pressure
250
200
150
100
0
20
40
60
Age in years
80
FIGURE 6.13: A regression of blood pressure against age, for 30 data points.
6.1. Figure 7.18 shows a linear regression of systolic blood pressure against age.
There are 30 data points.
(a) Write ei = yi xT
i for the residual. What is the mean ({e}) for this
regression?
(b) For this regression, var ({y}) = 509 and the R2 is 0.4324. What is var ({e})
for this regression?
(c) How well does the regression explain the data?
(d) What could you do to produce better predictions of blood pressure (without actually measuring blood pressure)?
6.2. In this exercise, I will show that the prediction process of chapter 12(see
page 251) is a linear regression with two independent variables. Assume we
have N data items which are 2-vectors (x1 , y1 ), . . . , (xN , yN ), where N > 1.
These could be obtained, for example, by extracting components from larger
vectors. As usual, we will write x
i for xi in normalized coordinates, and so on.
The correlation coefficient is r (this is an important, traditional notation).
(a) Show that r = mean ({(x mean ({x}))(y mean ({y}))})/(std (x)std (y)).
std(y)
(b) Now write s = std(x) . Now assume that we have an xo , for which we wish
to predict a y value. Show that the value of the prediction obtained using
the method of page 252 is
sr(x0 mean ({x})) + mean ({y}).
(c) Show that sr = mean ({(xy)}) mean ({x})mean ({y}).
(d) Now write
y1
x1
1
1
y
x
and Y = 2 .
X = 2
... ...
...
xn
1
yn
where X T X = X T Y.
The coefficients of the linear regression will be ,
Show that
T
X X =N
mean x2
mean ({x})

mean ({x})
1
Section 6.5
(e) Now show that var ({x}) = mean
(x mean ({x}))2
You should
= mean
166
2
x
mean ({x})2 .
(f ) Now show that std (x)std (y)corr ({(x, y)}) = mean ({(x mean ({x}))(y mean ({y}))}).
C H A P T E R
Regression: Some harder topics

7.1 MODEL SELECTION: WHICH MODEL IS BEST?
It is usually quite easy to have many explanatory variables in a regression problem.
Even if you have only one measurement, you could always compute a variety of nonlinear functions of that measurement. As we have seen, inserting variables into a
model will reduce the fitting cost, but that doesnt mean that better predictions will
result (section 6.4.1). We need to choose which explanatory variables we will use.
A linear model with few explanatory variables may make poor predictions because
the model itself is incapable of representing the independent variable accurately (an
effect known as bias). A linear model with many explanatory variables may make
poor predictions because we cant estimate the coefficients well (an effect known as
variance). Choosing which explanatory variables we will use (and so which model
we will use) requires that we balance these effects, described in greater detail in
section 7.1.1. In the following sections, we describe a variety of different methods
of doing so.
There are other applications of regression, which arent key to machine learning but are worth being aware of. One important use for regression is to discount
trends in data. Theres an example in section 7.1.6. Another is to identify explanatory variables that are important, by seeing which coefficients in a model are
small. Unfortunately, it can be difficult to decide what it means for a coefficient
to be small (section 7.1.7).
7.1.1 Bias and Variance
We now look at the process of finding a model in a fairly abstract way, because
doing so highlights two important types of problem in modelling. We have
Y = f (X) +

where is noise and f is an unknown function. We have E[] = 0, and E 2 =
var ({}) = 2 ; furthermore, is independent of X. We have some procedure that
takes a selection of training data, consisting of pairs (xi , yi ), and selects a model f.
We will use this model to predict values for future x.
We need to understand the error that will occur when we use f to predict for
some data item that isnt in the training set. This is
i
h
E (Y f(X))2
where the expectation is taken over training sets and possible choices of x. This

expectation can be written in an extremely useful form. Recall var ({U }) = E U 2
167
Section 7.1
Model Selection: Which Model is Best?
168
E[U ]2 . This means we have

h i
h i
i
h

= E Y 2 2E Y f + E f2
E (Y f(X))2
n o
h i2
h i
2
= var ({Y }) + E[Y ] + var f + E f 2E Y f .
Now Y = f (X) + , E[] = 0, and is independent of X so we have E[Y ] = E[f ]

and var ({Y }) = var ({}) = 2 . This yields
i
n o
h i2
h
h i
2
= var ({Y }) + E[f ] + var f + E f 2E[f ]E f
E (Y f(X))2
n o h
i2
= 2 + var f + E (f f)
.
Each of these terms has an important interpretation. There is an irreducible
n o
error, 2 ; even the true model must produce this error. The term var f
=
h
h i i
E (f E f )2 is known as variance. This term results from the fact that the
h i
model we chose (f) is different from the mean model (E f ; remember, this ex-
pectation is over choices of training data). This is because our training data is a
subset ofh all data,
iand our model is chosen to be good on the training data. The
term E (f f ) is referred to as bias. This term reflects the fact that even the
h i
best choice of model (E f ) may not be the same as the true source of data (E[f ]
which is the same as f , because f is deterministic).

It is hard to talk precisely about the complexity of a model, but one reasonable proxy is the number of parameters we have to estimate to determine the model.
Generally, when a model comes from a small or simple family, we expect that
(a) we can estimate the best model in the family reasonably accurately (so the
variance will be low) but (b) the model may have real difficulty reproducing the
data (meaning the bias is large). Similarly, if the model comes from a large or
complex family, the variance is likely to be high (because it will be hard to estimate the best model in the family accurately) but the bias will be low (because the
model can more accurately reproduce the data). All modelling involves managing
this tradeoff between bias and variance.
You can see this tradeoff in the perch example of section 6.4.1 and Figure 6.9.
Recall that, as I added monomials to the regression of weight against length, the
fitting error went down; but the model that uses length10 as an explanatory variable makes very odd predictions away from the training data. When I use low
degree monomials, the dominant source of error is bias; and when I use high degree
monomials, the dominant source of error is variance. A common mistake is to feel
that the major difficulty is bias, and so to use extremely complex models. Usually
the result is poor estimates of model parameters, and huge variance. Experienced
modellers fear variance more than they fear bias.
7.1.2 Penalties: AIC and BIC

We would like to choose one of a set of models. We cannot do so using just the
training error, because more complex models will tend to have lower training error,
Section 7.1
169
and so the model with the lowest training error will tend to be the most complex
model. Training error is a poor guide to test error, because lower training error is
evidence of lower bias on the models part; but with lower bias, we expect to see
greater variance, and the training error doesnt take that into account.
One strategy is to penalize the model for complexity. We add some penalty,
reflecting the complexity of the model, to the training error. We then expect to see
the general behavior of figure ??. The training error goes down, and the penalty
goes up as the model gets more complex, so we expect to see a point where the sum
is at a minimum.
There are a variety of ways of constructing penalties. AIC (short for An
Information Criterion) is a method due originally to Akaike, in ****. Rather than
using the training error, AIC uses the maximum value of the log-likelihood of the
model. Write L for this value. Write k for the number of parameters estimated to
fit the model. Then the AIC is
2k 2L
and a better model has a smaller value of AIC (remember this by remembering
that a larger log-likelihood corresponds to a better model). Estimating AIC is
straightforward for regression models if you assume that the noise is a zero mean
normal random variable. You estimate the mean-squared error, which gives the
variance of the noise, and so the log-likelihood of the model. You do have to keep
track of two points. First, k is the total number of parameters estimated to fit the
model. For example, in a linear regression model, where you model y as xT + ,
you need to estimate d parameters to estimate and the variance of (to get
the log-likelihood). So in this case k = d + 1. Second, log-likelihood is usually
only known up to a constant, so that different software implementations often use
different constants. This is wildly confusing when you dont know about it (why
would AIC and extractAIC produce different numbers on the same model?) but
of no real significance youre looking for the smallest value of the number, and
the actual value doesnt mean anything. Just be careful to compare only numbers
computed with the same routine.
An alternative is BIC (Bayes Information Criterion), given by
2k log N 2L
(where N is the size of the training data set). You will often see this written as
2L 2k log N ; I have given the form above so that one always wants the smaller
value as with AIC. There is a considerable literature comparing AIC and BIC. AIC
has a mild reputation for overestimating the number of parameters required, but is
often argued to have firmer theoretical foundations.
Section 7.1
Worked example 7.1
170
AIC and BIC
Write
MN for the model that predicts weight from length for the perch dataset
Pj=n
as j=0 j lengthj . Choose an appropriate value of N [1, 10] using AIC and
BIC.
Solution: I used the R functions AIC and BIC, and got the table below.
1
2
3
4
5
6
7
8
9
10
AIC 677 617 617 613 615 617 617 612 613 614
BIC 683 625 627 625 629 633 635 633 635 638
The best model by AIC has (rather startlingly!) N = 8. One should not take
small differences in AIC too seriously, so models with N = 4 and N = 9 are
fairly plausible, too. BIC suggests N = 2.
7.1.3 Cross-Validation
AIC and BIC are estimates of error on future data. An alternative is to measure this
error on held out data, using a cross-validation strategy (as in section 2.1.4). One
splits the training data into F folds, where each data item lies in exactly one fold.
The case F = N is sometimes called leave-one-out cross-validation. One then
sets aside one fold in turn, fitting the model to the remaining data, and evaluating
the model error on the left-out fold. This error is then averaged. Numerous variants
are available, particularly when lots of computation and lots of data are available.
For example: one might not average over all folds; one might use fewer or more
folds; and so on.
Worked example 7.2
Cross-validation
Write
MN for the model that predicts weight from length for the perch dataset
Pj=n
as j=0 j lengthj . Choose an appropriate value of N [1, 10] using leave-oneout cross validation.
Solution: I used the R functions CVlm, which takes a bit of getting used to.
There is sample code in the exercises section. I found:
1
2
3
4
5
6
7
8
1.94e04 4.03e03 7.18e03 4.46e03 5.97e03 5.64e04 1.23e06 4.03e06
where the best model is N = 2.
7.1.4 Forward and Backward Stagewise Regression
Assume we have a set of explanatory variables and we wish to build a model,
choosing some of those variables for our model. Our explanatory variables could
be many distinct measurements, or they could be different non-linear functions of
the same measurement, or a combination of both. We can evaluate models relative
9
3.86e06
10
1.87e08
Section 7.1
171
to one another fairly easily (AIC, BIC or cross-validation, your choice). However,
choosing which set of explanatory variables to use can be quite difficult, because
there are so many sets. Imagine you start with a set of F possible explanatory
variables (including the original measurement, and a constant). You dont know
how many to use, so you might have to try every different group, of each size, and
there are too many. There are two useful alternatives.
In forward stagewise regression, you start with an empty working set
of explanatory variables. You then iterate the following process, which is fairly
obviously a greedy algorithm. For each of the explanatory variables not in working
set, you construct a new model using the working set and that explanatory variable,
and compute the model evaluation score. If the best of these models has a better
score than the model based on the working set, you insert the appropriate variable
into the working set and iterate. If no variable improves the working set, you decide
you have the best model and stop. Backward stagewise regression is pretty
similar, but you start with a working set containing all the variables, and remove
variables one-by-one and greedily. As usual, greedy algorithms are very helpful but
not capable of exact optimization. Each of these strategies can produce rather good
models, but neither is guaranteed to produce the best model.
7.1.5 Dropping Variables with L1 Regularization
We have a large set of explanatory variables, and we would like to choose a small
set that explains most of the variance in the independent variable. We could do
this by encouraging to have many zero entries. In section 6.4.2, we saw we could
regularize a regression by adding a term to the cost function that P
discouraged large
values of . Instead of solving for the value of that minimized i (yi xTi )2 =
(y X )T (y X ) (which I shall call the error cost), we minimized
X
(yi xTi )2 + T = (y X )T (y X ) + T
i
(which I shall call the L2 regularized error). Here > 0 was a constant chosen
by cross-validation. Larger values of encourage entries of to be small, but do
not force them to be zero. The reason is worth understanding.
Write k for the kth component of , and write k for all the other components. Now we can write the L2 regularized error as a function of k :
(a + )k2 2b(k )k + c(k )
where a is a function of the data and b and c are functions of the data and of .
Now notice that
b(k )
.
k =
(a + )
Notice that doesnt appear in the numerator. This means that, to force k to
zero by increasing , we may have to make arbitrarily large. This is because the
improvement in the penalty obtained by going from a small k to k = 0 is tiny
the penalty is proportional to k2 .
Section 7.1
172
To force some components of to zero, we need a penalty that grows linearly

around zero rather than quadratically. This means we should use the L1 norm of
, given by
X
|| ||1 =
| k |.
k
To choose , we must now solve
(y X )T (y X ) + || ||1
for an appropriate choice of . An equivalent problem is to solve a constrained
minimization problem, where one minimizes
(y X )T (y X ) subject to || ||1 t
where t is some value chosen to get a good result, typically by cross-validation.
There is a relationship between the choice of t and the choice of (with some
thought, a smaller t will correspond to a bigger ) but it isnt worth investigating
in any detail.
Actually solving this system is quite involved, because the cost function is not
differentiable. You should not attempt to use stochastic gradient descent, because
this will not compel zeros to appear in (exercises). There are several methods,
which are beyond our scope. As the value of increases, the number of zeros in
will increase too. We can choose in the same way we used for classification;
split the training set into a training piece and a validation piece, train for different
values of , and test the resulting regressions on the validation piece. However, one
consequence of modern methods is that we can generate a very good approximation
to the path ()
for all values of 0 about as easily as we can choose for a
particular value of .
One way to understand the models that result is to look at the behavior
of cross-validated error as changes. The error is a random variable, random
because of the random split. It is a fair model of the error that would occur on
a randomly chosen test example (assuming that the training set is like the test
set, in a way that I do not wish to make precise yet). We could use multiple splits,
and average over the splits. Doing so yields both an average error for each value
of and an estimate of the standard deviation of error. Figure 7.1 shows the
result of doing so for two datasets. Again, there is no that yields the smallest
validation error, because the value of error depends on the random split crossvalidation. A reasonable choice of lies between the one that yields the smallest
error encountered (one vertical line in the plot) and the largest value whose mean
error is within one standard deviation of the minimum (the other vertical line in
the plot). It is informative to keep track of the number of zeros in as a function
of , and this is shown in Figure 7.1.
Another way to understand the models is to look at how changes as
changes. We expect that, as gets smaller, more and more coefficients become
non-zero. Figure ?? shows plots of coefficient values as a function of log for a
regression of weight against all variables for the bodyfat dataset, penalised using
When
the L1 norm. For different values of , one gets different solutions for .
Section 7.1
14
14
14
13
13
13
10
10
9 7 5 5 4 4 2 1 0
14
14
14
13
13
13
10
10
9 7 5 5 4 4 2 1 0
600
0
200
400
MeanSquared Error
800
800
600
400
200
MeanSquared Error
14
173
1000
1000
14
0
log(Lambda)
log(Lambda)
FIGURE 7.1: Plots of mean-squared error as a function of log regularization parameter (i.e. log ) for a regression of weight against all variables for the bodyfat
dataset. These plots show mean-squared error averaged over cross-validation folds
with a vertical one standard deviation bar. On the left, the plot for the dataset with
the six outliers identified in Figure 14.5 removed. On the right, the plot for the
whole dataset. Notice how the outliers increase the variability of the error, and the
best error. The top row of numbers gives the number of non-zero components in .
Notice how as increases, this number falls. The penalty ensures that explanatory
variables with small coefficients are dropped as gets bigger.
is very large, the penalty dominates, and so the norm of must be small. In
falls
turn, most components of are zero. As gets smaller, the norm of beta
and some components of become non-zero. At first glance, the variable whose
coefficient grows very large seems important. Look more carefully; this is the last
component introduced into the model. But Figure 7.1 implies that the right model
has 7 components. This means that the right model has log 1.3, the vertical
line shown in the detailed figure. In the best model, that coefficient is in fact zero.
The L1 norm can sometimes produce an impressively small model from a
large number of variables. In the UC Irvine Machine Learning repository, there is
a dataset to do with the geographical origin of music (https://archive.ics.uci.edu/
ml/datasets/Geographical+Original+of+Music). The dataset was prepared by Fang
Zhou, and donors were Fang Zhou, Claire Q, and Ross D. King. Further details
appear on that webpage, and in the paper: Predicting the Geographical Origin
of Music by Fang Zhou, Claire Q and Ross. D. King, which appeared at ICDM
in 2014. There are two versions of the dataset. One has 116 explanatory variables
(which are various features representing music), and 2 independent variables (the
latitude and longitude of the location where the music was collected). Figure 7.3
shows the results of a regression of latitude against the independent variables using
L1 regularization. Notice that the model that achieves the lowest cross-validated
prediction error uses only 38 of the 116 variables.
Section 7.1
14
13
11
14
13
11
174
Coefficients
20
10
Coefficients
30
40
10
14
Log Lambda
Log Lambda
FIGURE 7.2: Plots of coefficient values as a function of log for a regression of
weight against all variables for the bodyfat dataset, penalised using the L1 norm. In
each case, the six outliers identified in Figure 14.5 were removed. On the left, the
plot of the whole path for each coefficient (each curve is one coefficient). On the
right, a detailed version of the plot. The vertical line shows the value of log the
produces the model with smallest cross-validated error (look at Figure 7.1). Notice
that the variable that appears to be important, because it would have a large weight
with = 0, does not appear in this model.
Regularizing a regression with the L1 norm is sometimes known as a lasso. A
nuisance feature of the lasso is that, if several explanatory variables are correlated,
it will tend to choose one for the model and omit the others (example in exercises).
This can lead to models that have worse predictive error than models chosen using
the L2 penalty. One nice feature of good minimization algorithms for the lasso is
that it is easy to use both an L1 penalty and an L2 penalty together. One can form
!

X
(1 )
1
2
T
2
+
(yi xi )
|| ||2 + || ||1
N
2
i
Error +
Regularizer
where one usually chooses 0 1 by hand. Doing so can both discourage large
values in and encourage zeros. Penalizing a regression with a mixed norm like this
is sometimes known as elastic net. It can be shown that regressions penalized with
elastic net tend to produce models with many zero coefficients, while not omitting
correlated explanatory variables. All the computation can be done by the glmnet
package in R (see exercises for details).
7.1.6 Using Regression to Compare Trends
Regression isnt only used to predict values. Another reason to build a regression
model is to compare trends in data. Doing so can make it clear what is really hap-
Section 7.1
88
77
57
15
Coefficients
320
10
280
300
MeanSquared Error
340
10
360
88 88 88 85 83 76 74 72 61 48 28 20 15 13 6 3 0
175
log(Lambda)
Log Lambda
Mean-squared error as a function of log regularization parameter (i.e. log ) for a regression of latitude against features describing music (details in text), using the dataset at https:// archive.ics.uci.edu/ ml/ datasets/
Geographical+Original+of+Music and penalized with the L1 norm. The plot on the
left shows mean-squared error averaged over cross-validation folds with a vertical
one standard deviation bar. The top row of numbers gives the number of non-zero
Notice how as increases, this number falls. The penalty ensures
components in .
that explanatory variables with small coefficients are dropped as gets bigger. On
the right, a plot of the coefficient values as a function of log for the same regression. The vertical line shows the value of log the produces the model with smallest
cross-validated error. Only 38 of 116 explanatory variables are used by this model.
FIGURE 7.3:
pening. Here is an example from Efron (Computer-Intensive methods in statistical

regression, B. Efron, SIAM Review, 1988). The table in the appendix shows some
data from medical devices, which sit in the body and release a hormone. The data
shows the amount of hormone currently in a device after it has spent some time in
service, and the time the device spent in service. The data describes devices from
three production lots (A, B, and C). Each device, from each lot, is supposed to have
the same behavior. The important question is: Are the lots the same? The amount
of hormone changes over time, so we cant just compare the amounts currently in
each device. Instead, we need to determine the relationship between time in service
and hormone, and see if this relationship is different between batches. We can do
so by regressing hormone against time.
Figure 7.4 shows how a regression can help. In this case, we have modelled
the amount of hormone in the device as
a (time in service) + b
for a, b chosen to get the best fit (much more on this point later!). This means
we can plot each data point on a scatter plot, together with the best fitting line.
Section 7.1

Regression residual against time
40
35
B CC B
C
A
A
30
C CCC
A
25
20
15
50
B
C
BA
C
A A AA
100
150
Time in service
200
Residual
Amount of hormone
Hormone against time in service
176
2
0 B
C
C B
C
A
A
C
A
B
B
C
CC
C
B
A
C
AA
A
50
100
150
Time in service
200
FIGURE 7.4: On the left, a scatter plot of hormone against time for devices from
tables 7.1 and 7.1. Notice that there is a pretty clear relationship between time and
amount of hormone (the longer the device has been in service the less hormone there
is). The issue now is to understand that relationship so that we can tell whether lots
A, B and C are the same or different. The best fit line to all the data is shown as
well, fitted using the methods of section 6.2. On the right, a scatter plot of residual
the distance between each data point and the best fit line against time for the
devices from tables 7.1 and 7.1. Now you should notice a clear difference; some
devices from lots B and C have positive and some negative residuals, but all lot
A devices have negative residuals. This means that, when we account for loss of
hormone over time, lot A devices still have less hormone in them. This is pretty
good evidence that there is a problem with this lot.
This plot allows us to ask whether any particular batch behaves differently from
the overall model in any interesting way.
However, it is hard to evaluate the distances between data points and the best
fitting line by eye. A sensible alternative is to subtract the amount of hormone
predicted by the model from the amount that was measured. Doing so yields a
residual the difference between a measurement and a prediction. We can then
plot those residuals (Figure 7.4). In this case, the plot suggests that lot A is special
all devices from this lot contain less hormone than our model predicts.
7.1.7 Significance: What Variables are Important?
Imagine you regress some measure of risk of death against blood pressure, whether
someone smokes or not, and the length of their thumb. Because high blood pressure
and smoking tend to increase risk of death, you would expect to see large coefficients for these explanatory variables. Since changes in the thumb length have no
effect, you would expect to see small coefficients for these explanatory variables.
This suggests a regression can be used to determine what effects are important in
building a model.
One difficulty is the result of sampling variance. Imagine that we have an
explanatory variable that has absolutely no relationship to the dependent variable.
If we had an arbitrarily large amount of data, and could exactly identify the correct
Section 7.1
177
model, wed find that, in the correct model, the coefficient of that variable was zero.
But we dont have an arbitrarily large amount of data. Instead, we have a sample
of data. Hopefully, our sample is random, so that the reasoning of section 14.5 can
be applied. Using that reasoning, our estimate of the coefficient is the value of a
random variable whose expected value is zero, but whose variance isnt. As a result,
we are very unlikely to see a zero. This reasoning applies to each coefficient of the
model. To be able to tell which ones are small, we would need to know the standard
deviation of each, so we can tell whether the value we observe is a small number of
standard deviations away from zero. This line of reasoning is very like hypothesis
testing. It turns out that the sampling variance of regression coefficients can be
estimated in a straightforward way. In turn, we have an estimate of the extent
to which their difference from zero could be a result of random sampling. R will
produce this information routinely; use summary on the output of lm.
A second difficulty has to do with practical significance, and is rather harder.
We could have explanatory variables that are genuinely linked to the independent
variable, but might not matter very much. This is a common phenomenon, particularly in medical statistics. It requires considerable care to disentangle some of these
issues. Here is an example. Bowel cancer is an unpleasant disease, which could kill
you. Being screened for bowel cancer is at best embarrassing and unpleasant, and
involves some startling risks. There is considerable doubt, from reasonable sources,
about whether screening has value and if so, how much (as a start point, you could
look at Ransohoff DF. How Much Does Colonoscopy Reduce Colon Cancer Mortality?. Ann Intern Med. 2009). There is some evidence linking eating red or
processed meat to incidence of bowel cancer. A good practical question is: should
one abstain from eating red or processed meat based on increased bowel cancer
risk?
Coming to an answer is tough; the coefficient in any regression is clearly
not zero, but its pretty small as these numbers indicate. The UK population in
2012 was 63.7 million (this is a summary figure from Google, using World Bank
data; theres no reason to believe that its significantly wrong). I obtained the
following figures from the UK cancer research institute website, at http://www.
cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/bowel-cancer.
There were 41, 900 new cases of bowel cancer in the UK in 2012. Of these cases,
43% occurred in people aged 75 or over. 57% of people diagnosed with bowel cancer
survive for ten years or more after diagnosis. Of diagnosed cases, an estimated 21%
are linked to eating red or processed meat, and the best current estimate is that
the risk of incidence is between 17% and 30% higher per 100g of red meat eaten
per day (i.e. if you eat 100g of red meat per day, your risk increases by some number between 17% and 30%; 200g a day gets you twice that number; and rather
roughly so on). These numbers are enough to confirm that there is a non-zero
coefficient linking the amount of red or processed meat in your diet with your risk
of bowel cancer (though youd have a tough time estimating the exact value of that
coefficient from the information here). If you eat more red meat, your risk really
will go up. But that coefficient is clearly pretty small, because the incidence is
about 1 in 1500 per year. Does it matter? you get to choose, and your choice has
consequences.
Section 7.2
Robust Regression
178
7.2 ROBUST REGRESSION

We have seen that outlying data points can result in a poor model. This is caused by
the squared error cost function: squaring a large error yields an enormous number.
One way to resolve this problem is to identify and remove outliers before fitting a
model. This can be difficult, because it can be hard to specify precisely when a
point is an outlier.
One alternative solution is to come up with a cost function that is less susceptible to problems with outliers (Section 7.2.1). Another is to search for good
points. A small set of good points will identify the model we wish to fit; other good
points will agree, and the points that disagree are bad points. This is the basis of
an extremely important approach, described in Section 7.2.2.
7.2.1 M-Estimators and Iteratively Reweighted Least Squares
One way to reduce the effect of outliers on a least-squares solution would be to
weight each term in the cost function. Write W for a diagonal matrix of weights,
one per training data point. Now, rather than minimize (Y X )T (Y X ), we
could minimize
(Y X )T W(Y X ).
Doing so means we must solve the linear system
X T WY = X T WX .
We need some method to estimate an appropriate set of weights. This would use a
large weight for errors at points that are trustworthy, and a low weight for errors
at suspicious points.
We can obtain such weights using an M-estimator, estimates parameters by
replacing the negative log-likelihood with a term that is better behaved. In our
examples, the negative log-likelihood has always been squared error. Write for
the parameters of the model being fitted, and ri (xi , ) is the residual error of the
model on the ith data point. For us, ri will always be yi xTi . So rather than
minimizing
X
(ri (xi , ))2
i
as a function of , we will minimize an expression of the form

X
(ri (xi , ); ),
i
for some appropriately chosen function . Clearly, our negative log-likelihood is

one such estimator (use (u; ) = u2 ). The trick to M-estimators is to make (u; )
look like u2 for smaller values of u, but ensure that it grows more slowly than u2
for larger values of u.
The Huber loss is one important M-estimator. We use
( 2
u
|u| <
2
(u; ) =
2
| u | 2
Section 7.2
Robust Regression
179
which is the same as u2 for u , and then switches to | u | for larger

(or smaller) (Figure ??). The Huber loss is convex (meaning that there will
be a unique minimum for our models) and differentiable, but its derivative is not
continuous.
Another choice that I have seen widely used (though I dont know a name for
it) is
u2
.
(u; ) = 2
+ u2
2
For small values of u, this looks like u2 ; for large values of u, it looks like 1. The
parameter controls the point at which the function flattens out, and we have
plotted a variety of examples in Figure ??.
Generally, M-estimators are discussed in terms of their influence function.
This is
.
u
Our miniIts importance becomes evidence when we consider algorithms to fit .
mization criterion is
!
X
X
T
(yi xi ; )
=
(xi )
u
i
i
=
Now write wi () for
0.
yi xTi
(where the partial derivative is evaluated at the value of the residual). We can
write the minimization criterion as
X
[wi ()] (yi xTi )(xi ) = 0.
i
But this is equivalent to

X T WY = X T WX .
The difficulty here is estimating the wi (). But the following strategy, known as
iteratively reweighted least squares is very effective.
We assume we have an estimate of the correct parameters (n) , and consider
updating it to (n+1) . We compute
(n)
wi
= wi ((n) ) =
yi
u
xTi (n)
(where the partial derivative is estimated at = (n) ). We then estimate (n+1)

by solving
X T W (n) Y = X T W (n) X (n+1) .
The key to this algorithm is finding good start points for the iteration. One strategy
is randomized search. We select a small subset of points uniformly at random, and
Section 7.2
Robust Regression
180
350
Robust regressions of weight against height,

bodyfat dataset
250
150
200
Weight
300
RLM, Huber, k2=1e2

LM
LM, outliers excluded
30
40
50
60
70
Height
FIGURE 7.5: Comparing three different linear regression strategies on the bodyfat
data, regressing weight against height. Notice that using a robust regression gives
an answer very like that obtained by rejecting outliers by hand. The answer may
well be better because it isnt certain that each of the four points rejected is an
outlier, and the robust method may benefit from some of the information in these
points. I tried a range of scales for the Huber loss (the k2 parameter), but found
no difference in the line resulting over scales varying by a factor of 1e4, which is
why I plot only one scale.
fit some to these points, then use the result as a start point. If we do this often
enough, one of the start points will be an estimate that is not contaminated by
outliers.
The estimators require a sensible estimate of , which is often referred to as
scale. Typically, the scale estimate is supplied at each iteration of the solution
method. One reasonable estimate is the MAD or median absolute deviation,
given by
(n)
(n) = 1.4826 mediani | ri (xi ; (n1) ) |.
Another a popular estimate of scale is obtained with Hubers proposal 2 (that
is what everyone calls it!). Choose some constant k1 > 0, and define (u) =
2
min (| u |, k1 ) . Now solve the following equation for :
X
i
(n)
ri (xi ; (n1)
) = N k2
where k2 is another constant, usually chosen so that the estimator gives the right answer
for a normal distribution (exercises). This equation needs to be solved with an iterative
method; the MAD estimate is the usual start point. R provides hubers, which will compute
this estimate of scale (and figures out k2 for itself). The choice of k1 depends somewhat on
how contaminated you expect your data to be. As k1 , this estimate becomes more
like the standard deviation of the data.
Section 7.2
181
Weight regressed against all for bodyfat,

histogram of residuals,
all points
60
Frequency
20
20
20
40
40
Residual
60
80
80
100
100
120
120
Weight regressed against all for bodyfat,

residual against fitted value,
all points
Robust Regression
100
150
200
250
300
Fitted value
350
50
100
Residual
FIGURE 7.6: A robust linear regression of weight against all variables for the bodyfat
dataset, using the Huber loss and all data points. On the left, residual plotted
against fitted value (the residual is not standardized). Notice that there are some
points with very large residual, but most have much smaller residual; this wouldnt
happen with a squared error. On the right, a histogram of the residual. If one
ignores the extreme residual values, this looks normal. The robust process has been
able to discount the effect of the outliers, without us needing to identify and reject
outliers by hand.
7.2.2 RANSAC: Searching for Good Points

An alternative to modifying the cost function is to search the collection of data
points for good points. This is quite easily done by an iterative process: First, we
choose a small subset of points and fit to that subset, and then we see how many
other points fit to the resulting object. We continue this process until we have a
high probability of finding the structure we are looking for.
For example, assume that we are fitting a line to a dataset that consists of
about 50% outliers. We can fit a line to only two points. If we draw pairs of points
uniformly and at random, then about a quarter of these pairs will consist entirely
of good data points. We can identify these good pairs by noticing that a large
collection of other points lie close to the line fitted to such a pair. Of course, a
better estimate of the line could then be obtained by fitting a line to the points
that lie close to our current line.
? formalized this approach into an algorithmsearch for a random sample
that leads to a fit on which many of the data points agree. The algorithm is
usually called RANSAC, for RANdom SAmple Consensus, and is displayed in Algorithm 7.1. To make this algorithm practical, we need to choose three parameters.
The Number of Samples Required

Our samples consist of sets of points drawn uniformly and at random from
Section 7.2
Robust Regression
182
Determine:
nthe smallest number of points required (e.g., for lines, n = 2,
for circles, n = 3)
kthe number of iterations required
tthe threshold used to identify a point that fits well
dthe number of nearby points required
to assert a model fits well
Until k iterations have occurred
Draw a sample of n points from the data
uniformly and at random
Fit to that set of n points
For each data point outside the sample
Test the distance from the point to the structure
against t; if the distance from the point to the structure
is less than t, the point is close
end
If there are d or more points close to the structure
then there is a good fit. Refit the structure using all
these points. Add the result to a collection of good fits.
end
Use the best fit from this collection, using the
fitting error as a criterion
Algorithm 7.1: RANSAC: Fitting Structures Using Random Sample Consensus.
the dataset. Each sample contains the minimum number of points required to fit
the abstraction of interest. For example, if we wish to fit lines, we draw pairs of
points; if we wish to fit circles, we draw triples of points, and so on. We assume
that we need to draw n data points, and that w is the fraction of these points that
are good (we need only a reasonable estimate of this number). Now the expected
value of the number of draws k required to get one point is given by
E[k] = 1P (one good sample in one draw) +
2P (one good sample in two draws) + . . .
= wn + 2(1 wn )wn + 3(1 wn )2 wn + . . .
= wn
(where the last step takes a little manipulation of algebraic series). We would like
to be fairly confident that we have seen a good sample, so we wish to draw more
than wn samples; a natural thing to do is to add a few standard deviations to this
number. The standard deviation of k can be obtained as
1 wn
.
SD(k) =
wn
An alternative approach to this problem is to look at a number of samples that
Section 7.3
Modelling with Bumps
183
guarantees a low probability z of seeing only bad samples. In this case, we have
(1 wn )k = z,
which means that
log(z)
.
log(1 wn )
It is common to have to deal with data where w is unknown. However, each fitting
attempt contains information about w. In particular, if n data points are required,
then we can assume that the probability of a successful fit is wn . If we observe
a long sequence of fitting attempts, we can estimate w from this sequence. This
suggests that we start with a relatively low estimate of w, generate a sequence
of attempted fits, and then improve our estimate of w. If we have more fitting
attempts than the new estimate of w predicts, the process can stop. The problem
of updating the estimate of w reduces to estimating the probability that a coin
comes up heads or tails given a sequence of fits.
k=
Telling Whether a Point Is Close

We need to determine whether a point lies close to a line fitted to a sample.
We do this by determining the distance between the point and the fitted line, and
testing that distance against a threshold d; if the distance is below the threshold,
the point lies close. In general, specifying this parameter is part of the modeling
process. Obtaining a value for this parameter is relatively simple. We generally need
only an order of magnitude estimate, and the same value applies to many different
experiments. The parameter is often determined by trying a few values and seeing
what happens; another approach is to look at a few characteristic datasets, fitting
a line by eye, and estimating the average size of the deviations.
The Number of Points That Must Agree
Assume that we have fitted a line to some random sample of two data points,
and we need to know whether that line is good. We do this by counting the number
of points that lie within some distance of the line (the distance was determined in
the previous section). In particular, assume that we know the probability that an
outlier lies in this collection of points; write this probability as y. We would like to
choose some number of points t such that the probability that all points near the
line are outliers, y t , is small (say, less than 0.05). Notice that y (1 w) (because
some outliers should be far from the line), so we could choose t such that (1 w)t
is small.
7.3 MODELLING WITH BUMPS
For many data sets, linear models just dont fit very well. However, we expect that
points near a given x will have similar y values. We can exploit this observation in
a variety of ways.
7.3.1 Scattered Data: Smoothing and Interpolation
*** Imagine we have a set of points xi on the plane, with a measured height value yi
for each point. We would like to reconstruct a surface from this data. There are two
Section 7.3
184
Heat map of score 1 for prawn catches

-11.9
L
a
t
i
t
u
d
e
Catch score 1
3
2
1
0
-11.1
142.8
-1
-11
-11.5
-12
144
143.5
143
Longitude
143.9
142.5
FIGURE 7.7: A dataset recording scores of prawn trawls around the Great Barrier
Reef, from http:// www.statsci.org/ data/ oz/ reef.html. There are two scores; this is
score 1. On the left I have plotted the data as a 3D scatter plot. This form of plot
isnt usually very successful, though it helps to make it slightly easier to read if one
supplies vertical lines from each value to zero, and a zero surface. On the right, a
heat map of this data, made by constructing a fine grid, then computing the average
score for each grid box (relatively few boxes get one score, and even fewer get two).
The brightest point corresponds to the highest score; mid-grey is zero (most values),
and dark points are negative. Notice that the scale is symmetric; the reason there is
no very dark point is that the smallest value is considerably larger than the negative
of the largest value. The x and y dimensions are longitude and latitude, and I have
ignored the curvature of the earth, which is pretty small at this scale.
important subcases: interpolation, where we want a surface that passes through
each value; and smoothing, where our surface should be close to the values, but
need not pass through them. This case is easily generalised to a larger number of
dimensions. Particularly common is to have points in 3D, or in space and time.
Although this problem is very like regression, there is an important difference:
we are interested only in the predicted value at each point, rather than in the
conditional distribution. Typical methods for dealing with this problem are very
like regression methods, but typically the probabilistic infrastructure required to
predict variances, standard errors, and the like are not developed. Interpolation and
smoothing problems in one dimension have the remarkable property of being very
different to those in two and more dimensions (if youve been through a graphics
course, youll know that, for example, interpolating splines in 2D are very different
from those in 1D). We will concentrate on the multidimensional case.
Now think about the function y that we wish to interpolate, and assume that
Section 7.3
185
Heat map of score 1 for prawn catches,

interpolated with Gaussian kernel
-11.9
L
a
t
i
t
u
d
e
Catch score 1
5
0
-5
-11
-11.1
142.8
Longitude
143.9
142.5
-11.5
143
-12 144
143.5
FIGURE 7.8: The prawn data of figure 7.7, interpolated with radial basis functions
(in this case, a Gaussian kernel) with scale chosen by cross-validation. On the left,
a surface shown with the 3D scatter plot. On the right, a heat map. I ignored the
curvature of the earth, small at this scale, when computing distances between points.
This figure is a good example of why interpolation is usually not what one wants to
do (text).
x is reasonably scaled, meaning that distances between points are a good guide to
their similarity. There are several ways to achieve this. We could whiten the points
(section 14.5), or we could use our knowledge of the underlying problem to scale
the different features relative to one another. Once we have done this properly, we
expect the similarity between y(u) and y(v) to depend on the distance between
these points (i.e. || u v ||) rather than on the direction of the vector joining them
(i.e. u v). Furthermore, we expect that the dependency should decline with
increasing || x xi ||. In most problems, we dont know how quickly the weights
should decline with increasing distance, and it is usual to have a scaling parameter
to handle this. The scaling parameter will need to be selected.
All this suggests constructing an interpolate using kernel function. A kernel
function
K(u) is a non-negative function such that (a) K(u) = K(u) and (b)
R
K(u)du
= 1. Widely used kernel functions are:
The Gaussian kernel, K(u) =

pact support.
1
2
The Epanechnikov kernel, K(u) =

tiable at u = 1 and at u = 1.
The Logistic kernel, K(u) =
exp u2 . Notice this doesnt have com3

4 (1
u2 )I[|u|1] . This isnt differen-
1
exp u+exp u+2 .
This doesnt have compact
Section 7.3
186
Heat map of score 1 for prawn catches,

smoothed, Gaussian kernel, scale by cross-validation
-11.9
L
a
t
i
t
u
d
e
Catch score 1
4
2
0
-11.1
-2
-11
142.8
Longitude
143.9
-11.5
-12 144
143.5
143
142.5
FIGURE 7.9: The prawn data of figure 7.7, smoothed with radial basis functions (in
this case, a Gaussian kernel) with scale chosen by cross-validation. On the left,
a surface shown with the 3D scatter plot. On the right, a heat map. I ignored
the curvature of the earth, small at this scale, when computing distances between
points. I used 60 basepoints, constructed by choosing 60 of 155 training points at
random, then adding a small random offset.
support, either.
The Quartic kernel, K(u) =
15
16 (1
u2 )2 I[|u|1] .
You should notice that each is a bump function its large at u = 0, and falls
away as | u | increases. It follows from the two properties above that, for h > 0, if
K(u) is a kernel function, then K(u; h) = h1 K( hu ) is also a kernel function. This
means we can vary the width of the bump at the origin in a natural way by choice
of h; this is usually known as the scale of the function.
We choose a kernel function K, then build a function

R
X
|| x bj ||
aj K
y(x) =
h
j=1
where bj is a set of R points which dont have to be training points. These are
sometimes referred to as base points. You can think of this process as placing a
weighted bump at each base point. Consider the values that this function takes at
the training points xi . We have
y(xi ) =
R
X
j=1
aj K
|| xi bj ||
h
Section 7.3
187
P
2
and we would like to minimize
i (yi y(xi )) . We can rewrite this with the
aid of some linear
algebra. Write G for the Gram matrix, whose i, jth entry

||xi bj||
; write Y for the vector whose ith component is yi ; and a for the
is K
h
vector whose jth component is aj . Then we want to minimize
(Y Ga)T (Y Ga).
There are a variety of cases. For interpolation, we choose the base points to be the
same as the training points. A theorem of Micchelli guarantees that for a kernel
function that is (a) a function of distance and (b) monotonically decreasing with
distance, G will have full rank. Then we must solve
Y = Ga
(Interpolation)
for a and we will obtain a function that passes through each training point. For
smoothing, we may choose any set of R < N basepoints, though its a good idea to
choose points that are close to the training points. Cluster centers are one useful
choice. We then solve the least-squares problem by solving
G T Y = G T Ga
(Smoothing)
for a. In either case, we choose the scale h by cross-validation. We do this by:

selecting a set of scales; holding out some training points, and interpolating (resp.
smoothing) the values of the others; then computing the error of the predictions
for these held-out points. The error is usually the square of the residual, and its
usually a good idea to average over many points when doing this.
Both interpolation and smoothing can present significant numerical challenges. If the dataset is large, G will be large. For values of h that are large, G
(resp. G T G) is usually very poorly conditioned, so solving the interpolation (resp.
smoothing) system accurately is hard. This problem can usually be alleviated by
adding a small constant () times an identity matrix (I) in the appropriate spoat.
So for interpolation, we solve
Y = (G + I)a
(Interpolation)
for a and we will obtain a function that passes through each training point. Similarly, for smoothing, we solve
G T Y = (G T G + I)a
(Smoothing)
for a. Usually, we choose lambda to be small enough to make the linear algebra
work (1e 9 works for me), and ignore it.
As Figure ?? suggests, interpolation isnt really as useful as you might think.
Most measurements arent exactly right, or exactly repeatable. If you look closely
at the figure, youll see one location where there are two scores; this is entirely to
be expected for the score of a prawn trawl at a particular spot in the ocean. This
creates problems for interpolation; G must be rank deficient, because two rows will
be the same. Another difficulty is that the scores look wiggly moving a short
distance can cause the score to change quite markedly. This is likely the effect of
Section 7.3
188
luck in trawling, rather than any real effect. The interpolating method chooses a
very short scale, because this causes the least error in cross-validation, caused by
predicting zero at the held out point (which is more accurate than any prediction
available at any longer scale). The result is an entirely implausible model.
Now look at Figure 7.9. The smoothed surface is a reasonable guide to the
scores; in the section of ocean where scores tend to be large and positive, so is the
smoothed surface; where they tend to be negative, the smoothed surface is negative,
too.
7.3.2 Density Estimation
One specialized application of kernel functions related to smoothing is density
estimation. Here we have a set of N data points xi which we believe to be IID
samples from some p(X), and we wish to estimate the probability density function
p(X). In the case that we have a parametric model for p(X), we could do so by
estimating the parameters with (say) maximum likelihood. But we may not have
such a model, or we may have data that does not comfortably conform to any
model.
A natural, and important, model is to place probability 1/N at each data
point, and zero elsewhere. This is sometimes called an empirical distribution.
However, this model implies that we can only ever see the values we have already
seen, which is often implausible or incovenient. We should like to smooth this
model. If the xi have low enough dimension, we can construct a density estimate
with kernels in a straightforward way.
R Recall that a kernel function is non-negative, and has the property that
K(u)du = 1. This means that if
y(x) =
R
X
j=1
aj K
|| x bj ||
h
we have
Z
y(x)dx =
R
X
aj .
j=1
Now imagine we choose a basepoint at each data point, and we choose aj = 1/N
for all j. The resulting function is non-negative, and integrates to one, so can be
seen as a probability density function. We are placing a bump function of scale h,
weighted by 1/N on top of each data point. If there are many data points close
together, these bump functions will tend to reinforce one another and the resulting
function will be large in such regions. The function will also be small in gaps
between data points that are large compared to h. The resulting model captures
the need to (a) represent the data and (b) allow values that have not been observed
to have non-zero probability.
Choosing h using cross-validation is straightforward. We choose the h that
maximises the log-likelihood of omitted test data, averaged over folds. There is
one nuisance effect here to be careful of. If you use a kernel that has finite support,
you could find that an omitted test item has zero probability; this leads to trouble
Section 7.3
189
Density estimate for reported locations of prawn catches,

-11.9
L
a
t
i
t
u
d
e
-11
-11.2
-11.4
-11.6
-11.1
-11.8
-12
142.5
142.8
143
143.5
Longitude
143.9
144
FIGURE 7.10: The prawn data of figure 7.7, now shown on the left as a scatter
plot of locations from which scores were reported. The os correspond to negative
scores, and the +s to positive scores. On the right, a density estimate of the
probability distribution from which the fishing locations were drawn, where lighter
pixels correspond to larger density values. I ignored the curvature of the earth, small
at this scale, when computing distances between points.
with logarithms, etc. You could avoid this by obtaining an initial scale estimate
with a kernel that has infinite support, then refining this with a kernel with finite
support.
*** example *** careful about how kernel functions scale with dimension
7.3.3 Kernel Smoothing
We expect that, if x is near a training example xi , then y(x) will be similar to
yi . This suggests constructing an estimate of y(x) as a weighted sum of the values
at nearby examples. Write W (x, xi ) for the weight applied to yi when estimating
y(x). Then our estimate of y(x) is
N
X
yi W (x, xi ).
i=1
We need good choices of W (x, xi ). There are some simple, natural constraints we
can impose. We should like y to be a convex combination of the observed values.
This means we
P want W (x, xi ) to be non-negative (so we can think of them as
weights) and i W (x, xi ) = 1
Section 7.3
190
Density estimate for reported locations of prawn catches

with negative value of S1,
Density estimate for reported locations of prawn catches

with positive value of S1,
-11.9
-11.9
L
a
t
i
t
u
d
e
L
a
t
i
t
u
d
e
-11.1
-11.1
142.8
Longitude
143.9
142.8
Longitude
143.9
FIGURE 7.11: Further plots of the prawn data of figure 7.7. On the left, a density estimate of the probability distribution from which the fishing locations which
achieved positive values of the first score were drawn, where lighter pixels correspond to larger density values. On the right, a density estimate of the probability
distribution from which the fishing locations which achieved negative values of the
first score were drawn, where lighter pixels correspond to larger density values. I
ignored the curvature of the earth, small at this scale, when computing distances
between points.
Assume we have a kernel function K(u). Then a natural choice of weights is

i||
K( ||xx
hi )
W (x, xi ; hi ) = Pk
||xxj||
j=1 K( hi )
where I have expanded the notation for the weight function to keep track of the
scaling parameter, hi , which we will need to select. Notice that, at each data point,
we are using a kernel of different width to set weights. This should seem natural to
you. If there are few points near a given data point, then we should have weights
that vary slowly, so that we can for a weighted average of those points. But if
there are many points nearby, we can have weights that vary fast. The weights
are non-negative because we are using a non-negative kernel function. The weights
sum to 1, because we divide at each point by the sum of all kernels evaluated at
that point. For reference, this gives the expression
N
i||
X
K( ||xx
)
hi
y(x) =
yi Pk
||xxj||
i=1
j=1 K( hj )
Section 7.4
Exploiting Your Neighbors for Regression
191
Posterior estimate of p(S1>0|x) as a function of position x

for prawn data
-11.9
L
a
t
i
t
u
d
e
-11.1
142.8
Longitude
143.9
FIGURE 7.12: An estimate of the posterior probability of obtaining a positive value
of the first catch score for prawns, as a function of position, from the prawn data
of figure 7.7, using the density estimate of figure ??. L ighter pixels correspond to
larger density values. I ignored the curvature of the earth, small at this scale, when
computing distances between points.
Changing the hi will change the radius of the bumps, and will cause more (or
fewer) points to have more (or less) influence on the shape of y(x). Selecting the
hi is easy in principle. We search for a set of values that minimizes cross-validation
error. In practice, this takes an extremely extensive search involving a great deal
of computation, particularly if there are lots of points or the dimension is high.
For the examples of this section, I used the R package np (see appendix for code
samples).
*** robustness isnt really an issue here - copes rather well with outliers ***
not completely correct - locfit will use an alternative *** for that, you need to
discuss biweight
*** distinction between (a) the standard error of prediction and (b) the sd of
predictive distribution
*** you could get away with poor scaling of x if you scale each dimension
separately
*** Curse of dimension and multiple here
*** appendix to each chapter - R CODE for each figure?
7.4 EXPLOITING YOUR NEIGHBORS FOR REGRESSION
TODO: work in local polynomial stuff
Nearest neighbors can clearly predict a number for a query example you find
Section 7.4

Kernel smoothed regression
90
70
75
200
80
85
Temp
600
400
Weight
800
95
1000

Kernel smoothed regression
192
10
20
30
Length
40
14
15
16
17
18
19
20
Freq
FIGURE 7.13: Non parametric regressions for the datasets of Figures 6.1 and 6.4.
The curve shows the expected value of the independent variable for each value of
the explanatory variable. The vertical bars show the standard error of the predicted
density for the independent variable, for each value of the explanatory variable.
Notice that, as one would expect, the standard deviation is smaller closer to data
points, and larger further away. On the left, the perch data. On the right, the
cricket data.
the closest training example, and report its number. This would be one way to
use nearest neighbors for regression, but it isnt terribly effective. One important
difficulty is that the regression prediction is piecewise constant (Figure 7.16). If
there is an immense amount of data, this may not present major problems, because
the steps in the prediction will be small and close together. But its not generally
an effective use of data.
A more effective strategy is to find several nearby training examples, and use
them to produce an estimate. This approach can produce very good regression
estimates, because every prediction is made by training examples that are near to
the query example. However, producing a regression estimate is expensive, because
for every query one must find the nearby training examples.
Write x for the query point, and assume that we have already collected the
N nearest neighbors, which we write xi . Write yi for the value of the dependent
variable for the ith of these points. Notice that some of these neighbors could be
quite far from the query point. We dont want distant points to make as much
contribution to the model as nearby points. This suggests forming a weighted
average of the predictions of each point. Write wi for the weight at the ith point.
Then the estimate is
P
wi yi
ypred = Pi
.
i wi
A variety of weightings are reasonable choices. Write di = || (x xi ) || for
the distance between the query point and the ith nearest neighbor. Then inverse
Section 7.4
15
Residuals against predicted values

Weight against all, gaussian kernel,
all points
193
Residuals against predicted values

Weight against all, gaussian kernel,
all points
Residual
Residual
10
100
150
200
250
300
350
400
100
150
Predicted value
200
250
300
Predicted value
FIGURE 7.14:
distance weighting uses wi = 1/di . Alternatively, we could use an exponential

function to strongly weight down more distant points, using
2
di
wi = exp
.
2 2
We will need to choose a scale , which can be done by cross-validation. Hold out
some examples, make predictions at the held out examples using a variety of different scales, and choose the scale that gives the best held-out error. Alternatively,
if there are enough nearest neighbors, we could form a distance weighted linear
regression, then predict the value at the query point from that regression.
Each of these strategies presents some difficulties when x has high dimension.
In that case, it is usual that the nearest neighbor is a lot closer than the second
nearest neighbor. If this happens, then each of these weighted averages will boil
down to evaluating the dependent variable at the nearest neighbor (because all the
others will have very small weight in the average).
Remember this:
Nearest neighbors can be used for regression. In the
simplest approach, you find the nearest neighbor to your feature vector, and
take that neighbors number as your prediction. More complex approaches
smooth predictions over multiple neighbors.
**** link this to non parametric and kernel methods
Section 7.4

Weight vs height in bodyfat data
Kernel smoothed regression,
all points
200
250
Weight vs height in bodyfat data

Kernel smoothed regression,
all points
194
180
Weight
160
150
170
Weight
200
190
RMSE=25.5
100
RMSE=25.5
30
40
50
60
70
66
68
Height
70
72
74
Height
FIGURE 7.15:
7.4.1 Local Polynomial Regression

Imagine wish to predict a value at a point x0 . Write y0 for the value we predict.
A reasonable choice would be to choose a number that minimizes a least squares
difference to the training values, weighted in some way. We would like weights to
be large for training points that are close to the test point, and small for points
that are far away. We could achieve this using a kernel function, and choose the yo
to be the value of 0 that minimizes
X
i
(yi 0 )2 K(
|| x0 xi ||
).
hi
If you differentiate and set to zero, etc., you will find you have the familiar expression
for kernel regression
!
N
i||
X
K( ||x0hx
)
i
yi Pk
y0 =
.
||x0 xj||
)
i=1
j=1 K(
hi
This suggests something really fruitful. You can read the equation
X
i
(yi 0 )2 K(
|| x0 xi ||
)
hi
as choosing a local function at each xi such that the weighted errors are minimized.
The local function is constant (the value of 0 ), but doesnt have to be. Instead, it
could be polynomial. A polynomial function about the point xi
7.4.2 Using your Neighbors to Predict More than a Number
Linear regression takes some features and predicts a number. But in practice, one
often wants to predict something more complex than a number. For example, I
Section 7.4
Nearest Neighbor Regression
195
Nearest Neighbor Regression, 40 pts, k=5

1
Inverse Dist
Exp si=0.1
Exp si=0.5
Exp si=1
0.8
0.8
0.6
0.4
0.4
Dependent variable
Dependent variable
0.6
0.2
0
0.2
0.4
0.2
0
0.2
0.4
0.6
0.6
0.8
0.8
1
6
0
2
Explanatory variable
1
Inverse Dist
Exp si=0.1
Exp si=0.5
Exp si=1
0.8
0.6
Inverse Dist
Exp si=0.1
Exp si=0.5
Exp si=1
0.8
0.6
0.4
0.4
Dependent variable
Dependent variable
0
2
0.2
0
0.2
0.2
0
0.2
0.4
0.4
0.6
0.6
0.8
1
0.8
6
0
2
0
2
FIGURE 7.16: Different forms of nearest neighbors regression, predicting y from a
one-dimensional x, using a total of 40 training points. Top left: reporting the

nearest neighbor leads to a piecewise constant function. Top right: improvements
are available by forming a weighted average of the five nearest neighbors, using inverse distance weighting or exponential weighting with three different scales. Notice
if the scale is small, then the regression looks a lot like nearest neighbors, and if it
is too large, all the weights in the average are nearly the same (which leads to a
piecewise constant structure in the regression). Bottom left and bottom right
show that using more neighbors leads to a smoother regression.
might want to predict a parse tree (which has combinatorial structure) from a
sentence (the explanatory variables). As another example, I might want to predict
a map of the shadows in an image (which has spatial structure) against an image
(the explanatory variables). As yet another example, I might want to predict which
direction to move the controls on a radio-controlled helicopter (which have to be
moved together) against a path plan and the current state of the helicopter (the
explanatory variables).
Looking at neighbors is a very good way to solve such problems. The general
strategy is relatively simple. We find a large collection of pairs of training data.
Write xi for the explanatory variables for the ith example, and yi for the dependent
variable in the ith example. This dependent variable could be anything it
doesnt need to be a single number. It might be a tree, or a shadow map, or a
word, or anything at all. I wrote it as a vector because I needed to choose some
Section 7.4
196
notation.
In the simplest, and most general, approach, we obtain a prediction for a new
set of explanatory variables x by (a) finding the nearest neighbor and then (b)
producing the dependent variable for that neighbor. We might vary the strategy
slightly by using an approximate nearest neighbor. If the dependent variables
have enough structure that it is possible to summarize a collection of different
dependent variables, then we might recover the k nearest neighbors and summarize
their dependent variables. How we summarize rather depends on the dependent
variables. For example, it is a bit difficult to imagine the average of a set of
trees, but quite straightforward to average images. If the dependent variable was
a word, we might not be able to average words, but we can vote and choose the
most popular word. If the dependent variable is a vector, we can compute either
distance weighted averages or a distance weighted linear regression.
Matched
Images
Patch to be replaced
Final
composited
image
Initial image
FIGURE 7.17: We can fill large holes in images by matching the image to a collection,
choosing one element of the collection, then cutting out an appropriate block of pixels
and putting them into the hole in the query image. In this case, the hole has been
made by an artist, who wishes to remove the roofline from the view. Notice how
there are a range of objects (boats, water) that have been inserted into the hole.
These objects are a reasonable choice, because the overall structures of the query
and matched image are largely similar getting an image that matches most of
the query image supplies enough context to ensure that the rest makes sense.
Example: Filling Large Holes with Whole Images Many different

kinds of user want to remove things from images or from video. Art directors
might like to remove unattractive telephone wires; restorers might want to remove
scratches or marks; theres a long history of government officials removing people
Section 7.4
197
with embarrassing politics from publicity pictures (see the fascinating examples
in ?); and home users might wish to remove a relative they dislike from a family
picture. All these users must then find something to put in place of the pixels that
were removed.
If one has a large hole in a large image, we may not be able to just extend
a texture to fill the hole. Instead, entire objects might need to appear in the hole
(Figure 7.17). There is a straightforward, and extremely effective, way to achieve
this. We match the image to a large collection of images, to find the nearest
neighbors (the details of the distance function are below). This yields a set of
example images where all the pixels we didnt want to replace are close to those of
the query image. From these, we choose one, and fill in the pixels from that image.
There are several ways to choose. If we wish to do so automatically, we could
use the example with the smallest distance to the image. Very often, an artist is
involved, and then we could prepare a series of alternatives using, perhaps, the
k closest examples then show them to the artist, who will choose one. This
method, which is very simple to describe, is extremely effective in practice.
It is straightforward to get a useful distance between images. We have an
image with some missing pixels, and we wish to find nearby images. We will
assume that all images are the same size. If this isnt in fact the case, we could
either crop or resize the example images. A good measure of similarity between two
images A and B can be measured by forming the sum of squared differences (or
SSD) of corresponding pixel values. You should think of an image as an array of
pixels. If the images are grey-level images, then each pixel contains a single number,
encoding the grey-level. If the images are color images, then each pixel (usually!)
contains three values, one encoding the red-level, one encoding the green-level, and
one encoding the blue-level. The SSD is computed as
X
(Aij Bij )2
(i,j)
where i and j range over all pixels. If the images are grey-level images, then by
(Aij Bij )2 , I mean the squared difference between grey levels; if they are color
images, then this means the sum of squared differences between red, green and blue
values. This distance is small when the images are similar, and large when they are
different (it is essentially the length of the difference vector).
Now we dont know some of the pixels in the query image. Write K for the
set of pixels around a point whose values are known, and K for the size of this set.
We can now use
1 X
(Aij Bij )2 .
K
(i,j)K
Filling in the pixels requires some care. One does not usually get the best
results by just copying the missing pixels from the matched image into the hole.
Instead, it is better to look for a good seam. We search for a curve enclosing the
missing pixels which (a) is reasonably close to the boundary of the missing pixels
and (b) gives a good boundary between the two images. A good boundary is one
where the query image (on one side) is similar to the matched image (on the
Section 7.5
Bayesian Regression
198
other side). A good sense of similarity requires that pixels match well, and that
image gradients crossing the boundary tend to match too.
Remember this:
Nearest neighbors can be used to predict more than
numbers. Examples include parse trees, blocks of pixels, and so on.
7.5 BAYESIAN REGRESSION

To predict a y value for x with our current linear regression procedure, we take a
training data set, produce an estimate of , then use that estimate to predict
This is simple, and often effective, but it ignores one important
the value xT .
phenomenon. Our estimate of might not be right, and errors in will result in
errors in the predicted y. We would like to account for these errors. One way to
do so is to work with a posterior probability distribution on .
7.5.1 Tricks with Normal Distributions
At this point, it is helpful to know some useful tricks with normal distributions. It
is common to write y|x N (, ) to mean that p(y|x) is a normal distribution, with
mean and covariance . Normal distributions are nice, because many calculations
are in fact quite simple.
T
Projecting a normal distribution: Assume that xT = (u, v) is a normal
random vector, with mean = (u , v ) and covariance . Write

A BT
= 1
B C
(where the blocks correspond to the u, v decomposition of x). We would like to
construct
Z
p(u) = p(u, v)dv.
Assume also that v is dv dimensional. I will do this example in detail, as a model

for the next ones. First, we will change variables, writing m = (u u ) and
n = (v v ). We then seek
Z
p(m) =
p(m, n)dn
D
dv
where D is all values of n, so R (you should check nothing interesting has happened with the change of variables). Now we have
2 log p(m, n) = (x )T 1 (x ) + K
= (u u )T A(u u ) + 2(u u )T B T (v v ) + (v v )T C(v v )

= mT Am + 2mT B T n + nT Cn
= mT (A B T C 1 B)m + (n C 1 Bm)T C(n C 1 Bm).
Section 7.5
Bayesian Regression
199
(where you should check the last step, which I got by completing the square; you
could expand out the expression). Now write p(m, n) = f (m)g(n C 1 Bm) where
2 log g(u) = uT Cu
R
I claim g(n C 1 Bm)dn does not depend on m. Choose some particular value of
m, and compute the integral for that value, recalling that the domain of the integral
is over all values of n the domain does not change with m. Different choices
of m shift the peak of the integrand, but do not
R integral.
R change the value of the
Bm
)dn
=
Equivalently, for some fixed m = m0 , note that D g(nC 1
0
D g(w)dw,
R
where w = n C 1 Bm0 . This is the same for all m0 , so g(n C 1 Bm)dn does
not depend on m. As a result
2 log p(u) = (u u )T (A B T C 1 B)(u u ) + K
so p(u) is normal with mean u . Obtaining its covariance requires a very little
more work. Write

uu Tvu
=
vu vv
and multiply this by 1 , yielding

I
Auu + B T vu ATvu + B T vv
==
0
Buu + Cvu
BTvu + Cvv
0
I
from which you can derive (A B T C 1 B) = uu .

Assume x N (0 , 0 ). Assume also that y|x N (Ax, 1 ). Notice that another
way to write this is y = Ax + , where is a normal random vector with mean 0
and covariance 1 ). We would like to recover p(x|y). We can do so by chasing
Section 7.6
You should
200
7.6 YOU SHOULD

7.6.1 remember:
New term: irreducible error . . . . . . . . . . .
New term: variance . . . . . . . . . . . . . . . .
New term: bias . . . . . . . . . . . . . . . . . .
New term: AIC . . . . . . . . . . . . . . . . . .
New term: BIC . . . . . . . . . . . . . . . . . .
New term: forward stagewise regression . . . .
New term: Backward stagewise regression . . .
New term: error cost . . . . . . . . . . . . . . .
New term: L2 regularized error . . . . . . . . .
New term: lasso . . . . . . . . . . . . . . . . .
New term: elastic net . . . . . . . . . . . . . .
New term: residual . . . . . . . . . . . . . . . .
New term: Huber loss . . . . . . . . . . . . . .
New term: iteratively reweighted least squares
New term: MAD . . . . . . . . . . . . . . . . .
New term: median absolute deviation . . . . .
New term: Hubers proposal 2 . . . . . . . . . .
New term: interpolation . . . . . . . . . . . . .
New term: smoothing . . . . . . . . . . . . . .
New term: kernel function . . . . . . . . . . . .
New term: base points . . . . . . . . . . . . . .
New term: Gram matrix . . . . . . . . . . . . .
New term: density estimation . . . . . . . . . .
New term: empirical distribution . . . . . . . .
Nearest neighbors can be used for regression. .
New term: sum of squared differences . . . . .
New term: seam . . . . . . . . . . . . . . . . .
Nearest neighbors can predict structures. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
162
162
162
163
163
165
165
165
165
168
168
170
172
173
174
174
174
178
178
179
180
181
182
182
187
191
191
192
Section 7.6
You should
201
APPENDIX: DATA
Batch A
Amount of Time in
Hormone
Service
25.8
99
20.5
152
14.3
293
23.2
155
20.6
196
31.1
53
20.9
184
20.9
171
30.4
52
Batch B
Amount of Time in
Hormone
Service
16.3
376
11.6
385
11.8
402
32.5
29
32.0
76
18.0
296
24.1
151
26.5
177
25.8
209
Batch C
Amount of Time in
Hormone
Service
28.8
119
22.0
188
29.7
115
28.9
88
32.8
58
32.5
49
25.4
150
31.7
107
28.5
125
TABLE 7.1: A table showing the amount of hormone remaining and the time in
service for devices from lot A, lot B and lot C. The numbering is arbitrary (i.e.
theres no relationship between device 3 in lot A and device 3 in lot B). We expect
that the amount of hormone goes down as the device spends more time in service,
so cannot compare batches just by comparing numbers.
Section 7.6
You should
202
PROBLEMS
Blood pressure against age
Systolic blood pressure
250
200
150
100
0
20
40
60
Age in years
80
FIGURE 7.18: A regression of blood pressure against age, for 30 data points.
7.1. Figure 7.18 shows a linear regression of systolic blood pressure against age.
There are 30 data points.
(a) Write ei = yi xT
i for the residual. What is the mean ({e}) for this
regression?
(b) For this regression, var ({y}) = 509 and the R2 is 0.4324. What is var ({e})
for this regression?
(c) How well does the regression explain the data?
(d) What could you do to produce better predictions of blood pressure (without actually measuring blood pressure)?
7.2. In this exercise, I will show that the prediction process of chapter 12(see
page 251) is a linear regression with two independent variables. Assume we
have N data items which are 2-vectors (x1 , y1 ), . . . , (xN , yN ), where N > 1.
These could be obtained, for example, by extracting components from larger
vectors. As usual, we will write x
i for xi in normalized coordinates, and so on.
The correlation coefficient is r (this is an important, traditional notation).
(a) Show that r = mean ({(x mean ({x}))(y mean ({y}))})/(std (x)std (y)).
std(y)
(b) Now write s = std(x) . Now assume that we have an xo , for which we wish
to predict a y value. Show that the value of the prediction obtained using
the method of page 252 is
sr(x0 mean ({x})) + mean ({y}).
(c) Show that sr = mean ({(xy)}) mean ({x})mean ({y}).
(d) Now write
y1
x1
1
1
y
x
and Y = 2 .
X = 2
... ...
...
xn
1
yn
where X T X = X T Y.
The coefficients of the linear regression will be ,
Show that
T
X X =N
mean x2
mean ({x})

mean ({x})
1
Section 7.6
(e) Now show that var ({x}) = mean
(x mean ({x}))2
You should
= mean
203
2
x
mean ({x})2 .
(f ) Now show that std (x)std (y)corr ({(x, y)}) = mean ({(x mean ({x}))(y mean ({y}))}).
C H A P T E R
Classification II
8.1 LOGISTIC REGRESSION
8.2 NEURAL NETS
8.3 CONVOLUTION AND ORIENTATION FEATURES
8.4 CONVOLUTIONAL NEURAL NETWORKS
204
C H A P T E R
Boosting
9.1 GRADIENTBOOST
*** for classification *** for regression *** in each case with trees?
9.2 ADABOOST
**?
205
C H A P T E R
10
Some Important Models

10.1 HMMS
10.2 CRFS
10.3 FITTING AND INFERENCE WITH MCMC?
206
C H A P T E R
11
Background: First Tools for Looking

at Data
The single most important question for a working scientist perhaps the
single most useful question anyone can ask is: whats going on here? Answering
this question requires creative use of different ways to make pictures of datasets,
to summarize them, and to expose whatever structure might be there. This is an
activity that is sometimes known as Descriptive Statistics. There isnt any fixed
recipe for understanding a dataset, but there is a rich variety of tools we can use
to get insights.
11.1 DATASETS
A dataset is a collection of descriptions of different instances of the same phenomenon. These descriptions could take a variety of forms, but it is important that
they are descriptions of the same thing. For example, my grandfather collected
the daily rainfall in his garden for many years; we could collect the height of each
person in a room; or the number of children in each family on a block; or whether
10 classmates would prefer to be rich or famous. There could be more than
one description recorded for each item. For example, when he recorded the contents of the rain gauge each morning, my grandfather could have recorded (say)
the temperature and barometric pressure. As another example, one might record
the height, weight, blood pressure and body temperature of every patient visiting
a doctors office.
The descriptions in a dataset can take a variety of forms. A description could
be categorical, meaning that each data item can take a small set of prescribed
values. For example, we might record whether each of 100 passers-by preferred to
be Rich or Famous. As another example, we could record whether the passersby are Male or Female. Categorical data could be ordinal, meaning that we
can tell whether one data item is larger than another. For example, a dataset giving
the number of children in a family for some set of families is categorical, because it
uses only non-negative integers, but it is also ordinal, because we can tell whether
one family is larger than another.
Some ordinal categorical data appears not to be numerical, but can be assigned a number in a reasonably sensible fashion. For example, many readers will
recall being asked by a doctor to rate their pain on a scale of 1 to 10 a question
that is usually relatively easy to answer, but is quite strange when you think about
it carefully. As another example, we could ask a set of users to rate the usability
of an interface in a range from very bad to very good, and then record that
using -2 for very bad, -1 for bad, 0 for neutral, 1 for good, and 2 for very
good.
Many interesting datasets involve continuous variables (like, for example,
207
Section 11.1
Datasets
208
height or weight or body temperature) when you could reasonably expect to encounter any value in a particular range. For example, we might have the heights of
all people in a particular room; or the rainfall at a particular place for each day of
the year; or the number of children in each family on a list.
You should think of a dataset as a collection of d-tuples (a d-tuple is an
ordered list of d elements). Tuples differ from vectors, because we can always add
and subtract vectors, but we cannot necessarily add or subtract tuples. We will
always write N for the number of tuples in the dataset, and d for the number of
elements in each tuple. The number of elements will be the same for every tuple,
though sometimes we may not know the value of some elements in some tuples
(which means we must figure out how to predict their values, which we will do
much later).
Index
1
2
3
4
5
6
7
8
9
10
net worth
100, 360
109, 770
96, 860
97, 860
108, 930
124, 330
101, 300
112, 710
106, 740
120, 170
Index
1
2
3
4
5
6
7
8
9
10
Taste score
12.3
20.9
39
47.9
5.6
25.9
37.3
21.9
18.1
21
Index
11
12
13
14
15
16
17
18
19
20
Taste score
34.9
57.2
0.7
25.9
54.9
40.9
15.9
6.4
18
38.9
TABLE 11.1: On the left, net worths of people you meet in a bar, in US $; I made
this data up, using some information from the US Census. The index column,
which tells you which data item is being referred to, is usually not displayed in a
table because you can usually assume that the first line is the first item, and so
on. On the right, the taste score (Im not making this up; higher is better) for 20
different cheeses. This data is real (i.e. not made up), and it comes from http://
lib.stat.cmu.edu/ DASL/ Datafiles/ Cheese.html.
Each element of a tuple has its own type. Some elements might be categorical.
For example, one dataset we shall see several times has entries for Gender; Grade;
Age; Race; Urban/Rural; School; Goals; Grades; Sports; Looks; and Money for
478 children, so d = 11 and N = 478. In this dataset, each entry is categorical
data. Clearly, these tuples are not vectors because one cannot add or subtract (say)
Gender, or add Age to Grades.
Most of our data will be vectors. We use the same notation for a tuple and
for a vector. We write a vector in bold, so x could represent a vector or a tuple
(the context will make it obvious which is intended).
The entire data set is {x}. When we need to refer to the ith data item, we
write xi . Assume we have N data items, and we wish to make a new dataset out
of them; we write the dataset made out of these items as {xi } (the i is to suggest
you are taking a set of items and making a dataset out of them).
In this chapter, we will work mainly with continuous data. We will see a
Section 11.2
Whats Happening? - Plotting Data
209
variety of methods for plotting and summarizing 1-tuples. We can build these
plots from a dataset of d-tuples by extracting the rth element of each d-tuple.
All through the book, we will see many datasets downloaded from various web
sources, because people are so generous about publishing interesting datasets on
the web. In the next chapter, we will look at 2-dimensional data, and we look at
high dimensional data in chapter 3.
11.2 WHATS HAPPENING? - PLOTTING DATA
The very simplest way to present or visualize a dataset is to produce a table. Tables
can be helpful, but arent much use for large datasets, because it is difficult to get
any sense of what the data means from a table. As a continuous example, table 11.1
gives a table of the net worth of a set of people you might meet in a bar (I made
this data up). You can scan the table and have a rough sense of what is going on;
net worths are quite close to $ 100, 000, and there arent any very big or very small
numbers. This sort of information might be useful, for example, in choosing a bar.
People would like to measure, record, and reason about an extraordinary
variety of phenomena. Apparently, one can score the goodness of the flavor of
cheese with a number (bigger is better); table 11.1 gives a score for each of thirty
cheeses (I did not make up this data, but downloaded it from http://lib.stat.cmu.
edu/DASL/Datafiles/Cheese.html). You should notice that a few cheeses have very
high scores, and most have moderate scores. Its difficult to draw more significant
conclusions from the table, though.
Gender
boy
boy
girl
girl
girl
girl
girl
girl
girl
girl
Goal
Sports
Popular
Popular
Popular
Popular
Popular
Popular
Grades
Sports
Sports
Gender
girl
girl
boy
boy
boy
girl
girl
girl
girl
girl
Goal
Sports
Grades
Popular
Popular
Popular
Grades
Sports
Popular
Grades
Sports
TABLE 11.2: Chase and Dunner (?) collected data on what students thought made
other students popular. As part of this effort, they collected information on (a) the
gender and (b) the goal of students. This table gives the gender (boy or girl)
and the goal (to make good grades Grades; to be popular Popular; or to
be good at sports Sports). The table gives this information for the first 20
of 478 students; the rest can be found at http:// lib.stat.cmu.edu/ DASL/ Datafiles/
PopularKids.html. This data is clearly categorical, and not ordinal.
Table 11.2 shows a table for a set of categorical data. Psychologists collected
data from students in grades 4-6 in three school districts to understand what factors students thought made other students popular. This fascinating data set can
be found at http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html, and was pre-
Section 11.2
210
pared by Chase and Dunner (?). Among other things, for each student they asked
whether the students goal was to make good grades (Grades, for short); to be
popular (Popular); or to be good at sports (Sports). They have this information for 478 students, so a table would be very hard to read. Table 11.2 shows the
gender and the goal for the first 20 students in this group. Its rather harder to
draw any serious conclusion from this data, because the full table would be so big.
We need a more effective tool than eyeballing the table.
Number of children of each gender
Number of children choosing each goal
300
250
250
200
200
150
150
100
100
50
50
0
boy
girl
Sports
Grades
Popular
FIGURE 11.1: On the left, a bar chart of the number of children of each gender in
the Chase and Dunner study (). Notice that there are about the same number of
boys and girls (the bars are about the same height). On the right, a bar chart of
the number of children selecting each of three goals. You can tell, at a glance, that
different goals are more or less popular by looking at the height of the bars.
11.2.1 Bar Charts

A bar chart is a set of bars, one per category, where the height of each bar is
proportional to the number of items in that category. A glance at a bar chart often
exposes important structure in data, for example, which categories are common, and
which are rare. Bar charts are particularly useful for categorical data. Figure 11.1
shows such bar charts for the genders and the goals in the student dataset of Chase
and Dunner (). You can see at a glance that there are about as many boys as girls,
and that there are more students who think grades are important than students
who think sports or popularity is important. You couldnt draw either conclusion
from Table 11.2, because I showed only the first 20 items; but a 478 item table is
very difficult to read.
11.2.2 Histograms
Data is continuous when a data item could take any value in some range or set of
ranges. In turn, this means that we can reasonably expect a continuous dataset
contains few or no pairs of items that have exactly the same value. Drawing a bar
chart in the obvious way one bar per value produces a mess of unit height
bars, and seldom leads to a good plot. Instead, we would like to have fewer bars,
each representing more data items. We need a procedure to decide which data
Section 11.2
4
3
2
1
0
0.95
1.05 1.1 1.15 1.2

Net worth, in $100, 000s
1.25
211
Histogram of cheese goodness score for 30 cheeses

14
Number of data items
Number of data items
Histogram of net worth for 10 individuals

5
12
10
8
6
4
2
0
0
10 20 30 40 50 60 70
Cheese goodness, in cheese goodness units
FIGURE 11.2: On the left, a histogram of net worths from the dataset described in
the text and shown in table 11.1. On the right, a histogram of cheese goodness
scores from the dataset described in the text and shown in table 11.1.
items count in which bar.

A simple generalization of a bar chart is a histogram. We divide the range
of the data into intervals, which do not need to be equal in length. We think of
each interval as having an associated pigeonhole, and choose one pigeonhole for
each data item. We then build a set of boxes, one per interval. Each box sits on its
interval on the horizontal axis, and its height is determined by the number of data
items in the corresponding pigeonhole. In the simplest histogram, the intervals that
form the bases of the boxes are equally sized. In this case, the height of the box is
given by the number of data items in the box.
Figure 11.2 shows a histogram of the data in table 11.1. There are five bars
by my choice; I could have plotted ten bars and the height of each bar gives the
number of data items that fall into its interval. For example, there is one net worth
in the range between $102, 500 and $107, 500. Notice that one bar is invisible,
because there is no data in that range. This picture suggests conclusions consistent
with the ones we had from eyeballing the table the net worths tend to be quite
similar, and around $100, 000.
Figure 11.2 also shows a histogram of the data in table 11.1. There are six
bars (0-10, 10-20, and so on), and the height of each bar gives the number of data
items that fall into its interval so that, for example, there are 9 cheeses in this
dataset whose score is greater than or equal to 10 and less than 20. You can also use
the bars to estimate other properties. So, for example, there are 14 cheeses whose
score is less than 20, and 3 cheeses with a score of 50 or greater. This picture is
much more helpful than the table; you can see at a glance that quite a lot of cheeses
have relatively low scores, and few have high scores.
11.2.3 How to Make Histograms
Usually, one makes a histogram by finding the appropriate command or routine in
your programming environment. I use Matlab, and chapter ?? sketches some useful
Matlab commands. However, it is useful to understand the procedures used.
Section 11.2
212
Histogram of body temperatures in Fahrenheit

14
12
10
8
6
4
2
0
96
98
100
102
Gender 1 body temperatures in Fahrenheit Gender 2 body temperatures in Fahrenheit
14
14
12
12
10
10
0
96
98
100
102
0
96
98
100
102
FIGURE 11.3: On top, a histogram of body temperatures, from the dataset pub-
lished at http:// www2.stetson.edu/ jrasp/ data.htm. These seem to be clustered

fairly tightly around one value. The bottom row shows histograms for each gender (I dont know which is which). It looks as though one gender runs slightly cooler
than the other.
Histograms with Even Intervals: The easiest histogram to build uses
equally sized intervals. Write xi for the ith number in the dataset, xmin for the
smallest value, and xmax for the largest value. We divide the range between the
smallest and largest values into n intervals of even width (xmax xmin )/n. In this
case, the height of each box is given by the number of items in that interval. We
could represent the histogram with an n-dimensional vector of counts. Each entry
represents the count of the number of data items that lie in that interval. Notice
we need to be careful to ensure that each point in the range of values is claimed by
exactly one interval. For example, we could have intervals of [0 1) and [1 2), or
we could have intervals of (0 1] and (1 2]. We could not have intervals of [0 1]
and [1 2], because then a data item with the value 1 would appear in two boxes.
Similarly, we could not have intervals of (0 1) and (1 2), because then a data
item with the value 1 would not appear in any box.
Histograms with Uneven Intervals: For a histogram with even intervals,
it is natural that the height of each box is the number of data items in that box.
Section 11.3
Summarizing 1D Data
213
But a histogram with even intervals can have empty boxes (see figure 11.2). In
this case, it can be more informative to have some larger intervals to ensure that
each interval has some data items in it. But how high should we plot the box?
Imagine taking two consecutive intervals in a histogram with even intervals, and
fusing them. It is natural that the height of the fused box should be the average
height of the two boxes. This observation gives us a rule.
Write dx for the width of the intervals; n1 for the height of the box over the
first interval (which is the number of elements in the first box); and n2 for the
height of the box over the second interval. The height of the fused box will be
(n1 + n2 )/2. Now the area of the first box is n1 dx; of the second box is n2 dx; and
of the fused box is (n1 + n2 )dx. For each of these boxes, the area of the box is
proportional to the number of elements in the box. This gives the correct rule: plot
boxes such that the area of the box is proportional to the number of elements in
the box.
11.2.4 Conditional Histograms
Most people believe that normal body temperature is 98.4o in Fahrenheit. If you
take other peoples temperatures often (for example, you might have children), you
know that some individuals tend to run a little warmer or a little cooler than this
number. I found data giving the body temperature of a set of individuals at http://
www2.stetson.edu/jrasp/data.htm. As you can see from the histogram (figure 11.3),
the body temperatures cluster around a small set of numbers. But what causes the
variation?
One possibility is gender. We can investigate this possibility by comparing a histogram of temperatures for males with histogram of temperatures for females. Such histograms are sometimes called conditional histograms or classconditional histograms, because each histogram is conditioned on something (in
this case, the histogram uses only data that comes from gender).
The dataset gives genders (as 1 or 2 - I dont know which is male and which
female). Figure 11.3 gives the class conditional histograms. It does seem like
individuals of one gender run a little cooler than individuals of the other, although
we dont yet have mechanisms to test this possibility in detail (chapter 14.5).
11.3 SUMMARIZING 1D DATA
For the rest of this chapter, we will assume that data items take values that are
continuous real numbers. Furthermore, we will assume that values can be added,
subtracted, and multiplied by constants in a meaningful way. Human heights are
one example of such data; you can add two heights, and interpret the result as a
height (perhaps one person is standing on the head of the other). You can subtract
one height from another, and the result is meaningful. You can multiply a height
by a constant say, 1/2 and interpret the result (A is half as high as B).
11.3.1 The Mean
One simple and effective summary of a set of data is its mean. This is sometimes
known as the average of the data.
Section 11.3
Summarizing 1D Data
214
Definition: 11.1 Mean

Assume we have a dataset {x} of N data items, x1 , . . . , xN . Their mean
is
i=N
1 X
xi .
mean ({x}) =
N i=1
For example, assume youre in a bar, in a group of ten people who like to talk
about money. Theyre average people, and their net worth is given in table 11.1
(you can choose who you want to be in this story). The mean of this data is $107,
903.
Properties of the Mean
should remember:
The mean has several important properties you
Scaling data scales the mean: or mean ({kxi }) = kmean ({xi }).
Translating data translates the mean: or mean ({xi + c}) = mean ({xi }) + c.
The sum of signed differences from the mean is zero. This means that
N
X
i=1
(xi mean ({xi })) = 0.
Choose the number such that the sum of squared distances of data points
to is minimized. That number is the mean. In notation
arg min X
(xi )2 = mean ({xi })
These properties are easy to prove (and so easy to remember). I have broken
these out into a box of useful facts below, to emphasize them. All but one proof is
relegated to the exercises.
Section 11.3
Summarizing 1D Data
215
arg min P
2
i (xi ) = mean ({x})
Proof: Choose the number such that the sum of squared distances of data
points to is minimized. That number is the mean. In notation:
Proposition:
arg min X
(xi )2 = mean ({x})
We can show this by actually minimizing the expression. We must have that the
derivative of the expression we are minimizing is zero at the value of we are
seeking. So we have
N
d X
(xi )2
d i=1
N
X
i=1
= 2
2(xi )
N
X
i=1
= 0
(xi )
so that 2N mean ({x}) 2N = 0, which means that = mean ({x}).

Property 11.1: The Average Squared Distance to the Mean is Minimized
Now this result means that the mean is the single number that is closest to
all the data items. The mean tells you where the overall blob of data lies. For this
reason, it is often referred to as a location parameter. If you choose to summarize
the dataset with a number that is as close as possible to each data item, the mean
is the number to choose. The mean is also a guide to what new values will look
like, if you have no other information. For example, in the case of the bar, a new
person walks in, and I must guess that persons net worth. Then the mean is the
best guess, because it is closest to all the data items we have already seen. In the
case of the bar, if a new person walked into this bar, and you had to guess that
persons net worth, you should choose $107, 903.
Section 11.3
Summarizing 1D Data
216
Useful Facts: 11.1 Mean

mean ({kxi }) = kmean ({xi }).
mean ({xi + c}) = mean ({xi }) + c.
P
[ N
i=1 (xi mean ({xi })) = 0.
arg min X
(xi )2 = mean ({xi })
The mean is a location parameter; it tells you where the data lies along
a number line.
11.3.2 Standard Deviation and Variance

We would also like to know the extent to which data items are close to the mean.
This information is given by the standard deviation, which is the root mean
square of the offsets of data from the mean.
Definition: 11.2 Standard deviation

Assume we have a dataset {x} of N data items, x1 , . . . , xN . The standard deviation of this dataset is is:
v
u i=N
u1 X
p
std ({xi }) = t
(xi mean ({x}))2 = mean ({(xi mean ({x}))2 }).
N i=1
You should think of the standard deviation as a scale. It measures the size of
the average deviation from the mean for a dataset, or how wide the spread of data
is. For this reason, it is often referred to as a scale parameter. When the standard
deviation of a dataset is large, there are many items with values much larger than,
or much smaller than, the mean. When the standard deviation is small, most data
items have values close to the mean. This means it is helpful to talk about how
many standard devations away from the mean a particular data item is. Saying
that data item xj is within k standard deviations from the mean means that
abs (xj mean ({x})) kstd ({xi }).
Section 11.3
Summarizing 1D Data
217
Similarly, saying that data item xj is more than k standard deviations from the
mean means that
abs (xi mean ({x})) > kstd ({x}).
As I will show below, there must be some data at least one standard deviation
away from the mean, and there can be very few data items that are many standard
deviations away from the mean.
Properties of the Standard Deviation Standard deviation has very important properties:
Translating data does not change the standard deviation, i.e. std ({xi + c}) =
std ({xi }).
Scaling data scales the standard deviation, i.e. std ({kxi }) = kstd ({xi }).
For any dataset, there can be only a few items that are many standard deviations away from the mean. In particular, assume we have N data items, xi ,
whose standard deviation is . Then there are at most k12 data points lying
k or more standard deviations away from the mean.
For any dataset, there must be at least one data item that is at least one
standard deviation away from the mean.
The first two properties are easy to prove, and are relegated to the exercises. I
prove the others below. Again, for emphasis, I have broken these properties out in
a box below.
Section 11.3
Summarizing 1D Data
218
Proposition:
Assume we have a dataset {x} of N data items, x1 , . . . , xN .
Assume the standard deviation of this dataset is std ({x}) = . Then there are at
most k12 data points lying k or more standard deviations away from the mean.
Proof: Assume the mean is zero. There is no loss of generality here, because
translating data translates the mean, but doesnt change the standard deviation.
Now we must construct a dataset with the largest possible fraction r of data
points lying k or more standard deviations from the mean. To achieve this, our
data should have N (1 r) data points each with the value 0, because these
contribute 0 to the standard deviation. It should have N r data points with the
value k; if they are further from zero than this, each will contribute more to
the standard deviation, so the fraction of such points will be fewer. Because
rP
2
i xi
std ({x}) = =
N
we have that, for this rather specially constructed dataset,
r
N rk 2 2
=
N
so that
1
.
k2
We constructed the dataset so that r would be as large as possible, so
r=
1
k2
for any kind of data at all.

Property 11.2: For any dataset, it is hard for data items to get many standard
deviations away from the mean.
This bound (box 11.2) is true for any kind of data. This bound implies that,
for example, at most 100% of any dataset could be one standard deviation away
from the mean, 25% of any dataset is 2 standard deviations away from the mean
and at most 11% of any dataset could be 3 standard deviations away from the
mean. But the configuration of data that achieves this bound is very unusual. This
means the bound tends to wildly overstate how much data is far from the mean
for most practical datasets. Most data has more random structure, meaning that
we expect to see very much less data far from the mean than the bound predicts.
For example, much data can reasonably be modelled as coming from a normal
distribution (a topic well go into later). For such data, we expect that about
68% of the data is within one standard deviation of the mean, 95% is within two
standard deviations of the mean, and 99.7% is within three standard deviations
of the mean, and the percentage of data that is within ten standard deviations of
the mean is essentially indistinguishable from 100%. This kind of behavior is quite
common; the crucial point about the standard deviation is that you wont see much
Section 11.3
Summarizing 1D Data
219
data that lies many standard deviations from the mean, because you cant.
Proposition:
(std ({x}))2 max(xi mean ({x}))2 .
i
Proof: You can see this by looking at the expression for standard deviation.
We have
v
u i=N
u1 X
std ({x}) = t
(xi mean ({x}))2 .
N i=1
Now, this means that
N (std ({x}))2 =
i=N
X
i=1
But
i=N
X
i=1
so
(xi mean ({x}))2 .
(xi mean ({x}))2 N max(xi mean ({x}))2

i
(std ({x}))2 max(xi mean ({x}))2 .

i
Property 11.3: For any dataset, there must be at least one data item that is
at least one standard deviation away from the mean.
Boxes 11.2 and 11.3 mean that the standard deviation is quite informative.
Very little data is many standard deviations away from the mean; similarly, at least
some of the data should be one or more standard deviations away from the mean.
So the standard deviation tells us how data points are scattered about the mean.
Useful Facts: 11.2 Standard deviation

std ({x + c}) = std ({x}).
std ({kx}) = kstd ({x}).
For N data items, xi , whose standard deviation is , there are
at most k12 data points lying k or more standard deviations away
from the mean.
(std ({x}))2 maxi (xi mean ({x}))2 .
The standard deviation is often referred to as a scale parameter; it tells
you how broadly the data spreads about the mean.
Section 11.3
Summarizing 1D Data
220
There is an ambiguity that comes up often here because two (very slightly)
different numbers are called the standard deviation of a dataset. One the one
we use in this chapter is an estimate of the scale of the data, as we describe it.
The other differs from our expression very slightly; one computes
rP
2
i (xi mean ({x}))
N 1
(notice the N 1 for our N ). If N is large, this number is basically the same as the
number we compute, but for smaller N there is a difference that can be significant.
Irritatingly, this number is also called the standard deviation; even more irritatingly,
we will have to deal with it, but not yet. I mention it now because you may look
up terms I have used, find this definition, and wonder whether I know what Im
talking about. In this case, I do (although I would say that).
The confusion arises because sometimes the datasets we see are actually samples of larger datasets. For example, in some circumstances you could think of the
net worth dataset as a sample of all the net worths in the USA. In such cases, we
are often interested in the standard deviation of the dataset that was sampled. The
second number is a slightly better way to estimate this standard deviation than the
definition we have been working with. Dont worry - the N in our expressions is
the right thing to use for what were doing.
11.3.3 Variance
It turns out that thinking in terms of the square of the standard deviation, which
is known as the variance, will allow us to generalize our summaries to apply to
higher dimensional data.
Definition: 11.3 Variance

Assume we have a dataset {x} of N data items, x1 , . . . , xN . where
N > 1. Their variance is:
!
i=N

1 X
2
= mean (xi mean ({x}))2 .
(xi mean ({x}))
var ({x}) =
N i=1
One good way to think of the variance is as the mean-square error you would
incur if you replaced each data item with the mean. Another is that it is the square
of the standard deviation.
Properties of the Variance
The properties of the variance follow from
Section 11.3
Summarizing 1D Data
221
the fact that it is the square of the standard deviation. I have broken these out in
a box, for emphasis.
Useful Facts: 11.3 Variance

var ({x + c}) = var ({x}).
var ({kx}) = k 2 var ({x}).
While one could restate the other two properties of the standard deviation in
terms of the variance, it isnt really natural to do so. The standard deviation is in
the same units as the original data, and should be thought of as a scale. Because
the variance is the square of the standard deviation, it isnt a natural scale (unless
you take its square root!).
11.3.4 The Median
One problem with the mean is that it can be affected strongly by extreme values.
Go back to the bar example, of section 11.3.1. Now Warren Buffett (or Bill Gates,
or your favorite billionaire) walks in. What happened to the average net worth?
Assume your billionaire has net worth $ 1, 000, 000, 000. Then the mean net
worth suddenly has become
10 $107, 903 + $1, 000, 000, 000
= $91, 007, 184
11
But this mean isnt a very helpful summary of the people in the bar. It is probably more useful to think of the net worth data as ten people together with one
billionaire. The billionaire is known as an outlier.
One way to get outliers is that a small number of data items are very different, due to minor effects you dont want to model. Another is that the data
was misrecorded, or mistranscribed. Another possibility is that there is just too
much variation in the data to summarize it well. For example, a small number
of extremely wealthy people could change the average net worth of US residents
dramatically, as the example shows. An alternative to using a mean is to use a
median.
Section 11.3
Summarizing 1D Data
222
Definition: 11.4 Median

The median of a set of data points is obtained by sorting the data
points, and finding the point halfway along the list. If the list is of
even length, its usual to average the two numbers on either side of the
middle. We write
median ({xi })
for the operator that returns the median.
For example,
median ({3, 5, 7}) = 5,
median ({3, 4, 5, 6, 7}) = 5,
and
median ({3, 4, 5, 6}) = 4.5.
For much, but not all, data, you can expect that roughly half the data is smaller
than the median, and roughly half is larger than the median. Sometimes this
property fails. For example,
median ({1, 2, 2, 2, 2, 2, 2, 2, 3}) = 2.
With this definition, the median of our list of net worths is $107, 835. If we insert
the billionaire, the median becomes $108, 930. Notice by how little the number has
changed it remains an effective summary of the data.
Properties of the median You can think of the median of a dataset as
giving the middle or center value. It is another way of estimating where the
dataset lies on a number line (and so is another location parameter). This means
it is rather like the mean, which also gives a (slightly differently defined) middle
or center value. The mean has the important properties that if you translate the
dataset, the mean translates, and if you scale the dataset, the mean scales. The
median has these properties, too, which I have broken out in a box. Each is easily
proved, and proofs are relegated to the exercises.
Useful Facts: 11.4 Median

median ({x + c}) = median ({x}) + c.
median ({kx}) = kmedian ({x}).
Section 11.3
Summarizing 1D Data
223
11.3.5 Interquartile Range

Outliers are a nuisance in all sorts of ways. Plotting the histogram of the net worth
data with the billionaire included will be tricky. Either you leave the billionaire out
of the plot, or all the histogram bars are tiny. Visualizing this plot shows outliers
can affect standard deviations severely, too. For our net worth data, the standard
deviation without the billionaire is $9265, but if we put the billionaire in there, it
is $3.014 108 . When the billionaire is in the dataset, the mean is about 91M $
and the standard deviation is about 300M $ so all but one of the data items lie
about a third of a standard deviation away from the mean on the small side. The
other data item (the billionaire) is about three standard deviations away from the
mean on the large side. In this case, the standard deviation has done its work of
informing us that there are huge changes in the data, but isnt really helpful as a
description of the data.
The problem is this: describing the net worth data with billionaire as a having
a mean of $9.101 107 with a standard deviation of $3.014 108 really isnt terribly
helpful. Instead, the data really should be seen as a clump of values that are
near $100, 000 and moderately close to one another, and one massive number (the
billionaire outlier).
One thing we could do is simply remove the billionaire and compute mean
and standard deviation. This isnt always easy to do, because its often less obvious
which points are outliers. An alternative is to follow the strategy we did when we
used the median. Find a summary that describes scale, but is less affected by
outliers than the standard deviation. This is the interquartile range; to define
it, we need to define percentiles and quartiles, which are useful anyway.
Definition: 11.5 Percentile

The kth percentile is the value such that k% of the data is less than or
equal to that value. We write percentile({x}, k) for the kth percentile
of dataset {x}.
Definition: 11.6 Quartiles

The first quartile of the data is the value such that 25% of the data is less
than or equal to that value (i.e. percentile({x}, 25)). The second quartile of the data is the value such that 50% of the data is less than or equal
to that value, which is usually the median (i.e. percentile({x}, 50)). The
third quartile of the data is the value such that 75% of the data is less
than or equal to that value (i.e. percentile({x}, 75)).
Section 11.3
Summarizing 1D Data
224
Definition: 11.7 Interquartile Range

The interquartile range of a dataset {x} is iqr{x} = percentile({x}, 75)
percentile({x}, 25).
Like the standard deviation, the interquartile range gives an estimate of how
widely the data is spread out. But it is quite well-behaved in the presence of
outliers. For our net worth data without the billionaire, the interquartile range is
$12350; with the billionaire, it is $17710.
Properties of the interquartile range You can think of the interquartile
range of a dataset as giving an estimate of the scale of the difference from the mean.
This means it is rather like the standard deviation, which also gives a (slightly
differently defined) scale. The standard deviation has the important properties
that if you translate the dataset, the standard deviation translates, and if you
scale the dataset, the standard deviation scales. The interquartile range has these
properties, too, which I have broken out into a box. Each is easily proved, and
proofs are relegated to the exercises.
Useful Facts: 11.5 Interquartile range

iqr{x + c} = iqr{x}.
iqr{kx} = k 2 iqr{x}.
For most datasets, interquartile ranges tend to be somewhat larger than standard deviations. This isnt really a problem. Each is a method for estimating the
scale of the data the range of values above and below the mean that you are likely
to see. It is neither here nor there if one method yields slightly larger estimates
than another, as long as you dont compare estimates across methods.
11.3.6 Using Summaries Sensibly
One should be careful how one summarizes data. For example, the statement
that the average US family has 2.6 children invites mockery (the example is from
Andrew Vickers book What is a p-value anyway?), because you cant have fractions
of a child no family has 2.6 children. A more accurate way to say things might
Section 11.4
Plots and Summaries
225
be the average of the number of children in a US family is 2.6, but this is clumsy.
What is going wrong here is the 2.6 is a mean, but the number of children in a
family is a categorical variable. Reporting the mean of a categorical variable is
often a bad idea, because you may never encounter this value (the 2.6 children).
For a categorical variable, giving the median value and perhaps the interquartile
range often makes much more sense than reporting the mean.
For continuous variables, reporting the mean is reasonable because you could
expect to encounter a data item with this value, even if you havent seen one in
the particular data set you have. It is sensible to look at both mean and median;
if theyre significantly different, then there is probably something going on that is
worth understanding. Youd want to plot the data using the methods of the next
section before you decided what to report.
You should also be careful about how precisely numbers are reported (equivalently, the number of significant figures). Numerical and statistical software will
produce very large numbers of digits freely, but not all are always useful. This is a
particular nuisance in the case of the mean, because you might add many numbers,
then divide by a large number; in this case, you will get many digits, but some
might not be meaningful. For example, Vickers (ibid) describes a paper reporting
the mean length of pregnancy as 32.833 weeks. That fifth digit suggests we know
the mean length of pregnancy to about 0.001 weeks, or roughly 10 minutes. Neither
medical interviewing nor peoples memory for past events is that detailed. Furthermore, when you interview them about embarrassing topics, people quite often lie.
There is no prospect of knowing this number with this precision.
People regularly report silly numbers of digits because it is easy to miss the
harm caused by doing so. But the harm is there: you are implying to other people,
and to yourself, that you know something more accurately than you do. At some
point, someone will suffer for it.
11.4 PLOTS AND SUMMARIES
Knowing the mean, standard deviation, median and interquartile range of a dataset
gives us some information about what its histogram might look like. In fact, the
summaries give us a language in which to describe a variety of characteristic properties of histograms that are worth knowing about (Section 11.4.1). Quite remarkably, many different datasets have histograms that have about the same shape
(Section 11.4.2). For such data, we know roughly what percentage of data items
are how far from the mean.
Complex datasets can be difficult to interpret with histograms alone, because
it is hard to compare many histograms by eye. Section 11.4.3 describes a clever
plot of various summaries of datasets that makes it easier to compare many cases.
11.4.1 Some Properties of Histograms
The tails of a histogram are the relatively uncommon values that are significantly
larger (resp. smaller) than the value at the peak (which is sometimes called the
mode). A histogram is unimodal if there is only one peak; if there are more than
one, it is multimodal, with the special term bimodal sometimes being used for
the case where there are two peaks (Figure 11.4). The histograms we have seen
Section 11.4
Plots and Summaries
226
mode
Bimodal
modes
Multimodal
modes
Population 1
Population 2
Population 1
Population 2
Population 3
FIGURE 11.4: Many histograms are unimodal, like the example on the top; there is
one peak, or mode. Some are bimodal (two peaks; bottom left) or even multimodal
(two or more peaks; bottom right). One common reason (but not the only reason)
is that there are actually two populations being conflated in the histograms. For
example, measuring adult heights might result in a bimodal histogram, if male and
female heights were slightly different. As another example, measuring the weight
of dogs might result in a multimodal histogram if you did not distinguish between
breeds (eg chihauhau, terrier, german shepherd, pyranean mountain dog, etc.).
have been relatively symmetric, where the left and right tails are about as long as
one another. Another way to think about this is that values a lot larger than the
mean are about as common as values a lot smaller than the mean. Not all data is
symmetric. In some datasets, one or another tail is longer (figure 11.5). This effect
is called skew.
Skew appears often in real data. SOCR (the Statistics Online Computational
Resource) publishes a number of datasets. Here we discuss a dataset of citations
to faculty publications. For each of five UCLA faculty members, SOCR collected
the number of times each of the papers they had authored had been cited by
other authors (data at http://wiki.stat.ucla.edu/socr/index.php/SOCR Data Dinov
072108 H Index Pubs). Generally, a small number of papers get many citations, and
many papers get few citations. We see this pattern in the histograms of citation
numbers (figure 11.6). These are very different from (say) the body temperature
pictures. In the citation histograms, there are many data items that have very few
citations, and few that have many citations. This means that the right tail of the
histogram is longer, so the histogram is skewed to the right.
One way to check for skewness is to look at the histogram; another is to
Section 11.4
Plots and Summaries
227
Symmetric Histogram
mode, median, mean, all on top of
one another
right
tail
left
tail
Left Skew
Right Skew
mean
mode
median
left
tail
right
tail
left
tail
mean
mode
median
right
tail
FIGURE 11.5: On the top, an example of a symmetric histogram, showing its tails
(relatively uncommon values that are significantly larger or smaller than the peak
or mode). Lower left, a sketch of a left-skewed histogram. Here there are few
large values, but some very small values that occur with significant frequency. We
say the left tail is long, and that the histogram is left skewed. You may find
this confusing, because the main bump is to the right one way to remember this
is that the left tail has been stretched. Lower right, a sketch of a right-skewed
histogram. Here there are few small values, but some very large values that occur
with significant frequency. We say the right tail is long, and that the histogram
is right skewed.
compare mean and median (though this is not foolproof). For the first citation
histogram, the mean is 24.7 and the median is 7.5; for the second, the mean is 24.4,
and the median is 11. In each case, the mean is a lot bigger than the median. Recall
the definition of the median (form a ranked list of the data points, and find the
point halfway along the list). For much data, the result is larger than about half
of the data set and smaller than about half the dataset. So if the median is quite
small compared to the mean, then there are many small data items and a small
number of data items that are large the right tail is longer, so the histogram is
skewed to the right.
Left-skewed data also occurs; figure 11.6 shows a histogram of the birth
weights of 44 babies born in Brisbane, in 1997 (from http://www.amstat.org/publications/
jse/jse data archive.htm). This data appears to be somewhat left-skewed, as birth
weights can be a lot smaller than the mean, but tend not to be much larger than
the mean.
Skewed data is often, but not always, the result of constraints. For example,
good obstetrical practice tries to ensure that very large birth weights are rare (birth
Section 11.4
Plots and Summaries
228
Histogram of citations for faculty member A Birth weights for 44 babies born in Brisbane
400
15
300
10
200
5
100
0
0
100
200
300
400
0
500 1000
2000
3000
4000
5000
FIGURE 11.6:
On the left, a histogram of citations for a faculty
member,
from
data
at
http:// wiki.stat.ucla.edu/ socr/ index.php/
Very few publications have many
SOCR Data Dinov 072108 H Index Pubs.
citations, and many publications have few. This means the histogram is strongly
right-skewed. On the right, a histogram of birth weights for 44 babies borne in
Brisbane in 1997. This histogram looks slightly left-skewed.
is typically induced before the baby gets too heavy), but it may be quite hard to
avoid some small birth weights. This could could skew birth weights to the left
(because large babies will get born, but will not be as heavy as they could be if
obstetricians had not interfered). Similarly, income data can be skewed to the right
by the fact that income is always positive. Test mark data is often skewed
whether to right or left depends on the circumstances by the fact that there is
a largest possible mark and a smallest possible mark.
11.4.2 Standard Coordinates and Normal Data
It is useful to look at lots of histograms, because it is often possible to get some
useful insights about data. However, in their current form, histograms are hard to
compare. This is because each is in a different set of units. A histogram for length
data will consist of boxes whose horizontal units are, say, metres; a histogram
for mass data will consist of boxes whose horizontal units are in, say, kilograms.
Furthermore, these histograms typically span different ranges.
We can make histograms comparable by (a) estimating the location of the
plot on the horizontal axis and (b) estimating the scale of the plot. The location
is given by the mean, and the scale by the standard deviation. We could then
normalize the data by subtracting the location (mean) and dividing by the standard
deviation (scale). The resulting values are unitless, and have zero mean. They are
often known as standard coordinates.
Section 11.4
Plots and Summaries
229
Definition: 11.8 Standard coordinates

Assume we have a dataset {x} of N data items, x1 , . . . , xN . We represent these data items in standard coordinates by computing
x
i =
(xi mean ({x}))

.
std ({x})
We write {
x} for a dataset that happens to be in standard coordinates.
Standard coordinates have some important properties. Assume we have N

data items. Write xi for the ith data item, and x
i for the ith data item in standard
coordinates (I sometimes refer to these as normalized data items). Then we have
mean ({
x}) = 0.
We also have that
std ({
x}) = 1.
An extremely important fact about data is that, for many kinds of data,
histograms of these standard coordinates look the same. Many completely different
datasets produce a histogram that, in standard coordinates, has a very specific
appearance. It is symmetric, unimodal; it looks like a narrow bump. If there were
enough data points and the histogram boxes were small enough, the curve would
look like the curve in figure 11.7. This phenomenon is so important that data of
this form has a special name.
Definition: 11.9 Standard normal data

Data is standard normal data if, when we have a great deal of data,
the histogram is a close approximation to the standard normal curve.
This curve is given by
2
1
y(x) = e(x /2)
2
(which is shown in figure 11.7).
Section 11.4
The Standard Normal Curve
Plots and Summaries
230
0.5
Volumes of oysters, standard coordinates

5
0.4
0.3
0.2
0.1
0
0
4 3 2 1 0 1 2 3 4 6
4
2
0
2
4
6
Human heights, standard coordinates
Human weights, standard coordinates
60
60
50
50
40
40
30
30
20
20
10
10
0
6
0
6 6
FIGURE 11.7: Data is standard normal data when its histogram takes a stylized,
bell-shaped form, plotted above. One usually requires a lot of data and very small
histogram boxes for this form to be reproduced closely. Nonetheless, the histogram
for normal data is unimodal (has a single bump) and is symmetric; the tails fall
off fairly fast, and there are few data items that are many standard deviations from
the mean. Many quite different data sets have histograms that are similar to the
normal curve; I show three such datasets here.
Definition: 11.10 Normal data

Data is normal data if, when we subtract the mean and divide by
the standard deviation (i.e. compute standard coordinates), it becomes
standard normal data.
It is not always easy to tell whether data is normal or not, and there are
a variety of tests one can use, which we discuss later. However, there are many
examples of normal data. Figure 11.7 shows a diverse variety of data sets, plotted
Section 11.4
Plots and Summaries
231
as histograms in standard coordinates. These include: the volumes of 30 oysters

(from http://www.amstat.org/publications/jse/jse data archive.htm; look for 30oysters.dat.txt); human heights (from http://www2.stetson.edu/jrasp/data.htm; look
for bodyfat.xls, with two outliers removed); and human weights (from http://www2.
stetson.edu/jrasp/data.htm; look for bodyfat.xls, with two outliers removed).
Properties of normal data For the moment, assume we know that a
dataset is normal. Then we expect it to have the following properties:
If we normalize it, its histogram will be close to the standard normal curve.
This means, among other things, that the data is not significantly skewed.
About 68% of the data lie within one standard deviation of the mean. We
will prove this later.
About 95% of the data lie within two standard deviations of the mean. We
About 99% of the data lie within three standard deviations of the mean. We
In turn, these properties imply that data that contains outliers (points many standard deviations away from the mean) is not normal. This is usually a very safe
assumption. It is quite common to model a dataset by excluding a small number
of outliers, then modelling the remaining data as normal. For example, if I exclude
two outliers from the height and weight data from http://www2.stetson.edu/jrasp/
data.htm, the data looks pretty close to normal.
11.4.3 Boxplots
It is usually hard to compare multiple histograms by eye. One problem with comparing histograms is the amount of space they take up on a plot, because each
histogram involves multiple vertical bars. This means it is hard to plot multiple
overlapping histograms cleanly. If you plot each one on a separate figure, you have
to handle a large number of separate figures; either you print them too small to see
enough detail, or you have to keep flipping over pages.
A boxplot is a way to plot data that simplifies comparison. A boxplot displays a dataset as a vertical picture. There is a vertical box whose height corresponds to the interquartile range of the data (the width is just to make the figure
easy to interpret). Then there is a horizontal line for the median; and the behavior
of the rest of the data is indicated with whiskers and/or outlier markers. This
means that each dataset makes is represented by a vertical structure, making it
easy to show multiple datasets on one plot and interpret the plot (Figure 11.8).
To build a boxplot, we first plot a box that runs from the first to the third
quartile. We then show the median with a horizontal line. We then decide which
data items should be outliers. A variety of rules are possible; for the plots I show, I
used the rule that data items that are larger than q3 + 1.5(q3 q1 ) or smaller than
q1 1.5(q3 q1 ), are outliers. This criterion looks for data items that are more
than one and a half interquartile ranges above the third quartile, or more than one
and a half interquartile ranges below the first quartile.
Section 11.5
Whose is bigger? Investigating Australian Pizzas
232
Outlier
31
30
Whisker
29
27
Interquartile range
28
q3
Box
Median
26
Dominos
EagleBoys
FIGURE 11.8: A boxplot showing the box, the median, the whiskers and two outliers.
Notice that we can compare the two datasets rather easily; the next section explains
the comparison.
Once we have identified outliers, we plot these with a special symbol (crosses
in the plots I show). We then plot whiskers, which show the range of non-outlier
data. We draw a whisker from q1 to the smallest data item that is not an outlier,
and from q3 to the largest data item that is not an outlier. While all this sounds
complicated, any reasonable programming environment will have a function that
will do it for you. Figure 11.8 shows an example boxplot. Notice that the rich
graphical structure means it is quite straightforward to compare two histograms.
11.5 WHOSE IS BIGGER? INVESTIGATING AUSTRALIAN PIZZAS
At http://www.amstat.org/publications/jse/jse data archive.htm), you will find a dataset
giving the diameter of pizzas, measured in Australia (search for the word pizza).
This website also gives the backstory for this dataset. Apparently, EagleBoys pizza
claims that their pizzas are always bigger than Dominos pizzas, and published a set
of measurements to support this claim (the measurements were available at http://
www.eagleboys.com.au/realsizepizza as of Feb 2012, but seem not to be there anymore).
Whose pizzas are bigger? and why? A histogram of all the pizza sizes appears
Section 11.5
233
Histogram of pizza diameters, in inches

40
30
20
10
0
24
26
28
30
32
FIGURE 11.9: A histogram of pizza diameters from the dataset described in the text.
Notice that there seem to be two populations.
Dominos pizza diameters, in inches
EagleBoys pizza diameters, in inches
30
30
25
25
20
20
15
15
10
10
26
28
30
32
26
28
30
32
FIGURE 11.10: On the left, the class-conditional histogram of Dominos pizza diameters from the pizza data set; on the right, the class-conditional histogram of
EagleBoys pizza diameters. Notice that EagleBoys pizzas seem to follow the pattern we expect the diameters are clustered tightly around a mean, and there is a
small standard deviation but Dominos pizzas do not seem to be like that. There
is more to understand about this data.
in figure 11.9. We would not expect every pizza produced by a restaurant to have
exactly the same diameter, but the diameters are probably pretty close to one
another, and pretty close to some standard value. This would suggest that wed
expect to see a histogram which looks like a single, rather narrow, bump about a
mean. This is not what we see in figure 11.9 instead, there are two bumps, which
suggests two populations of pizzas. This isnt particularly surprising, because we
know that some pizzas come from EagleBoys and some from Dominos.
If you look more closely at the data in the dataset, you will notice that each
data item is tagged with the company it comes from. We can now easily plot con-
Section 11.5
234
Box plots of pizzas by maker

31
30
29
28
27
26
Dominos
EagleBoys
FIGURE 11.11: Boxplots of the pizza data, comparing EagleBoys and Dominos pizza.
There are several curiosities here: why is the range for Dominos so large (25.5-29)?
EagleBoys has a smaller range, but has several substantial outliers; why? One would
expect pizza manufacturers to try and control diameter fairly closely, because pizzas
that are too small present risks (annoying customers; publicity; hostile advertising)
and pizzas that are too large should affect profits.
ditional histograms, conditioning on the company that the pizza came from. These
appear in figure 11.10. Notice that EagleBoys pizzas seem to follow the pattern
we expect the diameters are clustered tightly around one value but Dominos
pizzas do not seem to be like that. This is reflected in a boxplot (figure 11.11),
which shows the range of Dominos pizza sizes is surprisingly large, and that EagleBoys pizza sizes have several large outliers. There is more to understand about
this data. The dataset contains labels for the type of crust and the type of topping
perhaps these properties affect the size of the pizza?
EagleBoys produces DeepPan, MidCrust and ThinCrust pizzas, and Dominos
produces DeepPan, ClassicCrust and ThinNCrispy pizzas. This may have something to do with the observed patterns, but comparing six histograms by eye is
unattractive. A boxplot is the right way to compare these cases (figure 11.12). The
boxplot gives some more insight into the data. Dominos thin crust appear to have a
narrow range of diameters (with several outliers), where the median pizza is rather
larger than either the deep pan or the classic crust pizza. EagleBoys pizzas all have
a range of diameters that is (a) rather similar across the types and (b) rather a lot
like the Dominos thin crust. There are outliers, but few for each type.
Another possibility is that the variation in size is explained by the topping.
We can compare types and toppings by producing a set of conditional boxplots (i.e.
the diameters for each type and each topping). This leads to rather a lot of boxes
(figure 11.13), but theyre still easy to compare by eye. The main difficulty is that
the labels on the plot have to be shortened. I made labels using the first letter
from the manufacturer (D or E); the first letter from the crust type (previous
paragraph); and the first and last letter of the topping. Toppings for Dominos are:
Hawaiian; Supreme; BBQMeatlovers. For EagleBoys, toppings are: Hawaiian; SuperSupremo; and BBQMeatlovers. This gives the labels: DCBs; (Dominos; Clas-
Section 11.5
235
Box plots of pizzas by maker and type

31
30
29
28
27
26
DThinNCrispy
DDeepPan
DClassicCrust
EMidCrust
EDeepPan
EThinCrust
FIGURE 11.12: Boxplots for the pizza data, broken out by type (thin crust, etc.).
sicCrust; BBQMeatlovers); DCHn; DCSe; DDBs; DDHn; DDSe; DTBs;

DTHn; DTSe; EDBs; EDHn; EDSo; EMBs; EMHn; EMSo; ETBs;
ETHn; ETSo. Figure 11.13 suggests that the topping isnt what is important,
but the crust (group the boxplots by eye).
What could be going on here? One possible explanation is that Eagleboys
have tighter control over the size of the final pizza. One way this could happen is
that all EagleBoys pizzas start the same size and shrink the same amount in baking,
whereas all Dominos pizzas start a standard diameter, but different Dominos crusts
shrink differently in baking. Another way is that Dominos makes different size
crusts for different types, but that the cooks sometimes get confused. Yet another
possibility is that Dominos controls portions by the mass of dough (so thin crust
diameters tend to be larger), but Eagleboys controls by the diameter of the crust.
You should notice that this is more than just a fun story. If you were a manager
at a pizza firm, youd need to make choices about how to control costs. Labor costs,
rent, and portion control (i.e. how much pizza, topping, etc. a customer gets for
their money) are the main thing to worry about. If the same kind of pizza has a wide
range of diameters, you have a problem, because some customers are getting too
much (which affects your profit) or too little (which means they might call someone
Section 11.5
236
Box plots of pizzas by maker, type, and topping

31
Dominos
EagleBoys
ThinCrust
30
ClassicCrust
DeepPan
29
ThinCrust
28
DeepPan
27
MidCrust
26
DCBsDCHnDCSeDDBsDDHnDDSeDTBsDTHnDTSeEDBsEDHnEDSoEMBsEMHnEMSoETBsETHnETSo
FIGURE 11.13: The pizzas are now broken up by topping as well as crust type (look at
the source for the meaning of the names). I have separated Dominos from Eagleboys
with a vertical line, and grouped each crust type with a box. It looks as though the
issue is not the type of topping, but the crust. Eagleboys seems to have tighter
control over the size of the final pizza.
else next time). But making more regular pizzas might require more skilled (and so
more expensive) labor. The fact that Dominos and EagleBoys seem to be following
different strategies successfully suggests that more than one strategy might work.
But you cant choose if you dont know whats happening. As I said at the start,
whats going on here? is perhaps the single most useful question anyone can ask.
Section 11.6
You should
237
11.6 YOU SHOULD

11.6.1 be able to:
Plot a bar chart for a dataset.
Plot a histogram for a dataset.
Tell whether the histogram is skewed or not, and in which direction.
Plot a box plot for one or several datasets.
Interpret a box plot.
11.6.2 remember:
New term: categorical . . . . . . . . . .
New term: ordinal . . . . . . . . . . . .
New term: continuous . . . . . . . . . .
New term: bar chart . . . . . . . . . . .
New term: histogram . . . . . . . . . . .
New term: conditional histograms . . .
New term: class-conditional histograms
New term: mean . . . . . . . . . . . . .
Definition: Mean . . . . . . . . . . . . .
Useful facts: Mean . . . . . . . . . . . .
New term: standard deviation . . . . . .
Definition: Standard deviation . . . . .
Useful facts: Standard deviation . . . .
New term: variance . . . . . . . . . . . .
Definition: Variance . . . . . . . . . . .
Useful facts: Variance . . . . . . . . . .
New term: outlier . . . . . . . . . . . .
New term: median . . . . . . . . . . . .
Definition: Median . . . . . . . . . . . .
Useful facts: Median . . . . . . . . . . .
New term: interquartile range . . . . . .
Definition: Percentile . . . . . . . . . . .
Definition: Quartiles . . . . . . . . . . .
Definition: Interquartile Range . . . . .
Useful facts: Interquartile range . . . . .
New term: tails . . . . . . . . . . . . . .
New term: mode . . . . . . . . . . . . .
New term: unimodal . . . . . . . . . . .
New term: multimodal . . . . . . . . . .
New term: bimodal . . . . . . . . . . . .
New term: skew . . . . . . . . . . . . .
New term: standard coordinates . . . .
Definition: Standard coordinates . . . .
New term: standard normal data . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
201
201
201
204
205
207
207
207
208
210
210
210
213
214
214
215
215
215
216
216
217
217
218
218
218
219
219
219
219
219
220
222
223
223
Section 11.6
New term:
Definition:
New term:
Definition:
New term:
standard normal curve

Standard normal data .
normal data . . . . . .
Normal data . . . . . .
boxplot . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
You should
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
238
.
.
.
.
.
223
223
224
224
225
C H A P T E R
12
Background:Looking at
Relationships
We think of a dataset as a collection of d-tuples (a d-tuple is an ordered list of
d elements). For example, the Chase and Dunner dataset had entries for Gender;
Grade; Age; Race; Urban/Rural; School; Goals; Grades; Sports; Looks; and Money
(so it consisted of 11-tuples). The previous chapter explored methods to visualize
and summarize a set of values obtained by extracting a single element from each
tuple. For example, I could visualize the heights or the weights of a population (as
in Figure 11.7). But I could say nothing about the relationship between the height
and weight. In this chapter, we will look at methods to visualize and summarize
the relationships between pairs of elements of a dataset.
12.1 PLOTTING 2D DATA
We take a dataset, choose two different entries, and extract the corresponding
elements from each tuple. The result is a dataset consisting of 2-tuples, and we
think of this as a two dimensional dataset. The first step is to plot this dataset in a
way that reveals relationships. The topic of how best to plot data fills many books,
and we can only scratch the surface here. Categorical data can be particularly
tricky, because there are a variety of choices we can make, and the usefulness of
each tends to depend on the dataset and to some extent on ones cleverness in
graphic design (section 12.1.1).
For some continuous data, we can plot the one entry as a function of the other
(so, for example, our tuples might consist of the date and the number of robberies;
or the year and the price of lynx pelts; and so on, section 12.1.2).
Mostly, we use a simple device, called a scatter plot. Using and thinking about
scatter plots will reveal a great deal about the relationships between our data items
(section 12.1.3).
12.1.1 Categorical Data, Counts, and Charts
Categorical data is a bit special. Assume we have a dataset with several categorical descriptions of each data item. One way to plot this data is to think of
it as belonging to a richer set of categories. Assume the dataset has categorical
descriptions, which are not ordinal. Then we can construct a new set of categories
by looking at each of the cases for each of the descriptions. For example, in the
Chase and Dunner data of table 11.2, our new categories would be: boy-sports;
girl-sports; boy-popular; girl-popular; boy-grades; and girl-grades. A
large set of categories like this can result in a poor bar chart, though, because there
may be too many bars to group the bars successfully. Figure 12.1 shows such a bar
chart. Notice that it is hard to group categories by eye to compare; for example,
239
Section 12.1
Plotting 2D Data
240
Number of each gender choosing each goal

boyPopular
girlPopular
150
100
boyGrades
50
girlGrades
0
bP bG bS gS gG gP
boySports
girlSports
FIGURE 12.1: I sorted the children in the Chase and Dunner study into six categories
(two genders by three goals), and counted the number of children that fell into each
cell. I then produced the bar chart on the left, which shows the number of children
of each gender, selecting each goal. On the right, a pie chart of this information.
I have organized the pie chart so it is easy to compare boys and girls by eye start
at the top; going down on the left side are boy goals, and on the right side are girl
goals. Comparing the size of the corresponding wedges allows you to tell what goals
boys (resp. girls) identify with more or less often.
you can see that slightly more girls think grades are important than boys do, but
to do so you need to compare two bars that are separated by two other bars. An
alternative is a pie chart, where a circle is divided into sections whose angle is
proportional to the size of the data item. You can think of the circle as a pie, and
each section as a slice of pie. Figure 12.1 shows a pie chart, where each section is
proportional to the number of students in its category. In this case, Ive used my
judgement to lay the categories out in a way that makes comparisons easy. Im not
aware of any tight algorithm for doing this, though.
Pie charts have problems, because it is hard to judge small differences in area
accurately by eye. For example, from the pie chart in figure 12.1, its hard to tell
that the boy-sports category is slightly bigger than the boy-popular category
(try it; check using the bar chart). For either kind of chart, it is quite important
to think about what you plot. For example, the plot of figure 12.1 shows the total
number of respondents, and if you refer to figure 11.1, you will notice that there
are slightly more girls in the study. Is the percentage of boys who think grades are
important smaller (or larger) than the percentage of girls who think so? you cant
tell from these plots, and youd have to plot the percentages instead.
An alternative is to use a stacked bar chart. You can (say) regard the data
as of two types, Boys and Girls. Within those types, there are subtypes (Popularity, Grades and Sport). The height of the bar is given by the number of
elements in the type, and the bar is divided into sections corresponding to the number of elements of that subtype. Alternatively, if you want the plot to show relative
frequencies, the bars could all be the same height, but the shading corresponds to
the fraction of elements of that subtype. This is all much harder to say than to see
or to do (Figure 12.2).
Section 12.1
Goals by gender
Plotting 2D Data
241
250
Gender by goals
Sports
Popular
Grades
50
50
100
100
150
150
200
200
girl
boy
Grades
Popular
Sports
boy
girl
Gender
Goals by gender, relative frequencies
Gender by goals, relative frequencies
1.0
1.0
Goals
Sports
Popular
Grades
0.0
0.0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
girl
boy
Grades
Popular
Sports
Goals
boy
girl
Gender
FIGURE 12.2: These bar charts use stacked bars. In the top row, the overall height
of the bar is given by the number of elements of that type but each different subtype is
identified by shading, so you can tell by eye, for example, how many of the Grades
in the study were Boys. This layout makes it hard to tell what fraction of, say,
Boys aspire to Popularity. In the bottom row, all bars have the same height,
but the shading of the bar identifies the fraction of that type that has a corresponding
subtype. This means you can tell by eye what fraction of Girls aspire to Sports.
An alternative to a pie chart that is very useful for two dimensional data is
a heat map. This is a method of displaying a matrix as an image. Each entry of
the matrix is mapped to a color, and the matrix is represented as an image. For
the Chase and Dunner study, I constructed a matrix where each row corresponds
to a choice of sports, grades, or popular, and each column corresponds to a
choice of boy or girl. Each entry contains the count of data items of that type.
Zero values are represented as white; the largest values as red; and as the value
increases, we use an increasingly saturated pink. This plot is shown in figure 12.3
If the categorical data is ordinal, the ordering offers some hints for making
a good plot. For example, imagine we are building a user interface. We build an
initial version, and collect some users, asking each to rate the interface on scales for
Section 12.1
Plotting 2D Data
242
120
Sports
100
Grades
80
60
Popular
bo
gi
rl
40
FIGURE 12.3: A heat map of the Chase and Dunner data. The color of each cell
corresponds to the count of the number of elements of that type. The colorbar at
the side gives the correspondence between color and count. You can see at a glance
that the number of boys and girls who prefer grades is about the same; that about
the same number of boys prefer sports and popularity, with sports showing a mild
lead; and that more girls prefer popularity to sports.
-2
-1
0
1
2
-2
24
6
2
0
0
-1
5
12
4
0
0
0
0
3
13
3
0
1
0
0
6
13
1
2
1
0
0
2
5
TABLE 12.1: I simulated data representing user evaluations of a user interface.
Each cell in the table on the left contains the count of users rating ease of use
(horizontal, on a scale of -2 -very bad- to 2 -very good) vs. enjoyability (vertical,
same scale). Users who found the interface hard to use did not like using it either.
While this data is categorical, its also ordinal, so that the order of the cells is
determined. It wouldnt make sense, for example, to reorder the columns of the
table or the rows of the table.
ease of use (-2, -1, 0, 1, 2, running from bad to good) and enjoyability (again,
-2, -1, 0, 1, 2, running from bad to good). It is natural to build a 5x5 table, where
each cell represents a pair of ease of use and enjoyability values. We then count
the number of users in each cell, and build graphical representations of this table.
One natural representation is a 3D bar chart, where each bar sits on its cell in the
2D table, and the height of the bars is given by the number of elements in the cell.
Table 12.1 shows a table and figure 12.4 shows a 3D bar chart for some simulated
data. The main difficulty with a 3D bar chart is that some bars are hidden behind
others. This is a regular nuisance. You can improve things by using an interactive
tool to rotate the chart to get a nice view, but this doesnt always work. Heatmaps
dont suffer from this problem (Figure 12.4), another reason they are a good choice.
Section 12.1
Plotting 2D Data
243
Counts of user responses for a user interface

2
20
Enjoyability
30
20
10
0
2
15
0
10
1
5
Enjoyability
2 1
2
0 1
Ease of use
2
2
0
1
Ease of use
FIGURE 12.4: On the left, a 3D bar chart of the data. The height of each bar is
given by the number of users in each cell. This figure immediately reveals that users
who found the interface hard to use did not like using it either. However, some of
the bars at the back are hidden, so some structure might be hard to infer. On the
right, a heat map of this data. Again, this figure immediately reveals that users
who found the interface hard to use did not like using it either. Its more apparent
that everyone disliked the interface, though, and its clear that there is no important
hidden structure.
Remember this:
There are a variety of tools for plotting categorical
data. Its difficult to give strict rules for which to use when, but usually
one tries to avoid pie charts (angles are hard to judge by eye) and 3D bar
charts (where occlusion can hide important effects).
12.1.2 Series
Sometimes one component of a dataset gives a natural ordering to the data. For
example, we might have a dataset giving the maximum rainfall for each day of
the year. We could record this either by using a two-dimensional representation,
where one dimension is the number of the day and the other is the temperature,
or with a convention where the ith data item is the rainfall on the ith day. For
example, at http://lib.stat.cmu.edu/DASL/Datafiles/timeseriesdat.html, you can find
four datasets indexed in this way. It is natural to plot data like this as a function
of time. From this dataset, I extracted data giving the number of burglaries each
month in a Chicago suburb, Hyde Park. I have plotted part this data in Figure 12.5
(I left out the data to do with treatment effects). It is natural to plot a graph of
the burglaries as a function of time (in this case, the number of the month). The
plot shows each data point explicitly. I also told the plotting software to draw
lines joining data points, because burglaries do not all happen on a specific day.
Section 12.1
Burglaries each month in Hyde Park

number of burglaries
100
Plotting 2D Data
244
Lynx pelts traded at Hudson Bay and price

1500
number of pelts/100
price in pence
80
1000
60
40
500
20
0
0
10
20
30
month
40
0
1840
1860
1880
year
1900
1920
FIGURE 12.5: Left, the number of burglaries in Hyde Park, by month. Right, a
plot of the number of lynx pelts traded at Hudson Bay and of the price paid per pelt,
as a function of the year. Notice the scale, and the legend box (the number of pelts
is scaled by 100).
The lines suggest, reasonably enough, the rate at which burglaries are happening
between data points.
FIGURE 12.6: Snows scatter plot of cholera deaths on the left. Each cholera death
is plotted as a small bar on the house in which the bar occurred (for example, the
black arrow points to one stack of these bars, indicating many deaths, in the detail
on the right). Notice the fairly clear pattern of many deaths close to the Broad
street pump (grey arrow in the detail), and fewer deaths further away (where it was
harder to get water from the pump).
As another example, at http://lib.stat.cmu.edu/datasets/Andrews/ you can
Section 12.1
Plotting 2D Data
245
FIGURE 12.7: Left, a scatter plot of arsenic levels in US groundwater, prepared

by the US Geological Survey (you can find the data at http:// water.usgs.gov/ GIS/
metadata/ usgswrd/ XML/ arsenic map.xml. Here the shape and color of each marker
shows the amount of arsenic, and the spatial distribution of the markers shows where
the wells were sampled. Right, the usage of Nitrogen (a component of fertilizer)
by US county in 1991, prepared by the US Geological Survey (you can find the data
at http:// water.usgs.gov/ GIS/ metadata/ usgswrd/ XML/ nit91.xml). In this variant
of a scatter plot (which usually takes specialized software to prepare) one fills each
region with a color indicating the data in that region.
find a dataset that records the number of lynx pelts traded to the Hudsons Bay
company and the price paid for each pelt. This version of the dataset appeared first
in table 3.2 of Data: a Collection of Problems from many Fields for the Student
and Research Worker by D.F. Andrews and A.M. Herzberg, published by Springer
in 1985. I have plotted it in figure 12.5. The dataset is famous, because it shows
a periodic behavior in the number of pelts (which is a good proxy for the number
of lynx), which is interpreted as a result of predator-prey interactions. Lynx eat
rabbits. When there are many rabbits, lynx kittens thrive, and soon there will
be many lynx; but then they eat most of the rabbits, and starve, at which point
the rabbit population rockets. You should also notice that after about 1900, prices
seem to have gone up rather quickly. I dont know why this is. There is also some
suggestion, as there should be, that prices are low when there are many pelts, and
high when there are few.
12.1.3 Scatter Plots for Spatial Data
It isnt always natural to plot data as a function. For example, in a dataset containing the temperature and blood pressure of a set of patients, there is no reason
to believe that temperature is a function of blood pressure, or the other way round.
Two people could have the same temperature, and different blood pressures, or
vice-versa. As another example, we could be interested in what causes people to
die of cholera. We have data indicating where each person died in a particular
outbreak. It isnt helpful to try and plot such data as a function.
The scatter plot is a powerful way to deal with this situation. In the first
Section 12.1
Heart rate
80
70
60
96
98
100
Body temperature
246
6
Normalized heart rate
2 2
212
2 2122122 1
1
222121211212
1
1 12221 2222112112122 22
211 111121 2 212 1
1 1111212112 212
1
11 2111112 1 1
12
2111 2211211 2 2
2 1 21 12 2
1 121 22 1
2
222
2
2 1 2
90
Plotting 2D Data
102
4
2
0
2
2 11
1
12
111
21
11 12 2212
22
22
1 22
2
2
1
1
1
2
1
2
2
2
2
2
2
1
1211 2 212212 1 2
2 11
1 112
221211211212
11 21 1
21 1 2
1112 2
12
2 11
1 212
2 1222121
1221 1
11
2 2
1
4
6
6
4
2
0
2
4
Normalized body temperature
FIGURE 12.8: A scatter plot of body temperature against heart rate, from the dataset
at http:// www2.stetson.edu/ jrasp/ data.htm; normtemp.xls. I have separated the
two genders by plotting a different symbol for each (though I dont know which
gender is indicated by which letter); if you view this in color, the differences in color
makes for a greater separation of the scatter. This picture suggests, but doesnt
conclusively establish, that there isnt much dependence between temperature and
heart rate, and any dependence between temperature and heart rate isnt affected by
gender.
instance, assume that our data points actually describe points on the a real map.
Then, to make a scatter plot, we make a mark on the map at a place indicated by
each data point. What the mark looks like, and how we place it, depends on the
particular dataset, what we are looking for, how much we are willing to work with
complex tools, and our sense of graphic design.
Figure 12.6 is an extremely famous scatter plot, due to John Snow. Snow
one of the founders of epidemiology used a scatter plot to reason about a cholera
outbreak centered on the Broad Street pump in London in 1854. At that time,
the mechanism that causes cholera was not known. Snow plotted cholera deaths as
little bars (more bars, more deaths) on the location of the house where the death
occurred. More bars means more deaths, fewer bars means fewer deaths. There
are more bars per block close to the pump, and few far away. This plot offers quite
strong evidence of an association between the pump and death from cholera. Snow
used this scatter plot as evidence that cholera was associated with water, and that
the Broad Street pump was the source of the tainted water.
Figure 12.7 shows a scatter plot of arsenic levels in groundwater for the United
States, prepared by the US Geological Survey. The data set was collected by Focazio
and others in 2000; by Welch and others in 2000; and then updated by Ryker 2001.
It can be found at http://water.usgs.gov/GIS/metadata/usgswrd/XML/arsenic map.
xml. One variant of a scatter plot that is particularly useful for geographic data
occurs when one fills regions on a map with different colors, following the data in
that region. Figure 12.7 shows the nitrogen usage by US county in 1991; again,
this figure was prepared by the US Geological Survey.
Section 12.1
Weights
350
300
250
200
150
100
20
247
300
Weights, outliers removed
400
Plotting 2D Data
40
Heights
60
250
200
150
100
60
80
65
70
75
Heights, outliers removed
80
300
400
250
300
Weights
Weights, outliers removed
FIGURE 12.9: A scatter plots of weight against height, from the dataset at http://
www2.stetson.edu/ jrasp/ data.htm. Left: Notice how two outliers dominate the
picture, and to show the outliers, the rest of the data has had to be bunched up.
Right shows the data with the outliers removed. The structure is now somewhat
clearer.
200
150
100
60
200
100
65
70
75
Heights, outliers removed
80
0
60
65
70
Heights
75
80
FIGURE 12.10: Scatter plots of weight against height, from the dataset at http://
www2.stetson.edu/ jrasp/ data.htm. Left: data with two outliers removed, as in
figure 12.9. Right: this data, rescaled slightly. Notice how the data looks less
spread out. But there is no difference between the datasets. Instead, your eye is
easily confused by a change of scale.
Remember this:
Scatter plots are a most effective tool for geographic
data and 2D data in general. A scatter plot should be your first step with a
new 2D dataset.
Section 12.1
Plotting 2D Data
248
price of pelts, in pennies
1500
1000
500
0
0
2
4
6
number of pelts traded
8
x 10
FIGURE 12.11: A scatter plot of the price of lynx pelts against the number of pelts.
I have plotted data for 1901 to the end of the series as circles, and the rest of the
data as *s. It is quite hard to draw any conclusion from this data, because the scale
is confusing. Furthermore, the data from 1900 on behaves quite differently from the
other data.
12.1.4 Exposing Relationships with Scatter Plots
Scatter plots are natural for geographic data, but a scatter plot is a useful, simple
tool for ferreting out associations in other kinds of data as well. Now we need
some notation. Assume we have a dataset {x} of N data items, x1 , . . . , xN . Each
data item is a d dimensional vector (so its components are numbers). We wish to
investigate the relationship between two components of the dataset. For example,
we might be interested in the 7th and the 13th component of the dataset. We
will produce a two-dimensional plot, one dimension for each component. It does
not really matter which component is plotted on the x-coordinate and which on
the y-coordinate (though it will be some pages before this is clear). But it is very
difficult to write sensibly without talking about the x and y coordinates.
We will make a two-dimensional dataset out of the components that interest
us. We must choose which component goes first in the resulting 2-vector. We will
plot this component on the x-coordinate (and we refer to it as the x-coordinate),
and to the other component as the y-coordinate. This is just to make it easier to
describe what is going on; theres no important idea here. It really will not matter
which is x and which is y. The two components make a dataset {xi } = {(xi , yi )}.
To produce a scatter plot of this data, we plot a small shape at the location of each
data item.
Such scatter plots are very revealing. For example, figure 12.8 shows a scatter
plot of body temperature against heart rate for humans. In this dataset, the gender
of the subject was recorded (as 1 or 2 I dont know which is which), and
so I have plotted a 1 at each data point with gender 1, and so on. Looking
at the data suggests there isnt much difference between the blob of 1 labels and
the blob of 2 labels, which suggests that females and males are about the same
in this respect.
The scale used for a scatter plot matters. For example, plotting lengths in
Section 12.1
Plotting 2D Data
249
Weights, outliers removed, normalized
meters gives a very different scatter from plotting lengths in millimeters. Figure 12.9 shows two scatter plots of weight against height. Each plot is from the
same dataset, but one is scaled so as to show two outliers. Keeping these outliers
means that the rest of the data looks quite concentrated, just because the axes
are in large units. In the other plot, the axis scale has changed (so you cant see
the outliers), but the data looks more scattered. This may or may not be a misrepresentation. Figure 12.10 compares the data with outliers removed, with the
same plot on a somewhat different set of axes. One plot looks as though increasing
height corresponds to increasing weight; the other looks as though it doesnt. This
is purely due to deceptive scaling each plot shows the same dataset.
Dubious data can also contribute to scaling problems. Recall that, in figure 12.5, price data before and after 1900 appeared to behave differently. Figure 12.11 shows a scatter plot of the lynx data, where I have plotted number of
pelts against price. I plotted the post-1900 data as circles, and the rest as asterisks. Notice how the circles seem to form a quite different figure, which supports the
suggestion that something interesting happened around 1900. We can reasonably
choose to analyze data after 1900 separately from before 1900. A choice like this
should be made with care. If you exclude every data point that might disagree with
your hypothesis, you may miss the fact that you are wrong. Leaving out data is
an essential component of many kinds of fraud. You should always reveal whether
you have excluded data, and why, to allow the reader to judge the evidence.
When you look at Figure 12.11, you should notice the scatter plot does not
seem to support the idea that prices go up when supply goes down. This is puzzling
because its generally a pretty reliable idea. In fact, the plot is just hard to interpret
because it is poorly scaled. Scale is an important nuisance, and its easy to get
misled by scale effects.
4
2
0
2
4
4
2
0
2
4
Heights, outliers removed, normalized
FIGURE 12.12: A normalized scatter plot of weight against height, from the dataset
at http:// www2.stetson.edu/ jrasp/ data.htm. Now you can see that someone who
is a standard deviation taller than the mean will tend to be somewhat heavier than
the mean too.
The way to avoid the problem is to plot in standard coordinates. We can
normalize without worrying about the dimension of the data we normalize each
Section 12.1
Heart rate
80
70
60
96
250
6
2 2
212
2 2122122 1
1
222121211212
1
1 12221 2222112112122 22
211 111121 2 212 1
1 1111212112 212
1
11 2111112 1 1
12
2111 2211211 2 2
2 1 21 12 2
1 121 22 1
2
222
2
2 1 2
90
Plotting 2D Data
98
100
Body temperature
102
4
2
0
2
2 11
1
12
111
21
11 12 2212
22
22
1 22
2
2
1
1
1
2
1
2
2
2
2
2
2
1
1211 2 212212 1 2
2 11
1 112
221211211212
11 21 1
21 1 2
1112 2
12
2 11
1 212
2 1222121
1221 1
11
2 2
1
4
6
6
4
2
0
2
4
FIGURE 12.13: Left: A scatter plot of body temperature against heart rate, from the
dataset at http:// www2.stetson.edu/ jrasp/ data.htm; normtemp.xls. I have separated the two genders by plotting a different symbol for each (though I dont know
which gender is indicated by which letter); if you view this in color, the differences
in color makes for a greater separation of the scatter. This picture suggests, but
doesnt conclusively establish, that there isnt much dependence between temperature
and heart rate, and any dependence between temperature and heart rate isnt affected
by gender. The scatter plot of the normalized data, in standard coordinates, on the
right supports this view.
4
3
normalized price
price of pelts, in pennies
1500
1000
500
2
1
0
1
0
0
2
4
6
number of pelts traded
8
4
x 10
2
2
1
0
1
2
normalized number of pelts
FIGURE 12.14: Left: A scatter plot of the price of lynx pelts against the number of
pelts (this is a repeat of figure 12.11 for reference). I have plotted data for 1901
to the end of the series as circles, and the rest of the data as *s. It is quite hard
to draw any conclusion from this data, because the scale is confusing. Right: A
scatter plot of the price of pelts against the number of pelts for lynx pelts. I excluded
data for 1901 to the end of the series, and then normalized both price and number
of pelts. Notice that there is now a distinct trend; when there are fewer pelts, they
are more expensive, and when there are more, they are cheaper.
Section 12.2
Correlation
251
dimension independently by subtracting the mean of that dimension and dividing

by the standard deviation of that dimension. This means we can normalize the
x and y coordinates of the two-dimensional data separately. We continue to use
the convention of writing the normalized x coordinate as x
and the normalized y
coordinate as y. So, for example, we can write xj = (xj mean ({x}))/std ({x})) for
the x
value of the jth data item in normalized coordinates. Normalizing shows us
the dataset on a standard scale. Once we have done this, it is quite straightforward
to read off simple relationships between variables from a scatter plot.
Remember this:
The plot scale can mask effects in scatter plots, and
its usually a good idea to plot in standard coordinates.
4
2
0
2
4
4
2
0
2
4
12.2 CORRELATION
4
2
0
2
4
4
2
0
2
4
FIGURE 12.15: On the left, a normalized scatter plot of weight (y-coordinate) against
height (x-coordinate). On the right, a scatter plot of height (y-coordinate) against
weight (x-coordinate). Ive put these plots next to one another so you dont have to
mentally rotate (which is what you should usually do).
The simplest, and most important, relationship to look for in a scatter plot is
this: when x
increases, does y tend to increase, decrease, or stay the same? This is
straightforward to spot in a normalized scatter plot, because each case produces a
very clear shape on the scatter plot. Any relationship is called correlation (we will
see later how to measure this), and the three cases are: positive correlation, which
means that larger x
values tend to appear with larger y values; zero correlation,
which means no relationship; and negative correlation, which means that larger x
values tend to appear with smaller y values. I have shown these cases together
in one figure using a real data example (Figure 12.16), so you can compare the
appearance of the plots.
4
2
0
2
2 211
121 1
1
221111
11 12 2212
2222
222 22 2
212
222121
2
1
1
11
1212
2 11
1 122
2111211 11
2
2
2
1
1
2
1
1
1
12
21 2
2
2 11
1112 2
1
1
2
1
1
2
2
2 1222111
1 21
2 22 1
1
4
6
6
4
2
0
2
4
Positive Correlation
Correlation
252
Negative Correlation
4
3
normalized price
No Correlation
6
Section 12.2
0
2
2
1
0
1
4
4
2
0
2
4
2
2
1
0
1
2
normalized number of pelts
FIGURE 12.16: The three kinds of scatter plot are less clean for real data than for
our idealized examples. Here I used the body temperature vs heart rate data for the
zero correlation; the height-weight data for positive correlation; and the lynx data
for negative correlation. The pictures arent idealized real data tends to be messy
but you can still see the basic structures.
Positive correlation occurs when larger x

values tend to appear with larger
y values. This means that data points with with small (i.e. negative with large
magnitude) x
values must have small y values, otherwise the mean of x
(resp.
y) would be too big. In turn, this means that the scatter plot should look like a
smear of data from the bottom left of the graph to the top right. The smear might
be broad or narrow, depending on some details well discuss below. Figure 12.12
shows normalized scatter plots of weight against height, and of body temperature
against heart rate. In the weight-height plot, you can clearly see that individuals
who are higher tend to weigh more. The important word here is tend taller
people could be lighter, but mostly they tend not to be. Notice, also, that I did
NOT say that they weighed more because they were taller, but only that they tend
to be heavier.
Negative correlation occurs when larger x
values tend to appear with
smaller y values. This means that data points with with small x values must
have large y values, otherwise the mean of x
(resp. y) would be too big. In turn,
this means that the scatter plot should look like a smear of data from the top left
of the graph to the bottom right. The smear might be broad or narrow, depending
on some details well discuss below. Figure 12.14 shows a normalized scatter plot
of the lynx pelt-price data, where I have excluded the data from 1901 on. I did so
because there seemed to be some other effect operating to drive prices up, which
was inconsistent with the rest of the series. This plot suggests that when there were
more pelts, prices were lower, as one would expect.
Zero correlation occurs when there is no relationship. This produces a
characteristic shape in a scatter plot, but it takes a moment to understand why. If
there really is no relationship, then knowing x will tell you nothing about y. All
we know is that mean ({
y}) = 0, and var ({
y}) = 1. This is enough information to
predict what the plot will look like. We know that mean ({
x}) = 0 and var ({
x}) = 1;
so there will be many data points with x
value close to zero, and few with a much
larger or much smaller x
value. The same applies to y. Now consider the data
points in a strip of x
values. If this strip is far away from the origin, there will
Section 12.2
Correlation
253
be few data points in the strip, because there arent many big x
values. If there
is no relationship, we dont expect to see large or small y values in this strip,
because there are few data points in the strip and because large or small y values
are uncommon we see them only if there are many data points, and then seldom.
So for a strip with x
close to zero, we might see some y values that are far from
zero because we will see many y values. For a strip with x that is far from zero,
we expect to see few y values that are far from zero, because we see few points in
this strip. This reasoning means the data should form a round blob, centered at
the origin. In the temperature-heart rate plot of figure 12.13, it looks as though
nothing of much significance is happening. The average heart rate seems to be
about the same for people who run warm or who run cool. There is probably not
much relationship here.
The correlation is not affected by which variable is plotted on the x-axis and
which is plotted on the y-axis. Figure 12.15 compares a plot of height against
weight to one of weight against height. Usually, one just does this by rotating the
page, or by imagining the new picture. The left plot tells you that data points
with higher height value tend to have higher weight value; the right plot tells you
that data points with higher weight value tend to have higher height value i.e.
the plots tell you the same thing. It doesnt really matter which one you look at.
Again, the important word is tend the plot doesnt tell you anything about
why, it just tells you that when one variable is larger the other tends to be, too.
12.2.1 The Correlation Coefficient
Consider a normalized data set of N two-dimensional vectors. We can write the
ith data point in standard coordinates (
xi , yi ). We already know many important
summaries of this data, because it is in standard coordinates. We have mean ({
x}) =
0; mean ({
y }) = 0; std ({
x}) = 1; and std ({
y }) = 1. Each
of
these
summaries
is
2
2
2
itself the
mean
of
some
monomial.
So
std
({
x
})
=
mean
x
=
1;
std
({
y
})
=

mean y2 (the other two are easy). We can rewrite this information
in
terms
2
of means ofmonomials,
giving mean ({
x}) = 0; mean ({
y }) = 0; mean x
= 1;

and mean y2 = 1. There is one monomial missing here, which is x
y.
The term mean ({
xy}) captures correlation between x and y. The term is
known as the correlation coefficient or correlation.
Section 12.2
Correlation
254
Definition: 12.1 Correlation coefficient

Assume we have N
data items which are 2-vectors
(x1 , y1 ), . . . , (xN , yN ), where N > 1. These could be obtained,
for example, by extracting components from larger vectors. We
compute the correlation coefficient by first normalizing the x and y
, yi = (yi mean({y}))
. The
coordinates to obtain x
i = (xi mean({x}))
std(x)
std(y)
correlation coefficient is the mean value of x
y, and can be computed
as:
P
xi yi
corr ({(x, y)}) = i
N
Correlation is a measure of our ability to predict one value from another.

The correlation coefficient takes values between 1 and 1 (well prove this below).
If the correlation coefficient is close to 1, then we are likely to predict very well.
Small correlation coefficients (under about 0.5, say, but this rather depends on
what you are trying to achieve) tend not to be all that interesting, because (as
we shall see) they result in rather poor predictions. Figure 12.17 gives a set of
scatter plots of different real data sets with different correlation coefficients. These
all come from data set of age-height-weight, which you can find at http://www2.
stetson.edu/jrasp/data.htm (look for bodyfat.xls). In each case, two outliers have
been removed. Age and height are hardly correlated, as you can see from the figure.
Younger people do tend to be slightly taller, and so the correlation coefficient is
-0.25. You should interpret this as a small correlation. However, the variable
called adiposity (which isnt defined, but is presumably some measure of the
amount of fatty tissue) is quite strongly correlated with weight, with a correlation
coefficient is 0.86. Average tissue density is quite strongly negatively correlated with
adiposity, because muscle is much denser than fat, so these variables are negatively
correlated we expect high density to appear with low adiposity, and vice versa.
The correlation coefficient is -0.86. Finally, density is very strongly correlated with
body weight. The correlation coefficient is -0.98.
Its not always convenient or a good idea to produce scatter plots in standard
coordinates (among other things, doing so hides the units of the data, which can
be a nuisance). Fortunately, scaling or translating data does not change the value
of the correlation coefficient (though it can change the sign if one scale is negative).
This means that its worth being able to spot correlation in a scatter plot that isnt
in standard coordinates (even though correlation is always defined in standard coordinates). Figure 12.18 shows different correlated datasets plotted in their original
units. These data sets are the same as those used in figure 12.17
Properties of the Correlation Coefficient
You should memorize the following properties of the correlation coefficient:
Section 12.2
Age and height, correlation=0.25

Weights, normalized
Height, normalized
2
0
2
2
0
2
2
0
2
Density, normalized
0
2
2
0
2
4
Adiposity, normalized
Density and Body Fat, correlation=0.98
4
Bodyfat, normalized
Adiposity, normalized
4
4
0
2
4
Age, normalized
Density and Adiposity, correlation=0.73
6
4
4
255
Adiposity and weight, correlation=0.86
4
4
Correlation
2
0
2
4
4
2
0
2
Density, normalized
FIGURE 12.17: Scatter plots for various pairs of variables for the age-height-weight
dataset from http:// www2.stetson.edu/ jrasp/ data.htm; bodyfat.xls. In each case,
two outliers have been removed, and the plots are in standard coordinates (compare
to figure 12.18, which shows these data sets plotted in their original units). The
legend names the variables.
The correlation coefficient is symmetric (it doesnt depend on the order of its
arguments), so
corr ({(x, y)}) = corr ({(y, x)})
The value of the correlation coefficient is not changed by translating the data.
Scaling the data can change the sign, but not the absolute value. For constants
a 6= 0, b, c 6= 0, d we have
corr ({(ax + b, cx + d)}) = sign(ab)corr ({(x, y)})
If y tends to be large (resp. small) for large (resp. small) values of x
, then
the correlation coefficient will be positive.
If y tends to be small (resp. large) for large (resp. small) values of x
, then
the correlation coefficient will be negative.
Section 12.2
Weights, NOT normalized
Height, NOT normalized
75
70
65
35
30
25
20
1.3
Bodyfat, NOT normalized
Adiposity, NOT normalized
40
60
80
100
Age, NOT normalized
Density and Adiposity, correlation=0.73
40
300
250
200
150
100
15
60
20
1
1.1
1.2
Density, NOT normalized
256
Adiposity and weight, correlation=0.86
Age and height, correlation=0.25

80
15
0.9
Correlation
20
25
30
35
40
Adiposity, NOT normalized
Density and Body Fat, correlation=0.98
50
40
30
20
10
0
0.9
1
1.1
1.2
Density, NOT normalized
1.3
FIGURE 12.18: Scatter plots for various pairs of variables for the age-height-weight
dataset from http:// www2.stetson.edu/ jrasp/ data.htm; bodyfat.xls. In each case,
two outliers have been removed, and the plots are NOT in standard coordinates
(compare to figure 12.17, which shows these data sets plotted in normalized coordinates). The legend names the variables.
If y doesnt depend on x
, then the correlation coefficient is zero (or close to
zero).
The largest possible value is 1, which happens when x
= y.
The smallest possible value is -1, which happens when x
=
y.
The first property is easy, and we relegate that to the exercises. One way to
see that the correlation coefficient isnt changed by translation or scale is to notice
that it is defined in standard coordinates, and scaling or translating data doesnt
change those. Another way to see this is to scale and translate data, then write out
the equations; notice that taking standard coordinates removes the effects of the
scale and translation. In each case, notice that if the scale is negative, the sign of
the correlation coefficient changes.
The property that, if y tends to be large (resp. small) for large (resp. small)
values of x
, then the correlation coefficient will be positive, doesnt really admit
Section 12.2
Correlation
257
a formal statement. But its relatively straightforward to see whats going on.
Because mean ({
x}) = 0, small values of mean ({
x}) must be negative and large
P
x
i yi
i
; and for this sum to be
values must be positive. But corr ({(x, y)}) =
N
positive, it should contain mostly positive terms. It can contain few or no hugely
positive (or hugely negative) terms, because std (
x) = std (
y) = 1 so there arent
many large (or small) numbers. For the sum to contain mostly positive terms, then
the sign of x
i should be the same as the sign yi for most data items. Small changes
to this argument work to show that if if y tends to be small (resp. large) for large
(resp. small) values of x
, then the correlation coefficient will be negative.
Showing that no relationship means zero correlation requires slightly more
work. Divide the scatter plot of the dataset up into thin vertical strips. There
are S strips. Each strip is narrow, so the x
value does not change much for the
data points in a particular strip. For the sth strip, write N (s) for the number of
data points in the strip, x
(s) for the x value at the center of the strip, and y(s)
for the mean of the y values within that strip. Now the strips are narrow, so we
can approximate all data points within a strip as having the same value of x
. This
yields

1 X
y (s)
mean ({
xy})
x
(s) N (s)
S
sstrips
(where you could replace with = if the strips were narrow enough). Now assume
that y(s) does not change from strip to strip, meaning that there is no relationship between x
and y in this dataset (so the picture is like the left hand side in
figure 12.16). Then each value of y(s) is the same we write y and we can
rearrange to get
1 X
mean ({
xy}) y
x
(s).
S
sstrips
Now notice that
0 = mean ({
y })
1
S
y (s)
N (s)
sstrips
(where again you could replace with = if the strips were narrow enough). This
means that if every strip has the same value of y(s), then that value must be zero.
In turn, if there is no relationship between x
and y, we must have mean ({
xy}) = 0.
Section 12.2
Correlation
258
Proposition:
1 corr ({(x, y)}) 1
Proof: Writing x, y for the normalized coefficients, we have
P
x
i yi
corr ({(x, y)}) = i
N
and you can think of the value as the inner product of two vectors. We write
1
1
x1 , x
2 , . . . x
N ] and y = [
y1 , y2 , . . . yN ]
x = [
N
N
2
and we have corr ({(x, y)}) = xT y. Notice xT x = std (x) = 1, and similarly
for y. But the inner product of two vectors is at its maximum when the two
vectors are the same, and this maximum is 1. This argument is also sufficient to
show that smallest possible value of the correlation is 1, and this occurs when
x
i =
yi for all i.
Property 12.1: The largest possible value of the correlation is 1, and this occurs
when x
i = yi for all i. The smallest possible value of the correlation is 1, and
this occurs when x
i =
yi for all i.
12.2.2 Using Correlation to Predict
Assume we have N data items which are 2-vectors (x1 , y1 ), . . . , (xN , yN ), where
N > 1. These could be obtained, for example, by extracting components from
larger vectors. As usual, we will write xi for xi in normalized coordinates, and so
on. Now assume that we know the correlation coefficient is r (this is an important,
traditional notation). What does this mean?
One (very useful) interpretation is in terms of prediction. Assume we have a
data point (x0 , ?) where we know the x-coordinate, but not the y-coordinate. We
can use the correlation coefficient to predict the y-coordinate. First, we transform
to standard coordinates. Now we must obtain the best y0 value to predict, using
the x
0 value we have.
We want to construct a prediction function which gives a prediction for any
value of x
. This predictor should behave as well as possible on our existing data.
For each of the (
xi , yi ) pairs in our data set, the predictor should take x
i and
produce a result as close to yi as possible. We can choose the predictor by looking
at the errors it makes at each data point.
We write yip for the value of yi predicted at x
i . The simplest form of predictor
is linear. If we predict using a linear function, then we have, for some unknown
a, b, that yip = a
xi + b. Now think about ui = yi yip , which is the error in our
prediction. We would like to have mean ({u}) = 0 (otherwise, we could reduce the
Section 12.2
Correlation
259
error of the prediction just by subtracting a constant).

mean ({u}) =
=
=
=
=
mean ({
y yp })
mean ({
y }) mean ({a
xi + b})
mean ({
y }) amean ({
x}) + b
0 a0 + b
0.
This means that we must have b = 0.

To estimate a, we need to think about var ({u}). We should like var ({u}) to
be as small as possible, so that the errors are as close to zero as possible (remember,
small variance means small standard deviation which means the data is close to the
mean). We have
var ({u}) =
=
=
=
=
var ({
y yp })

because mean ({u}) = 0
mean (
y a
x)2

2
mean (
y ) 2a
xy + a2 (
x)2
2
2
2amean ({
xy}) + a2 mean (
x)
mean (
y)
1 2ar + a2 ,
which we want to minimize by choice of a. At the minimum, we must have

dvar ({ui })
= 0 = 2r + 2a
da
so that a = r and the correct prediction is
y0p = r
x0
You can use a version of this argument to establish that if we have (?, y0 ), then
the best prediction for x
0 (which is in standard coordinates) is r
y0 . It is important
to notice that the coefficient of yi is NOT 1/r; you should work this example, which
appears in the exercises. We now have a prediction procedure, outlined below.
Section 12.2
Correlation
260
Procedure: 12.1 Predicting a value using correlation

Assume we have N
data items which are 2-vectors
(x1 , y1 ), . . . , (xN , yN ), where N > 1. These could be obtained,
for example, by extracting components from larger vectors. Assume
we have an x value x0 for which we want to give the best prediction of
a y value, based on this data. The following procedure will produce a
prediction:
Transform the data set into standard coordinates, to get
xi
yi
x
0
1
(xi mean ({x}))
std (x)
1
(yi mean ({y}))
std (y)
1
(x0 mean ({x})).
std (x)
Compute the correlation

r = corr ({(x, y)}) = mean ({
xy}).
Predict y0 = r
x0 .
Transform this prediction into the original coordinate system, to
get
y0 = std (y)r
x0 + mean ({y})
Now assume we have a y value y0 , for which we want to give the best
prediction of an x value, based on this data. The following procedure
will produce a prediction:
Transform the data set into standard coordinates.
Compute the correlation.
Predict x0 = r
y0 .
Transform this prediction into the original coordinate system, to
get
x0 = std (x)r
y0 + mean ({x})
There is another way of thinking about this prediction procedure, which is

often helpful. Assume we need to predict a value for x0 . In normalized coordinates,
our prediction is yp = r
x0 ; if we revert back to the original coordinate system, the
Section 12.2
Correlation
261
prediction becomes
(y p mean ({y}))
=r
std (y)
(x0 mean ({x}))

std (x)
This gives a really useful rule of thumb, which I have broken out in the box below.
Procedure: 12.2 Predicting a value using correlation: Rule of thumb 1

If x0 is k standard deviations from the mean of x, then the predicted
value of y will be rk standard deviations away from the mean of y, and
the sign of r tells whether y increases or decreases.
An even more compact version of the rule of thumb is in the following box.
Procedure: 12.3 Predicting a value using correlation: Rule of thumb 2

The predicted value of y goes up by r standard deviations when the
value of x goes up by one standard deviation.
We can compute the average root mean square error that this prediction
procedure will make. The square of this error must be

= mean y 2 2rmean ({xy}) + r2 mean x2
mean u2
=
1 2r2 + r2
1 r2
so the root mean square error will be 1 r2 . This is yet another interpretation of
correlation; if x and y have correlation close to one, then predictions could have very
small root mean square error, and so might be very accurate. In this case, knowing
one variable is about as good as knowing the other. If they have correlation close
to zero, then the root mean square error in a prediction might be as large as the
root mean square error in y which means the prediction is nearly a pure guess.
The prediction argument means that we can spot correlations for data in
other kinds of plots one doesnt have to make a scatter plot. For example, if
we were to observe a childs height from birth to their 10th year (you can often
find these observations in ballpen strokes, on kitchen walls), we could plot height
as a function of year. If we also had their weight (less easily found), we could plot
weight as a function of year, too. The prediction argument above say that, if you
can predict the weight from the height (or vice versa) then theyre correlated. One
way to spot this is to look and see if one curve goes up when the other does (or
Section 12.2
Correlation
262
FIGURE 12.19: This figure, from Vickers (ibid, p184) shows a plot of the stork
population as a function of time, and the human birth rate as a function of time, for
some years in Germany. The correlation is fairly clear; but this does not mean that
reducing the number of storks means there are fewer able to bring babies. Instead,
this is the impact of the first world war a hidden or latent variable.
goes down when the other goes up). You can see this effect in figure 12.5, where
(before 19h00), prices go down when the number of pelts goes up, and vice versa.
These two variables are negatively correlated.
12.2.3 Confusion caused by correlation
There is one very rich source of potential (often hilarious) mistakes in correlation.
When two variables are correlated, they change together. If the correlation is
positive, that means that, in typical data, if one is large then the other is large,
and if one is small the other is small. In turn, this means that one can make
a reasonable prediction of one from the other. However, correlation DOES NOT
mean that changing one variable causes the other to change (sometimes known as
causation).
Two variables in a dataset could be correlated for a variety of reasons. One
important reason is pure accident. If you look at enough pairs of variables, you
may well find a pair that appears to be correlated just because you have a small
set of observations. Imagine, for example, you have a dataset consisting of only
two vectors there is a pretty good chance that there is some correlation between
the coefficients. Such accidents can occur in large datasets, particularly if the
dimensions are high.
Another reason variables could be correlated is that there is some causal
relationship for example, pressing the accelerator tends to make the car go
faster, and so there will be some correlation between accelerator position and car
acceleration. As another example, adding fertilizer does tend to make a plant grow
bigger. Imagine you record the amount of fertilizer you add to each pot, and the
size of the resulting potplant. There should be some correlation.
Yet another reason variables could be correlated is that there is some other
background variable often called a latent variable linked causally to each of
Section 12.3
Sterile Males in Wild Horse Herds
263
the observed variables. For example, in children (as Freedman, Pisani and Purves
note in their excellent Statistics), shoe size is correlated with reading skills. This
DOES NOT mean that making your feet grow will make you read faster, or that
you can make your feet shrink by forgetting how to read. The real issue here is
the age of the child. Young children tend to have small feet, and tend to have
weaker reading skills (because theyve had less practice). Older children tend to
have larger feet, and tend to have stronger reading skills (because theyve had more
practice). You can make a reasonable prediction of reading skills from foot size,
because theyre correlated, even though there is no direct connection.
This kind of effect can mask correlations, too. Imagine you want to study the
effect of fertilizer on potplants. You collect a set of pots, put one plant in each,
and add different amounts of fertilizer. After some time, you record the size of each
plant. You expect to see correlation between fertilizer amount and plant size. But
you might not if you had used a different species of plant in each pot. Different
species of plant can react quite differently to the same fertilizer (some plants just
die if over-fertilized), so the species could act as a latent variable. With an unlucky
choice of the different species, you might even conclude that there was a negative
correlation between fertilizer and plant size. This example illustrates why you need
to take great care in setting up experiments and interpreting their results.
This sort of thing happens often, and its an effect you should look for. Another nice example comes from Vickers (ibid). The graph, shown in Figure 12.19,
shows a plot of (a) a dataset of the stork population in Europe over a period of
years and (b) a dataset of the birth rate over those years. This isnt a scatter plot;
instead, the data has been plotted on a graph. You can see by eye that these two
datasets are quite strongly correlated . Even more disturbing, the stork population dropped somewhat before the birth rate dropped. Is this evidence that storks
brought babies in Europe during those years? No (the usual arrangement seems
to have applied). For a more sensible explanation, look at the dates. The war
disturbed both stork and human breeding arrangements. Storks were disturbed
immediately by bombs, etc., and the human birth rate dropped because men died
at the front.
12.3 STERILE MALES IN WILD HORSE HERDS
Large herds of wild horses are (apparently) a nuisance, but keeping down numbers
by simply shooting surplus animals would provoke outrage. One strategy that
has been adopted is to sterilize males in the herd; if a herd contains sufficient
sterile males, fewer foals should result. But catching stallions, sterilizing them, and
reinserting them into a herd is a performance does this strategy work?
We can get some insight by plotting data. At http://lib.stat.cmu.edu/DASL/
Datafiles/WildHorses.html, you can find a dataset covering herd management in wild
horses. I have plotted part of this dataset in figure 12.20. In this dataset, there
are counts of all horses, sterile males, and foals made on each of a small number
of days in 1986, 1987, and 1988 for each of two herds. I extracted data for one
herd. I have plotted this data as a function of the count of days since the first data
point, because this makes it clear that some measurements were taken at about
the same time, but there are big gaps in the measurements. In this plot, the data
Section 12.3
264
Number of horses vs. day
Number of horses
80
Adults
Sterile Males
Foals
60
40
20
0
0
200
400
Day
600
800
FIGURE 12.20: A plot of the number of adult horses, sterile males, and foals in horse
herds over a period of three years. The plot suggests that introducing sterile males
might cause the number of foals to go down. Data from http:// lib.stat.cmu.edu/
DASL/ Datafiles/ WildHorses.html.
points are shown with a marker. Joining them leads to a confusing plot because
the data points vary quite strongly. However, notice that the size of the herd drifts
down slowly (you could hold a ruler against the plot to see the trend), as does the
number of foals, when there is a (roughly) constant number of sterile males.
Does sterilizing males result in fewer foals? This is likely hard to answer for
this dataset, but we could ask whether herds with more sterile males have fewer
foals. A scatter plot is a natural tool to attack this question. However, the scatter
plots of figure 12.21 suggest, rather surprisingly, that when there are more sterile
males there are more adults (and vice versa), and when there are more sterile
males there are more foals (and vice versa). This is borne out by a correlation
analysis. The correlation coefficient between foals and sterile males is 0.74, and
the correlation coefficient between adults and sterile males is 0.68. You should find
this very surprising how do the horses know how many sterile males there are
in the herd? You might think that this is an effect of scaling the plot, but there is
a scatter plot in normalized coordinates in figure 12.21 that is entirely consistent
with the conclusions suggested by the unnormalized plot. What is going on here?
The answer is revealed by the scatter plots of figure 12.22. Here, rather than
plotting a * at each data point, I have plotted the day number of the observation.
This is in days from the first observation. You can see that the whole herd is
shrinking observations where there are many adults (resp. sterile adults, foals)
occur with small day numbers, and observations where there are few have large day
numbers. Because the whole herd is shrinking, it is true that when there are more
adults and more sterile males, there are also more foals. Alternatively, you can see
the plots of figure 12.20 as a scatter plot of herd size (resp. number of foals, number
of sterile males) against day number. Then it becomes clear that the whole herd is
shrinking, as is the size of each group. To drive this point home, we can look at the
correlation coefficient between adults and days (-0.24), between sterile adults and
days (-0.37), and between foals and days (-0.61). We can use the rule of thumb in
Section 12.3
Foals vs. sterile adults
Sterile adults vs. adults

15
Number of sterile adults
Number of foals
20
15
10
5
5
10
15
Number of sterile adults
Foals vs. sterile adults (standard coordinates)
3
2
1
0
1
0
1
2
3
N sterile adults (standard coordinates)
10
0
20
N foals (standard coordinates)
0
0
1
2
265
30
40
50
60
Number of adults
70
Sterile adults vs. adults (standard coordinates)

3
2
1
0
1
2
2
1
0
1
N adults (standard coordinates)
FIGURE 12.21: Scatter plots of the number of sterile males in a horse herd against
the number of adults, and the number of foals against the number of sterile males,
from data of http:// lib.stat.cmu.edu/ DASL/ Datafiles/ WildHorses.html. Top: unnormalized; bottom: standard coordinates.
box 12.3 to interpret this. This means that every 282 days, the herd loses about
three adults; about one sterile adult; and about three foals. For the herd to have
a stable size, it needs to gain by birth as many foals as it loses both to growing up
and to death. If the herd is losing three foals every 282 days, then if they all grow
up to replace the missing adults, the herd will be shrinking slightly (because it is
losing four adults in this time); but if it loses foals to natural accidents, etc., then
it is shrinking rather fast.
The message of this example is important. To understand a simple dataset,
you might need to plot it several ways. You should make a plot, look at it and ask
what it says, and then try to use another type of plot to confirm or refute what
you think might be going on.
N foals (standard coordinates)
Foals vs. adults (standard coordinates)

6
40
67
1
0
360
66
336 0
361
772
39
738
696
710
404
742
700 374
335
375
2
2
1
0
1
2
Section 12.4
You should
266
Sterile adults vs. adults (standard coordinates)

6
67
40
0
404
772
700
742
361
374
336
335 375
66
696
39
710
738
360
2
1
0
1
2
FIGURE 12.22: Scatter plots of the number of foals vs. the number of adults and
the number of adults vs. the number of sterile adults for the wild horse herd,
from http:// lib.stat.cmu.edu/ DASL/ Datafiles/ WildHorses.html. Rather than plot
data points as dots, I have plotted the day on which the observation was made.
Notice how the herd starts large, and then shrinks.
12.4 YOU SHOULD

12.4.1 be able to:
Plot a bar chart, a heat map, and a pie chart for a categorical dataset.
Plot a dataset as a graph, making sensible choices about markers, lines and
the like.
Plot a scatter plot for a dataset.
Plot a normalized scatter plot for a dataset.
Interpret the scatter plot to tell the sign of the correlation between two variables, and estimate the size of the correlation coefficient.
Compute a correlation coefficient.
Interpret a correlation coefficient.
Use correlation to make predictions.
12.4.2 remember:
New term: pie chart . . . . . . . . . . . . . . . . . . . . . . . .
New term: stacked bar chart . . . . . . . . . . . . . . . . . . .
New term: heat map . . . . . . . . . . . . . . . . . . . . . . . .
New term: 3D bar chart . . . . . . . . . . . . . . . . . . . . . .
There are a variety of tools for plotting categorical data. . . . .
New term: scatter plot . . . . . . . . . . . . . . . . . . . . . . .
A scatter plot should be your first step with a new 2D dataset.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
233
233
234
235
236
238
240
Section 12.4
Its usually a good idea to plot in

New term: correlation . . . . . .
New term: correlation coefficient
New term: correlation . . . . . .
Definition: Correlation coefficient
New term: latent variable . . . .
standard coordinates.
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
You should
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
267
.
.
.
.
.
.
244
244
246
246
247
255
C H A P T E R
13
Background: Useful Probability

Distributions
We will use probability as a tool to resolve practical questions about data.
I describe some forms these questions take below, for concreteness. Generally,
resolving these questions requires some form of model. This model gives an abstract
representation of the problem that is useful for problem solving, and (typically)
comes with recipes for attacking the major types of problem.
We could ask what process produced the data? For example, I observe a
set of independent coin flips. I would now like to know the probability of observing
a head when the coin is flipped. This might seem a bit empty as a problem, but an
analogous problem is: are male children or female children more frequent? Notice
that the answers to these questions are typically not exact. Instead, they are
estimates. We will see a variety of methods to estimate the probability of observing
a head from a sequence of independent coin flips, but none is guaranteed to return
the right answer (because you cant). Instead, we have to live with information
about how accurate the estimate is.
We could ask what sort of data can we expect in the future? For
example, we could ask: is gender assigned independently? equivalently, can you
predict the gender of the next child born to a couple more accurately if you look at
the genders of the previous children? In reliability engineering, one asks: how long
will it be until this product fails? One version of this question that occupies many
people is: how long until I die? By the way, the best answer seems to be subtract
your age from a number that seems to be close to 85. Again, these are questions
which dont lend themselves to the right answer, as opposed to the best possible
estimate. You might get hit by a truck tomorrow.
We could ask what labels should we attach to unlabelled data? For
example, we might see a large number of credit card transactions, some known to be
legitimate and others known to be fraudulent. We now see a new transaction: is it
legitimate? You can see that versions of this question appear in many applications.
As another example, you see many programs downloaded from the web, some known
to be legitimate and others known to be malware. You now see a new program:
is it safe to run? It may be possible in some circumstances to know the right
answer to this question, but we are usually stuck with the best answer.
We could ask is an effect easily explained by chance variations, or is
it real? For example, you believe that the average weight of a mouse is 15 grams.
You could test this by catching 100 mice (a sample) and weighing them; but, when
you do this, the answer wont be 15 grams even if the average weight of a mouse
is 15 grams. This is because you have a random selection of mice. But we will be
able to build a model of the random variations in the sample average. We can use
268
Section 13.1
Discrete Distributions
269
this model to tell whether the difference between sample average and 15 grams is
easily explained by random variations in the sample, or is significant. This allows
us to tell how forcefully the evidence contradicts your original belief.
Building a model requires understanding the specific problem you want to
solve, then choosing from a vocabulary of many different models that might apply
to the problem. Long experience shows that even if the model you choose does not
match the problem exactly, it can still be useful. In this chapter, I describe the
properties of some probability distributions that are used again and again in model
building.
13.1 DISCRETE DISTRIBUTIONS
13.1.1 The Discrete Uniform Distribution
If every value of a discrete random variable has the same probability, then the
probability distribution is the discrete uniform distribution. We have seen this
distribution before, numerous times. For example, I define a random variable by
the number that shows face-up on the throw of a fair die. This has a uniform
distribution. As another example, write the numbers 1-52 on the face of each card
of a standard deck of playing cards. The number on the face of the first card drawn
from a well-shuffled deck is a random variable with a uniform distribution.
One can construct expressions for the mean and variance of a discrete uniform
distribution, but theyre not usually much use (too many terms, not often used).
Keep in mind that if two random variables have a uniform distribution, their sum
and difference will not (recall example ??).
13.1.2 Bernoulli Random Variables
A Bernoulli random variable models a biased coin with probability p of coming up
heads in any one flip.
Definition: 13.1 Bernoulli random variable

A Bernoulli random variable takes the value 1 with probability p and
0 with probability 1 p. This is a model for a coin toss, among other
things
Section 13.1
270
Useful Facts: 13.1 Bernoulli random variable

1. A Bernoulli random variable has mean p.
2. A Bernoulli random variable has variance p(1 p).
Proofs are easy, and in the exercises.
13.1.3 The Geometric Distribution

We have a biased coin. The probability it will land heads up, P ({H}) is given by
p. We flip this coin until the first head appears. The number of flips required is a
discrete random variable which takes integer values greater than or equal to one,
which we shall call X. To get n flips, we must have n 1 tails followed by 1 head.
This event has probability (1 p)(n1) p. We can now write out the probability
distribution that n flips are required.
Definition: 13.2 Geometric distribution

We have an experiment with a binary outcome (i.e. heads or tails; 0
or 1; and so on), with P (H) = p and P (T ) = 1 p. We repeat this
experiment until the first head occurs. The probability distribution for
n, the number of repetitions, is the geometric distribution. It has the
form
P ({X = n}) = (1 p)(n1) p.
for 0 p 1 and n 1; for other n it is zero. p is called the parameter
of the distribution.
Notice that the geometric distribution is non-negative everywhere. It is straightforward to show that it sums to one, and so is a probability distribution (exercises).
Useful Facts: 13.2 Geometric distribution

1. The mean of the geometric distribution is p1 .
2. The variance of the geometric distribution is
1p
p2 .
The proof of these facts requires some work with series, and is relegated
to the exercises.
Section 13.1
271
It should be clear that this model isnt really about coins, but about repeated
trials. The trial could be anything that has some probability of failing. Each
trial is independent, and the rule for repeating is that you keep trying until the
first success. Textbooks often set exercises involving missiles and aircraft; Ill omit
these on the grounds of taste.
13.1.4 The Binomial Probability Distribution
Assume we have a biased coin with probability p of coming up heads in any one
flip. The binomial probability distribution gives the probability that it comes up
heads h times in N flips.
Worked example ?? yields one way of deriving this distribution. In that
example, I showed that there are
N !/(h!(N h)!)
outcomes of N coin flips that have h heads. These outcomes are disjoint, and
each has probability ph (1 p)(N h) . As a result, we must have the probability
distribution below.
Definition: 13.3 Binomial distribution

In N independent repetitions of an experiment with a binary outcome
(ie heads or tails; 0 or 1; and so on) with P (H) = p and P (T ) = 1 p,
the probability of observing a total of h Hs and N h T s is

N
Pb (h; N, p) =
ph (1 p)(N h)
h
as long as 0 h N ; in any other case, the probability is zero.
Section 13.1
272
Useful Fact: 13.3 Binomial distribution

Write Pb (i; N, p) for the binomial distribution that one observes i Hs
in N trials.
N
X
i=0
Pb (i; N, p) = (p + (1 p))N = (1)N = 1
by pattern matching to the binomial theorem. As a result,

N
X
Pb (i; N, p) = 1
i=0
The binomial distribution satisfies a recurrence relation. We must have that

Pb (h; N, p) = pPb (h 1; N 1, p) + (1 p)Pb (h; N 1, p).
This is because can get h heads in N flips either by having h 1 heads in N 1
flips, then flipping another, or by having h heads in N flips then flipping a tail.
You can verify by induction that the binomial distribution satisfies this recurrence
relation.
Useful Facts: 13.4 Binomial distribution

1. The mean of Pb (i; N, p) is N p.
2. The variance of Pb (i; N, p) is N p(1 p)
The proofs are informative, and so are not banished to the exercises.
Section 13.1
273
Proof: 13.1 The binomial distribution

Notice that the number of heads in N coin tosses is can be obtained
by adding the number of heads in each toss. Write Yi for the Bernoulli
random variable representing the ith toss. If the coin comes up heads,
Yi = 1, otherwise Yi = 0. The Yi are independent. Now
N
X
Yi
E[X] = E
j=1
N
X
E[Yi ]
j=1
=
=
N E[Y1 ]
because the Yi are independent
N p.
The variance is easy, too. Each coin toss is independent, so the variance
of the sum of coin tosses is the sum of the variances. This gives
N
X
var[X] = var
Yi
j=1
=
=
N var[Y1 ]
N p(1 p)
13.1.5 Multinomial probabilities

The binomial distribution describes what happens when a coin is flipped multiple
times. But we could toss a die multiple times too. Assume this die has k sides, and
we toss it N times. The distribution of outcomes is known as the multinomial
distribution.
We can guess the form of the multinomial distribution in rather a straightforward way. The die has k sides. We toss the die N times. This gives us a sequence
of N numbers. Each toss of the die is independent. Assume that side 1 appears n1
times, side 2 appears n2 times, ... side k appears nk times. Any single sequence
with this property will appear with probability pn1 1 pn2 2 ...pnk k , because the tosses are
independent. However, there are
N!
n1 !n2 !...nk !
such sequences. Using this reasoning, we arrive at the distribution below
Section 13.1
274
Definition: 13.4 Multinomial distribution

I perform N independent repetitions of an experiment with k possible
outcomes. The ith such outcome has probability pi . I see outcome 1 n1
times, outcome 2 n2 times, etc. Notice that n1 + n2 + n3 + . . .+ nk = N .
The probability of observing this set of outcomes is
Pm (n1 , . . . , nk ; N, p1 , . . . , pk ) =
Worked example 13.1
N!
pn1 pn2 . . . pnk k .
n1 !n2 !...nk ! 1 2
Dice
I throw five fair dice. What is the probability of getting two 2s and three 3s?
Solution:
5! 1 2 1 3
2!3! ( 6 ) ( 6 )
13.1.6 The Poisson Distribution

Assume we are interested in counts that occur in an interval of time (e.g. within
a particular hour). Because they are counts, they are non-negative and integer
valued. We know these counts have two important properties. First, they occur
with some fixed average rate. Second, an observation occurs independent of the
interval since the last observation. Then the Poisson distribution is an appropriate
model.
There are numerous such cases. For example, the marketing phone calls you
receive during the day time are likely to be well modelled by a Poisson distribution.
They come at some average rate perhaps 5 a day as I write, during the last
phases of an election year and the probability of getting one clearly doesnt
depend on the time since the last one arrived. Classic examples include the number
of Prussian soldiers killed by horse-kicks each year; the number of calls arriving at
a call center each minute; the number of insurance claims occurring in a given time
interval (outside of a special event like a hurricane, etc.).
Section 13.1
275
Definition: 13.5 Poisson distribution

A non-negative, integer valued random variable X has a Poisson distribution when its probability distribution takes the form
P ({X = k}) =
k e
,
k!
where > 0 is a parameter often known as the intensity of the distribution.
Notice that the Poisson distribution is a probability distribution, because it

is non-negative and because
X
i
= e
i!
i=0
so that
X
k e
k=0
k!
=1
Useful Facts: 13.5 Poisson distribution

1. The mean of a Poisson distribution with intensity is .
2. The variance of a Poisson distribution with intensity is (no,
thats not an accidentally repeated line or typo).
The proof of these facts requires some work with series, and is relegated
to the exercises.
I described the Poisson distribution as a natural model for counts of randomly

distributed points along a time axis. But it doesnt really matter that this is a time
axis it could be a space axis instead. For example, you could take a length of
road, divide it into even intervals, then count the number of road-killed animals is
in each interval. If the location of each animal is independent of the location of
any other animal, then you could expect a Poisson model to apply to the count
data. Assume that the Poisson model that best describes the data has parameter
. One property of such models is that if you doubled the length of the intervals,
then the resulting dataset would be described by a Poisson model with parameter
2; similarly, if you halved the length of the intervals, the best model would have
parameter /2. This corresponds to our intuition about such data; roughly, the
Section 13.2
Continuous Distributions
276
number of road-killed animals in two miles of road should be twice the number in
one mile of road. This property means that no pieces of the road are special
each behaves the same as the other.
We can build a really useful model of spatial randomness by observing this
fact and generalizing very slightly. A Poisson point process with intensity is a
set of random points with the property that the number of points in an interval of
length s is a Poisson random variable with parameter s. Notice how this captures
our intuition that if points are very randomly distributed, there should be twice
as many of them in an interval that is twice as long.
This model is easily, and very usefully, extended to points on the plane, on
surfaces, and in 3D. In each case, the process is defined on a domain D (which has
to meet some very minor conditions that are of no interest to us). The number
of points in any subset s of D is a Poisson random variable, with intensity m(s),
where m(s) is the area (resp. volume) of s. These models are useful, because they
capture the property that (a) the points are random and (b) the probability you
find a point doesnt depend on where you are. You could reasonably believe models
like this apply to, say, dead flies on windscreens; the places where you find acorns
at the foot of an oak tree; the distribution of cowpats in a field; the distribution of
cherries in a fruitcake; and so on.
13.2 CONTINUOUS DISTRIBUTIONS
13.2.1 The Continuous Uniform Distribution
Some continuous random variables have a natural upper bound and a natural lower
bound but otherwise we know nothing about them. For example, imagine we are
given a coin of unknown properties by someone who is known to be a skillful maker
of unfair coins. The manufacturer makes no representations as to the behavior of
the coin. The probability that this coin will come up heads is a random variable,
about which we know nothing except that it has a lower bound of zero and an
upper bound of one.
If we know nothing about a random variable apart from the fact that it has
a lower and an upper bound, then a uniform distribution is a natural model.
Write l for the lower bound and u for the upper bound. The probability density
function for the uniform distribution is
x<l
0
1/(u l) l x u
p(x) =
0
x>u
A continuous random variable whose probability distribution is the uniform distribution is often called a uniform random variable.
13.2.2 The Beta Distribution

Its hard to explain now why the Beta (or ) distribution is useful, but it will come
in useful later (section 14.5). The Beta distribution is a probability distribution for
a continuous random variable x in the range 0 x 1. There are two parameters,
> 0 and > 0. Recall the definition of the function from section 1.2. We have
Section 13.2
PDF of Beta(x) for various alpha, beta

12
2
0.2
0.4
0.6
0.8
alpha: 1, beta: 10
alpha: 10, beta: 1
alpha: 3, beta: 15
alpha: 20, beta: 100
10
0
0
277
PDF of Beta(x) for various alpha, beta

12
alpha: 1, beta: 1
alpha: 10, beta: 10
alpha: 50, beta: 50
10
0
0
0.2
0.4
0.6
0.8
FIGURE 13.1: Probability density functions for the Beta distribution with a variety
of different choices of and .

that
P (x|, ) =
( + ) (1)
x
(1 x)(1) .
()()
From this expression, you can see that:

P (x|1, 1) is a uniform distribution on the unit interval.
P (x|, ) has a single maximum at x = ( 1)/( + 2) for > 1, > 1)
(differentiate and set to zero).
Generally, as and get larger, this peak gets narrower.
For = 1, > 1 the largest value of P (x|, ) is at x = 0.
For > 1, = 1 the largest value of P (x|, ) is at x = 1.
Figure 13.1 shows plots of the probability density function of the Beta distribution
for a variety of different values of and .
Useful Facts: 13.6 Beta distribution

For a Beta distribution with parameters ,
1. The mean is
+ .
2. The variance is
(+)2 (++1) .
Section 13.2
278
PDF of Gamma(x) for various alpha, beta

PDF of Gamma(x) for various alpha, beta
2
2
alpha: 1, beta: 1
alpha: 1, beta: 5
alpha: 5, beta: 5
alpha: 5, beta: 1
1.5
alpha: 15, beta: 15 1.5
alpha: 15, beta: 5
1
0.5
0.5
0
0
10
0
0
10
FIGURE 13.2: Probability density functions for the Gamma distribution with a variety
of different choices of and .

13.2.3 The Gamma Distribution
The Gamma (or ) distribution will also come in useful later on (section 14.5). The
Gamma distribution is a probability distribution for a non-negative continuous
random variable x 0. There are two parameters, > 0 and > 0. The
probability density function is
P (x|, ) =
(1) x
x
e
.
()
Figure 13.2 shows plots of the probability density function of the Gamma distribution for a variety of different values of and .
Useful Facts: 13.7 Gamma distribution

For a Gamma distribution with parameters ,
1. The mean is
.
2. The variance is
2 .
13.2.4 The Exponential Distribution

Assume we have an infinite interval of time or space, with points distributed on it.
Assume these points form a Poisson point process, as above. For example, we might
consider the times at which email arrives; or the times at which phone calls arrive
at a large telephone exchange; or the locations of roadkill on a road. The distance
(or span of time) between two consecutive points is a random variable X. This
Section 13.3
The Normal Distribution
279
random variable takes an exponential distribution. There is a single parameter,

. We have that

expx for x 0
Pexp (x|) =
.
0
otherwise
This distribution is often useful in modelling the failure of objects. We assume
that failures form a Poisson process in time; then the time to the next failure is
exponentially distributed.
Useful Facts: 13.8 Exponential distribution

For an exponential distribution with parameter
1. The mean is
2. The variance is
1
.
1
.
2
Notice the relationship between this parameter and the parameter of the Poisson distribution. If (say) the phone calls are distributed with Poisson distribution
with intensity (per hour), then your expected number of calls per hour is . The
time between calls will be exponentially distributed with parameter , and the
expected time to the next call is 1/ (in hours).
13.3 THE NORMAL DISTRIBUTION
13.3.1 The Standard Normal Distribution
Definition: 13.6 Standard Normal distribution

The probability density function

2

x
1
exp
p(x) =
.
2
2
is known as the standard normal distribution
The first step is to plot this probability density function (Figure 13.3). You
should notice it is quite familiar from work on histograms, etc. in Chapter 14.5. It
has the shape of the histogram of standard normal data, or at least the shape that
the histogram of standard normal data aspires to.
Section 13.3
280
The Standard Normal Curve

0.5
0.4
0.3
0.2
0.1
0
4 3 2 1
FIGURE 13.3: A plot of the probability density function of the standard normal dis-
tribution. Notice how probability is concentrated around zero, and how there is
relatively little probability density for numbers with large absolute values.
Useful Facts: 13.9 standard normal distribution

1. The mean of the standard normal distribution is 0.
2. The variance of the standard normal distribution is 1.
These results are easily established by looking up (or doing!) the relevant integrals; they are relegated to the exercises.
A continuous random variable is a standard normal random variable if

its probability density function is a standard normal distribution.
13.3.2 The Normal Distribution
Any probability density function that is a standard normal distribution in standard
coordinates is a normal distribution. Now write for the mean of a random
variable and for its standard deviation; we are saying that, if
x
has a standard normal distribution, then p(x) is a normal distribution. We can work
out the form of the probability density function of a general normal distribution in
two steps: first, we notice that for any normal distribution, we must have

(x )2
p(x) exp
.
2 2
Section 13.3
But, for this to be a probability density function, we must have

This yields the constant of proportionality, and we get
281
p(x)dx = 1.
Definition: 13.7 Normal distribution


1
(x )2
p(x) =
exp
.
2 2
2
is a normal distribution.
Useful Facts: 13.10 normal distribution


1
(x )2
exp
p(x) =
.
2 2
2
has
1. mean
2. and variance .
These results are easily established by looking up (or doing!) the relevant integrals; they are relegated to the exercises.
A continuous random variable is a normal random variable if its probability

density function is a normal distribution. Notice that it is quite usual to call
normal distributions gaussian distributions.
13.3.3 Properties of the Normal Distribution
Normal distributions are important, because one often runs into data that is well
described by a normal distribution. It turns out that anything that behaves like a
binomial distribution with a lot of trials for example, the number of heads in
many coin tosses; as another example, the percentage of times you get the outcome
of interest in a simulation in many runs should produce a normal distribution
(Section 13.4). For this reason, pretty much any experiment where you perform a
simulation, then count to estimate a probability or an expectation, should give you
an answer that has a normal distribution.
Section 13.3
282
It is a remarkable and deep fact, known as the central limit theorem, that
adding many independent random variables produces a normal distribution pretty
much whatever the distributions of those random variables. Ive not shown this
in detail because its a nuisance to prove. However, if you add together many
random variables, each of pretty much any distribution, then the answer has a
distribution close to the normal distribution. It turns out that many of the processes
we observe add up subsidiary random variables. This means that you will see normal
distributions very often in practice.
A normal random variable tends to take values that are quite close to the
mean, measured in standard deviation units. We can demonstrate this important
fact by computing the probability that a standard normal random variable lies
between u and v. We form
2
Z v
u
1
exp
du.
2
2
u
It turns out that this integral can be evaluated relatively easily using a special
function. The error function is defined by
Z x

2
exp t2 dt
erf(x) =
0
so that
Z x
2

u
1
x
1
exp
du.
erf ( ) =
2
2
2
2
0
Notice that erf(x) is an odd function (i.e. erf(x) = erf(x)). From this (and
tables for the error function, or Matlab) we get that, for a standard normal random
variable
2
Z 1
1
x
exp
dx 0.68
2
2 1
and
and
2
x
exp
dx 0.95
2
2
2
x
exp
dx 0.99.
2
2
These are very strong statements. They measure how often a standard normal
random variable has values that are in the range 1, 1, 2, 2, and 3, 3 respectively.
But these measurements apply to normal random variables if we recognize that they
now measure how often the normal random variable is some number of standard
deviations away from the mean. In particular, it is worth remembering that:
Section 13.4
Approximating Binomials with Large N
283
Useful Facts: 13.11 Normal Random Variables

About 68% of the time, a normal random variable takes a value
within one standard deviation of the mean.
13.4 APPROXIMATING BINOMIALS WITH LARGE N

The Binomial distribution appears to be a straightforward thing. We assume we
flip a coin N times, where N is a very large number. The coin has probability p of
coming up heads, and so probability q = 1 p of coming up tails. The number of
heads h follows the binomial distribution, so
P (h) =
N!
ph q (N h)
h!(N h)!
The
mean of this distribution is N p, the variance is N pq, and the standard deviation
is N pq.
Evaluating this probability distribution for large N is very difficult, because
factorials grow fast. We will construct an approximation to the binomial distribution for large N that allows us to evaluate the probability that h lies in some range.
This approximation will show that the probability that h is within one standard
deviation of the mean is approximately 68%.
This is important, because it shows that our model of probability as frequency
is consistent. Consider the probability that the number of heads you
see lies within
one standard deviation of the mean. The size of that interval is 2 N pq. As N gets
bigger, the size of that interval, relative to the total number of flips, gets smaller.
If I flip a coin N times, in principle I could see a number of heads h that ranges
from 0 to N . However, we will establish that about 68% of the time, h will lie in
the interval within one standard deviation of the mean. The size of this interval,
relative to the total number of flips is
r
N pq
pq
=2
.
2
N
N
As a result, as N ,
h
p
N
because h will tend to land in an interval around pN that gets narrower as N gets
larger.
Section 13.4
P(k heads) in 4 flips
284

0.25
0.2
Probability
Probability
0.3
0.2
0.1
0
0
0.15
0.1
0.05
2
3
Number of heads
0
0
4
6
8
Number of heads
10
20
40
60
Number of heads
80
0.08
Probability
Probability
0.1
0.05
0.06
0.04
0.02
0
0
10
20
30
Number of heads
40
0
0
FIGURE 13.4: Plots of the binomial distribution for p = q = 0.5 for different values
of N . You should notice that the set of values of h (the number of heads) that have
substantial probability is quite narrow compared to the range of possible values. This
set gets narrower as the number
of flips increases. This is because the mean is pN
and the standard deviation is N pq so the fraction of values that is within one
standard deviation of the mean is O(1/ N ).
The main difficulty with Figure 13.4 (and with the argument above) is that
the mean and standard deviation of the binomial distribution tends to infinity as
the number of coin flips tends to infinity. This can confuse issues. For example, the
plots of Figure 13.4 show narrowing probability distributions but is this because
the scale is compacted, or is there a real effect? It turns out there is a real effect,
and a good way to see it is to consider the normalized number of heads.
13.4.1 Large N
Recall that to normalize a dataset, you subtract the mean and divide the result
by the standard deviation. We can do the same for a random variable. We now
consider
h Np
x=
.
N pq
Section 13.4
285
The probability distribution

of x can be obtained from the probability distribution
for h, because h = N p + x N pq, so

N!
p(N p+x N pq) q (N qx N pq) .

P (x) =
(N p + x N pq)!(N q x N pq)!
I have plotted this probability distribution for various values of N in Figure 13.5.
P(k heads) in 4 flips, normalized

0.25
0.2
Probability
Probability
0.3
0.2
0.1
0
20
0.15
0.1
0.05
0
20
10
0
10
20
Number of heads
10
0
10
20
Number of heads
0.08
Probability
Probability
0.1
0.05
0.06
0.04
0.02
0
20
10
0
10
Number of heads
20
0
20
10
0
10
Number of heads
20
FIGURE 13.5: Plots of the distribution for the normalized variable x, with P (x) given
in the text, obtained from the binomial distribution with p = q = 0.5 for different
values of N . These distributions are normalized (mean 0, variance 1. They look
increasingly like a standard normal distribution EXCEPT that the value at their
mode gets smaller as N gets bigger (there are more possible outcomes). In the text,
we will establish that the standard normal distribution is a limit, in a useful sense.
But it is hard to work with this distribution for very large N . The factorials
become very
difficult to evaluate. Second, it is a discrete distribution on N points,
spaced 1/ N pq apart. As N becomes very large, the number of points that have
non-zero probability becomes very large, and x can be very large, or very small. For
example, there is some probability, though there
may be very little indeed, on the
point where h = N , or, equivalently, x = N (p + N pq). For sufficiently large N ,
Section 13.4
286
we think of this probability distribution as a probability density function. We can

do so, for example, by spreading the probability for xi (the ith value of x) evenly
over the interval between xi and xi+1 . We then have a probability density function
that looks like a histogram, with bars that become narrower as N increases. But
what is the limit?
13.4.2 Getting Normal
To proceed, we need Stirlings approximation, which says that, for large N ,

N ! 2 N
N
e
N
This yields
P (h)
Np
h
h
Nq
N h
(N h) s
N
2h(N h)
Recall we used the normalized variable

h Np
.
x=
N pq
We will encounter the term N pq often, and we use = N pq as a shorthand.

We can compute h and N h from x by the equalities
N h = N q x.
h = N p + x
So the probability distribution written in this new variable x is

(N p+x)
(N qx) s
N
Np
Nq
P (x)
(N p + x)
(N q x)
2(N p + x)(N q x)
There are three terms to deal with here. It is easiest to work with log P . Now
1
log(1 + x) = x x2 + O(x3 )
2
so we have
log
and
log
Np
(N p + x)
Nq
(N q x)

x
= log 1 +
Np
x
1 x 2

+ ( )(
)
Np
2 Np

x
1 x 2
+ ( )(
) .
Nq
2 Nq
Section 13.4
From this, we have that

"
(N p+x)
(N qx) #
Nq
Np
log
N p + x
N q x
287
"
2 #

x
x
1
[N p + x]
+
+
Np
2
Np
"
2 #

x
x
1
[N q x]
+
Nq
2
Nq

1
x2 + O((x)3 )
=
2
(recall = N pq if youre having trouble with the last step). Now we look at the
square-root term. We have
s
1
N
= (log [N p + x] + log [N q x] log N + log 2)
log
2(N p + x)(N q x)
2

x
log N p + O
N p

1
x
=
Nq
2 + log N q O
log N + log 2

x
but, since N is very large compared to x, we can ignore the O( N
p ) terms. Then
this term is not a function of x. So we have
log P (x)
x2
+ constant.
2
Now because N is very large, our probability distribution P limits to a probability

density function p, with
2
x
.
p(x) exp
2
We can get the constant of proportionality from integrating, to

2
1
x
p(x) =
exp
.
2
2
This constant of proportionality deals with the effect in figure 13.5, where the mode
of the distribution gets smaller as N gets bigger. It does so because there are more
points with non-zero probability to be accounted for. But we are interested in the
limit where N tends to infinity. This must be a probability density function, so it
must integrate to one.
Review this blizzard of terms. We started with a binomial distribution, but
standardized the variables so that the mean was zero and the standard deviation
was one. We then assumed there was a very large number of coin tosses, so large
that that the distribution started to look like a continuous function. The function
we get is the standard normal distribution.
Section 13.4
288
13.4.3 So What?
I have proven an extremely useful fact, which I shall now put in a box.
Useful Fact: 13.12 Binomial distribution for large N

Assume h follows the binomial distribution with parameters p and q.
hN p
Write x =
. Then, for sufficiently large N , the probability distriN pq
bution P (x) can be approximated by the probability density function

2

x
1
exp
2
2
in the sense that
P ({x [a, b]})
b
a
exp
u2
2
du
This justifies our model of probability as frequency. I interpreted an event

having probability p to mean that, if I had a large number N of independent
repetitions of the experiment, the number that produced the event would be close
to N p, and would get closer as N got larger. We know that, for example, 68% of
the time a standard normal random variable takes a value between 1 and 1. In
this case, the standard normal random variable is
h (N p)
N pq
so that 68% of the time, h must take a value in the range [N p N pq, N p+ N pq].
Equivalently, the relative frequency h/N must take a value in the range
pq
pq
[p , p + ]
N
N
but as N this range gets smaller and smaller, and h/N limits to p. So our
view of probability as a frequency is consistent.
To obtain h, we added N independent Bernoulli random variables. So you can
interpret the box as saying that the sum of many independent Bernoulli random
variables has a probability distribution that limits to the normal distribution as
the number added together gets larger. Remember that I have stated, though not
precisely, but not proved the deep and useful fact that the sum of pretty much
any independent random variables has a distribution that gets closer to a normal
distribution as the number added together gets larger.
Section 13.5
You should
289
13.5 YOU SHOULD

13.5.1 remember:
Definition: Bernoulli random variable . . . .
Useful facts: Bernoulli random variable . . .
Definition: Geometric distribution . . . . . .
Useful facts: Geometric distribution . . . . .
Definition: Binomial distribution . . . . . . .
Useful fact: Binomial distribution . . . . . . .
Useful facts: Binomial distribution . . . . . .
Definition: Multinomial distribution . . . . .
New term: intensity . . . . . . . . . . . . . .
Definition: Poisson distribution . . . . . . . .
Useful facts: Poisson distribution . . . . . . .
New term: Poisson point process . . . . . . .
New term: uniform distribution . . . . . . . .
New term: uniform random variable . . . . .
Useful facts: Beta distribution . . . . . . . .
Useful facts: Gamma distribution . . . . . . .
New term: exponential distribution . . . . . .
Useful facts: Exponential distribution . . . .
New term: standard normal distribution . . .
Definition: Standard Normal distribution . .
Useful facts: standard normal distribution . .
New term: standard normal random variable
New term: normal distribution . . . . . . . .
Definition: Normal distribution . . . . . . . .
Useful facts: normal distribution . . . . . . .
New term: normal random variable . . . . . .
New term: normal distribution . . . . . . . .
New term: gaussian distributions . . . . . . .
New term: central limit theorem . . . . . . .
New term: error function . . . . . . . . . . .
Useful facts: Normal Random Variables . . .
Useful fact: Binomial distribution for large N
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
261
262
262
263
263
264
264
266
267
267
267
268
268
268
269
270
271
271
271
271
272
272
272
273
273
273
273
273
274
274
275
280
C H A P T E R
14
Background:Inference: Making Point

Estimates
Inference is the process of drawing conclusions from data. One form of
inference is to estimate a number, or set of numbers, that describes a dataset. The
result is known as a point estimate. An alternative is to estimate an interval
within which a number lies, with some degree of certainty. Such estimates are
known as interval estimates. Finally, one might wish to assess the extent to
which a body of evidence supports rejecting an hypothesis known as hypothesis
testing. In this chapter, we deal with point estimates. In the following chapter, we
deal with interval estimates and hypothesis testing, which require an understanding
of point estimates. There are two, somewhat distinct, situations in which we could
make point estimates.
In the first, we have a dataset {x}, and a probability model we believe applies
to that dataset. But we need to select appropriate values of the parameters to
ensure that the model describes the data. For example, we might have a set of
N coin flips which we believe to be independent and identically distributed. Of
these, k flips came up H. We know that a binomial distribution with p(H) = p
is a good model but what value of p should we use? Your intuition is likely to
suggest using k/N , but wed like a more robust procedure than guessing. We need
an inference procedure to obtain the unknown parameter from the data. Notice
that this will be an estimate, rather than the true value. As we shall see, there
is more than one possible procedure to apply, depending to some extent on the
problem. In some cases (section 14.1), we estimate parameter values based solely
on data; in others (section 14.2), we are able to use prior information about the
parameters to affect the estimate.
In the second situation, we want to know some property of a population. For
example, we may wish to know the mean weight of a person, or the mean response
of a mouse to a drug. It would be a stretch to describe this population with one of
the probability models that we have seen. In principle, the number we want is not
even necessarily random; in principle, we could measure everyone on the planet and
average the weights. In practice, this doesnt make sense, for quite straightforward
reasons. You cant really weigh every person, or dose every mouse, on the planet.
Instead, to estimate this property, we obtain a sample (some people; some mice;
etc.) of the population, and estimate the property from the sample. There is now
an important problem. Different samples lead to different estimates of the property.
We will arrange to have the sample drawn randomly from the population, so the
sample we see represents the value of a set of random variables. If you followed the
proof of the weak law of large numbers, you should suspect that the mean of this
sample could be a good estimate of the population mean. This turns out to be the
290
Section 14.1
Estimating Model Parameters with Maximum Likelihood
291
case (section 14.5). However, there is some random error in the estimate, and we
can tell (on average) how large the error caused by random sampling could be.
14.1 ESTIMATING MODEL PARAMETERS WITH MAXIMUM LIKELIHOOD
Assume we have a dataset D = {x}, and a probability model we believe applies to
that dataset. Generally, application logic suggests the type of model (i.e. normal
probability density; Poisson probability; geometric probability; and so on). But
usually, we do not know the parameters of the model for example, the mean and
standard deviation of a normal distribution; the intensity of a poisson distribution;
and so on. Notice that this situation is unlike what we have seen to date. In
chapter 13, we assumed that we knew , and could then use the model to assign
a probability to a set of data items D. Here we know the value of D, but dont
know . Our model will be better or worse depending on how well we choose the
parameters. We need a strategy to estimate the parameters of a model from a
sample dataset. Notice how each of the following examples fits this pattern.
Example: 14.1 Inferring p from repeated flips binomial

We could flip the coin N times, and count the number of heads k. We
know that an appropriate probability model for a set of independent
coin flips is the binomial model P (k; N, p). But we do not know p,
which is the parameter we need a strategy to extract a value of p
from the data.
Example: 14.2 Inferring p from repeated flips geometric

We could flip the coin repeatedly until we see a head. We know that,
in this case, the number of flips has the geometric distribution with
parameter p. In this case, the data is a sequence of T s with a final H
from the coin flips. There are N flips (or terms) and the last flip is a
head. We know that an appropriate probability model is the geometric
distribution Pg (N ; p). But we do not know p, which is the parameter
we need a strategy to extract a value of p from the data.
Section 14.1
292
Example: 14.3 Inferring the intensity of spam poisson

It is reasonable to assume that the number of spam emails one gets in
an hour has a Poisson distribution. But what is the intensity parameter
? We could count the number of spam emails that arrive in each of a
set of distinct hours, giving a dataset of counts D. We need a strategy
to wrestle an estimate of from this dataset.
Example: 14.4
data
Inferring the mean and standard deviation of normal
Imagine we know for some reason that our data is well described by
a normal distribution. We could ask what is the mean and standard
deviation of the normal distribution that best represents the data?
Write for the parameters of our model. If we knew , then the probability
of observing the data D would be P (D|). We know D, and we dont know , so
the value of P (D|) is a function of .
Definition: 14.1 Likelihood

The function P (D|), which is a function of , is known as the likelihood of the data D, and is often written L() (or L(; D) if you want
to remember that data is involved).
14.1.1 The Maximum Likelihood Principle

We need a reasonable procedure to choose a value of to report. One and I
stress this is not the only one is the maximum likelihood principle.
Definition: 14.2 The maximum likelihood principle

Choose such that L() = P (D|) is maximised, as a function of .
Section 14.1
293
In words, this means: Choose the parameter such that the probability of
observing the data you actually see, is maximised. This should strike you as being
a reasonable choice. You should also be aware that this is not the only possible
choice (well see another one in section 14.2).
For the examples we work with, the data will be independent and identically distributed or IID. This means that each data item is an idependently
obtained sample from the same probability distribution (see section ??). In turn,
this means that the likelihood is a product of terms, one for each data item, which
we can write as
Y
L() = P (D|) =
P (di |).
idataset
It is traditional to write for any set of parameters that are unknown. There
are two, distinct, important concepts we must work with. One is the unknown
parameter(s), which we will write . The other is the estimate of the value of that
This estimate is the best we can do it may
parameter, which we will write .
not be the true value of the parameter.
Section 14.1
Worked example 14.1

mial model
294
Inferring p(H) for a coin from flips using a bino-
In N independent coin flips, you observe k heads. Use the maximum likelihood
principle to infer p(H).
Solution: The coin has = p(H), which is the unknown parameter. We know
that an appropriate probability model is the binomial model P (k; N, ). We
have that

N
L() = P (D|) = Pb (k; N, ) =
k (1 )(N k)
k
which is a function of the unknown probability that a coin comes up
heads; k and N are known. We must find the value of that maximizes this
expression. Now the maximum occurs when
L()
= 0.
We have
L()
=
N
k

kk1 (1 )(N k) k (N k)(1 )(nk1)
and this is zero when

kk1 (1 )(N k) = k (N k)(1 )(N k1)
so the maximum occurs when
k(1 ) = (N k).
This means the maximum likelihood estimate is
k
=
N
which is what we guessed would happen, but now we know why that guess
makes sense.
Section 14.1
Worked example 14.2

model
295
Inferring p(H) from coin flips using a geometric
You flip a coin N times, stopping when you see a head. Use the maximum
likelihood principle to infer p(H) for the coin.
Solution: The coin has = p(H), which is the unknown parameter. We know
that an appropriate probability model is the geometric model Pg (N ; ). We
have that
L() = P (D|) = Pg (N ; ) = (1 )(N 1)
which is a function of the unknown probability that a coin comes up heads;
N is known. We must find the value of that maximizes this expression. Now
the maximum occurs when
L()
= 0 = ((1 )(N 1) (N 1)(1 )(N 2) )
So the maximum likelihood estimate is

1
= .
N
We didnt guess this.
Section 14.1
Worked example 14.3

a multinomial distribution
296
Inferring die probabilities from multiple rolls and
You throw a die N times, and see n1 ones, . . . and n6 sixes. Write p1 , . . . , p6
for the probabilities that the die comes up one, . . ., six. Use the maximum
likelihood principle to estimate p1 , . . . , p6 .
Solution: The data are n, n1 , . . . , n6 . The parameters are = (p1 , . . . , p6 ).
P (D|) comes from the multinomial distribution. In particular,
L() = P (D|) =
n!
pn1 pn2 . . . pn6 6
n1 ! . . . n6 ! 1 2
which is a function of = (p1 , . . . , p6 ). Now we want to maximize this function

by choice of . Notice that we could do this by simply making all pi very large
but this omits a fact, which is that p1 + p2 + p3 + p4 + p5 + p6 = 1. So we
substitute using p6 = 1 p1 p2 p3 p4 p5 (there are other, neater, ways
of dealing with this issue, but they take more background knowledge). At the
maximum, we must have that for all i,
L()
=0
pi
which means that, for pi , we must have
(ni 1)
ni pi
(1p1 p2 p3 p4 p5 )n6 pni i n6 (1p1 p2 p3 p4 p5 )(n6 1) = 0
so that, for each pi , we have

ni (1 p1 p2 p3 p4 p5 ) n6 pi = 0
or
pi
ni
=
.
1 p1 p2 p3 p4 p5
n6
You can check that this equation is solved by

=
1
(n1 , n2 , n3 , n4 , n5 , n6 )
(n1 + n2 + n3 + n4 + n5 + n6 )
The logarithm is a monotonic function (i.e. if x > 0, y > 0, x > y, then

log(x) > log(y)). This means that the values of that maximise the log-likelihood
are the same as the values that maximise the likelihood. This observation is very
useful, because it allows us to transform a product into a sum. The derivative of a
product involves numerous terms; the derivative of a sum is easy to take. We have
X
Y
log P (di |)
P (di |) =
log P (D|) = log
idataset
idataset
Section 14.1
297
and in some cases, log P (di |) takes a convenient, easy form.

Definition: 14.3 The log-likelihood of a dataset under a model
is a function of the unknown parameters, and you will often see it
written as
X
log L() =
log P (di |).
idataset
Poisson distributions
Worked example 14.4
You observe N intervals, each of the same, fixed length (in time, or space).
You know that, in these intervals, events occur with a Poisson distribution (for
example, you might be observing Prussian officers being kicked by horses, or
telemarketer calls...). You know also that the intensity of the Poisson distribution is the same for each observation. The number of events you observe in the
ith interval is ni . What is the intensity, ?
Solution: The likelihood is
L() =
iintervals
P ({ni events} |) =
iintervals
It will be easier to work with logs. The log-likelihood is

X
(ni log log ni !)
log L() =
i
so that we must solve

log L() X ni
( 1) = 0
=
i
which yields a maximum likelihood estimate of
P
ni
= i
N
ni e
.
ni !
Section 14.1
298
The intensity of swearing
Worked example 14.5
A famously sweary politician gives a talk. You listen to the talk, and for each of
30 intervals 1 minute long, you record the number of swearwords. You record
this as a histogram (i.e. you count the number of intervals with zero swear
words, with one, etc.). For the first 10 intervals, you see
no. of swear words no. of intervals
0
5
1
2
2
2
3
1
4
0
and for the following 20 intervals, you see
no. of swear words no. of intervals
0
9
1
5
2
3
3
2
4
1
Assume that the politicians use of swearwords is Poisson. What is the intensity
using the first 10 intervals? the second 20 intervals? all the intervals? why are
they different?
Solution: Use the expression from worked example 14.4 to find
10
=
=
20
=
=
30
=
=
total number of swearwords

number of intervals
9
10
number of intervals
21
20
number of intervals
30
.
30
These are different because the maximum likelihood estimate is an estimate

we cant expect to recover the exact value from a dataset. Notice, however,
that the estimates are quite close.
Section 14.1
Worked example 14.6
299
Normal distributions
Assume we have x1 , . . . , xN , and we wish to model these data with a normal

distribution. Use the maximum likelihood principle to estimate the mean of
that normal distribution.
Solution: The likelihood of a set of data values under the normal distribution
with unknown mean and standard deviation is
L()
=
=
=
P (x1 , . . . xN |, )
P (x1 |, )P (x2 |, ) . . . P (xN |, )

N
Y
(xi )2
1
exp
2 2
2
i=1
and this expression is a moderate nuisance to work with. The log of the likelihood is
!
N
X
(xi )2
+
term not depending on .
log L() =
2 2
i=1
We can find the maximum by differentiating wrt and setting to zero, which
yields
log L()
N
X
2(xi )
i=1
=
=
2 2
1
2
N
X
i=1
xi N
so the maximum likelihood estimate is

PN
xi
= i=1
N
which probably isnt all that surprising. Notice we did not have to pay attention
to in this derivation we did not assume it was known, it just doesnt do
anything.
Section 14.1
Worked example 14.7
300
Normal distributions -II
Assume we have x1 , . . . , xN which are data that can be modelled with a normal
distribution. Use the maximum likelihood principle to estimate the standard
deviation of that normal distribution.
Solution: Now we have to write out the log of the likelihood in more detail.
Write for the mean of the normal distribution and for the unknown standard
deviation of the normal distribution. We get
!
N
X
(xi )2
N log + Term not depending on
log L() =
22
i=1
We can find the maximum by differentiating wrt and setting to zero, which
yields
N
log L()
2 X (xi )2
N
= 3
=0
i=1
2
so the maximum likelihood estimate is

s
PN
2
i=1 (xi )
=
N
which probably isnt all that surprising, either.
You should notice that one could maximize the likelihood of a normal distribution with respect to mean and standard deviation in one go (i.e. I could have
done worked examples 14.6 and 14.7 in one worked example, instead of two). I did
this example in two parts because I felt it was more accessible that way; if you
object, youre likely to be able to fill in the details yourself very easily.
The maximum likelihood principle has a variety of neat properties we cannot
expound. One worth knowing about is consistency; for our purposes, this means
that the maximum likelihood estimate of parameters can be made arbitrarily close
to the right answer by having a sufficiently large dataset. Now assume that our data
doesnt actually come from the underlying model. This is the usual case, because
we cant usually be sure that, say, the data truly is normal or truly comes from a
Poisson distribution. Instead we choose a model that we think will be useful. When
the data doesnt come from the model, maximum likelihood produces an estimate
of that corresponds to the model that is (in quite a strong sense, which we cant
explore here) the closest to the source of the data. Maximum likelihood is very
widely used because of these neat properties. But there are some difficulties.
14.1.2 Cautions about Maximum Likelihood
One important problem is that it might be hard to find the maximum of the likelihood exactly. There are strong numerical methods for maximizing functions, and
Section 14.2
Incorporating Priors with Bayesian Inference
301
these are very helpful, but even today there are likelihood functions where it is very
hard to find the maximum.
The second is that small amounts of data can present nasty problems. For
example, in the binomial case, if we have only one flip we will estimate p as either
1 or 0. We should find this report unconvincing. In the geometric case, with a
fair coin, there is a probability 0.5 that we will perform the estimate and then
report that the coin has p = 1. This should also worry you. As another example,
if we throw a die only a few times, we could reasonably expect that, for some
i, ni = 0. This doesnt necessarily mean that pi = 0, though thats what the
maximum likelihood inference procedure will tell us.
This creates a very important technical problem how can I estimate the
probability of events that havent occurred? This might seem like a slightly silly
question to you, but it isnt. Solving this problem has really significant practical
consequences. As one example, consider a biologist trying to count the number of
butterfly species on an island. The biologist catches and classifies a lot of butterflies,
then leaves. But are there more butterfly species on the island? To get some sense
that we can reason successfully about this problem, compare two cases. In the
first, the biologist catches many individuals of each of the species observed. In this
case, you should suspect that catching more butterflies is unlikely to yield more
species. In the second case, there are many species where the biologist sees only
one individual of that species. In this case, you should suspect that catching more
butterflies might very well yield new species.
As another example, a really important part of natural language processing
involves estimating the probability of groups of three words. These groups are
usually known as trigrams. People typically know an awful lot of words (tens
to hundreds of thousands, depending on what you mean by a word). This means
that there are a tremendous number of trigrams, and you can expect that any real
dataset lacks almost all of them, because the dataset isnt big enough for there to
be even just one of each trigram. Some are missing because they dont occur in
real life, but others are not there simply because they are unusual (eg Atom Heart
Mother actually occurs in real life, but you may not have seen it try a web
search if the phrase doesnt ring a bell). Modern speech recognition systems need
to know how probable every trigram is. If the speech system thinks a trigram has
zero probability and the trigram actually occurs, the system will make a mistake.
We cant solve this problem just by giving each trigram a very small non-zero
probability, because there are too many trigrams it is important to distinguish
between rare ones, and ones that dont ever occur. But what probability should
I use for a rare trigram? Maximum likelihood would say use zero, but this would
generate problems. Formalizing all this gets difficult quite quickly,
14.2 INCORPORATING PRIORS WITH BAYESIAN INFERENCE
Another important issue with maximum likelihood is that there is no mechanism
to incorporate prior beliefs. For example, imagine you get a new die from a reliable
store, roll it six times and see a one once. You would be happy to believe that
p(6) = 1/6 for this die. Now imagine you borrow a die from a friend with a long
history of making weighted dice. Your friend tells you this die is weighted so that
Section 14.2
302
p(1) = 1/2. You roll the die six times and see a one once; in this case, you might
worry that p(6) isnt 1/6, and you just happened to get a slightly unusual set
of rolls. Youd worry because you have good reason to believe the die isnt fair,
and youd want more evidence to believe p(6) = 1/6. Maximum likelihood cant
distinguish between these two cases.
The difference lies in prior information information we possess before we
look at the data. We would like to take this information into account when we
estimate the model. One way to do so is to place a prior probability distribution
p() on the parameters . Then, rather than working with the likelihood p(D|), we
could apply Bayes rule, and form the posterior p(|D). This posterior represents
the probability that takes various values, given the data D.
Definition: 14.4 Bayesian inference
Extracting information from the posterior p(|D) is usually called
Bayesian inference
Definition: 14.5 MAP estimate

A natural estimate of is the value that maximizes the posterior p(|D).
This estimate is known as a maximum a posteriori estimate or
MAP estimate.
14.2.1 Constructing the Posterior

Bayes rule tells us that
p(|D) =
P (D|)P ()
P (D)
but (as we shall see) it can be hard to work out P (D). For some problems, we
might not need to know it.
Section 14.2
Posterior of P(H), given 7H and 3T
Posterior of P(H), given 3H and 7T

3
Posterior value
Posterior value
0
0
303
0.2
0.4
0.6
P(H)
0.8
0
0
0.2
0.4
0.6
P(H)
0.8
FIGURE 14.1: The curves show a function proportional to the posterior on , for
the two cases of example ??. Notice that this information is rather richer than the
single value we would get from maximum likelihood inference.
Worked example 14.8
Flipping a coin
We have a coin with probability of coming up heads when flipped. We start

knowing nothing about . We then flip the coin 10 times, and see 7 heads (and
3 tails). Plot a function proportional to p(| {7 heads and 3 tails}). What
happens if there are 3 heads and 7 tails?
Solution: We know nothing about p, except that 0 1, so we choose a
uniform prior on p. We have that p({7 heads and 3 tails} |) is binomial. The
joint distribution is p({7 heads and 3 tails} |) p() but p() is uniform, so
doesnt depend on . So the posterior is proportional to: 7 (1 )3 which is
graphed in figure 14.1. The figure also shows 3 (1 )7 which is proportional
to the posterior for 3 heads and 7 tails. In each case, the evidence does not
rule out the possibility that = 0.5, but tends to discourage the conclusion.
Maximum likelihood would give = 0.7 or = 0.3, respectively.
In Example 14.8, it is interesting to follow how the posterior on p changes as
evidence come in, which is easy to do because the posterior is proportional to a
binomial distribution. Figure 14.2 shows a set of these posteriors for different sets
of evidence.
For other problems, we will need to marginalize out , by computing
Z
P (D) = P (D|)P ()d.
It is usually impossible to do this in closed form, so we would have to use a numerical

integral or a trick. The next section expounds one useful trick.
Conjugate priors
Section 14.2
Posterior on p(H), given 3H and 0T.
posterior on p(H)
posterior on p(H)
3
2
1
0.2
0
0
0.4
0.6
0.8
1
p(H)
10
3
2
1
0
0
0.2
0.4
p(H)
0.6
0.8
0.2
0.4
p(H)
0.6
0.8
5
posterior on p(H)
posterior on p(H)
0
0
304
6
4
2
0
0
0.2
0.4
p(H)
0.6
0.8
FIGURE 14.2: The probability that an unknown coin will come up heads when flipped
is p(H). For these figures, I simulated coin flips from a coin with p = 0.75. I
then plotted the posterior for various data. Notice how, as we see more flips, we
get more confident about p. The vertical axis changes substantially between plots in
this figure.
In some cases, P () and P (D|), when multiplied together, take a familiar
form. This happens when P (D|) and P () each belong to parametric families
where there is a special relationship between the families. When a prior has this
property, it is called a conjugate prior. There are some cases worth knowing, given
in the worked examples.
Section 14.2
Worked example 14.9
305
Flipping a coin - II
We have a coin with probability of coming up heads when flipped. We model

the prior on with a Beta distribution, with parameters > 0, > 0. We
then flip the coin N times, and see h heads. What is P (|N, h, , )?
Solution: We have that P (N, h|) is binomial, and that P (|N, h, , )
P (N, h|)P (|, ). This means that

( + ) (1)
N
P (|N, h, , )
h (1 )(N h)
(1 )(1) .
h
()()
and we can write
P (|N, h, , ) (+h1) (1 )(+N h1) .
Notice this has the form of a Beta distribution, so it is easy to recover the
constant of proportionality. We have
P (|N, h, , ) =
( + + N )
(+h1) (1 )(+N h1) .
( + h)( + N h)
Section 14.2
Worked example 14.10
306
More sweary politicians
Example 14.5 gives some data from a sweary politician. Assume we have only
the first 10 intervals of observations, and we wish to estimate the intensity using
a Poisson model. Write for this parameter. Use a Gamma distribution as a
prior, and write out the posterior.
Solution: We have that
p(D|)

0 e
0!
2 e
2!
9 e10
12
and
p(|, ) =
5
2
1 e
1!
3 e
3!
2
1
(1)
e
()
This means that

p(|D) (1+9) e(+10) .
Notice this has the form of another Gamma distribution, so we can write
p(|D) =
( + 10)(+9) (1+9) (+10)
e
( + 9)
14.2.2 Normal Prior and Normal Likelihood

When both P (D|) and P () are normal, some important simplifications occur.
First, the prior is conjugate to the likelihood. Second, the posterior is also normal.
And third, the mean and standard deviation of the posterior take a simple form.
We start with an example. Assume we drop a measuring device down a
borehole. It is designed to stop falling and catch onto the side of the hole after
it has fallen 0 meters. On board is a device to measure its depth. This device
reports a known constant times the correct depth plus a zero mean normal random
variable, which we call noise. The device reports depth every second.
The first question to ask is what depth do we believe the device is at before
we receive any measurement? We designed the device to stop at 0 meters, so we
are not completely ignorant about where it is. However, it may not have worked
absolutely correctly. We choose to model the depth at which it stops as 0 meters
plus a zero mean normal random variable. The second term could be caused by error
in the braking system, etc. We could estimate the standard deviation of the second
term (which we write 0 ) either by dropping devices down holes, then measuring
Section 14.2
307
with tape measures, or by analysis of likely errors in our braking system. The depth
of the object is the unknown parameter of the model; we write this depth . Now
the model says that is a normal random variable with mean 0 and standard
deviation 0 .
Notice that this model probably isnt exactly right for example, there must
be some probability in the model that the object falls beyond the bottom of the
hole, which it cant do but it captures some important properties of our system.
The device should stop at or close to 0 meters most of the time, and its unlikely
to be too far away.
Now assume we receive a single measurement what do we now know about
the devices depth? The first thing to notice is that there is something to do here.
Ignoring the prior and taking the measurement might not be wise. For example,
imagine that the noise in the wireless system is large, so that the measurement is
often corrupted our original guess about the devices location might be better
than the measurement. Write x1 for the measurement. Notice that the scale of
the measurement may not be the same as the scale of the depth, so the mean of
the measurement is c1 , where c1 is a change of scale (for example, from inches to
meters). We have that p(x1 |) is normal with mean c1 and standard deviation
n1 . We would like to know p(|x1 ).
We have that
log p(, x1 ) = log p(x1 |) + log p()
( 0 )2
(x1 c1 )2
=
2
2n1
202
+ terms not depending on or x.
We have two estimates of the position, , and we wish to come up with a representation of what we know about . One is x1 , which is a measurement we know
its value. The expected value of x1 is c1 , so we could infer from x1 . But we
have another estimate of the position, which is 0 . The posterior, p(|x1 ), is a
probability distribution on the variable ; it depends on the known values x1 , 0 ,
0 and n1 . We need to determine its form. We can do so by some rearrangement
of the expression for log p(, x1 ).
Notice first that this expression is of degree 2 in (i.e. it has terms 2 ,
and things that dont depend on ). This means that p(|x1 ) must be a normal
distribution, because we can rearrange its log into the form of the log of a normal
distribution. This yields a fact of crucial importance.
Useful Fact: 14.1 Normal distributions are conjugate

A normal prior and a normal likelihood yield a normal posterior.
Section 14.2
308
Write 1 for the mean of this distribution, and 1 for its standard deviation.
The log of the distribution must be
( 1 )2
+ terms not depending on .
212
The terms not depending on are not interesting, because if we know 1 those
terms must add up to

1
log
21
so that the probability density function sums to one. Our goal is to rearrange terms
into the form above. Notice that

1
( 1 )2
1
2
+ 2 2 + term not depending on
2p2
212
2p
We have
log p(|x1 ) =
=
(c1 x1 )2
( 0 )2
+ terms not depending on

2 2
202
n1

0
x1
1
2 2 2 + 2 c1 2 + 2
2n1
20
0
2 2 n1
+c2 2
n1
terms not depending on
which means that

12 =
2
n1
02
+ c21 02
2
n1
and
1

2
x1
0
n1
02
= 2 c1 2 + 2
2 + c2 2
2n1
20 n1
1 0

2
2
2
c1 x1 0 + 0 n1
n1
02
=
2 2
2 + c2 2
n1
n1
0
1 0
2
2
c1 x1 0 + 0 n1
=
.
2 + c2 2
n1
1 0
These equations make sense. Imagine that 0 is very small, and n1 is very
big; then our new expected value of which is 1 is about 0 . Equivalently,
because our prior was very accurate, and the measurement was unreliable, our
expected value is about the prior value. Similarly, if the measurement is reliable
(i.e. n1 is small) and the prior has high variance (i.e. 0 is large), then our
expected value of is about x1 /c1 i.e. the measurement, rescaled. I have put
these equations, in a more general form, in a box below.
Section 14.2
309
Useful Fact: 14.2 Normal posteriors

Assume we wish to estimate a parameter . The prior distribution for
is normal, with known mean and known standard deviation . We
receive a single data item x. The likelihood of this data item is normal
with mean c and standard deviation m , where c and m are known.
Then the posterior, p(|x, c, m , , ), is normal, with mean
2
cx2 + m
2 + c2 2
m
and standard deviation
2 2
m
.
2 + c2 2
m
14.2.3 MAP Inference

Look at example 14.1, where we estimated the probability a coin would come up
heads with maximum likelihood. We could not change our estimate just by knowing
the coin was fair, but we could come up with a number for = p(H) (rather than,
say, a posterior distribution). A natural way to produce a point estimate for that
incorporates prior information is to choose such that
argmax
argmax P (, D)
=
P (|D) =
P (D)
This is the MAP estimate. If we wish to perform MAP inference, P (D) doesnt
matter (it changes the value, but not the location, of the maximum). This means
we can work with P (, D), often called the joint distribution.
Section 14.2
310
Flipping a coin - II
We have a coin with probability of coming up heads when flipped. We model

the prior on with a Beta distribution, with parameters > 0, > 0. We
then flip the coin N times, and see h heads. What is the MAP estimate of ?
Solution: We have that
P (|N, h, , ) =
( + + N )
(+h1) (1 )(+N h1) .
( + h)( + N h)
You can get the MAP estimate by differentiating and setting to 0, yielding
=
1+h
.
+2+N
This has rather a nice interpretation. You can see and as extra counts of
heads (resp. tails) that are added to the observed counts. So, for example, if
you were fairly sure that the coin should be fair, you might make and large
and equal. When = 1 and = 1, we have a uniform prior as in the previous
examples.
More sweary politicians
We observe our swearing politician for N intervals, seeing ni swear words in

the ith interval. We model the swearing with a Poisson model. We wish to
estimate the intensity, which we write . We use a Gamma distribution for the
prior on . What is the MAP estimate of ?
Solution: Write T =
PN
i=1 .
p(|D) =
and the MAP estimate is
We have that
( + N )(+T ) (1+T ) (+T )
e
( + T )
( 1 + T )
=
( + N )
(which you can get by differentiating with respect to , then setting to zero).
Notice that if is close to zero, you can interpret as extra counts; if is large,
even if the counts are large.
then it strongly discourages large values of ,
14.2.4 Filtering
We can make online estimates of the maximum likelihood value of mean and standard deviation for a normal distribution. Assume, rather than seeing N elements
Section 14.2
311
of a dataset in one go, you get to see each one once, and you cannot store them.
Assume that this dataset is modelled as normal data. Write
k for the maximum
likelihood estimate of the mean based on data items 1 . . . k (and
k for the maximum
likelihood estimate of the standard deviation, etc.). Notice that
k+1 =
and that
k+1 =
(k
k ) + xk+1
(k + 1)
(k
k2 ) + (xk+1
k+1 )2
(k + 1)
This means that you can incorporate new data into your estimate as it arrives
without keeping all the data. This process of updating a representation of a dataset
as new data arrives is known as filtering.
This is particularly useful in the case of normal posteriors. Recall that if we
have a normal prior and a normal likelihood, the posterior is normal. Now consider
a stream of incoming measurements. Our initial representation of the parameters
we are trying to estimate is the prior, which is normal. We allow ourselves to see
one measurement, which has normal likelihood; then the posterior is normal. You
can think of this posterior as a prior for the parameter estimate based on the next
measurement. But we know what to do with a normal prior, a normal likelihood,
and a measurement; so we can incorporate the measurement and go again. This
means we can exploit our expression for the posterior mean and standard deviation
in the case of normal likelihood and normal prior and a single measurement to deal
with multiple measurements very easily.
Assume a second measurement, x2 arrives. We know that p(x2 |, c2 , n2 ) is
normal with mean c2 and standard deviation n2 . In the example, we have a new
measurement of depth perhaps in a new, known, scale with new noise (which
might have larger, or smaller, standard deviation than the old noise) added. Then
we can use p(|x1 , c1 , n1 ) as a prior to get a posterior p(|x1 , x2 , c1 , c2 , n1 , n2 ).
Each is normal, by useful fact 14.1. Not only that, but we can easily obtain the
expressions for the mean 2 and the standard deviation 2 recursively as functions
of 1 and 1 .
Applying useful fact 14.2, we have
2 =
2
c2 x2 12 + 1 n2
2
2
2
n2 + c2 1
and
22 =
2
n2
12
.
+ c22 12
2
n2
But what works for 2 and 1 will work for k + 1 and k. We know the posterior after
k measurements will be normal, with mean k and standard deviation k . The
k + 1th measurement xk+1 arrives, and we have p(xk+1 |, ck+1 , n(k+1) ) is normal.
Then the posterior is normal, and we can write the mean k+1 and the standard
deviation k+1 recursively as functions of k and k . The result is worth putting
in a box.
Section 14.2
312
Useful Fact: 14.3 Online updating of normal posteriors

Assume we wish to estimate a parameter . The prior distribution for
is normal, with known mean and known standard deviation . All
data is normal conditioned on . We have already received k data items.
The posterior p(|x1 , . . . , xk , c1 , . . . , ck , n1 , . . . , nk , , ) is normal,
with mean k and standard deviation k . We receive a new data item
xk+1 . The likelihood of this data item is normal with mean c and standard deviation n(k+1) , where ck+1 and n(k+1) are known. Then the
posterior, p(|x1 , . . . , xk+1 , c1 , . . . , ck , ck+1 , n1 , . . . , n(k+1) , , ), is
normal, with mean
k+1 =
2
ck+1 xk+1 k2 + k n(k+1)
and
2
k+1
=
2
+ c2k+1 k2
n(k+1)
2
n(k+1)
k2
2
+ c2k+1 k2
n(k+1)
Again, notice the very useful fact that, if everything is normal, we can update
our posterior representation when new data arrives using a very simple recursive
form.
Section 14.3
Samples, Urns and Populations
313
Normal data
Assume you see N datapoints xi which are modelled by a normal distribution

with unknown mean and with known standard deviation . You model the
prior on using a normal distribution with mean 0 and standard deviation
0 . What is the MAP estimate of the mean?
Solution: Recall that the maximum value of a normal distribution occurs at
its mean. Now problem is covered by useful fact 14.2, but in this case we have
ci = 1 for each data point, and i = . We can write
N =
2
2
xN N
1 + N 1
2
2
+ N 1
and
2
N
=
2
2 N
1
.
2
2 + N
1
and evaluate the recursion down to 0 , 0 .
14.2.5 Cautions about Bayesian Inference

Just like maximum likelihood inference, bayesian inference is not a recipe that can
be applied without thought. It turns out that, when there is a lot of data, the
prior has little inference on the outcome of the inference, and the MAP solution
looks a lot like the maximum likelihood solution. So the difference between the two
approaches is most interesting when there is little data, where the prior matters.
The difficulty is that it might be hard to know what to use as a good prior. In
the examples, I emphasized mathematical convenience, choosing priors that lead
to clean posteriors. There is no reason to believe that nature uses conjugate priors
(even though conjugacy is a neat property). How should one choose a prior for a
real problem?
This isnt an easy point. If there is little data, then the choice could really
affect the inference. Sometimes were lucky, and the logic of the problem dictates
a choice of prior. Mostly, we have to choose and live with the consequences of the
choice. Often, doing so is succesful in applications.
The fact we cant necessarily justify a choice of prior seems to be one of lifes
inconveniences, but it represents a significant philosophical problem. Its been at
the core of a long series of protracted, often quite intense, arguments about the
philosophical basis of statistics. I havent followed these arguments closely enough
to summarize them; they seem to have largely died down without any particular
consensus being reached.
14.3 SAMPLES, URNS AND POPULATIONS
Very often the data we see is a small part of the data we could have seen, if wed
been able to collect enough data. We need to know how the measurements we make
Section 14.3
314
on the dataset relate to the measurements we could have made, if we had all the
data. This situation occurs very often. For example, imagine we wish to know the
average weight of a rat. This isnt random; you could weigh every rat on the planet,
and then average the answers. But doing so would absurd (among other things,
youd have to weigh them all at the same time, which would be tricky). Instead,
we weigh a small set of rats, chosen rather carefully. If we have chosen sufficiently
carefully, then the answer from the small set is quite a good representation of the
answer from the whole set.
The data we could have observed, if we could have seen everything, is the
population. The data we actually have is the sample. We would like to know
the mean of the population, but can see only the sample; surprisingly, we can say
a great deal from the sample alone, assuming that it is chosen appropriately.
14.3.1 Estimating the Population Mean from a Sample
Assume we have a population {x}, for i = 1, . . . , Np . Notice the subscript here
this is the number of items in the population. The population could be unreasonably
big: for example, it could consist of all the people in the world. We want to know
the mean of this dataset, but we do not get to see the whole dataset. Instead, we
see the sample.
How the sample is obtained is key to describing the population. We will focus
on only one model (there are lots of others). In our model, the sample is obtained by
choosing a fixed number of data items. Write k for the number of data items in the
sample. We expect k is a lot smaller than Np . Each item is chosen independently,
and fairly. This means that each time we choose, we choose one from the entire
set of Np data items, and each has the same probability of being chosen. This is
sometimes referred to as sampling with replacement.
One natural way to think about sampling with replacement is to imagine the
data items as being written on tickets, which are placed in an urn (old-fashioned
word for a jar, now used mainly by statisticians and morticians). You obtain the
sample by repeating the following experiment k times: shake the urn; take a ticket
from the urn and write down the data on the ticket; put it back in the urn. Notice
that, in this case, each sample is drawn from the same urn. This is important, and
makes the analysis easier. If we had not put the ticket back, the urn would change
between samples.
We summarize the whole dataset with its mean, which we write popmean ({x}).
This is known as the population mean. The notation is just to drive home the
facts that its the mean of the whole population, and that we dont, and cant,
know it. The whole point of this exercise is to estimate this mean.
We would like to estimate the mean of the whole dataset from the items that
we actually see. Imagine we draw k tickets from the urn as above, and average the
values. The result is a random variable, because different draws of k tickets will
give us different values. Write X (k) for this random variable, which is referred to
as the sample mean. Because expectations are linear, we must have that
i
i
h
i
h
i 1 h
h
E X (1) + . . . + E X (1) = E X (1)
E X (k) =
k
(where X (1) is the random variable whose value is obtained by drawing one ticket
Section 14.3
from the urn). Now

i
h
=
E X (1)
=
xi p(i)
xi
315
i1,...Np
i1,...Np
i1,...Np
1
Np
because we draw fairly from the urn
xi
Np
popmean ({x})
which is the mean value of the items in the urn. This means that
i
h
E X (k) = popmean ({xi }).
Under our sampling model, the expected value of the sample mean is the population
mean.
Useful Facts: 14.4 Sample means and population means

The sample mean is a random variable. It is random, because different
samples from the population will have different values of the sample
mean. The expected value of this random variable is the population
mean.
We will not get the same value of X (k) each time we perform the experiment,
because we see different data items in each sample. So X (k) has variance, and this
variance is important. If it is large, then each estimate is quite different. If it is
small, then the estimates cluster. Knowing the variance of X (k) would tell us how
accurate our estimate of the population mean is.
14.3.2 The Variance of the Sample Mean
We write popsd ({x}) for the standard deviation of the whole population of {x}.
Again, we write it like this to keep track of the facts that (a) its for the whole
population and (b) we dont and usually cant know it.
We can compute the variance of X (k) (the sample mean) easily. We have
i2
i
i
h
h
i
h
h
2
var X (k) = E (X (k) )2 E X (k) = E (X (k) )2 (popmean ({x}))

so we need to know E (X (k) )2 . We can compute this by writing
X (k) =
1
(X1 + X2 + . . . Xk )
k
Section 14.3
316
where X1 is the value of the first ticket drawn from the urn, etc. We then have
2
X (k) =

2 2
1
X 1 + X 2 2 + . . . X 2 k + X1 X2 + . . .
.
X1 Xk + X2 X1 + . . . X2 Xk + . . . Xk1 Xk
k
Expectations are linear, so we have that

i 1 2 EX 2 + EX 2 + . . . EX 2 + E[X X ]+
h
1 2
k
2
1
.
E (X (k) )2 =
. . . E[X1 Xk ] + E[X2 X1 ] + . . . E[X2 Xk ] + . . . E[Xk1 Xk ]
k
The order in which the tickets are drawn from the urn doesnt matter, because

2
each
time
we
draw
a
ticket
we
draw
from
the
same
urn.
This
means
that
E
X
=
1
2
2
E X 2 = . . . E X k . You can think of this term as the expected value of the
random variable generated by: drawing a single number
of the urn; squaring
out
that number; and reporting the square. Notice that E X 2 1 = E (X (1) )2 (look at
the definition of X (1) ).
Because the order doesnt matter, we also have that E[X1 X2 ] = E[X1 X3 ] =
. . . E[Xk1 Xk ]. You can think of this term as the expected value of the random
variable generated by: drawing a number out of the urn; writing it down; returning
it to the urn; then drawing a second number from the urn; and reporting the
product of these two numbers. So we can write
i
i

h
2
1 h
E X (k) = ( )2 kE (X (1) )2 + k(k 1)E[X1 X2 ]
k
and these two terms are quite easy to evaluate.
Urn variances
Show that
i PNp x2
h
2
2
E (X (1) )2 = i=1 i = popsd ({x}) + popmean ({x})
Np
Solution: First, we have (X (1) )2 is the number obtained by taking a ticket
out of the urn and squaring its data item. Now
so
i2
i
i
h
h
h
2
2
popsd ({x}) = E (X (1) )2 E X (1) = E (X (1) )2 popmean ({x})
i
h
2
2
E (X (1) )2 = popsd ({x}) + popmean ({x})
Section 14.3
317
Urn variances
Show that
E[X1 X2 ] = popmean ({x})
Solution: This looks hard, but isnt. Recall from the facts in chapter ??
(useful facts ??, page ??) that if X and Y are independent random variables,
E[XY ] = E[X]E[Y ]. But X1 and X2 are independent they are different
random draws from the same urn. So
E[X1 X2 ] = E[X1 ]E[X2 ]
but E[X1 ] = E[X2 ] (they are draws from the same urn) and E[X] =
popmean ({x}). So
E[X1 X2 ] = popmean ({x})2 .
Now
i
h
E (X (k) )2
=
=
=
=

kE (X (1) )2 + k(k 1)E[X1 X2 ]
k2
(1) 2
E (X ) + (k 1)E[X1 X2 ]
k
2
2
2
(popsd ({x}) + popmean ({x}) ) + (k 1)popmean ({x})
k
popsd ({x})2
+ popmean ({x})2
k
so we have
i
h
var X (k)
i2
i
h
h
= E (X (k) )2 E X (k)
2
=
=
popsd ({x})
2
2
+ popmean ({x}) popmean ({x})
k
2
popsd ({x})
.
k
This is a very useful result which is well worth remembering together with our facts
on the sample mean, so well put them in a box together.
Section 14.3
318
Useful Fact: 14.5 The sample mean

The sample mean is a random variable. Write X (k) for the mean of k
samples. We have that:
h
i
E X (k)
= popmean ({x})
i
h
=
var X (k)
o
n
=
std X (k)
popsd ({x})
k
popsd ({x})
The consequence is this: If you draw k samples, the standard deviation of

your estimate of the mean is
popsd ({x})
k
which means that (a) the more samples you draw, the better your estimate becomes
and (b) the estimate improves rather slowly for example, to halve the standard
deviation in your estimate, you need to draw four times as many samples. The
standard deviation of the estimate of the mean is often known as the standard
error of the mean. This allows us to draw a helpful distinction: the population
has a standard deviation, and our estimate of its mean (or other things but we
wont go into this) has a standard error.
Notice we cannot state the standard error of our estimate exactly, because we
do not know popsd ({x}). But we could make a good estimate of popsd ({x}), by
computing the standard deviation of the examples that we have. It is now
P helpful
to have some notation for the particular sample we have. I will write isample
for a sum over the sample items, and we will use
P
isample xi
mean ({x}) =
k
for the mean of the sample that is, the mean of the data we actually see; this is
consistent with our old notation, but theres a little reindexing to keep track of the
fact we dont see all of the population. Similarly, I will write
sP
2
isample (xi mean ({xi }))
std ({x}) =
k
for the sample standard deviation. Again, this is the standard deviation of the data
we actually see; and again, this is consistent with our old notation, again with a
Section 14.3
319
little reindexing to keep track of the fact we dont see all of the population. We
could estimate
popsd ({x}) std ({x})
and as long as we have enough examples, this estimate is good. If the number of
samples k is small, it is better to use
sP
2
isample (xi mean ({x}))
.
popsd ({x})
k1
In fact, much more is known about the distribution of X (k) .
14.3.3 The Probability Distribution of the Sample Mean
The sample mean is a random variable. We know an expression for its mean, and
we can estimate its variance. In fact, we can determine its probability distribution,
though I wont do this rigorously. In section 13.4.3, I mentioned that adding a
number of independent random variables almost always got you a normal random
variable, a fact sometimes known as the central limit theorem. I didnt prove it,
and Im not going to now. But when we form X (k) , were adding random variables.
This means that X (k) is a normal random variable, for sufficiently big k (for some
reason, k > 30 is usually seen as right).
This is important, because it has the following consequence. Draw a large
number of different samples of k elements from the population. Each is a dataset of
k items. Compute mean ({x}) for each, and regard the resulting numbers e1 , . . . , er
as data items. Convert the ei to standard coordinates si , where
si =
(ei mean ({ei }))

std (ei )
(i.e. by subtracting the mean of the ei , and dividing by their standard deviation).
Now construct a construct a histogram of the s. If r is sufficiently large, the
histogram will be close to the standard normal curve.
14.3.4 When The Urn Model Works
In our model, there was a population of Np data items xi , and we saw k of them,
chosen at random. In particular, each choice was fair (in the sense that each data
item had the same probability of being chosen) and independent. These assumptions are very important for our analysis to apply. If our data does not have these
properties, bad things can happen.
For example, assume we wish to estimate the percentage of the population
that has beards. This is a mean (the data items take the value 1 for a person with
a beard, and 0 without a beard). If we select people according to our model, then
ask them whether they have a beard, then our estimate of the percentage of beards
should behave as above.
The first thing that should strike you is that it isnt at all easy to select people
according to this model. For example, we might select phone numbers at random,
then call and ask the first person to answer the phone whether they have a beard;
Section 14.3
320
but many children wont answer the phone because they are too small. The next
important problem is that errors in selecting people can lead to massive errors in
your estimate. For example, imagine you decide to survey all of the people at a
kindergarten on a particular day; or all of the people in a womens clothing store;
or everyone attending a beard growing competition (they do exist). In each case,
you will get an answer that is a very poor estimate of the right answer, and the
standard error might look very small. Of course, it is easy to tell that these cases
are a bad choice.
It may not be easy to tell what a good choice is. You should notice the similarity between estimating the percentage of the population that wears a beard,
and estimating the percentage that will vote for a particular candidate. There is
a famous example of a survey that mispredicted the result of the Dewey-Truman
presidential election in 1948; poll-takers phoned random phone numbers, and asked
for an opinion. But at that time, telephones tended to be owned by a small percentage of rather comfortable households, who tended to prefer one candidate, and
so the polls mispredicted the result rather badly.
Sometimes, we dont really have a choice of samples. For example, we might
be presented with a small dataset of (say) human body temperatures. If we can be
satisfied that the people were selected rather randomly, we might be able to use this
dataset to predict expected body temperature. But if we knew that the subjects
had their temperatures measured because they presented themselves at the doctor
with a suspected fever, then we most likely cannot use it to predict expected body
temperature.
One important and valuable case where this model works is in simulation. If
you can guarantee that your simulations are independent (which isnt always easy),
this model applies to estimates obtained from a simulation. Notice that it is usually
straightforward to build a simulation so that the ith simulation reports an xi where
popmean ({x}) gives you the thing you want to measure. For example, imagine you
wish to measure the probability of winning a game; then the simulation should
report one when the game is won, and zero when it is lost. As another example,
imagine you wish to measure the expected number of turns before a game is won;
then your simulation should report the number of turns elapsed before the game
was won.
Section 14.4
You should
321
14.4 YOU SHOULD

14.4.1 be able to:
Write out the likelihood for a set of independent data items produced by
models from chapter 13 (at least Normal, Binomial, Multinomial, Poisson,
Beta, Gamma, Exponential).
Write out the log likelihood for a set of independent data items produced by
models from chapter 13 (at least Normal, Binomial, Multinomial, Poisson,
Beta, Gamma, Exponential).
Find maximum likelihood solutions for parameters of these models from a set
of independent data items.
Describe situations where maximum likelihood estimates might not be reliable.
Describe the difference between maximum likelihood estimation and Bayesian
inference.
Write an expression for the posterior or log-posterior of model parameters
given a set of independent data items.
Compute the MAP estimate for the cases shown in the worked examples.
Compute on-line estimates of the maximum likelihood estimate of the mean
and standard deviation of a normal model.
Compute on-line estimates of the MAP estimate of the mean and standard
deviation in the case of a normal prior and a normal likelihood.
Estimate the population mean from a sample mean.
Estimate the standard error of the estimate of a population mean.
14.4.2 remember:
New term:
New term:
New term:
New term:
New term:
Definition:
New term:
Definition:
New term:
New term:
Definition:
New term:
New term:
New term:
Inference . . . . . . . . . . . . . . . . . . . . .
point estimate . . . . . . . . . . . . . . . . . .
interval estimates . . . . . . . . . . . . . . . .
hypothesis testing . . . . . . . . . . . . . . . .
likelihood . . . . . . . . . . . . . . . . . . . . .
Likelihood . . . . . . . . . . . . . . . . . . . .
maximum likelihood principle . . . . . . . . .
The maximum likelihood principle . . . . . . .
independent and identically distributed . . . .
IID . . . . . . . . . . . . . . . . . . . . . . . .
The log-likelihood of a dataset under a model
consistency . . . . . . . . . . . . . . . . . . . .
prior probability distribution . . . . . . . . . .
posterior . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
282
282
282
282
284
284
284
285
285
285
289
292
294
294
Section 14.4
New term: Bayesian inference . . . . . . . . . . . .

Definition: Bayesian inference . . . . . . . . . . . .
New term: maximum a posteriori estimate . . . . .
New term: MAP estimate . . . . . . . . . . . . . .
Definition: MAP estimate . . . . . . . . . . . . . .
New term: conjugate . . . . . . . . . . . . . . . . .
Useful fact: Normal distributions are conjugate . .
Useful fact: Normal posteriors . . . . . . . . . . . .
New term: joint . . . . . . . . . . . . . . . . . . . .
New term: filtering . . . . . . . . . . . . . . . . . .
Useful fact: Online updating of normal posteriors .
New term: population . . . . . . . . . . . . . . . .
New term: sample . . . . . . . . . . . . . . . . . .
New term: population mean . . . . . . . . . . . . .
New term: sample mean . . . . . . . . . . . . . . .
Useful facts: Sample means and population means
Useful fact: The sample mean . . . . . . . . . . . .
New term: standard error . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
You should
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
322
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
294
294
294
294
294
296
299
301
301
303
304
306
306
306
306
307
310
310

Learning Book 11 Feb

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Learning Book 11 Feb

Загружено:

Авторское право:

Доступные форматы

Contents

3 Extracting Important Relationships in High Dimensions

3.2.1 The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Clustering: Models of High Dimensional Data

5 Clustering using Probability Models

7 Regression: Some harder topics

5.3.1 Example: Mixture of Normals: The E-step

12.2.3 Confusion caused by correlation .

13 Background: Useful Probability Distributions

14 Background:Inference: Making Point Estimates

14.4.2 remember: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

Notation and conventions

Ac is the complement of the set A (i.e. A).

1.1.1 Background Information

Some Useful Mathematical Facts

1.2 SOME USEFUL MATHEMATICAL FACTS

The Curse of Dimension

Now each component of x is independent, so that P (x) = P (x1 )P (x2 ) . . . P (xd ).

The Curse of Dimension

1.4.2 Minor Banes of Dimension

Definition: 2.1 Classifier

2.1 CLASSIFICATION, ERROR, AND LOSS

Classification, Error, and Loss

Classification, Error, and Loss

if L(+ )p(+|x) > L( +)p(|x)

if L(+ )p(+|x) < L( +)p(|x)

Classification, Error, and Loss

Classification, Error, and Loss

Classification, Error, and Loss

Classifying with Naive Bayes

tend not to be used in practice, because they appear to be extremely pessimistic.

(the posterior). We write xj for the jth component of x. Our assumption is

Classifying with Naive Bayes

Classifying with Naive Bayes

Classifying breast tissue samples

Worked example 2.1

The breast tissue dataset at https://archive.ics.uci.edu/ml/datasets/

Classifying with Naive Bayes

Classifying mouse protein expression

Worked example 2.2

really aggressively to get

The Support Vector Machine

Example: 2.2 A linear model with two features

The Support Vector Machine

A good choice of C should have some important properties. If i and yi have

The Support Vector Machine

Hinge loss for a single example

The Support Vector Machine

The Support Vector Machine

The Support Vector Machine

so involves looking at each of the gi terms). Instead, we use a steplength that is

Procedure: 2.1 Stochastic Gradient Descent

The Support Vector Machine

To construct figures, I downloaded the dataset at http://archive.ics.uci.edu/

Held out error

The Support Vector Machine

The Support Vector Machine

descent with very large datasets.

2.3.5 Multi-Class Classifiers

Classifying with Random Forests

2.4 CLASSIFYING WITH RANDOM FORESTS

Classifying with Random Forests

Classifying with Random Forests