Вы находитесь на странице: 1из 73

STAT6101

Introductory Statistical Methods and Computing

Department of Statistical Science, 1-19 Torrington Place, UCL

2015-2016
Contents

1 Data Summary and Presentation 6

1.1 Populations and Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.1 Reasons for Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.2 Random Sample from a Finite Population . . . . . . . . . . . . . . . . . . . . . 6

1.1.3 How to obtain a random sample . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Types of variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Tables and Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.1 Presentation of Qualitative Data (ordered and unordered) . . . . . . . . . . . . 7

1.3.2 Presentation of Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.3 Presentation of Small Amounts of Continuous Data . . . . . . . . . . . . . . . 9

1.3.4 Symmetric and Skew Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.5 Presentation of Large Amounts of Continuous Data . . . . . . . . . . . . . . . 10

1.3.6 Comments on Dotplots, Stemplots, Histograms and Boxplots . . . . . . . . . . 14

1.4 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.4.1 Measures of Location (or Level) . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.4.2 Measures of spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.5 Five-figure summaries and boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.6 Choice of Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.7 Change of origin and scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.8 Log transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Describing bivariate data 20

2.1 From Univariate to Bivariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.1 Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.2 An illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Principle of least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.1 Fitting a constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.2 Fitting a straight line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 Properties of the correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Rank correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3
3 The Normal Distribution 28

3.1 Population distribution and parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Calculation of probabilities for a normal distribution . . . . . . . . . . . . . . . . . . . 30

3.4 Percentage points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Sampling distributions 32

4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Simulation of samples from a Normal population . . . . . . . . . . . . . . . . . . . . . 33

4.3 Simulation of samples from an Exponential population . . . . . . . . . . . . . . . . . . 36

5 Confidence Intervals and Tests based on the t distribution 38

5.1 Estimation of a population mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1.1 Method 1: Quoting the sample mean and standard error . . . . . . . . . . . . . 38

5.1.2 Method 2: Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2 One sample t tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3 Interpretation of P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.4 Procedure for hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.5 Relation between confidence intervals and hypothesis tests . . . . . . . . . . . . . . . . 43

5.6 Two sample t tests and confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . 44

5.6.1 Matched pairs t test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.6.2 Two sample t test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.7 Confidence intervals for the di↵erence between two population means . . . . . . . . . . 46

6 Two Sample Non Parametric Tests 47

6.1 Wilcoxon signed rank test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.1.1 Procedure for calculating the test statistic and P-value . . . . . . . . . . . . . . 47

6.2 Mann-Whitney two sample test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.2.1 Procedure for calculating the test statistic and P-value . . . . . . . . . . . . . . 49

7 Probability and Binomial and Poisson Distributions 51

7.1 Idea of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.2 Rules of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.2.1 Conditional and joint probabilities: multiplication rules . . . . . . . . . . . . . 52

4
7.2.2 Mutually exclusive events: addition rules . . . . . . . . . . . . . . . . . . . . . 53

7.2.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.3 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.3.1 Mean and Variance of a random variable . . . . . . . . . . . . . . . . . . . . . . 55

7.4 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7.6 Approximations to the binomial distribution for large n . . . . . . . . . . . . . . . . . 58

7.7 Normal approximation to the Poisson distribution for large µ . . . . . . . . . . . . . . 59

8 Inference for Binomial and Poisson Parameters 60

9 Frequency Data and Chi-Square Tests 62

9.1 Fitting a Probability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

9.2 Fitting a binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

9.3 Fitting a Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

9.4 Contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

9.5 Comparing two or more binomial proportions . . . . . . . . . . . . . . . . . . . . . . . 69

10 The Normal Linear Regression Model 71

10.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

10.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5
1 Data Summary and Presentation

1.1 Populations and Samples

A population is a defined group of individuals or items to whom the conclusions of a study or exper-
iment apply. A finite population is one where the number of individuals or items in the population
can be counted, such as the population of people in a city, or the population of all registered com-
panies, or the population of all licenced cars in the UK. An infinite population: a researcher may
notionally repeat an ‘experiment’ or observational process several times, under the same conditions;
then all of the possible observations that he or she could make under these conditions is the popula-
tion of observations. Examples might be the population of possible claims received by an insurance
company or the population of possible industrial accidents. Here the population is not necessarily
real, in the sense of the earlier examples, and it is often convenient to regard it as an infinite number
of hypothetical values.

A sample is a subset of the population. A variable is a quantity that may take any one of a specified
set of values, for a given individual. Examples are age (of persons), income (of households), socio-
economic class (of workers). Data are the set of values of one or more variables recorded on one or
more individuals or items.

1.1.1 Reasons for Sampling

1. It may be too expensive or time consuming to measure every item.

2. It may be more accurate to measure a few items carefully, than to try to measure every item.

3. It is essential to sample if by examining items you destroy them; e.g., if you are interested in the
life length of light bulbs then you must burn them until they fail, so you cannot test every bulb.

The disadvantage of sampling is that information is inevitably lost by not measuring every item.

1.1.2 Random Sample from a Finite Population

This is a sample where every member of the population has an equal chance of being in the sample.
Strictly speaking: a random sample of size n is one chosen such that every sample of size n has the
same chance of being the chosen sample.

It is possible though, that by chance the random sample turns out to be unrepresentative of the
population. One way to reduce this possibility is to increase the sample size, but this may not always
be practical.

1.1.3 How to obtain a random sample

Give every member of the population a number or label.

6
1. If the population size is small, put each number onto a card and shu✏e the cards thoroughly.
Choose the required number of cards and investigate the members of the population correspond-
ing to these numbers.
2. If the population size is large use random number tables. These are pages of numbers which
have been generated in such a way that each digit on the page is equally likely to be 0, 1....,9.
Suppose the population size is N and the number N has x digits. Proceed as follows:
(a) Decide where to start on the page of random numbers.
(b) Decide in which direction to move along the page.
(c) Read o↵ the random numbers consecutively in groups of x, ignoring any numbers which
are greater than N and any which have already occurred.

1.2 Types of variable

Qualitative (non numerical):


Categorical — no actual measurement is made, just a qualitative judgment e.g., sex, hair
colour. The observations are said to fall into categories.
Ordinal — there is a natural ordering of the categories such as degree of severity of a dis-
ease (mild, moderate, severe), occupational group (professional, skilled manual, unskilled
manual).
Quantitative (numerical):
Discrete — can only take one of a discrete set of values (e.g., the number of children in a family,
the number of bankruptcies in a year, the number of industrial accidents in a month).
Continuous — can in principle take any value in a continuous range, such as a person’s height
or the time for some event to happen. In practice, all real observations are discrete be-
cause they are recorded with finite accuracy (e.g., time to the nearest minute, income to
the nearest pound). But if they are recorded sufficiently accurately they are regarded as
continuous.

1.3 Tables and Graphs

1.3.1 Presentation of Qualitative Data (ordered and unordered)

1. By a frequency table:
Example 1.1 The number of fatal accidents to children under 15 in the UK during
1987 (source: Action on Accidents, produced by the National Association of Health
Authorities and the Royal Society for the Prevention of Accidents).
Type of accident Number of children Percentage
Pedestrians (road) (P) 260 30.88
Burns, fires (mainly home) (B) 119 14.13
Vehicle occupants (road) (V) 96 11.40
Cyclists (road) (R) 73 8.67
Drownings (home and elsewhere) (D) 63 7.48
Choking on food (C) 50 5.94
Falls (home and elsewhere) (F) 40 4.75
Suffocation (home) (S) 34 4.04
Others (O) 107 12.71
Total 842 100.00

7
The categories are the types of accident. The number of children dying from each type of accident
is the frequency of that category. The relative frequency or proportion of children dying from
each type of accident is the frequency divided by the total number of deaths. Multiplying the
relative frequencies by 100 gives the percentages (i.e., the relative frequencies per 100 cases).

2. Pictorially: for example by a barplot:

Pedestrians
Burns or fires
Vehical passengers
Cycling
Drownings
Choking on food
Falls
Suffocation
Others

0 50 100 150 200 250

Number of children

If the categories are unordered (as here) is it useful to order them by size in the figure — indeed
the figure and table could be combined.

1.3.2 Presentation of Discrete Data

1. By a frequency table, giving the number of occurrences of each value of the variable:

Example 1.2 The following data give the number of plants of the sedge Carex flacca
found in 100 throws of a quadrat over a wet meadow.
1 0 1 4 1 0 2 4 3 0 2 0 0 4 2 0 0 2 1 0 2 1 0 3 1
6 0 0 1 2 2 0 2 1 2 0 0 0 0 1 1 0 1 2 1 2 2 3 4 0
0 4 0 3 0 1 5 1 2 4 0 1 1 0 4 0 0 3 8 2 1 3 1 3 0
0 1 0 2 0 5 1 2 0 3 1 2 1 3 0 1 0 0 1 0 0 0 0 2 3

Number of plants 0 1 2 3 4 5 6 7 8 Total


Frequency 37 24 18 10 7 2 1 0 1 100
Relative frequency 0.37 0.24 0.18 0.10 0.07 0.02 0 01 0.00 0.01 1.00

2. Pictorially, for example by a relative frequency graph, e.g., for Example 1.2:

0.3
Relative frequency

0.2

0.1

0.0

0 1 2 3 4 5 6 7 8

Number of plants

8
1.3.3 Presentation of Small Amounts of Continuous Data

Example 1.3 The systolic blood pressure of 21 women participating in a keep fit class
were as follows.

152 105 123 131 99 115 149 137 126 124 128
143 150 112 135 130 123 118 122 136 141

1. By a dotplot (also known as a dot diagram)

*
* * * * * *** * * ** *** * * ** *
--+---------+---------+---------+---------+---------+----
100 110 120 130 140 150
Systolic blood pressure (mmHg)

2. By a stemplot (also known as a stem and leaf display)

(a) First divide the numbers up into a right hand digit and the rest of the number. The left
hand digits are called the stem values and the right hand digit the leaf value. Aim at having
between 5 and 15 stem values.
(b) Write all the possible stem values from smallest to largest in a column.
(c) Take each number in turn and write its leaf value against its stem value.
(d) Re-write the stemplot putting the leaves on each stem in ascending order. Write the leaves
out carefully in vertical columns so that the shape of the data can be seen.

Stem Leaves (leaf unit = 1) Stem Leaves (leaf unit = 1)


9 9 9 9
10 5 10 5
11 528 11 258
12 364832 12 233468
13 17506 13 01567
14 931 14 139
15 20 15 02

Example 1.4 The following data give the per capita income, in $US, for a sample of 16
countries. (Source IMF)

India 162 Honduras 403 Costa Rica 1491 Israel 3790


Madagascar 183 Botswana 524 Argentina 1988 Japan 6593
The Gambia 208 Ecuador 854 Greece 2822 Australia 6843
Egypt 358 Jamaica 1466 Italy 3439 Kuwait 11554

1. Dotplot:

*
** *
**** * * * * * ** *
+----+----+----+----+----+----+----+----+----+----+----+----+-----
0 2000 4000 6000 8000 10000 12000
Income in US Dollars

9
2. Stemplot:
Following the procedure in Example 1.3 would give 1140 stem values which is far too many. In
order to get over this we round every number down to the nearest 10 or if the number of stem
values is still too large to the nearest 100. By rounding down you retain the actual figures in
the original number.

100 100 200 300 400 500 800 1400


1400 1900 2800 3400 3700 6500 6800 11500

Ignore the zeroes at the end of each number and proceed as in Example 1.3.

Stem Leaves (leaf unit = 100) Stem Leaves (leaf unit = 100)
0 1123458 0 1123458
1 449 1 449
2 8 2 8
3 47 3 47
4 4
5 5
6 58 6 58
7
8 HI: 11500
9
10
11 5

If there are a few observations that are a long way from the main body of the data (i.e., there
are several stems with no leaves) then these observations can be listed separately as either high
(HI) or low (LO) values in a stemplot as above.

1.3.4 Symmetric and Skew Data

Data are symmetric if when you imagine a line through the central value in a table or diagram, the
two halves are mirror images. If the data are not symmetric they are said to be skewed. If the larger
values are more spread out than the smaller values, then the data are positively skewed, or right
skew. If the smaller values are more spread out than the larger values, then the data are negatively
skewed, or left skew. The data in Examples 1.2 and 1.4 are right skew while the data in example 1.3
are approximately symmetric.

1.3.5 Presentation of Large Amounts of Continuous Data

The individual measurements of the hand span (in inches) of 140 men were as follows.

68.2 64.8 64.2 73.9 69.5 70.8 68.4 72.7 72.6 67.5
67.0 72.7 71.6 72.3 70 0 71.0 71.0 67.4 68.3 66.8
73.1 71.9 73.4 67.6 73.0 69.9 71.8 64.3 71.5 70.4
70.3 73.9 70.8 70.2 65.0 75.4 72.3 71.1 65.5 70.6
70.9 68.3 71.5 66.6 70.0 72.2 67.6 71.2 70.5 66.5
76.3 66.1 76.0 75.1 68.2 68.6 69.4 69.1 70.7 70.5
65.5 69.9 68.0 72 2 69.8 65.5 73.2 64.7 67.5 68.2
72.4 68.5 65.1 65.6 74.8 68.0 70.3 73.2 74.2 74.7
65.8 72.5 70.1 72.2 73.8 66.3 70.3 74.0 69.4 69.7
70.7 67.5 68.4 67.0 68.3 67.6 63.9 66.5 67.1 66.9

10
65.1 72.1 71.3 67.1 65.4 68.0 70.3 66.7 70.8 74.0
66.5 71.6 73.9 70.8 66.5 69.8 73.9 66.7 67.8 67.9
67.5 65.6 70.3 70.7 67.3 65.8 66.0 72.2 70.8 72.1
64.4 65.7 72.4 68.2 73.2 68.0 68.4 61.5 66.9 61.3

1. Grouped frequency table

(a) Calculate the range of the data i.e., the largest value minus the smallest value.
(b) Divide the range up into groups. Aim at having between 5 and 15 groups.
(c) Calculate the frequency of each group.
(d) Optional: calculate the relative frequency of each group
(i.e. divide the frequency in each group by the total frequency)

Hand span Frequency Relative frequency


[61.0 , 62.5) 2 0.014
[62.5 , 64.0) 1 0.007
[64.0 , 65.5) 9 0.064
[65.5 , 67.0) 21 0.150
[67.0 , 68.5) 29 0.207
[68.5 , 70.0) 11 0.079
[70.0 , 71.5) 27 0.193
[71.5 , 73.0) 20 0.143
[73.0 , 74.5) 14 0.100
[74.5 , 76.0) 4 0.029
[76.0 , 77.5) 2 0.014
Total 140 1.000

2. Frequency histogram
This is a graph of the information in a grouped frequency table. For each group, a rectangle is
drawn with base equal to the group width and area (i.e. surface) equal to the frequency for that
group. A frequency histogram has two axes

• horizontal axis with the variable of interest


• vertical axis with the frequency density, i.e. the frequency per unit (what that unit is,
depends on the context)

Two possible histograms of the span data are

The horizontal axis and the shape of the graph are identical in the two histograms but the unit of
the frequency density is di↵erent and as such the numbers on the the vertical axes are di↵erent.

11
The frequency densities are calculated using the formula

frequency = base ⇥ height

where the latter is the formula to calculate the area of a rectangular. In general the frequency
and base are known (as they are shown in the grouped frequency table) and the only unknown
in the equation is the height, i.e. the frequency density.
Let us look at these calculations in more detail, for example, for the first category in the hand
span example. The information in the grouped frequency table, shows that the frequency for
men with a hand span in the interval [61.0 , 62.5) inches equals 2. If a frequency histogram is
drawn, the width of the category (the base) equals : 62.5 inches 61.0 inches = 1.5 inches.
Substituting these values in the formula gives
number
frequency = base ⇥ height , 2 = 1.5inches ⇥ frequency density , 2 = 1.5inches ⇥
unit
So the only unknown in the equation is the frequency density. Looking at the formula, it is clear
that the expression on the left hand side is a number (with no unit). The expression on the
right hand side should be identical so the unit of the frequency density needs to be expressed
in inches. The choice of how many inches, depends on personal preference but it is important
to choose a unit which does not make the graph unclear (for example, the frequency per 3267
inches). The specific unit is chosen before actually drawing the graph, so let’s take 1.5 inches in
this example. Choosing this unit gives the equation:
number
2 = 1.5inches ⇥ , number = 2
1.5 inches
If a unit of 1 inch had been chosen, the number would have been
number
2 = 1.5inches ⇥ , number = 1.33
1 inch
The following table gives the calculations for all the categories of the hand span example for two
di↵erent units of the frequency density (per 1.5 inches, and per 1 inch).

Left graph Right graph


Span Frequency base ⇥ height Freq. density base ⇥ height Freq. density
[61.0, 62.5) 2 1.5 inch ⇥ number 2
1.5 inch ⇥ number 1.33
1.5 inch 1.5 inch 1 inch 1 inch
[62.5, 64.0) 1 1.5 inch ⇥ number 1
1.5 inch ⇥ number 0.67
1.5 inch 1.5 inch 1 inch 1 inch
[64.0, 65.0) 9 1.5 inch ⇥ number 9
1.5 inch ⇥ number 6
1.5 inch 1.5 inch 1 inch 1 inch
.. .. .. .. .. ..
. . . . . .
[76.0, 77.5) 2 1.5 inch ⇥ number 2
1.5 inch ⇥ number 1.33
1.5 inch 1.5 inch 1 inch 1 inch

Note that the numbers on the vertical axis of the left histogram are identical to the original
frequencies. This will only occur when all groups are of equal width, and the unit of the density
is chosen to be equal to the group width.

A common error is to label the vertical axis as “frequency” instead of “frequency density”
(although the numbers happen to be identical, the meaning is di↵erent). In general, though,
the intervals in a grouped frequency table are not necessarily equal and it is the areas of the
rectangles, not their heights, that represent the frequencies. For example, let’s look at the hand
span example with the last three categories merged. The new frequency table would look as
follows

12
Span Frequency Relative frequency
[61.0 , 62.5) 2 0.014
[62.5 , 64.0) 1 0.007
[64.0 , 65.5) 9 0.064
[65.5 , 67.0) 21 0.150
[67.0 , 68.5) 29 0.207
[68.5 , 70.0) 11 0.079
[70.0 , 71.5) 27 0.193
[71.5 , 73.0) 20 0.143
[73.0 , 77.5) 20 0.143
Total 140 1.000

When recalculating the frequency densities for the histograms, the same principles remain valid:
the areas ( base ⇥ height) still represent the frequencies. Redoing the calculations for the new
example, shows that all frequency densities remain the same except for the merged categories:

Left graph 1 Right graph


Span Frequency base ⇥ height Freq. density base ⇥ height Freq. density
[61.0, 62.5) 2 1.5 inch ⇥ number 2
1.5 inch ⇥ number 1.33
1.5 inch 1.5 inch 1 inch 1 inch
.. .. .. .. .. ..
. . . . . .
[73.0, 77.5) 20 4.5 inch ⇥ number 6.67
4.5 inch ⇥ number 4.44
1.5 inch 1.5 inch 1 inch 1 inch

The two corresponding histograms now look as follows

Besides the units for the frequency densities in the two aforementioned histograms, many other
units are possible. Always choose a unit which does not make the graph unclear, and to make
sure that a histogram is a clear as possible, always label the axes accurately.

Besides a frequency histogram, it is also possible to draw relative frequency histograms. A rela-
tive frequency histogram is exactly the same shape as a frequency histogram, but the vertical axis
contains the relative frequency density and is re-scaled so that areas represent proportions. For
the hand span example (with the last three categories merged) the relative frequency histograms
could look as follows

13
1.3.6 Comments on Dotplots, Stemplots, Histograms and Boxplots

1. A dotplot is a simple and often very e↵ective display of a small sample of continuous data;
it can show location, dispersion, asymmetry (though subtle shape features are hard to see in
small samples) and extreme values. Dotplots are particularly good for comparing several small
samples with respect to a common scale.

2. A stemplot tabulates the actual values of the original data (possibly after rounding). It therefore
lends itself to simple calculation of quantiles. And it also gives a pictorial view of the data (like a
histogram on its side) and shows location, dispersion and, to some extent, shape. Stemplots can
be useful for small and moderate sample sizes (e.g., up to 50 or 60) but not for large samples.

3. A histogram just plots the frequencies — i.e., the information in a frequency table, so the
individual values are lost. This form is good for showing shape (in addition to location and
dispersion) for reasonably large samples. It is also possible to compare several histograms on a
common scale.

4. A fourth type of graph useful for quantitative data, is the boxplot (see below). This is also
good for comparing a number of samples, each of a reasonable size (e.g., over 25 each).

1.4 Summary Statistics

In addition to graphical displays it is often useful to have numerical summary statistics that attempt
to condense the important features of the data into a few numbers.

1.4.1 Measures of Location (or Level)

1. Mean. This is sometimes referred to as the arithmetic mean, to distinguish it from other types
of mean, such as geometric mean and harmonic mean. It is defined as
the sum of all observations
mean =
the total number of observations
This is often written in the mathematical notation x̄ which you will find in textbooks and on
calculators. Using Example 1.3 to help explain the notation:
n is the number of observations in the sample = 21
x1 is the systolic blood pressure of the first woman in the sample = 152
x2 is the systolic blood pressure of the second woman in the sample = 105, etc.

14
P Pn
x is the sum of all the x values; this is short for the more precise expression i=1 xi .
x̄ is the mean of the sample
Thus P
x 152 + 105 + · · · + 141
x̄ = = = 128.52 .
n 21
The mean has some properties that it is useful to understand:
(a) Imagine trying to balance a dotplot of the data on the end of a pencil (where each dot has
equal weight). The point on the scale where the figure balances exactly is the mean. This
helps us understand why if the data are symmetric, the mean is in the middle; and it tells
us intuitively where the mean must be if the data are not symmetric.
(b) Suppose that you subtract the mean from each data value. Then the resulting di↵erences
(sometimes called residuals) must add to zero. That is
n
X
(x1 x̄) + (x2 x̄) + . . . + (xn x̄) = xi nx̄ = nx̄ nx̄ = 0 .
i 1

2. Median. The median of a set of numbers is the value below which (or equivalently above which)
half of them lie. It is also known as the 50-percentile point. To find the median of n observations,
first put the observations in increasing order. The median is then given by:
n+1
the 2 th observation if n is odd,
n
the mean of the 2 th and ( n2 + 1)th observations if n is even.
For the data in Example 1.3, the observations have been written down in increasing order in the
stemplot. As n = 21, the median is the 11th observation, i.e., median = 128.
For the data in Example 1.4, x̄ = 2667.4, and median = (1466 + 1491)/2 = 1478.5.
Note that for the approximately symmetric data of Example 1.3 the mean and median are similar
but for the right skew data of Example 1.4 the mean is much larger than the median.
3. Quartiles (and other quantiles). In the same way as for the median, we may calculate the value
below which some specified fraction of the observations lie. The lower quartile qL is the value
below which one quarter of the observations lie and the upper quartile qU is the value below
which three quarters of the observations lie. The lower and upper quartiles are also known as
the 25 and 75 percentiles. Di↵erent text books may use slightly di↵erent definitions of sample
quartiles. Here is a standard one: as when finding the median, first put all the n observations
in increasing order. Then:
If n4 is not a whole number then calculate a, the next whole number larger than n4 , and
b, the next whole number larger than 3n 4 . The lower quartile is the ath observation
and the upper quartile is the bth observation.

If n4 is a whole number then the lower quartile is the mean of the n4 th and ( n4 + 1) th
observations and the upper quartile is the mean of the 3n 3n
4 th and ( 4 + 1) th.

In Example 1.3, n = 21, so n4 = 5.25, which is not a whole number. So a = 6 and the lower
quartile qL is the 6th value in the ordered data. From the stem plot, qL = 122. Also, 3n
4 = 15.75,
so b = 16 and the upper quartile is the 16th value, qU = 137. Note that we could also find qU
by counting a = 6 values from down from the largest.
In Example 1.4, n = 16, so n4 = 4, which is a whole number. So the lower quartile is the
average of the 4th and 5th observations: qL = (358 + 403)/2 = 380.5, and the upper quartile
is the average of the 12th and 13th (or equivalently the 4th and 5th down from the largest):
qU = (3439 + 3790)/2 = 3614.5.

15
1.4.2 Measures of spread

1. Range. The range is the largest observation minus the smallest observation.
In Example 1.3, the range is 152 99 = 53. In Example 1.4, the range is 11554 162 = 11392.

2. Interquartile Range. The range has the disadvantage that it may be greatly a↵ected by
extreme values that are a large distance away from the main body of the data — as in Example 1.4
— so that it may not give an informative measure of the spread of most of the data. A more
stable measure is the interquartile range, which is the range of the middle half of the data. Thus

interquartile range = upper quartile lower quartile = qU qL .

For the data in Example 1.3 the interquartile range is 137 122 = 15. For the data in Example 1.4
the interquartile range = 3614.5 380.5 = 3234.0.

3. Variance and Standard Deviation.


The sample variance is the sum of squares of the residuals (the di↵erences between each
observation and the sample mean) divided by n 1:
Pn
(x1 x̄)2 + (x2 x̄)2 + · · · + (xn x̄)2 i=1 (xi x̄)2
variance = =
n 1 n 1
The units of the variance are the square of the units of the original data, so it’s numerical value
is not particularly useful as a measure of spread. (We will see later that the variance is a useful
mathematical parameter; in particular, if you add two independent quantities, then the variance
of the sum equals the sum of the variances.)
The corresponding measure of spread that is in the original units is the standard deviation,
defined by p
standard deviation = variance .
The sample standard deviation is usually denoted by s and the variance by s2 .
If you calculate the standard deviation using the statistics mode on a calculator, then s is found
by pressing the n 1 key. If you calculate the standard deviation using a calculator without a
statistics mode, the calculating form of the variance is
P !
1 X ( x)2
variance = s2 = x2
n 1 n
P P
For the data in Example 1.3, n = 21, x = 2699, x2 = 1522 + 1052 + · · · + 1412 = 350983.
Hence the variance is !
2 1 (2699)2
s = 350983 = 204.8619
20 21
and the standard deviation is p
s= 204.8619 = 14.3130 .

1.5 Five-figure summaries and boxplots

A set of data is often conveniently summarised by the five statistics: minimum, lower quartile, median,
upper quartile and maximum. For moderate or larger samples, these can give concise information
about location, spread and shape; and several samples can be compared in this way. It is common
also to present these numbers graphically in a boxplot.

16
A scale is drawn (in the same way as for a dotplot) and a box is drawn between the two quartiles.
So half of the data are in the box. The median is also marked in the box. Lines (sometimes called
“whiskers”) are drawn from the ends of the box to the minimum and maximum, to show the whole
range of the data. Extreme points (e.g., those equal to or more than 1.5 interquartile ranges below qL
or above qU ) are often plotted separately.

Here are boxplots for the data in Examples 1.3 and 1.4. You can compare these with the dotplots
shown earlier.

Example 1.3 Example 1.4



100 110 120 130 140 150 0 2000 4000 6000 8000 10000

blood pressure (mmHg) income (dollars)

In Example 1.3 the five-figure summary is (99, 122, 128, 137, 152). Half of the observations are
between 122 and 137; the distribution is nearly symmetrical (the median is near the middle of the
box) and one extreme low value has been identified, which is not much lower than the next lowest.

In Example 1.4 the five figure summary is (162, 380.5, 1478.5, 3614.5, 11554). The distribution is very
positively skewed (the lower whisker is much shorter than the upper one and the median is towards
the left end of the box) and an extreme high value — considerably higher than the next highest — is
shown.

1.6 Choice of Summary Statistics

The mean and standard deviation together describe location and spread for quantitative data.
They are most useful when it makes sense to add or average the measurements arithmetically (e.g., for
lengths, times, amounts of money, etc.) They are less useful if the data are skewed because they give
no indication of shape — indeed for highly skewed data the mean may be rather untypical of most
of the data. They may also be greatly a↵ected by outliers. They also form the basis for confidence
intervals and significance tests when the data are samples from normal populations.

The five-figure summary describes not only location and spread but also shape. These statistics
describe data essentially by giving the range of each quarter of the data ordered by size, so they
are useful when this description is. In particular, they do not combine the data arithmetically. The
median and quartiles are not a↵ected by extreme values or outliers, so they may be useful for this
reason.

In Example 1.4, the five-figure summary is much more informative than the mean and standard
deviation. The latter give a poor description because of the strong skewness. Also the observations
are amounts of income per head for several countries of very di↵erent sizes. It is not clear that it
makes much sense to average these numbers arithmetically.

In Example 1.3, the observations are blood pressures of several similar individuals. They are fairly
symmetrically distributed and well described by location and spread only. While it may not be
physically meaningful to average these, it is not unreasonable to do so, and the mean and standard
deviation are a useful summary here.

17
1.7 Change of origin and scale

It is often convenient or necessary to change the units of measurement. This may involve changing
the origin (e.g., local time to Greenwich Mean Time) or changing the scale (dollars to pounds, miles
to kilometres), or both (degrees Celsius to degrees Fareinheit). For example, if x is temperature in C
and u is temperature in F, then u = 1.8x + 32 and x = (u 32)/1.8. There are simple rules for how
summary statistics change under such linear transformations.

1. If you subtract a constant a from each of a set of numbers, then the mean of the new set is the
mean of the old set minus a, and the standard deviation is unchanged. Thus, if ui = xi a, for
i = 1, 2, . . . , n, then ū = x̄ a and su = sx .
In general, adding or subtracting a constant (changing the origin) will shift measures of location
(mean, median, quartiles, etc.) by that amount, but will not a↵ect measures of dispersion (sd,
interquartile range, etc).

2. If you multiply each number by the same positive constant, then both the mean and standard
deviation are multiplied by that constant. If ui = bxi , where b > 0, then ū = bx̄ and su = bsx ;
the variance is multiplied by the squared constant, s2u = b2 s2x

3. More generally, if ui = bxi + a, where b > 0, then ū = bx̄ + a, s2u = b2 s2x and su = bsx .

4. In particular, if
xi x̄
ui =
sx
then ū = 0 and su = 1. That is, if we subtract the sample mean from each data value and then
divide each by the sample standard deviation, the resulting numbers u1 , u2 , . . . , un will have
mean equal to 0 and standard deviation equal to 1. These numbers are called the standardised
data. This is the special case of 3 above where a = x̄/sx and b = 1/sx .

These rules are very useful for doing quick and accurate calculations.

1.8 Log transformations

Many variables and relationships are usefully described using log scales. Logs to base 10 are useful
to indicate orders of magnitude, but mathematical formulae generally use natural logs, i.e., logs to
base e. Here are some logarithmic scales: the first shows equal divisions of y = log10 x and the second
shows equal divisions of y = loge x:

-2 -1 0 1 2

0.01 0.05 0.1 0.5 1 5 10 50 100

-2 -1 0 1 2

0.1 0.2 0.5 1 2 5 10

Natural logs also have a very useful numerical property:

18
the di↵erence between two numbers, as a fraction of their mean value, approximately equals
the di↵erence between their natural logs.

For example (110 90)/100 = 0.20 and loge (110) loge (90) = 0.20067. This works well for fractional
di↵erences up to about 0.5.

Suppose we take (natural) logs of a set of numbers. What happens to their mean and standard
deviation? The mean of the logs does not equal the log of the mean of the numbers. In general
mean(log x) is less than log(mean of x), though these two are quite close if the standard deviation is
small. The approximate formula is
⇣ ⌘ ✓ ◆2
1 sd(x)
mean(log x) ⇡ log mean(x) .
2 mean(x)

Furthermore, the standard deviation of the logs approximately equals the relative standard deviation
of the original numbers:
sd(x)
sd(log x) ⇡ .
mean(x)
Thus, if the standard deviation of log x equals 0.20, then the standard deviation of x is approximately
20% of the mean of x. Again, this works well for relative standard deviations up to about 0.5.

19
2 Describing bivariate data

2.1 From Univariate to Bivariate Data

2.1.1 Formulae

Consider a sample of values of two variables x and y for n individuals. Denote these data by xi , yi
for individual i = 1, 2, . . . , n . The sample means are denoted by
P P
xi x1 + x2 + · · · + xn yi
x̄ = = and ȳ = ,
n n n
the sums of squares about the mean are denoted by
n P n P
X X ( xi ) 2 X X ( yi ) 2
2
Cxx = (xi x̄) = x2i and Cyy = (yi 2
ȳ) = yi2
i=1
n i=1
n

and the standard deviations of x and y are


s s
Cxx Cyy
sx = and sy = .
n 1 n 1
The sum of products about the mean is denoted by
n P P
X X ( xi )( yi )
Cxy = (xi x̄)(yi ȳ) = x i yi ,
i=1
n

and the sample covariance is defined by


n ⇣ ⇣
Cxy 1 X
sxy = = xi x̄) yi ȳ) ,
n 1 n 1 i=1

and the sample correlation coefficient is defined by


n ⇣
Cxy sxy 1 X xi x̄ ⌘⇣ yi ȳ ⌘
r = rxy = p = = .
Cxx Cyy sx sy n 1 i=1 sx sy

The latter is a measure of the strength and direction of a linear relationship between x and y. The
least squares regression line is given by the equation

y = a + bx ,

where this line has slope b and intercept a given by


Cxy
b = and a = ȳ bx̄ .
Cxx
The residual sum of squares is
n
X 2
Cxy
RSS = (yi a bxi )2 = Cyy = Cyy (1 2
rxy )
i=1
Cxx

and the residual standard deviation is


s s
RSS (n 1)s2y (1 2 )
rxy
sres = = .
n 2 n 2

20
Interpretations: For a particular x, the quantity a+bx (i.e., the point on the line) can be interpreted
as the estimated mean value of y for all individuals with this x value; and the residual standard
deviation can be interpreted as the standard deviation of y values for individuals with the same x
value. The meaning of what an ‘individual’ is will depend on the context. The intercept a is the
estimated mean value of y when x = 0 and can therefore be regarded as the estimated mean y for all
individuals that have x = 0. (Sometimes this interpretation may not make physical sense.) The slope
b is the amount by which the mean value of y is estimated to change when x increases by one unit. It
can therefore be regarded as the estimated change in the mean y when x increases by one unit.

These interpretations make the tacit assumptions that (a) a straight line is a good description of the
relationship and (b) that the scatter of y values about the line is roughly the same for each x.

Calculators: If you have a statistical calculator with two-variable data entry, it will have keys that
give you x̄, ȳ, sx , sy , a, b, and r. It may also have a key labelled ŷ that will give you a + bx when
you enter a value of x. Amazingly, most such calculators do not have a key to give you the residual
standard deviation. To calculate sres , use one of the formulae above.
P P P 2 P 2 P
Also, such calculators generally have keys to give you the values of n, x, y, x , y and xy,
from which you may also calculate Cxx , Cyy and Cxy using the above formulae.

2.1.2 An illustration

Example 2.0 In order to illustrate the various calculations with simple numbers, here are
some fictitious data for a small sample of 12 households, where x is weekly income and y
is weekly expenditure. Both variables are in dollars per week.
x: 100 100 200 300 300 400 400 400 500 500 500 600
y: 50 100 95 225 280 270 340 380 400 455 480 535

Here is a scatter plot of these data:

600
expenditure (dollars per week)


500 ••
400 • •

300 • •
200 •

100 • •

0

0 100 200 300 400 500 600


income (dollars per week)

You can use your calculator to check that the various statistics are n = 12, x̄ = 358.33, ȳ = 300.83,
sx = 162.14, sy = 159.86 and rxy = 0.97. The means and standard deviations are all in dollars
(per week). The average expenditure is a bit less than the average income and for both variables
the standard deviation is large in proportion to the mean (i.e., sx /x̄ = 0.45 and sy /ȳ = 0.53). The
correlation coefficient is of course dimensionless and is very close to 1, suggesting that there is a strong
positive association between expenditure and income, as can be seen from the figure.

21
You can also check that the intercept, slope and residual standard deviation for the least squares
regression line are a = 41.7, b = 0.96 and sres = 41.06. So the weekly expenditures of households
with the same weekly income x, vary about a mean of 41.7 + 0.96x dollars with a standard deviation
of about 41 dollars. For example, if the income is 350 dollars, the average expenditure is estimated to
be 41.7 + 0.96 ⇥ 350 = 294.3 dollars, and the standard deviation of expenditures is 41 dollars. Note
that expenditures for households with the same income vary with a much smaller standard deviation
than do expenditures of households with di↵erent incomes.

Here is another plot of the data, with the regression line drawn on:

600
expenditure (dollars per week)


500 ••
400 • •

300 • •
200 •

100 • •

0

0 100 200 300 400 500 600


income (dollars per week)

You can see that the line provides a good description of how the average expenditure increases with
income. The points are scattered approximately equally above and below the line, with about the
same amount of scatter at each x value. In this example, both x and y are measured in the same units
and the slope b = 0.96 has a simple interpretation: for every dollar increase in income, the households
are estimated to spend on average an extra 96 cents. The intercept a = 41.7 formally represents the
estimated mean expenditure of households with zero income, but this is not meaningful in the present
context as there is no reason to suppose that the straight line description is sensible when income is
small.

Example 2.1 The following table gives the area in square km (A) and the number of plant
species (S) for 14 of the Galapagos islands, along with values of x = log A and y = log S
Note that these are logs to base e. We want to describe how S depends on A.
Island Area No of log area log no of
sq.km species species
A S x y

Daphne Major 0.34 18 -1.07881 2.89037


Espanola 58.27 97 4.06509 4.57471
Fernandina 634.49 93 6.45282 4.53260
Genovesa 17.35 40 2.85359 3.68888
Isabela 4669.32 347 8.44877 5.84933
Marchena 129.49 51 4.86360 3.93183
Pinzon 17.95 108 2.88759 4.68213
Rabida 4.89 70 1.58719 4.24850
San Salvador 572.33 237 6.34972 5.46806
Santa Cruz 903.82 444 6.80663 6.09582
Santa Fe 24.08 62 3.18138 4.12713
Santa Maria 170.92 285 5.14120 5.65249
Seymour 1.84 44 0.60977 3.78419
Tortuga 1.24 16 0.21511 2.77259

22
x = log(area)
0 2 4 6 8
500
• • 6
400 •

no. of species (log scale)


• •
••

y = log(species)
no. of species

300 5
• ••
• 100 • ••
200 •• ••
50 • 4
• •
100 ••
•••• •
• •• 3
0 • ••
0 1000 2000 3000 4000 1 10 100 1000

area in sq.km area in sq.km (log scale)

Here are scatter plots of S against A and y against x. Look at these carefully to see the e↵ect of the
log transformations. Note that the first plot is dominated by the largest island (Isabella), which looks
like an outlier, but this is not particularly extreme on the log scale. It looks as if y is roughly linearly
related to x and that the scatter of y about the line is roughly the same for all x.

You can check using your calculator that x̄ = 3.7417, ȳ = 4.4499, Cxx = 100.9813, Cyy = 13.9076, Cxy =
32.2638, sx = 2.7870, sy = 1.0343, rxy = 0.8609 and that the least squares line is

y = 3.25 + 0.32x

with residual standard deviation sres = 0.55. Thus, for islands of the same area A, the log of the
number of species (y) will vary with standard deviation 0.55 about a mean value of 3.25 + 0.32 log A.
Transforming this back gives the relation between S and A:

S = e3.25 A0.32 = 25.79A0.32 .

Thus the number of plant species increases roughly as the cube root of the area of the island, and the
relative standard deviation of this number, for islands of the same area, is about 55%.

2.2 Principle of least squares

2.2.1 Fitting a constant

Suppose we have a set of n numbers y1 , y2 , . . . , yn and we wish to approximate them by a single number
a, say. What number a is closest to the y’s in the sense of least squares? In other words, what a is
such that (y1 a)2 + (y2 a)2 + · · · + (yn a)2 is as small as possible?

The answer is a = ȳ, the mean of the y’s. Here is a proof: For any a and i we may write

yi a = (yi ȳ) + (ȳ a) .

Now square both sides:

(yi a)2 = (yi ȳ)2 + 2(yi ȳ)(ȳ a) + (ȳ a)2 .

Now add up each side over i = 1, 2, . . . , n. On the right hand side we may add up each of the three
P
terms separately. Furthermore, because (yi ȳ) = 0, the middle term adds to zero, so
n
X n
X
(yi a)2 = (yi ȳ)2 + n(ȳ a)2 .
i=1 i=1

23
Each term on the right side is positive, or possibly zero, and we can make the right hand side (and
hence the left hand side) as small as possible by choosing a = ȳ. Furthermore, the minimum value of
P P
(yi a)2 is therefore (yi ȳ)2 = Cyy .

You can also derive this result by using calculus.

2.2.2 Fitting a straight line

Imagine a scatter plot of points (x1 , y1 ), (x2 , y2 ), . . ., (xn , yn ). Suppose we want to approximate the
relationship describing how y depends on x by a straight line y = a + bx. What line is closest to the
points in the sense of least squares? That is, what values of a and b are such that
n
X
(yi a bxi )2
i=1

is as small as possible? The answer is


Cxy
b = and a = ȳ bx̄ .
Cxx
You can prove this either using calculus or by extending the above argument using algebra. The
equation of the least squares line is therefore

y = ȳ + b(x x̄) .

This line goes through the point (x̄, ȳ) and has slope b = Cxy /Cxx .

2.3 Properties of the correlation coefficient

The correlation coefficient rxy , also known as Pearson’s correlation coefficient, is defined as
n ⇣
Cxy 1 X xi x̄ ⌘⇣ yi ȳ ⌘
rxy = p = .
Cxx Cyy n 1 i=1 sx sy

It is a measure of the strength and direction of a linear relation between two quantitative variables.
You can see from the second formula above that it does not depend on the units of measurement.
If you change the origin and scale of x or y (or both) rxy does not change. It has the following
mathematical properties:

• 1  rxy  1 ,

• if rxy > 0, y tends to increase as x increases,

• if rxy < 0, y tends to decrease as x increases,

• if rxy = 1 or rxy = 1 the points in the xy-scatter plot lie exactly on a straight line,

• the closer rxy is to 1 or 1, the closer the points are to a straight line.

These properties are illustrated by the following scatter plots:

24
r = -1 r = -0.6 r=0 r = 0.9 r=1
y y y y • • y
• •• •• • • • • •• •

•• ••• •
• ••• • • • • • • •• •
•• • • • •
•• • • • • ••
• • • • •• • ••
••
•• •
x x x x x

There are many types of relationships between two variables. Like other summary statistics, the
correlation coefficient by itself cannot summarise a relationship adequately. One should always look
at a scatter plot of the data. To illustrate this, the figures below show some very di↵erent types of
relationship where the points all have the same value of rxy = 0.7:

(a) remote
(b) (c)
y point •
y • y


•• • • •
• •
•• • •
• • • •

• •
• •• • • • •• • •• •
• • outlier
• • •
x x x


(d) • (e) • (f) •
y y • y
• •
• •• • •• •
• •
•• • •
• •• • • •
• •
•••••
• • • •
• •
• • •• •
x x x

In (a) there is a remote point whose x value is very di↵erent from that of the other data values. The
observations show no relationship, but the remote point makes rxy = 0.7. In (b) there is an outlier
that does not fit the pattern of the rest of the data. The observations are very highly correlated except
for the outlier which brings the correlation down to 0.7. Case (c) is a typical scatter plot for a slightly
weak relationship between the variables.

In (d) there is a very strong relationship, but it is not linear. In (e) there are two distinct groups
of observations with no apparent relationship between x and y within either group; but the average x
and y values both di↵er for the two groups. In (f) there are two distinct groups of observations with
high correlation within each group.

25
Example 2.2 The following table gives the percentage of 18 London Boroughs devoted to
open space (x) and the percentage of all accidents which involve children in the Boroughs
(y).
Borough x y Borough x y Borough x y
Bermondsey 5.0 46.3 Woolich 7.0 38.2 Stoke Newington 6.5 30.8
Deptford 2.2 43.4 Stepney 2.5 38.2 Hammersmith 12.2 28.3
Islington 1.5 42.9 Poplar 4.5 37.0 Wandsworth 14.6 23.8
Fulham 4.2 42.2 Southwark 3.1 35.3 Marylebone 23.6 17.8
Shoreditch 1.4 40.0 Camberwell 5.2 33.6 Hampstead 14.8 17.1
Finsbury 2.0 38.8 Paddington 7.2 33.6 Westminster 27.5 10.8

You can check that rxy = 0.92 which indicates a strong negative association between x and y.
But you cannot conclude from the very high negative correlation that providing more open space in a
borough will cause the number of accidents involving children to fall. Boroughs with a high proportion
of parks may have fewer children living there (e.g., Westminster where there are a large number of
office blocks), so there will be less accidents involving children. Also the boroughs with a high rate of
accidents involving children tend to be the poorer boroughs. As most accidents occur in the home, it
could be cramped housing conditions which are causing the high rate of accidents involving children
not the lack of open space — see Example 1.1 where approximately half the children died from choking,
burns or poisoning which more open space will do little to prevent. Reasons for associations like this
one are usually quite complex. A scatter plot of these data is given below.

2.4 Rank correlation

Another measure of association is Spearman’s rank correlation coefficient, usually denoted by rS .


It is the same as Pearson’s coefficient applied to the ranks of x and y. It therefore has corresponding
properties. For example, if rS = 1, then the ranks of x and y lie exactly on a line: in other words, the
values of x and y are in the same order. In general, rS is more robust, in that is is not a↵ected by
extreme values: it just describes how the orderings of the two variables are related.

A simple formula for calculating rS is


P
6 i d2i
rS = 1
n(n2 1)
where di is the di↵erence between the rank of xi and the rank of yi .

In Example 2.2, the ranks of x and y and their di↵erences are:


Borough rx ry d Borough rx ry d Borough rx ry d
Bermondsey 9 18 -9 Woolich 12 11.5 0.5 Stoke Newington 11 6 5
Deptford 4 17 -13 Stepney 5 11.5 -6.5 Hammersmith 14 5 9
Islington 2 16 -14 Poplar 8 10 -2 Wandsworth 15 4 11
Fulham 7 15 -8 Southwark 6 9 -3 Marylebone 17 3 14
Shoreditch 1 14 -13 Camberwell 10 7.5 2.5 Hampstead 16 2 14
Finsbury 3 13 -10 Paddington 13 7.5 5.5 Westminster 18 1 17

P 2
Here i di = ( 9)2 + ( 13)2 + · · · + (17)2 = 1779 and n = 18, so n2 1 = 323. Hence
6 ⇥ 1779
rS = 1 = 0.84 .
18 ⇥ 323
Again this indicates a quite strong negative relationship.

Here are scatter plots of y against x and of rank(y) against rank(x).

26
Example 2.2

• •

% accidents with children, y


•• • •

40 •• 15 •
• • • •
• •
• •

rank of y
• •
30 • 10 •
• •
• • •
20 •
• 5 •
• •


10 • •

5 10 15 20 25 5 10 15

% open space, x rank of x

You can see from these why the rank correlation rS is not as strong as the ordinary correlation rxy in
this example.

In Example 2.1, the correlation between the log number of species and log area is rxy = 0.86, while
the correlation between number of species and area is only rSA = 0.60, indicating a weaker linear
relation (look at the scatter plots again). On the other hand, the log transformation does not change
the ordering of the numbers, so the rank correlation between S and A is the same as for y and x.
You can check that it is rS = 0.84 which is very close to rxy .

27
3 The Normal Distribution
Example 3.1 The span measurements (in inches) of 1200 men were as follows.

Mid Frequency Relative Mid Frequency Relative


point Frequency point frequency
58.45 1 0.001 70.45 155 0.129
59.45 2 0.002 71.45 129 0.108
60.45 1 0.001 72.45 103 0.086
61.45 4 0.003 73.45 79 0.066
62.45 6 0.005 74.45 49 0.041
63.45 14 0.012 75.45 30 0.025
64.45 34 0.028 76.45 13 0.011
65.45 66 0.055 77.45 10 0.008
66.45 88 0.073 78.45 7 0.006
67.45 109 0.091 79.45 3 0.003
68.45 136 0.113 80.45 1 0.001
69.45 159 0.132 81.45 1 0.001

Here is a relative frequency histogram for these data:

0.12
Relative frequency

0.10
0.08
0.06
0.04
0.02
0.0

60 65 70 75 80

Span (inches)

With the 140 observations of span in Example 1.5 the outline to the relative frequency histogram is
jagged. In Example 3.1 there are 1200 observations, the group widths are narrower and the outline
of the relative frequency histogram is smoother. As we obtain more and more observations we can
approximate the outline of the histogram by a smooth curve called a relative frequency curve,
usually denoted by f (x). In probability theory, f (x) is also called a probability density function.

A variable whose frequency curve has the following mathematical form

1 x µ 2
f (x) = p1 e 2
( )
, 1<x<1
2⇡

is said to have a normal distribution. This is a symmetric bell-shaped curve. The parameter µ
represents the population mean and is the population standard deviation.

Examples of frequency curves:

Normal Positively skew Symmetric but not normal

28
The relative frequency histogram in Example 3.1 suggests that the span of males may be approximated
by a normal distribution. Many variables are approximately normally distributed. Sometimes skew
data can be transformed (e.g., by taking the square root or logarithm of every observation) in order to
make the resulting values approximately symmetrically distributed and so enable the use of statistical
techniques that assume normally distributed variables.

3.1 Population distribution and parameters

For a sample from a population we may calculate sample statistics such as the sample mean and
sample standard deviation. The corresponding quantities for the population itself are referred to as
population parameters. In order to distinguish between sample statistics and population parameters,
we use Roman letters for sample values and Greek letters for population parameters. In particular, the
sample mean and standard deviation are denoted by x̄ and s and the population mean and standard
deviation by µ and . It is important to understand this distinction.

A relative frequency curve f (x) describes the distribution of a variable for a population. Mathemat-
ically, f (x) is defined so that the total area under the curve equals 1.

3.2 Probability

Of the 1200 men whose spans are depicted by the relative frequency histogram in Example 3.1, the
proportion with span less than 65 inches is represented by the area of those bars to the left of 64 in
that histogram. Similarly, among the men in the population, the proportion with span less than 65
inches is the area under the frequency curve to the left of 65, i.e., the shaded area:

65 70 Span

The population relative frequency is called the probability, i.e. the probability a man has a span less
thean 65 inches is the proportion of the population with a span less than 65 inches. In this example
the variable being measured is each man’s span. If we denote this variable by X then a convenient
way to write down this probability is P (X < 65). Assume that the variable X (a man’s span) has
a population distribution that is normal with mean µ and standard deviation . Subtracting µ from
the variable and then dividing by gives a new variable whose distribution normal with mean 0 and
standard deviation 1. This new variable is said to have a standard normal distribution.

Algebraically, the new (standardised) variable is Z = X µ and Z has a standard normal distribution.
Thus: ✓ ◆ ✓ ◆
X µ c µ c µ
P (X < c) = P < =P Z<
c µ
Thus the proportion with X < c equals the proportion with Z < , i.e., the two shaded areas below
are the same.

29
Frequency curve of X Frequency curve of Z

x c-µ z
µ c 0 σ

3.3 Calculation of probabilities for a normal distribution

The shaded area under the standard normal frequency curve can be obtained from Statistical Tables.
Table 2 gives P (Z < z) for values of z from 0 to 3.30. Other probabilities can be calculated from
these.

Example 3.2

1. P (Z < 1.59) = 0.9441


2. P (0.6 < Z < 2.0)
= P (Z < 2.0) P (Z < 0.6)
= 0.9772 0.7257
= 0.2515
3. P (Z > 1.8) = 1 P (Z < 1.8)
= 1 0.9641
= 0.0359
4. P (Z < 0.75) = P (Z > 0.75)
= 1 P (Z < 0.75)
= 1 0.7734
= 0.2266
5. P ( 2.31 < Z < 1.65) = P (Z < 1.65) P (Z < 2.31)
= P (Z < 1.65) (1 P (Z < 2.31))
= 0.9505 1 + 0.9896
= 0.9401
6. The distribution of hand span X in a population of men is normal with mean 70
inches and standard deviation 3 inches. What proportion of men have a span less
than 65 inches?
✓ ◆
65 70
P (X < 65) = P Z < = P (Z < 1.67)
3
= 1 P (Z < 1.67)
= 1 9525
= 0.0475 .

3.4 Percentage points

A percentage point is the value of a variable for a given area. For example, 1.645 is the upper 5
percentage point of Z and 1.645 is the lower 5 percentage point. Tables of percentage points for the
standard normal distribution, giving the values of z corresponding to various shaded areas, are also
available.

30
area = 0.95

area = 0.05

z
0 1.645

Example 3.2 part 6, continued. What span would be exceeded by only 5% of the
population of men? Here we need to find the value of k such that P (X > k) = 0.05, or
equivalently, P (X < k) = 0.95; or equivalently, the value of k such that P (Z < k 370 ) =
0.95.
k 70
Hence, 3 = 1.645, so k = 70 + (3 ⇥ 1.645) = 74.9 inches.

Example 3.3 Find the proportions of values from a normal distribution that lie between
µ k and µ + k for k = 1, 2, 3. Now
✓ ◆
µ k µ µ+k µ
P (µ k <X <µ+k ) = P <Z<
= P ( k < Z < k)
= P (Z < k) P (Z < k)
⇣ ⌘
= P (Z < k) 1 P (Z < k)
= 2P (Z < k) 1

Hence

P (µ < X < µ + ) = 2P (Z < 1) 1 = 2 ⇥ 0.8413 1 = 0.6826


P (µ 2 < X < µ + 2 ) = 2P (Z < 2) 1 = 2 ⇥ 0.9772 1 = 0.9544
P (µ 3 < X < µ + 3 ) = 2P (Z < 3) 1 = 2 ⇥ 0.9987 1 = 0.9975

This leads to a useful way of interpreting the standard deviation of a variable that is approximately
normally distributed. For such a variable:

approximately 68% of observations will lie between µ and µ + ,


approximately 95% of observations will lie between µ 2 and µ + 2 , and
approximately 99.7% of observations will lie between µ 3 and µ + 3 .

31
4 Sampling distributions

What happens when we take random samples from population and, for each sample, we calculate a
statistic such as the sample mean?

4.1 Theory

1. Consider a Normal population that has mean µ and standard deviation . Imagine taking
a random sample of n observations and calculating the sample mean x̄. Imagine doing this
many times. What will the values of x̄ look like? It is found that

• they have mean equal to the population mean µ,


p
• they have standard deviation equal to / n, and
• they are Normally distributed.

So this distribution will be centered at µ. It will be less spread out for a sample of size n = 100
than for n = 10, say. The larger n is, the smaller is the standard deviation of the distribution of
means, and the closer x̄ is likely to be to µ.

2. Suppose we use the sample mean x̄ to estimate the population mean µ. Then the quantity
p
/ n is called the standard error of x̄. It measures how di↵erent x̄ might be from µ, i.e. it
measures the precision (or lack of precision) of the estimate of µ. If the standard error is small,
then x̄ is likely to be close to µ. If it is large, then x̄ might be very di↵erent from µ.

3. Suppose we know and we find n and x̄ from a sample. Then we can use this result to tell us
something about the population mean µ. The standardised mean

X̄ µ
Z= p
/ n

has a standard normal distribution, so it is likely to be between 2 and +2, and very likely
to be between 3 and +3. The probability that Z is between 1.96 and +1.96 is 0.95. So if we
form the interval p p
X̄ 1.96 / n to X̄ + 1.96 / n
this will probably include µ (with probability 0.95). That is, if we take many random samples
and form this interval for each, then 95% of them will include µ. This interval is called a 95%
confidence interval for µ.

4. There is another way we use the above result. Suppose there is a hypothesis that the population
mean has some particular value, for example suppose the hypothesis is that µ = 25. Then we
can see if the data “agree” with this hypothesis by calculating the value of z given by
x̄ 25
z= p .
/ n

If the hypothesis is correct, z will be from a standard normal distribution. If not, z is more
likely to be further from 0.
We can use the standard normal distribution to calculate the probability of getting a value
further from 0 than z, assuming the hypothesis is correct. This is called the P-value. If the
P-value is very small, this is evidence that the hypothesis is not correct. This procedure is called
a hypothesis test or a significance test.

32
5. Sometimes we want to calculate a confidence interval to see how closely we can estimate the
population mean, and sometimes we want to do a significance test to see if the data agree with
a hypothesised value of µ. We will look at examples in the next section.

6. Now suppose we do not know the population standard deviation . Usually we estimate
using the sample standard deviation sx , and instead of Z we consider

X̄ µ
T = p .
sx / n

Now T does not have a standard normal distribution but a Student’s t-distribution with
n 1 degrees of freedom. We usually abbreviate “degrees of freedom” to “df” or the Greek
letter ⌫. A t-distribution is still symmetrical, but has longer tails than the Normal, and its shape
depends on n. When n is very large it is very close to a standard normal distribution.
So to find a 95% confidence interval for µ we do the same as before except that we use the
t-distribution. This gives the interval
p p
x̄ tn 1,0.025 sx / n to x̄ + tn 1,0.025 sx / n

where the number tn 1,0.025 is the upper 2.5 percentage point of the t-distribution with n 1
degrees of freedom. We can find tn 1,0.025 from statistical tables (e.g., Table 3) or from computer
programs. It is always bigger than 1.96 and as n gets larger tn 1,0.025 gets nearer to 1.96.
And if we want to test a hypothesis, for example that µ = 25, we calculate
x̄ 25
t= p
sx / n

and determine the P-value by finding the probability of getting a value further from 0 than t,
using the t-distribution with n 1 degrees of freedom.

7. Now consider a population that is not Normal. Let µ and be the population mean and
standard deviation. If we take random samples of size n and calculate the sample mean x̄ for
each, then the distribution of sample means

• still has mean µ


p
• and standard deviation / n, and
• if n is large, is approximately Normal.

This last result — the fact that the distribution of sample means is approximately Normal,
regardless of the population — is known as the central limit theorem. A consequence of this
is that if we have a large sample from a non-Normal population, we can calculate approximate
confidence intervals for µ, as if the population were Normal.

4.2 Simulation of samples from a Normal population

The table below gives 75 random samples, each of size 4, from a Normal population with mean 100
and standard deviation 15 (think of them as IQ scores). Also given are the sample means, standard
deviations, values of t = 2(x̄ 100)/sx and 95% confidence intervals x̄ ± 3.182sx /2.

33
sample mean s.d. t confidence
interval
1 100 87 122 111 105.00 14.99 0.67 81.15 128.85
2 99 110 122 101 108.00 10.49 1.53 91.31 124.69
3 85 89 98 94 91.50 5.69 -2.99 82.45 100.55
4 98 85 104 76 90.75 12.63 -1.46 70.65 110.85
5 95 120 113 81 102.25 17.65 0.25 74.17 130.33
6 99 120 103 109 107.75 9.14 1.70 93.20 122.30
7 72 134 73 95 93.50 29.01 -0.45 47.34 139.66
8 105 81 61 102 87.25 20.50 -1.24 54.63 119.87
9 100 103 97 117 104.25 8.85 0.96 90.18 118.32
10 118 137 98 130 120.75 17.08 2.43 93.58 147.92
11 100 76 99 92 91.75 11.09 -1.49 74.11 109.39
12 85 120 105 110 105.00 14.72 0.68 81.58 128.42
13 114 91 69 95 92.25 18.46 -0.84 62.87 121.63
14 79 93 83 92 86.75 6.85 -3.87 75.85 97.65 *
15 93 114 129 106 110.50 15.07 1.39 86.53 134.47
16 88 92 104 85 92.25 8.34 -1.86 78.98 105.52
17 114 120 123 112 117.25 5.12 6.73 109.10 125.40 *
18 83 87 105 86 90.25 9.98 -1.95 74.37 106.13
19 102 74 120 67 90.75 24.68 -0.75 51.49 130.01
20 103 79 93 116 97.75 15.65 -0.29 72.85 122.65
21 136 112 97 101 111.50 17.52 1.31 83.62 139.38
22 101 96 128 100 106.25 14.66 0.85 82.93 129.57
23 100 88 110 90 97.00 10.13 -0.59 80.88 113.12
24 111 111 79 107 102.00 15.45 0.26 77.42 126.58
25 83 116 96 107 100.50 14.25 0.07 77.83 123.17
26 67 66 105 87 81.25 18.55 -2.02 51.73 110.77
27 118 101 97 113 107.25 9.88 1.47 91.53 122.97
28 123 95 84 96 99.50 16.58 -0.06 73.12 125.88
29 92 81 105 84 90.50 10.72 -1.77 73.44 107.56
30 106 85 117 66 93.50 22.63 -0.57 57.49 129.51
31 95 81 109 111 99.00 13.95 -0.14 76.80 121.20
32 108 92 92 90 95.50 8.39 -1.07 82.16 108.84
33 140 82 51 86 89.75 36.97 -0.55 30.93 148.57
34 116 93 125 103 109.25 14.10 1.31 86.81 131.69
35 101 97 120 108 106.50 10.08 1.29 90.46 122.54
36 86 106 91 105 97.00 10.03 -0.60 81.04 112.96
37 74 98 102 108 95.50 14.91 -0.60 71.78 119.22
38 105 109 115 107 109.00 4.32 4.17 102.13 115.87 *
39 92 96 124 81 98.25 18.30 -0.19 69.13 127.37
40 122 103 112 112 112.25 7.76 3.16 99.90 124.60
41 107 100 98 91 99.00 6.58 -0.30 88.53 109.47
42 106 71 88 122 96.75 22.08 -0.29 61.62 131.88
43 108 87 108 68 92.75 19.24 -0.75 62.14 123.36
44 101 83 91 89 91.00 7.48 -2.41 79.09 102.91
45 105 76 89 87 89.25 11.95 -1.80 70.23 108.27
46 80 91 103 113 96.75 14.34 -0.45 73.94 119.56
47 64 96 113 91 91.00 20.31 -0.89 58.68 123.32
48 105 84 103 88 95.00 10.55 -0.95 78.21 111.79
49 137 97 101 107 110.50 18.14 1.16 81.64 139.36
50 145 86 94 87 103.00 28.23 0.21 58.09 147.91
51 77 106 87 118 97.00 18.46 -0.33 67.63 126.37
52 119 95 76 93 95.75 17.69 -0.48 67.61 123.89
53 123 101 112 124 115.00 10.80 2.78 97.82 132.18
54 112 89 80 112 98.25 16.30 -0.21 72.32 124.18
55 91 75 114 96 94.00 16.06 -0.75 68.44 119.56
56 113 90 116 124 110.75 14.59 1.47 87.53 133.97
57 65 93 90 112 90.00 19.30 -1.04 59.29 120.71
58 119 107 92 129 111.75 15.95 1.47 86.38 137.12
59 83 107 92 97 94.75 10.01 -1.05 78.82 110.68
60 108 100 111 106 106.25 4.65 2.69 98.86 113.64
61 104 99 121 115 109.75 10.05 1.94 93.77 125.73

34
62 104 125 104 106 109.75 10.21 1.91 93.51 125.99
63 115 111 90 102 104.50 11.09 0.81 86.85 122.15
64 102 103 115 115 108.75 7.23 2.42 97.25 120.25
65 105 116 89 94 101.00 12.03 0.17 81.86 120.14
66 101 97 111 79 97.00 13.37 -0.45 75.73 118.27
67 122 99 123 67 102.75 26.29 0.21 60.93 144.57
68 63 94 92 101 87.50 16.78 -1.49 60.80 114.20
69 114 119 89 80 100.50 18.95 0.05 70.35 130.65
70 108 97 84 114 100.75 13.20 0.11 79.75 121.75
71 83 96 64 75 79.50 13.48 -3.04 58.06 100.94
72 108 101 106 85 100.00 10.42 0.00 83.41 116.59
73 109 79 96 105 97.25 13.33 -0.41 76.05 118.45
74 134 99 106 106 111.25 15.52 1.45 86.56 135.94
75 99 111 74 82 91.50 16.66 -1.02 64.99 118.01

Here are histograms of the 300 individual scores x, the 75 sample means x̄, the 75 standardised means
z = 2(x̄ 100)/15 and the 75 t values:

Frequency
30
80
25
60 20

40 15
10
20
5
0 0

55 70 85 100 115 130 145 -6 -4 -2 0 2 4 6


x z

25 30
25
20
20
15
15
10
10
5 5
0 0

55 70 85 100 115 130 145 -6 -4 -2 0 2 4 6


mean t

Superimposed on the histograms are the theoretical frequency curves: for x, Normal with mean 100
and standard deviation 15; for x̄, Normal with mean 100 and standard deviation 7.5; for z, standard
normal; and for t a t-distribution with 3 degrees of freedom.

Note that both x and x̄ are centered at the population mean of 100, but the standard deviation of the
distribution of sample means is half that of the original population. Look carefully at the comparison
between z and t. Both are symmetric about 0, but the latter, which is based on a t-distribution, has
longer tails implying a likelihood of more extreme values than for the normal distribution.

Look also at the confidence intervals in the table. Just three of these (marked with a *) fail to include
the population mean of 100. The theory says that in the long run 95% of such intervals will include
100, and in our 75 samples 72/75 = 0.96 of intervals do, which is about right.

35
4.3 Simulation of samples from an Exponential population

A variable whose frequency curve has the following mathematical form

f (x) = e x, 0<x<1

is said to have an exponential distribution. This distribution is often used to describe the times
between “random” events such as arrivals of telephone calls, accidents, earthquakes, etc. In this context
the parameter is called the rate — the average number of events per unit time. The population
mean µ (i.e., the average time between successive events) is equal to 1/ . The population standard
deviation is also equal to 1/ . Thus if the times between arrivals of telephone calls had an exponential
distribution with rate three per hour, then the average time between calls would be 1/3 of an hour,
i.e., 20 minutes. Also the standard deviation of these times would be 20 minutes. The exponential
frequency curve is very di↵erent from a normal curve. It has a mode at 0 and decays exponentially
as x increases. A variable with an exponential distribution is necessarily positive, whereas one with a
Normal distribution has a non zero probability of being negative.

Here are 75 samples, each of size 9, from an Exponential population with mean 1. The population
standard deviation is also 1. The 75 sample means are also given below.
sample mean
0.49 2.41 1.61 1.37 0.20 0.60 0.09 0.79 3.41 1.22
0.73 0.05 1.82 1.62 0.08 2.76 0.19 0.17 3.50 1.21
0.42 0.35 3.19 2.89 0.88 1.69 0.38 0.21 0.79 1.20
0.52 0.36 3.95 0.44 0.62 0.37 0.15 2.37 1.08 1.09
1.09 0.45 2.56 1.15 0.73 7.62 0.39 0.92 0.41 1.70
0.01 0.04 0.45 0.57 1.54 0.05 1.24 0.77 2.25 0.77
1.15 0.86 0.86 0.46 0.58 1.70 1.20 0.29 0.03 0.79
0.09 0.16 0.61 2.04 0.46 5.02 0.95 0.12 0.78 1.14
2.20 0.84 0.56 0.73 0.15 0.39 0.99 1.90 0.29 0.90
2.29 5.35 0.16 0.40 1.15 1.15 2.45 0.85 1.11 1.66
0.91 0.03 0.64 0.07 0.27 0.94 0.40 0.20 1.51 0.55
2.55 0.47 0.45 0.14 0.25 1.52 0.49 0.40 3.19 1.05
0.14 0.78 0.90 1.59 0.24 0.22 0.70 0.47 0.79 0.65
0.03 0.95 3.03 0.04 2.17 0.49 1.21 0.11 1.53 1.06
0.01 3.88 0.44 1.13 0.26 0.50 0.56 1.06 1.14 1.00
1.09 1.43 1.49 0.20 1.80 0.28 0.15 1.76 0.16 0.93
1.88 4.44 0.44 0.11 0.42 0.52 2.37 0.48 0.46 1.23
7.10 0.48 1.56 3.59 0.65 0.10 1.65 1.80 0.20 1.90
1.01 0.02 1.01 1.85 0.56 0.40 0.13 0.10 1.23 0.70
1.56 2.41 1.76 1.87 0.85 0.15 1.27 1.74 2.69 1.59
0.80 2.00 2.25 1.39 0.90 0.53 0.10 0.27 1.03 1.03
3.65 2.54 0.41 0.24 0.44 0.45 0.63 0.41 0.59 1.04
2.15 2.20 2.02 0.83 1.27 1.03 0.63 1.74 0.24 1.35
0.69 2.14 1.81 1.41 1.83 2.04 0.96 0.43 0.12 1.27
2.11 0.48 2.09 0.62 1.60 0.26 1.20 0.63 2.30 1.25
0.81 0.70 2.37 0.62 0.50 0.94 0.39 1.61 0.48 0.94
1.79 0.02 0.82 0.71 2.22 1.64 2.77 2.19 2.96 1.68
0.64 1.23 0.88 0.53 0.11 4.25 1.70 1.93 1.48 1.42
0.09 0.25 0.15 0.60 0.18 1.56 2.35 1.23 1.13 0.84
4.38 0.12 0.00 4.33 0.15 2.81 0.05 1.65 2.42 1.77
1.37 1.15 2.26 0.06 0.39 0.24 1.78 0.38 1.41 1.00
1.76 0.79 1.69 0.03 0.07 1.36 2.43 1.42 0.91 1.16
0.66 1.00 0.28 0.61 0.37 0.26 0.28 0.37 0.01 0.43
0.61 0.13 0.19 1.13 0.45 0.69 0.11 0.19 0.21 0.41
1.84 0.98 1.84 1.05 1.47 0.03 0.40 0.66 0.28 0.95
0.76 0.40 0.32 0.35 0.33 0.04 0.05 0.98 1.23 0.49
0.05 0.10 0.22 0.37 0.70 0.38 0.15 0.78 1.00 0.42
0.95 1.05 0.29 1.80 0.09 0.29 0.01 0.55 0.28 0.59

36
3.10 0.24 0.60 1.09 1.94 0.88 0.86 1.79 1.93 1.38
1.21 0.09 0.34 0.79 0.30 1.42 0.30 1.68 0.29 0.71
0.04 0.78 2.05 0.18 1.53 0.37 1.55 1.08 1.47 1.01
1.21 1.38 0.15 1.16 0.94 0.21 0.91 0.57 1.57 0.90
0.26 0.15 3.68 0.56 0.28 0.80 1.19 0.20 3.28 1.16
1.11 1.05 0.27 1.36 0.12 0.18 2.32 2.46 0.80 1.07
0.84 0.28 0.94 1.04 2.76 0.07 1.78 1.06 1.73 1.17
0.10 0.49 0.09 0.62 3.70 0.80 0.13 0.38 1.84 0.91
0.49 0.40 1.09 2.76 0.19 0.41 0.43 1.27 2.94 1.11
0.71 0.07 1.45 1.39 0.92 0.19 0.17 0.43 0.91 0.69
0.17 0.28 0.73 0.24 0.85 2.40 1.14 1.49 0.16 0.83
1.18 0.80 0.33 0.37 0.23 0.48 1.53 0.35 1.14 0.71
0.60 0.83 0.42 1.11 0.87 0.30 0.30 0.95 0.42 0.65
0.50 0.79 2.06 0.34 0.85 3.39 0.73 0.21 1.78 1.18
1.94 0.18 5.54 0.55 0.03 0.47 0.75 0.56 0.14 1.13
0.27 0.39 1.04 0.18 0.15 0.57 0.78 1.34 0.67 0.60
0.38 0.13 1.77 1.82 1.60 1.11 2.79 0.84 2.09 1.39
2.54 0.65 0.56 0.51 1.10 0.39 0.54 5.40 1.37 1.45
0.35 1.47 0.31 0.00 0.48 3.51 1.00 0.04 5.65 1.42
0.30 0.44 1.22 0.23 0.30 4.66 1.81 0.38 3.42 1.42
0.32 2.44 1.24 1.56 2.61 0.86 0.24 0.39 1.59 1.25
0.27 0.43 0.63 0.09 1.20 2.69 1.88 1.40 0.61 1.02
0.78 0.02 1.49 0.47 1.03 1.42 0.79 0.68 0.33 0.78
0.16 0.18 4.08 0.35 0.77 0.56 0.01 0.08 1.11 0.81
0.72 0.14 0.53 2.83 1.83 2.70 3.17 0.54 1.46 1.55
0.24 1.13 0.05 0.92 0.54 0.69 0.01 0.72 1.23 0.61
0.14 0.05 0.28 0.89 1.48 1.80 1.04 2.29 0.32 0.92
1.53 0.27 2.47 0.25 1.77 2.88 0.25 2.54 0.51 1.39
0.68 0.55 0.13 1.16 1.10 2.51 1.86 0.24 0.05 0.92
3.50 0.18 0.27 0.16 0.89 5.73 0.10 0.22 0.45 1.28
0.94 1.65 0.52 2.68 1.04 0.77 1.08 0.95 1.08 1.19
0.82 0.07 2.26 0.91 2.17 0.09 2.30 0.09 0.76 1.05
4.59 0.60 0.38 0.83 0.41 3.46 1.61 0.39 0.10 1.38
3.00 0.34 0.45 0.27 2.45 0.72 0.84 0.72 0.32 1.01
1.22 0.08 0.10 3.47 0.08 0.70 2.14 3.74 0.20 1.30
0.45 1.78 0.12 0.32 3.39 1.27 0.57 1.09 0.72 1.08
0.02 0.44 0.68 0.43 0.02 0.32 0.16 1.38 0.48 0.44

Here is a histogram of the 675 original observations, along with the exponential frequency curve; and a
histogram
p of
pthe 75 sample means, along with a normal curve with mean µ = 1 and standard deviation
/ 9 = µ/ 9 = 1/3.
Frequency

300 25

250 20
200
15
150
10
100

50 5

0 0

0 2 4 6 8 0.0 0.5 1.0 1.5 2.0 2.5

time, x sample mean

This is a vivid demonstration of the central limit theorem. The population is very non-normal, but
the distribution of sample means — even for samples of size 9 — is not so di↵erent from a normal
distribution. For a larger sample size, it would be even closer to normal.

37
5 Confidence Intervals and Tests based on the t distribution

In chapter 3 we considered populations that have a normal distribution with a known mean and
standard deviation. Often we know that a population has a normal distribution (or approximately a
normal distribution) but we do not know its mean and standard deviation. Usually it is the population
mean that is of most interest. Typically we have a sample from the population and we wish to either

• obtain an estimate of the population mean, or


• compare the mean of the population with some other value.

We use the sample data to make inferences about the population. The sample should be representative
of the population from which it is drawn and care should be taken to ensure that this is so, for example
by taking a simple random sample (see §1) or possibly a stratified simple random sample.

The size of the sample should be large enough to allow sensible conclusions to be drawn from the data,
but it also should be of a size that can easily be handled by the investigators.

5.1 Estimation of a population mean

5.1.1 Method 1: Quoting the sample mean and standard error

We take a sample of n observations from the population and calculate the sample mean x̄. We think
of x̄ as an estimate of µ. We express this by writing
µ̂ = x̄ ,
where the ‘hat’ means “an estimate of”.

Now if we took another sample of n values, a di↵erent x̄ would be obtained and hence a di↵erent
estimate of µ. So just quoting x̄ does not give any indication of the possible error involved. However
p
we do know that the standard deviation of the sampling distribution of x̄ is / n. If this standard
deviation is small then most of the values of x̄ are close to µ and we are confident that the particular
x̄ from our sample will be close to µ. If this standard deviation is large then some x̄’s will be a long
p
way from µ and our particular x̄ might be one of these. Thus knowing / n will give us some idea
p
about how close our x̄ is likely to be to µ. The quantity / n is called the standard error of our
estimate of µ, and is denoted by se(µ̂). Thus in this case

se(µ̂) = p .
n
However, in most situations is not known and we estimate it using the the sample standard deviation
p
s. The estimated standard error of µ̂ is thus s/ n.

Note: many text books use the term “standard error” to refer to the estimated standard error, and
write
s
se(µ̂) = p .
n
This is not entirely satisfactory, but is acceptable for the present course provided that we understand
p p
that for small samples s/ n may be rather di↵erent from the correct value / n. In particular, the
use of the t distribution (below) will allow for this di↵erence.

In general, the standard error of an estimate of a parameter is the standard deviation of the sampling
distribution of estimates of that parameter (when you imagine taking repeated samples of size n).

38
Example 5.1 For the blood pressure data in Example 1.3, n = 21, x̄ = 128.52 and
s = 14.31. Let µ be the population mean systolic blood pressure for women attending
keep fit classes.p An estimate of µ is µ̂ = 128.52 mm Hg with estimated standard error
se(µ̂) = 14.31/ 21 = 3.12 mm Hg.

5.1.2 Method 2: Confidence intervals

In the above method of estimation we quoted an estimate of µ and its standard error, to express the
“uncertainty” of our estimate. An alternative method is to find an interval in which we expect µ
to lie, that is, we calculate an interval of the form (x̄ b, x̄ + b) which is likely to contain µ. It is
conventional, for most purposes, to calculate a 95% confidence interval; that is, to choose b so that
if repeated samples were taken, 95% of the intervals (x̄ b, x̄ + b) would include µ.

Let tn 1,0.025 be the percentage point of a t distribution with n 1 df for an upper tail area of p = 0.025.
These values are given in Table 3 under the 0.025 column with ⌫ = n 1. Then

!
X̄ µ
P tn 1,0.025 < p < tn 1,0.025 = 0.95
s/ n
which can be rearranged to
⇣ s s ⌘
P X̄ tn 1,0.025 p < µ < X̄ + tn 1,0.025 p = 0.95 .
n n

This statement says that if we take many samples of size n from the population and calculate the
p p
interval x tn 1,0.025 s/ n to x + tn 1,0.025 s/ n for each sample, then 95% of the intervals would
contain the population mean µ. We actually observe just one sample of size n and calculate just one
interval, so (in this rather indirect sense) we are 95% confident that our particular interval is one of
those that contains µ.

Notes

• The confidence interval can also be written as

µ̂ tn 1,p se(µ̂) , µ̂ + tn 1,p se(µ̂)

which is often abbreviated to µ̂ ± tn 1,p se(µ̂). For a 95% confidence interval, p = 0.025 and it
can be seen from Table 3 that tn 1,p is about 2 when n is large and is a bit greater than 2 for
smaller n. So the 95% confidence interval contains values within about 2 (or a bit greater than
2) standard errors of µ̂.

• If we wish to be even more confident that our interval contains µ, we could use a higher confidence
level than 95% e.g. 99%. In this case we use the above formula with upper percentage point
p = 0.005, so that the value of tn 1,p is that in Table 3 under the 0.005 column with ⌫ = n 1.
The resulting interval is longer and more likely to contain µ. In practice it is a very common
convention to use a confidence level of 95%.

Example 5.1 continued. From the dotplot in §1 the sample data are fairly symmetric
and it looks reasonable to assume that they come from a normal distribution. We will
calculate a 95% and 99% confidence intervals for µ, the population mean systolic blood
pressure.

39
We have n = 21 so ⌫ = 20. For a 95% confidence interval we use tn 1,p = 2.086 from
Table 3 to get

128.52 ± 2.086 ⇥ 3.12 = 128.52 ± 6.508 = (122.0, 135.0) mm Hg.

We are 95% confident that the population mean systolic blood pressure is between 122.0
and 135.0 mm Hg. For a 99% confidence interval we would get tn 1,p = 2.845 and hence

128.52 ± 2.845 ⇥ 3.12 = 128.52 ± 8.876 = (119.6, 137.4) mm Hg.

We are 99% confident that the population mean systolic blood pressure is between 119.6.6
and 137.4 mm Hg.

5.2 One sample t tests

In §5.1 we were interested in estimating a population mean µ from a sample of n observations. We


used two types of uncertain inference: (a) a point estimate µ̂ and its standard error se(µ̂), and
(b) a confidence interval for µ.

In this section we consider another type of inference: a hypothesis test. We have a hypothesized
value of µ, denoted by µ0 , say, and we ask the question: do the data agree with the hypothesis?

Example 5.2 Extensive data collected during the first half of this century showed that in
those years, Japanese children born in America grew faster than did Japanese children born
in Japan. The population mean height of 11-year-old Japan-born Japanese boys is known
to be 139.7 cm. In order to investigate whether improved economic and environmental
conditions in postwar Japan had narrowed this gap, a large sample of Japanese children
born in Hawaii was obtained, and the children were categorised with respect to age. There
were 13 eleven-year old boys in the sample. Their heights (in cm) were as follows:
138, 146, 148, 151, 140, 149, 143, 155, 147, 146, 160, 145, 134.

We will test the hypothesis that the population mean height of Hawaii-born Japanese boys
is 139.7 cm (i.e., the same as that of Japan-born Japanese boys).

The hypothesis that we test is called a null hypothesis and is denoted by H0 . In general a null
hypothesis represents “no change” — here it asserts that the population mean height of eleven-year-
old Japanese boys born in Hawaii is the same as that for eleven-year-old Japanese boys born in Japan.
In mathematical notation the null hypothesis is

H0 : µ = µ 0

where in this case, µ0 = 139.7 cm.

The logical alternative to H0 is called the alternative hypothesis and is denoted by H1 . Here our
alternative hypothesis is
H1 : µ 6= µ0 .
Note that both H0 and H1 are statements about the population mean (not about the sample mean).
Usually H0 is a precise statement while H1 is vague.

We are going to use a one sample t-test. We make the assumption that our data are a random
sample from a normal distribution and we calculate the t-statistic:
x̄ µ0
t= p .
s/ n

40
If the null hypothesis is true, then t will be a random value from the t-distribution with n 1 df. Thus
if H0 is true, t should be reasonably close to zero, while if H1 is true, t is likely to be further away
from zero — either greater than zero (if µ > µ0 ) or less than zero (if µ < µ0 ).

Example 5.2 continued Here is a dotplot of the heights of the 13 Hawaii-born eleven-
year-old Japanese boys:

*
* * * * * * * * * * * *
---+---------+---------+---------+---------+---------+---------------
135 140 145 150 155 160 Height in cm

The dotplot is fairly symmetric, and experience suggests that heights in a homogeneous
group of individuals are approximately normally distributed. Note, though, that the
heights are only measured to the nearest cm, so that our data are really discrete. Neverthe-
less it should be reasonable to treat them as a random sample from a normal distribution.
We may calculate n = 13, x̄ = 146.31, s = 6.88, and hence
146.31 139.7
t= p = 3.46 .
6.88/ 13
Now, as a random value from the t-distribution with 12 df, the value of 3.46 is rather
extreme: the chance of getting a value greater than this is less than 0.005 (see Table 3) and
is in fact about 0.003. The chance of getting a value more extreme (i.e., greater than 3.46
or less than 3.46) is therefore only 2 ⇥ 0.003 = 0.006. We therefore say there is evidence
that H0 is not true — and that the population mean height of eleven-year-old Japanese
boys born in Hawaii is greater than 139.7 cm.

In general, if our value of t is a typical value from the t-distribution with n 1 df we say that our data
are consistent with the null hypothesis. If t is not such a typical value, but is too extreme, we regard
this as evidence that H0 is not true. Specifically, we calculate a quantity called the P-value, which is

the probability of getting a value of T as extreme as, or more extreme than, our observed
value t, assuming that H0 is true,

where T has a t-distribution with n 1 df.

In Example 5.2, the P-value is

P = P (T > 3.46 or T < 3.46) = 2 P (T > 3.46) = 0.006 .

5.3 Interpretation of P-values

A P-value is the probability of getting a value of the test statistic that is as extreme as (or more
extreme than) the observed value if H0 is true. A small value of P is therefore regarded as evidence
that H0 is not true: the smaller P is, the stronger the evidence is against H0 . As a guide, the common
convention is:

If P < 0.01 there is strong evidence against H0


If 0.01 < P < 0.05 there is fairly strong evidence against H0
If P > 0.05 there is little or no evidence against H0 , or
the data are consistent with H0

41
Thus, in Example 5.2 we found P ⇡ 0.006, so there is strong evidence against H0 , i.e., strong evidence
that the mean height of postwar 11-year-old Hawaii-born Japanese boys di↵ers from that of Japanese-
born Japanese boys. In fact, the sample mean for the Hawaii-born boys is greater than 139.7 cm, so
the evidence is that the population mean for Hawaii-born boys is greater than for Japan-born boys.

Notes:

• A P-value measures evidence against the null hypothesis. A large P-value (such as 0.8, say) does
not necessarily imply that H0 is true, because data can be consistent with H0 and at the same
time be consistent with other hypotheses.

• A small P-value does not mean that H0 can’t be true, because it is possible (though unlikely)
that extreme data may occur by chance, even when H0 is true.

• The above guidelines are not hard and fast rules. For example a P-value of 0.06 means much
the same as one of 0.04, even though one of these is less than 0.05 and the other is not.

• What do we mean by “more extreme”? In general, we decide this by thinking about the alter-
native hypothesis H1 . In Example 5.2, H1 says that µ is either greater than 139.7 or less than
139.7. In the first case, we would expect t to be greater than 0 and in the second case, to be less
than 0. Thus, a more extreme value of t than 3.44, would be one that is further from zero than
3.44.

• Sometimes it may not be possible for µ to be less than the hypothesised value µ0 . Then our
alternative hypothesis would be H1 : µ > µ0 ; “more extreme” values of t would just be values
greater than the observed value; and the P-value would be P (T > t) rather than twice this. This
is called a one-sided test. The usual case, where the P-value equals 2P (T > t) is a two-sided
test. One-sided tests are appropriate only rarely.

• A hypothesis test is a rather limited form of inference. Very often we will also wish to make an
estimate, e.g., using a confidence interval.

For Example 5.2 a 95% confidence interval for µ is


6.88
146.3 ± 2.179 ⇥ p = 146.3 ± 4.158 = (142.1, 150.5)
13
Thus the population mean for the Hawaii-born boys is estimated to be between 142.1 cm and 150.5 cm.
Note that this interval does not include 139.7 cm, the population mean for the Japan-born boys. In
general, if the P-value is greater than 0.05, the 95% confidence interval will include the hypothesised
mean µ0 .

Example 5.3 The mean systolic blood pressure for white males aged 35-44 is 127.2 mm Hg.
The systolic blood pressures (in mm Hg) for a sample of 45 diabetic males aged 35-44 were
as follows.
135 138 149 132 136 136 127 132 128 126
117 136 136 142 135 133 130 131 140 130
140 127 127 124 123 121 131 129 136 125
142 127 127 123 128 131 127 138 137 124
125 133 129 128 133

The researchers were interested in determining whether the mean systolic blood pressure
of 35-44-year-old diabetic males di↵ered from that of 35-44-year-old males in the general
population.

42
Let µ be the population mean systolic blood pressure for diabetic males aged 35-44. The null and
alternative hypotheses are

H0 : µ = 127.2 and H1 : µ 6= 127.2 .

Again we assume that we have a random sample from a normal population. The sample statistics are
n = 45, x̄ = 131.20, s = 6.3661 and hence
131.20 127.2
t= p = 4.21
6.3661/ 45

The P-value is thus P (T > 4.21) + P (T < 4.21) = 2 P (T > 4.21). where T has a t-distribution
with 44 df. From Table 3 we can see that P (T > 4.21) < 0.0005 and so P < 2 ⇥ 0.0005 = 0.001.
Thus there is very strong evidence against H0 , suggesting that 35-44-year-old diabetics have a higher
systolic blood pressure, on average, than 35-44-year-old men in the general population.

A 95% confidence interval for µ is


6.3661
131.2 ± 2.015 ⇥ p = 131.2 ± 1.912 = (129.3, 133.1)
45
so the mean systolic blood pressure for 35-44-year-old diabetic men is estimated to be between
129.3 mm Hg and 133.1 mm Hg.

5.4 Procedure for hypothesis tests

A good procedure to be followed in performing hypothesis tests is:

1. Set up the null and alternative hypotheses defining any notation you use.

2. State any assumptions you are making and, if possible, check whether they are reasonable.

3. Calculate the test statistic (t in this section).

4. Obtain the P-value.

5. Interpret the P-value.

6. Write a one sentence conclusion.

5.5 Relation between confidence intervals and hypothesis tests

Suppose we test a null hypothesis H0 : µ = µ0 and find that the P-value is greater than 0.05. Then
the 95% confidence interval for µ will include the hypothesised value µ0 . If P < 0.05 then the 95%
confidence interval will not include µ0 . On other words:

the 95% confidence interval for µ consists of all hypothesised values µ0 for which the P-value
is greater than 0.05.

Thus, if you calculate a 95% confidence interval that does not include a µ0 of interest, then you can
infer that the P-value will be less than 0.05. Likewise, if a 99% confidence interval does not include µ0 ,
the P-value will be less than 0.01. Note, though, that there is a logical disctinction between estimation
and hypothesis testing.

43
5.6 Two sample t tests and confidence intervals

5.6.1 Matched pairs t test

Example 5.4 The following data give the pH reading for the surface soil and subsoil of 13
areas of acid soil. Test whether the average pH di↵ers between the surface soil and subsoil.

Topsoil pH Subsoil pH Difference Topsoil pH Subsoil pH Difference


6.57 8.34 -1.77 5.49 7.90 -2.41
6.77 6.13 0.64 5.56 5.20 0.36
6.53 6.32 0.21 5.32 5.32 0.00
6.71 8.30 -1.59 5.92 6.21 -0.29
6.72 8.44 -1.72 6.55 5.66 0.89
6.01 6.80 -0.79 6.93 5.66 1.27
4.99 5.42 -0.43

In this example we have, for each area, a pair of observations that are not independent. However, we
assume that the two observations on an area are independent of the two observations on any other
area. We test for the di↵erence, on average, between the surface soil pH and the subsoil pH by first
calculating the di↵erence between the two readings for each area. These di↵erences are assumed to
be a random sample from a normal distribution.
Here is a dot plot of the di↵erences:

* *** * ** * * * * * *
---+---------+---------+---------+---------+--------------
-3 -2 -1 0 1 Difference in pH

The dotplot is fairly symmetric and the assumption looks reasonable.

Let µ be the population mean of the di↵erence between the surface soil pH and the subsoil pH. We
test the null hypothesis
H0 : µ = 0
against the alternative H1 : µ 6= 0. Note that the mean of the di↵erences between surface and subsoil
pH equals the di↵erence between the means, so we may interpret µ as either of these.

The test is exactly the same as for the one sample t test except that the observations are now the
di↵erences. The calculated statistics are:
0.4331
n = 13, x̄ = 0.4331, s = 1.1500, t = p = 1.36 .
1.1500/ 13

The P-value is P (T < 1.36) + P (T > 1.36) = 2 P (T > 1.36) where T has a t-distribution with 12 df.
From Table 3, P ⇡ 2 ⇥ 0.10 = 0.20. The P-value is greater than 0.05 and the data are consistent with
H0 . We conclude that there is no evidence that the average pH for subsoil di↵ers from that of surface
soil.

5.6.2 Two sample t test

Example 5.5 As part of a study on job satisfaction reported in the Journal of Library
Administration (1984) samples of 13 male and 11 female employees of a University Library
were asked to complete a job satisfaction questionnaire. The results were as follows, with
the higher the score the greater the job satisfaction.

44
Male 67 65 65 84 92 95 82 76 78 80 60 74 77
Female 78 67 72 67 65 48 81 63 91 71 78

Do the data suggest that male and female university librarians di↵er in their mean score
on the job satisfaction questionnaire?
In this example we have two unrelated samples from two di↵erent populations. Let µx and µy be the
population mean job satisfaction score for male and female university library employees. We test
H0 : µ x = µ y
against the alternative H1 : µx 6= µy . We can also express the null hypothesis as H0 : µx µy = 0.
Thus intuitively we can imagine testing whether the di↵erence in population means is zero, and then
go on and estimate this di↵erence.

We assume that the two samples come from normal populations that have the same standard deviation
(where is unknown), though their means may be di↵erent. We also assume that all the observations
are independent.
Here is a dotplot of the two samples:

*
* * * * *** * * * * * males

*
* * * * ** * * * females

---+---------+---------+---------+---------+---------+-
50 60 70 80 90 100 job satisfaction score

These look reasonably like two samples from normal populations with the same standard deviation.

In general, let nx , x̄, sx and ny , ȳ, sy be the sample sizes, means and standard deviations for the two
samples. We use the di↵erence in sample means x̄ ȳ to estimate µx µy , that is:
µ̂x µ̂y = x̄ ȳ .
Under the above assumptions it can be shown that the standard error of this estimate is
s
1 1
se(µ̂x µ̂y ) = +
nx ny
An estimate of 2, the common variance of the two populations, is given by
(nx 1)s2x + (ny 1)s2y
s2p = .
nx 1 + ny 1
This is a weighted average of the two sample variances, with weights equal to their degrees of freedom.
The square root of this quantity, sp , is called the pooled standard deviation and is used to estimate
, the common standard deviation of the two populations. Thus the estimated standard error of µ̂x µ̂y
is s
1 1
se(µ̂x µ̂y ) = sp +
nx ny
and the t-statistic is
x̄ ȳ
t= q .
1 1
sp nx + ny
It can be shown that if H0 is true, this will be a random value from the t-distribution with ⌫ =
(nx 1) + (ny 1). The further t is from 0, the stronger the evidence is against H0 , and we can use
the t-distribution to calculate the P-value.

45
Example 5.5 continued The summary statistics are:

Male nx = 13 x̄ = 76.538 sx = 10.477


Female ny = 11 ȳ = 71.000 sy = 11.225

The pooled standard deviation is


!1
12 ⇥ 10.4772 + 10 ⇥ 11.2252 2
sp = = 10.838
12 + 10

and the estimated standard error of µ̂x µ̂y is


r
1 1
10.838 ⇥ + = 4.434 .
13 11
The t-statistic is thus
76.538 71.000 5.538
t= = = 1.25
4.434 4.434
and from Table 3 with ⌫ = 12 + 10 = 22 we see that P ⇡ 2 ⇥ 0.11 = 0.22. Thus the data
are consistent with H0 , and we conclude that there is no evidence that the average job
satisfaction score di↵ers between male and female university librarians.

5.7 Confidence intervals for the di↵erence between two population means

Using the same general method as before, a 95% confidence interval for µx µy is given by
s s
1 1 1 1
x̄ ȳ tnx +ny 2;0.025 sp + , x̄ ȳ + tnx +ny 2;0.025 sp + .
nx ny nx ny

where tnx +ny 2;0.025 is the upper 0.025 percentage point of the t-distribution with (nx 1) + (ny 1)
df.

Example 5.5 continued From Table 3, for ⌫ = 22 we find that tnx +ny 2;0.025 = 2.074.
So a 95% confidence interval for µx µy is

5.538 ± 2.074 ⇥ 4.434 = 5.538 ± 9.196 = ( 3.7, 14.7) .

Thus the di↵erence in population mean job satisfaction score is estimated to be between
3.7 and 14.7. Note that this interval includes 0, which agrees with our conclusion from
the hypothesis test that the data are consistent with µx µy = 0.

46
6 Two Sample Non Parametric Tests

Sometimes it may not be reasonable to regard the data as random samples from normal populations.
For example, the data may be too skew, discrete or ordinal. These non-parametric tests make rather
di↵erent types of assumptions to test hypotheses of interest, but they do not easily lend themselves
to estimating population parameters. They are best explained by considering examples.

The tests will use ranks of the observations, rather than the observations themselves. In a sample of
n distinct values, the smallest has rank 1, the next smallest has rank 2, and so on up to the largest,
which has rank n. Here is an example, for a sample of n = 5 numbers:

xi 4.2 0.6 2.1 6.3 3.4


rank(xi ) 4 2 1 5 3

Note that the ranks will always consist of the numbers 1, 2, . . . , n in some order. If in the above
example x4 was 106.3 rather than 6.3, the ranks would be unchanged. Only the ordering of the
numbers a↵ects their ranks.

6.1 Wilcoxon signed rank test

This is a non-parametric test for paired data.

Example 6.1 Eight volunteers are asked to transfer peas from one dish to another using
a straw before and after they have consumed two pints of beer. The number of peas
transferred in two minutes were as follows.
Volunteer 1 2 3 4 5 6 7 8
Before 32 54 22 43 40 37 16 48
After 26 38 29 15 42 29 17 34
Difference 6 16 -7 28 -2 8 -1 14

Do the results suggest that alcoholic consumption changes the level of performance?

As with the matched pairs t-test we calculate the di↵erences between the two test results for each
volunteer. This is essentially because we are not interested in how many peas each person can transfer
— some people will be quicker at doing this than others — but in how this number may di↵er between
the two conditions. Our null hypothesis is that the general level of performance is the same under
either condition (before or after consuming alcohol). We interpret this as saying that the two numbers
for each person are two random values from the same distribution (though the distribution may di↵er
between people).

6.1.1 Procedure for calculating the test statistic and P-value

1. If any of the di↵erences are 0 ignore them and reduce the sample size by the number of zero
di↵erences.

2. Ignore the signs of the di↵erences and replace each di↵erence by its rank (i.e., calculate the ranks
of the absolute values of the di↵erences). If two or more di↵erences have the same value give
each of them the average of the ranks for those di↵erences.

3. Give each rank the sign of the di↵erence corresponding to it.

47
4. Calculate the test statistic w, defined as follows. Let t+ and t be the sum of the positive and
negative ranks respectively. Then w is the smaller of t+ and t .

5. Find the approximate P-value, or a range in which it lies, from Table 4 (or otherwise).

For the data in Example 6.1 we have

Differences (B-A) 6 16 -7 28 -2 8 -1 14
Ranks of absolute differences 3 7 4 8 2 5 1 6
Sign of difference + + - + - + - +

Now t+ = 3 + 7 + 8 + 5 + 6 = 29 and t = 4 + 2 + 1 = 7, so w = t = 7. If H0 were true it would be


as if each of the ranks 1 to 8 was given a + or sign with equal probability. Note that the smallest
that t can possibly be is 0 (when all signs are +) and the largest that t can be is 12 n(n + 1) = 36
(when all signs are ). Similarly for t+ . So if H0 is true we would expect t+ and t both to be about
half way between these values, i.e., 18. The closer these are to 0 or 36 (i.e., the closer w is to 0) the
stronger the evidence against H0 .
From Table 4, the row of percentage points for n = 8 is

p 0.05 0.025 0.01 0.005


Wp 5 3 1 0

Thus if we observed w = 4 the tail probability would be between 0.05 and 0.025, so the P-value would
be between 0.10 and 0.05. (As usual, we double the tail probability to get the P-value.) In our case we
observed w = 7, which is greater than 5, so the tail probability is greater than 0.05 and so P > 0.10.

Thus the data are consistent with H0 and we conclude that there is no evidence that the performance
level changes after drinking the beer.

Here is an example with ties and a zero di↵erence:

Example 6.2 The cinema attendance (in millions) for the 13 regions of Great Britain in
1994 and 1995 were as follows. (Source: Regional Trends 1996)
Region 1994 1995 Difference Rank of abs Sign
difference
Yorkshire 10.4 9.7 0.7 6.5 +
North East 6.3 5.8 0.5 5 +
Midlands 18.4 16.8 1.6 11 +
Anglia 6.6 6.2 0.4 3.5 +
London 33.7 31.3 2.4 12 +
Southern 9.7 8.7 1.0 8.5 +
South West 2.0 1.7 0.3 1.5 +
Lancashire 16.6 15.1 1.5 10 +
HTV 5.8 6.8 -1.0 8.5 -
Border 0.6 0.6 0.0 *
Central Scotland 8.2 7.5 0.7 6.5 +
Northern Scotland 1.9 1.5 0.4 3.5 +
Northern Ireland 3.8 3.5 0.3 1.5 +

Test whether there is any evidence that cinema attendance in Great Britain di↵ered be-
tween the two years.

It is interesting to consider what the relevant “populations” are in this example. Nevertheless, let us
apply Wilcoxon’s signed ranks test.

48
There is one zero di↵erence, which we omit, so the sample size becomes n = 13 1 = 12. Also we
average the relevant ranks where there are ties: so, for example, the two smallest di↵erences both
equal 0.3, so these each get rank 1.5 (instead of 1 and 2). There is only one negative di↵erence (all
attendances went down except for HTV) so w = t = 8.5. the relevant row of Table 4 (n = 12) is

p 0.05 0.025 0.01 0.005


Wp 17 13 9 7

w is between 9 and 7, so the tail probability is between 0.01 and 0.005. Hence P is between 0.02 and
0.01, which represents fairly strong evidence against H0 . We conclude that cinema attendance went
down in 1995.

6.2 Mann-Whitney two sample test

This is a non parametric test to compare two independent samples.

Example 6.3 A group of 7 people all over 50 years of age and another group of 6 people
all under 30 years of age had their conduction velocity measured. This was done by
measuring both the time taken for the signal resulting from a standardized knock on the
Achilles tendon to travel up the relevant nerve to the spinal cord and then back down
again on another nerve to make the muscle twitch, in milliseconds, and the distance the
nerve impulse traveled. The conduction velocity is the distance traveled divided by the
time taken and was measured in meters per second. The results were as follows.
Older group 37.7 40.0 42.8 38.2 37.4 33.4 44.7
Younger group 45.9 53.9 40.0 43.7 41.3 44.6

Is there evidence that reactions (as measured by conduction velocity) tend to di↵er between
the age groups?
In this example we have unrelated samples from two di↵erent populations. We will test the null
hypothesis that the two populations are identical.

6.2.1 Procedure for calculating the test statistic and P-value

1. Pool all observations into one sample and arrange them in order of size.

2. Write down the ranks of the observations. If two or more observations have the same value give
each of them the average of the ranks for those observations.

3. Write down which of the two samples each observation comes from.

4. Calculate the test statistic, u as follows. Let nx and ny be the two sample sizes, let rx and ry
be the sum of the ranks for each of the samples, and let
1 1
u x = rx 2 nx (nx + 1) and uy = ry 2 ny (ny + 1) .

Then u is the smaller of ux and uy .

5. Percentage points of the distribution of u when H0 is true are given in Table 5, from which the
approximate P-value (or a range in which it lies) can be deduced.

49
For the data in Example 6.3, this gives:

ordered data: 33.4 37.4 37.7 38.2 40.0 40.0 41.3 42.8 43.7 44.6 44.7 45.9 53.9
rank: 1 2 3 4 5.5 5.5 7 8 9 10 11 12 13
sample: x x x x x y y x y y x y y

1
rx = 1 + 2 + 3 + 4 + 5.5 + 8 + 11 = 34.5 , ux = 34.5 2 ⇥ 7 ⇥ 8 = 6.5
1
ry = 5.5 + 7 + 9 + 10 + 12 + 13 = 56.5 , uy = 56.5 2 ⇥ 6 ⇥ 7 = 35.5 .
Hence u = 6.5, the smaller of 6.5 and 35.5.

Note that the smallest that ux could possibly be is 0 (when all x values are less than the smallest y
value) and the largest that ux could be is 42 (when all x values are greater than the largest y value).
So if H0 is true we would expect ux to be in the middle of this range, i.e., 21. (In general, the expected
value of ux is 12 nx ny .) Similarly for uy . Thus the closer u is to 0, the more evidence there is against
H0 .

We can find a range in which the P-value lies from Table 5, which gives percentage points of u under
H0 , for various n1 and n2 , where n1 is the smaller of nx and ny , and n2 is the larger of nx and ny . In
our case, n1 = 6 and n2 = 7 and the relevant row of Table 5 is

p 0.05 0.025 0.01 0.005


up 8 6 4 3

Our observed u, 6.5, is between 6 and 8 so the one sided tail probability is between 0.025 and 0.05.
Hence, for a two-sided test, 0.05 < P < 0.10, which is only weak evidence against H0 . We conclude
that there is not sufficient evidence that reactions di↵er for the two populations, though this is probably
because we only have very small samples. Note that here, as in general, it is fallacious to conclude
that H0 is true.

50
7 Probability and Binomial and Poisson Distributions

7.1 Idea of probability

In Chapter 3 we referred to probabilities of events such as a < X  b, where the random variable X
is continuous and has a normal distribution. Thus, if X represents height (in inches) in a population
of men, then P (66 < X  72) denotes the proportion of men in the population whose heights are
between 66 and 72 inches. If a man is to be chosen at random from the population, the probability
that his height will be between 66 and 72 inches is P (66 < X  72).

Now we are going to consider probabilities for general events and discrete variables (i.e., those that
can only take a discrete set of values). An event is a set of values, or outcomes, in which we are
interested. The probability of an event A is denoted P (A) and is a number on a scale from 0 to 1
where

P (A) = 0 means that A is impossible,

P (A) = 1 means that A is certain,


1
P (A) = 2 means that A is equally likley to happen as not to happen.

All probabilities are either 0, 1 or a number between 0 and 1.

Example 7.1 If we imagine rolling a fair die ‘at random’ we would expect each of the
six faces to have the same chance of falling uppermost. Indeed, if we rolled the die a
large number of times we would observe that the proportion of times a 6 falls uppermost
converges to 61 as we increase the number of tosses. Thus the probability of obtaining a 6
is 16 . Here the event of interest is “a 6 falls uppermost”.

We can imagine a “population” consisting of the possible outcomes 1, 2, 3, 4, 5, 6 and an experiment


of choosing one of these outcomes at random. If this experiment is repeated many times, then in the
long run the outcome 6 will occur in one sixth of these experiments. In this sense the probability
that a six falls uppermost corresponds to the long run proportion of times this would happen if the
experiment were repeated many times.

An example of an impossible event is “a 7 falls uppermost” as there is no 7 face. This event has
probability 0. An example of a certain event is “the score on the uppermost face is  10”, as every
possible score is less than or equal to 10 (in fact  6). This event has probability 1.

7.2 Rules of probability

Probabilities obey the rules of proportions. Imagine a population of individuals among whom a
proportion p have some attribute A, while the remaining proportion 1 p do not. Imagine choosing
an individual at random from this population. The individual chosen might or might not have the
attribute A: the probability that he or she does is

P (A) = p .

Thus there is a direct correspondence between the probability of the event A and the proportion in the
population who have the attribute A.

51
Example 7.2 Consider a population of 220 people classified by sex and height:

Tall Short Total


Male 50 50 100
Female 30 90 120
Total 80 140 220
Then
80 4
P (Tall) = 220 = 11
P (Short) = 140 7
220 = 11 = 1 P (Tall)
P (Male) = 100
220 = 11
5

P (Female) = 120 6
220 = 11 = 1 P (Male).

In general, for any event A,


P (not A) = 1 P (A) .

7.2.1 Conditional and joint probabilities: multiplication rules

The conditional probability of A given B is denoted P (A | B). It is the proportion of individuals


who have the attribute A among those who have B. Thus P (Tall | Male) is the proportion of men who
are tall, whereas P (Tall) is the proportion of people (men or women) who are tall. In Example 7.2:

50 1
P (Tall | Male) = 100 = 2
30 1
P (Tall | Female) = 120 = 4
P (Male | Tall) = 80 = 58
50

50 5
P (Male | Short) = 140 = 14 .

Note that P (Tall | Male) is di↵erent from P (Male | Tall). The former is the proportion of men who are
tall while the latter is the proportion of tall people who are men.

In general P (A | B) is di↵erent from P (B | A) though these are often confused in practice. In applica-
tions of probability in courts of law, the confusion is so common that it has a name: the prosecutor’s
fallacy.

We may also consider joint probabilities of two or more events:

50 5
P (Tall and Male) = 220 = 22
90 9
P (Short and Female) = 220 = 22

5 1 5
Note that P (Tall and Male) = P (Tall | Male) ⇥ P (Male), or 22 = 2 ⇥ 11 .

In general, for any events A and B,

P (A and B) = P (A | B)P (B) = P (B | A)P (A) .

This last equality gives us a means of calculating P (A | B) when we know P (B | A), P (A) and P (B).
In this form, it is known as Bayes Theorem: P (A | B) = P (B | A)P (A)/P (B).
It is sometimes useful to imagine the joint probabilities in a table, as if they were numbers out of a
total population of 1:

52
Tall Short Total
5 5 5
Male 22 22 11
3 9 6
Female 22 22 11
4 7
Total 11 11 1

7.2.2 Mutually exclusive events: addition rules

In Example 7.1 the experiment consisted of rolling a die. There were six possible outcomes: 1, 2, 3, 4,
5, or 6. These outcomes are mutually exclusive, meaning that if one occurs any other cannot. Two
events that are not mutually exclusive are “5 or 6” and “an even number”, because if the outcome of
the roll is a 6 then both of these events have happened.

Consider the event “5 or 6” which occurs if the experiment results in either a 5 or a 6. This has
probability
P (X = 5 or 6) = 26 = 16 + 16 = P (X = 5) + P (X = 6) .
In general, if A and B are two mutually exclusive events then

P (A or B) = P (A) + P (B) .

In Example 7.2 the events “Tall and Male” and “Tall and Female” are mutually exclusive, since an
individual cannot simultaneously be both of these. Now, by the above addition rule
80 50 30
P (Tall) = P (Tall and Male) + P (Tall and Female) or from the table 220 = 220 + 220 .

Furthermore, expressing the joint probabilities in terms of conditional probabilities:


4 1 5 1 6
P (Tall) = P (Tall | Male)P (Male) + P (Tall | Female)P (Female), or 11 = 2 ⇥ 11 + 4 ⇥ 11 .

In general for events A and B,

P (A) = P (A | B)P (B) + P (A | not B)P (not B) .

This is called the generalised addition law and is often used in conjunction with Bayes theorem.

7.2.3 Independence

Example 7.3 Consider a population of individuals classified by sex and by the ability
to curl one’s tongue. (This ability is determined by a single gene and is not sex-linked.)
Suppose the population proportions are:

Curl Straight Total


Male .10 .30 .40
Female .15 .45 .60
Total .25 .75 1

Then P (Curl and Male) = 0.10 = 0.40 ⇥ 0.25 = P (Curl) ⇥ P (Male). Similarly, each joint
probability in the table is the product of the relevant row and column totals. In this case,
sex and the ability to curl one’s tongue are independent.

53
In general two events A and B are independent if

P (A and B) = P (A) P (B) .

It follows from this that P (A | B) = P (A | not B) = P (A). Thus in Example 7.3

.10 1 .15 1 .25 1


P (Curl | Male) = .40 = 4 , P (Curl | Female) = .60 = 4 , P (Curl) = 1 = 4 .

Intuitively, if A and B are independent, then the probability that A happens does not depend on
whether B has happened. Note that independence is a property of the probabilities of events, not just
of the (logical) events themselves.

Example 7.4 Suppose that the probability that a new born baby is a boy is 12 (i.e., a
baby is equally likely to be a boy or girl) independently of all other births. Consider two
births in a maternity hospital on a particular day. Let

B1 be the event the first baby born is a boy


G1 be the event the first baby born is a girl
B2 be the event the second baby born is a boy
G2 be the event the second baby born is a girl
B1 B2 be the event both babies born are boys, etc.

For the first birth, P (B1 ) = 12 and P (G1 ) = 12 . Similarly for the second birth P (B2 ) = P (G2 ) = 12 .
Furthermore, since all births are independent, the two babies born are equally likely to be any one
of the four possibilities B1 B2 , B1 G2 , G1 B2 , or G1 G2 . So each of these events has probability 14 . Or,
using the independence rule:
1 1 1
P (B1 B2 ) = P (B1 ) ⇥ P (B2 ) = 2 ⇥ 2 = 4 .

7.3 Random variables

In Example 7.1, let X denote the (random) value of the uppermost face after rolling the die. X is
called a random variable. We write
P (X = 6) = 16
to denote the probability that the experiment results in the outcome 6; that is, the probability of the
event “X = 6”. Similarly
1 1 1 1 1
P (X = 1) = 6 , P (X = 2) = 6 , P (X = 3) = 6 , P (X = 4) = 6 , P (X = 5) = 6 .

In this example, X has six possible values, which are equally likely since their probabilities of
occurrence are the same.

Also, di↵erent values of X represent mutually exclusive events, so the probabilities of all possible
values must add to 1. For example, consider the event of not getting a 6. Then
5 1
P (X 6= 6) = P (X = 1 or 2 or 3 or 4 or 5) = 6 = 1 6 = 1 P (X = 6) .

That is, to find the probability that X 6= 6, we can either add up the probabilities of all values of X
that do not equal 6, or we can subtract the probability that X = 6 from 1.

54
7.3.1 Mean and Variance of a random variable

The mean (or expectation) of a random variable corresponds to the population mean µ for the
relevant ‘population’. In Example 7.1, our population consists of the numbers 1, 2, 3, 4, 5, and 6 in
equal proportions. The mean of the random variable X is therefore the mean of these six numbers,
or µ = 1 ⇥ 16 + 2 ⇥ 16 + 3 ⇥ 16 + 4 ⇥ 16 + 5 ⇥ 16 + 6 ⇥ 16 = 3.5.

In general the mean of a random variable may be found by multiplying each value by its probability
and adding all the products: X
µ = iP (X = i) .
i

The variance 2 of a random variable is defined similarly:


X
2
= (i µ)2 P (X = i) .
i

So in Example 7.1, X has variance

2 1 1 1 35
= (1 3.5)2 ⇥ 6 + (2 3.5)2 ⇥ 6 + · · · + (6 3.5)2 ⇥ 6 = 12 .

If a large number of independent realisations of the random variable are obtained (e.g., if the die is
rolled many times) the resulting values of X will have a mean close to µ and standard deviation .

7.4 Binomial Distribution

An experiment that consists of n independent “trials” such that each trial can only result in two
possible outcomes (usually called “success” and “failure”) and such that the probability of “success”
is the same for each trial is known as a binomial experiment.

Example 7.5 One hundred seeds are sown and it is observed whether or not each seed
germinates. The sowing of seed constitutes a trial and “success” on a particular trial
occurs if that seed germinates. Here there are n = 100 trials. If whether or not one
seed germinates does not a↵ect whether or not other seeds germinate (i.e., if the trials are
independent), and if each seed has the same chance of germinating, then this is a binomial
experiment. But if some of the seeds were watered and others were not then the probability
of “success” would not be the same for each trial and so the binomial situation would not
hold.

We are often interested in how many of the n trials result in a “success”. We define the random
variable X as the number of “successes” in n trials. If we have a binomial experiment, then X has a
binomial distribution with index n and success probability p, where p is the probability that
any one trial results in a “success”.

Let Si be the event “success on the ith trial” and Fi be the event “failure on the ith trial”, for
i = 1, 2, . . . , n. Let P (Si ) = p, so P (Fi ) = 1 p.

For n = 1 (i.e., there is just one trial), X is either 0 or 1 and

P (X = 0) = 1 p, P (X = 1) = p .

55
For n = 2, X may be 0, 1, or 2. Now

P (F1 F2 ) = P (F1 ) ⇥ P (F2 ) = (1 p) ⇥ (1 p)


P (F1 S2 ) = P (F1 ) ⇥ P (S2 ) = (1 p) ⇥ p
P (S1 F2 ) = P (S1 ) ⇥ P (F2 ) = p ⇥ (1 p)
P (S1 S2 ) = P (S1 ) ⇥ P (S2 ) = p ⇥ p

In the first line above, X = 0, in the second and third lines X = 1 and in the fourth line, X = 2.
Hence
P (X = 0) = (1 p)2 , P (X = 1) = 2p(1 p) , P (X = 2) = p2 .

For n = 3, X may be 0, 1, 2 or 3, and

P (X = 0) = P (F1 F2 F3 ) = (1 p)3
P (X = 1) = P (S1 F2 F3 ) + P (F1 S2 F3 ) + P (F1 F2 S3 ) = 3p(1 p)2
P (X = 2) = P (S1 S2 F3 ) + P (S1 F2 S3 ) + P (F1 S2 S3 ) = 3p2 (1 p)
3
P (X = 3) = P (S1 S2 S3 ) = p

For n = 4, X may be 0, 1, 2, 3 or 4 and a similar argument gives probabilities:

(1 p)4 , 4p(1 p)3 , 6p2 (1 p)2 , 4p3 (1 p), p4 .

The coefficients in these binomial probabilities can be obtained from Pascal’s triangle. This is a tri-
angle of numbers in which each number is calculated by adding together the two numbers immediately
above it. Values up to n = 8 are:

n=0 1
n=1 1 1
n=2 1 2 1
n=3 1 3 3 1
n=4 1 4 6 4 1
n=5 1 5 10 10 5 1
n=6 1 6 15 20 15 6 1
n=7 1 7 21 35 35 21 7 1
n=8 1 8 28 56 70 56 28 8 1

Thus if n = 7, P (X = 3) = 35p3 (1 p)4 . The number 35 is often written as


⇣ 7 ⌘
3

where it denotes the number of subsets of 3 objects that could be drawn from a set of 7 objects.
Because any choice of 3 objects from 7 must leave behind 4 objects, and vice versa, it must be true
that
⇣ 7 ⌘ ⇣ 7 ⌘
=
4 3
as can be seen from Pascal’s triangle above. A formula for calculating this coefficient is:
⇣ 7 ⌘ 7⇥6⇥5 7!
= = = 35 ,
3 1⇥2⇥3 3! ⇥ 4!

where, for example, 3 ! = 1 ⇥ 2 ⇥ 3 = 6.

56
In general, the number n !, called factorial n, is 1 ⇥ 2 ⇥ · · · ⇥ n, with the convention that 0 ! = 1. And
the number of choices of r objects from n is
⇣ n ⌘ n ⇥ (n 1) ⇥ · · · ⇥ (n r + 1) n! ⇣ n ⌘
= = = .
r 1 ⇥ 2 ⇥ ··· ⇥ r r ! (n r) ! n r

A general formula for binomial probabilities is therefore


⇣ n ⌘
P (X = r) = pr (1 p)n r
for r = 0, 1, 2, . . . , n.
r

Formulae for the mean and variance of the binomial distribution are:
2
µ = np and = np(1 p) .

For example, for n = 1, µ = 0 ⇥ (1 p) + 1 ⇥ p = p and 2 = (0 p)2 (1 p) + (1 p)2 p = p(1 p).

We could calculate binomial probabilities using a calculator or by using Table 6, which gives values of
P (X  r) for values of n from 3 to 19, r from 0 to n and p from 0.01 to 0.50. For values of p greater
than 0.5, note that the number of “failures”, n X, also has a binomial distribution with parameter
1 p instead of p. Here are some examples:

Example 7.6 Suppose n = 7 and p = 0.30 then

1. P (X  2) = 0.6471
2. P (X < 4) = P (X  3) = 0.8740
3. P (X = 2) = P (X  2) P (X  1) = 0.6471 0.3294 = 0.3177
4. P (X > 4) = 1 P (X  4) = 1 0.9712 = 0.0288
5. P (X 3) = 1 P (X  2) = 1 0.6471 = 0.3529
6. Using a calculator P (X = 2) = 21 ⇥ (0.3)2 ⇥ (0.7)5 = 0.3176523 ⇡ 0.3177

Example 7.7 The probability that a person will respond to a mailed advertisement is 0.1.
What is the probability that at most two people out of a group of ten will respond?

Let X denote the number of people who respond. Assuming that we have a binomial experiment, we
have n = 10 and p = 0.1. We require P (X  2) = 0.9298.

Example 7.8 In the game of ‘chuck-a-luck’ you pay 1p to play and then you throw 3 dice.
If you throw one six you get 1p back; if you throw two sixes you get 2p back and if you
throw 3 sixes you get 3p back. What is the probability you at least get your money back?

Let X be the number of sixes thrown. Here we have a binomial experiment with n = 3 and p = 16 .
We require P (X 1) = 1 P (X = 0) = 1 ( 56 )3 ⇡ 1 0.5787 = 0.4213.

Alternatively, we may interpolate from Table 6: when p = 0.15, P (X = 0) = 0.6141 and when
p = 0.20, P (X = 0) = 0.5120, so when p = 0.167, P (X = 0) ⇡ 0.58. Thus P (X 1) ⇡ 0.42.

57
7.5 Poisson distribution

Suppose “events” occur at a rate of µ per unit (for example the unit may be time or length). Let
X denote the (random) number of events that occur in a particular unit. Then X has a Poisson
distribution with mean µ if

µr e µ
P (X = r) = for r = 0, 1, 2, . . . .
r!

Poison probabilities may be obtained using a calculator or from Table 7, which gives values of P (X  r)
for values of µ from 0 to 20. This table is used in the same way as Table 6 is used for binomial
probabilities.

Example 7.9 During a certain period of the day the average number of telephone calls
per minute coming into a switchboard is 4. What is the probability that

(a) in one minute during this period the switchboard receives at most 3 calls;
(b) in two minutes during this period the switchboard receives more than 8 calls?

For (a), let X be the number of calls in the relevant one minute period. Suppose that X has a Poisson
distribution with mean µ = 4. We require P (X  3) = 0.4335 from Table 7.

For (b) let X be the number of calls in the relevant two minute period. Suppose that X has a Poisson
distribution with mean µ = 8. We require P (X > 8) = 1 P (X  8) = 1 0.5925 = 0.4075 from
Table 7.

The variance of the Poisson distribution is equal to the mean µ. Hence the standard deviation is
p
= µ. For example, in Example 7.9 if we counted the number of calls arriving in a minute, for each
of a large number of di↵erent minutes, these numbers would vary about a mean of 4 with a standard
deviation of 2.

The Poisson distribution is a limiting case of the binomial distribution in the following sense: suppose
X has a binomial distribution with index n and success probability p = µ/n. Then as n becomes large
(and therefore p becomes small), the distribution of X tends to Poisson with mean µ. In particular,
the binomial mean is np = µ for all n and the binomial variance is 2 = np(1 p) = µ(1 µ/n), which
approaches µ as n increases.

7.6 Approximations to the binomial distribution for large n

Let X have a binomial distribution with index n and success probability p.

1. If n is large and p is not too close to 0 or 1, then

P (X  r) ⇡ P (Y  r + 0.5)
p
where Y has a normal distribution with mean µ = np and standard deviation = np(1 p).
When p is near 0.5, this approximation works well even for quite small n, e.g., n = 20.

2. If n is large and p is close to 0, then the binomial distribution can be approximated by the
Poisson distribution with µ = np.

58
Example 7.10 The proportion of bull calves born to domestic cattle is 0.512. What is
the probability that out of 100 calves born less than 50 are bulls?

Let X denote the number of bull calves born out of the 100 calves. Then x has a binomial distribution
with n = 100 and p = 0.512. We require P (X < 50). Using the normal approximation, we have

P (X < 50) = P (X  49) ⇡ P (Y  49.5)

where
p Y has a normal distribution with mean µ = 100 ⇥ 0.512 = 51.2 and standard deviation =
100 ⇥ 0.512 ⇥ (1 0.512) = 4.999. So
✓ ◆
49.5 51.2
P (X < 50) ⇡ P Z  = P (Z  0.3401) = 1 0.6331 = 0.3669 .
4.999

Example 7.11 It is suggested that 0.006% of insured males die in road accidents each
year. What is the probability that in a given year, an insurance company must pay o↵ 3
out of the 10000 policies against such accidents that they have?

Let X denote the number of claims for death from road accidents in a given year. Then X has
a binomial distribution with n = 10000 and p = 0.00006. Hence X has approximately a Poisson
distribution with µ = 10000 ⇥ 0.00006 = 0.6. Thus

P (X = 3) = P (X  3) P (X  2) = 0.9966 0.9769 = 0.0197 .

7.7 Normal approximation to the Poisson distribution for large µ

Let X have a Poisson distribution with mean µ. Then if µ is reasonably large,

P (X  r) ⇡ P (Y  r + 0.5)
p
where Y has a normal distribution with mean µ and standard deviation = µ. This approximation
can be used to calculate Poisson probabilities for values of µ outside the range of Table 7.

For example, suppose X has a Poisson distribution with mean µ = 20 and we want to find P (X  24).
From Table 7 we get P (X  24) = 0.8432, while the above approximation gives
✓ ◆
24.5 20
P (X  24) ⇡ P (Y  24.5) = P Z  p = P (Z  1.0062) = 0.8428
20
which is not very di↵erent. The normal approximation is of course better for larger µ.

59
8 Inference for Binomial and Poisson Parameters

Example 8.1 A lady says she can tell by taste whether tea has been made with tea bags
or bulk tea. She sips from 15 pairs of cups, one with each kind of tea, and makes the
correct identification 9 times. Is there reason to think that the lady really can tell the
di↵erence?

Let X denote the number of correct identifications out of 15 and let p be the probability the lady
correctly identifies the two teas in a pair. We test the null hypothesis

H0 : p = 0.5

which is what p would be if she made a random choice. If the lady has some ability to choose correctly
we would expect p to be greater than 0.5. However it it also possible that p might be less than 0.5
— she might make the wrong choice more often than by chance! We therefore use a two-sided test as
usual.

The P-value is the probability of getting a result as extreme as (ore more extreme than) that observed,
if H0 is true. The one-sided P-value is P (X 9), where X has a binomial distribution with n = 15
and p = 0.5. Thus the two sided P-value is

P = 2 P (X 9) = 2(1 P (X  8)) = 2(1 0.6964) = 0.61 .

So the data are consistent with H0 . There is no evidence that the lady can distinguish the two types
of tea by tasting.

Example 8.2 A group of prospectors for a certain mineral has been operating for some
time in an extensive area covering several hundred square kilometers, where deposits of the
mineral are found randomly located over the area at an average density of 1000 deposits
per square kilometer. The group considers moving to another less accessible area but this
will only be worthwhile if the density of deposits is higher. They decide to carry out a pilot
survey of an area of 10,000 square metres in the new location. If they find 17 deposits in
their pilot area, is it likely to be worthwhile to develop the new area?

Let X denote the number of deposits found in the pilot area. Let µ be the average number of deposits
per 10,000 square metres in the new area. In the old area the density of deposits is 1000 per square
km, which is 10 3 per square metre, and therefore 104 ⇥ 10 3 = 10 per 10,000 square meters. So we
test
H0 : µ = 10 .

If H0 is true , X will have a Poisson distribution with mean µ = 10, because deposits are randomly
located. So the one-sided P-value is

P (X 17) = 1 P (X  16) = 1 0.9730 = 0.0270 .

Hence the two sided P-value is P = 2 ⇥ 0.0270 = 0.054. (We use the rule of doubling the one-sided
P-value.) This is greater than, but very close to the conventional 0.05, and suggests that there is some
(but not strong) evidence that the density of deposits is higher in the new location. The evidence is
not clear cut, but it might be useful to calculate a confidence interval for µ.

60
Example 8.3 From Mendelian inheritance theory it is expected that certain crosses of
pea will give yellow and green peas in the ratio 3:1. In a particular experiment 180 yellow
and 48 green peas were obtained. Does this experiment support the theory?

Let X denote the number of green peas obtained out of 228. Let p be the probability of a pea being
1
green. According to the theory, p = 3+1 = 0.25, so we test

H0 : p = 0.25 .

If H0 is true, X has a binomial distribution with n = 228 and p = 0.25. The expected
p value of X is
np = 228 ⇥ 0.25 = 57 and the standard deviation of the distribution of X is = 228 ⇥ 0.25 ⇥ 0.75 =
6.538. We find the one sided P-value using the normal approximation:
✓ ◆
48.5 57
P (X  48) ⇡ P (Y  48.5) = P Z  = P (Z  1.30) = 1 0.9032 = 0.0968 .
6.538
Hence the P-value is P = 2 ⇥ 0.0968 = 0.1936, so the data are consistent with H0 . We conclude that
there is no evidence to suggest that the theory is wrong.

Example 8.4 Medical researchers studied the e↵ect of tight neckties on the flow of blood
to the head and the possible decrease in the brain’s ability to respond to visual information.
Results of a random sample of 250 businessmen found that 167 were wearing their tie too
tight. Find a 95% confidence interval for the proportion of the population of businessmen
who wear their tie too tight.

Let p be the population proportion of businessmen who wear their tie too tight. Let X denote the
number of businessmen out of 250 who wear their tie too tight. Then X has a binomial distribution
with n = 250 and success probability p. The sample proportion 167/250 is an estimate of p:

p̂ = 167/250 = 0.668 .

The standard error of this estimate is


s
p(1 p)
se(p̂) = .
n

We do not know p but we can obtain an approximate standard error by using p̂ in this formula:
s r
p̂(1 p̂) 167 83 1
se(p̂) ⇡ = ⇥ ⇥ = 0.0298 .
n 250 250 250

Since n is large we can use the normal approximation to the sampling distribution of p̂. Thus an
approximate 95% confidence interval for p is

p̂ ± 1.96se(p̂) = 0.668 ± 1.96 ⇥ 0.0298 = (0.610, 0.726) .

Similarly an approximate 99% confidence interval for p is

p̂ ± 2.5758 ⇥ se(p̂) = 0.668 ± 2.5758 ⇥ 0.0298 = (0.591, 0.745) .

Note: 1.96 is the upper 2.5 percentage point of the standard normal distribution, and 2.5758 is the
upper 0.5 percentage point.

61
9 Frequency Data and Chi-Square Tests

9.1 Fitting a Probability Model

Example 9.1 The data below give the observed frequency of di↵erent kinds of pea seeds
in crosses from plants with round yellow seeds and plants with wrinkled green seeds. The
Mendelian theory of inheritance suggests that 1/16 of the seeds will be wrinkled and green
(WG), 3/16 will be round and green (RG), 3/16 will be wrinkled and yellow (WY) and
9/16 will be round and yellow (RY).

Type of seed: RY WY RG WG Total


Observed number: 93 27 32 8 160

Do the data agree with the theory?

Let p1 , p2 , p3 , p4 be the proportions of RY, WY, RG, WG seeds in the population. We test the null
hypothesis
H0 : p1 = 9/16, p2 = 3/16, p3 = 3/16, p4 = 1/16 .
If H0 is true then in a sample of 160 observations we would expect 90, 30, 30, 10 of RY, WY, RG,
WG seeds respectively.

We can try to assess how well the data agree with the theory by comparing the observed and expected
frequencies in a table. How can we judge their agreement? To judge how well an expected frequency
E agrees with an observed frequency O we need a measure that depends on the di↵erence O E, and
also on the size of E. For example, we would consider O = 20 and E = 10 to be in poor agreement,
but O = 210 and E = 200 to be in rather good agreement, even though O E = 10 in both cases.
According to statistical theory, a good measure is the standardised residual defined by
O E
p .
E
p
A useful informal rule of thumb is that O and E are in reasonably good agreement p if (O E)/ E
is between 2 and +2. For example, when O = 20 and E = 10 we get (20 p 10)/ 10 = 3.16 which
is poor agreement. But when O = 210 and E = 200 we get (210 200)/ 200 = 0.71 which is good
agreement (between 2 and +2). Note that this rule applies to frequencies or counts (that is,
numbers on the counting scale 0, 1, 2, . . .) but not to measurements that have units, such as lengths,
times, etc.

We also need a measure of how well the set of observed frequencies O1 , O2 , . . . , Ok agree with the
expected frequencies E1 , E2 , . . . , Ek overall. The conventional measure, is the chi-square statistic
defined by
Xk
2 (Oi Ei )2
stat = .
i=1
Ei
p
In words, the chi-square statistic is the sum of the squares of the standardised residuals (O E)/ E.
If 2stat = 0 the observed and expected frequencies agree exactly. The larger 2stat is, the worse is the
agreement. Furthermore, according to statistical theory, if the null hypothesis (H0 ) is true, then 2stat
should be a random value from a 2 -distribution with degrees of freedom ⌫ equal to the number of
categories minus one. We can then find the P -value as the upper tail probability P (Y > 2stat ) where Y
has a 2 -distribution with ⌫ degrees of freedom. Table 8 give percentage points for 2 -distributions. As
usual, a small P -value (corresponding to a large 2stat ) gives evidence against H0 , and would therefore
suggest that the expected frequencies do not agree sufficiently well with the observed frequencies.

62
Returning to Example 9.1, here are the observed and expected frequencies and standardised residuals:

Type of seed RY WY RG WG Total

Observed number Oi 93 27 32 8 160


Expected number
p Ei 90 30 30 10 160
(Oi Ei )/ Ei 0.32 0.55 0.37 0.63

The standardised residuals are all comfortably between 2 and +2 so it looks as if the observed and
expected numbers are in good agreement. To do a formal test, we calculate the chi-square statistic:
2 2 2 2 2
stat = 0.32 + ( 0.55) + (0.37) + ( 0.63) = 0.933. The degrees of freedom are ⌫ = 4 1 = 3. The
relevant extract from Table 8 is
tail probability p 0.90 0.75
2
p for ⌫ = 3 0.58 1.21

We have 0.58 < 0.933 < 1.21 so our P -value is between 0.75 and 0.90. The P -value is not small (much
greater than 0.05) so the data are consistent with H0 . We conclude that the data do support the
theory.

9.2 Fitting a binomial distribution


Example 9.2 A factory has 4 machines and the number of machines breaking down each
week is observed for 100 weeks with the following results.

Number of breakdowns 0 1 2 3 4 Total


Observed number of weeks 63 28 6 2 1 100

We are interested in whether breakdowns are ‘random’ in the sense that each machine has
the same chance of breaking down in any week, independently of what may happen to
other machines and in other weeks.

Let X denote the number of machines breaking down in a week. Then if breakdowns are random in the
above sense, X will have a binomial distribution with index n = 4 and unknown success probability p,
where p is the probability that any given machine breaks down in any week. (Observing a machine for
a week is a ‘trial’ and a machine breaking down is a ‘success’. The 100 weeks represent 100 realisations
of the random variable X.) We will therefore fit a binomial distribution to these data and test the
goodness of fit. The probabilities of 0, 1, 2, 3 or 4 breakdowns in any week are
⇣ 4 ⌘
P (X = r) = pr (1 p)4 r
for r = 0, 1, 2, 3, 4.
r

We want to calculate expected frequencies given by 100 ⇥ P (X = r). We do not know the value of p,
so we estimate it from the data as:
total number of successes 0 ⇥ 63 + 1 ⇥ 28 + 2 ⇥ 6 + 3 ⇥ 2 + 4 ⇥ 1 50
p̂ = = = = 0.125 .
total number of trials 100 ⇥ 4 400
Note that this is equivalent to calculating the sample mean number of breakdowns per week (x̄ =
50/100 = 0.5) and ‘equating’ this to the theoretical mean on the binomial distribution 4p, so that
p̂ = 0.5/4 = 0.125.
Here are the observed and expected frequencies for Example 9.2:

63
Number of breakdowns 0 1 2 3 4 Total
Observed number of weeks 63 28 6 2 1 100
Probability P (X = r) 0.5862 0.3350 0.0718 0.0068 0.0002 1
Expected number of weeks 58.62 33.50 7.18 0.68 0.02 100

In Example 9.1 all of the expected values were reasonably large. But if one or more of the expected
values is too small (less than about 5) then the above rules for interpreting the standardised residuals
and 2stat can fail. It is standard practice then to pool categories so that all expected frequencies are
greater or equal to 5. This gives the table:

Number of breakdowns 0 1 2
Observed frequency Oi 63 28 9
Expected frequency
p Ei 58.62 33.50 7.88
(Oi Ei )/ Ei 0.57 0.95 0.40

Again, the standardised residuals are comfortably between 2 and +2, suggesting that the Oi and Ei
agree well.

The chi-square statistic is 2stat = (0.57)2 + ( 0.95)2 + (0.40)2 = 1.39. Also, because p was estimated
from the data, we must reduce the degrees of freedom by one, so ⌫ = 3 1 1 = 1. From Table 8
with ⌫ = 1, you can check that the P -value is between 0.20 and 0.24. Again, this is not small and the
binomial distribution fits the data well. We conclude that the data are consistent with the hypothesis
that each machine has, independently, the same chance of breaking down in any week.

9.3 Fitting a Poisson distribution


Example 9.3 Twenty five leaves were selected at random from each of six McIntosh apple
trees in a single orchard. The following table shows the distribution of European red mites
on these 150 apple leaves.

Number of mites on a leaf 0 1 2 3 4 5 6 7 Total


Observed number of leaves 70 38 17 10 9 3 2 1 150

We are interested in whether mites are distributed ‘randomly’ on leaves, in the sense that
any leaf has the same chance of receiving a mite, independently of whether it has other
mites.

If mites are distributed randomly in the above sense, then the numbers of mites per leaf would follow
a Poisson distribution. So we will fit a Poisson distribution and test the goodness of fit.

The mean rate, µ, of mites per leaf is not known. We may use the sample mean as an estimate of µ:
0 ⇥ 70 + 1 ⇥ 38 + 2 ⇥ 17 + 3 ⇥ 10 + 4 ⇥ 9 + 5 ⇥ 3 + 6 ⇥ 2 + 7 ⇥ 1
µ̂ = x̄ = = 1.1467 .
150
Using this estimate, the expected number of leaves with r mites is therefore

(1.1467)r e 1.1467
150 ⇥ r = 0, 1, 2, . . . .
r!
Here are the results of the calculations, where the last three columns have pooled results from leaves
with three or more mites:

64
Number Observed Probability Expected Obs Exp Std
of mites number number O E res

0 70 0.3177 47.65 70 47.65 3.24


1 38 0.3643 54.64 38 54.64 -2.25
2 17 0.2089 31.33 17 31.33 -2.56
3 10 0.0798 11.97 25 16.36 2.14
4 9 0.0229 3.43
5 3 0.0052 0.79
6 2 0.0010 0.15
7 1 0.0002 0.02
>7 0 0.0000
Total 150 1.0000 149.98

Note that although no leaves have been observed with 8 or more mites on them it is possible that such
leaves exist and this must be allowed for in the calculations. Thus the last group is > 7 and the last
probability is 1 (the sum of all the previous probabilities) In this case it is zero to 4 decimal places,
but sometimes it may be greater.

All four standardised residuals are large in absolute value (outside the range 2 to +2) so the fit does
not look good. The chi-square statistic is 2stat = (3.24)2 + ( 2.25)2 + ( 2.56)2 + (2.14)2 = 26.67
with degrees of freedom ⌫ = 4 1 1 = 2. Again we subtract an extra degree of freedom because µ
was estimated from the data. From Table 8 the P -value is less than 0.001, which provides significant
evidence against H0 . (The null hypothesis here is that mites are distributed randomly on the leaves.)

The data therefore suggest that the mites are not distributed at random on the leaves. This could be
due to the fact that the mites exist in colonies so that a leaf is more likely to be attacked by several
mites than a single mite, or that the mites have only attacked a few trees and the rest are free from
infestation. More information is needed to discover the reason.

9.4 Contingency tables

Example 9.4 A researcher, investigating public attitudes to the level of welfare benefits
in Britain, carried out a pilot survey in the town where she lived. A simple random sample
of 200 individuals was drawn from the electoral register, and each member of the sample
was asked to complete a questionnaire. One question asked what the respondent felt about
the current level of child benefit. Respondents were also assigned to an occupational group
according to the occupation of the principal provider of financial support within their
household. The following table gives the numbers of people giving each response, for each
occupation group:

Response about level of benefit


Too high About right Too low Don’t know Total
Occupation
Non-manual 18 29 10 15 72
Manual 13 40 26 14 93
None currently 3 13 11 8 35
Total 34 82 47 37 200

We are interested in whether a person’s occupation a↵ects their feeling about child benefit;
i.e., is the pattern of responses the same for each occupation group, and if not, how does
it di↵er?

We will formally test the null hypothesis that there is no association between occupation and re-
sponse. But before doing so it is helpful to look at summary statistics in the form of row proportions.

65
Sometimes it may be the column proportions that are of interest, but in this example we want to look
at the row proportions — the proportion of each response with in each occupation group. We divide
each entry in the contingency table by the relevant row total; two decimal places is good enough to
see any pattern:
Table of row proportions
Too high About right Too low Don’t know Total
Occupation
Non-manual .25 .40 .14 .21 1.00
Manual .14 .43 .28 .15 1.00
None currently .09 .37 .31 .23 1.00
Total .17 .41 .24 .19 1.00

Some of the numbers of responses are quite small so we should not read too much into these proportions.
But the general pattern seems to be that a higher proportion of the non-manual group think that child
benefit is too high, compared with the other two groups, about the same proportions (about 40%)
of the three groups think the benefit is about right, and a lower proportion of the non-manual group
thinks it is too low.

We can formally test whether there is an association between response and occupation group using
a 2 -test as follows. We will calculate a table of expected frequencies and a table of standardised
residuals and hence a 2stat and a P -value.

We can express the null hypothesis H0 in di↵erent ways:

H0 : There is no association between response and occupation group,

H0 : Response and occupation group are independent, or

H0 : The population proportions of responses are the same for each occupation group.

It is often the third of these that is most easily interpreted. If the null hypothesis were true then we
would expect the same row proportions for each occupation group. These would be approximately
34/200 too high, 82/200 about right, 47/200 too low and 37/200 don’t know (i.e., estimated from the
column totals).

There are 72 people with non-manual occupations so, if H0 is true, the expected numbers of these
people in the four response groups are

34 82 47 37
200 ⇥ 72 = 12.24, 200 ⇥ 72 = 29.52, 200 ⇥ 72 = 16.92, 200 ⇥ 72 = 13.32 .

Similarly if H0 is true, the expected numbers of the di↵erent responses in the manual occupation
group are

34 82 47 37
200 ⇥ 93 = 15.81, 200 ⇥ 93 = 38.13, 200 ⇥ 93 = 21.85, 200 ⇥ 93 = 17.21

And for the no current occupation group, the expected frequencies are

34 82 47 37
200 ⇥ 35 = 5.95, 200 ⇥ 35 = 14.35, 200 ⇥ 35 = 8.22, 200 ⇥ 35 = 6.48 .

Here is the table of expected frequencies:

66
Table of expected frequencies
Too high About right Too low Don’t know Total
Occupation
Non-manual 12.24 29.52 16.92 13.32 72
Manual 15.81 38.13 21.85 17.21 93
None currently 5.95 14.35 8.22 6.48 35
Total 34 82 47 37 200

Note that it has the same row and column totals as the table of observed frequencies. The general
formula is
row total ⇥ column total
expected frequency = .
grand total
To compare the observed
p and expected frequencies, it is useful to calculate a table of standardised
residuals (O E)/ E:

Table of standardised residuals


Too high About right Too low Don’t know
Occupation
Non-manual 1.65 0.10 1.68 0.46
Manual 0.71 0.30 0.89 0.77
None currently 1.21 0.36 0.97 0.60

In spite of the pattern we saw above, these standardised residuals are all between 2 and +2, so the
observed and expected frequencies seem to agree reasonably well. The chi-square statistic is
2
stat = (1.65)2 + ( 0.10)2 + · · · + (0.60)2 = 10.62 .

What are the degrees of freedom? Statistical theory tells us that for a contingency table with r rows
and c columns the degrees of freedom is ⌫ = (r 1) ⇥ (c 1). (This is because the table of expected
frequencies is forced to have the same row and column totals as the table of observed frequencies.)
So in the present example, with 3 rows and 4 columns, ⌫ = 2 ⇥ 3 = 6. From Table 8 the P -value is
therefore approximately 0.10. (P (Y > 10.62) ⇡ 0.10, where Y has a 2 -distribution with 6 degrees of
freedom.)

On the basis of these data there is therefore insufficient evidence to conclude that there is any rela-
tionship between a person’s occupational group and his or her attitude towards the current level of
child benefit.

Example 9.5 The following data give the sex (S) and the usual means of travel to work
(T) for a random sample of 140 employees of a large company (E). The data is coded for
sex as 1 = male and 2 = female and for usual means of travel to work as 1 = car or motor
cycle driver, 2 = car or motor cycle passenger, 3 = public transport and 4 = walk or pedal
cycle.
E S T E S T E S T E S T E S T E S T E S T
1 1 1 21 2 4 41 1 4 61 1 4 81 1 1 101 1 1 121 1 1
2 1 1 22 1 1 42 2 1 62 1 1 82 1 1 102 2 3 122 1 1
3 1 1 23 2 1 43 2 2 63 1 4 83 2 2 103 1 1 123 1 3
4 1 3 24 1 1 44 1 1 64 1 1 84 1 1 104 1 1 124 1 3
5 1 1 25 2 2 45 2 3 65 2 4 85 1 1 105 2 4 125 1 4
6 1 1 26 2 2 46 1 1 66 1 1 86 2 4 106 1 2 126 2 1
7 1 1 27 1 4 47 1 1 67 1 4 87 2 3 107 1 1 127 2 3
8 2 4 28 1 1 48 2 1 68 1 4 88 1 1 108 1 3 128 1 1
9 2 2 29 1 1 49 1 4 69 1 3 89 1 4 109 2 1 129 1 1
10 1 1 30 2 4 50 2 1 70 1 1 90 1 4 110 1 1 130 2 4

67
11 2 4 31 2 1 51 2 4 71 2 3 91 1 1 111 2 1 131 1 4
12 1 1 32 2 1 52 1 4 72 1 1 92 1 4 112 1 2 132 1 1
13 1 1 33 1 1 53 1 4 73 1 1 93 2 2 113 1 2 133 1 1
14 2 4 34 1 1 54 2 1 74 1 3 94 1 4 114 1 1 134 2 1
15 1 1 35 2 4 55 1 1 75 1 3 95 2 4 115 2 1 135 1 3
16 1 3 36 2 2 56 2 3 76 1 2 96 2 3 116 1 1 136 1 1
17 2 4 37 2 4 57 1 1 77 1 1 97 2 4 117 1 4 137 1 2
18 1 4 38 1 1 58 2 1 78 1 1 98 2 1 118 1 1 138 1 4
19 1 1 39 2 2 59 2 1 79 1 1 99 1 1 119 1 1 139 2 4
20 2 4 40 2 1 60 2 2 80 2 2 100 1 1 120 1 4 140 1 4

Is there any association between a person’s gender and his or her mode of transport?

Here is a contingency table compiled from the raw data:

Male (1) Female (2) Total


Car or motor cycle driver (1) 56 16 72
Car or motor cycle passenger (2) 5 10 15
Public transport (3) 9 7 16
Walk or pedal cycle (4) 20 17 37
Total 90 50 140

Let us first carry out the 2 -test. The null hypothesis is that there is no association between sex and
mode of transport. You can check that the expected frequencies and standardised residuals are:

Expected frequencies Standardised residuals


Male Female Male Female
Transport 1 46.29 25.71 1.43 -1.92
2 9.64 5.36 -1.50 2.01
3 10.29 5.71 -0.40 0.54
4 23.79 13.21 -0.78 1.04

So 2stat = (1.43)2 + ( 1.92)2 + ( 1.50)2 + (2.01)2 + ( 0.40)2 + (0.54)2 + ( 0.78)2 + (1.04)2 = 14.10,
with degrees of freedom ⌫ = 3 ⇥ 1 = 3. From Table 8, the P -value is between 0.001 and 0.005,
which represents quite strong evidence against H0 . Note also that there are two fairly substantial
standardised residuals of 1.92 and 2.01. We may conclude that there is evidence of a relationship
between a person’s sex and his or her means of transport to work.

Now let us look at some proportions to see what the relationship is. The row proportions are

Male Female Total


Car or motor cycle driver (1) .78 .22 1.00
Car or motor cycle passenger (2) .33 .67 1.00
Public transport (3) .56 .44 1.00
Walk or pedal cycle (4) .54 .46 1.00
Total .64 .36 1.00

These tell us the relative numbers of males and females for each mode of transport. There is a majority
of males in the sample (64% male and 36% female). The main pattern is that, among those who drive a
car of motorcycle to work, a higher proportion are male (0.78), while among those who are passengers
a lower proportion are male.

68
Here are the column proportions:
Male Female Total
Car or motor cycle driver (1) .62 .32 .51
Car or motor cycle passenger (2) .06 .20 .11
Public transport (3) .10 .14 .11
Walk or pedal cycle (4) .22 .34 .26
Total 1.00 1.00 1.00

These tell us, for each sex, the relative frequencies of each mode of transport. From the marginal
proportions, half of the sample work force drive to work, while a quarter walk or cycle; 11% take public
transport and another 11% are passengers. Looking at males and females separately, a rather higher
proportion of males drive (.62 compares with .32) and more females are passengers (.20 compared with
.06). These proportions seem more interesting than the row proportions in this case. The chi-square
test has confirmed that there is good evidence that the patterns di↵er for males and females.

9.5 Comparing two or more binomial proportions

Example 9.6 Four adjacent areas of heathland were burned in turn in successive years.
100 quadrat samples were then taken at random from each area and the presence or absence
of the grass Agrostis tenuis was noted with the following results:

Area Number of quadrats with


(Number of years since Agrostis tenuis
burning in brackets) Present Absent Total
A (1) 26 74 100
B (2) 40 60 100
C (3) 39 61 100
D (4) 47 53 100
Total 152 248 400

We wish to test whether the frequency of Agrostis tenuis varies from area to area i.e. if
the proportion of quadrats containing the grass di↵ers over the four areas.

Let pA , pB , pC and pD be the population proportions of quadrats containing Agrostis tenuis in the 4
areas. We test the null hypothesis

H 0 : pA = pB = p C = pD

against the alternative hypothesis that at least two of these proportions are di↵erent.

If H0 is true an estimate of p, the common proportion of quadrats containing Agrostis tenuis, is


p̂ = 152/400 = 0.38 and we would expect Agrostis tenuis to be present in 100/400 = 38 and absent
in 100 ⇥ 248/400 = 62 quadrats from area A. Similarly for the other three areas. Thus the method is
the same as for contingency tables. Here are the expected frequencies and standardised residuals:

Expected frequencies Standardised redisuals


Area Present Absent Area Present Absent
A 38 62 A -1.95 1.52
B 38 62 B 0.32 -0.25
C 38 62 C 0.16 -0.13
D 38 62 D 1.46 -1.14

69
The chi-square statistic is 2stat = ( 1.95)2 +· · ·+( 1.14)2 = 9.76 with degrees of freedom ⌫ = 3⇥1 = 3.
From Table 8 the P value is between 0.01 and 0.025, so there is some (moderate) evidence against
H0 . Note that all of the standardised residuals are between 2 and +2 but one is close to 2. We
conclude that there is some evidence that some of the four proportions di↵er.

We may also calculate estimates and confidence intervals for proportions of interest and also for
di↵erences between proportions. For example:
p
For area D: p̂D = 0.47 with standard error se(p̂D ) = 0.47 ⇥ 0.53/100 = 0.0499. So a 95% confidence
interval for pD is 0.47 ± 1.96 ⇥ 0.0499 = 0.47 ± 0.0978 = (0.372, 0.568).

To estimate the di↵erence in proportions for areas D and A: p̂D p̂A = 0.47 0.26 = 0.21 with
standard error
s r
p̂D (1 p̂D ) p̂A (1 p̂A ) 0.47 ⇥ 0.53 0.26 ⇥ 0.74
se(p̂D p̂A ) = + = + = 0.0664 .
nD nA 100 100

A 95% confidence interval for the di↵erence pD pA is therefore 0.21 ± 1.96 ⇥ 0.0664 = 0.21 ± 0.1302 =
(0.08, 0.34).

70
10 The Normal Linear Regression Model

10.1 Theory

Consider a sample of values of two variables x and y for n individuals. Denote these data by xi , yi for
individual i = 1, 2, . . . , n . Section 2.1 gives formulae for the sample means x̄, ȳ, standard deviations
sx , sy , correlation coefficient rxy , the slope b and intercept a of the least squares regression line y =
a + bx, and for the residual standard deviation about this line sres .

Here we consider a statistical model, in which we imagine that yi is drawn from a Normal population
with mean ↵ + xi and with standard deviation , where ↵, and are unknown parameters. This is
a model for how the observations yi are related to xi . In such models y is called a response variable
and x is an explanatory variable. Although both y and x are variables, all the regression results are
interpreted as being conditional on the values of x and the only ”variable” considered is the variable
y.

Thus, each yi is from a di↵erent Normal population: these populations have the same standard
deviation but di↵erent means; and the means all lie on a straight line y = ↵ + x (this is the
conditional expectation of y given x). Some text books write this model as an equation

yi = ↵ + x i + e i

where e1 , e2 , . . . , en are unobserved random “errors”. That is, for a particular xi we can imagine
generating a yi by taking the value ↵ + xi and adding a random number drawn from a Normal
distribution with mean 0 and standard deviation .

Under this model, the slope b and intercept a of the least squares regression line are good estimates
of and ↵, and sres is a good estimate of . Furthermore, standard errors and confidence intervals
for various parameters can be calculated. The formulae are as follows:

Estimates of slope , intercept ↵ and error standard deviation :


s s
(n 1)s2y (1 2 )
rxy
ˆ = b = Cxy , ↵
ˆ = a = ȳ bx̄ , ˆ = sres =
RSS
= ,
Cxx n 2 n 2

where is estimated with ⌫ = n 2 degrees of freedom.

Standard errors of the estimates of slope and intercept ↵ are given by:
s
ˆ 1 x̄2
se( ˆ) = p , se(↵)
ˆ = ˆ +
Cxx n Cxx
Pn
where Cxx = i=1 (xi x̄)2 = (n 1)s2x .

The mean response at a given x is denoted by µx , where µx = ↵ + x. Thus for a particular value
of x, µx is the average y for individuals with this x. The estimate of µx and its standard error are
s
1 (x x̄)2
ˆ + ˆx ,
µ̂x = ↵ se(µ̂x ) = ˆ + .
n Cxx

Note that ↵ corresponds to the mean response at x = 0.

71
We can construct confidence intervals for and for µx by using the t-distribution in exactly the
same way as for the mean of a normal population, except that we use n 2 degrees of freedom instead
of n 1. Thus a 95% confidence interval for is
ˆ ± tn 2;0.025 se(
ˆ)

and a 95% confidence interval for µx is


µ̂x ± tn 2;0.025 se(µ̂x )

where tn 2;0.025 is the upper 2 12 percentile of the t-distribution with n 2 degrees of freedom.

10.2 Examples

Example 10.1 Here are some data for 26 babies born in University College Hospital in a
particular week. The babies are all boys of the same race. The data are their birth weights
in gm (y) and gestational ages in weeks x, to the nearest week.
x 42 41 39 40 40 40 39 39 41 42 41 43 42
y 3180 2780 3630 3900 3310 2896 2780 3800 3900 4020 4180 3460 4400

x 41 38 37 38 43 35 37 35 38 40 42 39 34
y 3800 2990 3160 2720 3560 2640 2400 2320 2910 3200 3800 3560 2538

We want to see if the relation between birth weight and gestational age is well described
by the normal linear regression model, and to estimate parameters of interest.

As always, we start by looking at a scatter plot of y against x. Here we regard birth weight as the
response variable and age as the explanatory variable.

4500
••

4000 •
• • •• •
birth weight (gm)

3500 •• ••


•• •
3000 ••• ••
•• • •
•• •

2500
• •
2000

32 34 36 38 40 42 44

gestational age (weeks)

The model says that weights of babies with gestational age x have a Normal distribution with standard
deviation , say, and mean ↵ + x. This does not look unreasonable, given the small number of babies
in our sample.

You may verify the following calculations: x̄ = 39.46, ȳ = 3301.3, sx = 2.45, sy = 578.3, rxy = 0.7054,
Cxx = 150.4615, Cxy = 25020.31, Cyy = 8361816, a = 3260.8, b = 166.3 and sres = 418.4.

Applying the formulae in §2.1, you may check that the least squares estimates of ↵ and are ↵ ˆ =
3260.8 gm and ˆ = 166.3 grams per week, and the estimate of is ˆ = 418.4 gm. The line
y = 3260.8 + 166.3x is drawn on the scatter plot and seems to be a good description of how the
average birth weight depends on age. It is not always easy to judge this from the scatter plot, and
it is customary to plot the residuals yi a bxi against xi , which makes it easier to see systematic
departures from the model.

72
••
600 • • ••
••
400

200
• • •

residual (gm)
• • •
0
•• •
-200 • •

-400 • •
• • • •
-600 •
-800 •
32 34 36 38 40 42 44

gestational age (weeks)

There does not appear to be a detectable systematic pattern: about half of the residuals are positive.
Perhaps there is a suggestion of more scatter at higher x values, which suggests that the assumption
that is the same for all x may be questionable, but the sample size is really too small to take this
seriously.

The parameter is the standard deviation of birth weights for baby boys with the same gestational
age. This is estimated to be 418 gm, which is quite large — but not as large as sy = 578 gm, the
estimated standard deviation of birth weights for babies with di↵erent ages.

The parameter ↵ does not have a physical meaning; the model does not make sense when x = 0, and
anyway we would not expect a straight line relationship to hold for abnormally low gestational ages.

The parameter is the change in average birth weight when gestational age increases by one week.
The point estimate is ˆ = 166 gm. This change is quite small compared with the standard deviation of
418 gm, which is why there is a lot of overlap between points at neighbouring x-values. The standard
error of ˆ is
ˆ 418.4
se( ˆ) = p = p = 34.11 ,
Cxx 150.4615
and the upper 0.025 quantile of the t distribution with ⌫ = n 2 = 24 is t24;0.025 = 2.064. Hence a
95% confidence interval for is
ˆ ± t24;0.025 se( ˆ) = 166.3 ± 2.064 ⇥ 34.11 = (95.9, 236.7) .

Thus the increase in average weight when age increases by one week is estimated to be between 96
and 237 gm (with conventional 95% confidence).

Suppose we want to estimate the average weight of baby boys born at 36 weeks. Our parameter is
µx = ↵ + 36 . The point estimate is µ̂x = 3260.8 + 36 ⇥ 166.3 = 2726 gm. This is the point on the
fitted line at x = 36. This estimate has standard error
s s
1 (x x̄)2 1 (36 39.46)2
se(µ̂x ) = ˆ + = 418.4 ⇥ + = 143.74 ,
n Cxx 26 150.4615

and a 95% confidence interval for µx is

µ̂x ± t24;0.025 se(µ̂x ) = 2726 ± 2.064 ⇥ 143.74 = (2429, 3023) .

Thus the average weights of baby boys born at 36 weeks is estimated to be between 2429 and 3023 gm.

73

Вам также может понравиться