Вы находитесь на странице: 1из 105

# MA12003 Statistics and Probability

Dr Niall Dodds
Ms Ewa Bieniecka
Semester 2, 2017/18
Contents

1 Data Analysis 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Populations and Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Data Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Dot Plots, Stem and Leaf Displays . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Data Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 Measures of Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.2 Measures of Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Probability 24
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Discrete and Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 37

## 3 Discrete Random Variables 40

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 The Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6 Joint Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

## 4 Continuous Random Variables 57

Fundamentals of Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.1 Cumulative Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.2 Expected Value, Variance, Median . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.3 Combining Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 Finding Areas under Other Normal Curves . . . . . . . . . . . . . . . . . . . . 71
4.2.2 Finding Values of z when Probabilities are Given . . . . . . . . . . . . . . . . . 73
4.2.3 Combining Normally Distributed Variables . . . . . . . . . . . . . . . . . . . . 76

1
5 Sampling 79
5.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.1 1–Tail and 2–Tail Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Appendices 93

## C Normal Distribution Tables 102

2
Chapter 1

Data Analysis

Week 1 Monday

1.1 Introduction
Why study statistics? In a broad sense, statistics can provide some input to answering questions such
as:

• What happens?

## • How can we make things better?

Typically we are deal with situations in which there is some degree of uncertainty or unpredictability.

Examples:

## 1. Collect data (with care).

2. Analyse data.

3. Present data.

4. Interpret data.

The hope is that eventually we are able to give unambiguous answers to the questions of interest.
In order to understand the meaning of statistical tests and their results, it is important to grasp the
underlying mathematics. This is a main focus of the present module.

3
Example 1.1.1
Two dental cleansers, A and B, are tested on teeth specimens. The weight loss in mg over the
experiment is:

A 10.2 11.0 9.6 9.8 9.9 10.5 11.2 9.5 10.1 11.8
B 9.6 8.5 9.0 9.8 10.7 9.0 9.5 9.9

Can we say that one cleanser is definitely more abrasive (gives higher weight loss)? We can use
a box plot to illustrate the experimental results but the answer is not clear cut just by looking
at the plot.

## dental cleanser experimental data

12
11
weight loss in mg

10
9
8

A B

—————————————————

Example 1.1.2
A cautionary example. Things are not always as simple as they seem. One class of such
examples is known as Simpson’s Paradox. Examine the following data regarding surgery for gall
stone removal. Two methods are used, OS (standard surgery) and PN (keyhole surgery).

OS PN
small large all small large all
Successful operations 81 192 273 234 55 289
Total operations 87 263 350 270 80 350

## Success rates (%) small large all

OS 93 73 78
PN 87 69 83
all 88 72 80

The results are highly counterintuitive: while the OS method has higher success rate for both
small and large stones, it has a lower overall success rate.
Two effects combine to cause this:

4
1. The two groups (large/small) are very different in size for the two methods, and
2. Stone size has a large effect on the success rate.

This example illustrates a reason why we must proceed with great care in order to arrive at
accurate conclusions from statistical data.
—————————————————

## 1.2 Populations and Sampling

An important distinction must be made between populations and samples.

Definition
A population is the whole collection of individuals in which were are interested (e.g. people, objects,
flowers). A sample is a subset of the population from which we collect information/data.

We frequently want to know information about an entire population (e.g. voting intentions of everyone
living in Scotland; age of all people in the world). When such information is available (e.g. in a census)
the only problem is then in presenting and interpreting the data in an appropriate way. However,
studying the full population is usually impractical (e.g. due to cost or time limitations) and so to
learn about the population we instead seek to get data from a sample of the population. We must
then try to make sure that the sample gives a good representation of the population in which we are
interested.

Examples

## Study/Question Population Sample

General election voting Voting intentions of all people on Voting intentions of a sample of
intention the electoral roll in the UK 1000 people chosen at random from
the UK population
Mass of stars in the Uni- All stars in the visible Universe Sample of 10,000 observed stars
verse
A fair dice? Results of an infinite number of rolls Results from a sample of 100 rolls of
of the dice in question the dice
How long a make of Lifetime of all lightbulbs ever made Lifetime of 50 test bulbs
lightbulb lasts

## • How large a sample do you need?

• What will data from the sample tell you about the population (which is what you are really
interested in)?

## • How accurate will the information be?

5
How a sample is selected from a population is important and many different objective sampling
methods exist (e.g. simple random, systematic, stratified, quota, cluster, panel). The methods allow
us to develop analytical tools to make inferences about the underlying population. Broad general rules
are include those that:

## • A sample should be representative of the population.

• Each member of the population should have the same (or a known) chance of selection (not
simply volunteers).

• Selection should have no element of subjective choice: use a ‘random’ mechanism, e.g. spin a
coin, roll a fair dice, draw cards from a deck, use a computer random number generator, etc.
(Note: here random does not mean the same as haphazard.)

• No outcomes should be favoured or disadvantaged (e.g. if you want to know about spending
habits of adults, carrying out a survey in a shopping centre would not give a good sample).

## 1.3 Types of Data

We will encounter various different types of data in this module. Different types of data are best dealt
with and displayed in different ways, as we will discuss during the following section.

• Qualitative or categorical data is non-numerical data, about some quality or attribute, e.g. gender,
colour of eyes, blood type, shape of box. The data may be

– Ordinal data: about quantities that have a natural ordering, a rank e.g. place in a race,
job title etc. (Note that rank tells us nothing about distance between ranked items even
though a number might be involved.)
– Nominal data: about quantities with no natural ordering e.g. favourite food, type of car.

• Quantitative or numerical data is associated with measurements and counts. The data may be

– Discrete, i.e. separate and distinct (e.g. number of goals in a hockey match). This is
frequency or count data on how many individuals or items fall into a given category (e.g.
number of students with grades A, B and C in a particular module).
– Continuous i.e. able to take on any value, often within some range (e.g. height of a person).
This is Metric data obtained from measurement, e.g. time to complete a race, weight of
sheep, speed of light, etc.

## 1.4 Data Presentation

The purpose of displaying data in different ways is to try to provide some insight into certain charac-
teristics of the data. The most appropriate way to present or summarise data depends on the type of
data that have been collected.

• Qualitative/Categorical data: nominal and ordinal. Here summaries and displays are chiefly by
means of bar charts and pie charts.

6
• Quantitative/Numerical data: counts or measurements. The best way to display this type of
data depends on the size of the data set:

## – Small to medium sets: dot plot; stem and leaf display.

– Medium to large sets: frequency table; histogram (bar chart); cumulative frequency (or %)
plot.

## 1.4.1 Dot Plots, Stem and Leaf Displays

We illustrate the method with an example. Consider the following data set listing the number of
second hand cars less than 5 years old that are for sale in n = 45 selected car showrooms.

20 47 55 39 32 36 85 17 62 44 64
105 57 76 48 18 31 71 50 33 29 73
48 27 24 117 86 17 32 64 12 50 6
29 20 51 51 161 73 13 25 45 37 68
26

Example 1.4.1

To construct a dot plot, we draw an axis which extends from the minimum value up to the
maximum value. Now work through the data, putting dots in the appropriate places above the
axis to represent data values. If a value is repeated, build up the dots.

## second hand cars in each showroom

0 50 100 150

number of cars

Note that the symbol you choose is arbitrary (you could use also filled or open circles, ?, or
another of your choice).
The dental cleansers example shows another use of dot plots, illustrating their use for more than
one data set.

7
For a stem and leaf plot the data is split into two parts, the first digit (or digits) which is
called the stem, and the last digit (or digits) which is called the leaf.
For our car data we choose the tens as the stem (e.g. for data in the 30s) and the final digit as
the leaf (e.g 7 for the number 37).
To construct a stem and leaf plot:

• Examine the data to identify the stems and leafs. Write the stems in a column.
• Go through the data one value at a time, writing the leaf for each data value beside its
corresponding stem.
• Rearrange the leaves for each stem into ascending order.

## A stem and leaf diagram for our car data is:

0 6
1 23778
2 00456799
3 1223679
4 45788
5 001157
6 2448
7 1336
8 56
9
10 5
11 7
12
13
14
15
16 1
—————————————————

There is a subtlety associated with constructing the stem and leaf diagram that regards the choice
of number of stems to take. (A similar issue is associated with intervals in a histogram, as discussed
below.)

## What do these plots show?

The purpose of making such plots is to show at a glance how the data values are spread out or
distributed, e.g.

• Highly concentrated or thinly spread (e.g the cars in the showroom are concentrated between
about 10 and 50 but thinly spread above about 90).

## • The pattern or shape of spread including:

8
– Number of peaks: unimodal (one peak), bimodal (two peaks), etc.
– Symmetric distribution, or asymmetric distribution with, for example, a long upper tail (or
positive skew).
– Typical or ‘average’ values (this will be discussed in more detail later in this section).

1.4.2 Histograms
Histograms are similar in concept to stem and leaf displays, but are more useful for large data sets
(or quantitative/numerical data). A histogram groups the data as for a stem and leaf diagram but
records only the frequency in each group rather than all the leaf digits. That is, it takes a sequence of
bins and counts the number of observations in each bin. A bar is drawn on each bin where the height
relates to the proportion of observations in each bin. The important point to note is:

## The area of the blocks represents the frequency of outcome.

So, if the bins are all of equal width then the height can be a simple frequency. Otherwise it must be
a frequency density. Sometimes the densities are scaled so that the overall area of all the bars is equal
to 1. A histogram may be drawn for both continuous and discrete data.
Note that bar charts are similar to histograms but appropriate for qualitative/categorical data (e.g.
number of students in Dundee coming from particular countries). For a bar chart the width of each
bar has no meaning.

Example 1.4.2

## We illustrate histogram construction using an example.

Consider the following data showing the weight at birth of 50 infants with the condition SIRDS
(severe idiopathic respiratory disease):

## Birth weights (kg)

1.050 1.175 1.230 1.310 1.500
1.600 1.720 1.750 1.770 2.275
2.500 1.030 1.100 1.185 1.225
1.260 1.295 1.300 1.550 1.820
1.890 1.940 2.200 2.270 2.440
2.560 2.760 1.130 1.575 1.680
1.760 1.930 2.015 2.090 2.600
2.700 2.950 3.160 3.400 3.640
2.830 1.410 1.715 1.720 2.040
2.200 2.400 2.550 2.570 3.005

A common intermediate step is to construct a grouped frequency table. Start by grouping data
in, for example, 0.2kg intervals (or ‘cells’ or ‘bins’). For continuous data, such as this, one has to
decide on an endpoint convention, i.e. what to do about data that lies exactly between intervals.

9
For this table we have chosen to go up (i.e. to put an infant with weight 2.2000kg in the interval
2.2 − 2.4kg). For discrete data this endpoint problem does not exist.
Our grouped frequency table is:

## Birth weight (kg) Frequency

1.0-1.2 6
1.2-1.4 6
1.4-1.6 4
1.6-1.8 8
1.8-2.0 4
2.0-2.2 3
2.2-2.4 4
2.4-2.6 6
2.6-2.8 3
2.8-3.0 2
3.0-3.2 2
3.2-3.4 0
3.4-3.6 1
3.6-3.8 1

We then draw bars, one for each interval identified, with height corresponding to the frequency
density. For the grouping of the table above our histogram is:

## Histogram, infant birth weights

8
7
6
5
Frequency

4
3
2
1
0

1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8

weight (kg)

If we make a different frequency grouping then we must adjust the histogram accordingly. For
example:

10
Birth weight (kg) Frequency
1.0-1.2 6
1.2-1.4 6
1.4-1.6 4
1.6-1.8 8
1.8-2.0 4
2.0-2.2 3
2.2-2.4 4
2.4-2.6 6
2.6-3.2 7
3.2-3.8 2

## Histogram, infant birth weights, alternative grouping

0.8
0.7
0.6
0.5
Density

0.4
0.3
0.2
0.1
0.0

1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 3.2 3.8

weight(kg)

Note that this second case uses a density scale on the y-axis. How are the heights of the bars
calculated here? Each represents the proportion of occurrences. For weights 1.0-1.2kg there are
6 infants recorded in the total count of 50. That is, a 6/50 = 0.12 proportion. The width of the
strip is 0.2 so we need a bar of height z where 0.2z = 0.12 ⇒ z = 0.6, and similarly for the other
strips.
—————————————————

## What do histograms show?

Histograms are most often constructed using a package such as R or Excel which can do most of the
work for you. It is possible to customise the plots in such packages, as briefly discussed above. One
important thing to consider is the size of intervals.

Choice of intervals
How many intervals should we choose?
• Not too many: many intervals gives many bars each with small width. Some intervals can be
empty and overall the histogram looks noisy/choppy.

• Not too few: too few intervals gives large bars each averaging out lots of data so the shape of
the distribution is lost.

11
Even with a choice of intervals that avoids these two extremes, any particular choice of interval size
and endpoint method will lead to slightly different histogram shapes. In general, small data sets are
more sensitive to these choices so one should bear the size in mind when looking at plots.

Example 1.4.3

As a further example, consider the following data giving the response time in minutes for n = 220
letters:

21.0 17.0 43.0 2.2 12.0 2.0 10.0 17.0 34.0 1.0 27.0 15.0
20.0 5.0 6.0 65.0 22.0 45.0 31.0 7.0 40.7 5.0 17.0 10.0
37.0 67.0 6.0 3.0 63.0 37.0 25.0 60.0 44.0 12.0 13.0 5.5
11.0 25.0 3.0 5.0 25.0 7.0 18.0 22.0 2.0 6.0 4.0 1.0
130.0 40.5 30.5 12.0 20.5 15.0 2.0 1.0 30.0 25.0 22.0 40.0
10.0 5.0 6.0 19.0 15.0 10.0 21.0 65.0 5.2 11.0 6.0 31.0
7.0 45.0 11.0 13.0 30.0 10.0 5.0 40.0 13.0 17.0 1.0 7.0
4.0 9.0 92.0 7.0 19.0 25.0 21.0 63.0 116.0 7.0 64.0 30.0
20.0 12.0 1.0 14.0 7.0 6.75 32.0 30.0 32.0 22.0 10.0 2.0
40.0 5.0 37.0 11.0 21.0 35.0 20.0 30.0 45.0 2.25 29.0 5.5
12.5 39.0 8.0 95.0 1.0 1.0 14.0 8.0 26.0 45.0 14.0 17.0
16.0 36.0 21.0 169.0 7.0 15.0 11.0 11.0 42.0 1.0 17.0 70.0
5.0 20.0 1.0 6.0 4.0 28.0 7.0 25.0 20.0 21.0 12.4 9.0
22.0 5.0 4.0 0.3 0.1 2.8 4.6 5.9 2.5 2.4 1.0 17.0
52.0 27.0 30.0 10.25 1.0 1.0 3.0 2.0 2.0 12.0 107.0 2.0
3.0 17.0 3.0 2.0 11.0 41.0 5.5 32.0 17.0 17.0 11.0 17.0
75.0 15.0 10.5 25.0 28.0 49.0 8.0 4.5 6.4 70.5 0.25 6.0
12.0 5.0 38.0 6.0 5.0 2.0 11.0 22.0 82.0 35.0 37.0 140.0
41.0 38.0 30.0 1.0

80
60
Frequency

40
20
0

## 0 50 100 150 200

time (minutes)

12
Here we have a very asymmetric distribution with a long upper tail. The peak in the distribution
is in the first interval, 0 − 10 minutes.
From the plot it is hard to say what a typical value for the response time is. This leads us to
consider data summaries in the next section. A data summary for this letter time data is given
by:

## n Mean Standard Deviation Minimum QL Median QU Maximum

220 21.94 25.11 0.1 5.7 14.0 30.0 169

## We will discuss these terms in the following section.

—————————————————

## 1.5 Data Summaries

Diagrams such as histograms are useful summaries of data sets. Often it is also possible or helpful to
give more simplified summaries of the data, talking about its location (including centre value) and its
spread. Various methods to summarise these features have been developed and we discuss these here.

For location, we mostly focus on giving an average or typical level or value, two measures being the
median M and the mean x̄. We can also use lower and upper quartiles (overall dividing the data into
4 chunks).

To classify the amount of spread (or variation) in data we can use the interquartile range (distance
between the chunks in the location measure) or the standard deviation, s.

## 1.5.1 Measures of Location

The median, M

By definition, the median splits the ordered data set into two groups of equal size, so

• 50% of values ≥ M

• 50% of values ≤ M

(i.e. M is the 50% point). If n denotes the number of entries in the data set then the median M is
given by:

## • n odd: the ((n + 1)/2)th (i.e. middle) ordered data value.

13
• n even: halfway between the (n/2)th and the (n/2 + 1)th (ordered) data values.

Example 1.5.1

## • Data set A: n = 10, M = 10.15

• Data set B: n = 8, M = 9.5
—————————————————

Example 1.5.2

Car showroom data. We have n = 45. The 23rd value in order (either ascending or descending)
will correspond to the Median, M . In fact, M = 45 (read off stem and leaf).
—————————————————

## Quartiles and the five number summary

Quartiles are useful for assessing symmetry. The quartiles are the lower quartile, QL , the median, M ,
and the upper quartile, QU . They split the (ordered) data into four groups of equal size:

## • 25% of data values ≤ QL

75% of data values ≥ QL

## • 25% of data values ≥ QU

75% of data values ≤ QU

## i.e. QL is the 25% point; QU is the 75% point.

One can also make generalisations of quartiles to quantiles if specified levels are given (e.g. the 1st
7-quantile is the point where 1/7 of the data is below that value and 6/7 above; the 4th 7-quantile
has 4/7 below and 3/7 above).

14
For discrete distributions, there is no universal agreement on selecting the quartile values. In this
course, for simplicity, we will pick one of the conventions. What other reasonable conventions can you
think of? Which conventions can you find in Excel?

To find quartiles:

• If n divisible by 4:
QL : halfway between the (n/4)th and the (n/4 + 1)th values in ascending order,
QU : halfway between the (n/4)th and the (n/4 + 1)th values in descending order.

• If n is not divisible by 4:
Let k be the integer part of n/4 (i.e. if n/4 = 19 then k = 19 or if n/4 = 16.7 then k = 16 etc.).
Then define:
QL : the (k + 1)th value in ascending order,
QU : the (k + 1)th value in descending order.

Example 1.5.3
1. First we consider a case where n is divisible by 4.
If the data set was {1, 6, 5, 10, 19, 54, 78, 105, 1, 3, 55, 78, 45, 56, 33, 1} then we could sort it as
{1, 1, 1, 3, 5, 6, 10, 19, 33, 45, 54, 55, 56, 78, 78, 105}. There are n = 16 pieces of data so the lower
quartile is halfway between the 4th and 5th entry, i.e. has value QL = 4. The upper quartile is
halfway between the 12th and 13th entry, i.e. QU = 55.5

## 2. Now consider a case where n is not divisible by 4.

If the data set was {44, 65, 66, 68, 73, 75, 78, 79, 81, 84} then, since the data is already sorted, we
note that there are 10 entries. Since 10/4 = 2.5 has integer part 2 we take the 3rd entry as the
lower quartile, i.e. QL = 66 and the 3rd from last (the 8th) entry as QU = 79.
—————————————————

The five number summary is made up of the quartiles along with the two extreme values, EL , EU ,
which can be used to give a brief, but informative numerical summary of the data.

A box plot is a graphical device for displaying the 5 number summary. This is a box about the lower
and upper quartiles with a line marking the median. Then whiskers (straight lines) extend to the
minimum and maximum values. The plot provides an indication of symmetry in the data.

Definition 1.5.1 (Outlier) Sometimes the whiskers are shortened so that they are have maximum
length of 1.5 times the box length (i.e. 1.5 times the interquartile range). The remaining data is then

15
marked with points to distinguish it as an outlier. [So, the interquartile range, as discussed below, is
QU − QL . You find 1.5 times this value, call that z, say. Then you mark outliers as points that are
greater than QU + z or less than QL − z. The whisker ends at the points that are not outliers.]

Example 1.5.4
Car showroom data: n = 45 ⇒ k = 11
The 12th value in ascending order gives QL = 27.
The 12th value in descending order gives QU = 64.
Also, EL = 6 and EU = 161, giving as our five number summary {6, 27, 45, 64, 161}
Box plot:

Or:

0 50 100 150

number of cars

—————————————————

Week 1 Tuesday

The mean, x̄

## The mean is defined as

n
sum of values x1 + x2 + · · · + xn 1X
x̄ = = = xi .
number of values n n
i=1

This is the quantity that is often referred to as the ‘average’ in everyday life.

The mean, x̄, is the balance point for the set of data values:

16
(Imagine data points as point masses and imagine that the axis can rotate. Then x̄ is located at the
point where there is exact balance: i.e. no rotation.)

Example 1.5.5

## Recall the data on dental cleansers:

A 10.2 11.0 9.6 9.8 9.9 10.5 11.2 9.5 10.1 11.8 Total = 103.6
B 9.6 8.5 9.0 9.8 10.7 9.0 9.5 9.9 Total = 76.0

To find the mean, x̄, we have to divide the total by the number of data points (in each case):

P
n i xi x̄ M
A 10 103.6 10.36 10.15
B 8 76.0 9.5 9.55

Here the mean can be compared with the median, M . They are not the same as each other
(although similar in this example). We discuss the difference between the two in the next section.
—————————————————

For data in a frequency table, with fj values in the cell with midpoint value yj , j = 1, . . . k , the mean
is given by
k
1X
fj yj (1.1)
n
j=1

i.e. multiply the number of entries in each class/interval by the midpoint value in the class (this will
give a different answer to the mean of the original data since some of the original information is lost
by creating the frequency table).

17
Mean vs. Median:

The mean and median are different measures of the centre of data and so, in general, will give different
values. Sometimes the values can be very different.

• When data are symmetric the mean and the median are roughly the same, M ≈ x̄. The approxi-
mation gets better the more symmetric the data is.

• When there is a long upper tail, which we call positive skewness, then M < x̄:

• When there is a long lower tail, which we call negative skewness, then M > x̄.

An property to remember about the mean is that it is not resistant to outliers (i.e. values which do
not seem to fit in with the others), whereas the median is. We illustrate this with an example.

18
Example 1.5.6
Consider the following weights in pounds of the crew in the 1992 Cambridge vx. Oxford boat
race. Each boat contains 8 rowers and 1 cox.
188.5 183 194.5 185 214 203.5 186 178.5 109 186 184.5 204 184.5 195.5 202.5 174 183
109.5
The entries 109 and 109.5 are the cox weights.

8
6
Frequency

4
2
0

## 100 120 140 160 180 200 220

weight in lb

The data has two outliers, skewing the data to the left. Including these outliers then the mean
weight is x̄ = 181.4 and the median M = 185.5. The mean is lowered by the outliers.
If the weight of rowers alone is considered then x̄ = 190.4 while M = 186.0. The median is still
very similar – it is not very affected by outliers.
—————————————————

If the measure of location is to give the value of a typical individual, the median would be preferred.
However, if an estimate of the population total is needed, you would use N x̄, where N is the population
size.

Example 1.5.7
Salary distribution e.g. in a large company

## • If we want to know a typical employee salary, then M gives a better value.

• But to get an idea of the total wage bill we would rather know x̄, so that the total wage
bill would be N x̄
—————————————————

19
1.5.2 Measures of Spread
As well as information on the average location of data, it is often useful to know about the way the
data spreads out about its average. Several measures of this spread have been developed which give
an indication of the degree of variation within a data set.

The range: EU − EL

Problems: depends on extremes (i.e. very volatile); depends on number of samples n (for given situa-
tion, the more cases you have, the more likely you are to get an outlier); ignores centre and spread of
data.
Hence only used in special circumstances, e.g. small n.

## Interquartile range: IQR = QU − QL

A simple concept, used when the median M is used as a measure of location. It gives the distance
over which the central 50% of the data values is spread.

## The (sample) standard deviation, s

Used when the mean x̄ is used as a measure of location. It gives a measure of the average distance
of data values from x̄. We can use knowledge of the mean to construct a measure of spread from the
mean called the standard deviation.

(If you are finding the standard deviation of a data set this is usually the sample standard deviation
– as distinct from the standard deviation of the population (usually denoted σ). Indeed for a data
set where you calculate the mean, you are (usually) calculating the sample mean. The mean of the
overall population may be different. We will discuss these differences later in the course.)

Let di = xi − x̄ be the deviation of the observation xi from the mean x̄ (for i = 1, 2, . . . n), i.e. the
difference of each observation from the mean value. This difference may be positive or negative
depending on whether the observation is greater or less than the mean. The magnitude (absolute
value) of the difference di is denoted |di | and is the distance of the observation xi from the mean x̄.

## It can be shown that the sum of all the deviations di is zero

X X
(xi − x̄) = di = 0
i i

(see tutorial sheet; this is really the definition of the mean). Because these deviations sum to zero we
can’t use their sum as a measure of spread. However, it’s clear that the sizes of all the di terms tell

Consider instead d2i = (xi − x̄)2 , the squared deviations (or, equivalently, the squared distances). This
sum will be non-zero and so we can use it as a measure of spread. Then the average (mean) of these
sums of squares is known as the (sample) variance:
Pn
2 i (xi − x̄)2
s =
n−1

20
The square root of the variance is known as the standard deviation:
sP
n
− x̄)2
i (xi
s=
n−1

which is used more often as it has units of x, whereas s2 has units of x2 . (Compare it with root mean
square.)

*Non-examinable material*
Two questions naturally arise when looking at this definition:

1. Why square di then take the square root? An alternative measure used is the (mean) average
Pn
deviation: i |di |/(n − 1). It is used much less often and has less theory developed for its use,
but may be better in some circumstances (particularly when extreme outlying values are present
in the data).

2. Why n − 1 in the denominator and not n? The natural choice for the denominator would seem
to be the total number of pieces of data, n.

• In fact, if we want to know the spread of data for a whole population then we use the
population standard deviation, r Pn
i (xi − x̄)2
σ=
n
In practice, however, usually we are dealing with a small sample of data from the total
population (and trying to infer things about the whole population from that – inferential
statistics. It has been shown that when dealing with a small- or medium-sized sample, the
above formula usually under-estimates the spread of data of the whole population. So the
correction of replacing n by n − 1 is applied and can be shown to give a better estimate in
most cases. There is a very complicated formula which can be used, but in practice this
• Deviations around the mean sum to zero. Hence we have total of (n−1) pieces of information
or degrees of freedom.

## *End of non-examinable material*

Calculations
Some useful facts to know when calculating s:

## • An alternative formula exists for the numerator in s2 :

n n
!
X X
(xi − x̄)2 = xi 2
− nx̄2 (1.2)
i i

## • For data grouped in a frequency table:

k
1 X
s2 = fj (yj − ȳ)2 (1.3)
n−1
j=1

21
i.e. act as though all entries in cell j take value yj (the midpoint value). Alternatively
 
k
X k
X
fj (yj − ȳ)2 =  fj yj2  − nȳ 2 (1.4)
j=1 j=1

Example 1.5.8

Calculate the variance and standard deviation for the dental cleanser example, cleanser B.

For this data set there are n = 8 observations and the mean is x̄ = 9.5.

Data (loss in mg) 9.6 8.5 9.0 9.8 10.7 9.0 9.5 9.9

Deviations di = (xi − x̄) 0.1 -1.0 -0.5 0.3 1.2 -0.5 0.0 0.4

## Sums of differences squared:

8
X 8
X
2
di = (xi − x̄)2 = 0.01 + 1 + 0.25 + 0.09 + 1.44 + 0.25 + 0.0 + 0.16 = 3.20
i=1 i

Variance: P8
2 − x̄)2
i=1 (xi 3.20
s = = = 0.4571
n−1 8−1
Standard deviation:

s= 0.4571 = 0.68

## Note that this is the sample standard deviation.

—————————————————

What does the (sample) standard deviation s tell us about the spread of the data – how far numbers
in a data set are away from their average?

Typically, most entries will be somewhere around one standard deviation from the mean. Not many
entries will be more than two or three standard deviations away from the mean.

## If the data are reasonably symmetric a rule of thumb is

• 60% – 75% of data values lie within one standard deviation of the mean x̄ ± s.

• > 95% of data values lie within two standard deviations of the mean x̄ ± 2s.

• almost all data values lie within three standard deviations of the mean x̄ ± 3s.

For data that is normally distributed we can be more precise – this will be discussed later in the
module.

22
We also consider an example for calculating the standard deviation from data in a grouped frequency
table:

Example 1.5.9

Using the data from the birth weight frequency table below, calculate s.

1.0-1.4 12
1.4-1.8 12
1.8-2.2 7
2.2-2.6 10
2.6-3.2 7
3.2-3.8 2

## We have n = 50 observations and k = 6 cells/intervals.

The mean is

P6
j=1 fj yj 1.2 × 12 + 1.6 × 12 + 2.0 × 7 + 2.4 × 10 + 2.9 × 7 + 3.5 × 2 98.9
x̄ = = = = 1.978.
n 50 50

We calculate
6
X
fj yj2 = 12 × (1.2)2 + 12 × (1.6)2 + 7 × (2)2 + 10 × (2.4)2 + 7 × (2.9)2 + 2 × (3.5)2 = 216.97
j=1

## and so the variance is

 
k
1  X 1
s2 = fj yj2  − nȳ 2 = 216.972 − 50 × 1.9782 = 0.4356

n−1 49
j=1

## and standard deviation

s= 0.4356 = 0.66kg.

This is an approximation using the frequency table grouping. Since we also have the full data
on birth weights we can calculate the (sample) mean and standard deviation.
Appropriate calculations would give us x̄ = 1.97kg and s = 0.66kg. These are well approximated
by the frequency grouping.
—————————————————

23
Chapter 2

Probability

Week 2 Monday

2.1 Introduction
Probability is given by a number between 0 and 1. It gives a measure of how likely it is for an event
to occur.

Example 2.1.1
Roll a die.

• Probability of 1 is 16 .
• Probability of even number is 21 .
• Probability of 7 is 0.
• Probability of ‘1,2,3,4,5 or 6’ is 1.
—————————————————

Notation 2.1.1 We use letters A, B, C, . . . for events and write P(A), P(B), P(C), . . . for the probabil-
ities of these events occurring. We use P(A), P(B), P(C), . . . for the probabilities of events A, B, C, . . .
not occurring. These are examples of complementary events.

The events A and A are called complementary since exactly one of either A or A must occur. Hence:

## P(A) + P(A) = 1 =⇒ P(A) = 1 − P(A)

Notation 2.1.2 Let A and B be any events. Then we denote event of A and B both happening as:

A and B = A ∩ B

Example 2.1.2
The students who are studying Law or Engineering at a particular university are classified by
gender to give the following data:

24
Female Male TOTAL
Law 468 520 988
Engineering 674 1152 1826
TOTAL 1142 1672 2814

If a student is selected at random from all the Law or Engineering students, what is the proba-
bility that the student selected is
(i) an engineering student?
(ii) female?
(iii) a female engineering student?
(iv) not a female engineering student?
(v) either female or a male engineering student?
(vi) either female or an engineering student?
————————————————————
Let E denote the event ‘engineering student selected’. Let F denote the event ‘female student
selected’.
1826
(i) P(E) = 2814 ≈ 0.649
1142
(ii) P(F ) = 2814 ≈ 0.406
674
(iii) P(E ∩ F ) = [P(E and F ) =] 2814 ≈ 0.240
674
(iv) P(E ∩ F ) = 1 − 2814 ≈ 0.760
TBC
—————————————————

Example 2.1.3

## T ime(minutes) F requency T ime(minutes) F requency

5 − 10 1 40 − 45 100
10 − 15 3 45 − 50 38
15 − 20 6 50 − 55 16
20 − 25 9 55 − 60 5
25 − 30 42 60 − 65 2
30 − 35 107 65 − 70 1
35 − 40 170

What is the probability that a randomly selected truck has a turnaround time between 15 and
60 minutes?
—————————————
Denote the event A for time t such that 15 ≤ t ≤ 60 min. Then:

25
1+3+2+1
P(A) = ≈ 0.014
500
P(A) ≈ 1 − 0.014 = 0.986

—————————————————

## Mutually Exclusive Events and Venn Diagrams

Example 2.1.2 (continued)
If F occurs then (F ∩ E) cannot occur. Events F and (F ∩ E) are mutually exclusive. But if F occurs
then E can also occur, i.e. F and E are not mutually exclusive.
———————————-

Venn Diagram
Parts of a rectangle correspond to events. All possible events and combinations of events are shown.
The two examples below are equivalent:

## Here A and B are not mutually exclusive:

26
Venn diagram can be useful to find formulas connecting different probabilities. Denote the probability
of the event ‘A or B occurring’ as P(A ∪ B). If A and B are mutually exclusive, then the probability
P(A ∪ B) is given by P(A) + P(B). Note that in this case P(A ∩ B) = 0. If A and B are not mutually
exclusive, then P(A ∪ B), and is given by P(A) + P(B) − P(A ∩ B).
In general, we have:
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

## Example 2.1.2 (continued)

1142+1152
(v) P(F ∪ (F ∩ E)) = P(F ) + P(F ∩ E) = 2814 ≈ 0.815
1142+1826−674
(vi) P(F ∪ E) = P(E) + P(F ) − P(F ∩ E) = 2814 ≈ 0.815

De Morgan’s Laws
The event complementary to ‘A or B’ (A ∪ B) is ‘neither A nor B’. This is the same as ‘not A and
not B’, which can be denoted as A ∩ B.
Hence:
A∪B =A∩B

1 − P(A ∪ B) = P(A ∩ B)

## C ∩D =A∩B =A∪B =A∪B =C ∪D

Thus, as events A and B were arbitrary, for all events C and D we have:

C ∩D =C ∪D

1 − P(C ∩ D) = P(C ∪ D)

27
These are called De Morgan’s Laws.

Example 2.1.4

Faults in the production of bolts are classified as either faulty heads or faulty threads.
15% of all bolts produced have faulty heads.
20% of all bolts produced have faulty threads.
10% of all bolts produced have both types of fault.
If a bolt is selected at random, what is the probability that it has
(i) at least 1 fault?
(ii) no fault?
(iii) a good thread but a faulty head?
——————————————–
Let H denote ‘faulty head’. Let T denote ‘faulty thread’. Then:

P(H) = 0.15
P(T ) = 0.2
P(H ∩ T ) = 0.1

## (i) P(H ∪ T ) = P(H) + P(T ) − P(H ∩ T ) = 0.15 + 0.2 − 0.1 = 0.25

(ii) P(H ∩ T ) = 1 − P(H ∪ T ) = 1 − 0.25 = 0.75
(iii) Consider the following Venn diagram:

## P(H ∩ T ) = P(H) − P(H ∩ T )

Therefore:
P(H ∩ T ) = P(H) − P(H ∩ T ) = 0.15 − 0.1 = 0.05
—————————————————

Week 2 Tuesday

28
Counting (Permutations)
In this section we will use factorial expressions, e.g. ‘five factorial’ 5! = 5 × 4 × 3 × 2 × 1, ‘six factorial’
6! = 6 × 5 × 4 × 3 × 2 × 1, etc.

Example 2.1.5

How many ways are there to arrange four letters A, B, C and D e.g. ABDC, ACDB, etc.?
—————————————–
For the first position, there are 4 different letters to choose from. For each of the letters chosen
for the position 1, there are 3 letters available for the position 2. This gives 4 × 3 combinations
for the first two positions. Then, there are 2 possibilities for position 3, and only one for position
4. This gives total of 4! = 4 × 3 × 2 × 1 = 24 different arrangements.
—————————————————

Example 2.1.6

## If we have n different letters, there are n! ways to arrange them.

—————————————————

Example 2.1.7

## How many distinct arrangements of letters A, A, A, B, B are there?

—————————
There are 5! ways of arranging the letters if we made a distinction between all of them. This
would however include 3! of arranging A0 s and 2! ways of arranging B 0 s, which in practice give
us the same arrangements. Hence, the actual number of distinct strings is given by:

   
5! 5 5
= = = 5 C2
3! × 2! 3 2
—————————————————

n

We call an expression k a binomial coefficient or simply n choose k. It describes a number of
combinations (number of unordered sets) of size k, chosen from a set of size n of distinct objects. We
can think of the last example as choosing 3 positions out of 5 for the letters A.
In general:
 
n n!
=
k k! × (n − k)!

Example 2.1.8

How many ways are there to choose 3 letters from a set of 5 distinct letters?
——————————–

29
We can first think of all possible arrangements of five letters, e.g. ABCDE, ACEBD (5!), where
the first three letters are the ones we want to choose. As in our final combination the order of
the letters will not matter, we should not count permutations of chosen (3!) and non–chosen
numbers (2!). Therefore we need to divide the initial number by 3! × 2!. Hence:

5! 5×4×3×2×1
= = 10
3! × 2! 3×2×1×2×1
—————————————————

Further Examples

Example 2.1.9

According to a recent University of Dundee Prospectus, there were 11,951 students registered
at the university in the preceding year. A breakdown of these students is given below where
they are categorised by both study level (undergraduate (u/g) or postgraduate (p/g)) and study
status (full time (FT), part time (PT) or distance learning (DL)).

FT PT DL total
u/g 7415 971 400 8786
p/g 655 1136 1374 3165
total 8070 2107 1774 11951

If you select one of these students at random, what are the chances you get
(i) a p/g student, (ii) a PT student, (iii) a PT p/g student, (iv) a PT student or a
p/g student?

(v) Suppose now you select a p/g student at random; what now are the chances that you select
a PT student?
————————————-
3165
(i) P(p/g) = 11951 ≈ 0.26
2107
(ii) P(P T ) = 11951 ≈ 0.18
1136
(iii) P(p/g ∩ P T ) = 11951 ≈ 0.095
3165+971
(iv) P(p/g ∪ P T ) = 11951 ≈ 0.35
1136
(v) P(P T | p/g) = 3165 ≈ 0.36
—————————————————

The notation used in (v) is called ‘conditional probability of PT given p/g’. The expression P(P T | p/g)
gives us probability that a P T student is selected from the group of p/g students. We have:

P(P T ∩ p/g)
P(P T | p/g) =
P(p/g)

30
Example 2.1.10
You roll three dice. What are the chances that at least two of the scores are the same or are
consecutive (e.g. (1,1,1), (1,2,1), (4,2,5), (1,5,4), (5,6,2) or (2,3,2))?
—————————-
Let us call our desirable event A. The complementary event A can be described as: all numbers
are different and no two numbers are consecutive. All possible combinations (unordered) are:
{1, 3, 5}, {1, 3, 6}, {1, 4, 6}, {2, 4, 6}. Number from each of these sets can be arranged in order in
3! = 6 different ways. This means that there are 6 × 4 = 24 permutations in A. Total number
of outcomes of 3 dice rolls is 6 × 6 × 6 = 216. Therefore:

24 24
P(A) = =⇒ P(A) = 1 − P(A) = 1 − ≈ 0.89
216 216
—————————————————

Example 2.1.11
Cars sold in the UK in 2011 and 2012 were classified by colour as shown in the following table.
(Numbers of cars in each category are given in thousands)

2011 2012
Red 212 183
Blue 443 467
Yellow 19 20
Green 135 122
Black 270 467
White 77 81
Silver 481 589
Other 288 101
TOTAL 1925 2030

(a) What is the probability that a car randomly selected from all the 2011 and 2012 cars is
silver?
(b) What is the probability that a car randomly selected from all the 2012 cars is silver?
(c) If a car is randomly selected from all the 2011 and 2012 cars, what is the probability that it
is a 2012 car?
(d) If a car that is randomly selected from all the 2011 and 2012 cars is silver, what is the
probability that it is a 2012 car?
————————————–
S: a silver car is chosen
2011: a car from the 2011 cars is chosen
2012: a car from the 2012 cars is chosen

31
481+589
(a) P(S | 2011 ∪ 2012) = 1925+2030 ≈ 0.27
589
(b) P(S | 2012) = 2030 ≈ 0.29
2030
(c) P(2012 | 2011 ∪ 2012) = 1925+2030 ≈ 0.51
589
(c) P(2012 |S) = 481+589 ≈ 0.55
—————————————————

Example 2.1.12

In a university class of mathematics students, it is found that 38% studied physics at school and
45% studied German at school and that 26% studied neither.
(a) What percentage of the students studied at least one of physics and German?
(b) What is the probability that a randomly selected student studied German, but not physics?
—————————-
P : students who studied physics
G: students who studied German

We have:
P(P ) = 0.38
P(G) = 0.45
P(P ∩ G) = 0.26

## (a) P(P ∪ G) = 1 − P(P ∪ G) = 1 − P(P ∩ G) = 1 − 0.26 = 0.74

Answer: The percentage of students who studied at least one of physics and German is 74%.

(b) P(G ∩ P ) = P(G) − P(G ∩ P ) = P(G ∪ P ) − P(P ) = 0.74 − 0.38 = 0.36 [see Venn diagram].

## Alternatively use the following table:

P P TOTAL
G 0.36 0.45
G 0.26
TOTAL 0.38 0.62 1

Notice that using a table like the one above greatly simplifies calculations.
Answer: The probability that a randomly selected student studied German but not physics is
0.36.
—————————————————

Week 3 Monday

32
2.2 Conditional Probability
Lecture Example 2.1.9 (v) is an example of conditional probability. The condition is that a student is a
p/g. Let A represent PT student and let B represent p/g student. The question asked for probability
that A is true, given that B is true. We write this as P(A | B), and read as ‘A given B’. In the
solution we used the general result:

P(A ∩ B)
P(A | B) =
P(B)

## Consider the following example:

Example 2.2.1
A poker player is observed to place high bets 25% of the time, but only has a good hand and
bets high 8% of the time. Given that the player has placed a high bet, what is the probability
that they have a good hand?
——————————
Let us denote the events as follows:
A: good hand
B: high bet
Then:
P(B) = 0.25
P(A ∩ B) = 0.08
P(A | B) =?
We calculate:

P(A ∩ B) 0.08
P(A | B) = = = 0.32
P(B) 0.25

Answer: The probability that the player has a good hand given that they placed high bet is
0.32.
—————————————————

Example 2.2.2
In a survey of recent stock market share prices it has been noted that 45% of shares which fell in
price last quarter, rose this quarter. Also, 32% of all shares fell last quarter. What percentage
of stocks, fell last quarter and then rose this quarter?
——————————-
Let us denote the events as follows:
A: fell in price last quarter
B: rose in price this quarter

33
Then:
P(B | A) = 0.45
P(A) = 0.32
P(A ∩ B) =?
We calculate:

## P(A ∩ B) = P(B | A) × P(A) = 0.45 × 0.32 ≈ 0.14

Answer: Approximately 14% of shares fell last quarter and then rose this quarter.
—————————————————

Example 2.2.3

A test for a chemical correctly detects that the chemical is present only 98% of the time, The
test never gives a false positive result. A series of tests gives the result that that the chemical
is present in 65% of tissue samples. Estimate the amount of tissue samples that contain the
chemical.
——————————
Let us denote the events as follows:
A: chemical present
B: test positive
Then:
P(B | A) = 0.98
P(B | A) = 0
P(B) = 0.65
P(A) =?
We calculate:

P(B ∩ A)
0 = P(B | A) = =⇒ P(B ∩ A) = 0
P(A)
0.65 = P(B) = P(B ∩ A) + P(B ∩ A) = P(B ∩ A)
P(B ∩ A) P(B ∩ A) 0.65
P(B | A) = =⇒ P(A) = = ≈ 0.66
P(A) P(B | A) 0.98

## Answer: Approximately 66% of tissue samples contain the chemical.

—————————————————

34
Bayes’ Theorem
Note that:

P(A ∩ B)
P(A | B) = =⇒ P(A ∩ B) = P(A | B) × P(B) (2.1)
P(B)
Also:

P(B ∩ A)
P(B | A) = (2.2)
P(A)
Substituting (2.1) into (2.2) gives Bayes’ Theorem:

P(A | B) × P(B)
P(B | A) =
P(A)

Example 2.2.4

It rains on 3 out of 10 days. Forecasters predict rain for the following day half of the time. Given
that forecasters are correct for 85% of days when it does rain, and are correct for 65% of days
when it does not, calculate the probability that:
(a) it will rain given that the forecaster has said it will,
(b) it will rain given that the forecaster has said it will not.
———————————–
Let us denote the events as follows:
R: rain
F : forecast rain
Then:
P(R) = 0.3
P(F ) = 0.5
P(F | R) = 0.85
P(F | R) = 0.65

## P(F | R) × P(R) 0.85 × 0.3

a) P(R | F ) = = = 0.51
P(F ) 0.5
P(F | R) × P(R) (1 − P(F | R)) × P(R) 0.15 × 0.3
b) P(R | F ) = = = = 0.09
P(F ) 1 − P(F ) 0.5

Answer: a) There is 0.51 probability that it will rain if the forecaster said it would, and b)
there is 0.09 probability that it will rain if the forecaster said it would not.
—————————————————

Example 2.2.5

35
A test for a medical condition that 1 in 20 people have correctly diagnoses the condition in 4/5
of those with it. The probability that an individual who gets a positive result actually has the
condition is 0.3. What is the probability that the test gives a positive result?
——————————–
Let us denote the events as follows:
C: condition present
F : test result positive
Then:
1
P(C) = 20
4
P(F | C) = 5

P(C | F ) = 0.3
P(F ) =?
We calculate:

## P(F | C) × P(C) 0.8 × 0.05

P(F ) = = ≈ 0.13
P(C | F ) 0.3

Answer: The probability that the test gives a positive result is approximately 0.13.
—————————————————

Week 3 Tuesday

## 2.3 Independent Events

Events A and B are independent if there is the same chance of A occurring, whether or not B occurs,
i.e.

Recall that

Example 2.3.1

## Toss two coins A and B. P(Ahead ) = 0.5 and P(Bhead ) = 0.5.

Then P(Ahead ∩ Bhead ) = 0.5 × 0.5 = 0.25=P(Ahead ) × P(Bhead ).
Hence the events A and B are independent.

36
—————————————————

Example 2.3.2
There is a queue of cars at a set of traffic lights. Given that 10% of all cars are red, what is the
probability that
(a) The first car is red?
(b) The first and second cars are both red?
(c) Exactly one of the first 2 cars is red?
(d) Neither of the first 2 cars are red?
(e) At least one of the first 10 cars is red?
—————————–
(a) P(R1 ) = 0.1
(b) P(R1 ∩ R2 ) = 0.1 × 0.1 = 0.01
(c) P((R1 ∩ R2 ) ∪ (R1 ∩ R2 )) =[as these event are mutually exclusive]
= P(R1 ∩ R2 ) + P(R1 ∩ R2 ) =[as these events are independent]
P(R1 ) × P(R2 ) + P(R1 ) × P(R2 ) = 0.1 × 0.9 + 0.9 × 0.1 = 0.18
(d) P((R1 ∩ R2 ) = 0.9 × 0.9 = 0.81
(e) P(at least oen red car out of 10) = 1 − P(no red cars out of 10) = 1 − 0.910 ≈ 1 − 0.349 =
0.651
—————————————————

## 2.4 Discrete and Continuous Random Variables

A random variable which takes only finite/countable number of values is referred to as discrete. A
random variable which can take any value in an interval of R is called continuous.

Example 2.4.1
DISCRETE:

## • Roll a die. Possible values: {1, 2, 3, 4, 5, 6}.

• Cars passing through the traffic light. How many cars pass before first red car? Possible
values: {1, 2, 3, . . .}.

CONTINUOUS:

• Growing bacteria. How long before the colony reaches 1cm in diameter? Possible values:
t ≥ 0.
—————————————————

Example 2.4.2

37
The probability that an electronic component fails within t hours is given by
   
t −t
P (t) = 1 − 1 + exp , t > 0.
100 100

(a) Calculate the probability that the lifetime of a randomly selected component is less than
100 hours.
(b) Calculate the probability that the lifetime of a randomly selected component is greater than
200 hours.
(c) Three components are selected at random. Calculate the probability that at least 1 compo-
nent lasts more than 100 hours.
———————————–
(a) P(t < 100) = 1 − (1 + 100
100 ) × exp( −100
100 ) = 1 −
2
e ≈ 1 − 0.7358 = 0.2642
(b) P(200 < t) = 1 − P(t < 200) = (1 + 200
100 ) × exp( −200
100 ) =
3
e2
≈ 0.406
(c) P(at least 1 out of 3 lasts > 100) = 1−P(all 3 components last < 100) = 1−(P(t < 100))3 ≈
0.9815
—————————————————

Example 2.4.3
Patients referred to an orthopaedic clinic suffering from chronic lower back pain may be suffering
from condition A or condition B. Condition A is cured by painkillers and rest whereas condition
B is cured by painkillers and physiotherapy. A and B are quite difficult to distinguish from
X-rays, and consequently only 70% of patients who have A are correctly diagnosed as having A,
whilst 80% of patients who have B are correctly diagnosed as having B. Of those referred to the
clinic, 40% actually have A and the remaining 60% actually have B.
(a) What proportion of all the patients are given the correct treatment?
(b) What proportion of those being given physiotherapy should have been recommended to rest
(c) Are the events of having condition A and being diagnosed with condition A independent?
—————————————-
Let us denote the events as follows:
A: have condition A
D: diagnosed with A
A: have condition B
D: diagnosed with B
Then:
P(D | A) = 0.7
P(D | A) = 0.8
P(A) = 0.4

38
P(A) = 0.6
(a)

## P((A ∩ D) ∪ (A ∩ D)) = P(A ∩ D) + P(A ∩ D) =

= P(D | A) × P(A) + P(D | A) × P(A) =
= 0.7 × 0.4 + 0.8 × 0.6 = 0.28 + 0.48 = 0.76

## Answer: 76 in 100 patients were given correct treatment.

(b) Note that we now know that P (A ∩ D) = 0.28 and P(A ∩ D) = 0.48

## P(A ∩ D) (P(A) − P(A ∩ D))

P(A |D) = = =
P(D) P(D ∩ A) + P(D ∩ A)
P(A) − P(A ∩ D) 0.4 − 0.28 0.12
= = = 0.2
(P(A) − P(A ∩ D)) + P(A ∩ D) (0.4 − 0.28) + 0.48 0.6

Answer: Two in ten patients who were given physiotherapy should have been recommended to
(c) P(D) = 1 − P(D) = 1 − 0.6 = 0.4
P(D) × P(A) = 0.4 × 0.4 = 0.16
P(A ∩ D) = 0.28 6= 0.16 = P(D) × P(A)
Answer: The events of having condition A and being diagnosed with condition A are not
independent.
—————————————————

Week 4 Monday

39
Chapter 3

## Discrete Random Variables

3.1 Introduction
We are interested in the probability of a variable X taking any given value x. The set of probabilities
for all possible values is called the probability distribution for X.

Example 3.1.1

Roll 2 dice. Let X be the sum of the two values. What is the probability distribution for X?
—————————
Table of possible events:

Die no. 2
1 2 3 4 5 6
1 (1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6)
2 (2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6)
Die no. 1 3 (3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6)
4 (4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6)
5 (5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6)
6 (6, 1) (6, 2) (6, 3) (6, 4) (6, 5) (6, 6)

## Table for values of X:

Die no. 2
X 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
Die no. 1 3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

## Probability distribution for X is:

40
x 1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P(X = x) 0 36 36 36 36 36 36 36 36 36 36 36
—————————————————

## Summing over all possible values x which X can take gives:

X
p(x) = 1
x

Expected Value of X

## The mean value of X is given by:

X
µ = µX = E(X) = xp(x)
| {z }
x
Alternative notations

Example 3.1.2

Roll 2 dice. Let X be the sum of the two values. We find the mean value for X:

1 2 3 4 5 6
µ =2× +3× +4× +5× +6× +7× +
36 36 36 36 36 36
5 4 3 2 1
+8 × + 9 × 10 × + 11 × + 12 × =
36 36 36 36 36

## (2 + 12) × 1 + (3 + 11) × 2 + (4 + 10) × 3 + (5 + 9) × 4 + (6 + 8) × 5 + 7 × 6

= =
36

14 × (1 + 2 + 3 + 4 + 5 + 3) 14 × 18
= = = 7
36 36
—————————————————

## Consider data in the frequency table:

X frequency
x1 f1
x2 f2
.. ..
. .
xn fn
We calculate the mean value for X as follows:

41
n n Pn
X X fi xi fi
µ= xi p(xi ) = xi × Pn = Pi=1
n
i=1 i=1 i=1 fi i=1 fi

## Expected Value of a Function of X

Example 3.1.3

Roll 2 dice. Let X be the sum of the two values. What is the expected value of X 2 ?
————————-
Define a new variable Y = X 2 . The probability distribution for Y is:

x 2 3 4 5 6 7 8 9 10 11 12
y 4 9 16 25 36 49 64 81 100 121 144
1 2 3 4 5 6 5 4 3 2 1
P(Y = y) 36 36 36 36 36 36 36 36 36 36 36

X 1 2 3
E(Y) = E(X 2 ) = y p(y) = 4 × +9× + 16 × +
y
36 36 36
4 5 6 5 4 3 2 1
+25 × + 36 × + 49 × + 64 × + 81 × + 100 × + 121 × + 144 × =
36 36 36 36 36 36 36 36

4 + 18 + . . . + 242 + 144
≈ 54.83
36

Note that:

E(X) = 7
[E(X)]2 = 49
E(X 2 ) ≈ 54.83

—————————————————

In general
[E(X)]2 6= E(X 2 ) .

If Y = f (X) then:

X P
E(Y) = E(f (X)) = y p(y) = x f (x) p(x)
y

Variance:

## V ar(X) = E((X − µ)2 ) = − µ)2 p(x)

P
x (x

42
Standard Deviation:
p
σ= V ar(X)

Standard Deviation gives a quadratic mean distance of the value of X from its mean µ.

Example 3.1.4
Roll 2 dice. Let X be the sum of the two values. What is the variance and standard deviation
of X?
————————–
Recall that µX = 7.

x 2 3 4 5 6 7 8 9 10 11 12
(x − µ) −5 −4 −3 −2 −1 0 1 2 3 4 5
(x − µ)2 25 16 9 4 1 0 1 4 9 16 25
1 2 3 4 5 6 5 4 3 2 1
P(X = x) 36 36 36 36 36 36 36 36 36 36 36

## Hence the variance is given by:

X 1 2 3
Var(X) = E((X − µ)2 ) = (x − µ)2 p(x) = 25 × + 16 × +9× +
x
36 36 36
4 5 6 5 4 3 2 1
+4 × +1× +0× +1× +4× +9× + 16 × + 25 × =
36 36 36 36 36 36 36 36

25 + 32 + 27 + 16 + 5 + 0 + 5 + 16 + 27 + 32 + 25
≈ 5.83 .
36

## Also, the standard deviation is given by:

p √
σ= V ar(X) ≈ 5.83 ≈ 2.42.
—————————————————

## Alternative Formula for Variance

V ar(X) = E(X 2 ) − [E(X)]2

Example 3.1.5
Roll 2 dice. Let X be the sum of the two values. What is the variance of X?
——————–
Using previous results:

## V ar(X) = E(X 2 ) − [E(X)]2 ≈ 54.83 − (7)2 = 5.83

Note that this gives the same answer as the previous method.

43
—————————————————

*Non-examinable material*
Proof of Var(X)=E(X 2 ) − [E(X)]2 .

## Var(X) = E((X − µ)2 )

X
= (x − µ)2 p(x)
x
X
= (x2 − 2xµ + µ2 )p(x)
x
X X X
= x2 p(x) − 2xµp(x) + µ2 p(x)
x x x
X X X
2
= x p(x) − 2µ xp(x) + µ2 p(x)
x x x
= E(X 2 ) − 2µE(X) + µ2
= E(X 2 ) − 2E(X)E(X) + E(X)2
= E(X 2 ) − E(X)2

## *End of non-examinable material*

Week 4 Tuesday

Linear Function of X
Let X be a random variable and let a and b be constants. Consider Y = g(X) = a + bX. Then:

E(Y ) = a + bE(X)

V ar(Y ) = b2 V ar(X)

The linear function g is the only type of function that E(g(X)) = g(E(X)).
For standard deviation of Y we have:

p p √ p
σY = V ar(Y ) = b2 V ar(X) = b2 V ar(X) = |b| · σX

Summary:

## E(a + bX) = a + bE(X)

V ar(a + bX) = b2 V ar(X)
σY = |b| · σX

*Non-examinable material*

Proof of E(a+bX)=a+bE(X).

44
X
E(a + bX) = (a + bx)p(x)
x
X X
= a p(x) + b xp(x)
x x
= a + bE(X)

## V ar(a + bX) = E(((a + bX) − µ(a+bX) )2 ) = E(((a + bX) − (a + bµX ))2 )

= E((bX − bµX )2 )
= E((b(X − µX ))2 )
= E(b2 (X − µX )2 )
= b2 E((X − µX )2 )
= b2 V ar(X)

## 3.2 The Discrete Uniform Distribution

When all n possible values of a variable X have the same probability, we call it uniform distribution.

1
P(X = k) = for k ∈ {1, 2, 3, . . . , n}
n

Example 3.2.1

Roll a die and let X be the outcome of the roll. Then, X has uniform distribution.

1 1 1
P(1) = , P(2) = , . . . , P(6) =
6 6 6
—————————————————

## Mean and Variance of Uniform Distribution

Suppose that X takes values x ∈ {1, 2, 3, . . . , n} with uniform distribution. Then:

n+1
E(X) = 2

n2 −1
V ar(X) = 12

*Non-examinable material*

## Proof of formulas for mean and variance for uniform distribution

45
n
X
µ = kp(k)
k=1
n
X 1
= k
n
k=1
n
!
1 X
= k
n
k=1
 
1 1 n+1
= n(n + 1) =
n 2 2

σ 2 = E(X 2 ) − [E(X)]2
n
n+1 2
X  
2
= k p(k) −
2
k=1
n
n+1 2
 
21
X
= k −
n 2
k=1
n
!
n+1 2
 
1 X 2
= k −
n 2
k=1
n+1 2
   
1 n(n + 1)(2n + 1)
= −
n 6 2
2
n −1
=
12

## 3.3 The Binomial Distribution

Example 3.3.1

If 20% of all cars are silver and 3 random cars form a queue at a set of traffic lights, what is the
probability that
(a) none of the cars are silver?
(b) exactly 1 of the cars are silver?
(c) exactly 2 of the cars are silver?
(d) all 3 of the cars are silver?

46
——————————————–

## Total number of silver cars Car 1 Car 2 Car 3 Probability

3
× 0.83 = 0.512

0 x x x 0
S x x
3
× 0.82 × 0.2 = 0.384

1 x S x 1
x x S
S S x
3
× 0.8 × 0.22 = 0.096

2 S x S 2
x S S
3
× 0.23 = 0.008

3 S S S 3
—————————————————

Suppose that there are 2 possible outcomes (e.g. success/failure) of a single event with probability p
(success) and (1 − p) (failure). Suppose also that the repeated events are independent. Then if there
are n events, the probability of exactly x of these events having ‘successful’ outcomes is given by the
binomial distribution.
 
n
P(X = x) = · px · (1 − p)n−x
x
Notation:

X ∼ Bi(n, p)

## For the Example 3.3.1:

X ∼ Bi(3, 0.2)

Example 3.3.2
If 20% of all cars are silver and 5 random cars form a queue at a set of traffic lights, what is the
probability distribution of variable X denoting number of silver cars in the queue?
—————————————

 
5
P(X = 0) = · 0.20 · 0.85 = 0.85 ≈ 0.33
0
 
5
P(X = 1) = · 0.21 · 0.84 ≈ 0.41
1
 
5
P(X = 2) = · 0.22 · 0.83 ≈ 0.20
2
 
5
P(X = 3) = · 0.23 · 0.82 ≈ 0.05
3
 
5
P(X = 4) = · 0.24 · 0.81 ≈ 0.006
4
 
5
P(X = 5) = · 0.25 · 0.80 ≈ 0.0003
5

47
—————————————————

An alternative to calculate these probabilities using the formula is to use binomial distribution tables
(see Appendix A). Note tables only exist for some values of n and some values of p.

Example 3.3.3
Assume that the probability of a randomly selected person having a birthday in January is 1/12.
In a class of 20 students, what is the probability that
(a) exactly 3 of the students were born in January?
(b) exactly 2 of the students were born in January?
(c) no more than 3 of the students were born in January?
(d) at least 4 of the students were born in January?
—————————-
1 1
Let X denote number of students born in January. Then X ∼ Bi(20, 12 ). Note that p = 12 ≈
0.08. Using tables:
(a) P(X = 3) ≈ 0.1414
(b) P(X = 2) = 0.2711
(c) P(X ≤ 3) = P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) ≈
≈ 0.1887 + 0.3282 + 0.2711 + 0.1414 = 0.9294
(d) P(X ≥ 4) = 1 − P(X ≤ 3) ≈ 1 − 0.9294 = 0.0706
—————————————————

## Mean and Variance of Binomial Distribution

Suppose that X ∼ Bi(n, p). Then:

E(X) = np

V ar(X) = np(1 − p)

48
This means that if we perform an event n number of times, and the probability of success in each
such event is p, then we would on average get np number of successes. While intuitive, note that this
statement requires a proof, which we omit here.

## 3.4 Geometric Distribution

Suppose that there are only 2 possible outcomes of an event with probabilities p (success) and 1 − p
(failure), and the event is repeated. The probability of the first success occurring in X th event is given
by the geometric distribution Ge(p). Let X denote the number of event which is the first success.
Then:

P(X = x) = (1 − p)x−1 p

## P(X > x) = (1 − p)x

Example 3.4.1

Toss a coin. Probability of heads is 21 . Let X be the number of the toss which is the first heads.
What are the probabilities of X being 1, 2, 3?
————————
We observe:
1
X ∼ Ge( )
2
Therefore:
1 1 1
P(X = 1) = (1 − )0 ( ) =
| {z2 } |{z}
2 2

1 1 1
P(X = 2) = (1 − )1 ( ) =
| {z } |{z} 4
2 2

1 1 1
P(X = 3) = (1 − )2 ( ) =
| {z2 } |{z}
2 8
—————————————————

Example 3.4.2

Consider a queue of cars. The probability that any given car is silver is 0.2. Let X be the
number of the first silver car in the queue. What are the probabilities: P(X = 1), P(X = 2),
P(X > 3), P(X ≥ 7)?
—————————-

P(X = 1) = 0.2

## P(X > 3) = 0.83

49
P(X ≥ 7) = 0.86
—————————————————

## Mean and Variance of Geometric Distribution

Suppose that X ∼ Ge(p). Then:

1
E(X) = p

1−p
V ar(X) = p2

Note that this formula for expected value means that if the probability of success is 31 , then it would
take us on average 3 trials to get a success, which is very intuitive. Again, we omit the proof.

Week 5 Monday

## 3.5 The Poisson Distribution

Events occur at points in time/space. The average rate of events occuring is λ events per unit of
time/space. Let X be number of events in an interval of fixed length T . Then:

µx ·e−µ
P(X = x) = x! .

σ2 = µ .

## We use the following notation for Poisson distribution:

X ∼ Po(µ) .

We use the formula or the probability tables (which can be found in Appendix B) to find the proba-
bilities.

Example 3.5.1
If X ∼ Po(1.9), then:

P(X = 2) = 0.2700
P(X = 3) = 0.1710
P(X ≥ 3) = 1 − P(X ≤ 2) = 1 − (P(X = 0) + P(X = 1) + P(X = 2))
= 1 − (0.1496 + 0.2842 + 0.2700) = 0.2962

—————————————————

Example 3.5.2

50
Assume that the number of cars travelling along a minor road has a Poisson distribution with
mean 4 vehicles per hour. Determine the probability that
(a) exactly 4 vehicles pass in a given hour,
(b) at least 2 vehicles pass in a given 30 minute interval,
(c) no cars pass in at least one out of the next 4 hours.
————————————-
(a) Let X(T ) be the number of vehicles passing in a time interval of length T hours. Our
λ = 4, where λ gives us average number of cars passing every hour. We are interested in
the average corresponding to our chosen time interval, which coincidentally is an hour. Hence,
µ = λ × T = 4 × 1 = 4.
Then, X(1) ∼ Po(4).
Therefore, we are looking for the probability:

P(X = 4) = 0.1954.

(b) This time, out time interval T = 0.5, and hence µ = 4 × 0.5 = 2 and X(0.5) ∼ Po(2).
Therefore, we are looking for the probability:

## P(X ≥ 2) = 1 − P(X = 0) − P(X = 1) = 1 − 0.1353 − 0.2707 = 0.5940.

(c) We treat each of the four hours as a separate event. Moreover, we treat these hours inde-
pendently of each other. Hence, we first observe that:

P(no cars in at least 1 out of 4 hours) = 1 − P(at least 1 car passes in each of the 4 hours)

.
We want to calculate the probability that for any given hour there is at least one car passing.
Hence we compute µ = 4 × 1 = 4 and denote X ∼ Po(4).
We calculate:

## P(X ≥ 1) = 1 − P(X = 0) = 1 − 0.183 = 0.9817

.
Thus we find:

P(at least 1 car passes in each of the 4 hours) = (P(At least one car passes in a given hour))4
= 0.98174 ≈ 0.9288.

## Using this we find:

51
P(no cars in at least 1 out of 4 hours) = 1 − P(at least 1 car passes in each of the 4 hours)
= 1 − 0.9288 = 0.0712.

—————————————————

The Poisson distribution gives a good approximation to the binomial distribution if n is large and p
is small. Then:

µ = np
Po(np) ∼ Bi(n, p)

Example 3.5.3

## Let X ∼ Bi(50, 0.05). By using formulas for binomial distribution:


50
P(X = 0) = (1 − 0.05)50 ≈ 0.0769
0
 
50
P(X = 1) = · 0.05 · (1 − 0.05)49 ≈ 0.2025
1

Now we will try to approximate this binomial distribution by Poisson distribution as follows:
µ = np = 50 × 0.05 = 2.5. Consider X 0 ∼ Po(2.5). By using tables for Poisson distribution:

P(X 0 = 0) = 0.0821
P(X 0 = 1) = 0.2052.

We observe that the values are not precisely the same but they are close. In the language of
limits, if you fix the product np and take the limit of a given binomial probability for n tending
to infinity, you will obtain an expression for a probability given by the formula for the Poisson
distribution with mean µ = np. Note that this also explains why the variance for Poisson
distribution is given by µ – compare it with the variance for the binomial distribution.
—————————————————

52
Week 5 Tuesday

## 3.6 Joint Distribution

Consider when a type of object have two associated variables X and Y . We might be interested in:

a. functions of X and Y ,

## b. whether or not there is any connection between values of X and Y .

A linear combination of two random variables X and Y is an expression of the form αX + βY , where
α and β are real constants, i.e. α, β ∈ R. A linear combination of X and Y satisfies:

## Note that the formula for E (XY ) is given by:

P P
E (XY ) = x y x · y · P(X = x and Y = y).

## We will use the latter formula in the Example 3.6.1.

A related quantity is the (Pearson Product-Moment) correlation coefficient:

Cov(X,Y )
ρX,Y = σX · σY ,

## where σX is the standard deviation of X and σY is the standard deviation of Y .

Interpretation:

1. Sign:

a) ρX,Y = 0 means that there is no linear relationship between X and Y ; we say that there is
no correlation; note that we cannot conclude if X and Y are independent or not;
b) ρX,Y > 0 means that generally large X corresponds to large Y , and small X corresponds to
small Y ; we say that the correlation is positive;
c) ρX,Y < 0 means that generally large X corresponds to small Y and vice versa; we say that
the correlation is negative.

2. Magnitude:

a) The Pearson correlation coefficient always takes values in the interval between -1 and 1:

−1 ≤ ρX,Y ≤ 1.

53
b) We can verbally describe the strength of the correlation:
i) 0.00-0.19 very weak,
ii) 0.20-0.39 weak,
iii) 0.40-0.59 moderate,
iv) 0.60-0.79 strong,
v) 0.80-1.0 very strong.

• ρX,Y = 0,

## • Var(αX + βY ) = α2 Var(X) + β 2 Var(Y ).

Example 3.6.1

The final grade Z (given as a percentage) for a module is calculated using the continuous as-
sessment mark X (given as a percentage) and the exam mark Y (given as a percentage) in the
following way:
Z = 0.4X + 0.6Y.

## A class of students has the following distribution of marks:

54
X
40 50 60 70 80 90
30 1 1 1 0 0 0
Y 40 0 3 2 0 1 0
50 0 1 5 10 2 2
60 0 2 6 4 1 2
70 0 0 0 1 4 1

## We are interested in expected values of X, Y and Z, as well as certain conditional probabilities.

We also want to find out whether there is correlation between X and Y , namely whether generally
high continuous assessment mark corresponds to high exam mark.
————————————-
First, we observe that the numbers in the table sum up to 50. Hence in order to obtain proba-
bilities, we need to divide each frequency by 50. This new table is called joint probability mass
function.

X Marginal Distribution
40 50 60 70 80 90 ↓ of Y
30 0.02 0.02 0.02 0 0 0 0.06
Y 40 0 0.06 0.04 0 0.02 0 0.12
50 0 0.02 0.1 0.2 0.04 0.04 0.4
60 0 0.04 0.12 0.08 0.02 0.04 0.3
70 0 0 0 0.02 0.08 0.02 0.12
Marginal Distribution of X → 0.02 0.14 0.28 0.3 0.16 0.1 1

An entry in x column and y row corresponds to the probability P(X = x and Y = y). The
marginal distributions of X and Y allow us to calculate the following:

X
E(X) = x · p(x) = 40 · 0.02 + 50 · 0.14 + 60 · 0.28 + 70 · 0.3 + 80 · 0.16 + 90 · 0.1 = 67.4
E(Y ) = . . . = 53
E(Z) = E(0.4X + 0.6Y ) = 0.4 · E(X) + 0.6 · E(Y ) = 58.76

## The latter we can conclude from the properties of linear functions.

We can also calculate conditional distributions, for example the distribution of X when Y = 50.
It would be given by row 3 of the joint probability mass table, rescaled to sum to 1. As the sum
of the values in this row is 0.4 = 25 , this corresponds to dividing all probabilities by 52 , which is
5
the same as multiplying them by 2 = 2.5
X 40 50 60 70 80 90
P(X = x | Y = 50) 0 0.05 0.25 0.5 0.1 0.1
Then we can compute:

55
X
E(X | Y = 50) = x · P(X = x | Y = 50)
= 40 · 0 + 50 · 0.05 + 60 · 0.25 + 70 · 0.05 + 80 · 0.1 + 90 · 0.1 = 69.4

Alternatively, noting that the numbers in the original frequency table need to be rescaled by
1 5 1
50 · 2 = 20 , we can calculate E(X | Y = 50) directly from it:

1
E(X | Y = 50) = · (40 · 0 + 50 · 1 + . . . + 90 · 2) = 69.4
20

In order to calculate the correlation coefficient, let us first compute covariance of X and Y .
Hence we start from E(XY ):

XX
E(XY ) = x · y · P(X = x and Y = y) = (40 · 30 · 0.02) + (50 · 30 · 0.02)+
x y

## + (60 · 30 · 0.02) + 0 + 0 + 0 + 0 + (50 · 40 · 0.06) + . . . = 3592

Cov(X, Y ) = E(XY ) − E(X) · E(Y ) = 3592 − (67.4) · (53) = 19.8
p p
σX = E(X 2 ) − [E(X)]2 = (402 · 0.02 + 502 × 0.14 + . . .) − (67.4)2 = 12.4
σY = . . . = 10.4
Cov(X, Y ) 19.8
ρX,Y = = = 0.15
σX · σY 12.4 · 10.4

As ρX,Y > 0 we conclude that there is a positive correlation between continuous assessment
grade and exam mark, namely the better continuous assessment mark, the better exam mark.
As |ρX,Y | = 0.15, the correlation can be described as very weak.
—————————————————

Week 6 Tuesday

56
Chapter 4

## Continuous Random Variables

Fundamentals of Integration
Powers of x
Positive powers:

xn = |x · x · .{z
. . · x · x}
n times

Negative powers:

1
x−n =
xn

Example 4.0.1

1
= x−2
x2

5
= 5x−3
x3

1
= x−1
x
—————————————————

Fractional powers:

√ m
( n x)m = x n

Example 4.0.2

√ 1
x = x2

5
1
x = x5

57
1 1 1
√ = 1 = x− 2
x x2

2 √
3 √
x3 = x2 = ( 3 x)2

3 √
2 √
x2 = x3 = ( 2 x)3

1 1 1
= √
3
x− 3 = 1
x x 3
—————————————————

## Calculating Areas under Graphs of Functions

1. Consider a function which is a power of x. A general expression for it is xs . Note that s can be
negative, it can be also a fraction. The number s cannot be equal to −1 though (i.e. we are not
considering x1 ). We want to calculate area under the graph of the function xs .

2. a) For each term xs , add 1 to the power (to get the power s + 1) and divide the whole expression
by the new power (divide by s + 1). So the entire procedure results in:

xs+1
.
s+1
b) For a constant term c, multiply by x (i.e. if we want to calculate area under the graph of c,
we first obtain cx.

3. Substitute in values of x at each endpoint of the region of which area we want to compute, into
the results from (2) and subtract the resultant values.

Example 4.0.3

x=1 x=1
x2 12 02
Z   
1
y = 2x Area = (2x)dx = 2 · =2· − =2· =1
x=0 2 x=0 2 2 2
—————————————————————————-

58
Z x=3
y=5 Area = (5)dx = [5x]x=3
x=1 = [5 · 3 − 5 · 1] = 15 · 5 = 10
x=1

—————————————————————————-

x=2 x=2
x4 x3
Z 
3 2 3 2
y =x +x +2 Area = (x + x + 2)dx = + + 2x
x=1 4 3
 4  x=1
23
  4
2 1 13 8 1 1
= + +2·2 − + + 2 · 1 = [4 + + 4] − [ + + 2] = . . .
4 3 4 3 3 4 3

—————————————————————————-

59
" 3
#x=4

Z x=4
1 1 x2
y= x+1=x +1
2 Area = (x + 1)dx = 2 ·
2
3 +x =
x=1 2 x=1
x=4  x=4
2 √ 3

2 3
= ·x +x
2 = · ( x) + x =
3 x=1 3
 x=1
2 √ 3 2 √ 3
  
= · ( 4) + 4 − · ( 1) + 1 =
3 3
   
16 2
= +4 − + 1 = ...
3 3

—————————————————————————-

(
1
2 x, 0 ≤ x ≤ 2,
y=
−x4 + 4x − 3, 2 < x,

## Z x=3 Z x=2 Z x=3

Area = (y)dx = (y)dx + (y)dx =
x=0 x=0 x=2
Z x=2 Z x=3
1
= ( x)dx + (−x4 + 4x − 3)dx = . . .
x=0 2 x=2

60
—————————————————————————-

1
y= = x−2
x2

x=∞ x=∞
x=∞ x=∞
x−1
Z Z  
−2 1
Area = (y)dx = (x )dx = = − =
x=1 x=1 −1 x=1 x x=1
   
1 1
= limx→∞ − − − =0+1=1
x 1

What is the area of the region between x=2 and x=1? What about between x=3 and x=2?
—————————————————————————–
Sometimes the function does not look like a sum of powers of x at the first glance. Here are
examples which cannot be integrated directly, but can be expanded in order to integrate as
usual.
Further examples:

1. y = (x − 3)2 = x2 − 6x + 9
1+2x3 2x3
2. y = 3x3
= 1
3x3
+ 3x3
= 13 x−3 + 2
3

—————————————————

61
Week 7 Monday

4.1 Introduction
A continuous random variable (CRV) is a random variable that can take only real value within a
specified range. The distribution of probabilities for a CRV can be expressed using a graph where
areas under the graph represent probabilities.

## If the graph is given by y = f (x) for some function f (x), then

Z 3
P(2 < X < 3) = f (x)dx
2

Since the probabilities of all possible outcomes always sum up to 1, therefore the total area under the
graph is equal to 1. A function f (x) which gives probability for X in this way is called a probability
density function (p.d.f.).

Example 4.1.1

For a particular bus route, buses run every 10 minutes. Let T be the random variable for waiting
time for the next bus. T can take only value between 0 and 10. The waiting time can be modelled
using a uniform distribution between T = 0 and T = 10. The p.d.f. is of the form:

(
k 0 ≤ t ≤ 10
f (t) =
0 otherwise

## We need the total area to equal to 1.

10k = 1 =⇒ k = 0.1

62
(
0.1 0 ≤ t ≤ 10
f (t) =
0 otherwise

Then for example to find the probability of the waiting time between T = 2 and T = 5, calculate
the area:

—————————————————

Example 4.1.2
The time X measured in seconds between the consecutive vehicles passing a fixed point on a
motorway is modelled as a continuous random variable with p.d.f:

(
40−x
800 0 ≤ x ≤ 40
f (x) =
0 otherwise

The probability that a vehicle passes after 10 but before 25 seconds is:

25 25
x2

40 − x
Z
1
P(10 < X < 25) = dx = 40x −
10 800 800 2 10
 
1 625 100
= (40 × 25 − ) − (400 − ) = 0.42
800 2 2
—————————————————

63
For continuous random variables the probability of the random variable taking any particular value
a ∈ R is 0.

P(X = a) = 0

Also:

## P(a < X < b) = P(a ≤ X ≤ b)

This does not mean that X = a is impossible. Instead it is just because the probabilities in the
distribution are spread out so thinly (over infinitely many values) that probabilities can only be seen
on intervals.
We can interpret f (x) as relating to the probability of getting a value close to X.

## 4.1.1 Cumulative Distribution Functions

• The cumulative distribution function (c.d.f.) of a random variable X is the function F (x) defined
such that:

## F (x) = P(X ≤ x).

• The value of F (x) must always lie between 0 and 1 since F gives probabilities.

• Z x
F (x) = P(X ≤ x) = f (t)dt
−∞

## • The c.d.f. is a continuous function.

F (x) → 1 as x→∞
F (x) → 0 as x → −∞

## • F (x) = 1 for x ≥ b if b is the upper limit of the possible values for X.

P(c ≤ X ≤ d) = P(X ≤ d) − P(X ≤ c) = F (d) − F (c)

64
Example 4.1.3

## (Bus waiting time Example 4.1.1 continued.)

The p.d.f. is given by:

(
0.1 0 ≤ t ≤ 10
f (t) =
0 otherwise

To find c.d.f.:

Z x Z x
For x ≤ 0 : F (x) = f (t)dt = 0dt = [0 · t]x−∞ = x−∞ = 0
−∞ −∞
Zx Z 0 Z x Z x
For 0 ≤ x ≤ 10 : F (x) = f (t)dt = 0dt + f (t)dt = F (0) + 0.1dt =
−∞ −∞ 0 0

## = 0 + [0.1t]x0 = 0.1x − 0.1 · 0 = 0.1x

Z x Z 0 Z 10 Z x
For x ≥ 10 : F (x) = f (t)dt = 0dt + f (t)dt + f (t)dt =
−∞ −∞ 0 10
Z x
= F (0) + (F (10) − F (0)) + 0dt =
10
= F (10) + x10 = 0.1 · 10 + 0 = 1

Hence:

x Z  0
 x≤0
F (x) = f (t)dt = 0.1x 0 ≤ x ≤ 10
−∞ 
1 x ≥ 10

—————————————————

65
Week 7 Tuesday

Example 4.1.4

(Vehicles on the motorway Example 4.1.2 continued.) Recall the p.d.f. for this situation:

(
40−x
800 0 ≤ x ≤ 40
p.d.f. : f (x) =
0 otherwise

 0 x≤0
R x 40−t
c.d.f. : F (x) = F (0) + 0 800 dt 0 ≤ x ≤ 40
 Rx
F (40) + 40 0dt x ≥ 40

x x
t2 x2
  
40 − t
Z
1 1
F (0) + dt = 0 + 40t − = 40x − −0
0 800 800 2 0 800 2

x
402 402
Z    
1 2 1 1
F (40) + 0dt = 40 − +0= 1− =2· =1
40 800 2 800 2 2

 h 0 i x≤0
1 x2
c.d.f. : F (x) =
 800 40x − 2 0 ≤ x ≤ 40

 1 x ≥ 40

—————————————————

Example 4.1.5

(
3
4 (3 − x)(x − 1) 1 ≤ x ≤ 3
p.d.f. : f (x) =
0 otherwise

x x x
3 −t3
Z Z 
3 3 2 2
(3 − t)(t − 1)dt = (−t + 4t − 3)dt = + 2t − 3t =
1 4 41 4 3 1
 3  3
−x3 3x2 9x
 
3 −x 3 −1
= + 2x2 − 3x − + 2 · 12 − 3 · 1 = + − +1
4 3 4 3 4 2 4

Thus:

66

 0 x≤1
−x3 3x2 9x
c.d.f. : F (x) = + − +1 1≤x≤3
 4 2 4
1 x≥3

## The c.d.f. allows us to calculate probabilities, for example:

9
P(X ≤ 2) = F (2) = −2 + 6 −
+ 1 = 0.5
2
—————————————————

## 4.1.2 Expected Value, Variance, Median

Let X be a continuous random variable with p.d.f. f(x). Then the expected value of X is defined to
be:
R∞
E(X) = −∞ x · f (x)dx .

where:
R∞
E(X 2 ) = −∞ x
2 · f (x)dx .

## The median M satisfies P(X ≤ M ) = 12 , so:

1
RM 1
F (M ) = 2 −∞ f (x)dx = 2 .

Note that this is consistent with the original description of the median for the samples. How would
we define lower and upper quartile of the random variable X?

Example 4.1.6
A continuous random variable X has pdf given by
(
2x, 0 ≤ x ≤ 1
f (x) = .
0, otherwise

Find the cdf. Find the expected value E(X), variance Var(X), standard deviation σX , and the
median M of X.
———————————————-

Z x
2tdt = [t2 ]x0 = x2 − 0 = x2
0

0
 x≤0
c.d.f. : 2
F (x) = x 0 ≤ x ≤ 1

1 x≥1

67
∞ 1 1 1
2x3
Z Z Z 
2 2
E(X) = x · f (x)dx = x · (2x)dx = 2x dx = =
−∞ 0 0 3 0 3

∞ 1 1 1
2x4
Z Z Z 
2 2 2 3 1
E(X ) = x · f (x)dx = x · (2x)dx = 2x dx = =
−∞ 0 0 4 0 2

 2
2 1 2 2 1
Var(X) = E(X ) − [E(X)] = − =
2 3 18

p 1
σX = Var(X) = √
3 2

1 1 √
=⇒ x2 =
F (M ) = =⇒ x = 0.5 ≈ 0.71
2 2
—————————————————

## 4.1.3 Combining Random Variables

Consider n random variables X1 , X2 , . . . , Xn . Then:

## What does it mean for standard deviation?

If random variables X1 , X2 , . . . , Xn are independent we also have:

## E(Xi Xj ) = E(Xi )E(Xj )

Example 4.1.7

In a large company the wage per hour of employees is modelled as a continuous random variable
X with probability distribution given by

 2500 , x ≥ 5,

f (x) = x5
 0,
 x < 5,

## where X is measured in the currency unit.

(a) Find the expected hourly wage.

68
(b) The annual wage is given by the random variable Y . Assuming all employees are paid for
1920 hours work a year, find the expected annual wage.
(c) The same company gives an annual bonus that is modelled as a continuous random variable
Z with probability distribution given by

1
, 1000 ≤ z ≤ 2000

g(z) = 1000
 0,
 otherwise

Find the expected annual bonus and the expected total money in a year for an employee at the
company.
——————————————–
(a)

∞
∞ ∞
x−3 −2500 1 ∞
    
−2500 1
Z Z
2500 −4 20
E(X) = x· 5 dx = 2500x dx = 2500 = 3
= 0− 3
=
5 x 5 −3 5 3 x 5 3 5 3

(b)

20
Y = 1920X =⇒ E(Y ) = E(1920X) = 1920E(X) = 1920 · = 12800
3
(c)

2000  2 2000
20002 10002
Z    
1 1 z 1 4000 1000
E(Z) = z· dz = = − = − = 1500
1000 1000 1000 2 1000 1000 2 2 2 2

## E(Y + Z) = E(Y ) + E(Z) = 12800 + 1500 = 14300

—————————————————

Week 8 Monday

## 4.2 The Normal Distribution

The normal distribution gives a symmetrical bell shaped curve that can be fitted to data.

69
The normal distribution then gives a statistical model (pdf). The normal distribution is not relevant
for a number of sets of data.

The normal distribution only depends on the mean µ and standard deviation σ for the data. The
mean µ gives the position of the maximum of the curve and the standard deviation σ gives a measure
of how spread out the data is.

The formula for the normal curve is too complicated to integrate directly. The tables give values for
the cdf of Standard Normal Distribution. Standard Normal Distribution is a normal distribution for
which µ = 0 and σ = 1. Recall that cdf means that the tables give areas, and so the corresponding
probabilities, up to given z.

Example 4.2.1

## Let Z ∼ N(µ = 0, σ 2 = 1) follow a standard normal distribution. Consider the following

example with z = −0.73:

70
Find −0.7 in the LHS column of the table. Find 3 in the top row of the table. Area is entry
along from −0.7 and below 3, i.e.:

## Area = P(Z < z) = 0.2327

—————————————————

Example 4.2.2

A random variable Z has a normal distribution with mean 0 and standard deviation 1. What is
the probability that
1
(a) Z takes a value less than ,
2
(b) Z takes a value greater than 2.63.
(c) Z takes a value with |Z| > 1.
————————————————————————————————
(a)

(b)

## P(Z > 2.63) = P(Z < −2.63) = 0.00427

(c)
P(|Z| > 1) = P(Z > 1) + P(Z < −1) = 2 · 0.1587 = 0.3174

—————————————————

## 4.2.1 Finding Areas under Other Normal Curves

Let X have a normal distribution with mean µX and standard deviation σX . We write it down as:

71
2 ) .
X ∼ N(µX , σX

Then:

X−µX
Z= σX

has a standard normal distribution with mean µZ = 0 and standard deviation σZ = 1. Note that by
convention we use the letter Z to denote a random variable which has standard normal distribution.

Example 4.2.3

The time to complete a project is, from previous experience, estimated to be normally distributed
with a mean of 45 weeks and a standard deviation of 5 weeks. Find the probability that the
project will:
(a) be completed within 43 weeks,
(b) be completed within 49 weeks,
(c) take longer than 52 weeks,
(d) take between 41 and 47 weeks.
————————————————————————————————
Let X ∼ N(µ = 45, σ 2 = 52 ) model the number of weeks necessary to complete a project.
(a)
X −µ 43 − 45
P(X < 43) = P( < ) = P(Z < −0.4) = 0.3446
σ 5
(b)
X −µ 49 − 45
P(X < 49) = P( < ) = P(Z < 0.8) = 0.7881
σ 5
(c)

52 − 45 X −µ
P(52 < X) = P( < ) = P(1.4 < Z) = 1 − P(Z < 1.4) = 1 − 0.9192 = 0.0808
5 σ

## Alternatively, by the symmetry of normal distribution:

52 − 45 X −µ
P(52 < X) = P( < ) = P(1.4 < Z) = P(Z < −1.4) = 0.0808
5 σ

72
(d)

41 − 45 X −µ 47 − 45
P(41 < X < 47) = P( < < ) = P(−0.8 < Z < 0.4) =
5 σ 5
= P(Z < 0.4) − P(Z < −0.8) = 0.6554 − 0.2119 = 0.4435

—————————————————

Example 4.2.4
The thickness of manufactured metal plates (intended to be 20mm) is normally distributed with
µ = 20mm and σ = 0.04mm.
(a) What proportion of plates can be expected to be at most 20.10mm thick?
(b) What proportion of plates can be expected to be more than 19.95mm thick?
(c) What proportion of plates can be expected to be within 0.05mm of the target thickness of
20mm?
(d) What value would one have to set as tolerance limits 20 ± c so that the percentage of plates
outside the tolerance limit is 5%?
—————————————————————————————————————————-
Let X ∼ N(20, 0.042 ) model the thickness of the manufactured plates.
(a)

20.10 − 20
P(X < 20.10) = P(Z < ) = P(Z < 2.5) = 0.99379
0.04
(b)

19.95 − 20
P(19.95 < X) = P( < Z) = P(−1.25 < Z) = P(Z < 1.25) = 0.8944
0.04
(c)

19.95 − 20 20.05 − 20
P(19.95 < X < 20.05) = P( <Z< ) = P(−1.25 < Z < 1.25) =
0.04 0.04
= P(Z < 1.25) − P(Z < −1.25) = 0.8944 − 0.1056 = 0.7888

(d) TBC.
—————————————————

Week 8 Tuesday

## 4.2.2 Finding Values of z when Probabilities are Given

Either use percentage points of normal distribution table, or use the main table in reverse.

Example 4.2.5

73
Let Z be a random variable with standard normal distribution. Find z ∗ such that:
(a)
P(Z < z ∗ ) = 0.7

(b)
P(Z < z ∗ ) = 0.15

(c)
P(Z < z ∗ ) = 0.56

————————————————————————————————
(a) Percentage point table, for LHS column 0.7 the RHS column gives value of z ∗ . Thus:

z ∗ = 0.5244

## Let z1 be such that:

P(Z < z1 ) = 0.85.

Hence:
z1 = 1.0364

## from the percentage point table. Then from symmetry:

z ∗ = −z1 = −1.0364

(c)
We observe that 0.56 is not in LHS of the percentage point table. Therefore we use the main
tables. The closest number to 0.56 is 0.5596. Therefore:

z ∗ = 0.15

(Later on we will see that we could be even more exact, namely z ∗ = 0.151.)

74
—————————————————

Example 4.2.6

(This is continuation of Example 4.2.4.) The thickness of manufactured metal plates (intended
to be 20mm) is normally distributed with µ = 20mm and σ = 0.04mm.
(d) What value would one have to set as tolerance limits 20 ± c so that the percentage of plates
outside the tolerance limit is 5%?
—————————————————————————————————-
(d) We need to find c such that P(20 − c < X < 20 + c) = 0.95.

−c c
P(20 − c < X < 20 + c) = P( <Z< ) = P(−25c < Z < 25c) = P(Z < 25c) − P(Z < −25c) =
0.04 0.04
= P(Z < 25c) − (1 − P(Z < 25c)) = 2 · P(Z < 25c) − 1

Hence:
2 · P(Z < 25c) − 1 = 0.95 =⇒ P(Z < 25c) = 0.975.

## Therefore, the tolerance limits are 20 ± 0.0784mm.

—————————————————

75
Values of z with 3 Decimal Places
For the third decimal place, use the column add proportional parts.

Example 4.2.7

(a)
P(Z < 1.123) = 0.8686 + 0.0006 = 0.8692

(b)
P(Z < 0.659) = 0.7422 + 0.0029 = 0.7451

(c)
P(Z < −0.326) = 0.3745 − 0.0022 = 0.3723
—————————————————

## 4.2.3 Combining Normally Distributed Variables

2 , then we write:
If X is normally distributed with mean µX and variance σX

2 ) .
X ∼ N(µX , σX

## 2 ) and Y ∼ N(µ , σ 2 ) are independent random variables, then:

If X ∼ N(µX , σX Y Y

2 + σ2 ) ,
X + Y ∼ N(µX + µY , σX Y

and

2 + σ2 ) .
X − Y ∼ N(µX − µY , σX Y

Notice that we always add variances, even if we need to subtract the means. Notice that this theorem
not only tells us what the mean and variance of X + Y and X − Y is, but also that they are normally
distributed.

Example 4.2.8

A machined rod has to pass through a drilled hole in a component. The machining process
produces rods with a diameter that is normally distributed with a mean of 6mm and a standard
deviation of 0.1mm. The drilled process gives holes with mean diameter 6.1mm and SD 0.05mm.
For what proportion of rod-component pairs will the rod be too big to pass through the hole?
—————————————————————————————————————————-
Let rod’s diameter be modelled by:
X ∼ N(6, 0.12 )

## Let hole’s diameter be modelled by:

Y ∼ N(6.1, 0.052 )

76
We need to find the probability that for a randomly selected rod and a randomly selected hole,
the hole’s diameter is smaller than the rod’s diameter, namely:

## X − Y ∼ N(6 − 6.1, 0.11 + 0.052 ) = (−0.1, 0.1122 ).

Therefore:
 
0 − (−0.1)
P(0 < X−Y ) = P <Z = P(0.893 < Z) = P(Z < −0.893) = 0.1867−0.0008 = 0.1859.
0.112

Hence, 0.1859 is the proportion of the rod-component pairs for which the rod is too big to pass.
—————————————————

Example 4.2.9

Consider 1 litre and 2 litre bottles of milk. Let X be the random variable for the volume of a 1l
bottle (in ml), with the volume of each bottle being independent and X ∼ N (1006, 32 ). Similarly
for 2l bottles, let the random variable Y denote the volume of milk in ml, begin independent
with Y ∼ N (2008, 42 ).
(a) Assume you buy two 1 litre bottles. What is the probability of getting more than 2 litres in
total?
(b) Now assume you buy one 2l and two 1l bottles. What is the probability there is more milk
in the 2l bottle than in the 1l bottles?
(c) Now assume you buy one 2l and one 1l bottle. What is the probability that there is more
milk in the 2l bottle than twice the contents of the 1l bottle?
—————————————————————————————————————————-
(a) Let X1 ∼ N (1006, 32 ) denote the first picked bottle and X2 ∼ N (1006, 32 ) denote the second
picked bottle. Now we want to know:

P(2000 < X1 + X2 )

2 + σ 2 = 32 + 32 = 18,
The random variable X1 + X2 has mean µX + µX = 2012 and variance σX X

and hence the standard deviation 18.
√ 2
X1 + X2 ∼ N(2012, 18 )

Therefore:
 
2000 − 2012
P(2000 < X1 + X2 ) = P √ <Z = P(−2.828 < Z) = P(Z < 2.828) =
18
= 0.99760 + 0.00006 = 0.99766

77
√ 2
(b) Let Y ∼ N (2008, 42 ) represent a randomly picked 2l bottle, and let X1 +X2 ∼ N(2012, 18 )
be the sum of two randomly picked 1l bottles, as in (a). Then we are interested in:

## P(X1 + X2 < Y ) = P(0 < Y − X1 + X2 ).

The random variable Y −X1 +X2 is normally distributed with mean µY −2·µX = 2008−2012 =
−4 and variance σY2 + 2 · σX
2 = 42 + 18 = 34. Hence:

√ 2
Y − X1 + X2 ∼ N(−4, 34 )

Thus:

0 − (−4)
P(0 < Y −X1 +X2 ) = P( √ < Z) = P(0.686 < Z) = P(Z < −0.686) = 0.2483−0.0019 = 0.2464
34

(c) This question is different from (b) as instead of two independently selected 1l bottles we now
need to consider double the weight of one bottle. Let X ∼ N (1006, 32 ) represent the randomly
picked 1l bottle and Y ∼ N (2008, 42 ) represent a randomly picked 2l bottle. We are looking to
find:
P(2X < Y ) = P(2X − Y < 0).

First consider the random variable 2X. It is a multiple of a random variable, and hence E(2X) =
2E(X) and Var(2X) = 22 Var(X). A multiple of a normally distributed variable is also a normally
distributed variable. Hence:

2X ∼ N(2 · µX , 22 · σX
2
) = N(2 · 1006, 22 · 32 ) = N(2012, 62 ).

Notice that it is different than the distribution of X1 + X2 . In particular, this one has a bigger
variance. Now:

2X − Y ∼ N(2012 − 2008, 42 + 62 ) = N(4, 52).

Hence:

0−4
P(2X − Y < 0) = P(Z < √ ) = P(Z < −0.555) = 0.2912 − 0.0017 = 0.2895.
52
—————————————————

Week 9 Monday

78
Chapter 5

Sampling

## Large parent population: mean µ, variance σ 2 (GREEK LETTERS).

Sample is a smaller subset o the parent population: mean x, variance s2 (ROMAN LETTERS).
x and s2 are often used to estimate µ and σ 2 .

## 5.1 Hypothesis Testing

A sample can be used to test whether or not a suspected fact is true.

Example 5.1.1
Let X be a random variable representing a result of throwing a die. We want to find out whether
the die is biased – in particular if P(X = 6) > 61 .
The NULL HYPOTHESIS is the expected outcome, denoted by H0 . (Here H0 : P(X = 6) = 16 .)
The ALTERNATIVE HYPOTHESIS is H1 . (Here H1 : P(X = 6) > 16 .)
—————————————————

A sample is unlikely to match the null hypothesis exactly. However, given a null hypothesis, if the
probability of the sample taking a certain range of values is below a certain small level then we deduce
that H0 is false.
This small level of probability is called a SIGNIFICANCE LEVEL, usually given as a percentage.
Significance level denotes the probability of rejecting the null hypothesis, given that it is true.
Method:

## 5. If P(property|H0 ) < significance level, then H0 is false.

Example 5.1.2
Biased die with P(X = 6) > 16 .

79
1. Choose 1% significance level (probability 0.01).
2. Roll die 20 times. Results:

5 1 6 5 2 1 2 1 4 4 6 6 6 1 6 4 6 4 2 6

3. H0 : P(X = 6) = 16 .
H1 : P(X = 6) > 16 .
4. For the sample we get 7 sixes out o 20 rolls. Given H0 , let Y ∼ Bi(20, 16 ) model the
predicted behaviour (number of sixes out of 20 throws) and consider the probability given
by the binomial distribution P(Y ≥ 7) = 0.0375.
Note that hypothesis testing considers results (for the sample) that are at least as extreme
as the observed results.
5. 0.0375 > 0.01. Therefore accept H0 and conclude that the die is not biased.
—————————————————

Example 5.1.3
An average of 50% of people1 taking the UK driving test pass it. A complaint is made that a
particular examiner, Mr Smith, is too lenient on candidates. Unknown to Mr Smith his work
is observed and it is found that he passes 16 out of 20 tested candidates. Are the complaints
justified at the 5% significance level?
———————————————————-
Let p=probability that Mr Smith passes a candidate.
H0 : p=0.5
H1 : p > 0.5
Consider binomial distribution Bi(n = 20, p = 0.5).
P(At least 16 out of 20 passed) = 0.0059
0.0059 < 0.05
Therefore reject H0 . We have a reason to consider that Mr Smith is too lenient.
————————————————————–
In this example 16 out o 20 led us to reject H0 . What is the least number o passes out of 20
that would lead us to reject H0 ?
————————————————————–
Use X ∼ Bi(n = 20, p = 0.5).

## P(X ≥ 15) = P(X ≥ 16) + P(X = 15) = 0.0207 < 0.05

P(X ≥ 14) = P(X ≥ 15) + P(X = 14) = 0.0577 > 0.05

## 15 out of 20 passes is the least number leading to rejection of H0 at 5% significance level.

1
actually in 2012-13 it was 47.4%

80
—————————————————

## 5.1.1 1–Tail and 2–Tail Tests

The previous two examples were 1–tail tests where we were only interested in one direction of bias.
(Too many passes, not too few passes). The following is an example of a 2–tail test.

Example 5.1.4
Testing if a coin is biased.
Let p be probability of heads.
H0 : p=0.5
H1 : p 6= 0.5
Experiment: toss coin 20 times. Obtain 4 heads and 16 tails.
We want to find the probability that an event at least this extreme happens under H0 . Using
X ∼ Bi(n = 20, p = 0.5):

## P(X ≤ 4) + P(X ≥ 16) = 0.0059 + 0.0059 = 0.0118

0.0118 > 0.01 therefore at 1% significance level we conclude that the coin is not biased.
0.0118 < 0.05 therefore at 5% significance level we conclude that the coin is biased.
—————————————————

Week 9 Tuesday

Example 5.1.5
On a long-forgotten island, now rediscovered, 15 birds were found. The explorer wonders whether
they might be the same bird as described in legend where 1/4 are female and 3/4 male. How
many females are required in the 15 birds if you are to conclude they are not the same species
at the 10% level?
————————————
Let p be probability of female.
H0 : p = 0.25. This hypothesis states that it is the same species.
H1 : p 6= 0.25. This hypothesis states that it is not the same species.
We observe that in such a situation a 2-tail test is appropriate, as our alternative hypothesis
does not state whether higher or lower ratio is preferred. Therefore, we will be looking for so
called critical values, namely values a and b such that:

0.10
P(X ≤ a) ≤ = 0.05
2
0.10
P(X ≥ b) ≤ = 0.05
2

81
If in our sample we get values less or equal to a or at least b, it would lead us to the rejection of
the null hypothesis. Note that the significance level 10% splits for the 2–tail test, so that each
tail gets 5%.
Consider X ∼ Bi(15, 0.25) modelling the number of females in a sample of 15, under the as-
sumption of H0 .

## P(X ≤ 1) = P(X = 0) + P(X = 1) = 0.0134 < 0.05

P(X ≤ 2) = P(X ≤ 1) + P(X = 2) = 0.0802 > 0.05
P(X ≥ 6) = 0.1484 > 0.05
P(X ≥ 7) = 0.0173 < 0.05

Hence, our lower critical value a = 1 and our upper critical value b = 7.
Therefore, if X = 0 or X = 1 or X ≥ 7, we reject H0 and conclude that it cannot be the same
species.
—————————————————

Compare the application of the method for symmetrical and asymmetrical cases of binomial distribu-
tion.

## 5.1.2 Central Limit Theorem

Theorem 5.1.1 (Central Limit Theorem (CLT)) If we have a parent population with a normal
distribution with mean µ and variance σ 2 and we take a sample of size n from this population the the
2
distribution of the sample mean X is approximately N(µ, σn ) if n is sufficiently large (usually ≥ 30).
This number n can be smaller if the parent population already follows a normal distribution.

Example 5.1.6

Suppose that bags of sugar are packed by a machine to have a mean weight fixed by the operator
but a fixed standard deviation of 9.5 g. The distribution of weights is known to be Normal. The
mean weight of bags for today’s batch should be 500 g but the operator thinks that its level has
been fixed as too low. She takes a random sample of 16 bags and finds the total weight of the
bags is 7874 g. Carry out a suitable hypothesis test at the 1% significance level.
————————————
µ : mean weight of a bag produced by the machine
H0 : µ = 500
H1 : µ < 500
Here 1–tail test is appropriate as the operator is concerned only in case when the level is fixed
for too low.

82
By the central limit theorem, under the assumption of H0 , the mean of the sample follows
2 2
a normal distribution: X ∼ N(µ, σ16 ) = N(500, 9.5
n ). Our observed sample mean is given by
7874
x= 16 = 492.125. Hence we want to find:
!
492.125 − 500
P(X ≤ 492.125) = P Z ≤ 9.5 = P(Z ≤ −3.31) = 0.000466 < 0.01
4

Hence we reject H0 at 1% significance level and conclude that the machine has been set too low.
—————————————————

Example 5.1.7

A random variable X is Normally distributed with known standard deviation 15.0. The mean
is thought to be µ = 100. A sample of n = 25 data points are taken and the calculated mean is
104.7. Test the hypothesis that the mean is different from 100, at the 5% significance level.
———————————–
H0 : µ = 100
H1 : µ 6= 100
2 2
Sample mean follow normal distribution by CLT: X ∼ N(µ, σn ) = N(100, 15
25 ).

In this case the 2–tail test is appropriate. Let us calculate the probability of a sample mean at
least as extreme as 104.7.

2 · P(X ≥ 104.7) = 2 · P(Z ≥ 1.57) = P(Z ≤ −1.57) + P(Z ≥ 1.57) = 0.1164 > 0.05

Therefore, we accept the null hypothesis that the mean µ is not different from 100.
—————————————————

If we do not know the population standard deviation σ, we use the sample to estimate it. We need a
large n (size of the sample) to do this, typically n ≥ 50.

Example 5.1.8

## A particular type of AA battery advertises a lifetime of 80 minutes at a 1000mA load. A customer

thinks this is an overestimate and decides to test a random sample of 74 batteries, recording

83
the lifetime x minutes of each battery. The lifetimes are known to be Normally distributed. He
decides to carry out a hypothesis test at a 5% significance level.
The results of the test are:
X X
n = 74 x = 5877.4 x2 = 489, 442.8.

——————————————————-
H0 : µ = 80
H1 : µ < 80
Because the alternative hypothesis states that the mean is strictly smaller than it should be, a
1–tail test is appropriate.
P
x 5877.4
x= = = 79.42
n 74
rP r
x2 − n · (x)2 489, 442.8 − 74 · (79.42)2
s= = = 17.61
n−1 73
We will use the sample standard deviation s to estimate the population standard deviation σ.

σ2 17.612
X ∼ N(µ, ) = N(80, )
n 74
Therefore:

!
79.42 − 80
P(X ≤ 79.42) = P Z ≤ 17.61 = P(Z ≤ −0.283) = 0.39 > 0.05.

74

Hence we accept the null hypothesis and conclude that the battery life equals to 80 minutes.
—————————————————

Week 10 Tuesday

## 5.2 Confidence Intervals

Given a sample of size n with mean x, a p% confidence interval is the interval for which there is a p%
chance that µ for the parent population lies in that interval.

 
kσ kσ
x− √
n
, x+ √
n

where:

## • σ is the standard deviation of the parent population,

84
• x is the sample mean,

## • k can be found from % of normal distribution.

Example 5.2.1
Confidence level 90%:

## From the % points table:

0.95% → k=1.6449
—————————————————

## The most commonly used confidence levels and values of k are:

Confidence level k
90% 1.6449
95% 1.96
99% 2.58

Example 5.2.2
In a large town the distribution of incomes per family has a known standard deviation of
£17000. A random sample of 400 families was taken and the same mean income found to
be £21500. Calculate a 95% confidence interval for the mean income per family in the city.
———————————————————————-

σ = 17000
n=400
x = 21500
95% confidence level corresponds to k=1.96.

## 95% confidence interval is given by:

   
kσ kσ (1.96)(17000) (1.96)(17000)
x− √ , x+ √ = 21500 − √ , 21500 + √ = (19834, 23166)
n n 400 400

85
—————————————————

Example 5.2.3

A test has been designed to produce Normally distributed scores. The scores have a scale
from 0 to 100 and a known standard deviation of 25. A random sample of 10 people took the
test and had average score x̄ = 47.7. Find a 99% confidence interval and a 90% confidence
interval for the mean score of all people taking this test.
———————————————————————-

σ = 25
n=10
x = 47.7
99% confidence level corresponds to k1 = 2.58.
90% confidence level corresponds to k2 = 1.6449.

## 99% confidence interval is given by:

   
k1 σ k1 σ (2.58)(25) (2.58)(25)
x− √ , x+ √ = 47.7 − √ , 47.7 + √ = (27.30, 68.10)
n n 10 10

## 90% confidence interval is given by:

   
k2 σ k2 σ (1.6449)(25) (1.6449)(25)
x− √ , x+ √ = 47.7 − √ , 47.7 + √ = (34.70, 60.70)
n n 10 10
—————————————————

Example 5.2.4

## In a US Bureau of Labor survey, workers employed in manufacturing industries in the US

earned an average of \$546 per week in September 1996. Assume that this mean is based
on a random sample of 1000 workers and the standard deviation for that sample was \$75.
Find a 99% confidence level for the mean.
———————————————————————-

n=1000
x = 546
s=75
99% confidence level corresponds to k = 2.58.

Assume that σ = s:

86
99% confidence interval is given by:
   
kσ kσ (2.58)(75) (2.58)(75)
x− √ , x+ √ = 546 − √ , 546 + √ = (539.9, 552.1)
n n 1000 1000
—————————————————

## Assume that s = σ and use t–student distribution to find k. In t–distribution table:

γ: γ% confidence level
Row ν = n − 1.

## The table then gives k.

Example 5.2.5

A test is designed to produce scores that are Normally distributed and have a value between
0 and 100. A group of 15 students take the test and attain scores given by

64 89 44 76 63 81 58 69 55 93 53 33 68 60 63.

Use the data to construct a 95% confidence interval for the mean score of all people taking
this test.
———————————————————————-

n=15
x = 64+89+...+60+63 = 64.6
q P 15
x2i −nx2
s= n−1 = 15.93
ν = n − 1 = 14
γ = 95% confidence level with ν = 14 corresponds to k = 2.1448.

## We will use σ = s in the formula for the confidence interval.

95% confidence interval is given by:
   
kσ kσ (2.1448)(15.93) (2.1448)(15.93)
x− √ , x+ √ = 64.6 − √ , 64.6 + √ = (55.78, 73.42)
n n 15 15
—————————————————

Example 5.2.6

To find out the cardiac demands of heavy snow shovelling 10 health men of the same age
participated in snow-removal tests. (They shovelled snow for 10 minutes at a time, with
10–15 minute rest periods. Their heart rate was measured at 2 minute intervals.) Their
mean heart rate in beats per minute was x̄ = 175 and the standard deviation was s = 15.

87
Find a 90 % confidence interval for the population mean (the mean heart rate for health
men of this age when shovelling snow).
Source: Franklin, B.A. et al. (1995), Cardiac demands of heavy snow shovelling, J. Ameri-
can Medical Association 273 .
———————————————————————-

n=10
x = 175
s = 15
ν =n−1=9
γ = 90% confidence level with ν = 9 corresponds to k = 1.8331.

## We will use σ = s in the formula for the confidence interval.

90% confidence interval is given by:
   
ks ks (1.8331)(15) (1.8331)(15)
x− √ , x+ √ = 175 − √ , 175 + √ = (166.31, 183.69)
n n 10 10
—————————————————

Week 11 Monday

## 5.3 Linear Regression

Linear regression involves finding the best mathematical model for a relationship between 2 variables
when the relationship is assumed to be linear, and when all that is known is a selection of data points
(xi , yi ) where i ∈ {1, 2, . . . , n} for the two variables x and y.

88
This type of model is generally used in situation where there is an element of ‘randomness’ in the
data. Therefore for a given value of x,
y = βˆ0 + βˆ1 x

will not give a certain value of y corresponding to x. Instead, it gives an expected value of y.
For a given set of data the line which we will choose is the one that minimises the expression:

n
X
(yi − (βˆ0 + βˆ1 xi ))2 .
i=1

## This is called the least squares line.

Formulas:

xi )2
P
X (
Sxx = x2i −
 nP P 
X ( x i ) · ( yi )
Sxy = x i yi −
n
S xy
βˆ1 =
Sxx
βˆ0 = y − βˆ1 x

Example 5.3.1

Suppose an experiment involving five subjects is conducted to determine the relationship between
the percentage of a certain drug in the bloodstream and the length of time it takes to react to
a stimulus. The results are shown in the table below.

## Subject Amount of Drug Reaction Time

x (%) y (seconds)
1 1 1
2 2 1
3 3 2
4 4 2
5 5 4

( The number of measurements and the measurements themselves are unrealistically simple in
order to avoid excessive arithmetic and to concentrate instead of the processes in this introductory
example.)
Determine the straight line
y = β̂0 + β̂1 x.

## that best fits the data in the least squares sense.

————————————————————————

89
X
xi = 1 + 2 + 3 + 4 + 5 = 15
X
yi = 1 + 1 + 2 + 2 + 4 = 10
P
xi 15
x= = =3
Pn 5
yi 10
y= = =2
X n 5
x2i = 12 + 22 + 32 + 42 + 52 = 55
X
xi yi = 12 + 2 · 1 + 3 · 2 + 4 · 2 + 5 · 4 = 37
( xi )2 152
X P
2
Sxx = xi − = 55 − = 10
 nP P 5
X ( xi ) · ( yi ) 15 · 10
Sxy = x i yi − = 37 − =7
n 5
Sxy 7
βˆ1 = = = 0.7
Sxx 10
βˆ0 = y − βˆ1 x = 2 − 3 · 0.7 = −0.1

## Hence, the line of least squares is given by:

y = −0.1 + 0.7x.

—————————————————

Example 5.3.2
.
Due primarily to the price controls of the car- YEAR PETROL CRUDE OIL
tel of crude oil suppliers (OPEC), the price y (cents/gallon) x (USD/barrel)
of crude oil rose dramatically from the mid 1973 39 3.89
1970s to the early 1980s. As a result, mo- 1975 57 7.67
torists were confronted with a similar upward 1976 59 8.19
spiral of petrol prices. The data in the ta- 1977 62 8.57
ble are typical prices for a gallon of regular 1978 63 9.00
petrol and a barrel of crude oil for the indi- 1979 86 12.64
cated years. 1980 119 21.59
1981 133 31.77
1982 122 28.52
1983 116 26.19
1984 113 25.88
1985 112 24.09
1986 86 12.51
1987 90 15.41
Given that
X X X X X
xi = 235.92, yi = 1257, x2i = 5074.0898, yi2 = 124459, xi yi = 24654.87,

90
(a) Use the data to calculate the least squares line that describes the relationship between the
price of a gallon of petrol and the price of a barrel of crude oil.
(b) If the price of crude oil fell to \$8 a barrel, to what level would you expect the price of petrol
fall?
——————————————————-
(a)
P
xi 235.92
x= = = 16.85
Pn 14
yi 1257
y = = = 89.79
n 14
( xi )2 235.922
X P
Sxx = x2i − = 5074.0898 − = 1098.5
 nP P  14
X ( x i ) · ( yi ) 235.92 · 1257
Sxy = xi yi − = 24654.87 − = 3472.6
n 14
Sxy 3472.6
βˆ1 = = = 3.161
Sxx 1098.5
βˆ0 = y − βˆ1 x = 89.79 − 16.85 · 3.161 = 36.52

## Hence, the line of least squares is given by:

y = 36.52 + 3.161x.

(b)

## y = 36.52 + 3.161 · 8 = 61.81

Hence, we would expect the petrol to cost 61.81 cents per gallon.
—————————————————

## Estimating the Likely Error in a Model

The straight line model only provides the information about E(Y | X). We now estimate the variance
in the y-values.
Note that SSE stands for ‘sum of squares error’.

yi )2
P
X (
Syy = yi −
n
ˆ
SSE = Syy − β1 · Sxy
SSE
s2 =
n−2

The quantity s2 gives an estimate of σ 2 , where σ 2 is the variance of the difference in actual y-values
from predicted y-values.

91
Measure of the Usefulness of a Model
The coefficient of determination r2 is given by:

SSE
r2 = 1 − .
Syy

It given the proportion of the total sample variability that is explained by the linear model.

Example 5.3.3
(Example 5.3.1 continued.)

X
yi2 = 1 + 1 + 4 + 4 + 16 = 26
( yi ) 2 102
X P
2
Syy = yi − = 26 − =6
n 5
SSE = Syy − βˆ1 · Sxy = 6 − 0.7 · 7 = 1.1
SSE 1.1
s2 = = = 0.36
n−2 3

s = 0.36 = 0.6

Thus, s = 0.6 is the typical error that we can expect in the predicted values of the model.

SSE 1.1
r2 = 1 − =1− = 0.82
Syy 6
This means that 82% of the variation in the y-values of the data is accounted for by the model.
—————————————————

Example 5.3.4
(Example 5.3.2 continued.)

( yi )2 12572
X P
Syy = − yi2 = 124459 − = 11598
n 14
SSE = Syy − βˆ1 · Sxy = 11598 − (3.161)(3472.6) = 621.1
SSE 621.1
s2 = = = 51.76
n−2 14 − 2

s = 51.76 = 7.2

Thus, s = 7.2 is the typical error that we can expect in the predicted values of the model.

SSE 621.1
r2 = 1 − =1− = 0.946
Syy 11598
This means that 94.6% of the variation in the y-values of the data is accounted for by the model.
—————————————————

92
Appendices

93
Appendix A

94
Appendix B

99
Appendix C

102