You are on page 1of 9

MEI Statistics 2 Bivariate data

Topic assessment
1.

For a random sample of 20 towns, a smoking index (x) and a cancer index (y) are
constructed. These data are illustrated in the diagram below. The associated
summary statistics are also given below.

n 20
x 2 235724

x 2152
y 278565

y 2327
xy 252811

A community health officer claims that these data show that there is a positive
correlation between smoking and cancer.
(i) Show that the product moment correlation coefficient for the data is 0.425,
correct to 3 significant figures. Carry out a suitable hypothesis test at the 5%
significance level to check the officers claim, stating your hypotheses and
conclusion carefully. Comment on the validity of the test in relation to the
scatter diagram.
[10]
(ii) A spokesman for a tobacco firm claims that the data do not show that there is
a connection between smoking and cancer. Discuss briefly whether or not his
claim can be justified statistically.
[2]
(iii) Explain the meaning of the term significance level relating your answer to
the test carried out in part (i).
[2]
2.

Ten gymnasts take part in a competition which has two events, floor exercises
and parallel bars. The scores for each event are given in the table.
Competitor
Floor exercises
Parallel bars

A
9.5
9.2

B
8.8
9.5

C
6.7
5.8

D
6.0
3.7

E
5.9
5.4

F
5.7
6.3

G
5.2
5.9

1 of 9

H
4.2
4.5

I
4.1
5.6

J
3.9
4.7
[3]

13/11/13 MEI

MEI S2 Bivariate data Assessment solutions

A sports journalist wishes to investigate whether there is a positive association
between performances on floor exercises and parallel bars.
(ii) By ranking the data, calculate an appropriate correlation coefficient and carry
out a suitable hypothesis test at the 5% level of significance. State your
hypotheses and conclusions carefully. Comment on the validity of the test.
[10]
The product moment correlation coefficient for the two sets of data is 0.837
(correct to 3 significant figures).
(iii) With reference to the scatter diagram, comment on the difference in the
values of the two correlation coefficients. Discuss which of the two is more
appropriate on this occasion.
[3]
3.

A medical screening programme predicts the remaining number of years (y) an

individual aged x years has to live. The data for four individuals are shown in the
table.
x
y

35
45.0

45
39.0

55
30.0

65
18.0

(i) Represent the data by a scatter diagram, drawn on graph paper.

[2]

(ii) Calculate the equation of the regression line of y on x, and plot it on your
scatter diagram.
[5]
(iii) Use your regression equation to predict the remaining number of years of life
of a person aged
(A) 65 years of age,
(B) 90 years of age,
commenting on your second prediction.
[3]
(iv) Calculate the sum of the squares of the residuals. What is the relevance of
this value with respect to the regression line?
[5]
4.

A meteorologist wishes to test whether there is any correlation between

temperature, xC, and wind speed, y mph. For a selection of places in the United
Kingdom on a particular day, he collects data which are illustrated in the diagram
and summarised as follows.
n 25
x 2 3853

x 307
y 3008
2

2 of 9

y 250
xy 3143

13/11/13 MEI

MEI S2 Bivariate data Assessment solutions

(i) Calculate the product moment correlation coefficient for the data. Carry out a
suitable hypothesis test at the 5% significance level, using the null hypothesis
H 0 : 0 . Define and state your alternative hypothesis and conclusions
carefully.

[9]

(ii) What must be assumed about the underlying distribution for the test to be
valid? Discuss, with reference to the scatter diagram above, whether this
assumption is reasonable in this case.
[3]
(iii) Another data point, x = 6, y = 29, was omitted from the data set. Explain
briefly the effect its inclusion would have on the product moment correlation
coefficient. Comment on the validity of the test if this point were included.
[3]
Total 60

3 of 9

13/11/13 MEI

MEI S2 Bivariate data Assessment solutions

Solutions to topic assessment
1. (i)

2152
107.6
20

S xx x2 nx2 235724 20 107.62 4168.8

2327
y
116.35
20
S yy y 2 ny 2 278565 20 116.35 2 7818.55

S xy xy nxy 252811 20 107.6 116.35 2425.8

S xy
2425.8
r

0.425
S xxS yy
4168.8 7818.55
H0 : 0
H1 : 0
where is the population correlation coefficient.
At 5% significance level with n = 20, critical value =
0.3783
Since 0.425 > 0.3783, reject H0: there is sufficient
evidence at the 5% significance level to suggest that
there is a positive correlation between the smoking
index and the cancer index.
EITHER: Since the shape of the scatter diagram is
roughly elliptical, the data appear to come from a
bivariate Normal population, so the test is
appropriate.
OR: Since the shape of the scatter diagram does not
appear to be elliptical, the data do not come from a
bivariate Normal population, so the test is not
appropriate.
[10]
(ii) If a different significance level were chosen, then the
test could result in the null hypothesis being
accepted. For example, at the 2.5% significance level
the critical value for n = 20 is 0.4438, so the
spokesmans claim would be justified statistically if
the test were carried out at this level.
[2]
(iii)
The significance level is the probability of
rejecting the null hypothesis when it is in fact true.
If the population correlation coefficient is in fact zero,
then 5% of random bivariate samples of size 20 will

4 of 9

13/11/13 MEI

MEI S2 Bivariate data Assessment solutions

produce a value of r that exceeds the critical value of
0.3783.
[2]

5 of 9

13/11/13 MEI

2. (i)

[3]
(ii)
Competitor
Floor
exercises rank
Parallel bars
rank
d

rs 1

A B C
1 2 3

D
4

E F G
5 6 7

H I
8 9

7 3 4

1
0
3
6

J
1
0
8

4 9 9

6 di2

n n 2 1
6 78
1
10 99
0.527 (3 s.f.)

H0: there is no association between the performances

on floor exercises and
parallel bars.
H1: there is a positive association between the
performances on floor
exercises and parallel bars.
At 5% significance level for n =10, critical value =
0.5636
0.527 < 0.5636
so accept H0: there is insufficient evidence to suggest
a positive association between performance on the
floor exercises and performance on the parallel bars.
The test is valid provided that the data are a random
sample from the underlying population.

6 of 9

13/11/13 MEI

MEI S2 Bivariate data Assessment solutions

[10]
(iii)
The product moment correlation coefficient is
influenced by the two outlying points.
The Spearmans rank test is more appropriate, since
there is no evidence that the background population
is bivariate Normal.
[3]
3. (i)

[2]

200
50
4
132
y 132 y 4 33
(ii)
S xx x2 nx2 10500 10000 500

x 200 x

S xy xy nxy 6150 4 50 33 450

y a bx
For regression line
S xy 450
b

0.9
S xx
500
a y bx 33 0.9 50 78
Regression line is

y 78 0.9x

[5]
(iii)

x 65 y 78 0.9 65 19.5
(A)
x 90 y 78 0.9 90 3

(B)
The second prediction is meaningless as the person is
predicted to have negative years to live.
[3]

7 of 9

13/11/13 MEI

MEI S2 Bivariate data Assessment solutions

78 0.9 35 46.5
y
For (35, 45)
Residual
45 46.5 1.5
y 78 0.9 45 37.5
For (45, 39)
Residual
39 37.5 1.5
78 0.9 55 28.5
y
For (55, 30)
Residual
30 28.5 1.5
78 0.9 65 19.5
y
For (65, 18)
Residual
18 19.5 1.5
2
Sum of squares of residual 4 1.5 9

(iv)

x, y
Out of all possible straight lines passing through
, the regression line is the one which minimises the
sum of the squares of the residuals. So 9 is the
smallest possible value of the sum of the squares of
the residuals.
[5]
307
x
12.28
25
4. (i)
S xx x2 nx2 3853 25 12.282 83.04
250
y
10
25
S yy y 2 ny 2 3008 25 102 508

S xy xy nxy 3143 25 10 12.28 73

S xy
73
r

0.355
S xxS yy
83.04 508

H0: = 0
H1: 0
Where is the population correlation coefficient
Critical value for two-tailed test at 5% significance
level for n = 25 is 0.3961.
0.355 < 0.3961 so accept H0: there is not sufficient
evidence at the 5% significant level to suggest that
there is a correlation between temperature and wind
speed.
[9]
(ii) The underlying distribution must be bivariate Normal.
The elliptical shape of the scatter diagram suggests
that this is the case.
[3]

8 of 9

13/11/13 MEI

MEI S2 Bivariate data Assessment solutions

(iii)
The additional data point will reduce the product
moment correlation coefficient, since it is affected by
extreme values. This makes the test less valid as the
shape of the scatter diagram would be less elliptical.
[3]

9 of 9

13/11/13 MEI