Topic assessment

1.

For a random sample of 20 towns, a smoking index (x) and a cancer index (y) are

constructed. These data are illustrated in the diagram below. The associated

summary statistics are also given below.

n 20

x 2 235724

x 2152

y 278565

y 2327

xy 252811

A community health officer claims that these data show that there is a positive

correlation between smoking and cancer.

(i) Show that the product moment correlation coefficient for the data is 0.425,

correct to 3 significant figures. Carry out a suitable hypothesis test at the 5%

significance level to check the officers claim, stating your hypotheses and

conclusion carefully. Comment on the validity of the test in relation to the

scatter diagram.

[10]

(ii) A spokesman for a tobacco firm claims that the data do not show that there is

a connection between smoking and cancer. Discuss briefly whether or not his

claim can be justified statistically.

[2]

(iii) Explain the meaning of the term significance level relating your answer to

the test carried out in part (i).

[2]

2.

Ten gymnasts take part in a competition which has two events, floor exercises

and parallel bars. The scores for each event are given in the table.

Competitor

Floor exercises

Parallel bars

A

9.5

9.2

B

8.8

9.5

C

6.7

5.8

D

6.0

3.7

E

5.9

5.4

F

5.7

6.3

G

5.2

5.9

H

4.2

4.5

I

4.1

5.6

J

3.9

4.7

[3]

13/11/13 MEI

A sports journalist wishes to investigate whether there is a positive association

between performances on floor exercises and parallel bars.

(ii) By ranking the data, calculate an appropriate correlation coefficient and carry

out a suitable hypothesis test at the 5% level of significance. State your

hypotheses and conclusions carefully. Comment on the validity of the test.

[10]

The product moment correlation coefficient for the two sets of data is 0.837

(correct to 3 significant figures).

(iii) With reference to the scatter diagram, comment on the difference in the

values of the two correlation coefficients. Discuss which of the two is more

appropriate on this occasion.

[3]

3.

individual aged x years has to live. The data for four individuals are shown in the

table.

x

y

35

45.0

45

39.0

55

30.0

65

18.0

[2]

(ii) Calculate the equation of the regression line of y on x, and plot it on your

scatter diagram.

[5]

(iii) Use your regression equation to predict the remaining number of years of life

of a person aged

(A) 65 years of age,

(B) 90 years of age,

commenting on your second prediction.

[3]

(iv) Calculate the sum of the squares of the residuals. What is the relevance of

this value with respect to the regression line?

[5]

4.

temperature, xC, and wind speed, y mph. For a selection of places in the United

Kingdom on a particular day, he collects data which are illustrated in the diagram

and summarised as follows.

n 25

x 2 3853

x 307

y 3008

2

y 250

xy 3143

13/11/13 MEI

(i) Calculate the product moment correlation coefficient for the data. Carry out a

suitable hypothesis test at the 5% significance level, using the null hypothesis

H 0 : 0 . Define and state your alternative hypothesis and conclusions

carefully.

[9]

(ii) What must be assumed about the underlying distribution for the test to be

valid? Discuss, with reference to the scatter diagram above, whether this

assumption is reasonable in this case.

[3]

(iii) Another data point, x = 6, y = 29, was omitted from the data set. Explain

briefly the effect its inclusion would have on the product moment correlation

coefficient. Comment on the validity of the test if this point were included.

[3]

Total 60

13/11/13 MEI

Solutions to topic assessment

1. (i)

2152

107.6

20

2327

y

116.35

20

S yy y 2 ny 2 278565 20 116.35 2 7818.55

S xy

2425.8

r

0.425

S xxS yy

4168.8 7818.55

H0 : 0

H1 : 0

where is the population correlation coefficient.

At 5% significance level with n = 20, critical value =

0.3783

Since 0.425 > 0.3783, reject H0: there is sufficient

evidence at the 5% significance level to suggest that

there is a positive correlation between the smoking

index and the cancer index.

EITHER: Since the shape of the scatter diagram is

roughly elliptical, the data appear to come from a

bivariate Normal population, so the test is

appropriate.

OR: Since the shape of the scatter diagram does not

appear to be elliptical, the data do not come from a

bivariate Normal population, so the test is not

appropriate.

[10]

(ii) If a different significance level were chosen, then the

test could result in the null hypothesis being

accepted. For example, at the 2.5% significance level

the critical value for n = 20 is 0.4438, so the

spokesmans claim would be justified statistically if

the test were carried out at this level.

[2]

(iii)

The significance level is the probability of

rejecting the null hypothesis when it is in fact true.

If the population correlation coefficient is in fact zero,

then 5% of random bivariate samples of size 20 will

13/11/13 MEI

produce a value of r that exceeds the critical value of

0.3783.

[2]

13/11/13 MEI

2. (i)

[3]

(ii)

Competitor

Floor

exercises rank

Parallel bars

rank

d

rs 1

A B C

1 2 3

D

4

E F G

5 6 7

H I

8 9

7 3 4

1

0

3

6

J

1

0

8

4 9 9

6 di2

n n 2 1

6 78

1

10 99

0.527 (3 s.f.)

on floor exercises and

parallel bars.

H1: there is a positive association between the

performances on floor

exercises and parallel bars.

At 5% significance level for n =10, critical value =

0.5636

0.527 < 0.5636

so accept H0: there is insufficient evidence to suggest

a positive association between performance on the

floor exercises and performance on the parallel bars.

The test is valid provided that the data are a random

sample from the underlying population.

13/11/13 MEI

[10]

(iii)

The product moment correlation coefficient is

influenced by the two outlying points.

The Spearmans rank test is more appropriate, since

there is no evidence that the background population

is bivariate Normal.

[3]

3. (i)

[2]

200

50

4

132

y 132 y 4 33

(ii)

S xx x2 nx2 10500 10000 500

x 200 x

y a bx

For regression line

S xy 450

b

0.9

S xx

500

a y bx 33 0.9 50 78

Regression line is

y 78 0.9x

[5]

(iii)

x 65 y 78 0.9 65 19.5

(A)

x 90 y 78 0.9 90 3

(B)

The second prediction is meaningless as the person is

predicted to have negative years to live.

[3]

13/11/13 MEI

78 0.9 35 46.5

y

For (35, 45)

Residual

45 46.5 1.5

y 78 0.9 45 37.5

For (45, 39)

Residual

39 37.5 1.5

78 0.9 55 28.5

y

For (55, 30)

Residual

30 28.5 1.5

78 0.9 65 19.5

y

For (65, 18)

Residual

18 19.5 1.5

2

Sum of squares of residual 4 1.5 9

(iv)

x, y

Out of all possible straight lines passing through

, the regression line is the one which minimises the

sum of the squares of the residuals. So 9 is the

smallest possible value of the sum of the squares of

the residuals.

[5]

307

x

12.28

25

4. (i)

S xx x2 nx2 3853 25 12.282 83.04

250

y

10

25

S yy y 2 ny 2 3008 25 102 508

S xy

73

r

0.355

S xxS yy

83.04 508

H0: = 0

H1: 0

Where is the population correlation coefficient

Critical value for two-tailed test at 5% significance

level for n = 25 is 0.3961.

0.355 < 0.3961 so accept H0: there is not sufficient

evidence at the 5% significant level to suggest that

there is a correlation between temperature and wind

speed.

[9]

(ii) The underlying distribution must be bivariate Normal.

The elliptical shape of the scatter diagram suggests

that this is the case.

[3]

13/11/13 MEI

(iii)

The additional data point will reduce the product

moment correlation coefficient, since it is affected by

extreme values. This makes the test less valid as the

shape of the scatter diagram would be less elliptical.

[3]

