Вы находитесь на странице: 1из 24

Mathematics 536 Statistics

MATHEMATICS 536

STATISTICS
Independent Study Unit

0.1
Mathematics 536 Statistics

Table of Contents

General introduction 0.3


1.1 Measures of Dispersion 1.1
Range 1.2
Interquartile range 1.5
Mean deviation 1.6
Standard deviation 1.7
Z score 1.10
1.2 Practise 1.17
1.3 Review Exercises 1.20
1.4 Going further 1.21
2.1 Correlation 2.1
Distribution tables 2.1
Scattergrams 2.3
Coefficient of correlation 2.6
Line of regression 2.13
3.1. The “Real” Truth 3.1
C Correction Guide C.1
Section 1 C.1
Section 2 C.5
Section 3 C.13

0.2
Mathematics 536 Statistics

General Introduction
Statistics is a branch of mathematics which is full of subtleties and
which allows us to draw conclusions from a mass of data. In this unit, for
example, statistics will help you to better understand certain realities,
as in the life of Tommy and Clive, two cousins whom you will meet here.
They will even resort to statistical studies to clear up the “selection
mystery” for CEGEP candidates.

This will involve the study of the Z-score, which is used by educational
institutions to classify students at the time of their applications, and of
measures of dispersion.

You will also find certain explanations regarding the “apparent


inexactitude” of mathematics: For example, when a researcher makes a
discovery following numerous trials (practical experiments), certain data
obtained may be more or less exact. However, using these results, he can
deduce a (theoretical) rule indicating how one variable may vary as a
function of the other. That is the beauty of statistics!

Also, by using a graphic calculator, you will be able to save a considerable


amount of time and concentrate on understanding these fabulous topics
which are being taught in high school for the first time in Quebec!

The statistics unit is divided into 3 sections.

The titles below sum up the important ideas which will be addressed in
each of these sections:

1. Measures of dispersion

2. Correlation

3. The “Real” Truth

0.3
Mathematics 536 Statistics

1.1 Measures of Dispersion


The first signs of spring are beginning to appear, and Tommy has just
received some news which has cheered him immensely. He has been
accepted at CEGEP in the course he applied for. He rushes straight off to
pass the news to his cousin Clive in the hope that he too has the same
good news. The two cousins spend a lot of time imagining themselves
registered in the same courses, and even employed in the same career.

Unhappily, it hasn’t happened! Clive has been refused in the first round of
applications and must wait for the second round. The boys let out their
frustrations, but then decide they need to know why this has happened, and
decide to ask some pertinent questions about the criteria by which the
colleges go about making their decisions on admission. How do colleges
classify students in order to select successful candidates? Could there be
any element of favouritism?

This module offers an answer to these questions, extending and deepening


the knowledge of statistics which you have been developing throughout
high school.

To achieve the objectives of this module, you must


compare different measures of dispersion (range, semi-
interquartile range, mean deviation and standard
deviation). Further, you will learn to calculate the Z-
score.

So, are you ready to solve the mystery?

Section 1 1.1
Mathematics 536 Statistics

Range

Here are two distributions: A ⇒1, 2, 3, 4, 5, 6, 7, 8, 9


B ⇒3, 4, 4, 4, ? , 6, 6, 6, 7 where the
elements represent the ages of the candidates in a children’s contest.
What number must replace the ? In order that both distributions
should have the same mean?________________________________
______________________________________________________

You should have found that the ? Is 5 because the sum of the elements in
the first list is 45 and 45 divided by 9 elements is 5 years. For the mean
of the second list of ages to equal 5 as well, since it also contains 9
elements, the second list must also sum to 45.
How are the two lists similar? _____________________________
______________________________________________________

How are they different? __________________________________


______________________________________________________

You may have found that,in the second list, the data are more closely
grouped around the mean and that, even if both lists have the same mean,
they are far from being composed of similar elements. Thus, it is not
enough to rely on a measure of central tendency (mean, median, mode,
studied in 436), but we must also consider the measures of dispersion .
A measure of dispersion indicates if the elements of a sample being
studied are little or very spread out.

To compare distributions of data referring to the same subject, we often


use the range.
The range of a distribution of data is the difference between the
extreme values of a quantitative characteristic.

In statistics, a characteristic is said to be quantitative


if it can be evaluated by numbers.

1.2 Section 1
Mathematics 536 Statistics

The extreme values (maximum and minimum) of the two


distributions are: _______ and ______ for the first and _______
and _______ for the second.
Now calculate the range of each distribution.___________________
______________________________________________________

You should have found: 9 - 1 = 8 and 7 - 3 = 4. Thus, in the first case, the
difference between the age of the youngest and of the oldest is 8 years,
and, in the second case, the difference is 4 years. These are the two
numbers which represent the ranges of the distributions.

We can notice that in these two distributions:


• the mean is the same (5 years)
• the range is different (8 years and 4 years)
• in the distribution with range 8 years, the data is more spread out
• in the distribution with range 4 years, the data is more closely
grouped together

However, in certain situations, the range is not useful.

We find the situation when there are only a few extreme values, or when
the distribution is not uniform. For example, if Jenny, a student in
Secondary V, wants to make a study on people’s mass. She asks some
friends from her class and also her little sister. She collects the
following data: 10, 42, 48, 49, 51, 55, 57, 57 and 58 kg.

What is the range of this distribution? _______________________

Is this number a good indicator of the dispersion of the data?______


______________________________________________________

Why?______________________________________________________

Certainly you found that the range is 58 - 10 = 48 kg, but we could obtain
a better value for the range of this data - 16 kg - by removing the non-
appropriate values.

Section 1 1.3
Mathematics 536 Statistics

Is there a value in this distribution which does not seem to fit with
the others? ___________________________________________
_____________________________________________________
It would appear that the mass of the little sister , that is 10 kg, is
not representative of the set of masses.

Recalculate the range of the distribution if the mass of the little


sister is omitted. _______________________________________
______________________________________________________
This is why 16 kg is a better measure of dispersion.

Effectively, it is reasonable to state that the masses of the students in


Jenny’s class vary at most by 16 kg from one individual to another. This is
why statisticians remove certain “extreme values” (high or low) which
bias the results. This helps understand the expression “anything can be
proved by statistics”. To overcome this factor, other measures of
dispersion can be used.

One can easily draw out the pertinent information regarding the dispersion
of data using a box-and-whisker plot which you saw last year.

Box-and-whisker plot
I n a box-and-whisker plot, the quartiles separate the distribution
into 4 parts, each containing 25% of the data. Q2 represents the
median and Q 1 and Q3 represent the medians of the lower and upper
halves

Figure 1.2 Box-and-whisker plot

Q1 Q2 Q3
range

Tommy drew the box-and-whisker plot of his class’s chemistry results.

Figure 1.3 Chemistry results

40 50 60 70 80 90 100 Mark (%)

1.4 Section 1
Mathematics 536 Statistics

Find approximate values of the range of the distribution, as well as


Q1, Q2, and Q3 from the box-and-whisker plot in Figure 1.3. ________
________________________________________________

You should have found: minimum ≈ 44%; maximum ≈ 100%; Q 1 ≈ 52%; Q 2 ≈


70%; Q3 ≈ 92%. Thus, the range is 100 - 44 = 56%. It is difficult to assign
any meaning to this 56% because, as we will see shortly, the interest in
studying measures of dispersion is in comparing distributions. First of all,
however, we should know three other measures of dispersion, namely the
semi-interquartile range, the mean variation and the standard deviation.

Semi-interquartile Range
In the case where the range does not give a good indication of the
distribution, we use other measures. An easy measure to find is the semi-
interquartile range.
Q3 − Q1
The semi-interquartile range is given by the expression where
2
Q1 and Q2 are the first and third quartiles.
The semi-interquartile range of the students in Tommy’s chemistry
class is:_____________________________________________

92% − 52%
Certainly, it is = 20%. This 20% is a measure of
2
dispersion and not of position like the mode, mean or median. It must be
understood that the measure of dispersion makes sense when it is used in
making a comparison. So, for example, if Tommy knows that the other
chemistry class has a semi-interquartile range of 30%, this indicates that
the results in his class are less spread out and are more closely grouped
around the a “middle” mark (about 70%, the median, for example).
On the other hand, the semi-interquartile range is not a sufficiently
reliable measure in certain cases since it only considers have of the
values in the distribution. However, we may encounter it being used in
advertising, for example, where there is the need to give a rapid
indication of the dispersion of the data in a majority (50%) of cases.

Section 1 1.5
Mathematics 536 Statistics

Mean Variation
A third measure of dispersion is the mean variation which is defined
below.

The mean deviation is the mean of the deviations from the mean.
∑ xi − X
Mean variation = where x i represents the data points, X the
n
mean, n the number of data points and ∑ is the symbol for the summation
(total)

This measure is appropriate for a limited distribution. In the table below,


Clive, the cousin who lives in St. Hyacinthe, has listed the number of hours
of hours of TV that he watched each week after Christmas.

Help Clive to complete the following table in order to find the mean
deviation.

Figure 1.4 Hours of TV watched per day

Weeks after Christmas 1 2 3 4 5 6 7 Sum Mean

Hours of TV watched 18 14 2 0 0 5 10

Mean variation x i − X

To complete the table, you must:

1. Find the sum of the number of hours


2. Find the average number of hours of TV watched per week
3. Find each variation from the mean
4. Calculate the sum of the variations from the mean
5. Calculate the mean of the variations from the mean.

This gives a mean of 7 hours per week of watching TV with a mean


variation of 6 hours.

1.6 Section 1
Mathematics 536 Statistics

Did you know that...


At the beginning of 1998, 3 million Quebec residents were
plunged into total darkness, the result of a electricity cut caused
by the worst ice-storm to have hit Quebec to date. During this
storm, which lasted 4 days, more than 80 mm of freezing rain fell. The
damages caused to the electricity network were catastrophic. Hydro-
Quebec took almost 32 days to restore current to everyone.
__________________________________________________________

Looking at these results, we note that:


• there are some extreme values
• the data is very dispersed, but the mean deviation takes all values
into account.

Although the mean variation takes all values into account, the difficulties
in manipulating absolute values means that statisticians use almost
exclusively the standard deviation, which uses squares.

Standard Deviation
The standard deviation σ (sigma) is the most widely used of the
measures of dispersion. The graphing calculator calculates the standard
deviation very rapidly, as well as the mean, median etc. To calculate the
standard deviation, the calculator uses the following formula:
∑(x i − X ) 2
σ = where x i represents the data points, X , the mean, n
n
the number of data points and ∑ the summation (or the total).
N.B. In the case of a sample, or if n < 30:
- replace σ by S
- replace n by n-1
- in case your batteries die, it is worth knowing the standard
deviation formula.

Section 1 1.7
Mathematics 536 Statistics

Here is a list of 10 Quebec municipalities and their respective populations


(i.e. number of inhabitants).

Figure 1.5 Population of 10 Quebec towns


Town Population Town Population
Quebec 167 517 Chicoutimi 62 670
Longueuil 129 874 Jonquière 57 933
Gatineau 92 284 Trois-Rivières 49 426
Sherbrooke 76 429 Drummondville 43 171
Hull 65 764 Granby 42 804

How would you find the value of the mean deviation for this
distribution? ___________________________________________
______________________________________________________
______________________________________________________

The use of a graphic calculator is certainly strongly recommended. It


increases the speed of execution and avoids interminable calculations. The
time saved can then be used for the understanding and analysis of the
results, the most important stage.

To enter a list (a distribution) into a graphing calculator, you must:

- (helpful) Clear the memory by pressing STAT CLEARLIST

and choosing the list 2nd L1 ENTER

- Press STAT EDIT , move to the list you want to

use and enter the data.

Enter the data representing the population of the ten towns into a graphic
calculator.

1.8 Section 1
Mathematics 536 Statistics

How do we go about obtaining the required information (standard


deviation)?_____________________________________________
______________________________________________________

To show statistical calculations on the graphing calculator, you must


enter the data, then press
STAT CALC 1-VAR STATS

and enter the appropriate list name, for example L1


2nd

The results appear in the following order:


X the mean;
∑X the sum of the data;
SX the standard deviation in the case of a sample or when n < 30;
σX the standard deviation for all other cases;
MED the median;
n the number of data points in the distribution.

In the case we are currently considering, which type of standard


deviation is appropriate, SX or σX?__________________________
Why?__________________________________________________
______________________________________________________

We must take SX because there are only 10 towns (<30). Thus, analyzing
the results, we obtain a mean population for the 10 towns of 78 787
inhabitants, and a standard deviation of 40 770 inhabitants. We cannot
interpret these results in themselves, since we cannot compare them with
results from another distribution of similar data on the same subject.
There is, however, a measure of position which can help us to analyze this
sort of data: the standard score or Z-score.

Section 1 1.9
Mathematics 536 Statistics

Z-Score
Do you remember Clive and Tommy? Of course! Some days after receiving
their respective CEGEP replies, Tommy called clive and said to him, to
give him some comfort, “Don’t worry about it, someone has explained to
me that it is perhaps your Z-score which caused you to not be accepted the
first time. You will certainly be accepted second time.” Clive understood
nothing of what his cousin had explained to him about Z-scores, so his
encouragement did not have the anticipated effect. Here you will find out
what it is...

Karl-Friedrich Gauss devoted part of his life working on numbers. Based


on his work, others were able to prove, by means of complicated and
clever calculations, that, given a sufficiently large distribution of data
points, that the data tend to spread themselves around a central value
(the mean). The graph obtained is called the Gauss curve, or the normal
curve. In this graph, 95.45% of the data are grouped within ± two
standard deviations on one side or the other of the mean, as shown in
Figure 1.6.
Figure 1.6 Gauss Curve

X
− 2σ + 2σ

95.5% of data

1.10 Section 1
Mathematics 536 Statistics

The two graphs below show that the distributions are fundamentally
equivalent. They differ only in their means (X ) and by their standard
deviations (σ ). To better compare the distributions, we use a measure
which eliminates the differences in the standard deviations and in the
means. This new measure is called the Z-score (standard score).
Figure 1.7 Variation in the mean without changing the standard deviation

σ1 ≠ σ2 ≠ σ3
Figure 1.8 Variation in the standard deviation without changing the mean.

X X X

σ1 = σ2 = σ3

The Z-score is defined as follows:


The Z-score or standard score is calculated using the following
xi − X
formula: Z = where x i represents the data points, X , the mean
σ
and σ, the standard deviation of a distribution.

Section 1 1.11
Mathematics 536 Statistics

What is the subtraction in this formula?______________________


______________________________________________________
And the division?________________________________________
______________________________________________________
The subtraction gives the deviation which separates a specific data point
from the mean. The division creates a ratio between this deviation and the
standard deviation (how many times is the denominator contained in the
numerator?). Thus, by these two simple operations, we are able to obtain,
for each data point, a measure of dispersion relative to the mean in terms
of the standard deviation. For example, we could say of a data point that it
lies at “one and a half standard deviations below the mean” (-1.5 σ).
Figure 1.9 Example of the position -1.5

-1.5 σ

_________________________________________________

Did you know that...


• The name given to the Gauss curve, also known as the normal curve,
is just an honorific, since Gauss himself did not develop this theory.
However, he worked extensively on the theory of numbers, from
which statistics originated.
−x 2
1
• The equation of the Gauss curve is: y = e 2 . You can amuse

yourself trying to draw this graph on your graphing calculator! To
help you define an appropriate window, use the other mode of
representation (table of values) and look carefully at the minimum
and maximum values.

1.12 Section 1
Mathematics 536 Statistics

So, if you lay out the results of all the students in Tommy’s and Clive’s
classes, you will be able to calculate their respective Z-scores and
attempt to solve the mystery which is still hovering over us!
Here are the results - average marks - of the of the students when they
sent their applications for admission to the CEGEPs.
Figure 1.10 Average marks of the students in the two groups
Group Average mark
Tommy’s class 76 70 80 81 68 60 78 84 75 72 70 77 75 83 60
88 77 74 74 65 83 78 73 80 70 68 85 77 74
Clive’s class 85 70 53 66 77 85 81 70 73 95 58 68 70 73 91
85 56 67 68 69 80 88 56 95 77 77 72 80 90

What values are needed to calculate the Z-score?


The_____________ and the ____________________.

You need to know the mean and the standard deviation.

Enter both sets of values in the graphing calculator, find the values
of the means and standard deviations of the two distributions and
calculate the Z-scores of the two cousins. The numbers in bold
characters represent their respective marks.
__________________________________________________________
__________________________________________________________
__________________________________________________________

You should have found that Tommy’s class had a mean of 75% and a
standard deviation of 6.9%. Clive’s class had a mean of 75% and a standard
deviation of 11.5% (figures 1.11 and 1.12). Since Tommy had 88%, his Z-
88% − 75%
score is 6.9%
= 1.88. You should note that the subtraction of two
percentages gives a percentage and that the division of the two
percentages gives a ratio (or the number of times the standard deviation
is contained in the deviation from the mean). In the case of Clive, who had
85% − 75%
obtained a mark slightly lower - 85%, his Z-score is 11. 5 % = 0.87. Look
at the results of the two cousins placed on the same graph (figure 1.13).

Section 1 1.13
Mathematics 536 Statistics

This value removes the effect of the mean and the standard deviation of
the two groups. This permits a comparison between the different data
sets.

Figure 1.11 Results of Tommy’s class placed on a normal curve.

Tommy’s
position

+1.88 σ1

X1 = 75%

Figure 1.12 Results of Clive’s class placed on a normal curve.

Clive’s
position

+0.87 σ2

X2 = 75%
σ2 = 11.5%

1.14 Section 1
Mathematics 536 Statistics

Figure 1.13 Graph of the two groups combined

Clive’s
position

+0.87 Tommy’s
position

+1.88

Z=0
Explain the difference in their Z-scores knowing that the means of
their two respective groups are identical._____________________
______________________________________________________
______________________________________________________

Is the Z-score a measure of dispersion or of position?________________


__________________________________________________________

Do you believe that this could have an influence on Clive’s acceptance at


the same CEGEP as his cousin?__________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
The standard deviation shows that the results of the students in Tommy’s
class are not so widely dispersed as those of the students in his cousin
Clive’s class. Further, Clive ended up with a slightly lower mark. This did
not enable his mark to distinguish him from the majority of the students
in his class. He is less than a single standard deviation (0.87) above the
mean - although this, of course, is good. For Tommy, the situation is
different, since, with a Z-score of almost two (1.88), he pulls himself
well clear of the large “middle group” of students in his class. Thus, the
Z-score is a measure of position. This factor could have had an
determining influence on the selection of the students. But, before
jumping too quickly to conclusions, read the Did you know that... Which
follows.

Section 1 1.15
Mathematics 536 Statistics

Did you know that.....

The standard deviation is not used except by universities to classify


students coming from college. Our two friends, Tommy and Clive, would
not have been affected, therefore, by the marks of their fellow students.
On the contrary, some CEGEPs take into account other factors when they
make their selections; these depend on the courses of study, the
prerequisites etc. These factors must have acted against Clive much more
than the dispersion of marks in his class.

1.16 Section 1
Mathematics 536 Statistics

1.2 Practise
1. A family doctor collected in a table the heights (in m) of some of his
patients in order to do some statistical analysis. The table is below:

1.85 0.95 1.04 1.15 0.80 1.18 1.32 1.45 1.24 1.03 1.28
1.75 1.42 1.53 1.22 1.24 1.27 1.18 1.53 1.29 1.41 0.99
1.33 1.21 1.28 1.52 1.65 0.42 1.80 1.10 1.25 1.35 1.42
1.26 1.32 1.18 1.32 1.22 1.05 1.23 1.42 0.75 1.15 1.32

For this distribution, find:

a) the range__________________________________________

b) the semi-interquartile range___________________________

c) the mean deviation__________________________________

d) the standard deviation________________________________

e) Amy’s Z-score, if her height is 1.80 m___________________

f) Interpret the results_________________________________


_________________________________________________
_________________________________________________
_________________________________________________
_________________________________________________

Section 1 1.17
Mathematics 536 Statistics

2. Statistics Canada collects national data in order to obtain statistics about a wide
range of subjects. This is very useful for determining social, economic,
environmental and other policies. Here are some data collected by certain
countries in 1993.
Figure 1.14
Population, density, birth and death rates, 1993

Population Birth Death


Country
X 1000 Per km2 Rate per 1 000 inhabitants

Canada 28 753 2.9 13.5 7.2

Germany 81 190 227.5 10.0 11.0

Australia 17 657 2.3 14.7 6.9

Austria 7 884 94.0 12.0 10.5

Belgium 10 045 329.3 12.5 10.6

Denmark 5 189 120.4 12.9 12.1

Spain 39 083 77.4 10.0 8.5

U.S.A. 257 908 27.5 15.7 8.8

Finland 5 066 15.0 12.8 10.1

France 57 667 105.0 13.0 9.1

Greece 10 368 78.5 9.8 9.4

Ireland 3 524 50.1 15.0 8.8

Iceland 258 2.5 17.5 7.0

Italy 56 120 186.3 10.1 9.6

Japan 124 670 330.3 9.5 7.0

Luxembourg 387 148.9 13.0 9.6

Norway 4 312 13.3 13.9 10.9

New Zealand 3 480 13.0 16.8 7.7

Holland 15 300 375.0 12.8 9.0

Portugal 9 888 107.0 11.5 10.7

U.K. 58 191 237.7 13.5 10.9

Sweden 8 745 19.4 13.5 11.1

Switzerland 6 938 168.0 12.1 9.1

Turkey 59 490 76.2 23.3 6.7

1 2 3 4

1.18 Section 1
Mathematics 536 Statistics

Consider the following four distributions:


1 the population in thousands of inhabitants
2 the population in inhabitants per km 2
3 the birth rate per 1 000 inhabitants
4 the death rate per 1 000 inhabitants

a) For each distribution, calculate:


1 2 3 4

Range

Semi-interquartile
range

Mean variation

Standard
variation
Canada’s Z-score

b) Interpret Canada’s results with respect to the other countries by


using the Z-score. Try to explain the results.___________________
______________________________________________________
______________________________________________________
______________________________________________________
______________________________________________________
______________________________________________________
______________________________________________________
______________________________________________________
______________________________________________________
______________________________________________________
______________________________________________________
______________________________________________________
______________________________________________________

Section 1 1.19
Mathematics 536 Statistics

1.3 Review Exercises


Describe the four measures of dispersion.
Figure 1.15
Name Formula Method of Interpreta- Limitations
calculation tion

2. Describe the Z-score.

Name Formula Method of Interpreta- Limitations


calculation tion

1.20 Section 1
Mathematics 536 Statistics

1.4 Going further


Do you want to know more? OK - here goes. There are two somewhat complicated
formulae to calculate the standard deviation of a distribution. They are:

∑ (x i − X )
2

Formula 1: s =
n − 1

(∑x i )
2

∑x i −
2

n
Formula 2: s =
n − 1

Check whether they would give the same result with the following distribution: 7, 8, 3,
1, 1, 4.

Figure 1.17
xi xi − x (x i − x ) 2 xi 2

7
8
3
1
1
4
∑x i = ∑ (x i − x ) = ∑ (x i − x ) 2 = ∑x i 2 =

Using formula 1: s=

Using formula 2: s=

Which formula do you prefer? ___________________________________________

For someone without a graphing calculator, why is it preferable to use formula 2?____
__________________________________________________________________

Section 1 1.21

Вам также может понравиться