Вы находитесь на странице: 1из 27

1

MEASURES OF CENTRAL TENDENCY, DISPERSION AND CORRELATION



MEASURES OF CENTRAL TENDENCY(AVERAGES)
Introduction
We saw how data can be summarised and presented in tabular, chart and
graphical formats. Sometimes you might need more information than that provided by
diagrammatic representations of data. In such circumstances you may need to apply some
sort of numerical analysis, for example you might wish to calculate a measure of
centrality and a measure of dispersion.

An average is a representative figure that is used to give some impression of the size of
all the items in the population. There are three main types of average.
Arithmetic mean
Mode
Median
We will be looking at each of these averages in turn, their calculation, advantages and
disadvantages.

The arithmetic mean
Arithmetic mean of ungrouped data
The arithmetic mean is the best known type of average and is widely understood. It is
used for further statistical analysis.


Example: The arithmetic mean
The demand for a product on each of 20 days was as follows (in units).
3 12 7 17 3 14 9 6 11 10 1 4
19 7 15 6 9 12 12 8
The arithmetic mean of daily demand is

2
Arithmetic mean of data in a frequency distribution
It is more likely in an assessment that you will be asked to calculate the arithmetic mean
Example :
Consider the following table and complete the (fx) column and calculate the value of the
arithmetic mean.

ARITHMETIC MEAN OF GROUPED DATA IN CLASS INTERVALS
The arithmetic mean of grouped data is determined as:

where n is the number of values recorded, or the number of items measured.
The mid-point of class intervals
To calculate the arithmetic mean of grouped data we therefore need to decide on a value
which best represents all of the values in a particular class interval. This value is known
as the mid-point.
The mid-point of each class interval is conventionally taken, on the assumption that the
frequencies occur evenly over the class interval range.




3
Example: calculate the value of the arithmetic mean from the following frequency
distribution table.

The arithmetic mean of combined data
Suppose that the mean age of a group of five people is 27 and the mean age of another
group of eight people is 32. How would we find the mean age of the whole group of 13
people?

The sum of the ages in the first group is 5 27 = 135
The sum of the ages in the second group is 8 32 = 256
The sum of all 13 ages is 135 + 256 = 391


The advantages and disadvantages of the arithmetic mean
Advantages of the arithmetic mean
It is easy to calculate
It is widely understood
4
It is representative of the whole set of data
It is supported by mathematical theory and is suited to further statistical analysis
Disadvantages of the arithmetic mean
Its value may not correspond to any actual value. For example, the 'average' family
might have 2.3 children, but no family has exactly 2.3 children.
An arithmetic mean might be distorted by extremely high or low values. For example,
the mean of 3, 4, 4 and 6 is 4.25, but the mean of 3, 4, 4, 6 and 15 is 6.4. The high
value, 15, distorts the average and in some circumstances the mean would be a
misleading and inappropriate figure.
Question


Weighted Mean
A firm owns six factories at which the basic weekly wages are given in column 2 of
Table 4.4. Find the mean basic wage earned by employees of the firm.

5

This result, which takes account of the number of employees, is a much more realistic
measure of location for the distribution of the basic wage than the straight mean we found
first. The second result is called the weighted mean of the basic wage, where the weights
are the numbers of employees at each factory.

Geometric Mean
The geometric mean is seldom used outside of specialist applications. It is appropriate
when dealing with a set of data such as that which shows exponential growth (that is
where the rate of growth depends on the value of the variable itself), for example
population levels arising from indigenous birth rates, or that which follows a geometric
progression, such as changes in an index number over time, for example the Retail Price
Index.

It is sometimes quite difficult to decide where the use of the geometric mean over the
arithmetic mean is the best choice. We will return to the use of geometric means in the
next chapter. The geometric mean (GM) is evaluated by taking the n
th
root of the product
of all n observations, that is:

Example:
In the year 2000 the population of a town is 300,000. In 2010 a new census reveals it has
risen to 410,000. Estimate the population in 2015. If we assume that was no net
immigration or migration then the birth rate will depend on the size of the population
(exponential growth) so the geometric mean is appropriate.

(Note that this is appreciably less than the arithmetic mean which is 355,000.)
6
Harmonic Mean
Another measure of central tendency which is only occasionally used is the harmonic
mean. It is most frequently employed for averaging speeds where the distances for each
section of the journey are equal.
If the speeds are x then:

Example:
An aeroplane travels a distance of 900 miles. If it covers the first third and the last third
of the trip at a speed of 250 mph and the middle third at a speed of 300 mph, find the
average speed.

THE MODE
The mode or modal value is an average which means 'the most frequently occurring
value'.
Mode of a Simple Frequency Distribution
Consider the following frequency distribution:
Table 8.12 Accident distribution data

In this case the most frequently occurring value is 1 (it occurred 39 times) and so the
mode of this distribution is 1.
7
Example 1:
The following is an ordered list of the number of complaints received by a telephone
supervisor per day over a period of a fortnight:
3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 8, 9,
10, 12.
The value which occurs most frequently is 6, therefore:
mode = 6
The mode of a grouped frequency distribution
There are various methods of estimating the modal value (including a graphical one). A
satisfactory result is obtained easily by using the following formula:

where: L = lower boundary of the modal class
i = width of the modal class interval
f
m
= the frequency of the modal class
f
m-1
= the frequency of the pre-modal class
f
m+1
= the frequency of the post-modal class
Example:
Find the modal value of the height of employees from the data shown in the Table.

Solution:
The largest frequency is 20, in the fourth class, so this is the modal class, and the
value of the mode lies between 175 and 180 cm. So, using the above formula:
8

Advantages and Disadvantages of the Mode
(a) Advantages
(i) It is not distorted by extreme values of the observations.
(ii) It is easy to calculate.
(b) Disadvantages
(i) It cannot be used to calculate any further statistic.
(ii) It may have more than one value (although this feature helps to show the
shape of the distribution).

MEDIAN
The median is the value of the middle member of an array. The middle item of an odd
number of items is calculated as the

If a set of n observations is arranged in order of size then, if n is odd, the median is the
value of the middle observation; if n is even, the median is the value of the arithmetic
mean of the two middle observations.
Note that the same value is obtained whether the set is arranged in ascending or
descending order of size, though the ascending order is most commonly used. This
arrangement in order of size is often called ranking.
The rules for calculating the median are:
(a) If n is odd and M is the value of the median then:


9
Example: The median
(a) The median of the following nine values:
8 6 9 12 15 6 3 20 11
is found by taking the middle item (the fifth one) in the array:
3 6 6 8 9 11 12 15 20
The median is 9.
(b) Consider the following array.
1 2 2 2 3 5 6 7 8 11
The median is 4 because, with an even number of items, we have to take the arithmetic
mean of the two middle ones (in this example, (3 + 5)/2 = 4).

Finding the median of an ungrouped frequency distribution
The median of an ungrouped frequency distribution is found in a similar way. Consider
the following distribution.

The median would be the (35 + 1)/2 = 18
th
item. The 18
th
item has a value of 16, as we
can see from the cumulative frequencies in the right hand column of the above table.

Finding the median of a grouped frequency distribution
If the sample size is large and the data continuous, it is possible to find the median of
grouped data by estimating the position of the (n/2)
th
value. In this case the following
formula may be used:

10
where: L is the lower class boundary of the median class
F is the cumulative frequency up to but not including the median class
f is the frequency of the median class
i is the width of the class interval.

Example:
From the following frequency distribution establish the median.

Advantages and Disadvantages of the Median
(a) Advantages
(i) Its value is not distorted by extreme values, open-ended classes or classes of
irregular width.
(ii) All the observations are used to order the data even though only the middle one
or two observations are used in the calculation.
(iii) It can be illustrated graphically in a very simple way.
(b) Disadvantages
(i) In a grouped frequency distribution the value of the median within the median
class can only be an estimate, whether it is calculated or read from a graph.
(ii) Although the median is easy to calculate it is difficult to manipulate arithmetically.
It is of little use in calculating other statistical measures.

QUANTILES
Definitions
If a set of data is arranged in ascending order of size, quantiles are the values of the
observations which divide the number of observations into a given number of equal parts.
11
They cannot really be called measures of central tendency, but they are measures of
location in that they give the position of specified observations on the x-axis.
The most commonly used quantiles are:
(a) Quartiles
These are denoted by the symbols Q1, Q2 and Q3 and they divide the observations into
four equal parts:
Q1 has 25% below it and 75% above it.
Q2 has 50% below it and 50% above, i.e. it is the median and is more usually
denoted by M.
Q3 has 75% below it and 25% above.
(b) Deciles
These values divide the observations into 10 equal parts and are denoted by D1, D2, ...
D9, e.g. D1 has 10% below it and 90% above, and D2 has 20% below it and 80% above.
(c) Percentiles
These values divide the observations into 100 equal parts and are denoted by P1, P2,
P3, ... P99, e.g. P1 has 1% below it and 99% above.
Note that D5 and P50 are both equal to the median (M).

Calculation of Quantiles
Example:
Table 4.7 shows the grouped distribution of the overdraft sizes of 400 bank customers.
Find
the quartiles, the 4
th
decile and the 95
th
percentile of this distribution.








12
Size of overdraft of bank customers

Using appropriately amended versions of the formula for the median given previously,
the arithmetic calculations are as follows:
The formula for the first quartile (Q1) may be written as:

where: L is the lower class boundary of the class which contains Q
1

F is the cumulative frequency up to but not including the class which contains Q
1

f is the frequency of the class which contains Q
1
and
i is the width of the class interval.
This gives:


13


MEASURES OF DISPERSION
Measures of dispersion give some idea of the spread of a variable about its average. The
main measures are as follows.
The range
The semi-interquartile range
The standard deviation
The variance
The coefficient of variation
1 The range
The range is the difference between the highest and lowest observations.
The range of a distribution is the difference between the largest and the smallest values in
the set of data.

If the data is given in the form of a grouped frequency distribution, the range is the
difference between the highest upper class boundary and the lowest lower class
boundary.
Example:
Calculate the mean and the range of the following set of data.
4 8 7 3 5 16 24 5
Advantages and Disadvantages
(a) Advantages
(i) It is easy to understand.
(ii) It is simple to calculate.
14
(iii) It is a good measure for comparison as it spans the whole distribution.
(b) Disadvantages
(i) It uses only two of the observations and so can be distorted by extreme values.
(ii) It does not indicate any concentrations of the observations.
(iii) It cannot be used in calculating other functions of the observations.

2 QUARTILE DEVIATION (THE SEMI-INTERQUARTILE RANGE)
The semi-interquartile range is half the difference between the upper and lower quartiles.
The lower and upper quartiles can be used to calculate a measure of spread called the
semi-interquartile range.

The inter-quartile range
The inter-quartile rangeis the difference between the values of the upper and lower
quartiles (Q
3
Q
1
) and hence shows the range of values of the middle half of the
population.
Example:
Construct an ogive of the following frequency distribution and hence establish the semi-
interquartile range.

Advantages and Disadvantages
(a) Advantages
(i) The calculations are simple and quite quick to do.
(ii) It covers the central 50% of the observations and so is not distorted by extreme
15
values.
(iii) It can be illustrated graphically.
(b) Disadvantages
(i) The lower and upper 25% of the observations are not used in the calculation so it
may not be representative of all the data.
(ii) Although it is related to the median, there is no direct arithmetic connection
between the two.
(iii) It cannot be used to calculate any other functions of the data.

3 THE MEAN DEVIATION
The mean deviationis a measure of the average amount by which the values in a
distribution differ from the arithmetic mean.

Explaining the mean deviation formula

Example: The mean deviation
The hours of overtime worked in a particular quarter by the 60 employees of ABC Co are
as follows.

16
Required
Calculate the mean deviation of the frequency distribution shown above.

Summary of the mean deviation
(a) It is a measure of dispersion which shows by how much, on average, each item in the
distribution differs in value from the arithmetic mean of the distribution.
(b) Unlike quartiles, it uses all values in the distribution to measure the dispersion, but it
is not greatly affected by a few extreme values because an average is taken.
(c) It is not, however, suitable for further statistical analysis.

4 STANDARD DEVIATION AND VARIANCE

THE STANDARD DEVIATION
The standard deviation measures the spread of data around the mean. In general, the
larger the standard deviation value in relation to the mean, the more dispersed the data.

The standard deviation, which is the square root of the variance, is the most important
measure of spread used in statistics. Make sure you understand how to calculate the
standard deviation of a set of data.

17
There are a number of formulae which you may use to calculate the standard deviation;
use whichever one you feel comfortable with.

EXAMPLE 1
Voditel International own a large fleet of company cars. The mileages, in thousands of
miles, of a sample of 17 of their cars over the last financial year were:
11 31 27 26 27 35 23 19 28 25
15 36 29 27 26 22 20
Calculate the mean and standard deviation of these mileage figures.
EXAMPLE 2
The kilocalories per portion in a sample of 32 different breakfast cereals were recorded
and collated into the following grouped frequency distribution:


(a) Obtain an approximate value for the median of the distribution.
(b) Calculate approximate values for the mean and standard deviation of the distribution.

The variance
The variance,
2
, is the average of the squared mean deviation for each value in a
distribution. is the Greek letter sigma (in lower case). The variance is therefore called
'sigma squared'.
18
The main properties of the standard deviation
The standard deviation's main properties are as follows.
(a) It is based on all the values in the distribution and so is more comprehensive than
dispersion measures based on quartiles, such as the quartile deviation.
(b) It is suitable for further statistical analysis.
(c) It is more difficult to understand than some other measures of dispersion.
The importance of the standard deviation lies in its suitability for further statistical
Analysis

The coefficient of variation
The spreads of two distributions can be compared using the coefficient of variation.

The bigger the coefficient of variation, the wider the spread. For example, suppose that
two sets of data, A and B, have the following means and standard deviations.

Although B has a higher standard deviation in absolute terms (51 compared to 50) its
relative spread is less than A's since the coefficient of variation is smaller.

Advantages and Disadvantages of the Standard Deviation
(a) Advantages
(i) It uses all the observations.
(ii) It is closely related to the most commonly used measure of location, i.e. the
mean.
(iii) It is easy to manipulate arithmetically.
(b) Disadvantages
(i) It is rather complicated to define and calculate.
(ii) Its value can be distorted by extreme values.

19
SKEWNESS
1 Skewed distributions
As well as being able to calculate the average and spread of a frequency distribution, you
should be aware of the skewness of a distribution.
Skewness is the asymmetry of a frequency distribution curve. When the items are not
symmetrically dispersed on each side of the mean, we say that the distribution is skewed
or asymmetric.

Symmetrical frequency distributions
A symmetrical frequency distribution (a normal distribution) can be drawn as follows.


Properties of a symmetrical distribution
Its mean, mode and median all have the same value, M
Its two halves are mirror images of each other

If a distribution is symmetrical, the mean, mode and the median all occur at the same
point, i.e. right in the middle. But in a skew distribution, the mean and the median lie
somewhere along the side with the "tail", although the mode is still at the point where the
curve is highest. The more skew the distribution, the greater the distance from the mode
to the mean and the median.

Positively skewed distributions
A positively skewed distribution's graph will lean towards the left hand side, with a tail
stretching out to the right, and can be drawn as follows.
20

Properties of a positively skewed distribution
Its mean, mode and median all have different values
The mode will have a lower value than the median
Its mean will have a higher value than the median (and than most of the
distribution)
It does not have two halves which are mirror images of each other

Negatively skewed distributions
A negatively skewed distribution's graph will lean towards the right hand side, with a tail
stretching out to the left, and can be drawn as follows.

Properties of a negatively skewed distribution
Its mean, median and mode all have different values
The mode will be higher than the median
The mean will have a lower value than the median (and than most of the
distribution)
Since the mean is affected by extreme values, it may not be representative of the items in
a very skewed distribution.

21
Measures of Skewness
The more skew the distribution, the more spread out are these three measures of location,
and so we can use the amount of this spread to measure the amount of skewness. The
most usual way of doing this is to calculate:

The value of the coefficient of skewness is between +3 and -3.
Example: Skewness
In a quality control test, the weights of standard packages were measured to give the
following grouped frequency table.

Required
(a) Calculate the mean, standard deviation and median of the weights of the packages.
(b) Calculate pearsons coefficient of skewness and explain whether or not the
distribution is symmetrical.






22
CORRELATION
When the value of one variable is related to the value of another, they are said to be
correlated.

Two variables are said to be correlated if a change in the value of one variable is
accompanied by a change in the value of another variable. This is what is meant by
correlation.

Examples of variables which might be correlated
A person's height and weight
The distance of a journey and the time it takes to make it
Scatter diagrams
One way of showing the correlation between two related variables is on a scatter
diagram, plotting a number of pairs of data on the graph. For example, a scatter diagram
showing monthly selling costs against the volume of sales for a 12-month period might
be as follows.

The independent variable (the cause) is plotted on the horizontal(x) axis and the
dependent variable (the effect) is plotted on the vertical(y) axis.
This scattergraph suggests that there is some correlation between selling costs and sales
volume, so that as sales volume rises, selling costs tend to rise as well.

Degrees of correlation
Two variables might be perfectly correlated, partly correlated or uncorrelated. Correlation
can be positive or negative.
23
Positive and negative correlation
Correlation, whether perfect or partial, can be positive or negative.
Positive correlation means that low values of one variable are associated with
low values of the other, and high values of one variable are associated with high
values of the other.
Negative correlation means that low values of one variable are associated with
high values of the other, and high values of one variable with low values of the
other.

The correlation coefficient and the coefficient of determination
The degree of correlation between two variables is measured by Pearson's correlation
coefficient, r. The nearer r is to +1 or 1, the stronger the relationship.

The correlation coefficient
Pearson's correlation coefficient, r(also known as the product moment correlation
coefficient) is used to measure how strong the connection is between two variables,
known as the degree of correlation.

the correlation coefficient range
The correlation coefficient, r must always fall between 1 and +1. If you get a value
outside this range you have made a mistake.
r = +1 means that the variables are perfectly positively correlated
r = 1 means that the variables are perfectly negatively correlated
r = 0 means that the variables are uncorrelated
24
Example: The correlation coefficient
The cost of output at a factory is thought to depend on the number of units produced.
Data have been collected for the number of units produced each month in the last six
months, and the associated costs, as follows.

Required
Assess whether there is there any correlation between output and cost.
Solution


25

There is perfect positive correlation between the volume of output at the factory and costs
which means that there is a perfect linear relationship between output and costs.

Required
Calculate Pearson's correlation' coefficient for the data and explain the result.

THE COEFFICIENT OF DETERMINATION, R
2

The coefficient of determination r
2
measures the proportion of the total variation in the
value of one variable that can be explained by variations in the value of the other
variable. Unless the correlation coefficient r is exactly or very nearly +1, 1 or 0, its
meaning or significance is a little unclear. For example, if the correlation coefficient for
two variables is +0.8, this would tell us that the variables are positively correlated, but the
correlation is not perfect. It would not really tell us much else. A more meaningful
26
analysis is available from the square of the correlation coefficient, r, which is called the
coefficient of determination, r
2


Interpreting r
2

In the question above, r = 0.992, therefore r
2
= 0.984. This means that over 98% of
variations in sales can be explained by the passage of time, leaving 0.016 (less than2%)
of variations to be explained by other factors.
Similarly, if the correlation coefficient between a company's output volume and
maintenance costs was 0.9, r
2
would be 0.81, meaning that 81% of variations in
maintenance costs could be explained by variations in output volume, leaving only 19%
of variations to be explained by other factors (such as the age of the equipment).


SPEARMAN'S RANK CORRELATION COEFFICIENT
Coefficient of rank correlation
In the examples considered above, the data were given in terms of the values of the
relevant variables, such as the number of hours. Sometimes however, they are given in
terms of order or rank rather than actual values.

Spearman's rank correlation coefficient is used when data is given in terms of order or
rank, rather than actual values.

Where n = number of pairs of data
d = the difference between the rankings in each set of data.

The coefficient of rank correlation can be interpreted in exactly the same way as the
ordinary correlation coefficient. Its value can range from 1 to +1.


27


Example: The rank correlation coefficient
The examination placings of seven students were as follows.


Required
Judge whether the placings of the students in statistics correlate with their placings in
economics.
Solution
Correlation must be measured by Spearman's coefficient because we are given the
placings of students, and not their actual marks.

where d is the difference between the rank in statistics and the rank in economics for each
student.

The correlation is positive, 0.536, but the correlation is not strong.

Вам также может понравиться