MEASURES OF CENTRAL TENDENCY, DISPERSION AND CORRELATION
MEASURES OF CENTRAL TENDENCY(AVERAGES) Introduction We saw how data can be summarised and presented in tabular, chart and graphical formats. Sometimes you might need more information than that provided by diagrammatic representations of data. In such circumstances you may need to apply some sort of numerical analysis, for example you might wish to calculate a measure of centrality and a measure of dispersion.
An average is a representative figure that is used to give some impression of the size of all the items in the population. There are three main types of average. Arithmetic mean Mode Median We will be looking at each of these averages in turn, their calculation, advantages and disadvantages.
The arithmetic mean Arithmetic mean of ungrouped data The arithmetic mean is the best known type of average and is widely understood. It is used for further statistical analysis.
Example: The arithmetic mean The demand for a product on each of 20 days was as follows (in units). 3 12 7 17 3 14 9 6 11 10 1 4 19 7 15 6 9 12 12 8 The arithmetic mean of daily demand is
2 Arithmetic mean of data in a frequency distribution It is more likely in an assessment that you will be asked to calculate the arithmetic mean Example : Consider the following table and complete the (fx) column and calculate the value of the arithmetic mean.
ARITHMETIC MEAN OF GROUPED DATA IN CLASS INTERVALS The arithmetic mean of grouped data is determined as:
where n is the number of values recorded, or the number of items measured. The mid-point of class intervals To calculate the arithmetic mean of grouped data we therefore need to decide on a value which best represents all of the values in a particular class interval. This value is known as the mid-point. The mid-point of each class interval is conventionally taken, on the assumption that the frequencies occur evenly over the class interval range.
3 Example: calculate the value of the arithmetic mean from the following frequency distribution table.
The arithmetic mean of combined data Suppose that the mean age of a group of five people is 27 and the mean age of another group of eight people is 32. How would we find the mean age of the whole group of 13 people?
The sum of the ages in the first group is 5 27 = 135 The sum of the ages in the second group is 8 32 = 256 The sum of all 13 ages is 135 + 256 = 391
The advantages and disadvantages of the arithmetic mean Advantages of the arithmetic mean It is easy to calculate It is widely understood 4 It is representative of the whole set of data It is supported by mathematical theory and is suited to further statistical analysis Disadvantages of the arithmetic mean Its value may not correspond to any actual value. For example, the 'average' family might have 2.3 children, but no family has exactly 2.3 children. An arithmetic mean might be distorted by extremely high or low values. For example, the mean of 3, 4, 4 and 6 is 4.25, but the mean of 3, 4, 4, 6 and 15 is 6.4. The high value, 15, distorts the average and in some circumstances the mean would be a misleading and inappropriate figure. Question
Weighted Mean A firm owns six factories at which the basic weekly wages are given in column 2 of Table 4.4. Find the mean basic wage earned by employees of the firm.
5
This result, which takes account of the number of employees, is a much more realistic measure of location for the distribution of the basic wage than the straight mean we found first. The second result is called the weighted mean of the basic wage, where the weights are the numbers of employees at each factory.
Geometric Mean The geometric mean is seldom used outside of specialist applications. It is appropriate when dealing with a set of data such as that which shows exponential growth (that is where the rate of growth depends on the value of the variable itself), for example population levels arising from indigenous birth rates, or that which follows a geometric progression, such as changes in an index number over time, for example the Retail Price Index.
It is sometimes quite difficult to decide where the use of the geometric mean over the arithmetic mean is the best choice. We will return to the use of geometric means in the next chapter. The geometric mean (GM) is evaluated by taking the n th root of the product of all n observations, that is:
Example: In the year 2000 the population of a town is 300,000. In 2010 a new census reveals it has risen to 410,000. Estimate the population in 2015. If we assume that was no net immigration or migration then the birth rate will depend on the size of the population (exponential growth) so the geometric mean is appropriate.
(Note that this is appreciably less than the arithmetic mean which is 355,000.) 6 Harmonic Mean Another measure of central tendency which is only occasionally used is the harmonic mean. It is most frequently employed for averaging speeds where the distances for each section of the journey are equal. If the speeds are x then:
Example: An aeroplane travels a distance of 900 miles. If it covers the first third and the last third of the trip at a speed of 250 mph and the middle third at a speed of 300 mph, find the average speed.
THE MODE The mode or modal value is an average which means 'the most frequently occurring value'. Mode of a Simple Frequency Distribution Consider the following frequency distribution: Table 8.12 Accident distribution data
In this case the most frequently occurring value is 1 (it occurred 39 times) and so the mode of this distribution is 1. 7 Example 1: The following is an ordered list of the number of complaints received by a telephone supervisor per day over a period of a fortnight: 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 8, 9, 10, 12. The value which occurs most frequently is 6, therefore: mode = 6 The mode of a grouped frequency distribution There are various methods of estimating the modal value (including a graphical one). A satisfactory result is obtained easily by using the following formula:
where: L = lower boundary of the modal class i = width of the modal class interval f m = the frequency of the modal class f m-1 = the frequency of the pre-modal class f m+1 = the frequency of the post-modal class Example: Find the modal value of the height of employees from the data shown in the Table.
Solution: The largest frequency is 20, in the fourth class, so this is the modal class, and the value of the mode lies between 175 and 180 cm. So, using the above formula: 8
Advantages and Disadvantages of the Mode (a) Advantages (i) It is not distorted by extreme values of the observations. (ii) It is easy to calculate. (b) Disadvantages (i) It cannot be used to calculate any further statistic. (ii) It may have more than one value (although this feature helps to show the shape of the distribution).
MEDIAN The median is the value of the middle member of an array. The middle item of an odd number of items is calculated as the
If a set of n observations is arranged in order of size then, if n is odd, the median is the value of the middle observation; if n is even, the median is the value of the arithmetic mean of the two middle observations. Note that the same value is obtained whether the set is arranged in ascending or descending order of size, though the ascending order is most commonly used. This arrangement in order of size is often called ranking. The rules for calculating the median are: (a) If n is odd and M is the value of the median then:
9 Example: The median (a) The median of the following nine values: 8 6 9 12 15 6 3 20 11 is found by taking the middle item (the fifth one) in the array: 3 6 6 8 9 11 12 15 20 The median is 9. (b) Consider the following array. 1 2 2 2 3 5 6 7 8 11 The median is 4 because, with an even number of items, we have to take the arithmetic mean of the two middle ones (in this example, (3 + 5)/2 = 4).
Finding the median of an ungrouped frequency distribution The median of an ungrouped frequency distribution is found in a similar way. Consider the following distribution.
The median would be the (35 + 1)/2 = 18 th item. The 18 th item has a value of 16, as we can see from the cumulative frequencies in the right hand column of the above table.
Finding the median of a grouped frequency distribution If the sample size is large and the data continuous, it is possible to find the median of grouped data by estimating the position of the (n/2) th value. In this case the following formula may be used:
10 where: L is the lower class boundary of the median class F is the cumulative frequency up to but not including the median class f is the frequency of the median class i is the width of the class interval.
Example: From the following frequency distribution establish the median.
Advantages and Disadvantages of the Median (a) Advantages (i) Its value is not distorted by extreme values, open-ended classes or classes of irregular width. (ii) All the observations are used to order the data even though only the middle one or two observations are used in the calculation. (iii) It can be illustrated graphically in a very simple way. (b) Disadvantages (i) In a grouped frequency distribution the value of the median within the median class can only be an estimate, whether it is calculated or read from a graph. (ii) Although the median is easy to calculate it is difficult to manipulate arithmetically. It is of little use in calculating other statistical measures.
QUANTILES Definitions If a set of data is arranged in ascending order of size, quantiles are the values of the observations which divide the number of observations into a given number of equal parts. 11 They cannot really be called measures of central tendency, but they are measures of location in that they give the position of specified observations on the x-axis. The most commonly used quantiles are: (a) Quartiles These are denoted by the symbols Q1, Q2 and Q3 and they divide the observations into four equal parts: Q1 has 25% below it and 75% above it. Q2 has 50% below it and 50% above, i.e. it is the median and is more usually denoted by M. Q3 has 75% below it and 25% above. (b) Deciles These values divide the observations into 10 equal parts and are denoted by D1, D2, ... D9, e.g. D1 has 10% below it and 90% above, and D2 has 20% below it and 80% above. (c) Percentiles These values divide the observations into 100 equal parts and are denoted by P1, P2, P3, ... P99, e.g. P1 has 1% below it and 99% above. Note that D5 and P50 are both equal to the median (M).
Calculation of Quantiles Example: Table 4.7 shows the grouped distribution of the overdraft sizes of 400 bank customers. Find the quartiles, the 4 th decile and the 95 th percentile of this distribution.
12 Size of overdraft of bank customers
Using appropriately amended versions of the formula for the median given previously, the arithmetic calculations are as follows: The formula for the first quartile (Q1) may be written as:
where: L is the lower class boundary of the class which contains Q 1
F is the cumulative frequency up to but not including the class which contains Q 1
f is the frequency of the class which contains Q 1 and i is the width of the class interval. This gives:
13
MEASURES OF DISPERSION Measures of dispersion give some idea of the spread of a variable about its average. The main measures are as follows. The range The semi-interquartile range The standard deviation The variance The coefficient of variation 1 The range The range is the difference between the highest and lowest observations. The range of a distribution is the difference between the largest and the smallest values in the set of data.
If the data is given in the form of a grouped frequency distribution, the range is the difference between the highest upper class boundary and the lowest lower class boundary. Example: Calculate the mean and the range of the following set of data. 4 8 7 3 5 16 24 5 Advantages and Disadvantages (a) Advantages (i) It is easy to understand. (ii) It is simple to calculate. 14 (iii) It is a good measure for comparison as it spans the whole distribution. (b) Disadvantages (i) It uses only two of the observations and so can be distorted by extreme values. (ii) It does not indicate any concentrations of the observations. (iii) It cannot be used in calculating other functions of the observations.
2 QUARTILE DEVIATION (THE SEMI-INTERQUARTILE RANGE) The semi-interquartile range is half the difference between the upper and lower quartiles. The lower and upper quartiles can be used to calculate a measure of spread called the semi-interquartile range.
The inter-quartile range The inter-quartile rangeis the difference between the values of the upper and lower quartiles (Q 3 Q 1 ) and hence shows the range of values of the middle half of the population. Example: Construct an ogive of the following frequency distribution and hence establish the semi- interquartile range.
Advantages and Disadvantages (a) Advantages (i) The calculations are simple and quite quick to do. (ii) It covers the central 50% of the observations and so is not distorted by extreme 15 values. (iii) It can be illustrated graphically. (b) Disadvantages (i) The lower and upper 25% of the observations are not used in the calculation so it may not be representative of all the data. (ii) Although it is related to the median, there is no direct arithmetic connection between the two. (iii) It cannot be used to calculate any other functions of the data.
3 THE MEAN DEVIATION The mean deviationis a measure of the average amount by which the values in a distribution differ from the arithmetic mean.
Explaining the mean deviation formula
Example: The mean deviation The hours of overtime worked in a particular quarter by the 60 employees of ABC Co are as follows.
16 Required Calculate the mean deviation of the frequency distribution shown above.
Summary of the mean deviation (a) It is a measure of dispersion which shows by how much, on average, each item in the distribution differs in value from the arithmetic mean of the distribution. (b) Unlike quartiles, it uses all values in the distribution to measure the dispersion, but it is not greatly affected by a few extreme values because an average is taken. (c) It is not, however, suitable for further statistical analysis.
4 STANDARD DEVIATION AND VARIANCE
THE STANDARD DEVIATION The standard deviation measures the spread of data around the mean. In general, the larger the standard deviation value in relation to the mean, the more dispersed the data.
The standard deviation, which is the square root of the variance, is the most important measure of spread used in statistics. Make sure you understand how to calculate the standard deviation of a set of data.
17 There are a number of formulae which you may use to calculate the standard deviation; use whichever one you feel comfortable with.
EXAMPLE 1 Voditel International own a large fleet of company cars. The mileages, in thousands of miles, of a sample of 17 of their cars over the last financial year were: 11 31 27 26 27 35 23 19 28 25 15 36 29 27 26 22 20 Calculate the mean and standard deviation of these mileage figures. EXAMPLE 2 The kilocalories per portion in a sample of 32 different breakfast cereals were recorded and collated into the following grouped frequency distribution:
(a) Obtain an approximate value for the median of the distribution. (b) Calculate approximate values for the mean and standard deviation of the distribution.
The variance The variance, 2 , is the average of the squared mean deviation for each value in a distribution. is the Greek letter sigma (in lower case). The variance is therefore called 'sigma squared'. 18 The main properties of the standard deviation The standard deviation's main properties are as follows. (a) It is based on all the values in the distribution and so is more comprehensive than dispersion measures based on quartiles, such as the quartile deviation. (b) It is suitable for further statistical analysis. (c) It is more difficult to understand than some other measures of dispersion. The importance of the standard deviation lies in its suitability for further statistical Analysis
The coefficient of variation The spreads of two distributions can be compared using the coefficient of variation.
The bigger the coefficient of variation, the wider the spread. For example, suppose that two sets of data, A and B, have the following means and standard deviations.
Although B has a higher standard deviation in absolute terms (51 compared to 50) its relative spread is less than A's since the coefficient of variation is smaller.
Advantages and Disadvantages of the Standard Deviation (a) Advantages (i) It uses all the observations. (ii) It is closely related to the most commonly used measure of location, i.e. the mean. (iii) It is easy to manipulate arithmetically. (b) Disadvantages (i) It is rather complicated to define and calculate. (ii) Its value can be distorted by extreme values.
19 SKEWNESS 1 Skewed distributions As well as being able to calculate the average and spread of a frequency distribution, you should be aware of the skewness of a distribution. Skewness is the asymmetry of a frequency distribution curve. When the items are not symmetrically dispersed on each side of the mean, we say that the distribution is skewed or asymmetric.
Symmetrical frequency distributions A symmetrical frequency distribution (a normal distribution) can be drawn as follows.
Properties of a symmetrical distribution Its mean, mode and median all have the same value, M Its two halves are mirror images of each other
If a distribution is symmetrical, the mean, mode and the median all occur at the same point, i.e. right in the middle. But in a skew distribution, the mean and the median lie somewhere along the side with the "tail", although the mode is still at the point where the curve is highest. The more skew the distribution, the greater the distance from the mode to the mean and the median.
Positively skewed distributions A positively skewed distribution's graph will lean towards the left hand side, with a tail stretching out to the right, and can be drawn as follows. 20
Properties of a positively skewed distribution Its mean, mode and median all have different values The mode will have a lower value than the median Its mean will have a higher value than the median (and than most of the distribution) It does not have two halves which are mirror images of each other
Negatively skewed distributions A negatively skewed distribution's graph will lean towards the right hand side, with a tail stretching out to the left, and can be drawn as follows.
Properties of a negatively skewed distribution Its mean, median and mode all have different values The mode will be higher than the median The mean will have a lower value than the median (and than most of the distribution) Since the mean is affected by extreme values, it may not be representative of the items in a very skewed distribution.
21 Measures of Skewness The more skew the distribution, the more spread out are these three measures of location, and so we can use the amount of this spread to measure the amount of skewness. The most usual way of doing this is to calculate:
The value of the coefficient of skewness is between +3 and -3. Example: Skewness In a quality control test, the weights of standard packages were measured to give the following grouped frequency table.
Required (a) Calculate the mean, standard deviation and median of the weights of the packages. (b) Calculate pearsons coefficient of skewness and explain whether or not the distribution is symmetrical.
22 CORRELATION When the value of one variable is related to the value of another, they are said to be correlated.
Two variables are said to be correlated if a change in the value of one variable is accompanied by a change in the value of another variable. This is what is meant by correlation.
Examples of variables which might be correlated A person's height and weight The distance of a journey and the time it takes to make it Scatter diagrams One way of showing the correlation between two related variables is on a scatter diagram, plotting a number of pairs of data on the graph. For example, a scatter diagram showing monthly selling costs against the volume of sales for a 12-month period might be as follows.
The independent variable (the cause) is plotted on the horizontal(x) axis and the dependent variable (the effect) is plotted on the vertical(y) axis. This scattergraph suggests that there is some correlation between selling costs and sales volume, so that as sales volume rises, selling costs tend to rise as well.
Degrees of correlation Two variables might be perfectly correlated, partly correlated or uncorrelated. Correlation can be positive or negative. 23 Positive and negative correlation Correlation, whether perfect or partial, can be positive or negative. Positive correlation means that low values of one variable are associated with low values of the other, and high values of one variable are associated with high values of the other. Negative correlation means that low values of one variable are associated with high values of the other, and high values of one variable with low values of the other.
The correlation coefficient and the coefficient of determination The degree of correlation between two variables is measured by Pearson's correlation coefficient, r. The nearer r is to +1 or 1, the stronger the relationship.
The correlation coefficient Pearson's correlation coefficient, r(also known as the product moment correlation coefficient) is used to measure how strong the connection is between two variables, known as the degree of correlation.
the correlation coefficient range The correlation coefficient, r must always fall between 1 and +1. If you get a value outside this range you have made a mistake. r = +1 means that the variables are perfectly positively correlated r = 1 means that the variables are perfectly negatively correlated r = 0 means that the variables are uncorrelated 24 Example: The correlation coefficient The cost of output at a factory is thought to depend on the number of units produced. Data have been collected for the number of units produced each month in the last six months, and the associated costs, as follows.
Required Assess whether there is there any correlation between output and cost. Solution
25
There is perfect positive correlation between the volume of output at the factory and costs which means that there is a perfect linear relationship between output and costs.
Required Calculate Pearson's correlation' coefficient for the data and explain the result.
THE COEFFICIENT OF DETERMINATION, R 2
The coefficient of determination r 2 measures the proportion of the total variation in the value of one variable that can be explained by variations in the value of the other variable. Unless the correlation coefficient r is exactly or very nearly +1, 1 or 0, its meaning or significance is a little unclear. For example, if the correlation coefficient for two variables is +0.8, this would tell us that the variables are positively correlated, but the correlation is not perfect. It would not really tell us much else. A more meaningful 26 analysis is available from the square of the correlation coefficient, r, which is called the coefficient of determination, r 2
Interpreting r 2
In the question above, r = 0.992, therefore r 2 = 0.984. This means that over 98% of variations in sales can be explained by the passage of time, leaving 0.016 (less than2%) of variations to be explained by other factors. Similarly, if the correlation coefficient between a company's output volume and maintenance costs was 0.9, r 2 would be 0.81, meaning that 81% of variations in maintenance costs could be explained by variations in output volume, leaving only 19% of variations to be explained by other factors (such as the age of the equipment).
SPEARMAN'S RANK CORRELATION COEFFICIENT Coefficient of rank correlation In the examples considered above, the data were given in terms of the values of the relevant variables, such as the number of hours. Sometimes however, they are given in terms of order or rank rather than actual values.
Spearman's rank correlation coefficient is used when data is given in terms of order or rank, rather than actual values.
Where n = number of pairs of data d = the difference between the rankings in each set of data.
The coefficient of rank correlation can be interpreted in exactly the same way as the ordinary correlation coefficient. Its value can range from 1 to +1.
27
Example: The rank correlation coefficient The examination placings of seven students were as follows.
Required Judge whether the placings of the students in statistics correlate with their placings in economics. Solution Correlation must be measured by Spearman's coefficient because we are given the placings of students, and not their actual marks.
where d is the difference between the rank in statistics and the rank in economics for each student.
The correlation is positive, 0.536, but the correlation is not strong.