Engineering Mathematics 4 Descriptive Statistics and Probability Course Outline

Engineering Mathematics 4
David Ramsey
Room: B2-026 e-mail: david.ramsey@ul.ie website: www.ul.ie/ramsey
January 20, 2010
1 / 94
Course Outline
1. Data Collection and Descriptive Statistics. 2. Probability Theory 3. Statistical Inference
2 / 94
Course Texts
1. Engineering Mathematics 4 - Ref. No. 5301 includes exercise lists, statistical tables and past papers, will be available from the print room). 2. Montgomery - Applied Statistics and Probability for Engineers. 3. Stuart - An introduction to statistical analysis for business and industry: a problem solving approach. 4. Montgomery, Runger, Fari - Engineering Statistics. Similar books are available in section 519 of the library.
3 / 94
1 Data Collection and Descriptive Statistics
Populations of objects and individuals show variation with respect to various traits (e.g. height, political preferences, the working life of a light bulb). It is impractical to observe all the members of the population. In order to describe the distribution of a trait in the population, we select a sample. On the basis of the sample we gain information on the population as a whole.
4 / 94
1.1 Types of Variables

1. Qualitive variables: These are normally categorical (non-numerical) variables. We distinguish between two types of qualitative variables: a) nominal: these variables are not naturally ordered in any way (e.g. i. department - mechanical engineering, mathematics, economics ii. industrial sector). b) ordinal: there is a natural order for such categorisations e.g. with respect to smoking, people may be categorised as 1: non-smokers, 2: light smokers and 3: heavy smokers. It can be seen that the higher the category number, the more an individual smokes. Exam grades are ordinal variables.
5 / 94
2. Quantitative Variables
These are variables which naturally take numerical values (e.g. age, height, number of children). Such variables can be measured or counted. As before we distinguish between two types of quantitative variables. a) Discrete variables: these are variables that take values from a set that can be listed (most commonly integer values, i.e. they are variables that can be counted). For example, number of children, the results of die rolls.
6 / 94
b) Continuous variables
These are variables that can take values in a given range to any number of decimal places (such variables are measured, normally according to some unit e.g. height, age, weight). It should be noted that such variables are only measured to a given accuracy (i.e. height is measured to the nearest centimetre, age is normally given to the nearest year). If a discrete random variable takes many values (i.e. the population of a town), then for practical purposes it is treated as a continuous variable.
7 / 94
1.2 Collection of data

Since it is impractical to survey all the individuals in a population, we need to base our analysis on a sample. A population is the entire collection of individuals or objects we wish to describe. A sample is the subset of the population chosen for data collection. A sampling frame is the list used to chose the sample. A unit is a member of the population. A variable is any trait that varies from unit to unit.
8 / 94
Collection of data
For example, suppose we wish to investigate the political preferences of Irish voters (our variable of interest). The population is made up of people eligible to vote in Irish elections. A unit is any eligible voter. The lists of eligible voters for each constituency may be used as the sampling frame.
9 / 94
Collection of data
A list of addresses in various constituencies may be used as the sampling frame. However, there is no one to one correspondence between addresses and eligible voters. The sample is the set of people asked for their political preferences. The variable of interest is the party an individual wishes to vote for (other socio-demographic variables may be collected, such as age, occupation, education). The sample size is the number of individuals in a sample and is denoted by n.
10 / 94
1.2.1 Parameters and Statistics
A parameter is an unkown number describing a population. For example, it may be that 9% of the population of eligible voters wish to vote for the Green Party (we do not, however, observe this population proportion). A statistic is a number describing a sample. For example, 8% of a sample may wish to vote for the Green Party. This is the sample proportion. Statistics may be used to describe a population, but they only estimate the real parameters of the population.
11 / 94
Parameters and Statistics - Precision of Statistics
Naturally, the statistics from a sample will show some variation around the appropriate parameters e.g. 9% of the population wish to vote for the Green Party, but only 8% in the sample. The greater the sample size, the more precise the results (suppose we take a large number of samples of size n, the larger n the less variable the sample proportion from the various samples, i.e. the smaller the sample variance).
12 / 94
Parameters and Statistics - Bias

However, there may be intrinsic bias from two possible sources: a) Sampling bias - when a sample is chosen in a way such that some members of the population are more likely to be chosen than others. e.g. Suppose that the Labour party is most popular in Dublin. If we used samples of voters from Dublin to estimate support in the whole of Ireland, we would systematically overestimate the support of the Labour party. b) Non-Sampling Bias This results from mistakes in data entry and/or how interviewees react to being questioned. For example, Fianna Fail supporters may be more likely to hide their preference than other individuals. In this case, we would systematically underestimate the support of Fianna Fail.
13 / 94
Non-Sampling Bias
Other sources of bias may be: 1. Lack of anonymity. 2. The wording of a question. 3. The desire to give an answer that would please the interviewer (e.g. surveys may systematically overestimate the willingness of individuals to pay extra for environmentally friendly goods).
14 / 94
Precision and Bias
It should be noted that bias is a characteristic of the way in which data are collected not a single sample. Increasing the sample size will improve the precision of an estimate, but will not aect the bias.
15 / 94
1.2.2 Random Sampling
Sampling is said to be random if each individual has an equal probability of being chosen to be in the sample and this probability is not aected by who else is chosen to be in the sample. An estimate of a parameter is unbiased if there is no systematic tendency to under- or over-estimate the parameter. Random sampling does not ensure that the estimates of parameters are unbiased. However, the bias does not result from the sampling procedure (see above).
16 / 94
Another Example of Sampling Bias - Estimation of the Population Mean
e.g. Suppose the population of interest is the Irish population as a whole and the variable of interest height. Suppose I base my estimate of the mean height of the population on the mean height of a sample of students. Since students tend to be on average taller than the population as a whole, I will systematically overestimate the mean height in the population. That is to say, if I consider many samples of students of say size 100, a large majority of such samples would give me an overestimate of the mean height of the population as a whole.
17 / 94
1.3 Descriptive Statistics - 1.3.1 Qualitative (Categorical Data)
We may describe qualitative data using a) Frequency tables. b) Bar charts. c) Pie charts. n denotes the total number of observations (the sample size).
18 / 94
Frequency tables
Frequency tables display how many observations fall into each category (the frequency column), as well as the relative frequency of each category (the proportion of observations falling into each category). Let ni denote the number of observations in category i . The relative frequency of category i is fi , where ni fi = n Multiplying by 100, we obtain the relative frequency as a percentage. If there are missing data we may also give the relative frequencies in terms of the actual number of observations, n i.e. ni fi = n
19 / 94
Frequency tables
For example 200 students were asked which of the following bands they preferred: Franz Ferdinand, Radiohead or Coldplay. The answers may be presented in the following frequency table Band Coldplay Franz F. Radiohead Frequency 62 66 72 Relative Frequency (% ) 62 100/200 = 31 66 100/200 = 33 72 100/200 = 36
20 / 94
Bar chart
In a bar chart the height of a bar represents the relative frequency of a given category (or the number of observations in that category).
21 / 94
Pie chart
The size of a slice in a pie chart represents the relative frequency of a category. Hence, the angle made by the slice representing category i is given (in degrees) by i , where i = 360fi = 360ni n
(i.e. we multiply the relative frequency by the number of degrees in a full revolution). If the relative frequency of observations in group i is given in percentage terms, denoted pi . 1% of the observations in the sample correspond to an angle of 3.6 degrees. Thus, i = 3.6pi .
22 / 94
Pie chart
23 / 94
Example 1.1
Suppose 1000 randomly chosen voters are asked which party they are going to vote for in the forthcoming election. The results are as given below: Fianna Fail: 360 Fine Gael: 270 Green Party: 100 Labour: 90 Progressive Democrats: 70 No Answer: 110
24 / 94
Example 1.1
The frequency table is as follows: Party Fianna Fail Fine Gael Green Party Labour Prog. Dems. No answer Frequency 360 270 100 90 70 110 Rel. Freq. (% ) 36 27 10 9 7 11 Of non-missing data 360100 = 40.45 890 270100 = 30.34 890 100100 = 11.24 890 90100 890 = 10.11 70100 890 = 7.87 -
25 / 94
Example 1.1
The bar chart illustrating just the non-missing data is given by
26 / 94
Example 1.1
The pie chart for the sample as a whole (i.e. including those who gave no answer) is derived as follows: The angle made by the slice representing the support of Fianna Fail is given by 36 3.6 = 129.60 The angle made by the slice representing the support of Fianna Fail is given by 27 3.6 = 97.20, etc.
27 / 94
Example 1.1
The pie chart is as follows:
28 / 94
1.3.2 Graphical Presentation of Quantitative Data
Discrete data can be presented in the form of frequency tables and/or bar charts (as above). The distribution of continuous data can be presented using a) A histogram. b) Its empirical distribution function (also referred to as an OGIVE).
29 / 94
Histograms for continuous variables
In order to draw a histogram for a continuous variable, we need to categorise the data into intervals of equal length. The end points of these intervals should be round numbers. The number of categories used should be approximately n (normally between 5 and 20 categories are used). For example, if we have 30 observations then we should use about 30 5.5 categories. Hence, 5 and 6 are sensible choices for the number of categories. Let k be the number of categories.
30 / 94
Histograms
In order to choose the length of each interval, L, we use xmax xmin r = , k k where xmax is smallest round number larger than all the observations and xmin is the largest round number smaller than all the observations. The dierence between these numbers is an estimate of the range of the data, denoted r . L If necessary L is rounded upwards, so that the intervals are of nice length and the whole range of the data is covered.
31 / 94
Histograms
The intervals used are [xmin , xmin + L], (xmin + L, xmin + 2L], . . . , (xmax L, xmax ]. In general the lower end-point of an interval is assumed not to belong to that interval (to avoid a number belonging to two classes).
32 / 94
Histograms
A histogram is very similar to a bar chart. The height of the block corresponding to an interval is the relative frequency of observations in that block. Thus, the height of a block is the number of observations in that interval divided by the total number of observations.
33 / 94
The Ogive
Suppose we have a reasonably large amount of data. In order to draw the empirical distribution function (ogive), we rst split the data into categories as when drawing a histogram. Let x0 , x1 , x2 , . . . , xk be the endpoints of the intervals formed (k denotes the number of intervals). That is to say the i -th interval is (xi 1 , xi ]
34 / 94
The Ogive
The ogive is a graph of the cumulative relative frequency. The cumulative relative frequency at an endpoint xi is the proportion of observations less than xi . Note that no observations are smaller than the lower endpoint of the rst interval (x0 ), i.e. at x0 the cumulative relative frequency is 0. The cumulative relative frequency at the upper endpoint of an interval can be calculated by adding the relative frequency of observations in that interval to the cumulative relative frequency at the lower endpoint of that interval (the upper endpoint of the previous interval).
35 / 94
The Ogive
Hence, we can calculate the cumulative relative frequency at the endpoints of each interval. We then draw a scatter plot for the k + 1 values of the cumulative relative frequency at the endpoints of each interval. The X-coordinate of each of these k + 1 points is an endpoint of one of the intervals and the Y-coordinate is the cumulative relative frequency at that endpoint. To draw the OGIVE we connect each of these points to the next using a straight line. Note that the height of the OGIVE at the nal endpoint is by denition 1.
36 / 94
Example 1.2
We observe the height of 20 individuals (in cm). The data are given below 172, 165, 188, 162, 178, 183, 171, 158, 174, 184, 167, 175, 192, 170, 179, 187, 163, 156, 178, 182. Draw a histogram and OGIVE representing these data.
37 / 94
Example 1.2
We rst consider the histogram. First we choose the number of classes and the corresponding intervals. 20 4.5, thus we should choose 4 or 5 intervals.
38 / 94
Example 1.2
The tallest individual is 192cm tall and the shortest 156cm. 200cm is the smallest round number larger than all the observations and 150cm is the largest round number smaller than all the observations. Thus, we take the range to be 50. To calculate the length of the intervals L= r . k
Taking k to be 4, L = 12.5. Taking k = 5, L = 10 (a nicer length). Hence, it seems reasonable to use 5 intervals of length 10, starting at 150.
39 / 94
Example 1.2
If we assume that the upper endpoint of an interval belongs to that interval, then we have the intervals [150,160], (160, 170], (170,180], (180,190], (190,200]. Now we count how many observations fall into each interval and hence the relative frequency of observations in each interval.
40 / 94
Example 1.2
Height (x) 150 x 160 160 < x 170 170 < x 180 180 < x 190 190 < x 200
No. of Observations 2 5 7 5 1
Rel. Frequency 2/20 = 0.1 5/20 = 0.25 7/20 = 0.35 5/20 = 0.25 1/20 = 0.05
41 / 94
Example 1.2
The histogram is given below:
42 / 94
Interpretation of the histogram of a continuous variable
A histogram is an estimator of the density function of a variable (see the chapter on the distribution of random variables in Section 2). The distribution of height seems to be reasonably symmetrical around 175cm.
43 / 94
Example 1.2 - The Ogive

From the frequency table, the cumulative relative frequencies at the endpoints of the intervals are given by: Endpoint 150 160 170 180 190 200 Cum. Rel. Freq. 0.0 0.1 0.1+0.25 = 0.35 0.35+0.35 = 0.7 0.7+0.25 = 0.95 0.95+0.05 = 1
The Ogive is given on the next slide
44 / 94
The Ogive
1
((
((((
0.8
0.6
0.4
, , , , ,
0.2 0
150
160
170
180
190
45 / 94
200
The Ogive
The graph of the ogive can be used to estimate percentiles of a distribution. By denition % of observations are less than the -percentile. For example, the 60-percentile of height may be estimated by drawing a horizontal line from 0.6 on the y-axis until it hits the cumulative distribution function. The 60-percentile is the value on the x-axis directly below this point of intersection. This is illustrated on the next slide.
46 / 94
Estimation of percentiles
1
((
((((
0.8
0.6
0.4
, , , , ,
0.2 0
150
160
170
180
190
47 / 94
200
The Ogive
The 60-percentile of height in this example is approximately 178cm. The value of the cumulative relative frequency at the upper endpoint of the nal interval is by denition 1.
48 / 94
1.3.3 Symmetry and Skewness of Distributions
From a histogram we may infer whether the distribution of a random variable is symmetric or not. The histogram of height shows that the distribution is reasonably symmetric (even if the distribution of height in the population were symmetric, we would normally observe some small deviation from symmetry in the histogram as we observe only a sample).
49 / 94
Right-Skewed distributions
A distribution is said to be right-skewed if there are observations a long way to the right of the centre of the distribution, but not a long way to the left. For example, the distribution of the age of students is right-skewed. Must students will be around 20 years of age. None will be much younger, but there are some mature students. The distribution of wages will also be right-skewed.
50 / 94
A right-skewed distribution
51 / 94
Left-skewed distributions
A distribution is said to be left-skewed if there are observations a long way to the left of the centre of the distribution, but not a long way to the right. For example, the distribution of weight of participants in the Oxford-Cambridge boat race will have a left-skewed distribution. This is due to the fact that the majority of participants will be heavy rowers, while a minority will be very light coxes.
52 / 94
A Leftskewed Distribution
53 / 94
1.4 Numerical Methods of Describing Quantitative Data
We consider two types of measure: 1. Measures of centrality - give information regarding the location of the centre of the distribution (the mean, median). 2. Measures of variability (dispersion) - give information regarding the level of variation (the range, variance, standard deviation, interquartile range).
54 / 94
1.4.1 Measures of centrality
1. The Sample Mean, x . Suppose we have a sample of n observations, the mean is given by the sum of the observations divided by the number of observations. 1 x= n
n
xi ,
i =1
where xi is the value of the i -th observation.
55 / 94
The Population Mean
denotes the population mean. If there are N units in the population, then N xi = i =1 , N where xi is the value of the trait for individual i in the population. is normally unknown. The sample mean x (a statistic) is an estimator of the population mean (a parameter).
56 / 94
2. The sample median Q2
In order to calculate the sample median, we rst order the observations from the smallest to the largest. The order statistic x(i ) is the i -th smallest observation in a sample (i.e. x(1) is the smallest observation and x(n) is the largest observation). The notation for the median comes from the fact that the median is the second quartile (see quartiles in the section on measures of dispersion).
57 / 94
The sample median Q2
If n is odd, then the median is the observation which appears in the centre of the ordered list of observations. Hence, Q2 = x(0.5[n+1]) . If n is even, then the median is the average of the two observations which appear in the centre of the ordered list of observations. Hence, Q2 = 0.5[x(0.5n) + x(0.5n+1) ] One half of the observations are smaller than the median and one half are greater.
58 / 94
The sample median
One advantage of the median as a measure of centrality is that it is less sensitive to extreme observations (which may be errors) than the mean. When the distribution is skewed, it is preferable to use the median as a measure of centrality. e.g. the median wage rather than the average wage should be used as a measure of what the average man on the street earns. The distribution of wages is right-skewed and the small proportion of people who earn very high wages will have a signicant eect on the mean. The mean is greater than the median. For left-skewed distributions the mean is less than the median.
59 / 94
The sample median
When we have a large number of observations the 50% -percentile can be used to approximate the median. This can be read from the OGIVE. Using the OGIVE of height considered earlier, the median height is approximately 175cm. A more accurate method of approximating the median in such cases is considered in the section on grouped data.
60 / 94
1.4.2 Measures of Dispersion - 1. The Range
The range is dened to be the largest observation minus the smallest observation. Since the range is only based on 2 observations it conveys little information and is sensitive to extreme values (errors).
61 / 94
2. The sample variance s 2
The sample variance is a measure of the average square distance from the mean. The formula for the sample variance is given by s2 = 1 n1
n
(xi x )2 .
i =1
s 2 0 and s 2 = 0 if and only if all the observations are equal to each other.
62 / 94
3. The sample standard deviation s
The sample standard deviation is given by the square root of the variance. It (and hence the sample variance) can be calculated on a scientic calculator by using the n1 or sn1 function as appropriate. In simple terms, the standard deviation is a measure of the average distance of an observation from the mean. It cannot be greater than the maximum deviation from the mean.
63 / 94
4. The interquartile range

The i -th quartile, Qi , is taken to be the value such that i quarters of the observations are less than Qi . Thus, Q2 is the sample median. If
n+1 4
is an integer, then the lower quartile Q1 is given by Q1 = x( n+1 ) 4
Otherwise, if a is the integer part of n+1 4 [this is obtained by simply removing everything after the decimal point], then Q1 = 0.5[x(a) + x(a+1) ]
64 / 94
The interquartile range
If
3n+3 4
is an integer, then the upper quartile Q3 is given by Q3 = x( 3n+3 ) 4
Otherwise, if b is the integer part of
3n+3 4 ,
then
Q3 = 0.5[x(b) + x(b+1) ]
65 / 94
When there is a large amount of data, the quartiles Q1 , Q2 , Q3 can be calculated from the ogive as the 25-th, 50-th and 75-th percentiles, respectively. From the OGIVE, the lower and upper quartiles for the height data given previously are approximately 166cm and 182cm, respectively. The interquartile range (IQR) is the dierence between the upper and lower quartiles IQR = Q3 Q1 i.e. for the height data approximately 16cm.
66 / 94
The quartiles can be used to display the distribution of a trait in the form of a box plot. The central line represents the median, the ends of the box represent the lower and upper quartiles. Points outside the whiskers represent outliers. Box plots can be used to investigate whether a distribution is skewed (see following diagrams).
67 / 94
A symmetric distribution
The ends of the boxes and the whiskers are symmetrically distributed about the median.
68 / 94
A right skewed distribution
There are several outliers much greater than the median, but none much smaller than the median. The upper endpoint of the whisker is much further from the median than the lower endpoint.
69 / 94
A left skewed distribution
There are several outliers much smaller than the median, but none much greater than the median. The upper endpoint of the whisker is much closer to the median than the lower endpoint.
70 / 94
Choice of the measure of dispersion

The units of all the measures used so far (except for the variance) are the same units as those used for the measurement of observations. The units of variance are the square of the units of measurement. For example, if we observe velocity in metres per second, the variance is measured in metres squared per second squared. For this reason the standard deviation is generally preferred to the variance as a measure of dispersion. If a distribution is skewed then the interquartile range is a more reliable measure of the dispersion of a random variable than the standard deviation.
71 / 94
Comparison of the dispersion of two variables
Sometimes we wish to compare the dispersion of two variables. In cases where dierent units are used to measure the two variables or the means of two variables are very dierent, it may be useful to use a measure of dispersion which does not depend on the units in which it is measured. The coecient of variation C .V . does not depend on the units of measurement. It is the standard deviation divided by the sample mean s C .V . = . x
72 / 94
Example 1.3 - The sample mean
Calculate the measures of centrality and dispersion dened above for the following data. 6, 9, 12, 9, 8, 10 There are 6 items of data hence, x=
6 i =1 xi
6 + 9 + 12 + 9 + 8 + 10 =9 6
73 / 94
Example 1.3 - The sample median
In order to calculate the median, we rst order the data. If an observation occurs k times, then it must appear k times in the list of ordered data. The ordered list of data is 6, 8, 9, 9, 10, 12. Since there is an even number of data (n = 6), the median is the average of the two observations in the middle of this ordered list. Hence, Q2 = 0.5[x(n/2) + x(1+ n ) ] = 0.5[x(3) + x(4) ] = 2 9+9 2
74 / 94
Example 1.3 - The range
The range is the dierence between the largest and the smallest observations Range = 12 6 = 6.
75 / 94
Example 1.3 - The variance and standard deviation
The variance is given by s 2= = 1 n1 (6

n
(xi x )2 (9 9)2 + (12 9)2 + (9 9)2 + (8 9)2 + (10 9)2 5 s 2 = 2.
i =1 9)2 +
=4 The standard deviation is given by s =
76 / 94
Example 1.3 - The interquartile range

In order to calculate the interquartile range, we rst calculate the lower and upper quartiles. n = 6, hence n+1 4 = 1.75. The integer part of this number is 1. Hence, the lower quartile is Q1 = 0.5[x(1) + x(2) ] = 0.5(6 + 8) = 7
+3 = 5.25. The integer part of this number is 5. Similarly, 3n4 Hence, the upper quartile is
Q3 = 0.5[x(5) + x(6) ] = 0.5(10 + 12) = 11. Hence, IQR = 11 7 = 4.
77 / 94
Example 1.3 - The coecient of variation
s 2 = . x 9 Suppose a variable is by denition positive, e.g. height, weight. C .V . = A coecient of variation above 1 is accepted to be very large (such variation may occur in the case of wages when wage inequality is high). With regard to the physical traits of people, values for the coecient of variation of around 0.1 to 0.3 are common (in humans the coecient of variation of height is around 0.1, the coecient of variation for weight is somewhat bigger).
78 / 94
1.5 Measures of Location and Dispersion for Grouped Data - a) Discrete Random Variables
A die was rolled 100 times and the following data were obtained Result 1 2 3 4 5 6 No. of observations 15 18 20 14 15 18
79 / 94
Grouped discrete data

Suppose the possible results are {x1 , x2 , . . . , xk } and the result xi occurs fi times. The total number of observations is
k
n=
i =1
fi .
The sum of the observations is given by

k
xi fi .
i =1
80 / 94
It follows that the sample mean is given by x=

k i =1 fi xi
The variance of the observations is given by 1 s = n1

2 k
fi (xi x )2
i =1
81 / 94

The following table is useful in calculating the sample mean xi 1 2 3 4 5 6 fi 15 18 20 14 15 18 100
350 100
fi xi 15 36 60 56 75 108 350 = 3.5.
Hence, the sample mean is x =
82 / 94

Once the mean has been calculated, we can add two columns for (xi x )2 and fi (xi x )2 : xi 1 2 3 4 5 6 fi 15 18 20 14 15 18 100 fi xi 15 36 60 56 75 108 350 (xi x )2 2.52 1.52 0.52 0.52 1.52 2.52 fi (xi x )2 15 2.52 = 93.75 18 1.52 = 40.5 20 0.52 = 5 14 0.52 = 3.5 15 1.52 = 33.75 18 2.52 = 112.5 289
83 / 94
The sample variance is given by 1 n1

k
fi (xi x )2 =
i =1
289 = 2.92. 99
84 / 94
Calculation of the sample median for grouped discrete data
In this case we know the exact values of the observations and hence we can order the data. In this way we can calculate the median. Since there are 100 observations, the median is Q2 = 0.5[x(50) + x(51) ]
85 / 94
The 15 smallest observations are equal to 1 i.e. x(1) = x(2) = . . . = x(15) = 1. The next 18 smallest observations are equal to 2 i.e. x(16) = x(17) = . . . = x(33) = 2. The next 20 smallest observations are all equal to 3 i.e. x(34) = x(35) = . . . = x(53) = 3.
86 / 94
It follows that x(50) = x(51) = 3. Hence, Q2 = 0.5[x(50) + x(51) ] = 3.
87 / 94
1.5 Measures of Location and Dispersion for Grouped Data - a) Continuous Random Variables
In such cases we have data grouped into intervals. Let xi be the centre of the i -th interval and fi the number of observations in the i -th interval. The approach to calculating the sample mean and variance is the same as in the case of discrete data. In order to carry out the calculations, we assume that each observation is in the middle of the appropriate interval.
88 / 94
Example 1.4
Consider the grouped data from Example 1.2 Height (x ) 150 x 160 160 < x 170 170 < x 180 180 < x 190 190 < x 200 xi 155 165 175 185 195 fi 2 5 7 5 1 20 fi xi 310 825 1225 925 195 3480
Thus, the sample mean is x =
3480 20
= 174.
89 / 94
Example 1.4
Now we can add the remaining 2 columns of the table. xi 155 165 175 185 195 fi 2 5 7 5 1 20 fi xi 310 825 1225 925 195 3480 (xi x )2 192 92 1 112 212 fi (xi x )2 2192 = 722 592 = 405 7 5 112 = 605 212 = 441 2180
The variance is 2180 .74. 19 = 114 The standard deviation is 114.74 = 10.71.
90 / 94
Estimating the median for grouped continuous data
Now we consider a more accurate method of estimating the median than the graphical method presented earlier using the OGIVE. The rst step is to nd the interval in which the median lies. This is the interval in which the cumulative relative frequency (crf) equals 0.5. Since crf(170)=0.35<0.5 and crf(180)=0.7>0.5, the median lies in the interval (170,180).
91 / 94

The median can now be estimated using geometry. Consider the graph of the OGIVE on the interval (170, 180). We know the height of the OGIVE at 170 and 180 and know that at the median the height must be 0.5. We can construct the following pair of similar triangles. 0.7

c.r.f. 0.5
h2 = 0.35
h1 = 0.15 Q2 = 170 + y Height 180
0.35 170

92 / 94
The height of the large triangle is h2 = 0.35 (the proportion of observations in that interval). The length of the large triangle is the length of the interval d = 10 The cumulative relative frequency (c.r.f.) at the median is 0.5. The height of the small triangle is 0.5 minus the c.r.f. at the lower endpoint. Thus, h1 = 0.15. The median is equal to y + 170, where y is the length of the small triangle (170 is the lower endpoint of the interval).
93 / 94
Since the triangles are similar the ratio of the length to height is constant, i.e. h1 h2 = . d y Hence, y= h1 d 0.15 10 = 4.3. h2 0.35
It follows that the median is approximately Q2 = 170 + 4.3 = 174.3.
94 / 94

Engineering Mathematics 4 Descriptive Statistics and Probability Course Outline

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Engineering Mathematics 4 Descriptive Statistics and Probability Course Outline

Загружено:

Авторское право:

Доступные форматы

Engineering Mathematics 4

January 20, 2010

1. Data Collection and Descriptive Statistics. 2. Probability Theory 3. Statistical Inference

1 Data Collection and Descriptive Statistics

1.1 Types of Variables

1.2 Collection of data

1.2.1 Parameters and Statistics

Parameters and Statistics - Precision of Statistics

Parameters and Statistics - Bias

Precision and Bias

1.2.2 Random Sampling

Another Example of Sampling Bias - Estimation of the Population Mean

1.3 Descriptive Statistics - 1.3.1 Qualitative (Categorical Data)

1.3.2 Graphical Presentation of Quantitative Data

Histograms for continuous variables

Interpretation of the histogram of a continuous variable

Example 1.2 - The Ogive

The Ogive is given on the next slide

1.3.3 Symmetry and Skewness of Distributions

1.4 Numerical Methods of Describing Quantitative Data

1.4.1 Measures of centrality

where xi is the value of the i -th observation.

The Population Mean

2. The sample median Q2

The sample median Q2

The sample median

The sample median

1.4.2 Measures of Dispersion - 1. The Range

2. The sample variance s 2

3. The sample standard deviation s

4. The interquartile range

is an integer, then the lower quartile Q1 is given by Q1 = x( n+1 ) 4

The interquartile range

is an integer, then the upper quartile Q3 is given by Q3 = x( 3n+3 ) 4

Otherwise, if b is the integer part of

The interquartile range

The interquartile range

A right skewed distribution

A left skewed distribution

Choice of the measure of dispersion

Comparison of the dispersion of two variables

Example 1.3 - The sample mean

Example 1.3 - The sample median

Example 1.3 - The range

Example 1.3 - The variance and standard deviation

The variance is given by s 2= = 1 n1 (6

(xi x )2 (9 9)2 + (12 9)2 + (9 9)2 + (8 9)2 + (10 9)2 5 s 2 = 2.

=4 The standard deviation is given by s =

Example 1.3 - The interquartile range

Q3 = 0.5[x(5) + x(6) ] = 0.5(10 + 12) = 11. Hence, IQR = 11 7 = 4.

Example 1.3 - The coecient of variation

Grouped discrete data

The sum of the observations is given by

Grouped discrete data

It follows that the sample mean is given by x=

The variance of the observations is given by 1 s = n1

Grouped discrete data

fi xi 15 36 60 56 75 108 350 = 3.5.

Hence, the sample mean is x =

Grouped discrete data

Grouped discrete data

The sample variance is given by 1 n1