Академический Документы
Профессиональный Документы
Культура Документы
for
Engineering Students
Preface
In general, engineers develop new products, improve existing designs, build and test
prototypes, troubleshoot ongoing manufacturing process and others. In each of these
functions, engineers collect and analyze data as an integral part of their job. Thus, statistical
methods are an inseparable part of how engineers solve engineering problems.
This text is an introductory statistics textbook designed for undergraduate students
taking engineering programs at Universiti Teknologi Malaysia, Skudai. It provides
sucient material covered in SSE2193 Engineering Statistics course throughout a 15week semester. This text does not pretend to provide either a complete statistical
toolkit or a review of all statistical methods in all aspects of engineering applications.
It does however provide students with an easy start-up kit to key statistical methods
with various examples and tasks that students can solve either during class or at their
own time.
We sincerely hope this text will be useful for students in acquiring skills of handling
observed data, drawing valid inferences and eventually making sound judgement and
profound decision.
Authors
September 2015
_________________________________________________________________________
Self-Review Quiz
Test your prior knowledge and understanding on the basic statistics by answering the
following questions.
b) ii, iii, iv
c) i, iii, iv
d) ii, iv
7.
b)
c) x
d)
8.
Which of the following data can most possibly be represented by a discrete random
variable?
a)
b)
c)
d)
10. A random variable X follows a normal distribution with mean 16 and standard deviation
2. The probability of X being less than 15 can be calculated by finding
a)
15 16
P Z
22
b)
15 16
P Z
c)
d)
e)
16 15
P Z
15 16
P Z
15 16
P Z
Contents
Preface
Self-Review Quiz
1 Fundamental Topics
1.1 Descriptive Statistics and Inferential Statistics
1.1.1 Terms and denitions
1.1.2 Measures of central tendency
1.1.3 Measures of dispersion
1.1.4 The use of calculators
1.1.5 Types of Plots
1.2 Probability
1.2.1 Basic notation and denition
1.2.2 Classical denition of probability
1.2.3 Mutually exclusive event
1.2.4 Additive rule of probability
1.2.5 Conditional probability
1.2.6 Multiplication rule of probability
1.2.7 Independence
1.3 Random Variables
1.3.1 Discrete random variable
1.3.2 Continuous random variable
1.3.3 Cumulative distribution function
1.3.4 Mathematical expectation
1.3.5 Variance and standard deviation
1.4 Some Probability Distributions
1.4.1 Binomial distribution
1.4.2 Poisson distribution
1.4.3 Negative binomial distribution
__
__
X1X 2
Chapter 1
Fundamental Topics
Learning Objectives:
At the end of this chapter, students should be able to
(a)
(b)
(c)
(d)
(e)
(g)
This chapter presents a brief refresher of basic statistics that students are expected to have
learnt at pre-undergraduate level. Although this chapter may not represent a whole course of
basic statistics material, it suces the necessary background framework for the succeeding
chapters in this book.
1.1
Statistics deals with the collection, analysis, presentation, and interpretation of data
set and making decision based on the observed data. The role of an engineer is fundamental
in many aspects of decision making process such as designing, developing new products,
maintaining and controlling manufacturing processes and improving previous systems and
processes. Statistical methods are important tools in these activities that could assist
engineers with both descriptive and analytical methods in handling with the variability in the
observed data.
Statistics can be divided into two major areas namely descriptive statistics and
inferential statistics. Descriptive statistics deals with collection and presentation of data.
These involve collecting raw data, classifying, interpreting and presenting the data into
meaningful information for users. On the other hand, inferential statistics involve procedures
used to draw inferences about a population from a sample. Here, probability models are used
to quantify the risks involved in making any statistical inference.
assigned a code in the form of a number where the numbers are simply labels such as races,
for example: Malay = 1, Chinese = 2 and Indian = 3) and ordinal (which can be ranked, i.e.
put in order, or have a rating scale attached, for example: rst, second, and third place in a
competition).
x
i 1
x
i 1
x
(b)
x
i 1
Median
Median is the middle value that divides the higher half of the data from the lower half of the
data when the observations are arranged in ascending or descending order. If the number of
observations is odd, the median is the middle value, and if the number of observations is
even, the median is the average of the two middle values.
(c)
Mode
Mode is the observation with the highest frequency. If there are several observations with the
same highest frequency, then there are more than one mode in the set of data. However, a
mode may not exist if all observations occur with the same frequency. Therefore, unlike mean
and median, mode is not unique.
1.1.3
Measures of dispersion
Measures of dispersion or variation are numerical values that indicate the variability of a set
of data. When the dispersion is large, the data are widely scattered. The simplest measure of
variation is range but the most used measures are variance and standard deviation.
(a) Range
Range of a data set is the difference between the largest and the smallest observations.
Range = Largest observation - Smallest observation
(b) Variance
The variance of a set of data is a measure of the spread or dispersion within a set of
data. The population variance is denoted by 2 and sample variance by s2.
The population variance, on one hand, is given by
2
1
N
x
i 1
where N is the population size, xi is the i-th observation in the population and is
the population mean.
The sample variance, on the other hand, is given by
s2
xi x
n 1 i 1
where n is the sample size, xi is the i-th observation and x is the sample mean.
If the variance is defined, we can conclude that it is never negative because the squares
are either positive or zero. The unit for variance is the square of the unit of observation.
(c) Standard deviation
Standard deviation is a positive square root of the variance. Therefore standard
deviation for population and sample are
x
i 1
and
i 1
__
xi x
n 1
respectively.
The following example has been done using Casio fx 570MS. You should consult your
calculator instruction manual if yours does not appear to follow the following patterns.
Exampl
___________________________________________________________________________
es
In a crash test, cars were tested to determine what impact speed was required to obtain
bumper damage. The following data shows the speed (in km/hours) of 10 sample cars. Find
the mean, median, mode, range, variance and standard deviation for the cars using the
formula manually. Check if you could get the same answers to the mean and standard
deviation using your calculator.
98, 101, 114, 90, 103, 93, 98, 105, 119, 89
Solution
Mean =
= 1010/10
= 101.
To nd the median, we have to rearrange the observations in an ascending or descending
order
89 90 93 98 98 101 103 105 114 119
Since the number of observations is even, the median is the average of the two middle
values
Median
98 101
2
= 99.5
Mode = 98 since it has the highest frequency, i.e. it appears most frequently in the data set.
Range = 119 89
= 30
As the set of data are taken from a sample, we can calculate its sample variance
s2
__
1 10
xi x
n 1 i 1
1 10
2
xi 101
9 i 1
95. 56
95.96
9.775
___________________________________________________________________________
1.15
Types of Plots
Data can be summarized, not only numerically using a measure of central tendency and a
dispersion measure, but also graphically which may give us an instantaneous idea about same
characteristics of the data such as its distribution and skewness.
A suitable graphical summary for qualitative data can either be a histogram or a boxplot. Whereas for qualitative data, one can use either pie chart, bar chart or Pareto chart. In
addition, one can use a scatter plot to summarize graphically a relationship between two
quantitative variables.
1.2
Probability
In common usage, the word probability means the chance that a particular event will occur. In
statistics, probability is a numerical measure of the likelihood of the event. Before we go
further, it is better for us to understand a few terms that are connected to probability
1.2.1
(a) Outcome
An outcome is a result of an experiment or trial
elements. Usually we denote sample space as S. For example, a trial of tossing a die will lead
to S = {1, 2, 3, 4, 5, 6}.
(c) Event
Event is a subset from a sample space. Let an event A be dened as getting an odd number
from tossing a die. Then A = {1, 3, 5} which is a subset from the sample space, S = {1, 2, 3,
4, 5, 6}.
1.2.2
Classical probability uses the sample space to determine the numerical probability that an
event will occur. It is also called a theoretical probability. Let S be a sample space and E
be an event which is a subset of the sample space S . The probability of event E occurring
is
P E
number elemant in E
n E
number element in S
n S
But this is only true if all outcomes are equally likely (having the same chances) to occur.
There are some basic rules about probability:
(i) Any probability assigned must be a nonnegative real number. The probability will take a
value from 0 to 1. Since it reects a chance of an event to occur, a probability of 0 indicates
that the event will never occur. On the other hand, if the probability is 1, it means the event
will always occur for certain. Therefore,
0 P E 1
(ii) The probability of a sample space is always unity, i.e. P S 1 . The probability that
an event does not occur is one minus the probability that the event does occur. Therefore, if
E is the complement for E , then
P E' 1 P E
E i
i 1
(iv) P
i.e.,
P E
i 1
E1 E 2 .
for i 1, 2, , n where E1 , E 2
Example 2
___________________________________________________________________________
In an experiment, a box containing 5 green bulbs, 6 blue bulbs and 4 white bulbs are used. A
bulb is chosen at random. What is the probability that (i) a white bulb, (ii) a non-white bulb is
chosen?
Solution
The number of bulbs in the box is 15, so n S 15
Suppose event A is The bulb obtained is white. The number of white bulbs in the box is 4,
so n A 4 .
Hence,
P ( getting a white bulb ) P A
4
15
and
P ( not getting a white bulb ) P A ' 1 P A 1
4
11
15 15
To explain the above rule, when A and B are not mutually exclusive, there is an overlapping
or intersection between A and B. That is why when we add P(A) and P(B), the probability of
the intersection, P(A B), is added twice. To compensate for that double addition, the
intersection needs to be subtracted once, (P(A B)).
When A and B are mutually exclusive, P(A B) = 0, then the additive rule becomes
P(A B) = P(A) + P(B)
Example 3
_________________________________________________________________________
In a group of 30 engineering students, 4 out of the 7 women and 8 out of the 23 men wear
spectacles. What is the probability that a person chosen at random from the group is a woman
or someone who wears spectacles?
Solution
Let W be person chosen is a woman and S be person chosen wears spectacles
We have,
P W
7
,
30
P S
12
30
and
P W and S P W S
Thus,
P W or S
4
.
30
P W S
= P W P S P W S
7
12
4
30 30 30
= 0.5
___________________________________________________________________________
a conditional probability. The symbol P A B denotes the probability that event A will
occur given that event B has occurred. The formula is given by
P A B
P A B
P B
where P A B is the probability that event A and event B both occur and P(B) is the
probability that event B occurs.
These probabilities are also referred to as Bayesian probability, named after the probability
theorist Thomas Bayes (1702 61).
The Bayes theorem gives us a general conditional probability formula. If there are k
mutually exclusive events and P B 0 , then
P Ak B
P Ak P B Ak
n
P A PB A
i 1
__________________________________________________________________________
Example
4
A quality control ocer would inspect an assembled product from machine A by randomly
selecting one of its components from the assembly line. The probability that a defective
component is found is 35%. If a defective component was found, the probability that machine
A breaks down an hour after the ocers inspection is 0.64. On the other hand, if a nondefective component was found, the probability that machine A breaks down an hour after the
ocers inspection is just 0.28.
(a) Find the probability that machine A breaks down an hour after the ocers inspection.
(b) If machine A breaks down an hour after inspection, what is the probability that a defective
component was found earlier?
___________________________________________________________________________
Solution:
P(Defective) = P(D) = 0.35
P(Breaks down|Defective) = P(B|D) = 0.64
P D B
P B
0.64 0.35
0.552.
0.406
___________________________________________________________________________
1.2.6
The results of the multiplication rule can determine the probability that two events, A and B,
both occur. The multiplication rule follows from the denition of conditional probability. The
result is often written as follows, using set notation:
P(A B) = P(A|B)P(B)
or
P(A B) = P(B|A)P(A)
where
P(A) is the probability that event A occurs,
P(B) is the probability that event B occurs,
P(A B) is the probability that event A and event B both occur,
P(A|B) is the probability that event A occurs given that event B has already occurred,
and P(B|A) is the probability that event B occurs given that event A has already occurred.
We can easily understand the multiplication rules from a tree diagram. Some information
about the tree diagram: (i) the branches represent any possible outcomes from a trial, (ii) the
sum of the probabilities from a source is equal to 1.
_______________________________________________________________________
Example 5
All raw components of a certain product must pass two production process to become a
nished product. The probability that a raw component passes the rst production process is
0.72. The probability that the component passes the second production process after it passes
the rst production process is 0.8. What is the probability that a raw component becomes a
nished product?
Solution
Let A be a component passes the rst production process and B be a component passes the
second production process. Then,
Independence
__________________________
6
Two marbles are drawn (without replacement) from a bag containing 4 red and 6 blue
marbles.
(a) What is the probability both of them are blue?
(b) What is the probability of getting one red and one blue marbles?
Solution
Let R represents a red marble and B represents a blue marble,
1
6 5
3
10 9
(a) P B and B
8
4 6
6 4
.
15
10 9
10 9
(b) P R and B P R B P B R
EXERCISE
A motor company has 18 used cars and 11 of them are accident-free. For the accident-free
car, the probability alarm system is not functioning is 0.3 and if the car was not accident-free,
the probability alarm system is not functioning is 0.6.
(a) Mr. Ahmad wants to buy 2 used cars for his children. Find the probability both cars are
accident-free.
(b) Mr. Osman also wants to buy 2 used cars. What is the probability that only one of them
is not accident-free.
(c) What is the probability that Miss Ani buys a car that is accident-free and its alarm
system is working?
(d) Ali wants to buy a used car. What is the probability that its alarm system is not
functioning?
(e) The alarm system for a used car bought by Madam Sheely is not functioning. What is
the probability that it is accident-free?
___________________________________________________________________________
1.3
Random Variables
A random variable, usually written as X , is a variable whose possible values are numerical
outcomes of a random phenomenon. There are two types of random variables, discrete and
continuous.
1.3.1
A discrete random variable is one which may take on only a countable number of distinct
values such as the number of children in a family, the number of goals scored in football
games and the number of defective bulbs in a box.
The probability distribution of a discrete random variable (sometimes called probability mass
function) is a list of probabilities associated with each of its possible values. The probability
0 p i 1, and
(b)
p
i
1
i
___________________________________________________________________________
Example
7
x2
is the probability mass function for X, for
30
x 0,1, 2, 3, 4.
Solution:
We need to show that 0 pi 1 and
Now P 0 0,
P1
1
,
30
P 2
p
i
4
,
30
1
i
P 3
9
16
and P 4
30
30
1
4
9
16
1
30 30 30 30
Hence, it is shown that X is a discrete random variable and P(X = x) is the probability mass
function for X .
___________________________________________________________________________
Example
8
cx
2
for x 0, 1, 2, 3.
Solution
If X is a discrete random variable, then
P X
x 1
0
c
2c 3c
1
2
2
2
2
6c
1
2
1
3
__________________________________________________________________________
1.3.2 Continuous random variable
A continuous random variable can take all possible values over an interval of real numbers
such as weight, time, and height. The probability of a random variable X being in an interval
[a,b] is dened as an area under a curve which is represented by a function f (x), that is
P a X b
f x dx F b F a
The function f (x) is called a probability density function and it satises the following
conditions:
(a) The curve of f (x) has no negative values (f (x) 0 for all x)
(b) The total area under the curve is equal to 1
The function F (.) is a cumulative distribution function which will be discussed in the next
section.
___________________________________________________________________________
Example
Show that f x
x2
; 1 x 4 is a pdf and find P 2 X 3
21
Solution
4
We must show
f x dx 1
1
x3
x2
dx
21
21 1
64
1
63
63
1
3
P 2 X 3
2
x2
dx
21
Shown
19
63
___________________________________________________________________________
1.3.3 Cumulative distribution function
The cumulative distribution function, denoted by F (.) is
F (x) = P(X x)
For a discrete random variable, the cumulative distribution function is the sum of the
probabilities, that is
F x P X x
P X
t .
X=x
1
2
3
4
For a continuous random P(X
variable,
cumulative
distribution
function is found by
= x) the
0.12
0.54 0.09
0.25
integrating f (t) from to x, that is
F x P X x
f t dt
X=x
1
2
3
4
__________________________________________________________________________
P(X = x) 0.12 0.54 0.09 0.25
F (X = x) 0.12 0.66 0.75 1.0
Example
10
Find the cumulative distribution function of X if X is a discrete random variable having the
following the probability distribution:
Solution
or
0,
0.12,
F x
0.66,
1
x 1;
1 x 2;
3 x 4;
x 4.
___________________________________________________________________________
Example
11
Find the cumulative distribution function if the probability density function for X is
0.1 10 x 20
f x
0 elsewhwere
Solution
For x 10, F x 0
For 10 x 20
F x
f t dt
10
x
0.1 dt
10
0.1x 1
For x 20,
Therefore
F x 1
; x 10
0
F x 0.1x 1 ;10 x 20
1
; x 20
___________________________________________________________________________
1.3.4 Mathematical expectation
The expected value of a random variable indicates its average or central value.
(a) The expected value of a discrete random variable X is dened by
n
E X xi P xi
i 1
x f xi dx
where E(X ) and E(X 2) both exist and E(X ) is the expected value of X .
Var X .
___________________________________________________________________________
Example
12
X = {1, 2, 5, 10} is a random variable with the probability function P(X = x) dened by
P(X = 1) = 0.4, P(X = 2) = 0.3 and P(X = 10) = 0.2
(a) Find P X 5.
(b) Evaluate the mean E(X ) and the variance Var(X ).
Solution
(a)
Mean, E X xi P xi
i 1
xi P xi
2
i 1
E X 2 E X
24.1 12,25
11.85
___________________________________________________________________________
Example
13 probability density function of a random variable X is f (x), dened as follows
The
0.1 2 x 6
f x 0.2 8 x 11
0 elsewhwere
0.1x dx
0 .1 x 2
11
0.2 x dx
0.2 x 2
11
Mean, E X
0.1x 2 dx
11
0.2 x 2 dx
0.1 x 3
0.2 x 3
3 2
3
6.93 54.6
11
61.53
Var X
E X
EX
61.53 53.29
8.24
__________________________________________________________________________________
1.4
In this section, we will introduce some popular distributions for discrete and continuous
random variables. Popular distributions for discrete random variables include binomial,
poisson, negative binomial, hypergeometric and geometric distributions. On the other hand,
special distributions for continuous random variable include normal, exponential, erlang,
gamma, weibull and lognormal distributions.
1.4.1
Binomial distribution
Binomial distribution is a discrete probability distribution. It is used when there are exactly
two mutually exclusive outcomes of a trial and these outcomes are appropriately labeled as
success and failure. The binomial distribution is used to obtain the probability of
observing x number of successes from n number of trials, with the probability of success on a
single trial is denoted by (Note that some references use p). The binomial distribution
assumes that is xed for all trials.
In general, if a random variable X follows the binomial distribution with parameters n and ,
we write
X ~ B n,
n
where C x
x
distribution are
n!
. The mean,
x ! n x !
n and
n x
x 0, 1, 2, , n
n 1 respectively.
We can evaluate the probability associated to a binomial distribution either using a scientic
calculator or a statistical table. Certain statistical table provides the cumulative binomial
probabilities, P(X k).
___________________________________________________________________________
Example
14
= 0.1631
___________________________________________________________________________
Example
15
A pewter manufacturer produces souvenir mugs. Suppose that one of the machines breaks
down and 8% of the mugs are found to be defective and cannot be sold. If 23 mugs are
selected at random, nd the probability that
(a) 3 mug are defective.
(b) between 8 and 10 mugs are defective.
(c) at least 1 mug cannot be sold.
Solution
Let X represents the number of defective mugs, then X ~ B 23, 0.08.
(a) P X 3
C 3 0.08
23
1 0.08 233
0.1711
(b) Find the answer yourself and compare it with your neighbours answer.
(c)
P X 1
1 P X 0
1
C 0 0.08
23
1 0.08 23
1 0.1469
0.8531
___________________________________________________________________________
1.4.2 Poisson distribution
Poisson distribution is another discrete probability distribution. When we know the mean
number of events that occur in a certain time interval or continuum of space, then the Poisson
distribution is a suitable distribution to nd the probability of exactly
x occurrences in that
interval. Generally, a discrete random variable X is said to follow a Poisson distribution with
parameter , written as
X ~ Po
e x
for x 0,1, 2...
x!
where is the mean number of events in the given time interval or a continuum of space. The
interval must be statistically independent. The Poisson distribution has expected value
E X and variance Var(X ) = .
We can evaluate the probability associated to a poisson distribution either using a scientic
calculator or a statistical table. Certain statistical table provides the cumulative poisson
probabilities, P(X k).
If X 1 ~ Po 1 , X 2 ~ P0 2 , , X n ~ P0 n then
X 1 X 2 X n ~ P0 1 2 ... n
Example
16
If X ~ P0 2.4 , find
(a )
P X 6
(b)
P X 3
(e )
P X 4
(c ) P X 7
(d ) P X 7
Solution
(a) P (X 6) = 0.9884
(b) P (X 3) = 1 P (X 2)
= 1 0.5697
= 0.4303.
(c) P (X < 8) = P (X 7)
= 0.9967.
(d) P (X > 1) = 1 P (X 1)
= 1 0.3084
= 0.6916.
(b) P (X = 4) = P (X 4) P (X 3)
= 0.9041 0.7787
= 0.1254.
___________________________________________________________________________
Example
17
On average, Good Construction can build 8 units of playground during a 2-month period.
Find the probability that
(a) Good Construction can only build 3 units of playground during a 2-month period.
(b) Good Construction can build at most 10 units of playground during a 2-month period.
(c) Good Construction can build more than 20 units of playground during a 4-month period.
Solution
Let X be the number of playgrounds Good Construction can build during a 2-month period,
then X Po(8)
e 8 8 3
3!
0.0286
( a ) P X 3
(b) P X 10 0.8159
Let Y be the number of playgrounds Good Construction can build during a 4-month period,
then Y Po(16)
P Y 20
(c )
1 P Y 20
1 0.8682
0.1318
___________________________________________________________________________
1.4.3
A negative binomial experiment is a statistical experiment that has the following properties:
C r 1 p r 1 p
x 1
xr
x 1
x 1 ! .
C r 1
r 1 ! x r !
r 1
x 1
The mean and variance for a negative binomial random variable are
E X r p
and
Var X r 1 p p 2
respectively.
_________________________________________________________________________
Task
1
Suppose that a call to Sinar FM gets connected with a probability of 0.05. Assume calls are
independent,
(a) what is the probability that the 6-th call made is the second call that gets connected?
[ 0.0102]
(b) what is the probability that more than four calls have to be made before getting
connected?
[0.8145]
___________________________________________________________________________
Task
2
Assume that a sample of 15 components are tested every hour. Suppose X denotes the
number of components in the sample of 15 that require modication. Components are
assumed to be independent with respect to modication. If the percentage of components that
require modication remains at 1.5%, what is the probability that hour 8 is the third sample at
which X exceeds 1?
[1.6894104]
___________________________________________________________________________
1.4.4
Geometric Distribution
The geometric distribution is a special case of the negative binomial distribution. It deals with
the number of trials required for a single success. Thus, the geometric distribution is negative
binomial distribution where the number of successes (r) is equal to 1.
Denition: Suppose a negative binomial experiment consists of x trials and results
in one success. If the probability of success on an individual trial is p, then the geometric
probability is:
P x; p p 1 p
x 1
for x 1, 2, 3, , and 0 p 1.
and
Var X r 1 p p 2
respectively.
__________________________________________________________________________
Task 3
The probability that a computer running a certain operating system crashes on any given day
is 0.05. Find the probability that the computer crashes for the rst time on the 10th day after
the operating system is installed. Find the expected number of days the computer runs before
it crashes for the rst time.
[0.0315; 20 ]
1.4.5
Hypergeometric distribution
P x ; N , n, k
Cx
N K
N
Cn
C n x
N K
nx
N
respectively, where p K N
and
and
N n
N 1
N n
Var X np1 p
N 1
__________________________________________________________________________
Task 4
A company employs 500 men under the age of 58. Suppose that 25% carry a marker on a
male chromosome that indicates an increased risk for high blood pressure.
a. If 20 men in the company are tested for the marker in this chromosome, what is the
probability that exactly half of them have the marker.
[0.0089 ]
b. If 15 men in the company are tested for the marker in this chromosome, what is the
probability that more than 1 has the marker?
[0.9229 ]
___________________________________________________________________________
1.4.6 Normal distribution
Normal distribution is the most important continuous distribution in statistics because
normality arises naturally in many physical, biological, and social measurement situations. It
is also named as Gaussian distribution taken from the name Gauss who found the probability
density function (pdf) for normal distribution. The pdf of a normal random variable X is
symmetric, bell-shaped and asymptotically approaches 0 as x goes to or .
A continuous random variable X with probability density function
f x
X 2
1
exp
,
2 2
2
X ~ N , 2
Since the integration for nding the probability using its probability density function is nontrivial, then we have to transform X into a standard normal variable Z which has
a mean 0 and and variance 1. The transformation can be done by using the following formula.
Z
We can evaluate the probability associated to a standard normal distribution either using a
scientic calculator or a statistical table. A statistical table typically provides two types of
tables associated to a standard normal distribution.
(i) a table that shows the probabilities for a standard normal distribution in the form
of P 0 Z z that is the area under the standard normal curve between 0
and positive z values.
(ii)
a table that shows the z values when P(Z > z) = where is the upper tail area of
the standard normal distribution, and 0.5.
Some properties of normal distribution
(a ) k X ~ N k x , k 2 2 x
(b) X Y ~ N x y , 2 x 2 y .
(c ) X Y ~ N x y , 2 x y
___________________________________________________________________________
Example
18
The lifetime of ROAD tyre is normally distributed with mean 24000 km and
standard
deviation 4000 km.
(a) Find the probability that the lifetime of ROAD tyre exceeds 27000 km.
(b) Find the probability that the lifetime of ROAD tyre is between 22500 km and
26500 km.
(c) If 10% of ROAD tyres have low lifetime, nd the maximum distance it can
achieve.
Solution
Let X represents the lifetime of ROAD tyre, then X N (24000, 40002).
(a)
27000 24000
P X 27000 P Z
4000
= P Z 0.75
= 0.5 0.2734
= 0.2266
(b)
26500 24000
22500 24000
Z
4000
4000
P 22500 X 26500 P
= P 0.375 Z 0.625
= 0.2357 + 0.148
= 0.3837
(c) Let x be the maximum distance specied, then the question implies P X x 0.1
which is equivalent to P Z z 0.1 0.1 . From table,
z 0.1 1.2816
Thus, 1.2816
Hence,
x 24000
4000
x = 24000 1.2816(4000)
x = 18873.6 km.
___________________________________________________________________________
______________
Example
19
A Cooper test for a football player from Team A is normally distributed with mean 660
second and standard deviation 45 second. The Cooper test for a football player from Team B
is normally distributed with mean 690 second and standard deviation 25 second. A player is
selected at random.
(a) What is the probability a player from Team A can complete the test less is than 700
second?
(b) What is the probability the time set by a Team A player is better than the time
set by a Team B player?
Solution
Let X represent a time set by Team A player, X N (660, 45 2 ) and let Y represent a time set
by Team B player, Y N (690, 25 2 )
(a)
P X 700
700 660
P Z
45
= P ( Z < 0.89 )
= 0.5 + 0.3133
= 0.8133
(b)
P X Y
P X Y 0
0 660 690
P Z
45 2 25 2
P Z 0.58
= 0.5 + 0.2190
___________________________________________________________________________
Task 5
A manufacturer produces bathroom tiles. The tiles are sold in boxes containing 25 tiles each.
The probability that a piece of tile from a box is defective is 0.1. A box is selected at random.
[ 0.0178 ]
[ 0.0001 ]
[ 0.0095 ]
(b) An interior decorating company purchases 10 boxes of tiles from the manufacturer. What
is the probability that at least two of the boxes contain perfect tiles?
[ 0.1581 ]
__________________________________________________________________________
Task 7
In 2006 World Cup tournament, the weight of the balls used is normally distributed with
mean weight 435 grams and standard deviation 10 grams. A ball is selected at random.
(a) What
is the probability the weight is between 400 grams and 450 grams? [0.933
(b) What is the probability the weight is more than 460 grams?
[ 0.0062 ]
(c) If 10% of the balls is considered heavy, what is the minimum weight of the ball
in that category?
[447.816 grams ]
___________________________________________________________________________
1.4.7
Exponential distribution
for 0 x < and > 0. The parameter is also called a rate parameter, whereas 1/ is a scale
parameter. The mean and variance for X are
EX 1
and Var X 1
1 e x x 0
F x P X x
0 x0
Figure below demonstrates exponential probability density functions with dierent values.
It can be seen from the gure that all pdfs are monotonically decreasing.
: 1;
[ -: 0.5;
: 1.5; ]
EX 1
and Var X 1
respectively.
___________________________________________________________________________
Example
20
Solution
The time between phone calls received by a telephonist is exponentially distributed with a
mean of 10 minutes.
a. What is the probability that there are no calls in one hour?
[Ans: 0.0025 ]
b. What is the probability that there are not more than four calls within one hour? [ 0.2851]
c. Determine x such that the probability that there are no calls within x hours is 0.02
[39.12 minute]
__________________________________________________________________________
An important property of the exponential distribution is that it is memoryless , which means
that if a random variable X is exponentially distributed, its conditional probability is given by
P X x1 x 2
i.e
X x1 Pr X x 2 for all x1 , x 2 0.
P X x1 x 2 X x1 Pr X x 2
________________________________________________________________________
Task 9
The number of hits on a website follows a Poisson process with a rate of four per minute.
a. What is the probability that more than two minutes go by without a hit?
[ 3.35 10 4 ]
b. If two minutes have gone by without a hit, what is the probability that a hit will occur in
the next minute?
[ 0.9817]
___________________________________________________________________________
1.4.8
Other distributions for continuous random variables include Erlang, Gamma, Weibull and
log-normal distributions. Unlike normal distribution, these distributions assume that the
variables are strictly non-negative. The list of probability density functions for these
distributions are listed below:
r x r 1 e x
for x 0
r 1 !
and r 1, 2,
Note : If r 1 , then Erlang is
simply an exponential distribution.
EX
Var X
r
2
2. Gamma
f x
r
r x r 1 e x
for x 0 and r E
0X.
r 1 !
Var X
r
2
3. Weibull
1
x
x
exp
For x 0, 0 and 0 ,
Note: and are shape and the
f x
E X T 1
2
Var X T 1
1
1
1
4.
Lognormal
f x
ln x 2
exp
2 2
x 2
EX e
Var X e 2 e w 1
2
The shape of the above distributions for varying values of their parameters can be
investigated via computer software such as Matlab. Further information and examples for
these distributions can be found from Montgomery & Runger (2006).
__________________________________________________________________________
Exercise 1
1. Identify whether the following items are constants or variables. If it is a variable, determine
whether it is quantitative or qualitative, discrete or continuous
(a) The number of days in March.
(b) IC numbers for Malaysian citizen.
(c) The time taken to write an essay.
(d) The type of cars used by employees of a company.
(e) Temperature for each day in a month.
(f) Minimum age to take a driving licence
(g) The lengths of a specic type of bricks.
(h) The compressive strengths of 100 aluminium-lithium alloy specimens.
(i) The number of students registering Engineering Statistics in the last ve
academic years.
(j) The breakdown time of an insulating uid between electrodes.
(k) The grades achieved by engineering students in UTM.
2. A motor company has 18 used cars and 11 of them are accident-free. For the accident-free
car, the probability alarm system is not functioning is 0.3 and if the car was not accident-free,
the probability alarm system is not functioning is 0.6.
(a) Mr. Ahmad wants to buy 2 used cars for his children. Find the probability both cars are
accident-free.
(b) Mr. Osman also wants to buy 2 used cars. What is the probability that only one of them
is not accident-free.
(c) What is the probability that Miss Ani buys a car that is accident-free and its alarm
system is working?
(d) Ali wants to buy a used car. What is the probability that its alarm system is not
functioning?
(e) The alarm system for a used car bought by Madam Sheely is not functioning. What is
the probability that it is accident-free?
k ; 0 x 1
x
f x ;1 x 2
4
0 ; elsewhere
(a) Show that k
5
.
8
15
27
23
18
50
33
36
48
25
19
29
22
42
15
Use a scientic calculator to determine the mean and variance for the above data. Now
assume that the data are sample data selected by random. Find the new mean and variance
for the data. Comment your answers.
8. 25 pieces of computer chips were tested and the proportion of any chip being
contaminated is 0.15. Find the probability that
A supplier delivers ten boxes, each containing 25 chips, to a customer. What is the probability
that the customer will receive at least two boxes containing at most two contaminated chips
each?
Chapter 2
Sampling Distributions
Learning Objectives:
At the end of this chapter, students should be able to
(a) understand the concepts of sample mean and proportion.
(b) understand and use the central limit theorem.
(c) compute and interpret the sample mean and proportion.
(d) explain the important role of normal distributions as sampling distributions.
(e) calculate the probabilities associated with sample mean and sample proportion.
2.1
Introduction
variance is referred to as a statistic. There are many statistics that we can use, which include
the mean, median, mode, standard deviation and so on. One reason we sample is so that we
can get an estimate for an unknown parameter of the population we sample from.
Choosing a sample of size n from a population and measuring the statistics (mean, standard
deviation, etc), the sampling distribution is the resulting probability distribution. For
example, if the statistic is the sample mean, x , of samples of size eight, then the sampling
__
distribution is the probability distribution of the sample mean, X . It lists the various values
__
__
A very important and useful concept in statistics is the Central Limit Theorem (CLT).
The CLT says that if a large enough sample was drawn from a population, then the
distribution of the sample mean is approximately normal, regardless of the type of
distribution for the population the sample was drawn from.
The Central Limit Theorem states that
1. the mean of the sampling distribution of means is the same as the population mean,
2. the variance of the sampling distribution of means is the same as the population variance
divided by the size of the sample, and
3. if the population from which the sample is taken is normally distributed, then the sampling
distribution of means will also be normal. If the population is not normally distributed, then
the sampling distribution of means will approximately be normal distributed as the sample
size gets larger, usually when n 30
2.3
__
The sample mean, X is the best estimator of the population mean, . Suppose we have a
set of independent random variables X 1 , X 2 , X n where E X i and
__
__
X1 X 2 X 3 X n
n
n
1
Xi
n i 1
n
__
1
Xi X
n 1 i 1
__
The probability distribution of the sample means X , is called the sampling distribution
__
of X .
__
The expected mean and variance of X are denoted as __X and
__
E X
X
__
1 n
Xi
n i 1
1
n
n
__
X
Var X
1 n
Xi
n i 1
Var
1 n
Var X i
n i 1
1
2 n 2
n
2
2X .
__
__
. The
The sampling distribution for the sample mean is expressed as X ~ N ,
n
standardized variable
__
X
Z
follow a standard normal distribution. The sampling distribution of the mean is normally
distributed regardless of the population. If the population distribution is unknown or not
normal, then using the central limit theorem, the sampling distribution for sample mean is
normally distributed when n 30
___________________________________________________________________________
Example 1
A certain type of thread is manufactured with a mean tensile strength of 77.3 kg and a
standard deviation of 6.4 kg. Assuming that the tensile strength follow a normal distribution,
nd the probability that the mean tensile strength of a random sample of 40 such thread is
more than 75 kg.
Solution
Now
n 40
__
6.4 2
X
~
N
77
.
3
,
therefore
40
6.4 2
75 77.3
__
P X 75 P Z
6.4 2
40
P Z 2.27
0.5 0.4884
0.9884
_________________________________________________________________________
Example
The number of customers arriving per hour at a certain automobile service facility is assumed
to follow a Poisson distribution with mean 12. If a random sample of 36 hour were taken,
what is the probability that the mean number of customers in an hour is less than 10?
Solution
Given X ~ Po 12
__
12
X ~ N 12,
36
Therefore, by CLT
10 12
P X 10 P Z
12
36
__
P Z 3.46
0.5 0.4997
0.0003
___________________________________________________________________________
Task 1
The average life of a washing machine is 12 years with a standard deviation of 2 years.
Assuming that the lives of these machines follow approximately a normal distribution, nd
(a) the probability that the mean life of a random sample of 12 machines is greater than 10
years.
[ 0.9997 ]
b) the probability that the mean life of a random sample of 9 machines falls between 9.4 and
12.2 years.
[ 0.6179 ]
__________________________________________________________________________
___________________________________________________________________________
Task 2
A random sample of size 35 is taken from a population which has a binomial distribution with
the number of trials 50 and the proportion of success 0.30. What is the probability that the
sample mean is at least 13.5?
[ 0.9969 ]
__________________________________________________________________________
2.4
__
__
Sampling Distribution of X 1 X 2
Suppose we have two independent populations, both are normally distributed. Let the rst
2
population has mean 1 and variance 1 and the second population has mean 2 and
variance 2 .
2
__
__
If X 1 and X 2 are the sample means of two independent random samples of sizes n1 and
n 2 , then
2
__
X 1 ~ N 1 , 1
n1
and
2
__
X 2 ~ N 2 , 2
n2
__
__
__
__
X1X 2
__
__
E X1 E X 2
1 2
and variance
__
__
__
__
2 X 1 X 2 Var X 1 Var X 2
__
__
2
Var X 1 1 Var X 2
__
__
Var X 1 Var X 2
2
n1
n2
2
thus,
2
2
__
__
X 1 X 2 ~ N 1 2 , 1 2
n1
n2
with
__
__
X 1 X 2 1 2
Z
2
1
2
2
n1
n2
__
A random sample of size 18 is selected from a normal population with a mean of 85 and a
standard deviation of 8. A second random sample of size 10 is taken from another normal
__
__
population with mean 80 and a standard deviation 5. Let X 1 and X 2 be the two sample
means. Find the probability
(a)
__
__
(b) that the dierence between the sample means is less than 6.
c) that the dierence between the means is more than 4.
Solution
__
82
__
52
and X 2 ~ N 80,
, therefore
We know that X 1 ~ N 85,
18
10
__
__
82
52
X 1 X 2 ~ N 85 80,
18
10
__
__
X 1 X 2 ~ 5, 6.0556
__
__
__
P X 1 X 2 P X 1 X 2 0
= P Z
05
6.0556
P Z 2.03
0.5 P 0 Z 2.03
= 0.5 0.4788
0.9788
(b) The probability that the dierence between the sample means is less than 6 is
__
__
P X 1 X 2 4
__
__
X1 X
4 P
__
__
X1 X
P Z
45
45
P Z
6.0556
6.0556
P Z 0.4 P Z 3.66
Example
4
A random sample of size 49 is taken from a binomial distribution with n = 60 and p = 0.4.
Another random sample of size 32 is taken from another binomial distribution with n = 60
and p = 0.4. Find the probability that the dierence between the two sample means is less
than 1.
Solution
Given X 1 ~ B 60, 0.4 and X 2 ~ B 60, 0.4
__
__
14.4
14.4
X
X
1 ~ N 24,
2 ~ N 24,
Therefore, by CLT
and
49
32
__
__
14.4
114.4
,
Hence, X 1 X 2 ~ N 24 24,
49
32
__
__
X 1 X 2 ~ N 0, 0.7438
__
__
__
__
P X 1 X 2 1 P 1 X 1 X 2 1
1 0
0.7438
1 0
0.7438
P 1.16 Z 1.16
0.7540
___________________________________________________________________________
Task 3
A random sample of size 30 is taken from a population which is distributed from a Poisson
distribution with mean 54. Another random sample of size 32 is taken from a Poisson
distribution with mean 58. What is the probability that the dierence between the means is
less than 2.
[ 0.1461 ]
___________________________________________________________________________
2.5
The concept of proportion is the same as the concept of probability of success in a binomial
experiment. The probability of success in a binomial experiment represents the proportion of
the sample or population that possesses a given characteristic.
The population proportion, denoted by , is obtained by taking the ratio of the number of
elements in a population with a specic characteristic to the total number of elements in the
population. The sample proportion, denoted by p, gives a similar ratio for a sample.
X
N
and
where
N
x
n
and is a proportion of successes and not 3.1423... . Each sample will give a dierent value
of p therefore the proportion is a random variable and symbolized as P.
To determine the reliability of the estimator, P, we need to know its sampling distribution.
When samples of size n are drawn for this population, each sample contains a certain number
of observation event with the certain characteristics. The Central Limit Theorem (CLT) tells
us that the relative frequency distribution of the sample mean for any population is
approximately normal for suciently large samples, (n 30).
Sampling Distribution of P
1. Mean of the Sample Proportion
The mean of the sample proportion, P is denoted by p and is equal to the population
proportion, .
X
p E P E
1
E X
n
1
n
n
P Var P Var
1
Var X
n2
1
2 n 1
n
1
1
n
1
n
P ~ N ,
with
Z
1
n
c .c
1
1
P P p P p
P p
2
n
2
n
(b)
c .c
1
P P p P P p
2n
(c)
c .c
P P p P P p
2 n
(d)
c .c
P P p P P p
2 n
c .c
1
P P p P P p
2
n
(e)
Example 5
A manufacturer claims that the diameter of a metal rod is 75% within the specication.
A random sample of 50 metal rods is chosen, nd the probability that
(a) at least 70% diameter of the metal rod within the specication.
(b) between 78% and 82% diameter of the metal rod within the specication.
(c) more than 90% diameter of the metal rod within the specication.
Solution
0.75
1 0.75 1 0.75
0.00375
n
50
P ~ N 0.75, 0.00375
(a) The probability that at least 70% diameter of the metal rod within the specication is
c .c
1
P P 0.70 P P 0.70
2 50
P P 0.69
0.69 0.75
P Z
0.00375
(b)
The probability that between 78% and 82% diameter of the metal rod within the
specication is
c .c
1
1
P 0.78 P 0.82 P 0.78
P 0.82
2
50
2
50
P 0.79 P 0.81
0.79 0.75
0.00375
0.81 0.75
0.00375
P 0.65 P 0.98
P 0 Z 0.98 P 0 Z 0.65
0.3365 0.2422
0.0943
(c) The probability that more than 90% diameter of the metal rod within the specication is
c .c
1
P P 0.90 P P 0.90
2 50
P P 0.91
0.91 0.75
P Z
0.00375
__________________________________________________________________________
Task 5
30% of pipe in a chemical plant showed signs of serious corrosion. A survey was done and a
random sample of 100 pipes in a chemical plant was selected. Find the probability that
(a) more than 35% of pipe in a chemical plant showed signs of serious corrosion.
[ 0.1151 ]
(b) from 20% to 30% of pipe in a chemical plant showed signs of serious corrosion.
[ 0.5328 ]
Task 6
From a survey, we found that 90% of automobile will not be rejected because of the machine
failure. A random sample of 50 automobiles was selected. What is the probability that
(a) not less than 92% of automobile will not be rejected because of the machine failure?
[ 0.4052 ]
(b) between 88% and 92% of automobile will not be rejected because of the machine
failure?
[ 0.1896 ]
__________________________________________________________________________
Task 7
3
of the rubber cushions will be rejected. A manufacturer did not
100
satised with the results and does a survey. Among 100 samples of the rubber
cushions, nd the probability of the
(a) proportion of the rubber cushions will be rejected exceed 0.04.
(b) proportion of the rubber cushions will be rejected not more than 0.05.
___________________________________________________________________________
2.6
Let say we have two binomial populations with proportion of successes 1 and 2 , with
random samples of size n1 and n 2 are taken from population 1 and population 2,
respectively. Then 1 and 2 are the proportions from those samples. By the CLT, provided
both n1 and n 2 are large ( n1 30 and n 2 30), the sampling distribution of P1 is
1 1
P1 ~ N 1 , 1
n1
1 2
P2 ~ N 2 , 2
n2
P1 P2 E P1 P2
E P1 E P2 1 2
1 1 1 2 1 2
n1
n2
1 1 1 2 1 2
n1
n2
The sampling distribution of the dierence between two proportions, P1 P2 has mean
1 2
and variance
1 1 1 2 1 2
1 1 2 1 2
P1 P2 ~ N 1 2 , 1
n
n
1
2
with
1 1 2 1 2
P1 P2 ~ N 1 2 , 1
n1
n2
___________________________________________________________________________
Example
6
Two companies, M Chip and N Chip produced micro computer chips and supplied them to
company ACERA. 25% of the micro computer chips produced by Company M Chip and 20%
of the micro computer chips produced by Company N Chip are defective. 100 samples are
randomly chosen from each company, nd the probability that
(a)
M Chip is greater than the sample proportion of defective micro computer chips
produced by Company N Chip.
(b) the sample proportions of defective micro computer chips dier by at least 6%.
(c) the dierence between the sample proportion of defective micro computer chips
produced by Company M Chip and the sample proportion of defective micro computer chips
produced by Company N Chip is at most 4%.
Solution
0.25 1 0.25
PM ~ N 0.25,
N 0.25, 0.001875
100
0.07 1 0.07
PN ~ N 0.20,
N 0.20, 0.0016
100
(a) The probability of the sample proportion of defective micro computer chips produced by
Company M Chip is greater than the sample proportion of defective micro computer chips
produced by Company N Chip is
P PM PN P PM PN 0
P Z
0 0.05
0.003475
= P (Z > 0.85)
= 0.5 + P (0 < Z < 0.85)
= 0.5 + 0.3023
= 0.8023
(b) The probability of the sample proportions of defective micro computer chips dier by at
least 6% is
P PM PN 0.06 P PM PN 0.06 P PM PN 0.06
0.06 0.05
0.06 0.05
P Z
P Z
0.003475
0.003475
= 0.0307 + 0.4325
= 0.4632
(c) The probability of the dierence between the sample proportion of defective micro
computer chips produced by Company M Chip and the sample proportion of defective micro
computer chips produced by Company N Chip is at most 4% is
0.04 0.05
PM PN 0.04 P Z
0.003475
= P (Z < 0.17)
= 0.5 P (0 < Z < 0.17)
= 0.5 0.0675
= 0.4325
________________________________________________________________________________
Exampl
e7
A manufacturer claims that some of the electrical parts produced by two machines are
defective. He said that 90 out of 1500 of the electrical parts are defective were produced by
machine 1 and 84 out of 1200 of the electrical parts are defective were produced by machine
2. If random samples of 50 electrical parts produced by machine 1 and 60 electrical parts
produced by machine 2 are chosen, what is the probability that
(a) the proportion of defective electrical parts produced by machine 1 is smaller than the
proportion of defective electrical parts produced by machine 2?
(b) the proportion of defective electrical parts produced by machine 1 is greater than the
proportion of defective electrical parts produced by machine 2?
(c) the proportion of defective electrical parts dier by less than 0.02?
Solution
1
90
0.06
1500
84
0.07
1200
0.06 1 0.06
P1 ~ N 0.06,
N 0.06, 0.001128
50
0.071 0.07
P2 ~ N 0.07,
N 0.07, 0.001085
50
(a) The probability of the proportion of defective electrical parts produced by machine 1 is
smaller than the proportion of defective electrical parts produced by machine 2 is
P P1 P2 P P1 P2 0 P Z
0 0.01
0.002213
= P (Z > 0.21)
= 0.5 + P (0 < Z < 0.21)
= 0.5 + 0.0832
= 0.5832
(b) The probability of the proportion of defective electrical parts produced by machine 1 is
greater than the proportion of defective electrical parts produced by machine 2 is
P P1 P2 P P1 P2 0
P Z
0 0.01
0.002213
= P (Z < 0.21)
= 0.5 P (0 < Z < 0.21)
= 0.5 0.0832
= 0.4168
(c) The probability of the proportion of defective electrical parts dier by less than 0.02 is
P P2 P1 0.02 P 0.02 P2 P1 0.02
0.02 0.01
0.002213
0.02 0.01
0.002213
A Production Manager claims that his two machines will fail due to continuous operation and
will produce defective products. An investigation was done and it was found that the claimed
was true. 50 of 500 products are from machine A and 45 of 500 products from machine B are
defective. 100 products from each machine were selected randomly. Find the probability that
(a) the sample proportion of the products from machine A is smaller than the sample
proportion of the products from machine B are defective.
[ 0.4052 ]
(b) the sample proportions dier by less than 1.8% are defective
[ 0.6730 ]
(c) the dierence between the sample proportion of the products from machine A
and the sample proportion of the products from machine B are defective is at least 1%.
[ 0.5000 ]
__________________________________________________________________________
Task 9
A company purchased parts from two suppliers and has been having serious problems with
scrap and rework with both suppliers. From previous record, 16% was found to be
nonconforming parts supplied by Supplier A while 14% was found to be nonconforming parts
supplied by Supplier B. A quality engineer decides to investigate and took 100 randomly
selected samples for an investigation from each supplier. What is the probability that
(a) the proportion of nonconforming parts supplied by Supplier A is greater than the
proportion of nonconforming parts supplied by Supplier B?
[ 0.6554 ]
(b) the proportion of nonconforming parts supplied from Supplier A is more than the
proportion of nonconforming parts supplied from Supplier B by at least 0.01?
[ 0.5793 ]
(c) the dierence between the proportion of nonconforming parts supplied by Supplier A and
the proportion of nonconforming parts supplied by Supplier A is more than 0.05? [ 0.2776 ]
___________________________________________________________________________
2.7 t Distribution
Theorem 1 Let Z be a standard normal variable and V a chi-squared random variable with
degrees of freedom. If Z and V , then the distribution of the random variable T , where
Z
V
v 1
2
h t
v v
2
v 1
2
2
t
1
v
v degrees of freedom.
Corollary 1 Let X 1 , X 2 , , X n be independent random variables that are all normal with
mean and standard deviation . Let
__
X i 1
n
and
S2
__
i 1 X i X
n 1
n
Xi
n
__
X
Then the random variable T S
has a t distribution with v n 1 degree of
n
freedom and can be written as T ~ t n 1 .
0.001
t 0.001,15 3.733
(b)
(c)
(d)
v 10.
v 20.
v 30.
0.005
t 0.005, 20 3.733
0.010
t 0.010 , 10 3.733
v 15.
0.025
t 0.025 , 30 3.733 ]
_______________________________________________________________________
2.8
2 Distribution
The continuous random variable X has a chi-squared distribution, with degrees of freedom,
if its density function is given by
f x
2
v
2
v
1
2
exp
x
2
x 0
2
where is a positive integer and can be written as X ~ v .
2
All chi-square distributions are skewed to the right. The symbol ,v denotes the number
along the horizontal axis that cuts o to its left an area of under the chi- square distribution
with degrees of freedom.
2
2
2
Table 8 from Lee (2004) gives the values of ,v with P ,v
___________________________________________________________________________
Task 11
(a)
0.01
v 10.
[ 2 0.01,10 23.209 ]
(b)
0.05
v 15.
[ 2 0.05,15 24.996 ]
(c)
0.99
v 12.
(d)
0.995
[ 2 0.995, 16 5.142 ]
v 16.
___________________________________________________________________________
2.9
F Distribution
Theorem 2 Let U and V be two random variables having independent chi-squared distribution
with v1 and v 2 degrees of freedom, respectively. Then the distribution of the random
variable
U
F
v1
v2
v v
1 2
2
h f
v1
2
v1
v
2
v2
2
v1
v1
1
2
1 v1 f
v2
v1 v2
2
0 f
F 1 ,v1 ,v2
F ,v1 ,v2
1 ,v2
___________________________________________________________________________
___________________________________________________________________________
Task 12
10.48
(a)
0.001
v1 5
(b)
0.010
v1 10
v 2 10
(c)
0.975
v1 15
v2 9
[ f 0.975,15, 9 0.3205 ]
(d)
0.950
v1 12
v 2 20
[ f 0.950,12, 20 0.3937 ]
v 2 10
0.001, 5 , 10
Exercise 2
1. A random sample of size 32 is drawn from a normal distribution with mean 30 and
standard deviation 9. What is the probability that the
(a) sample mean is at most 26?
(b) sample mean is smaller than 33?
2. A random sample of size 41 is taken from a population which is Poisson distributed with
mean 26. What is the probability that the
(a) sample mean is less than 27?
(b) sample mean is at least 29?
3. A random sample of size 16 is selected from a normal distribution with a mean of 92 and a
standard deviation of 11. Another random sample of size 12 is selected with mean 88 and
standard deviation 16. Find the probability that
(a) the dierence between the mean is more than 8?
(b) is less than by 18?
4. PVC pipe is manufactured with a mean length of 30.5 inch and a standard deviation of 2.8
inches. Find the probability that a random sample of n = 15 pipes will have a sample mean
length greater than 29 inches.
5. The probability that a machine produces defective parts is 0.02. A random sample of 15
parts was taken.
(a) What is the probability that the sample mean is more than 0.5 if a random sample of size 4
was taken?
(b) What is the probability that the sample mean is less than 0.8 if a random sample of size 9
was taken?
6. The mean amount of air blows from a JSM air conditioner is 5.5 m in a minute with
standard deviation of 1.2 m. For DGM air conditioner, the mean amount of air blows is 4.9 m
in a minute with standard deviation of 1.1m. 12 set of air conditioner from both type are
selected to run a test.
a) What is the probability the mean air blows for JSM air conditioner is greater than DGM?
b) What is the probability that the dierence between mean air blows for both air conditioner
is less than 1?
7. The average weight a can of soda before the machine is service is 260 ml with standard
deviation of 11 ml. The average weight a can of soda after the machine is service is 250 ml
with standard deviation of 8 ml. 40 cans of soda before the machine is service was chosen at
random and 38 cans of soda after the machine is service was also chosen at random. Find the
probability the mean average weight a can of soda before the machine is service is at least
more than the average weight after the machine is service by 5.
8. The number of times Max photostat machine and JP photostat machine break- down
follows a Poisson distribution. An average of 8 breakdown were recorded for the Max
photostat machine during a randomly selected day. For JP Photostat machine, an average of 5
breakdown were recorded during a randomly selected day.
(a) If a random sample of 15 days were taken, what is the probability that the mean number
of breakdown recorded in a day for Max photostat machine is more than 10?
(b) If a random sample of 20 days were taken,
i. what is the probability that the mean number of breakdown recorded in a day dier by less
than 4?
ii. what is the probability that the dierence between the mean number of breakdown
recorded in a day is at least 5?
9. 15% of the paperclips do not follow the companys specications. QA inspector took 1000
samples randomly for inspection, what is the probability that
a) less than 15% of the paperclips do not follow the companys specications?
(b) at most 12% of the paperclips do not follow the companys specications?
(c) more than 17% of the paperclips do not follow the companys specications?
10. A claimed was made that 98% of A4 papers produced by a company has a good quality. A
survey was done and a random sample of 1000 A4 papers was selected. Find the probability
that
(a) more than 97% of A4 papers produced by a company has a good quality.
(b) between 97% and 99% of A4 papers produced by a company has a good quality.
(c) up to 99% of A4 papers produced by a company has a good quality.
11. A manufacturer claims that 34 of the electrical components was found to be nondefective. 250 electrical components were selected randomly. What is the probability that
(a) at least
(b)
4
of the electrical components was found to be nondefective?
5
37
39
to
of the electrical components was found to be nondefective?
50
50
7
of the electrical components was found to be nondefective?
10
12. A safety engineer claims that of all industrial accidents are caused by the carelessness of
the employees. A survey is carried and randomly 250 of all industrial accidents were selected.
What is the probability that
(a) at most
1
of all industrial accidents are caused by the carelessness of the employees?
4
(c )
1
of all industrial accidents are caused by the carelessness of the employees?
5
9
11
to
of all industrial accidents are caused by the carelessness of the employees
50
50
13. From previous record, 1.2% of machines in a manufacturing factory will be serviced at
least 3 times in a month. A survey was done involving 100 machines. Find the probability of
the
(a) proportion of machines in a manufacturing factory will be serviced at least 3 times in a
month more than 0.013.
(b) proportion of machines in a manufacturing factory will be serviced at least 3 times in a
month less than 0.09.
(c) proportion of machines in a manufacturing factory will be serviced at least 3 times in a
month not more than 0.10.
14. From previous experience, 35% of the microchips are defective. An engineer was asked
to investigate and solve this problem. He took randomly 500 samples of the microchips. Find
the probability of the
(a) proportion of the microchips are defective less than 0.36.
(b) proportion of the microchips are defective not more than 0.32.
(c) proportion of the microchips are defective between 0.33 and 0.38, inclusive.
15. A company produces component parts for two types of engines, DOHC and SOHC. They
claimed that 96% of the component parts for DOHC and 95% of the component parts for
SOHC meet specications. 100 random samples were selected from each component parts.
What is the probability that
(a) the proportion of the component parts for DOHC is less than the proportion
16. A claimed was made that 10 out of 1000 laptops and 5 out of 500 desktops produced by a
company has been rejected. A survey was done and a random sample of 50 laptops and 40
desktops was selected. Find the probability that
(a) the sample proportion of the laptop is more than the sample proportion of the desktops has
been rejected.
(b) the dierence between the sample proportion of the laptop and the sample proportion of
the desktops has been rejected is at least 0.01.
(c) the sample proportion of the laptop is smaller than the sample proportion of the desktops
has been rejected by at most 0.005.
17. A manufacturer of CDs and DVDs players uses a set of comprehensive tests to access the
electrical function of its product. All disk players must pass all test prior to being sold. It was
found that
4
3
of CDs player and
of DVDs player failed the tests. A quality engineer
200
200
was asked to investigate the problems. 150 random samples were taken from each player.
What is the probability that
1
failed the tests?
100
(b) the proportions of CDs player is greater than the proportion of DVDs player failed the
tests?
(c) the proportion of CDs player is less than the proportion of DVDs player failed the test by
at most
2
100
18. A manufacturer claims that his products produced by two dierent machines meet the
customers specications. An investigation occurred and it was found that some of the
products failed to meet the specications and has been rejected. From 450 items, 27 of them
from machine A and from 500 items, 25 of them from machine B failed to meet the
specications and have been rejected. 60 items from each machine were selected randomly.
Chapter 3
Estimation
Learning Objectives:
At the end of this chapter, students should be able to
(a) distinguish between estimator and estimate for a given problem.
(b) describe the dierence between inferential statistics and descriptive statistics.
(c) identify the best estimator for mean, proportion and standard deviation construct the
condence interval for mean, proportion and variance for single population and for two
populations correctly based on given problem.
(d) interpret the condence interval correctly.
3.1
Introduction
In previous chapter we had learnt the sampling distributions of random variables. This
knowledge will equip us in working with the core of inferential statistics. Do you know what
inferential statistics is?
This chapter will introduce you to rstly, the denition of inferential statistics followed by the
denition of important terms that will be used intensively in this chapter namely estimator,
estimate, point and interval estimate, and condence interval. Next, we will discover the
procedure of estimating the true parameter of a population.
Lastly, we will construct the condence intervals for mean, proportion and variance for cases
of one population and two populations with the correct interpretation.
Let us recap the denition of inferential statistics. It deals with the use of probabilities and
data from sample to infer the underlying population or to make generalisation of the
underlying population. That is using information about the sample to make decision and
conclusion about population characteristics. For example by studying the average amount of
top-up spent by university students per month for a group of students in UTM, we can infer
the average amount of top-up spent by the whole university students in our country. Can you
guess what the sample and population in this example are? You can always think that, a
sample is a subset of a population. Does it help? Dont give up, you had tried your best! In
statistics, we call all university students in our country a population and the subset of this
population which is a group of students from UTM is called a sample. In the next section we
will start with the denition of important terms in this chapter.
3.2
Terminology
3.3
Point Estimate
We start with our previous example on the monthly amount of top-up by university students.
The mean value of monthly top-up computed for the sample is called a sample mean denoted
__
by x . This is a point estimate of the corresponding population mean, i.e mean monthly
top-up for university students in Malaysia. Let say, we select 1000 UTM students randomly
and the mean monthly top-up is RM40. This RM40 is a point estimate for the true mean of
monthly top-up for all university students in Malaysia. The statistician can then state that the
mean monthly top-up for Malaysian university student is RM40. This is what we call a point
estimation.
__
For the above example the population mean is estimated using the sample mean x
calculated as follows
__
x1 x 2 , x1000
1000
where x1 is the amount of monthly top-up by UTM student 1, x 2 is the amount of monthly
__
estimate for 2 .
In engineering we often need to estimate the followings:
The mean of a single population ; for example the mean breakdown voltage of
diodes.
The variance of a single population, 2 (or standard deviation, ); for example the
standard deviation of the inside diameter of certain plastic pipes.
The proportion of items in a population that belong to a certain class of interest; for
example the proportion of defective items for a particular production process.
12
; for example the ratio between variances of
22
Statistic
Parameter
Point estimate
______________________________________________
X
S
n 1
s2
X
n
________________________________________________
Statistical properties for best estimator (the most ecient estimator) must
1.
2.
be unbiased, that is E
For further explanation of these properties, please refer to Montgomery, Runger and Hubele
(2004) page 131-133.
___________________________________________________________________________
3.4
Interval Estimate
Next, by extending our top-up example, instead of saying that the mean top-up for university
students in Malaysia is RM40, we may want to say it within a certain range. That is, by
subtracting a number from RM40 and adding the same number to RM40 will give us this
range. In illustrating this example, let the number to be subtracted from RM40 is RM5 and
add this number to RM40. Hence we obtain the range from RM35 to RM45. Then we can
state that the range from RM35 to RM45 is likely to contain the mean top-up for all
Malaysian university students.
In general, the interval estimate of the unknown parameter can be written as l, u where l
is the lower limit and u is the upper limit. So the corresponding interval estimate for the
above example is RM(35,45). Since dierent samples will produce dierent values of sample
mean that result in dierent values of l and
random variables of the lower limit L and the upper limit U . The associated probability to
this interval estimate can be expressed as follows
P L U 1 ,
where 0 < < 1. That is we have a probability of 1 of choosing a sample that will
produce an interval containing the true value of . The resulting interval estimate is called a
100(1 )% condence interval (CI) for the true parameter .
Generally, a 100(1 )% condence interval (CI) for the true parameter means
P L U 1 ,
which can be interpreted as follow, if we collect innitely many random samples and
compute 100(1 )% CI for the true parameter for each sample, 100(1 )% of these
intervals will contain the true value of .
However, in practice we only draw one random sample. The interpretation that we will use is
the observed interval l, u contains the true value of with 100(1 ) condence level.
3.5
CI on the Mean
need to consider;
(a) population variance 2 is known,
(b) population variance 2 is unknown but the sample size is large n 30 and
(c) population variance 2 is unknown and the sample size is small n 30 .
These considerations need to be taken into account because we need to know the sampling
__
distribution for the sample mean X . The use of this sampling distribution will be
demonstrated as follows. Take the rst case as an example. We know that the sampling
__
u and variance
2
. Thus, the
n
__
X
statistic Z 2
is distributed as a standard normal. In computing a 100 1 % CI
n
population mean,
1 .
z 2
__
__
and x
z 2
n
respectively.
u;
be written as
__
z 2
n
__
z 2
n
or
__
z 2
n
__
,x
z 2
(b) A 100 1 % CI for the population mean, with unknown population variance, 2
can also be written as
__
2, n 1
__
2, n 1
as we can use central limit theorem in this case, where s is the estimated sample standard
deviation.
(c)
__
2, n 1
__
2, n 1
with the assumption that the sample comes from normal distribution.
Example
1
gives the sample mean of x 2.49978 10 7 pascal. Construct a 95% CI on the mean compressive strength.
Solution
This example is clearly case (a) where population standard deviation is known and equals
to 2.18039 105 pascal. The CI that we want to compute is the 95% CI for the mean
__
compressive strength, . From the sample, x 2.49978 10 7 pascal and sample size,
n 16 .
so z 2 z 0.025 1.96
2.49978 10 7 1.96
2.18039 10 5
16
2.49978 10 7 1.96
2.18039 10 5
16
A random sample of 16 compact cars tested for fuel consumption gave a mean of 12.5 km per
litre with a standard deviation of 0.83 km per litre. Assuming that the fuel consumption in km
per litre of all compact cars have a normal distribution, construct a 99% condence interval
for the population mean of fuel consumption for compact cars.
[ 11.8885, 13.1115 ]
Task 2
Borneo Steel Corporation produces iron rings that are supplied to ARAAB Co Ltd. These
rings are supposed to have a diameter of 60 cm. The machine that makes these rings does not
produce each ring with a diameter of exactly 60 cm. The diameter of each of the rings varies
slightly. It is known that when the machine is working properly, the rings made on this
machine have a mean diameter of 60 cm. The quality control department takes a random
sample of 35 such rings every week, calculates the mean of the diameters for these rings, and
makes a 99% condence interval for the population mean. If either the lower limit of this
condence interval is less than 59.938 cm or the upper limit of this condence interval is
greater than 60.063 cm, the machine is stopped and adjusted. A recent such sample of 35
rings produced a mean diameter of 60.038 cm with a standard deviation of 0.15 cm. Based on
this sample can you conclude that the machine needs an adjustment?
[(59.9727, 60.1033); yes]
___________________________________________________________________________
3.6
__
__
X 1 X 2 1 2
Z
2
2
1
2
n1
n2
assuming we know both population variances. Again we compute a 100 1 % CI for the
dierence between the two population means, 1 2 so that
__
__
__
12 2 2
1 2 X 1 X 2 z
n
2
n2
1
__
P X 1 X 2 z
1 2 2 2
n n 1 .
2
1
There are three cases of 100 1 % CI for the dierence between two population means
1 2 ;
__
x 1 x 2 z
__
__
12 2 2
1 2 x 1 x 2 z
n
2
n2
1
12 2 2
n n
1
2
i. with 1 2
2
__
__
__
1
1
1 2 x 1 x 2 z s p
x 1 x 2 z s p
2
2
n 1 n2
__
where s p
n1 1 s12
ii. with 1 2
2
1
1
n 1 n2
n 2 1 s 22
is a pooled standard deviation.
n1 n 2 2
__
__
__
s2
s2
1 2 x 1 x 2 z s p
x 1 x 2 z s p
2
2
n 1 n2
__
s2
s2
n 1 n2
__
__
__
1
1
1 2 x 1 x 2 t s p
x 1 x 2 t s p
2
2
n2
n1
__
where v n1 n 2 2 and s p
ii. with 1 2
2
1
1
n 2
n1
n1 1 s12
n 2 1 s 22
n1 n 2 2
__
__
__
s2
s2
1 2 x 1 x 2 t s p
x 1 x 2 t s p
2
2
n 1 n2
__
s2
s2
n 1 n2
where
s2
s2
n 1 n2
s2
n1
s2
n2
n1 1
n2 1
_______________________________________________________________________
Example
2
Suppose random samples of 49 Silver Tyres and 36 Dun Tyres were selected. The sample
mean mileage the tyre lasts for Silver Tyres is 119000 km and the standard deviation is
7700km and the sample mean mileage for Dun Tyres is 118000 km and the standard
deviation is 6000km. Compute a 90% CI on the dierence of the two population means.
Solution
7700 2
6000 2
49
36
1 2
7700 2
6000 2
49
36
___________________________________________________________________________
Task 3
Using Example 2 but we assume that their population variances are equal. Construct a 95%
CI on the dierence of the means mileage the tyre lasts.
[-2026.0942, 4026.0942]
___________________________________________________________________________
Task 4
A car magazine is comparing the total repair costs incurred during the rst three years on two
mid-sized cars, the Pherry and the XPY. Random samples of 16 Pherrys and 9 XPYs are
taken. All 25 cars are three years old and have similar mileages. The mean of repair costs for
the 16 Pherry cars is RM5000 for the rst three years with a standard deviation of RM800.
For the 9 XPY cars, this mean is RM7700 with a standard deviation of RM1000. Assume that
the repair costs follow a normal distribution with the same population variance. Construct a
90% condence interval for the dierence between the two populations means
[-3324.7295, -2075.270]
___________________________________________________________________________
Task 5
A process engineer is comparing two dierent etching solutions for removing silicon from
the backs of wafers. The etch rates follow normal distribution and have equal population
variances of 0.352. Below are the observed etch rates from 10 wafers for each solution.
____________________________
Solution 1
Solution 2
____________________________
9.7
10.5
10.1 9.9
9.3
10.2
10.5 10.1
9.1
9.9
10.6 10.2
9.5
10.3
10.3 10.3
10.0 10.1
10.3 10.1
____________________________
Find a 90% CI for the dierence in mean etch rates. [ -0.6375, -0.1225 ]
Task
6
Using Task 5, construct a 95% CI for the dierence in mean etch rates if we do not know the
population variances and assume that both populations have an unequal variances.
[ -0.7198, -0.0402 ]
___________________________________________________________________________
3.7
P z
1
n
1 .
P z
2
1
1
P z
.
2
n
n
1
n
P z
2
with
P 1 P
P 1 P
P z
.
2
n
n
__________________________________________________________________________
Example
3
0.005 0.995
0.005 1.6449
200
0.005 0.995
200
0.0032 0.0132
___________________________________________________________________________
Task 7
A random sample of 200 diskettes were inspected and 17 defective diskettes were found. Find
a 95% CI on the true proportion of defective diskettes.
[ 0.0463, 0.1237 ]
___________________________________________________________________________
Task 8
A random sample of 400 components were tested and 6.25 percent of the sample components
fail to satisfy production specications. Find a 90% CI on the true proportion of components
that fail to satisfy the specications.
[ 0.0426, 0.0824 ]
__________________________________________________________________________
3.8
To construct the CI for 1 2 recall that the sampling distribution for P1 P2 is normal
1 1 1 2 1 2
.So the statistic
n1
n2
P1 P2 1 2
1 1 1 2 1 2
n1
n2
is a standard normal random variable. Using the same approach as previous section, we
obtain a 100 1 % CI for the dierence between two proportions as
P1 P2 z
P1 1 P1 P2 1 P2
P1 1 P1 P2 1 P2
1 2 P1 P2 z
2
n1
n2
n1
n2
___________________________________________________________________________
Example
4
In a factory, plastic parts are formed using two dierent injection-molding machines. Two
random samples, each of size 200 are chosen and 5 defective parts are found in the sample
from machine A whereas 6 defective parts are found in the sample from machine B. Construct
a 99% CI on the dierence in proportions of defective parts.
Solution
P1 5
200
0.025 ; P2 6
200
0.025 2.5758
0.025 0.975
0.03 0.97
1 2
200
200
0.025 0.975 0.03 0.97
200
200
Task 9
A survey conducted by independent Engineering Education Research Unit found that among
teenagers aged 17 to 19, 20% of school girls and 25% of school boys wanted to study in
engineering discipline. Suppose that these percentages are based on random samples of 501
school girls and 500 school boys. Determine a 90% CI for the dierence between the
proportions of all school girls and all school boys who would like to study in engineering
discipline.
-0.0933, -0.00666]
___________________________________________________________________________
3.9
n 1 s 2
2
n 1 s 2 2 n 1 s 2 .
2
,n 1
12
,n 1
__________________________________________________________________________________
Example
5
A study on an operating system for a portable computer has been carried out thorvoughly to
estimate the variance of response time. A random sample of 10 portable computers are chosen
and give the standard deviation value of 8 milliseconds. Assume that the response time
follows normal distribution, construct a 95% CI on true variance of response time.
Solution
0.05, 02.025 19.023, 02.975 2.7
10 1 8 2
02.025,10 1
576
19.023
30.279
10 1 8 2 .
02.975,101
576
2.7
213.333
________________________________________________________________________________
Task
A random sample of 13 bolts is selected and the inside diameter is measured. The sample
standard deviation of the bolt inside diameter is 0.018 mm. Construct a 90% CI for the
standard deviation.
[0.0136, 0.0273]
__________________________________________________________________________
3.10
S 22
F
S12
22
12
P f 1
F f
1 .
2
,
n
1
,
n
1
2
,
n
1
,
n
1
2
1
2
1
S 22
1 .
P f 1
2
f
2 , n2 1, n1 1
2 , n2 1, n1 1
S1
2
2
2
Rearranging the above, we obtain a 100 1 % CI on the ratio of two variances of two
normal distributions,
S 22
s
22 s12
f1
2
2 f
.
2 , n2 1, n1 1
2 , n2 1, n1 1
s
S1
s2
12
2
1
2
2
F1
2 , n2 1, n1 1
1
F
2 , n2 1, n1 1
s12
1
2
s2 f
2 , n2 1, n2 1
12 s12
2 2 f
.
2 s 2 2, n2 1,n1 1
___________________________________________________________________________
Example
6
A quality engineer is studying the diameter of stainless steel rod manufactured on two
dierent machines. Two random samples of 16 and 13 rods respectively are selected which
give the variances of the diameter values 0.30cm2 and 0.40cm2 respectively. Assume that the
data were drawn from normal distributions, construct a 95% CI on the ratio of variances of
the diameters.
Solution
s12 0.30cm 2 s 22 0.40 cm 2 f 0.025,16 1, 131 3.18 f 0.025,131,16 1 2.96
s12
1
2
s2 f
2 , n1 1,n2 1
12 s12
f
.
22 s22 2, n2 1,n1 1
12
0.3 1
0.4 3.18
22
0.2358
0.3
2.96
0.4
12
2.22
22
_____________________________________________________________________________
Task 11
An engineer is studying an axial load of aluminium cans. It is measured by using a plate
where an increasing pressure is applied on top of the can until it collapses. This maximum
weight that the sides of the can can support is the axial load. Two random samples of sizes 10
and 7 aluminium cans are selected and the standard deviations are 10.1 kg and 11.8 kg
respectively. Find a 90% CI on the ratio of variances of the loads.
[0.1787,2.4689]
___________________________________________________________________________
Exercise 3
1. When you construct a 90% condence interval for , what are you 90% condent about?
2. What happen to the width of CI if we increase the same size?
3. Can we consider the construction of condence interval be part of inferential statistics?
Why?
4. For a data set obtained from a sample, n 49, x 102.5, and s 10.7
(a) What is the point estimate for ?
(b) Compute a 98% CI for .
5. A 90% CI for can be interpreted as follow, if we take 1000 random samples of the same
size and compute the condence interval each, then 900 of them
a. will contain
c. will contain x
6. Carbonated drink bottles are lled by an automated lling machine. Assume that the ll
volume is normally distributed and from previous production process the variance of ll
volume is 0.005 liter. A random sample of size 16 was drawn from this process which gives
the mean ll volume of 0.51 liter. Construct a 99% CI on the mean ll of all carbonated drink
bottles produced by this factory.
7. A random sample of 12 wafers were drawn from a slider fabrication process which gives
the following photoresist thickness in micrometer: 10 11 9 8 10 10 11 8 9 10 11 12 Assume
that the thickness is normally distributed. Construct a 95% CI for mean of all wafers
thickness produced by this factory,
8. The following is the result for diameter of 10 bearings selected randomly from a
production process.
0.5061 0.5083 0.5058 0.5075
0.5049
0.5037
13.925 13.909
14.057
14.068
14.006
13.893
14.005
(a) Construct a 90% CI for the mean thickness of epitaxial layers assuming that the thickness
of epitaxial layer follows normal distribution with variance of 0.0050 m 2 .
(b) Construct a 90% CI for mean thickness of all epitaxial layers assuming that the thickness
of epitaxial layer follows normal distribution.
(c) Comment on the interval estimates based on their practicality.
10. Using data in question 9 and the following data on thickness of the epitaxial layers
at high deposition time and at 59% arsenic ow rate;
14.295 14.095 15.505
15.806
15.106
14.839,
construct a 90% CI on the dierence between means thickness of epitaxial layers assuming
that the thickness of epitaxial layers follow normal distribution with equal variances. Interpret
your CI and can you conclude that the true mean dierence is zero?
11. A quality inspector inspected a random sample of 300 memory chips from a production
line, she found 9 are defectives. Construct a 99% condence interval for the proportion of
defective chips.
90
25
10
2
1
(a) What is the point estimate of the proportion of defectives due to holes too small?
(b) Construct a 90% CI for the proportion of defectives for the production process due to
holes too small.
(c) What is the point estimate for proportion of defectives due to poor connection?
(d) Construct a 90% CI for the proportion of defectives for the production process due to poor
connection.
(e) If oversize and undersize chip can be classied as incorrect chip size, what is the point
estimate of the proportion of defect due to incorrect chip size?
Hence nd a 95% interval estimate for the proportion of defective items due
to incorrect chip size.
14. An optical rm is concerned about the variability of the refractive index of a typical glass
that he will grind into lenses. The refractive index follows approxi- mately normal
distribution. A random sample of 15 glasses is drawn from a large shipment which give a
variance of 1.5 104 refractive index. Construct a 95% CI for the standard deviation of
refractive index of all glasses
Chapter 4
Tests of Hypotheses
Learning Objectives:
At the end of this chapter, students should be able to:
a) structure science and/or engineering decision-making problems concerning one
or two samples as hypothesis test.
(b) test hypotheses concerning a population mean.
(c) test hypotheses concerning a population variance or standard deviation.
(d) test hypotheses concerning a population proportion.
(e) test hypotheses concerning the dierence in two population means.
(f) test hypotheses concerning the ratio of two population variances or standard
4.1
Statistical Hypotheses
Many science and engineering problems require us to decide whether to accept or reject
a statement about some parameter. That statement is called a hypothesis. A statistical
hypothesis can arise from various elds of interest such as engineering, science, education, etc. A systematic procedure to decide whether to accept or reject a hypothesis is
called hypothesis testing.
We cannot prove that a hypothesis is absolutely true or false. If the data sample supports the
hypothesis, then we do not reject it. If the data sample does not support the hypothesis, we
reject it.
The hypothesis being tested is referred to as the null hypothesis and denoted by H0. The null
hypothesis is set up primarily to see whether it can be rejected or not. Also, we must
formulate an alternative hypothesis in order to know when to reject a null hypothesis. The
alternative hypothesis denoted by H 1 is the hypothesis which we accept when the null
hypothesis can be rejected. Some authors use the notation Ha or H A for the alternative
hypothesis
Denition 2 A null hypothesis, H 0 , is an assertion about one or more population
parameters. We hold this assertion as true until there are sucient statistical evidence to
conclude otherwise. The alternative hypothesis, H 1 , is the assertion of all situations not
covered by the null hypothesis
Together, the null and the alternative hypotheses constitute complete set of hypotheses that
covers all possible values of the parameter or parameters under investigation. The value of
the population parameter specied in the null hypothesis is usually determined in one of the
following three ways:
1. from a model or theory regarding the process under investigation, then the objective of
hypothesis testing is usually to verify the model or theory.
2. from knowledge of the process or previous tests or experiments, then the objective of
hypothesis testing is to determine whether the parameter value has changed.
3. from external consideration, such as design or engineering specication, or from
contractual obligations, then the objective of hypothesis testing is conformance testing.
The hypothesis test is carried out using information obtained by random sampling.
For example, suppose that we are interested in the output voltage of a power supply used in a
mobile phone; output voltage is a random variable that can be described by a probability
distribution. Suppose that our interest focuses on the mean output voltage
(a parameter of this distribution). Specically, we are interested in deciding whether
or not the mean output voltage is 6.00 V. We may express this formally as
H 0 : 6.00 V
H 1 : 6.00 V
(4.1)
The statement H 0 : 6.00 V in Equation (4.1) is called the null hypothesis1, and the
statement H 1 : 6.00 V is called the alternative hypothesis. Since values of the
alternative hypothesis could be either greater or less than 6.00 V, it is called a two-sided
alternative hypothesis. When we formulate the hypotheses as
H 0 : 6.00 V
H 1 : 6.00 V
or
H 0 : 6.00 V
H 1 : 6.00 V
then values of the alternative hypothesis could be less than 6.00 V or greater than 6.00 V,
respectively, it is called a one-sided alternative hypothesis 2
Denition 3 A test statistic is a sample statistic computed from the data obtained by random
sampling. The value of the test statistic is used in determining whether or not the null
hypothesis should be rejected.
We decide whether or not to reject the null hypothesis by following a rule called the decision
rule.
Denition 4 The decision rule of a statistical hypothesis test is a rule that species
the conditions under which the null hypothesis may be rejected.
_______________________________________________________________
Note that when choosing the null hypothesis one should bear in mind that it should nearly
always be precise, or be easily reduced to a precise hypothesis. For example when testing
H 0 : 6 V versus H 1 : 6. V , the null hypothesis does not specify the value of
exactly and so is not precise. But in practice we would proceed as if we were testing
H 0 : 6 V versus H 1 : 6 V
2
Note that hypotheses are always statements about the parameters of one or more
populations
under investigation,
not statements __
about the sample. So it is wrong to write
__
__
H 0 : x 6 V versus H 1 : x 6 V or H 1 : x 6 V .
1
Table 4.1 shows all the four possible outcomes of a test of hypothesis. The conclusion
columns refer to the action that he or she will be taken based on the results of the sampling
experiment. He or she will either conclude that the alternative hypothesis H 1 is true or the
null hypothesis H 0 is true. The state of nature rows refer to the fact that either the alternative
hypothesis H 1 is true or the null hypothesis H 0 is true. We can assume the true state of
nature is unknown when he or she conducting the test.
Statistical Conclusion
State of Nature
H 1 is true
H 0 is true
H 0 is true
Type I error
Correct conclusion
H 1 is true
Correct conclusion
Type II error
Also, he or she will be making wrong conclusion if he/she accepts the null hypothesis
(equivalently, rejecting alternative hypothesis) when in fact H 1 is really true. This type
of wrong conclusion is called a Type II error.
Denition 6 Failing to reject the null hypothesis H 0 (equivalently failing to accept
alternative hypothesis H 1 ) when it is false in state of nature is dened as a Type II
error.
Probabilities can be associated with the Type I and Type II errors because this
conclusion is based on random variables. The probability of making a Type I error is
denoted by (the Greek letter alpha), that is
(4.2)
The probability of making a Type II error is denoted by (the Greek letter beta), that is
should avoid the decision to accept H 0 , instead we should state that the sample evidence is
insucient to reject H0 if the sample evidence does not support that decision. Type I error is
considered more important than Type II error because we want to guard against the
possibility of making a wrong conclusion while the state of nature is true more than guarding
the other type of error.
A procedure leading to a decision about a particular hypothesis is called a test of a
hypothesis. The general procedure used for testing a hypothesis is as follows:
1. Identify the parameter of interest.
2. Formulate a null hypothesis and an alternative hypothesis.
3. Choose a signicance level
4. Determine the distribution and state the rejection region of the test statistic.
5. Specify an appropriate test statistic and calculate the value of the test statistic from a
random sample of data.
6. Decide whether to reject H 0 or fail to reject H 0 by comparing the calculated value of the
test statistic with the values in the critical region.
Steps 14 should be completed prior to calculation of the test statistic from a random
sample of data. This sequence of steps will be illustrated in subsequent sections.
___________________________________________________________________________
4.2
We now consider the case of hypothesis testing on the mean of a population under the
assumption of normality. The tests are also valid in cases where only approximate normality
exists. If it is not normal then the conditions of the central limit theorem apply.
To test the hypothesis that a random sample X 1 , X 2 , , X n
of size
n comes from
where 0 is a specied constant and we have assumed that the population variance 2 is
known. Now consider testing the hypothesis
H 0 : 0
H1 : 0
(4.4)
X 0
(4.5)
If the null hypothesis is true, Z test has a standard normal distribution, N (0, 1). When we
know the distribution of the test statistic we can locate the critical region to control the Type I
error probability at the desired level. In this case we would use the
z
and z
percentage points
(4.6)
or
z test z
(4.7)
z test z
(4.8)
Equations (4.6) and (4.7) dene the critical region or rejection region for the test. The Type I
error probability for this test procedure is
The procedures for testing the mean when the variance is known are summarized in
Table 4.2.
Table 4.2: Testing the mean when variance is known
__
X 0
___________________________________________________________________________
Exampl
e 1 phones are powered by battery. The output voltage of a power supply used in a
Mobile
mobile phone is an important product characteristic. Specications require that the mean
output voltage must be 6.00 V. We know that the standard deviation of output voltage is =
0.5 V. We decide to specify a Type I error probability or signicance level of 0.05 . A
random sample of n 20 is collected and obtains a sample mean output voltage of
__
Solution
Case Null hypothesis Alternative hypothesis Rejection region
We will follow the procedure
in Section H
(4.1)
for testing a hypothesis:
z test z 2 or
H outlined
1
0 : 0
1 : 0
1. The parameter of interest is population mean, , the mean output voltage.
z z
test
2.
The
null
H 0 : 6.00 V versus
z test z
z test z
2
2
3. 0.05
4. Reject z test z
Z test
X 0
__
z test
x 0
6.80 6.00
0.5
20
7.16
6. Since the value z test 7.16 does exceed 1.96, we reject H 0 : 6.00 at the 0.05 level
of signicance. We can statistically conclude that the mean output voltage diers from 6 V,
based on a sample of 20 measurements.
Suppose that we specify the hypotheses as
H 0 : 0
H1 : 0
(4.9)
where the alternative hypothesis is one-sided. In dening the critical region for this test, we
observe that a positive value of the test statistic Z test would never lead us to conclude that
H 0 : 0 is false. Therefore, we would place the critical region in the lower tail of the
standard normal distribution and reject H 0 if the calculated value z test is too small. We
would reject H 0 if
z test z
Similarly, to test
H 0 : 0
H1 : 0
(4.10)
we observe that a negative value of the test statistic Z test would never lead us to conclude
that H 0 : 0 is false. Therefore, we would place the critical region in the upper tail of the
standard normal distribution and reject H 0 if the calculated value of ztest is too large. We
would reject H 0 if
z test z
_________________________________________________________________________
Task 1
A manufacturer claim that battery life of model Z1 exceeds 90.0 hours. The life in hours of a
battery is known to be approximately normally distributed, with standard deviation = 8.5
__
hours. A random sample of 18 batteries has a mean life of x 95.5 hours. Is there
evidence to support the claim. Use = 0.01.
z test
2.7452; reject H 0
__________________________________________________________________________________
normal distribution
__
Z test
with
X 0
S
S . However, when
X
S
Ttest
X 0
S
If the H 0 is true, Ttest has a t distribution with n 1 degrees of freedom and we can locate
the critical region to control the Type I error probability at the desired level. In this case we
2 , n 1
and t
2 , n 1
regions to reject H 0 : 0 if
t test t
, n 1
(4.11)
or
t test t
, n 1
(4.12)
2 , n 1
t test t
2 , n 1
(4.13)
Table 4.3: Testing the mean when variance is unknown and n < 30
__
X 0
S
, n 1 ,degree of freedom
2
3
H 0 : 0
H1 : 0
H 0 : 0
H1 : 0
Rejection region
t test t
, n 1
ort t test tt
test
1
2
, n, n
1
t test t , n 1
Equations (4.11) and (4.12) dene the critical region or rejection region for the test.
The Type I error probability for this test procedure is
The procedures for testing the mean when the variance is unknown are summarized in
Table 4.3.
Table 4.2 and Table 4.3 are very similar except that Ttest is used as the test statistic
instead of Z test . Also, we use t distribution to dene the critical region instead of using
the standard normal distribution.
_____________________________________________________________________
Example 2
Referring to Example 1, suppose that the true variance is unknown. Ten determinations of the
output voltage of a power supply yielded the following values:
6.05
6.06
6.03
5.95
6.00
5.98
6.04
5.98
6.02
6.03
Can we say that the average output voltage equal to 6.00 V? Assume that the data
are approximately normal.
Solution
The solution using the outline in Section 4.1 is as follows:
1. The parameter of interest is population mean, , the mean output voltage.
2. The null and alternative hypotheses are
H 0 : 6.00 V versus H 1 : 6.00 V
3. 0.05
4.Reject
H0
if
t test t
2 , n 1
Ttest
X 0
S n
or
x 6.014 V
t test
x 0
6.014 6.00
0.0353
10
1.254
6. Since the value t test 1.254 is between 2.262 and 2.262, we are unable to reject
H 0 : 6.00 , and there is no strong evidence to indicate that output voltage not equal to
6.00 V at the 0.05 level of signicance . We can statistically conclude that the mean output
voltage equal 6.00 V, based on a sample of 10 measurements
___________________________________________________________________________
Task 2
Suppose you are a buyer of large supplies of mobile phone batteries. You want to test the
manufacturers claim that his mobile phone batteries last more than 900 hours. You test 40
batteries and nd that the sample mean is 922 hours and the sample standard deviation 68
hours. Should you accept claim? Use = 0.05.
z test
2.0462; reject H 0
___________________________________________________________________________
Task
A manufacturer of transistors claims that its transistors will last an average of 2100 hours. To
maintain this average, 20 transistors are tested each month. What conclusions should be
drawn from a sample that has a mean 2140 hours and a sample standard deviation 87 hours?
Assume that distribution of the lifetime of the transistors is normal. Use = 0.01.
t test
_______________________________________________________________________________
4.3
Hypothesis tests on the population variance or standard deviation are equally important as
testing on the population mean. For example, we wish to test whether a random sample is
drawn from a normal population of a specic known variance, say 02 or equivalently, that
the standard deviation is equal to 0 . To test
H 0 : 2 02
H 1 : 2 02
(4.14)
If the null hypothesis H 0 : 2 02 is true, the test statistic used is that given by the random
variable
n 1 S 2 .
(4.15)
02
2
which has a chi-square, , distribution with n 1 degrees of freedom. We will use the test
statistic
2
test
n 1 s 2
(4.16)
02
n 1 degree of freedom. Table 4.4 summarizes the critical regions needed for each of
n 1 S 2 , n 1
02
,degree of freedom
2
3
Example 3
H 0 : 2 02
H 0 : 2 02
H 1 : 2 02
H 1 : 2 02
Rejection region
2
test
12
, n 1
2
2
2
2
or test
1
test
, n,
n11
2
2
test
2 , n 1
A drilling machine is used to drill metal plates used in batteries. A random sample of 25
plates results in a sample variance of hole diameter of s 2 1.82mm 2 . If the variance of hole
diameter exceeds 1.00 mm 2 , the drilling machine must be serviced. Is there evidence that
the machine needs to be service? Use = 0.01, and assume that hole diameter has a normal
distribution.
Solution
The solution using the outlined in Section 4.1 is as follows:
1. The parameter of interest is population variance, 2 , the variance hole diameter
2. The null hypothesis and alternative hypothesis are
H 0 : 2 1.00 mm 2 versus H 1 : 2 1.00 mm 2
3. 0.01
Since
10 200 0.05
Z test
0
0 1 0 n
0 1 0 n
0.05 0.03
1.6581
6. . Since the value ztest = 1.6581 is between 1.96 and 1.96, we are unable to reject
H 0 : 0.03 , and there is no strong evidence to indicate that the percentage of defective
not equal to 3% at the 0.005 level of signicance. We statistically conclude that the
percentage of defective components is 3%.
___________________________________________________________________________
For small
probabilities.
___________________________________________________________________________
Task 6
An electrical company claimed that at least 90% of the parts which they supplied on a
government contract conformed to specications. A sample of 280 parts was tested, and 35
did not meet specications. Can we accept the companys claim at a 0.05 level of
signicance?
z test
___________________________________________________________________________
Task 7
The manufacturer of electronic devices informed his buyer about the proportion of defective
devices in its shipments. He claims that the proportion of all devices that are defective is less
than 6%. A random sample of 100 electronic devices indicates that 5 are defective. Using
0.05 , test whether the buyer will accept the manufacturers claim or not.
z test
______________________________
Variance known
X 11 , X 12 , , X 1n1
n1
__
X 1 X 2 1 2
12 22
n1 n2
z as the boundaries of the critical region. This result and two other cases are included in
2
Table 4.6.
Table 4.6: Testing 1 2 when variance 12 and 22 are known
__
__
X X 2 1 2
Z 1
Case Null hypothesis Alternative
Rejection region
Test statistic:hypothesis 2 2
20
1 H 0 : 1 2 0 H 1 : 1 12
z test z or
2
n1 n2
2
3
H 0 : 1 2 0
H 1 : 1 2 0
z test
z
z z
H 0 : 1 2 0
H 1 : 1 2 0
z test z
test
___________________________________________________________________________
Task 8
A manufacturer is comparing the settings of two machines, M1 and M2, which should
produce rods of the same length. Both have, over a long period, given rods whose lengths
were normally distributed with variance 37 cm 2 . Although the two machines are supposed to
given the same length of rod, he suspects that this is not so. Examine this suspicion, if the
total length of 15 rods from M1 is 513 cm, and the total length of 20 rods from M2 is 575 cm.
Use = 0.05.
z test
2.6231; reject H 0
________________________________________________________________________________
4.5.2
Variance unknown
If the sample sizes n1 and n2 are large (commonly, equal and greater than 30), the normal
distribution procedures in Section 4.5.1 could be used with replacing 12 and 22 with S12
and S 22 , respectively.
However, when sample sizes n1 and n2 are small (commonly, n < 30) and the populations
are normally distributed, our hypotheses testing will be based on the t distribution. Two
dierent assumptions must be treated. Firstly, we assume that the variances of the two normal
distributions are unknown but equal, 12 22 2 . . Secondly, we assume that the variances
of the two normal distributions are unknown and not equal, 12 22 .
(i) when 12 22 2 .
__
__
The variance of X 1 X 2 is
2
1 1
2 2 2
__ __
Var X 1 X 2 1 2
n1 n 2 n1 n 2
n1 n 2
__
X 1 X 2 1 2
1 1
n1 n 2
Since is unknown, we replace it with S p the pooled estimator of . The pooled estimator
2
of 2 , denoted by S p , is dened by
2
p
n1 1 S12 n 2
1 S 22
n1 n2 2
Test statistic is
__
__
X 1 X 2 1 2
S 2p
1
1
n1 n 2
1
2
1
2
hypothesis
2 0 H 1 : 1 2 0
1 H 0 : 1 T
z z or
Test statistic:
, v n1 ntest2 2 2
1
1
n1 n 2
z test
z
z test z2
2 H 0 : 1 2 0
H 1 : 1 2 0
z test z
3 H 0 : 1 2 0
H 1 : 1 2 0
degree of freedom
Example 5
A researcher wants to prove that brand X size AAA battery last an average of at least 30
minutes longer than brand Y. Two normally distributed independent random samples of 10
each brand are selected, and the batteries are run continuously until they are no longer
__
functional. The sample mean life for brand X is found to be x 328 minutes, and the
1
sample standard deviation is s1 46 minutes. The results for the brand Y batteries are
__
x 2 472 minutes and s 2 52 minutes. Is there evidence that brand X batteries last at least
30 minutes longer than brand Y batteries of the same size? Use = 0.05 and assume the two
population variances are equal.
Solution
1. The parameters of interest are 1 and 2 , the mean life of batteries.
2. H 0 : 1 2 30 versus H 1 : 1 2 30
3. 0.05.
4. Reject H 0 if t test t
, n1 n2 2
__
s 2p
s 2p
n1 1 s12 n2
1 s 22
n1 n2 2
10 1 46 2
10 1 52 2
10 10 2
2410
sp
2410 49.0918
__
t test
__
x1 x2 1 2
sp
1
1
n1 n 2
328 472 30
49.0918
1
1
10 10
7.9255
6.
brand X batteries last at least 30 minutes longer than brand Y batteries of the same size
___________________________________________________________________________
Task 9
A problem solving test was given to two groups of 35 and 40 engineers, respectively. In the
rst group the mean score was 82 with a standard deviation of 5, while in the second group
the mean score was 77 with a standard deviation of 10. Is there a signicance dierence
between the performances of the two groups at 5% level of signicance? Assume the two
population variances are equal.
z test
2.6780; reject H 0
___________________________________________________________________________
Task 10
An experiment is done to test the strength of two types of rock climbing ropes, namely R1
and R2. A sample of 15 pieces of rope R1 has a mean strength of 200 kg and a standard
deviation of 5 kg. A sample of 10 pieces of rope R2 has a mean strength of 188 kg and a
standard deviation of 6 kg. Assume the two population variances are equal. Test the mean
strength R1 is greater than R2 at 1% level of signicance.
t test
5.4299; reject H 0
_________________________________________________________________________________
(ii) when 12 22
When we cannot assume the unknown variances 12 and 22 are equal, then there is no exact
test statistic for testing H 0 : 1 2 0 . However, if H 0 : 1 2 0 is true, the
statistic
__
__
X 1 X 2 1 2
1
1
n1 n 2
S12 S 22
n
1 n2
S12
n
1
S 22
n
2
n__1 1 __ n2 1
X 1 X 2 1 2
(4.17)
S
S
n1 n2
2
1
2
2
Test statistic: , T
,2 v 2 2
degree
22
2 2
2
Sand
unequal is
n
n
n
n
1 2
1
2
summarized in Table 4.8 .
n1 1 n2 1
of freedom
Table 4.8: Testing 1 2 when variance 12 and 22 are unknown and
Case Null hypothesis Alternative
Rejection region
unequal 1 H 0 : 1 2 0 hypothesis
H 1 : 1 2 0
t test t or
2
2
3
H 0 : 1 2 0
H 0 : 1 2 0
H 1 : 1 2 0
H 1 : 1 2 0
,v
t test
t t ,tv
test
,v
t test t ,v
Example
6
A scientist want to determine how two catalysts will eect the mean yield of a chemical
process. Two normally distributed independent random samples of n1 12 for catalyst C1
and n2 10 for catalyst C2 are selected. The sample mean yield for catalyst C1 is found to be
__
x1 152.25 and the sample standard deviation is s1 3.44 . The results for the catalyst C2
__
are x 2 150.85 and s 2 3.72 . Is there any dierence between the mean yields? Use
0.01 and assume the two population variances are unequal.
Solution
1. The parameters of interest are 1 and 2 , the mean process yield.
2 H 0 : 1 2 0 (or H 0 : 1 2 ) versus H 1 : 1 2 0 (or H 1 : 1 2 ).
3. 0.01 .
4. We have s1 3.44 , s 2 3.72 , n1 12 , n2 10 . The degrees of freedom on ttest are
found from equation (4.17) as
S12 S 22
n1 n2
S12
n1
n1 1
S 22
n2
n2 1
3.44 2 3.72 2
10
12
3.44 2
12
12 1
3.72 2
10
10 1
18.6489 19
Therefore,
t test t
we
2 ,v
__
reject
H0
if
t test t
2 ,v
t 0.005,19 2.861
or
__
t test
__
x1 x 2 1 2
s12 s 22
n1 n2
152.25 150.85 0
3.44 2
3.72 2
12
10
0.9094
6. Since t test 0.9094 is less than 2.861, we fail to reject H 0 . We conclude that
there is no dierence between mean yields.
___________________________________________________________________________
4.6
Suppose that two independent random samples of sized n1 and n2 are taken from two large
populations and that X 1 n1 and X 2 n 2 represent the observed number of successes in
n1 and n2 trials, or the observed proportion of successes, respectively. Then P 1 X 1 n1 and
P 2 X 2 n2
equal to 5 to makes use of the normal approximation to the binomial distribution. Similarly,
this applied to P 2 .
To test the hypotheses
H 0 : 1 2
H1 : 1 2
(4.18)
P1 P2 1 2
P1 1 P1 P2 1 P2
n1
n2
When H 0 is true, we can substitute 1 2 in the preceding formula for Z to give the
form
Z
P1 P2
P P1 1 2
Z P1 1P n2 n 1
P1 1 P11 2P2 1 P2
Test statistic:
n1
n2
where
X X2
P 1 hypothesis
Alternative
n1 n2
Rejection region
: .
2 statistic
0 ZHis1 :distributed
1 2 approximately
0
1 H 0 of
is a pooled estimate
The
1).
1
z test N
z(0,
or
2
H 0 : 1 2 0
H1 : 1 2 0
H 0 : 1 2 Table
0
H 1 :Testing
1
0 2
4.9:
2 1
z test
z
z z
test
z test z
__________________________________________________________________________________
Example 7
A usual medication was given to a random sample of 180 patients from district A who have
high fever. A new medication was given to a random sample of 200 patients from district B
who also have high fever. If 144 and 180 patients recover from the fever, respectively, is the
new medication helps to cure better the fever. Use = 0.05
Solution
1. The parameters of interest are 1 and 2 , the proportion of patients who recover from
usual medication and new medication, respectively.
2. H 0 : 1 2
versus H 1 : 1 2 .
3. 0.05
4. We reject H 0 if z test z z 0.05 1.6449 . Refer from Table 6 of Lee (2004).
5. We have
P1
144
0.80
180
P2
180
0.90
200
x1 x 2 144 180
0.8526
n1 n 2 180 200
z test
P1 P2
1
1
P 1 P
n1 n2
0.80 0.90
1
1
180
200
0.8526 0.1474
2.7456
Task 11
A random sample of 150 students of UTM found that 102 were in favor of a new grading
system, while another sample of 180 students of UKM found that 108 were in favor of the
new system. Do the results indicate a signicant dierence in the proportion of UTM and
UKM students who favor the new grading system? Use = 0.01.
z test
___________________________________________________________________________
Task 12
A geneticist is interested in the proportion of males and females in a population that have a
certain minor blood disorder. He did a survey by taking a random sample of 100 males and
100 females. 31 of the males are found to be aicted, whereas only 24 of the females appear
to have the disorder. Can we conclude that the proportion of men in the population aicted
with this blood disorder is signicantly greater than the proportion of women aicted? Use
level of signicance = 0.01.
z test
___________________________________________________________________________
4.7
S12 12
F 2 2
(4.19)
S
2
2
2
Test statistic: , F S1 , v n 1 , v n 1 degree of
1
1
2
2
has an F distribution with n1 1
S 2 numerator degrees of freedom and n 2 1 denominator
2
degrees offreedom
freedom. Under H 0 :
2
1
2
3
H 0 : 12 12
H 0 : 12 12
Alternative
hypothesis
H 1 : 12 12
H 1 2: 12 12
S 2
2
F H 112: 1 1
S2
Rejection region
Ftest F1
or
,v1 ,v2
FF
FF ,v ,v
test
test
211 ,v2 ,v
1
Table 4.10 summarizes the critical regions needed for each of the possible alternative
hypotheses.
Table 4.10: Testing of ratio of two variances
Table 9 in Lee (2004) contains only upper-tail percentage points of the F distribution. If we
need the lower-tail percentage points f1 ,v
1 , v2
f 1 ,v1 ,v2
1
f ,v2 ,v1
(4.20)
1
f 0.001,12 , 6
1
17.99
0.0556
___________________________________________________________________________
Example 8
Company A and company B can supply chemical material. The mean concentration for both
companies is the same, but we suspect that the variability in concentration may dier
between the two companies. The variance of concentration in a random sample of n1 8 by
company A yields s12 12.4 grams per liter, while for company B, a random sample of
n2 10 yields s 22 13.8 grams per liter. Is there sucient evidence to conclude that the
two population variances dier? We assume that concentration is a normal random variable
for both companies. Use = 0.02.
Solution
The solution using the outlined in Section 4.1 is as follows:
1. The parameter of interest are the variances of chemical concentration 12 and 12
2. The null hypothesis and alternative hypothesis are
H 0 : 12 12 versus H 1 : 12 12
3. 0.02
4. Reject H 0 if
f test f1
f 1 0.02 2,81,10 1
2 , v1 , v2
f 0.99 , 7 , 9
1
f 0.01, 9, 7
1
0.1488
6.72
or if
f test f
f 0.02 2,81,10 1
2 , v1 , v2
f 0.01, 7 , 9
5.61
s12
s 22
f test
12.4
13.8
0.8986
6. Since the value f test 0.8986 is between 0.1488 and 5.61, we are unable to reject
H 0 : 12 12 at the 0.02 level of signicance. Therefore, there is no strong evidence to
f test
___________________________________________________________________________
Task 14
The following data represents the times taken by two machines in producing an electrical
part:
Machine
Time (in milliseconds)
_______________________________________________
1
108
86
98
109
92
81
165
97
134
87
114
_______________________________________________
Assuming that the distributions of the times are approximately normal, can we conclude that
there is a signicant dierence in variability of the times in producing an electrical part by
machine 1 and machine 2 at 0.05
___________________________________________________________________________
EXERCISE 4
1. Test the hypothesis that the random sample
30.4 31.2 30.8 29.9 30.4 30.7 29.9 30.1
came from a normal population with mean 30.5. The standard deviation of the measurements
is known to be 0.1. Use 0.05
__
2. A sample of size 60 yielded that values x 46.7 and s 2 41.5 . Test the hypothesis that
45 against the alternative that it is greater. Use 0.05 .
3. Repeat question (1) without assuming that the standard deviation is known to be 0.1. In
other words estimate the population variance from the sample measurements. Use 0.05
.
4. A manufacturer claims that the standard mean volume per bottle of shampoo is 250
milliliter. Ten random samples are taken from a batch and the volume per bottle is measured.
The ten measurements have a sample mean of 243 milliliter and a standard deviation of 7
milliliter. Assume approximate normality of data. Is this sample mean signicantly below the
claimed value? Use 0.01 .
5. The standard deviation of the breaking strengths of certain cables produced by a company
is given as 240 kg. After a change was introduced in the process of manufacturing of these
cables, the breaking strengths of a sample of 8 cables showed a standard deviation of 300 kg.
Investigate the signicance of the apparent increase in variability. Use 0.01 .
6. A semiconductor company claimed that at least 99% of the electronic components which
they export without defect. A sample of 150 electronic components was tested, and 12 with
defect. Can we accept the companys claim at a 0.01 level of signicance?
7. An opinion survey in district D1 found that 68% of people considered electricals taris to
be too high. A random sample of 35 people in district D2 were asked the same question 21
thought electricals taris to be too high. Is this proportion signicantly dierent from that of
district D1? Use 0.05 .
__
do county voters favoring the proposal is lower than the proposal of town voters? Use
0.05 .
11. A sample of male and a sample of female were polled on an issue. 120 of 250 male and
126 of 300 female vote yes on the issue. Can we conclude that more male than female favor
the issue. Use 0.02 .
12. Repeat exercise 11 but using 0.10 .
13. Two types of soil namely S1 and S2 at certain district solutions were tested for their
gamma radiation dose. A random sample of 6 measurements of S1 showed a mean of 7.52
with a standard deviation of 0.024. A random sample of 5 measurements of S2 showed a
mean of 7.49 with a standard deviation of 0.032. Assume both population variances are equal.
(a) Determine whether the two types of soil have dierent gamma radiation doses. Use
0.05 .
(b) Determine whether the two types of soil have dierence in the variability of
gamma radiation doses. Use 0.01 .
Chapter 5
Chi-Square Tests
Learning Objectives:
At the end of this chapter, students should be able to
(a) apply the goodness-of-t test.
(b) summarize data in contingency table.
(c) apply the independence test.
(d) apply the homogeneity test.
5.1
Introduction
We have seen in previous chapters that some random variables follow certain distributions
such as binomial, Poisson and normal distributions. We either make an assumption about the
distribution, or we know that the random variables follow specic distributions.
In the next section of this chapter we introduce a method to test such assumption known as
goodness-of-t test which requires the data to be presented in frequency distribution. In this
chapter, we will also discuss two methods of data analysis in which a data set is presented in
a contingency table. The two analysis are the independence test and homogeneity test,
discussed in sections 5.3 and 5.4 respectively.
5.2
Consider the result obtained from an experiment of tossing a die 300 times, as shown in Table
5.1 below:
Table 5.1: Frequency distribution
____________________________________________
Outcome
1
2
3
4
5
6
_____________________________________________
Frequency
45
52
60
58
44
41
_____________________________________________
There are six possible outcomes for each trial, i.e. obtaining number 1, 2, 3, 4, 5 or 6. These
outcomes are also referred to as categories. The question we would like to answer is whether
the dice is a fair dice. The results of the experiment is the evidence for concluding whether
the dice is a fair dice or otherwise. We know that a fair dice has the following characteristic
1
6
If X is a random variable representing the outcome obtained for each trial, then X follows the
uniform distribution with P (X = x) =
1
for x = 1, 2, 3, 4, 5, 6. The objective is to test the
6
hypotheses that the dice is a fair dice which can be stated as below:
H 0 : P 1 P 2 P 3 P 4 P 5 P 6
H 1 : P X i P X
1
6
for i, j 1, 2, 3, 4, 5, 6; i j
The statement in H 0 is equivalent to the dice being a fair dice and the statement in H 1 is
equivalent to the dice not being a fair dice. If the dice is a fair dice, we expect the frequency
for the outcome xi or category i is
Ei n P X i for i 1, 2, 3, 4, 5, 6
where
1
50
6
E 2 n P 2 300
1
50
6
E 4 n P 4 300
1
50
6
E 6 n P 6 300
E1 n P 1 300
E3 n P 3 300
E 5 n P 5 300
1
50
6
1
50
6
1
50
6
O2 52,
O3 60
O4 58,
O5 44,
O6 41
which dier from the expected frequencies if the dice is a fair dice.
The logic is if the dice is a fair dice, the dierence between the observed and the
expected frequencies
Oi \ Ei
observed and the expected frequencies forms the statistic to test the hypothesis regarding the
probability distribution of the random variable. The statistic is stated in the following theorem
Theorem 4 The statistic
O E
k 2
2 i\ i
i1 i
Note: This theorem is applicable if the least expected value Ei is at least 5, i.e. E i 5
for all i.
O E
k 2
2 i\ i 2
, pk 1
i1 i
E
at signicance level .
Now we show the procedure to calculate the statistic 2 . Since the statistic 2 is calculated
from the observed sample we use the similar convention from previous chapter denoting
2
test
as the calculated statistic 2 .
________________________________________________________
Oi \
Ei n P i
Oi\ Ei
Ei
____________________________________________________
1
45 50 2 0.50
E1 300 50
O1 45
6
50
O2 52
E 2 300
1
50
6
52 50 2
O3 60
E3 300
1
50
6
60 50 2
O4 58
E 4 300
1
50
6
58 50 2
O5 44
E5 300
1
50
6
44 50 2
O6 41
E 6 300
1
50
6
41 50 2
50
50
50
50
50
0.08
2.00
1.28
0.72
1.62
__________________________________________________________
EO
6 i\ i
2
So
tes i1
i
and accept
2
H 0 if test
20.05, 61 11.070 . Note that v k 1 since unknown parameters are absent.
2
Since test 6.2 11.070 , we accept H 0 and conclude that there is no evidence that the
and n O1 O2 Ok .
Category
1
2 ...
k
Ok
P i , is stated in the null
Frequency
O1 O2 i... occurring,
The belief is that the probability
of category
hypotheses H 0 as
H 0 : P i i
i 1, 2, , k .
for
Example 1
The authority claims that the proportions of road accidents occurring in this country
according to the categories User Attitude (A), Mechanical Fault (M), Insucient Sign Board
(I) and Fate (F) are 60%, 20%, 15% and 5% respectively. A study by an independent body
shows the following data
Category
Total
Frequency
130
35
30
200
_____________________________________________________________
Oi \
Oi\ Ei
Ei n P i
Ei
_______________________________________________________________
130 120 2 0.833
O A 130
E A 0.6 200 120
120
35 40 2
OM 35
E M 0.2 200 40
O I 30
E I 0.15 200 30
30 30 2
OF 5
E F 0.05 200 10
5 10 2
40
30
10
0.625
0.00
2.500
_______________________________________________________________
2
2
At = 0.05, reject H 0 if test 0.05,3 7.815 . Thus we accept H 0 and conclude that we
No. of days
12
32
45
50
35
26
If X is a random variable representing the number of students playing truancy per day, test
the hypothesis that X follows the Poisson distribution with mean 3 per day at 0.01
Solution
n 12 32 45 50 35 26 200 , k 6
For X ~ P0 3
P X 0 0.0498,
P X 1 0.1493,
P X 2 0.2241
P X 3 0.2240,
P X 4 0.1681,
P X 5 0.1847
Oi\ Ei
Oi \
Ei n P X i
O0 12
O1 32
O2 45
45 44.82 2
O3 50
50 44.80 2
O4 35
35 33.62 2
O5 26
26 36.94 2
Ei
12 9.96 2
9.96
0.42
32 29.86 2
29.86
44.82
44.80
33.62
0.15
0.00
0.60
0.06
3.24
36.94
_______________________________________________________________
2
test
0.42 0.15 0.00 0.60 3.24 4.47
2
2
At 0.01 , reject H 0 if test 0.01,5 15.086 0:01;5 = 15:086: Thus, H 0 is accepted
and we conclude that there is no evidence to support the number of students playing truancy
per day does not follow the Poisson distribution with mean 3 per day.
___________________________________________________________________________
IQ Score
Frequency
X < 90
2
90 X < 100
30
100 X < 110
85
110 X < 120
90
120 X < 130
40
Example 3
X 130
3
Total
250
It is believed that the IQ score of all adults follow the Normal distribution with mean 110 and
standard deviation 10. The score of IQ test given to 250 adults are summarized below where
X represent IQ score.
H 0 : X ~ N 110, 10 2
Assuming H 0 is correct, Z
X 110
10
_______________________________________________
P
IQ Score
_______________________________________________
X 90
90 X 100
P Z 2 0.0228
P 2 Z 1 0.1359
100 X 110
P 1 Z 0 0.3413
110 X 120
P 0 Z 1 0.3413
120 X 130
P 1 Z 2 0.1359
X 130
P Z 2 0.0228
______________________________________________
Oi\ Ei
Oi
Ei n P X i
O1 2
O2 30
O3 85
85 85.33 2
O4 90
90 85.33 2
O5 40
40 33.98 2
O6 3
3 5.70
5.70
Ei
2 5.70 2
5.70
2.40
30 33.98 2
33.98
85.33
85.33
33.98
0.47
0.00
0.26
1.07
1.28
2
2
At 0.05 , reject H 0 if test 0.05,5 11.070 . Thus, we fail to reject H 0 and conclude
that there is no evidence to support the IQ scores does not follows the normal distribution
with mean 110 and standard deviation 10.
___________________________________________________________________________
Task 1
It is believed that the number of scratches on a compact disk produced by a process follows
the Poisson distribution with mean 2.5 scratches per disk. The following data shows the
number of disks with the corresponding number of scratches on them:
Number
of
scratches01234
Number
of
disk5223020158
Test the belief at significance level 0.01
k 6
2
then v 5; test
3.1523 15.086; fail to reject H 0
Task 2
Repeat Question in Task 1 above, but without knowing the true mean value. What differences
may you encounter?
k 6,
2
p 1 then v 4; test
3.1869 13.277; fail to reject H 0
___________________________________________________________________________
5.3
Independence Test
____________________________________
Student
Bespectacled
Result
_______________________________________
A
Yes
Excellent
B
No
Excellent
C
Yes
Good
D
Yes
Excellent
E
No
Good
F
No
Good
G
Yes
Excellent
______________________________________
Bespectacled
Yes
No
good
excellent
1
2
3
1
Usually, the question we have in mind when dealing with data in contingency table is
whether the two variables are independent. Independence means the two variables are not
influential to each other. Thus in the example above we want to test whether being
bespectacled or not is influencing the students Maths results or not. This test is called
independence test which capitalizes on the fact of independent events in probability study:
Two events A and B are independent if and only if
P (A B) = P (A)P (B),
To understand this test further we introduce the two-dimensional contingency table in its
general form.
In general, a two-dimensional contingency table is of the form below
Column Variable
Category B1
Category
Category
Category A1
Category A2
Row Variable
Category
Category Ar
O11
B2
O12
O21
O22
Or 1
Or 2
Bc
O1c
O2 c
Orc
The above contingency table is a r c contingency table where r denotes the number of
categories of the row variable, c denotes the number of categories of the column variable and
Oij is the observed frequency in cell i, j , i.e. the observed frequency for i th category of
ni
n j
Category B2
Category A1
O11 A1 B1
O12
Category A2
O21 A2 B1 )
O22
Or1 Ar B1 )
Or 2
n 1
n 2
O1c A1 Bc )
Orc Ar Bc )
nr
n c
Category Bc
O2 c A2 Bc )
Row
Variable
Category
Category Ar
n1
n2
Most often, we do not know the true values of P Ai or P B j but we know from the
estimation Chapter 3 that the best estimator for population proportion or probability is the
sample proportion. Thus
P Ai
ni
and
P Bj
^
n j
n
P Ai Bj P Ai PBj
^
^ ^
ni n j
n n
With this estimated joint probability, we can find the expected frequency in each cell, E ij if
Ai and B j are independent. The expected frequency in cell i, j . is
Eij n P Ai Bj
^
n P Ai P Bj
^ ^
ni nj
n
n n
ni n j
n
Now, if Ai and B j are truly independent, we anticipate Oij and E ij do not differ and if
they differ the difference is not significant. The statistic Oij E ij forms the basis for the
independence test which is stated in Theorem 2.
Theorem 2
rc
The statistic
O E
2 i j\ i j
2
follows the chi-squared distribution with
i1 j1 i j
rc
EO
~ cr 11 .
i11j Eij
2 i j\ i j 2
This test is a one-tailed test on the right where H 0 is rejected if the calculated 2 value is
2
greater than , r 1 c 1 at significance level
2
chapter, the calculated 2 value is denoted by test
test. Thus, we reject H 0 if
2
test
2 , r 1 c 1
Example 4
Insomnia is a disease where a person finds it hard to sleep at night. A study is conducted to
determine whether the two attributes, smoking habit and insomnia disease are dependent. The
following data set was obtained:
Insomnia
Yes
No
Habit
Non-smokers
Ex-smokers
Smokers
c 2,
n1 10 70 80, n2 8 32 40,
n3 22 38 60, n 2 10 8 22 40,
n 2 70 32 38 140, n 10 70 8 32 22 38 180.
10
8
22
70
32
38
Oi
E11
80 40
17.78
180
O12 70
E12
80 140
62.22
180
O21 8
E 21
40 40
8.89
180
O22 32
E 22
40 140
31.11
180
O31 22
E 31
60 40
13.33
180
E 32
60 140
46.67
180
O11 10
10 17.78 2
17.78
8 8.89 2
8.89
Oi\ Ei
Ei
3.40
70 62.22 2
62.22
0.97
0.90
22 13.33 2
13.33
Ei n P X i
32 31.11 2
31.11
0.03
5.64
O32 38
38 46.67 2
46.67
1.61
2
test
3.40 0.97 0.90 0.03 5.64 1.61 12.55.
2
2
The critical value at 5% significance level is 0.05, 31 21 0.05, 2 5.991 and the rule is
2
to reject H 0 if test
5.991
Inactive
Active
Activities
Use a 5% significance level to conduct the study.
v 2;
40
30
80
90
60
60
2
test
2.0168 5.991; fail to reject H 0
_____________________________________________________________________
Task 4
A study is conducted to determine whether the management efficiency and the specialization
sector are independent. The following data set was obtained:
Management
Efficiency
Low Fair Good
Education
Health
Sector
Banking
Use a 1% significance level to conduct the study.
v 4;
20
15
15
20
25
30
35
40
80
2
test
9.7807 13.277; fail to reject H 0
___________________________________________________________________________
5.4
Homogeneity Test
In the independence test each subject has the possibility of belonging to any of the
rc
cells. For further clarification, consider the following contingency table which shows the
frequency of students according to gender and their hand phone brands.
Hand phone brand
Male
Nokia
Samsung
Others
Total
80
60
30
170
Female
60
70
20
150
Total
140
130
50
320
If all 320 students are chosen at random regardless of their gender and hand phone brand,
each student will be classified in one of the six joint categories and the test of independence
is a valid test. In other words, each of the 320 students will belong in one and only one of the
six cells of the contingency table. However, we may want to fix the number of male and
female students in this study. For example we may want to have 150 male students and 170
female students.
Thus a male student will either belong to the joint categories (Male
Samsung) or (Male
Nokia), (Male
for the male category and not in any of the six joint categories. In other words, a male student
will belong to any of the three cells of the male category. Similarly a female student will
belong to any of the three cells of the female category. Fixing the number of male and female
students constrains the assignment of each subject to the relevant gender categories. When we
have such constraint, we are actually comparing the distribution of hand phone brand
preferences between the two genders. In this case, we fix the row total
ni .
This means we are comparing whether the preferences over Nokia, Samsung or other brand
of hand phones are the same for male and female students.
At the same time, we may prefer to fix the column total n j , i.e. we select 140 Nokia users,
130 Samsung users and 50 other brand users. Each user will be classified in the relevant cell
which is constrained on his/her preferences. Thus, we are actually comparing the distribution
of gender between the hand phone brands.
The relevant test is called homogeneity test where we are testing the similarity of two or more
populations with regard to the distribution of a certain characteristic. For the fixed number of
male and female students, the hypotheses are
H 0 : The proportions of students preferring the three hand phone brands are the same for
H 1 : The proportions of students preferring the three hand phone brands are not the same for
Task 5
200 female owners and 200 male owners of Proton cars are selected at random and
the colour of their cars are noted. The following data shows the results:
Car Colour
Gender
Black
Dull
Bright
Male
40
110
50
Female
20
80
100
Use a 1% significance level to test whether the proportions of colour preferences are the same
for male and female.
2
v 2; xtest
28.07 9.210; reject H 0
Exercise 5
1. A random sample of 200 printed boards has been collected and the following number
of defects was observed:
Number of defects
Observed Frequency
0 1 2 3 4 5
10 40 54 45 32 8
6
6
7 and more
5
Can we conclude that the number of defects follows the Poisson distribution with
mean 2.6 at significance level = 0.05?
0
5
1 2 3 4 5 6 and more
10 18 19 16 12 20
Can we conclude that the number of defective electrical components follows the
Poisson distribution at significance level = 0.01?
3. A manufacturing engineer is testing a power supply used in a notebook computer. The
complete table of observed frequencies is as follows:
Class
interval
x 4.948
4.948 x 4.986
4.986 x 5.014
5.014 x 5.040
5.040 x 5.066
5.066 x 5.094
5.094 x 5.132
x 5.132
Observed
frequencies Oi
12
14
12
13
12
11
12
14
Test the hypothesis whether the output voltage is adequately described by a normal
distribution with mean 5.04V and standard deviation 0.08V at a significance level =
0.05.
4. A machine is supposed to mix 40% peanuts, 30% hazelnuts, 20% cashews, and 10%
pecans. A can containing 500 of these mixed nuts was found to have 269 peanuts, 112
hazelnuts, 74 cashews, and 45 pecans. At the 0.05 level of significance, test the
hypothesis that the machine is mixing the nuts according to the required percentages.
5. It is believed that the ratio of Bumiputera, Orang Asli, and others student intake in
Faculty of Engineering is 14:3:3. A sample of 500 students chosen at random shows
the following data:
Bumiputera
Orang Asli
Others
Number of Students
345
78
77
Status
Rejected
Non Rejected
Test the hypothesis that the status and classification are independent at significance
level = 0:05
7. A study was conducted to determine whether the type of painkiller administered to
patients is influencing the level of pain felt by patient and the following data set was
obtained:
Painkiller
A
B
No
20
10
Level of Pain
A little
30
35
Strong
10
15
Test whether the level of pain and the type of painkiller are independent at
significance level = 0:01.
8. A total of 1000 PVC pipes are sampled and categorized with respect to both length
and diameter specification. The results are presented in the following table:
Length
Too Short
Meet Specification
Too Long
Too Thick
20
65
35
Diameter
Meet Specification
115
550
145
Too Wide
15
45
10
Test at 1% significance level whether the length and the diameter of the PVC pipes
are independent.
Defective
Non defective
Day
100
150
Shift
Evening
200
200
Night
200
150
Chapter 6
Analysis of Variance
Learning Objectives:
At the end of this chapter, students should be able to
a) Identify treatment, response and levels of treatment.
b) Analyse data using one-way ANOVA.
c) Perform one-way ANOVA techniques via the Microsoft Excel.
6.1 Introduction
In Chapter 4, we compare two population means or in other words two levels of a factor, to
decide if there was any difference occurring between the population means from which the
samples came from. However, researchers often want to examine differences among three or
more population means. For example, researchers might want to compare five different
temperatures in developing polymer to be used in removing toxic wastes from water. The
procedure that can be used for testing the equality for means of temperature is one-way
analysis of variance or one-way ANOVA. The five different levels of temperature are also
known as five levels of factors, or five treatments. A factor (or treatment) is a property, or
characteristic, that allows us to distinguish the different populations from one another. Levels
of factors are commonly denoted by k.
The term treatment is used because early applications of analysis of variance involved
agricultural experiments in which different plots of farmland were treated with different
fertilizers, seed types, insecticides and so on.
To understand how analysis of variance works and why it is called analysis of
variance, using the example above, we obtain a random sample from the population. For each
temperature, we measure the percentage of impurities removed by the treatment. We will get
different measurements for each temperature. This shows there is variability within group or
here we use the term 'Factor'.
In one-way analysis of variance, we partition the variability into two components:
within group variability and between group variability. We then examine the ratio of the two it is called an F ratio - by dividing the between group variability with the within group
variability. It is in this sense that ANOVA is an analysis of variance: the variance between
groups is compared to the variance within groups.
After conducting a one-way analysis of variance, we might conclude that there is
sufficient evidence to reject a claim of equal population means, but we cannot conclude from
ANOVA that any particular mean is different from the others.
The model deals with specific factor levels and is involved with testing the null
hypothesis against the alternative hypothesis, stated below:
H 0 : 1 2 ... k
where yij is the jth observation from the ith factors, i is the ith mean and ij is the random
error.
An alternative and preferred form of this equation is obtained by substituting
i i
with the restriction
k
i 1
i 1
1
In carrying out ANOVA, it is y
11
know the following
y12
2
y21
y22
Factor
i
yi1
yi 2
y2 n2
...
k
yk 1
yk 2
important to
notations:
ni
over a level.
yi.
(ii) yi
is the level
ni
y1.
y2.
yini
...
yi .
yknk
yk .
y..
mean.
ni
(iii)
of the responses
y1n1
(iv) y ..
y..
is overall mean of the data.
N
y
i 1 j 1
ij
y .. ,
2
i 1 j 1
i 1
SST yij y ..
i 1 j 1
k
ni
yij
y..
i 1 j 1
The equation for the sum of squares for the levels, which measures the variability due to the
levels or factors, is
k
SSTrt n yi y ..
i 1
i 1
yi.
ni
y..
With SST and SSTrt known, SSE can be calculated by the formula
SSE = SST SSTrt
The SSE term measures the variability of the data due to random error.
There are degrees of freedom terms associated with each of the sums of squares. The
degrees of freedom for factor, error and total are given by k-1, N-k and N-1, respectively.
Mean square values are calculated by dividing the sum of square terms for the level
and error by their respective degrees of freedom values. These values represent the variance
of the level and error components of the data. Mean square values for levels and errors are
SSTrt
k 1
SSE
MSE =
N k
MSTrt =
F0 =
MSTrt
MSE
we reject the null hypothesis and conclude that some of the variability of the data is due to
differences in the factor levels.
6.4 Output
The general format for output for this type of analysis is an ANOVA table, which contains
basic information about the analysis:
Source of
Variation
Factor
(between levels)
Error
(within levels)
Total
Sum of Squares
Mean Square
f calculated
SSTrt
Degrees of
Freedom
k 1
MSTrt
SSE
N k
MSTrt
MSE
MSE
SST
N 1
Example 1
Three different types of alcohol can be used in a particular chemical process. The resulting
yield (in %) from several batches using the different types of alcohol are given below:
Alcohol (in %)
1
2
3
93
95
76
95
97
77
94
87
84
Test whether or not the three populations appear to have equal means using = 0.01.
Solution
Alcohol (in %)
1
2
93
95
95
97
94
87
y1. 262 y2. 279 y3.
3
76
77
84
237 y.. 778
N 9, k 4
Hypothesis:
H 0 : 1 2 3
H1 : i j
SST yij
y..
i 1 j 1
93 95 74 ... 76 77 84
2
778
SSTrt
i 1
yi
ni
y..
3
3
9
3
778
1
2622 2792 237 2
3
9
67,551.3333 67, 253.7778
297.5555
SSE SST+SSTrt
660.2222 297.5555
362.6667
Source of
Variation
Factor
Sum of
Squares
297.5555
Degrees of
Freedom
3 1 2
Error
362.6667
93 6
Mean Square
Fcalculated
297.5555
148.7778
2
362.6667
60.4445
6
148.7778
2.4614
60.4445
Total
660.2222
9 1 8
At = 0.01, from the statistical table for f distribution, we have
f 0.01,2,6 5.14
Since f calc 2.4614 f 0.01,2,6 5.14 , we unable to reject the null hypothesis and conclude that
there is no difference in the three types of alcohol at a significance of = 0.01.
Task 1
An experiment was done to compare the amount of heat loss for three types of thermal panes.
The inside temperature was kept at a constant 68o F , the outside temperature was kept at a
constant 20o F , and heat loss was recorded for three different panes of each type:
Pane Type
1
2
3
Use ANOVA to test for
differences in heat loss due to
20
14
11
14
12
13
pane type at = 0:05. What can
you conclude from this test?
29
13
19
16
12
15
[ f calc 2.3608 f 0.05,2,9 4.26, fail to reject H 0 ; No differences.]
Task 2
An experiment was conducted to compare four formulations for a lens coating with regard to
its adhesive property. Four samples of each formulation were used, and the resulting
adhesions are given below:
1
15
10
21
23
Formulation
2
3
29
33
60
59
91
49
20
21
4
26
34
28
46
evidence to indicate a
0.05
Task 3
To determine the effect of three phosphor types on the output of computer monitors, each
phosphor type was used in three monitors, and the coded results are given below:
Type
2
4
2
3
3
1
3
sufficient 7
2 evidence to conclude that there is a
3 among the three monitors? Test by
difference in the mean phosphor 5
7
6
using = 0.025
5
4
5
5
6
[ f calc 7.4495 f 0.025,2,12 5.10, reject H 0 ; a difference exists.]
Do
the
data
provide
1. Enter $A$2:$C$7 in the Input Range: box(or you can enter that value automatically
2.
3.
4.
5.
by clicking in the box and then select the range of cells A2 through C7).
Click the Columns button so that we indicate our data is grouped by columns.
Click the Labels in first row box so that we indicate we are using labels.
Enter the value of alpha in the Alpha: box.
Under Output Options click the button for Output range: and enter $A$9 in the Output
range: box (or click in the box and then click on the cell A9 to cause it to appear in the
box).
6. Click OK.
An example of Excel output summary from a one-way analysis of variance can be seen in
Figure 6.1 below. Notice that the means for the three groups (as well as the count, sum, and
variance for each group) can be seen in the summary table.
P value P( F 5.178082192)
This P-value is then compared to a chosen level of significance, . The rules are:
From the output above, P value 0.023917 . Suppose we choose = 0:05, noticeably the
P value 0.05 , thus we conclude that there exists a significant difference in the means at
0.05 level of significance. However, if we choose 0.01 , obviously P value 0.01 .
Hence, we fail to reject H 0 and conclude that there is no significant difference in the means
at 0.01 level of significance.
Task 4
Conduct a one-way ANOVA for Tasks 1, 2, and 3 by using Excel. Identify the P-value for
each task and interpret the value.
Exercise 6
1. It was known that a toxic material was dumped in a river leading into a large saltwater commercial fishing area. Civil engineers studied the way the water carried the
toxic material by measuring the amount of the material (in parts per million) found in
oysters harvested at three different locations, ranging from the estuary out into the bay
where the majority of commercial fishing was carried out. The resulting data are
given below:
parts
per
in
oysters
quality
control
= 0.05.
2. A
experiment
Site 1
15
26
20
20
29
28
21
26
to
Location
Site 2
19
15
10
26
11
20
13
15
18
Site 3
22
26
24
26
15
17
24
engineer
conducted
an
investigate
the
of
effect
1
40.3
25.4
28.2
41.6
28.8
38.7
29.4
37.7
Experience
2
3
34.2
26.3
25.4
29.2
30.2
24.6
28.9
29.1
39.2
34.8
29.5
32.3
29.0
36.0
25.6
25.6
4
26.6
21.2
23.2
27.0
27.1
27.3
34.2
33.3
significant differences
experience for average
Use = 0.05
suggest that a training
be productive?
3. The OPEC oil embargo made it evident that fuel economy in automobiles needed to
be improved. Newer lightweight materials were sought for use in automobile engines.
Comparisons on the density (in g / cm3 ) were made among test material samples of
steel, aluminium, and phenolic thermoset composites containing glass fibres, resulting
in the following data:
Steel
7.60
7.81
7.72
7.68
7.79
7.76
Materials
Aluminium
2.90
2.67
2.80
2.85
2.60
2.76
Phenolics
1.79
1.72
1.67
1.80
1.50
1.63
Using an analysis of variance, state the correct hypothesis for testing equality of
means in density for the three materials and conduct the ANOVA test. State your
conclusion. Use = 0:01 level of significance.
5.62
7.70
2.52
6.77
6.12
8.31
5.44
6.65
6.62
8.80
4.94
6.01
6.21
8.24
2.99
6.26
7.80
7.87
4.39
7.09
5.36
7.44
4.44
6.06
from machine to machine. The following data are the tensile strength measurements in
kilograms per square centimeter x 101
Machine
1
2
3
4
17.5
19.2
15.8
18.6
16.4
16.8
20.9
18.9
20.3
18.5
17.2
20.5
14.6
21.4
16.4
19.5
21.5
16.9
18.1
20.1
Perform the analysis of variance at the 0.025 level of significance and indicate
whether or not the mean tensile strengths differ significantly for the four machines.
6. In a biological experiment, 4 concentrations of a certain chemical are used to enhance
the growth in centimeters of a certain type of plant over time. The growths of plants
are measured. The following output is from Excel.
Chapter 7
Simple
Linear
Correlation
Regression
Learning Objectives:
At the end of this chapter, students should be able to
and
7.1 Introduction
In previous chapters, we have only focused on learning the behaviour of population and
sample characteristics, such as the mean, proportion and variance. Having learning about
those characteristics, we shall be able to move further at exploring the relationship between
variables, which can be said as the sample space of earlier chapters. Notice that in many
problems, arising from science and engineering, involve exploring the relationship between
two or more variables. In this chapter, we consider two statistical techniques that are very
useful as a foundation to describe the relationship between these variables. First, by using a
regression analysis, and second, by calculating a correlation coefficient.
Simple linear regression if there is only one response variable and one predictor
variable.
Multiple regressions if there is only one response variable and many predictor
variables.
Multivariate regression if there are many response variables and one or more than
one predictor variable.
iii.
There are many other types of regression analysis. In this chapter, we only deal with
the first classification. Linear regression, in general, models the relationship between two or
more random variables using a linear equation. In other words, it is a method of estimating
the conditional expected value of one response variable given the values of some predictor
variable or variables. Simply put, linear regression assumes the best estimate of the response
variable is a linear function of some parameters (though not necessarily linear on the
predictors).
Response variables are also called dependent variables, explained variables, predicted
variables, or regressands. In the case of a single response variable, it is usually denoted by Y.
2
Predictor variables, on the other hand, are also called independent variables, explanatory
variables, control variables, or regressors, and are usually denoted as X 1 , X 2 ,..., X p
only on correlation coefficient that measures the relationship that is linear, particularly for
quantitative data. This will be discussed in detail in Section 7.6.
Task 1
1. Choose your pair. Next, discuss the difference between regression and correlation.
2. Choose a different pair. Next, list down
a) two possible response variables, and
b) two possible predictor variables.
from your engineering discipline.
Solution
The response variable Y represents the number of customers at the petrol station, whereas the
predictor variable X represents the increase in petrol price.
(7.1)
Notice here that the deterministic component in the regression model above is in fact a simple
linear, or a straight line, model.
Readers are to be cautioned that this intercept, , is not the same as the level of significance in a
hypothesis testing which is also denoted as . In addition, some references use 0 instead of in
the regression model.
3
(7.2)
Computing the expected value of Y given a certain value of X , say X x , will result in
Equation (7.2) becoming the following equation:
E Y X x Y X x x
(7.3)
We can see from equation (7.3) that the best estimate of the response variable given a certain
value of a predictor variable is simply a linear function of two unknown parameters, and
. After estimating the two unknown parameters, the target fitted simple linear regression
equation can be obtained and expressed as
Y x
(7.4)
Task 2
1. Determine the response and predictor variables in the following cases:
a. An investigation is carried out to study if the amount of certain chemical that
will dissolve in a given volume of water depends on the level of temperature.
b. A study is done to determine if Oxide of Nitrogen emission rate is influenced
by the load of an engine.
c. An engineer tries to predict the tensile strength of a specimen of cold drawn
copper from the Brinell hardness reading.
2. Without looking at your notes, re-write a simple regression model and state
assumptions related to the model. Next, check if you get the idea correct.
3. Similar to the above, re-write a fitted regression equation and check if you are on the
right track.
7.1 below, we can detect a positive slope for the linear model between Y and X in plot (a) and
a negative slope for the linear model in plot (b).
Task 3
1. Plot a scatter diagram that implies a very strong positive relationship between two
variables.
2. Plot a scatter diagram that implies a moderately weak negative relationship between
two variables.
coefficients, and by minimizing the sum of squared residuals. The resulting fitted line
provides the best possible description of the relationship between the response and the
predictor variables.
(7.5)
These residuals are a very useful tool in providing information about the adequacy of the
fitted model.
i yi xi
(7.6)
The sum of squared deviations of the observations from the true regression line is then given
by
n
i 1
i 1
L i 2 yi xi
(7.7)
By the method of least squares, we estimate the unknown parameters and explicitly by
minimizing the sum of squared errors, of residuals, with respect to these parameters, which is
meant by equating the partial derivatives of L with respect to and respectively to zero.
The least squares estimates of and , that is and respectively, must satisfy
the following conditions.
L
2 yi xi 0
i 1
n
L
2 yi xi xi 0
,
i 1
(7.8)
yi n xi
i 1
i 1
i 1
i 1
i 1
xi yi xi xi 2
(7.9)
Equations (7.9) are commonly called the least squares normal equations.
y x
(7.10)
S
xy
S xx
(7.11)
1 n
1 n
xi and y yi whereby the sum of products, S xy , and the total sum of
n i 1
n i 1
squares for X , S xx , are given below.
where x
n
1 n
S xy xi yi xi yi
n i 1 i 1
i 1
1
S xx xi
n
i 1
n
xi
i 1
n
Another term that will be much in use later in this chapter is the total sum of squares of the
response variable Y denoted by S yy and is given as follows.
1
S yy yi
n
i 1
n
yi
i 1
n
These sums of squares and sum of product are commonly available in any standard statistical
formula sheet.
Y x
Mode
Mode
Shift CLR 1
Note that Step 3 is vitally important when storing a new data set so that the old data set will
be removed and will not be mixed with the new data set to ensure an accurate analysis.
Once the sample data are stored in the calculator, we can retrieve the available output
by pressing appropriate operators as shown in Table 7.2.
Table 7.2: Output available from CASIO fx-570MS calculator
Operators
Output
Shift
S-SUM
Shift
S-SUM
Shift
S-SUM
Shift
S-SUM
>
Shift
S-SUM
>
Shift
S-SUM
>
xy
Shift
S-SUM
Shift
S-SUM
>
Shift
S-SUM
> >
Shift
S-SUM
> >
Shift
S-SUM
> >
Notice that r in Table 7.2 is the product moment correlation coefficient which will be covered
in Section 7.6 of this chapter.
Example 3
Obtain the equation of the least squares regression line of y on x for the following data:
x
y
20 25 30 35 40 45 50 55 60 65
98 87 92 79 68 57 59 43 60 38
Solution
The least squares regression line y on x is y x .
Follow the five steps in Table 7.1. At Step 3, before we store the new data set, we must
always make sure that the old data set is already cleared. This is indicated by n 0 on the
calculator screen before the new data set is stored.
After storing the above data set, we should get the following output:
Operators
Output
Shift
S-SUM
Shift
S-SUM
x 425
Shift
S-SUM
n 10
Shift
S-SUM
>
Shift
S-SUM
>
y 681
Shift
S-SUM
>
xy
Shift
S-SUM
Shift
S-SUM
>
50125
x 42.5
y
By formula,
n
S
xy
S xx
xi yi
i 1
n
1 n
x
i yi
n i 1 i 1
1
xi
n
i 1
n
xi
i 1
n
Substituting the formula with the values obtained from calculator will lead to
1
425 681
10
4252
X
10
1.2667 (to 4 d.p.)
X
and
y x
Hence, the least squares regression line is
y 121.9348 1.2667 x
Output
Shift
S-SUM
> >
Shift
S-SUM
> >
1.2667
Intuitively, we should get the same values for and when calculating the estimated
values either by using formula or directly from calculator. Nonetheless, we may notice that in
this example the values of calculated by using the formula and its value obtained directly
from calculator are slightly different. This small discrepancy may always occur due to a
rounding off values at earlier stage of calculation.
This notation (d.p.) is a short form for decimal places. We normally round the final
answer to four decimal places.
For simplicity at the expense of accuracy, the least squares linear regression of y on x in
this example is thus
y 121.93 1.27 x
We will refer to this equation in the later examples and tasks. Noticeably, the estimated
regression line in this example has a positive intercept and negative slope. Note that and
can vary in , .
Example 4
Refer to Example 3, predict the value of y when x 58 .
Solution
When x 58 , the predicted value of y when using the regression equation is
Task 4
1. An article in the Journal of Sound and Vibration (Vol. 151, 1991, pp. 383-394)
described a study which investigated the relationship between noise exposure and
hypertension. The noise exposure is measured by the sound pressure level (SPL) in
decibels, whereas hypertension is measured by the blood pressure rise (BPR) in
millimetres of mercury (mmHg). A representative data set reported is as follows:
SPL, x
BPR, y
60 63 65 70 70 70 80 90 80 80
1
0
1
2 5
1
4 6
2 3
SPL, x
BPR, y
85
5
89
4
90
6
90 90
8 4
Ans : 48.27
x
y
140 165 210 215 245 265 305 325 355 395
29 23 26 36 47
59 68 72 73 85
Ans : 16.6914,0.2614
(c) Write a fitted simple linear regression model for the above data.
Ans : y 16.6914 0.2614
(d) Next, estimate the number of defective items produced by the machine if the
speed is 380.
Ans : y 83
a)
b)
2
where
xx
1
S yy S xy
n2
Note that the proving of these properties is not covered in this chapter. These properties are
useful in computing the test statistic value in a hypothetical testing procedure.
n 2 2
2
(7.12)
Ttest
Var
2 / S xx
(7.13)
2 / S xx
df under
H 0 : 0 . The
determination of critical regions, and hence critical values, will depend on the alternative
hypothesis, H1 , and the level of significance, , as listed in Table 7.3.
Note that t ,n 2 is a critical value for testing at significance level and n 2 degrees of
freedom.
Table 7.3 Tests of hypothesis for the slope, , of linear regression model
Type of hypothesis testing
Hypothesis
Rejection
criteria
Two-sided test
(Test for linearity)
H0 : 0
H0 : 0
Right-tailed test
(Test for a positive slope)
Left-tailed test
(Test for a negative slope)
Example 5
H0 : 0
H0 : 0
H0 : 0
H0 : 0
Reject H 0 if ttest t ,n 2
Reject H 0 if ttest t ,n 2
x 425, x 20125
y 50125, xy 26330
1.27, n 10,
Therefore,
y 681,
S xx x
S yy y
S xy
x y
xy
n
4252
20125
10
50125
26330
6812
10
425 681
10
Thus,
2
Var
S xx
S xx S xy
n 2 S xx
3748.9 1.27 2612.5
10 2 2062.5
=
(to 4 s.f.)
/ 2 0.05 / 2 0.025
and
df n 2 10 2 8
From Table 7 in Lee (2004), the critical value, t0.025,8 2.306 . Thus, the decision rule is that
we will reject H 0 if ttest t0.025,8 ( 2.306) .
Step 3: Calculate the value of test statistic.
The value of test statistic is calculated as follows:
ttest
Var
1.27 0
0.02612
(to 4 d.p.)
yi y
i 1
yi y yi y 2
i 1
i 1
Symbolically, we have
2
a) SST yi y S yy is the total corrected sum of squares of y .
i 1
i 1
c) SSE yi y 2 SST SSR is the error sum of squares which measures the
2
i 1
If we divide the SSR and SSE with their respective degrees of freedom, we will obtain the
mean squared regression denoted by MSR (= SSR/1) and the mean squared error denoted by
MSE (= SSE/n - 2) respectively. It can be shown that the test statistic
Ftest
MSR
MSE
follows the F distribution with 1 and n 2 degrees of freedom under the null hypothesis
H0 : 0 .
We can arrange the test procedure using this approach in an ANOVA table, as shown
in Table 7.4
Source of
Variation
Regression
Error
Total
SSR S xy
1
MSR
MSR / MSE
n2
n 1
MSE
Example 6
Reconsider Example 3, test H 0 : 0 versus H1 : 0 using the ANOVA approach.
Solution
Step 1: Calculate , S yy , S xy
From the solution in Example 5, we have
Sum of
Squares
3317.875
431.025
3748.9
Degrees of
Freedom
1
10 2 8
9
Mean
Square
3317.875
53.8781
ftest
Task 5
1. Without looking at any reading material, list down briefly steps involved in testing the
significance of regression. Check your list with your friend who sits next to you and
compare your answers.
2. Why t-test is preferred to z-test in testing the slope of a linear regression model?
Discuss with your neighbours.
3. Consider the data from Question 1 in Task 4, by using t-test approach, test the
hypothesis that the regression of blood pressure rise (BPR) on the sound pressure
level (SPL) is linear at the 0.05 level of significance.
[ Ans : ttest 7.3145 t0.025,18 2.101 , reject H 0 , linearity significantly exists.]
7.6 Correlation
S xy
S xx S yy
(7.14)
It measures the extent to which the points on a scatter diagram cluster about a straight line.
For example, if we construct a scatter diagram for a sample data having n pairs of
measurements
x , y : i 1, 2,..., n
i
7.6.2 Properties of r
Some properties of r include:
a) r 1,1 .
(a) r 0.60
(b) r 0.85
(c) r 1
Meanwhile, the scatter diagrams below show examples of negative linear correlation between
X and Y, in an increasing order of strength:
(a) r 0.60
(b) r 0.85
(c) r 1
Noticeably, the wider the scatter of the points around a straight line the weaker the
correlation will be and hence the closer r is to 0, either from negative or positive directions.
The two diagrams below display examples of the absence of linear relationship
between X and Y. For Figure (b) below, although r = 0 implying no linear relationship, the
two variables do actually have a relationship which is nonlinear (in this case a quadratic
relationship).
Example 7
Compute the product moment correlation coefficient to measure the relationship between X
and Y variables based on sample data from Example 3. Comment your answer.
Solution
The correlation coefficient computed based on the sample data is the sample
correlation coefficient, r, given as
S xy
S xx S yy
2612.5
2062.5 3748.9
0.9395
S-SUM
> >
Output
3
Task 6
1. Refer the sample data from Question 1 in Task 4, measure the strength of
relationship between blood pressure rise (BPR) and the sound pressure level
(SPL).
[ Ans : 0.8650; strong positive correlation]
2. Refer to sample data from Question 2 in Task 4, obtain the Pearson product
moment correlation coefficient for the sample data. Comment your result.
[ Ans : 0.9611; very strong positive correlation]
Figure 7.3 Data storage in Excel worksheet for regression for analysis
b) Next, click Tool from the menu bar and then choose Data Analysis from the pulldown menu followed by Regression from the pop-up menu.
7.
The following table lists the measurements of the air velocity and evaporation
coecient of burning fuel droplets in an impulse engine:
Air Velocity
(cm/sec)
20
60
100
140
180
220
260
300
340
380
420
460
Evaporation Coefficient (
/sec)
1.8
3.5
3.7
5.6
7.5
7.8
9.8
11.6
13.7
16.5
18.6
19.5
(a) Fit a straight line to these data by using the method of least squares.
(b) Estimate the evaporation of a droplet when the air velocity is 190 cm/sec.
(c) Test whether evaporation coecient of burning fuel droplets in an impulse engine is
positively related to the measurements of the air velocity at 0.10 signicance level.
(d) Find the Pearson correlation coecient. Give your comment.
8.
(a)
(b)
(c)
(d)
9.
A manufacturing company bought a new cutting tool from company A and wanted to
investigate the useful life (in hours) related to the speed at which the tool is operated.
The Excel output follows for useful life of the tool (in hours) and speed (meters per
minutes).
10.
The following output from Excel gives information on the engine powers x (in
(a) Find the least square estimates of the regression line for the engine power against the
maximum speed.
(b) What does the estimate of imply?
(c) What is the predicted maximum speed if the engine power is 72 kilowatt?
(d) Is there any evidence that the data strongly suggest a linear association between the
engine power and the maximum speed at the 0.01 signicance level.
(e) Find the correlation between the engine power and the maximum speed. Explain your
answer.
Correlation
Chapter 8
Nonparametric Statistics
Learning Objectives:
At the end of this chapter, students should be able to:
a)
b)
c)
d)
e)
f)
8.1 Introduction
There are four types of data namely nominal, ordinal, interval scale and ratio scale data. An
example of nominal data is gender where male may be represented as 1 and female as 2.
The numbers are used for identication of the categories in gender variable. Data that can
be ordered from the lowest to the highest value such as feeling towards school which can be
categorized and ordered such as very unhappy, unhappy, somewhat happy, happy and very
happy, are ordinal data. To understand interval scale data, we start with an example;
temperature. A reading of 0 0 C does not mean there is no temperature and 50 0 C is not
twice as hot as 25 0 C . In contrast, 0 meter of length of ratio scale data means there is no
length and 50 m is twice the length of 25 m . The measurement length, weight and density
are some examples of ratio scale data. Statistical methods that we have discussed before
such as the t-test, ANOVA and regression deals with interval scale data or ratio scale data
and that the data being analyzed is assumed to come from a population with a specic
probability distribution. For example in the t-test, the population where a random sample is
selected from is assumed to be normally distributed with mean and variance 2 . In
general, these techniques are classed as parametric statistics. This chapter discusses an
alternative to the parametric statistics namely non-parametric statistics (NPS). Parametric
statistics is capable of analyzing interval scale and ratio scale data. Mean and variance for
these data can be calculated, interpreted and used in the analysis. But not so for nominal
and ordinal data.
For example, consider the nominal data gender with categories male and female.
Surely the mean of gender has no meaning. NPS is the method to use when dealing with such
data.
In general, a statistical technique is categorized as NPS if it has at least one of the
following characteristics:
1. The method is used on nominal data.
2. The method is used on ordinal data.
3. The method is used on interval scale or ratio scale data but there is no assumption
regarding the probability distribution of the population where the sample is selected.
8.2
Sign Test
We have seen the test of population proportion that uses the sampling distribution
P N ,
for large sample size n. The sign test is a test of the population
n
proportion for testing 0.5 in a small sample situation (usually for n 20).
To understand how the sign test works, let us look at this example.
A study is conducted to see the preference of hand-phone users towards two branches
of hand-phones A and B by asking the views of 12 users. Specically this study is done to
see if the preferences are the same towards the two brands.
If there is no dierence on the preference then we can anticipate the proportion of
users who prefer brand A is the same or about equal to the proportion of users who prefer
brand B. Since there are only two brands being tested, proportion of users preferring brand A
is 0.5 and similarly for brand B if there is no dierence on the brand preference.
If the proportion of users preferring brand A is greater than that of brand B, we can
anticipate the number of users preferring brand A will be a lot higher than the number of
users preferring brand B. On the other hand if the proportion of users preferring brand B is
greater than those of brand a, we can anticipate the number of users who prefer brand A will
be a lot lower that the number of users preferring brand B.
This forms our hypotheses
H 0 : 0.5
H 1 : 0 .5
where is the proportion of the population of users preferring brand A.
Now, we have 12 subjects who named their preferences and let X be a random
variable representing the number of users who prefer brand A and furthermore assume H 0 is
true, thus X follows the Binomial distribution with n = 12 and = 0.5 or simply.
X ~ Bin 12,0.5
For notational purposes, let those who prefer brand A be represented by the sign +
and those who prefer brand B be represented by the sign -. Thus, comes the sign test. So
the random variable X is redened to represent the number of + and X ~ Bin 12,0.5 . Our
alternative hypothesis H 1 : 0.5 indicates that we have a two-tailed test with two rejection
regions. Supposed this test is done at signicance level = 0.05, this means we would reject
H 0 if X a or X b , i.e. we would reject H 0 if the number of + is at most a or at least b.
The issue now is to nd the values of a and b.
By the nature of a two-tailed test we know that P X a P( X b) 0.05 . Now for
n
n x
P X x p x 1 p
x
for x = 0, 1, 2, ..., 12. The probability for each value of x is shown in the table below:
X=x
0
1
2
3
4
5
6
7
8
9
10
11
12
P (X = x)
0.0002
0.0029
0.0161
0.0537
0.1208
0.1934
0.2256
0.1934
0.1208
0.0537
0.0161
0.003
0.0002
P X 2 P( X 10)
P X 0 P ( X 1) P X 2 P( X 10) P X 11 P ( X 12)
P X 3 P( X 9)
= 0.146
which is a lot more than our chosen 0.05 .
Since the value 0.0386 is closer to 0.05 than 0.146, it is reasonable to make our
decision rule as reject H 0 if the number of + is at most 2 or the number of + is at least 10.
However, with this rule, our signicance level is not exactly 0.05 but 0.0384 .
H 1 : 0 .5
2. Choice 2
H 0 : 0.5
H 1 : 0 .5
3. Choice 3
H 0 : 0.5
H 1 : 0 .5
Choice 1: This is a two-tailed test with the rejection regions X a or X b . The
value of a is such that P X a
Choice 2: This is a one-tailed test on the right with the rejection region X a .
The value of a is such that P(X a) . The graph is shown in Figure 8.3.
Example 1
10 engineering students went on a diet program in an attempt to lose weight with the
following results:
Name
Abu
Ah Lek
Sami
Kassim
Chong
Raja
Busu
Wong
Ali
Tan
Weight before
69
82
76
89
93
79
72
68
83
103
Weight after
58
73
70
71
82
66
75
71
67
73
Is the diet program an eective means of losing weight? Do the test at signicance level
0.10 .
Solution
Let the sign + indicates Weight before - Weight after > 0, and indicates Weight
before- Weight after < 0.
Thus
Name
Abu
Ah Lek
Sami
Kassim
Chong
Raja
Busu
Wong
Ali
Tan
Weight before
69
82
76
89
93
79
72
68
83
103
Weight after
58
73
70
71
82
66
75
71
67
73
Sign
+
+
+
+
+
+
+
+
H 1 : 0 .5
Let X represents the number of + sign. Assuming H 0 is correct, X ~ Bin 10,0.5 .
The observed number of + sign is 8 and the probability of getting at least 8 + is
P (X 8) = 1 0.9453 = 0.0547
which is less then 0.10 . Thus, we can conclude that there is sucient evidence that the
diet program is an eective programme to reduce weight.
Example 2
16 students were asked about their views on their college new regulation of not
allowing students to drive on campus. 13 of them oppose the ruling while 3 of them agree
with it. Is there evidence to support the hypothesis that the minority of students support the
new ruling at signicance level 0.05 ?
Solution
Let X represents the number of student supporting the ruling.
H 0 : 0.5
H 1 : 0 .5
Assuming H 0 is correct then X ~ Bin 16, 0.5 . The observed X is 3. Using the distribution
above
P (X 3) = 0.0106
which is less than 0.05 . Thus reject H 0 and conclude that there is sucient evidence that
Use the sign test at the 0.05 level to test the hypothesis that the new additive have the
same drying time as the regular additive.
[Ans: P X 1 0.0625 0.025 or P X 1 0.9922 0.025 ; fail to reject H 0 and
conclude that the new and regular additive have the same drying time.]
In cases where the number of subject is large (n 20), the normal approximation can be used
as a decision rule where if X is a random variable representing the number of + then
0.25
X N 0.5 ,
W W W
It must be a good team to win 12 consecutive games and their winning the games are not by
chance nor it is random. Based on these results, we can easily predict the outcome of the
next game.
Consider another football team B with the following results in 12 games.
W L W L W L W L W
L W
Based on these result we can anticipate the result for the next game. The teams performance
is predictable and the results is not random.
Consider another football team C with the following results in 12 games.
W
W L
L W
L
L
L
W L
W
W
For team A, we see that the outcome is not random and the number of run is the minimum 1.
For team B, we see that the outcome is not random and the number of run is the maximum
12. So, too many runs or too few runs indicate the outcome is not random.
Let
R= The number of runs
n1 = number of W
n2 = number of L
n n1 n2
It is a tedious job to construct the probability distribution of R for higher values of n1 and
n2 . With the probability distribution we are capable of building the rule for accepting and
rejecting H 0 . As we have said earlier, small value of R or large value of R indicates the
outcome is not random, thus the test of randomness is a two-tailed test. This test of
randomness is called the run test.
Since the run test is a two-tailed test, we would reject H 0 if the observed number or
runs R a or R b . The values a and b are chosen in such a way that P X a
P X b
and
2
Example 3
A machine cuts plywood with mean length 100 cm and standard deviation 1 cm. 15
plywoods produced by this machine consecutively shows the following length (in cm).
99.5
99.5
99
99.8
100.6
99.7
100.1
99.8
100.3
100.1
100.2
100.5
100.2
100.3
99.9
Can we conclude that the length of plywoods cut by this machine is random over and below
the mean length 100 cm at signicance level 0.05 ?
Solution
Let + indicate the length of plywood which is over 100 cm and indicates the length
which is below 100 cm. The outcome is thus,
++++++++
with n = 15, n1 8 , n2 7 where n1 the number + and n 2 the number of .
H 0 : The length is random
The number of observed runs is R 9 . Using the statistical table, we would reject H 0 if
Task 3
Task 4
In an industrial production line, items are inspected daily for defective items. The
following is a sequence of defective items, D, and non-defective items, N, produced by this
production line:
D
Use the runs test to determine whether the defective items are occurring at random. Let
0.05 .
[Ans: 4
R 10 14 , we fail to reject H 0 and conclude that the defective items are occurring at random.]
approximately
2R
Normal
2n1 n2 2n1 n2 n1 n2
n1 n2 2 n1 n2 1
, i.e
with
mean
2n1 n2
1
n1 n2
and
2n1 n2
2n n 2n n n1 n2
R N
1, 1 2 12 2
n
n
1
2
1
1
2
1
2
variance
and
2n1 n2
1
n1 n2
N 0,1
2n1 n 2 2n1 n2 n1 n2
R
n1 n2 2 n1 n2 1
In this case we can use the standard Normal distribution to nd the critical values of z for the
given signicance level .
8.4
8.4.1 Introduction
Often enough we are dealing with data in the form of ranks as in the case of ordinal data. For
instance, a study may involve the feelings of students towards this subject which can be
categorized as Very Unhappy, Unhappy, Somewhat Happy, Happy and Very Happy.
The feelings can be ordered or ranked where rank 1 represents the lowest feeling Very
Unhappy, rank 2 the second lowest feeling Unhappy and so forth. This section describes
some statistical methods in dealing with such data.
8.4.2 Mann-Whitney Test
The Mann-Whitney test or sometimes referred to as Wilcoxon rank-sum test is used to test
the location measures (such as means) of two dierent populations are identical.Two
independent random samples are required from each population. Let x1 , x 2 , ..., x n and
y1 , y 2 , ..., y m be two random samples of sizes n and m where n m from populations X and
Y respectively. We wish to test the hypotheses that the two distributions X and Y are the
same. The hypotheses are
H 0 : P X P Y
H 1 : P X P Y
Assign the rank 1 to n m to both samples where the smallest value from both samples is
assigned rank 1, the second smallest value is assigned rank 2, and so on. The highest value is
assigned rank n m . Let R X i and R Y j denote the rank assigned to X i and Y j for all i
and j. For convenience let N m n . The sum of the ranks assigned to population X can be
used as a test statistic,
n
T R X i
i 1
Rank
1
2
3
4
5
6
We see that
3
T1 R X i 1 2 3 6
i 1
and
T2 R Y j 4 5 6 15
3
i 1
On one hand, when the sample sizes for both samples are the same we would expect
T1 R X i T2 R Y j
if both populations X and Y are the same. However, if they are signicantly dierent we
would expect T1 R X i to differ significantly with T2 RY j where we would expect
On the other hand, when the sample sizes dier, a rather small T1 or large T1 gives
some indication that the populations dier. Comparison of T1 with T2 is not appropriate
with diering sample sizes due to unequal chances of summing the integer ranks. Thus, the
inferential aspect must only consider either T1 alone or T2 alone.
Table A7 of W. J. Conover (1971) provides the critical value for rejection of H 0
for various values of n and m. The table provides P T W p p . For example consider
n 5 and m 7 . The value 15 corresponding to p = 0.001 means P (T < 15) 0.001 and the
value 22 corresponding to p = 0.05 means P (T < 22) 0.05. Thus we would left critical
value. The right-hand-side critical value is obtained by n N 1 w p . So the right-hand-side
critical value is 5(5 + 7 + 1) 22 = 43, i.e. P (T > 43) 0.05. Thus we would reject
H 0 : P X P Y 0.5 if the observed T R X i 22 at 0.05 as the left critical
value. The right-hand-side critical value is obtained by n N 1 w p . So the right-hand-side
critical value is 5(5 + 7 + 1) 22 = 43, i.e. P (T > 43) 0.05. Thus we would reject
H 0 : P X P Y 0.5 if T < 22 or T > 43 at 0.10 which corresponds to p = 0.05 for
two-sided test. However when n and m are large
n N 1 nm N 1
,
2
12
T N
Example 4
Data below show the marks obtained by electrical engineering students in an
examination:
Gender
Male
Male
Male
Male
Female
Female
Female
Female
Female
Marks
60
62
78
83
40
65
70
88
92
Can we conclude the achievements of male and female students are identical at signicance
level 0.1 .
Solution
H 0 : Male and Female achievements are the same.
Gender
Male
Male
Male
Male
Female
Female
Female
Female
Female
Random Variable
X
X
X
X
Y
Y
Y
Y
Y
Marks
60
62
78
83
40
65
70
88
92
Rank
2
3
6
7
1
4
5
8
9
n = 4, m = 5.
4
T1 R X i 2 3 6 7 18
i 1
and
T2 R Y j 1 4 5 8 9 26
5
i 1
Task 5
Petrobus
Procat
The petrol consumption
(in11.9
km/liter
petrol)
12.5,
10.5,
10.4,for several Proton Wira 1.5 model for two
10.8, 8.9, 10.0, 9.5,
brands of petrol is shown below:
11.2
13.0, 10.7
Can we conclude both brands of petrol give equal mileage at signicance level
0.05 ?
[Ans: 19 T1 35 41 , fail to reject H 0 and conclude that both brands of petrol give the same
mileage.]
Task 6
The following data represent the number of hours that two dierent digital cameras
operate before a recharge is required.
Camera
A
Camera
B
5.
2
5.
8
5.4
6.2
6.5
6.3
5.8
6.2
5.4
5.8
6.1
6.2
6.2
6.6
6.8
5.9
5.8
6.3
Use the Mann Whitney test with 0.1 to determine if camera A operates longer
than camera B on a full battery charge.
[Ans: T1 70.5 100 , fail to reject H 0 and conclude that there is no signicant evidence from the
data, at 0.1 , that Camera A operates longer than Camera B on a full battery charge.]
Example 5
Consider the following data which record the weight (in kg) of 8 students before and
after going through a diet program intended to reduce their weight.
Subjec
t
A
B
C
D
E
F
G
H
Before (Y)
70
75
68
60
73
80
65
63
After
(X)
62
70
58
61
61
60
54
66
d i y i xi . Then we rank the di ignoring the negative sign (if any). This means we rank the
modular of d i ; d i . Let this ranks be noted by R. Next, we give the sign according to the
sign of the corresponding d. Let these signed-rank be denoted by R d i . So we would have
Subject
A
B
C
D
E
F
G
H
Before(Y )
70
75
68
60
73
80
65
63
After (X )
62
70
58
61
61
60
54
66
di= xi - yi
8
5
10
-1
12
20
11
-3
R
4
3
5
1
7
8
6
2
T R d i T R d i
Since the assumption that R d i is symmetry then the mean of R d i 0 and the
H 1 : median of R d i 0
We can have the usual one-tailed test as
H 0 : median of R d 0
H 1 : median of R d 0
or
H 0 : median of R d 0
H 1 : median of R d 0
and the two-tailed test
H 0 : median of R d 0
H 1 : median of R d 0
R d . Table (Hisyam Lees table) lists the critical points for accepting H 0 for various values
of .
Going back to the before-after weight example, we see that T 33 and T 3 . At
signicance level = 0.05, Table (Hisyams table) gives the critical point with n = 8 as 4.
This means that we would reject H 0 if T 3 or T 3 . Since the lower of the two values
is T 3 which is exactly the same as the critical value 3, we reject H 0 and accept H 1 .
Thus we make the conclusion that there is evidence the weight before and after going through
the diet program is not equal.
Table below summarizes the various test procedures for both one-tailed and two tailed
test:
Task 7
Before 74
65
78
81
55
61
80
After
62
83
100 68
59
105 66
87
65
Using the 2.5% signicance level, can we conclude that attending the course increases the
hand-insert ability speed of the operators?
[Ans: Since T = 25.5 < 33, we fail to reject H 0 and conclude that the course does not increase the
operators hand-insert ability speed.]
Task 8
The following data gives the number of industrial accidents in ten manufacturing
plants for one month periods before and after an intensive promotion on safety:
Plant
Before
After
1
3
2
2
4
3
3
3
1
4
6
3
5
8
4
6
4
1
7
5
4
8
6
5
9
7
6
10
8
4
Do the data support the claim that the campaign was successful in reducing accidents?
Use = 0.05.
[Ans: Since T = 55 > 44, we reject H 0 and conclude that the campaign was successful in reducing
accidents at = 0.05.]
In a Wilcoxon signed-rank test for two dependent samples, when the sample size is
large (n 15) the statistics T and T is approximately Normal with mean T
n n 1
and
4
variance 2T n n 1 2n 1 written as
24
n n 1 n n 1 2n 1
,
4
24
T N
Thus,
n n 1
4
N 0,1
n n 1 2n 1
24
T
8.5
Measure of Association
8.5.1
We have seen the correlation coecient r measure the linear relationship between two
continuous variables X and Y.
A measure of correlation for ranked data based on the denition of Pearson Correlation
Coecient where there is no tie or few ties called Spearman Rank Correlation
Coecient, denoted by is given by
r s 1
6T
n n 2 1
where
n
R X R Y
T di
i 1
i 1
and
- R X i is the rank assigned to xi .
- R Yi is the ranks assigned to y i .
- d i is the dierence between the ranks assigned to xi and y i .
- n is the number of pairs of data.
Usually the value of rs is close to the value obtained by nding r based on numerical
measurements. The interpretation of rs is similar to the interpretation of r in which a value of
+1 or 1 indicates perfect association between X and Y. The plus sign indicates identical
rankings and the minus sign occurring for reverse ranking. When rs is zero or close to zero,
we would conclude that the variables are uncorrelated.
Some advantages in using rs rather than r are:
1.
Mole ratio 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
Viscosity 0.45 0.20 0.34 0.58 0.70 0.57 0.55 0.44
Example 6
The data below show the eect of the mole ratio of sebacic acid on the intrinsic viscosity of
copolyesters.
Find the Spearman rank correlation coecient to measure the relationship of mole ratio of
sebacic acid and the viscosity of copolyesters.
Solution
Let X and Y represent the mole ratio of sebacic and viscosity of copolyesters,
respectively. First we assign ranks to each set of measurements. The rank of 1 assigned to the
lowest number in each set, the rank of 2 to the second lowest number in each set, and so
forth, until the rank of 10 is assigned to the largest number. The table below shows the
individual rankings of the measurements and the dierences in ranks for the 8 pairs of
observations.
Mole ratio
1
0.9
0.8
0.7
di2
16
36
16
4
0.6
0.5
0.4
0.3
0.7
0.57
0.55
0.44
4
3
2
1
8
6
5
3
-4
-3
-3
-2
16
9
9
4
T = 110
Thus,
r s 1
6T
n n 2 1
6 110
8 64 1
= 0.3095
which shows a weak negative correlation between the mole ratio of sebacic acid and the
viscosity of copolyesters.
Example 7
The following data were collected and rank during an experiment to determine the change in
thrust eciency, y as the divergence angle of a rocket nozzle, x changes:
Rank X
Rank Y
1
2
2
3
3
1
4
5
5
7
6
9
7
4
8
6
9
10
10
8
Find the Spearman rank correlation coecient to measure the relationship between the
divergence angle of a rocket nozzle and the change in thrust eciency.
Solution
R(xi)
1
R(yi)
2
di = R(xi)-R(yi)
-1
di2
1
2
3
4
5
6
7
8
9
10
3
1
5
7
9
4
6
10
8
-1
2
-1
-2
-3
3
2
-1
-2
1
4
1
4
9
9
4
1
4
T = 38
6T
n n 2 1
6 38
10100 1
0.7697
indicating a high positive correlation between the divergence angle of a rocket nozzle and the
Dryingeciency.
time
2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
change in thrust
Solids removed 4.3 1.5 1.8 4.9 4.2 4.8 5.8 6.2 7.0 7.9
Task 9
The grams of solids removed from a material (y) is thought to be related to the drying time,
(x). Ten observations obtained from an experimental study follow.
Calculate the Spearman rank correlation coecient to measure the relationship between the
grams of solids removed from a material and the drying time.
=0.8788]
Task 9
[ rs
Two persons rank their preferences on 8 brands of automobile due to the rise of the price of
petrol. The ranks are in the following order:
Brands
Person A
Person B
Calculate the Spearman rank
1 2
3
4
5
6
7
8
5 8
4
3
6
2
7
1
7 5
4
2
8
1
6
3
correlation coecient to measure the relationship between the
=0.7143]
Exercise 8
1.
Briey explains the meaning of categorical data and give two examples.
Name
Abu Ali Chen Rama Subra Lim Tan Amin
2. When does a statistical method become a non-parametric statistics?
Weight Before(kg) 78 86
69
83
78
74
80
90
Weight After (kg)
66 87
64
80
73
65
75
87
3. At a college there are two cafeterias A and B where the students usually have their
meals. A random sample of 12 students is taken and 5 of them prefer cafeteria A and
the rest indicates preference on cafeteria B. At the 5% signicance level, can we
conclude that the students at this college has equal preference of the two cafeterias?
4. Eight students went on a diet in an attempt to lose weight, with the following results:
Use the sign test to test whether the diet an eective means of losing weight at
signicance level 0.05 . Now use the Wilcoxon signed-rank test to test the same
hypothesis at the same signicance level.
5.
In a library, there are two popular reading sections A and B where students normally
do their fovourite readings. A random sample of 14 students is taken and their
preferences are shown below:
A B A A B A A A A B A B
At the 10% signicance level, can we conclude that the students has equal preference
of the two library reading sections?
6. Through the years the achievement award given to sta in a department has the
following order according to gender:
M M M M F F
M M F M F
where M represent Male and W represent Female. Is the award given according to
gender a random event at signicance level 0.05 .
7. In a study to determine whether accidents occurs at random or not the following data
were gathered for 15 consequtive days
+ + - Before
+ + 210
+ - 180+ 195
+ - 220+ 231
- -199 - 224+
After
where + indicates the number of accidents for that day is above average and -
indicates the number of accidents for that day is below average. Test the hypothesis at
signicance level 0.05 .
8. The following data gives the cholesterol levels for seven adults before and after they
completed a special dietry plan
Use the sign test at the 5% signicance level to test whether the level of cholesterol is
the same before and after completing the special dietary plan. Use the Wilcoxon
signed-rank test at the 5% signicance level to test whether the level of cholesterol is
the same before and after completing the special dietary plan. Draw your conclusion.
9. The following table gives the recorded grades for 10 engineering students on carry
marks and nal examination in an Engineering Statistics course:
Student
Ali
Bidin
Chua
Didi
Emily
Farouk
Gina
Carry Marks
48
46
38
43
36
49
44
Final Examination
47
45
42
40
38
49
44
Hasan
Intan
Joe
42
34
40
46
37
34
Brand
A
B
C
D
Panel
1
10
6
1
7
Panel
2
9
3
4
5
x
y
E
3
6
F
8
7
G
2
8
H
5
2
I
9
10
1.6 9.4 J 15.5 20.0
4 22.0 135.5 43.0 40.5 33.0
240 181 K 193 155
172 7 110 113 75
94
8
L
9
6
Calculate the Spearman rank correlation coecient to measure the relationship between
the fretting wear of mild steel and oil viscosity.
Answers
Answers to Self-Review Quiz
Questio
ns
1
2
3
4
5
6
7
8
9
10
Part A
b
a
d
b
b
b
c
c
b
d
Part B
FALSE
FALSE
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
FALSE
Answers to Exercise 1
1.
(a) Constant
(b) Constant
(c) Variable, quantitative, continuous
(d) Variable, qualitative, nominal
(e) Variable, quantitative, interval-scaled
(f ) Constant
(g) Variable, quantitative, continuous
(a) 0.3595
(b) 0.5033
(c) 0.4278
(d) 0.4167
(e) 0.4396
(a) Straightforward
5
x ; 0 x 1
8
1 x2
;1 x 2
2 8
1 ; elsewhere
(b) f x
(c) 0.4688
(a) 0.5328
(b) 0.3372
(c) 0.0675
8.
(a) 0.9305
(b) 0.8385
(c) 0.2924
(d) 1
0.7642
9.
(a) 0
b) 5.1984 10 4
(c) RM2082.245
Answers to Exercise 2
1. a. 0.0060
b. 0.9706
2. a. 0.8962
b. 0.0001
3. a. 0.7757
b. 0.6129
4. 0.9808
5. a. 0.9993
b. 1.0000
6. a. 0.8997
b. 0.8020
7. 0.6772
8. a. 1.0000
b. i. 0.9998
ii. 0.0002
9. a. 0.4840
b. 0.0344
c. 0.0045
10. a. 0.9842
b. 0.9684
c. 0.9911
11. a. 0.0401
b. 0.5490
c. 0.9599
12. a. 0.9803
b. 0.4681
c. 0.6156
13. a. 0.2912
b. 1.0000
c. 1.0000
14. a. 0.6628
b. 0.0869
c. 0.7230
15. a. 0.3669
b. 0.8725
c. 0.5000
16. a. 0.5000
b. 0.3192
c. 0.5948
17. a. 0.4682
b. 0.6293
c. 0.9505
18. a. 0.4052
b. 0.7265
c. 0.5000
Answers to Exercise 3
1. The observed interval contains the true value of .
2. Shorter
3. Yes, because we are making use of the sample information to infer the population
parameter.
4. a. 102.5
b. (98.944, 106.056)
5. a.
6. (0.4645, 0.5555) liter
7. (9.1, 10.7) micrometer
8. a. (0.505441, 0.507519) cm
b. (0.504637, 0.508323) cm
as it is impractical to know the variance of normal population without knowing its mean.
10. (1.061, 0.460); the observed interval contains the true value of mean dierence with
90% level of condence, No.
11. (0.0107, 0.0493)
12. (0.0048, 0.0202)
13. a. 0.09 b. (0.0751, 0.1049)
c. 0.01
d.(0.0048,0.0152)
e.0.003;(0.0004,0.00639)
Answers to Exercise 4
1. z test = 2.3717; reject H 0 .
2. z test = 2.044; reject H 0 .
3. t test = 0.5167; fail to reject H 0 .
4. t test = 2.821; reject H 0 .
5. Fail to reject H 0 .
6. z test = 6.1546; reject H 0 .
c. RM(85.96, 64.96)
e. 43.459 ; RM(27.13, 29.78) f. (0.5236, 13.353)
b. Fail to reject H 0 .
Answers to Exercise 5
2
1. k = 7, then = 6; xtest = 5.6807 < 12.592; Fail to reject H 0 .
2
2. k = 6, p = 1 where = 3.47, then = 4; xtest = 3.682 < 13.277; Fail to reject H 0 .
2
3. k = 8, then = 7; xtest = 0.6333 < 14.067; Fail to reject H 0 .
2
4. k = 4; then = 3; xtest = 40.692 > 7.815; reject H 0 .
2
5. k = 3; then = 2; xtest = 0.2448 < 9.21; Fail to reject H 0 .
2
6. Independence test: = 1; xtest = 33.33 > 3.841 (without Yates correction); reject H 0 ;
defective components are NOT the same, i.e. they are signicantly not homogeneous at
0.05 .
2
10. Homogeneity test: = 2; xtest = 36.6753 > 5.991; reject H 0 ; The proportions of output
components for shift 1 are signicantly not the same for all 3 machines.
Answers to Exercise 6
1. f calc 4.9471 f 0.05, 2, 21 3.47
Answers to Exercise 7
1. a.
0.6623 , 1.1256
2. a.
143.731 , 15.202
3. a.
0.2757 , 0.0255
4. a. 5.3066
c. Reject H 0
b. 3.98
b. 37.317
c. Reject H 0
d. 0.9939
d. - 0.9859
b. Reject H 0 c.0.9387
d. 0.9502
c. Accept H 0
b. 3.85
d. Accept H 0
5. a.
5.6 , 0.07
6. a.
2.8144 , 2.8622
b. 306.2076
c. Reject H 0
d. 0.8742
7. a.
0.0016 , 0.0415
b. 7.8866
c. Reject H 0
d. 0.9901
Answers to Exercise 8
3. H 0 : 0.5 vs H 0 : 0.5; P X 5 0.3872 0.025 or P X 5 0.8062 0.025
fail to reject H 0 ; the students at this college have equal preference of the two cafeterias.
4.
P X 4 0.5 0.025;
9. rs 0.8182 ; a strong positive correlation between carry marks and nal exam scores.
10. rs 0.6573 ; a moderately strong positive correlation between results given by panel 1
and panel 2.
11. rs 0.85 ; a strong negative correlation between the fretting wear of mild steel and oil
viscosity.
References
Lee, M. H. (2004). Statistical Tables and Formulae for Science and Engineering.
Skudai: UTM.
Montgomery, D. C. & Runger, G. C. (2006). Applied Statistics and Probability
for Engineers, 4th Ed. USA: John Wiley & Sons.
Montgomery, D. C., Runger, G. C. & Hubele, N. F. (2003). Engineering Statistics. USA: John Wiley & Sons.