Вы находитесь на странице: 1из 251

# Introductory Statistics

for
Engineering Students

## Zarina Mohd Khalid

Noraslinda Mohamed Ismail
Arifah Bahar
Norazlina Ismail

## Department of Mathematics, Faculty of Science

Universiti Teknologi Malaysia

Preface

In general, engineers develop new products, improve existing designs, build and test
prototypes, troubleshoot ongoing manufacturing process and others. In each of these
functions, engineers collect and analyze data as an integral part of their job. Thus, statistical
methods are an inseparable part of how engineers solve engineering problems.
This text is an introductory statistics textbook designed for undergraduate students
taking engineering programs at Universiti Teknologi Malaysia, Skudai. It provides
sucient material covered in SSE2193 Engineering Statistics course throughout a 15week semester. This text does not pretend to provide either a complete statistical
toolkit or a review of all statistical methods in all aspects of engineering applications.
It does however provide students with an easy start-up kit to key statistical methods
with various examples and tasks that students can solve either during class or at their
own time.
We sincerely hope this text will be useful for students in acquiring skills of handling
observed data, drawing valid inferences and eventually making sound judgement and
profound decision.
Authors
September 2015

_________________________________________________________________________

Self-Review Quiz
Test your prior knowledge and understanding on the basic statistics by answering the
following questions.

## Part A (Objective Questions): Choose one correct answer only.

1. The probability of an event is always
a) less than 0
b) in the range 0 to 1.0
c) greater than 0
2. Two equally likely events
a) have the same probability of occurence
b) cannot occur together
c) have no eect on the occurence of each other
3. Let S be the set of sample space and dened as S = {1, 2, 3, 4, 5, 6, 7}. Let A, B, C
be the subsets of the sample space and dened as
Given
i) P A B 0.5
P B C ' 0.25

a) i, ii, iii
4.

b) ii, iii, iv

c) i, iii, iv

d) ii, iv

a)
Yes b) No

## 5. Two mutually exclusive events

a) have the same probability of occurence
b) cannot occur together
c) have no effect on the occurence of each other

7.

## Which of the following notations is NOT a parameter?

a)

b)

c) x

d)

8.
Which of the following data can most possibly be represented by a discrete random
variable?
a)
b)
c)
d)

## The weight of engineering students registered at UTM.

The height of mountains across south-east Asia.
The number of errors typed on a piece of paper.
The amount of time spent by engineers working offshore.

## 9. Which of the following distributions is a continuous distribution?

a) Binomial with n = 25 and p = 0:6
b) Normal with = 30 and 2 = 16
c) Poisson with = 7

10. A random variable X follows a normal distribution with mean 16 and standard deviation
2. The probability of X being less than 15 can be calculated by finding
a)

15 16

P Z

22

b)

15 16

P Z

c)
d)
e)

16 15

P Z

15 16

P Z

15 16

P Z

Contents
Preface
Self-Review Quiz
1 Fundamental Topics
1.1 Descriptive Statistics and Inferential Statistics
1.1.1 Terms and denitions
1.1.2 Measures of central tendency
1.1.3 Measures of dispersion
1.1.4 The use of calculators
1.1.5 Types of Plots
1.2 Probability
1.2.1 Basic notation and denition
1.2.2 Classical denition of probability
1.2.3 Mutually exclusive event
1.2.5 Conditional probability
1.2.6 Multiplication rule of probability
1.2.7 Independence
1.3 Random Variables
1.3.1 Discrete random variable
1.3.2 Continuous random variable
1.3.3 Cumulative distribution function
1.3.4 Mathematical expectation
1.3.5 Variance and standard deviation
1.4 Some Probability Distributions
1.4.1 Binomial distribution
1.4.2 Poisson distribution
1.4.3 Negative binomial distribution

## 1.4.4 Geometric Distribution

1.4.5 Hypergeometric distribution
1.4.6 Normal distribution
1.4.7 Exponential distribution
1.4.8 Other continuous distributions
1.4.7 Exponential distribution
Exercise 1
2 Sampling Distributions
2.1 Introduction
2.2 Central Limit Theorem
__

## 2.3 Sampling Distribution for a Single Mean, X

2.4 Sampling Distribution of

__

__

X1X 2

## 2.5 Sampling Distribution for the Proportion, P

2.6 Sampling Distribution of the Dierence Between two Proportions
2.7 t Distribution
2.8 X 2 Distribution
2.9 F Distribution
Exercise 2
3 Estimation
3.1 Introduction
3.2 Terminology
3.3 Point Estimate
3.4 Interval Estimate
3.5 CI on the Mean
3.6 CI for the Dierence between Two Population Means
3.7 CI for the Population Proportion
3.8 CI for the Dierence between Two Population Proportions
3.9 CI on the Normal Population Variance

## 3.10 CI for the Ratio of Two Normal Population Variances

Exercise 3
4 Tests of Hypotheses
4.1 Statistical Hypotheses
4.2 Test of Hypothesis for the Mean
4.3 Test of Hypothesis for the Variance
4.4 Test of Hypothesis for the Proportion
4.5 Test of Hypothesis for the Dierence between the
4.5.1 Variances known
4.5.2 Variances unknown
4.6 Test of Hypothesis for the Dierence between
the Proportions
4.7 Test of Hypothesis for the Ratio of the Variances
Exercise 4
5 Chi-Square Tests
5.1 Introduction
5.2 Goodness-of-t Test
5.3 Independence Test
5.4 Homogeneity Test
Exercise 5
6 Analysis of Variance
6.1 Introduction
6.2 One-Way ANOVA
6.3 Partitioning of Total Variability Into Components
6.4 Output
6.5 Computer Application - Using Excel
Exercise 6

## 7 Simple Linear Reqression and Correlation

7.1 Introduction
7.1.1 Regression analysis
7.1.2 Correlation coecient
7.2 Simple Linear Regression
7.2.1 Simple linear regression model
7.2.2 Model assumptions
7.2.3 Fitted simple linear regression equation
7.3 Scatter Diagram
7.3.1 Data plotting
7.3.2 Draw by eye
___________________________________________________________________________

## 7.4 A Method of Least Squares

7.4.1 Errors and residuals
7.4.2 The sum of squared residuals
7.4.3 Normal equations
7.4.4 The least squares estimators
7.4.5 The tted regression line and prediction
7.4.6 Finding the least squares estimates using a scientic calculator
7.5 Tests for Linearity of Regression
7.5.1 Testing procedures
7.5.2 Using a ttest approach
7.5.3 Using a one-way analysis of variance approach
7.6 Correlation
7.6.1 Product moment correlation coecient, r
7.6.2 Properties of r
7.6.3 Interpretation of r values
7.7 Simple Linear Regression and Correlation using Excel
7.7.1 Excel procedures
7.7.2 Excel output and interpretation
Exercise 7
8 Nonparametric Statistics
8.1 Introduction
8.2 Sign Test
8.3 Run Test
8.4 Some Methods Based on Ranks
8.4.1 Introduction
8.4.2 Mann-Whitney Test
8.4.3 Wilcoxon Signed-Rank test for Two Dependent
8.5 Measure of Association
8.5.1 Spearman Rank Correlation Coecient
Exercise 8
References

Chapter 1

Fundamental Topics
Learning Objectives:
At the end of this chapter, students should be able to

(a)
(b)
(c)
(d)
(e)
(g)

## dene statistics and relate it to engineering applications.

distinguish between descriptive and inferential statistics.
Identify types of data. summarize data numerically and graphically.
calculate the probability of an event using suitable properties.
nd the expected value and variance for discrete and continuous random
variable
identify probability models and their distributional characteristics

This chapter presents a brief refresher of basic statistics that students are expected to have
learnt at pre-undergraduate level. Although this chapter may not represent a whole course of
basic statistics material, it suces the necessary background framework for the succeeding
chapters in this book.

1.1

## Descriptive Statistics and Inferential Statistics

Statistics deals with the collection, analysis, presentation, and interpretation of data
set and making decision based on the observed data. The role of an engineer is fundamental
in many aspects of decision making process such as designing, developing new products,
maintaining and controlling manufacturing processes and improving previous systems and
processes. Statistical methods are important tools in these activities that could assist
engineers with both descriptive and analytical methods in handling with the variability in the
observed data.

Statistics can be divided into two major areas namely descriptive statistics and
inferential statistics. Descriptive statistics deals with collection and presentation of data.
These involve collecting raw data, classifying, interpreting and presenting the data into
meaningful information for users. On the other hand, inferential statistics involve procedures
used to draw inferences about a population from a sample. Here, probability models are used
to quantify the risks involved in making any statistical inference.

## 1.1. Terms and denitions

(a) Population
Population is the set under study. The items under study could refer to anything such as
persons or objects. The number of individual items in the population is the population size.
(b) Sample
Sample is a subset of a population. Elements in a sample are drawn from a population. By
using information from the sample, we can make inferences about the population.
(c) Random variable
Random relates to events that have no specic pattern and that they occur by chance of a
process. Thus random implies that in a process of selection, any individual object or element
has an equal chance of being selected. Variable represents unknown quantity that varies.
Random variables are either measurable or non-measurable entities. Measurable or countable
random variables are quantitative random variables which are either discrete or continuous.
In contrast, non-measurable random variables are qualitative random variables.
(d) Parameter
Parameter is a characteristic or measure that we obtain from a population.
(e) Data
A data set is a collection of facts or observations from which conclusion may be drawn. It can
be in numerical (quantitative) or non-numerical (qualitative) form.
Quantitative data can be split into two types: discrete (having distinct and separate values, for
example: 1, 2, 3, ...) and continuous (which takes any value in an interval, including rational
or decimal numbers). These data can be further classied into interval scale and ratio scale
data.
Qualitative data, on the other hand, can be divided into two groups: nominal (which can be

assigned a code in the form of a number where the numbers are simply labels such as races,
for example: Malay = 1, Chinese = 2 and Indian = 3) and ordinal (which can be ranked, i.e.
put in order, or have a rating scale attached, for example: rst, second, and third place in a
competition).

## 1.1.2 Measures of central tendency

A central tendency of a set of data is a numerical value that indicates the middle of the data
set. The most common measures of central tendency are mean, median and mode.
(a) Mean
Mean or arithmetic mean of a list of observations is the sum of all observations divided by
the number (or size) of the observations. Population mean is
N

x
i 1

x
i 1

## If we take a sample with size n, the sample mean is

n

x
(b)

x
i 1

Median

Median is the middle value that divides the higher half of the data from the lower half of the
data when the observations are arranged in ascending or descending order. If the number of
observations is odd, the median is the middle value, and if the number of observations is
even, the median is the average of the two middle values.
(c)

Mode

Mode is the observation with the highest frequency. If there are several observations with the
same highest frequency, then there are more than one mode in the set of data. However, a
mode may not exist if all observations occur with the same frequency. Therefore, unlike mean
and median, mode is not unique.

1.1.3

Measures of dispersion

Measures of dispersion or variation are numerical values that indicate the variability of a set
of data. When the dispersion is large, the data are widely scattered. The simplest measure of
variation is range but the most used measures are variance and standard deviation.
(a) Range
Range of a data set is the difference between the largest and the smallest observations.
Range = Largest observation - Smallest observation

(b) Variance
The variance of a set of data is a measure of the spread or dispersion within a set of
data. The population variance is denoted by 2 and sample variance by s2.
The population variance, on one hand, is given by
2

1
N

x
i 1

where N is the population size, xi is the i-th observation in the population and is
the population mean.
The sample variance, on the other hand, is given by
s2

xi x

n 1 i 1

where n is the sample size, xi is the i-th observation and x is the sample mean.

If the variance is defined, we can conclude that it is never negative because the squares
are either positive or zero. The unit for variance is the square of the unit of observation.
(c) Standard deviation
Standard deviation is a positive square root of the variance. Therefore standard
deviation for population and sample are

x
i 1

and

i 1

__

xi x

n 1

respectively.

## 1.1.4 The use of calculators

Manual calculations on simple summary statistics such as the mean and standard deviation on
a sample of univariate data can literally be carried out by hand. However it is often a tedious
practice and one is prone to make mistakes especially when dealing with a large set of sample
data. To avoid this, it is useful to use a scientic calculator to access the following
__

## information x , s, n, x, x 2 as well as calculating the mean and the standard deviation

of a
set of numbers.

The following example has been done using Casio fx 570MS. You should consult your
calculator instruction manual if yours does not appear to follow the following patterns.

## Set the calculator to the following:

(1) Clear screen
Press Shif t, Press CLR, Choose 1 (for clear screen, Scl),
Press =, Press AC.
(2) Choosing SD mode
Press M ODE, M ODE, Choose 1 (for standard deviation, SD),
Press = . (note: SD should appear on the display screen)
(3) Entering data: eg. 1,2,3,4
Press 1, Press M + .
Press 2, Press M + .
Press 3, Press M + .
Press 4, Press M + .

## 4) Finding summary statistics

__
Shif t 2, choose 1, gives the sample mean x 2.5 .
Shif t 2, choose 3, gives the sample standard deviation s = 1.29.
2
Shif t 1, choose 1, gives x 30. .
Shif t 1, choose 3, gives n 4 .

Exampl

___________________________________________________________________________
es
In a crash test, cars were tested to determine what impact speed was required to obtain
bumper damage. The following data shows the speed (in km/hours) of 10 sample cars. Find
the mean, median, mode, range, variance and standard deviation for the cars using the
formula manually. Check if you could get the same answers to the mean and standard
98, 101, 114, 90, 103, 93, 98, 105, 119, 89

Solution
Mean =

## 98 + 101 + 114 + 90 + 103 + 93 + 98 + 105 + 119 + 89

10

= 1010/10
= 101.
To nd the median, we have to rearrange the observations in an ascending or descending
order
89 90 93 98 98 101 103 105 114 119
Since the number of observations is even, the median is the average of the two middle
values
Median

98 101
2

= 99.5
Mode = 98 since it has the highest frequency, i.e. it appears most frequently in the data set.
Range = 119 89
= 30
As the set of data are taken from a sample, we can calculate its sample variance
s2

__
1 10

xi x

n 1 i 1

1 10
2
xi 101
9 i 1
95. 56

## Thus, its standard deviation is

s

95.96

9.775

___________________________________________________________________________
1.15

Types of Plots

Data can be summarized, not only numerically using a measure of central tendency and a
dispersion measure, but also graphically which may give us an instantaneous idea about same
characteristics of the data such as its distribution and skewness.
A suitable graphical summary for qualitative data can either be a histogram or a boxplot. Whereas for qualitative data, one can use either pie chart, bar chart or Pareto chart. In
addition, one can use a scatter plot to summarize graphically a relationship between two
quantitative variables.

1.2

Probability

In common usage, the word probability means the chance that a particular event will occur. In
statistics, probability is a numerical measure of the likelihood of the event. Before we go
further, it is better for us to understand a few terms that are connected to probability
1.2.1

## Basic notation and denition

(a) Outcome
An outcome is a result of an experiment or trial

## (b) Sample space

A sample space is a set that contains all possible outcomes from an experiment as its

elements. Usually we denote sample space as S. For example, a trial of tossing a die will lead
to S = {1, 2, 3, 4, 5, 6}.
(c) Event
Event is a subset from a sample space. Let an event A be dened as getting an odd number
from tossing a die. Then A = {1, 3, 5} which is a subset from the sample space, S = {1, 2, 3,
4, 5, 6}.
1.2.2

## Classical denition of probability

Classical probability uses the sample space to determine the numerical probability that an
event will occur. It is also called a theoretical probability. Let S be a sample space and E
be an event which is a subset of the sample space S . The probability of event E occurring
is
P E

number elemant in E
n E

number element in S
n S

But this is only true if all outcomes are equally likely (having the same chances) to occur.
There are some basic rules about probability:
(i) Any probability assigned must be a nonnegative real number. The probability will take a
value from 0 to 1. Since it reects a chance of an event to occur, a probability of 0 indicates
that the event will never occur. On the other hand, if the probability is 1, it means the event
will always occur for certain. Therefore,
0 P E 1

(ii) The probability of a sample space is always unity, i.e. P S 1 . The probability that
an event does not occur is one minus the probability that the event does occur. Therefore, if
E is the complement for E , then

P E' 1 P E

## (iii) P E1 E 2 P E1 P E 2 , where E1 and E 2 are mutually exclusive.

E i

i 1

(iv) P

i.e.,

P E
i 1

E1 E 2 .

for i 1, 2, , n where E1 , E 2

## are mutually exclusive

Example 2
___________________________________________________________________________
In an experiment, a box containing 5 green bulbs, 6 blue bulbs and 4 white bulbs are used. A
bulb is chosen at random. What is the probability that (i) a white bulb, (ii) a non-white bulb is
chosen?
Solution
The number of bulbs in the box is 15, so n S 15
Suppose event A is The bulb obtained is white. The number of white bulbs in the box is 4,
so n A 4 .
Hence,
P ( getting a white bulb ) P A

4
15

and
P ( not getting a white bulb ) P A ' 1 P A 1

4
11

15 15

## 1.2.3 Mutually exclusive event

When two events, say A and B, cannot occur together at the same time, we call these events
as mutually exclusive or disjoint events. The probability of them both occurring at the same
time is 0,
P A B 0.

## 1.2.4 Additive rule of probability

The additive rule of probability can be used to determine the probability of event A or event
B occurs, or both occur, A B . The general additive rule is

## P(A B) = P(A) + P(B) P(A B).

To explain the above rule, when A and B are not mutually exclusive, there is an overlapping
or intersection between A and B. That is why when we add P(A) and P(B), the probability of
the intersection, P(A B), is added twice. To compensate for that double addition, the
intersection needs to be subtracted once, (P(A B)).
When A and B are mutually exclusive, P(A B) = 0, then the additive rule becomes
P(A B) = P(A) + P(B)

Example 3
_________________________________________________________________________
In a group of 30 engineering students, 4 out of the 7 women and 8 out of the 23 men wear
spectacles. What is the probability that a person chosen at random from the group is a woman
or someone who wears spectacles?
Solution
Let W be person chosen is a woman and S be person chosen wears spectacles
We have,
P W

7
,
30

P S

12
30

and

P W and S P W S

Thus,

P W or S

4
.
30

P W S

= P W P S P W S

7
12
4

30 30 30

= 0.5
___________________________________________________________________________

## 1.2.5 Conditional probability

The probability of an event occurring given that another event has already occurred is called

a conditional probability. The symbol P A B denotes the probability that event A will
occur given that event B has occurred. The formula is given by
P A B

P A B
P B

where P A B is the probability that event A and event B both occur and P(B) is the
probability that event B occurs.
These probabilities are also referred to as Bayesian probability, named after the probability
theorist Thomas Bayes (1702 61).
The Bayes theorem gives us a general conditional probability formula. If there are k
mutually exclusive events and P B 0 , then
P Ak B

P Ak P B Ak
n

P A PB A
i 1

__________________________________________________________________________
Example
4

A quality control ocer would inspect an assembled product from machine A by randomly
selecting one of its components from the assembly line. The probability that a defective
component is found is 35%. If a defective component was found, the probability that machine
A breaks down an hour after the ocers inspection is 0.64. On the other hand, if a nondefective component was found, the probability that machine A breaks down an hour after the
ocers inspection is just 0.28.

(a) Find the probability that machine A breaks down an hour after the ocers inspection.
(b) If machine A breaks down an hour after inspection, what is the probability that a defective
component was found earlier?
___________________________________________________________________________
Solution:
P(Defective) = P(D) = 0.35
P(Breaks down|Defective) = P(B|D) = 0.64

## P(Breaks down|Not Defective) = P(B|D ) = 0.28

(a) P(Breaks down) = P(B) = P(D)P(B|D) + P(D )P(B|D )
= 0.35(0.64) + (0.65)0.28
= 0.406.
(b) P( Defective Breaks down)

P D B
P B

0.64 0.35
0.552.
0.406
___________________________________________________________________________

1.2.6

## Multiplication rule of probability

The results of the multiplication rule can determine the probability that two events, A and B,
both occur. The multiplication rule follows from the denition of conditional probability. The
result is often written as follows, using set notation:
P(A B) = P(A|B)P(B)

or

P(A B) = P(B|A)P(A)
where
P(A) is the probability that event A occurs,
P(B) is the probability that event B occurs,
P(A B) is the probability that event A and event B both occur,
P(A|B) is the probability that event A occurs given that event B has already occurred,
and P(B|A) is the probability that event B occurs given that event A has already occurred.
We can easily understand the multiplication rules from a tree diagram. Some information
about the tree diagram: (i) the branches represent any possible outcomes from a trial, (ii) the
sum of the probabilities from a source is equal to 1.

_______________________________________________________________________
Example 5

All raw components of a certain product must pass two production process to become a
nished product. The probability that a raw component passes the rst production process is
0.72. The probability that the component passes the second production process after it passes
the rst production process is 0.8. What is the probability that a raw component becomes a
nished product?
Solution
Let A be a component passes the rst production process and B be a component passes the
second production process. Then,

## P(component becomes nished product)

= P(component passes both production process)
= P(A B)
= P(B|A).P(A)
= 0.8(0.72)
= 0.576
___________________________________________________________________________
1.2.7

Independence

## Two events A and B are independent events if and only if

P(A B) = P(A) P(B).
__________________________________________________________________________
Example

__________________________
6
Two marbles are drawn (without replacement) from a bag containing 4 red and 6 blue
marbles.
(a) What is the probability both of them are blue?
(b) What is the probability of getting one red and one blue marbles?

Solution
Let R represents a red marble and B represents a blue marble,

1
6 5

3
10 9

(a) P B and B

8
4 6
6 4

.
15
10 9
10 9

(b) P R and B P R B P B R

EXERCISE
A motor company has 18 used cars and 11 of them are accident-free. For the accident-free
car, the probability alarm system is not functioning is 0.3 and if the car was not accident-free,
the probability alarm system is not functioning is 0.6.
(a) Mr. Ahmad wants to buy 2 used cars for his children. Find the probability both cars are
accident-free.
(b) Mr. Osman also wants to buy 2 used cars. What is the probability that only one of them
is not accident-free.
(c) What is the probability that Miss Ani buys a car that is accident-free and its alarm
system is working?
(d) Ali wants to buy a used car. What is the probability that its alarm system is not
functioning?
(e) The alarm system for a used car bought by Madam Sheely is not functioning. What is
the probability that it is accident-free?

___________________________________________________________________________
1.3
Random Variables
A random variable, usually written as X , is a variable whose possible values are numerical
outcomes of a random phenomenon. There are two types of random variables, discrete and
continuous.
1.3.1

## Discrete random variable

A discrete random variable is one which may take on only a countable number of distinct
values such as the number of children in a family, the number of goals scored in football
games and the number of defective bulbs in a box.
The probability distribution of a discrete random variable (sometimes called probability mass
function) is a list of probabilities associated with each of its possible values. The probability

(a)

0 p i 1, and

(b)

p
i

## P X x pi must satisfy the following conditions

1
i

___________________________________________________________________________
Example
7

## Show that the following P X x

x2
is the probability mass function for X, for
30

x 0,1, 2, 3, 4.

Solution:
We need to show that 0 pi 1 and
Now P 0 0,

P1

1
,
30

P 2

p
i

4
,
30

1
i

P 3

9
16
and P 4
30
30

## Thus, 0 P X x 1 for all x 0, 1, 2, 3, 4.

Now, P X x 0

1
4
9
16

1
30 30 30 30

Hence, it is shown that X is a discrete random variable and P(X = x) is the probability mass
function for X .
___________________________________________________________________________
Example
8

## c if X is a discrete random variable with probability mass function

P X x

cx
2

for x 0, 1, 2, 3.
Solution
If X is a discrete random variable, then

P X

x 1

0
c
2c 3c

1
2
2
2
2
6c
1
2

1
3

__________________________________________________________________________
1.3.2 Continuous random variable
A continuous random variable can take all possible values over an interval of real numbers
such as weight, time, and height. The probability of a random variable X being in an interval
[a,b] is dened as an area under a curve which is represented by a function f (x), that is
P a X b

f x dx F b F a

The function f (x) is called a probability density function and it satises the following
conditions:

(a) The curve of f (x) has no negative values (f (x) 0 for all x)
(b) The total area under the curve is equal to 1
The function F (.) is a cumulative distribution function which will be discussed in the next
section.
___________________________________________________________________________
Example

Show that f x

x2
; 1 x 4 is a pdf and find P 2 X 3
21

Solution
4

We must show

f x dx 1
1

x3
x2
dx

21
21 1
64
1

63
63
1
3

P 2 X 3
2

x2
dx
21

Shown

19
63

___________________________________________________________________________
1.3.3 Cumulative distribution function
The cumulative distribution function, denoted by F (.) is
F (x) = P(X x)
For a discrete random variable, the cumulative distribution function is the sum of the
probabilities, that is

F x P X x

P X

t .

X=x
1
2
3
4
For a continuous random P(X
variable,
cumulative
distribution
function is found by
= x) the
0.12
0.54 0.09
0.25
integrating f (t) from to x, that is
F x P X x

f t dt

X=x
1
2
3
4
__________________________________________________________________________
P(X = x) 0.12 0.54 0.09 0.25
F (X = x) 0.12 0.66 0.75 1.0
Example
10

Find the cumulative distribution function of X if X is a discrete random variable having the
following the probability distribution:

Solution

or

0,
0.12,

F x
0.66,
1

x 1;
1 x 2;
3 x 4;
x 4.

___________________________________________________________________________
Example
11

Find the cumulative distribution function if the probability density function for X is

0.1 10 x 20
f x
0 elsewhwere
Solution
For x 10, F x 0
For 10 x 20

F x

f t dt

10
x

0.1 dt
10

0.1x 1

For x 20,
Therefore

F x 1

; x 10
0

F x 0.1x 1 ;10 x 20
1
; x 20

___________________________________________________________________________
1.3.4 Mathematical expectation
The expected value of a random variable indicates its average or central value.
(a) The expected value of a discrete random variable X is dened by
n

E X xi P xi
i 1

E X

x f xi dx

## Some properties for expected value, E X ;

(a) The expected value of a constant is equal to the constant itself, that is E k k
(b) E kX kE X , where k is constant.
___________________________________________________________________________
1.3.5 Variance and standard deviation
The variance and standard deviation are non-negative real values which give an idea of how
widely spread the values of the random variable are likely to be. When the variance is large,
then the observations are more scattered around the mean. The variance of a random variable
X is dened as
Var X 2 E X 2 E X

where E(X ) and E(X 2) both exist and E(X ) is the expected value of X .

## Some properties of variance, Var(X ), include

(a) Variance of a constant is equal to zero, Var k 0 .
(b) Var kX k 2 Var X .

## Note: Standard deviation,

Var X .

___________________________________________________________________________
Example
12

X = {1, 2, 5, 10} is a random variable with the probability function P(X = x) dened by
P(X = 1) = 0.4, P(X = 2) = 0.3 and P(X = 10) = 0.2
(a) Find P X 5.
(b) Evaluate the mean E(X ) and the variance Var(X ).
Solution
(a)

## P(X = 1) + P(X = 2) + P(X = 5) + P(X = 10) = 1

0.4 + 0.3 + P(X = 5) + 0.2 = 1
P(X = 5) = 0.1
4

Mean, E X xi P xi
i 1

3.5
EX 2

xi P xi
2

i 1

## 12 0.4 2 2 0.3 5 2 0.1 10 2 0.2

24.1
Var X

E X 2 E X
24.1 12,25

11.85

___________________________________________________________________________
Example
13 probability density function of a random variable X is f (x), dened as follows
The

0.1 2 x 6

f x 0.2 8 x 11
0 elsewhwere

Solution
Mean, E X

0.1x dx

0 .1 x 2

11

0.2 x dx

0.2 x 2

11

## 1.8 0.2 12.2 6.4

7 .3

Mean, E X

0.1x 2 dx

11

0.2 x 2 dx

0.1 x 3
0.2 x 3

3 2
3

6.93 54.6

11

61.53
Var X

E X

EX

61.53 53.29
8.24

__________________________________________________________________________________

1.4

## Some Probability Distributions

In this section, we will introduce some popular distributions for discrete and continuous
random variables. Popular distributions for discrete random variables include binomial,
poisson, negative binomial, hypergeometric and geometric distributions. On the other hand,
special distributions for continuous random variable include normal, exponential, erlang,
gamma, weibull and lognormal distributions.
1.4.1

Binomial distribution

Binomial distribution is a discrete probability distribution. It is used when there are exactly
two mutually exclusive outcomes of a trial and these outcomes are appropriately labeled as
success and failure. The binomial distribution is used to obtain the probability of
observing x number of successes from n number of trials, with the probability of success on a
single trial is denoted by (Note that some references use p). The binomial distribution
assumes that is xed for all trials.

In general, if a random variable X follows the binomial distribution with parameters n and ,
we write
X ~ B n,

P X x n C x x 1
n

n
where C x
x

distribution are

n!
. The mean,
x ! n x !

n and

n x

x 0, 1, 2, , n

## and variance, 2 for binomial

n 1 respectively.

We can evaluate the probability associated to a binomial distribution either using a scientic
calculator or a statistical table. Certain statistical table provides the cumulative binomial
probabilities, P(X k).
___________________________________________________________________________
Example
14

## If X B(5, 0.3), Find

(a) P (X 4)
(b) P (X = 2)
(c) P (X < 3)
(d) P (X > 1)
(e) P (X 3)
Solution
(a)P (X 4) = 0.9976
(b)P (X = 2) = P (X 2) P (X 1)
= 0.8369 0.5282
= 0.3087
(c)P (X < 3) = P (X 2)
= 0.8369
(d)P (X > 1) = 1 P (X 1)
= 1 0.5282
= 0.4718
(e)P (X 3) = 1 P (X 2)
= 1 0.8369

= 0.1631
___________________________________________________________________________
Example
15

A pewter manufacturer produces souvenir mugs. Suppose that one of the machines breaks
down and 8% of the mugs are found to be defective and cannot be sold. If 23 mugs are
selected at random, nd the probability that
(a) 3 mug are defective.
(b) between 8 and 10 mugs are defective.
(c) at least 1 mug cannot be sold.

Solution
Let X represents the number of defective mugs, then X ~ B 23, 0.08.
(a) P X 3

C 3 0.08

23

1 0.08 233

0.1711

(c)
P X 1

1 P X 0
1

C 0 0.08

23

1 0.08 23

1 0.1469
0.8531

___________________________________________________________________________
1.4.2 Poisson distribution
Poisson distribution is another discrete probability distribution. When we know the mean
number of events that occur in a certain time interval or continuum of space, then the Poisson
distribution is a suitable distribution to nd the probability of exactly

x occurrences in that

interval. Generally, a discrete random variable X is said to follow a Poisson distribution with
parameter , written as
X ~ Po

## if it has the following probability distribution function

P X x

e x
for x 0,1, 2...
x!

where is the mean number of events in the given time interval or a continuum of space. The

interval must be statistically independent. The Poisson distribution has expected value
E X and variance Var(X ) = .

We can evaluate the probability associated to a poisson distribution either using a scientic
calculator or a statistical table. Certain statistical table provides the cumulative poisson
probabilities, P(X k).
If X 1 ~ Po 1 , X 2 ~ P0 2 , , X n ~ P0 n then
X 1 X 2 X n ~ P0 1 2 ... n

Example
16
If X ~ P0 2.4 , find
(a )

P X 6

(b)

P X 3

(e )

P X 4

(c ) P X 7
(d ) P X 7

Solution
(a) P (X 6) = 0.9884
(b) P (X 3) = 1 P (X 2)
= 1 0.5697
= 0.4303.
(c) P (X < 8) = P (X 7)
= 0.9967.
(d) P (X > 1) = 1 P (X 1)
= 1 0.3084
= 0.6916.
(b) P (X = 4) = P (X 4) P (X 3)
= 0.9041 0.7787
= 0.1254.
___________________________________________________________________________
Example
17

On average, Good Construction can build 8 units of playground during a 2-month period.
Find the probability that
(a) Good Construction can only build 3 units of playground during a 2-month period.
(b) Good Construction can build at most 10 units of playground during a 2-month period.

(c) Good Construction can build more than 20 units of playground during a 4-month period.

Solution
Let X be the number of playgrounds Good Construction can build during a 2-month period,
then X Po(8)
e 8 8 3
3!
0.0286

( a ) P X 3

(b) P X 10 0.8159
Let Y be the number of playgrounds Good Construction can build during a 4-month period,
then Y Po(16)
P Y 20

(c )

1 P Y 20

1 0.8682
0.1318

___________________________________________________________________________
1.4.3

## Negative binomial distribution

A negative binomial experiment is a statistical experiment that has the following properties:

## The experiment consists of x repeated trials

Each trial can result in just two possible outcomes. We call one of these outcomes
a success and the other, a failure.
The probability of success, denoted by p, is the same on every trial.
The trials are independent; that is, the outcome on one trial does not aect the
outcome on other trials.

## A negative binomial random variable is referred to as the number X of repeated trials to

produce r successes in a negative binomial experiment. The probability distribution of a
negative binomial random variable is called a negative binomial distribution,
which is also known as the Pascal distribution.
The negative binomial probability refers to the probability that a negative binomial
experiment results in r 1 successes after trial x 1 and r successes after trial x.
Denition: Suppose a negative binomial experiment consists of x trials and results in r
successes. If the probability of success on an individual trial is p, then the negative binomial
probability is:
b * x; r , p

C r 1 p r 1 p

x 1

xr

## for x = r, r + 1, r + 2, . . ., r = 1, 2, 3, . . ., and 0 < p < 1.

Note that

x 1
x 1 ! .

C r 1
r 1 ! x r !
r 1

x 1

The mean and variance for a negative binomial random variable are
E X r p

and

Var X r 1 p p 2

respectively.
_________________________________________________________________________
1

Suppose that a call to Sinar FM gets connected with a probability of 0.05. Assume calls are
independent,
(a) what is the probability that the 6-th call made is the second call that gets connected?
[ 0.0102]

(b) what is the probability that more than four calls have to be made before getting
connected?
[0.8145]
___________________________________________________________________________

2

Assume that a sample of 15 components are tested every hour. Suppose X denotes the
number of components in the sample of 15 that require modication. Components are
assumed to be independent with respect to modication. If the percentage of components that
require modication remains at 1.5%, what is the probability that hour 8 is the third sample at
which X exceeds 1?
[1.6894104]

___________________________________________________________________________
1.4.4

Geometric Distribution

The geometric distribution is a special case of the negative binomial distribution. It deals with
the number of trials required for a single success. Thus, the geometric distribution is negative
binomial distribution where the number of successes (r) is equal to 1.
Denition: Suppose a negative binomial experiment consists of x trials and results
in one success. If the probability of success on an individual trial is p, then the geometric
probability is:
P x; p p 1 p

x 1

for x 1, 2, 3, , and 0 p 1.

## The mean and variance for a geometric random variable are

E X r p

and

Var X r 1 p p 2

respectively.
__________________________________________________________________________

The probability that a computer running a certain operating system crashes on any given day
is 0.05. Find the probability that the computer crashes for the rst time on the 10th day after
the operating system is installed. Find the expected number of days the computer runs before
it crashes for the rst time.
[0.0315; 20 ]

1.4.5

Hypergeometric distribution

## A sample of size n is randomly selected without replacement from a population of N

items.
In the population, k items can be classied as successes, and N k items can be
classied as failures.

## N : The number of items in the population.

k: The number of items in the population that are classied as successes
n: The number of items in the sample.
x: The number of items in the sample that are classied as successes.
k
C x : The number of combinations of k items, taken x at a time.
P (x; N, n, k): hypergeometric probability - the probability that an n-trial
hypergeometric experiment results in exactly x successes, when the population
consistsof N items, k of which are classied as successes.

## A hypergeometric random variable is the number of successes that result from a

hypergeometric experiment. The probability distribution of a hypergeometric random variable
is called a hypergeometric distribution.
Denition: Suppose a population consists of N items, k of which are successes. And a random
sample drawn from that population consists of n items, x of which are successes. Then the
hypergeometric probability is:

P x ; N , n, k

Cx

N K
N

Cn

C n x

N K

nx
N

## The mean and variance for X are:

EX n p

respectively, where p K N

and

and
N n

N 1

N n

Var X np1 p
N 1

## is finite population correction factor.

__________________________________________________________________________

A company employs 500 men under the age of 58. Suppose that 25% carry a marker on a
male chromosome that indicates an increased risk for high blood pressure.
a. If 20 men in the company are tested for the marker in this chromosome, what is the
probability that exactly half of them have the marker.
[0.0089 ]
b. If 15 men in the company are tested for the marker in this chromosome, what is the
probability that more than 1 has the marker?
[0.9229 ]
___________________________________________________________________________
1.4.6 Normal distribution
Normal distribution is the most important continuous distribution in statistics because
normality arises naturally in many physical, biological, and social measurement situations. It
is also named as Gaussian distribution taken from the name Gauss who found the probability
density function (pdf) for normal distribution. The pdf of a normal random variable X is
symmetric, bell-shaped and asymptotically approaches 0 as x goes to or .
A continuous random variable X with probability density function
f x

X 2
1
exp
,
2 2
2

## is normally distributed with mean and variance 2, that is

X ~ N , 2

Since the integration for nding the probability using its probability density function is nontrivial, then we have to transform X into a standard normal variable Z which has

a mean 0 and and variance 1. The transformation can be done by using the following formula.
Z

## where Z is a random variable for a standard normal distribution, that is

Z ~ N 0, 1

We can evaluate the probability associated to a standard normal distribution either using a
scientic calculator or a statistical table. A statistical table typically provides two types of
tables associated to a standard normal distribution.
(i) a table that shows the probabilities for a standard normal distribution in the form
of P 0 Z z that is the area under the standard normal curve between 0
and positive z values.
(ii)

a table that shows the z values when P(Z > z) = where is the upper tail area of
the standard normal distribution, and 0.5.
Some properties of normal distribution

## If X ~ N x , 2 x , Y ~ N y , 2 y and X and Y are independent, then,

(a ) k X ~ N k x , k 2 2 x

(b) X Y ~ N x y , 2 x 2 y .

(c ) X Y ~ N x y , 2 x y

___________________________________________________________________________
Example
18

The lifetime of ROAD tyre is normally distributed with mean 24000 km and
standard
deviation 4000 km.

(a) Find the probability that the lifetime of ROAD tyre exceeds 27000 km.
(b) Find the probability that the lifetime of ROAD tyre is between 22500 km and
26500 km.

(c) If 10% of ROAD tyres have low lifetime, nd the maximum distance it can
achieve.

Solution
Let X represents the lifetime of ROAD tyre, then X N (24000, 40002).
(a)

27000 24000

P X 27000 P Z

4000

= P Z 0.75

= 0.5 0.2734
= 0.2266
(b)

26500 24000
22500 24000
Z

4000
4000

P 22500 X 26500 P

= P 0.375 Z 0.625
= 0.2357 + 0.148
= 0.3837

(c) Let x be the maximum distance specied, then the question implies P X x 0.1
which is equivalent to P Z z 0.1 0.1 . From table,
z 0.1 1.2816

Thus, 1.2816
Hence,

x 24000
4000

x = 24000 1.2816(4000)
x = 18873.6 km.

___________________________________________________________________________
______________
Example
19

A Cooper test for a football player from Team A is normally distributed with mean 660
second and standard deviation 45 second. The Cooper test for a football player from Team B
is normally distributed with mean 690 second and standard deviation 25 second. A player is
selected at random.

(a) What is the probability a player from Team A can complete the test less is than 700
second?

(b) What is the probability the time set by a Team A player is better than the time
set by a Team B player?
Solution
Let X represent a time set by Team A player, X N (660, 45 2 ) and let Y represent a time set
by Team B player, Y N (690, 25 2 )
(a)

P X 700

700 660

P Z

45

= P ( Z < 0.89 )
= 0.5 + 0.3133
= 0.8133

(b)

P X Y

P X Y 0

0 660 690

P Z

45 2 25 2

P Z 0.58

= 0.5 + 0.2190

___________________________________________________________________________

A manufacturer produces bathroom tiles. The tiles are sold in boxes containing 25 tiles each.

The probability that a piece of tile from a box is defective is 0.1. A box is selected at random.

## (a) What is the probability that

(i) no tiles are defective?
(ii) more than 10 tiles are defective?
(iii) at least 7 tiles are defective?

[ 0.0178 ]
[ 0.0001 ]
[ 0.0095 ]

(b) An interior decorating company purchases 10 boxes of tiles from the manufacturer. What
is the probability that at least two of the boxes contain perfect tiles?
[ 0.1581 ]
__________________________________________________________________________

In 2006 World Cup tournament, the weight of the balls used is normally distributed with
mean weight 435 grams and standard deviation 10 grams. A ball is selected at random.
(a) What

is the probability the weight is between 400 grams and 450 grams? [0.933

(b) What is the probability the weight is more than 460 grams?

[ 0.0062 ]

(c) If 10% of the balls is considered heavy, what is the minimum weight of the ball
in that category?
[447.816 grams ]
___________________________________________________________________________
1.4.7

Exponential distribution

## Exponential distribution is also a family of continuous probability distribution. It describes

the time between successive events in a Poisson process, i.e. a process in which events occur
continuously and independently at a constant average rate.
Denition: Suppose a random variable X denotes the distance between successive events of a
Poisson process with mean , then X is an exponential random variable with parameter
which has the following probability density function:
f x e x

for 0 x < and > 0. The parameter is also called a rate parameter, whereas 1/ is a scale
parameter. The mean and variance for X are

EX 1

and Var X 1

## respectively. The random variable X following an exponential distribution with parameter

can be written as X ~ Exp The cumulative distribution function for the exponential
random variable is

1 e x x 0
F x P X x
0 x0
Figure below demonstrates exponential probability density functions with dierent values.
It can be seen from the gure that all pdfs are monotonically decreasing.

: 1;

[ -: 0.5;

: 1.5; ]

## The mean and variance for an exponential random variable are

EX 1

and Var X 1

respectively.
___________________________________________________________________________
Example
20

Solution

## If X ~ Exp 2 , then x E X 1 0.5 and 2 x Var X 1 2 0.25.

2
2
Furthermore, P X 1 1 P X 1 1 1 exp 2 1 0.1353
_________________________________________________________________________

The time between phone calls received by a telephonist is exponentially distributed with a
mean of 10 minutes.
a. What is the probability that there are no calls in one hour?

[Ans: 0.0025 ]

b. What is the probability that there are not more than four calls within one hour? [ 0.2851]
c. Determine x such that the probability that there are no calls within x hours is 0.02
[39.12 minute]

__________________________________________________________________________
An important property of the exponential distribution is that it is memoryless , which means
that if a random variable X is exponentially distributed, its conditional probability is given by
P X x1 x 2

i.e

X x1 Pr X x 2 for all x1 , x 2 0.

P X x1 x 2 X x1 Pr X x 2

________________________________________________________________________

The number of hits on a website follows a Poisson process with a rate of four per minute.
a. What is the probability that more than two minutes go by without a hit?
[ 3.35 10 4 ]
b. If two minutes have gone by without a hit, what is the probability that a hit will occur in
the next minute?
[ 0.9817]
___________________________________________________________________________
1.4.8

## Other continuous distributions

Other distributions for continuous random variables include Erlang, Gamma, Weibull and
log-normal distributions. Unlike normal distribution, these distributions assume that the
variables are strictly non-negative. The list of probability density functions for these
distributions are listed below:

## Distribution Probability Density Functions

s
1. Erlang
f x

r x r 1 e x
for x 0
r 1 !

and r 1, 2,
Note : If r 1 , then Erlang is
simply an exponential distribution.

EX

Var X

r
2

2. Gamma
f x

r
r x r 1 e x
for x 0 and r E
0X.

r 1 !

## Note: If r is an integer, then

Gamma is simply an Erlang
distribution.

Var X

r
2

3. Weibull
1

x
x
exp

For x 0, 0 and 0 ,
Note: and are shape and the
f x

## scale parameters respectively. If

1 then, Weibull is simply an
exponential
distribution
with

E X T 1

2
Var X T 1

1
1

1
4.
Lognormal

f x

ln x 2
exp

2 2
x 2

## for 0 x and X expW

where W ~ N , 2 .

EX e

Var X e 2 e w 1
2

The shape of the above distributions for varying values of their parameters can be
investigated via computer software such as Matlab. Further information and examples for
these distributions can be found from Montgomery & Runger (2006).

__________________________________________________________________________
Exercise 1
1. Identify whether the following items are constants or variables. If it is a variable, determine
whether it is quantitative or qualitative, discrete or continuous
(a) The number of days in March.
(b) IC numbers for Malaysian citizen.
(c) The time taken to write an essay.
(d) The type of cars used by employees of a company.
(e) Temperature for each day in a month.
(f) Minimum age to take a driving licence
(g) The lengths of a specic type of bricks.
(h) The compressive strengths of 100 aluminium-lithium alloy specimens.
(i) The number of students registering Engineering Statistics in the last ve
(j) The breakdown time of an insulating uid between electrodes.
(k) The grades achieved by engineering students in UTM.

2. A motor company has 18 used cars and 11 of them are accident-free. For the accident-free
car, the probability alarm system is not functioning is 0.3 and if the car was not accident-free,
the probability alarm system is not functioning is 0.6.

(a) Mr. Ahmad wants to buy 2 used cars for his children. Find the probability both cars are
accident-free.
(b) Mr. Osman also wants to buy 2 used cars. What is the probability that only one of them
is not accident-free.
(c) What is the probability that Miss Ani buys a car that is accident-free and its alarm
system is working?
(d) Ali wants to buy a used car. What is the probability that its alarm system is not
functioning?
(e) The alarm system for a used car bought by Madam Sheely is not functioning. What is
the probability that it is accident-free?

k ; 0 x 1
x
f x ;1 x 2
4

0 ; elsewhere
(a) Show that k

5
.
8

## (b) Find the cumulative distribution function F (x) for X .

(c) Find P(1/2 X 3/2).
(d) Find the expected value and variance for X .

## 4. An electronic product contains 20 integrated circuits. The integrated circuits are

independent of each other. The probability that any integrated circuit is defective is 0.03. The
product operates only when all integrated circuits work properly. What is the probability that
the product operates?
5. On average, IT Shop can sell 10 notebooks in 2 days. What is the probability
they can sell
(a) 13 notebooks in 2 days?
(b) at least 17 notebooks in 3 days?
(c) not more than 19 notebooks in 4 days?

## 6. The weight of a 24 at screen LCD TV on market is normally distributed with mean 15 kg

and standard deviation 2 kg. The weight of a standard TV having the same screen width is
also normally distributed but with mean 31 kg and standard deviation 5 kg. What is the
probability that
(a) the weight of a at screen LCD TV is between 13 kg and 16 kg?
(b) the weight of 2 standard TVs is greater than 65 kg?
(c) the weight for 2 LCD TV is greater than the weight of a single standard TV?

## 7. Suppose 16 observations are as listed below:

14
35

15
27

23
18

50
33

36
48

25
19

29
22

42
15

Use a scientic calculator to determine the mean and variance for the above data. Now
assume that the data are sample data selected by random. Find the new mean and variance
8. 25 pieces of computer chips were tested and the proportion of any chip being
contaminated is 0.15. Find the probability that

## (a) at most 6 chips are contaminated

(b) at least 20 chips are not contaminated.
(c) between 4 and 8 chips are contaminated.
(d) more than 2 chips are not contaminated.

A supplier delivers ten boxes, each containing 25 chips, to a customer. What is the probability
that the customer will receive at least two boxes containing at most two contaminated chips
each?

## 9. The yield in RM from a days production is assumed to be normally distributed with a

mean of RM2000 and a variance of 2500 RM squared. What is the probability that
a) the production yield on any particular day exceeds RM2500.
(b) the production yield is less than RM1900 on each of the next two days,
assuming the yields on dierent days are independent random variables.
(c) 5% of a days production yield is considered protable revenue to the company. What is the daily minimum yield to be considered protable?

Chapter 2
Sampling Distributions
Learning Objectives:
At the end of this chapter, students should be able to
(a) understand the concepts of sample mean and proportion.
(b) understand and use the central limit theorem.
(c) compute and interpret the sample mean and proportion.
(d) explain the important role of normal distributions as sampling distributions.
(e) calculate the probabilities associated with sample mean and sample proportion.

2.1

Introduction

## It is often impossible to measure the mean or standard deviation of an entire population

unless the population is small or a nationwide census is available. The population mean, ,
and standard deviation, , are examples of population parameters. Given the impracticality of
__

## measuring population parameters, we instead measure sample statistics, X or S, by taking

independent samples from the same population.
By measuring the entire population and calculating the mean or variance, we refer this
quantity as a parameter of the population. If we measure from sample, then the mean or

variance is referred to as a statistic. There are many statistics that we can use, which include
the mean, median, mode, standard deviation and so on. One reason we sample is so that we
can get an estimate for an unknown parameter of the population we sample from.

Choosing a sample of size n from a population and measuring the statistics (mean, standard
deviation, etc), the sampling distribution is the resulting probability distribution. For
example, if the statistic is the sample mean, x , of samples of size eight, then the sampling
__

distribution is the probability distribution of the sample mean, X . It lists the various values
__

__

2.2

## Central Limit Theorem

A very important and useful concept in statistics is the Central Limit Theorem (CLT).
The CLT says that if a large enough sample was drawn from a population, then the
distribution of the sample mean is approximately normal, regardless of the type of
distribution for the population the sample was drawn from.
The Central Limit Theorem states that
1. the mean of the sampling distribution of means is the same as the population mean,
2. the variance of the sampling distribution of means is the same as the population variance
divided by the size of the sample, and
3. if the population from which the sample is taken is normally distributed, then the sampling
distribution of means will also be normal. If the population is not normally distributed, then
the sampling distribution of means will approximately be normal distributed as the sample
size gets larger, usually when n 30
2.3

__

## Sampling Distribution for a Single Mean, X

The sample mean, X is the best estimator of the population mean, . Suppose we have a
set of independent random variables X 1 , X 2 , X n where E X i and
__

__

X1 X 2 X 3 X n
n
n
1
Xi
n i 1

## and the sample variance is

S

n
__
1

Xi X

n 1 i 1

__

The probability distribution of the sample means X , is called the sampling distribution
__

of X .
__
The expected mean and variance of X are denoted as __X and

__
E X
X

__

1 n

Xi

n i 1

1
n
n

__
X

Var X

1 n

Xi

n i 1

Var

1 n
Var X i
n i 1
1
2 n 2
n
2

2X .
__

__

. The
The sampling distribution for the sample mean is expressed as X ~ N ,
n

standardized variable
__

X
Z

follow a standard normal distribution. The sampling distribution of the mean is normally
distributed regardless of the population. If the population distribution is unknown or not
normal, then using the central limit theorem, the sampling distribution for sample mean is
normally distributed when n 30
___________________________________________________________________________
Example 1

A certain type of thread is manufactured with a mean tensile strength of 77.3 kg and a
standard deviation of 6.4 kg. Assuming that the tensile strength follow a normal distribution,
nd the probability that the mean tensile strength of a random sample of 40 such thread is
more than 75 kg.
Solution

## X ~ N , 2 where 77.3 and

Now
n 40

__

6.4 2

X
~
N
77
.
3
,

therefore
40

6.4 2

75 77.3

__

P X 75 P Z

6.4 2
40

P Z 2.27

0.5 0.4884

0.9884

_________________________________________________________________________
Example

The number of customers arriving per hour at a certain automobile service facility is assumed
to follow a Poisson distribution with mean 12. If a random sample of 36 hour were taken,
what is the probability that the mean number of customers in an hour is less than 10?
Solution
Given X ~ Po 12
__
12

X ~ N 12,

36

Therefore, by CLT

10 12
P X 10 P Z

12

36

__

P Z 3.46
0.5 0.4997
0.0003

___________________________________________________________________________

The average life of a washing machine is 12 years with a standard deviation of 2 years.
Assuming that the lives of these machines follow approximately a normal distribution, nd

(a) the probability that the mean life of a random sample of 12 machines is greater than 10
years.
[ 0.9997 ]

b) the probability that the mean life of a random sample of 9 machines falls between 9.4 and
12.2 years.
[ 0.6179 ]
__________________________________________________________________________

___________________________________________________________________________

A random sample of size 35 is taken from a population which has a binomial distribution with
the number of trials 50 and the proportion of success 0.30. What is the probability that the
sample mean is at least 13.5?
[ 0.9969 ]
__________________________________________________________________________
2.4

__

__

Sampling Distribution of X 1 X 2

Suppose we have two independent populations, both are normally distributed. Let the rst
2
population has mean 1 and variance 1 and the second population has mean 2 and

variance 2 .
2

__

__

If X 1 and X 2 are the sample means of two independent random samples of sizes n1 and
n 2 , then

2
__

X 1 ~ N 1 , 1
n1

and

2
__

X 2 ~ N 2 , 2
n2

__

__

## The sampling distribution of X 1 X 2 is also normally distributed but with mean

__

__

X1X 2

__
__
E X1 E X 2

1 2

and variance
__
__
__
__
2 X 1 X 2 Var X 1 Var X 2

__
__
2
Var X 1 1 Var X 2

__
__
Var X 1 Var X 2

2
n1
n2
2

thus,
2
2
__
__

X 1 X 2 ~ N 1 2 , 1 2
n1
n2

with

__
__

X 1 X 2 1 2

Z
2
1
2
2
n1
n2

## having a standard normal distribution.

If the two populations are not normally distributed and both samples have sizes at least 30, by
__

__

## central limit theorem, the distribution of X 1 X 2 is approximately normal.

___________________________________________________________________________
Example
3

A random sample of size 18 is selected from a normal population with a mean of 85 and a
standard deviation of 8. A second random sample of size 10 is taken from another normal
__

__

population with mean 80 and a standard deviation 5. Let X 1 and X 2 be the two sample
means. Find the probability
(a)

__

__

## that X 1 is greater than X 2

(b) that the dierence between the sample means is less than 6.
c) that the dierence between the means is more than 4.
Solution

__

82

__

52

and X 2 ~ N 80,
, therefore
We know that X 1 ~ N 85,
18
10

__
__

82
52

X 1 X 2 ~ N 85 80,
18
10

__

__

X 1 X 2 ~ 5, 6.0556
__

__

## (a) The probability that X 1 is greater than X 2 is

__
__
__

__

P X 1 X 2 P X 1 X 2 0

= P Z

05

6.0556

P Z 2.03
0.5 P 0 Z 2.03

= 0.5 0.4788
0.9788

(b) The probability that the dierence between the sample means is less than 6 is
__
__

P X 1 X 2 4

__

__

X1 X

4 P

__

__

X1 X

P Z

45

45
P Z

6.0556
6.0556

P Z 0.4 P Z 3.66

## = 0.5 0.1554 0.5 0.4999

0.6554 0.0001
0.6555

Example
4

A random sample of size 49 is taken from a binomial distribution with n = 60 and p = 0.4.
Another random sample of size 32 is taken from another binomial distribution with n = 60
and p = 0.4. Find the probability that the dierence between the two sample means is less
than 1.
Solution
Given X 1 ~ B 60, 0.4 and X 2 ~ B 60, 0.4
__
__
14.4
14.4

X
X
1 ~ N 24,
2 ~ N 24,

Therefore, by CLT
and
49
32

__
__

14.4
114.4

,
Hence, X 1 X 2 ~ N 24 24,
49
32

__

__

X 1 X 2 ~ N 0, 0.7438
__
__
__
__

P X 1 X 2 1 P 1 X 1 X 2 1

1 0

0.7438

1 0

0.7438

P 1.16 Z 1.16
0.7540

___________________________________________________________________________

## A consumer of an electronics company is comparing the brightness of two dierent types of

picture tubes for use in their television sets. Type A tube has mean brightness of 100 and
standard deviation of 16, while type B tube has mean brightness of 110 and standard
deviation of 14. A random sample of 25 tubes from each type is selected. What is the
probability that the dierent brightness in the two sample means is at least 5.5?
[ 0.8555 ]
___________________________________________________________________________

A random sample of size 30 is taken from a population which is distributed from a Poisson
distribution with mean 54. Another random sample of size 32 is taken from a Poisson
distribution with mean 58. What is the probability that the dierence between the means is
less than 2.
[ 0.1461 ]
___________________________________________________________________________
2.5

## Sampling Distribution for the Proportion P

The concept of proportion is the same as the concept of probability of success in a binomial
experiment. The probability of success in a binomial experiment represents the proportion of
the sample or population that possesses a given characteristic.
The population proportion, denoted by , is obtained by taking the ratio of the number of
elements in a population with a specic characteristic to the total number of elements in the
population. The sample proportion, denoted by p, gives a similar ratio for a sample.

## Population and Sample Proportions

The population and sample proportions, denoted by and p, respectively, are calculated as

X
N

and

where
N

## total number of elements in the population

total number of elements in the sample

x
n

## number of elements in the population that possess a specic characteristic

number of elements in the sample that possess a specic characteristic

and is a proportion of successes and not 3.1423... . Each sample will give a dierent value
of p therefore the proportion is a random variable and symbolized as P.
To determine the reliability of the estimator, P, we need to know its sampling distribution.
When samples of size n are drawn for this population, each sample contains a certain number
of observation event with the certain characteristics. The Central Limit Theorem (CLT) tells
us that the relative frequency distribution of the sample mean for any population is
approximately normal for suciently large samples, (n 30).

Sampling Distribution of P
1. Mean of the Sample Proportion
The mean of the sample proportion, P is denoted by p and is equal to the population
proportion, .
X

p E P E

1
E X
n
1

n
n

## The sample proportion, P is called an unbiased estimator of the population proportion, .

2. Variance of the Sample Proportion
The variance of the sample proportion is denoted by P2 and given by the formula
X

P Var P Var

1
Var X
n2
1
2 n 1
n
1

## Standard Deviation of the Sample Proportion

The standard deviation of the sample proportion is denoted by
P

1
n

## 3. For large samples, the sampling distribution of P is approximately normal.

Therefore the sampling distribution of P has mean and variance
written as

1
n

P ~ N ,

with
Z

1
n

## Continuity Correction Factor1

The continuity correction factor needs to be made when a continuous curve is being used to
1

## approximate discrete probability distributions. The ratio 2 n is added or subtracted as a

continuity correction factor according to the form of the probability statement as follows:
(a)

c .c

1
1

P P p P p
P p
2
n
2
n

(b)

c .c

1
P P p P P p
2n

(c)

c .c

P P p P P p
2 n

(d)

c .c

P P p P P p
2 n

c .c

1
P P p P P p
2
n

(e)

Example 5
A manufacturer claims that the diameter of a metal rod is 75% within the specication.
A random sample of 50 metal rods is chosen, nd the probability that
(a) at least 70% diameter of the metal rod within the specication.
(b) between 78% and 82% diameter of the metal rod within the specication.
(c) more than 90% diameter of the metal rod within the specication.
Solution

0.75

1 0.75 1 0.75

0.00375
n
50

P ~ N 0.75, 0.00375

(a) The probability that at least 70% diameter of the metal rod within the specication is

c .c

1
P P 0.70 P P 0.70
2 50

P P 0.69

0.69 0.75
P Z
0.00375

## P Z 0.98 0.5 P 0 Z 0.98

0.5 0.3365
0.8365

(b)

The probability that between 78% and 82% diameter of the metal rod within the

specication is

c .c

1
1
P 0.78 P 0.82 P 0.78
P 0.82

2
50
2
50

P 0.79 P 0.81

0.79 0.75

0.00375

0.81 0.75

0.00375

P 0.65 P 0.98
P 0 Z 0.98 P 0 Z 0.65

0.3365 0.2422
0.0943

(c) The probability that more than 90% diameter of the metal rod within the specication is
c .c

1
P P 0.90 P P 0.90
2 50

P P 0.91

0.91 0.75
P Z
0.00375

## P Z 0.98 0.5 P 0 Z 2.61

0.5 0.4955
0.0045

__________________________________________________________________________

30% of pipe in a chemical plant showed signs of serious corrosion. A survey was done and a
random sample of 100 pipes in a chemical plant was selected. Find the probability that

(a) more than 35% of pipe in a chemical plant showed signs of serious corrosion.
[ 0.1151 ]
(b) from 20% to 30% of pipe in a chemical plant showed signs of serious corrosion.
[ 0.5328 ]

From a survey, we found that 90% of automobile will not be rejected because of the machine
failure. A random sample of 50 automobiles was selected. What is the probability that
(a) not less than 92% of automobile will not be rejected because of the machine failure?
[ 0.4052 ]
(b) between 88% and 92% of automobile will not be rejected because of the machine
failure?

[ 0.1896 ]

__________________________________________________________________________

## From previous record,

3
of the rubber cushions will be rejected. A manufacturer did not
100

satised with the results and does a survey. Among 100 samples of the rubber
cushions, nd the probability of the
(a) proportion of the rubber cushions will be rejected exceed 0.04.
(b) proportion of the rubber cushions will be rejected not more than 0.05.
___________________________________________________________________________

2.6

## Sampling Distribution of the Difference Between Two Proportions

Let say we have two binomial populations with proportion of successes 1 and 2 , with
random samples of size n1 and n 2 are taken from population 1 and population 2,
respectively. Then 1 and 2 are the proportions from those samples. By the CLT, provided
both n1 and n 2 are large ( n1 30 and n 2 30), the sampling distribution of P1 is

1 1

P1 ~ N 1 , 1
n1

1 2

P2 ~ N 2 , 2
n2

## Therefore the sampling distribution of the dierence between two proportions, P1 P2

can be obtained. By the Central Limit Theorem,
the mean is

P1 P2 E P1 P2
E P1 E P2 1 2

Var P1 Var P2

1 1 1 2 1 2

n1
n2

## and the standard deviation is

2 P1 P2

1 1 1 2 1 2

n1
n2

The sampling distribution of the dierence between two proportions, P1 P2 has mean

1 2

and variance

1 1 1 2 1 2

## and can be written as

n1
n2

1 1 2 1 2

P1 P2 ~ N 1 2 , 1

n
n

1
2

with

1 1 2 1 2

P1 P2 ~ N 1 2 , 1

n1
n2

___________________________________________________________________________
Example
6

Two companies, M Chip and N Chip produced micro computer chips and supplied them to
company ACERA. 25% of the micro computer chips produced by Company M Chip and 20%
of the micro computer chips produced by Company N Chip are defective. 100 samples are
randomly chosen from each company, nd the probability that
(a)

## the sample proportion of defective micro computer chips produced by Company

M Chip is greater than the sample proportion of defective micro computer chips
produced by Company N Chip.

(b) the sample proportions of defective micro computer chips dier by at least 6%.
(c) the dierence between the sample proportion of defective micro computer chips
produced by Company M Chip and the sample proportion of defective micro computer chips
produced by Company N Chip is at most 4%.

Solution
0.25 1 0.25

PM ~ N 0.25,
N 0.25, 0.001875
100

0.07 1 0.07

PN ~ N 0.20,
N 0.20, 0.0016
100

## PM PN ~ N 0.25 0.20, 0.001875 0.0016 N 0.05, 0.003475

(a) The probability of the sample proportion of defective micro computer chips produced by
Company M Chip is greater than the sample proportion of defective micro computer chips
produced by Company N Chip is
P PM PN P PM PN 0

P Z

0 0.05

0.003475

= P (Z > 0.85)
= 0.5 + P (0 < Z < 0.85)
= 0.5 + 0.3023
= 0.8023
(b) The probability of the sample proportions of defective micro computer chips dier by at
least 6% is
P PM PN 0.06 P PM PN 0.06 P PM PN 0.06

0.06 0.05

0.06 0.05
P Z
P Z

0.003475
0.003475

## = P (Z < 1.87) + P (Z > 0.17)

= [0.5 P (0 < Z < 1.87)] + [0.5 P (0 < Z <
0.17)]
= [0.5 0.4693] + [0.5 0.0675]

= 0.0307 + 0.4325
= 0.4632

(c) The probability of the dierence between the sample proportion of defective micro
computer chips produced by Company M Chip and the sample proportion of defective micro
computer chips produced by Company N Chip is at most 4% is

0.04 0.05
PM PN 0.04 P Z

0.003475

= P (Z < 0.17)
= 0.5 P (0 < Z < 0.17)
= 0.5 0.0675
= 0.4325
________________________________________________________________________________
Exampl
e7

A manufacturer claims that some of the electrical parts produced by two machines are
defective. He said that 90 out of 1500 of the electrical parts are defective were produced by
machine 1 and 84 out of 1200 of the electrical parts are defective were produced by machine
2. If random samples of 50 electrical parts produced by machine 1 and 60 electrical parts
produced by machine 2 are chosen, what is the probability that
(a) the proportion of defective electrical parts produced by machine 1 is smaller than the
proportion of defective electrical parts produced by machine 2?
(b) the proportion of defective electrical parts produced by machine 1 is greater than the
proportion of defective electrical parts produced by machine 2?
(c) the proportion of defective electrical parts dier by less than 0.02?
Solution
1

90
0.06
1500

84
0.07
1200

0.06 1 0.06

P1 ~ N 0.06,
N 0.06, 0.001128
50

0.071 0.07

P2 ~ N 0.07,
N 0.07, 0.001085
50

## P1 P2 ~ N 0.07 0.06, 0.001128 0.001085 N 0.01, 0.002213

(a) The probability of the proportion of defective electrical parts produced by machine 1 is
smaller than the proportion of defective electrical parts produced by machine 2 is

P P1 P2 P P1 P2 0 P Z

0 0.01

0.002213

= P (Z > 0.21)
= 0.5 + P (0 < Z < 0.21)
= 0.5 + 0.0832
= 0.5832

(b) The probability of the proportion of defective electrical parts produced by machine 1 is
greater than the proportion of defective electrical parts produced by machine 2 is

P P1 P2 P P1 P2 0

P Z

0 0.01

0.002213

= P (Z < 0.21)
= 0.5 P (0 < Z < 0.21)
= 0.5 0.0832
= 0.4168
(c) The probability of the proportion of defective electrical parts dier by less than 0.02 is
P P2 P1 0.02 P 0.02 P2 P1 0.02

0.02 0.01

0.002213

0.02 0.01

0.002213

## = P (0.64 < Z < 0.21)

= P (0 < Z < 0.64) + P (0 < Z < 0.21)
= 0.2389 + 0.0832
= 0.3221
___________________________________________________________________________

A Production Manager claims that his two machines will fail due to continuous operation and
will produce defective products. An investigation was done and it was found that the claimed
was true. 50 of 500 products are from machine A and 45 of 500 products from machine B are
defective. 100 products from each machine were selected randomly. Find the probability that
(a) the sample proportion of the products from machine A is smaller than the sample
proportion of the products from machine B are defective.

[ 0.4052 ]

(b) the sample proportions dier by less than 1.8% are defective

[ 0.6730 ]

(c) the dierence between the sample proportion of the products from machine A
and the sample proportion of the products from machine B are defective is at least 1%.
[ 0.5000 ]
__________________________________________________________________________

A company purchased parts from two suppliers and has been having serious problems with
scrap and rework with both suppliers. From previous record, 16% was found to be
nonconforming parts supplied by Supplier A while 14% was found to be nonconforming parts
supplied by Supplier B. A quality engineer decides to investigate and took 100 randomly
selected samples for an investigation from each supplier. What is the probability that

(a) the proportion of nonconforming parts supplied by Supplier A is greater than the
proportion of nonconforming parts supplied by Supplier B?

[ 0.6554 ]

(b) the proportion of nonconforming parts supplied from Supplier A is more than the
proportion of nonconforming parts supplied from Supplier B by at least 0.01?

[ 0.5793 ]

(c) the dierence between the proportion of nonconforming parts supplied by Supplier A and
the proportion of nonconforming parts supplied by Supplier A is more than 0.05? [ 0.2776 ]
___________________________________________________________________________
2.7 t Distribution
Theorem 1 Let Z be a standard normal variable and V a chi-squared random variable with
degrees of freedom. If Z and V , then the distribution of the random variable T , where

Z
V

v 1
2

h t
v v
2

v 1
2
2

t
1
v

## This is known as the t- distribution with

v degrees of freedom.

Corollary 1 Let X 1 , X 2 , , X n be independent random variables that are all normal with
mean and standard deviation . Let

__

X i 1
n

and

S2

__

i 1 X i X

n 1
n

Xi
n

__

X
Then the random variable T S
has a t distribution with v n 1 degree of
n
freedom and can be written as T ~ t n 1 .

## Statistics Table 9 in page 28 will give the value of t ,v with P t t ,v

___________________________________________________________________________

## By using the Statistics Table, nd t ,v for the cases below:

(a)

0.001

t 0.001,15 3.733

(b)

(c)

(d)

v 10.

v 20.

v 30.

0.005

t 0.005, 20 3.733

0.010

t 0.010 , 10 3.733

v 15.

0.025

t 0.025 , 30 3.733 ]

_______________________________________________________________________
2.8

2 Distribution

The continuous random variable X has a chi-squared distribution, with degrees of freedom,
if its density function is given by

f x
2

v

2

v
1
2

exp

x

2

x 0

2
where is a positive integer and can be written as X ~ v .

2
All chi-square distributions are skewed to the right. The symbol ,v denotes the number

along the horizontal axis that cuts o to its left an area of under the chi- square distribution
with degrees of freedom.

2
2
2
Table 8 from Lee (2004) gives the values of ,v with P ,v

___________________________________________________________________________

(a)

0.01

v 10.

[ 2 0.01,10 23.209 ]

(b)

0.05

v 15.

[ 2 0.05,15 24.996 ]

(c)

0.99

v 12.

## [ 2 0.99 ,12 3.571 ]

(d)

0.995

[ 2 0.995, 16 5.142 ]

v 16.

___________________________________________________________________________
2.9

F Distribution

Theorem 2 Let U and V be two random variables having independent chi-squared distribution
with v1 and v 2 degrees of freedom, respectively. Then the distribution of the random
variable

U
F

v1
v2

v v
1 2
2

h f
v1
2

v1
v
2

v2
2

v1

v1
1
2

1 v1 f

v2

v1 v2
2

0 f

## This is known as the F -distribution with v1 and v 2 degrees of freedom.

Theorem 3 Writing f ,v ,v for f with v1 and v 2 degrees of freedom, we obtain
1

F 1 ,v1 ,v2

F ,v1 ,v2

1 ,v2

## with P F F ,v1 ,v2

___________________________________________________________________________

___________________________________________________________________________

## By using a statistical table, nd F ,v for the cases below:

10.48

(a)

0.001

v1 5

(b)

0.010

v1 10

v 2 10

(c)

0.975

v1 15

v2 9

[ f 0.975,15, 9 0.3205 ]

(d)

0.950

v1 12

v 2 20

[ f 0.950,12, 20 0.3937 ]

v 2 10

0.001, 5 , 10

## [ f 0.010, 10 ,10 4.85 ]

Exercise 2
1. A random sample of size 32 is drawn from a normal distribution with mean 30 and
standard deviation 9. What is the probability that the
(a) sample mean is at most 26?
(b) sample mean is smaller than 33?

2. A random sample of size 41 is taken from a population which is Poisson distributed with
mean 26. What is the probability that the
(a) sample mean is less than 27?
(b) sample mean is at least 29?

3. A random sample of size 16 is selected from a normal distribution with a mean of 92 and a
standard deviation of 11. Another random sample of size 12 is selected with mean 88 and
standard deviation 16. Find the probability that
(a) the dierence between the mean is more than 8?
(b) is less than by 18?

4. PVC pipe is manufactured with a mean length of 30.5 inch and a standard deviation of 2.8
inches. Find the probability that a random sample of n = 15 pipes will have a sample mean
length greater than 29 inches.

5. The probability that a machine produces defective parts is 0.02. A random sample of 15
parts was taken.
(a) What is the probability that the sample mean is more than 0.5 if a random sample of size 4
was taken?
(b) What is the probability that the sample mean is less than 0.8 if a random sample of size 9
was taken?

6. The mean amount of air blows from a JSM air conditioner is 5.5 m in a minute with
standard deviation of 1.2 m. For DGM air conditioner, the mean amount of air blows is 4.9 m
in a minute with standard deviation of 1.1m. 12 set of air conditioner from both type are
selected to run a test.
a) What is the probability the mean air blows for JSM air conditioner is greater than DGM?
b) What is the probability that the dierence between mean air blows for both air conditioner
is less than 1?
7. The average weight a can of soda before the machine is service is 260 ml with standard
deviation of 11 ml. The average weight a can of soda after the machine is service is 250 ml
with standard deviation of 8 ml. 40 cans of soda before the machine is service was chosen at
random and 38 cans of soda after the machine is service was also chosen at random. Find the
probability the mean average weight a can of soda before the machine is service is at least
more than the average weight after the machine is service by 5.

8. The number of times Max photostat machine and JP photostat machine break- down
follows a Poisson distribution. An average of 8 breakdown were recorded for the Max
photostat machine during a randomly selected day. For JP Photostat machine, an average of 5
breakdown were recorded during a randomly selected day.

(a) If a random sample of 15 days were taken, what is the probability that the mean number
of breakdown recorded in a day for Max photostat machine is more than 10?
(b) If a random sample of 20 days were taken,
i. what is the probability that the mean number of breakdown recorded in a day dier by less

than 4?
ii. what is the probability that the dierence between the mean number of breakdown
recorded in a day is at least 5?

9. 15% of the paperclips do not follow the companys specications. QA inspector took 1000
samples randomly for inspection, what is the probability that
a) less than 15% of the paperclips do not follow the companys specications?
(b) at most 12% of the paperclips do not follow the companys specications?
(c) more than 17% of the paperclips do not follow the companys specications?

10. A claimed was made that 98% of A4 papers produced by a company has a good quality. A
survey was done and a random sample of 1000 A4 papers was selected. Find the probability
that
(a) more than 97% of A4 papers produced by a company has a good quality.
(b) between 97% and 99% of A4 papers produced by a company has a good quality.
(c) up to 99% of A4 papers produced by a company has a good quality.

11. A manufacturer claims that 34 of the electrical components was found to be nondefective. 250 electrical components were selected randomly. What is the probability that
(a) at least
(b)

4
of the electrical components was found to be nondefective?
5

37
39
to
of the electrical components was found to be nondefective?
50
50

## (c) more than

7
of the electrical components was found to be nondefective?
10

12. A safety engineer claims that of all industrial accidents are caused by the carelessness of

the employees. A survey is carried and randomly 250 of all industrial accidents were selected.
What is the probability that

(a) at most

1
of all industrial accidents are caused by the carelessness of the employees?
4

## (b) more than

(c )

1
of all industrial accidents are caused by the carelessness of the employees?
5

9
11
to
of all industrial accidents are caused by the carelessness of the employees
50
50

13. From previous record, 1.2% of machines in a manufacturing factory will be serviced at
least 3 times in a month. A survey was done involving 100 machines. Find the probability of
the
(a) proportion of machines in a manufacturing factory will be serviced at least 3 times in a
month more than 0.013.
(b) proportion of machines in a manufacturing factory will be serviced at least 3 times in a
month less than 0.09.
(c) proportion of machines in a manufacturing factory will be serviced at least 3 times in a
month not more than 0.10.
14. From previous experience, 35% of the microchips are defective. An engineer was asked
to investigate and solve this problem. He took randomly 500 samples of the microchips. Find
the probability of the
(a) proportion of the microchips are defective less than 0.36.
(b) proportion of the microchips are defective not more than 0.32.
(c) proportion of the microchips are defective between 0.33 and 0.38, inclusive.
15. A company produces component parts for two types of engines, DOHC and SOHC. They
claimed that 96% of the component parts for DOHC and 95% of the component parts for
SOHC meet specications. 100 random samples were selected from each component parts.
What is the probability that
(a) the proportion of the component parts for DOHC is less than the proportion

## of the component parts for SOHC meet specications?

(b) the proportions dier by more than 0.5% meet specications?
(c) the proportion of the component parts for DOHC exceeds the proportion of
the component parts for SOHC meet specications by at least 1%?

16. A claimed was made that 10 out of 1000 laptops and 5 out of 500 desktops produced by a
company has been rejected. A survey was done and a random sample of 50 laptops and 40
desktops was selected. Find the probability that
(a) the sample proportion of the laptop is more than the sample proportion of the desktops has
been rejected.
(b) the dierence between the sample proportion of the laptop and the sample proportion of
the desktops has been rejected is at least 0.01.
(c) the sample proportion of the laptop is smaller than the sample proportion of the desktops
has been rejected by at most 0.005.
17. A manufacturer of CDs and DVDs players uses a set of comprehensive tests to access the
electrical function of its product. All disk players must pass all test prior to being sold. It was
found that

4
3
of CDs player and
of DVDs player failed the tests. A quality engineer
200
200

was asked to investigate the problems. 150 random samples were taken from each player.
What is the probability that

## (a) the proportion dier by less than

1
failed the tests?
100

(b) the proportions of CDs player is greater than the proportion of DVDs player failed the
tests?
(c) the proportion of CDs player is less than the proportion of DVDs player failed the test by

at most

2
100

18. A manufacturer claims that his products produced by two dierent machines meet the
customers specications. An investigation occurred and it was found that some of the
products failed to meet the specications and has been rejected. From 450 items, 27 of them
from machine A and from 500 items, 25 of them from machine B failed to meet the
specications and have been rejected. 60 items from each machine were selected randomly.

## What is the probability that

(a) the proportion of the items from machine A is greater than the proportion of the items
from machine B failed to meet the specications and has been rejected?
(b) the proportions dier by more than 1.5% failed to meet the specications and has been
rejected?
(c) the proportion of the items from machine B is less than the proportion of the items from
machine A failed to meet the specications and has been rejected is at least 1% ?

Chapter 3
Estimation
Learning Objectives:
At the end of this chapter, students should be able to
(a) distinguish between estimator and estimate for a given problem.
(b) describe the dierence between inferential statistics and descriptive statistics.
(c) identify the best estimator for mean, proportion and standard deviation construct the
condence interval for mean, proportion and variance for single population and for two
populations correctly based on given problem.
(d) interpret the condence interval correctly.

3.1

Introduction

In previous chapter we had learnt the sampling distributions of random variables. This
knowledge will equip us in working with the core of inferential statistics. Do you know what
inferential statistics is?
This chapter will introduce you to rstly, the denition of inferential statistics followed by the
denition of important terms that will be used intensively in this chapter namely estimator,
estimate, point and interval estimate, and condence interval. Next, we will discover the
procedure of estimating the true parameter of a population.

Lastly, we will construct the condence intervals for mean, proportion and variance for cases
of one population and two populations with the correct interpretation.
Let us recap the denition of inferential statistics. It deals with the use of probabilities and
data from sample to infer the underlying population or to make generalisation of the
underlying population. That is using information about the sample to make decision and
conclusion about population characteristics. For example by studying the average amount of
top-up spent by university students per month for a group of students in UTM, we can infer
the average amount of top-up spent by the whole university students in our country. Can you
guess what the sample and population in this example are? You can always think that, a
sample is a subset of a population. Does it help? Dont give up, you had tried your best! In
statistics, we call all university students in our country a population and the subset of this
population which is a group of students from UTM is called a sample. In the next section we

3.2

Terminology

## 1. Estimator is dened as a sample statistic used to estimate the value of a population

parameter.
2. Estimate is the value assigned to a population parameter based on the value of a sample
statistic.
3. Point estimate is the value of a sample statistic that is used to estimate a population

## parameter whereas interval estimation means a procedure to construct an interval around a

point estimate with the hope that this interval contains the corresponding population
parameter.
4. Condence interval that we will learn throughout this chapter is dened as an interval that
is constructed around a point estimate that is associated with the level of condence based on
the procedure in constructing it. The condence level is the proportion of times that the
condence interval will contain the true parameter, assuming that the estimation procedure is
repeated a large number of times.
Next we will learn through example on how to determine the best estimator and hence
construct the appropriate condence interval according to the sample data that we
have.

3.3

Point Estimate

We start with our previous example on the monthly amount of top-up by university students.
The mean value of monthly top-up computed for the sample is called a sample mean denoted
__
by x . This is a point estimate of the corresponding population mean, i.e mean monthly

top-up for university students in Malaysia. Let say, we select 1000 UTM students randomly
and the mean monthly top-up is RM40. This RM40 is a point estimate for the true mean of
monthly top-up for all university students in Malaysia. The statistician can then state that the
mean monthly top-up for Malaysian university student is RM40. This is what we call a point
estimation.
__

For the above example the population mean is estimated using the sample mean x
calculated as follows
__

x1 x 2 , x1000
1000

where x1 is the amount of monthly top-up by UTM student 1, x 2 is the amount of monthly
__

## top-up by UTM student 2 and so on. So, the estimator here is X .

Similarly, we can also estimate an unknown population variance, 2 , using a point estimator
2
S 2 and the numerical value assigned to it, for example s 1.6 , is called the point

estimate for 2 .
In engineering we often need to estimate the followings:

The mean of a single population ; for example the mean breakdown voltage of
diodes.

The variance of a single population, 2 (or standard deviation, ); for example the
standard deviation of the inside diameter of certain plastic pipes.

The proportion of items in a population that belong to a certain class of interest; for
example the proportion of defective items for a particular production process.

## The dierence in means of two populations, 1 2 for example the dierence in

means breakdown voltage of two diodes.

## The dierence in proportions, 1 2 of two populations; for example the

dierence in proportions of nonconforming coils of brand A and B.

## The ratio between two variance,

12
; for example the ratio between variances of
22

## breaking strength of fibre A and fibre B

The following table summarises the point estimates of these parameters together with their
statistics.
Table 3.1: Point Estimates and Statistics
______________________________________________
Unknown

Statistic

Parameter

Point estimate

______________________________________________

X
S

n 1

s2

X
n

________________________________________________
Statistical properties for best estimator (the most ecient estimator) must
1.
2.

be unbiased, that is E

## have minimum variance, that is the variability of the estimator is as small as

possible.

For further explanation of these properties, please refer to Montgomery, Runger and Hubele
(2004) page 131-133.
___________________________________________________________________________
3.4

Interval Estimate

Next, by extending our top-up example, instead of saying that the mean top-up for university
students in Malaysia is RM40, we may want to say it within a certain range. That is, by
subtracting a number from RM40 and adding the same number to RM40 will give us this
range. In illustrating this example, let the number to be subtracted from RM40 is RM5 and
add this number to RM40. Hence we obtain the range from RM35 to RM45. Then we can
state that the range from RM35 to RM45 is likely to contain the mean top-up for all
Malaysian university students.
In general, the interval estimate of the unknown parameter can be written as l, u where l
is the lower limit and u is the upper limit. So the corresponding interval estimate for the
above example is RM(35,45). Since dierent samples will produce dierent values of sample
mean that result in dierent values of l and

## u , these values are actually the values for

random variables of the lower limit L and the upper limit U . The associated probability to
this interval estimate can be expressed as follows
P L U 1 ,

where 0 < < 1. That is we have a probability of 1 of choosing a sample that will

produce an interval containing the true value of . The resulting interval estimate is called a
100(1 )% condence interval (CI) for the true parameter .
Generally, a 100(1 )% condence interval (CI) for the true parameter means
P L U 1 ,

which can be interpreted as follow, if we collect innitely many random samples and
compute 100(1 )% CI for the true parameter for each sample, 100(1 )% of these
intervals will contain the true value of .
However, in practice we only draw one random sample. The interpretation that we will use is
the observed interval l, u contains the true value of with 100(1 ) condence level.
3.5

CI on the Mean

## u , there are three cases that we

need to consider;
(a) population variance 2 is known,
(b) population variance 2 is unknown but the sample size is large n 30 and
(c) population variance 2 is unknown and the sample size is small n 30 .
These considerations need to be taken into account because we need to know the sampling
__

distribution for the sample mean X . The use of this sampling distribution will be
demonstrated as follows. Take the rst case as an example. We know that the sampling
__

## distribution for the sample mean X is normal with mean

u and variance

2
. Thus, the
n

__

X
statistic Z 2
is distributed as a standard normal. In computing a 100 1 % CI
n

population mean,

z 2 __
z 2
__
P X
,X
n
n

1 .

z 2

__

__

and x

## There are three cases of 100 1 % CI for the mean population,

(a) A 100 1 % CI for the population mean,

z 2
n

respectively.

u;

## u with known population variance, 2 can

be written as
__

z 2
n

__

z 2
n

or

__

z 2
n

__

,x

z 2

(b) A 100 1 % CI for the population mean, with unknown population variance, 2
can also be written as
__

2, n 1

__

2, n 1

as we can use central limit theorem in this case, where s is the estimated sample standard
deviation.
(c)

## and the sample size is small (n < 30) can be written as

__

2, n 1

__

2, n 1

with the assumption that the sample comes from normal distribution.

Example
1

## Compressive strength of a concrete is normally distributed with standard deviation

2.18039 10 5 pascal. A sample of 16 specimens has been randomly selected which
__

gives the sample mean of x 2.49978 10 7 pascal. Construct a 95% CI on the mean compressive strength.
Solution
This example is clearly case (a) where population standard deviation is known and equals
to 2.18039 105 pascal. The CI that we want to compute is the 95% CI for the mean
__

compressive strength, . From the sample, x 2.49978 10 7 pascal and sample size,
n 16 .

## 100 1 % 95% 0.05

so z 2 z 0.025 1.96

## Hence, the 95% CI for the mean compressive strength is

2.49978 10 7 1.96

2.18039 10 5
16

2.49978 10 7 1.96

2.18039 10 5
16

## 2.49978 10 7 1.06839 10 5 2.49978 10 7 1.06839 10 5

2.4891 10 7 2.5105 10 7

## or the condence interval can also be written as 2.4891, 2.5105 10 7 pascal.

________________________________________________________________________

A random sample of 16 compact cars tested for fuel consumption gave a mean of 12.5 km per
litre with a standard deviation of 0.83 km per litre. Assuming that the fuel consumption in km
per litre of all compact cars have a normal distribution, construct a 99% condence interval
for the population mean of fuel consumption for compact cars.
[ 11.8885, 13.1115 ]

Borneo Steel Corporation produces iron rings that are supplied to ARAAB Co Ltd. These
rings are supposed to have a diameter of 60 cm. The machine that makes these rings does not
produce each ring with a diameter of exactly 60 cm. The diameter of each of the rings varies
slightly. It is known that when the machine is working properly, the rings made on this
machine have a mean diameter of 60 cm. The quality control department takes a random
sample of 35 such rings every week, calculates the mean of the diameters for these rings, and
makes a 99% condence interval for the population mean. If either the lower limit of this
condence interval is less than 59.938 cm or the upper limit of this condence interval is
greater than 60.063 cm, the machine is stopped and adjusted. A recent such sample of 35
rings produced a mean diameter of 60.038 cm with a standard deviation of 0.15 cm. Based on
this sample can you conclude that the machine needs an adjustment?
[(59.9727, 60.1033); yes]
___________________________________________________________________________
3.6

## In constructing CI on the dierence between two population means, 1 2 , we extend our

knowledge from previous section by choosing our statistic Z , as

__
__

X 1 X 2 1 2

Z
2
2
1

2
n1
n2

assuming we know both population variances. Again we compute a 100 1 % CI for the
dierence between the two population means, 1 2 so that

__

__
__
12 2 2

1 2 X 1 X 2 z

n
2
n2
1

__

P X 1 X 2 z

1 2 2 2

n n 1 .
2
1

There are three cases of 100 1 % CI for the dierence between two population means
1 2 ;

## (a) 100 1 % CI on the dierence of two population means, 1 2 with known

population variances
__

__

x 1 x 2 z

__
__
12 2 2

1 2 x 1 x 2 z

n
2
n2
1

12 2 2

n n
1
2

## (b) 100 1 % CI on the dierence of two population means, 1 2 with unknown

population variances and n1 , n2 30

i. with 1 2
2

__
__
__
1
1
1 2 x 1 x 2 z s p
x 1 x 2 z s p

2
2
n 1 n2

__

where s p

n1 1 s12

ii. with 1 2
2

1
1

n 1 n2

n 2 1 s 22
is a pooled standard deviation.
n1 n 2 2

__
__
__
s2
s2
1 2 x 1 x 2 z s p
x 1 x 2 z s p

2
2
n 1 n2

__

s2
s2

n 1 n2

## (c) 100 1 % CI on the dierence of two population means 1 2 , with unknown

population variances and n1 , n 2 30 and normality assumption holds
i. with 1 2
2

__
__
__
1
1
1 2 x 1 x 2 t s p
x 1 x 2 t s p

2
2
n2
n1

__

where v n1 n 2 2 and s p
ii. with 1 2
2

1
1

n 2
n1

n1 1 s12

n 2 1 s 22
n1 n 2 2

__
__
__
s2
s2
1 2 x 1 x 2 t s p
x 1 x 2 t s p

2
2
n 1 n2

__

s2
s2

n 1 n2

where

s2
s2

n 1 n2
s2

n1

s2
n2

n1 1
n2 1

_______________________________________________________________________
Example
2

Suppose random samples of 49 Silver Tyres and 36 Dun Tyres were selected. The sample
mean mileage the tyre lasts for Silver Tyres is 119000 km and the standard deviation is
7700km and the sample mean mileage for Dun Tyres is 118000 km and the standard
deviation is 6000km. Compute a 90% CI on the dierence of the two population means.

Solution

## This is case b-ii. 0.1, z 2 z 0.05 1.6449

The 90% CI on the dierence of the two population means

7700 2
6000 2

49
36

1 2

7700 2
6000 2

49
36

## 1000 2445.32 1 2 1000 2445.32

1445.32 1 2 3445.32

___________________________________________________________________________

Using Example 2 but we assume that their population variances are equal. Construct a 95%
CI on the dierence of the means mileage the tyre lasts.

[-2026.0942, 4026.0942]

___________________________________________________________________________

A car magazine is comparing the total repair costs incurred during the rst three years on two
mid-sized cars, the Pherry and the XPY. Random samples of 16 Pherrys and 9 XPYs are
taken. All 25 cars are three years old and have similar mileages. The mean of repair costs for
the 16 Pherry cars is RM5000 for the rst three years with a standard deviation of RM800.
For the 9 XPY cars, this mean is RM7700 with a standard deviation of RM1000. Assume that
the repair costs follow a normal distribution with the same population variance. Construct a
90% condence interval for the dierence between the two populations means
[-3324.7295, -2075.270]
___________________________________________________________________________

A process engineer is comparing two dierent etching solutions for removing silicon from
the backs of wafers. The etch rates follow normal distribution and have equal population
variances of 0.352. Below are the observed etch rates from 10 wafers for each solution.
____________________________
Solution 1
Solution 2
____________________________
9.7
10.5
10.1 9.9
9.3
10.2
10.5 10.1
9.1
9.9
10.6 10.2
9.5
10.3
10.3 10.3
10.0 10.1
10.3 10.1
____________________________

Find a 90% CI for the dierence in mean etch rates. [ -0.6375, -0.1225 ]

6

Using Task 5, construct a 95% CI for the dierence in mean etch rates if we do not know the
population variances and assume that both populations have an unequal variances.
[ -0.7198, -0.0402 ]
___________________________________________________________________________
3.7

probability,

P z

1
n

1 .

P z
2

1
1
P z
.
2
n
n

1
n

## P that gives the following 100 1 % CI on

P z
2

with

P 1 P
P 1 P
P z
.
2
n
n

__________________________________________________________________________

Example
3

## A manufacturer of printed circuit board (pcb) is interested in estimating the fraction of

defective units produced. A random sample of 200 boards contains 1 defectives. Find a 90% CI
for the true proportion of defectives.
Solution

0.005 1.6449

0.005 0.995
0.005 1.6449
200

0.005 0.995
200

## 0.005 0.0082 0.005 0.0082

0.0032 0.0132

___________________________________________________________________________

A random sample of 200 diskettes were inspected and 17 defective diskettes were found. Find
a 95% CI on the true proportion of defective diskettes.

[ 0.0463, 0.1237 ]

___________________________________________________________________________

A random sample of 400 components were tested and 6.25 percent of the sample components
fail to satisfy production specications. Find a 90% CI on the true proportion of components
that fail to satisfy the specications.

[ 0.0426, 0.0824 ]

__________________________________________________________________________
3.8

## CI for the difference between between Two population.

To construct the CI for 1 2 recall that the sampling distribution for P1 P2 is normal

1 1 1 2 1 2
.So the statistic

n1
n2

## with mean 1 2 and variance

P1 P2 1 2

1 1 1 2 1 2

n1
n2

is a standard normal random variable. Using the same approach as previous section, we
obtain a 100 1 % CI for the dierence between two proportions as

P1 P2 z

P1 1 P1 P2 1 P2
P1 1 P1 P2 1 P2

1 2 P1 P2 z

2
n1
n2
n1
n2

___________________________________________________________________________
Example
4

In a factory, plastic parts are formed using two dierent injection-molding machines. Two
random samples, each of size 200 are chosen and 5 defective parts are found in the sample
from machine A whereas 6 defective parts are found in the sample from machine B. Construct
a 99% CI on the dierence in proportions of defective parts.
Solution

P1 5

200

0.025 ; P2 6

200

## So, the 99% CI on the dierence in proportions of defective,

0.025 2.5758

0.025 0.975
0.03 0.97

1 2
200
200
0.025 0.975 0.03 0.97

200
200

## 0.005 0.0421 1 2 0.005 0.0421

0.0471 1 2 0.0371
_________________________________________________________________________

A survey conducted by independent Engineering Education Research Unit found that among
teenagers aged 17 to 19, 20% of school girls and 25% of school boys wanted to study in
engineering discipline. Suppose that these percentages are based on random samples of 501
school girls and 500 school boys. Determine a 90% CI for the dierence between the
proportions of all school girls and all school boys who would like to study in engineering
discipline.
-0.0933, -0.00666]
___________________________________________________________________________
3.9

n 1 s 2
2

## is distributed as 2 with n 1 degrees of freedom. So now we would like to have an

interval in such a way that

n 1 s 2 2 n 1 s 2 .
2

,n 1

12

,n 1

__________________________________________________________________________________

Example
5

A study on an operating system for a portable computer has been carried out thorvoughly to
estimate the variance of response time. A random sample of 10 portable computers are chosen
and give the standard deviation value of 8 milliseconds. Assume that the response time
follows normal distribution, construct a 95% CI on true variance of response time.

Solution
0.05, 02.025 19.023, 02.975 2.7

## A 95% CI on the variance of response time, 2

10 1 8 2
02.025,10 1

576

19.023
30.279

10 1 8 2 .
02.975,101
576
2.7

213.333

________________________________________________________________________________

A random sample of 13 bolts is selected and the inside diameter is measured. The sample

standard deviation of the bolt inside diameter is 0.018 mm. Construct a 90% CI for the
standard deviation.
[0.0136, 0.0273]
__________________________________________________________________________
3.10

S 22
F

S12

22
12

## is F with n 2 1 and n1 1 degrees of freedom. So now we would like to have an interval

in such a way that

P f 1
F f
1 .
2
,
n

1
,
n

1
2
,
n

1
,
n

1
2
1
2
1

S 22

1 .
P f 1
2
f

2 , n2 1, n1 1
2 , n2 1, n1 1
S1
2

2
2

Rearranging the above, we obtain a 100 1 % CI on the ratio of two variances of two
normal distributions,

S 22

s
22 s12
f1
2
2 f
.
2 , n2 1, n1 1
2 , n2 1, n1 1
s
S1
s2
12
2
1
2
2

F1

2 , n2 1, n1 1

1
F

2 , n2 1, n1 1

## we can rewrite the above CI as follows,

s12
1
2
s2 f

2 , n2 1, n2 1

12 s12
2 2 f
.
2 s 2 2, n2 1,n1 1

___________________________________________________________________________

Example
6

A quality engineer is studying the diameter of stainless steel rod manufactured on two
dierent machines. Two random samples of 16 and 13 rods respectively are selected which
give the variances of the diameter values 0.30cm2 and 0.40cm2 respectively. Assume that the
data were drawn from normal distributions, construct a 95% CI on the ratio of variances of
the diameters.

Solution
s12 0.30cm 2 s 22 0.40 cm 2 f 0.025,16 1, 131 3.18 f 0.025,131,16 1 2.96

s12
1
2
s2 f

2 , n1 1,n2 1

12 s12

f
.
22 s22 2, n2 1,n1 1

12
0.3 1

0.4 3.18
22
0.2358

0.3
2.96
0.4

12
2.22
22

_____________________________________________________________________________

An engineer is studying an axial load of aluminium cans. It is measured by using a plate
where an increasing pressure is applied on top of the can until it collapses. This maximum
weight that the sides of the can can support is the axial load. Two random samples of sizes 10
and 7 aluminium cans are selected and the standard deviations are 10.1 kg and 11.8 kg
respectively. Find a 90% CI on the ratio of variances of the loads.
[0.1787,2.4689]
___________________________________________________________________________

Exercise 3
1. When you construct a 90% condence interval for , what are you 90% condent about?
2. What happen to the width of CI if we increase the same size?
3. Can we consider the construction of condence interval be part of inferential statistics?
Why?
4. For a data set obtained from a sample, n 49, x 102.5, and s 10.7
(a) What is the point estimate for ?
(b) Compute a 98% CI for .

5. A 90% CI for can be interpreted as follow, if we take 1000 random samples of the same
size and compute the condence interval each, then 900 of them
a. will contain

## b. will not contain

c. will contain x

6. Carbonated drink bottles are lled by an automated lling machine. Assume that the ll
volume is normally distributed and from previous production process the variance of ll
volume is 0.005 liter. A random sample of size 16 was drawn from this process which gives
the mean ll volume of 0.51 liter. Construct a 99% CI on the mean ll of all carbonated drink
bottles produced by this factory.
7. A random sample of 12 wafers were drawn from a slider fabrication process which gives
the following photoresist thickness in micrometer: 10 11 9 8 10 10 11 8 9 10 11 12 Assume
that the thickness is normally distributed. Construct a 95% CI for mean of all wafers
thickness produced by this factory,
8. The following is the result for diameter of 10 bearings selected randomly from a
production process.
0.5061 0.5083 0.5058 0.5075

0.5049

0.5037

## Assume that diameter of bearing follows normal distribution.

(a) Construct a 90% CI for the mean of diameter of bearings.
(b) Construct a 95% CI for the mean of diameter of bearings.
(c) Comment on your interval estimates pertaining to their maximum error which is
dened as t 2 , n 1 .
9. In integrated circuit manufacturing industry, a basic process is to grow an epitaxial layer on
polished silicon wafers. The wafers are mounted on a susceptor and positioned inside a
specied jar. Through the nozzles positioned near the top of the jar a chemical vapours are
introduced. The susceptor is rotated and heat at constant temperature is applied. The
following are the thickness of the epitaxial layers (in m ) at low deposition time and at 59%
arsenic ow rate.

13.925 13.909

14.057

14.068

14.006

13.893

14.005

(a) Construct a 90% CI for the mean thickness of epitaxial layers assuming that the thickness
of epitaxial layer follows normal distribution with variance of 0.0050 m 2 .
(b) Construct a 90% CI for mean thickness of all epitaxial layers assuming that the thickness
of epitaxial layer follows normal distribution.
(c) Comment on the interval estimates based on their practicality.
10. Using data in question 9 and the following data on thickness of the epitaxial layers
at high deposition time and at 59% arsenic ow rate;
14.295 14.095 15.505

15.806

15.106

14.839,

construct a 90% CI on the dierence between means thickness of epitaxial layers assuming
that the thickness of epitaxial layers follow normal distribution with equal variances. Interpret
your CI and can you conclude that the true mean dierence is zero?
11. A quality inspector inspected a random sample of 300 memory chips from a production
line, she found 9 are defectives. Construct a 99% condence interval for the proportion of
defective chips.

## 12. A manufacturer of mobile phone batteries is interested in estimating the proportion of

defect of his products. A random sample of size 800 batteries contains 10 defectives.
Construct a 95% condence interval for the proportion of defectives.
13. A manufacturer of computer chips inspected a random sample of 1000 chips. The
following are the number of defects according to its type.
holes too small

90

## holes too large

poor connections
chip oversize
chip undersize

25
10
2
1

(a) What is the point estimate of the proportion of defectives due to holes too small?
(b) Construct a 90% CI for the proportion of defectives for the production process due to
holes too small.
(c) What is the point estimate for proportion of defectives due to poor connection?
(d) Construct a 90% CI for the proportion of defectives for the production process due to poor
connection.
(e) If oversize and undersize chip can be classied as incorrect chip size, what is the point
estimate of the proportion of defect due to incorrect chip size?
Hence nd a 95% interval estimate for the proportion of defective items due
to incorrect chip size.
14. An optical rm is concerned about the variability of the refractive index of a typical glass
that he will grind into lenses. The refractive index follows approxi- mately normal
distribution. A random sample of 15 glasses is drawn from a large shipment which give a
variance of 1.5 104 refractive index. Construct a 95% CI for the standard deviation of
refractive index of all glasses

## 15. A mechanical engineer in a car manufacturing company is investigating two types of

bumper guards. A random sample of 6 guards from each type were mounted on a compact
car. Each car was then run into a concrete wall at 8km per hour.
The following are the costs of repairs (in RM):
Bumper guard 1 : 305 420 363 485 300 360
`Bumper guard 2 : 405 345 336 450 400 360
a) Construct a 90% CI for the mean cost of repairs using bumper guard 1. State 3 conditions

## in constructing the CI.

(b) Assuming that all conditions in part (a) are satised, construct a 90% CI for mean costs of
repairs using bumper guard 2. What can you observe from these CIs?
(c) Assuming that the variances of cost of repairs are equal, construct a 95% CI on the mean
dierences of cost of repairs.
(d) What is the point estimate of the variance of cost of repair for bumper guard 1? Construct
a 95% CI for variance of cost of repair for bumper guard 1.
(e) What is the point estimate of the standard deviation of cost of repair for bumper guard 2?
Construct a 95% CI for the standard deviation of cost of repair for bumper guard 2.
(f) Find a 90% CI for the ratio of two variances for cost of repairs.
__________________________________________________________________________

Chapter 4

Tests of Hypotheses
Learning Objectives:
At the end of this chapter, students should be able to:
a) structure science and/or engineering decision-making problems concerning one
or two samples as hypothesis test.
(b) test hypotheses concerning a population mean.
(c) test hypotheses concerning a population variance or standard deviation.
(d) test hypotheses concerning a population proportion.
(e) test hypotheses concerning the dierence in two population means.
(f) test hypotheses concerning the ratio of two population variances or standard

4.1

Statistical Hypotheses

Many science and engineering problems require us to decide whether to accept or reject
a statement about some parameter. That statement is called a hypothesis. A statistical
hypothesis can arise from various elds of interest such as engineering, science, education, etc. A systematic procedure to decide whether to accept or reject a hypothesis is
called hypothesis testing.

## Denition 1 A statistical hypothesis is a statement about the parameter, or parameters,

of one or more populations.

We cannot prove that a hypothesis is absolutely true or false. If the data sample supports the
hypothesis, then we do not reject it. If the data sample does not support the hypothesis, we
reject it.
The hypothesis being tested is referred to as the null hypothesis and denoted by H0. The null
hypothesis is set up primarily to see whether it can be rejected or not. Also, we must

formulate an alternative hypothesis in order to know when to reject a null hypothesis. The
alternative hypothesis denoted by H 1 is the hypothesis which we accept when the null
hypothesis can be rejected. Some authors use the notation Ha or H A for the alternative
hypothesis
Denition 2 A null hypothesis, H 0 , is an assertion about one or more population
parameters. We hold this assertion as true until there are sucient statistical evidence to
conclude otherwise. The alternative hypothesis, H 1 , is the assertion of all situations not
covered by the null hypothesis
Together, the null and the alternative hypotheses constitute complete set of hypotheses that
covers all possible values of the parameter or parameters under investigation. The value of
the population parameter specied in the null hypothesis is usually determined in one of the
following three ways:

1. from a model or theory regarding the process under investigation, then the objective of
hypothesis testing is usually to verify the model or theory.
2. from knowledge of the process or previous tests or experiments, then the objective of
hypothesis testing is to determine whether the parameter value has changed.
3. from external consideration, such as design or engineering specication, or from
contractual obligations, then the objective of hypothesis testing is conformance testing.
The hypothesis test is carried out using information obtained by random sampling.

For example, suppose that we are interested in the output voltage of a power supply used in a
mobile phone; output voltage is a random variable that can be described by a probability
distribution. Suppose that our interest focuses on the mean output voltage
(a parameter of this distribution). Specically, we are interested in deciding whether

or not the mean output voltage is 6.00 V. We may express this formally as
H 0 : 6.00 V

H 1 : 6.00 V

(4.1)

The statement H 0 : 6.00 V in Equation (4.1) is called the null hypothesis1, and the
statement H 1 : 6.00 V is called the alternative hypothesis. Since values of the
alternative hypothesis could be either greater or less than 6.00 V, it is called a two-sided
alternative hypothesis. When we formulate the hypotheses as
H 0 : 6.00 V

H 1 : 6.00 V

or
H 0 : 6.00 V

H 1 : 6.00 V

then values of the alternative hypothesis could be less than 6.00 V or greater than 6.00 V,
respectively, it is called a one-sided alternative hypothesis 2
Denition 3 A test statistic is a sample statistic computed from the data obtained by random
sampling. The value of the test statistic is used in determining whether or not the null
hypothesis should be rejected.

We decide whether or not to reject the null hypothesis by following a rule called the decision
rule.
Denition 4 The decision rule of a statistical hypothesis test is a rule that species
the conditions under which the null hypothesis may be rejected.

_______________________________________________________________

Note that when choosing the null hypothesis one should bear in mind that it should nearly
always be precise, or be easily reduced to a precise hypothesis. For example when testing
H 0 : 6 V versus H 1 : 6. V , the null hypothesis does not specify the value of
exactly and so is not precise. But in practice we would proceed as if we were testing
H 0 : 6 V versus H 1 : 6 V
2
Note that hypotheses are always statements about the parameters of one or more
populations
under investigation,
not statements __
about the sample. So it is wrong to write
__
__
H 0 : x 6 V versus H 1 : x 6 V or H 1 : x 6 V .
1

Table 4.1 shows all the four possible outcomes of a test of hypothesis. The conclusion
columns refer to the action that he or she will be taken based on the results of the sampling
experiment. He or she will either conclude that the alternative hypothesis H 1 is true or the
null hypothesis H 0 is true. The state of nature rows refer to the fact that either the alternative
hypothesis H 1 is true or the null hypothesis H 0 is true. We can assume the true state of
nature is unknown when he or she conducting the test.

## Table 4.1: Four possible outcomes of a test of hypothesis

Statistical Conclusion
State of Nature

H 1 is true

H 0 is true

H 0 is true

Type I error

Correct conclusion

H 1 is true

Correct conclusion

Type II error

## He or she will be making wrong conclusion when accepting alternative hypothesis

(equivalently, rejecting null hypothesis) in fact H 0 is really true. This type of wrong
conclusion is called a Type I error.

## Denition 5 Rejecting the null hypothesis H 0 (equivalently accepting alternative hypothesis

H 1 ) when it is true in state of nature is dened as a Type I error.

Also, he or she will be making wrong conclusion if he/she accepts the null hypothesis
(equivalently, rejecting alternative hypothesis) when in fact H 1 is really true. This type
of wrong conclusion is called a Type II error.
Denition 6 Failing to reject the null hypothesis H 0 (equivalently failing to accept
alternative hypothesis H 1 ) when it is false in state of nature is dened as a Type II
error.

Probabilities can be associated with the Type I and Type II errors because this
conclusion is based on random variables. The probability of making a Type I error is
denoted by (the Greek letter alpha), that is

## P Type I error P reject H 0 when H 0 is true .

(4.2)

The probability of making a Type II error is denoted by (the Greek letter beta), that is

## P Type II error P accept H 0 when H 0 is false

(4.3)
A decision will be made only when we know the probability of making the error that
corresponds to that conclusion. When is specied, we should be able to reject H 0 (accept
H 1 ) if the test statistics is in the rejection region. However, when is not specied, we

should avoid the decision to accept H 0 , instead we should state that the sample evidence is
insucient to reject H0 if the sample evidence does not support that decision. Type I error is
considered more important than Type II error because we want to guard against the
possibility of making a wrong conclusion while the state of nature is true more than guarding
the other type of error.
A procedure leading to a decision about a particular hypothesis is called a test of a
hypothesis. The general procedure used for testing a hypothesis is as follows:
1. Identify the parameter of interest.
2. Formulate a null hypothesis and an alternative hypothesis.
3. Choose a signicance level

4. Determine the distribution and state the rejection region of the test statistic.
5. Specify an appropriate test statistic and calculate the value of the test statistic from a
random sample of data.
6. Decide whether to reject H 0 or fail to reject H 0 by comparing the calculated value of the
test statistic with the values in the critical region.

Steps 14 should be completed prior to calculation of the test statistic from a random
sample of data. This sequence of steps will be illustrated in subsequent sections.

___________________________________________________________________________

4.2

## Test of Hypothesis for the Mean

We now consider the case of hypothesis testing on the mean of a population under the
assumption of normality. The tests are also valid in cases where only approximate normality
exists. If it is not normal then the conditions of the central limit theorem apply.
To test the hypothesis that a random sample X 1 , X 2 , , X n

of size

n comes from

## a population with mean 0 we use the statistic

__

where 0 is a specied constant and we have assumed that the population variance 2 is
known. Now consider testing the hypothesis
H 0 : 0
H1 : 0

(4.4)

## We will use the test statistic

__

X 0

(4.5)

If the null hypothesis is true, Z test has a standard normal distribution, N (0, 1). When we
know the distribution of the test statistic we can locate the critical region to control the Type I
error probability at the desired level. In this case we would use the
z

and z

percentage points

z test z

(4.6)

or
z test z

(4.7)

## and we should fail to reject H 0 if

z

z test z

(4.8)

Equations (4.6) and (4.7) dene the critical region or rejection region for the test. The Type I
error probability for this test procedure is

The procedures for testing the mean when the variance is known are summarized in
Table 4.2.
Table 4.2: Testing the mean when variance is known
__

## Test statistic: Z test

X 0

___________________________________________________________________________

Exampl
e 1 phones are powered by battery. The output voltage of a power supply used in a
Mobile

mobile phone is an important product characteristic. Specications require that the mean
output voltage must be 6.00 V. We know that the standard deviation of output voltage is =
0.5 V. We decide to specify a Type I error probability or signicance level of 0.05 . A
random sample of n 20 is collected and obtains a sample mean output voltage of
__

## x 6.80 V . What conclusions should we draw?

Solution
Case Null hypothesis Alternative hypothesis Rejection region
in Section H
(4.1)
for testing a hypothesis:
z test z 2 or
H outlined
1
0 : 0
1 : 0
1. The parameter of interest is population mean, , the mean output voltage.
z z
test

2.

The

null

## hypothesis and alternative hypothesis are

H 0 : 0
H1 : 0
2
H 1 : 6.00 V 3
H 0 : 0
H1 : 0

H 0 : 6.00 V versus
z test z

z test z

2
2

3. 0.05
4. Reject z test z

__

Z test

X 0

__

## Since x 6.80 V and 0.5V , the value of the test statistic is

__

z test

x 0

6.80 6.00
0.5

20

7.16

6. Since the value z test 7.16 does exceed 1.96, we reject H 0 : 6.00 at the 0.05 level
of signicance. We can statistically conclude that the mean output voltage diers from 6 V,
based on a sample of 20 measurements.
Suppose that we specify the hypotheses as
H 0 : 0

H1 : 0

(4.9)

where the alternative hypothesis is one-sided. In dening the critical region for this test, we
observe that a positive value of the test statistic Z test would never lead us to conclude that
H 0 : 0 is false. Therefore, we would place the critical region in the lower tail of the

standard normal distribution and reject H 0 if the calculated value z test is too small. We
would reject H 0 if
z test z

Similarly, to test
H 0 : 0
H1 : 0

(4.10)

we observe that a negative value of the test statistic Z test would never lead us to conclude
that H 0 : 0 is false. Therefore, we would place the critical region in the upper tail of the
standard normal distribution and reject H 0 if the calculated value of ztest is too large. We

would reject H 0 if
z test z

_________________________________________________________________________

A manufacturer claim that battery life of model Z1 exceeds 90.0 hours. The life in hours of a
battery is known to be approximately normally distributed, with standard deviation = 8.5
__

hours. A random sample of 18 batteries has a mean life of x 95.5 hours. Is there
evidence to support the claim. Use = 0.01.

z test

2.7452; reject H 0

__________________________________________________________________________________

## To test hypothesis on when 2 is unknown is by replacing 2 with the sample variance

S 2 . If n is large (normally n 30) we can proceed to use the test procedure based on the

normal distribution
__

Z test

## where we just replace

with

X 0
S

S . However, when

__

X
S

## has a t distribution with n 1 degrees of freedom.

Now consider testing the hypotheses in Equation (4.4). We will use the test statistic
__

Ttest

X 0
S

If the H 0 is true, Ttest has a t distribution with n 1 degrees of freedom and we can locate
the critical region to control the Type I error probability at the desired level. In this case we

2 , n 1

and t

2 , n 1

## as the boundaries of the critical

regions to reject H 0 : 0 if

t test t

, n 1

(4.11)
or

t test t

, n 1

(4.12)

## and we should fail to reject H 0 if

t

2 , n 1

t test t

2 , n 1

(4.13)

Table 4.3: Testing the mean when variance is unknown and n < 30
__

## Test statistic: Ttest

X 0
S

, n 1 ,degree of freedom

## Case Null hypothesis Alternative hypothesis

H 0 : 0
H1 : 0
1

2
3

H 0 : 0

H1 : 0

H 0 : 0

H1 : 0

Rejection region

t test t

, n 1

ort t test tt
test

1
2
, n, n
1

t test t , n 1

Equations (4.11) and (4.12) dene the critical region or rejection region for the test.
The Type I error probability for this test procedure is

The procedures for testing the mean when the variance is unknown are summarized in
Table 4.3.
Table 4.2 and Table 4.3 are very similar except that Ttest is used as the test statistic
instead of Z test . Also, we use t distribution to dene the critical region instead of using
the standard normal distribution.
_____________________________________________________________________
Example 2
Referring to Example 1, suppose that the true variance is unknown. Ten determinations of the
output voltage of a power supply yielded the following values:
6.05

6.06

6.03

5.95

6.00

5.98

6.04

5.98

6.02

6.03

Can we say that the average output voltage equal to 6.00 V? Assume that the data
are approximately normal.
Solution
The solution using the outline in Section 4.1 is as follows:
1. The parameter of interest is population mean, , the mean output voltage.
2. The null and alternative hypotheses are
H 0 : 6.00 V versus H 1 : 6.00 V

3. 0.05
4.Reject

H0

if

t test t

2 , n 1

__

Ttest

X 0

S n

or

x 6.014 V

## the calculated value of the test statistic is

__

t test

x 0

6.014 6.00
0.0353

10

1.254

6. Since the value t test 1.254 is between 2.262 and 2.262, we are unable to reject
H 0 : 6.00 , and there is no strong evidence to indicate that output voltage not equal to

6.00 V at the 0.05 level of signicance . We can statistically conclude that the mean output
voltage equal 6.00 V, based on a sample of 10 measurements
___________________________________________________________________________
Suppose you are a buyer of large supplies of mobile phone batteries. You want to test the
manufacturers claim that his mobile phone batteries last more than 900 hours. You test 40
batteries and nd that the sample mean is 922 hours and the sample standard deviation 68
hours. Should you accept claim? Use = 0.05.

z test

2.0462; reject H 0

___________________________________________________________________________

A manufacturer of transistors claims that its transistors will last an average of 2100 hours. To
maintain this average, 20 transistors are tested each month. What conclusions should be
drawn from a sample that has a mean 2140 hours and a sample standard deviation 87 hours?
Assume that distribution of the lifetime of the transistors is normal. Use = 0.01.

t test

## 2.0562; fail to reject H 0

_______________________________________________________________________________

4.3

## Test of Hypothesis for the Variance

Hypothesis tests on the population variance or standard deviation are equally important as
testing on the population mean. For example, we wish to test whether a random sample is
drawn from a normal population of a specic known variance, say 02 or equivalently, that
the standard deviation is equal to 0 . To test
H 0 : 2 02
H 1 : 2 02

(4.14)

If the null hypothesis H 0 : 2 02 is true, the test statistic used is that given by the random
variable

n 1 S 2 .

(4.15)

02

2
which has a chi-square, , distribution with n 1 degrees of freedom. We will use the test

statistic

2
test

n 1 s 2

(4.16)

02

## The null hypothesis would be rejected if

2
test
12 , n 1
2
where , n 1 is the upper 100

## 2 percentage points of the chi-square distribution with

n 1 degree of freedom. Table 4.4 summarizes the critical regions needed for each of

## the possible alternative hypotheses.

Table 4.4: Testing the variance, 2
2
Test statistic:

n 1 S 2 , n 1
02

,degree of freedom

## Case Null hypothesis Alternative hypothesis

H 0 : 2 02
H 1 : 2 02
1

2
3
Example 3

H 0 : 2 02
H 0 : 2 02

H 1 : 2 02
H 1 : 2 02

Rejection region
2
test
12

, n 1

2
2
2
2
or test

1
test
, n,

n11
2

2
test
2 , n 1

A drilling machine is used to drill metal plates used in batteries. A random sample of 25
plates results in a sample variance of hole diameter of s 2 1.82mm 2 . If the variance of hole
diameter exceeds 1.00 mm 2 , the drilling machine must be serviced. Is there evidence that
the machine needs to be service? Use = 0.01, and assume that hole diameter has a normal
distribution.
Solution
The solution using the outlined in Section 4.1 is as follows:
1. The parameter of interest is population variance, 2 , the variance hole diameter
2. The null hypothesis and alternative hypothesis are
H 0 : 2 1.00 mm 2 versus H 1 : 2 1.00 mm 2
3. 0.01

## 4. Reject if z test z 2 z 0.025 1.96 or z test z 2 z 0.025 1.96

Refer from Table 6 of Lee (2004).
5. The test statistic is
Z test

Since

10 200 0.05

Z test

0
0 1 0 n

0 1 0 n

0.05 0.03

## 0.03 0.97 200

1.6581

6. . Since the value ztest = 1.6581 is between 1.96 and 1.96, we are unable to reject
H 0 : 0.03 , and there is no strong evidence to indicate that the percentage of defective

not equal to 3% at the 0.005 level of signicance. We statistically conclude that the
percentage of defective components is 3%.
___________________________________________________________________________
For small

## n , test concerning true proportions can be based directly on tables of binomial

probabilities.
___________________________________________________________________________

An electrical company claimed that at least 90% of the parts which they supplied on a
government contract conformed to specications. A sample of 280 parts was tested, and 35
did not meet specications. Can we accept the companys claim at a 0.05 level of
signicance?

z test

## 1.2649; fail to reject H 0

___________________________________________________________________________
The manufacturer of electronic devices informed his buyer about the proportion of defective
devices in its shipments. He claims that the proportion of all devices that are defective is less
than 6%. A random sample of 100 electronic devices indicates that 5 are defective. Using
0.05 , test whether the buyer will accept the manufacturers claim or not.

z test

## 0.4211; fail to reject H 0 _________________________________________________

______________________________

## 4.5 Test of Hypothesis for the Difference between the Means

4.5.1
Let

Variance known
X 11 , X 12 , , X 1n1

n1

## parameters X 1 ~ N 1 , 12 and X 2 ~ N 2 , 22 , respectively, where 12 and 22 are

known. The test statistic used to test H 1 : 0 0 against H 1 : 0 is the standard
normal random variable
__

__

X 1 X 2 1 2

12 22

n1 n2

## where 0 is a specied number. When 0 0 then H 0 : 1 2 0 or H 0 : 1 2 .

Because Z has the standard normal distribution when H 0 is true, we would take z 2 and

z as the boundaries of the critical region. This result and two other cases are included in
2
Table 4.6.
Table 4.6: Testing 1 2 when variance 12 and 22 are known
__

__

X X 2 1 2
Z 1
Case Null hypothesis Alternative
Rejection region
Test statistic:hypothesis 2 2
20
1 H 0 : 1 2 0 H 1 : 1 12
z test z or
2
n1 n2
2
3

H 0 : 1 2 0

H 1 : 1 2 0

z test
z
z z

H 0 : 1 2 0

H 1 : 1 2 0

z test z

test

___________________________________________________________________________
A manufacturer is comparing the settings of two machines, M1 and M2, which should
produce rods of the same length. Both have, over a long period, given rods whose lengths
were normally distributed with variance 37 cm 2 . Although the two machines are supposed to
given the same length of rod, he suspects that this is not so. Examine this suspicion, if the
total length of 15 rods from M1 is 513 cm, and the total length of 20 rods from M2 is 575 cm.
Use = 0.05.

z test

2.6231; reject H 0

________________________________________________________________________________

4.5.2

Variance unknown

If the sample sizes n1 and n2 are large (commonly, equal and greater than 30), the normal
distribution procedures in Section 4.5.1 could be used with replacing 12 and 22 with S12
and S 22 , respectively.
However, when sample sizes n1 and n2 are small (commonly, n < 30) and the populations

are normally distributed, our hypotheses testing will be based on the t distribution. Two
dierent assumptions must be treated. Firstly, we assume that the variances of the two normal
distributions are unknown but equal, 12 22 2 . . Secondly, we assume that the variances
of the two normal distributions are unknown and not equal, 12 22 .
(i) when 12 22 2 .
__

__

The variance of X 1 X 2 is
2
1 1
2 2 2
__ __
Var X 1 X 2 1 2

n1 n 2 n1 n 2
n1 n 2

## Now we have the test statistic

__

__

X 1 X 2 1 2
1 1

n1 n 2

Since is unknown, we replace it with S p the pooled estimator of . The pooled estimator
2
of 2 , denoted by S p , is dened by

2
p

n1 1 S12 n 2

1 S 22
n1 n2 2

Test statistic is
__

__

X 1 X 2 1 2
S 2p

1
1

n1 n 2

## which follow the t distribution with n1 n2 2 degrees of freedom.

The procedures for testing 1 2 when variance 12 and 22 are unknown but equal is
summarized in Table 4.7.
Table 4.7: Testing 1 2 when variance 12 and 22 are unknown but equal
__
Case Null hypothesis__ Alternative
Rejection region

1
2
1
2
hypothesis
2 0 H 1 : 1 2 0
1 H 0 : 1 T
z z or
Test statistic:
, v n1 ntest2 2 2
1
1

n1 n 2
z test
z
z test z2
2 H 0 : 1 2 0
H 1 : 1 2 0
z test z
3 H 0 : 1 2 0
H 1 : 1 2 0

degree of freedom

Example 5
A researcher wants to prove that brand X size AAA battery last an average of at least 30
minutes longer than brand Y. Two normally distributed independent random samples of 10
each brand are selected, and the batteries are run continuously until they are no longer
__

functional. The sample mean life for brand X is found to be x 328 minutes, and the
1
sample standard deviation is s1 46 minutes. The results for the brand Y batteries are
__

x 2 472 minutes and s 2 52 minutes. Is there evidence that brand X batteries last at least
30 minutes longer than brand Y batteries of the same size? Use = 0.05 and assume the two
population variances are equal.

Solution
1. The parameters of interest are 1 and 2 , the mean life of batteries.
2. H 0 : 1 2 30 versus H 1 : 1 2 30
3. 0.05.
4. Reject H 0 if t test t

, n1 n2 2

__

__

s 2p
s 2p

n1 1 s12 n2

1 s 22
n1 n2 2

10 1 46 2

10 1 52 2
10 10 2

2410
sp

2410 49.0918

__

t test

__

x1 x2 1 2
sp

1
1

n1 n 2

328 472 30
49.0918

1
1

10 10

7.9255

6.

## Since t test 7.9255 1.734, , we reject H 0 . We do not have an evidence that

brand X batteries last at least 30 minutes longer than brand Y batteries of the same size
___________________________________________________________________________

A problem solving test was given to two groups of 35 and 40 engineers, respectively. In the
rst group the mean score was 82 with a standard deviation of 5, while in the second group
the mean score was 77 with a standard deviation of 10. Is there a signicance dierence
between the performances of the two groups at 5% level of signicance? Assume the two
population variances are equal.

z test

2.6780; reject H 0

___________________________________________________________________________
An experiment is done to test the strength of two types of rock climbing ropes, namely R1
and R2. A sample of 15 pieces of rope R1 has a mean strength of 200 kg and a standard
deviation of 5 kg. A sample of 10 pieces of rope R2 has a mean strength of 188 kg and a
standard deviation of 6 kg. Assume the two population variances are equal. Test the mean
strength R1 is greater than R2 at 1% level of signicance.

t test

5.4299; reject H 0

_________________________________________________________________________________

(ii) when 12 22
When we cannot assume the unknown variances 12 and 22 are equal, then there is no exact
test statistic for testing H 0 : 1 2 0 . However, if H 0 : 1 2 0 is true, the
statistic
__

__

X 1 X 2 1 2
1
1

n1 n 2

## is distributed approximately as t with a degree of freedom given by

S12 S 22

n
1 n2
S12

n
1

S 22

n
2

n__1 1 __ n2 1
X 1 X 2 1 2

(4.17)

S
S

n1 n2
2
1

2
2

Test statistic: , T
,2 v 2 2
degree
22
2 2
2
Sand
unequal is

## The procedures for testing 1 2 when variances

and
S1
S1 S1 2
2 are unknown
2

n
n
n
n
1 2
1
2
summarized in Table 4.8 .
n1 1 n2 1
of freedom
Table 4.8: Testing 1 2 when variance 12 and 22 are unknown and
Case Null hypothesis Alternative
Rejection region
unequal 1 H 0 : 1 2 0 hypothesis
H 1 : 1 2 0
t test t or
2

2
3

H 0 : 1 2 0
H 0 : 1 2 0

H 1 : 1 2 0
H 1 : 1 2 0

,v

t test
t t ,tv
test

,v

t test t ,v

Example
6
A scientist want to determine how two catalysts will eect the mean yield of a chemical
process. Two normally distributed independent random samples of n1 12 for catalyst C1
and n2 10 for catalyst C2 are selected. The sample mean yield for catalyst C1 is found to be
__

x1 152.25 and the sample standard deviation is s1 3.44 . The results for the catalyst C2
__

are x 2 150.85 and s 2 3.72 . Is there any dierence between the mean yields? Use
0.01 and assume the two population variances are unequal.

Solution
1. The parameters of interest are 1 and 2 , the mean process yield.
2 H 0 : 1 2 0 (or H 0 : 1 2 ) versus H 1 : 1 2 0 (or H 1 : 1 2 ).
3. 0.01 .
4. We have s1 3.44 , s 2 3.72 , n1 12 , n2 10 . The degrees of freedom on ttest are
found from equation (4.17) as

S12 S 22

n1 n2
S12

n1

n1 1

S 22

n2

n2 1

3.44 2 3.72 2

10
12
3.44 2

12

12 1

3.72 2

10

10 1

18.6489 19

Therefore,
t test t

we
2 ,v

__

reject

H0

if

t test t

2 ,v

t 0.005,19 2.861

or

__

## 5. We have x 152.25 and x 150.85 . Therefore, the test statistic is

1
2
__

t test

__

x1 x 2 1 2
s12 s 22

n1 n2
152.25 150.85 0
3.44 2
3.72 2

12
10
0.9094

6. Since t test 0.9094 is less than 2.861, we fail to reject H 0 . We conclude that
there is no dierence between mean yields.
___________________________________________________________________________

4.6

## Test of Hypothesis for the Difference between the Proportions

Suppose that two independent random samples of sized n1 and n2 are taken from two large
populations and that X 1 n1 and X 2 n 2 represent the observed number of successes in
n1 and n2 trials, or the observed proportion of successes, respectively. Then P 1 X 1 n1 and
P 2 X 2 n2

## respectively. Furthermore, we know that the sampling distribution of P1 is approximately

normal with mean 1 and variance 1 1 1 n1 , if n1 is relatively large and 1 is not too
close to either 0 or 1. As rule of thumb both n1 1 and 1 1 1 must be greater than or

equal to 5 to makes use of the normal approximation to the binomial distribution. Similarly,
this applied to P 2 .
To test the hypotheses
H 0 : 1 2

H1 : 1 2

(4.18)

## of two binomial populations we use the statistic

P1 P2 1 2

P1 1 P1 P2 1 P2

n1
n2

When H 0 is true, we can substitute 1 2 in the preceding formula for Z to give the
form
Z

P1 P2

P P1 1 2
Z P1 1P n2 n 1
P1 1 P11 2P2 1 P2
Test statistic:

n1
n2

where

## Case Null hypothesis

X X2
P 1 hypothesis
Alternative
n1 n2

Rejection region

: .
2 statistic
0 ZHis1 :distributed
1 2 approximately
0
1 H 0 of
is a pooled estimate
The
1).
1
z test N
z(0,
or
2

## The procedures for testing 1 2 is summarized in Table 4.9.

2
3

H 0 : 1 2 0

H1 : 1 2 0

H 0 : 1 2 Table
0
H 1 :Testing
1
0 2
4.9:
2 1

z test
z
z z
test

z test z

__________________________________________________________________________________

Example 7
A usual medication was given to a random sample of 180 patients from district A who have
high fever. A new medication was given to a random sample of 200 patients from district B
who also have high fever. If 144 and 180 patients recover from the fever, respectively, is the
new medication helps to cure better the fever. Use = 0.05
Solution
1. The parameters of interest are 1 and 2 , the proportion of patients who recover from
usual medication and new medication, respectively.
2. H 0 : 1 2

versus H 1 : 1 2 .

3. 0.05
4. We reject H 0 if z test z z 0.05 1.6449 . Refer from Table 6 of Lee (2004).
5. We have

P1

144
0.80
180

P2

180
0.90
200

x1 x 2 144 180

0.8526
n1 n 2 180 200
z test

P1 P2
1
1
P 1 P

n1 n2

0.80 0.90

1
1

180
200

0.8526 0.1474

2.7456

## 6. Since z test 2.7456 is less than 1.6449, we reject H 0 : 1 2 at = 0.05. Therefore,

there is strong evidence indicate that the new medication helps to cure better the fever.
___________________________________________________________________________

A random sample of 150 students of UTM found that 102 were in favor of a new grading
system, while another sample of 180 students of UKM found that 108 were in favor of the
new system. Do the results indicate a signicant dierence in the proportion of UTM and
UKM students who favor the new grading system? Use = 0.01.

z test

## 1.5043; fail to reject H 0

___________________________________________________________________________
A geneticist is interested in the proportion of males and females in a population that have a
certain minor blood disorder. He did a survey by taking a random sample of 100 males and
100 females. 31 of the males are found to be aicted, whereas only 24 of the females appear
to have the disorder. Can we conclude that the proportion of men in the population aicted
with this blood disorder is signicantly greater than the proportion of women aicted? Use
level of signicance = 0.01.

z test

## 1.1085; fail to reject H 0

___________________________________________________________________________

4.7

1

## variance 12 , and let X 21 , X 22 , , X 2 n2 be a random sample from a normal population with

mean 2 and variance 22 . Assume that both populations are independent. Let S12 and S 22
be the sample variances. Then the ratio

S12 12
F 2 2
(4.19)
S

2
2
2
Test statistic: , F S1 , v n 1 , v n 1 degree of
1
1
2
2
has an F distribution with n1 1
S 2 numerator degrees of freedom and n 2 1 denominator
2

degrees offreedom
freedom. Under H 0 :

2
1

1 H 0 : 12 12

2
3

H 0 : 12 12
H 0 : 12 12

Alternative
hypothesis
H 1 : 12 12

H 1 2: 12 12
S 2
2
F H 112: 1 1
S2

Rejection region

Ftest F1

or

,v1 ,v2

FF
FF ,v ,v
test
test
211 ,v2 ,v
1

## Ftest F ,v1 ,v2

Table 4.10 summarizes the critical regions needed for each of the possible alternative
hypotheses.
Table 4.10: Testing of ratio of two variances

Table 9 in Lee (2004) contains only upper-tail percentage points of the F distribution. If we
need the lower-tail percentage points f1 ,v

1 , v2

f 1 ,v1 ,v2

1
f ,v2 ,v1

(4.20)

## For example, to nd the lower-tail percentage point f 0.999 , 6,12 is

f 0.999 , 6 ,12

1
f 0.001,12 , 6

1
17.99

0.0556

___________________________________________________________________________

Example 8
Company A and company B can supply chemical material. The mean concentration for both
companies is the same, but we suspect that the variability in concentration may dier
between the two companies. The variance of concentration in a random sample of n1 8 by
company A yields s12 12.4 grams per liter, while for company B, a random sample of
n2 10 yields s 22 13.8 grams per liter. Is there sucient evidence to conclude that the

two population variances dier? We assume that concentration is a normal random variable
for both companies. Use = 0.02.
Solution
The solution using the outlined in Section 4.1 is as follows:
1. The parameter of interest are the variances of chemical concentration 12 and 12
2. The null hypothesis and alternative hypothesis are
H 0 : 12 12 versus H 1 : 12 12

3. 0.02
4. Reject H 0 if

f test f1

f 1 0.02 2,81,10 1

2 , v1 , v2

f 0.99 , 7 , 9

1
f 0.01, 9, 7

1
0.1488
6.72

or if
f test f

f 0.02 2,81,10 1

2 , v1 , v2

f 0.01, 7 , 9
5.61

## Refer from Table 9 of Lee (2004).

5. The test statistic is

s12
s 22

f test

12.4
13.8

0.8986

6. Since the value f test 0.8986 is between 0.1488 and 5.61, we are unable to reject
H 0 : 12 12 at the 0.02 level of signicance. Therefore, there is no strong evidence to

## indicate that the two population variances dier.

__________________________________________________________________________
Two types of equipments for measuring the amount of carbon monoxide in the atmosphere
are being compared in an air-pollution experiment. It is desired to determine whether the two
types of equipments yield measurements having the same variability. A random sample of 10
from equipment E1 has a sample standard deviation of 0.10. A random sample of 16 from
equipment E2 has a sample standard deviation of 0.09. Assuming the populations of
measurements to be approximately normally distributed. Test the hypothesis that E21 E2 2
against the alternative that E21 E2 2 . Use 0.05 .

f test

## 1.2346; fail to reject H 0

___________________________________________________________________________
The following data represents the times taken by two machines in producing an electrical
part:
Machine
Time (in milliseconds)
_______________________________________________
1

108

86

98

109

92

81

165

97

134

87

114

_______________________________________________
Assuming that the distributions of the times are approximately normal, can we conclude that
there is a signicant dierence in variability of the times in producing an electrical part by
machine 1 and machine 2 at 0.05
___________________________________________________________________________

EXERCISE 4
1. Test the hypothesis that the random sample
30.4 31.2 30.8 29.9 30.4 30.7 29.9 30.1
came from a normal population with mean 30.5. The standard deviation of the measurements
is known to be 0.1. Use 0.05
__

2. A sample of size 60 yielded that values x 46.7 and s 2 41.5 . Test the hypothesis that
45 against the alternative that it is greater. Use 0.05 .

3. Repeat question (1) without assuming that the standard deviation is known to be 0.1. In
other words estimate the population variance from the sample measurements. Use 0.05
.
4. A manufacturer claims that the standard mean volume per bottle of shampoo is 250
milliliter. Ten random samples are taken from a batch and the volume per bottle is measured.
The ten measurements have a sample mean of 243 milliliter and a standard deviation of 7
milliliter. Assume approximate normality of data. Is this sample mean signicantly below the
claimed value? Use 0.01 .

5. The standard deviation of the breaking strengths of certain cables produced by a company
is given as 240 kg. After a change was introduced in the process of manufacturing of these
cables, the breaking strengths of a sample of 8 cables showed a standard deviation of 300 kg.
Investigate the signicance of the apparent increase in variability. Use 0.01 .
6. A semiconductor company claimed that at least 99% of the electronic components which
they export without defect. A sample of 150 electronic components was tested, and 12 with
defect. Can we accept the companys claim at a 0.01 level of signicance?
7. An opinion survey in district D1 found that 68% of people considered electricals taris to
be too high. A random sample of 35 people in district D2 were asked the same question 21
thought electricals taris to be too high. Is this proportion signicantly dierent from that of
district D1? Use 0.05 .

__

1
__

## observations has x 35 and s1 7 . Is there a signicance dierence between the two

1
sample means at the 0.01 level of signicance? Assume that the two populations have equal
variances.
9. Random samples of 200 screws manufactured by machine A and 100 screws manufactured
by machine B showed 19 and 5 defective screws, respectively. Test the hypothesis that
(a) Machine B is performing better than machine A
(b) The two machines are showing dierent qualities of performance. Use 0.05 .
10. A vote is to be taken to determine whether a new housing should be constructed. The
housing area is near to a county site and also short distance from a town. To determine if
there is a signicant dierence in the proportion of county voters and town voters favoring
the proposal, a poll is taken. A random sample of 93 of 150 county voters favor the proposal
and 387 of 450 town voters also favor the the proposal. Can we conclude that the proportion

do county voters favoring the proposal is lower than the proposal of town voters? Use
0.05 .

11. A sample of male and a sample of female were polled on an issue. 120 of 250 male and
126 of 300 female vote yes on the issue. Can we conclude that more male than female favor
the issue. Use 0.02 .
12. Repeat exercise 11 but using 0.10 .
13. Two types of soil namely S1 and S2 at certain district solutions were tested for their
gamma radiation dose. A random sample of 6 measurements of S1 showed a mean of 7.52
with a standard deviation of 0.024. A random sample of 5 measurements of S2 showed a
mean of 7.49 with a standard deviation of 0.032. Assume both population variances are equal.
(a) Determine whether the two types of soil have dierent gamma radiation doses. Use
0.05 .

(b) Determine whether the two types of soil have dierence in the variability of
gamma radiation doses. Use 0.01 .

Chapter 5
Chi-Square Tests

Learning Objectives:
At the end of this chapter, students should be able to
(a) apply the goodness-of-t test.
(b) summarize data in contingency table.
(c) apply the independence test.
(d) apply the homogeneity test.

5.1

Introduction

We have seen in previous chapters that some random variables follow certain distributions
such as binomial, Poisson and normal distributions. We either make an assumption about the
distribution, or we know that the random variables follow specic distributions.
In the next section of this chapter we introduce a method to test such assumption known as
goodness-of-t test which requires the data to be presented in frequency distribution. In this
chapter, we will also discuss two methods of data analysis in which a data set is presented in
a contingency table. The two analysis are the independence test and homogeneity test,
discussed in sections 5.3 and 5.4 respectively.
5.2

## Goodness - of fit Test

Consider the result obtained from an experiment of tossing a die 300 times, as shown in Table
5.1 below:
Table 5.1: Frequency distribution
____________________________________________
Outcome
1
2
3
4
5
6
_____________________________________________
Frequency
45
52
60
58
44
41
_____________________________________________
There are six possible outcomes for each trial, i.e. obtaining number 1, 2, 3, 4, 5 or 6. These
outcomes are also referred to as categories. The question we would like to answer is whether
the dice is a fair dice. The results of the experiment is the evidence for concluding whether
the dice is a fair dice or otherwise. We know that a fair dice has the following characteristic

## P (1) = P (2) = P (3) = P (4) = P (5) = P (6) =

1
6

If X is a random variable representing the outcome obtained for each trial, then X follows the
uniform distribution with P (X = x) =

1
for x = 1, 2, 3, 4, 5, 6. The objective is to test the
6

hypotheses that the dice is a fair dice which can be stated as below:

H 0 : P 1 P 2 P 3 P 4 P 5 P 6

H 1 : P X i P X

1
6

for i, j 1, 2, 3, 4, 5, 6; i j

The statement in H 0 is equivalent to the dice being a fair dice and the statement in H 1 is
equivalent to the dice not being a fair dice. If the dice is a fair dice, we expect the frequency
for the outcome xi or category i is
Ei n P X i for i 1, 2, 3, 4, 5, 6

where

1
50
6

E 2 n P 2 300

1
50
6

E 4 n P 4 300

1
50
6

E 6 n P 6 300

E1 n P 1 300

E3 n P 3 300

E 5 n P 5 300

1
50
6
1
50
6

1
50
6

## However the observed frequencies obtained from the experiment are

O1 45,

O2 52,

O3 60

O4 58,

O5 44,

O6 41

which dier from the expected frequencies if the dice is a fair dice.
The logic is if the dice is a fair dice, the dierence between the observed and the
expected frequencies

Oi \ Ei

## is either zero or a small number. The dierence between the

observed and the expected frequencies forms the statistic to test the hypothesis regarding the
probability distribution of the random variable. The statistic is stated in the following theorem
Theorem 4 The statistic

O E

k 2
2 i\ i

i1 i

## follows the Chi-Square distribution with (k p 1) degree of freedom.

where k is the number of categories and p is the number of unknown parameters needed to be
estimated from the data. If there is no unknown parameter, then the degrees of freedom is
k 1 where p 0 .

Note: This theorem is applicable if the least expected value Ei is at least 5, i.e. E i 5
for all i.

## This test is a one-tailed test where H 0 is rejected if the calculated statistic

O E

k 2
2 i\ i 2
, pk 1
i1 i

E
at signicance level .

Now we show the procedure to calculate the statistic 2 . Since the statistic 2 is calculated
from the observed sample we use the similar convention from previous chapter denoting
2
test
as the calculated statistic 2 .

________________________________________________________

Oi \

Ei n P i

Oi\ Ei

Ei

____________________________________________________
1
45 50 2 0.50
E1 300 50
O1 45
6
50

O2 52

E 2 300

1
50
6

52 50 2

O3 60

E3 300

1
50
6

60 50 2

O4 58

E 4 300

1
50
6

58 50 2

O5 44

E5 300

1
50
6

44 50 2

O6 41

E 6 300

1
50
6

41 50 2

50
50
50
50
50

0.08
2.00
1.28
0.72
1.62

__________________________________________________________

EO

## 0.5 0.8 2.0 1.28 0.72 1.62 6.20

E

6 i\ i
2
So
tes i1
i

and accept

2
H 0 if test
20.05, 61 11.070 . Note that v k 1 since unknown parameters are absent.

2
Since test 6.2 11.070 , we accept H 0 and conclude that there is no evidence that the

## dice is not a fair dice.

The test we have seen above is called goodness-of-t test. In general, we would
observe the following table with Oi represents the observed frequency for category i for
i 1, 2, , k .

and n O1 O2 Ok .

Category
1
2 ...
k
Ok
P i , is stated in the null
Frequency
O1 O2 i... occurring,
The belief is that the probability
of category
hypotheses H 0 as
H 0 : P i i

i 1, 2, , k .

for

## Assuming H 0 is correct, the expected frequency for each category i , Ei is calculated by

Ei n P i and with the help of Theorem 1, we can test the hypothesis stated in H 0 .
___________________________________________________________________________

Example 1

The authority claims that the proportions of road accidents occurring in this country
according to the categories User Attitude (A), Mechanical Fault (M), Insucient Sign Board
(I) and Fate (F) are 60%, 20%, 15% and 5% respectively. A study by an independent body
shows the following data
Category

Total

Frequency

130

35

30

200

## Can we accept the claim at signicance level = 0.05?

Solution
n = 200
H 0 : P (A) = 0.6, P (M ) = 0.2, P (I ) = 0.15, P (F ) = 0.05
H 1 : At least one P (i) diers for i = A, M, I and F.

_____________________________________________________________

Oi \

Oi\ Ei

Ei n P i

Ei

_______________________________________________________________
130 120 2 0.833
O A 130
E A 0.6 200 120
120

35 40 2

OM 35

E M 0.2 200 40

O I 30

E I 0.15 200 30

30 30 2

OF 5

E F 0.05 200 10

5 10 2

40
30
10

0.625

0.00

2.500

_______________________________________________________________

## Since Ei 5 for i A, M , I and F , then k 4 . Furthermore, p 0 , therefore

v 4 1 3.
2
test
0.833 0.625 0.000 2.500 3.958.

2
2
At = 0.05, reject H 0 if test 0.05,3 7.815 . Thus we accept H 0 and conclude that we

## have no evidence to reject the claim.

___________________________________________________________________________
Example 2
The number of students playing truancy in a school over 200 school days is shown below
No. of truancy

No. of days

12

32

45

50

35

26

If X is a random variable representing the number of students playing truancy per day, test
the hypothesis that X follows the Poisson distribution with mean 3 per day at 0.01
Solution
n 12 32 45 50 35 26 200 , k 6

For X ~ P0 3
P X 0 0.0498,

P X 1 0.1493,

P X 2 0.2241

P X 3 0.2240,

P X 4 0.1681,

P X 5 0.1847

Oi\ Ei

Oi \

Ei n P X i

O0 12

O1 32

O2 45

45 44.82 2

O3 50

50 44.80 2

O4 35

35 33.62 2

O5 26

## E5 200 0.1847 36.94

26 36.94 2

Ei
12 9.96 2
9.96

0.42

32 29.86 2
29.86
44.82
44.80
33.62

0.15

0.00
0.60
0.06

3.24
36.94
_______________________________________________________________

## Since Ei 5 for i 0, 1, 2, 3, 4, 5, then k 6 .Now, v 6 1 5 since p 0.

2
test
0.42 0.15 0.00 0.60 3.24 4.47

2
2
At 0.01 , reject H 0 if test 0.01,5 15.086 0:01;5 = 15:086: Thus, H 0 is accepted

and we conclude that there is no evidence to support the number of students playing truancy
per day does not follow the Poisson distribution with mean 3 per day.
___________________________________________________________________________

IQ Score
Frequency
X < 90
2
90 X < 100
30
100 X < 110
85
110 X < 120
90
120 X < 130
40
Example 3
X 130
3
Total
250
It is believed that the IQ score of all adults follow the Normal distribution with mean 110 and
standard deviation 10. The score of IQ test given to 250 adults are summarized below where
X represent IQ score.

## Test the above belief at 0.05

Solution
Let X represents the IQ scores.

H 0 : X ~ N 110, 10 2

## H 1 : X does not follow N 110, 10 2

Assuming H 0 is correct, Z

X 110
10

_______________________________________________
P
IQ Score
_______________________________________________

X 90
90 X 100

P Z 2 0.0228
P 2 Z 1 0.1359

100 X 110

P 1 Z 0 0.3413

110 X 120

P 0 Z 1 0.3413

120 X 130

P 1 Z 2 0.1359

X 130

P Z 2 0.0228

______________________________________________

Oi\ Ei

Oi

Ei n P X i

O1 2

O2 30

O3 85

85 85.33 2

O4 90

90 85.33 2

O5 40

40 33.98 2

O6 3

3 5.70
5.70

Ei
2 5.70 2
5.70

2.40

30 33.98 2
33.98

85.33
85.33
33.98

0.47

0.00
0.26
1.07

1.28

## Since E i 5 for i 1, 2, 3, 4, 5, 6 then k 6 .Now, v 6 1 5 since p 0.

2
test
2.40 0.47 0.00 0.26 1.07 1.28 5.48

2
2
At 0.05 , reject H 0 if test 0.05,5 11.070 . Thus, we fail to reject H 0 and conclude

that there is no evidence to support the IQ scores does not follows the normal distribution
with mean 110 and standard deviation 10.
___________________________________________________________________________
It is believed that the number of scratches on a compact disk produced by a process follows
the Poisson distribution with mean 2.5 scratches per disk. The following data shows the
number of disks with the corresponding number of scratches on them:
Number

of

scratches01234

Number

of

disk5223020158
Test the belief at significance level 0.01

k 6

2
then v 5; test
3.1523 15.086; fail to reject H 0

Repeat Question in Task 1 above, but without knowing the true mean value. What differences
may you encounter?

k 6,

2
p 1 then v 4; test
3.1869 13.277; fail to reject H 0

___________________________________________________________________________

5.3

Independence Test
____________________________________
Student
Bespectacled
Result
_______________________________________

A
Yes
Excellent
B
No
Excellent
C
Yes
Good
D
Yes
Excellent
E
No
Good
F
No
Good
G
Yes
Excellent
______________________________________

Maths Results

Bespectacled

Yes
No

good

excellent

1
2

3
1

## We have a two-dimensional 2 X 2 contingency table read as two by two contingency table.

The first number 2 means there are two rows for the row variable "Bespectacled" with
categories Yes and No. The second number 2 means there are two columns for the column
variable "Maths Results" with two categories Good and Excellent. The row and column
variables are both nominal type of data. Each of the four boxes in the contingency table is
called cell. The numbers in each cell are the frequency of students having both the
corresponding row and column categories or simply referred to as observed frequency.
.

Usually, the question we have in mind when dealing with data in contingency table is

whether the two variables are independent. Independence means the two variables are not
influential to each other. Thus in the example above we want to test whether being
bespectacled or not is influencing the students Maths results or not. This test is called
independence test which capitalizes on the fact of independent events in probability study:
Two events A and B are independent if and only if
P (A B) = P (A)P (B),
To understand this test further we introduce the two-dimensional contingency table in its
general form.
In general, a two-dimensional contingency table is of the form below
Column Variable
Category B1

Category

Category

Category A1
Category A2
Row Variable

Category

Category Ar

O11

B2
O12

O21

O22

Or 1

Or 2

Bc
O1c

O2 c

Orc

The above contingency table is a r c contingency table where r denotes the number of
categories of the row variable, c denotes the number of categories of the column variable and
Oij is the observed frequency in cell i, j , i.e. the observed frequency for i th category of

ni

n j

Column Variable
Category B1

Category B2

Category A1

O11 A1 B1

O12

Category A2

O21 A2 B1 )

O22

Or1 Ar B1 )

Or 2

n 1

n 2

O1c A1 Bc )

Orc Ar Bc )

nr

n c

Category Bc
O2 c A2 Bc )

Row
Variable

Category

Category Ar

n1
n2

## If the events Ai and B j are independent then

P Ai B j P Ai P B j

Most often, we do not know the true values of P Ai or P B j but we know from the
estimation Chapter 3 that the best estimator for population proportion or probability is the
sample proportion. Thus

P Ai

ni

and

P Bj
^

n j
n

## Therefore the estimated probability for the joint categories is

P Ai Bj P Ai PBj
^

^ ^

ni n j

n n
With this estimated joint probability, we can find the expected frequency in each cell, E ij if
Ai and B j are independent. The expected frequency in cell i, j . is

Eij n P Ai Bj
^

n P Ai P Bj
^ ^

ni nj
n
n n

ni n j
n

Now, if Ai and B j are truly independent, we anticipate Oij and E ij do not differ and if
they differ the difference is not significant. The statistic Oij E ij forms the basis for the
independence test which is stated in Theorem 2.
Theorem 2

rc

The statistic

O E

2 i j\ i j

2
follows the chi-squared distribution with

i1 j1 i j

## r 1 c 1 degrees of freedom where

Oij the observed frequency in cell i, j . , and
E ij the expected frequency in cell i, j .
The theorem can be written simply as

rc

EO

~ cr 11 .
i11j Eij
2 i j\ i j 2

## This theorem is useful in testing the following hypotheses

H 0 : Row and column variables are independent.

## H 1 : Row and column variables are not independent.

This test is a one-tailed test on the right where H 0 is rejected if the calculated 2 value is
2
greater than , r 1 c 1 at significance level

## . Again, using the convention in previous

2
chapter, the calculated 2 value is denoted by test
test. Thus, we reject H 0 if
2
test
2 , r 1 c 1

Example 4
Insomnia is a disease where a person finds it hard to sleep at night. A study is conducted to
determine whether the two attributes, smoking habit and insomnia disease are dependent. The
following data set was obtained:
Insomnia
Yes

No

Habit

Non-smokers
Ex-smokers
Smokers

## Use a 5% significance level to conduct the study.

Solution
H 0 : Smoking habit and Insomnia are independent.
H 1 : Smoking habit and Insomnia are not independent.
r 3

c 2,

n1 10 70 80, n2 8 32 40,
n3 22 38 60, n 2 10 8 22 40,
n 2 70 32 38 140, n 10 70 8 32 22 38 180.

10
8
22

70
32
38

Oi

E11

80 40
17.78
180

O12 70

E12

80 140
62.22
180

O21 8

E 21

40 40
8.89
180

O22 32

E 22

40 140
31.11
180

O31 22

E 31

60 40
13.33
180

E 32

60 140
46.67
180

O11 10

10 17.78 2
17.78

8 8.89 2
8.89

Oi\ Ei
Ei

3.40

70 62.22 2
62.22

0.97

0.90

22 13.33 2
13.33

Ei n P X i

32 31.11 2
31.11

0.03

5.64
O32 38

38 46.67 2
46.67

1.61

2
test
3.40 0.97 0.90 0.03 5.64 1.61 12.55.

2
2
The critical value at 5% significance level is 0.05, 31 21 0.05, 2 5.991 and the rule is
2
to reject H 0 if test
5.991

## Thus, we reject H 0 and conclude that there is a significant evidence at 5% significance

level to conclude that smoking habit and insomnia disease are not independent.

## A study is conducted to determine whether student's academic performance are independent

of their active involvement in co-curricular activities. The following data set was obtained:
Performance
Low Fair Good
Co-curricular

Inactive
Active
Activities
Use a 5% significance level to conduct the study.

v 2;

40
30

80
90

60
60

2
test
2.0168 5.991; fail to reject H 0

_____________________________________________________________________
A study is conducted to determine whether the management efficiency and the specialization
sector are independent. The following data set was obtained:
Management
Efficiency
Low Fair Good
Education
Health
Sector
Banking
Use a 1% significance level to conduct the study.

v 4;

20
15
15

20
25
30

35
40
80

2
test
9.7807 13.277; fail to reject H 0

___________________________________________________________________________

5.4

Homogeneity Test

In the independence test each subject has the possibility of belonging to any of the

rc

cells. For further clarification, consider the following contingency table which shows the
frequency of students according to gender and their hand phone brands.
Hand phone brand

Male

Nokia

Samsung

Others

Total

80

60

30

170

Female

60

70

20

150

Total

140

130

50

320

If all 320 students are chosen at random regardless of their gender and hand phone brand,
each student will be classified in one of the six joint categories and the test of independence
is a valid test. In other words, each of the 320 students will belong in one and only one of the
six cells of the contingency table. However, we may want to fix the number of male and
female students in this study. For example we may want to have 150 male students and 170
female students.
Thus a male student will either belong to the joint categories (Male
Samsung) or (Male

Nokia), (Male

## Others). He can only be classified in the distribution of hand phones

for the male category and not in any of the six joint categories. In other words, a male student
will belong to any of the three cells of the male category. Similarly a female student will
belong to any of the three cells of the female category. Fixing the number of male and female
students constrains the assignment of each subject to the relevant gender categories. When we
have such constraint, we are actually comparing the distribution of hand phone brand
preferences between the two genders. In this case, we fix the row total

ni .

This means we are comparing whether the preferences over Nokia, Samsung or other brand
of hand phones are the same for male and female students.
At the same time, we may prefer to fix the column total n j , i.e. we select 140 Nokia users,
130 Samsung users and 50 other brand users. Each user will be classified in the relevant cell
which is constrained on his/her preferences. Thus, we are actually comparing the distribution
of gender between the hand phone brands.
The relevant test is called homogeneity test where we are testing the similarity of two or more
populations with regard to the distribution of a certain characteristic. For the fixed number of
male and female students, the hypotheses are
H 0 : The proportions of students preferring the three hand phone brands are the same for

## male and female students.

H 1 : The proportions of students preferring the three hand phone brands are not the same for

## male and female students

For the fixed number of brand users the hypotheses are
H 0 : The proportions of male and female students users are the same for Nokia, Samsung

## and other brands of hand phone

H 1 : The proportions of male and female students users are not the same

## for Nokia, Samsung and other brands of hand phone.

The procedure to conduct the homogeneity test is the same as the test of independence
discussed earlier.

200 female owners and 200 male owners of Proton cars are selected at random and
the colour of their cars are noted. The following data shows the results:
Car Colour

Gender

Black

Dull

Bright

Male

40

110

50

Female

20

80

100

Use a 1% significance level to test whether the proportions of colour preferences are the same
for male and female.

2
v 2; xtest
28.07 9.210; reject H 0

Exercise 5
1. A random sample of 200 printed boards has been collected and the following number
of defects was observed:
Number of defects
Observed Frequency

0 1 2 3 4 5
10 40 54 45 32 8

6
6

7 and more
5

Can we conclude that the number of defects follows the Poisson distribution with
mean 2.6 at significance level = 0.05?

## 2. A random sample of 100 electrical components produced in a factory has been

selected and the following number of defective components was recorded:
Number of defects
Frequency

0
5

1 2 3 4 5 6 and more
10 18 19 16 12 20

Can we conclude that the number of defective electrical components follows the
Poisson distribution at significance level = 0.01?
3. A manufacturing engineer is testing a power supply used in a notebook computer. The
complete table of observed frequencies is as follows:
Class
interval
x 4.948
4.948 x 4.986
4.986 x 5.014
5.014 x 5.040
5.040 x 5.066
5.066 x 5.094
5.094 x 5.132
x 5.132

Observed
frequencies Oi
12
14
12
13
12
11
12
14

Test the hypothesis whether the output voltage is adequately described by a normal
distribution with mean 5.04V and standard deviation 0.08V at a significance level =
0.05.
4. A machine is supposed to mix 40% peanuts, 30% hazelnuts, 20% cashews, and 10%
pecans. A can containing 500 of these mixed nuts was found to have 269 peanuts, 112
hazelnuts, 74 cashews, and 45 pecans. At the 0.05 level of significance, test the
hypothesis that the machine is mixing the nuts according to the required percentages.
5. It is believed that the ratio of Bumiputera, Orang Asli, and others student intake in
Faculty of Engineering is 14:3:3. A sample of 500 students chosen at random shows
the following data:
Bumiputera

Orang Asli

Others

Number of Students

345

78

77

## Do we have a reason to accept the above ratio at significance level = 0:01?

6. A random sample of semiconductor devices is taken to observe the relationship
between classification and status for each device. The results are as follows:
Classification
Defective
Non Defective
80
20
40
60

Status
Rejected
Non Rejected

Test the hypothesis that the status and classification are independent at significance
level = 0:05
7. A study was conducted to determine whether the type of painkiller administered to
patients is influencing the level of pain felt by patient and the following data set was
obtained:
Painkiller
A
B

No
20
10

Level of Pain
A little
30
35

Strong
10
15

Test whether the level of pain and the type of painkiller are independent at
significance level = 0:01.
8. A total of 1000 PVC pipes are sampled and categorized with respect to both length
and diameter specification. The results are presented in the following table:
Length
Too Short
Meet Specification
Too Long

Too Thick
20
65
35

Diameter
Meet Specification
115
550
145

Too Wide
15
45
10

Test at 1% significance level whether the length and the diameter of the PVC pipes
are independent.

## 9. A set of data was collected to determine whether the proportions of defective

components produced by workers were the same for the day, evening, and night shifts.
The following data were collected:

Defective
Non defective

Day
100
150

Shift
Evening
200
200

Night
200
150

## Use a 0.05 level of significance to determine if the proportions of defective

components are the same for all three shifts.
10. A QC inspector took a set of sample data to determine whether the proportions of
output components for two shifts produced by machine A, B and C were the same.
The following data were collected:
Machine
A
B
C
Shift 1 100 120
180
Shift 2 120 180
100
Use a 0.05 level of significance to determine if the proportions of output components
for shift 1 are the same for all three machines.

Chapter 6
Analysis of Variance

Learning Objectives:
At the end of this chapter, students should be able to
a) Identify treatment, response and levels of treatment.
b) Analyse data using one-way ANOVA.
c) Perform one-way ANOVA techniques via the Microsoft Excel.

6.1 Introduction
In Chapter 4, we compare two population means or in other words two levels of a factor, to
decide if there was any difference occurring between the population means from which the
samples came from. However, researchers often want to examine differences among three or
more population means. For example, researchers might want to compare five different
temperatures in developing polymer to be used in removing toxic wastes from water. The
procedure that can be used for testing the equality for means of temperature is one-way
analysis of variance or one-way ANOVA. The five different levels of temperature are also
known as five levels of factors, or five treatments. A factor (or treatment) is a property, or
characteristic, that allows us to distinguish the different populations from one another. Levels
of factors are commonly denoted by k.
The term treatment is used because early applications of analysis of variance involved
agricultural experiments in which different plots of farmland were treated with different
fertilizers, seed types, insecticides and so on.
To understand how analysis of variance works and why it is called analysis of
variance, using the example above, we obtain a random sample from the population. For each
temperature, we measure the percentage of impurities removed by the treatment. We will get
different measurements for each temperature. This shows there is variability within group or
here we use the term 'Factor'.
In one-way analysis of variance, we partition the variability into two components:
within group variability and between group variability. We then examine the ratio of the two it is called an F ratio - by dividing the between group variability with the within group

variability. It is in this sense that ANOVA is an analysis of variance: the variance between
groups is compared to the variance within groups.
After conducting a one-way analysis of variance, we might conclude that there is
sufficient evidence to reject a claim of equal population means, but we cannot conclude from
ANOVA that any particular mean is different from the others.
The model deals with specific factor levels and is involved with testing the null
hypothesis against the alternative hypothesis, stated below:

H 0 : 1 2 ... k

## 6.2 One-Way ANOVA

The one-way analysis of variance specifically allows us to compare several groups of
observations, all of which are independent but possibly with a different mean for each group.
A test of great importance is whether or not all the means are equal. Assume that we are
interested in comparing the means of k populations. In a one-way ANOVA, it is assumed that
each of the populations is normally distributed with the same variance, 2 .

## The output of each observation may be written as:

yij i ij

where yij is the jth observation from the ith factors, i is the ith mean and ij is the random
error.
An alternative and preferred form of this equation is obtained by substituting

i i
with the restriction
k

i 1

yij i ij ,

i 1

## and i is called the effect of the ith factor.

1
In carrying out ANOVA, it is y
11
know the following
y12

2
y21
y22

Factor

i
yi1
yi 2

y2 n2

...

k
yk 1
yk 2

important to
notations:

ni

j 1

over a level.
yi.
(ii) yi
is the level
ni

y1.

y2.

yini

...

yi .

yknk

yk .

y..
mean.

ni

(iii)

of the responses
y1n1

## y.. yij is the grand sum of all responses.

i 1 j 1

(iv) y ..

y..
is overall mean of the data.
N

## 6.3 Partitioning of Total Variability into Components

ANOVA is a procedure in which the total variation in a measured response is partitioned into
components that can be attributed to recognizable sources of variation. These individual
components are useful in testing pertinent hypothesis. The total variability of the data,
designated by the double summation
k

y
i 1 j 1

ij

y .. ,
2

k

i 1 j 1

i 1

## The general form is

SST = SSTrt + SSE
where

i 1 j 1

## SST is the total sum of squares,

SSTrt is the sum of squares due to the levels, and
SSE is the sum of squares due to the errors.
The equation for the total sum of squares, which is a measure of the overall variability
of the data, is
k

SST yij y ..

i 1 j 1
k

ni

yij

y..

i 1 j 1

The equation for the sum of squares for the levels, which measures the variability due to the
levels or factors, is
k

SSTrt n yi y ..

i 1

i 1

yi.

ni

y..

With SST and SSTrt known, SSE can be calculated by the formula
SSE = SST SSTrt
The SSE term measures the variability of the data due to random error.
There are degrees of freedom terms associated with each of the sums of squares. The
degrees of freedom for factor, error and total are given by k-1, N-k and N-1, respectively.
Mean square values are calculated by dividing the sum of square terms for the level
and error by their respective degrees of freedom values. These values represent the variance
of the level and error components of the data. Mean square values for levels and errors are
SSTrt
k 1
SSE
MSE =
N k

MSTrt =

F0 =

MSTrt
MSE

## is f -distributed with degrees of freedom k 1 and N 1 . Therefore, if

f calculated f ,k 1, N k

we reject the null hypothesis and conclude that some of the variability of the data is due to
differences in the factor levels.

6.4 Output
The general format for output for this type of analysis is an ANOVA table, which contains
Source of
Variation
Factor
(between levels)
Error
(within levels)
Total

Sum of Squares

Mean Square

f calculated

SSTrt

Degrees of
Freedom
k 1

MSTrt

SSE

N k

MSTrt
MSE

MSE

SST

N 1

Example 1
Three different types of alcohol can be used in a particular chemical process. The resulting
yield (in %) from several batches using the different types of alcohol are given below:
Alcohol (in %)
1
2
3
93
95
76
95
97
77
94
87
84
Test whether or not the three populations appear to have equal means using = 0.01.

Solution
Alcohol (in %)
1
2
93
95
95
97
94
87
y1. 262 y2. 279 y3.

3
76
77
84
237 y.. 778

N 9, k 4
Hypothesis:
H 0 : 1 2 3
H1 : i j

## for at least one (i,j)

ni

SST yij

y..

i 1 j 1

93 95 74 ... 76 77 84
2

778

660.2222
k

SSTrt
i 1

yi
ni

y..

## 2622 2792 237 2

778

3
3
9
3

778
1
2622 2792 237 2
3
9
67,551.3333 67, 253.7778
297.5555
SSE SST+SSTrt
660.2222 297.5555
362.6667

Source of
Variation
Factor

Sum of
Squares
297.5555

Degrees of
Freedom
3 1 2

Error

362.6667

93 6

Mean Square

Fcalculated

297.5555
148.7778
2
362.6667
60.4445
6

148.7778
2.4614
60.4445

Total
660.2222
9 1 8
At = 0.01, from the statistical table for f distribution, we have
f 0.01,2,6 5.14

Since f calc 2.4614 f 0.01,2,6 5.14 , we unable to reject the null hypothesis and conclude that
there is no difference in the three types of alcohol at a significance of = 0.01.

An experiment was done to compare the amount of heat loss for three types of thermal panes.
The inside temperature was kept at a constant 68o F , the outside temperature was kept at a
constant 20o F , and heat loss was recorded for three different panes of each type:
Pane Type
1
2
3
Use ANOVA to test for
differences in heat loss due to
20
14
11
14
12
13
pane type at = 0:05. What can
you conclude from this test?
29
13
19
16
12
15
[ f calc 2.3608 f 0.05,2,9 4.26, fail to reject H 0 ; No differences.]
An experiment was conducted to compare four formulations for a lens coating with regard to
its adhesive property. Four samples of each formulation were used, and the resulting

## Do the data provide sufficient

difference in the mean formulation at

1
15
10
21
23

Formulation
2
3
29
33
60
59
91
49
20
21

4
26
34
28
46

evidence to indicate a
0.05

## [ f calc 2.1188 f 0.05,3,12 3.49, fail to reject H 0 ; No differences.]

To determine the effect of three phosphor types on the output of computer monitors, each
phosphor type was used in three monitors, and the coded results are given below:
Type
2
4
2
3
3

1
3
sufficient 7
2 evidence to conclude that there is a
3 among the three monitors? Test by
difference in the mean phosphor 5
7
6
using = 0.025
5
4
5
5
6
[ f calc 7.4495 f 0.025,2,12 5.10, reject H 0 ; a difference exists.]
Do

the

data

provide

## 6.5 Computer Application Using Excel

The Excel spreadsheet program has a tool to calculate one-way Analysis of Variance, which
simplifies our computational task considerably. The first step is to enter the data into an Excel
Worksheet. Each factor should be in a separate column. Each column should have a heading
representing the different factors.
In Excel 2007 Worksheet, select Data in the main menu, followed by Data Analysis. If
you use Excel 2003, you may go to Tools first, and select Data Analysis. If Data Analysis is
not available you must install the Data Analysis Tools as follows:
2. Click on the box next to Analysis ToolPak to select it.
3. Click OK. You have now installed the ToolPak.
From any version, when you click Data Analysis, a pop-up menu will appear. You scroll
down the Data Analysis menu and select Anova:Single Factor. Complete the Anova:
Single Factor window as follows:

1. Enter \$A\$2:\$C\$7 in the Input Range: box(or you can enter that value automatically
2.
3.
4.
5.

by clicking in the box and then select the range of cells A2 through C7).
Click the Columns button so that we indicate our data is grouped by columns.
Click the Labels in first row box so that we indicate we are using labels.
Enter the value of alpha in the Alpha: box.
Under Output Options click the button for Output range: and enter \$A\$9 in the Output
range: box (or click in the box and then click on the cell A9 to cause it to appear in the

box).
6. Click OK.
An example of Excel output summary from a one-way analysis of variance can be seen in
Figure 6.1 below. Notice that the means for the three groups (as well as the count, sum, and
variance for each group) can be seen in the summary table.

## Figure 6.1: Microsoft Excel Output for ANOVA

One way to interpret the output is to look at the P-value, defined as
P value P ( F f calc )

## In the above output,

P value P( F 5.178082192)

This P-value is then compared to a chosen level of significance, . The rules are:

## If P value , H 0 will not be rejected.

However, if P value , then it suggests that the sample data provide sufficient
evidence to reject H 0 .

From the output above, P value 0.023917 . Suppose we choose = 0:05, noticeably the

P value 0.05 , thus we conclude that there exists a significant difference in the means at
0.05 level of significance. However, if we choose 0.01 , obviously P value 0.01 .
Hence, we fail to reject H 0 and conclude that there is no significant difference in the means
at 0.01 level of significance.
Conduct a one-way ANOVA for Tasks 1, 2, and 3 by using Excel. Identify the P-value for
each task and interpret the value.

## T1: P value 0.149887; T 2 : P value 0.15118; P value 0.007883

Exercise 6
1. It was known that a toxic material was dumped in a river leading into a large saltwater commercial fishing area. Civil engineers studied the way the water carried the
toxic material by measuring the amount of the material (in parts per million) found in
oysters harvested at three different locations, ranging from the estuary out into the bay
where the majority of commercial fishing was carried out. The resulting data are
given below:

average
found

parts

per

in

oysters

quality

control

= 0.05.
2. A

experiment

Site 1
15
26
20
20
29
28
21
26

to

Location
Site 2
19
15
10
26
11
20
13
15
18

Site 3
22
26
24
26
15
17
24

## significant difference in the

million of toxic material
harvested at three sites. Use

engineer

conducted

an

investigate

the

of

effect

## experience on an assembly line in terms of the average time required to complete an

assembly task. If experience is found to be a factor, a training program is planned for
new employees. The engineer randomly selected eight employees from groups who
had completed 1, 2, 3, and 4 years of work experiences, respectively.
The resulting data are given below:

## a) Test for any

among years of
assembly time.
b) Do the data
program might

1
40.3
25.4
28.2
41.6
28.8
38.7
29.4
37.7

Experience
2
3
34.2
26.3
25.4
29.2
30.2
24.6
28.9
29.1
39.2
34.8
29.5
32.3
29.0
36.0
25.6
25.6

4
26.6
21.2
23.2
27.0
27.1
27.3
34.2
33.3

significant differences
experience for average
Use = 0.05
suggest that a training
be productive?

3. The OPEC oil embargo made it evident that fuel economy in automobiles needed to
be improved. Newer lightweight materials were sought for use in automobile engines.
Comparisons on the density (in g / cm3 ) were made among test material samples of
steel, aluminium, and phenolic thermoset composites containing glass fibres, resulting
in the following data:

Steel
7.60
7.81
7.72
7.68
7.79
7.76

Materials
Aluminium
2.90
2.67
2.80
2.85
2.60
2.76

Phenolics
1.79
1.72
1.67
1.80
1.50
1.63

Using an analysis of variance, state the correct hypothesis for testing equality of
means in density for the three materials and conduct the ANOVA test. State your
conclusion. Use = 0:01 level of significance.

## 4. Consider the following set of dissolved oxygen concentration data obtained in 4

different seasons.
Season 1
Season 2
Season 3
Season 4

5.62
7.70
2.52
6.77

6.12
8.31
5.44
6.65

6.62
8.80
4.94
6.01

6.21
8.24
2.99
6.26

7.80
7.87
4.39
7.09

5.36
7.44
4.44
6.06

## Use a one-way ANOVA to determine if season has a significant impact on oxygen

variability at 0.05 level of significance.
5. Four different machines are used in manufacturing rubber seals. The machines are
being compared with respect to tensile strength of the product. A random sample of
seals from each machine is used to determine whether the mean tensile strength varies

from machine to machine. The following data are the tensile strength measurements in
kilograms per square centimeter x 101

Machine

1
2
3
4

17.5
19.2
15.8
18.6

16.4
16.8
20.9
18.9

20.3
18.5
17.2
20.5

14.6
21.4
16.4
19.5

21.5
16.9
18.1

20.1

Perform the analysis of variance at the 0.025 level of significance and indicate
whether or not the mean tensile strengths differ significantly for the four machines.
6. In a biological experiment, 4 concentrations of a certain chemical are used to enhance
the growth in centimeters of a certain type of plant over time. The growths of plants
are measured. The following output is from Excel.

## a) How many plants are used for each concentration?

b) Can we conclude at = 0:05 level of significance that different concentrations
affect the growth of the plant?
7. A company is considering four brands of lightbulbs to choose from. Before the
company decides which lightbulbs to buy, they want to investigate if the mean
lifetimes of the four types of lightbulbs are the same. The company's research
department randomly selected a few bulbs of each brands and tested them. The
following results are based on the number of hours (in thousands) that each of the
bulbs lasted before being burned out. At 5% significance level, test the null hypothesis
that the mean lifetime of bulbs for each of these four brands is the same.

Chapter 7
Simple
Linear
Correlation

Regression

Learning Objectives:
At the end of this chapter, students should be able to

and

## (a) Differentiate between response and predictor variables.

(b) Define the terms regression and correlation and highlight the differences between
the two terms.
(c) Write down a linear regression model correctly.
(d) Estimate unknown parameters in a linear regression model by using the method of
least squares.
(e) Use a scientific calculator and computer technology such as Microsoft Excel to get the
estimates of the unknown parameters in a linear regression model.
(f) Make a prediction based on a fitted regression model.
(g) Run a hypothesis test and make inferences on the existence of linearity in a linear
regression model.
(h) Compute a correlation coefficient and differentiate between different types of
relationship between two variables.

7.1 Introduction
In previous chapters, we have only focused on learning the behaviour of population and
sample characteristics, such as the mean, proportion and variance. Having learning about
those characteristics, we shall be able to move further at exploring the relationship between
variables, which can be said as the sample space of earlier chapters. Notice that in many
problems, arising from science and engineering, involve exploring the relationship between
two or more variables. In this chapter, we consider two statistical techniques that are very
useful as a foundation to describe the relationship between these variables. First, by using a
regression analysis, and second, by calculating a correlation coefficient.

## 7.1.1 Regression analysis

Regression analysis generally models the relationship between one or more response1
variables and one or more predictor 2 variables. Three common classifications of regression
analysis are listed below:
i.
ii.

Simple linear regression if there is only one response variable and one predictor
variable.
Multiple regressions if there is only one response variable and many predictor
variables.
Multivariate regression if there are many response variables and one or more than
one predictor variable.

iii.

There are many other types of regression analysis. In this chapter, we only deal with
the first classification. Linear regression, in general, models the relationship between two or
more random variables using a linear equation. In other words, it is a method of estimating

the conditional expected value of one response variable given the values of some predictor
variable or variables. Simply put, linear regression assumes the best estimate of the response
variable is a linear function of some parameters (though not necessarily linear on the
predictors).

## 7.1.2 Correlation coefficient

Correlation coefficient, on the other hand, gives us a single value, rather than a model, that
measures the relationship between variables. In this chapter, we also concentrate

Response variables are also called dependent variables, explained variables, predicted
variables, or regressands. In the case of a single response variable, it is usually denoted by Y.
2
Predictor variables, on the other hand, are also called independent variables, explanatory
variables, control variables, or regressors, and are usually denoted as X 1 , X 2 ,..., X p

only on correlation coefficient that measures the relationship that is linear, particularly for
quantitative data. This will be discussed in detail in Section 7.6.
1. Choose your pair. Next, discuss the difference between regression and correlation.
2. Choose a different pair. Next, list down
a) two possible response variables, and
b) two possible predictor variables.

## 7.2 Simple Linear Regression

As mentioned earlier, the main focus of this chapter is a simple linear regression analysis.
It involves a single predictor, commonly denoted as X and a single response variable,
commonly denoted as Y.

## Definition A simple linear regression model gives a straight-line relationship between a

single response (or dependent) variable, Y , and a single predictor (or independent) variable,
X.
Example 1
An engineering student is investigating if his carry marks for all subjects depend on the
number of revision hours he has spent on the subjects.
Solution
In this example, the response, or dependent, variable Y represents the engineering student's
carry marks for all subjects, whereas the predictor, or independent variable X represents the
number of revision hours the student has spent on each subject.
Example 2
An analyst is investigating if the increase in petrol price has an effect on the number of
customers at a petrol station.

Solution
The response variable Y represents the number of customers at the petrol station, whereas the
predictor variable X represents the increase in petrol price.

## 7.2.1 Simple linear regression model

Once we have identified the response and predictor variables, we may select a random
sample consisting of n pairs of observations. Given this set of paired data,

## x1 , y1 , x2 , y2 ,..., xn , yn ,each of the paired observations can be expressed as a statistical,

or stochastic, model which consists of a deterministic and random components, as follows:
y1 xi i

(7.1)

## is an unknown regression coefficient representing the intercept 3 ,

is another unknown regression coefficient representing the slope,
and i is the random error for the i-th pair.

Notice here that the deterministic component in the regression model above is in fact a simple
linear, or a straight line, model.

## 7.2.2 Model assumptions

The assumptions underlying the simple linear regression model include the followings:
1. The errors, i , are normally distributed.

Readers are to be cautioned that this intercept, , is not the same as the level of significance in a
hypothesis testing which is also denoted as . In addition, some references use 0 instead of in
the regression model.
3

## 2. The mean of the random errors is zero.

3. The variance of the random errors is an unknown constant, 2 .
4. The errors are uncorrelated, that is Cov i , j 0 .

## 7.2.3 Fitted simple linear regression equation

Re-expressing Equation 7.1 in terms of variables, instead of values, we get the following
equation:
Y x

(7.2)

Computing the expected value of Y given a certain value of X , say X x , will result in
Equation (7.2) becoming the following equation:

E Y X x Y X x x

(7.3)

We can see from equation (7.3) that the best estimate of the response variable given a certain
value of a predictor variable is simply a linear function of two unknown parameters, and

. After estimating the two unknown parameters, the target fitted simple linear regression
equation can be obtained and expressed as
Y x

(7.4)

1. Determine the response and predictor variables in the following cases:
a. An investigation is carried out to study if the amount of certain chemical that
will dissolve in a given volume of water depends on the level of temperature.
b. A study is done to determine if Oxide of Nitrogen emission rate is influenced
by the load of an engine.
c. An engineer tries to predict the tensile strength of a specimen of cold drawn
copper from the Brinell hardness reading.
2. Without looking at your notes, re-write a simple regression model and state
assumptions related to the model. Next, check if you get the idea correct.
3. Similar to the above, re-write a fitted regression equation and check if you are on the
right track.

## 7.3 Scatter Diagram

A scatter diagram can be used to plot the n randomly selected paired observations. This
diagram is a helpful tool in detecting a relationship between two variables.

## 7.3.1 Data plotting

The scatter diagram is a two-dimensional cartesian plot, with the x-axis representing the
predictor variable values and the y-axis representing the response variable values. Figure 7.1
shows two examples of scatter diagrams. From the scatter plots in Figure

7.1 below, we can detect a positive slope for the linear model between Y and X in plot (a) and
a negative slope for the linear model in plot (b).

## 7.3.2 Draw by eye

We can draw, by eye, many straight lines through the points on the scatter diagram. These
straight lines, however, are subject to an individual's judgment and consequently will give
different estimated values of and . To arrive at a common estimated, or fitted, regression
equation with common and , we can use a method of least squares in estimating the
unknown parameters, which is discussed in the next section.

1. Plot a scatter diagram that implies a very strong positive relationship between two
variables.
2. Plot a scatter diagram that implies a moderately weak negative relationship between
two variables.

## 7.4 A Method of Least Squares

The method of least squares is a classical method proposed by a German scientist named Karl
Gauss (1777-1855). It is a method that estimates the unknown simple linear regression

coefficients, and by minimizing the sum of squared residuals. The resulting fitted line
provides the best possible description of the relationship between the response and the
predictor variables.

## 7.4.1 Errors and residuals

Residuals are simply errors in a set of sample data. These residuals can be seen as the vertical
deviations of the estimated regression line from the observed values, as shown in Figure 7.2
below, and denoted by ei for the ith observation, i 1, 2,..., n , that is
ei yi y

(7.5)

## Figure 7.2: The least-square regression line and residuals

These residuals are a very useful tool in providing information about the adequacy of the
fitted model.

## 7.4.2 The sum of squared residuals

Recall Equation 7.1, the population random error term can be re-expressed by

i yi xi

(7.6)

The sum of squared deviations of the observations from the true regression line is then given
by
n

i 1

i 1

L i 2 yi xi

(7.7)

By the method of least squares, we estimate the unknown parameters and explicitly by
minimizing the sum of squared errors, of residuals, with respect to these parameters, which is
meant by equating the partial derivatives of L with respect to and respectively to zero.

The least squares estimates of and , that is and respectively, must satisfy
the following conditions.
L

2 yi xi 0
i 1

n
L
2 yi xi xi 0
,
i 1

(7.8)

## 7.4.3 Normal equations

Simplifying the two equations in Equation 7.8 results in the following two further equations
n

yi n xi
i 1

i 1

i 1

i 1

i 1

xi yi xi xi 2

(7.9)

Equations (7.9) are commonly called the least squares normal equations.

## 7.4.4 The least squares estimators

Solving the least squares normal equations simultaneously yields the least squares estimators

## and , as given below:

y x

(7.10)

S
xy
S xx

(7.11)

1 n
1 n
xi and y yi whereby the sum of products, S xy , and the total sum of

n i 1
n i 1
squares for X , S xx , are given below.

where x

n
1 n
S xy xi yi xi yi
n i 1 i 1
i 1

1
S xx xi
n
i 1
n

xi

i 1
n

Another term that will be much in use later in this chapter is the total sum of squares of the
response variable Y denoted by S yy and is given as follows.

1
S yy yi
n
i 1
n

yi

i 1
n

These sums of squares and sum of product are commonly available in any standard statistical
formula sheet.

## 7.4.5 The fitted regression line and prediction

Once and are estimated, the fitted or estimated regression model can be expressed as a
simple deterministic straight line equation, given in Equation 7.4, re-expressed as below

Y x

## which gives a regression of Y on X , where y is the estimated or predicted value of Y for a

given value of X x . In other words, the predicted value of the response, or dependent,
variable y for a given value of independent variable x can simply be obtained by
substituting the given value of x into the above equation. In short, the fitted line can be used
to make prediction on Y for any value of X , as long as the X values are within a given
range.

## 7.4.6 Finding the least squares estimates using a scientific calculator

Most scientific calculators provide tools for obtaining the estimated regression coefficients
and hence the fitted regression line. The following steps require readers to use this kind of
calculator: CASIO fx-570MS.

## Table 7.1: Calculation steps using CASIO fx-570MS calculator

Steps:
1. Choose the Regression mode

Mode

Mode

Shift CLR 1

## 5. Continue Step 4 for all i 1, 2,..., n.

Note that Step 3 is vitally important when storing a new data set so that the old data set will
be removed and will not be mixed with the new data set to ensure an accurate analysis.
Once the sample data are stored in the calculator, we can retrieve the available output
by pressing appropriate operators as shown in Table 7.2.
Table 7.2: Output available from CASIO fx-570MS calculator
Operators
Output

Shift

S-SUM

Shift

S-SUM

Shift

S-SUM

Shift

S-SUM

>

Shift

S-SUM

>

Shift

S-SUM

>

xy

Shift

S-SUM

Shift

S-SUM

>

Shift

S-SUM

> >

Shift

S-SUM

> >

Shift

S-SUM

> >

Notice that r in Table 7.2 is the product moment correlation coefficient which will be covered
in Section 7.6 of this chapter.
Example 3
Obtain the equation of the least squares regression line of y on x for the following data:

x
y

20 25 30 35 40 45 50 55 60 65
98 87 92 79 68 57 59 43 60 38

Solution
The least squares regression line y on x is y x .
Follow the five steps in Table 7.1. At Step 3, before we store the new data set, we must
always make sure that the old data set is already cleared. This is indicated by n 0 on the
calculator screen before the new data set is stored.
After storing the above data set, we should get the following output:

Operators

Output

Shift

S-SUM

Shift

S-SUM

x 425

Shift

S-SUM

n 10

Shift

S-SUM

>

Shift

S-SUM

>

y 681

Shift

S-SUM

>

xy

Shift

S-SUM

Shift

S-SUM

>

50125

x 42.5
y

By formula,
n

S
xy
S xx

xi yi
i 1

n
1 n
x
i yi
n i 1 i 1

1
xi

n
i 1
n

xi

i 1
n

Substituting the formula with the values obtained from calculator will lead to
1
425 681
10

4252
X
10
1.2667 (to 4 d.p.)
X

and

y x
Hence, the least squares regression line is

y 121.9348 1.2667 x

Operators

Output

Shift

S-SUM

> >

## 121.9333 (to 4 d.p.)

Shift

S-SUM

> >

1.2667

Intuitively, we should get the same values for and when calculating the estimated
values either by using formula or directly from calculator. Nonetheless, we may notice that in
this example the values of calculated by using the formula and its value obtained directly
from calculator are slightly different. This small discrepancy may always occur due to a
rounding off values at earlier stage of calculation.

This notation (d.p.) is a short form for decimal places. We normally round the final

For simplicity at the expense of accuracy, the least squares linear regression of y on x in
this example is thus
y 121.93 1.27 x

We will refer to this equation in the later examples and tasks. Noticeably, the estimated
regression line in this example has a positive intercept and negative slope. Note that and
can vary in , .

Example 4
Refer to Example 3, predict the value of y when x 58 .
Solution
When x 58 , the predicted value of y when using the regression equation is

## y 121.93 1.27(58) 48.27 (to 2 d.p.)

1. An article in the Journal of Sound and Vibration (Vol. 151, 1991, pp. 383-394)
described a study which investigated the relationship between noise exposure and
hypertension. The noise exposure is measured by the sound pressure level (SPL) in
decibels, whereas hypertension is measured by the blood pressure rise (BPR) in
millimetres of mercury (mmHg). A representative data set reported is as follows:
SPL, x
BPR, y

60 63 65 70 70 70 80 90 80 80
1
0
1
2 5
1
4 6
2 3

SPL, x
BPR, y

85
5

89
4

90
6

90 90
8 4

5 7
9
7
6

## a) Fit a linear regression model of Y on X using the method of least squares.

Ans : y 10.1315 0.1743
b) What can you infer from the estimated value of the slope?
Ans : A unit increase in x leads to a 0.1743 unit increase in y
c) Predict the value of Y when x 58 .

Ans : 48.27

## 2. The number of defective components, Y , produced by a machine is known to be

linearly related to the speed setting, X , of the machine. The data below were
collected from a recent quality control record.

x
y

140 165 210 215 245 265 305 325 355 395
29 23 26 36 47
59 68 72 73 85

## Ans : 262,51.8,61660, 456.16,16119

(a) Obtain x , y , S xx , S yy and S xy
(b) Hence, calculate and using formula. Compare the calculated estimated
values with those given directly by a scientific calculator.

Ans : 16.6914,0.2614

(c) Write a fitted simple linear regression model for the above data.
Ans : y 16.6914 0.2614
(d) Next, estimate the number of defective items produced by the machine if the
speed is 380.
Ans : y 83

## 7.5 Tests for Linearity of Regression

Testing the statistical hypotheses about the model parameters is an important part of assessing
the adequacy and significance of a linear regression model. In this chapter, we limit our focus
at discussing the hypothesis testing about the slope of the regression model only whereas the
hypothesis testing about the intercept is not covered. Readers may refer to Montgomery et al.
(2003) p. 274 and other references for further details. Prior to testing the hypotheses, we need
to make the following assumptions:
a) The random errors, i , have a mean 0 and (unknown) variance 2 .
b) The random errors, i , are normally distributed.
c) The random errors corresponding to different observations are independent and
uncorrelated.
Furthermore, we also need to first observe the properties of which may be viewed as a
random variable. From the regression model in Equation (7.1), we can describe the properties
of as follows:

a)
b)

S

## is an unbiased estimator for , that is E .

2
where

xx

1
S yy S xy
n2

Note that the proving of these properties is not covered in this chapter. These properties are
useful in computing the test statistic value in a hypothetical testing procedure.

## 7.5.1 Testing procedures

Hypothetical testing procedures include writing the hypotheses, stating the decision rule,
computing the selected test statistic and finally making a conclusive decision related to the
null hypothesis about a particular parameter value, as discussed below:

## Step 1: Writing the hypothesis

When testing the hypotheses about the slope, , we actually test the linearity of the simple
linear regression model. Appropriate hypotheses are:
H0 : 0
H1 : 0
These hypotheses relate to the significance of regression. If we fail to reject H 0 , we may
conclude that there is no linear relationship between X and Y . This may imply either of
these two situations:
a) X is of little value in explaining the variation in Y and therefore the best estimator
of Y for any value of X is simply Y Y , or
b) the true relationship between X and Y are not linear.
However, if we reject H 0 , this will imply that X is of importance in explaining the
variability in Y .

## Step 2: Stating the decision rule

Once we state the hypothesis statements, we may choose either t-test or one-way ANOVA
using f-test approach to carry out the test further. This option only applies on a two-sided test.
Furthermore, we can use t-test approach, rather than z-test approach simply because the
number of paired observations is small (n < 30) and the variance is unknown. Note here that
f-test value is simply t-value squared.
For testing the significance of regression, either approach will lead to a two-sided
hypothesis test that has two critical regions bounded by a maximum critical value on the left
and a minimum critical value on the right. The decision made is dependent upon the location
of the computed test statistic. The decision rule is to reject H 0 if the computed test statistic
lies in any of the critical regions, either in the left tail, or in the upper tail.
It is worth noted that t-test can, not only be applied to two-sided test, but also to onesided test. The use of f-test, however, can only apply on two-sided test. In short, we have two
options for carrying out a two-sided test but we are left with only one option for a one-sided
test.

## Step 3: Computing the test statistic

A test statistic is computed by assuming the value under H 0 is true. This is the reason why
under H 0 the equality sign is important. This is also applied when we have one-sided test.

## Step 4: Making decision and conclusion

After computing the chosen test statistic, this value is then compared with the critical value
stated in Step 2. A decision is made according to the location of the test statistic value. If the
test statistic value lies in a critical region, we reject H 0 and say that we have strong or
sufficient evidence from our sample information that H 0 is false. Otherwise, we are unable to
reject H 0 implying that the available information is insufficient to go against H 0 .

## 7.5.2 Using a t-test approach

We can test the linearity of a simple linear regression model by using a t-test. Why t-test? We
have assumed that the errors, i , are independently and identically distributed (iid) with a
Normal distribution having mean 0 and variance 2 . It follows directly that the observations
Yi are also iid normal with mean xi and variance 2 . Now, is a linear combination
2
of independent normal variables, and hence is N , / S xx using the properties listed in

n 2 2
2

(7.12)

## has a 2 distribution with n 2 degrees of freedom, and is independent of 2 . As a

result, the appropriate test statistic

Ttest

Var

2 / S xx

## follows the t distribution with n 2 degrees of freedom

(7.13)

2 / S xx

df under

H 0 : 0 . The

determination of critical regions, and hence critical values, will depend on the alternative
hypothesis, H1 , and the level of significance, , as listed in Table 7.3.
Note that t ,n 2 is a critical value for testing at significance level and n 2 degrees of
freedom.

Table 7.3 Tests of hypothesis for the slope, , of linear regression model
Type of hypothesis testing
Hypothesis
Rejection
criteria
Two-sided test
(Test for linearity)

H0 : 0
H0 : 0

## Reject H 0 if ttest t / 2,n 2

or ttest t / 2,n 2
[i.e. if ttest t / 2,n 2 ]

Right-tailed test
(Test for a positive slope)

Left-tailed test
(Test for a negative slope)
Example 5

H0 : 0
H0 : 0
H0 : 0
H0 : 0

Reject H 0 if ttest t ,n 2

Reject H 0 if ttest t ,n 2

## Consider Example 3, test H 0 : 0 versus H1 : 0 using the t-test at the level of

significance = 0.05.
Solution
From the solution to Example 3, we have

x 425, x 20125
y 50125, xy 26330

1.27, n 10,
Therefore,

y 681,
S xx x

S yy y
S xy

x y
xy
n

4252
20125

10

50125

26330

6812

10

425 681
10

Thus,
2
Var
S xx

S xx S xy

n 2 S xx
3748.9 1.27 2612.5

10 2 2062.5
=

(to 4 s.f.)

## Step 1: State the null and alternative hypotheses.

We are to test the significance of regression given by the following hypotheses:
H0 : 0
H0 : 0
Step 2: Determine the rejection region and state a decision rule.
The significance level is = 0.05. The sign under H1 indicates that the test is twosided. Therefore, the area in the right or left tail of the t distribution is

/ 2 0.05 / 2 0.025

and

df n 2 10 2 8

From Table 7 in Lee (2004), the critical value, t0.025,8 2.306 . Thus, the decision rule is that
we will reject H 0 if ttest t0.025,8 ( 2.306) .
Step 3: Calculate the value of test statistic.
The value of test statistic is calculated as follows:
ttest

Var

1.27 0
0.02612

(to 4 d.p.)

## Step 4: Make a conclusion.

The value of test statisti ttest =
therefore, ttest 2.306 and thus ttest certainly falls
in the critical region. Hence, we reject the null hypothesis and conclude that the data provide
sufficient evidence that the slope is significantly not zero at 0.05 level of significance.

## 7.5.3 Using a one-way analysis of variance approach

The analysis of variance (ANOVA) method is an alternative approach to test the significance
of regression. Using this approach, the total variability in the response variable is partitioned
into two meaningful components as follows:
n

yi y
i 1

yi y yi y 2
i 1

i 1

Symbolically, we have

## SST SSR SSE

where SS denotes sum of squares and
n

2
a) SST yi y S yy is the total corrected sum of squares of y .
i 1

2

i 1

## amount of variability in yi accounted for by the regression line.

n

c) SSE yi y 2 SST SSR is the error sum of squares which measures the
2

i 1

## residual variation left unexplained by the regression line.

The corresponding degrees of freedom df associated with each SS are as follows:

## a) dfT n 1 where n is the number of paired observations, xi , yi , i 1, 2,..., n.

b) df reg 2 1 1 since the model has two unknown parameters, and
c) df E dfT df reg n 1 1 n 2. .

If we divide the SSR and SSE with their respective degrees of freedom, we will obtain the
mean squared regression denoted by MSR (= SSR/1) and the mean squared error denoted by
MSE (= SSE/n - 2) respectively. It can be shown that the test statistic

Ftest

MSR
MSE

follows the F distribution with 1 and n 2 degrees of freedom under the null hypothesis
H0 : 0 .
We can arrange the test procedure using this approach in an ANOVA table, as shown
in Table 7.4

Source of
Variation
Regression
Error
Total

Ftest
Sum of
Degrees of
Mean
Squares
Freedom
Square

SSR S xy
1
MSR
MSR / MSE

SST S yy

n2
n 1

MSE

## In this case, the test hypotheses are

H0 : 0
H1 : 0
We will reject H 0 if f test f ,1,n 2 at level of significance where f ,1,n 2 is the critical
value which is tabulated in Table 9 of Lee (2004).

Example 6
Reconsider Example 3, test H 0 : 0 versus H1 : 0 using the ANOVA approach.
Solution
Step 1: Calculate , S yy , S xy
From the solution in Example 5, we have

## 1.27, S yy 3748.9, S xy 2612.5

Step 2: Compute all the sums of squares
By formula,
SST S yy 3748.9
SSR S xy 1.27 2612.5

## Step 3: Complete the ANOVA table

By substitution, the complete ANOVA table is as follows:
Source of
Variation
Regression
Error
Total

Sum of
Squares
3317.875
431.025
3748.9

Degrees of
Freedom
1
10 2 8
9

Mean
Square
3317.875
53.8781

ftest

## Step 4: The test of hypotheses

The hypotheses statements: H 0 : 0 versus H1 : 0.
The rejection criterion: We will reject H 0 if f test f 0.05,1,8 [ 5.32 from Table 9 of Lee (2004)]
Decision and Conclusion: From ANOVA table, f test 61.5811 which is very far into the
critical region, i.e. f test 5.32 . Therefore, we reject H 0 and conclude that the data provide
sufficient evidence to support the existence of linearity between X and Y .

1. Without looking at any reading material, list down briefly steps involved in testing the
significance of regression. Check your list with your friend who sits next to you and
2. Why t-test is preferred to z-test in testing the slope of a linear regression model?
3. Consider the data from Question 1 in Task 4, by using t-test approach, test the
hypothesis that the regression of blood pressure rise (BPR) on the sound pressure
level (SPL) is linear at the 0.05 level of significance.
[ Ans : ttest 7.3145 t0.025,18 2.101 , reject H 0 , linearity significantly exists.]

## 4. Refer to data from Question 2 in Task 4, test the hypothesis H 0 : 0 against

H1 : 0 at the level of significance = 0:01. Write your conclusive decision
clearly.

## 5. Repeat Question 4 but changing the alternative hypothesis to H 0 : 0 . Use

appropriate test approach. Will your data provide enough evidence to reject H 0 ?
[ Ans : ttest 9.8448 t0.05,8 1.86 , reject H 0 , positive linearity significantly exists.]
6. Repeat Question 3 but by using a one-way ANOVA approach. Compare your current
decision with the previous one.
[ Ans : ftest 53.5015 f 0.05,1,18 4.41 , reject H 0 , linearity significantly exists.]
7. Repeat Question 4 but using a one-way ANOVA approach. What is your finding?
[ Ans : ftest 96.9210 f 0.05,1,8 5.32 , reject H 0 , linearity significantly exists.]

7.6 Correlation

## In the study of linear regression, we consider predicting a value of a response variable, Y

from knowledge of the independent, or controlled, variable X. In this section, however, we
will consider the problem of measuring the relationship between two variables, X and Y. As
such, we have a correlation analysis which attempts to measure
1. the strength, and
2. the direction
of a relationship between two variables by means of a single number called a correlation
coefficient.

## 7.6.1 Product moment correlation coefficient, r

Particularly, a linear correlation coefficient is a measure of the strength and direction of a
linear relationship between two random variables, X and Y, denoted by for population data
and r for sample data. Here, r is known as Pearson's product moment correlation coefficient,
or simply sample correlation coefficient, defined as
r

S xy
S xx S yy

(7.14)

It measures the extent to which the points on a scatter diagram cluster about a straight line.
For example, if we construct a scatter diagram for a sample data having n pairs of
measurements

x , y : i 1, 2,..., n
i

7.6.2 Properties of r
Some properties of r include:
a) r 1,1 .

## b) When r is close to 1, it implies that there is a strong positive linear relationship

between X and Y. Furthermore, when r = 1, we have a perfect positive linear
relationship.
c) On the other hand, if r is close to -1, it implies that there is a strong negative linear
relationship between X and Y . Likewise, if we have r = -1, it means that we have a
perfect negative linear relationship.
d) When r is close to zero, either from positive or negative direction, it implies that there
is a weak or no linear relationship between X and Y.

## 7.6.3 Interpretation of r values

Scatter diagrams below show three different positive linear relationships between X and Y ,
in an increasing order of strength:

(a) r 0.60

(b) r 0.85

(c) r 1

Meanwhile, the scatter diagrams below show examples of negative linear correlation between
X and Y, in an increasing order of strength:

(a) r 0.60

(b) r 0.85

(c) r 1

Noticeably, the wider the scatter of the points around a straight line the weaker the
correlation will be and hence the closer r is to 0, either from negative or positive directions.
The two diagrams below display examples of the absence of linear relationship
between X and Y. For Figure (b) below, although r = 0 implying no linear relationship, the
two variables do actually have a relationship which is nonlinear (in this case a quadratic
relationship).

## (b) r 0 (Nonlinear correlation)

Example 7
Compute the product moment correlation coefficient to measure the relationship between X
and Y variables based on sample data from Example 3. Comment your answer.
Solution
The correlation coefficient computed based on the sample data is the sample
correlation coefficient, r, given as

S xy
S xx S yy

2612.5

2062.5 3748.9

0.9395

## Comment: There is a very high negative correlation between X and Y.

To obtain the value directly from calculator, we may use the following operators:
Operators
Shift

S-SUM

> >

Output
3

## r 0.9395 (to 4 d.p.)

1. Refer the sample data from Question 1 in Task 4, measure the strength of
relationship between blood pressure rise (BPR) and the sound pressure level
(SPL).
[ Ans : 0.8650; strong positive correlation]

2. Refer to sample data from Question 2 in Task 4, obtain the Pearson product
moment correlation coefficient for the sample data. Comment your result.
[ Ans : 0.9611; very strong positive correlation]

## 7.7 Excel procedures

The steps listed below are procedures of using Excel. In this case, we consider the sample
data from Question 1 in Task 4.
a) First, store the data in an Excel worksheet as shown in Figure 7.3 overleaf.

Figure 7.3 Data storage in Excel worksheet for regression for analysis

b) Next, click Tool from the menu bar and then choose Data Analysis from the pulldown menu followed by Regression from the pop-up menu.
7.

The following table lists the measurements of the air velocity and evaporation
coecient of burning fuel droplets in an impulse engine:
Air Velocity
(cm/sec)
20
60
100
140
180
220
260
300
340
380
420
460

Evaporation Coefficient (
/sec)
1.8
3.5
3.7
5.6
7.5
7.8
9.8
11.6
13.7
16.5
18.6
19.5

(a) Fit a straight line to these data by using the method of least squares.

(b) Estimate the evaporation of a droplet when the air velocity is 190 cm/sec.
(c) Test whether evaporation coecient of burning fuel droplets in an impulse engine is
positively related to the measurements of the air velocity at 0.10 signicance level.
(d) Find the Pearson correlation coecient. Give your comment.

8.

## A research department in a university wants to nd out if the starting monthly salaries

(in RM100) of the recently university graduates in engineering is related to their CGPA.
The excel output is as follows. Assume that the data is normally distributed.

(a)
(b)
(c)

## Find the estimated regression line to t the above data.

Predict the starting monthly salary if the CGPA is 3.6.
Does the data support the existence of a linear relationship between starting salaries

(d)

## and CGPA? Test using 0.05 .

Find the Pearson correlation coecient. What can you infer form the value?

9.

A manufacturing company bought a new cutting tool from company A and wanted to
investigate the useful life (in hours) related to the speed at which the tool is operated.
The Excel output follows for useful life of the tool (in hours) and speed (meters per
minutes).

## (a) Build a linear model between useful life and speed.

(b) Predict the useful life if the speed is 55 m/mins.
(c) Test on the validity of the model build in part (a). Use = 0.01.
(d) Find the correlation. Interpret the value.

10.

The following output from Excel gives information on the engine powers x (in

## kilowatt) and the maximum speed y (km/hour) for 12 racing cars.

(a) Find the least square estimates of the regression line for the engine power against the
maximum speed.
(b) What does the estimate of imply?
(c) What is the predicted maximum speed if the engine power is 72 kilowatt?
(d) Is there any evidence that the data strongly suggest a linear association between the
engine power and the maximum speed at the 0.01 signicance level.
(e) Find the correlation between the engine power and the maximum speed. Explain your

## Simple Linear Regression and

Correlation

Chapter 8

Nonparametric Statistics
Learning Objectives:
At the end of this chapter, students should be able to:
a)
b)
c)
d)
e)
f)

## recognize the situations for nonparametric application.

understand and apply the sign test.
understand and apply the run test.
understand and apply the Mann-Whitney test.
understand and apply the Wilcoxon signed-rank test.
compute the Spearmans rank correlation coecient.

8.1 Introduction
There are four types of data namely nominal, ordinal, interval scale and ratio scale data. An
example of nominal data is gender where male may be represented as 1 and female as 2.
The numbers are used for identication of the categories in gender variable. Data that can
be ordered from the lowest to the highest value such as feeling towards school which can be
categorized and ordered such as very unhappy, unhappy, somewhat happy, happy and very
happy, are ordinal data. To understand interval scale data, we start with an example;
temperature. A reading of 0 0 C does not mean there is no temperature and 50 0 C is not
twice as hot as 25 0 C . In contrast, 0 meter of length of ratio scale data means there is no
length and 50 m is twice the length of 25 m . The measurement length, weight and density
are some examples of ratio scale data. Statistical methods that we have discussed before
such as the t-test, ANOVA and regression deals with interval scale data or ratio scale data
and that the data being analyzed is assumed to come from a population with a specic
probability distribution. For example in the t-test, the population where a random sample is
selected from is assumed to be normally distributed with mean and variance 2 . In
general, these techniques are classed as parametric statistics. This chapter discusses an
alternative to the parametric statistics namely non-parametric statistics (NPS). Parametric
statistics is capable of analyzing interval scale and ratio scale data. Mean and variance for

these data can be calculated, interpreted and used in the analysis. But not so for nominal
and ordinal data.
For example, consider the nominal data gender with categories male and female.
Surely the mean of gender has no meaning. NPS is the method to use when dealing with such
data.
In general, a statistical technique is categorized as NPS if it has at least one of the
following characteristics:
1. The method is used on nominal data.
2. The method is used on ordinal data.
3. The method is used on interval scale or ratio scale data but there is no assumption
regarding the probability distribution of the population where the sample is selected.

8.2

Sign Test

We have seen the test of population proportion that uses the sampling distribution

P N ,
for large sample size n. The sign test is a test of the population
n

proportion for testing 0.5 in a small sample situation (usually for n 20).
To understand how the sign test works, let us look at this example.
A study is conducted to see the preference of hand-phone users towards two branches
of hand-phones A and B by asking the views of 12 users. Specically this study is done to
see if the preferences are the same towards the two brands.
If there is no dierence on the preference then we can anticipate the proportion of
users who prefer brand A is the same or about equal to the proportion of users who prefer
brand B. Since there are only two brands being tested, proportion of users preferring brand A
is 0.5 and similarly for brand B if there is no dierence on the brand preference.
If the proportion of users preferring brand A is greater than that of brand B, we can
anticipate the number of users preferring brand A will be a lot higher than the number of
users preferring brand B. On the other hand if the proportion of users preferring brand B is

greater than those of brand a, we can anticipate the number of users who prefer brand A will
be a lot lower that the number of users preferring brand B.
This forms our hypotheses
H 0 : 0.5

H 1 : 0 .5
where is the proportion of the population of users preferring brand A.
Now, we have 12 subjects who named their preferences and let X be a random
variable representing the number of users who prefer brand A and furthermore assume H 0 is
true, thus X follows the Binomial distribution with n = 12 and = 0.5 or simply.
X ~ Bin 12,0.5

For notational purposes, let those who prefer brand A be represented by the sign +
and those who prefer brand B be represented by the sign -. Thus, comes the sign test. So
the random variable X is redened to represent the number of + and X ~ Bin 12,0.5 . Our
alternative hypothesis H 1 : 0.5 indicates that we have a two-tailed test with two rejection
regions. Supposed this test is done at signicance level = 0.05, this means we would reject
H 0 if X a or X b , i.e. we would reject H 0 if the number of + is at most a or at least b.
The issue now is to nd the values of a and b.
By the nature of a two-tailed test we know that P X a P( X b) 0.05 . Now for

## X ~ Bin 12,0.5 , the probability distribution of X is

n
n x
P X x p x 1 p
x

for x = 0, 1, 2, ..., 12. The probability for each value of x is shown in the table below:

X=x
0
1
2
3
4
5
6
7
8
9
10
11
12

P (X = x)
0.0002
0.0029
0.0161
0.0537
0.1208
0.1934
0.2256
0.1934
0.1208
0.0537
0.0161
0.003
0.0002

## If we decide to reject H 0 when X 2 or X 10 , we can see that the signicance level

P X 2 P( X 10)

P X 0 P ( X 1) P X 2 P( X 10) P X 11 P ( X 12)

## 0.0002 0.0030 0.0161 0.061 0.0030 0.0002

0.0386
which is less than our chosen 0.05 .
If we decide to reject H 0 when X 3 or X 9 , we can see that the signicance level

P X 3 P( X 9)
= 0.146
which is a lot more than our chosen 0.05 .
Since the value 0.0386 is closer to 0.05 than 0.146, it is reasonable to make our
decision rule as reject H 0 if the number of + is at most 2 or the number of + is at least 10.
However, with this rule, our signicance level is not exactly 0.05 but 0.0384 .

## Now, back to our sample of 12 persons, 11 of them prefer brand A. Therefore we

would reject H 0 and make a conclusion that the data provide evidence that there is a
dierence in brand preference at a signicance level 0.05 .
The sign test uses the binomial distribution as the decision rule. In general, we have
three choices for our hypothesis :
1. Choice 1
H 0 : 0.5

H 1 : 0 .5

2. Choice 2
H 0 : 0.5

H 1 : 0 .5

3. Choice 3
H 0 : 0.5

H 1 : 0 .5
Choice 1: This is a two-tailed test with the rejection regions X a or X b . The
value of a is such that P X a

.
2
2

## Figure 8.2: A two-tailed sign test.

Choice 2: This is a one-tailed test on the right with the rejection region X a .
The value of a is such that P(X a) . The graph is shown in Figure 8.3.

## Figure 8.3: A right one-tailed sign test.

Choice 3: This is a one-tailed test on the left with the rejection region X a . The
value of a is such that P X a . The graph is shown in Figure 8.4.

## Figure 8.4: A left one-tailed sign test.

Example 1

10 engineering students went on a diet program in an attempt to lose weight with the
following results:

Name
Abu
Ah Lek
Sami
Kassim
Chong
Raja
Busu
Wong
Ali
Tan

Weight before
69
82
76
89
93
79
72
68
83
103

Weight after
58
73
70
71
82
66
75
71
67
73

Is the diet program an eective means of losing weight? Do the test at signicance level

0.10 .
Solution
Let the sign + indicates Weight before - Weight after > 0, and indicates Weight
before- Weight after < 0.
Thus

Name
Abu
Ah Lek
Sami
Kassim
Chong
Raja
Busu
Wong
Ali
Tan

Weight before
69
82
76
89
93
79
72
68
83
103

Weight after
58
73
70
71
82
66
75
71
67
73

Sign
+
+
+
+
+
+
+
+

## The + sign indicates the diet program is eective in reducing weight.

H 0 : 0.5

H 1 : 0 .5
Let X represents the number of + sign. Assuming H 0 is correct, X ~ Bin 10,0.5 .
The observed number of + sign is 8 and the probability of getting at least 8 + is
P (X 8) = 1 0.9453 = 0.0547
which is less then 0.10 . Thus, we can conclude that there is sucient evidence that the
diet program is an eective programme to reduce weight.

Example 2
16 students were asked about their views on their college new regulation of not
allowing students to drive on campus. 13 of them oppose the ruling while 3 of them agree
with it. Is there evidence to support the hypothesis that the minority of students support the
new ruling at signicance level 0.05 ?
Solution
Let X represents the number of student supporting the ruling.
H 0 : 0.5

H 1 : 0 .5
Assuming H 0 is correct then X ~ Bin 16, 0.5 . The observed X is 3. Using the distribution
above
P (X 3) = 0.0106
which is less than 0.05 . Thus reject H 0 and conclude that there is sucient evidence that

## minority of students support the ruling.

Example 2
A paint supplier claims that a new additive will reduce the drying time of its acrylic
paint. To test this claim, 8 panels of wood are painted with one side of each panel with paint
containing the new additive and the other side with paint containing the regular additive. The
drying time, in hours, were recorded as follows:
Drying Times
1
6.4
6.6
2
5.8
5.8
3
7.4
7.8
4
5.5
5.7
5
6.3
6.0
6
7.8
8.4
7
8.6
8.8
8
8.2
8.4

Use the sign test at the 0.05 level to test the hypothesis that the new additive have the
same drying time as the regular additive.
[Ans: P X 1 0.0625 0.025 or P X 1 0.9922 0.025 ; fail to reject H 0 and
conclude that the new and regular additive have the same drying time.]

In cases where the number of subject is large (n 20), the normal approximation can be used
as a decision rule where if X is a random variable representing the number of + then

0.25

X N 0.5 ,

## 8.3 Run Test

Consider a football team A with the following results in 12 games
W

W W W

It must be a good team to win 12 consecutive games and their winning the games are not by
chance nor it is random. Based on these results, we can easily predict the outcome of the
next game.
Consider another football team B with the following results in 12 games.
W L W L W L W L W

L W

Based on these result we can anticipate the result for the next game. The teams performance
is predictable and the results is not random.
Consider another football team C with the following results in 12 games.
W

W L

## Is these results a random event, i.e. is these results occur by chance?

Denition A run is a sequence of one or more consecutive occurrences of the same outcome
in a sequence of occurrences in which there are only two possible outcomes.
For team A, there is only one run with Ws = 12 and Ls = 0.
WWWWWWWWWWWW
For team B, there are 12 runs with Ws = 6 and Ls = 6.
W L W L W L W L W

L W

W
W

L
L

L
W L

W
W

## Our objective is to test the following hypothesis

H 0 : The outcome of the game is random

## H 1 : The outcome of the game is not random

For team A, we see that the outcome is not random and the number of run is the minimum 1.
For team B, we see that the outcome is not random and the number of run is the maximum
12. So, too many runs or too few runs indicate the outcome is not random.
Let
R= The number of runs

n1 = number of W
n2 = number of L

n n1 n2
It is a tedious job to construct the probability distribution of R for higher values of n1 and

n2 . With the probability distribution we are capable of building the rule for accepting and
rejecting H 0 . As we have said earlier, small value of R or large value of R indicates the
outcome is not random, thus the test of randomness is a two-tailed test. This test of
randomness is called the run test.
Since the run test is a two-tailed test, we would reject H 0 if the observed number or
runs R a or R b . The values a and b are chosen in such a way that P X a

P X b

and
2

2

W

## R 2 or R 6 . Since R 5 , we accept H 0 and conclude that the results of the outcome is

random.
It is quite a tedious job to construct the probability distribution of the number of runs
R each time we perform a run test. Table 13 page 43 in Lee (2004) provides the critical values
to accept or reject at various values of signicance levels.

Example 3

A machine cuts plywood with mean length 100 cm and standard deviation 1 cm. 15
plywoods produced by this machine consecutively shows the following length (in cm).
99.5
99.5
99

99.8
100.6
99.7

100.1
99.8
100.3

100.1
100.2
100.5

100.2
100.3
99.9

Can we conclude that the length of plywoods cut by this machine is random over and below
the mean length 100 cm at signicance level 0.05 ?
Solution
Let + indicate the length of plywood which is over 100 cm and indicates the length
which is below 100 cm. The outcome is thus,
++++++++
with n = 15, n1 8 , n2 7 where n1 the number + and n 2 the number of .
H 0 : The length is random

## H 1 : The length is not random

The number of observed runs is R 9 . Using the statistical table, we would reject H 0 if

## R 4 or R 13 and we accept H 0 if 5 R 12 . Since the observed R 9 , we accept H 0

and conclude that, there is no evidence to conclude the length of plywood cut by the machine
is not random.

## The share price index for 18 consecutive days is as follows

+ + + + + ++ + +
where + indicates the price increase from the previous day and - indicates the price
decrease from the previous day. Is the price increase or decrease a random event at
signicance level 0.05 ?
[Ans: 5 R 11 15 , fail to reject H 0 and conclude that the price increase or decrease is a
random event.]

In an industrial production line, items are inspected daily for defective items. The
following is a sequence of defective items, D, and non-defective items, N, produced by this
production line:
D

Use the runs test to determine whether the defective items are occurring at random. Let

0.05 .
[Ans: 4

R 10 14 , we fail to reject H 0 and conclude that the defective items are occurring at random.]

## If either n1 15 or n 2 15 , the sample is considered large for the purpose of

applying the run test. The Normal approximation comes in handy with the following
statement.
For large values of n1 and n2 , the distribution of R(the number of runs in the sample)
is

approximately

2R

Normal

2n1 n2 2n1 n2 n1 n2

n1 n2 2 n1 n2 1

, i.e

with

mean

2n1 n2
1
n1 n2

and

2n1 n2
2n n 2n n n1 n2

R N
1, 1 2 12 2

n
n

1
2
1
1
2
1
2

variance

## 8.4 Some Methods Based on Ranks

and

2n1 n2

1
n1 n2

N 0,1
2n1 n 2 2n1 n2 n1 n2
R

n1 n2 2 n1 n2 1

In this case we can use the standard Normal distribution to nd the critical values of z for the
given signicance level .

8.4

## Some Methods Based on Ranks

8.4.1 Introduction
Often enough we are dealing with data in the form of ranks as in the case of ordinal data. For
instance, a study may involve the feelings of students towards this subject which can be
categorized as Very Unhappy, Unhappy, Somewhat Happy, Happy and Very Happy.
The feelings can be ordered or ranked where rank 1 represents the lowest feeling Very
Unhappy, rank 2 the second lowest feeling Unhappy and so forth. This section describes
some statistical methods in dealing with such data.
8.4.2 Mann-Whitney Test
The Mann-Whitney test or sometimes referred to as Wilcoxon rank-sum test is used to test
the location measures (such as means) of two dierent populations are identical.Two
independent random samples are required from each population. Let x1 , x 2 , ..., x n and
y1 , y 2 , ..., y m be two random samples of sizes n and m where n m from populations X and
Y respectively. We wish to test the hypotheses that the two distributions X and Y are the
same. The hypotheses are
H 0 : P X P Y

H 1 : P X P Y

Assign the rank 1 to n m to both samples where the smallest value from both samples is
assigned rank 1, the second smallest value is assigned rank 2, and so on. The highest value is
assigned rank n m . Let R X i and R Y j denote the rank assigned to X i and Y j for all i
and j. For convenience let N m n . The sum of the ranks assigned to population X can be
used as a test statistic,
n

T R X i
i 1

## Consider this data

Sample
X
X
X
Y
Y
Y

Rank
1
2
3
4
5
6

We see that
3

T1 R X i 1 2 3 6
i 1

and
T2 R Y j 4 5 6 15
3

i 1

On one hand, when the sample sizes for both samples are the same we would expect

T1 R X i T2 R Y j
if both populations X and Y are the same. However, if they are signicantly dierent we
would expect T1 R X i to differ significantly with T2 RY j where we would expect

## T1 R X i would be very small or very large.

On the other hand, when the sample sizes dier, a rather small T1 or large T1 gives
some indication that the populations dier. Comparison of T1 with T2 is not appropriate
with diering sample sizes due to unequal chances of summing the integer ranks. Thus, the
inferential aspect must only consider either T1 alone or T2 alone.
Table A7 of W. J. Conover (1971) provides the critical value for rejection of H 0

for various values of n and m. The table provides P T W p p . For example consider

n 5 and m 7 . The value 15 corresponding to p = 0.001 means P (T < 15) 0.001 and the
value 22 corresponding to p = 0.05 means P (T < 22) 0.05. Thus we would left critical
value. The right-hand-side critical value is obtained by n N 1 w p . So the right-hand-side
critical value is 5(5 + 7 + 1) 22 = 43, i.e. P (T > 43) 0.05. Thus we would reject
H 0 : P X P Y 0.5 if the observed T R X i 22 at 0.05 as the left critical
value. The right-hand-side critical value is obtained by n N 1 w p . So the right-hand-side
critical value is 5(5 + 7 + 1) 22 = 43, i.e. P (T > 43) 0.05. Thus we would reject
H 0 : P X P Y 0.5 if T < 22 or T > 43 at 0.10 which corresponds to p = 0.05 for
two-sided test. However when n and m are large
n N 1 nm N 1
,

2
12

T N

Example 4
Data below show the marks obtained by electrical engineering students in an
examination:
Gender
Male
Male
Male
Male
Female
Female
Female
Female
Female

Marks
60
62
78
83
40
65
70
88
92

Can we conclude the achievements of male and female students are identical at signicance
level 0.1 .
Solution
H 0 : Male and Female achievements are the same.

## H 1 : Male and Female achievements are not the same.

Let the random variable X represents the gender Male and Y represents the gender Female.

Gender
Male
Male
Male
Male
Female
Female
Female
Female
Female

Random Variable
X
X
X
X
Y
Y
Y
Y
Y

Marks
60
62
78
83
40
65
70
88
92

Rank
2
3
6
7
1
4
5
8
9

n = 4, m = 5.
4

T1 R X i 2 3 6 7 18
i 1

and
T2 R Y j 1 4 5 8 9 26
5

i 1

## T1 4 4 5 1 13 27 . Thus, we fail to reject H 0 and conclude that the achievements of

Male and Female are not signicantly dierent.

Petrobus
Procat
The petrol consumption
(in11.9
km/liter
petrol)
12.5,
10.5,
10.4,for several Proton Wira 1.5 model for two
10.8, 8.9, 10.0, 9.5,
brands of petrol is shown below:
11.2
13.0, 10.7

Can we conclude both brands of petrol give equal mileage at signicance level

0.05 ?
[Ans: 19 T1 35 41 , fail to reject H 0 and conclude that both brands of petrol give the same
mileage.]

The following data represent the number of hours that two dierent digital cameras
operate before a recharge is required.
Camera
A
Camera
B

5.
2
5.
8

5.4

6.2

6.5

6.3

5.8

6.2

5.4

5.8

6.1

6.2

6.2

6.6

6.8

5.9

5.8

6.3

Use the Mann Whitney test with 0.1 to determine if camera A operates longer
than camera B on a full battery charge.
[Ans: T1 70.5 100 , fail to reject H 0 and conclude that there is no signicant evidence from the
data, at 0.1 , that Camera A operates longer than Camera B on a full battery charge.]

## 8.4.3 Wilcoxon Signed-Rank test for Two Dependent Samples

The Wilcoxon signed-rank test for two dependent samples or paired samples is used to test
whether two populations from which these samples are drawn are identical. For example, we
might want to test whether the weight of persons before and after going through a diet
program is the same or not. Each person will have two weight measurements; before and after
going through the diet program. So we have one sample for the weight before going through
the diet program and one sample for the weight after going through the diet program. Since
the two measurements come from the same person, the samples are dependent which is also
known as paired samples. To understand this technique, we start with an example.

Example 5
Consider the following data which record the weight (in kg) of 8 students before and
after going through a diet program intended to reduce their weight.
Subjec
t
A
B
C
D
E
F
G
H

Before (Y)
70
75
68
60
73
80
65
63

After
(X)
62
70
58
61
61
60
54
66

## First we need to calculate the dierence of weight before-after i.e.,

d i y i xi . Then we rank the di ignoring the negative sign (if any). This means we rank the
modular of d i ; d i . Let this ranks be noted by R. Next, we give the sign according to the
sign of the corresponding d. Let these signed-rank be denoted by R d i . So we would have

Subject
A
B
C
D
E
F
G
H

Before(Y )
70
75
68
60
73
80
65
63

After (X )
62
70
58
61
61
60
54
66

di= xi - yi
8
5
10
-1
12
20
11
-3

R
4
3
5
1
7
8
6
2

## We make the following assumptions when using this technique:

1. R d i is symmetry.
2. R d i is mutually independent.
3. R d i has the same median.

## The hypothesis for this test is as follows:

H 0 : The weight before and after is the same

## H 1 : The weight before and after is not the same

Let R d i denote R d i which are positive and R d i denote R d i which are negative.
The logic is, if both the populations of weight before and after are the same then, we
can anticipate

T R d i T R d i
Since the assumption that R d i is symmetry then the mean of R d i 0 and the

## median of R d i 0 . Thus the hypothesis stated above can be interpreted as

H 0 : median of R d i 0

H 1 : median of R d i 0
We can have the usual one-tailed test as
H 0 : median of R d 0

H 1 : median of R d 0
or
H 0 : median of R d 0

H 1 : median of R d 0
and the two-tailed test
H 0 : median of R d 0

H 1 : median of R d 0

## This means we would reject H 0 if T a or T a for the two-tailed test. This

rejection rule make it simpler for us as we would only need to consider the lower of T and
T in our sample. For larger n it is a tedious job to construct the probability distribution of

R d . Table (Hisyam Lees table) lists the critical points for accepting H 0 for various values
of .
Going back to the before-after weight example, we see that T 33 and T 3 . At
signicance level = 0.05, Table (Hisyams table) gives the critical point with n = 8 as 4.
This means that we would reject H 0 if T 3 or T 3 . Since the lower of the two values
is T 3 which is exactly the same as the critical value 3, we reject H 0 and accept H 1 .
Thus we make the conclusion that there is evidence the weight before and after going through
the diet program is not equal.
Table below summarizes the various test procedures for both one-tailed and two tailed
test:

Before 74

65

78

81

55

61

80

After

62

83

100 68

59

105 66

87

65

## A semi-conductor manufacturer claims that its production operators have increased

their hand-insert ability speed after attending a course. The following table gives the handinsert ability speed of 8 operators before and after they attended the course:

Using the 2.5% signicance level, can we conclude that attending the course increases the
hand-insert ability speed of the operators?
[Ans: Since T = 25.5 < 33, we fail to reject H 0 and conclude that the course does not increase the
operators hand-insert ability speed.]

The following data gives the number of industrial accidents in ten manufacturing
plants for one month periods before and after an intensive promotion on safety:
Plant
Before
After

1
3
2

2
4
3

3
3
1

4
6
3

5
8
4

6
4
1

7
5
4

8
6
5

9
7
6

10
8
4

Do the data support the claim that the campaign was successful in reducing accidents?
Use = 0.05.

[Ans: Since T = 55 > 44, we reject H 0 and conclude that the campaign was successful in reducing
accidents at = 0.05.]

In a Wilcoxon signed-rank test for two dependent samples, when the sample size is
large (n 15) the statistics T and T is approximately Normal with mean T

n n 1
and
4

variance 2T n n 1 2n 1 written as
24
n n 1 n n 1 2n 1
,

4
24

T N

Thus,

n n 1
4
N 0,1
n n 1 2n 1
24
T

8.5

Measure of Association

8.5.1

## Spearman Rank Correlation Coecient

We have seen the correlation coecient r measure the linear relationship between two
continuous variables X and Y.
A measure of correlation for ranked data based on the denition of Pearson Correlation
Coecient where there is no tie or few ties called Spearman Rank Correlation
Coecient, denoted by is given by

r s 1

6T
n n 2 1

where
n

R X R Y

T di
i 1

i 1

and
- R X i is the rank assigned to xi .
- R Yi is the ranks assigned to y i .
- d i is the dierence between the ranks assigned to xi and y i .
- n is the number of pairs of data.
Usually the value of rs is close to the value obtained by nding r based on numerical
measurements. The interpretation of rs is similar to the interpretation of r in which a value of
+1 or 1 indicates perfect association between X and Y. The plus sign indicates identical
rankings and the minus sign occurring for reverse ranking. When rs is zero or close to zero,
we would conclude that the variables are uncorrelated.
Some advantages in using rs rather than r are:
1.

## The underlying relationship between X and Y is not assumed to be linear. Thus,

when the data possess a distinct curvilinear relationship, the rank correlation
coecient will likely be more reliable than the conventional measure of r.

## 2. The normality assumption concerning the distributions of X and Y is not necessary.

3. Meaningful numerical measurement of r is not possible such as when dealing with
ordinal data but nevertheless can establish rankings.

Mole ratio 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
Viscosity 0.45 0.20 0.34 0.58 0.70 0.57 0.55 0.44

Example 6
The data below show the eect of the mole ratio of sebacic acid on the intrinsic viscosity of
copolyesters.

Find the Spearman rank correlation coecient to measure the relationship of mole ratio of
sebacic acid and the viscosity of copolyesters.
Solution
Let X and Y represent the mole ratio of sebacic and viscosity of copolyesters,
respectively. First we assign ranks to each set of measurements. The rank of 1 assigned to the
lowest number in each set, the rank of 2 to the second lowest number in each set, and so
forth, until the rank of 10 is assigned to the largest number. The table below shows the
individual rankings of the measurements and the dierences in ranks for the 8 pairs of
observations.

Mole ratio
1
0.9
0.8
0.7

## Viscosity R(xi) R(yi) di = R(xi)-R(yi)

0.45
8
4
4
0.2
7
1
6
0.34
6
2
4
0.58
5
7
-2

di2
16
36
16
4

0.6
0.5
0.4
0.3

0.7
0.57
0.55
0.44

4
3
2
1

8
6
5
3

-4
-3
-3
-2

16
9
9
4
T = 110

Thus,

r s 1

6T
n n 2 1

6 110
8 64 1

= 0.3095
which shows a weak negative correlation between the mole ratio of sebacic acid and the
viscosity of copolyesters.

Example 7
The following data were collected and rank during an experiment to determine the change in
thrust eciency, y as the divergence angle of a rocket nozzle, x changes:
Rank X
Rank Y

1
2

2
3

3
1

4
5

5
7

6
9

7
4

8
6

9
10

10
8

Find the Spearman rank correlation coecient to measure the relationship between the
divergence angle of a rocket nozzle and the change in thrust eciency.
Solution

R(xi)
1

R(yi)
2

di = R(xi)-R(yi)
-1

di2
1

2
3
4
5
6
7
8
9
10

3
1
5
7
9
4
6
10
8

-1
2
-1
-2
-3
3
2
-1
-2

1
4
1
4
9
9
4
1
4
T = 38

## Substituting into the formula for rs , we nd that

rs 1

6T
n n 2 1

6 38
10100 1

0.7697
indicating a high positive correlation between the divergence angle of a rocket nozzle and the
Dryingeciency.
time
2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
change in thrust
Solids removed 4.3 1.5 1.8 4.9 4.2 4.8 5.8 6.2 7.0 7.9

The grams of solids removed from a material (y) is thought to be related to the drying time,
(x). Ten observations obtained from an experimental study follow.

Calculate the Spearman rank correlation coecient to measure the relationship between the
grams of solids removed from a material and the drying time.
=0.8788]

[ rs

Two persons rank their preferences on 8 brands of automobile due to the rise of the price of
petrol. The ranks are in the following order:
Brands
Person A
Person B
Calculate the Spearman rank

1 2
3
4
5
6
7
8
5 8
4
3
6
2
7
1
7 5
4
2
8
1
6
3
correlation coecient to measure the relationship between the

## preferences of these two persons.

[ rs

=0.7143]

Exercise 8
1.

Briey explains the meaning of categorical data and give two examples.

Name
Abu Ali Chen Rama Subra Lim Tan Amin
2. When does a statistical method become a non-parametric statistics?
Weight Before(kg) 78 86
69
83
78
74
80
90
Weight After (kg)
66 87
64
80
73
65
75
87
3. At a college there are two cafeterias A and B where the students usually have their
meals. A random sample of 12 students is taken and 5 of them prefer cafeteria A and
the rest indicates preference on cafeteria B. At the 5% signicance level, can we
conclude that the students at this college has equal preference of the two cafeterias?
4. Eight students went on a diet in an attempt to lose weight, with the following results:

Use the sign test to test whether the diet an eective means of losing weight at
signicance level 0.05 . Now use the Wilcoxon signed-rank test to test the same
hypothesis at the same signicance level.
5.

In a library, there are two popular reading sections A and B where students normally
do their fovourite readings. A random sample of 14 students is taken and their
preferences are shown below:

A B A A B A A A A B A B

At the 10% signicance level, can we conclude that the students has equal preference
of the two library reading sections?

6. Through the years the achievement award given to sta in a department has the
following order according to gender:

M M M M F F

M M F M F

where M represent Male and W represent Female. Is the award given according to
gender a random event at signicance level 0.05 .
7. In a study to determine whether accidents occurs at random or not the following data
were gathered for 15 consequtive days
+ + - Before
+ + 210
+ - 180+ 195
+ - 220+ 231
- -199 - 224+

After

## 193 186 186 223 220 183 233

where + indicates the number of accidents for that day is above average and -
indicates the number of accidents for that day is below average. Test the hypothesis at
signicance level 0.05 .
8. The following data gives the cholesterol levels for seven adults before and after they
completed a special dietry plan

Use the sign test at the 5% signicance level to test whether the level of cholesterol is

the same before and after completing the special dietary plan. Use the Wilcoxon
signed-rank test at the 5% signicance level to test whether the level of cholesterol is
the same before and after completing the special dietary plan. Draw your conclusion.

9. The following table gives the recorded grades for 10 engineering students on carry
marks and nal examination in an Engineering Statistics course:
Student
Ali
Bidin
Chua
Didi
Emily
Farouk
Gina

Carry Marks
48
46
38
43
36
49
44

Final Examination
47
45
42
40
38
49
44

Hasan
Intan
Joe

42
34
40

46
37
34

## Calculate the Spearman rank correlation coecient to measure the relationship

between carry marks and final examination.
10. Two panels test 12 brands of computer chips for overall quality. The ranks assigned
by the panels are as follows:

Brand
A
B
C
D

Panel
1
10
6
1
7

Panel
2
9
3
4
5

x
y

E
3
6
F
8
7
G
2
8
H
5
2
I
9
10
1.6 9.4 J 15.5 20.0
4 22.0 135.5 43.0 40.5 33.0
240 181 K 193 155
172 7 110 113 75
94
8
L
9
6

## Calculate the Spearman rank correlation coecient to measure the relationship

between the results given by panel 1 and panel 2.
11. An engineer wants to investigate the relationship between the fretting wear of mild
steel and oil viscosity. Representative data follow, with x = oil viscosity and
y = wear volume.

Calculate the Spearman rank correlation coecient to measure the relationship between
the fretting wear of mild steel and oil viscosity.

Questio
ns
1
2
3
4
5
6
7
8
9
10

Part A
b
a
d
b
b
b
c
c
b
d

Part B
FALSE
FALSE
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
FALSE

1.

(a) Constant
(b) Constant
(c) Variable, quantitative, continuous
(d) Variable, qualitative, nominal
(e) Variable, quantitative, interval-scaled
(f ) Constant
(g) Variable, quantitative, continuous

## (h) Variable, quantitative, continuous

(i) Variable, quantitative, discrete
(j) Variable, quantitative, continuous
(k) Variable, qualitative, ordinal
2.

(a) 0.3595
(b) 0.5033
(c) 0.4278
(d) 0.4167
(e) 0.4396

(a) Straightforward

5
x ; 0 x 1
8
1 x2

;1 x 2
2 8
1 ; elsewhere

(b) f x

(c) 0.4688

4. 0.5438
5. (a) 0.0729
(b) 0.3359
(c) 0.4703
6.

(a) 0.5328
(b) 0.3372
(c) 0.0675

## 6. 28.1875 , 2 125.2773 ; s 2 133.6292 , x = 28.1875

A possible comment: the means are the same for population and sample data, but
larger dispersion is observed if the data were sample data.

8.

(a) 0.9305
(b) 0.8385
(c) 0.2924
(d) 1
0.7642

9.

(a) 0
b) 5.1984 10 4
(c) RM2082.245

1. a. 0.0060

b. 0.9706

2. a. 0.8962

b. 0.0001

3. a. 0.7757

b. 0.6129

4. 0.9808
5. a. 0.9993

b. 1.0000

6. a. 0.8997

b. 0.8020

7. 0.6772
8. a. 1.0000

b. i. 0.9998

ii. 0.0002

9. a. 0.4840

b. 0.0344

c. 0.0045

10. a. 0.9842

b. 0.9684

c. 0.9911

11. a. 0.0401

b. 0.5490

c. 0.9599

12. a. 0.9803

b. 0.4681

c. 0.6156

13. a. 0.2912

b. 1.0000

c. 1.0000

14. a. 0.6628

b. 0.0869

c. 0.7230

15. a. 0.3669

b. 0.8725

c. 0.5000

16. a. 0.5000

b. 0.3192

c. 0.5948

17. a. 0.4682

b. 0.6293

c. 0.9505

18. a. 0.4052

b. 0.7265

c. 0.5000

1. The observed interval contains the true value of .
2. Shorter
3. Yes, because we are making use of the sample information to infer the population
parameter.
4. a. 102.5

b. (98.944, 106.056)

5. a.
6. (0.4645, 0.5555) liter
7. (9.1, 10.7) micrometer
8. a. (0.505441, 0.507519) cm

b. (0.504637, 0.508323) cm

## c. CI in part (b) is more practical

as it is impractical to know the variance of normal population without knowing its mean.
10. (1.061, 0.460); the observed interval contains the true value of mean dierence with
90% level of condence, No.
11. (0.0107, 0.0493)
12. (0.0048, 0.0202)
13. a. 0.09 b. (0.0751, 0.1049)

c. 0.01

d.(0.0048,0.0152)

e.0.003;(0.0004,0.00639)

## 14. (0.804, 3.731) 10 4

15. a. (314.033, 430.301); sample was drawn from a normal population, is unknown, and n
is small. b. (346.917, 418.417)
d. 70.6692; (1945.958, 30048.782) RM2

1. z test = 2.3717; reject H 0 .
2. z test = 2.044; reject H 0 .
3. t test = 0.5167; fail to reject H 0 .
4. t test = 2.821; reject H 0 .
5. Fail to reject H 0 .
6. z test = 6.1546; reject H 0 .

c. RM(85.96, 64.96)
e. 43.459 ; RM(27.13, 29.78) f. (0.5236, 13.353)

## 7. z test = 1.014; fail to reject H 0 .

8. z test = 4.0216; reject H 0 .
9. a. Fail to reject H 0 b. Fail to reject H 0 .
10. z test = 6.3640; reject H 0 .
11. z test = 1.4084; fail to reject H 0 .
12. z test = 1.4084; reject H 0 .
13. a. Fail to reject H 0

b. Fail to reject H 0 .

2
1. k = 7, then = 6; xtest = 5.6807 < 12.592; Fail to reject H 0 .
2
2. k = 6, p = 1 where = 3.47, then = 4; xtest = 3.682 < 13.277; Fail to reject H 0 .
2
3. k = 8, then = 7; xtest = 0.6333 < 14.067; Fail to reject H 0 .
2
4. k = 4; then = 3; xtest = 40.692 > 7.815; reject H 0 .
2
5. k = 3; then = 2; xtest = 0.2448 < 9.21; Fail to reject H 0 .
2
6. Independence test: = 1; xtest = 33.33 > 3.841 (without Yates correction); reject H 0 ;

## Status and classication are signicantly DEPENDENT at 0.05 .

2
7. Independence test: = 2; xtest = 4.7179 < 9.21; Fail to reject H 0 ; Level of pains and type

## of painkiller are INDEPENDENT.

2
8. Independence test: = 4; xtest = 13.3808 > 13.277; reject H 0 ; Length and diameter are

## signicantly DEPENDENT at 0.01 .

9. Homogeneity test: = 2; xtest

## = 17.1428 > 5.991; reject H 0 . The proportions of

defective components are NOT the same, i.e. they are signicantly not homogeneous at

0.05 .
2
10. Homogeneity test: = 2; xtest = 36.6753 > 5.991; reject H 0 ; The proportions of output

components for shift 1 are signicantly not the same for all 3 machines.

1. f calc 4.9471 f 0.05, 2, 21 3.47

## 2.(a) f calc 2.603 f 0.05, 2, 28 2.95 ; No signicant dierence.

(b) No
3. f calc 5982.001 f 0.01, 2,15 6.36 ; Means are signicantly dierent.
4. f calc 29.7986 f 0.05,3, 20 3.10 ; Season has a signicant impact on oxygen variability.
5. f calc 2.1656 f 0.02,3,16 4.08 ; The mean tensile strengths do not dier signicantly.
6. (a) 6, 5, 4 and 6 respectively.
(b) P value= 0.1827 > 0.05; Dierent concentrations do not aect the plant growth.
7. P value= 0.00143 < 0.05; The mean lifetimes are signicantly dierent.

1. a.

0.6623 , 1.1256

2. a.

143.731 , 15.202

3. a.

0.2757 , 0.0255

4. a. 5.3066

c. Reject H 0

b. 3.98
b. 37.317

c. Reject H 0

d. 0.9939
d. - 0.9859

b. Reject H 0 c.0.9387

d. 0.9502

c. Accept H 0
b. 3.85

d. Accept H 0

5. a.

5.6 , 0.07

6. a.

2.8144 , 2.8622

b. 306.2076

c. Reject H 0

d. 0.8742

7. a.

0.0016 , 0.0415

b. 7.8866

c. Reject H 0

d. 0.9901

## 8. a. Starting M onthly Salary = 8.4269 + 7.7427 CGP A

b. RM36.3 hundreds, or RM3630
c. Yes because Signicance F < 0.05, or P-value for CGPA coecient < 0.05

## d. r = Multiple R = 0.6871; moderately strong positive linear correlation

9. a. Lif e = 8.32975 0.085775 Speed
b. 3.6121 hours
c. Yes because Signicance F < 0.01, or P-value for Speed coecient < 0.01
d. r = Multiple R = 0.9339; very strong positive linear correlation between Useful Life and
Speed.

## 10. a. M ax. Speed = 1.7987 + 2.3794 power

b. implies that a unit increase in power would lead to about 2.3794 units increase in
Maximum Speed.
c. 169.5146 km/h
d. Yes because Signicance F < 0.01, or P-value for Power coecient < 0.01
e. r = Multiple R = 0.7426; moderately strong positive linear correlation between
Maximum Speed and Power.

3. H 0 : 0.5 vs H 0 : 0.5; P X 5 0.3872 0.025 or P X 5 0.8062 0.025
fail to reject H 0 ; the students at this college have equal preference of the two cafeterias.
4.

## (Sign test) H 0 : 0.5 vs H 0 : 0.5; P X 5 0.0352 0.05; reject H 0 ; the diet

program is effective.

## (Wilcoxon signed-ranked test) H 0 : PWb PWa vs H 0 : PWb PWa ;

T 1 1 5 or T 32 31; reject H 0 ; the diet program is effective.

## 5. H 0 : 0.5 vs H 0 : 0.5; P X 9 0.9102 0.05 or P X 9 0.212 0.05

fail to reject H 0 ; the students have equal preference of the two library reading sections.
6. 3 R 6 11 ; the award given according to gender is a random event.

## 7. 4 R 9 13 ; accidents occur at random.

8. (Sign test) H 0 : 0.5 vs H 0 : 0.5; P X 4 0.7734 0.025 or

P X 4 0.5 0.025;

## (Wilcoxon signed-ranked test) H 0 : P Cb P C a vs H 0 : P Cb P C a ;

T 1 6.5 2 or 2 T 21.5 26; fail to reject H 0 ; the level of cholesterol is the

## same before and after completing the special dietary plan.

9. rs 0.8182 ; a strong positive correlation between carry marks and nal exam scores.
10. rs 0.6573 ; a moderately strong positive correlation between results given by panel 1
and panel 2.
11. rs 0.85 ; a strong negative correlation between the fretting wear of mild steel and oil
viscosity.

References

Lee, M. H. (2004). Statistical Tables and Formulae for Science and Engineering.
Skudai: UTM.
Montgomery, D. C. & Runger, G. C. (2006). Applied Statistics and Probability
for Engineers, 4th Ed. USA: John Wiley & Sons.
Montgomery, D. C., Runger, G. C. & Hubele, N. F. (2003). Engineering Statistics. USA: John Wiley & Sons.