Вы находитесь на странице: 1из 178

Lecture 9 fybanuro@ug.edu.

gh 1
ANALYSIS OF
QUANTITATIVE DATA
DR. FRANCIS YAW BANURO
UGBS - LEGON

Lecture 9 fybanuro@ug.edu.gh 2
1. Introduction
The researcher, after collecting the data,
will usually like to see what is inside the
collected data.
In order to do this he or she will have to
organise and manipulate the data so that
they can reveal things of interest.
This lecture is concerned with discussing
techniques for organising and
manipulating quantitative data.
2
Lecture 9 fybanuro@ug.edu.gh 3
1. Introduction
All data can be summarised in two main
ways.
pictorial representation
summary statistics.
We shall first consider pictorial
representation of data.


3
Lecture 9 fybanuro@ug.edu.gh 4
2.Pictorial representation
2.1. Frequency tables
We discuss the following characteristics of frequency
tables.
i. class intervals (non-overlapping) must range between 5 and 15;
ii. limits (lower and upper);
iii. tally;
iv. frequency distribution: the completed frequency table (must have a
title, units of measurement and source of data);
v. relative frequency

where C is the number of class intervals.
100 *
f
f
C
1 i
i
i

=
4
Lecture 9 fybanuro@ug.edu.gh 5
Observation Tally Frequency Observation Tally Frequency
98 / 1 134 / 1
102 // 2 136 / 1
104 / 1 138 / 1
108 / 1 140 //// 4
112 // 2 142 / 1
114 // 2 146 / 1
116 //// 5 150 // 2
118 / 1 162 / 1
120 / 1 176 / 1
122 // 2 178 / 1
126 // 2 190 / 1
130 / 1 208 / 1
Table 1: Frequency table for smokers systolic blood pressure data

Lecture 9 fybanuro@ug.edu.gh 6
Class interval Tally Frequency
90 109 //// 5
110 129 //// //// //// 15
130 149 //// //// 10
150 169 /// 3
170 189 // 2
190 - 209 // 2
Table 2: Grouped frequency table for the smokers data
Lecture 9 fybanuro@ug.edu.gh 7
2.Pictorial representation
2.1. Frequency tables (continuation)
This indicates the % of total cases that fall in a given
class interval.
It enables comparison between two sets of data that
have a different number of observations;
vi. class boundaries (lower limit- 0.5, upper limit +
0.5);
vii. cumulative relative frequency (percentage): that
% of individuals having a measurement less
than or equal to the upper boundary of the
class interval. This figure is useful for obtaining
the median and percentile scores (see ogive).



7
Lecture 9 fybanuro@ug.edu.gh 8
2.Pictorial representation
2.2. Graphs
Graphs help the user obtain an intuitive feeling for the
data at a glance. Graphs must have a descriptive title,
labeled axes and units of observations.
a) Frequency Distribution: it is a table that lists the number
of observations of some variable that fall in various
categories. The categories should be enough so that we
can see a meaningful distribution. A rule of thumb is to
divide the range of values into 8 to 15 equally spaced
categories, plus a possible open-ended category at
either end of the range.

8
Lecture 9 fybanuro@ug.edu.gh 9
2.Pictorial representation
2.2. Graphs (continuation)
b) Scatterplots: A scatterplot contains a point for
each observation, based on the values of two
selected variables.
The resulting plot indicates the relationship, if
any, between these two variables.
StatPro does this for us: Charts - scatterplot:
continue with the screens that follow (see
..\EMBA\data\Expenses.xls).

9
Lecture 9 fybanuro@ug.edu.gh 10
2.Pictorial representation
2.2. Graphs (continuation)
c) Histogram: is a pictorial representation of the
frequency table.
horizontal axis: shows the class boundaries not limits;
vertical axis: shows the frequency ( or relative
frequency) of observations;
10
Lecture 9 fybanuro@ug.edu.gh 11
2.Pictorial representation
2.2. Graphs (continuation)
d) Frequency polygon: uses the same axes as in
the histogram except that we use points
marked at the intersection of the class midpoint
and its frequency or relative frequency.
These points are then joined by straight lines.
Its comes down on the abscissa one midpoint away
from both the first data class interval and the data
last class interval.

11
Lecture 9 fybanuro@ug.edu.gh 12
2.Pictorial representation
2.2. Graphs (continuation)
Effectiveness
when superimposed, enables the comparison of two
data sets. Note that the same can not be said about
the histogram;
it is used only when we have numerical data not for
nominal nor ordinal numbers);
has three known shapes
symmetrical (bell shaped)
bimodal
asymmetrical or skewed (positively skewed: skewed to the
right or negatively skewed: skewed to the
left)tutorials\Frequencypolygon.xls


12
Lecture 9 fybanuro@ug.edu.gh 13
2.Pictorial representation
2.2. Graphs (continuation)
e) Cumulative frequency polygon (ogive): the abscissa is
the same as for the histogram or the frequency polygon.
The ordinate indicates the cumulative frequency or
cumulative relative frequency.
We use points at the intersection of the class's cumulative
frequency and its upper boundary.
These points are then joined by straight lines.
It is useful for
comparing two sets of data (from the abscissa to the ordinate);
obtaining the median and the percentiles (from the ordinate to the
abscissa).tutorials\cumulativefrequency.xls


13
Lecture 9 fybanuro@ug.edu.gh 14
2.Pictorial representation
2.2. Graphs (continuation)
f) Stem-and-leaf: the stem represents the class interval
and the leaf is a string of values within the interval.
It portrays a histogram laid on its side.
We can infer the shape of the distribution (symmetrical, bimodal
or asymmetrical) from this plot.
g) Bar chart: this is the usual form of displaying nominal or
ordinal data.
The various categories are represented in the horizontal axis
and the height of each bar is equal to the frequency of items for
that category.
All bars are separated and are of equal length.
The ordinate must start from 0 and where this is not possible,
we need to use broken bars.
14
Lecture 9 fybanuro@ug.edu.gh 15
3. Summary statistics
We shall consider three types of summary statistics
namely
measures of central tendency,
measures of dispersion and
coefficient of variation
3.1. Measures of central tendency
The most used are
the mean,
median
mode.
We shall first consider these measures for ungrouped
data and then for grouped data.
tutorials\data.xls


15
Lecture 9 fybanuro@ug.edu.gh 16
3. Summary statistics
3.1. Measures of central tendency
A. Ungrouped data
1. The (arithmetic) mean




1. The median: gives the observation which divides the distribution into
equal halves. For an even number of observations, the median is
the average of the two middlemost values.
2. The mode: it is the observation that occurs most frequently.




n
x
Mean
n
1 i
i
=
=
16
Lecture 9 fybanuro@ug.edu.gh 17
3. Summary statistics
B.Grouped data










F
CW
* CF) (MP LCL Median 2.
f
m * f
Mean 1.
n
1 i
i
n
1 i
i i
+ =
=

=
=
17
Lecture 9 fybanuro@ug.edu.gh 18
3. Summary statistics
where:
m
i
= midpoint of the i
th
class (interval);
f
i
= frequency of the i
th
class (interval);
LCL= lower class limit of the interval in which the median
item occurs;
MP= median position;
CF=cumulative frequency just before the median;
CW= class width of the median class;
F= frequency of observations in the median item class.
tutorials\groupdata.xls


Lecture 9 fybanuro@ug.edu.gh 19
3. Summary statistics
129.6587 : smokers - Non
133.0135
37
4921.5
f
m * f
Mean
C
1 i
i
C
1 i
i i
=
=
=

=
=
Lecture 9 fybanuro@ug.edu.gh 20
3. Summary statistics
3.2.Measures of dispersion
It is important for us to know whether the
observations tend to be quite similar or
whether they vary considerably.
We shall assess this by using the range and
the standard deviation
A. Ungrouped data
1. Range: it is the difference between the two
extreme data values.

20
Lecture 9 fybanuro@ug.edu.gh 21
3. Summary statistics
2. Variance: shows the variability around the mean
of the observed data.
It is computed as



But since this is expressed in squared units, the
standard deviation is usually preferred.




1 n
x
Variance
n
1 i
2
i
x

|
.
|

\
|

=

=

21
Lecture 9 fybanuro@ug.edu.gh 22
3. Summary statistics
3. Standard deviation:



For grouped data, the standard deviation is



1 n
Deviation Standard
1
2

|
.
|

\
|

=

=
n
i
i
x
x
2
n
1 i
i
n
1 i
i i
n
1 i
i
n
i
2
i i
f
m * f
f
m * f
Deviation Standard
|
|
|
|
.
|

\
|
=

=
=
=
=
22
Lecture 9 fybanuro@ug.edu.gh 23
3. Summary statistics
( )
20.4312 : smokers - Non
19.489
17962.5912 18342.4122
133.0135
37
678669.25
f
m * f
f
m * f
SD
2
2
C
1 i
i
C
1 i
i i
C
1 i
i
C
1 i
2
i i
=
=
=
|
|
|
|
.
|

\
|
=

=
=
=
=
Lecture 9 fybanuro@ug.edu.gh 24
3. Summary statistics
3.3.Coefficient of variation (CV)
The coefficient of variation depicts the size of the standard
deviation relative to its mean.
It is free of measurement units of the original data facilitating
its use to compare the relative variation of even unrelated
quantities (blood glucose and serum cholesterol).
It is also used where consistency is important.



For different sets of data, the smaller CV, the better (i.e.
indication of less variability).

Mean
Deviation Standard
CV =
24
3. Summary statistics
For the smokers data



For the non-smokers data

0.1465
133.0135
19.489
CV = =
0.1576
129.6587
20.4312
CV = =
Lecture 9 fybanuro@ug.edu.gh 26
3. Summary statistics
3.4. Measures of Association
A measure of association is a single number that
expresses the strength, and often the direction, of a
linear relationship between two numerical variables.
There are many measures but the most popular
measures are covariance and the Pearson correlation
coefficient, .
If we let X
i
and Y
i
be the paired values for
observation i, and let n be the number of
observations, then the covariance between X and Y,
denoted by Cov(X, Y) is given by

26
Lecture 9 fybanuro@ug.edu.gh 27
3. Summary statistics



Use COVAR in Excel
and the Pearson correlation coefficient is given by




Use CORREL in Excel
1 n
y y x x
Y) Cov(X,
n
1 i
i i

|
.
|

\
|

|
.
|

\
|

=

=

Y X
*
Y) Cov(X,
=
27
Lecture 9 fybanuro@ug.edu.gh 28
3. Summary statistics
The covariance is an average of products of
deviations from means.
It has the limitation that it is affected by the units
in which X and Y are measured.
For this reason, we use the Pearson correlation
coefficient (which is unit less) as a remedy.
When we have more than two variables in a data
set, we can always create a table of covariances
and/or correlations using StatPro


28
WEEK 9 & 10 fybanuro@ug.edu.gh 29
STATISTICAL INFERENCE
WEEK 9 & 10 fybanuro@ug.edu.gh 30
Population parameters
For a population with N members, let X
i
; i=1, ..., N
be the numerical value of the i
th
member of the
population; X
i
is neither 1 nor 0;
Then the population mean (average),, is

The population variance is



N
x
N
i
i
=
=
1

( )
2
1
2
1
2
2

o =

=

= =
N
X
N
X
N
i
i
N
i
i
WEEK 9 & 10 fybanuro@ug.edu.gh 31
Population parameters
Where each X
i
takes values 1 or 0, the population
mean is

where p is the proportion of individuals in the
population having the particular characteristic.
The population variance is


p
N
X
N
i
i
= =

=1

) 1 (
2 2 1
2
2
p p p p
N
X
N
i
i
= = =

=
o
WEEK 9 & 10 fybanuro@ug.edu.gh 32
Sources of estimation error
Two basic sources of error:
1. sampling error (); result of unlucky samples (n=20
from ADMN 605). The resulting error is the difference
between the reported average for the 20 marks and the
average of the 183 marks.

2. Nonsampling error: can occur for a variety of reasons
a) nonresponse bias (do they differ in some important respect from
the respondents?)
b) nontruthful responses to some sensitive questions (do you
regularly use cocaine?). Solution randomised response
technique (one sensitive and other innocuous: flip coin and
answer any one)


n
s * 2
= c
WEEK 9 & 10 fybanuro@ug.edu.gh 33
Nonsampling error
c) measurement error; responses do not reflect
what the investigator has in mind.
i. poorly worded questions;
ii. respondents dont fully understand questions;
iii. respondents dont have the information.
d. voluntary response bias: when the subset of
respondents differ in some important respect
from all potential respondents (cf. hours study
per night by students: respondents are those
with good grades; those with poor grades)
WEEK 9 & 10 fybanuro@ug.edu.gh 34
Expectation and variance of the
sample mean
For a random variable X, we know that
E(X
i
)= and Var(X
i
)=
2
.


( )
( ) ( ) ( )
. of estimate unbiased an is
X
X

n
n
n
...
n
X E ... X E X E
n
X ... X X E
E
n 2 1
n 2 1

= =
+ + +
=
+ + +
=
+ + +
= |
.
|

\
|
WEEK 9 & 10 fybanuro@ug.edu.gh 35
The variance of Xbar
( ) ( ) ( )
.
n

X
n

n
n
n
...
n
X Var ... X Var X Var
Var

X
2
2
2
2
2 2 2
2
n 2 1
=
= =
+ + +
=
+ + +
=
|
.
|

\
|

WEEK 9 & 10 fybanuro@ug.edu.gh 36


CONFIDENCE INTERVALS
Introduction
For populations approx. normal X-bar is an
unbiased estimator of .
What about the specific sample mean X-bar?
(a bit high or low!)
X-bar. Hence we need an interval
estimate or confidence interval for .
An x% confidence interval makes us x%
certain that the true population mean is within
this interval.
WEEK 9 & 10 fybanuro@ug.edu.gh 37
Importance of confidence interval
estimation in the real world
Taxable income: [1m, 2.2m) with mean = 1.6m.
Government has two objectives:
maximisation of revenue;
minimisation of risk of default if overassessed.
Better to use the lower limit.
Government indifferent about overcharging or
undercharging, use 1.6m.
In that case the govt would undercharge in about the
cases and overcharge in about the cases (unfair!!!).
How can the government increase revenue but still use a
lower limit?
Importance of confidence interval
estimation in the real world
Increase the sample size
This will shrink the width of the confidence
interval
This will shift the lower limit upwards
WEEK 9 & 10 fybanuro@ug.edu.gh 39
Confidence interval for a single
mean
From the CLT, n large means


Two equivalent forms:
1) For large n we use the population standard deviation. If
not known, we use the sample standard deviation.
i.
|
|
.
|

\
|
~

n
N
X
2
,
o

n
s
Z or
n
Z * *
2 / 2 / o o
c
o
c = =
WEEK 9 & 10 fybanuro@ug.edu.gh 40
Confidence interval for a single
mean
ii. Then the CI for the mean, , is


2) For small n (n<30) the t-distribution is used.


There is a trade-off between the level of confidence and
the width of the CI. The higher the level of confidence, the
wider the CI and vise versa.
For the Z-multiple: 90% (1.65); 95% (1.96); 99% (2.58)
For the t-multiple: 90% (1.70); 95% (2.05); 99% (2.76) for n=30.
n
s
Z or
n
Z
X X
* : * :
2 2
o o

o


n
s
n t
X
* ) 1 ( :
2

WEEK 9 & 10 fybanuro@ug.edu.gh 41


Degree of confidence
Less confidence and a narrow interval;
More confidence and a wider interval;
What about large n?
risk of nonsampling error;
cost involved;
time factor;
narrow confidence interval.
WEEK 9 & 10 fybanuro@ug.edu.gh 42
Example 1
A fast-food restaurant recently added a new
sandwich to its customers. To estimate the
popularity of this sandwich, a random sample of
40 customers who ordered the sandwich were
surveyed. Each of these customers was asked
to rate the sandwich on a scale of 1 to 10, 10
being the best. The results are
tutorials\Sandwich1.xls . The manager wants to
estimate the mean satisfaction rating over the
entire population of customers by using a 95%
CI.
WEEK 9 & 10 fybanuro@ug.edu.gh 43
Solution
We shall use StatPro:
1. load the module;
2. place cursor anywhere in the data set;
3. from StatPro, select Statistial Inference;
4. select One-Sample Analysis;
5. next select Satisfaction as variable to
analyse;
6. check that you want a CI for the mean;
7. accept all other defaults.
Solution
The mean rating is 6.250;
The 95% CI for the population mean rating
extends from 5.739 to 6.761;
The manager is 95% confident that the
true mean rating over all customers who
might try the sandwich is within this CI.
WEEK 9 & 10 fybanuro@ug.edu.gh 45
Example 2
A sample survey to estimate the fast-food market
(number of fast-food meals eaten) in Accra
yielded the following summary results.


Calculate a 95% CI for the mean of the whole
population of Accra.

Answer [0.75, 0.89]
180 n 0.48, s 0.82,
X
= = =

WEEK 9 & 10 fybanuro@ug.edu.gh 46


Example 3
From a large class, a random sample of 4 grades
were drawn: 64, 66, 89, 77. calculate a 95% CI for
the whole class mean ().


Answer [55.68, 92.32]
4 n and 11.52 s 74,
X
= = =

WEEK 9 & 10 fybanuro@ug.edu.gh 47


Confidence Intervals
So far we can say we are x% confident
that a particular population mean lies
between
L
and
U
.
But can we tell the proportion of the
population that
consume higher than 0.82?
scored higher than 74?
No!!!!
WEEK 9 & 10 fybanuro@ug.edu.gh 48
CI for a proportion
To answer the question what proportion of the population
has a characteristic higher than the mean?.
Useful in auditing.
Same format as for the mean
point estimate multiple*standard error.
Let Y be any property that members of a population either
have or dont have (car).
Sample n members randomly from this population.
Let be the sample proportion of members with property
Y.
p
^
WEEK 9 & 10 fybanuro@ug.edu.gh 49
CI for a proportion
Let be the population proportion.
Then is an estimate of .
The variance in this case was shown to

Hence the standard error of can be shown to
be

We assume n large and hence apply the z-multiple.



t
p
^
t
|
|
.
|

\
|

p p
^ ^
1
p
^
n
1
p
SE
p p
^ ^
^
|
|
.
|

\
|

=
n
1
* Z :
p p
p
^ ^
2

^ |
|
.
|

\
|

WEEK 9 & 10 fybanuro@ug.edu.gh 50


CI for a proportion
Validity for the assumption of n large





5 ) (1 * n
5 * n
5 ) (1 * n
5 * n
U
U
L
L
>
>
>
>
WEEK 9 & 10 fybanuro@ug.edu.gh 51
Example 4
Revisit example 1.
This time, the manager would like to use
the same sample to estimate the
proportion of customers who rate the
sandwich at least 6. Her thinking is that
these are the customers who are likely to
purchase the sandwich on subsequent
visits tutorials\Sandwich2.xls
WEEK 9 & 10 fybanuro@ug.edu.gh 52
Solution
We can check the assumption of large n.





They are all well above 5 so that the validity of this CI is
established.
The manager can be 95% confident that the percentage of
all customers who would rate the sandwich 6 or higher is
somewhere between 47.5% and 77.5%.
Wide CI! How can she narrow this?

9 0.225 * 40 ) (1 * n
31 0.775 * 40 * n
21 0.525 * 40 ) (1 * n
19 0.475 * 40 * n
U
U
L
L
= =
= =
= =
= =
WEEK 9 & 10 fybanuro@ug.edu.gh 53
CI for difference in means
We usually wish to compare the means
of two populations.
Two ways to design the samples to
achieve this:
1) independent samples: male and female first
semester marks in ADMN 602.
2) dependent samples (paired samples): each
member of the sample has two scores
(before and after).
WEEK 9 & 10 fybanuro@ug.edu.gh 54
CI for difference in means:
Independent samples
Assume the means and standard deviations of the
populations are (
1
,
2
) and (
1
,
2
) respectively.
We take random samples of sizes n
1
and n
2
from
the populations to estimate
1
-
2
; a point estimate.
We can use the point estimate to
construct the CI for
1
-
2
using the samples
standard deviations and the t-distribution.

|
|
.
|

\
|


2 1
X X
WEEK 9 & 10 fybanuro@ug.edu.gh 55
CI for difference in means:
Independent Samples
The CI for
1
-
2
is


When we assume equal population variances the
CI becomes


( ) ( ) | |
n
s
n
s
n n t
X X
2
2
2
1
2
1
2 1
2

* 1 1
2 1
+ +
|
|
.
|

\
|


( ) ( ) | |
( ) ( )
( ) ( ) 1 n 1 n
* 1 n * 1 n
S where
1 1
* S * 1 1
2 1
2 1
2
2
2
2
1
1
p
2 1
p
2 1
2

s s
n n
n n t
X X
+
+
=
+ +
|
|
.
|

\
|


WEEK 9 & 10 fybanuro@ug.edu.gh 56
Example 5
The SureStep Company manufactures high-quality
treadmills for use in exercise clubs.SureStep currently
purchases its motors for these treadmills from supplier A.
However, it is considering a change to supplier B, which
offers a slightly lower cost. The only question is whether
supplier Bs motors are as reliable as supplier As. To
check this, SureStep installs motors from supplier A on
30 of its treadmills and motors from supplier B on
another 30 of its treadmills. It then runs these treadmills
under typical conditions and for each treadmill, records
the number of hours until the motor fails. The data
appears in tutorials\Motors.xls. What can SureStep
conclude?
WEEK 9 & 10 fybanuro@ug.edu.gh 57
Solution
M
A
= 748.8 and M
B
= 655.667.
SD
A
= 283.881 and SD
B
= 259.986
The CI for the difference between means
is [-47.549, 233.815]
quite wide and includes zero;
Supplier A better?
Supplier B could also be because of the
negative part of the confidence interval.
WEEK 9 & 10 fybanuro@ug.edu.gh 58
CI for difference in means:
dependent samples
Two marks or scores are obtained for each
member of the sample.
We are interested in the CI between the means of
these two marks or scores.
Procedure
compute difference between each individuals marks or
scores
compute
n
D
n
1 i
i
D

=
2i 1i i
X X D =
WEEK 9 & 10 fybanuro@ug.edu.gh 59
CI for difference in means:
dependent samples
the CI for the difference is

=



1 n
D
where ;
n
S
* 1) (n t
n
1 i
2
i
2
D
D
2

D
S D

|
.
|

\
|

=

=

WEEK 9 & 10 fybanuro@ug.edu.gh 60


Example
A real estate agent has collected a random sample of 75
houses that were recently sold in a suburban community.
She is particularly interested in comparing the appraised
value and recent selling price of the houses in this
particular market. The values of these two variables for
each of the 75 randomly chosen houses are provided in
the file ..\houses.xls.
Using the sample data,
generate a 95% CI for the mean difference between the
appraised values and selling prices of the houses sold in this
suburban community.
interpet the constructed interval estimate for the real estate
agent.
WEEK 9 & 10 fybanuro@ug.edu.gh 61
Solution
1. Add the difference column to the data
Then with cursor anywhere in the data:
StatPro
Statistical Inference
One-sample analysis
Select difference for analysis.
[-2.489, 1.737]
WEEK 9 & 10 fybanuro@ug.edu.gh 62
Solution
2. Considering the paired data
StatPro
Statistical Inference
Paired-sample analysis
Select value and price for analysis.
[-2.489, 1.737]
WEEK 9 & 10 fybanuro@ug.edu.gh 63
Solution
Interpretation:
interval includes zero;
evidence is two-way;
also there could be no evidence in any
direction.
WEEK 9 & 10 fybanuro@ug.edu.gh 64
CI for difference between
proportions
Same analysis as in difference between means for
two-sample analysis.
Let denote the two unknown population
proportions.
Let denote the two sample proportions
with sample sizes .
Then the point estimate
Then the 95% CI for this point estimate is given by
2 1
and
2
^
1
^
p p
and
2
^
1
^
2 1
p p
is
WEEK 9 & 10 fybanuro@ug.edu.gh 65
CI for difference between
proportions



Example:
An appliance store is about to run a big sale. It selects 300
of its customers and randomly divides them into two sets of
150 customers each. It then mails a notice of the sale to all
300 customers but includes a coupon for an extra 5% off the
sale price to the second set of customers only. As the sale
progresses, the store keeps track of which of these
customers purchase appliances. The resulting data appear
in tutorials\Coupons.xls What can the store keeper conclude
about the effectiveness of the coupons?

2
2
^
2
^
1
1
^
1
^
2
^
1
^
n
1 *
n
1 *
* 1.96
p p p p
p p
|
|
.
|

\
|

+
|
|
.
|

\
|

|
|
.
|

\
|

WEEK 9 & 10 fybanuro@ug.edu.gh 66


Solution
We compute the two proportions as

Then the difference between proportions is

The z_multiple is 1.96
The standard error of difference between sample
proportions is 0.05235.
The sampling error is 0.10261
0.2333 and 0.3667
2
^
1
^
p p
= =
0.1334 0.2333 0.3667
2
^
1
^
p p
= =
WEEK 9 & 10 fybanuro@ug.edu.gh 67
Solution
The CI for the difference between proportions is
[0.0307, 0.2359]
Because the confidence limits are both positive,
we can conclude that the effect of coupons is
almost surely to increase (about 13%: 36.7-23.3)
the proportion of buyers.
Thus for every 100 customers, the coupons will
probably induce an extra 3 to 23 customers to
purchase an appliance who otherwise would not
have made a purchase.
In terms of profit the store will make less profit
by including coupons than by not including them.
WEEK 12 fybanuro@ug.edu.gh 68
HYPOTHESIS TESTING
WEEK 12 fybanuro@ug.edu.gh 69
Introduction
Hypothesis testing is one of the most
frequently used tools in academic
research.
Inferences to a population can be made
via:
confidence interval for a point estimate;
hypothesis testing;
A. null hypothesis (H
0
)
B. alternative hypothesis (H
A
): research hypothesis
(what the researcher wants to prove).
WEEK 12 fybanuro@ug.edu.gh 70
Introduction
Debate among academic researchers over
the most useful procedure.
Where there is nothing to prove, use
confidence interval estimation (business
context).
Use hypothesis testing for statistical
analysis.
Form of alternative hypothesis can be
either one-tailed or two-tailed.
WEEK 12 fybanuro@ug.edu.gh 71
Introduction
Indicators of one-tailed test:
more than, greater than, less than,
better than, worse than, at least, at
most.
Indicators of two-tailed test:
not equal to, different from, changed for
better or worse.
WEEK 12 fybanuro@ug.edu.gh 72
Error types
Regardless of whether the analyst decides
to accept or reject H
0
, it might be the
wrong decision.
type I error: incorrectly rejecting H
0
.
type II error: incorrectly accepting H
0
.
Type I error is the most serious of the two.
We must exercise caution in terms of
rejecting H
0
.
WEEK 12 fybanuro@ug.edu.gh 73
Significance of the test.
How strong must the evidence in favour
of H
A
be for us to reject H
0
?
Two approaches are used to answer this
question.
a) critical value;
b) p-value;
WEEK 12 fybanuro@ug.edu.gh 74
Significance of the test.
Using the critical value:
1) the analyst prescribes the probability of a type I error
(=0.05).
2) the analyst then computes the test statistic using either
the standard normal distribution, the t-distribution, the F-
distribution or the Chi-square distribution.
3) the critical value is then read from the appropriate
distribution table for either a one-tailed or a two-tailed for
.
4) the result is then marked off the graph indicating the
rejection region.
5) if the computed test statistic is greater than (in absolute
terms) the critical value (statistically significant) H
0
is
rejected, otherwise we accept H
0
.
WEEK 12 fybanuro@ug.edu.gh 75
Significance of the test.
Using the p-value:
1) p-value is the probability that the sample
value would be as large as the value
actually observed, if H
0
is true.
2) the smaller the p-value, the more
evidence there is in favour of H
A
.
3) sample evidence is statistically
significant if p-value is less than .
WEEK 12 fybanuro@ug.edu.gh 76
Using p-value: one-tailed with
known.
State H
0
and H
A
and specify =0.05.

1. Compute

1. Compute

1. From the standard normal tables read off P(Z q)=.
Interpretation:
1. is the p-value of H
0
.
2. if H
0
were true, there would be only probability of observing X-bar
as large as observed.
3. if X-bar is observed to be close to H
0
then would be large.
4. reject H
0
if .

=
q

Z

X
0
=

=

WEEK 12 fybanuro@ug.edu.gh 77
Example
A standard manufacturing process has produced
millions of TV tubes with a mean life =1200 hours
and a standard deviation = 300 hours. A new
process, recommended by the engineering
department as better, produces a sample of 100
tubes, with an = 1265. Test the engineering
departments claim.

X
WEEK 12 fybanuro@ug.edu.gh 78
Solution
2.17
30
1200 1265
n

X
Z 3.
30
10
300
100
300
n

2.
0.05 1200; : H 1200; : H . 1
0
A 0
=

=
= = =
= > s

WEEK 12 fybanuro@ug.edu.gh 79
Solution
Via the critical value approach:
1) this is one-tailed so Z
0.05
= 1.64
2) since Z computed is greater than Z-
critical, we reject H
0
and conclude that
the new process is really better than the
old process.
WEEK 12 fybanuro@ug.edu.gh 80
Solution
Via p-value:
1) P(Z 2.17) = 0.015.

2) This means that if H
0
were true, there would be

only probability of observing as large as 1265.

3) since 0.015 < 0.05, we reject H
0
and conclude that the
new process is better than the old process.
%
2
1
1

X
WEEK 12 fybanuro@ug.edu.gh 81
Using p-value: one-tailed with
unknown
In this case we use the t-distribution.


The following steps may then be executed.
1. Compute the sample mean.
2. Compute the standard deviation and hence the
standard error of the mean.
3. Compute the t-statistic and the degrees of freedom;
4. Read the p-value from the t-distribution;
5. Reject H
0
(p-value < 0.05) or dont (p-value > 0.05)
n
S
t

X
0

=

WEEK 12 fybanuro@ug.edu.gh 82
Using p-value: one-tailed with
unknown
Example:
A large required chemistry course at University
of Ghana has been using the same textbook for
a number of years. Over the years, the students
have been asked to rate this textbook on 10-
point scale, and the average rating has been
stable at about 5.2. This year, the faculty
decided to experiment with a new textbook. After
the course, 50 randomly selected
WEEK 12 fybanuro@ug.edu.gh 83
Using p-value: one-tailed with
unknown
students were asked to rate this new
textbook, also on a scale of 1 to 10. The
results appear as tutorials\students.xls
Can we conclude that the students prefer
the new textbook to the old one?
WEEK 12 fybanuro@ug.edu.gh 84
Solution
1. H
0
: 5.2; H
A
: > 5.2
2. The sample mean is 5.680
3. Sample standard deviation = 1.953
Standard error of mean


4.



276 . 0
071 . 7
953 . 1
50
953 . 1
= = =
49 1 50 df 1.739;
0.276
5.2 5.68
t = = =

=
WEEK 12 fybanuro@ug.edu.gh 85
solution
Using the critical value,
t
0.05
(49) = 1.68
Since t-critical < t-computed, we can reject
H
0
.
Using p-value approach
0.025 < P(t 1.739 ) < 0.05
Again, we can reject H
0

WEEK 12 fybanuro@ug.edu.gh 86
Difference between two population
means
For equal population variances







( )
( )
( ) ( )
2 n n
s 1 n s 1 n
S where
n
1
n
1
s

t
or
n
1
n
1
s

Z
2 1
2
2 2
2
1 1
p
2 1
p
2 1
2 1
2 1
p
2 1
2 1
X X
X X
+
+
=
+
|
.
|

\
|

=
+
|
.
|

\
|

=


WEEK 12 fybanuro@ug.edu.gh 87
Difference between two population
means
For unequal population variances


( )
( )
2
2
2
1
2
1
2 1
2 1
2
2
2
1
2
1
2 1
2 1
n
s
n
s

t
or
n
s
n
s

Z
X X
X X
+

|
.
|

\
|

=
+

|
.
|

\
|

=


WEEK 12 fybanuro@ug.edu.gh 88
Difference between two population
means
Revisit SureStep Company problem believing that
Supplier As motors are more reliable than those of
Supplier Bs.

0.05
: H
: H
B A A
B A 0
=
>
s
WEEK 12 fybanuro@ug.edu.gh 89
Difference between two population
means
70.281
30
259.986
30
283.881
93.133 655.667 748.8
259.986 S 283.881; S
30 n n 655.667; 748.8;
2 2
B A
B A
B A
X X
= +
=
= =
= = = =

WEEK 12 fybanuro@ug.edu.gh 90
Difference between two population
means




Critical value = 1.67
P-value = between 0.05 and 0.10
In both cases, we cannot reject H
0
.
1.325
70.281
93.133
value t = =
WEEK 12 fybanuro@ug.edu.gh 91
Hypothesis Test for Equal
population Variances







( )
( )
( ) ( ) df 1 n and 1 n with
s , s min
s , s max
value F
2 1
2
2
2
1
2
2
2
1

=
WEEK 12 fybanuro@ug.edu.gh 92
Proportions
Single proportion


where p and are the sample and population propotions
respectively.
Two proportions;



where are the sample proportions and p
c
= the
pooled proportion for the two samples combined.
n
) (1
p
Z

=
( ) ( )
( )
|
|
.
|

\
|
+

=
2 1
c c
2 1
2 1
n
1
n
1
* p 1 * p
Z

p p
p p
and
2 1
WEEK 12 fybanuro@ug.edu.gh 93
Example
The Walpole Appliance Company has a
customer service department that handles
customer questions and complaints. This
departments processes are set up to respond
quickly and accurately to customers who phone
in their concerns. However, there is a sizable
minority of customers who prefer to write letters.
Traditionally, the customer service department
has not been very efficient in responding to
these customers.

WEEK 12 fybanuro@ug.edu.gh 94
Example (continued)
Letter writers first receive a mail-gram asking them
to call customer service (which is exactly what
letter writers wanted to avoid), and when they do
call, the customer service representative who
answers the phone typically has no knowledge of the
customers problem. As a result, the department manager
estimates that 15% of letter writers have not obtained a
satisfactory response within 30 days of the time their letters
were first received. The manaers goal is to reduce this
value by at least half, that is to 7.5% or less.
WEEK 12 fybanuro@ug.edu.gh 95
Example (contd)
To do so, she changes the process for responding to
letter writers. Under the new process these customers
now receive a prompt and couteous form letter that
responds to their problem. Each form letter states that if
the customer still has problems, he or she can tell the
department. The manager also files the original letters so
that if customers call back, the representative who
answers will be able to find their letters quickly and
respond intelligently. With this new process in place, the
manager has tracked 400 letter writers and has found
that only 23 of them are classified as unsatisfied after a
30-day period. Does it appear that the manager has
achieved her goal?
WEEK 12 fybanuro@ug.edu.gh 96
Solution
The managers desired
The observed proportion is
Our two hypotheses are


The test statistic is


0.075 =
0.0575
400
23
p = =
0.075 :
0.075 :
H
H
A
0
s
>
1.329
400
0.925 * 0.075
0.075 0.0575
Z =

=
WEEK 12 fybanuro@ug.edu.gh 97
Solution
Using the critical value,
Z
0.05
= -1.65
Since t-critical < t-computed, we cannot reject
H
0
.
Using p-value approach
P(Z -1.329 ) = 0.092
Again, we cannot reject H
0


WEEK 12 fybanuro@ug.edu.gh 98
Differences between two
proportions
Example:
The Dirty and Clean Company, a large
manufacturer of automobile parts, has several
plants in Ghana. For years, Dirty and Clean
employees have complained that their suggestions
for improvements in the manufacturing processes
are ignored by upper management. In the spirit of
employee empowerment, Dirty and Clean
management at the Takoradi plant decided to initiate a
number of policies to respond to employee suggestions.
WEEK 12 fybanuro@ug.edu.gh 99
Differences between two
proportions
For example, a mailbox was placed in a central
location, and employees were encouraged to drop
suggestions into this box . No such initiatives were
taken at the other Dirty and Clean plants. As expected,
there was a great deal of employee enthusiasm
at the Takoradi plant shortly after the new policies
were implemented, but the question was whether life
would revert to normal and the enthusiasm would dampen
with time.
WEEK 12 fybanuro@ug.edu.gh 100
Differences between two
proportions
To check this, 100 randomly selected
employees at the Takoradi plant and 300
employees from other plants were asked to fill out
a questionnaire 6 months after the implementation
of the new policies at the Takoradi plant.
Employees were instructed to respond to each
item on the questionnaire by checking either a
yes box or a no box. Two specific items on the
questionnaire were:
WEEK 12 fybanuro@ug.edu.gh 101
Differences between two
proportions
Management at this plant is generally
responsive to employee suggestions for
improvements in the manufacturing
processes.
Management at this plant is more
responsive to employee suggestions now
than it used to be.
The results of the questionnaire for these
two items appear in the table below.

WEEK 12 fybanuro@ug.edu.gh 102
Differences between two
proportions


A B C D E F
1 Employee empowerment results

2
Item 1: Management responds Items 2: Things have improved
3 Takoradi Other Takoradi Other
4 Yes 39 93 Yes 68 159
5 No 61 207 No 32 141
6 Totals 100 300 Totals 100 300
WEEK 12 fybanuro@ug.edu.gh 103
Differences between two
proportions
1. Does it appear that the policies at the
Takoradi plant are appreciated?
2. Should Dirty and Clean implement these
policies in its other plants?
WEEK 12 fybanuro@ug.edu.gh 104
Using p-value: two tailed
Compute the Z or t test statistic;
Read the probability from the appropriate
tables ().
The required p-value is 2*.
Reject H
0
if 2* 0.05, else accept it.
WEEK 12 fybanuro@ug.edu.gh 105
Example
A large required chemistry course at
University of Ghana has been using the
same textbook for a number of years.
Over the years, the students have been
asked to rate this textbook on 10-point
scale, and the average rating has been
stable at about 5.2. This year, the faculty
decided to experiment with a new
textbook. After the course, 50 randomly
selected
WEEK 12 fybanuro@ug.edu.gh 106
Example
students were asked to rate this new
textbook, also on a scale of 1 to 10. The
results appear as students.xls. Can we
conclude that the students like this new
textbook any more or less than the
previous textbook?

WEEK 12 fybanuro@ug.edu.gh 107
Solution
H
0
: = 5.2; H
A
: 5.2
2. The sample mean is 5.680
3. Sample standard deviation = 1.953
Standard error of mean


4.




276 . 0
071 . 7
953 . 1
50
953 . 1
= = =
49 1 50 df 1.739;
0.276
5.2 5.68
t = = =

=
WEEK 12 fybanuro@ug.edu.gh 108
solution
Using the critical value,
t
0.025
(49) = 2.02
Since t-critical > t-computed, we cannot reject
H
0
.
Using p-value approach
2*(0.025 < P(t 1.739 ) < 0.05)
Again, we cannot reject H
0

WEEK 12 fybanuro@ug.edu.gh 109
Analysis of variance (ANOVA)
Statistical procedure for comparing the
differences between more than two
population means.
Why ANOVA?
large variability within samples makes it
difficult to infer whether there are really any
differences between population means;
small variability within samples makes it easy
to infer differences between population
means.
WEEK 12 fybanuro@ug.edu.gh 110
ANOVA: one-way
Used in two typical situations:
1. where we have several distinct populations
(starting salary for Business, Engineering
and Computer Science graduates);
2. randomised experiments (allergies and
randomly assign persons to a different type
of allergy medicine currently being
developed).
WEEK 12 fybanuro@ug.edu.gh 111
ANOVA
Compare variances within samples to
variances between the sample means.
Only if the between variance is large relative to
the within variance can we conclude with any
certainty that there are differences between
population means.
Two basic assumptions:
1. population variances are all equal to some
common variance,
2
;
2. populations are normally distributed.
WEEK 12 fybanuro@ug.edu.gh 112
ANOVA: one-way
Let
I = number of distinct populations;

i
=mean of the i
th
distinct population;
= mean of the i
th
sample;
= variance of the i
th
sample;
n
i
= size of the i
th
sample;

= combined number of observations;

=grand mean of the n observations.
i Y

S
i
2

=
=
I
1 i
i n
n
I
I
1 i
i Y
Y

=
=
WEEK 12 fybanuro@ug.edu.gh 113
ANOVA: one-way
H
0
:
1
=
2
=
3
= ...=
I;
H
A
: At least one mean is different from others.

Then the between variation (sum of squares
between: SSB) is


The within variation (sum of squares within: SSW)
is

=
=
=
|
.
|

\
|
=
I
1 i
B
2
i i
1 I df with * SSB
Y Y n
( )

=
= =
I
1 i
W
2
i
i
I n df with * 1 n SSW
S
WEEK 12 fybanuro@ug.edu.gh 114
ANOVA: one-way
The MSB and MSW are then computed as


Finally, we compute the F-test statistic


We then use the critical value approach or the p-
value approach to either reject or accept H
0
.
W B
df
SSW
MSW and
df
SSB
MSB = =
MSW
MSB
F
ratio
=
WEEK 12 fybanuro@ug.edu.gh 115
Example
Suppose that the performance of 3 machines
are to be compared. Because these machines
are operated by people, and because of other
inexplicable reasons, output per hour is subject
to chance fluctuation. In the hope of averaging
out and thus reducing the effect of chance
fluctuation, a random sample of 5 different hours
is obtained from each machine and set out in the
table below.
WEEK 12 fybanuro@ug.edu.gh 116
Example





Test the hypothesis that there are differences
between the output of the 3 machines.
Machine 1 Machine 2 Machine 3
47 55 54
53 54 50
49 58 51
50 61 51
46 52 49
WEEK 12 fybanuro@ug.edu.gh 117
Solution
H
0
: 1 = 2 = 3
H
A
: At least one of the means is different from the
others.

( ) ( ) ( )
130 5 80 45
52 51 * 5 52 56 * 5 52 49 * 5
Y Y * n SSB
3.5 S 12.5; S 7.5; S
52 Y 51; Y 56; Y 49; Y
2 2 2
3
1 i
2
i
i
2
3
2
2
2
1
3 2 1
= + + =
+ + =
|
.
|

\
|
=
= = =
= = = =

=
=
=
WEEK 12 fybanuro@ug.edu.gh 118
Solution
( )
3 7.83
12
94
df
SSW
MSW
65
2
130
df
SSB
MSB
12 3 15 df
94 14 50 30
3.5 * 4 12.5 * 4 7.5 * 4
S * 1 n SSW
2 1 3 df
W
B
W
3
1 i
2
i i
B
= = =
= = =
= =
= + + =
+ + =
=
= =

=
WEEK 12 fybanuro@ug.edu.gh 119
Solution





( )
( )
( )
machines. three the between difference is there
that conclude and H reject we 0.05, 0.005 Since
0.005 8.298 F P
value P Via
machines. three the between difference is there
that conclude and H reject we 8.298, 3.89 Since
3.89 0.05 F
alue ritical Via
8.298
7.833
65
MSW
MSB
ratio F
0
0
2,12
<
= >
<
=
= =

WEEK 12 fybanuro@ug.edu.gh 120


Solution: Constructing Confidence
Intervals
We use






2.799 7.833 S 0.632;
5
1
5
1
n
1
n
1
MSW S and
n
1
n
1
* S Y Y SE 2.0; multiplier
where
Y Y SE * multiplier Y Y
p
j i
p
j i
p
j i
j i j i
= = = + = +
=
+ =
|
.
|

\
|
=
|
.
|

\
|

|
.
|

\
|



WEEK 12 fybanuro@ug.edu.gh 121
Solution: Constructing Confidence
Intervals
For machines 1 and 2






The interval does not include 0 so machines 1 and
2 differ.
( )
( ) 3.46 10.54,
3.540 7
0.632 * 2.799 * 2 56 49
Y Y SE * multiplier Y Y
2 1 2 1
=
=
=
|
.
|

\
|

|
.
|

\
|


WEEK 12 fybanuro@ug.edu.gh 122
Solution: Constructing Confidence
Intervals
For machines 1 and 3






The interval does include 0 so machines 1 and 3 do
not differ.
( )
( ) 1.54 5.54,
3.540 2
0.632 * 2.799 * 2 51 49
Y Y SE * multiplier Y Y
3 1 3 1
=
=
=
|
.
|

\
|

|
.
|

\
|


WEEK 12 fybanuro@ug.edu.gh 123
Solution: Constructing Confidence
Intervals
For machines 2 and 3






The interval does not include 0 so machines 2 and
3 differ.
( )
( ) 8.54 1.46,
3.540 5
0.632 * 2.799 * 2 51 56
Y Y SE * multiplier Y Y
3 2 3 2
=
=
=
|
.
|

\
|

|
.
|

\
|


WEEK 12 fybanuro@ug.edu.gh 124
Chi-square Tests
Two tests:
goodness-of-fit and independence tests.
Goodness-of-fit test.
many statistical procedures are based on the
assumption that the population data are normally
distributed.
this is a test for normality.
let there be C categories;
let O
i
(i=1,...,C) be the observed number of
observations in the i
th
category;
let E
i
be the expected number of observations in the
i
th
category;
WEEK 12 fybanuro@ug.edu.gh 125
Goodness-of-fit test
let be the probability of the i
th
category under H
0
.
then E
i
=n*
If H
0
,of normality, is true, the test statistic has an
approximately distribution with C-1 degrees of
freedom.



The test is then carried out using p-value or critical
value for

i
t
i
t
2
_
2
_
( )

=
C
1 i
i
2
i
i
2
E
O

E
value
WEEK 12 fybanuro@ug.edu.gh 126
Example
We wish to test the hypothesis that births in Ghana
occur equally often throughout the year. Suppose
the only data available is a random sample of 88
births, grouped into seasons of differing length and
presented below.

Season Length Observed
Frequency
April to June 91 27
July to August 62 20
September to October 61 8
November to March 151 33
TOTAL 365 88
WEEK 12 fybanuro@ug.edu.gh 127
Solution
The cells here are 4.
we first compute under H
0


The expected frequencies are:

we next compute (O-E)



the components of are
i
t
i

. 0.414 0.167; 0.170; 0.249;


365
91


4 3 2
1
= = = = =
36.432 14.696; 14.96; E 21.912; 0.249 * 88
E E E 4 3
2
1
= = = = =
3.432 E 6.696; E
5.04; E 5.088; 21.912 27 E
4
4
3
3
2
2
1
1
O O
O O
= =
= = =
WEEK 12 fybanuro@ug.edu.gh 128
Solution




Use p-value (0.10) or critical value (7.81) to
complete the test.
Thus, we cannot reject H
0
.

( ) ( )
6.257
0.3233 3.051 1.7013 1.1814
36.432 14.696 14.96 21.912
3.432 6.696
5.04 5.088

2 2
2 2
2
=
+ + + =
+ + + =

WEEK 12 fybanuro@ug.edu.gh 129
Independence test
Population is categorised in two different ways
(smoking habits and drinking habits)
We then wish to find whether the two attributes
are independent in a probabilistic sense.
The chi-square test enables us to test this
empirically.
Data usually counts in various combinations of
categories (contingency table)
WEEK 12 fybanuro@ug.edu.gh 130
Independence test
Suppose there are I rows and J columns;
Let p
i
and p
j
be their respective marginal
probabilities.
Then p
ij
=p
i
*p
j
and E
ij
=n*p
ij
.
The test statistic is


Use p-value or critical value to complete the test.
( )
( ) ( )

= =

=
I
1 i
J
1 j
ij
2
ij ij
2
1 J * 1 I df with value
E
E O

WEEK 12 fybanuro@ug.edu.gh 131


Example
In a demographic study of women who were listed
in Whos Who, the following table was compiled for
1436 women who were married at least once.





Is there a relationship between marital status and
educational level?
Education Married once Married twice TOTAL
College 550 61 611
No college 681 144 825
TOTAL 1231 205 1436
WEEK 12 fybanuro@ug.edu.gh 132
Solution
I=J=2;
Marginal probabilities for the rows:


Marginal probabilities for the columns:


0.5745
1436
825
0.4255;
1436
611
p p
r2 r1
= = = =
0.1428
1436
205
0.8572;
1436
1231
p p
c2 c1
= = = =
WEEK 12 fybanuro@ug.edu.gh 133
Solution
Computing the p
ij





Computing E
ij




0.0820 0.1428 * 0.5745 p * p
0.4925 0.8572 * 0.5745 p * p
0.0608 0.1428 * 0.4255 p * p
0.3647 0.8572 * 0.4255 p * p
c2 r2
22
c1 r2
21
c2 r1
12
c1 r1
11
p
p
p
p
= = =
= = =
= = =
= = =
117.75; 0.0820 * 1436
707.23; 0.4925 * 1436
87.31; 0.0608 * 1436
523.71; 0.3647 * 1436
E
E
E
E
22
21
12
11
= =
= =
= =
= =
WEEK 12 fybanuro@ug.edu.gh 134
Solution
Compute




Use p-value (<0.001) or critical value (3.84) to
complete the test.
We reject H
0
.

( ) ( ) ( ) ( )
16.0717
5.8519 0.9728 7.9283 1.3197
117.75
117.75 144
707.23
707.23 681
87.31
87.31 61
523.71
523.71 550
2 2 2 2
2

=
+ + + =

=
HYPOTHESIS TESTING
NON-PARAMETRIC TESTS
Introduction
These tests do not assume any
distribution for the data.
They are called distribution-free tests.
They use ranks and are very efficient and
hence very popular.
We shall discuss
1. Wilcoxon rank test for two samples
2. Friedmans rank test for multiple
comparisons
Wilcoxon rank test for two samples
It tests H
0
that there is no difference in the
distribution of two populations.
The procedure is as follows:
rank the observations for both samples;
compute Wilcoxons Rank Sum (W

): sum of the
ranks of the smaller sample.
Find the p-value for H
0
from the Table VIII.
Example
Suppose that independent random samples of annual income were
taken from two different regions in Ghana in 1980 and then ordered as
in the table below.










Test H
0
that the two underlying populations are identical (no difference
in annual income)
Northern Greater Accra
6,000 11,000
10,000 13,000
15,000 14,000
29,000 17,000
20,000
31,000
Solution
Ranks:




H
0
: There is no difference between the incomes of the North
and Greater Accra
H
A
: The North is poorer than Greater Accra.
n
1
=4; n
2
= 6; W =18
From Table VIII, the p-value for H
0
is 0.238. Thus we cannot
reject H
0
since 0.238 > 0.05.
Northern Greater Accra
1 3
2 4
6 5
9 7
8
10
Friedmans rank test for multiple
comparisons.
The test procedure is as follows:
rank the observations within each of the I blocks
taking care of ties (rows).
compute the average rank under each treatment J
(columns),
compute the average of the and call it
under H
0
the test statistic can be approximated by a
distribution with (J-1) degrees of freedom.

J) 1,2,..., (j
.j R
=

j
R
.

.. R

|
|
.
|

\
|

+
=
J
1 j
2
.j R
R
..
1) (J * J
I * 12
Q
Friedmansrank test for multiple
comparisons.
Use the critical value or p-value approach to complete
the test.
Example:
Consider an experimental study of drugs to relieve
itching. Five drugs were compared to placebo and no
drug with 10 volunteer male subjects. Each volunteer
underwent one treatment per day, and the time order
was randomised. The subjects were given a drug (or
placebo) intravenously, and then itching was induced on
their forearms with cowage, an effective itch stimulus.
The subjects recorded the duration of the itching. The
following table (..\..\itching.xls) gives the durations of
the itching (in seconds). Is there any drug effect?
Solution
See itching.xls.







Both the p-value and the critical value suggest that we reject
H
0
that there is no difference between the drugs.
( ) ( ) ( ) ( ) ( )
( ) ( )
12.6 value Critical
0.022 value p
14.86
6.935 *
56
120
4 4.25 4 4.9
4 3.05 4 3.5 4 2.3 4 4.9 4 5.1
8 * 7
10 * 12
Q
2 2
2 2 2 2 2
=
=
=
=
(
(

+
+ + + + +
=
REGRESSION ANALYSIS
1. Introduction
Regression analysis is the study of
relationships between variables.
It applies to so many situations.
Three issues to discuss
1. to infer its characteristics (slope and
intercept)
2. to know which explanatory variables
belong to the equation.
3. to use the model for prediction

1. Introduction
There are four categorisations of
regression analysis:
1. Overall purpose of the analysis
to understand how the world operates;
to make predictions
2. Data type being analysed
cross-sectional
time series data (autocorrelation)

1. Introduction
3. The number of explanatory variables in
the analysis
simple regression
multiple regression
4. Linear versus non-linear models
We shall focus on linear regression since
it can also be used to estimate non linear
relationships (after mathematical
transformations)
2. Exploratory Techniques
Scatter plots
check for existence of relationship between
variables
also reveals outliers ( observations that lie
outside the typical pattern of points)
requires thorough investigation
may or may not be deleted
run the regression analysis with them or without
them and compare the results.
2. Exploratory Techniques
Scatter plots
also reveals unequal variances
the variability of the response variable increases as
the explanatory variable increases ( a fan shape)
violates one of the assumptions in linear regression
analysis.
also reveals no relationship between a pair of
variables.
in that the analysis stops right there.
3. Correlations
Correlations are numerical summary
measures that indicate the strength of
relationships between pairs of variables.
A correlation can only measure the
strength of a linear relationship.
Correlations lie between -1 and +1,
inclusive
see ..\EMBA\data\correlation.xls

3. Correlations
The correlation between two variables X and Y is given as




The first term is the covariance between X and Y (a
measure of association between X and Y).
It magnitude depends on the units in which the variables are
measured.
For this reason, it is often difficult to interpret the magnitude
of a covariance, and we concentrate instead on correlations.

Y X
n
1 i
i i
XY
S * S
1
*
1 n
Y Y X X
R

|
.
|

\
|

|
.
|

\
|

=

=

4. Simple Linear Regression
Though scatterplots and correlations indicate
linear relationships and the strengths of these
relationships, they do not quantify the
relationships.
From correlation.xls, we see that Sales are
related to promotional expenditures. But what
exactly is this relationship?
This subsection is devoted to quantifying
relationships between two variables (fitting a
straight line through the scatterplot of the
response and explanatory variables).
see ..\EMBA\data\leastsquares.xls
4. 1. Least Square Estimation
We wish to choose the line that makes the
vertical distances from the points to the line as
small as possible.
Vertical distance from the horizontal axis to any
point can be decomposed into two:
vertical distance from the horizontal axis to the line
(fitted value)
vertical distance from the line to the point (residual:
positive or negative)
see ..\EMBA\data\leastsquares.xls
4. 1. Least Square Estimation
Fundamental equation for regression is
Observed Value = Fitted Value + Residual
Best fitting line through the points in the
scatterplot is the one with the smallest
sum of squared residuals.
This is called the least squares line.
Why use squared residuals? (cancel out!!)
4. 1. Least Square Estimation
Recall that the equation for any straight
line can be written as
Y
i
=
0
+
1
X
i
+ e
i

where
0
,
1
, and e
i
are the Y intercept, the
slope of the line and the error terms
respectively.
The formulas for the least squares line are

4. 1. Least Square Estimation




Our fitted regression line is then presented as


Then a typical residual e
i
is


=

=

=
=
|
.
|

\
|

|
.
|

\
|

|
.
|

\
|

=

X * Y 2.
S
S
* r
X X
Y Y X X
. 1
1 0
X
Y
XY n
1 i
i
n
1 i
i i
1
i
i
X Y *
1 0
^
o o + =
i
i i
Y Y e
^
=
4. 1. Least Square Estimation
Magnitude of residuals a good indication of how
useful the regression line is for predicting the
response values from the explanatory variable
values.
A single numerical measure is the Standard Error
of estimate (S
e
)

( ) 1 k n
e
S
n
1 i
2
i
e
+
=

=
4. 1. Least Square Estimation
The smaller S
e
is, the more accurate predictions
tend to be.
The percentage of variation of the response
variable explained by the regression is known as
the coefficient of determination (R
2
)



0 R
2
1

=
|
.
|

\
|

=
n
1 i
2
i
n
1 i
2
i
2
Y Y
e
1 R
4. 1. Least Square Estimation
R
2
measures the goodness of a linear fit.
It is also the square of the correlation between the observed
Y values and the fitted values. This is also called multiple
R.
Thus multiple R = R
2

For simple linear regression, multiple R is the same as the
absolute value of the correlation between Y and X
The better the linear fit is, the closer R
2
is to 1.
See ..\EMBA\data\plot1.xls and ..\EMBA\data\plot2.xls.

^
Y
5. Multiple Regression
When we include several explanatory variables
in the regression equation, we move into the
realm of multiple regression.
Some characteristics of multiple regression are
we are now fitting a plane but not a line.
we still employ the least squares method.
we eliminate bias of some of the confounding
variables (fertilizer & rainfall in yield predicting)
we can reduce the residual variance and hence
improve confidence intervals and other tests.
5. Multiple Regression:
Assumptions
1. The e
i
s are unobserved independent and
identically distributed error terms with
E(e
i
)=0 and var(e
i
)=
e
2

2. The distribution e is independent of the
joint distribution of the X.
3. The are constants and referred to as
the marginal effects of X on the Y.
4. The e are normally distributed.
5. Multiple Regression: The Model
Any observed Y may be expressed as its expected
value plus the random error term.
Thus:


We normally dont know the true equation but must
fit an estimated of the form

i ki k 2i 2 1i 1 0 i
e X ... X X Y + + + + + =
ki k 2i 2 1i 1 0
^
X a ... X a X a a Y + + + + =
5. Multiple Regression: The Model
The least squares procedure determines the a that
minimises the sum of squares given by


More generally, a = (X

X)
-1
Xy and for a to exist,
(X

X)
-1
must exist.
This means that the columns must be linearly
independent (full rank)
2
n
1 i
i
^
i
2
Y Y V

=
|
.
|

\
|
=
5. Multiple Regression
Coefficient of Multiple Determination (R
2
)





=
=

|
.
|

\
|
=
|
.
|

\
|
=
|
.
|

\
|
=
=
n
1 i
2
i
n
1 i
2
i
^
i
n
1 i
2
^
2
y y SST
y y SSE
y y SSR
where ,
SST
SSR
R
5. Multiple Regression
R
2
increases with the addition of more explanatory
variables.
We penalise by using the adjusted R
2







( )
2 2
R 1 *
1 k n
1 n
1 R Adjust

=
5. Multiple Regression
Inferences for the model
The overall goodness of fit of the model is done
using an F-test.
In this case,


The F-statistic is given by

0 to equal not is the of one least At : H
0 ... : H
i A
k 2 1 0
= = = =
( ) MSE
MSR
1 k n SSE/
SSR/k
F =

=
5. Multiple Regression
When can we drop a regressor
confidence interval of its coefficient includes 0
when its p-value is greater than 0.05.
However, even if any of the above is true yet we
have strong prior grounds for believing that X is
related positively or negatively to Y, X generally
should not be dropped from the regression
equation if it has the right sign.
See ..\EMBA\data\multiregression.xls
5. Multiple Regression
Multicollinearity
Multicollinearity occurs when there is a
nearly linear relationship among a set of
explanatory variables
Consequences
Increased variances of affected explanatory
variables
Magnitude of these coefficients may also change
Sign of the regression coefficients may also
change
5. Multiple Regression
Detection
Using correlation matrix
Using the squared multiple correlation
measure R
j
2.
When R
j
2
is approximately 1, that is an indication of
multicollinearity.

5. Multiple Regression
Remedy
1. Model re-spefication
2. If there are too few observations for too
many variables, add more observations or
eliminate some variables on theoretical
basis.
3. Combine linear variables into a single
variable
4. Use other methods of estimation.
5. Multiple Regression
Example

See ..\EMBA\data\Height.xls

Data was generated with
0
= 32 and
1
=
3.2 for right or left foot length.
5. Multiple Regression
Heteroscedasticity
A condition whereby the error variances are
not constant over all cases.
Transforming both response and explanatory
variables.
Autocorrelation
Error terms are correlated an indication of the
omission of one or more explanatory
variables.
5. Multiple Regression
Detection
Plot residuals against time or cases and look for
systematic patterns
Using statistical tests based on the first-order
autoregressive model.


=

=

=
< + =
+ + + + =
n
2 t
2
1 t
n
2 t
1 t t
^
t 1 t t
t kt 1t 1 0 t
e
e * e
and
1 with e * e where
e ... X Y
5. Multiple Regression
The ordinary least squares fit is done and Durbin-
Watson test is conducted as follows:



with Durbin-Watson statistic D defined by

0 : H
0 : H
A
0
=
=
( )

=
=

=
n
1 t
2
t
n
2 t
2
1 t t
e
e e
D
5. Multiple Regression
The Durbin-Watson d
L
and D
U
are then read from
their table and decision made as follows
When
If D > d
U
we accept H
0
.
If D < d
L
we accept H
A
.
If d
L
D d
U
we cannot make a decision.
When
If D < 4-d
U
we accept H
0
.
If D > 4-d
L
we accept H
A
.
If 4-d
U
D 4-d
L
we cannot make a decision.

0
^
>
0
^
<
Example
Consider the manufacturing overhead data for
which a multiple regression model is fitted.
We wish to test for autocorrelation.

See ..\EMBA\data\autocorrelation.xls

Solution:

H
0
: = 0; H
A
: 0; = 0.05




Example: Solution





Then



=

=
36
2 t
1 t t
31 174050481. e * e
05 524500854. e
36
2 t
2
1 t
=

=

0.3318
05 524500854.
31 174050481.
e
e * e
36
1 t
2
1 t
36
1 t
1 t t ^

= = =

=

=

Solution


We now read from the Durbin-Watson Table for
n = 36 and p-1 = 2 to obtain:
d
L
= 1.15 and d
U
= 1.38
Since we use the first part of the test.
We notice that d
L
< D < d
U
and we cannot conclude
whether or not there is autocorrelation.


( )
1.3131
0969 557166199.
98 731587427.
e
e e
D
36
1 t
2
t
36
2 t
2
1 t t
= =

=
=

0
^
>

Reading Texts
1. Albright, S. C., Winston, W. L., and Zappe, C. (2003): Data Analysis and Decision
Making with Microsoft Excel, 2
nd
Edition, Thomson Brooks/Cole.

2. Healey, J. F. (1993), Statistics, A tool for Social Research, 3
rd
Edition, Wadsworth
Publishing Company

3. Wayne, W. D. and Terrell, J. C. (1989), Business Statistics for Management and
Economics. 5
th
Edition, Houghto Mifflin Company, Boston.

4. Wonnacott, T. H. and Wonnacott, R. J. (1990), Introductory Statistics, 5th edition,
John Wiley & Sons.

5. Louise, S. (2001), Quantitative Methods for Business, Management & and
Finance, Palgrave

Вам также может понравиться