Sta 200 B Article

Faculty of Applied Sciences
Department of Mathematics and Physics

Statistical Methods 2B Lecture Notes
Lecturer: Mr. T. Farrar
Contents
1 Review of Random Variables and Probability Distributions 1
2 Correlation Analysis of Paired Data Sets 19
3 Simple Linear Regression Analysis 27
4 Multiple Linear Regression 48
5 Logistic Regression 81
6 Poisson Regression 90
1 Review of Random Variables and Probability
Distributions
What you will be expected to already know
1. Descriptive Statistics
2. Basic Probability concepts
3. Graphical methods of displaying data (line graph, scatter plot, histogram)
4. Random Variables and Probability Distributions (Discrete and continuous)
5. Special probability distributions (binomial, Poisson, normal)
6. Hypothesis Testing (t-tests, F tests,
2
tests, nonparametric tests, p-values)
7. Basic calculus
8. Matrices
1
Discrete Random Variables
Denition: A random variable is a variable which takes on
its values by chance
Denition: The sample space S (a.k.a. support) is the
set of possible values that a random variable may take
A random variable is discrete if it can only take only a nite
or countably innite number of distinct values. Usually a discrete random
variable only takes on integer values.
E.g. Number of defective television sets in a shipment of 100 sets
S = {1, 2, 3, . . . , 100}
E.g. Number of visits to a website in one year
S = {1, 2, 3, . . .}
We use an uppercase letter such as Y to denote a random variable, and a
lowercase letter such as y to denote a particular value that the random variable
may assume
Discrete Probability Distributions
We may denote the probability that Y takes on the value y by Pr (Y = y)
This probability is subject to the following restrictions:
1. 0 Pr (Y = y) 1 for all y (all probabilities must be between 0 and 1)
2.
yS
Pr (Y = y) = 1 (sum of probabilities over whole sample space must be 1)
E.g. Flipping a six sided die: let Y be the number that comes up
Pr (Y = y) =
1
6
, y = 1, 2, 3, 4, 5, 6
It is easy to see that both restrictions hold.
The probability distribution of the lengths of patent lives for new drugs is
given below. The patent life refers to the number of years a company has
to make a prot from the drug after it is approved before competitors may
produce the same drug.
Years, y 3 4 5 6 7 8 9 10 11 12 13
Pr (Y = y) .03 .05 .07 .10 .14 .20 .18 .12 .07 .03 .01
The function that maps all values in the sample space to their probabilities is
called a probability mass function
It may be expressed in a table (as above) or as a mathematical formula
2
We can use a graph to represent the probability mass function:
Suppose the law dictates that the sentence (in years) for a particular crime
must be between 5 and 10 years in prison. By looking at past cases a lawyer
is able to construct the following probability distribution for the number of
years to which a person convicted of the crime is sentenced:
f(y) =
0.4471
y
, y = 5, 6, 7, 8, 9, 10
Hence the probability that a person convicted of this crime receives a 6 year
sentence is
f
Y
(6) =
0.4471
6
= 0.1825
As an exercise, graph this probability mass function and verify that it satises
the two restrictions on probability mass functions.
Expected Value of a Discrete Random Variable
We can dene the expected value of a random variable as follows:
E(Y ) =
yS
yf(y)
If f(y) accurately characterises the population described by the random vari-
able Y , then E(Y ) = , the population mean
3
In our prison sentencing example:
E(Y ) =
10
y=5
y
0.4471
y
=
10
y=5
0.4471
y
= 0.4471
_
5 +
6 +
7 +
8 +
9 +
10
_
= 7.298
Thus, we would expect the average sentence to be 7.3 years.
It can also be shown that for any real-valued function g(Y ), the expected value
of g(Y ) is given by:
E(g(Y )) =
yS
g(y)f(y)
Variance of a Discrete Random Variable
We can dene the variance of a random variable as follows:
2
= E(Y ) = E
_
(Y )
2
= E
_
Y
2
_
2
(why?)
=
yS
y
2
f(y) E(Y )
2
In our prison sentencing example:
Var (Y ) =
10
y=5
y
2
0.4471
y
E(Y )
2
=
10
y=5
0.4471y
(3/2)
7.298
2
= 0.4471
_
5
(3/2)
+ 6
(3/2)
+ 7
(3/2)
+ 8
(3/2)
+ 9
(3/2)
+ 10
(3/2)
_
7.298
2
= 56.177 53.261
= 2.916
4
Thus, the variance of Y is 2.916 and the standard deviation is
2.916 = 1.71
Properties of Expected Value
Let Y be a discrete random variable with probability mass function f(y) and
let a be a constant. Then E(aY ) = aE(Y ).
Proof:
E(aY ) =
yS
ayf(y)
= a
yS
yf(y)
= aE(Y )
As an exercise, prove that if b is a constant, then E(b) = b.
As a further exercise, if Y
1
and Y
2
are two random variables, prove that
E(Y
1
+ Y
2
) = E(Y
1
) + E(Y
2
).
Properties of Variance
Let Y be a discrete random variable with probability mass function f(y) and
let a be a constant. Then Var (aY ) = a
2
Var (Y ).
Proof:
Var (aY ) = E
_
a
2
Y
2
_
E(aY )
2
= a
2
E
_
Y
2
_
a
2
E(Y )
2
= a
2
_
E
_
Y
2
_
E(Y )
2
_
= a
2
Var (Y )
As an exercise, prove that if b is a constant, then Var (b) = 0.
Special Discrete Probability Distributions
Binomial Distribution
The binomial distribution relates to a binomial experiment which has the
following ve properties:
1. The experiment consists of a xed number of trials, n
5
2. Each trial results in one of two outcomes, called success
and failure (denoted 1 and 0)
3. The probability of success in each trial is equal to p and the probability
of failure is 1 p (sometimes called q)
4. All the trials are independent of one another
5. The random variable of interest is Y , the total number of successes ob-
served in the n trials
The probability mass function for the binomial distribution is as follows:
f(y) =
_
n
y
_
p
y
(1 p)
ny
, y = 0, 1, 2, . . . , n and 0 p 1
We can derive this function using multiplicative probability rule for indepen-
dent events and the concept of combinations
We have y successes and n y failures, and there are
n!
y! (n y)!
=
_
n
y
_
ways
to arrange them in order
Here is a graph of the binomial probability mass function where n = 15 and
p = 0.4:
As an exercise, draw the binomial probability mass function where n = 9 and
p = 0.8.
Mean and Variance of Binomial Distribution
The mean of a binomially distributed random variable is E(Y ) = np.
The variance of a binomially distributed random variable is Var (Y ) = np (1 p).
6
Binomial Example
There is an English saying, Dont count your chickens before they hatch
A farmer is breeding chickens. He has 15 hens that each lay one egg per day.
The eggs are then placed in incubators
He has observed that there is an 80% hatchability rate, that is, an 80% prob-
ability that an egg will hatch into a live chick
1. How many live chicks should the farmer expect per day?
E(Y ) = np = 15 0.8 = 12
2. What is the probability that at least 13 eggs from a given day will hatch?
Pr (Y 13) = Pr (Y = 13) + Pr (Y = 14) + Pr (Y = 15)
=
_
15
13
_
0.8
13
(1 0.8)
1513
+
_
15
14
_
0.8
14
(1 0.8)
1514
+
_
15
15
_
0.8
15
(1 0.8)
1515
= 0.2309 + 0.1319 + 0.0352 = 0.398
Negative Binomial Probability Distribution
While a binomial random variable measures the number of successes in n trials
of a binomial experiment where n is xed, a negative binomial random
variable measures the number of trials y required for k successes to occur.
We could think of this as the event A B where A is the event that the rst
y 1 trials contain k 1 successes and B is the event that the yth trial results
in a success.
f(y) = Pr (A B) = Pr (A) Pr (B) (since A and B are independent)
Pr (A) =
_
y 1
k 1
_
p
k1
q
yk
, y k (by binomial distribution)
Pr (B) = p
Thus f(y) =
_
y 1
k 1
_
p
k
q
yk
, y = k, k + 1, k + 2, . . .
7
Negative Binomial Distribution
Here is a graph of the binomial probability mass function where k = 3 and
p = 0.6 (going as far as y = 17):
As an exercise, draw the negative binomial probability mass function where
k = 2 and p = 0.5, up to y = 10.
Mean and Variance of Negative Binomial Distribution
The mean of a negative binomial random variable is E(Y ) =
k
p
The variance of a negative binomial random variable is Var (Y ) =
k (1 p)
p
2
Negative Binomial Distribution Example
Each time a sherman casts his line into the water there is a probability of
1
8
that he will catch a sh.
Today he has decided that he will continue casting his line until he catches 5
sh
1. What is the expected number of casts required to catch 5 sh?
E(Y ) =
k
p
=
5
0.125
= 40
2. What is the standard deviation of the number of casts required to catch 5 sh?
Var (Y ) =
5 (1 0.125)
0.125
2
= 280
=
_
Var (Y ) =
280 = 16.73
8
4. What is the probability that he will need exactly 50 casts?
Pr (Y = 20) =
_
50 1
5 1
_
0.125
5
(1 0.125)
505
= 0.0159
5. What is the probability that he will need more than 8 casts?
Pr (Y > 8) = 1
8
y=5
_
y 1
5 1
_
0.125
5
(1 0.125)
y5
= 1
__
4
4
_
0.125
5
(1 0.125)
55
+
_
5
4
_
0.125
5
(1 0.125)
65
+
_
6
4
_
0.125
5
(1 0.125)
75
+
_
7
4
_
0.125
5
(1 0.125)
85
_
= 1 (0.0000 + 0.0001 + 0.0004 + 0.0007)
= 1 0.0011 = 0.999
Poisson Distribution
The Poisson Distribution can be thought of as a limiting case of the binomial
distribution
Suppose we are interested in the number of car accidents Y that occur at a
busy intersection during one week
We could divide the week into n intervals of time, with each interval being so
small that at most one accident could occur in that interval
We dene p as the probability that an accident occurs in a particular sub-
interval and 1 p as the probability that no accident occurs
We could then think of this as a binomial experiment
It can then be shown that:
lim
n
_
n
y
_
p
y
(1 p)
ny
=
(np)
y
e
np
y!
If we let = np then we have the probability mass function of the Poisson
distribution:
f(y) =

y
e
y!
, y = 0, 1, 2, . . .
9
Here is a graph of the Poisson probability mass function where = 3.3 (going
as far as y = 12):
As an exercise, draw the Poisson probability mass function where = 1, up
to y = 6.
Mean and Variance Poisson Distribution
The Poisson Distribution is used to model the counting of rare events that
occur with a certain average rate per unit of time or space
For the Poisson Distribution, E(Y ) = and Var (Y ) =
The expected value and variance are equal!
Poisson Distribution Example
The number of complaints that a busy laundry facility receives per day is a
random variable Y having a Poisson distribution with = 3.3
1. What is the probability that the facility will receive less than two com-
plaints on a particular day?
Pr (Y < 2) = Pr (Y = 0) + Pr (Y = 1)
=

0
e
0!
+

1
e
1!
= 0.0369 + 0.1217 = 0.1586
2. What is the average number of complaints the facility receives per week?
(The facility is open ve days per week)
10
If the number of complaints per day has a Poisson distribution with
parameter then the number of complaints in ve days has a Poisson
distribution with parameter 5 . Thus, if we let W
be the number of complaints per week, then:
E(W) = 5 = 16.5
Continuous Random Variables
A random variable is continuous if it can on any value in an
interval (e.g., between 0 and 5). In other words, continuous random variables
take on real-numbered values
There is no such thing as a probability mass function for a continuous random
variable. Instead, we have a probability density function which allows us
to nd probabilities over an interval
If Y is a continuous random variable, and f(y) is the probability density
function, then:
Pr (a Y b) =
_
b
a
f(y)dy
What we are actually doing is nding the area under the curve between a and
b.
Properties of a Probability Density Function
1. f(y) 0 for all y, < y <
2.
_

f(y)dy = 1
Probability Density Function Example
Suppose Y is the proportion of people who pay their income tax on time. The
probability density function is:
f(y) =
_
3y
2
, 0 y 1
0 , elsewhere
Verify that f(y) is a probability density function
11
First we note that 3y
2
0 for all 0 y 1 , so the rst condition
is satised.
Second:
_

f(y)dy =
_
1
0
f(y)dy (since the function is 0 elsewhere)
=
_
1
0
3y
2
dy
= y
3
1
0
= 1
3
0
3
= 1
Thus the second condition is also satised.
Find the probability that between 60% and 90% of people pay their income
tax on time.
Pr (0.6 Y 0.9) =
_
0.9
0.6
3y
2
dy
= y
3
0.9
0.6
= 0.9
3
0.6
3
= 0.513
Thus 51.3% of people pay their income tax on time according to
this model.
Note that it does not matter whether we use < or with continuous random
variables
Expected Value and Variance of a Continuous Random Variable
The expected value of a continuous random variable Y is dened as follows:
= E(Y ) =
_

yf(y)dy
Similarly the variance is dened thus:
2
= Var (Y ) = E
_
Y
2
_
2
=
_

y
2
f(y)dy
2
These have the same properties as in the discrete case.
12
Find the expected value of the proportion of people who pay their income tax
on time.
= E(Y ) =
_
1
0
y 3y
2
dy
=
_
1
0
3y
3
dy
=
3
4
y
4
1
0
=
3
4
= 0.75
Find the standard deviation of the proportion of people who pay their
income tax on time.
2
= Var (Y ) =
_
1
0
y
2
3y
2
dy
2
=
_
1
0
3y
4
dy 0.75
2
=
3
5
y
5
1
0
0.75
2
=
3
5
0.75
2
= 0.6 0.5625 = 0.0375
Hence =
0.0375 = 0.194
Special Continuous Probability Distributions
Uniform Distribution
Suppose that Y can take on any value between
1
and
2
with equal probability.
Then Y follows the continuous uniform distribution and its probability mass
function is as follows:
f(y) =
_
_
_
1
1
,
1
y
2
0 , elsewhere
13
We can use integrals to compute probabilities, but in this case we dont need
to because we are actually just nding the area of a rectangle! It can be shown
that E(Y ) =

1
+
2
2
and Var (Y ) =
(
2
1
)
2
12
Uniform Distribution Example
An insurance company provides roadside assistance to its clients. To save costs
they want to dispatch the nearest possible tow truck.
Along a particular highway which is 100 km long, breakdowns occur at uni-
formly distributed locations.
Towing Company A is the nearest for the rst 70 km of the highway and
Towing Company B is the nearest for the nal 30 km of the highway.
1. What is the expected location of the next breakdown?
E(Y ) =

1
+
2
2
=
0 + 100
2
= 50
We expect the next breakdown to occur at the 50 km mark
3. What is the probability that the next breakdown will be attended by company
B?
Here f(y) =
1
100
, 0 y 100 and 0 elsewhere
We need to nd the area under f(y) between 70 and 100
We could calculate
_
100
70
f(y)dy
Or we can simply calculate the area of this rectangle:
14
The area of a rectangle is length width. Thus:
Pr (70 Y 100) = 30
1
100
= 0.30
Normal Distribution
A random variable Y is said to have a normal distribution with parameters
< < and > 0 if its probability density function is:
f(y) =
1
2
e
(y)
2
/(2
2
)
, < y <
It can be shown that E(Y ) = and Var (Y ) =
2
.
Here is a graph of the Normal probability density function where = 10 and
= 2 (plotted from y = 2 to y = 18):
f(y) in this case cannot be integrated analytically, so nding the area under
the curve must be done using complicated numerical methods.
Good news: this has been done for you in the Z table for the Normal distri-
bution with = 0 and = 1, known as the Standard Normal Distribution
15
Even more good news: any Normally distributed random variable Y with
mean and standard deviation can be transformed to a Standard Normal
random variable Z using this simple transformation:
Z =
Y
This graph shows how the transformation works:

Using the Z Table to Calculate Probabilities
The Z Table provides us with Pr (Z < z) for any z value that we choose up to
2 decimal places
16
Suppose we want to know Pr (Z < 0.42) =
_
0.42
2
e
z
2
/2
dz
We look up 0.42 in the table as follows:
Thus Pr (Z < 0.42) = 0.6628
What we have just found is this:
The table only gives us Pr (Z < z) for positive z values
If we want to nd Pr (Z > z) we can use the complement rule:
Pr (Z > z) = 1 Pr (Z < z)
If we want to nd Pr (Z < z) for a negative z value, we can use the fact that
the Standard Normal Distribution is symmetric:
Pr (Z < z) = 1 Pr (Z < z)
17
If we wanted to nd the probability that Z falls between two values?
Pr (z
1
< Z < z
2
) = Pr (Z < z
2
) Pr (Z < z
1
)
Which looks like this:
Normal Distribution Example
The top-selling Red and Voss tyre is rated 70 000 km. In fact, the distance
the tyres can run is a normally distributed r.v. with a mean of 82 000 km and
a s.d. of 6 400 km.
1. What is the probability that the tyre wears out before 70 000 km?
Let X be the r.v. of distance the tyres can run. X is normally distributed, so
Z =
X
is a standard normal r.v. Therefore:

Pr (X < 70000) = Pr
_
X 82000
6400
<
70000 82000
6400
_
= Pr
_
Z <
12000
6400
_
= Pr (Z < 1.88)
= 1 Pr (Z < 1.88) = 1 0.9699 = 0.03 = 3%
18
3. What is the probability that the tyre lasts between 90 000 and 100 000 km?
Pr (90000 < X < 100000) = Pr
_
90000 82000
6400
<
X 82000
6400
<
100000 82000
6400
_
= Pr
_
8000
6400
< Z <
18000
6400
_
= Pr (1.25 < Z < 2.81)
= Pr (Z < 2.81) Pr (Z < 1.25)
= 0.9975 0.8944 = 0.1031 = 10%
2 Correlation Analysis of Paired Data Sets
Scatter Plots
An ice cream vendor collects data on the amount of ice cream she sells per
day, along with the temperature of the day:
Ice Cream Sales vs. Temperatures
Temperature (
C) Ice Cream Sales (R)

14.2 215
16.4 325
11.9 185
15.2 332
18.5 406
22.1 522
19.4 412
25.1 614
23.4 544
18.1 421
22.6 445
17.2 408
Scatter Plot
We can represent the relationship between these two variables using a scatter
plot:
19
Why do we put Ice Cream Sales on the vertical axis?
How could we quantify the relationship between these two variables?
Covariance of a Two Random Variables
The covariance of two random variables X and Y is dened as:
Cov (X, Y ) = E(X
X
) E(Y
Y
)
This measures the extent to which X and Y vary together
The correlation of two random variables X and Y is dened as:
XY
=
Cov (X, Y )
Y
This is also called Pearsons Correlation Coecient
Pearsons Correlation Coecient
The sample correlation coecient is a statistic used to estimate based on
paired observations of X and Y
20
This statistic, called r, estimates the strength of the linear relationship
between two paired samples x
1
, x
2
, . . . , x
n
and y
1
, y
2
, . . . , y
n
r =
n
i=1
(x
i
x) (y
i
y)
_
n
i=1
(x
i
x)
2
i=1
(y
i
y)
2
We can also write r as follows:
r =
s
xy
s
xx
s
yy
where:
s
xy
=
1
n 1
n
i=1
(x
i
x) (y
i
y) (sample covariance of x and y)
s
2
xx
=
1
n 1
n
i=1
(x
i
x)
2
(sample variance of x)
s
2
yy
=
1
n 1
n
i=1
(y
i
y)
2
(sample variance of y)
Pearsons Correlation Coecient
A shortcut formula for r is:
r =
n
i=1
x
i
y
i
n x y
_
_
n
i=1
x
2
i
n x
2
__
n
i=1
y
2
i
n y
2
_
r takes on values between 1 and 1
A value close to 1 indicates a strong positive linear relationship (as x
increases, y tends to increase )
A value close to 1 indicates a strong negative relationship (as x in-
creases, y tends to increase )
A value close to 0 indicates no linear relationship, that is, x and y are
independent
Pearsons Correlation Coecient Example
Calculate r for the Ice Cream Sales vs. Temperature example
21
Ice Cream Sales vs. Temperature
x y x
2
y
2
xy
14.2 215 201.64 46225 3053
16.4 325 268.96 105625 5330
11.9 185 141.61 34225 2201.5
15.2 332 231.04 110224 5046.4
18.5 406 342.25 164836 7551
22.1 522 488.41 272484 11536.2
19.4 412 376.36 169744 7992.8
25.1 614 630.01 376996 15411.4
23.4 544 547.56 295936 12729.6
18.1 421 327.61 177241 7620.1
22.6 445 510.76 198025 10057.0
17.2 408 295.84 166464 7017.6
x
i
= 224.1
y
i
= 4829
x
2
i
= 4362.05
y
2
i
= 2118025
x
i
y
i
= 95506.6
x = 18.675 y = 402.4167
Pearsons Correlation Coecient Example
r =
n
i=1
x
i
y
i
n x y
_
_
n
i=1
x
2
i
n x
2
__
n
i=1
y
2
i
n y
2
_
=
95506.6 12 18.675 402.4167
_
[4362.05 12 18.675
2
] [2118025 12 402.4167
2
]
=
5325.025
30928562
= 0.9575
Thus in this case there is a strong positive relationship between
temperature and ice cream sales
Hypothesis Testing on Pearsons Correlation Coecient
As long as we can assume the samples x and y are from normal distributions,
then under H
0
:
t =
r
n 2
1 r
2
follows a t distribution with n 2 degrees of freedom
22
Thus, we can use use Pearsons Correlation Coecient test for independence
of two normally distributed samples:
1. State the hypotheses: in the two sided case, the null hypothesis is H
0
:
= 0 and the alternative H
A
: = 0
2. State the signicance level,
3. State the test statistic
4. Calculate the critical value and give the critical region
5. Calculate the observed value of the test statistic and make a decision
6. State your conclusion in words.
Pearsons Correlation Coecient: Hypothesis Test Example
For our ice cream sales vs. temperature example, we want to know if the
correlation is signicantly dierent from 0.
1. H
0
: = 0 vs. H
A
: = 0
2. = 0.05
3. t =
r
n 2
1 r
2
4. t
/2,n2
= t
0.025,10
= 2.228; thus we reject H
0
if |t
observed
| > 2.228
5. t
observed
= 10.50 > 2.228, thus we reject H
0
6. We conclude at 5% signicance level that the correlation is signicantly dif-
ferent from 0
The Fisher Transformation
What if we want to test whether =
0
for any value 1 <
0
< 1?
What if we want a condence interval for ?
The Fisher Transformation allows us to do both (approximately)
z
r
=
1
2
ln
_
1 + r
1 r
_
This quantity has an approximate Normal distribution with a mean of 0 and
a variance of
1
n 3
From this we get the following test statistic, which has a standard normal
distribution under the null hypothesis:
Z =
1
2
ln
_
1+r
1r
_
1
2
ln
_
1+
0
1
0
_
1
n3
23
Pearsons Correlation Coecient: General Hypothesis Test Example
Suppose we want to nd out whether the correlation is less than 0.99 in our
ice cream sales vs. temperature example?
1. H
0
: = 0.99 vs. H
A
: < 0.99
2. = 0.05
3. Z =
1
2
ln
_
1+r
1r
_
1
2
ln
_
1+
0
1
0
_
1
n3
4. z
= z
0.05
= 1.645; thus we reject H
0
if z
observed
< 1.645
5. z
observed
= 2.19 < 1.645, thus we reject H
0
6. We conclude at 5% signicance level that the correlation is signicantly
less than 0.99
Pearsons Correlation Coecient: Condence Intervals
Just as r is a sample estimate of the unknown parameter , so z
r
is a sample
estimate of the unknown parameter z
Based on the distribution of the Fisher Transformation z

r
we can give the
following approximate (1 ) 100% condence interval for z
:
z
L
= z
r
z
/2
_
1
n 3
z
U
= z
r
+ z
/2
_
1
n 3
But what we are really interested in is a condence interval for
If we rearrange the Fisher Transformation to get r in terms of z
r
we nd that:
r =
e
2z
r
1
e
2z
r
+ 1
We can substitute the condence limits above into this formula to get the
lower and upper condence limits for :
lower
=
_
1+r
1r
_
exp{2z
/2
_
1
n3
} 1
_
1+r
1r
_
exp{2z
/2
_
1
n3
} + 1
upper
=
_
1+r
1r
_
exp{2z
/2
_
1
n3
} 1
_
1+r
1r
_
exp{2z
/2
_
1
n3
} + 1
In our ice cream vs. temperature example, a 95% Condence Interval for the
true Pearson Correlation Coecient is:
(
lower
= 0.8515,
upper
= 0.9883)
24
Spearmans Rank Correlation Coecient
What if one or both of X and Y are not normally distributed?
Suppose we have the Statistics FISA marks and number of hours of TV
watched per week for n = 8 students:
FISA Marks vs. Hours of TV per week
Hours of TV per week (x
i
) FISA Mark (y
i
)
3 73
11 50
7 87
38 31
13 62
20 61
22 46
34 59
Spearmans Rank Correlation Coecient
In this case we can instead use Spearmans Rank Correlation Coecient
s
,
which is based on the ranks of the x
i
and y
i
rather than the values themselves
It is a general measure of association rather than a measure of linear depen-
dence
R(x
i
) are the ranks of the x values; thus the lowest value has a rank of 1, the
second lowest a rank of 2, etc.
R(y
i
) is computed the same way for the y values
The sample estimator of
s
is:
r
s
=
n
n
i=1
R(x
i
)R(y
i
)
n
i=1
R(x
i
)
n
i=1
R(y
i
)
_
_
_
n
n
i=1
R(x
i
)
2
_
n
i=1
R(x
i
)
_
2
_
_
_
_
n
n
i=1
R(y
i
)
2
_
n
i=1
R(y
i
)
_
2
_
_
If there are no ties in x or y, this reduces to a simpler formula:
r
s
= 1
6
n
i=1
d
2
i
n(n
2
1)
where d
i
= R(x
i
) R(y
i
)
25
FISA Marks vs. TV hours per week
Hours of TV per week (x
i
) FISA Mark (y
i
) R(x
i
) R(y
i
) d
i
d
2
i
3 73 1 7 6 36
11 50 3 3 0 0
7 87 2 8 6 36
38 31 8 1 7 49
13 62 4 6 2 4
20 61 5 5 0 0
22 46 6 2 4 16
34 59 7 4 3 9
d
2
i
= 150
Spearmans Rank Correlation Coecient Example
In our FISA marks vs. TV hours example:
We can now compute the sample Spearman correlation coecient:
r
s
= 1
6 150
8 (8
2
1)
= 0.786
This suggests that there is a negative association between hours spent watching
TV and FISA mark
Spearmans Rank Correlation Coecient: Hypothesis Testing
We may want to test the null hypothesis H
0
:
s
= 0 against some alternative
to see if there is a signicant association between x and y
If n is large (and there are no ties) then the statistic t =
r
s
n 2
_
1 r
2
s
has ap-
proximately a t distribution with n 2 degrees of freedom
If n is small we use r
s
as our test statistic and use a table of critical values
(see appendix)
For our student marks vs. TV hours example, suppose we want to check if the
association between these two variables is signicant at the 5% signicance
level
26
Spearmans Rank Correlation Coecient: Hypothesis Testing Example
1. H
0
:
s
= 0 vs. H
A
:
s
= 0
2. = 0.05
3. Test statistic is r
s
4. Critical value is r
s
/2,8
= 0.738, so we reject H
0
if |r
s
observed
| > 0.738
5. |r
s
observed
| = | 0.786| = 0.786 > 0.738, so we reject H
0
6. We conclude there is a (negative) association between hours spent watching
TV per week and FISA mark
Spearmans Rank Correlation Coecient: General Hypothesis Tests and
Condence Intervals
The Fisher Transformation that was done on the Pearson Correlation Coe-
cient also applies to the Spearman Rank Correlation Coecient
Thus we can use the very same formulas based on the standard normal dis-
tribution to carry out general hypothesis tests such as H
0
:
s
= 0.6 vs.
H
A
:
s
= 0.6 as well as to construct condence intervals for
s
Of course we need to use r
s
instead of r in these formulas, but everything else
stays the same
Limitations of Correlation Analysis
Two of the limitations of correlation analysis are:
1. It does not allow us to compare more than two variables at a time
2. It does not allow us to make predictions
We now turn to linear regression analysis which enables us to do both of these
3 Simple Linear Regression Analysis
Equation of a Line
The equation of a line is often expressed as y = mx + c
m is the slope of the line, the change in y for a one unit change in x
c is the intercept of the line, the value of y when x = 0 (and the point
where the line crosses the vertical axis)
Often when we compare observations from two variables, we see what appears
to be an approximately linear relationship
We must decide logically which is the independent variable (x) and which is
the dependent variable (y)
For example, the scatter plot of ice cream sales vs. temperatures (which
is dependent on the other?)
27
Line Fitting
If we have only two points, we can t a line that goes right through them both
E.g. if we have the points (x
1
= 2, y
1
= 4) and (x
2
= 6, y
2
= 6)
m =
y
2
y
1
x
2
x
1
=
64
62
=
1
2
m =
y y
1
x x
1
1
2
=
y 4
x 2
2 (y 4) = x 2
2y 8 = x 2
2y = x + 6
y =
1
2
x + 3
28
Line Fitting
However, as soon as we have three or more points, we usually cant t them
perfectly with a straight line
Consider the following scatter plot:
There is no line that describes this relationship perfectly
So how do we model a relationship that is kind oflinear?
The Simple Linear Regression Model
We could assume that the y
i
observations depend on the x
i
observations in a
linear way but also contain some unexplained variation
We model this unexplained variation or error as a random variable
i
This means Y is a random variable since it depends on a random variable
Thus we have Y =
0
+
1
x +
Or, for individual observations, y
i
=
0
+
1
x
i
+
i
for i = 1, 2, . . . , n
We have simply changed the name of m to
1
and c to
0
, switched their
order, and added the error term
29
Model Assumptions
The most important assumptions of a simple linear regression model are as
follows:
The x values are xed, not random (thus we write x in lower case and Y ,
a random variable, in upper case)
All error terms have a zero mean, i.e. E(
i
) = 0i
All error terms have the same xed variance, i.e. Var (
i
) =
2
i
All observations are independent of each other
The error terms follow the normal distribution
The Problem
Even if our model and its assumptions are correct, we have a problem: we
dont know the values of
0
,
1
or
i
In order to know them we would have to have data from the whole population
of x and y, which is usually impossible
We can only estimate
0
,
1
and
i
as best as we can
But how?
Line Fitting
If we asked three people to draw the line that best ts the points, we might
get three dierent results:
How would we know which line is the best?
As statisticians we want to use a statistic to quantify this! But how?
30
The Least Squares Method
Suppose we have observations (x
i
, y
i
) for i = 1, 2, . . . , n, and we t a line with
equation y
i
=

0
+

1
x
i
We have simply changed the name of m to
1
and c to
0
, and switched
their order
Theon y,
0
and
1
reminds us that these are estimates of the relation-
ship
We can determine how far each individual y
i
value is from the line using the
formula e
i
= y
i
y
i
= y
i
0
+

1
x
i
_
The e
i
values are called residuals
31
The residuals e
i
are our best estimate of the unknown errors
i
They also provide us with a clue of how to nd the estimated line that best
ts the data
Overall, we want the errors to be as small as possible
However, we cant just minimize the sum of errors because the positive errors
(points above the line) and negative errors (points below the line) will cancel
each other out!
Instead we minimize the sum of squared errors
2
i
because these will all be
positive
SS
Error
=
n
i=1
2
i
This quanties the overall distance between the points and the line
Similar to how the variance gives an indication of the distance between
data points and their mean
32
We will choose the values of
0
and
1
that minimize the sum of squared
errors
How do we do this? Calculus!
The Sum of Squared Errors is a function of
0
and
1
SS
Error
= S(
0
,
1
) =
n
i=1
(y
i
1
x
i
)
2
So our method is as follows:
1. Take partial derivatives of the SS
Error
function with respect to
0
and
1
2. Set the derivatives equal to zero
3. Solve this system of equations for
0
and
1
to get the values which
minimize the function
Deriving the Least Squares Estimators
S(
0
,
1
)
0
= 2
n
i=1
(yi
0
1
x
i
) = 0 (1)
S(
0
,
1
)
1
= 2
n
i=1
(yi
0
1
x
i
) x
i
= 0 (2)
This is the system of equations we must solve in terms of
0
and
1
We simplify them as follows:
2
n
i=1
(y
i
1
x
i
) = 0
n
i=1
y
i
i=1
i=1
1
x
i
= 0
n
i=1
y
i
0
n
i=1
1
1
n
i=1
x
i
= 0
n y n
0
n
1
x = 0
0
= y
1
x
33
2
n
i=1
(y
i
1
x
i
) x
i
= 0
n
i=1
y
i
x
i
i=1
0
x
i
i=1
1
x
2
i
= 0
n
i=1
y
i
x
i
0
n
i=1
x
i
1
n
i=1
x
2
i
= 0
n
i=1
x
i
y
i
( y
1
x)
n
i=1
x
i
1
n
i=1
x
2
i
= 0
n
i=1
x
i
y
i
n x y + n
1
x
2
1
n
i=1
x
2
i
= 0
1
_
n
i=1
x
2
i
n x
2
_
=
n
i=1
x
i
y
i
n x y
1
=
n
i=1
x
i
y
i
n x y
n
i=1
x
2
i
n x
2
Least Squares Estimation Formula
Thus the least squares estimates of
0
and
1
can be calculated using the
following formula:
1
=
n
i=1
x
i
y
i
n x y
n
i=1
x
2
i
n x
2
0
= y
1
x
It turns out that

1
and

0
are Minimum variance unbiased estimators
(MVUE) of
1
and
0
This means that:
1. E
_
0
_
=
0
and E
_
1
_
=
1
(unbiased)
2.

0
and

1
can be proven to have the smallest variance (greatest precision)
of any linear estimators of
0
and
1
34
Proof that

1
is Unbiased Estimator of
1
We rst need to derive E(Y
i
) and E
_
Y
_
We will also use our assumptions that the x values are xed and that E(
i
) = 0
E(Y
i
) = E(
0
+
1
x
i
+
i
)
= E(
0
) + E(
1
x
i
) + E(
i
)
=
0
+
1
x
i
+ 0 (since the rst two are constants)
=
0
+
1
x
i
E
_
Y
_
= E
_
1
n
n
i=1
y
i
_
=
1
n
n
i=1
E(y
i
)
=
1
n
n
i=1
(
0
+
1
x
i
)
=
1
n
(n
0
+
1
n x)
=
0
+
1
x
35
E
_
1
_
= E
_
_
_
_
_
_
n
i=1
x
i
y
i
n x y
n
i=1
x
2
i
n x
2
_
_
_
_
_
_
=
1
n
i=1
x
2
i
n x
2
E
_
n
i=1
x
i
y
i
n x y
_
(since x is xed, the denominator is constant)
=
1
n
i=1
x
2
i
n x
2
_
n
i=1
x
i
E(y
i
) n xE( y)
_
=
1
n
i=1
x
2
i
n x
2
_
n
i=1
x
i
(
0
+
1
x
i
) n x (
0
+
1
x)
_
(see results proved above)
=
1
n
i=1
x
2
i
n x
2
_
0
n x +
1
n
i=1
x
2
i
n x
0
n x
2
1
_
=
1
_
n
i=1
x
2
i
n x
2
_
n
i=1
x
2
i
n x
2
=
1
Proof that

0
is an Unbiased Estimator of
0
As an exercise, try to prove that E
_
0
_
=
0
The proof is much shorter than the proof for

1
Prediction with Simple Linear Regression
Once we have calculated the least squares estimates

1
and

0
, we can write
out the tted regression equation:
y =

0
+

1
x
We can now use this equation to predict the most likely value of y for a
particular value of x
36
This is one of the most useful things about this model!
However we must be careful to only make predictions for values of x in the
domain of our data
We cannot extrapolate since the relationship may not be linear outside of
the domain of the data
The Riskiness of Extrapolation
Suppose we t a line to a set of data points with x
i
values ranging from 0 to 6
Now we use our tted line to predict the value of y for x = 10
The Riskiness of Extrapolation
What if modeling the relationship between y and x as a straight line is only
appropriate between x = 0 and x = 6?
Can you see how far o the prediction would appear to be if we had data for
larger x values like this?
Simple Linear Regression Example
Various doses of a toxic substance were given to groups of 25 rats and the
results were observed (see table below)
37
Rat Deaths vs. Doses
Dose in mg (x) Number of Deaths (y)
4 1
6 3
8 6
10 8
12 14
14 16
16 20
1. Find the tted simple linear regression equation for this data
2. Use the model to predict the number of deaths in a group of 25 rats who
receive a 7 mg dose of the toxin
38
Rat Deaths vs. Doses
x
i
y
i
x
2
i
x
i
y
i
4 1 16 4
6 3 36 18
8 6 64 48
10 8 100 80
12 14 144 168
14 16 196 224
16 20 256 320
x
i
= 70
y
i
= 68
x
2
i
= 812
x
i
y
i
= 862
x = 10 y = 9.714
1
=
n
i=1
x
i
y
i
n x y
n
i=1
x
2
i
n x
2
=
862 7 10 9.714
812 7 10
2
=
182.02
112
= 1.625
0
= y
1
x
= 9.714 1.625 10
= 6.536
Note that it is important not to round numbers o until you have the nal
regression equation, otherwise your answer may be inaccurate
Thus the tted regression equation is y = 6.54 + 1.63x
Predicting the number of deaths for a dose of 7mg:
y = 6.54 + 1.63x = 6.54 + 1.63 7 = 4.9
39
Simple Linear Regression Exercise
Calculate the equation of the line of best t for the temperature (x) vs. ice
cream sales (y) example
Use the equation to predict the ice cream sales on a day on which the temper-
ature is 20
Inferences from a Simple Linear Regression

The two unknown parameters involved in a simple linear regression model are
0
and
1

2
, the variance of the error terms, is also unknown
We may be interested in knowing whether it is reasonable to conclude that
one of these unknowns is equal to (or not equal to) a particular value
Most often we are interested in whether
1
= 0 since this determines whether
x and y have a positive relationship, a negative relationship or no relationship
Like in correlation analysis!
To use hypothesis testing to make inferences about these unknowns we need
an appropriate test statistic
Inferences on
1
Inferences about
1
will be based on how far the estimated value

1
is from
the null hypothesis value
As always, we also take into account the standard error of the estimate and
its probability distribution
We already proved that E
_
1
_
=
1
Let:
SS
x
=
n
i=1
x
2
i
n x
2
SS
y
=
n
i=1
y
2
i
n y
2
SS
xy
=
n
i=1
x
i
y
i
n x y
Notice that, expressed in these terms,

1
=
SS
xy
SS
x
Subject to our model assumptions, it can be proven that Var
_
1
_
=

2
SS
x
40
However, because we do not know the value of
2
we must use the best esti-
mate, which turns out to be
2
=
1
n 2
n
i=1
e
2
i
=
1
n 2
SS
Residual
= MS
Residual
Thus

Var
_
1
_
=

2
SS
x
It can be proven that
1
E(
1
)
_
Var
_
1
_
has a t distribution with n 2 degrees of
freedom
Thus t =
1
_

2
SS
x
has a t distribution with n 2 degrees of freedom
Since SS
Residual
= SS
y

1
SS
xy
, we can express this as:
t =
SS
y

1
SS
xy
(n 2) SS
x
If we replace
1
with
1
this becomes our test statistic for testing H
0
:
1
=
1
Hypothesis Testing Review
For such a t test, our decision rules would be as follows:
H
0
:
1
=
1
vs. H
A
:
1
=
1
Reject H
0
if |t
observed
| > t
/2,n2
H
0
:
1
=
1
vs. H
A
:
1
<
1
Reject H
0
if t
observed
< t
,n2
H
0
:
1
=
1
vs. H
A
:
1
>
1
Reject H
0
if t
observed
> t
,n2
41
The p-value Approach
Instead of using critical values to decide whether to reject H
0
, one can also use
p-values
A p-value (sometimes denoted ) is dened as the probability of obtaining a
result at least as extreme as the observed data, given that H
0
is true.
For such a t test, our decision rules would be as follows:
H
0
:
1
=
1
vs. H
A
:
1
=
1
Reject H
0
if 2 Pr (t > |t
observed
| given that
1
=
1
) <
H
0
:
1
=
1
vs. H
A
:
1
<
1
Reject H
0
if Pr (t < t
observed
given that
1
=
1
) <
H
0
:
1
=
1
vs. H
A
:
1
>
1
Reject H
0
if Pr (t > t
observed
given that
1
=
1
) <
Note that p-values cannot usually be computed by hand. As an example, the
third p-value involves computing
=
_

t
observed
f(y)dy where f(y) is the probability density function of the t distribution
However, p-values can be easily calculated with a computer, and are the quick-
est way to reach a decision about a hypothesis test when using statistical
software packages
Condence Interval for
1
Using the t statistic above, we can derive a (1 )100% condence interval
for
1
as follows:
Pr
_
_
1
t
/2,n2
SS
y

1
SS
xy
(n 2) SS
x
<
1
<

1
+ t
/2,n2
SS
y

1
SS
xy
(n 2) SS
x
_
_
= 1
Thus the C.I. for
1
is:
_
_
1
t
/2,n2
SS
y

1
SS
xy
(n 2) SS
x
,

1
+ t
/2,n2
SS
y

1
SS
xy
(n 2) SS
x
_
_
42
Inference on
1
Example
Suppose we want to test H
0
:
1
= 0 vs. H
A
:
1
= 0 for the rat death vs.
dosage example, at the = 0.05 signicance level
Our test statistic is t t (n 2) as dened above
Our critical region is |t
observed
| > t
/2,n2
= t
0.025,5
= 2.570
We have already calculated that SS
xy
= 182 and SS
x
= 112
We further can calculate that SS
y
= 301.4286
t =
SS
y

1
SS
xy
(n 2) SS
x
=
1.625 0
_
301.4286 1.625 182
(7 2) 112
=
1.625
0.01014
=
1.625
0.1007
= 16.14
|t
observed
| > 2.570, thus we reject H
0
and conclude that
1
= 0; the slope of the
regression model is statistically signicant
A 95% Condence Interval for
1
is given by:
_
_
1
t
/2,n2
SS
y

1
SS
xy
(n 2) SS
x
,

1
+ t
/2,n2
SS
y

1
SS
xy
(n 2) SS
x
_
_
(1.625 2.570 0.1007, 1.625 + 2.570 0.1007)
(1.37, 1.88)
Inference on
0
In a similar way it can be proven that:
E
_
0
_
=
0
Var
_
0
_
=
2
_
1
n
+
x
2
SS
x
_
If we estimate
2
with
2
then t =
0

_
1
n
+
x
2
SS
x
has a t distribution with
n 2 degrees of freedom
43
We can also express t as:
t =
SS
y

1
SS
xy
n 2
_
1
n
+
x
2
SS
x
_
Condence Interval for
0
A (1 )100% Condence Interval for
0
is given by:
_
_
0
t
/2,n2
SS
y

1
SS
xy
n 2
_
1
n
+
x
2
SS
x
_
,

0
+ t
/2,n2
SS
y

1
SS
xy
n 2
_
1
n
+
x
2
SS
x
_
_
_
Inference on
0
Example
With our dosage vs. rat deaths example, suppose we are interested in whether
0
< 1
1. H
0
:
0
= 1 vs. H
A
:
0
< 1
2. = 0.05
3. t t (n 2) where t is as dened above
4. Reject H
0
if t
observed
< t
,n2
= t
0.05,5
= 2.015
t
observed
=
SS
y

1
SS
xy
n 2
_
1
n
+
x
2
SS
x
_
=
6.536 1
301.4286 1.625(182)
7 2
_
1
7
+
10
2
112
_
=
7.536
1.1763
= 6.948
5. t
observed
= 6.948 < 2.015 thus we reject H
0
6. We conclude that
0
< 1 at 5% signicance level
A 95% Condence Interval for
0
is as follows:
_
_
0
t
/2,n2
SS
y

1
SS
xy
n 2
_
1
n
+
x
2
SS
x
_
,

0
+ t
/2,n2
SS
y

1
SS
xy
n 2
_
1
n
+
x
2
SS
x
_
_
_
_
6.536 2.570
1.1763, 6.536 + 2.570
1.1763
_
(6.536 2.787, 6.536 + 2.787)
(9.323, 3.749)
44
Inference on
2
It is also possible to perform hypothesis tests and condence intervals con-
cerning
2
using the
2
distribution
However we will not cover these in this module.
Predicting the Mean Response
One of the advantages of the linear regression model is that we can use x to
predict Y
Suppose we want to estimate the mean value of Y when x = x
, E(Y |x = x
)
We know that E(Y |x = x
) =
0
+
1
x
Our best estimate of E(Y |x = x
) is y
0
+

1
x
The variance of this estimator is Var ( y
) =
2
_
1
n
+
(x
x)
2
SS
x
_
Since
2
is unknown, we can use the following estimate:
Var ( y
) =
2
_
1
n
+
(x
x)
2
SS
x
_
=
SS
y

1
SS
xy
n 2
_
1
n
+
(x
x)
2
SS
x
_
It can also be shown that t =
y
Var ( y
)
t (n 2)
Condence Interval for Mean Response
Thus a (1 )100% Condence Interval for E(Y |x = x
) is given by:
_
_
0
+

1
x
t
/2,n2
_
SS
y

1
SS
xy
n 2
_
1
n
+
(x
x)
2
SS
x
_
_
_
If we want the interval to be as narrow as possible (a more accurate prediction),
then n should be large, SS
x
should be large, and x should be near x
.
That is, we should gather data on a wide range of x values
45
Predicting a New Response
Suppose we want to predict the response value y
for a new observation x = x
Our best estimate would be y
0
+

1
x
E( y
) =
0
+
1
x
Var ( y
) =
2
_
1 +
1
n
+
(x
x)
2
SS
x
_
Thus:
Var ( y
) =
2
_
1 +
1
n
+
(x
x)
2
SS
x
_
=
SS
y

1
SS
xy
n 2
_
1 +
1
n
+
(x
x)
2
SS
x
_
It can be shown that t =
y
Var ( y
)
t (n 2)
Prediction Interval for an Individual Response
A (1 )100% Prediction Interval for y
is given by:
0
+

1
x
t
/2,n2
_
SS
y

1
SS
xy
n 2
_
1 +
1
n
+
(x
x)
2
SS
x
_
It is called a prediction interval rather than a condence interval because Y
i
is a random variable, not an unknown parameter
Notice that the prediction interval for Y
i
is always wider than the condence
interval for E(Y |x = x
)
It is more dicult to predict the value of an individual observation than the
mean of many observations
Example
Consider our Temperature vs. Ice Cream Sales example
We want a condence interval for the average ice cream sales when the tem-
perature is 20
and a prediction interval for the ice cream sales on a particular

day when the temperature is 20
46
1. Condence Interval for E(Y |x = 20)
0
+

1
x
t
/2,n2
_
SS
y

1
SS
xy
n 2
_
1
n
+
(x
x)
2
SS
x
_
159.474 + 30.088(20) t
0.025,10
_
174754.9 30.088(5325.025)
12 2
_
1
12
+
(20 18.675)
2
176.9825
_
442.286 2.228
135.549
442.286 25.94
= (416.35, 468.23)
2. Prediction Interval for

Y
i
when x = 20
0
+

1
x
i
t
/2,n2
_
SS
y

1
SS
xy
n 2
_
1 +
1
n
+
(x
x)
2
SS
x
_
159.474 + 30.088(20) t
0.025,10
_
174754.9 30.088(5325.025)
12 2
_
1 +
1
12
+
(20 18.675)
2
176.9825
_
442.286 2.228
1589.10
442.286 88.82
= (353.47, 531.11)
Assessing the Fit of a Regression Line
While testing the hypothesis H
0
:
1
= 0 can give us a yes or no answer on
whether the model is appropriate, we would like a statistic that can quantify
how good the model is
One method is to calculate what proportion of the total variation in y is
explained by our model
The total variation in y is SS
y
=
n
i=1
(y
i
y)
2
=
n
i=1
y
2
i
n y
2
The variation not explained by the model is SS
Residual
=
n
i=1
(y
i
y
i
)
2
Thus the variation explained by the model is the dierence SS
y
SS
Residual
Our goodness of t statistic, called the Coecient of Determination, is
the ratio of the variation explained by the model to the total variation:
r
2
=
SS
y
SS
Residual
SS
y
= 1
SS
Residual
SS
y
47
We call this statistic r
2
because it turns out that it is the square of Pearsons
sample correlation coecient r
Proof:
r
2
= 1
SS
Residual
SS
y
= 1
SS
y

1
SS
xy
SS
y
= 1
_
1
1
SS
xy
SS
y
_
=

1
SS
xy
SS
y
=
SS
xy
SS
x
SS
xy
SS
y
=
SS
2
xy
SS
x
SS
y
= (r)
2
Goodness of Fit Example
In our dosage vs. rat deaths example:
r
2
=
SS
2
xy
SS
x
SS
y
=
182
2
112 301.4286
= 0.981
Thus in this case we can say that 98.1% of the variation in rat deaths can be
explained by the dosage given
4 Multiple Linear Regression
Multiple Linear Regression Model Specication
Before now we have used models with only one independent variable x
i
What if we want to investigate the relationship between a single dependent
variable Y and two independent variables x
1
and x
2
?
The multiple linear regression model allows us to do this
Motivational Example
An experiment was conducted to determine the eect of pressure and temper-
ature on the yield of a chemical. Two levels of pressure (in kPa) and three
levels of temperature (in

C) were used and the results were as follows:
48
Yield (y
i
) Pressure (x
i1
) Temperature (x
i2
)
21 350 40
23 350 90
26 350 150
22 550 40
23 550 90
28 550 150
3D Scatter Plot
If we want to represent the relationship graphically we would need a three
dimensional scatter plot
Instead of a line of best t, we now need a plane of best t
Multiple Linear Regression Model
The multiple linear regression model allows us to investigate the relation-
ship between a single dependent variable Y and two independent variables x
1
and x
2
The model is specied as follows:
Y =
0
+
1
x
1
+
2
x
2
+
Or, in terms of observations, as follows:
y
i
=
0
+
1
x
1i
+
2
x
2i
+
i
49
This is the equation of a plane, not a line

0
is still the intercept (the point where the plane crosses the vertical axis,
x
1
= x
2
= 0)

1
is the slope of the plane in the x
1
direction

2
is the slope of the plane in the x
2
direction

1
and
2
are sometimes referred to as partial slope coecients
This model relies on the same assumptions as the simple linear regression
model, with one addition:
x
1
and x
2
must not be collinear (highly correlated with one another)
The tted regression equation in this case is:
Y =

0
+

1
x
1
+

2
x
2
Multiple Linear Regression Model: Deriving Least Squares Parameter
Estimates
We can again use the Method of Least Squares to estimate the parameters
0
,
1
and
2
We still have our sum of squared error function, which is now a function of
three variables:
SS
Error
= S(
0
,
1
,
2
) =
n
i=1
2
i
=
n
i=1
(y
i
1
x
1i
2
x
2i
)
2
We can still use the same steps:
1. Take partial derivatives of the SS
Error
function with respect to
0
,
1
and
2
2. Set the derivatives equal to zero
3. Solve this system of equations for
0
,
1
and
2
to get the values which
minimize the function
S(
0
,
1
,
2
)
0
= 2
n
i=1
(y
i
1
x
1i
2
x
2i
) = 0
S(
0
,
1
,
2
)
1
= 2
n
i=1
(y
i
1
x
1i
2
x
2i
) x
1i
= 0
S(
0
,
1
,
2
)
2
= 2
n
i=1
(y
i
1
x
1i
2
x
2i
) x
2i
= 0
50
Solving this system of equations for
0
,
1
and
2
is possible but it will take
long and the formula will be complicated.
An alternative is to use matrix notation, which is more compact
Multiple Linear Regression Model: Matrix Notation
We can specify the regression model in matrix notation as follows:
y = X + where
y is an n 1 matrix:
y =
_
_
y
1
y
2
.
.
.
y
n
_
_
X
is an 3 n matrix:
X
=
_
_
1 1 1
x
11
x
12
x
1n
x
21
x
22
x
2n
_
_
is a 3 1 matrix:
=
_
2
_
_
is an n 1 matrix:
=
_
2
.
.
.
n
_
_
Quick Review of Matrices
For any matrices A and B where A
is the transpose of A:
_
A
= A
(A+B)
= A
+B
(AB)
= B
51
Additionally, the inverse of a square matrix A (which is like the matrix
equivalent of division) is the matrix A
1
such that AA
1
= I where I is the
identity matrix, e.g.
I =
_
_
1 0 0
0 1 0
0 0 1
_
_
To nd the inverse of a matrix we can use the following method (similar to
Gauss-Jordan elimination):
Suppose
A =
_
_
1 2 3
0 4 5
1 0 6
_
_
Then:
_
_
1 2 3 1 0 0
0 4 5 0 1 0
1 0 6 0 0 1
_
_
=
_
_
1 2 3 1 0 0
0 4 5 0 1 0
0 2 3 1 0 1
_
_
=
_
_
1 2 3 1 0 0
0 4 5 0 1 0
0 0 11 2 1 2
_
_
=
_
_
2 0 1 2 1 0
0 4 5 0 1 0
0 0 11 2 1 2
_
_
=
_
_
22 0 0 24 12 2
0 4 5 0 1 0
0 0 11 2 1 2
_
_
=
_
_
22 0 0 24 12 2
0 44 0 10 6 10
0 0 11 2 1 2
_
_
=
_
_
1 0 0
12
11
6
11
1
11
0 1 0
5
22
3
22
5
22
0 0 1
2
11
1
11
2
11
_
_
Thus A
1
=
_
_
12
11
6
11
1
11
5
22
3
22
5
22
2
11
1
11
2
11
_
_
Deriving Least Squares Estimates in Matrix Notation
Our sum of squared error function in matrix notation is:
S() =
n
i=1
2
i
=
= (y X)
(y X)
=
_
y
(X)
_
(y X)
=
_
y
_
(y X)
= y
y y
X +
X
52
Now, in
y we are multiplying a 1 3 matrix by a 3 n matrix by a

n 1 matrix, so the result will be a 1 1 matrix, i.e. a scalar number
Similarly, in y
X we are multiplying a 1 n matrix by a n 3 matrix by a

3 1 matrix, so the result will again be a 1 1 matrix, i.e. a scalar
Notice also that
y =
_
y
x
_
The transpose of a scalar is itself

Thus, since these matrices are both scalars, they are equal, and we can simplify
our equation to:
S() = y
y 2
y +
X
We now dierentiate this function using vector calculus and set it equal to 0:
S
= 2X
y + 2X
X = 0
X
X = X
y
=
_
X
X
_
1
X
y
Thus in matrix form, the least squares estimators of are given by

=
_
X
X
_
1
X
y
This matrix exists as long as the inverse of X
X exists, which it does as long

as our assumption of no linear dependence between x
1
and x
2
holds true
The estimators have the same Minimum Variance Unbiased Estimator
property as

0
and

1
do in the simple linear regression case
In matrix form, the tted regression equation is y = X
In matrix form, the residuals are e = y y

Multiple Linear Regression Example
We have the following data from ten species of mammal:
53
Species Name Gestation Period in days (y) Body Weight in kg (x
1
) Avg. Litter size (x
2
)
Rat 23 0.05 7.3
Tree Squirrel 38 0.33 3
Dog 63 8.5 4
Porcupine 112 11 1.2
Pig 115 190 8
Bush Baby 135 0.7 1
Goat 150 49 2.4
Hippo 240 1400 1
Fur seal 254 250 1
Human 270 65 1
Here, our individual matrices are as follows:
y =
_
_
23
38
63
112
115
135
150
240
254
270
_
_
X
=
_
_
1 0.05 7.3
1 0.33 3
1 8.5 4
1 11 1.2
1 190 8
1 0.7 1
1 49 2.4
1 1400 1
1 250 1
1 65 1
_
_
We rst check if our y values appear to be normally distributed:
54
Looks okay
Our X
X matrix is as follows:
_
_
10 1974.580 29.9
1974.58 2065419.851 3401.855
29.9 3401.855 153.49
_
_
To nd the inverse of this matrix we would use Gauss-Jordan Elimination as
above
However in the age of technology its much quicker to use computer software
such as MatLab
We nd that
_
X
X
_
1
=
_
_
0.3021 1.9913 10
4
5.4428 10
2
1.9913 10
4
6.3378 10
7
2.4744 10
5
5.4428 10
2
2.4744 10
5
1.6569 10
2
_
_
We multiply this matrix by X
and then by y to get our parameter estimates
=
_
_
178.7
0.07569
17.93
_
_
Thus our tted regression equation is

Y = 178.68 + 0.07569x
1
17.93x
2
We interpret this as follows:
The intercept means that (according to the model) a mammal with body
weight of 0 kg which has an average litter size of 0 babies would have a
gestation period of 179 days
(Note that the intercept does not always make practical sense!)
55
For every kg of body weight, gestation period increases by 0.07569 days
For every baby in the average litter, gestation period decreases by 17.93
days
Remember, we cannot assume the relationships are causal
It can be dangerous to extrapolate outside the region of x
1
and x
2
values in
the data even if it is within range of individual values
Intercept may be an example of this!
See the graph below
Multiple Linear Regression with k Independent Variables
Using our matrix notation we can generalise the multiple linear regression
model from 2 independent variables to k independent variables
The model is specied as follows:
Y =
0
+
1
x
1
+
2
x
2
+ +
k
x
k
+
Or, in terms of observations, as follows:
y
i
=
0
+
1
x
1i
+
2
x
2i
+ +
k
x
ki
+
i
Note that p = k + 1 is the total number of parameters in the model (k inde-
pendent variables plus one intercept)
56
Hence y = X + where:
y is an n 1 matrix, X is an n p matrix, is a p 1 matrix, and is
an n 1 matrix
This model relies on the same assumptions as the simple linear regression
model, along with the assumption of no multicollinearity:
None of the independent variables are collinear (highly correlated with one
another)
Multiple Linear Regression Example
Data was collected from 195 American universities on the following variables:
Graduation Rate (the proportion of students in Bachelors degree pro-
grammes who graduate after four years)
Admission Rate (the proportion of applicants to the university who are
accepted)
Student-to-Faculty Ratio (the number of students per lecturer)
Average Debt (the average student debt level at graduation, in US dol-
lars)
A few observations from the data are displayed below:
Grad Rate (y) Admission Rate (x
1
) S/F Ratio (x
2
) Avg Debt (x
3
)
0.65 0.35 14 11156
0.81 0.39 16 13536
0.8 0.35 12 19762
0.46 0.65 13 12906
0.5 0.58 21 14449
0.47 0.65 11 16645
0.18 0.59 14 17221
0.52 0.6 13 14791
0.39 0.79 15 14382
.
.
.
.
.
.
.
.
.
.
.
.
57
In this case we have k = 3 independent variables and p = 4 parameters to
estimate
The model equation is as follows:
y
i
=
0
+
1
x
i1
+
2
x
i2
+
3
x
i3
+
i
Using computer software we determine that:
_
X
X
_
1
=
_
_
j = 0 j = 1 j = 2 j = 3
j = 0 0.1059 0.01782 3.0672 10
3
3.2823 10
6
j = 1 0.01782 0.1906 5.7407 10
3
5.7146 10
7
j = 2 3.0672 10
3
5.7407 10
3
4.6400 10
4
2.1002 10
9
j = 3 3.2823 10
6
5.7146 10
7
2.1001 10
9
2.3045 10
10
_
_
We further determine that:
=
_
X
X
_
1
X
y =
_
_
1.1095
0.3798
0.02789
5.1687 10
7
_
_
Thus our sample regression function is:
y = 1.1095 0.3798x
1
0.02789x
2
+ 5.1687 10
7
x
3
Interpretation:
For every 0.01 unit increase in admission rate, there is an expected
0.003798 unit decrease in graduation rate (we cant really talk about
the usual 1 unit increase in x
1
since it is a proportion and ranges only
from 0 to 1)
For every one unit increase in student-to-lecturer ratio, there is an ex-
pected 0.02789 unit decrease in graduation rate
For every $1 increase in average student debt, there is an expected 5.1687
10
7
unit increase in graduation rate
Inferences from a Multiple Linear Regression
Just like in simple linear regression, we often want to do hypothesis testing for
multiple linear regression
There are three main types of hypothesis tests to consider:
1. Inferences on Individual Parameters
2. Inferences on the Full Model (all parameters)
3. Inferences on Subsets of Parameters
58
Inferences on Individual Parameters
The logic is the same as in simple linear regression but we now use a matrix
approach
It can be proven that E
_
_
=
It can also be proven that the covariance matrix of

is:
Cov
_
_
=
2
_
X
X
_
1
This means that for each individual element of

,

j
:
E
_
j
_
=
j
Var
_
j
_
=
2
C
jj
where C
jj
is the diagonal element of
_
X
X
_
1
corresponding to

j
This is the multivariate equivalent of our result in simple linear regression that
Var
_
1
_
=
2
SS
1
x
Now, we face the same problem as before in that we dont usually know the
value of
2
Remember, before we estimated
2
with

2
=
1
n 2
n
i=1
e
2
i
=
1
n 2
SS
Residual
In the multivariate case, we have to divide by n p instead of n 2 (we
subtract the number of parameters to be estimated which was 2 in that case)
Our sum of squared residuals can be expressed as follows:
SS
Residual
=
n
i=1
e
2
i
= e
e
= (y y)
(y y)
=
_
y X
_
y X
_
= y
y y
= y
y 2
y +

= y
y since X
= X
y
59
Therefore,
2
=
SS
Residual
n p
=
1
n p
_
y
y
_
The test statistic for testing the null hypothesis H
0
:
j
=
j
is thus:
t =
j

_
C
jj
=
j
_
_
y
y
_
C
jj
/ (n p)
Under the null hypothesis, t follows a t distribution with n p degrees of
freedom
Our decision rules will be the same as for inferences on
1
in the simple linear
regression model (depending whether we have a two-tailed, lower tail or upper
tail test)
Note that this formula can be used for any
j
including
0
If we set
j
= 0 then we are testing for the signicance of an individual
coecient, that is, whether there is a linear relationship between Y and x
j
Inferences on Individual Parameters: Example
Suppose we want to test whether the admission rate has a signicant, negative
impact on the graduation rate
1. H
0
:
1
= 0 vs. H
A
:
1
< 0
2. = 0.05
3. t =
1
_
(y
y)C
11
/ (n p)
t (n p)
4. Critical region: t
observed
< t
,n4
= t
0.05,1954
= t
0.05,192
1.66
5. t
observed
=
0.3798
_
4.7691 0.1906/ (195 4)
= 5.50
t
observed
< 1.66 therefore we reject H
0
6. We conclude that admission rate has a signicant, negative eect on
graduation rate
60
Suppose we want to test whether the average student debt has a signicant
impact on the graduation rate
1. H
0
:
3
= 0 vs. H
A
:
3
= 0
2. = 0.05
3. t =
3
_
(y
y)C
33
/ (n p)
t (n p)
4. Critical region: |t
observed
| > t
/2,np
= t
0.025,1954
= t
0.025,191
1.984
5. t
observed
=
5.169 10
7
_
4.7691 2.3045 10
10
/ (195 4)
= 0.215
|t
observed
| < 1.984 thus we do not reject H
0
6. We conclude that average student debt has no signicant eect on grad-
uation rate
Inference on the Whole Regression Model
One way to test the usefulness of a particular multiple linear regression model
with k independent variables is to test the following:
H
0
:
1
=
2
= =
k
= 0
H
A
:
j
= 0 for at least one j
If we reject H
0
, this implies that at least one of the independent variables
x
1
, x
2
, . . . , x
k
contributes signicantly to the model
To develop this test, remember the following from our r
2
calculations:
SS
y
=
n
i=1
(y
i
y)
2
=
n
i=1
y
2
i
n y
2
= y
y n y
2
SS
Residual
= y
y
Hence SS
Model
= SS
y
SS
Residual
=

y n y
2
It can be shown that under H
0
, SS
Model

2
(p 1) and SS
Residual

2
(n p)
From this we can develop a test statistic which compares the variation ex-
plained by the model to the variation not explained by the model:
F =
SS
Model
/ (p 1)
SS
Residual
/ (n p)
Under H
0
, F F (p 1, n p) and so we use the F distribution table to
determine whether or not to reject the null hypothesis
In this case we always have a one-sided, upper tail test. Our decision rule is:
Reject H
0
if F
observed
> F
,p1,np
61
Inference on the Whole Regression Model: Example
For our graduation rate example:
1. H
0
:
1
=
2
=
3
= 0 vs. H
A
:
j
= 0 for at least one j = 1, 2, 3
2. = 0.05
3. Test statistic: F =
SS
Model
/(p 1)
SS
Residual
/(n p)
F (p 1, n p)
4. Critical Region: F
observed
> F
,p1,np
= F
0.05,2,192
3.041
5. F
observed
=
_
y n y
2
_
/ (p 1)
_
y
y
_
/ (n p)
=
6.102/(4 1)
4.769/(195 4)
= 81.47 >
3.041, so we reject H
0
6. We conclude that at least one of the independent variables contributes
signicantly to the model.
Inference on a Subset of the Parameters
It is also possible to carry out a test of signicance on a subset of the param-
eters, but we will not cover this
Condence Intervals for Individual Coecients
By rearranging our test statistic for an individual coecient parameter, we
can obtain the following (1 ) 100% Condence Interval for
j
for any j =
0, 1, 2, . . . , k:
Pr
_
j
t
/2,np
_

2
C
jj

j

j
+ t
/2,np
_

2
C
jj
_
= 1 where

2
= SS
Residual
/ (n p) =
_
y
y
_
/ (n p)
Condence Intervals for Individual Coecients: Example
Let us construct a condence interval for
3
in the graduation rate example
First lets calculate
2
If y
y = 68.9714 and

y = 64.20232, then SS
Residual
= 4.769
Thus
2
= SS
Residual
/(n p) = 4.769/(195 4) = 0.02497
We know that

3
= 5.1687 10
7
and C
33
= 2.3045 10
10
Thus our condence interval is given by:
j
t
/2,np
_

2
C
jj
5.1687 10
7
t
0.025,1954
_
0.02497(2.3045 10
10
)
5.1687 10
7
1.984
_
0.02497(2.3045 10
10
)
5.1687 10
7
1.984
_
0.02497(2.3045 10
10
)
5.1687 10
7
4.759 10
6
=
_
4.24 10
6
, 5.28 10
6
_
62
Thus we can say with 95% condence that the change in graduation rate for
a $1 increase in average student debt is between 4.25 10
6
and 5.28 10
6
Notice that the condence interval contains the value 0, which agrees with the
conclusion to our hypothesis test earlier
Condence Region for All Coecients
One can also construct a joint condence region for all parameters
For a simple linear regression model the condence ellipse for (
0
,
1
) would
have the shape of a two-dimensional ellipse
This is outside the scope of this course however
Condence Interval for the Mean Response
As we did in simple linear regression, we can construct a condence interval
for the mean response at a particular point, say, x
=
_
_
1
x
01
x
02
.
.
.
x
0k
_
_
The mean response at this point is E(Y |x = x
) = x
The estimated mean response at this point is y
= x
A (1 ) 100% Condence Interval for E(Y |x = x
) is given by:
Pr
_
y
t
/2,np
_

2
x
(X
X)
1
x
E(Y |x = x
) y
+ t
/2,np
_

2
x
(X
X)
1
x
_
= 1
Condence Interval for the Mean Response: Example
Lets nd a condence interval for the average graduation rate of universities
which have an admission rate of 50% = 0.5, a student-to-faculty ratio of
20 : 1 = 20, and an average student debt of $20000
In this case, x
= [1, 0.5, 20, 20000], a 1 4 matrix

Our point estimate is:
y
= x
= [1, 0.5, 20, 20000]

_
_
1.1095
0.3798
0.02789
5.1687 10
7
_
_
= 1.1095 0.3798(0.5) 0.02789(20) + 5.1687 10
7
(20000)
= 0.3721
63
Thus we would predict that such universities would have an average graduation
rate of 37.21%
The only thing left to calculate in our condence interval formula is x
_
X
X
_
1
x
Using matrix multiplication we see this is equal to 0.03492

Thus our 95% condence interval for E(Y |x = x
) is:
y
t
/2,np
_

2
x
(X
X)
1
x
0.3721 1.984
_
0.02497(0.03492)
0.3721 0.0586
= (0.3135, 0.4307)
Prediction Interval for a New Response
Also, like in simple linear regression, we can predict the value of the response
Y
for a new observation x
and obtain a condence interval for it

The predicted value is y
= x
(actually the same as y
above)
A (1 ) 100% Prediction interval for Y
is:
Pr
_
y
t
/2,np
_

2
_
1 +x
(X
X)
1
x
_
Y
+ t
/2,np
_

2
_
1 +x
(X
X)
1
x
_
_
= 1
As in the simple linear regression case, we can see from the 1+ that this
prediction interval is wider than the condence interval for the mean response
Prediction Interval for a New Response: Example
Let us obtain a prediction interval at a particular university which has an
admission rate of 50% = 0.5, a student-to-faculty ratio of 20 : 1 = 20, and an
average student debt of $20000
Our point estimate is y
which is actually the same as y
; it equals 0.3721
Our 95% prediction interval is as follows:
y
t
/2,np
_

2
_
1 +x
(X
X)
1
x
_
0.3721 1.984
_
0.02497(1 + 0.03492)
0.3721 0.3189
= (0.0532, 0.691)
We can see that this is a very wide (and not very useful) prediction interval
64
Assessing Goodness of Fit of a Multiple Linear Regression Model
We can dene r
2
just as we did for the simple linear regression model:
r
2
= 1
SS
Residual
SS
y
= 1
y
y
y
y n y
2
In this case it is referred to as the Multiple Coecient of Determination
One of the disadvantages of this statistic is that it will always increase as more
independent variables are added to the model
This will suggest that the t is getting better even if the new variables are not
signicant
This problem led to the development of an alternative goodness of t statistic
for multiple linear regression called Adjusted r
2
Adjusted r
2
Adjusted r
2
, written as r
2
, imposes a penalty for adding more terms to the
model
It will thus decrease when we add an independent variable that does not
contribute much explanatory power
r
2
= 1
SS
Residual
/ (n p)
SS
y
/ (n 1)
= 1
_
n 1
n p
_
_
1 r
2
_
r
2
and r
2
for Multiple Linear Regression Model: Example
In our university graduation rates example, we calculate r
2
as follows:
r
2
= 1
y
y
y
y n y
2
= 1
68.9714 64.20232
68.9714 58.09986
= 1 0.4387
= 0.5613
This suggests that 56% of the variation in graduation rates can be explained
by the three factors in the model
65
Now we calculate r
2
as follows:
r
2
= 1
_
n 1
n p
_
_
1 r
2
_
= 1
_
195 1
195 4
_
(1 0.5613)
= 1 0.4456
= 0.5544
In this case, there is not much dierence between the two, because the sample
size n is very large compared to the number of parameters p
Model Selection Algorithms
Various algorithms (procedures) have been proposed for selecting which vari-
ables to include in a model
This is particularly important when there are many possible independent vari-
ables to choose from
We do not want to miss out on variables that contribute signicantly to the
model, but we also dont want to include unnecessary variables which make
our estimates less precise
The three most common algorithms that are used are:
1. Backward Elimination
2. Forward Selection
3. Stepwise Selection
Backward Elimination
Backward Elimination starts with a full model consisting of all possible inde-
pendent variables, and cuts it down until the bestmodel is achieved
The algorithm then proceeds as follows:
1. Begin with a model including all possible independent variables
2. Estimate the model and take note of the t
observed
statistic values for indi-
vidual coecients (not including
0
)
3. Choose the coecient with the smallest |t
observed
|; call it
j
4. Carry out the test of hypothesis H
0
:
j
= 0 vs. H
A
:
j
= 0 at the
signicance level
5. If the null hypothesis is rejected, we accept this as our nal model
6. If the null hypothesis is not rejected, we remove the variable x
j
from the
model and repeat from step (2)
66
Forward Selection
Forward Selection works in the opposite direction: it begins with an empty
model and adds variables until the bestmodel is achieved
The algorithm proceeds as follows:
1. Run simple linear regressions between y and each possible x variable
2. Identify the independent variable with the highest |t
observed
| value in its
simple linear regression with y
0
:
j
= 0 vs. H
A
:
j
= 0 at the
signicance level in this simple linear regression model
4. If we reject H
0
, we add x
j
to the multiple linear regression model and
proceed to the independent variable with the next highest |t
observed
| in its
simple linear regression with y, and repeat from step (3).
5. If the null hypothesis is not rejected, we conclude x
j
is not signicant
to the model, so we do not add it. We also realise none of the other
independent variables with smaller |t
observed
| will be signicant; thus the
model is nal; we are done
Stepwise Selection
Stepwise Selection combines elements of both Backward Elimination and For-
ward Selection
The algorithm proceeds as follows:
1. Run simple linear regressions between y and each possible x variable
2. Identify the independent variable with the highest |t
observed
| value in its
simple linear regression with y
0
:
j
= 0 vs. H
A
:
j
= 0 at the
signicance level in this simple linear regression model
4. If we reject H
0
, we add x
j
to the multiple linear regression model
So far the algorithm is exactly like Forward Selection; but now
it changes
5. Carry out a t test from the multiple linear regression model for the
signicance of each
j
in the model so far
6. If the null hypothesis is not rejected for any
j
we delete that x
j
from
the model
7. Proceed to the independent variable with the next highest |t
observed
| in its
simple linear regression with y, and repeat from step (3).
8. Once we reach a point where all the variables in the model are signicant,
and none of the variables outside the model are signicant, this is our nal
model
67
Model Selection Algorithms: Example
It is easier to see an example in the tutorial using SAS, since these algorithms
are very tedious to carry out by hand
In the case of our Graduation Rate example, all three algorithms lead to the
same result: we keep x
1
and x
2
in the model and drop x
3
Note: there are other model selection algorithms but we will not cover them
Residual Analysis
Revisiting Model Assumptions
Remember that the assumptions of the multiple linear regression include the
following:
All error terms have a zero mean, i.e. E(
i
) = 0i
All error terms have the same xed variance, i.e. Var (
i
) =
2
i
All observations are independent of each other
The error terms follow the normal distribution
None of the x variables are highly correlated with one another
Whenever we are applying a multiple linear regression model it is important
to check these assumptions
Model Adequacy
The rst four of these assumptions can be assessed using residual analysis:
that is, looking at the residuals of the model
There are two basic ways to do this:
Graphical Analysis
Hypothesis Tests
In this module we will only look at graphical analysis (the hypothesis testing
approach will be taught in Econometrics in third year)
Graphical Residual Analysis
Remember that the residuals are dened as e = y y, that is, e
i
= y
i
y
i
To calculate the residuals we rst determine the least squares regression line
and then obtain the predicted value for each x
i
in the sample; then we subtract
these predicted values from the observed y
i
values in the sample
Once we have the residuals we can plot the residuals (vertical axis) against
the predicted values (horizontal axis)
One can gain a lot of information about the model by looking at this plot
68
Plot of Residuals vs. Predicted Values
The main things to look for in the plot are patterns or unusual points
Ideally, the points should be evenly distributed above and below zero and
should appear completely random
In this plot we can see that the points appear random
69
Do you see anything dierent in this plot?
The variance of the residuals appears to increase as y increases
Normal Quantile-Quantile Plot
A normal quantile-quantile plot is a useful tool for checking if the residuals
are normally distributed
If so, the points should fall approximately in a straight line
Does this QQ plot look normally distributed?
70
How about this one?
71
Histogram of Residuals
Another way to check normality is to plot a histogram of the residuals and see
if it is bell shaped
How about this one?
72
Summary of Graphical Analysis of Residuals
Graphical analysis of residuals is a useful diagnostic tool for determining model
adequacy
However it has limitations - often the results can be inconclusive
This is especially true for small sample sizes
Outlier Diagnostics
We can also use the residuals to look for outliers: values which the model
predicts extremely badly
While we could simply look at the residuals themselves, it is better to scale
them in some way
Analogy to z scores from STA100A: we dont only want to know how far
an observation is from its mean; we want to know how many standard
deviations away it is
A basic way to scale the residuals would be to divide them by their standard
deviation:
d
i
=
e
i

This is called the standardized residual
Since these residuals should be approximately normally distributed with mean
0 and variance 1, they should almost always lie in the range 3 d
i
3
Thus we could dene an outlier as any observation whose standardized residual
is > 3 or < 3
Outlier Diagnostics: Internally Studentized Residuals
It can be shown that in general, even if Var (
i
) =
2
, the variance of the
residuals is not constant
Rather, Var (e
i
) =
2
(1 h
ii
) where h
ii
is the ith diagonal element of the so
called Hat Matrix, H = X
_
X
X
_
1
X
As a result, a better way of scaling the residuals is:

r
i
=
e
i
_

2
(1 h
ii
)
(This r
i
is not to be confused with the sample Pearson correlation coecient
r)
This statistic is known as the internally studentized residual
73
Outlier Diagnostics: Externally Studentized Residuals
The only weakness with the internally studentized residual is that the variance
estimate
2
used in calculating r
i
is inuenced by the ith observation
It may be thrown o by an outlier; thus r
i
is not ideal for outlier detection
Instead, for each observation, we could estimate the variance using a data set
of n 1 observations with the ith observation removed, and use this estimate
S
2
(i)
in the scaling formula
It can be shown that:
S
(i)
=
(n p)
2
e
2
i
/(1 h
ii
)
n p 1
If we replace
2
with S
2
(i)
in the internally studentized residual formula we get:
t
i
=
e
i
_
S
2
(i)
(1 h
ii
)
This is known as the externally studentized residual and is the best way
of scaling residuals
Hypothesis Test for Outliers
A further advantage is that, under the model assumptions, t
i
t(n p 1)
One could carry out a hypothesis test on each observation to check if it is an
outlier:
1. H
0
: The ith observation is not an outlier vs. H
A
: The ith observation is
an outlier
2. = 0.05
3. Test statistic is |t
i
|
4. Rejection rule: Reject H
0
if |t
i
| > t
/(2n),np1
5. Compute t
i observed
and reach a decision
6. State conclusion
The reason why we have /(2n) instead of /2 is that we are running the
hypothesis test n times, so we are basically dividing up the overall type I error
probability among the n individual tests (this is known as the Bonferroni
approach)
74
y
i
x
i
y
i
e
i
d
i
r
i
t
i
19 8 18.325 0.675 0.2008 0.2178 0.1997
17 7 16.275 0.725 0.2157 0.2450 0.2248
23 10 22.425 0.575 0.1711 0.1856 0.1699
22 9 20.375 1.625 0.4835 0.5169 0.4827
33 14 30.625 2.375 0.7067 1.4133 1.5696
18 7 16.275 1.725 0.5133 0.5830 0.5480
16 7 16.275 -0.275 -0.0818 -0.0929 -0.0849
19 10 22.425 -7.425 -2.2092 -2.3962 -10.5468
Outlier Diagnostics: Example
Suppose we have the following set of data (n = 8):
When we estimate the simple linear regression model y
i
=
0
+
1
x
i
+
i
using
the least squares method, we get:
0
= 1.525,

1
= 2.15
We can substitute each of our x
i
for x in the tted equation y = 1.525 +2.15x
to obtain the predicted values y
i
which are in the third column of the table
above
We can then calculate the residuals: e
i
= y
i
y
i
(see fourth column of table)
To calculate the standardized residuals we rst need to calculate
2
:

2
=
1
n 2
n
i=1
e
2
i
=
1
6
_
0.275
2
+ 0.425
2
+ + (4.025)
2
_
= 3.6625
Now we have d
i
=
e
i

2
(see calculated values in fth column)
Next we can calculate the internally studentized residuals. We rst need to
calculate the Hat matrix H = X
_
X
X
_
1
X
In this case, X =
_
_
1 8
1 7
1 10
1 9
1 14
1 7
1 7
1 10
_
_
75
Taking the diagonal elements of H and using them in the formula r
i
=
e
i
_

2
(1 h
ii
)
, we get the values (see sixth column of table above)
Next we calculate the externally studentized residuals. We rst need to calcu-
late the S
2
(i)
=
(n p)
2
e
2
i
/(1 h
ii
)
n p 1
Then we plug these into the following formula to get the values in the seventh
column:
t
i
=
e
i
_
S
2
(i)
(1 h
ii
)
It is now apparent for the rst time that the 8th observation is an outlier
Hypothesis Test for Outliers: Example
We conduct the hypothesis test described above for each of the 8 observations,
at = 0.05 level
In every case, our rejection rule is reject H
0
if |t
i
| > t
/(2n),np1
= t
0.003125,5
We dont have a column for 0.003125 in our t table so we can take the
average of the entries in the 0.005 and 0.001 columns to get an approxi-
mation: (4.030 + 5.876)/2 = 4.953
We reject H
0
for all observations for which |t
i
| > 4.953; in this case we reject
only for the 8th observation
Thus we conclude that the 8th observation is an outlier and none of the others
are
Inuence Diagnostics
Sometimes, a small subset of observations (even one observation) exert a dis-
proportionate inuence on the tted regression model
In other words, the parameter estimates

depend more on these few obser-
vations than on the majority of the data
We would like to be able to locate these inuential observations and pos-
sibly eliminate them
Leverage
The elements of the Hat Matrix h
ij
describe the amount of inuence exerted
by y
j
on y
i
Thus a basic measure of the inuence of an observation, known as the lever-
age, is given by h
ii
76
The properties of the Hat Matrix H include that the sum of all n diagonal
elements is equal to p, that is:
n
i=1
h
ii
= p
Therefore, the average h
ii
value would be
p
n
As a rule of thumb, any observation i such that h
ii
>
2p
n
would be called a
high-leverage observation
Cooks Distance
The leverage only takes into account the location of an x observation
A more sophisticated measure of inuence would take into account the location
of the x and y values of an observation
The Cooks Distance is one such measure
Let

be the usual least squares parameter estimates from all n observations,
and
Let

(i)
be the least squares parameter estimates where the ith observation
has been deleted from the data
Then the Cooks Distance is dened as:
D
i
=
_
(i)
X
_
(i)
_
pMS
Residual
The Cooks Distance formula can also be expressed in terms of the internally
studentized residuals:
D
i
=
r
2
i
p
h
ii
1 h
ii
In general, if D
i
> 1 we say that the ith observation is inuential
Inuence Diagnostics: Example
With the outlier data set used above, the h
ii
values are:
h
ii
= [0.15, 0.225, 0.15, 0.125, 0.75, 0.225, 0.225, 0.15]
In this case
2p
n
=
2(2)
8
= 0.5. Since h
55
= 0.75 > 0.5, we can say that the 5th
observation is a high leverage observation
We can calculate the Cooks Distance using the formula D
i
=
r
2
i
p
h
ii
1 h
ii
In this case,
D
i
= [0.0042, 0.0087, 0.0030, 0.0191, 2.9961, 0.0493, 0.0013, 0.5066]
Since D
5
> 1 we can again say that the 5th observation is inuential
77
Multicollinearity
Multicollinearity occurs when two or more of the x variables have a strong
linear relationship with each other
This makes the

estimates less precise
In fact, if two or more x variables have a perfect linear relationship, we cannot
use the method of least squares
Technically this is because the X
X matrix is not invertible

In most cases the multicollinearity will not be perfect; but if it is strong, it
can still ruin the model
How do we know if there is multicollinearity?
Detecting Multicollinearity
The simplest way to detect multicollinearity is to calculate the Pearson corre-
lation coecient between each pair of independent variables x
s
and x
t
A rule of thumb says that if any of these correlation coecients is higher than
0.7 in absolute value, there is serious multicollinearity
SAS can also provide us with variance ination factor (VIF) estimates,
which tell us by what factor the error variance increases due to multicollinearity
in a particular independent variable
A rule of thumb says that if the VIF > 5 for any independent variable, there
is serious multicollinearity involving that variable
The simplest way of resolving multicollinearity is to remove one of the oending
x variables
Multicollinearity: Example
The table below gives the cost of adding a new communications node to a
network, along with three independent variables thought to explain this cost:
the number of ports available for access (x
1
), the bandwidth (x
2
), and the port
speed (x
3
)
When we estimate the model Y
i
=
0
+
1
x
1i
+
2
x
2i
+
3
x
3i
+
i
using Ordinary
Least Squares, we get the tted equation:
y = 17487 14168x
1
+ 81.39x
2
+ 1523.7x
3
Continue from SAS project
78
y
i
x
1i
x
2i
x
3i
52388 68 58 653
51761 52 179 499
50221 44 123 422
36095 32 38 307
27500 16 29 154
57088 56 141 538
54475 56 141 538
33969 28 48 269
31309 24 29 230
23444 24 10 230
24269 12 56 115
53479 52 131 499
33543 20 38 192
33056 24 29 230
Changes in Functional Form
What if there is a non-linear relationship between Y and x?
E.g. quadratic, cubic, logarithmic, etc.
We can still use linear regression just as before, but with the independent
variables transformed appropriately
Changes in Functional Form: Example 1
Example with quadratic term
Changes in Functional Form: Example 1
Example with ln term (log base e)
Interpretation:

1
is the expected change in y for a one unit increase in ln x
This can also be expressed in terms of a change in x:
1
is the expected change in y when x is multiplied by e = 2.718, that is,
when x increases by 171.8%
More generally, the expected change in y for a % increase in x would be
1
ln
_
100 +
100
_
Thus the expected change in y for a 10% increase in x would be 0.095
1
For small , ln
_
100 +
100
_

100
and so, we can say approximately that
1
100
is the expected change in y for a 1% increase in x
79
Transformations of the Dependent Variable
Used to make the data t a normal distribution better
Used to resolve the problem of non-constant variance
Common transformations include:
y
= ln(y)
y
=

y
The Box-Cox Transformation is a method used to choose the best transforma-
tion for y
Box-Cox Transformation
Used to make the data t a normal distribution better
Used to make the variance more constant
Common transformations include:
y
= ln(y)
y
=

y
The Box-Cox Transformation is a method used to choose the best transforma-
tion for y
Box-Cox Transformation
The Box-Cox Transformation consists of estimating a new parameter
(This has nothing to do with Poisson distribution)
The value of is the best power to use in transforming y; for instance:
If = 2, we use the transformation y
= y
2
If =
1
2
, we use the transformation y
= y
1/2
=

y
In the special case = 0 we use the transformation y
= ln(y)
SAS can estimate the parameter for us
Box-Cox Transformation: Example
C
Interaction Terms
a
80
Dummy Variables
Do two-category only; save rest for econometrics
5 Logistic Regression
Dierent kinds of Dependent Variables
Throughout our study of linear regression models, we have assumed that the
dependent variable is a normally distributed random variable
However, in practice we may want to build models for data that are not nor-
mally distributed
For the rest of the module we will be looking at some of these modules
Categorical Dependent Variable
We already studied models with dummy (categorical) independent variables
But what if the dependent variable is categorical?
If the dependent variable has two possible values (like a Bernoulli ran-
dom variable), then it is called binary
A Bernoulli random variable is a binomial random variable where the
number of trials is n = 1
For example, the dependent variable could be:
Y
i
=
_
1 if the ith product is defective
0 if the ith product is ok
Or:
Y
i
=
_
1 if the ith patient recovers
0 if the ith patient dies
We can construct models for this kind of dependent variable
They will be quite dierent from linear regression models, but still have some
key similarities since both types of models are classied as Generalized Lin-
ear Models
81
Generalized Linear Models
Generalized Linear Models are a class of models, some of the properties of
which are:
1. We have n independent response observations y
1
, y
2
, . . . , y
n
with theoret-
ical means
1
,
2
, . . . ,
n
2. The observation y
i
is a random variable with a probability distribution
from the exponential family (which basically means its probability mass
function or probability density function has an e in it)
3. The mean response vector is related to a linear predictor = x
0
+
1
x
1
+
2
x
2
+ . . . +
k
x
k
4. The relationship between
i
and
i
is expressed by a link function g so
that
i
= g(
i
), i = 1, 2, . . . , n
By taking the inverse of this function we can also write
i
= E(y
i
) =
g
1
(
i
) = g
1
(x
i
)
In the case of linear regression:
The link function is g(
i
) =
i
, so E(Y
i
) =
i
=
i
= x
i

The dependent variable follows a normal distribution
In summary, Y
i
N(x
i
,
2
) (this is a way of writing the model without
i
)
Logistic Regression Model
If each Y
i
follows a Bernoulli distribution (binomial with n = 1), with prob-
ability of success Pr (Y
i
= 1) = p
i
and probability of failure 1 p
i
, then
i
= E(Y
i
) = p
i
If we again used the identity link function g(
i
) =
i
then our model would
be p
i
=
0
+
1
x
1
+
2
x
2
+ . . . +
k
x
k
It is easy to see that this is a bad idea, because the predicted values of the
model would not necessarily be between 0 and 1
A better model uses the link function is g(p
i
) = ln
_
p
i
1 p
i
_
The quantity
p
i
1p
i
is called an odds: it is the ratio of the probability of
success to the probability of failure
Thus the link function gives the log odds, also known as the logit or
logistic function
This means the model can be expressed as follows:
ln
_
p
i
1 p
i
_
=
0
+
1
x
1
+
2
x
2
+ . . . +
k
x
k
82
By taking the inverse of the function we can also express the model like this:
E(Y
i
) = p
i
=
1
1 + e
x
i

where x
i
= [1, x
1i
, x
2i
, . . . , x
ki
]
Notice that there is no error term
i
in this model
Remember that p
i
are probabilities and thus range between 0 and 1
A graph of g(p
i
) is as follows (it is undened at 0 and 1):
Parameter Estimation in Logistic Regression
Just like in linear regression, our rst task is to estimate the parameter vector
=
_
2
.
.
.
k
_
_
However we can no longer use the Method of Least Squares (Why?)
Instead we use the Method of Maximum Likelihood
83
We will not explain the details of this method
Unfortunately this method requires an iterative procedure and cannot easily
be calculated by hand
However computer software such as SAS can compute the estimates

0
,

1
,

2
, . . . ,

k
quite easily
Interpreting Parameters in Logistic Regression
More important for our purpose is to be able to interpret what the parameter
estimates tell us
The parameter estimates themselves are interpreted as log-odds ratios, while
e
1
for instance would be interpreted as an odds ratio
It is best to illustrate what these terms mean using an example
Logistic Regression Example
Consider a data set of 200 people admitted to the intensive care unit at a
hospital
The dependent variable is whether they died:
y
i
=
_
1 if the person died
0 if the person survived
The rst independent variable is the type of admission to ICU:
x
i1
=
_
1 if they were admitted via emergency services
0 if the they were self-admitted
The second independent variable x
i2
is the persons systolic blood pressure in
mm Hg
The estimated model is:
ln
_
p
i
1 p
i
_
=

0
+

1
x
i1
+

2
x
i2
which can also be written as:
Pr (Y
i
= 1) = p
i
=
1
1 + e
(
0
+
1
x
i1
+
2
x
i2
)
84
We estimate the parameters in SAS and our tted equation is:
ln
_
p
i
1 p
i
_
= 1.33 + 2.022x
i1
0.014x
i2
Or:
Pr (Y
i
= 1) = p
i
=
1
1 + e
(1.33+2.022x
i1
0.014x
i2
)
Now to interpret the parameters: as in linear regression,

0
represents the case
when all independent variables take a value of 0
In this case, if x
i1
= 0 (meaning the person was self-admitted) and their
systolic blood pressure was 0 (x
i2
= 0), then ln
_
p
i
1 p
i
_
= 1.33
Or:
p
i
1 p
i
= e
1.33
= 0.264
Thus the odds of dying for a self-admitted person with a systolic blood pres-
sure of 0 is 0.264
From the second equation, the probability of dying for a self-admitted person
with a systolic blood pressure of 0 is
1
1 + e
1.33
= 0.209
1
and

2
are interpreted as log odds ratios:
1
= 2.022 tells us the log of the ratio of odds of death for a person admitted
via emergency to the odds of death for a self-admitted person
This can be shown as follows:
ln
_
Pr(Y
i
= 1|x
i1
= 1)
Pr(Y
i
= 0|x
i1
= 1)
_
=

0
+

1
+

2
x
i2
ln
_
Pr(Y
i
= 1|x
i1
= 0)
Pr(Y
i
= 0|x
i1
= 0)
_
=

0
+

2
x
i2
; thus
ln
_
Pr(Y
i
= 1|x
i1
= 1)
Pr(Y
i
= 0|x
i1
= 1)
_
ln
_
Pr(Y
i
= 1|x
i1
= 0)
Pr(Y
i
= 0|x
i1
= 0)
_ = ln
_
Pr(Y
i
= 1|x
i1
= 1)
Pr(Y
i
= 0|x
i1
= 1)
_
ln
_
Pr(Y
i
= 1|x
i1
= 0)
Pr(Y
i
= 0|x
i1
= 0)
_
=

0
+

1
+

2
x
i2
0
+

2
x
i2
=

1
85
Similarly, e
1
= e
2.022
= 7.55 is the odds ratio, telling us how much many
times higher the odds of death are for a person admitted via emergency than
a self-admitted person
That is, a person admitted via emergency has an odds of death that is 7.55
times higher than the odds of death for a self-admitted person
Since x
i2
is a continuous rather than categorical independent variable, the
interpretation of the log odds ratio

2
is slightly dierent:
2
= 0.014 tells us that for every one-unit increase in x
i2
(blood pressure),
the log-odds of death is estimated to increase by

2
= 0.014, that is, to
decrease by 0.014 (since the parameter estimate is negative)
It is simpler to interpret e
2
= e
0.014
= 0.986: this tells us that for every
one-unit increase in blood pressure, the odds of death is estimated to increase
by a factor of 0.986; that is (since this is less than 1), the odds of death are
estimated to decrease by a factor of 1/0.986 = 1.014, i.e. decrease by 1.4%.
More generally, for every increase of units in x
i2
, the odds of death are
estimated to increase by a factor of e
Thus in this example, if blood pressure increases by 20 units, the odds of

death would increase by an estimated factor of e
0.01420
= 0.756, that
is, decrease by a factor of 1/0.756 = 1.32, i.e. decrease by 32%.
Making Predictions using Logistic Regression
Given a particular observation x
i
, whether inside or outside our sample, our
predicted log odds for the event Y
i
= 1|x
i
is:
ln
_
p
i
1 p
i
_
= x
Our predicted probability Pr (Y

i
= 1|x
i
) is thus:
p
i
=
1
1 + e
x
We can use this to predict the outcome of a new observation

A basic way to do this would be: if p
i
> 0.5, predict that Y
i
will be 1; if
p
i
< 0.5, predict that Y
i
will be 0
Wald Inference on Parameters in Logistic Regression
Just like in linear regression, we are interested in making inferences on the
parameters
Usually we want to test the null hypothesis H
0
:
j
= 0 against a two-tailed
or one-tailed alternative
This is because, if
j
> 0 then the odds of the event Y
i
= 1 increase as x
ji
increases; if
j
< 0 then the odds of the event Y
i
= 1 decrease as x
ji
increases;
but if
j
= 0 then changes in x
ji
have no eect on the odds of the event Y
i
= 1
86
Testing the Null Hypothesis H
0
:
j
= 0
Under H
0
, the statistic
2
=
2
j
Var
_
j
_
2
1
Var
_
j
_
is the jth element of the matrix
_
X
V X
_
1
, where:
V is a diagonal matrix whose diagonal elements are Var (Y
1
) , Var (Y
2
) , . . . , Var (Y
n
)
Each Var (Y
i
) = p
i
(1 p
i
) since the Y
i
follow the Bernoulli distribution
(binomial with one trial)
Our estimate of V is

V where we replace each p
i
with p
i
=
1
1 + e
(x
)
:
V =
_
_
p
1
(1 p
1
) 0 0 0
0 p
2
(1 p
2
) 0 0
0 0 p
3
(1 p
3
) 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 p
n
(1 p
n
)
_
_
Thus the test statistic may be expressed as:
2
=
2
j
_
X

V X
_
1
jj
0
:
j
= 0: Example
In our ICU example, suppose we want to test the null hypothesis H
0
:
2
= 0
against a two-tailed alternative
Our hypothesis test would proceed as follows:
1. H
0
:
2
= 0 vs. H
0
:
2
= 0
2. = 0.05
3. Test Statistic:
2
(as expressed above)
4. Rejection Rule: Reject H
0
if
2
observed
<
2
/2,1
or
2
observed
>
2
(1/2),1
5. Decision: reject H
0
or do not reject
6. Conclusion: if we reject H
0
then x
j
does have a signicant eect on the
odds of the event Y
i
= 1; if we do not reject, it does not have a signicant
eect
87
Condence Intervals for Parameter Estimates and Odds Ratios
We can obtain a (1 )100% Condence Interval for
j
as follows:
jL
=

j
z
/2
_
_
X

V X
_
1
jj
jU
=

j
+ z
/2
_
_
X

V X
_
1
jj
For j = 0 we can also obtain a (1 )100% Condence Interval for the odds
ratio e
j
:
_
e
j
L
, e
j
U
_
Condence Interval for Mean Response in Logistic Regression
As in multiple linear regression, we are interested in obtaining a (1 )100%
Condence Interval for the mean response, which in this case is the p
i
We rst obtain a Condence Interval for the log odds ln
_
p
i
1 p
i
_
:
LO
L
= x
z
/2
_
x
i
_
X

V X
_
1
x
i
LO
U
= x
+ z
/2
_
x
i
_
X

V X
_
1
x
i
We can transform this to obtain a (1 )100% Condence Interval for p
i
:
p
iL
=
1
1 + e
LO
L
p
iU
=
1
1 + e
LO
U
Prediction Interval for Individual Response in Logistic Regression
We can also obtain a (1 )100% Prediction Interval for a new observation
y
i
:
p
i
z
/2
p
i
(1 p
i
)
_
1 +x
i
_
X

V X
_
1
x
i
88
Goodness of Fit in Logistic Regression
There is no such thing as r
2
for a logistic regression model
A version of Pearsons
2
test can be used to assess goodness of t of the
model
The test statistic is
2
=
n
i=1
y
i
p
i
p
i
(1 p
i
)

2
np
We use this to test the null hypothesis that the model is a good t to the data
If
2
is large, we reject H
0
and conclude that the model is not a good t
We can also use the deviance as a measure of goodness of t or model selection:
D() = 2
_
n
i=1
y
i
ln
_
y
i
p
i
_
+ (1 y
i
) ln
_
1 y
i
1 p
i
_
_
We would like the deviance to be as low as possible, so when choosing between
competing models we would prefer the model with the smallest deviance
There are other methods for assessing goodness of t in a logistic regression
model, which we will not cover here
Goodness of Fit in Logistic Regression: Example
a
Multinomial Logistic Regression
Logistic regression can also be used when the dependent variable is categorical
with more than two categories
For example:
Y
i
=
_
_
5 if a person banks with Capitec
4 if a person banks with ABSA
3 if a person banks with Standard Bank
2 if a person banks with Nedbank
1 if a person banks with FNB
0 if a person has no bank account
The model is now based on the multinomial distribution rather than the
binomial distribution
Estimation and inference of the model is similar, but interpretation of param-
eters is more complicated
We will not cover this method in this module, but you may want to investigate
it for your Project 2 or Work Integrated Learning next year
89
6 Poisson Regression
cheese
1. Model Specication
2. Model Assumptions (Overdispersion)
3. Neg. Binomial Regression as an alternative
4. Estimation
5. Interpreting Parameters
6. Inference
Poisson Regression Model
If we have a dependent variable that follows a Poisson distribution (e.g. count
data of some kind) we can use Poisson regression to t a model which relates
it to some independent variables
We are assuming that each observation Y
i
follows a Poisson distribution with
rate parameter
i
; thus
i
= E(Y
i
) =
i
It is also true then that Var (Y
i
) =
i
Following our Generalized Linear Models approach, the best link function to
use in this case is g (
i
) = ln(
i
)
An advantage to this link function is that it ensures that

i
> 0 (see
below), which is a must for the Poisson distribution
This means the model can be expressed as follows:
ln (
i
) =
0
+
1
x
1
+
2
x
2
+ . . . +
k
x
k
By taking the inverse of the function we can also express the model like this:
E(Y
i
) =
i
= e
x
i

where x
i
= [1, x
1i
, x
2i
, . . . , x
ki
]
Notice that there is again no error term
i
in this model, because the model
equation has the expected value of the dependent variable rather than the
dependent variable itself
90
Parameter Estimation in Poisson Regression
Just like in linear and logistic regression, our rst task is to estimate the
parameter vector
=
_
2
.
.
.
k
_
_
As in logistic regression, we cannot longer use the Method of Least Squares
since the assumptions (such as normality) are not satised
Instead we use the Method of Maximum Likelihood
As before, we will not explain the details of this method, as it cannot easily
be calculated by hand
However computer software such as SAS can compute the estimates

0
,

1
,

2
, . . . ,

k
quite easily
Interpreting Parameters in Parameter Regression
The interpretation of parameters in a Poisson regression model is not the same
as in a logistic regression model, since the link function is dierent
To interpret, for instance, the coecient
1
, take the model equation with x
1
set to some xed value x
1
and subtract it from the model equation with x
1
set
to x
1
+ 1:
ln(
i
|x
1i
= x
1
+ 1) ln(
i
|x
1i
= x
1
)
= (
0
+
1
(x
1
+ 1) +
2
x
2i
+ +
k
x
ki
) (
0
+
1
(x
1
) +
2
x
2i
+ +
k
x
ki
)
=
1
Thus
1
is the increase in ln(
i
) resulting from a one unit increase in x
1i
There is an easier interpretation for e
1
:
E(Y
i
|x
1i
= x
1
+ 1) =
i
|x
1i
= x
1
+ 1 = e
0
+
1
(x
1
+1)+
2
x
2i
++
k
x
ki
E(Y
i
|x
1i
= x
1
) =
i
|x
1i
= x
1
= e
0
+
1
x
1
+
2
x
2i
++
k
x
ki
E(Y
i
|x
1i
= x
1
+ 1)
E(Y
i
|x
1i
= x
1
)
= e
1
Thus if x
1i
increases by one unit,
i
increases by a factor of e
1
We could also say 100e
1
is the percent increase in
i
resulting from a one unit
increase in x
1i
91
Poisson Regression: Example
a
Making Predictions using Poisson Regression
Given a particular observation x
i
, whether inside or outside our sample, our
predicted mean E(Y
i
|x
i
) is:
i
= e
x
This is also our best prediction for an individual observation

Wald Inference on Parameters in Logistic Regression
Just like in linear and logistic regression, we are interested in making inferences
on the parameters
Usually we want to test the null hypothesis H
0
:
j
= 0 against a two-tailed
or one-tailed alternative
This is because, if
j
> 0 then the mean value of Y
i
increases as x
ji
increases;
if
j
< 0 then mean value of Y
i
decreases as x
ji
increases; and if
j
= 0 then
changes in x
ji
have no eect on mean of Y
i
0
:
j
= 0
Under H
0
, the statistic
2
=
2
j
Var
_
j
_
2
1
Var
_
j
_
is the jth element of the matrix
_
X
V X
_
1
, where:
V is a diagonal matrix whose diagonal elements are Var (Y
1
) , Var (Y
2
) , . . . , Var (Y
n
)
Each Var (Y
i
) =
i
= e
x
i

since the Y
i
follow the Poisson distribution
Our estimate of V is

V where we replace each
i
with

i
= e
x
V =
_
1
0 0 0
0

2
0 0
0 0

3
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0

n
_
_
Thus the test statistic may be expressed as:
2
=
2
j
_
X

V X
_
1
jj
Other than the change in the matrix

V , Wald inference is the same for Poisson
regression as for logistic regression
92
Wald Inference in Poisson Regression: Example
a
Condence Intervals for Parameter Estimates
We can obtain a (1 )100% Condence Interval for
j
as follows:
jL
=

j
z
/2
_
_
X

V X
_
1
jj
jU
=

j
+ z
/2
_
_
X

V X
_
1
jj
We can also obtain a (1 )100% Condence Interval for e
j
:
_
e
j
L
, e
j
U
_
Condence Interval for Mean Response in Logistic Regression
As in multiple linear regression, we are interested in obtaining a (1 )100%
Condence Interval for the mean response, which in this case is
i
We rst obtain a Condence Interval for ln
i
:
LO
L
= x
z
/2
_
x
i
_
X

V X
_
1
x
i
LO
U
= x
+ z
/2
_
x
i
_
X

V X
_
1
x
i
We can transform this to obtain a (1 )100% Condence Interval for
i
:
p
iL
=
1
1 + e
LO
L
p
iU
=
1
1 + e
LO
U
Prediction Interval for Individual Response in Logistic Regression
We can also obtain a (1 )100% Prediction Interval for a new observation
y
i
:
i
z
/2
i
_
1 +x
i
_
X

V X
_
1
x
i
93
Goodness of Fit in Poisson Regression
As with logistic regression, we can use a Pearson chi-squared test to assess
goodness of t in the Poisson regression model
In this case the test statistic is:
2
=
n
i=1
_
(y
i
i
)
2
i
_

2
np
We can also use the deviance as a measure of goodness of t or model selection:
D() = 2
n
i=1
y
i
ln
_
y
i
i
_
We would like the deviance to be as low as possible, so when choosing between
competing models we would prefer the model with the smallest deviance
Overdispersion in Poisson Regression
Overdispersion is a violation of the model assumption that the dependent
variable follows a Poisson distribution
In particular, overdispersion means Var (Y
i
) >
i
whereas it should be that
Var (Y
i
) =
i
If we have count data with overdispersion, an alternative model to use is the
Negative Binomial Regression Model
Through some clever transformations it can be shown that if Y is a negative
binomially distributed random variable and E(Y ) = then Var (Y ) = +
2
where > 0 is a transformed parameter called the dispersion parameter
What this means is that the variance of a negative binomial distributed random
variable is larger than the variance of a Poisson distributed random variable
by the amount of
2
.
Thus we can use the negative binomial distribution to model count data which
is overdispersed
The model equation is exactly the same, and so is the interpretation of the
parameters
We can run a negative binomial regression model in SAS and test the null
hypothesis H
0
: = 0 to check for overdispersion
If we reject H
0
this means there is overdispersion; the negative binomial
model will be better
If we do not reject H
0
this means there is no overdispersion; the Poisson
model is ne
94
Negative Binomial Regression: Example
a
a
a
a
95

Sta 200 B Article

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Sta 200 B Article

Загружено:

Авторское право:

Доступные форматы

Faculty of Applied Sciences

Department of Mathematics and Physics

This graph shows how the transformation works:

is a standard normal r.v. Therefore:

C) Ice Cream Sales (R)

Based on the distribution of the Fisher Transformation z

Inferences from a Simple Linear Regression

1.1763, 6.536 + 2.570

Our best estimate of E(Y |x = x

The variance of this estimator is Var ( y

for a new observation x = x

Our best estimate would be y

and a prediction interval for the ice cream sales on a particular

y we are multiplying a 1 3 matrix by a 3 n matrix by a

X we are multiplying a 1 n matrix by a n 3 matrix by a

The transpose of a scalar is itself

X exists, which it does as long

In matrix form, the residuals are e = y y

and then by y to get our parameter estimates

The estimated mean response at this point is y

A (1 ) 100% Condence Interval for E(Y |x = x

= [1, 0.5, 20, 20000], a 1 4 matrix

= [1, 0.5, 20, 20000]

Using matrix multiplication we see this is equal to 0.03492

for a new observation x

and obtain a condence interval for it

(actually the same as y

which is actually the same as y

As a result, a better way of scaling the residuals is:

X matrix is not invertible

Thus in this example, if blood pressure increases by 20 units, the odds of

Our predicted probability Pr (Y

We can use this to predict the outcome of a new observation

This is also our best prediction for an individual observation

Вам также может понравиться