Вы находитесь на странице: 1из 28

Chapter 10 Simple Linear Regression and

Correlation
Linear Regression
Methods for studying the relationship of two or more
quantitative variables
Example:
Predict salary from education and years of experience
Predict sales from the amount of advertising expenditures
Predict vocabulary size from the age and amount of education of parents
Variables:
Response/outcome/dependent variable
Predictor/explanatory/independent variable
1
Relationships between the response and predictor variables
Functional or mathematical relation:
= (X) deterministic
Structural or statistical relation:
= X + error stochastic/probabilistic
Goals:
1) What is a reasonable model?
(a) (X)
(b) errors
2) When (X) has unknown parameters, estimate the parameters
3) predict at new X
2
Simple Linear Regression (SLR)
Basic model:

= [
0
+[
1
x

+e

: the response/dependent variable


X

: the predictor/explanatory/independent variable


y

: the observed value of

: treated as a fixed quantity (or conditioned upon)


e

: the random error, typically assumed E e

= u and
Ior e

= o
2
, and usually assumed normally distributed
Key assumptions (to be checked later):
Linear relationship
Independent (uncorrelated) errors
Constant variance errors
Normally distributed errors
3
The SLR model can also be written as

|X

= x

~N([
0
+[
1
x

, o
2
)
4
The mean of given X = x (known as the condition mean) is a
linear function of x given by [
0
+[
1
x
[
0
is the conditional mean when x = u
If we replace x by x -x
0
then [
0
is interpreted as conditional
mean when x = x
0
[
1
is the slope, i.e. change in the mean of per unit change in x
o
2
is the variation of responses about the mean
The relationship is described by the true regression line
E Y|x = [
0
+[
1
x
The model is called linear not because it is linear in x, but
rather because it is linear in the parameters [
0
and [
1
5
Example: Crime Rate
A criminologist studying the relationship between level of education
and crime rate in medium-sized U.S. counties collected the following data
for a random sample of 84 counties; X is the percentage of individuals in the
county having at least a high-school diploma and is the crime rate (crimes
reported per 100, 000 residents) last year.
60 65 70 75 80 85 90
2
0
0
0
4
0
0
0
6
0
0
0
8
0
0
0
1
0
0
0
0
1
2
0
0
0
1
4
0
0
0
Scatter Plot for the Crime Rate Data
Percentage of having at least high school diplomas
C
r
i
m
e

R
a
t
e

(
p
e
r

1
0
0
K

r
e
s
i
d
e
n
t
s
6
Fitting the SLR model - least squares (LS) estimation
Choose [
`
0
, [
`
1
to minimize the sum of squared deviations
(vertical distance) of all data points to the fitted line:
[
0
, [
1
= y

-([
0
+[
1
x

)
2
n
=1

mn
= [
`
0
, [
`
1
= y

-([
`
0
+[
`
1
x

)
2
n
=1
c

2
n
=1
Taking first partial derivatives and setting them equal to zero
yields normal equations:
[
0
n +[
1
x

n
=1
= y

n
=1
[
0
x

n
=1
+[
1
x

2
n
=1
= x

n
=1
which are equivalent to
c

n
=1
= u
x

n
=1
= u
7
Least squares estimators:
[
`
0
=
x

2
n
=1
y

n
=1
- x

n
=1
x

n
=1
n x

2
n
=1
- x

n
=1
2
[
`
1
=
n x

n
=1
- x

n
=1
y

n
=1
n x

2
n
=1
- x

n
=1
2
S
x
= x

-x y

-y
n
=1
= x

n
=1
-
1
n
x

n
=1
y

n
=1
S
xx
= x

-x
2
n
=1
= x

2
n
=1
-
1
n
x

n
=1
2
S

= y

-y
2
n
=1
= y

2
n
=1
-
1
n
y

n
=1
2
0 1
,
1
S
xj
S
xx
8
[
`
0
= y -[
`
1
x , and [
`
1
=
S
xj
S
xx
are the best linear unbiased
estimates of [
0
and [
1
The fitted values: y

= [
`
0
+[
`
1
x

Residuals: e

= c

= y

-y

Least squares (LS) line:


y = [
`
0
+[
`
1
x
x , y = [
`
0
+[
`
1
x is the
centroid of the scatter plot
60 65 70 75 80 85 90
2
0
0
0
4
0
0
0
6
0
0
0
8
0
0
0
1
0
0
0
0
1
2
0
0
0
1
4
0
0
0
Scatter Plot for the Crime Rate Data
Percentage of having at least high school diplomas
C
r
i
m
e

R
a
t
e

(
p
e
r

1
0
0
K

r
e
s
i
d
e
n
t
s
x , y
y = 2uS17.6 -17u.S8x
0 20 40 60 80
-
4
0
0
0
-
2
0
0
0
0
2
0
0
0
4
0
0
0
6
0
0
0
R
e
s
i
d
u
a
l
s
9
Goodness of fit of the LS line
Residuals: c

= y

-y

Error sum of squares (SSE):


mn
= c

2
n
=1
Compare with the SSE for the simplest model:

= [
0
+e

[
`
0
= y, and
mn
= y

-y
2
n
=1
= S

, referred to as
the (corrected) total sum of squares (SST), which
measures the variability of y around its mean
Then SST can be decomposed as
y

-y
2
n
=1
= y

-y
2
n
=1
+ y

-y

2
n
=1
SST = SSR + SSE
SSR: theregression sum of squares, which measures the
variation in y that is accounted for by regression on x
10
The coefficient of determination:
r
2
=
SSR
SS1
= 1 -
SSL
SS1
, u r
2
1
which represents the proportion of variation in y that is
accounted for by regression on x.
Relationship to the sample correlation coefficient r:
SSR = [
`
1
2
S
xx
r
2
=
SSR
SS1
=
[

1
2
S
xx
S
jj
=
S
xj
2
S
xx
S
jj
r
2
=
S
xj
S
xx
S
jj
=
s
xj
s
x
s
j
= r = [
`
1
s
x
s
j
The sign of r is the same as the sign of [
`
1
.
11
Estimation of o
2
A common unbiased estimator of o
2
is given by
o
2
= s
2
=
c

2
n -2
=
SSE
n -2
= HSE
MSE: Mean square error
The d.f. for s
2
is n -2 since 2 unknown parameters [
0
and
[
1
are estimated from the data of size n.
Crime rate example continued:
Obtain the point estimates of the following:
(1) The difference in the mean crime rate for the two counties whose high-
school graduation rates differ by one percentage point;
(2) The mean crime rate last year in counties with high school graduation
percentage X=80;
(3) The random error o.
12
#read in the data set
>crime=read.table("crimerate.txt",header=FALSE)
>names(crime)=c("rate","percentage")
#scatter plot
>plot(crime$percentage,crime$rate,main="Scatter Plot for the Crime Rate Data",
xlab="Percentage of having at least high school diplomas", ylab="Crime Rate (per 100K
residents",type="p",pch=16)
#fitting a SLR model using least squares
>g1=lm(rate~percentage,data=crime)
#adding the fitted LR line in the scatter plot
>abline(g1,col="red",lwd=2)
60 65 70 75 80 85 90
2
0
0
0
4
0
0
0
6
0
0
0
8
0
0
0
1
0
0
0
0
1
2
0
0
0
1
4
0
0
0
Scatter Plot for the Crime Rate Data
Percentage of having at least high school diplomas
C
r
i
m
e

R
a
t
e

(
p
e
r

1
0
0
K

r
e
s
i
d
e
n
t
s
13
#LS estimation results
>summary(g1)
Call:
lm(formula =rate ~percentage, data =crime)
Residuals:
Min 1Q Median 3Q Max
-5278.3 -1757.5 -210.5 1575.3 6803.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20517.60 3277.64 6.260 1.67e-08 ***
percentage -170.58 41.57 -4.103 9.57e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2356 on 82 degrees of freedom
Multiple R-squared: 0.1703, Adjusted R-squared: 0.1602
F-statistic: 16.83 on 1 and 82 DF, p-value: 9.571e-05
14
>summary(g1)$coeff
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20517.5999 3277.64269 6.259865 1.672906e-08
percentage -170.5752 41.57433 -4.102897 9.571396e-05
>predict(g1,data.frame(percentage=80),se=TRUE)
$fit
1
6871.585
$se.fit
[1] 263.6425
$df
[1] 82
$residual.scale
[1] 2356.292
>deviance(g1) #SSE
[1] 455273165
>df.residual(g1) #df for SSE
[1] 82
>sqrt(deviance(g1)/df.residual(g1)) #estimate for sigma
[1] 2356.292
15
>residuals(g1)
1 2 3 4 5 6 7 8
591.96401 1648.56552 1660.99033 1518.99033 568.44147 -159.63749 -2357.48712 -828.00967
9 10 11 12 13 14 15 16
97.96401 1401.56552 -1233.46080 285.56552 2426.26477 -1594.28410 -1493.43448 -2615.16004

81 82 83 84
-1363.25778 2533.01666 621.14071 28.11439
>summary(g1)$residuals
#do the same as residuals(g1)
>sum(residuals(g1)^2) #SSE
[1] 455273165
>plot(residuals(g1),pch=16,main=
"Scatter Plot of Residuals,
ylab="Residuals",xlab="")
>abline(h=0,lty=2)
0 20 40 60 80
-
4
0
0
0
-
2
0
0
0
0
2
0
0
0
4
0
0
0
6
0
0
0
Scatter Plot of Residuals
R
e
s
i
d
u
a
l
s
16
>fitted.values(g1)
1 2 3 4 5 6 7 8 9
7895.036 6530.434 6701.010 6701.010 5677.559 9259.637 8918.487 6701.010 7895.036
10 11 12 13 14 15 16 17 18
6530.434 7724.461 6530.434 7212.735 6189.284 6530.434 7042.160 7212.735 8065.611

82 83 84
5506.983 6359.859 7553.886
>plot(crime$percentage,fitted.values(g1),pch=16,xlab="Percentage",ylab="Fitted Values")
>abline(g1,lty=2)
>plot(fitted.values(g1),residuals(g1),main="",ylab="Residuals",xlab=expression(hat(y)),pch=16)
>plot(crime$percentage,residuals(g1),main="",ylab="Residuals",xlab="Percentage",pch=16)
60 65 70 75 80 85 90
5
0
0
0
6
0
0
0
7
0
0
0
8
0
0
0
9
0
0
0
1
0
0
0
0
Percentage
F
i
t
t
e
d

V
a
l
u
e
s
5000 6000 7000 8000 9000 10000
-
4
0
0
0
-
2
0
0
0
0
2
0
0
0
4
0
0
0
6
0
0
0
y
^
R
e
s
i
d
u
a
l
s
60 65 70 75 80 85 90
-
4
0
0
0
-
2
0
0
0
0
2
0
0
0
4
0
0
0
6
0
0
0
Percentage
R
e
s
i
d
u
a
l
s
17
Statistical Inference for Simple Linear Regression
Inference on [
0
and [
1
[
`
1
=
(x

-x )(y

-y)
n
=1
(x

-x )
2
n
=1
=
(x

-x )y

n
=1
(x

-x )
2
n
=1
E [
`
1
= [
1
Ior [
`
1
=
c
2
(x
i
-x )
2
n
i=1
=
c
2
S
xx
[
`
0
= y -[
`
1
x E [
`
0
= [
0
Ior [
`
0
= o
2
1
n
+
x
2
(x
i
-x )
2
n
i=1
=
c
2
x
i
2
n
i=1
n (x
i
-x )
2
n
i=1
=
c
2
x
i
2
n
i=1
nS
xx
18
[

0
-[
0
S([

0
)
~N(u,1) and
[

1
-[
1
S([

1
)
~N(u,1)
n-2 S
2
c
2
=
SSL
c
2
~_
2
n-2
[
`
0
, [
`
1
, and S
2
are independently distributed
SE [
`
0
= s
x
i
2
n
i=1
nS
xx
and SE [
`
1
=
s
S
xx
[

0
-[
0
SL([

0
)
~t
n-2
and
[

1
-[
1
SL([

1
)
~t
n-2
1uu 1 -o %CIs on [
0
and [
1
are given by
[
`
0
_t
n-2,u2
SE([
`
0
)
[
`
1
_t
n-2,u2
SE([
`
1
)
19
Hypotheses tests:
E
0
: [
1
= [
1
0
vs. E
1
: [
1
= [
1
0
Use the t-test:
t =
[

1
-[
1
0
SL([

1
)
~t
n-2
when E
0
is true
Reject E
0
at level o if
t =
|[

1
-[
1
0
|
SL([

1
)
> t
n-2,u2
or p-value= 2P t
n-2
t < o
Particularly, for testing if there is a linear relationship,
E
0
: [
1
= u vs. E
1
: [
1
= u
Reject E
0
at level o if
t =
|[

1
|
SL([

1
)
> t
n-2,u2
20
Crime Rate Example continued:
(1) Test linear relationship at o = u.uS
>summary(g1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20517.60 3277.64 6.260 1.67e-08 ***
percentage -170.58 41.57 -4.103 9.57e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2356 on 82 degrees of freedom
Multiple R-squared: 0.1703, Adjusted R-squared: 0.1602
F-statistic: 16.83 on 1 and 82 DF, p-value: 9.571e-05
21
(2) Calculate a 95% CI on the change in the mean crime rate for
every one percentage point increase in high-school graduation
rate.
>confint(g1)
2.5 % 97.5 %
(Intercept) 13997.3245 27037.87538
percentage -253.2798 -87.87061
>#wecanspecifya particular parameter
>#as well as change confidence level
>confint(g1,"percentage",level=0.9)
5 % 95 %
percentage-239.7403 -101.4101
22
Analysis of Variance (ANOVA) for SLR
ANOVA is a statistical technique to decompose the total variability
in the y

s into separate variance components associated with


specific sources
Decomposition of the variability and degrees of freedom (d.f.)
y

-y
2
n
=1
= y

-y
2
n
=1
+ y

-y

2
n
=1
SST = SSR + SSE
d.f. n-1 = 1 + n-2
A mean square is defined by a sum of squares divided by its d.f.
Mean square regression: HSR = SSR1
Mean square error: HSE = SSE(n -2)
23
Since
MSR
MSL
=
SSR
s
2
=
[

1
2
S
xx
s
2
=
[

1
s S
xx
2
=
[

1
SL([

1
)
2
= t
2
F =
MSR
MSL
= t
2
~F
1,n-2
we can test E
0
: [
1
= u vs. E
1
: [
1
= u at level o by
rejecting E
0
if F >
1,n-2,u
(equivalent to t > t
n-2,u2
)
Analysis of variance (ANOVA) table
Sourceof
Variation
(Source)
Sumof
Squares
(SS)
Degreesof
Freedom
(d.f.)
MeanSquare
(MS)
Fstatistic
Regression SSR 1
HSR =
SSR
1
F =
HSR
HSE
Error SSE n - 2
HSE =
SSE
n - 2
Total SST n - 1
24
Crime Rate Example continued:
- Test the significance of the linear relationship between the crime
rate and the high-school graduation rate at o = u.uS
>anova(g1)
Analysis of Variance Table
Response: rate
Df Sum Sq Mean Sq F value Pr(>F)
percentage 1 93462942 93462942 16.834 9.571e-05 ***
Residuals 82 455273165 5552112
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
25
Prediction of Future Observations
To predict the value of a future response

at a specified value x

Use confidence interval to estimate the fixed unknown mean of

,
denoted by p

= E

= [
0
+[
1
x

= [
`
0
+[
`
1
x

= y +[
`
1
(x

-x )
Ior p

= o
2
1
n
+
(x

-x )
2
S
xx
p

_t
n-2,u2
s
1
n
+
(x

-x )
2
S
xx
Use prediction interval to predict the value of the r.v.

-p

~N u, 1 +
1
n
+
(x

-x )
2
S
xx
p

_t
n-2,u2
s 1 +
1
n
+
(x

-x )
2
S
xx

26
Crime rate example continued:
(a) Calculate 95% CI for the average crime rate in counties with
80% high-school graduation rate;
(b) Calculate 95% PI for the crime rate of a future selected county
with 80% high-school graduation rate.
>predict(g1,data.frame(percentage=80), interval="confidence")
$fit
fit lwr upr
1 6871.585 6347.116 7396.054
>predict(g1,data.frame(percentage=80), interval="prediction")
$fit
fit lwr upr
1 6871.585 2154.92 11588.25
27
>grid=seq(60,90,1)
>conf=predict(g1,data.frame(percentage=grid),interval="confidence")
>pred=predict(g1,data.frame(percentage=grid),interval="prediction")
>
matplot(grid,pred,lty=c(1,2,2),col=c("red","green","green"),type="l",lwd=2,main=
"CI vsPI", xlab="Percentage of having at least high school diplomas",
ylab="Crime Rate (per 100K residents)")
>matplot(grid,conf[,2:3],lty=c(2,2),
col=c("blue","blue"),type="l",add=T,lwd=2)
60 65 70 75 80 85 90
2
0
0
0
4
0
0
0
6
0
0
0
8
0
0
0
1
0
0
0
0
1
2
0
0
0
1
4
0
0
0
CI vs PI
Percentage of having at least high school diplomas
C
r
i
m
e

R
a
t
e

(
p
e
r

1
0
0
K

r
e
s
i
d
e
n
t
s
)
x
Both CI and PI have shortest
widths when x

= x ;
Predicting beyond the range of
observed data (extrapolation)
is risky and should generally
be avoided
28

Вам также может понравиться