Linear Regression Tutorial Solutions

Tutorial 2 Solutions
Gerald Cheang
School of ITMS
University of South Australia
1. First remove the 3 observations contributing to the outliers. Then run

lm( ) linear models function all over again. Then check fitted residuals
against fitted values. Also run QQ plot to check.
> data(cars)
> cars2 <- cars[-c(23, 35, 49),]
> fit <- lm(cars2$dist ~ cars2$speed)
> fit
Call:
lm(formula = cars2$dist ~ cars2$speed)
Coefficients:
(Intercept) cars2$speed
-15.137 3.608
> summary(fit)
Call:
lm(formula = cars2$dist ~ cars2$speed)
Residuals:
Min 1Q Median 3Q Max
-25.032 -7.686 -1.032 6.576 26.185
Coefficients:
Estimate Std. Error t value Pr(>|t|)
1
(Intercept) -15.1371 5.3053 -2.853 0.00652 **
cars2$speed 3.6085 0.3302 10.928 3e-14 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 11.84 on 45 degrees of freedom

Multiple R-squared: 0.7263, Adjusted R-squared: 0.7202
F-statistic: 119.4 on 1 and 45 DF, p-value: 3.003e-14
We can see that p-values of intercept and gradient coefficients are both
significant. We can do a plot with the fitted line to look at the plot (please
see your own R plot window)
> plot(cars2)
> lines(cars2$speed, fit$fitted.values)
> qqnorm(fit$residuals)
The plot looks a bit better now. The QQ-plot also looks reasonable.
Now we check fitted residuals against fitted values, and also plot the
reference lines within ±2 standard deviations.
> plot(fit$fitted,fit$residual)
> abline(0,0)
> abline(23.16,0)
> abline(-23.16,0)
So it looks like there might be yet another 3 or 4 outliers. You can

try again to see if now the residuals are normally distributed and no
more outliers are present. However, be aware that if we keep going on
eliminating observations, we run the risk of not having sufficient data.
Perhaps we could try starting all over again with the original data and fit
a model without the intercept term, since we know that if the car is not
moving, its braking distance is zero, or try fitting
D = kS α ,
2
where D is distance, S is speed and α is most likely a power coefficient
which is slightly greater than 1.
Now it is log D = log k + α log S = c + α log S.
> plot(log(cars$dist), log(cars$speed))
> fit <- lm(log(cars$dist) ~ log(cars$speed))
> fit
Call:
lm(formula = log(cars$dist) ~ log(cars$speed))
Coefficients:
(Intercept) log(cars$speed)
-0.7297 1.6024
> summary(fit)
Call:
lm(formula = log(cars$dist) ~ log(cars$speed))
Residuals:
-1.00215 -0.24578 -0.02898 0.20717 0.88289
Coefficients:
(Intercept) -0.7297 0.3758 -1.941 0.0581 .
log(cars$speed) 1.6024 0.1395 11.484 2.26e-15 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> plot(fit$fitted, fit$resid)

> 2*sd(fit$resid)
[1] 0.8022207
> abline(-0.802207,0)
> abline(0.802207,0)
> fit$resid
1 2 3 4 5 6
-0.79856996 0.81086795 -1.00214620 0.70260189 0.17017863 -0.48855950
3
7 8 9 10 11 12
-0.06960160 0.29812318 0.56638717 -0.27948421 0.21950696 -0.61306648
13 14 15 16 17 18
-0.25639154 -0.07406998 0.08008070 -0.12228701 0.14597698 0.14597698
19 20 21 22 23 24
0.44825785 -0.24103697 0.08438543 0.59521105 0.88289313 -0.61395481
25 26 27 28 29 30
-0.35159054 0.37929697 -0.24736713 -0.02422358 -0.34451150 -0.12136794
31 32 33 34 35 36
0.10177561 -0.16416792 0.12351415 0.42889580 0.52897926 -0.40495544
37 38 39 40 41 42
-0.15983298 0.23103333 -0.60493040 -0.19946529 -0.11942258 -0.04531461
43 44 45 46 47 48
0.08821678 -0.03373575 -0.30563556 -0.11432152 0.15897182 0.16978273
49 50
0.42467498 0.01442169
> cars3 <- cars[ -c(2,3,23),]
> plot(log(cars3$dist), log(cars3$speed))
> fit <- lm(log(cars3$dist) ~ log(cars3$speed))
> summary(fit)
Call:
lm(formula = log(cars3$dist) ~ log(cars3$speed))
Residuals:
-0.70597 -0.22556 -0.03112 0.19540 0.76215
Coefficients:
(Intercept) -0.9042 0.3798 -2.381 0.0216 *
log(cars3$speed) 1.6615 0.1392 11.939 1.52e-15 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> 2*sd(fit$resid)
[1] 0.6859106
> abline(0.6859106,0)
> abline(-0.6859106,0)
4
> cars4 <- cars3[-c(1,2),]
> fit <- lm(log(cars4$dist) ~ log(cars4$speed))
> summary(fit)
Call:
lm(formula = log(cars4$dist) ~ log(cars4$speed))
Residuals:
-0.60863 -0.23150 -0.01027 0.18632 0.60475
Coefficients:
(Intercept) -0.6520 0.4488 -1.453 0.154
log(cars4$speed) 1.5693 0.1623 9.672 2.35e-12 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> qqnorm(fit$resid)
> 2*sd(fit$resid)
[1] 0.6251462
So in the end, the appropriate model is
log D = 1.5693 log S.
The intercept term is not significant. If writing with error terms, then it
is
log Di = 1.5693 log Si + zi ,
where zi is the residual error for the ith set of observation.
2. We try to fit log Volume versus log Girth.
> attach(trees)
> treesfit <- lm(log(Volume) ~ log(Girth))
> summary(treesfit)
5
Call:
lm(formula = log(Volume) ~ log(Girth))
Residuals:
-0.205999 -0.068702 0.001011 0.072585 0.247963
Coefficients:
(Intercept) -2.35332 0.23066 -10.20 4.18e-11 ***
log(Girth) 2.19997 0.08983 24.49 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

F-statistic: 599.7 on 1 and 29 DF, p-value: < 2.2e-16
> 2*sd(treesfit$resid)
[1] 0.2260512
> qqnorm(treesfit$resid)
> plot(treesfit$fitted,treesfit$resid)
> abline(0.226,0)
> abline(-0.226,0)
The QQ plot of the residuals suggest that there might be up to 2 outliers

at the top right hand corner. Doing the residuals versus fitted value plot
and drawing reference lines, show that there is only 1 outlier. We remove
that observation which corresponds to the observation number 17 and do
the regression again.
> trees2 <- trees[-c(17), ,]

> trees2fit <- lm(log(trees2$Volume) ~ log(trees2$Girth))
> summary(trees2fit)
6
Call:
lm(formula = log(trees2$Volume) ~ log(trees2$Girth))
Residuals:
-0.197735 -0.062881 -0.001503 0.078311 0.213452
Coefficients:
(Intercept) -2.3616 0.2144 -11.01 1.1e-11 ***
log(trees2$Girth) 2.2000 0.0835 26.35 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

F-statistic: 694.1 on 1 and 28 DF, p-value: < 2.2e-16
> 2*sd(trees2fit$resid)
[1] 0.2099947
> qqnorm(trees2fit$resid)
> plot(trees2fit$fitted,trees2fit$resid)
> abline(0.21,0)
> abline(-0.21,0)
We see now that there is still 1 outlier that appears to be just sitting on
top reference line. Could either remove this and do all over again but
beware that we now have less than 30 observations. Or just take the
results as they are now since the outlier is just so marginal.
3. > plot(rock$peri, rock$area)
It does look like there could be a linear relationship.
7
> fit <- lm(rock$area ~ rock$peri)
> summary(fit)
Call:
lm(formula = rock$area ~ rock$peri)
Residuals:
-2511.9 -1282.7 -80.1 790.8 4375.4
Coefficients:
(Intercept) 3052.0181 476.8561 6.400 7.26e-08 ***
rock$peri 1.5419 0.1572 9.808 7.51e-13 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1543 on 46 degrees of freedom

> 2*sd(fit$resid)
[1] 3052.909
> abline( 3052.909,0)
> fit$resid
Both intercept and gradient terms are significant. There is one outlier.
The outlier is the 48th observation. Redo the whole analysis by removing
the outlier.
> plot(rock2$peri, rock2$area)

> fit <- lm(rock2$area ~ rock2$peri)
> summary(fit)
8
Call:
lm(formula = rock2$area ~ rock2$peri)
Residuals:
-2283.61 -1195.62 -94.24 911.06 2752.52
Coefficients:
(Intercept) 2806.3167 443.1323 6.333 9.99e-08 ***
rock2$peri 1.5983 0.1449 11.029 2.22e-14 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1412 on 45 degrees of freedom

Both intercept and gradient terms are significant. R2 value is also higher.

> 2*sd(fit$resid)
[1] 2792.234
> abline(2792.234,0)
> qqnorm(fit$resid)
So now all residuals are within ± 2 standard deviations and the QQ-
normal plot looks reasonable straight. So we take the model where
Ai = 2806.3167 + 1.5983Pi + zi ,
where zi are iid N (0, 1396.117).
4. We can try fitting fuel (consumption) as a linear function of all the other
variables (displacement, horse power, weight and rear axle ratio) first.
Then we can eliminate those that are not significant.
9
> plot(mtcars)
> fit <- lm(mtcars$mpg ~ mtcars$disp + mtcars$hp + mtcars$drat + mtcars$wt)
> summary(fit)
Call:
lm(formula = mtcars$mpg ~ mtcars$disp + mtcars$hp + mtcars$drat +
mtcars$wt)
Residuals:
-3.5077 -1.9052 -0.5057 0.9821 5.6883
Coefficients:
(Intercept) 29.148738 6.293588 4.631 8.2e-05 ***
mtcars$disp 0.003815 0.010805 0.353 0.72675
mtcars$hp -0.034784 0.011597 -2.999 0.00576 **
mtcars$drat 1.768049 1.319779 1.340 0.19153
mtcars$wt -3.479668 1.078371 -3.227 0.00327 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

The intercept is significant and weight and horse power are significant
explanatory variables for fuel consumption. So we refit the model again.
> fit <- lm(mtcars$mpg ~ mtcars$hp + mtcars$wt)

> summary(fit)
Call:
lm(formula = mtcars$mpg ~ mtcars$hp + mtcars$wt)
Residuals:
-3.941 -1.600 -0.182 1.050 5.854
10
Coefficients:
(Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
mtcars$hp -0.03177 0.00903 -3.519 0.00145 **
mtcars$wt -3.87783 0.63273 -6.129 1.12e-06 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

The adjusted R2 value is also slightly higher than the previous model.

> 2*sd(fit$resid)
[1] 5.016717
> 2*sd(fit$resid)
[1] 5.016717
> abline(5.016717,0)
> 2*sd(fit$resid)
[1] 5.016717
> abline(5.016717,0)
> fit$resid
(Check which residuals greater than 2 sd or less than -2 sd)
> mtcars2 <- mtcars[-c(17,18,20),]

> fit <- lm(mtcars2$mpg ~ mtcars2$hp + mtcars2$wt)
> summary(fit)
Call:
11
lm(formula = mtcars2$mpg ~ mtcars2$hp + mtcars2$wt)
Residuals:
-3.3258 -1.1674 -0.0214 1.2123 3.1963
Coefficients:
(Intercept) 36.257772 1.276604 28.402 < 2e-16 ***
mtcars2$hp -0.026503 0.006335 -4.183 0.000289 ***
mtcars2$wt -4.004809 0.476091 -8.412 6.81e-09 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> 2*sd(fit$resid)
[1] 3.466868
> abline(3.466868,0)
> abline(-3.466868,0)
> qqnorm(fit$resid)
So now after elimination those 3 outliers, all residuals are within ± 2

standard deviations. So the model is
yM pg,i = 36.257772 − 0.026503xHp,i − 4.004809xW t,i + zi .
12

Linear Regression Tutorial Solutions

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Linear Regression Tutorial Solutions

Загружено:

Авторское право:

Доступные форматы

Tutorial 2 Solutions

1. First remove the 3 observations contributing to the outliers. Then run

Residual standard error: 11.84 on 45 degrees of freedom

So it looks like there might be yet another 3 or 4 outliers. You can

Residual standard error: 0.4053 on 48 degrees of freedom

> plot(fit$fitted, fit$resid)

Residual standard error: 0.3467 on 45 degrees of freedom

Residual standard error: 0.3162 on 43 degrees of freedom

So in the end, the appropriate model is

log D = 1.5693 log S.

Residual standard error: 0.115 on 29 degrees of freedom

The QQ plot of the residuals suggest that there might be up to 2 outliers

> trees2 <- trees[-c(17), ,]

Residual standard error: 0.1069 on 28 degrees of freedom

Residual standard error: 1543 on 46 degrees of freedom

> plot(rock2$peri, rock2$area)

Residual standard error: 1412 on 45 degrees of freedom

> plot(fit$fitted, fit$resid)

Residual standard error: 2.602 on 27 degrees of freedom

> fit <- lm(mtcars$mpg ~ mtcars$hp + mtcars$wt)

Residual standard error: 2.593 on 29 degrees of freedom

> plot(fit$fitted, fit$resid)

(Check which residuals greater than 2 sd or less than -2 sd)

> mtcars2 <- mtcars[-c(17,18,20),]

Residual standard error: 1.799 on 26 degrees of freedom

So now after elimination those 3 outliers, all residuals are within ± 2

yM pg,i = 36.257772 − 0.026503xHp,i − 4.004809xW t,i + zi .

Вам также может понравиться