Академический Документы
Профессиональный Документы
Культура Документы
Gerald Cheang
School of ITMS
University of South Australia
> data(cars)
> cars2 <- cars[-c(23, 35, 49),]
> fit <- lm(cars2$dist ~ cars2$speed)
> fit
Call:
lm(formula = cars2$dist ~ cars2$speed)
Coefficients:
(Intercept) cars2$speed
-15.137 3.608
> summary(fit)
Call:
lm(formula = cars2$dist ~ cars2$speed)
Residuals:
Min 1Q Median 3Q Max
-25.032 -7.686 -1.032 6.576 26.185
Coefficients:
Estimate Std. Error t value Pr(>|t|)
1
(Intercept) -15.1371 5.3053 -2.853 0.00652 **
cars2$speed 3.6085 0.3302 10.928 3e-14 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
We can see that p-values of intercept and gradient coefficients are both
significant. We can do a plot with the fitted line to look at the plot (please
see your own R plot window)
> plot(cars2)
> lines(cars2$speed, fit$fitted.values)
> qqnorm(fit$residuals)
The plot looks a bit better now. The QQ-plot also looks reasonable.
Now we check fitted residuals against fitted values, and also plot the
reference lines within ±2 standard deviations.
> plot(fit$fitted,fit$residual)
> abline(0,0)
> abline(23.16,0)
> abline(-23.16,0)
2
where D is distance, S is speed and α is most likely a power coefficient
which is slightly greater than 1.
Now it is log D = log k + α log S = c + α log S.
> plot(log(cars$dist), log(cars$speed))
> fit <- lm(log(cars$dist) ~ log(cars$speed))
> fit
Call:
lm(formula = log(cars$dist) ~ log(cars$speed))
Coefficients:
(Intercept) log(cars$speed)
-0.7297 1.6024
> summary(fit)
Call:
lm(formula = log(cars$dist) ~ log(cars$speed))
Residuals:
Min 1Q Median 3Q Max
-1.00215 -0.24578 -0.02898 0.20717 0.88289
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7297 0.3758 -1.941 0.0581 .
log(cars$speed) 1.6024 0.1395 11.484 2.26e-15 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
3
7 8 9 10 11 12
-0.06960160 0.29812318 0.56638717 -0.27948421 0.21950696 -0.61306648
13 14 15 16 17 18
-0.25639154 -0.07406998 0.08008070 -0.12228701 0.14597698 0.14597698
19 20 21 22 23 24
0.44825785 -0.24103697 0.08438543 0.59521105 0.88289313 -0.61395481
25 26 27 28 29 30
-0.35159054 0.37929697 -0.24736713 -0.02422358 -0.34451150 -0.12136794
31 32 33 34 35 36
0.10177561 -0.16416792 0.12351415 0.42889580 0.52897926 -0.40495544
37 38 39 40 41 42
-0.15983298 0.23103333 -0.60493040 -0.19946529 -0.11942258 -0.04531461
43 44 45 46 47 48
0.08821678 -0.03373575 -0.30563556 -0.11432152 0.15897182 0.16978273
49 50
0.42467498 0.01442169
> cars3 <- cars[ -c(2,3,23),]
> plot(log(cars3$dist), log(cars3$speed))
> fit <- lm(log(cars3$dist) ~ log(cars3$speed))
> summary(fit)
Call:
lm(formula = log(cars3$dist) ~ log(cars3$speed))
Residuals:
Min 1Q Median 3Q Max
-0.70597 -0.22556 -0.03112 0.19540 0.76215
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9042 0.3798 -2.381 0.0216 *
log(cars3$speed) 1.6615 0.1392 11.939 1.52e-15 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
4
> cars4 <- cars3[-c(1,2),]
> fit <- lm(log(cars4$dist) ~ log(cars4$speed))
> summary(fit)
Call:
lm(formula = log(cars4$dist) ~ log(cars4$speed))
Residuals:
Min 1Q Median 3Q Max
-0.60863 -0.23150 -0.01027 0.18632 0.60475
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.6520 0.4488 -1.453 0.154
log(cars4$speed) 1.5693 0.1623 9.672 2.35e-12 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
The intercept term is not significant. If writing with error terms, then it
is
log Di = 1.5693 log Si + zi ,
where zi is the residual error for the ith set of observation.
2. We try to fit log Volume versus log Girth.
> attach(trees)
> treesfit <- lm(log(Volume) ~ log(Girth))
> summary(treesfit)
5
Call:
lm(formula = log(Volume) ~ log(Girth))
Residuals:
Min 1Q Median 3Q Max
-0.205999 -0.068702 0.001011 0.072585 0.247963
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.35332 0.23066 -10.20 4.18e-11 ***
log(Girth) 2.19997 0.08983 24.49 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
6
Call:
lm(formula = log(trees2$Volume) ~ log(trees2$Girth))
Residuals:
Min 1Q Median 3Q Max
-0.197735 -0.062881 -0.001503 0.078311 0.213452
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.3616 0.2144 -11.01 1.1e-11 ***
log(trees2$Girth) 2.2000 0.0835 26.35 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
> 2*sd(trees2fit$resid)
[1] 0.2099947
> qqnorm(trees2fit$resid)
> plot(trees2fit$fitted,trees2fit$resid)
> abline(0.21,0)
> abline(-0.21,0)
We see now that there is still 1 outlier that appears to be just sitting on
top reference line. Could either remove this and do all over again but
beware that we now have less than 30 observations. Or just take the
results as they are now since the outlier is just so marginal.
3. > plot(rock$peri, rock$area)
It does look like there could be a linear relationship.
7
> fit <- lm(rock$area ~ rock$peri)
> summary(fit)
Call:
lm(formula = rock$area ~ rock$peri)
Residuals:
Min 1Q Median 3Q Max
-2511.9 -1282.7 -80.1 790.8 4375.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3052.0181 476.8561 6.400 7.26e-08 ***
rock$peri 1.5419 0.1572 9.808 7.51e-13 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Both intercept and gradient terms are significant. There is one outlier.
The outlier is the 48th observation. Redo the whole analysis by removing
the outlier.
8
Call:
lm(formula = rock2$area ~ rock2$peri)
Residuals:
Min 1Q Median 3Q Max
-2283.61 -1195.62 -94.24 911.06 2752.52
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2806.3167 443.1323 6.333 9.99e-08 ***
rock2$peri 1.5983 0.1449 11.029 2.22e-14 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Both intercept and gradient terms are significant. R2 value is also higher.
So now all residuals are within ± 2 standard deviations and the QQ-
normal plot looks reasonable straight. So we take the model where
Ai = 2806.3167 + 1.5983Pi + zi ,
where zi are iid N (0, 1396.117).
4. We can try fitting fuel (consumption) as a linear function of all the other
variables (displacement, horse power, weight and rear axle ratio) first.
Then we can eliminate those that are not significant.
9
> plot(mtcars)
> fit <- lm(mtcars$mpg ~ mtcars$disp + mtcars$hp + mtcars$drat + mtcars$wt)
> summary(fit)
Call:
lm(formula = mtcars$mpg ~ mtcars$disp + mtcars$hp + mtcars$drat +
mtcars$wt)
Residuals:
Min 1Q Median 3Q Max
-3.5077 -1.9052 -0.5057 0.9821 5.6883
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.148738 6.293588 4.631 8.2e-05 ***
mtcars$disp 0.003815 0.010805 0.353 0.72675
mtcars$hp -0.034784 0.011597 -2.999 0.00576 **
mtcars$drat 1.768049 1.319779 1.340 0.19153
mtcars$wt -3.479668 1.078371 -3.227 0.00327 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
The intercept is significant and weight and horse power are significant
explanatory variables for fuel consumption. So we refit the model again.
Call:
lm(formula = mtcars$mpg ~ mtcars$hp + mtcars$wt)
Residuals:
Min 1Q Median 3Q Max
-3.941 -1.600 -0.182 1.050 5.854
10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
mtcars$hp -0.03177 0.00903 -3.519 0.00145 **
mtcars$wt -3.87783 0.63273 -6.129 1.12e-06 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
The adjusted R2 value is also slightly higher than the previous model.
Call:
11
lm(formula = mtcars2$mpg ~ mtcars2$hp + mtcars2$wt)
Residuals:
Min 1Q Median 3Q Max
-3.3258 -1.1674 -0.0214 1.2123 3.1963
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.257772 1.276604 28.402 < 2e-16 ***
mtcars2$hp -0.026503 0.006335 -4.183 0.000289 ***
mtcars2$wt -4.004809 0.476091 -8.412 6.81e-09 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
12