Академический Документы
Профессиональный Документы
Культура Документы
Rolleigh
Call:
lm(formula = Speed ~ Height)
Coefficients:
(Intercept) Height
39.5093 0.1715
You get much more information when you save the call into an “object” (which I named fit below).
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.509313 1.537364 25.70 <2e-16 ***
Height 0.171485 0.008743 19.61 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpretation of slope: For every 100 feet increase in the height of a coaster, the top speed increases by
17mph, on average.
1
You don’t really need the data=coasters in the lm() call because the
coaster dataset is attached, but its good practice. In advanced work, try to
avoid attaching datasets, it will only clutter your workspace. Instead, use
the data= argument to specify in which dataset to look for the variables.
To superimpose the fitted regression line (see plot to the right) uses the
abline() command:
Also, note that the object fit contains a lot of information. Type
> names(fit)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
to get a list of all the components of the object. You can access those components by the name of the
object followed by a dollar sign followed by the name of the component: object$component. So, for
instance, to retrieve the estimated coefficients for intercept and slope, you can type
> fit$coefficients
(Intercept) Height
39.5093129 0.1714846
but
> coefficients(fit)
also works.
To get the fitted values for each of the height values (the x-variable) in your data set, type
> fit$fitted
1 2 3 4 5 6 7
49.11245 72.94881 61.28785 68.66169 66.43239 57.51519 52.88511
8 9 10 11 12 13 14
60.43043 51.51323 55.62886 65.91794 55.28589 48.94096 60.94489
...
78 79 80
60.08746 69.51911 74.66365
but > fitted(fit) also works. These are the 80 values for the estimated speed of a coaster based
on the linear regression model.
Confidence interval and hypothesis test (testing if the slope is significantly different from zero, i.e., if
there is a linear relationship at all) for the slope (for the last week of classes)
To test the null hypothesis H0: beta=0 against the alternative hypothesis HA: beta≠0, we simply refer to
the t-statistic displayed for Height. How is it computed? It’s simply the estimated slope (0.1715) divided
by the standard error (0.0087), which results in t=19.61. Is this an extreme value if the null hypothesis
were true? We have to judge the value relative to a t-distribution with df=78 (the sample size – 2).
Certainly, the t value is extreme, with a p-value of less than 10^(-16) (see output above). Hence, we have
2
sufficient evidence to reject the null hypothesis and conclude the true slope is significantly different
from zero.
As always, a confidence interval is more informative. To find a confidence interval for the slope (i.e., the
effect of Height), simply type
> confint(fit)
2.5 % 97.5 %
(Intercept) 36.4486558 42.5699699
Height 0.1540783 0.1888909
and read off the CI for height: [0.154, 0.189]. This means that we are 95% confident that the effect of a
one-unit (one foot) increase in height results in an increase of at least 0.154mph and at most 0.189mph
in the top speed of a roller coaster, on average. Since a one foot increase is pretty meaningless given the
height values, we look at a 100 feet increase: We are 95% confident that for a 100 feet increase in the
height of a coaster, the top speed increases by at least 15.4mph and at most 18.9mph, on average.
(Clearly, zero is not contained in the interval, indicating a statistical significant effect of height on top
speed.)
but it is much better to use studentized (or standardized) residuals in any residual plot. You get
studentized residuals through
3
The obligatory residual plot, plotting studentized residuals
versus the x variable is obtained through
You can also plot the studentized residuals versus the fitted
values fitted(fit), which will result in the exact same
picture, but a different scaling on the x-axis.
If you want to identify which residuals are troublesome, then
use the identify() function. I.e.,
> identify(resid.stud~Height)
[1] 41
> coasters[41,]
Name Park State Country Duration Speed Height Drop
41 Oblivion Alton Towers Alton England NA 68 65 180
This observation has a large positive residual, meaning we underestimated its top speed based on its
height. If you check on the internet (http://en.wikipedia.org/wiki/Oblivion_%28roller_coaster%29), you
will find that this coaster (Oblivion) is basically just a free fall, explaining why it reaches a higher top
speed than coasters of similar height.
> hist(resid.stud)
> qqnorm(resid.stud)
> qqline(resid.stud, col="red")
One can also add prediction intervals for predicting the top speed (Y):
> pred1=predict(fit,newdata=list(Height=height.new), interval="prediction")
> head(pred1)
fit lwr upr
1 48.94096 37.01603 60.86590
2 49.57321 37.65784 61.48857
3 50.20545 38.29931 62.11159
4 50.83769 38.94044 62.73494
5 51.46993 39.58123 63.35863
6 52.10217 40.22168 63.98266
> lines(pred1[,2]~height.new,
col="blue",lty=3) #lower prediction
limit
> lines(pred1[,3]~height.new,
col="blue",lty=3) #upper prediction
limit