Академический Документы
Профессиональный Документы
Культура Документы
Oskar Hollinsworth
13 February 2018
Example sheet 2
Question 5
−3
−4
−5
MRI_Count
The correlation between the coefficients is identical to the correlation between the variables.
summary(BrainSizeLM2, correlation = TRUE)$correlation
1
cor(Height, MRI_Count)
## [1] 0.5883772
Question 6
We see that Knock Hill has the longest time despite a very small distance and climb. We subtract an hour
from this point.
pairs(hills)
1000 3000 5000 7000
25
dist
15
5
7000
4000
climb
1000
200
time
50 100
5 10 15 20 25 50 100 150 200
The data is very bunched near the origin so taking logarithms could help to linearise the relationship. When
doing so, we should include an intercept term to give the model freedom when the independent variable is 1.
For example, if y = y(0)xp , then log y = log(y(0)) + p log(x). So the intercept is log(y(0)).
hlslm1 <- lm(time ~ dist + climb, data=hls)
summary(hlslm1)
##
## Call:
2
## lm(formula = time ~ dist + climb, data = hls)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.632 -4.934 1.007 4.541 27.903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.94198 2.58005 -5.016 1.90e-05 ***
## dist 6.34556 0.36047 17.604 < 2e-16 ***
## climb 0.01175 0.00123 9.555 6.83e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.8 on 32 degrees of freedom
## Multiple R-squared: 0.9712, Adjusted R-squared: 0.9694
## F-statistic: 540.2 on 2 and 32 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(hlslm1)
Standardized residuals
Residuals vs Fitted Normal Q−Q
Bens of Jura Bens of Jura
Residuals
1 3
10
−20
50 100 150 −2 −1 0 1 2
Standardized residuals
BenMoffat
NevisChase
1
1
Moffat Chase
##
## Call:
## lm(formula = log(time) ~ log(dist) + log(climb), data = hls)
##
3
## Residuals:
## Min 1Q Median 3Q Max
## -0.52624 -0.06273 0.00452 0.06846 0.31384
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.29359 0.27312 1.075 0.29
## log(dist) 0.91141 0.06534 13.949 3.76e-15 ***
## log(climb) 0.24889 0.04761 5.228 1.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1607 on 32 degrees of freedom
## Multiple R-squared: 0.9521, Adjusted R-squared: 0.9491
## F-statistic: 317.8 on 2 and 32 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(hlslm2)
Standardized residuals
Residuals vs Fitted Normal Q−Q
Cairn Table Cairn
JuraTable
2
Bens of Jura Bens of
Residuals
0.0
0
−0.6
−3
Standardized residuals
ofCow
JuraHill 0.5
Cairn Table Bens of Jura
1.0
−1
0.5
Cook's distance 1
0.0
−4
Overall, I prefer the logarithmic model as the residual vs. fitted graph looks much better. The following is a
95% prediction interval for the record time of a hypothetical race.
hyporace <- data.frame("dist"=5.3, "climb"=1100)
predict(hlslm2, hyporace, level=0.95, interval="prediction")
4
Question 9
Performing the one-sided or two-sided t-test on the externally studentised residuals gives a highly signifcant
p-value. This suggests that the human brain body size ratio is indeed an outlier amongst animals. In this
case, I think that a one-sided test is more appropriate as we are clearly seeking to prove that the human
brain is unusually large.
lmMamm <- lm(log(brain) ~ log(body), data=mammals)
n <- nrow(mammals)
p <- 1
eta <- rstudent(lmMamm)["Human"]
pval1 <- 1 - pt(eta, n-p-1)
pval2<- 1 - pf(eta^2, 1, n-p-1)
cat("One-sided: ", pval1, " Two sided: ", pval2)
5
10 15 20 25 30 35 40 45
Trinity
Pembroke
Emma
PercFirsts
Churchill
Peterhouse TrinityHall
StCatharine's Jesus
Clare King's
Christ's StJohn's
CorpusChristi
Downing
Magdalene
Fitzwilliam Selwyn Caius
Girton
Homerton
Newnham Robinson
StEdmund's MurrayEdwards
log(WineBudget)
If I perform this test on a college based on its appearance as an outlier, then that college is no longer a
random data point. It is now a college with an extreme position on this graph. Hence this will be much more
likely to have an extreme studentised residual. Hence the p-value will be an underestimate.