Вы находитесь на странице: 1из 5

MPG differences between manual and automatic transmission

Erick Farias

Executive summary
The present report intends to answer to the questions: Is an automatic or manual transmission better for MPG and, if there is
difference, how much is it?"
The assessment to this question started with an exploratory analysis and then the fitting of a linear model.
The variables for the initial model were selected through the criteria of least multicolinearity. Then, this initial model was
tested against others (adding other variables) and the best model was selected through the evaluation of the ANOVA (nested
model testing), the predictive R and the residuals.
After testing the model, it was concluded that the most adequate is the model of MPG ~ Cylinders + Transmission Type +
Horse Power + Weight, with a predictive R of 80%.
Interpreting the coefficients, it's thus concluded that leaving all the rest unchanged, there's no difference in mpg when
comparing automatic to manual transmission (see section Coefficient Interpretation for deeper explanation).

Analysis step by step


Exploratory analysis
We started making a boxplot of mpg, broken by transmission type (am).
##Loading data
data<- as.data.frame(mtcars)
##Expl. Analysis:
##
1. Boxplot to see if there are differences between manual and automatic transmission
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
g<- ggplot(data, aes(x=am,y=mpg))
g<-g+geom_boxplot(aes(fill=factor(am)))
g<- g+xlab("Transmission Type")
g<- g+ ylab("Miles per Gallon")
g

In this plot we see that apparently a manual car run more miles per gallon. We'll proceed to a regression model to check if
there's not underlying variables explaining the mpg change when we look only to the transmission factor adjustedly. Thus we
can isolate the effect of the transmission over mpg as much as possible to understand its influence on it.

Linear Model selection


We'll first select all the variables with a good correlation to mpg, and litlle correlation between each other. We'll assess this by
making a correlation plot of the variables. The not selected variables a priori will be inserted into other models to be compared
in the nested model testing through anova.
##
2. Pairs plot to look for multicolinearity
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.1.3
corr<-cor(data)
corrplot(corr, method="circle", type="upper")

1 variables picked for the model: Transmission (am), for it is it that will answer the interest question; 2: Horse Power (hp),
for it has high linear relationship with mpg and low with transmission type,having a reduced overlaying of variance explained.
Initially we also expect the variables qsec (1/4 mile time) and gear (number of forward gears) to be not really significant for
the model, as they have little correlation with MPG.
I understand that these two picked variables explain the biggest amount of variation, and adding others will add little
explanation power for the model, for they have overlaying variance. However we'll proceed in testing the others, adding one to
one in individual models, and then comparing all of it in a Nested Model Testing through ANOVA.

Nested Model Testing (ANOVA)


##Model selection strategy:
##Nested Model Testing through ANOVA
fit1<-lm(data = data, mpg ~ factor(am))
fit2<-update(fit1, mpg ~ factor(am) + hp)
fit3<-update(fit2, mpg~factor(am) + hp + factor(cyl))
fit4<-update(fit3, mpg ~factor(am) + hp + factor(cyl) + wt)
fit5<-update(fit4, mpg~ factor(am) + hp + factor(cyl) + wt + disp)
fit6<-update(fit4, mpg~ factor(am) + hp + factor(cyl) + wt + disp + qsec)
fit7<-update(fit4, mpg~ factor(am) + hp + factor(cyl) + wt + disp + qsec + drat)
fit8<-update(fit4, mpg~ factor(am) + hp + factor(cyl) + wt + disp + drat + qsec + vs)
fit9<-update(fit4, mpg~ factor(am) + hp + factor(cyl) + wt + disp + drat + qsec + vs + gear)
fit10<-update(fit4, mpg~ factor(am) + hp + factor(cyl) + wt + disp + drat + qsec + vs + gear + carb)
anova(fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9,fit10)
##
##
##
##
##

Analysis of Variance Table


Model
Model
Model

1: mpg ~ factor(am)
2: mpg ~ factor(am) + hp
3: mpg ~ factor(am) + hp + factor(cyl)

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Model 4: mpg ~ factor(am) + hp + factor(cyl) + wt


Model 5: mpg ~ factor(am) + hp + factor(cyl) + wt +
Model 6: mpg ~ factor(am) + hp + factor(cyl) + wt +
Model 7: mpg ~ factor(am) + hp + factor(cyl) + wt +
Model 8: mpg ~ factor(am) + hp + factor(cyl) + wt +
vs
Model 9: mpg ~ factor(am) + hp + factor(cyl) + wt +
vs + gear
Model 10: mpg ~ factor(am) + hp + factor(cyl) + wt +
vs + gear + carb
Res.Df
RSS Df Sum of Sq
F
Pr(>F)
1
30 720.90
2
29 245.44 1
475.46 71.3239 4.997e-08 ***
3
27 197.20 2
48.24 3.6183
0.04558 *
4
26 151.03 1
46.17 6.9265
0.01599 *
5
25 150.41 1
0.62 0.0925
0.76413
6
24 142.33 1
8.08 1.2118
0.28404
7
23 141.21 1
1.12 0.1687
0.68562
8
22 139.02 1
2.18 0.3275
0.57354
9
21 135.27 1
3.75 0.5629
0.46183
10
20 133.32 1
1.95 0.2921
0.59485
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'

disp
disp + qsec
disp + qsec + drat
disp + drat + qsec +
disp + drat + qsec +
disp + drat + qsec +

0.1 ' ' 1

For the first three added variables ( Horse Power [hp], Weight(wt) and Number of Cylinders[cyl] ) we had a P-value <= 0.05
showing that these variables are significant for the model in terms of variance explained x complexity added. Thus we'll
consider them for further testings, selecting the model Fit4.
For models 5 up to 10 we have p-values > .05 showing that we should not add these variables, for they're insignificant in the
same criteria mentioned above.

Analyzing the R
We chose to analyze both the adjusted R and the predictive R as measures of a good fit.
The adjusted R squared increases only if the new term improves the model more than would be expected by chance and it can
also decrease with poor quality predictors.
The predicted R-squared is a form of cross-validation and it can also decrease. Cross-validation determines how well your
model generalizes to other data sets by partitioning your data.
##The predictive R
pred_r_squared <- function(linear.model) {
lm.anova <- anova(linear.model)
tss <- sum(lm.anova$"Sum Sq")
# predictive R^2
pred.r.squared <- 1 - PRESS(linear.model)/(tss)
return(pred.r.squared)
}
PRESS <- function(linear.model) {
pr <- residuals(linear.model)/(1 - lm.influence(linear.model)$hat)
PRESS <- sum(pr^2)
return(PRESS)
}
summary(fit4)$r.squared ##Multiple R squared
## [1] 0.8658799
summary(fit4)$adj.r.squared ## Adjusted R squared
## [1] 0.8400875
pred_r_squared(fit4) ## Predictive R squared
## [1] 0.8015456

Analyzing the Predictive R we see that it's a little smaller than the adjusted R. One way to think of this is that 6.5% (86.6%
80.1%) of the model is explained by too many factors and random correlations, which we would have attributed to our model
if we were just using Multiple R.
When the model is good and has few terms, the differences are small, which is the case.
So we have further evidence to stay with the model fit4: we can say with some certainty that 98% of the variance is explained
by this model.

Residual analysis
Now, presupposing that this is a good model, lets take a look at the residuals:
1.
2.
3.

If we haa an intercept, the mean of residuals should be ~0


Correlations between residuals x predictors must be ~ 0
We must see no pattern in the behavior of residuals

##Residuals analysis
##1.
mean(resid(fit4))
## [1] 1.12757e-16
##2.
cor(resid(fit4), data$hp)
## [1] -1.560353e-16
cor(resid(fit4), data$cyl)
## [1] 2.350835e-17
cor(resid(fit4), data$am)
## [1] 7.500521e-18
cor(resid(fit4), data$wt)
## [1] -1.027771e-16
##3.
par(mfrow=c(2,2))
plot(fit4)

We see from 1. that the mean of residuals is ~ 0; from 2. that the residuals have no significant correlation with the predictors.
And from 3. we see that there's no relevant pattern in the behavior of residuals and that they're approx. normal distributed.

Coefficient Interpretation
###Coefficient interpretation
summary(fit4)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = mpg ~ factor(am) + hp + factor(cyl) + wt, data = data)
Residuals:
Min
1Q Median
-3.9387 -1.2560 -0.4013

3Q
1.1253

Max
5.0513

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.70832
2.60489 12.940 7.73e-13 ***
factor(am)1
1.80921
1.39630
1.296 0.20646
hp
-0.03211
0.01369 -2.345 0.02693 *
factor(cyl)6 -3.03134
1.40728 -2.154 0.04068 *
factor(cyl)8 -2.16368
2.28425 -0.947 0.35225
wt
-2.49683
0.88559 -2.819 0.00908 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.41 on 26 degrees of freedom
Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10

# confidence interval for transmission estimate


# Manual
sumCoef<-summary(fit4)$coefficients
sumCoef[2,1]+c(-1,1) * qt(.975, df=fit3$df) * sumCoef[2,2]
## [1] -1.055769

4.674192

Selecting the model 'fit4' we see that:


We can look to the estimates and say that, leaving all the rest unchanged, when we have a manual transmission there's a
expected raise of 1.8 miles per gallon, when compared to the automatic transmission. However, the confidence interval ranges
from -1.06 up to 4.67 - showing (altogether with the p-value of the estimate, that is >= 0.05) that there's no statistical evidence
to believe that there's difference in MPG when we consider only the type of transmission. It gives us a hint to believe that the
difference seen in the exploratory analysis was due to underlying variables that are associated to the type of transmission.
Further research could check on this.
We also see that the p value for 8 cylinders is also not statistically significant (show no difference from 4 cylinders).