Вы находитесь на странице: 1из 7

Prof.

Baumer MTH 220: Lecture notes April 4th, 2014


Agenda
1. Exam 2 Recap
2. Fitting the Least Squares Line
3. Conditions for Regression
4. R
2
Exam Review The average score on the exam was 82.7. The minimum score to earn a grade of
at least B- is 80. Lets call this the Exam 2 threshold. Once again we are oering a bootstrap
incentive:
If you scored at least an 80 on this exam, but below 80 on the rst exam, then your rst exam
score is now an 80.
If you scored below 80 on this exam, you have the chance to raise your score on this exam
(Exam 2) to 80, if you score above the Exam 3 threshold on the third exam.
Fitting the Least Squares Line The following plot illustrates the relationship between the
poverty rate and the high school graduation rate among all 50 states and the District of Columbia.
poverty <- read.csv("http://math.smith.edu/~bbaumer/mth241/poverty.txt", sep = "\t")
require(mosaic)
xyplot(Poverty ~ Graduates, data = poverty, xlab = "Graduation Rate", ylab = "Poverty Rate",
pch = 19)
Graduation Rate
P
o
v
e
r
t
y

R
a
t
e
10
15
80 85 90
Use the following summary statistics to calculate the least squares regression line.
favstats(Poverty, data = poverty)
## min Q1 median Q3 max mean sd n missing
## 5.6 9.25 10.6 13.4 18 11.35 3.099 51 0
favstats(Graduates, data = poverty)
## min Q1 median Q3 max mean sd n missing
## 77.2 83.3 86.9 88.7 92.1 86.01 3.726 51 0
cor(Poverty ~ Graduates, data = poverty)
## [1] -0.7469
Prof. Baumer MTH 220: Lecture notes April 4th, 2014
Slope:
Intercept:
Interpretation:
Conditions for the Least Squares Line Check the conditions for a least squares regression
line.
mod = lm(Poverty ~ Graduates, data = poverty)
xyplot(residuals(mod) ~ fitted.values(mod), data = poverty, type = c("p", "r",
"smooth"), pch = 19)
histogram(~residuals(mod), fit = "normal")
qqmath(~residuals(mod))
ladd(panel.qqmathline(residuals(mod)))
plot(mod)
Linearity:
Constant Variability:
Nearly Normal Residuals:
Prof. Baumer MTH 220: Lecture notes April 4th, 2014
Practice Problems
1. (EOCE 5.17) The Association of Turkish Travel Agencies reports the number of foreign tourists
visiting Turkey and tourist spending by year. The scatterplot below shows the relationship
between these two variables along with the least squares t.
Number of Tourists (thousands)
T
o
u
r
i
s
t

S
p
e
n
d
i
n
g

(
m
i
l
l
i
o
n
s

o
f

$
)
0
5000
10000
15000
0 5000 10000 15000 20000 25000
Number of Tourists (thousands)
r
e
s
i
d
u
a
l
s
(
m
o
d
.
t
)
1500
1000
500
0
500
1000
0 5000 10000 15000 20000 25000
residuals(mod.t)
D
e
n
s
i
t
y
0.0000
0.0002
0.0004
0.0006
0.0008
0.0010
2000 1000 0 1000
(a) Describe the relationship between number of tourists and spending.
(b) What are the explanatory and response variables?
(c) Why might we want to t a regression line to these data?
(d) Do the data meet the conditions required for tting a least squares line? In addition to
the scatterplot, use the residual plot and histogram to answer this question.
Prof. Baumer MTH 220: Lecture notes April 4th, 2014
2. Working together on laptops, load in the RailTrail data set in the mosaic package.
(a) Describe the relationship between volume of ridership and the high temperature (hightemp).
(b) What are the explanatory and response variables?
(c) Fit a regression line using lm(). What is the equation for your line?
(d) Examine the residual plots. Do the conditions for regression appear reasonable?
Measuring the Strength of Fit Just as we were able to quantify the strength of the linear
relationship between two variables with the correlation coecient, r, we can quantify the percentage
of variation in the response variable (y) that is explained by the explanatory variables. This quantity
is called the coecient of determination and is denoted R
2
.
Like any percentage, R
2
is always between 0 and 1
For simple linear regression (one explanatory variable), R
2
= r
2
xyplot(Poverty ~ Graduates, data = poverty, xlab = "Graduation Rate", ylab = "Poverty Rate",
type = c("p", "r"), pch = 19, lwd = 3)
n = nrow(poverty)
SST = var(~Poverty, data = poverty) * (n - 1)
SSE = var(residuals(mod)) * (n - 1)
1 - SSE/SST
## [1] 0.5578
rsquared(mod)
## [1] 0.5578
Prof. Baumer MTH 220: Lecture notes April 4th, 2014
Solution to Fitting the Line Do it by hand.
# summary stats to calculate slope and intercept
fv.p = favstats(Poverty, data = poverty)
fv.g = favstats(Graduates, data = poverty)
x.bar <- fv.g[1, "mean"]
s.x <- fv.g[1, "sd"]
y.bar <- fv.p[1, "mean"]
s.y <- fv.p[1, "sd"]
xyplot(Poverty ~ Graduates, data = poverty, ylab = "% in poverty", xlab = "% HS grad",
pch = 19)
ladd(panel.abline(v = x.bar, lty = 2))
ladd(panel.abline(h = y.bar, lty = 2))
R <- cor(Graduates ~ Poverty, data = poverty)
b1 <- (s.y/s.x) * R
b0 <- y.bar + b1 * -x.bar
mod <- lm(Poverty ~ Graduates, data = poverty)
coef(mod)
## (Intercept) Graduates
## 64.7810 -0.6212
ladd(panel.abline(mod, lwd = 2, col = "goldenrod"))
Figure 1: And by the third trimester therell be hundreds of babies inside you. (http://xkcd.
com/605/)
Prof. Baumer MTH 220: Lecture notes April 4th, 2014
% HS grad
%

i
n

p
o
v
e
r
t
y
10
15
80 85 90
Solution to RailTrail Is there a non-linear t?
xyplot(volume ~ hightemp, data = RailTrail, type = c("p", "r", "smooth"), pch = 19)
hightemp
v
o
l
u
m
e
200
300
400
500
600
700
40 50 60 70 80 90
mod.rt = lm(volume ~ hightemp, data = RailTrail)
summary(mod.rt)
##
## Call:
## lm(formula = volume ~ hightemp, data = RailTrail)
##
## Residuals:
## Min 1Q Median 3Q Max
## -254.56 -57.80 8.74 57.35 314.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.079 59.395 -0.29 0.77
## hightemp 5.702 0.848 6.72 1.7e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 104 on 88 degrees of freedom
Prof. Baumer MTH 220: Lecture notes April 4th, 2014
## Multiple R-squared: 0.339,Adjusted R-squared: 0.332
## F-statistic: 45.2 on 1 and 88 DF, p-value: 1.71e-09
# plot(mod.rt)

Вам также может понравиться