Agenda 1. Exam 2 Recap 2. Fitting the Least Squares Line 3. Conditions for Regression 4. R 2 Exam Review The average score on the exam was 82.7. The minimum score to earn a grade of at least B- is 80. Lets call this the Exam 2 threshold. Once again we are oering a bootstrap incentive: If you scored at least an 80 on this exam, but below 80 on the rst exam, then your rst exam score is now an 80. If you scored below 80 on this exam, you have the chance to raise your score on this exam (Exam 2) to 80, if you score above the Exam 3 threshold on the third exam. Fitting the Least Squares Line The following plot illustrates the relationship between the poverty rate and the high school graduation rate among all 50 states and the District of Columbia. poverty <- read.csv("http://math.smith.edu/~bbaumer/mth241/poverty.txt", sep = "\t") require(mosaic) xyplot(Poverty ~ Graduates, data = poverty, xlab = "Graduation Rate", ylab = "Poverty Rate", pch = 19) Graduation Rate P o v e r t y
R a t e 10 15 80 85 90 Use the following summary statistics to calculate the least squares regression line. favstats(Poverty, data = poverty) ## min Q1 median Q3 max mean sd n missing ## 5.6 9.25 10.6 13.4 18 11.35 3.099 51 0 favstats(Graduates, data = poverty) ## min Q1 median Q3 max mean sd n missing ## 77.2 83.3 86.9 88.7 92.1 86.01 3.726 51 0 cor(Poverty ~ Graduates, data = poverty) ## [1] -0.7469 Prof. Baumer MTH 220: Lecture notes April 4th, 2014 Slope: Intercept: Interpretation: Conditions for the Least Squares Line Check the conditions for a least squares regression line. mod = lm(Poverty ~ Graduates, data = poverty) xyplot(residuals(mod) ~ fitted.values(mod), data = poverty, type = c("p", "r", "smooth"), pch = 19) histogram(~residuals(mod), fit = "normal") qqmath(~residuals(mod)) ladd(panel.qqmathline(residuals(mod))) plot(mod) Linearity: Constant Variability: Nearly Normal Residuals: Prof. Baumer MTH 220: Lecture notes April 4th, 2014 Practice Problems 1. (EOCE 5.17) The Association of Turkish Travel Agencies reports the number of foreign tourists visiting Turkey and tourist spending by year. The scatterplot below shows the relationship between these two variables along with the least squares t. Number of Tourists (thousands) T o u r i s t
S p e n d i n g
( m i l l i o n s
o f
$ ) 0 5000 10000 15000 0 5000 10000 15000 20000 25000 Number of Tourists (thousands) r e s i d u a l s ( m o d . t ) 1500 1000 500 0 500 1000 0 5000 10000 15000 20000 25000 residuals(mod.t) D e n s i t y 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 2000 1000 0 1000 (a) Describe the relationship between number of tourists and spending. (b) What are the explanatory and response variables? (c) Why might we want to t a regression line to these data? (d) Do the data meet the conditions required for tting a least squares line? In addition to the scatterplot, use the residual plot and histogram to answer this question. Prof. Baumer MTH 220: Lecture notes April 4th, 2014 2. Working together on laptops, load in the RailTrail data set in the mosaic package. (a) Describe the relationship between volume of ridership and the high temperature (hightemp). (b) What are the explanatory and response variables? (c) Fit a regression line using lm(). What is the equation for your line? (d) Examine the residual plots. Do the conditions for regression appear reasonable? Measuring the Strength of Fit Just as we were able to quantify the strength of the linear relationship between two variables with the correlation coecient, r, we can quantify the percentage of variation in the response variable (y) that is explained by the explanatory variables. This quantity is called the coecient of determination and is denoted R 2 . Like any percentage, R 2 is always between 0 and 1 For simple linear regression (one explanatory variable), R 2 = r 2 xyplot(Poverty ~ Graduates, data = poverty, xlab = "Graduation Rate", ylab = "Poverty Rate", type = c("p", "r"), pch = 19, lwd = 3) n = nrow(poverty) SST = var(~Poverty, data = poverty) * (n - 1) SSE = var(residuals(mod)) * (n - 1) 1 - SSE/SST ## [1] 0.5578 rsquared(mod) ## [1] 0.5578 Prof. Baumer MTH 220: Lecture notes April 4th, 2014 Solution to Fitting the Line Do it by hand. # summary stats to calculate slope and intercept fv.p = favstats(Poverty, data = poverty) fv.g = favstats(Graduates, data = poverty) x.bar <- fv.g[1, "mean"] s.x <- fv.g[1, "sd"] y.bar <- fv.p[1, "mean"] s.y <- fv.p[1, "sd"] xyplot(Poverty ~ Graduates, data = poverty, ylab = "% in poverty", xlab = "% HS grad", pch = 19) ladd(panel.abline(v = x.bar, lty = 2)) ladd(panel.abline(h = y.bar, lty = 2)) R <- cor(Graduates ~ Poverty, data = poverty) b1 <- (s.y/s.x) * R b0 <- y.bar + b1 * -x.bar mod <- lm(Poverty ~ Graduates, data = poverty) coef(mod) ## (Intercept) Graduates ## 64.7810 -0.6212 ladd(panel.abline(mod, lwd = 2, col = "goldenrod")) Figure 1: And by the third trimester therell be hundreds of babies inside you. (http://xkcd. com/605/) Prof. Baumer MTH 220: Lecture notes April 4th, 2014 % HS grad %
i n
p o v e r t y 10 15 80 85 90 Solution to RailTrail Is there a non-linear t? xyplot(volume ~ hightemp, data = RailTrail, type = c("p", "r", "smooth"), pch = 19) hightemp v o l u m e 200 300 400 500 600 700 40 50 60 70 80 90 mod.rt = lm(volume ~ hightemp, data = RailTrail) summary(mod.rt) ## ## Call: ## lm(formula = volume ~ hightemp, data = RailTrail) ## ## Residuals: ## Min 1Q Median 3Q Max ## -254.56 -57.80 8.74 57.35 314.03 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -17.079 59.395 -0.29 0.77 ## hightemp 5.702 0.848 6.72 1.7e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 104 on 88 degrees of freedom Prof. Baumer MTH 220: Lecture notes April 4th, 2014 ## Multiple R-squared: 0.339,Adjusted R-squared: 0.332 ## F-statistic: 45.2 on 1 and 88 DF, p-value: 1.71e-09 # plot(mod.rt)