Вы находитесь на странице: 1из 17

Chapter 9: Correlation &

Simple Linear Regression


Waist & Fat
• Example 9.3.1 (P. 413)
Waist & Fat
• Expecting a “Linear Relationship”
Waist & Fat
• Research Questions:

• Waist is easy to measure, and can be measured accurately


• Fat is not easy to measure, and hard to tell the accuracy
• It helps if we can establish a relationship between Waist & Fat

• Does a linear relationship exist between Waist & Fat?


• How strong is this linear relationship?
• How well can we predict Fat by Waist?
Correlation
• Linear relationship between two continuous variables
• “Scatter plot”
Correlation
• Correlation Coefficient R(X, Y)
• -1 < R < 1 : Strength & direction of linear relationship between X & Y
• 0 < R < 1: + linear correlation; Y increases as X increases
• -1 < R < 0; - linear correlation; Y decreases as X increases

• R = 1 or R = -1: perfect linear correlation


• R = 0: no linear correlation

• Coefficient of Determination R2
• % of variability in Y (or X) that can be explained by X (or Y)

• Estimation & Testing


• Assumption: X & Y from normal populations
• H0: R = 0 vs. H1: R ≠ 0
• SAS proc corr
Simple Linear Regression
• Linear relationship btw two continuous variables
• Predicting value of Y based on X
• X: covariate (Waist)
• Y: response (Fat)

• Relationship, not causality!


• A man with larger waist cir. also have higher body fat; but that’s not saying
larger waist cir. causes higher body fat!
Linear Regression: Model

• b0: Intercept
• b1: Slope; indicator of linear relationship
• e : random error

• Regression Line:

• Y values from different x are independent

• Waist & Fat satisfies these assumptions?


Linear Reg.: Research Topics

• Explore data: assumptions satisfied?


• Use scatter plot

• Estimate model: what is the quantitative relationship?

• Evaluate model: Relationship strong? Prediction good?

• SAS proc reg.


• Run Dropbox/Regression/Chp9 SAS
Linear Reg.: Linear Relationship
• Model Interpretation
• For every Δx increase in the predictor X, the response Y will increase for the
amount of β1*Δx
• Always in terms of the change!

• Existence of Relationship: H0: β1 = 0


• {H0 Not Rejected}: “Based on our data, no evidence supports a linear
relationship between Y & X. Other relationship might exist.”
• {H0 Rejected}: “Our data supports a linear relationship between Y & X”

• Does the data support a linear relationship between Waist & Fat?
Linear Reg.: Model Strength
• ANOVA in linear regression model
• SST: total variation in Y; = SSR + SSE
• SSR: variation explained by linear regression
• SSE: unexplained/error/residual variation
• Least-squares estimates minimizes SSE

• Coefficient of Determination R2 = SSR / SST


• R2 in SAS output “Analysis of Variance”
• 0 < R2 < 1; Larger R2 means stronger model
• R = corr(X, Y), if model has only one X

• What % of variance does Fat ~ Waist model explain?


Linear Reg: Estimation & Prediction
• “Narrower” interval: 100(1-a)% confidence (estimation) interval of μY|X
• “Wider” interval: 100(1-a)% prediction interval of Y|X
Linear Reg: Estimation & Prediction
• “Narrower” interval: 100(1-a)% confidence (estimation) interval of μY|X
• “Wider” interval: 100(1-a)% prediction interval of Y|X

• People often use words “confidence” or “estimation” for parameter


• … and use “prediction” for future observations / subjects

• Given x=x0, I have 95% confidence that the prediction interval will cover the mean
of y|x=x0
• I have 95% confidence that the next y corresponding to x0 will fall in the estimation
interval
Summary
• A regression model describes the condition distribution of Y|X=x, or
certain characteristics of it, as a function of the explanatory variables x
• We estimate such models on the basis of samples of pairs of random
variables (Y,X)
• It is convenient to assume that a regression model consists of signal and
noise, i.e. a deterministic part and an error term
Extra: Dummy Coding
• Use (k-1) dummy variables for a k-level categorical predictor

• Study the effect of Gender


• Define one dummy variable Gender: =0 (male); =1 (female)

• Iris sepal length (Short, Medium & Long)


• Wrong: Iris: =0 (Short); =1 (Medium); =2 (Long). “Equally spaced” assumption
• Correct: define two dummy variables IrisM & IrisL, i.e. Short is the “reference”

Short Medium Long


IrisM 0 1 0
IrisL 0 0 1
Homework
• Preview questions of Two-way Table

• In linear regression, why do you think Prediction Interval is wider than


Confidence (Estimation) Interval?
• Answers can be found in Chapter 9

• Reading the story of Bigfoot & UFO sighting


Part I: Page 378, Exercise 15

ANOVA
Source of
Variation SS df MS F P-value F crit
0.54671208 27.5462179 2.01480369
Rows 8.20068125 15 3 6 1.82574E-13 1
0.05365416 0.02682708 1.35168895 0.27411299 3.31582950
Columns 7 2 3 5 6 1
0.01984708
Error 0.5954125 30 3

8.84974791
Total 7 47