Вы находитесь на странице: 1из 17

# Chapter 9: Correlation &

## Simple Linear Regression

Waist & Fat
• Example 9.3.1 (P. 413)
Waist & Fat
• Expecting a “Linear Relationship”
Waist & Fat
• Research Questions:

## • Waist is easy to measure, and can be measured accurately

• Fat is not easy to measure, and hard to tell the accuracy
• It helps if we can establish a relationship between Waist & Fat

## • Does a linear relationship exist between Waist & Fat?

• How strong is this linear relationship?
• How well can we predict Fat by Waist?
Correlation
• Linear relationship between two continuous variables
• “Scatter plot”
Correlation
• Correlation Coefficient R(X, Y)
• -1 < R < 1 : Strength & direction of linear relationship between X & Y
• 0 < R < 1: + linear correlation; Y increases as X increases
• -1 < R < 0; - linear correlation; Y decreases as X increases

## • R = 1 or R = -1: perfect linear correlation

• R = 0: no linear correlation

• Coefficient of Determination R2
• % of variability in Y (or X) that can be explained by X (or Y)

## • Estimation & Testing

• Assumption: X & Y from normal populations
• H0: R = 0 vs. H1: R ≠ 0
• SAS proc corr
Simple Linear Regression
• Linear relationship btw two continuous variables
• Predicting value of Y based on X
• X: covariate (Waist)
• Y: response (Fat)

## • Relationship, not causality!

• A man with larger waist cir. also have higher body fat; but that’s not saying
larger waist cir. causes higher body fat!
Linear Regression: Model

• b0: Intercept
• b1: Slope; indicator of linear relationship
• e : random error

• Regression Line:

## • Waist & Fat satisfies these assumptions?

Linear Reg.: Research Topics

## • Explore data: assumptions satisfied?

• Use scatter plot

## • SAS proc reg.

• Run Dropbox/Regression/Chp9 SAS
Linear Reg.: Linear Relationship
• Model Interpretation
• For every Δx increase in the predictor X, the response Y will increase for the
amount of β1*Δx
• Always in terms of the change!

## • Existence of Relationship: H0: β1 = 0

• {H0 Not Rejected}: “Based on our data, no evidence supports a linear
relationship between Y & X. Other relationship might exist.”
• {H0 Rejected}: “Our data supports a linear relationship between Y & X”

• Does the data support a linear relationship between Waist & Fat?
Linear Reg.: Model Strength
• ANOVA in linear regression model
• SST: total variation in Y; = SSR + SSE
• SSR: variation explained by linear regression
• SSE: unexplained/error/residual variation
• Least-squares estimates minimizes SSE

## • Coefficient of Determination R2 = SSR / SST

• R2 in SAS output “Analysis of Variance”
• 0 < R2 < 1; Larger R2 means stronger model
• R = corr(X, Y), if model has only one X

## • What % of variance does Fat ~ Waist model explain?

Linear Reg: Estimation & Prediction
• “Narrower” interval: 100(1-a)% confidence (estimation) interval of μY|X
• “Wider” interval: 100(1-a)% prediction interval of Y|X
Linear Reg: Estimation & Prediction
• “Narrower” interval: 100(1-a)% confidence (estimation) interval of μY|X
• “Wider” interval: 100(1-a)% prediction interval of Y|X

## • People often use words “confidence” or “estimation” for parameter

• … and use “prediction” for future observations / subjects

• Given x=x0, I have 95% confidence that the prediction interval will cover the mean
of y|x=x0
• I have 95% confidence that the next y corresponding to x0 will fall in the estimation
interval
Summary
• A regression model describes the condition distribution of Y|X=x, or
certain characteristics of it, as a function of the explanatory variables x
• We estimate such models on the basis of samples of pairs of random
variables (Y,X)
• It is convenient to assume that a regression model consists of signal and
noise, i.e. a deterministic part and an error term
Extra: Dummy Coding
• Use (k-1) dummy variables for a k-level categorical predictor

## • Study the effect of Gender

• Define one dummy variable Gender: =0 (male); =1 (female)

## • Iris sepal length (Short, Medium & Long)

• Wrong: Iris: =0 (Short); =1 (Medium); =2 (Long). “Equally spaced” assumption
• Correct: define two dummy variables IrisM & IrisL, i.e. Short is the “reference”

## Short Medium Long

IrisM 0 1 0
IrisL 0 0 1
Homework
• Preview questions of Two-way Table

## • In linear regression, why do you think Prediction Interval is wider than

Confidence (Estimation) Interval?
• Answers can be found in Chapter 9

## • Reading the story of Bigfoot & UFO sighting

Part I: Page 378, Exercise 15

ANOVA
Source of
Variation SS df MS F P-value F crit
0.54671208 27.5462179 2.01480369
Rows 8.20068125 15 3 6 1.82574E-13 1
0.05365416 0.02682708 1.35168895 0.27411299 3.31582950
Columns 7 2 3 5 6 1
0.01984708
Error 0.5954125 30 3

8.84974791
Total 7 47