Вы находитесь на странице: 1из 104
Simple Linear Regression
Simple Linear Regression

Lecturer

Dinh Thai Hoang

Correlation vs. Regression A scatter plot can be used to show the relationship between two variables
Correlation vs. Regression
A scatter plot can be used to show the
relationship between two variables

Correlation analysis is used to measure the strength of the association (linear relationship) between two variables

Correlation is only concerned with strength of the relationship

No causal effect is implied with correlation

Correlation Analysis (continued)
Correlation Analysis
(continued)

Population Correlation Coefficient ρ (Rho) is Used to Measure the Strength between the Variables

ρ σ σ σ XY X = Y
ρ
σ
σ
σ
XY
X
=
Y
Correlation Analysis (continued)
Correlation Analysis
(continued)

Sample Correlation Coefficient r is an Estimate of ρ and is Used to Measure the Strength of the Linear Relationship in the Sample Observations

2 i = 1 )( i Y − Y − = n Y ( ) )
2
i = 1
)(
i
Y
Y
=
n
Y
(
)
)
(
)
1
i
i
n
1
i
X
X
2
X
X
(
i
i
Y
=
n

r =

SS XY r = SS SS X Y n n SS = ∑ ( X −
SS
XY
r =
SS
SS
X
Y
n
n
SS
=
(
X
X
)(
Y
Y
)
=
X Y
nXY
XY
i
i
i
i
i =
1
i
=
1
n
n
2
2
2
SS
=
(X
X )
=
X
n(X )
X
i
i
i =
1
i
=
1
n
n
2
2
2
SS
=
(Y
Y )
=
Y
n(Y )
Y
i
i
i =
1
i
=
1
Different Values of the Correlation Coefficient
Different Values of the Correlation Coefficient
Different Values of the Correlation Coefficient
Features of ρ and r
Features of ρ and r

Unit Free Range between -1 and 1

The Closer to -1, the Stronger the Negative Linear Relationship

The Closer to 1, the Stronger the Positive Linear Relationship

The Closer to 0, the Weaker the Linear Relationship

t Test for Correlation Hypotheses
t Test for Correlation
Hypotheses

H 0 : ρ = 0 (no correlation) H 1 : ρ ≠ 0 (correlation)

Test Statistic

t =

r

2 1− r n − 2
2
1− r
n − 2

Decision rule: Reject H 0 , if t < -t n-2, α/2 or t > t n-2, α/2

Regression analysis is used to: Regression Analysis
Regression analysis is used to:
Regression Analysis

Predict the value of a dependent variable based on the value of at least one independent variable

Explain the impact of changes in an independent variable on the dependent variable

Dependent variable: the variable we wish to predict or explain

Independent variable: the variable used to predict or explain the dependent variable

Only one independent variable, X Simple Linear Regression Model
Only one independent variable, X
Simple Linear Regression Model

Relationship between X and Y described by a linear function

is

Changes in Y are assumed to be related to changes in X

Types of Relationships
Types of Relationships
Linear relationships
Linear relationships

Y

X
X

Y

Types of Relationships Linear relationships Y X Y X Curvilinear relationships Y X Y X
Types of Relationships Linear relationships Y X Y X Curvilinear relationships Y X Y X
Types of Relationships Linear relationships Y X Y X Curvilinear relationships Y X Y X
Types of Relationships Linear relationships Y X Y X Curvilinear relationships Y X Y X
Types of Relationships Linear relationships Y X Y X Curvilinear relationships Y X Y X
Types of Relationships Linear relationships Y X Y X Curvilinear relationships Y X Y X
Types of Relationships Linear relationships Y X Y X Curvilinear relationships Y X Y X
Types of Relationships Linear relationships Y X Y X Curvilinear relationships Y X Y X

X

Curvilinear relationships
Curvilinear relationships

Y

X
X

Y

Types of Relationships Linear relationships Y X Y X Curvilinear relationships Y X Y X

X

Types of Relationships
Types of Relationships
Strong relationships
Strong relationships

Y

Types of Relationships Strong relationships Y X Y X Y Weak relationships X Y X

X

Y

Types of Relationships Strong relationships Y X Y X Y Weak relationships X Y X

X

Y

Weak relationships
Weak relationships

X

Y

Types of Relationships Strong relationships Y X Y X Y Weak relationships X Y X
Types of Relationships Strong relationships Y X Y X Y Weak relationships X Y X

X

Types of Relationships No relationship
Types of Relationships
No relationship
Y Y X X
Y
Y
X
X
Simple Linear Regression Model
Simple Linear Regression Model

Population

Slope

Coefficient

Population Slope Coefficient

Population Y intercept

Population Y intercept

Independent

Variable

Dependent

Variable

Simple Linear Regression Model Population Slope Coefficient Population Y intercept Independent Variable Dependent Variable 0 1
0 1 i i i Y = β + β X + ε
0
1
i
i
i
Y = β + β X + ε

Random

Error

term

Simple Linear Regression Model Population Slope Coefficient Population Y intercept Independent Variable Dependent Variable 0 1

Linear component

Random Error component

Simple Linear Regression Model
Simple Linear Regression Model
i X i Y = β + β X + ε Random Error for this X
i
X
i
Y = β + β X + ε
Random Error
for this X i value
Slope = β 1
X
ε i
0
1
i
i

Y

Observed Value of Y for X

Observed Value of Y for X i

Observed Value of Y for X
Observed Value of Y for X
Observed Value of Y for X

Predicted Value of Y for X i

Predicted Value of Y for X Intercept = β
Predicted Value of Y for X Intercept = β
Predicted Value of Y for X Intercept = β

Intercept = β 0

Predicted Value of Y for X Intercept = β
Predicted Value of Y for X Intercept = β
Simple Linear Regression Equation
Simple Linear Regression Equation

Sample regression line provides an estimate

the population regression line as well as a

predicted value of Y

estimate of

Sample Y Intercept

Simple Linear Regression Equation Sample regression line provides an estimate the population regression line as well
Simple Linear Regression Equation Sample regression line provides an estimate the population regression line as well
Y i
Y
i
0 = b + b X 1 i
0
= b
+ b X
1
i

+ e

Sample

Slope

Coefficient

i
i

Residual

Simple Linear Regression Equation Sample regression line provides an estimate the population regression line as well

Simple Regression Equation (Fitted Regression Line, Predicted Value)

ˆ

Y

= b

  • X =

+ b

0

1

The Least Squares Method
The Least Squares Method

b 0 and b 1 are obtained by finding the values of

that minimize the sum of the squared

differences between Y and Y ˆ :

 

n

(

i

=

1

Y i
Y
i

ˆ

Y

i

)

n

  • 2

=

e

2

i

i

=

1

b

0

b

1

estimate of β 0

provides an estimate

provides an estimate

estimate of

β 1

n n ∑ Y = nb + b ∑ X i 0 1 i SS XY
n
n
Y
=
nb
+
b
X
i
0
1
i
SS
XY
i
=
1
i
= 1
b
=
1
SS
n
n
n
X
2
X Y
=
b
X
+
b
X
i
i
0
i
1
i
i
=
1
i
=
1
i
= 1
n
b
=
Y
b X
SS
=
X Y
nXY
0
1
XY
i
i
i = 1
n
2
2
SS
=
X
n(X )
X
i
i = 1
Interpretation of the Slope and the Intercept
Interpretation of the
Slope and the Intercept

b 0 is the estimated average value of Y when the value of X is zero

b 1 is the estimated change in the average value of Y as a result of a one-unit change in X

Simple Linear Regression Example
Simple Linear Regression
Example
relationship between the selling price of a home A real estate agent wishes to examine the
relationship between the selling price of a home
A real estate agent wishes to examine the
and its size (measured in square feet)

A random sample of 10 houses is selected

Dependent variable (Y) = house price in $1000s

Independent variable (X) = square feet

Simple Linear Regression Example relationship between the selling price of a home A real estate agent
1550 (Y) (X) Square Feet House Price in $1000s 1700 255 1425 319 2450 324 2350
1550
(Y)
(X)
Square Feet
House Price in $1000s
1700
255
1425
319
2450
324
2350
405
Simple Linear Regression
Example: Data
219
1100
199
1875
308
1700
279
1600
312
1400
245
Simple Linear Regression Example: Scatter Plot House price model: Scatter Plot
Simple Linear Regression
Example: Scatter Plot
House price model: Scatter Plot
100 0 500 1000 1500 2500 2000 3000 House Price ($1000s) 50 150 200 250 300
100
0
500
1000
1500
2500
2000
3000
House Price ($1000s)
50
150
200
250
300
350
400
450
0

Square Feet

Simple Linear Regression Example: Scatter Plot House price model: Scatter Plot 100 0 500 1000 1500
Simple Linear Regression Example: Using Excel
Simple Linear Regression
Example: Using Excel
Simple Linear Regression Example: Using Excel
Simple Linear Regression Example: Excel Output Regression Statistics
Simple Linear Regression
Example: Excel Output
Regression Statistics

The regression equation is:

0.03297 1.69296 1708.1957 18934.9348 t Stat Lower 95% Upper 95% 0.18580 232.07386 0.03374 -35.57720 0.01039 3.32938
0.03297
1.69296
1708.1957
18934.9348
t Stat
Lower 95%
Upper 95%
0.18580
232.07386
0.03374
-35.57720
0.01039
3.32938
58.03348
ANOVA
0.01039
0.12892
11.0848
P-value
MS
SS
df
F
Standard Error
98.24833
9
8
1
10
41.33032
0.52842
0.58082
0.76211
house price = 98.24833 + 0.10977 (square feet)
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.10977
Significance F
Coefficients
Square Feet
Intercept
Total
Residual
Regression
32600.5000
13665.5652
18934.9348
0.03297 1.69296 1708.1957 18934.9348 t Stat Lower 95% Upper 95% 0.18580 232.07386 0.03374 -35.57720 0.01039 3.32938
Simple Linear Regression Example: Graphical Representation
Simple Linear Regression Example:
Graphical Representation

House price model: Scatter Plot and Prediction Line

Intercept 100 150 200 250 300 350 400 450 House Price ($1000s) = 98.248 0 Slope
Intercept
100
150
200
250
300
350
400
450
House Price ($1000s)
= 98.248
0
Slope
= 0.10977
0
3000
2500
2000
1500
1000
50
500

Square Feet

house price = 98.24833 + 0.10977 (square feet)
house price = 98.24833 + 0.10977 (square feet)

Simple Linear Regression Example: Interpretation of b o

+ 0.10977 (square feet) house price = 98.24833
+ 0.10977 (square feet) house price = 98.24833
+ 0.10977 (square feet)
house price =
98.24833

b 0 is the estimated average value of Y when the value of X is zero (if X = 0 is in the range of observed X values)

Because a house cannot have a square footage of 0, b 0 has no practical application

Simple Linear Regression Example: Interpretation of b + 0.10977 (square feet) house price = 98.24833 b

Simple Linear Regression Example: Interpreting b 1

house price = 98.24833 + (square feet) 0.10977
house price = 98.24833 + (square feet) 0.10977
house price = 98.24833 +
(square feet)
0.10977

b 1 estimates the change in the average value of Y as a result of a one-unit increase in X

Here,

b 1 = 0.10977

tells us that the mean value of a

house increases by .10977($1000) = $109.77, on

average, for each additional one square foot of size

Simple Linear Regression Example: Interpreting b house price = 98.24833 + (square feet) 0.10977 b estimates

Simple Linear Regression Example: Making Predictions

Predict the price for a house with 2000 square feet:
Predict the price for a house with 2000 square feet:
Predict the price for a house
with 2000 square feet:

house price

=

98.25

+

0.1098 (sq.ft.)

=

98.25

+

0.1098(2000)

= 317.85

square feet is 317.85($1,000s) = $317,850 The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
The predicted price for a house with 2000
Simple Linear Regression Example: Making Predictions Predict the price for a house with 2000 square feet:

Simple Linear Regression

Example: Making Predictions
Example: Making Predictions

When using a regression model for prediction, only predict within the relevant range of data

0 House Price ($1000s) 100 150 200 250 300 350 400 450 Relevant range for interpolation
0
House Price ($1000s)
100
150
200
250
300
350
400
450
Relevant range for
interpolation
50
3000
2500
2000
1500
1000
500
beyond the range
of observed X’s
extrapolate
Do not try to
0

Square Feet

Measures of Variation
Measures of Variation

Total variation is made up of two parts:

SST

=

SSR

+

SSE

Total Sum of Squares
Total Sum of
Squares
where: SST ∑ (Y Y) = 2 − i Y
where:
SST
(Y
Y)
=
2
i
Y

Regression Sum of Squares

SSR ∑ ( Y ˆ Y) = 2 − i
SSR
( Y ˆ
Y)
=
2
i
Error Sum of Squares − Y ˆ ) SSE ∑ (Y = 2 i i
Error Sum of
Squares
− Y ˆ )
SSE
(Y
=
2
i
i

= Mean value of the dependent variable

Y i = Observed value of the dependent variable

ˆ

Y

i

= Predicted value of Y for the given X i value

Measures of Variation (continued)
Measures of Variation
(continued)

SST = total sum of squares

 

(Total Variation)

Measures the variation of the Y i values around their

mean Y

SSR = regression sum of squares

(Explained Variation)

Variation attributable to the relationship between X

and Y

SSE = error sum of squares

(Unexplained Variation)

Variation in Y attributable to factors other than X

Measures of Variation (continued) Y Y i ∧∧∧∧ ∧∧∧∧ Y SSE = ∑∑∑∑(Y i - Y
Measures of Variation
(continued)
Y
Y i
∧∧∧∧
∧∧∧∧
Y
SSE = ∑∑∑∑(Y i - Y i ) 2
_
SST = ∑∑∑∑(Y i - Y) 2
∧∧∧∧
Y
∧∧∧∧
_
_
SSR = ∑∑∑∑(Y i - Y) 2
_
Y
Y
X
X
i
Coefficient of Determination, r 2 The coefficient of determination is the portion
Coefficient of Determination, r 2
The coefficient of determination is the portion

of the total variation in the dependent variable

that is explained by variation in the

independent variable

The coefficient of determination is also called

r-squared and is denoted as r 2 SSR regression sum of squares = = SST total
r-squared and is denoted as r 2
SSR
regression
sum
of squares
=
=
SST
total sum of squares
2
r
2 r ≤ ≤ 0 1
2
r
0
1
Y
Y
X

X

 

r 2 = 1

 
 

Y

Y
Y
Y
Y
Y
Y
 

r 2 = 1

 
Y X r = 1 Y r = 1 X r 2 = 1 Perfect linear

X

r 2 = 1
r 2 = 1

Perfect linear relationship

between X and Y:

100% of the variation in Y is

explained by variation in X

Y
Y
X
X

0 < r 2 < 1

Weaker linear relationships

between X and Y:

Y

Some but not all of the

  • variation in Y is explained

Some but not all of the variation in Y is explained
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
Y X 0 < r < 1 Weaker linear relationships between X and Y: Y Some
  • by variation in X

by variation in X

X

Y r 2 = 0 No linear relationship between X and Y: The value of Y

Y

r 2 = 0
r 2 = 0
Y r 2 = 0 No linear relationship between X and Y: The value of Y
Y r 2 = 0 No linear relationship between X and Y: The value of Y
Y r 2 = 0 No linear relationship between X and Y: The value of Y
Y r 2 = 0 No linear relationship between X and Y: The value of Y
Y r 2 = 0 No linear relationship between X and Y: The value of Y
Y r 2 = 0 No linear relationship between X and Y: The value of Y
Y r 2 = 0 No linear relationship between X and Y: The value of Y
Y r 2 = 0 No linear relationship between X and Y: The value of Y
Y r 2 = 0 No linear relationship between X and Y: The value of Y
Y r 2 = 0 No linear relationship between X and Y: The value of Y

No linear relationship

between X and Y:

The value of Y does not

r 2 = 0
r 2 = 0
  • X depend on X. (None of the variation in Y is explained by variation in X)

Simple Linear Regression Example:

Coefficient of Determination, r 2 in Excel
Coefficient of Determination, r 2 in Excel
18934.9348 = 0.58082 r 2 = SSR SST =
18934.9348 = 0.58082
r 2 =
SSR
SST
=

Regression Statistics

Coefficient of Determination, r 2 in Excel 18934.9348 = 0.58082 r 2 = SSR SST =
Multiple R R Square Adjusted R Square Standard Error Observations 9 F df SS MS 0.01039
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
9
F
df
SS
MS
0.01039
ANOVA
Regression
Residual
Total
Significance F
32600.5000
8
11.0848
18934.9348
1708.1957
13665.5652
18934.9348
32600.5000
0.76211
0.58082
0.52842
41.33032
10
1
58.08% of the variation in house prices is explained by variation in square feet
58.08% of the variation in
house prices is explained by
variation in square feet

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

Square Feet

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

Simple Linear Regression Example: Coefficient of Determination, r 2 in Excel 18934.9348 = 0.58082 r 2
Standard Error of Estimate
Standard Error of Estimate

The standard deviation of the variation of observations around the regression line is estimated by

− = = n YX i 2 i = 1 n − 2 n − 2
=
=
n
YX
i
2
i = 1
n − 2
n − 2
SSE
S
( Y
Y ˆ )
i

Where

SSE = error sum of squares n = sample size

Simple Linear Regression Example:

Standard Error of Estimate in Excel
Standard Error of Estimate in Excel

Regression Statistics

10 41.33032 0.52842 0.58082 0.76211 41.33032 Multiple R R Square Adjusted R Square Standard Error Observations
10
41.33032
0.52842
0.58082
0.76211
41.33032
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
YX
S
=

ANOVA

df

SS

MS

F

Significance F

Regression

1

18934.9348

18934.9348

11.0848

0.01039

Residual

8

13665.5652

1708.1957

Total

9

32600.5000

 
 

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

Square Feet

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

Simple Linear Regression Example: Standard Error of Estimate in Excel Regression Statistics 10 41.33032 0.52842 0.58082
Comparing Standard Errors S YX is a measure of the variation of observed
Comparing Standard Errors
S YX is a measure of the variation of observed

Y values from the regression line

Y

X small S YX
X
small S
YX

Y

X large S YX
X
large S
YX

The magnitude of S YX should always be judged relative to the size of the Y values in the sample data

i.e., S YX = $41.33K is moderately small relative to house prices in the $200K -
i.e., S YX = $41.33K is moderately small relative to house prices in
the $200K - $400K range
Assumptions of Regression L.I.N.E
Assumptions of Regression
L.I.N.E

Linearity

The relationship between X and Y is linear

Independence of Errors

Error values are statistically independent

Normality of Error

Error values are normally distributed for any given value of X

Equal Variance (also called homoscedasticity)

The probability distribution of the errors has constant variance

Residual Analysis = Y − Y ˆ e i i i
Residual Analysis
= Y − Y ˆ
e
i
i
i

The residual for observation i, e i , is the difference

between its observed and predicted value

Check the assumptions of regression by examining the

residuals

Examine for linearity assumption Evaluate independence assumption Evaluate normal distribution assumption

Examine for constant variance for all levels of X (homoscedasticity)

Graphical Analysis of Residuals

Can plot residuals vs. X

Residual Analysis for Linearity
Residual Analysis for Linearity
Y x
Y
x
x Not Linear residuals
x
Not Linear
residuals
Y x
Y
x

residuals

 
 
x
 
x
x
 
x
x
x
x
 
x
 
x
x
x
x
x
x
x
x

x

   

Linear
Linear

Residual Analysis for Independence

Not Independent residuals X X residuals Independent residuals X
Not Independent residuals X X residuals Independent residuals X
Not Independent
Not Independent
residuals
residuals

X

X residuals
X
residuals

Independent
Independent

residuals

 
X
X
X
X
X
X
X
 
 
X
X
X
X
X
X

X

 
Checking for Normality
Checking for Normality

Examine the Stem-and-Leaf Display of the Residuals Examine the Boxplot of the Residuals Examine the Histogram of the Residuals

Construct a Normal Probability Plot of the Residuals

Residual Analysis for Normality
Residual Analysis for Normality

When using a normal probability plot, normal

errors will approximately display in a straight line

Percent

100

0

-3 -2 -1 0 3 2 1
-3
-2
-1
0
3
2
1

Residual

Residual Analysis for Equal Variance

Y x x Non-constant variance residuals Y x x Constant variance residuals
Y x x Non-constant variance
Y x x Non-constant variance

Y

x

Y x x Non-constant variance

x

Non-constant variance
Non-constant variance

residuals

residuals
Y x x Non-constant variance residuals Y x x Constant variance residuals

Y

x

x Constant variance residuals
x
Constant variance
residuals
Simple Linear Regression Example: Excel Residual Output
Simple Linear Regression
Example: Excel Residual Output

RESIDUAL OUTPUT

 
 

Predicted House Price

Residuals

1

251.92316

-6.923162

2

273.87671

38.12329

3

284.85348

-5.853484

4

304.06284

3.937162

5

218.99284

-19.99284

6

268.38832

-49.38832

7

356.20251

48.79749

8

367.17929

-43.17929

9

254.6674

64.33264

10

284.85348

-29.85348

0 0 3000 2000 1000 House Price Model Residual Plot Square Feet Residuals -20 -40 -60
0
0
3000
2000
1000
House Price Model Residual Plot
Square Feet
Residuals
-20
-40
-60
80
60
40
20

Does not appear to violate any regression assumptions

F Test for Significance
F Test for Significance

F Test statistic:

F STAT = MSE MSR
F STAT =
MSE
MSR

where

SSR k SSE n − k − 1 MSR = MSE =
SSR
k
SSE
n
k
1
MSR =
MSE =

where F STAT follows an F distribution with k numerator and (n – k - 1) denominator degrees of freedom

(k = the number of independent variables in the regression model)

F-Test for Significance

Excel Output
Excel Output

Regression Statistics

11.0848 0.58082 0.52842 41.33032 10 Regression 1 18934.9348 18934.9348 0.76211 0.01039 Residual 8 13665.5652 1708.1957 Total
11.0848
0.58082
0.52842
41.33032
10
Regression
1
18934.9348
18934.9348
0.76211
0.01039
Residual
8
13665.5652
1708.1957
Total
9
32600.5000
11.0848
=
F
df
SS
MS
ANOVA
Significance F
of freedom
=
MSR
MSE
With 1 and 8 degrees
18934.9348
1708.1957
F STAT =
p-value for
the F-Test
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
F Test for Significance (continued)
F Test for Significance
(continued)
H 0 : β 1 = 0 H 1 : β 1 ≠ 0
H 0 : β 1 = 0
H 1 : β 1 ≠ 0
α = .05 df 1 = 1 df 2 = 8
α = .05
df 1 = 1
df 2 = 8

Critical

F αααα = 5.32 α = .05 Value: Reject H 0
F αααα = 5.32
α = .05
Value:
Reject H 0

0

Do not reject H 0

F .05 = 5.32

F

Test Statistic:

= F MSR 11.08 STAT =
=
F
MSR
11.08
STAT =
αααα = 0.05 MSE Decision: Reject H 0 at
αααα = 0.05
MSE
Decision:
Reject H 0 at

Conclusion:

There is sufficient evidence that

house size affects selling price

Inferences About the Slope
Inferences About the Slope

The standard error of the regression slope coefficient (b 1 ) is estimated by

= = 1 YX YX 2 i b − X S S S (X SS X
=
=
1
YX
YX
2
i
b
− X
S
S
S
(X
SS X
)

where:

S

b

1

= Estimate of the standard error of the slope

S YX

=

SSE n − 2
SSE
n
2

= Standard error of the estimate

Inferences About the Slope:

t Test
t Test

t test for a population slope

Is there a linear relationship between X and Y?

Null and alternative hypotheses

H 0 :

β 1 = 0

(no linear relationship)

H 1 :

β 1 0

(linear relationship does exist)

Test statistic

where:

b

  • 1 b 1 = regression slope coefficient

b

t

=

S

STAT

  • 1 β 1 = hypothesized slope S b1 = standard error of the slope

d.f. = n 2

Inferences About the Slope:

t Test Example
t Test Example

House Price

 

in $1000s

(y)

Square Feet

(x)

 
  • 245 1400

 
  • 312 1600

 
  • 279 1700

 
  • 308 1875

 
  • 199 1100

 
  • 219 1550

 
  • 405 2350

 
  • 324 2450

 
  • 319 1425

 
  • 255 1700

Estimated Regression Equation:

house price = 98.25 + 0.1098 (sq.ft.)

The slope of this model is 0.1098

Is there a relationship between the

square footage of the house and its

sales price?

Inferences About the Slope:

t Test Example H 0 : β 1 = 0
t Test Example
H 0 : β 1 = 0

From Excel output:

H 1 : β 1 0

P-value b From Minitab output: S b 1 t Stat 0.03297 0.10977 98.24833 58.03348 Intercept Coefficients
P-value
b
From Minitab output:
S
b 1
t Stat
0.03297
0.10977
98.24833
58.03348
Intercept
Coefficients
Standard Error
Square Feet
1.69296 0.12892
3.32938 0.01039
1
1
S b 1 b 1 Predictor Coef SE Coef T P Constant 98.25 58.03 1.69 0.129
S
b
1
b 1
Predictor Coef
SE Coef
T
P
Constant
98.25
58.03
1.69 0.129
3.33 0.010
Square Feet 0.10977
0.03297
t
Inferences About the Slope: t Test Example H 0 : β 1 = 0 From Excel
STAT b S = b 1 1 = 3.32938 0.10977 = 0.03297
STAT
b
S
=
b
1
1 = 3.32938
0.10977
=
0.03297

Inferences About the Slope:

t Test Example
t Test Example

Test Statistic: t STAT = 3.329

H 0 : β 1 = 0

H 1 : β 1 0

d.f. = 10- 2 = 8

α/2=.025
α/2=.025

Reject H 0

Do not reject H 0

-t α/2

  • 0 t α/2

-2.3060

 

Decision: Reject H 0

There is sufficient evidence that square footage affects

α/2=.025

Reject H 0

  • 3.329 house price

2.3060
2.3060

Inferences About the Slope:

t Test Example H 0 : β 1 = 0 H 1 : β 1 ≠
t Test Example
H 0 : β 1 = 0
H 1 : β 1 ≠ 0

From Excel output:

 

Coefficients

Standard Error

t Stat

P-value

Intercept

98.24833

58.03348

1.69296

0.12892

Square Feet

0.10977

0.03297

3.32938

0.01039

From Minitab output: Predictor Coef SE Coef P 1.69 0.129 3.33 0.010 T Constant 98.25 58.03
From Minitab output:
Predictor Coef
SE Coef
P
1.69 0.129
3.33 0.010
T
Constant
98.25
58.03
Square Feet 0.10977
0.03297
Inferences About the Slope: t Test Example H 0 : β 1 = 0 H 1
p-value
p-value
Inferences About the Slope: t Test Example H 0 : β 1 = 0 H 1

Decision: Reject H 0 , since p-value < α

There is sufficient evidence that square footage affects house price.

Confidence Interval Estimate for the Slope Confidence Interval Estimate of the Slope:
Confidence Interval Estimate
for the Slope
Confidence Interval Estimate of the Slope:
1 α / 2 b ± t S b 1
1
α
/ 2
b
± t
S
b
1
d.f. = n - 2
d.f. = n - 2
58.03348 Square Feet Intercept 0.18580 0.03374 3.32938 0.01039 0.03297 232.07386 0.12892 1.69296 -35.57720 0.10977 Excel Printout
58.03348
Square Feet
Intercept
0.18580
0.03374
3.32938 0.01039
0.03297
232.07386
0.12892
1.69296 -35.57720
0.10977
Excel Printout for House Prices:
Upper 95%
Standard Error
Coefficients
t Stat
Lower 95%
P-value
98.24833
At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858)
At 95% level of confidence, the confidence interval for
the slope is (0.0337, 0.1858)
Confidence Interval Estimate for the Slope (continued)
Confidence Interval Estimate
for the Slope
(continued)
Standard Error t Stat P-value 0.18580 0.03374 -35.57720 232.07386 Lower 95% Coefficients Upper 95% Intercept 0.01039
Standard Error
t Stat
P-value
0.18580
0.03374
-35.57720
232.07386
Lower 95%
Coefficients
Upper 95%
Intercept
0.01039
3.32938
0.03297
0.10977
Square Feet
0.12892
1.69296
58.03348
98.24833
$1000s, we are 95% confident that the average impact on sales price is between $33.74 and
$1000s, we are 95% confident that the average
impact on sales price is between $33.74 and
Since the units of the house price variable is
$185.80 per square foot of house size
Conclusion: There is a significant relationship between house price and square feet at the .05 level
Conclusion: There is a significant relationship between
house price and square feet at the .05 level of significance
This 95% confidence interval does not include 0.
t Test for a Correlation Coefficient Hypotheses
t Test for a Correlation Coefficient
Hypotheses
(no correlation between X and Y) H 0 : ρ = 0 H 1 : ρ
(no correlation between X and Y)
H 0 : ρ = 0
H 1 : ρ ≠ 0
(correlation exists)

Test statistic

− = STAT r t 2 2 r 1 − n
=
STAT
r
t
2
2
r
1
n

(with n – 2 degrees of freedom)

where r = + r 2 if b r = − r 2 if b 1
where
r
= +
r
2 if b
r
= −
r
2 if b
1 0
>
1 0
<
t-test For A Correlation Coefficient (continued)
t-test For A Correlation Coefficient
(continued)
between square feet and house price at the Is there evidence of a linear relationship .05
between square feet and house price at the
Is there evidence of a linear relationship
.05 level of significance?
(No correlation) (correlation exists) H 0 : ρ = 0 H 1 : ρ ≠ 0
(No correlation)
(correlation exists)
H 0 : ρ = 0
H 1 : ρ ≠ 0
.762 = 8 df = 10 - 2 α =.05 , n − 1 r 2
.762
= 8
df = 10 - 2
α =.05 ,
n
1
r
2
10
.762
2
2
= 3.329
1
t
r
STAT
2
=
=
t-test For A Correlation Coefficient (continued)
t-test For A Correlation Coefficient
(continued)
2 n − 1 r 2 − 10 − .762 .762 2 − = 3.329 1
2
n
1
r
2
10
.762
.762
2
= 3.329
1
t
r
α/2=.025
α/2=.025
d.f. = 10-2 = 8
STAT
2
=
=
3.329 0 t α/2 Do not reject H 0 -2.3060 2.3060 Reject H 0 Reject H
3.329
0 t α/2
Do not reject H 0
-2.3060
2.3060
Reject H 0
Reject H 0
-t α/2
Reject H 0 Decision:
Reject H 0
Decision:
significance There is Conclusion: evidence of a linear association at the 5% level of
significance
There is
Conclusion:
evidence of a
linear association
at the 5% level of
Estimating Mean Values and Predicting Individual Values Goal: Form intervals around Y to express
Estimating Mean Values and
Predicting Individual Values
Goal: Form intervals around Y to express

uncertainty about the value of Y for a given X i

∧∧∧∧ Y = b 0 +b 1 X i Y Confidence Interval for the mean of
∧∧∧∧
Y = b 0 +b 1 X i
Y
Confidence
Interval for
the mean of
Y, given X i
for an individual Y, given X i Prediction Interval
for an individual Y,
given X i
Prediction Interval
X i X

X

i

X

∧∧∧∧

Y

Confidence Interval for the Average Y, Given X

Confidence interval estimate for the mean value of Y given a particular X i
Confidence interval estimate for the mean value of Y given a particular X i
Confidence interval estimate for the
mean value of Y given a particular X i
ˆ Y ± t i : = X S h Y|X Confidence interval for µ
ˆ
Y
±
t
i
:
=
X
S
h
Y|X
Confidence interval for µ
(X i i i 2 2 2 h − − − = = (X (X i
(X
i
i
i
2
2
2
h
=
=
(X
(X
i
X)
X)
X)
YX
+
1
n
+
1
n
α / 2
SSX
i
Size of interval varies according to distance away from mean, X
Size of interval varies according
to distance away from mean, X

Prediction Interval for an Individual Y, Given X

Confidence interval estimate for an Individual value of Y given a particular X i
Confidence interval estimate for an Individual value of Y given a particular X i
Confidence interval estimate for an
Individual value of Y given a particular X i
: X = YX i α / 2 S + h Confidence interval for Y X
:
X =
YX
i
α / 2
S
+ h
Confidence interval for Y
X i
Y ˆ ± t
1
This extra term adds to the interval width to reflect the added uncertainty for an individual
This extra term adds to the interval width to reflect
the added uncertainty for an individual case
Estimation of Mean Values: Example Confidence Interval Estimate for µ Y|X=X i Find the 95% confidence
Estimation of Mean Values:
Example
Confidence Interval Estimate for µ Y|X=X
i
Find the 95% confidence interval for the mean price
of 2,000 square-foot houses
∧∧∧∧
Predicted Price Y i = 317.85 ($1,000s)
2
1
(X
X)
i
Y ˆ
±
t
S
=
317.85
±
37.12
0.025
YX
n
2
+ ∑
(X
X)
i
The confidence interval endpoints are 280.66 and 354.90,
or from $280,660 to $354,900
Estimation of Individual Values: Example Prediction Interval Estimate for Y X=X i Find the 95% prediction
Estimation of Individual Values:
Example
Prediction Interval Estimate for Y X=X
i
Find the 95% prediction interval for an individual
house with 2,000 square feet
∧∧∧∧
Predicted Price Y i = 317.85 ($1,000s)
2
1
(X
X)
i
Y ˆ
±
t
S
1
+
=
317.85
±
102.28
0.025
YX
n
2
+ ∑
(X
X)
i
The prediction interval endpoints are 215.50 and 420.07,
or from $215,500 to $420,070
Multiple Regression
Multiple Regression
The Multiple Regression Model
The Multiple Regression
Model
1 dependent (Y) & 2 or more independent variables (X i ) Idea: Examine the linear
1 dependent (Y) & 2 or more independent variables (X i )
Idea: Examine the linear relationship between

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Y = β + β X + β X 1 2 i 1i
Y-intercept
Population slopes
Y = β + β X + β X
1
2
i
1i
2i
0
+ ⋅ ⋅ ⋅ + β

Random Error

k

X

ki

+ ε

i

Multiple Regression Equation
Multiple Regression Equation
The coefficients of the multiple regression model are estimated using sample data
The coefficients of the multiple regression model are
estimated using sample data

Multiple regression equation with k independent variables:

i + ⋅ ⋅ ⋅ + b + b X 2i 1i 2 Estimated slope coefficients
i
+ ⋅ ⋅ ⋅ + b
+ b
X
2i
1i
2
Estimated slope coefficients
Y
= b
ˆ
Estimated
(or predicted)
value of Y
Estimated
1
intercept
+ b X
0

k

X

ki

Example: 2 Independent Variables
Example:
2 Independent Variables

A distributor of frozen dessert pies wants to evaluate factors thought to influence demand

Dependent variable:

Independent variables:

Pie sales (units per week)

Price (in $)

Advertising ($100’s)

Data are collected for 15 weeks

Example: 2 Independent Variables A distributor of frozen dessert pies wants to evaluate factors thought to
Example: 2 Independent Variables A distributor of frozen dessert pies wants to evaluate factors thought to
7.20 430 3.0 4.50 7 470 3.7 6.40 8 450 3.5 7.00 9 490 4.0 5.00
7.20
430 3.0
4.50
7
470 3.7
6.40
8
450 3.5
7.00
9
490 4.0
5.00
10
340 3.5
6
11
300 3.2
7.90
12
440 4.0
5.90
13
450 3.5
5.00
14
300 2.7
7.00
15
460 3.3
Price
Pie
Sales
Advertising
($100s)
($)
Multiple regression equation:
+ b 2 (Advertising)
Sales = b 0 + b 1 (Price)
Pie Sales Example
350 3.3
5.50
1
Week
7.50
2
350 3.0
8.00
3
430 4.5
8.00
4
350 3.0
6.80
5
380 4.0
7.50
df -48.57626 -1.37392 2.85478 0.01449 17.55303 130.70888 Multiple R R Square Adjusted R Square Standard Error
df
-48.57626
-1.37392
2.85478
0.01449
17.55303
130.70888
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.72213
0.52148
0.44172
47.46341
15
ANOVA
0.03979
Regression
2
Residual
12
Total
14
Coefficients
Intercept
306.52619
Price
-24.97509
Advertising
74.13096
14730.013
SS
MS
Standard Error
Regression Statistics
Significance F
Sales = 306.526 - 24.975(Price) + 74.131(Advertising)
Excel Multiple Regression Output
114.25389
10.83213
25.96732
29460.027
27033.306
56493.333
F
6.53861
0.01201
2252.776
t Stat
P-value
Lower 95%
Upper 95%
2.68285
0.01993
57.58835
555.46404
-2.30565
The Multiple Regression Equation
The Multiple Regression Equation
Sales = 306.526 - 24.975(Price) + 74.131(Advertising)
Sales = 306.526 - 24.975(Price) + 74.131(Advertising)

where Sales is in number of pies per week Price is in $ Advertising is in $100’s.

The Multiple Regression Equation Sales = 306.526 - 24.975(Price) + 74.131(Advertising) where Sales is in number
b 1 = -24.975: sales will decrease, on average, by 24.975 pies per week for each
b 1 = -24.975: sales
will decrease, on
average, by 24.975
pies per week for
each $1 increase in
selling price, net of
the effects of changes
due to advertising
b 2 = 74.131: sales will increase, on average, by 74.131 pies per week for each
b 2 = 74.131: sales will
increase, on average,
by 74.131 pies per
week for each $100
increase in
advertising, net of the
effects of changes
due to price

Predictions

Predict sales for a week in which the selling price is $5.50 and advertising is $350:
Predict sales for a week in which the selling price is $5.50 and advertising is $350:
Predict sales for a week in which the selling
price is $5.50 and advertising is $350:
Sales = 306.526 - 24.975(Price) + 74.131(Advertising) 306.526 - 24.975 (5.50) = 428.62 = + 74.131
Sales
=
306.526 - 24.975(Price)
+
74.131(Advertising)
306.526 - 24.975 (5.50)
= 428.62
=
+
74.131 (3.5)
Predicted sales is 428.62 pies
Predicted sales
is 428.62 pies
Note that Advertising is in $100’s, so $350 means that X 2 = 3.5
Note that Advertising is
in $100’s, so $350
means that X 2 = 3.5
Coefficient of Multiple Determination
Coefficient of Multiple Determination

Reports the proportion of total variation in Y explained by all X variables taken together

= 2 regression sum of squares total sum of squares SSR = SST r
=
2
regression sum of squares
total sum of squares
SSR
=
SST
r
15 0.03979 -48.57626 -1.37392 25.96732 2.85478 0.01449 17.55303 130.70888 Multiple R 0.72213 R Square 0.52148 Adjusted
15
0.03979
-48.57626
-1.37392
25.96732
2.85478
0.01449
17.55303
130.70888
Multiple R
0.72213
R Square
0.52148
Adjusted R Square
Standard Error
Observations
0.44172
47.46341
-2.30565
ANOVA
df
Regression
2
Residual
12
Total
14
Coefficients
Intercept
306.52619
Price
-24.97509
Advertising
74.13096
6.53861
F
SS
MS
Regression Statistics
Significance F
= .52148
56493.3
29460.0
r 2 =
SSR
SST
52.1% of the variation in pie sales
is explained by the variation in
price and advertising
Multiple Coefficient of
Determination In Excel
29460.027
14730.013
=
0.01201
27033.306
2252.776
56493.333
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
114.25389
2.68285
0.01993
57.58835
555.46404
10.83213
Adjusted r 2 r 2 never decreases when a new X variable is added to the
Adjusted r 2
r 2 never decreases when a new X variable is
added to the model

This can be a disadvantage when comparing models What is the net effect of adding a new variable?

We lose a degree of freedom when a new X variable is added

Did the new X variable add enough explanatory power to offset the loss of one degree of freedom?

(continued)
(continued)

Shows the proportion of variation in Y explained by all X variables adjusted for the number of X variables used

Adjusted r 2

− 1 )          n − 1 
− 1
)
 
 
 n − 1  
(1
k
1
adj
n
r
r
2
2
=

(where n = sample size, k = number of independent variables)

Penalize excessive use of unimportant independent

variables

Smaller than r 2

Useful in comparing among models

Price 0.44172 Standard Error Observations 47.46341 15 ANOVA df Regression 2 Residual 12 Total 14 Coefficients
Price
0.44172
Standard Error
Observations
47.46341
15
ANOVA
df
Regression
2
Residual
12
Total
14
Coefficients
Intercept
306.52619
Adjusted R Square
-24.97509
Advertising
74.13096
44.2% of the variation in pie sales is
explained by the variation in price and
advertising, taking into account the sample
size and number of independent variables
Adjusted r 2 in Excel
=
2
adj
.44172
r
Regression Statistics
Significance F
MS
SS
F
555.46404
14730.013
6.53861
0.01201
27033.306
2252.776
56493.333
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
114.25389
2.68285
0.01993
57.58835
29460.027
10.83213
-2.30565
0.03979
-48.57626
-1.37392
25.96732
2.85478
0.01449
17.55303
130.70888
Multiple R
0.72213
R Square
0.52148
Is the Model Significant?
Is the Model Significant?

F Test for Overall Significance of the Model

Shows if there is a linear relationship between all of the X variables considered together and Y

Use F-test statistic Hypotheses: (at least one independent variable affects Y) H 0 : β
Use F-test statistic
Hypotheses:
(at least one independent
variable affects Y)
H 0 : β 1 = β 2 = … = β k = 0
(no linear relationship)
H 1 : at least one
≠ 0
β i
F Test for Overall Significance Test statistic:
F Test for Overall Significance
Test statistic:
SSR = k SSE n − k −1 MSR = MSE F STAT
SSR
=
k
SSE
n
k
−1
MSR
=
MSE
F STAT

where F STAT has numerator d.f. = k and denominator d.f. = (n – k - 1)

306.52619 -2.30565 10.83213 -24.97509 555.46404 57.58835 0.01993 2.68285 114.25389 0.03979 Upper 95% Lower 95% P-value t
306.52619
-2.30565
10.83213
-24.97509
555.46404
57.58835
0.01993
2.68285
114.25389
0.03979
Upper 95%
Lower 95%
P-value
t Stat
Standard Error
Coefficients
56493.333
14
-48.57626
-1.37392
74.13096
25.96732
2.85478
0.01449
17.55303
130.70888
ANOVA
Regression
Residual
Total
Intercept
Price
Advertising
Significance F
With 2 and 12 degrees
of freedom
F STAT =
(continued)
6.5386
2252.8
14730.0
MSR
MSE
P-value for
the F Test
Regression Statistics
Multiple R
MS
SS
df
F
=
=
F Test for Overall Significance in
Excel
R Square
Adjusted R Square
Standard Error
Observations
0.72213
0.52148
0.44172
47.46341
15
2
29460.027
14730.013
6.53861
0.01201
12
27033.306
2252.776
F Test for Overall Significance (continued)
F Test for Overall Significance
(continued)
H 0 : β 1 = β 2 = 0 H 1 : β 1 and
H 0 : β 1 = β 2 = 0
H 1 : β 1 and β 2 not both zero
Reject H 0 α = .05 F 0 F 0.05 = 3.885 df 2 = 12
Reject H 0
α = .05
F
0
F 0.05 = 3.885
df 2 = 12
Do not
reject H 0
F 0.05 = 3.885
α = .05
df 1 = 2
Critical
Value:

Test Statistic:

= F 6.5386 STAT = MSR MSE
=
F
6.5386
STAT =
MSR
MSE

Decision:

Since F STAT test statistic is value < .05), reject H 0 in the rejection region
Since F STAT test statistic is
value < .05), reject H 0
in the rejection region (p-

Conclusion:

There is evidence that at least one independent variable affects Y

Use t tests of individual variable slopes Are Individual Variables Significant?
Use t tests of individual variable slopes
Are Individual Variables Significant?

Shows if there is a linear relationship between the variable X j and Y holding constant the effects of other X variables

Hypotheses: H 0 : β j = 0 (no linear relationship) (linear relationship does exist
Hypotheses:
H 0 : β j = 0 (no linear relationship)
(linear relationship does exist
H 1 : β j ≠ 0
between X
and Y)
j
H 0 : β j = 0 (no linear relationship) Are Individual Variables Significant? H 1
H 0 : β j = 0 (no linear relationship)
Are Individual Variables Significant?
H 1 : β j ≠ 0 (linear relationship does exist
between X
and Y)
j

Test Statistic:

STAT S b t = j b j
STAT
S
b
t
=
j
b
j
(df = n – k – 1)
(df = n – k – 1)
2252.776 12 Total 14 Coefficients Intercept 306.52619 Price -24.97509 Advertising 74.13096 29460.027 14730.013 6.53861 27033.306 Residual
2252.776
12
Total
14
Coefficients
Intercept
306.52619
Price
-24.97509
Advertising
74.13096
29460.027
14730.013
6.53861
27033.306
Residual
56493.333
Standard Error
P-value
t Stat
114.25389
25.96732
10.83213
-2.30565
2.85478
0.01449
0.03979
0.01993
2.68285
-48.57626
MS
SS
Regression Statistics
Significance F
Excel Output
Are Individual Variables Significant?
t Stat for Price is t STAT = -2.306, with
p-value .0398
t Stat for Advertising is t STAT = 2.855,
with p-value .0145
0.01201
Lower 95%
Upper 95%
57.58835
555.46404
F
-1.37392
17.55303
130.70888
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.72213
0.52148
0.44172
47.46341
15
ANOVA
df
Regression
2

Inferences about the Slope:

t Test Example
t Test Example
H 1 : β j ≠ 0 H 0 : β j = 0
H 1 : β j ≠ 0
H 0 : β j = 0
α = .05 t αααα/2 = 2.1788 d.f. = 15-2-1 = 12
α = .05
t αααα/2 = 2.1788
d.f. = 15-2-1 = 12

From the Excel output:

For Advertising t STAT = 2.855, with p-value .0145 For Price t STAT = -2.306, with
For Advertising t STAT = 2.855, with p-value .0145
For Price t STAT = -2.306, with p-value .0398
in the rejection region (p-values < .05) The test statistic for each variable falls
in the rejection region (p-values < .05)
The test statistic for each variable falls

Decision:

Reject H 0 for each variable

Conclusion:

There is evidence that both

Price and Advertising affect

pie sales at αααα = .05

α/2=.025
α/2=.025

Do not reject H 0

α/2=.025

Reject H 0
Reject H 0
  • Reject H 0

-t α/2

  • 0 t α/2

-2.1788
-2.1788
2.1788

2.1788

Confidence Interval Estimate for the Slope Confidence interval for the population slope β j
Confidence Interval Estimate
for the Slope
Confidence interval for the population slope β j
j t α / 2 b ± S b j
j
t
α / 2
b
±
S
b
j
Intercept 306.52619 114.25389 Price -24.97509 10.83213 Advertising 74.13096 25.96732 Standard Error Coefficients
Intercept
306.52619
114.25389
Price
-24.97509
10.83213
Advertising
74.13096
25.96732
Standard Error
Coefficients

where t has (n – k – 1) d.f.

Here,

t has

(15 – 2 – 1) = 12

d.f.

Example: Form a 95% confidence interval for the effect of changes in price (X 1 )
Example: Form a 95% confidence interval for the effect of changes in
price (X 1 ) on pie sales:
-24.975 ± (2.1788)(10.832)
So the interval is (-48.576 , -1.374)
(This interval does not contain zero, so price has a significant effect on sales)
Confidence Interval Estimate for the Slope Confidence interval for the population slope β j
Confidence Interval Estimate
for the Slope
Confidence interval for the population slope β j
Price Example: Excel output also reports these interval endpoints: Weekly sales are estimated to be reduced
Price
Example: Excel output also reports these interval endpoints:
Weekly sales are estimated to be reduced by between 1.37 to
48.58 pies for each increase of $1 in the selling price, holding the
effect of price constant
130.70888
17.55303
25.96732 …
74.13096
Advertising
-1.37392
10.83213 -48.57626
-24.97509
Coefficients
555.46404
57.58835
114.25389
306.52619
Intercept
Upper 95%
Lower 95%
Standard Error
Testing Portions of the Multiple Regression Model
Testing Portions of the
Multiple Regression Model

Contribution of a Single Independent Variable X j

= SSR (all variables) – SSR(all variables except X j ) SSR(X j | all variables
= SSR (all variables) – SSR(all variables except X j )
SSR(X j | all variables except X j )

Measures the contribution of X j in explaining the

total variation in Y (SST)

Testing Portions of the Multiple Regression Model Contribution of a Single Independent Variable X j ,
Testing Portions of the
Multiple Regression Model
Contribution of a Single Independent Variable X j ,

assuming all other variables are already included

(consider here a 2-variable model):

SSR(X 1 | X 2 ) = SSR (all variables) – SSR(X 2 )
SSR(X 1 | X 2 )
= SSR (all variables) – SSR(X 2 )

From ANOVA section of

regression for

  • 0 + b X

Y ˆ = b

+ b

X

2

2

1

1

From ANOVA section of regression for Y ˆ = b + b X 0 2 2
From ANOVA section of
regression for
Y ˆ = b
+ b
X
0
2
2

Measures the contribution of X 1 in explaining SST

The Partial F-Test Statistic
The Partial F-Test Statistic

Consider the hypothesis test:

H 0 : variable X j does not significantly improve the model after all other variables
H 0 : variable X j does not significantly improve the model after all
other variables are included
H 1 : variable X j significantly improves the model after all other
variables are included

Test using the F-test statistic:

(with 1 and n-k-1 d.f.)

= STAT F MSE SSR (X | all variables except j) j
=
STAT
F
MSE
SSR (X
| all variables except j)
j
Testing Portions of Model: Example
Testing Portions of Model:
Example

Example: Frozen dessert pies

advertising is included significantly improves the model given that Test at the α = .05 level
advertising is included
significantly improves
the model given that
Test at the α = .05 level
the price variable
to determine whether
Testing Portions of Model: Example Example: Frozen dessert pies advertising is included significantly improves the model
Testing Portions of Model: Example Example: Frozen dessert pies advertising is included significantly improves the model

Testing Portions of Model:

Example
Example
with X 2 (advertising) included H 1 : X 1 does improve model
with X 2 (advertising) included
H 1 : X 1 does improve model

H 0 : X 1 (price) does not improve the model

α = .05,

df = 1 and 12

F 0.05 = 4.75

(For X 1 and

X 2 )

(For X 2 only)

ANOVA

 

ANOVA

 

df

SS

MS

 

df

SS

Regression

2

29460.02687

14730.01343

Regression

1

17484.22249

Residual

12

27033.30647

2252.775539

Residual

13

39009.11085

Total

14

56493.33333

Total

14

56493.33333

Testing Portions of Model: Example
Testing Portions of Model:
Example

(For X 1 and

X 2 )

(For X 2 only)

ANOVA

 

ANOVA

 

df

SS

MS

 

df

SS

Regression

2

29460.02687

14730.01343

Regression

1

17484.22249

Residual

12

27033.30647

2252.775539

Residual

13

39009.11085

Total

14

56493.33333

Total

14

56493.33333

− = = 2 STAT F = 5.316 2252.78 17,484.22 29,460.03 SSR (X 1 | X
=
=
2
STAT
F
= 5.316
2252.78
17,484.22
29,460.03
SSR (X
1
| X
MSE(all)
)

Conclusion: Since F STAT = 5.316 > F 0.05 = 4.75 Reject H 0 ; Adding X 1 does improve model

Coefficient of Partial Determination for k variable model
Coefficient of Partial Determination
for k variable model
− = 2 + Yj.(all variables except j) j j | all variables except j) SST
=
2
+
Yj.(all variables except j)
j
j
| all variables except j)
SST SSR(all variables)
SSR (X
SSR(X | all variables except j)
r

Measures the proportion of variation in the dependent

variable that is explained by X j while controlling for

(holding constant) the other independent variables

Using Dummy Variables
Using Dummy Variables

A dummy variable is a categorical independent variable with two levels:

yes or no, on or off, male or female coded as 0 or 1

Assumes the slopes associated with numerical independent variables do not change with the value for the categorical variable

If more than two levels, the number of dummy variables needed is (number of levels - 1)

Dummy-Variable Example (with 2 Levels)
Dummy-Variable Example
(with 2 Levels)
= 1 2 2 0 + + b X b X Y ˆ b 1
=
1
2
2
0
+
+
b X
b X
Y ˆ
b
1

Let:

Y

= pie sales

X 1 = price

X 2 = holiday (X 2 = 1 if a holiday occurred during the week)

(X 2 = 0 if there was no holiday that week)

Dummy-Variable Example (with 2 Levels) = 1 2 2 0 + + b X b X

Dummy-Variable Example (with 2 Levels)

= b b 0 Y ˆ = b Y ˆ = b + (b = 0
= b b 0 Y ˆ = b Y ˆ = b + (b = 0
=
b
b
0
Y ˆ
=
b
Y ˆ
=
b
+
(b
=
0 b X
b (0)
b (1)
No Holiday
Holiday
b X
b X
1
1
1
1
1
2
2
1
1
1
0 b X
2
0
)
+
+
+
+
+
+

Y (sales)

Different

Same

intercept

slope

b 0 + b 2

b 0

Dummy-Variable Example (with 2 Levels) = b b 0 Y ˆ = b Y ˆ =
Dummy-Variable Example (with 2 Levels) = b b 0 Y ˆ = b Y ˆ =
If is significant effect “Holiday” has a rejected, then on pie sales H 0 : β
If
is
significant effect
“Holiday” has a
rejected, then
on pie sales
H 0 : β 2 = 0

X 1 (Price)

Interpreting the Dummy Variable Coefficient (with 2 Levels)

Sales = 300 - 30(Price) + 15(Holiday) Example:
Sales = 300 - 30(Price) + 15(Holiday)
Example:

Sales: number of pies sold per week

Price:

pie price in $

  • 1 If a holiday occurred during the week

  • 0 If no Holiday: holiday occurred

b 2 = 15: on average, sales were 15 pies greater in weeks with a holiday
b 2 = 15: on average, sales were 15 pies greater in
weeks with a holiday than in weeks without a
holiday, given the same price
Interpreting the Dummy Variable Coefficient (with 2 Levels) Sales = 300 - 30(Price) + 15(Holiday) Example:

Dummy-Variable Models (more than 2 Levels)

The number of dummy variables is one less than the number of levels
The number of dummy variables is one less than the number of levels
The number of dummy variables is one less
than the number of levels

Example:

Y = house price ;

X 1 = square feet

If style of the house is also thought to matter:

Style = ranch, split level, colonial

Dummy-Variable Models (more than 2 Levels) The number of dummy variables is one less than the
Three levels, so two dummy variables are needed
Three levels, so two dummy
variables are needed
Dummy-Variable Models (more than 2 Levels) The number of dummy variables is one less than the
Dummy-Variable Models (more than 2 Levels) Example: Let “colonial” be the default category, and let
Dummy-Variable Models
(more than 2 Levels)
Example: Let “colonial” be the default category, and let

X 2 and X 3 be used for the other two categories:

Y = house price

X 1 = square feet

X 2 = 1 if ranch, 0 otherwise

X 3 = 1 if split level, 0 otherwise

The multiple regression equation is:

Y ˆ

=

b

  • 0 +

b X

1

1

+

b X

2

2

+

b X

3

3

Dummy-Variable Models (more than 2 Levels) Example: Let “colonial” be the default category, and let X
Interpreting the Dummy Variable Coefficients (with 3 Levels) Consider the regression equation:
Interpreting the Dummy Variable
Coefficients (with 3 Levels)
Consider the regression equation:
= 2 3 + + + 20.43 23.53X 0.045X 18.84X Y ˆ 1
=
2
3
+
+
+
20.43
23.53X
0.045X
18.84X
Y ˆ
1

For a colonial: X 2 = X 3 = 0

Y ˆ 1 1 1 = = = + + + + + Y ˆ 18.84
Y ˆ
1
1
1
=
=
=
+
+
+
+
+
Y ˆ
18.84
Y ˆ
0.045X
With the same square feet, a
ranch will have an estimated
average price of 23.53
thousand dollars more than a
colonial.
For a ranch: X 2 = 1; X 3 = 0
20.43
20.43
20.43
0.045X
0.045X
With the same square feet, a
split-level will have an
estimated average price of
18.84 thousand dollars more
than a colonial.
For a split level: X 2 = 0; X 3 = 1
23.53