Вы находитесь на странице: 1из 50

# Chapter 11

## Statistics for Managers Using Microsoft Excel, 4e 2004 Prentice-Hall, Inc.

Chap 11-1

Chapter Goals
After completing this chapter, you should be
able to:

## Obtain and interpret the simple linear regression

equation for a set of data

model

analysis

## Explain measures of variation and determine whether

the independent variable is significant

Chapter Goals
(continued)

able to:

## Calculate and interpret confidence intervals for the

regression coefficients

autocorrelation

## Form confidence and prediction intervals around an

estimated Y value for a given X

## Recognize some potential problems if regression

analysis is used incorrectly

## A scatter plot (or scatter diagram) can be used

to show the relationship between two variables

## Correlation analysis is used to measure

strength of the association (linear relationship)
between two variables

relationship

## Correlation was first presented in Chapter 3

11.1 Introduction to
Regression Analysis

## Predict the value of a dependent variable based on the

value of at least one independent variable

## Explain the impact of changes in an independent

variable on the dependent variable

## Dependent variable: the variable we wish to explain

Independent variable: the variable used to explain
the dependent variable

Model

## Relationship between X and Y is

described by a linear function

## Changes in Y are assumed to be caused

by changes in X

Types of Relationships
Linear relationships
Y

Curvilinear relationships
Y

X
Y

X
Y

Types of Relationships
(continued)
Strong relationships
Y

Weak relationships
Y

X
Y

X
Y

Types of Relationships
(continued)
No relationship
Y

X
Y

## Simple Linear Regression Model

The population regression model:
Population
Y intercept
Dependent
Variable

Population
Slope
Coefficient

Independent
Variable

Random
Error
term

Yi 0 1Xi i
Linear component

Random Error
component

## Simple Linear Regression Model

(continued)

Yi 0 1Xi i

Observed Value
of Y for Xi

Predicted Value
of Y for Xi

Slope = 1
Random Error
for this Xi value

Intercept = 0

Xi

## Simple Linear Regression Equation

The simple linear regression equation provides an
estimate of the population regression line
Estimated
(or predicted)
Y value for
observation i

Estimate of
the regression

Estimate of the
regression slope

intercept

a bX
Y
i
i

Value of X for
observation i

## a and b are obtained by finding the values of 0 and 1 that

minimize the sum of the squared (SSE resudial) differences
:
between Y and Y

## To minimize, differentiate with respect to a and b, and set each

result to 0. This generates two simultaneous equations (called
normal equations) & two unknowns. Solving for a and b, we get

i 1

i 1

i 1

n xi yi ( xi )( yi )
n

n xi2 ( xi ) 2
i 1

i 1

a y bx

## A real estate agent wishes to examine the

relationship between the selling price of a home
and its size (measured in square feet)

## A random sample of 10 houses is selected

Dependent variable (Y) = house price in \$1000s
Independent variable (X) = square feet

## Sample Data for House Price Model

House Price in \$1000s
(Y)

Square Feet
(X)

245

1400

312

1600

279

1700

308

1875

199

1100

219

1550

405

2350

324

2450

319

1425

255

1700

Excel Output
Regression Statistics
Multiple R

0.76211

R Square

0.58082

0.52842

Standard Error

41.33032

Observations

ANOVA

## The regression equation is:

10

df

SS

MS

F
11.0848

Regression

18934.9348

18934.9348

Residual

13665.5652

1708.1957

Total

32600.5000

Coefficients
Intercept
Square Feet

Standard Error

t Stat

P-value

Significance F
0.01039

Lower 95%

Upper 95%

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

Graphical Presentation

regression line
Slope
= 0.10977

Intercept
= 98.248

## Interpretation of the Intercept, a

house price 98.24833 0.10977 (square feet)

## a is the estimated average value of Y when the

value of X is zero (if X = 0 is in the range of
observed X values)

## Here, no houses had 0 square feet, so a = 98.24833

just indicates that, for houses within the range of
sizes observed, \$98,248.33 is the portion of the
house price not explained by square feet

## Interpretation of the Slope Coefficient, b

house price 98.24833 0.10977 (square feet)

## b measures the estimated change in the

average value of Y as a result of a oneunit change in X

## Here, b = .10977 tells us that the average value of a

house increases by .10977(\$1000) = \$109.77, on
average, for each additional one square foot of size

## Predictions using Regression Analysis

Predict the price for a house
with 2000 square feet:

## house price 98.25 0.1098 (sq.ft.)

98.25 0.1098(200 0)
317.85
The predicted price for a house with 2000
square feet is 317.85(\$1,000s) = \$317,850

## When using a regression model for prediction,

only predict within the relevant range of data
Relevant range for
interpolation

Do not try to
extrapolate
beyond the range
of observed Xs

SST

SSR

Total Sum of
Squares

Regression Sum
of Squares

SST ( Yi Y )2

SSR ( Yi Y )2

SSE
Error Sum of
Squares

SSE ( Yi Yi )2

where:

## Yi = Observed values of the dependent variable

Y
i = Predicted value of Y for the given Xi value

Measures of Variation
(continued)

## Measures the variation of the Yi values around their

mean Y
Explained variation attributable to the relationship
between X and Y

## Variation attributable to factors other than the

relationship between X and Y

Measures of Variation
(continued)

Y
Yi

SSE = (Yi - Yi )2

_
SSR = (Yi - Y)2

Xi

_
Y

## The coefficient of determination is the portion

of the total variation in the dependent variable
that is explained by variation in the
independent variable
The coefficient of determination is also called
R-squared and is denoted as R2
SSE SSR regression sum of squares
R 1

SST SST
total sum of squares
2

note:

0 R 1
2

Excel Output
SSR 18934.9348
r

0.58082
SST 32600.5000
2

Regression Statistics
Multiple R

0.76211

R Square

0.58082

0.52842

Standard Error

41.33032

Observations

ANOVA

## 58.08% of the variation in

house prices is explained by
variation in square feet

10

df

SS

MS

F
11.0848

Regression

18934.9348

18934.9348

Residual

13665.5652

1708.1957

Total

32600.5000

Coefficients
Intercept
Square Feet

Standard Error

t Stat

P-value

Significance F
0.01039

Lower 95%

Upper 95%

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

## A measure of lineer association between two

variables X and Y, denoted r. It is between -1 and +1

S XX
r b
S XY / KK ( S XX SYY )
SYY

## The standard deviation of the variation of

observations around the regression line is
estimated by
n

S YX

SSE

n2

(
Y

Y
)
i i
i1

Where
SSE = error sum of squares
n = sample size

n2

Excel Output
Regression Statistics
Multiple R

0.76211

R Square

0.58082

0.52842

Standard Error

41.33032

Observations

ANOVA

S YX 41.33032

10

df

SS

MS

F
11.0848

Regression

18934.9348

18934.9348

Residual

13665.5652

1708.1957

Total

32600.5000

Coefficients
Intercept
Square Feet

Standard Error

t Stat

P-value

Significance F
0.01039

Lower 95%

Upper 95%

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

## Comparing Standard Errors

SYX is a measure of the variation of observed
Y values from the regression line
Y

small s YX

large s YX

## The magnitude of SYX should always be judged relative to the

size of the Y values in the sample data
i.e., SYX = \$41.33K is moderately small relative to house prices in
the \$200 - \$300K range

Residual Analysis
ei Yi Yi

## The residual for observation i, ei, is the difference

between its observed and predicted value
Check the assumptions of regression by examining the
residuals

## Examine for linearity assumption

Examine for constant variance for all levels of X
(homoscedasticity)
Evaluate normal distribution assumption
Evaluate independence assumption

Y

Not Linear

residuals

residuals

Linear

## Residual Analysis for Homoscedasticity

Y

x
Non-constant variance

residuals

residuals

Constant variance

Not Independent

residuals

residuals

residuals

Independent
X

## The standard error of the regression slope

coefficient (b) is estimated by

SYX
Sb

SSX

SYX

(X

X)

where:

Sb

S YX

SSE

## = Standard error of the estimate

n2

Excel Output
Regression Statistics
Multiple R

0.76211

R Square

0.58082

0.52842

Standard Error
Observations

ANOVA

Sb 0.03297

41.33032
10

df

SS

MS

F
11.0848

Regression

18934.9348

18934.9348

Residual

13665.5652

1708.1957

Total

32600.5000

Coefficients
Intercept
Square Feet

Standard Error

t Stat

P-value

Significance F
0.01039

Lower 95%

Upper 95%

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

H0: 1 = 0
H1: 1 0

## (no linear relationship)

(linear relationship does exist)

Test statistic

b 1
t
Sb

where:

d.f. n 2

Sb = standard
error of the slope

b = regression slope
coefficient
1 = hypothesized slope

(continued)
House Price
in \$1000s
(y)

Square Feet
(x)

245

1400

312

1600

279

1700

308

1875

199

1100

219

1550

405

2350

324

2450

319

1425

255

1700

## Estimated Regression Equation:

house price 98.25 0.1098 (sq.ft.)

## The slope of this model is 0.1098

Does square footage of the house
affect its sales price?

t Test Example
H0: 1 = 0

H1: 1 0

Coefficients
Intercept
Square Feet

b
Standard Error

Sb
t Stat

P-value

98.24833

58.03348

1.69296

0.12892

0.10977

0.03297

3.32938

0.01039

b 1 0.10977 0
t
t

3.32938
Sb
0.03297

t Test Example
(continued)

H0: 1 = 0

## From Excel output:

H1: 1 0

Coefficients
Intercept
Square Feet

d.f. = 10-2 = 8
/2=.025

Reject H0

/2=.025

Do not reject H0

-t/2
-2.3060

Reject H

0
t/2
2.3060 3.329

b
Standard Error

Sb

t Stat

P-value

98.24833

58.03348

1.69296

0.12892

0.10977

0.03297

3.32938

0.01039

Decision:
Reject H0
Conclusion:
There is sufficient evidence
that square footage affects
house price

t Test Example
(continued)

P-value = 0.01039
H0: 1 = 0

H1: 1 0

Coefficients
Intercept
Square Feet

## This is a two-tail test, so

the p-value is
P(t > 3.329)+P(t < -3.329)
= 0.01039
(for 8 d.f.)

P-value
Standard Error

t Stat

P-value

98.24833

58.03348

1.69296

0.12892

0.10977

0.03297

3.32938

0.01039

## Decision: P-value < so

Reject H0
Conclusion:
There is sufficient evidence
that square footage affects
house price

## F-Test for Significance

F Test statistic:
where

MSR
F
MSE
MSR

SSR
k

MSE

SSE
n k 1

## where F follows an F distribution with k numerator and (n k - 1)

denominator degrees of freedom
(k = the number of independent variables in the regression model)

Excel Output
Regression Statistics
Multiple R

0.76211

R Square

0.58082

0.52842

Standard Error

41.33032

Observations

ANOVA

MSR 18934.9348
F

11.0848
MSE 1708.1957

10

df

of freedom
SS

MS

F
11.0848

Regression

18934.9348

18934.9348

Residual

13665.5652

1708.1957

Total

32600.5000

Coefficients
Intercept
Square Feet

Standard Error

P-value for
the F-Test

t Stat

P-value

Significance F
0.01039

Lower 95%

Upper 95%

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

## F-Test for Significance

(continued)

Test Statistic:

H 0 : 1 = 0

MSR
F
11.08
MSE

H 1 : 1 0
= .05
df1= 1

df2 = 8

Decision:
Reject H0 at = 0.05

Critical
Value:
F = 5.32

Conclusion:

= .05

Do not
reject H0

Reject H0

F.05 = 5.32

## There is sufficient evidence that

house size affects selling price

## Confidence Interval Estimate

for the Slope
Confidence Interval Estimate of the Slope:

b1 t n2Sb1

d.f. = n - 2

Coefficients
Intercept
Square Feet

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

## At 95% level of confidence, the confidence interval for

the slope is (0.0337, 0.1858)

for the Slope

(continued)

Coefficients
Intercept
Square Feet

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

## Since the units of the house price variable is

\$1000s, we are 95% confident that the average
impact on sales price is between \$33.70 and
\$185.80 per square foot of house size
This 95% confidence interval does not include 0.
Conclusion: There is a significant relationship between
house price and square feet at the .05 level of significance

## Lacking an awareness of the assumptions underlying

least-squares regression
Not knowing how to evaluate the assumptions
Not knowing the alternatives to least-squares regression
if a particular assumption is violated
Using a regression model without knowledge of the
subject matter
Extrapolating outside the relevant range
Start with a scatter plot of X on Y to observe possible
relationship

## Strategies for Avoiding

the Pitfalls of Regression

## Plot the residuals vs. X to check for violations of assumptions

such as homoscedasticity
Use a histogram, stem-and-leaf display, box-and-whisker plot,
or normal probability plot of the residuals to uncover possible
non-normality

## If there is violation of any assumption, use alternative

methods or models
If there is no evidence of assumption violation, then test
for the significance of the regression coefficients and
construct confidence intervals and prediction intervals
Avoid making predictions or forecasts outside the
relevant range

Chapter Summary

## Introduced types of regression models

Reviewed assumptions of regression and
correlation
Discussed determining the simple linear
regression equation
Described measures of variation
Discussed residual analysis