Вы находитесь на странице: 1из 77

Chapter 11

Simple Linear Regression


Contents

Probabilistic Models
Fitting the Model: The Least Squares Approach
Model Assumptions
Assessing the Utility of the Model: Making Inferences
about the Slope 1
The Coefficients of Correlation and Determination
Using the Model for Estimation and Prediction
A Complete Example
Did You Know that Degree, Color and Race
Make a Difference in Home Refinancing?
A study found that broker fees for purchasers without a
college degree pay $1,472 more than those with a college
degree
No only did a degree matter, but race was also a factor.
African Americans on average paid $500 more than whites,
Hispanics $275 more than whites.
Regression analysis was used to determine whether various
borrower characteristics had a bearing on the amount of
broker fees and closing costs paid.
Did You Know that the Presence of an
NFL Team Boost Rental Costs?
Regression analysis has revealed that in cities with an
NFL team, rental costs for apartment in the central city
area were 8 percent higher than in cities without an NFL
team
Property tax receipts were also found to be higher in
cities with NFL teams
Regression Analysis
Regression analysis examines associative relationships between a
metric dependent variable and one or more independent
variables in the following ways:
• Determine whether the independent variables explain a
significant variation in the dependent variable: whether a
relationship exists.
• Determine how much of the variation in the dependent variable
can be explained by the independent variables: strength of the
relationship.
• Determine the structure or form of the relationship: the
mathematical equation relating the independent and dependent
variables.
• Predict the values of the dependent variable.
• Control for other independent variables when evaluating the
contributions of a specific variable or set of variables.
Regression Analysis

NOTE that
Regression analysis is concerned with the nature and
degree of association between variables and does not
imply or assume any causality.
A First-Order (Straight Line) Probabilistic Model

y = 0 + 1x +
where
y = Dependent or response variable
(variable to be modeled)
x = Independent or predictor variable
(variable used as a predictor of y)

E(y) = 0 + 1x = Deterministic component


 (epsilon) = Random error component
Independent Variable vs.
Dependent Variable
Independent variable
Explanatory or predictor variable
Often presumed to be a cause of the other
Dependent variable
Criterion Variable
Influenced by the independent variable
A First-Order (Straight Line) Probabilistic Model

y = 0 + 1x +

0 (beta zero) = y-intercept of the line, that is, the


point at which the line intercepts or
cuts through the y-axis
1 (beta one) = slope of the line, that is, the change
(amount of increase or decrease) in the
deterministic component of y for every
1-unit increase in x
A First-Order (Straight Line) Probabilistic Model

A positive slope implies that E(y) increases by the amount


1 for each unit increase in x. A negative slope implies that
E(y) decreases by the amount 1.
Five-Step Procedure

Step 1: Hypothesize the deterministic component of the


model that relates the mean, E(y), to the
independent variable x.
Step 2: Use the sample data to estimate unknown
parameters in the model.
Step 3: Specify the probability distribution of the random
error term and estimate the standard deviation of
this distribution.
Step 4: Statistically evaluate the usefulness of the model.
Step 5: When satisfied that the model is useful, use it for
prediction, estimation, and other purposes.
Explaining Attitude Toward the City of
Residence
Respondent No Attitude Toward Duration of Importance
the City Residence Attached to
Weather
1 6 10 3

2 9 12 11

3 8 12 4

4 3 4 1

5 10 12 11

6 4 6 1

7 5 8 7

8 2 2 4

9 11 18 8

10 9 9 10

11 10 17 8

12 2 2 5
Scatter Diagram
12

10

8
Attitude

0
0 5 10 15 20
Duration of Residence
Which Straight Line Is Best on this Scatter?

Line 1

Line 2

9 Line 3

Line 4
6

2.25 4.5 6.75 9 11.25 13.5 15.75 18


14

12

10

0
0 5 10 15 20
Fitting the Model:
The Least Squares Approach
Scatterplot

1. Plot of all (xi, yi) pairs


2. Suggests how well model will fit

y
60
40
20
0 x
0 20 40 60
Which Line Fits Best?

• How would you draw a line through the points?


• How do you determine which line ‘fits best’?

y
60
40
20
0 x
0 20 40 60
Least Squares Line

The least squares line ŷ = b̂0 + b̂1 x is one that has


the following two properties:
1. The sum of the errors equals 0,
i.e., mean error = 0.
2. The sum of squared errors (SSE) is smaller than for any
other straight-line model, i.e., the error variance is
minimum.
Least Squares Graphically

n
LS minimizes åi 1 2 3 4
e
ˆ 2

i =1
= e
ˆ 2
+ e
ˆ 2
+ e
ˆ 2
+ e
ˆ 2

y y2 = bˆ0 + bˆ1 x2 + eˆ2


^
^ 4
2
^ ^
yˆi = bˆ0 + bˆ1 xi
1 3

x
Formula for the Least Squares Estimates

SS xy
Slope : b̂1 =
SS xx

y - intercept : b̂0 = y - b̂1 x

( )( y - y )
where SS xy = å xi - x i

= å( x - x )
2
SS xx i

n = sample size
Interpreting the Estimates of 0 and 1 in Simple
Linear Regression

y-intercept: b̂0 represents the predicted value of y when x


= 0 (Caution: This value will not be meaningful if
the value x = 0 is nonsensical or outside the
range of the sample data.)

slope: b̂1 represents the increase (or decrease) in y for


every 1-unit increase in x (Caution: This
interpretation is valid only for x-values within the
range of the sample data.)
Example 1- Applying the Method of Least Squares
to the Advertising-Sales Data

Refer to the following advertising monthly-sales data.


Consider the straight-line model, E(y)=β0+β1x, where
y=sales revenue (thousands of dollars), and x=advertising
expenditure (hundreds of dollars).
a) Use the method of least squares to
estimate the values of β0 and β1.
b) Predict the sales
revenue when the Ad Exp., x, ($100s) Sales,y, (1000s)
advertising expenditure is $200. 1 1
2 1
c) Find SSE for the analysis. 3 2
4 2
5 4
Scattergram
Sales vs. Advertising

Sales
4
3
2
1
0
0 1 2 3 4 5
Advertising
Parameter Estimation Solution

bˆ0 = y - bˆ1 x = 2 - (.70 )(3) = -.10

yˆ = -.1 + .7 x
Coefficient Interpretation Solution

^
1. Slope (1)
• Sales Volume (y) is expected to increase by $700
for each $100 increase in advertising (x), over the
sampled range of advertising expenditures from
$100 to $500
^
2. y-Intercept (0)
• Since 0 is outside of the range of the sampled
values of x, the y-intercept has no meaningful
interpretation
Regression Line Fittedto the Data

Sales
4
3
2
1
0
0 1 2 3 4 5
Advertising
Example: Income vs Consumption Expenditure

Consumption
Income (x)
Expenditure (y)
1 7
5 6
9 9
13 8
17 10
Questions

Construct scatterplot; determine if linear model is


appropriate. If so …
… find the least squares prediction line
Estimate consumption expenditure in a household with
an income of (i) $6,000 (ii) $25,000. Comfortable with
estimates?
Compute the residuals
Scatterplot

Consumption Expenditure

11
Expenditure ($1,000's)

10
9
8
7
6
5
0 5 10 15 20
Household Income ($1,000's)
Solution
Inc. x Exp. y xi-xbar (xi-xbar)2 yi-ybar (yi-ybar)2 (xi-xbar)
(yi-ybar)
1 7 -8 64 -1 1 8

5 6 -4 16 -2 4 8

9 9 0 0 1 1 0

13 8 4 16 0 0 0

17 10 8 64 2 4 16

x=45 y=40 (xi-xbar) (xi-xbar) (yi-ybar) (yi-ybar)


2 2
32
=0 =160 =0 =10
least squares prediction line

So, prediction line is

For income $6000


6.2+0.2*6=7.4 ($ 740)
For income $25 000
6.2+0.2*25=11.2 ($ 1120)
Least Squares Prediction Line

Consumption Expenditure

11
Expenditure ($1,000's)

10 y = 6.2 + 0.2x
9
8
7
6
5
0 5 10 15 20
Household Income ($1,000's)
Consumption Expenditure Prediction When x=$6,000

Consumption Expenditure

11
Expenditure ($1,000's)

10 y = 6.2 + 0.2x
9
8
7.4 7
6
5
0 5 6 10 15 20
Household Income ($1,000's)
Consumption Expenditure Prediction When x=$25,000

Consumption Expenditure

11.2 12
Expenditure ($1,000's)

11
10 y = 6.2 + 0.2x
9
8
7
6
5
0 5 10 15 20 25
25
Household Income ($1,000's)
C. Compute the Residuals

Inc. x ConE y y=6.2+.2x y - y (y-y)^2


1 7 6.4 .6 .36
5 6 7.2 -1.2 1.44
9 9 8 1 1
13 8 8.8 -.8 .64
17 10 9.6 .4 .16
residuals=0 (residuals)2
=3.6
Residuals

Consumption Expenditure

11
Expenditure ($1,000's)

10 y = 6.2 + 0.2x
9
8
7
6
5
0 5 10 15 20
Household Income ($1,000's)
Income Residual Plot

Income Residual Plot

2
Residuals

1
0
-1 0 5 10 15 20
-2
Income
residuals, (residuals)2

Note that
* Sresiduals = 0
* SSE=Sum of (residuals)2 = 3.6

Any other line drawn through the scatterplot


will have
* S(residuals)2 > 3.6
Assessing the Utility of the Model:
Making Inferences about the Slope 1
Usefulness of the Hypothesized Model

Testing the null hypothesis that the linear model


contributes no information for the prediction of y against
the alternative hypothesis that the linear model is useful
in predicting y, we test

H 0 : b1 = 0 against H a : b1 ¹ 0
If data support alternative hypothesis => x does
contribute information for the prediction of y using the
straight-line model.
A Test of Model Utility: Simple Linear Regression

H0: 1 = 0
b̂1 b̂1
Test Statistic: t = =
sb̂ s SS xx
1

Alternative Rejection
Hypothesis Region

Ha: 1 > 0 t > t

Ha: 1 < 0 t < t


Ha: 1 ≠ 0 t > t or t < t
where t and t are based on (n– 2) degrees of freedom
Example 3- Testing the Regression Slope, β1- Sales
Revenue Model
Refer to the simple linear regression analysis of the
advertising-sales data (Examples 1 and 2). Conduct a test
(at α = 0.05) to determine if sales revenue (y) is linearly
related to advertising expenditure (x).
Test of Slope Coefficient Solution

H0: 1 = 0
Ha: 1  0
 .05 Reject H0 Reject H0
df  5 – 2 = 3 .025 .025
Critical Value(s):
-3.182 0 3.182 t
Test Statistic Solution

s .6055
sb̂ = = = .1914
SS xx (15)
1 2

55 -
5

b̂1 .70
t= = = 3.657
Sb̂ .1914
1
Test of Slope Coefficient Solution

H0: 1 = 0 Test Statistic:


Ha: 1  0
 .05 t = 3.657
df  5 – 2 = 3
Critical Value(s): Decision:
Reject H0 Reject H0 Reject at  = .05
.025 .025 Conclusion:
There is evidence of a
-3.182 0 3.182 t relationship
Test of Slope Coefficient Computer Output
SUMMARY
OUTPUT

Regression Statistics
Multiple R 0.903696114
R Square 0.816666667
Adjusted R
Square 0.755555556
Standard Error 0.605530071
Observations 5

ANOVA
df SS MS F Significance F
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -0.1 0.635085296 -0.15746 0.884884 -2.121124854 1.921124854
Ad Exp., x, ($100s) 0.7 0.191485422 3.655631 0.035353 0.090607928 1.309392072

^
1
S^1 t = ^1 / S^1 P-Value
The Coefficients of Correlation and
Determination
The Coefficients of Correlation (r)
Correlation Models

Answers ‘How strong is the linear relationship


between two variables?’
Coefficient of correlation (r)
Sample correlation coefficient denoted r
Values range from –1 to +1
Measure of the strength of the linear
relationship
A low correlation does not necessarily imply
that x and y are unrelated- only that x and y
are not strongly linearly related.
Coefficient of Correlation
Coefficient of Correlation
Coefficient of Correlation
SUMMARY
OUTPUT

Regression Statistics
Multiple R 0.903696114 r
R Square 0.816666667
Adjusted R
Square 0.755555556
Standard Error 0.605530071
Observations 5

ANOVA
df SS MS F Significance F
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -0.1 0.635085296 -0.15746 0.884884 -2.121124854 1.921124854
Ad Exp., x, ($100s) 0.7 0.191485422 3.655631 0.035353 0.090607928 1.309392072
The Coefficients of Determination (R2)
Decomposition of the Total Variation

Y
Residual Variation SSE

Explained Variation SSreg


Y

X
X1 X2 X3 X4 X5
Decomposition of the Total Variation

Total Variation: SSyy = SSreg + SSE


Coefficient of Determination

To measure the contribution of x in predicting y.


To accomplish this, we calculate how much the errors
of prediction of y were reduced by using the
information provided by x.

r2 = (coefficient of correlation)2

-1  r  1 0  r2  1
Example - Obtaining the Value of r2 for the
Sales Revenue Model

Calculate the coefficient of determination for the


advertising-sales example. Interpret the result.

r2 = (coefficient of correlation)2
r2 = (.904)2
r2 = .817

Interpretation: About 81.7% of the sample


variation in Sales (y) can be explained by using Ad
$ (x) to predict Sales (y) in the linear model.
SUMMARY
OUTPUT

Regression Statistics
Multiple R 0.903696114
R Square 0.816666667 r2
Adjusted R
Square 0.755555556
Standard Error 0.605530071
Observations 5

ANOVA
df SS MS F Significance F
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -0.1 0.635085296 -0.15746 0.884884 -2.121124854 1.921124854
Ad Exp., x, ($100s) 0.7 0.191485422 3.655631 0.035353 0.090607928 1.309392072
SUMMARY
OUTPUT

Regression Statistics
Multiple R 0.903696114
R Square 0.816666667 r2=(Ssyy-SSE)/SSyy
Adjusted R
Square 0.755555556
Standard Error 0.605530071
Observations 5

ANOVA
df SS MS F Significance F
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -0.1 0.635085296 -0.15746 0.884884 -2.121124854 1.921124854
Ad Exp., x, ($100s) 0.7 0.191485422 3.655631 0.035353 0.090607928 1.309392072
Testing Global Usefulness of the Model: The
analysis of Variance F-Test
SUMMARY
OUTPUT

Regression Statistics
Multiple R 0.903696114
R Square 0.816666667 F statistics=MSreg/MSresidual
Adjusted R
Square 0.755555556
Standard Error 0.605530071
Observations 5 P-value of F statistics
ANOVA
df SS MS F Significance F
Regression 1 4.9 4.9 13.36364 0.035352847
Residual 3 1.1 0.366667
Total 4 6

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -0.1 0.635085296 -0.15746 0.884884 -2.121124854 1.921124854
Ad Exp., x, ($100s) 0.7 0.191485422 3.655631 0.035353 0.090607928 1.309392072
A Complete Example
Example

Suppose a fire insurance company wants to relate the


amount of fire damage in major residential fires to the
distance between the burning house and the nearest
fire station. The study is to be conducted in a large
suburb of a major city; a sample of 15 recent fires in
this suburb is selected. The amount of damage, y, and
the distance between the fire and the nearest fire
station, x, are recorded for each fire.
Example
DAMAGE
50
45
40
35
30
25
20
15
10
5
0
0 1 2 3 4 5 6 7
Example

Step 1:

First, we hypothesize a model to relate fire damage, y,


to the distance from the nearest fire station, x. We
hypothesize a straight-line probabilistic model:
y = 0 + 1x + 
Example

Step 2:
Use a statistical software package to estimate the
unknown parameters in the deterministic component of
the hypothesized model. The Excel printout for the
simple linear regression analysis is shown on the next
slide. The least squares estimates of the slope 1 and
intercept 0, highlighted on the printout, are

ˆ1  4.919331
ˆ0  10.277929
Example

Least Squares Equation: yˆ  10.278  4.919 x


Example

This prediction equation is graphed in the Minitab


scatterplot.
Interpretation

The least squares estimate of the slope, ˆ1  4.919


implies that the estimated mean damage increases by
$4,919 for each additional mile from the fire station. This
interpretation is valid over the range of x, or from .7 to
6.1 miles from the station. The estimated y-intercept,
has the interpretation that a fire 0 miles from the fire
station has an estimated mean damage of $10,278.
Example

Step 3: Specify the probability distribution of the


random error component .

The estimate of the standard deviation  of ,


highlighted on the Excel printout is
s = 2.31635= ?
This implies that most of the observed fire damage (y)
values will fall within approximately 2 = 4.64
thousand dollars of their respective predicted values
when using the least squares line.
Example

Step 4:
First, test the null hypothesis that the slope 1 is 0 –
that is, that there is no linear relationship between fire
damage and the distance from the nearest fire station,
against the alternative hypothesis that fire damage
increases as the distance increases. We test
H0: 1 = 0
Ha: 1 > 0
The two-tailed observed significance level for testing is
approximately 0.
Example
The 95% confidence interval yields (4.070, 5.768).
How?
So?
We estimate (with 95% confidence) that the interval
from $4,070 to $5,768 encloses the mean increase (1)
in fire damage per additional mile distance from the fire
station.
So?
The coefficient of determination, is r2 = .9235,
which implies ?
about 92% of the sample variation in fire damage (y) is
explained by the distance (x) between the fire and the
fire station.
Example

The coefficient of correlation, r, that measures the


strength of the linear relationship between y and x is

r   r  .9235  .96
2

The high correlation confirms our conclusion that 1 is


greater than 0; it appears that fire damage and
distance from the fire station are positively correlated.
All signs point to a strong linear relationship between y
and x.
F test

Вам также может понравиться