Вы находитесь на странице: 1из 88

Copyright 2003 Brooks/Cole

A division of Thomson Learning, Inc.



Example
Let y be the monthly sales revenue for a
company. This might be a function of
several variables:
x
1
= advertising expenditure
x
2
= time of year
x
3
= state of economy
x
4
= size of inventory
We want to predict y using knowledge of
x
1
, x
2
, x
3
and x
4
.

Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

A Simple Linear Model
In Chapter 3, we used the equation of
a line to describe the relationship between y
and x for a sample of n pairs, (x, y).
If we want to describe the relationship
between y and x for the whole population,
there are two models we can choose
Deterministic Model: y = o + |x
Probabilistic Model:
y = deterministic model + random error
y = o + |x + c
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

A Simple Linear Model
Since the bivariate measurements that
we observe do not generally fall
exactly on a straight line, we choose to
use:
Probabilistic Model:
y = o + |x + c
E(y) = o + |x
Points deviate from the
line of means by an amount
c where c has a normal
distribution with mean 0 and
variance o
2
.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Method of Least Squares
The equation of the best-fitting line
is calculated using a set of n pairs (x
i
, y
i
).
We choose our estimates a
and b to estimate o and | so
that the vertical distances of
the points from the line,
are minimized.
2 2
) ( )

bx a y y y
b a
bx a y
= =
+ =
SSE
minimize to and Choose
: line fitting Best
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Least Squares Estimators


x b y a
S
S
b
bx a y
n
y x
xy
n
y
y
n
x
x
xx
xy
xy
yy xx
= =
+ =

=

=
and
where : line fitting Best
S
S S
: squares of sums the Calculate

) )( (
) ( ) (
2
2
2
2
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example
The table shows the math achievement test
scores for a random sample of n = 10 college
freshmen, along with their final calculus
grades.
Student 1 2 3 4 5 6 7 8 9 10
Math test, x 39 43 21 64 57 47 28 75 34 52
Calculus grade, y 65 78 52 82 92 89 73 98 56 75
Use your calculator
to find the sums
and sums of
squares.
76 46
36854
59816 23634
760 460
2 2
= =
=
= =
= =
y x
xy
y x
y x



Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

: line fitting Best
and .76556
S
2056 S
2474 S
x y
a b
xy
yy
xx
77 . 78 . 40

78 . 40 ) 46 ( 76556 . 76
2474
1894
1894
10
) 760 )( 460 (
36854
10
) 760 (
59816
10
) 460 (
23634
2
2
+ =
= = = =
= =
= =
= =
Example
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The total variation in the experiment is
measured by the total sum of squares:
The Analysis of Variance
2
) y y S
yy
= = ( SS Total
The Total SS is divided into two parts:
SSR (sum of squares for regression):
measures the variation explained by using x in
the model.
SSE (sum of squares for error): measures the
leftover variation not explained by x.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Analysis of Variance
We calculate
0259 . 606
9741 . 1449 2056
) (
9741 . 1449
2474
1894
) (
2
2
2
=
=
=
=
=
= =
xx
xy
yy
xx
xy
S
S
S
S
S
SSR - SS Total SSE
SSR
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The ANOVA Table
Total df = Mean Squares
Regression df =
Error df =



n -1
1
n 1 1 = n - 2
MSR = SSR/(1)
MSE = SSE/(n-2)
Source df SS MS F
Regression 1 SSR SSR/(1) MSR/MSE
Error n - 2 SSE SSE/(n-2)
Total n -1 Total SS
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Calculus Problem
Source df SS MS F
Regression 1 1449.9741 1449.9741 19.14
Error 8 606.0259 75.7532
Total 9 2056.0000
0259 . 606 9741 . 1449 2056
) (
9741 . 1449
2474
1894
) (
2
2
2
= =
= =
= = =
xx
xy
yy
xx
xy
S
S
S
S
S
SSR - SS Total SSE
SSR
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Testing the Usefulness
of the Model
The first question to ask is whether the
independent variable x is of any use in
predicting y.
If it is not, then the value of y does not change,
regardless of the value of x. This implies that
the slope of the line, |, is zero.
0 : 0 :
0
= = | |
a
H versus H
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Testing the
Usefulness of the Model
The test statistic is function of b, our best
estimate of |. Using MSE as the best estimate
of the random variation o
2
, we obtain a t
statistic.
xx
xx
S
MSE
t b n df
t
S
MSE
b
t
2 /
2
0
o
=

=
: interval confidence a or with
on distributi a has which : statistic Test
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Calculus Problem
Is there a significant relationship between
the calculus grades and the test scores at the
5% level of significance?
38 . 4
2474 / 7532 . 75
0 7656 .
/
0
=

=
xx
S
b
t
MSE
0 : 0 :
0
= = | |
a
H versus H
Reject H
0
when |t| > 2.306. Since t = 4.38 falls into
the rejection region, H
0
is rejected .
There is a significant linear relationship
between the calculus grades and the test scores
for the population of college freshmen.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The F Test
You can test the overall usefulness of the
model using an F test. If the model is
useful, MSR will be large compared to
the unexplained variation, MSE.
predicting in useful is model H test To
0
y :
. 2 - and with F F if H Reject

MSE
MSR
F : Statistic Test
0
df n 1
o
>
=
This test is
exactly
equivalent to
the t-test, with t
2

= F.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Regression Analysis: y versus x
The regression equation is y = 40.8 + 0.766 x
Predictor Coef SE Coef T P
Constant 40.784 8.507 4.79 0.001
x 0.7656 0.1750 4.38 0.002

S = 8.704 R-Sq = 70.5% R-Sq(adj) = 66.8%

Analysis of Variance
Source DF SS MS F P
Regression 1 1450.0 1450.0 19.14 0.002
Residual Error 8 606.0 75.8
Total 9 2056.0
Regression coefficients,
a and b
Minitab Output
MSE
0 : = |
0
H test To
Least squares
regression line
F t =
2
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Measuring the Strength
of the Relationship
If the independent variable x is of useful in
predicting y, you will want to know how well
the model fits.
The strength of the relationship between x and y
can be measured using:
SS Total
SSR
: ion determinat of t Coefficien
: t coefficien n Correlatio
= =
=
yy xx
xy
yy xx
xy
S S
S
r
S S
S
r
2
2
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Measuring the Strength
of the Relationship
Since Total SS = SSR + SSE, r
2
measures
the proportion of the total variation in the
responses that can be explained by using the
independent variable x in the model.
the percent reduction the total variation by
using the regression equation rather than just
using the sample mean y-bar to estimate y.

SS Total
SSR
=
2
r
For the calculus problem, r
2
= .705 or
70.5%. The model is working well!
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Checking the
Regression Assumptions
1. The relationship between x and y is linear,
given by y = o + |x + c.
2. The random error terms c are independent and,
for any value of x, have a normal distribution
with mean 0 and variance o
2
.
Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Residuals
The residual error is the leftover
variation in each data point after the
variation explained by the regression model
has been removed.

If all assumptions have been met, these
residuals should be normal, with mean 0
and variance o
2
.
i i i i
bx a y y y = or Residual

Copyright 2003 Brooks/Cole


A division of Thomson Learning, Inc.

If the normality assumption is valid, the
plot should resemble a straight line,
sloping upward to the right.
If not, you will often see the pattern fail
in the tails of the graph.
Normal Probability
Plot
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Estimation and Prediction
Once you have
determined that the regression line is useful
used the diagnostic plots to check for
violation of the regression assumptions.
You are ready to use the regression line to
Estimate the average value of y for a
given value of x
Predict a particular value of y for a
given value of x.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Estimation and Prediction
Estimating the
average value of y
when x = x
0

Estimating a
particular value of y
when x = x
0

Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Estimation and Prediction
The best estimate of either E(y) or y for
a given value x = x
0
is

Particular values of y are more difficult to
predict, requiring a wider range of values in the
prediction interval.
0

bx a y + =
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Estimation and Prediction
|
|
.
|

\
|

+ +
=
|
|
.
|

\
|

+
=
xx
xx
S
x x
n
MSE t y
x x y
S
x x
n
MSE t y
x x y
2
0
2 /
0
2
0
2 /
0
) ( 1
1

) ( 1

o
o
: when of value particular a predict To
: when of value average the estimate To
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Calculus Problem
Estimate the average calculus grade for
students whose achievement score is 50
with a 95% confidence interval.
85.61. to 72.51 or
79.06 .76556(50) 40.78424 Calculate
55 . 6 06 . 79
2474
) 46 50 (
10
1
7532 . 75 306 . 2

|
|
.
|

\
|

+
= + =
y
y
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Calculus Problem
Estimate the calculus grade for a
particular student whose achievement
score is 50 with a 95% confidence
interval.
100.17. to 57.95 or
79.06 .76556(50) 40.78424 Calculate
11 . 21 06 . 79
2474
) 46 50 (
10
1
1 7532 . 75 306 . 2

|
|
.
|

\
|

+ +
= + =
y
y
Notice how
much wider this
interval is!
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI
1 79.06 2.84 (72.51, 85.61) (57.95,100.17)

Values of Predictors for New Observations
New Obs x
1 50.0
Minitab Output
Blue prediction
bands are always
wider than red
confidence bands.
Both intervals are
narrowest when x = x-
bar.
Confidence and prediction
intervals when x = 50
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Correlation Analysis
The strength of the relationship between x and y is
measured using the coefficient of correlation:
: t coefficien n Correlatio
yy xx
xy
S S
S
r =
Recall from Chapter 3 that
(1) -1 s r s 1 (2) r and b have the same sign
(3) r ~ 0 means no linear relationship
(4) r ~ 1 or 1 means a strong (+) or (-)
relationship
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example
The table shows the heights and weights of
n = 10 randomly selected college football
players.
Player 1 2 3 4 5 6 7 8 9 10
Height, x 73 71 75 72 72 75 67 69 71 69
Weight, y 185 175 200 210 190 195 150 170 180 175
Use your calculator
to find the sums
and sums of
squares.
8261 .
) 2610 )( 4 . 60 (
328
2610 4 . 60 328
= =
= = =
r
S S S
yy xx xy

Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Some Correlation Patterns
Use the Exploring Correlation applet to
explore some correlation patterns:
r = 0; No
correlation
r = .931; Strong
positive correlation
r = 1; Linear
relationship
r = -.67; Weaker
negative correlation
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Inference using r
The population coefficient of correlation is
called (rho). We can test for a significant
correlation between x and y using a t test:
0 : H versus H test To
a 0
= = 0 :
. 2 - with or if H Reject
: Statistic Test
0
df n t t t t
r
n
r t
2 / 2 /
2
1
2
o o
< >

=
This test is
exactly
equivalent to
the t-test for the
slope |=0.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example
Is there a significant positive correlation
between weight and height in the population
of all college football players?
8261 . = r
0 : H
H
a
0
>
=

0 :
15 . 4
8261 . 1
8
8261 .
1
2
2
2
=

=

: Statistic Test
r
n
r t
Use the t-table with n-2 = 8 df to
bound the p-value as p-value <
.005. There is a significant
positive correlation.
Applet
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example
Let y be the monthly sales revenue for a
company. This might be a function of
several variables:
x
1
= advertising expenditure
x
2
= time of year
x
3
= state of economy
x
4
= size of inventory
We want to predict y using knowledge of
x
1
, x
2
, x
3
and x
4
.

Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The General
Linear Model
y = |
0
+ |
1
x
1
+ |
2
x
2
++ |
k
x
k
+ c
where
y is the response variable you want to
predict.
|
0
, |
1
, |
2
,..., |
k
are unknown constants
x
1
, x
2
,..., x
k
are independent predictor
variables, measured without error.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example
Consider the model E(y) = |
0
+ |
1
x
1
+ |
2
x
2

This is a first order model (independent
variables appear only to the first power).
|
0
= y-intercept = value of E(y) when
x
1
=x
2
=0.
|
1
and |
2
are the partial regression
coefficientsthe change in y for a one-
unit change in x
i
when the other
independent variables are held constant.
Traces a plane in three dimensional space.

Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Method of
Least Squares
The best-fitting prediction equation is
calculated using a set of n measurements
(y, x
1
, x
2
, x
k
) as


We choose our estimates b
0
, b
1
,, b
k
to
estimate |
0
, |
1
,, |
k
to minimize
2
1 1 0
2
) ... (
)

(
k k
x b x b b y
y y
=
= SSE
k k
x b x b b y + + + = ...

1 1 0
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example


A computer database in a small community contains the
listed selling price y (in thousands of dollars), the amount
of living area x
1
(in hundreds of square feet), and the
number of floors x
2
, bedrooms x
3
, and bathrooms x
4
, for n
= 15 randomly selected residences currently on the
market.
Property y x
1
x
2
x
3
x
4

1 69.0 6 1 2 1
2 118.5 10 1 2 2
3 116.5 10 1 3 2

15 209.9 21 2 4 3
Fit a first order
model to the data
using the method
of least squares.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example


The first order model is
E(y) = |
0
+ |
1
x
1
+ |
2
x
2
+ |
3
x
3
+ |
4
x
4

fit using Minitab with the values of y and the
four independent variables entered into five
columns of the Minitab worksheet.
Regression Analysis: ListPrice versus SqFeet, NumFlrs, Bdrms, Baths
The regression equation is
ListPrice = 18.8 + 6.27 SqFeet - 16.2 NumFlrs - 2.67 Bdrms + 30.3 Baths

Predictor Coef SE Coef T P
Constant 18.763 9.207 2.04 0.069
SqFeet 6.2698 0.7252 8.65 0.000
NumFlrs -16.203 6.212 -2.61 0.026
Bdrms -2.673 4.494 -0.59 0.565
Baths 30.271 6.849 4.42 0.001
Partial regression
coefficients
Regression equation
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The total variation in the experiment is
measured by the total sum of squares:


The Analysis of Variance
2
) y y S
yy
= = ( SS Total
The Total SS is divided into two parts:
SSR (sum of squares for regression):
measures the variation explained by using the
regression equation.
SSE (sum of squares for error): measures the
leftover variation not explained by the
independent variables.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The ANOVA Table
Total df = Mean Squares
Regression df =
Error df =



n -1
k
n 1 k = n k -1
MSR = SSR/k
MSE = SSE/(n-k-1)
Source df SS MS F
Regression k SSR SSR/k MSR/MSE
Error n k -1 SSE SSE/(n-k-1)
Total n -1 Total SS
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Real Estate Problem
Another portion of the Minitab printout
shows the ANOVA Table, with n = 15
and k = 4.
S = 6.849 R-Sq = 97.1% R-Sq(adj) = 96.0%

Analysis of Variance
Source DF SS MS F P
Regression 4 15913.0 3978.3 84.80 0.000
Residual Error 10 469.1 46.9
Total 14 16382.2

Source DF Seq SS
SqFeet 1 14829.3
NumFlrs 1 0.9
Bdrms 1 166.4
Baths 1 916.5

MSE
Sequential Sums of squares:
conditional contribution of
each independent variable
to SSR given the variables
already entered into the
model.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Testing the Usefulness
of the Model
The first question to ask is whether the
regression model is of any use in predicting y.
If it is not, then the value of y does not change,
regardless of the value of the independent
variables, x
1
, x
2
,, x
k.
This implies that the
partial regression coefficients, |
1
, |
2
,, |
k
are
all zero.
zero not is one least at : H
versus 0 ... : H
i a
2 1 0
|
| | | = = = =
k
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The F Test
You can test the overall usefulness of the
model using an F test. If the model is
useful, MSR will be large compared to
the unexplained variation, MSE.
0 ... : H
to equivalent is predicting in useful is model : H test To
2 1 0
0
= = = =
k
y
| | |
. 1 - and with F F if H Reject

MSE
MSR
F : Statistic Test
0
df k- n k
o
>
=
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Measuring the Strength
of the Relationship
If the independent variables are useful in
predicting y, you will want to know how well
the model fits.
The strength of the relationship between x and y
can be measured using:
SS Total
SSR
: ion determinat of t coefficien Multiple
2
= R
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Measuring the Strength
of the Relationship
Since Total SS = SSR + SSE, R
2
measures
the proportion of the total variation in the
responses that can be explained by using the
independent variables in the model.
the percent reduction the total variation by
using the regression equation rather than just
using the sample mean y-bar to estimate y.

) 1 /( ) 1 (
/
MSE
MSR
and
SS Total
SSR

2
2
2

= = =
k n R
k R
F R
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Testing the Partial
Regression Coefficients
Is a particular independent variable useful in
the model, in the presence of all the other
independent variables? The test statistic is
function of b
i
, our best estimate of |
i
.
0 : H versus 0 : H
a 0
= =
i i
| |
) SE(
0
: statistic Test
i
i
b
b
t

=
which has a t distribution with error df = n k 1.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Real Estate Problem
Is the overall model useful in predicting list
price? How much of the overall variation in
the response is explained by the regression
model?
S = 6.849 R-Sq = 97.1% R-Sq(adj) = 96.0%

Analysis of Variance
Source DF SS MS F P
Regression 4 15913.0 3978.3 84.80 0.000
Residual Error 10 469.1 46.9
Total 14 16382.2

Source DF Seq SS
SqFeet 1 14829.3
NumFlrs 1 0.9
Bdrms 1 166.4
Baths 1 916.5

F = MSR/MSE = 84.80 with
p-value = .000 is highly
significant. The model is very
useful in predicting the list
price of homes.
R
2
= .971 indicates that
97.1% of the overall
variation is explained by
the regression model.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Real Estate Problem


In the presence of the other three
independent variables, is the number of
bedrooms significant in predicting the list
price of homes? Test using o = .05.
Regression Analysis: ListPrice versus SqFeet, NumFlrs, Bdrms, Baths
The regression equation is
ListPrice = 18.8 + 6.27 SqFeet - 16.2 NumFlrs - 2.67 Bdrms + 30.3 Baths

Predictor Coef SE Coef T P
Constant 18.763 9.207 2.04 0.069
SqFeet 6.2698 0.7252 8.65 0.000
NumFlrs -16.203 6.212 -2.61 0.026
Bdrms -2.673 4.494 -0.59 0.565
Baths 30.271 6.849 4.42 0.001
To test H
0
: |
3
= 0, the test statistic is t =
-0.59 with p-value = .565.
The p-value is larger than .05 and H
0
is
not rejected.
We cannot conclude that number of
bedrooms is a valuable predictor in the
presence of the other variables.
Perhaps the model could be refit
without x
3
.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Comparing
Regression Models
The strength of a regression model is
measured using R
2
= SSR/Total SS. This
value will only increase as variables are
added to the model.
To fairly compare two models, it is better to
use a measure that has been adjusted using
df:
100%
) 1 SS/( Total
MSE
1 ) adj (
2
|
|
.
|

\
|
=
n-
R
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Checking the
Regression Assumptions
c are independent
Have a mean 0 and common variance o
2
for
any set x
1
, x
2
,..., x
k .
Have a normal distribution.
Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

If the normality assumption is valid, the
plot should resemble a straight line,
sloping upward to the right.
If not, you will often see the pattern fail
in the tails of the graph.
Normal Probability Plot
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Estimation and Prediction
Enter the appropriate values of x
1
, x
2
, , x
k
in
Minitab. Minitab calculates


k k
x b x b x b b y + + + + = ...

2 2 1 1 0
and both the confidence interval and the
prediction interval.
Particular values of y are more difficult to
predict, requiring a wider range of values in the
prediction interval.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Real Estate Problem
Estimate the average list price for a home
with 1000 square feet of living space,
one floor, 3 bedrooms and two baths with
a 95% confidence interval.
Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI
1 117.78 3.11 ( 110.86, 124.70) ( 101.02, 134.54)

Values of Predictors for New Observations
New Obs SqFeet NumFlrs Bdrms Baths
1 10.0 1.00 3.00 2.00
We estimate that the average list
price will be between $110,860
and $124,700 for a home like
this.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Using Regression Models
When you perform multiple regression analysis,
use a step-by step approach:
1. Obtain the fitted prediction model.
2. Use the analysis of variance F test and R
2
to determine
how well the model fits the data.
3. Check the t tests for the partial regression coefficients to
see which ones are contributing significant information
in the presence of the others.
4. If you choose to compare several different models, use
R
2
(adj) to compare their effectiveness.
5. Use diagnostic plots to check for violation of the
regression assumptions.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

A Polynomial Model
When k = 2, the model is quadratic:

c | | | | + + + + + =
k
k
x x x y ...
2
2 1 0
A response y is related to a single independent
variable x, but not in a linear manner. The
polynomial model is:
When k = 3, the model is cubic:

c | | | + + + =
2
2 1 0
x x y
c | | | | + + + + =
3
3
2
2 1 0
x x x y
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example


A market research firm has observed the sales (y) as a
function of mass media advertising expenses (x) for 10
different companies selling a similar product.
Since there is only one
independent variable, you
could fit a linear, quadratic, or
cubic polynomial model.
Which would you pick?
Company 1 2 3 4 5 6 7 8 9 10
Expenditure, x 1.0 1.6 2.5 3.0 4.0 4.6 5.0 5.7 6.0 7.0
Sales, y 2.5 2.6 2.7 5.0 5.3 9.1 14.8 17.5 23.0 28.0
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Two Possible Choices
A straight line model: y = |
0
+ |
1
x + c
A quadratic model: y = |
0
+ |
1
x + |
2
x
2
+ c
Here is the Minitab printout for the straight line:
Regression Analysis: y versus x
The regression equation is
y = - 6.47 + 4.34 x
Predictor Coef SE Coef T P
Constant -6.465 2.795 -2.31 0.049
x 4.3355 0.6274 6.91 0.000
S = 3.725 R-Sq = 85.6% R-Sq(adj) = 83.9%
Analysis of Variance
Source DF SS MS F P
Regression 1 662.46 662.46 47.74 0.000
Residual Error 8 111.00 13.88
Total 9 773.46
Overall F test is highly
significant, as is the t-test of
the slope. R
2
= .856 suggests a
good fit. Lets check the
residual plots
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example
There is a strong pattern of a
curve leftover in the residual
plot.
This indicates that there is a
curvilinear relationship
unaccounted for by your
straight line model.
You should have used the
quadratic model! Use Minitab to fit the
quadratic model:
y = |
0
+ |
1
x + |
2
x
2
+ c
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Quadratic Model


Regression Analysis: y versus x, x-sq
The regression equation is
y = 4.66 - 3.03 x + 0.939 x-sq

Predictor Coef SE Coef T P
Constant 4.657 2.443 1.91 0.098
x -3.030 1.395 -2.17 0.067
x-sq 0.9389 0.1739 5.40 0.001
S = 1.752 R-Sq = 97.2% R-Sq(adj) = 96.4%

Analysis of Variance
Source DF SS MS F P
Regression 2 751.98 375.99 122.49 0.000
Residual Error 7 21.49 3.07
Total 9 773.47
Overall F test is highly significant,
as is the t-test of the quadratic term
|
2
. R
2
= .972 suggests a very good
fit.
Lets compare the two models, and
check the residual plots.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Which Model to Use?

Use R
2
(adj) to compare the models:
The straight line model: y = |
0
+ |
1
x + c
The quadratic model: y = |
0
+ |
1
x + |
2
x
2
+ c
% 9 . 83 adj) (
2
= R
% 4 . 96 adj) (
2
= R
The quadratic model is
better.
There are no patterns in the
residual plot, indicating
that this is the correct
model for the data.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Using Qualitative
Variables
Multiple regression requires that the
response y be a quantitative variable.
Independent variables can be either
quantitative or qualitative.
Qualitative variables involving k categories
are entered into the model by using k-1
dummy variables.
Example: To enter gender as a variable, use
x
i
= 1 if male; 0 if female
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example


Data was collected on 6 male and 6 female assistant
professors. The researchers recorded their salaries (y)
along with years of experience (x
1
). The professors
gender enters into the model as a dummy variable: x
2
= 1
if male; 0 if not.
Professor Salary, y Experience, x
1
Gender, x
2
Interaction,
x
1
x
2

1 $50,710 1 1 1
2 49,510 1 0 0

11 55,590 5 1 5
12 53,200 5 0 0
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example


We want to predict a professors salary based on years
of experience and gender. We think that there may be a
difference in salary depending on whether you are
male or female.
The model we choose includes experience (x
1
), gender
(x
2
), and an interaction term (x
1
x
2
) to allow salarys
for males and females to behave differently.
c | | | | + + + + =
2 1 3 2 2 1 1 0
x x x x y
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Minitab Output


We use Minitab to fit the model.
Regression Analysis: y versus x1, x2, x1x2
The regression equation is
y = 48593 + 969 x1 + 867 x2 + 260 x1x2

Predictor Coef SE Coef T P
Constant 48593.0 207.9 233.68 0.000
x1 969.00 63.67 15.22 0.000
x2 866.7 305.3 2.84 0.022
x1x2 260.13 87.06 2.99 0.017

S = 201.3 R-Sq = 99.2% R-Sq(adj) = 98.9%

Analysis of Variance
Source DF SS MS F P
Regression 3 42108777 14036259 346.24 0.000
Residual Error 8 324315 40539
Total 11 42433092
What is the regression
equation for males? For
females?
For males, x
2
= 1,
y = 49459.7 + 1229.13x
1

For females, x
2
= 0,
y = 48593.0 + 969.0x
1

Two different straight line
models.
Is the overall model useful
in predicting y?
The overall F test is F =
346.24 with p-value = .000.
The value of R
2
= .992
indicates that the model fits
very well.
Is there a difference in the relationship between salary and years of
experience, depending on the gender of the professor?
Yes. The individual t-test for the interaction term is t = 2.99 with p-
value = .017. This indicates a significant interaction between
gender and years of experience.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example


Have any of the regression assumptions been violated, or
have we fit the wrong model?
It does not appear from the
diagnostic plots that there
are any violations of
assumptions.
The model is ready to be
used for prediction or
estimation.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Testing Sets of Parameters
Suppose the demand y may be related to five independent
variables, but that the cost of measuring three of them is very
high.
If it could be shown that these three contribute little or no
information, they can be eliminated.
You want to test the null hypothesis
H
0
: |
3
= |
4
= |
5
= 0that is, the independent variables x
3
,
x
4
, and x
5
contribute no information for the prediction of y
versus the alternative hypothesis:
H
a
: At least one of the parameters |
3
, |
4
, or |
5
differs
from 0 that is, at least one of the variables x
3
, x
4
, or x
5

contributes information for the prediction of y.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Model Two (complete model)

terms in model 1 additional terms in model 2

Testing Sets of Parameters
To explain how to test a hypothesis concerning a set
of model parameters, we define two models:
Model One (reduced model)



r r
x x x y E | | | | + + + + =
2 2 1 1 0
) (
k k r r r r r r
x x x x x x y E | | | | | | | + + + + + + + + =
+ + + +

2 2 1 1 2 2 1 1 0
) (
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Testing Sets of Parameters
The test of the hypothesis
H
0
: |
3
= |
4
= |
5
= 0
H
a
: At least one of the |
i
differs from 0
uses the test statistic

where F is based on df
1
= (k - r ) and df
2
=
n -(k + 1).
The rejection region for the test is identical to
other analysis of variance F tests, namely F > F
o
.
( ) ( )
2
2 1
MSE
SSE SE r k S
F

=
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Stepwise Regression
A stepwise regression analysis fits a variety
of models to the data, adding and deleting variables
as their significance in the presence of the other
variables is either significant or nonsignificant,
respectively.
Once the program has performed a sufficient
number of iterations and no more variables are
significant when added to the model, and none of
the variables are nonsignificant when removed, the
procedure stops.
These programs always fit first-order models and
are not helpful in detecting curvature or interaction
in the data.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Pearsons Chi-Square
Statistic
We have some preconceived idea about the
values of the p
i
and want to use sample
information to see if we are correct.
The expected number of times that
outcome i will occur is E
i
= np
i
. If the
observed cell counts, O
i
, are too far from
what we hypothesize under H
0
, the more
likely it is that H
0
should be rejected.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Pearsons Chi-Square
Statistic
We use the Pearson chi-square statistic:
When H
0
is true, the differences O-E will
be small, but large when H
0
is false.
Look for large values of X
2
based on the
chi-square distribution with a particular
number of degrees of freedom.
i
i i
E
E O
X
2
2
) (
=
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Degrees of Freedom
These will be different depending on the
application.
1. Start with the number of categories or
cells in the experiment.
2. Subtract 1df for each linear restriction on
the cell probabilities. (You always lose 1
df since p
1
+p
2
++ p
k
= 1.)
3. Subtract 1 df for every population
parameter you have to estimate to
calculate or estimate E
i
.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Goodness of Fit Test
The simplest of the applications.
A single categorical variable is measured,
and exact numerical values are specified
for each of the p
i
.
Expected cell counts are E
i
= np
i
Degrees of freedom: df = k-1
i
i i
E
E O
X
2
2
) (
= : statistic Test
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example
A multinomial experiment with k = 6 and
O
1
to O
6
given in the table.
We test:

Upper Face 1 2 3 4 5 6
Number of times 50 39 45 62 61 43
H
0
: p
1
= 1/6; p
2
= 1/6;p
6
= 1/6 (die is fair)
H
a
: at least one p
i
is different from 1/6 (die is biased)
Toss a die 300 times with the following
results. Is the die fair or biased?
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example
Upper Face 1 2 3 4 5 6
O
i
50 39 45 62 61 43
E
i
50 50 50 50 50 50
E
i
= np
i
= 300(1/6) = 50
Calculate the expected cell counts:
Test statistic and rejection region:
df. with X if H Reject
2
0
5 1 6 1 07 . 11
2 . 9
50
) 50 43 (
...
50
) 50 39 (
50
) 50 50 ( ) (
2
05 .
2 2 2 2
2
= = = >
=

+ +

=
k
E
E O
X
i
i i
_
Do not reject H
0
.
There is insufficient
evidence to indicate
that the die is biased.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Some Notes
The test statistic, X
2
has only an
approximate chi-square distribution.
For the approximation to be accurate,
statisticians recommend E
i
> 5 for all cells.
Goodness of fit tests are different from
previous tests since the experimenter uses
H
0
for the model he thinks is true.

Be careful not to accept H
0
(say the model is
correct) without reporting |.

H
0
: model is correct (as specified)
H
a
: model is not correct
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Contingency Tables: A
Two-Way Classification
The experimenter measures two
qualitative variables to generate
bivariate data.
Gender and colorblindness
Age and opinion
Professorial rank and type of
university
Summarize the data by counting the
observed number of outcomes in each
of the intersections of category levels in
a contingency table.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

r x c Contingency Table
The contingency table has r rows and c
columnsrc total cells.
1 2 c
1 O
11
O
12
O
1c

2 O
21
O
22
O
2c

.
r O
r1
O
r2
O
rc

We study the relationship between the two
variables. Is one method of classification
contingent or dependent on the other?
Does the distribution of measurements in
the various categories for variable 1
depend on which category of variable 2 is
being observed?
If not, the variables are independent.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Chi-Square Test of
Independence
Observed cell counts are O
ij
for row i and column j.
Expected cell counts are E
ij
= np
ij
If H
0
is true and the classifications are
independent,
p
ij
= p
i
p
j
= P(falling in row i)P(falling in row j)
H
0
: classifications are independent
H
a
: classifications are dependent
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Chi-Square Test of
Independence
ij
ij
j
i
E
E O
X

(
2
2

= : statistic Test
n
c r
n
c
n
r
n E
n
c
n
r
p p
j i j
i
ij
j
i
j i
=
|
|
.
|

\
|
|
.
|

\
|
=

. and with and Estimate



The test statistic has an approximate chi-square
distribution with df = (r-1)(c-1).
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example
Furniture defects are classified according
to type of defect and shift on which it
was made.
Shift
Type 1 2 3 Total
A 15 26 33 74
B 21 31 17 69
C 45 34 49 128
D 13 5 20 38
Total 94 96 119 309
Do the data present sufficient evidence to indicate that the type
of furniture defect varies with the shift during which the piece
of furniture is produced? Test at the 1% level of significance.
H
0
: type of defect is independent of shift
H
a
: type of defect depends on the shift
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Furniture Problem
Calculate the expected cell counts. For
example:
99 . 22
309
) 96 ( 74

2 1
12
= = =
n
c r
E
Chi-Square Test: 1, 2, 3
Expected counts are printed below observed counts
1 2 3 Total
1 15 26 33 74
22.51 22.99 28.50

2 21 31 17 69
20.99 21.44 26.57

3 45 34 49 128
38.94 39.77 49.29

4 13 5 20 38
11.56 11.81 14.63

Total 94 96 119 309
df. ( with X if H Reject
: statistic Test
2
0
6 ) 1 )( 1 07 . 11
18 . 19
63 . 14
) 63 . 14 20 (
...
99 . 22
) 99 . 22 26 (
51 . 22
) 51 . 22 15 (

(
2
05 .
2 2 2
2
2
= = >
=

+ +

=
c r
E
E O
X
ij
ij ij
_
Chi-Sq = 2.506 + 0.394 + 0.711 +
0.000 + 4.266 + 3.449 +
0.944 + 0.836 + 0.002 +
0.179 + 3.923 + 1.967 =
19.178
DF = 6, P-Value = 0.004

Reject H
0
. There is
sufficient evidence to
indicate that the
proportion of defect
types vary from shift
to shift.
Applet
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Comparing Multinomial
Populations


Sometimes researchers design an experiment
so that the number of experimental units falling
in one set of categories is fixed in advance.
Example: An experimenter selects 900 patients who
have been treated for flu prevention. She selects 300
from each of three typesno vaccine, one shot, and
two shots.
No Vaccine One Shot Two Shots Total
Flu r
1

No Flu r
2

Total 300 300 300 n = 900
The column totals have
been fixed in advance!
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Comparing Multinomial
Populations
Each of the c columns (or r rows) whose
totals have been fixed in advance is actually a
single multinomial experiment.
The chi-square test of independence with (r-
1)(c-1) df is equivalent to a test of the
equality of c (or r) multinomial populations.
No Vaccine One Shot Two Shots Total
Flu r
1

No Flu r
2

Total 300 300 300 n = 900
Three binomial populationsno
vaccine, one shot and two shots.
Is the probability of getting the flu
independent of the type of flu
prevention used?
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example
Random samples of 200 voters in each of
four wards were surveyed and asked if
they favor candidate A in a local
election.
Ward
1 2 3 4 Total
Favor A 76 53 59 48 236
Do not favor A 124 147 141 152 564
Total 200 200 200 200 800
Do the data present sufficient evidence to indicate that the the
fraction of voters favoring candidate A differs in the four wards?
H
0
: fraction favoring A is independent of ward
H
a
: fraction favoring A depends on the ward
H
0
: p
1
= p
2
= p
3
= p
4

where p
i
= fraction favoring A in each of the four
wards

Applet
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Applet
The Voter Problem
Calculate the expected cell counts. For
example:
59
800
) 200 ( 236

2 1
12
= = =
n
c r
E
Chi-Square Test: 1, 2, 3, 4
Expected counts are printed below observed counts

1 2 3 4 Total
1 76 53 59 48 236
59.00 59.00 59.00 59.00

2 124 147 141 152 564
141.00 141.00 141.00 141.00

Total 200 200 200 200 800
df. 3 ) 1 )( 1 ( with 81 . 7 X if H Reject
722 . 10
141
) 141 152 (
...
59
) 59 53 (
59
) 59 76 (

(
: statistic Test
2
05 .
2
0
2 2 2
2
2
= = >
=

+ +

=
c r
E
E O
X
ij
ij ij
_
Chi-Sq = 4.898 + 0.610 + 0.000 + 2.051 +
2.050 + 0.255 + 0.000 + 0.858 =
10.722
DF = 3, P-Value = 0.013
Reject H
0
. There is
sufficient evidence to
indicate that the
fraction of voters
favoring A varies
from ward to ward.
Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

The Voter Problem
Since we know that there are differences
among the four wards, what are the
nature of the differences?
Look at the proportions in favor of
candidate A in the four wards.
Ward 1 2 3 4
Favor A 76/200=.38 53/200 = .27 59/200 = .30 48/200 = .24
Candidate A is doing best in the first ward, and worst in the
fourth ward. More importantly, he does not have a majority of
the vote in any of the wards!

Вам также может понравиться