Example: - Let Be The For A Company. This Might Be A Function of Several Variables

Copyright 2003 Brooks/Cole
A division of Thomson Learning, Inc.

Example
Let y be the monthly sales revenue for a
company. This might be a function of
several variables:
x
1
= advertising expenditure
x
2
= time of year
x
3
= state of economy
x
4
= size of inventory
We want to predict y using knowledge of
x
1
, x
2
, x
3
and x
4
.


A Simple Linear Model
In Chapter 3, we used the equation of
a line to describe the relationship between y
and x for a sample of n pairs, (x, y).
If we want to describe the relationship
between y and x for the whole population,
there are two models we can choose
Deterministic Model: y = o + |x
Probabilistic Model:
y = deterministic model + random error
y = o + |x + c

A Simple Linear Model
Since the bivariate measurements that
we observe do not generally fall
exactly on a straight line, we choose to
use:
Probabilistic Model:
y = o + |x + c
E(y) = o + |x
Points deviate from the
line of means by an amount
c where c has a normal
distribution with mean 0 and
variance o
2
.

The Method of Least Squares
The equation of the best-fitting line
is calculated using a set of n pairs (x
i
, y
i
).
We choose our estimates a
and b to estimate o and | so
that the vertical distances of
the points from the line,
are minimized.
2 2
) ( )
bx a y y y
b a
bx a y
= =
+ =
SSE
minimize to and Choose
: line fitting Best

Least Squares Estimators

x b y a
S
S
b
bx a y
n
y x
xy
n
y
y
n
x
x
xx
xy
xy
yy xx
= =
+ =

=
=
and
where : line fitting Best
S
S S
: squares of sums the Calculate
) )( (
) ( ) (
2
2
2
2

Example
The table shows the math achievement test
scores for a random sample of n = 10 college
freshmen, along with their final calculus
grades.
Student 1 2 3 4 5 6 7 8 9 10
Math test, x 39 43 21 64 57 47 28 75 34 52
Calculus grade, y 65 78 52 82 92 89 73 98 56 75
Use your calculator
to find the sums
and sums of
squares.
76 46
36854
59816 23634
760 460
2 2
= =
=
= =
= =
y x
xy
y x
y x


: line fitting Best
and .76556
S
2056 S
2474 S
x y
a b
xy
yy
xx
77 . 78 . 40
78 . 40 ) 46 ( 76556 . 76
2474
1894
1894
10
) 760 )( 460 (
36854
10
) 760 (
59816
10
) 460 (
23634
2
2
+ =
= = = =
= =
= =
= =
Example

The total variation in the experiment is
measured by the total sum of squares:
The Analysis of Variance
2
) y y S
yy
= = ( SS Total
The Total SS is divided into two parts:
SSR (sum of squares for regression):
measures the variation explained by using x in
the model.
SSE (sum of squares for error): measures the
leftover variation not explained by x.

We calculate
0259 . 606
9741 . 1449 2056
) (
9741 . 1449
2474
1894
) (
2
2
2
=
=
=
=
=
= =
xx
xy
yy
xx
xy
S
S
S
S
S
SSR - SS Total SSE
SSR

The ANOVA Table
Total df = Mean Squares
Regression df =
Error df =

n -1
1
n 1 1 = n - 2
MSR = SSR/(1)
MSE = SSE/(n-2)
Source df SS MS F
Regression 1 SSR SSR/(1) MSR/MSE
Error n - 2 SSE SSE/(n-2)
Total n -1 Total SS

The Calculus Problem
Source df SS MS F
Regression 1 1449.9741 1449.9741 19.14
Error 8 606.0259 75.7532
Total 9 2056.0000
0259 . 606 9741 . 1449 2056
) (
9741 . 1449
2474
1894
) (
2
2
2
= =
= =
= = =
xx
xy
yy
xx
xy
S
S
S
S
S
SSR - SS Total SSE
SSR

Testing the Usefulness
of the Model
The first question to ask is whether the
independent variable x is of any use in
predicting y.
If it is not, then the value of y does not change,
regardless of the value of x. This implies that
the slope of the line, |, is zero.
0 : 0 :
0
= = | |
a
H versus H

Testing the
Usefulness of the Model
The test statistic is function of b, our best
estimate of |. Using MSE as the best estimate
of the random variation o
2
, we obtain a t
statistic.
xx
xx
S
MSE
t b n df
t
S
MSE
b
t
2 /
2
0
o
=
=
: interval confidence a or with
on distributi a has which : statistic Test

Is there a significant relationship between
the calculus grades and the test scores at the
5% level of significance?
38 . 4
2474 / 7532 . 75
0 7656 .
/
0
=
=
xx
S
b
t
MSE
0 : 0 :
0
= = | |
a
H versus H
Reject H
0
when |t| > 2.306. Since t = 4.38 falls into
the rejection region, H
0
is rejected .
There is a significant linear relationship
between the calculus grades and the test scores
for the population of college freshmen.

The F Test
You can test the overall usefulness of the
model using an F test. If the model is
useful, MSR will be large compared to
the unexplained variation, MSE.
predicting in useful is model H test To
0
y :
. 2 - and with F F if H Reject

MSE
MSR
F : Statistic Test
0
df n 1
o
>
=
This test is
exactly
equivalent to
the t-test, with t
2

= F.

Regression Analysis: y versus x
The regression equation is y = 40.8 + 0.766 x
Predictor Coef SE Coef T P
Constant 40.784 8.507 4.79 0.001
x 0.7656 0.1750 4.38 0.002

S = 8.704 R-Sq = 70.5% R-Sq(adj) = 66.8%

Analysis of Variance
Source DF SS MS F P
Regression 1 1450.0 1450.0 19.14 0.002
Residual Error 8 606.0 75.8
Total 9 2056.0
Regression coefficients,
a and b
Minitab Output
MSE
0 : = |
0
H test To
Least squares
regression line
F t =
2

Measuring the Strength
of the Relationship
If the independent variable x is of useful in
predicting y, you will want to know how well
the model fits.
The strength of the relationship between x and y
can be measured using:
SS Total
SSR
: ion determinat of t Coefficien
: t coefficien n Correlatio
= =
=
yy xx
xy
yy xx
xy
S S
S
r
S S
S
r
2
2

of the Relationship
Since Total SS = SSR + SSE, r
2
measures
the proportion of the total variation in the
responses that can be explained by using the
independent variable x in the model.
the percent reduction the total variation by
using the regression equation rather than just
using the sample mean y-bar to estimate y.

SS Total
SSR
=
2
r
For the calculus problem, r
2
= .705 or
70.5%. The model is working well!

Checking the
Regression Assumptions
1. The relationship between x and y is linear,
given by y = o + |x + c.
2. The random error terms c are independent and,
for any value of x, have a normal distribution
with mean 0 and variance o
2
.
Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.

Residuals
The residual error is the leftover
variation in each data point after the
variation explained by the regression model
has been removed.

If all assumptions have been met, these
residuals should be normal, with mean 0
and variance o
2
.
i i i i
bx a y y y = or Residual


If the normality assumption is valid, the
plot should resemble a straight line,
sloping upward to the right.
If not, you will often see the pattern fail
in the tails of the graph.
Normal Probability
Plot

Estimation and Prediction
Once you have
determined that the regression line is useful
used the diagnostic plots to check for
violation of the regression assumptions.
You are ready to use the regression line to
Estimate the average value of y for a
given value of x
Predict a particular value of y for a
given value of x.

Estimating the
average value of y
when x = x
0

Estimating a
particular value of y
when x = x
0


The best estimate of either E(y) or y for
a given value x = x
0
is

Particular values of y are more difficult to
predict, requiring a wider range of values in the
prediction interval.
0
bx a y + =

|
|
.
|
\
|
+ +
=
|
|
.
|
\
|
+
=
xx
xx
S
x x
n
MSE t y
x x y
S
x x
n
MSE t y
x x y
2
0
2 /
0
2
0
2 /
0
) ( 1
1
) ( 1
o
o
: when of value particular a predict To
: when of value average the estimate To

Estimate the average calculus grade for
students whose achievement score is 50
with a 95% confidence interval.
85.61. to 72.51 or
79.06 .76556(50) 40.78424 Calculate
55 . 6 06 . 79
2474
) 46 50 (
10
1
7532 . 75 306 . 2
|
|
.
|
\
|
+
= + =
y
y

Estimate the calculus grade for a
particular student whose achievement
score is 50 with a 95% confidence
interval.
100.17. to 57.95 or
79.06 .76556(50) 40.78424 Calculate
11 . 21 06 . 79
2474
) 46 50 (
10
1
1 7532 . 75 306 . 2
|
|
.
|
\
|
+ +
= + =
y
y
Notice how
much wider this
interval is!

Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI
1 79.06 2.84 (72.51, 85.61) (57.95,100.17)

Values of Predictors for New Observations
New Obs x
1 50.0
Minitab Output
Blue prediction
bands are always
wider than red
confidence bands.
Both intervals are
narrowest when x = x-
bar.
Confidence and prediction
intervals when x = 50

Correlation Analysis
The strength of the relationship between x and y is
measured using the coefficient of correlation:
: t coefficien n Correlatio
yy xx
xy
S S
S
r =
Recall from Chapter 3 that
(1) -1 s r s 1 (2) r and b have the same sign
(3) r ~ 0 means no linear relationship
(4) r ~ 1 or 1 means a strong (+) or (-)
relationship

Example
The table shows the heights and weights of
n = 10 randomly selected college football
players.
Player 1 2 3 4 5 6 7 8 9 10
Height, x 73 71 75 72 72 75 67 69 71 69
Weight, y 185 175 200 210 190 195 150 170 180 175
Use your calculator
to find the sums
and sums of
squares.
8261 .
) 2610 )( 4 . 60 (
328
2610 4 . 60 328
= =
= = =
r
S S S
yy xx xy


Some Correlation Patterns
Use the Exploring Correlation applet to
explore some correlation patterns:
r = 0; No
correlation
r = .931; Strong
positive correlation
r = 1; Linear
relationship
r = -.67; Weaker
negative correlation

Inference using r
The population coefficient of correlation is
called (rho). We can test for a significant
correlation between x and y using a t test:
0 : H versus H test To
a 0
= = 0 :
. 2 - with or if H Reject
: Statistic Test
0
df n t t t t
r
n
r t
2 / 2 /
2
1
2
o o
< >
=
This test is
exactly
equivalent to
the t-test for the
slope |=0.

Example
Is there a significant positive correlation
between weight and height in the population
of all college football players?
8261 . = r
0 : H
H
a
0
>
=
0 :
15 . 4
8261 . 1
8
8261 .
1
2
2
2
=
=

: Statistic Test
r
n
r t
Use the t-table with n-2 = 8 df to
bound the p-value as p-value <
.005. There is a significant
positive correlation.
Applet

Example
Let y be the monthly sales revenue for a
company. This might be a function of
several variables:
x
1
= advertising expenditure
x
2
= time of year
x
3
= state of economy
x
4
= size of inventory
We want to predict y using knowledge of
x
1
, x
2
, x
3
and x
4
.


The General
Linear Model
y = |
0
+ |
1
x
1
+ |
2
x
2
++ |
k
x
k
+ c
where
y is the response variable you want to
predict.
|
0
, |
1
, |
2
,..., |
k
are unknown constants
x
1
, x
2
,..., x
k
are independent predictor
variables, measured without error.

Example
Consider the model E(y) = |
0
+ |
1
x
1
+ |
2
x
2

This is a first order model (independent
variables appear only to the first power).
|
0
= y-intercept = value of E(y) when
x
1
=x
2
=0.
|
1
and |
2
are the partial regression
coefficientsthe change in y for a one-
unit change in x
i
when the other
independent variables are held constant.
Traces a plane in three dimensional space.


The Method of
Least Squares
The best-fitting prediction equation is
calculated using a set of n measurements
(y, x
1
, x
2
, x
k
) as

We choose our estimates b
0
, b
1
,, b
k
to
estimate |
0
, |
1
,, |
k
to minimize
2
1 1 0
2
) ... (
)
(
k k
x b x b b y
y y
=
= SSE
k k
x b x b b y + + + = ...
1 1 0

Example

A computer database in a small community contains the
listed selling price y (in thousands of dollars), the amount
of living area x
1
(in hundreds of square feet), and the
number of floors x
2
, bedrooms x
3
, and bathrooms x
4
, for n
= 15 randomly selected residences currently on the
market.
Property y x
1
x
2
x
3
x
4

1 69.0 6 1 2 1
2 118.5 10 1 2 2
3 116.5 10 1 3 2

15 209.9 21 2 4 3
Fit a first order
model to the data
using the method
of least squares.

Example

The first order model is
E(y) = |
0
+ |
1
x
1
+ |
2
x
2
+ |
3
x
3
+ |
4
x
4

fit using Minitab with the values of y and the
four independent variables entered into five
columns of the Minitab worksheet.
Regression Analysis: ListPrice versus SqFeet, NumFlrs, Bdrms, Baths
The regression equation is
ListPrice = 18.8 + 6.27 SqFeet - 16.2 NumFlrs - 2.67 Bdrms + 30.3 Baths

Constant 18.763 9.207 2.04 0.069
SqFeet 6.2698 0.7252 8.65 0.000
NumFlrs -16.203 6.212 -2.61 0.026
Bdrms -2.673 4.494 -0.59 0.565
Baths 30.271 6.849 4.42 0.001
Partial regression
coefficients
Regression equation

The total variation in the experiment is
measured by the total sum of squares:

2
) y y S
yy
= = ( SS Total
The Total SS is divided into two parts:
SSR (sum of squares for regression):
measures the variation explained by using the
regression equation.
SSE (sum of squares for error): measures the
leftover variation not explained by the
independent variables.

The ANOVA Table
Total df = Mean Squares
Regression df =
Error df =

n -1
k
n 1 k = n k -1
MSR = SSR/k
MSE = SSE/(n-k-1)
Source df SS MS F
Regression k SSR SSR/k MSR/MSE
Error n k -1 SSE SSE/(n-k-1)
Total n -1 Total SS

The Real Estate Problem
Another portion of the Minitab printout
shows the ANOVA Table, with n = 15
and k = 4.
S = 6.849 R-Sq = 97.1% R-Sq(adj) = 96.0%

Source DF SS MS F P
Regression 4 15913.0 3978.3 84.80 0.000
Total 14 16382.2

Source DF Seq SS
SqFeet 1 14829.3
NumFlrs 1 0.9
Bdrms 1 166.4
Baths 1 916.5

MSE
Sequential Sums of squares:
conditional contribution of
each independent variable
to SSR given the variables
already entered into the
model.

Testing the Usefulness
of the Model
The first question to ask is whether the
regression model is of any use in predicting y.
If it is not, then the value of y does not change,
regardless of the value of the independent
variables, x
1
, x
2
,, x
k.
This implies that the
partial regression coefficients, |
1
, |
2
,, |
k
are
all zero.
zero not is one least at : H
versus 0 ... : H
i a
2 1 0
|
| | | = = = =
k

The F Test
You can test the overall usefulness of the
model using an F test. If the model is
useful, MSR will be large compared to
the unexplained variation, MSE.
0 ... : H
to equivalent is predicting in useful is model : H test To
2 1 0
0
= = = =
k
y
| | |
. 1 - and with F F if H Reject

MSE
MSR
F : Statistic Test
0
df k- n k
o
>
=

of the Relationship
If the independent variables are useful in
predicting y, you will want to know how well
the model fits.
The strength of the relationship between x and y
can be measured using:
SS Total
SSR
: ion determinat of t coefficien Multiple
2
= R

of the Relationship
Since Total SS = SSR + SSE, R
2
measures
the proportion of the total variation in the
responses that can be explained by using the
independent variables in the model.
the percent reduction the total variation by
using the regression equation rather than just
using the sample mean y-bar to estimate y.

) 1 /( ) 1 (
/
MSE
MSR
and
SS Total
SSR

2
2
2

= = =
k n R
k R
F R

Testing the Partial
Regression Coefficients
Is a particular independent variable useful in
the model, in the presence of all the other
independent variables? The test statistic is
function of b
i
, our best estimate of |
i
.
0 : H versus 0 : H
a 0
= =
i i
| |
) SE(
0
: statistic Test
i
i
b
b
t
=
which has a t distribution with error df = n k 1.

Is the overall model useful in predicting list
price? How much of the overall variation in
the response is explained by the regression
model?
S = 6.849 R-Sq = 97.1% R-Sq(adj) = 96.0%

Source DF SS MS F P
Regression 4 15913.0 3978.3 84.80 0.000
Total 14 16382.2

Source DF Seq SS
SqFeet 1 14829.3
NumFlrs 1 0.9
Bdrms 1 166.4
Baths 1 916.5

F = MSR/MSE = 84.80 with
p-value = .000 is highly
significant. The model is very
useful in predicting the list
price of homes.
R
2
= .971 indicates that
97.1% of the overall
variation is explained by
the regression model.


In the presence of the other three
independent variables, is the number of
bedrooms significant in predicting the list
price of homes? Test using o = .05.
Regression Analysis: ListPrice versus SqFeet, NumFlrs, Bdrms, Baths
ListPrice = 18.8 + 6.27 SqFeet - 16.2 NumFlrs - 2.67 Bdrms + 30.3 Baths

Constant 18.763 9.207 2.04 0.069
SqFeet 6.2698 0.7252 8.65 0.000
NumFlrs -16.203 6.212 -2.61 0.026
Bdrms -2.673 4.494 -0.59 0.565
Baths 30.271 6.849 4.42 0.001
To test H
0
: |
3
= 0, the test statistic is t =
-0.59 with p-value = .565.
The p-value is larger than .05 and H
0
is
not rejected.
We cannot conclude that number of
bedrooms is a valuable predictor in the
presence of the other variables.
Perhaps the model could be refit
without x
3
.

Comparing
Regression Models
The strength of a regression model is
measured using R
2
= SSR/Total SS. This
value will only increase as variables are
added to the model.
To fairly compare two models, it is better to
use a measure that has been adjusted using
df:
100%
) 1 SS/( Total
MSE
1 ) adj (
2
|
|
.
|
\
|
=
n-
R

Checking the
Regression Assumptions
c are independent
Have a mean 0 and common variance o
2
for
any set x
1
, x
2
,..., x
k .
Have a normal distribution.
Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.

If the normality assumption is valid, the
plot should resemble a straight line,
sloping upward to the right.
If not, you will often see the pattern fail
in the tails of the graph.
Normal Probability Plot

Enter the appropriate values of x
1
, x
2
, , x
k
in
Minitab. Minitab calculates

k k
x b x b x b b y + + + + = ...
2 2 1 1 0
and both the confidence interval and the
Particular values of y are more difficult to
predict, requiring a wider range of values in the

Estimate the average list price for a home
with 1000 square feet of living space,
one floor, 3 bedrooms and two baths with
a 95% confidence interval.
Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI
1 117.78 3.11 ( 110.86, 124.70) ( 101.02, 134.54)

Values of Predictors for New Observations
New Obs SqFeet NumFlrs Bdrms Baths
1 10.0 1.00 3.00 2.00
We estimate that the average list
price will be between $110,860
and $124,700 for a home like
this.

Using Regression Models
When you perform multiple regression analysis,
use a step-by step approach:
1. Obtain the fitted prediction model.
2. Use the analysis of variance F test and R
2
to determine
how well the model fits the data.
3. Check the t tests for the partial regression coefficients to
see which ones are contributing significant information
in the presence of the others.
4. If you choose to compare several different models, use
R
2
(adj) to compare their effectiveness.
5. Use diagnostic plots to check for violation of the
regression assumptions.

A Polynomial Model
When k = 2, the model is quadratic:

c | | | | + + + + + =
k
k
x x x y ...
2
2 1 0
A response y is related to a single independent
variable x, but not in a linear manner. The
polynomial model is:
When k = 3, the model is cubic:

c | | | + + + =
2
2 1 0
x x y
c | | | | + + + + =
3
3
2
2 1 0
x x x y

Example

A market research firm has observed the sales (y) as a
function of mass media advertising expenses (x) for 10
different companies selling a similar product.
Since there is only one
independent variable, you
could fit a linear, quadratic, or
cubic polynomial model.
Which would you pick?
Company 1 2 3 4 5 6 7 8 9 10
Expenditure, x 1.0 1.6 2.5 3.0 4.0 4.6 5.0 5.7 6.0 7.0
Sales, y 2.5 2.6 2.7 5.0 5.3 9.1 14.8 17.5 23.0 28.0

Two Possible Choices
A straight line model: y = |
0
+ |
1
x + c
A quadratic model: y = |
0
+ |
1
x + |
2
x
2
+ c
Here is the Minitab printout for the straight line:
Regression Analysis: y versus x
y = - 6.47 + 4.34 x
Constant -6.465 2.795 -2.31 0.049
x 4.3355 0.6274 6.91 0.000
S = 3.725 R-Sq = 85.6% R-Sq(adj) = 83.9%
Source DF SS MS F P
Regression 1 662.46 662.46 47.74 0.000
Total 9 773.46
Overall F test is highly
significant, as is the t-test of
the slope. R
2
= .856 suggests a
good fit. Lets check the
residual plots

Example
There is a strong pattern of a
curve leftover in the residual
plot.
This indicates that there is a
curvilinear relationship
unaccounted for by your
straight line model.
You should have used the
quadratic model! Use Minitab to fit the
quadratic model:
y = |
0
+ |
1
x + |
2
x
2
+ c

The Quadratic Model

Regression Analysis: y versus x, x-sq
y = 4.66 - 3.03 x + 0.939 x-sq

Constant 4.657 2.443 1.91 0.098
x -3.030 1.395 -2.17 0.067
x-sq 0.9389 0.1739 5.40 0.001
S = 1.752 R-Sq = 97.2% R-Sq(adj) = 96.4%

Source DF SS MS F P
Regression 2 751.98 375.99 122.49 0.000
Total 9 773.47
Overall F test is highly significant,
as is the t-test of the quadratic term
|
2
. R
2
= .972 suggests a very good
fit.
Lets compare the two models, and
check the residual plots.

Which Model to Use?

Use R
2
(adj) to compare the models:
The straight line model: y = |
0
+ |
1
x + c
The quadratic model: y = |
0
+ |
1
x + |
2
x
2
+ c
% 9 . 83 adj) (
2
= R
% 4 . 96 adj) (
2
= R
The quadratic model is
better.
There are no patterns in the
residual plot, indicating
that this is the correct
model for the data.

Using Qualitative
Variables
Multiple regression requires that the
response y be a quantitative variable.
Independent variables can be either
quantitative or qualitative.
Qualitative variables involving k categories
are entered into the model by using k-1
dummy variables.
Example: To enter gender as a variable, use
x
i
= 1 if male; 0 if female

Example

Data was collected on 6 male and 6 female assistant
professors. The researchers recorded their salaries (y)
along with years of experience (x
1
). The professors
gender enters into the model as a dummy variable: x
2
= 1
if male; 0 if not.
Professor Salary, y Experience, x
1
Gender, x
2
Interaction,
x
1
x
2

1 $50,710 1 1 1
2 49,510 1 0 0

11 55,590 5 1 5
12 53,200 5 0 0

Example

We want to predict a professors salary based on years
of experience and gender. We think that there may be a
difference in salary depending on whether you are
male or female.
The model we choose includes experience (x
1
), gender
(x
2
), and an interaction term (x
1
x
2
) to allow salarys
for males and females to behave differently.
c | | | | + + + + =
2 1 3 2 2 1 1 0
x x x x y

Minitab Output

We use Minitab to fit the model.
Regression Analysis: y versus x1, x2, x1x2
y = 48593 + 969 x1 + 867 x2 + 260 x1x2

Constant 48593.0 207.9 233.68 0.000
x1 969.00 63.67 15.22 0.000
x2 866.7 305.3 2.84 0.022
x1x2 260.13 87.06 2.99 0.017

S = 201.3 R-Sq = 99.2% R-Sq(adj) = 98.9%

Source DF SS MS F P
Regression 3 42108777 14036259 346.24 0.000
Residual Error 8 324315 40539
Total 11 42433092
What is the regression
equation for males? For
females?
For males, x
2
= 1,
y = 49459.7 + 1229.13x
1

For females, x
2
= 0,
y = 48593.0 + 969.0x
1

Two different straight line
models.
Is the overall model useful
in predicting y?
The overall F test is F =
346.24 with p-value = .000.
The value of R
2
= .992
indicates that the model fits
very well.
Is there a difference in the relationship between salary and years of
experience, depending on the gender of the professor?
Yes. The individual t-test for the interaction term is t = 2.99 with p-
value = .017. This indicates a significant interaction between
gender and years of experience.

Example

Have any of the regression assumptions been violated, or
have we fit the wrong model?
It does not appear from the
diagnostic plots that there
are any violations of
assumptions.
The model is ready to be
used for prediction or
estimation.

Testing Sets of Parameters
Suppose the demand y may be related to five independent
variables, but that the cost of measuring three of them is very
high.
If it could be shown that these three contribute little or no
information, they can be eliminated.
You want to test the null hypothesis
H
0
: |
3
= |
4
= |
5
= 0that is, the independent variables x
3
,
x
4
, and x
5
contribute no information for the prediction of y
versus the alternative hypothesis:
H
a
: At least one of the parameters |
3
, |
4
, or |
5
differs
from 0 that is, at least one of the variables x
3
, x
4
, or x
5

contributes information for the prediction of y.

Model Two (complete model)

terms in model 1 additional terms in model 2

To explain how to test a hypothesis concerning a set
of model parameters, we define two models:
Model One (reduced model)

r r
x x x y E | | | | + + + + =
2 2 1 1 0
) (
k k r r r r r r
x x x x x x y E | | | | | | | + + + + + + + + =
+ + + +

2 2 1 1 2 2 1 1 0
) (

The test of the hypothesis
H
0
: |
3
= |
4
= |
5
= 0
H
a
: At least one of the |
i
differs from 0
uses the test statistic

where F is based on df
1
= (k - r ) and df
2
=
n -(k + 1).
The rejection region for the test is identical to
other analysis of variance F tests, namely F > F
o
.
( ) ( )
2
2 1
MSE
SSE SE r k S
F

=

Stepwise Regression
A stepwise regression analysis fits a variety
of models to the data, adding and deleting variables
as their significance in the presence of the other
variables is either significant or nonsignificant,
respectively.
Once the program has performed a sufficient
number of iterations and no more variables are
significant when added to the model, and none of
the variables are nonsignificant when removed, the
procedure stops.
These programs always fit first-order models and
are not helpful in detecting curvature or interaction
in the data.

Pearsons Chi-Square
Statistic
We have some preconceived idea about the
values of the p
i
and want to use sample
information to see if we are correct.
The expected number of times that
outcome i will occur is E
i
= np
i
. If the
observed cell counts, O
i
, are too far from
what we hypothesize under H
0
, the more
likely it is that H
0
should be rejected.

Pearsons Chi-Square
Statistic
We use the Pearson chi-square statistic:
When H
0
is true, the differences O-E will
be small, but large when H
0
is false.
Look for large values of X
2
based on the
chi-square distribution with a particular
number of degrees of freedom.
i
i i
E
E O
X
2
2
) (
=

Degrees of Freedom
These will be different depending on the
application.
1. Start with the number of categories or
cells in the experiment.
2. Subtract 1df for each linear restriction on
the cell probabilities. (You always lose 1
df since p
1
+p
2
++ p
k
= 1.)
3. Subtract 1 df for every population
parameter you have to estimate to
calculate or estimate E
i
.

The Goodness of Fit Test
The simplest of the applications.
A single categorical variable is measured,
and exact numerical values are specified
for each of the p
i
.
Expected cell counts are E
i
= np
i
Degrees of freedom: df = k-1
i
i i
E
E O
X
2
2
) (
= : statistic Test

Example
A multinomial experiment with k = 6 and
O
1
to O
6
given in the table.
We test:

Upper Face 1 2 3 4 5 6
Number of times 50 39 45 62 61 43
H
0
: p
1
= 1/6; p
2
= 1/6;p
6
= 1/6 (die is fair)
H
a
: at least one p
i
is different from 1/6 (die is biased)
Toss a die 300 times with the following
results. Is the die fair or biased?

Example
Upper Face 1 2 3 4 5 6
O
i
50 39 45 62 61 43
E
i
50 50 50 50 50 50
E
i
= np
i
= 300(1/6) = 50
Calculate the expected cell counts:
Test statistic and rejection region:
df. with X if H Reject
2
0
5 1 6 1 07 . 11
2 . 9
50
) 50 43 (
...
50
) 50 39 (
50
) 50 50 ( ) (
2
05 .
2 2 2 2
2
= = = >
=
+ +
=
k
E
E O
X
i
i i
_
Do not reject H
0
.
There is insufficient
evidence to indicate
that the die is biased.

Some Notes
The test statistic, X
2
has only an
approximate chi-square distribution.
For the approximation to be accurate,
statisticians recommend E
i
> 5 for all cells.
Goodness of fit tests are different from
previous tests since the experimenter uses
H
0
for the model he thinks is true.

Be careful not to accept H
0
(say the model is
correct) without reporting |.

H
0
: model is correct (as specified)
H
a
: model is not correct

Contingency Tables: A
Two-Way Classification
The experimenter measures two
qualitative variables to generate
bivariate data.
Gender and colorblindness
Age and opinion
Professorial rank and type of
university
Summarize the data by counting the
observed number of outcomes in each
of the intersections of category levels in
a contingency table.

r x c Contingency Table
The contingency table has r rows and c
columnsrc total cells.
1 2 c
1 O
11
O
12
O
1c

2 O
21
O
22
O
2c

.
r O
r1
O
r2
O
rc

We study the relationship between the two
variables. Is one method of classification
contingent or dependent on the other?
Does the distribution of measurements in
the various categories for variable 1
depend on which category of variable 2 is
being observed?
If not, the variables are independent.

Chi-Square Test of
Independence
Observed cell counts are O
ij
for row i and column j.
Expected cell counts are E
ij
= np
ij
If H
0
is true and the classifications are
independent,
p
ij
= p
i
p
j
= P(falling in row i)P(falling in row j)
H
0
: classifications are independent
H
a
: classifications are dependent

Chi-Square Test of
Independence
ij
ij
j
i
E
E O
X
(
2
2
= : statistic Test
n
c r
n
c
n
r
n E
n
c
n
r
p p
j i j
i
ij
j
i
j i
=
|
|
.
|
\
|
|
.
|
\
|
=
. and with and Estimate

The test statistic has an approximate chi-square
distribution with df = (r-1)(c-1).

Example
Furniture defects are classified according
to type of defect and shift on which it
was made.
Shift
Type 1 2 3 Total
A 15 26 33 74
B 21 31 17 69
C 45 34 49 128
D 13 5 20 38
Total 94 96 119 309
Do the data present sufficient evidence to indicate that the type
of furniture defect varies with the shift during which the piece
of furniture is produced? Test at the 1% level of significance.
H
0
: type of defect is independent of shift
H
a
: type of defect depends on the shift

The Furniture Problem
Calculate the expected cell counts. For
example:
99 . 22
309
) 96 ( 74
2 1
12
= = =
n
c r
E
Chi-Square Test: 1, 2, 3
Expected counts are printed below observed counts
1 2 3 Total
1 15 26 33 74
22.51 22.99 28.50

2 21 31 17 69
20.99 21.44 26.57

3 45 34 49 128
38.94 39.77 49.29

4 13 5 20 38
11.56 11.81 14.63

Total 94 96 119 309
df. ( with X if H Reject
: statistic Test
2
0
6 ) 1 )( 1 07 . 11
18 . 19
63 . 14
) 63 . 14 20 (
...
99 . 22
) 99 . 22 26 (
51 . 22
) 51 . 22 15 (
(
2
05 .
2 2 2
2
2
= = >
=
+ +
=
c r
E
E O
X
ij
ij ij
_
Chi-Sq = 2.506 + 0.394 + 0.711 +
0.000 + 4.266 + 3.449 +
0.944 + 0.836 + 0.002 +
0.179 + 3.923 + 1.967 =
19.178
DF = 6, P-Value = 0.004

Reject H
0
. There is
sufficient evidence to
indicate that the
proportion of defect
types vary from shift
to shift.
Applet

Comparing Multinomial
Populations

Sometimes researchers design an experiment
so that the number of experimental units falling
in one set of categories is fixed in advance.
Example: An experimenter selects 900 patients who
have been treated for flu prevention. She selects 300
from each of three typesno vaccine, one shot, and
two shots.
No Vaccine One Shot Two Shots Total
Flu r
1

No Flu r
2

Total 300 300 300 n = 900
The column totals have
been fixed in advance!

Comparing Multinomial
Populations
Each of the c columns (or r rows) whose
totals have been fixed in advance is actually a
single multinomial experiment.
The chi-square test of independence with (r-
1)(c-1) df is equivalent to a test of the
equality of c (or r) multinomial populations.
No Vaccine One Shot Two Shots Total
Flu r
1

No Flu r
2

Total 300 300 300 n = 900
Three binomial populationsno
vaccine, one shot and two shots.
Is the probability of getting the flu
independent of the type of flu
prevention used?

Example
Random samples of 200 voters in each of
four wards were surveyed and asked if
they favor candidate A in a local
election.
Ward
1 2 3 4 Total
Favor A 76 53 59 48 236
Do not favor A 124 147 141 152 564
Total 200 200 200 200 800
Do the data present sufficient evidence to indicate that the the
fraction of voters favoring candidate A differs in the four wards?
H
0
: fraction favoring A is independent of ward
H
a
: fraction favoring A depends on the ward
H
0
: p
1
= p
2
= p
3
= p
4

where p
i
= fraction favoring A in each of the four
wards

Applet

Applet
The Voter Problem
Calculate the expected cell counts. For
example:
59
800
) 200 ( 236
2 1
12
= = =
n
c r
E
Chi-Square Test: 1, 2, 3, 4
Expected counts are printed below observed counts

1 2 3 4 Total
1 76 53 59 48 236
59.00 59.00 59.00 59.00

2 124 147 141 152 564
141.00 141.00 141.00 141.00

Total 200 200 200 200 800
df. 3 ) 1 )( 1 ( with 81 . 7 X if H Reject
722 . 10
141
) 141 152 (
...
59
) 59 53 (
59
) 59 76 (
(
: statistic Test
2
05 .
2
0
2 2 2
2
2
= = >
=
+ +
=
c r
E
E O
X
ij
ij ij
_
Chi-Sq = 4.898 + 0.610 + 0.000 + 2.051 +
2.050 + 0.255 + 0.000 + 0.858 =
10.722
DF = 3, P-Value = 0.013
Reject H
0
. There is
sufficient evidence to
indicate that the
fraction of voters
favoring A varies
from ward to ward.

The Voter Problem
Since we know that there are differences
among the four wards, what are the
nature of the differences?
Look at the proportions in favor of
candidate A in the four wards.
Ward 1 2 3 4
Favor A 76/200=.38 53/200 = .27 59/200 = .30 48/200 = .24
Candidate A is doing best in the first ward, and worst in the
fourth ward. More importantly, he does not have a majority of
the vote in any of the wards!

Example: - Let Be The For A Company. This Might Be A Function of Several Variables

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Example: - Let Be The For A Company. This Might Be A Function of Several Variables

Загружено:

Авторское право:

Доступные форматы

Copyright 2003 Brooks/Cole

A division of Thomson Learning, Inc.

Copyright 2003 Brooks/Cole

. and with and Estimate

Вам также может понравиться