Residual Analysis: There Will Be No Class Meeting On Tuesday, November 26

1
Today well complete our discussion of

Multiple Linear Regression and introduce
Goodness of Fit Testing.
Exam 3 is at 8:00 AM on Friday, December 13, 2013
same general format as exams 1 and 2. It focuses on
Chapters 10-14 but is cumulative in the sense that it
requires a good grasp of probability, hypothesis testing
and interval estimation principles.
Note: Due to the ABET accreditation visit from
November 24-26, there will be no class meeting on
Tuesday, November 26.
HW Assignment 5 will be adjusted based on our
progress during the last three regular lectures. Only the
chapter 14 problems and possibly some of the assigned
chapter 15 problems will be due on December 6.
In assessing MLR model adequacy,
useful measures to examine adherence
to model assumptions include:
Residual Analysis: detects correlation,
normality, missing variables, etc.
Tests for Multicolinearity: detects
correlation among independent
variables which inflates the variance of
forecasts
Measures of Multicolinearity (correlation
between independent variables)
Ideally, Variance Inflation Factors should
not exceed about 5 (i.e. R
2
i
.8):
VIF(|
i
) = 1/(1- R
2
i
) for i=1,2,,k
where R
2
i
is the R
2
from regressing x
i
on the other k-1 independent vars.
Determinant of the correlation matrix
also measures linear dependency:
Determinant = 0 implies exact linear
dependence of xs
Determinant = 1 implies linear
independence
Direct examination of correlation matrix
elements for pairwise dependencies
When the F-test for regression is
significant but individual coefficients are
not, this is another possible indication
of multicolinearity which inflates the
variance of y-value estimates.
Sample problem illustrating the use of
software for Multiple Linear Regression
experiment with the full model
statistical/visual examination of fit
standardized residual analysis
multicolinearity analysis/MSE inflation
forward model building based on correlation
comparison of 4 vs. 1 variable models
F stat, t-stat, adjusted R-square
2
Goodness of Fit Testing
(Moving on to Chapter 14)
It is often useful in statistics to
determine the degree to which sample
data fits a pre-defined distribution.
Goodness of fit testing is the primary
tool used for this type of analysis.
For a generalized binomial experiment
where each trial can result in k>2
possible outcomes, we can apply sample
data to test hypotheses of the form:
H
0
: p
1
=p
10
, p
2
=p
20
, , p
k
=p
k0
vs.
H
a
: at least one p
i
p
i,0
where the corresponding test statistic
has the form:
2
=
i=1k
(n
i
-np
i0
)
2
/np
i0
vs.
2
, k-1
Thus, goodness of fit tests are upper tail
Chi-Square tests.
With continuous distributions, cell
probabilities for such tests are selected
to satisfy, p
i0
=P(a
i-1
Xa
i
), where it is
recommended that cells be chosen to
satisfy np
i0
5.
Goodness of fit test application:
A statistics tutoring service was designed to
serve the following client base:
Mgt: 40%, Engr: 30%, Soc: 20%, Agr: 10%
A sample of 120 client visits yields:
Mgt: 52, Engr: 38, Soc: 21, Agr: 9
Was the design expectation correct?
Observed Expected
40% Business 52 .4(120)=48
30% Engineering 38 .3(120)=36
20% Social Science 21 .2(120)=24
10% Agriculture 9__ .1(120)=12
120
1 2 3 4
:p .4, p .3, p .2, p .1
.
:
o
a
H
vs
H Distribution is different
= = = =
2
.05,3
4 1 3 7.81 df _ = = =
2 2 2 2
2
(52 48) (38 36) (21 24) (9 12)
48 36 24 12
1.57 7.81 Fail
_

= + + +
= <
3
Problem 14-9 (page 602)
Test with an exponential distribution
14-9)
1
1
2
2
3
3
4
4
1 1 2 2 3 3 4 4
1
2
3
( ) ( ) 1 =1
: ( ) vs. : ( )
5 intervals
( , )( , )( , )( , )( , )
.2 1 .2231
.4 1 .5108
.6 1 1.49
.8 1
x x
x x
o o a o
C
C x
O
C
C x
O
C
C x
O
C
C x
O
f x e F x e
H f x e H f x e
O C C C C C C C C C
e e C
e e C
e e C
e e

= =
= =
= = =
= = =
= = =
= =
}
}
}
} 4
1.61 C =
( ) ( )
25 . 1
8
8 9
...
8
8 6
2 2
2
=
(

+ +
= _
2
.10,4
Fail to Reject since 7.779 _ =
,
Cell counts are 6, 8, 10, 7 and 9:
Test for a discrete distribution
# 0 1 2 3 4 5 6 7 8 9 10 11 12
Freq
24 16 16 18 15 9 6 5 3 4 3 0 1
( ) ( ) ( ) ( )
: ( ) 0,1...
!
vs.
: ( ) is non-Poisson
16 0 24 1
0 1 ... 11 12
120 120 120 120
3.167
x
o o
a o
e
H p x x
x
H p x
= =
= + + + +
=
14-17
3.167
0
3.167
: p ( ) for 0,1...
!
x
o
e
H x x
X
= =
x 0 1 2 3 4 5 6 7
.0421 .1224 .059 .043
n 5.05 16 7.1 5.16
obs 24 16 16 18 15 9 6 16
P
^
P
^
2 2 2
2
.01,7
(5.05 24) (16 16) (5.16 16)
... 104
5.05 16 5.16
18.47 104 Reject
_
_

= + + =
= <
4
A Contingency Table is a table whose
rows represent the possible values of
one variable and whose columns
represent the possible values for a
second variable. The entries in the table
are the number of times that each pair of
values occurs.
Two Variables: defect type and shift
Shift:
Defect Type: 1st 2nd 3rd Total
Color 27 13 10 50
Printing 20 17 7 44
Skewness 5 7 5 17
Total 52 37 22 111
In problems of this type, we may be
interested in testing whether proportions
in the different categories are the same
for all populations, i.e., whether the
populations are homogeneous. For
example, are all defect types distributed
the same way across the three shifts?
(In this case, the row totals are fixed and
the column totals are random variables.)
Defining n
ij
and e
ij
as the observed and
expected number from the i
th
sample
falling into category j, and p
ij
as the
proportion of individuals in population i
who fall into category j, we can use the
data in the contingency table to test
homogeneity using hypotheses of the
form:
H
0
: p
1j
=p
2j
==p
Ij
vs. (for j=1,,J)
H
a
: H
0
is not true
where the corresponding Chi Square test
statistic has (I-1)(J-1) degrees of freedom
and has the form:
2
=
all cells
(n
ij
-e
ij
)
2
/e
ij
vs.
2
, (I-1)(J-1)
where e
ij
=(i
th
row total)(j
th
column total)/(grand total)
as before, it is recommended that data be
collected to satisfy e
ij
5 for all cells.
In tests of Homogeneity, H
0
states that
the proportion of individuals in category
J is the same for each population and
that this is true for every category.
5
Two Variables: defect type and shift
Shift:
Defect Type: 1st 2nd 3rd Total
Color 27 13 10 50
Printing 20 17 7 44
Skewness 5 7 5 17
Total 52 37 22 111
This procedure assumes fixed row totals column totals are random variables.
Illustration for the sample contingency
table with defect type fixed and shift as a
random variable.
A closely related procedure to
homogeneity testing can be used to test
for independence by defining p
ij
as the
proportion of individuals in category (i,j)
and phat
i.
as the sample proportion for
category i of factor 1 and phat
.j
as the
sample proportion for category j of
factor 2. These definitions define the
estimated expected cell counts:
e
ij
=(i
th
row total)(j
th
column total)/(grand total)
and can be applied in tests of the form:
H
0
: p
ij
=p
i.
p
.j
for i=1I and j=1J vs.
H
a
: H
0
is not true
where the corresponding test statistic
has the form:
2
=
all cells
(n
ij
-e
ij
)
2
/e
ij
vs.
2
, (I-1)(J-1)
as before, it is recommended that data
be collected to satisfy e
ij
5 for all cells.
Tests of Homogeneity vs. Independence:
In tests of homogeneity, either the row
total is fixed and the column totals are
random variables, or else the column
totals are fixed and the row totals are the
random variables.
In tests of independence, only the
sample size is fixed and the row and
column totals are both random variables.
Test of independence

Residual Analysis: There Will Be No Class Meeting On Tuesday, November 26

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Residual Analysis: There Will Be No Class Meeting On Tuesday, November 26

Загружено:

Авторское право:

Доступные форматы

1

Today well complete our discussion of

Вам также может понравиться