Вы находитесь на странице: 1из 25

STT 814 Section II

Chapter 1
2010 Robert J. Tempelman 1

Introduction

STT 814 builds upon the material presented in STT 464 Statistics for Biologists.
In STT 464, we essentially "started at the top", considering descriptive statistics,
exploratory data analysis and probability theory before pursuing important examples of
applied statistical inference. The nature of the statistical inference considered in STT 464
was rather fundamental...t-tests involving the mean of one population, the differences in
means between two populations and finally, the differences between several means in a
one-way analysis of variance. Then, we considered the basic elements of linear
regression before concluding the course with a short series on contingency table analysis.

In STT 814, we carry on with many of the same inferential topics as we did in
STT 464 but with much greater depth and generality. In particular, we will spend most of
our attention on those situations where several causal factors or variables must be
considered simultaneously. When all the causal variables are quantitative, then statistical
inference is a multiple regression problem. Consider the following example taken directly
from Ramsey and Schafer (1997) pp.227-228.

Biologists are keenly interested in the characteristics that enable various species to
survive. An interesting variable in this respect is brain size. One might expect that
bigger brains are better, but certain penalties seem to be associated with large brains, such
as the need for longer pregnancies and fewer offspring. Although the individual members
of the large-brained species may have greater chance of surviving, the benefits for the
species must be good enough to compensate for these penalties. To shed some light on
this issue, it is helpful to determine exactly which characteristics are associated with large
brains, after adjusted for the the effect of body size. This problem was considered in
Sacher, G.A. and Staffeldt, E.F (1974) Relation of gestation time to brain weight for
placental mammals; implications for the theory of vertebrate growth, The American
aturalist 108: 593-613.

These scientists collected data on average values of brain weight (Y), body weight
(X
1
), gestation lengths (X
2
) and litter size (X
3
) for 96 species of animals. The natural
question would be Are X
2
and X
3
related to Y after accounting for or adjusting for X
1
?
This is a multiple regression problem and will form the basis for one third of STT 814
course material. Some scatterplots relating Y to X
1
, X
2
and X
3
individually are provided
below:

STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 2



If you look carefully at the plots above, the relationships between the variables appear to
be complicated but sometimes that is simply due to an inappropriate scaling of the
variables. Suppose we log transform each of the variables and recreate the scatterplots;
we then arrive at the following:


STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 3


It appears that a linear relationship looks a lot more clearer now.
Other issues that we need to be wary of include confounding or multicollinearity
as well define later. For example, consider the scatterplot of log transformed gestation
length versus log transformed body size below:



It is quite possible that once we look at the relationship between brain weight and
gestation length AFTER adjusting for body size, we find no meaningful relationship.

When the causal factors are classification variables (like gender, breed, dietary
treatment, etc.), then the statistical inference becomes an analysis of variance or
experimental design problem. This general topic area will constitute the remaining two
thirds of STT 814. We considered some simple factorial design cases in STT 464, such as
2 x 2 factorials. Consider the data taken from a 3 x 2 factorial study in which gains in
weight (grams) of male rats were compared under 6 diet treatments:
STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 4

High Protein Low Protein
Beef Cereal Pork Beef Cereal Pork
73 98 94 90 107 49
102 74 79 76 95 82
118 56 96 90 97 73
104 111 98 64 80 86
81 95 102 86 98 81
107 88 102 51 74 97
100 82 108 72 74 106
87 77 91 90 67 70
117 86 120 95 89 61
111 92 105 78 58 82

In these cases, we might not be only interested in, say, the effect of protein level
adjusted for protein source but whether the differences between protein levels depend on
protein source (i.e. interaction). An interaction means plot is provided below:



If you examine this plot, it virtually appears that there is very little difference in
gains between high and low protein levels when cereal is the protein source as opposed to
the situation when beef or pork are the protein sources. Of course, we shouldnt draw
such conclusions solely from a means plot but from formal statistical inference as well
see later. We'll also consider multifactorial designs (i.e. those designs that involve 3 or
more factors) including discussion on the diabolical 3-way (and higher order)
interactions.

Sometimes, you can remove a lot of "noise" in your statistical inference by
blocking on important sources of variation. For example, in swine nutrition trials
STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 5
involving young piglets, it may seem advisable to randomize hormonal treatments to
piglets within litters. This essentially removes litter to litter variation in the assessment
(i.e. standard errors) of hormonal effects since treatments are compared directly to each
other within each litter! If you have more treatments than experimental units within a
block, then an incomplete block design might be considered.

Some designs that we will consider are hierarchical in scope. For example, those
of you that need to deal with pens of animals in animal science research may collect data
on individual animals while treatments, such as diets, are given to entire pens. This is an
example of a nested design (or a split plot design if levels of a second factor were applied
to individual animals). Some repeated measures designs (i.e. physiological measurements
taken on repeated measurements on individuals over time) are also related to this general
category of designs.

Finally, we'll consider "efficient" designs in the sense of allowing more factors to
be considered in a small study but at a small price...ignoring the possibility (or presuming
to be unimportant) of all or some levels of interaction between various factors. Latin
square designs are popular but they don't allow assessment of any type of interaction.
Fractional factorials and response surface designs, however, do allow some lower-order
interactions to be investigated.

If you're starting to feel a little overwhelmed by all these design names for
regression/ experimental design problems, you're not alone. It's often difficult to attach a
label to a particular design (although many researchers like to impress their colleagues
with their knowledge of design names). In some cases, a design might actually be a
hybrid of some of the basic classes of designs described above. What perhaps is more
important to know is how to write the linear model for a particular design (and hence the
choice of the textbook).

Analysis of covariance is one such example of a hybrid design/analysis that
combines analysis of variance with simple linear regression. Consider the following
example from Ramsey and Schafer (1997): A study of the effects of predatory intertidal
crab species on snail populations involved the measuring of mean closing forces and the
propodus heights of claws on several crabs of three species. The researchers wished to
obtain the linear regression of log force on log height (the reasons for the log
transformations will be seen later) for each of the three species; but is there a way of
combining all this into one analyses (i.e. to investigate differences in the rates of change
from one species to the next)?

So you've now got a general feeling of the topics we'll consider in this course. As
you might have figured out by now, these topics exclusively involve the analysis of
continuous quantitative responses. If time permits (rather unlikely), we might also
consider logistic regression for the multiple regression analyses of binary response
variables (this topic is treated in much greater detail in EPI 826). The analysis of discrete
data is increasingly important in light of the higher profile presently afforded to fitness
STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 6
and fertility traits (which tend to be measured on a discrete scale) in biological and
agricultural research.

I already mentioned to you that it's often difficult to attach labels to very elaborate
designs. Of course, this may beg the question as to what is fundamentally important. In
almost any analysis of quantitative continuous response variables, I feel that the most
important starting point is the linear model. Linear models will be a recurring theme
throughout this course. A linear statistical model could be loosely described as follows:

data

= linear combination of effects + random noise

or in terms of mathematics:

y =
y|x, ,
2
+ e

Note the conditional mean
y|x, ,
2
will generally differ for each y depending on x
and . What is meant by y|x, ,

2

? This is the conditional distribution of the data y given
knowledge on covariate x (if one is specified) and a vector of location parameters and a
dispersion parameter
2
(although there might be several). For example in simple linear
regression, we would write:

y
i
=
o
+
1
x
i
+ e
i
; i = 1,...,n

So here, the response y
i
(or more precisely, its conditional distribution) on the ith
experimental unit depends on a covariate x
i
and two location parameters =

o
1
. If the
error terms e
i
are normally identically and independently distributed or IID (0,
2
); i.e. e
i

~ IID(0,
2
) as we often require for tractable statistical inference, then the conditional
distribution of y
i
can be written as:


( )
y x ID x
i o i o i
| , , , ~ ,
1
2
1
2
+ for all i

with the sole I in ID standing for independently, but certainly not identically, because
different means for each response on each experimental unit are possible because of
different covariate x
i
values; i.e.


y x x o i
i
x
| , =
= +
1


In a one-way analysis of variance, our model would resemble the following:

y
ij
=
i
+ e
ij


or more generally (where
i
= +
i
) as
STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 7

y
ij
= +
i
+ e
ij


For example, suppose we have 3 dietary treatments with 5 animals randomly
assigned to each of the 3 treatments. Then we have a total of 15 observations, each
modeled as function of the overall mean (), one of the three treatment effects (
1
,
2
or

3
) and an error term (e
ij
) specific to the observation in question. Here the location
parameters are written as follows:
=

(
(
(
(

1
2
3

Again, if the error terms are such that e
ij
~ IID(0,
2
) for all i and j, then the
conditional distribution of y
ij
can be written as:


( )
y ID
ij i i
| , , ~ ,
2 2
+

STT 814 will be a modeling intensive course. This is because I believe that
writing the correct linear model is a much more important first step than getting bogged
down with experimental design names or labels. Of course, we will refer to the more
common design labels (e.g. randomized block, Latin square) often throughout the course
but the labels don't indicate the appropriate statistical inference procedures in quite the
same way a linear model would. Furthermore, an emphasis on modeling allows a
perfectly general approach to statistical inference, particularly when some data are
missing. An even greater emphasis on modeling is considered in ANS 870 Techniques
for the Analysis of Unbalanced Research Data and in CSS 921 Contemporary Statistical
Models in Biology.

The assumption of additive treatment effects implied in a linear model is perhaps
a little naive for many biological phenomena; biology is far too complex to just be
described by linear or additive combinations of anything. But a linear model is just a
model. It is by no means a perfect representation of truth. We just want it to provide a
good representation or approximation of the truth, to capture the most salient features of
the data. And linear models, for the most part, often are able to do this. Nevertheless,
there will be times when a non-linear model is appropriate and essential, particularly in
the modeling of growth curves and/or discrete responses. If you are particularly much
more interested in the latter, you might consider enrolling in EPI 826; also consider the
material presented in Chapters 13 and 14 in Kutner et al. (2005).

STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 8
Simple Linear Regression
(Chapters 1& 2 in Kutner et al., 2005)

Perhaps one of the simplest models we consider in statistical inference for
continuous data is the case when one quantititative variable y is predicted from
knowledge of another quantitative variable x. One might distinguish between functional
(or deterministic) and statistical relations, the latter of which includes a stochastic error
term e.

y x e
o
= + +
1
e ~ (0,
2
)

y: dependent response, response variable
x: independent explanatory variable.

Why regression in biology?
classical dose-response phenomenon...i.e. increasing levels of something that is
quantitative and examining its impact on something else that is quantitative.

Common inference problems:
1) inference on
1

2) prediction of responses (or mean thereof).

Now one-way classification models can be formulated in terms of a regression model.
For example let's say we have two treatments (Treatment 1 and Treatment 2). We could
use the regression model in the following way:

y x e
o
= + +
1
e ~ N(0,
2
)

If Treatment = 1, then let x = 0

If Treatment = 2, then let x = 1

For example, suppose we have two observations. If individual 1 was assigned to
Treatment 1 and individual 2 was assigned to Treatment 2, then

( )
y o o
1
1
0 = + = =
1

and
( )
y o o
2
1 1
1 = + = + =
2


i.e. the differences in treatment effects would simply be the regression slope
1
.

The x variables in this case are referred to as dummy variables. We'll leave this
topic for now but I wanted to plant the seeds of dummy variable modeling in your mind
STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 9
now (but refer also to discussion on page 623 of Ott and Longnecker, 2001 and pp 314-ff
in Kutner et al. 2005well discuss in much more detail later).

You should always plot y versus x to confirm a linear relationship. Note that
o
is
the y-intercept and
1
is the change in unit y for a change in unit x. Again, no cause and
effect pattern is necessarily implied by the model unless randomization was implemented.
Note also that
o
is perhaps not too important itself; it just provides a point of reference.
That is, a regression may not be anywhere near linear outside of the range of the data.
And perhaps,
o
wouldn't have much biological relevance anyhow. Let's consider the
following example taken from Kutner et al. (2005) in Problem 1.27.

A person's muscle mass is expected to decrease with age. To explore this relationship in
women, a nutritionist randomly selected four women from each 10-year age group, beginning
with age 40 and ending with age 79. (Based on this randomization strategy, the experimental
design could be considered to be a completely randomized design, even though we'll be using
simple linear regression to analyze the data!) The results follow; X is age, and Y is a measure of
muscle mass.

Woman Muscle mass (Y) Age (X)
1 82 71
2 91 64
3 100 43
4 68 67
5 87 56
6 73 73
7 78 68
8 80 56
9 65 76
10 84 65
11 116 45
12 76 58
13 97 45
14 100 53
15 105 49
16 77 78

Before you do anything else, provide a scatterplot to confirm the relationship that
you wish to model. We'll use SAS to do this. In fact, we'll be using SAS to do all
graphics and statistical analysis throughout the course. Well get into SAS more in the
lab sections of this course.

The output using SAS PROC GPLOT (discussed later in lab) would look
something like the following:

STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 10


It certainly appears that a linear relationship exists between muscle mass and age.

Here we have 16 observations. We can indicate this in our regression model as
follows:

y x e e
i o i i y x x i
i
= + + = +
=

1 |
; i = 1,2,..,n

where in this case, n = 16.

How do we estimate
o
and
1
? Least squares estimation -> it's the most
fundamental method for estimating location parameters in linear regression and
classification models (i.e. experimental designs). One crucial assumption of least squares
estimation is that


( )
e IID
i
~ , 0
2
for all i = 1,2,...,n

that is, the residual errors be independently and identically distributed. Note that I do not
mention normality. You don't need normality to provide least squares estimates of
location parameters of
o
and
1
(least squares estimation also leads to something called
the minimum variance unbiased estimators (MVUE) of
o
and
1
); HOWEVER, if you
wish to do any hypothesis testing, you must make some distributional assumptions,
specifically IID for the most tractable statistical analyses. Least squares estimation is
akin to finding the values of
o
and
1
that minimize the sum of squares of the residuals
e y
i i y x x
i
=
=

|
over all i; that is minimizing

f = ( ) ( ) ( )

= =
=
=
= =
n
i
i o i
n
i
x x y i
n
i
i
x y y e
i
1
2
1
1
2
|
2
1

STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 11

Again, to find the minimum of any function entails a little bit of calculus. That is,
determine

f
o

and

f
1

Equate both to zero and solve for
o
and
1
. The resulting least squares estimate for
1
is

1
1
1 1
2 1
2
1
=

|
\

(
(
(
(
(
=
= =
=
=

x y
x y
n
x
x
n
i
i
n
i
i i
i
n
i
n
i
i
i
n
i
n


Divide both numerator and denominator by n-1, you should recognize that

cov( , )
var( )

1
=
x y
x

which further means that

( )( )
( )

1
1
2
1
= =

=
=

S
S
x x y y
x x
xy
xx
i i
i
n
i
i
n

Finally,



o
y x =
1


where ( )


o o
est = is the estimated intercept

Let's go back to the example:


( )
024 . 1
06 . 131
15 . 134
var
) , cov(
94 .. 1965
31 . 2012

1
=

= =

= =
x
y x
S
S
xx
xy

( )

. . .4375 .
o
y x = = =
1
861875 1024 60 148 75

STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 12
Therefore our prediction equation for muscle mass based on age of women (between 40
and 79 years) is:



. .
|
y x x
y x o
= = + =
1
148 75 1024

For example, let's suppose we wanted to predict the muscle mass for a woman of
age 55. This, of course, is synonymous with the least squares estimate of the conditional
mean
y|x=55
(not bothering with the additional but implied notation):

75 . 91 ) 55 ( 024 . 1 75 . 148


1 55 | 55
= = + = =
= =
x y
o x y x
units

Estimation of variability
As indicated by a plot of the regression line above, there would appear to be
plenty of variability about the regression line. Of course, we already explicitly accounted
for such a phenomenon in our model. Any hypothesis testing involving this regression
relationship would require some assessment of this variability and the normal
distribution assumption. When we say that the e
i
's are IID(0,
2
) , then

independence of e
i
's implies-> cov(e
i
,e
i'
) = 0, or similarly, corr(e
i
,e
i'
) = 0!
for all i i'

Again this is a very important assumption in statistical inference for fundamental
linear models! We could write this in matrix form!

var
e
e
e
e
n
1
2
3
2
2
2
2
0 0 0
0 0 0
0 0 0
0
0 0 0 0

(
(
(
(
(
(
=

(
(
(
(
(
(

= I
2

where I is a square identity matrix with n rows and n columns; i.e.

I =

(
(
(
(
(
(
1 0 0 0
0 1 0 0
0 0 1 0
0
0 0 0 1



An identity matrix has all 1's down the diagonal (indicating that each error term is
correlated with itself by 1) and 0's everywhere else (indicating that the error terms are not
correlated to each other...i.e. independent)
STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 13
As you may recall, we often used the term mean squared error or MSE to assess
variability; the MSE is derived by first computing the error sums of squares (SSE):

( )

=
=
n
i
i
x y i
y SSE
1
2
|


i.e. taking the sums of squares from the regression line and averaging them. It can be
shown that least squares estimation minimizes the sum of these squared deviations.

The term mean is perhaps a little deceptive because we really don't divide the sum
of squares by the total number of squared deviations; we divide by the error degrees of
freedom in order to get an unbiased estimate of
2
. For ANOVA and regression the error
df is determined by the number of location parameters you need to estimate in the model.
In the case of simple linear regression, you need to estimate 2 parameters, so your error
degrees of freedom is n-2 (by the way, if you're forcing the regression line to go through
the point x=0, y=0 or origin, then the intercept wouldn't need to be estimated and the error
degrees of freedom would be n-1 but that is beyond the scope of this course). Therefore
an unbiased estimate of the residual variance
2
(Recall that we specify the error terms to
have a specific variance
2
) for the example is

MSE
SSE
n
=

= =
2
974 656
14
69 62
.
.

Recall from STT 464 or your similar prerequisite course what is meant by unbiasedness.
For variance component estimation, it means that if you conceptually repeated the
experiment many times, the long-run average of all of the variance estimates (MSE) will
approach the true variance (
2
)

Partitioning the total sums of squares

In analysis of variance and regression, invariably all statistical inference starts
with partitioning the total sums of squares (corrected for the mean) into two components:
1) Model sums of squares
2) Residual or Error sums of squares.

Throughout this course, we will often split apart the model sums of squares into
its various components. In the example:

( ) TSS y y y
y
n
i
i
n
i
i
n
i
i
n
= =
|
\

|
= =
= =
=


1
2
2
1
1
2
2
121887
1379
16
3034.44
The deviation of each observation from the overall mean can be written in terms of its
two components:

STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 14
Deviation of record from overall mean

= deviation of record from prediction + deviation of prediction from overall mean

( ) ( ) ( )
y y y y
i i y x y x
= +

| |



This concept is illustrated for one of the subjects (the 14
th
subject being of age 53) in the
dataset below:


Therefore, over all i,

( ) ( ) ( )
y y y y
i
i
n
i y x y x
i
n
i
n
= +
= = =

1 1 1

| |


It can be shown that linear regression models that:

( ) ( ) ( )
y y y y
i
i
n
i y x y x
i
n
i
n
= +
= = =

1
2
2 2
1 1

| |


i.e. TSS = SSE + SSR
STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 15

Some easier ways of computing SSR "by hand" include the following:


( )
SSR
S
S
S S
xy
xx
xx xy
= = =
2
1
2
1



which is rather easy to compute once you have computed the regression equation. For the
example above,

SSR = (-1.0236)
2
1965.9375 = 2059.78

such that

MSR
SSR
= =
1
2059 78 .
Note

SSE = TSS - SSR = 3034.44 - 2059.78 = 974.66

Under the null hypothesis H
o
:
1
= 0,


MSR
MSE
F
n
~
, 1 2

or:

MSR
MSE
t
n
~
2


Let's write up the Analysis of Regression (Variance) table:

Source df SS MS F
Regression 1 2059.78 2059.78 29.587**
Error 14 974.66 69.62
Total 15 3034.44
**P<.0001

Now the mean squares have the following expectations under IID assumptions on the
error terms:

1) E(MSR) =
2
+
1
2
S
xx

Translation: If you repeated the experiment many times, the average values of the
MSR's from each experiment would be equal to
2
+
1
2
S
xx


2) E(MSE) =
2

STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 16
Translation: If you repeated the experiment many times, the average values of the MSE's
from each experiment would be equal to
2


Therefore if
1
= 0, then E(MSR) = E(MSE) and F = MSR/MSE should be a
random draw from a F-distribution with 1,n-2 degrees of freedom. If
1
0, then
1
2
> 0
such that E(MSR) > E(MSE) and the probability of getting a value larger than the F
computed may be very small compared to that expected under a F
1,n-2
distribution; that is,
a small P-value (e.g. Prob(F>29.587)<0.00001) would result in doubt casted upon H
o
:
1

= 0.

Why doesn't SSR include a test for
o
????? Because all analysis of variance tables
correct for the mean. And it happens that if there were no regression relationship (i.e.
1

= 0), then
o
would not only be the y-intercept, it would also be the overall mean of y.
Therefore, testing for
o
is not generally considered in an analysis of variance table. This
is perhaps rather inconsequential as
o
often has no intrinsic biological meaning...it just
provides a point of reference for the regression equation. It provides a prediction on
response for the case of X = 0, which is meaningless for our particular example (unless
the covariates are standardized to have a mean of 0).

We have already seen that


MSR
MSE
S
MSE
xx
=

1
2
~ t
dfe


can be used to test H
o
:
1
= 0. This test statistic can be further rewritten as follows:


df error
xx
xx
t
S
MSE
MSE
S
_
o 1
1
~
error) standard (estimated
H | mean its variable random normal


= =


Therefore
MSE
S
xx
is the standard error of

1
!. Note that the standard error depends on x.
Consider the following as further elaborated in Kutner et al. (2005) on pg. 42. The
estimate

1
is seen to be a linear function of all of the observations y
i
; i = 1,2,..,n; i.e. it
can be shown that

1
1
=
=

k y
i i
i
n

Where

( )
k
x x
x x
i
i
i
i
n
=

2
1
is considered to be a constant term.
Therefore,
STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 17

( )
( ) var

var
1
2
1
2 2
1
2 2
1
= = =
= = =

k y k k
i i
i
n
i
i
n
i
i
n

if the observations are conditionally independent (implied in e
i
~NIID(0,
2
)). Since
2
is
not known, this is estimated as

( )
var

1
2
1
=
=

MSE k
i
i
n
.
Now it can be further shown that

( )
k
x x
i
i
n
i
i
n
2
1
2
1
1
=
=


Therefore, an estimate of
( )
var

1
is

( )
var

1
2
1
=
=

MSE k
i
i
n
=
( )
MSE
x x
MSE
S
i
i
n
xx

=
=

2
1


For the example,


( ) ( )
se

v ar

.
.
.
1
1
69 62
1993 76
0188 = = =

That is the t-test statistic can be written as:

t
est
se
= =

=
(

)
(

)
.
.
.44

10236
188
5

Compare to t
14,/2=.025
for two-tailed tests

There are at least three advantages of doing a t-test over a single numerator degree of
freedom F-test:
1) Can use non-zero values of
1
in null

hypothesis specification
2) One-tailed tests on
1
possible.
3) Confidence intervals can be provided

The 95% confidence interval for
1
can be computed as follows:

1
t
/2,n-2=.025,14
se(

1
)
= -1.0236 2.145 (.188)
= [-1.427,-0.620]

STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 18
i.e. we are 95% confident that the true mean decrease of muscle mass with age between
40 and 79 years in women is between -1.427 and -0.620 units/year.

Another look at the Interval estimator of
1
:
We previously saw that


( )
( )
se
MSE
S
MSE
n s
xx x

1 2
1
= =



How can we increase our precision on
1
, i.e. lower se(
1
) and sharper CI:

1) minimize residual variation i.e. minimize
2
e
to allow smaller MSE
2) Increase n
3) Increase the variance of x.

The third point suggests that we put half the values of x on each end of the range
of x. However, if the relationship is curvilinear, then it might be wise to include points in
the middle as well. If your goal is to get a good estimate of
1
, it makes sense to sample
the x values such that the variance is large. (See also pg. 546 in Ott and Longnecker,
2001).

The prediction equation is really only good for the range of values of x considered.
It is dangerous to extrapolate to other values of x not in the data. For example, we
certainly wouldn't expect muscle mass to decrease for women from age 10 to age 20!!.

Interval estimators for the conditional mean and prediction intervals for individual
observations.

y =
y|x
+ e

We generally wish to have inference on both y (an individual observation) and
y|x
(the
conditional mean)

(1) Mean response:
Let:
( )
x y o x y
est x
| 1 |

= + =

The 95% confidence interval for the conditional mean
y|x
can be written as:


( )

| / , |

y x dfe y x
t se
2

where
STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 19

( )
( )
se MSE
n
x x
S
y x
xx

|
= +
|
\

|
1
2


(2) Prediction of an individual observation:

For a certain value of x,what is the 95% prediction interval for a single observation y?

;

1 |
e x e y
o x y x
+ + = + = Pr ( ) ( )

y ed y E y
x x
= = =

95% CI for y
x
is:

( )

/ ,
y t se y
x dfe x

2

where
( )
( )
se y MSE
n
x x
S
x
xx
= + +
|
\

| 1
1
2


Situations (1) and (2) are different in terms of another aspect. Situation (1) pertains to the
estimation of a parameter while situation (2) pertains to the prediction of an observation
on a single experimental unit. The relevance of each is nicely described in Section 11.4
of Ott and Longnecker (2001). Note the slight difference between the standard error of a
conditional mean and the standard error of a prediction. The farther x is away from the
overall mean of x ( x ), the greater the standard error and hence the wider the 95%
confidence intervals in both cases!

For a woman of age 71, we find that:

38 . 75 ) 71 ( 024 . 1 05 . 148
71 |
= =
= x y


is both the conditional mean of y (that is, conditional on x=71) and the predicted value
for y; i.e.
71 | 71

= =
=
x y x
y . The standard error of the conditional mean is:

( )
( )
881 . 2
76 . 1993
12 . 60 71
16
1
62 . 69
2
71 |
=
|
|

\
|

+ =
= x y
se

The 95% confidence interval for
y|x
when x = 71 is then:

[ 75.38 - 2.145*2.881, 75.38 + 2.145*2.881] = [69.20,81.56].

Now, the standard error of prediction on y given x = 71 is:

STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 20
( )
( )
83 . 8
76 . 1993
12 . 60 71
16
1
1 62 . 69
2
71
=
|
|

\
|

+ + =
= x
y se

Such that the 95% prediction interval of y when x = 71 is

[ 75.38 - 2.145*8.83, 75.38 + 2.145*8.83] = [56.44,94.31].

Therefore,the 95% confidence interval for the true mean muscle mass at age of 71 years is
[69.20,81.56] while the 95% prediction interval for any one woman is [56.44,94.31].

It is important to note that in both cases, the standard errors and the confidence
intervals depend on x, and are smallest when x = x . This means that unlike the least
squares regression line, the LCL and UCL lines will be curvilinear in some way,
widening as x deviates further and further away from x . This can be noted from the plot
of the confidence (and prediction) interval bands below (Well see how to create these in
SAS lab later):
mass = 148. 05 - 1. 0236age
N
16
Rsq
0. 6788
Adj Rsq
0. 6559
RMSE
8. 3438
40
50
60
70
80
90
100
110
120
130
age
40 45 50 55 60 65 70 75 80
Pl ot mass*age PRED*age L95M*age
U95M*age L95*age U95*age



STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 21
An alternative approach to hypothesis testing on location parameters:
The General Linear Test (Section 2.8 in Kutner et al., 2005)

Full models versus Reduced models

Let's say we have two classes of location parameters:

(1) those parameters we "absolutely" need ( *) and
(2) those parameters of interest we wish to test whether or not they are important
( **).
i.e. the vector of all parameters could be written as follows:


=

(
*
**

Suppose you wanted to test whether ** was necessary or statistically important. In a
general way, the hypothesis test might resemble the following:

H
o
: some parameters of interest ** = 0 versus H
1
: ** 0

Specific example of a lack of fit test:
As a specific example, suppose we wanted to know whether a regression
relationship between y and x actually existed; that is,

H
o
:
1
= 0 versus H
1
:
1
0

One of the most natural ways to do this is to fit two models

Model 1: Full Model

Include all parameter(s) of interest.
General case: y = f( *)+ f( **)+ e
Specific example y =
o
+
1
x + e

For our specific linear regression example:


. .
|
y x x
y x o
= = + =
1
148 05 1024


Model 2: Reduced Model

Don't include parameter(s) of interest.
General case: y = f( *)+ e
Specific example: y =
o
+ e -> i.e. a simple one-population model!!
exactly same as y = + e

STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 22
For the specific example, our best estimate of
o
is y = 86.19. Therefore our prediction
equation would be:


.
|
y
y x o
= = = 8619

for all observations regardless of the value of x!

For both cases, let's compute the SSE

General case Specific Example
Model 1: ( ) ( )
( )
SSE M y y y x
i i
i
n
i o i
i
n
1
1
2
1
1
2
= =
= =



We already found this to be SSE(M
1
) = 974.65 for the specific example

General case Specific Example
Model 2: ( ) ( )
( )
( ) SSE M y y y y y
i i
i
n
i o
i
n
i
i
n
2
1
2
1
2
1
2
= = =
= = =


= S
yy


SSE(M
2
) = S
yy
=3034.44 for the specific example.

OTE: estimates of parameters depend on the model under which they were estimated!
i.e. for the specific example,
o
is the estimate of an intercept of a straight line regression
line in Model 1 ; in Model 2, it is simply the overall mean!

Test statistic:

Now compare the two error sums of squares SSE(M
1
) and SSE(M
2
). Note the
error sums of squares for the full model will always be less than or equal to the error
sums of squares for the reduced model.

SSE(M
1
) SSE(M
2
)

If the two error sums of squares are not much different, then
1
is not important.
(i.e. in the specific example for Model 2, we assume a linear regression model with slope
= 0 and intercept = y ). Conversely, a large difference suggests H
1
: is likely. The F test
statistic is:


( ) ( )
( )
F
SSE M SSE M
errordf errordf
SSE M
df
MSE M
M M
M
=

=
=

= =
2 1
2 1
1
1
1
3034 974 66
15 14
974 66
14
2059 78
69 62
29 587
( )
.44 .
.
.
.
.

STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 23
Under H
o
:, F is distributed as F random variable with df
M2
-df
M1
numerator
degrees of freedom and df
M1
denominator degrees of freedom. Since F = 29.587 >
F
=.05,1,14
= 4.60, (P<.0001) , we conclude that
1
is significantly different from 0. Of
course, do note that the F-test statistic is not different from that used to test H
o
:
1
= 0 in
the analysis of variance or regression table.

This concept of the general test, also known as the lack of fit test, will be
increasingly important later in both multiple regression and the analysis of experimental
designs.

Other points on simple linear regression (see also pp. 77-78 in Kutner et al. 2005):

1) What if x is not predetermined or fixed but is actually a random variable? This is no
problem. We are really interested in the distribution of y given x (i.e. y|x)
(specifically,
y|x
, and
2
y|x
) so the fact that x is fixed at predetermined levels or is allowed
to vary randomly is of no inferential consequence. However big problems may be
apparent when x is measured with error)...considerable bias in estimated regression
coefficients (slopes) can then result without correcting for this. The so-called
measurement error-in-covariates problem is beyond the scope of this course. There is no
estimation problem if y is measured with random measurement error since that just gets
added on as residual variability; nevertheless, too much measurement error can have a
substantial impact on precision and power

2) Descriptive measures
e.g. Coefficient of Determination

1 0
2
=
TSS
SSR
R

Most people use R
2
to indicate "how much variation is accounted for by the
model". Of course, this shouldn't imply, as the book emphasizes, that the model is a
random variable!

R
2
values tend to be overused and can be a little deceiving. In fact, R
2
's depend on
experimental conditions and design (e.g. spacings of x.) as well as on the model. In
highly controlled laboratory settings, some researchers might get nervous about their
technique if the R
2
value was less than 90% (must refine technique!!) whereas R
2
values >
50% might be considered to be remarkably good for regression analysis involving some
on-farm studies. Also see the very good discussion on pg 75 in Kutner et al. (2005).

3) Power of tests could also be considered in a linear regression model. Under the null
hypothesis H
o
:,

STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 24


1
2
0 ~ ,
S
xx
|
\

|
while under H
1
:



1 1
2
~ *,
S
xx
|
\

|

where
1
* is the "true" size of the regression coefficient value that we wish to be able to
detect. It can then be shown that the power of test with a Type I error rate of = .05 can
be written as:


1
/ 2
2
*
Prob
xx
Power Z z
S

| |
|
|
= >
|
|
|
\


presuming that the variance
2
is treated as known (which is not really a big issue).
Otherwise, the power computation would be t-based (much like we did in Section 2 of
STT 464); i.e.


1
/ 2, 2
2
*
Prob
df n
xx
Power t t
S

=
| |
|
|
= >
|
|
|
\


Design 1: For example, suppose we set up a trial such that 5 of the x values are to set at
x=4 and 5 of the x values are to be set at x=10. Then S
xx
= 90. The variability about the
regression line is thought to be
2
= 10. We wish to be able to have enough power of test
to detect a true
1
of
1
* = 1. The power is then:

1
Prob 1.96 Pr ( 1.04)
10
90
Power Z ob Z
| |
|
|
= > = >
|
|
\
= 85%

Design 2: How about if we set up the trial such that the x values are equally spaced (i.e. 2
at x = 4, 2 at 5.5,2 at x = 7.0, 2 at x =8.5, and 2 at x = 10). Then S
xx
= 45. The power of
test would then be:
STT 814 Section II
Chapter 1
2010 Robert J. Tempelman 25
Power ob Z ob Z = >
|
\

|
|
|
|
= > Pr . Pr ( . ) 196
1
10
45
016 = 56%

Again, you might take a little beating on power by spreading your points
throughout as in Design 2, but if your goals included having good predictions across the
range of x, Design 2 would be much more desirable. Furthermore, if the regression
relationship was not linear, you would be much better able to assess this with the latter
design; it would be impossible with the former.

You might be wondering that if high values of S
xx
are key to improving power on

1
, why not just spread out the extreme values of x even further!! In the example, for
instance, this might mean going from x = 2 to say, x = 16. The caveat in all this is that by
spreading them out, you might arrive at a situation where linear relationships no longer
apply throughout the entire range. This certainly would have been true in our muscle
mass with age example.

Вам также может понравиться