Вы находитесь на странице: 1из 14

PLEASE SCROLL DOWN FOR ARTICLE

This article was downloaded by: [University of Alberta]


On: 7 January 2009
Access details: Access Details: [subscription number 713587337]
Publisher Informa Healthcare
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,
37-41 Mortimer Street, London W1T 3JH, UK
Encyclopedia of Biopharmaceutical Statistics
Publication details, including instructions for authors and subscription information:
http://www.informaworld.com/smpp/title~content=t713172960
Analysis of Variance
Maria Overbeck-Larisch
a
; Werner Sanns
a
a
University of Applied Sciences, Darmstadt, Germany
Online Publication Date: 23 April 2003
To cite this Section Overbeck-Larisch, Maria and Sanns, Werner(2003)'Analysis of Variance',Encyclopedia of Biopharmaceutical
Statistics,1:1,42 54
Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf
This article may be used for research, teaching and private study purposes. Any substantial or
systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or
distribution in any form to anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make any representation that the contents
will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses
should be independently verified with primary sources. The publisher shall not be liable for any loss,
actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly
or indirectly in connection with or arising out of the use of this material.
Analysis of Variance
Maria Overbeck-Larisch
Werner Sanns
University of Applied Sciences, Darmstadt, Germany
INTRODUCTION
The analysis of variance (ANOVA) is a statistical method,
which originally was developed by R.A. Fisher (1890
1962) for problems in biological studies.
[2]
Since that
time, the ANOVA has been used in many different areas
and there exists a comprehensive literature concerning
the theory and practice of ANOVA.
[4,5,710]
The classical
idea of the ANOVA was to find out if there exists an
influence of one or more categorical variables (factor
variables) over a normally distributed random variable.
This classical task of ANOVA can be regarded as a
qualitative analysis. The modern approach to ANOVA,
however, includes a quantitative analysis, too, via the
general linear model (GLM) (i.e., it allows to measure the
strength of the influence of each factor variable).
Practicing the ANOVA requires an appropriate soft-
ware system. SAS
1
is one of these systems; it is a well-
known tool in biopharmaceutical statistics. Therefore we
include in this contribution the basic commands of the
SAS procedures ANOVA and GLM (Ref. [6]).
MATHEMATICAL PROBLEM
Assume k random variables to be independent and nor-
mally distributed with the same variance s
2
but with even-
tually different mean values (expected values) m
1
, m
2
, . . .,
m
k
. The objective of the ANOVA is to test the hypothesis
H
0
of all mean values being equal against the alternative
H
1
, which states that there are at least two different mean
values. Thus the ANOVA is a statistical test procedure
which, for k =2, corresponds to the situation of the two-
sample t-test. If the hypothesis holds, the k random
variables have identical normal distributions. Otherwise,
they have different mean values.
EXAMPLE 1 OF TOTAL PROTEIN IN SERUM
In toxicological studies, many parameters such as the
number of red and white blood corpuscles (erythrocytes
and leucocytes) or the Na concentration in urine play an
important role, and it is of great interest to know the sta-
tistical distribution of these parameters. Some of them
such as the parameter TPRO (Total PROtein in serum)
can be assumed to be normally distributed. The following
table contains the values of the parameter TPRO, where
the measurements come from three control groups (not
treated with the test compound) consisting of dogs, mice,
and rats.
For further computations, we save the data set as a
SAS file named TPRO with the three variables Species,
Sex, and Value. In submitting the following SAS com-
mand lines, the file is created and printed on the screen
in the output window:
42 Encyclopedia of Biopharmaceutical Statistics
DOI: 10.1081/E-EBS 120007385
Copyright D 2003 by Marcel Dekker, Inc. All rights reserved.
D
o
w
n
l
o
a
d
e
d

B
y
:

[
U
n
i
v
e
r
s
i
t
y

o
f

A
l
b
e
r
t
a
]

A
t
:

0
6
:
3
3

7

J
a
n
u
a
r
y

2
0
0
9
By means of a one-way ANOVA, we shall show that
the values of the parameter TPRO differ significantly
from species to species. One-way ANOVA means that the
influence of only one factor, for instance the factor spe-
cies, is taken into account.
But if one also takes into account the fact that the first
six animals of each species are female whereas the last
six animals are male, one may wonder if the factor Sex
significantly influences the value of the parameter TPRO,
too. By means of a two-way ANOVA, we will answer
this question.
MODEL ASSUMPTIONS FOR THE
ONE-WAY ANALYSIS OF VARIANCE
In the general situation of a one-way ANOVA, there are
n observations in k different samples (because of the k
levels of one categorical factor variable) of eventually
different sample sizes n
1
, n
2
, . . ., n
k
:
The observations y
ij
, 1 _ i _ k, and 1 _ j _ n
i
come
from random variables:
Y
ij
= m
i
E
ij
= m a
i
E
ij
where the E
ij
variables are independent N(0,s
2
)-distri-
buted random variables. m: = (1/n)
P
i = 1
k
m
i
denotes the
overall mean, and the parameters a
i
: = m
i
m denote the
deviations of the particular mean values from m. Because
of this definition, the parameters a
i
satisfy the condition:
X
k
i = 1
a
i
= 0
which later on could be used for recoding the parameters
within the frame of the linear model. In terms of the para-
meter a
i
, we have to test the hypothesis H
0
of all a
i
being
zero against the alternative H
1
that at least one a
i
,= 0.
NOTATIONS
Some of the following notations are well known from
other statistical methods:
n =
X
k
i = 1
n
i
y
i
=
1
n
i
X
n
i
j = 1
y
ij
and s
2
i
=
1
n
i
1
X
n
i
j = 1
(y
ij
y
i
)
2
f or 1 _ i _ k
y =
1
n
X
k
i = 1
X
n
i
j = 1
y
ij
=
1
n
X
k
i = 1
n
i
y
i
and
s
2
=
1
n 1
X
k
i = 1
X
n
i
j = 1
(y
ij
y)
2
sst =
X
k
i = 1
X
n
i
j = 1
(y
ij
y)
2
= (n 1)s
2
ss1 =
X
k
i = 1
n
i
(y
i
y)
2
ss2 =
X
k
i = 1
X
n
i
j = 1
(y
ij
y
i
)
2
=
X
k
i = 1
(n
i
1)s
2
i
The name analysis of variance comes from the
equation:
sst = ss1 ss2
which is valid as the following computation shows:
sst =
X
k
i = 1
X
n
i
j = 1
(y
ij
y)
2
=
X
k
i = 1
X
n
i
j = 1
((y
ij
y
i
) (y
i
y))
2
=
X
k
i = 1
X
n
i
j = 1
(y
ij
y
i
)
2
2
X
k
i = 1
(y
i
y)
X
n
i
j = 1
(y
ij
y
i
)
|{z}
= 0

X
k
i = 1
X
n
i
j = 1
(y
i
y)
2
= ss2 ss1
Interpretation: The total sum of square (sst) (i.e., the
sum over all squared differences between the total mean
and the particular observation) can be split into two sums
ss1 and ss2. The first one is a result of the differences
between the samples, and the second one is a result of the
differences within the samples. So the above equation can
be written in the form:
sst = ss1 ss2 = ss
between
ss
within
Sample 1 y
11
, y
12
, . . ., y
1n
1
Sample 2 y
21
, y
22
, . . ., y
2n
2
. . . . . .
. . . . . .
Sample k y
k1
, y
k2
, . . ., y
kn
k
Analysis of Variance 43
A
D
o
w
n
l
o
a
d
e
d

B
y
:

[
U
n
i
v
e
r
s
i
t
y

o
f

A
l
b
e
r
t
a
]

A
t
:

0
6
:
3
3

7

J
a
n
u
a
r
y

2
0
0
9
It seems to be reasonable to reject the hypothesis that
the k normal distributions are identical if the quotient
ss
between
/ss
within
is large. The question how large the
quotient has to be in order to reject the hypothesis can be
answered by the following theorem.
THEOREM
If the model assumptions are fulfilled and if the hypo-
thesis of equal means is true, we can state the following:
SST
s
2
=
1
s
2
X
k
i = 1
X
n
i
j = 1
(Y
ij


Y)
2
~ w
2
n1
SS1
s
2
=
1
s
2
X
k
i = 1
n
i
(

Y
i


Y)
2
~ w
2
k1
SS2
s
2
=
1
s
2
X
k
i = 1
X
n
i
j = 1
(Y
ij


Y
i
)
2
~ w
2
nk
V: =
SS1=(k 1)
SS2=(n k)
~ F
k1; nk
Under H
0
, the random variable V assumes small
values. Thus H
0
can be rejected if the realized value v
of the test variable V exceeds a critical value v*. For a
given significance level a, the critical value v* is deter-
mined in such a way that:
P(V > v*[H
0
) = a
Let F be the cumulative distribution function of the
F
k 1,n k
distribution. Then the critical value v* is the
solution of the equation:
1 F(v*) = a
or
v* = F
1
(1 a)
respectively. This means that v* is the (1 a) percentile
of the F
k 1,n k
distribution.
Therefore we have the following procedure for a one-
way ANOVA:
1st step: Choose an appropriate significance level a.
2nd step: Compute the (1 a) percentile of the F
k 1,n k
distribution.
3rd step: Compute y1, . . ., y
k
, y , s
1
2
, . . . ,s
k
2
, ss1, ss2, and
v =[ss1/(k1)]/[ss2/(n k)].
4th step: Reject H
0
if v > v*.
REMARK
It is obvious that the above theorem and the resulting
procedure for a one-way ANOVA strongly depend on the
assumption of normally distributed random variable Y
ij
. If
this condition is not satisfied, the test variables V is not
appropriate and a distribution-free test like the test of
Kruskal Wallis should be chosen.
EXAMPLE 2 OF TOTAL PROTEIN IN SERUM
As the TPRO value can be assumed to be normally
distributed, we follow the given procedure for a one-way
ANOVA (k =3, n=36):
1st step: If the aim is to show that the values of the
parameter TPRO differ significantly from species to
species, we choose a small significance level. We
choose the standard level of 5%.
2nd step: We have to compute the 95% percentile of the
F
2,33
distribution. This can be performed, for instance,
by the following SAS program:
We get v*=3.28492.
3rd step: Submitting the SAS commands:
the following results can be achieved:
y = 5.9527778 y
1
= 5.4666667 y
2
= 5.7416667 y
3
= 6.6500000
s = 0.5744494 s
1
= 0.2570226 s
2
= 0.3287949 s
3
= 0.1977142
44 Analysis of Variance
D
o
w
n
l
o
a
d
e
d

B
y
:

[
U
n
i
v
e
r
s
i
t
y

o
f

A
l
b
e
r
t
a
]

A
t
:

0
6
:
3
3

7

J
a
n
u
a
r
y

2
0
0
9
So we have:
ss1 =
X
k
i = 1
n
i
(y
i
y)
2
= 12
X
3
i = 1
(y
i
y)
2
= 9:20388889
ss2 =
X
k
i = 1
(n
i
1)s
2
i
= 11
X
3
i = 1
s
2
i
= 2:34583333
v =
ss1=(k 1)
ss2=(n k)
=
9:20388889=2
2:34583333=33
= 64:738
4th step: As v > v*, the hypothesis H
0
can be rejected at
the 5% significance level. The means of the three
species differ significantly.
These and a lot of more computational results can be
achieved by putting the SAS procedure ANOVA into
action. The SAS program:
yields, among other results, the following ANOVA table
(we added our symbols in parentheses):
There are two more or less remarkable differences
between our handmade calculation and the SAS output.
The first one concerns the notation: The sum ss1 =
ss
between
describes that part of the total sum of squares that
corresponds to the model, whereas the sum ss2 =ss
within
is the remaining error sum of squares. For a better un-
derstanding of the SAS output, denote ss1 by ss
model
and
denote ss2 by ss
error
and write the equation of a one-way
ANOVA in the form:
sst = ss
model
ss
error
The second difference is more essential: SAS does not
compute the 95% percentile v* of the F
2,33
distribution,
but in the column with the title Pr > F, it produces
the probability 1 F(v) or an upper bound. Here this
upper bound is 0.0001. Because of the equivalence:
v > v* F(v) > F(v*) F(v) > 1a 1F(v) < a
the hypothesis can be rejected as long as the value
1F(v) is less than the chosen significance level a. As
already mentioned, SAS produces some more computa-
tional results that are not printed above. As a very
important number, for instance, we get:
R
2
= ss1=sst = ss
between
=ss
total
= ss
model
=ss
total
which tells us how much of the total sum of squares can
be explained by the model. In the example of TPRO,
R
2
=0.796893 means that about 80% of the variation in
the data set can be explained by the fact that the animals
belong to different species.
Analysis of Variance 45
A
D
o
w
n
l
o
a
d
e
d

B
y
:

[
U
n
i
v
e
r
s
i
t
y

o
f

A
l
b
e
r
t
a
]

A
t
:

0
6
:
3
3

7

J
a
n
u
a
r
y

2
0
0
9
MODEL ASSUMPTIONS FOR THE
TWO-WAY ANALYSIS OF VARIANCE
Within the framework of a two-way ANOVA or a higher-
way ANOVA, we simultaneously explore if there is
an influence of two or more categorical variables (fac-
tor variables) over a normally distributed random var-
iable. For the sake of simplicity, let us restrict ourselves
to a situation with only two factor variables. Regard-
ing our TPRO data set, we want to find out simulta-
neously if the Species or the Sex has an effect on the
value of the parameter TPRO. The two factor varia-
bles divide the observations in r rows and p columns
with r p cells (samples). Assume, furthermore, that
there are m observations in each cell (see matrix below).
So for the number n of observations, we have n=rpm.
(In case of m=1, we have the situation of a simple
block experiment.)
Let us assume the observations y
ijl
, 1 _ i _ r, 1 _
j _ p, and 1 _ l _ m to come from random variables:
Y
ijl
= m a
i
b
j
E
ijl
with independent N(0,s
2
)-distributed random variables
E
ijl
. a
i
denotes the effect of the ith row and b
j
denotes the
effect of the jth column. As in the model of a one-way
ANOVA, we assume two reparameterization conditions
to be valid:
X
r
i = 1
a
i
= 0 and
X
p
j = 1
b
j
= 0
So we have to test the hypothesis H
0
: a
1
= =
a
r
= b
1
= =b
p
=0 against the alternative H
1
that at least
one of these r +p parameters does not have a value
of zero.
NOTATIONS
y =
1
rpm
P
r
i = 1
P
p
j = 1
P
m
l = 1
y
ijl
sst =
P
r
i = 1
P
p
j = 1
P
m
l = 1
(y
ijl
y)
2
y
i
=
1
pm
P
p
j = 1
P
m
l = 1
y
ijl
for 1 _ i _ r
ss1 =
P
r
i = 1
pm(y
i
y)
2
y
j
=
1
rm
P
r
i = 1
P
m
l = 1
y
ijl
for 1 _ j _ p
ss2 =
P
p
j = 1
rm(y
j
y)
2
ss3 =
P
r
i = 1
P
p
j = 1
P
m
l = 1
(y
ijl
y
i
y
j
y)
2
As in the case of a one-way ANOVA, the basic idea is
to split the total sum of squares:
sst = ss1 ss2 ss3
Here ss1 is a result of the differences between the rows,
ss2 is a result of the differences between the columns, and
ss3 is the remaining error sum of squares. If the model
assumptions are fulfilled and if the hypothesis is true,
we can state: The random variables SS1/s
2
, SS2/s
2
, and
SS3/s
2
are w
2
-distributed. The number of degrees of
freedom of SS1/s
2
and SS2/s
2
is r 1 and p1, res-
pectively. The number of degrees of freedom of SS3/s
2
is n 1 (r 1) ( p 1) =n +1 r p with n =rpm.
Furthermore, we have:
V
1
: =
SS1=(r 1)
SS3=(n 1 r p)
~ F
(r1);(n1rp)
V
2
: =
SS2=(p 1)
SS3=(n 1 r p)
~ F
(p1);(n1rp)
46 Analysis of Variance
D
o
w
n
l
o
a
d
e
d

B
y
:

[
U
n
i
v
e
r
s
i
t
y

o
f

A
l
b
e
r
t
a
]

A
t
:

0
6
:
3
3

7

J
a
n
u
a
r
y

2
0
0
9
The hypothesis H
0
can be rejected if the realization v
1
of the test variable V
1
is greater than the (1 a) percentile
v
1
* of the F-ratio distribution with (r 1),(n+1 r p)
degrees of freedom, or if the realized value v
2
of the test
variable V
2
is greater than the (1 a) percentile v
2
*
of the
F-ratio distribution with ( p 1),(n +1 r p) degrees of
freedom. This rejection rule is equivalent to the demands:
P(V
1
> v
1
[H
0
) = 1 F
1
(v
1
) < a or
P(V
2
> v
2
[H
0
) = 1 F
2
(v
2
) < a
Here the functions F
1
and F
2
are the cumulative dis-
tribution functions of the F-ratio distributions with (r 1),
(n+1 r p) and ( p 1),(n +1r p) degrees of free-
dom, respectively.
EXAMPLE 3 OF TOTAL PROTEIN IN SERUM
Submitting the SAS program:
one gets the following ANOVA table:
As 1F
1
(v
1
) =1 F
1
(65.58) < 0.0001 < a=5%, the
hypothesis of no differences between the Species can
be rejected. But as 1F
2
(v
2
) =1 F
2
(1.43) =0.2407 is
much greater than the significance level of a=5%, the
hypothesis of no differences between the Sex cannot
be rejected.
Note that the R
2
is nearly the same as in the model with
the one factor variable, Species.
If the number m of observations in the single cells is
greater than one, the model of a two-way ANOVA allows
one to include interactions. The corresponding model
equation is:
Y
ijl
= m a
i
b
j
(ab)
ij
E
ijl
where the following equations are assumed to be true:
X
r
i = 1
a
i
= 0;
X
p
j = 1
b
j
= 0
X
p
i = 1
(ab)
ij
= 0 for all 1 _ i _ r;
X
r
i = 1
(ab)
ij
= 0 for all 1 _ j _ p
Analysis of Variance 47
A
D
o
w
n
l
o
a
d
e
d

B
y
:

[
U
n
i
v
e
r
s
i
t
y

o
f

A
l
b
e
r
t
a
]

A
t
:

0
6
:
3
3

7

J
a
n
u
a
r
y

2
0
0
9
Without discussing this model in detail, we submit the
following SAS program:
The SAS output contains an additional line for the
interactions between the categorical variables Species and
Sex. As the computed probability 1F(v
3
) =0.2776 is
much greater than the assumed significance level of 5%,
we conclude that there are no significant interactions.
LINEAR MODEL OF ANALYSIS OF VARIANCE
To begin with, assume the situation of a one-way
ANOVA with k =3 samples (such as in the example of
TPRO, where the samples correspond to different
species):
The n model equations are:
Y
ij
= m
i
E
ij
= m a
i
E
ij
for
i = 1; 2; 3 and 1 _ j _ n
i
Apply the following transformation to the observed
value y
ij
and to the corresponding random variables Y
ij
and E
ij
:
y =
y
11
y
12
. . .
y
1n
1
y
21
y
22
. . .
y
2n
2
y
31
y
32
. . .
y
3n
3
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
=
y
1
y
2
. . .
y
n
1
y
n
1
1
y
n
1
2
. . .
y
n
1
n
2
y
n
1
n
2
1
y
n
1
n
2
2
. . .
y
n
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
Y =
Y
1
Y
2
. . .
Y
n
1
Y
n
1
1
Y
n
1
2
. . .
Y
n
1
n
2
Y
n
1
n
2
1
Y
n
1
n
2
2
. . .
Y
n
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
E =
E
1
E
2
. . .
E
n
1
E
n
1
1
E
n
1
2
. . .
E
n
1
n
2
E
n
1
n
2
1
E
n
1
n
2
2
. . .
E
n
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
Now the n model equations can be rewritten in
the form:
Y
i
= mx
i0
a
1
x
il
a
2
x
i2
a
3
x
i3
E
i
where
x
i0
= 1 for all 1 _ i _ n
x
il
=
1 if 1 _ i _ n
1
0 else

x
i2
=
1 if n
1
1 _ i _ n
1
n
2
0 else

x
i3
=
1 if n
1
n
2
1 _ i _ n
0 else

Then we have:
Y =
Y
1
Y
2
. . .
Y
n
1
Y
n
1
1
Y
n
1
2
. . .
Y
n
1
n
2
Y
n
1
n
2
1
Y
n
1
n
2
2
. . .
Y
n
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
=
1 1 0 0
1 1 0 0
. . . . . . . . . . . .
1 1 0 0
1 0 1 0
1 0 1 0
. . . . . . . . . . . .
1 0 1 0
1 0 0 1
1 0 0 1
. . . . . . . . . . . .
1 0 0 1
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A

m
a
1
a
2
a
3
0
B
B
B
@
1
C
C
C
A
E = X b E; b =
m
a
1
a
2
a
3
0
B
B
B
@
1
C
C
C
A
Sample 1 y
11
, y
12
, . . ., y
1n
1
Sample 2 y
21
, y
22
, . . ., y
2n
2
Sample 3 y
31
, y
32
, . . ., y
3n
3
48 Analysis of Variance
D
o
w
n
l
o
a
d
e
d

B
y
:

[
U
n
i
v
e
r
s
i
t
y

o
f

A
l
b
e
r
t
a
]

A
t
:

0
6
:
3
3

7

J
a
n
u
a
r
y

2
0
0
9
The equation:
Y = X E; X = (x
ij
)
1_i_n;0_j_3
is similar to a linear intercept regression model with k =3
factor variables (called regressors). But there is the
following important difference concerning the elements
of the design matrix X: When X is the design matrix of
a linear intercept regression model, for j _1, the ele-
ment x
ij
is the ith adjusted value for the jth regressor. In
the present linear model of a one-way ANOVA, the
element x
ij
of the design matrix X is an element of the set
{0,1}. For j _1, the value x
ij
only denotes which sample
the ith observation belongs to.
For generalizing the one-way model, assume that the
number of samples is an arbitrary number k _2. In the
general case, the design matrix X has n rows and k +1
columns. All elements of the first column have a value of
1. In the second column, we have n
1
leading elements
with a value of 1, followed by nn
1
zeros. In the 3rd
column, we have n
1
leading zeros, followed by n
2
elements with a value of 1 and further nn
1
n
2
zeros.
The last column begins with nn
k
zeros, followed by n
k
ones. So we have:
X =
1
n1
1
n1
0
n1
0
n1
0
n1
. . . 0
n1
1
n2
0
n2
1
n2
0
n2
0
n2
. . . 0
n2
1
n3
0
n3
0
n3
1
n3
0
n3
. . . 0
n3
. . . . . . . . . . . . . . . . . . . . .
1
nk
0
nk
0
nk
0
nk
0
nk
. . . 1
nk
0
B
B
B
B
@
1
C
C
C
C
A
; b =
m
a
1
a
2
. . .
a
k
0
B
B
B
B
@
1
C
C
C
C
A
Here 1
n
j
denotes the n
j
-dimensional column vector with
all its components having a value of 1 and 0
n
j
is the n
j
-
dimensional zero vector.
EXAMPLE 4 OF TOTAL PROTEIN IN SERUM
Neglecting the influence of the factor variable Sex, we
have a one-way ANOVA with the linear model equation:
Y =
Y
1
. . .
Y
12
Y
13
. . .
Y
24
Y
25
. . .
Y
36
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
=
1
12
1
12
0
12
0
12
1
12
0
12
1
12
0
12
1
12
0
12
0
12
1
12
0
B
@
1
C
A
m
a
1
a
2
a
3
0
B
B
B
@
1
C
C
C
A
E = X E; =
m
a
1
a
2
a
3
0
B
B
B
@
1
C
C
C
A
The equations for the linear model of a two-way
ANOVA with r p samples are:
Y
i
= mx
i0
a
1
x
il
a
2
x
i2
a
r
x
ir
b
1
x
i(r1)

b
p
x
i(rp)
E
i
Here for all 1 _i _n, the coefficient x
i0
of the parameter m
has a value of 1. For 1 _j _r, the coefficient x
ij
of the
parameter a
j
has a value of 1, if the observation with
number i belongs to the jth category of the first factor
variable; else x
ij
has a value of 0. For 1 _j _p, the coeffi-
cient x
i(r + j)
of the parameter b
j
has a value of 1, if the
observation with number i belongs to the jth category of the
second factor variable. Otherwise, x
i(r + j)
has a value of 0.
EXAMPLE 5 OF TOTAL PROTEIN IN SERUM
Taking into account the fact that the first six observations
for each species are female whereas the last six
observations are male animals, we write the model in
the following way (r = 3, p = 2):
Y
i
= mx
i0
a
1
x
i1
a
2
x
i2
a
3
x
i3
b
1
x
i4
b
2
x
i5
E
i
Here the numbers x
i0
, x
i1
, x
i2
, and x
i3
are defined as
above in the linear model of one-way ANOVA, whereas
the additional numbers x
i4
and x
i5
characterize the sex
of the animals. Formally, they are defined in the fol-
lowing way:
x
i4
=
1 if the ith observation comes from a female animal
0 else

x
i5
=
1 if the ith observation comes from a male animal
0 else

So we get the linear model for the two-way ANOVA:


Y =
Y
1
. . .
Y
12
Y
13
. . .
Y
24
Y
25
. . .
Y
36
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
=
1
6
1
6
0
6
0
6
1
6
0
6
1
6
1
6
0
6
0
6
0
6
1
6
1
6
0
6
1
6
0
6
1
6
0
6
1
6
0
6
1
6
0
6
0
6
1
6
1
6
0
6
0
6
1
6
1
6
0
6
1
6
0
6
0
6
1
6
0
6
1
6
0
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
A

m
a
1
a
2
a
3
b
1
b
2
0
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
A
E = X E; =
m
a
1
a
2
a
3
b
1
b
2
0
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
A
Analysis of Variance 49
A
D
o
w
n
l
o
a
d
e
d

B
y
:

[
U
n
i
v
e
r
s
i
t
y

o
f

A
l
b
e
r
t
a
]

A
t
:

0
6
:
3
3

7

J
a
n
u
a
r
y

2
0
0
9
The aim of the classical ANOVA is to examine if all
sample effects are 0. But as soon as one has formulated
the linear model Y=X +E of the ANOVA, one can do
more than this qualitative analysis and ask the same
questions as in regression analysis. Finding estimators for
the parameters and testing the hypotheses on them is a
quantitative analysis.
In regression analysis, the parameter vector b of the
linear model Y=X +E is estimated by the least squares
method (LSM). This means that the least squares
estimator (LSE) b has to minimize the function:
f (b): = (y Xb)
T
(y Xb)
Simple computations show that the function f becomes
minimal if and only if b is the solution of the so-called
normal equations:
(X
T
X)b = X
T
y
If the matrix X has a full rank, the matrix X
T
X is
regular and the unique solution of the system of normal
equations is:
b = (X
T
X)
1
X
T
y
It is important to emphasize that the LSE b =
(X
T
X)
1
X
T
y is an unbiased estimator of the unknown
parameter vector of the linear model Y=X +E for
every distribution of the random vector E. Thus contrary
to the classical approach to ANOVA, where we assume
from the very beginning that E follows a normal
distribution law, this assumption is not necessary for
estimating the parameter vector of the linear model by
means of the LSM.
But there arises another problem: The design matrix X
of the linear model for the one-way ANOVA does not
have a full rank. If the factor variable divides the
observations in k samples, the matrix X has 1+k columns.
The first column with elements x
i0
=1, 1_i _n is the sum
of the following k columns. The problem is intensified if
X is the design matrix of a linear model for two-way
ANOVA. If the two factor variables divide the observa-
tions in r p samples, the matrix X has 1+r +p columns.
The first column of X with elements x
i0
= 1, 1_i _n is
the sum of the following r columns and it is the sum of the
last p columns as well. As a consequence, the matrix X
T
X
is not regular and does not have an inverse.
EXAMPLE 6 OF TOTAL PROTEIN IN SERUM
Neglecting the influence of the factor variable Sex, the
design matrix X of the linear model of this one-way
ANOVA satisfies the equation:
X
T
X =
36 12 12 12
12 12 0 0
12 0 12 0
12 0 0 12
0
B
B
@
1
C
C
A
This matrix is not regular: The first column is the sum
of the last three columns. If the factor Sex, too, is taken
into account, the design matrix X of the linear model of
this two-way ANOVA satisfies the equation:
X
T
X =
36 12 12 12 18 18
12 12 0 0 6 6
12 0 12 0 6 6
12 0 0 12 6 6
18 6 6 6 18 0
18 6 6 6 0 18
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
This matrix is not regular: The first column is the sum of
the following three columns and it is the sum of the last
two columns as well.
There are two methods for getting LSEs in this
situation. The SAS procedure GLM uses a generalized
inverse, which is defined in the following way: The (m,n)
matrix A

is a generalized inverse of the (n,m) matrix A


if the equation:
A A

A = A
is satisfied. If A is a regular quadratic matrix, it has
exactly one generalized inverse and this is the inverse
matrix A
1
; else the generalized inverse of the matrix A
is not unique. If the equation system Ax =b, where A is
an (n,m) matrix and b is an n-dimensional vector, is
solvable, the set of all solutions is:
x[x = A

b (A

A I
m
) z; z 5
m

So one special solution of the linear equation system


Ax=b is the vector x=A

b.
Applying this statement to the system (X
T
X)b=X
T
y of
normal equations, we have the following: If G is a gen-
eralized inverse of the matrix X
T
X, the vector b=GX
T
y is
only a special solution, which, moreover, is not an un-
biased estimator for the unknown parameter vector .
EXAMPLE 7 OF TOTAL PROTEIN IN SERUM
The appropriate SAS procedure is GLM. In submitting
the commands:
50 Analysis of Variance
D
o
w
n
l
o
a
d
e
d

B
y
:

[
U
n
i
v
e
r
s
i
t
y

o
f

A
l
b
e
r
t
a
]

A
t
:

0
6
:
3
3

7

J
a
n
u
a
r
y

2
0
0
9
the SAS generates the same output as when using the SAS
procedure ANOVA for the one-way case. But extending
the model command line by the option solution makes
SAS compute, in addition, a special solution b of the
system of normal equations by means of a special gen-
eralized inverse G of the matrix X
T
X. Extending the
model command line once more by the two options xpx
and i, the matrix X
T
X and the generalized inverse G are
printed in the output window, too. Thus when submitting
for the one-way ANOVA the SAS commands:
the SAS computes:
X
T
X =
36 12 12 12
12 12 0 0
12 0 12 0
12 0 0 12
0
B
B
@
1
C
C
A
G =
0:0833333333 0:0833333333 0:0833333333 0
0:083333333 0:1666666667 0:0833333333 0
0:083333333 0:0833333333 0:1666666667 0
0 0 0 0
0
B
B
@
1
C
C
A
b =
^ m
^a
1
^a
2
^a
3
0
B
B
@
1
C
C
A
=
6:650000000
1:183333333
0:908333333
0:000000000
0
B
B
@
1
C
C
A
The output contains the note that the estimates for the
intercept m and for the three effects a
1
, a
2
, a
3
of the three
species Dog, Mouse, and Rat are biased and not unique.
The second method for solving the problem of a less-
than-full-rank design matrix X requires a recoding of the
parameters to be estimated, which implies a reduction of
its number and another interpretation. In the case of a one-
way ANOVA with k samples, we make use of the fact that
for every i, the number x
i0
equals 1 and the sum
x
i1
+x
i2
+ +x
ik
has a value of 1, too. So the model
can be rewritten in the following form:
Y
i
= mx
i0
a
1
x
i1
a
2
x
i2
a
k1
x
i(k1)
a
k
x
ik
E
i
= mx
i0
a
1
x
i1
a
2
x
i2
a
k1
x
i(k1)
a
k
x
ik
E
i
a
k
x
i0
a
k
(x
i1
x
i2
x
ik
)
= (m a
k
)x
i0
(a
1
a
k
)x
i1
(a
2
a
k
)x
i2

(a
k1
a
k
)x
i(k1)
(a
k
a
k
)x
ik
E
i
= (m a
k
)x
i0
(a
1
a
k
)x
i1
(a
2
a
k
)x
i2

(a
k1
a
k
)x
i(k1)
E
i
The design matrix X of this linear model has k columns
instead of k +1 and the parameter vector b has k
components instead of k +1:
X =
1
n
1
1
n
1
0
n
1
0
n
1
0
n
1
. . . 0
n
1
1
n
2
0
n
2
1
n
2
0
n
2
0
n
2
. . . 0
n
2
1
n
3
0
n
3
0
n
3
1
n
3
0
n
3
. . . 0
n
3
. . . . . . . . . . . . . . . . . . . . .
1
n
k1
0
n
k1
0
n
k1
0
n
k1
0
n
k1
. . . 1
n
k1
1
n
k
0
n
k
0
n
k
0
n
k
0
n
k
. . . 0
n
k
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
=
m a
k
a
1
a
k
a
2
a
k
. . . . . . . . .
a
k1
a
k
0
B
B
B
B
@
1
C
C
C
C
A
Obviously, the new design matrix X has a full rank k.
The first component of the parameter vector is the
expectation of the last sample. The further components of
the parameter vector denote differential effects; they
measure the effect of the jth sample minus the effect of
the last sample.
EXAMPLE 8 OF TOTAL PROTEIN IN SERUM
The new design matrix X and the new parameter vector
for the one-way model are:
X =
1
12
1
12
0
12
1
12
0
12
1
12
1
12
0
12
0
12
0
@
1
A
=
m a
3
a
1
a
3
a
2
a
3
0
@
1
A
This matrix X has full rank, so the matrix X
T
X is
regular. The following computations can be performed
by hand or by means of a computer algebra system like
Mathematica
1
:
X
T
X =
36 12 12
12 12 0
12 0 12
0
B
@
1
C
A
(X
T
X)
1
=
1
12
1
12
1
12
1
12
1
6
1
12
1
12
1
12
1
6
0
B
B
@
1
C
C
A
The normal equations are solved uniquely by the
estimation vector:
b =
6:6500
1:18333
0:883333
0
@
1
A
Analysis of Variance 51
A
D
o
w
n
l
o
a
d
e
d

B
y
:

[
U
n
i
v
e
r
s
i
t
y

o
f

A
l
b
e
r
t
a
]

A
t
:

0
6
:
3
3

7

J
a
n
u
a
r
y

2
0
0
9
Comparing these results with the SAS output of the
procedure GLM, it is obvious that the only meaningful
difference is the missing part of the 4th component. The
estimates computed by SAS by means of a general-
ized inverse are unbiased estimators for the new set of
parameters: The number 6.6500 estimates the expecta-
tion m+a
3
of the TPRO value in the species Dog. For
the species Mouse, the expectation of the TPRO value
is m+a
1
=(m+a
3
) +(a
1
a
3
). It can be estimated by
6.6500 +( 1.8333) =5.4667. For the species Rat, the ex-
pectation of the TPRO value is m+a
2
= (m+a
3
) +(a
2
a
3
).
It can be estimated by 6.6500 + ( 0.883333) =5.7667.
Obviously, these estimators are identical with the
averages y
1
, y
2
, and y
3
of the three samples corresponding
to the single species which were computed by the SAS
procedure MEANS, and the question arises if this linear
model is worth the effort. Moreover, it has to be admitted
that for a one-way ANOVA, it would be possible to avoid
a less-than-full-rank design matrix X in choosing a no-
intercept model where the design matrix X arises from the
original design matrix by canceling the first column. For
the no-intercept linear model, the design matrix and the
parameter vector are given by:
X =
1
12
0
12
0
12
0
12
1
12
0
12
0
12
0
12
1
12
0
@
1
A
=
m
1
m
2
m
3
0
@
1
A
The justification for leaving out the possibility of a
no-intercept model is the fact that this method does not
work in the case of a two-way ANOVA, as the sum of
the last p columns again gives a column where all
elements have a value of 1. So we have to go on with the
method of recoding the 1+r +p parameters. For the sake
of simplicity, we demonstrate the following method for
our example.
EXAMPLE 9 OF TOTAL PROTEIN IN SERUM
We make use of the fact that for every i, the following
equations are true:
x
i0
= 1; x
i1
x
i2
x
i3
= 1; x
i4
x
i5
= 1
So the model can be rewritten in the following form:
Y
i
= mx
i0
a
1
x
i1
a
2
x
i2
a
3
x
i3
b
1
x
i4
b
2
x
i5
E
i
= mx
i0
a
1
x
i1
a
2
x
i2
a
3
x
i3
b
1
x
i4
b
2
x
i5
E
i
(a
3
b
2
)x
i0
a
3
(x
i1
x
i2
x
i3
) b
2
(x
i4
x
i5
)
= (m a
3
b
2
)x
i0
(a
1
a
3
)x
i1
(a
2
a
3
)x
i2
(b
1
b
2
)x
i4
E
i
The design matrix X of this linear model has four
columns instead of six and the parameter vector has
four components instead of six:
X =
1
6
1
6
0
6
1
6
1
6
1
6
0
6
0
6
1
6
0
6
1
6
1
6
1
6
0
6
1
6
0
6
1
6
0
6
0
6
1
6
1
6
0
6
0
6
0
6
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
=
m a
3
b
2
a
1
a
3
a
2
a
3
b
1
b
2
0
B
B
@
1
C
C
A
Obviously, the new design matrix X has a full rank of
4. The first component of the parameter vector is the
expectation of the last sample (i.e., of male rats). The next
two components of the parameter vector again denote
differential effects; they measure the effect of the species
Dog minus the effect of the species Rat and the effect of
the species Mouse minus the effect of the species Rat. The
last component of measures the differential effect of
Sex: female minus male.
Submitting the SAS commands:
yields the estimation:
b =
6:702777778
1:183333333
0:908333333
0:105555556
0
B
B
@
1
C
C
A
with the note that the estimates are biased and not unique.
Indeed, the output does not allow to compute unbiased
estimators for the original parameters m, a
1
, a
2
, a
3
, b
1
,
and b
2
. But with the correct interpretation of the output,
the expectation of the random variable TPRO for each of
the six samples defined by Species and Sex can be
estimated unbiasedly:
Dog, female 6.702777778 1.18333333
0.105555556 ~ 5.41389
Dog, male 6.702777778 1.18333333 ~ 5.51944
Mouse, female 6.702777778 0.908333333
0.105555556 ~ 5.68889
Mouse, male 6.702777778 0.908333333 ~ 5.79444
Rat, female 6.702777778 0.105555556 ~ 6.59722
Rat, male 6.702777778 ~ 6.70278
52 Analysis of Variance
D
o
w
n
l
o
a
d
e
d

B
y
:

[
U
n
i
v
e
r
s
i
t
y

o
f

A
l
b
e
r
t
a
]

A
t
:

0
6
:
3
3

7

J
a
n
u
a
r
y

2
0
0
9
At the end of this contribution, we make some remarks
concerning additional possibilities of the linear model.
The further discussion of these possibilities is beyond the
scope of this article.
REMARK 1
Neither the method of generalized inverses nor the
described method of recoding the parameters allows to
compute unbiased estimators for the parameters
m, a
1
, a
2
, . . ., a
k
or m, a
1
, . . ., a
r
, b
1
, . . ., b
p
of the original
linear model of the one-way or the two-way ANOVA,
respectively. But there are linear functions of these
parameters for which unbiased estimates do exist. In the
one-way case, all sums m+a
i
, 1 _i _k for instance can be
estimated. In the two-way case, all sums m+a
i
+b
j
,
1_i _r, 1_j _p can be estimated.
REMARK 2
It was emphasized that for the task of estimating the
parameters of a full-rank linear model, special distribution
assumptions are not necessary. So up to now, in our
example, the fact that the TPRO value can be assumed to
be normally distributed was not taken advantage of. But
with this normal distribution in mind, one can test the
hypotheses on the parameters. SAS, for instance, auto-
matically tests the hypothesis that this component has a
value of 0 using a Student-distributed test variable for
each component of the parameter vector. The output
shows: The estimators for the intercept and for the
differential effects Dog minus Rat and Mouse minus
Rat differ significantly from 0, whereas the last
component estimating the differential effect Female mi-
nus Male does not.
Within the frame of the linear model Y=X +E of
ANOVA, it is possible to test more general hypotheses of
the form C =0. Here C is a matrix with an appropriate
number of rows and columns. In the two-way model for
the TPRO value for instance, the matrix C could have one
single row: With C=(0 1 0 0), the hypothesis C =0
means that a
1
a
3
=0 (i.e., there is no difference between
the species Dog and the species Rat). With C=(0 1 0
10), the hypothesis C =0 means that a
1
a
3
=
10(b
1
b
2
) (i.e., the difference between the effects of
the species Dog and species Rat is 10 times as great as the
difference between the effects Female and Male).
REMARK 3
In the linear model of a two-way or higher-way ANOVA,
it is possible to include interactions. In the example of
TPRO, there are six interaction parameters. Including
them in the model increases the number of columns of the
design matrix by six, but the rank is only increased by
two. So one has to reduce the number of interaction terms
from six to two, and it is rather difficult to explain the
meaning of the recoded parameters.
REMARK 4
It should be mentioned that there is another recoding
method for solving the problem of a less-than-full-rank
design matrix in the linear model of ANOVA. Making
use of the reparameterization conditions, the number of
parameters for each factor variable can be reduced by 1.
The resulting design matrix X has elements with values
0, +1, or 1. In the example of TPRO for instance, we
have the two conditions a
1
+a
2
+a
3
=0 and b
1
+b
2
=0.
These two equations yield:
a
3
= a
1
a
2
and b
2
= b
1
So the original model equations can be rewritten in the
following form:
Y
i
= mx
i0
a
1
x
i1
a
2
x
i2
a
3
x
i3
b
1
x
i4
b
2
x
i5
E
i
= mx
i0
a
1
x
i1
a
2
x
i2
(a
1
a
2
)x
i3
b
1
x
i4
(b
1
)x
i5
E
i
= mx
i0
a
1
(x
i1
x
i3
) a
2
(x
i2
x
i3
)
b
1
(x
i4
x
i5
) E
i
Again the design matrix X of this new linear model
with only four columns instead of six has a full rank and
the parameter vector has four components instead of six:
X =
1
6
1
6
0
6
1
6
1
6
1
6
0
6
1
6
1
6
0
6
1
6
1
6
1
6
0
6
1
6
1
6
1
6
1
6
1
6
1
6
1
6
1
6
1
6
1
6
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
=
m
a
1
a
2
b
1
0
B
B
@
1
C
C
A
The advantage of this reparameterization method is
that the meaning of the parameters is left unchanged. This
method is not supported by the SAS procedure GLM, but
it is used in the procedure CATMOD for the linear model
in categorical data analysis, where the response variable is
categorical as well.
[3,6]
REMARK 5
In many animal studies, each animal is measured
repeatedly through time. In the ideal case, the number
Analysis of Variance 53
A
D
o
w
n
l
o
a
d
e
d

B
y
:

[
U
n
i
v
e
r
s
i
t
y

o
f

A
l
b
e
r
t
a
]

A
t
:

0
6
:
3
3

7

J
a
n
u
a
r
y

2
0
0
9
of time points is identical for each animal. Performing a
one-way or a higher-way ANOVA in treating time
point as a factor variable violates the primary assump-
tion of independence.
There exist different methods for solving this problem.
If the influence of time is negligible, the simplest method
is to choose one value per animal. For example, one can
choose the first observation, or one can choose one
observed value at random. If (and only if) the set of time
points is identical for each animal, a time-by-time
ANOVA can be performed. It consists of a number of
separate analyses, one for each time of observation.
The repeated-measures ANOVA, however, includes
the influence of time. It requires a modification of the
underlying linear model, adding further parameters for the
interactions between treatments and time points. By the
repeated statement, the SAS procedures ANOVA and
GLM offer the opportunity of performing a repeated-
measures ANOVA.
The repeated-measures ANOVA and further GLMs for
the analysis of longitudinal data like marginal models,
random effects models, and transition models are
discussed in Ref. [1].
REFERENCES
1. Diggle, P.J.; Liang, K.-Y.; Zeger, S.L. Analysis of
Longitudinal Data; Clarendon Press: Oxford, 1994.
2. Fisher, R.A. The Design of Experiments; Oliver and Boyd:
Edinburgh, 1947.
3. Forthofer, R.N.; Lehnen, R.G. Public Program Analysis: A
New Categorical Data Approach; Lifetime Learning Publ.:
Belmont, CA, 1981.
4. Hocking, R.R. Methods and Applications of Linear
Models: Regression and the Analysis of Variance; Wiley-
Interscience, 1996.
5. Lindman, H.R. Analysis of Variance in Experimental
Design; Springer-Verlag: New York, 1992.
6. Littell, R.C.; Freund, R.J.; Spector, PhC. SAS System for
Linear Models, 3rd ed.; SAS Institute Inc.: Cary, NC,
1991.
7. Roberts, M.J.; Russo, R. Students Guide to Analysis of
Variance; Routledge, 1999.
8. Sahai, H.; Ageel, M. The Analysis of Variance: Fixed,
Random and Mixed Models; Springer, 2000.
9. Scheffe, H. The Analysis of Variance; John Wiley & Sons:
New York, 1959.
10. Searle, S.R. Linear Models; John Wiley & Sons: New
York, 1971.
54 Analysis of Variance
D
o
w
n
l
o
a
d
e
d

B
y
:

[
U
n
i
v
e
r
s
i
t
y

o
f

A
l
b
e
r
t
a
]

A
t
:

0
6
:
3
3

7

J
a
n
u
a
r
y

2
0
0
9

Вам также может понравиться