Вы находитесь на странице: 1из 67

Nonlinear Regression Analysis and Its Applications

Douglas M.Bates,Donald G. Watts


CoDvriaht 0 1988 bv John Wilev & Sons. In(

CHAPTER 3.

Practical Considerations in Nonlinear


Regression

“Rationally, let it be said in a whisper, experience is certainly worth more than


theory.”
- Amerigo Vespucci

Nonlinear estimation, like all data analysis procedures, involves many practical
considerations. In this chapter, we discuss some techniques which help ensure a
successful nonlinear analysis. The topics include model specification, prelim-
inary analysis, determination of starting values, transformations of parameters
and variables, other iteration schemes, convergence, assessment of fit and
modification of models, correlated residuals, accumulated data, comparison of
models, parameters as functions of other variables, and presentation of results.
A case study in which we illustrate many of the techniques presented in this
chapter is given in Section 3.13. The important practical problem of designing
experiments for nonlinear models is discussed in the final section.

3.1 Model Specification

An important step in any nonlinear analysis is specification of the model, which


includes specifying both the expectation function and the characteristics of the
disturbance.

3.1.1 The Expectation Function

Ideally, physical, biological, chemical, or other theoretical considerations will


lead to a mechanistic model for the expectation function. The analyst’s job is
then to find the simplest form of the model and the parameter estimates which
67
68 NONLINEAR REGRESSION ANALYSIS

provide an adequate fit of the model to the data, subject to the assumptions about
the disturbance. Note that it is not necessary for the expectation function to be
stated as an explicit function of the parameters and the control variables. In
Chapter 5 we discuss an important class of models, known as compartment
models, in which the expected response is given by the solution to a set of linear
differential equations. Special techniques, developed in that chapter, can be
used to avoid solving explicitly for the expectation function in terms of the
parameters and independent variables.
In other situations, the expectation function may be the solution to a non-
linear differential equation or an integral equation which has no analytic solu-
tion. Then the value of the expectation function must be determined numerical-
ly for any given parameter values for a regular nonlinear least squares program
to be used. In such situations, numerical parameter derivatives or a derivative-
free optimization procedure will often have to be used to calculate the least
squares estimates. However, as discussed in Caracotsios and Stewart (1985).
when an expectation function is obtained from the solution to a set of ordinary
differential equations, the parameter derivatives of the expectation function can
be determined from the sensitivity functions for the system of differential equa-
tions. These functions are evaluated numerically at the same time as the solu-
tion of the differential equations is evaluated.

Example: a-Plnene 1
The decomposition of a-pinene was investigated by Fuguitt and Hawkins
(1945, 1947), who reported the concentrations of five reactants as a func-
tion of time, at a series of reaction temperatures. In Appendix 1, Section
A1.6, we present the data for the run at 189.5OC.
We discuss these data in Chapters 4 and 5 and fit a model which is
specified by a set of linear differential equations. As discussed in Chapter
5 , the parameters in such models can be estimated very easily, due to the
ease with which they can be specified and the ease with which the
responses and the derivatives with respect to the parameters can be evaluat-
ed. As will be also shown in Chapter 5 , however, the linear differential
equation model does not provide an adequate fit to the a-pinene data.
Stewart and Sorensen (1981) analyzed the complete data set reported
by Fuguitt and Hawkins (1945, 1947). and proposed a model consisting of
a set of five nonlinear differential equations

df3
-dt = elf I
PRACTICAL CONSIDERATIONS 69

where A, i = 1, . . . , 5 , represent the theoretical responses at time t.


There is no analytic solution to this set of differential equations, and
so we must use numerical procedures. For given values of
0 = (0,, . . . , e8)*,the differential equations would be integrated numeri-
cally using, say, a Runge-Kutta integration routine (Conte and de Boor,
1980). The numerical estimates of the responses, f(t), and the observed
responses y ( t ) , at the observation times, could then be used to calculate
residuals from which an appropriate estimation criterion can be evaluated.
We discuss the choice of estimation criterion for multiresponse data
in Chapter 4. Methods for obtaining derivatives of the response functions
at the observation times by means of the “sensitivity functions”

are given in Caracotsios and Stewart (1985). The derivative matix V can
then be calculated from the sensitivity functions.

In other situations, a mechanistic model may not be advanced by the


researcher, in which case the statistician will be called upon to suggest an equa-
tion. One approach is to ask the researcher to search through the literature to
see if models have been proposed. If not, the statistician and the researcher can
apply their modeling skills and develop a plausible mechanistic model. Failing
this, the statistician must formulate a model which has the same sort of behavior
as the data. If the data rise monotonically to an asymptote, perhaps a
Michaelis-Menten, exponential rise, or logistic model might be appropriate. If
the data peak and then decay towards zero, perhaps a double exponential, a
Michaelis-Menten model with a quadratic term in the denominator, or a gamma
function would be suitable.
Finally, if there are several sets of data, it may be possible to use the self-
modeling approach of Lawton, Sylvestre, and Maggio (1972). This approach
has been used in modeling spirometer curves which give the volume of air ex-
pelled from the lungs as a function of time for a number of subjects, and in
modeling the creatine phosphokinase serum levels in patients suffering myocar-
dial infarctions (Armstrong et al., 1979).

3.1.2 The Disturbance Term

All nonlinear estimation programs are based on specific assumptions about the
disturbance term, usually that the disturbance is additive and normally distibut-
ed with zero mean, constant variance, and independence between cases (see
70 NONLINEAR REGRESSION ANALYSIS

Section 1.3). Checking assumptions on the disturbance term is considerably


easier and more sensitive if the data include replications at some or all of the
design points. It is helpful if the experimental runs have been randomized,
although many nonlinear experiments involve sequential measurements of the
response, so that randomization may not be feasible.
At the initial stage, it is generally possible to check only one of the as-
sumptions on the disturbance, namely constancy of variance. If there are repli-
cations, one can simply plot the data and look to see if the spread of the data
tends to systematically increase or decrease with respect to any of the predictor
variables. Alternatively, one can use an analysis of variance program to obtain
averages and estimated variances and standard deviations for the replicated
responses and then plot the variances or standard deviations versus the average,
again looking for any systematic relationship, as discussed in Section 1.3. If
none is apparent, then it may be tentatively assumed that the variance is constant
and the analysis can proceed; if there is a relationship, then oftentimes a simple
power transformation such as square root, logarithm, or inverse will stabilize the
variance. Even without replications, some visual indication of constancy of
variance can be gained from a data plot but this is not as definitive as when re-
plications are available.
Note that transforming the data also involves transforming the expectation
function. Thus, if there is a well-justified expectation function for the response
but the data should be transformed to induce constant variance, then the same
transformation should be applied to the expectation function to preserve the fun-
damental relationship. (See Section 3.9 for an example.) This is discussed more
fully in Carroll and Ruppert (1984), where the Box-Cox transformations (Sec-
tion 1.3.2) are applied to both the observed responses and the expected
responses using the same transformation parameter h. The optimal value of h is
determined by maximum likelihood. Alternatively, one can use weighted least
squares (Draper and Smith, 1981) if a reasonable decision can be made about
how the variance changes with respect to the response.
After a model has been fitted, it is possible to perform further checks on
the disturbance assumptions by examining the residuals, as described in Sections
1.3,3.7,and 3.8. It is also possible to check adequacy of the model and to com-
pare rival models, as discussed in Section 3.10.

3.2 Preliminary Analysis

Having decided on a suitable expectation function (or set of plausible expecta-


tion functions) and a transformation of the data (and the expectation function, if
necessary), we need to provide a computer program with the expectation func-
tion in some form and, unless numerical derivatives or derivative-free methods
are used, its derivatives with respect to the parameters. Naturally, the expecta-
tion function and derivatives must be correctly specified and correctly coded,
but (as most nonlinear analysts know from experience) a great many errors oc-
PRACTICAL CONSIDERATIONS 71

cur at this stage.


One way to ensure that the function is correctly specified and correctly
coded is to use a separate program or even a calculator to evaluate the function
at one or two distinct design points then compare these values with those from
the nonlinear estimation routine. The same technique can be used for the
derivatives, of course, but a better procedure is to compare the analytic deriva-
tives from the routine with numerical derivatives obtained from finite differ-
ences of the expectation function (see Section 3.5.3). These comparisons are
done on the basis of the relative differences between the derivatives calculated
in the two ways. If vnp is the analytic derivative for case n and parameter p
while ,S is the finite difference approximation, then the relative difference is

I vnp- Cnp I if vnp= 0


Verifying that the relative differences are small not only provides a check on the
derivatives, but, indirectly, a check on the expectation function, because a
discrepancy between the numerical derivatives and the analytic derivatives can
be due to either incorrect specification or coding of the analytic derivatives, or
due to incorrect specification or coding of the expectation function, or both.
When coding the function, and especially when deriving and coding the
derivatives, it is good practice to use temporary variables and the chain rule for
derivatives, as demonstrated below. This helps avoid algebraic errors, which
can occur when trying to reduce a function to its simplest form.

Example: lsomerization 2
For the isomerization data of Example Isomerization 1, the function

is considered appropriate. To code the function and its derivatives, sup-


pose the variablesxl,x2,x3 are coded as X (l), X (21, X (31,and the
parameters as THETA ( 1) , THETA ( 2) , THETA ( 3 ) , and THETA ( 4 ) .
Then we can code the function simply and accurately by introducing the
temporary variables
NUMX = X(2) - X(3)/1.632
DENOM = 1.0 + THETA(2)*X(l) + THETA(3)*X(2)
+ THETA(4) *X(3)
RATIO = NUMX/DENOM
so the function becomes
F = THETA (1)*THETA (3)*RATIO
Next, introducing the temporary variable
72 NONLINEAR REGRESSION ANALYSIS

FD = - F/DENOM
the derivatives become (denoting df/de, by F1 and so on),
F1 = THETA(3)*RATIO
F2 = FD*X(l)
F3 = THETA(1)*RATIO + FD*X(2)
F4 = FD*X(3)

It is also important to check that the data being analyzed are valid. That
is, one must always ensure that the correct numerical values of the response and
predictor variables have been entered into the machine. Probably the most ef-
fective way to check this is to plot the response versus each predictor variable,
making sure that the response behaves the way it should with respect to each of
the predictor variables.

3.3 Starting Values

One of the best things one can do to ensure a successful nonlinear analysis is to
obtain good starting values for the parameters-values from which convergence
is quickly obtained.
Several simple but useful principles for determining starting values can be
used:

(1) interpret the behavior of the expectation function in terms of the


parameters analytically or graphically;
(2) interpret the behavior of derivatives of the expectation function in
terms of the parameters analytically or graphically;
(3) transform the expectation function analytically or graphically to obtain
simpler, preferably linear, behavior;
(4) reduce dimensions by substituting values for some parameters or by
evaluating the function at specific design values; and
( 5 ) use conditional linearity.

We discuss each of these techniques in turn, and illustrate them with specific ex-
amples. For further discussion on obtaining starting values, see Ratkowsky
(1983).

3.3.1 Interpreting the Expectation Function Behavior

One of the advantages of nonlinear regression is that the parameters in the ex-
pectation function are usually meaningful to the scientist or researcher. This
meaning can be graphical, physical, biological, chemical, or in some other ap-
propriate form, and can be very helpful in determining starting values. Initial
PRACTICAL CONSIDERATIONS 73

estimates for some of the parameters may be available from related experiments.
Also, plotting a nonlinear expectation function using various values for the
parameters is an extremely beneficial exercise, because in this way one becomes
familiar with the function and how the parameters affect its behavior.
Sometimes starting values can be obtained by considering the behavior
near the origin or at other special design values. For example letting x = 0
gives the initial value of O1 +02 for the model f ( x , 8 )=el+0,e-d3x, and letting
x + gives the asymptote O1 (assuming O3 > 0).
OQ

Example: Puromycin 9
In the Michaelis-Menten expectation function, f = 01x/(02+ x), the
parameter €I1is the asymptotic velocity of the enzymatic reaction, and so
can be estimated by the maximum observed data value, ymm, or by eye
from a plot. Graphically, O1 represents the asymptotic value of f a s x + OQ.
Similarly, e2 represents the half-concentration, i.e. the value of x such that
when the concentration reaches that value the velocity is one-half its ulti-
mate value. For the Puromycin data, ymax= 207 provides a good starting
value for el. From a plot of the data (Figure 2.1), or simply from a listing,
it can be seen that the observed velocity reaches ym,/2 at a concentration
of about 0.06 and so this value can be used as a starting value for €I2. H

3.3.2 Interpreting Derivatives of the Expectation Function

Sometimes rates of change of the function at specified design values can be used
to obtain parameter starting estimates. For example, the derivative with respect
to x of the Michaelis-Menten model at x = 0 is 8 , /02, and so by estimating the
rate at x = 0 from the ratio of differences of adjacent y values over differences
of adjacent x values, and dividing this rate into y,,,, we can obtain a starting
value for 02. For Puromycin data, we obtain O2 = 207/(61/0.02)= 0.068.
Similarly, derivatives at special values of x , such as limits or points of
inflection, can be used. For example, for the double exponential model
f= + e3e-'4~
assuming e2 > 04, the function behaves like a simple exponential 03ea4' for
large x and like O3 + €I1
e-e2x for small x . Thus, the rate of change at small x pro-
vides an estimate of 02,and at large x an estimate of 04.

3.3.3 Transforming the Expectation Function

Transformations of the expectation function can often be used to obtain starting


values. For instance, for the Michaelis-Menten model with a linear or quadratic
denominator, simply taking the reciprocal of the function produces a model
which can be rewritten as a linear model. Linear least squares can be used on
the reciprocal data to estimate the linear parameters, which can then be used to
74 NONLINEAR REGRESSION ANALYSIS

obtain starting values for 8. The model from Example Isomerization 1,

is also transformably linear, since

A linear regression (with a constant term) of (x2 -x3/ 1.632)ly on x1, x2, and x3
would yield starting values
Oo-- 1 fjLY 81 @=, B2 @=T-
B3
1-
B2 PO PO PO

For the model f(x,8)=exp[-0,x1 e ~ p ( - 0 ~ / x ~used


) ] , in a chemical kinet-
ics example (Bard, 1974, p. 124), taking logarithms twice gives
02
In lnf = lnx I + ln(-0,) - -
x2
and one could again use linear least squares to obtain starting values.
Graphical transformations are also very effective. Plotting f versus x on
semilog paper or plotting In f versus x on regular graph paper often reveals the
true nature of the data or enables one to see when one portion of the model is
dominant, and hence where one can measure a rate and associate it with a par-
ticular parameter.
For example, the double exponential model
f ( X , e ) = ~ ~ e - ~ * "e-e4x
+o~
with O2 > O4 is approximately In f = In O3 - € 4at~large x, which gives a straight
line on a semilog plot. A simple fit can then be made by eye to obtain values for
O3 and €I4.These values can then be used to calculate values of 83ea4' at all
values of x, and hence residuals 7 = y - B3ea4" can be derived. Plotting
versus x on semilog paper then enables one to estimate 01 and CI2. This process,
known as peeling, can be used when the expectation function is a sum of several
exponentials.

Example: Sulfisoxazole 1
To demonstrate the technique of peeling, we consider sulfisoxazole data
given in Kaplan et al. (1972) and described in Appendix 1, Section A1.7.
In this experiment, sulfisoxazole was administered to a subject intravenous-
ly, blood samples were taken at specified times, and the concentration of
sulfisoxazole in the plasma was measured. The data are plotted in Figure
3.1.
Plotting the sulfisoxazole concentration on a log scale versus x as in
Figure 3.2a reveals monotonic decay with straight line behavior for large x,
PRACTICAL CONSIDERATIONS 75

I I I 1

0 10 20 30 40 50
Time (rnin)
Figure 3.1 Plot of sulfisoxazoleconcentration in plasma versus time.

9
P

4
,\

\.
,
x
\

,,

0 10 20 30 40 50 0.5 1.0 1.5 2.0


Time (min) Time (min)
(a) (b)
Figure 3.2 Curve peeling using the Sulfisoxazole data. In part a we show the data,
plotted on a log scale, together with a straight line fit (dashed line) to the last six points.
In part b we show, on a log scale, the residuals for the first six data points from the
straight line fit in part a. The dashed line is the fitted line through these (log) residuals.

which suggests a model of the form


-92x + Q-84X
=
with all positive parameters. Fitting a straight line to the last six (log) data
values gives an intercept of 5.05 and a slope of -0.153, so that the starting
values are 0s = e 5.05 = 156 and 0: = 0.153. Calculating the residuals
76 NONLINEAR REGRESSION ANALYSIS
-0. I53 x,,
j,, =y,,-156e
and plotting l n j versus x for the first six data values, as in Figure 3.2b,
again reveals straight line behavior. Fitting a straight line to these (log)
residuals gives an intercept of 4.55 and a slope of -1.3 1, so that the starting
~ a l u e s a r e 8 y = e ~ . ~ ~ = 9 5 a n d e ~ = 1 .W3 1 .

3.3.4 Reducing Dimensions

Peeling is an example of the general technique of reducing dimensions in order


to obtain starting values. In this technique one estimates parameters successive-
ly, each estimated parameter making it easier to estimate the remaining ones.
As another example of reducing parameter dimensions, consider the model
f = el + e2ea3’, where e3 is positive. Then the limiting value of the response
when x + 00 is el and the value at x = 0 is €I1 +02. Depending on whether the
data is increasing or decreasing, we can use ymaxor yminto get the starting value
ey, and then use the difference y(0) - e(I to get 0;. We perform a linear regres-
sion (without a constant term) of ln[(y - ey)/es]on x to obtain el. Altemative-
ly, once 0: and 0s are determined, we could substitute these values into the
function and evaluate (l/x)ln[(y - ey)/es] at selected values of x to obtain 0s.
Sometimes we can reduce the dimensionality of the model and indirectly
reduce the number of parameters. For example, with the model
f(x,0)=e~p[-0~x~exp(-8~/x~)], if there are some very large values of x2, then
the model is approximatelyf(xl ,0)=e?”’, so it is easy to obtain a starting esti-
mate for €I1 by taking logarithms of the responses at large x2. Similarly the
model f ( x . 0 ) =ele2xI /( 1 + e2x I + 03x2) reduces to a Michaelis-Menten type
when x 2 is small, so it is easy to obtain starting values.

3.3.5 Conditional Linearity

In many model functions, several of the parameters are conditionally linear (see
Section 2.1) and linear regression can be used to get starting values for these
parameters conditional on the nonlinear parameters. Alternatively, special algo-
rithms which exploit the conditional linearity, described in Section 3.5.5, can be
used. These algorithms only require starting estimates for the nonlinear parame-
ters. As an example of conditional linearity, in
f(x.0) =el + ~ ~ e - ~ ~ ~
both €I1and €I2 are conditionally linear, so it is possible to use linear least
squares to estimate €)(I and es once an estimate for 09 has been obtained. A de-
tailed example involving conditionally linear parameters is given in Section 3.6.
PRACTICAL CONSIDERATIONS 77

3.4 Parameter Transformations

As will be shown in Chapters 6 and 7, transforming the parameters in a non-


linear regression model can produce a much better linear approximation. This
has the beneficial effects of making approximate inference regions better and
speeding convergence to the least squares value. Parameter transformations can
also be used to enforce constraints on the values of the parameters.
Note that transformations of parameters are very different from transfor-
mations of the responses. Transformations of the response distort the response
space and create a new expectation surface, thereby affecting the disturbances
and the validity of the assumptions on the disturbances. In contrast, transforma-
tions of the parameters merely relabel points in the parameter space and on the
existing expectation surface. Consequently they do not affect the assumptions
about the deterministic or the stochastic parts of the model, although they do af-
fect the validity of the linear approximation and inferences based on it.
The use of parameter transformations to improve validity of the linear ap-
proximation is discussed in Chapter 7; here we focus on transformations to im-
pose constraints on parameters and to improve convergence.

3.4.1 Constrained Parameters

The parameters in most nonlinear models are restricted to regions which make
sense scientifically. For example, in the Michaelis-Menten model and in the
isomerization model, all the parameters must be positive, and in exponential
models, the parameters in the exponent usually must be positive.
It is often possible to ignore the restrictions when fitting the model and
simply examine the converged parameter estimates to see if they satisfy the con-
straints. If the model fits the data well, the parameter estimates should be in a
meaningful range. Sometimes, though, it may be dangerous to allow the param-
eter estimates to go into proscribed regions during the iterations because the
parameter values may begin to oscillate wildly or cause numerical overflows. In
these situations, one should impose the constraints throughout the estimation
process.
General techniques for optimizing functions whose parameters are con-
strained, called nonlinear programming, are beyond the scope of this book. See,
for example, Gill, Murray, and Wright (1981) or Bard (1974) for details. For-
tunately, the types of constraints that are applied to the parameters of a non-
linear regression model are usually simple enough to be handled by parameter
transformations. For example, if e,, must be positive, we reparametrize to
$, = In €I,,so throughout the iterations the value of 3(, = e b remains positive.
An interval constraint on a parameter, say
a l e l b
can be enforced by a logistic transformation of the form
78 NONLINEAR REGRESSION ANALYSIS
b-a
O=a+-
1+e+
while an order constraint on parameters €Ij, . . . ,e k , say
a<8j<8j+l< IekIb
can be enforced by a transformation given in Jupp (1978).
The order constraint can be used to ensure a unique optimum in a model
with exchangeable parameters. As an example of such a model, consider the
double exponential model
j ( x , ~ ) = e ~ e ~ 0102.04
~ ~ + ~ ~ e ~ ~ ~
where the pairs of parameters (0, ,e2) and (e3,e4)
are exchangeable-that is, ex-
changing the parameter pair (eI,0,) with the pair (e3.e4)will not alter the
values of the expected responses. Exchangeable parameters can create nasty
optimization problems because the linear approximation cannot account for that
kind of symmetry.
In this example, we remove the exchangeability by requiring
o 5 e2I e4
and enforce this with the transformation
e2= e 92
e4 = e'2 (1 +e'4)
Since el and €I3 are conditionally linear parameters, their optimal values are
uniquely determined when 02 and 0, are distinct. Thus we only need to keep e2
and O4 ordered to eliminate the exchangeability.

3.4.2 Facilitating Convergence

Parameter transformations can facilitate convergence because they prevent the


parameters from venturing into proscribed regions. Transformations can also
improve convergence by making the parameter lines behave more uniformly on
the expectation surface so that the Gauss increment is more accurate. Joint
variable-parameter transformations can also be used to improve the estimation
situation by improving conditioning of the derivative matrix V. Frequently this
is done by centering or scaling the data. For example, the simple model
f ( x .0) = ele has derivatives

and the derivative vectors tend to be collinear when the values of x are all posi-
PRACTICAL CONSIDERATIONS 79

tive. Rewriting the model as


j-(x,e)= O l e - W x - x ~ + x ~ )

and reparametrizing with


-82x0
$1 =%e
$2 = 02
-xo)
gives f(x,$> = $ l e , and now the derivatives with respect to $ will be
more nearly orthogonal. A useful choice is x g =X.
Scaling the variables and the parameters can also improve conditioning by
making the derivative matrix have column vectors which are more nearly equal
in length.
Other transformations can be useful, depending on the context of the
problem. For example, in chemical kinetics it is often useful to revise the model
so that reciprocal absolute temperature is used rather than temperature T. Com-
bining this with centering would then modify a term involving temperature to
the form 1/T - l/To.
The effect of parameter transformations on parameter effects nonlineari-
ties and the adequacy of linear approximation inference regions is discussed in
Chapter 7.

3.5 Other Iterative Techniques

The Gauss-Newton iterative algorithm for nonlinear 1Fast squares, described in


Section 2.2.1, is a simple, useful method for finding 8. Some modifications to
this method, as well as alternative methods, have been suggested-primarily to
deal with ill-conditioning of the derivative matrix V and to avoid having to code
and specify the derivatives.

3.5.1 A Newton-Raphson Method

The Gauss-Newton method for estimating nonlinear parameters can be con-


sidered as a special case of the more general Newton-Raphson method (Bard,
1974) which uses a local quadratic approximation to the objective function.
Near 8'. we approximate
O)T ,ce-eo)
s(e) = s(eo)+mT(e-eo)+(e-e R

where

is the gradient of S(0) evaluated at 8' and


80 NONLINEAR REGRESSION ANALYSIS

a = -a2s
ae aeT
is the Hessian of S(0)evaluated at 8'. The approximating sum of squares func-
tion will have a stationary point when its gradient is zero-that is, when
o+n(e- eo)= o
and this stationary point will be a minimum if n is positive definite (all its
eigenvalues positive). If Q is positive definite, the Newton-Raphson step is
80 = -Q-' a
For the function
s(e) = ( Y - ~ I ) ~ ( Y - ~ )
the gradient is
o = -2VT(y - q)
and the Hessian is

- q)
avT
n = 2VTV - 2-(y
aeT
where V is the derivative matrix. The Gauss-Newton increment is therefore
equivalent to the Newton-Raphson increment with the second derivative term
avTiaeTset to zero.
Dennis, Gay, and Welsch (198 1) describe a nonlinear least squares rou-
tine which develops a quasi-Newton approximation (Dennis and Schnabel,
1983) to the second term in the Hessian. This extends the Gauss-Newton algo-
rithm and makes it closer to the Newton-Raphson algorithm, which has the ad-
vantage that the approximating Hessian should be closer to the actual Hessian
than the single term VTV used in the Gauss-Newton algorithm. However, the
term VTV is necessarily positive definite (or at least positive semidefinite), since
the eigenvalues of VTV are the squares of the singular values of V. Adding
another term on to this to form an approximating Hessian can destroy the posi-
tive definiteness, in which case the Newton-Raphson algorithm must be
modified to restore positive definiteness in the Hessian.

3.5.2 The Levenberg-Marquardt Compromise

A condition that can cause erratic behavior of Gauss-Newton iterations is singu-


larity of the derivative matrix V caused by collinearity of the columns. When V
is nearly singular, 6 can be very large, causing the parameters to go into un-
desirable regions of the parameter space.
One solution to the problem of near-singularity is to perform the calcula-
tions for the increment in a numerically stable way, which is why we recom-
mend using the QR decomposition rather than the normal equations. We also
PRACTICAL CONSIDERATIONS 81

recommend using double precision or extended precision arithmetic for the cal-
culations, where feasible, and using joint variable-parameter transformations as
discussed in Section 3.4.
Another general method for dealing with near-singularity is to modify the
Gauss-Newton increment to
6 ( k ) = (VTV+kI)-'VT(y-q) (3.1)
as suggested in Levenberg (1944), or to
6 ( k )= ( V T V + k D ) - ' V T ( y - ~ )
as suggested in Marquardt (1963), where k is a conditioning factor and D is a di-
agonal matrix with entries equal to the diagonal elements of VTV. This is called
the Levenberg-Marquardt compromise because the direction of 6 ( k ) is inter-
mediate between the direction of the Gauss-Newton increment (k +0) and the
direction of steepest descent VT(y -q)/ VT(y-q) I (k +m).
Note that Levenberg recommends inflating the diagonal of VTV by an ad-
ditive factor, while Marquardt recommends inflating the diagonal by a multipli-
cative factor 1 + k. Marquardt's method produces an increment which is invari-
ant under scaling transformations of the parameters, so that if the scale for one
component of the parameter vector is doubled, the increment calculated, and the
corresponding component of the increment halved, the result will be the same as
calculating the increment in the original scale. In Levenberg's method, this is
not true. Box and Kanemasu (1984) showed, however, that if one requires in-
variance of the increment under linear transformations of the parameter space,
the resulting increment is the Gauss-Newton increment with a step factor.
The Levenberg-Marquardt compromise is more difficult to implement
than the Gauss-Newton algorithm, since one must decide how to manipulate
both the conditioning factor k and the step factor h; nevertheless it is implement-
ed in many nonlinear least squares programs. Although we presented the incre-
ment in terms of the inverse of an augmented VTV matrix, the actual calcula-
tions for the increment should be done using a QR decomposition of V and ap-
plying updates from a diagonal matrix using the Givens rotations (Dongma et
al., 1979, Chapter 10; Golub and Pereyra, 1973), since the Levenberg increment
(3.1) is the least squares solution to the system with derivative matrix

L J

and response vector


CY-4
L A

For the Marquardt increment (3.2), the derivative matrix is changed to


82 NONLINEAR REGRESSION ANALYSIS

3.5.3 Numerical Derivatives

We have assumed that implementations of the algorithms we have described use


analytic derivatives with respect to the parameters. Obtaining these derivatives
and coding them is usually the most tedious and error-prone stage in a nonlinear
analysis.
As a general rule we recommend using analytic derivatives for accuracy,
although it is convenient to use programs which use numerical derivatives from
finite differences. Such convenience is not obtained without cost, however, be-
cause numerical derivatives can be inaccurate and they usually increase the
computing time necessary to obtain convergence. Furthermore, if second
derivatives are required to investigate the effect of nonlinearity on inferences,
the numerical second derivatives evaluated from numerical first derivatives can
be very inaccurate. Other problems with numerical derivatives involve the
choice of step size to determine the finite differences, and whether to use central
or forward differences.
With forward differences, for the pth parameter we evaluate the model
function using the current values of all the parameters except for the pth, which
is incremented to €Ip( 1 + E). Dividing the differences between the function
values by the fractional amount &€Ip gives an approximate derivative. This re-
quires 1 + f evaluations of the expected response vector at each iteration. Using
central differences would require evaluation of the model function at €Ip( 1 f E) in
addition to the central value, so the total number of evaluations would be 1+ 2f.
Dennis and Schnabel(l983) recommend setting E equal to the square root of the
relative machine precision (that is, the square root of the smallest number which,
when added to 1.0 in the floating point arithmetic of the computer, produces a
number greater than 1.O).

3.5.4 Derivative-Free Methods

There are derivative-free methods which do not simply use numerical approxi-
mations to derivatives. Ralston and Jennrich (1978) introduced one such rou-
tine, DUD (Doesn't Use Derivatives), which is based on using a secant plane ap-
proximation to the expectation surface rather than a tangent plane approxima-
tion.
To use DUD, one must provide starting values 8'. The program then au-
tomatically produces a further set of P parameter vectors by displacing each
parameter in turn by 10%. These parameter vectors are then used to calculate
expectation vectors q l,q2,. . . , giving a secant plane which matches the expec-
tation surface at P + 1 points. A set of linear coordinates is generated on the
PRACTICAL CONSIDERATIONS 83

secant plane, and the projection of y onto the secant plane is made and mapped
into the parameter plane. This information is used to calculate a new 8 vector,
say W, for which q(8') is closer to y than any of the other parameter vectors.
The parameter vector 8 corresponding to the q which is farthest from y is then
replaced by 8', and the process continued until convergence is achieved.

Example: Rumford 6
DUD can be illustrated very effectively using a two-observation example
such as in Example Rumford 2. To simplify arithmetic and to provide a
better scale for the figure, we provide the necessary two ( = P + 1) starting
values rather than using the automatic 10% displacement. The two starting
values are chosen to be 8' =0.02 and O2 =0.10. Figure 3.3 shows the ex-
pectation surface q(8) together with the secant line 1 through the points
q(8') and q(e2). We now introduce a linear scale parameter a on 8 such
that 8=8' + T a , where T = ( 0 2 - 8 ' )and so a = O at 8' and a= 1 at e2. We
also im se a linear scale on I such that I(a)=q(O1)+Ha, where
P
H=q(8 )-q(O'). The linear coordinate system is also shown in Figure
3.3.

90 100 110 120 130 140 150


q1

Figure 3.3 A geometric interpretation of the calculation of the DUD increment using the
2-case Rumford data. A portion of the expectation surface (heavy solid line) is shown in
the response space together with the observed response y. Also shown is the projection
of y-q(O.02) onto the secant plane joining q(0.02) and q(O.10) (solid line). The tick
marks indicate true positions on the expectation surface and linear approximation posi-
tions on the secant plane.
84 NONLINEAR REGRESSION ANALYSIS

For this example, T = 8* - 8' = 0.08,8= 8' + T a , SO

and
I = q(8')+ H a

[124.62
= 90.831 +[
-17.70
-29.671 a

y-I(0) = 110 - 90.837 1


We now use linear least squares to project the residual vector
124.62

-
19.17

onto I to obtain
ii = ( H ~ H ) - ' H ~ -( YI(o))
For this example
L.

= -0.49
so new value of 8 is
en,, = 0.02 + ~(-0.49)
= 0.02 + 0.08(-0.49)
= -0.019
Evaluating the sum of squares at this point reveals that this new point is
farther from y than either of the two starting points, and so a step factor h is
introduced to search along the increment vector to determine a better point.
Incorporating h as
etria,= enewh+eold(1
-h)
PRACTICAL CONSIDERATIONS 85

gives, for this example,


etria,= (-0.019)~+ 0.02( 1- A)
and the minimum occurs at h=0.5 with etrial =0.0005. The point €l2=0.1O
is then replaced by €I3 =0.0005 and the process is repeated using the pair
(e1,e3).
In the general case of P parameters, at the ith iteration we use the values
of q(0) at e:, @, . . . , ei+lto determine the secant plane as the P-dimensional
plane which passes through q(eg),p = 1, . . . ,P+1. For convenience, we as-
sume that ei+lcorresponds to the point closest to y; we then determine the
P x P matrix T by setting its pth column equal to 0; - ei+l, and the Nx P matrix
H by setting itspth column equal to q(eg)-q(ei+l).Then, formally,
a = (HTH)-'HT[y-q(8b+l)]
en,, = ei+l+T&
and
-A)
etria,= enewh+elp+I(i
Note that Ralston and Jennrich (1978) allow the step factor to be negative, by
choosing h from a sequence of values 1, 1/2,-1/4, 118,-1/16,. . . . At con-
vergence the linear approximation parameter covariance matrix is given by
s2T(HTH)-'TT,where s 2 is the usual variance estimate. Note that the matrix T
may be ill conditioned by the time convergence is achieved and so the linear ap-
proximation standard errors and correlations may not be reliable.

3.5.5 Removing Conditionally Linear Parameters

One way of simplifying a nonlinear regression problem is to eliminate condi-


tionally linear parameters. As mentioned in Sections 2.1 and 3.3.5, the optimal
values of the conditionally linear parameters, for fixed values of the nonlinear
parameters, can be determined by linear least squares. If we partition the
parameter vector 8 into the conditionally linear parameters #I of dimension P I
and the nonlinear parameters e of dimension P 2 with P = P I + P 2 , the expected
responses can be written
qc#I,e>
= A(+#
where the Nx P matrix A depends only on the nonlinear parameters. For any
value of 4, the conditional estimate of #I is
S(0,= A+($) Y
where A+ = (ATA)-'AT is the pseudoinverse of A. The associated expected
responses are
86 NONLINEAR REGRESSION ANALYSIS

6(+)= A(+)A+(+)Y
Golub and Pereyra (1 973) formulated a Gauss-Newton algorithm to minimize
the reduced sum of squares function

that depends only on the nonlinear parameters. In particular, they give the
+,
derivative of A+(+) with respect to which is the key ingredient in the algo-
rithm. The expression for this derivative is used in Chapter 4, where we present
a Gauss-Newton algorithm for multiresponse parameter estimation.
One difficulty with using projection over the conditionally linear parame-
ters is that additional information about the parameters must be given by the
user. The user must specify which parameters are conditionally linear as well as
specifying the derivatives of the entries of A with respect to 4. This often
results in more difficulty than simply ignoring the conditional linearity. There
are some structured problems, however-such as spline regression with knot po-
sitions allowed to vary, as described in Jupp (1978Fwhere the division
between conditionally linear and nonlinear parameters is inherent in the
specification of the problem, so the Golub-Pereyra method can be used to ad-
vantage. These methods are discussed further in Kaufman (1975) and Bates and
Lindstrom (1986).

3.6 Obtaining Convergence

Obtaining convergence is sometimes difficult. If you are having trouble, check


the following:

Is the expectation function correctly specified?


Is the expectation function correctly coded?
Are the derivatives correctly specified?
Are the derivatives correctly coded?
Are the data entered correctly?
Are all the observations reasonable?
Is the response variable correctly identified?
Do the starting values have the correct values?
Do the starting values correspond to the correct parameters?

If the answer to all these questions is yes, look carefully at the output
from the optimization program. Most good programs can produce detailed out-
put on each iteration to help find out what is wrong. Check to see that the initial
sum of squares, S(eo), is smaller than the sum of squares of the responses. If
not, then the fitted function is worse than no function and, in spite of your
checks, you probably have an incorrect expectation function, or incorrect data,
or incorrect starting values. You may even be trying to fit an x variable rather
PRACTICAL CONSIDERATIONS 87

than the response y.


Look at the parameter values. Do the starting values have the correct
magnitudes? Correct signs? And are they assigned to the correct parameters?
Next, look at the parameter increments. Are they all of roughly the same
magnitude relative to the parameters? Does the increment, when added to the
parameter vector, place the parameter vector in a bad region in the parameter
space? For example, are any necessarily positive parameters driven negative?
Do any of the parameters become unreasonably large or small? If so, could
there be an error in the derivative functions? Try using numerical derivatives at
a few design points to check the analytic derivatives. Would different starting
values for some of the parameters help? Is there a transformation of the param-
eters which could help?
Sometimes convergence is not achieved because the model has too many
parameters. Look at the parameter values to see whether any of them are being
driven to extreme values corresponding to a simpler model function. Also look
at combinations of the parameter increments to see, for example, if pairs of them
tend to move together, suggesting collinearity or possibly overparametrization.
If there is a suspicion of overparametrization, try simplifying the expectation
function, even temporarily-it may be that a simpler model will produce better
parameter estimates, so that eventually the full model can be fitted.
Check to see that there are enough data in all regions of the design space
so that valid parameter estimates can be obtained. For example, when fitting a
transition type model in which there is, say, linear behavior to the left of a point
and different linear behavior to the right (Bacon and Watts, 1971; Hinkley,
1969; Watts and Bacon, 1974), it is often the case that there are lots of data
values to define the behavior away from the join point, but not many near the
join point. In this situation, the parameter which describes the sharpness of tran-
sition will be poorly estimated, and so convergence may be slow.
When dealing with a comprehensive model which involves combining
data from several experiments, it is generally good practice to fit each data set
with a possibly simpler restricted model, and gradually extend the model by in-
corporating more data sets and parameters. An example of this is given in
Ziegel (1985). Conversely, a large data set which has several reasonably dis-
tinct operating regions can be blocked into small subsets on that basis, so that a
reduced model can be fitted to each subset and the results used to provide start-
ing estimates for a model for the full data set, as illustrated below.

Example: Lubricant 1
To illustrate the process of getting starting values and obtaining conver-
gence for a complicated nonlinear model function, we consider data on the
kinematic viscosity of a lubricant as a function of temperature (xI)and
pressure (q). The data, discussed in Linssen (197% are reproduced in
Appendix 1, Section A1.8, and plotted in Figure 3.4. The model function is
88 NONLINEAR REGRESSION ANALYSIS

A 0
A o *
0 25.0%

37.8% A 0
1
= 98.9% A o*
0

0 2000 4000 6000


Pressure (atrn)
Figure 3.4 Plot of the logarithm of the kinematic viscosity of a lubricant versus pres-
sure for four different temperatures.

To begin, we note that six of the nine parameters are conditionally


linear, which is most helpful. Also, to improve conditioning, as discussed
in Section 3.4.1, we scale the pressure data x 2 by dividing by lo00 and
avoid confusion by writing w 2= x 2 / 1o00.
To obtain a starting estimate for e2, we use the data for w2=0.001
and assume that for this low value of scaled pressure, the model is a func-
tion of x 1 only. Taking reciprocals and using linear least squares as
described in Section 3.3.3 gives 8: =983 and 8': = 192.
Now we exploit the conditional linearity in the model because we
can use linear least squares to obtain starting estimates for the remaining
parameters once we have reasonable estimates for 88 and 89. Thus we con-
centrate on getting estimates for only these two. We simplify the situation
even more by assuming that when w 2 is small, the model function is essen-
tially linear in w 2 ,so that
PRACTICAL CONSIDERATIONS 89

That is, we can ignore the terms involving e4, €I5, O,, and e9. By examin-
ing the plot, we see that the data for each temperature follow quite straight
lines for w 2 c 2, so we choose this for the range. Also, for fixed values of
xI,the leading term and the exponential term are constant and we may
rearrange the model as
yf=e3w2+e6w2g

=w2P
where
y’=y- 983
192+~1
-XI /e8
g=e
and
p = e3 +e6g
Regressing y’ on w 2 for each of the four temperatures 0, 25, 37.8,
98.9 gives p values of 1.57, 1.49, 1.39, 1.37. We now use the p values and
to obtain estimates for e3 and 06 by noting that
- X I /88
the relation g = e
w h e n x I = O w e h a v e g = 1 , s o p = e 3 + e 6 , a n d w h e n x I + - , p=e3. We
therefore estimate the sum of the two parameters as 1.57 (the value of p at
x I = 0). and assuming that the lower asymptote has almost been reached at
the highest temperature, we choose e3 to be 1.35, which is a bit smaller
than 1.37 (the value of p at x 1=98.9). The value for e6 is then estimated
as 1.57- 1.35 = 0.22. Finally, since p = O3 + (36 g, so (p - e3)/e6=g, we re-
gressed

on x I to give e8= 35.5.


Using these parameter estimates for e2 and e8. we performed a non-
linear regression on the data for small w 2 values to get more refined esti-
mates, e2=202 and e8=35.90. We then used these estimates for e2 and
e8,and the data for all w 2 values, to estimate all the parameters with €4=O.
The new values were e2=209 and e8=47.55. Finally, we used these
values plus the starting value e9= O to converge on the full model. The
final parameter estimates were
6 = (1053,206.1, 1.464, -0.259,0.0224,0.398,0.09354,56.97, -0.463)T
with a residual sum of squares of 0.08996. W
90 NONLINEAR REGRESSION ANALYSIS

3.7 Assessing the Fit and Modifying the Model


In any nonlinear analysis, it is necessary to assess the fit of the model to the data
and to assess the appropriateness of the assumptions about the disturbances. To
do so, we use the same techniques as in linear regression, namely sensibleness
of parameter values, comparison of mean squares and extra sums of squares,
and plots of residuals. If there are any inadequacies in the model, or if any of
the assumptions do not seem to be appropriate, then the model must be modified
and the analysis continued until a satisfactory result is obtained.
In nonlinear estimation, it is possible to converge to parameter values
which are obviously, or perhaps suspiciously, wrong. This is because we may
have converged to a local minimum, or got stalled because of some awkward
behavior of the expectation surface. Assessment of any fitted model should
therefore begin with a careful consideration of the parameter estimates and
whether they make sense scientifically. If the parameters do not make sense,
check to see that the correct starting values were used. Also check to see that
the program did not simply terminate due to lack of progress or too many itera-
tions, but that convergence was actually achieved. One should also scan the
iteration progress information to see if convergence occurred smoothly. Some
programs have special facilities for fixing some parameters while allowing oth-
ers to vary, and others have poor convergence criteria. It is incumbent on the
user to understand fully the program being used and to appreciate its idiosyn-
crasies. “Caveat emptor” is as true for nonlinear estimation packages as it is for
anything else in life.
If the program has proceeded smoothly to an apparently legitimate con-
vergence point, but the parameters are not reasonable, check the expectation
function and its coding, the derivatives and their codings, the starting values,
and the data, as in Section 3.6. Is the response variable correctly specified? Are
the residuals well behaved?
If these checks are satisfactory but the parameter vector is not, try a fairly
different starting vector. If you then converge to the same point, it may be that
the data are trying to tell you that the expectation function is not appropriate. At
this stage it may well be helpful to discuss things with the researcher or a col-
league; as often happens, in the course of such a discussion you may discover a
simple, “obvious” error.
When convergence to reasonable values has been reached, check the
parameter approximate standard errors and approximate t ratios [calculated as
(parameter estimate) / (approximate standard error)]. If a t ratio is not
significant, consider deleting that parameter from the expectation function and
refitting the model, as discussed more fully in Section 3.10.
Generally, the simpler the model the better (Ockham’s razor: see quota-
tion, p.1).
Also check the parameter approximate correlation matrix to see whether
any parameters are excessively highly correlated, since high correlations may
indicate overparametrization (a model which is too complicated for the data set).
PRACTICAL CONSIDERATIONS 91

Exactly what constitutes a “high” correlation is somewhat dependent on the type


of data and model being considered. In general, correlations above 0.99 in ab-
solute value should be investigated. Try simplifying the expectation function in
a scientifically sensible way or transforming the variables and parameters to
reduce collinearities (Section 3.4). For further discussion on simplifying
models, see Section 3.8. Further information on variability of parameter esti-
mates and nonlinear dependencies between parameter estimates can be obtained
using the techniques of Chapter 6.
When a simple, adequate expectation function has been found, a plot of
the fitted values overlaid with the observed responses is an excellent way to as-
sess the fit. Plots of the residuals versus the fitted values and the control vari-
ables are also powerful aids. The residuals should also be plotted against other,
possibly lurking factors to help detect model inadequacies. For further discus-
sion, see Draper and Smith (1981), Joiner (1981), or Cook and Weisberg (1983).
Particular attention should be paid to whether the residuals have a uniform
spread, since any nonsystematic behavior is suggestive of nonconstant variance.
If there is nonconstant variance, consider transforming the data to induce con-
stant variance, and transforming the model function to maintain the integrity of
the model, possibly using the approach in Carroll and Ruppert (1984) to optim-
ize the transformation parameter, or try using weighted least squares.
Nonrandom behavior of the residuals, as evidenced by plots of the residu-
als against the regressor variables or other variables, tends to indicate lack of
adequacy of the expectation function. In such cases, try expanding the model in
a scientifically sensible way to eliminate the nonrandom behavior. For example,
add “incremental” parameters to account for differences between subjects or
days, or between groups of subjects or days, as discussed in Section 3.10. When
dealing with sums of exponentials, perhaps add a constant term to allow for de-
cay to a nonzero asymptote.
Probability plots of the residuals should be made to verify the normal as-
sumption about the disturbances. If there is pronounced lack of normality, try to
decide whether it is due to a small number of outliers or whether it is due to
inadequacy of the expectation function. For obvious outliers, check that the data
have been correctly recorded and correctly entered into the computer. If they
have been correctly entered, discuss with the experimenter the propriety of
deleting them. Perhaps there are good nonstatistical reasons for removing
them-for example, a contaminated sample. If you are considering such edit-
ing, it may be helpful to present the experimenter with information concerning
the influence of the possible outliers, such as parameter estimates, standard devi-
ations, fitted values, and residual mean squares with and without the suspicious
data points.
If the residuals are clearly nonnormal, consider transforming the data and
the model (Carroll and Ruppert, 1984) or changing the criterion from least
squares to a “robust” estimation criterion (Huber, 1981). Note, however, that
the use of criteria other than least squares will usually require special software.
Assessment of adequacy of the expectation function is easier if there are
replications, because it will have been possible to check for, or transform to,
92 NONLINEAR REGRESSION ANALYSIS

stable variance before fitting the model. Replications also allow one to test for
lack of fit of the model by comparing the ratio of the lack of fit mean square
with the replication mean square with the appropriate F distribution value, as
discussed in Sections 1.3.2, 3.10, and 3.12.

3.8 Correlated Residuals

Whenever time or distance is involved as a factor in a regression analysis, it is


prudent to check the assumption of independent disturbances. Correlation of the
disturbances can be detected from a rime series plot of the residuals versus time
(or order of the experiments) or from a lag plot of the residual on the nth case
versus the residual on the (n-1)th case. Tendencies for the residuals to stay po-
sitive or negative in runs on the time series plot, or nonrandom scatter of the
residuals when plotted on the lag plot, can reveal nonindependence or correla-
tion of the disturbances.

Example: Chloride 1
Sredni (1970) analyzed data on chloride ion transport through blood cell
walls. The data, derived from Sredni's thesis, are listed in Appendix 1,
Section A1.9, and plotted in Figure 3.5. The observation y, gives the
chloride concentration (in percent) at time x,, (in minutes).

*...
w- ...
....
..*
cv
( D -

.*
..
0.

&*
z
@ N -

-ij
5:- .-
0 -
N .*.'
.*

$ - .*

3 4 5 6 7 8
Time (min)
Figure 3.5 Plot of chloride concentration versus time for the chloride transport data.
PRACTICAL CONSIDERATIONS 93

The model function


f(x,,,e)= e l ( i-e2ea3'n)
was derived from the theory of ion transport, where represents the final
percentage concentration of chlorine, O3 is a rate constant, and e2 accounts
for the unknown initial and final concentrations of the chlorine and the un-
known initial reaction time. As usual, it was assumed that the disturbances
had zero mean and constant variance and were independent.
An initial estimate for 8, was obtained by extrapolating the data to
large time, giving 07 = 35. Dividing yn by and linearizing the equation
by rearranging terms and taking logarithms allowed us to estimate the
remaining parameters by linear _regression, to give 8' = (35,0.91, 0.22)T.
Convergence was obtained to 8 = (39.09,0.828,0.1 59)T with a residual
sum of squares of 1.88. A time series plot of the residuals, shown in Figure
3.6a, shows runs in the residuals. Similarly, the lag plot shown in Figure
3.6b, shows positive correlation. We are thus alerted to the possibility that
the disturbances are not independent, or that there is some deficiency in the
form of the expectation function.

When the disturbances are not independent, the model for the observa-
tions must be altered to account for dependence. Common forms for depen-
dence, or autocorrelation,of disturbances are moving average or autoregressive
models of variable order (Box and Jenkins, 1976). Simple examples of such
forms are a moving average process of order 1 where
Z" = E n -OIEn-l

? -
0
x- 1
I
I
.
.

2 -. . ..
. . 2- .
I
I
*I
.*

.) *:?.I..*
.I
.* .I ;*:

.. . ;c? - - - - - - - c - - I- **
,9 ----c--------------- ---------

..@*,.'. ,.
<N 0 0 :
<N 0
8 .
.*I

?- ?-
**'
.= .
-
I
1

8- 8-
I
* I

, I
I
94 NONLINEAR REGRESSION ANALYSIS

or an autoregressive process of order 1 where


2, = E n +$IZn-l

and the E,, n = 1,2, ....N,are independent random disturbances with zero mean
and constant variance, or more simply, white noise.
In regression situations, when the data are equally spaced in time, it is re-
latively easy to determine an appropriate form for the dependence of the distur-
bances by calculating and plotting the residual autocorrelationfunction,
in 2n-k
rk= - k=l,2,. ..
n=k+l Ns2
versus the lag k. In the definition of rk, s 2 is the variance estimate, and the resi-
duals are assumed to have zero average. The residual autocorrelation function
is usually calculated out to k = N I5. If the residual autocorrelation function is
consistently within the range f 2 d N after lag 2 or 3, then the model may be
identified as a moving average process of order 1 or 2. If the residual autocorre-
lation function tends to decay gradually to zero, then the process may be
identified as an autoregressive process. Alternatively, to determine the order of
the autoregressive process, it may be necessary to calculate the partial aufo-
correlation function (Box and Jenkins, 1976). For regression situations where
time is not the only factor, or the most important factor, first order autoregres-
sive processes are often adequate.

Example: Chloride 2
The residual autocorrelation function for the chloride data was calculated

c
2-
.-0 ...........................................................
-3
F O , I I
1 I
B 0
c
-
J
........................................................

?t
. ,
9
r

2 4 6 a 10 12 14
Lag
Figure 3.7 Autocorrelation function of the residuals from the original nonlinear least
squares fit to the chloride data. The dotted lines enclose the interval in which approxi-
mately 95% of the correlations would be expected to lie if the true correlations were 0.
PRACTICAL CONSIDERATIONS 95

and plotted as in Figure 3.7. The- correlation estimates decay towards zero,
falling within the limits f 2 d N (shown as dotted horizontal lines) quite
quickly. On the basis of this plot, it was decided that a first order autore-
gressive process would adequately model the dependence in the residuals.
The model to be fitted is now of the form Y,,=f(xn,8)+Z,, where
Z,, = E,, +@Z,,-l. To estimate the parameters 8 and @, we reduce the prob-
lem to an ordinary nonlinear least squares problem by subtracting @ times
the equation for Y,,-l from Yn,as
y, - @Y,,-~= f ( x , , e ) - ~ f ( ~ n - , , ~ ) + Z , - ~ Z , , - l
or
Y, = @Y,,-~+f(x,,e)-@f(xn-l,e)+En
Starting values for 8 were taken from 6 above, and the starting value
for t$ was taken as the lag one correlation estimate, r = 0.67. Convergence
was obtained to (eT,t$) = (37.58,0.849,0.178,0.69)with a residual sum of
squares of 0.98. The residuals E from this fit are well behaved, as shown in
Figure 3.8 and the residual autocorrelation function, shown in Figure 3.9,
was uniformly small.

In general, as in the above example, the main effect of accounting for


dependence is to reduce the residual v e a n c e and reduce the correlation in the
residuals: the model parameter estimate 8 does not change much. However, the
model parameters are better estimated because they have smaller standard errors
and because the method of least squares has been applied correctly, since the as-
sumptions are satisfied. For a more complicated application of this technique,

0.
.. . *

*. .. . .

.. . . . .. 1
r
..
:
8 1

I .
.I
m I 1 I

0 10 20 30 40 50 94.3 4.1 oi 0.1 0.2


Observation number 2,
(a) (b)
Figure 3.8 Plots of the residuals 2 from the nonlinear least squares fit to the chloride
data using I$= 0.69. The residuals are plotted as a time series in part a and as a lag plot in
part b.
96 NONLINEAR REGRESSION ANALYSIS

C
.............................................................
-iii
.-0

f2
9)
.',L
I I I . . I
5 ............................................................

2 4 6 8 10 12 14
Lag
Figure 3.9 Autocorrelation function of the residuals from the nonlinear least squares fit
to the chloride data using +=0.69. The dotted lines enclose the interval in which approx-
imately 95% of the correlations would be expected to lie if the true correlations were 0.

see Watts and Bacon (1974).

3.9 Accumulated Data

In some studies, when it is impractical to measure instantaneous concentrations,


accumulated responses are recorded.

Example: Ethyl acrylate 1


An experiment to study the metabolism of ethyl acrylate was performed by
giving rats a single bolus of radioactively tagged ethyl acrylate. Each rat
was given a measured dose of the compound via stomach intubation and
placed in an enclosed cage from which the air could be drawn through a
bubble chamber. The exhaled air was bubbled through the chamber, and at
a specified time the bubble chamber was replaced by a fresh one, so that
the measured response was the accumulated COz during the time interval.
Preliminary analysis of the data revealed that normalizing each animal's
response by dividing by the actual dose received would permit combination
of the data so that a single model could be fitted to the data for all the rats.
Furthermore, the variability in the normalized data was such that it was
necessary to take logarithms of the data to produce constant variance across
the time points. The starting points and lengths of the accumulation inter-
vals and the averages for the nine rats, normalized by actual dose, are given
in Appendix 1, Section A1.10 (Watts, deBethizy, and Stiratelli, 1986). and
the cumulative COz data are plotted in Figure 3.10.
PRACTICAL CONSIDERATIONS 97

0.5 1 5 10 50 100
Time (hours)
Figure 3.10 Plot of cumulative exhaled C 0 2 amounts versus collection interval end
point for the ethyl acrylate data.

Two methods for the analysis of such data were given in Renwick (1982).
The first method uses peeling of the “approximate concentration” data obtained
by dividing the accumulated amount by the accumulation time interval. The
second method uses the cumulative total, extrapolated to infinite time, and then
peeling of the differences [extrapolated- (cumulative forul)].This is called the
“sigma-minus” method.
We do not recommend either of these methods, and specifically decry use
of the sigma-minus method because it is so sensitive to variations in the extra-
polated value. It can be shown, for example, that small percentage changes in
the extrapolated value, say less than 296, can cause changes in the rate constants
in excess of 100%. Furthermore, both methods are based on peeling, which re-
quires excessive subjective judgement. Instead of the abovementioned methods,
we recommend direct analysis of the accumulated data using integrated
responses as described below. In addition to avoiding the disadvantages of the
other methods, this method has the advantage that it provides measures of preci-
sion of the estimates in the form of parameter approximate standard errors and
correlations.
98 NONLINEAR REGRESSION ANALYSIS

3.9.1 Estimating the Parameters by Direct Integration

Suppose that the theoretical response to the input stimulus at time t is f(t.0).
Then the accumulated output in the interval tfl-l to f,, is
rn

F, = j f(t,ewt
h-I
We therefore use the integrated function values F, and the observed accumulat-
ed data pairs Cy,, t,) to estimate the parameters. We rewrite the model in terms
of the factors x = rfl-l, the start of the interval, and x2, = t,, - tfl-l, the length of
the interval, so the model for the amount accumulated in an interval is F(x,,0),
where x,=(x~,,,x~,,)T .
To determine a tentative form for f(t,0),we plot the approximate rates
yn/x2,, versus x + x z n / 2 on semilog paper and use peeling to obtain starting es-
timates for the parameters. The final estimation is done using nonlinear least
squares. Note that if the variance is not constant, it may be necessary to
transform the data and the function, as in the following example.

Example: Ethyl acrylate 2


The C 0 2 data are reproduced in Table 3.1 together with the derived quanti-
ties (interval midpoint xln +x2,/2 and approximate rate y,/xz,) which are
plotted in Figure 3.1 1. We can see from the figure that an appropriate
model for the data involves three exponentials (two to account for the peak,
and another to account for the change in slope of the decay from the peak).
Because the radioactivity prior to injection must be zero, the concentration
at t =O must be zero. A plausible model for the concentration at time t is
therefore
f (t. e) = -(e4 + e5)e +Q4e-Bzr + e5ea3r
An appropriate model for the accumulated data in the collection interval
starting at t,,-l is then

or
e4+Q5 -eIxI ea1x2)
F(X,0) = - -e (1-
Q1

Because of the nonconstant variance, the logarithms of F were fitted to the


PRACTICAL CONSIDERATIONS 99

Table 3.1 Collection intervals and averages


of normalized exhaled COz for the ethyl
acrylate data together with the derived quanti-
ties: interval midpoint and approximate rate
Collection Derived
Interval (hr) Quantities
Start Length C02 Interval Approx.
XI x2 (g) Midpoint Rate
0.0 0.25 0.01563 0.125 0.0625
0.25 0.25 0.04190 0.375 0.1676
0.5 0.25 0.05328 0.625 0.2131
0.75 0.25 0.05226 0.875 0.2090
1.o 0.5 0.08850 1.25 0.1770
1.5 0.5 0.06340 1.75 0.1268
2.0 2.0 0.13419 3.0 0.067 1
4.0 2.0 0.04502 5.0 0.0225
6.0 2.0 0.02942 7.0 0.0147
8.0 16.0 0.02716 16.0 0.0017
24.0 24.0 0.01037 36.0 O.OOO4
48.0 24.0 0.00602 60.0 0.0003
100 NONLINEAR REGRESSION ANALYSIS

v)

8
al

-
c
!!
al
m
.E
x
v)
o
e
no
s
2

v)
0
0
8
0 10 20 30 40 50 60
Interval midpoint (hours)
Figure 3.11 Approximate C 0 2 exhalation rate versus collection interval midpoint for
the ethyl acrylate data.

logarithms of the data. The results of this analysis together with the start-
ing estimates are presented in Table 3.2.
In an analysis of the logarithmic data for the individual rats, due at-
tention was paid to the behavior of the residuals. The triple rate constant
model fitted the data very well. H

Table 3.2 Parameter summary for the 3-


exponential model fitted to the ethyl acrylate
data.
Nonlinear Least Squares
Approx.
Parameter Start Estimate Std. Err.
01 4.461 3.025 0.752
(32 0.571 0.481 0.038
(33 0.0434 0.0258 0.0096
(34 0.355 0.310 0.049
05 0.0034 0.0011 0.0005
PRACTICAL CONSIDERATIONS 101

Example: Saccharin 1
As a second example of treating accumulated data, we analyze the saccha-
rin data in Renwick (1982). In this experiment, the measured response was
the amount of saccharin accumulated in the urine of a rat after receiving a
single bolus of saccharin. The data are recorded in Appendix 1, Section
A 1 . 1 1 , and plotted in Figure 3.12.
The function involved only two rate constants, and the response was
modeled as
f ( t . 0 )= e3-e4"+ 04e-e2'
so

As in the ethyl acrylate example, the integrated model was fitted to the log-
arithms of the accumulated data to stabilize variance.
The curve peeling and the sigma-minus method results from
Renwick (1982) are given in columns 2 and 3 of Table 3.3, and the results
using the direct integration method are given in column 4. Note the consid-
erable differences between the results based on peeling and those obtained
by nonlinear least squares. Note too, that the peeling and sigma-minus

0 20 40 60 80 100 120
Time (min)
Figure 3.12 Plot of cumulative excreted amount versus collection interval end point for
the saccharin data.
102 NONLINEAR REGRESSION ANALYSIS

Table 3.3 Parameter summary for the saccharin data, com-


paring estimates obtained using the sigma-minus method, us-
ing the approximate rate method, and using nonlinear least
squares to fit the integrated response function.

Estimate by
Nonlinear Least squares
Approx.
Parameter Peeling" Sigma-Minus" Value Std. Err.
01 0.07 10 0.0833 0.122 0.03 1
(32 0.0234 0.0255 0.0279 0.003
(33 830 932 1345 249
(34 270 314 402 98
" From Renwick (1982).
methods do not provide parameter standard errors.
There were two very large residuals from the nonlinear least squares
fit, at x I = 5 and x I = 105. A second analysis was done by simply combin-
ing the observations at x I = 5 and x , = 15 and at x I=90 and x 1= 105, as
shown in Table 3.4. The residuals from this fit were very well behaved,

Table 3.4 Collection intervals and excreted amounts


for original and combined saccharin data.

Original Combined
Collection Collection
Interval (hr) Interval (hr)
Start Length Saccharin Start Length Saccharin
XI x2 (PB) XI x2 (clg)
0 5 7518 0 5 7 518
5 10 6 275 5 25 11 264
15 15 4 989
30 15 2 580 30 15 2 580
45 15 1485 45 15 1485
60 15 86 1 60 15 861
75 15 56 1 75 15 56 1
90 15 363 90 30 663
105 15 300
PRACTICAL CONSIDERATIONS 103

and the residual variance was reduced to 0.0071 from 0.0158. The parame-
ter estimates (standard errors in parentheses) were el = 0.154(0.035),
e2= 0.030(0.002), e3= 1506(233),and e4 = 472(70).

3.10 Comparing Models

In some situations there may be more than one function which could be used as
a model. For example, in fitting a double exponential model,
j-(x,e)=eleazx+ e 3 2 4 x
0, could be 0, in which case the model reduces to
f ( x , e ) = ele+I + e3
or O3 could be 0, in which case the model reduces to
f ( x ,e) = ele -Bzx
In this situation of nested models, we would be interested in finding the simplest
model which adequately fits the data.
In other situations, we might compare non-nested models-for example,
model 1
f ( x , e ) = e l ( i - e+x)
versus model 2

both of which start at f = 0 when x = 0 and approach the asymptote €I1 as x +-.
In these situations, one model may give a superior fit to the data, and we would
like to select that model.

3.10.1 Nested Models

To decide which is the simplest nested model to fit a data set adequately, we
proceed as in the linear case and use a likelihood ratio test (Draper and Smith,
1981). Because of the spherical normal assumption, this leads to an assessment
of the extra sum of squares due to the extra parameters involved in going from
the partial to the full model.
Letting S denote the sum of squares, v the degrees of freedom, and P the
number of parameters, with subscriptsfand p for the full and partial models and
a subscript e for extra, the calculations can be summarized as in Table 3.5. To
complete the analysis, we compare the ratio s$/sj to F(v,,vf;a)and accept the
partial model if the calculated mean square ratio is lower than the table value.
Otherwise, we retain the extra terms and use the full model. Illustrations of the
104 NONLINEAR REGRESSION ANALYSIS

Table 3.5 Extra sum of squares analysis for nested models.


Sum of Degrees of Mean Square F Ratio
Source Squares Freedom
Extra parameters S, =Sp - Sf v, =P f - Pp s: = S, Iv, 2 2
Se lsf
Full model Sf vf=N-Pf s;=Sflvf
Partial model so N-P,

use of the extra sum of squares analysis are given below in Example Puromycin
10 and in Section 3.1 1.
Note that for linear least squares, the extra sum of squares analysis is ex-
act because the data vector y is being projected onto linear subspaces of the
response space to determine Sp and SP Mathematically, the partial model ex-
pectation plane is a linear subspace of the full model expectation plane. Residu-
al vectors can then be decomposed into orthogonal components and, from the
fact that the full model residual vector has a squared length which is distributed
as a a2x2random variable with N - P degrees of freedom, it follows that the
squared lengths of the components are also distributed as $x2 random variables
with degrees of freedom equal to the dimensions of the linear subspaces.
For nonlinear models, as we might expect, the analysis is only approxi-
mate because the calculated mean square ratio will not have an exact F distribu-
tion. However, the distribution of the mean square ratio is only affected by in-
trinsic nonlinearity and not by parameter effects nonlinearity, and, as shown in
Chapter 7, the intrinsic nonlinearity is generally small. When the partial model
is inadequate, the effect of intrinsic nonlinearity on the analysis can be large but
the partial model will be rejected anyway: it is only when the fitted values are
very close that the form of the distribution is critical. In these cases, the intrin-
sic nonlinearity will usually have a small effect because the expected responses
being compared are close together on the expectation surface.

3.10.2 Incremental Parameters and Indicator Variables

Many nested models can be parametrized in terms of incremental parameters.


An incremental parameter accounts for a change in a parameter between blocks
of cases and is associated with an indicator variable. An advantage of using in-
cremental parameters is that a preliminary evaluation of the need for the full
model can be made directly from the regression output without having to do ad-
ditional computation. The use of incremental parameters is most easily
described by means of an example.
PRACTICAL CONSIDERATIONS 105

Example: Puromycin 10
In the Puromycin experiment, two blocks of experiments were run. In one
the enzyme was treated with puromycin (Table Al.3a), and in the other the
same enzyme was untreated (Table A1.3b). It was hypothesized that the
Puromycin should affect the maximum velocity parameter 01,but not the
half-velocity parameter 02. The two data sets are plotted in Figure 3.13.
To determine if the O2 parameter is unchanged, we use an extra sum
of squares analysis, which requires fitting a full and a partial model. The
full model corresponds to completely different sets of parameters for the
treated data and the untreated data, while the partial model corresponds to
different O1 parameters but the same O2 parameter. To combine the full
and partial models, we introduce the indicator variable

x2 =I
(0 untreated
1 treated

and let x 1 be the substrate concentration. The combined model is then


written

where O1 is the maximum velocity for the untreated enzyme, (PI is the in-

I I I I I
0.0 0.2 0.4 0.6 0.8 1.0
Concentration (pprn)
Figure 3.13 Plot of enzyme velocity data versus substrate concentration. The data for
the enzyme treated (not treated) with Puromycin are shown as 0 (*).
106 NONLINEAR REGRESSION ANALYSIS

cremental maximum velocity due to the treatment, is the (possibly com-


mon) “half-velocity” point, and $2 is the change in the half-velocity due to
the treatment. Since we expect to be nonzero, we are interested in test-
ing whether Q2 could be zero.
The model (3.4) was fitted and the results of this fit are shown in
Table 3.6. It appears that $2 could be zero, since it has a small t ratio, and
so we fit the partial model (3.4) with $2 =O. The results of this fit are given
in Table 3.7 and the extra sum of squares analysis is presented in Table 3.8.
In this well-designed experiment, which includes replications, it is also

Table 3.6 Parameter summary for the 4-parameter


Michaelis-Menten model fitted to the combined Puromycin data
set.
Approx. Correlation
Parameter Estimate Std. Err. t Ratio Matrix
91 160.3 6.90 23.2 1.00
92 0.0477 0.00828 5.8 0.77 1.00
41 52.4 9.55 5.5 -0.72 -0.56 1.00
$2 0.0164 0.0114 1.4 -0.56 -0.72 0.77 1.00

Table 3.7 Parameter summary for the 3-parameter


Michaelis-Menten model fitted to the combined Puromycin
data set.
~ ~ ~

Approx. Correlation
Parameter Estimate Std. Err. t Ratio Matrix
91 166.6 5.81 28.7 1.00
92 0.058 0.00591 9.8 0.61 1.00
41 42.0 6.27 6.7 -0.54 0.06 1.00

Table 3.8 Extra sum of squares analysis for the 3- and 4-


parameter Michaelis-Menten model fitted to the combined Puro-
mycin data set.
Sum of Degrees of Mean
Source Squares Freedom Square F Ratio p Value
Extra 186 1 186. 1.7 0.21
4-parameter 2055 19 108.2
3-parameter 2241 20
PRACTICAL CONSIDERATIONS 107

possible to analyze for lack of fit of the partial model as shown in Table
3.9. These summary calculations, together with plots of the residuals (not
shown), suggest that a model which has a common half-velocity parameter
and a higher asymptotic velocity for the treated enzyme is adequate. H

In the above example, the t ratios for the incremental parameters permit
reliable inferences to be made concerning changes from one block to another.
We recommend, however, that the extra sum of squares analysis always be used,
since it is unaffected by parameter effects nonlinearity (see Chapter 7) and is
therefore more exact than the t test in the nonlinear case. We only use the t ra-
tios to suggest which incremental parameters might be zero and should be inves-
tigated further: the actual decision on whether to retain a parameter should be
based on an extra sum of squares analysis or a profile t analysis (see Chapter 6).
In summary, incremental parameters provide a direct and simple pro-
cedure for determining whether changes in parameters occur between different
blocks. Clearly, incremental parameters can also be used to advantage in linear
least squares to determine changes in parameters between blocks, since then the
t tests are exact. Even for linear least squares, however, we recommend fitting
the reduced model and using the extra sum of squares analysis to make any final
decisions concerning inclusion or deletion of parameters, so as to avoid prob-
lems with multicollinearity and inflation of variances. Incremental parameters
can also be used when there are more than two blocks by introducing additional
indicator variables or, possibly, by rewriting the parameters as functions of other
variables as in Section 3.1 1.

3.10.3 Non-nested Models

When trying to decide which of several non-nested models is best, the first ap-
proach should be to the researcher. That is, if there are scientific reasons for
prefemng one model over the others, strong weight should be given to the
researcher’s reasons because the primary aim of data analysis is to explain or
account for the behavior of the data, not simply to get the best fit.

Table 3.9 Lack of fit analysis for the 3-parameter


Michaelis-Menten model fitted to the combined Puromycin data
set.
Sum of Degrees of Mean
Source Squares Freedom Square FRatio pValue
Lack of fit 1144 9 127.3 1.3 0.35
Redication 1097 11 99.7
Residuals 224 1 20
108 NONLINEAR REGRESSION ANALYSIS

If the researcher cannot provide convincing reasons for choosing one


model over others, then statistical analyses can be used, the most important of
which is probably an analysis of the residuals. Generally the model with the
smallest residual mean square and the most random-looking residuals should be
chosen. The residuals should be plotted versus the predicted values, the control
variables, time order, and any other (possibly lurking) variables; see Section
3.7.

3.1 1 Parameters as Functions of Other Variables

In many situations, the parameters in a mechanistic model will depend on other


variables. For example, in chemical kinetic studies, we may have data from
several experiments in which the operating conditions have been varied, and it
may be thought that the rate constants should depend in some systematic way on
the operating conditions. We would then like to fit a model which incorporates
the dependence of the kinetic parameters 8 on some process variables, say w,
+
and some process parameters, = ((+, . . . ,gL)T.That is, 8 = 8(w,+). The ex-
pectation function is then f(x,8)=f(x.O(w,$)).
To estimate the parameters in such an extended model, we could express
the function in terms of the regular variables x, the process variables w,and the
+,
process parameters determine the derivatives with respect to $, and then use a
Gauss-Newton algorithm to converge to +. It is more efficient, however, to
build on what we already have and proceed as follows:
(1) At each level of w,estimate the kinetic parameters 8 in the regular
model.
(2) Plot the paramete; estimates 6, versus w to determine a plausible
form for the relationship of 0, to w and to obtain starting estimates for the pro-
cess parameters +.
(3) Use the chain rule for derivatives to determine the derivatives with
respect to $, exploiting the existing derivatives with respect to 8, as

for 1 = 1,2, . . . ,L, where L is the total number of process parameters.


An application of this method is described in Section 5.5.

Example: Puromycin 11
Suppose in the research on Puromycin (Example Puromycin 10) there
were, say, four treatment levels of Puromycin instead of just two (treated
and untreated). We could then proceed by incorporating three indicator
variables to account for changes in the parameters due to different treat-
ments. However, if the Puromycin treatments consist of different doses, it
might be possible to write
PRACTICAL CONSIDERATIONS 109

where a possible form of el and 02 is


01 =$10+$11 w
0 2 = $20 + $21 w
In this example, the (regular) variable is x, the substrate concentration, and
the process variable is w, the Puromycin concentration.
NO_Wsuppose that at Puromycin concentratio? w I we_ get estimates
GII and €Iz1.at conceFtration w2, we get estimates eI2and eZ2. and so on,
and that a plot of e2 versus w looks esseptially flat, which suggests
O2 = coptant. Then we would choose $20 = 02. Suppose further that the
plot of el versus w reveals a straigGt line relationship, el = $ l o + $ I 1 w . We
could then use linear regression of el on w to get starting estimates for $lo
and011.
The model to be fitted to the combined data vector would be

3.1 2 Presenting the Results

As in all statistical analyses, the results from a nonlinear regression analysis


should be presented clearly and succinctly. This is usually done most effective-
ly by considering the needs and abilities of the prospective audience. The report
should always include a summary of the main findings and conclusions.
The summary should include a statement of the final model, the parameter
estimates and their standard errors, and an interpretation of the model and the
parameters in the context of the original problem.
In the main body of the report, it is useful to state the original problem
and possibly a derivation of the general form of mechanistic model proposed.
Plots of the data should be given, and any preliminary analyses should be dis-
cussed, particularly if they involved transformation of the data or the expecta-
tion function. A listing of the data should always be included (perhaps in an ap-
pendix), but otherwise plots should be used for effective communication.
The initial model should be presented with a brief description of the steps
taken to reduce or extend it, if necessary referring to detailed analyses in appen-
dices. The final model, together with parameter estimates and their approximate
standard errors and correlation matrix should be stated, along with a plot of the
data, the fitted expectation function, and an approximate confidence band for the
110 NONLINEAR REGRESSION ANALYSIS

expectation function. Pairwise plots of the parameter inference region, and pos-
sibly profile t plots, as described in Chapter 6, should be given. Of great impor-
tance is an interpretation of the expectation function and the parameter values
relative to the original problem, and especially any new findings, such as the
need for additional variables in the model or the non-necessity of any variables
or parameters.
Finally, conclusions and recommendations should be made, especially
concerning possible future experiments or development of the research.
For further tips on report writing, see Ehrenberg (1981), Ehrenberg
(1982), and Watts (1981). The preparation and presentation of graphical materi-
al is covered in Tufte (1983), Cleveland (1984, 1985), and Chambers et al.
(1983).

3.13 Nitrite Utilization: A Case Study

To illustrate the techniques presented in this chapter, we present an analysis of


data on the utilization of nitrite in bush beans as a function of light intensity (El-
liott and Peirson, 1986). Portions of primary leaves from three 16-day-old bean
plants were subjected to eight levels of light intensity (pE/m2s), and the nitrite
utilization (nmol/g hr) was measured. The experiment was repeated on a dif-
ferent day, resulting in the data listed in Appendix 1, Section Al. 12.
The experimenters did not have a theoretical mechanism to explain the
behavior, but they thought that nitrite utilization should be zero at zero light in-
tensity, and should tend to an asymptote as light intensity increased.

3.13.1 Preliminary Analysis

From a plot of the data (Figure 3.14) it can be seen that there was a difference in
the nitrite utilization between experiments on the two days, particularly at higher
light intensities. There is also a tendency for the response to drop at high light
intensity. Note too that, even though the response ranges from 200 to 20000
nmol/g hr, the variance is effectively constant; there is no need to transform to
stabilize variance. To verify the apparent stable variance, we performed a two
way analysis of variance using indicator variables for days and for light intensi-
ties, with the results shown in Tables 3.10 and 3.1 1.
For our purposes the most useful information from the two way analysis
of variance is the replication sum of squares and mean square, which can be
used for testing lack of fit. We note, however, that the lack of a significant
dayxintensity interaction suggests that some of the model parameters may be
equal for the two days, although the significant day effect tends to corroborate
the observed difference between the heights of the maxima on the two days. A
plot of the replication standard deviations versus the replication averages, shown
PRACTICAL CONSIDERATIONS 111

* **
*
i 0
*
*
*
c

*
3
0
I

:
**
sf
t
I I I I
0 50 100 150
Light intensity

Figure 3.14 Plot of nitrite utilization by bean plants versus light intensity for day 1 (*)
and day 2 (4.

Table 3.10 Two way analysis of variance for the nitrite utilization
data.
Sumof Degrees Mean
Squares of Square
Source (lo6) Freedom (lo6) F Ratio p Value
Days 4.23 1 4.23 6.1 0.02
Intensity 2040 7 291.5 420. 0.00
Days x intensity 10.07 7 1.44 2.1 0.08
Replication 22.21 32 0.694
112 NONLINEAR REGRESSION ANALYSIS

Table 3.11 Replication averages and standard devia-


tions for the nitrite utilization data.
Day 1 Day 2
Standard Standard
Intensity Average Deviation Average Deviation
2.2 826 652 1327 694
5.5 2 702 623 2541 758
9.6 4 136 719 4619 296
17.5 7 175 40 1 7 554 86
27.0 10567 908 9019 650
46.0 16 302 1154 14753 1082
94.0 19296 963 17786 430
170.0 18719 1117 17 374 1408

* *
*
*

I I I L
0 5000 10000 15000 20000
Replication average
Figure 3.15 Replication standard deviations plotted versus replication averages for the
nitrite utilization data. Day 1 data are shown as * and day 2 data as 0.

in Figure 3.15, verified our earlier assessment that the variance is stable since
there is no systematic relation, and so we proceed to model fitting.
Note that the analysis of variance is used here only as a screening tool. It
is not intended as a final analysis of these data, since the underlying additive
linear model assumed in an analysis of variance is not appropriate.
PRACTICAL CONSIDERATIONS 113

3.13.2 Model Selection

Because the researchers did not have a model in mind, it was necessary to select
one on the basis of the behavior of the data. The Michaelis-Menten model

and the simple exponential rise model


f ( x , e ) = e l ( i - eazx)
were selected because they met the researcher’s beliefs that nitrite utilization
was zero at zero light intensity and tended to an asymptote as the light intensity
increased. To simplify the description, we give details for the
Michaelis-Menten model analysis, and only present summaries for the exponen-
tial rise model.
Since there are 24 observations for each day from this well-designed ex-
periment, it would be reasonable to fit a separate model for each day. We would
like to think, however, that the same parameter values, or at least some of the
same parameter values, would be valid for both days, and so we proceed to fit a
model for day 1 with incremental parameters for day 2. That is, we write

where x is the light intensity and x2 is an indicator variable

X2 ={
0
1
day 1
day2

as described in Section 3.10.

3.13.3 Starting Values

Since the maximum value on day 1 is about 20000, and on day 2 is about
18000,we choose 0: = 25 000 and 4: = -3000. The response reaches about
12500 at a li ht intensity of about 34 for day 1 and 35 for day 2, which gives
ey=34anqf=1.

3.13.4 Assessing the Fit

Convergence was achieved at the values shown in Table 3.12.


It appears from the t ratios that both the incremental parameters could be
estimates of zero, and so a common model could be fitted. However, if we do a
lack of fit analysis on this model as in Table 3.13, we see that this four-
parameter model is not adequate. (The same conclusion was reached for the ex-
114 NONLINEAR REGRESSION ANALYSIS

Table 3.12 Parameter summary for the 4-parameter


Michaelis-Menten model fitted to the nitrite utilization data.
Standard Correlation
Parameter Estimate Error r Ratio Matrix
81 24 743 1241 19.9 1.00
92 35.27 4.66 7.6 0.88 1.00
$1 -2329 1720 -1.4 -0.72 -0.64 1.00
b? -2.174 6.63 -0.3 -0.62 -0.70 0.88 1.00

Table 3.13 Lack of fit analysis for the 4-parameter


Michaelis-Menten model fitted to the nitrite utilization data.
Sumof Degrees Mean
Squares of Square
Source (lo6) Freedom (lo6) FRatio p Value
Lack of fit 64.30 12 5.36 7.72 0.00
Replications 22.21 32 0.694
Residuals 86.5 1 44

ponential rise model, in this case with a lack of fit ratio of 3.2, corresponding to
a p value of 0.00.)
A plot of the residuals versus light intensity, as in Figure 3.16, reveals
nonrandom behavior, with negative residuals at small and large intensities and
positive ones in the middle. The model must therefore be modified to allow the
nitrite utilization to drop with increasing light intensity, rather than leveling off
as suggested by the researchers.

3.13.5 Modifying the Model

To alter the Michaelis-Menten expectation function to rise to a peak and then


fall, we added a quadratic term to the denominator to produce the quadratic
Michaelis-Menten model,

which, with incremental parameters and an indicator variable for the different
days, becomes
PRACTICAL CONSIDERATIONS 115

*
*
*

0 50 100 150
Light intensity
Figure 3.16 Studentized residuals from the 4-parameter Michaelis-Menten model plot-
ted versus light intensity. Day 1 data are shown as * and day 2 data as 0.

(For the exponential rise model, we replaced the unit term by an exponential to
produce the exponential difference model,
f = e1(e43' - e4Zx)
This model, augmented with increment parameters and an indicator variable,
was also used to fit the data.)
Starting values for the parameters were obtained by taking reciprocals of
the function and the data and using linear least squares for the quadratic
Michaelis-Menten model. Taking reciprocals worked for the day 2 data, giving
8 = (107411, 234, 0.024)T, but gave some negative values for the day 1 data.
We therefore used the day 2 starting values with slight perturbations to get start-

e0
ing values for the 6-parameter model of 8' = (1lOOOO, 234, 0.024)T and
= (-10000, 23, 0.002)T. (For the exponential difference model, we guessed
that the two rate constants might be in the ratio 1 5 and used the estimate for e2
to give e3 = 0.006.We then estimated 8, by evaluating

for several x values. This gave = 37 OOO for the day 1 data and 35 000 for the
day 2 data, from which we got 4: = -2000.)
116 NONLINEAR REGRESSION ANALYSIS

3.13.6 Assessing the Fit

Quick convergence was achieved to the parameter estimates given in Table 3.14
for the quadratic Michaelis-Menten model. All the incremental parameters
have nonsignificant approximate t ratios, which suggests that the parameters
could be zero, and so a simpler model may be adequate. The extremely high
parameter approximate correlations also lead one to suspect that the model may
be overparametrized. The residual sum of squares (32.02~10~ on 42 df) is only
about a third of that for the previous model. (Similar conclusions were reached
for the 6-parameter exponential difference model.)
The residuals for this model, plotted versus light intensity in Figure 3.17,
are clearly well behaved and give no evidence of inadequacy of the model.

3.13.7 Reducing the Model

To determine what simplifications could be made in the quadratic


Michaelis-Menten model, we set 92 and O3 to zero, still retaining t$l to account
for a difference between days. For starting values, we simply used the relevant
converged values from the 6-parameter model.

3.13.8 Assessing the Fit

The results for the 4-parameter quadratic model are given in Table 3.15. The
extra sum of squares analysis for the 4-parameter versus the 6-parameter qua-
dratic model, shown in Table 3.16, does not show a significant degradation of
the fit with elimination of 42 and 93. The residuals, when plotted versus light in-
tensity as in Figure 3.18, attest to the adequacy of the model. Furthermore, a
lack of fit analysis, shown in Table 3.17, suggests that the model is adequate.

Table 3.14 Parameter summary for the 6-parameter quadratic


Michaelis-Menten model fitted to the nitrite utilization data.
Standard Correlation
Parameter Estimate Error t Ratio Matrix
01 89 846 31583 2.4 1.00
e2 186.7 90.1 2.1 1.00 1.00
03 0.01626 0.00922 1.8 1.00 0.99 1.00
41 -38956 40020 -1.0 -0.94 -0.94 -0.94 1.00
92 -83.23 96.8 -0.9 -0.93 -0.93 -0.92 1.00 1.00
43 -0.00846 0.0993 -0.9 -0.93 -0.92 -0.93 1.00 0.99
PRACTICAL CONSIDERATIONS 117

0 50 100 150
Light intensity
Figure 3.17 Studentized residuals from the 6-parameter quadratic Michaelis-Menten
model plotted versus light intensity. Day 1 data are shown as * and day 2 data as 0.

Table 3.15 Parameter summary for the 4-parameter quadratic


Michaelis-Menten model fitted to the nitrite utilization data.
Standard Correlation
Parameter Estimate Error t Ratio Matrix
01 70096 16443 4.3 1 .00
(32 139.4 39.3 3.6 1.00 1.00
(33 0.01144 0.00404 2.8 0.99 0.99 1.00
61 -5381 1915 -2.8 -0.69 -0.66 -0.66 1.00
118 NONLINEAR REGRESSION ANALYSIS

Table 3.16 Extra sum of squares analysis for the 6-parameter


versus the 4-parameter quadratic Michaelis-Menten model
fitted to the nitrite utilization data.
~~

Sumof Degrees Mean


Squares of Square
Source (lo6) Freedom (lo6) F Ratio p Value
Extra 0.82 2 0.41 0.54 0.59
6-parameter 32.02 42 0.762
4-parameter 32.84 44 0.746

' *
5 *
* ** 3 *
t
*

0 50 100 150
Light intensity
Figure 3.18 Studentized residuals from the 4-parameter quadratic Michaelis-Menten
model plotted versus light intensity. Day 1 data are shown as * and day 2 data as 0.
PRACTICAL CONSIDERATIONS 119

Table 3.17 Lack of fit analysis for the 4-parameter quadratic


Michaelis-Menten model fitted to the nitrite utilization data.
Sumof Degrees Mean
Squares of Square
Source (lo6) Freedom (lo6) FRatio pValue
Lack of fit 10.63 12 0.886 1.28 0.28
Replications 22.21 32 0.694
Residuals 32.84 44 0.746

Note that the parameter @ I is now apparently significantly different from


0, with an approximate t ratio of -2.8, confirming our earlier suspicion that there
was a difference between days. This parameter was not significantly different
from 0 in the 6-parameter model, which is further evidence for the 6-parameter
model being overparametrized and hence the parameter approximate standard
errors being artificially inflated, causing nonsignificant t ratios.
The parameter approximate correlation matrix for the quadratic
Michaelis-Menten 4-parameter model shown in Table 3.15 reveals that several
of the correlations are very large. This is not unusual in nonlinear regression,
and is induced by a combination of the form of the expectation function and the
design used. For example, for a Michaelis-Menten model, no matter how good
the design is, it is impossible to obtain zero correlation between the parameters
because it is impossible to force the derivatives to be orthogonal. To see this,
we note that the first column of the derivative matrix, v l , has elements
xl(8, + x ) , and the second column, v2, has elements -81xl(82 + x ) ~ .All the
elements in v1 are positive and all the elements in v2 are negative, and so the
two vectors v I and -v2 will always tend to point in the same direction in the
response space. Consequently, they will tend to be collinear.
As a final check on the model, the 3-parameter Michaelis-Menten model
(0, + @ I x 2 ) X I
f= 02 + X I
could be fitted and compared with the 4-parameter quadratic Michaelis-Menten
model using an extra sum of squares analysis to further substantiate the necessi-
ty for the parameter €I3. This was not done because the residuals for the original
3-parameter Michaelis-Menten model were so badly behaved.
(Similar results and conclusions were reached for the exponential differ-
ence model: that is, a 4-parameter model with common exponential parameters
and scale factor, plus an incremental parameter for day 2, was found to give an
adequate fit. Summary information on the fit is given in Table 3.18, and com-
parison with the 6-parameter model in Table 3.19. The lack of fit analysis is
given in Table 3.20. In this case, the lack of fit ratio was 1.46, still not
significant,but slightly larger than for the quadratic Michaelis-Menten model.)
120 NONLINEAR REGRESSION ANALYSIS

Table 3.18 Parameter summary for the 4-parameter exponential


difference model fitted to the nitrite utilization data.
Standard Correlation
Parameter Estimate Error t Ratio Matrix
01 35 115 8940 3.9 1.00
02 0.0 1845 O.OO3 17 5.8 -0.99 1.00
03 0.00325 0.00120 2.7 0.99 -0.97 1.00
41 -2686 1006 -2.7 -0.71 0.67 -0.68 1.00

Table 3.19 Extra sum of squares analysis for the 6-parameter


versus the 4-parameter exponential difference model fitted to
the nitrite utilization data.
Sumof Degrees Mean
Squares of Square
Source (lo6) Freedom (lo6) FRatio p Value
Extra 0.37 2 0.19 0.23 0.80
6-~arameter 33.97 42 0.809
4-parameter 34.34 44 0.780

Table 3.20 Lack of fit analysis for the 4-parameter exponen-


tial difference model fitted to the nitrite utilization data.
Sumof Degrees Mean
Squares of Square
Source (lo6) Freedom (lo6) FRatio p Value
Lack of fit 12.13 12 1.011 1.46 0.19
Replications 22.2 1 32 0.694
Residuals 34.34 44 0.780

3.13.9 Comparing the Models

To compare the nested models we have used incremental parameters and the ex-
tra sum of squares principal, but they can not be used to compare the quadratic
Michaelis-Menten and the exponential difference models. Our first approach
was to the researchers, asking them whether one model was preferred on
scientific grounds. In this case, the researchers had no preference, and so we
simply presented them with the results for both models. Because the lack of fit
ratio and the residual mean squares were smaller, we had a slight preference for
PRACTICAL CONSIDERATIONS 121

the Michaelis-Menten model. In Figure 3.19 we show the nitrite utilization data
together with the fitted curve and the approximate 95% confidence bands for the
4-parameter quadratic Michaelis-Menten model.

3.13.10 Reporting the Results

A brief report was prepared for Professors Elliott and Peirson, along the lines of
Section 3.12. The major finding of interest was the need for a model which rose
to a peak rather than to an asymptote. This was not expected, at least at such a
low light level. As part of our report, we recommended additional experiments
be run, especially at higher light intensities, in order to verify the need for a
model which rises to a peak rather than approaching an asymptote, and to help
discriminate between the two competing models. It was further suggested that
future experiments involve fewer levels at low light intensity to reduce effort.

3.14 Experimental Design

So far we have concentrated more on the analysis of data than on the design of
experiments to produce good data, although we believe that good experimental

0
8
N
0
1 *...............
....................... ..__
.._ ......

s
9..
0

w
C
E

% O
3 0
:E
a 4
2

0
8
In

I I 1
0 50 100 150
Light intensity
Figure 3.19 Plot of nitrite utilization versus light intensity together with the fitted
curves (solid lines) and the 95% approximate inference bands (dotted). Data for day 1
are shown as * and for day 2 as 0.
122 NONLINEAR REGRESSION ANALYSIS

design is vital to scientific progress. The reason for the prime importance of ex-
perimental design is that the information content of the data is established when
the experiment is peqormed, and no amount of sensitive data analysis can re-
cover information which is not present in the data.
One reason for our emphasizing analysis rather than design is that we usu-
ally have to deal with data that have been obtained without the benefit of good
statistical design. Another reason is that, while good experimental design is ex-
tremely valuable, it is necessary to know how to analyze data in order to appre-
ciate what “good experimental design” is.

3.14.1 General Considerations

Experimentation is fundamental to scientific learning, which we may character-


ize as reducing ignorance. At any stage of research, we are in a position of hav-
ing data, and of being able to explain part of that data, and as the research
proceeds, we are able to account for, or explain, more of the data. For example,
a chemical engineer trying to learn about how a particular product is produced
would know very little initially about the factors and the chemical reactions in-
volved. As she proceeds, planning and running experiments under various con-
ditions, she would endeavor to find out, at each stage, what the important factors
are, and how they affect the response, Initially, she would be involved in empir-
ical “screening designs” to try to isolate those factors which are most influential
in affecting the response, probably using factorial or fractional factorial designs
(Box, Hunter, and Hunter, 1978). If she was interested in optimizing some
characteristic, she might then proceed to response surface designs (Box, Hunter,
and Hunter, 1978; Box and Draper, 1987). Later on, perhaps to fine tune the
product or to gain better understanding of the mechanisms involved, she would
move from empirical models and their associated strategies to mechanistic (usu-
ally nonlinear) models. It is this aspect of experimental design which we con-
sider in this section.
We assume initially that the experimenter has a well-definedform for the
expectation function relating the factors to the response, and that the objectives
of the experiments are to provide the necessary and adequate information to:

(1) estimate the parameters of interest in the model with


accuracy (i.e. small bias),
precision (i.e. small variance), and
(2) verify the assumptions about
the expectation function,
the disturbance model.

As was the case for estimation, it is helpful first to discuss the linear situa-
tion. Accordingly, in the following section we present a brief review of experi-
mental design for linear expectation functions. For a more comprehensive
presentation, see Box, Hunter, and Hunter (1978), Davies (1956), and Cochran
PRACTICAL CONSIDERATIONS 123

and Cox (1957); and for general considerations on the planning of experiments,
Box and Draper (1959) and Draper and Smith (1981). A thorough review of op-
timal designs is given in St. John and Draper (1973, Cochran (1973), and Stein-
berg and Hunter (1984). Hamilton and Watts (1985) discussed designs using
second order derivatives, and the geometry of experimental designs was dis-
cussed in Silvey and Titterington (1973).
Before considering the more technical details of experimental design, we
offer some comments which help ensure attainment of the general objectives ( 1 )
and (2) above.
With regard to providing accurate and precise estimates of the parameters,
it is helpful to recognize that an experimental design involves choosing the
values of the factors for a selected number of experimental cases (or runs). It is
therefore important that the number of cases be large enough to ensure attain-
ment of the specific objectives of the experiment. For example, if an expecta-
tion function involves five parameters, there will have to be at least five distinct
experimental conditions. It is equally important to limit the number of experi-
ments done at any one time. That is, one should not construct an extremely
large design and then proceed slavishly to follow that design to its completion.
Due account should be taken of what is learned at each stage of the experiment,
and this information should be exploited in the design of the next stage. The
number of experiments which should be run in a block will depend on the
number of factors and the type of experiment being run, of course, but blocks of
size 10 to 20 are usually informative and manageable.
The choices of the factor settings should be such that they are in useful
and appropriate ranges of the factors. That is, the factors should be located near
sensible values which will permit use of the parameter estimates in future inves-
tigations, and the levels of each factor should be spread out enough so that the
effect of each factor will be revealed in spite of the inherent variability of the
response.
With regard to verifying the assumptions about the expectation function, it
is important to provide replications to enable testing for lack of fit or inadequacy
of the expectation function. It is also important, when possible, to randomize
the order of the experiments, to ensure that the expectation function is appropri-
ate. (If there are unsuspected factors operating, randomizing will tend to cause
their effects to appear as increased variability rather than as incorrect parameter
estimates, as discussed in Section 1.3.)
With regard to verifying the assumptions about the disturbance model, re-
plications are again important. As discussed in Section 1.3, replications enable
one to test for constancy of variance and to determine a variance stabilizing
transformation if the variance is deemed not constant. Randomizing will also
tend to ensure that all of the assumptions concerning the disturbances will be ap-
propriate, as discussed in Section 1.3. Once again, we see the importance and
power of randomizing.
In summary, statistical analysis is concerned with the efficient extraction
and presentation of the information embodied in a data set, while statistical ex-
perimental design is concerned first with ensuring that the important necessary
124 NONLINEAR REGRESSION ANALYSIS

information is embodied in a data set, and second with making the extraction
and presentation of that information easy.

3.14.2 The Determinant Criterion

Consider the linear model (1.1)


Y=XB+Z
with the usual assumptions (1.2) and (1.3) about the disturbances Z,
E[Z] = 0

Var[Z] = 021
For a linear regression model, a row of the derivative matrix X depends only on
the choice of the K design variables, where the design variables determine such
characteristics as when the run is taken, at what pressure, at what temperature,
etc. An individual entry in the derivative matrix is calculated from the values of
the design variables. For any choice of the design variables generating a deriva-
tive matrix X, the parameters B will have a joint inference region whose volume
is proportional to I XTX I Thus, a logical choice of design criterion is to
choose the design points so that the volume of the joint inference region is
minimized (Wald, 1943). Since the power -1/2 is inconsequential, Wald pro-
posed maximizing the determinant D = I XTX I, and designs which satisfy this
criterion are called D-optimal designs. The criterion is referred to as the deter-
minant criterion.
From a geometric point of view, the determinant criterion implies that we
should2choose the columns of X so that each vector is as long as possible
( 11 xp II is as large as possible, p = 1.2, . . . ,P ) , and try to make the vectors
orthogonal (xFxq =0, p #q). The former ensures that the expectation plane will
be well supported in the response space, and that the parameter lines will be
widely spaced on the expectation plane. Consequently the disturbances, whose
variance is beyond our control, will have small effect, thereby producing a joint
region in the parameter space with small volume. The latter ensures that the
parameter estimates associated with the factors will not be correlated. That is,
changes in the response will be correctly associated with changes in the ap-
propriate causative factor, and not attributed to other factors.
The two requirements of long length and orthogonality of the derivative
vectors ensure that a disk on the expectation plane will map to a small ellipse in
standard position on the parameter plane.
The determinant criterion was applied to nonlinear expectation functions
by Box and Lucas (1959) who used, in place of the X matrix, the derivative ma-
trix Vo evaluated at some initial parameter estimates @. That is, in nonlinear
design, the D-optimal criterion is modified to maximize
PRACTICAL CONSIDERATIONS 125

D = lVoTVol (3.5)
The design of an experiment depends on the stage at which the researcher
is in an investigation. When only the form of the model is known, but not the
parameter values, as could be the case in enzyme kinetics or in biochemical ox-
ygen demand studies, the researcher would be concerned with choosing the
values of the factors to produce good parameter estimates. These are called
“starting designs.” Later on in an investigation, the researcher might wish to
design an experiment to improve the precision of estimates of some or all of the
parameters, exploiting data already obtained. Such designs are called “sequen-
tial designs,” and, when special interest is attached to a subset of the parameters,
“subset designs.”

3.14.3 Starting Designs

Box and Lucas (1959) proposed starting designs consisting of P points for a P -
parameter model, and therefore simplified the criterion (3.5) to that of maximiz-
ing I Vo I . Geometrically, the determinant criterion ensures that the expectation
surface is such that large regions on the tangent plane at q(8O) map to small re-
gions in the parameter space. When more than P points are to be chosen, the
D-optimal design usually results in replications of P distinct design points (Box,
1968), and these design points are those that would be chosen as D-optimal with
N = P. We therefore consider starting designs as having only P runs.

Example: Puromycin.12
To illustrate the choice of a starting design, we consider the case of enzyme
kinetics, which are assumed to follow a Michaelis-Menten model. We as-
sume that the maximum allowable substrate concentration is specified as
x,,, and that initial estimates of the parameters 8’ are given.
The derivatives of the expectation function, evaluated at the initial
parameter estimates eo,are
X -eyx
e;+x (e;+X)*

and so the determinant to be maximized is


126 NONLINEAR REGRESSION ANALYSIS

The modulus of this determinant is maximized when

The determinant criterion therefore places the design points so as to tie


down the asymptote (8,) by performing one experiment at the maximum
concentration, and to tie down the half-concentration by performing the
other experiment near the assumed half-concentration.
It is instructive to compare the D-optimal design with the dilution
design used by Treloar (1974). The dilution design used x, = 1.1 and five
dilutions by approximately one-half, with duplications, giving a total of 12
runs. With the same number of runs, the D-optimal design would consist
of 6 re lications at x, and 6 replications at x2 =8~/[1+2(8~/n,,)]. We
take Of= 0.1 as a reasonable starting estimate, and so the design is x = 1.1,
x2=0.085. In Figure 3.20 we plot the linear approximation 95%
confidence region for the dilution design and data together with the linear
approximation confidence region for the D-optimal design assuming that
both designs gave the same parameter estimates and residual variance. We
see that the D-optimal design does indeed give a smaller joint confidence
region and smaller confidence intervals. In addition, the correlation
between the parameters is lower. However, the gain in precision from us-
ing the D-optimal design would have to be balanced against any loss of in-
formation about lack of fit.
Note that the design does not depend on the conditionally linear
parameter 8,. which is true in general for conditionally linear parameters,
as shown in Section 3.14.6.

The determinant criterion provides an objective basis for determining P-


point designs for P-parameter models, but the design strategy should not be ap-
plied blindly. The criterion was derived on the basis that the expectation func-
tion is known, and provides only P design points to estimate the P parameters.
Replications at these P design points are useful because they provide informa-
PRACTICAL CONSIDERATIONS 127

0)

8
m
8
t-
8
6
(0

8
v)

8
d
8

Figure 3.20 Comparison of 95% approximate inference regions for two designs for the
Puromycin data. The larger region results from the dilution design used, and the shaded
region results from a D-optimal design.

tion concerning constancy of variance, but they cannot provide information


about lack of fit. It might be useful, therefore, to perform additional experi-
ments at other design points in order to detect lack of fit. In light of these con-
siderations, the dilution design strategy is eminently sensible, especially given
its high level of performance as demonstrated in the above example.

3.14.4 Sequential Designs

In many situations, some experiments will already have been done to check if
the equipment is functioning properly, or to screen possible models, as
described in Box and Hunter (1965). In other situations, it may be possible to
perform and analyze the result from a single experiment quite rapidly. In these
situations, it is possible to obtain even better parameter estimates by designing
the experiments sequentially; that is, an experimental run is designed, the data
are collected and analyzed, and the design for the next run is obtained by max-
imizing I VTV, I with respect to the design variables, x ~ +where
~ ,

C J

and vN+l is the gradient vector dfldeT evaluated at the least squares estimates
128 NONLINEAR REGRESSION ANALYSIS

from the N runs already made.

Example: lsomerization 3
To illustrate sequential design, we consider the model and data set from
Example Isomerization 1. The correlations between the parameters are
very high, and the linear approximation confidence regions include nega-
tive values for the equilibrium constants. We would therefore like to
design experiments to provide better precision in the parameter estimates.
The design points are determined by the values of the partial pressure
of hydrogen, x 1 , the partial pressure of n-pentane, x2, and the partial pres-
sure of isopentane, x 3 . In the previous runs these variables have ranged
from about 100 to 400 for x I , 75 to 350 for x 2 , and 30 to 150 for x3, so we
use these limits to define a reasonable region within which to design further
runs. We begin by evaluating the sequential D-optimal design criterion at
the original design points and at sequential design points at the comers of
the region. This gives the values in Table 3.21. The combination which
optimizes the D-optimal criterion is low x I (loo), high x 2 (350), and low
x 3 (30). Examination of nearby values confirms that the comer is a local
optimum, and since a coarse grid search of the design region did not reveal
any optima in the interior, we choose this comer as the design point for the
next run. W

Table 3.21 Sequential D-optimal


design criteria for the isomerization
model, evaluated at the comers of the
design region.
Factor Criterion
D
XI x2 x3 lo6
100 100 30 4.63
400 100 30 1.95
100 350 30 9.44
400 350 30 3.99
100 100 150 1.82
400 100 150 1.82
100 350 150 3.98
400 350 150 2.42
PRACTICAL CONSIDERATIONS 129

3.14.5 Subset Designs

When only a subset of the parameters 0 is of interest, the design criterion is


modified as suggested in Box (1971) and Hill and Hunter (1974). We assume
that the parameters have been ordered so the first PI parameters are the nuisance
parameters and the trailing P2 parameters are the parameters of interest, and we
partition the vector 8' as (0: I8:), and the matrix Vo as [Vy I Vs]. Then the
variance-covariance matrix of the P2 parameters is proportional to D2,2,where
D = (Vo 'V0)-'

and so the D-optimal criterion is changed to minimization of


Ds = I D2.2 I
which is equivalent to maximizing
I VO'VO I

Example: lsomerization 4
To illustrate subset design, we consider the model, data set, and design re-
gion from Example Isomerization 3, and treat the situation in which we
wish to improve the estimates of €I2. e3,and 0,. Evaluation of Ds at the
comers of the same region gives the results in Table 3.22, which produce
similar conclusions and the same design point as in Example Isomerization
3.

3.14.6 Conditionally Linear Models

It is awkward to have to specify initial estimates of the parameters 8 before an


experimental design can be obtained, since, after all, the purpose of the experi-
ment is to determine parameter estimates. In Examples Puromycin 12 and Iso-
merization 3, we saw that the D-optimal design was not affected by the value of
a conditionally linear parameter. For most models with conditionally linear
parameters, the locations of the D-optimal design points do not depend on the
conditionally linear parameters (Hill, 1980; Khuri, 1984). so the design problem
130 NONLINEAR REGRESSION ANALYSIS

Table 3.22 Sequential D-optimal sub-


set design criteria for the isomerization
model, evaluated at the corners of the
design region.

Factor Criterion

lo6
DS
XI x2 x3

100 100 30 8.77


400 100 30 3.90
100 350 30 13.11
400 350 30 7.08
100 100 150 3.69
400 100 150 3.68
100 350 150 7.39
400 350 150 4.70

is simpler.
The easiest type of conditionally linear model to demonstrate this for is
that with only one conditionally linear parameter, so the function can be written
m e ) = el g ( X a 1 )
for some function g where =(€I2,, . . ,ep)T. This
includes the
Michaelis-Menten, BOD, and isomerization models. The gradient of the model
function can then be written

which isolates the dependence of from any dependence upon x. Using (3.6),
the derivative matrix V can be written
V = H(x,8-1) B(Q,) (3.7)
where

is Px P , so the D-optimal criterion is


wTvi= I B ~ H ~ H B I
= I B I I~H ~ H I
and, again, the dependence of el is isolated from x. Therefore, the design does
PRACTICAL CONSIDERATIONS 131

not depend upon 8,.


In general, for conditionally linear models of the form
m9e)= elgl(x,e-L)+- .- + e L g L ( x , e - L )
where &L = . . . ,8p)T,the D-optimal design will not depend on the con-
ditionally linear parameters (el,. . . .8L)Tif V can be factored as in (3.7), pro-
vided B is square. The condition that the matrix B is square was not explicitly
stated in Hill (1980), nor was it emphasized in Khuri (1984), where it was
shown that conditionally linear parameters will usually affect subset designs.
This occurs because, for subset designs, the design criterion involves the ratio of
determinants of components of V, and so the simple factorization above usually
does not occur even if B is square. Khuri (1984) gives conditions under which
the conditionally linear parameters do not affect designs for subsets of parame-
ters.
In the common situation where each component of fLL enters into only
one of the functions g i , i = 1, . . . ,L, the derivatives can be factored as in (3.7).
For example, D-optimal designs for the sum of exponentials model
f(x, e)= ele-OZx + e3ea4' + - . . + ep-le-'Px
do not depend on the conditionally linear parameters.

3.14.7 Other Design Criteria

Precise parameter estimation is not the only objective used for experimental
design. Methods have been proposed for constructing designs for discriminat-
ing between possible model functions (Box and Hill, 1974) and for balancing
the objectives of model discrimination and precise parameter estimation (Hill,
Hunter, and Wichern, 1968). The review article (Steinberg and Hunter, 1984)
describes many of these criteria. We also list several of the references for dif-
ferent experimental design criteria for single response and multiresponse non-
linear models in the bibliography.

Exercises
3.1 Use the data from Appendix 4, Section A4.2 to fit the logistic model

(a) Plot the data versus x = loglo (NIF concentration). Note that you will
have to make a decision about how to incorporate the zero concentra-
tion data. You may want to incorporate the actual NTD concentrations
also.
(b) Give graphical interpretations of the parameters in the model, and use
the plot to obtain starting values for each data set.
132 NONLINEAR REGRESSION ANALYSIS

(c) Use the starting values in a nonlinear least squares routine to find the
least squares estimates for the parameters for each data set.
(d) Use incremental parameters and indicator variables to fit all of the data
sets together.
(e) Simplify the model by letting some of the parameters be common to all
of the data sets. Use extra sum of squares analyses to determine a sim-
ple adequate model.
(f) Write a short report about this analysis and your findings.
3.2 Use the data from Appendix 1, Section A1.14 to determine an appropriate
sum of exponentials model.
(a) Plot the data on semilog paper and use the plot to determine the number
of exponential terms to fit to the data.
(b) Use curve peeling to determine starting estimates for the parameters.
(c) Use the starting estimates from part (b) to fit the postulated model from
part ( 4 .
3.3 (a) Use the plot from Problem 2.6 and sketch in the curve of steepest des-
cent from the point 8'. Hint: The direction of steepest descent is per-
pendicular to the contours.
(b) Is the direction of the Gauss-Newton increment close to the initial
direction of steepest descent?
(c) Calculate and plot the Levenberg increment using a conditioning factor
of k =4.
(d) Calculate and plot the Marquardt increment using a conditioning factor
of k =4.
(e) Comment on the relative directions of the Gauss-Newton, Levenberg
and Marquardt increment vectors.
3.4 Use the data from Appendix 4, Section A4.3 to determine an appropriate
model and to estimate the parameters.
(a) Plot the concentration versus time on semilog paper, and use the plot to
determine the number of exponential terms necessary to fit the data.
(b) Use the plot and the method of curve peeling to determine starting
values for the parameters.
(c) Use a nonlinear estimation routine to estimate the parameters.
3.5 Use a nonlinear estimation routine and the data and model from Appendix
4, Section A4.4 to estimate the parameters. Take note of the number of
iterations required and any difficulties you encounter in each attempt.
(a) Use any approach you think is appropriate to obtain starting values for
the parameters in the model.
(b) Use your starting values in a nonlinear estimation routine to estimate
the parameters. If you achieve convergence, examine the parameter ap-
proximate correlation matrix, and comment on the conditioning of the
model.
(c) Reparametrize the model by centering the factor l / x 3 , and use the
equivalent starting values from part (a) to estimate the parameters. If
you achieve convergence, examine the parameter approximate correla-
PRACTICAL CONSIDERATIONS 133

tion matrix, and comment on the conditioning of the model. What ef-
fect does this reparametrization have on the number of iterations to con-
vergence?
(d) Reparametrize the model in part (a) using 8, =eel and O2 =eC2 and the
equivalent starting values from part (a) to estimate the parameters. If
you achieve convergence, examine the parameter approximate correla-
tion matrix, and comment on the conditioning of the model. What ef-
fect does this reparametrization have on the number of iterations to con-
vergence?
(e) Reparametrize the model in part (b) using the same parametrization as
in part (c) and the equivalent starting values from part (a) to estimate
the parameters. If you achieve convergence, examine the parameter ap-
proximate correlation matrix, and comment on the conditioning of the
model. What effect does this reparametrization have on the number of
iterations to convergence?
3.6 Use a nonlinear estimation routine and the data and model from Appendix
4, Section A4.5 to estimate the parameters. Take note of the number of
iterations required and any difficulties you encounter in each attempt.
(a) Use any approach you think is appropriate to obtain starting values for
the parameters in the model.
(b) Use your starting values in a nonlinear estimation routine to estimate
the parameters. If you achieve convergence, examine the parameter ap-
proximate correlation matrix, and comment on the conditioning of the
model.
(c) Reparametrize the model in part (a) using 02ed3' =ed3(x-e2) . If you
achieve convergence, examine the parameter approximate correlation
matrix, and comment on the conditioning of the model. What effect
does this reparametrization have on the number of iterations to conver-
gence?
3.7 (a) Show that the theoretical D-optimal starting design for the logistic
model of Problem 3.1 consists of x = (-=, 03-l.044/04, 03+l.044/04,
*)T.
(b) Interpret the choice of the design points graphically by plotting the
logistic function versus x and plotting the location of the design points
on the x-axis.
(c) Plot the derivatives with respect to the parameters versus x and use
these plots to help interpret the choice of the design points.