Nonlinear Regression Analysis

CHEE824
Nonlinear Regression Analysis

J. McLellan
Winter 2004
Module 1:
Linear Regression
Outline -
• assessing systematic relationships

• matrix representation for multiple regression
• least squares parameter estimates
• diagnostics
» graphical
» quantitative
• further diagnostics
» testing the need for terms
» lack of fit test
• precision of parameter estimates, predicted responses
• correlation between parameter estimates
3
The Scenario
We want to describe the systematic relationship

between a response variable and a number of
explanatory variables
multiple regression
we will consider the case which

is linear in the parameters
4
Assessing Systematic Relationships
Is there a systematic relationship?

Two approaches:
• graphical
» scatterplots, casement plots
• quantitative
» form correlations between response, explanatory
variables
» consider forming correlation matrix - table of pairwise
correlations between regressor and explanatories, and
pairs of explanatory variables
• correlation between explanatory variables leads to
correlated parameter estimates
5
Graphical Methods for Analyzing Data
Visualizing relationships between variables
Techniques
• scatterplots
• scatterplot matrices
» also referred to as “casement plots”
• Time sequence plots
chee824 - Winter 2004 J. McLellan 6

Scatterplots
,,, are also referred to as “x-y diagrams”

• plot values of one variable against another
• look for systematic trend in data
» nature of trend
• linear?
• exponential?
• quadratic?
» degree of scatter - does spread increase/decrease over
range?
• indication that variance isn’t constant over range of data

Scatterplots - Example
• tooth discoloration data - discoloration vs. fluoride
Scatterplot (teeth 4v*20c)

50
45
40
35
30
DISCOLOR
25
20
15
trend - possibly
10 nonlinear?
5
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
FLUORIDE

Scatterplot - Example
• tooth discoloration data -discoloration vs. brushing

50
45
40
35 signficant trend?
30 - doesn’t appear to
DISCOLOR
25 be present
20
15
10
5
4 5 6 7 8 9 10 11 12 13
BRUSHING

Scatterplot - Example
• tooth discoloration data -discoloration vs. brushing

50 Variance appears
45 to decrease as
40 # of brushings increases
35
30
DISCOLOR
25
20
15
10
5
4 5 6 7 8 9 10 11 12 13
BRUSHING

Scatterplot matrices
… are a table of scatterplots for a set of variables

Look for -
» systematic trend between “independent” variable and
dependent variables - to be described by estimated
model
» systematic trend between supposedly independent
variables - indicates that these quantities are correlated
• correlation can negatively ifluence model estimation results
• not independent information
• scatterplot matrices can be generated automatically
with statistical software, manually using Excel

Scatterplot Matrices - tooth data
Matrix Plot (teeth 4v*20c)
FLUORIDE
AGE
BRUSHING
DISCOLOR

Time Sequence Plot
- for naphtha 90% point - indicates amount of heavy

hydrocarbons present in gasoline range material
excursion - sudden
Time Sequence Plot - Naphtha 90% Point shift in operation
480
470
460
450
90% point (degrees F)
440
430
420
410
meandering about
400
average operating point
390
0 30 60 90 120 150
- time
180
correlation
210 240
in data
270
What do dynamic data look like?
Time Series Plot of Industrial Data

7
0 var1
#1 # 301 # 601 # 901 # 1201 # 1501 # 1801 # 2101 var2
# 151 # 451 # 751 # 1051 # 1351 # 1651 # 1951

Assessing Systematic Relationships
Quantitative Methods
• correlation
» formal def’n plus sample statistic (“Pearson’s r”)
• covariance
» formal def’n plus sample statistic
provide a quantiative measure of systematic LINEAR

relationships
15
Covariance
Formal Definition
• given two random variables X and Y, the covariance

is
Cov ( X , Y ) = E {( X − µ X )(Y − µY )}
• E{ } - expected value
• sign of the covariance indicates the sign of the slope
of the systematic linear relationship
» positive value --> positive slope
» negative value --> negative slope
• issue - covariance is SCALE DEPENDENT
16
Covariance
• motivation for covariance as a measure of systematic

linear relationship
» look at pairs of departures about the mean of X, Y
Y Y
X X
mean of X, Y mean of X, Y
17
Correlation
• is the “dimensionless” covariance

» divide covariance by standard dev’ns of X, Y
• formal definition
Cov ( X , Y )
Corr ( X , Y ) = ρ ( X , Y ) =
σ X σY
• properties
» dimensionless
» range
− 1 ≤ ρ ( X ,Y ) ≤ 1
strong linear relationship strong linear relationship
with negative slope with positive slope
Note - the correlation gives NO information about the
actual numerical value of the slope.
18
Estimating Covariance, Correlation
… from process data (with N pairs of observations)

Sample Covariance
1 N
R= ∑ ( X i − X )(Yi − Y )
N − 1i =1
Sample Correlation
1 N
∑ ( X i − X )(Yi − Y )
N − 1i =1
r=
s X sY
19
Making Inferences
The sample covariance and corrleration are

STATISTICS, and have their own probability
distributions.
Confidence interval for sample correlation -

» the following is approximately distributed as the standard
normal random variable
N − 3(tanh −1 ( r ) − tanh −1 ( ρ ))
−1
» derive confidence limits for tanh ( ρ ) and convert to
confidence limits for the true correlation using tanh
20
Confidence Interval for Correlation
Procedure
1. find zα /2 for desired confidence level
−1
2. confidence interval for tanh ( ρ ) is
−1 1
tanh ( r ) ± zα / 2
N −3
3. convert to limits to confidence limits for correlation by
taking tanh of the limits in step 2
A hypothesis test can also be performed using this function of the

correlation and comparing to the standard normal distribution
21
Example - Solder Thickness
Objective - study the effect of temperature on solder

thickness
Data - in pairs
Solder Temperature (C) Solder Thickness (microns)
245 171.6
215 201.1
218 213.2
265 153.3
251 178.9
213 226.6
234 190.3
257 171
244 197.5
225 209.8
22
Solder Thickness (microns)
230
220
210
thickness
200
190
180
170
160
150
140
200 210 220 230 240 250 260 270
tem perature
S older T em perature (C ) r T hickness (m ic

S older T em perature (C ) 1
S older T hick ness (m icro -0.920001236 1
23
Confidence Interval
zalpha/2 of 1.96 (95% confidence level)
limits in tanh^-1(rho) -2.329837282 -0.848216548

limits in rho -0.981238575 -0.690136605
24
Empirical Modeling - Terminology
• response
» “dependent” variable - responds to changes in other
variables
» the response is the characteristic of interest which we are
trying to predict
• explanatory variable
» “independent” variable, regressor variable, input, factor
» these are the quantities that we believe have an
influence on the response
• parameter
» coefficients in the model that describe how the
regressors influence the response
25
Models
When we are estimating a model from data, we

consider the following form:
Y = f ( X ,θ ) + ε
response “random error”
explanatory parameters
variables
26
The Random Error Term
• is included to reflect fact that measured data contain

variability
» successive measurements under the same conditions
(values of the explanatory variables) are likely to be
slightly different
» this is the stochastic component
» the functional form describes the deterministic
component
» random error is not necessarily the result of mistakes in
experimental procedures - reflects inherent variability
» “noise”
27
Types of Models
• linear/nonlinear in the parameters

• linear/nonlinear in the explanatory variables
• number of response variables
– single response (standard regression)
– multi-response (or “multivariate” models)
From the perspective of statistical model-building,

the key point is whether the model is linear or
nonlinear in the PARAMETERS.
28
Linear Regression Models
• linear in the parameters
T95 = b1 TLGO + b2 Tmid + ε
• can be nonlinear in the regressors
29
Nonlinear Regression Models
• nonlinear in the parameters

– e.g., Arrhenius rate expression
−E
r = k 0 exp( ) nonlinear
RT
linear
(if E is fixed)
30
Nonlinear Regression Models
• sometimes transformably linear

• start with −E
r = k0 exp( )+ε
RT
and take ln of both sides to produce
E
ln(r) = ln( k0 ) − +δ
RT
which is of the form
1
Y = β0 + β1 +δ linear in the
RT parameters
31
Transformations
• note that linearizing the nonlinear equation by

transformation can lead to misleading estimates if the
proper estimation method is not used
• transforming the data can alter the statistical
distribution of the random error term
32
Ordinary LS vs. Multi-Response
• single response (ordinary least squares)

• multi-response (e.g., Partial Least Squares)
T95, LGO = b11 TLGO + b12 Tmid + ε1
T95, kero = b21 Tkero + b22 Tmid + ε 2
– issue - joint behaviour of responses, noise
We will be focussing on single response models.
33
Linear Multiple Regression
Model Equation
Yi = β1 X i1 +K+ β p X ip + εi
i-th observation random noise

of response in i-th observation
(i-th data point) of response
i-th value of i-th value of

explanatory variable X 1 explanatory variable X p
The intercept can be considered as corresponding

to an X which always has the value “1”
34
Assumptions for Least Squares Estimation
Values of explanatory variables are known EXACTLY

» random error is strictly in the response variable
» practically - a random component will almost always be
present in the explanatory variables as well
» we assume that this component has a substantially
smaller effect on the response than the random
component in the response
» if random fluctuations in the explanatory variables are
important, consider alternative method (“Errors in
Variables” approach)
35
The form of the equation provides an adequate

representation for the data
» can test adequacy of model as a diagnostic
Variance of random error is CONSTANT over range of

data collected
» e.g., variance of random fluctuations in thickness
measurements at high temperatures is the same as
variance at low temperatures
» data is “heteroscedastic” if the variance is not constant -
different estimation procedure is required
» thought - percentage error in instruments?
36
The random fluctuations in each measurement are

statistically independent from those of other
measurements
» at same experimental conditions
» at other experimental conditions
» implies that random component has no “memory”
» no correlation between measurements
Random error term is normally distributed
» typical assumption
» not essential for least squares estimation
» important when determining confidence intervals,
conducting hypothesis tests
37
Least Squares Estimation - graphically
least squares - minimize sum of squared prediction errors
response o
o deterministic
(solder thickness)
o “true”
o relationship
o
T
prediction error
“residual”
More Notation and Terminology
Random error is “independent, identically distributed”

(I.I.D) -- can say that it is IID Normal
Capitals - Y - denotes random variable

- except in case of explanatory variable - capital used
to denote formal def’n
Lower case - y, x - denotes measured values of
variables
Model Y = β0 + β1 X + ε
Measurement y = β0 + β1x + ε
39
More Notation and Terminology
Estimate - denoted by “hat”

» examples - estimates of response, parameter
y$, β$0
Residual - difference between measured and predicted

response
e = y − y$
40
Matrix Representation for Multiple Regression
We can arrange the observations in “tabular” form - vector of
observations, and matrix of explanatory values:
 Y1   X11 X12 L X1 p   ε1 
    β
   
1
   
   
 Y2   X 21 X 22 L X2 p 
   ε2 
    β
   
    2  
 
 M = M M M M 
  + M 
    M
   
   
   
YN −1   X N −1,1 X N −1,2 L X N −1, p 
  ε N −1 
   
 β   
     p 
 YN   X N ,1 X N ,2 L X N , p   ε N 
41
Matrix Representation for Multiple Regression
The model is written as:
Y = X β +ε
Nx1
vector Nx1
Nxp px1 vector
matrix vector
N --> number of data observations

p --> number of parameters
42
Least Squares Parameter Estimates
We make the same assumptions as in the straight line

regression case:
» independent random noise components in each
observation
» explanatory variables known exactly - no randomness
» variance constant over experimental region (identically
distributed noise components)
43
Residual Vector
~
Given a set of parameter values β , the residual vector is formed
from the matrix expression:
 e1   Y1   X11 X12 L X1 p 
       β~ 
       1
 e2   Y2   X 21 X 22 L X2 p 
 
     ~ 
       β2 
 M =  M − M M M M  
      
      M 
e N −1  YN −1   X N −1,1 X N −1,2 L X N −1, p   
      ~ 
      β p 
 e N   YN   X N ,1 X N ,2 L X N , p 
44
Sum of Squares of Residuals
… is the same as before, but can be expressed as the squared

length of the residual vector:
N
SSE = ∑ ei2 = eT e
i =1
2
= e
~ T ~
= ( Y − Xβ ) ( Y − Xβ )
45
Find the set of parameter values that minimize the sum

of squares of residuals (SSE)
» apply necessary conditions for an optimum from calculus
(stationary point)
∂
( SSE ) = 0
∂β β$
» system of N equations in p unknowns, with number of
parameters < number of observations : over-determined
system of equations
» solution - set of parameter values that comes “closest to
satisfying all equations” (in a least squares sense)
46
The solution is:

β$ = ( XT X) −1 XT Y
generalized matrix inverse

of X
- generalization of standard
concept of matrix inverse to case of
non-square matrix case
47
Let’s analyze the data considered for the straight line case:
Solder Temperature (C) Solder Thickness (microns)

245 171.6
215 201.1
218 213.2
265 153.3 Model:
251 178.9
213
234
226.6
190.3 Y = β0 + β 1 X + ε
257 171
244 197.5
225 209.8
48
1716.  1 245  ε1 
     
     
In matrix form:  2011
.  1 215  ε2 
     
     
213.2 1 218  ε3 
     
     
153.3 1 265  ε4 
     
     
178.9  1 251 β0   ε5 
Y = Xβ + ε ⇔ 

 
=
   
  + 
226.6 1 213  β1   ε6 
     
     
190.3 1 234   ε7 
     
     
 171  1 257   ε8 
     
     
197.5 1 244   ε9 
     
     
209.8 1 225 ε 
10
49
In order to calculate the Least Squares Estimates:
 10 2367   1910 
( X T X) =   ; XT Y =  
   
2367 563335 449420
50
The least squares parameter estimates are obtained as:
 18.373 − 0.0772   1910  458.10

β$ = ( XT X) −1 XT Y =   = 
    
 − 0.0772 0.0003  449420  − 113
. 
51
Example - Wave Solder Defects
(page 8-31, Course Notes)

W ave Solder Defects Data
Run Conveyor Speed Pot Temp Flux Density No. of Defects
1 -1 -1 -1 100
2 1 -1 -1 119
3 -1 1 -1 118
4 1 1 -1 217
5 -1 -1 1 20
6 1 -1 1 42
7 -1 1 1 41
8 1 1 1 113
9 0 0 0 101
10 0 0 0 96
11 0 0 0 115
52
 100  1 −1 −1 − 1  ε1 
     
In matrix form:      
 119  1 1 −1 − 1  ε2 
     
     
 118  1 −1 1 − 1  ε3 
     
     
 217  1 1 1 − 1  ε4 
     
    β0   
 20  1 −1 −1 1    ε 
   5
Y = Xβ + ε ⇔
   
     β1   
 42  = 1 1 −1 1   + ε 
   6
    
    β2   
 41  1 −1 1 1    ε 
       7 
     β 3   
 113  1 1 1 1   ε8 
     
     
 101  1 0 0 0   ε9 
     
     
 96  1 0 0 0  ε 
     10 
     
 115  1 0 0 0  ε 
    11 
53
To calculate least squares parameter estimates:
11 0 0 0  1082 
   
   
0 8 0 0  212 
( X T X) =  ; XT Y =  
   
0 0 8 0  208 
   
   
 0 0 0 8 − 338
54
Least squares parameter estimates:
1 0 0   1082   93.36 
0
11 
    
 1   212   26.50 
0 0 0    
8
β$ = ( XT X) −1 XT Y =   = 
 1    
0 0 0   208   26.0 
8
    
 1    
 0 0 0  − 338 − 42.25
8
55
Examples - Comments
• if there N runs, and the model has p parameters, XTX is a pxp

matrix (smaller dimension than number of runs)
• elements of XTY are ∑ xij yi for parameters j=1, …, p
i
• in the Wave Solder Defects example, the values of the
explanatory variable for the runs followed very specific patterns
of -1 and +1, and XTX was a diagonal matrix
• in the Solder Thickness example, the values of the explanatory
variable did not follow a specific pattern, and XTX was not
diagonal
56
Graphical Diagnostics
Basic Principle - extract as much trend as possible from

the data
Residuals should have no remaining trend -

» with respect to the explanatory variables
» with respect to the data sequence number
» with respect to other possible explanatory variables
(“secondary variables”)
» with respect to predicted values
57
Residuals vs. Predicted Response Values

- even scatter
residual * * over range of prediction
ei * * *
* * - no discernable pattern
* * * y$i
- roughly half the residuals
* * * * are positive, half negative
DESIRED RESIDUAL PROFILE
58

*
outlier lies outside
main body of residuals
residual
ei * *
* *
* *
* * * y$i
* * * *
RESIDUAL PROFILE WITH OUTLIERS
59

variance of the residuals
residual * * appears to increase
ei ** * with higher predictions
* *
* *
* y$i
* * * *
* * **
NON-CONSTANT VARIANCE
60
Residuals vs. Explanatory Variables

» ideal - no systematic trend present in plot
» inadequate model - evidence of trend present
residual * left over quadratic trend

ei * * * - need quadratic term in model
** * * * * x
*
* * *
61
Residuals vs. Explanatory Variables Not in Model

» ideal - no systematic trend present in plot
» inadequate model - evidence of trend present
residual *
ei *
* * * * * w
* * *
** * systematic trend
not accounted for in model
- include a linear term in “w”
62
Residuals vs. Order of Data Collection
residual * * failure to account for time trend

ei ** * * in data
* t
** *
* *
*
residual successive random noise
ei
*
components are correlated
** ** * - consider more complex model
** * t - time series model for random
* *
* * component?
63
Quantitative Diagnostics - Ratio Tests
Residual Variance Test

» is the variance of the residuals significant to the inherent
noise variance?
» same test as that for the straight line data
» only distinction - number of degrees of freedom for the
Mean Squared Error => N-p , where p is the number of
parameters in the model
» compare ratio to FN-p,M-1,0.05 where M is the number of
data points used to estimate inherent variance
» significant? -> model is INADEQUATE
64
Residual Variance Ratio

2
sresiduals Mean Squared Error of Residuals( MSE )
=
2
sinherent 2
sinherent
Mean Squared Error of Residuals (Var. of Residuals):

N
2
∑ ei
2
sresiduals = MSE = i =1
N−p
65
Mean Square Regression Ratio

» same as in the straight line case except for degrees of
freedom
Variance described by model:
N
2
∑ ( y$i − y )
MSR = i =1
p −1
66
Quantitative Diagnostics - Ratio Test
Test Ratio: MSR

MSE
is compared against Fp-1,N-p,0.95
Conclusions?
– ratio is statistically significant --> significant trend
– NOT statistically significant --> significant trend has NOT
been modeled, and model is inadequate in its present form
For the multiple regression case, this test is a coarse

measure of whether some trend has been modeled -
it provides no indication of which X’s are important
67
Analysis of Variance Tables
The ratio tests involve dissection of the sum of squares:
SSR SSE
N N
= ∑ ( y$i − y ) 2 = ∑ ( yi − y$i ) 2
i =1 i =1
N
TSS = ∑ ( yi − y ) 2
i =1
68
Analysis of Variance (ANOVA) for Regression
Source Degrees Sum of Mean F-Value p-value

of of Squares Square
Variation Freedom
Regression p-1 SSR MSR=SSR/(p-1) F=MSR/MSE p
Residuals N-p SSE MSE=SSE/(N-p)
Total N-1 TSS
69
Quantitative Diagnostics - R2
Coefficient of Determination (“R2 Coefficient”)

» square of correlation between observed and predicted
values:
R 2 = [ corr ( y , y$)]2
» relationship to sums of squares:
2 SSE SSR
R = 1− =
TSS TSS
» values typically reported in “%”, i.e., 100 R2
» ideal - R2 near 100%
70
Issues with R2
• R2 is sensitive to extreme data points, resulting in misleading

indication of quality of fit
• R2 can be made artifically large by adding more parameters to
the model
» put a curve through every point - “connect the dots”
model --> simply modeling noise in the data, rather than
trend
» solution - define the “adjusted R2”, which penalizes the
addition of parameters to the model
71
Adjusted R2
Adjust for number of parameters relative to number of observations

» account for degrees of freedom of the sums of squares
» define in terms of Mean Squared quantities
2 MSE SSE / ( N − p)
Radj = 1− = 1−
TSS / ( N − 1) TSS / ( N − 1)
» want value close to 1 (or 100%), as before
» if N>>p, adjusted R2 is close to R2
» provides measure of agreement, but does not account for
magnitude of residual error
72
Testing the Need for Groups of Terms
In words: “Does a specific group of terms account for significant

trend in the model”?
Test
» compare difference in residual variance between full and
reduced model
» benchmark against an estimate of the inherent variation
» if significant, conclude that the group of terms ARE
required
» if not significant, conclude that the group of terms can be
dropped from the model - not explaining significant trend
» note that remaining parameters should be re-estimated in
this case
73
Test:
A - denotes the full model (with all terms)
B - denotes the reduced model (group of terms deleted)
Form:
SSE model A − SSE model B
2
s ( p A − pB )
pA, pB are the numbers of parameters in models A, B

s2 is an estimate of the inherent noise variance:
» estimate as SSEA/(N-pA)
74
Compare this ratio to
Fp A − pB ,νinherent ,0.95
» if MSEA is used as estimate of inherent variance, then

degrees of freedom of inherent variance estimate is pA
75
Lack of Fit Test
If we have replicate runs in our regression data set, we can break

out the noise variance from the residuals, and assess the
component of the residuals due to unmodelled trend
Replicates -
» repeated runs at the SAME experimental conditions
» note that all explanatory variables must be at fixed
conditions
» indication of inherent variance because no other factors
are changing
» measure of repeatibility of experiments
76
Using Replicates
We can estimate the sample variance for each set of replicates,

and pool the estimate of the variance
» constancy of variance can be checked using Bartlett’s
test
» constant variance is assumed for ordinary least squares
estimation average of
For each replicate set, we have: values in
number of replicate set
values in ni “i”
2
replicate set
“i”
∑ ( yij − yi • )
values in
j =1
si2 = replicate set
ni − 1 “i”
77
Using Replicates
The pooled estimate of variance is:

m 2
∑ (ni − 1) si
i =1
m 
 ∑ ni  − m
 i =1 
i.e., convert back to sums of squares, and divide by the total
number of degrees of freedom (the sum of the degrees of
freedom for each variance estimate)
78
The Lack of Fit Test
Back to the sum of squares “block”: SSE
SSR SSEP SSELOF
“lack of fit”
sum of squares
“pure error” sum

of squares
TSS
79
We partition the SSE into two components:

» component due to inherent noise
» component due to unmodeled trend
Pure error sum of squares (SSEP):
m  ni 2 
SSEP = ∑  ∑ ( yij − yi • ) 
i =1 j =1 
i.e., add together sums of squares associated with each replicate
group (there are “m” replicate groups in total)
80
The “lack of fit sum of squares” (SSELOF) is formed by backing out

SSEP from SSE:
SSELOF = SSE − SSEP
Degrees of Freedom:
- for SSEP:
m 
 ∑ ni  − m
 i =1 
- for SSELOF:
 m  
N − p −   ∑ ni  − m
  i =1  
81
The test ratio:
MSELOF SSELOF / ν LOF

=
MSEP SSEP / ν Pure
Compare to
Fν LOF ,ν Pure ,0.95
» significant? - there is significant unmodeled trend, and
model should be modified
» not significant? - there is nosignificant unmodeled trend,
and supports model adequacy
82
From earlier regression, SSE = 2694.0 and SSR = 25306.5

Replicate Set
9 0 0 0 101
10 0 0 0 96
11 0 0 0 115
std. devn 9.848858
sample var 97
sum of sq 194
(as (n_i-1)s^2)
LACK OF FIT TEST
ANOVA
df SS MS F value from F-table (95% pt)
Residual 7 2694.045
LOF 5 2500.045 500.0091 5.154733 19.3 (this is F5,2,0.95)
Pure Error 2 194 97
This was done by hand - Excel has no Lack of Fit test

83
A Comment on the Ratio Tests
Order of Preference (or “value”) - from most definitive to

least definitive:
• Lack of Fit Test -- MSELOF/MSEP

• MSE/s2inherent
• MSR/MSE
If at all possible, try to include replicate runs in your experimental

program so that the Lack of Fit test can be conducted
Many statistical software packages will perform the Lack of Fit test
in their Regression modules - Excel does NOT
84
The Parameter Estimate Covariance Matrix
… summarizes the variance-covariance structure of the parameter

estimates
 Var ( β$1) Cov ( β$1, β$2 ) L Cov ( β$1, β$ p ) 
 
 
 Cov ( β$1, β$2 ) Var ( β$2 ) L Cov ( β$2 , β$ p ) 
 
Σ= 
 M M O M 
 
 
Cov ( β$1, β$ p ) Cov ( β$2 , β$ p ) L Var ( β$ p ) 

85
Properties of the Covariance Matrix
• symmetric -- Cov(b1,b2) = Cov(b2,b1)

• diagonal entries are always non-negative
• off-diagonal entries can be +ve or -ve
• matrix is positive definite
v T Σv > 0
for any vector v
86
Parameter Estimate Covariance Matrix
The covariance matrix of the parameter estimates is defined as:
{
Σ = E ( β$ − β )( β$ − β ) T }
Compare expression with variance for single parameter:
$ $ 2
Var ( β ) = E {( β − β ) }
For linear regression, the covariance matrix is obtained as:
Σ = ( XT X) −1σ ε2
87
Parameter Estimate Covariance Matrix
Key point - the covariance structure of the parameter estimates is

governed by the experimental run conditions used for the
explanatory variables -
the Experimental Design
Example - the Wave Solder Defects data

Parameter estimates
11 0 0 0 1 0
11
0 0
 are uncorrelated, and
 
    variances of the
0 8 0 0  1 
0 0 0 non-intercept
( XT X) =  ;
( XT X) −1 = 
8
 parameteres are the
 
0 0 8 0  1 
  0 0 0 same
8
    - towards “uniform
 0 0 0 8  1
 0 0 0 precision” of
8 
parameter estimates
88
Estimating the Parameter Covariance Matrix
The X matrix is known - set of run conditions - so the only
estimated quantity is the inherent noise variance
» from replicates, external estimate, or MSE
For wave solder defect data, the sample variance of the replicates
is 384.86 with 7 degrees of freedom, and the parameter
covariances are:
1 0
11
0 0
 34.99 0 0 0 
   
 1   
0 0 0  0 48.11 0 0 
8
Σ$ = ( XT X) −1 se2 =   ( 384.86) =  
 1   
residual  0 0 48.11 0 
0 0
8
0
variance from    
MSE  1  
 0 0 0  0 0 0 48.11
8 
89
Using the Covariance Matrix
Variances of parameter estimates

» are obtained from the diagonal of the matrix
» square root is the standard dev’n, or “standard error”, of
the parameter estimates
• use to formulate confidence intervals for the paramters
• use in hypothesis tests for the parameters
Correlations between the parameter estimates
» can be obtained by taking covariance from appropriate
off-diagonal element, and dividing by the standard errors
of the individual parameter estimates
90
Correlation of the Parameter Estimates
Note that
β$0 = Y − β$1x
I.e., the parameter estimate for the intercept depends

linearly on the slope!
» the slope and intercept estimates are correlated
changing slope changes

point of intersection with
axis because the line must
go through the centroid of the
data
91
Getting Rid of the Covariance
Let’s define the explanatory variable as the deviation

from its average:
Z= X−X
- average of z is zero
β$0 = Y
Least Squares parameter N
estimates: ∑ ziYi
β$1 = i =1
- note that now there is no explicit N 2
dependence on the slope value ∑ zi
in the intercept expression i =1
92
Getting Rid of the Covariance
In this form of the model, the slope and intercept

parameter estimates are uncorrelated
Why is lack of correlation useful?

» allows indepedent decisions about parameter estimates
» decide whether slope is significant, intercept is significant
individually
» “unique” assignment of trend
• intercept clearly associated with mean of y’s
• slope clearly associated with steepness of trend
» correlation can be eliminated by altering form of model,
and choice of experimental points
93
Confidence Intervals for Parameters
… similar procedure to straight line case:

» given standard error for parameter estimate, use
appropriate t-value, and form interval as:
β$i ± tν ,α / 2 sβ$
i
The degrees of freedom for the t-statistic come from the
estimate of the inherent noise variance
» the degrees of freedom will be the same for all of the
parameter estimates
If the confidence interval contains zero, the parameter is plausibly

zero and consideration should be given to deleting the term.
94
Hypothesis Tests for Parameters
… represent an alternative approach to testing whether the term
should be retained in the model
Null hypothesis - parameter = 0

Alternate hypothesis - parameter is not equal to 0
β$i
Test statistic:
sβ$
i
» compare absolute value to tν ,α /2
» if test statistic is greater (“outside the fence”), parameter
is significant -- retain
» inside the fence? - consider deleting the term
95
Example - Wave Solder Defects Data
Test statistic will be compared to t7 ,0.025 = 2.365

because MSE is used to calculate standard errors of parameters,
and has 7 degrees of freedom.
Test statistic for intercept:

β$0 98.36
= = 16.63
sβ$ 34.99
0
Since 16.63 > 2.365, conclude that intercept parameter IS
significant and should be retained.
96
For the next term in the model:

β$1 26.5
= = 3.82 > 2.365
sβ$ 48.11
1
Therefore this term should be retained in the model.
Because the parameter estimates are uncorrelated in this model,

terms can be dropped without the need to re-estimate the other
parameters in the model -- in general, you will have to re-
estimate the final model once more to obtain the parameter
estimates corresponding to the final model form.
97
From Excel:
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 98.36363636 5.915031978 16.62943 6.948E-07 84.376818 112.3505
Conveyor Speed 26.5 6.935989803 3.820652 0.0065367 10.099002 42.901
Pot Temp 26 6.935989803 3.748564 0.0071817 9.599002 42.401
Flux Density -42.25 6.935989803 -6.09142 0.0004953 -58.651 -25.849
prob. that
standard dev’ns. test statistic a value is confidence
of each parameter for each greater than limits
estimate parameter computed test
ratio - 2-tailed
test! 98
Precision of the Predicted Responses
The predicted response from an estimated model has uncertainty,

because it is a function of the parameter estimates which have
uncertainty:
e.g., Solder Wave Defect Model - first responseat the point -1,-1,-1
y$1 = β$0 + β$1 ( −1) + β$2 ( −1) + β$3 ( −1)

If the parameter estimates were uncorrelated, the variance of the
predicted response would be:
Var ( y$1 ) = Var ( β$0 ) + Var ( β$1 ) + Var ( β$2 ) + Var ( β$3 )
(recall results for variance of sum of random variables)
99
Precision of the Predicted Responses
In general, both the variances and covariances of the parameter
estimates must be taken into account.
For prediction at the k-th data point:
Var ( y$ k ) = xTk ( XT X) −1x k σ ε2
 xk 1 
 
 
 xk 2 
[
= xk 1 ] 
xk 2 L xkp ( XT X) −1 
 2
σ ε
 M 
 
 
 xkp 
100
Example - Wave Solder Defects Model
In this example, the parameter estimates are uncorrelated

» XTX is diagonal
» variance of the predicted reponse is in fact the sum of the
variances of the parameter estimates
Variance of prediction at run #11 (0,0,0):
Var ( y$11 ) = Var ( β$0 ) + Var ( β$1 )( 0) + Var ( β$2 )( 0) + Var ( β$3 )( 0)
= Var ( β$ )
0
101
Precision of “Future” Predictions
Suppose we want to predict the response at conditions other than

those of the experimental runs --> future run.
The value we observe will consist of the component from the
deterministic component, plus the noise component.
In predicting this value, we must consider:
» uncertainty from our prediction of the deterministic
component
» noise component
Var ( y
$ ) + σ 2
The variance of this future prediction is future ε
where Var ( y$ future ) is computed using the same expression
for variance of predicted responses at experimental run conditions
102
Estimating Precision of Predicted Responses
Use an estimate of the inherent noise variance
s2y = x Tk ( X T X ) −1x k se2

$k
The degrees of freedom for the estimated variance of the predicted

response are those of the estimate of the noise variance
» replicates, external estimate, MSE
103
Confidence Limits for Predicted Responses
Follow an approach similar to that for parameters - 100(1-alpha)%

confidence limits for predicted response at the k-th run are:
y$ k ± tν ,α / 2 s y$ k
» degrees of freedom are those of the inherent noise
variance estimate
If the prediction is for a response at conditions OTHER than one of
the experimental runs, the limits are:
y$ k ± tν ,α / 2 s2y + se2
$ future
104
Practical Guidelines for Model Development
1) Consider CODING your explanatory variables
~ xi − xi
Coding - one standard form: xi = 1
range( xi )
2
» places designed experiment into +1,-1 form
» if run conditions are from an experimental design, this
coding must be used in order to obtain all of the benefits
from the design - uncorrelated parameter estimates
» if conditions are not from an experimental design, such a
coding improves numerical conditioning of the problem --
similar numerical scales for all variables
105
2) Types of models -
» linear in the explanatory variables
» linear with two-factor interactions (xi xj)
» general polynomials
3) Watch for collinearity in the X matrix - run condition patterns for

two or more explanatory variables are almost the same
» prevents clear assignment of trend to each factor
» shows up as singularity in XTX matrix
» associated with very strong correlation between
parameter estimates
106
4) Be careful not to extrapolate excessively beyond the range of

the data
5) Maximum number of parameters that can be fit to a data set =

number of unique run conditions
 m  
N −   ∑ ni  − m
  i =1  
» N - number of data points

» m - number of replicate sets
» ni - number of points in replicate set “i”
» as number of parameters increases, precision of
predictions decreases - start modeling noise
107
6) Model building sequence

» “building” approach - start with few terms and add as
necessary
» “pruning” approach - start with more terms and remove
those which aren’t statistically significant
» stepwise regression - terms are added, and retained
according to some criterion - frequently R2
• uncorrelated? criterion?
» “all subsets” regression - consider all subsets of model
terms of certain type, and select model with best criterion
• significant computational load
108
Polynomial Models
Order - maximum over the p terms in the model of the sum of the
exponents in a given term
e.g.,
2 2 3
Y = β0 + β1x1 + β2 x2 + β3 x1 x2 + ε
is a fifth-order model
Two factor interaction -

» product term - x1x2
» implies that impact of x1 on response depends on value
of x2
109
Polynomial Models
Comments -
» polynomial models can sometimes suffer from collinearity
problems - coding helps this
» polynomials can provide approximations to nonlinear
functions - think of Taylor series approximations
» high-order polynomial models can sometimes be
replaced by fewer nonlinear function terms
• e.g., ln(x) vs. 3rd order polynomial
110
Joint Confidence Region (JCR)
… answers the question

Where do the true values of the parameters lie?
Recall that for individual parameters, we gain an understanding of

where the true value lies by:
» examining the variability pattern (distribution) for the
parameter estimate
» identify a range in which most of the values of the
parameter estimate are likely to lie
» manipulate this range to determine an interval which is
likely to contain the true value of the parameter
111
Joint Confidence Region
Confidence interval for individual parameter:

Step 1) The ratio of the estimate to its standard deviation is
distributed as a Student’s t-distribution with degrees of freedom
equal to that of the standard devn of the variance estimate
β$i − βi
~ tν
sβ$
i
Step 2) Find interval [ −tν ,α / 2 , tν ,α / 2 ] which contains 100(1 − α )%
of values -i.e., probability of a t-value falling in this interval is (1 − α )
β$ ±t
Step 3) Rearrange this interval to obtain interval i ν ,α / 2 sβ$i
which contains true value of parameter 100(1 − α )% of the time
112
Comments on Individual Confidence Intervals:

» sometimes referred to as marginal confidence intervals -
cf. marginal distributions vs. joint distributions from earlier
» marginal confidence intervals do NOT account for
correlations between the parameter estimates
» examining only marginal confidence intervals can
sometimes be misleading if there is strong correlation
between several parameter estimates
• value of one parameter estimate depends in part on anther
• deletion of the other changes the value of the parameter
estimate
• decision to retain might be altered
113
Sequence:
Step 1) Identify a statistic which is a function of the parameter
estimate statistics
Step 2) Identify a region in which values of this statistic lie a certain

fraction of the time (a 100(1 − α )% region)
Step 3) Use this information to determine a region which contains

the true value of the parameters 100(1 − α )% of the time
114
The quantity
( β$ − β ) T XT X( β$ − β )
p
~ Fp,n − p
estimate of s2 ε
inherent
noise variance
(if MSE is used, degrees of freedom is n-p)
is the ratio of two sums of squares, and is distributed as an F-
distribution with p degrees of freedom in the numerator, and n-p
degrees of freedom in the denominator
115
We can define a region by thinking of those values of the ratio

which have a value less than
Fp ,n − p ,1−α
( $ − β ) T X T X( β$ − β )
β
i.e.,
p
≤ Fp,n − p,1−α
ε s2
Rearranging yields:
( β$ − β ) T X T X ( β$ − β ) ≤ psε2 Fp ,n − p ,α
116
Joint Confidence Region - Definition
The 100(1 − α )% joint confidence region for the parameters is

defined as those parameter values β satisfying:
( β$ − β ) T X T X ( β$ − β ) ≤ psε2 Fp ,n − p ,1−α
Interpretation:
» the region defined by this inequality contains the true
values of the parameters 100(1 − α )% of the time
» if values of zero for one or more parameters lie in this
region, those parameters are plausibly zero, and
consideration should be given to dropping the
corresponding terms from the model
117
Joint Confidence Region - Example with 2 Parameters
Let’s reconsider the solder thickness example:

 10 2367  458.10
( X T X) = 

;

β$ = 

;

sε2 = 135.38
2367 563335  − 113
. 
95% Joint Confidence Region (JCR) for slope&intercept:

( β$ − β ) T X T X ( β$ − β )
 β$0 − β0 
[
= β$0 − β0 ]
β$1 − β1 X T X   ≤ ps2 F
 ε p , n − p = 2 sε2 F2 ,10− 2 ,0.95
 β$1 − β1 
118
95% Joint Confidence Region (JCR) for slope&intercept:

458.10 − β0 
[458.10 − β0 − 113 . − β1]X T X   ≤ 2 (135.38) F
2 ,8,0.95
 
. − β1 
 − 113
= 2 (135.38)( 4.46) = 1207.59
The boundary is an ellipse...
119
rotated - implies correlation

Region
between estimates of slope
and intercept
-0.6
centred at least squares

Slope parameter estimates
-1.6
320 Intercept 600
greater “shadow” along horizontal axis --> variance of

intercept estimate is greater than that of slope
120
Interpreting Joint Confidence Regions
1) Are axes aligned with coordinate axes?
» is ellipse horizontal or vertical?
» indicates no correlation between parameter estimates
2) Which axis has the greatest shadow?
» projection of ellipse along axis
» indicates which parameter estimate has the greatest
variance
3) The elliptical region is, by definition, centred at the least squares
parameter estimates
4) Long, narrow, rotated ellipses indicate significant correlation
between parameter estimates
5) If a value of zero for one or more parameters lies in the region,
these parameters are plausibly zero - consider deleting from
model
121
Joint Confidence Regions
( β$ − β ) T XT X( β$ − β )
What is the motivation for the ratio
p
sε2
used to define the joint confidence region?
Consider the joint distribution for the parameter estimates:
1 1 $
exp{− ( β − β ) T Σ −$1 ( β$ − β )}
( 2π ) p / 2 det( Σ β$ ) 2 β
Substitute in estimate for ( β$ − β ) T (( XT X) −1 sε2 ) −1( β$ − β )

parameter covariance matrix:
( β$ − β ) T XT X( β$ − β )
=
sε2
122
Confidence Intervals from Densities
Individual Interval Joint Region

f β$ (b) f β$ β$ (b0 , b1 )
0 1
b1
lower upper b
b0
volume = 1-alpha
area = 1-alpha Joint Confidence
Region
123
Relationship to Marginal Confidence Limits
Region
marginal confidence interval
-0.6
Slope
for slope
centred at least squares

-1.6 parameter estimates
320 Intercept 600
marginal confidence interval for intercept
124
Relationship to Marginal Confidence Limits
95% confidence
Region
region implied by
considering parameters
marginal confidence interval
-0.6
individually
Slope 95% confidence

for slope
region for parameters

considered jointly
-1.6
320 Intercept 600
marginal confidence interval for intercept
125
Relationship to Marginal Confidence Intervals
Marginal confidence intervals are contained in joint confidence

region
» potential to miss portions of plausible parameter values
at tails of ellipsoid
» using individual confidence intervals implies a
rectangular region, which includes sets of parameter
values that lie outside the joint confidence region
» both situations can lead to
• erroneous acceptance of terms in model
• erroneous rejection of terms in model
126
Going Further - Nonlinear Regression Models
random noise
Model: component
Yi = η ( xi ,θ ) + εi
explanatory parameters
variables
Estimation Approach:
» linearize model with respect to parameters
» treat linearization as a linear regression problem
» iterate by repeating linearization/estimation/linearization
about new estimates,… until convergence to parameter
values - Gauss-Newton iteration - or solve numerical
optimization problem
127
Interpretation - Columns of X
– values of a given variable at different operating points -

– entries in XTX
» dot products of vectors of regressor variable values
» related to correlation between regressor variables
– form of XTX is dictated by experimental design
• e.g., 2k design - diagonal form

Parameter Estimation - Graphical View
approximating observation vector residual

vector
observations y
y$

Parameter Estimation - Nonlinear Regression Case
approximating observation vector residual

vector
observations y
y$
model surface
Properties of LS Parameter Estimates
Key Point - parameter estimates are random variables

» because of how stochastic variation in data propagates
through estimation calculations
» parameter estimates have a variability pattern -
probability distribution and density functions
Unbiased
E{ofβ$repeated
» “average” } = β data collection / estimation
sequences will be true value of parameter vector

Properties of Parameter Estimates
Consistent
» behaviour as number of data points tends to infinity
» with probability 1,
lim β$ = β
N →∞
» distribution narrows as N becomes large
Efficient
» variance of least squares estimates is less than that of
other types of parameter estimates

Properties of Parameter Estimates
Covariance Structure
» summarized by variance-covariance matrix
$ T −1 2
Cov( β ) = ( X X) σ
structure dictated by variance of

experimental design noise
$  Var ( β$0 ) Cov ( β$0 , β$1 ) 

Cov ( β ) =  
 Cov ( $
β , $
β ) Var ( $
β )
0 1 1 

Prediction Variance
… in matrix form -
var( y$ k ) = x Tk ( X T X) −1 x k σ 2
where is vector of conditions at k-th data point
xk

Joint Confidence Regions
Variability in data can affect parameter estimates jointly

depending on structure of data and model
section of sum of
squares
(or likelihood)
β 2 function
β1
marginal confidence limits

Nonlinear Regression Analysis

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Nonlinear Regression Analysis

Загружено:

Авторское право:

Доступные форматы

CHEE824

Nonlinear Regression Analysis

• assessing systematic relationships

We want to describe the systematic relationship

we will consider the case which

Is there a systematic relationship?

Visualizing relationships between variables

chee824 - Winter 2004 J. McLellan 6

,,, are also referred to as “x-y diagrams”

chee824 - Winter 2004 J. McLellan 7

• tooth discoloration data - discoloration vs. fluoride

Scatterplot (teeth 4v*20c)

chee824 - Winter 2004 J. McLellan 8

• tooth discoloration data -discoloration vs. brushing

Scatterplot (teeth 4v*20c)

chee824 - Winter 2004 J. McLellan 9

• tooth discoloration data -discoloration vs. brushing

Scatterplot (teeth 4v*20c)

chee824 - Winter 2004 J. McLellan 10

… are a table of scatterplots for a set of variables

chee824 - Winter 2004 J. McLellan 11

Matrix Plot (teeth 4v*20c)

chee824 - Winter 2004 J. McLellan 12

- for naphtha 90% point - indicates amount of heavy

Time Series Plot of Industrial Data

chee824 - Winter 2004 J. McLellan 14

provide a quantiative measure of systematic LINEAR

• given two random variables X and Y, the covariance

• motivation for covariance as a measure of systematic

• is the “dimensionless” covariance

… from process data (with N pairs of observations)

The sample covariance and corrleration are

Confidence interval for sample correlation -

A hypothesis test can also be performed using this function of the

Objective - study the effect of temperature on solder

Solder Thickness (microns)

S older T em perature (C ) r T hickness (m ic

zalpha/2 of 1.96 (95% confidence level)

limits in tanh^-1(rho) -2.329837282 -0.848216548

When we are estimating a model from data, we

response “random error”

• is included to reflect fact that measured data contain

• linear/nonlinear in the parameters

From the perspective of statistical model-building,

• linear in the parameters

T95 = b1 TLGO + b2 Tmid + ε

• can be nonlinear in the regressors

T95 = b1 TLGO + b2 Tmid + ε

• nonlinear in the parameters

• sometimes transformably linear

• note that linearizing the nonlinear equation by

• single response (ordinary least squares)

We will be focussing on single response models.

i-th observation random noise

i-th value of i-th value of

The intercept can be considered as corresponding

Values of explanatory variables are known EXACTLY

The form of the equation provides an adequate

Variance of random error is CONSTANT over range of

The random fluctuations in each measurement are

least squares - minimize sum of squared prediction errors

Random error is “independent, identically distributed”

Capitals - Y - denotes random variable

Estimate - denoted by “hat”