Вы находитесь на странице: 1из 135

CHEE824

Nonlinear Regression Analysis


J. McLellan
Winter 2004
Module 1:
Linear Regression
Outline -

• assessing systematic relationships


• matrix representation for multiple regression
• least squares parameter estimates
• diagnostics
» graphical
» quantitative
• further diagnostics
» testing the need for terms
» lack of fit test
• precision of parameter estimates, predicted responses
• correlation between parameter estimates

3
The Scenario

We want to describe the systematic relationship


between a response variable and a number of
explanatory variables

multiple regression

we will consider the case which


is linear in the parameters

4
Assessing Systematic Relationships

Is there a systematic relationship?


Two approaches:
• graphical
» scatterplots, casement plots
• quantitative
» form correlations between response, explanatory
variables
» consider forming correlation matrix - table of pairwise
correlations between regressor and explanatories, and
pairs of explanatory variables
• correlation between explanatory variables leads to
correlated parameter estimates

5
Graphical Methods for Analyzing Data

Visualizing relationships between variables

Techniques

• scatterplots
• scatterplot matrices
» also referred to as “casement plots”
• Time sequence plots

chee824 - Winter 2004 J. McLellan 6


Scatterplots

,,, are also referred to as “x-y diagrams”


• plot values of one variable against another
• look for systematic trend in data
» nature of trend
• linear?
• exponential?
• quadratic?
» degree of scatter - does spread increase/decrease over
range?
• indication that variance isn’t constant over range of data

chee824 - Winter 2004 J. McLellan 7


Scatterplots - Example

• tooth discoloration data - discoloration vs. fluoride

Scatterplot (teeth 4v*20c)


50

45

40

35

30
DISCOLOR

25

20

15
trend - possibly
10 nonlinear?
5
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
FLUORIDE

chee824 - Winter 2004 J. McLellan 8


Scatterplot - Example

• tooth discoloration data -discoloration vs. brushing

Scatterplot (teeth 4v*20c)


50

45

40

35 signficant trend?
30 - doesn’t appear to
DISCOLOR

25 be present
20

15

10

5
4 5 6 7 8 9 10 11 12 13
BRUSHING

chee824 - Winter 2004 J. McLellan 9


Scatterplot - Example

• tooth discoloration data -discoloration vs. brushing

Scatterplot (teeth 4v*20c)


50 Variance appears
45 to decrease as
40 # of brushings increases
35

30
DISCOLOR

25

20

15

10

5
4 5 6 7 8 9 10 11 12 13
BRUSHING

chee824 - Winter 2004 J. McLellan 10


Scatterplot matrices

… are a table of scatterplots for a set of variables


Look for -
» systematic trend between “independent” variable and
dependent variables - to be described by estimated
model
» systematic trend between supposedly independent
variables - indicates that these quantities are correlated
• correlation can negatively ifluence model estimation results
• not independent information
• scatterplot matrices can be generated automatically
with statistical software, manually using Excel

chee824 - Winter 2004 J. McLellan 11


Scatterplot Matrices - tooth data

Matrix Plot (teeth 4v*20c)

FLUORIDE

AGE

BRUSHING

DISCOLOR

chee824 - Winter 2004 J. McLellan 12


Time Sequence Plot

- for naphtha 90% point - indicates amount of heavy


hydrocarbons present in gasoline range material
excursion - sudden
Time Sequence Plot - Naphtha 90% Point shift in operation
480

470

460

450
90% point (degrees F)

440

430

420

410
meandering about
400
average operating point
390
0 30 60 90 120 150
- time
180
correlation
210 240
in data
270
chee824 - Winter 2004 J. McLellan 13
What do dynamic data look like?

Time Series Plot of Industrial Data


7

0 var1
#1 # 301 # 601 # 901 # 1201 # 1501 # 1801 # 2101 var2
# 151 # 451 # 751 # 1051 # 1351 # 1651 # 1951

chee824 - Winter 2004 J. McLellan 14


Assessing Systematic Relationships

Quantitative Methods

• correlation
» formal def’n plus sample statistic (“Pearson’s r”)
• covariance
» formal def’n plus sample statistic

provide a quantiative measure of systematic LINEAR


relationships

15
Covariance

Formal Definition

• given two random variables X and Y, the covariance


is
Cov ( X , Y ) = E {( X − µ X )(Y − µY )}
• E{ } - expected value
• sign of the covariance indicates the sign of the slope
of the systematic linear relationship
» positive value --> positive slope
» negative value --> negative slope
• issue - covariance is SCALE DEPENDENT
16
Covariance

• motivation for covariance as a measure of systematic


linear relationship
» look at pairs of departures about the mean of X, Y

Y Y

X X

mean of X, Y mean of X, Y

17
Correlation

• is the “dimensionless” covariance


» divide covariance by standard dev’ns of X, Y
• formal definition
Cov ( X , Y )
Corr ( X , Y ) = ρ ( X , Y ) =
σ X σY
• properties
» dimensionless
» range
− 1 ≤ ρ ( X ,Y ) ≤ 1
strong linear relationship strong linear relationship
with negative slope with positive slope
Note - the correlation gives NO information about the
actual numerical value of the slope.
18
Estimating Covariance, Correlation

… from process data (with N pairs of observations)


Sample Covariance
1 N
R= ∑ ( X i − X )(Yi − Y )
N − 1i =1

Sample Correlation
1 N
∑ ( X i − X )(Yi − Y )
N − 1i =1
r=
s X sY

19
Making Inferences

The sample covariance and corrleration are


STATISTICS, and have their own probability
distributions.

Confidence interval for sample correlation -


» the following is approximately distributed as the standard
normal random variable

N − 3(tanh −1 ( r ) − tanh −1 ( ρ ))
−1
» derive confidence limits for tanh ( ρ ) and convert to
confidence limits for the true correlation using tanh
20
Confidence Interval for Correlation

Procedure
1. find zα /2 for desired confidence level
−1
2. confidence interval for tanh ( ρ ) is
−1 1
tanh ( r ) ± zα / 2
N −3
3. convert to limits to confidence limits for correlation by
taking tanh of the limits in step 2

A hypothesis test can also be performed using this function of the


correlation and comparing to the standard normal distribution
21
Example - Solder Thickness

Objective - study the effect of temperature on solder


thickness
Data - in pairs
Solder Temperature (C) Solder Thickness (microns)
245 171.6
215 201.1
218 213.2
265 153.3
251 178.9
213 226.6
234 190.3
257 171
244 197.5
225 209.8
22
Example - Solder Thickness

Solder Thickness (microns)

230
220
210
thickness

200
190
180
170
160
150
140
200 210 220 230 240 250 260 270
tem perature

S older T em perature (C ) r T hickness (m ic


S older T em perature (C ) 1
S older T hick ness (m icro -0.920001236 1

23
Example - Solder Thickness

Confidence Interval

zalpha/2 of 1.96 (95% confidence level)

limits in tanh^-1(rho) -2.329837282 -0.848216548


limits in rho -0.981238575 -0.690136605

24
Empirical Modeling - Terminology

• response
» “dependent” variable - responds to changes in other
variables
» the response is the characteristic of interest which we are
trying to predict
• explanatory variable
» “independent” variable, regressor variable, input, factor
» these are the quantities that we believe have an
influence on the response
• parameter
» coefficients in the model that describe how the
regressors influence the response
25
Models

When we are estimating a model from data, we


consider the following form:

Y = f ( X ,θ ) + ε

response “random error”

explanatory parameters
variables

26
The Random Error Term

• is included to reflect fact that measured data contain


variability
» successive measurements under the same conditions
(values of the explanatory variables) are likely to be
slightly different
» this is the stochastic component
» the functional form describes the deterministic
component
» random error is not necessarily the result of mistakes in
experimental procedures - reflects inherent variability
» “noise”

27
Types of Models

• linear/nonlinear in the parameters


• linear/nonlinear in the explanatory variables
• number of response variables
– single response (standard regression)
– multi-response (or “multivariate” models)

From the perspective of statistical model-building,


the key point is whether the model is linear or
nonlinear in the PARAMETERS.

28
Linear Regression Models

• linear in the parameters

T95 = b1 TLGO + b2 Tmid + ε

• can be nonlinear in the regressors

T95 = b1 TLGO + b2 Tmid + ε

29
Nonlinear Regression Models

• nonlinear in the parameters


– e.g., Arrhenius rate expression

−E
r = k 0 exp( ) nonlinear
RT
linear
(if E is fixed)

30
Nonlinear Regression Models

• sometimes transformably linear


• start with −E
r = k0 exp( )+ε
RT
and take ln of both sides to produce
E
ln(r) = ln( k0 ) − +δ
RT
which is of the form
1
Y = β0 + β1 +δ linear in the
RT parameters
31
Transformations

• note that linearizing the nonlinear equation by


transformation can lead to misleading estimates if the
proper estimation method is not used
• transforming the data can alter the statistical
distribution of the random error term

32
Ordinary LS vs. Multi-Response

• single response (ordinary least squares)


T95 = b1 TLGO + b2 Tmid + ε
• multi-response (e.g., Partial Least Squares)
T95, LGO = b11 TLGO + b12 Tmid + ε1
T95, kero = b21 Tkero + b22 Tmid + ε 2
– issue - joint behaviour of responses, noise

We will be focussing on single response models.

33
Linear Multiple Regression

Model Equation
Yi = β1 X i1 +K+ β p X ip + εi

i-th observation random noise


of response in i-th observation
(i-th data point) of response

i-th value of i-th value of


explanatory variable X 1 explanatory variable X p

The intercept can be considered as corresponding


to an X which always has the value “1”

34
Assumptions for Least Squares Estimation

Values of explanatory variables are known EXACTLY


» random error is strictly in the response variable
» practically - a random component will almost always be
present in the explanatory variables as well
» we assume that this component has a substantially
smaller effect on the response than the random
component in the response
» if random fluctuations in the explanatory variables are
important, consider alternative method (“Errors in
Variables” approach)

35
Assumptions for Least Squares Estimation

The form of the equation provides an adequate


representation for the data
» can test adequacy of model as a diagnostic

Variance of random error is CONSTANT over range of


data collected
» e.g., variance of random fluctuations in thickness
measurements at high temperatures is the same as
variance at low temperatures
» data is “heteroscedastic” if the variance is not constant -
different estimation procedure is required
» thought - percentage error in instruments?
36
Assumptions for Least Squares Estimation

The random fluctuations in each measurement are


statistically independent from those of other
measurements
» at same experimental conditions
» at other experimental conditions
» implies that random component has no “memory”
» no correlation between measurements
Random error term is normally distributed
» typical assumption
» not essential for least squares estimation
» important when determining confidence intervals,
conducting hypothesis tests
37
Least Squares Estimation - graphically

least squares - minimize sum of squared prediction errors

response o
o deterministic
(solder thickness)
o “true”
o relationship
o
T
prediction error
“residual”
More Notation and Terminology

Random error is “independent, identically distributed”


(I.I.D) -- can say that it is IID Normal

Capitals - Y - denotes random variable


- except in case of explanatory variable - capital used
to denote formal def’n
Lower case - y, x - denotes measured values of
variables
Model Y = β0 + β1 X + ε

Measurement y = β0 + β1x + ε
39
More Notation and Terminology

Estimate - denoted by “hat”


» examples - estimates of response, parameter
y$, β$0

Residual - difference between measured and predicted


response
e = y − y$

40
Matrix Representation for Multiple Regression
We can arrange the observations in “tabular” form - vector of
observations, and matrix of explanatory values:
 Y1   X11 X12 L X1 p   ε1 
    β
   
1
   
   
 Y2   X 21 X 22 L X2 p 
   ε2 
    β
   
    2  
 
 M = M M M M 
  + M 
    M
   
   
   
YN −1   X N −1,1 X N −1,2 L X N −1, p 
  ε N −1 
   
 β   
     p 
 YN   X N ,1 X N ,2 L X N , p   ε N 

41
Matrix Representation for Multiple Regression

The model is written as:

Y = X β +ε
Nx1
vector Nx1
Nxp px1 vector
matrix vector

N --> number of data observations


p --> number of parameters

42
Least Squares Parameter Estimates

We make the same assumptions as in the straight line


regression case:
» independent random noise components in each
observation
» explanatory variables known exactly - no randomness
» variance constant over experimental region (identically
distributed noise components)

43
Residual Vector
~
Given a set of parameter values β , the residual vector is formed
from the matrix expression:
 e1   Y1   X11 X12 L X1 p 
       β~ 
       1
 e2   Y2   X 21 X 22 L X2 p 
 
     ~ 
       β2 
 M =  M − M M M M  
      
      M 
e N −1  YN −1   X N −1,1 X N −1,2 L X N −1, p   
      ~ 
      β p 
 e N   YN   X N ,1 X N ,2 L X N , p 
44
Sum of Squares of Residuals

… is the same as before, but can be expressed as the squared


length of the residual vector:
N
SSE = ∑ ei2 = eT e
i =1
2
= e
~ T ~
= ( Y − Xβ ) ( Y − Xβ )

45
Least Squares Parameter Estimates

Find the set of parameter values that minimize the sum


of squares of residuals (SSE)
» apply necessary conditions for an optimum from calculus
(stationary point)

( SSE ) = 0
∂β β$
» system of N equations in p unknowns, with number of
parameters < number of observations : over-determined
system of equations
» solution - set of parameter values that comes “closest to
satisfying all equations” (in a least squares sense)

46
Least Squares Parameter Estimates

The solution is:


β$ = ( XT X) −1 XT Y

generalized matrix inverse


of X
- generalization of standard
concept of matrix inverse to case of
non-square matrix case

47
Example - Solder Thickness

Let’s analyze the data considered for the straight line case:

Solder Temperature (C) Solder Thickness (microns)


245 171.6
215 201.1
218 213.2
265 153.3 Model:
251 178.9
213
234
226.6
190.3 Y = β0 + β 1 X + ε
257 171
244 197.5
225 209.8

48
Example - Solder Thickness
1716.  1 245  ε1 
     
     
In matrix form:  2011
.  1 215  ε2 
     
     
213.2 1 218  ε3 
     
     
153.3 1 265  ε4 
     
     
178.9  1 251 β0   ε5 
Y = Xβ + ε ⇔ 

 
=
   
  + 
226.6 1 213  β1   ε6 
     
     
190.3 1 234   ε7 
     
     
 171  1 257   ε8 
     
     
197.5 1 244   ε9 
     
     
209.8 1 225 ε 
10
49
Example - Solder Thickness

In order to calculate the Least Squares Estimates:

 10 2367   1910 
( X T X) =   ; XT Y =  
   
2367 563335 449420

50
Example - Solder Thickness

The least squares parameter estimates are obtained as:

 18.373 − 0.0772   1910  458.10


β$ = ( XT X) −1 XT Y =   = 
    
 − 0.0772 0.0003  449420  − 113
. 

51
Example - Wave Solder Defects

(page 8-31, Course Notes)


W ave Solder Defects Data
Run Conveyor Speed Pot Temp Flux Density No. of Defects
1 -1 -1 -1 100
2 1 -1 -1 119
3 -1 1 -1 118
4 1 1 -1 217
5 -1 -1 1 20
6 1 -1 1 42
7 -1 1 1 41
8 1 1 1 113
9 0 0 0 101
10 0 0 0 96
11 0 0 0 115

52
Example - Wave Solder Defects

 100  1 −1 −1 − 1  ε1 
     
In matrix form:      
 119  1 1 −1 − 1  ε2 
     
     
 118  1 −1 1 − 1  ε3 
     
     
 217  1 1 1 − 1  ε4 
     
    β0   
 20  1 −1 −1 1    ε 
   5
Y = Xβ + ε ⇔
   
     β1   
 42  = 1 1 −1 1   + ε 
   6
    
    β2   
 41  1 −1 1 1    ε 
       7 
     β 3   
 113  1 1 1 1   ε8 
     
     
 101  1 0 0 0   ε9 
     
     
 96  1 0 0 0  ε 
     10 
     
 115  1 0 0 0  ε 
    11 
53
Example - Wave Solder Defects

To calculate least squares parameter estimates:

11 0 0 0  1082 
   
   
0 8 0 0  212 
( X T X) =  ; XT Y =  
   
0 0 8 0  208 
   
   
 0 0 0 8 − 338
54
Example - Wave Solder Defects

Least squares parameter estimates:

1 0 0   1082   93.36 
0
11 
    
 1   212   26.50 
0 0 0    
8
β$ = ( XT X) −1 XT Y =   = 
 1    
0 0 0   208   26.0 
8
    
 1    
 0 0 0  − 338 − 42.25
8

55
Examples - Comments

• if there N runs, and the model has p parameters, XTX is a pxp


matrix (smaller dimension than number of runs)
• elements of XTY are ∑ xij yi for parameters j=1, …, p
i
• in the Wave Solder Defects example, the values of the
explanatory variable for the runs followed very specific patterns
of -1 and +1, and XTX was a diagonal matrix
• in the Solder Thickness example, the values of the explanatory
variable did not follow a specific pattern, and XTX was not
diagonal

56
Graphical Diagnostics

Basic Principle - extract as much trend as possible from


the data

Residuals should have no remaining trend -


» with respect to the explanatory variables
» with respect to the data sequence number
» with respect to other possible explanatory variables
(“secondary variables”)
» with respect to predicted values

57
Graphical Diagnostics

Residuals vs. Predicted Response Values


- even scatter
residual * * over range of prediction
ei * * *
* * - no discernable pattern

* * * y$i
- roughly half the residuals
* * * * are positive, half negative

DESIRED RESIDUAL PROFILE

58
Graphical Diagnostics

Residuals vs. Predicted Response Values


*
outlier lies outside
main body of residuals
residual
ei * *
* *
* *
* * * y$i
* * * *
RESIDUAL PROFILE WITH OUTLIERS

59
Graphical Diagnostics

Residuals vs. Predicted Response Values


variance of the residuals
residual * * appears to increase
ei ** * with higher predictions
* *
* *
* y$i
* * * *
* * **

NON-CONSTANT VARIANCE

60
Graphical Diagnostics

Residuals vs. Explanatory Variables


» ideal - no systematic trend present in plot
» inadequate model - evidence of trend present

residual * left over quadratic trend


ei * * * - need quadratic term in model
** * * * * x
*
* * *

61
Graphical Diagnostics

Residuals vs. Explanatory Variables Not in Model


» ideal - no systematic trend present in plot
» inadequate model - evidence of trend present

residual *
ei *
* * * * * w
* * *
** * systematic trend
not accounted for in model
- include a linear term in “w”

62
Graphical Diagnostics

Residuals vs. Order of Data Collection

residual * * failure to account for time trend


ei ** * * in data
* t
** *
* *
*
residual successive random noise
ei
*
components are correlated
** ** * - consider more complex model
** * t - time series model for random
* *
* * component?

63
Quantitative Diagnostics - Ratio Tests

Residual Variance Test


» is the variance of the residuals significant to the inherent
noise variance?
» same test as that for the straight line data
» only distinction - number of degrees of freedom for the
Mean Squared Error => N-p , where p is the number of
parameters in the model
» compare ratio to FN-p,M-1,0.05 where M is the number of
data points used to estimate inherent variance
» significant? -> model is INADEQUATE

64
Quantitative Diagnostics - Ratio Tests

Residual Variance Ratio


2
sresiduals Mean Squared Error of Residuals( MSE )
=
2
sinherent 2
sinherent

Mean Squared Error of Residuals (Var. of Residuals):


N
2
∑ ei
2
sresiduals = MSE = i =1
N−p

65
Quantitative Diagnostics - Ratio Tests

Mean Square Regression Ratio


» same as in the straight line case except for degrees of
freedom
Variance described by model:
N
2
∑ ( y$i − y )
MSR = i =1
p −1

66
Quantitative Diagnostics - Ratio Test

Test Ratio: MSR


MSE
is compared against Fp-1,N-p,0.95
Conclusions?
– ratio is statistically significant --> significant trend
– NOT statistically significant --> significant trend has NOT
been modeled, and model is inadequate in its present form

For the multiple regression case, this test is a coarse


measure of whether some trend has been modeled -
it provides no indication of which X’s are important
67
Analysis of Variance Tables

The ratio tests involve dissection of the sum of squares:

SSR SSE
N N
= ∑ ( y$i − y ) 2 = ∑ ( yi − y$i ) 2
i =1 i =1

N
TSS = ∑ ( yi − y ) 2
i =1
68
Analysis of Variance (ANOVA) for Regression

Source Degrees Sum of Mean F-Value p-value


of of Squares Square
Variation Freedom

Regression p-1 SSR MSR=SSR/(p-1) F=MSR/MSE p

Residuals N-p SSE MSE=SSE/(N-p)

Total N-1 TSS

69
Quantitative Diagnostics - R2

Coefficient of Determination (“R2 Coefficient”)


» square of correlation between observed and predicted
values:

R 2 = [ corr ( y , y$)]2
» relationship to sums of squares:

2 SSE SSR
R = 1− =
TSS TSS
» values typically reported in “%”, i.e., 100 R2
» ideal - R2 near 100%
70
Issues with R2

• R2 is sensitive to extreme data points, resulting in misleading


indication of quality of fit
• R2 can be made artifically large by adding more parameters to
the model
» put a curve through every point - “connect the dots”
model --> simply modeling noise in the data, rather than
trend
» solution - define the “adjusted R2”, which penalizes the
addition of parameters to the model

71
Adjusted R2

Adjust for number of parameters relative to number of observations


» account for degrees of freedom of the sums of squares
» define in terms of Mean Squared quantities

2 MSE SSE / ( N − p)
Radj = 1− = 1−
TSS / ( N − 1) TSS / ( N − 1)
» want value close to 1 (or 100%), as before
» if N>>p, adjusted R2 is close to R2
» provides measure of agreement, but does not account for
magnitude of residual error

72
Testing the Need for Groups of Terms

In words: “Does a specific group of terms account for significant


trend in the model”?

Test
» compare difference in residual variance between full and
reduced model
» benchmark against an estimate of the inherent variation
» if significant, conclude that the group of terms ARE
required
» if not significant, conclude that the group of terms can be
dropped from the model - not explaining significant trend
» note that remaining parameters should be re-estimated in
this case
73
Testing the Need for Groups of Terms

Test:
A - denotes the full model (with all terms)
B - denotes the reduced model (group of terms deleted)
Form:
SSE model A − SSE model B
2
s ( p A − pB )

pA, pB are the numbers of parameters in models A, B


s2 is an estimate of the inherent noise variance:
» estimate as SSEA/(N-pA)

74
Testing the Need for Groups of Terms

Compare this ratio to

Fp A − pB ,νinherent ,0.95

» if MSEA is used as estimate of inherent variance, then


degrees of freedom of inherent variance estimate is pA

75
Lack of Fit Test

If we have replicate runs in our regression data set, we can break


out the noise variance from the residuals, and assess the
component of the residuals due to unmodelled trend

Replicates -
» repeated runs at the SAME experimental conditions
» note that all explanatory variables must be at fixed
conditions
» indication of inherent variance because no other factors
are changing
» measure of repeatibility of experiments

76
Using Replicates

We can estimate the sample variance for each set of replicates,


and pool the estimate of the variance
» constancy of variance can be checked using Bartlett’s
test
» constant variance is assumed for ordinary least squares
estimation average of
For each replicate set, we have: values in
number of replicate set
values in ni “i”
2
replicate set
“i”
∑ ( yij − yi • )
values in
j =1
si2 = replicate set
ni − 1 “i”

77
Using Replicates

The pooled estimate of variance is:


m 2
∑ (ni − 1) si
i =1
m 
 ∑ ni  − m
 i =1 
i.e., convert back to sums of squares, and divide by the total
number of degrees of freedom (the sum of the degrees of
freedom for each variance estimate)

78
The Lack of Fit Test

Back to the sum of squares “block”: SSE

SSR SSEP SSELOF

“lack of fit”
sum of squares

“pure error” sum


of squares
TSS
79
The Lack of Fit Test

We partition the SSE into two components:


» component due to inherent noise
» component due to unmodeled trend
Pure error sum of squares (SSEP):

m  ni 2 
SSEP = ∑  ∑ ( yij − yi • ) 
i =1 j =1 
i.e., add together sums of squares associated with each replicate
group (there are “m” replicate groups in total)

80
The Lack of Fit Test

The “lack of fit sum of squares” (SSELOF) is formed by backing out


SSEP from SSE:
SSELOF = SSE − SSEP
Degrees of Freedom:
- for SSEP:
m 
 ∑ ni  − m
 i =1 
- for SSELOF:
 m  
N − p −   ∑ ni  − m
  i =1  
81
The Lack of Fit Test

The test ratio:

MSELOF SSELOF / ν LOF


=
MSEP SSEP / ν Pure
Compare to
Fν LOF ,ν Pure ,0.95
» significant? - there is significant unmodeled trend, and
model should be modified
» not significant? - there is nosignificant unmodeled trend,
and supports model adequacy

82
Example - Wave Solder Defects

From earlier regression, SSE = 2694.0 and SSR = 25306.5


Replicate Set
9 0 0 0 101
10 0 0 0 96
11 0 0 0 115
std. devn 9.848858
sample var 97
sum of sq 194
(as (n_i-1)s^2)
LACK OF FIT TEST
ANOVA
df SS MS F value from F-table (95% pt)
Residual 7 2694.045
LOF 5 2500.045 500.0091 5.154733 19.3 (this is F5,2,0.95)
Pure Error 2 194 97

This was done by hand - Excel has no Lack of Fit test


83
A Comment on the Ratio Tests

Order of Preference (or “value”) - from most definitive to


least definitive:

• Lack of Fit Test -- MSELOF/MSEP


• MSE/s2inherent
• MSR/MSE

If at all possible, try to include replicate runs in your experimental


program so that the Lack of Fit test can be conducted

Many statistical software packages will perform the Lack of Fit test
in their Regression modules - Excel does NOT
84
The Parameter Estimate Covariance Matrix

… summarizes the variance-covariance structure of the parameter


estimates
 Var ( β$1) Cov ( β$1, β$2 ) L Cov ( β$1, β$ p ) 
 
 
 Cov ( β$1, β$2 ) Var ( β$2 ) L Cov ( β$2 , β$ p ) 
 
Σ= 
 M M O M 
 
 
Cov ( β$1, β$ p ) Cov ( β$2 , β$ p ) L Var ( β$ p ) 

85
Properties of the Covariance Matrix

• symmetric -- Cov(b1,b2) = Cov(b2,b1)


• diagonal entries are always non-negative
• off-diagonal entries can be +ve or -ve
• matrix is positive definite
v T Σv > 0
for any vector v

86
Parameter Estimate Covariance Matrix

The covariance matrix of the parameter estimates is defined as:

{
Σ = E ( β$ − β )( β$ − β ) T }
Compare expression with variance for single parameter:
$ $ 2
Var ( β ) = E {( β − β ) }
For linear regression, the covariance matrix is obtained as:

Σ = ( XT X) −1σ ε2

87
Parameter Estimate Covariance Matrix

Key point - the covariance structure of the parameter estimates is


governed by the experimental run conditions used for the
explanatory variables -
the Experimental Design

Example - the Wave Solder Defects data


Parameter estimates
11 0 0 0 1 0
11
0 0
 are uncorrelated, and
 
    variances of the
0 8 0 0  1 
0 0 0 non-intercept
( XT X) =  ;
( XT X) −1 = 
8
 parameteres are the
 
0 0 8 0  1 
  0 0 0 same
8
    - towards “uniform
 0 0 0 8  1
 0 0 0 precision” of
8 
parameter estimates
88
Estimating the Parameter Covariance Matrix
The X matrix is known - set of run conditions - so the only
estimated quantity is the inherent noise variance
» from replicates, external estimate, or MSE

For wave solder defect data, the sample variance of the replicates
is 384.86 with 7 degrees of freedom, and the parameter
covariances are:
1 0
11
0 0
 34.99 0 0 0 
   
 1   
0 0 0  0 48.11 0 0 
8
Σ$ = ( XT X) −1 se2 =   ( 384.86) =  
 1   
residual  0 0 48.11 0 
0 0
8
0
variance from    
MSE  1  
 0 0 0  0 0 0 48.11
8 
89
Using the Covariance Matrix

Variances of parameter estimates


» are obtained from the diagonal of the matrix
» square root is the standard dev’n, or “standard error”, of
the parameter estimates
• use to formulate confidence intervals for the paramters
• use in hypothesis tests for the parameters
Correlations between the parameter estimates
» can be obtained by taking covariance from appropriate
off-diagonal element, and dividing by the standard errors
of the individual parameter estimates

90
Correlation of the Parameter Estimates

Note that
β$0 = Y − β$1x

I.e., the parameter estimate for the intercept depends


linearly on the slope!
» the slope and intercept estimates are correlated

changing slope changes


point of intersection with
axis because the line must
go through the centroid of the
data

91
Getting Rid of the Covariance

Let’s define the explanatory variable as the deviation


from its average:
Z= X−X
- average of z is zero
β$0 = Y
Least Squares parameter N
estimates: ∑ ziYi
β$1 = i =1
- note that now there is no explicit N 2
dependence on the slope value ∑ zi
in the intercept expression i =1
92
Getting Rid of the Covariance

In this form of the model, the slope and intercept


parameter estimates are uncorrelated

Why is lack of correlation useful?


» allows indepedent decisions about parameter estimates
» decide whether slope is significant, intercept is significant
individually
» “unique” assignment of trend
• intercept clearly associated with mean of y’s
• slope clearly associated with steepness of trend
» correlation can be eliminated by altering form of model,
and choice of experimental points
93
Confidence Intervals for Parameters

… similar procedure to straight line case:


» given standard error for parameter estimate, use
appropriate t-value, and form interval as:

β$i ± tν ,α / 2 sβ$
i
The degrees of freedom for the t-statistic come from the
estimate of the inherent noise variance
» the degrees of freedom will be the same for all of the
parameter estimates

If the confidence interval contains zero, the parameter is plausibly


zero and consideration should be given to deleting the term.
94
Hypothesis Tests for Parameters
… represent an alternative approach to testing whether the term
should be retained in the model

Null hypothesis - parameter = 0


Alternate hypothesis - parameter is not equal to 0

β$i
Test statistic:
sβ$
i
» compare absolute value to tν ,α /2
» if test statistic is greater (“outside the fence”), parameter
is significant -- retain
» inside the fence? - consider deleting the term
95
Example - Wave Solder Defects Data

Test statistic will be compared to t7 ,0.025 = 2.365


because MSE is used to calculate standard errors of parameters,
and has 7 degrees of freedom.

Test statistic for intercept:


β$0 98.36
= = 16.63
sβ$ 34.99
0
Since 16.63 > 2.365, conclude that intercept parameter IS
significant and should be retained.

96
Example - Wave Solder Defects Data

For the next term in the model:


β$1 26.5
= = 3.82 > 2.365
sβ$ 48.11
1

Therefore this term should be retained in the model.

Because the parameter estimates are uncorrelated in this model,


terms can be dropped without the need to re-estimate the other
parameters in the model -- in general, you will have to re-
estimate the final model once more to obtain the parameter
estimates corresponding to the final model form.
97
Example - Wave Solder Defects Data

From Excel:

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.36363636 5.915031978 16.62943 6.948E-07 84.376818 112.3505
Conveyor Speed 26.5 6.935989803 3.820652 0.0065367 10.099002 42.901
Pot Temp 26 6.935989803 3.748564 0.0071817 9.599002 42.401
Flux Density -42.25 6.935989803 -6.09142 0.0004953 -58.651 -25.849

prob. that
standard dev’ns. test statistic a value is confidence
of each parameter for each greater than limits
estimate parameter computed test
ratio - 2-tailed
test! 98
Precision of the Predicted Responses

The predicted response from an estimated model has uncertainty,


because it is a function of the parameter estimates which have
uncertainty:
e.g., Solder Wave Defect Model - first responseat the point -1,-1,-1

y$1 = β$0 + β$1 ( −1) + β$2 ( −1) + β$3 ( −1)


If the parameter estimates were uncorrelated, the variance of the
predicted response would be:
Var ( y$1 ) = Var ( β$0 ) + Var ( β$1 ) + Var ( β$2 ) + Var ( β$3 )
(recall results for variance of sum of random variables)

99
Precision of the Predicted Responses
In general, both the variances and covariances of the parameter
estimates must be taken into account.
For prediction at the k-th data point:
Var ( y$ k ) = xTk ( XT X) −1x k σ ε2
 xk 1 
 
 
 xk 2 
[
= xk 1 ] 
xk 2 L xkp ( XT X) −1 
 2
σ ε
 M 
 
 
 xkp 
100
Example - Wave Solder Defects Model

In this example, the parameter estimates are uncorrelated


» XTX is diagonal
» variance of the predicted reponse is in fact the sum of the
variances of the parameter estimates

Variance of prediction at run #11 (0,0,0):

Var ( y$11 ) = Var ( β$0 ) + Var ( β$1 )( 0) + Var ( β$2 )( 0) + Var ( β$3 )( 0)
= Var ( β$ )
0

101
Precision of “Future” Predictions

Suppose we want to predict the response at conditions other than


those of the experimental runs --> future run.
The value we observe will consist of the component from the
deterministic component, plus the noise component.
In predicting this value, we must consider:
» uncertainty from our prediction of the deterministic
component
» noise component
Var ( y
$ ) + σ 2
The variance of this future prediction is future ε
where Var ( y$ future ) is computed using the same expression
for variance of predicted responses at experimental run conditions

102
Estimating Precision of Predicted Responses

Use an estimate of the inherent noise variance

s2y = x Tk ( X T X ) −1x k se2


$k

The degrees of freedom for the estimated variance of the predicted


response are those of the estimate of the noise variance
» replicates, external estimate, MSE

103
Confidence Limits for Predicted Responses

Follow an approach similar to that for parameters - 100(1-alpha)%


confidence limits for predicted response at the k-th run are:
y$ k ± tν ,α / 2 s y$ k
» degrees of freedom are those of the inherent noise
variance estimate
If the prediction is for a response at conditions OTHER than one of
the experimental runs, the limits are:

y$ k ± tν ,α / 2 s2y + se2
$ future

104
Practical Guidelines for Model Development

1) Consider CODING your explanatory variables

~ xi − xi
Coding - one standard form: xi = 1
range( xi )
2
» places designed experiment into +1,-1 form
» if run conditions are from an experimental design, this
coding must be used in order to obtain all of the benefits
from the design - uncorrelated parameter estimates
» if conditions are not from an experimental design, such a
coding improves numerical conditioning of the problem --
similar numerical scales for all variables

105
Practical Guidelines for Model Development

2) Types of models -
» linear in the explanatory variables
» linear with two-factor interactions (xi xj)
» general polynomials

3) Watch for collinearity in the X matrix - run condition patterns for


two or more explanatory variables are almost the same
» prevents clear assignment of trend to each factor
» shows up as singularity in XTX matrix
» associated with very strong correlation between
parameter estimates

106
Practical Guidelines for Model Development

4) Be careful not to extrapolate excessively beyond the range of


the data

5) Maximum number of parameters that can be fit to a data set =


number of unique run conditions
 m  
N −   ∑ ni  − m
  i =1  

» N - number of data points


» m - number of replicate sets
» ni - number of points in replicate set “i”
» as number of parameters increases, precision of
predictions decreases - start modeling noise
107
Practical Guidelines for Model Development

6) Model building sequence


» “building” approach - start with few terms and add as
necessary
» “pruning” approach - start with more terms and remove
those which aren’t statistically significant
» stepwise regression - terms are added, and retained
according to some criterion - frequently R2
• uncorrelated? criterion?
» “all subsets” regression - consider all subsets of model
terms of certain type, and select model with best criterion
• significant computational load

108
Polynomial Models

Order - maximum over the p terms in the model of the sum of the
exponents in a given term
e.g.,
2 2 3
Y = β0 + β1x1 + β2 x2 + β3 x1 x2 + ε
is a fifth-order model

Two factor interaction -


» product term - x1x2
» implies that impact of x1 on response depends on value
of x2

109
Polynomial Models

Comments -
» polynomial models can sometimes suffer from collinearity
problems - coding helps this
» polynomials can provide approximations to nonlinear
functions - think of Taylor series approximations
» high-order polynomial models can sometimes be
replaced by fewer nonlinear function terms
• e.g., ln(x) vs. 3rd order polynomial

110
Joint Confidence Region (JCR)

… answers the question


Where do the true values of the parameters lie?

Recall that for individual parameters, we gain an understanding of


where the true value lies by:
» examining the variability pattern (distribution) for the
parameter estimate
» identify a range in which most of the values of the
parameter estimate are likely to lie
» manipulate this range to determine an interval which is
likely to contain the true value of the parameter

111
Joint Confidence Region

Confidence interval for individual parameter:


Step 1) The ratio of the estimate to its standard deviation is
distributed as a Student’s t-distribution with degrees of freedom
equal to that of the standard devn of the variance estimate
β$i − βi
~ tν
sβ$
i
Step 2) Find interval [ −tν ,α / 2 , tν ,α / 2 ] which contains 100(1 − α )%
of values -i.e., probability of a t-value falling in this interval is (1 − α )

β$ ±t
Step 3) Rearrange this interval to obtain interval i ν ,α / 2 sβ$i
which contains true value of parameter 100(1 − α )% of the time

112
Joint Confidence Region

Comments on Individual Confidence Intervals:


» sometimes referred to as marginal confidence intervals -
cf. marginal distributions vs. joint distributions from earlier
» marginal confidence intervals do NOT account for
correlations between the parameter estimates
» examining only marginal confidence intervals can
sometimes be misleading if there is strong correlation
between several parameter estimates
• value of one parameter estimate depends in part on anther
• deletion of the other changes the value of the parameter
estimate
• decision to retain might be altered

113
Joint Confidence Region

Sequence:
Step 1) Identify a statistic which is a function of the parameter
estimate statistics

Step 2) Identify a region in which values of this statistic lie a certain


fraction of the time (a 100(1 − α )% region)

Step 3) Use this information to determine a region which contains


the true value of the parameters 100(1 − α )% of the time

114
Joint Confidence Region

The quantity

( β$ − β ) T XT X( β$ − β )
p
~ Fp,n − p
estimate of s2 ε
inherent
noise variance
(if MSE is used, degrees of freedom is n-p)
is the ratio of two sums of squares, and is distributed as an F-
distribution with p degrees of freedom in the numerator, and n-p
degrees of freedom in the denominator

115
Joint Confidence Region

We can define a region by thinking of those values of the ratio


which have a value less than
Fp ,n − p ,1−α
( $ − β ) T X T X( β$ − β )
β
i.e.,
p
≤ Fp,n − p,1−α
ε s2
Rearranging yields:

( β$ − β ) T X T X ( β$ − β ) ≤ psε2 Fp ,n − p ,α

116
Joint Confidence Region - Definition

The 100(1 − α )% joint confidence region for the parameters is


defined as those parameter values β satisfying:

( β$ − β ) T X T X ( β$ − β ) ≤ psε2 Fp ,n − p ,1−α

Interpretation:
» the region defined by this inequality contains the true
values of the parameters 100(1 − α )% of the time
» if values of zero for one or more parameters lie in this
region, those parameters are plausibly zero, and
consideration should be given to dropping the
corresponding terms from the model
117
Joint Confidence Region - Example with 2 Parameters

Let’s reconsider the solder thickness example:


 10 2367  458.10
( X T X) = 

;

β$ = 

;

sε2 = 135.38
2367 563335  − 113
. 

95% Joint Confidence Region (JCR) for slope&intercept:


( β$ − β ) T X T X ( β$ − β )
 β$0 − β0 
[
= β$0 − β0 ]
β$1 − β1 X T X   ≤ ps2 F
 ε p , n − p = 2 sε2 F2 ,10− 2 ,0.95
 β$1 − β1 

118
Joint Confidence Region - Example with 2 Parameters

95% Joint Confidence Region (JCR) for slope&intercept:


458.10 − β0 
[458.10 − β0 − 113 . − β1]X T X   ≤ 2 (135.38) F
2 ,8,0.95
 
. − β1 
 − 113
= 2 (135.38)( 4.46) = 1207.59

The boundary is an ellipse...

119
Joint Confidence Region - Example with 2 Parameters

rotated - implies correlation


Region
between estimates of slope
and intercept
-0.6

centred at least squares


Slope parameter estimates

-1.6

320 Intercept 600

greater “shadow” along horizontal axis --> variance of


intercept estimate is greater than that of slope
120
Interpreting Joint Confidence Regions
1) Are axes aligned with coordinate axes?
» is ellipse horizontal or vertical?
» indicates no correlation between parameter estimates
2) Which axis has the greatest shadow?
» projection of ellipse along axis
» indicates which parameter estimate has the greatest
variance
3) The elliptical region is, by definition, centred at the least squares
parameter estimates
4) Long, narrow, rotated ellipses indicate significant correlation
between parameter estimates
5) If a value of zero for one or more parameters lies in the region,
these parameters are plausibly zero - consider deleting from
model
121
Joint Confidence Regions

( β$ − β ) T XT X( β$ − β )
What is the motivation for the ratio
p
sε2
used to define the joint confidence region?
Consider the joint distribution for the parameter estimates:
1 1 $
exp{− ( β − β ) T Σ −$1 ( β$ − β )}
( 2π ) p / 2 det( Σ β$ ) 2 β

Substitute in estimate for ( β$ − β ) T (( XT X) −1 sε2 ) −1( β$ − β )


parameter covariance matrix:
( β$ − β ) T XT X( β$ − β )
=
sε2
122
Confidence Intervals from Densities

Individual Interval Joint Region


f β$ (b) f β$ β$ (b0 , b1 )
0 1

b1

lower upper b
b0
volume = 1-alpha
area = 1-alpha Joint Confidence
Region

123
Relationship to Marginal Confidence Limits

Region
marginal confidence interval

-0.6

Slope
for slope

centred at least squares


-1.6 parameter estimates

320 Intercept 600

marginal confidence interval for intercept

124
Relationship to Marginal Confidence Limits

95% confidence
Region
region implied by
considering parameters
marginal confidence interval

-0.6
individually

Slope 95% confidence


for slope

region for parameters


considered jointly
-1.6

320 Intercept 600

marginal confidence interval for intercept

125
Relationship to Marginal Confidence Intervals

Marginal confidence intervals are contained in joint confidence


region
» potential to miss portions of plausible parameter values
at tails of ellipsoid
» using individual confidence intervals implies a
rectangular region, which includes sets of parameter
values that lie outside the joint confidence region
» both situations can lead to
• erroneous acceptance of terms in model
• erroneous rejection of terms in model

126
Going Further - Nonlinear Regression Models
random noise
Model: component
Yi = η ( xi ,θ ) + εi

explanatory parameters
variables
Estimation Approach:
» linearize model with respect to parameters
» treat linearization as a linear regression problem
» iterate by repeating linearization/estimation/linearization
about new estimates,… until convergence to parameter
values - Gauss-Newton iteration - or solve numerical
optimization problem

127
Interpretation - Columns of X

– values of a given variable at different operating points -


– entries in XTX
» dot products of vectors of regressor variable values
» related to correlation between regressor variables
– form of XTX is dictated by experimental design
• e.g., 2k design - diagonal form

chee824 - Winter 2004 J. McLellan 128


Parameter Estimation - Graphical View

approximating observation vector residual


vector
observations y

y$

chee824 - Winter 2004 J. McLellan 129


Parameter Estimation - Nonlinear Regression Case

approximating observation vector residual


vector
observations y

y$

model surface
chee824 - Winter 2004 J. McLellan 130
Properties of LS Parameter Estimates

Key Point - parameter estimates are random variables


» because of how stochastic variation in data propagates
through estimation calculations
» parameter estimates have a variability pattern -
probability distribution and density functions
Unbiased

E{ofβ$repeated
» “average” } = β data collection / estimation
sequences will be true value of parameter vector

chee824 - Winter 2004 J. McLellan 131


Properties of Parameter Estimates

Consistent
» behaviour as number of data points tends to infinity
» with probability 1,
lim β$ = β
N →∞
» distribution narrows as N becomes large
Efficient
» variance of least squares estimates is less than that of
other types of parameter estimates

chee824 - Winter 2004 J. McLellan 132


Properties of Parameter Estimates

Covariance Structure
» summarized by variance-covariance matrix

$ T −1 2
Cov( β ) = ( X X) σ

structure dictated by variance of


experimental design noise

$  Var ( β$0 ) Cov ( β$0 , β$1 ) 


Cov ( β ) =  
 Cov ( $
β , $
β ) Var ( $
β )
0 1 1 

chee824 - Winter 2004 J. McLellan 133


Prediction Variance

… in matrix form -

var( y$ k ) = x Tk ( X T X) −1 x k σ 2
where is vector of conditions at k-th data point
xk

chee824 - Winter 2004 J. McLellan 134


Joint Confidence Regions

Variability in data can affect parameter estimates jointly


depending on structure of data and model
section of sum of
squares
(or likelihood)
β 2 function

β1
marginal confidence limits
chee824 - Winter 2004 J. McLellan 135

Вам также может понравиться