Identification of Dynamic Systems, Theory and Formulation

NASA Reference - Publication 1138
February 1985
Identification of Dynamic Systems

Theory a-nd Formulation
Richard E, Maine: and Kenneth W.
(NASA-RP-113E) 1 C E N ' i l E i C B : I C b S F D Y B I P . 1 , SYSTdlJS, T H 2 l i E i ' A L L f C3r"tiLE'; i C S j E 3 S A ) 138 g BC Ai7/t!P A C 1 CSCI 1 L E
NASA Reference Publication 1138
1 Identification of
Dynamic Systems
Theory and Formulation
Richard E. Maine and Kenneth W. Iliff
Ames Research Center Dryden Flight Research Facility Edwards, Calfornia
N a r ~ c l ~A e r o ~ a u t l c s al aQd Space Adrn~n~srr ~ c n ,I
NS AA
Scientific and Tecnnicql Information Branch

% '
PREFACE The subject o f system i d e n t i f i c a t i o n i s too broad t o be covered completely i n one book. This document i s r e s t r i c t e d t o s t a t i s t i c a l system i d e n t i f i c a t i o n ; t h a t i s . methods derived from p r o b a b i l i s t i c mathematical statements o f the problem. W w i l l be p r i m a r i l y interested i n maximum-likelihood and r e l a t e d estimators. e S t a t i s t i c a l methods are becoming increasingly important w i t h the p r o l i f e r a t i o n o f high-speed, general-purpose d i g i t a l computers. Problems t h a t were once solved by hand-pS,tting the data and drawing a l i n e through them are now done by t e l l i n g a computer t o fit the best l i n e through the data (or by some completely d i f f e r e n t , are well-suited t o computer formerly impractical method). S t a t i s t i c a l approaches t o system i c i ~ n t i f i c a t i o n appl ication. Automated s t a t i s t i c a l algorithms can solve more complicated problems more rapidly-and sometimes more accurately- than the older manual methods. There i s a danger, however, o f the engineer's l o s i n g t h e i n t u i t i v e feel f o r the system t h a t arises from long hours o f working c l o s e l y w i t h the data. To use s t a t i s t i c a l estimat i o n algorithms e f f e c t i v e l y , the engineer must have not only a good grasp o f t h e system under analysis, but also a thorough understanding o f the a n a l y t i c t o o l s used. The analyst must s t r i v e t o understand how the system behaves and what c h a r a c t e r i s t i c s o f the data influence the s t a t i s t i c a l estimators i n order t o evaluate the v a l i d i t y and meaning o f the r e s u l t s . Our primary aim i n t h i s document i s t o provide t h e p r a c t i c i n g data analyst w i t h the background necessary t o make e f f e c t i v e use o f s t a t i s t i c a l system i d e n t i f i c a t i o n techniques, p a r t i c u l a r l y maximum-likelihood and r e l a t e d estimators. The i n t e n t i s t o present the theory i n a manner t h a t aids i n t u i t i v e understanding a t a concrete l e v e l useful i n application. Theoretical r i g o r has not been sacrificed, b u t we have t r i e d t o avoid "elegant" proofs t h a t may require three l i n e s t o w r i i e , but 3 years o f study t o comprehend the underlying theory. I n p a r t i c u l a r , such t h e o r e t i c a l l y i n t r i g u i n g subjects as martingales and measure theory a r e ignored. Several excellent volumes on these subjects are availnble. inciudiny Balakrishnan (1973). Royden (1968). Rudin (1974). and Kushner (1971). W assume t h a t the reader has a thorough background i n l i n e a r algebra and calculus (Paige, Swift, and e Slobko. 1974; Apostol. 1969; Nering. 1969; and Wilkinson. 1 x 5 ) . including complete f a m i l i a r i t y w i t h matrix operations, vector spaces. inner products, norms, gradients, eigenvalues, and r e l a t e d subjects. The reader should be f a m i l i a r w i t h the concept o f function spaces as types o f abstract vector spaces (Luenberger, 1969), e but does n o t need expertise i n functional analysis. W also assume f a m i l i a r i t y w i t h concepts o f deterministic dynamic systems (Zadeh and Desoer, 1963; Wiberg. 1971; and Levan. 1983). Chapter 1 introduces the basic concepts o f system i d e n t i f i c a t i o n . Chapter 2 i s an i n t r o d u c t i o n t o numeric a l optiwization methods, which a r e important t o system i d e n t i f i c a t i o n . Chapter 3 reviews basic concepts from p r o b a b i l i t y theory. The treatment i s necessarily abbreviated, and previous f a m i l i a r i t y w i t h p r o b a b i l i t y theory i s assumed. Chapters 4-10 present the body o f the theory. Chapter 4 defines the concept o f an estimator and some o f the basic properties o f estimators. Chapter 5 discusses estimation as a s t a t i c problem i n which time i s not involved. Chapter 6 presents some s i n p l e r e s u l t s on stochastic processes. Chapter 7 covers t h e s t a t e estimat i o n problem f o r dynamic systems w i t h known c o e f f i c i e n t s . W f i r s t pose i t as a s t a t i c estimation problem. e e drawing on the r e s u l t s from Chapter 5. W then show how a recursive formulation r e s u l t s i n a simpler s o l u t i o n process, a r r i v i n g a t the same state estimate. The d e r i v a t i o n used f o r the recursive s t a t e estimator (Kalman f i l t e r ) does n o t r e q u i r e a background i n stochastic processes; only basic p r o b a b i l i t y and the r e s u l t s from Chapter 5 are used. Chapters 8-10 presont the parameter estimation problem f o r dynamic systems. Each chapter covers one o f e the basic es:imation algorithms. W have considered parameter estimation as a problem i n i t s own r i g h t , r a t h e r than f o r c i n g i t i n t o the form o f a nonlinear f i l t e r i n g problem. The general nonlinear f i l t e r i n g problem i s m r e d i f f i c u h than parameter estimation f o r l i n e a r systems, and i t requires ad hoc approximations f o r p r a c t i c a l implemntqtion. He f e e l t h a t our approach i s more natural and i s easier t o understand. Chapter 11 examines the accuracy o f the estimates. The enphasis i n t h i s chapter i s on evaluating the accuracy and analyzing causes o f poor accuracy. The chapter also includes b r i e f discussions about the r o l e s o f model structure determination and experiment design.
iii
TABLE OF CONTENTS Page
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii NOMENCLATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i x 1.0 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 SYSTEM IDENTIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 PARAMETER IDENTIFlCATlON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 TYPES OF SYSTEM WDELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 E x p l i c i t F u n c t i o n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.3 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 PARAMETER ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 OTHER APPROACHES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
PREFACE
2.0
3.0
4.0
5.0
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 THE STATIC ESTIMTION PROBLEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1 LINEAR SYSTEMS WITH ADDITIVE GAUSSIAN NOISE . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1.1 J o i n t D i s t r i b u t i o n o f Z and 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1.2 A poetoriol.i Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1.3 Maximum Likelihood Esttmator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1.4 Comparison o f Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2 PARTITIONING IN ESTIMATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.4 4.2.5 COmON 4.3.1 1.3.2 4.3.3 4.3.4 Bayesian Optimal Estimators Asymptotic Properties ESTIMATORS A posteriori Expected Value Bayesian Minimm Risk Maxhnum a p3stsrioz-i P r o b a b i l i t y Maximum Likelihood 5.3 5.2.1 Measurement P a r t i t t o n t n g 5.2.2 Application t o Linear Gaussian System 5.2.3 Parameter P a r t t t i o n i n g LIMITING CASES AND SINGULARITIES 5.3.1 Singular P 5.3.2 Singular GG' 5.3.3 Singular CPC* + GG* 5.3.4 Infinite P 5.3.5 I n f i n i t e GG* 5.3.6 Singular Ce(GG*)-'C + P"
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 BASIC PRINCIPLES FROM PPOBABILITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 PROBABILITY SPF.LES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 P r o b a b i - i t y T r i p l e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.2 Cdnditicnal P r o b a b i l i t i e s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 SCALAR RANDOn VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 D i s t r i b u t i o n and Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 Expectations and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 JOINT RANDOn VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 D i s t r i b u t i o n and Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.2 Expectations and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.3 Marginal and Conditional D i s t r i b u t i o n s . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.4 S t a t i s t i c a l Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 TRANSFORMTION OF VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 GAUSSIAN VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5.1 Standard Gaussian D i s t r i b u t i o n s . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5.2 General Gaussian D i s t r i b u t i o n s . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5.4 Central L i m i t Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 STATISTICAL ESTIMTORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1 DEFINITION O A ESTIMATOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 F N 4.2 PROPERTIES OF ESTIMTORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 Unbiased Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.2 Minimum Variance Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.3 Cramer-Rao Inequality ( E f f i c i e n t Estimators) . . . . . . . . . . . . . . . . . . . . . 37
OPTIMIZATION METHODS 2.1 ONE-DIMENSIONAL SEARCHES 2.2 DIRECT METHODS 2.3 GRADIENT METHODS 2.4 SECOND ORDER METHODS 2.4.1 Newton-Raphson 2.4.2 Invariance 2.4.3 S i n g u l a r i t i e s 2.4.4 Quasi-Newton Methods 2.5 S M SF SQUARES 2.5.1 Linear Case 2.5.2 Nonl inear Case 2.6 CONVERGENCE IMPROVEMENT
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .................. 52 ..... 53 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 . . . . . .. .. .. .. .. .. .. ................. .. .. .. .. .. .. .. . . . . . . . . . 55 . . . . . . . . . . . . . . . . . . . . . . . . . . . ............... .. . 55 . 56 . . . . . .. .. .. .. .. ............... .. .. .. .. .. . . . . . . . .. .. .. .. ........ 57 ... .... . . . . . . . . . . . . . . . . ....... . . . . . . . . . . . 58 58

P ' I N a U G E BLW& NOT m
6.0
7.0
8.0
9.0
10.0
. . . . . . . . . . . . . . . . . . . . . 58 . . . . . . . . . . . . . . . . . . . . . . . . . . 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 . . . . . . . . . . . . . . . . . 61 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.6 STOCHASTICPROCESSES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1 DISCRETE TIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1.1 Linear Systems Forced by Gaussian White Noise . . . . . . . . . . . . . . . . . . 69 6.1.2 Nonlinear Systems and Non-Gaussian Noise . . . . . . . . . . . . . . . . . . . . . 70 6.2 CONTINUOUSTIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2.1 Linear Systems Forced by White Noise . . . . . . . . . . . . . . . . . . . . . . . 70 6.2.2 Additive White Measurement Noise . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2.3 Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 STATE ESTIMATION F R DYNAMIC SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 O 7.1 EXPLICIT FORMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 RECURSIVE FORMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.2.1 Prediction Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.2.2 Correction Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.2.3 Kalman F i l t e r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.2.4 Alternate Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.2.5 Innovations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.3 STEADY-STATE F R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 OM 7.4 CONTINUOUSTIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.5 CONTINUOUS/OISCRETE TIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.6 SMOOTHING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.7 NONLINEAR SYSTEMS AN0 NON-GAUSSIAN NOISE . . . . . . . . . . . . . . . . . . . . . . . . 86 O OUTPUT ERROR METHOD F R DYNAMIC SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.1 DERIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.2 INITIAL CONDITIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.3 COMPUTATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.3.1 Gauss-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.3.2 SystemResponse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.3.3 F i n i t e Difference Response Gradient . . . . . . . . . . . . . . . . . . . . . . . 93 8.3.4 Analytic Response Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 8.4 U K O N G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 NN W 8.5 CHARACTERISTICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 FILTER ERROR METHOD F R DYNAMIC SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 O 9.1 DERIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 9.1.1 S t a t i c Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 9.1.2 Derivation by Recursive Factoring . . . . . . . . . . . . . . . . . . . . . . . . 98 9.1.3 Derivation Using the Innovation . . . . . . . . . . . . . . . . . . . . . . . . . 98 9.1.4 Steady-State Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 9.1.5 Cost Function Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 9.2 COMPUTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 9.3 FORMULATION A A FILTERING PROBLEM . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 S EQUATION ERROR METHOD FOR OYNAMIC SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . 1C1 10.1 PROCESS-NOISE APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 10.1.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 10.1.2 Special Case o f F i l t e r Error . . . . . . . . . . . . . . . . . . . . . . . . . . 102 10.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 10.2 GENERAL EQUATION ERROR FORM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 10.2.1 Discrete State-Equation E r r o r . . . . . . . . . . . . . . . . . . . . . . . . . 104 10.2.2 Continuous/Discrete State-Equation Error . . . . . . . . . . . . . . . . . . . . 104 10.2.3 Observation-Equation E r r o r . . . . . . . . . . . . . . . . . 10.3 COMPUTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4 NONLINEAR SYSTEMS YlTH ADDITIVE GAUSSIAN NOISE 5.4.1 J o i n t D i s t r i b u t i o n o f Z and c 5.4.2 Estimators 5.4.3 Computation o f the Estimates 5.4.4 S i n g u l a r i t i e s 5.4.5 P a r t i t i o n i n g MULTIPLICATIVE GAUSSIAN NOISE (ESTIMATION OF VARIANCE) NON-GAUSSIAN NOISE
11.0 ACCURACY O THE ESTIMATES F 11.1 CONFIDENCE REGIONS 11.1.1 Random Parameter Vector 11.1.2 Nonrandom Parameter Vector 11.1.3 Gaussian Approximation 11.1.4 Nonstatistical Derivation 11.2 ANALYSIS O THE CONFIDENCE ELLIPSOID F 11.2.1 S e n s i t i v i t y 11.2.2 Correlation 11.2.3 Cramer-Rao Bound 11.3 OTHER MEASURES O ACCURACY F 11.3.1 Bias 11.3.2 Scatter 11.3.3 Engineering Judgment 11.4 MODEL STRUCTURE DETERMINATION 11.5 EXPERIMENT DESIGN
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 . . . . . . . . . . . . . . . . . . . . . . . . . . 113 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.0
MATRIX RESULTS A.1 M T R I X INVERSION L E N A.2 MATRIX DIFFERENTIATION
REFERENCES
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .131
vll
NOMENCLATURE SYWOLS
I t i s impractical t o l i s t a l l o f the symbols used i n t h i s document. The f o l l o w i n g are symbols o f p a r t i c u l a r significance and those used consistently i n l a r g e portions o f the document. I n several specialized situations, the same sylnbols are used w i t h d i f f e r e n t meanings not included i n t h i s l i s t .
s t a b i l i t y matrix control matrix bias s t a t e observation matrix control observation mutrix expected value e r r o r vector system function process noise covarian:e matrix p w b a b i l i t j d i s t r i b u t i o n function o f x system state function measurement noise covariance matrix
B
b(.)
C
D
E( .) e F(.) FF* Fx(.) f(.)
GG*
system observation function !I(.) h(.

J(.)
equation e r r o r function cost function Fisher information matrix p r i o r mean of
process noise vector p r i o r covariance o f
6 , o r covariance o f f i l t e r e d x
p r o b a b i l i t y density function o f x, short notation p r o b a b i l i t y density function o f x, f u l l n o t a t i o n covariance o f predicted x covariance o f innovation time system input dynamic system input vector concatenated innovation vector innovation vector parameter vector i n s t a t i c models dynamic system state vector system response concatenated response vector dynan~icsystem response vector sample i n t e r v a l measurement noise vector s t a t e t r a n s i t i o n matrlx
i n p u t t r a n s i t i o n matrix vector o f unknown parameters set o f possible parameter values random noise vector probabi 1it y space predicted estimate ( i n f i l t e r i n g contexts) optimum ( i n optimization contexts), o r estimate ( i n e s t i m a t i o r contexts), o r f i l t e r e d estimate ( i n f i l t e r i n g contexts) smoothed estimate Subscript
i n d i c a t e s dependence on 5
Abbreviations and acronyms arg corr cov exp In

MX X
value o f
t h a t maximizes the f o l l o w i n g f u n c t i o n
correlation covariance exponential natural logarithm maximum a p o s t e r i o r i p r o b a b i l i t y mximum-1 i k e l ihood estimator mean-square e r r o r variance
MAP
ML E mse var
Mathematical n o t a t i o n f(.) the e n t i r e f u n c t i o n transpose Ox gradient w i t h respect t o t h e vector x ( r e s u l t i s a row vector when the operand i s a scalar, o r a matrix when the operand i s a column vector) second gradient w i t h respect t o series sumnation series product 3.14159 x
f, as opposed t o the value of the f u n c t i o n a t a p a r t i c u l a r p o i n t
+ I:
v2
n
n
...
u n
set union set i n t e r s e c t i o n subset element o f a set the set o f a l l inner product conditioned on ( i n probabil f t y contexts) absolute value o r determinant volumeelement right-hand l i m i t a t vector w i t h
ti
c
E
fx:cl
( . * a )
such t h a t c o n d i t i o n c
holds
1 1. I
dl.1
tt
n-vector
n elements
x(')
4th element o f the vector
x, or
i t h row o f t h e matrix x
A lower case subscript generally indicates an element o f a sequence
1 .o CHAPTER 1 1.0 INTRODUCTION the deduction of system c h a r a c t e r i s t i c s from measured data. because i t i s t h e opposite of the problem o f computing the Gauss (1809, p. 85) r e f e r s t o "the inverse problem, t h a t i s place." The inverse problem might be phrased as, "Given the general terms, system i d e n t i f i c a t i o n i s seen as a simple obscure area o f mathematics.
System i d e n t i f i c a t i o n i s broadly defined as It i s comnonly referred t o as an inverse problem response of a system w i t h known characteristics. when the t r u e i s t o be derived from the apparent answer. what was the question?" Phrased i n such concept used i n everyday l i f e , rather than as an
3dknou
The system i s your body, and the c h a r a c t e r i s t i c o f i n t e r e s t i s perform an experiment by placing t h e system on a mechanical transducer i n the bathroom which g i v e i as output a p o s i t i o n approximately proportional t o the system mass and the l o c a l g r a v i t a t i o n a l f i e l d . Based on previous comparisons w i t h the doctor's scales, you know t h a t your scale cons i s t e n t l y reads 2 l b high, so ynu subtract t h i s f i g u r e from the reading. The r e s u l t i s s t i l l somewhat higher than expected, so you step o f f o f the scales and then repeat t h e experiment. The new reading i s more "reasonable" and from i t you obtain an estimate o f the system mass.
Exam l e 1.0-1
This simple examble a c t u a l l y includes several important p r i n c i p l e s o f system i d e n t i f i c a t i o n ; f o r instance, the r e s u l t i n g estimates are biased (as defined i n Chapter 4). Example 1.0-2 The "guess your weight" booth a t the f a i r .
The weight guesser's illstrumentation and estimation algorithm are more d i f f i c u l t t o describe precisely. but they are used t c solve the same system i d e n t i f i c a t i o n problem. Example 1.0-3 hewton's deduction o f the theory o f gravity.
Newton's problem was much more d i f f i c u l t than the f i r s t two examples. He had t o deduce n o t j u s t a s i n g l e number, but a l s o the form o f the equations describing the system. Newton was a t r u e expert i n system i d e n t i f i c a t i c n (among other things). As apparent from the above examples, system i d e q t i f i c a t i o n i s as much an a r t as a science. This p o i n t i s o f t e n forgotten by s c i e n t i s t s who prove elegant mathematical theorems about a model t h a t doesn't adequately represent the t r u e system t o begin with. On the other hand, engineers who r e j e c t what they consider t o be "ivory tower theory" are foregoing t o o l s t h a t could give d e f i n i t e answers t o some questions, and h i n t s t o a i d i n the understanding o f others. System i d e n t i f i c a t i o n i s c l o s e l y t i e d t o control theory, p a r t i a l l y by some comnon methodology, and part i a l l y by the use o f i d e n t i f i e d system models f o r c o n t r o l design. Before you can design a c o n t r o l l e r f o r a system, you must have some n o t i o n o f the equations describing t h e system. Another ~orrmonpurpose o f system i d e n t i f i c a t i o n i s t o help gain atl understanding o f how a system works. Newton's investigations were more along t h i s l i n e . ( I t i s u n l i k e l y t h a t he wanted t o control the motion o f the planets. ) The a p p l i c a t i o n o f system i d e n t i f i c a t i o n techniques i s strongly dependent on t h e puriose f o r which the r e s u l t s are intended; r a d i c a l l y d i f f e r e n t system models and i d e n t i f i c a t i o n techniques r d y be appropriate f o r d i f f e r e n t purposes r e l a t e d t o the same system. The a i r c r a f t control s:'%tem designer n i l 1 be unimpressed when given a model based on inputs t h a t cannot be influenced, outputs t h a t ,lnnot be measured, aspects o f the system t h a t the designer does not want t o control. and a complicated model i n a form not amenable t o control analysis techniques. The same model might be i d e a l f o r the aerodynamicist studying the f l o w around the vehicle. The f i r s t and most important step o f any system i d e n t i f i c a t i o n a p p l i c a t i o n i s t o define i t s purpose. Following t h i s chapter's overview, t n i s document presents one aspect o f the science o f system i d e n t i f i c a t i o n - t h e theory o f s t a t i s t i c a l estimation. The theory's main purpose i s t o help the engineer understand the system, n o t t o serve as a formula f o r consistently producing t h e required results. Therefore, our exposition o f the theory, although r i g o r o u s l y defensible, emphasizes i n t u i t i v e understanding rather than mathematical sophistication. The f o l l o w i n g comnents o f Luenberger '1969, p. 2) a l s o apply t o the theory o f system identification: Some readers may look w i t h great expectation toward functional analysis, hoping t o discover new powerful techniques t h a t w i l l enable them t o solve important problems beyond the reach o f simpler mathematical analysis. Such hopes are r a r e l y r e a l i z e d i n practice. The primary u t i l i t y o f functional analysis i s i t s r o l e as a u n i f y i n g d i s c i p l i n e , gathering a number o f apparently diverse, specialized mathematical t r i c k s i n t o one o r a few geometric principles.
...
With good i n t u i t i v e understanding, which a r i s e s from such u n i f i c a t i o n , the reader w i l l be b e t t e r equipped t o extend the ideas t o other areas where the solutions, although simple, were n o t formerly obvious. The l i t e r a t u r e o f the f i e l d o f t e n uses t h e terms "system i d e n t i f i c a t i o n , " "parameter i d e n t i f i c a t i o n , " and "parameter sstimation" interchangeably. The f o l l o w i n g sections define and d i f f e r e n t i a t e these broad terns. The m a j o r i t y of the l i t e r a t u r e i n the f i e l d , i n c l u d i n g most o f t h i s document, addresses t h e f l e l d most prec l s e l y c a l l e d parameter estimation.
2 1.1 SYSTEM IDENTIFICATION
W begin by phrasing the system i d e n t i f i c a t i o n problem i n formal mathematical terms. There are three e elements essential t o a system i d e n t i f i c a t i o n problem: a system, an experiment, and a response. W define e these elements here i n broad, abstract, s e t - t h e o r e t i c terns, before introducing more concrete f o m s i n Section 1.3. L e t U represent some experiment, taken from the s e t @ o f possible experiments on the system. U could represent a d i s c r e t e event, such as stepping on the scales; o r a value, such as a voltage applied. U could a l s o be a vector f u n c t i o n o f time, such as the motions o f the c o n t r o l surfaces w h i l e an a i r p l a n e i f;lown through a maljeuver. I n systems terminology. U i s the i n p u t t o the system. (We w i l l use the term5 input," "control, and "experiment" more o r l e s s interchangeably.) Observe the response Z o f the system,,to the experiment. As w i t h U. Z could be represented i n man:! forms i n c l u d i n g as a d i s c r e t e event (e.9.. the system blew up") o r as a measured time function. I t i s ,-n element o f the set @ o f possible responses. (We a l s o use the terms "output8' o r "measurement" f o r 2.) The abstract system i s a map ( f u n c t i o n ) F from the set o f possible experiments t o the set o f possible responses. F: that i s
@+a
(1.1-1)
The system i d e n t i f i c a t i o n problem i s t o reconstruct the f u n c t i o n F from a c o l l e c t i o n o f experiments Ui and the corresponding system responses 21. This i s tne purest form o f the "black box" i d e n t i f i c a t i o n problem. W are asked t o i a e n t i f y the system w i t h no information a t a l l about i t s i n t e r n a l structure, as i f e the system were i n a black box whlch we could n o t see i n t o . Our only information i s the inputs and outputs. An obvious s o l u t i o n i s t o perform a l l o f the experiments i n @ and simp:y tabulate the responses. This i s u s u a l l y impossible because the set @ i s too l a r g e ( t y p i c a l l y , i n f i n i t e ) . Also, we m y n o t have complete freedom i n s e l e c t i n g the Ui. Furthermore, even i f t h i s approach were possible. the t a b u l a r fotmat o f the r e s u l t would generally be inconvenient and o f l i t t l e help i n understanding the s t r u c t u r e o f the system.
I f we cannot perfonn a l l o f the experiments i n 0 , the system i d e n t i f i c a t i o n problem i s impossible without f u r t h e r information. Since we have made no assumptions about the form o f F, we cannot be sure o f i t s behavior without checking every point.
f+++
a) 2.1
Exam l e 1 1 1 The i n p u t U and output Z o f a system are both represented y rea va ued scalar variables. When an i n p u t o f 1.0 i s applied. the output i s 1.0. When an i n p u t o f -1.0 i s applied, the output i s a l s o 1.0. Without f u r t h e r information we cannot t e l l which o f the f o l l o w i n g representations ( o r an i n f i n i t e number o f others) o f the system i s correct. (independent o f
1
U)
d)
The response depends on the time I n t e r v a l between applylng U and measuring 2, which we f o r g o t t o consider.
Exa l e 1.1-2 and output o are dh---nTheminputnWhencannotinputf t lansgsystemamongscalar tlmei sfunctlons nterval (I,-). the I cos(t), the output sin(t). Wlthout more I f o t l o we dls uish a) z(t) = cos(t) Independent o f
U
i7d-F-
Exam l e 1.1-3 The lnput and output o f a system are Integers I n the range o r every Input r x c e p t U = 37, w measure the output and f i n d It e equal t o the lnput. W have no mathematlcal basls f o r drawlng any concluslon e about the response t o the lnput U = 37. W could guess t h a t the output mlght e be Z = 37, b u t there I s no mathematical j u s t l f l c a t l o n f o r t h l s guess I n the problem as f o m l a t e d .
Our l n a b l l l t y t o draw any concluslons I n the above examples ( p a r t l c u l a r l y Example (1.1-3). whlch seems so obvious l n t u l t i v e l y ) points out the fnadequacy o f the pure black-box statement o f the system I d e n t l f l c a t l o n e problem. W cannot reconstruct the functlon F without some guldance on chooslng a p a r t l c u l a r f u n c t i r l r from the l n f l n i t e number o f functlons conslstent w l t h the r e s u l t s o f the experiments performed. W have seen t h a t the pure black box system l d e n t l f l c a t i o n problem, where absolutely no Information I s e given about the l n t e r n a l structure o f the system. I s impossible t o solve. The lnformatlon needed t o construct the system function F I s thus composed o f two parts: I n f o m t l o n whlch I s assumed. and Information whlch i s deduced from tCe experimental data. These two I n f o m t l o n sources can c l o s e l y interact. For Instance, the experlmental data could contradlct the assumptlons made, r e q u l r t n g a r e v l s l o n o f the assumptlons, o r the data could be used t o select one of a set o f candidate assunptlons (hypotheses). Such l n t e r a c t l o n tends t o obscure the r o l e o f the assumptlon, m k l n g I t seem as though a l l o f the l n f o m t l o n was obtalned from the experlmental data, and thus has a purely objective v a l l d l t y . I n f a c t , t h l s I s never t h e case. R e a l l s t l c s l l y , most o f the l n f o m t l o n used f o r constructlng the system functlon F w l l l be assumptlons based on knowledge o f the nature o f the physlcal processes o f the system. System l d e n t l f l c a t i o n technology based on experlmental data I s used only t o f l l l i n the relatively small gaps I n our knowledge o f the system. From t h i s perspective, we recognlze system l d e n t l f l c a t i o n as an extremely useful t o o l f o r f l l l l n g I n such knowledge gaps, rather than as a panacea whlch w l l l a u t o l ~ t i c a t l y e l l us everything we need t o know about a system. The c a p a b l l l t i e s o f some modern t techniques may l n v l t e the vlew o f system l d e n t l f l c a t i o n as a cure-all. because the underlying assumptlons are subtle and seldom e x p l l c l t l y stated. Exa l e 1.1-4 t f much dh--To37theReturnr noIasl the problemfom nexample (1.1-3).u l rl ebeeinlngly,t l onothalgonow e ge Inte behavlor o the system I s r q d t o deduce t a t Z w t l l be when U 37; Indeed, y cannon system d e n t l f l c a n rlthms would make such a deduction. I n f a c t , the assunptlons m d e are numerous. The s p e c l f l c a t l o n of the set o f posslble lnputs and outputs already lmplles many assumptlons about the system; f o r Instance, t h a t there are no t r a n s l e n t effects, o r t h a t such e f f e c t s are unlnportant. The problem statement does not allow f o r an event such as t h e system output's oscillating through several values. W have also m d e an assumptlon o f r e p e a t a b l l l t y . e Perhaps the same experiment redone tomorrow would produce d i f f e r e n t r e s u l t s , depending on some f a c t o r n o t considered. Encompassing a11 o f the other e assumptlons I s the assumption o f s i m p l l c l t y . W have applled ":cam's Razor and found the simplest s y s t a consistent w i t h the data. (me can e a s i l y lmaglne useful systems t h a t select s p e c l f i c Inputs f o r speclal treatment. e Nothing I n the data has eliminated such systems. W can see t h a t the assumpt l o n s play the l a r g e s t r o l e I n solvlng t h l s problem. Granted the assunptlnn t h a t we want the s l n p l e s t conslstent r e s u l t . the deductlon from the dat: t h a t Z = U 4s t r l v l a l . Two general types o f assunptlons e x i s t . The f i r s t consists o f r e s t r l c t l o n s on the allowable forms o f the f u n c t l o n F. Presumbly, such r e s t r l c t l o n s would r e f l e c t the knowledge o f what functlons are reasonable consldtrlng the physics o f the system. The second type o f assunption i s sonw criterion f o r selecting a " b a t " functlon frm those conslstent w l t h the experlmental resu?t s . I n the f o l l w l n g sections. we w i l l m t h a t these two approaches are caphlned-restrlcting the s e t o f functlons considered, and then selecting a best cholce from t h l s set.
For physlcal s y s t w , i n f o m t i o n about the general form o f the s y s t m function F can o f t e n be derived from knowledge o f the s y s t a . Speclflc n w r l c a l values, however, a r e s o n w t l n s p x b l b l t ' i v e l y d i f f i c u l t t o compute t m ~ r e t l c a l l y l t h w t m k i n unacceptable approxlmrtions. Therefore, the w s t widely used area o f w s y s t m I d m t l f i c a t i o n I s the subfie!d c a l l e d parameter I d e n t l f l c a t i o n .
I n parameter I d e n t l f i c a t l o n . the form o f the system function I s assumed t o be known. Thls f u n c t i o n cont a i n s a f l n i t e number o f parameters, the values o f which must be deduced from experlmental data. Let c be a vector w i t h the unknown parameters as I t s elements. Then the system response Z I s a known e function o f the l n p u t U and the parameter vector C. W can r e s t a t e t h i s I n 3 more convenlent, but comp l e t e l y q u l v a l e n t way. For etch value o f the parameter vector 6, the system response Z I s a known function o f the Input U. (The functlon can be d l f f e r e n t f o r d l f f e r e n t values of :.) Ue say t h a t the function I s parameterized by I and w r i t e
The functlon FE(U) i s r e f e r r e d t o as the assumed system model. The subscript n o t a t i o n f o r c i s used purely f o r convenlence t o Indicate t h e special r o l e of C. The functlon could be equivalently w r i t t e n as F(c.U). The parameter i d e n t i f l c a t l o n problem i s then t o deduce the value o f c based on measurement o f t h e responses 21 t o a set of inputs U1. Thls problem o f i d e n t i f y i n g the parameter vector c I s much l e s s ambitious tnan the system i d e n t i f l c a t l o n problem o f constructing the e n t i r e F functlon from experlmental data; i t i s Imre I n l I n e w i t h the amount of information t h a t reasonably can be expected t o be obtained from experimental data. Oeduclng the value o f E amounts t o solving the f o l ' . ~ i n g e t o f simultaneous and generally nonlinear s equations.
where N I s the number o f experiments performed. Note t h a t the only variable I n these equations I s the parame t e r vector C. The U and Zi represent the s p e c i f i c Input used and response measured f o r the I t h experlment. This I s q u l t e d i i f e r e n t from Equatlon (1.2-1) whlch expresses a general r e l a t i o n s h i p among the three variables U. 2 , and C.
-7 response --
Exam l e 1.2-1 I n the problem o f example (1.1-1). s a 1lnear function o f the Input Z = FE(U) + a, + a,U
assume we are given t h a t the
The parameter vector i s c = (a,,a,)*, W were given t h a t U = -1 and U e t i o n (1.2-2) expands t o
the values of a and a, being unknown. , +1 both r e s u l t i n Z = 1; thus Equa-
Thls system i s easy t o solve and glves F(U) = 1 (Independent o f U).
b,
and a, = 0.
Thus we have
Exa l e 1.2-2 I n the p t ~ o l e m exarple (1.1-2i, of h y r e s e n t e d as
assume we know t h a t the sys-
or, equlvalently, expressing
Z as an e x p l l c l t functlon o f U.
The unknown parameter vector f o r t h i s s stem i s (a,b *. SIr~ce u ( t ) = cos(t) resulted i n z ( t ) s i n ( t j . Equatlon (1.2-21 btcomes
for all ~
tc(-0.9).
This equatlon i s unlquely solved by a = 0-
r n d b = -1.
m l 1.2-3 I n the problem o f F x a g l e (1.1-3). assume t h a t the system can e represented by a polynomial o f order 10 o r less.
The unknown parameter vector t s I (a, ,a,.. .a,,)*. Using the experlmental data described I n Example 1.6. Equatlon (1.2-2) becomes
This systemof equations i s uniquely solved by a, = 0, a, through a , a l l q u a l l i n g 0.
1, and a,
As w i t h any s e t o f equations, there are three possible r e s u l t s from Equation (1.2-2). F i r s t , there can be a unique solution, as i n each of the exanples above. Second, there could be mu1t i p l e solutions, i n which case d t h e r more experiments must be p e r f o m d o r more rrsunptions would be necessary t o r e s t r i c t the set o f allowable solutions o r t o p i c k a best sotution i n some sense. The t h i r d p o s s i b i l i t y i s t h a t there could be no solutions, the experimental data being inconsistent w i t h the assumed equations. This s i t u a t i o n w i l l require a basic change i n our w y o f t h i n k i n g a b w t the problem. There w i l l almost never be an exact solution w i t h r e a l data, so the f i r s t two p o s s i b i l i t i e s are sanewhat acadealic. The remainder o f the document. and Section 1.4 i n particular, w i l l address the general s i t u a t i o n where Equation (1.2-2) need not nave an exact solution. The p o s s i b i l i t i e s o f one o r more solutions are p a r t o f t h e general case. Exa l e 1.2 4 I n the problem o f Example (1.1-11, assume we are given t h a t
+ response i s a quadratic function o f the input tL e
= (a,,a .a )*. h . :.:re given t h a t U -1 and The parameter vector i s these data Equation (1.2-2) expands .o ' U = +1 both r e s u l t i n 2 = 1. ~ i k h
1 = FE(-1) = a,
- a,
+ a,
From t h i s information we can deduce t h a t uniquely determined. The values might be experiment U = 0. A1 ternately, we might system consistent w i t h the data a v t i l a b l e and a, = 1.
a, = 0, b u t a and a a r e not , determined by performfng the decide t h a t the lowest order i s preferred, g i v i n g a, = 0 we a r e given given t h a t experiment no parameter
-I?+--.s t a t t e response i
I n the problem o f Example (1.1-1). assume t h a t a l i n e a r function o f the input. W were e U = -1 and U = +1 both r e s u l t i n Z = 1. Suppose t h a t the U = 0 i s p e r f o m d and r e s u l t s i n Z = 0.95. There are then values consistent w i t h the data. txa l e 1.2-5
1.3
TYPES OF SYSTEM MODELS
Although tne basic concept of system modelfrog i s q u i t e general, more useful r e s u l t s can be obtained by examining s p e c i f i c types o f system models. C l a r i t y o f exposition i s a l s o improved by using s p e c i f i c models. even when we can obtain the r e s u l t i n a more general context. This section describes some o f t h e broad classes o f system model forms which are o f t e n used i n parameter i d e n t i f i c a t i o n . 1.3.1 E x p l i c i t Functicn
The most basic type o f system model i s the e x p l i c i t function. The response Z i s w r i t t e n as a known e x p l i c i t function o f t h e input U and the parameter vector c. This type o f model correspo~dsexactly t o Equation (1.2-1):
I n the simplest subset o f the e x p l i c i t function models, the response i s a l i n e a r function o f the parameter vector
I n t h i s equation. f(U) i s a matrix which i s a known function (nonlinear i n general) o f the input. This i s the type o f model used i n l i n e a r regression. Many systems can be put i n t o t h i s e a s i l y analyzed form, even though the systems might appear q u i t e complex a t f i r s t glance. A comnon exanple o f a m d ? l 1inear i n i t s parameters i s a f i n i t e polynomial expansion o f Z o f U. i n terms
I n t h i s case, f(U) i s the row vector ( I , U, n o t i n the input U. 1.3.2 State Space
u2...un).
Note t h a t
Z i s l i n e a r i n the parameters
Ej, b u t
State-space models are very useful f o r dynamic systems; t h a t i s , systems w i t h responses t h a t are time functions. Wiberg (1971) and Zadeh and Desoer (1963) g i v e general discussions o f state-space models. T i m can be treated as e i t h e r a continuous o r d i s c r e t i z e d v a r i a b l e I n dynamic models; the theories o f discrete- and continuous-time systems are q u i t e d i f f e r e n t .
The general f o m f o r a c o n t i n u o u s - t i r state-space .ode1 i s
, n where f and g are arbitrary know functions. The i n i t i a l condition x can be know o r c ~ k a function o f t. The variable x ( t ) i s defined as the state o f the system a t time t. Equation (1.3-3b) i s called the state equation. and (1.3-3c) i s called the observation equation. The measured system response i s z. The state i s not considered t o be measured; i t i s an internal system variable. Howver, g[x(t).u(t).t.(] = x(t) i s a legitimate observation function, the measurement can be equal t o the state i f so desired. Discrete-tine state space models are similar t o continuous-time models, except that the d i f f e r e n t i a l equations are replaced by difference equations. The general form i s
The system variables are defined only a t the discrete times t i This document i s largely concerned w i t h continuous-time dynamic systems d e s c r i k d by differential Equations (1.3-3b). The systen response. however, i s measured a t discrete time points, and the caaputations are done i n a d i g i t a l coaputer. Thus, sone features of both discrete- and continuous-time systems are pertinent. The system equations are
A(?,)
=
X,
(1.3-511) (1.3-5b)
i = 1.2
i ( t ) = f[x(t).u(t).t.cl z(ti)
= g[x(ti),u(tij.ti,~]
....
(1.3-k)
The response ti) i s considered t o be defined only a t the discrete time points ti, although tne state x ( t ) i s defined i n continuous time. M w i l l see that the theory o f paramter i d e n t i f i c a t i o n f o r continuous-time system with discrete obsere vations i s v i r t u a l l y identical t o the theory f o r discrete-time systems i n spite o f the superficial d i f f e r e w e s i n the system equation f o m . The theory o f continuous-time observations requires much deeper mtheaatical background and w i l l only be outlined i n t h i s Jocunent. Since practical -pplication o f the a l g o r i t h s developed generally requires a d i g i t a l conputer, the continuous-time theory i s o f secondary ilportance.
A important subset o f systems described by state space equations i s the set o f linear dynamic systems. n Although the equations are sometimes rewritten i n foms convenient f o r d i f f e r e n t applications, a l l linear dynamic system models can be written i n the following foms: the continuous-time f o r n i s
The matrix A i s called the s t a b i l i t y matrix. B i s called the control matrix, and C and D are called state and control cbservation matrices, respectively. The hiscrete-time fonn i s dt,)
a
x,
(1.3-7a)
The matrices 4 and v are called the system transition matrices. The form f o r continuous systems with discrete observations i s identical t o Equation (1.3-6). except that the observation i s defined only a t the discrete t i m e polnts. I n ill three form, A B, C, 0, 4, and Y are matrix functions o f the parameter , vecro- t. These m t r l c e s are functions o f time i n general, but f o r notational simplicity, we ni:l not exf 4 t ? y lndicate the t i n e dependence unless i t i s important t o a dfscussion. The continuous-time and discrete-time state-equation forms are closely related. I n many applications. the discrete-time fonn of Equation (1.3-7) i s used as a discretized approximation t o Equation (1.3-6). In thi, case, the t r a n s i t i o n m t r l c e s 4 and r are related t o the A and B matrices by the equations
Ye discuss t h i s relationship i n more d e t a i l i n Section 7.5. I n a similar wnner, Equation (1.3-4) i s s a t i l e s viewed as an approximation t o Equation (1.3-3). Although the p r i n c i p l e i n the nonlinear case i s the wt as i n the linear case. ue cannot w r i t e precise expressions f o r the relationship i n such s i l p l e c l o K d f o m as :n the l i m a r case.
Standardized canonical f o m o f the state-space equations ( Y i k r g . 1971) play an i q o r t a n t r o l e i n s o r approaches t o parameter estimation. Ue w i l l not t l p h s i z e canonical forms i n t h i s Qcucnt. The k s i c theory o f paraaeter i d e n t i f i c a t i o n i s the same, whether canonical f o m are used o r not. I n same applications. canonical f o m s are useful. o r even necessary. Such forms, however. destroy any internal relationship between the m d e l structure and the system. retaining only the external response characteristics. F i d e l i t y t o the internal 6s well as t o the external system characteristics i s a s i w i f i c a n t a i d t o engineering judgment and t o the incorporation o f known facts about the system. both o f which play crucial roles i n s y s t a identification. For instance, we might kncw the values o f many locations o f the A m t r i x i n i t s 'naturalm form. Yhen the A matrix i s t r a n s f o r p d t o a canonical forn, these siaple facts generallj becop unwieldy equations which cannot reasonably be used. When there i s 1i t t l e useful knowledge o f the internal system structure, the use of more appropriate. canonical forms b e c o r ~ s
O t k r types of system a d e l s are used i n various applications. This Qclvlcnt w i l l not cover thew explici t l y , but many o f the ideas and results from e x p l i c i t function and state space models can be applied t o other m d e l types. One o f these alternate m d e l classes deserves special mention because o f i t s wide use. This i s the class of auto-regressive moving average (W) d e i s and related variants (Hajdasinski. Eykhoff. Damen, and van den m Bool. 1982). Discrete-time ARllA lodels are i n the general form
Discrete-time nodels can be readily rewritten as l i n e a r state s y c e lndels ( S c k p p e . 1973). so a l l o f the theory which we w i l l develop f o r state space models i s d i r e c t l y applicable.
The exaaples i n Section 1.2 e r e carefully chosen t o have exact solutions. Real data i s seldom so obliging. No matter how careful we have been i n selecting the f o m o f the a s s d system model, i t w i l l not be an exact representation o f the system. The experimental data w i l l not be consistent with the assumed m d e l form f o r any value o f the parameter vector c. The model may be close. but i t w i l l not be exact, i f f o r no other reason than that the measurements o f the response w i l l be made w i t h real, and thus inperfect. instruments. The theoretical developmnt seeas t o have arrived a t a cul-de-sac. The black box system i d e n t i f i c a t i o n problem was not feasible because there were too many soluticns consistent with the data. To reolove t h i s d i f f i ~ u i t y ,i t was necessary t o assme a model form and define the problem as parameter identification. With the astuned node). however, there are no solutions consistent w i t h the data.
b e need t o r e t a i n the concept of an assumed -1 structure i n order t o reduce the scope of the pmblea. y e t avoid the i n f l e x i b i l i t y o f requiring that the m d e l exactly reproduce the experimental data. W do t h i s e by using the assued model structure, but acknowledging that i t i s inperfect. The assued m d e l structure should include tho essential characteristics o f the true system. The selection o f these essential characteri s t i c s i s the m s t significant engineering j u m t i n system analysis. A good exaaple i s Gauss' (1809, p. x i ) j u s t i f i c a t i o n that the major axis o f a comctary e l l i p s e i s not an essential parameter, and t h a t a s i n p l i f ied parabolic m d e l i s therefore appropriato:
There existed, i n point o f fact. no s u f f i c i e n t reason why i t should be taken f o r granted that the paths o f conets are exactly parabolic: on the contrary. i t m s t be regarded as i n the highest degree iapmbable that nature should ever have favored such an hypothesis. Since, nevertheless, i t was known, that the phemmna o f a heavenly body moving i n an e l l i p s e o r hyperbola. the major axis o f which i s very great r e l a t i v e l y t o the parameter, d i f f e r s very l i t t l e near the perihelion fm the motion i n a parabola o f which the vertex i s a t the same distance from the focus; and t h a t t h i s difference becomes the more inconsiderable the greater the r a t i o o f the axis t o the p a r w t e r : and since. mreover. experience has shorn that between the observed moticn and the motion computed i n the parabolic o r b i t , there remained differences scarcely ever reater than those which might safely be attribute& t o errors o f observation errors quite considerable i n most cases): astronomers have thought proper t o r e t a i n the parabola, and very properly, because there are no mans whatever o f ascertaining satisfactorily what. I f any, are the differences from a parabola.
Chapter 11 discusses some aspects of t h i s selection, including theoretical aids t o making such judglents.
Ye need t o determine how t o select the value o f c which makes the mathematical model the "best"
GIven the assumed m d e l structure, the primary question I s how t o t r e a t inperfections i n the model.
representation o f the essential characteristics o f the system. Me also need t o evaluate the error i n the determination o f c due t o the ummdeled e f f e c t s present i n the experimental data. These needs introduce several mw concepts. One concept i s that o f a 'best' representation as opposed t o the correct representation. It i s often i.possible t o define a single correct representation, even i n principle, because we have acknowledged the assumed model structure t o be ilperfect and we have constrained ourselves t o work within t h i s structure. Thus c does not have a correct value. As k t o n (1970) says on t h i s subject. A f a v o r i t e form of lunacy amng aeronautical engineers produces countless a t t e q t s t o decide w h a t d i f f e r e n t i a l equation governs the l o t i o n o f some physical object, such as a helicopter r o t o r But arguacnts about which d i f f e r e n t i a l equation represents truth, together with t h e i r f i t t i n g calculations. are uasted ti*.
....
- % r eIF % %i and.
Estimating the radius of the Earth. The Earth i s not a perthus, does not have a radius. Therefore. tne problem of estimating the radius of the Earth has no correct answer. Nonetheless, a representation o f the Earth as a sphere i s a useful s i r p l i f i c a t i o n f o r many purposes.
Ex
l e 1.4-1
Even the concept o f the "bestu representation overstates the meaning o f our estimates because there i s no universal c r i t e r i o n f o r defining a single best representation (thus our quotes around "best"). Many system i d e n t i f i c a t i o n nethods ertablish an o p t i m l i t y c r i t e r i o n and use nrrmerical optimization methods t o commute the optimal estimates as defined by the criterion; indeed most o f t h i s d o c m t i s devoted t o such optiiaal e s t i mators o r approximations t o them. To be avoided, however, i s the c m n a t t i t u d e that optimal (by some c r i terion) i s synonymus w i t h correct, and t h a t any nonoptimal estimator i s therefore wrong. Klein (1975) uses the term "adequate model" t o suqgest that the appropriate judgmnt on an i d e n t i f i e d m d e l i s whether the lode1 i s adcquate f o r i t s intended purpose. I n addition t o these concepts o f the correct. best. o r adequate values o f c, we have the s a w h a t related issue o f errors i n the determination of c caused by the presence o f umodeled effects i n the experimental data. Even i f a correct value o f 6 i s defined i n principle, i t may not be possible t o determine t h i s value exactly from the experimental data due t o contamination o f the data by umodeled effects. W can no* define the task as to deternine the best estimate o f c obtainable from the data, o r perhaps e an adequate estimate o f c, rather than to determine the correct value o f 6 . This revised problem i s more properly called parameter estimation than parameter identification. (Both terms are often used interchangeImplied subproblems o f parameter estimation include the d e f i n i t i o n o f the c r i t e r i a f o r best o r ably.! adequate, and the characterization o f potential errors i n the estimates.
tjta l e 1.4-2 Reconsider the problem of example (1.2-5). Although there i s no inear model exactly consistent with the data, modeling the output as a constant value o f 1 appears a reasonable a p p r o x i ~ t i o n and agrees exactly w i t h two o f the three data points.
One approach t o parameter estination i s to minimize the error between the model response and the actual measured response, using a least squares o r sme similar ad hoc criterion. The values o f the paramter vector c which r e s u l t i n the minimum error are called the best estimates. Gauss (1809. p. 162) introduced t h i s idea:
Finally, as a l l our observations, on account of the iaperfection o f the instruments and o f the senses, are only approximtions t o the truth, an o r b i t based only on the six absolutely necessary data may s t i l l be l i a b l e t o considerable errors. I n order t o diminish these as lnuch as possible, and t h s t o reach the greatest precision attainable, no other method w i l l be given except t o accuuulate the greatest n u h e r o f the most perfect observations, and t o adjust the elements, not so ds t o s a t i s f y t h i s o r t h a t set of observations with absolute exactness, but so as t o agree with a l l i n the best possible manner. This approach i s easy t o understand without extensive matheratical background, and i t can produce excellent results. It i s r e s t r i c t e d t o deterministic nodels so that the nodel response can be calculated.
A alternate approach t o parameter estimation introduces p r o b a b i l i s t i c concepts i n order t o take advann tage o f the extensive theory o f s t a t i s t i c a l estimation. We should note that, from Gauss's time, these two approaches have been intimately linked. The sentence inmediateiy following the above exposition i n Theoria l b t u s (buss. 1809. p. 162) i s
For which purpose, we w i l l shar i n the t h i r d section how, according t o the principles o f the calculbs o f probabilities, such an agreement m y be obtained, as w i l l be. i f i n no one place perfect, yet i n a l l places the s t r i c t e s t possible. I n the s t a t i s t i c a l approach, a l l o f the effects not included i n the deterministic system m d e l are modeled as randm noise; the characteristics o f the noise and i t s position i n the system equations vary for d i f f e r e n t applications. The probabilistic treatment solves the perplexing problem o f how t o examine the effect o f the u m d e l e d portion o f the systea without f i r s t modeling it. The formerly urrodeled portion i s modeled probab i l i s t i c a l l y , which allows description o f i t s general characteristics such as magnitude and frequency content. without requiring a detailed model. Systems such as this, which involve both tile and randomness, are referred t o as stochastic systems. This document w i l l examine a small p a r t o f the extensive theory o f stochastic systens, which can be used t o define e s t i m t e s o f the unknown parameters and t o characterize the properties o f these estimates.
Although t h i s document w i l l devote s i g n i f i c a n t time t o the treatment of the probabilistic approach, t h i s approach should not be oversold. It i s currently popular t o disparage m d e l - f i t t i n g approaches as nonrigorous 811dwithout theoretical basis. Such attitudes ignore two important facts: f i r s t , i n many o f the m s t commn situations, the "sophisticated" probabilistic approach arrives a t the safe estimation algorithn as the m d e l f i t t i n g approaches. This f a c t i s often obscured by the use o f buzz words and unenlightening notation. apparently for fear that the theoretical e f f o r t w i l l be considered as wasted. Our view i s that such relationships should be erphasized and c l e a r l y explained. The two approaches c o a p l m n t each other, and the engineer who ~nderstandsboth i s best equipped t o handle real world problems. The n o d e l - f i t t i n g approach gives good i n t b i +.ive understanding o f such problems as nude1i n g error, algori t convergence, and identif i a b i l i t y , along h ~ t k r s . The probabilistic approach contributes quantitative characterization of the properties of the e s t i aates (the accuracy), and an understanding o f how these characteristics are affected by vartous factors. The second f a c t ignored by those who disparage nodel f i t t i n g i s that the p r o b a b i l i s t i c approach involves j u s t as many (or more) u n j u s t i f i e d ud ii assumptions. Behind the smug f r o n t o f mathematical r i g o r and sophist i c a t i o n l i e patently ridiculous assumptions about the systcs. The contan~inating noise seldom has any o f the characteristics (Gaussian, white, etc.) assumed simply i n order t o get results i n a usable form. Nore basic i s the f a c t t h a t the contaminating noise i s not necessarily randcm noise a t a l l . I t i s a composite o f a l l o f the otherwise u w d e l e d portions o f the system output, some of which might be " t r u l y " random (deferring the philosophical question o f whether t r u l y random events exist). but s o ~ p f which are certainly deterministic o even a t the macroscopic level. I n l i g h t of t h i s consideration. the 'rigoru o f the p r o b a b i l i s t i c approach i s tarnished from the start, no m t t e r how precise the inner mathematics. Contrary t o the iapressions often given, the probabilistic approach i s not the single correct answer, but i s one o f the possible avenues t h a t can give useful results. making on the average as many u n j u s t i f i e d o r b l a t a n t l y false assuptions as the alterna-tives. b y e s (1736. p. 9). i n an essay reprinted by Barnard (1958). made a classical statement on the r o l e o f assuwtions i n mathematics:
I t i s not the business o f the Mathematician t o dispute whether quantities do
i n f a c t ever vary i n the manner t h a t i s supposed, but only whether the notion o f t h e i r doing so be i n t e l l i g i b l e ; which being allowed, he has a r i g h t t o take i t f o r granted, and then see what deductions he can make fm t h a t suppcsition.. .He i s not inquiring how things are i n matter of fact, but supposing things t o be i n a certain way, what are the consequences t o be deduced from them; and a l l that i s t o be demanded o f him i s , t h a t h i s suppositions be i n t e l l i g i b l e , and h i s inferences j u s t from the suppositions he makes.
The denrands on the applications engineer are somewhat different, and more i n l i n e with Bayes' (1736, p. 50) l a t e r statement i n the same document. So f a r as Hathematics do not tend t o make men more sober and rational thinkers. wiser and better men, they are only t o be considered as an amusement, which ought not t o take us o f f from serious business. A few words are necessary i n defense o f the probabilistic approach, l e s t the reader decide t h a t i t i s not worthwhile t o pursue. The main issue i s the description o f deterministic phenomena as random. This disagrees with common mdern perceptions o f the meaning and use o f randmess f o r physical situations, i n which random and deterministic phenomena are considered as quite d i s t i n c t and well delineated. 3ur viewpoint owes m r e t o the e a r l i e r philosophy o f probability theory- t h a t i t i s a useful tool f o r studying cwplicated phenomena inherently random ( i f anything i s inherently random). Cramer (1946, p. 141) gives a classic which need not exposition of nis philosophy: large and inportant groups o f ra,dom [The following i s descriptive of] experiments. Small variations i n the i n i t i a l state o f the observed units. which cannot be detected by our instruments, may produce considerable changes i n the f i n a l result. The conplicated character o f the laws o f the observed phenomena may render exact calculation practically, i f not theoretically. impossible. Uncontrollable action by small disturbing factors may lead t o irregular deviations from a presumed "true value".
...
I t i s , o f course, clear that there i s no sharp d i s t i n c t i o n between these v a r i w s mdcs o f randmess. Whether w ascribe e.g. the fluctuations observed e ihe results o f a series o f shots a t a target mainly t o wll variations i n t , - . i n i t i a l state o f the projectile, t o the complicated nature o f the b a l l i s t i c laws, or t o the action o f small disturbing factors, i s largely a matter o f taste. The essential thing i s that, i n a l l cases where one o r more o f these circumstances are present, an exact prediction o f the results o f individual experiments becomes impossible, and the irregular fluctuations characteristic o f random e;periments wi 11 appear.
We s h l l now see that, i n cares o f t h i s character, there appears amidst a l l i r r e g u l a r i t y o f f l t ~ c t u a t i o n sa certain typical form o f regularity t h a t w i l l serve as the basis o f the mathematical tleory o f s t a t i s t i c s .
The probabilistic mtllods allow quantitative analysis o f the general behavior o f these canplicated phenomena, evm though we ? .e unable t o model the exact behavior.
1.5
O H R APPROACHES TE
Our aim i n t h i s document i s t o present a u n i f i e d viewpoint o f the system i d e n t i f i c a t i o n ideas leading t o maxinwn-likelihood estimation o f the parameters o f dynamic systems, and o f the application o f these ideas. There are many conpletely d i f f e r e n t approaches t o i d e n t i f i c a t i o n o f dynamic systems. There are innumerable books and papers i n the system i d e n t i f i c a t i o n l i t e r a t u r e . Eykhoff (1974) and Astrom and Eykhoff (1970) give surveys of the f i e l d . However. much of the work i n system i d e n t i f i c a t i o n i s pub1ished outside of the wneral body of system identification 1iterature. Many techniques have been Q v e l oped f o r specific areas o f application by researchers oriented more toward the application area than toward general system i d e n t i f i c a t i o n problem. These specializea techniques are part o f the larger f i e l d o f system identification, although they are usually not labeled as such. (Sometimes they are recognizable as special cases o r applications o f more general results.) I n the area most familiar t o us, a i r c r a f t s t a b i l i t y and cont r o l 4 e r i ~ a t i v e swere estimated from f l i g h t data long before such estimation was c l a s s i f i e d as a system ident, Fication problem (Doetsch. 1953; Etkin. 1958; Flack. 1959; Greenberg. 1951; Rampy and Berry, 1964; Holowicz. 1966; and Holowicz and Holleman. 1958).
We do not even attempt here the monumental task o f surveying the large body o f system i d e n t i f i c a t i o n techniques. Suffice i t t o say that other approaches exist, soate e x p l i c i t l y labeled as system i d e n t i f i c a t i o n techniques, and some not so labeled. Ue feel that w are better equipped t o make a useful contribution by e presenting, i n an organized and comprehensible mnner, the viewpoint with which we are most familiar. This orientation does not constitute a dismissal o f other viewpoints.
We have sunetimes been asked t o refute claims that, i n some specific application, a silnple technique such as regression obtained superior results t o a "sophisticated" technique bearing impressive-sounding credentials as an optimal nonlinear m a x i m likelihood estimator. The i u p l i c a t i o n i s t h a t simple i s scinehow synonymous with poor, and sophisticated i s synonymous with good, associations that w completely disavow. Indeec, the e opposite association seems more often dppropriate, and we t r y t o present the maximum likelihood estimator i n a simple l i g h t . Ye believe t h a t these methods are a l l tools t o be used when they help do the job. Ye have used quotations from Gauss several t'rnes i n t h i s chapter t o i l l u s t r a t e h i s insight i n t o what are s t i l l some o f the important issues o f th.? day, and w w i l l close the chapter w i t h y e t another (Gauss. 1809, p. 108): e
...we hope. therefore, i t w i l l not be disagreeable t o the reader, that. besides the solution t o be given hereafter, which seems t o leave nothing further t o be desired, we have thought proper t o preserve also the one o f which w have made e frequent use before the former suggested i t s e l f t o me. I t i s always p r o f i t a b l e t o approach the more d i f f i c u l t problems i n several ways, and not t o despise the good although preferring the better.
CHAPTER 2 2.0 OPTIMIZATION METHODS
Most o f the est!mators i n t h i s book r e q u i r e the minimization o r maximization o f a nonlinear function. Sometimes we can w r i t e i;n e x p l i c i t expression f o r the minimum o r maximum polnt. I n many cases, however, we must use an i t e r a t i v e numerical algorithm t o f i n d t h e solution. Therefore a background i n optimization methods i s mandatory f o r appreciation o f the various estimators. Optimization i s a major f i e l d i n i t s own r i g h t and we do not attempt a thorough treatment o r even a survey o f the f i e l d i n t h i s chapter. Our purpose i s t o b r i e f l y introduce a few o f t h e optimization techniques m s t p e r t i n e n t t o parameter estimation. Several o f the conclusions we draw about tne r e l a t i v e m e r i t s o f various algorithms a r e influenced by the general structure of parameter estimation problems and, thus, might not be s~rpportablei n a broader context of optimizing a r b i t r a r y functions. Numerous books such as Rao (1979), Luenberger (1969). Luenberger (1972). Dixon (1972). and Polak (1971) cover t h e d e t a i l e d d e r i v a t i o n and analysis o f the techniques discussed here and others. These books give mor r thorough treatments o f t h e optimization methods than we have room f o r here, b u t a r e not 0 r i e n t ~ dspecific; l y t o parameter estimation problems. For those involved i n the a p p l i c a t i o n o f estimation theory, and p a r t i c u l a r l y for those who w i l l be w r i t i n g computer programs for parameter estimation, w strongly recomnend reading several o f these books. The u t i l i t y and e f f i e ciency o f a parameter estimation program depend strongly on i t s optimization algorithms. The material i q t h i s chapter should be s u f f i c i e n t f o r a general understanding o f t h e problems and the kinds o f algorithms used, b u t not f o r the d e t a i l s o f e f f i c i e n t application. The basic optimization problem i s t o f i n d the value o f t h e vector x t h a t gives the smallest o r l a r g e s t value o f the scalar-valued function J(x). By convention we w i l l t a l k about minimization problems; any maximization problem can be made i n t o an equivalent minimization problem by changing t h e sign o f the function. W e w i l l f o l l o w t h e widespread p r a c t i c e o f c a l l i n g the f u n c t i o n t o be minimized a cost function, regardless o f whether o r n o t i t r e a l l y has anything t o do w i t h monetary cost. To formalize the d e f i n i t i o n of the problem, a f u n c t i o n J ( x ) i s s a i d r o have a minimum a t i i f
f o r a l l x. This i s sometimes c a l l e d an unconstrained global minimum t o d i s t i n g u i s h i t frcm l o c s l and constrained minima, which are defined below. Two kinds o f side constraints are sometimes placed on the problem. g.(x) = 0 I n e q u a l i t y constraints are i n t h e form Equality constraints are i n the form (2.0-2)
The g i and value o f x straints it minimum o f
h i are scalar-valued functions o f x. There can be any number o f constraints on a problem. A i s c a l l e d admissible i f i t s a t i s f i e s a l l o f the constraints; i f a value v i o l a t e s any o f the coni s i n a h i s s i b l e . The constraints modify the problem statement as follows: ic i s the constrained J(x) i f i i s admissible and i f Equation (2.0-1) holds for a l l admissible x.
Two c r u c i a l questions about any optimization problem are whether a solution e x i s t s and whether i t i s unique. These questions a r e important i n a p p l i c a t i o n as well as i n theory. A computer program can spend a long time searching f o r a s o l u t i o n t h a t does n o t e x i s t . A simple example o f an optimization problem w i t h no s o l u t i o n i s the unconstrained minimization o f J ( x ) = x. A problem can also f a i l t o have 3 s o l u t i o n because there i s no x s a t i s f y i n g the constraints. W w i l l say t h a t a problem that has no s o l u t i o n i s ill-posed. e x,)', where x I silnple problem w i t h a nonunique s o l u t i o n i s the unconstrained minimization o f J ( x ) = (x, i s a 2-vector.
A l l o f the algorithms t h a t we discuss (and most other algorithms) search f o r a l o c a l minimum o f the funct i o n , r a t h e r than the global m i n i m . A l o c a l minimum (also c a l l e d a r e l a t i v e minimum) i s defined as follows: i i s a l o c a l minimum o f J(x) i f a scalar 5. > 0 e x i s t s such t h a t J(1)
<
J ( 1 + h)
(2.0-4)
f o r a l l h w i t h I h l < E . To define a constrained l o c a l minimum, we must add the q u a l i f i c a t i o n s t h a t ic and 1 + h s a t i s f y t h e constraints. The term "extremum" r e f e r s t o e i t h e r a l o c a l minimum o r a l o c a l maximum. Figure (2.0-1) i l l u s t r a t e s a problem w i t h three l o c a l minima. one o f which i s the global minimum. Note t h a t i f a global minimum exists, even i f i t i s not unique, i t i s a l s o a l o c a l minimum. The converse t o t h i s statement i s false; the existence o f a l o c a l minimum does n o t even imply t h a t a global minimum e x i s t s . W can sometimes prove t h a t a function has o n l y one l o c a l minimum point, and t h a t t h i s p o i n t i s also the e global minimum. When we lack such proofs, there i s no universal way t o guarantee t h a t the l o c a l minimum found by an algorithm i s the global m i n i m . A reasonable check f o r i t e r a t i v e algorithms i s t o t r y t h e algorithm w i t h many d i f f e r e n t s t a r t i n g values widely d i s t r i b u t e d w i t h i n the realm of possible values. Ifthe algorithm consistt?ntly converges t o tk same s t a r t i n g point, t h a t p o i n t i s probably the global minimum. The cost o f such a t e s t , however. i s o f t e n p r o h i b i t i v e l y high. The l i k e l i h o o d o f l o c a l minima d i f f i c u l t i e s varies widely depending on t h e application. I n some applicat i o n s w can prove t h a t there a r e no l o c a l minima except a t t h e unique global minimum. At the ccher extreme, e some applications are plagued by numerous l o c a l minima t o the extent t h a t most minimization algorithms a r e
e worthless. Host applications l i e between these extremes. W can o f t e n argue convincingly t h a t a p a r t i c u l a r answer must be the global minimum, even when rigorous proof i s impractical. The algorithms i n t h i s chapter are, w i t h a few exceptions, i t e r a t i v e . Given some s t a r t i n g value x,, the algorithms give a procedure f o r computing a new value x,; then x, i s computed from x,, e t c . The i n t e n t o f the i t e r a t i v e algorithms i s t o create a sequence X i t h a t converges t o the minimum. The s t a r t i n g value can be from an independent estimate o f a reasonable answer. o r i t can come from a special start-up algorithm. The f i n a l step o f any i t e r a t i v e algorithm i s t e s t i n g convergence. A f t e r the algorithm has proceeded for some time, we need t o choose among the f o l l o w i n g a l t e r n a t i v e s : 1) the algorithm has converged t o a value s u f f i c i e n t l y close t o the t r u e minimum and should therefore be terminated; 2) the algorithm i s making acceptable progress toward the solution and should be continued; 3 ) t h e algorithm i s f a i l i n g t o converge o . i s c o n v e r g i ~ gtoo slowly t o o b t a i n a s o l u t i o n i n an acceptable time, and i t should therefore be abirndoned; o r 4) the algorithm behavior t h a t suggests t h a t switching t o a d i f f e r e n t algorithm ( o r modifying the current one) i s exl~ibiting might be productive. This decision i s f a r from t r i v i a l because some algorithms can e s s e n t i a l l y s t a l l a t a p o i n t far from any l o c a l minimum, making such small changes i n X i t h a t they appear t o have converged. W have b r i e f l y mentioned the problems o f existence and uniqueness o f solutions, l o c a l minima, s t a r t i n g e values, and convPrgence tests. These are major issues i n p r a c t i c a l application, but we w i l l not examine them fui.ther here. The references contain considerable discussion o f these issues.
A cost function o f an N-dimensional x vector can be visualized as a hypersurface i n (N + 1)-dimensional space. For i l l u s t r a t i n g the behavior o f the various algorithms, we w i l l use i s o c l i n e p l o t s o f c o s t functions of two variables. An i s o c l i n e i s the locus of a l l points i n the x-space corre5ponding t o some specified cost function value. The i s o c l i n e s o f p o s i t i v e d e f i n i t e quadratic functions are always e l l i p s e s . Furthermore. a quadratic function i s cornpleteiy specified by one o f i t s i s o c l i n e s and the f a c t t h a t i t i s quadratic. Twodimensional examples are s u f f i c i e n t t o i l l u s t r a t e most o f the pertinent points o f the algorithms.
W w i l l consider unconstrained minimization problems, which i l l u s t r a t e the basic points necessary f o r our e purposes. The references address problems w i t h e q u a l i t y and i n e q u a l i t y constraints. 2.1 ONE-DIMENSIONAL SEARCHES
Optimization methodology i s strongly influenced by whether o r not x i s a scalar. Because the optimizat i o n problems i n t h i s book are generally multi-dimensional, the methods applicable only t o scalar x are n o t d i r 2 c t l y relevant. Many o f the multi-dimensional optimization algorithms, however, require the solution o f one-dimensional subproblems as p a r t of t h e l a r g e r algorithm. Most such subproblems are i n the form of minimizing the m u l t i dimensiondl cost function w i t h x constrained t o a l i n e i n the nulti-dimensional space. This has the superf i c i a l appearance o f a multi-dimensional problem, and f u r t h e m r e one w i t h the added complications o f cons t r a i n t s . To c l a r i f y the one-dimensional nature of these subproblems, express them as follows: the vector x i s r e s t r i c t e d t o a l i n e defined by x = x,
AX,
(2.1-1)
whore x, and x are fixed vectors, and 1 i s a scalar v a r i a b l e representing p o s i t i o n along the 1ine. Restricted t o t h i s l i n e , the cost can be w r i t t e n as a function o r A. g ( ~ ) J(x, E
+ AX,)
(2.1-2)
The function g(h) i s a scalar f u n c t i o n o f a scalar variable, and one-dimensional minimization algorithms apply d i r e c t l y . Substituting the minimizing value of A i n t o Equation (2.1-1) then gives the minimizing p o i n t along the l i n e i n the space o f x. W w i l l n o t examine the one-dimensional search algorithms i n t h i s book. Several o f the references have e good treatments o f the subject. W w i l l note t h a t most o f the relevant one-dimensional algorithms involve e approximating the function by a low-order polynomial based on the values o f the function and i t s f i r s t and second derivatives a t one o r more points. The mininum p o i n t o f the polynomial, e x p l i c i t l y evaluate replaces one o f the o r i g i n a l psints, and the process repeats. The d i s t i n g u i s h i n g features o f the algorithms are the order o f the polynomial, the number o f points, and the order o f the derivatives o f J(x) ea;aluated. Variants o f the algorithms depend on start-up procedures and methods f o r selecting the p o i n t t o be replaced. I n some special cases we can solve the one-dimensional minimization problems e x p l i c i t l y by s e t t i n g the d e r i v a t i v e t o zero, o r by other means, even when we cannot e x p l i c i t l y solve the encompassing multi-dimensional problem. Several o f our examples o f multi-dimensional algorithms w i l l use e x p l i c i t solutions o f the onedimensional subproblems t o avoid g e t t i n g bogged down i n d e t a i l . Real problems seldom w i l l be so conveniently amenable t o exact s o l u t i o n of the one-dimensional subproblems, except where the nulti-dimensional problem could be d i r e c t l y solved without r e s o r t t o i t e r a t i v e methods. I t e r a t i v e one-dimensional searches are u s u a l l y necessary w i t h any method t h a t involves one-dimensional subproblems. W w i l l encounter one o f the r a r e exceptions e !n the estimation o f variance.
2.2
DIRECT METHODS
Optimi t a t i o n methods t h a t do not r e q u i r e the evaluation o f derivatives o f the cost function are c a l l e d d i r e c t methods o r zero-order methods (because they use up t c zeroth order derivatives). These methods use only the cost function values. Axial i t e r a t i o n , also c a l l e d the univa14ate method o r coordinate descent. i s the basis f o r many o f the d i r e c t methods. I n t h i s method we search along each o f the coordinate d i r e c t i o n s o f the x-space, one a t a
time. S t a r t i n g w i t h the p o i n t x , f i x the values o f a l l b u t the f i r s t coordinate, reducing the problem t o one-dimensional minimization. ~o!ve t h i s problem using any one-dimensional algorithm. Call the r e s u l t i n g point x Then f i x the f i r s t coordinate a t the value so determined and do a s i m i l a r search along the d i r e c t i o n o f the second coordinate, g i v i n g the p o i n t x2. Continue these one-dimensional searches u n t i l each o f the N coordinate d i r e c t i o n s has been searched; the f i n a l p o i n t of t h i s process i s XN.
The p o i n t XN completes the f i r s t cycle o f minimization. Repeat t h i s cycle s t a r t i n g from t h e p o i n t XN instead o f x Continue repeating the minimization cycle u n t i l the process converges ( o r u n t i l you give up, which may welP come f i r s t ) .
The performance o f the a x i a l i t e r a t i o n algorithm on most problems i s unacceptably poor. The algorithm perfonns w e l l only when the minimum p o i n t along each a x i s i s nearly independent o f the values o f the other coordinates. Exam l e 2 2 1 Use a x i a l i t e r a t i o n t o minimize J(x.y) a A(x y)' + B(x + y)' The solution i s the t r i v i a l l y obvious (0.0). but the problem i s good f o r i l l u s t r a t i n g the behaviar o f algorithms i n a simple case. Instead o f using a one-dimensional search procedure, we w i l l e x p l i c i t l y solve the onedimensional subproblems. For any f i x e d y, obtain the minimizing x coordinate value by s e t t i n g the d e r i v a t i v e t o zeru
giving
Similarly, f o r f i x e d x, the minimizing y A - B Y ' m X
value i s
W see t h a t f o r A >> 0, the values o f x and y descend slowly toward the t r u e minimum a t (0.0). e Figure (2.2-1) i l l u s t r a t e s t h i s behavior on an i s o c l i n e p l o t . Note t h a t i f A = B (the > s t function i s o c l i n e i s c i r c u l a r ) the exact minimum i s obtained i n one c y ~ i e ,".~t A/B increases the perfurmance worsens. as Several modifications t o the basic a x i a l i t e r a t i o n method a r e a v a i l a b l e t o improve i t s performance. Some o f these modifications e x p l o i t the notion o f the p a t t e r n d i r e c t i o n , the d i r e c t i o n from the beginning p o i n t X x~ o f a cycle t o the end p o i n t X(i+,\, o f the same cycle. Figure (2.2-2) i l l u s t r a t e s the p a t t e r n direct i o n , which tends t o p o i n t i n the general Y i r e c t i o n o f the minimum. Powell's method i s the most powerful o f the d i r e c t methods t h a t search along p a t t e r n directions. See t h e references f o r d e t a i l s . 2.3 GRADIENT METHODS
Optimization methods t h a t use the f i r s t d e r i v a t i v e (gradient) o f the cost function are c a l l e d gradient mthods o r f i r s t order methods. Gradient methods require t h a t the cost f u n c t i o n be d i f f e r e n t i a b l e ; m s t o f the cost functions ccnsidered i n t h i s book meet t h i s requirement. The gradient methods generally converge ~n fewer i t e r a t i o n s than many o f the d i r e c t methods because the gradient methods use more information i n each aceration. (There a r e exceptions. p a r t i c u l a r l y when comparing simple-minded gradient methods w i t h the most powe' f u l o f the d i r e c t methods). The penalty paid f o r the generally improved performance o f the gradient mthods compared w i t h the d i r e c t methods i s the requirement t o evaluate the gradient. W define the gradient o f the function J ( x ) w i t h respect t o x t o be the row vector. e i t as a colurri vector; the difference i s inconsequential as long as one i s consistent.) (Some t e x t s define
A reasonable estimate o f the computational cost o f evaluating t h e gradient i s N times the cost o f evaluating the function. This estimate follows from the f a c t t h a t t h e gradient can be approximately evaluated by N f i n 1t e d i f f e r e x e s
where e l i s the u n i t vector along the x i a x i s and r i s a small number. I n special cases, there can be expressions f o r the gradient t h a t cost s i g n i f i c a n t l y less than N f u n c t i o n evaluations. Equation (2.3-2) somewhat obscures the d i s t i n c t i o n between the gradient methods and the d i r e c t methods. W can r e w r i t e any gradient method i n a f i n i t e difference fonn t h a t does not e x p l i c i t l y involve gradients. e There i s . nonetheless, a f a i r l y clear d i s t i n c t i o n between methods derived from g r a d i e ~ tideas and methods derived from d i r e c t search ideas. W w i l l r e t a i n t h i s philosophical d i s t i n c t i o n regardless o f whether the e gradients are evaluated expl i c i t l y o r by f i n i t e differences. The method o f steepest descent (also c a l l e d the gradient method) involves a series o f one-dimensional searches, as d i d t h e a x i a l - i t e r a t i o n method and i t s variants. I n t h e steepest-descent method, these searches
are along the d i r e c t i o n o f the negative o f the gradient vector, evaluated a t the c u r r e n t point. dimensional problem i s t o f i n d the value o f X t h a t minimizes
The one-
where
si
i s the search d i r e c t i o n given by s i = -v;J(x) lx=xi
The negative o f the gradient i s the d i r e c t i o n o f steepest l o c a l descent o f the c o s t f u n c t i o n (thus the name o f the method). To prove t h i s property, f i r s t note t h a t f o r any vector s we have
W are using the e
(...)
n o t a t i o n f o r t h e inner product (x.y)
x*y
Equation (2.3-5) t i o n (2.3-1) i s Equation (2.3-5) Cauchy-Schwartz
i s a generalization o f the d e f i n i t i o n o f the gradient; i t applies i n spaces where Equanot meaningful. W then need o n l y show that, i f s i s r e s t r i c t e d t o be a u n i t vector. e i s minimized by choosing s i n the d i r e c t i o n o f -v:J(x). This follows imnediately from the i n e q u a l i t y (Luenberger. 1969) o f 1inear algebra.
Theorem 2.3-1 (Cauchy-Schwartz) tx.y)' some scalar a. Proof - The theorem i s t r i v i a l

i f y = 0.
s 1x1'
1yI2 w i t h e q u a l i t y i f and o n l y i f
x = ay
for
For y f 0 examine
Choose
Substitute i n t o Equation (2.3-?) and rearrange t o g i v e
Equality holds i f and only i f x + ~y = 0 i n Equation (2.3-7), t r u e i f and only i f x = ay ( A w i l l then be -a).
which w i l l be
On the surface, the steepest descent property o f the method seems t o i c p l y e x c e l l e n t performance i n m i n i mizing the cost f u n c t i o n value. The d i r e c t i o n o f steepest descent, however, i s a l o c a l property which might p o i n t f a r from the d i r e c t i o n o f the global minimum. I t i s thus o f t e n a poor choice o f search d i r e c t i o n . D i r e c t methods such as Powell's o f t e n converge more r a p i d l y than steepest descent. The steepest descent method performs worst i n long narrow valleys o f t h e c o s t function. I t i s a l s o sensit i v e t o scaling. These two d i f f i c u l t i e s a r e c l o s e l y related; r e s c a l i n g a problem can e a s i l y create long narrow valleys. The f o l l o w i n g examples i l l u s t r a t e the s c a l i n g and v a l l e y d i f f i c u l t i e s : Example 2.3-1 L e t the cost f u n c t i o n be 1 J(x) = 7 ( x i
xi)
The steepest descent method works excel!ently f o r t h i s cost function (so does almost every optimization method). The gradient o f J ( x ) i s
Therefore, from any s t a r t i n g point, the negative o f the gradient p o i n t s e x a c t l y a t the o r i g i n , which i s the global minimum. The minimum w i l l be a t t a i n e d e x a c t l y (or t o the accuracy of the one-dimensional search methods used) i n one i t e r a t i o n . Figure (2.3-1) i l l u s t r a t e s the algorithm s t a r t i n g from the p o i n t (1,1)*.
Exam l e 2.3-2 Rescale the preceding example by replacing x, by 0 . 1 ~ ~ . e r aps we u s t redefined the u n i t s of x, t o be m i l l i m e t e r s instead o f centimeters.) The c o s t f u n c t i o n i s then
and the gradient i s vxJ(x) (O.Olx,,x,)
Figure (2.3-2) shows the search d i r e c t i o n used by the algorithm s t a r t i n g from the p o i n t (10,1)*, which corresponds t o the p o i n t (1.1)* i n the previous
example. The search d i r e c t i o n points almost 90" from the o r i g i n . A careless glance a t Figure (2.3-2) i n v i t e s t h e canclusion t h a t the minimum i n the search d i r e c t i o n w i l l be on the x axis and tnus t h a t the second i t e r a t i o n o f t'le steepest descent algorithm w i l l a t t a i n :he minimum. I t i s t r u e t h a t the minimum i s close t o t h e x axis. b u t i t i s not exactly on the axis; the d i s t i n c t i o n makes an important difference i n the algorithm's performance. For points x AV:J(X) along the search d i r e c t i o n from any p o i n t (x,.x,)*, the cost function i s 1 0 . 0 1 ~ )+ ~ : ( l A)'] ~ g(A) = f ( ~ A V ~ J ( X ) ) 7 [ a . O l ~ : ( l
The minimum of
g(A) i s a t
and thus the minimum p o i n t along the search d i r e c t i o n i s (x,
- o.o1x,i
x,
- X,i)*
w i t h i defined as above. The f o l l o w i n g t a b l e and Figure (2.3-3) show several i t e r a t i o t l s o f t h i s process s t a r t i n g from the point (10.1)*. Iteration x,
X2
The trend o f the algorithm i s c l e a r ; every two i t e r a t i o n s i t nloves e s s e n t i a l l y halfway t o the solution. Consider the behavior s t a r t i n g from the p o i n t (10,0.1)* instead o f (10.1)*: Iteration x,
Xz
This behavior. p l o t t e d i n Figure (2.3-4), i s abysmal. The algorithm i s bounci n g back and f o r t h across the valley, making l i t t l e progress toward the minimum. Several modifications t o the steepest descent method are available t o improve t t s performance. A rescaling step t o eliminbte valleys caused by scaling y i e l d s major improvements f o r some problems. The method o f parall e l tangents (PARTAN method) e x p l o i t s p a t t e r n d i r e c t i o n s s i m i l a r t o those discussed i n Section 2.2; searches i n such p a t t e r n d i r e c t i o n s are o f t e n c a l l e d acceleration steps. The conjugate gradient method i s t h e most powerf u l o f the modffications t o steepest descent. The references discuss these and other gradient algorithms i n detatl. 2.4 SECOND ORDER MFTHODS
Gptimt zation methods that use the second d e r i v a t i v e ( o r an approximation t o i t ) o f the cost .function are c a l l e d sscond order methods. These methods require t h a t t h e f i r s t and second dertvatives o f the cost function exist. 2.4.1 Newton-Raphson
The Newton-Raphson optimtzation algorithm (also c a l l e d Newton's m t h o d ) I s the basis f o r a l l o f the second order methods. The idea o f t h i s algorithm 1s t o approximate the cost function by the f i r s t three terms o f i t s Taylor series expansion about the current point.
From a geometrlc vlewpolnt, t h l s equatlon descrlbes t h e paraboloid t h a t best approxlmates the functlon near Equatlng the gradlent o f J l ( x ) t o zero glves an equatlon f o r the mlnlmum p o l n t o f the approxlmating not.* t h a t v X J ( x ~ ) and V ~ J ( X are evaluated a t the f i x e d p o l n t x j and ~) functlon. Taklnq t h i s :rad!rnt, thus are i ~ u tfunctlons o f x.
XI.
vxJl ( x ) * vXJ(xl) The solutlon i s
+ (X
- xl)*[v:J(xf)l
If the second gradlent o f J l s p o s l t i v e d e f l n l t e , then Equation (2.4-3) gives the exact un4. .,. mlnlrmm o f t h e approxlmatlng functlon; i t :s a reasonable guess a t an approximate mlnlmum o f the o r i g l n a junction. I f the second gradlent i s not p o s i t l v e d e f l n l t e , then the approxlmatlng function does n o t have a unlque mlnlmum and the algorlthm l s l l k e l y t o perform poorly. The Newton-Raphson a l g o r l t h n uses Equatlon (2.4-3) l t e r a t l v e l y ; t h e x from t h i s equatlon I s the s t a r t l n g p o i n t f a r the next I t e r a t i o n . The algorithm I s xXtl
I
xl
- [V:J(X~)I-'V~J(X~)
(2.4-4)
The performance o f t h l s algorithm I n the close neighborhood o f a s t r i c t l o c a l minlmum I s unexcelled; t h l s performance represents an Ideal toward whlch other algorithms s t r i v e . The Newton-Raphson algorlthm a t t a i n s t h e exact (except f o r numerlcal round-off e r r o r s ) mlnlmum o f any p o s i t l v e - d e f l n l t e quad:;tf functlon i n a s l n g l e I t e r a t i o n . Convergence w i t h l n 5 t o 10 l t e r a t l o n s i s c o m n on some p r a c t l c a l nonquadratlc problems w i t h several dozen dlmensions; d i r e c t and gradient methods t y p i c a l l y count I t e r a t i o n s I n hundreds and thousands f o r such problems and s e t t l e f o r less accurate answers. See the references f o r analysts o f convergence charac:erlstics. Three negative features of the Newton-Raphson algorithm balance f t s e x c e l l e n t convergence near the mlnlmum. F i r s t I s the behavior o f t h e algorlthm f a r from the mlnlmum. I f the i n l t i a l estimate i s f a r from the minimum, the algorithm o f t e n converges e r r a t l c a l l y o r even diverges. Such problems are o f t e n associated w l t h second gradlent matrices t h a t are not p o s l t i v e deflnlte. Because of t h l s problem, I t l s comnon t o use special start-up procedures t o get w i t h i n the area where Newton-Raphson performs well. One such procedure I s t o s t a r t w l t h a gradient met~tod, switchlng t o Newton-Raphson near the mlnlmum. There are many other start-up procedures, and they play a key r o l e i n successful appllcatlons o f the Newton-Raphson algorlthm. The second negative feature o f t h e Newton-Raphson method i s the computatlonal cost and complexity o f evaluating the second gradient matrlx. The magnitude o f t h l s diff!r,ulty varies widely among appllcations. I n some special cases the second g r a d l r n t I s l l t t l e harder t o compute than the f i r s t gradlent; Newton-Raphson. perhaps w l t h a start-up procedure, f s a good choice f o r such appllcatlons. If, t the other extreme, you are reduced a t o f l n i te-dl fference computatlon o f the second gradient, Davldon-Fletcher-Powell (Section 2.4.4) i s probably a more appropriate algorithm. I n evaluating the computational burden of Newton-Raphson and other methods, remember t h a t Newton-Raphson requires no one-dlmenslonal searches. Equatlon (2.4-4) c o n s t l t u t e s the e n t i r e algorlthm. The one-dfmensional searches requlred by most other algorithms can account f o r a m a j o r i t y of t h e l r computational cost. The t h l r d negative feature o f the Newton-Raphson algorithm l s the necessity t o I n v e r t the second gradient matrix ( o r a t l e a s t t o solve the s e t o f l l n e a r equations lnvolving the matrix). The computer tlme requlred f o r the inversion I s seldom an issue; t h l s tlme i s t y p l c a l l y small compared t o the time requlred t o evaluate t h e second gradlent. Furthermore. the algorlthm converges q u i c k l y enough t h a t I f one l l n e a r system s o l u t l o n per l t e r a t l o n i s a l a r g e f r a c t l o n o f the t o t a l cost, then the t o t a l cost must be low, even i f the l i n e a r system l s on the order a f 100-by-100. The c r u c i a l lssue concerning the Inversion o f the second gradlent I s the posslb i l l t y t h a t the matrlx could be slngular o r i l l - c o n d i t i o n e d . W w l l l dlscuss s i n g u l a r i t l e s i n Section 2.4.3. e 2.4.2 Invarlance
The Newton-Raphson algorlthm has f a r l e s s d l f f l c u l t y w i t h long narrow valleys o f t h e cost function than does the steepest-descent method. Thls difference i s r e l a t e d t o an invarlance property o f the Newton-Raphson algorlthm. Invarlance o f minimization rlgorlthms i s a useful concept whlch many t e x t s mentlon b r i e f l y , i f a t a l l . W w l l l therefore elaborate somewhat on the subject. e The examples i n t h e section on steepest descent l l l u s t r a t e a strong l l n k between scaling and narrow valleys. Scaling changes can ea:lly create such valleys. Therefore we can generally s t a t e t h a t minimization methods t h a t are sensitive t o scallng changes are l i k e l y t o behave poorly i n narrow valleys. This reasoning suggests a simple c r l t e r l o n f o r evaluating o p t l m l z r t i o n a l g o r l t h s : a good optlmlzation algorithm should be I n v a r i a n t under scallng changes. This p r l n c i p l e I s almost so self-evldent as t o be unworthy o f mention. The user o f a program would be j u s t i f i a b l y disgruntled I f an a l g o r i t b n t h a t t r k e d i n the Engllsh Gravltatlonal System (Inperid1 System) of u n i t s f a i l e d when applied t o the same problem expressed i n m t r l c nits ( o r v i c e versa). Someone t r y i n g to duplicate reported r e s u l t s would be perplexed by data p u b l i r h t d I n metric u n i t s which could be dupllcated only by convertlng t o Engllsh G r a v i t a t l o n r l System u n i t s , i n which thc conputation was r e a l l y done. Nonetheless, many c m n algorithms, Including the steepest descent method, f a l l t o e x h l b i t invariance under scaling. The c r l t e r l o n i s n e l t h e r necessary -r s u f f i c i e n t . I t i s easy t o construct r i d f c u l o u s a1 o r l t h n s t h a t are rlgoI n v a r i a n t to scale changes (such as thc g o r i t h . t h a t always returns the value zero).rnd rithms l i k e the steepest descent m t h o d nave achleved excellent r e s u l t s i n some a p p l i c r t l o n s . It i s safe t o
state-sensitive
state. however, t h a t you can usually Improve a good scale-sensltlve algorlthm by maklng I t scale-invariant. An i n i t l a l step t h r t r e s c r l e s the problem can e f f e c t l v e l y make the steepest-descent method scale-lnvarlant (although such a step destroys a d i f f e r e n t invarlance property of the steepest-descent method: lnvarlance under r o t a t l o n o f coordinates). Rescallng a problem can be done manually by the usor, o r It can be an automatic p a r t o f an algorlthm; automatic rescaling ha5 the obvlous advantage o f belng easier f o r the user, and a secondary advantage o f allowing dynamic s c a l l n g changes as the algorlthm proceeds. W can extend the ldea o f invarlance beyond scale changes. I n generdl, we would l l k e an algorlthm t o be e I n v a r i a n t under the l a r g e s t posslble set o f t r a n s f o m t l o n s . A j u s t i f i c a t i o n f o r t h i s c r l t e r i o n I s t h a t almost any complicated mlnlmlzation problem can be expressed as some transformation (possibly q u l t e compllcated) o f a slmpler p-oblem. W can sometimes use such transformations t o s l m p l l f y the s o l u t i o n o f the o r l g l e nal problems. Often It i s more d i f f i c u l t t o do the transformation than t o solve the o r l g i n a l optimization problem. Even ifwe cannot do the transformations, we can use the concept t o conclude t h a t an o p t i n l z a t i o n algorlthm l n v a r l a n t over a l a r g e class o f transformations i s l l k e l y t o work on a large class o f prtblems. l h e Newton-Raphson algorlthm I s i n v a r i a n t under a l l i n v e r t i b l e 1lnear transformations. lnvarlance property t h a t we can rlsually achieve. Thls l r the widest
The scale-invariance o f the Newton-Raphson algorithm can be p a r t i a l l y n u l l i f l e d by poor c h o l ~ e f matt-ix o Inversion ( o r Iinear system solution) algorlthms. W have assumed exact arlthmetlc I n the preced ,ng d i j c u s i l o n e o f scale-invariance. Some m a t r l ~ Inverslon routines are sensl t l v e t o scal l n g e f f e c t s . Inversion based >,. Cholesky f a c t o r i z a t i o n (Wilklnson, 1965, and Acton, 1970) i s a good. easilq implemented method f o r . I-i. matrices (the second gradient I s always sylrmctric). and I s I n s e n s i t i v e t o scallng. A l t e r n a t i v e l y , pt escale the matrix by using I t s diagonal elements. 2.4.3 Slngularltles
The second gradient matrix used I n the Newton-Raphson algorlthm i s p o s l t l v e d e f i n i t e l n r region near a s t r l c t l o c a l mlnlmwn. Idcall,, the start-up procedure w i l l reach such a reglon. and the Newton-Raphson a 1 0 ~ r i t h m w i l l then converge wlthout needing t o contend w i t h s l n g u l a r i t i e s . Thls vlewpolnt I s o v e r l y o p t l m l s t i c ; slngular o r i l l - c o n d i t i o n e d matrlces (the difference i s l a r g e l y academic) a r i s e i n many situations. I n the followlng discussion, we dlscount t h e e f f e c t s o f scaling. Matrices t h a t have l a r g e c o n d i t i o n numbers because o f scallng do n o t represent i n t r i n s i c a l l y I l l - c o n d i t i o n e d problems, and do n o t require t h e techniques dlscussed i n t h l s sectlon. I n some situatlons. the second gradient matrlx I s exactly singular f o r a l l values o f x; two columns (and rows) are l d e n t l c a l o r a column (and corresponding row) I s zero. These slmple s l n g u l a r l t l e s occur r e g u l a r l y even I n complex nonlinear problems. They o f t e n r e s u l t from e r r o r s i n the problem formulatlon, such as minlmlzl n g w i t h respect t o a parameter t h a t I s i r r e l e v a n t t o the cost function.
In the more general case. the second gradient i s singular (or ill-condltloned! a t swne points b u t not a t others. Whenever we use the t e n singular i n the following dlscussion, we i m p l i c i t l y mean singular o r illcondltloned. Because o f t h l s d e f i n i t i o n , there w i l l he vaguely defined regions o f s i n g u l a r l t y r a t h e r than i s o l a t e d polnts. The consequences o f s l n g u l a r l t i e s are d i f f e r e n t depertding on whether o r not they are near the mlnlmum.
Slngular!;ies far from the minimum pose no basic ~ h e o r e t l c a ld1ff:cultles. There a r e several p r a c t i c a l methods f o r handllng such s i n g u l a r i t l e s . One method I s t o use a gradlent algorithm ( o r any other algorithm unaffected by such s l n g u l a r l t i e s ) u n t i l x i s out o f the reglcn o f s i n q u l a r l t y . W can also use t h l s method e i f the second gradlent matrlx has negatlve eigenvalues. whether the matrix I s I l l - c o n d l t l o n e d o r not. I f the matrlx has negative eigenvalues, the Newton-Raphson algorithm i s l i k e l y t o behave poorly. ( I t could even conaround a l o c a l verge t o a l o c a l maximum.) The second gradlent I s always p o s i t i v e semi-definite I n a r e g i ~ n minimum, so negative eigenvalues a r e only a consideration away from the minimum. Another method o* handllng s l n g u l a r l t l e s I s t o add a small p o s l t l v e d e f l n i t e matrlx t o the second gradient before Inversion. W can also use t h i s method t o handle negative eigenvaluec, i f the added matrix i s large e enough. This method i s c l o s e l y r e l a t e d t o t h e prevlous suggestion o f using a gradient algorlthm. I f the aeded matrix i s a l a r g e constant tlmes an l d e n t l t y matrlx, the Newton-Raphson algorlthm. so modified, gives a small step f n the negative gradient d i n c t l o n . For srrmll constants, t h e algorithm has c h a r a c t e r l s t l c s between those o f steepest descent and Newton-Raphson. The conputatlonal cost o f t h l s method l s hlgh; i n essence, we are g e t t i n g p e r f o m n c e l l k e ste9pest descent w.tlle paying the conputatlonal c o s t o f Newton-Raphson. Even s m l l addltlons t o the second d e r i v a t l v c matrlx can dramatically change the convergence behavlor o f the NewtonRaphson algorlthm. W should therefore discontinue t h i s m d l f l c a t l o n h e r ! out o f the region o f s i n g u l a r l t y . e The advantage o f t h l s method I s i t s s l m p l l c l t y ; excluding the t e s t 3 when the matrlx i s Ill-conditioned. t h l s 3 modification can be done I n two short 1 lnes of FORTRAN code. Penrose (1955), Aokl (1967). The l a s t method i s t o use a pseudo-inverse (rank-deficient s o l u t i o n Luenberger (1969). Yllklnson and Relnsch (l9II), h l e r and Stewart (19731, and blrbar. Boyle. Dongarra. and k l e r (1977) dlscuss pseudo-lnverses I n d e t a i l . The baslc ldea o f the pseudo-lnverse method 1s t o Ignore the directions I n the x-space corresponding t o zero elgenvalues ( w i t h i n some t o l e r a n c ~ )o f the second gradlent. , I n the p a r a m t e r e s t l n u t i o n context. such d l m c t i o n s represent parameters, o r c d l n a t l o n s o f p a r a p r t e ~ ~ sabout which the data glve l l t t l e lnformatlcn. Lacklng any I n f o r m t i o n to the contrary, the method leaves i ~ param~ h e t e r conbination?, unchanged from t h e l r I n i t i a l values. The pseudo-inverse method does not address the problem of negatlve eigenvalues, b u t i t I s popular I n 8 large class o f a p p l i c a t i u t ~ s &re n t g t ! v e eigcnvalues are impossible. The method l s easy t c i l p l e w n t . bclng only a r e w r i t e o f the matrix-lnversion o r llnear-systnrcsolutlon subroutine. It a l s o has a useful property i b r r n t f r o n the other proposed methods; i t d a s not a f f e c t the Newton-Raphson a l g o r l t l m when the u t r i x i s well-conditioned, Therefore one can f r e e l y apply t h l s method without t e s t l n g whether I t 4s needed. (It i s t r u e t h a t condltlon t e s t s I n sorc form are p a r t o f s pseudo-lnverse algorithm. b u t such t e s t s are a t a lower l e v e l contained wi t h l n the pseudo-inverse subroutlne. )
S i n g u l a r i t l e s near the mlnlmum require speclal consideration. The ? x c e l l e n t convergence o f NewtonRaphson near the mlninum i s the prlmary reason f o r uslng the algorlthm. I f ,.* s l g n i f l c a n t l y slow the convergence near the mlnfmum, there I s l i t t l e argument f o r uslng Newton-Raphson. The use o f a pseudo-fnverse can handle s i n g u l a r l t l e s while malntainlng the e x c ~ l l e n t convergence; the pseudo-lnverse i s thus an approprlate t o o l f o r t h i s purpose. Although pseudo-lnverses handle the computational problems, s i n g u l a r l t l e s near the minlmrm a1 so r a l s e t h e o r e t i c a l and application lssues. Such a singularity indlcates t h a t the minimum p o l n t I s poorly deflned. The cost functlon I s e s s e n t l a l l y f l a t I n a t l e a s t one direction from t h e minimum, and the mlnlnnnn value o f the cost function might be a t t a l n e d t o 1, chine accuracy by widely separated points. Although t h e a l g o r l t h n cone verges t o a mlnlrwm polnt, It might w the wrong minimm p o l n t I f the mlnlmum I s f l a t . I f the only goal i s t o minimize the cost function, rny mlnlmlziny p o l n t mlsht be acceptable. I n the applications o f t h l s book, minlmlzlng the cost :unction i s only a means t o an end; :w desired ou' u t I s the value o f x. I f r l l t l p l e solut l o n s e x l s t , the problem statmwnt i s incomplete o r f a u l t y . W strongly advlse avoldlng the r o u t i n e use o f pseudo-lnverses o r other computational machinations t o e "solve" uniqueness problems. I f t h e baslc problem statement f s f a u l t y , no numerical t r i c k w l l l solve It. The pseudo-lnverse works by changlng the problem staterent o f the inversion, adding the s t i p u l a t i o n t h a t the Inverse have mlnfnum norm. The l n t e r p r e t a t l o n o f t h l s s t l p u l a t l o n I s vague i n the context o f the optlmlzatior, p r o b l ~ l n(unless t h e cost functlon I s quadratlc, i n which c3.e It speclfles the s o l u t l o n nearest the s t a r t i n g po! t ) . I f t h l s s t i p u l a t i o n l s a reasonable a d d i t l o n t o the problem statement, then the pseudo-lnverse i s an appropriate t o o l . Thls declslon can have s l g n i f l c a n t e f f e c t s . For a nonquadratic cost functlon, f o r example, there mlght be large differences I n the solutlon point, dependlng on small changes I n :he s t a r t i n g point. the data, o r the algorithm. The pseudo-lnverse can be a good diagnostic t o o l f o r g e t t i n g the informatlon '.-:eded t o revise the problem 1he analyst's trong p o l n t I s statement, but one should not depend upon i t t o solve the problem autonomous', i n f u m l a t l n g the problem; the compilter's strength i s i n crunchlng numbers t o a r r l v * a t t h e s o ~ u t l o n . A f a i l u r e I n e i t h e r r o l e w l l l compromise the va!,idity o f the s o l u t i ~ n . Thls stateineni s b u t a rephrasing o f the computer c l i c h e "garbage I n , garbage out. whlch has been said many more times than i t has been heard
2.4.4
Quasl-Newton Hethods
Quasi-Newton methods are intended f o r problems where e x p l l c i t evaluation o f the second gradlent o f the cost function I s coniplicated o r costly. but the performance of the Newton-Raphson algorithm I ,deslred. These : methods form approximations t o the second-gradlent m a t r l r uslng the f i r s t - g r a d l e n t values from several t i e r s tions. Thc approximation t o the second gradient then substitutes f o r the exact second gradien: I n EquaS o f the methods d l r u t l y forn, approxlmations o f the inverse o f the second-grad!ent matrlx, w t l o n (2.4-4). avoiding the cost and some o f the problems o f matrix inversion. Note t h a t as long as the approximatlon t o the second-gradtmt matrlx I s p o s l t i v e definite. Equat l o n (2.4-4) can never converge t o any p o l n t w l t h a nonzero f l r s t gradient. Therefore approximatlons t o the second gradient, no matter how poor, cannot a f f e c t the s o l u t l o n p o i n t . The approximations can g r e a t l y change the speed o f convergence and the area o f acceptable s t a r t l n g values. Approxlmatlons t o the f i r s t gradient would a f f e c t the s o l u t i o n p o l n t as well. The steepest descent method can be consldered as the crudest o f the quasi-Newton methods, uq;ing a constant times the i d e n t l t y matrix as the approximation t o the second gradient. The performance o f the abasi-Mewton methods approaches t h a t o f Newton-Raphson as the approximation t o the second gradient inprover. The Davidon-Fletcher-Powell method (variable metric method) i s the most popular quasl-Newton method. See the references f o r discussions o f these methods. 2.5 S M OF SQUARES US
The algorithms discussed i n the prevlous sectfons are generally applicable t o any mlnimizatlon problem. By t a l l o r l n g a l g o r i t h n ~ st o speclal c h a r a c t e r l s t l c s o f speclf i c problem classes. we can o f t e n achleve f a r b e t t e r performance than by uslng the general purpose algorithms. M n y o f the cost functic~nsa r i s l n g i n estlmatlon problems have the fonn o f sums o f squares. sums-of-squares form i s The general
The fi are vector-valued functions o f x, and the Y{ a r e m i g h t l n s. To s l m p l l f y s m o f t h e formulrr, This assurrption does not r e a l f y r e s t r i c t the a p p l i c a t l o n because we can m a s s w t h a t the Y are s always substitute l / l ( Y i + nonsymetrlc Y( wlthout changing t h e f u n r i i o n values. I n l a s t a p p l i cations, the Wf are p o s l t i v e semi-definite; t h i s i s not a r e q u l r c a n t , b u t we w l l l see t h a t I t helps eniure t h a t the stationary points encountered are l o c a l m i n i m . The fonn o f Equation (2.5-1) i s coawn enough to w r i t special study.
fi:Li:
The sunnutlon sign i n Equation (2.5-2) i s s o m h r t superfluous i n t h a t m y function i n the f o m o f Equat i o n (2.5-1) can be r e w r i t t e n i n an equlvalant f o m without the summtion sign. This can be done by concatan r t i n g the I d i f f e r e n t f l ( x ) vectors i n t o a single, longer f ( x ) vector and nuking a corresponding l a r g e Y matrix w l t h the W m t r i c e s on diagonal blocks. The o n l y d l f f e r m c e i s i n the notation. Ua choose the longer notation w i t h tha s u w t l o n sign because l t more d i r e c t l y corresponds w l t h the way u n y parameter o t i w t i o n problems are n a t u r a l l y phrased.
S e b ~ r a lo f the algorithms discussed i n t h e previous two sections work w e l l w i t h t h e form o f EquaFor any reasonable f i functions, Equation (2.5-11 defines a cost function t h a t i s w e l l t i o n (2.5-1). many approximated by quadratics over f a i r l y l a r g e regions. S i ~ ~ c e o f t h e general micimization schemes are based on quadratic approximations, a p p l i c a t i o n of these schemes t o Equation (2.5-1) i s natural. This statement does not imply t b a t Lhsre are never problems minimizing Equation (2.5-1); t h e problems are sometimes severr?. b u t the odds of succc. .s w i t h reasonable e f f o r t are much b e t t e r than they are f o r a r b i t r a r y cost function forns. Although the g e ~ e r a lmethods are usable, we can e x p l o i t the problem s t r u c t u r e t o do b e t t e r . 2.5.1 Linear Case
I f the f i functions i n Equation (2.5-1) are l i n e a r , then the c o s t function i s exactly quadratic and we can express the minimum point i n closed form. I n p a r t i c u l a r , l e t the f i be the a r b i t r a r y l i n e a r functions
Equation (2.5-1) then becomes
Equating the gradient o f Equation (2.5-3)
t o zero gives
Solving f o r
gives
assuming t h a t the inverse e x i s t s . I f the inverse exists, then Equation (2.5-5) gives t h e o n l y stationary p o i n t o f Eq'lation (2.5-3). This stationary p o i n t must be a minimum i f a l l the W i are p o s i t i v e semi-definite, and i t m s t be a maxi!,um i f a l l the Mi a,-e negative semi-definite. (We leave the straightforward proofs as an exercise.) I f t h e W i meet n e i t h e r of these conditions, the stationary p o i n t can be a minimum, a %ximum, or a saddle point.
I f the inverse i n Equation (2.5-5) does net e x i s t , then there i s a l i n e ( a t l e a s t ) 3 f solutions t o Equat i o n (2.5-4). A l l o f i ese points are stationary p o i n t s o f the cost function. Use o f a pseudo-inverse w i l l produce the solution w i t h minimum norm, b u t t h i s i s usually a poor idea (see Section 2.4.3).
2.5.2
Nonlinear Case
A natural I f the f i are nonlinear, there i s no s i m ~ l e , closed-form s o l u t i o n l i k e Equation (2.5-5). question i n such situations, i n which there i s an easy method t o handle l i n e a r equations, i s whether we can merely l i c e a r i z e the nonlinear equations and use the l i n e a r methodology. Such l i n e a r i z a t i o n does n o t give an acceptlble closed-form solution t o tne current problem. but i t does f o n the basis f o r an i t e r a t i v e methoc. Define the l i n e a r i z a t i o r , o f f i about any p o i n t
XJ
as
where
Equation (2.5-5). w i t h t h e ~ f j and b l J ) substituted f o r Ai and b i , gives t h e stationary p o i n t of t h e cost ) w i t h the l i l e a r i z e d f i functions. This p o i n t i s not, i n eneral, a solution t o the nonlinear proulem. If, however. x i i s close t o the solution, then Equation (2.1-57 should give a p o i n t closer t o the so1ution. because t h e I tnearization w l l l g i v e a good representation o f the cost function i n the region around x i . The i t e r a t i v e algorithlr, r e s u l t i n g from t h i s concept i s as follows: F!rst, choose a s t a r t i n g value x,. The closer x, i s t o the correct solution, the b e t t e r the algorithm i s l i k e l y t o work. Then define revised x j values by
This equation comes from s u b s t i t u t i n g Equation (2.5-7) i n t o Equation (2.5-5) and simplifying. I t e r a t e Equa t i o n (2.5-8) u n t i l i t converges by some c r i t e r i o n . o r u n t i l you give up. This method i s o f t e n c a l l e d quasil i n e a r i z a t i o n because i t i s based on l i n e a r i z a t i o n not o f the cost function i t s e l f , but c f f a c t o r s i n the cost function. Ye made several vague, unsupported statements i n the process o f d e r i v i n g t h i s a l g o r i t h . Ye now need t o analyze the algorithm's performance and compare i t w i t h the performance of the algorithms discussed i n the frevious sections. This task i s g r e a t l y s i n p l i f i e d by n o t i n g t h a t Equation (2.5-8) defines a quasi-Newton alrorithm. To show thi:, we can w r i t e the f i r s t and second gradients o f Equation (2.5-1):
(We have not previously introduced the d e f i n i t i o n o f t h e second gradient o f a vector, as ir the v i f i ( x ) above. The r e s u l t i s t e c h n i c a l l y a tenser, b u t we w i l l not need t o consider i t i n d e t a i l here.) Comparing w see t h a t the only difference between quasie Equation (2.5-8) w i t h Equations (2.4-4). (2.5-9). and (2.5-!0). l i n e a r i z a t i o n and Newton-Raphson i s t h a t q u a s i - l i n e a r i z a t i o n has dropped the second term In Equation (2.5-10). Quasi-!lnearization i s thus a quasi-Newton method using
as an approximation f o r the second gradient. term we w i l l adopt i n t h i s book.
The algorithm i n t h i s fcrm i s also known as Gauss-Newton, the
Near the solution. the neglected term o f the second gradient i s generally small. Section 5.4.3 c i i t l i n e s t h i s argument as i t applies t o the parameter estimation problem. Thercfore, Gauss-Newton approaches the excell e n t performance of Newton-Raphsor! near the solution. Such approximation i s the main goal o f quasi-Newton methods. Accurately approximating the p e r f o n ~ n c eo f Newton-Raphson far from the m i n i m i s not o f great concern because Newton-Raphson does not generally p e r f o n w e l l i n regions f a r from the minimum. Ye can even argue t h a t Gauss-Newton sometimes performs b e t t e r than Newton-Raphson f a r from t h e m i n i m . The worst problems w i t h Newton-Raphson occur when the second gradient matrix has negative e~genvalues; Newton-Raphson can then SO i n the wrong d i r e c t i o n , possibly converging t o a l o c a l maximum o r diverging. I f a l l o f the W i are p o s i t i v e semi-def i r t i t e (which i s usually t h e case), then the second gradient approximation given by Equation (2.5-11) i s p o s i t i v e semi-Jefinl t e f o r a1 1 x. A p o s i t i v e semi-def i n i t e second gradient approximation does not guarant e e good behavior, but i t surely helps; n e g i t i v e eigenvalues v i r t u a l l y guarantee problems. Thus we can heurist i c a l l y argue t h a t Gauss-Newton should perform b e t t e r than Newton-Raphson. W u i l l not attempt a d e t a i l e d e support of t h i s general argument i n t h i s book. I n several specific cases the improvement of Gauss-Newton over Newton-Raphson i s e a s i l y Cemonstrable. Although Gauss-Newton sometimes perfoms b e t t e r than Newton-Raphson f a r frcm the solution, i t has irany o f the r a m b a r i c start-up problems. Both a l g o r i t h n s e x h i b i t t h e i r best p e r f o m n c e near the minimum. Therefare, m w i l l o f t e n need t o begin w i t h some other, more stable algorithm, changing t o Ga~ss-Newtonas we near the minimum. The r e a l argument i n favor o f Gauss-Newton over Newton-Raphson i s the lower computational e f f o r t and comp l e x i t y of Gauss-Newton. Any performance improvement i s a coincidental side b e n e f i t . Equation (2.5-11) involves only f i r s t derivatives o f f i ( x ) . These f i r s t derivatives are a l s o used i n Equation (2.5-9) f o r the f i r s t gradient o f the cost. Therefore, a f t e r computing the f i r s t gradient o f J, the only s i g n i f i c a n t computat i o n remaining f o r the Gauss-Newton approximation i s the matrix m u l t i p l i c a t i o n i n Equation (2.5-11). The comp u t a t i o n o f the Gauss-Newton approximation f o r the second gradient can sometimes take less time than the compuat ion o f the f l r s t gradient, depending on the system d i w n s i o n s . For complicated f i f:~nctions, evaluation i n Equaticn (2.5-10) i s a major p o r t i o n o f the conputation e f f o r t o f the f u l l Newton-Raphson o f the v:fi(x) algorithm. Gauss-Ngrton avoids t h i s e x t r a e f f o r t , obt?ining t h e performance per i t e r a t i o n o f Newton-Raphson ( i f n o t b e t t a r i n some areas) w i t h computational e f f o r t per i t e r a t i o n comparable t o gradient methods. Considering the cost o f the one-dimensional searches required by gradient methods. Gauss-Newton can even be cheaper per i t e r a t i o n than gradient methods. The exact trade-off depends on the r e l a t i v e costs o t evaluati n g the f i and t h e i r gradients, and on the t y p i c a l nuntter o f evaluations required i n the one-dimensional searches. Gauss-Newton i s a t i t s best when the cost o f evaludting the f i i s nearly as much as the cost o f evaluating both t h e f i and t h e i r gradients b e t o high overhead costs comnon t o both evaluations. This i s exactly the case i n some a i r c r a f t applications, where the overhead consists l a r g e l y o f dimensionalizing the derivatives and b u i l d i n g new system matrices a t each time point. The other quasi-Newtnn methods, such as Davidon-Fletcher-Powell, a l s o approach Newton-Raphson performance without evaluating the second derivatives o f the f i . These methods. however, do r e q u i r e one-dimensional searches. Gauss-Newton stands almost alone i n avoiding both second d e r i v a t i v e evaluations and one-dimensional searches. This performance i s d i f f i c u l t t o match i n general a l g o r i t h s t h a t do not take advantage o f the special structure o f the cost function.
Some analysts (Fojter, 1983) introduce one-dimensic?al l i n e searches i n t o the Gauss-Newton a l g r i t h m t o ilprove i t s performance. The u t i l i t y o f t h i s idea depends on how w l l the Gauss-Newton r t h o d i s performing. I n most o f our experience, b u s s - k t o n works well enough that the one-dimensional l i n e searches canlot measurably imprbte performance; the t o t a l colputation time can well be larger with the l i n e searches. Uhen the Gauss-Newton a l g o r i t h i s performing poorly, however, such l i n e searches could help s t a b i l i z e it.
For cost functions i n the form o f Equation (2.5-1). the cost/perfoneance r a t i o o f Gauss-Newton i s so nuch better than that o f nmst other a l g o r i t h s that Gauss-Newton i s the c l e a r l y preferred algorithn. You may want t o nodify Gauss-Newton f o r specific problems, and you w i l l almost surely need t o use scme special start-up algorithn, but the best methods w i l l be based on Gauss-Newton.
2.6
C N F G N E !HPR@VMNT O YR E C
Second-order methods ~ i ? do converge quite rapidly i n regions where they work well. There i s usually t such a region arout~dthe minim16 point; the size o f the region i s problem-dependent. The price paid f o r t h i s region of excellent convergence i s that the second-order methods often corverge poorly o r dive1.w i n regions far from the minimm. Techniques t o detect and recnedy such convergence problems are an important part o f the practical implementation o f second-order methods. I n t h i s section, we b r i e f l y l i s t a few o f the many convergenca ilrprovenent techniques. l b d i f i c a t i o n s t o improve the behavior of second-order m t b d s i n regions f a r from the mininum almost inevitably slow the convergence i n the region near the mininrm. This r e f l e c t s a natural trace-off between speed and r e l i a b i l i t y o f convergence. Therefore, e f f e c t i v e implementation o f convergence-improvcrnent techniques usually includes d i f f e r e n t treatment o f regions f a r f?om the m i n i m and near the mininum. I n regions f a r from the minimun, the second-order methods are modified or abandoned i n favor o f more conservative algorithms. I n regions near the m i n i m , there i s a t r a n s i t i o n o the f a s t second order methods. : The means o f determining when t o make such transitions vary hidely. Transitions can be based on a sinple i t e r a t i o n count, on adaptive c r i t e r i a which exaainr the cbserved convergence behavior, o r on other principles. Transitidns can be either gradual or step changes.
Some convergence improvemnt techniques abandon second-order methods i n the regions f a r from the miniam, adopting gradient methods instead. I n t u r experience, the pure gradient method i s too slow f o r practical use on m s t parameter estimation problems. Accelerated gradient methods such as PARTAN and conjugate gradient are reasonable p o s s i b i l i t i e s .
Other convergence i n p r o v e n t techniques are modifications o f the second-order methods. Kany convergence problems r e l a t e t o ill-conditioned or nonpositive second gradient matrices. This suggests such modifications as adding positive d e f i n i t e matribes t o the second gradient or using rank-deficient solutions. Constraints m the allowable range 3 f estimates or on the change per i t e r a t i o n can also have s t a b i l i z i n g effects. A p a r t i c u l a r l y popular constraint i s t o f i x some o f the ordinates a t constant values, thus rzducing the dimension of the optimization p r o b h ; t h i s i s a form o f axial iteration, and i t s effectiveness depends on a wise (ar lucky) choice o f the ordinates t c be constrained. Relaxation methods, which reduce the indicated parameter changes by some fixed percentage, can sometimes s t a b i l i z e o s c i l l a t i n g behavior o f the algorithm. Line searches i n the indicated direction extend t h i s concept. and should ha capable o f s t a b i l i z i n g a i m s t any problem, a t the cost of additional function evaluations. The above l i s t o f convergence improvement techniques i s f a r from complete. I t also omits mention of numerous important implementation details. This 1i s t serves only t o c a l l attent ion t o the area o f convergence improvement. See the references f o r m r e thorough treatments.
LOCAL MINIMA GLOBAL MlNlMU
Figure (2.0-1).
I l l u s t r a t i o n o f local and global minima.
Figure (2.2-1).
Behavior o f axial iteration.
Figure (2.3-2).
The g r a d i e ~ t direction near a narrow valley.
Figvre (2.2-2).
The pattern direction.
Figure (2.3-3).
Behavior o f the gradient algorithm i n a narrow valley.
Figure (2.3-1).
rm The gradient direction fo a circular isocline.
Figure (2.3-4).
Worse behavior o f the gradient algorithm.
CHAPTER 3 3.0 BASIC PRINCIPLES
FSOW PROBABILITY
I n t h i s chapter ue w i l l review s o w basic d e f i n i t i o n s and r e s u l t s from p r o b a b i l i t y theory. W p r e s w e t h a t the reader has had previous exposure t o t h i s material. Our aim here i s t o review and serve as a r e f e r ence f o r those concepts t h a t are used extensively i n the following chapters. TCe treatment, therefore, i s q u i t e abbreviated, and devotes l i t t l e time t o motivating the f i e l d o f study o r philosophizing about t h e r e s u l t s . Proofs o f several o f the statements a r e omitted. Some of the other proofs a r e merely out1 ined, w i t h tone o f the more tedious steps omitted. Apostol (1969). Ash (1970). and Papoulis (1965) g i v e more d e t a i l e d treatment. 3.1 PWOBAEILITY SPACES
A ~mbability spdce i s formally defined by t k e e i 4 m s (n.6.P). sornetirnes c a l l e d the p r o b a b i l i t y t r i p l e . n is c~tled thr sample space, and the elements u 01 ri are r a ? l e d outcomes o r r e a l i z a t i o n s . 6 i s a set o f sets d t f i n e d on n, closed under countable set operaiions (union, intersection, a ~ compiemnt). Each set d B c 6 i s c a l l e d an event. I n the current discussion, we w i l l n o t be concecnrd n'tb t h a 'ine d e t a i l s of d e t ~ n i t l u ro f 6. 6 i s r e f e r r e d t o as the class o f measurable sets and i s studied i a masure theory (hbyden. 1968; iudin, 1974). P i s a scalar valued function defined on 6, and i s c a l l e s the ? r o b a b i l i t y function o r ~rooatility measure. For each set B i n 6, the f u n c t i ~ n P(B) defines the probabil .:y t h a t w w i l l be i n B. P . u s t s a t i s f y the following axioms: 1) 0
P(B) i 1 f o r a l l
BE 6
3) 3.1.2
P(F
B ) i
P(Bi)
f o r 311 countable sequences o f d i s j o i n t
Bi
E b
Conditiona! P r o b a b i l i t s I f A and B a r e * i events and P(B; va
+ 0,
the ccnditional p r o b a b i l i t y o f
A oiv?n E i s defined as (3.1-1)
p(AlB) = P(AIB)/P(B) where A/B i s the set i n t e r s e c t i o n o f the events A and B.
The events A and B are s t a t i s t i c l l l y independent i f P(A1E) = P(A). Note t h a t t h i s condition. i s symmetric; tha: i s . i f P(AIB) = P(A), then P(BIA) = P(B), provided t h a t P(AIB) and P(BJA) are both defined. 3.2 SCALAR RANDOM VARIABLES X(4) defined on n i s c a l l e d a random v a r i a b l e i f the set {w:X(W) < x) i s
A scalar real-valued function i n 6 f o r a l l real x. 3.2.1
D i s t r i b u t i o n and Density F u n c m Every random variable has a d i s t r i b u t i o n function defined as follows:
I t f ~ l l o w s i r e c t l y from the properties of a p r o b a b i l i t y measure t h a t Fx(x) m s t be a nondecreasing function d . of x. w i t h FX(--) = 0 and Fx(-) = I By the Lebesque decomposition l e m (Royden. 1968, p. 240; L d i n , 1974. p. 129). any d i s t r i b u t i o n function can always be w r i t t e n as the sum o f a d i f f e r e n t i a b l e conponent and a componeitt which i s piecewise constant w i t h a countable number o f d i s c o n t i n u i t i e s . I n m n y cases, w w i l l be e concerned w i t h variab?es w i t h d i f f e r e n t i a b l e d i s t r i b u t i o n functions. For such random variables, we define a function, px(s), c a l l e d the p r o b a b i l i t y density function, t o be the d e r i v a t i v e o f the d i s t r i b u t i o n function: pX(x) W have also the inverse r e l a t i o n s h i p e d ' jy Fy(x) (3.2-2)
A p r o b a b i l i t y density function must be nonnegative, and i t s i n t e g r a l over the r e a l l i n e must equal 1. For shorten px(s) t o p(x) where the medning i s clear. Where confusion i s s i m p l i c i t y o f notation, we w i l l o' possible, we w i l l r e t a i n the longer notation. A p r o b a b i l i t y d i s t r i b u t i o n can be defined completely by g i v i n g e i t h e r the d i s t r i b u t i o n function o r the density function. W w i l l work mainly w i t h density functions, except when they are n o t defined. e
24
3.2.2 &tations and b n : s
The expected vaiue o f a random variable. X, i s defined by
I f X does not have a density function, the p m i s e d e f i n i t i o n o f the e x p e c t a t i o ~i s somewhat more technical. i n v o l v i n g a S t i e l t j e s i n t e g r a l ; Equation (3.2-4) i s adeguate f o r the needs o f t h i s document. The expected value i s a l s o c a l l e d t h e expectation o r the mean. Pny (measurable) function o f a random variable i s also a random variable and
The expected value o f xn f o r p o s i t i v e n i s c a l l e d the n t h mornent of X. Under m i l d conditions, knowledge o f a l l o f the mrnnents o f a d i s t r i b u t i o n i s s u f f i c i e n t t o define the d i s t r i b u t i o n (Papoulis. 1965, p. 158).
The variance of
X i s defined as var(k) z EI(X
- {XI)') - 2EtX)EtXl - {XI2
= E ( X ~ }+ E ~ x ) ' = EtX21
The standard deviation i s the square r o o t o f t h e variance.

3.3 JOINT RANDOn VARIABLES
Two random variables defined on the same sample space are c a l l e d j o i n t random variables.
3.3.1 D i s t r i b u t i o n and Density Functions
I f two random variables. X and Y, a r e defined on the same sample space, we define a j o i n t d i s t r i b u t i o n function of these variables as
For absolutely continuous d i s t r i b u t i o n functions, a j o i n t p r o b a b i l i t y density function by the p a r t i a l d e r i v a t i v e a2 P ~ , J ( ~ . Y )= FX,y(x.~)
pxPy(x.y) i s defieed (3.3-2)
k then have a l s o
I n a s i m i l a r manner, j o i n t d i s t r i b u t i o n s and a e n s i t i e s o f N random variables can be defined. As i n t h e scalar case, the j o i n t density fuiiction o f N random variables must be nonnegative and i t s i n t e g r a l over t h e e n t i r e space must equal 1. A random N-vector i s the same as N j o i n t l y random scalar variables, t h e only difference being i n the terminology
3.3.2
Expectations and Moments The expected value o f a random vector X i s defined as i n t h e scalar case:
The covariance o f
i s a matrix defined by cov(X) = Et[X
- E(X)][X - E(X)lf)
The covariance matrix i s dlways s y n m t r i c and p o s i t i v e semi-definite. It i s p o s i t i v e d e f i n i t e I f X has a density function. Higher order uuments o f random vectors can be defined, but a r e n o t a t t o n a l l y clumsy and seldom used. Consider a random vector Y given by
where A i s m y deterininistic matrix (not necessarily square), and b i s an appropridte length deterministic vector. Then the mean and covariance o f Y are
3.3.3
Marginal and Conditional D i s t r i b b t i o n s
I f X and Y are j o i n t l y random variables w i t h a j o i n t d i s t r i b u t i o n f u n c t i o n given by Equation (3.3-1). then X and Y are a l s o i n d i v i d u a l l y random variables, w i t h d i s t r i b u t i o n functions defined as i n Equat i o n (3.2-1). The i n d i v i d u a l d i s t r i b u t i o n s o f X and Y are c a l l e d the marginal d i s t r i b u t i o n s , and the corresponding density functions are c a l l e d marginal density functions.
The marginal d i s t r i b u t i o n s o f X and Y can be derived from the j o i n t d i s t r i b u t i o n . (Note t h a t the converse i s false without additional assumptions.) By comparing Equations (3.2-1) and (3.3-l), we obtain
and correspondingly FY(y) = FXSy(-.y) I n terms o f the density functions, using Equations (3.2-2) and (3.3-3). we obtain (3.3-10a)
The conditional d i s t r i b u t i o n function o f
X given Y
i s defined as (see Equation (3.1-1))

<
F
and correspondingly f o r
X I Y (xly)
= P(Iw:X(w)
< xlltw:Y(w)
yl)
(3.3-11)
FylX. The conditional density function, when i t exists, can be expressed as p x I Y ( x I ~ )' P ~ , ~ ( ~ , Y ) ~ P ~ ( Y ) (3.3-12)
Equation (3.3-12)
i s known as Bayes' r u l e .
The conditional expectation i s defined as

E ~ X ~ Y } =
J
-m
assuming tPat t h e density function e x i s t s .
Using Equation (3.3-13),

a
we obtain the useful decomposition (3.3-14)
EIf(L.Y)) 3.3.4 S t a t i s t i c a l Independence Two random vectors

X and Y
EIE(f(X,Y)IY)l
defined on the same p r o b a b i l i t y space are defined t o be independent i f F x , y ( ~ s ~ ) Fx(x)Fy(y) (3.3-15)
I f t h e j o i n t p r o b a b i l i t y density function exists, we cai. w r i t e t h i s condition as

P~,,(X.Y)
" px(x)py(y)
o t depend U Y ) does nindependentono ry,anyand py x doesando t9. are f functions n
f
An imnedtate corollary, using Equation (3.3-12). I s t h a t px depend on x. I f X arid Y r r e jndependent. then f(X) and Two vectors a r e uncorrelatrd i f
EtXY*) = E{X)EtY*)
(3.3-17)
o r equivalently i f EI(X
EIX))(Y
- E{Yl)*)
* 0
I f X and Y
are uncorrelated, then the covariance o f t h e i r sum equals the sum of t h e i r covariances.
Iftwo vectors are independent, then they are uncorrelated, b u t the converse o f t h i s statement ' s f a l s e .
3.4 TRANSFORMATION OF VARIABLES
A large p a r t o f p r o b a b i l i t y theory i concerned i n some manner w i t h the transformation o f variables; i.e., : characterizing random variables defined as functions o f other random variables. W have previo.!sly c i t e d e l i m i t e d r e s u l t s on the means and covariances o f some transformed variables (Equations (3.2-5). (3.3-7). and (3.3-8)). I n t h i s section we seek the e n t i r e density function. Our consideration i s r e s t r i c t e d t o variables t h a t have density functions. Let X be a random vector w i t h density f u n c t i o n pX(x) defined on Rn, the Euclidean space o f r e a l n-vectors. Then define Y E Rm by Y = f(X). W seek t o derive the density funce t i o n o f Y. There are three cases t o consider, depending on whether rn = n, m > n, o r m < n.
The primary case of i n t e r e s t i s when m = n. Assume t h a t f ( . ) i s i n v e r t i b l e and has continuous p a r t i a l derivatives. (Technically, t h i s i s only required almost everywhere.) Define g(Y) = f-'(Y). Then
where 3
i s the Jacobian o f the t r a n s f o m t i o n g J.. =

1J
aYJ
agi (Y
See Rudin (1974. p. 186) and Apostol (1969, p. 394) f o r the proof. Example 3.4-1 Let Y = CX, w i t h C and J = C - l , g i v i n g square and nonsingular. Then g(y) = C-'y
as the trarisformation equation. If f i s not i n v e r t i b l e , the d i s t r i b u t i o n o f

Y
i s given by a sum o f terms s i m i l a r t o Equation (3.4-1).
For the case w i t h m > n, the d i s t r i b u t i o n o f Y w i l l be concentrated on, a t most, an F-iimensional hypersurface i n ,R and w i l l not have a density f u n c t i o n i n I?,,,. , The simplest n o n t r i v i a l case o f m < n i s when Y consists o f a subset o f the elements o f X. I n t h i s case, the density function sought i s the density function o f the marginal d i s t r i b u t i o n o f the p e r t i n e n t subset I n general, when m < n, o f the elements o f X. Marginal d i s t r i b u t i o n s were discussed i n Section 3.3.3. X can oe transformed i n t o a random vector Z E Rn, such t h a t Y i s a subset o f the elements o f Z. Example 3.4-2 Let
R,
and Y = XI + X,.
Define
= CA
where
Then using example 3.4-1,
where
Then Y = Z,, so the d i s t r i b u t i o n o f Eqbation (3.3-10). 3.5 GAUSSIAN VARIABLES
i s the marginal d i s t r i b u t i o n of
Z ,
which can be computed from
Random variables w i t h Gaussian d i s t r i b u t i o n s p l a y a major r o l e i n t h i s document and i n much o f p r o b a b i l i t y e theory. W w i l l , therefore, b r i e f l y review the d e f i n i t i o n and s m of the s a l i e n t properties o f Gaussian d i s t r i b u t i o n s . These d i s t r i 5 u t i o n s are o f t e n c a l l e d normal d l s t r i b u t i o n s i n the l i t e r a t u r e .
3.5.1
Standard Gaussian D i s t r i b u c o s
A l l Gaussian d f s t r i b u t i o n s derive from the d i s t r i b u t i o n o f a standard Gaussian variable w i t h mean 0 and covariance 1. The density f u n c t i o n o f the standard Gaussian d i s t r i b u t i o n i s defined t o be
e The d i s t r i b u t i o n f u n c t i o n does n o t have a simple closed-form expression. W w i l l f i r s t show t h a t Equat i o n (3.5-1) i s a v a l i d density f u n c t i o n w i t h mean 0 and covariance l. The m s t d i f f i c u l t p a r t i s showing t h a t i t s i n t e g r a l over the r e a l l i n e i s 1. Theorem 3.5-1 Equation (3.5-1) defines a v a l i d p r o b a b i l i t y density function.
Proof The f u n c t i o n i s obviously nonnegative. i h e r e remains only t o show t h a t i t s i n t e g r a l over the r e a l l i n e i s 1. Taking advantage o f the symnetry about 0. we can reduce t h i s problem t o proving t h a t
There i s no closed-form expression f o r t h i s ,ntegral over any f i n i t e range, b u t f o r the s e m i - i n f i n i t e range o f Equation (3.5-2) the f o l l o w i n g " t r i c k " works. Form the square o f the i n t e g r a l :
Then change variables t o p o l a r coordinates, s u b s t i t u t i n g r Z f o r and r d r de f o r dx dy, t o get
xZ + y Z
The i n t e g t a l i n Equation ( 3 . 5 4 ) has a closed-form solution:
Thus.
Taking the square r o o t gives Equation (3.5-2). The mean o f the d i s t r i b u t i o n i s t r i v i a l l y zero by s j n m t r y .
completing the proof. To derive the covariance, note t h a t
EII
Thus,
- xZ)
= l l ( l
- x2)(2n)-1l2
COV(X)=
exp
( l
xz ax
(21)-l/zr rxp
(- -
x
z)\:m
(3.5-9)
=
~ 1 x 2 )
~1x1' = 1
-o=1
This completes our discussion o f the scalar standard Gaussian. W define a standard m u l t i v a r i a t e Gaussian vector t o be the concatenation o f n independent standard e Gaussian variables. The standard m u l t i v a r i a t e Gaussian density f u n c t i o n i s therefore the product of n marginal density functions i n the form o f Lquatton (3.5-1).
The mean o f t h i s d i s t r i b u t i o n i s 0 and the covariance i s an i d e n t i t y matrix. 3.5.2 General Gausslan O i s t r i b u t i o n s
W w i l l define the class o f a l l Gaussian d i s t r i b u t i o n s by reference t o the standard Gaussian d l s t r i b u t l o n s e o f the previous section. Ye define a random vector Y t o have a Gaussian d i s t r i b u t i o n ~f Y can be represented i n the form
where X i s a standard Gaussian vector, A i s a d e t e r m i n i s t i c matrix and m i s a d e t e r m i n i s t i c vector. The A matrix need not be square. Note t h a t any d e t e r m i n i s t i c vector i s a special case o f a Gaussian vector w i t h a zero A matrix. W have defined the class o f Gaussian random variables by a set o f operations t h a t can produce such e variables. I t now remains t o determine the forms and properties o f these d i s t r i b u t i o n s . (This i s somewhat backwards from the most comnon approach, where the forms o f the d i s t r i b u t i o n s are f i r s t defined and Equat i o n (3.5-12) l s , roven as a r e s u l t . W f i n d t h a t our approach makes i t somewhat easier t o handle singular e and nonsingular cases c o n s i s t e n t l y without introducing c h a r a c t e r i s t i c functions (Papoulis, 1965). By Equations (3.3-7) and (3.3-8). the Y defined by Equation (3.5-12) has mean m and covariance AA*. Our f i r s t major r e s u l t w i l l be t o show t h a t a Gaussian d i s t r i b u t i o n i s uniquely s p e c i f i e d by i t s mean and covariance; t h a t i s , i f two d i s t r i b u t i o n s a r e both Gaussian and have equal means and covariances, then the two d i s t r i b u t i o n s are i d e n t i c a l . Note t h a t t h i s does n o t mean t h a t the A matrices need t o be i d e n t i c a l ; the reason the r e s u l t i s n o n t r i v i a l i s t h a t an i n f i n i t e number o f d i f f e r e n t A matrices g i v e the same covariance AA*. Example 3.5-1 Consider three Gaussian vectors
and
where X, and X, are standard Gaussian 2-vectors and X, Gaussian 3-vector. W have e
i s a standard
Thus a l l three
Yi
have equal covariance.
The r e s t o f t h i s section i s devoted t o proving t h i s r e s u l t i n three steps. F i r s t , we w i l l consider sqlrare. nonsingular A matrices. Second, we w i l l consider general square A matrices. F i n a l l y , we w i l l consider nonsquare A m t r i c e s . Each o f these steps uses the r e s u l t s o f the previous step. Theorem 3.5-2 I f Y i s a Gaussian n-vector defined by Equation (3.5-12) w i t h a nonsingular A matrix, then the p r o b a b i l i t y dencity f u n c t i o n o f Y e x i s t s and i s given by p ( y ) * 1 ~ n A' ' 1 2 eXp[i where
A
(y
- .)*A-'(y - m)l
i s the covariance
AA*.
Proof This i s d i r e a c Equation a(3.4-1).c t a p p l i c a t i o n o f the transformation o f variables f o r P,(Y) = pX[A-'(y
- m)llh"I
Substituting A
f o r kA*
then gives the desired r e s u l t .
Note t h a t the densl:y function, Equation (3.5-13). depends only on t h e mean and covariance, thus proving the uniqueness r e s u l t f o r the case r e s t r i c t e d t o nonslngular matrices. A p a r t i c u l a r case o f i n t e r e s t I s where m i s 0 and A i s u n i t a r y . (A u n l t a r y m t r i x i s J square one w l t h Me = I . ! I n t h l s case. Y has a standard Gausslsn d i s t r i b u t i o n .
Theorem 3.5-3 I f Y I s a Gaussian n-vector defined by Equation (3.5-12) wfth any square A matrix, then Y can be represented as Y * S ~ + M where 1 i s a standard Gaussian n-vector and S i s p o s i t i v e semi-definite. Furthermore, the S i n t h i s representation i s unique and depends only on the covariance o f Y. Proof s first. -fThe uniqueness ii n easy t o prove, andi we w i l l do i tcovarianceThe f covario the Y given Equation (3.5-12) s AA*. The o a Y
ante
(3.5-14)
expressad as i n Equation (3.5-14) i s SS*. A necessary (bu not s u f f i c i e n t ; condition f o r Equation (3.5-14) t o be a v a l i d representation o f Y i s therefore, t h a t SS* equal AA*. I t i s an elementary r e s u l t o f l i n e a r :igebra (Wilkinson, 1965; Dongarra, Holer, Bunch, and Stewart, 1979: 2i10 Strang, 1980) t h a t AA* i s always p o s i t i v e semi-definite and t h a t there i s one and only one p o s i t i v e semi-definite matrtx S s a t i s f y i n g SS* = AA*. S i s c a l l e d the matrix square r o o t of M*. This proves the uniqueness. The existence proof r e l i e s on another r e s u l t from l i n e a r algebra: any square matrix A can be factored as SQ, where S i s p o s i t i v e sscri-definite and Q i s unitary. For nonsingular A, t h i s f a c t o r i z a t l o n i s easy-S i s the matrix square r o o t o f AA* and Q i s S"A. A formal proof f o r general A matrices would be too long a diversion i n t o l i n e a r algebra f o r our current purposes, so we d i l l omit it. This f a c t o r i z a t i o n i s closely r e l a t e d t o , and can be formally derived from, the well-known QR f a c t o r i z a t i o n , where Q I s u n i t a r y and R i s upper t r i a n g u l a r (Wilkinson, 1965; Dongarra, Koler. Bunch, and Stewart, 1979; and Strang, 1980). Given the SQ f a c t o r i z a t i o n o f
A, d r f i n e
By theorem (3.5-2). X i s a standard Gaussian n-vector. S u b s t i t u t i n g i n t o Equation (3.5-12) gives Equation (3.5-14). completing the proof. Because the S i n the above theorem depends c n l y on the covariance o f Y. i t imnediately follows t h a t A matrix i s uniquely specified by the mean and covariance. It remains only t o extend t h i s r e s u l t t o rectangular A matrices.
the e L s t r i h u t i o n of any Gaussian variable gener.aten by a square
Theorem 3.5-4 The d i s t r i b u t i o n o f any Gaussian vector i s uniquely defined b y I t s mean and covariance. Proof - We square A have already shown the r e s u l t f o r Gaussian rector:, generated by matrices. W need only show that a Gaussian r e c t o r generated by e a rectangular A matrix can ?e r e w r i t t e n i n terms o f a square A matrix. Let A be n-by-m, and cons, tr '.he two cases, n > m and 0 m. I f n . m : . define a standard Gaussian n-vector x by augmenting the X vector w i t h n m independent standard Gaussians. deffne an n-by-n matrix by augmenting A w i t h n m rows o f zeros. W then have e
as desired. For the case n < mr define a random m-vector m n zeros. Then
by augmenting
with
PaAxti
where 61and Theorem (3.5-3) are obtained-by augwnting zeros t o m and A. t o r e w r i t e Y as Use
.=sicti
SInce the l a s t m i n the form
-n
elements o f
are zero, E q u a t i ~ n(3.5-16) nust be
Thus Y=5ktm which i s i n the required fonn. Theorem (3.5-4) i s the c e n t r a l r e s u l t o f t h t s approach t o Gaussian variables. It makes the p r a c t i c a l manipulation o f Gaussian variables much easier, Once you have demonstrated t h a t sane r e s u l t i s Gaussian. you
r,eed o n l y derive the mean anc' covariance t o specify the d i s t r i b u t i o n completely. This i s f a r easier than manipulating the f u l l density . u n c t i o n o r d i s t r i b u t l n n function, a process which often requires p a r t i a l d i f f e r e n t i a l equations. I f the covariance m a t r i x i s nonslngular, then the density f u n c t i o n e x i s t s and i s given by Equation (3.5-13). I f the covariance i s singular. a density f u n c t l o n does not e x i s t (unless you extend the d e f i n i t i o n o f density functions t o include components l i k e impulse functions). Two p r o p d r t l c s o f the Gaussian density f u n c t i a n o f t e n provide usr.ful computational shortcuts t o evaluating the mean and covariance o f nonsingular Gaussians. The f i r s t property i s t h a t the mean o f the density f u n c t i o n occurs a t i t s n~aximum. The mean i s thus the unique s n l u t i o n o f
The logarithm i n t h i s equation can be removed, b u t the equation i s u s u a l l y most useful as w r i t t e n . property i s t h a t the covariance can be expressed as cov(Y) =
The second (3.5-18)
-[vi en p ( y ) ] - I
Both o f these properties are easy t o v e r i f y by d i r e c t s u b s t i t u t i o n i n t o Equation (3.5-13). 3.5.3 Properties
I n t h i s section we derlve several useful properties o f Gaussian vectors. Most o f these properties r e l a t e t o operations on Gaussian vectors t h a t g i v e Gaussian r e s u l t s . A major reason f o r the wide use o f Gaussian d f s t n - i b u t i o n s i s t h a t many basic oberations on Gaussian vectors g i v e Gaussian r e s u l t s , which can be characterized completely by the mean and covariance. Theorem 3.5-5 I f Y i s a Gaussian vector w i t ? mean m and covariance and i f Z i s given by A,
then Proof --
Z i s Gaussian w i t h mean B + b and covaria~ice BAB*. m

By d e f i n i t i o n , Y can be expressed as
where X Z gives
i s a standard Gaussian.
Substituting Y
i n t o the expression for
proving t h a t Z i s Gaussian. The mean and covariance expressions f o r l i n e a r operations on any random vector were previously derived i n Equations (3.3-7) and (3.3-8). Several o f the properties discussed i n t h i s section involve the concept o f j o i n t l y Gaussian variables. Two or m r e random vectors are said t o be j o i n t l y Gaussian i f t h e i r j o i n t d i s t r i b u t i o n i s Gaussian. Note t h a t two vectors can both be Gaussian and y e t n o t be j o i n t l y Gaussian. Example 3.5-2 Let Y be a Gaussian rarldom v a r i a b l e w i t h mean 0 and variance 1. Define Z as
The random v a r i a b l e Z i s Gaussian w i t h mean 0 and variance 1 (apply Equation (3.4-1) b u t Y and Z are not j o i n t l y Gaussian. ~ : ~ e o r e 3.5-6 Let Y, and Y be j o i n t l y Gal - i a n vectors, and l e t the mean m , and covariance o f the j o i n t d i s t r i b u t 4 m Lie p;-titloned as
t o show t h i s ) ,
Then the marginal d i s t r i b u t i o n s o f Y, and Y, E(Y,) = m , EIY,)

P
are Gaussian w i t h
cov(Y,) cov(Y,)
= =
A , A ,
= m ,
Proof Apply theorem (3.5-5) w i t h
B = [l 01 and
E = [0
11.
The f o l l o w i n g two theorems r e l a t e t o Independent Gausslan varlables: Theorem 3.5-7 Eausslan. If Y and
are two independent Gaussian variables. then Y and Z are j o i n t l y
Proof For ncnstngular d t s t r i b u t t o n s , t h t s proof t s easy t o do by w r i t l n g out m r o d u c t o f the denstty functtons. For a more general proof, we can proceed as follows: w r i t r Y and Z as
e where X, and X are standard Gaussian vectors. W can always construct the I n tkese equations t o be tndependent, but the f o l l o w i n a ar-r!ment X, and X, .-.,. standard avotds the necessity t o prove t h a t statement. Deftne trx, inde Gaussians, R, and k,. and f u r t h e r deflne
Then ? and-f have the same j o t n t d t s t r i b u t t o n as Y and 2 . The concatenatton o f k and X, I s a standard Gaussian vector. Therefore, ? and 9 a r e j o i n t l y ~ a u s s f a nbecause they can be expressed as
Stnce Y and Z have the came !otnt d i s t r i b u t t o n as j o t n t l y Gausslan.
and
i, Y
and
Z a r e also
tehn
Theorem 3.5-8
I f Y and Z are two uncorrelated j o t n t l y Gausslan i a r t a b l e s , are tndependent and Gaussian.
Proof - By theorem (3.5-3).
we can express
where X i s a standard Gaussian vector and S i s p o s t t t v e semt-deflnite. P a r t i t i o n S as
By the d e f t n i t t o n o f "uncorrelated," we ms'. have S = st1 = 0. Therefon. , p a r t i t i o n t n g X i n t o X, and X,, and p a r t t t i o n i n g m I n t o m, and m we , can w r i t e
o Since Y and Z dre f u ~ r i i o n s f the independent vectors are independent and Gaussian. Stnce any two tnde~cndentvectors are uncorrelated, Theorem (3.5-8) c o r r e l a t i o n are equivalent f o r Gaussians.
X, and X,,
Y and Z
proves t h a t tndependence and lack of
W prevfously covered margtnal d i s t r i b u t t o n s o f Gaussian vectors. The f o l l o w i n g t h e o r m considers condie t t o n a l d i s t r i b u t i o n s . Ue w t l l d i r e c t l y constder on;y condttional d t s t r t b u t i o n s c f nonstngular Gausstans. Stnce the r e s u l t s of the t h e o r w involve tnverses, there are obvious d i f f i c u l t i e s t h a t cannot be cirrumvented by avotdtng the use o f p r o b a b i l t t y denstty functions I n the proof. Theorem 3.5-9 Let Y, and Y, be j o t n t l y Gaussian vartables w i t h a nonslnguJ a r j o i n t d i s t r i b u t t o n . P a r t i t i o n the man, c ~ v a r t a n c e , and inverse c o v a r ~ a n ~ e o f the j o t n t d i s t r i b u t i o n as
Therr the condlttonal d t s t r i b u t t o n s o f L u s s t a n wtth means and covrrtances EIYlIY21 cov(Y1lYII
Y, given Y2, and o f
Y,
given Y,,
are (3.5-188) (3.5-18b) (3.5-191) (3.5-19b)
=m , = A,,
rn
A,,A;:(Y,
- m,)
- ~,,fi.;;~,,
EIY,(Y,I
COV(Y,IY,)
m + A,,A;:(Y~ ,
A ,,
- A2,A;:b1,
(rl1!-l ml) (rt2)-'
Proof - Th
j o i n t p r o b a b i l i t y density functicn o f
Y,
and Y,
is
where c compute.
i s a scalar constant. the magnitude o f which we w t l l not need t o Expanding the exponent, and recognfzing t h a t r,, = r,,*, gives
Completing squares r e s u l t s i n
I n t e g r a t i n g t h l s expression w i t h respect t o y, gives the marginal denslty function o f Y,. The second term i n the exponent does not tnvolve y,, and we recognize the f i r s t tern as the exponent i n a Gaussian density function wi:h m a n m , r-'r ,(y m and covariance r, , ) I t s integral with respect t o y iilthere#ore 3 constant independent o f y,. The m r ~ ! n a l density o f Y, i s therefore 1 P ( Y ~ ) c2 exp[- 7 (y2 m2)*(r,, = r,,r;?r,,)(~~ m2d
function
where c, i s a constant. Note that because we know t h a t Equation (3.5-22) must h- = p r o b a b i l i t y density function, we need not cmouee the value o f c,; t h i s srves us a l o t o f work. Equation (3.5-22) i s an expression f o r a , r,lry:r,2)-A. Gaussian density functton w i t h m a n m arld covariance (1:; The p a r t i t t o n e d matrix Inversion l e m (Appendix A) gives us
thus independently v e r t f y t n g the r e s u l t o f Theorem (3.5-6) on the marginal distribution. The conditional density o f Y, g:ven Y, i s obtained using Baves' r u l e , by d i v i d i n g Equatton (3.5-21) by Equation (3.5-22)
where c, i s a constant. This I s an expression f o r a Gaussian density function w i t h a m a n m , r;:r ,(y, m and covariance r: ;. The p a r t l tioned r t r l x inversion l a (Appendix then gtves
Thus the condittonal d i s t r i k t i o n o f Y, given Y, I s Gaussian w i t h m a n m + A '(y m ) and r o v a r l m c e A l :,A;)A~,. as we desired t o prrve. coAai2fon:l di:tribution o f V, given Y1 f o lows by s y n n t r y .
he
The f i n a l r e s u l t o f t h i s section concerns sums o f Gnussian variables. Theorem 3.5-10 I f Y and Y are j o i n t l y Gaussian random vectors a f q u a l l e n g t h and t h e i r j o i n t d:strfbution has mean and covarlance p a r t i t i o n e d as
Then Y, + Y,
"1
+
Al,
A12
I s 6russian w i t h man m, + At,.
+ m,
and covariance
Proof - Apply T h e o m (3.5-5)
w l t h 5 = [I
I] and b = 0.
A s l l p l e s u R I r y of t h l s section i s t h a t l i n e a r oper&tions on b u s s f a n variables g l v e Gaussian r e s u l t s . This p r l n - i p l e i s not generally t r u e f o r non1l;rrar oporatfons. Therefore. b u s s l s n distributions arc s t w n g l y associated w l t h the analysis of l l n e r r system.
3.5.4
3.5.4
Central L i m t t Theorem
Tne Central L i m i t Theorem i s o f t e n used as a basis f o r j u s t i f y i n g the assunrption t h a t the d i s t r i b u t i o n o f some physical quantity i s approximately Gaussian.
Y, be a sequence o f independent, i d e n t i c a l l y d i s t r i b u t e 0 randoo Theorem 3.5-11 Let Y,. vectors w i w i n i t e mean m and covariance A. Then the vectors
N
...
converge i n d i s t r i b u t i o n t o a Gaussian vector w i t h mean zero and covariance Proof See Ash (1970, p. 171) and Apostol (i569, p. 567)
A.
Cramer (1946) discusses several variants on t h i s th:orem, where t h e Y i need n c t be independent and ident i c a l l y distributed, b u t nther requirements a r e placed on the d i s t r i b u t i o n s . The general r e s u l t i s t h a t sums o f random variables tend t o Gaussian l i m i t s under f a i r l y broad conditions. The precise conditions w i l l not concern us here. An implication o f t h i s theorem i s t h a t macroscopic behavior which i s the r e s u l t o f the sumnation o f a 13rge number o f microscopic events often has a Gaussian d i s t r i b u t i o n . The classic example i s Brownian motion. W : r i l l i l l u s t r a t e the Central L i m i t Theorem w i t h a simple example. e Exa l e 3.5-3 Let the d i s t r i b u t i o n o f the Y i i n Tl~eorem(3.5-11) be uniform on z e i n Z G a ? (-1.1). Then the mean i s zero and the covariance i s 113. Examine the densit,y functions o f the f i r s t few Zi. The f i r s t function, Z,, i s equal t c Y,, and thus i s uniform on (-1.1). Figure (3.5-1) compares the densities o f Z and the Gaussian l i m i t . The , GadSSian l i m i t d i s t r i b u t i o n has mean zero and variance 113. For the second function we have
and the density fuactior. o f
Z,
I
i s given by I z for
z
Figure (3.5-2)
Irl
compares the density o f
1 , w i t h the Gauss~anl i m i t .
The density function o f
Z,
i s given by
Figure (3.5-3) compares density o f 2 , w i t h the Gaussian l i m i t . N i s 3, ZN i s already becoming reasonably close t o Gaussian.
By the time
21
Figure (3.5-1).
Density functions o f
2,
acd the l i . n i t Gaussian.
Figure (3.5-2). Density functions of the l i m i t Gaussian.
Z2
and
Fiaure (3.5-3).
Density functions o f the l i m i t Sahssian.
Z3
and
CHAPTER 4 4.0 STATISTICAL ESTIMTORS
I n t h i s chapter, we introduce the c o n c ~ p tof at, ;mator. k then define some basic measures of e s t i mator perfowance. Uc use these measures of p e r f o m . . i e t o introduce several c m a n s t a t i s t i c a l estimators. The d e f i n i t i o n 5 i n t h i s chapter are general. Subsequent c h a p t e ~ w i l l t r e a t s p e c i f i c forms. For other s treatnents o f t h i s and r e l a t e d material. see Sorenson (1980). Schweppe (1973). Goodwin and Payne (1977), and Eykhoff (1974). These books a l s d c m e r other e a * i m t o r s t h a t we do not mention here.
4.1
DEFINITION G AN ESTIMATOR F The s t a t i s t i c a l d e f i n i t i o n o f an estimator i s as
The concept o f e s t i m t i o n i s central t o our study. f3llows: Perform an experiment (input) U taken from the set , responre i s a random variable:
o f possible experiments on the system.
The system
where
5.
F. E E
i s the t r u e value o f the parameter vector and

Z
n i s the random component o f the >;stem.
An estimator i s any function o f Thus
w i t h range i n 5.
The value O F the function i s c a l l e d the est'mate
= i!z.u)
= ~(Z(E.U.~).U)
(4. I-?)
This d e f i n i t i o n i s r e d d i l y generalized t o r m l t i p l e performances o f the sane experiment o r t o the performance o f more than one experiment. I f N experiments Iji are performed, w i t h responses Zi, then an estlmate would be o f the 'om
i=
E(Z
,.... ZN.u ,...U,)

,. )a
= i(z(~.u~,0
where the U. are independent. Thc N experiments can be regarded as a single "super-ex eriment" the response t o which i s the concatenated vector (2,. ..ZN) t $ x I x (U,. ..UN) td x 0 x x The random element i?(u, ...-i~) t n . n x . x n. Equation (4.1-3) i s then simply a restatement o f Equation (4.1-2) on rbe l a r g e r space.
...
...
a ... x(2L
For s i m p l i c i t y o f notatio., we w i l l generally omit the dependence on U from Equations (4.1.-1) anti (4.1-2). For the nost part, we w i l l be discussing parar,xter estimation based on responses t o s p e c i f i c , known inputs; therefore, the dependence o f the response and the estimate on the input are i r r e l e v a n t , and merely c l u t t e r up the notation. Formally, a l l o f the d i s t r i b ~ r t i o n sand expectations #;lay be considered t o be impTlci t l y conditioned on U. Note t h a t the estimate 5 i s a random var.abl2 because i t i s a function o f Z, which i s a rsndom variable. When the experiment i s a c t u a l l y performed, s p e c i f i c r e a l i z a t i o n s o f these random variables w i l l be obtained. Thc t r u e parameter value F. i s not u s u a l l y considered t o be random, simply unknown. I n some s i t u a t i c n s , however, i t i s convenient t o define 5 as a random variable instead o f as an unknown paraneter. The s i g n i f i c a n t difference between these approaches i s t h a t a random variable has a p r o b a b i l i t y d i s t r i b u t i o n , which constitutes additional information t h a t can be used i n the random-variable approach. Several popular estimators can only t e defined using the random-variable approach. These advantages of tne random-variable approach are balanced by the necessity t o know the p r o b a b i l i t y d i s t r i b u t i o n o f E. Ift h i s d i s t r i b u t i o n i s not known, there are no differences, ~ x c c p ti n t e r m i n o l o ~ , between the randcm-variable and unknown-peramct;; approaches.
A t h i r d view o f 6 involves idsas from information theory. I n t h i s context. E i s considere6 :c be an unknt~wnparameter as above. Even though 5 i s n o t random, i t i s defined t o have a "probabi 1i t y d i s t r i b u t i o n . " This p r o b a b i l i t y d i s t r i b u t i o n doer not r e l a t e t a any randomness o f F.,but r e f l e c t s our 4 ,owledge o r informat i ? n about the value o f 6. D i s t r i b u t i o n s w i t h low variance correspond t o a high degree o f c e r t a i n t y about the value o f 5, and vice versa. The term " p r o b a b i l i t y d i s t r i b u t i o n " i s a misnomer i n t h i s context. The terms "information d i s t r i b u t i o n " o r "information function" more accurately r e f l e c t t h i s i n t e r p r e t a t i o n .
I n the contsxt o f information theory, the marginal o r p r i o r d i s t r i b u t i a n p(6) r e f l e c t s the information about 5 p r i o r t o perfonning the experiment. A case wh?re there i s no p r i o r information can be handled as a i i m i t o f p r i o r d i s t r i b u t i o n s w i t h less and l e s s information (variance guing t o i n f i n i t y ) . The d i s t r i b u t i o n of , t h e response Z i s a function o f the value o f 6. When F i s a random variable, t h i s i s c a l l e d p ( Z 1 ~ ) . the e conditional d i s t r i b u t i o n o f z glven c. W w i l l use the same notation when E i s n o t randor i n order t o emph~sizethe dependence o f the d i s t r i b u t i o n on c , and f o r consistency o f notdtion. When p(6) f s defined, the j o i ~ l tp r c % b i l i t y density i s then
The marginal p r o b a b i l i t y density o f
Z is P(Z) = $ P ( z . ~ ) ~ I c I
The conditional d e ~ ~ s i of ty
c given Z (also c a l l e d the p o s t e r i o r density) i s
I n tCe information theory context, the p o s t e r i o r d i s t r i b u t i o n r e f l e c t s informatiorl about the value o f 6 a f t e r t h e experinen: i s performed. I t accounts f o r the information knowrl p r i o r t o the exberiment, and the informat i o n gained by the e x p e r i m n t . The d i s i i n c t i o n s among the rarld~mvariable, unknwrt parameter, and i n f o r n a t i o n theory p o i n t s o f view are l a r g e l y academic. Although the conventional notations d i f f e r , the equations used are equivalent i n a l l three e cases. Cur presentation uses thp p r o b a b i l i t y density n o t a t i o n througnout. W see l i t t l e b e n e f i t i n repeating i d e n t i c a l deribations, s u b 5 t i t u l i n g the t e n "intormation function" f o r " l i k e l i h c o d functicn" and changing notation. W derive the basic e q u a t i c ~ so n l j once, r e s t r i c t i n g the d i s t i n c t i o n s among the three p o i n t s o f view e t o discussions c f a p p l i c a b i l i t y and i n t e r p r e t a t i o n . 4.2 PROPERTIES O ESTIMATORS F
W can define an i n f i n i t e nu~rbero f ectimators f o r a given problem. The d e f i n i t i o n o f an estimator proe vides no means o f evaluating these estimators, some o f which can be r i d i c u l o u s l y poor. This section w i l l describe some o f the properties used t o evaluate estimators and t o select a good e s t i m t o r f o r a p a r t i c u l a r problem. The properties are a l l expressed i n terms o f o p t i m a l i t y c r i t e r i a . 4.2.1 Cnbiased Estimators
A bias i s a consistent o r repeatable e r r o r . The parameter estimates from any s p e c i f i c data set w i l l 21wtys be imperfect. I t i s reasonable t o hope, howerer, t h a t the estimate obtained from a l a r g e set of maneuvers w u l a be centered around the t r u e value. The e r r o r s i n the estimates might be thought o f as consisti n g o f two compcinents- copsistent e r r o r s and random e r r o r s . Random error; 3 r e generally unavoidable. Consist e n t o r average e r r o r s might be removable. Let us r t s t a t e the above ideas more precisely. The bias b o f an estimator
i(.)i s
defined as (4.2-1)
b ( ~ ) ~ t i I - : = E I ~ ( Z ( C . ~ ) ) ~ <E) = ~ 1 The i i n thes- equations i s a o f the t r u e value. I t averages the d i f f e r e n t t r u e values. The made e r p l l c i t . A i l discclssions
random variable, n o t a s p e c i f i c r e a i i z a t i o n . Note t h a t the bias i s a f u n c t i o n out (by the E I . I ) the random noise e f f e c t s , b u t there i s no averaging among b i a s i s a l s o a f u n c t i o n of the i n p u t U, b u t t h i s dependence i s n o t u s u a l l y o f bias are i m p l i c i t l y r e f e r r i n g t o some given Input.
An unbiased estimator i s defined as an estimatar f o r which the b i a s i s i d e n t i c a l l y zero:
This requiremerit i s q u i t e s t r i n g e n t because i t must be met f o r every value o f Unbizsed estimators may n o t e x i s t f o r some problems. Fsr other problems, unbiased estimators rray e x i s t , b u t may be too c m p l i c a t e d f o r p r a c t i c a l computation. Any estimator thar. i s n o t unbiased i s caller! h:ased. Generally, i t i s considered desirable f o r an estimator t o be unbi?sed. This judgment, however, does n o t apply t o a l l s i t u a t i o n s . The bias o f an e r t i m a t o r measures o n l y the average o f i t s behavior. I t i s possible f o ? the i n d i v i d u a l e s t i m t e s t o be so poor t h a t they are ludicrous, y e t average o u t so t h a t the e s t i m a ~ o ri s unbiased. The following example i s taken from Ferguson (1967, p. 126). Exam i e 4.2-1 A telephone operator has been working f o r 10 minutes and wondersPif he would be missed i f he took a 20 minute coffee break. Assume t h a t c a l l s are coming i n as a Poisson process w i t h the average r a t e o f calls per 10 minutes, A being unknown. Tte number Z o f c a l l s received i n the f i r s t 10 minutes has a Poisson d i s t r i b u t i o n w i t h parameter A .
:.
O the basis o f 2 , the operator desires t o estimate 8, the p r o b a b i l i t y of n receiving no c a l l s i n the next 20 minutes. For a Poisson process, 8 = I f t h e estimator 6(Z) i s t o be unbiased, we must have E ~ ~ ( z ( B . ~ ) ) ~=B 6 I Thus for a l l
BE
!O,l]
M u l t i p l y by eA, g i v i n g
Expand the right-hand side as a power series t o get
The convergent power series are e ual f o r a l l i E [O.-) i f the c o e f f i c i e n t s are i d e n t i c a l . Thus 8(Z) = (-I)? i s the on,y unbiased estimator o f B f o r t h i s problem. The operator would estimate tht* p r o b a b i l i t y o f missing no c a l l s as +1 i f he had received an even number o f c a l l s and -1 i f he had i received an odd n u h e r o f c a l l s . This e s t i ~ t o r s the only unbiased estimator r i d i c u l o u s l y poor one. I f the estimates are f o r the ?roblem, b u t i t i s reauired t o l i e i n the meaningful range o f [0,1], then there i s no unbiased can be e a s i l y constructed. estimator, b u t some q e i t e reasonable biased ~ s t i m t o r s
I.
The bias i s a uceful t o o l f o r studying estimators. I n general, i t i s desirable f o r the b i a s t o be zero, o r a t l e a s t s n a l l . However, because the b i a s measures only the average properties o f the estimates, i t cannot be used as t h e sole c r i t e r i o n f o r evaluating estimators. It i s possible f o r a Diased estimator t o be c l e a r l y superior t o a l l o f the unbiased esti:,ators f o r a problem. 4.2.2 Minimum Variance Estimator? The variance o f an estimator i s defined as
Note t h a t the variance. l i k e the biss, i s a function o f the input and the t r u e value. The varibnce alone i s not a reasonaole measure f o r evaluating an estimator. For instance, any constant estimator (one t h a t always returns a constant value, ignoring the data) has zero variance. These arc obviously poor estimators i n most situations.
A more useful measure i s the mean square e r r o r :
The mean square e r r o r and variance are obviously i d e n t i c a l f o r unS!ased estimators (E{ilc; = 5). An estimator i s uniformly minimm mean-square e r r o r i f , f o r every value o f 5, - t s mean square e r r o r i s l e s s than o r equal t o the mean square e r r o r o f any other estimator. Note t h a t the man-square e r r o r i s a cymnetric nutrilc. Cnc s y m e t r i c m t r i x i s l e s s than o r equal t o another i f their. difference i s poii:;vr semi-driinite. l h i s d e f i n i t i o n i s somewhat academic a t t h i s p o i r t Secau-e such +srimators do not e x i s t except i n t r i v i a l cases. A cons t m t ejtimatcir has zero medn-square ? r r o r when i s equal t o tne constant. (The performance i s poor a t cther values o f 5.) Therefore, i n order t o be uniformly miniinur mean-squdre error, an estimator would have t o have zero mean-square e r r o r f o r every 5; otherwise, a constant estima:or would be b e t t e r f o r t h a t c. Tne concept o f minimrm mean-square e r r o r becomes more useful i f the class o f estimators allowed i s r e s t r i c t e d . An estimator i s uniformly minimum mean-square e r r o r unbiased i f i t i s unbiased and, f o r every value o f 5, i t s mean-square e r r o r i s l e s s than o r equal t o t h a t o f any other unbiased estimator. Such e s t i mators do n o t e x i s t f o r every problem, because the requirement must hold f o r every value o f 5. Estimators optimum i n t h i s sense e x i s t f o r many problems o f i n t e r e s t . The mean-square e r r o r and t h e variance are i d e n t i c a l f o r unbiased estimators, so such optimal estimators are a l s o c a l l e d uniformly minimum variance unbiased estimators. They are also o f t e n c a l l e d simply minimum variance est.imators. This term should be regarded as an abbreviation, because i t I s n o t lneaningful i n i t s e l f . 4.2.3 &met--Rao Inequality ( E f f i c i e n t Estimators1
The Cramer-Rao i n e q u a l i t y i s one o f the c e n t r a l r e s u l t s used t o evaludto the performance of estimators. The i n e q u a l i t y gives a t h e o r e t i c a l l i m i t t o the accuracy t h a t i s porsible, regardless o f the estimator used. I n a sense, the Cramer-Rao i n e q u a l i t y gives a measure o f the information content o f the data. Before d e r i v i n g the Cramer-Rao inequality, l e t us prove a b r i e f lemna.
Lm a 4.2-1 en
Let
X and Y
be two random N-vectors.
Then
E{XXf) 2 E{XY*)[EIYY*)]"EIYX*l assuming t h a t the inverse exists.
8-by-N matrix.
Proof
The proof i s done by completing the square. Then EI(X
Let
A be any nonrandom (4.2-6)
AY)(X
- AY)*l
r 0
because i t i s a covariance matrix.
Expanding
choose
4.2.3 Then E{ XX*) or

2
E{XYt , ~ ( Y Y * ) ] " E ~ Y x *+ E(xY+)[E;YY*)]-'EIYX'I ~ [E{xY*)[E(YY*)]'~E{YY*)[E~YY*~]~~EIYX*)
contpleting the lemna. Ye now seek t o f i n d a bound on E { ( i
- c ) ( i - c)*Jc},
the mean square e r r o r of the estimate.
Theorem 4.2-2 (Cramer-Rao) Assume t h a t the density p(ZI6) e x i s t s and i s smoothenough t o allow the operations below. (See Crame'r (1946) f o r d e t d i l s . ) This assumption proves adequate f o r most Lases o f i n t e r e s t t o us. Pitman (1979) discilsses some o f the cases where ? ( Z i t ) i s not as smooth as required here. Then
where
Proof lemra - Let X and Yl e tfroml o f the (4.2-1) be E(Z]n - t respectively, and al expectations i the on 5. Concentrate f i r s t on t+e term
and V an p(ZI6). ; lemna be conditioned
where d / Z I relation
!s the volume element i n the space Z.
Substituting the
gives EtxY*l~) =
$(~(zI - ~ ) i ~ ~ p ( Z l ~ ) j d ~ Z ~
=J~(Z)(V~P(ZIC))~IZI -J(V~PQICI)~IZI Now i i ~ i)s not a f u n c t i o n o f c . Therefore, assuming s u f f i c i e n t smoothness o f p(Z1.C) as a function o f F, the f i r s t t e n becomes , f t ( z ) ~ ~ P ( z ~ o = ~ z J~(Z!P(ZIF)~IZI d v5 ~ = 7cE{i(Z)j0 Using the d e f i n i t i o n (Equation (4.2-1)) o f the bias, obtain vgE{<(Z)It}
I
(4.2-15)
vF[F + b(F)1 = I + vLb(F)

f
(4.2-16) Z, so
I n t h e second term o f Equation (4.2-14).
i s not a function o f
f i ~ ~ p ( z l ~ ) d l = lI~~/P(ZIC)~IZI z = t v 1 = 0 F Using Equations (4.2-16) and (4.2-17) 11) Equation (4.2-14) E{XY*/{) = I + v C b ( t ) Define the Fisher Information m a t r i x M(c) : EIYY*(c) They by lemna (4.2-1) EI(~(z)
5
(4.2-17) gives (4.2-18)
E{(v;
an ~ ( Z l c ) ) ( van P ( Z I E ) ) ] C I ~
- c)(i(Z) - t)*lc)
[I+ v ~ ~ ( c ) I M ( ~ ) - ~+ v F b ( c j l * CI
(4.2-10)
which ?s the desired r e s u l t . Equation (4.2-10) i s the Cramer-Rao fne u a l i t y . I t s s p e c i a l i z a t i o n t o unbta:ed interest. For an unbiased estimator, b ? ~ )i s zero so E { ( ~ ( z ) c)(i(Z) estimator:, i s o f particular (4.2-20)
- c)*It)
r M(c)-'
This gives us a lowar bound, as a function o f c , or) the achievable variance o f any unbiased estimator. An unbiased e s t i ~ m t o r which a t t a i n s the e q u a l i t y i n Equation (4.2..20) i s c a l l e d an e f f i c i e n t estimator. No estimator can achieve a lower variance than an e f f i c i e n t estimator except by introducing a b i a s i n the e s t i mates. I n t h i s sense, an efficient estimator mkes the mit use o f the information available i n tne data. The above development gives !lo guarantee t h a t an e f f i c i e n t estimator e x i b t s f o r every prnblem. When an e f f i c i e n t estimator does e x i s t . i t i s a l s o a uniformly minimum variance unbiased estimator. I t i s much edsier t o check f o r equality i n Equation (4.2-20) than t o d i r e c t l y prove t h e t no other unbiased estimator has a smaller variance than a given estimator. The Cramer-Rao ineqirality i s tnerefore useful as a s u f f i c i e r ~ t(but not necessary) check t h a t an estimatcjr i s uniformly minimum variance unbiased. A uszful a l t e r n a t i v e expression f o e the informatior! matrix F: can be obtained i f p ( Z / t ) i s s u f f i c i e n t l y smooth. Applying iquation (4.2-13) t o the definition o f M (Equation (4.2-19)) gives
Then exdmine
The second term i s equal t o
M(c). as shobm i n Equation (4.2-21).
Evaluate the f i r s t term as
Thus an alternate expression for the information c a t r i x i s
4.2.4
Bayesian Optimal Eztimators
The o p t i m a l i t y conditions o f the previous sections have bcen q u i t e r e s t r i c t i v e i n t h a t they must hold simultaneously f o r every possible value o f 6 . Thus f o r s m problems, no estimatops e x i s t t h a t are optimal by these c r i t e r i a . The Bayesian approach avoids t h i s d i f f i c u l t y by using a single, o v e r a l l , o p t i m a l i t y c r i t e r i o n which averages the e r r o r s made f o r d i f f e r e n t values o f 5. With t h i s appro3ch. an optimal estimator may be worse than a nonoptimal one f o r s p e c i f i c values o f 5 , but the o v e r a l l averaged performance o f the Bayesian optimal estimator w i l l be better. The Bayesian approaih requires t h a t a l o s s function ( r i s k function, o p t i m a l i t y c r i t e r i o n ) be defined as a function o f the t r u e value 5 and the estimate i. The most comnon loss function i s a weighted square e r r o r J(i.i) = (C -
i)*n(s - i)
(4.2-25)
where R i s a weighting matrix. An estimator i s considered optimal I n the Bayesian sense i f i t minimizes the a posteriori expected value o f the loss function:
An optimal estimator must minimize t h i s expected value f o r eacb 2. Since P(Z) i s not a furlction o f i, i t does not a f f e c t the minimization of Equation (4.2-26) w i t h respect t o C. Thus a Bayesian optimal estimator also minimizes the expression
Note t h a t p ( c ) , t h e p r o b a b i l i t y density o f 6, i s required i n order t o dafine Bayesian o p t i m a l i t y . For t h i s purpose, p(6) can be considered simply as a weighting t h a t i s p a r t o f the loss function, ~f i t cannot appropriately be interpreted as a t r u e p r o b a b i l i t y density o r an information function (Section 4.1). 4.2.5 Asymptotic Properties
Asymptotic properties concern the c h a r a c t e r i s t i c s o f the estimates as the amount o f data used increases toward i n f i n i t y . The amount o f data used can Increase e i t h e r by repeating experiments o r by increasing the time s l i c e analyzed i n a s i n g l e experiment. (The l a t t e r i s pertinent only f o r dytiamic systems.) Since only a f i n i t e amount o f data can be used i n practice, i t i s not imnediately obvious why there i s any I n t e r e s t i n asymptotic properties.
This i n t e r e s t arises p r t m a r i l y from considerations o f s i m p l i c i t y . It i s o f t e n slmpler t o colnpute asympt o t i c properties and t o construct asymptotically optimal estimators than t o do so f o r f i n i t e amounts o f data. W can then use the asymptotic r e s u l t s as good approxtmatlons t o the more d i f f i c u l t f i n i t e data r e s u l t s i f t h e e amount o f data used t s large enough. The f i n i t e data d e f i n i t i o n s o f unbiased estimators and e f f i c i e n t e s t i mators have d i r e c t asymptotic a~alogumco f i n t e r e s t . An estimator i s asymptotically unbiased i f t h e b i d s goes t o zero f o r a l l c as the amount of data ?r?c; t o i n f i n i t y . An estimator i s asymptotically e f f i c i e n t i f i t i s asymptotically unbiased and i f
as the amount o f data approaches i n f i n i t y . Equation (4.2-70).
Equation (4.2-28)
i s an asymptotic expression f o r e q u a l i t y i n
One important asymptotic property has no f i n i t e data analogue. This i s t h e notion of consistency. An estimator i s consistent i f 2 + E as the amount of data goes t o i n f i n i t y . For strong consistency. the convergence i s required t o be w i t h p r o b a b i l i t y one. Note t h a t strong conststency i s defined i n terms of he convergence o f i n d i v i d u a l real!zations o f the estimates, u n l i k e the bias, variance, and other properties which are defined i n terms o f average properties (expected values). Consistency i s a stronger property than asymptotic unbiasedness; t h a t 43, a l l consistent estimators are asymptotically unbiased. This i s a basic cnr:vergence r e s u l t - t h a t convergence w i t h p r o b a b i l i t y one implies convergence i n d i s t r i b u t i o n (and thus, s p ~ c i f l c ally , convergence i n man). W r e f e r the reader t o L l p s t e r and e Shiryayev (19771, C r a d r (1946). Goodwin and Payne (1977). Zacks (1971). and Mehra and L a i n l o t i s (1976) f o r t h i s and other r e s u l t s on consistency. Resuits on consistency tend t o involve careful mathematical arguments relati.ig t o d i f f e r e n t types o f convergence. Ue w i l l n o t delve deeply i n t o asymptotic properties such as consistency i n t h i s book. k'e generally f e e l t h a t asymptotic properties, although t h e o r e t t c a l l y i n t r i g u i n g , should be played down i n p r a c t i c a l application. Application o f i n f i n i t e - t i m e r e s u l t s t o f i n t t e data i s an zpproximation, one t h a t i s sometimes useful, b u t sometimes gives completely misleading conclusions (see Section 8.2). The inconsistency should be evident i n books t h a t spend copious time arguing f i n e p o i n t s o f d i s t i n c t i o n between d t f f e r e n t kinds o f convergence and then pass o f f a p p l i c a t t o n t o f i n i t e data w i t h cursory a l l u s i o n s t o ustng large data samples. Although we de-emphasize the "rigorous" treatment o f asymptotic properties, some asymptotic r e s u l t s are c r u c i a l t o p r a c t i c a l implementation. This i s n o t because o f any improved r i g o r o f the asymptctic r e s u l t s , b u t because the asymptotic r e s u l t s are o f t e n simpler, sometimes enough simpler t o make the c r i t i c a l dtfference i n u s a b i l i t y . This i s our primary use o f aPymptotic r e s u l t s : as s i m p l i f y i n g approximattons t o the f i n i t e - t i m e results. Introduction o f complicated convergence arguments hides t h i s essential r o l e . The approximations work well I n many cases and, as w i t h most approximations, f a i l i n some s i t u a t i a n s . Our emphasis i n asymptotic r e s u l t s w i l l center on j u s t i f y i n g when they are akpropriate and understanding when they f a i l . 4.3
C W N ESTIMATORS
This section w i l l define some o f the comnonly used general types o f estimators. The l i s t i s f a r from complete; we mention only those estimators t h a t w i l l be used i n t h i s book. W a l s o present a few general e r e s u l t s characterizing the estimators. 4.3.1
A poetorion Expected Value
One o f the most natural estimates i s the a posteriori expected value. mean o f the p o s t e r i o r d i s t r i b u t i o n .
This estimate I s defined as the
This estimator requires t h a t 4.3.2 Bayesian Minimm Risk
p ( ~ ) , the p r i o r density o f
c,
be known.
Any estimator whtch minimizes the a posteriori Bayesian o p t i m a l i t y was defined I n Section 4.2.4. expected value o f the l o s s function I s a Bayesian minimum r t s k estimator. (In general. there can be more than one such estimator f o r a given problem.) The p r i o r d i s t r i b u t i o n o f 6 must k known t o deflne Bayesian estimators. Theorem 4.3-1 The a poetoriori expected value (Sectton 4.3.1) Bayesian mtni'mum r i s k estimator f o r t h e l o s s function J(r.) where R i s the unique
= (C
- i)+R(c - 1
I s any p o s i t i v e d e f i n i t e symnetrlc m t r l x .
Proof A Bayestan minimum r i s k estimator must minimize
Since R i s s y m t r i c , the gradient o f t h i s fdnction i s viE(J1Z) -2E(R(c
- i ( Z ) ) IZI*
S e t t i n g t h i s expression t o zero gives
o
Therefore
= R E(C
- e ( z ) ! z l = RCE(CIZ) - i ( Z ) 1
EtJIZl. The second gradient i s
i s the unique stationary p o i n t o f
v ~ E ~ J*Z ~> 0 I 2R so the stationary p o i n t i s the global minimum. Theorem (1.3-1) applies only for the quadratic l o s s f u n c t l o n o f Equatiorl (4.3-2). The f o l l o w i n g very s i m i l a r theorem applies t o a much broader class o f loss functions, but requires the assumption t h a t p(c1Z) i s symnetric about i t s man. Theorem (4.3-1) makes no assumptions about p(;lZ) except t h a t i t has f i n i t e mean dnd variance. Theorem - 4.3-2 Assume t h a t p(cIZ) i s symnetric about i t s mean f o r each 2; i.e..
F~~~ ( { ( z )
+
LIZ) = p S I L ( i ( z )
- CIZ)
(4.3-8)
where i ( Z ) i s the expected value o f given 2. Then the a posteriori expected value i s the unique Bayesian minimum r i s k estimator f o r any l o s s function of the form J(C.i) where J,
a
J1(C
- i)
(4.3-9)
i s symnetric about 0 and i s s t r i c t l y convex.
Proof W need t o d m n s t r a t e t h a t e D(a) r ~ { J ( c , i ( Z ) + a l Z l for a l l a f 0. Using Equation (4.3-9)
- E { J ( c , ~ ( z ) ~ ~ )0 >
and the d e f i n i t i o n o f expectation ~
D(a) =,~(CIZ)[J~(C
-i
- )a) - J ~ ( C ~UIII~ILI 2 01
Because of the sym#try o f p(clZ). we can replace the i n t e g r a l i n Equat i o n (4.3-11) by an i n t e g r a l over the region
S = IE:(c
- i(Z).a)
Using the symnetry o f DL.)
J,
gives
*j'P(cIZ)[Jl(t - i ( 2 ) - a) s
+ J,(c
- ~ ( 2 )+ a)
i(Z)) (4.3-15)
By the s t r i c t convexity o f J,
J,(C
for a l l a
- Z!L) - a) + J,(F - i ( Z ) + a)
Therefore D(a) > 0 f o r a l l
> 2J,(t
+ 0.
a t 0 as we desired t o show.
Note t h a t i f J, i s convex, but not s t r i c t l y convex, theorem (4.3-2) s t i l l holds except f o r the uniqueness. Theorms (4.3-1) and (4.3-2) a r e two o f the basic r e s u l t s i n the theory o f estimation. They motivate the use of a poeterioxd expected value estimators. 4.3.3
Maximum a posteriori P r o b a b i l i t y
a porwrtori p r o b a b i l i t y (MP) estima+.e i s defined as t h e mode o f the p o s t e r i o r d i s t r i b u t i o n o f c which maximizes the poster'ar density function). I f the d i s t r i b u t i o n i s n o t unimodal. m y n o t be unique. As w i t h the previously discussed estimators, the p r i o r d i s t r i b u t i o n o f i n order t o define the M P estimate.
The maxi mu^ (1.e.. t h e value the 1 A estimate 3P c must be k n m
The M P estimate i s q u a l to the a pcsturiori expected value (and thus to the Bayesian m i n i m r i s k f o r l o s s functions meeting the conditions o f Theorm (4.3-2)) i f the p o s t e r i o r d i s t r i b u t i o n i s s y n a r t r i c a b w t i t s m a n and unimoL1, since the m e and the mean o f such d i s t r l b u t i q n s are q u a l . For nonsymetric d i s t r i b u d tions, t h i s e q u a l i t y does n o t hold.
Ths MAP estimate i s generally much caster t o c a l c u l a t e than the a p o e t ~ r i o r iexpected value. a poetsriori expected value is (from Er,, a t i o n (4.3-1))
The
This c a l c u l a t i o n requires the evaluation o f two i n t e g r a l s over mization o f
s.
The M P estimate require5 the maxi-
w i t h respect t o
6.
The p(Z) i s not a function o f
5, so the MAP estimate can also be obtatned by
The "arg IMX" notation indicates t h a t i i s t h e value of c t h a t maximizes the density function p ( Z l ~ ) p ( ( ) . The m a x i m i z a t i o ~i n Equation (4.3-18) i s generally much simpler than the integrations i n Equation (4.3-16).
The previous e s t i m t o r s h i v e a l l required t h a t the p r i o r d i s t r i b u t i o n o f E be known. When 6 i s n o t random o r when i t s d i s t r i b u t l o n i s not known, there a r e f a r fewer redsonable estimators t o choose from. Maximum l i k e l i h o o d estimators are the only type t h a t we w i l l discuss. The naximum l i k e l t h o o d estimate i s defined as the value o f p ( Z 1 ~ ) ; i n other words,
E which maximizes t h e l i k e l i h o o d functional
The mximum l i k e l i h o o d estimator i s c l o s e l y r e l a t e d t o the MAP estimator. The HAP estimator maximizes ~ ( 1 2 ) ; h e u r i s t i c a l l y we could say t h a t the MAP estimator selects the m s t probable value o f 5 , given the data. The maximum 1ike:ihood estimator maximizes p(Z1c); i.e., i t selects the value o f 6 which makes the observed data most plausible. Although these may sound l i k e trot statements o f t h e same concept, there are c r u c i a l d i f f e r ences. One o f the most central differences i s '.hat maximum likelihood i s defined whether o r not the p r i o r d i s t r i b u t i o n o f c i s known. C~mparingEquation (4.3-18) w i t h EqudtiOn (4.3-19) reveals t h a t the maximum 1i k e l ihood estimate i s Ident i c a l t o the MAP estimate i f p(:) i s a constant. I i the parameter space .": has f i n i t e size, t h i s implies t h s t p(6) i s the uniform d i s t r i b u t i o n . For i n f i n i t e E, such as Rn, there are no uniform d i s t r i b u t i o n s , so a s t r i c t equivalence cannot be established. Ifwe r e l a x our d e f i n i t i o n o f a p r o b a b i l i t y d i s t r i b u t i o n t o allow a r b i t r a r y density functions which need n o t integrate t o 1 (sometimes c a l l e d generalized p r o b a b i l i t i e s ) , the equivalence can be established f o r any z. Alternately. the uniform d i s t r i b u t i o n f o r i n f i n i t e size a can be viewed as a l i m i t i n g case o f d i s t r i b u t i o n s w i t h variance going t o i n f i n i t y ( l e s s and less p r i o r c e r t a i n t y about t h e value o f E). The maxinum l i k e l i h o o d estimator places no preference on any value o f 6 over any other value o f c; the estimate i s s o l e l y a function o f the data. The MAP estimate, on t h e other hand, considers both the data and t h e preference defined by the p r i o r d i s t r i b u t i o n . Maximum l i k e l i h o o d estimators have many i n t e r e s t i c g properties, which we w i l l cover l a t e r . most basic i s given try the following theorem: Theorem 4.3-3 I f an e f f i c i e n t estimator e x i s t s f o r a problem, t h a t estimator i s a maxirmm l i k e l l h o o d estimator. ?roof (This proof requires the use o f the f u l l notation f o r p r o b a h i l i t y densTty functions t o avoid confusion.) Assume t h a t e(Z) i s any e f f i c i e n t estimator. An estimator w i l l be e f f i c i e n t i f and only i f equaltty holds i n lenna (4.2-1). Equality holds I f and only i f X = AY i n Equation (4.2-6). S u b s t i t u t i n g for A from Equation (4.2-8) gives One o f t h e
Substituting f o r X and Y as i n the proof o f the Cramer-Rao bound, and using Equations (4.2-18) and (4.2-19) gives
i(z)
E = [I
v~~(c)IM(E)-'V;
pZIC(Zlt)
E f f l c t e n t estimators mrst be unbjased, so b(() i s zero and i(z)
- r = M(E)-~v;
trr P ~ ~ ~ ( Z I C )
For an e f f i c t e n t e s t l m t o r , Equatton (4.3-22) must hold f o r a l l values o f and E. I n p a r t l c u l a r , for each Z, the equation st h o l d f o r c e(Z), The left-hand slde :s then zero, so we must have
The e s t i l M t e i s thus a t a s t a t i o n a r y p o i n t o f the l i k e l i h o o d f u n c t i o n a l . Taking the gradient o f Equation (1.3-22)
-I= M(c)-'v;
Evaluatinq t h i s a t
r n p z l C ( Z i ~ ) M ( C ) - ~ [ V ~ M ' C ) I M ( C ) - ~9 ; pZle(i16) V . n
= i(Z;,
and using Equation (4.3-23)
gives
Since R i s p o s i t i v e d e f i n i t e , the s t a t i m a r y p o i n t i s a l o c a l maximum. I n f a c t , i t i s the only l x a l maximum. because a l o c a l maximum a t any p o i n t other than c = [ ( Z ) would v i o l a t e Equation (4.3-22). The requirement f o r ( Z Z t o be f i n i t e implies t h a t p ~ / ~ ( Z / c 0 ds ) + SO glohal maximum. Therefore t h i the loca{ m a x i m w i l l be is a maxlmm l i k e l ihood estimator.
-.
Corollar l e f f i c i e t estimators r a problem d A lestimatornexists. i sf ounique). are equivalent ( i .e., c ent

it
if
and i This t h e o ~ - e ~ ~ ~ t s c o r o l l a r y are not estimators do not e x i s t f o r many problems. mdtor i s e f f i c i e e t . The theorem does apply appi i c a b l e asymptctlc r e s u l t s which w i l l be
as useful as they might seem a t f i r s t glance, because e f f f c i e n t Therefore, i t i s not always t r u e t h a t a maximm l i k e l ihood e s t i t o some simple problems, however, and motivates the more widely discussed l a t e r .
Maximum l i k e l i h o o d estimates have the f o l l o w i n g naturol invariance property: l e t i be the maximum then f ( c ) i s the maxi~ruml i k e l i h o c d estimate of f : ~ ) for any f u n c t i o n f. The l i k e l i h o o d estimate o f for proof o f t h i s statement i s t r i v i a l if f i s i n v e r t i b l e . L e t Lc(c.Z) be the l i k e l i h o o d f u n c t i o n a l o f a given Z. i k f i n e
e;
Then the l i k e l i h o o d function o f
is
This i s the c r u c i a l equbtion. By d e f i n i t i o n , the left-hand side i s maximized by side i s maximized by f-'(x) = 6. Therefore
a
x = i , and the right-hand (4.3-26)
f(O)
The extension t o n o n i n v e r t i b l e f i s straightforward-simply r e a l i z e t h a t f - ' ( x ) i s a set o f values, r a t h e r t h a n a single value. The same argument then s t i l l holds, regarding Lx(x,Z) as a One-to-many f u n c t i o n (Setvalued function). F i n a l l y , l e t us emphasize t h ~ t ,although maximum l i k e l i h o o d estimates are formally i d e n t i c a l t o MAP e s t i mates w i t h unjform p r i o r d i s t r i b u t i o n s , there i s a basic t h e o r e t i c a l d i f f e r e n c e i n i n ~ e r p r e t a t i c n . Maximum l i k e l i h o o d makes no statements aboui d i s t r i b u t i o n s o f :, p r i o r o r p o s t e r i o r . S t a t i n g t h a t a parameter has a uniform p r l o r d i s t r i b u t i o n i s d r a s t i c a l l y d i f f e r e n t from saying t h a t we have no information about the parameter. Several c l a s s i c "paradoxes" o f p r o b a b i l i t y theory r e s u l t e d from ignoring t h i s difference. The paradoxes a r i s e i n transformations o f variable. L e t a scalar E have a uniform p r i o r d i s t r i b u t i o n , and l e t f be any continuous i n v e r t i b l e function. Then, by Equation (3.4-l), x = f ( ~ has the density f u n c t i o n )
which i s n o t a u ~ i i f o r md i s t r i b u t i o n on x (unless f i s 1 inear). Thus i f we say t h a t there i s no p r i o r information (uniform d i s t r i b u t i o n ) about c. then t h i s gives us p r i o r information (nonuniform d i s t r i b u t i o n ) about x, and vice rersa. This apparent paradox r e s u l t s from equating a unifotm d i s t r i b u t i o n w i t h the i d t a o f "no informat+on. Therefore, although we can formally derive the equations f o r maximum l i k e l i h o o d e s t i m t o r s by s u b s t i t u t i n g uniform p r i o r d i s t r i b u t i o n s i n the equations f o r MAP estimators, we must avoid misinterpretations. Fisher (1921. p. 326) discussed t h i s subject a t length: There would be no need t o emphasize the baseless character o f the assumpti*ns made under the t i t l e s o: inverse PI obabil i t y and BAYES' Theorem i n view o f I must indeed plead the decisive c r i t i c i s m t o which they have been exposed g u i l t y i n my o r i g i n a l statement o f the Method o f Maximm Likelihood 19) t o having based my argument upon the p r i n c i p l e o f inverse p r o b a b i l i t y ; i n the same paper. i t i s true, I emphasized the f a c t t h a t such inverse p r o b a b i l i t i e s were r e l a t i v e only. That i s t o say, t h a t w h i l e one might speak o f one value o f p as having an inverse p r o b a b i l i t y three times t h a t o f another value c f p, we might on no account introduce the d i f f e r e n t i a l element dp, so as t o be able t o say t h a t i t was three t i m s as probable t h a t p should l i e i n one r a t h e r than the other o f two equal elements. Upon consideration. therefore. I perceive t h a t the word p r o b a b i l i t y i s wrongly used i n such a connection: p r o b a b i l i t y i s a r a t i o o f frequencies, and about the frequencies o f such values whatever. W m s t r e t u r n t o the actual f a c t t h a t one value e we can know n o t h i t ~ g o f p, o f the frequency o f which we know nothing, would y i e l d the observed r e s u l t three times as frequently as would another value o f p. I f we need a word t o characterize t h f s r e l a t i v e p r o p t r t y o f d i f f e r e n t values o f p, I suggest
....
t h a t we may speak wlthout confusion o f the l i k e l i h o o d o f one value o f p being t h r l c e the l i k e l l h o o d of another. b r a r i n g always I n mtnd t h a t l l k e l l hood I s n o t here bsed l o o s e l y as a rynonym o f p r o b a b l l l t y , but simply t o express t h e r e l a t i v e frequencies w l t h whlch such values o f the hypothetical quanttty p would i n f a c t y i e l d +be observed sample.
CHAPTER 5 5.0 THE STATIC ESIIMATIOh PROBLEM
I n t h i s chapter begins the application o f the general types of estimators deflned I n Chapter 4 t o s p e c i f l c problems. The problems discussed i n t h l s chapter are s t a t l c estlmation problems; t h a t fs, problems where t f m 1s not e x p l l c i t l y involved. Subsequent chapters on dynmlc systems draw heavlly on thesr s t a t i c on r e s u l t s . Our treatment I s f a r from complete; I t I s easy t o spend an e n t l r e b ~ o k s t a t i c estimation alone (Sorenson, 1980). The materlal presented here was selected l a r g e l y on the basis o f relevance t o dynamic systems. W concentrate primarily on l l n e a r systems wlth a d d l t i v e Gaussian noise, where there are slmple, closede form solutlons. Ue also cover nonllnear systems w i t h a d d i t i v e Gausslan noise. whlch w i l l prove o f major importance I n Chapter 8. Non-Gaussian and nonadditive noise are mentioned only b r i e f l y , except f o r the special problem o f estlmation o f variance. W w i l l i n i t i a l l y t r e a t nonsingular problans, where we assume t h a t a l l relevant d l s t r l b u t i o n s have denslty e functlons. The understandlnc, and handllng o f slngular and I l l - c o n d l t l o n e d problems then receive special attentlon. S l n g u l a r i t l e s and I l l - c o n d l t i o n l n g are c r u c i a l Issues I n practical application, but are i n s u f f l . c i e n t l y treated I n nwrch of the current literature. W also discus. p a r t i t l o n l n g o f estfmatlon problems, an e Important technique f o r s l r p l l f y l n g the computatlonal task and t r e a t l n g some s i n g u l a r l t l e s . The general form o f a s t a t i c system m d e l I s
W apply a known s p e c i f l c input U (or a set o f inputs) t o the system, and measure t h e response 2. The e vector w i s a random vector contamlnatlng the measured system response. W desire t o e s t i m t e the value e of
(.
The estimators discussed l n Chapter 4 requlre knowledge o f the conditional d i s t r r o u t l - n o f Z given ( and U. Ue assume, f o r now, t h a t the d i s t r i b u t i o n i s nonsingular. w l t h deqsity p(ZI(.U). I f 6 i s conI n some simple cases. these densities might be sidered random. you m s t know the j o i n t denslty p(Z,tllr). given d i r e c t l y , l n which case Equation (5.0-1) i s not necessary; the estimators o f Chapter 4 appl d l n c t l y More t y p i c a l l y , p(Z1c.U) i s a complicated density whlch I s derlved from Equation (5.0-1) and ~ ( w ~ c , u ) ; I t I s o f t e n reasonable t o assume q u i t e slmple d i s t r i b u t i o n s f o r U , independent o f 5 arhd U. I n t h l s chapter, m? w l l l look a t several specific cases. 5.1 LINEAR SYSTEMS WITH ADDITIVE GAUSSIAN NOISE
Th* s i w l e s t and most c l a s s i c r e s u l t c are obtalned f o r l l n e a r s t a i i c s y s t ~ m s l t h a d d i t l v e b u s s i a o noise. w The system equatlons are assumed t o have the form Z = C(U)C + D(U) + G(U)U
(5.1-1)
For any particular U. 2 i s a l i n e a r combination o f (, W, and a constant vector. Note t h a t there are no assumptions about l i n e a r i t y w l t h respect tr, U; the functions C. 0, and G can be a r b i t r a r i l y complicated. Throughout t h l s section, we omit the e x p l i c i t dependence on U from the notation. Similarly, a l l d l s t r l b u t i o n s and expectatlons are i m p l i c i t l y understood t o he conditioned on U. The random noise vector w i s assumed t o be b u s s f c n and independent uf By conventlon, we w l l l deffne the mean o f w to be 0, and the covarlancc t o be i d e n t i t y . Thls conventlon does not l l m i t the gener, F a l l t y o f Equatlon (5.1-l), f o r i f w has a m a n m and a f l n i t e covariance FF*, we can define G = G and 0 = D + m to obtaln ;
(.
whzre
Y,
has zero lman and l d e n t l t y couariance.
When 5 i s consldered as random, we w l l l assume t h a t I t s marglnel ( p r l o r ) d i s t r l b u t l o n i s Gausslan w l t h mean mc and covarlancc P. p ( ~ ) /?dl-'/' = Equatlon (5.1-3) assumes that cases l a t e r . 5.1.1 ;oint
exp(-
i(( - n ( ) * ~ - l ( ( -
(5.1-3)
P I s nonsingular.
Ue w l l l discuss the i n p l i c a t l o n s and h r n d l i n g o f singular
D l s t r l b u t i o n o f Z and
Several d l s t r l b u t i o n s which can be derived from Equation (5.1-1) w i l l be r q u l n d I n order t o analyze t h l s s y s t w . I * t us f i r s t conslder p(ZIc), the conditional density of Z given C. This d i s t r l b u t i o n i s defined whether C I s r a n d m o r not. I f { i s given, then Equation (5.1-1) i s simply the sun o f a copstant vector and r constant n a t r l x times a (irusslrn vector. Uslng the p m p e r t l e s o f Iirussian distributions discussed i n Chapter 3, we see t h a t the conditional d i s t r l b u t l o n o f Z glven { l s Qussfrn w l t h man and covarirncr.
Thus. assuming t h s t
GG*
i s nonsingular,
~ ( ~ 1 6 )I ? ~ G G * I - ' / ~exl,(-
i( Z - cc - D ) * ( G G * ) - ~ ( z- CL - D)]
(5.1-6) . I l y defrne the distribution

c' ir,dependent
i s random, w i t h m a r g i n ~ ldensity given by Eqbation (5.1-3), we can a l s o mea-.'7 If j o i n t d i s t r i b u t i o n o f Z and 6 , the conditional d i s t t :::stion o f 6 givcn I . and t h r ma -in: o f 2. For the marglnal d i s t r i b u t i o n o f Z, note t h a t Equation (5.1-11 i s a l i n e a r comb.inr:ion Qussian vectors. Therefore Z 1s Gaussian w f t h mean and covariance
cov(Z) = CPC* + GG* For the j o i n t d i s t r i b u t f o n o f 6 and
( 5 1-8)
Z. we now r e q u ~ r etne cross-correlation
E([Z The j o i n t d i s t r i b u t i o n o f 5 and
- E ( Z ) I [ t - E(01'1
* C P
Z i s thus Gaussian w f t h mean and covariance
PC*
r , .rote t h s t t h i s j o i n t d i s t r i b u t i o n could a l s o be derived b m u l t i p l y i n g Equations (5.1-3) and (5.1-6) according t o Aayes r u l e . That d e r i v a t i o n a r r l v e s a t the same r e s u l t s f o r Fquations (5.1-10) and ( 5 . 1 - l l ) , b u t i s much more tedious. o f i n a l l y , we can deri,ie the conditional d l s t r i h u t i o n o f 5 given Z ( t h e p o s t e r i o r d + s t ~ , i b u t i o n f 6 ) from the j o i n t d i s t r i b u t i o n o f and Z. Applying Theorem (3.5-9) t o Equations (5.1-10) and (5.1-11). we see t h a t the conditional O i s t r i b u t i o l o f F given Z i s Gacssian w i t h m a n and covariance
, Equations (5.1-12; and (5.1-13) assume t h a t CPC* + GG* i s nonsingular. IC t h i s matrix i singular. the problem i s i l l - p o s e d and should Le restated. W w i l l discuss the s i n s u l a r case l a t e r . e
Assuming t h a t P. GG*. and (C*(GG*)-'C + P - I ) are nonsingslar, we can use the m a t r i x illversion l e m s . ( l e n m s (-1.1-3) and (A.l-4)). t o put Equations (5.1-12) and (5.1-13) i n t o forms t h a t w i l l prove i n t u i t i v r : y useful.
the form o f W w i l l have much occasion t o contrast the form o f E uations (5.1-12) and (5.1-13) ~ 4 t h e Equations (5.1-141 tnd (5.1-151. W w i l l c a l l Equations 15.1-12, and (5.1-13) the covariance form because they e r r e i n t e n s of the uninverted covariances P and GG*. E q u a t i o ~ ~(5.1-14) and (5.1-15) are c a l l e d the i n f o r s n a t i o n Corm because they are i n t e n s o f the inverses P-' and (GG*]'l, which are r e l a t e d t o , t r amo:lt o f infotn:ation. (The l a r g e r the covariance, thc less information you have, and v i c e versa.) Equation (5.1-15) has an i n t e r p r e t a t i o n as a d d i t i o n o f information: P-I i s the amount of p r i o r informatisn about c. and CC(GG*)"C i s the amount o f informat,ion i n the measurement; the t o t a l i n f o m t i o n a f t e r the ~,.rasurement i s thc sum o f these two terns. 5.1.2
A Posteriori Estimators
L e t us f i r s t examine the three types o f estimators t h a t are based on the p o s t e r i o r d i s t r i b u t i o r p(6IZ). These three types of estimators are a posteriori expected value, maximum a p o s t r r i o ~ ip r o b a b r l i t y , and Bayesian minirum r i s k .
U previously derived the expression f o r the a posteriori expected value i n the process o f d e f i n i n g the e p o s t e r i o r d i s t r i b u t i o n . E i t h e r the covariance o r i n f o m t i o n fonn can be used. Ue w i l l use the inforrration form because i t t i e s i n w t t t other approaches as w i l l be seen below. Thc a posteriori cxpected value estimator i s thus
The maximum a p09teri0ri p r o b a b i l i t y estimate i s q u a 1 t o the a 0 8 t # ~ i 0 &cxpertad value because the p o s t e r i o r d i s t r i b u t i o n i s b u s s i a n (and thus unimodal and s y m t r c c .gout i t s mean). This f a c t suggests an a l t e r n a t e d e r i v a t l o n o f Equation (5.1-16) which i s q b i t e enlightening. To f i n d the maximum p 3 i n t a f the posterfor d i s t r i b u t l o n of given 2, w r i t e
Expanding t h i s equation using Equations (5.1-3) i n P(CIZ) =
and (5.1-6)
gives
- $ (Z - cc
- o)*(w)-'(z - cc - D) -
2 (c
- m c ) * ~ l ( c- mE) + a(Z)
(5.1-18)
where aiZ) i s a function o f Z ~ n l y . E uation (5.1-19) shows the problem i n i t s "least squares" form. Ye are attempting t o choose c t o m m z e m ) and (Z - C - D ) The matrices P-' and (GG*)-' are weightings used i n the cost functions. The l a r g e r the value o f (GG*)", the more importance i s placed on D), and r i c e versa. minimizing (Z CE
Obtain the estimate t i o n (3.5- 17).
b j s e t t i n g the gradient o f Equation (5.1-18) t o zero, as suggested by Equa0 = C*(GG*)"(L
- C( - 0)
p-l(i
me)
(5.1-19)
Write t h i s as 0 = C*(GGf)-'(Z dnd the solution i s
CmE
- D) - P - l ( i - me) - C*(GG*)-lc(i
m) E
(5.120)
assuming t h a t the inverses e x i s t .
For Gaussian d i s t r i b u t i o n s , Equatton (3.5-18) en p([IZ)]-'
gives the covariance as (5.1-22)
cov((IZ) = -'v;
= (C(GGf)"C
+ Pel)-'
Note h c the second gradient i s negative d e f i n i t e (and the covariance p o s i t i v e d e f G ? i t e ) , v e r i f y i n g t h a t the solution i s a maximum o f the p o s t e r i o r p r o b a b i l i t y density function. This d e r i v a t i o n does n o t require the use o f matrix inversion lemnas, o r the expression from Chapter 3 f o r the,Gaussian ccnditional d i s t r i b u t i o n . For more complicated problems, such as conditional d i s t r i b u t i o n s o f N :ointly Gaussian vectors, the a l t e r n a t e d e r i v a t i o n as i n Equations (5.1-17) t o (5.1-22) i s much easier than the straightforward d e r i v a t i o n as i n E q u a t i o n (5.1-10) t o (5.1-15). Because o f the symnetry o f the p o s t e r i o r d i s t r i b u t i o n , the Bayesian optimal estimate i s also equal t o the a posterioii expected value estimate i f t h e aayes l o s s function meets t h e c r i t e r i a uf Theorem (4.3-1). W w i l l now examine the s t a t i s t i c a l properties o f the estimator given by Equation (5.1-16). e estimator i s a l i n e a r function o f 2 , the b i a s i s easy t o compute. b(6) = E { < ~ E I Since the
= E{mg + (C*(GG*)-'C
+ P-')-'CZ(GG*)-'(Z
- CmE -
D)]cI
The estimator i s b i a b r d TUI i;; fir,::; iion;ir,pu:ar P and SB*. The gralar case gives ..me i n s i g h t i n t o t h i s bias. I f 5 i s scalar, the f a c t o r i n brackets i n Equation (5.1-23) l i e s between 0 and 1. As GG* decreases and/or P increases, the f a c t o r approaches 0, as does the bias. I n t h i s case, the estimator obtains l e s s information from the i n i t i a l guess o f E (which has large covariance), and more information from the measurement (which has small covariance). I f the s i t u a t i o n i s reversed, G * increasi.lg and/or P decreasing, the G I n t h i s case, the estimator shows an increasing predilect.;on t o ignore the measured bias becomes l a r g e r response and t o keep the i n i t i a l guess o f 6. The variance and mean square e r r o r are a l s o easy t o compute. Equations (5.1-16) and 15.1-5): c o v ( i l 6 ) = (Cf(GG*)-lC The variance o f follows d i r e c t l y from Pml)-l (5.1-24)
+ P-')'lC*(GG*)-lGG*(GGt)-lC(C*(W)-lC
+ +
P-I)-'
= (C*(GG*)-'C
The mean square e r r o r i s then
P")"C*(GG*)-'C(C*(GG')''C
mse(c) = c o v ( i l c ) + b ( ~ ) b ( 6 ) * which i s evaluated using Equatfons (5.1-23) and (5.1-24).
The rmst obvious question t o ask i n r e l a t i o n t o Equations (5.1-24) hnd (5.1-25) i s how they compare w i t h other cstimators and w i t h the Cramer-Rao bound. Let us evaluate the Cramer-Rao bound. The Fisher information matrlx (Equation (4.2-19)) i s easy t o congute using Equation (5.141:
Thus the C r a r r - L o bound for unbiased e s t i w t o r s i s .K(~IE)

2 (c*(W)-'c)-'
Note that, f o r wr v l l ~ s o f c, the a posteriori expected value estimator has a lower lean-square error than i the C r u c r - L o bound f o r unbiased estlmtors; naturally. t h i s i s because the estimator i s biased. To compute the t r u r - L o bound f o r an estimator with bias given by Equation (5.1-23). we need t o evaluate
The Craacr-Rao bound i s then ( f n r Equation (4.2-10))
=(ilr)
(P(G*)-lC
+ P-')-'c*(w)-'c(c*(s~*)-'C
+ P-')"
(5.1-29)
At every Note that the estimator does not achieve the C r u r - L o bound except a t the single point c = m other point. tb second term I n Equation (5.1-25) i s positive, and the f i r s t term i s equal t o tfk bound; therefore, the m e i s greater than the bound. For a single observation. w can say i n s l r v r y that the a posteriori estimator i s optianl Bayesian f o r e a large class o f loss functions, but i t i s biased and does not achieve the Crarr-Rao lcuer bound. I t -ins t o investigate the a s y q t o t i c properties. The a s j w t c t i c behavior of e s t i m t o r s for s t a t i c systems i s defined i n t e r n o f N independent repetitions o f the experimnt. where N apprwches i n f i n i t y . Ye -st f i r s t define the application o f the a .pcsteriori estimator t o r e p a t e d experimnts. A s s u that the system sdel i s given by Equation (5.1-1). with c distributed according t o EquaPerfom W e x p e r i r n t s Ul...~. does not matter whether the Uj are distinct.) The t i o n (5.1-3j. corresponding system matrices are C i . Di, and Gi6.. and the measurements are Zi. Th? randm noise w i i an : indcpcndent, zero-n, i d e n t i t y covariance. ~ u s s l a n vector f o r each i. The maxia postrriori estimate o f E i s g i m by
1It
assuinp that t inverses exist. k The a s y q t o t i c properties a n defined f o r r e p e t i t i o n o f the sale e x p e r i r n t , so we do not need the f u l l tpnerality o f Equation (5.1-30). I f Ui = Uj, C i = C j . Di = Dj, and G i = G j f o r a l l i and j. Equat i o n (5.1-3G) can be written N i s B + [ P ( f f i * ) - ' C + P-1]-1C*(6b)-1 , (5.1-31) (Ti hL 0) i=i
rQqwte the bias. covariance, and m e o f t h i s estimate i n the s a m manner as Equations (5.1-23) t o (5.1-25): b(c) = [I (NCe(66f)-'C + P-1)-')(C*(66f)-1C](~, cov(i(c) = [ K * ( W ) " C m e ( l l c ) = cov!ilc)
+
- c)
+ P-']-l
(5.1-32)
'
i- '
+ P-']-'NC*(6b)-'C[NC*(S*)-'C
b(c)b(c)*
(5.1-33) (5.1-34)
I., P .
The Crarr-Rao bound f o r unbiased estimators i s

.se(.ilc)
z (nt*(G*)-lc)-l
As W increases, Equation (5.1-32) w s t o zero, so the estimator i s asymptotically unbiased. The e f f e c t of increasing N i s exactly conparrlble to Increasing ( * ' ; we take m r e and better q u a l i t y measurements. a ) ' as the estimator depends lore heavily on t!! measurements and less on i t s i n i t i a l guess. The estimator i s also asylptotically e f f i c i e n t as defined by Equation !4.2-28) ~c*(Gb.)''c K*(66*)-'C 5.1.3 R a x l n u Likelihood E s t f v t o r the derf?etion o f the only difference i s cov(ilf;)
---r
because (5.1-36) (5.1-37)
b(~)b(f;)"
N N
I 0
The derivation o f t a b expression f-r the m x i u likelihood estimator i s s i n f l s r maximu~l portel.iori probability estimator dore i n Equations (5.1-17) to (5.1-22). a that instead o f m p ( { I Z ) , m maximize
The only relevant difference between Equation (5.1-38) and Equation (5.1-18) i s the inclusion o f the t e r n based on the p r i o r d i s t r i b u t i ? n o f c i n Equation (5.1-18). (The a(z) are also different, but t h i s i s o f no consequencs a t the nmment., The maximum likelihood estimate does not make use o f the p r f o r distribution; indeed i t docs not require that such a d i s t r i b u t i o n exist. He w 11 see that many o f the M results are equal t o the E R4P results with the terns from the p r i o r distribution m t t e d . Find the maximum point o f Equation (5.1-38) by setting the gradient t o zero.
The solution. assuming that
C*(GG*)-'C
i s nonsingular, i s given by
< = (Le(GG*)-'C;"C*(GG+)"(Z
E =
Dl P " set t o zero.
This i s the same fonn as that o f the R4P estinete. Equation (5.1-21). with
Z.
A p a r t i c u l a r l y simple case occurs when C = I and D = 0.
I n t h i s event. Cquation (5.1-40) reduces t o

C; that i s
Note that the expression (C*(GG*)-'C)-'C*(GGf)-'
i s a left-inverse of
He can view the e s t i m t o r given by Equation (5.1-40) as a pseudo-inverse of the system given by Equat i o n (5.1-1). Using both equations. w r i t e
i = (C*(GG*)'lC)-lC*(~G*)-l(Cc
=
+ D + Gw
- D)
+ (L* (GG*)-'C)-'C*(GG*)-'Gw
A1though we nust use Equation (5.1-40) t o conpute because and w are not known. Equation (5.1-42) i s useful i n analyzir~gand understanding the behavior o f the estimator. h e interesting point i s inmediately obvious from Equation (5.1-42): the e s t i m t e i s simply the sum o f the true value plus the e f f e c t o f the contaminating noise W. For the particular realization w = 0, the estimate i s e ~ a c t l yequal t o the true value. This property, which i s not shared by the a posterior;, estimators, i s closely related t o the bias. Indeed. the bias o f the maximum likelihood estimator i s inmediately evident from Equation (5.1-42).
The maximun likelihood estimate i s thus unbiased. r e s u l t i f we substitute 0 f o r P-'.
Note that Equation (5.1-32) for the M P t i a s gives the sane A Using Equation (5.1-42).
Since the estimator i s unbiased, the covariance and mean square error are equal. they are given by
Ye can also obtain t h i s r e s u l t from Equations (5.1-33) and (5.1-34) f o r the M P esti,nator by substituting 0 A
f o r P-'.
We previously conputed the Cramer-Rao bound f o r unbiased estimators f o r t h i s problem (Equation 5.1-27)). The mean square error o f the maxinum likelihood estimator i s exactly equal t o the Cramer-Rao b?und. The maxinum l ~ u e l i h o o d estimator i s thus e f f i c i e n t and i s , therefore, a minimum variance unbiased estimator. The maximum likelihood estimator i s not, i n general. Bayesian o p t i m l . Bayesian optimality may not even be defined, since 6 need not be random.
The M E result: f o r repeated experiments can be obtained from the corresponding M P equations by substiA t u t i n g zero f o r P- and mc. We w i l l not repeat these equations here. 5.1.4 Comparison o f Estimators
We have seen t h a t the m a x i m likelihood estimator i s unbiased and e f f i c i e n t , whereas the a posteriori estimators are only asynptotically unbiased and e f f i c i e n t . O the other hand, the a ~osterioriestimators are n Bayesian optimal f o r a large class o f loss functions. Thus neither estimator emerges as an unchallenged favorite. The reader might reasonably expect smne guidance as t o which estimator t o choose f o r 2 given problem.
The roles o f the %to estimators arc actually quite d i s t l n c t and well-defined. The maximum likelihood estimator does the best possible job ( i n the sense o f mininum msn square error) o f estimating the balue o f c based on the masurenrnts alone. without prejudice (bias) frm any preconceived guess about tile value. The m a x i m likelihood estimator i s thus the obvious choice when we have no p r i o r infomation. Having no p r i o r infonnation i s analogous t o having a p r i o r d i s t r i b u t i o n w i t h i n f i n i t e variance; i.e.. P-' = 0. I n t h i s regard. goes t o zero. The l i m i t i s (assuming that examine Equation (5.1-16) f o r the a po8teriol.i e s t i w t e as P'' C*(ffi*)-'C i s nonsingular)
5.1.4
i=
m + (C*(GGt)"C)-'C*(GG*)-l(Z CmE 0) E = mE (C*(GG*+-~C)-~C*(GG*)-~C~~ + (c*(GG*)-'c)-'c*(GG*)-'(Z D)
(C* (GG*)-'C)-'C*(GG*)"(Z
- D)
(5.1-45)
which i s equal t o the maximum l i k e l i h o o d eltimate. The rraximum l i k e l i h o o d estimate i s thus a l i m i t i n g case c f an a poste."ion' estimator as the variance o f the p r i o r d i s t r i b u t i o n approaches i n f i n i t y . The o posteriori estimate cornbines the information from the masurements w i t h the p r i o r information t o obtain the optimal estimate considering both sources. This e s t i w t o r makes use o f more information and thus can o b t a i n more accurate estimates, on the average. With t h i s litproved average accuracy comes a b i a s i n favor o f the p r i o r estimate. I f the p r i o r e s t i m t e i s good, the a posteriori estimate w i l l generally be more accur a t e than the maximum l i k e l i h o o d estimate. I f the p r i o r estimate i s poor, t h e a posteriori estimate w i l l be poor. The advantages o f the a posteriori estimators thus depend heavily on the accuracy o f the p r i o r estimate of the value. The basic c r i t e r i o n i n deciding whether t o use an MAP o r M E estimator i s whether you want estimates based only cn the current data o r based on both the current data and the p r i o r i n f o m a t i o n . The MLE estimate i s based only on the current data. and the MAP estimate i s based on both the current data and the o r i o r distribution. The d i s t i n c t i o n between the M E and MAP estimators o f t e n becomes b l u r r e d i n p r a c t i c a l application. The estimators are closely r e l a t e d i n nunerical computation, as w e l l as i n tbeory. An F(AP estimate can be an intermediate computational step t o obtaining a f i n a l M E estimate, o r v i c e versa. The f o l l o w i n g paragraphs describe one o f these situations; the other s i t u a t i o n i s discussed i n Section 5.2.2. It i r q u i t e comnon t o have a p r i o r guess o f the par..meters, b u t t o desire an independent v e r i f i c a t i o n o f the value based on the measurements alone. I n t h i s case, the maximum l i k e l i h o o d estimator i s the appropriate t o o l i n order t o make the estimates independent o f the i n i t i a l guess.
A two-step estimation i s o f t e n the most appropriate t o o b t a i n maximum i n s i g h t i n t o 3 problem. F i r s t , use the maxinxrm l i k e l i h o o d estimator t o obtain the best estimates based on the measurements alone, ignoring any p r i o r information. Then consider the p r i o r information i n order t o obtain a f i n a l best estimate based on bath t h e measurements and the p r i o r information. By t h i s two-step approach, we can see where the information i s coming from-the p r i o r d i s t r i b u t i o n , the measurements, o r both sources. The two-step approach a l s o allows the freedom t o independently choose the methodology f o r each step. For instance. we mioht desire t o use a maxinum l i k e l i h o o d estimator f o r obtaining t h e estimates bared on the measurmnts, b u t use engineering judgnent t o e s t a b l i s h the best conpromise between the p r i o r expectations and t h e maximum l i k e l i h o o d r e s u l t s . This i s o f t e n the best approach because i t may be d i f f i c u l t t o completely and accurately characterize the p r i o r i n f o r m a t i m i n terms o f a s p e c i f i c p r o b a b i l i t y d i s t r i b u t i o n . The p r i o r information o f t e n includes h e u r i s t i c f a c t o r s such as the engineer's judgment o f what would c o n s t i t u t e reasonable results.
The theory o f s u f f i c i e n t s t a t i s t i c s (Ferguson, 1967; Cramer, 1940; and Fisher, 1921) i s useful i n t h e two-step cpproach i f we desire t o use s t a t i s t i c a l techniques for both steps. The maximum l i k e l i h o o d estimate and i t s covariance fcrm a s u f f i c i e n t s t a t i s t i c f o r t h i s problem. Although we w i l l not go i n t o d e t a i l here. i f we know the maximum l i k e l i h o o d estirrate and i t s covariance, we know a l l o f the s t a t i s t i c a l l y useful infonnat i o n t h a t can be extracted from the data. The specific a p p l i c a t i o n i s t h a t the a posteriori estimates can be w r i t t e n i n terms o f the maximum l i k e l i h o o d estimate and i t s covariance instead o f as a d i r e c t f u n c t i o n o f the data. The following expression i s easy t o v e r i f y using Equations (5.1-16). (5.1-40). and (5.1-44):
is where ia the a posteriori t i o n (5.1-40)). and Q i s the form. the r e l a t i o r ~ s h i pbetween p r i o r d i s t r i b u t i o n i s the only t h e measured data o r even w i t h
estimate (Equation (5.1-16)). in i s the maximum 1 i k e l ihood estimate (Equacovariance o f the maximum l i k e l i h o o d estimate (Equation (5.1-44)). In this the a posteriori estirrilte and the maximum l i k e l i h o o d estimate i s p l a i n . The f a c t o r which enters i n t o the relationship; i t has nothing d i r e c t l y t o do w i t h what experiment was performed. Both
Equation (5.1-46) i s c l o s e l y r e l a t e d t o the measurement-partitioning ideas o f the next section. r e l a t e t o contining data from two d i f f e r e n t sources. 5.2 PARTITI9NING IN ESTIMATION PROBLEMS
P a r t i t i o n i n g estimation problems has some o f the same b e n e f i t s as p a r t i t i o n i n g optimization problems. A problem h a l f the size o f the o r i g i n a l t y p i c a l l y takes w e l l l e s s than h a l f the e f f o r t t o solve. Therefore, we can o f t e n come out. ahead by p a r t i t i o n i n g a problem i n t o smaller subproblems. O f course, t h i s t r i c k only works i f the solutions t o the subproblems can e a s i l y be combined t o give t s o l u t i o n t o t h e o r i g i n a l problem Two kinds o f p a r t i t i o n i n p applicable t o parameter estimation problems a r e nieasurement p a r t i t i o n i n g and parameter p a r t i t i o n i n g . Both o f these schemes permit easy combination of the subproblem solutions i n some sf tuatfons.
5.2.1
Measurement P a r t i t i a n i n q
A problem w i t h n u l t i p l e measurements can o f t e n be p a r t i t i o n e d i n t o a seque,rlce o f subproblems processing t h e measurements one a t a time. The same p r i n c i p l e applies t o p a r t i t i o n i n g a vector measurement i n t o a series o f scalar (or shorter vector) measurements; the only difference i s notational.
The estimators under discussion are a l l based on p(Z C ) or, f o r a p o s t e r i o r i estimators, ~ ( E I Z ) . W e w i l l i n i t i a l l y consider measurement p a r t i t i o n i n g as a prob em i n f a c t o r i n g these density functions. Let t h r measurement Z be p a r t i t i o n e d i n t o two measurements. Z and 2,. (Extensions t o more than two e p a r t i t i o n s f o l l o w t h e same principles.) W would l i k e t o f a c t o r p(t15) i n t o separate f a c t o r s dependent on 2, and 2 , . By Bayes' r u l e , we can always w r i t e
This fonn does not d i r e c t l y achieve t h e required separation because achieve the required separation, we introduce the requirement t h a t
p(Z,IZ,.c)
involves both 2, and Z2.
TO
W w i l l c a l l t h i s the Markov c r i t e r i o n . e H e u r i s t i c a l l y , the Harkov c r i t e r i o n assures t h a t p(Z1l:) contains a l l o f the useful information we can e x t r a c t from Z,. Therefore, having computed p(Z, 1 S ) a t the measured value o f 2,. we have no f u r t h e r need f o r 2,. I f the Markov c r i t e r i o n does n o t hold, then there a r e i n t e r a c t i o n s t h a t r e q u i r e Z and Z t o be , ; considered together instead o f separately. For systems w i t h a d d i t i v e noise, the Markov c r i t e r i o n imp1i e s t h a t Z i s independent o f t h a t i n 2,. , Note t h a t t h i s does not mean t h a t 2 , i s independent o f 2,. the noise i n For systems where the Markov c r i t e r i o n holds, we can s u b s t i t u t e Equation (5.2-2) t o get i n t o Equation (5.2-1)
which i s the desired f a c t o r i z a t i o n o f When
p(Z1c). p(cIZ) follows from t h a t o f p(Z1~).
has a , x i o r d i s t r i b u t i o n , the f a c t o r i z a t i o n o f
i n the p ( i ) i n the denominator i s not important, because the denominator i s merely :he mixing o f 2, and Z , I t w i l l prove convenient t o w r i t e Equation (5.2-4) i n the form a normalizing constant, independent o f
:.
Let us now consider measurement p a r t i t i o n o f arl M P estimator f o r a system w i t h A Equation (5.2-5). The M P estimate i s A
~ ( ~ 1 factored as i n 2 ) (5.2-6)
= arg max P ( Z ~ I ~ ) P ( S I Z ~ )
This equation i s i d e n t i c a l i n form t o Equition (4.3-la), w i t h p ( c I Z ) playing the r o l e o f the p r i o r d i s t r i b u A t i o n . W have, therefore, the f o l l o w i n g two-step process f o r obtaining the M P estimate by measurement e partitioning: F i r s t , evaluate t h e p o s t e r i o r d i s t r i b u t i o n o f E given Z . This i s a function o f 5, r a t h e r than a single value. Practical a p p l i c a t i o n demands t h a t t h i s d i s t r i b u t i o n he e a s i l y representable by a few s t a t i s t i c s , b u t A w put o f f such consideratiois u n t i l the next section. Then use t h i s as the p r i o r d i s t r i b u t i o n f o r an M P e Provided t h a t the system meets the Markov c r i t e r i o n , t h e resu:ting e s t i estimator w i t h the measurement Z , . mate should be i d e n t i c a l t o t h a t obtained by t h e unpartitioned M P estimator. A :he Measurement p a r t i t i o n i n g o f MLE estimator f o l l o w s s i m i l a r l i n e s , except f o r some issues o f i n t e r p r e t a t i o n . MLE estimate f o r a system factored as i n Equation (5.2-3) i s
This equation i s I d e n t i c a l i n form t o Eqrlation (4.3-18). w i t h p(Z,It) playing the r o l e o f the p r i o r d i s t r i b u t i o n . T k two steps o f the p a r t i t i o n e d MLE estimator are therefore as follows: f i r s t , evaluate p(Z1 1 C) a t t h e measured value o f Z,, g i v i n g a f u n c t i o n o f 6. Then use t h i s function as the p r i o r density f o r an M P A Provided t h a t the system meets the Markov c r i t e r i o n , the vesulting estimate estimator w i t h measurement 2. , should be i d e n t i c a l t o t h a t obtained by the unpartitioned MLE estimator.
I t i s not a p r o b a b i l i t y The p a r t i t i o n e d MLE estimator raises an issue o f interpretat'on o f p(Z,Ic). e density function of 6 . The vector 6 need n o t even be random. W can avoid the issue o f c n o t being random by using fnfonnation terminology, considering p(Z, IS) o represent the s t a t e o f our knowledge o f 6 t based on Z instead of being a p r o b a b i l i t y density function o f E. Alternately, we can simply consider p(Z,Ic) t o be a function of 6 t h a t arises a t an intermediate step o f computing the MLE estimate. The process described gives the c o r r e c t MLE estimate o f 5 , regardless o f how we choose t o i n t e r p r e t the intermediate steps.
The close connection between M P and MLE estimators i s i l l u s t r a t e d by t h e appearance o f an MAP estimator A as a step i n obtaining the MLE estimate w t t h p a r t i t i o n e d measurements. The r e s u l t can be interpreted e i t h e r as an M P estimate based on the measurement Z and the p r i o r density p(Z,Ic), o r as an MLE estimate based on , A , , both Z and Z.
5.2.2
w i c a t i o n t o Linear Gaussian Systems
W now consider the a p p l i c a t i o n o f measurement p a r t i t i o n i n g t o l i n e a r systens w i t h a d d i t i v e Gaussian e noise. Ye w i l l f i r s t consider the p a r t i t i o n e d HAP estimator, followed by the p a r t i t i o n e d MLE estimator. Let the p a r t i t i o n e d system be
are independent Gaussian random variables w i t h mean 0 and covariance 1. The Markov c r i t e r i o n where W, and W, be independent f o r measurement p a r t i t i o n i n g t o apply. The p r i o r d i s t r i b u t i o n of c requires t h a t W, and W, i s Gaussian w i t h mean me and covariance P, and i s independent o f W, and w,.
). W have previously seen t h a t t h i s e The f i r s t step of the p a r t i t i o n e d HAP estimator i s t o compute p ( ~ i Z Denote the mean and i s a Gaussian density w i t h mean and covariance given by Equations (5.1-12) and (5.1-13). , Thm, Equations (5.1-12) and (5.1-13) give covariance of p(I(Z1) by m and PI.
The second step i s t o conpute the MAP e s t i m t e o f c using the measurement Z and the p r i o r density , p(cIZ,): This step i s another a p p l i c a t i o n o f Equation (5.1-12), using m f o r mC and Pi f o r P. The , result i s
The i defined by Equation (5.2-11) i s t h e MAP estimate. I t should exactly equal the HAP estimate obtained by d i r e c t a p p l i c a t i o n o f Equation (5.1-12) t o the concatenated system. You can consider Equat i o n s (5.2-9) through (5.2-11) t o be an algebraic rearrangement o f the o r i g i n a l Equation (5.1-12); indeed. they can be derived :'n such terms. Example 5.2-1 Consider a system z = c + w where w i s Gaussian w i t h mean 0 and covariance 1, and c has a Gaussian e p r i o r d i s t r i b u t i o n w i t h mean D and covariance 1. W make two independent are independent) and desire measurements o f Z (i.e.. the two samples o f , , the MAP estimate of 5. Suppose the Z measurement i s 2 and the Z measurement i s -1. Without measurement p a r t i t i o n i n g , we could proceed as follows: concatenated system w r i t e the
D= . D i r e c t l y apply E uation (5.1-12) w i t h mc = 0. P = 1. C = [l I]*, 0 G = 1, and Z = q2, -I]*. MAP estimat? i s then The
Now consider t h i s same problem w i t h measurement p a r t i t i o n i n g . To get p(eIZ,), apply Equations (5.2-9) and (5.2-10) w i t h mc = 0, P = 1, C, = 1, D = 0, l G = 1. and Z = 2 . 1 , m = l(2)-'2, ,
1 2 Z, =
For the second step, apply Equation (5.2-11) w i t h m = 1, P, = 1/2, C , , D , 0, G , 1 and Z = -1. , ,
= 1,
W see t h a t the r e s u l t s o f t h e two approaches are i d e n t i c a l i n t h i s example, e

as claimed. Note t h a t the p a r t i t i o n i n g removes the requirement t o i n v e r t a 2-by-2 matrix, s u b s t i t u t i n g two 1-by-1 inversions.
The computational advantages o f using the p a r t i t i o n a d form o f the M P estimator vary depending on A numerous factors. There are numerous other rearrangements o f Equations (5.1-12) and (5.1-13). The information form o f Equations (5.1-14) and (5.1-15) i s o f t e n preferable i f the required inverses e x i s t . The information form can also be used i n the p a r t i t i o n e d estimator, replacing Equations (5.2-9) through (5.2-11) w i t h corresponding i n f o m t i o n forms. Equation (5.1-30) i s another alternative, which i s o f t e n rhe most e f f i c i e n t . There i s a t l e a s t one circumstance i n which a p a r t i t i o n e d form i s mandatory. This i s when the data comes i n two separate batches and the f i r s t batch o f data must be discarded ( f o r any o i seberal reasons-perhaps u n a v a i l a b i l i t y o f enough canputer m r y ) before processing the second batch. Such circumstances occur regularly. P a r t i t i o n e d estimators are also p a r t i c u l a r l y ap?ropriate when you have already computed the e s t i mate based on the f i r s t batch o f data before receiving the second batch. Let us now consider the p a r t i t i o n e d MLE estimator. The f i r s t step i s t o compute p(Z,Ct). EquaI t i s imnediately evident t h a t the logarithm o t i o n (5.1-38) gives a f o m l a f o r p(Z,I{). p(Z,It) i s a quadratic form i n t . Therefore, although p(Z,Ic) need not be i n t e r p r e t e d as a p r o b a b i l i t y density function o f E, i t has the algebraic form o f a G?ussian density function. except f o r an i r r e l e v a n t constant m u l t i p l i e r . Applying Equations (3.5-17) and (3.5-18) gives the mean and covariance o f t h i s function as
A The second step o f the c a r t i t i c n e d MLE estimator i s i d e n t i c a l t o the second step o f the p a r t i t i o n e d M P estimatoi-. Apply Equation (5.2-11). using the m; and P from the f i r s t step. For the p a r t i t i o n e d MLE , estimator. i t i s most natural (although n o t required) t o use the i n f o m t i o n form o f Equation (5.2-11). which i s
This form i s more p a r a l l e l t o E q ~ a t i o n s(5.2-12) Exa
and (5.2-13).
E $ l x f i , ignoring the p r i o r d i s t r i b u t i o n o f 6 . To get the MLE . .

estimate f o r the concatenated system, appl Equation (5.1-40) w i t h C = [l I]*. 0. G = 1, and Z = [2. -If*. D =
l e 5 2 2 Consider a maximum l i k e l i h o o d estimator f o r the problem o f
= (2)-'[l
112 =
3 (Z,
+ 2,)
I 2
Now consider the same problem w i t h measurement p a r t i t i o n i n g . Far the f i r s t , step. apply Equations (5.2-12) and (5.2-13) w i t h C, = 1, Dl = 0, G = 1, and Z = 2. ,
For the second step, apply Equations (5.2-14) and (5.2-15) w i t h C, = 1.
D, = 0 , G = 1 , a n d 2 , = -1. ,
P, = [ l ( l ) - l + (1)-11-1 =
;
= 1+ 1 p Z, =
= 2+
$ (I)-'(z,
- 2 - 0)
The p a r t i t i o n e d algorithm thus gives the same r e s u l t as the o r i g i n a l unpartitloned algorithm. There I s o f t e n confusion on the issue o f the b i a s o f the p a r t i t i o n e d MLE estimator. This i s an M E e s t i mate o f , based on both Z and Z2. It i s , therefore, unbiased l i k e a l l MLE estimators f o r l i n e a r systems w i t h a d d i t i v e Gaussian noise. On the other hand, the l a s t step o f the p a r t i t i o n e d estimator i s an M P estimate A based on Z w i t h a p r i o r d i s t r i b u t i o n described by m and P,. , , W have previously shown t h a t MAP estimators e are biased. There i s no contradiction i n these two viewpoints. The eztimate i s biased based on the measurement 2 alone, b u t unbiased based on 2, and Z , . , Therefore, i t i s o v e r l y s i m p l i s t i c t o u n i v e r s a l l y condemn MAP estimators as biased. The b i a s i s not always so c l e a r an issue, b u t requires you t o define exactly on what data you are basing the b i a s d e f i n i t i o n . The primary basis f o r deciding whether t o use an MAP o r M E estimator i s whether you want estimates based o n l y on the c u r r e n t s e t o f data, o r estimates based on the current data and p r i o r information combined. The b i a s merely r e f l e c t s t h i s decision; i t does not give you independent help i n deciding. 5.2.3 &meter Partitioning
6
I n parameter p a r t i t i o n i n g , we w r i t e the parameter vector & izations are obvious) smaller vectors . and c,.
as a f u n c t i o n o f two (or more-the general-
The f u n c t i o n f must be i n v e r t i b l e t o o b t a i n c and c, from 6, o r the s o l u t i o n t o the p a r t i t i o n e d problem w i l l n o t be unique. The simplest k i n d ~f p a r t i t i o n s are those i n which c, and c, are p a r t i t i o n s o f the c vector. we have a p a r t i t i o n e d o p t i m i z a t i o n problem. Two With the parameter 6 p a r t i t i o n e d i n t o 6, and 6 possible s o l u t i o n methods apply. The best method, i f f t can be used, i s generally t o solve f o r c, i n t e n s o f (, ( o r v i c e versa) and s u b s t i t u t e t h i s r e l a t i o n s h i p i n t o the o r i g i n a l problem. A x i a l i t e r a t i o n i s another reasonable method i f solutions f o r c, and 5, are nearly independent so t h a t few i t e r a t i o n s are required.
5.3
LIMITING CASES AND SINGULARITIES
I n the previous discuss+ons, w have simply assumed t h a t a l l o f the required m a t r i x inverses e x i s t . W e e made t h i s assumption t o present some o f the basic r e s u l t s without g e t t i n g sidetracked on f i n e points. W w i l l e now take a comprehensive look a t d l 1 o f the s i n g u l a r i t i e s and l i m i t i n g cases, explaining both the circumstances t h a t g i v e r i s e t o the various special cases, and how t o handle such cases when they occur. The reader w i l l recognize t h a t most o f the special cases are i d e a l i z a t i o n s which are seldom l i t e r a l l y true. W almost never know any value p e r f e c t l y (zero covariance). Conversely, i t i s r a r e t o have absolutely e no information about the value o f a parameter ( i n f i n i t e covariance). There are very few parameters t h a t would not be viewed w i t h suspicion i f an estimate o f , say. 10's6 were obtained. These i d e a l i z a t i o n s are useful i n p r a c t i c e f o r two reasons. F i r s t , they avoid the necessity t o quantify statements such as " v ~ r t u a l l y e r f e c t " p when the d i f f e r e n c e between v i r t u a l l y p e r f e c t and p e r f e c t i s n o t o f measurable consequence (although one must be careful : sometimes even an extremely small difference can be c r u c i a l ). Second, numerical problems w i t h f i n i t e a r i t h m e t i c can be a l l e v i a t e d by recognizing e s s e n t i a l l y singular s i t u a t i o n s and t r e a t i n g them s p e c i a l l y as though they were e x a c t l y singular. W w i l l address two kinds o f s i n g u l a r i t i e s . The f i r s t k i n d o f s i n g u l a r i t y involves Gaussian d i s t r i b u t i o n s e w i t h singular covariance matrices. These are perfect11 v a l i d p r o b a b i l i t y d i s t r i b u t i o n s conforming t o the usual d e f i n i t i o n . The d i s t r i b u t i o n s , however, do n o t have density functions; therefore the maximin a p o s t e r i o r i p r o b a b i l i t y and maximum l i k e l i h o o d estimates cannot be defined as we have done. The s i n g u l a r i t y implies t h a t the p r o b a b i l i t y d i s t r i b u t i o n i s e n t i r e l y concentrated on a subspace o f the o r i g i n a l l y defined p r o b a b i l i t y space. Ifthe problem statement i s redefined t o include only the subspace, the r e s t r i c t e d problem i s nonsingul a r . You can also address t h i s s i n g u l a r i t y by l o o k i n g a t l i m i t s as the covariance a ~ p r o a c h ~ s singular the matrix, provided t h a t the i i m i t s e x i s t . The second k i n d o f s i n g u l a r i t y involves Gaussian variables w i t h i n f i n i t e covariance. Conceptually, the meaning o f i n f i n i t e covariance i s e a s i l y stated-we have no information about the value o f the variable (but we must be c a r e f u l about generalizing t h i s idea, p a r t i c u l a r l y i n nonlinear t r a n s f o m t i o n s - s e e the discussion a t the end o f Section 4.3.4). Unluckily, i n f i n i t e covariance Gaussians do n a t f i t w i t h i n the s t r i c t d e f i n i t i o n o f a p r o b a b i l i t y d i s t r i b u t i o n . (They cannot meet axiom 2 i n Section 3.1.1.) For c u r r e n t purposes, we need o n l y recognize t h a t an i n f i n i t e covariance Gaussian d i s t r i b u t i o n can be considered as a l i m i t i n g case ( i n s o w sense t + a t we w i l l not p r e c i s e l y define here) o f f i n i t e covariance Gaussians. The term "generalized p r o b a b i l i t y d i s t r i b u t i o n " i s sometimes used i n connection w i t h such l i m i t i n g arguments. The equations which apply t o the i n f i n i t e covariance case are the l i m i t s o f the correspondino f i n i t e covariance cases, provided t h a t the 1 i m i t s e x i s t . The primary concern i n p r a c t i c e i s thus how t o compute the appropriate 1 i m i t s . W could avoid several o f the s i n g u l a r i t i e s by r e t r e a t i n g t o a higher l e v e l o f a b s t r a c t i o n i n the mathee matics. The theory can consistently t r e a t Gaussian variables w i t h singular covariances by replacing the concept o f a p r o b a b i l i t y density function w i t h the more general concept o f a Radon-Nikodym d e r i v a t i v e . (A p r o b a b i l i t y density f u n c t i o n i s a s p e c i f i c case o f a Radon-Nikodym derivative.) Although such variables do not have p r o b a b i l i t y density functions, they do have Radon-Nikodym d e r i v a t i v e s w i t h respect t o appropriate measures. S u b s t i t u t i n g the more general and more abstract concept o f a - f i n i t e measures i n place of probab':i t y measures allows s t r i c t d e f i n i t i o n o f i n f i n i t e covariance Gaussian variables w i t h i n the same context. This l e v e l o f a b s t r a c t i o n requires considerable depth o f mathematical background, b u t changes l i t t l e i n the p r a c t i c a l application. W cdn derive the i d e n t i c a l computational methods a t a lower l e v e l o f abstrhction. e The abstract theory serves t o place a l l o f the t h e o r e t i c a l r e s u l t s i n a comnon framework. I n many senses the general a b s t r a c t theory i s simpler than the more concrete approach; there are fewer exceptions and special cases t o consider. I n implementing the abstract theory, the same computational issues a r i s e , b u t the s i m p l i f i e d viewpoint can help i n d i c a t e how t o resolve these issues. Simply knowing t h a t the problem does have a well-defined s o l u t i o n i s a major a i d t o f i n d i n g the solution. The conceptual s i m p l i f i c a t i o n gained by the abstract theory requires s i g n i f i c a n t l y more background than we assume i n t h i s book. Our emphasis w i l l be on the computations required t o deal w i t h the s i n g u l a r i t i e s . r a t h e r than on the abstract theory. Royden (1968). Rudin (1974). and L i p s t e r and Shiryayev (1977) t r e a t such subjects as a - f i n i t e measures and Radon-Nikodym d e r i v a t i v e s . W w i l l consider two general computational methods f o r t r e a t i n g s i n g u l a r i t i e s . The f i r s t method i s t o e use a l t e r n a t e forms o f the equations which a r e n o t a f f e c t e d by the s i n g u l a r i t y . The covariance form (Equati~ns (5.1-12) and (5.1-13)) and t h e information form (Equations (5.1-14) and (5.1-15)) o f the p o s t e r i o r d i s t r i b u t i o n are equivalent, b u t have d i f f e r e n t p o i n t s o f s i n g u l a r i t y . Therefore, a s i n g u l a r i t y i n one form can often be handled simply by switching t o the other form. This simple method f a i l s i f a problem statement has s i n g u l a r i t i e s i n both forms. Also, we may desire t o s t i c k w i t h a p a r t i c u l a r form f o r other reasons. The second method i s t o p a r t i t i o n the estimation problem i n t o two parts: the t o t a l l y singular p a r t and the nonsingular p a r t . This p a r t i t i o n i n g allows us t o use one means o f solving the singular p a r t and another means o f s o l v i n g the nonsingular p a r t ; we then combine the p a r t i a l solutfons t o g i v e the f i n a l r e s u l t .
5.3.1
Singular
The f i r s t case t h a t we w i l l consider i s singular P. A s i l ~ g u l a r P m t r i x indicates t h a t some parameter o r l i n e a r combination o f parameters i s known p e r f e c t l y before the experiment i s performed. For instance, we might know t h a t E, = 55, + 3, even though 6, and 5, are unknown. I n t h i s case, we know the l i n e a r combinat i o n E, 55, exactly. The singular P matrix creates no problems i f we use the covariance form instead o f the information form. I f we s p e c i f i c a l l y desire t o use the information form, we can handle the s i n g u l a r i t y as follows.
Since P i s always s y m e t r i c . the range and the n u l l space o f P form an orthogonal decomposition o f the spaLe 5 . The singular eigenvectors o f P span the n u l l space, and t h e nonsingular eigenvectors span the range. Use the eigenvectors t o decompose the parameter estimation problem i n t o the t o t a l l y singular subproblem and the t o t a l l y nonsingular subproblem. This i s a parameter p a r t i t i o n i n g as discussed i n Section 5.2. The e t o t a l l y singular subproblem i s t r i v i a l because w know the exact s o l u t i o n when we s t a r t (by d e f i n i t i o n ) . Subs t i t u t e the s o l c t i o n o f t h e singular problem i n the o r i g i n a l problem and solve the nonsingular subproblem i n t h e normal manner. A s p e c i f i c implementation o f t h i s decomposition i s as follows: l e t X S be the matrix o f orthonormal s i n g u l a r eigenvectors o f P, and XNS be the matrix o f orthonormal nonsingular eigenvectors. Then define
The covariance5 o f
:S and fhs
are
where
PNS i s nonsingular.
Write
and r e s t a t e the Substitute Equation (5.3-3) i n t o the o r i g i n a l problem. Use the e x a c t l y known value o f problem i n terms o f ENS as the unknown parameter vector. Other decmpositions derived from m u l t i p l y i n g Equation (5.3-1) by nonsingular t r a n s f o r m t i o n s can be used i f they have advantages f o r s p e c i f i c s i t u a t i o n s . W w i l l henceforth assume t h a t P i s : nonsingular. I t i s unimportant whether the a r i g i n a l problem e statement i s nonsingular o r we are w o r ~ i n gw i t h the nonsingular subproblem. The implementation bbove i s defined i n very general terms, which would a l l o w i t t o be done as an automatic computer subroutine. I n practice, we u s u a l l y know the f a c t o f and reason f o r the s i n g u l a r i t y beforehand and can e a s i l y handle i t more concretely. I f an equation gives an exact r e l a t i o n s h i p between two o r more variables which we know p r i o r t o the experiment, we solve the equation f o r one variable and remove t h a t v a r i a b l e from the problem by s u b s t i t u t i o n . Exa l e 3-1 Assume t h z$kn--5orce and mment a t the output o f a system i s a known f u n c t i o n o f the
An unknown p o i n t f o r c e i s applied a t a known p o s i t i o n r r e f e r r e d t o the o r i g i n . W thus know t h a t e
I f F and M are both considered as unknowns, the P m a t r i x i s singular. But t h i s s i n g u l a r i t y i s r e a d i l y removed by s u b s t i t u t i n g for M i n terms of F so t h a t F i s the only unknown.
5.3.2
Singular
GG*
The treatment o f singular GG* i s s i m i l a r i n p r i n c i p l e t o t h a t o f singular P. A singular GG* matrix implies t h a t some masurement o r combination o f measurements i s made p e r f e c t l y (1.e.. noise-free). The covariance form does n o t involve the inverse o f GG*, and thus can be used w i t h no d i f f i c u l t y when GG* i s singular. An a l t e r n a t e approach involves a sequential decomposition o f the o r i g i n a l problem i n t o t o t a l l y singular (GG* 0) and nonsingular subproblems. The t o t a l l y singular subproblem must be handled i n the covariance form; t h e nonsingu11r subproblem can then be handled i n e i t h e r form. This i s a measurexnt p a r t i t i o n i n g as descrfbed i n Section 5.2. Divide the measurement i n t o two portions, c a l l e d the singular and the nonsingular F i r s t ignore Z and f i n d the p o s t e r i o r d i s t r i b u t i o n o f E given o n l y Z Then measurements, ZS and ZN use t h i s r e s u l t as the j i s t r i b u t i o n p r i o r $0 Zs. k specifically lnplenent t h i s decomposition as fo!?ows:
For the f i r s t step o f the decomposition l e t XNS be the matrix o f nonsingular eigenvectors of M u l t i p l y Equation (5.1-1) on the l e f t by x i S g i v i n g
GG*.
56
Def ine
Equation (5.3-4) then becomes
Note t h a t G N i s nonsingular. o f E condit(ioned on ZNS i s
GS
Using the information form f o r the p o s t e r i o r d i s t r i b u t i o n , the d i s t r i b u t i o n CNSmc
mNS = E [ C I Z ~ * mc + ( ~ f i ~ ( ~ ~ ~ G + iP-l)-lCAS(GNSGNfS)-l(ZNS~I f ~ ) - l C ~ ~ pNS = C O V { E I Z ~' ; ~ (CfiS(GNSGiS)-lCNS + P")-'
- DNS)
(5.3-?a) (5.3-7b)
For *.he second step. l e t XS be the m a t r i x o f singular eigenvectors o f GG*. Equatioi (5.3-6) i s
Corresponding t o
where
zs
cs
- xpz
=
xpc
0 = XpD ,
t
Since GS i s 0, we m s t use the covariance
Use Equation (5.3-7) f o r the p r i o r d i s t r i b u t i o n f o r t h i s step. form f o r the p o s t e r i o r d i s t r i b u t i o n , which reduces t o
Equations (5.3-4). (5.3-6). (5.3-8). and (5.3-10) g i v e an a l t e r n a t e expression f o r the o s t e r i o r d i s t r i b u t i o n G of E given Z which we can use when G * i s singular. I t does require t h a t CsPn C[ be nonsingular. This I s a special case o f t h e r e q u i r e m n t t h a t CPC* + GG* t e nonsingular. which we !iscuss l a t e r . It i s i n t e r e s t i n g t o note t h a t the covariance (Equation (5.3-lob)) o f the estimate i s singular. M u l t i p l y Equation (5.3-lob) on the r i g h t by C$ and obtain
Therefore the columns o f C$ 5.3.3 G Singular CPC* + G *
a r e a l l singular eigenvectors o f the covariance o f the estimate.
The next special case t h a t we w i l l consider i s when CPC* + G * i s singular. Note f i r s t t h a t t h i s can G happen o n l y when GG* i s a l s o singular, because CPC* and G * are both p o s i t i v e semi-definite, and the sum G o f two such matrices can be singular only i f both terms a r e singular. Since both GG* and CPC* + GG* are singular, neither the covariance f9nn nor t h e information f o m circumvents the s i n g u l a r i t y . I n f a c t , there i s no way t o circumvent t h i s s i n g u l a r i t y . I f CPC* + GG* i s singular, the problem i s i n t r i n s i c a l l y ill-posed. The only s o l u t i o n i s t o r e s t a t e the o r i g i n a l p r o b l m .
I f we examine what i s implied by a singular CPC* + GG*, we w i l l be able t o see why i t necessarily means t h a t the problem i s ill-posed, and what kinds o f changes I n the problem statement are required. Referrlnpl t o Equation (5.1-6). we see t h a t CPC* + GG* I s the covariance o f the measurement 2. GG* i s the c o n t r i b u t i o n o f the measurement noise t o t h i s covariance, and CPC* i s the c o n t r l b u t l o n due t o the p r i o r variance o f E . I f CPC* + GG* i s singular, we can e x a c t l y p r e d i c t some p a r t o f the measurea response. For t h i s t o occur. there nust be n e i t h e r measurement noise nor parameter uncertainty a f f e c t i n g t h a t p a r t i c u l a r p a r t o f t h e response.
Clearly, there are serlous mathematical d l f f i c u l t l e s I n saylng t h a t we know exactly what the measured value w l l l be before taking the mascrement. A t best, the measurement can agree w f t h what we predlcted, which adds no new Information. I f , however, there i s any disagreement a t a l l , even due t o roundlng e r r o r I n the computatlons, there i s an Irresolvable c o n t r a d l c t l o n - m said t h a t we knew exactly what the value would be and we were wrong. This l s one s l t u a t l o n where t h e dlfference between almost p e r f e c t and p e r f e c t i s extremely Important. As CPC* + GGC approaches s i n g u l a r l t y , the correspondlng estimators diverge; we cannot t a l k about the l i m i t l n g case because the estlmdtors do n o t converge t o a l l m l t I n any meanlngful sense.
5.3.4
Infinlte
Up t o t h i s point. the special cases considered have a l l Involved slngular covarlance matrlces, correspondi n g t o perfectly known q u a n t l t l e s . The remaining specfal cases a l l concern l i m i t s as elgenvalues o f a covariance matrix approach i n f i n i t y , corresponding t o t o t a l ignorance o f the value o f a quantity. The f i r s t such special case t o dlscuss I s when an elgenvalue o f P approaches I n f i n i t y . The problem i s much easier t o discuss I n terms o f the lnformatlon matrix P' -. As an eigenvalue o f P approaches l n f l n l t y . the corresponding elgenvalue o f P - I approaches zero. At the l l m l t , P-' i s slngular. To be cautlous, we should not speak o f P" belng singular b u t o n l y o f the l l m i t as P" goes t o a s l n g u l a r l t y , as i t f s not meaningful tolsay t h a t P-' i s singular. Provided t h a t we use the fnformation form everywhere, a l l o f the l i r n l t s as P- goes t o a s l n g u l a r i t y are well-behaved and can be evaluated simply by substituting t h e slngular value f o r P". Thus t h l s s l n g u l a r i t y poses no d l f f i c u l t l e s I n practlce, as long s we avoid the use o f : goes t o zero I s p a r t l c u expressions i n v o l v i n g a nonlnverted P. As previously mentioned, the l i m i t as Pl a r l y I n t e r e s t i n g and r e s u l t s i n estimates i d e n t i c a l t o the maximum l l k e l l h o o d estimates. Using a slngular f s paramount t o saying t h a t there i s no p r i o r information about some parameter o r set o f parameters ( o r P-I t h a t we choose t o discount any such lnformatlon i n order t o obtain an independent check). There i s no convenient way t o decompose the problem so t h a t t h e covarlance form can be used w i t h slngular P-' matrlces. i s most c l e a r l y illustrated by some exanples using confidence regions. A The meanlng of a singular P" confidence r e fon I s the area where the p r o b a b l l i t y denslty function ( r e a l l y a generalized p r o b a b i l i t y density functlon here! i s greater than o r equal t o some glven constant. (See Chapter 11 f o r a more d c t a l l e d discusslon of confidence regions.) Let the parameter vector consist o f two elements, el and c,. Assume t h a t the p r i o r d i s t r i b u t i o n has mean zero and
The p r l o r confldence reglons are glven by
o r equivalently
whlch reduces t o
where C, and C, are constants depending on t h e l e v e l o f confidence desired. For current purposes, we are interested only i n the shape o f the confidence region. which i s independent of the values of the constants. Figure (5.3-1) i s a sketch o f the shape. Note t h a t t h l s confidence region i s a l i m l t i n g case o f an e l l l p s e w l t h major a x i s length going t o i n f i n i t y while the mlnor a x i s i s fixed. This p r i o r d l s t r l b u t l o n glves i n f o r , mation about el, but none about 6 .
Now conslder a second example, whlch I s l d e n t l c a l t o the f i r s t except t h a t
I n t h i s rase, the p r l o r confidence region i s
Figure (5.3-2) I s a sketch o f the shape o f t h l s confidence region. I n t h i s case, the dlfference between C, The singulai. and Cz f s known w i t h sane confldence. but there 4s no lnfonnatlon a b u t the sum & + 6 eigenvectors o f P m l correspond t o d i r e c t f o l s t n the parameter space about which there f s no p r i o r knowledge.
58 5.3.5 I n f l n l t e GG*
5.3.5
Correspondfng t o the case where P'l approaches a slngular p o l n t I s the s l m l l a r case where (GG*)" e approaches a s l n y u l a r i t y . As i n the case cr slngular P", there are no computational problems. W can r e a d l l y evaluate a l l o f the l l m l t s slmply by s u b s t l t u t l n g thelslngular w t r l x f o r ( a * ) - ' . The l n f o m t l o n m a t r l x would l n d l c a t e t h a t some measurement o r forin avolds the use o f a noninverted a*. A slngular (GG*)' l l n e a r comblnatlon o f measurements had l n f l n l t e nolse variance, which i s rdther u n l i k e l y . The primary use o f slncular (GG*)" matrlces i n p r a c t l c e I s t o make the estimator Ignore c e r t a l n measurements i f they are worthless o r slmplv unavallable. It I s m a t h c ~ t l c a l l ycleaner t o r e w r l t e t h e system model so t h a t the unused measurements are not Included I n the observatlon vector, b u t i t I s sometlmes more convenient t o slmply use a slngular (GG*)-I matrlx. The two methods give the same r e s u l t . (Not havlng a m e a s u r y n t a t a l l l s equlvaapproaches 0. Thls l e n t t o havlng one and lgnorlng I t . ) One l n t e r e s t l n g s p e c l f l c case occurs when (GG*)method then amunts t o lgnorlng a l l of the measurements. As might be expected. the a p e t e r L o r i estimate i s then the same as the a priori estlmate. 5.3.6 Singular C*(GG*)"C
+ P-I
The f i n a l speclal case t o be dlscussed I s when the C*(GG*)-'C + P'' I n the l n f o m t l o n form approaches a s!ngular value. Note t h a t t h i s can occur only l f P-' I s also approachln a s l n g u l a r l t y . Therefore. the problem cannot be avoided by uslng the covarlance form. I f t*(GG*)-lC + Pa' I s slngular, i t means t h a t there i s no p r l o r informatlon about a parameter o r combination o f parameters, and t h a t the experiment added no such l n f o m t l o n . The d l f f l c u l t y , then, i s t h a t there I s absolutely no basis f o r estlmatlng the value o f the slngul a r parameter o r comblnatlon. The system I s r e f e r r e d t o as belng unldentlf.iable when t h i s s f n g u l a r f t y I s present. I d e n t l f l a b l l l t y I s an Important lssue I n the theory o f parameter estimatlo~,. Tne easlest computat l o n r l solutlon I s t o r e s t a t e the problem, d e l e t l n g the parameter i n question from the l l s t o f unknowns. Essentlaliy the same r e s u l t comes from uslng a pseudo-Inverse I n Equatlon (5.1-14) (but see the dlscusslon i n Sectlon 2.4.3 on the b l l n d use o f pseudo-Inverses t o "solve" such problems). nf course, the best alternative i s o f t e n t o examlne why the experlment gave no l n f o m t l o n about the parameter. and t o redeslgn the experiment so t h a t a usable estimate can be obtalned. 5.4 NOIiLINEAR SYSTEMS WITH ADDITIVE GAUSSIAN NOISE The general form o f the system equations f o r a nonlinear system w i t h a d d l t i v e Gausslan nolse I s
Z = f(t.U) + G(U)w
(5.4-1)
As I n the case o f l l n e a r systems, we w i l l d e f i n e by convention the mean o f w t o be zero and the covariance t o be i d e n t i t y . I f c i s random. we w i l l assume t h a t i t i s independent o f J,I and has the d l s t r i b u t i o n glven by Equation (5.1-3). 5.4.1 J o i n t D l s t r i b u t l o n o f Z and
To define the estlmators o f Chapter 4. ke need t o know the d i s t r i b u t i o n P(Z1c.U). Thls d l s t r l b u t i o n l s and G(U) are both constants I f condltloned on The expressions f(:.U) e a s i l y derived from Equatlon (5.4-1). s p e c l f l c values o f c and U. Therefore we can apply the r u l e s dlscussed i n Chapter 3 f o r multlpl!catlon o f Gausslan vectors by constants and a d d l t i o n o f constants t o Gausslan vectors. Using these rules, we see t h a t the d l s t r l b u t l o n o f Z conditioned on c and U I s Gausslan w i t h mean f(6.U) and covarlance G(U)G(U)*.
Thls I s t h e obvious nonllnear g e n e r a l l r a ~ i o n f Equation (5.1-6); o m t h o d o f derlvation. I f 6 I s random, we w!ll puted by Bayes r u l e
the n o n l l n e a r i t y does not change t h e basic The j o i n t d i s t r l b u t l o n i s com(5.4-3)
need t o know the j o i n t d i s t r i b u t i o n p(Z,clU). P(Z,CIU)

a
P(Z~~~'J)P(FIU)
Using Equatlons (5.1-3) and (5.4-2) gives p(Z.6lU)
[ l2rPl
IZ~GG*II-~/'
exP{
[c
- m 6 ~ * ~ - x [-c mCl}
2 [Z - ~((.U)]*[G(U)G(U)*]-l[Z
- f((.U)]
(5.4-4)
Note t h a t p(Z.tlU) I s not, I n general, Gausslan. Although Z condltloned on ( i s Gausslan, and ; I s Gausslan, Z and c need hot be j o i n t l y Gausslan. This i s one o f the r m j o r dfiferences between l i n e a r and nonllnear systems w l t h a d d l t i v e Gausslan nolse.
f l *
Exam l e 5.4-1
L e t Z and 6 be scalars, P = 1, mc * 0, G(U) = 1. and Then p(~11.u) * ( ~ n ) - l / ' ezP((2
and
~ h l glves s
The general f o r n o f a j o i n t Gaussian d i s t r i b u t i o n f o r two variables Z and 6 i s
where a. b, c. and d are constants. The j o i n t d i s t r i b u t i o n o f Z &nu cannot be manipulated i n t o t h i s t o m because a y' term appears i n the exponent. Thus Z and t are not j o i n t l y Gaussian, even though Z conditioned on c i s Gaussian and 5 i s Gaussian. of
Given Equation (5.4-4). s can compute the marginal d i s t r i b u t i o n o f given Z from thc equations
2, and the conditional d i s t r i b u t i o n (5.4-5)
and
Tho i n t e g r a l i n Equation (5.4-5) i s not eary t o evaluate i n general. Since p(Z,i.) i s n o t necessarily : Gaussian, o r any other standard d i s t r i b u t i o n , the only general mems o f computing p(Z) 1 t o numerically integrate Equation (5.4-5) f o r a g r i d o f Z values. I f ( and Z a r e vectors, t h i s can be a q u i t e formidable task. Therefore, we w i l l avoid the use o f p(Z) and P(c1Z) f o r nonlinear systems. 5.4.2 Estimntors
The a posteriori expected value and Bayes s p t i n a l estimators are seldom used f o r nonlinear systems because t h e i r computation i s d i f f i c u l t . Computation o f the expected value requires the numerical i n t e g r a t i o n o f Equation (5.4-5) and t.me evaluation o f Equation (5.4-6) t o f i n d the conditional d i s t r i b u t i o n , and then the i n t e g r a t i o n o f 6 times the conditional d i s t r i b u t i o n . Theorem (4.3-1) says t h a t the Bayes optimal estimatcr f o r quadratic l o s s i s equal t o the u oeteriori expected value est.ilmtor. The computation o f the Bayes optlmal estimates -equires the same o r equivaPent multidimenslonal integrations, so Theorem (4.3-1) does not provide us w i t h a s i m p l i f i e d means o f computing the estimates. Since the p o s t e r i o r d i s t r i b u t i o n o f c need not be symnetric. the MAP estimate i s :.ot equal t o the a postorioTi expected value f o r nonlinear systems. The M estlmator does not r e q u i r e the use o f EquaP t i o n s (5.4-5) and (5.4-6). The H4P estima:e i s obtafned by maximizing Equation (5.4-6) w i t h respect t o c. For generai, nonlinear Since p(Z) i s n o t a function o f 6, we can equivalently maximize Equation (5.4-4). systems, we must do t h i s maximization using numerical optimlzation techniques.
It i s usually convenient t o work w i t h the logarlthm o f Equation (5.4-4). Since standard optimization conventions are phrased i n t e n s o f minimization, rather than mdximization, we w i l i s t a t e the problen as minimizi n g the negative o f the logarithm o f the p r o b a b i l i t y density.
Since the l a s t term o f Equation (5.4-7) i s a constant, i t does not a f f e c t the optimization. f o r e define the cost functional t o be minimlzed as J(c) =
W can theree (5.4-8)
$ [Z - f(f,U)]*(GG*)-'[Z
- f(i,U)]
1 T
[C
- mt]*P-'[t
W have omitted the dependence o f J on Z and U from the notation because i t w i l l be evaluated f o r s p e c i f i c e Z and U i n application; 6 i s the only v a r i a b l e w i t h respect t o which we are optimizing. Equatior! (5.4-8) makes i t c l e a r t h a t the HAP estimator i s also a least-squares estimator f o r t h i s problen~. The (a*)-' and P" matrices are m i g h t i n g s on the squared measurement e r r o r and the squared e r r o r i n the p r i o r estimste o f 6, respectively. As 1 the : For the maximum l i k e 1 ihood estimate we maximfre Equatjor~ (5.4-2) instead o f EquatiLi (5.4-4). goes case o f l i n e a r systems, the maximum l i k e l i h o o d estimate i s equal t o the l i m i t o f the MAP estimate as Pt o zero; i.e., the l a s t t e r n o f Equation (5.4-8) i s omitted. E For a s i n g l e measurement. o r even f o r a f i n i t e number o f measurmnts. the nonlinear MAP and M e s t i mators have none o f the o p t i m a l i t y properties discussed i n Chapter 4. The e s t i v a t e s a r e n e i t h e r unbiased, minimum variance. Bayes optimal, o r e f f i c i e n t . Uhen there are a large nunber o f measurements, the differences frw o p t i m a l i t y are u s u a l l y small enough t o ignore f o r p r a c t i c a l purposes The main b e n e f i t s o f the nonlinear MLE and HAP estimators are t h e i r r e l a t i v e ease o f computation and t h e i r l i c k s t o the i n t u i t i v e l y a t t r a c t i v e idea o f l e a s t squares. These l i n k s give s u m rrason t o suspect t h a t even i f some o f t h e assumptions about the noise d i s t r i b u t i o n are questionable, the estimators s t i l l make sense from a n o n s t a t i s r i c a l viewpoint. The f i n a l p r a r t i c a l judgmer~to f an e s t i n ~ t o ri s based on whether the estimates a r r adequate fur t h e i r intended use, rather than on whether they are exactly optimum. The extension of Equation (5.4-8) t o m u l t i p l e independznt experiments i s straightforward.
where N i s the number o f e~tperimentsperformed. The maxlrmm l i k e l i h o o d e s t i m t o r IS obtained by omltt.lng the l a s t term. The asymptotic properties are defined as N goes t o i n f i ~ i t y . The maximum l l k e l l h o o d e s t l mator can be shown t o be asymptotlcally unblased and a s y p t o t i c a l l y e f f i c i e n t (and thus a l s o asymptotlcally mlnimum-variance unbiased) under q u i t e general condltlons. The estlmator I s also conslstent. The ripcrous proofs o f these propertles (Cramr, 1946), although n o t extremely d l f f i c u l t , are f a l r l y lengthy and w i l l not be presented here. The only condltlon r e q l ~ l r e di s t h a t
converge t o a p o s i t i v e d e f l n i t e matrix. a Gaussian d l s t r l b u t l o n .
Cramer (1945) also proves t n a t the 0stl.nates asymptc.tically approach
Since the maximum l i k e 1 ihood estimates arc asymptotlcally efficient, the Cramr-Rao i n e q u a l l t y (Equat i o n (4.2-20)) glves a good estlmate o f the covarlance o f the estlmate f o r l a r g e N. Uslng Equation (4.2-19) f o r the Information matrix glves
The covarlance o f the maxirmm 1l k e l ihooC estimate 's thus approxlmated by
When c has a p r l o r d l s t r l h u t l o n , the corresponding approximation f o r the covariance o f the p o s t e r l o r d l s t r l b u t i o n of c i s
5.4.3
Computation o f the Estimates
The discussion o f t h r prevlous sectlon d i d not address the question o f hod t o compute the MAP and PL estimates. Equatlon (5.4-9) (wlthout the l a s t term f o r the RE) l s the cost functlonal t o mlnlmize. Hinlmlzation nf s ~ c h nonlinear functlons can be a d l f f i c u l t proposltton, as discussed i n Chapter 2. Equatlon (5.4-9) I s i n the form o f a sum of squares. Therefore the Cirss-Newton m t h o d i s o f t e n the b e t t cholce of optlmizatlon method. Chapter 2 dlscusse* the d r t a l l s o f the buss-Newton method. The p r o b a b i l i s t i c background o f Equatlon (5.4-9) allows us t o apply t:t- c e : ~ t r a l l l m l t theorem t o strer~gthenone o f the arguments usrd t o support the buss-Ntwton method. For s i n p l l c i t y , assume t h a t a!l o f the Ui are i d e n t i c a l . C m a r e the l l m i t l n g behavior o f the twc trtms I h e term' retained by the buss-Newton approximation o f the second gradlent. as expressed by Equatlon (2.5-10). l s N [ ~ ~ f ] * ~ i X i * ! ' ~ [ v ~ fwhlch grows l i n e a r l y w i t h k. At the t r u e value o f 6 , 21 f((,Uf) i s I Gausslan ], r a n d m varSable w l t h mean 0 and covariance GG*. Therefore, the omitted term o f the second gradient I s a sum of i.rdrpende~tt, identically distrlbuted, random variables w i t h zero mean. By the c e n t r a l 1 ,lnlt theorem, the va~ lance o f 1/N t l m s t h i s term goes t o zero as N goes t o t n f f n i t y . Since l/N times th? retained term goes t o a nonzero constant, the omitted term i s small compared t o the retailled one f o r l a r g e 6 . Thls conclusion I s s t i l l t r u e i f the Ui are not identfcal. as long as f and i t s gradients are bounded and the f i r s t gradlent does not converge t o zero.
This d a n s t r a t e s t h a t f o r l a r g e N the omitted term i s sinall conyrred t o the retalned term if c f s a t the t r u e value, and, by continuity, t f c i s s u f f i c i e n t l y close t o the t r b e value. When c i s f a r from the t r u e value, the arguments of Chapter 2 apply.
5.4.4
Singularitfes
The singular cases which a r i s e f o r nonlinear systems are h a s i c a l l y the same as f o r l i n e a r systems and have similar solutions. L i m i t s as P-' o r (GGf)-I approach singular values pose no d i f f i c u l t y . Singular P o r GG* matrices are handled by reducing the problem t o a nonsingular subproblem as i n the l i n e a r case. The one s i n g u l a r i t y which merits some additional discussion i n the nonlinear case corresponds t o singular
i n the l i n e a r case. given by
The equivalent matrix i n the nonlinear rase, i f rre use the Gauss-Newton algoritlun, i s
N
I f Equation (5.4.-13) i s singular a t the t r u e value, the system i s said t o be unidentifiable. W discussed the e computational problems o f t h i s s i n g u l a r i t y i n Chapter 2. Even i f the optimization algorithm c o r r e c t l y f i n d s a unique minimm, Equatior (5.4-11) indizates t h a t the covariance o f a maximum l i k e l i h o o d estimate would be very large. (The covariar.ce i s approximated by the inverse of a nearly singular , m t r i x . ) Thus the experimental data contain very l i t t l e information about the value o f s o w parameter a r combination o f parameters. Note t h a t the covariance estimate i s unrelated t o the optimization itigorithm; changes to the optimization algorithm might help you f i n d the minimum, b u t w i l l not change the properties o f the r e s u l t i n g estimates. The singular-, i t y call be eliminated by using a p r i o r d i s t r i b b t i o n w i t h a p o s i t i v e d e f i n i t e P' b u t i n t h i s case, t h e e s t i mated parameter values w i l l be strongly influenced by the p r i o f d i s t r i b u t i o n , since the experimental data i s lacking i n information.
As w i t h l i n e a r systems, u n i d e n t i f i a b i l i t y i s a serious problem. To obtain usable estimates, i t i s genera l l y necessary t o e i t h e r reformulate the problem o r redesign the experiment. k ' i t h r.onlinear systems, we have the additional d i f f i c u l t y o f diagnosing whether i d e n t i f i a b i l i t y problems are present o r not. This d i f f i c u l t y arises because Equation (5.4-13) i s a function c f 6 and i t i s necessary t o eva!uatc i t a t o r near the minimum t o ascertain whether th? system i s i d e n t i f i a b l e . I f the system i s not i d e n t i f i a b l e , i t may be d i f f i c u l t f o r the algorithm t o approach the (possibly nonunique) minimum because o f convergence problems. 5.4.5 Partitioning
I n both theory and computation, parameter estimation i s much more d i f f i c u l t f o r nonlinear than for l i n e a r systems. Therefore, means o f s i m p l i f y i n g parameter estimation problems are p a r t i c u l a r l y desirable f o r nonl i n e a r systems. Tine p a r t i t i o n i n g ideas o f Section 5.2 have t h i s p o t e n t i a l f o r some problems. The parameter p a r t i t i o n i n g ideas o f Section 5.2.3 m k 2 no l i n e a r i t y assumptions, and thus apply d i r e c t l y e t o nonlinear problems. W have l i t t l e more t o add t o the eal'lier discussion o f parameter p a r t i t i o n i n g except t o say t h a t parameter p a r t i t i o n i n g i s o f t e n extremely important i n nonlinear systems. It can make the c r i t i c a l difference between a tractable and an i n t r a c t a b l e problem formulation. Neasure~rentp a r t i t i o , lng. as formulated i n Section 5.2.1, i s impractfcal f o r most nonlinear systems. For general n o n l i n e l r syskems, the posterior density function p(6IZ ) w i l l not be Gaussian o r any other simple form. The p r a c t i c a l application o f measurement p a r t i t i o n i n g t o t i n e a r systems arises d i r e c t l y from the f a c t t h a t Gaussian distribu:ions are uniquely defined by t h e i r mean and covariance. The only p r a c t i c a l method o f applying measurement p a r t i t i o n i n g t o nonlinear systems i s t o approximate the function ~(EIZ,) ( o r p(Z,Ie) for MLE estimates) by some sin~pleform described by a few parameters. The obvious approximation i n most cases i s a Gaussian density function w i t h the same m a n and covariance. Tne exact covariance ;s d i f f ' c u l t t o compute. but Equations (5.4-11) and (5.4-12) give good approximations f o r t h i s purpose. 5.5 MULTIkLICATIVE GAUSSIAN NOISE (ESTIMATION O VARIANCE) F
The previous sections o f t h i s chaptei have assumed t h a t the G matrix i s known. The r e s u l t s are q u i t e d i f f e r e n t when G I s u n k n m because the noise n u l t i p l i e s G r a t h e r than adding t o it. For convenience, we w i ? l work d i r e c t l y w i t h GG* t o avoid the necessity o f taking rnatrix square roots. W compute the estimates o f G by taking the p o s i t i v e semidefinite, symetric-matrix square r o o t s of the e estimates o f GG*. The general form o f a nonlinear system w i t h unknown G i s
W w i l l consider N independent measurements Z i e - l t i n g from the experiments Ui. The Z i a r e then independent Gaussian vectors w i t h means f(c,Uj all . ,ariances G(t,Ui)G((.Ui)*. W w i l l use Equae s :ule (Equation (5.4-3)) then gives us the j o i n t d i s t r t t i o n (5.1-3) f o r the p r i o r d i s t r i b u t i o n s o f c. bution o f 6 and the Z i given the Ui. Equations (5.4-5) and (5.4-6) define the marginal d i s t r i b u t i o n o f i and the p o s t e r i o r d i s t r i b u t i o n o f 6 given Z. The l a t t e r d i s t r i b u t i o n s are cumbersome t o evaluate and thus seldom used.
l.:~.
Because o f the d i + f i c u l t y o f coaputing the p o s t e r i o r d i s t r i b u t i o n , the a posteriori expected value and e b y e s optimal e s t i m t o r s r , e seldom used. W can coapute the maximum l i k e l i h o o d estimates minimizing t h e negative o f the logarithm o f the 1ik e l ihood functional. Igrioring i r r e l e v a n t constant terms, the r e s u l t i n g cost functional i s N J(t) =
t[Zi
- f(c)]*[G(c)~(c)*]-~[Z~ - f(c)I
tnlG(OG(c)*I)
(5.5-2,'
o r equivalently
Ui from the n o t a t i o n and assume t h a t a l l o f the Ui are i d e n t i cal. (The genera!ization t o d i f f e r e n t Ui i s easy and changes l i t t l e of essence.) The M P estimator mini:mc]. The M P e s t i mizes a cost f u - c t i o n a l equal t o Equation (5.5-2) plus t e x t r a tetm 1/2[6 - me]*P-'[c k mate o f GG* i s se:dom used because the PL estimate i s easier t o compute and proves q u i t e satisfactory.
I can use numerical methods t o minimize Equation (5.5-2) and compute the M estimates. & I n most pract i c a l problems, the f o l l o w i n g parameter p a r t i t i o n i n g g r e a t l y s i n p l i f i e s the coaputation -*otbired: assume t h a t the 5. vector can be p a r t i t i o n e d i n t o independent vectors <G and cf such t h a t
Ye have omitted the e x p l i c i t deptndence on
The p a r t i t i o n cf may be empty, i n which case f i s a constant ( i f CG i s empty we have a k n m GG* n n t r i x , and the nroblem reduces t o t h a t discussed i n the previous section). Assume f u r t h e r t h a t the GG* w t r i x i s con@letely unknown, except f o r the r e s t r i c t i o n t h a t i t be p o s i r i v e semidefinite. Set t h e gradients o f Equation (5.5-2) w i t h respect t o GG* and 6f equal t o zero i n order t o f i n d t h e unconstrained minimum. Using the matrix d i f f e r e z t i a t i o n r e s u l t s (A.2-5) and (A.2-6) from Appendix A, we get
Equation (5.5-5)
gives
a*=
C [zi
irl
- f(kf)xzi
- fis,)l*
which i s the f a m i l i a r sample second moment o f the residuals. The estimate o f GG* frm Equation (5.5-7) i s always p o s i t i v e semidefinite. I t i s possible f o r t h i s estimate t o be singular, i n which case we nust use the techniques previously discussed for handling singular GG* matrices. For a given cf, Eouation (5.5-7) i s a simple n o n i t e r a t i v e estimator f o r GG*. This closed-form expression i s the reason f o r the p a r t i t i o n o f c i n t o Sf and 6 ~ .
We can constrain GG* t o be djagonal, i n which case the s o l u t i o n i s the diagonal elements o f Equat i o n (5.5-7). I f we place other Jpes o f constraints on GG*, such as knowledge o f the values o f i n d i b - iual off-diagonal elements, such simple closed-form solutions are not apparent. i n practice, such constraints are seldom required.
I f Sf i s empty, Equation (5.5-7) i s the s o l u t i o n t o the problem. I f cf i s n o t empty, we need t o cMnbine t h i s subproblem s o l u t i o n w i t h a s o l u t i o n f o r cf t o get a solution o f the e n t i r e problem. Let us investigate the two methods discussed i n Section 5.2.3.
The f i r s t method i s a x i a l i t e r a t i o n . Axial i t e r a t i o n involves successively estimating EG w i t h f i x e d Sf, and estimating cf w i t h f i x e d 66. Equation (5.5-5) gives the 66 estimate i n closed f o m f o r f i x e d cf. To estimate cf w i t h f i x e d c ~ we m s t minimize Equation (5.5-2) w i t h respect t o , Un!ess the system i s i s i n the form o f a linear, t h i s mrnimi?ation requires an i t e r a t i v e method. For f i x e d G. Equation (5.5:jj sum of squares and the Gauss-Newton method i s an appropriate choice ( i n f a c t t h i s sir5prob;em i s i d e n t i c a l t o the problem disctissed i n Sect'-? 5.4). W thus have an inner i t e r a t i o n w i t h i n the outer a x i a l i t e r a t i o n o f e .$f and c ~ . I n such situatjons, e f f i c i e n c y i s o f t e n fmproved by terminating the inner i t e r a t i o n before i t converges, inasmch as the l a r g e s t chatiges i n the cf estimates occur on the e a r l y inner i t e r a t i o n s . A f t e r these e a r l y i t e r i ians, more can be gained by r e v i s i n g GG* t o r e f l e c t these large changes thar by r e f i n i n g Since the estimates o f 6 and GG* a f f e c t one another, there i s no p o i n t i n obtaining extremely accurate u n t i l GG* f s known t o a corresponding accuracy. Ps Gauss (1809. p. 249) said concerning estimates o f e a d i f f e r e n t prodlem:
(f.
I t then can only be worth while t o aim a t the highest accuracy, when the f i n a l r o r r e c t i o r i i s t o be given t o the o r b i t t o he determined. But as long as i t appears probable t h a t new observatims w i l l give r i s e t o rlew corrections, i t w i l l be convenient t o relax, more o r less, as the case may be from extreme precisicn, i f i n t h i s way, the length o f the cooputations can be considerably dioinished.
E x p l o i t i n g t h i s concept t o i t s f u l l e s t suggests using only one i t e r a t f o n o f the Gauss-Newton algorithm f o r the inner " i t e r a t i o n . " I n t h i s case the inner i t e r a t i o n i s no longer i t e r a t i v e , and the o v e r a l l algorithm would be as follows:
1 Estimate GG* .
2. 3.
using Equation (5.5-71 and the current guess o f
cf.
if.
Use one i t e r a t i o n o f the Gauss-Newton algorithm t o r e v i s e the estimate o f Repeat steps 1 and 2 u n t i l rmvergence.
I n general, a x i a l i t e r a t i o n i s a very poor algorithm, as discussed i n Chapter 2. Thn convergence i s o f t e n extremely slow. Furthermore, the algorithm can converge t o a p o i n t t h a t i s n o t a s t r i c t l o c a l minimum and y e t give no h i n t o f a problem. For t h i s articular applicatinn, however, the performance o f a x i a l i t e r a t i o n borders on spectacular. Let us consider, f o r a while, the a l t e r n a t i v e t e a x i a l i t e r a t i o n : Equation (5.5-3). This s u b s t i t u t i o n gives J(C ) =
f
1N t r a c e { I l 2
1 2 N an
1i
s u b s t i t u t i n g Equation (5.5-7)
into
N [Zi
f(cf)l[Zi
- f(cf)l*
(5.5-8)
The f i r s t tenn i s i r r e l e v a n t t o the minimization, so we w i l l redefine the cost function as
-i;: s You may sometinss see t h i s c o s t function w r i t t e n i n the equivalent ( f o r our p u r ~ ~ s e m) Jkf) = Examine the gradient o f Equation (5.5-9). f r a n Appendix A, we obtain
(ce*j
(5.5-10)
Using t h e matrix d i f f e r e n t i a t i o n r e s u l t s (A.7-3) and (A.2-6)
This i s more compactly expressed as
which i s exactly the same as Equation (5.5-6) evaluated a t G = 6 . F u r t h e m r e , the Gauss-Newton methtid used t o solve Equation (5.5-6) i s a good method f o r s o l v i r g Eq~tation(5.5-12) because
Equation (5.5-13) neglects the d e r i v a t i v e o f G* w i t h respect t o cf, but we can e a s i l y show t h a t the term so neglected i s even smaller than the term containing v 2 f ( c f ) , t h n omissicn o f which we previously j u s t i f i e d . Therefore, a x i a l i t e r a t i o n i s i d e n t i c a i t o s u b s t i t u t i o n o f Equation (5.5-7) as a constraint. It seems 1i k e l y t h a t we could use t h i s e q u a l i t y t o make deductions about t h e g e m t r y o f the cost function and thence about the behavior o f various algorithms. (Perhaps there may be some k i n d o f orthogonality property buried here.) Several computer programs, including t h e I 1iff-Maine M E 3 code (Maine and I 1iff, 1980; and Maine, 1981). use a x i a l i t e r a t i o n , o r a modification thereof, o f t e n w i t h l i t t l e more j u s t i f i c a t i c n than t h a t I t seems t o work well. This i s , o f course, the f i n a l and most important j u s t i f i c a t i o n , b u t i t i s best used as v e r i f i c a t i o n o f a n a l y t i c a l arguments. Although Equations (5.5-12) and (5.5-13) are derived i n standard texts, we have n o t seen t h e r e l a t i o n s h i p between these equations and a x i a l i t e r a t i o n pursued i n the l i t e r a t u r e . It i s p l a i n t h a t t h i s equivalence r e l a t e s t o the e x c e l l e n t performance o f a x i a l i t e r a t f o n on t h i s problem. W w i l l e leave f u r t h e r i n q u i r y along t h i s l i n e t o the reader. An important special case o f Equation (5.5-1) occurs when f(cf) i s linear
w i t h i n v e r t i b l e C. and the solution i s
For l i n e a r
f, Equation (5.5-6)
i s solved exactly i n a s i n g l e Gauss-Newton i t e r a t i o n .
if
If C i s i n v e r t i b l e , t h i s reduces t o
(c*(wI*)-~c)-~c*(GG*)-~ i Z i R 1 1 .
(5.5-15)
independent o f GG*. This i s , o f course, C-' and (5.5-16) i n t o (5.5-15) gives
times the sanple mean.
S u b s t i t u t i n g Equations (5.5-14)
which i s the f a m i l i a r srntple variance.
Equation (5.5-17)
can be manipulated i n t o the a l t e r n a t e form
Because if i s not a function o f system model.
GG*, the computation o f
if &* and
does not r e q u i r e i t e r a t i o n f o r t h i s
I n general. the txximum l i k e l i h o o d estimates a r e asymptotically unbiased and e f f i c ~ e n t ,b u t they need have no such properties f o r f i n i t e N. For l i n e a r i n v e r t i b l e systems, the biases are easy t o compute. From Equatian (5.5-16),
E!
if 1 cf
= C'-
Ccf =
i=i
ci
(5.5-19)
This equation shows t h a t if i s unbiased f o r f i n i t e N f o r l i n e a r i n v e r t i b l e systems. From Equat i o n (5.5-18). using the f a c t t h a t zZi i s Gaussian w i t h mean NCcf and covariance NGG*.
Thus kc* i s biased f o r f i n i t e N. Examining Equation (5.5-20). we see t h a t the estimator defined by m u l t i p l y i n g the ML estimate by N/(N 1) i s unbiased f o r f i n i t e N if N > 1. This unbiased estimate i s o f t e n used instead o f the maximum l i k e l i h o o d estimate. For larpe N, the difference i s inconsequential.
are unknown. I f cf i s know, then t h e maxiI n t h i s discussion, we have assumed t h a t both GG* and mum 1ikelihood estimator f o r GG* i s given by Equation (5.5-5) and t h i s estimate i s unbiased. The proof i s l e f t as an exercise. This r e s u l t gives i n s i g h t i n t o the reasons f o r the b i a s o f t h e estimator given by Equation (5.5-17). Note t h a t Equations (5.5-17) and (5.5-7) a r e i d e n t i c a l except t h a t the sample mean i s used This s u b s t i t u t i o n o f the sample mean f o r i n Equation (5.5-17) i n p:ace o f the t r u e mean i n Equation (5.5-7). the t r u e mean has resulted i n a bias. The difference between the estimates from Equations (5.5-17) and (5.5-7) can t +ten i n the form
G As t h i s expression shows, the estimate o f G * using the sample mean i s l e s s than o r equal t o the estimate using the t r u e mean f o r every r e a l i z a t i o n ( i .e., the difference i s p o s i t i v e semidefinite), e q u a l i t y occurring This i s a stronger property than the b i a s difference; the b i a s o n l y when a l l o f the Z i are equal t o f ( c f ) . difference implies only t h a t the expected value using the sanple mean i s less. 5.6 NON-GAllSSIAN NOISE
Non-Gaussian noise i s so general a c l a s s i f i c a t i o n t h a t l i t t l e can be said beyond the discussion i n Chapter I . The forms and properties o f the estilrators depend strcngly on the types o f noise d i s t r i b u t i o n . The same comnents apply t o Gaussian noise i f i t i s n o t a d d i t i v e o r n u l t i p l i c a t i v e , because the conditional d i s t r i b u t i o n o f Z given I s then non-Gaussian. I n general, we apply the r u l e s f o r transformation o f variables t o derive the conditional d i s t r i b u t i o n o f Z given c. Using t h i s d i s t r i b u t i o n , and the p r i o r dist r i b u t i o n of f i f defined, we can derive the various r s t i n a t o r s i n p r i n c i p l e .
The optimal estimators of Chapter 4 ofter. require considerable computation f o r non-Gaussian noise. It i s o f t e n possible t o define much simpler estimators which have adequate performance. W w i l l examine one situae t i o n where such s i m p l i f i c a t i o n can occur. Let the system model be 1inear w i t h a d d i t i v e noise Z;C{+w The d i s t r i b u t i o n o f w must have f i n i t e mean and variance independent o f E, but i s otherwise unrestricted. C a l l the mean m and the variance GG*. , W w i l l r e s t r i c t ourselves t o considering onlp l i n e a r estimators e o f the f ~ r m
Within t h i s class, we w i l l look f o r minimum-variance, unbiased estimators. W w i l l require t h a t the variance e be minimizea only over the class o f unbiased l i n e a r estimators; there w i l l be no guarantee t h a t a smaller variance cannot be attained by a nonlinear estimator. o The b i a s o f an estimator o f the f o r n ~ f Equation (5.6-2) b ( t ) = E ~ ~ ( E I E = KC6
I f the estimator i s t o be unbiased. w must have e
is
- E + D - Km,
The variance o f an unbiased estimator o f the given form i s var(<) = KGG*K* Note t h a t the bias and variance o f the estimate depend only p o i the mean and variance o f the noise d i s t r i bution. The exact noise d i s t r i b u t i o n need n o t even be known. I f the noise d i s t r i b u t i o n were Gaussian, a minimum-v~rianceunbiared estimator would e x i s t and be given by
This estimator i s l i n e a r . Since no unbiased estimator, l i n e a r or not, can have a lower variance f o r the Gaussian case, t h i s estimator i s the minimum-variance, ~ n b i a s e dl i n e a r estimator f o r Gaussian noise. Since t h e b i a s and variance o f a l i n e a r estimator depend only on the mean and variance o f the noise, t h i s i s the minimum-variance, unbiased l i n e a r estimator f o r any notse d i s t r i b u t i o n w i t h the same mean and variance. The o p t i m a l i t y o f t h i s estimator can a l s o be e a s i l y proven without reference t o Gaussian d i s t r i b u t i o n s (although the above proof i s complete and rigorous). L e t
A = K
- (c*(GG*)-'c)-'c*(GG*)-'
(5.6-7)
f o r any
K.
Then 0
AGG*A* = KGG*K* + (C*(GG*)-'C)-'C*(GG*)-'GG*(GG*)-'C(C*(GG*)-lC,-l
- KGG*(GG*)-~c(c*(GG*)-~c)-~
(c*(GG*)-~c)-~c*(GG*)-~GG*K*
Using Equation (4.6-4b)
ds a constraint on
K, Equation (5.6-8) KGG*K*
becomes
0
or, using Equation (5.6-5)
<
(C*(GG*)-'C)-'
var(C) 2 (CC(GG*)-'C)" Thus no K s a t i s f y i n g Equation (5.6-4b) can achieve a variance loner than t h a t given oy Equation '5.6-10). The variance i s equal t o the minimum i f and only i f A i s zero; t h a t i s i f
Therefore Equation (5.6-6) defines the unique m i n i m - v a r i a n c e , unuiased l i n e a r estimator. t h a t GG* and C*(GG*)-'C are nonsingular; Section 5.3 discusses the singular cases.
W are assuming e
I n sumnary, i f the system i s l i n e a r w i t h a d d i t i v e noise, and the estimator i s required t o ue l i n e a r and unbiasea, the r e s u l t s f o r Gaussian d i s t r i b u t i o n s apply t o any d i s t r i b u t i o n w i t h the same mean and variance.
The use o f optimal nonlinear estimators i s seldom j u s t i f i a b l e i n view o f the current s t a t e of the a r t . Although exceptional cases exist, three f a c t o r s argue against using optimal nonlinear estimators. The f l i - s t f a c t o r i s the complexity and corresponding cost o f J e r l v i n g and implementing optimal nonlinear estimators. e For s o w problems, w can construct f a i r l y simple suboptimal nonlinear estimators t h a t give b e t t e r performance than the l i n e a r estimators (often by s l i g h t l y modifying the l i n e a r estimator), b u t optimal nonlinear estimat i o n i s a d i f f i c u l t task. The second f a c t o r i s t h a t l i n e a r estimators, perhaps s l i g h t l y modified, o f t e n can give q u i t e good e s t i mates, even i f they are n o t exactly optimal. Based on the central l i m i t theorem, several r e s u l t s show that, under f a i r l y general conditions, the l i n e a r estimates w i l l approach the optimal nonlinear estimates as the number o f samples increases. The precise conditions and proofs o f these r e s u l t s are beyond the scope o f t h i s book. The t h i r d f a c t o r i s t h a t we seldom have precise knowledge o f the d i s t r i b u t i o n anyway. The e r r o r s from inaccurate s p e c i f i c a t i o n of the d i s t r i b u t i o n a r e l i k e l y t o he as large as the e r r o r s from using a Suboptimal l i n e a r estimator. kc need t o consider t h i s f a c t i n deciding whet,her an optinla1 nonlinear estimator i s r e a l l y worth the cost. From Gauss (18C9. p. 253) The i n v e s t i g a t i o n of an o r b i t having, s t r i c t l y speaking, the maxirmm probabili t y , w i l l depend upon a knowledge of ...[the p r o b a b i l i t y d i s t r i b u t i o n ] ; but d t h a t depends upon so many vague ~ n doubtful considerations- physiological included- which cannot be subjected t o calculation, t h a t i t i s scarcely, and indeed less than scarcely, possible..
..
Figure (5.3-1).
Confidence region w i t h singular
P-'.
Figure (5.3-2).
Confidence region w i t h another singular
P-I.
CHAPTER 6 6.0 SI JCHASTIC PROCZSSES
I n simplest terms, a stochastic process i s a random r a r i a b l e t h a t i s a f u n c t i o n o f time. Thus stochastic processes are basic t o the study o f parameter estimation f o r dynamic systems. A complete and rigorous study of stochastic process theory requires considerable depth o f mathematical background. p a r t i c u l a r l y f o r continuous-time processes. For the purposes o f t h i s book, such depth o f background i s not required. Our approach does not draw heavily on stochastic process theory. This chapter focuses on the few r e s u l t s t h a t a r e needed f o r t h i s document. Astrom (1970), Papoulis (1965). L i ~ s t e and Shiryayev (1977). and numerocs other books give more complete treatments a t r a r y i n g l e v e l s r o f abstraction. The necessary r e s u l t s i n t h i s chapter a r e l a r g e l y concerned w i t h continuous-time m d e l s . Although we derive a few discrete-time equations i n order t o examine t h e i r continuous-time l i m i t s , the chapter can be omitted i f you are studying only discrete-time analysis. 6.1 DISCRETE TIME
A discrete-time randor* process x i s simply a c o l l e c t i o n o f random variables x i , one f o r each time point, defined on the same p r o b a b i l i t y space. There can be a f i n i t e Jr i n f i n i t e number o f time points. The stochastic process i s completely characterized by t h e j o i n t d i s t r i b u t i o n s o f a l l o f the x.. This can be a rather unwieldy means o f characterizing the process, however, particula1.1y i f the number o# time points i s infinite.
If the X i are j o i n t l y Gaussian, the process can be characterized by i t s f i r s t and second moments. NonGaussian processes are often 2150 analyzed i n terms o f t h e i r f i r s t two moments because exact analyses are too complicated. The f i r s t two moments o f the process x are
m ( i ) = E{xil
(6.1-1)
The function
R ( i . j ) i s ca!led the autocorrelation f u n c t i o n o f the process.
A process i s c a l l e d stationary i f the j o i n t d i s t r i b u t i o n o f any c o l l e c t i o n o f the x i depends only on differences o f the i values, not on the absolute time. This i s c a l l e d strict-sense s t a t i o n a r i t y . A process i s stationary t o second order o r wide-sense stationary i f the f i r s t moment i s constant and the second moments depend only on time differences; i.e.. i f
f o r a l l i, j, and k. For Gaussian processes wide-sense s t a t i o n a r i t y implies strict-sense s t a t i o n a r i t y . The autocorrelation function o f a wide-sense stationary process can be w r i t t e n as a function o f one vartable, the time difference.
A process i f R(i . j ) = 0 t e r i z e d by the o f X i i s the process. 6.1.1
i s c a l l e d white i f X i i s independent o f x j f o r a l l i # j. Thus a Gaussian process i s white when i # j. Any process t h a t i s n o t white i s c a l l e d colored. A white process can be characd i s t r i b u t i o n cif x i fo- each i. I f a process i s bbth white and stationary, the d i s t r i b u t i o n same as t h a t o f X i f o r a l l i and j, and t h i s d i s t r i b u t i o n i s suff;cient t o chardcterize the
Linear System; Forced by Gaussian White Noise
Our primary i n t e r e s t i n t h i s chapter i s i n the r e s u l t s o f passing random signals through dynamic systems. W w i l l f i r s ? iook a t the simplest case, stationary white Gaussian noise passing through a l i n e a r system. The e system equation i s
l where n i s a stationary, Gaussian, white process w i t h zero m a n a ~ i d e n t i t y covariance. The assumptinn of zero mean i s made s o l e l y t o simp1 i f y the equations. Results f o r nor~zeromean can be obtained by l i n e a r superp o s i t i o n o f the deterministic response t o the mean and t h e stochastic response t o the process w i t h the mean removed. W are a l s o given t h a t x, i s Gaussian w i t h m a n 0 and covariance Po, and t h a t x, i s independent e o f the n i . The x i form a stochastic process generated from the n i . W desire t o examine the properties of the e stochastic process x. It I s imnediately obvious t h a t x i s Gaussian because x i can be w r i t t e n as a l i n e a r combination o f x, and n , n,, ni_ I n f a c t , the j o i n t d l s t r c b u t t o n o f the x i can be e a s i l y derived by e x p l i c i t l y w r i t i n g t h t s Pinear r e l a t f o n and using Theorem (3.5-5). W w i l l leave t h i s derfvation as an exere cise, and pursue instead a d e r i v a t i o n using recursion along the l i n e s t h a t w i l l be used i n Chapter 7.
...
Assum we know t h a t x has mean U and covariance Pi. d i a t e l y from Equation (6.1-5):
Then the d i s t r i b u t t o n of
XI+,
follows im-
70 E { ~ ~ + ~ x ? +=, l@E{xixjio*
6.1.1
FE{n,n;lF*
+ @E{xin?)F* + FE(nix?l@* = @Pi@* + FF*
(6.1-7)
ni-,, a l l The cross terms i n Equation (6.1-7) drop out because x i i s a f u n c t i o n only o f xo and no, n o f which are independent of n i by assumption. W now have a recursive formula f o r the covafiance x i e Pitl where
a
...
@Pi#*
FF*
i = 0.1,.
..
(6.1-8)
Po i s a given p o i n t from wnich we can s t a r t the recursion.
W know t h a t the x i are j o i n t l y Gaussian zero-mean variables w i t h covariances given by the recursion e (6.1-8). To complete the characterization o f the j o i n t d i s t r i b u t i o n of the X,I we need only the crosscovariances E ~ X ~ X ; ) f o r i j. Assume without l o s s o f g e n e r a l i t y t h a t i > J. Then x i can be w r i t t e n as
Then E{X x*) = ei-j~(x.x*) IJ J J
C
k=j
i-i $i-l-k~E{n
, I +i-jP *= k j
,j
(6.1-10)
The cross terms i n Equation (6.1-10) a r e a l l zero by the same reasoning as used f o r Equation (6.1-7). i < j, the same d e r i v a t i o n ( o r transposition o f the above r e s u l t ) gives
For
This completes the d e r i v a t i o n o f the j o i n t d i s t r i b u t i o n o f the x i . white (except i n special cases). 6.1.2 Nonlinear Systems and Non-Gaussian Noise
Note t h a t
i s n e i t h e r s t a t i o n a r y nor
I f the i o i s e i s not Gaussian, ana:yzing the system becomes much more d i f f i c u l t . Except i n special cases, we then have t o work w i t h the p r o b a b i l i t y d i s t r i b u t i o n s as functions instead o f simply using the means and covariances. S i m i l a r problems a r i s e f o r n c ~ l i n e a rsystems o r nonadditive noise even i f the noise i s Gaussian, because the d i s t r i b u t i o n s o f the x i w i l l n o t then be Gaussian. Consider the system
Assume t h a t f has continuous p a r t i a l d e r i v a t i v e s a l m s t everywhere, dnd can be i n v e r t e d t o o b t a i n n i ( t r i v i a l i f the noise i s a d d i t i v e ) : n. = f - ' ( ~ ~ . x ~ + ~ ) 1 The n i are assumed t o be white and independent of xo, b u t not necessarily Gaussian. Equation (3.4-:) d i s t r i b u t i o n of xi+, given x i can be obtained f r ~ m (6.1-13) Then the conditional
where J i s the Jacobian o f the transformation obtained from
The j o i n t d i s t r i b u t i o n o f
cap then be
Equations (6.1-14) and (6.1-15) are, i n general, too unwieldy t o work w i t h i n practice. nonlinear systems o r non-Gaussian noise u s u a l l y involves s i m p l i f y i n g approximtions. 6.2 CONTINUOUS TIME
P r a c t i c a l work w i t h
W w i l l look a t continuous-time stochastic processes by looking a t l i m i t s o f d i s c r e t e - t i n e processes w i t h e the time i n t e r v a l going t o 0. The discussion w i l l focus on how t o take the l i m i t so t h a t a useful r e s u l t i s obtained. W w i l l n o t get involved i n the i n t t i c a c i e s o f I t o u r Stratanovich calculus (Astrom, 1970; e Jazwinski, 1970; and L i p s t e r and Shlryayev, 1977). 6.2.1 Linear Systems Forced by White Noise Consider a l i n e a r continuous-time dynamic system driven by white, zero-mean noise
W would l i k e t o look a t t h i s system as a l i m i t ( i n some sense) o f the discrete-time systems e
as A, the time i n t e r v a l between samples, goez t o zero. Equation (6.2-2) i s i n the fornl o f E u l e r ' s method f o r approximating the solution o f Equation (6.2-1). For the moment we w i l l consider the d i s c r e t e n(ti) t o be Gaussian. The d i s t r i b u t i o n o f the n ( t . ) i s not p a r t i c u l a r 1 important t o the end r e s u l t , b u t our argument i s sonnrhat easier i f the n ( t i ) are ~ a u s s l a n . Equation (6.2-2j corresponds t n Equation (6.1-5) w i t h I + An s u b s t i t u t e d f o r o, A F ~ s u b s t i t u t e d f o r F, and sore changes i n n o t a t i o n t o make the d i s c r e t e and continuous notations more s i m i l a r .
I f n were a reasonably behaved d e t e r m i n i s t i c process, we would get Equation (6.2-1) as a l i m i t o f Equat i o n (6.2-2) when A goes t o zero. For the stochastic system, however, the s i t u a t i o n i s q u i t e d i f f e r e n t . S u b s t i t u t i n g I + AA f o r 0 and AFc f o r F i n Equation (6.1-8) gives
Subtracting
P ( t i ) and d i v i d i n g by A
gives
Thus i n the l i m i t
Note t h a t Fc has completely dropped o u t o f Cquation (6.2-5). The d i s t r i h u t i o n o f x d i s t r i b u t i o n o f the f o r c i n g noise. :n p a r t i c u l a r , i f P = 0, then P ( t ) = 0 f o r a l l , does not respond t o the f o r c i n g noise.
does n o t depend on the t. The system simply
A model i n which the system does n o t respond t o the noise i s not very useful. A useful mode? would be one t h a t gives a f i n i t e nonzero covariance. Such a model i s achieved by m u l t i p l y i n g the noise by A - ' / ~ (and thus e i t s covariance by A - I ) . W r e w r i t e Equation (6.2-2) as x(ti + A) = (I bA)r(ti) + The
A
+~ ' / ~ ~ ~ n ( t ~ )
(6.2-6)
i n the
A,: FF
term o f Equatioo (6.2-4)
then disappears and the l i m i t becomes
behavior of the covariance ( o r something asymptotic t o Note t h a t only a A-I r e s u l t i n the l i m i t .
A-I) w i l l give
f i n i t e nonzero
W w i l l thus define the continuobs-time white-noise process i n Equation (6.2-1) as a l i m i t , i n some e sense, o f discrete-time processes w i t h covariance: A-l. The autocorrelation function o f the continuous-tiore process i s
The impulse function 6(s) i s zero f o r x f 0 and i n f i n i t e f o r s = 0, and i t s i n t e g r a l over any f i n i t e range i n c l u d i n g the o r i g i n i s 1. W w i l l n o t go through the l ~ t h e m a t i c a lformalfsm required t o r i g o r o u s l y define e the impulse f u n c t i o n - s u f f i c e i t t o say t h a t the concept can be defined rigorously. e This model for a conti~luous-time w h i t e - ~ ~ o i sprocess requires f u r t h e r discussion. I t i s obviously n o t a f a i t h f u l representation o f any physical process because the variance o f n ( t ) i s i n f i n i t e a t every time point. The t o t a l power of the process i s dlso i n f i n i t e . The response of a dynamic system t o t h i s process, however, appears we1 l-behaved. The reasons f o r t h i s apparently anomalous behavior are most e a s i l y understood i n the frequency domain. The p c w r spectrum o f the process n i s f l a t ; there i s the same power i n every frequancy band o f the same width. There i s f i n i t e power i n any f i n i t e frequency range, b u t because the process has i n f i n i t e bandwidth, the t o t a l power i s i n f i n i t e . Because any physical system has f . l i t e bandwidth, the system response t o the noise w i l l be f i n i t e . If, the other hand. we kept the t o t a l power o f the noise f i n i t e as we o r i g i n a l l y on t r i e d t o do, the power i n any f i n i t e frequency band would go t o zero as we approached i n f i n i t e bdndwidth; thus. a physical system would have zero response. The preceding paragraph explains why i t i s necessary t o have i n f i n i t e power i n a nleaningful continuoustime white-noise process. It a l s ~ suggests a r a t i o n a l e f o r j u s t i f y i n g such a moue1 even though any physical noise source must htve f i n i t e power. W can envision the physical noise as being band l i m i t e d , b u t w i t h a e band l i m i t much l a r g e r than the system band 1im:t. Ifthe noise batid l i m i t i s l a r g e eaough. i t s exact value i s uninportant because the system response t o i n p u t s a t a very high frequancy 1: n e g l i g i b l e . Therefore, we can analyze the system w l t h white noise o f i n f i n i t e bandwidth and obtain r e s u l t s t h a t a r e very good approximat i o n s t o the finite-bandwidth r e s u l t s . The analysis i s much simpler i n the infinite-bandwidth white-noise model (even though sone f a i r l y a b s t r a c t mathematics i s required t o make i t rigorous). I n sumnary. contlnuoustime white-noise i s not physically r e a l i z a b l e b u t car1 g i v e r e s u l t s t h a t are good apprcximations t o phystcal systems.
6.2.2
Addftive White Measurement Noise
W saw i n the previous section t h a t continuous-time white noise d r i v i n g a dyr~amic system must have e i n f i n i t e power i n order t o o b t a i n useful r e s u l t s . W w i l l show i n t h i s section t h a t the same conclusion e applies t o continuous-time white measurement noisc. Me suppose t h a t nolse-corrupted measurements z are made of the system o f Equatlon (6.2-1). surement equation i s assumed t o be l i n e a r w i t h a d d i t i v e white noise: ~ ( t = Cx(t) )
+
The mea(6.2-9)
Gcn(t)
For convenience. we w i ? l assume t h a t the mean o f the noise i s 0. W then ask what e l s e must be said about e n ( t ) i n order t o obtain useful r e s u l t s from t h l s model. Presume t h a t we have measured z ( r ) over the i n t e r v a l 0 < t < T, and we want t o estimate some characteri s t i c o f the system-say, x(T). This i s a f l t t e r i n g problem, which we w i l l discuss f u r t h e r i n Chapter 7. For c u r r e n t purposes, we w i l l s i m p l i f y the problem by assuming t h a t A = 0 and F = 0 i n Equation (6.2-1). Thus x ( t ) i s a constant over the i n t e r v a l , and dynamics do not enter the problem. W can consider t h i s a s t a t i c e problem w i t h repeated observations o f a random variable, l i k e those s i t u a t i o n s we covered I n Chapter 5. Let us look a t the l i m i t o f the discrete-time equivalents t o t h i s problem. I f samples are taken every seconds, there are A-'T t o t a l samples. Equation (5.1-31) i s the PV\P e s t i w t o r f o r the discrete-time problem. 'The mean square e r r o r of the estimate i s given by Equations (5.1-32) t o (5.1-34). As A decreases t o 0 and the number o f samples increases t o i n f i n i t y , the mean square e r r o r decreases t o 0. This r e s u l t would To get a useful imply t h a t continuous-time estimates are always exact; i t i s thus n o t a very useful mode:. ~ m d e l , we must l e t the covariance o f the measurement noise go t o i n f i n i t y l i k e I-' as A decreases t o 0. This argument i s very s i m i ? a r t o t h a t used i r ~ the previous section. I f the measurement noise had f i n i t e rar!anca, each measurement would g i v e us a f i n i t e amount o f i n f o r m t i o n , and we rmuld have an i n f i n i t e amount o f information (no uncertainty) when the number o f mtasurements was i n f i n i t e . Thus the discrete-time equival e n t o f Equatlon (6.2-3) i s
A
where
n(ti) has i d e n t i t y cu.;;!?nre.
B+causc any measurement i s made using a physical device w i t h a f i n i t e bandwidth, we stop g e t t i n g much new information as we take samples f a s t e r than the response time o f the instrument. I n f a c t , the measurement equat i o n i s sometimes w r i t t e n as a d i f f e r e n t i a l equation f o r the instrument response instead of i n the mare i d e a l ized form o f Equation (6.2-9). W need a noise model w i t h a f i n i t e power i n the bandwidth o f the measurements e because t h i s i s the frequency range t h a t we are r e a l l y working i n . This argument i s e s s e r ~ t i a l l ythe same as the one we used i n the discussion o f white noise forcing the system. The white noise cac ayain be v!ewed as an approximation t o band-limited noise w i t h a l a r g e bandwidth. The lack o f f i d e l i t y i n representing very hignfrequency c h a r a c t e r i s t i c s i s not too impcrtant, because h i g h frequencias h i 1 1 tend t o be f i l t e r e d out when we cperate on the data. (For' instance, most operatiol?s orl continuous-time data w i l l have i n t e g r a t i o n s a t some po'rt.) AS a consequence o f t h i s m d e l i n g , we shculd be dubious o~ the p r a c t i c a l a p p l i c a t i o t ~of any a l g o r i t h m which r e s u l t s from t h i s bnalysis and does n o t t i l t e r o u t high-freqlucncy data i n soae manner. W can generalize the conclusions i n t h i s and the p r ~ v i o u ssection. Continuous-time white no!se w i t h e f i n i t e variance i s generally not a useful coccept i n any context. W w i l l therefore take as p a r t o f the d e f i e e n i t i o n o f continbous-time white noise t h a t i t have i n f i n i t e covariance. W w i l l use the spectral density r a t h e r than the covariance as a meaningful measure o f the noise a m ~ l i t u d e . White noise w i t h a u t o c o r r e l a t l o n R ( ~ . T ) = GCGE6(t hac spectral density 6.2.3
G~G:.
- T)
Nonlinear Systems
As w i t h discrete-time n o n l i n e a r i t i e s , exact andlysis o f nonlinear c o n t i n u o u s - t i r ~ systems i s generally so d i f f i c u l t as t o be impossible f o r most p r a c t l c a l i n t e n t s and purposes. The usual approach i s t o use a I i n e a r i z a t i o n o f the system o r some other a,>proximation. L e t the system equation be
where n I s zero-inean white noise w i t h u n i t y power, spectral density. For compactness o f notation, l e t p represent the d i s t r i b u t i o n of x a t time t, given t h a t x das x, a t time t,. The e v o l u t i o n of t h i s d i s t r i b u t i o n i s described by the following parabolic p a r t i a l d i f f e r e n t i a l equation:
where n i s the length o f the x vector. The i n i t i a l c o n d i t i o n f o r t h l s equatio.: a t t = t is p S(x x,). See Jczwinskt (1970) f o r the d e r i v a t i o n s i Equatiorr ( 6 . 2 - 1 3 ) . This equation !s c a l l e d the Fokker-Planck equation o r the fo.ward K o l m g o w v equaiion. I t 1s considered one o f the basic equations o f nonlinear i i l Q r i n g theory. I n p r i n c i p l e , t h i s e q u t t i o n completely describes the behavior o f the system and thus the problem i s "solved." I n practice, the s o l u t i c n o f t h i s m l t i d i m e n s l o n a l p a r t i a l d i f f e r e n t i a l equat i o n i s u s u a l l y too formidable t o consider seriously.
CHAPTER 7 7.0 STArE ESTIMT ION F R DYNAMIC SYSTEMS O
I n t h i s chapter, we address the estimation o f the state o f dynamic systems. The emphasis I s on l l n e a r dynamic systems w i t h a d d l t l v e Gausslan noise. W w i l l I n i t i a l l y develop the theory f o r discrete-time systems e and then extend I t t o continuous-time and mixed continuous/discrete models. The general Form o f a iinear discrete-tlme system model I s
The n and r l i are assumed t o be independent Gausslan noise vectors w l t h zero mean t n d I d e n t i t y covariance. The noise n i s c a l l e d process noise o r state noise; II I s c a l l e d measurement nol:r. The input vectors, u a r e assumed t o be known exactly. The state o f the system a t the 4th time p o i n t i s x i . The i n i t r a i condit i o n x, I s a Gausslan random variable w l t h mean m dnd covtrlance Po. (Po can be zero, meaning t h a t the , i n i t i a l candition i s known exactly.) I n general, the systein matrlccs e, r , F. C. D, ar,G G can be functtor,s of time. This chapter w i l l assume t h a t the system i s tlme-invariant i n order t o s i m p l i f y the notation. Except f o r the discussion o f steady-state fonns i n Section 7.3, the r e s u l t s are e a s i l y generalized t o time-varyl.1~systems by adding appropriate time subscripts t o the matrices. The state estimatlon problem i s defined as follows: s t a t e x ~ . To shorten the notation, we define based on the measurements x,, x ,
...z
~ estimate the ,
State estimation problens are comnonly divided i n t o three classes, depending on the r e l a t i o n s h i p o f
M and N.
I f M i s equal t o N, the problem i s c a l l e d a f i l t e r i n g probiem. Based on a l l of the measurements taken up t o the current time, we desire t o estimate the current state. Thls type o f problem I s t y p i c a l o f those encountered I n real-t5me applications. I t I s the most widely treated one. and the one on whlch we w l i l concentrate.
I f M 1s greater than N, we have a p r e d l c t l ~ , ? ' ' ?m. N, and we desire t o p r e d i c t the state a t SOW f u t u r e t i n e M. solved, the p r e d i c t i o n problem I s t r i v i a l .
Tne ddta are available up t c the current time W w l l l see t h a t once the f i l t e r i n g problem i s e
I f M 1s less than N, the problem I s c a l l e d a smoothing protlem. Thls type of problem i s most comranly encountered I n postexperiment hatch [ ~ r o o s s i n gi n which a l l o f the data are gathered before processing begins. I n t h l s case, the estlmate o f x~ can be based on a l l o f the data gathered, both before and a f t e r time M. By using a l l values o f M from 1 t o N 1, plus the f i l t e r e d s o l u t i o n f o r M = N, we can construct the e s t l mated state time h i s t o r y f o r tho i n t e r v a l being processed. Thls i s r e f e r r e d t o as f i :ed-interval wnoothing. Smothing can also be dsed in a real-time environment where a few time points o f delay i n obtaining current state estimater i s an acceptable p r i c e for the improved accuracy gained. For instance, It m;ght be acceptable t o gather data up t o time N = M + 2 before conputfng the estimate o f x ~ . This i s c a l l e d f i x e d - l a g smoothing. A t h i r d type o f smoothing i s fixed-point w o t h l n g ; i n t h i s cose, i t I s desired t o estimate xpl f o r a p a r t i c u l a r f i x e d M I n a real-time e n v i r m t e n t , using new data t o i w r o v e the estimate.
I n a i l cases, x # w i l l have a p r i o r d i s t r i b u t i o n derived from Equatlon (7.0-la) and t:le nolse d i s t r l b u ttdns. Since Equation (7.0-1) i s l i n e a r i n the nolse, and the noise i s assumed Gausslan. the p r l o r and p o s t e r i o r d i s t r i b u t i o n s o f XN w i l l be Gaussian. Therefore, the o poetariori expected value. MP, and man Bayes' mlninum r i s k astimators w l l l be I d e n t i c a l . Therc a r e the obvious estimators f o r a problem w i t h a myldefined p r i o r d i s t r i b u t i o n . The remainder o f the chapter assums the use o f these estimators. 7.1 EXPLICIT FORMULATION
By manipulat+ng Equatlon (7.0-1) i n t o an appropriate form, we can w r i t e the s t a t e estimatlon problem as a special cdse of the s t a t i c estimatlon proSlem studied i n Chapter 5. I n t h i s section, we w i l l solve the problem fnvolved w i l l thus play no special r o l e I n the meaning by such manipulation; the f a c t t h a t a dynamic system e o f the estimation problem. W w i l l examine only the t h l t e r i n g problem here. Our alm i s t o manipulate the s t a t e estimatlon problem i n t o the f o n of i q u a t l o n (5.1-1). The most obvfous o f Equation (5.1-1) t o be XN, the vector which he desire t o approach t o t h i s groalem i s t o define the and the input, U, would bc a concatenaestimate. The observation. 2, would be a concatmation o f zi,...,i~; ,UN-, t i o n of u,,. The noise vector, W , would then have t o be a concntenatlon o t n , ~nN-~,n~.. .qd. The problem can indeed be m i t t e n I n t h l s manner. Unfortunatkly. the p r l o r d i s t r i b u t i o n o f r N I s not independent a f r, n (except f o r t h e case N = 0); therefore, Equatjon (5.1-16) t s not the c o r r e c t expressioh f o r the e s t k t e c f XN Of course, we could derive an appropriate expression a l l m i n g f o r the correl a t i o n , but we w i l l take an a l t e r n a t e apprurch which allows the d i r e c t use OF Equation (5.1-16).
.. .
,....
...
h P
L e t the unkncun parameter vector be the concatenation o f the i n i t i a l condition and a l l o f the process
noise vector:.
The vcctsr x , whlch w r e a l l y desire t o cstlmate, can be w r l t t e n as an e x p l i c i t functlon o f the elements o t e c; I n p a r t l c u f a r , Equatlon (7.0-la) expands l n t o
tieta W can compute the MAP estimate o' XN by usin? the MAP estimates o f x, and n i i n Equation (7.1-2). e t h a t we can f r e e l y t r e a t the !if as noise o r as unknown parameterr w i t h p r l a r d i s t r i b u t i o n s without changing tcle essential nature o f the problem. The p r o b a b i l i t y d l s t r i b ~ t l o n f Z i s i d e n t l c a l I n e i t h e r Lase. The o only d i s t l n c t l o n i s whether o r not we wlnt ertfmates o f the n l . For t h l s cholce o f c, the remafning items o f Equation (5.1-1) must be
W get an e x p l i c i t formula f o r e
zi
by s u b s t i t u t i n g Equation (7.1-2)
i n t o Equation (7.0-lb),
giving
whlch can be w r i t t e n i n the form o f Equation (5.1-1) w i t h
You can e r s i l y verify these matrlces by s u b s t l t u t i n g them l n t o Equrtlon (5.1-1). the p r i o r d i s t r i ' :!on o f t are
The mean and covariance
The HAP e r t i n u t e o f c i s then glven by Equatlon (5.1-16). obtained fm, t h a t o f ( by uslng Equati,~n(7.1-2).
The HAP
e s t l n r t e o f x ~ whlch we seek, i s ,
The f l l t e r l n g problem I s thus "solved." This solution, however, i s unacceptabiy cumbersome. I f the system state I s an i-vector, the Inversion o f an (N + 1)r-by-jN + l ) i matrix I s requlred I n ordcr t o estimate The conputatlonal costs become unacceptable a f t e r a very few t l m polnts. W could Investigate whether i t e XN. I s possible t o take advantage o f the structure o i the matrices glven I n Equatlon (i.1-5) i n ordcr t,o s i m p l i f y the computatlon. kz can more r e a d i l y achleve the same ends, however, by adoptin9 a d l f f e r e n t approzch t o solvlng the problen: from the s t a r t . 7.2 RECURSIVE FORMUlkTlClN
To f l n d a simpler solutlon t o the f l l t e r l n g problem than tni; d e r l v r d i n the prrccedlng section, we need t o take b e t t e r advantage o f the speclal structure of the prnblem. The above d e r l v a t l o n used the l l n e a r l t y o f the problem and the B3ns:!an assumption on t h t nolsc, which a r t secondary features o f the problem s t r b c t u r r . The t f a c t t h a t the problem lnvolves a dynamlc state-space model Ss much rr.?re baslc, but was n o t used ~ b o v e o any rpectal advantage; the f l r s t step I n the d e r l v a t l o n was t o recast the system I n the fonn o f a s t a t i c model. L e t - r reexamlne the problem, m k l n g use o f the properties o f dynamlc state-space systens. The deflnlng property o f a state-space model l s as follows: the future output I s dependent only on the current state and the f u t u r e Input. I n cther words, provlded t h a t the current s t a t e o f the system I s know;., knmledge o f any prevfous states. inputs, o r outputs, 4s I r r e l e v a n t t o the p r e d l c t l o n o f f u t u r e system behavl o r ; a l l relevant facts about previous behavior a r e sbbsumed I n the knowledge o f t h e current state. Thls i s e r s e n t l a l l y the definition o f the state o f a system. The p r o b a b l l i s t l c expresslon o f t h i s idea l s
I t l s t h l s property t h a t allows the systt.n t o be descrlbed I n a recurslve fonn. cuch as t h a t o f Equat i o n (7.0-1). The recursive form involves much less computatlon than the mathematically q u i v a l e n t e x p l i c i t funn o f Equation (7.1-4).
Thls reasonlng suggests t h a t rcr.urslon might be used t o s a c advantage i n obtaining a s o l u t i o n t o the f i l t e r i n g problem. The e s t l m t o r s under conslderatlon (WiP, etc.) are a l l deflned frm the conditicnal dlst r i b u t l o n o f XN glven ZN. Ye w l l l reek a r x u r s l v e exprf-sion f o r the condltlonal d i s t r i b u t i o n , and thus f o r t h e e s t l m t e s . Ut w l l l prove t h r t such an expresslon e x l s t s by d e r l r l n g it. I n the nature o f recurslve forms. .:c s t a r t by assunlny t \ a t the condltlonal d l s t r l b u t i o n o f XN glvcn zh I s k r l m f o r some N, and then we attempt t o derlve an expresslon f >r the condltlonal d i s t r +button o f XN+, glven Z +,. W recognize t h i s task as s l m l l a r t o the m e a s u r m n t p a r t i t i o n i n g o f Sectlon 5.2.2. i n t h a t e w want t o s!wl i f y the s o l u t l o n by processing the measurements one a t a tlme. E uations (5.2-1) and (7.2-1) e express s l m l l a r ideas and glve the basis f o r the s i m p l l f l c a t i o n s i n both cases. The XN o f i q u a t l o s (?.2-1) corresponds t o the c o f Equation (5.2-2).)
Our task then 1 t o derlve p(xN+ ZN+,!. ; W w i l l d i v i d e t h i s task ia.0 two steps. F!rst, derive e p(xN+, Z ) from x Z Thls I s ca1/-- be prediction s:ep, because we are predicting x ~ + , based on previous nvormation. ! I s a l s o c a l l e d t tlme update because we are updating the estlmate t o a new t l w p o l r ? based on the Inme data. The second step i s t o derlve p ( x ~ +Z1 IN i) from pixNt1IzN). This I s c a l l e d the Correctlon step, because we are correcting the prcdlcted o f AN+X based on the dew informat'on I n r ~ t , . It I s also c a l l e d the measurement update because we are updatlng the e s t i n r t e bused on the new measurmnt.
estimate
Slnce a l l o f the d i s t r l b u t i o n s are assumed t o be Gausslan, they are completely deflned by t h e i r mans and covarlance m t r l c e s . Denote t h r (presumed known) m a n and covariance o f the d l s t r l t u t i o n p(xNIZN) by i r , an: PN, respectlvely. I n general. x and PN are functions o f ZN, but. m w l l l n o t encunber the notatfon w i t h t h l s l n f o ~ t l o t l . L l k w l r e , denote tffc mean a l d covarlance o f p(xN+,IZN) by TN+, and Q +,. The task ;thus t o N derlve expressions f o r 1 ~ + , and ON+, l n terns o f i and PN and exyresslons f o r by+, and PN+, I n t e n s o f X + and Q + NI N. I 7.2.1 kredlctlon
S x
For iN+,, simply take the expected valur of (7.2-2)
The p r e d l c t l o n step (tlme update) I s straightfornerd. Equation (7.0-1s) conditioned on ZN.
E{xN+,IZN) = ~ E t x ~ l Z ~ ) + F ~ t n ~ l Z ~ ; + YUN
Tho q!rantltles E{xN+,I?~) and ErXNlzN} a n . by d e f i n l t l o n , xN+ and i respectlvely. ZN I s a functfon of x,, no,...,nN-l,nl....n , and detennlttlstlc quantltles; nN I s fndepenlent of a l i of :hew, dtld therefore Independent o f ZN. T ~ U S
Substltutlng t h i s i n t o Equrtlon (7.2-2)
gives
XN+,
exN +
'UN 011
Slnce tclc t; ree ;en*% I n order t o evaluate Qw,, take the covrrldncr o f both s l des o f Equatlon (7.0-11). the rlght-hand side of tb. equatlon a r e tndapendent. the covsrlance o f t h e l r sum I s the sum o f th.!r covarlances.
The terms cov{xN+,IZ~l and c o v ( x ~ 1 Z ~ere, by d e f i n i t i o n . QN+~and PN, respectively. ) and, thus, has zero covarlance. By the independence of nN and ZN
YUN
i s deterministic
Substituting these relationships i n t o Equation (7.2-5)
gives
iM+, + FF' - #P,,on

Equations (7.2-4) and (7.2-7) const;tute the r e s u l t s t e s i r e d f o r the p r e d i c t i o n step ( t i m update) o f the f i l t e r i n g problem. T h g r e a d i l y generalize t o p r e d i c t i n g more than one sample ahead. These equations j u s t i f y our e a r l i e r statement that, once the f i l t e r i n g problem i s solved, the p r e d i c t i o n problem i s easy; f o r suppose we desire t o estimate x n based on ZN w i t h M > N. I f we can solve the f i l t e r i n g problem t o obtain iN. the f i l t e r e d estimate of XN, then, by a straightforward extension o f Equation (7.2-4).
i s the des:aed KAP estimate o f 7.2.2 Correction Step
XM.
For the correction step (measurement update), assume t h a t we know the mean. AN+,, and covariance, QN+~. of the d i s t r i b u t i o n o f XN+, given ZN. W seek the d i s t r i b u t i o n o f X N + ~ given both ZN and z~+,. e From Equation (7.0-lb)
i s Gaussian w i t h zero ,man and i d e n t i t y covariance. I+ I~ The d i s t r i b u t i o n o f N f o r nN, TIN+= i s independent o f ZN. Thus, we can say t h a t p(nN+,IZN)
By the s a w argument as used (7.2-10)
' p(nN+,)
This t r i v i a l - l o o k i n g statement i s the key t o the problem, f o r now everything i n the problem i s conditioned i n ZN. we know the d i s t r i b u t i o n s o f XN+, and TIN+, conditioned nn ZN, i d we seek the d i s t r i b u t i o n o f XN+, conditioned on ZN, and a d d i t i o ~ a l l ycocdi tioned on ZN+,. This problem i s thus exactly i n the form o f Equation (5.1-1). except t h a t a l l o f the d i s t r i b u t i o n s i n v o l i - d are conditioned on ZN. This amounts t o nothing more than r e s t a t i n g the problem o f Chapter 5 on a d i f f e r e n t p r o b a b i l i t y space, one conditioned on ZN. The previous r e s u l t s apply d i r e c t l y t o the new probabili t y space. Therefore, f r c , Equations (5.;-14) and (5.1-15)
I n obtaining Equations (7.2-11) and (7.2-12) following quantities:
from Equations (5.1-i4) and (5.1-15). (7.2-11),(7.2-12)

X~+l
we have i d e n t i f i e d the
(5.1-14),(5.1-15) m 5 P Z
C D
QNt1
Z ~ + ~
C Du~+l
EIcIZI COV{F]~
i ~ + ~
P ~ + ~ GG*
GG*
This completes the d e r i v a t i o n o f the correction step (measur-ment update), whtch r,e see t o be a d i r e c t a o p l i c a t i o n o f the r e s u l t s fron Chapter 5.
7.2.3 Kalman F i l t e r
To cunplete the recurs;ve s o l u t i o n t o the f i l t e r i n g probiem, we need only know the solution f o r soRe value o f N, an? we can now propagate t h a t solution t o l a r g e r N The s o l u t i o n f o r N = 0 i s i m d l a t e frm the i n i t i a l problem statement. The d i s t r l b u t i o n o f xo, cond1t:oned on Zo (t.e., conditioned on nothing because , Z i = (zl, zl)*), i s given t o be Gaussian w i t h mean m and covariance Po.
....
Let us now f i t together the pieces derived above t o show how t o solve the f i l t e r i n g problem: Step 1: Initialization Define i = m , , Po i s given Step 2: Prediction (time update), s t a r t i n g w i t h
i = 0.
= o~? ~ @+ FF* + *
(7.2-15)
Step 3:
Correction (wasurenent update)
W have defined the quantity i;+, by Equation (7.2-14) i n order t o make the form o f Equation (7.2-17) more e apparent; ii+, can e a s i l y be shown t o be E(zi+,lZi j . Repeat the p r e d i c t i o n and correction steps f o r i = 0. 1 N 1 i n urder t o obtain iN, estimate o f XN based on zl the MAP zy,.
..... -
,...,
Equations (7.2-13) t o (7.2-17) c o n s t i t u t e the Kalman f i l t e r f o r discrete-time systems. The recursive form of t h i s f j l t e r i s p a r t i c b l a r l y suited t~ real-time applications. Once i has been computed, i t i s not N necessary, as i t was using the methods o f Section 7.1, t o s t a r t from scratch i n order t o compute i + we need , ; do only one nure p r e d i c t i o n step and one m r e correction step. ~ti s extremely important tt note tRat t h e conputational cost o f obtaining in+, from i i s n o t a f u n c t i o n o f R. Thi, means t h a t real-time Kalman N f i l t e r s can be implemented using f i x e d f i n i t e resources t o run f o r a r b i t r a r i l y long time i s , t e r v a l s . This was not the case using the methods o f Section 7.1, where the estimator started from scratch f o r eazh time point, and each new estimate required more computation thas the previous estimate. For some applications, i t i s also do important t h a t the Pi.and Qi not depend cn the measurements, and can thus be precclputed. Such precomput a t i o n can s i g n i f i c a n t l y reduce real-time c o m p ~ t a t i o n a lrequirements. hone o f these advantages should obscure the f a c t t h a t the Kalman f i l t e r obtains the s ~ estimates as .;ere w obtained i n Section 7.1. The advantages a f the Kalman f i l t e r l i e i n the easier computation o f the estimates. not i n improvements i n the accuracy o f the estimates. 7.2.4 A1ternate Forms
The f i l t e r Eqbrtions (7.2-13) t o (7.2-17) can be a l g e b r a i c a l l y manipulated i n t o several equivalent a l t e r nate forms. Although a l l o f tCe variants are f o r m a l l y equivalent, d i f f e r e n t ones have computational advantages i n d i f f e r e n t situations. Son= o f th? advantages l i e i n d i f f e r e n t p o i n t s o f s i n g u l a r i t y and d i f f e r e n t s i z e matrices t o i n v e r t . W w i l l show a few o f the possible a l t e r n a t e forms i n t h i s section. e The f i r s t v a r i a n t comes from using Equations (5.1-12) and (5.1-13) (the covariance form) instead of (5.1-14) and (5.1-15) (the information form). Equations (7.2-16) and (7.2-17) then become
The covariance form i s p a r t i c u l a r l y useful i f GG* o r any o f the Qi are singuldr. The exact conditions under which Qi become singular a r e f a i r l y ccmplicated, b u t we can draw some simple conclusions from lookcan F i r s t . i f FF* i s nonsingular, then Qi never be singular. Second, a singular can i n g at 'fluation (7.2-15). Po (pal .,..clarly P, = 0 ) i s l i k e l y t o cause p r o b l e m i f FF* i s also singular. The only matrix t o i n v e r t i n Equation, i7.2-18) and (7.2-19) i s CQi++C* + GG*. I f t h i s matrix i s singular the problem i s ill-posed; the s i t u a t i o n i s the same as t h a t t~sclrssedi n Section 5.1.3. Note t h a t the covariance form involves inversion cf an r-bj-a matrix, where r i s the length o f the observation vector. On the other hand, the information form i,,volves inversion o f a p-by-p matrix, where p i the length o f the s t a t e vector. For some systems, t h e difference between r and p may be s i g n i f i c a n t , ; r e s u l t i n g i n a strong preference f o r one form o r the other. If ir i s diagonal (or i f GG* i s diagonalizable the system can be r e w r i t t e n w i t h a diagonal G), Equations (7.2-18) and (7.2-19) can be manipulated i n t o a form t h a t involves no matrix inversions. The key t o t h i s ~nanipulationi s t o consider the system t o have r independent scalar observations a t each t4me p a i n t Instead o f a single vector observation o f length r. The scalar observations can then be processed o r - a t a time. The Kalman f i l t e r p a - t i t i o n s t h e estimation problem by processing the measurements one tlme-point a t a time; w i t h t h i s modlficatlon, we extend the same p a r t i t i o n i n g concept t o process one element o f t h e measurement vector a t a time. The d c r i v a t t o n o f the measurement-update Equations (7.2-18) and (7.2-19) applies without e change t o a system w i t h several independent observatlons a t a time point. W need only apply the me. SurenIente update equation a times wi.'~ no intervening time updates. W do need a l i t t l e more complicated not3tlon t o keep track o f the process, b u t the equations a r e b a s i c a l l y the same.
Let I c!') and D!') be the j t h rows o f t h e C and D w t r i c e s . ~ ( j * j be the j t h diagonal element of ) * \ G, and z\?' be the j t h element o f z +,. Define fi+,, j t o be the estimate o f xi+, a f t e r the- j t h scalar o b s e h a t i o n a t time i + 1 has been processed, and aefine Pi+,,j t o be the covariance o f ..,+,,j. W s t a r t t h e measurement vpaate a t each time p o i n t w i t h e
Then, f o r each scalar measurement, we do the update
where + 1+1
1+1 , J
D(J+l)u
it1
Note t h a t the inversions i n Equations (7.2-22) and (7.2-23) are scalar inversions rather than matrices. None o f these scalars w i l l be 0 unless CQi+lC* + GG* i s singular. A f t e r ;recessing a l l e o f the scalar ineasurernents f o r t h e time point, we have
7.2.5
Innovations The inno(7.2-27)
A discussion o f the Kalman f l l t e r would be incomplete without some mention o f the innovations. v a t i o n a t sample p o i n t i, also c a l l e d the residual, i s
v. = z .
i i
- ii
= Cxi + Dui
where
i = EIzilZi-lj i
Following the notation f o r Zi, we define
d Now V i i s a l i n e a r function o f Zi. This i s shown by Equations (7.2-13) t o (7.2-17) ~ n 17.2-27). which g i v e formuiae f o r computing the V i i n terms o f the Z i . It may not be immediately obvious t h a t t h i s function i s i n v e r t i b l e . W w i l l prove i n v e r t i b i l i t y by w r i t i n g the inverse function; i.e., by expressing Z i i n telms of e Vi. Repeating Equations (7.2-13) and (7.2-14):
iit1 = ei. + 1
YU.
(7.2-30a)
S u b s t i t u t i n g Equation (7.2-27)
i n t o Equation (7.2-17) ii+l
gives
= iitl + Pi+lC*(GG*)-l~i+l
F i n a l l y , from Equation (7.2-27)
Equation (7.2-30) i s c a l l e d the i n ~ o v a t i o n sform o f the system. t h e z i from the v i .
It gives t h e recursive formula f o r computing
L e t us examine the d i s t r i b u t i o n o f the innovations. Tile innovations are obviously Gaussian, because they are l i n e a r functions of 2 , which i s Gaussian. Using Equation (3.3-10). i t i s i m d i a t a t h a t the mean o f the innovation i s 0. E(vil = E[zi E(zilZi_l)l
E I z i l - E { E ( z ~ ~ Z ~=- ~ ) ) 0
(7.2-31)
Derive the covaria;..r
matrix c: the innovation by w r i t i n g
The two terms on the r i g h t are independent, so cov(vi) = C cov(xi ii)C* = CQiC* + GG*
+ GGw
The most i n t e r e s t i n g property of the innovations i s t h a t v i i s independent o f v j f o r i # j. To prove t h i s , i t i s s u f f i c i e n t t o s+ow t h a t v. i s independent o f Vj-,. Let us examine E{vi(Vi,&). Since Vi-I i s obtained front Z i by an i n v e r t i b l e continuous transformation, c m d i t i o n i n g on Vi-, IS the same as conditioning on Zi-,. one i s known, so i s the other.) Therefore,
11f
as shown i n Equation (7.2-31).
Thus we have
Comparing t h i s equation w i t h the 'ormula f o r t h e Gaussian conditional mean given !n Theorem (3.5-9), t h a t t h i s can be t r u e only i f v i and V i are uncorrelated (A,, = 0 i n the thecrem). Then by Theorem (3.5-8). v i and Vi-, are indepenaent.
we see
The innovation i s thus a discrete-time white-noise process (i.e., each time p o i n t i s independent o f a l l o f the others). Thus, the Kalman f i l t e r i s o f t e n c a l l e d a wnitening f i l t e r ; i t creates a white process ( Y ) as a f u n c t i o n o f a nonwhite process (Z). 7.3 STEADY-STATE F R OM
The largest computatiorlal cost o f the Kalman f i l t e r i s i n the computatian o f the covariance matrix P i using Equations (7.2-15) and (7.2-16) (or any of the a l t e r n a t e f o m s ) . For a l a r g e and important class o f problems, we can replace P . and Qi by constants P and Q, independent o f time. This approach s i g n i f i c a n t l y lowers computational cost o? the f ~ l t e r . W w i l l r e s t r i c t the discussion i n t h i s section t o time-invariant systems; i n only a few special cases e do time-invariant f i l t e r s make sense f o r time-varying systems. Equations t h a t a time i n v a r i a n t f i l t e r m s t s a t i s f y are e a s i l y derived. and (7.2-15), we cdn express as a function o f Qi. Using Equations (7.2-18)
Thus, f o r 9:
t o equal a constant
Q, we must have
P =
o[Q
- QC*(CQC* + GG*)-'CQ]@*
+ FF*
This i s the algebraic matrix Riccat: equation f c r disroete-time systems. (An a l t e r n a t e form can be obtained by using Equation (7.2-16) i n place o f Equation (7.2. 18); the condition can also be w r i t t e n i n terms o f P iqstead o f 0).
I f Q i s a scalar, the algebraic R i c c a t i equation i s a quadratic equition i n Q and the solution i s simple. For nonscalar Q, the s o l u t i o n i s f a r more d i f f i c u l t and has been the subject o f numerous papers. W w i l l not ccver the d e t a i l s o f d e r i v i n g and in~plementingnumerical methods f o r solving the R i c c a t i equation. e Thf most widely used methods are based on eigenvector decomposition ( P ~ t t e r , 1966; Vaughan, 1970; and Geyser and Lehtinen. 1975). When a unique s o l u t i o n exists, these methods give accurate r e s u l t s w i t h small computat i o n a l costs.
The d e r i v a t i o n o f the conditions under which Equation (7.3-2) has ap acceptable solution i s more complie cated than would be approprirte for i n c l u s i o n i n t h i s text. W therefore present the f o l l o w i n g r e s u l t without proof: Theorem 7.3-1 I f a l l unstable o r marginally stable modes o f the system are c o n t r o l l a b l e by the process naise and are observable, and i f CFF*C* + GG* i s i n v e r t i b l e , then Equaticn (7.3-2) has a unique p o s i t i v e semidefinite solut i o n and Qi converges t o t h i s s o l u t i o n f o r a l l choices o f the i n i t i a l covariance, P o .
TfiWT)
Proof See Schweppe (1973. p. 142) f o r a h e u r i s t i c argument, o r Balakrishnan and K a i l a t h and Ljung (1976) f o r more rigorous treatments.
The condition on CFF*C* + GG* ensures t h a t the problem i s well-posed. Without t k i s condition, the inverse i n Equation (7.3-1) may not e x i s t f o r some i n i t i a l Po ( p a r t i c u l a r l y P = 0). Some statements o t the theorem , incorporr :.e the stronger requirement t h a t GG* be i n v e r t i b l e , but the weaker condition i s s u f f i c i e n t . Perhaps the most Important p o i n t t o note i s t h a t the system i s not required t o be stable. Although the existence and uniqueness o f the solution are easier t o prove f o r stable systems, the more general conditions o f Theorem (7.3-1) a r e important i n the estimation and c o n t r o l o f unstable systems. W can achieve a h e u r i s t i c understanding o f the need f o r t h e conditions or' Theorem (7.3-1) by examining e one-dimensional systems, f o r which we can w r i t e the solutions t o Equation (7.3-2; e x p l i s l t l y . I f the system i s one-dimensional, then i t i s observable i f C i s nonzero (and G i s f i n i t e ) , and i t i s con. . l l a b l ~~ ythe t process noise i f F i s nonzero. W w i l l consider the problem i n several cases. e
Case 1: G = 0. I n t h i s case, we mrst have C # 0 and F # 0 i n order f o r the problem t o be well-posed. ~ q u a t % m 3 - 1 then reduces t o Ui+l = FF*. g i v i n g a unique t i m - i n v a r i a n t covariance s a t i s f y i n g Equation (7.3-21. Case 2: G ? J. C = 0, F = 0. I n t h i s case. Equation (7.3-1) becomes Pi+, = 02Qi. This converges t o If l o ( = 1 Q i remains a t the s t a r t i n g value, and thus the steady s t a t e . Q = 0 V o l < 1 (stable system covariance i s not unique. I f 11 > i . the s o l u t i o n diverges o r stays a t 0, depending on the s t a r t i n g value. 1
Case 3: G f 0. C = 0. F # 0. I n t h i s case. Equation (7.3-2)
reduces t o
For
19: < 1, t h i s equation has a unique, nonnegative solution
and convergence o f Equation (7.3-1) t o t h i s solution i s e a s i l y shown. I f lo1 2 1, the s o l u t i o r ~i s negative. which i s n o t an adnissible covariance. o r i n f i n i t e ; i n e i t h e r event. Equation (7.3-1) diverges t o i n f i n i t y . Case 4: G # 0. C f 0, F 9. I n t h i s case, Equation (7.3-2) i s a quddratic equation w i t h roots zero and (oa I f 161 < 1 the sec:)nd r o o t i s negative, and thus there i s a unique r m n e g a t i v e root. If , 191 = 1, there i s a double yoot a t z r o , and the s o l u t i o n i s s t i l l unique. I n both o f these events, convergencc o f Equation (7.3-1) t o the s o l ~ t i o n t 0 i s easy t o show. I f 191 > 1, there are two nonnegative roots, a and the system can converge t o e i t h e r one, depending on whether o r n o t t h e i n i t i a l covariance i s zero.
-'ma.
Case 5:
G # 0, C f 0. F # 0 .
I n t h i s case, Equation (7.3-2)
i s a ,uadratic equation w i t h r o o t s (7.3-5)
0 = (1I2)H + m H z +
where
Regardless o f t h e value ~!f O, tile square-root term i s always l a r g e r 'n magnitude than (1/2)H; therefore, there i s one p o s i t i v e and one n t g a t i v e root. Convergence o f Equation (7.3-1) t o the p o s i t i v e r o o t i s easy t o show. Let us now sumnarize the r e s u l t s o f these f i v e cases. I n a l l well-posed cases, the covariance converges t o a unique value i f the sjstem i s stable. For unstable o r margiea?ly stable systems, a unique converged value i s dSSured i f both C and F are nonzero. For one-dimensional systems, there i s a l s o a unique convergent solut i o n f o r lo1 = 1. G # 0, C # 0. F = 0; t h i s case i l l u s t r a t e s t h a t the conditions o f Theoreh (?.3-1) are n o t necessary, although they are s u f f i c i e n t . H e u r i s t i c a l l y , we can say t h a t o b s e r v a b i l i t y (C # 0) prevents the covariance from diverging t o i n f i n i t y f o v unstable systems. C o n t r o l l a b i l i t y by the process noise (F # 0) ensures uniqueness by e l i m i n a t i n g the p o s s i b i l i t y of p e r f e c t p r e d i c t i o n (Q = 0).
An important r e l a t e d question t o consider i s the s t a b i l i t y o f the f i l t e r . vector t o be
W define the corrected e r r o r e
Using Equations (7.0-1).
(7.2-15),
(7.2-16).
and (7.2-19) gives the recursive r e l a t i o n s h i p
W can snow that. given t h e conditions o f Theorem (7.3-1). the system o f Equation (7.3-8) i s stab;e. e This s t a b i l i t y implies that, i n the absence o f new disturbances, (noise) e r r o r s i n t h e s t a t e estimate w i l l d i e o u t w i t h time; furthennore, f o r bounded disturbances, the e r r o r s w i l l always be boended. A rigorous proof i s n o t presentea here.
It i s i n t e r e s t i n g t o examine the s t a b i l i t y o f the one-dimensional example w i t h G # 0, C f 0, F = 0, and lo1 = 1. W previously noted t h a t Q i f o r t h i s case cor-lerges t o 2 f o r a l l i n i t i a l covariances. L e t us e examine the steady-state f i l t e r . For t h i s case. Equation (7.3-8) rsetlucer t o
which I s o n l y marginally stable. Recall t h a t t h i s case d i d not meet the conditions o f Theorem (7.3-1). so our s t a b i l i t y guarartee does n o t apply. Although a steady-state f I l t e r e x i s t s , i t does n o t perform a t a l l l f k e the time-varying f i l t e r . The time-iarying f i l t e r reduces the e r r o r t o zzro asymptotically w i t h time. The steadys t a t e f i l t e r has no feedback, and the e r r o r remains a t i t s i n i t i a l value. Balakrishnan (1984) discusses t h e steady-state f i l t e r I n m r e d e t a i l . Two special cases of time-invariant Kalman fi!ters deserve special note. The f i r s t case i s where F i s zero and the system I s stable (and GG* mrst be I n v e r t i b l e t o ensure a well-posed problem). I n t h i s case, the
steady s t a t e Kalman gain K i s zero. The Kalman f i l t e r simply integrates the s t a t e equation, ignoring any available measurements. Since the system i s stable and has no disturbances, the e r r o r w i l l decay t o zero. The same f i l t e r i s obtained f o r nonzero F if C i s zero o r if G i s i n f i n i t e . The e r r o r does tiot then decay t o zero, but t h e output contains no useful information t o feed back. Thc second special case i s where G i s zero and C ii square and i n v e r t i b l e . FF* must be i n v e r t i b l e ; e estimator then reduces t o h t o ensure a well-posed problem. For t h i s case, the Kalman gain i s C-'.
which ignores a l l previous information. The current s t a t e can be reconstructed exactly from the current measurement, so there i s no need t o consider past data. This i s the a n t i t h e s i s o f the case where F i s 0 and no information frm the current measurement i s used. Host r e a l i s t i c systems l i e somewhere betwet,, these two extremes. 7.4 CONTINUOUS TIME The fonn o f a l i n e a r continuous-time system m d e l i s
where n and n are assumed t o be zero-mean white-noise processes v t h u n i t y power spectral density. The input u i s assumed t o be known exactly. As i n the discrete-time analysis, we w i l l s i m p l i f y the notation by assuming t h a t the system i s time invariant. The same d e r i v a t i o n applies t o time-varying systems by evaluating the matrices a t the appropriate time points. He w i l l analyze Equation (7.4-1) as a l i m i t o f the discrete-time systems
wbore r! and n are discrete-time white-noise processes w i t h i d e n t i t y covariances. f a c t o r s were discussed i n Section F 2.
The reasons f o r the 'A / '
The f i l t e r f o r the system o f Equation (7.4-2) i s obtained by making appropriate s u b s t i t u t i o n s i n Equat i o n s (7.2-13) t o (7 2-17). W need t o substitute (I AA) i n place o f o , AB i n place o f Y , AF e + F; , i n place o f FF*. and A - ' G ~ G ~ i n place o f GG*. Combining Equations (7.2-13). (7.2-14). and (7.2-17) and making the substitutions gives
Subtracting i ( t i )
and d i v i d i n g by A
gives
Taking the l i n i t as A
0 gives the f i l t e r equation
i ( t ) = A i ( t ) + Bu(t) + P(t)C*(GcGt)-l[z(t)
It remains t o f i n d the equation f o r
- Cs(t, - Du(t)]
becomes (7.4-6)
P(t).
F i r s t note t h a t Equation (7.2-15)
Q(ti + A) = ( I + A A ) P ( ~ ~ ) + IAA)* + AF~F; ( and thus
Equation (7.2-18) i s a more convenient form f o r our current purposes than (7.2-16). s t i t u t i o n s i n Equation (7.2-18) t o get P(ti Subtract P(ti)
Make the appropriate sub-
A)
= Q(tl + A )
- Q(ti
+ A)C*(CQ(~, + A!C* + A"G~G:)"CQ(~,
+ A)
and d i v i d e by A
t o give
F3r the f i r s t term on the r i g h t o f Equi~!on (7.4-9).
s u b s t i t u t e from t q u a t i o n (7.4-7)
t o get
Thus i n t h l~i m i t Equation (7.4-9) becomes P ( t ) = AP(t) + P(tjAt
+ FCF:
- P(t)C*(GCG;)-lCP(t)
Equation (7.4-11) i s the continuous-time R i c a t t i equation. The i n i t i a l condition f o r the equation i s Po = 0, the covariance o f the i n i t i a l state. Po i s assumed t o be known. Equations (7.4-5) and (7.4-11) c o n s t i t u t e the s o l u t i o n t o the continuous-time f i l t e r i n g problem f o r l i n e a r systems w i t h white process and measurement noise. The continuous-time f i l t e r requires GG* t o be nonsingular. One p o i n t worth n o t i n g about the continuous-time f i l t e r i s t h a t t h e innovation z ( t ) - 4 ( t ) i s a whitei o i s e process w i t h th? same power spectral density as the measurement noise. (They are not, however, the same process.) The power spectrum o f the innovation can be found by l o o k i n g a t the l i m i t o f Equation (7.2-33). Making the appropriate s u b s t i t u t i o n s gives
The power spectral density o f the innovation i s then
The disappearance o f the f i z s t term o f Equation (7.4-12) than the discrete-time one i n many ways.
i n the 1 i m i t makes the continuous-time f i l t e r simpler
For time-invariant continuous-time systems, we can i n v e s t i g a t e the p o s s i b i l i t y t h a t the f i l t e r reaches a steady state. As i n the discrete-time steady-state f i l t e r , t h i s outcome would r e s u l t i n a s i g n i f i c a n t comput a t i o n a l advantage. I f the steady-state f i l t e r e x i s t s , i t i s obvious t h a t t h e steady-state P ( t ) most s a t i s f y t n e equation
-ained by s e t t i n g b t o 0 i n Equation (7.4-11). The eigenvector decomposition methods referenced a f t e r duatior: (7.3-2) are a l s o the best p r a c t i c a l n1:merical methods f o r solving Equatiorl (7.4-14). The f o l l o w i n g theorem, comparable t o Theorem (7.3-1). i s n o t proven here. rheorem 7.4-1 I f a l l unstable o r n e u t r a l l y stable modes o f t h e system are ~ o n t r o a e by the process noise and are observable, and i f G Gc i s invertl:lL! then Equation (7.4-14) has a unique p o s i t i v e semidednite solut i o n , and P ( t ) converges t o t h i s s o l u t i o n f o r a l l choices o f the i n i t i a l covariance Po. Proof See K a i l a t h and Lyung (1976). Ba1ak:ishnan E(1961). 7.5 CONTINUOUS/OISCRETE TIME w i t h continuousofter, debate these f i l t e r s the t r u e system. (1981), o r Kalman and
Many p r a c t i c a l a p p l i c a t i o n s o f f i l t e r i n g involve d i s c r e t e sa:. .led measurements o f systems time dynamics. Since t h i s problem has elements of both d i s c r e t e a ~ continuous time, there i s ~ d ober whether the discrete- o r continuous-time f i l t e r i s more appropriate. I n f a c t , n e i t h e r o f i s appropriate because they are both based on models t h a t are not r e a l i s t i c representations o f As Schweppe (1973, p. 206) says, Some r a t h e r i n t e r e s t i n g arguments sometimes r e s u l t when one asks the question, Are the discrete- o r the continuous-time r e s u l t s more useful? The answer i s , n e i t h e r i s superior i n a l l cases. o f course, t h a t the question i s s t u p i d
....
The appropriate model f o r a contin~rous-time dynamic system w i t h discrete-time measurements i s a continuous-time model w i t h discrete-time measurements. Although t h i s ctatement sounds l i k e a tautology, i t s p o i n t has been missed enough t o ~ m k ei t worth emphasizing. Some o f the confusion may be due t o the mistaken impression t h a t such a mixed node1 could no* be analyzed w i t h the a v a i l a b l e t o o l s . I n f a c t , the d e r i v a t i o n o f the appropriate f i l t e r i s t r i v i a l , given the pure continuous- and pure discrete-time r e s u l t s . The f i l t e r f o r t h i s c l a s s o f problems simply involves an appropriate combination o f the discrete- and continuous.,time f i l t e r s previously e derived. I t takes o n l y a few l i n e s t o show how t h e previously derived r e s u l t s f i t t h i s problem. W w i l l spend most o f t h i s section t a l k i n g about implementation issues i n a l i t t l e more d e t a i l . L e t the system be described by i ( t ) = Ax(t)
Bu(t)
Fcn(t)
Equation (7.5-la) i s i d e n t i c a l t o Equation (7.4-la); and. except f o r a notation change. Equation (7.5-lb) i d e n t i c a l t o Equation (7.0-lb). Note t h a t the observation i s only defjned a t the discrete points ti. although the state i s defined i n continuous time.
is
Between the times o f two observations, the analysis o f Equation (7.5-1) i s i d e n t i c a l t o :hat o f Equat i o n (7.4-1) w i t h an i n f i n i t e G matrix o r a zero C matrix; e i t h e r o f these conditions i s equivalent t o having no useful observation. Let i ( t i ) be the s t a t e estimate a t time t i based on the observations up t o and including z ( t i ) . Then the predicted estimate i n the i n t e r v a l ( t j , t i + & ] i s obtained f. .i i ( t t ) = i(ti) (7.5-2)
The covariance o f the p r e d i c t i o n i s ~(t;)

= P(tii
Equations (7.5-3) and (7.5-5) are obtained d i r e c t l y by s u b s t i t u t i n g C = 0 i n Equations (7.4-5) and (7.4-11). The natation has been chai~geat o i n d i c a t e t h a t , because there i s no observetion i n the i n t e r v a l , these are predicted estimates; whereas, i n the pure continuous-time f i l t e r , the observations are continuously used and f i l t e r e d estimates are obtained. Integrate Equations (7.5-3) and (7.5-5) over the i n t e r v a l ( t i s t i + n) t o obtain the predicted estimate i ( t i + A ) and i t s covariance Q ( t i + A ) . I n practice, although u ( t ) i s defined tontinuously, i t w i l l o f t e n be measured ( o r otherwise known) only a t the time points ti. F u r t h e m r e , the i n t e g r a t i o n w i l l l i k e l y be done by a d i g i t a l computer wnich cannot integrate continuous-time data exactly. Thus Equation (7.5-3) w i l l be integrated numerically. The simplest i n t e g r a t i o n approximation would give
This approximation may be adequate f o r some purposes, b u t i t i s more o f t e n a l i t t l e too crude. I f the matrix i s time-varying, there are rcveral reasonable i n t e g r a t i o n schemes which we w i l l not discuss here; the most c o m n are based on Runge-Kutt~ algorithms (Acton. 1970). For systems w i t h time-invariant A matrices and constant sample i n t e r v a l s , w e t r a n s i t i o n matrix i s by f a r the most e f f i c i e n t approach. Fir;t define $ = exp(Ad) (7.5-7)
This approximation i s the exact solution t o Equation (7.5-3) i f u ( t ) holds i t s value between samples. Wiberg (1971) and Zadeh and Desoer (1963) derive t h i s solution. Woler an^ Van Loan (1978) discuss various means o f numerically evaluating Equations (7.5-7) and (7.5-8). Equation (7.5-9) has an advantage of beins i n the exact form i n which discrete-time jystems a r e usually w r i t t e n (Equation (7.0-la)). Equation (7.5-9) introduces about 1/2-sample delay i n the modeling o f the response t o the control inpuL unless the continuous-time u ( t ) holds i t s value between samples; t h i s delay i s o f t e n unacceptable. Figure (7.5-1) shows a sample input signal and the signal as modeled by Equation (7.5-9). A b e t t e r approximt i o n i s usually x(::
+ A)
+i(tt)
(1/2)l(u(ti)
+ u(ti + A ) )
(7.5-13)
This equation models u ( t ) between samples as being constant a t the average o f the two sample values. Figure (7.5-2) i l l u s t r a t e s t h i s model. There i s l i t t l e phase l a g i n the model represented by Equation (7.5-10). and the difference i n implementation cost between Equations (7.5-9) and (7.5-10) i s n e g l i g i b l e . Equat i o n (7.5-10j i s probably the most comnonly used approximation method w l t h time-invariant A matrices. The high-frequency content introduced by the jumps i n the above models can be removed by modeling u ( t ) as a l i n e a r i n t e r p o l a t i o n between the measured values as i l l u s t r a t e d i n Figure (7.5-3). This model adds another u ( t i ) . I n our experience, t h i s degree of f i d e l i t y i s t e r n t o Equation (7.5-10) proportional t o u ( t i + A ) u s u a l l y unneressary, arid i s not worth the e x t r a c o s t and complication. There are sonle applications where the accuracy required might j u s t i f y t h i s o r even more complicated methods, such as higher-order spline f i t s . (The l i n e a r i n t e r p o l a t i o n I s a f i r s t - o r d e r spline.)
I f you are using a Runge-Kutta algorithm instead o f a t r a n s i t i o n - m a t r i x algorithm f o r solving the d i f f e r e n t i a l equation, l i n e a l i n t e r p o l a t i o n o f the i n p u t :r,:roduces n e g l i g i b l e e x t r a cast and i s c o m n practice. Eqliation (7.5-5) doe; n o t involve measured data and thus does not present the problems c f i n t e r p o l a t i n g betwecn the measurements. The exact s o l u t i o n o f Equation (7.5-5) i s
as can be v e r i f i e d by substitution. Note t h a t Equation (7.5-11) i s exactly i n the form o f a discrete-time update o f the covariance (Equation (7.2-15)) if F i s defined as a square r o o t o f the i n t e g r a l term. For small A. the i n t e g r a l term i s w e l l approximated by AF~F:, r e s u l t i n g i n
The e r r o r s I n t h approximation are usually f a r smaller than the uncertainty i n the value of be negiected. This approximation i s s i g n i f i c a n t l y b e t t e r than t h e a l t e r n a t e approximation
Fc, and can thus
obtained by inspecti010 from Equation (7.5-5). The above discussion has conce~itratedon propagating t h e estimate between measurements, i.e., t h e time update. I t remains only t o discuss t h e measurement update f o r t h e discrete measurements. W have ic(t ) and e Q ( t i ) a t sore time point. W need t o use these and t h e measured data a t the time p o l n t t o o b t a i n ir(tlj and e P ( t i ) . This i s i d e n t i c a l t o the discrete-time measurement update problem solved by Equations i7.2-16) and (7.2-17). W can also use the a l t e r n a t e forms discussed i n Section 7.2.4. e To s t a r t the f i l t e r , we are given the a priori m a n i(t,)-and covariance Q(t,) o f the s t d t e a t time to. Use Equations (7.2-16) and (7.2-17) ( o r alternates) t o obtain x(t,) and P(t,). I n t e r a t e Equations (7.5-2) t o (7.5-5) from t t o t, by some means (most l i k e l y Equations (7.5-10) and (7.5-1218 t o o b t a i n i ( t l ) and : Q(tlj. This completes one time step o f the f i l t e r ; processing o f subsequent time points uses the same procedure. The solution f o r the steady-state form of the d l screte/continuous f i l t z r follows imnediately from t h a t o f the discrete-time f i l t e r , because the equations f o r the covariance updates are i d e n t i c a l f o r t h e two f i l t e r s w i t h the appropriate substitutior! o f F i n terms o f Fc. Theorem (7.3-1) therefore applies. W can s u m r i z e t h i s section by saying t h a t there i s a continuous/discrete-time f i l t e r derived from e appropriate r e s u l t s i n the pure discrete- and pure continuous-time analyses. If the Input u holds i t s value between samples. then the form o f the continuousldiscrete f i l t e r i s i d e n t i c a l t o t h a t o f the pure discrete-time f i l t e r w i t h an appropriate s u b s t i t u t i o n f o r the equivalent discrete-time process noise covariance. For more r e a l i s t i c behavior of u, we mrst adopt approximations i f t h e analysis i s done on a d i g i t a l computer. It i s also possible t o view the continuous-time f i l t e r equations as g i v i n g reasonable approximations t o the continuous/discrete-time f i l t e r i n some situations. I n any event, we w i l l n o t go wrong as long as we recognize t h a t we can w r i t e the exact f i l t e r equations f o r the continuous/discrete-time system and t h a t we must consider any qther equations used as approxi~mtionst o the exact solution. With t h i s frame of mind we can o b j e c t i v e l y evaluate the adequacy o f the approximations involved f o r s p e c i f i c problems. 7.6 SMOOTHING
The d e r i v a t i o n o f optimal smooth~rsdraws heavily on t h e d e r i v a t i o n o f the Kalman f i l t e r . S t a r t i n g from the t i l t e r results, only a s i n g l e step i s required t o compute the smoothed estimates. I n t h i s section. we b r i e f l y derive the f i x e d - i n t e r v a l smoother f o r discrete-time l i n e a r systems w i t h a d d i t i v e Gaussian noise. Fixed-interval smothers are the most widely used. The stme lene,,al p r i n c i p l e s apply t o d e r i v i n g fixed-point and fixed-lag smothers. See Meditch (1969) for derivations and equations f o r fixed-point and f i x e d - l a g m o t h e r s and f o r continuous-time forms. There are a l t e r n a t e computational forms f o r the f i x e d - i n t e r v a l smother; these forms give mathematically equivalent results. W w i l l not discuss computational advantages ~f the various f u n s . See Bierman (1977) e and Bach and Wingrove (1983) for a1 ternate forms and discussions o f t h e i r advantages. Consider the f i x e d - i n t e r v a l smoothing p r o b l m on an i n t e r v a l w i t h N time points. As i n the f i l t e r derivation, we wil: concentrate on two time p o i n t s a t a time irl nrder t o get a recursive form. It i s s t r a i g h t forward t o w r i t e an e x p l i c i t formulation f o r the smother, l i k e the e x p l i c i t f i l t e r form o f Section 7.1. but such a form i s impractical.
i I n t h e nature of recursive derivations, assume t h a t we have previously computed ; t , the smoothed e s t i mate of xpl. and S tl. the covariance o f given Z W seek t o derive an expression f o r k t and St. e Note t h a t h i s recursion runs backwards i n t i r e instead o forwards; a forward recursion w i l l n o t work, f o r ! w reasons which , e w i l l see l a t e r .
The smoothed estimates, I and Xitl, i
a r e defined by
W w i l l use the measurement p a r t i t i o n i n g ideas of Section 5.2.2, e Z i and
w i t h the measurement ZN p a r t i t i o n e d i n t o
From the d e r i v a t i o n of the Kalman f i l t e r , we can w r i t e the j o i n t d i s t r i b u t i o n o f x i and tioned on Zi. It i s Gaussian w i t h
condi-
W d i d not previously derive the cross term i n the above covariance matrix. e
To derive the form shown, w r i t e
For the second step o f the p a r t i t i o n e d algorithm, we consider the measurements 1 . . using Equat i o n s (7.6-3) and (7.6-4) f o r the p r i o r d i s t r i b u t i o n . The measurements I can be w r l t t e n i n the form i
for some matrices ti, i f , and 64. and some Gaussian, zero-mean, identity-covariance noise vector ii Although we could laboriously wri:e out expressions f o r the matrices i n Equation (7.6-6). t h i s step 1s unnecessary; we need only know t h a t such a form exists. The important t h i n g about Equation (7.6-6) i s t h a t x i does not appear i n it. Using Equations (7.6-3) and (7.6-4) f o r the p r i o r d i s t r i b u t i o n and Equation (7.6-6) f o r the measurement equation, we can now obtain the j o i n t p o s t e r i o r distribution o f x and x +, given Zi. This d i s t r i b u t i o n i s Gaussian w i t h mean and cova:iance given b Equazions (5.1-12) and 15.1-131, s u b s t i t u t i a g Equation (7.6-3) f o r mc. Equation (7.6-4) f o r P. f o r D. f o r 6. and
&
By d e f i n i t i o n (Equation (7.6-l)),
the mean o f t h i s d i s t r i b u t i o n gives the smoothed e s t i m t e s
if+,. s u b s t i t d t i o n s i n t o Equation (5.1-12) and expanding gives Making the
i and i
W can solve Tquation (7.5-8) f o r jii e step of the backwards recursion.
i n terms o f
ji+,,
which we assume t o have been computed i n the previous
Equation (7.6-9) i s the backwards recursive form sought. Note t h a t the equation does not depend e x p l i c i t l y on the measurtments o r on the matrices i n Equation (7.6-6). That information I s a l l subsumed irt ;i+,. The " i n i t i a l " condition f o r the recursion i s
i rn i N N
(7.6-10)
which f o l l o w s d i r e c t l y from the d e f i n i t i o n s . W do not have a corresponding known boundary condition a t the e beginning o f the interval. which i s why we must propagate the smoothing recursion backwards. instead of forwards. W can n w describe the complete process o f computing the smoothed state estimates f o r a f i x e d time i n t e r e F i r s t propagate the Kalman f i l t e r through the e n t i r e i n t e r v a l , saving a l l o f the va:ues i t . R i , Pi, and 01. Then propagate Equation (7.6-9) backwards i n time, uslng the saved values from the f i l t e r . and s t a r t i n g frm the boundary condition given by Equation (7.6-10). val. appropriately i n t o Equation (5.1-13) W can derive a formula f o r the smother covariance by s u b s t i t u t i ~ ~ g e t o get
( T k off-dfagondl blocks are not relevant t o t h i s derivation.) terms of Sf+,, g i v i n g
Ye can solve Equation (7.6-11)
for
Sf
in
This gives us a kckwards recursion f o r t h e smoother covariance.
The " i n i t i a l " c o n d i t i o n
f o l l o w s from the d e f i n i t i o n s . Note that, as i n the recursion f o r the smoothed estimate, the measurements and A l l the necessary data about the the measurement equation matrices have dropped out o f Equation (?.6-12). f u t u r e process i s subsumed i n Si+,. Note a l s o t h a t i t i s n o t necessary t o compute the smother covariance S i i n order t o compute the smoothed estimates. 7.7 NONLINEAR SYSTEMS AND NON-GAUSSIAN NOISE
Optimal s t a t e estimation f o r nonlinear dynamic systems i s s u b s t a n t i a l l y more d i f f i c u l t than f o r l i n e a r systems. Only i n r a r e special cases are there t r a c t a b l e exact solutions f o r optimal f i l t e r s f o r nonlinear systems. The same comnents apply t o systems w i t h non-Gaussian ncise. P r a c t i c a l implementations of f i l t e r s f o r nonlinear s y s t ~ m si n v a r i a b l y i n v o l de approximations. The most c o m n approximations arc based on l i n e a r i z i n g the system and using the o p t i m l f i l t e r f o r the l i n e a r i z e d system. S i m i l a r l y , non-~aussiannoise i s approximated, t o f i r s t o;.der, by Gaussian noise w i t h the same mean and covariance. Consider a nonlinear dynamic system w i t h a d d i t i v e noise i ( t ) = f(x(t),u(t)) 2 ( t i ) = 9(x(ti).u(ti))
+ nit)
+ llj
Assum..? t h a t we have some nominal estimate, xn(t), o f the s t a t e &ime h i s t o r y . Equation (7.7-1) about t h i s nominal t r a j e c t o r y i s i ( t ) = A(t)x(t) + B(t)u(t) + fn(t) + n(t)
Then the l i n e a r i z a t i o n o f (7.7-2a)
where
For a given r,ominal t r a j e c t o r y . Equations (7.7-2) t o (7.7-4) define a time-varying l i n e q r system. The Kalman filter/!moother algorithms derived i r l previous sections o f t h i s chapter give optimal s t a t e estimates f o r t h i s 1 inearizod system. The f i l t e r based on t h i s l i n e a r i z e d system i s c a l l e d a l i n e a r i z e d Kalman f i l t e r o r an extended Kalman f i l t e r (EKF). I t s adequacy as an approximation t o the optimal f i l t e r f o r the nonlinear system depends on several factors which we nil; n o t analyze i n depth. It i s a reasonable supposition t h a t i f the system i s n e a r l y l i n e a r , then the l i n e a r i z e d Kalman f i l t e r w i l l be a close approximation t o the optimal f i l t e r f o r the system. If, the other hand, n o n l i n e a r i t i e s play a major r o l e i n d e f i n i n g the c h a r a c t e r i s t i c system on responses, the reasonableness o f the 1 inearized Kalman f i l t e r 1s questionable. The above d e s c r i p t i o n i s intended o n l y t o intt'oduce i h e s i m p l c r i leers o f l i n e a r i z e d Kalnwn f i l t e r s , S t a r t i n g from t h i s point, there are numerous extenslor~s. m o d i f i c a t ~ons, and nuances o f sppl i c a t i o n . Nonlinear f f l t e r i n g I s an area o f current research. See Each and Wingrove ('983) and Cox and aryson (19801 f o r a few of the many i n v e s t i g a t i o n s i n t h i s f i e l d . Schweppe (1973) and Jazwinski (1970) have f a i r l y extensive discussions o f nonlinear s t a t e estimation.
Figure (7.5-1 )
Hold-last-val ue input model
---- -Figure (7.5-2). Average value input model.
Figure (7.5-3).
Llnesr Interpolation input model.
CHAPTER 8 8.0 OUTPUT ERROR METHOD F R DYNAMIC SYSTEMS O
I n prevtcus chapters. we have cove;ed the s t a t t c e s t l m t t o n problem and t h ? estlmatton o f the s t a t e o f dynamtc systems. With t h l s background, we can now begtn t o address the p r t n c t p l e subject ~f t h i s book, esttmat t o n o f the parameters o f dynamtc systems. Before addressing the more d t f f t c u l t p a r a l n t e r estimatton problems posed by more genercl system models. we w t l l conslder the s t m p l i f t e d case t h a t leads t o the algortthm c a l l e d output e r r o r . The s l m p l t f l c a t t o n t h a t leads t o the output-error method I s t o omit t h * process-notse term from the s t a t e equatton. For tht: reason, the output-error methtd t s o f t e n desc thed by terms l t l r e "the no-n-ocess-noise algortthm" o r "the measurementnoise-only a l g o r t thm. W w t l l f t r s t dtscuss mlxed contlnuous/dtscrete-ttme systems, which are most approprtate f o r the mnajortty e o f the practical appltcatlons. W w l l l follow t h l s discusston by a b r i e f sumnary o f any dtfferences f o r pure e discrete-ttme systems, wlifch are useful f o r some appllcattons. The d e r t v a t t o n and r e s u l t s are e s s e n t t a l l y I d e n t i c a l . The pbre continuous-ttme results, although s l m t l a r t n expression. Involve e x t r a compltcattons. W e nave never seen an appropriate p r a c t i c a l a p p l i c a t t o n o f the pure continuous-ttme r e s u l t s ; we therefore f e e l j u s t i f i e d I n o m t t t t ~ ~them. g I n mixed conttnuous/dtscrete ttme. the most general system model t h a t we w t l l seriously consider I s x ( t o ) = xo i ( t ) = f[x(t),u(t),~! (8.0-la) (8.0-lb)
The masurerent notse n I s assumed t be a sequence o f independent Gaussian random vartables and i d e n t i t y covariance. The t i p u t u 's assumd t o be known ertactly. The t n i t i ~ lcondttlon treated I n several ways, as discussed i n Sectton 8.2. I n general, the functtons f and g can functtons o f t. W omit t h t s from the notatton f o r stmp:tclty. e ( I n any event, e x p l i c i t time be put i n the notatton o f Equatton (8.0-1) by deftntng an extra c o n t r o l equal t o t . ) The correspondtny nonlinear model for purc d i s c r e t e - t t w systems 1s
w t t h zero m a n x, can be a l s o be e x p l t c t t dependence can
The assumpttons are the s a w as I n the conttnuous/discrete cise. Although the output-error method app!ies t o nonlinear systems, we w i l l gtve special a t t e n t i n n t o the treatment o f 1 tnear systems. The 1 inear form of Equatton (8.0-1) i s
-I . .
i'
-.
The matrices A, 8, C, D, and G a r e functions of i;we w l l l n o t compltcate the cattng t h l s relattonship. Of course, x and z are a l s o functtons o f 6 through system matrtces.
tatt ton by e x p l t c t t l y i n d t ~ e t dependence on the r
I n general, the m t r t c e s A, 8, C. D, and G can a l s o w functtons of t t m . For n o t a t i o n a l s t m p l l c t t y , we have not e x p l t c t t l y Indicated t h i s dependence. I n several places. ttme tnvariance o f the m t r t c e s introduces s i g n t f t c a n t computattonal savrngs. The t e x t w l l l indlcate such sttuattons. Note t h a t 6 cannot be a f u n c t i o n of time. Problems w t t h time-varyfng L must be reformulated w t t h a t f m - i n v a r i r s * c i n order f o r the techntques o f t h i s chapter t o be applicable. 'he l t n e a r forin o f Equatlnn (8.0-2) is
The t r a n s t t t o n m t r i c c s
and
a r e functions o f
c,
and possibly o f t t m .
For any ~f the model forms, a p r t o r d t r t r i b u t t o n for C: may o r my n o t e x i s t . depending on the p a r t t c u l r r applicatton. When there f s no prtoi. distrlbu+,ion, o r when you desire t o o b t a i n an e s t t m t e Independent o f tne
PfieCEDmG PAGE D L M N m
m.wL
..
--, > -
. ..
434 l " m m a u V -.
A*
. -
.:#
p r i o r d l s t r i b u t l o n , use a mxlmum-likelihood estimator. When a p r l o r d l s t r l u u t l o n I s considered, R4P e s t l i i u t o r s are appropriate. For the parameter estimatlon problem. a ~ s t e r i o r iexpected-value estlmatus and Bayesian optfmal estimates are impractical t o compute, except i n speclay cases. The p o s t e r i o r dls:rlbvrlun o f r. i s not, i n general. symnetrlc; thus the a p o s t e r i o r i expectpd value need r o t equal the HAP e s t l c e t e .
The basic method o f d e r i v a t i o n f o r the output-error meihod i s t o reduce the prdblem t o the stat,, form o f Chapter 5. W w i i l see i h a t the dynamic system makes the models f a i r l y c m p l l c a t e d , b u t n o t d l f f e r e r i t i n any e e essential way from those o f Chapter 5. W f l r s t consider the case where 6 and the I n i t i a l condition are assumed t:, be known. Clioose an a r b i t r a r y value of E . Given the i n i t i a l c o n d i t i t . . x, and a specifieL I n p u t t l m e - h i s t o r y u. the s t a t e equation (8.0-lb) can be s o l v r d t o give the s t a t e as a f u n c t l o n o f tlme. W assume t h a t f i s e s u i f i c i e n t l y smooth t o guarantee the existence and uniqueness o f the s o l u t i o n (Brauer and Noel, 1969). For may be d i f f i c u l t o r impossible t o %press lr. closed form, bu'. t h a t complicated f functicns, the s o l u t i ~ n aspect l j irrelevant t o the theory. (The p r a c t i c a l implication I s t h a t the s o l u t i o n w l l l be obta+-,dd uslng numerical approxlmatjon methods.) The important t h i n g t o note I s that.. because o f the e14nlnatlon o f the process noise, the s o l u t l o n i s d ~ t e r m l n i s t l c . For a s p e c i f i e d i n p u t u, the system s t a t e i s thus a d e t e r m i n i s t i c f u n c t i o n o f c bnd tlme. For conslstency w i t h the yot tat ion o f the f i l t e r - e r r o r method discussed l a t e r , denote t h l s f u n c t i o n by ic ( t ) . The c subscript emphaslzei L S dependence on c. The dependence on u i s n o t relevant t o the c u r r e i t discussion, ~ so the n o t a t i o n ignares ih!s dependence f o r s i m p i l c l t y . Assuming known G, Equat.:on (8.0-lc) then becomes
Equa'ion (8.1-1) i s ir, the form of Equation (2.4-1); i t I s a s t a t i c nonlinear model w i t h a d d i t i v e roise. There The assutnptlons are m u l t i p l e experiments, c-r a t each ti. The estlmators o f Sectlon 5.4 applq direct:y. adopted have a'iowed us t o solve the system dycamlcs, leavlng an essential1,v s t a t i c problem. The MAP estlmate i s obtained by minimizing Equatlon (5.4-9). t i o n becdmes
N
Ir! the n o t a t i o n o f t h i s chapter, t h l s equa-
where
The quantities m~ and P are the mean and covarlaace o f the p r i o r d l s t r i b u t l o n the MLE estlmatoi-: omit the l a s t term o f Equation (8.i-2). g f v l n g
-.C
6, as I n Chapter 5.
FIJ~
Equatlon (8.1-4) I s a quadratic form i n the di:ference between r , the measurf:d response (output), and it, tne response computed from the d e t e r m i n i s t i c p a r t o f the system mode;. T h i s motivates the name "output e r r o r . " The mlnlmizatlan o f Equation (8.1-4) I s an i n t u i t i v e l y p l a u s i b l e r s t . i m t o r defensible even without s t a t l s t l c a l derivation. The mlnim!zlng value o f ( gives the system mndel '.t~at bect approximates ( I n a lerst-squares sense) the actual system rzsponse t o the t e r t input. Although t h i s does n o t necessarily guarantee t h a t the model response and the system respolise w l l l be s i m t l a r f o r other t e s t input:,, t h e mlnlmizing value o f c I s certainly a plauslble e* ate. The estimates t h a t r e s u l t :tom m l n l r l z i n g Equation (8.1-4) are sometifi-s c a l l e d " l e a s t squares" estimates, reference t o tk quadratic form o f the equation. W p r e f e r t, avc'id the us- of this,termlnology because f t e o I s p o t e n t l a l l y confusing. k n y o f the estlmators a p p l i c a b l r : dynam;c systa,, 1;3ve a ,east-rduares form, so the t e r n I s n o t d e f l n l t i v e . F u r t h e m r e , the term "?east squrrer" i s ~ m s o f t e n r 3 b l l e d 'I :quation (8.1-4) t t o c o n t r a s t i t from other fornrs labeled "maxlnum l i k e l i n o o d " (typ#r.,~ly the estl~a.ai.)rro f Sectlon 8.4. which apply t o ucknwn G, o r m e estimators o f Chapter 9, ~ h l c h acr:.int f o r p-oceis ncisej. l n i r c o n t r a s t I s r i s leading because Equation (8.1-4) descrlbcs a conpletely igorolls, maxlnrlrn-i ;Lei i:,?od estlmatcr f n r the problem as posed. The dlfferences betmen Equation (8.1-4) and the estlmstors o f Sections C . 4 rnd Chapter 9 I r e differences I n thr problem statement, n o t d l f f c r m c e s I n the s t a t l s t l c a l p r l n c l p l e s used f o r soluLion.
I.
70 derive the output-error method f o r pure dlscrete-time systems, s u b s t f t u t e the dlsctSete-tioe Equat l c ~ (8.0-2b I n place o f Equation (8.0-lb). The d e r i v a t i o n and the r e s u l t are unchanoco except t h a t Equat i c n (8.1-3bI becomes
8.2
8.2
INITIAL CONDITIONS
The above derivation of the output-error method assumed t h a t the i n t t i a l c o n d i t i o n was known exactly. This assumption i s seldom s t r i c t l y true, except when using fonns where the i n i t i a l condition i s zero by definition. The i n i t i a l ccndition i s t y p i c a l l y based on i n p e r f e c t l y measu-ed data. This c h a r a c t e r i s t i c suggests t r e a t i n g the i n i t i a l condition as a random v a r i a b l e w i t h some mean and covariance. Such treatment, however, i s inconpatible w i t h t h e output-error method. The output-error method i s predicated on a deterministic solution o f the s t a t e equation. T :*tment o f a random i n i t i a l condition requires the more complex f i l t e r - e r r o r method discussed l d t e r .
I f the system i s stable, then i n i t i a l condition e f f e c t s d x a y t o a n e g l i g i b l e l e v e l i n a f i n i t e time. I f t h i s decay i s s u f f i c i e n t l y f a s t and t h e e r r o r i n the i n i t i a l condition i s s u f f i c i e n t l y small, the i n i t i a l condi:ion e r r o r w i l l have n e g l i g i b l e e f f e c t on the system response and can be ignored. I f the e r r o r s i n the i n i t i a l condition a r e too large t o j u s t i f y neglecting them. there are several ways t o resolve the problem without s a c r i f i c i n g the r e l a t i v e s i m p l i c i t y o f the output-error method. One way i s t o simpiy improve +he i n i t i a l - c o n d i t i o n values. This i s sometimes t r i v i a l l y easy i f the i n i t i a l - c o n d i t i o n value i s computed fv-om the measurement a t the f i r s t time p o i n t o f the maneuver (a comnon practice): change the s t a r t tim oy on.lple t o avoid ar, obvious w i l d point. average the f i r s t few data points, o r draw a f a i r i n g through the noise r, .se the f a i r e d value.
..
When these methods are inapplicable o r i n s u f f i c i e n t , we can ,r~known parameters t o estimate. The i n i t i a l condition i s then a o f the stdte equation i s thus s t i l l a deterministic function o f method. The equations of Section 5.1 s t i l l apply, provided t h a t ic,(to) for Equation (8.3-la).
include the i n i t i a l condition i n the l i s t o f deterministic function o f 6. The s o l u t i o n 6 and time, as required f o r the output-error we substitute
= xo(c)
It i s easy t o :how t h a t the i n i t i a l - c o n d i t i o n estimates have poor asymptotic properties a the time : i n t e r v a l increases. The i n i t i a l - c o n d i t i o n information i s a l l near the beginnir,g o f the maneuver, and increasi n g the time i n t e r v a l does not add t o t h i s information. Asymptotically, we can and should ignore i n i t i a l cond i t i o n s for stable systems. This i s one case where asymptotic r e s u l t s are misleading. For r e a l data w i t h f i n i t e time i n t e r v a l s we should always c a r e f u l l y consider i n i t i a l conditions. Thus, we avoid making the mistake of one published paper (which we w i l l leave anonymous) which b l i t h e l y set the model i n i t i a l condition t o zero i n s p i t e o f c l e a r l y nonzrrs data. I t i s not c l e a r whether t h i s was a simple oversight o r whether the author thought t h a t asymptotic r e s u l t s jur;t!fied tha practice; i n any event, the r e s u l t i n g e r r o r s were so egregious as t o render the r e s u l t s worthless (except as an object lesson).
8.3
COMPUTATIONS
Equations (8.1-2) and (8.1-31 d e f i r r the cost function t h a t must be minimized t o o b t r i n the MAP estimates (or, i n the special case t h a t P- i s zero, the WE estimates). This i s a f a i r l y complicated function o f 5 . Therefore we must use an i t e r a t i v e minimization scheme.
It i s edsy t o become overwhelmed by the apparent complexity o f J as a function o f 6; I t ( t i ) i s i t s e l f a complicated function o f E , involving the solution o f a d i f f e r e n t i a l equation. To get J as a function o f c we m s t substitute t h i s function f o r i ( t i ) i n Equation (8.1-2). You might give up a t the thought o f evaluating f i r s t and second gradients o f t h i s function, as required by most i t e r a t i v e optimization methods. The conplexity, however, i s only apparent. It i s c r u c i a l t o recognize t h a t we do not need t o develop a closed-form expression, the development o f which would be d i f f i c u l t c t best. W are only required t o develop e a workable procedure f o r computing the r e s u l t .
To evaluate the gradients o f J, we need o n l y proceed one step a t a time; each step i s q u i t e simple, Involvlng nothing more complicated than chain-rule d i f f e r e n t i a t i o n . This step-by-step process follows the advice from Alice i n Wondarkand: The White Rabbit p u t on h l s spectacles. Fbjesty?" he asked. "Beqir, a t the beginning:" come t o the end: then stop. 8.3.1 Gauss-Newton Method "Where s h a l l I begin, please your
the King said, very gravely, "and go on t i l l you
The cost function i s i n the form o f a sum o f squares, which makes Gauss-Newton the preferred optimization a l g o r i t h . Secttons 2.5.2 and 5.4.3 discussed the Gauss-Newton algorithm. To gather together a l l the import a n t equations. we repeat the basic equations o f the Gauss-Newton algorithm i n the notation o f t h i s chapter. Gauss-Newton I s a quasi-Newton a l g o r i t h . The f u l l Newton-Raphson algorithm i s
The f i r s t gradient i s N vCJ(e) =
i=1
[z(ti)
- ie(tf)l*(GG*)-1[v6i(ti)l
(C
- mC)*P-l
92 For the Gauss-Newton algorithm, we approximate the second gradient by
which corresponds t o Equation (2.5-11) applied t o the cost function o f t h i s chapter. Equations (8.3-1) through (8.3-3) are the same, whether the system i s i n pure discrete time o r mixed continuous/discrete time. The only quantities i n these equations requiring any discussion are i c ( t i ) and v E i C ( t i ) . 8.3.2 System Reswnse
The methods f o r computation of the system response depend on whether the tystem i s pure discrete time o r mixed continuous/discrete tlme. The choice of method i s also influenced by whether the system i s linear o r nonlinear. Computation of the response of discrete-time systems i s simply a matter o f plugging i n t o the equations. The general equations f o r a nonlinear system are
i E ( t 1+1 ) = f [ i F. (t.).u(ti),tl . 1
i = 0.1
....
(8.3-4b)
The more specific equations f o r a lirtear discrete-time s y s t a are i,(t,) = x0(O
i,(ti)
C i E ( t . ) + Du(ti) 1
i = 1.2
,...
(8.3-5c)
For mixed continuous/di;crete-time systems, nunerical methods f o r approximate integration are required. You can use any o f n w r o u s nunerical methods, hut tha u t i l i t y o f the more coaplicated methods i s often l i m i t e d by t h r available data. It makes l i t t l e sense t o use a high-order method t o integrate the system equations between the time points where the input i s measured. The errors i m p l i c i t i n interpolating the input measurements are p.-obably larger than the errors i n the integration method. For most purposes, a second-order Runge-Kutta algorithm i s probably an appropriate choice:
For linear systems, a transition matrix method i s m r e accurate and e f f i c i e n t than Equation (8.3-6).
+ ( t o ) = xo(E) (8.3-7a)
where
Section 7.5 discusses the form of Equation (8.3-7b). k l e r and Van Loan (1978) describe several ways of numer!cally evaluatlng Equations (8.3-8) and (8.3-9). I n t h l s application. because ti+, tl i s small compared t o the sjstem natural periods, s l ~ l p l eseries expansion works well.
8.3.2
where
8.3.3
F i n i t e Difference Response Gradient
It remains t o discuss the computation o f v i C ( t . ) , the ,radi?nt o f t h e system response. l h e r e are two basic methods f o r evaluating t h i s gradient: d i f f e r e n t i a t i o n and a n a l y t i c d i f f e r e n t i a t i o n . This section discusses the f i n i t e difference approach. 2nd the next section discusses the a n a l y t i c approach.
finhe-difference
Finite-difference d i f f e r e n t i a t i o n i s applicable t o any rmdel form. The method i s easy t o describe and equally easy t o code. Because i t i s easy t o coae, f i n i t e - d i f f e r e n c e d i f f e r e n t i a t i o n i s appropriate f o r programs where quick r e s u l t s are needed o r the production workload i s s m l l enough t h a t saving program developr n t time i s more important than i r p r o v i n g program e f f i c i e n c y . Because i t applies w i t h equal ease t o a l l model forms, f i n i t e - d i f f e r e n c e d i f f e r e n t i a t i o n i s a l s o appropriate f o r programs t h a t must handle nonlinear models, f o r which a n a l y t i c d i f f e r e n t i a t i o n i s numerically complicated (Jategaonkar and Plaetschke. 1983). To use f i n i t e - d i f f e r e n c e d i f f e r e n t i a t i o n , perturb the f i r s t element o f the 5 vector by some Small amount $E('). Recompute the system response using t h f s perturbed ( .v ctor, obtaining the perturbed system response zp. The p a r t i a l d e r i v a t i v e o f the response w i t h respect t o i s then approximately
(('f
Repeat t h i s process, perturbing each element o f c i n turn, t o approximate the p a r t i a l d e r i v a t i v e s w i t h respect t o each element o f C. The f i n i t e - d i f f e r e n c e gradient i s then the concatenation o f the p a r t i a l derivatives.
.-]
(8.3-14)
Selection o f the s i z e o f the perturbations requires some thought. I f the perturbation i s too large. Equation (8.3-13) becomes a poor approximation o f the p a r t i a l derivative. I f t h e perturhation i s too small, roundoff e r r o r s become a problem.
Some people have reported excellent r e s u l t s using simple perturbation-size r u l e s such as s e t t i n g the perturbation magnitude a t 1% f a t y p i c a l expected magnitude o f the corresponding ( element (assuming t h a t o you understand t h e problem w e l l enough t o be able t o e s t a b l i s h such t y p i c a l magnitudes). You could alternat i v e l y consider percentages o f the current i t e r a t i o n estimates ( w i t h some special provision f o r handling zero o r e s s e n t i a l l y zero estimates). Another reasonable rule, a f t e r the f i r s t i t e r a t i o n , would be t o use percentages o f the diagonal elements o f the second gradient, r a i s e d t o the -112 power. As a f i n a l r e s o r t ( i t takes more computer time and i s m r e complex), you could t r y several perturbation sizes, using the r e s u l t s t o gauge the degree o f n o n l i n e a r i t y and roundoff error, and adaptively selecting the best perturbation size.
llue t o our l i m i t e d experienct w i t h the f i n i t e difference approach, we d e f w makfng s p e c i f i c recomnendat i o n s on perturbation sizes, b u t o f f e r the opinion t h a t the problem i s amenable t o reasonable solution. A l i t t l e experimentation should s u f f i c e t o e s t a b l i s h an adequate perturbation-size r u l e f o r a s p e c i f i c class o f problems. Note t h a t the higher the precisfon o f your computer, t h e m r e margin you have between the boundaries o f l i n e a r i t y problems and roundoff problems. Those o f us w i t h 60- and 6 4 - b i t conputers ( o r 32-bit canputers i n double precision) seldom have serious roundoff problems and can use simple perturbation-size r u l e s w i t h impunity. I f you t r y t o get by w i t h single precision on a 3 2 - b i t conputer, careful perturbation-size selection w i l l be more important. 8.3.4 Analytic Response Gradient
The other approach t o conputing the gradient o f the system response i s t o a n a l y t i c a l l y d i f f e r e n t i a t e the system equations. For l i n e a r systems, t h i s approach i s sometimes f a r more e f f i c i e n t than f i n i t e difference d i f f e r e n t i a t i m . For nonlinear systems, a n a l y t i c d i f f e r e n t i a t i o n i s i n p r a c t i c a l l y clumsy ( p a r t i a l l y because you have t o redo it f o r each new nonlinear model form). He w i l l , therefore, r e s t r i c t our discussion of a n a l y t i c d i f f e r e n t i a t i o n t o l i n e a r systems.
It i s c r u c i a l t o We f i r s t consider pure discrete-tfme l i n e a r systems i n the form o f Equation (8.3-5). r e c a l l t h a t we do not need a closed form f o r the gradient; we only need a method f o r ca;puting it. A closedform expression would be formidable, u n l i k e the f o l l o w i n g equation, which i s the almost enbarassingly obvious gradient of Equation (8.3-5). obtained by using nothing more complicated than the chain r u l e :
p,i(t,)
vEx0(0
(8.3-13a)
Equation (8.3-13b) gives a r,?cursive fornula f o r v ;(ti), w i t h Equation (8.3-13r) as the i n i t i a l condition. expresses v c i ( t i ) i n t e r n o f tke s o l u t i o n o f Equation (8.3-13b). Equation 16.3-13:)
The q u a n t i t i e s vce. vgr, vcC, and v D i n Equation (8.3-13) a r e gradients o f ,natrices w i t h respect t o t h e vector c. The r e s u l t s are vectors, the elements o f which are matrices ( i f you a r e fond o f buzz words, these a r e third-order tensors). I f t h i s s t a r t s t o sound complicated, you w i l l be pleased t o know t h a t the products l i k e (v O ) u ( t i ) are ordinary matrices (and indeed sparse matrices-they have l o t s o f zero elements). You can colnpute the products d i r e c t l y without ever forming the vector o f matrices i n your program. A program t o implement Equation (8.3-13) takes fewer l i n e s than the explanation. W could w r i t e Equation (8.3-13) without using gradients o r matrices. Simply replace v by a l a c ( j ) e throughout, and then concatenate the p a r t i a l derivatives t o get the gradient o f ? ( t i ) . We tkan have, a t worst, p a r t i a l derivatives o f matrices w i t h respect t o scalars; these p a r t i a l derivatives a r e matrices. The only difference between c 8 r i t i n g the equations w i t h p a r t i a l derivatives o r gradients i s notational. W choose e t o use t h e gradient notation because i t i s shorter and more consistent w i t h the r e s t o f the book. Let us look a t Equation (8.3-13c) i n d e t a i l t o see how these equations would be inplemented i n a program, and perhaps t o b e t t e r understand t h e equations. The left-hand side i s a matrix. Each column of t h e matrix i s the p a r t i a l d e r i v a t i v e o f i ( t i ) w i t h respect t o one element o f c:
i s a s i m i l a r matrix, cunputed from Equation (8.3-13b); thus C(veii(ti)) The quantity vgi(ti) t i o n o f a matrix times a matrix, and t h i s i s a c a l c u l a t i o n we can handle. The q u a n t i t y VcC matrices
i s a multiplicai s the vector of
-.
and the product ( v E C ) i ( t i ) i s
(Our notation cues not i n d i c a ~ ee x p l i c i t l y t h a t t h i s i s the intended product f o m l a , b u t the o t h e r conceivable i n t e r p r e t a t i o n o f the notation i s obviously wrong because the dimensions are incompatibl,?. Formal tensor notation would make the i n t e n t i o n e x p l i c i t , but we do not r e a l l y need t o introduce tensor notation here because the c o r r e c t i n t e r p r e t a t i o n i s obvious). I n many cases the matrix aclac'j) w i l l be sparse. T y p i c a l l y these matrices are e i t h e r zero o r have only one nonzero element. W can take advantage o f such sparseness i q t h e canputation. I f C i s n o t a function o f e 6") (presumably 5 ' ' ) a f f e c t s other o f the system matrices), then ac/ac(j) i s a zero matrix. I f only the (k.m) element o f C i s affected by ~ ( j ) .then [ac/ac(j)]i(t,) i s a vector w i t h [ a ~,(.. * ~ ) / a c ( j ) ] i ( t ~ ) ( " n the ~ i) k t h element and zeros elsewhere. I f more than one elernent o f C i s a f f e c t e d by c ' ~ ) , then the r e s u l t i s a sum of such terms. This approach d i r e c t l y forms [ a C / ~ c ( j ) ] i ( t ~ ) , taking advantage o f sparseness, instead of forming the f u l l ac/ac(jl matrix and using a general-purpcse matrix m u l t i p l y routine. The terms (vcD)u(ti). (V e ) i ( t ), and ( V ~ ) u ( t i )are a l l s i m i l a r i n form t o ( v C C ) i ( t i ) . The i n i t i a l condition Vgxo i s a zero matrix i) x, i s known; otherwise i t has a nonzero element f o r each unknown element o f x,.
We r,ow know how t o evaluate a l l o f the terms i s Equation (8.4-13). This i s s i g n i f i c a n t l y f a s t e r than , f i n i t e differences f o r some applications. The speed-up i s most s i g n i f i c a n t i f a, r. C and D are functions o f time r e q u i r i n g s i g n i f i c a n t work t o evaluate a t each point; straighforward f i n i t e difference methods would have to reevaluate these matrices f o r each perturbation.
k p t a and k h r a (1974) discuss a method t h a t i s b a s i c a l l y a modification o f Equation (8.3-13) f o r canputi ~ g c i ( t i ) . Depending on the nunber o f inputs, states, outputs, and unknown parameters, t h i s method can v sometimes save conputer time by redticing the length o f the gradient vector needed f o r propagation i n Equation (8.4-13). He now have everything needed t o implement t h e basic Gauss-Newton minimization a l g o r i t h m . P r a c t i c a l a p p l i c a t i o n w i l l t y p i c a l l y require some k i n d o f start-up algorithm and methods f o r handling cases where t h e algorithm converges slowly o r diverges. The I l i f f - M a i n e code, WLE3 (Maine and Iliff, 1980; and Maine, 1981), incorporates several such modifications. The line-search ideas (Foster, 1983) b r i e f l y discussed a t the end o f Section 2.5.2 a l s o seem appropriate f o r hand1i n g convergence problems. We w i l l n o t cover t h e d e t a i l s o f such p r a c t i c a l issues here. The discussions of s i n g u l a r i t i e s i n Section 5.4.4 and o f p a r t i t i o n i n g i n Section 5.4.5 apply d i r e c t l y t o the problem o f t h i s chapter, so we w i l l not repeat then. 8.4
UNKNOWN G
The previous discussion i n t h i s chapter has assumed t h a t the G-matrix i s known. Equations (8.1-2) and (8.1-4) are derived based on t h i s assumption. For unknown G, the nethods o f Section 5.5 apply d i r e c t l y . Equation (5.5-2) substitutes f o r Equation (8.1-4). I n the terminology o f t h i s chapter, Equation (5.5-2) becomes
J(c) =
is1
lz(ti)
- ic(t,)l*[G(c)G(O*l~lI~(ti)fc(ti)l + ~ n l G ( c ) G ( O * l p l u s a constant.
(8.4-1)
I f G i s known, t h i s reduces t o Equation (8.1-4)
As discussed i n Section 5.5, the best approach t o ~ i n i m i z i n gEquation (8.4-1) i s t o p a r t i t i o n the parame t e r vector i n t o a p a r t CG a f f e c t i n g G and a p a r t ~f a f f e c t i n g i. For each f i x e d 6 , the Gauss-Newton , equations o f Section 8.3 apply t o r e v i s i n g the estimate o f f,f. For each f i x e d tf, the revised estimate of G i s given by Equation (5.5-7). which becomes
%*
x
N
it1
[z(ti)
(ti)][z(ti)
- iC(ti)ln
(8.4-2)
i n the current notation. Section 5.5 describes the a x i a l i t e r a t i o n method, which a l t e r n a t e l y applies the . Gauss-Newton equations o f Section 8.3 f o r ~f and Equation (8.4-2) f o r G The cost function f o r estimation w i t h unknown G i s o f t e n w r i t t e n i n a l t e r n a t e forms. Although the above form i s usually the most useful f o r computation, the following forms provide some i n s i g h t i n t o the r e l a t i o n s o f the estimators w i t h unknown G versus those w i t h fixed G. When G i s completely unknown. the minimization o f Equation (8.4-1) i s equivalent t o the minimization o f
which ;orres$onds t o Equation (5.5-9). Section 5.5 derives t h i s equivalence by e l i m i n a t i n g G. t o r e s t r i c t G t o be diagonal, i n which case Equation (8.4-3) becomes
It i s conron
This form i s a product o f the e r r o r s i n the d i f f e r e n t signals, instead o f ?he weighted sum-of-the-errors o f Equation (8.1-4). 8.5 CHARACTERISTICS
form
We have shown t h a t the output e r r o r estimator i s a d i r e c t application o f the estimators derived i n Section 5.4 f o r nonlinear s t a t i c systems. To describe the s t a t i s t i c a l c h a r a c t e r i s t i c s o f output e r r o r e s t i mates, we need only apply the corresponding Section 5.4 r e s u l t s t o the p a r t i c u l a r fonn o f output error.
I n most cases. the corresponding s t a t i c system i s nonlinear, even f o r l i n e a r dynamic systems. Therefore, we nust use the forms o f Section 5.4 instead o f the simpler forms o f Section 5.1, which apply t o l i n e a r s t a t i c systems. I n p a r t i c u l a r , the output e r r o r MLE and WP estimators are both biased f o r f i n i t e time. Asymptotic a l l y . they are unbiased and e f f i c i e n t . From Equation (5.4-11). the covariance o f the MLE output e r r o r estimate i s approximated by
From Equation (5.4-12). mator i s
the corresponding approximation f o r the p o s t e r i o r d i s t r i b u t i o n o f
i n an MAP e s t i -
C O V ( C ~ .= ) ?
;{
[~~;~(t~)]*(GG*)-~[v~?~(t~)] +
(8.5-2)
1'1
9.0 CHAPTER 9 9.0 FILTER ERROR METHOD F R DYNAMIC SYSTEMS O
I n t h i s chapter, we consider the p a r a m t e r estimation problem f o r dynamic systems w i t h both process and measuremnt noise. W r e s t r i c t the consideration t o l i n e a r systems w i t h a d d i t i v e Gaussian noise, because the e exact analysis o f more general systems i s i n p r a c t i c a l l y c o r p l i c a t e d except i n special cases l i k e output e r r o r (no process noise). Th? easiest way t o handle nonlinear systems w i t h both measurement and process noise i s usually t o l i n e a r i z e the system and apply the l i n e a r r e s u l t s . This method does not g i v e exact r e s u l t s f o r nonlinear systems, b u t can give adequate approximations i n some cases. I n mixed continuous/discrete time, the l i n e a r system model i s
The lneasurenent noise n i s assumed t o be a sequence o f independent Gaussian random variables w i t h zero mtan and i d e n t i t y covariance. The process noise n i s a zero-mean, white-noise process. independent of the measuretnent noise, w i t h i d e n t i t y spectral density. The i n i t i a l c o n d i t i o n xo i s assumed t o be a Gaussian random variable, independent o f n and n, w i t h mean xo and covariance Po. AS special cases, Po can be 0 . implying t h a t the i n i t i a l c o r ~ d i t i o ni s known exactly; o r i n f i n i t e , i n p l y i n g colnplete ignorance of the i n i t i . 1 1 condition. The input u i s assumed t o be known exactly. As i n the case of output error. the system matrices A. B. C. D. F, and G, a r e functions o f be functions o f time. The corresponding pure discrete-time model i s x ( t 0 ) = xo 6 and may
A l l o f the same assumptions apply, except t h a t n i s a sequence o f independent Gaussian random variables w i t h zero mean and i d e n t i t y covariance.
9.1
DERIVATION 6, we need t o choose
I n order t o obtain the maximum l i k e l i h o o d estimate o f L(6.Z) = P(ZNIE) where
t o maximize
For the Irl4P estimate, we need t o maximize p ( Z ~ l c ) p ( ~ ) . I n e i t h e r event, the c r u c i a l f i r s t step i s t o f i n d a .e tractable expression f o r ~ ( 2 ~ 1 6 ) W w i l l discuss three ways of d e r i v i n g t h i s density function. 9.1.1 S t a t i c Derivation
The f i r s t means o f d e r i v i n g an expression f o r p(Z 16) i s t o solve the system equations, reductng t h t o the s t a t i c form o f Equation (5.0-1). This technique, arthough simple i n p r i n c i p l e , does n o t give a t r a c t a b l e solution. W b r i e f l y o u t l i n e t h e approach here i n order t o i l l u s t r a t e the p r i n c i p l e , before considering t h e e more f r u i t f u l approaches o f the f o l l o w i n g sections. For a pure discrete-time l i n e a r system described by Equation (9.0-2). z(t1) i s the e x p l i c i t s t a t i c expression f o r
This i s a nonlinear s t a t i c model i n the general form o f Equation (5.5-1). However, the separation o f E i n t o EG and (f as described by Equation (5.5-4) does n o t apply. Note t h a t Equation (9.1-2) i s a nonlinear function o f 6, even i f the matrices are l i n e a r functions. I n fact, the order o f n o n l i n e a r i t y increases w i t h the number o f time points. The use of estimators derived d i r e c t l y from Equation (9.1-2) i s unacceptably d i f f i c u l t f o r a l l but the simplest special cases, and we w i l l n o t pursue i t f u r t h e r . For mixed continuous/discrete-time systems, s i m i l a r p r i n c i p l e s apply, except t h a t the w o f Equat i o n (5.0-1) must be generalized t o allow vectors o f i n f i n i t e dimension. The process noise i n a mixed continuous/discrete-time system i s a functlon of time, and cannot be w r i t t e n as a finlte-dimensional random vector. The material o f Chapter 5 covered only finite-dimensional vectors. The Chapter 5 r e s u l t s general$ze
n i c e l y t o infinite-dimensional vector spaces (function spaces), but we w i l l not f i n d t h a t l e v e l o f abstraction necessary. Application t o pure continuous-time systems would require f u r t h e r generalization t o allow i n f i n i t e dimensional observations. 9.1.2 Derivation by Recursive Factorinp
W w i l l now consider a d e r i v a t i o n based on f a c t o r i n g p(ZNI{) by means o f Bayes r u l e (Equation (3.3-12)). e The derivation applies e i t h e r t o pure discrete-time o r mixed continuous/discrete-time systems; the d e r i v a t i c n i s i d e n t i c a l i n both cases. For the f i r s t steo, w r i t e
Recursive a p p l i c a t i o n of t h i s formula gives
For any p a r t i c u l a r 6, the d i s t r i b u t i o n o f Gaussian w i t h mean
i5 ( t i
= EICx(ti) = Ci,(ti)
Z(ti) given Zi-,
i s known from the Chapter 7 r e s u l t s ; i t i s
E(z(ti)lZ1-l,cl
+
Du(ti) + Goi lZi-l,C)
+ Du(ti)
and covariance
Note t h a t iE(ti) and i (ti) are functions o f 5 because they are obtained from the Kalman f i l t e r based on a p a r t i c u l a r value o f 5; Ehat i s , they are conditioned on E . W use t h e e subscript notation t o emphasize t h i s depe~~dence. Ri i s also a function o f 5, although our notation does not e x p l i c i t l y i n d i c a t e t h i s . Substituting the appropriate Gaussian density functions characterized by Equations (9.1-5) and (9.1-6) i n t o Equation (9.1-4) gives N 1 (9.1-7) L ( C . Z ~ ) :~ ( 2 ~ 1 6 ) fl / 2 ~ ~ ~exp{- ~ [z(ti) = l - 7 / ~ iC(ti)]*Ri1[z(ti) - 2 5 ( t .1 ] } ) i=1
This i s the desired expression f o r the l i k e l i h o o d functional. 9.1.3 Derivation Using the Innovation This d e r i v a t i o n a l s o applies e i t h e r t o
Another d e r i v a t i o n involves the properties o f the innovation. mixed continuous/discrete-time o r t o pure discrete-time systems.
W proved i n Chapter 7 t h a t the innovations are a sequence o f independent, zero-mean Gaussian r a r i a b l e s e w i t h covariances R i given by Equation (7.2-33). This proof was done f o r the pure discrete-time case. but extends d i r e c t l y t o mixed continuous/discrete-time systems. The Chapter 7 r e s u l t s assumed t h a t the system matrices were known; thus the r e s u l t s are conditioned on 6. The conditional p r o b a b i l i t y density f u n c t i o n o f the innovations i s therefore
W a l s o showed i n Chapter 7 t h a t the innovations are an i n v e r t i b l e l i n e a r function o f the observations. e Furthermore, i t i s easy t o show t h a t the determinant o f the Jacobian o f the transformation equals 1. (The Jacobian i s t r i a n g u l a r w i t h 1's on the diagonal). Thus by Equation (3.4-1). we can s u b s t i t u t e
i n t o Equation (9.1-8)
t o give
which i s i d e n t i c a l t o Equatfon (9.1-7). We see t h a t the d e r i v a t i o n by Bayes f a c t w i n g and the d e r f v a t i o n using the innovation g i v e the same r e s u l t .
9.1.4
Steady-State Form
For many applications, we can use t h e time steady-state Kalman f i l t e r i n the cost functional, r e s u l t i n g i n e major computational savirigs. This usage requfres, o f course, t h a t the steady-state f i l t e r e x i s t . W discussed the c r i t e r i a f o r the existence o f the steady-state f t l t e r i n Chapter 7. The most fmportant c r i t e r i o n i s obviously t h a t the system be time-invariant. The r e s t o f t h i s section assumes t h a t a steady-state form e x i s t s . When a steady-state form exists, two approaches can be taken t o j u s t i f y i n g i t s use. The f i r s t j u s t i f i c a t i o n i s t h a t the steady-state form i s a good approximation i f the time i n t e r v a l i s long enough. The time-varying f i l t e r gain converges t o t h e steady-state gain w f t h time constants a t l e a s t as f a s t as those o f the open-loop system, and sometimes s i g n i f i c a n t l y faster. Thus, i f the maneuver analyzed i s long conpared t o the systen time constants, the f i l t e r gain would converge t o the steady-state gain i n a small port i o n o f the maneuver time. W could v e r i f y t h i s behavior by computing time-varying gains f o r representative e values o f 5. I f the f i l t e r gain d w s converge q u i c k l y t o the steady-state gain, then the steady-state f i l t e r should g i v e a good approximation t o the cost functional. The second possible j u s t i f i c a t i o n f o r the use o f the steady-state f i l t e r involves the choice o f the , The time-varying f i l t e r requires P, t o be specified. I t i s a conmun p r a c t i c e i n i t i a l s t a t e covariance P. t o set Po t o zero. This p r a c t i c e arises more from a lack o f b e t t e r ideas than from any r e a l argument t h a t zero i s a good value. It i s seldom t h a t we know the i n i t i a l state exactly as imp1l e d by the zero covariance. One circumstance which would j u s t i f y the zero t n i t i a l covariance would be the case where the i n i t i a l condition i s included i n t h e l i s t o f unknown parameters. I n t h i s case, the i n i t i a l covariance i s properly zero because the f i l t e r i s conditioned on the values o f the unknown parameters. Any p r i o r information about the i n i t i a l condition i s then r e f l e c t e d i n the p r i o r d i s t r i b u t i o n o f ( instead of i n P. , Unless one has a s p e c i f i c need f o r estimates o f the i n i t i a l condltion, there a r e u s u a l l y b e t t e r approaches. W suggest t h a t the steady-state covariance i s o f t e n a reasonable value f o r the i n i t i a l covariance. I n e t h i s case, the tins-varying and steady-state f i l t e r s are i d e n t i c a l ; arguments about the speed of convergence and the length o f the data i n t e r v a l are not required. Since the time-varying form requires s i g n i f i c a n t l y more computation than the steady-state form. the steady-state form i s preferable except where i t i s c l e a r l y and significantly inferior. If the steady-state f i l t e r i s used, Equation (9.1-7) becomes
N
where R i s the steady-state covdriance of the innovation. I n general. R i s a function o f f . The i c ( t t ) W use the e i n Equation (9.1-11) corns from the steady-state f i l t e r , u n l i k e t h e i 5 ( t i )i n Equation (9.1-7). same notation f o r both quantities, distinguishing them by context. (The z c ( t . ) from the steady-state f i l t e r i s always associated w t t h the steady-state covariance R, whereas the ic(ti) brom the time-varying f i l t e r i s associated w f t h the time-varying covariance Rf .) 9.1.5 Cost Function Discussion
The maximum-likelihood estimate o f c i s obtained by maximizinq Equation (9.1-11) (or Equation !9.1-7) ifthe steady-state form i s inappropriate) w f t h respect t o 6. Because o f the exponential i n Equation (9.1-11). i t i s more convenient t o work w i t h the logarithm o f the l i k e l i h o o d functional, c a l l e d the l o g l i k e l i h o o d functional f o r short. The l o g l i k e l i h o o d functtonal !s maximized by the same value o f c t h a t inaxfmizes the 1 i k e l ihood functional because the logarithm i s a monotonic increasing function. By convention, most optfmization theory i s w r i t t e n i n t e n s o f minimization instead o f maximization. W therefore define the negative o f the l o g l i k e l i h o o d functional t o be 1 cost functional e e which i s t o be minimized. W also omit the a n ( 2 ~ )term from the cost functional, because i t does not a f f e c t the mintmizatton. The most convenient expression f o r the cost functional i s then
I f R i s known, then Equatfon (9.1-12) i s i n a least-squares f o m . This i s sometimes c a l l e d a predictione r r o r form because the q u a n t i t y being minimized i s the square of the one-step-ahead p r e d t c t i o n e r r o r z(ti) i (ti). The term " f i l t e r e r r o r " i s a l s o used because the q u a n t i t y minimized i s obtained from the Kalman f i f t e r .
Note t h a t t h l s f o r n o f t h e l i k e l i h o o d functlonal involves the K a l m n f i l t e r - n o t a smoother. There i s sometimes a temptation t o replace the f i l t e r i n t h i s cost function by a smoother, assuming t h a t t h i s w i l l g i v e improved r e s u l t s . The smoother gfves b e t t e r s t a t e estimates than the f i l t e r , b u t the p r o b l s considered i n t h i s chapter i s not s t a t e estimation. The s t a t e estimates are an i n c i d e n t a l side-product o f t h e algorithm f o r estirnatfng the parameter vector 6 . There are ways o f dertvfng and w r f t f n g the parameter estimation problem which tnvolve smothers (Cox and Bryson, 1980). b u t the d i r e c t use o f a smoother i n Equation (9.1-12) i s simply Incorrect. For MAP estimates. we modify the cost functional by adding the negative o f the logarithm of the p r i o r p r o b a b i l i t y density o f E . I f the r t o r d i s t r i b u t i o n o f i s Gaussidn w t t h mean mc and covariance W, the cost functional o f Equation (9.1-12e b e c m s (tgnoring constant terms)
The f i l t e r - e r r o r forms o f Equations (9.1-12) and (9.1-13) are p a r a l l e l t o the output-error forns of When there i s no process noise, the steady-state lblnvn f i l t e r becomes an Equations (8.1-4) and (8.1-2). integration o f the system equations, and the innovation covariance R equals the mensurement noise covariance 66'. Thus the output error quations o f the previous chapter are special cases o f the f i l t e r error equations w l t h zero process noise. 9.2 COMPUTATION
The best methods f o r minimizing Equation (9.1-12) o r (9.1-13) are based on the Lust-Newton a l g o r i t h . Because these equations are so similar i n form t o the output-error equatlorls o f Chapter 8, most o f the Chapt e r 8 material on conputation applies d i r e c t l y o r wlth only minor modlflcation. The primary differences between conputational methods f o r f i l t e r error and those f o r output error center on the treatment of the noise covariances, particularly when the covariances are unknown. k i n e and I ? i f f (1981a) discuss the isplearntation d e t a i l s o f the f i l ter-error algorithm. Tne I l i f f - k i n e code, M.D ( k i n e and I l i f f , 1980; and k i n e , 1981), inplements the f i l t e r - e r r o r algorithm f o r linear continuous/discrete-t4m systcns.
W generally presume the use o f the steady-state f i l t e r I n the f i l t e r - e r r o r algorithm. e s i g n i f i c a n t l y more complicated using the time-varying f i l t e r .
9.3 FOWLATION AS A FILTERING PROBLEM
1;nplcmntation i s
An alternative t o the d i r e c t approach o f the previous section i s t o recast the parameter e s t i m t i o n prnblem I n t o the fonn o f a f i l t e r i n g problem. The techniques o f Chapter 7 then apply.
Suppose w s t a r t wlth the system model e
This i s the same as Equation (9.0-1). except t h c t here we e x p l i c i t l y indicate the dependence o f the m t r l c e s on 6 . The problem i s t o estimate c. I n order t o apply state estimation techniques t o t h i s problem, 6 must be part o f the state vector. Therefore. we define an augmnted state vector
W can combine Equation (9.3-1) wlth the t r i v i a l d i f f e r e n t i a l equation e
E=o
to w r i t e a system equation with xa as the state vector. Note t h a t the resulting system i s nonlinear I : I xa (because i t has products o f c and x). even though Equation (9.3-1) i s l l n e a r i n x. I n principle. we can apply the extended Kallnan f i l t e r , discussed i n Section 7.7, to the problem o f e s t i m t i n g xa. Unfortunately, the nonlinearity i n the a u w n t e d system i s crucial t o the system behavior. The adequacy o f the extended Kalman f i l t e r for t h i s problem has seldm been analyzed i n d e t a l l . Schwappe (1973, p. 433) says on t h i s subject the system i d e n t i f i c a t i o n problem has been t r a n s f o r n o i n t o a problea which has already been discussed extmsively. The discussions are not terminated a t t h l s point f o r the sinple reason t h a t Part I V d i d not provide any "best" one way to solve a nonlinear state e s t i mation problem. A major conclusion of Part I V was that the best m y to proceed depends heavily on the e x p l i c i t nature o f the problm. S s t m i d m t i f i c a t i o n leads to special types of nonltnear r s t i m t i o n p r o ~ l r s ,so spadalized discussions are needad. the state a u p n t a t i o n approach i s not dnphrsized, as the author feels t h a t i t i s nuch mom appropriate to approrch the r y s t r n i r l e n t t f i c r t i o n problem directly. Houever, them a m specfa1 cases where state augnntatlon works very we11.
...
...
CHAPTER 10 10.0 EQUATION ERROR METHOD FOR DVNMIC SYSTEMS
Thts chapter discusses the q u a t t o n e r r o r approach t o parameter e s t t m t t o n f o r dynamtc systems. W w t l l e f t r s t define a r e s t r t c t e d form o f q u a t t o n e r r o r , p a r a l l e l t o thc "reatments o f output e r r o r end f t l t e r e r r c r t n the prevlous chapters. This form o f equatton e r r o r I s a s p u l a l case o f f t l t e r e r r o r where there I s process notse, but no measurement noise. It therefore stands l n counterpoint t o output error, which I s the speclal case where there I s measurement notse, but no process notse. W w t l l then extend the d e f t n t t t o n of equatlon e r r o r t o a e catlons o f equation e r r o r do not flt prectsely l n t o the overly I n I t s most general forms, the term q u a t l o n e r r o r encompasses the forms most comonly associated w t t h the tern). The primary i n t h i s chapter t s t h e t r computational stmpltctty. 10.1 PROCESS-NOISE APPROACH more general form. Some o f the p r a c t t c a l appl t r e s t r t c t t v e form based on process noise only. output e r r o r and f l l t e r error, t n addttton t o dtsttngutshlng feature o f the methods emphasized
I n t h i s sectton, we conslder equatton e r r o r I n a manner p a r a l l e l t o the prevtous trectnrents o f output e r r o r and f t l t e r error. The f t l t e r - e r r o r method t r e a t s systems w t t h both process noise and m e a s u r m n t notse. and output e-ror t r e a t s the speclal case o f systems w t t h measurement notse only. Equatton e r r o r completes t h t s by t r l a d o f alg~>rithms t r e a t t n g the speclal case o f systems w l t h process nolse only. The eqiatlon-error method applies t o nonltnear systems w t t h a d d i t i v e Gaussian process nolse. W w l l l e r e r t r l c t the discusston o f t h t s sectton t o pure discrete-time models, f o r which the d e r i v a t i o n I s stralghtfornard. Mixed contlnuous/dtscrete-ttme models can be handled by converting them t o equtvalent pure dt screte-ttme models. Equatton e r r o r does not s t r i c t l y apply t o pure conttnucus-ttme models. (The problem becomes l 1-posed). 1 The general form o f the nonlinear, dtscrete-ttme system model we wt 11 constder 4s
The process notse, n, I s a sequence o f independent Gaussian random varlables w t t h zero mean and l d e n t l t y covartance. The matrlx F can be a function o f althou h the s t n p l t f t e d n o t b t l o n Ignores t h t s p o s s t b t l t t y . It w t l l prove conventent t o assume t h a t the measurements z?tl) are deflned f o r t = O,.. .N; previous chapters have defined them o n l y f o r t = 1,. ,N.
..
(.
The followtng d e r l v a t l o n o f the equatton-error method c l o s e l y p a r a l l e l s the d e r l v a t l o n r , i the f t l t e r - e r r o r method l n Sectlon 9.1.3. Both are based p r f m r i l y on appltcatton o f the t r a n s f o m t a t f ~ nof vartables f o m l a , Equation (3.4-1). .,tarting from a process known t o be a sequence o f Independent Gauisian varlables. By assumptto:~, the probabtl i t y denslty functton o f the process ,~otse i s N-1 p(nN) =
fl
I-0
( 2 ~ ) ' ~ 'exp(nlnt) ~
where nN i s the concatenatton o f the nt. W f u r t h e r assume that F i s I n v e r t i b l e f o r a l l permtsstble e values o f 6; t h i s assumption t s necessary t o ensure t h a t the problem I s well-posed. W deftne XN to be the e concatenatton of the x(t4). Then, f o r each value of 6. XN t s an i n v e r t t b l e l i n e a r functton of nN. The inverse functton i s
where, f o r conventence and f o r conststency w i t h the notatton o f prevlous chapters, we heve deftned i,(tl+,) f[x(ti),~(ti),61 (10.1-4)
because the Inverse t r a n s f o m t l o n The detarmtnant o f the Jacoblan o f the lnverse t r a n s f o m t i o n 4s IF m t r l x I s block-triangular w i t h F" i n the dtagonal blocks. D f m c a p p l i c a t l o n o f the t r a n s f o m t f o n - o f varlables formula, Equation (3.4-l), gives N ~ ( ~ ~= 6 ) 1 t=1
I"
I~.FF*I-~~.
exp{-
j [x(tt)
- ic(ti)]*(FF*)-l
[x(tt)
- i6(tl)])
(10.1-5)
tl0n
Of
I n order t o dertve a stnple e x p r o s f o n f o r p(ZN 6). we requlre t h a t g be a conttnuous, I n v e r t l b l e funcx f o r erch value o f 6. The t n v e r t t b t l t t y s c r i t l c r l to the s t n p l l c l t y o f UH q u r t i o n - e r r o r
algorithm. This assumption, combined w i t h t h e l a c k o f measurement noise, means t h a t we can reconstruct the s t a t e vector perfectly, provided t h a t w know 6. The inverse function gives t h i s reconstruction: e
It g i s not i n v e r t i b l e , a recursive state estimator becomes imbedded i n the algorithm and we are again faced w i t h something as complicated as the f i l t e r - e r r o r algorithm. For I n v e r t i b l e g, the transformation-ofvariables formula, Equation (3.4-1). gives
where i c ( t i ) i s given by Equation (10.1-6). "(ti)
and = f[ic(ti_l).u(t,,l).C1
Most p r a c t i c a l applications o f equation e r r o r separate the problems o f s t a t e reconstruction and parameter estimation. I n t h e context defined above, t h i s i s possib;e when g i s not a function o f 6. Then Equat i o n (10.1-6) i s a l s o independent o f c; thus. we can reconstruct the s t a t e exactly wjthout knowledge o f 5 . Fu*themre, the estimates o f 6 depend only on the reconstructed state vector and the control vector. There i s no d i r e c t dependence on t h e actual measurements z ( t i ] o r on the exact form o f the g-fuqction. This i s evident i n Equation (10.1-7) because the Jacobian o f g' i s independent o f c and, therefore, i r r e l e v a n t t o the parameter-estimation problem. I n many p r a c t i c a l applications, the state reconstruction i s more complicated than a simple pointwise function as i n Equation (10.1-6). b u t as long as the s t a t e reconstruction does n o t depend on 6, the d e t a i l s do not matter t o the parameter-estimation process. You w i l l seldom ( i f ever) see Equation (10.1-7) elsewhere i n the form shown here, which includes the fact o r f o r the Jacobian o f 9". The usual d e r i v a t i o n ignores the measurement equation and s t a r t s from the assumption t h a t the state i s known exactly, whether by d i r e c t measurement o r by some reconstruction. W have e included the measurement equation only i n order t o emphasize tne p a r a l l e l s between equation e r r o r , output c error, and f i l t e r error. For the r e s t o f t h i s section, we wi 11 assume t h a t g i s independent o f c. W w i l l s p e c i f i c a l l y assume t h a t the determinant o f the Jacobian o f g i s 1 (the actual value being i r r e l e v a n t t o the estir,ator anyway), so t h a t we can w r i t e Equation (10.1-7) i n a more conventional form as
where
You can derive s l i g h t generalizations. useful i n sane cases, from Equation (10.1-7). The maximum-likelihood estimate o f 6 i s the value t h a t maximizes Equation (10.1-9). As I n previous c h a p t e ~ ' ~ i t i s convenient t,o work i n terms o f minimizing the n e g a t i v e - l o g - l i k e l ~ h o o dfunctional ,
If
has a Gaussian p r i o r d i s t r i b u t i o n w i t h mean mc and covarlance P, then the U P estimate minimizes
10.1.2
Special Care o f F l l t e r E r r o r
For l i n e a r systeats, m can a l s o derive s t a t e - q u a t i o n e r r o r by plugging i n t o the l i n e a r f i l t e r - e r r o r algorithm derived In Chapter 9. Assume t h a t G I s 0; FF* i s i n v e r t i b l e ; C i s square. i n v e r t i b l e , and known exactly; and D i s known exactly. These a r e t h e assumptions t h a t mean we have p e r f e c t measurarmnts o f the state o f the system. The Kalmn f i l t e r f o r t h i s case I s (repeating Equation (7.3-11)) i(t,) and the covarimce, PI, = C'l[z(t,)
- Du(t,)]
::(ti)
o f t h i s f i l t e r e d e s t i n r t e I s 0.
The one-step-ahead p r e d i c t i o n i s
The use o f t n i s form i a an q u a t i o n - e r r o r method presumes t h a t the state x ( t i ) can be reconstructed as a function o f the z ( t i ) and u ( t f ) . This p n s u n p t i o n I s i d e n t i c a l t o t h a t f o r discrete-time state-equation error, and i t implies the same conditions: there must be noise-free measurements o f the state, irdependent o f 6. I t i s l t n y l i c i t t h a t a known i n v e r t i b l e transformation o f such measurements i s s t a t i s t i c a l l v equivalent. As i n the discrete-time case, we can define the estimator even when the meast+rerwnts a r e noisy, b u t i t rill no longer be a maxlnwn-likelihood estimator. Equation (10.2-7) also presures t h a t the d e r l v a t i v e i ( t ) can be reconstructed from the measurements. Neglect~ngf o r the moment the s t a t i s t i c a l implications, note t h a t we can form a p l a u s i b l e equation-error e s t i mator using any reasonable means o f approximating a value f o r i ( t i ) independently o f 6 . The simplest case o f t h i s i s when the observation vector includes measur.ements o f t h e s t a t e d e r i v s t i v e s i n a d d i t i o n t o the measurements o f the states. I f such d e r i v a t ~ v emeasurements are not d i r e c t l y available, we csn always approximate i i t i ) by f i n i t e - d i f f e r e n c e d i f f e r e n t i a t i o n o f the s t a t e measurements, as i n
Both d i r e c t measurement and f i n i t e - d i f f e r e n - e approximation are used
it:
practice. To a r r i v e a t
Rigorous s t a t i s t i c a l treatment i s easiest f o r the case o f f i n i t e - d i f f e r e n c e ,,~proximations. such a form, we w r i t ? the state equation i n i n t e r r a t e d form as
A n approximate solution (not necessarily the t e s t approximation) t o Equation (10.2-9)
is (10.2-10) F-matrix.
x(tit1)
:x(ti)
+ (ti+l
'
ti)f[x(ti)*u(ti)*~l
Fdni
where n i i s a sequence o f independent Gaussian variables, and Fd i s the equivalent discrete Sestions 6.2 and 7.5 discuss such approximations. Equation (10.2-10) i s i n the form o f a discrete-time state equation. e r r o r method based on t h i s equation uses
The discrete-time state-equation
Redefining h by d i v i d i n g by t i
- ti-,
gives the form ).ti.cl = i(ti)
h[z(.;,u(.
- fCx(ti),u(ti),c1
where the d e r i v a t i v e i s obtained from the f i n i t e - d i f f e r e n c e formula
Other discrete-time approximations o f Equation (10.2-9) r e s u l t i n d i f f e r e n t f i n i t e - d i f f e r e n c e formulae. The ceatral-difference form of Equatlon (10.2-8) i s usually b e t t e r than the one-sided form of Equat i o n (10.2-13), although Equation (10.2-8) has a lower bandwidth. If the bandwidth o f Equation (10.2-8) presents problems. a b e t t e r approach than Equation (10.2-13) i s t o use
where we have used the notation

f.i/lz
f (ti
and
There are several other reasonable f i n i t e - d i f f e r e n c e formulae applicable t o t L i s problem. Rigorous s t a t i s t i c a l treatment o f the case i n which d i r e c t state d e r i v a t i v e measurenents are a v a i i a b l e rdises several complications. Furthermore, It i s d i f f i c u l t t o g e t a rigorous r e s u l t i n the form t y p i c a l l y used-an equation-error methcd based on i measurements substituted i n t o Equation (10.2-7). I t i s probably best t o regard t h l s approacii as an equation-error estimator derived from plausible, b u t ad hoc, reasoning.
We w i l l b r l e f l y o u t l i n e the s t a t i s t i c a l issues raised by s t a t e d e r i v a t i v e measurements, without a t t m p t i n g a complete analysis. The f i r s t problem i s that. f o r systems w l t h white process nois,., the s t a t e d e r i v a t i v e i s i n f f n i t e a t every p o i n t i n time. (Careful argument i s required even t o define the derivative.) I& could avoid t h l s problem by r e q u i r i n g the process noise t o be band-limited, o r by other means, b u t t h e r e s u l t i n g estlmetor
w i l l n o t be i n the desired form. A ;.euristic explanation i s t h a t the x measurements contain i m p l i c i t the d e r i v a t i v e (from the f i n i t e differences), and simple use o f the measured d e r i v a t i v e information a b o u ~ ignore: t h i s infontlation. A :-igorous maximum-likelihood estimator would use both sources of i n f o m t i o n . This statement assumes t h a t the i measurements and the f i n i t e - d i f f e r e n c e derivatives are inoe~enilentdats . It i s conceivable t h a t the x "measurements" are obtained as sums o f the i measurentents ( f o r instance, i n an i n e r t i ~ lnavigation u n i t ) . Such cases a r e merely integrated versions o f the f i n i t e - d i f f e r e n c e approach, not r e a l l y comparable t o cases o f independent i measurements. The lack o f a rigorous d e r i v a t i o n f o r the state-cquation e r r o r method w i t h independently measured s t a t e derivatives does not necessarily mean t h d t i t i s a poor estimator. i f the i n f o m t i o n i n t h e s t a t e d e r i v a t i v e measurements i s much b e t t e r than the information i n the f i n i t e - d i f f e r e n c e state derivatives, we can just:'fy t h e approach as o gcod approximation. Furthermore, as expressed i n our discussions i n Section 1.4, an e s t i mator does not have t o be s t a t i s t i c a l l y derived t o be a good estimator. For some problems, t h i s estimator gives adequate r e s u l t s w i t h low computational costs; when t h i s r e s u l t occurs, i t i s s u f f i c i e n t j u s t i f i c a t i o n in itself.
Another s p e c i f i c case o f the equation-error method i s observation-quation error. !n t h i s case, the s p e c i f i c form o f h comes from the observation equation, ignoring the noise. The equailon i s the same for pure discrete-time o r mixed continuous/discrete-time systems. The observation equation f o r a system w i t h a d d i t i v e noise i s
The
h f u n c t i o n based on t h i s equation i s
As i n the case o f state-equation error, observation-equation e r r o r requires measurements o r reconstruct i o n s 3: the state, because x(ti) appears i n the equation. The corn,~nts i n Section 10.2.1 about noise i n t h e s t a t e mpasurement apply here alss. Observation-equation e r r o r does not require measurements o f the s t a t e derivative. The observation-equation e r r o r method also requires t h a t there be some measurements i n a d d i t i o n t o the states, o r the method reduces t o t r i v i a l i t y . I f the states were the only measurements, the obscrvat;on equat i o n would reduce t o
which has 00 unknown parameters.
There would, therefore, be nothing t o t;timate.
The observation-equation e r m r method applies only t o estimating parameters i n the d b ~ e r v ~ t i o n equation. Unkncm parameters i n the state equarlon do I I O ~ enter t h i s f o m l a t i o n . I n f a c t , the existence o f the s t a t e equation i s l a r g e l y i r r e l e v a n t t o th, method. This irrelevance perhaps explains why observation-equaticn e r r o r i s usually neglected i n discussions o f estimators f o r dynamic systems. The method i s e s s e n t i a l l y a d i r e c t a p p l i c a t i o n o f the s t a t i c estimatrrs o f Chapter 5 , taking no advantage o f the dynamics o f the system ( t h e s t a t e equation). From a t h e o r e t i c a l viewpoint, i t may seem out o f place i n t h i s chapter. I n practice, the observation-equation-error method i s widely used, someti~escontorted t o look 1i k e a state-equation-error method. The observation-equation-error method i s o f t e n a competitor t o an output-error method. Our treatment o f observation-equation e r r o r i s intended t o f a c i l i t a t e a f a i r evaluation o f such choices and t o avoid unnecesszry contortions i n t o state-equation e r r o r forms.
We have previously mentioned t h a t a u n i f y i n g c h a r a c t e r i s t i c of the methods discussed i n t h i s chapter i s e t h e i r cmputational simp1 i c i t y . W have not, however, given much d e t a i l on the computational issues. !-quation (10.2-3), which encompasses a l l equation-error forms. i s i n the form o f Equation (2.5-1) i f the aleighting t m t r i x W i s known. Therefore, the Gauss-Newton optimization a 1 g o r i t h n applies d i r e c t l y . Urknown ma+* ices can be handled by the method discussed i n Sections 5.5 and 8.4.
I n the most general d e f i n i t i o n of equation error, t h i s i s nearly the lidt o f what we can state about .nputation. The d e f i n i t i o n o f Equatton (10.2-3) i s general enough t o allow output e r r o r and f i l t e r e r r o r as special cases. Both output er;-or and f i l t e r e r r o r have the special property t h a t the dependence o f h on z and u can be cast I n a recursive form, s i g n t f i c a n t l y lowering the computational costs. Because o f t h i s recursive form, the t o t a l computational cost i s roughly proportional t o the number o f time points. N. The general d e f f n i t i o n o f equation e r r o r a l s o encompasses nonrecurstve forms, which could have computational costs proportional t o N2 o r higher powers. The equa tion-error methods discussed i n t h i s chapter have the property that, f o r each ti, the dependence o f h on z(. ) and u(. ) i s r e s t r i c t e d t o one o r two time points. Therefore, the computational e f f o r t f o r each evaluation o f h i s independent o f N, and the t o t a l conputational cost i s roughly praportional t o N. I n t h t s regard, state-equation e r r o r and output-equatton e r r o r are comparable t o output e r r o r and f l l t e r e r r o r . For a colnpletely general, nonlinear system, the conputatfonal c o s t , o f state-equation e r r o r o r output-equation
e r r o r i s roughly s i m i l a r t o the cost of output e r r o r . f i l t e r e r r o r without using l i n e a r i z e d approximations.)
(General nonlinear models are c u r r e n t l y impractical f o r
I n the l a r g e m a j o r i t y o f p r a c t i c a l applications, however, the f and g functions have special properties which make the conputational costs o f state-equation e r r o r and output-equation e r r o r f a r smaller than the computational costs o f output e r r o r o r f i l t e r error. The f i r s t property i s t h a t the f and g functions are l i n e a r i n c. This property holds t r u e even f o r systems describea as nonlinear; the n o n l i n e a r i t y meant by the term "nonlinear system" i s as a functior! o f x and u - n o t 3s a function o f 6 . Equation (1.3-2) i s a simple example o f a s t a t i c system nonlinear i n the input, b u t l i n e a r i n the parameters. The output-error method can seldom take advantage o f l i n e a r i t y i n the parameters, even when t h e system i s also l i n e a r i n x and u, because the system response i s usually a nonlinear function o f t . (There are some s i g n i f i c a n t exceptions i n special cases.) State-equation e r r o r and output-equation e r r o r methods, i n contrast, can take e x c e l l e n t advantage o f l i n e a r i t y i n the parameters, even when t h e system i s nonlinear i n x and u. I n t h i s situation, state-equation e r r o r and cutput-equation e r r o r m e t the conditions o f Section 2.5.1 f o r the Gauss-Newton algorithm t o a t t a i n t h e exact minimm i n a single i t e r a t i o n . This i s both a q u a n t i t a t i v e and a q u a l i t a t i v e conputational improvement r e l a t i v e t o output e r r o r . The q u a n t i t a t i v e improvement i s a d i v i s i o n o f the computational cost by the n u d e r o f i t e r a t i o n s required f o r the The q u a l i t a t i v e improvement i s the e l i m i n a t i o n o f the issues associated w i t h i t e r a t i v e output-error methc:. methods: s t a r t i n g values, convergence-testicg c r i t e r i a . f a i l u r e t o converge, convergence accelerators, rml tip l e l o c a l solutions, and other issues. The most conrnonly c i t e d o f these b e n e f i t s i s t h a t there i s no need f o r reasonable s t a r t i n g values. You can evaluate the equations a t any a r b i t r a r y p o i n t (zero 's a f t e n convenient) without a f f e c t i n g the - e s u l t . Another s i m p l i f y i n g property o f f and g, not q u i t e as universal, but t r u e i n the m a j o r i t y of cases, i s t h a t each element of 6 a f f e c t s only one element o f f o r g. The simplest example o f t h i s i s a l i n e a r system e where t h e unknown parameters are i n d i v i d u a l elements o f the system matrices. With t h i s structure, ifw cons t r a i n L' t o be diagonal, Equation (10.2-3) separates i n t o a sum o f independent minimization problems w i t h e scalar h, one problem f o r each element o f h. I f e i s the n u ~ h e ro f elements o f the h-vector, w now have n independent functions i n the form of Equation (10.2-3), each w i t h scalar h. Each element o f 5 a f f e c t s one and only one o f these scalar fdnctions. This p a r t i t i o n i n g has the obvious benefit, c o m n t o most p a r t i t i o n i n g algorithms, t h a t the sum o f the I n-problems w i t h scalar ! requires less computation than the unpartitioned vector problem. The outer-product ccinputation o f Equation (2.5-ll), o f t e n the most time-consuming p a r t o f the algoritnm, i s proportional t o the square o f the number o f unknowns and t o a. Therefore, i f the unknowns a r e evenly d i s t r i b u t e d among the a e l e m n t s o f h, the computational cost o f the vector' problem coulc be as much as a 3 times the cost o f each o f the scalar problems. Other portions o f the computational cost and overhead w i l l reduce t h i s f d c t o r somewhat. b u t the improvement i s s t i l l dramatic. Another b e n e f i t o f the p a r t i t i o r r i n g i s t h a t i t allows us t o avoid i t e r a t i o n when the noise covarianies a r e unknown. u i t h t h i s p a r t i t i o n i n g , the minimizing values o f c are independent o f W. The normal r o l e o f W i s i n weighing !Ie inportance o f f i t t i n g the d i f f e r e n t elements o f the h. One value of 5 might f i t one element o f h best, while another value o f 6 f i t s another element o f h best; W establishes how t o s t r i k e a compromise among these c o n f l i c t i n g aims. Since the p a r t i t i o n e d problem structure makes the d i f f e r e n t e l e ments o f h 111dependen:. W i s l a r g e l y i r r e l e v a n t . Therefore we can estimate the elements o f 5 using any a r b i t r a r y value o f Y (usually an i d e n t i t y matrix). I f we want an estimate o f W, we can compute it a f t e r we estimate the other unknowns. The equation tational plotting 10.4 sonbiced e f f e c t o f these computational improvements i s t o make the computational cost of the stdtee r r o r and output-equation e r r o r methods n e g l i g i b l e i n many applications. I t i s c o m n f o r the coiapucost o f the actual equation-error algorithm t o be dwarfed by the overhead costs o f o b t a i c i n g :he data. the r e s u l t s , and r e l a t e d c m i u t a t i o n s .
DISCUSSION
The undebated strong points o f the state-equation-error and output-equation-error methods are t h e i r s i m p l i c i t y and low computational cost. Host important i s t h a t Gauss-Newton gives the exact minimum of the cost function without i t e r a t i o n . Because the methods a r e noniterative, they require no s t a r t i n g estinates. These methods have been used i n many applications, sometimes under d i f f e r e n t nams. The weaknesses o f these methods stem from t h e i r assumptions o f p e r f e c t s t a t e measurements. R e l a t i v e l y small amounts o f noise i n the measurements can cause s i g n i f i c a n t t i a s e r r o r s i n the estimates. I f a measurement o f some s t a t e i s unavailable, o r i f an i n s t r u m n t f a i l s , these methods are not d i r e c t l y applicable (though such problems are sometimes handled by s t a t e reconstruction ?lgorithms). State-equation-error and output-equation-error methods can be used w i t h e i t h e r of two d i s t i n c t approaches, depending upon t h e application. The f i r s t approacn i s t o accept the problem o f measurement-noise s e n s i t i v i t y and t o emphasize the computational e f f i c i e n c y o f the method. This approach i s appropriate when conputational cost i s a more important consideration than accuracy. For example, state-equation e r r o r and output-equation e r r o r methods are popular f o r obtaining s t a r t i n g values f o r i t e r a t i v e procedures such as output e r r o r . I n such applications, the estimates need only be accur a t e enough t o cause the i t e r a t i v e methods t o converge (presumably t o b e t t e r estivates). Another cormon use f o r state-equation e r r o r and output-error i s t o select a model from a l a r g e number of candidates by estimating the parameters i n each candidate model. Once the model form i s selected, t h e rough parameter estimates can be r e f i n e d by iom other method.
The second approach t o usinq state-equation-error or output-quation-error methods i s t o spend the tine and effort,necessary t o get accurate reqults from them. which f i r s t requires accurate state acasureaents with low noise ~ r v e l s . I n many applications o f these methods, m s t o f the work l i e s i n f i l t e r i n g the data and recot~structingestimates of r n r r s u m d states. (A K a l u n f i l t e r can sometimes be helpful hen?, provided that the f i l t e r does not depend upon the parameters t o be estlwted. This condition requires a special problea structure.) The t o t a l cost of obtaining good e s t i m t e s from these methods. including the cost o f data preprocessing. m y be cowarable t o the cost o f l o r e complicated i t e r a t i v e algorithas t h a t require less preprocessing. The tra&-off i s highly dependent on application variables such as the required accuracy o f the estimates. the qua1i t y o f the available instrumentation. and the existence o f independent needs f o r accurate state measurments.
CHAPTER 11 11.0
ACCURACY O f THE ESTIMTES
a pervasive issue i n the varicus stages o f application. fm the problem statener~tt o the evaluation and use of the I-esults.
Parameter estimates from real systems are. by t h e i r nature, inperfect.
The accuracy o f the estimates i s
Ue introduced : subject of parameter estimation i n Section 1.4, using concepts o f errors i n the e s t i l rates and adequacy cf the results. The subsequent chapters have largely concentrated on the derivation o f lgorithas. These h r i v a t i o n s are a l l related t o accuracy issues, based on the definitions and discussions i n hapter 4. W ~ v e r ,the questions about accuracy have been largely overshadowed by the d e t a i l s o f deriving and ~aplementi~a algori t b s . the
!#I t h i s chapter, w return the emphasis t o the c r i t i c a l issue o f accuracy. e The f i n a l judgment o f the parameter estimation process f o r a particular application i s based on the accuracy o f the results. W examine e the evaluation of the accuracy. factors contributing t o inaccuracy, and means o f improving accuracy. A t r u l y comprehensive treatment of the subject o f accuracy i s inpossible. Ue r e s t r i c t wr discussion largely t o seneric issues related t o the thcory and methodology of parameter estimation.
To make effective use of parameter estimates, we must have sane gauge o f t h e i r accuracy, be i t a s t a t i s t i cal measure, an i n t u i t i v e guess, o r some other source. I f we absolutely cannot distinguish the extremes of accurate versus worthless estimates, we must always consider the p o s s i b i l i t y that the estimates are worthless. i n which case the estimates could not be used i n any application i n which t h e i r v a l i d i t y was importaat. riterefore. neasures of the estimate accuracy are as inpartant as are the estimates themszlves. Various means ~f judging the accuracy o f parameter estimates are i n current use.
W w i l l group the uses f o r measures of e s t i w t e accuracy i n t o three p n e r a l classes. The f i r s t class o f e ose i s i n planning the parameter estimation. Predictions of the estimate accuracy can be used t o evaluate the ,Idequacy of the proposed experiments and instrumentation system f o r the parameter estimation on the proposed mdel. There are limitations t o t n i s usage because i t involves predicting accuracy before the actual data are obtained. Unexpected problems can always cause degradation o f the results compared t o the predictions. The accuracy predicticns are m s t useful i n i d e n t i f y i n g experiments that have no hope o f success.
The second use i s i n the parameter estimation process i t s e l f . Measures o f accuracy can help detect various problems i n the estimation, from modeling failures. data problems, program bugs, o r other sources. Another facet of t h i s class of use i s the canparison of d i f f e r e n t estimates. The canparisons can be between two d i f f e r e n t models o r methods applied t o the same data set, between estimates from independent data sets, o r between predictions and estimates from the experimental data. I n any o f these events, measures o f accuracy car1 help determine which of the conflicting values i s best, o r whether some c a n p m i s e behreen them should be cotisidered. Comparison o f the accuracy measures with the differences i n the estimates i s a means t o detennine i f the differences are significant. The magnitude o f the observed differences between the estimates is, i n i t s e l f , an indicator o f accuracy. The t h i r d use of measures o f accuracy i s f o r presentation with the f i n a l estimates f o r the user o f the results. Ifthe estimates are t o be used i n a control system design, f o r instance, knowledge of t h e i r accuracy i s useful i n evaluating the s e n s i t i v i t y o f the control system. I f the estimates are t o be used by an e x p l i c i t adaptive o r learning control system, then i t i s important that the accuracy evaluation be systematic enough t o be arltunatically iv'enented. Such iarnediate use o f the estimates precludes the intercession o f engineering judgment; the ev,- lscion o f the estimates must be e n t i r e l y automatic. Such control systems m s t recognize poor results and sui.olrly discount them (or ensure that they never occur-an overly optimis;ic goal). The single most c r i t i c a l contributor t o getting accurate parameter estimates i n practical problems i s the analyst's understanding o f the physical system and the instrumentation. The most thorough knwledge o f param eter estimation theory and the use o f the most powerful techniques do not compensate f o r poor understanding o f the system. This statement relates d i r e c t l y t o the discussion i n Chapter 1 about the "black box" identificat i w problem and the roles o f independent knowledge versus system identification. The principles discussed i n t h i s chapter, although no substitute f o r an understanding o f the system, are a necessary adjunct t o such understanding. Before proceeding fucthar, we need t o review the d e f i n i t i o n o f the term "accuracy" as i t applies t o real data. .A system i s lev-r described exactly by the simp1 i f f e d models used f o r analysis. Regardless of the sophistication o f t r .mdel, unexplained sources o f modeling error w i l l always remain. There i s no unique. correct m d e l . The cc. ,pr o f accuracy i s d i f f i c u l t t o define precisely i f no correct model exists. I t i s easiest t o approach b j * ~ n s i d e r i n gthe problem i n two parts: estilnation and modeling. For analyzing the estimation problem. we assume that the m d e l clescribes the system exactly. The d e f i n i t i o n o f accuracy i s then precise and q u a n t i ~ o t i v e . k n y results arr: available i n the subject area o f estimation accuracy. Sections 11.1 and 11.2 disc1 i several o f them.
The modeling problem addresses the question o f whether the fonn o f the model can describe the system adequa?:ely f o r i t s intended use. There i s l i t t l e gulde from the theory i n t h i s area. Studies such as those of Gup,'a, Hall, and Trantlz (1978), Fiske and Price (1977). and Akaike (1974). discuss selection o f the best model from a set o f candidates, but do not consider the more basic issue c f defining the candidate models. Section 11.4 considlrl.s t h i s point i n more detail.
For the %,*stpart, the determination o f lnodel adequacy i s based on engineering j u d ~ n n t and problemspeciflr analysis relying heavily on the analyst's understanding o f the physics o f the system. I n some cases,
we can t e s t model adzquacy by demonstration: i f we t r y the model and i t achieves i t s purpose, i t was obviously adequate. Such t e s t s are not always p r a c t i c a l , however. This method assumes, o f course, t h a t the t e s t was co>nprehensive. Such assumptions should not be made 1 i g h t l y ; they have cost 1 ives when systelns encountered untested conditions.
A f t e r considering estimation and modeling as separate problems, we need t o look a t t h e i r i n t e r a c t i o n s t o conplete t.ie discussion o f accuracy. W need t o consider the estimates t n a t r e s u l t from a model judged t o be e adequate, a1though not exact. As i n the modeling problem, t h i s process involves considerable subjective judgment, although we can obtain some q u a n t i t a t i v e r e s u l t s . W can examine sane specific, postulated sources o f modeling e r r o r through simuldtions o r analyses t h a t e use more c m l e x models than are p r a c t i c a l o r desirable i n the parameter estimation. Such simulations o r analyses cari include, f o r example, models o f s p e c i f i c , postulated instrumentation e r r o r s (Hodge and Bryant. 1978; and Sorensen. 1972). M i n e and I l i f f (1981b) present some m r e general, but less rigorous, results.
11.1 CONFIDENCE REGIONS
The concept o f a confidence region i s c e n t r a l t o the a n a l y t i c a l study o f estimation accuracy. I n general terms, a confidence region i s a region w i t h i n which we can be reasonably confident t h a t the t r u e value of F. l i e s . Accurate estimates correspond t o smail confidence regions f o r a given level o f confidence. Note t h a t small confidence regions i n p l y l a r g e confidence; i n order t o avoid t h i s apparent inversion o f terminology, the t e r n "uncertainty region" i s sometimes used i n place o f the t e n "confidence region." The following subsect i o n s define confidence regions more precisely. For continuous, nonsingular estimation problems, the p r o b a b i l i t y o f any p o i n t estimate's being exactly correct i s zero. Ye need a concept such as the confidence region t o make statements w i t h a nonzero confidence. Throughout the discussion o f confidence regions, we assume t h a t the system model i s correct; t h a t is, we assume the t h a t 5 has a t r u e value l y i n g i ~ ,parameter smce. I n l a t e r sections we w i l l consider issues r e l a t i n g t o model i n g error. 11.1.1 @dm Parameter Vector This
Let us consider f i r s t the case i n which 5 i s a random variable w i t h a known p r i o r d i s t r i b u t i o n . s i t u a t i o n usually implies the use o f an HAP estimator. In this i n any f i x e d sion, we can working w i t h bution of c
case, F, has a p o s t e r i o r d i s t r i b u t i o n , and we can define the p o s t e r i o r p r o b a b i l i t y t h a t C l i e s region. Although we w i l l use the p o s t e r i o r a i s t r i b u t i o n o f 6 as the context f o r t h i s discusequally w e l l define p r i o r confidence regions. None o f the f o l l o w i n g development depends up?n our a p q s t e r i o r d i s t r i b u t i o n . For s i m p l i c i t y o f exposition, we w i l l assume t h a t the p o s t e r i o r d i s t r i has a density function. The p o s t e r i o r p r o b a b i l i t y t h a t F, l i e s i n a region R i s then
W d ~ f i n e R t o be a confidence region f o r the confidecce l e v e l a i f P(R) = a, and no other region e e w i t h the same p r o b a b i l i t y i s smaller than R. W use the volume o f a region as a measure o f i t s size. Theorem 11.1 Let R be the set of a l l points w i t h p(cIZ) r c, where c a constant. Then R i s a confidence region f o r the confidence level a = P(R). Proof Let
= a.
is
R be as defined above, and l e t R ' by any other region w i t h W need t o prove t h a t the vcluy o f R ? s t be greater than o r e equal t o t h a t o f ,R. k'e define T = R n R , S = R n R and, S' = R'n R. Then T, S, and S are d i s j o i n t , R = T u 5, and R ' = T u S Because S C R, we must have p ( ~ , l Z )2 c everywhere i n S. Conversely, S ' c R, so p(cIZ) c everywhere I n S t . I n order f o r P(R') P(R), we must have P(S') = P(S). Therefore, tire volume o f S' must be greater than o r equal t o t h a t o f 5. The volume o f R ' must then be greater than t h a t o f R, comp l e t i n g the proof.
I t i s o f t e n convenient t o characterize a closed region by i t s boundary. The boundaries o f the confidence regions defined by Theorem 11.1 are i s o c l i n e s o f the p o s t e r i o r density function p ( ~ 1 Z ) .
W can w r i t e the confidence region derived j n the above theorem as e
W must use the f u l l notation f o r the p r o b a b i l i t y density f u n c t l o n t o avoid confusion i n the f o l l o w i n g m n i p u e l a t i o n s . For consistency w i t h the f o l l o w i n g section, i t i s convenient t o re-express the confidence region i n terms o f the density function o f the e r r o r .
The estimate
i s a deterministic function o f
Z; therefore, Equation (11.1-3)
t r i v f a l l y gives
Substituting t h i s i n t o Equation (11.1-2)
gives the expression
R = t x : peIZ(x Substituting x +
- ilz)
c)
for
i n Equation (11.1-5)
gives the convenient form
This form shcws the boundaries o f the confidence regions t o be translated i s o c l i n e s o f the error-density function. Exact determination o f the confidence regions i s impractical except i n simple cases. One such case occurs when 6 i s scalar and p(cIZ) i s unimodal. An i s o c l i n e then c o c j i s t s o f tw points, and the 1 ine segment between the two p o i n t s i s the confidence region. I n t h i s one-dimensional case, the confidence region i s o f t e n c a l l e d a confidence i n t e r v a l . Another simple case occurs when the p o s t e r i o r density function i s i n some standard family o f density functions expressible i n closed form. This !'s m s t cormonly the family o f Gaussian density functions. An i s o c l i n e o f a Gaussian density f u n c t i o n w i t h mar, m and nonsingular covariance A i s a set o f x values satisfying
This i s the equation o f an e l l i p s o i d . For problems not f i t t i n g i n t o one o f thesz special cases, we usually must make approximations i n the computation o f the confidence regions. Section l l . i . 3 discusses the most comnon approximation. 11.1.2 m r a n d o m Parameter Vector
When i s simply an unknown parameter w i t h no random nature, the development o f confidence regions i s more obl ique, but the r e s u l t i s z i m i l a r i n form t o the r e s u l t s o f the previous section. The same comnents apply when we wish t o ignore any p r i o r d i s t r i b u t i o n o f 5 and t o obtain confidence regions based s o l e l y on the current experimental data. These s i t u a t i o n s usually imply the use o f MLE estimators. I n neither o f these s i t u a t i o n s can we meaningfully discuss the p r o b a b i l i t y o f 6 l y i n g i n a given region. Ye proceed as follows t o develop a s u b s t i t u t e concept: the estimate i s 2 function o f the observation Z, which has a p r o b a b i l i t y d i s t r i b u t i o n conditioned on 6. Therefore, we can define a p r o b a b i l i t y d i s t r i b u t i o n e o f i conditioned on 5 . W w i l l assume t h a t t h i s d i s t r i b u t i o n has a density function
For a given value o f 6, the i s o c l i n e s o f pile define boundaries o f confidence regions f o r be such a confidence region, w i t h confidence l e v e l a .
c.
Let
R ,
Pl = {x: pi16(xIc)
It i s convenient t o define
L C l
(11.1-8) Pe/5. using the r e l a t i o n (11.1-9)
R,
i n terms o f the e r r o r density function piIc(xI6) = p e I c ( ~
-16)
This gives
The estimate has p r o b a b i l i t y a o f being i n R , . For t h i s chapter, we a r e more interested i n the s i t u a t i o n where we know the value o f and seek t o define a confidence region f o r 6 , which i s unknown. W e can define such a confidence region f o r 6 , given i , i n two steps, s t a r t i n g w i t h the region R , .
R1
The f i r s t step i s t o define a region R which i s a m i r r o r image o f R , , . A point 5 r e f l e c t s onto the p o i n t i+ x i n R,, as shown i n Figure (11.1-1). W can thus w r i t e e
-x
R ,
i n the region as
This r e f l e c t i o n interchanges 6 and l i e s i n R,, probability a that
i;therefore.
i s i n R, i f and o n l y i f there i s the same p r o b a b i l i t y a t h a t
f,
i s i n R,. l i e s i n R,.
Because there i s
To be t e c h n i c a l l y correct, we must be careful about the phrasing o f t h i s statement. Because the t r u e value i s n o t random, i t makes no se:'se t o say t h a t 6 has p r o b a b i l i t y a o f l y i n g i n R,. The randomness i s i n , , t h e construction o f the region ? because R depends on the estimate i , which depends i n t u r n on the noiseW can sensibly say t h a t the region R, , constructed i n t h i s manner, has p r o b a b i l i t y e contaminated observations. a o f covering the t r u e value c. This concept o f a region covering the f i x e d p o i n t 6 replaces the concept o f the p o i n t 5 l y i n g i n a f i x e d region. The d i s t i n c t i o n i s more important i n theory than i n practice.
i n p r i n c i p l e , we cannot construct the region from the data a v a i l Although we have defined the region R, , able because R, deperds on the value o f c, which i s unknown. Our next step i s t o construct a region R which approximates R2. but does not depend on the t r u e value o f c. W base the approximation on the assumpe t i o n t h a t P e l t i s approximately i n v a r i a n t as a function o f 6; t h a t i s
This approximation i s u n l i k e l y t o be v a l i d f o r large values o f o f 6 , the approximation i s u s u a l l y reasonable. for , Ue define the confidence region R
6.
except i n s!mple cases.
For small values using
by applying t h i s approximation t o Equation (11.1-11).

R, =
-C
~i + x:
P (xli) 2 c l elc
(11.1-13)
The region R depends only on , p c, and the a r b i t r a r y constant c. The function pe i s presumed known from the s t a r t , and i s the e s t i m a d computed by the methods described i n previous c h p h r s . I n p r i n c i p l e , we have s u f f i c i e n t information t o conpute t h e region R,. P r a c t i c a l a p p l i c a t i a n requires e i t h e r t h a t p be i n one of t h e simple forms described i n Section 11.1.1, o r t h a t we make f u r t h e r approximations as d i s t k s e d i n Section 11.1.3.
e,
If c i s small ( t h a t i s , if the estimate i s accurate). then R w i l l l i k e l y be a close approximation , to R . , If 4 6 i s large, then the approximation i s questionable. The r e s u l t i s t h a t we are unable t o define l a r g e confidence regions accurately except i n special cases. Ye can t e l l t h a t the confidence region i s large, but i t s precise size and shape are d i f f i c u l t t o determine.
- -
Note t h a t the c o n f i d e ~ ~ cregion f o r nonrandom parameters, defined by Equatlon (11.1-13). i s almost idene The only d i f f e r t i c a l i n form t o the confidence region for random parameters, defined by Equation (11.1-6). ence i n the form i s what the density futictions are conditioned on. 11.1.5 Gaussian Approximation
The previous sections have derived the boundaries o f confidence regions f o r both random and nonrandom parameter vectors i n terms o f i s o c l i n e s o i p r o b a b i l i t y density functions o f the e r r o r vector. Except i n special cases, the p r o b a b i l i t y density functions are too complicated t o allow p r a c t i c a l conputaticn of the exact isoclines. Extreme precision i n the conputation o f the confidence regions I s seldom necessary; we have I n t h i s section. a1ready made approximations i n the d e f i n i t i o n o f confidence regions f o r conrandan parameters we introduce approximations which allow r e l a t i v e l y easy computation o f confidence regions.
The c e n t r a l idea o f t h i s section i s t o approximate the p e r t i n e n t p r o b a b i l i t y density functions by Gaussian density functions. As discussed i n Section 11.1.1, the i s o c l i n e s o f Gaussian density functiolfs a r e e l l i p s o i d s . I n many cases. which are easy t o conpute. W c a l l these "confidence e l l i p s o i d s " o r "uncertainty e l l i p s o i d s . e w can j u s t i f y the Gaussian approximation w i t h arguments t h a t the d i s t r i b u t i o n s asymptotically approach e Gaussians as t h e amunt o f data increases. Section 5.4.2 discusses some p e r t i n e n t asylnptotic r e s u l t s .
e A Gaussian approximation i s defined by i t s mean and covariance. W w i l l consider appropriate choices f o r the mean and covariance t o make the Gaussian density f u n c t i o n a reasonable approximation. An obvious p o s r i b f l i t y i s t o set the mean and covariance o f the Gaussian approximation t o match the m a n and covariance of the o r i g i n a l dhnsity function; we are o f t e n forced t o s e t t l e f o r approximations t o the mean and covariance o f the o r i g i n a l density function, the exact values being impractical t o compute. Another p o s s i b i l i t y i s t o use Equations (3.5-17) an3 (3.5-18). W w i l l i l l u s t r a t e the use o f botk o f these options. e Consider f i r s t the case o f an MLE estimator. Equation (11.1-13) defines the confidence region. Ye w i l l use covariance matching t o define the Gaussian approximation t o pelt. The exact mean and covariance o f p e l c are d i f f i c u l t t o compute, b u t there a r e asymptotic r z s u l t s which give reasonable approximations. W use zero as an approximation t o the m a n o f pel ; t h i s approximation i s based on MLE estimators being e asymptotically unbiased. Because MLE estimators are e f f f c i e n t , the Cramer-Rao bound gives an asymptotic approximation for the covariance o f pelc as the inverse o f the Fisher information matrix M(c). W can use e e i t h e r Equation (4.2-19) o r (4.2-24) as equivalent expressions f o r the Fisher information matrix. Equat i o n (5.4-11) gives the p a r t i c u l a r form o f M(c) f o r s t a t i c nonlinear systems w i t h a d d i t i v e Gaussian noise. Both i and M(e) are r e a d i l y a v a i l a b l e i n p r a c t i c a l application. The estimate o f a parameter estimation program, and most MLE parameter-estimation programs comp1,te t o i t as a by-product o f i t e r a t i v e minimization of the cost function. i s the primary output M(t) o r an apprc mation
Now consider the case o f an MAP estimator. W need a Gaussian approximation t o p(e1z). Equae t i o n s (3.5-17) and (3.5-18) provide a convenient basis f o r such an approximation. By Equation (3.5-17). we s e t the mean o f the Gaussian approximation equal t o the p o i n t a t which p ( e l z ) i s a maximum; by d e f i n i t i o n o f the MAP estimator, t h i s p o i n t i s zero. W then s e t t h e covariance o f the Gaussian approximation t o e
A
[-V:
on p ( e l z ) ] - I
evaluated a t 6 = i. For s t a t i c nonlinear systems w t t h a d d i t i v e Gaussian noise, Equation (11.1-14) reduces t o t k form o f Equation (5.4-12). which we could a l s o have o t t a i n e d by apprCxIlnate covariancr matching argurnnts. This form f o r the covariance i s the same as t h a t used i n the HLE confidence e l l i p s o i d . w i t h the a d d i t l o n o f the p r i o r covariance t e n . As t h e p r i o r covariance goes t o I n f i n i t y , the confidence e l 1I p s o i d f o r t h e CIAP esttmator approaches t h a t f o r the MLE estimator, as we would anticipate. Both the MLE and MAP confidence e l l i p s o i d s take the f o m (X
Z)*A-~(X
- 8) = c
where A i s an approximation t o the error-covariance matrix. W have suggested s u i t a b l e approximations i n tne e above paragraphs, b u t most approximations t o the e r r o r covariance are equally acceptable. The choice i s usually d i c t a t e d by what i s conveniently available i n a given program. 11.1.4 N o n s t a t i s t i c a l Derivation
W can a l t e r n a t e l y derive the confidence e l l i p s o i d s f o r M P and M estimators from a n o n s t a t i s t i c a l viewe E point. This d e r i v a t i o n obtains the same r e s u l t as the s t a t i s t i c a l approach and i s easier t o follow. Comparison of the ideas used i n the s t a t i s t i c a l and n o n s t a t i s t i c a l derivations reveals the close r e l a t i o n s h i p s between the s t a t i s t i c a l c n a r a c t e r i s t i c s o f the estimates and the numerical problems o f computing them. The nonstatist i c a l approach generalizes e a s i l y t o estimators and models f o r which precise s t a t i s t i c a l descriptions are difficult. The n o n s t a t i s t i c a l d e r i v a t i o n presumes t h a t the estimate i s defined as the minimizing p o i n t o f some cost function. W examine t h e shape o f t h i s cost function as i t affects the numerical minimization problem i n the e area o f the minimum. For current purposes, we are not concerned w i t h start-up problems. i s o l a t e d l o c a l m i n i m , and other problems manifested f a r from the s o l u t i o n point. A r e l a t i v e l y f l a t . i l l - d e f i n e d minimum correspo~~ds t o a questionable estimate; the extreme case o f t h i s i s a function without a d i s c r e t e l o c a l minimum p o i n t A steep, we1 1-def ined minimum corresponds t o a re1i a b l e estimate. With t h i s j u s t i f i c a t i o n , we define a confidence region t o be the s e t o f p o i n t s w i t h cost-function values less than o r equal t o some constant. D i f f e r e n t values o f the constant g i v e d i f f e r e n t confidence levels. The boundary o f such a region i s an i s o c l i n e o f the cost function. W then approximate the cost function i n the neighborhood o f the minimum by a quadratic Taylor-series e expansion about the minimum point. 1 (11.1-16) J(E) = J ( i ) + 2 (E - i ) * [ v ; ~ ( i ) l ( ~ - i) The i s o c l i n e s o f t h i s quadratic approximation are the confidence e l l i p s o i d s .
(E
- i)*Cv;J(<)I(c - i ) = c
The second gradient o f an MLE o r MAP c o s t function i s an asynptotic approximation t o the appropriate e r r o r covariance. Therefore. Equation (11.1-17) gives the same shape ccnfidence e l l i p s o i d s as we previously derived on a s t a t i s t i c a l basis. I n practice. the Gauss-Newton o r other approximation t o the second gradient i s usually used. Tha constant c determines the size o f the confidence e l l i p s o i d . The n o n s t a t i s t i c a l d e r i v a t i o n gives no obvious basis f o r selecting a value o f c. The value c = 1 gives the most useful correspondence t o the s t a t i s t i c a l derivation, as we w i l l see i n Section 11.2.1. Figures (11.1-2) and (11.1-3) the n o n s t a t i s t i c a i d e f i n i t i o n . 11.2 i l l u s t r a t e the construction o f one-dimensional confidence e l l i p s o i d s using
ANALYSIS O THE CONFIDENCE ELLIPSOID F
The confidence e l l i p s o i d gives a comprehensive p i c t u r e o f the t h e o r s t i c a l l y l i k e l y e r r o r s i n the estimate. I t i s d i f f i c u l t , however, t o display t h e information content o f the e l l i p s o i d on a two-dimensional sheet o f paper. I n the applications we most comnonly work on, there are typicallv 10 t o 30 ulrknown parameters; t h a t is, e the e l l i p s o i d i s 10- t o 30-dimensional. W can p r i n t the covariance matrix which defines t h e shape o f the e l l i p s o i d , buc i t i s d i f f i c u l t t o draw useful conclusions from such a presentation format. The problem o f meaningful presentation i s f u r t h e r compounded when analyzing hundreds o f experiments t o obtain parameter estimates under a wide v a r i e t y o f conditions. I n the f o l l o w i n g sections, we confidence e l l i p s o i d s In ways t h a t reducing the dimensionality o f the forms, such as the accuracy o f the discuss s i m p l i f i e d s t a t i s t i c s t h a t characterize important features o f the are easy t o describe and present. The emphasis i n these s t a t i s t i c s i s on problem. Many important questions about accuracy reduce t o one-dimensional estimate o f each element of the parameter vector.
A l l o f the s t a t i s t i c s discussed here are functions o f the matrix A, which defines the shape of the confidence e l l i p s o i d . W have seen above t h a t A i s an approximation t o the error-covariance matriq. These two e viewpoints o f A w i l l provide us w i t h geometrical and s t a t i s t i c a l interpretations. A t h i r d i n t e r p r e t a t i o n comes from viewing A as the inverse o f t h e second gradient o f the cost function. I n practice, A i s usually computed from the Gauss-Newton o r other convenient approvlmation t o the second gradient. These s t a t i s t i c s are c l o s e l y l i n k e d t o some o f the basic sources o f estimation e r r o r s and d i f f i c u l t i e s . W w i l l i l l u s t r a t e the discussion w l t h idealized examples o f these classes o f d i f f i c u l t i e s . The exact rneans e o f overcoming such d i f f i c u l t i e s depends on the problem, but t h e f i r s t step i s t o understand the mechanism causing the difficulty. I n a s u r p r i s i n g number o f applications, the major difficu1t:es are cases o f the simple i d e a l i z a t i o n s dlscussed here. 11.2.1 Sensitivity
The s e n s i t i v i t y I s the simplest o f the s t a t l s t l c s r e l a t i n g t o the confidence e l l i p s o i d . Although the sens i t i v i t y has both a s t a t i s t i c a l and a n o n s t a t i s t i c a l i n t e r p r e t a t i o n , the use o f the s t a t i s t i c a l l n t e r p n t a t l o n i s r e l a t t v e l y rare. The t e r n " s e n s i t l v l t y " corns from the n o n s t a t l s t i c a l i n t e r p r e t a t i o n , which we w i l l discuss flrst.
From the n o n s t a t i s t l c a l viewpoint. the s e n s i t l v i t y i s a measure o f how much the cost-function value changes f o r a given change i n a scalar parameter value. The most comnon d e f i n i t i o n o f t h e s e n s i t i v t t y w i t h respect t o a parameter i s the second p a r t i a l d e r i v a t i v e o f the c o s t f!:,~ctton w i t h respect t o the parameter.
For the purposes o f t h i s 2hapter. we a r e interested i n the s e n s i t i v i t y evaluated a t the minimum p o i n t o f the cost function; we w i l l take t h i s as p a r t o f the d e f i n i t i o n o f the s e n s l t i v i t y . The i n Equation (11.2-1) can be any scalar function o f the 6 vector. I n most cases, ~1 'is one o t the elements o f the 6 vector. For s i r p l i c i t y , we w i l l assume f o r the r e s t o f t h i s section t h a t 1 i s the i t h element o f C. Generalizations are straightfarnard. When 61 i s the i t h element o f C, the second p a r t i a l d e r i v a t i v e w i t h respect t o 6.i i s the 4th diagonal element o f the second-gradient matrix.
The s e n s i t i v i t y has a simple geometric i n t e r p r e t a t i o n based on the confidence e l l i p s o i d . Use the value c = 1 i n Equation (11.1-17) t o define a confidence e l l i p s o i d . Draw a l i n e passing through i (the center o f the el!ipsoid) and p a r a l l e l t o t h e c i axis. The s e n s i t i v i t y w i t h respect t o t i i s r e l a t e d t o the distance, Ii,from the center o f the e l l i p s o i d t o the i n t e r c e p t o f t h i s l i n e and the e l l i p s o i d . W c a l l t h i s distance e Figure (11.2-1) shows the construction o f thle i n s e n s i t i v i t i e s w i t h the i n s e n s i t i v i t y w i t h respect t o ~ i . respect t o 6, and 6 on a two-dimensional example. The r e l a t i o n s h i p oetween the s e n s i t i v i t y and the insensi, tivity i s
which follows irnnediately from Equation (11.1-17) sensitivity.
f o r the confidence e l l i p s o i d , and Equation (11.2-1)
f o r the
W can rephrase the geometric i n l e r p r e t a t i o n o f the i n s e n s i t i v i t y as follows: the i n s e n s i t i v i t y w i t h e respect t o 6 i s the l a r g e s t change t h a t we can make i n the i t h element o f { and s t i l l remain w i t h i n the confidence e l i i p s o i d . A l l other elements o f C are constrained t o remain equal t o t h e i r estirnares values during t h i s search; t h a t i s . the search i s constrained t o a l i n e p a r a l l e l t o the ~ia x i s passing through i. From the s t a t i s t i c a l viewpoint, the i n s e n s i t i v i t y w i t h respect t o C. i s an approximation t o the standard deviation o f e i , the corresponding component o f the error. condit-oned on a l l o f the other components o f the e r r o r . W can see t h i s by r e c a l l i n g the r e s u l t s from Chapter 3 on conditional Gaussian d i s t r i b u t i o n s . I f the e covariance o f e i s A, then the covariance o f e i conditioned on a l l o f the other' components i s [ ( ~ - ' ) ~ i ] - ' ; therefore. the conditional standard deviation i s [ ( A - ' ) ~ ~ ] - ' / ~ . From Equations (11.2-2) and (11.2-3). we can see t h a t t h i s expression equals the i n s e n s i t i v i t y . Note t h a t the conditiontng on the other elements i n the s t a t i s t i c a l viewpoint corresponds d i r e c t l y t o the c o n s t r a i n t on the other elements i n the g e o m t r i c viewpoint. A s e n s i t i v i t y analysis w i l l detect one o f the most obtious kinds o f estimation d i f f i c u l t y - p a r a m e t e r s which have 1i t t l e o r no e f f e c t on the system response. 11 a parameter has no e f f e c t on t h e system response. then i t should be obvious t h a t the system response data give no basis f o r an estimate o f the parameter; i n s t a t i s t i c a l terms, the system i s unidentifiable. S i m i l a r l y , i f a parameter has l i t t l e e f f e c t on the system response. then there i s l i t t l e basis f o r an estimate o f the parameter; we can expect the estimates t o be inaccurate. Checking f o r parameters which have no e f f e c t on the system response may seem l i k e an academic exercise, considering t h a t p r a c t i c a l problems would n o t be l i k e l y t o have such i r r e l e v a n t parameters. I n f a c t , t h f s seemingly t p i v i a l d i f f i c u l t y i s extremely c o m n i n p r a c t i c a l applications. I t can a r i s e from typographical o r other e r r o r s i n Input t o computer programs. Perhaps the most comnon example o f t h l s problem i s attempting t o estimate the e f f e c t o f an input which i s i d e n t i c a l l y zero. The input might e i t h e r be v a l i d l y zero. i n which case i t s e f f e c t cannot be estimated, o r the input signal might have been destroyed or misplaced by sensor o r programning problems. The s e n s i t i v i t y i s a reasonable i n d i c a t o r of accuracy only when we are estimating a s i n g l e parameter, because the estimates o f other para~netersa r e never exact. as the s e n s l t i v i t y analysis assumes. The s e n s i t i v i t y acalysis ignores a l l e f f e c t s o f c o r r e l a t i o n between parameters; we can evaluate the s e n s i t i v i t y w i t h respect t o a parameter without even knowing what other parameters are being estimated. When more than one parameter i s estimated, t h e s e n s i t i v i t y gives only a lower bound f o r the e r r o r estimate. The e r r o r band i s always a t l e a s t as l a r g e as the s e n s i t i v i t y regardles* o f what other parameters are estimated; c o r r e l a t i o n e f f e c t s between parameters can increase. b u t never decrease, the e r r o r band. I n other words, high s e n s l t i v i t y i s a necessary. b u t n o t s u f f i c i e n t , condition f o r an accurate estimate. I n practice. c o r r e l a t i o n e f f e c t s useless as an i n d i c a t o r o f accuracy. o f completely i r r e l e v a n t parameters. Indistinguishable from the e f f e c t s of 11.2.2 Correlation tend t o increase the e r r o r band so much t h a t the s e n s i t i v i t y i s v i r t u a l l y The s e n s l t i v i t y analysis i s usually useful only f o r detecting the problem The s e n s i t t v i t y w i l l n o t i n d i c a t e when the e f f e c t o f a parameter i s other parameters, a more comnon P lblem.
W noted i n the previous section t h a t c o r r e l a t i o n s among parameters r e s u l t I n much l a r g e r e r r o r bands than e indicated by t h e s e n s i t i v i t i e s alone, The fnadequacy o f the s e n s i t i v t t y as a measure o f estimate accuracy has e l e d t o the widespread use o f the s t a t i s t i c a l c o r r e l a t i o n s t o i n d i c a t e accuracy. W w i l l see i n t h f s section t h a t the c o r r e l a t i o n s a l s o give an incomplete p i c t u r e o f the accuraty.
The statistical c o r r e l a t i m between two e r r o r components e l and e j corr(e, .ej) assuming t h a t the means o f e l and e j are zero. is corr(el,ej)
i s defined t o be
= E t e i e j l / i F t j
I n terms o f
= hi
A, the covarlar~cematrlx o f
e, the c o r r e l a t l o n (11.2-5)
j/m
Geometrically, the c o r r e l a t i o n s a r e r e l a t e d t o the e c c e n t r i c t t y o f the confidence e l l l ~ s o l d . I f the sens l t i v l t i e s w i t h respect t o a l l of the unknown parameters are equal (which we can always arrange by a scale change), and ifthe c o r r e l a t l o n s are a l l zero, then the confidence e l l l y s o l d I s spherlcal. As the magnltudes o f the correlatlons become larger. the e c c e n t r l c l t y o f t h e scaled e l l l p s o l d lncreasrs. The magnitude o f the c o r r e l a t l o n s can never exceed 1, except through appmxlmations o r round-off e r r o r s i n the computation. The def l n i t l o n above l s f o r the uncondltlonal o r f u l l correlatlons. Whenever the term c o r r e l a t i o n appears without a modlfler, i t implicitly means t h e unconditlona? c o r r e l a t i o n . W can a l s o define condlt40nal e correlatlons. although they a r e l e s s c m n l y used. The d e f l n l t l o n o f the conditional c o r r e l a t l o n I s l d e n t l c a l t o t h a t of the uncondltlonal correlations, except t h a t the expected values are a l l condltioned on a l l o f the parameters other than the two under consideration. W can express the condltional c o r r e l a t l o n o f e l and e j e as cond c9rr(el
.:
j) = -rlj/
J('lrrjj)
where r = A". This i s s l m i l a r t o the expresslon f o r the uncondltional correlation, the dlfference belng t h a t r replaces A and the sign 1s changed.
I f there are only two unknowns, h + conditional and unconditional c o r r e l a t l o n s are identical. I f there e . a r r more than hro unknowns, the condltlonal and unconditlonal c o r r e l a t l o n s can give q u l t e d l f f e r e n t plctures. Conslder the case i n which r i s an N-by-N matrlx w i t h 1 ' s on the dlagonal and w l t h a l l o f the off-diagonal elements equal t o X. As X. the condltional correlation, approaches -1/(N 1). the f u l l c o r r e l a t l o n approaches 1. I n the l l m l t , when X equals -1/(N 1). the r m t r i x i s singular. Thus, f o r l a r g e N, the f u l l c o r r e l a t i o n s can be q u l t e hlgh even when a l l o f the conditlonal c o r r e l a t l o n s a r e low. This same example I n v e r t s t o show t h a t the converse also i s true.
There are three objections t o using the correlations, f u l l o r conditional, as primary I n d i c a t o r s o f accuracy. F i r s t , although the c o r r e l a t l o n s give l n f o m t l o n about the shape o f the confldence e l l l p s o i d . th2y completely ignore i t s slze. Figure (11.2-2) shows two confldence ellipsoids. E l l i p s e A I s completely contained w f t b l n e l l i p s e B and I s , therefore, c l e a r l y preferable; y e t e l l i p s e B has zero c o r r e l a t l o n and e l l i p s e A has s l g n i f i c a n t c o r r e l a t i o n . From t h i s example, i t i s obvious t h a t accurate estimates can have hlgh c o r r e l a t l o n s and poor estlmates can have l o r correlatlons. To evaluate the accuracy o f t h e estlmates, you need informatlon about the s e n s i t i v l t l e s as v :I1 as about the correlatlons; n e i t h e r alone i s adequate. Ar a more concrete example o f the i n t e r p l a y between c o r r e l a t f o n and s e n s i t i v i t y , conslder a scalar l l n e a r system:
W wlsh t o estimate 0. Both D and the b i a s H are unknown. The lnput u ( t i ) i s an angular p o s i t i o n o f e some control device. Suppose t h a t the input time-history i s as shown f n Figure (11.2-3). A large portion o f the energy i n t h i s input i s from the steady-state value of 90'; the energy i n the pulse i s much smaller. This input I s h l g h l y correlated w l t h a constant b l r s lnput. Therefore, the estimate o f D w l l l be h l g h l y correl a t e d w l t h the estlmate o f H. ( I f t h l s p o i n t i s not obvlous. we can choose a few time p o i n t s on t h e figure and compute the corresponding covariance matrlx.) The s e n s l t l v l t y w i t h respect t o 0 I s high; because o f the l a r g e values o f u, small changes I n D cause l a r g e changes I n z.
Now w conslder the same system, w l t h the l n p u t shorn i n Flgure (11.2-4). e Both the c o r r e l a t l o n and the s e n s i t i v i t y a r e much l o m r than they were f o r the lnput o f Flgure (11.2-3). These changes balance each other. r e s u l t i n g i n the same accuracy i n estimating D. The inputs s h w n l n the two f i g u r e s a r e i d e n t i c a l , but M a The cholce o f reference a x i s I s a matter of covvention sured w i t h respect t o reference axes r o t a t e d by 90'. which should not a f f e c t the accuracy; I t does, however, a f f e c t both the s e n s i t l v i t y and c o r r e l a t i o n .
Thls exanple ! l l u s t r a t e s t h a t the c o r r e l a t i o n alone I s n o t a reasonable measure o f accuracy. 8 redeflnfng the refemnce ax,: o f the input i n t h i s example, m can change the c o r r e l a t i o n a t w i l l t o any vaTue between -1 and 1. The second o b j r c t l o n t o th use o f c o r r e l a t l o n s as Indicators o f accuracy I s more serious because i t cannot be answered by simply looking a t s e n s l t l v l t i e s and c o r r e l a t l o n s together. I n the same way t h a t s m s i t l v l t l e s are one-dlmnsional tools, c o r r e l a t l o n s are two-dimensional tools. The u t i l i t y of a t o o l n s t r l c t e d t o two-dimensional subspaces i s l i m i t e d . Three slnple exanples o f i d e a l i z e d b u t r e a l i s t i c situations serve t o i l l u s t r a t e the dimensional l l m l t a t t o n s o f t h e correlatlons. Thece examples involve f r e e l a k r a l - d l n c t i o n a l o s c l l l a t l o n o f an a l r c r a f t . For the f i r s t exanple, there i s a yaw-rate feedback t o the rudder and a rudder-to-ailero~i interconnect. Thus the a l l e r o n and wdder signals a r e both proportional t o yaw rate. I n t h i s case. the conditlonal c o r r e l r t i o n s o f the alleron, rudder, and yaw-rate derivatives a m 1 (or nearly so w i t h i n p e r f e c t data). Conditioned on t h e a i l e r o n derivatives being k n o w exactly, changes i n t h e rudder d e r i v a t i v e rstlmates can be e x a c t l y compensated f o r by changes I n the yrw-ratz d c r l v a t i v e e s t i m t e s ; thus, the conditlonal c o r n l r t i o n i s 1 The .
uncondfttonal correlations. however, are e a s i l y seen t o be only 1/2. Changes I n the rudder d e r l v a t t v e e s t t nates must be compensated f o r by some combination o f changes i n the a i l e r o n and yaw-rate d e r i v a t i v e estimates. Since there are no constratnts on how much o f the compensation must come from the a i l e r o n and how much from the yaw-rate dertvattve estimates, the unconditional c o r r e l a t i o n s would be 1/2 (because, on the average, 112 o i the compcnsatton would come from each source). For the second exanple, no feedback i s present and there i s a n c d t r a l l y damped, d u t c h - r o l l o s c t l l a t i o n (or a wtng rock). The sideslip, r o l l - r a t e , and yaw-rate stg~balsare thus a l l sinusotds o f the s a m frquency, w i t h d i f f e r e n t phases and anplttudts. Taken two a t a time, these stgnals have low correlations. The condit t o n a l cort.clations consider only two p t r a m t e r s a t a time, and thus the conditional c o r r e l a t f ~ n so f the derivatives w i l l be low. Nonetheless, the three stgnals a r e l t n e a r l y dependent when a?1 are c?nsidered t o ther, because they can a l l be w r i t t e n as l i n e a r co&tnations o f a sine wave and a cosine , . A , ? a t the dutchr o c frequency. The unconditional correlattons o f the d e r t v r l i v e s w i l l be 1 (or nearly so w i t n tmperfect data). 50th o f the above exanples have three-dimnsional c o r r e l a t i o n problems, which prevent the parameters from being i d e n t t f t a b l e . The condttlonal c o r r e l a t l o n s are low i n one case, and the unconditional c o r r e l a t i o n s are low i n the other. Although n d t h e r alone i s sufficient, examination o f both the condttional and uncondtttonal c o r r e l a t i o n s w i l l always r r v e a l three-dtmtnsional c o r r e l a t i o r ~problems. For the t h i r d example, suppose t h a t a wing l e v e l e r feeds back bank angle t o the aileron. and t h a t a n e u t r a l l y damped dutch r o l l i s present w i t h the feedback on. There are then f o u r p e r t i n v i ~ t ignals (sideslip, r o l l rate, yaw rate, and r t l e r o n ) t h a t are sinusotds w i t h the Sam frequency and d i f f e r e n t phases. I n t h i s case, both the conditfonal and the uneondttional c o r r e l a t i o n s w i l l be low. Nonetheless, there i s a c o r r e l a t t o n problem which r e s u l t s i n u n i d e n t i f i a b l e parameters. Thts c o r r e l a t i o n problem i s four-dimensional and cannot be seen using the two-dimensional correlatlons. The f u l l and conditional c o r r e l a t l o n s are c l o s e l y r e l a t e d t o the eigcnvalues o f 2-by-2 submtrtces of the r matrices. respectively, n o m l i t e d t o have u n t t y dtagonal elements. S p e c i f i c a l l y , the eigenvalues are 1 p l u s the c o r r e l a t l o n and 1 minus the correlatlon; thus. htgh c o r r e l a t l o n s correspond t o large etgenvalue spreads. Higher-order c o r r e l a t i o n s would be investigated using eigenvalues o f l a r g e r submtrices. Looked a t i n t h i s I i g n t , the i n v e s t i g a t i o n o f 2-by-2 submatrices i s revealed as an a r b i t r a r y choice d i c t a t e d by i t s f a m i l t a r i t y more than by any o b j e c t i v e c r t t e r i o n . The eigenvalues o f the f u l l n o m l i z e d A and T matrices would seem more approrriate tools. These efgenvalues and the corresponding eigenvectors can provtde some informatton, b u t they a r e seldom used. I n prtnctple, small algenvalues of the n o m l i t e d r matrix o r large eigenvalues o f the normaltzed A matrix indicate c o r r e l a t i o n s among the parameters w i t h stgntficant components i n the corresponding etgenvectors. Note t h a t the eigenvalues o f the u n n o m l l z e d r and A m t r l c e s are o f l i t t l e use i n studying correlattons, because scdling e f f e c t s tend t o dominate.
A and
The l a s t objection t o the use o f the c o r r e l a t i o n s i s the d t f f i c u l t y o f presentation. I t I s impractical t o dtsplay the e s t t m t c d c o r r e l a t l o n s graphtcally i n a problem w i t h more than a handful o f unknowns. The most conwan presentation i s stmpl t o p r i n t the matrix o f estimated correlations. This o p t i o n o f f e r s l t t t l e improvement i n comprehmstbl!tty over simply p r i n t t n g t h e A matrix. Ifthere are a l a r g e number of expcrlments, i t i s potntless t o p r i n t a l l o f the c o r r e l a t i o n matrices. Such a nongraphical presentation cannot r e a s o ~ b l y i v e a coherent p i c t u r e o f the system analyzed. g 11.2.3 C r m r - R a o Bound
The Cramr-Rao bound I s the l a s t o f the s t a t i s t i c s based on the confidence e l l i p s o i d . It proves t o be the most useful o f these s t a t i s t i c s . The Cramr-Rao b w n d I s o f t e n r e f e r r e d t o by other names, tncluding the standard deviation and t h e uncertainty l e v e l . W w i l l constder both s t a t i s t i c a l and n o n s t a t t s t i c a l tntsrpretae t i o n s o f the C r a m r - k o bound. The Cramtr-Rao bound o f or, estimated scalar parameter i s the standard deviation o f the e r r o r i n t h a t parameter. S t r i c t l y speaking, the term Cramer-Ro bound applies o n l y t o the approximation t o the standard deviation obtained from t h e Cramer-Rao i n q u a l i t y . For the purposes o f t h i s section, the properties are stmtl a r , regardless o f the source o f t standard deviation. I n terms o f the A n a t r i x , th Cramer-ko bound o f the i t h e l a n t o f i s (htf)lfi. The C r m r - R a o bound I s c l o s e l y r e l a t e d t o the i n s e n s i t i v t t y . Both a r e standard deviations o f the e r r o r , the only difference k i n g t h a t the i n s e n s i t i v i t y i s the conditional standard deviation, whereas the Cramer-Ro bound i s uncondttional. They are also conputationally s i m i l a r , the difference being i n whether t h e tnverston i s o f the rnstrtx o r o f the indtvidual e l m t . The g e o n t r l c r e l a t i o n s h i p between the C r a m r - k o b w n d and the i n s e n s i t i v i t y i s p a r t i c u l a r l y revealing. The Cramr-Rao bcund on c I s the larges. chrnge t h a t you can m k e i n t i and s t i l l - i n w i t h i n the c o n f i dence e l l i p s o i d . w r t n t h s search. the other conponbntr are f r e e t o take any values t h a t keep the p o i n t w i t h i n the confidence e ? l i p w i d . This d e f i n i t i o n fa i d e n t i c a l t o t h e geometric d e f i n i t t u n o f the i n r m s i t i v ity,'except t h a t the other components a r e constrained t o the e s t t m t e d values i n the d e f t n i t f o n o f l n s e n s t t i v i t y . This c o n s t r a i n t i s d t r e c t l y r e l a t e d t o the statistical c o n d i t i o n t n i n the d e f t n i t t o n o t the i n s e n s l t i v f t y ; the C r w r - l o bound has no such constratnts md i s an uncondftiona? standard &viation. The Craacr-ftao bound must always be s t l e a s t as large as the i n s e n s t t i v t t y , because releasing a c o n s t r a i n t can never make the s o l u t i o n o f a maxtmization problem w l l e r . This f a c t r e l a t e s t o our previous statement t h a t c o r r e l a t i o n e f f e c t s can increase, b u t n r -r decrease, the e r r o r band &f inbd by the I n s e n s l t i v l t y . Figure (11.2-5) i l l u s t r a t e s the geometric in6 ~ r e t a t i o n f the C r a a e r - l o bounds and i n s e n s i t f v l t i e s i n a o two-dimenstonal examle. To p v v e t h a t the Cramer-ko b w n d I s tho s o l u t i o n to the above optlmizatlon problem, prove a m r e gemr41 r e s u l t . (The general r e s u l t I s a c t u a l l y m s t a r to prove.)
*e
w t l l s t a t e and
flxed vectar x and a posltive deflnlte synnstric o f x*y, subject t o the constralnt that x*Hx s 1, l s P r m f Slnce x*y has no unconstrained local eAtrena, the solutlon must m n the constraint boundary; therefore, the lnequal l t y i n the constralnt can be replaced by an equallty. Thls constralned optimization p r o b l m can be restated by the use o f Lagrange m u l t i p l i e r s (Luenbergsr, 1969) as the unconstrained miniml zatlon o f
where A I s the scalar Lagrange multlpl i e r . the gradients t o zero as follows:

I )
The mnxirmm I s found by sattlng (11.2-9)
* vXt(x,A)
- AHx
From Equatlon (11.2-9) we have

x
~'~H'ly
Substituting t h l s l n t o Equation (11.2-10) gives
y*~'l~'l~~'lH'ly 1
Substltutlng i n t o Equatlon (11.2-11) gives
and thus
a t the solution.
Thls Cs the r e s u l t sought.
The speclfic case o f y being a u n l t vector along the ~1 axls glves the form claimed f o r the Cramer-Lo bound o f the C{ element. The general form o f 'lheomn (11.2-1) has other applications. The value of any llne'l. canblnatlon o f the parmetars can bn! expressed as c*y f o r some f l x e d y-vector. Thus the general form shows how t o evaluate s the accuracy o f arbitrary linear c d < ~ t i o n o f parameters. Thls form applies to mny situatlons where the sum, dlfferencr. o r other conblnation o f m l t i p l e parameters i s o f interest.
O the b r s l s o f t h l s n t r l c picture, w can think o f the Cramer-Ao bounds as insensltivit:es t h a t a n colputed accounting f o r a l E a m t e r correlations. The coaputrtlon and i n t e r p r e t r t l o n of the C r a r * r - h o bounds are v a l i d i n any nulnbrr o f dlmnslons. I n t h i s respect, the Cramer-bo bounds contrast w l t h 'ha insms i t l v i t l e s , d i c h are one-dimenslonal tocls, and the correlations, whlch are two-dlmnstonal tools. The C r o a r - L o bounds are thus the best of the theoretical measures o f accuracy that can be evaluated f o r a slngle exwriment.
11.3
OTHiR MEASURES OF ACCURACY
The p r e v l w s secttons h w discussed the C r a a r - k o bound ml other accuracy s t r t i s t l c s based on the confidence ellipsoid. Although the C r u r r - b o Dound I s the best slngle analytical mersure o f accuracy, overn l l m c e on any slngle source o f accuracy data I s dangtrous. Uncritical use o f the C r m r - L o bound a n glve extremly nlsleadlng results i n r w l i s t l c situations, as discussed by M i n e and I l i f f (1981b). Thls section discusses a l t e m r t e accuracy matures, whlch can supplement the C r m r - A o bound.
Th, bias of an c s t l m t o r i s occasionally c i t e d as an lndlcatur o f accuracy. Yu do not consider It a useful indicator i n most clrcuntmces. Thls suction i s l l m l t e d t o a b r i e f exposition of the r e a m s for t h l s judoarnt
Section 4.2.1 defines the b i a s o f an estimator. Bins a r i s e s from several sources. Some estimators are i n t r i n s i c a l l y biased, regardless o f the v t u r e o f the data. Random noise i n the data o f t e n causes a bias. The b i a s from random nolse somettmes goes t o zero asymptotically f o r estimators matched t o the noise character i s t t c s . F i n a l l y , t h e Inevitable modeling e r r o r s i n analyzing r e a l systems cause a l l estlmators t o be biased, even asymptotically. Rost discussions o f b i a s r e f e r , i n p l i c i t l y o r e x p l i c i t l y , t o asymptotic bias. Even f o r tdeallzed cases w i t h no modeliny e r r o r , estimators are seldom unbiased f o r f i n i t e time. There are two reasons why the b i a s I s o f minimal use as a measure o f accuracy. F i r s t , the b i a s r e f l e c t s only the consistent errors; i t gnores random scatter. As I l l u s t r a t e d i n Section 4.2.1, i t i s possible +or an estimator t o give ludicrous i n d i v i d u r l estimates which average o u t t o a small o r zero bias. This property i s I n t r i n s i c t o the definition o f the bias. Second, the b i a s i s d i f f i c u l t t o compute t n most cases. I f we could compute the bids, we could subtract i t from the estimates t o o b t a i n revised esttmates t h a t were unbiased. (Some estimators use t h i s techniqla-.) I n some cases, i t may be ~ r a c t l c a lt o compute a bound on the magnitude o f the bias from a p a r t i c u l a r source, even when we cannot conprrte the actua: bias. Although they are r a r e l y used, such bounds can give a reasonable i n d i c a t i o n o f the l i k e l y magnitude o f t h e error from some sources. This i s the most constructive use o f b i a s i n f c m t i o n i n evaluating accuracy. I n contrast. t h e often-repeated statements t h a t a gfven estimator i s o r fs n o t asymptotically unbiased are o f l i t t l e p r a c t i c a l use. Most o f the estimators considered I n t h i s riocument are asymptotically unbiased when the assumptions used i n the d e r i v a t i o n are true. The statement t h a t ocher estimators ere biased under the same condl+ions amounts t o a restatement o f the universal p r i n c i p l e t h a t estim';ors a-e biased i n the presence o f modeling e r r o r . Thus argumnts about which o f two e s t i n a t o r s i s b l a s r * & r e s i l l y . Thes arguments reduce t o the issue of what assumptions t o use, an issue best addressed d t r e c t l y . Although q u a n t i t a t i v e measures o f btas may not be avai1,ible. the analyst should alwayz consider the issue Unforo f b i a s due t o modeling e r r o r . Bias e r r o r s are added t o a l l other typos o f e r r o r i n the L-:ima:es. tunately, some bias e r r o r r are impossible *.a detect s o l e l y by analyzing the data. The estlmatas can be repeatable w i t h l i t t l e s c a t t e r and appear t o be accurate by a l l other measures, and s t i l l have l a r g e oias errors. An example o f t h i s type o f problem i s a c a l i b r a t i o n e r r o r i n a nonredundant instrument. Th only way t o avoid such problems i s t o be meticulous i n executing and documenting every step o f che ~ p p l i c a t i o n , includi n g nudeling, instrumentation. and data handling. No automatic t e s t s e i i s t t h a t adequately s u b s t i t u t e f o r such care. 11.3.2
0"'
Scatter
When there are several experimcn:; a t the same condition. the s c a t t e r o f the esttmates i s an i n d i c a t i o n accuracy. W can a l s o evaluate s c a t t e r about a smooth f a i r i n g o f the estimates i n a series o f experiments e w i t h gradually changing conditions. This approach assumes t h a t the parameters change smoothly as a functfon o f e x p e r i m n t a l condttion. The s c a t t e r has a s i g n t f i c a n t advantage over m n y o f the t h e o r e t i c a l measures o f a c c u r a c ~discussed below. The s c a t t e r measures the actual p e r f o m n c c t h a t s o f the t h e o r e t i c a l measures are t r y i n y t o p r e d i c t . m Therefore the s c a t t e r includes several e f f e c t s . such as random F r r o r s i n mersuring the experinent conditions. t h a t are ignored i n the t h e o r e t i r a t predictions. You can gain the most i n f o r m t i o n , of course, by c o n s i d e r i ~ g both the observed s c a t t e r and the t h e o r e t i c a l predictions. An inherent weakness i n the use o f s c a t t e r as a gauge o f accuracy i s t h a t several data p o i n t s are r e w i r e d t o define i t . Depending on the application. t h i s o b j e c t i o n can range from i n c o n s q u e n t i a l tc, insurmountable. A r e l a t e d problem i s t h a t the s c a t t e r does not show the accuracy o f i n d i v i d u a l potnts, some of n h i c h m y be b e t t e r than others. For instance. ifo n l y two c o n f l i c t i n g data p o i n t s are available, the s c a t t e r gives no h ' n t as t o whlch i s more r e l i a b l e . Figure (11.3-1) shcws e s t i m t e s o f the parameter Cnp obtained from f l i g h t data o f a PA-30 a i r c r a f t . The s c a t t e r I s large. showing estimates o f both signs. Ftgure (11.3-2) shows the same data segregated i n t o rudder and a i l e r o n nuneuvers. I n t h i s case, the s c r t t r r makes i t evident t h a t the a i l e r o n maneuvers r e s u l t i n f a r more consistent estlmater o f Cn tnan do the rudder maneuvers. Had t h e n been o n l y one o r two a i l e r o n and one o r two rudder mcneuvers avaifable, there w u l d heve been no way t o Qduce from the s c a t t e r t h a t the a i l e r o n maneuvers were superior f o r estimating t h i s parameter. The s c a t t e r shares a makness w i t h ,nost o f the t h e o r e t i c a l accuracy 6nasures i n t h a t I t does n o t account f o r consistent e r r o r s (i.e., biases). M n y occurrences can r e s u l t i n s m l i s c a t t e r about an t n c o r f r c t value. fhe scatter, therefore, should be regarded as a l o w r bound. The estimates can be w r s e than i s i n d i c a t e d by ~ the scatter, b u are zeldon b e t t e r . h i n e m d I l i f f (1981b) discuss well-documented s i t u ~ t l o n si n which tte s c a t t e r i s s i g n i f i c a n t l y l a r g e r than the C r w r - k o bounds, I n a11 such cases, we r e w r d the s c a t t e r 8s n m r e r e a l i s t i c measure o f t h e magn i t u d e o f the errors. The C r m a r - k t bound I s s t i l l a reasonahla m a n s o f d e t e m f n i n g whlch i n d i v i d u a l expcrim n t s a r e most accurate, b u t m y n o t gIve a rearonaule m g n i t u d c o f the e r r o r . I n s p i t e of i t s problems. the L t a s c a t t e r i s an e a s i l y used t o o l f o r evaluating accuracy, and i t should always be extmlned when s u f f i c i e n t data p o i n t s are a v a i l a b l e t o define it. 11.3.3 Enpinerrinn J u d w n t
inginerring jud ?t i s the o l d e s t aeasure o f estimate r e l i a b i l i t y . Even w i t h the t h e o r e t i c a l accurrc measures now a v a f l r b r the need f o r j u d g r r n t rsnuins; th t h e o r e t i c a l measures are r r e l y t o o l s which suppfy r o r e i n f o r m t i o n on which t o base the judgment. By d e f t n i t i o n . the process o f applying r n ? l n c c r i n g judgment
cainot be descriaed precisely and q u a n t i t a t i v e l y . o r there would be no judgment involved. Algorithms can be devised t o search f o r s p e c i f i c problems, but the engineer s t i l l needs t o make a f i n a l unautomated judgment. Therefore, t h i s section w i l l simply l i s t some o f the f a c t o r s most o f t e n considered i n making a judgment. One o f the most basic factors i n judging the accuracy o f the estimates i s the anticipated accuracy. T1.e engineer usually has a priori knowledge o f how accurately one can reasonably expect t o be able t o estimate the parameters. This k i w l e d g e can be based on previous experience, awareness o f the r e l a t i v e importance and l i n e a r dependence o f the parameters, and the q u a l i t y o f experimental data obtained. Another basic c r i t e r i o n i s the reasonability o f the estimated parameter values. Before analysis i s begun. we usually know the approximate range o f vslues o f the parameters. Drastic deviatiocs from t h i s range are reason t o suspect the estimates unless we discover the reason f o r the poor p r e d i c t i o n o r we independently v e r i f y the suspect value. W have prev'ously mentioned the r o l e o f engineering judgment i n e must look f o r v i o l a t i o n s o f s p e c i f i c assumptions made i n d e r i v i n g the may i n d i c a t e w d e l i n g errors. Both the estimator and the theoretical by modeling errors. The magnitude o f the nodeling-error e f f e c t s must evaluating model adequacy. The engineer model, and f o r dnexplainrd problems t h a t measures o f accuracy can be invalidated be judged.
The engineer judges the q u a l i t y o f the f i t o f the measured and estimated time h i s t o r i e s . The characterist i c s o f t h i s f i t can give indications of many problems. Many modeling e r r o r problems f i r s t become apparent as poor time-history f i t s . Failed sensors and data processing e r r o r s o r omissions are among the other classes o f problems which can be deduced from the f i t s . F i n a l l y , engineering judgment i s used t o assemble and weigh a1 1 o f the available information about the estimates You nust combine the judgmental f a c t o r s w i t h information from the theoretical t o o l s i n order t o give a f i 11 best estimate o f the parameters and o f t h e i r accuracies. 11.4 MODEL STRUCTURE DETERMINATIG;~
I n the previous sections, ne have l a r g e l y assumed t h a t the assumed model form i s correct. This i s never s t r i c t l y t r u e i n practice. Therefore. & m s t always consider the possible e f f e c t s o f modeling e r r o r as a special issue. The t o o l s discussed i n Section 11.3 can help i n the evaluation o f these e f f e c t s . I n t h i s section, we s p e c l f i c a l l y examine the question o f determining the best model s t r u c t t r e f o r pararre t e r estimation. One approach t o minimizing the e f f e c t s o f ~rudelstructure e r r o r s i s t o use a model structure which i s close t o t h a t o f the t r u e system. There are, however, d e f i n i t e l i m i t s t o t h i s p r i n c i p l e . The l i m i t a t i o n s a r i s e both i n how accurate you make the model and i n how accurate you should make it.
can
I n tne f i e l d o f simulation, i t i s almost axiomatic t h a t the sir,-lation f i d e l i t y improves as more d e t a i l : i s added t th, m d e l . Practical considerations o f cost and the degree o f required f i d e l i t y d i c t a t e the l e v e l o f d e t a i l included i n the model. Simulation and system i d e n t i f i c a t i o n are c l o s e l y r e l a t e d f i e l d s . and we might expect t h a t s~lcha basic p r i n c i p l e would be c m n t o both. Contrary t o t h i s expectation, system i d e n t i f i c a t i o n sometimes obtains b e t t e r r e s u l t s from a simple than from a d e t a i l e d model. The use o f too d e t a i l e d a model i s probably one -f the most comnon sources o f d i f f i c u l t y i n the p r a c t i c a l a p p l i c a t i o n o f system identification. The problem; t h a t a r i s e from too d e t a i l e d a model are best i l l u s t r a t e d by a simple example. Presume t h a t Figure (11.4-1) shows experimental data from a system w i t h a scalar input U. and a sczlar output Z. The l i n e i n the f i g u r e i s the best l i n e a r f i t t o the data. This l i n e appears t o be a reasonable representation o f t h e system. To investigate possible nonlinear effects, consider the case o f polynomial models. It i s obvious t h a t the e r r o r between the model output and the experimental data w i l l become smaller as the order of the model increases. High-crder polynomials include lower-order polynomials as s p e c i f i c cases (we have no requirement t h a t the high-ordsr coef:irlent be nonzero), so the best secohd-order f i t 1s a t l e a s t as good as the best l i n e a r f i t , and so f o r t h . When the order o f the polynomial bhcomes one less than the number o f data points. the model w i l l exactly match the experimental data (unless input values were repeated). Although the dat,a points are Figure (11.4-2) shows such a p e r f e c t match o f the data from Figure (11.4-1). matched perfectly, the curve o s c i l l a t e s w i l d l y . The simple l i n e a r f i t o f Figure (11.4-1) i s probably a much b e t t e r representation o f the system, even though the model of Figure (11.4-2) i s more detailed. W could say e t h a t the model o f Figure (11.4-2) i s f i t t i n g t h e noise irtstead o f the t r u response. Essentially, as the model complexity increases, and more unknown parameters a r e estimated, the problem approaches the black-box system-identification problem where there are no assumptions about the model form. W e have previously snown t h a t the pure black-box problem i s insoluble. One can deduce only a f i n i t e anaunt o f information about the system from a f i n i t e amount o f experimental data. The engineer provides, i n the form o f an assumed model structure, the r e s t o f the information required t o solve the system-identification problem. As the assumed model structure becomes more general, i t provides less information, and thus more o f the i n f o r m t i o n must be deduced from the experimental data. Eventually, one reaches a p o i n t where the information available i s I n s u f f i c i e n t ; the estimation algorithms then perform poorly, g i v i n g r i d i c u l o u s results. The Cramer-Rao bound gives a s t a t i s t i c a l basis f o r estimating whether the experimental data contain s u f f i c i e n t i n f o m t i o n t o r e l i a b l y estimate the parameters I n a model. This and r e l a t e d s t a t i s t i c s can be used t o determine the n u d e r and selection o f terms t o include i n the model (KleIn and Batterson. 1983; Gupta, Hall. and Trankle. 1978; and Trankle, Vincent. a t ~ dF - k l i n , 1982). The basic p r i n c i p l e i s t o include i n the m d e l oi~ly those terms t h a t can be accurately e s t i m . from the a v a i l a b l e experimental data. This process. known as model structure detetmlnation, i s descrfbed i n r u r t h e r d e t a i l i n the c i t e d references. W w i l l r e s t r i c t our e discussion t o the general n a t w e and a p p l i c a b i l i t y o f model structure determirailon.
Automatic model structure determination i s often viewed as a panacea t h a t eliminates the necessity f o r model selection t o be based on engineering judgment and knokiedge o f the phenomenology o f t h e system. Since we have repeatedly emphasized t h a t pure black-box system i d e n t i f i c a t i o n i s impossible, such claims f o r automatic model determination must b t viewed w i t h suspicion. There i s a basic f a l l a c y i n the argument tha'i a u t o m t i c model structure detennination can replace engineering judgnent i n selecting a model. The model structure detenninatlon algorithms a r e not creative; they can only t e s t candidate models suggested by the engineer. I n f a c t , the model structure detennination a l g o r i t h n s are a type o f parameter estimation i n disguise, i n which the parameter i s an index i n d i c a t i n g which model i s t o be used. I n a way, model structure determination i s easier than most parameter estimation. A t each stage. there are only two possible values f o r a term, zero o r nonzero; whereas most parameter estimation demands tha. a s p e c i f i c value be picked from the e n t i r e r e a l l l n e . This task does n o t approach the scope o f t h e black-box system-identification problem i n which the number a f possible models i s a high order o f i n f i n i t y . Engineering judgment i s s t i l l needed, therefore, t o select the types o f candidate models t o be tested. I f the bandidate models are not appropriate, the r e s u l t s w i l l be questionable. The very best t h a t could be expecttd from an automatic algorithm i n t h i s circumstance would be r e j e c t i o n o f a l l o f t h e candidates (and not a l l automatic t e s t s have even t h a t much c a p a b i l i t y . No a u t m t i c algorithm can suggest c r e a t i v e inprovenents t h a t i t has n o t been s p e c i f i c a l l y programed f o r . Collsider a system w i t h an actual output o f Z = sin(3). Assume t h a t a polynomial model has been selected by the engineer, and automatic structure determination has been used t o determine what order polynomial t o use. The task i s hopeless i n t h i s form. The data can be f i t a r b i t r a r i l y well w i t h a polynomial o f a high enough order. b u t the polynomial form does n o t describe the essence o f the system. I n p a r t i c u l a r , the f i n i t e polynomial w i l l n o t be v a l i d for extrapolating system performance outside c f the range o f the experimental data. I n the above system, consider three ranges o f U-values: IU! < 0.1. IUI < 1.0! and I U ( 10.0. I n t h e range I U I c 0.1, the l i n e a r polynomtal Z = U i s a close approximtion. as shown i n Figure (11.4-3). The extrapo a t i o n o f t h i s approximation t o the ranan IUI < 1.0 introduces noticeable errors. as shown i n Figure (11.4-4). Over t h i s range, the approximatiolr Z = U U3/6 i s reasonable. I f we expand our view t o the range IU/ < 10.0, as i n Figure (1.5-5). then n e i t h e r the l i n e a r nor t C third-order polynomial i s a t a l l representative o f the sine function. I t would require a t l e a s t a seventh-order polynomial t o match even the gross c h a r a c t e r i s t i c s o f the sine function over t h i s range; a good match wouid require a s t i l l higher order.
Another problem w i t h automatic model-structure determination i s t h a t i t gives only a s t a t i s t i c a l estimate. Like a l l estimates, i t if imperfect. I f no b e t t e r information i s available. i t i s appropriate t o use automatic model s t r u c t u r e determination as the best guess. If, however, f a c t s about the model s t r u c t u r e a r e deducible from the physics o f the system, i t i s s i l l y t o throw away known f a c t s and use irrperfect estimates. (This i s one o f the most basic p r i n c i p l e s i n the e n t i r e f i e l d o f system i d e n t i f i c a t i o n , not j u s t i n model structure determination: i f a f a c t i s k n r m , bse i t and save the e s t i m t i o n theory f o r cases i n which i t i s needed. ) The most basic problem w i t h automatic model s t r u c t u r e determination l i e s i n the statement o f the problem. The very term "model structure determination" i s misleading, becaiise there i s seldom a c o r r e c t model t o determine. Even when there i s a correct model, i t may be f a r too complicated f o r p r a c t i c a l purposes. The r e a l model structul-e determination problem i s not t o determine s m nonexistent "correct" model structure, b u t t o determine an adequate nodel structure. W discussed the idea o f adequate models i n Section 1.4; the idea o f e an adequate model structure i s an i n t i l n t e p ~ r o f the idea o f an adequate model. t This basic issue i s addressad b r i e f l y . i f a t a l l , i n most o f the l i t e r a t u r e on modr?l s t r u c t u r e determinat i o n . Many papers generate simulated data w i t h a specified mcdel, and then demonstrate t h a t a proposed model structure determination algorithm can determine the c o r r e c t inodel. This a p ~ r o a c hhas l i t t l e t o do w i t h the r e a l issue i n model structure determination. The previous paragraphs have emphasized the numerous problems o f automatic model strllcture determination. That these problems e x i s t does not mean t h a t automatic model-structure determination i s worthless, only t h a t the mfndless a p p l i c a t i o n o f i t i s dangerous. Automatic model structure determination can be a valuable t o o l when used w i t h an appreciation o f i t s liaitat:,is. Most gocd model structure determination programs allow the engineer t o override the s t a t i s t i c a l decision and force s p e c i f i c terns t o be included o r omitted. This approach makcs good use o f both the theory and the judgment, so t h a t the theory i s used as a t o o l t o a i d the judgment and t o warn against some types o f poor judgment, but the end r e s p o n s i b i l i t y l i e s w i t h the engineer. 11.5 EXPERIMENT DESIGN
Thc prevlous discussion has, f o r the mcst part, assumed t h a t a s p e c l f f c set o f experimental data has already been gathered. I n some cases, t h i s i s a v a l i d assunption. I n other cases, the opportunity I s a v a i l able t o specify the experiments t o be performed and the measurements t o be taken. This section gives a b r i e f e overview o f the subject o f designing experiments t o r parameter i d e n t i f i c a t i o n . W leave d e t a i l e d discussion t o works c i t e d I n the references. Methods f o r experiment design f a l l i n t o two major categories. The f i r s t category I s t h a t o f methods based on numerical optimization. Such methods choose an input, subject t o appropriate constraints, which minimizes the Cramer-Rao bound o r some r e l a t e d e r r o r estimate. Goodwin (1962) and Plaetschke and Schulz (1979) g i v e t h e o r e t i c a l and p r a c t i c a l d e t a i l s o f s m optimization approaches t o input design. Experiment design i s o f t e n strongly constrained by p r a c t i c a l considerations; i n the extreme case, the constraints completely specify the Input, leaving no l a t i t u d e f o r design. I n a design based on numerical o p t f m i z ~ t i o n , tile constraints must be expressed mathematically. This d e r i v a t i o n o f such expressions IS sometimes straiqhtforward, as when a control device I s l i m i t e d by a physical stop a t r s p e c i f i c position. I n other cases, the constraints involve issues such as safety t h a t are d i f f i c u l t t o quantify as precise l i m i t s .
S l i g h t changes i n the form of the constraints can change the e n t i r e character o f the t h e o r e t i c a l optimum input. Because the constraints are one o f the major influences i n the experiment &sign, adopting s i n p l i f i e d constraint forms s o l e l y b e c u s e they are easy t o analyze i s o f t e n inadvisable. I n p a r t i c u l a r . " s o f t " cons t r a i n t s i n the form of a cost penalty proportional t o the square o f the input are almost never accurate representations o f p r a c t i c a l constraints. k s t p r a c t i c a l experirnent design f a l l s i n t o the second major category, methods based more on h e u r i s t i c design than on formal optimization of a cost function. Such designs draw heavily on the engineer's understandi n g o f the system. There are several widely applicable r u l e s o f thunb t o help h e u r i s t i c experiment design; some o f them consider issues such as frequency content, modal excitation, and independence. Plaetschke and Schulz (1979) describe some of these rilles, and evaluate inputs based on them.
Figure (11.1-1).
Construction o f
R,.
Figure (1 1.1-3). Construction o f two-dimensional confidence e l 1ipsoid.
I
JIE)
t2
----I
J I t ) = CONSTANT
I I
I
I
I
(WIN
I I I
I
E EMAX Figure (1 1.1 -2). Construction of one-dimensional confidence e l 1ipsoid.
Figure (11.2-1).
Geometric i n t e r p r e t a t i o n o f insensitivity.
CR. MER-RAO BOUND
0 FLIGHT DATA
Figure (11.2-2).
Correlation and sensitivity.
". d@
': b-'
80
TIME
Figure (11.3-1).
Estimates of
Cn
P'
Figure (11.2-3).
High correlation and high sensitivity.

.8
I
.4
CRLUACR- RAO BOUND
OFLIGHT DATA DDER MANUEVERS
" * - ' : ~
-10
:
' a u c
-.4
TIME
Figure (11.2-4).
Low correlation and low sensitivity.
-.8
-1.2
AILERON MANEUVERS
Figure (11.3-21.
Estimates of input u.
Cn
, segregated
P
by
Figure (1 1.2-5). Cramcr-Rao bounds and insensltivitfes.
-.I
I
IUI < 0.1.
Figure (11.4-3).
2 = sin(U) in 9 e rlnge
Figure (11.4-1).
Best linear f i t o f noise data.
Figure (11.4-4).
Z = sin!U) i n the range /UI
< 1.0.
Figure (11.4-2;.
Exact polynonlal match o f noise data.
-10
l
IUI < 10.0.
Figure (11.4-5).
Z=
sln(U) i n the range
12.0 CHAPTER 12
I n t h i s document, we have presented t k e t h e o r e t i c a l background o f s t a t i s t i c a l estimators f o r dynamic n systems, w i t h p a r t i c u l a r crphasis nr, maxinurn-likelihood estimators. A understanding o f t h i s t h e o r e t i c a l background i s c r u c i a l t o t h e p r a c t i c a l a p p l i c a t i o n of the estimators; the analyst needs t o know the c a p a b i l i t i e s and l i m i t a t i o n s c f the estimators. There a r e several examples o f a r t i f i c i a l l y c ~ l p l i c a t e dproblems t h a t succunb t o simple approaches, and seemingly t r i v i a l questions t h a t have no answers.
A thorough understanding o f the system being analyzed i s necessary t o conplement t h i s t h e o r e t i c a l b x k ground. No amount o f t h e o r e t i c a l sophistication can compensate f o r the lack o f such understanding. The e n t i r e theory r e s t s on the basis o f the assunptions made about the system characteristics. The theory can g i v e only l i m i t e d help i n v a l i d a t i n g o r r e f u t i n g such assunptions.
Errors and unexpected d i f f i c u l t i e s are i n e v i t a b l e i n any substantial parameter estimation project. The eventual success o f the p r o j e c t hinges on the a n a l y s t ' s a b i l i t y t o recognize unreasonable r e s u l t s and diagnose t h e i r causes. This a b i l i t y . i n turn, requires an understanding o f both e s t i m t i o n theory and the system being analyzed. Problem can range frw obvious instrumentation f 6 i l u r e s t o subtle modeling inconsistencies and i d e n t i f i e a b i l i t y problems. Probably the m s t d i f f i c u l t p a r t o f p a r a w t e r estimation i s t o straddle the f i n e l i n e between models too simple t o adequately represent t h e system and models too conplicated t o be i d e n t i f i a b l e . There i s no conserv a t i v e p o s i t i o n on t h i s 'ssue; excesses i n e i t h e r d i r e c t i o n can be f a t a l . The solution i s t y p i c a l l y i t e r a t i v e . using diagnostic s k i l l s t o detect problems and make improvements u n t i l an 3dequate r e s u l t i s obtained. The problem i s exacerbated by there being no correct answer. Neither i s there a s i n g l e c o r r e c t method t o solve parameter estimation problems. Although we have c a s t i gated some practices as demonstrably poor, we make no attempt t o e s t a b l i s h as dogma any p a r t i c u l a r method. The material of t h i s document i s intended nore as a set o f t o o l s f o r parameter estimation problems. The select i o n o f the best t o o l s for a p a r t i c u l a r task i s influenced by f a c t o r s other than the purely t h e o r e t i c a l . B e t t e r r e s u l t s o f t e n come from a crude, but adequate, method t h a t the analyst thoroughly understands than from a sophisticated, b u t unfamiliar. method. We recxmend the a t t i t u d e expressed by Gauss (1809, p. 108):
I?. always p r o f i t a b l e t o approach the more d i f f i c u l t problems i n several is ways, and not t o despise the good although preferring the b e t t e r .
APPENDIX A A.0 MTRIX RESULTS
This appendix presents several matrix r e s u l t s used i n the body o f the book. The derivations are mostly exercises i n s i n p l e matrix algebra. Various of these r e s u l t s a r e given i n numerour other documents; Goodwin and Payne ( 1 9 7 7 . appendix E) present wst o f them. A.l W R I X INVERSION L E W Consider a square. nonsingular m t r i x A, p a r t i t i o n e d as
where A,,
and A,,
are square.
Define the inverse o f A
t o be r , s i m i l a r l y p a r t i t i o n e d as (A. 1 - 2 )
where r,, i s the same size as A,,. W want t o express the p a r t i t i o n s r i j i n terms o f the A i j o To e derive such expressions, we need t o assume t h a t e i t h e r A,, o r A,% i s i n v e r t i b l e ; i f both are singular, there i s no useful form. Consider f i r s t the case where A,, i s invert~ble. L e m A.1.l Given A and r p a r t i t i o n e d as i n Equations ( A . l - { j and ( A . l - 2 ) . assume t a t A and A a r e i n v e r t i b l e . Then (A,, A , , A ~ ~ A ~ , , i s invertible and the !artitions 0) r are given by
r:, =
A ;
- A;;A,~(A~,
A~,A;~A,~)-~A~,A;~
Proof - The condition
A r = I gives the four equations Al:rlz

+
A I Z ~ ~ Z= 0
A21r12
A22r22 = I
and t h e condition r A = I gives the f o l l r equations ~,,AIz ~ZIA,,

+
rlzA22 = 0 P,?A,, =
Q
~ll"11 +
rlzA21 = 1 rzzA2z = 1
~~IAIZ
Equations ( A . l - 7 )
and (A.1-12).
respectively, give
r,2
-h;:~12r22
r z 1 = -r22A21A;: Substitute Equation (A.1-15) i n t o Equation (A.1-10) t t o n ( A . l - 1 6 ) i n t o E q u a t i ~ n(A.1-14) t o get (A22 and s u o s t i t u t e Equa-
= I
r22(~22
- A ~ ~ A ; : A ~ ~= ) I
Ry the a s s u n ~ t i o no f i n v e r t i b i l i t y o f A, t h e r j e x i s t and s a t i s f y Equations (A.l-7) t o (A.l-14). The assumption invertibility of A , then s a t i s f i e s Equaassures, through the above substitutions, t h a t r,, t i o n s ( A . l - 1 7 ) and (A.l-18). Therefore (A, A,,A;~A,,) i s i n v e r t i b l e and r, i s given by Equation (A.l-6). ,
04
Substituting Equation (A.1-6 i n t o Equations (A.l-15) and (A.l-16) gives F i n a l l y , s u b s t i t u t i n g Equation (A.1-I) i n t o Equattons (1.1-4) and (1.1-51. gives Equation (A. 1-3). completing Equation (A. 1-9) and solving f o r r,, the proof. The case where
A ,,
i s nonsingular i s stmply a p e r m t a t i o n of the same l e m .
L e m A . l - 2 Given A and r p a r t i t i o n e d as i n Equations ( A . l - 1 ) and (A.l-2). assume t a t A and A are i n v e r t i b l e . Then (A,, A,,A;~A,,) i s inverttble and the :artitions o)L r are given by
Proof
Define a reordered matrix
The inverse o f A '
i s given by the corresponding reordering o f
r.
Then apply the previous lemna t o A ' and r ' When both A , and A ,,
. -
a r e i n v e r t i b l e , we can combine the above l e m s t o o t t a i n two other useful r e s u l t s .
Assume t h a t two matrices A and C are i n v e r t i b l e . Further L e m A.l-3 BC-ID) o r (C DA-'0) i s i n v e r t i b l e . assume t h a t one of the expressions (A Then the other expression i s a l s o i n v e r t i b l e and
Proof Define A, = A , A , = 0, A, = D, and A = C. I n order t o apply , G s ( A . l - 1 ) and ( A - 1 - 2 f , we f t r s k need t o show t h a t A as defined by Equatto~i(A.l-1) i s i n v e r t i b l e . I f C DA-'0) i s i n v e r t t b l e , then the r j defined by Equations ( A . l - 3 ) t o lA.1-6) s a t i s f y Equations (1.1-7) t o (1.1-14). Therefore A i s i n v e r t i b l e . L e m (A.l-2) then gives the i n v e r t i b i l i t y o f (A 0C"D). which i s one o f the desired r e s u l t s .
BC'lD) I s i n v e r t i b l e , then the r i j Conversely, i f we assume t h a t (A defined by Equations (A.l-19) t o (A.l-22) s a t i s f y Equations ( A . l - 7 ) t o (A.l-14). Therefore A i s i n v e r t i b l e and L e m (A.1-1) gives t h e i n v e r t i b i l i t y o f the expression (C DA"0).
Thus the i n v e r t i b i l i t y o f e i t h e r expression implies i n v e r t i b i l i t y o f the other and o f A. W can now apply both L e m s A 1 1 and (A.1-2). e Equating the expressions f o r r,, given by Eguations [A:l:3{ and (1.1-19). and p u t t i n g the r e s u l t i n tenns o f A, 0, C, and D, gives Equation (A.l-23). completing the proof. L e m A.l-4 Given A, 0, C, and D as i n Lenma (A.l-3). f n v e r t i b i l f t y assumptions, then A"B(C w i t h the same
- DA'lB)"
= (A
- BC-lD)"BC-'
Proof The proof i s i d e n t i c a l t o t h a t o f Lemna (A.l-3). excett t h a t we equate thexpressions for r , given by Equations (A.l-4) and (A.l-20), giving Equatfon (A.l-24) as a r e s u l t .
A.2
MATRIX DIFFERENTIATION
For several o f the f o l l o w i n g r e s u l t s , i t i s convenient t o define the d e r i v a t i v e o f a scalar w i t h respect t o a matrix. I f f i s a scalar f u n c t i o n o f the m a t r i x A, we define df/dA t o be a matrix w i t h elements equal t o the d e r i v a t i v e s o f f w i t h respect t o corresponding elements o f A.
Two s i m ~ l er e l a t i o n s i n v o l v i n g the t r a c e f u n c t i o n are useful i n manipulating the matrix and vector quant i t i e s we work with. Result A.2-1
If x and y
are two vectors o f the same length, then x*y = t r ( y x * ) (A.2-2)
Proof - Both sides expand t o Rtsult-A.2-2 I f
ZX(')~(~).
1
A and B are two matrices of the same size, then
Proof --
Expand the r i g h t side, eiement by element.
Both o f these r e s u l t s are special cases o f the same r e l a t i o n s h i p between inner products and outer products. The following r e s u l t i s a p a r t i c u l a r a p p l i c a t i o n o f Result (A.2-2). Result A.2-3 I f f(A) i s a scalar function of the matrix f u n c t i o n o f the scalar x, then A, and A is a
Proof
Use the chain r u l e w i t h the i n d i v i d u a l eleinents o f
A to write
Equation (A.2-4) then follows from Result (A.2-2) and the d e f i n i t i o n given by Equation (A.2-1). Result A.2-4 I f the matrix d dx wherever A i s invertible. the inverse AA-' = I Take the d e r i v a t i v e , using the chain r u l e . A i s a function o f
x, then
Proof - By the d e f i n i t i o n o f
Solving f o r Result A.2-5
d/dx(A-')
gives Equation iA.2-6), as desired. i s i n v e r t i b l e , and x and y are vectors, then
If A
Proof - Use r e s u l t
(A.2-4)
t o get
Now
where e l Therefore
i s e vector w l t h zeros i n a l l but the
i t h elenknt, whlch 1s 1.
which i s the (1.j) element o f -(A''~x*A-')~. The d e f i n i t i o n o f the matrix derivative then glves Equation (A.2-10) as desired. Result A.2-6 If A i s I n v e r t i b l e , then
Proof - Expanding the determinant by cofactors enlAl = an
o f the
l t h row gfves
~('*~)(adj) ( ~ * ~ ) A
Taking the d e r i v a t i v e w i t h respect t o ~ ( ~ *gives j )
does not depend on ~ ( j ' ~ ) . Equation (A.2-15) and Using because (adj A ) ( ~ ' ~ ) the expression f o r a matrix inverse i n terms o f the matrix o f cofactors, we get
Equation (A.2-14) then follows, as desired, from the d e f i n i t i o n o f the d e r i v a t i v e w i t h respect t o a matrix.
REFERENCES Acton, Forman S.: Numerical Methods t h a t Work. Harper 6 Row, New Yark, 1970. IEEE Trans. 41tmt. Contr., Vol. AC-19.
Akaike, Htrotugu: A hew Look a t S t a t i s t i c a l Model I d e n t l f i c a t i o n . No. 6, pp. 716-723, 1974. Aoki , Masanao: Optimization o f Stochastic Systems. Apostol, Tom M.: Calculus: Volume 11.
Academic Press, New York, 1967. 2nd ed., 1969.
Xerox College Publishing, Waltham, Mass..
Ash, Robert B. : Basic Probabil l t y Theory. Astrom, Karl J.:
John Wiley 6 Sons, Inc., New York, 1970. Academic Press. New York. 1970. A u t m t i c a , Vol. 7, pp. 123-162, 1970. AIM
Introduction t o Stochastic Control Theory. System I d e n t l f i c a t i o n - A
Astrom, Karl J. and Eykhoff, P.: Bach. R. E. and Wingrove, R. C.: paper 83-2087. 1983.
Survey.
Applications o f State Estimation I n A i r c r a f t F l i g h t Data Analysis.
Balakrishnan. A. V.: Stochastic D i f f e r e n t i a l Systems I. F i l t e r i n g and Control-A Function Space Apprach. Lecture Notes i n Economics and Mathematical Systems. 84. M Beckman, G. Goos, and H. P. Kunzi eds., . Springer-Verlag, Berl in, 1973. Balakrishnan. A. V.: Balakrishnan. A. V.: Barnard. G. A.: 1958. Stochastic f i l t e r i n g and Control. Kalmn F i l t e r i n g Theory. Optlmization Software, Inc., Los Angeles, 1981.
Optimization Software, Inc., New York. 1984. Biometrika, Vol. 45,
Thomas Bayes Ess4y Toward Solving a Problem i n the Doctrine o f Chances.
Bayes, Thorns: An Introduction t o the Doctrine o f Fluxions, and a Defence o f the Mathematicians Against t h e Objections o f the Author o f the Analyst. John Noon, 1736. (See Barnard, 1958). Bierman, G. J.: Factorization Methods f o r Discrete Sequential E s t i m t i o n . neering, Vol. 128, Academic Press, New York, 1977. Brauer, Fred and Noel. John A.: New York, 1969. Mathematics i n Science and EngiW. A. Benjamin,
Q u a l i t a t i v e Theory o f Ordinary O i f f e r e n t i a l Equations.
Cox. A. B. and Bryson, A. . : I d e n t i f i c a t i o n by a Colrbined Smoothing Nonlinear Programning Algorithm. Automatlca, Vol 16, pp. 689-694, 1980.
C r a d r , Harald: Dixon, L. C. W.: Doetsch, K. H.:
Mathematical Methods o f S t a t i s t i c s . Nonlinear Optimization.
Princeton University Press, Princeton, N.J.,
1946.
Crane, Russak 6 Co., New York, 1972. A.R.C. . R. 6 M 2945, 1953. SIAM, Philadelphia,
The Time Vector Method f o r S t a b i l i t y Investigations.
Oongarra, J. J.; A l e r , C. B.; Bunch, J. R.; and Stewart. G. W.: 1979. Etkin, 8.: Eykhoff. P.: Dynamics o f Atmospheric F l i g h t .
LINPACK User's Guide.
John Wiley 6 Sons, Inc., New York, 1958. John Wiley 6 Sons, London, 1974. Academic Press, New York, 1967.
System I d e n t i f i c a t i o n , Parameter and State Estimation. Mathematical S t a t i s t i c s :
Ferguson, Thomas S.:
A Decision Theoretic Approach.
Fisher, R. A.: On the Mathematical Foundations o f Theoretical S t a t i s t i c s . Vol. 222, pp. 309-368, 1921.
Phil. Trans. Roy. Soc. London, A I M paper 77-1171, 1977.
Fiske, P. H. and Price, C. F. : A New Approach t o Model Structure I d e n t i f i c a t i o n . Flack. Nelson D. : AFFTC S t a b i l i t y and Control Technique. Foster, G W.: . 1983. AFFTC-TN-59-21,
Edwards, California, 1959. RAE TR 83025,
The I d e n t i f i c a t i o n o f A i r c r a f t S t a b l l i t y an6 Control Parameters i n Turbulence.
Garbow, B. S.; Boyle. J. M.; Dongarra, J. J.; and Cbler, C. 8.: Extonsion. Springer-Verlag, Barlin, 1977.
Matrix Eigensystem Routines-EISPACK Guide
Gauss, Karl F r i e d r i c h - Theory o f the Motion o f the Heavenly Bodies Moving About the Sun i n Conic Sections. Translated by Charles Henry Davis. @over Publications, Inc., Nw York, 1647. Translated fw: Theoria e Motus, 1809. Geyser, L u c i l l e C. and Lehtinen, Bruce: D i g i t a l Program f o r Solving the Linear Stochastic O p t i w l C o r ~ t r o land Estimation Problem. NASA TN D-7820, 1975.
n P bodwin, Graham C.: A Overview o f the System I d e n ~ ~ f l c a t i o nr o b l m Experiment Design. on I d e n t l f i c a t l o n and System Paramtar Estimation, Wlshlngton, D.C., 1982.
S i x t h IFAC Synposiun
--
---
--
.-Am--
--
'(a)
Goodwln. Graham C. and Payne, Robert L.: Academic Press, New York, 1 '7.
Dynamic System I d e n t i f l c a t i o n :
Experiment Design and Data Analysls.
Greenberg, H. : A Survey of Methods f o r Determining S t a b l l l t y Parameters o f an Airplane from Dynamic F l i g h t M e a s u r m n t NASA TN-2340, 1951.
Gupta. N. K. ; Hall, W. E. ; and Trankle, T. L. : Advanced Methods o f A d e l Structure Determination from Test Data. A I M J. Guidance and Control, Vol. 1, No. 3, 1978. Gupta, N. K.; and Mehra, R. K.: Computatlonal Aspects o f M x i m m Llkellhood Estimation and Reduction I n Senslt i v i t y Functlon Calculations. IEEE Trans, on Automnt. Contr., Vol AC-19, No. 6, pp. 774-783, 1974. Hajdaslnskl, A. K.; Eykhoff, P.; D a m , A. A. H.; and van den Boom, A. J. W.: The Cholce and Use o f D l f f e r e n t Model Sets f o r System I d e n t i f i c a t l o n . S l x t h IFAC Symposlwn on I d r ? n t l f l c a t l o n and System Parameter E s t l mat {on, Washington. D.C., 1982. Hodge, Hard F. and Bryant, Wayne H.: f b n t e Csrlo Analysls o f Inaccuracies I n Estlmated A l r c r a f t Parameters Caused by Unmdeled F l l g h t I n s t r u n n t a t l o n Errors. NASA TN D-7712, 1975. Jategaonkar, R. and Plaetschke, E.: M x l m m Llkellhood Parameter Estlmatlon from F l i g h t Test Data f o r General Nonlinear Systems. DFVLR-FB 83-14, 1983. Jamlnskl. Andrew H.: Stochastic Processes and F l l t e r l n g Theory. Acadanlc Press, New York, 1970. IEEE
Kailath. T. and Lyung, L.: Asymptotlc Behavior o f Constant-Coefflclent R i c c a t l D f f f e r e n t l a l Equations. Trans. Automat. Contr., Vol. AC-21, pp. 385-388, 1976. Kalman. R. E. and Bucy, R. S.: New Results i n Linear F i l t e r i n g and Predlctlon Theory. Journal of Baslc Englneerlng, Vol. 63, pp. 95-107, 1961. Klein, Vladislav: 1975. On the Adequate Model f o r A l r c r a f t Parameter Estlmation.
Trans. ASME, Serles D.
CIT. Cransfield Report Aero No. 28,
Kleln, Vladlslav and Batterson, Jams &.: Determination o f Spllces and Stepwise Regresslon. NASA TP-2126, 1983. Kushncr, Harold: Levan, N.: Introduction t o Stochastic Ccntrol.
-plane Model Structure from F l i g h t Data Uslng
Hol t, Rinehart and Winston. Inc., New Vork, 1971.
Systems and Signals.
Optimization Software, Inc., New Vork. 1983. S t a t l s t l c s o f Random Processes I: General Theory. Sprlnger-Verlag,
Llpster, R. S, and Shiryayev, A. N.: N w York, 1977. e Luenberger, David G.: Luenberger, Davfd G.:
Oytlmlzatlon by Vector Space Methods.
John Wlley L Sons, New Vork, 1969. Addison-Wesley. Readlng Mass.. 1972.
Introduction t o Llnear and Nonlinear Programing.
M l n e . Rlchard E.: Programer's Manual f a r M E 3 . A General FORTRAN Program f o r Maximum Likelihood Parameter Estlmatlon. NASA TP-1690, 1981. M i n e , Richard E. and I 1lff, Kenneth W. : User's Manual f o r MHLE3, A General Fortran Program f o r Mxlnum Llke1lhood Parameter Estimation. NASA TP-1563, 1980. M l n e , Richard E. and I l i f f . Kenneth W.: Fornulation and Implemntation o f a P r a c t l c a l Algoritiun f o r Parameter Estimation w i t h Process and Measurement Nolse. SIAM J. Appl. Math., Vol. 41, pp. 558-579, 1981(a). M i n e , Rlchard E. and I l i f f . Kenneth U.: The Theor and Practice o f Estlrartlng the Accuracy of Dynamic F l l g h t Dctermlned Coeff l c i e n t s . NASA RP-1077, 1981(bf. Med'tch, J. S.: Stochsstic O p t i m l Llnear Estlmstion and Control. k G r a w - H i l l Book Co., New York, 1%9. Academtc
Mehra, Ramn K. ~ n L a i n l o t i s , D l m l t r l G (eds): d . Press, Mw York, 1976. e
System I d e n t f . ~ ~ a t l ~ . Advance* and h s e Studies. ::
A l e r , C. R. and Stewart, G. W.: An A l g o r i t h n f o r Generallzad kt1l x Elgenvalue Problems. c a l Analysls, Val. 10, pp. 241-256, 1973.
SIAn J. o f NuneriSIAn
A l e r , Cleve; and Van Loan. Charles: Nineteen Dubious Ways t o Conpute the Exponentlal of a b t r i x . R e v l a , Vol. 20, No. 4, pp. 801-836, 1978. Warlng, Evar D.: Llnear Algebra m h t r f x Theory. d John U l l e y L Sons. Inc..
k York, 2nd ed..
1970.
Palgc, Lowell J.; k r f f t , J. Dean; and Slobko, Thanvs A.: ing, Lexington. Mss., 2nd ed., 1974. Papoulis, Athnasios: k YO&, 1965. Penrose, R.:
Elements o f Linear Algebra.
Xerox College Publlsh-
Probablllty. k n d o ~ n Varlrbles, and Stochastic Processes. Proc. b n t t r i d g e P h i l .
kGrrw-Hi11 Book Co.,
A L n e r a l i z e d Inverse f o r Matrices.
W. 51, pp. 4Cb-413, 1955.
P l t n m , E. J. G.:
Sonr L s l c W r y f o r S t r t i s t l c a l Inference. Chrplan md M11, London, 1979.
Plaetschke, E. and Schulz, 6.: Polak, E,:
Practical Input Signal Design.
A A D Lecture Series No. 104, 1979. GR Academic Press, Nw York, 1971. e
CMputational Mthods i n Optimization:
A Unified Approach.
Potter, James E. : Matrix Quadratic Solutions.
S I N J. Appl. Math.. Vol. 14, pp. 496-501, 1966.
Rtmpy, John M. and Rerry, Donald 1.: Determinatton of S t a b i l i t y Derivatives from F l i g h t Test Data by Means o f High Speed Repetitive Operation Analog Hatching. FTC-TDR-64-8, Edmrds, Calif., 1964. Rto, 5. 5. : Dptimizatlon, Theory and Applications. Royden, H. L.: Rudln, Yalter: Real Analysis. Wiley E a s t e r ~Limited, New Dulhi, 1979.
The n c M i l l a n Co., London. 1968. HcGraw-Hill Book Co., New York, 1974. P r e n t i c e - k l l , Inc.. Englewood C l i f f s , N.J., 1973.
Real and Comp' x Analysis.
Schweppe, Fred C.:
Uncertain Dynamic Systems.
Sorensen, John A.: Analysis of Instrumentation Error Effects on the I d e n t i f i c a t i o n Accuracy o f A i r c r a f t Parameters. NASA CR-112121, 1972. Sorensen, Harold W.: Parameter Estimtion; Principles a t ~ dProblems. Marcel Dckker. Inc., New York, 1980.
Strang, G l l k : Linear Algebra and I t s Applications. :
Academic Press,
New York. 1980.
Trankle. T L.; Vincent, J.H. ; and Franklin, 5. N.: 5 stem l d e n t i f i c b t i o n o f Nonlinesr Aerodynamic Models. AGARDograph. The Techniques and Techn?logy of Nonf inear F i l tering and l l l r n Fi1tertng. 1982. Vaughan, David R.: A Nonrecursive Algebraic Solution f x the Discrete Riccati Equation. Contr., Vol. X-15, op. 597-599, 1970. Wikrg. Oonald M.: State Space and Linear Systems. kGraw-Hi11 Book to., New York, 1971. IEEE Trans. Autonwt.
Wilkinson. J. H. : The Algebraic Eigenvalue Problem.
Clarendon Press, Oxford, 1965. Volume 11. Linear Algebra, Part 2.
Wilkinson, J. H. and Rtinsch, C.: Handbook f o r Automntic C w u t a t i o n . Springer-Verlr), New York, 1971.
Wolowicz, Chester H. : Considerations i n the Determination of S t a b i l i t y and Control Derivatives and Dynamic Characteristics f. rm F l i g h t Data. A A D Rep. 549-Part 1, 1966. GR Yolowicz, Chester H. and Hollanun, Euclid C.: Report 224, 1958. Zacks, Shelenlyahu: Stability-Derivative Determination f r a n F l i g h t Data. John Wiley 6 Sons. M n York, 1971. e NcGraw-dill Book Co., N York, 1963. e w AAD GR
The Theory o f S t a t i s t i c a l Inference.
Zldek, Lotfi A. and h s w r . Charles A.:
Linear System Theory.

Identification of Dynamic Systems, Theory and Formulation

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Identification of Dynamic Systems, Theory and Formulation

Загружено:

Авторское право:

Доступные форматы

NASA Reference - Publication 1138

Identification of Dynamic Systems

(NASA-RP-113E) 1 C E N ' i l E i C B : I C b S F D Y B I P . 1 , SYSTdlJS, T H 2 l i E i ' A L L f C3r"tiLE'; i C S j E 3 S A ) 138 g BC Ai7/t!P A C 1 CSCI 1 L E

NASA Reference Publication 1138

N a r ~ c l ~A e r o ~ a u t l c s al aQd Space Adrn~n~srr ~ c n ,I

Scientific and Tecnnicql Information Branch

TABLE OF CONTENTS Page

MATRIX RESULTS A.1 M T R I X INVERSION L E N A.2 MATRIX DIFFERENTIATION

system observation function !I(.) h(.

equation e r r o r function cost function Fisher information matrix p r i o r mean of

process noise vector p r i o r covariance o f

Abbreviations and acronyms arg corr cov exp In

f, as opposed t o the value of the f u n c t i o n a t a p a r t i c u l a r p o i n t

4th element o f the vector

A lower case subscript generally indicates an element o f a sequence

2 1.1 SYSTEM IDENTIFICATION

assume we are given t h a t the

The parameter vector i s c = (a,,a,)*, W were given t h a t U = -1 and U e t i o n (1.2-2) expands t o

the values of a and a, being unknown. , +1 both r e s u l t i n Z = 1; thus Equa-

Thls system i s easy t o solve and glves F(U) = 1 (Independent o f U).

Exa l e 1.2-2 I n the p t ~ o l e m exarple (1.1-2i, of h y r e s e n t e d as

assume we know t h a t the sys-

or, equlvalently, expressing

This equatlon i s unlquely solved by a = 0-

This systemof equations i s uniquely solved by a, = 0, a, through a , a l l q u a l l i n g 0.

TYPES OF SYSTEM MODELS

The general f o m f o r a c o n t i n u o u s - t i r state-space .ode1 i s

CHAPTER 2 2.0 OPTIMIZATION METHODS

The g i and value o f x straints it minimum o f

Similarly, f o r f i x e d x, the minimizing y A - B Y ' m X

i s the search d i r e c t i o n given by s i = -v;J(x) lx=xi

W are using the e

n o t a t i o n f o r t h e inner product (x.y)

Equation (2.3-5) t i o n (2.3-1) i s Equation (2.3-5) Cauchy-Schwartz

Theorem 2.3-1 (Cauchy-Schwartz) tx.y)' some scalar a. Proof - The theorem i s t r i v i a l

Substitute i n t o Equation (2.3-?) and rearrange t o g i v e

and the gradient i s vxJ(x) (O.Olx,,x,)

and thus the minimum p o i n t along the search d i r e c t i o n i s (x,

vxJl ( x ) * vXJ(xl) The solutlon i s

Equation (2.5-1) then becomes

Equating the gradient o f Equation (2.5-3)

as an approximation f o r the second gradient. term we w i l l adopt i n t h i s book.

The algorithm i n t h i s fcrm i s also known as Gauss-Newton, the

LOCAL MINIMA GLOBAL MlNlMU

I l l u s t r a t i o n o f local and global minima.

Behavior o f axial iteration.

The g r a d i e ~ t direction near a narrow valley.

The pattern direction.

Behavior o f the gradient algorithm i n a narrow valley.

rm The gradient direction fo a circular isocline.

Worse behavior o f the gradient algorithm.

CHAPTER 3 3.0 BASIC PRINCIPLES

f o r 311 countable sequences o f d i s j o i n t

Conditiona! P r o b a b i l i t s I f A and B a r e * i events and P(B; va

A oiv?n E i s defined as (3.1-1)

p(AlB) = P(AIB)/P(B) where A/B i s the set i n t e r s e c t i o n o f the events A and B.

A scalar real-valued function i n 6 f o r a l l real x. 3.2.1

D i s t r i b u t i o n and Density F u n c m Every random variable has a d i s t r i b u t i o n function defined as follows:

The expected vaiue o f a random variable. X, i s defined by

X i s defined as var(k) z EI(X

- {XI)') - 2EtX)EtXl - {XI2

The standard deviation i s the square r o o t o f t h e variance.

For absolutely continuous d i s t r i b u t i o n functions, a j o i n t p r o b a b i l i t y density function by the p a r t i a l d e r i v a t i v e a2 P ~ , J ( ~ . Y )= FX,y(x.~)

E{XXf) 2 E{XY)[EIYY)]"EIYX*l assuming t h a t the inverse exists.