Академический Документы
Профессиональный Документы
Культура Документы
Numerical
Methods and
Optimization
A Consumer Guide
ric Walter
Numerical Methods
and Optimization
A Consumer Guide
123
ric Walter
Laboratoire des Signaux et Systmes
CNRS-SUPLEC-Universit Paris-Sud
Gif-sur-Yvette
France
ISBN 978-3-319-07670-6
ISBN 978-3-319-07671-3
DOI 10.1007/978-3-319-07671-3
Springer Cham Heidelberg New York Dordrecht London
(eBook)
mes petits-enfants
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
3
3
3
4
4
4
6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
8
11
12
12
14
15
16
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
18
19
22
22
23
23
25
25
29
29
33
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
viii
Contents
3.7
Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.1
Classical Iterative Methods . . . . . . . . . . . . .
3.7.2
Krylov Subspace Iteration . . . . . . . . . . . . . .
3.8 Taking Advantage of the Structure of A . . . . . . . . . .
3.8.1
A Is Symmetric Positive Definite . . . . . . . . .
3.8.2
A Is Toeplitz . . . . . . . . . . . . . . . . . . . . . . .
3.8.3
A Is Vandermonde . . . . . . . . . . . . . . . . . . .
3.8.4
A Is Sparse . . . . . . . . . . . . . . . . . . . . . . . .
3.9 Complexity Issues . . . . . . . . . . . . . . . . . . . . . . . . .
3.9.1
Counting Flops. . . . . . . . . . . . . . . . . . . . . .
3.9.2
Getting the Job Done Quickly . . . . . . . . . . .
3.10 MATLAB Examples . . . . . . . . . . . . . . . . . . . . . . . .
3.10.1 A Is Dense . . . . . . . . . . . . . . . . . . . . . . . .
3.10.2 A Is Dense and Symmetric Positive Definite .
3.10.3 A Is Sparse . . . . . . . . . . . . . . . . . . . . . . . .
3.10.4 A Is Sparse and Symmetric Positive Definite .
3.11 In Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
38
42
42
43
43
43
44
44
45
47
47
52
53
54
55
56
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
60
61
61
62
64
65
66
67
69
70
70
71
72
74
75
76
77
77
78
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
5.3
Univariate Case . . . . . . . . . . . . . . . . .
5.3.1
Polynomial Interpolation . . . . .
5.3.2
Interpolation by Cubic Splines .
5.3.3
Rational Interpolation . . . . . . .
5.3.4
Richardsons Extrapolation . . .
5.4 Multivariate Case . . . . . . . . . . . . . . . .
5.4.1
Polynomial Interpolation . . . . .
5.4.2
Spline Interpolation . . . . . . . .
5.4.3
Kriging . . . . . . . . . . . . . . . . .
5.5 MATLAB Examples . . . . . . . . . . . . . .
5.6 In Summary . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . .
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
79
80
84
86
88
89
89
90
90
93
95
97
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
99
100
101
102
106
107
109
109
110
111
112
113
116
117
119
120
120
121
123
127
129
131
131
134
137
138
139
139
140
Contents
7.3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
141
142
143
144
144
148
148
149
150
153
154
155
155
160
164
165
Introduction to Optimization . . . . . . . . . . . . . . . . . . . . .
8.1 A Word of Caution. . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4 How About a Free Lunch? . . . . . . . . . . . . . . . . . . .
8.4.1
There Is No Such Thing . . . . . . . . . . . . . . .
8.4.2
You May Still Get a Pretty Inexpensive Meal
8.5 In Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
167
167
167
168
172
173
174
174
175
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
177
177
183
183
184
188
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
191
194
194
195
195
196
200
Contents
xi
9.3.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
201
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
216
220
220
222
225
226
227
.
.
.
.
.
.
.
.
.
.
.
.
227
236
241
243
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
245
245
245
245
246
247
248
248
252
256
256
257
257
259
260
261
261
265
266
271
272
273
273
275
275
277
278
280
xii
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
281
281
282
286
287
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
289
289
290
291
297
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
299
299
303
304
305
314
314
325
326
328
329
330
331
333
337
337
341
343
346
356
356
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
359
359
359
360
360
361
361
364
365
365
11 Combinatorial Optimization .
11.1 Introduction . . . . . . . . .
11.2 Simulated Annealing. . .
11.3 MATLAB Example . . .
References . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
xiii
13.3.3
13.3.4
.....
366
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
368
368
368
371
371
373
378
378
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
379
379
379
379
380
380
381
383
383
384
385
385
386
386
386
387
388
388
388
389
389
389
390
397
398
400
402
403
404
404
405
406
xiv
Contents
Resources to Go Further . . . . . . . . . . . . . . . . . .
Search Engines. . . . . . . . . . . . . . . . . . . . . . . . . .
Encyclopedias . . . . . . . . . . . . . . . . . . . . . . . . . .
Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.4.1 High-Level Interpreted Languages . . . . . .
15.4.2 Libraries for Compiled Languages . . . . . .
15.4.3 Other Resources for Scientific Computing .
15.5 OpenCourseWare . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
409
409
409
410
411
411
413
413
413
414
16 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.1 Ranking Web Pages . . . . . . . . . . . . . . . . . . . . . . . . .
16.2 Designing a Cooking Recipe . . . . . . . . . . . . . . . . . . .
16.3 Landing on the Moon . . . . . . . . . . . . . . . . . . . . . . . .
16.4 Characterizing Toxic Emissions by Paints . . . . . . . . . .
16.5 Maximizing the Income of a Scraggy Smuggler . . . . . .
16.6 Modeling the Growth of Trees . . . . . . . . . . . . . . . . . .
16.6.1 Bypassing ODE Integration . . . . . . . . . . . . . .
16.6.2 Using ODE Integration . . . . . . . . . . . . . . . . .
16.7 Detecting Defects in Hardwood Logs . . . . . . . . . . . . .
16.8 Modeling Black-Box Nonlinear Systems . . . . . . . . . . .
16.8.1 Modeling a Static System by Combining
Basis Functions . . . . . . . . . . . . . . . . . . . . . .
16.8.2 LOLIMOT for Static Systems . . . . . . . . . . . .
16.8.3 LOLIMOT for Dynamical Systems. . . . . . . . .
16.9 Designing a Predictive Controller with l2 and l1 Norms
16.9.1 Estimating the Model Parameters . . . . . . . . . .
16.9.2 Computing the Input Sequence. . . . . . . . . . . .
16.9.3 From an l2 Norm to an l1 Norm . . . . . . . . . . .
16.10 Discovering and Using Recursive Least Squares . . . . .
16.10.1 Batch Linear Least Squares . . . . . . . . . . . . . .
16.10.2 Recursive Linear Least Squares . . . . . . . . . . .
16.10.3 Process Control . . . . . . . . . . . . . . . . . . . . . .
16.11 Building a LotkaVolterra Model . . . . . . . . . . . . . . . .
16.12 Modeling Signals by Pronys Method . . . . . . . . . . . . .
16.13 Maximizing Performance. . . . . . . . . . . . . . . . . . . . . .
16.13.1 Modeling Performance . . . . . . . . . . . . . . . . .
16.13.2 Tuning the Design Factors. . . . . . . . . . . . . . .
16.14 Modeling AIDS Infection . . . . . . . . . . . . . . . . . . . . .
16.14.1 Model Analysis and Simulation . . . . . . . . . . .
16.14.2 Parameter Estimation . . . . . . . . . . . . . . . . . .
16.15 Looking for Causes. . . . . . . . . . . . . . . . . . . . . . . . . .
16.16 Maximizing Chemical Production. . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
415
415
416
418
419
421
423
423
423
424
426
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
426
428
429
429
430
431
433
434
435
436
437
438
440
441
441
443
443
444
444
445
446
15 WEB
15.1
15.2
15.3
15.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
xv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
448
450
451
452
452
453
454
456
457
459
459
459
460
460
461
461
462
462
464
465
467
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
469
Chapter 1
High-school education has led us to view problem solving in physics and chemistry
as the process of elaborating explicit closed-form solutions in terms of unknown
parameters, and then using these solutions in numerical applications for specific
numerical values of these parameters. As a result, we were only able to consider a
very limited set of problems that were simple enough for us to find such closed-form
solutions.
Unfortunately, most real-life problems in pure and applied sciences are not
amenable to such an explicit mathematical solution. One must then often move from
formal calculus to numerical computation. This is particularly obvious in engineering, where computer-aided design based on numerical simulations is the rule.
This book is about numerical computation, and says next to nothing about formal
computation as made possible by computer algebra, although they usefully complement one another. Using floating-point approximations of real numbers means that
approximate operations are carried out on approximate numbers. To protect oneself
against potential numerical disasters, one should then select methods that keep final
errors as small as possible. It turns out that many of the methods learnt in high school
or college to solve elementary mathematical problems are ill suited to floating-point
computation and should be replaced.
Shifting paradigm from calculus to computation, we will attempt to
discover how to escape the dictatorship of those particular cases that are simple
enough to receive a closed-form solution, and thus gain the ability to solve complex,
real-life problems,
understand the principles behind recognized methods used in state-of-the-art
numerical software,
stress the advantages and limitations of these methods, thus gaining the ability to
choose what pre-existing bricks to assemble for solving a given problem.
Presentation is at an introductory level, nowhere near the level of detail required
for implementing methods efficiently. Our main aim is to help the reader become
a better consumer of numerical methods, with some ability to choose among those
available for a given task, some understanding of what they can and cannot do, and
some power to perform a critical appraisal of the validity of their results.
By the way, the desire to write down every line of the code one plans to use should
be resisted. So much time and effort have been spent polishing code that implements
standard numerical methods that the probability one might do better seems remote
at best. Coding should be limited to what cannot be avoided or can be expected to
improve on the state of the art in easily available software (a tall order). One will
thus save time to think about the big picture:
what is the actual problem that I want to solve? (As Richard Hamming puts it [1]:
Computing is, or at least should be, intimately bound up with both the source of
the problem and the use that is going to be made of the answersit is not a step
to be taken in isolation.)
how can I put this problem in mathematical form without betraying its meaning?
how should I split the resulting mathematical problem into well-defined and numerically achievable subtasks?
what are the advantages and limitations of the numerical methods readily available
for these subtasks?
should I choose among these methods or find an alternative route?
what is the most efficient use of my resources (time, computers, libraries of routines, etc.)?
how can I check the quality of my results?
what measures should I take, if it turns out that my choices have failed to yield a
satisfactory solution to the initial problem?
A deservedly popular series of books on numerical algorithms [2] includes Numerical Recipes in their titles. Carrying on with this culinary metaphor, one should get
a much more sophisticated dinner by choosing and assembling proper dishes from
the menu of easily available scientific routines than by making up the equivalent
of a turkey sandwich with mayo in ones numerical kitchen. To take another analogy, electrical engineers tend to avoid building systems from elementary transistors, capacitors, resistors and inductors when they can take advantage of carefully
designed, readily available integrated circuits.
Deciding not to code algorithms for which professional-grade routines are available does not mean we have to treat them as magical black boxes, so the basic
principles behind the main methods for solving a given class of problems will be
explained.
The level of mathematical proficiency required to read what follows is a basic
understanding of linear algebra as taught in introductory college courses. It is hoped
that those who hate mathematics will find here reasons to reconsider their position
in view of how useful it turns out to be for the solution of real-life problems, and that
those who love it will forgive me for daring simplifications and discover fascinating,
practical aspects of mathematics in action.
The main ingredients will be classical Cuisine Bourgeoise, with a few words about
recipes best avoided, and a dash of Nouvelle Cuisine.
(1.1)
are to be evaluated, with a, b, and c known floating-point numbers such that x1 and
x2 are real numbers. We have learnt in high school that
x1 =
b +
b2 4ac
b b2 4ac
and x2 =
.
2a
2a
(1.2)
disastrous, and should be avoided. To this end, one may use the following algorithm,
which is also verifiable and takes benefit from the fact that x1 x2 = c/a:
b sign(b) b2 4ac
,
q=
2
x1 =
q
c
, x2 = .
a
q
(1.3)
(1.4)
Although these two algorithms are mathematically equivalent, the second one is
much more robust to errors induced by floating-point operations than the first (see
Sect. 14.7 for a numerical comparison). This does not, however, solve the problem
that appears when x1 and x2 tend toward one another, as b2 4ac then tends to zero.
We will encounter many similar situations, where naive algorithms need to be
replaced by more robust or less costly variants.
1.1.3 Unavailable
Quite frequently, there is no mathematical method for finding the exact solution of
the problem of interest. This will be the case, for instance, for most simulation or
optimization problems, as well as for most systems of nonlinear equations.
This classification is not tight. It may be a good idea to transform a given problem
into another one. Here are a few examples:
to find the roots of a polynomial equation, one may look for the eigenvalues of a
matrix, as in Example 4.3,
to evaluate a definite integral, one may solve an ordinary differential equation, as
in Sect. 6.2.4,
to solve a system of equations, one may minimize a norm of the deviation between
the left- and right-hand sides, as in Example 9.8,
to solve an unconstrained optimization problem, one may introduce new variables
and impose constraints, as in Example 10.7.
Most of the numerical methods selected for presentation are important ingredients
in professional-grade numerical code. Exceptions are
methods based on ideas that easily come to mind but are actually so bad that they
need to be denounced, as in Example 1.1,
prototype methods that may help one understand more sophisticated approaches,
as when one-dimensional problems are considered before the multivariate case,
promising methods mostly available at present from academic research institutions, such as methods for guaranteed optimization and simulation.
MATLAB is used to demonstrate, through simple yet not necessarily trivial examples typeset in typewriter, how easily classical methods can be put to work. It
would be hazardous, however, to draw conclusions on the merits of these methods on
the sole basis of these particular examples. The reader is invited to consult the MATLAB documentation for more details about the functions available and their optional
arguments. Additional information, including illuminating examples, can be found
in [3], with ancillary material available on the WEB, and [4]. Although MATLAB is
the only programming language used in this book, it is not appropriate for solving all
numerical problems in all contexts. A number of potentially interesting alternatives
will be mentioned in Chap. 15.
This book concludes with a chapter about WEB resources that can be used to
go further and a collection of problems. Most of these problems build on material
pertaining to several chapters and could easily be translated into computer-lab work.
This book was typeset with TeXmacs before exportation to LaTeX. Many
thanks to Joris van der Hoeven and his coworkers for this awesome and truly
WYSIWYG piece of software, freely downloadable at http://www.texmacs.
org/.
References
1. Hamming, R.: Numerical Methods for Scientists and Engineers. Dover, New York (1986)
2. Press, W., Flannery, B., Teukolsky, S., Vetterling, W.: Numerical Recipes. Cambridge University
Press, Cambridge (1986)
3. Moler C.: Numerical Computing with MATLAB, revised, reprinted edn. SIAM, Philadelphia
(2008)
4. Ascher, U., Greif, C.: A First Course in Numerical Methods. SIAM, Philadelphia (2011)
Chapter 2
2.1 Introduction
This chapter recalls the usual convention for distinguishing scalars, vectors, and
matrices. Vetters notation for matrix derivatives is then explained, as well as the
meaning of the expressions little o and big O employed for comparing the local
or asymptotic behaviors of functions. The most important vector and matrix norms
are finally described. Norms find a first application in the definition of types of
convergence speeds for iterative algorithms.
ai,k bk, j ,
(2.2)
and the number of columns in A must be equal to the number of rows in B. Recall
that the product of matrices (or vectors) is not commutative, in general. Thus, for
instance, when v and w are columns vectors with the same dimension, vT w is a scalar
whereas wvT is a (rank-one) square matrix.
Useful relations are
(2.3)
(AB)T = BT AT ,
and, provided that A and B are invertible,
(AB)1 = B1 A1 .
(2.4)
If M is square and symmetric, then all of its eigenvalues are real. M 0 then means
that each of these eigenvalues is strictly positive (M is positive definite), while M 0
allows some of them to be zero (M is non-negative definite).
2.3 Derivatives
Provided that f () is a sufficiently differentiable function from R to R,
df
(x),
f(x) =
dx
d2 f
f(x) =
(x),
dx 2
dk f
f (k) (x) =
(x).
dx k
(2.5)
(2.6)
(2.7)
Vetters notation [1] will be used for derivatives of matrices with respect to matrices. (A word of caution is in order: there are other, incompatible notations, and one
should be cautious about mixing formulas from different sources.)
If A is (n A m A ) and B (n B m B ), then
M=
A
B
(2.8)
A
.
bi, j
(2.9)
2.3 Derivatives
(2.10)
J
x1
J
x2
J
(x)
(x) =
.
x
..
(2.11)
J
xn
2
J
2 J
2 J
x
1 2
1 n
x1
2 J
..
2J
.
2 J
x2 x1 x22
(x)
(2.12)
(x)
=
..
..
..
xxT
. .
.
2 J
2 J
xn x1
x2
n
(2.13)
provided that both are continuous at x and x belongs to an open set in which both are
defined. Hessians are thus symmetric, except in pathological cases not considered
here.
Example 2.4 If f() is a differentiable function from Rn to R p , and x a vector of Rn ,
then
f f
f1
1
1
x1 x2 xn
..
f2 f2
.
f
x1 x2
(2.14)
J(x) = T (x) =
..
. . ..
x
.
. .
fp
fp
x1
xn
is the ( p n) Jacobian matrix of f() at x. When p = n, the Jacobian matrix is
square and its determinant is the Jacobian.
10
Remark 2.2 The last three examples show that the Hessian of J () at x is the Jacobian
matrix of its gradient function evaluated at x.
Remark 2.3 Gradients and Hessians are frequently used in the context of optimization, and Jacobian matrices when solving systems of nonlinear equations.
Remark 2.4 The Nabla operator , a vector of partial derivatives with respect to all
the variables of the function on which it operates
,...,
x1
xn
T
,
(2.15)
is often used to make notation more concise, especially for partial differential equations. Applying to a scalar function J and evaluating the result at x, one gets the
gradient vector
J
(x).
(2.16)
J (x) =
x
If the scalar function is replaced by a vector function f, one gets the Jacobian matrix
f(x) =
f
(x),
xT
(2.17)
T
where f is interpreted as f T .
By applying twice to a scalar function J and evaluating the result at x, one gets
the Hessian matrix
2 J
2 J (x) =
(x).
(2.18)
xxT
( 2 is sometimes taken to mean the Laplacian operator , such that
f (x) =
n
2 f
i=1
xi2
(x)
(2.19)
is a scalar. The context and dimensional considerations should make what is meant
clear.)
Example 2.5 If v, M, and Q do not depend on x and Q is symmetric, then
T
(v x) = v,
x
(2.20)
(Mx) = M,
xT
(2.21)
T
(x Mx) = (M + MT )x
x
(2.22)
2.3 Derivatives
11
and
T
(x Qx) = 2Qx.
x
(2.23)
xx0
f (x)
= 0,
g(x)
(2.24)
m
ai x i ,
i=2
i=2
12
so f (x) = O(x 2 ) when x tends to zero. If, on the other hand, x is taken equal to the
(large) positive integer n, then
f (n) =
m
ai n i
i=2
m
m
|ai n i |
i=2
|ai | n m ,
i=2
2.5 Norms
A function f () from a vector space V to R is a norm if it satisfies the following
three properties:
1. f (v) 0 for all v V (positivity),
2. f (v) = || f (v) for all R and v V (positive scalability),
3. f (v1 v2 ) f (v1 ) + f (v2 ) for all v1 V and v2 V (triangle inequality).
These properties imply that f (v) = 0 v = 0 (non-degeneracy). Another useful
relation is
(2.27)
| f (v1 ) f (v2 )| f (v1 v2 ).
Norms are used to quantify distances between vectors. They play an essential role,
for instance, in the characterization of the intrinsic difficulty of numerical problems
via the notion of condition number (see Sect. 3.3) or in the definition of cost functions
for optimization.
n
i=1
1
|vi |
(2.28)
2.5 Norms
13
||v||2 =
vi2 = vT v,
(2.29)
i=1
n
|vi |,
(2.30)
i=1
(2.31)
(2.32)
vT w v2 w2 .
(2.33)
and
(2.34)
||v||2 = vH v,
where vH is the transconjugate of v, i.e., the row vector obtained by transposing the
column vector v and replacing each of its entries by its complex conjugate.
Example 2.7 For the complex vector
v=
a
,
ai
2
where a is some nonzero real
number and i is the imaginary unit (such that i = 1),
T
T
v v
= 0. Thisproves that v v is not a norm. The value of the Euclidean norm of
v is vH v = 2|a|.
Remark 2.6 The so-called l0 norm of a vector is the number of its nonzero entries.
Used in the context of sparse estimation, where one is looking for an estimated
parameter vector with as few nonzero entries as possible, it is not a norm, as it does
not satisfy the property of positive scalability.
14
(2.35)
(2.36)
||v||=1
so
for any M and v for which the product Mv makes sense. This matrix norm is subordinate to the vector norm inducing it. The matrix and vector norms are then said
to be compatible, an important property for the study of products of matrices and
vectors.
The matrix norm induced by the vector norm l2 is the spectral norm, or 2-norm ,
||M||2 =
(MT M),
(2.37)
where () is the function that computes the spectral radius of its argument, i.e., the
modulus of the eigenvalue(s) with the largest modulus. Since all the eigenvalues
of MT M are real and non-negative, (MT M) is the largest of these eigenvalues.
Its square root is the largest singular value of M, denoted by max (M). So
||M||2 = max (M).
(2.38)
|m i, j |,
(2.39)
which amounts to summing the absolute values of the entries of each column in
turn and keeping the largest result.
The matrix norm induced by the vector norm l is the infinity norm
||M|| = max
i
|m i, j |,
(2.40)
which amounts to summing the absolute values of the entries of each row in turn
and keeping the largest result. Thus
||M||1 = ||MT || .
(2.41)
2.5 Norms
15
Since each subordinate matrix norm is compatible with its inducing vector norm,
||v||1 is compatible with ||M||1 ,
||v||2 is compatible with ||M||2 ,
(2.42)
(2.43)
(2.44)
m i,2 j =
trace MT M
(2.45)
i, j
(2.46)
Remark 2.7 To evaluate a vector or matrix norm with MATLAB (or any other interpreted language based on matrices), it is much more efficient to use the corresponding
dedicated function than to access the entries of the vector or matrix individually to
implement the norm definition. Thus, norm(X,p) returns the p-norm of X, which
may be a vector or a matrix, while norm(M,fro) returns the Frobenius norm
of the matrix M.
(2.47)
(2.49)
ek+1
= < .
k 2
k e
(2.50)
lim sup
and quadratic if
lim sup
16
A method with quadratic convergence thus also has superlinear and linear
convergence. It is customary, however, to qualify a method with the best convergence
it achieves. Quadratic convergence is better that superlinear convergence, which is
better than linear convergence.
Remember that these convergence speeds are asymptotic, valid when the error
has become small enough, and that they do not take the effect of rounding into
account. They are meaningless if the initial vector x0 was too badly chosen for the
method to converge to x . When the method does converge to x , they may not
describe accurately its initial behavior and will no longer be true when rounding
errors become predominant. They are nevertheless an interesting indication of what
can be expected at best.
Reference
1. Vetter, W.: Derivative operations on matrices. IEEE Trans. Autom. Control 15, 241244 (1970)
Chapter 3
3.1 Introduction
Linear equations are first-order polynomial equations in their unknowns. A system
of linear equations can thus be written as
Ax = b,
(3.1)
where the matrix A and the vector b are known and where x is a vector of unknowns.
We assume in this chapter that
all the entries of A, b and x are real numbers,
there are n scalar equations in n scalar unknowns (A is a square (n n) matrix
and dim x = dim b = n),
these equations uniquely define x (A is invertible).
When A is invertible, the solution of (3.1) for x is unique, and given mathematically
in closed form as x = A1 b. We are not interested here in this closed-form solution,
and wish instead to compute x numerically from numerically known A and b. This
problem plays a central role in so many algorithms that it deserves a chapter of
its own. Systems of linear equations with more equations than unknowns will be
considered in Sect. 9.2.
Remark 3.1 When A is square but singular (i.e., not invertible), its columns no longer
form a basis of Rn , so the vector Ax cannot take all directions in Rn . The direction of
b will thus determine whether (3.1) admits infinitely many solutions for x or none.
When b can be expressed as a linear combination of columns of A, the equations
are linearly dependent and there is a continuum of solutions. The system x1 + x2 = 1
and 2x1 + 2x2 = 2 corresponds to this situation.
When b cannot be expressed as a linear combination of columns of A, the equations
are incompatible and there is no solution. The system x1 + x2 = 1 and x1 + x2 = 2
corresponds to this situation.
17
18
Great books covering the topics of this chapter and Chap. 4 (as well as topics
relevant to many others chapters) are [13].
3.2 Examples
Example 3.1 Determination of a static equilibrium
The conditions for a linear dynamical system to be in static equilibrium translate
into a system of linear equations. Consider, for instance, a series of three vertical
springs si (i = 1, 2, 3), with the first of them attached to the ceiling and the last
to an object with mass m. The mass of each spring is neglected, and the stiffness
coefficient of the ith spring is denoted by ki . We want to compute the elongation xi
of the bottom end of spring i (i = 1, 2, 3) resulting from the action of the mass of
the object when the system has reached static equilibrium. The sum of all the forces
acting at any given point is then zero. Provided that m is small enough for Hookes
law of elasticity to apply, the following linear equations thus hold true
mg
k3 (x2 x3 )
k2 (x2 x1 )
=
=
=
k3 (x3 x2 ),
k2 (x1 x2 ),
k1 x 1 ,
(3.2)
(3.3)
(3.4)
where g is the acceleration due to gravity. This system of linear equations can be
written as
x1
0
k1 + k2
k2
0
k2
k2 + k3 k3 x2 = 0 .
(3.5)
mg
0
k3
k3
x3
The matrix in the left-hand side of (3.5) is tridiagonal, as only its main descending
diagonal and the descending diagonals immediately over and below it are nonzero.
This would still be true if there were many more strings in series, in which case the
matrix would also be sparse, i.e., with a majority of zero entries. Note that changing
the mass of the object would only modify the right-hand side of (3.5), so one might
be interested in solving a number of systems that share the same matrix A.
Example 3.2 Polynomial interpolation
Assume that the value yi of some quantity of interest has been measured at time
ti (i = 1, 2, 3). Interpolating these data with the polynomial
P(t, x) = a0 + a1 t + a2 t 2 ,
where x = (a0 , a1 , a2 )T , boils down to solving (3.1) with
(3.6)
3.2 Examples
19
y1
A = 1 t2 t22 and b = y2 .
y3
1 t3 t32
1 t1 t12
(3.7)
x] .
x = A1 [b (A)
(3.10)
(3.11)
||A||
||b||
||x||
||A1 || ||A||
+
.
||
x||
||A|| ||
x||
||A||
(3.12)
The multiplicative coefficient ||A1 ||||A|| appearing in the right-hand side of (3.12)
is the condition number of A
cond A = ||A1 || ||A||.
(3.13)
20
||A||
.
||A||
(3.14)
(3.15)
x A1 b.
(3.16)
b A x,
(3.17)
x b A1 A b x,
(3.18)
so
so
x
(cond A)
x
||b||
.
||b||
(3.19)
Since
(3.20)
(3.21)
Its value depends on the norm used. For the spectral norm,
||A||2 = max (A),
(3.22)
1
,
min (A)
(3.23)
21
with min (A) the smallest singular value of A, the condition number of A for the
spectral norm is the ratio of its largest singular value to its smallest
cond A =
max (A)
.
min (A)
(3.24)
The larger the condition number of A is, the more ill-conditioned solving (3.1)
becomes.
It is useful to compare cond A with the inverse of the precision of the floating-point
representation. For a double-precision representation according to IEEE Standard
754 (typical of MATLAB computations), this precision is about 1016 .
Solving (3.1) for x when cond A is not small compared to 1016 requires special
care.
Remark 3.4 Although this is probably the worst method for computing singular
values, the singular values of A are the square roots of the eigenvalues of AT A.
(When A is symmetric, its singular values are thus equal to the absolute values of its
eigenvalues.)
Remark 3.5 A is singular if and only if its determinant is zero, so one might have
thought of using the value of det A as an index of conditioning, with a small determinant indicative of a nearly singular system. However, it is very difficult to check
that a floating-point number differs significantly from zero (think of what happens to
the determinant of A if A and b are multiplied by a large or small positive number,
which has no effect on the difficulty of the problem). The condition number is a much
more meaningful index of conditioning, as it is invariant to a multiplication of A by
a nonzero scalar of any magnitude (a consequence of the positive scalability of the
norm). Compare det(101 In ) = 10n with cond(101 In ) = 1.
Remark 3.6 The numerical value of cond A depends on the norm being used, but an
ill-conditioned problem for one norm should also be ill-conditioned for the others,
so the choice of a given norm is just a matter of convenience.
Remark 3.7 Although evaluating the condition number of a matrix for the spectral
norm just takes one call to the MATLAB function cond(), this may actually require
more computation than solving (3.1). Evaluating the condition number of the same
matrix for the 1-norm (by a call to the function cond(,1)), is less costly than for
the spectral norm, and algorithms are available to get cheaper estimates of its order
of magnitude [2, 6, 7], which is what we are actually interested in, after all.
Remark 3.8 The concept of condition number extends to rectangular matrices, and
the condition number for the spectral norm is then still given by (3.24). It can also
be extended to nonlinear problems, see Sect. 14.5.2.1.
22
Unless A has some specific structure that makes inversion particularly simple, one
should thus think twice before inverting A to take advantage of the closed-form
solution
(3.25)
x = A1 b.
Cramers rule for solving systems of linear equations, which requires the computation of ratios of determinants is the worst possible approach. Determinants are
notoriously difficult to compute accurately and computing these determinants is
unnecessarily costly, even if much more economical methods than cofactor expansion are available.
which implies that all of its eigenvalues are real and strictly positive.
If A is large, is it sparse, i.e., such that most of its entries are zeros?
Is A diagonally dominant, i.e., such that the absolute value of each of its diagonal
entries is strictly larger than the sum of the absolute values of all the other entries
in the same row?
Is A tridiagonal, i.e., such that only its main descending diagonal and the diagonals
immediately over and below are nonzero?
23
b1 c1
0
..
.
..
.
a b c
2 2 2 0
..
.. ..
.
.
.
0 a3
A=
..
.. ..
..
. 0
.
.
.
0
.
.
.. .. b
..
n1 cn1
0 0
an
bn
(3.27)
Is A Toeplitz, i.e., such that all the entries on the same descending diagonal take the same
value?
h 0 h 1 h 2 h n+1
h1
h 0 h 1
h n+2
..
..
.. ..
..
.
.
.
.
.
(3.28)
A=
.
.
.
.. .. h
..
h n1 h n2 h 1
h0
24
Ux = b,
where
u 1,1 u 1,2
0 u 2,2
U= . .
..
..
0
u 1,n
u 2,n
.
. . ..
. .
(3.29)
(3.30)
0 u n,n
When U is invertible, all its diagonal entries are nonzero and (3.29) can be solved
for one unknown at a time, starting with the last
xn = bn /u n,n ,
(3.31)
(3.32)
(3.33)
Forward substitution, on the other hand, applies to the lower triangular system
where
Lx = b,
(3.34)
l1,1 0 0
.
l2,1 l2,2 . . . ..
.
L=
.
..
..
. 0
ln,1 ln,2 . . . ln,n
(3.35)
It also solves (3.34) for one unknown at a time, but starts with x1 then moves down
to get x2 and so forth until xn is obtained.
Solving (3.29) by backward substitution can be carried out in MATLAB via the
instruction x=linsolve(U,b,optsUT), provided that optsUT.UT=true,
which specifies that U is an upper triangular matrix. Similarly, solving (3.34) by
forward substitution can be carried out via x=linsolve(L,b,optsLT), provided that optsLT.LT=true, which specifies that L is a lower triangular matrix.
25
(3.37)
(3.38)
i = 1, . . . , m.
(3.39)
This classical approach for solving (3.1) has no advantage over LU factorization
presented next. As it works simultaneously on A and b, Gaussian elimination for a
right-hand side b not previously known cannot take advantage of past computations
carried out with other right-hand sides, even if A remains the same.
3.6.3 LU Factorization
LU factorization, a matrix reformulation of Gaussian elimination, is the basic workhorse to be used when A has no particular structure to be taken advantage of. Consider
first its simplest version.
(3.40)
26
1 0 0
u 1,1 u 1,2 u 1,n
.
u 2,n
l2,1 1 . . . .. 0 u 2,2
. . . . . .. .
(3.41)
A= .
. . .
. ... ...
.
.
0
0 0 u n,n
ln,1 ln,n1 1
When (3.41) admits a solution for its unknowns li, j et u i, j , this solution can be
obtained very simply by considering the equations in the proper order. Each unknown
is then expressed as a function of entries of A and already computed entries of L and
U. For the sake of notational simplicity, and because our purpose is not coding LU
factorization, we only illustrate this with a very small example.
Example 3.3 LU factorization without pivoting
For the system
1 0
u 1,1 u 1,2
a1,1 a1,2
=
,
a2,1 a2,2
0 u 2,2
l2,1 1
(3.42)
we get
u 1,1 = a1,1 , u 1,2 = a1,2 , l2,1 u 1,1 = a2,1 and l2,1 u 1,2 + u 2,2 = a2,2 . (3.43)
So, provided that a11 = 0,
l2,1 =
a2,1
a2,1
a2,1
=
and u 2,2 = a2,2 l2,1 u 1,2 = a2,2
a1,2 .
u 1,1
a1,1
a1,1
(3.44)
Terms that appear in denominators, such as a1,1 in Example 3.3, are called pivots.
LU factorization without pivoting fails whenever a pivot turns out to be zero.
After LU factorization, the system to be solved is
LUx = b.
Its solution for x is obtained in two steps.
First,
Ly = b
(3.45)
(3.46)
27
3.6.3.2 Pivoting
Pivoting is a short name for reordering the equations (and possibly the variables) so
as to avoid zero pivots. When only the equations are reordered, one speaks of partial
pivoting, whereas total pivoting, not considered here, also involves reordering the
variables. (Total pivoting is seldom used, as it rarely provides better results than
partial pivoting while being more expensive.)
Reordering the equations amounts to permuting the same rows in A and in b,
which can be carried out by left-multiplying A and b by a suitable permutation matrix.
The permutation matrix P that exchanges the ith and jth rows of A is obtained by
exchanging the ith and jth rows of the identity matrix. Thus, for instance,
001
b1
b3
1 0 0 b2 = b1 .
010
b3
b2
(3.48)
Since det I = 1 and any exchange of two rows changes the sign of the determinant,
we have
det P = 1.
(3.49)
P is an orthonormal matrix (also called unitary matrix), i.e., it is such that
PT P = I.
(3.50)
28
(3.51)
(3.52)
(3.53)
(3.54)
Ux = y
(3.55)
is solved for x. Of course the (sparse) permutation matrix P need not be stored as an
(n n) matrix; it suffices to keep track of the corresponding row exchanges.
Remark 3.11 Algorithms solving systems of linear equations via LU factorization
with partial or total pivoting are readily and freely available on the WEB with a
detailed documentation (in LAPACK, for instance, see Chap. 15). The same remark
applies to most of the methods presented in this book. In MATLAB, LU factorization
with partial pivoting is achieved by the instruction [L,U,P]=lu(A).
Remark 3.12 Although the pivoting strategy of LU factorization is not based on
keeping the condition number of the problem unchanged, the increase in this condition number is mitigated, which makes LU with partial pivoting applicable even to
some very ill-conditioned problems. See Sect. 3.10.1 for an illustration.
LU factorization is a first example of the decomposition approach to matrix computation [9], where a matrix is expressed as a product of factors. Other examples
are QR factorization (Sects. 3.6.5 and 9.2.3), SVD (Sects. 3.6.6 and 9.2.4), Cholesky
29
factorization (Sect. 3.8.1), and Schur and spectral decompositions, both carried out
by the QR algorithm (Sect. 4.3.6). By concentrating efforts on the development of
efficient, robust algorithms for a few important factorizations, numerical analysts
have made it possible to produce highly effective packages for matrix computation,
with surprisingly diverse applications. Huge savings can be achieved when a number
of problems share the same matrix, which then only needs to be factored once. Once
LU factorization has been carried out on a given matrix A, for instance, all the systems
(3.1) that differ only by their vector b are easily solved with the same factorization,
even if the values of b to be considered were not known when A was factored. This
is a definite advantage over Gaussian elimination where the factorization of A is
hidden in the solution of (3.1) for some pre-specified b.
(3.56)
Ax = b A
x.
(3.57)
or equivalently that
3.6.5 QR Factorization
Any (n n) invertible matrix A can be factored as
A = QR,
(3.58)
30
(3.59)
(3.60)
Remark 3.15 Contrary to LU factorization, QR factorization also applies to rectangular matrices, and will prove extremely useful in the solution of linear least-squares
problems, see Sect. 9.2.3.
At least in principle, GramSchmidt orthogonalization could be used to carry out
QR factorization, but it suffers from numerical instability when the columns of A are
close to being linearly dependent. This is why the more robust approach presented
in the next section is usually preferred, although a modified Gram-Schmidt method
could also be employed [10].
(3.62)
H1 (v) = H(v).
(3.63)
31
v
vv T
vTv
x
x
vvT
x 2 v T v x = H (v )x
Fig. 3.1 Householder transformation
(3.64)
(3.65)
where e1 is the vector corresponding to the first column of the identity matrix, and
where the sign indicates liberty to choose a plus or minus operator. The following
proposition makes it possible to use H(v) to transform x into a vector with all of its
entries equal to zero except for the first one.
Proposition 3.1 If
H(+) = H(x + ||x||2 e1 )
(3.66)
(3.67)
H(+) x = ||x||2 e1
(3.68)
H() x = +||x||2 e1 .
(3.69)
and
then
and
32
(3.70)
So
H(v)x = x 2v
vT x
vT v
= x v = ||x||2 e1 .
(3.71)
(3.72)
to protect oneself against the risk of having to compute the difference of floating-point
numbers that are close to one another. In practice, the matrix H(v) is not formed.
One computes instead the scalar
=2
vT x
,
vT v
(3.73)
(3.74)
(3.75)
(3.76)
Hk+1 is in charge of shaping the (k + 1)-st column of Ak while leaving the k columns
to its left unchanged. Let ak+1 be the vector consisting of the last (n k) entries
of the (k + 1)-st column of Ak . The Householder transformation must modify only
ak+1 , so
33
Hk+1 =
0
Ik
.
0 H(ak+1 + sign(a1k+1 )
ak+1
2 e1 )
(3.77)
In the next equation, for instance, the top and bottom entries of a3 are indicated by
the symbol :
..
..
..
. .
(3.78)
A3 = . 0
.
..
..
..
.
.
.
0 0
In (3.77), e1 has the same dimension as ak+1 and all its entries are again zero, except
for the first one, which is equal to one.
At each iteration, the matrix H(+) or H() that leads to the more stable numerical
computation is selected, see (3.72). Finally
R = Hn1 Hn2 H1 A,
(3.79)
or equivalently
1
R = QR.
A = (Hn1 Hn2 H1 )1 R = H11 H21 Hn1
(3.80)
(3.81)
Instead of using Householder transformations, one may implement QR factorization via Givens rotations [2], which are also robust, orthonormal transformations,
but this makes computation more complex without improving performance.
(3.82)
34
UT U = VT V = I,
(3.83)
(3.84)
so
UVT x = b,
(3.85)
x = V 1 UT b,
(3.86)
x = V
where
UT b,
(3.87)
(3.88)
1
b
b
(3.89)
then has a very large Euclidean norm, and should thus be completely different from
x, as the eigenvalue b is also a (very small) singular value of A and 1/b will be
35
3.7.1.1 Principle
To solve (3.1) for x, decompose A into a sum of two matrices
A = A1 + A2 ,
(3.90)
(3.91)
1
Define M = A1
1 A2 and v = A1 b to get
x = Mx + v.
(3.92)
The idea is to choose the decomposition (3.90) in such a way that the recursion
xk+1 = Mxk + v
(3.93)
converges to the solution of (3.1) when k tends to infinity. This will be the case if
and only if all the eigenvalues of M are strictly inside the unit circle.
36
(3.94)
(3.95)
The scalar interpretation of this method is as follows. The jth row of (3.1) is
n
a j,i xi = b j .
(3.96)
i=1
bj
i= j
a j, j
a j,i xi
(3.97)
bj
i= j
a j, j
a j,i xik
j = 1, . . . , n.
(3.98)
A sufficient condition for convergence to the solution x of (3.1) (whatever the initial
vector x0 ) is that A be diagonally dominant. This condition is not necessary, and
convergence may take place under less restrictive conditions.
(3.99)
x k+1
j
bj
37
j1
i=1
a j,i xik+1
n
k
i= j+1 a j,i x i
a j, j
j = 1, . . . , n.
(3.100)
Note the presence of xik+1 on the right-hand side of (3.100). The components of xk+1
that have already been evaluated are thus used in the computation of those that have
not. This speeds up convergence and makes it possible to save memory space.
Remark 3.16 The behavior of the GaussSeidel method depends on how the variables are ordered in x, contrary to what happens with the Jacobi method.
As with the Jacobi method, a sufficient condition for convergence to the solution
x of (3.1) (whatever the initial vector x0 ) is that A be diagonally dominant. This
condition is again not necessary, and convergence may take place under less restrictive
conditions.
(3.101)
(3.102)
(1 )x kj
bj
j1
i=1
a j,i xik+1
n
k
i= j+1 a j,i x i
a j, j
j = 1, . . . , n.
(3.103)
As a result,
k+1
xk+1 = (1 )xk + xGS
,
(3.104)
k+1
is the approximation of the solution x suggested by the GaussSeidel
where xGS
iteration.
A necessary condition for convergence is [0, 2]. For = 1, the Gauss
Seidel method is recovered. When < 1 the method is under-relaxed, whereas it is
over-relaxed if > 1. The optimal value of depends on A, but over-relaxation is
usually preferred, where the displacements suggested by the GaussSeidel method
are increased. The convergence of the GaussSeidel method may thus be accelerated
by extrapolating on iteration results. Methods are available to adapt based on past
38
behavior. They have largely lost their interest with the advent of Krylov subspace
iteration, however.
xk+1 = D1 (L + U)xk + D1 b.
(3.105)
(3.106)
(3.107)
(3.108)
(3.109)
Subtract x from both sides of (3.109), and left multiply the result by A to get
rk+1 = rk Ark .
(3.110)
(3.111)
39
k1
ri .
(3.112)
i=0
Therefore,
xk x0 + span{r0 , Ar0 , . . . , Ak1 r0 },
(3.113)
where span{r0 , Ar0 , . . . , Ak1 r0 } is the kth Krylov subspace generated by A from
r0 , denoted by Kk (A, r0 ).
Remark 3.17 The definition of Krylov subspaces implies that
Kk1 (A, r0 ) Kk (A, r0 ),
(3.114)
and that each iteration increases the dimension of search space at most by one.
Assume, for instance, that x0 = 0, which implies that r0 = b, and that b is an
eigenvector of A such that
Ab = b.
(3.115)
Then
k 1, span{r0 , Ar0 , . . . , Ak1 r0 } = span{b},
This is appropriate, as the solution is x = 1 b.
(3.116)
(3.117)
The Cayley-Hamilton theorem states that Pn (A) is the zero (n n) matrix. In other
words, An is a linear combination of An1 , An2 , . . . , In , so
k n, Kk (A, r0 ) = Kn (A, r0 ),
(3.118)
and the dimension of the space in which search takes place does not increase after
the first n iterations.
A crucial point, not proved here, is that there exists n such that
x x0 + K (A, r0 ).
(3.119)
In principle, one may thus hope to get the solution in no more than n = dim x
iterations in Krylov subspaces, whereas for Jacobi, GaussSeidel or SOR iterations
no such bound is available. In practice, with floating-point computations, one may
still get better results by iterating until the solution is deemed satisfactory.
40
1 T
x Ax bT x.
2
(3.120)
Using theoretical optimality conditions presented in Sect. 9.1, it is easy to show that
the unique minimizer of this cost function is indeed
x = A1 b. Starting from xk ,
k+1
is computed by line search along some
the approximation of x at iteration k, x
direction dk as
(3.121)
xk+1 (k ) = xk + k dk .
It is again easy to show that J (xk+1 (k )) is minimum if
k =
(dk )T (b Axk )
.
(dk )T Adk
(3.122)
(3.123)
which means that it is conjugate with respect to A (or A-orthogonal) with all the
previous search directions. With exact computation, this would ensure convergence
to
x in at most n iterations. Because of the effect of rounding errors, it may be useful
to allow more than n iterations, although n may be so large that n iterations is actually
more than can be achieved. (One often gets a useful approximation of the solution
in less than n iterations.)
After n iterations,
n1
i di ,
(3.124)
xn = x0 +
i=0
so
xn x0 + span{d0 , . . . , dn1 }.
(3.125)
i = 0, 1, . . .
(3.126)
This can be achieved with an amazingly simple algorithm [19, 21], summarized in
Table 3.1. See also Sect. 9.3.4.6 and Example 9.8.
Remark 3.19 The notation := in Table 3.1 means that the variable on the left-hand
sign is assigned the value resulting of the evaluation of the expression on the
41
r0 := b Ax0 ,
d0 := r0 ,
0 := r0 22 ,
k := 0.
While ||rk ||2 > tol, compute
k := (dk )T Adk ,
k := k /k ,
xk+1 := xk + k dk ,
rk+1 := rk k Adk ,
k+1 := rk+1 22 ,
k := k+1 /k ,
dk+1 := rk+1 + k dk ,
k := k + 1.
right-hand side. It should not be confused with the equal sign, and one may write
k := k + 1 whereas k = k + 1 would make no sense. In MATLAB and a number of
other programming languages, however, the sign = is used instead of :=.
3.7.2.4 Preconditioning
The convergence speed of Krylov iteration strongly depends on the condition number
of A. Spectacular acceleration may be achieved by replacing (3.1) by
MAx = Mb,
(3.127)
(3.128)
42
where e j is the jth column of In and m j the jth column of M, computing M can be
split into solving n independent least-squares problems (one per column), subject to
sparsity constraints. The nonzero entries of m j are then obtained by solving a small
unconstrained linear least-squares problem (see Sect. 9.2). The computation of the
is thus easily parallelized. The main difficulty is a proper choice for S,
columns of M
which may be carried out by adaptive strategies [27]. One may start with M diagonal,
or with the same sparsity pattern as A.
Remark 3.20 Preconditioning may also be used with direct methods.
(3.130)
43
and are therefore bounded. As Cholesky factorization fails if A is not positive definite,
it can also be used to test symmetric matrices for positive definiteness, which is preferable to computing the eigenvalues of A. In MATLAB, one may use U=chol(A) or
L=chol(A,lower).
When A is also large and sparse, see Sect. 3.7.2.2.
3.8.2 A Is Toeplitz
When all the entries in any given descending diagonal of A have the same value, i.e.,
h0
h1
..
.
h 1 h 2 h n+1
h 0 h 1
h n+2
..
.. .. ..
.
. .
.
,
..
. h h
A=
h n2
0
h n1 h n2 h 1
(3.134)
h0
3.8.3 A Is Vandermonde
When
1 t1
t12 t1n
1 t2 t22
. .
..
A = .. ..
.
..
.. ..
. .
.
2
1 tn+1 tn+1
t2n
.. ..
. . ,
.. ..
. .
n
tn+1
(3.135)
3.8.4 A Is Sparse
A is sparse when most of its entries are zeros. This is particularly frequent when a
partial differential equation is discretized, as each node is influenced only by its close
neighbors. Instead of storing the entire matrix A, one may then use more economical
44
descriptions such as a list of pairs {address, value} or a list of vectors describing the
nonzero part of A, as illustrated by the following example.
Example 3.5 Tridiagonal systems
When
b1 c1
a2 b2
0 a3
A=
..
.
0
.
..
0
c2
..
.
0
..
.
..
..
..
..
an1
0
..
.
.
bn1
an
0
..
.
..
.
cn1
bn
(3.136)
the nonzero entries of A can be stored in three vectors a, b and c (one per nonzero
descending diagonal). This makes it possible to save memory that would have been
used unnecessarily to store zero entries of A. LU factorization then becomes extraordinarily simple using the Thomas algorithm [29].
How MATLAB handles sparse matrices is explained in [30]. A critical point when
solving large-scale systems is how the nonzero entries of A are stored. Ill-chosen
orderings may result in intense transfers to and from disk memory, thus slowing
down execution by several orders of magnitude. Algorithms (not presented here) are
available to reorder sparse matrices automatically.
45
1 + 3 + + (2n 1) = n 2 .
(3.137)
Example 3.8 When A is tridiagonal, solving (3.1) with the Thomas algorithm
(a specialization of LU factorization) can be done in (8n 6) flops only [29].
For a generic (n n) matrix A, the number of flops required to solve a linear
system of equations turns out to be much higher than in Examples 3.7 and 3.8:
LU factorization requires (2n 3 /3) flops. Solving each of the two resulting triangular systems to get the solution for one right-hand side requires about n 2 more flops,
so the total number of flops for m right-hand sides is about (2n 3 /3) + m(2n 2 ) .
QR factorization requires 2n 3 flops, and the total number of flops for m right-hand
sides is 2n 3 + 3mn 2 .
A particularly efficient implementation of SVD [2] requires (20n 3 /3) + O(n 2 )
flops.
Remark 3.21 For a generic (n n) matrix A, LU, QR and SVD factorizations thus
all require O(n 3 ) flops. They can nevertheless be ranked, from the point of view of
the number of flops required, as
LU < QR < SVD.
For small problems, each of these factorizations is obtained very quickly anyway, so
these issues become relevant only for large-scale problems or for problems that have
to be solved many times in an iterative algorithm.
When A is symmetric positive definite, Cholesky factorization applies, which
3
requires only
total number of flops for m right-hand sides thus
3 n /3 flops.2 The
becomes (n /3) + m(2n ) .
The number of flops required by iterative methods depends on the degree of
sparsity of A, on the convergence speed of these methods (which itself depends on
the problem considered) and on the degree of approximation one is willing to tolerate
in the solution. For Krylov-space solvers, dim x is an upper bound of the number
iterations needed to get an exact solution in the absence of rounding errors. This is
a considerable advantage over classical iterative methods.
46
47
3.10.1 A Is Dense
MATLAB offers a number of options for solving (3.1). The simplest of them is to
use Gaussian elimination
xGE = A\b;
No factorization of A is then available for later use, for instance for solving (3.1)
with the same A and another b.
It may make more sense to choose a factorization and use it. For an LU factorization with partial pivoting, one may write
[L,U,P] = lu(A);
% Same row exchange in b as in A
Pb = P*b;
% Solve Ly = Pb, with L lower triangular
opts_LT.LT = true;
y = linsolve(L,Pb,opts_LT);
% Solve Ux = y, with U upper triangular
opts_UT.UT = true;
xLUP = linsolve(U,y,opts_UT);
which gives access to the factorization of A that has been carried out. A one-liner
version with the same result would be
xLUP = linsolve(A,b);
but L, U and P would then no longer be made available for further use.
For a QR factorization, one may write
[Q,R] = qr(A);
QTb = Q*b;
opts_UT.UT = true;
x_QR = linsolve(R,QTb,opts_UT);
and for an SVD factorization
48
[U,S,V] = svd(A);
xSVD = V*inv(S)*U*b;
For an iterative solution via the Krylov method, one may use the function gmres,
which does not require A to be positive definite [23], and write
xKRY = gmres(A,b);
Although the Krylov method is particularly interesting when A is large and sparse,
nothing forbids using it on a small dense matrix, as here.
These five methods are used to solve (3.1) with
1 2
6
A = 4 5
7 8 9+
(3.138)
10
b = 11 ,
12
and
(3.139)
28/3
9.3333333333333333
x = 29/3 9.6666666666666667 .
0
0
(3.140)
The fact that x3 = 0 explains why x is independent of the numerical value taken by
. However, the difficulty of computing x accurately does depend on this value. In
all the results to be presented in the remainder of this chapter, the condition number
referred to is for the spectral norm.
For = 1013 , cond A 1015 and
xGE =
-9.297539149888147e+00
9.595078299776288e+00
3.579418344519016e-02
xLUP =
-9.297539149888147e+00
49
9.595078299776288e+00
3.579418344519016e-02
xQR =
-9.553113553113528e+00
1.010622710622708e+01
-2.197802197802198e-01
xSVD =
-9.625000000000000e+00
1.025000000000000e+01
-3.125000000000000e-01
gmres converged at iteration 2 to a solution with
relative residual 9.9e-15.
xKRY =
-4.555555555555692e+00
1.111111111110619e-01
4.777777777777883e+00
LU factorization with partial pivoting turns out to have done a better job than QR
factorization or SVD on this ill-conditioned problem, for less computation. The
condition numbers of the matrices involved are evaluated as follows
CondA = 1.033684444145846e+15
% LU factorization
CondL = 2.055595570492287e+00
CondU = 6.920247514139799e+14
% QR factorization with partial pivoting
CondP = 1
CondQ = 1.000000000000000e+00
CondR = 1.021209931367105e+15
% SVD
CondU = 1.000000000000001e+00
CondS = 1.033684444145846e+15
CondV = 1.000000000000000e+00
For = 105 , cond A 107 and
xGE =
-9.333333332978063e+00
9.666666665956125e+00
3.552713679092771e-10
50
xLUP =
-9.333333332978063e+00
9.666666665956125e+00
3.552713679092771e-10
xQR =
-9.333333335508891e+00
9.666666671017813e+00
-2.175583929062594e-09
xSVD =
-9.333333335118368e+00
9.666666669771075e+00
-1.396983861923218e-09
gmres converged at iteration 3 to a solution
with relative residual 0.
xKRY =
-9.333333333420251e+00
9.666666666840491e+00
-8.690781427844740e-11
The condition numbers of the matrices involved are
CondA =
1.010884565427633e+07
% LU factorization
CondL = 2.055595570492287e+00
CondU = 6.868613692978372e+06
% QR factorization with partial pivoting
CondP = 1
CondQ = 1.000000000000000e+00
CondR = 1.010884565403081e+07
% SVD
CondU = 1.000000000000000e+00
CondS = 1.010884565427633e+07
CondV = 1.000000000000000e+00
For = 1, cond A 88 and
xGE =
-9.333333333333330e+00
9.666666666666661e+00
3.552713678800503e-15
51
xLUP =
-9.333333333333330e+00
9.666666666666661e+00
3.552713678800503e-15
xQR =
-9.333333333333329e+00
9.666666666666687e+00
-2.175583928816833e-14
xSVD =
-9.333333333333286e+00
9.666666666666700e+00
-6.217248937900877e-14
gmres converged at iteration 3 to a solution with
relative residual 0.
xKRY =
-9.333333333333339e+00
9.666666666666659e+00
1.021405182655144e-14
The condition numbers of the matrices involved are
CondA = 8.844827992069874e+01
% LU factorizaton
CondL = 2.055595570492287e+00
CondU = 6.767412723516705e+01
% QR factorization with partial pivoting
CondP = 1
CondQ = 1.000000000000000e+00
CondR = 8.844827992069874e+01
% SVD
CondU = 1.000000000000000e+00
CondS = 8.844827992069871e+01
CondV =1.000000000000000e+00
The results xGE and xLUP are always identical, a reminder of the fact that LU factorization with partial pivoting is just a clever implementation of Gaussian elimination.
The better the conditioning of the problem, the closer the results of the five methods
get. Although the product of the condition numbers of L and U is slightly larger than
cond A, LU factorization with partial pivoting (or Gaussian elimination) turns out
here to outperform QR factorization or SVD, for less computation.
52
53
3.10.3 A Is Sparse
A and sA, standing for the (asymmetric) sparse matrix A, are built by the script
n = 1.e3
A = eye(n); % A is a 1000 by 1000 identity matrix
A(1,n) = 1+alpha;
A(n,1) = 1; % A now slightly modified
sA = sparse(A);
Thus, dim x = 1000, and sA is a sparse representation of A where the zeros are not
stored, whereas A is a dense representation of a sparse matrix, which comprises 106
entries, most of them being zeros. As in Sects. 3.10.1 and 3.10.2, A is singular for
= 0, and its conditioning improves when increases.
All the entries of the vector b are taken equal to one, so b is built as
b = ones(n,1);
For any > 0, it is easy to check that the exact unique solution of (3.1) is then such
that all its entries are equal to one, except for the last one, which is equal to zero. This
system has been solved with the same script as in the previous section for Gaussian
elimination, LU factorization with partial pivoting, QR factorization and SVD, not
taking advantage of sparsity. For Krylov iteration, sA was used instead of A. The
following script was employed to tune some optional parameters of gmres:
restart = 10;
tol = 1e-12;
maxit = 15;
xKRY = gmres(sA,b,restart,tol,maxit);
(see the gmres documentation for details).
For = 107 , cond A 4 107 and the following results are obtained. The
time taken by each method is in s. As dim x = 1000, only the last two entries of the
numerical solution are provided. Recall that the first of them should be equal to one
and the last to zero.
TimeGE = 8.526009399999999e-02
LastofxGE =
1
0
TimeLUP = 1.363140280000000e-01
LastofxLUP =
1
0
54
TimeQR = 9.576683100000000e-02
LastofxQR =
1
0
TimeSVD = 1.395477389000000e+00
LastofxSVD =
1
0
gmres(10) converged at outer iteration 1
(inner iteration 4)
to a solution with relative residual 1.1e-21.
TimeKRY = 9.034646100000000e-02
LastofxKRY =
1.000000000000022e+00
1.551504706009954e-05
55
For = 103 , cond (AT A) 1.6 107 and the following results are obtained. As
dim x = 106 , only the last two entries of the numerical solution are provided. Recall
that the first of them should be equal to one and the last to zero.
pcg converged at iteration 6 to a solution
with relative residual 2.2e-18.
TimePCG = 5.922985430000000e-01
LastofxPCG =
1
-5.807653514112821e-09
3.11 In Summary
Solving systems of linear equations plays a crucial role in almost all of the
methods to be considered in what follows, and often takes up most of computing
time.
Cramers method is not even an option.
Matrix inversion is uselessly costly, unless A has a very specific structure.
The larger the condition number of A is, the more difficult the problem becomes.
Solution via LU factorization is the basic workhorse to be used if A has no
particular structure to be taken advantage of. Pivoting makes it applicable for
any nonsingular A. Although it increases the condition number of the problem,
it does so with measure and may work just as well as QR factorization or SVD
on ill-conditioned problems, for less computation.
When the solution is not satisfactory, iterative correction may lead quickly to a
spectacular improvement.
Solution via QR factorization is more costly than via LU factorization but does
not worsen conditioning. Orthonormal transformations play a central role in this
property.
Solution via SVD, also based on orthonormal transformations, is even more
costly than via QR factorization. It has the advantage of providing the condition
number of A for the spectral norm as a by-product and of making it possible to
find approximate solutions to some hopelessly ill-conditioned problems through
regularization.
Cholesky factorization is a special case of LU factorization, appropriate if A is
symmetric and positive definite. It can also be used to test matrices for positive
definiteness.
When A is large and sparse, suitably preconditioned Krylov subspace iteration
has superseded classical iterative methods as it converges more quickly, more
often.
56
References
1. Golub, G., Van Loan, C.: Matrix Computations, 3rd edn. The Johns Hopkins University Press,
Baltimore (1996)
2. Demmel, J.: Applied Numerical Linear Algebra. SIAM, Philadelphia (1997)
3. Ascher, U., Greif, C.: A First Course in Numerical Methods. SIAM, Philadelphia (2011)
4. Rice, J.: A theory of condition. SIAM J. Numer. Anal. 3(2), 287310 (1966)
5. Demmel, J.: The probability that a numerical analysis problem is difficult. Math. Comput.
50(182), 449480 (1988)
6. Higham, N.: Fortran codes for estimating the one-norm of a real or complex matrix, with
applications to condition estimation (algorithm 674). ACM Trans. Math. Softw. 14(4), 381
396 (1988)
7. Higham, N., Tisseur, F.: A block algorihm for matrix 1-norm estimation, with an application
to 1-norm pseudospectra. SIAM J. Matrix Anal. Appl. 21, 11851201 (2000)
8. Higham, N.: Gaussian elimination. Wiley Interdiscip. Rev. Comput. Stat. 3(3), 230238 (2011)
9. Stewart, G.: The decomposition approach to matrix computation. Comput. Sci. Eng. 2(1),
5059 (2000)
10. Bjrck, A.: Numerical Methods for Least Squares Problems. SIAM, Philadelphia (1996)
11. Golub G, Kahan W.: Calculating the singular values and pseudo-inverse of a matrix. J. Soc.
Indust. Appl. Math. Ser. B Numer. Anal. 2(2), 205224 (1965)
12. Stewart, G.: On the early history of the singular value decomposition. SIAM Rev. 35(4), 551
566 (1993)
13. Varah, J.: On the numerical solution of ill-conditioned linear systems with applications to
ill-posed problems. SIAM J. Numer. Anal. 10(2), 257267 (1973)
14. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM, Philadelphia (2003)
15. Young, D.: Iterative methods for solving partial difference equations of elliptic type. Ph.D.
thesis, Harvard University, Cambridge, MA (1950)
16. Gutknecht, M.: A brief introduction to Krylov space methods for solving linear systems. In: Y.
Kaneda, H. Kawamura, M. Sasai (eds.) Proceedings of International Symposium on Frontiers
of Computational Science 2005, pp. 5362. Springer, Berlin (2007)
17. van der Vorst, H.: Krylov subspace iteration. Comput. Sci. Eng. 2(1), 3237 (2000)
18. Dongarra, J., Sullivan, F.: Guest editors introduction to the top 10 algorithms. Comput. Sci.
Eng. 2(1), 2223 (2000)
19. Hestenes, M., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res.
Natl. Bur. Stan 49(6), 409436 (1952)
20. Golub, G., OLeary, D.: Some history of the conjugate gradient and Lanczos algorithms: 1948
1976. SIAM Rev. 31(1), 50102 (1989)
21. Shewchuk, J.: An introduction to the conjugate gradient method without the agonizing pain.
Technical report, School of Computer Science. Carnegie Mellon University, Pittsburgh (1994)
22. Paige, C., Saunders, M.: Solution of sparse indefinite systems of linear equations. SIAM J.
Numer. Anal. 12(4), 617629 (1975)
23. Saad, Y., Schultz, M.: GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7(3), 856869 (1986)
24. van der Vorst, H.: Bi-CGSTAB: a fast and smoothly convergent variant of Bi-CG for the solution
of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 13(2), 631644 (1992)
References
57
25. Benzi, M.: Preconditioning techniques for large linear systems: a survey. J. Comput. Phys. 182,
418477 (2002)
26. Saad, Y.: Preconditioning techniques for nonsymmetric and indefinite linear systems. J. Comput. Appl. Math. 24, 89105 (1988)
27. Grote, M., Huckle, T.: Parallel preconditioning with sparse approximate inverses. SIAM J. Sci.
Comput. 18(3), 838853 (1997)
28. Higham, N.: Cholesky factorization. Wiley Interdiscip. Rev. Comput. Stat. 1(2), 251254
(2009)
29. Ciarlet, P.: Introduction to Numerical Linear Algebra and Optimization. Cambridge University
Press, Cambridge (1989)
30. Gilbert, J., Moler, C., Schreiber, R.: Sparse matrices in MATLAB: design and implementation.
SIAM J. Matrix Anal. Appl. 13, 333356 (1992)
Chapter 4
This chapter is about the evaluation of the inverse, determinant, eigenvalues, and
eigenvectors of an (n n) matrix A.
Before evaluating the inverse of a matrix, check that the actual problem is not
rather solving a system of linear equations (see Chap. 3).
Unless A has a very specific structure, such as being diagonal, it is usually inverted
by solving
(4.1)
AA1 = In
for A1 . This is equivalent to solving the n linear systems
Axi = ei ,
i = 1, . . . , n,
(4.2)
3
2
costs only about 2n /3 + 2n flops.
For LU factorization with partial pivoting, solving (4.2) means solving the triangular systems
Lyi = Pei , i = 1, . . . , n,
(4.3)
59
60
for yi , and
Uxi = yi , i = 1, . . . , n,
(4.4)
for xi .
For QR factorization, it means solving the triangular systems
Rxi = QT ei , i = 1, . . . , n,
(4.5)
for xi .
For SVD factorization, one has directly
A1 = V 1 UT ,
(4.6)
(4.7)
with all of them requiring O(n 3 ) flops. This is not that bad, considering that the mere
product of two generic (n n) matrices already requires O(n 3 ) flops.
(4.8)
(4.9)
det PT = (1) p ,
(4.10)
so
where
61
(4.11)
(4.12)
(4.13)
det Q = (1)q ,
(4.14)
so
(4.16)
i=1 i,i .
(4.18)
62
only because the roots of a polynomial equation may be very sensitive to errors in the
coefficients of the polynomial (see the perfidious polynomial (4.59) in Sect. 4.4.3).
Example 4.3 will show that one may, instead, transform the problem of finding the
roots of a polynomial equation into that of finding the eigenvalues of a matrix.
1
, i = 1, . . . , N .
N
(4.19)
The evolution of xk when one more page change takes place is described by the
Markov chain
(4.20)
xk+1 = Sxk ,
where the transition matrix S corresponds to a model of the behavior of surfers.
Assume, for the time being, that a surfer randomly follows any one of the hyperlinks
present in the current page (each with the same probability). S is then a sparse matrix,
easily deduced from G, as follows. Its entry si, j is the probability of jumping from
page j to page i via a hyperlink, and s j, j = 0 as one cannot stay in the jth page.
Each of the n j nonzero entries of the jth column of S is equal to 1/n j , so the sum
of all the entries of any given column of S is equal to one.
This model is not realistic, as some pages do not contain any hyperlink or are not
pointed to by any hyperlink. This is why it is assumed instead that the surfer may
randomly either jump to any page (with probability 0.15) or follow any one of the
hyperlinks present in the current page (with probability 0.85). This leads to replacing
S in (4.20) by
1 1T
A = S + (1 )
,
(4.21)
N
63
with = 0.85 and 1 an N -dimensional column vector full of ones. With this model,
the probability of staying at the same page is no longer zero, but this makes evaluating
Axk almost as simple as if A were sparse; see Sect. 16.1.
After an infinite number of clicks, the asymptotic distribution of probabilities x
satisfies
(4.22)
Ax = x ,
so x is an eigenvector of A, associated with a unit eigenvalue. Eigenvectors are
defined up to a multiplicative constant, but the meaning of x implies that
N
xi = 1.
(4.23)
i=1
Once x has been evaluated, the relevant pages with the highest values of their entry
in x are presented first. The transition matrices of Markov chains are such that their
eigenvalue with the largest magnitude is equal to one. Ranking WEB pages thus boils
down to computing the eigenvector associated with the (known) eigenvalue with the
largest magnitude of a tremendously large (and almost sparse) matrix.
Example 4.2 Bridge oscillations
On the morning of November 7, 1940, the Tacoma Narrows bridge twisted violently in the wind before collapsing into the cold waters of the Puget Sound. The
bridge had earned the nickname Galloping Gertie for its unusual behavior, and it
is an extraordinary piece of luck that no thrill-seeker was killed in the disaster. The
video of the event, available on the WEB, is a stark reminder of the importance of
taking potential oscillations into account during bridge design.
A linear dynamical model of a bridge, valid for small displacements, is given by
the vector ordinary differential equation
Mx + Cx + Kx = u,
(4.24)
(4.26)
64
= 0.
(K k2 M)
k
(4.27)
k
Computing k2 and is known as a generalized eigenvalue problem [3]. Usually,
M is invertible, so this equation can be transformed into
k = k k ,
A
(4.28)
(4.29)
0 0
a0
..
.
a1
1 ...
.
A=
0 ..
.. . .
.
.
0
..
..
.
..
.
..
. 0
..
.
(4.30)
1 an1
and one of the most efficient methods for computing these roots is to look for the
eigenvalues of A.
(4.31)
65
k
will
then
k+1
decrease the angle between v and vmax at each iteration. To ensure that
v = 1, (4.31) is replaced by
2
1
vk+1 = k Avk .
Av
(4.32)
Av = Av 2 v ,
(4.33)
Upon convergence,
so max = Av 2 and vmax = v . Convergence may be slow if other eigenvalues
are close in magnitude to max .
Remark 4.2 When max is negative, the method becomes
1
vk+1 = k Avk ,
Av
(4.34)
Av = Av 2 v .
(4.35)
Remark 4.3 If A is symmetric, then its eigenvectors are orthogonal and, provided
that vmax 2 = 1, the matrix
T
A = A max vmax vmax
(4.36)
has the same eigenvalues and eigenvectors as A, except for vmax , which is now
associated with = 0. One may thus apply power iterations to find the eigenvalue
with the second largest magnitude and the corresponding eigenvector. This deflation
procedure should be iterated with caution, as errors cumulate.
(4.37)
might be used to compute min and the corresponding eigenvector (provided that
min > 0). Inverting A is avoided by solving the system
66
Avk+1 = vk
(4.38)
for vk+1 and normalizing the result. If a factorization of A is used for this purpose,
it needs to be carried out only once. A trivial modification of the algorithm makes it
possible to deal with the case min < 0.
(4.40)
(4.41)
(A I)xi = (i )xi .
(4.42)
we have
Multiply (4.42) on the left by (A I)1 (i )1 , to get
(A I)1 xi = (i )1 xi .
(4.43)
The vector xi is thus also an eigenvector of (AI)1 , associated with the eigenvalue
(i )1 . By choosing close enough to i , and provided that the other eigenvalues
of A are far enough, one can ensure that, for all j = i,
1
1
.
|i |
| j |
(4.44)
(4.45)
67
(4.46)
for vk+1 (usually via an LU factorization with partial pivoting of (A I), which
needs to be carried out only once). When gets close to i , the matrix (A I)
becomes nearly singular, but the algorithm works nevertheless very well, at least
when A is normal. Its properties, including its behavior on non-normal matrices, are
investigated in [4].
4.3.6 QR Iteration
QR iteration, based on QR factorization, makes it possible to compute all the eigenvalues of a not too large and possibly dense matrix A with real coefficients. These
eigenvalues may be real or complex-conjugate. It is only assumed that their magnitude differ (except, of course, for a pair of complex-conjugate eigenvalues). An
interesting account of the history of this fascinating algorithm can be found in [5].
Its convergence is studied in [6].
The basic method is as follows. Starting with A0 = A and i = 0, repeat until
convergence
1. Factor Ai as Qi Ri .
2. Invert the order of the resulting factors Qi and Ri to get Ai+1 = Ri Qi .
3. Increment i by one and go to Step 1.
For reasons not trivial to explain, this transfers mass from the lower triangular part
of Ai to the upper triangular part of Ai+1 . The fact that Ri = Qi1 Ai implies that
Ai+1 = Qi1 Ai Qi . The matrices Ai+1 and Ai therefore have the same eigenvalues.
Upon convergence, A is a block upper triangular matrix with the same eigenvalues
as A, in what is called a real Schur form. There are only (1 1) and (2 2) diagonal
blocks in A . Each (1 1) block contains a real eigenvalue of A, whereas the
eigenvalues of the (2 2) blocks are complex-conjugate eigenvalues of A. If B
is one such (2 2) block, then its eigenvalues are the roots of the second-order
polynomial equation
(4.47)
2 trace (B) + det B = 0.
The resulting factorization
A = QA QT
(4.48)
i
Qi ,
(4.49)
68
(4.50)
Remark 4.4 After pointing out that good implementations [of the QR algorithm]
have long been much more widely available than good explanations, [7] shows that
the QR algorithm is just a clever and numerically robust implementation of the power
iteration method of Sect. 4.3.3 applied to an entire basis of Rn rather than to a single
vector.
Remark 4.5 Whenever A is not an upper Hessenberg matrix (i.e., an upper triangular
matrix completed with an additional nonzero descending diagonal just below the
main descending diagonal), a trivial variant of the QR algorithm is used first to put
it into this form. This speeds up QR iteration considerably, as the upper Hessenberg
form is preserved by the iterations. Note that the companion matrix of Example 4.3
is already in upper Hessenberg form.
If A is symmetric, then all the eigenvalues i (i = 1, . . . , n) of A are real, and the
corresponding eigenvectors vi are orthogonal. QR iteration then produces a series of
symmetric matrices Ak that should converge to the diagonal matrix
= Q1 AQ,
with Q orthonormal and
(4.51)
1 0 0
.
0 2 . . . ..
.
=
. . .
.. . . . . 0
0 0 n
(4.52)
AQ = Q,
(4.53)
Aqi = i qi , i = 1, . . . , n,
(4.54)
or, equivalently,
where qi is the ith column of Q. Thus, qi is the eigenvector associated with i , and
the QR algorithm computes the spectral decomposition of A
A = QQT .
(4.55)
When A is not symmetric, computing its eigenvectors from the Schur decomposition becomes significantly more complicated; see, e.g., [8].
69
A=
is
A=
01
,
10
01
10
,
10
01
so
RQ = A
and the method is stuck. This is not surprising as the eigenvalues of A have the same
absolute value (1 = 1 and 2 = 1).
To bypass this difficulty and speed up convergence, the basic shifted QR method
proceeds as follows. Starting with A0 = A and i = 0, it repeats until convergence
1. Choose a shift i .
2. Factor Ai i I as Qi Ri .
3. Invert the order of the resulting factors Qi and Ri and compensate the shift, to
get Ai+1 = Ri Qi + i I.
A possible strategy is as follows. First set i to the value of the last diagonal
entry of Ai , to speed up convergence of the last row, then set i to the value of the
penultimate diagonal entry of Ai , to speed up convergence of the penultimate row,
and so on.
Much work has been carried out on the theoretical properties and details of the
implementation of (shifted) QR iteration, and its surface has only been scratched
here. QR iteration, which has been dubbed one of the most remarkable algorithms
in numerical mathematics ([9], quoted in [8]), turns out to converge in more general
situations than those for which its convergence has been proven. It has, however,
two main drawbacks. First, the eigenvalues with small magnitudes may be evaluated
with insufficient precision, which may justify iterative improvement, for instance by
(shifted) inverse power iteration. Second, the QR algorithm is not suited for very
large, sparse matrices, as it destroys sparsity. On the numerical solution of large
eigenvalue problems, the reader may consult [3], and discover that Krylov subspaces
once again play a crucial role.
70
71
72
For = 1013 ,
TrueDet = -3.000000000000000e-13
REdetDF = -7.460615985110166e-03
REdetLUP = -7.460615985110166e-03
REdetQR = -1.010931238834050e-02
REdetSVD = -2.205532173587620e-02
For = 105 ,
TrueDet = -3.000000000000000e-05
REdetDF = -8.226677621822146e-11
REdetLUP = -8.226677621822146e-11
REdetQR = -1.129626855380858e-10
REdetSVD = -1.372496047658452e-10
The dedicated function and LU factorization with partial pivoting thus give slightly
better results than the more expensive QR or SVD approaches.
(4.56)
2 = 1.116843969807017,
3 = 1.666666666666699 1014 .
(4.57)
(4.58)
20
(x i).
(4.59)
i=1
73
We expand P(x) using poly and look for its roots using roots, which is based on
QR iteration applied to the companion matrix of the polynomial. The script
r = zeros(20,1);
for i=1:20,
r(i) = i;
end
% Computing the coefficients
% of the power series form
pol = poly(r);
% Computing the roots
PolRoots = roots(pol)
yields
PolRoots =
2.000032487811079e+01
1.899715998849890e+01
1.801122169150333e+01
1.697113218821587e+01
1.604827463749937e+01
1.493535559714918e+01
1.406527290606179e+01
1.294905558246907e+01
1.203344920920930e+01
1.098404124617589e+01
1.000605969450971e+01
8.998394489161083e+00
8.000284344046330e+00
6.999973480924893e+00
5.999999755878211e+00
5.000000341909170e+00
3.999999967630577e+00
3.000000001049188e+00
1.999999999997379e+00
9.999999999998413e-01
These results are not very accurate. Worse, they turn out to be extremely sensitive
to tiny perturbations of some of the coefficients of the polynomial in the power
series form (4.29). If, for instance, the coefficient of x 19 , which is equal to 210,
is perturbed by adding 107 to it while leaving all the other coefficients unchanged,
then the solutions provided by roots become
PertPolRoots =
2.042198199932168e+01 + 9.992089606340550e-01i
2.042198199932168e+01 - 9.992089606340550e-01i
1.815728058818208e+01 + 2.470230493778196e+00i
74
1.815728058818208e+01
1.531496040228042e+01
1.531496040228042e+01
1.284657850244477e+01
1.284657850244477e+01
1.092127532120366e+01
1.092127532120366e+01
9.567832870568918e+00
9.113691369146396e+00
7.994086000823392e+00
7.000237888287540e+00
5.999998537003806e+00
4.999999584089121e+00
4.000000023407260e+00
2.999999999831538e+00
1.999999999976565e+00
1.000000000000385e+00
+
+
+
-
2.470230493778196e+00i
2.698760803241636e+00i
2.698760803241636e+00i
2.062729460900725e+00i
2.062729460900725e+00i
1.103717474429019e+00i
1.103717474429019e+00i
Ten of the 20 roots are now found to be complex conjugate, and radically different
from what they were in the unperturbed case. This illustrates the fact that finding the
roots of a polynomial equation from the coefficients of its power series form may be
an ill-conditioned problem. This was well known for multiple roots or roots that are
close to one another, but discovering that it could also affect a polynomial such as
(4.59), which has none of these characteristics, was in Wilkinsons words, the most
traumatic experience in (his) career as a numerical analyst [10].
75
4.082482904638510e-01
-8.164965809277283e-01
4.082482904638707e-01
The diagonal entries of DiagonalizedA are, in the same order,
1.611684396980710e+01
-1.116843969807017e+00
1.551410816840699e-14
They are thus identical to the eigenvalues previously obtained with the instruction
eig(A).
A (very partial) check of the quality of these results can be carried out with the
script
Residual = A*EigVect-EigVect*DiagonalizedA;
NormResidual = norm(Residual,fro)
which yields
NormResidual = 1.155747735077462e-14
4.5 In Summary
Think twice before inverting a matrix. You may just want to solve a system of
linear equations.
When necessary, the inversion of an (n n) matrix can be carried out by solving
n systems of n linear equations in n unknowns. If an LU or QR factorization of A
is used, then it needs to be performed only once.
Think twice before evaluating a determinant. You may be more interested in a
condition number.
Computing the determinant of A is easy from an LU or QR factorization of A. The
result based on QR factorization requires more computation but should be more
robust to ill conditioning.
Power iteration can be used to compute the eigenvalue of A with the largest magnitude, provided that it is real and unique, and the corresponding eigenvector. It
is particularly interesting when A is large and sparse. Variants of power iteration
can be used to compute the eigenvalue of A with the smallest magnitude and the
corresponding eigenvector, or the eigenvector associated with any approximately
known isolated eigenvalue.
(Shifted) QR iteration is the method of choice for computing all the eigenvalues of
A simultaneously. It can also be used to compute the corresponding eigenvectors,
which is particularly easy if A is symmetric.
(Shifted) QR iteration can also be used for simultaneously computing all the
roots of a polynomial equation in a single indeterminate. The results may be very
sensitive to the values of the coefficients of the polynomial in power series form.
76
References
1. Langville, A., Meyer, C.: Googles PageRank and Beyond. Princeton University Press, Princeton (2006)
2. Bryan, K., Leise, T.: The $25,000,000,000 eigenvector: the linear algebra behind Google. SIAM
Rev. 48(3), 569581 (2006)
3. Saad, Y.: Numerical Methods for Large Eigenvalue Problems, 2nd edn. SIAM, Philadelphia
(2011)
4. Ipsen, I.: Computing an eigenvector with inverse iteration. SIAM Rev. 39, 254291 (1997)
5. Parlett, B.: The QR algorithm. Comput. Sci. Eng. 2(1), 3842 (2000)
6. Wilkinson, J.: Convergence of the LR, QR, and related algorithms. Comput. J. 8, 7784 (1965)
7. Watkins, D.: Understanding the QR algorithm. SIAM Rev. 24(4), 427440 (1982)
8. Ciarlet, P.: Introduction to Numerical Linear Algebra and Optimization. Cambridge University
Press, Cambridge (1989)
9. Strang, G.: Introduction to Applied Mathematics. Wellesley-Cambridge Press, Wellesley
(1986)
10. Wilkinson, J.: The perfidious polynomial. In: Golub, G. (Ed.) Studies in Numerical Analysis,
Studies in Mathematics vol. 24, pp. 128. Mathematical Association of America, Washington,
DC (1984)
11. Acton, F.: Numerical Methods That (Usually) Work, revised edn. Mathematical Association
of America, Washington, DC (1990)
12. Farouki, R.: The Bernstein polynomial basis: a centennial retrospective. Comput. Aided Geom.
Des. 29, 379419 (2012)
Chapter 5
5.1 Introduction
Consider a function f() such that
y = f(x),
(5.1)
with x a vector of inputs and y a vector of outputs, and assume it is a black box, i.e., it
can only be evaluated numerically and nothing is known about its formal expression.
Assume further that f() has been evaluated at N different numerical values xi of x,
so the N corresponding numerical values of the output vector
yi = f(xi ), i = 1, . . . , N ,
(5.2)
are known. Let g() be another function, usually much simpler to evaluate than f(),
and such that
(5.3)
g(xi ) = f(xi ), i = 1, . . . , N .
Computing g(x) is called interpolation if x is inside the convex hull of the xi s,
i.e., the smallest convex polytope that contains all of them. Otherwise, one speaks
of extrapolation (Fig. 5.1). A must read on interpolation (and approximation) with
polynomial and rational functions is [1]; see also the delicious [2].
Although the methods developed for interpolation can also be used for extrapolation, the latter is much more dangerous. When at all possible, it should
therefore be avoided by enclosing the domain of interest in the convex hull of
the xi s.
Remark 5.1 It is not always a good idea to interpolate, if only because the data yi
are often corrupted by noise. It is sometimes preferable to get a simpler model that
. Walter, Numerical Methods and Optimization,
DOI: 10.1007/978-3-319-07671-3_5,
Springer International Publishing Switzerland 2014
77
78
x2
x
Interpolation
x
x3
Extrapolation
x4
Fig. 5.1 Extrapolation takes place outside the convex hull of the xi s
satisfies
g(xi ) f(xi ), i = 1, . . . , N .
(5.4)
5.2 Examples
Example 5.1 Computer experiments
Actual experiments in the physical world are increasingly being replaced by
numerical computation. To design cars that meet safety norms during crashes, for
instance, manufacturers have partly replaced the long and costly actual crashing of
prototypes by numerical simulations, much quicker and much less expensive but still
computer intensive.
A numerical computer code may be viewed as a black box that evaluates the
numerical values of its output variables (stacked in y) for given numerical values of
its input variables (stacked in x). When the code is deterministic (i.e., involves no
pseudorandom generator), it defines a function
y = f(x).
(5.5)
Except in trivial cases, this function can only be studied through computer experiments, where potentially interesting numerical values of its input vector are used to
compute the corresponding numerical values of its output vector [3].
5.2 Examples
79
To limit the number of executions of complex code, one may wish to replace f()
by a function g() much simpler to evaluate and such that
g(x) f(x)
(5.6)
for any x in some domain of interest X. Requesting that the simple code implementing
g() give the same outputs as the complex code implementing f() for all the input
vectors xi (i = 1, . . . , N ) at which f() has been evaluated is equivalent to requesting
that the interpolation Eq. (5.3) be satisfied.
Example 5.2 Prototyping
Assume now that a succession of prototypes are built for different values of a
vector x of design parameters, with the aim of getting a satisfactory product, as
quantified by the value of a vector y of performance characteristics measured on
these prototypes. The available data are again in the form (5.2), and one may again
wish to have at ones disposal a numerical code evaluating a function g such that
(5.3) be satisfied. This will help suggesting new promising values of x, for which new
prototypes could be built. The very same tools that are used in computer experiments
may therefore also be employed here.
Example 5.3 Mining surveys
By drilling at latitude x1i , longitude x2i , and depth x3i in a gold field, one gets a
sample with concentration yi in gold. Concentration depends on location, so yi =
f (xi ), where xi = (x1i , x2i , x3i )T . From a set of measurements of concentrations in
such very costly samples, one wishes to deduce the most promising region, via the
interpolation of f (). This motivated the development of Kriging, to be presented in
Sect. 5.4.3. Although Kriging finds its origins in geostatistics, it is increasingly used
in computer experiments as well as in prototyping.
(5.7)
Figure 5.2 illustrates the obvious fact that the interpolating function is not unique. It
will be searched for in a prespecified class of functions, for instance polynomials or
rational functions (i.e., ratios of polynomials).
80
x1
x2
x3
x4
x5
n
ai x i
(5.8)
i=0
(5.9)
j = 0, . . . , n,
(5.10)
81
p0 = a n
pi = pi1 x + ani
P(x) = pn
(i = 1, . . . , n) ,
(5.11)
1
(2xinitial a b),
ba
(5.12)
so this is not restrictive.) A key point is how the x j s are distributed in [1, 1]. When
they are regularly spaced, interpolation should only be considered practical for small
values of n. It may otherwise yield useless results, with spurious oscillations known
as Runge phenomenon. This can be avoided by using Chebyshev points [1, 2], for
instance Chebyshev points of the second kind, given by
x j = cos
j
,
n
j 0, 1, . . . , n.
(5.13)
n
j=0
x xk
yj.
x j xk
(5.14)
k= j
The evaluation of p from the data is thus bypassed. It is trivial to check that Pn (x j ) =
y j since, for x = x j , all the products in (5.14) are equal to zero but the jth, which
is equal to 1. Despite its simplicity, (5.14) is seldom used in practice, because it is
numerically unstable.
A very useful reformulation is the barycentric Lagrange interpolation formula
82
wj
j=0 xx j y j
wj
j=0 xx j
Pn (x) = n
(5.15)
wj =
j 0, 1, . . . , n.
(5.16)
These weights thus depend only on the location of the evaluation points x j , not on
the values of the corresponding data y j . They can therefore be computed once and
for all for a given node configuration. The result is particularly simple for Chebyshev
points of the second kind, as
w j = (1) j j ,
j 0, 1, . . . , n,
(5.17)
1 x0 x02 x0n
A=.
..
1
and
x1
..
.
x12
..
.
..
.
x1n
..
.
(5.19)
xn xn2 xnn
y0
y1
y = . .
..
(5.20)
yn
A is a Vandermonde matrix, notoriously ill-conditioned for large n.
Remark 5.4 The fact that a Vandermonde matrix is ill-conditioned does not mean
that the corresponding interpolation problem cannot be solved. With appropriate
alternative formulations, it is possible to build interpolating polynomials of very
high degree. This is spectacularly illustrated in [2], where a sawtooth function is
83
interpolated with a 10,000th degree polynomial at Chebishev nodes. The plot of the
interpolant (using a clever implementation of the barycentric formula that requires
only O(n) operations for evaluating Pn (x)) is indistinguishable from the plot of the
function itself.
Remark 5.5 Any nth degree polynomial may be written as
Pn (x, p) =
n
ai i (x),
(5.21)
i=0
where the i (x)s form a basis and p = (a0 , . . . , an )T . Equation (5.8) corresponds
to the power basis, where i (x) = x i , and the resulting polynomial representation is
called the power series form. For any other polynomial basis, the parameters of the
interpolatory polynomial are obtained by solving (5.18) for p, with (5.19) replaced
by
(5.22)
A=.
.. .
..
..
..
..
.
.
.
.
1 1 (xn ) 2 (xn ) n (xn )
One may use, for instance, the Legendre basis, such that
0 (x) = 1,
1 (x) = x,
(i + 1)i+1 (x) = (2i + 1)xi (x) ii1 (x), i = 1, . . . , n 1. (5.23)
As
1
i ( ) j ( )d = 0
(5.24)
whenever i = j, Legendre polynomials are orthogonal on [1, 1]. This makes the
linear system to be solved better conditioned than with the power basis.
84
recurrence equation
Pi,i (x) = yi , i = 1, . . . , n + 1,
Pi, j (x) =
1
[(x j x)Pi, j1 (x) (x xi )Pi+1, j (x)],
x j xi
1 i < j n + 1,
(5.25)
with P1,n+1 (x) the nth degree polynomial interpolating all the data.
(5.26)
and assume the coordinates xi of the knots (or breakpoints) are increasing with i. On
each subinterval Ik = [xk , xk+1 ], a third-degree polynomial is used
Pk (x) = a0 + a1 x + a2 x 2 + a3 x 3 ,
(5.27)
so four independent constraints are needed per polynomial. Since Pk (x) must be an
interpolator on Ik , it must satisfy
Pk (xk ) = yk
(5.28)
Pk (xk+1 ) = yk+1 .
(5.29)
and
The first derivative of the interpolating polynomials must take the same value at each
common endpoint of two subintervals, so
Pk (xk ) = Pk1 (xk ).
(5.30)
85
u k +1
uk
x k1
xk
x k+1
xk+1 x
x xk
+ u k+1
,
Pk (x) = u k
xk+1 xk
xk+1 xk
(5.31)
(5.32)
(xk+1 x)3
(x xk )3
+ u k+1
+ ak (x xk ) + bk ,
6h k+1
6h k+1
(5.33)
where h k+1 = xk+1 xk . Take (5.28) and (5.29) into account to get the integration
constants
yk+1 yk
h k+1
(u k+1 u k )
(5.34)
ak =
h k+1
6
and
1
bk = yk u k h 2k+1 .
6
(5.35)
(5.36)
86
where u is the vector comprising all the u k s. This expression is cubic in x and affine
in u.
There are (N + 1 = dim u) unknowns, and (N 1) continuity conditions (5.30)
(as there are N subintervals Ik ), so two additional constraints are needed to make the
solution for u unique. In natural cubic splines, these constraints are u 0 = u N = 0,
which amounts to saying that the cubic spline is affine in (, x0 ] and [x N , ).
Other choices are possible; one may, for instance, fit the first derivative of f () at x0
and x N or assume that f () is periodic and such that
f (x + x N x0 ) f (x).
(5.37)
(r )
P0 (x0 ) = PN 1 (x N ), r = 0, 1, 2.
(5.38)
For any of these choices, the resulting set of linear equations can be written as
Tu = d,
(5.39)
with u the vector of those u i s still to be estimated and T tridiagonal, which greatly
P(x, p)
,
Q(x, p)
(5.40)
ai x i
j=0 b j x
(5.41)
87
a0 + a1 x
.
1 + b1 x
(5.42)
It depends on three parameters and can thus, in principle, be used to interpolate f (x)
at three values of x. Assume that f (x0 ) = f (x1 ) = f (x2 ). Then
a0 + a1 x 0
a0 + a1 x 1
=
.
1 + b1 x0
1 + b1 x1
(5.43)
(5.44)
(5.45)
thus define a set of nonlinear equations in p, the solution of which seems to require
tools such as those described in Chap. 7. This system, however, can be transformed
into a linear one by multiplying the ith equation in (5.45) by Q(xi , p) (i = 1, . . . , n)
to get the mathematically equivalent system of equations
Q(xi , p) f (xi ) = P(xi , p), i = 1, . . . , n,
(5.46)
88
(5.47)
but that it is impossible in practice to make h tend to zero, as in the two following
examples.
Example 5.4 Evaluation of derivatives
One possible finite-difference approximation of the first-order derivative of a
function f () is
1
(5.48)
f(x) [ f (x + h) f (x)]
h
(see Chap. 6). Mathematically, the smaller h is the better the approximation becomes,
but making h too small is a recipe for disaster in floating-point computations, as it
entails computing the difference of numbers that are too close to one another.
Example 5.5 Evaluation of integrals
The rectangle method can be used to approximate the definite integral of a function
f () as
b
f ( )d
h f (a + i h).
(5.49)
i
Mathematically, the smaller h is the better the approximation becomes, but when h
is too small the approximation requires too much computer time to be evaluated.
Because h cannot tend to zero, using R(h) instead of r introduces a method error,
and extrapolation may be used to improve accuracy on the evaluation of r . Assume
that
(5.50)
r = R(h) + O(h n ),
where the order n of the method error is known. Richardsons extrapolation principle
takes advantage of this knowledge to increase accuracy by combining results obtained
at various step-sizes. Equation (5.50) can be rewritten as
r = R(h) + cn h n + cn+1 h n+1 +
(5.51)
n
n+1
h
h
+ cn+1
+
2
2
(5.52)
89
To eliminate the nth order term, subtract (5.51) from 2n times (5.52) to get
h
R(h) + O(h m ),
2
(5.53)
R(h)
+ O(h m ).
1
(5.54)
n
2 1 r = 2n R
with m > n, or equivalently
r=
2n R
h
2
2n
Two evaluations of R have thus made it possible to gain at least one order of approximation. The idea can be pushed further by evaluating R(h i ) for several values of
h i obtained by successive divisions by two of some initial step-size h 0 . The value at
h = 0 of the polynomial P(h) extrapolating the resulting data (h i , R(h i )) may then
be computed with Nevilles algorithm (see Sect. 5.3.1.3). In the context of the evaluation of definite integrals, the result is Rombergs method, see Sect. 6.2.2. Richardsons
extrapolation is also used, for instance, in numerical differentiation (see Sect. 6.4.3),
as well as for the integration of ordinary differential equations (see the Bulirsch-Stoer
method in Sect. 12.2.4.6).
Instead of increasing accuracy, one may use similar ideas to adapt the step-size h
in order to keep an estimate of the method error acceptable (see Sect. 12.2.4).
(5.55)
90
p = (a0 , a1 , . . . , a5 )T ,
(5.56)
and this holds true whatever the degree of the polynomial and the number of input
variables. The values of these coefficients can therefore always be computed by
solving a set of linear equations enforcing interpolation, provided there are enough
of them. The choice of the structure of the polynomial (of which monomials to
include) is far from trivial, however.
5.4.3 Kriging
The name Kriging is a tribute to the seminal work of D.G. Krige on the Witwatersrand
gold deposits in South Africa, circa 1950 [13]. The technique was developed and
popularized by G. Matheron, from the Centre de gostatistique of the cole des
mines de Paris, one of the founders of geostatistics where it plays a central role [12,
14, 15]. Initially applied on two- and three-dimensional problems where the input
factors corresponded to space variables (as in mining), it extends directly to problems
with a much larger number of input factors (as is common in industrial statistics).
We describe here, with no mathematical justification for the time being, how the
simplest version of Kriging can be used for multidimensional interpolation. More
precise statements, including a derivation of the equations, are in Example 9.2.
Let y(x) be the scalar output value to be predicted based on the value taken by
the input vector x. Assume that a series of experiments (which may be computer
experiments or actual measurements in the physical world) has provided the output
values
(5.57)
yi = f (xi ), i = 1, . . . , N ,
for N numerical values xi of the input vector, and denote the vector of these output
values by y. Note that the meaning of y here differs from that in (5.1). The Kriging
prediction
y(x) of the value taken by f (x) for x {xi , i = 1, . . . , N } is linear in y,
and the weights of the linear combination depend on the value of x. Thus,
y(x) = cT (x)y.
(5.58)
91
It seems natural to assume that the closer x is to xi , the more f (x) resembles f (xi ).
This leads to defining a correlation function r (x, xi ) between f (x) and f (xi ) such
that
(5.59)
r (xi , xi ) = 1
and that r (x, xi ) decreases toward zero when the distance between x et xi increases.
This correlation function often depends on a vector p of parameters to be tuned from
the available data. It will then be denoted by r (x, xi , p).
Example 5.6 Correlation function for Kriging
A frequently employed parametrized correlation function is
r (x, xi , p) =
dim
x
exp( p j |x j x ij |2 ).
(5.60)
j=1
The range parameters p j > 0 specify how quickly the influence of the measurement
yi decreases when the distance to xi increases. If p is too large, then the influence of
the data quickly vanishes and
y(x) tends to zero whenever x is not in the immediate
vicinity of some xi .
Assume, for the sake of simplicity, that the value of p has been chosen beforehand, so it no longer appears in the equations. (Statistical methods are available for
estimating p from the data, see Remark 9.5.)
The Kriging prediction is Gaussian, and thus entirely characterized (for any given
value of the input vector x) by its mean
y(x) and variance
2 (x). The mean of the
prediction is
(5.61)
y(x) = rT (x)R1 y,
where
and
R=
..
..
..
..
.
.
.
.
N
1
N
2
N
N
r (x , x ) r (x , x ) r (x , x )
(5.62)
rT (x) = r (x, x1 ) r (x, x2 ) r (x, x N ) .
(5.63)
(5.64)
where y2 is a proportionality constant, which may also be estimated from the data,
see Remark 9.5.
92
1.5
1
0.5
0
0.5
1
Confidence interval
1.5
2
1
0.5
0
x
0.5
where
y(x) = rT (x)v,
(5.66)
v = R1 y
(5.67)
is computed once and for all by solving the system of linear equations
93
Rv = y.
(5.68)
(5.69)
where ei is the ith column of I N . Even if this is true for any correlation function and
any value of p, the structure of the correlation function and the numerical value of p
impact the prediction and do matter.
The simplicity of (5.61), which is valid for any dimension of input factor space,
should not hide that solving (5.68) for v may be an ill-conditioned problem. One way
to improve conditioning is to force r (x, xi ) to zero when the distance between x and
xi exceeds some threshold , which amounts to saying that only the pairs (yi , xi )
y(x). This is only feasible if there are enough
such that ||x xi || contribute to
xi s in the vicinity of x, which is forbidden by the curse of dimensionality when the
dimension of x is too large (see Example 8.6).
Remark 5.7 A slight modification of the Kriging equations transforms data interpolation into data approximation. It suffices to replace R by
R = R + m2 I,
(5.70)
1
1 + 25x 2
(5.71)
was used by Runge to study the unwanted oscillations taking place when interpolating
with a high-degree polynomial over a set of regularly spaced interpolation points.
Data at n + 1 such points are generated by the script
for i=1:n+1,
x(i) = (2*(i-1)/n)-1;
94
0.2
0
0.2
0.4
0.6
0.8
1
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
Fig. 5.5 Polynomial interpolation at nine regularly spaced values of x; the graph of the interpolated
function is in solid line
y(i) = 1/(1+25*x(i)2);
end
We first interpolate these data using polyfit, which proceeds via the construction
of a Vandermonde matrix, and polyval, which computes the value taken by the
resulting interpolating polynomial on a fine regular grid specified in FineX, as
follows
N = 20*n;
FineX = zeros(N,1);
for j=1:N+1,
FineX(j) = (2*(j-1)/N)-1;
end
polynomial = polyfit(x,y,n);
fPoly = polyval(polynomial,FineX);
Fig. 5.5 presents the useless results obtained with nine interpolation points, thus using
an eighth-degree polynomial. The graph of the interpolated function is a solid line,
the interpolation points are indicated by circles and the graph of the interpolating
polynomial is a dash-dot line. Increasing the degree of the polynomial while keeping
the xi s regularly spaced would only worsen the situation.
A better option is to replace the regularly spaced xi s by Chebyshev points satisfying (5.13) and to generate the data by the script
95
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
Fig. 5.6 Polynomial interpolation at 21 Chebyshev values of x; the graph of the interpolated
function is in solid line
for i=1:n+1,
x(i) = cos((i-1)*pi/n);
y(i) = 1/(1+25*x(i)2);
end
The results with nine interpolation points still show some oscillations, but we can
now safely increase the order of the polynomial to improve the situation. With 21
interpolation points, we get the results of Fig. 5.6.
An alternative option is to use cubic splines. This can be carried out by using the
functions spline, which computes the piecewise polynomial, and ppval, which
evaluates this piecewise polynomial at points to be specified. One may thus write
PieceWisePol = spline(x,y);
fCubicSpline = ppval(PieceWisePol,FineX);
With nine regularly spaced xi s, the results are then as presented in Fig. 5.7.
5.6 In Summary
Prefer interpolation to extrapolation, whenever possible.
Interpolation may not be the right answer to an approximation problem; there is
no point in interpolating noisy or uncertain data.
96
0.6
0.5
0.4
0.3
0.2
0.1
0
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
Fig. 5.7 Cubic spline interpolation at nine regularly spaced values of x; the graph of the interpolated
function is in solid line
References
97
References
1. Trefethen, L.: Approximation Theory and Approximation Practice. SIAM, Philadelphia (2013)
2. Trefethen, L.: Six myths of polynomial interpolation and quadrature. Math. Today 47, 184188
(2011)
3. Sacks, J., Welch, W., Mitchell, T., Wynn, H.: Design and analysis of computer experiments
(with discussion). Stat. Sci. 4(4), 409435 (1989)
4. Farouki, R.: The Bernstein polynomial basis: a centennial retrospective. Comput. Aided Geom.
Des. 29, 379419 (2012)
5. Berrut, J.P., Trefethen, L.: Barycentric Lagrange interpolation. SIAM Rev. 46(3), 501517
(2004)
6. Higham, N.: The numerical stability of barycentric Lagrange interpolation. IMA J. Numer.
Anal. 24(4), 547556 (2004)
7. de Boor, C.: Package for calculating with B-splines. SIAM J. Numer. Anal. 14(3), 441472
(1977)
8. Stoer, J., Bulirsch, R.: Introduction to Numerical Analysis. Springer, New York (1980)
9. de Boor, C.: A Practical Guide to Splines, revised edn. Springer, New York (2001)
10. Kershaw, D.: A note on the convergence of interpolatory cubic splines. SIAM J. Numer. Anal.
8(1), 6774 (1971)
11. Wahba, G.: Spline Models for Observational Data. SIAM, Philadelphia (1990)
12. Cressie, N.: Statistics for Spatial Data. Wiley, New York (1993)
13. Krige, D.: A statistical approach to some basic mine valuation problems on the Witwatersrand.
J. Chem. Metall. Min. Soc. 52, 119139 (1951)
14. Chils, J.P., Delfiner, P.: Geostatistics. Wiley, New York (1999)
15. Wackernagel, H.: Multivariate Geostatistics, 3rd edn. Springer, Berlin (2003)
16. Vazquez, E., Walter, E.: Estimating derivatives and integrals with Kriging. In: Proceedings
of 44th IEEE Conference on Decision and Control (CDC) and European Control Conference
(ECC), pp. 81568161. Seville, Spain (2005)
Chapter 6
We are interested here in the numerical aspects of the integration and differentiation
of functions. When these functions are only known through the numerical values
that they take for some numerical values of their arguments, formal integration,
or differentiation via computer algebra is out of the question. Section 6.6.2 will
show, however, that when the source of the code evaluating the function is available,
automatic differentiation, which involves some formal treatment, becomes possible.
The integration of differential equations will be considered in Chaps. 12 and 13.
Remark 6.1 When a closed-form symbolic expression is available for a function,
computer algebra may be used for its integration or differentiation. Computer algebra
systems such as Maple or Mathematica include methods for formal integration that
would be so painful to use by hand that they are not even taught in advanced calculus
classes. They also greatly facilitate the evaluation of derivatives or partial derivatives.
The following script, for instance, uses MATLABs Symbolic Math Toolbox to
evaluate the gradient and Hessian functions of a scalar function of several variables.
syms x y
X = [x;y]
F = x3*y2-9*x*y+2
G = gradient(F,X)
H = hessian(F,X)
It yields
X =
x
y
F =
x3*y2 - 9*x*y + 2
99
100
G =
3*x2*y2 - 9*y
2*y*x3 - 9*x
H =
[
6*x*y2, 6*y*x2 - 9]
[ 6*y*x2 - 9,
2*x3]
For vector functions of several variables, Jacobian matrices may be similarly generated, see Remark 7.10.
It is not assumed here that such closed-form expressions of the functions to be
integrated or differentiated are available.
6.1 Examples
Example 6.1 Inertial navigation
Inertial navigation systems are used, e.g., in aircraft and submarines. They include
accelerometers that measure acceleration along three independent axes (say, longitude, latitude, and altitude). Integrating these accelerations once, one can evaluate
the three components of speed, and a second integration leads to the three components of position, provided the initial conditions are known. Cheap accelerometers,
as made possible by micro electromechanical systems (MEMS), have found their
way into smartphones, videogame consoles and other personal electronic devices.
See Sect. 16.23.
Example 6.2 Power estimation
The power P consumed by an electrical appliance (in W) is
1
P=
T
T
u( )i( )d,
(6.1)
where the electric tension u delivered to the appliance (in V) is sinusoidal with period
T , and where (possibly after some transient) the current i through the appliance (in A)
is also periodic with period T , but not necessarily sinusoidal. To estimate the value
of P from measurements of u(tk ) and i(tk ) at some instants of time tk [0, T ],
k = 1, . . . , N , one has to evaluate an integral.
Example 6.3 Speed estimation
Computing the speed of a mobile from measurements of its position boils down
to differentiating a signal, the value of which is only known at discrete instants of
time. (When a model of the dynamical behavior of the mobile is available, it may be
taken into account via the use of a Kalman filter [1, 2], not considered here.)
6.1 Examples
101
I =
(6.2)
where the lower limit a and upper limit b have known numerical values and where the
integrand f (), a real function assumed to be integrable, can be evaluated numerically
at any x in [a, b]. Evaluating I is often called quadrature, a reminder of the method
approximating areas by unions of small squares.
Since, for any c [a, b], for instance its middle,
b
c
f (x)dx =
b
f (x)dx +
f (x)dx,
(6.3)
the computation of I may recursively be split into subtasks whenever this is expected to lead to better accuracy, in a divide-and-conquer approach. This is adaptive
quadrature, which makes it possible to adapt to local properties of the integrand f ()
by putting more evaluations where f () varies quickly.
The decision about whether to bisect [a, b] is usually taken based on comparing
the numerical results I+ and I of the evaluation of I by two numerical integration
methods, with I+ expected to be more accurate than I . If
|I+ I |
< ,
|I+ |
(6.4)
where is some prescribed relative error tolerance, then the result I+ provided by
the better method is kept, else [a, b] may be bisected and the same procedure applied
to the two resulting subintervals. To avoid endless bisections, a limit is set on the
number of recursion levels and no bisection is carried out on subintervals such that
their relative contribution to I is deemed too small. See [3] for a comparison of
strategies for adaptive quadrature and evidence of the fact that none of them will
give accurate answers for all integrable functions.
The interval [a, b] considered in what follows may be one of the subintervals
resulting from such a divide-and-conquer approach.
102
(6.5)
with
ba
.
N
h=
(6.6)
The interval [a, b] is partitioned into subintervals with equal width kh, so k must
divide N . Each subinterval contains (k+1) evaluation points, which makes it possible
to replace f () on this subinterval by a kth degree interpolating polynomial. The
value of the definite integral I is then approximated by the sum of the integrals of
the interpolating polynomials over the subintervals on which they interpolate f ().
Remark 6.2 The initial problem has thus been replaced by an approximate one that
can be solved exactly (at least from a mathematical point of view).
Remark 6.3 Spacing the evaluation points regularly may not be such a good idea,
see Sects. 6.2.3 and 6.2.4.
The integral of the interpolating polynomial over the subinterval [x0 , xk ] can then
be written as
ISI (k) = h
c j f (x j ),
(6.7)
j=0
where the coefficients c j depend only on the order k of the polynomial, and the same
formula applies for any one of the other subintervals, after a suitable incrementation
of the indices.
In what follows, NC(k) denotes the Newton-Cotes method based on an interpolating polynomial with order k, and f (x j ) is denoted by f j . Because the x j s are
equispaced, the order k must be small. The local method error committed by NC(k)
over [x0 , xk ] is
xk
eNC (k) =
(6.8)
x0
and the global method error over [a, b], denoted by E NC (k), is obtained by summing
the local method errors committed over all the subintervals.
103
Proofs of the results concerning the values of eNC (k) and E NC (k) presented below
can be found in [4]. In these results, f () is of course assumed to be differentiable
up to the order required.
h
ba
( f0 + f1 ) =
( f 0 + f 1 ).
2
2N
(6.9)
All endpoints are used twice when evaluating I , except for x0 and x N , so
ba
I
N
N
1
f0 + f N
fi .
+
2
(6.10)
i=1
1
f ()h 3 ,
12
(6.11)
for some [x0 , x1 ], and the global method error is such that
E NC (1) =
ba
f ( )h 2 ,
12
(6.12)
for some [a, b]. The global method error on I is thus O(h 2 ). If f () is a
polynomial of degree at most one, then f() 0 and there is no method error,
which should come as no surprise.
Remark 6.4 The trapezoidal rule can also be used with irregularly spaced xi s, as
I
N 1
1
(xi+1 xi ) ( f i+1 + f i ).
2
(6.13)
i=0
104
ISI =
h
( f 0 + 4 f 1 + f 2 ).
3
(6.14)
The name 1/3 comes from the leading coefficient in (6.14). It can be shown that
1 (4)
f ()h 5 ,
90
(6.15)
b a (4)
f ( )h 4 ,
180
(6.16)
eNC (2) =
for some [x0 , x2 ], and
E NC (2) =
for some [a, b]. The global method error on I with NC(2) is thus O(h 4 ), much
better than with NC(1). Because of a lucky cancelation, there is no method error if
f () is a polynomial of degree at most three, and not just two as one might expect.
3
h( f 0 + 3 f 1 + 3 f 2 + f 3 ).
8
(6.17)
The name 3/8 comes from the leading coefficient in (6.17). It can be shown that
3 (4)
f ()h 5 ,
80
(6.18)
b a (4)
f ( )h 4 ,
80
(6.19)
eNC (3) =
for some [x0 , x3 ], and
E NC (3) =
for some [a, b]. The global method error on I with NC(3) is thus O(h 4 ), just
as with NC(2), and nothing seems to have been gained by increasing the order of
the interpolating polynomial. As with NC(2), there is no method error if f () is a
polynomial of degree at most three, but for NC(3) this is not surprising.
ISI =
105
2
h(7 f 0 + 32 f 1 + 12 f 2 + 32 f 3 + 7 f 4 ).
45
(6.20)
8 (6)
f ()h 7 ,
945
(6.21)
2(b a) (6)
f ( )h 6 ,
945
(6.22)
eNC (4) =
for some [x0 , x4 ], and
E NC (4) =
for some [a, b]. The global method error on I with NC(4) is thus O(h 6 ). Again
because of a lucky cancelation, there is no method error if f () is a polynomial of
degree at most five.
Remark 6.5 A cursory look at the previous formulas may suggest that
E NC (k) =
ba
eNC (k),
kh
(6.23)
which seems natural since the number of subintervals is (b a)/kh. Note, however,
that in the expression for E NC (k) is not the same as in that for eNC (k).
I =
I (h, m) + c1 h m + c2 h m+1 +
(6.25)
h
h
h
, m + c1
+ c2
+
I =
I
2
2
2
(6.26)
Instead of combining (6.25) and (6.26) to eliminate the first method-error term,
as
106
h
h
(1 2m ) + O(h k ),
I (h, m)
I
, m = c1
2
2
(6.27)
I (h, m)
I h2 , m
h
=
+ O(h k ).
2
1 2m
(6.28)
This estimate may be used to decide whether halving again the step-size would be
appropriate. A similar procedure may be employed to adapt step-size in the context
of solving ordinary differential equations, see Sect. 12.2.4.2.
c2i h 2i ,
(6.29)
E NC (1) =
i 1
and each extrapolation step increases the order of the method error by two, with
method errors O(h 4 ), O(h 6 ), O(h 8 ) . . . This makes it possible to get extremely accurate results quickly.
Let R(i, j) be the value of (6.2) as evaluated by Rombergs method after j Richardson extrapolation steps based on an integration with the constant step-size
hi =
ba
.
2i
(6.30)
4j
1
[4 j R(i, j 1) R(i 1, j 1)].
1
(6.31)
Compare with (5.54), where the fact that there are no odd method error terms is not
2 j+2
taken into account. The method error for R(i, j) is O(h i
). R(i, 1) corresponds
to Simpsons 1/3 rule and R(i, 2) to Booles rule. R(i, j) for j > 2 tends to be more
stable than its Newton-Cotes counterpart.
107
Evaluation points xi
Weights wi
1
2
3
1/ 3
0
0.774596669241483
0.339981043584856
0.861136311594053
0
0.538469310105683
0.906179845938664
2
1
8/9
5/9
0.652145154862546
0.347854845137454
0.568888888888889
0.478628670499366
0.236926885056189
4
5
wi f (xi ),
(6.32)
I
i=1
which has 2N parameters, namely the N evaluation points xi and the associated
weights wi . Since an (2N 1)th order polynomial has 2N coefficients, it thus becomes
possible to impose that (6.32) entails no method error if f () is a polynomial of degree
at most (2N 1). Compare with Newton-Cotes methods.
Gauss has shown that the evaluation points xi in (6.32) were the roots of the N th
degree Legendre polynomial [5]. These are not trivial to compute to high precision
for large N [6], but they are tabulated. Given the evaluation points, the corresponding
weights are much easier to obtain. Table 6.1 gives the values of xi and wi for up to five
evaluations of f () on a normalized interval [1, 1]. Results for up to 16 evaluations
can be found in [7, 8].
The values xi and wi (i = 1, . . . , N ) in Table 6.1 are approximate solutions of
the system of nonlinear equations expressing that
1
f (x)dx =
1
wi f (xi )
(6.33)
i=1
for f (x) 1, f (x) x, and so forth, until f (x) x 2N 1 . The first of these
equations implies that
108
wi = 2.
(6.34)
i=1
(6.35)
xdx = 0 = w1 x1 x1 = 0.
(6.36)
and
1
1
One must therefore evaluate f () at the center of the normalized interval, and multiply
the result by 2 to get an estimate of the integral. This is the midpoint formula, exact for
integrating polynomials up to order one. The trapezoidal rule needs two evaluations
of f () to achieve the same performance.
Remark 6.6 For any a < b, the change of variables
x=
(b a) + a + b
2
(6.37)
f (x)dx =
a
1
f
1
so
(b a) + a + b
2
I =
ba
2
ba
2
d,
(6.38)
1
g( )d,
(6.39)
with
g( ) = f
(b a) + a + b
.
2
(6.40)
Remark 6.7 The initial horizon of integration [a, b] may of course be split into
subintervals on which Gaussian quadrature is carried out.
109
(6.42)
I =
(6.43)
110
internal integration
external
integration
x
Fig. 6.1 Nested 1D integrations
f (x, y)dxdy,
(6.44)
y1 x1 (y)
so one may perform one-dimensional inner integrations with respect to x at sufficiently many values of y and then perform a one-dimensional outer integration with
respect to y. As in the univariate case, there should be more numerical evaluations
of the integrand f (, ) in the regions where it varies quickly.
111
(6.47)
The same equation can be used to evaluate < f > as previously, provided that only
the xi s in D are kept and N is the number of these xi s.
(6.48)
N
1 2 i
f (x ).
N
(6.49)
i=1
112
n is large, however, the situation would be much worse if the integrand had to be
evaluated on a regular grid.
Variance-reduction methods may be used to increase the precision on < f >
obtained for a given N [12].
113
(6.52)
(6.53)
(6.54)
So
f (x0 + h) f (x0 )
f(x0 )
= f(x0 ) +
h + o(h).
h
2
(6.55)
f (x0 + h) f (x0 )
f(x0 ) =
+ O(h),
h
(6.56)
and the method error committed when using (6.52) is O(h). This is why (6.52) is
called a first-order forward difference. Similarly,
f (x1 ) f (x1 h)
+ O(h),
f(x1 ) =
h
and (6.53) is a first-order backward difference.
(6.57)
114
To allow a more precise evaluation of f(), consider now a second-order interpolating polynomial P2 (x), associated with the values taken by f () at three regularly
spaced points x0 , x1 and x2 , such that
x2 x1 = x1 x0 = h.
(6.58)
P2 (x) =
and
Now
and
so
and
(6.60)
f (x1 + h) f (x1 h)
P2 (x1 ) =
2h
(6.61)
(6.62)
f(x1 ) 2
f (x1 + h) = f (x1 ) + f(x1 )h +
h + O(h 3 )
2
(6.63)
f(x1 ) 2
f (x1 h) = f (x1 ) f(x1 )h +
h + O(h 3 ),
2
(6.64)
(6.65)
(6.66)
115
(6.67)
(6.68)
(6.69)
(6.70)
(6.71)
(6.72)
(6.73)
116
(6.75)
Since
f (x1 + h) =
f (i) (x1 ) i
h + O(h 6 ),
i!
(6.76)
f (i) (x1 )
(h)i + O(h 6 ),
i!
(6.77)
i=0
f (x1 h) =
i=0
the odd terms disappear when summing (6.76) and (6.77). As a result,
f (x1 + h) 2 f (x1 ) + f (x1 h)
1
= 2
h2
h
and
(4) (x )
f
1
4
6
h + O(h ) ,
f(x1 )h +
12
(6.78)
2
(6.79)
Similarly, one may write forward and backward differences. It turns out that
f (x0 + 2h) 2 f (x0 + h) + f (x0 )
f(x0 ) =
+ O(h),
h2
f (x2 ) 2 f (x2 h) + f (x2 2h)
+ O(h).
f(x2 ) =
h2
(6.80)
(6.81)
Remark 6.10 The method error of the centered difference is thus O(h 2 ), whereas
the method errors of the forward and backward differences are only O(h). This is
why the centered difference is used in the Crank-Nicolson scheme for solving some
partial differential equations, see Sect. 13.3.3.
117
Example 6.7 As in Example 6.6, take f (x) = x 4 , so f(x) = 12x 2 . The first-order
forward difference satisfies
f (x + 2h) 2 f (x + h) + f (x)
= 12x 2 + 24hx + 14h 2
h2
= f(x) + O(h),
(6.82)
(6.83)
(6.84)
(6.85)
f(x) = R1 (h) + c1 h +
(6.86)
R1 (h) =
such that
h
R1 (h) + O(h m ),
2
(6.87)
f (x + 2h ) + 4 f (x + h ) 3 f (x)
,
2h
(6.88)
118
f (x + h) f (x h)
,
2h
(6.89)
so
f(x) = R2 (h) + c2 h 2 +
(6.90)
(6.91)
(6.92)
with
N (x) = f (x + 2h ) + 8 f (x + h ) 8 f (x h ) + f (x 2h ).
(6.93)
A Taylor expansion of f () around x shows that the even terms in the expansion of
N (x) cancel out and that
N (x) = 12 f(x)h + 0 f (3) (x)(h )3 + O(h 5 ).
(6.94)
(6.95)
119
2 f
x y
f
(x, y)
y
(6.96)
2 f
x y
g
x (x,
f
(x, y) + O(h 2y ).
y
(6.97)
x y
2h x
(x + h x )3 (x h x )3
(3y 2 + h 2y ) = (3x 2 + h 2x )(3y 2 + h 2y )
2h x
= 9x 2 y 2 + 3x 2 h 2y + 3y 2 h 2x + h 2x h 2y
=
2 f
+ O(h 2x ) + O(h 2y ).
x y
(6.98)
120
Globally,
(y + h y )3 (y h y )3
(x + h x )3 (x h x )3
2 f
.
x y
2h x
2h y
(6.99)
Gradient evaluation, at the core of some of the most efficient optimization methods,
is considered in some more detail in the next section, in the important special case
where the function to be differentiated is evaluated by a numerical code.
(x0 ) =
f
x1 (x0 )
x2 (x0 )
..
.
f
xn (x0 )
(6.100)
via the use of some numerical code deduced from the one evaluating f (x0 ).
We start by a description of the problems encountered when using finite differences, before describing two approaches for implementing automatic differentiation
[1521]. Both of them make it possible to avoid any method error in the evaluation
of gradients (which does not eliminate the effect of rounding errors, of course). The
first approach may lead to a drastic diminution of the volume of computation, while
the second is simple to implement via operator overloading.
(6.101)
121
(6.102)
The method error is O(xi2 ) for (6.102) instead of O(xi ) for (6.101), and (6.102)
does not introduce phase distorsion, contrary to (6.101) (think of the case where f (x)
is a trigonometric function). On the other hand, (6.102) requires more computation
than (6.101).
As already mentioned, it is impossible to make xi tend to zero, because this would
entail computing the difference of infinitesimally close real numbers, a disaster in
floating-point computation. One is thus forced to strike a compromise between the
rounding and method errors by keeping the xi s finite (and not necessarily equal). A
good tuning of each of the xi s is difficult, and may require trial and error. Even if one
assumes that appropriate xi s have already been found, an approximate evaluation
of the gradient of f () at x0 requires (dim x + 1) evaluations of f () with (6.101)
and (2 dim x) evaluations of f () with (6.102). This may turn out to be a challenge if
dim x is very large (as in image processing or shape optimization) or if many gradient
evaluations have to be carried out (as in multistart optimization).
By contrast, automatic differentiation involves no method error and may reduce
the computational burden dramatically.
(6.103)
This instruction makes little sense, but variants more difficult to detect may lurk in
the direct code. Two types of variables are distinguished:
the independent variables (the inputs of the direct code), which include the entries
of x,
the dependent variables (to be computed by the direct code), which include f (x).
All of these variables are stacked in a state vector v, a conceptual help not to
be stored as such in the computer. When x takes the numerical value x0 , one of the
dependent variables take the numerical value f (x0 ) upon completion of the execution
of the direct code.
For the sake of simplicity, assume first that the direct code is a linear sequence of
N assignment statements, with no loop or conditional branching. The kth assignment
statement modifies the (k)th entry of v as
v(k) := k (v).
(6.104)
122
In general, k depends only on a few entries of v. Let Ik be the set of the indices of
these entries and replace (6.104) by a more detailed version of it
v(k) := k ({vi | i Ik }).
(6.105)
(6.106)
where k leaves all the entries of v unchanged, except for the (k)th that is modified
according to (6.105).
Remark 6.11 The expression (6.106) should not be confused with an equation to be
solved for v
Denote the state of the direct code after executing the kth assignment statement
by vk . It satisfies
(6.107)
vk = k (vk1 ), k = 1, . . . , N .
This is the state equation of a discrete-time dynamical system. State equations find
many applications in chemistry, mechanics, control and signal processing, for instance. (See Chap. 12 for examples of state equations in a continuous-time context.)
The role of discrete time is taken here by the passage from one assignment statement to the next. The final state v N is obtained from the initial state v0 by function
composition, as
(6.108)
v N = N N 1 1 (v0 ).
Among other things, the initial state v0 contains the value x0 of x and the final state
v N contains the value of f (x0 ).
The chain rule for differentiation applied to (6.107) and (6.108) yields
TN
vT T1
f
f
(x0 ) = 0
(v0 ) . . .
(v N 1 )
(x0 ).
x
x
v
v
v N
(6.109)
As a mnemonic for (6.109), note that since k (vk1 ) = vk , the fact that
vT
=I
v
(6.110)
makes all the intermediary terms in the right-hand side of (6.109) disappear, leaving
the same expression as in the left-hand side.
123
With
v0T
= C,
x
(6.111)
Tk
(vk1 ) = Ak
v
(6.112)
f
(x0 ) = b,
v N
(6.113)
f
(x0 ) = CA1 A N b,
x
(6.114)
and
with
f (x0 ) = bT v N ,
(6.116)
T
b = 0 0 1 .
(6.117)
The evaluation of the matrices Ai and the ordering of the computations remain to be
considered.
(6.119)
124
(6.120)
which amounts to saying that the value of the gradient is in the first dim x entries of
d0 . The vector dk has the same dimension as vk and is called its adjoint (or dual).
The recurrence (6.118) is implemented in an adjoint code, obtained from the direct
code by dualization in a systematic manner, as explained below. See Sect. 6.7.2 for
a detailed example.
Tk
(vk1 )
v
(6.121)
and
k (vk1 ) (k) = k (vk1 ),
(6.122)
k (vk1 ) i = vi (k 1), i = (k),
(6.123)
1 0
.
0 ..
Ak = .. 0
. .
. .
. .
0 0
0
1
0
0
0
..
..
.
.
..
..
. .
.
k
v(k) (vk1 ) 0
..
.
1
k
v1 (vk1 )
(6.124)
125
di (k 1) = di (k) +
d(k) (k 1) =
k
(vk1 )d(k) (k), i = (k),
vi
k
(vk1 )d(k) (k).
v(k)
(6.125)
(6.126)
Since we are only interested in d0 , the successive values taken by the dual vector
d need not be stored, and the time indexation of d can be avoided. The adjoint
instructions for
v(k) := k ({vi | i Ik });
will then be, in this order
for all i Ik , i = (k), do di := di +
d(k) :=
k
v(k) (vk1 )d(k) ;
k
vi (vk1 )d(k) ;
Remark 6.12 If k depends nonlinearly on some variables of the direct code, then
the adjoint code will involve the values taken by these variables, which will have to
be stored during the execution of the direct code before the adjoint code is executed.
These storage requirements are a limitation of backward evaluation.
Example 6.12 Assume that the direct code contains the assignment statement
cost := cost+(y-ym )2 ;
so k = cost+(y-ym )2 .
Let dcost, dy and dym be the dual variables of cost, y and ym . The dualization
of this assignment statement yields the following (pseudo) instructions of the adjoint
code
k
y dcost = dy + 2(y-ym )dcost;
dym := dym + ymk dcost = dym 2(y-ym )dcost;
k
dcost := cost
dcost = dcost; % useless
dy := dy +
A single instruction of the direct code has thus resulted in several instructions of the
adjoint code.
126
When there are loops in the direct code, reversing time amounts to reversing the
direction of variation of their iteration counters and the order of the instructions in
the loop. Regarding conditional branching, if the direct code contains
if (condition C) then (code A) else (code B);
then the adjoint code should contain
if (condition C) then (adjoint of A) else (adjoint of B);
and the value taken by condition C during the execution of the direct code should be
stored for the adjoint code to know which branch it should follow.
6.6.3.3 Initializing Adjoint Code
The terminal condition (6.119) with b given by (6.117) means that all the dual
variables must be initialized to zero, except for the one associated with the value of
f (x0 ) upon completion of the execution of the direct code, which must be initialized
to one.
Remark 6.13 v, d and Ak are not stored as such. Only the direct and dual variables
intervene. Using a systematic convention for denoting the dual variables, for instance
by adding a leading d to the name of the dualized variable as in Example 6.12,
improves readability of the adjoint code.
6.6.3.4 In Summary
The adjoint-code procedure is summarized by Fig. 6.2.
The adjoint-code method avoids the method errors due to finite-difference approximation. The generation of the adjoint code from the source of the direct code
is systematic and can be automated.
The volume of computation needed for the evaluation of the function f () and its
gradient is typically no more than three times that required by the sole evaluation
of the function whatever the dimension of x (compare with the finite-difference
approach, where the evaluation of f () has to be repeated more than dim x times).
The adjoint-code method is thus particularly appropriate when
dim x is very large, as in some problems in image processing or shape optimization,
many gradient evaluations are needed, as is often the case in iterative optimization,
the evaluation of the function is time-consuming or costly.
On the other hand, this method can only be applied if the source of the direct code
is available and differentiable. Implementation by hand should be carried out with
care, as a single coding error may ruin the final result. (Verification techniques are
available, based on the fact that the scalar product of the dual vector with the solution
127
x0
f (x0)
d0
dN
Gradient of f at x0
contained in d0
of a linearized state equation must stay constant along the state trajectory.) Finally,
the execution of the adjoint code requires the knowledge of the values taken by
some variables during the execution of the direct code (those variables that intervene
nonlinearly in assignment statements of the direct code). One must therefore store
these values, which may raise memory-size problems.
v
V = v,
.
x
(6.127)
b
a
+
,
A + B = a + b,
x x
(6.128)
a
b
A B = a b,
,
x x
(6.129)
128
b
a
b+a
,
A B = a b,
x
x
A
=
B
a
,
b
a
x
b
x
ba
b2
(6.130)
(6.131)
A
= c,
B
a
x
c b
x
b
(6.132)
with c = a/b.
The ordered pair associated with any real constant d is D = (d, 0), and that
associated with the ith independent variable xi is X i = (xi , ei ), where ei is as usual
the ith column of the identity matrix. The value g(v) taken by an elementary function
g() intervening in some instruction of the direct code is replaced by the pair
vT g
(v) ,
G(V) = g(v),
x v
(6.133)
T
i
where V is a vector of pairs Vi = (vi , v
x ), which contains all the entries of v /x
and where g/v is easy to compute analytically.
Example 6.13 Consider the direct code of the example in Sect. 6.7.2. It suffices to
execute this direct code with each operation on reals replaced by the corresponding
operation on ordered pairs, after initializing the pairs as follows:
F = (0, 0),
(6.134)
(6.135)
P1 =
P2 =
1
p1 ,
,
0
0
p2 ,
.
1
(6.136)
(6.137)
F=
f (x0 ),
f
(x0 ) ,
x
(6.138)
where x0 = ( p1, p2 )T is the vector containing the numerical values of the parameters
at which the gradient must be evaluated.
129
s(k, x) =
k = 1, . . . , n t ,
(6.139)
is readily available, which makes it possible to use this information in a GaussNewton method (see Sect. 9.3.4.3).
On the other hand, the number of flops will be higher than with the adjoint-code
method, very much so if the dimension of x is large.
(x)
=
xxT
2
2 f
(x) x1 fx2 (x)
x12
2 f
2 f
x2 x1 (x) x 2 (x)
2
..
.
..
.
2 f
2 f
xn x1 (x) xn x2 (x)
..
2 f
x1 xn (x)
2 f
(x)
x2 xn
..
.
2 f
xn2
(x)
(6.140)
=
xxT
x
f
xT
(6.141)
130
(6.142)
Section 6.6.3 has shown that g(x) can be evaluated very efficiently by combining
the use of a direct code evaluating f (x) and of the corresponding adjoint code. This
combination can itself be viewed as a second direct code evaluating g(x). Assume
that the value of g(x) is in the last n entries of the state vector v of this second direct
code at the end of its execution. A second adjoint code can now be associated to this
second direct code to compute the Hessian. It will use a variant of (6.109), where the
output of the second direct code is the vector g(x) instead of the scalar f (x):
TN
vT T1
gT
gT
(x) = 0
(v0 ) . . .
(v N 1 )
(x).
x
x
v
v
v N
(6.143)
(6.144)
2 f
(x) = CA1 A N B,
xxT
(6.145)
and (6.114) by
for the computation of the Hessian to boil down to the evaluation of the product of
these matrices. Everything else is formally unchanged, but the computational burden
increases, as the vector b has been replaced by a matrix B with n columns.
v 2 v
V = v, ,
.
x xxT
The fact that Hessians are symmetrical can be taken advantage of.
(6.146)
131
1
1 x 2
f (x) =
.
(6.147)
exp
2
2
The probability that x belongs to the interval [ 2, + 2 ] is given by
+2
I =
f (x)dx.
(6.148)
erf( 2) 0.9544997361036416.
(6.149)
One of the functions available for this purpose is quad [3], which combines ideas of
Simpsons 1/3 rule and Romberg integration, and recursively bisects the integration
interval when and where needed for the estimated method error to stay below some
absolute tolerance, set by default to 106 . The script
f = @(x) exp(-x.2/2)/sqrt(2*pi);
Integral = quad(f,-2,2)
produces
Integral = 9.544997948576686e-01
so the absolute error is indeed less than 106 . Note the dot in the definition of the
anonymous function f, needed because x is considered as a vector argument. See
the MATLAB documentation for details.
I can also be evaluated with a Monte Carlo method, as in the script
f = @(x) exp(-x.2/2)/sqrt(2*pi);
IntMC = zeros(20,1);
N=1;
for i=1:20,
132
Absolute error on I
0.2
0.1
0.1
0.2
0.3
0.4
10
12
14
16
18
20
log 2(N)
Fig. 6.3 Absolute error on I as a function of the logarithm of the number N of integrand evaluations
X = 4*rand(N,1)-2;
% X uniform between -2 and 2
% Width of [-2,2] = 4
F = f(X);
IntMC(i) = 4*mean(F)
N = 2*N;
% number of function evaluation
% doubles at each iteration
end
ErrorOnInt = IntMC - 0.9545;
plot(ErrorOnInt,o,MarkerEdgeColor,...
k,MarkerSize,7)
xlabel(log_2(N))
ylabel(Absolute error on I)
This approach is no match to quad, and Fig. 6.3 confirms that the convergence to
zero of the absolute error on the integral is slow.
The redeeming feature of the Monte Carlo approach is its ability to deal with
higher dimensional integrals. Let us illustrate this by evaluating
Vn =
dx,
Bn
(6.151)
133
(6.152)
This can be carried out by the following script, where n is the dimension of the
Euclidean space and V(i) the volume Vn as estimated from 2i pseudo random xs
in [1, 1]n .
clear all
V = zeros(20,1);
N = 1;
%%%
for i=1:20,
F = zeros(N,1);
X = 2*rand(n,N)-1;
% X uniform between -1 and 1
for j=1:N,
x = X(:,j);
if (norm(x,2)<=1)
F(j) = 1;
end
end
V(i) = mean(F)*2n;
N = 2*N;
% Number of function evaluations
% doubles at each iteration
end
Vn is the (hyper) volume of Bn , which can be computed exactly. The recurrence
Vn =
2
Vn2
n
(6.153)
can, for instance, be used to compute it for even ns, starting from V2 = . It implies
that V6 = 3 /6. Running our Monte Carlo script with n = 6; and adding
TrueV6 = (pi3)/6;
RelErrOnV6 = 100*(V - TrueV6)/TrueV6;
plot(RelErrOnV6,o,MarkerEdgeColor,...
k,MarkerSize,7)
xlabel(log_2(N))
ylabel(Relative error on V_6 (in %))
we get Fig. 6.4, which shows the evolution of the relative error on V6 as a function
of log2 N .
134
200
150
100
50
50
100
0
10
12
14
16
18
20
log 2(N)
Fig. 6.4 Relative error on the volume of the six-dimensional unit Euclidean ball as a function of
the logarithm of the number N of integrand evaluations
6.7.2 Differentiation
Consider the multiexponential model
ym (k, p) =
n exp
pi exp( pn exp +i tk ),
(6.154)
i=1
n
times
2
y(k) ym (k, p) .
k=1
% Forward loop
(6.155)
(6.156)
135
ym(k) = 0;
for i=1:nexp,
% Forward loop
ym(k) = ym(k)+p(i)*exp(p(nexp+i)*t(k));
end
cost = cost+(y(k)-ym(k))2;
end
The systematic rules described in Sect. 6.6.2 can be used to derive the following
script (adjoint code),
dcost=1;
dy=zeros(ntimes,1);
dym=zeros(ntimes,1);
dp=zeros(2*nexp,1);
dt=zeros(ntimes,1);
for k=ntimes:-1:1,
% Backward loop
dy(k) = dy(k)+2*(y(k)-ym(k))*dcost;
dym(k) = dym(k)-2*(y(k)-ym(k))*dcost;
dcost = dcost;
for i=nexp:-1:1,
% Backward loop
dp(i) = dp(i)+exp(p(nexp+i)*t(k))*dym(k);
dp(nexp+i) = dp(nexp+i)...
+p(i)*t(k)*exp(p(nexp+i)*t(k))*dym(k);
dt(k) = dt(k)+p(i)*p(nexp+i)...
*exp(p(nexp+i)*t(k))*dym(k);
dym(k) = dym(k);
end
dym(k) = 0;
end
dcost=0;
dp % contains the gradient vector
This code could of course be made more concise by eliminating useless instructions.
It could also be written in such a way as to minimize operations on entries of vectors,
which are inefficient in a matrix-oriented language.
Assume that the data are generated by the script
ntimes = 100; % number of measurement times
nexp = 2;
% number of exponential terms
% value of p used to generate the data:
pstar = [1; -1; -0.3; -1];
h = 0.2; % time step
t(1) = 0;
for k=2:ntimes,
t(k)=t(k-1)+h;
136
end
for k=1:ntimes,
y(k) = 0;
for i=1:nexp,
y(k) = y(k)+pstar(i)*exp(pstar(nexp+i)*t(k));
end
end
With these data, for p = (1.1, 0.9, 0.2, 0.9)T , the value of the gradient vector
as computed by the adjoint code is found to be
dp =
7.847859612874749e+00
2.139461455801426e+00
3.086120784615719e+01
-1.918927727244027e+00
In this simple example, the gradient of the cost is easy to compute analytically, as
n
times
ym
J
y(k) ym (k, p)
= 2
(k),
p
p
(6.157)
k=1
(6.158)
(6.159)
The results of the adjoint code can thus be checked by running the script
for i=1:nexp,
for k=1:ntimes,
s(i,k) = exp(p(nexp+i)*t(k));
s(nexp+i,k) = t(k)*p(i)*exp(p(nexp+i)*t(k));
end
end
for i=1:2*nexp,
g(i) = 0;
for k=1:ntimes,
g(i) = g(i)-2*(y(k)-ym(k))*s(i,k);
end
end
g % contains the gradient vector
137
6.8 In Summary
Traditional methods for evaluating definite integrals, such as the Simpson and
Boole rules, request the points at which the integrand is evaluated to be regularly
spaced. As a result, they have less degrees of freedom than otherwise possible,
and their error orders are higher than they might have been.
Rombergs method applies Richardsons principle to the trapezoidal rule and can
deliver extremely accurate results quickly thanks to lucky cancelations if the integrand is sufficiently smooth.
Gaussian quadrature escapes the constraint of a regular spacing of the evaluation
points, which makes it possible to increase error order, but still sticks to fixed rules
for deciding where to evaluate the integrand.
For all of these methods, a divide-and-conquer approach can be used to split the
horizon of integration into subintervals in order to adapt to changes in the speed
of variation of the integrand.
Transforming function integration into the integration of an ordinary differential
equation also makes it possible to adapt the step-size to the local behavior of the
integrand.
Evaluating definite integrals of multivariate functions is much more complicated
than in the univariate case. For low-dimensional problems, and provided that the
integrand is sufficiently smooth, nested one-dimensional integrations may be used.
The Monte Carlo approach is simpler to implement (given a good random-number
generator) and can deal with discontinuities of the integrand. To divide the standard deviation on the error by two, one needs to multiply the number of function
evaluations by four. This holds true for any dimension of x, which makes Monte
Carlo integration particularly suitable for high-dimensional problems.
Numerical differentiation heavily relies on polynomial interpolation. The order
of the approximation can be computed and used in Richardsons extrapolation to
increase the order of the method error. This may help one avoid exceedingly small
step-sizes that lead to an explosion of the rounding error.
As the entries of gradients, Hessians and Jacobian matrices are partial derivatives,
they can be evaluated using the techniques available for univariate functions.
Automatic differentiation makes it possible to evaluate the gradient of a function defined by a computer program. Contrary to the finite-difference approach,
138
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
Jazwinski, A.: Stochastic Processes and Filtering Theory. Academic Press, New York (1970)
Borrie, J.: Stochastic Systems for Engineers. Prentice-Hall, Hemel Hempstead (1992)
Gander, W., Gautschi, W.: Adaptive quadraturerevisited. BIT 40(1), 84101 (2000)
Fortin, A.: Numerical Analysis for Engineers. Ecole Polytechnique de Montral, Montral
(2009)
Stoer, J., Bulirsch, R.: Introduction to Numerical Analysis. Springer, New York (1980)
Golub, G., Welsch, J.: Calculation of Gauss quadrature rules. Math. Comput. 23(106), 221230
(1969)
Lowan, A., Davids, N., Levenson, A.: Table of the zeros of the Legendre polynomials of order
116 and the weight coefficients for Gauss mechanical quadrature formula. Bull. Am. Math.
Soc. 48(10), 739743 (1942)
Lowan, A., Davids, N., Levenson, A.: Errata to Table of the zeros of the Legendre polynomials
of order 116 and the weight coefficients for Gauss mechanical quadrature formula. Bull.
Am. Math. Soc. 49(12), 939939 (1943)
Knuth, D.: The Art of Computer Programming: 2 Seminumerical Algorithms, 3rd edn. AddisonWesley, Reading (1997)
Press, W., Flannery, B., Teukolsky, S., Vetterling, W.: Numerical Recipes. Cambridge University Press, Cambridge (1986)
Moler, C.: Numerical Computing with MATLAB, revised, reprinted edn. SIAM, Philadelphia
(2008)
Robert, C., Casella, G.: Monte Carlo Statistical Methods. Springer, New York (2004)
Morokoff, W., Caflisch, R.: Quasi-Monte Carlo integration. J. Comput. Phys. 122, 218230
(1995)
Owen, A.: Monte Carlo variance of scrambled net quadratures. SIAM J. Numer. Anal. 34(5),
18841910 (1997)
Gilbert, J., Vey, G.L., Masse, J.: La diffrentiation automatique de fonctions reprsentes par
des programmes. Technical Report 1557, INRIA (1991)
Griewank, A., Corliss, G. (eds.): Automatic Differentiation of Algorithms: Theory Implementation and Applications. SIAM, Philadelphia (1991)
Speelpening, B.: Compiling fast partial derivatives of functions given by algorithms. Ph.D.
thesis, Department of Computer Science, University of Illinois, Urbana Champaign (1980)
Hammer, R., Hocks, M., Kulisch, U., Ratz, D.: C++ Toolbox for Verified Computing. Springer,
Berlin (1995)
Rall, L., Corliss, G.: Introduction to automatic differentiation. In: Bertz, M., Bischof, C.,
Corliss, G., Griewank, A. (eds.) Computational Differentiation Techniques, Applications, and
Tools. SIAM, Philadelphia (1996)
Nocedal, J., Wright, S.: Numerical Optimization. Springer, New York (1999)
Griewank, A., Walther, A.: Principles and Techniques of Algorithmic Differentiation, 2nd edn.
SIAM, Philadelphia (2008)
Chapter 7
139
140
f (x)
f (x)
7.2 Examples
Example 7.1 Equilibrium points of nonlinear differential equations
The chemical reactions taking place inside a constant-temperature continuous
stirred tank reactor (CSTR) can be described by a system of nonlinear ordinary
differential equations
x = f(x),
(7.1)
7.2 Examples
141
(7.2)
(7.3)
with i the ith Euler angle. Computing all the solutions of such a system of equations is difficult, especially if one is interested only in the real solutions. That is
why this problem has become a benchmark in computer algebra [3], which can also
be solved numerically in an approximate but guaranteed way by interval analysis
[4]. The methods described in this chapter try, more modestly, to find some of the
solutions.
142
We want to find a value (or values) of the scalar variable x such that
f (x) = 0.
(7.4)
ak + bk
.
2
(7.5)
(7.6)
(7.7)
(7.8)
The resulting interval [ak+1 , bk+1 ] is also guaranteed to contain at least one solution
of (7.4). Unless an exact solution has been found at the middle of the last interval
considered, the width of the interval in which at least one solution x is trapped is
divided by two at each iteration (Fig. 7.3).
The method does not provide a point estimate xk of x , but with a slight modification of the definition in Sect. 2.5.3, it can be said to converge linearly, with a rate
equal to 0.5, as
max
x[ak+1 ,bk+1 ]
x x = 0.5 max x x .
x[ak ,bk ]
(7.9)
As long as the effect of rounding can be neglected, each iteration thus increases
the number of correct bits in the mantissa by one. When computing with double
143
f (x)
This interval
is eliminated
0
ak
ck
bk
floats, there is therefore no point in carrying out more than 52 iterations, and specific
precautions must be taken for the results still to be guaranteed, see Sect. 14.5.2.3.
Remark 7.2 When there are several solutions of (7.4) in [ak , bk ], dichotomy will
converge to one of them.
(7.10)
(x) = x + f (x),
(7.11)
with = 0 a parameter to be chosen by the user. If it exists, the limit of the fixed-point
iteration
xk+1 = (xk ), k = 0, 1, . . .
(7.12)
is a solution of (7.4).
Figure 7.4 illustrates a situation where fixed-point iteration converges to the solution of the problem. An analysis of the conditions and speed of convergence of this
method can be found in Sect. 7.4.1.
144
(x)
Graph of (x)
x1
x3
x2 = (x1)
f k f k1
(x xk ),
xk xk1
(7.13)
where f k stands for f (xk ). The next evaluation point xk+1 is chosen so as to ensure
that P1 (xk+1 ) = 0. One iteration thus computes
xk+1 = xk
(xk xk1 )
fk .
f k f k1
(7.14)
As Fig. 7.5 shows, there is no guaranty that this procedure will converge to a solution,
and the choice of the two initial evaluation points x0 and x1 is critical.
145
f (x)
fk 1
fk
root
0
interpolating
polynomial
xk1
xk
xk+1
around xk
f (x) P1 (x) = f (xk ) + f(xk )(x xk ).
(7.15)
The next evaluation point xk+1 is again chosen so as to ensure that P1 (xk+1 ) = 0.
One iteration thus computes
xk+1 = xk
f (xk )
.
f(xk )
(7.16)
(7.17)
2 f(xk )
f (xk )
(7.18)
f(ck )
(xk x )2 .
2 f(xk )
(7.19)
146
f (x)
3
1
x1
x0
x2
xk+1 x f (x ) xk x 2 ,
2 f(x )
(7.20)
provided that f () has continuous, bounded first and second derivatives in the neighborhood of x with f(x ) = 0. Convergence of xk toward x is then quadratic. The
number of correct digits in the solution should approximately double at each iteration until rounding error becomes predominant. This is much better than the linear
convergence of the bisection method, but there are drawbacks:
there is no guarantee that Newtons method will converge to a solution (see
Fig. 7.6),
f(xk ) must be evaluated,
the choice of the initial evaluation point x0 is critical.
Rewrite (7.20) as
xk+1 x xk x 2 ,
(7.21)
f(x )
.
=
2 f(x )
(7.22)
(xk+1 x ) [ xk x ]2 .
(7.23)
with
147
(7.24)
although the method may still work when this condition is not satisfied.
Remark 7.3 Newtons method runs into trouble when f(x ) = 0, which happens
when the root x is multiple, i.e., when
f (x) = (x x )m g(x),
(7.25)
with g(x ) = 0 and m > 1. Its (asymptotic) convergence speed is then only linear.
When the degree of multiplicity m is known, quadratic convergence speed can be
restored by replacing (7.16) by
xk+1 = xk m
f (xk )
.
f(xk )
(7.26)
When m is not known, or when f () has several multiple roots, one may instead
replace f () in (7.16) by h(), with
h(x) =
f (x)
,
f(x)
(7.27)
1+ 5
xk+1 x 51
xk x 2 .
2
(7.29)
148
not quadratic, but still superlinear, as the golden number (1 + 5)/2 is such that
1+ 5
1.618 < 2.
1<
2
(7.30)
Just as with Newtons method, the asymptotic convergence speed becomes linear if
the root x is multiple [7].
Recall that the secant method does not requires the evaluation of f(xk ), so each
iteration is less expensive than with Newtons method.
(7.31)
(7.32)
(x) = x + f(x),
(7.33)
with = 0 some scalar parameter to be chosen by the user. If it exists, the limit of
the fixed-point iteration
xk+1 = (xk ), k = 0, 1, . . .
(7.34)
is a solution of (7.31).
This method will converge to the solution x if () is contracting, i.e., such that
(x1 ) (x2 )|| < ||x1 x2 ||,
< 1 : (x1 , x2 ) , ||
and the smaller is, the better.
(7.35)
149
(7.36)
(7.37)
f k
(x ),
xT
(7.38)
with entries
ji,l =
fi k
(x ).
xl
(7.39)
The next evaluation point xk+1 is chosen so as to make the right-hand side of (7.37)
equal to zero. One iteration thus computes
xk+1 = xk J1 (xk )f(xk ).
(7.40)
Of course, the Jacobian matrix is not inverted. Instead, the corrective term
xk = xk+1 xk
(7.41)
(7.42)
(7.43)
150
Remark 7.6 The condition number of J(xk ) is indicative of the local difficulty of
the problem, which depends on the value of xk . Even if the condition number of the
Jacobian matrix at an actual solution vector is not too large, it may take very large
values for some values of xk along the trajectory of the algorithm.
The properties of Newtons method in the multivariate case are similar to those
of the univariate case. Under the following hypotheses
f() is continuously differentiable in an open convex domain D (H1),
there exists x in D such that f(x ) = 0 and J(x ) is invertible (H2),
J() satisfies a Lipschitz condition at x , i.e., there exists a constant such that
J(x) J(x ) x x (H3),
asymptotic convergence speed is quadratic provided that x0 is close enough to x .
In practice, the method may fail to converge to a solution and initialization remains
critical. Again, some divergence problems can be avoided by using a damped Newton
method,
xk+1 = xk + k xk ,
(7.44)
k+1
(7.45)
Remark 7.8 Newtons method also plays a key role in optimization, see
Sect. 9.3.4.2.
151
where f(xk ) was approximated by a finite difference (see Remark 7.4). The approximation
f k f k1
,
(7.46)
f(xk )
xk xk1
becomes
J(xk+1 )x f,
(7.47)
x = xk+1 xk ,
f = f(xk+1 ) f(xk ).
(7.48)
(7.49)
where
(7.50)
where C(x, f) is a rank-one correction matrix (i.e., the product of a column vector
by a row vector on its right). For
C(x, f) =
(f J k x) T
x ,
xT x
(7.51)
(7.52)
(7.53)
J k1 uvT J k1
.
1 + vT J 1 u
k
(7.54)
152
(f J k x)
x2
(7.56)
x
x2
(7.57)
and
v=
in (7.51). Since
(7.55)
Mk f x
,
J k1 u =
x2
(7.58)
J k1 uvT J k1
,
1 + vT J 1 u
k
(Mk f x)xT Mk
xT x
,
xT (Mk f x)
1+
T
x x
(Mk f x)xT Mk
.
xT Mk f
(7.59)
The correction term C (x, f) is thus also a rank-one matrix. As with Newtons
method, a damping procedure is usually employed, such that
x = d,
(7.60)
where the search direction d is taken as in Newtons method, with J1 (xk ) replaced
by Mk , so
(7.61)
d = Mk f(xk ).
The correction term then becomes
C (x, f) =
(Mk f d)dT Mk
.
dT Mk f
(7.62)
In summary, starting from k = 0 and the pair (x0 , M0 ), (M0 might be taken as
J1 (x0 ), or more simply as the identity matrix), the method proceeds as follows:
1. Compute f k = f xk .
2. Compute d = Mk f k .
153
3. Find
such that
d) < f k
f(xk +
(7.63)
d,
xk+1 = xk +
k+1
k+1
f
= f(x ).
(7.64)
and take
4. Compute f = f k+1 f k .
5. Compute
Mk+1 = Mk
d)dT Mk
(Mk f
.
dT Mk f
(7.65)
(7.66)
(7.67)
where
154
a positive threshold to be chosen by the user, or when f(xk ) f(xk1 ) < . The
first of these stopping criteria may never be met if is too small or if x0 was badly
chosen, which provides a rationale for using the second one.
With either of these strategies, the number of iterations will change drastically for
a given threshold if the equations are arbitrarily multiplied by a very large or very
small real number.
One may prefer a stopping criterion that does not present this property, such as
stopping when
k
f(x ) < f(x0 )
(7.68)
(which may never happen) or when
(7.69)
or when
x xk1
eps,
xk + realmin
(7.70)
f x f xk1
eps,
f xk + realmin
(7.71)
(7.72)
and
155
(7.73)
A last interesting idea is to stop when there is no longer any significant digit in the
evaluation of f(xk ), i.e., when one is no longer sure that a solution has not been
reached. This requires methods for assessing the precision of numerical results, such
as described in Chap. 14.
Several stopping criteria may be combined, and one should also specify a maximum number of iterations, if only as a safety measure against badly designed other
tests.
x = 3 1.732050807568877.
Let us solve it with the four methods presented in Sect. 7.3.
(7.74)
156
1.732050810014728e+00
1.732050807568877e+00
1.732050807568877e+00
1.732050807568877e+00
1.732050807568877e+00
1.732050807568877e+00
1.732050807568877e+00
Although an accurate solution is obtained very quickly, this script can be improved
in a number of ways.
First, there is no point in iterating when the solution has been reached (at least up to
the precision of the floating-point representation employed). A more sophisticated
stopping rule than just a maximum number of iterations must thus be specified. One
may, for instance, use (7.70) and replace the loop in the previous script by
for k=1:Kmax,
x(k+1) = x(k)-f(x(k))/fdot(x(k));
if ((abs(x(k+1)-x(k)))/(abs(x(k+1)+realmin))<=eps)
break
end
end
157
158
The inner loop typically breaks after 12 iterations, which confirms that the secant
method is slower than Newtons, and a typical run yields
Solutions =
1.732050807568877e+00
1.732050807568877e+00
-1.732050807568877e+00
-1.732050807568877e+00
-1.732050807568877e+00
1.732050807568877e+00
-1.732050807568877e+00
1.732050807568877e+00
-1.732050807568877e+00
1.732050807568877e+00
so the secant method with multstart is able to find both solutions with the same
accuracy as Newtons.
(7.75)
159
160
and
upper =
2.000000000000000e+00
2.000000000000000e+00
2.000000000000000e+00
1.750000000000000e+00
1.750000000000000e+00
1.750000000000000e+00
1.750000000000000e+00
1.734375000000000e+00
1.734375000000000e+00
1.734375000000000e+00
The last interval computed is
[a, b] = [1.732050807568157, 1.732050807569067].
Its width is indeed less than 1012 , and it does contain
(7.76)
3.
(7.77)
(7.78)
(7.79)
(7.80)
(7.81)
f1
f
J(x) = T (x) = x1
f2
x
x1
f1
x2
f2
x2
2x1 x22
2x12 x2
2x1 x2
x12
(7.82)
161
The function f and its Jacobian matrix J are evaluated by the following function
function[F,J] = SysNonLin(x)
% function
F = zeros(2,1);
J = zeros(2,2);
F(1) = x(1)2*x(2)2-9;
F(2) = x(1)2*x(2)-3*x(2);
% Jacobian Matrix
J(1,1) = 2*x(1)*x(2)2;
J(1,2) = 2*x(1)2*x(2);
J(2,1) = 2*x(1)*x(2);
J(2,2) = x(1)2-3;
end
The (undamped) Newton method with multistart is implemented by the script
clear all
Smax = 10; % number of starts
Kmax = 20; % max number of iterations per start
Init = 2*rand(2,Smax)-1; % entries between -1 and 1
Solutions = zeros(Smax,2);
X = zeros(2,1);
Xplus = zeros(2,1);
for i=1:Smax
X = Init(:,i);
for k=1:Kmax
[F,J] = SysNonLin(X);
DeltaX = -J\F;
Xplus = X + DeltaX;
[Fplus] = SysNonLin(Xplus);
if (norm(Fplus-F)/(norm(F)+realmin)<=eps)
break
end
X = Xplus
end
Solutions(i,:) = Xplus;
end
Solutions
A typical run of this script yields
Solutions =
1.732050807568877e+00
-1.732050807568877e+00
1.732050807568877e+00
1.732050807568877e+00
1.732050807568877e+00
1.732050807568877e+00
-1.732050807568877e+00
-1.732050807568877e+00
162
1.732050807568877e+00
-1.732050807568877e+00
-1.732050807568877e+00
-1.732050807568877e+00
-1.732050807568877e+00
-1.732050807568877e+00
-1.732050807568877e+00
-1.732050807568877e+00
1.732050807568877e+00
1.732050807568877e+00
-1.732050807568877e+00
1.732050807568877e+00
where each row corresponds to the solution as evaluated for one given initial value
of x. All four solutions have thus been evaluated accurately, and damping was again
not needed on this simple problem.
Remark 7.10 Computer algebra may be used to generate the formal expression of the
Jacobian matrix. The following script uses the Symbolic Math Toolbox for doing so.
syms x y
X = [x;y]
F = [x2*y2-9;x2*y-3*y]
J = jacobian(F,X)
It yields
X =
x
y
F =
x2*y2 - 9
y*x2 - 3*y
J =
[ 2*x*y2, 2*x2*y]
[
2*x*y, x2 - 3]
f i2 (x)
(7.83)
i=1
163
clear all
Smax = 10; % number of starts
Init = 2*rand(Smax,2)-1; % between -1 and 1
Solutions = zeros(Smax,2);
options = optimset(Jacobian,on);
for i=1:Smax
x0 = Init(i,:);
Solutions(i,:) = fsolve(@SysNonLin,x0,options);
end
Solutions
A typical result is
Solutions =
-1.732050808042171e+00
1.732050807568913e+00
-1.732050807570181e+00
1.732050807120480e+00
-1.732050807568903e+00
1.732050807569296e+00
1.732050807630857e+00
1.732050807796109e+00
-1.732050807966248e+00
-1.732050807568886e+00
-1.732050808135796e+00
1.732050807568798e+00
-1.732050807569244e+00
1.732050808372865e+00
1.732050807568869e+00
1.732050807569322e+00
-1.732050807642701e+00
-1.732050808527067e+00
-1.732050807938446e+00
1.732050807568879e+00
where each row again corresponds to the solution as evaluated for one given initial
value of x. All four solutions have thus been found, although less accurately than
with Newtons method.
164
end
Solutions
NumberOfIterations
A typical run of this script yields
Solutions =
-1.732050807568899e+00
-1.732050807568901e+00
1.732050807568442e+00
-1.732050807568877e+00
1.732050807568591e+00
1.732050807569304e+00
1.732050807568429e+00
1.732050807568774e+00
1.732050807568853e+00
-1.732050807568868e+00
-1.732050807568949e+00
1.732050807564629e+00
-1.732050807570081e+00
1.732050807568877e+00
1.732050807567701e+00
1.732050807576298e+00
-1.732050807569200e+00
1.732050807564450e+00
-1.732050807568735e+00
1.732050807568897e+00
The number of iterations for getting each one of these ten pairs of results ranges
between 18 and 134 (although one of the pairs of results of another run was obtained
after 291,503 iterations). Recall that Broydens method does not use the Jacobian
matrix of f, contrary to the other two methods presented.
If, pressing our luck, we attempt to get more accurate results by setting tol =
1.e-15; then a typical run yields
Solutions =
NaN
NaN
NaN
NaN
1.732050807568877e+00
1.732050807568877e+00
NaN
NaN
1.732050807568877e+00
1.732050807568877e+00
NaN
NaN
NaN
NaN
1.732050807568877e+00
-1.732050807568877e+00
NaN
NaN
1.732050807568877e+00
-1.732050807568877e+00
While some results do get more accurate, the method thus fails in a significant number
of cases, as indicated by NaN, which stands for Not a Number.
7.8 In Summary
Solving sets of nonlinear equations is much more complex than with linear equations. One may not know the number of solutions in advance, or even if a solution
exists at all.
7.8 In Summary
165
The techniques presented in this chapter are iterative, and mostly aim at finding
one of these solutions.
The quality of a candidate solution xk can be assessed by computing f(xk ).
If the method fails, this does not prove that there is no solution.
Asymptotic convergence speed for isolated roots is typically linear for fixed-point
iteration, superlinear for the secant and Broydens methods and quadratic for Newtons method.
Initialization plays a crucial role, and multistart is the simplest strategy available
to explore the domain of interest in the search for all the solutions that it contains.
There is no guarantee that this strategy will succeed, however.
For a given computational budget, stopping iteration as soon as possible makes it
possible to try other starting points.
References
1. Neumaier, A.: Interval Methods for Systems of Equations. Cambridge University Press, Cambridge (1990)
2. Jaulin, L., Kieffer, M., Didrit, O., Walter, E.: Applied Interval Analysis. Springer, London
(2001)
3. Grabmeier, J., Kaltofen, E., Weispfenning, V. (eds.): Computer Algebra Handbook: Foundations, Applications, Systems. Springer, Berlin (2003)
4. Didrit, O., Petitot, M., Walter, E.: Guaranteed solution of direct kinematic problems for general
configurations of parallel manipulators. IEEE Trans. Robot. Autom. 14(2), 259266 (1998)
5. Ypma, T.: Historical development of the Newton-Raphson method. SIAM Rev. 37(4), 531551
(1995)
6. Stewart, G.: Afternotes on Numerical Analysis. SIAM, Philadelphia (1996)
7. Diez, P.: A note on the convergence of the secant method for simple and multiple roots. Appl.
Math. Lett. 16, 12111215 (2003)
8. Watson L, Bartholomew-Biggs M, Ford, J. (eds.): Optimization and nonlinear equations. J.
Comput. Appl. Math. 124(12):1373 (2000)
9. Kelley, C.: Solving Nonlinear Equations with Newtons Method. SIAM, Philadelphia (2003)
10. Dennis Jr, J.E., Mor, J.J.: Quasi-Newton methods, motivations and theory. SIAM Rev. 19(1),
4689 (1977)
11. Broyden, C.: A class of methods for solving nonlinear simultaneous equations. Math. Comput.
19(92), 577593 (1965)
12. Hager, W.: Updating the inverse of a matrix. SIAM Rev. 31(2), 221239 (1989)
13. Nocedal, J., Wright, S.: Numerical Optimization. Springer, New York (1999)
14. Linfield, G., Penny, J.: Numerical Methods Using MATLAB, 3rd edn. Academic Press, Elsevier,
Amsterdam (2012)
Chapter 8
Introduction to Optimization
8.2 Examples
Example 8.1 Parameter estimation
To estimate the parameters of a mathematical model from experimental data, a
classical approach is to look for the (hopefully unique) value of the parameter vector
x Rn that minimizes the quadratic cost function
J (x) = eT (x)e(x) =
N
ei2 (x),
(8.1)
i=1
167
168
8 Introduction to Optimization
where the error vector e(x) R N is the difference between a vector y of experimental
data and a vector ym (x) of corresponding model outputs
e(x) = y ym (x).
(8.2)
Most often, no constraint is enforced on x, which may take any value in Rn , so this
is unconstrained optimization, to be considered in Chap. 9.
Example 8.2 Management
A company may wish to maximize benefit under constraints on production, to
minimize the cost of a product under constraints on performance, or to minimize
time-to-market under constraints on cost. This is constrained optimization, to be
considered in Chap. 10.
Example 8.3 Logistics
Traveling salespersons may wish to visit given sets of cities while minimizing the
total distance they have to cover. The optimal solutions are then ordered lists of cities,
which are not necessarily coded numerically. This is combinatorial optimization, to
be considered in Chap. 11.
8.3 Taxonomy
A synonym of optimization is programming, coined by mathematicians working on
logistics during World War II, before the advent of the ubiquitous computer. In this
context, a program is an optimization problem.
The objective function (or performance index) J () is a scalar-valued function
of n scalar decision variables xi , i = 1, . . . , n. These variables are stacked in a
decision vector x, and the feasible set X is the set of all the values that x may take.
When the objective function must be minimized, it is a cost function. When it must
be maximized, it is a utility function. Transforming a utility function U () into a cost
function J () is trivial, for instance by taking
J (x) = U (x).
(8.3)
means that
x X,
J (
x) J (x).
(8.5)
Any
x that satisfies (8.5) is a global minimizer, and the corresponding cost J (
x) is
the global minimum. Note that the global minimum is unique if it exists, whereas
8.3 Taxonomy
169
J3
J1
x3
x1
x2
there may be several global minimizers. The next two examples illustrate situations
to be avoided, if possible.
Example 8.4 When J (x) = x and X is some open interval (a, b) R (i.e.,
the interval does not contain its endpoints a and b), there is no global minimizer
(or maximizer) and no global minimum (or maximum). The infimum is J (b), and
the supremum J (a).
Example 8.5 when J (x) = x and X = R, there is no global minimizer (or maximizer) and no global minimum (or maximum). The infimum is and the supremum +.
If (8.5) is only known to be valid in some neighborhood V(
x) of
x, i.e., if
x V(
x),
J (
x) J (x),
(8.6)
then
x is a local minimizer, and J (
x) a local minimum.
Remark 8.1 Although this is not always done in the literature, distinguishing minima
from minimizers (and maxima from maximizers) clarifies statements.
In Fig. 8.1, x1 and x2 are both global minimizers, associated with the unique global
minimum J1 , whereas x3 is only a local minimizer, as J3 is larger than J1 .
Ideally, one would like to find all the global minimizers and the corresponding
global minimum. In practice, however, proving that a given minimizer is global
is often impossible. Finding a local minimizer may already improve performance
drastically compared to the initial situation.
170
8 Introduction to Optimization
j = 1, . . . , n e ,
(8.7)
cij (x) 0,
j = 1, . . . , n i .
(8.8)
(8.9)
ci (x) 0,
(8.10)
and
8.3 Taxonomy
171
xi (1 xi )(2 xi )(3 xi ) = 0.
(8.11)
Remark 8.5 The number n = dim x of decision variables has a strong influence on
the complexity of the optimization problem and on the methods that can be used,
because of what is known as the curse of dimensionality. A method that would be
perfectly viable for n = 2 may fail hopelessly for n = 50, as illustrated by the next
example.
Example 8.6 Let X be an n-dimensional unit hypercube [0, 1] [0, 1]. Assume
that minimization is by random search, with xk (k = 1, . . . , N ) picked at random in
X according to a uniform distribution and the decision vector
xk achieving the lowest
cost so far taken as an estimate of a global minimizer. The width of a hypercube H
that has a probability p of being hit is = p 1/n , and this width increases very
quickly with n. For p = 103 , for instance, = 103 if n = 1, 0.5 if n = 10
and 0.87 if n = 50. When n increases, it thus soon becomes impossible to
explore any small region of decision space. To put it in another way, if 100 points
are deemed appropriate for sampling the interval [0, 1], then 100n samples must be
drawn in X to achieve a similar density. Fortunately, the regions of actual interest
in high-dimensional decision spaces often correspond to lower dimensional hyper
surfaces than may still be explored efficiently provided that more sophisticated search
methods are used.
The type of the cost function also has a strong influence on the type of method to
be employed.
When J (x) is linear in x, it can be written as
J (x) = cT x.
(8.12)
One must then introduce constraints to avoid x tending to infinity in the direction
c, which would in general be meaningless. If the contraints are linear (or affine)
in x, then the problem pertains to linear programming (see Sect. 10.6).
If J (x) is quadratic in x and can be written as
J (x) = [Ax b]T Q[Ax b],
(8.13)
172
8 Introduction to Optimization
N
[ei (x)]2 ,
(8.14)
i=1
with ei (x) differentiable, then one may employ Taylor expansions of the cost
function, which leads to the gradient and Newton methods and their variants
(see Sect. 9.3.4).
If J (x) is not differentiable, for instance when minimizing
|ei (x)|,
(8.15)
(8.16)
J (x) =
or
v
then specific methods are necessary (see Sects. 9.3.5, 9.4.1.2 and 9.4.2.1). Even
such an innocent-looking cost function as (8.15), which is differentiable almost
everywhere if the ei (x)s are differentiable, cannot be minimized by an iterative
optimization method based on a limited expansion of the cost, as this method
will usually hurl itself onto a point where the cost is not differentiable to stay
stuck there.
When J (x) is convex on X, the powerful methods of convex optimization can be
employed, provided that X is also convex. See Sect. 10.7.
Remark 8.6 The time needed for a single evaluation of J (x) also has consequences
on the types of methods that can be employed. When each evaluation takes a fraction
of a second, random search and evolutionary algorithms may be viable options. This
is no longer the case when each evaluation takes several hours, for instance because it
involves the simulation of a complex knowledge-based model, as the computational
budget is then severely restricted, see Sect. 9.4.3.
173
(8.17)
Note that the time needed by a given algorithm to visit N distinct points in X cannot
be taken into account in the performance measure.
We only consider the first of the NFL theorems in [13], which can be summarized
as follows: for any pair of algorithms (A1 , A2 ), the mean performance over all
minimization problems is the same, i.e.,
M
M
1
1
P N (A1 , M j ) =
P N (A2 , M j ).
M
M
j=1
j=1
(8.18)
174
8 Introduction to Optimization
8.5 In Summary
Before attempting optimization, check that this does make sense for the actual
problem of interest.
It is always possible to transform a maximization problem into a minimization
problem, so considering only minimization is not restrictive.
The distinction between minima and minimizers is useful to keep in mind.
Optimization problems can be classified according to the type of the feasible
domain X for their decision variables.
The type of the cost function has a strong influence on the classes of methods that
can be used. Non-differentiable cost functions cannot be minimized using methods
based on a Taylor expansion of the cost.
8.5 In Summary
175
The dimension of the decision vector is a key factor to be taken into account in
the choice of an algorithm, because of the curse of dimensionality.
The time required to carry out a single evaluation of the cost function should also
be taken into consideration.
There is no free lunch.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Chapter 9
J
(
x)xi + o(||x||),
xi
(9.1)
i=1
(9.2)
J
x1
J
x
(x) = . 2
g(x) =
.
x
.
J
xn
(x).
(9.3)
177
178
J(x)
(9.4)
(9.5)
Because there is no constraint on x, this is possible only if the gradient of the cost
at
x is zero. A necessary first-order optimality condition is thus
g(
x) = 0.
(9.6)
179
Consider now the second-order Taylor expansion of the cost function around
x
1 2 J
(
x)xi x j + o(||x||2 ), (9.7)
2
xi x j
n
x)x +
J (
x + x) = J (
x) + gT (
i=1 j=1
(9.8)
2 J
(x).
xxT
(9.9)
2 J
(x).
xi x j
(9.10)
(9.11)
and the second-order term in x should never contribute to decreasing the cost. A
necessary second-order optimality condition is therefore
x)x 0 x,
xT H(
(9.12)
(9.13)
Together, (9.6) and (9.13) do not make a sufficient condition for optimality, even
locally, as zero eigenvalues of H(
x) are associated with eigenvectors along which it
is possible to move away from
x without increasing the contribution of the secondorder term to the cost. It would then be necessary to consider higher order terms to
reach a conclusion. To prove, for instance, that J (x) = x 1000 has a local minimizer
at
x = 0 via a Taylor-series expansion, one would have to compute all the derivatives
of this cost function up to order 1000, as all lower order derivatives take the value
zero at
x.
180
(9.14)
(9.15)
(9.16)
(9.17)
Remark 9.1 There is, in general, no necessary and sufficient local optimality condition.
Remark 9.2 When nothing else is known about the cost function, satisfaction of
x is a global minimizer.
(9.17) does not guarantee that
Remark 9.3 The conditions on the Hessian are valid only for a minimization. For a
maximization, should be replaced by , and by .
Remark 9.4 As (9.6) suggests, methods for solving systems of equations seen in
Chaps. 3 (for linear systems) and 7 (for nonlinear systems) can also be used to look
for minimizers. Advantage can then be taken of the specific properties of the Jacobian
matrix of the gradient (i.e., the Hessian), which (9.13) tells us should be symmetric
non-negative definite at any local or global minimizer.
Example 9.2 Kriging revisited
Equations (5.61) and (5.64) of the Kriging predictor can be derived via the theoretical optimality conditions (9.6) and (9.15). Assume, as in Sect. 5.4.3, that N
measurements have taken place, to get
yi = f (xi ), i = 1, . . . , N .
(9.18)
E{Y (x)} = 0
(9.19)
181
and
xi , x j ,
(9.20)
with r (, ) a correlation function, such that r (x, x) = 1, and with 2y the GP variance.
(x) be a linear combination of the Y (xi )s, i.e.,
Let Y
(x) = cT (x)Y,
Y
(9.21)
(9.22)
(9.23)
There is thus no systematic error for any vector of weights c(x). The best linear
unbiased predictor (or BLUP) of Y (x) sets c(x) so as to minimize the variance of
the prediction error at x. Now
(x) Y (x)]2 = cT (x)YYT c(x) + [Y (x)]2 2cT (x)YY (x).
[Y
(9.24)
(9.26)
(9.27)
Provided that R is invertible, as it should, (9.27) implies that the optimal weighting
vector is
(9.28)
c(x) = R1 r(x).
182
(9.29)
(9.30)
(9.31)
which is (5.64).
Condition (9.17) is satisfied, provided that
2 J
(
c) = 2R 0.
ccT
(9.32)
Remark 9.5 Example 9.2 neglects the fact that 2y is unknown and that the correlation
function r (xi , x j ) often involves a vector p of parameters to be estimated from the
data, so R and r(x) should actually be written R(p) and r(x, p). The most common
approach for estimating p and 2y is maximum likelihood. The probability density of
the data vector y is then maximized under the hypothesis that it was generated by
a model with parameters p and 2y . The maximum-likelihood estimates of p and 2y
are thus obtained by solving yet another optimization problem, as
T 1
y R (p)y
+ ln det R(p)
p = arg min N ln
p
N
and
2y =
p)y
yT R1 (
.
N
(9.33)
(9.34)
183
(9.35)
(9.36)
The interpolation of the data should then be replaced by their approximation. Define
the error as the vector of residuals
e(x) = y f(x).
(9.37)
The most commonly used strategy for estimating x from the data is to minimize a
cost function that is quadratic in e(x), such as
J (x) = eT (x)We(x),
(9.38)
where W 0 is some known weighting matrix, chosen by the user. The weighted
least squares estimate of x is then
x = arg min [y f(x)]T W[y f(x)].
xRn
(9.39)
One can always compute, for instance with the Cholesky method of Sect. 3.8.1, a
matrix M such that
W = MT M,
(9.40)
so
x = arg min [My Mf(x)]T [My Mf(x)].
xRn
(9.41)
184
Replacing My by y and Mf(x) by f (x), one can thus transform the initial problem
into one of unweighted least squares estimation:
x = arg min J (x),
(9.42)
(9.43)
xRn
where
with || ||22 the square of the l2 norm. It is assumed in what follows that this transformation has been carried out (unless W was already the (N N ) identity matrix),
but the prime signs are dropped to simplify notation.
(9.44)
(9.45)
is thus affine in x. This implies that the cost function (9.43) is quadratic in x
J (x) = ||y Fx||22 = (y Fx)T (y Fx).
(9.46)
The necessary first-order optimality condition (9.6) requests that the gradient of J ()
at
x be zero. Since (9.46) is quadratic in x, the gradient of the cost function is affine
in x, and given by
J
(x) = 2FT (y Fx) = 2FT y + 2FT Fx.
x
(9.47)
Assume, for time being, that FT F is invertible, which is true if and only if all the
columns of F are linearly independent and which implies that FT F 0. The necessary
first-order optimality condition
J
(
x) = 0
(9.48)
x
then translates into the celebrated least squares formula
x = (FT F)1 FT y,
(9.49)
185
which is a closed-form expression for the unique stationary point of the cost function.
Moreover, since FT F 0, the sufficient condition for local optimality (9.17) is
satisfied and (9.49) is a closed-form expression for the unique global minimizer of
the cost function. This is a considerable advantage over the general case where no such
closed-form solution exists. See Sect. 16.8 for a beautiful example of a systematic and
repetitive use of linear least squares in the context of building nonlinear black-box
models.
Example 9.3 Polynomial regression
Let yi be the value measured for some quantity of interest at the known instant
of time ti (i = 1, . . . , N ). Assume that these data are to be approximated with a kth
order polynomial in the power series form
Pk (t, x) =
pi t i ,
(9.50)
i=0
where
x = ( p0 p1 . . .
p k )T .
(9.51)
Assume also that there are more data than parameters (N > n = k + 1). To compute
the estimate
x of the parameter vector x, one may look for the value of x that minimizes
J (x) =
(9.52)
i=1
with
y = [y1 y2 . . .
and
1
1
..
F=
.
.
..
1
t1
t2
..
.
..
.
tN
t12
t22
..
.
..
.
t N2
y N ]T
(9.53)
t1k
t2k
..
.
.
..
.
t Nk
(9.54)
Remark 9.6 The key point in Example 9.3 is that the model output Pk (t, x) is linear
in x. Thus, for instance, the function
f (t, x) = x1 et + x2 t 2 +
could benefit from a similar treatment.
x3
t
(9.55)
186
Despite its elegant conciseness, (9.49) should seldom be used for computing
least squares estimates, for at least two reasons.
First, inverting FT F usually requires unnecessary computations and it is less work
to solve the system of linear equations
x = FT y,
FT F
(9.56)
which are called the normal equations. Since FT F is assumed, for the time being, to
be positive definite, one may use Cholesky factorization for this purpose. This is the
most economical approach, only applicable to well-conditioned problems.
Second, the condition number of FT F is almost always considerably worse than
that of F, as will be explained in Sect. 9.2.4. This suggests the use of methods such
as those presented in the next two sections, which avoid computing FT F.
Sometimes, however, FT F takes a particularly simple diagonal form. This may
be due to experiment design, as in Example 9.4, or to a proper choice of the model
representation, as in Example 9.5. Solving (9.56) then becomes trivial, and there is
no reason for avoiding it.
Example 9.4 Factorial experiment design for a quadratic model
Assume that some quantity of interest y(u) is modeled as
ym (u, x) = p0 + p1 u 1 + p2 u 2 + p3 u 1 u 2 ,
(9.57)
where u 1 and u 2 are input factors, the value of which can be chosen freely in the
normalized interval [1, 1] and where
x = ( p 0 , . . . , p 3 )T .
(9.58)
The parameters p1 and p2 respectively quantify the effects of u 1 and u 2 alone, while
p3 quantifies the effect of their interaction. Note that there is no term in u 21 or u 22 . The
parameter vector x is to be estimated from the experimental data y(ui ), i = 1, . . . , N ,
by minimizing
N
2
y(ui ) ym (ui , x) .
(9.59)
J (x) =
i=1
A two-level full factorial design consists of collecting data at all possible combinations of the two extreme possible values {1, 1} of the factors, as in Table 9.1,
and this pattern may be repeated to decrease the influence of measurement noise.
Assume it is repeated once, so N = 8. The entries of the resulting (8 4) regression
matrix F are then those of Table 9.2, deprived of its first row and first column.
187
Value of u 1
Value of u 2
1
1
1
1
1
2
3
4
1
1
1
1
Constant
Value of u 1
Value of u 2
Value of u 1 u 2
1
2
3
4
5
6
7
8
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
(9.60)
1 T
F y.
8
(9.61)
This example generalizes to any number of input factors, provided that the quadratic
polynomial model contains no quadratic term in any of the input factors alone.
Otherwise, the column of F associated with any such term would consist of ones and
thus be identical to the column of F associated with the constant term. As a result,
FT F would no longer be invertible. Three-level factorial designs may be used in this
case.
Example 9.5 Least squares approximation of a function over [1, 1]
We look for the polynomial (9.50) that best approximates a function f () over the
normalized interval [1, 1] in the sense that
1
[ f ( ) Pk (, x)]2 d
J (x) =
(9.62)
188
M
x = v,
(9.63)
1
1
where m i, j = 1 i1 j1 d and vi = 1 i1 f ( )d , and cond M deteriorates drastically when the order k of the approximating polynomial increases. If the
polynomial is written instead as
Pk (t, x) =
pi i (t),
(9.64)
i=0
i1 ( ) j1 ( )d = i1 i, j ,
(9.65)
with
i1 =
2
.
2i 1
(9.66)
1
i ( ) f ( )d, i = 0, . . . , k.
(9.67)
The estimation of each of them thus boils down to the evaluation of a definite integral (see Chap. 6). If one wants to increase the degree of the approximating polynomial by one, it is only necessary to compute
pk+1 , as the other coefficients are left
unchanged.
In general, however, computing FT F should be avoided, and one should rather
use a factorization of F, as in the next two sections. A tutorial history of the least
squares method and its implementation via matrix factorizations is provided in [4],
where the useful concept of total least squares is also explained.
189
(9.68)
Since the (N n) last rows of R consist of zeros, one may as well write
R1
= Q1 R1 ,
F = Q1 Q2
O
(9.69)
(9.70)
(9.71)
(9.72)
(9.73)
(9.74)
190
(9.77)
(9.78)
(9.79)
(9.80)
R1
,
O
(9.81)
(9.82)
(9.83)
so
x is the solution of the linear system
U
x = v,
(9.84)
J (
x) = 2 .
(9.85)
J (
x) is thus trivial to obtain from the QR factorization, without having to solve
(9.84). This might be particularly interesting if one has to choose between several
competing model structures (for instance, polynomial models of increasing order)
and wants to compute
x only for the best of them. Note that the model structure
that leads to the smallest value of J (
x) is very often the most complex one, so some
penalty for model complexity is usually needed.
Remark 9.8 QR factorization also makes it possible to take data into account as soon
as they arrive, instead of waiting for all of them before starting to compute
x. This
is interesting, for instance, in the context of adaptive control or fault detection. See
Sect. 16.10.
191
(9.86)
where
U has the same dimensions as F, and is such that
U T U = In ,
(9.87)
(9.88)
FV = U,
(9.89)
F U = V.
(9.90)
Fvi = i ui ,
F T u i = i v i ,
(9.92)
In other words,
(9.91)
where vi is the ith column of V and ui the ith column of U. This is why vi and ui
are called right and left singular vectors, respectively.
Remark 9.9 While (9.88) implies that
V1 = VT ,
(9.87) gives no magic trick for inverting U, which is not square!
(9.93)
The computation of the SVD (9.86) is classically carried out in two steps [6],
[7]. During the first of them, orthonormal matrices P1 and Q1 are computed so as to
ensure that
(9.94)
B = P1T FQ1
192
is bidiagonal (i.e., it has nonzero entries only in its main descending diagonal and the
descending diagonal immediately above), and that its last (N n) rows consist of
zeros. Left- or right-multiplication of a matrix by an orthonormal matrix preserves
its singular values, so the singular values of B are the same as those of F. The computation of P1 and Q1 is achieved through two series of Householder transformations.
The dimensions of B are the same as those of F, but since the last (N n) rows of
B consist of zeros, the (N n) matrix P 1 with the first n columns of P1 is formed to
get a more economical representation
B = P 1T FQ1 ,
(9.95)
i ui viT ,
(9.98)
i=1
and
||F
Fk ||2 = k+1 .
(9.99)
193
Still assuming, for the time being, that FT F is invertible, replace F in (9.49) by
UVT to get
x = (VUT UVT )1 VUT y
T 1
= (V V )
2
T 1
= (V )
Since
VU y
(9.100)
(9.101)
VU y.
(9.102)
(VT )1 = V,
(9.103)
x = V 1 UT y.
(9.104)
(9.105)
(9.106)
194
the solution obtained via QR factorization may actually be slightly more accurate
than the one obtained via SVD. SVD may be preferred when the problem is extremely
ill-conditioned, for reasons detailed in the next two sections.
(9.107)
(9.108)
195
Remark 9.14 When some prior information is available on the possible values of x,
a Bayesian approach to regularization might be preferable [10]. If, for instance, the
prior distribution of x is assumed to be Gaussian, with known mean x0 and known
variance , then the maximum a posteriori estimate
xmap of x satisfies the linear
system
xmap = FT y + 1 x0 ,
(9.109)
(FT F + 1 )
and this system should be much better conditioned than the normal equations.
(9.110)
Provided that J (x) is bounded from below (as is the case if J (x) is a norm), this
ensures that the sequence {J (xk )}
k=0 converges. Unless the algorithm gets stuck at
x0 , performance as measured by the cost function will thus have improved.
This raises two important questions that we will leave aside until Sect. 9.3.4.8:
where to start from (how to choose x0 )?
when to stop?
Before quitting linear least squares completely, let us consider a case where they can
be used to decrease the dimension of search space.
(9.111)
and that the decision vector x can be split into p and , in such a way that
f(x) = F()p.
(9.112)
y F()p
(9.113)
196
is then affine in p. For any given value of , the corresponding optimal value
p()
of p can thus be computed by linear least squares, so as to confine nonlinear search
to space.
Example 9.6 Fitting data with a sum of exponentials
If the ith data point yi is modeled as
f i (p, ) =
p j e j ti ,
(9.114)
j=1
where the measurement time ti is known, then the residual yi f i (p, ) is affine in
p and nonlinear in . The dimension of search space can thus be halved by using
linear least squares to compute
p(), a considerable simplification.
197
P2 () =
( 2 )( 3 )
f (1 )
(1 2 )(1 3 )
( 1 )( 3 )
f (2 )
+
(2 1 )(2 3 )
( 1 )( 2 )
f (3 ).
+
(3 1 )(3 2 )
(9.116)
1 (2 1 )2 [ f (2 ) f (3 )] (2 3 )2 [ f (2 ) f (1 )]
, (9.117)
2 (2 1 )[ f (2 ) f (3 )] (2 3 )[ f (2 ) f (1 )]
(9.118)
Trouble arises when the points (i , f (i )) are collinear, as the denominator in (9.117)
is then equal to zero, or when P2 () turns out to be concave, as P2 () is then maximal
at
. This is why more sophisticated line searches are used in practice, such as Brents
method.
51
0.618.
(9.119)
=
2
198
Thus
k,1 = kmin + (1 )(kmax kmin ),
(9.120)
k,2 =
(9.121)
kmin
+ (kmax
kmin ).
If f (k,1 ) < f (k,2 ), then the subinterval (k,2 , kmax ] is eliminated, which leaves
k+1
k
[k+1
min , max ] = [min , k,2 ],
(9.122)
(9.123)
In both cases, one of the two evaluation points of iteration k remains in the updated
search interval, and turns out to be conveniently located within a fraction of one
of its extremities. Each iteration but the first thus requires only one additional evaluation of the cost function, because the other point is one of the two used during
the previous iteration. This method is called golden-section search, because of the
relation between and the golden number.
Even if golden-section search makes a thrifty use of cost evaluations, it is much
slower than parabolic interpolation on a good day, and Brents algorithm switches
back to (9.117) and (9.118) as soon as the conditions become favorable.
Remark 9.16 When the time needed for evaluating the gradient of the cost function is
about the same as for the cost function itself, one may use, instead of Brents method,
a safeguarded cubic interpolation where a third-degree polynomial is requested to
interpolate f () and to have the same slope at two trial points [14]. Golden section
search can then be replaced by bisection to search for
such that f(
) = 0 when
the results of cubic interpolation become unacceptable.
(9.124)
199
where
xk+1 () = xk + d
(9.125)
and the cost is considered as a function of . If this function is denoted by f (), with
f () = J (xk + d),
(9.126)
then
J
J (xk + d)
xk+1
( = 0) = T (xk )
= gT (xk )d.
f(0) =
(9.127)
So gT (xk )d in (9.124) is the initial slope of the cost function viewed as a function of .
The Armijo condition provides an upper bound on the desirable value of J (xk+1 ()),
which is affine in . Since d is a descent direction, gT (xk )d < 0 and > 0. Condition
(9.124) states that the larger is, the smaller the cost must become. The internal
parameter 1 should be such that 0 < 1 < 1, and is usually taken quite small (a
typical value is 1 = 104 ).
The Armijo condition is satisfied for any sufficiently small , so a bolder strategy must be induced. This is the role of the second inequality, known as curvature
condition, which requests that also satisfies
f() 2 f(0),
(9.128)
(9.129)
Since f(0) < 0, any such that f() > 0 will satisfy (9.128). To avoid this, strong
Wolfe conditions replace the curvature condition (9.129) by
|gT (xk + d)d| |2 gT (xk )d|,
(9.130)
while keeping the Armijo condition (9.124) unchanged. With (9.130), f() is still
allowed to become positive, but can no longer get too large.
Provided that the cost function J () is smooth and bounded below, the existence
of s satisfying the Wolfe and strong Wolfe conditions is guaranteed. The principles
of a line search guaranteed to find such a for strong Wolfe conditions are in [11].
Several good software implementations are in the public domain.
200
x2
Valley
5
6
3
4
1
2
x1
Fig. 9.2 Bad idea for combining line searches
(9.131)
to get xk+1 ;
3. replace the best of the di s in terms of cost reduction by d, increment k by one
and go to Step 1.
This procedure is shown in Fig. 9.3. While the elimination of the best performer
at Step 3 may hurt the readers sense of justice, it contributes to maintaining linear
independence among the search directions of Step 1, thereby allowing changes of
201
x2
Valley
xk +
xk
x1
Fig. 9.3 Powells algorithm for combining line searches
direction that may turn out to be needed after a long sequence of nearly collinear
displacements.
(9.132)
so the variation J of the cost resulting from the displacement x is such that
J = gT (xk )x + o(x).
(9.133)
202
When x is small enough for higher order terms to be negligible, (9.133) suggests
taking x collinear with the gradient at xk and in the opposite direction
x = k g(xk ), with k > 0.
(9.134)
(9.135)
If J (x) were an altitude, then the gradient would point in the direction of steepest
ascent. This explains why the gradient method is sometimes called the steepest
descent method.
Three strategies are available for the choice of k :
1. keep k to a constant value ; this is usually a bad idea, as suitable values may
vary by several orders of magnitude along the path followed by the algorithm;
when is too small, the algorithm is uselessly slow, whereas when is too large,
it may become unstable because of the contribution of higher order terms;
2. adapt k based on the past behavior of the algorithm; if J (xk+1 ) J (xk ) then
make k+1 larger than k , in an attempt to accelerate convergence, else restart
from xk with a smaller k ;
3. choose k by line search to minimize J (xk k g(xk )).
When k is optimal, successive search directions of the gradient algorithm should
be orthogonal
(9.136)
g(xk+1 ) g(xk ),
and this is easy to check.
Remark 9.18 More generally, for any iterative optimization algorithm based on a
succession of line searches, it is informative to plot the (unoriented) angle (k)
between successive search directions dk and dk+1 ,
(dk+1 )T dk
,
(k) = arccos
dk+1 2 dk 2
(9.137)
as a function of the value of the iteration counter k, which is simple enough for
any dimension of x. If (k) is repeatedly obtuse, then the algorithm may oscillate
painfully in a crablike displacement along some mean direction that may be worth
exploring, in an idea similar to that of Powells algorithm. A repeatedly acute angle,
on the other hand, suggests coherence in the directions of the displacements.
The gradient method has a number of advantages:
it is very simple to implement (provided one knows how to compute gradients, see
Sect. 6.6),
203
it is robust to errors in the evaluation of g(xk ) (with an efficient line search, convergence to a local minimizer is guaranteed provided that the absolute error in the
direction of the gradient is less than /2),
its domain of convergence to a given minimizer is as large as it can be for such a
local method.
Unless the cost function has some special properties such as convexity (see Sect. 10.7),
convergence to a global minimizer is not guaranteed, but this limitation is shared by
all local iterative methods. A more specific disadvantage is that a very large number
of iterations may be needed to get a good approximation of a local minimizer. After
a quick start, the gradient method usually gets slower and slower, which makes it
appropriate only for the initial part of search.
(9.138)
The variation J of the cost resulting from the displacement x is such that
1
J = gT (xk )x + xT H(xk )x + o(x2 ).
2
(9.139)
(9.141)
(9.143)
204
1
2
x
Fig. 9.4 The domain of convergence of Newtons method to a minimizer (1) is smaller than that
of the gradient method (2)
Remark 9.19 Newtons method for optimization is the same as Newtons method
for solving g(x) = 0, as H(x) is the Jacobian matrix of g(x).
When it converges to a (local) minimizer, Newtons method is incredibly quicker
than the gradient method (typically, less than ten iterations are needed, instead of
thousands). Even if each iteration requires more computation, this is a definite advantage. Convergence is not guaranteed, however, for at least two reasons.
First, depending on the choice of the initial vector x0 , Newtons method may
converge toward a local maximizer or a saddle point instead of a local minimizer, as
it only attempts to find x that satisfies the stationarity condition g(x) = 0. Its domain
of convergence to a (local) minimizer may thus be significantly smaller than that of
the gradient method, as shown by Fig. 9.4.
Second, the size of the Newton step
x may turn out to be too large for the higher
order terms to be negligible, even if the direction was appropriate. This is easily
avoided by introducing a positive damping factor k to get the damped Newton
method
x,
(9.144)
xk+1 = xk + k
where
x is still computed by solving (9.142). The resulting algorithm can be summarized as
(9.145)
xk+1 = xk k H1 (xk )g(xk ).
The damping factor k can be adapted or optimized by line search, just as for the
gradient method. An important difference is that the nominal value for k is known
205
here to be one, whereas there is no such nominal value in the case of the gradient
method.
Newtons method is particularly well suited to the final part of local search, when
the gradient method has become too slow to be useful. Combining an initial behavior
similar to that of the gradient method and a final behavior similar to that of Newtons
method thus makes sense. Before describing attempts at doing so, we consider an
important special case where Newtons method can be usefully simplified.
wl el2 (x),
(9.146)
l=1
where the wl s are known positive weights. The error el (also called residual) may,
for instance, be the difference between some measurement yl and the corresponding
model output ym (l, x). The gradient of the cost function is then
J
el
(x) = 2
(x),
wl el (x)
x
x
N
g(x) =
(9.147)
l=1
l
where e
x (x) is the first-order sensitivity of the error with respect to x. The Hessian
of the cost can then be computed as
T
N
N
el
el
2 el
g
(x)
(x) + 2
wl
wl el (x)
(x),
H(x) = T (x) = 2
x
x
x
xxT
l=1
l=1
(9.148)
2 el
where xxT (x) is the second-order sensitivity of the error with respect to x. The
damped Gauss-Newton method is obtained by replacing H(x) in the damped Newton
method by the approximation
Ha (x) = 2
l=1
wl
el
(x)
x
el
(x)
x
T
.
(9.149)
206
(9.150)
(9.151)
Replacing H(xk ) by Ha (xk ) has two advantages. The first one, obvious, is that (at
least when dim x is small) the computation of the approximate Hessian Ha (x) requires
barely more computation than that of the gradient g(x), as the difficult evaluation of
second-order sensitivities is avoided. The second one, more unexpected, is that the
damped Gauss-Newton method has the same domain of convergence to a given local
minimizer as the gradient method, contrary to Newtons method. This is due to the
fact that Ha (x) 0 (except in pathological cases), so Ha1 (x) 0. As a result, the
angle between the search direction g(xk ) of the gradient method and the search
direction Ha1 (xk )g(xk ) of the Gauss-Newton method is less than 2 in absolute
value.
When the magnitude of the residuals el (x) is small, the Gauss-Newton method
is much more efficient than the gradient method, at a limited additional computing
cost per iteration. Performance tends to deteriorate, however, when this magnitude
increases, because the neglected part of the Hessian gets too significant to be ignored
[11]. This is especially true if el (x) is highly nonlinear in x, as the second-order
sensitivity of the error is then large. In such a situation, one may prefer a quasiNewton method, see Sect. 9.3.4.5.
Remark 9.20 Sensitivity functions may be evaluated via forward automatic differentiation, see Sect. 6.6.4.
Remark 9.21 When el = yl ym (l, x), the first-order sensitivity of the error satisfies
(9.152)
207
(9.153)
(9.154)
(9.155)
i=1
where the numerical values of ti and y(ti ), (i = 1, . . . , N ) are known as the result
of experimentation on the system being modeled. The gradient and approximate
Hessian of the cost function (9.155) can be computed from the first-order sensitivity
of ym with respect to the parameters. If s j,k is the first-order sensitivity of q j with
respect to xk ,
q j
(ti , x),
(9.156)
s j,k (ti , x) =
xk
then the gradient of the cost function is given by
N
2 i=1
[y(ti ) q2 (ti , x)]s2,1 (ti , x)
N
g(x) = 2 i=1
[y(ti ) q2 (ti , x)]s2,2 (ti , x) ,
N
2 i=1 [y(ti ) q2 (ti , x)]s2,3 (ti , x)
Ha (x) = 2
2 (t , x)
s2,1
i
2 (t , x)
s2,2
s2,2 (ti , x)s2,3 (ti , x) .
s2,2 (ti , x)s2,1 (ti , x)
i
i=1
2
s2,3 (ti , x)s2,1 (ti , x) s2,3 (ti , x)s2,2 (ti , x)
s2,3 (ti , x)
208
(9.157)
Since q(0) does not depend on x, the initial condition of each of the first-order
sensitivities is equal to zero
s1,1 (0) = s2,1 (0) = s1,2 (0) = s2,2 (0) = s1,3 (0) = s2,3 (0) = 0.
(9.158)
The numerical solution of the system of eight first-order ordinary differential equations (9.153, 9.157) for the initial conditions (9.154, 9.158) can be obtained by
methods described in Chap. 12. One may solve instead three systems of four firstorder ordinary differential equations, each of them computing x1 , x2 and the two
sensitivity functions for one of the parameters.
Remark 9.22 Define the error vector as
e(x) = [e1 (x), e2 (x), . . . , e N (x)]T ,
(9.159)
and assume that the wl s have been set to one by the method described in Sect. 9.2.1.
Equation (9.151) can then be rewritten as
JT (xk )J(xk )dk = JT (xk )e(xk ),
(9.160)
e
(x).
xT
(9.161)
Equation (9.160) is the normal equation for the linear least squares problem
dk = arg min J(xk )dk + e(xk )22 ,
d
(9.162)
and a better solution for dk may be obtained by using one of the methods recommended in Sect. 9.2, for instance via a QR factorization of J(xk ). An SVD of J(xk ) is
more complicated but makes it trivial to monitor the conditioning of the local problem to be solved. When the situation becomes desperate, it also allows regularization
to be carried out.
209
h i, j
h i,i
gi
, gis =
h j, j
h i,i
xi
and is = ,
h i,i
(9.164)
(9.165)
where h i, j is the entry of Ha (xk ) in position (i, j), gi is the ith entry of g(xk ) and
x. Since h i,i > 0, such a scaling is always possible. The ith
xi is the ith entry of
row of (9.164) can then be written as
n
h i,s j + k i, j sj = gis ,
(9.166)
j=1
h i, j + k i, j h i,i
x j = gi .
(9.167)
j=1
In other words,
Ha (xk ) + k diag Ha (xk )
x = g(xk ),
(9.168)
where diag Ha is a diagonal matrix with the same diagonal entries as Ha . This is the
Levenberg-Marquardt method, routinely used in software for nonlinear parameter
estimation.
One disadvantage of this method is that a new system of linear equations has to
be solved whenever the value of k is changed, which makes the optimization of k
210
significantly more costly than with usual line searches. This is why some adaptive
strategy for tuning k based on past behavior is usually employed. See [18] for more
details.
The Levenberg-Marquardt method is one of those implemented in lsqnonlin,
which is part of the MATLAB Optimization Toolbox.
Jq k
(x )
x
2 Jq
.
xxT
(9.169)
(9.170)
(9.171)
Since the approximation is quadratic, its Hessian Hq does not depend on x, which
allows Hq1 to be estimated from the behavior of the algorithm along a series of
iterations.
Remark 9.23 Of course, J (x) is not exactly quadratic in x (otherwise, using the linear least squares method of Sect. 9.2 would be a much better idea), but a quadratic
approximation usually becomes satisfactory when xk gets close enough to a minimizer.
The updating of the estimate of
x is directly inspired from the damped Newton
method (9.145), with H1 replaced by the estimate Mk of Hq1 at iteration k:
xk+1 = xk k Mk g(xk ),
(9.172)
(9.173)
Hq x = gq ,
(9.174)
so
211
where
gq = gq (xk+1 ) gq (xk )
(9.175)
x = xk+1 xk .
(9.176)
and
(9.177)
(9.178)
Mk+1 = Mk + Ck .
(9.179)
(9.180)
Since H1 is symmetric, its initial estimate M0 and the Ck s are taken symmetric.
This is an important difference with Broydens method of Sect. 7.4.3, as the Jacobian
matrix of a generic vector function is not symmetric.
Quasi-Newton methods differ by their expressions for Ck . The only possible
symmetric rank-one correction is that of [20]:
Ck =
(x Mk g)(x Mk g)T
,
(x Mk g)T g
(9.181)
(9.182)
212
where
gT Mk g xxT
C1 = 1 +
xT g
xT g
(9.183)
xgT Mk + Mk gxT
.
xT g
(9.184)
and
C2 =
It is easy to check that this update satisfies (9.180) and may also be written as
Mk+1
gxT
xxT
xgT
I
M
+
= I
.
k
xT g
xT g
xT g
(9.185)
(9.186)
(9.187)
for Mk+1 to be positive definite. This is the case when strong Wolf conditions are
enforced during the computation of k [22]. Other options include
freezing M whenever gT x 0 (by setting Mk+1 = Mk ),
periodic restart, which forces Mk to the identity matrix every dim x iterations.
(If the actual cost function were quadratic in x and computation were carried out
exactly, convergence would take place in at most dim x iterations.)
The initial value for the approximation of H1 is taken as
M0 = I,
(9.188)
213
(9.190)
(9.191)
(9.192)
(9.193)
Successive search directions of the optimally damped Newton method are thus conjugate with respect to the Hessian. Conjugate-gradient methods will aim at achieving
the same property with respect to an approximation Hq of this Hessian. As the search
directions under consideration are not gradients, talking of conjugate-gradient is
misleading, but imposed by tradition.
A famous member of the conjugate-gradient family is the Polack-Ribire method
[16, 25], which takes
214
(9.194)
(9.195)
If the cost function were actually given by (9.169), then this strategy would ensure
that dk+1 and dk are conjugate with respect to Hq , although Hq is neither known
nor estimated, a considerable advantage for large-scale problems. The method is
initialized by taking
(9.196)
d0 = g(x0 ),
so its starts like a gradient method. Just as with quasi-Newton methods, a periodic
restart strategy may be employed, with dk taken equal to g(xk ) every dim x iterations.
Satisfaction of strong Wolfe conditions during line search does not guarantee,
however that dk+1 as computed with the Polack-Ribire method is always a descent
condition [11]. To fix this, it suffices to replace kPR in (9.194) by
kPR+ = max{ PR , 0}.
(9.197)
=b A
T
b 2b x + x Ax,
T
(9.199)
(9.200)
(9.201)
The cost function (9.201) is exactly quadratic, so its Hessian does not depend on x,
and using the conjugate-gradient method entails no approximation. The gradient of
the cost function, needed by the method, is easy to compute as
215
(9.202)
A good approximation of the solution is often obtained with this approach in much
less than the dim x iterations theoretically needed.
max min
max + min
2
,
(9.204)
(9.205)
This is much better than a linear convergence speed. As long as the effect of rounding
can be neglected, the number of correct decimal digits in xk is approximately doubled
at each iteration.
The convergence speed of the Gauss-Newton or Levenberg-Marquardt method
lies somewhere between linear and quadratic, depending on the quality of the approximation of the Hessian, which itself depends on the magnitude of the residuals. When
this magnitude is small enough convergence is quadratic, but for large enough residuals it becomes linear.
Quasi-Newton methods have a superlinear convergence speed, so
216
xk+1
x
= 0.
k
x
x
k
lim sup
(9.206)
217
(9.208)
(9.209)
If J (b) J (tref ) J (s), then w is replaced by tref . If the reflection has been more
successful and J (tref ) < J (b), then the algorithm tries to go further in the same
direction. This is expansion (Fig. 9.6), where the trial point becomes
218
b
t ref
c
w
s
t exp
(9.210)
If the expansion is a success, i.e., if J (texp ) < J (tref ), then w is replaced by texp ,
else it is still replaced by tref .
Remark 9.27 Some the vertices kept from one iteration to the next must be renamed.
For instance, after a successful expansion, the trial point texp becomes the best vertex b.
When reflection is more of a failure, i.e., when J (tref ) > J (s), two types of
contractions are considered (Fig. 9.7). If J (tref ) < J (w), then a contraction on the
reflexion side (or outside contraction) is attempted, with the trial point
1
1
tout = c + (c w) = (c + tref ),
2
2
(9.211)
whereas if J (tref ) J (w) a contraction on the worst side (or inside contraction) is
attempted, with the trial point
219
b
t out
c
t in
s
w
Fig. 9.7 Contractions (potential new simplices are in grey)
1
1
tin = c (c w) = (c + w).
2
2
(9.212)
Let
t be the best out of tref and tin (or tref and tout ). If J (
t) < J (w), then the worst
vertex w is replaced by
t.
Else, a shrinkage is performed (Fig. 9.8), during which each other vertex is moved
in the direction of the best vertex by halving its distance to b, before starting a new
iteration of the algorithm, by a reflection.
Iterations are stopped when the volume of the current simplex dwindles below
some threshold.
220
(9.213)
(9.214)
221
g(x) =
[E p {J (x, p)}].
x
(9.215)
Each iteration would thus require the evaluation of the gradient of a mathematical
expectation, which might be extremely costly as it might involve numerical evaluations of multidimensional integrals.
The stochastic gradient method, a particularly simple example of a stochastic
approximation technique, computes instead
xk+1 = xk k g (xk ),
(9.216)
[J (x, pk )],
x
(9.217)
with
g (x) =
where pk is picked at random according to (p) and k should satisfy the three
following conditions:
k > 0 (for the steps to be in the right direction),
k=0 k = (for all possible values of x to be reachable),
2
k
k=0 k < (for x to converge toward a constant vector when k tends to
infinity).
One may use, for instance
k =
0
,
k+1
with 0 > 0 to be chosen by the user. More sophisticated options are available; see,
e.g., [10]. The stochastic gradient method makes it possible to minimize a mathematical expectation without ever evaluating it or its gradient. As this is still a local
method, convergence to a global minimizer of E p {J (x, p)} is not guaranteed and
multistart remains advisable.
An interesting special case is when p can only take the values pi , i = 1, . . . , N ,
with N finite (but possibly very large), and each pi has the same probability 1/N .
Average-case optimization then boils down to computing
x = arg min J (x),
x
(9.218)
with
J (x) =
N
1
Ji (x),
N
(9.219)
i=1
where
Ji (x) = J (x, pi ).
(9.220)
222
Provided that each function Ji () is smooth and J () is strongly convex (as is often
the case in machine learning), the stochastic average gradient algorithm presented
in [35] can dramatically outperform a conventional stochastic gradient algorithm in
terms of convergence speed.
pP
(9.221)
This method leaves open the choice of the optimization routines to be employed
at Steps 2 and 3. Under reasonable technical conditions, it stops after a finite number
of iterations.
223
The next two sections briefly describe examples of the two strategies. In both cases,
search is assumed to take place in a possibly very large domain X taking the form of
an axis-aligned hyper-rectangle, or box. As no global optimizer is expected to belong
to the boundary of X, this is still unconstrained optimization.
Remark 9.28 When a vector x of model parameters must be estimated from experimental data by minimizing the l p -norm of an error vector ( p = 1, 2, ), appropriate
experimental conditions may eliminate all suboptimal local minimizers, thus allowing local methods to be used to get a global minimizer [39].
Choose x0 , set k = 0.
Pick a trial point xk+ = xk + k , with k random.
If J (xk+ ) < J (xk ) then xk+1 = xk+ , else xk+1 = xk .
Increment k by one and go to Step 2.
j 2
i ,
i = 1, . . . , dim x ,
(9.222)
and truncation is carried out to ensure that xk+ stays in X. The distributions differ
by the value given to j , j = 1, . . . , 5. One may take, for instance,
1
(9.223)
j1
/10,
j = 2, . . . , 5.
(9.224)
224
225
(9.225)
where () and
() are the probability density and cumulative distribution functions
of the zero-mean Gaussian variable with unit variance, and where
u=
sofar J
(x)
Jbest
,
(x)
(9.226)
sofar the lowest value of the cost over all the evaluations carried out so far.
with Jbest
2 (x) is large, which gives EGO some ability
EI(x) will be large if J(x) is low or
to escape the attraction of local minimizers and explore unknown regions. Figure 9.9
shows one step of EGO on a univariate problem. The Kriging prediction of the
cost function J (x) is on top, and the expected improvement EI(x) at the bottom (in
logarithmic scale). The graph of the cost function to be minimized is a dashed line.
The graph of the mean of the Kriging prediction is a solid line, with the previously
evaluated costs indicated by squares. The horizontal dashed line indicates the value
sofar . The 95 % confidence region for the prediction is in grey. J (x) should be
of Jbest
evaluated next where EI(x) reaches its maximum, i.e., around x = 0.62. This is far
from where the best cost had been achieved, because the uncertainty on J (x) makes
other regions potentially interesting.
226
2
1
0
1
2
1
0.5
0.5
0.5
EI( x )
2
4
6
1
0.5
x
Fig. 9.9 Kriging prediction (top) and expected improvement on a logarithmic scale (bottom) (courtesy of Emmanuel Vazquez, Suplec)
Once
x = arg max EI(x)
xX
(9.227)
227
nJ
wi Ji (x),
(9.228)
i=1
with positive weights wi to be chosen by the user. One may also give priority to one
of the cost functions and minimize it under constraints on the values allowed to the
others (see Chap. 10).
These two strategies restrict choice, however, and one may prefer to look for the
Pareto front, i.e., the set of all x X such that any local move that decreases a
given cost Ji increases at least one of the other costs. The Pareto front is thus a set
of tradeoff solutions. Computing a Pareto front is of course much more complicated
that minimizing a single cost function [58]. A single decision
x has usually to be
taken at a later stage anyway, which corresponds to minimizing (9.228) for a specific
choice of the weights wi . An examination of the shape of the Pareto front may help
the user choose the most appropriate tradeoff.
(9.230)
228
p = (10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)T .
(9.231)
The estimate
p is computed as
p = arg min J (p),
pR11
where
J (p) =
(9.232)
(9.233)
i=1
Since ym (xi , p) is linear in p, linear least squares apply. The feasible domain X for
the input vector xi is defined as the Cartesian product of the feasible ranges for each
of the input factors. The jth input factor can take any value in [min(j), max(j)],
with
min(1)
min(2)
min(3)
min(4)
=
=
=
=
0; max(1) = 0.05;
50; max(2) = 100;
-1; max(3) = 7;
0; max(4) = 1.e5;
The feasible ranges for the four input factors are thus quite different, which tends to
make the problem ill-conditioned.
Two designs for data collection are considered. In Design D1, each xi is independently picked at random in X, whereas Design D2 is a two-level full factorial
design, in which the data are collected at all the possible combinations of the bounds
of the ranges of the input factors. Design D2 thus has 24 = 16 different experimental
conditions xi . In what follows, the number N of pairs (yi , xi ) of data points in D1 is
taken equal to 32, so D2 is repeated twice to get the same number of data points as
in D1.
The output data are in Y for D1 and in Yfd for D2, while the corresponding values
of the factors are in X for D1 and in Xfd, for D2. The following function is used
for estimating the parameters P from the output data Y and corresponding regression
matrix F
function[P,Cond] = LSforExample(F,Y,option)
% F is (nExp,nPar), contains the regression matrix.
% Y is (nExp,1), contains the measured outputs.
% option specifies how the LS estimate is computed;
% it is equal to 1 for NE, 2 for QR and 3 for SVD.
% P is (nPar,1), contains the parameter estimate.
% Cond is the condition number of the system solved
% by the approach selected (for the spectral norm).
[nExp,nPar] = size(F);
229
if (option == 1)
% Computing P by solving the normal equations
P = (F*F)\F*Y;
% here, \ is by Gaussian elimination
Cond = cond(F*F);
end
if (option == 2)
% Computing P by QR factorization
[Q,R] = qr(F);
QTY = Q*Y;
opts_UT.UT = true;
P = linsolve(R,QTY,opts_UT);
Cond = cond(R);
end
if (option == 3)
% Computing P by SVD
[U,S,V] = svd(F,econ);
P = V*inv(S)*U*Y;
Cond = cond(S);
end
end
230
InitialCond = cond(F)
% Computing optimal P with normal equations
[PviaNE,CondViaNE] = LSforExample(F,Y,1)
OptimalCost = norm(Y-F*PviaNE))2
NormErrorP = norm(PviaNE-trueP)
% Computing optimal P via QR factorization
[PviaQR,CondViaQR] = LSforExample(F,Y,2)
OptimalCost = norm(Y-F*PviaQR))2
NormErrorP = norm(PviaQR-trueP)
% Computing optimal P via SVD
[PviaSVD,CondViaSVD] = LSforExample(F,Y,3)
OptimalCost = (norm(Y-F*PviaSVD))2
NormErrorP = norm(PviaSVD-trueP)
The condition number of the initial problem is found to be
InitialCond =
2.022687340567638e+09
The results obtained by solving the normal equations are
PviaNE =
9.999999744351953e+00
-8.999994672834873e+00
8.000000003536115e+00
-6.999999981897417e+00
6.000000000000670e+00
-5.000000071944669e+00
3.999999956693500e+00
-2.999999999998153e+00
1.999999999730790e+00
-1.000000000000011e+00
2.564615186884112e-14
CondViaNE =
4.097361000068907e+18
OptimalCost =
8.281275106847633e-15
NormErrorP =
5.333988749555268e-06
Although the condition number of the normal equations is dangerously high, this
approach still provides rather good estimates of the parameters.
231
232
NormErrorP =
8.972414778806571e-08
The condition number of the problem solved is slightly higher than for the initial
problem and the QR approach, and the estimates slightly less accurate than with the
simpler QR approach.
233
234
2.000000000012406e+02
-1.250000000000000e+06
2.126983788033481e-09
NewCondViaQR =
5.633128746769874e+00
OptimalCost =
7.951945308823372e-17
Although the condition number of the transformed initial problem is recovered, the
solution is actually slightly less accurate than when solving the normal equations.
The results obtained via an SVD of the regression matrix are
NewPviaSVD =
-3.452720300000001e+06
-3.759299999998882e+03
-1.249653125000000e+06
5.724000000012747e+02
-3.453749999999998e+06
-3.125000001688022e+00
3.999999996158294e-01
-3.750000000000931e+03
2.000000000023283e+02
-1.250000000000001e+06
1.280568540096283e-09
NewCondViaSVD =
5.633128746769864e+00
OptimalCost =
1.847488972244773e-16
Once again, the solution obtained via SVD is slightly less accurate than the one
obtained via QR factorization. So the approach solving the normal equations is a
clear winner on this version of the problem, as it is the less expensive and the most
accurate.
9.5.1.3 Using a Two-Level Full Factorial Design
Let us finally process the data collected according to D2, defined as follows.
% Two-level full factorial design
% for the special case nFact = 4
FD = [-1, -1, -1, -1;
-1, -1, -1, +1;
-1, -1, +1, -1;
-1,
-1,
-1,
-1,
-1,
+1,
+1,
+1,
+1,
+1,
+1,
+1,
+1,
-1,
+1,
+1,
+1,
+1,
-1,
-1,
-1,
-1,
+1,
+1,
+1,
+1,
235
+1,
-1,
-1,
+1,
+1,
-1,
-1,
+1,
+1,
-1,
-1,
+1,
+1,
+1;
-1;
+1;
-1;
+1;
-1;
+1;
-1;
+1;
-1;
+1;
-1;
+1];
The ranges of the factors are still normalized to [1, 1], but each of the factors
is now always equal to 1. Solving the normal equations is particularly easy, as the
resulting regression matrix Ffd is now such that Ffd*Ffd is a multiple of the
identity matrix. We can thus use the script
% Filling the regression matrix
Ffd = zeros(nExp,nPar);
nRep = 2;
for j=1:nRep,
for i=1:16,
Ffd(16*(j-1)+i,1) = 1;
Ffd(16*(j-1)+i,2) = FD(i,1);
Ffd(16*(j-1)+i,3) = FD(i,2);
Ffd(16*(j-1)+i,4) = FD(i,3);
Ffd(16*(j-1)+i,5) = FD(i,4);
Ffd(16*(j-1)+i,6) = FD(i,1)*FD(i,2);
Ffd(16*(j-1)+i,7) = FD(i,1)*FD(i,3);
Ffd(16*(j-1)+i,8) = FD(i,1)*FD(i,4);
Ffd(16*(j-1)+i,9) = FD(i,2)*FD(i,3);
Ffd(16*(j-1)+i,10) = FD(i,2)*FD(i,4);
Ffd(16*(j-1)+i,11) = FD(i,3)*FD(i,4);
end
end
% Solving the (now trivial) normal equations
NewPviaNEandFD = Ffd*Yfd/(16*nRep)
NewCondviaNEandFD = cond(Ffd)
OptimalCost = (norm(Yfd-Ffd*NewPviaNEandFD))2
This yields
NewPviaNEandFD =
-3.452720300000000e+06
236
-3.759299999999965e+03
-1.249653125000000e+06
5.723999999999535e+02
-3.453750000000000e+06
-3.125000000058222e+00
3.999999999534225e-01
-3.749999999999965e+03
2.000000000000000e+02
-1.250000000000000e+06
-4.661160346586257e-11
NewCondviaNEandFD =
1.000000000000000e+00
OptimalCost =
1.134469775459169e-17
These results are the most accurate ones, and they were obtained with the least
amount of computation.
For the same problem, a normalization of the range of the input factors combined
with the use of an appropriate factorial design has thus reduced the condition number
of the normal equations from about 4.1 1018 to one.
(9.234)
237
0.8
0.7
Output data
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
60
70
80
90
100
Time
Fig. 9.10 Data to be used in nonlinear parameter estimation
MarkerSize,7)
xlabel(Time)
ylabel(Output data)
hold on
They are described by Fig. 9.10.
The parameters
p of the model will be estimated by minimizing either the quadratic
cost function
16
2
(9.235)
ym (ti , p ) ym (ti , p)
J (p) =
i=1
16
(9.236)
i=1
In both cases,
p is expected to be close to p , and J (
p) to zero. All the algorithms
are initialized at p0 = (1, 1, 1)T .
238
239
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10
20
30
40
50
60
70
80
90
100
Time
Fig. 9.11 Least-square fit of the data in Fig. 9.10, obtained by Nelder and Meads simplex
after 393 evaluations of the cost function and every type of move Nelder and Meads
simplex algorithm can carry out.
The results of the simulation of the best model are on Fig. 9.11, together with the
data. As expected, the fit is visually perfect. Since it turns out be so with all the other
methods used to process the same data, no other such figure will be displayed.
240
p0 = [1;1;1];
optionsFMS = optimset(Display,iter,...
TolX,1.e-8,MaxFunEvals,1000,MaxIter,1000);
[pHat,Jhat] = fminsearch(@(p) L1costExpMod...
(p,data,t),p0,optionsFMS)
the following results are obtained
pHat =
1.999999999761701e+00
1.000000000015123e-01
2.999999999356928e-01
Jhat =
1.628779979759212e-09
after 753 evaluations of the cost function.
241
9.6 In Summary
Recognize when the linear least squares method applies or when the problem is
convex, as there are extremely powerful dedicated algorithms.
242
When the linear least squares method applies, avoid solving the normal equations,
which may be numerically disastrous because of the computation of FT F, unless
some very specific conditions are met. Prefer, in general, the approach based on a
QR factorization or SVD of F. SVD provides the value of the condition number
of the problem for the spectral norm as a byproduct and allows ill-conditioned
problems to be regularized, but is more complex than QR factorization and does
not necessarily give more accurate results.
When the linear least-squares method does not apply, most of the methods presented are iterative and local. They converge at best to a local minimizer, with
no guarantee that it is global and unique (unless additional properties of the cost
function are known, such as convexity). When the time needed for a single local
optimization allows, multistart may be used in an attempt to escape the possible
attraction of parasitic local minimizers. This a first and particularly simple example
of global optimization by random search, with no guarantee of success either.
Combining line searches should be done carefully, as limiting the search directions
to fixed subspaces may forbid convergence to a minimizer.
All the iterative methods based on Taylor expansion are not equal. The best ones
start as gradient methods and finish as Newton methods. This is the case of the
quasi-Newton and conjugate-gradient methods.
When the cost function is quadratic in some error, the Gauss-Newton method has
significant advantages over the Newton method. It is particularly efficient when
the minimum of the cost function is close to zero.
Conjugate-gradient methods may be preferred over quasi-Newton methods when
there are many decision variables. The price to be paid for this choice is that no
estimate of the inverse of the Hessian at the minimizer will be provided.
Unless the cost function is differentiable everywhere, all the local methods based
on a Taylor expansion are bound to fail. The Nelder and Mead method, which
relies only on evaluations of the cost function is thus particularly interesting for
nondifferentiable problems such as the minimization of a sum of absolute errors.
Robust optimization makes it possible to protect oneself against the effect of factors
that are not under control.
Branch-and-bound methods allow statements to be proven about the global minimum and global minimizers.
When the budget for evaluating the cost function is severely limited, one may
try Efficient Global Optimization (EGO), based on the use of a surrogate model
obtained by Kriging.
The shape of the Pareto front may help one select the most appropriate tradeoff
when objectives are conflicting.
References
243
References
1. Santner, T., Williams, B., Notz, W.: The Design and Analysis of Computer Experiments.
Springer, New York (2003)
2. Lawson, C., Hanson, R.: Solving Least Squares Problems. Classics in Applied Mathematics.
SIAM, Philadelphia (1995)
3. Bjrck, A.: Numerical Methods for Least Squares Problems. SIAM, Philadelphia (1996)
4. Nievergelt, Y.: A tutorial history of least squares with applications to astronomy and geodesy.
J. Comput. Appl. Math. 121, 3772 (2000)
5. Golub, G., Van Loan, C.: Matrix Computations, 3rd edn. The Johns Hopkins University Press,
Baltimore (1996)
6. Golub, G., Kahan, W.: Calculating the singular values and pseudo-inverse of a matrix. J. Soc.
Indust. Appl. Math. B. Numer. Anal. 2(2), 205224 (1965)
7. Golub, G., Reinsch, C.: Singular value decomposition and least squares solution. Numer. Math.
14, 403420 (1970)
8. Demmel, J.: Applied Numerical Linear Algebra. SIAM, Philadelphia (1997)
9. Demmel, J., Kahan, W.: Accurate singular values of bidiagonal matrices. SIAM J. Sci. Stat.
Comput. 11(5), 873912 (1990)
10. Walter, E., Pronzato, L.: Identification of Parametric Models. Springer, London (1997)
11. Nocedal, J., Wright, S.: Numerical Optimization. Springer, New York (1999)
12. Brent, R.: Algorithms for Minimization Without Derivatives. Prentice-Hall, Englewood Cliffs
(1973)
13. Press, W., Flannery, B., Teukolsky, S., Vetterling, W.: Numerical Recipes. Cambridge University Press, Cambridge (1986)
14. Gill, P., Murray, W., Wright, M.: Practical Optimization. Elsevier, London (1986)
15. Bonnans, J., Gilbert, J.C., Lemarchal, C., Sagastizabal, C.: Numerical Optimization
Theoretical and Practical Aspects. Springer, Berlin (2006)
16. Polak, E.: OptimizationAlgorithms and Consistent Approximations. Springer, New York
(1997)
17. Levenberg, K.: A method for the solution of certain non-linear problems in least squares. Quart.
Appl. Math. 2, 164168 (1944)
18. Marquardt, D.: An algorithm for least-squares estimation of nonlinear parameters. J. Soc.
Indust. Appl. Math. 11(2), 431441 (1963)
19. Dennis Jr, J., Mor, J.: Quasi-Newton methods, motivations and theory. SIAM Rev. 19(1),
4689 (1977)
20. Broyden, C.: Quasi-Newton methods and their application to function minimization. Math.
Comput. 21(99), 368381 (1967)
21. Dixon, L.: Quasi Newton techniques generate identical points II: the proofs of four new theorems. Math. Program. 3, 345358 (1972)
22. Gertz, E.: A quasi-Newton trust-region method. Math. Program. 100(3), 447470 (2004)
23. Shewchuk, J.: An introduction to the conjugate gradient method without the agonizing pain.
School of Computer Science, Carnegie Mellon University, Pittsburgh, Technical report (1994)
24. Hager, W., Zhang, H.: A survey of nonlinear conjugate gradient methods. Pacific J. Optim.
2(1), 3558 (2006)
25. Polak, E.: Computational Methods in Optimization. Academic Press, New York (1971)
26. Minoux, M.: Mathematical ProgrammingTheory and Algorithms. Wiley, New York (1986)
27. Shor, N.: Minimization Methods for Non-differentiable Functions. Springer, Berlin (1985)
28. Bertsekas, D.: Nonlinear Programming. Athena Scientific, Belmont (1999)
29. Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. B 120,
221259 (2009)
30. Walters, F., Parker, L., Morgan, S., Deming, S.: Sequential Simplex Optimization. CRC Press,
Boca Raton (1991)
31. Lagarias, J., Reeds, J., Wright, M., Wright, P.: Convergence properties of the Nelder-Mead
simplex method in low dimensions. SIAM J. Optim. 9(1), 112147 (1998)
244
32. Lagarias, J., Poonen, B., Wright, M.: Convergence of the restricted Nelder-Mead algorithm in
two dimensions. SIAM J. Optim. 22(2), 501532 (2012)
33. Ben-Tal, A., El Ghaoui, L., Nemirovski, A.: Robust Optimization. Princeton University Press,
Princeton (2009)
34. Bertsimas, D., Brown, D., Caramanis, C.: Theory and applications of robust optimization.
SIAM Rev. 53(3), 464501 (2011)
35. Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convegence rate for strongly-convex optimization with finite training sets. In: Neural Information
Processing Systems (NIPS 2012). Lake Tahoe (2012)
36. Rustem, B., Howe, M.: Algorithms for Worst-Case Design and Applications to Risk Management. Princeton University Press, Princeton (2002)
37. Shimizu, K., Aiyoshi, E.: Necessary conditions for min-max problems and algorithms by a
relaxation procedure. IEEE Trans. Autom. Control AC-25(1), 6266 (1980)
38. Horst, R., Tuy, H.: Global Optimization. Springer, Berlin (1990)
39. Pronzato, L., Walter, E.: Eliminating suboptimal local minimizers in nonlinear parameter estimation. Technometrics 43(4), 434442 (2001)
40. Whitley, L. (ed.): Foundations of Genetic Algorithms 2. Morgan Kaufmann, San Mateo (1993)
41. Goldberg, D.: Genetic Algorithms in Search. Optimization and Machine Learning. AddisonWesley, Reading (1989)
42. Storn, R., Price, K.: Differential evolutiona simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 11, 341359 (1997)
43. Dorigo, M., Sttzle, T.: Ant Colony Optimization. MIT Press, Cambridge (2004)
44. Kennedy, J., Eberhart, R., Shi, Y.: Swarm Intelligence. Morgan Kaufmann, San Francisco
(2001)
45. Bekey, G., Masri, S.: Random search techniques for optimization of nonlinear systems with
many parameters. Math. Comput. Simul. 25, 210213 (1983)
46. Pronzato, L., Walter, E., Venot, A., Lebruchec, J.F.: A general purpose global optimizer: implementation and applications. Math. Comput. Simul. 26, 412422 (1984)
47. Jaulin, L., Kieffer, M., Didrit, O., Walter, E.: Applied Interval Analysis. Springer, London
(2001)
48. Neumaier, A.: Interval Methods for Systems of Equations. Cambridge University Press,
Cambridge (1990)
49. Rump, S.: INTLABINTerval LABoratory. In: T. Csendes (ed.) Developments in Reliable
Computing, pp. 77104. Kluwer Academic Publishers, Dordrecht (1999)
50. Rump, S.: Verification methods: rigorous results using floating-point arithmetic. Acta Numerica, 287449 (2010)
51. Hansen, E.: Global Optimization Using Interval Analysis. Marcel Dekker, New York (1992)
52. Kearfott, R.: Globsol user guide. Optim. Methods Softw. 24(45), 687708 (2009)
53. Ratschek, H., Rokne, J.: New Computer Methods for Global Optimization. Ellis Horwood,
Chichester (1988)
54. Jones, D., Schonlau, M., Welch, W.: Efficient global optimization of expensive black-box
functions. J. Global Optim. 13(4), 455492 (1998)
55. Mockus, J.: Bayesian Approach to Global Optimization. Kluwer, Dordrecht (1989)
56. Jones, D.: A taxonomy of global optimization methods based on response surfaces. J. Global
Optim. 21, 345383 (2001)
57. Marzat, J., Walter, E., Piet-Lahanier, H.: Worst-case global optimization of black-box functions
through Kriging and relaxation. J. Global Optim. 55(4), 707727 (2013)
58. Collette, Y., Siarry, P.: Multiobjective Optimization. Springer, Berlin (2003)
Chapter 10
10.1 Introduction
Many optimization problems become meaningless unless constraints are taken into
account. This chapter presents techniques that can be used for this purpose. More
information can be found in monographs such as [13]. The interior-point revolution
provides a unifying point of view, nicely documented in [4].
(10.2)
In both cases, the neighborhood of the location with minimum altitude may not be
horizontal, i.e., the gradient of J () may not be zero at any local or global minimizer.
The optimality conditions and resulting optimization methods thus differ from those
of the unconstrained case.
10.1.2 Motivations
A first motivation for introducing constraints on the decision vector x is forbidding
unrealistic values of decision variables. If, for instance, the ith parameter of a model
to be estimated from experimental data is the mass of a human being, one may take
. Walter, Numerical Methods and Optimization,
DOI: 10.1007/978-3-319-07671-3_10,
Springer International Publishing Switzerland 2014
245
246
0 xi 300 Kg.
(10.3)
Here, the minimizer of the cost function should not be on the boundary of the feasible
domain, so none of these two inequality constraints should be active, except may be
temporarily during search. They thus play no fundamental role, and are mainly used
to check a posteriori that the estimates found for the parameters are not absurd. If
xi obtained by unconstrained minimization turns out not to belong to [0, 300] Kg,
xi = 300 Kg,
then forcing it to belong to this interval may result into
xi = 0 Kg or
neither of which might be considered satisfactory.
A second motivation is the necessity of taking into account specications, which
usually consist of constraints, for instance in the computer-aided design of industrial
products or in process control. Some inequality constraints are often saturated at the
optimum and would be violated unless explicitly taken into account. The constraints
may be on quantities that depend on x, so checking that a given x belongs to X may
require the simulation of a numerical model.
A third motivation is dealing with conicting objectives, by optimizing one of
them under constraints on the others. One may, for instance, minimize the cost of
a space launcher under constraints on its payload, or maximize its payload under
constraints on its cost.
Remark 10.1 In the context of design, constraints are so crucial that the role of the
cost function may even become secondary, as a way to choose a point solution
x
in X as defined by the constraints. One may, for instance, maximize the Euclidean
distance between x in X and the closest point of the boundary X of X. This ensures
some robustness to fluctuation in mass production of characteristics of components
of the system being designed.
Remark 10.2 Even if an unconstrained minimizer is strictly inside X, it may not be
optimal for the constrained problem, as shown by Fig. 10.1.
10.1 Introduction
247
x min
x max
x2
direction along which
t he cost decreases
x1
Fig. 10.2 X, the part of the first quadrant in white, is not compact and there is no minimizer
(10.4)
248
(10.5)
This decreases the dimension of search space and eliminates the need to take the
constraint (10.4) into consideration. It may, however, have negative consequences
on the structure of some of the equations to be solved by making them less sparse.
A change of variable may make it possible to eliminate inequality constraints. To
enforce the constraint xi > 0, for instance, it suffices to replace xi by exp qi , and the
constraints a < xi < b can be enforced by taking
xi =
a+b ba
+
tanh qi .
2
2
(10.6)
When such transformations are either impossible or undesirable, the algorithms and
theoretical optimality conditions must take the constraints into account.
Remark 10.3 When there is a mixture of linear and nonlinear constraints, it is often a
good idea to treat the linear constraints separately, to take advantage of linear algebra;
see Chap. 5 of [5].
X = x : ce (x) = 0 ,
(10.7)
249
Ax = b,
(10.8)
ce (x) = Ax b.
(10.9)
(10.10)
(10.11)
x + x)
cie (
cie (
x) +
cie
(
x)
x
T
x, i = 1, . . . , n e .
(10.12)
x + x) = cie (
x) = 0, this implies that
Since cie (
cie
(
x)
x
T
x = 0, i = 1, . . . , n e .
(10.13)
cie
(
x), i = 1, . . . , n e ,
x
(10.14)
ce
J
i i (
(
x) +
x) = 0.
(10.15)
x
x
i=1
250
ne
i cie (x),
(10.16)
i=1
(10.17)
Proposition 10.1 If
x and
are such that
L(
x,
) = min max L(x, ),
xRn Rn e
(10.18)
then
1. the constraints are satised:
ce (
x) = 0,
(10.19)
2.
x is a global minimizer of the cost function J () over X as dened by the constraints,
3. any global minimizer of J () over X is such that (10.18) is satised.
Proof 1. Equation (10.18) is equivalent to
L(
x, ) L(
x,
) L(x,
).
(10.20)
L(
x, ) = J (
x),
(10.21)
L(
x,
) = L(
x, ).
(10.22)
(10.23)
251
The inequalities (10.20) are thus satisfied, which implies that (10.18) is also
satisfied.
These results have been established without assuming that the Lagrangian is differentiable. When it is, the first-order necessary optimality conditions translate into
L
(
x, ) = 0,
x
(10.24)
(10.25)
which is equivalent to ce (
x) = 0.
The Lagrangian thus makes it possible formally to eliminate the constraints
from the problem. Stationarity of the Lagrangian guarantees that these constraints are satisfied.
One may similarly define second-order optimality conditions. A necessary condition for the optimality of
x is that the Hessian of the cost be non-negative definite on
the tangent space to the constraints at
x. A sufficient condition for (local) optimality is obtained when non-negative definiteness is replaced by positive definiteness,
provided that the first-order optimality conditions are also satisfied.
Example 10.1 Shape optimization.
One wants to minimize the surface of metal foil needed to build a cylindrical can
with a given volume V0 . The design variables are the height h of the can and the
radius r of its base, so x = (h, r )T . The surface to be minimized is
J (x) = 2r 2 + 2r h,
(10.26)
(10.27)
(10.28)
(10.29)
252
(10.30)
(10.31)
(10.32)
(10.33)
(10.34)
The height of the can should thus be equal to its diameter. Take (10.34) into (10.32)
to get
(10.35)
2
r 3 = V0 ,
so
r=
V0
2
1
3
V0
and
h=2
2
1
3
(10.36)
j = 1, . . . , n i ,
(10.38)
where the number n i = dim ci (x) of inequality constraints may be larger than dim x.
It is important to note that the inequality constraints should be written in the standard
form prescribed by (10.38) for the results to be derived to hold true.
Inequality constraints can be transformed into equality constraints by writing
cij (x) + y 2j = 0,
j = 1, . . . , n i ,
(10.39)
253
where y j is a slack variable, which takes the value zero when the jth scalar inequality
constraint is active (i.e., acts as an equality constraint). When cij (x) = 0 , one also
says then that the jth inequality constraint is saturated or binding. (When cij (x) > 0,
the jth inequality constraint is said to be violated.)
The Lagrangian associated with the equality constraints (10.39) is
L(x, , y) = J (x) +
ni
j cij (x) + y 2j .
(10.40)
j=1
When dealing with inequality constraints such as (10.38), the Lagrange multipliers
j obtained in this manner are often called Kuhn and Tucker coefcients. If the
constraints and cost function are differentiable, then the first-order conditions for the
stationarity of the Lagrangian are
i
c j
J
L
(
x,
,
y) =
(
x) +
(
x) = 0,
j
x
x
x
(10.41)
j=1
L
(
x,
,
y) = cij (
x) +
y 2j = 0, j = 1, . . . , n i ,
j
L
(
x,
,
y) = 2
j
y j = 0, j = 1, . . . , n i .
yj
(10.42)
(10.43)
j = 1, . . . , n i .
(10.45)
Remark 10.4 Compare with equality constraints, for which there is no constraint on
the sign of the Lagrange multipliers.
254
(10.46)
(10.47)
with
A(
)
x = 0,
(10.48)
2(1
)
.
A(
) =
2(1
)
(10.49)
255
and Tucker coefficient are strictly positive, the inequality constraint is saturated and
can be treated as an equality constraint
x12 + x22 + x1 x2 = 1.
(10.50)
T
4
3
4
x = (1/ 3, 1/ 3) , with J (
x ) = J (
x ) = 2/3. There
(1/ 3, 1/ 3) and
x4 .
are thus two global minimizers,
x3 and
Example 10.4 Projection onto a slab
We want to project some numerically known vector p Rn onto the set
S = {v Rn : b y f T v b},
(10.51)
(10.52)
(10.53)
(H+ and H are both orthogonal to f, so they are parallel.) This operation is at the
core of the approach for sparse estimation described in Sect. 16.27, see also [6].
The result
x of the projection onto S can be computed as
x = arg min x p22 .
xS
(10.54)
(10.55)
(10.56)
(10.57)
256
L
(
x,
1 ) = 0 = 2(
x p)
1 f,
x
L
(
x,
1 ) = 0 = y f T
x b.
1
(10.58)
(10.59)
y f Tp b
,
f Tf
f
x = p + T (y f T p b),
f f
1 = 2
(10.60)
(10.61)
and
1 is positive, as it should.
(10.62)
c j
c j
J
L
i
(
x, ,
) =
(
x) +
(
x) +
(
x) = 0,
j
x
x
x
x
i=1
ce (
x) = 0, ci (
x) 0,
0,
j cij (
x) = 0,
(10.63)
j=1
(10.64)
j = 1, . . . , n i .
(10.65)
No more than dim x independent constraints can be active for any given value
of x. (The active constraints are the equality constraints and saturated inequality
constraints.)
257
constraints, the KKT conditions boil down to a set of nonlinear equations, which may
be solved using the (damped) Newton method before checking whether the solution
thus computed belongs to X and whether the sign conditions on the Kuhn and Tucker
coefficients are satisfied. Recall, however, that
satisfaction of the KKT conditions does not guarantee that a minimizer has been
reached,
even if a minimizer has been found, search has only been local, so multistart may
remain in order.
[cie (x)]2 ,
(10.67)
p1 (x) =
i=1
or an l1 penalty function
p2 (x) =
ne
i=1
|cie (x)|.
(10.68)
258
ni
(10.69)
j=1
and
p4 (x) =
ni
(10.70)
i=1
(10.71)
with increasing positive values of k in order to approach X from the outside. The
final estimate of the constrained minimizer obtained during the last minimization
serves as an initial point (or warm start) for the next.
Remark 10.7 The external iteration counter k in (10.71) should not be confused
with the internal iteration counter of the iterative algorithm carrying out each of the
minimizations.
Under reasonable technical conditions [7, 8], there exists a nite such that p2 ()
xk X as soon as k > .
One then speaks of exact
and p4 () yield a solution
penalization [1]. With p1 () and p3 (), k must tend to infinity to get the same result,
which raises obvious numerical problems. The price to be paid for exact penalization
is that p2 () and p4 () are not differentiable, which complicates the minimization of
Jk (x).
Example 10.5 Consider the minimization of J (x) = x 2 under the constraint x 1.
Using the penalty function p3 (), one is led to solving the unconstrained minimization
problem
(10.72)
x = arg min J (x) = x 2 + [max{0, (1 x)}]2 ,
x
259
penalized cost
2.5
1.5
0.5
0
0.2
0.4
0.6
0.8
1.2
1.4
x
Fig. 10.3 The penalty function p4 () is used to implement an l1 -penalized quadratic cost for the
constraint x 1; circles are for = 1 and crosses for = 3
x=
< 1.
1+
(10.74)
ln[cij (x)].
(10.75)
p5 (x) =
j=1
260
ni
j=1
1
.
cij (x)
(10.76)
Since cij (x) < 0 in the interior of X, these barrier functions are well defined.
A typical strategy is to perform a series of unconstrained minimizations (10.71),
with decreasing positive values of k in order to approach X from the inside. The
estimate of the constrained minimizer obtained during the last minimization again
serves as an initial point for the next. This approach provides suboptimal but feasible
solutions.
Remark 10.8 Knowledge-based models often have a limited validity domain. As a
result, the evaluation of cost functions based on such models may not make sense
unless some inequality constraints are satisfied. Barrier functions are then much more
useful for dealing with these constraints than penalty functions.
(10.77)
(10.78)
ne
i=1
[cie (x)]2
ni
+
[max{0, cij (x)]2 .
(10.79)
j=1
Several strategies are available for tuning x, and for a given > 0. One of them
[9] alternates
1. minimizing the augmented Lagrangian with respect to x for fixed and , by
some unconstrained optimization method,
2. performing one iteration of a gradient algorithm with step-size for maximizing
the augmented Lagrangian with respect to and for fixed x,
k+1
k+1
261
L k
k
k
(x , , )
L k
k
k
(x , , )
+
k
k
ce (xk )
+
.
k
ci (xk )
(10.80)
262
x2
1
1/3
x1
its warehouse, given that this warehouse is just large enough to accommodate one
metric ton of P2 (if no space is taken by P1 ) and that it is impossible to produce a
larger mass of P1 than of P2 ?
This question translates into the linear program
Maximize U (x) = 2x1 + x2
(10.81)
(10.82)
(10.83)
(10.84)
(10.85)
(10.86)
263
is never zero, there is no stationary point, and any maximizer of U () must belong to
X. Now, the straight line
(10.87)
2x1 + x2 = a
corresponds to all the xs associated with the same value a of the utility function U (x).
The constrained maximizer of the utility function is thus the vertex of X located on
the straight line (10.87) associated with the largest value of a, i.e.,
x = [0 1]T .
(10.88)
x) = 1.
The company should thus produce P2 only. The resulting utility is U (
(10.89)
where the error e(x) is the N -dimensional vector of the residuals between the data
and model outputs
e(x) = y Fx.
(10.90)
When some of the data points yi are widely off the mark, for instance as a result of
sensor failure, these data points (called outliers) may affect the numerical value of
the estimate
xLS so much that it becomes useless. Robust estimators are designed to
be less sensitive to outliers. One of them is the least-modulus (or l1 ) estimator
xLM = arg min e(x)1 .
x
(10.91)
Because the components of the error vector are not squared as in the l2 estimator,
the impact of a few outliers is much less drastic. The least-modulus estimator can be
computed [15, 16] as
N
xLM = arg min
(u i + vi )
(10.92)
x
i=1
(10.93)
vi 0.
xLM has thus been translated
for i = 1, . . . , N , with fiT the ith row of F. Computing
into a linear program, where the (n + 2N ) decision variables are the n entries of x,
and u i and vi (i = 1, . . . , N ). One could alternatively compute
264
xLM = arg min
x
1T s,
(10.94)
i=1
where 1 is a column vector with all its entries equal to one, under the constraints
y Fx s,
(y Fx) s,
(10.95)
(10.96)
(10.97)
(10.98)
(10.99)
(y Fx) 1d ,
(10.100)
265
Dantzigs simplex method, not to be confused with the Nelder and Mead simplex
of Sect. 9.3.5, explores X by moving along edges of X from one vertex to the next
while improving the value of the objective function. It is considered first. Interiorpoint methods, which are sometimes more efficient, will be presented in Sect. 10.6.3.
(10.101)
Any inequality constraint can be transformed into an equality constraint by introducing an additional nonnegative decision variable. For instance,
3x1 + x2 1
(10.102)
3x1 + x2 + x3 = 1,
(10.103)
x3 0,
(10.104)
translates into
(10.105)
3x1 + x2 x3 = 1,
(10.106)
x3 0,
(10.107)
translates into
266
The standard problem can thus be written, possibly after introducing additional
entries in the decision vector x, as that of finding
x = arg min J (x),
x
(10.108)
where the cost function in (10.108) is a linear combination of the decision variables:
J (x) = cT x,
(10.109)
Ax = b,
(10.110)
x 0.
(10.111)
a j,k xk = bk , j = 1, . . . , m.
(10.112)
k=1
The matrix A has thus m rows (as many as there are constraints) and n columns (as
many as there are variables).
Let us stress, once more, that the gradient of the cost is never zero, as
J
= c.
x
(10.113)
Minimizing a linear cost in the absence of any constraint would thus not make sense,
as one could make J (x) tend to by making ||x|| tend to infinity in the direction
c. The situation is thus quite different from that with quadratic cost functions.
267
ai xi = b.
(10.114)
i=1
Index the columns of A so that the nonzero entries of x are indexed from 1 to r . Then
r
ai xi = b.
(10.115)
i=1
Let us prove that the first r vectors ai are linearly independent. The proof is by
contradiction. If they were linearly dependent, then one could find a nonzero vector
Rn such that i = 0 for any i > r and
r
ai (xi + i ) = b A(x + ) = b,
(10.116)
i=1
x1 + x2
2
(10.117)
(10.118)
could not be a vertex, as it would be strictly inside an edge. The first r vectors ai
are thus linearly independent. Now, since ai Rm , there are at most m linearly
independent ai s, so r m and x Rn has at least (n m) zero entries.
A basic feasible solution is any xb X with at least (n m) zero entries. We
assume in the description of the simplex method that one such xb has already been
found.
Remark 10.11 When no basic feasible solution is available, one may be generated (at
the cost of increasing the dimension of search space) by the following procedure [17]:
1. add a different articial variable to the left-hand side of each constraint that
contains no slack variable (even if it contains a surplus variable),
2. solve the resulting set of constraints for the m artificial and slack variables, with
all the initial and surplus variables set to zero. This is trivial: the artificial or slack
268
variable introduced in the jth constraint of (10.110) just takes the value b j . As
there are now at most m nonzero variables, a basic feasible solution has thus been
obtained, but for a modied problem.
By introducing artificial variables, we have indeed changed the problem being treated,
unless all of these variables take the value zero. This is why the cost function is
modified by adding each of the artificial variables multiplied by a large positive
coefficient to the former cost function. Unless X is empty, all the artificial variables
should then eventually be driven to zero by the simplex algorithm, and the solution
finally provided should correspond to the initial problem. This procedure may also
be used to detect that X is empty. Assume, for instance, that J1 (x1 , x2 ) must be
minimized under the constraints
x1 2x2 = 0,
3x1 + 4x2 5,
6x1 + 7x2 8.
(10.119)
(10.120)
(10.121)
On such a simple problem, it is trivial to show that there is no solution for x1 and x2 ,
but suppose we failed to notice that. To put the problem in standard form, introduce
the surplus variable x3 in (10.120) and the slack variable x4 in (10.121), to get
x1 2x2 = 0,
3x1 + 4x2 x3 = 5,
6x1 + 7x2 + x4 = 8.
(10.122)
(10.123)
(10.124)
(10.125)
(10.126)
6x1 + 7x2 + x4 = 8.
(10.127)
Solve (10.125)( 10.127) for the artificial and slack variables, with all the other
variables set to zero, to get
x5 = 0,
(10.128)
x6 = 5,
x4 = 8.
(10.129)
(10.130)
For the modied problem, x = (0, 0, 0, 8, 0, 5)T is a basic feasible solution as four
out of its six entries take the value zero and n m = 3. Replacing the initial cost
J1 (x1 , x2 ) by
(10.131)
J2 (x) = J1 (x1 , x2 ) + M x5 + M x6
269
(with M some large positive coefficient) will not, however, coax the simplex
algorithm into getting rid of the artificial variables, as we know this is mission
impossible.
Provided that X is not empty, one of the basic feasible solutions is a global minimizer of the cost function, and the algorithm moves from one basic feasible solution
to the next while decreasing cost.
Among the zero entries of xb , (n m) entries are selected and called off-base.
The remaining m entries are called basic variables. The basic variables thus include
all the nonzero entries of xb .
Equation (10.110) then makes it possible to express the basic variables and the cost
J (x) as functions of the off-base variables. This description will be used to decide
which off-base variable should become basic and which basic variable should leave
base to make room for this to happen. To simplify the presentation of the method,
we use Example 10.6.
Consider again the problem defined by (10.81)(10.85), put in standard form. The
cost function is
(10.132)
J (x) = 2x1 x2 ,
with x1 0 and x2 0, and the inequality constraints (10.84) and (10.85) are
transformed into equality constraints by introducing the slack variables x3 and x4 ,
so
3x1 + x2 + x3 = 1,
(10.133)
x3 0,
x1 x2 + x4 = 0,
x4 0.
(10.134)
(10.135)
(10.136)
3 1
1 1
x1
x2
1 x3
=
.
x4
(10.137)
1 1
1
x3 x4 ,
4 4
4
(10.138)
x2 =
1 1
3
x3 + x4 .
4 4
4
(10.139)
and
270
Table 10.1 Initial situation
in Example 10.6
3/4
1/4
1/4
1/4
1/4
3/4
It is trivial to check that the vector x obtained by setting x3 and x4 to zero and
choosing x1 and x2 so as to satisfy (10.138) and (10.139), i.e.,
1
x=
4
1
0 0
4
T
,
(10.140)
satisfies all the constraints while having an appropriate number of zero entries, and
is thus a basic feasible solution.
The cost can also be expressed as a function of the off-base variables, as (10.132),
(10.138) and (10.139) imply that
1
3 3
J (x) = + x3 x4 .
4 4
4
(10.141)
271
Constant coefficient Coefficient of x1 Coefficient of x3
J 1
x2
1
x4
1
1
3
4
1
1
1
the one leaving base. In our example, there is only one negative coefficient, which
is equal to 1/4 and associated with x1 . The variable x1 becomes equal to zero and
leaves base when the new basic variable x4 reaches 1.
The third step updates the table. In our example, the basic variables are now x2
and x4 and the off-base variables x1 and x3 . It is thus necessary to express x2 , x4 and
J as functions of x1 and x3 . From (10.133) and (10.135), we get
x2 = 1 3x1 x3 ,
x2 + x4 = x1 ,
(10.142)
(10.143)
x2 = 1 3x1 x3 ,
x4 = 1 4x1 x3 .
(10.144)
(10.145)
or equivalently
(10.146)
(10.147)
which is thus (globally) optimal and associated with the lowest possible cost
J (
x) = 1.
(10.148)
This corresponds to an optimal utility equal to 1, consistent with the results obtained
graphically.
272
The only drawback of this algorithm was that its worst-case complexity could not
be bounded by a polynomial in the dimension of the problem (linear programming
was thus believed to be an NP-hard problem). Despite that, the method cheerfully
handled large-scale problems.
A paper published by Leonid Khachiyan in 1979 [18] made the headlines (including on the front page of The New York Times) by showing that polynomial complexity
could be brought to linear programming by specializing a previously known ellipsoidal method for nonlinear programming. This was a first breach in the dogma
that linear and nonlinear programming were entirely different matters. The resulting
algorithm, however, turned out not to be efficient enough in practice to challenge
the supremacy of Dantzigs simplex. This was what Margaret Wright called a puzzling and deeply unsatisfying anomaly in which an exponential-time algorithm was
consistently and substantially faster than a polynomial-time algorithm [4].
In 1984, Narendra Karmarkar presented another polynomial-time algorithm for
linear programming [19], with much better performance than Dantzigs simplex on
some test cases. This was so sensational a result that it also found its way to the general
press. Karmarkars interior-point method escapes the combinatorial complexity of
exploring the edges of X by moving towards a minimizer of the cost along a path
that stays inside X and never reaches its boundary X, although it is known that any
minimizer belongs to X.
After some controversy, due in part to the lack of details in [19], it is now acknowledged that interior-point methods are much more efficient on some problems than
the simplex method. The simplex method nevertheless remains more efficient on
other problems and is still very much in use. Karmarkars algorithm has been shown
in [20] to be formally equivalent to a logarithmic barrier method applied to linear
programming, which confirms that there is something to be gained by considering
linear programming as a special case of nonlinear programming.
Interior-point methods readily extend to convex optimization, of which linear
programming is a special case (see Sect. 10.7.6). As a result, the traditional divide
between linear and nonlinear programming tends to be replaced by a divide between
convex and nonconvex optimization.
Interior-point methods have also been used to develop general purpose solvers for
large-scale nonconvex constrained nonlinear optimization [21].
273
Fig. 10.5 The set on the left is convex; the one on the right is not, as the line segment joining the
two dots is not included in the set
(10.149)
(10.150)
(10.151)
274
J1
J2
Fig. 10.6 The function on the left is convex; the one on the right is not
(10.152)
wi Ji (x)
(10.153)
is convex if each of the functions Ji (x) is convex and each weight wi is positive.
Example 10.12 The function
J (x) = max Ji (x)
i
(10.154)
(10.155)
275
where g() is the gradient function of J (). This provides a global lower bound for
the function from the knowledge of the value of its gradient at any given point x1 .
(10.156)
(10.157)
(10.158)
where the vector of Lagrange (or Kuhn and Tucker) multipliers is also called the
dual vector. The dual function D() is the infimum of the Lagrangian over x
D() = inf L(x, ).
x
(10.159)
Since J (x) and all the constraints cij (x) are assumed to be convex, L(x, ) is a convex
function of x as long as 0, which must be true for inequality constraints anyway.
So the evaluation of D() is an unconstrained convex minimization problem, which
can be solved with a local method such as Newton or quasi-Newton. If the infimum
of L(x, ) with respect to x is reached at
x , then
x ).
D() = J (
x ) + T ci (
(10.160)
276
Moreover, if J (x) and the constraints cij (x) are differentiable, then
x satisfies the
first-order optimality conditions
i
c j
J
(
x ) +
(
x ) = 0.
j
x
x
(10.161)
j=1
If is dual feasible, i.e., such that 0 and D() > , then for any feasible
point x
D() = inf L(x, ) L(x, ) = J (x) + T ci (x) J (x),
x
(10.162)
and D() is thus a lower bound of the minimal cost of the constrained problem
D() J (
x).
(10.163)
Since this bound is valid for any 0, it can be improved by solving the dual
problem, namely by computing the optimal Lagrange multipliers
= arg max D(),
0
(10.164)
in order to make the lower bound in (10.163) as large as possible. Even if the initial
problem (also called primal problem) is not convex, one always has
D(
) J (
x),
(10.165)
(10.166)
Duality is strong if this gap is equal to zero, which means that the order of the
maximization with respect to and minimization with respect to x of the Lagrangian
can be inverted.
A sufficient condition for strong duality (known as Slaters condition) is that the
cost function J () and constraint functions cij () are convex and that the interior of
X is not empty. It should be satisfied in the present context of convex optimization
(there should exist x such that cij (x) < 0, j = 1, . . . , n i ).
Weak or strong, duality can be used to define stopping criteria. If xk and k are
feasible points for the primal and dual problems obtained at iteration k, then
J (
x) [D(k ), J (xk )],
(10.167)
D(
) [D( ), J (x )],
(10.168)
277
with the duality gap given by the width of the interval [D(k ), J (xk )]. One may stop
as soon as the duality gap is deemed acceptable (in absolute or relative terms).
(10.169)
j = 1, . . . , n i .
(10.170)
If
w < 0, then x0 is strictly inside X. If
w = 0 then x0 belongs to X and cannot be
used for an interior-point method. If
w > 0, then the initial problem has no solution.
To remain strictly inside X, one may use a barrier function, usually the logarithmic
barrier defined by (10.75), or more precisely by
plog (x) =
i
ln[cij (x)] if ci (x) < 0,
nj=1
+
otherwise.
(10.171)
This barrier is differentiable and convex inside X; it tends to infinity when x tends to
X from within. One then solves the unconstrained convex minimization problem
x = arg min [J (x) + plog (x)],
x
(10.172)
(10.173)
278
This can be done very efficiently by a Newton-type method, with a warm start at
xk1 of the search for xk . The larger k becomes, the more xk approaches X, as the
relative weight of the cost with respect to the barrier increases. If J (x) and ci (x) are
both differentiable, then xk should satisfy the first-order optimality condition
k
plog k
J k
(x ) +
(x ) = 0,
x
x
(10.174)
which is necessary and sufcient as the problem is convex. An important result [2]
is that
every central point xk is feasible for the primal problem,
a feasible point for the dual problem is
kj =
1
k cij (xk )
j = 1, . . . , n i ,
(10.175)
ni
,
k
(10.176)
Remark 10.12 Since xk is strictly inside X, cij (xk ) < 0 and kj as given by (10.175)
is strictly positive.
The duality gap thus tends to zero as k tends to infinity, which ensures (at least
mathematically), that xk tends to an optimal solution of the primal problem when k
tends to infinity.
One may take, for instance,
(10.177)
k = k1 ,
with > 1 and 0 > 0 to be chosen. Two types of problems may arise:
when 0 and especially are too small, one will lose time crawling along the
central path,
when they are too large, the search for xk may be badly initialized by the warm
start and Newtons method may lose time multiplying iterations.
(10.178)
279
Ax b
(10.179)
is a convex problem, since the cost function and the feasible domain are convex. The
Lagrangian is
L(x, ) = cT x + T (Ax b) = bT + (AT + c)T x.
(10.180)
(10.181)
otherwise
and is dual feasible if 0 and AT + c = 0.
The use of a logarithmic barrier leads to computing the central points
xk = arg min Jk (x),
x
where
Jk (x) = k cT x
ni
(10.182)
(10.183)
j=1
with aTj the jth row of A. This is unconstrained convex minimization, and thus easy.
A necessary and sufficient condition for xk to be a solution of (10.182) is that
gk (xk ) = 0,
(10.184)
1
Jk
aj.
(x) = k c +
x
b aTj x
j=1 j
gk (x) =
(10.185)
To search for xk with a (damped) Newton method, one also needs the Hessian of
Jk (), given by
i
1
2 Jk
(x)
=
a aT .
T x)2 j j
T
xx
(b
a
j
j
j=1
Hk (x) =
(10.186)
280
Equation (10.175) suggests taking as the dual vector associated with xk the vector
with entries
1
(10.187)
, j = 1, . . . , n i ,
kj =
i
k c j (xk )
i.e.,
kj =
1
,
k (b j aTj xk )
j = 1, . . . , n i .
ni
k
(10.188)
(10.189)
281
final
off
on
off
Let us employ Dantzigs simplex on Example 10.6. The function linprog assumes
that
a linear cost is to be minimized, so we use the cost function (10.109), with
c = (2, 1)T ;
(10.190)
the inequality constraints are not transformed into equality constraints, but written
as
Ax b,
(10.191)
so we take
3 1
A=
1 1
1
and b =
;
0
(10.192)
282
optionSIMPLEX = ...
optimset(LargeScale,off,Simplex,on)
[OptimalX, OptimalCost] = ...
linprog(c,A,b,[],[],LowerBound,...
[],[],optionSIMPLEX)
The brackets [] in the list of input arguments of linprog correspond to arguments
not used here, such as upper bounds on the decision variables. See the documentation
of the toolbox for more details. This script yields
Optimization terminated.
OptimalX =
0
1
OptimalCost =
-1
which should come as no surprise.
283
Consider now Example 10.3, where the cost function J (x) = x12 + x22 must be
minimized under the nonlinear inequality constraint x12 + x22 + x1 x2 1. We know
that there are two global minimizers
x3 = (1/ 3, 1/ 3)T ,
x4 = (1/ 3, 1/ 3)T ,
(10.193)
(10.194)
x3 ) = J (
x4 ) = 2/3 0.66666666667.
where 1/ 3 0.57735026919, and that J (
The cost function is implemented by the function
function Cost = L2cost(x)
Cost = norm(x)2;
end
The nonlinear inequality constraint is written as c(x) 0, and implemented by the
function
function [c,ceq] = NLConst(x)
c = 1 - x(1)2 - x(2)2 - x(1)*x(2);
ceq = [];
end
Since there is no nonlinear equality constraint, ceq is left empty but must be present.
Finally, patternsearch is called with the script
clear all
x0 = [0;0];
x = zeros(2,1);
[xOpt,CostOpt] = patternsearch(@(x) ...
L2cost(x),x0,[],[],...
[],[],[],[], @(x) NLconst(x))
which yields, after 4000 evaluations of the cost function,
Optimization terminated: mesh size less
than options.TolMesh and constraint violation
is less than options.TolCon.
xOpt =
-5.672302246093750e-01
-5.882263183593750e-01
CostOpt =
6.677603293210268e-01
The accuracy of this solution can be slightly improved (at the cost of a major increase
in computing time) by changing the options of patternsearch, as in the following script
284
clear all
x0 = [0;0];
x = zeros(2,1);
options = psoptimset(TolX,1e-10,TolFun,...
1e-10,TolMesh,1e-12,TolCon,1e-10,...
MaxFunEvals,1e5);
[xOpt,CostOpt] = patternsearch(@(x) ...
L2cost(x),x0,[],[],...
[],[],[],[], @(x) NLconst(x),options)
which yields, after 105 evaluations of the cost function
Optimization terminated: mesh size less
than options.TolMesh and constraint violation
is less than options.TolCon.
xOpt =
-5.757669508457184e-01
-5.789321511983871e-01
CostOpt =
6.666700173773681e-01
See the documentation of patternsearch for more details.
These less than stellar results suggest trying other approaches. With the penalized
cost function
function Cost = L2costPenal(x)
Cost = x(1).2+x(2).2+1.e6*...
max(0,1-x(1)2-x(2)2-x(1)*x(2));
end
the script
clear all
x0 = [1;1];
optionsFMS = optimset(Display,...
iter,TolX,1.e-10,MaxFunEvals,1.e5);
[xHat,Jhat] = fminsearch(@(x) ...
L2costPenal(x),x0,optionsFMS)
based on the pedestrian fminsearch produces
xHat =
5.773502679858542e-01
5.773502703933975e-01
Jhat =
6.666666666666667e-01
285
in 284 evaluations of the penalized cost function, without even attempting to fine-tune
the multiplicative coefficient of the penalty function.
With its second line replaced by x0 = [-1;-1];, the same script produces
xHat =
-5.773502679858542e-01
-5.773502703933975e-01
Jhat =
6.666666666666667e-01
which suggests that it would have been easy to obtain accurate approximations of
the two solutions with multistart.
SQP as implemented in the function fmincon of the Optimization Toolbox is
used in the script
clear all
x0 = [0;0];
x = zeros(2,1);
options = optimset(Algorithm,sqp);
[xOpt,CostOpt,exitflag, output] = fmincon(@(x) ...
L2cost(x),x0,[],[],...
[],[],[],[], @(x) NLconst(x),options)
which yields
xOpt =
5.773504749133580e-01
5.773500634738818e-01
CostOpt =
6.666666666759753e-01
in 94 function evaluations. Refining tolerances by replacing the options of fmincon
in the previous script by
options = optimset(Algorithm,sqp,...
TolX,1.e-20, TolFun,1.e-20,TolCon,1.e-20);
we get the marginally more accurate results
xOpt =
5.773503628462886e-01
5.773501755329579e-01
CostOpt =
6.666666666666783e-01
in 200 function evaluations.
286
10.10 In Summary
Constraints play a major role in most engineering applications of optimization.
Even if unconstrained minimization yields a feasible minimizer, this does not mean
that the constraints can be neglected.
The feasible domain X for the decision variables should be nonempty, and preferably closed and bounded.
The value of the gradient of the cost at a constrained minimizer usually differs
from zero, and specific theoretical optimality conditions have to be considered
(the KKT conditions).
Looking for a formal solution of the KKT equations is only possible in simple
problems, but the KKT conditions play a key role in sequential quadratic programming.
10.9 In Summary
287
Introducing penalty or barrier functions is the simplest approach (at least conceptually) for constrained optimization, as it makes it possible to use methods designed
for unconstrained optimization. Numerical difficulties should not be underestimated, however.
The augmented-Lagrangian approach facilitates the practical use of penalty functions.
It is important to recognize a linear program on sight, as specific and very powerful
optimization algorithms are available, such as Dantzigs simplex.
The same can be said of convex optimization, of which linear programming is a
special case.
Interior-point methods can deal with large-scale convex and nonconvex problems.
References
1. Bertsekas, D.: Constrained Optimization and Lagrange Multiplier Methods. Athena Scientific,
Belmont (1996)
2. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge
(2004)
3. Papalambros, P., Wilde, D.: Principles of Optimal Design. Cambridge University Press,
Cambridge (1988)
4. Wright, M.: The interior-point revolution in optimization: history, recent developments, and
lasting consequences. Bull. Am. Math. Soc. 42(1), 3956 (2004)
5. Gill, P., Murray, W., Wright, M.: Practical Optimization. Elsevier, London (1986)
6. Theodoridis, S., Slavakis, K., Yamada, I.: Adaptive learning in a world of projections. IEEE
Sig. Process. Mag. 28(1), 97123 (2011)
7. Han, S.P., Mangasarian, O.: Exact penalty functions in nonlinear programming. Math. Program.
17, 251269 (1979)
8. Zaslavski, A.: A sufficient condition for exact penalty in constrained optimization. SIAM J.
Optim. 16, 250262 (2005)
9. Polyak, B.: Introduction to Optimization. Optimization Software, New York (1987)
10. Bonnans, J., Gilbert, J.C., Lemarchal, C., Sagastizabal, C.: Numerical Optimization: Theoretical and Practical Aspects. Springer, Berlin (2006)
11. Boggs, P., Tolle, J.: Sequential quadratic programming. Acta Numer. 4, 151 (1995)
12. Boggs, P., Tolle, J.: Sequential quadratic programming for large-scale nonlinear optimization.
J. Comput. Appl. Math. 124, 123137 (2000)
13. Nocedal, J., Wright, S.: Numerical Optimization. Springer, New York (1999)
14. Matousek, J., Grtner, B.: Understanding and Using Linear Programming. Springer, Berlin
(2007)
15. Gonin, R., Money, A.: Nonlinear L p -Norm Estimation. Marcel Dekker, New York (1989)
16. Kiountouzis, E.: Linear programming techniques in regression analysis. J. R. Stat. Soc. Ser. C
(Appl. Stat.) 22(1), 6973 (1973)
17. Bronson, R.: Operations Research. Schaums Outline Series. McGraw-Hill, New York (1982)
18. Khachiyan, L.: A polynomial algorithm in linear programming. Sov. Math. Dokl. 20, 191194
(1979)
19. Karmarkar, N.: A new polynomial-time algorithm for linear programming. Combinatorica 4(4),
373395 (1984)
20. Gill, P., Murray, W., Saunders, M., Tomlin, J., Wright, M.: On projected Newton barrier methods
for linear programming and an equivalence to Karmarkars projective method. Math. Prog. 36,
183209 (1986)
288
21. Byrd, R., Hribar, M., Nocedal, J.: An interior point algorithm for large-scale nonlinear
programming. SIAM J. Optim. 9(4), 877900 (1999)
22. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston
(2004)
23. Hiriart-Urruty, J.B., Lemarchal, C.: Convex Analysis and Minimization Algorithms: Fundamentals. Springer, Berlin (1993)
24. Hiriart-Urruty, J.B., Lemarchal, C.: Convex Analysis and Minimization Algorithms:
Advanced Theory and Bundle Methods. Springer, Berlin (1993)
25. Sasena, M., Papalambros, P., Goovaerts, P.: Exploration of metamodeling sampling criteria for
constrained global optimization. Eng. Optim. 34(3), 263278 (2002)
26. Sasena, M.: Flexibility and efficiency enhancements for constrained global design optimization
with kriging approximations. Ph.D. thesis, University of Michigan (2002)
27. Conn, A., Gould, N., Toint, P.: A globally convergent augmented Lagrangian algorithm for
optimization with general constraints and simple bounds. SIAM J. Numer. Anal. 28(2), 545
572 (1991)
28. Conn, A., Gould, N., Toint, P.: A globally convergent augmented Lagrangian barrier algorithm
for optimization with general constraints and simple bounds. Technical Report 92/07 (2nd
revision), IBM T.J. Watson Research Center, Yorktown Heights (1995)
29. Lewis, R., Torczon, V.: A globally convergent augmented Lagrangian pattern algorithm for
optimization with general constraints and simple bounds. Technical Report 9831, NASA
ICASE, NASA Langley Research Center, Hampton (1998)
Chapter 11
Combinatorial Optimization
11.1 Introduction
So far, the feasible domain X was assumed to be such that infinitesimal displacements
of the decision vector x were possible. Assume now that some decision variables xi
take only discrete values, which may be coded with integers. Two situations should
be distinguished.
In the first, the discrete values of xi have a quantitative meaning. A drug
prescription, for instance, may recommend taking an integer number of pills of a
given type. Then xi {0, 1, 2, . . .}, and taking two pills means ingesting twice as
much active principle as with one pill. One may then speak of integer programming.
A possible approach for dealing with such a problem is to introduce the constraint
(xi )(xi 1)(xi 2)(. . .) = 0,
(11.1)
via a penalty function and then resort to unconstrained continuous optimization. See
also Sect. 16.5.
In the second situation, which is the one considered in this chapter, the discrete
values of the decision variables have no quantitative meaning, although they may be
coded with integers. Consider for instance, the famous traveling salesperson problem
(TSP), where a number of cities must be visited while minimizing the total distance
to be covered. If City X is coded by 1 and City Y by 2, this does not mean that
City Y is twice City X according to any measure. The optimal solution is an ordered
list of city names. Even if this list can be described by a series of integers (visit City
45, then City 12, then...), one should not confuse this with integer programming, and
should rather speak of combinatorial optimization.
Example 11.1 Combinatorial problems are countless in engineering and logistics.
One of them is the allocation of resources (men, CPUs, delivery trucks, etc.) to tasks.
This allocation can be viewed as the computation of an optimal array of names of
resources versus names of tasks (resource Ri should process task T j , then task Tk ,
then...). One may want, for instance, to minimize completion time under constraints
. Walter, Numerical Methods and Optimization,
DOI: 10.1007/978-3-319-07671-3_11,
Springer International Publishing Switzerland 2014
289
290
11 Combinatorial Optimization
291
292
11 Combinatorial Optimization
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
(Y(iStart)-Y(iFinish))2);
% Coming back home
TripLength=TripLength +...
sqrt((X(iFinish)-X(iOrder(1)))2+...
(Y(iFinish)-Y(iOrder(1)))2);
end
The following script explores 105 itineraries generated at random to produce the
one plotted in Fig. 11.2 starting from the one plotted in Fig. 11.1. This result is clearly
suboptimal.
% X = table of city longitudes
% Y = table of city latitudes
% NumCities = number of cities
% InitialOrder = itinerary
% used as a starting point
% FinalOrder = finally suggested itinerary
NumCities = 10; NumIterations = 100000;
for i=1:NumCities,
X(i)=cos(2*pi*(i-1)/NumCities);
Y(i)=sin(2*pi*(i-1)/NumCities);
end
293
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
Fig. 11.2 Suboptimal itinerary suggested for the problem with 10 cities by simulated annealing
after the generation of 105 itineraries at random
294
11 Combinatorial Optimization
295
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
Fig. 11.3 Optimal itinerary suggested for the problem with ten cities by simulated annealing after
the generation of 105 exchanges of two cities picked at random
It is not clear whether decreasing temperature plays any useful role in this
particular example. The following script refuses any modification of the itinerary
that would increase the distance to be covered, and yet also produces the optimal
itinerary of Fig. 11.5 from the itinerary of Fig. 11.4 for a problem with 20 cities.
NumCities = 20;
NumIterations = 100000;
for i=1:NumCities,
X(i)=cos(2*pi*(i-1)/NumCities);
Y(i)=sin(2*pi*(i-1)/NumCities);
end
InitialOrder=randperm(NumCities);
for i=1:NumCities,
InitialX(i)=X(InitialOrder(i));
InitialY(i)=Y(InitialOrder(i));
end
InitialX(NumCities+1)=X(InitialOrder(1));
InitialY(NumCities+1)=Y(InitialOrder(1));
% Plotting initial itinerary
figure;
plot(InitialX,InitialY)
296
11 Combinatorial Optimization
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
0.2
0.2
0.4
0.6
0.8
0.8
0.6
0.4
Fig. 11.5 Optimal itinerary for the problem with 20 cities, obtained after the generation of 105
exchanges of two cities picked at random; no increase in the length of the TSPs trip has been
accepted
297
OldOrder = InitialOrder
for i=1:NumIterations,
OldLength=TravelGuide(X,Y,OldOrder,NumCities);
% Changing trip at random
NewOrder = OldOrder;
Tempo=randperm(NumCities);
NewOrder(Tempo(1)) = OldOrder(Tempo(2));
NewOrder(Tempo(2)) = OldOrder(Tempo(1));
% Compute resulting trip length
NewLength=TravelGuide(X,Y,NewOrder,NumCities);
if(NewLength<OldLength)
OldOrder=NewOrder;
end
end
% Picking up the final suggestion
% and coming back home
FinalOrder=OldOrder;
for i=1:NumCities,
FinalX(i)=X(FinalOrder(i));
FinalY(i)=Y(FinalOrder(i));
end
FinalX(NumCities+1)=X(FinalOrder(1));
FinalY(NumCities+1)=Y(FinalOrder(1));
% Plotting suggested itinerary
figure;
plot(FinalX,FinalY)
end
References
1. Paschos, V. (ed.): Applications of Combinatorial Optimization. Wiley-ISTE, Hoboken (2010)
2. Paschos, V. (ed.): Concepts of Combinatorial Optimization. Wiley-ISTE, Hoboken (2010)
3. Paschos, V. (ed.): Paradigms of Combinatorial Optimization: Problems and New Approaches.
Wiley-ISTE, Hoboken (2010)
4. van Laarhoven, P., Aarts, E.: Simulated Annealing: Theory and Applications. Kluwer,
Dordrecht (1987)
5. Mitra, D., Romeo, F., Sangiovanni-Vincentelli, A.: Convergence and finite-time behavior of
simulated annealing. Adv. Appl. Prob. 18, 747771 (1986)
6. Press, W., Flannery, B., Teukolsky, S., Vetterling, W.: Numerical Recipes. Cambridge University Press, Cambridge (1986)
7. Beichl, I., Sullivan, F.: The Metropolis algorithm. Comput. Sci. Eng. 2(1), 6569 (2000)
298
11 Combinatorial Optimization
8. Applegate, D., Bixby, R., Chvtal, V., Cook, W.: The Traveling Salesman Problem: A Computational Study. Princeton University Press, Princeton (2006)
9. Applegate, D., Bixby, R., Chvtal, V., Cook, W., Espinoza, D., Goycoolea, M., Helsgaun, K.:
Certification of an optimal TSP tour through 85,900 cities. Oper. Res. Lett. 37, 1115 (2009)
10. Wright, M.: The interior-point revolution in optimization: history, recent developments, and
lasting consequences. Bull. Am. Math. Soc. 42(1), 3956 (2004)
Chapter 12
12.1 Introduction
Differential equations play a crucial role in the simulation of physical systems, and
most of them can only be solved numerically. We consider only deterministic differential equations; for a practical introduction to the numerical simulation of stochastic
differential equations, see [1]. Ordinary differential equations (ODEs), which have
only one independent variable, are treated first, as this is the simplest case by far. Partial differential equations (PDEs) are for Chap. 13. Classical references on solving
ODEs are [2, 3]. Information about popular codes for solving ODEs can be found
in [4, 5]. Useful complements for who plans to use MATLAB ODE solvers are in
[610] and Chap. 7 of [11].
Most methods for solving ODEs assume that they are written as
x (t) = f(x(t), t),
(12.1)
where x is a vector of Rn , with n the order of the ODE, and where t is the independent
variable. This variable is often associated with time, and this is how we will call it,
but it may just as well correspond to some other independently evolving quantity, as
in the example of Sect. 12.4.4. Equation (12.1) defines a system of n scalar first-order
differential equations. For any given value of t, the value of x(t) is the state of this
system, and (12.1) is a state equation.
Remark 12.1 The fact that the vector function f in (12.1) explicitly depends on
t makes it possible to consider ODEs that are forced by some input signal u(t),
provided that u(t) can be evaluated at any t at which f must be evaluated.
Example 12.1 Kinetic equations in continuous stirred tank reactors (CSTRs) are naturally in state-space form, with concentrations of chemical species as state variables.
Consider, for instance, the two elementary reactions
A + 2B 3C and A + C 2D.
. Walter, Numerical Methods and Optimization,
DOI: 10.1007/978-3-319-07671-3_12,
Springer International Publishing Switzerland 2014
(12.2)
299
300
d2, 1
u
2
d1, 2
d0, 1
= 2k2 [A][C],
[ D]
(12.3)
(12.4)
(12.5)
x = Ax + Bu,
(12.6)
12.1 Introduction
301
which is linear in the input-flow vector u, with A a function of the i, j s. For the
model of Fig. 12.1,
(0,1 + 2,1 ) 1,2
(12.7)
A=
2,1
1,2
1
b=
,
0
and B becomes
(12.8)
Remark 12.2 Although (12.6) is linear with respect to its input, its solution is strongly
nonlinear in A. This has consequences if the unknown parameters i, j are to be
estimated from measurements
y(ti ) = Cx(ti ), i = 1, . . . , N ,
(12.9)
by minimizing some cost function. Even if this cost function is quadratic in the error,
the linear least-squares method will not apply because the cost function will not be
quadratic in the parameters.
Remark 12.3 When the vector function f in (12.1) depends not only on x(t) but
also on t, it is possible formally to get rid of the dependency in t by considering the
extended state vector
x
xe (t) =
.
(12.10)
t
This vector satisfies the extended state equation
x (t)
f(x, t)
=
x (t) =
= f e xe (t) ,
1
1
e
(12.11)
Sometimes, putting ODEs in state-space form requires some work, as in the following example, which corresponds to a large class of ODEs.
Example 12.3 Any nth order scalar ODE that can be written as
y (n) = f (y, y , . . . , y (n1) , t)
(12.12)
x =
y
y
..
.
y (n1)
(12.13)
302
Indeed,
0
y
..
.
y
.
x = .. =
. ..
0
y (n)
0
with
0
..
.
...
.
..
x+
.. g(x, t) = f(x, t),
.
0
1
1
0
1 0 ... 0
0 1 0
. .
0 .. ..
... ... 0
... ... ...
(12.14)
(12.15)
The solution y(t) of the initial scalar ODE is then in the first component of x(t).
Remark 12.4 This is just one way of obtaining a state equation from a scalar ODE.
Any state-space similarity transformation z = Tx, where T is invertible and independent of t, leads to another state equation.
z = Tf(T1 z, t),
(12.16)
(12.17)
cT = (1 0 . . . 0).
(12.18)
with
Constraints must be provided for the solution of (12.1) to be completely specified.
We distinguish
initial-value problems (IVPs), where these constraints completely specify the value
of x for a single value t0 of t and the solution x(t) is to be computed for t t0 ,
boundary-value problems (BVPs), and in particular two-endpoint BVPs where
these constraints provide partial information on x(tmin ) and x(tmax ) and the solution
x(t) is to be computed for tmin t tmax .
From the specifications of the problem, the ODE solver should ideally choose
a family of integration algorithms,
a member in this family,
a step-size.
It should also adapt these choices as the simulation proceeds, when appropriate. As
a result, the integration algorithms form only a small portion of the code of some
professional-grade ODE solvers. We limit ourselves here to a brief description of the
main families of integration methods (with their advantages and limitations) and of
12.1 Introduction
303
how automatic step-size control may be carried out. We start in Sect. 12.2 by IVPs,
which are simpler than the BVPs treated in Sect. 12.3.
(12.19)
x(t0 ) = x0 ,
(12.20)
(12.22)
x = x + x 2 , x(0) = p.
(12.23)
and
1
.
1 + pt
(12.24)
When p > 0, this solution is valid for any t 0, but when p < 0, it has a finite escape
time: it tends to infinity when t tends to 1/ p and is only valid for t [0, 1/ p).
The nature of the solution of (12.23) depends on the magnitude of p. When | p|
is small enough, the effect of the quadratic term is negligible and the solution is
approximately equal to p exp(t), whereas when | p| is large enough, the quadratic
term dominates and the solution has a finite escape time.
304
Remark 12.6 The final time tf of the computation may not be known in advance,
and may be defined as the first time such that
h (x (tf ) , tf ) = 0,
(12.25)
(12.28)
or a ( p, p) Pad approximation
exp M [D p (M)]1 N p (M),
(12.29)
where N p (M) and D p (M) are pth order polynomials in M. The coefficients of the
polynomials in the Pad approximation are chosen in such a way that its Taylor
expansion is the same as that of exp M up to order q = 2 p. Thus
N p (M) =
p
j=0
cj Mj
(12.30)
305
and
D p (M) =
c j (M) j ,
(12.31)
j=0
with
cj =
(2 p j)! p!
.
(2 p)! j! ( p j)!
(12.32)
(12.34)
where the ith diagonal entry of the diagonal matrix exp[(t t0 )] is exp[i (t t0 )].
This diagonalization approach makes it easy to evaluate x(ti ) at arbitrary values of ti .
The scaling and squaring method [1517], based on the relation
m
M
exp M = exp
,
m
(12.35)
is one of the most popular approaches for computing matrix exponentials. It is implemented in MATLAB as the function expm. During scaling, m is taken as the smallest
power of two such that M/m < 1. A Taylor or Pad approximation is then used
to evaluate exp(M/m), before evaluating exp M by repeated squaring.
Another option is to use one of the general-purpose methods presented next. See
also Sect. 16.19.
(12.37)
The simplest methods for solving initial-value problems are Eulers methods.
306
(12.38)
xl+1 = xl + hfl .
(12.39)
It thus takes
It is a single-step method, as the evaluation of x(tl+1 ) is based on the value of x
at a single value tl of t. The method error for one step (or local method error) is
generically O(h 2 ) (unless x (tl ) = 0).
Equation (12.39) boils down to replacing x in (12.1) by the forward finitedifference approximation
xl+1 xl
.
(12.40)
x (tl )
h
As the evaluation of xl+1 by (12.39) uses only the past value xl of x, the explicit
Euler method is a prediction method.
One may instead replace x in (12.1) by the backward finite-difference approximation
xl+1 xl
,
(12.41)
x (tl+1 )
h
to get
xl+1 = xl + hfl+1 .
(12.42)
Since fl+1 depends on xl+1 , xl+1 is now obtained by solving an implicit equation,
and this is the implicit Euler method. It has better stability properties than its explicit
counterpart, as illustrated by the following example.
Example 12.4 Consider the scalar first-order differential equation (n = 1)
x = x,
(12.43)
with some negative real constant, so (12.43) is asymptotically stable, i.e., x(t) tends
to zero when t tends to infinity. The explicit Euler method computes
xl+1 = xl + h(xl ) = (1 + h)xl ,
(12.44)
(12.45)
307
1
xl ,
1 h
which is asymptotically stable for any step-size h since < 0 and 0 <
(12.46)
1
1h
< 1.
Except when (12.42) can be made explicit (as in Example 12.4), the implicit Euler
method is more complicated to implement than the explicit one, and this is true for
all the other implicit methods to be presented, see Sect. 12.2.2.4.
h k (k)
x (tl ) + o(h k ).
k!
(12.47)
(12.48)
tl
bi f(x(tl,i ), tl,i ),
(12.49)
i=1
(12.50)
308
(12.51)
j=1
(12.52)
tl+1 = tl + h,
(12.55)
(12.53)
(12.54)
with a local method error o(h 2 ), generically O(h 3 ). Figure 12.2 illustrates the procedure, assuming a scalar state x.
Although computations are carried out at midpoint tl + h/2, this is a single-step
method, as xl+1 is computed as a function of xl .
The most commonly used Runge-Kutta method is RK(4), which may be written
as
k1 = hf(xl , tl ),
k1
h
k2 = hf(xl + , tl + ),
2
2
k2
h
k3 = hf(xl + , tl + ),
2
2
k4 = hf(xl + k3 , tl + h),
k2
k3
k4
k1
+
+
+ ,
xl+1 = xl +
6
3
3
6
tl+1 = tl + h,
(12.56)
(12.57)
(12.58)
(12.59)
(12.60)
(12.61)
with a local method error o(h 4 ), generically O(h 5 ). The first derivative of the state
with respect to t is now evaluated once at tl , once at tl+1 and twice at tl + h/2. RK(4)
is nevertheless still a single-step method.
309
x l + k1
xl +
k1
2
k1
xl+ 1
fl
k2
xl
tl
tl +
h
2
tl+ 1
h
Fig. 12.2 One step of RK(2)
Remark 12.8 Just as the other explicit Runge-Kutta methods, RK(4) is self starting.
Provided with the initial condition x0 , it computes x1 , which is the initial condition
for computing x2 , and so forth. The price to be paid for this nice property is that
none of the four numerical evaluations of f carried out to compute xl+1 can be
reused in the computation of xl+2 . This may be a major drawback compared to the
multistep methods of Sect. 12.2.2.3, if computational efficiency is important. On the
other hand, it is much easier to adapt step-size (see Sect. 12.2.4), and Runge-Kutta
methods are more robust when the solution presents near-discontinuities. They may
be viewed as ocean-going tugboats, which can get large cruise liners out of crowded
harbors and come to their rescue when the sea gets rough.
Implicit Runge-Kutta methods [19, 20] have also been derived. They are the only
Runge-Kutta methods that can be used with stiff ODEs, see Sect. 12.2.5. Each of their
steps requires the solution of an implicit set of equations and is thus more complex
for a given order. Based on [21, 22], MATLAB has implemented its own version of
an implicit Runge-Kutta method in ode23s, where the computation of xl+1 is via
the solution of a system of linear equations [6].
Remark 12.9 It was actually shown in [23], and further discussed in [24], that recursion relations often make it possible to use Taylor expansion with less computation
than with a Runge-Kutta method of the same order. The Taylor series approach is
indeed used (with quite large values of k) in the context of guaranteed integration,
where sets containing the mathematical solutions of the ODEs are computed numerically [2527].
310
n
a 1
ai xli + h
nb +
j0 1
i=0
b j fl j .
(12.62)
j= j0
They differ by the values given to the number n a of ai coefficients, the number n b of
b j coefficients and the initial value j0 of the index in the second sum of (12.62). As
soon as n a > 1 or n b > 1 j0 , (12.62) corresponds to a multistep method, because
xl+1 is computed from several past values of x (or of x , which is also computed from
the value of x).
Remark 12.10 Equation (12.62) only uses evaluations carried out with the constant
step-size h = ti+1 ti . The evaluations of f used to compute xl+1 can thus be reused
to compute xl+2 , which is a considerable advantage over Runge-Kutta methods.
There are drawbacks, however:
adapting step-size gets significantly more complicated than with Runge-Kutta
methods;
multistep methods are not self-starting; provided with the initial condition x0 ,
they are unable to compute x1 , and must receive the help of single-step methods
to compute enough values of x and x to allow the recurrence (12.62) to proceed.
If Runge-Kutta methods are tugboats, then multistep methods are cruise liners, which
cannot leave the harbor of the initial conditions by themselves. Multistep methods
may also fail later on, if the functions involved are not smooth enough, and RungeKutta methods (or other single-step methods) may then have to be called to their
rescue.
We consider three families of linear multistep methods, namely Adams-Bashforth,
Adams-Moulton, and Gear. The kth order member of any of these families has a local
method error o(h k ), generically O(h k+1 ).
Adams-Bashforth methods are explicit. In the kth order method AB(k), n a = 1,
a0 = 1, j0 = 0 and n b = k, so
xl+1 = xl + h
k1
b j fl j .
(12.63)
j=0
311
xl+1 = xl +
h
(3fl fl1 ).
2
(12.65)
It is thus a multistep method, which cannot start by itself, just as AB(3), where
h
(23fl 16fl1 + 5fl2 ),
12
(12.66)
h
(55fl 59fl1 + 37fl2 9fl3 ).
24
(12.67)
xl+1 = xl +
and AB(4), where
xl+1 = xl +
Since j takes the value 1, all of the Adams-Moulton methods are implicit. When
k = 1, there is a single coefficient b1 = 1 and AM(1) is the implicit Euler method
xl+1 = xl + hfl+1 .
(12.69)
(12.70)
h
(5fl+1 + 8fl fl1 ) ,
12
(12.71)
xl+1 = xl +
AM(3) satisfies
xl+1 = xl +
h
(9fl+1 + 19fl 5fl1 + fl2 ) .
24
(12.72)
k1
ai xli + hbfl+1 .
(12.73)
i=0
The Gear methods are also called BDF methods, because backward-differentiation
formulas can be employed to compute their coefficients. G(k) = BDF(k) is such that
312
(12.74)
xl+1 = xl+1 xl ,
2 xl+1 = (xl+1 ) = xl+1 2xl + xl1 ,
(12.75)
(12.76)
m=1
with
(12.77)
G(2) satisfies
1
(4xl xl1 + 2hfl+1 ).
3
(12.78)
1
(18xl 9xl1 + 2xl2 + 6hfl+1 ),
11
(12.79)
xl+1 =
G(3) is such that
xl+1 =
and G(4) such that
xl+1 =
1
(48xl 36xl1 + 16xl2 3xl3 + 12hfl+1 ).
25
(12.80)
A variant of (12.74),
k
k
1 m
1
0
xl+1 hfl+1
(xl+1 xl+1
) = 0,
m
j
m=1
(12.81)
j=1
was studied in [28] under the name of numerical differentiation formulas (NDF),
with the aim of improving on the stability properties of high-order BDF methods.
0
In (12.81), is a scalar parameter and xl+1
a (rough) prediction of xl+1 used as an
initial value to solve (12.81) for xl+1 by a simplified Newton (chord) method. Based
on NDFs, MATLAB has implemented its own methodology in ode15s [6, 8], with
order varying from k = 1 to k = 5.
Remark 12.11 Changing the order k of a multistep method when needed is trivial, as
it boils down to computing another linear combination of already computed vectors
xli or fli . This can be taken advantage of to make Adams-Bashforth self-starting
by using AB(1) to compute x1 from x0 , AB(2) to compute x2 from x1 and x0 , and
so forth until the desired order has been reached.
313
h
(3fl fl1 ),
2
(12.83)
1
and correction with AM(2), where xl+1 on the right-hand side is replaced by xl+1
2
= xl +
xl+1
h 1
f(xl+1 , tl+1 ) + fl ,
2
(12.84)
Remark 12.12 The influence of prediction on the final local method error is less
than that of correction, so one may use a (k 1)th order predictor with a kth order
corrector. When prediction is carried out by AB(1) (i.e., the explicit Euler method)
1
xl+1
= xl + hfl ,
(12.85)
314
xl+1 = xl +
h 1
f(xl+1 , tl+1 ) + fl ,
2
(12.86)
12.2.3 Scaling
Provided that upper bounds xi can be obtained on the absolute values of the state
variables xi (i = 1, . . . , n), one may transform the initial state equation (12.1) into
q(t)
= g(q(t), t),
with
qi =
xi
, i = 1, . . . , n.
xi
(12.87)
(12.88)
This was more or less mandatory when analog computers were used, to avoid saturating operational amplifiers. The much larger range of magnitudes offered by
floating-point numbers has made this practice less crucial, but it may still turn out to
be very useful.
(12.89)
(12.90)
315
(12.91)
(12.92)
This motivates the study of the stability of numerical methods for solving IVPs on
Dahlquists test problem [30]
x = x, x(0) = 1,
(12.93)
where is a complex constant with strictly negative real part rather than the real
constant considered in Example 12.4. The step-size h must be such that the numerical
integration scheme is stable for each of the test equations obtained by replacing by
one of the eigenvalues of A.
The methodology for conducting this stability study, particularly clearly described
in [31], is now explained; this part may be skipped by the reader interested only in
its results.
Single-step methods
When applied to the test problem (12.93), single-step methods compute
xl+1 = R(z)xl ,
(12.94)
(12.95)
(12.96)
316
1
(h)k xl ,
k!
(12.97)
1 k
z .
k!
(12.98)
The same holds true for any kth order explicit Runge-Kutta method, as it has been
designed to achieve this.
Example 12.6 When Heuns method is applied to the test problem, (12.85) becomes
1
= xl + hxl
xl+1
= (1 + z)xl ,
(12.99)
xl+1 = xl +
(12.100)
This should come as no surprise, as Heuns method is a second-order explicit RungeKutta method.
For implicit single-step methods, R(z) will be a rational function. For AM(1), the
implicit Euler method,
xl+1 = xl + hxl+1
1
xl
=
1z
(12.101)
h
(xl+1 + xl )
2
1 + 2z
xl .
1 2z
(12.102)
For each of these methods, the solution of Dahlquists test problem will be (absolutely)
stable if and only if z is such that |R(z)| 1 [31].
317
Imaginary part of z
Imaginary part of z
2
0
2
4
3
2
0
2
4
3
0
2
0
2
4
3
Real part of z
Imaginary part of z
Imaginary part of z
Real part of z
2
0
2
4
3
4
3
Real part of z
Imaginary part of z
Imaginary part of z
Real part of z
Real part of z
2
0
2
4
3
Real part of z
Fig. 12.3 Contour plots of the absolute stability regions of explicit Runge-Kutta methods on
Dahlquists test problem, from RK(1) (top left) to RK(6) (bottom right); the region in black is
unstable
For the explicit Euler method, this means that h should be inside the disk with
unit radius centered at 1, whereas for the implicit Euler method, h should be
outside the disk with unit radius centered at +1. Since h is always real and positive
and is assumed here to have a negative real part, this means that the implicit Euler
method is always stable on the test problem. The intersection of the stability disk of
the explicit Euler method with the real axis is the interval [2, 0], consistent with
the results of Example 12.4.
AM(2) turns out to be absolutely stable for any z with negative real part (i.e., for
any such that the test problem is stable) and unstable for any other z.
Figure 12.3 presents contour plots of the regions where z = h must lie for the
explicit Runge-Kutta methods of order k = 1 to 6 to be absolutely stable. The surface
318
of the absolute stability region is found to increase when the order of the method is
increased. See Sect. 12.4.1 for the MATLAB script employed to draw the contour
plot for RK(4).
or
j=0
( j z j )xl+ j = 0,
(12.104)
j=0
r
( j z j ) j
(12.105)
j=0
all belong to the complex disk with unit radius centered on the origin. (More precisely,
the simple roots must belong to the closed disk and the multiple roots to the open
disk.)
Example 12.7 Although AB(1), AM(1) and AM(2) are single-step methods, they
can be studied with the characteristic-polynomial approach, with the same results as
previously. The characteristic polynomial of AB(1) is
Pz ( ) = (1 + z),
(12.106)
(12.107)
(12.108)
319
1
1 = {z : |1 z| 1} .
S = z :
1+z
(12.109)
1
1
Pz ( ) = 1 z 1 + z ,
2
2
(12.110)
1 + 21 z
1 21 z
(12.111)
When the degree r of the characteristic polynomial is greater than one, the situation
becomes more complicated, as Pz ( ) now has several roots. If z is on the boundary
of the stability region, then at least one root 1 of Pz ( ) must have a modulus equal
to one. It thus satisfies
(12.112)
1 = ei ,
for some [0, 2 ].
Since z acts affinely in (12.105), Pz ( ) can be rewritten as
Pz ( ) = ( ) z ( ).
(12.113)
(ei )
.
(ei )
(12.114)
(12.115)
By plotting z( ) for [0, 2 ], one gets all the values of h that may be on
the boundary of the absolute stability region, and this plot is called the boundary
locus. For the explicit Euler method, for instance, ( ) = 1 and ( ) = 1,
so z( ) = ei 1 and the boundary locus corresponds to a circle with unit radius
centered at 1, as it should. When the boundary locus does not cross itself, it separates
the absolute stability region from the rest of the complex plane and it is a simple
matter to decide which is which, by picking up any point z in one of the two regions
and evaluating the roots of Pz ( ) there. When the boundary locus crosses itself, it
defines more than two regions in the complex plane, and each of these regions should
be sampled, usually to find that absolute stability is achieved in at most one of them.
In a given family of linear multistep methods, the absolute stability domain tends
to shrink when order is increased, in contrast with what was observed for the explicit
320
(12.116)
(12.117)
x(5) (tl + h 1 )
.
5!
(12.118)
c1 =
and
c2 =
Compute now x(tl + 2h 1 ) starting from the same initial state xl but in a single step
with size h 2 = 2h 1 , to get
x(tl + h 2 ) = r2 + h 52 c1 + O(h 6 ),
(12.119)
r1 r2
,
15
(12.121)
321
32
(r1 r2 ).
30
(12.122)
As expected, the local method error thus increases considerably when the step-size is
doubled. Since an estimate of this error is now available, one might subtract it from
r1 to improve the quality of the result, but the estimate of the local method error
would then be lost.
12.2.4.3 Assessing Local Method Error by Varying Order
Instead of varying their step-size to assess their local method error, modern methods
tend to vary their order, in such a way that less computation is required. This is
the idea behind embedded Runge-Kutta methods such as the Runge-Kutta-Fehlberg
methods [3]. In RKF45, for instance [33], an RK(5) method is used, such that
5
xl+1
= xl +
c5,i ki + O(h 6 ).
(12.123)
i=1
The coefficients of this method are chosen to ensure that an RK(4) method is embedded, such that
6
4
xl+1
= xl +
c4,i ki + O(h 5 ).
(12.124)
i=1
5 x4 .
The local method error estimate is then taken as xl+1
l+1
MATLAB provides two embedded explicit Runge-Kutta methods, namely ode23,
based on a (2, 3) pair of formulas by Bogacki and Shampine [34] and ode45, based
on a (4, 5) pair of formulas by Dormand and Prince [35]. Dormand and Prince proposed a number of other embedded Runge-Kutta methods [3537], up to a (7, 8) pair.
Shampine developed a MATLAB solver based on another Runge-Kutta (7, 8) pair
with strong error control (available from his website), and compared its performance
with that of ode45 in [7].
The local method error of multistep methods can similarly be assessed by comparing results at different orders. This is easy, as no new evaluation of f is required.
322
If the estimate of local method error on xl+1 turns out to be larger than some
user-specified tolerance, then xl+1 is rejected and knowledge of the method order
is used to assess a reduction in step-size that should make the local method error
acceptable. One should, however, remain realistic in ones requests for precision, for
two reasons:
increasing precision entails reducing step-sizes and thus increasing the computational effort,
when step-sizes become too small, rounding errors take precedence over method
errors and the quality of the results degrades.
Remark 12.14 Step-size control based on such crude error estimates as described
in Sects. 12.2.4.2 and 12.2.4.3 may be unreliable. An example is given in [38] for
which a production-grade code increased the actual error when the error tolerance
was decreased. A class of very simple problems for which the MATLAB solver
ode45 with default options gives fundamentally incorrect results because its stepsize often lies outside the stability region is presented in [39].
While changing step-size with a single-step method is easy, it becomes much more
complicated with a multistep method, as several past values of x must be updated
when h is modified. Let Z(h) be the matrix obtained by placing side by side all the
past values of the state vector on which the computation of xl+1 is based
Z(h) = [xl , xl1 , . . . , xlk ].
(12.125)
To replace the step-size h old by h new , one needs in principle to replace Z(h old ) by
Z(h new ), which seems to require the knowledge of unknown past values of the state.
Finite-difference approximations such as
xl xl1
h
(12.126)
xl 2xl1 + xl2
h2
(12.127)
l)
x(t
and
x (tl )
(12.128)
(12.129)
323
1
h
1
h2
T(h) = 0 h1 h22
0 0 h12
(12.130)
(12.131)
which allows step-size adaptation without the need for a new start-up via a single-step
method.
Since
T(h) = ND(h),
(12.132)
with N a constant, invertible matrix and
1 1
D(h) = diag(1, , 2 , . . . ),
h h
(12.133)
(12.134)
where = h new / h old . Further simplification is made possible by using the Nordsieck
vector, which contains the coefficients of the Taylor expansion of x around tl up to
order k
T
hk
l ), . . . , x (k) (tl ) ,
(12.135)
n(tl , h) = x(tl ), h x(t
k!
with x any given component of x. It can be shown that
n(tl , h) Mv(tl , h),
(12.136)
(12.137)
(12.138)
Since
it is easy to get an approximate value of v(tl , h new ) as M1 n(tl , h new ), with the order
of approximation unchanged.
324
tf t0
,
h
(12.139)
with h the average step-size. If the global error of a method with order k was equal
to N times its local error, it would be N O(h k+1 ) = O(h k ). The situation is actually
more complicated, as the global method error crucially depends on how stable the
ODE is. Let s(t N , x0 , t0 ) be the true value of a solution x(t N ) at the end of a simulation
x N be the estimate of this solution as provided by the
started from x0 at t0 and let
integration method. For any v Rn , the norm of the global error satisfies
s(t N , x0 , t0 )
x N = s(t N , x0 , t0 )
x N + v v
v
x N + s(t N , x0 , t0 ) v.
(12.140)
325
z0 = x(tl ),
z1 = z0 + hf(z0 , tl ),
zi+1 = zi1 + 2hf(zi , tl + i h), i = 1, . . . , N 1,
1
x(tl + H ) = x(tl + N h) [z N + z N 1 + hf(z N , tl + N h)].
2
A crucial advantage of this choice is that the method-error term in the computation
of x(tl + H ) is strictly even (it is a function of h 2 rather than of h). The order of
the method error is thus increased by two with each Richardson extrapolation step,
just as with Romberg integration (see Sect. 6.2.2). Extremely accurate results are thus
obtained quickly, provided that the solution of the ODE is smooth enough. This makes
the Bulirsch-Stoer method particularly appropriate when a high precision is required
or when the evaluation of f(x, t) is expensive. Although rational extrapolation was
initially used, polynomial extrapolation now tends to be favored.
(12.141)
and assume it is asymptotically stable, i.e., all its eigenvalues have strictly negative
real parts. This model is stiff if the absolute values of these real parts are such that
the ratio of the largest to the smallest is very large. Similarly, the nonlinear model
x = f(x)
(12.142)
is stiff if its dynamics comprises very slow and very fast components. This often
happens in chemical reactions, for instance, where rate constants may differ by
several orders of magnitude.
Stiff ODEs are particularly difficult to solve accurately, as the fast components
require a small step-size, whereas the slow components require a long horizon of
integration. Even when the fast components become negligible in the solution and
one could dream of increasing step-size, explicit integration methods will continue
to demand a small step-size to ensure stability. As a result, solving a stiff ODE with a
method for non-stiff problems, such as MATLABs ode23 or ode45 may be much
too slow to be practical. Implicit methods, including implicit Runge-Kutta methods
such as ode23s and Gear methods and their variants such as ode15s, may then
save the day [40]. Prediction-correction methods such as ode113 do not qualify as
implicit and should be avoided in the context of stiff ODEs.
326
r(q(t),
q(t), t) = 0.
(12.143)
(12.144)
0 = g(x, z, t).
(12.145)
(12.146)
x(t0 ) = x0 (),
z = g(x, z, t, ),
z(t0 ) = z0 (),
(12.147)
(12.148)
(12.149)
with a positive parameter treated as a small perturbation term. The smaller is,
the stiffer the system of ODEs becomes. In the limit, when is taken equal to zero,
(12.148) becomes an algebraic equation
g(x, z, t, 0) = 0,
(12.150)
and a DAE is obtained. The perturbation is called singular because the dimension of
the state space changes when becomes equal to zero.
It is sometimes possible, as in the next example, to solve (12.150) for z explicitly
as a function of x and t, and to plug the resulting formal expression in (12.146) to get
a reduced-order ODE in state-space form, with the initial condition x(t0 ) = x0 (0).
Example 12.8 Enzyme-substrate reaction
Consider the biochemical reaction
E + S C E + P,
(12.151)
327
(12.152)
(12.153)
(12.154)
(12.155)
(12.156)
(12.157)
[C](t0 ) = 0,
[P](t0 ) = 0.
(12.158)
(12.159)
+ [C]
0, and eliminate (12.152) by
Sum (12.152) and (12.154) to prove that [ E]
substituting E 0 [C] for [E] in (12.153) and (12.154) to get the reduced model
= k1 (E 0 [C])[S] + k1 [C],
[ S]
= k1 (E 0 [C])[S] (k1 + k2 )[C],
[C]
[S](t0 ) = S0 ,
[C](t0 ) = 0.
(12.160)
(12.161)
(12.162)
(12.163)
The quasi-steady-state approach [41] then assumes that, after some short transient
and before [S] is depleted, the rate with which P is produced is approximately
constant. Equation (12.155) then implies that [C] is approximately constant too,
which transforms the ODE into a DAE
= k1 (E 0 [C])[S] + k1 [C],
[ S]
0 = k1 (E 0 [C])[S] (k1 + k2 )[C].
(12.164)
(12.165)
The situation is simple enough here to make it possible to get a closed-form expression
of [C] as a function of [S] and the kinetic constants
p = (k1 , k1 , k2 )T ,
(12.166)
[C] =
E 0 [S]
,
K m + [S]
(12.167)
Km =
k1 + k2
.
k1
(12.168)
namely
with
[C] can then be replaced in (12.164) by its closed-form expression (12.167) to get
is expressed as a function of [S], E 0 and p.
an ODE where [ S]
328
Extensions of the quasi-steady-state approach to more general models are presented in [42, 43]. When an explicit solution of the algebraic equation is not available, repeated differentiation may be used to transform a DAE into an ODE, see
Sect. 12.2.6.2. Another option is to try a finite-difference approach, see Sect. 12.3.3.
k1 (E 0 [C])
[ S],
k1 + k2 + k1 [S]
(12.169)
(12.170)
is given by (12.164) and the denominator cannot vanish. The DAE has
where [ S]
thus been transformed into the ODE
= k1 (E 0 [C])[S] + k1 [C],
[ S]
= k1 (E 0 [C]) {k1 (E 0 [C])[S] + k1 [C]},
[C]
k1 + k2 + k1 [S]
and the initial conditions should be chosen so as to satisfy (12.165).
(12.171)
(12.172)
The differential index of a DAE is the number of differentiations needed to transform it into an ODE. In Example 12.9, this index is equal to one.
A useful reminder of difficulties that may be encountered when solving a DAE
with tools intended for ODEs is [44].
329
x target
O cannon
Remark 12.15 Many methods for solving BVPs for ODEs also apply mutatis mutandis to PDEs, so this part may serve as an introduction to the next chapter.
is fixed,
gunner can only choose the aiming angle in the open interval 0, 2 . When drag
is neglected, the shell altitude before impact satisfies
yshell (t) = (v0 sin )(t t0 )
g
(t t0 )2 ,
2
(12.173)
with g the acceleration due to gravity and t0 the instant of time at which the cannon
was fired. The horizontal distance covered by the shell before impact is such that
xshell (t) = (v0 cos )(t t0 ).
(12.174)
The gunner must thus find such that there exists t > t0 at which xshell (t) = xtarget
and yshell (t) = 0, or equivalently
330
g
(t t0 )2 .
2
(12.175)
(12.176)
This is a two-endpoint BVP, as we have partial information on the initial and final
states of the shell. For any feasible numerical value of , computing the shell trajectory
is an IVP with a unique solution, but this does not imply that the solution of the BVP
is unique or even exists.
This example is so simple that the number of solutions is easy to find analytically.
Solve (12.176) for (t t0 ) and plug the result in (12.175) to get
xtarget = 2 sin( ) cos( )
v02
v2
= sin(2 ) 0 .
g
g
(12.177)
For to exist, xtarget must thus not exceed the maximal range v02 /g of the gun. For any
attainable xtarget , there are generically two values 1 and 2 of for which (12.177)
is satisfied, as any ptanque player knows. These values are symmetric with respect
to = /4, and the maximal range is reached when 1 = 2 = /4. Depending on
the conditions imposed on the final state, the number of solutions of this BVP may
thus be zero, one, or two.
Not knowing a priori whether a solution exists is a typical difficulty with BVPs.
We assume in what follows that the BVP has at least one solution.
(12.178)
331
(12.179)
(12.180)
and that it is not possible (or not desirable) to put it in state-space form. The principle
of the finite-difference method (FDM) is then as follows:
Discretize the interval of interest for the independent variable t, using regularly
spaced points tl . If the approximate solution is to be computed at tl , l = 1, . . . , N ,
make sure that the grid also contains any additional points needed to take into
account the information provided by the boundary conditions.
Substitute finite-difference approximations for the derivatives y ( j) in (12.180), for
instance using the centered-difference approximations
Yl+1 Yl1
2h
(12.181)
(12.182)
yl
and
yl
332
cases. Because the finite-difference approximations are local (they involve only a
few grid points close to those at which the derivative is approximated), the linear
systems to be solved are sparse, and often diagonally dominant.
Example 12.10 Assume that the time-varying linear ODE
y (t) + a1 (t) y (t) + a2 (t)y(t) = u(t)
(12.184)
must satisfy the boundary conditions y(t0 ) = y0 and y (tf ) = yf , with t0 , tf , y0 and yf
known (such conditions on the value of the solution at the boundary of the domain
are called Dirichlet conditions). Assume also that the coefficients a1 (t), a2 (t) and
the input u(t) are known for any t in [t0 , tf ].
Rather than using a shooting method to find the appropriate value for y (t0 ), take
the grid
tl = t0 + lh, l = 0, . . . , N + 1, with h =
tf t0
,
N +1
(12.185)
which has N interior points (not counting the boundary points t0 and tf ). Denote by
Yl the approximate value of y(tl ) to be computed (l = 1, . . . , N ), with Y0 = y0 and
Y N +1 = yf . Plug (12.181) and (12.182) into (12.184) to get
Yl+1 2Yl + Yl1
Yl+1 Yl1
+ a2 (tl )Y (tl ) = u(tl ).
+ a1 (tl )
h2
2h
(12.186)
Rearrange (12.186) as
al Yl1 + bl Yl + cl Yl+1 = h 2 u l ,
(12.187)
h
a1 (tl ),
2
bl = h 2 a2 (tl ) 2,
h
cl = 1 + a1 (tl ),
2
u l = u(tl ).
(12.188)
with
al = 1
(12.189)
b1
a2
0
A=
..
.
.
..
0
Y1
Y2
..
.
333
c1
...
...
b2
c2
..
.
0
..
.
..
a3
0
...
..
..
.
...
..
..
a N 1
0
.
.
b N 1
aN
0
..
.
..
.
c N 1
(12.190)
bN
h 2 u 1 a1 y0
h2u2
..
.
x=
and b =
.
Y N 1
h 2 u N 1
YN
h 2 u N c N yf
(12.191)
Since A is tridiagonal, solving (12.189) for x has very low complexity and can be
achieved quickly, even for large N . Moreover, the method can be used for unstable
ODEs, contrary to shooting.
Remark 12.18 The finite-difference approach may also be used to solve IVPs or
DAEs.
334
12.3.4.1 Collocation
For Example 12.10, collocation methods determine an approximate solution y N
S(r, k, ) such that the N following equations are satisfied:
y N (xi ) + a1 (xi ) y N (xi ) + a2 (xi )y N (xi ) = u(xi ), i = 1, . . . , N 2, (12.193)
y N (t0 ) = y0 and y N (tf ) = yf .
(12.194)
The xi s at which y N must satisfy the ODE are the collocation points. Evaluating
the derivatives of y N that appear in (12.193) is easy, as y N () is polynomial in any
given subinterval. For S(3, 2, ), there is no need to introduce additional equations
because of the differentiability constraints, so xi = ti and N = n + 1.
More information on the collocation approach to solving BVPs, including the
consideration of nonlinear problems, is in [49]. Information on the MATLAB solver
bvp4c can be found in [9, 50].
j = 1, . . . , m,
(12.196)
with the y j s known. To take (12.196) into account, approximate y(t) by a linear
combination
y N (t) of known basis functions (for instance splines)
y N (t) =
x j j (t) + 0 (t),
(12.197)
0 (ti ) = yi , i = 1, . . . , m,
(12.198)
j=1
(12.199)
335
(12.200)
(12.201)
with
(12.202)
(12.203)
where < , > is the inner product in the function space and the i s are known test
functions. We choose basis and test functions that are square integrable on I, and take
< f 1 , f 2 >=
f 1 ( ) f 2 ( )d.
(12.204)
Since
< L(
yn 0 ), i >=< L(T x), i >,
(12.205)
(12.206)
The Ritz-Galerkin methods usually take identical basis and test functions, such that
i S(r, k, ), i = i , i = 1, . . . , N ,
(12.207)
for any ti in I.
Example 12.11 Consider again Example 12.10, where
L t (y) = y (t) + a1 (t) y (t) + a2 (t)y(t).
(12.209)
336
(12.210)
yf y0
(t t0 ) + y0 .
tf t0
(12.211)
For instance
0 (t) =
Equation (12.206) is satisfied, with
[ j ( ) + a1 ( ) j ( ) + a2 ( ) j ( )]i ( )d
(12.212)
(12.213)
ai, j =
I
and
bi =
I
for i = 1, . . . , N and j = 1, . . . , N .
Integration by parts may be used to decrease the number of derivations needed in
(12.212) and (12.213). Since (12.199) translates into
i (t0 ) = i (tf ) = 0, i = 1, . . . , N ,
(12.214)
we have
j ( )i ( )d =
0 ( )i ( )d =
j ( ) i ( )d,
(12.215)
0 ( ) i ( )d.
(12.216)
The definite integrals involved are often evaluated by Gaussian quadrature on each
of the subintervals generated by . If the total number of quadrature points were
equal to the dimension of x, Ritz-Galerkin would amount to collocation at these
quadrature points, but more quadrature points are used in general [45].
The Ritz-Galerkin methodology can be extended to nonlinear problems.
337
ex (t) = L t (
y N ) u(t) = L t (T x) + L t (0 ) u(t)
(12.217)
J (x) =
(12.218)
with
[L ()][L ()]T d
(12.220)
[L ()][u( ) L (0 )]d.
(12.221)
A=
I
and
b=
See [52] for more details (including a more general type of boundary condition
and the treatment of systems of ODEs) and a comparison with the results obtained
with the Ritz-Galerkin method on numerical examples. A comparison of the three
projection approaches of Sect. 12.3.4 can be found in [53, 54].
12.4.1.1 RK(4)
We take advantage of (12.98), which implies for RK(4) that
1
1
1
R(z) = 1 + z + z 2 + z 3 + z 4 .
2
6
24
(12.222)
338
The region of absolute stability is the set of all zs such that |R(z)| 1. The script
clear all
[X,Y] = meshgrid(-3:0.05:1,-3:0.05:3);
Z = X + i*Y;
modR=abs(1+Z+Z.2/2+Z.3/6+Z.4/24);
GoodR = ((1-modR)+abs(1-modR))/2;
% 3D surface plot
figure;
surf(X,Y,GoodR);
colormap(gray)
xlabel(Real part of z)
ylabel(Imaginary part of z)
zlabel(Margin of stability)
% Filled 2D contour plot
figure;
contourf(X,Y,GoodR,15);
colormap(gray)
xlabel(Real part of z)
ylabel(Imaginary part of z)
yields Figs. 12.5 and 12.6.
(12.223)
(12.225)
Margin of stability
339
1
0.5
0
3
2
1
0
1
2
Imaginary part of z
3
2.5
1.5
0.5
0.5
Real part of z
Fig. 12.5 3D visualization of the margin of stability of RK(4) on Dahlquists test; the region in
black is unstable
Imaginary part of z
3
3
2.5
1.5
0.5
0.5
Real part of z
Fig. 12.6 Contour plot of the margin of stability of RK(4) on Dahlquists test; the region in black
is unstable
340
0.8
Imaginary part of z
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1.5
0.5
Real part of z
Fig. 12.7 Absolute stability region is in gray for AB(1), in black for AB(2)
z=
exp(2i ) exp(i )
.
1.5 exp(i ) 0.5
(12.226)
Equations (12.223) and (12.226) suggest the following script, used to produce
Fig. 12.7.
clear all
theta = 0:0.001:2*pi;
zeta = exp(i*theta);
hold on
% Filled area 2D plot for AB(1)
boundaryAB1 = zeta - 1;
area(real(boundaryAB1), imag(boundaryAB1),...
FaceColor,[0.5 0.5 0.5]); % Grey
xlabel(Real part of z)
ylabel(Imaginary part of z)
grid on
axis equal
% Filled area 2D plot for AB(2)
boundaryAB2 = (zeta.2-zeta)./(1.5*zeta-0.5);
area(real(boundaryAB2), imag(boundaryAB2),...
341
FaceColor,[0 0 0]);
% Black
y(0) = y0 ,
(12.227)
where y(t) is the ball diameter at time t. This diameter increases monotonically from
its initial value y0 < 1 to its asymptotic value y = 1. For this asymptotic value, the
rate of oxygen consumption inside the ball (proportional to y 3 ) balances the rate of
oxygen delivery through the surface of the ball (proportional to y 2 ) and y = 0. The
smaller y0 is, the stiffer the solution becomes, which makes this example particularly
suitable for illustrating the influence of stiffness on the performance of ODE solvers
[11]. All the solutions will be computed for times ranging from 0 to 2/y0 .
The following script calls ode45, a solver for non-stiff ODEs, with y0 = 0.1 and
a relative tolerance set to 104 .
clear all
y0 = 0.1;
f = @(t,y) y2 - y3;
option = odeset(RelTol,1.e-4);
ode45(f,[0 2/y0],y0,option);
xlabel(Time)
ylabel(Diameter)
It yields Fig. 12.8 in about 1.2 s. The solution is plotted as it unfolds.
Replacing the second line of this script by y0 = 0.0001; to make the system
stiffer, we get Fig. 12.9 in about 84.8 s. The progression after the jump becomes very
slow.
Instead of ode45, the next script calls ode23s, a solver for stiff ODEs, again
with y0 = 0.0001 and with the same relative tolerance.
clear all
y0 = 0.0001;
f = @(t,y) y2 - y3;
option = odeset(RelTol,1.e-4);
ode23s(f,[0 2/y0],y0,option);
xlabel(Time)
ylabel(Diameter)
It yields Fig. 12.10 in about 2.8 s. While ode45 crawled painfully after the jump to
keep the local method error under control, ode23s achieved the same result with
far less evaluations of y .
342
Diameter
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10
12
14
16
18
1.2
1.4
1.6
1.8
20
Time
Fig. 12.8 ode45 on flame propagation with y0 = 0.1
1.4
1.2
Diameter
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Time
x 10 4
343
1.4
1.2
Diameter
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1.2
Time
1.4
1.6
1.8
x 10 4
Had we used ode15s, another solver for stiff ODEs, the approximate solution
would have been obtained in about 4.4 s (for the same relative tolerance). This is
more than with ode23s, but still much less than with ode45. These results are
consistent with the MATLAB documentation, which states that ode23s may be
more efficient than ode15s at crude tolerances and can solve some kinds of stiff
problems for which ode15s is not effective. It is so simple to switch from one ODE
solver to another that one should not hesitate to experiment on the problem of interest
in order to make an informed choice.
(12.228)
344
x = (1 0)T .
(12.229)
(12.230)
(12.231)
Notice that these times are not regularly spaced. The ODE solver will have to produce
solutions at these specific instants of time as well as on a grid appropriate for plotting
the underlying continuous solutions. This is achieved by the following function,
which generates the data in Fig. 12.11:
function Compartments
% Parameters
p = [0.6;0.15;0.35];
% Initial conditions
x0 = [1;0];
% Measurement times and range
Times = [0,1,2,4,7,10,20,30];
Range = [0:0.01:30];
% Solver options
options = odeset(RelTol,1e-6);
% Solving Cauchy problem
% Solver called twice,
% for range and points
[t,X] = SimulComp(Times,x0,p);
[r,Xr] = SimulComp(Range,x0,p);
function [t,X] = SimulComp(RangeOrTimes,x0,p)
[t,X] = ode45(@Compart,RangeOrTimes,x0,options);
function [xDot]= Compart(t,x)
% Defines the compartmental state equation
M = [-(p(1)+p(3)), p(2);p(1),-p(2)];
xDot = M*x;
end
end
345
x1
x2
0.9
0.8
0.7
State
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
15
20
25
30
Time
Fig. 12.11 Data generated for the compartmental model of Fig. 12.1 by ode45 for x(0) = (1, 0)T
and p = (0.6, 0.15, 0.35)T
% Plotting results
figure;
hold on
plot(t,X(:,1),ks);plot(t,X(:,2),ko);
plot(r,Xr(:,1));plot(r,Xr(:,2));
legend(x_1,x_2);ylabel(State);xlabel(Time)
end
Assume now that the true value of the parameter vector is
p = (0.6 0.35 0.15)T ,
(12.232)
which corresponds to exchanging the values of p2 and p3 . Compartments now
produces the data described by Fig. 12.12.
While the solutions for x1 are quite different in Figs. 12.11 and 12.12, the solutions
for x2 are extremely similar, as confirmed by Fig. 12.13 with corresponds to their
difference.
This is actually not surprising, because an identifiability analysis [55] would show
that the parameters of this model cannot be estimated uniquely from measurements
carried out on x2 alone, as exchanging the role of p2 and p3 always leaves the solution
for x2 unchanged. See also Sect. 16.22. Had we tried to estimate
p with any of the
346
x1
x2
0.9
0.8
0.7
State
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
15
20
25
30
Time
Fig. 12.12 Data generated for the compartmental model of Fig. 12.1 by ode45 for x(0) = (1, 0)T
and p = (0.6, 0.35, 0.15)T
347
x 10
1.5
Difference on2 x
0.5
0.5
1.5
2
0
10
15
20
25
30
Time
Fig. 12.13 Difference between the solutions for x2 when p = (0.6, 0.15, 0.35)T and when p =
(0.6, 0.35, 0.15)T , as computed by ode45
(12.234)
(12.235)
T (r ) = g(x(r )),
(12.236)
0
f(x, r ) =
0
1
r1
x(r )
(12.237)
and
g(x(r )) = (1 0)x(r ),
(12.238)
(12.239)
348
This BVP can be solved analytically, which provides the reference solution to which
the solutions obtained by numerical methods will be compared.
(12.240)
with p1 and p2 specified by the boundary conditions and obtained by solving the
linear system
T (rin )
ln(rin ) 1
p1
=
.
(12.241)
p2
ln(rout ) 1
T (rout )
The following script evaluates and plots the analytical solution on a regular grid from
r = 1 to r = 2 as
Radius = (1:0.01:2);
A = [log(1),1;log(2),1];
b = [100;20];
p = A\b;
MathSol = p(1)*log(Radius)+p(2);
figure;
plot(Radius,MathSol)
xlabel(Radius)
ylabel(Temperature)
It yields Fig. 12.14. The numerical methods used in Sects. 12.4.4.212.4.4.4 for solving this BVP produce plots that are visually indistinguishable from Fig. 12.14, so the
errors between the numerical and analytical solutions will be plotted instead.
349
100
90
Temperature
80
70
60
50
40
30
20
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
Radius
350
0.5
x 10
0.5
1.5
2.5
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
Radius
Fig. 12.15 Error on the distribution of temperatures in the pipe, as computed by the shooting
method
351
a1 (r ) = 1/r,
(12.242)
a2 (r ) = 0,
u(r ) = 0.
(12.243)
(12.244)
This is implemented in the following script, in which sAgrid and sbgrid are
sparse representations of A and b as defined by (12.190) and (12.191).
% Solving pipe problem by FDM
clear all
% Boundary values
InitialSol = 100;
FinalSol = 20;
% Grid specification
Step = 0.001; % step-size
Grid = (1:Step:2);
NGrid = length(Grid);
% Np = number of grid points where
% the solution is unknown
Np = NGrid-2;
Radius = zeros(Np,1);
for i = 1:Np;
Radius(i) = Grid(i+1);
end
% Building up the sparse system of linear
% equations to be solved
a = zeros(Np,1);
c = zeros(Np,1);
HalfStep = Step/2;
for i=1:Np,
a(i) = 1-HalfStep/Radius(i);
c(i) = 1+HalfStep/Radius(i);
end
sAgrid = -2*sparse(1:Np,1:Np,1);
sAgrid(1,2) = c(1);
sAgrid(Np,Np-1) = a(Np);
for i=2:Np-1,
sAgrid(i,i+1) = c(i);
sAgrid(i,i-1) = a(i);
end
sbgrid = sparse(1:Np,1,0);
sbgrid(1) = -a(1)*InitialSol;
sbgrid(Np) = -c(Np)*FinalSol;
352
353
x 10
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
Radius
Fig. 12.16 Error on the distribution of temperatures in the pipe, as computed by the finite-difference
method
the call to bvp4c. Finally, the function deval is in charge of evaluating the approximate solution provided by bvp4c on the same grid as used for the mathematical
solution.
% Solving pipe problem by collocation
clear all
% Choosing a starting point
Radius = (1:0.1:2); % Initial mesh
xInit = [0; 0]; % Initial guess for the solution
% Building structure for initial guess
PipeInit = bvpinit(Radius,xInit);
% Calling the collocation solver
SolByColloc = bvp4c(@PipeODE,...
@PipeBounds,PipeInit);
VisuCollocSol = deval(SolByColloc,Radius);
% Comparing with mathematical solution
A = [log(1),1;log(2),1];
354
x 10
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
Radius
Fig. 12.17 Error on the distribution of temperatures in the pipe, computed by the collocation
method as implemented in bvp4c with RelTol = 103
b = [100;20];
p = A\b;
MathSol = p(1)*log(Radius)+p(2);
Error = MathSol-VisuCollocSol(1,:);
% Plotting error
figure;
plot(Radius,Error)
xlabel(Radius)
ylabel(Error on temperature of the collocation
method)
The results are in Fig. 12.17. A more accurate solution can be obtained by decreasing
the relative tolerance from its default value of 103 (one could also make a more
educated guess to be passed to bvp4c by bvpinit). By just replacing the call to
bvp4c in the previous script by
optionbvp = bvpset(RelTol,1e-6)
SolByColloc = bvp4c(@PipeODE,...
@PipeBounds,PipeInit,optionbvp);
we get the results in Fig. 12.18.
12.5 In Summary
x 10
355
7
2.5
1.5
0.5
0.5
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
Radius
Fig. 12.18 Error on the distribution of temperatures in the pipe, computed by the collocation
method as implemented in bvp4c with RelTol = 106
12.5 In Summary
ODEs have only one independent variable, which is not necessarily time.
Most methods for solving ODEs require them to be put in state-space form, which
is not always possible or desirable.
IVPs are simpler to solve than BVPs.
Solving stiff ODEs with solvers for non-stiff ODEs is possible, but very slow.
The methods available to solve IVPs may be explicit or implicit, one step or
multistep.
Implicit methods have better stability properties than explicit methods. They are,
however, more complex to implement, unless their equations can be put in explicit
form.
Explicit single-step methods are self-starting. They can be used to initialize multistep methods.
Most single-step methods require intermediary evaluations of the state derivative that cannot be reused. This tends to make them less efficient than multistep
methods.
Multistep methods need single-step methods to start. They should make a more
efficient use of the evaluations of the state derivative but are less robust to rough
seas.
356
It is often useful to adapt step-size along the state trajectory, which is easy with
single-step methods.
It is often useful to adapt method order along the state trajectory, which is easy
with multistep methods.
The solution of BVPs may be via shooting methods and the minimization of a
norm of the deviation of the solution from the boundary conditions, provided that
the ODE is stable.
Finite-difference methods do not require the ODEs to be put in state-space form.
They can be used to solve IVPs and BVPs. An important ingredient is the solution
of (large, sparse) systems of linear equations.
The projection approaches are based on finite-dimensional approximations of the
ODE. The free parameters of these approximations are evaluated by solving a
system of equations (collocation and Ritz-Galerkin approaches) or by minimizing
a quadratic cost function (least-squares approach).
Understanding finite-difference and projection approaches for ODEs should facilitate the study of the same techniques for PDEs.
References
1. Higham, D.: An algorithmic introduction to numerical simulation of stochastic differential
equations. SIAM Rev. 43(3), 525546 (2001)
2. Gear, C.: Numerical Initial Value Problems in Ordinary Differential Equations. Prentice-Hall,
Englewood Cliffs (1971)
3. Stoer, J., Bulirsch, R.: Introduction to Numerical Analysis. Springer, New York (1980)
4. Gupta, G., Sacks-Davis, R., Tischer, P.: A review of recent developments in solving ODEs.
ACM Comput. Surv. 17(1), 547 (1985)
5. Shampine, L.: Numerical Solution of Ordinary Differential Equations. Chappman & Hall, New
York (1994)
6. Shampine, L., Reichelt, M.: The MATLAB ODE suite. SIAM J. Sci. Comput. 18(1), 122
(1997)
7. Shampine, L.: Vectorized solution of ODEs in MATLAB. Scalable Comput. Pract. Exper.
10(4), 337345 (2009)
8. Ashino, R., Nagase, M., Vaillancourt, R.: Behind and beyond the Matlab ODE suite. Comput.
Math. Appl. 40, 491512 (2000)
9. Shampine, L., Kierzenka, J., Reichelt, M.: Solving boundary value problems for ordinary
differential equations in MATLAB with bvp4c. http://www.mathworks.com/ (2000)
10. Shampine, L., Gladwell, I., Thompson, S.: Solving ODEs in MATLAB. Cambridge University
Press, Cambridge (2003)
11. Moler, C.: Numerical Computing with MATLAB, revised reprinted edn. SIAM, Philadelphia
(2008)
12. Jacquez, J.: Compartmental Analysis in Biology and Medicine. BioMedware, Ann Arbor (1996)
13. Gladwell, I., Shampine, L., Brankin, R.: Locating special events when solving ODEs. Appl.
Math. Lett. 1(2), 153156 (1988)
14. Shampine, L., Thompson, S.: Event location for ordinary differential equations. Comput. Math.
Appl. 39, 4354 (2000)
15. Moler, C., Van Loan, C.: Nineteen dubious ways to compute the exponential of a matrix,
twenty-five years later. SIAM Rev. 45(1), 349 (2003)
References
357
16. Al-Mohy, A., Higham, N.: A new scaling and squaring algorithm for the matrix exponential.
SIAM J. Matrix Anal. Appl. 31(3), 970989 (2009)
17. Higham, N.: The scaling and squaring method for the matrix exponential revisited. SIAM Rev.
51(4), 747764 (2009)
18. Butcher, J., Wanner, G.: Runge-Kutta methods: some historical notes. Appl. Numer. Math. 22,
113151 (1996)
19. Alexander, R.: Diagonally implicit Runge-Kutta methods for stiff O.D.E.s. SIAM J. Numer.
Anal. 14(6), 10061021 (1977)
20. Butcher, J.: Implicit Runge-Kutta processes. Math. Comput. 18(85), 5064 (1964)
21. Steihaug T, Wolfbrandt A.: An attempt to avoid exact Jacobian and nonlinear equations in the
numerical solution of stiff differential equations. Math. Comput. 33(146):521534 (1979)
22. Zedan, H.: Modified Rosenbrock-Wanner methods for solving systems of stiff ordinary differential equations. Ph.D. thesis, University of Bristol, Bristol, UK (1982)
23. Moore, R.: Mathematical Elements of Scientific Computing. Holt, Rinehart and Winston, New
York (1975)
24. Moore, R.: Methods and Applications of Interval Analysis. SIAM, Philadelphia (1979)
25. Bertz, M., Makino, K.: Verified integration of ODEs and flows using differential algebraic
methods on high-order Taylor models. Reliable Comput. 4, 361369 (1998)
26. Makino, K., Bertz, M.: Suppression of the wrapping effect by Taylor model-based verified
integrators: long-term stabilization by preconditioning. Int. J. Differ. Equ. Appl. 10(4), 353
384 (2005)
27. Makino, K., Bertz, M.: Suppression of the wrapping effect by Taylor model-based verified
integrators: the single step. Int. J. Pure Appl. Math. 36(2), 175196 (2007)
28. Klopfenstein, R.: Numerical differentiation formulas for stiff systems of ordinary differential
equations. RCA Rev. 32, 447462 (1971)
29. Shampine, L.: Error estimation and control for ODEs. J. Sci. Comput. 25(1/2), 316 (2005)
30. Dahlquist, G.: A special stability problem for linear multistep methods. BIT Numer. Math.
3(1), 2743 (1963)
31. LeVeque, R.: Finite Difference Methods for Ordinary and Partial Differential Equations. SIAM,
Philadelphia (2007)
32. Hairer, E., Wanner, G.: On the instability of the BDF formulas. SIAM J. Numer. Anal. 20(6),
12061209 (1983)
33. Mathews, J., Fink, K.: Numerical Methods Using MATLAB, 4th edn. Prentice-Hall, Upper
Saddle River (2004)
34. Bogacki, P., Shampine, L.: A 3(2) pair of Runge-Kutta formulas. Appl. Math. Lett. 2(4), 321
325 (1989)
35. Dormand, J., Prince, P.: A family of embedded Runge-Kutta formulae. J. Comput. Appl. Math.
6(1), 1926 (1980)
36. Prince, P., Dormand, J.: High order embedded Runge-Kutta formulae. J. Comput. Appl. Math.
7(1), 6775 (1981)
37. Dormand, J., Prince, P.: A reconsideration of some embedded Runge-Kutta formulae. J. Comput. Appl. Math. 15, 203211 (1986)
38. Shampine, L.: What everyone solving differential equations numerically should know. In:
Gladwell, I., Sayers, D. (eds.): Computational Techniques for Ordinary Differential Equations.
Academic Press, London (1980)
39. Skufca, J.: Analysis still matters: A surprising instance of failure of Runge-Kutta-Felberg ODE
solvers. SIAM Rev. 46(4), 729737 (2004)
40. Shampine, L., Gear, C.: A users view of solving stiff ordinary differential equations. SIAM
Rev. 21(1), 117 (1979)
41. Segel, L., Slemrod, M.: The quasi-steady-state assumption: A case study in perturbation. SIAM
Rev. 31(3), 446477 (1989)
42. Duchne, P., Rouchon, P.: Kinetic sheme reduction, attractive invariant manifold and slow/fast
dynamical systems. Chem. Eng. Sci. 53, 46614672 (1996)
358
43. Boulier, F., Lefranc, M., Lemaire, F., Morant, P.E.: Model reduction of chemical reaction
systems using elimination. Math. Comput. Sci. 5, 289301 (2011)
44. Petzold, L.: Differential/algebraic equations are not ODEs. SIAM J. Sci. Stat. Comput. 3(3),
367384 (1982)
45. Reddien, G.: Projection methods for two-point boundary value problems. SIAM Rev. 22(2),
156171 (1980)
46. de Boor, C.: Package for calculating with B-splines. SIAM J. Numer. Anal. 14(3), 441472
(1977)
47. Farouki, R.: The Bernstein polynomial basis: a centennial retrospective. Comput. Aided Geom.
Des. 29, 379419 (2012)
48. Bhatti, M., Bracken, P.: Solution of differential equations in a Bernstein polynomial basis. J.
Comput. Appl. Math. 205, 272280 (2007)
49. Russel, R., Shampine, L.: A collocation method for boundary value problems. Numer. Math.
19, 128 (1972)
50. Kierzenka, J., Shampine, L.: A BVP solver based on residual control and the MATLAB PSE.
ACM Trans. Math. Softw. 27(3), 299316 (2001)
51. Gander, M., Wanner, G.: From Euler, Ritz, and Galerkin to modern computing. SIAM Rev.
54(4), 627666 (2012)
52. Lotkin, M.: The treatment of boundary problems by matrix methods. Am. Math. Mon. 60(1),
1119 (1953)
53. Russell, R., Varah, J.: A comparison of global methods for linear two-point boundary value
problems. Math. Comput. 29(132), 10071019 (1975)
54. de Boor, C., Swartz, B.: Comments on the comparison of global methods for linear two-point
boudary value problems. Math. Comput. 31(140):916921 (1977)
55. Walter, E.: Identifiability of State Space Models. Springer, Berlin (1982)
Chapter 13
13.1 Introduction
Contrary to the ordinary differential equations (or ODEs) considered in Chap. 12,
partial differential equations (or PDEs) involve more than one independent variable.
Knowledge-based models of physical systems typically involve PDEs (Maxwells
in electromagnetism, Schrdingers in quantum mechanics, NavierStokes in fluid
dynamics, FokkerPlancks in statistical mechanics, etc.). It is only in very special
situations that PDEs simplify into ODEs. In chemical engineering, for example,
concentrations of chemical species generally obey PDEs. It is only in continuous
stirred tank reactors (CSTRs) that they can be considered as position-independent
and that time becomes the only independent variable.
The study of the mathematical properties of PDEs is considerably more involved
than for ODEs. Proving, for instance, the existence and smoothness of NavierStokes
solutions on R3 (or giving a counterexample) would be one of the achievements for
which the Clay Mathematics Institute is ready, since May 2000, to attribute one of
its seven one-million-dollar Millennium Prizes.
This chapter will just scratch the surface of PDE simulation. Good starting points
to go further are [1], which addresses the modeling of real-life problems, the analysis
of the resulting PDE models and their numerical simulation via a finite-difference
approach, [2], which develops many finite-difference schemes with applications in
computational fluid dynamics and [3], where finite-difference and finite-element
methods are both considered. Each of these books treats many examples in detail.
13.2 Classification
The methods for solving PDEs depend, among other things, on whether they are linear
or not, on their order, and on the type of boundary conditions being considered.
359
360
2 y
2 y
+ 2
2
x1
x2
(13.1)
(13.2)
where y(t, x) is the fluid velocity and its viscosity, is nonlinear, as the second term
in its left-hand side involves the product of y by its partial derivative with respect
to x.
(13.3)
13.2 Classification
361
(13.4)
is equivalent to
2 y
2 y
2 y
+ 2 =
.
2
t x1
x1
x2
Its order is thus two.
(13.5)
362
interested, for instance, in the temperature and chemical composition at time t and
space coordinates specified by x in a plug-flow reactor. Such problems, which involve
several domains of physics and chemistry (here, fluid mechanics, thermodynamics,
and chemical kinetics), pertain to what is called multiphysics.
To simplify notation, we write
yx
y
,
x
yx x
2 y
,
x2
yxt
2 y
,
xt
(13.6)
and so forth. The Laplacian operator, for instance, is then such that
y = ytt + yx x .
(13.7)
(13.8)
(13.9)
(13.10)
where yxt = yt x , the following system of linear equations must hold true
ytt
g(t, x, y, yt , yx )
,
dyt
M yt x =
yx x
dyx
(13.11)
a 2b c
M = dt dx 0 .
0 dt dx
where
(13.12)
dx
dt
2b
dx
dt
+ c = 0.
(13.14)
13.2 Classification
363
b
dx
=
dt
b2 ac
.
a
(13.15)
They define the characteristic curves of the PDE. The number of real solutions
depends on the sign of the discriminant b2 ac.
When b2 ac < 0, there is no real characteristic curve and the PDE is elliptic,
When b2 ac = 0, there is a single real characteristic curve and the PDE is
parabolic,
When b2 ac > 0, there are two real characteristic curves and the PDE is hyperbolic.
This classification depends only on the coefficients of the highest-order derivatives
in the PDE. The qualifiers of these three types of PDEs have been chosen because
the quadratic equation
a(dx)2 2b(dx)(dt) + c(dt)2 = constant
(13.16)
(13.17)
(13.18)
(13.19)
(13.20)
Example 13.3 Aircraft flying at Mach 0.7 will be heard by ground observers
everywhere around, and the PDE describing sound propagation during such a subsonic flight is elliptic. When speed is increased to Mach 1, a front develops ahead of
which the noise is no longer heard; this front corresponds to a single real characteristic curve, and the PDE describing sound propagation during sonic flight is parabolic.
When speed is increased further, the noise is only heard within Mach lines, which
form a pair of real characteristic curves, and the PDE describing sound propagation
364
Space
ht
hx
Time
Fig. 13.1 Regular grid
during supersonic flight is hyperbolic. The real characteristic curves, if any, thus
patch radically different solutions.
(13.21)
(13.22)
365
(13.23)
(13.24)
Yl,m Yl1,m
,
ht
(13.25)
yt (tl , xm )
Yl+1,m Yl,m
ht
(13.26)
yt (tl , xm )
Yl+1,m Yl1,m
.
2h t
(13.27)
or
366
committed during the past steps of the recurrence impact the future steps. This is
why one may avoid these methods even when they are feasible, and prefer implicit
methods.
For linear PDEs, implicit methods require the solution of large systems of linear
equations
Ay = b,
(13.28)
with y = vect(Yl,m ). The difficulty is mitigated by the fact that A is sparse and often
diagonally dominant, so iterative methods are particularly well suited, see Sect. 3.7.
Because the size of A may be enormous, care should be exercised in its storage and
in the indexation of the grid points, to avoid slowing down computation by accesses
to disk memory that could have been avoided.
(13.29)
(13.30)
where c = 2 .
Take a first-order forward approximation of yt (tl , xm )
yt (tl , xm )
Yl+1,m Yl,m
.
ht
(13.31)
At the midpoint of the edge between the grid points indexed by (l, m) and (l + 1, m),
it becomes a second-order centered approximation
Yl+1,m Yl,m
ht
.
yt tl + , xm
2
ht
(13.32)
To take advantage of this increase in the order of method error, the CrankNicolson
scheme approximates (13.29) at such off-grid points (Fig. 13.2). The value of yx x at
the off-grid point indexed by (l + 1/2, m) is then approximated by the arithmetic
mean of its values at the two adjacent grid points
1
ht
yx x (tl+1 , xm ) + yx x (tl , xm ) ,
yx x tl + , xm
2
2
(13.33)
367
ht
Space
yt is best evaluated
off-grid
l+ 2
l+1
Time
h 2x
,
c2
(13.34)
(13.35)
(13.36)
(13.37)
(13.38)
and write down (13.35) wherever possible. The space profile at time tl can then
be computed as a function of the space profile at time tl1 , l = 2, . . . , N . An
explicit solution is thus obtained, since the initial space profile is known. One may
prefer an implicit approach, where all the equations linking the Yl,m s are
368
considered simultaneously. The resulting system can be put in the form (13.28),
with A tridiagonal, which simplifies solution considerably.
369
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.5
0.5
0.5
1.5
370
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.5
0.5
0.5
1.5
Figure 13.5 illustrates a 2D case where the finite elements are triangles over a
triangular mesh. In this simple configuration, the approximation of the solution on
a given triangle of the mesh is specified by the three values Y (ti , xi ) of the approximate solution at the vertices (ti , xi ) of this triangle, with the approximate solution
inside the triangle provided by linear interpolation. (More complicated interpolation
schemes may be used to ensure smoother transitions between the finite elements.)
The approximate solution at any given vertex (ti , xi ) must of course be the same for
all the triangles of the mesh that share this vertex.
Remark 13.4 In multiphysics, couplings at interfaces are taken into account
by imposing relations between the relevant physical quantities at the interface vertices.
Remark 13.5 As with the FDM, the approximate solution obtained by the FEM is
characterized by the values taken by Y (t, x) at specific points in the region of interest
in the space of the independent variables t and x. There are two important differences,
however:
1. these points are distributed much more flexibly,
2. the value of the approximate solution in the entire domain of interest can be taken
into consideration rather than just at grid points.
371
x
Fig. 13.5 A finite element (in light gray) and the corresponding mesh triangle (in dark gray)
(13.39)
k=1
where f k (r, , , ) is zero outside the part of the mesh associated with the kth element
(assumed triangular here) and Yi,k is the value of the approximate solution at the ith
vertex of the kth triangle of the mesh (i = 1, 2, 3). The quantities to be determined
are then the entries of p, which are some of the Yi,k s. (Since the Yi,k s corresponding
to the same point in r space must be equal, this takes some bookkeeping.)
(13.40)
372
where L() is a linear differential operator, L r (y) is the value taken by L(y) at r, and
u(r) is a known input function. Assume also that the solution y(r) is to be computed
for known Dirichlet boundary conditions on D, with D some domain in r space.
To take these boundary conditions into account, rewrite (13.39) as
yp (r) = T (r)p + 0 (r),
(13.41)
(13.42)
and where p now corresponds to the parameters needed to specify the solution once
the boundary conditions have been accounted for by 0 ().
Plug the approximate solution (13.41) in (13.40) to define the residual
yp u(r),
ep (r) = L r
(13.43)
which is affine in p. The same projection methods as in Sect. 12.3.4 may be used to
tune p so as to make the residuals small.
13.4.3.1 Collocation
Collocation is the simplest of these approaches. As in Sect. 12.3.4.1, it imposes that
ep (ri ) = 0, i = 1, . . . , dim p,
(13.44)
where the ri s are the collocation points. This yields a system of linear equations to
be solved for p.
(13.45)
where i (r) is a test function, which may be the ith entry of (r). Collocation is
obtained if i (r) in (13.45) is replaced by (r ri ), with () the Dirac measure.
373
p = arg min
p
(13.46)
Since ep (r) is affine in p, linear least squares may once again be used. The first-order
necessary conditions for optimality then translate into a system of linear equations
that
p must satisfy.
Remark 13.6 For linear PDEs, each of the three approaches of Sect. 13.4.3
yields a system of linear equations to be solved for p. This system will be sparse
as each entry of p relates to a very small number of elements, but nonzero entries
may turn out to be quite far from the main descending diagonal. Again, reindexing
may have to be carried out to avoid a potentially severe slowing down of the
computation.
When the PDE is nonlinear, the collocation and RitzGalerkin methods require
solving a system of nonlinear equations, whereas the least-squares solution is
obtained by nonlinear programming.
(13.47)
where
y(x, t) is the string elongation at location x and time t,
is the string linear density,
T is the string tension.
The string is attached at its two ends, so
y(0, t) y(L , t) 0.
(13.48)
(13.49)
374
(13.50)
We define a regular grid on [0, tmax ] [0, L], such that (13.21) and (13.22) are
satisfied, and denote by Ym,l the approximation of y(xm , tl ). Using the second-order
centered difference ( 6.75), we take
ytt (xi , tn )
(13.51)
yx x (xi , tn )
Y (i + 1, n) 2Y (i, n) + Y (i 1, n)
,
h 2x
(13.52)
and
h 2x
ht
(13.53)
With
T h 2t
R=
,
(13.54)
h 2x
this recurrence becomes
Y (i, n+1)+Y (i, n1)RY (i +1, n)2(1R)Y (i, n)RY (i 1, n) = 0. (13.55)
Equation (13.49) translates into
Y (i, 1) = sin((i 1)h x ),
(13.56)
(13.57)
The values of the approximate solution for y at all the grid points are stacked
in a vector z that satisfies a linear system Az = b, where the contents of A and b
are specified by (13.55) and the boundary conditions. After evaluating z, one must
unstack it to visualize the solution. This is achieved in the following script, which
produces Figs. 13.6 and 13.7. A rough (and random) estimate of the condition number
of A for the 1-norm is provided by condest, and found to be approximately equal
to 5,000, so this is not an ill-conditioned problem.
375
1
0.8
0.6
Elongation
0.4
0.2
0
0.2
0.4
0.6
0.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Location
Fig. 13.6 2D visualization of the FDM solution for the string example
Elongation
0.5
0.5
1
1
0.8
1
0.6
0.8
0.6
0.4
0.4
0.2
Location
0.2
0
Time
Fig. 13.7 3D visualization of the FDM solution for the string example
376
clear all
% String parameters
L = 1;
% Length
T = 4;
% Tension
Rho = 1;
% Linear density
% Discretization parameters
TimeMax = 1;
% Time horizon
Nx = 50;
% Number of space steps
Nt = 100;
% Number of time steps
hx = L/Nx;
% Space step-size
ht = TimeMax/Nt; % Time step-size
% Creating sparse matrix A and vector b
% full of zeros
SizeA = (Nx+1)*(Nt+1);
A = sparse(1:SizeA,1:SizeA,0);
b = sparse(1:SizeA,1,0);
% Filling A and b (MATLAB indices cannot be zero)
R = (T/Rho)*(ht/hx)2;
Row = 0;
for i=0:Nx,
Column=i+1;
Row=Row+1;
A(Row,Column)=1;
b(Row)=sin(pi*i*hx/L);
end
for i=0:Nx,
DeltaCol=i+1;
Row=Row+1;
A(Row,(Nx+1)+DeltaCol)=1;
b(Row)=sin(pi*i*hx/L);
end
for n=1:Nt-1,
DeltaCol=1;
Row = Row+1;
A(Row,(n+1)*(Nx+1)+DeltaCol)=1;
for i=1:Nx-1
DeltaCol=i+1;
Row = Row+1;
A(Row,n*(Nx+1)+DeltaCol)=-2*(1-R);
A(Row,n*(Nx+1)+DeltaCol-1)=-R;
A(Row,n*(Nx+1)+DeltaCol+1)=-R;
A(Row,(n+1)*(Nx+1)+DeltaCol)=1;
A(Row,(n-1)*(Nx+1)+DeltaCol)=1;
end
i=Nx; DeltaCol=i+1;
Row=Row+1;
A(Row,(n+1)*(Nx+1)+DeltaCol)=1;
end
% Computing a (random) lower bound
% of Cond(A)for the 1-norm
ConditionNumber=condest(A)
% Solving the linear equations for z
Z=A\b;
% Unstacking z into Y
for i=0:Nx,
Delta=i+1;
for n=0:Nt,
ind_n=n+1;
Y(Delta,ind_n)=Z(Delta+n*(Nx+1));
end
end
% 2D plot of the results
figure;
for n=0:Nt
ind_n = n+1;
plot([0:Nx]*hx,Y(1:Nx+1,ind_n)); hold on
end
xlabel(Location)
ylabel(Elongation)
% 3D plot of the results
figure;
surf([0:Nt]*ht,[0:Nx]*hx,Y);
colormap(gray)
xlabel(Time)
ylabel(Location)
zlabel(Elongation)
377
378
13.6 In Summary
References
1. Mattheij, R., Rienstra, S., ten Thije Boonkkamp, J.: Partial Differential EquationsModeling,
Analysis, Computation. SIAM, Philadelphia (2005)
2. Hoffmann, K., Chiang, S.: Computational Fluid Dynamics, vol. 1, 4th edn. Engineering Education System, Wichita (2000)
3. Lapidus, L., Pinder, G.: Numerical Solution of Partial Differential Equations in Science and
Engineering. Wiley, New York (1999)
4. Gustafsson, B.: Fundamentals of Scientific Computing. Springer, Berlin (2011)
5. Chandrupatla, T., Belegundu, A.: Introduction to Finite Elements in Engineering, 3rd edn.
Prentice-Hall, Upper Saddle River (2002)
Chapter 14
14.1 Introduction
This chapter is mainly concerned with methods based on the use of the computer itself
for assessing the effect of its rounding errors on the precision of numerical results
obtained through floating-point computation. It marginally deals with the assessment
of the effect of method errors. (See also Sects. 6.2.1.5, 12.2.4.2 and 12.2.4.3 for
the quantification of method error based on varying step-size or method order.)
Section 14.2 distinguishes the types of algorithms to be considered. Section 14.3
describes the floating-point representation of real numbers and the rounding modes
available according to IEEE standard 754, with which most of todays computers
comply. The cumulative effect of rounding errors is investigated in Sect. 14.4. The
main classes of methods available for quantifying numerical errors are described in
Sect. 14.5. Section 14.5.2.2 deserves a special mention, as it describes a particularly
simple yet potentially very useful approach. Section 14.6 describes in some more
detail a method for evaluating the number of significant decimal digits in a floatingpoint result. This method may be seen as a refinement of that of Sect. 14.5.2.2,
although it was proposed earlier.
379
380
(14.1)
(14.2)
381
Example 14.1 If an absolute condition such as (14.1) is used to evaluate the limit
when k tends to infinity of xk computed by the recurrence
xk+1 = xk +
1
, x1 = 1,
k+1
(14.3)
(14.5)
382
10
10
10
10
10
10
15
10
20
20
10
15
10
10
10
10
10
Fig. 14.1 Need for a compromise; solid curve global error, dashdot line method error
x = 2;
F = x3;
TrueDotF = 3*x2;
i = -20:0;
h = 10.i;
% first-order forward difference
NumDotF = ((x+h).3-F)./h;
AbsErr = abs(TrueDotF - NumDotF);
MethodErr = 3*x*h;
loglog(h,AbsErr,k-s);
hold on
loglog(h,MethodErr,k-.);
xlabel(Step-size h (in log scale))
ylabel(Absolute errors (in log scale))
produces Fig. 14.1, which illustrates this need for a compromise. The solid curve
interpolates the absolute values taken by the global error for various values of h.
The dashdot line corresponds to the sole effect of method error, as estimated from
the first neglected term in (6.55), which is equal to f(x)h/2 = 3xh. When h is too
small, the rounding error dominates, whereas when h is too large it is the method
error.
383
Ideally, one should choose h so as to minimize some measure of the global error
on the final result. This is difficult, however, as method error cannot be assessed
precisely. (Otherwise, one would rather subtract it from the numerical result to get
an exact algorithm.) Rough estimates of method errors may nevertheless be obtained,
for instance by carrying out the same computation with several step-sizes or method
orders, see Sects. 6.2.1.5, 12.2.4 and 12.2.4.3. Hard bounds on method errors may
be computed using interval analysis, see Remark 14.6.
14.3 Rounding
14.3.1 Real and Floating-Point Numbers
Any real number x can be written as
x = s m be ,
(14.6)
where b is the base (which belongs to the set N of all positive integers), e is the
exponent (which belongs to the set Z of all relative integers), s {1, +1} is the
sign and m is the mantissa
m=
(14.7)
i=0
Any nonzero real number has a normalized representation where m [1, b], such
that the triplet {s, m, e} is unique.
Such a representation cannot be used on a finite-memory computer, and a
floating-point representation using a finite (and fixed) number of bits is usually
employed instead [2].
Remark 14.2 Floating-point numbers are not necessarily the best substitutes to real
numbers. If the range of all the real numbers intervening in a given computation is
sufficiently restricted (for instance because some scaling has been carried out), then
one may be better off computing with integers or ratios of integers. Computer algebra
systems such as MAPLE also use ratios of integers for infinite-precision numerical
computation, with integers represented exactly by variable-length binary words.
Substituting floating-point numbers for real numbers has consequences on the
results of numerical computations, and these consequences should be minimized. In
what follows, lower case italics are used for real numbers and upper case italics for
their floating-point representations.
Let F be the set of all floating-point numbers in the representation considered.
One is led to replacing x R by X F, with
384
X = fl(x) = S M b E .
(14.8)
If a normalized representation is used for x and X , provided that the base b is the
same, one should have S = s and E = e, but previous computations may have gone
so wrong that E differs from e, or even S from s.
Results are usually presented using a decimal representation (b = 10), but the
representation of the floating-points numbers inside the computer is binary (b = 2),
so
p
Ai 2i ,
(14.9)
M=
i=0
14.3 Rounding
385
indicating that a problem has been encountered. Note that the statement NaN = NaN
is false, whereas the statement +0 = 0 is true.
Remark 14.3 The floating-point numbers thus created are not regularly spaced, as
it is the relative distance between two consecutive doubles of the same sign that is
constant. The distance between zero and the smallest positive double turns out to be
much larger than the distance between this double and the one immediately above,
which is one of the reasons for the introduction of subnormal numbers.
(14.10)
(14.11)
Results may thus depend on the order in which the computations are carried out.
Worse, some compilers eliminate parentheses that they deem superfluous, so one
may not even know what this order will be.
toward 0,
toward the closest float or double,
toward +,
toward .
386
(14.14)
A bound on the rounding error on X op Y is thus easily computed, since the unit
roundoff u is known and fl(X op Y ) is the floating point number provided by the
computer as the result of evaluating X op Y . Equations (14.13) and (14.14) are at the
core of running error analysis (see Sect. 14.5.2.4).
387
p
Ai 2i ,
Ai {0, 1}.
(14.17)
i=1
(14.18)
with p the number of bits for the mantissa M, and [0.5, 0.5] when rounding
is to the nearest and [1, 1] when rounding is toward [8]. The relative
rounding error |X x|/|x| is thus equal to 2 p at most.
(14.19)
(14.21)
388
(14.22)
(14.23)
In both cases, = 0 or 1.
14.4.4 In Summary
Equations (14.20), (14.22) and (14.23) suggest that adding doubles that have the
same sign, multiplying doubles or dividing a double by a nonzero double should not
lead to a catastrophic loss of significant digits. Subtracting numbers that are close to
one another, on the other hand, has the potential for disaster.
One can sometimes reformulate the problems to be solved in such a way that a risk
of deadly subtraction is eliminated; see, for instance, Example 1.2 and Sect. 14.4.6.
This is not always possible, however. A case in point is when evaluating a derivative
by a finite-difference approximation, for instance
f (x0 + h) f (x0 )
df
(x0 )
,
dx
h
(14.24)
since the mathematical definition of a derivative requests that h should tend toward
zero. To avoid an explosion of rounding error, one must take a nonzero h, thereby
introducing method error.
where the gi s only depend on the data and algorithm and where i [0.5, 0.5] if
rounding is to the nearest and i [1, 1] if rounding is toward . The number
n b of significant binary digits in R then satisfies
389
n
R r
i
= p log2
n b log2
gi .
r
r
(14.26)
i=1
The term
n
i
log2
gi ,
r
(14.27)
i=1
which approximates the loss in precision due to computation, does not depend on
the number p of bits in the mantissa. The remaining precision does depend on p, of
course.
vi wi
(14.28)
390
f (
x) = f (x) +
391
n
[
f (x)] xi + O(2 ),
xi
(14.29)
i=1
with
x the perturbed input vector.
The relative error on the result f (x) therefore satisfies
| f (
x) f (x)|
| f (x)|
n
i=1 | xi
f (x)| |xi |
| f (x)|
|| + O(2 ).
(14.30)
i=1 | xi
f (x)| |xi |
| f (x)|
(14.31)
|g(x)|T |x|
,
| f (x)|
(14.32)
|r |
.
10n d
(14.33)
392
R+ + R
,
n d = log10
2(R+ R )
(14.34)
which may then be rounded to the nearest nonnegative integer. Similar computations
will be carried out in Sect. 14.6 based on statistical hypotheses on the errors.
Remark 14.4 The estimate
n d provided by (14.34) may be widely off the mark, and
should be handled with caution. If R+ and R are close, this does not prove that they
are close to r , if only because rounding is just one of the possible sources for errors.
If, on the other hand, R+ and R differ markedly, then the results provided by the
computer should rightly be viewed with suspicion.
Remark 14.5 Evaluating n d by visual inspection of R+ and R may turn out to be
difficult. For instance, 1.999999991 and 2.000000009 are very close although they
have no digit in common, whereas 1.21 and 1.29 are less close than they may seem
visually, as one may realize by replacing them by their closest two-digit approximations.
(14.37)
(14.38)
c = min{a b , a b+ , a+ b , a+ b+ }
(14.39)
and
and
c+ = max{a b , a b+ , a+ b , a+ b+ }.
393
(14.40)
(14.42)
When a formal expression is available for f (x), the natural inclusion function
[ f ]n ([x]) is obtained by replacing, in the formal expression of f (), each occurrence
of x by [x] and each operation or elementary function by its interval counterpart.
Example 14.4 If
f (x) = (x 1)(x + 1),
(14.43)
then
[ f ]n1 ([1, 1]) = ([1, 1] [1, 1])([1, 1] + [1, 1])
= [2, 0] [0, 2]
= [4, 4].
Rewriting f (x) as
f (x) = x 2 1,
(14.44)
(14.45)
(14.46)
so [ f ]n2 () is much more accurate than [ f ]n1 (). It is even a minimal inclusion
function, as
(14.47)
f ([x]) = [ f ]n2 ([x]).
This is due to the fact that the formal expression of [ f ]n2 ([x]) contains only one
occurrence of [x].
A caricatural illustration of the pessimism introduced by multiple occurrences of
variables is the evaluation of
394
f (x) = x x
(14.48)
on the interval [1, 1] using a natural inclusion function. Because the two occurrences of x in (14.48) are treated as if they were independent,
[ f ]n ([1, 1]) = [2, 2].
(14.49)
It is thus a good idea to look for formal expressions that minimize the number
of occurrences of the variables. Many other techniques are available to reduce the
pessimism of inclusion functions.
Interval computation easily extends to interval vectors and matrices. An interval
vector (or box) [x] is a Cartesian product of intervals, and [f]([x]) is an inclusion
function for the multivariate vector function f(x) if it computes an interval vector
[f]([x]) that contains the image of [x] by f(), i.e.,
f([x]) [f]([x]).
(14.50)
(14.51)
0
/ g([x]).
(14.52)
395
The first-order optimality condition (9.6) is thus satisfied nowhere in the box [x],
so [x] can be eliminated from further search as it cannot contain any unconstrained
minimizer.
Example 14.6 Bisection
Consider again Example 14.5, but assume now that
0 [g]([x]),
(14.53)
which does not allow [x] to be eliminated. One may then split [x] into [x1 ] and [x2 ]
and attempt to eliminate these smaller boxes. This is made easier by the fact that inclusion functions usually get less pessimistic when the size of their interval arguments
decreases (until the effect of outward rounding becomes predominant). The curse
of dimensionality is of course lurking behind bisection. Contraction, which makes
it possible to reduce the size of [x] without losing any solution, is thus particularly
important when dealing with high-dimensional problems.
Example 14.7 Contraction
Let f () be a scalar univariate function, with a continuous first derivative on [x],
and let x and x0 be two points in [x], with f (x ) = 0. The mean-value theorem
implies that there exits c [x] such that
f (x ) f (x0 )
.
f(c) =
x x0
In other words,
x = x0
f (x0 )
.
f(c)
(14.54)
(14.55)
f (x0 )
.
[ f]([x])
(14.56)
(14.57)
f (xk )
,
[ f]([xk ])
(14.58)
with xk some point in [xk ], for instance its center. Any solution belonging to [xk ]
belongs also to [xk+1 ], which may be much smaller.
396
The resulting interval Newton method is more complicated than it seems, as the
interval denominator [ f]([xk ]) may contain zero, so [xk+1 ] may consist of two intervals, each of which will have to be processed at the next iteration. The interval Newton
method can be extended to finding approximation by boxes of all the solutions of
systems of nonlinear equations in several unknowns [16].
Remark 14.6 Interval computations may similarly be used to get bounds on the
remainder of Taylor expansions, thus making it possible to bound method errors.
Consider, for instance, the kth order Taylor expansion of a scalar univariate function
f () around xc
f (x) = f (xc ) +
k
1 (i)
f (xc ) (x xc )i + r (x, xc , ) ,
i!
(14.59)
i=1
where
r (x, xc , ) =
1
f (k+1) ( ) (x xc )k+1
(k + 1)!
(14.60)
is the Taylor remainder. Equation (14.59) holds true for some unknown in [x, xc ].
An inclusion function [ f ]() for f () is thus
[ f ]([x]) = f (xc ) +
k
1 (i)
f (xc ) ([x] xc )i + [r ] ([x], xc , [x]) ,
i!
(14.61)
i=1
with [r ](, , ) an inclusion function for r (, , ) and xc any point in [x], for instance
its center.
With the help of these concepts, approximate but guaranteed solutions can be
found to problems such as
397
(14.63)
(14.64)
(14.65)
(14.66)
The first term on the right-hand side of (14.63)(14.66) is deduced from (14.14).
The following terms propagate input errors to the output while neglecting products
of error terms. The method is much simpler to implement than the interval approach
of Sect. 14.5.2.3, but the resulting bounds on the effect of rounding errors are approximate and method errors are not taken into account.
14.6 CESTAC/CADNA
The presentation of the method is followed by a discussion of its validity conditions,
which can partly be checked by the method itself.
398
14.6.1 Method
Let r R be some real quantity to be evaluated by a program and Ri F be
the corresponding floating-point result, as provided by the ith run of this program
(i = 1, . . . , N ). During each run, the result of each operation is randomly rounded
either toward + or toward , with the same probability. Each Ri may thus
be seen as an approximation of r . The fundamental hypothesis on which CESTAC/CADNA is based is that these Ri s are independently and identically distributed
according to a Gaussian law, with mean r .
Let be the arithmetic mean of the results provided by the computer in N runs
N
1
Ri .
=
N
(14.67)
i=1
Since N is finite, is not equal to r , but it is in general closer to r than any of the
Ri s ( is the maximum-likelihood estimate of r under the fundamental hypothesis).
Let be the empirical standard deviation of the Ri s
1
(Ri )2 ,
N 1
N
(14.68)
i=1
which characterizes the dispersion of the Ri s around their mean. Students t test
makes it possible to compute an interval centered at and having a given probability
of containing r
= .
(14.69)
Prob | r |
N
In (14.69), the value of depends on the value of (to be chosen by the user) and
on the number of degrees of freedom, which is equal to N 1 since there are N data
points Ri linked to by the equality constraint (14.67). Typical values are = 0.95,
which amounts to accepting to be wrong in 5% of the cases, and N = 2 or 3, to keep
the volume of computation manageable. From (14.33), the number n d of significant
decimal digits in satisfies
|r |
.
(14.70)
10n d
| r |
||
= log10
||
log10 .
(14.71)
14.6 CESTAC/CADNA
399
n d log10
||
0.953 if N = 2,
(14.72)
n d log10
||
0.395 if N = 3.
(14.73)
and
Remark 14.7 Assume N = 2 and denote the results of the two runs by R+ and R .
Then
||
|R+ + R |
= log10
log10 2,
(14.74)
log10
|R+ R |
so
n d log10
|R+ + R |
1.1.
|R+ R |
(14.75)
|R+ + R |
0.3.
|R+ R |
(14.76)
Based on this analysis, one may now present each result in a format that only
shows the decimal digits that are deemed significant. A particularly spectacular case
is when the estimated number of significant digits becomes zero (
n d < 0.5), which
amounts to saying that nothing is known of the result, not even its sign. This led to
the concept of computational zero (CZ): the result of a numerical computation is a
CZ if its value is zero or if it contains no significant digit. A very large floating-point
number may turn out to be a CZ while another with a very small magnitude may not
be a CZ.
The application of this approach depends on the type of algorithm being
considered, as defined in Sect. 14.2.
For exact finite algorithms, CESTAC/CADNA can provide each result with an
estimate of its number of significant decimal digits. When the algorithm involves
conditional branching, one should be cautious about the CESTAC/CADNA assessment of the accuracy of the results, as the perturbed runs may not all follow the same
branch of the code, which would make the hypothesis of a Gaussian distribution of
the results particularly questionable. This suggests analysing not only the precision
of the end results but also that of all floating-point intermediary results (at least those
involved in conditions). This may be achieved by running two or three executions
of the algorithm in parallel. Operator overloading makes it possible to avoid having to modify heavily the code to be tested. One just has to declare the variables
to be monitored as stochastic. For more details, see http://www-anp.lip6.fr/english/
cadna/. As soon as a CZ is detected, the results of all subsequent computations should
be subjected to serious scrutiny. One may even decide to stop computation there and
400
n
fi ,
(14.77)
i=1
(14.78)
which means the iterative increment is no longer significant. (The usual transcendental functions are not computed via such an evaluation of series, and the procedures
actually used are quite sophisticated [29].)
For approximate algorithms, one should minimize the global error resulting from
the combination of the method and rounding errors. CESTAC/CADNA may help
finding a good tradeoff by contributing to the assessment of the effects of the latter,
provided that the effects of the former are assessed by some other method.
14.6 CESTAC/CADNA
401
sn
=
n
n
i=1 x i
(14.79)
(14.80)
and 1 2 , the only error term with order higher than one, is negligible if 1 and 2
are small compared to one, i.e., if X 1 and X 2 are not CZs. For division
X1
x1 (1 + 1 )
x1 (1 + 1 )
=
=
(1 2 + 22 ),
X2
x2 (1 + 2 )
x2
(14.81)
and the particularly catastrophic effect that 2 would have if its absolute value were
larger than one is demonstrated. This would correspond to a division by a CZ, a first
cause of failure of the CESTAC/CADNA analysis.
A second one is when most of the final error is due to a few critical operations.
This may be the case, for instance, when a branching decision is based on the sign of a
quantity that turns out to be a CZ. Depending on the realization of the computations,
either of the branches of the algorithm will be followed, with results that may be
completely different and may have a multimodal distribution, thus quite far from a
Gaussian one.
These considerations suggest the following advice.
Any intermediary result that turns out to be a CZ should raise doubts as to
the estimated number of significant digits in the results of the computation to
follow, which should be viewed with caution. This is especially true if the CZ
appears in a condition or as a divisor.
402
Despite its limitations, this simple method has the considerable advantage of
alerting the user on the lack of numerical robustness of some operations in the
specific case of the data being processed. It can thus be viewed as an online numerical
debugger.
(14.82)
b +
b2 4ac
b b2 4ac
hs
and x2 =
.
2a
2a
(14.83)
b sign(b) b2 4ac
q=
,
2
x1mr =
c
q
and x2mr = .
q
a
(14.84)
(14.85)
Trouble arises when b is very large compared to ac, so let us take a = c = 1 and
b = 2 107 . By typing
Digits:=20;
f:=x2+2*107*x+1;
fsolve(f=0);
in Maple, one finds an accurate solution to be
x1as = 5.0000000000000125000 108 ,
x2as = 1.9999999999999950000 107 .
(14.86)
This solution will serve as a gold standard for assessing how accurately the methods
presented in Sects. 14.5.2.2, 14.5.2.3 and 14.6 evaluate the precision with which x1
and x2 are computed by the high-school and more robust formulas.
403
(14.87)
(14.88)
(14.89)
Rounding these estimates to the closest nonnegative integer, we can write only the
decimal digits that are deemed significant in the results. Thus
x1hs = 5 108 ,
x2hs = 1.999999999999995 107 ,
x1mr = 5.000000000000013 108 ,
x2mr = 1.999999999999995 107 .
(14.90)
404
x1hs
x2hs
x1mr
x2mr
=
=
=
=
-5._______________e-008
-1.999999999999995e+007
-5.00000000000001_e-008
-1.999999999999995e+007
They are fully consistent with those of the switching approach, and obtained in a
guaranteed manner. One should not be fooled, however, into believing that the guaranteed interval-computation approach can always be used instead of the nonguaranteed
switching or CESTAC/CADNA approach. This example is actually so simple that
the pessimism of interval computation is not revealed, although no effort has been
made to reduce its effect. For more complex computations, this would not be so, and
the widths of the intervals containing the results may soon become exceedingly large
unless specific and nontrivial measures are taken.
(14.91)
405
Rounding these estimates to the closest nonnegative integer, and keeping only the
decimal digits that are deemed significant, we get the slightly modified results
x1hs = 5 108 ,
x2hs = 1.99999999999999 107 ,
x1mr = 5.00000000000001 108 ,
x2mr = 1.99999999999999 107 .
(14.92)
The CESTAC/CADNA approach thus suggests discarding digits that the switching
approach deemed valid. On this specific example, the gold standard (14.86) reveals
that the more optimistic switching approach is right, as these digits are indeed correct.
Both approaches, as well as interval computations, clearly evidence a problem with
x1 as computed with the high-school method.
14.8 In Summary
Moving from analytic calculus to numerical computation with floating-point numbers translates into unavoidable rounding errors, the consequences of which must
be analyzed and minimized.
Potentially the most dangerous operations are subtracting numbers that are close
to one another, dividing by a CZ, and branching based on the value or sign of a
CZ.
Among the methods available in the literature to assess the effect of rounding
errors, those using the computer to evaluate the consequences of its own errors
have two advantages: they are applicable to broad classes of algorithms, and they
take the specifics of the data being processed into account.
A mere switching of the direction of rounding may suffice to reveal a large uncertainty in numerical results.
Interval analysis produces guaranteed results with error estimates that may be very
pessimistic unless dedicated algorithms are used. This limits its applicability, but
being able to provide bounds on method errors is a considerable advantage.
Running error analysis loses this advantage and only provides approximate bounds
on the effect of the propagation of rounding errors, but is much simpler to implement in an ad hoc manner.
The random-perturbation approach CESTAC/CADNA does not suffer from the
pessimism of interval analysis. It should nevertheless be used with caution as a
variant of casting out the nines, which cannot guarantee that the numerical results
provided by the computer are correct but may detect that they are not. It can
contribute to checking whether its conditions of validity are satisfied.
406
References
1. Pichat, M., Vignes, J.: Ingnierie du contrle de la prcision des calculs sur ordinateur. Editions
Technip, Paris (1993)
2. Goldberg, D.: What every computer scientist should know about floating-point arithmetic.
ACM Comput. Surv. 23(1), 548 (1991)
3. IEEE: IEEE standard for floating-point arithmetic. Technical Report IEEE Standard 754
2008, IEEE Computer Society (2008)
4. Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia
(2002)
5. Muller, J.M., Brisebarre, N., de Dinechin, F., Jeannerod, C.P., Lefvre, V., Melquiond, G.,
Revol, N., Stehl, D., Torres, S.: Handbook of Floating-Point Arithmetic. Birkhuser, Boston
(2010)
6. Chesneaux, J.M.: Etude thorique et implmentation en ADA de la mthode CESTAC. Ph.D.
thesis, Universit Pierre et Marie Curie (1988)
7. Chesneaux, J.M.: Study of the computing accuracy by using probabilistic approach. In: Ullrich, C. (ed.) Contribution to Computer Arithmetic and Self-Validating Methods, pp. 1930.
J.C. Baltzer AG, Amsterdam (1990)
8. Chesneaux, J.M.: Larithmtique stochastique et le logiciel CADNA. Universit Pierre et
Marie Curie, Habilitation diriger des recherches (1995)
9. Kulisch, U.: Very fast and exact accumulation of products. Computing 91, 397405 (2011)
10. Wilkinson, J.: Rounding Errors in Algebraic Processes, reprinted edn. Dover, New York (1994)
11. Wilkinson, J.: Modern error analysis. SIAM Rev. 13(4), 548568 (1971)
12. Kahan, W.: How futile are mindless assessments of roundoff in floating-point computation?
www.cs.berkeley.edu/~wkahan/Mindless.pdf (2006) (work in progress)
13. Moore, R.: Automatic error analysis in digital computation. Technical Report LMSD-48421,
Lockheed Missiles and Space Co, Palo Alto, CA (1959)
14. Moore, R.: Interval Analysis. Prentice-Hall, Englewood Cliffs (1966)
15. Moore, R.: Methods and Applications of Interval Analysis. SIAM, Philadelphia (1979)
16. Neumaier, A.: Interval Methods for Systems of Equations. Cambridge University Press, Cambridge (1990)
17. Jaulin, L., Kieffer, M., Didrit, O., Walter, E.: Applied Interval Analysis. Springer, London
(2001)
18. Ratschek, H., Rokne, J.: New Computer Methods for Global Optimization. Ellis Horwood,
Chichester (1988)
19. Hansen, E.: Global Optimization Using Interval Analysis. Marcel Dekker, New York (1992)
20. Bertz, M., Makino, K.: Verified integration of ODEs and flows using differential algebraic
methods on high-order Taylor models. Reliab. Comput. 4, 361369 (1998)
21. Nedialkov, N., Jackson, K., Corliss, G.: Validated solutions of initial value problems for
ordinary differential equations. Appl. Math. Comput. 105(1), 2168 (1999)
22. Nedialkov, N.: VNODE-LP, a validated solver for initial value problems in ordinary differential equations. Technical Report CAS-06-06-NN, Department of Computing and Software,
McMaster University, Hamilton (2006)
23. Wilkinson, J.: Error analysis revisited. IMA Bull. 22(11/12), 192200 (1986)
24. Zahradnicky, T., Lorencz, R.: FPU-supported running error analysis. Acta Polytechnica 50(2),
3036 (2010)
25. La Porte, M., Vignes, J.: Algorithmes numriques, analyse et mise en uvre, 1: Arithmtique
des ordinateurs. Systmes linaires. Technip, Paris (1974)
26. Vignes, J.: New methods for evaluating the validity of the results of mathematical computations. Math. Comput. Simul. 20(4), 227249 (1978)
27. Vignes, J., Alt, R., Pichat, M.: Algorithmes numriques, analyse et mise en uvre, 2: quations
et systmes non linaires. Technip, Paris (1980)
28. Vignes, J.: A stochastic arithmetic for reliable scientific computation. Math. Comput. Simul.
35, 233261 (1993)
References
407
29. Muller, J.M.: Elementary Functions, Algorithms and Implementation, 2nd edn. Birkhuser,
Boston (2006)
30. Rump, S.: INTLAB - INTerval LABoratory. In: Csendes, T. (ed.) Developments in Reliable
Computing, pp. 77104. Kluwer Academic Publishers, Dordrecht (1999)
Chapter 15
This chapter suggests web sites that give access to numerical software as well as
to additional information on concepts and methods presented in the other chapters.
Most of the resources described can be used at no cost. Classification is not tight, as
the same URL may point to various types of facilities.
15.2 Encyclopedias
For just about any concept or numerical method mentioned in this book, additional
information may be found in Wikipedia (http://en.wikipedia.org/), which now contains more than four million articles.
409
410
15.3 Repositories
A ranking of repositories is at http://repositories.webometrics.info/en/world. It
contains pointers to much more repositories than listed below, some of which are
also of interest in the context of numerical computation.
NETLIB (http://www.netlib.org/) is a collection of papers, data bases, and
mathematical software. It gives access, for instance, to LAPACK, a freely available
collection of professional-grade routines for computing
15.3 Repositories
411
ScaLAPACK, a library of high-performance linear algebra routines for distributedmemory computers and networks of workstations; ScaLAPACK is a continuation
of the LAPACK project;
SLEPc, a package for the solution of large, sparse eigenproblems on parallel computers, as well as related problems such as singular value decomposition;
SUNDIALS [1], a family of closely related solvers: CVODE, for systems of ordinary differential equations, CVODES, a variant of CVODE for sensitivity analysis,
KINSOL, for systems of nonlinear algebraic equations, and IDA, for systems of
differential-algebraic equations; these solvers can deal with extremely large systems, in serial or parallel environments;
SuperLU, a general purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations via LU factorization;
TAO, a large-scale optimization software, including nonlinear least squares, unconstrained minimization, bound-constrained optimization, and general nonlinear
optimization, with strong emphasis on the reuse of external tools where appropriate; TAO can be used in serial or parallel environments.
Pointers to a number of other interesting packages are also provided in the pages
dedicated to each of these products.
CiteSeerX (http://citeseerx.ist.psu.edu) focuses primarily on the literature in
computer and information science. It can be used to find papers that quote some
other papers of interest and often provide a free access to electronic versions of these
papers.
The Collection of Computer Science Bibliographies hosts more than three million
references, mostly to journal articles, conference papers, and technical reports. About
one million of them contains a URL for an online version of the paper (http://liinwww.
ira.uka.de/bibliography).
The Arxiv Computing Research Repository (http://arxiv.org/) allows researchers
to search for and download papers through its online repository, at no charge.
HAL (http://hal.archives-ouvertes.fr/) is another multidisciplinary open access
archive for the deposit and dissemination of scientific research papers and PhD
dissertations.
Interval Computation (http://www.cs.utep.edu/interval-comp/) is a rich source of
information about guaranteed computation based on interval analysis.
15.4 Software
15.4.1 High-Level Interpreted Languages
High-level interpreted languages are mainly used for prototyping and teaching, as
well as for designing convenient interfaces with compiled code offering faster execution.
412
15.4 Software
413
15.5 OpenCourseWare
OpenCourseWare, or OCW, consists of course material created by universities and
shared freely via the Internet. Material may include videos, lecture notes, slides,
exams and solutions, etc. Among the institutions offering courses in applied mathematics and computer science are
414
References
1. Hindmarsh, A., Brown, P., Grant, K., Lee, S., Serban, R., Shumaker, D., Woodward, C.:
SUNDIALS: suite of nonlinear and differential/algebraic equation solvers. ACM Trans. Math.
Softw. 31(3), 363396 (2005)
2. Galassi et al.: M.: GNU Scientific Library Reference Manual, 3rd edn. Network Theory Ltd,
Bristol (2009)
Chapter 16
Problems
This chapter consists of problems given over the last 10 years to students as part
of their final exam. Some of these problems present theoretically interesting and
practically useful numerical techniques not covered in the previous chapters. Many
of them translate easily into computer-lab work. Most of them build on material
pertaining to several chapters, and this is why they have been collected here.
1
, i = 1, . . . , N .
N
(16.1)
1. To compute xk , one needs a probabilistic model of the behavior of the WEB surfer.
The simplest possible model is to assume that the surfer always moves from one
page to the next by clicking on a button and that all the buttons of a given page
have the same probability of being selected. One thus obtains the equation of a
huge Markov chain
xk+1 = Sxk ,
. Walter, Numerical Methods and Optimization,
DOI: 10.1007/978-3-319-07671-3_16,
Springer International Publishing Switzerland 2014
(16.2)
415
416
16 Problems
where S has the same dimensions as M. Explain how S is deduced from M. What
are the constraints that S must satisfy to express that (i) if one is in any given page
then one must leave it and (ii) all the ways of doing so have the same probability?
What are the constraints satisfied by the entries of xk+1 ?
2. Assume, for the time being, that each page can be reached from any other page
after a finite (although potentially very large) number of clicks (this is Hypothesis
H1). The Markov chain then converges toward a unique stationary state x , such
that
x = Sx ,
(16.3)
and the ith entry of x is the probability that the surfer is in page i. The higher this
probability is the more this page is visible from the others. PageRank basically
orders the pages answering a given query by decreasing values of the corresponding entries of x . If H1 is satisfied, the eigenvalue of S with the largest modulus
is unique, and equal to 1. Deduce from this fact an algorithm to evaluate x .
Assuming that ten pages point in average toward a given page, show that the
number of arithmetical operations needed to compute xk+1 from xk is O(N ).
3. Unfortunately, H1 is not realistic. Some pages, for instance, do not point toward
any other page, which translates into columns of zeros in M. Even when there are
buttons on which to click, the surfer may decide to jump to a page toward which
the present page does not point. This is why S is replaced in (16.2) by
A = (1 )S +
1 1T ,
N
(16.4)
with = 0.15 and 1 a column vector with all of its entries equal to one. To
what hypothesis on the behavior of the surfer does this correspond? What is the
consequence of replacing S by A as regards the number of arithmetical operations
required to compute xk+1 from xk ?
417
x1
x2
x3
x4
1
2
3
4
5
6
7
8
1
1
1
1
+1
+1
+1
+1
1
1
+1
+1
1
1
+1
+1
1
+1
1
+1
1
+1
1
+1
1
+1
+1
1
+1
1
1
+1
12
15.5
14.5
12
9.5
10.5
11
1. Give affine transformations that replace the feasible intervals for the decision variables by the normalized interval [1, 1]. In what follows, it will be assumed that
these transformations have been carried out, so xi [1, 1], for i = 1, 2, 3, 4,
which defines the feasible domain X for the normalized decision vector x.
2. To study the influence of the value taken by x on the height of the brioche,
a statistician recommends carrying out the eight experiments summarized by
Table 16.1. (Because each decision variable (or factor) only takes two values, this
is called a two-level factorial design in the literature on experiment design. Not
all possible combinations of extreme values of the factors are considered, so this
is not a full factorial design.) Tell the cook what he or she should do.
3. The cook comes back with the results described by Table 16.2.
The height of a brioche is modeled by the polynomial
ym (x, ) = p0 + p1 x1 + p2 x2 + p3 x3 + p4 x4 + p5 x2 x3 ,
(16.5)
(16.6)
Explain in detail how you would use a computer to evaluate the value of that
minimizes
J ( ) =
8
y(x j ) ym (x j , )
(16.7)
j=1
where x j is the value taken by the normalized decision vector during the jth experiment and y(x j ) is the height of the resulting brioche. (Do not take advantage, at
418
16 Problems
this stage, of the very specific values taken by the normalized decision variables;
the method proposed should remain applicable if the values of each of the normalized decision variables were picked at random in [1, 1].) If several approaches
are possible, state their pros and cons and explain which one you would choose
and why.
4. Take now advantage of the specific values taken by the normalized decision
variables to compute, by hand,
(16.8)
What is the condition number of the problem for the spectral norm? What do
you deduce from the numerical value of
as to the influence of the four factors?
Formulate your conclusions so as to make them understandable by the cook.
5. Based on the resulting polynomial model, one now wishes to design a recipe that
maximizes the height of the brioche while maintaining each of the normalized
decision variables in its feasible interval [1, 1]. Explain how you would compute
(16.9)
u(t) = M(t),
(16.10)
(16.11)
and
where the value of c is assumed known. In what follows, the control input u(t) for t
[tk , tk+1 ] is obtained by linear interpolation between u k = u(tk ) and u k+1 = u(tk+1 ),
and the problem to be solved is the computation of the sequence u k (k = 0, 1, . . . , N ).
The instants of time tk are regularly spaced, so
419
tk+1 tk = h, k = 0, 1, . . . , N ,
(16.12)
z(t)
x(t) = z (t) .
M(t)
(16.13)
2. Show how this state equation can be integrated numerically with the explicit Euler
method when all the u k s and the initial condition x(0) are known.
3. Same question with the implicit Euler method. Show how it can be made explicit.
4. Same question with Gears method of order 2; do not forget to address its initialization.
5. Show how to compute u 0 , u 1 , . . . , u N ensuring a safe landing, i.e.,
z(t N ) = 0,
z (t N ) = 0.
(16.14)
Assume that N > Nmin , where Nmin is the smallest value of N that makes it
possible to satisfy (16.14), so there are infinitely many solutions. Which method
would you use to select one of them?
6. Show how the constraint
0 u k u max , k = 0, 1, . . . , N
(16.15)
(16.16)
can be taken into account, with ME the (known) mass of the module when the
fuel tank is empty.
420
16 Problems
rate, while the partial pressure y(ti ) of formaldehyde in the air leaving the chamber
at ti > 0 was measured by chromatography (i = 1, . . . , N ). The instants ti were not
regularly spaced.
The partial pressure y(t) of formaldehyde, initially very high, turned out to
decrease monotonically, very quickly during the initial phase and then considerably more slowly. This led to postulating a model in which the paint is organized in
two layers. The top layer releases formaldehyde directly into the atmosphere with
which it is in contact, while the formaldehyde in the bottom layer must pass through
the top layer to be released. The resulting model is described by the following set of
differential equations
x1 = p1 x1
x2 = p1 x1 p2 x2 ,
(16.17)
x3 = cx3 + p3 x2
where x1 is the formaldehyde concentration in the bottom layer, x2 is the formaldehyde concentration in the top layer and x3 is the formaldehyde partial pressure in the
air leaving the chamber. The constant c is known numerically whereas the parameters
p1 , p2 , and p3 and the initial conditions x1 (0), x2 (0), and x3 (0) are unknown and
define a vector p R6 of parameters to be estimated from the experimental data.
Each y(ti ) corresponds to a measurement of x3 (ti ) corrupted by noise.
1. For a given numerical value of p, show how the evolution of the state
x(t, p) = [x1 (t, p), x2 (t, p), x3 (t, p)]T
(16.18)
can be evaluated via the explicit and implicit Euler methods. Recall the advantages
and limitations of these methods. (Although (16.17) is simple enough to have a
closed-form solution, you are not asked to compute this solution.)
2. Same question for a second-order prediction-correction method.
3. Propose at least one procedure for evaluating
p that minimizes
J (p) =
N
[y(ti ) x3 (ti , p)]2 ,
(16.19)
i=1
(16.20)
q = (a1 , p1 , a2 , p2 , a3 )T
(16.21)
where
421
is a new parameter vector. The initial formaldehyde partial pressure in the air
leaving the chamber is then estimated as
x3 (0, q) = a1 + a2 + a3 .
(16.22)
c p2 p1 > 0,
(16.23)
Assuming that
show how a simple transformation makes it possible to use linear least squares
for finding a first value of a1 and p1 based on the last data points. Use for this
purpose the fact that, for t sufficiently large,
x3 (t, q) a1 e p1 t .
(16.24)
5. Deduce from the previous question a method for estimating a2 and p2 , again with
linear least squares.
6. For the numerical values of p1 and p2 thus obtained, suggest a method for finding
the values of a1 , a2 , and a3 that minimize the cost
J (q) =
N
[y(ti ) x3 (ti , q)]2 ,
(16.25)
i=1
(16.26)
422
16 Problems
1. Let xi be the number of Type i objects that the smuggler puts in his backpack
(i = 1, 2, 3). Compute the integer ximax that corresponds to the largest number
of Type i objects that the smuggler can take with him (if he only carries objects
of Type i). Compute the corresponding income (for i = 1, 2, 3). Deduce a lower
bound for the achievable income from your results.
2. Since the xi s should be integers, maximizing the smugglers income under a
constraint on the weight of his backpack is a problem of integer programming.
Neglect this for the time being, and assume just that
0 xi ximax , i = 1, 2, 3.
(16.27)
Express then income maximization as a standard linear program, where all the
decision variables are non-negative and all the other constraints are equality constraints. What is the dimension of the resulting decision vector x? What is the
number of scalar equality constraints?
3. Detail one iteration of the simplex algorithm (start from a basic feasible solution
with x1 = 5, x2 = 0, x3 = 5, which seems reasonable to the smuggler as his
backpack is then as heavy as he can stand).
4. Show that the result obtained after this iteration is optimal. What can be said of
the income at this point compared with the income at a feasible point where the
xi s are integers?
5. One of the techniques available for integer programming is Branch and Bound,
which is based in the present context on solving a series of linear programs.
Whenever one of these problems leads to an optimal value
xi that is not an
integer when it should be, this problem is split (this is branching) into two new
linear programs. In one of them
xi ,
xi
(16.28)
xi ,
xi
(16.29)
xi and where
xi is the
where
xi is the largest integer that is smaller than
smallest integer that is larger. Write the resulting two problems in standard form
(without attempting to find their solutions).
6. This branching process continues until one of the linear programs generated leads
to a solution where all the variables that should be integers are so. The associated
income is then a lower bound of the optimal feasible income (why?). How can
this information be taken advantage of to eliminate some of the linear programs
that have been created? What should be done with the surviving linear programs?
7. Explain the principle of Branch and Bound for integer programming in the general
case. Can the optimal feasible solution escape? What are the limitations of this
approach?
423
x1 = p1 x1 2 x2 3 ,
(16.30)
424
16 Problems
x2
2
y(i, x2 ) x1 (i, x2 , p) .
(16.31)
How can one compute the gradient of this cost function? How could then one
implement a quasi-Newton method? Do not forget to address initialization and
stopping.
xi =
x1i
x2i
, i = 1, . . . , N .
(16.32)
This problem concentrates on a given cross-section of the log, but the same operations
can be repeated on each of the cross-sections for which data are available.
To detect deviations from an (ideal) circular cross-section, we want to estimate
the parameter vector p = ( p1 , p2 , p3 )T of the circle equation
(x1i p1 )2 + (x2i p2 )2 = p32
(16.33)
(16.34)
1 2
ei (p),
2
(16.35)
where
N
J1 (p) =
i=1
425
(16.36)
(16.37)
N
ei (p)
.
J2 (p) =
s(p)
(16.38)
where
i=1
1 2
2 v if |v|
(16.39)
with = 3/2. The quantity s(p) in (16.38) is a robust estimate of the error
dispersion based on the median of the absolute values of the residuals
s(p) = 1.4826 medi=1,...,N |ei (p)|.
(16.40)
(The value 1.4826 was chosen to ensure that if the residuals ei (p) were independently and identically distributed according to a zero-mean Gaussian law with
variance 2 then s would tend to the standard deviation when N tends to infinity.) In practice, an iterative procedure is used to take the dependency of s on p
into account, and pk+1 is computed using
426
16 Problems
(16.41)
instead of s(p).
a. Plot the graph of the function (), and explain why
p2 can be expected to
be a better estimate of p than
p1 .
b. Detail the computations required to implement a gradient algorithm to
improve on p0 in the sense of J2 (). Provide, among other things, a closedform expression for the gradient of the cost.
c. Detail the computations required to implement a GaussNewton algorithm.
Provide, among other things, a closed-form expression for the approximate
Hessian.
d. After convergence of the optimization procedure, one may eliminate the data
p2 )| from the sum
points (x1i , x2i ) associated with the largest values of |ei (
in (16.38) before launching another minimization of J2 , and this procedure
may be iterated. What is your opinion about this strategy? What are the pros
and cons of the following two options:
removing a single data point before each new minimization,
simultaneously removing the n > 1 data points that are associated with
p2 )| before each new minimization?
the largest values of |ei (
(16.42)
427
are the training data. They are used to build a mathematical model, which may then
be employed to predict y(u) for u = ui . The model output takes the form of a linear
combination of basis functions j (u), j = 1, . . . , n, with the parameter vector p of
the model consisting of the weights p j of the linear combination
ym (u, p) =
n
p j j (u).
(16.43)
j=1
1. Assuming that the basis functions have already been chosen, show how to compute
(16.44)
where
J (p) =
N
yi ym (ui , p)]2 ,
(16.45)
i=1
with N n. Enumerate the methods available, recall their pros and cons, and
choose one of them. Detail the contents of the matrice(s) and vector(s) needed as
input by a routine implementing this method, which you will assume available.
2. Radial basis functions are selected. They are such that
j (u) = g
(u c j
)T W
j (u c j ) ,
(16.46)
where the vector c j (to be chosen) is the center of the jth basis function, W j
(to be chosen) is a symmetric positive definite weighting matrix and g() is the
Gaussian activation function, such that
2
x
.
g(x) = exp
2
(16.47)
In the remainder of this problem, for the sake of simplicity, we assume that
dim u = 2, but the method extends without difficulty (at least conceptually) to
more than two inputs.
For
1 10
1
cj =
,
(16.48)
, Wj = 2
1
j 0 1
plot a level set of j (u) (i.e., the locus in the (u 1 , u 2 ) plane of the points such
that j (u) takes a given constant value). For a given value of j (u), how does
the level set evolve when j2 increases?
428
16 Problems
3. This very simple model may be refined, for instance by replacing p j by the jth
local model
p j,0 + p j,1 u 1 + p j,2 u 2 ,
(16.49)
which is linear in its parameters p j,0 , p j,1 , and p j,2 . This leads to
ym (u, p) =
n
( p j,0 + p j,1 u 1 + p j,2 u 2 ) j (u),
(16.50)
j=1
where the weighting function j (u) specifies how much the jth local model
should contribute to the output of the global model. This is why j () is called an
activation function. It is still assumed that j (u) is given by (16.46), with now
Wj =
1
2
1,
j
0
1
(16.51)
2
2,
j
429
2. What criterion would you suggest for choosing the rectangle to be split?
3. To avoid a combinatorial explosion of the number of rectangles, all possible bisections are considered and compared before selecting a single one of them. What
criterion would you suggest for comparing the performances of the candidate
bisections?
4. Summarize the algorithm for an arbitrary number of inputs, and point out its pros
and cons.
5. How would you deal with a system with several scalar outputs?
6. Why is the method called LOLIMOT?
7. Compare this approach with Kriging.
(16.52)
(16.53)
with
How can the method developed in Sect. 16.8.2 be adapted to deal with this new
situation?
2. How could it be adapted to deal with MISO dynamical systems?
3. How could it be adapted to deal with MIMO dynamical systems?
n
i=1
h i u ki ,
(16.54)
430
16 Problems
p = (h 1 , . . . , h n )T
(16.55)
uk1 = (u k1 , . . . , u kn )T .
(16.56)
and
The vector uk1 thus contains all the values of the input needed for computing the
model output ym at the instant of time indexed by k. Between k and k + 1, the input
of the actual continuous-time process is assumed constant and equal to u k . When u0
is such that u 0 = 1 and u i = 0, i = 0, the value of the model output at time indexed
by i > 0 is h i when 1 i n and zero when i > n. Equation (16.54), which may
be viewed as a discrete convolution, thus describes a finite impulse response (or FIR)
model. A remarquable property of FIR models is that their output ym (k, p, uk1 ) is
linear in p when uk1 is fixed, and linear in uk1 when p is fixed.
The goal of this problem is first to estimate p from inputoutput data collected
on the process, and then to compute a sequence of inputs u i enforcing some desired
behavior on the model output once p has been estimated, in the hope that this sequence
will approximately enforce the same behavior on the process output. In both cases,
the initial instant of time is indexed by zero. Finally, the consequences of replacing
the use of an l2 norm by that of an l1 norm are investigated.
(16.57)
(16.58)
where
J1 (p) =
N
(16.59)
k=1
with N n and
e1 (k, p) = yk ym (k, p, uk1 ).
(16.60)
431
(16.61)
which has been chosen and is thus numerically known (it may be computed by some
reference model).
1. Assuming that the first entry of
p is nonzero, give a closed-form expression for
the value of u k ensuring that the one-step-ahead prediction of the output provided
by the model is equal to the corresponding value of the reference trajectory, i.e.,
p, uk ) = yr (k + 1).
ym (k + 1,
(16.62)
(All the past values of the input are assumed known at the instant of time indexed
by k.)
2. What may make the resulting control law inapplicable?
3. Rather than adopting this short-sighted policy, one may look for a sequence of
inputs that is optimal on some horizon [0, M]. Show how to compute
(16.63)
u = (u 0 , u 1 , . . . , u M1 )T
(16.64)
uR M
where
and
432
16 Problems
J2 (u) =
M
(16.65)
i=1
with
p, ui1 ).
e2 (i, ui1 ) = yr (i) ym (i,
(16.66)
(16.67)
To avoid unfeasible inputs (and save energy), one of the possible approaches is
to use a penalty function and minimize
J3 (u) = J2 (u) + uT u,
(16.68)
with > 0 chosen by the user and known numerically. Show that (16.68) can be
rewritten as
J3 (u) = (Au b)T (Au b) + uT u,
(16.69)
(16.70)
433
(16.71)
where
J4 (uk ) = (uk )T uk +
k+M1
e22 (i + 1, ui ),
(16.72)
i=k
with
uk = (u k , u k+1 , . . . , u k+M1 )T .
(16.73)
uk are discarded. The same procedure is carried out at the next discrete instant
of time, with the index k incremented by one. Draw a detailed flow chart of a
routine alternating two steps. In the first step,
p is estimated from past data while
in the second the input to be applied is computed by GPC from future desired
behavior. You may refer to the numbers of the equations in this text instead or
rewriting them. Whenever you need a general-purpose subroutine, assume that it
is available and just specify its input and output arguments and what it does.
8. What are the advantages of this procedure compared to those previously considered in this problem?
N
(16.74)
i=1
Show that the optimal value for p can now be computed by minimizing
J6 (p, x) =
N
xi
(16.75)
i=1
i = 1, . . . , N .
(16.76)
434
16 Problems
2. What approach do you suggest for this computation? Put the problem in standard
form when n = 2 and N = 4.
3. Starting from
p obtained by the method just described, how would you compute
the sequence of inputs that minimizes
J7 (u) =
N
(16.77)
i=1
(16.78)
u(i) = 0 i < 0.
(16.79)
n
i=1
ai yki +
n
b j u k j + k .
(16.80)
j=1
In (16.80), the integer n 1 is assumed fixed beforehand. Although the general case
is considered in what follows, you may take n = 2 for the purpose of illustration,
and simplify (16.80) into
yk = a1 yk1 + a2 yk2 + b1 u k1 + b2 u k2 + k .
(16.81)
In (16.80) and (16.81), u k is the input and yk the output, both measured on the
process at the instant of time indexed by the integer k. The k s are random variables
435
accounting for the imperfect nature of the model. They are assumed independently
and identically distributed according to a zero-mean Gaussian law with variance 2 .
Such a model is then called AutoRegressive with eXogeneous variables (or ARX).
The unknown vector of parameters
p = (a1 , . . . , an , b1 , . . . , bn )T
(16.82)
(16.83)
(16.84)
where
J N (p) =
N
[yi ym (i, p)]2 ,
(16.85)
i=1
with
ym (k, p) =
n
ai yki +
i=1
n
b j u k j .
(16.86)
j=1
(For the sake of simplicity, all the past values of y and u required for computing
ym (1, p) are assumed to be known.)
This problem consists of three parts. The first of them studies the evaluation of
p N from all the data (16.83) considered simultaneously. This corresponds to a batch
algorithm. The second part addresses the recursive treatment of the data, which makes
it possible to take each datum into account as soon as it becomes available, without
waiting for data collection to be completed. The third part applies the resulting
algorithms to process control.
(16.87)
436
16 Problems
(16.88)
for a matrix F N and a vector y N to be specified. You will assume in what follows
that the columns of F N are linearly independent.
3. Let Q Nand R N be
the matrices resulting from a QR factorization of the composite
matrix F N |y N :
F N |y N = Q N R N .
(16.89)
MN
RN =
,
O
(16.90)
p N = arg min M N
(16.92)
1 2
pR2n
4. Deduce from (16.92) the linear system of equations to be solved for computing
.
(16.93)
M N +1 = T
f N +1 y N +1
437
(16.94)
where
RN +1 =
U N +1 v N +1
.
0T N +1
(16.95)
U N v N
f NT +1 y N +1
(16.96)
(16.97)
438
16 Problems
M
M1
[yr (i) ym
(i,
p, u)]2 +
u 2j .
J0 (u0 ) =
i=1
(16.98)
j=0
In (16.98), u comprises all the input values needed to evaluate J0 (u0 ) (including
u0 ), is a (known) positive tuning parameter, yr (i) for i = 1, . . . , M is the
(known) desired trajectory and
(i,
p, u)
ym
n
a j ym
(i
j,
p, u) +
j=1
n
bk u ik .
(16.99)
k=1
2. Why did we replace (16.86) by (16.99) in the previous question? What is the price
to be paid for this change of process model?
3. Why did we not replace (16.86) by (16.99) in Sects. 16.10.1 and 16.10.2? What
is the price to be paid?
4. Rather than applying the sequence of inputs just computed to the process without
pk
caring about how it responds, one may estimate in real time at tk the parameters
of the model (16.86) from the data collected thus far (possibly with an exponential
forgetting of the past), and then compute the sequence of inputs
uk that minimizes
a cost function based on the prediction of future behavior
Jk (u ) =
k
k+M
[yr (i)
ym
(i,
pk , u)]2
i=k+1
k+M1
u 2j ,
(16.100)
j=k
with
uk = (u k , u k+1 , . . . , u k+M1 )T .
(16.101)
k1 x1 1,2 x2
k1
k2 x2 2,1 x1
k2
, x1 (0) = x10 ,
(16.102)
, x2 (0) = x20 ,
(16.103)
439
where x1 and x2 are the population sizes, large enough to be treated as non-negative
real numbers. The initial sizes x10 and x20 of the two populations are assumed known
here, so the vector of unknown parameters is
p = (r1 , k1 , 1,2 , r2 , k2 , 2,1 )T .
(16.104)
All of these parameters are real and non-negative. The parameter ri quantifies the rate
of increase in Population i (i = 1, 2) when x1 and x2 are small. This rate decreases
as specified by ki when xi increases, because available resources then get scarcer.
The negative effect on the rate of increase in Population i of competition for the
resources with Population j = i is expressed by i, j .
1. Show how to solve (16.102) and (16.103) by the explicit and implicit Euler methods when the value of p is fixed. Explain the difficulties raised by the implementation of the implicit method, and suggest another solution than attempting to
make it explicit.
2. The estimate of p is computed as
(16.105)
2
N
[yi (t j ) xi (t j , p)]2 .
(16.106)
where
J (p) =
i=1 j=1
In (16.106), N = 6 and
yi (t j ) is the numerically known result of the measurement of the size of
Population i at the known instant of time t j , (i = 1, 2, j = 1, . . . , N ),
xi (t j , p) is the value taken by xi at time t j in the model defined by (16.102)
and (16.103).
Show how to proceed with a gradient algorithm.
3. Same question with a GaussNewton algorithm.
4. Same question with a quasi-Newton algorithm.
5. If one could also measure
yi (t j ), i = 1, 2,
j = 1, . . . , N ,
(16.107)
p?
how could one get an initial rough estimate
p0 for
6. In order to provide the iterative algorithms considered in Questions 2 to 4 with
an initial value
p0 , we want to use the result of Question 5 and evaluate yi (t j )
numerically from the data yi (t j ) (i = 1, 2, j = 1, . . . , N ). How would you
proceed if the measurement times were not regularly spaced?
440
16 Problems
n
a j e j t ,
(16.108)
j=1
with
T
= (a1 , 1 , . . . , an , n ) .
(16.109)
The number n of these exponential terms is assumed fixed a priori. We keep the
first m indices for the real j s, with n m 0. If n > m, then the (n m)
following k s form pairs of conjugate complex numbers. Equation (16.108) can be
transformed into
ym (t, p) =
m
nm
aje
jt
j=1
2
bk ek t cos(k t + k ),
(16.110)
k=1
n
ck ym (tik , p), i = n + 1, . . . , N .
(16.111)
k=1
(16.112)
cRn
where
J (c) =
N
i=n+1
y(ti )
n
2
ck y(tik )
(16.113)
k=1
and
c = (c1 , . . . , cn )T .
2. The characteristic equation associated with (16.111) is
(16.114)
441
f (z, c) = z n
n
ck z nk = 0.
(16.115)
k=1
We assume that it has no multiple root. Its roots z i are then related to the exponents
i of the model (16.108) by
z i = ei T .
(16.116)
k ,
k , and
k of the model (16.110)
Show how to estimate the parameters
i ,
c) = 0.
from the roots
z i (i = 1, . . . , n) of the equation f (z,
3. Explain how to compute these roots.
k ,
k and
k of the model (16.110) are set to
4. Assume now that the parameters
i ,
the values thus obtained. Show how to compute the values of the other parameters
of this model so as to minimize the cost
J (p) =
N
[y(ti ) ym (ti , p)]2 .
(16.117)
i=1
5. Explain why
p thus computed is not optimal in the sense of J ().
6. Show how to improve
p with a gradient algorithm initialized at the suboptimal
solution obtained previously.
7. Same question with the GaussNewton algorithm.
(16.118)
(16.119)
442
16 Problems
with ximin and ximax known. Give a change of variables x = f(x) that puts the
constraints under the form
1 xi 1, i = 1, 2.
(16.120)
In what follows, unless the initial design factors already satisfy (16.120), it is
assumed that this change of variables has been performed. To simplify notation,
the normalized design factors satisfying (16.120) are still called xi (i = 1, 2).
2. The four elementary experiments that can be obtained with x1 {1, 1} and
x2 {1, 1} are carried out. (This is known as a two-level full factorial design.)
Let y be the vector consisting of the resulting measured values of the performance
index
yi = y(xi ), i = 1, . . . , 4.
(16.121)
(16.122)
where
J (p) =
4
y(xi ) ym (xi , p)
(16.123)
i=1
(16.124)
ym (x, p) = p1 x1 + p2 x2 .
(16.125)
and
In both cases, give the condition number (for the spectral norm) of the system of
linear equations associated with the normal equations. Do you recommend using
a QR or SVD factorization? Why? How would you suggest to choose between
these two model structures?
3. Due to the presence of measurement noise, it is deemed prudent to repeat N times
each of the four elementary experiments of the two-level full factorial design.
The dimension of y is thus now equal to 4N . What are the consequences of this
repetition of experiments on the normal equations and on their condition number?
4. If the model structure became
ym (x, p) = p1 + p2 x1 + p3 x2 + p4 x12 ,
(16.126)
443
what problem would be encountered if one used the same two-level factorial
design as before? Suggest a solution to eliminate this problem.
(16.127)
with
pi (i = 1, . . . , 4) obtained by the method studied in Sect. 16.13.1.
1. Use theoretical optimality conditions to show (without detailing the computations) how this model could be employed to compute
p),
(16.128)
X = {x : 1 xi 1, i = 1, 2}.
(16.129)
xX
where
(16.130)
where
T (t) is the number of healthy T cells,
T (t) is the number of infected T cells,
V (t) is the viral load.
These integers are treated as real numbers, so x(t) R3 . The state equation is
T = dT V T
T = V T T ,
V = T cV
(16.131)
444
16 Problems
and the initial conditions x(0) are assumed known. The vector of unknown
parameters is
p = [, d, , , , c]T ,
(16.132)
where
d, , and c are death rates,
is the rate of appearance of new healthy T cells,
is linked to the probability that a healthy T cell encountering a virus becomes
infected,
links virus proliferation to the death of infected T cells,
all of these parameters are real and positive.
(16.133)
where y(ti ) is the vector of the outputs measured at time ti on the patient.
2. The parameter vector is to be estimated by minimizing
J (p) =
N
(16.134)
i=1
where xm (ti , p) is the result at time ti of simulating the model (16.131) for the
value p of its parameter vector. Expand J (p) to show the state variables that are
445
measured on the patient (T (ti ), V (ti )) and those resulting from the simulation of
the model (Tm (ti , p), Vm (ti , p)).
3. To evaluate the first-order sensitivity of the state variables of the model with
respect to the jth parameter ( j = 1, . . . , N ), it suffices to differentiate the state
equation (16.131) (and its initial conditions) with respect to this parameter. One
thus obtains another state equation (with its initial conditions), the solution of
which is the first-order sensitivity vector
s p j (t, p) =
xm (t, p).
pj
(16.135)
Write down the state equation satisfied by the first-order sensitivity s (t, p) of
xm with respect to the parameter . What is its initial condition?
4. Assume that the first-order sensitivities of all the state variables of the model
with respect to all the parameters have been computed with the method described
in Question 3. What method do you suggest to use to minimize J (p)? Detail its
implementation and its pros and cons compared to other methods you might think
of.
5. Local optimization methods based on second-order Taylor expansion encounter
difficulties when the Hessian or its approximation becomes too ill-conditioned,
and this is to be feared here. How would you overcome this difficulty?
3
i=1
pi xi + p4 x1 x2 + p5 x1 x3 + p6 x2 x3 + p7 x1 x2 x3 .
(16.136)
446
16 Problems
Experiment
x1
x2
x3
y (in )
1
2
3
4
5
6
7
8
1
+1
1
+1
1
+1
1
+1
1
1
+1
+1
1
1
+1
+1
1
1
1
1
+1
+1
+1
+1
0
0.01
0
0.01
120.6
118.3
1.155
3.009
8
(16.137)
i=1
with y(xi ) the resistance deviation measured during the ith elementary experiment, and xi the corresponding vector of factors. (Inverting a matrix may not be
such a bad idea here, provided that you explain why...)
3. What is the value of J (
p)? Is
p a global minimizer of the cost function J ()?
Could these results have been predicted? Would it have been possible to compute
(16.138)
(16.139)
447
reactor, and the concentration of each species at any given instant of time is assumed
to be the same anywhere in the reactor. The evolution of the concentrations of the
quantities of interest is then described by the state equation
= p1 [A]
[ A]
= p1 [A] p2 [B] ,
[ B]
[C] = p2 [B]
(16.140)
[C](0) = 0,
(16.141)
p1
[exp( p2 t) exp( p1 t)],
p1 p2
(16.142)
where p =( p1 , p2 )T .
1. Assuming that p is numerically known, and pretending to ignore that (16.142) is
available, show how to solve (16.140) with the initial conditions (16.141) by the
explicit and implicit Euler methods. Recall the pros and cons of the two methods.
2. One wishes to stop the reactions when [B] is maximal. Assuming again that
the value of p is known, compute the optimal stopping time using (16.142) and
theoretical optimality conditions.
3. Assume that p must be estimated from the experimental data
y(ti ), i = 1, 2, . . . , 10,
(16.143)
where y(ti ) is the result of measuring [B] in the reactor at time ti . Explain in
detail how to evaluate
(16.144)
where
J (p) =
10
{y(ti ) [B](ti , p)}2 ,
(16.145)
i=1
with [B](ti , p) the model output computed by (16.142), using the gradient, Gauss
Newton and BFGS methods, successively. State the pros and cons of each of these
methods.
4. Replace the closed-form solution (16.142) by that provided by a numerical ODE
solver for the Cauchy problem (16.140, 16.141) and consider again the same
448
16 Problems
question with the GaussNewton method. (To compute the first-order sensitivities
of [A], [B], and [C] with respect to the parameter p j
jX (t, p) =
[X ](t, p),
pj
j = 1, 2,
X = A, B, C,
(16.146)
one may simulate the ODEs obtained by differentiating (16.140) with respect to
p j , from initial conditions obtained by differentiating (16.141) with respect to
p j .) Assume that a suitable ODE solver is available, without having to give details
on the matter.
5. We now wish to replace (16.145) by
J (p) =
10
(16.147)
i=1
(16.148)
where the vectors ai and scalars bi are known numerically. An optimal value for the
design vector is defined as
(16.149)
No closed-form expression for the cost function c() is available, but the numerical
value of c(x) can be obtained for any numerical value of x by running some available
numerical code with x as its input. The response-surface methodology [13, 14] can
be used to look for
x based on this information, as illustrated in this problem.
Each design variable xi belongs to the normalized interval [1, 1], so X is a hypercube of width 2 centered on the origin. It is also assumed that a feasible numerical
value
x0 for the design vector has already been chosen. The procedure for finding a
better numerical value of the design vector is iterative. Starting from
xk , it computes
k+1
as suggested in the questions below.
x
1. For small displacements x around
xk , one may use the approximation
xk ) + (x)T pk .
c(
xk + x) c(
(16.150)
449
j = 1, . . . , N ,
(16.151)
where the x j s are small displacements and N > dim x. Show how the resulting
data can be used to estimate
p k that minimizes
J (p) =
m
[c j c(
x k ) (x j )T p]2 .
(16.152)
j=1
3. What condition should the x j s satisfy for the minimizer of J (p) to be unique?
Is it a global minimizer? Why?
4. Show how to compute a displacement x that minimizes the approximation
k
of
pk , under the constraint
x + x X.
c(
xk + x) given by (16.150) for pk =
In what follows, this displacement is denoted by x+ .
5. To avoid getting too far from
xk , at the risk of losing the validity of the approxi+
xk+1
mation (16.150), x is accepted as the displacement to be used to compute
according to
xk + x+
xk+1 =
(16.153)
+
x ,
2
(16.154)
only if
(16.155)
xk + x.
xk+1 =
(16.156)
and
450
16 Problems
(16.157)
(16.158)
To get data from which p will be estimated, a unit quantity of tracer is injected
into Compartment 1 at t0 = 0, so
T
x(0) = 1 0 .
(16.159)
The quantity y(ti ) of tracer in the same compartment is then measured at known
instants of time ti > 0 (i = 1, . . . , N ), so one should have
y(ti ) 1 0 x(ti ).
(16.160)
(16.161)
(16.162)
(16.163)
where
Jmacro (q) =
N
i=1
2
y(ti ) ymacro (ti , q) .
(16.164)
451
(16.165)
for all s, where s is the Laplace variable and Y is the Laplace transform of y.
Recall that the Laplace transform of x is sX(s) x(0).)
6. How can these equations be used to get a first estimate
p0 of the microparameters?
7. How can this estimate be improved in the sense of the cost function
Jmicro (p) =
N
2
(16.166)
i=1
1
V
0 x(t),
(16.167)
(16.168)
452
16 Problems
(16.169)
1
(At)i .
i!
(16.170)
i=0
Many numerical methods are available for solving (16.168). This problem is an
opportunity for exploring a few of them.
(16.171)
453
(16.172)
3. How can one use this result to compute x(t) for t > 0? Why is the condition
number of T important?
4. What are the advantages of this approach compared with the use of generic methods?
5. Assume now that A is not known, and that the state is regularly measured every
h s., so x(i h) is approximately known, for i = 0, . . . , N . How can exp(Ah) be
estimated? How can an estimate of A be deduced from this result?
(16.173)
N
[y(ti ) ym (ti , p)]2 .
(16.174)
i=1
1. Explain how you would proceed in the absence of any additional constraint.
2. For some (admittedly rather mysterious) reasons, the model must comply with
the constraint
p12 + p22 = 1,
(16.175)
i.e., its parameters must belong to a circle with unit radius centered at the origin.
The purpose of the rest of this problem is to consider various ways of enforcing
(16.175) on the estimate
p of p.
a. Reparametrization approach. Find a transformation p = f( ), where is a
scalar unknown parameter, such that (16.175) is satisfied for any real value
of . Suggest a numerical method for estimating from the data.
b. Lagrangian approach. Write down the Lagrangian of the constrained problem using a vector formulation where the sum in (16.174) is replaced by an
expression involving the vector
y = [y(t1 ), y(t2 ), . . . , y(t N )]T ,
(16.176)
454
16 Problems
(16.177)
(16.178)
Describe in some detail how you would implement these strategies. What is
the difference with the Lagrangian approach? What are the pros and cons of
1 () and 2 ()? Which of the optimization methods described in this book
can be used with 1 ()?
d. Projection approach. In this approach, two steps are alternated. The first
step uses some unconstrained iterative method to compute an estimate
pk+
k
of the solution at iteration k + 1 from the constrained estimate
p of the
pk+
solution at iteration k, while the second computes
pk+1 by projecting
orthogonally onto the curve defined by (16.175). Explain how you would
implement this option in practice. Why should one avoid using the linear
least-square approach for the first step?
e. Any other idea? What are the pros and cons of these approaches?
(16.179)
(16.180)
(16.181)
455
e1 =
N
|ei |,
(16.182)
N
e2 =
ei2
(16.183)
i=1
i=1
and
e = max |ei |.
1i N
(16.184)
(16.185)
and by minimizing
J1 (x) =
N
(u i + vi )
(16.186)
i=1
(16.187)
(16.188)
d yi fiT x d , i = 1, . . . , N .
(16.189)
456
16 Problems
6. Robust estimation. Assume that some of the entries of the data vector y are outliers,
i.e., pathological data resulting, for instance, from sensor failures. The purpose of
robust estimation is then to find a way of computing an estimate
x of the value of
x from these corrupted data that is as close as possible to the one that would have
been obtained had the data not been corrupted. What are, in your opinion, the
most and the least robust of the three l p estimators considered in this problem?
7. Constrained estimation. Consider the special case where n = 2, and add the
constraint
|x1 | + |x2 | = 1.
(16.190)
(16.191)
(k01 + k21 )x1 (t) + k12 x2 (t) + u(t)
.
k21 x1 (t) k12 x2 (t)
(16.192)
The state of this model is x = [x1 , x2 ]T , with xi the quantity of some drug in
Compartment i. The outside of the model is considered as a compartment indexed
by zero. The data available consist of measurements of the quantity of drug y(ti ) in
Compartment 2 at N known instants of time ti , i = 1, . . . , N , where N is larger than
the number of unknown parameters. The input u(t) is known for
t [0, t N ].
The corresponding model output is
ym (ti , p) = x2 (ti , p).
(16.193)
There was no drug inside the system at t = 0, so the initial condition of the model
is taken as x(0) = 0.
1. Draw a scheme of the compartmental model (16.192), (16.193), and put its equations under the form
x = A(p)x + bu,
ym (t, p) = c x(t).
T
457
(16.194)
(16.195)
2. Assuming, for the time being, that the numerical value of p is known, describe
two strategies for evaluating ym (ti , p) for i = 1, . . . , N . Without going into too
much detail, indicate the problems to be solved for implementing these strategies,
point out their pros and cons, and explain what your choice would be, and why.
3. To take measurement noise into account, p is estimated by minimizing
J (p) =
N
(16.196)
i=1
(16.197)
where s is the Laplace variable and I the identity matrix of appropriate dimension.
For any given numerical value of p, the Laplace transform Ym (s, p) of the model
output ym (t, p) is obtained from the Laplace transform U (s) of the input u(t) as
Ym (s, p) = H (s, p)U (s),
(16.198)
so H (s, p) characterizes the inputoutput behavior of the model. Show that for
almost any value p of the vector of the model parameters, there exists another
value p such that
s,
H (s, p ) = H (s, p ).
(16.199)
458
16 Problems
is strapped down on the moving body and thus fixed in the reference frame of this
body. Strapdown IMUs tends to replace gimballed ones, as they are more robust and
less expensive. Computations are then needed to compensate for the rotations of the
strapdown IMU due to motion.
1. Assume first that a vehicle has to be located during a mission on the plane (2D
version), using a gimballed IMU that is stabilized in a local navigation frame
(considered as inertial). In this IMU, two sensors measure forces and convert
them into accelerations a N (ti ) and a E (ti ) in the North and East directions, at
known instants of time ti (i = 1, . . . , N ). It is assumed that ti+1 ti = t
where t is known and constant. Suggest a numerical method to evaluate the
position x(ti ) = (x N (ti ), x E (ti ))T and speed v(ti ) = (v N (ti ), v E (ti ))T of the
vehicle (i = 1, . . . , N ) in the inertial frame. You will assume that the initial
conditions x(t0 ) = (x N (t0 ), x E (t0 ))T and v(t0 ) = (v N (t0 ), v E (t0 ))T have been
measured at the start of the mission and are available. Explain your choice.
2. The IMU is now strapped down on the vehicle (still moving on a plane), and
measures its axial and lateral accelerations ax (ti ) and a y (ti ). Let (ti ) be the
angle at time ti between the axis of the vehicle and the North direction (assumed
to be measured by a compass, for the time being). How can one evaluate the
position x(ti ) = (x N (ti ), x E (ti ))T and speed v(ti ) = (v N (ti ), v E (ti ))T of the
vehicle (i = 1, . . . , N ) in the inertial frame? The rotation matrix
(16.200)
can be used to transform the vehicle frame into a local navigation frame, which
will be considered as inertial.
3. Consider the previous question again, assuming now that instead of measuring
(ti ) with a compass, one measures the angular speed of the vehicle
(ti ) =
d
(ti ).
dt
(16.201)
with a gyrometer.
4. Consider the same problem with a 3D strapdown IMU to be used for a mission in
space. This IMU employs three gyrometers to measure the first derivatives with
respect to time of the roll , pitch and yaw of the vehicle. You will no longer
neglect the fact that the local navigation frame is not an inertial frame. Instead, you
will assume that the formal expressions of the rotation matrix R1 (, , ) that
transforms the vehicle frame into an inertial frame and of the matrix R2 (x, y)
that transforms the local navigation frame of interest (longitude x, latitude y,
altitude z) into the inertial frame are available. What are the consequences on the
computations of the fact that R1 (, , ) and R2 (x, y) are orthonormal matrices?
5. Draw a block diagram of the resulting system.
459
11
14
17
20
23
26
cprod
19
21
30
42
48
63
All mass flows are in kgs1 . The network description is considerably simplified, and
the use of such a model for the optimization of operating conditions is not considered
(see [15] and [16] for more details).
The central branch includes a pump and the secondary circuit of a heat exchanger,
the primary circuit of which is connected to an energy supplier. The northern branch
contains a valve to modulate m 1 and the primary circuit of another heat exchanger,
the secondary circuit of which is connected to a first energy consumer. The southern
branch contains only the primary circuit of a third heat exchanger, the secondary
circuit of which is connected to a second energy consumer.
(16.202)
460
16 Problems
Suggest a numerical method for estimating its parameters a0 , a1 , and a2 . (Do not
carry out the computations, but give the numerical values of the matrices and vectors
that will serve as inputs for this method.)
m 0 0 !2
m 0 0 !
+ g0 ,
+ g1
(16.203)
where is the actual pump angular speed (a control input at the disposal of the
network manager, in rads1 ) and 0 is the pump (known) nominal angular speed.
Assuming that you can choose and measure m 0 and the resulting Hpump , suggest
an experimental procedure and a numerical method for estimating the parameters
g0 , g1 , and g2 .
(16.204)
Z1 2
m ,
d 1
(16.205)
where Z 1 is the (known) hydraulic resistance of the branch and d is the opening
degree of the valve (0 < d 1). (This opening degree is another control input at the
disposal of the network manager.) Finally, in the southern branch
H B H A = Z 2 m 22 ,
(16.206)
where Z 2 is the (known) hydraulic resistance of the branch. The mass flows in the
network must satisfy
m0 = m1 + m2.
(16.207)
461
(16.208)
where T (xb , t) is the temperature (in K) at the location xb in pipe b at time t, and
where T0 , b , and b are assumed known and constant. Discretizing this propagation
equation (with xb = L b i/N (i = 0, . . . , N ), where L b is the pipe length and N the
number of steps), show that one can get the following approximation
dx
(t) = A(t)x(t) + b(t)u(t) + b0 T0 ,
dt
(16.209)
where u(t) = T (0, t) and x consists of the temperatures at the discretization points
indexed by i (i = 1, . . . , N ). Suggest a method for solving this ODE numerically.
When thermal diffusion is no longer neglected, (16.208) becomes
T
T
2T
(xb , t) + b m b (t)
(xb , t) + b (T (xb , t) T0 ) =
(xb , t).
t
xb
x2
(16.210)
(16.211)
where the indices p and s correspond to the primary and secondary networks (with
the secondary network associated with the consumer), and the exponents in and
out correspond to the inputs and outputs of the exchanger. The efficiency k of the
exchanger and its exchange surface S are assumed known. Provided that the thermal
power losses between the primary and the secondary circuits are neglected, one can
also write
462
16 Problems
Q c = cm p (Tpin Tpout )
(16.212)
at the primary network, with m p the primary mass flow and c the (known) specific
heat of water (in Jkg1 K1 ), and
Q c = cm s (Tsout Tsin )
(16.213)
at the secondary network, with m s the secondary mass flow. Assuming that m p , m s ,
Tpin , and Tsout are known, show that the computation of Q c , Tpout , and Tsin boils down
to solving a linear system of three equations in three unknowns. It may be useful to
introduce the (known) parameter
1
kS 1
.
(16.214)
= exp
c mp
ms
What method do you recommend for solving this system?
x1 = p1 x1 + p2 x2 + u
.
x2 = p1 x1 ( p2 + p3 )x2
(16.215)
In (16.215), the scalars p1 , p2 , and p3 are unkown, positive, and real parameters.
The quantity of drug in Compartment i is denoted by xi (i = 1, 2), in mg, and u(t)
is the drug flow into Compartment 1 at time t due to intravenous administration (in
mg/min). The initial condition is x(0) = 0. The drug concentration (in mg/L) can be
measured in Compartment 1 at N known instants of time ti (in min) (i = 1, . . . , N ).
The model of the observations is thus
ym (ti , p) =
1
x1 (ti , p),
p4
(16.216)
463
where
p = ( p 1 , p 2 , p 3 , p 4 )T ,
(16.217)
(16.218)
where p is the unknown true value of p and (ti ) combines the consequences of
the measurement error and the approximate nature of the model.
The first part of this problem is about estimating p for a specific patient based
on experimental data collected for a known input function u(); the second is about
using the resulting model to design an input function that satisfies the requirements
of the treatment of this patient.
1. In which units should the first three parameters be expressed?
2. The data to be employed for estimating the model parameters have been collected using the following input. During the first minute, u(t) was maintained
constant at 100 mg/min. During the following hour, u(t) was maintained constant at 20 mg/min. Although the response of the model to this input could be
computed analytically, this is not the approach to be taken here. For a step-size
h = 0.1 min, explain in some detail how you would simulate the model and compute its state x(ti , p) for this specific input and for any given feasible numerical
value of p. (For the sake of simplicity, you will assume that the measurement
times are such that ti = n i h, with n i a positive integer.) State the pros and cons
of your approach, explain what simple measures you could take to check that the
simulation is reasonably accurate and state what you would do if it turned out not
to be the case.
3. The estimate
p of p must be computed by minimizing
J (p) =
N
i=1
y(ti )
1
x1 (ti , p)
p4
2
,
(16.219)
where N = 10. The instants of time ti at which the data have been collected
are known, as well as the corresponding values of y(ti ). The value of x1 (ti , p)
is computed by the method that you have chosen in your answer to Question 2.
Explain in some detail how you would proceed to compute
p. State the pros and
cons of the method chosen, explain what simple measures you could take to check
whether the optimization has been carried out satisfactorily and state what you
would do if it turned out not to be the case.
4. From now on, p is taken equal to
p, the vector of numerical values obtained at
Question 3, and the problem is to choose a therapeutically appropriate one-hour
464
16 Problems
(16.220)
and the input is uniquely specified by u R60 . Let x j (u1 ) be the model state
at time j h ( j = 1, . . . , 600), computed with a fixed step-size h = 0.1 min
from x(0) = 0 for the input u1 such that u 11 = 1 and u i1 = 0, i = 2, . . . , 60.
Taking advantage of the fact that the output of the model described by (16.215)
is linear in its inputs and time-invariant, express the state x j (u) of the model at
time j h for a generic input u as a linear combination of suitably delayed xk (u1 )s
(k = 1, . . . , 600).
5. The input u should be such that
u i 0, i = 1, . . . , 60 (why?),
j
xi Mi , j = 1, . . . , 600, where Mi is a known toxicity bound (i = 1, 2),
j
x2 [m , m + ], j = 60, . . . , 600, where m and m + are the known bounds
of the therapeutic range for the patient under treatment (with m + < M2 ),
the total quantity of drug ingested during the hour is minimal.
Explain in some detail how to proceed and how the problem could be expressed in
standard form. Under which conditions is the method that you suggest guaranteed
to provide a solution (at least from a mathematical point of view)? If a solution
u
is found, will it be a local or a global minimizer?
g
(t t0 )2 ,
2
(16.221)
with g the gravitational acceleration and t0 the instant of time at which the cannon
was fired. Show also that the horizontal distance covered by the shell before impact
is
465
(16.222)
2. Explain why choosing to hit the tank can be viewed as a two endpoint boundaryvalue problem, and suggest a numerical method for computing . Explain why
the number of solutions may be 0, 1, or 2, depending on the position of the tank.
3. From now on, the tank may be moving. The radar indicates its position xtank (ti ),
i = 1, . . . , N , at a rate of one measurement per second. Suggest a numerical method for evaluating the tank instantaneous speed xtank (t) and acceleration
xtank (t) based on these measurements. State the pros and cons of this method.
4. Suggest a numerical method based on the estimates obtained in Question 3 for
choosing and t0 in such a way that the shell hits the ground where the tank is
expected to be at the instant of impact.
i = 1, . . . , N ,
(16.223)
(16.224)
(16.225)
where p is the (unknown) true value of the parameter vector and vi is the measurement noise. The dimension n of p is very large. It may even be so large that
n > N . Estimating p from the data then seems hopeless, but can still be carried out
if some hypotheses restrict the choice. We assume in this problem that the model is
sparse, in the sense that the number of nonzero entries in p is very small compared
to the dimension of p. This is relevant for many situations in signal processing.
A classical method for looking for a sparse estimate of p is to compute
(16.226)
466
16 Problems
The purpose of this problem is to explore an alternative approach [17] for building
a sparsity-promoting algorithm. This approach is based on projections onto convex
sets (or POCS). Let C be a convex set in Rn . For each p Rn , there is a unique
p Rn such that
p = arg min p q22 .
qC
(16.227)
(16.228)
(16.229)
Illustrate this for n = 2 (you may try n = 3 if you feel gifted for drawings...).
Show that Si is a convex set.
3. Given the data (yi , fi ), i = 1, . . . , N , and the bound b, the set S of all acceptable
values of p is the intersection of all these feasible slabs
S=
N
$
Si .
(16.230)
i=1
(16.231)
(16.232)
T
b yk+1 fk+1
p b,
467
(16.233)
show how to compute pk+1 as a function of pk , yk+1 , fk+1 , and b, and illustrate
the procedure for n = 2.
4. Sparsity still needs to be promoted. A natural approach for doing so would be to
replace pk+1 at each iteration by its projection onto the set
%
&
B0 (c) = p Rn : p0 c ,
(16.234)
(16.235)
for n = 2 and c = 1. Are they convex? Which of the l p norms gives the closest
result to that of Question 5?
7. To promote sparsity, pk+1 is replaced at each iteration by its projection onto B1 (c),
with c a hyperparameter. Explain how this projection can be carried out with a
Lagrangian approach and illustrate the procedure when n = 2.
8. Summarize an algorithm based on POCS for estimating p while promoting sparsity.
9. Is there any point in recirculating the data in this algorithm?
References
1. Langville, A., Meyer, C.: Googles PageRank and Beyond. Princeton University Press, Princeton (2006)
2. Chang, J., Guo, Z., Fortmann, R., Lao, H.: Characterization and reduction of formaldehyde
emissions from a low-VOC latex paint. Indoor Air 12(1), 1016 (2002)
3. Thomas, L., Mili, L., Shaffer, C., Thomas, E.: Defect detection on hardwood logs using high
resolution three-dimensional laser scan data. In: IEEE International Conference on Image
Processing, vol. 1, pp. 243246. Singapore (2004)
4. Nelles, O.: Nonlinear System Identification. Springer, Berlin (2001)
5. Richalet, J., Rault, A., Testud, J., Papon, J.: Model predictive heuristic control: applications to
industrial processes. Automatica 14, 413428 (1978)
6. Clarke, D., Mohtadi, C., Tuffs, P.: Generalized predictive controlpart I. The basic algorithm.
Automatica 23(2), 137148 (1987)
7. Bitmead, R., Gevers, M., Wertz, V.: Adaptive Optimal Control, the Thinking Mans GPC.
Prentice-Hall, Englewood Cliffs (1990)
468
16 Problems
8. Lawson, C., Hanson, R.: Solving Least Squares Problems. Classics in Applied Mathematics.
SIAM, Philadelphia (1995)
9. Perelson, A.: Modelling viral and immune system dynamics. Nature 2, 2836 (2002)
10. Adams, B., Banks, H., Davidian, M., Kwon, H., Tran, H., Wynne, S., Rosenberg, E.: HIV
dynamics: modeling, data analysis, and optimal treatment protocols. J. Comput. Appl. Math.
184, 1049 (2005)
11. Wu, H., Zhu, H., Miao, H., Perelson, A.: Parameter identifiability and estimation of HIV/AIDS
dynamic models. Bull. Math. Biol. 70, 785799 (2008)
12. Spall, J.: Factorial design for efficient experimentation. IEEE Control Syst. Mag. 30(5), 3853
(2010)
13. del Castillo, E.: Process Optimization: A Statistical Approach. Springer, New York (2007)
14. Myers, R., Montgomery, D., Anderson-Cook, C.: Response Surface Methodology: Process and
Product Optimization Using Designed Experiments, 3rd edn. Wiley, Hoboken (2009)
15. Sandou, G., Font, S., Tebbani, S., Hiret, A., Mondon, C.: District heating: a global approach to
achieve high global efficiencies. In: IFAC Workshop on Energy Saving Control in Plants and
Buidings. Bansko, Bulgaria (2006)
16. Sandou, G., Font, S., Tebbani, S., Hiret, A., Mondon, C.: Optimisation and control of supply
temperatures in district heating networks. In: 13rd IFAC Workshop on Control Applications of
Optimisation. Cachan, France (2006)
17. Theodoridis, S., Slavakis, K., Yamada, I.: Adaptive learning in a world of projections. IEEE
Signal Process. Mag. 28(1), 97123 (2011)
Index
A
Absolute stability, 316
Active constraint, 170, 253
Adams-Bashforth methods, 310
Adams-Moulton methods, 311
Adapting step-size
multistep methods, 322
one-step methods, 320
Adaptive quadrature, 101
Adaptive random search, 223
Adjoint code, 124, 129
Angle between search directions, 202
Ant-colony algorithms, 223
Approximate algorithm, 381, 400
Armijo condition, 198
Artificial variable, 267
Asymptotic stability, 306
Augmented Lagrangian, 260
Automatic differentiation, 120
B
Backward error analysis, 389
Backward substitution, 23
Barrier functions, 257, 259, 277
Barycentric Lagrange interpolation, 81
Base, 383
Basic feasible solution, 267
Basic variable, 269
Basis functions, 83, 334
BDF methods, 311
Bernstein polynomials, 333
Best linear unbiased predictor (BLUP), 181
Best replay, 222
BFGS, 211
Big O, 11
Binding constraint, 253
Bisection method, 142
Bisection of boxes, 394
Black-box modeling, 426
Booles rule, 104
Boundary locus method, 319
Boundary-value problem (BVP), 302, 328
Bounded set, 246
Box, 394
Branch and bound, 224, 422
Brents method, 197
Broydens method, 150
Bulirsch-Stoer method, 324
Burgers equation, 360
C
Casting out the nines, 390
Cauchy condition, 303
Cauchy-Schwarz inequality, 13
Central path, 277
Central-limit theorem, 400
CESTAC/CADNA, 397
validity conditions, 400
Chain rule for differentiation, 122
Characteristic curves, 363
Characteristic equation, 61
Chebyshev norm, 13
Chebyshev points, 81
Cholesky factorization, 42, 183, 186
flops, 45
Chord method, 150, 313
Closed set, 246
Collocation, 334, 335, 372
469
470
Combinatorial optimization, 170, 289
Compact set, 246
Compartmental model, 300
Complexity, 44, 272
Computational zero, 399
Computer experiment, 78, 225
Condition number, 19, 28, 150, 186, 193
for the spectral norm, 20
nonlinear case, 391
preserving the, 30
Conditioning, 194, 215, 390
Conjugate directions, 40
Conjugate-gradient methods, 40, 213, 216
Constrained optimization, 170, 245
Constraints
active, 253
binding, 253
equality, 248
getting rid of, 247
inequality, 252
saturated, 253
violated, 253
Continuation methods, 153
Contraction
of a simplex, 218
of boxes, 394
Convergence speed, 15, 215
linear, 215
of fixed-point iteration, 149
of Newtons method, 145, 150
of optimization methods, 215
of the secant method, 148
quadratic, 215
superlinear, 215
Convex optimization, 272
Cost function, 168
convex, 273
non-differentiable, 216
Coupling at interfaces, 370
Cramers rule, 22
CrankNicolson scheme, 366
Curse of dimensionality, 171
Cyclic optimization, 200
CZ, 399, 401
D
DAE, 326
Dahlquists test problem, 315
Damped Newton method, 204
Dantzigs simplex algorithm, 266
Dealing with conflicting objectives, 226, 246
Decision variable, 168
Index
Deflation procedure, 65
Dependent variables, 121
Derivatives
first-order, 113
second-order, 116
Design specifications, 246
Determinant evaluation
bad idea, 3
useful?, 60
via LU factorization, 60
via QR factorization, 61
via SVD, 61
Diagonally dominant matrix, 22, 36, 37, 366
Dichotomy, 142
Difference
backward, 113, 116
centered, 114, 116
first-order, 113
forward, 113, 116
second-order, 114
Differentiable cost, 172
Differential algebraic equations, 326
Differential evolution, 223
Differential index, 328
Differentiating
multivariate functions, 119
univariate functions, 112
Differentiation
backward, 123, 129
forward, 127, 130
Direct code, 121
Directed rounding, 385
switched, 391
Dirichlet conditions, 332, 361
Divide and conquer, 224, 394, 422
Double, 384
Double float, 384
Double precision, 384
Dual problem, 276
Dual vector, 124, 275
Duality gap, 276
Dualization, 124
order of, 125
E
EBLUP, 182
Efficient global optimization (EGO), 225, 280
Eigenvalue, 61, 62
computation via QR iteration, 67
Eigenvector, 61, 62
computation via QR iteration, 68
Elimination of boxes, 394
Index
Elliptic PDE, 363
Empirical BLUP, 182
Encyclopedias, 409
eps, 154, 386
Equality constraints, 170
Equilibrium points, 141
Euclidean norm, 13
Event function, 304
Exact finite algorithm, 3, 380, 399
Exact iterative algorithm, 380, 400
Existence and uniqueness condition, 303
Expansion of a simplex, 217
Expected improvement, 225, 280
Explicit Euler method, 306
Explicit methods
for ODEs, 306, 308, 310
for PDEs, 365
Explicitation, 307
an alternative to, 313
Exponent, 383
Extended state, 301
Extrapolation, 77
Richardsons, 88
F
Factorial design, 186, 228, 417, 442
Feasible set, 168
convex, 273
desirable properties of, 246
Finite difference, 306, 331
Finite difference method (FDM)
for ODEs, 331
for PDEs, 364
Finite element, 369
Finite escape time, 303
Finite impulse response model, 430
Finite-element method (FEM), 368
FIR, 430
Fixed-point iteration, 143, 148
Float, 384
Floating-point number, 383
Flop, 44
Forward error analysis, 389
Forward substitution, 24
Frobenius norm, 15, 42
Functional optimization, 170
G
Galerkin methods, 334
Gauss-Lobatto quadrature, 109
Gauss-Newton method, 205, 215
471
Gauss-Seidel method, 36
Gaussian activation function, 427
Gaussian elimination, 25
Gaussian quadrature, 107
Gear methods, 311, 325
General-purpose ODE integrators, 305
Generalized eigenvalue problem, 64
Generalized predictive control (GPC), 432
Genetic algorithms, 223
Givens rotations, 33
Global error, 324, 383
Global minimizer, 168
Global minimum, 168
Global optimization, 222
GNU, 412
Golden number, 148
Golden-section search, 198
GPL, 412
GPU, 46
Gradient, 9, 119, 177
evaluation by automatic differentiation, 120
evaluation via finite differences, 120
evaluation via sensitivity functions, 205
Gradient algorithm, 202, 215
stochastic, 221
Gram-Schmidt orthogonalization, 30
Grid norm, 13
Guaranteed
integration of ODEs, 309, 324
optimization, 224
H
Heat equation, 363, 366
Hessian, 9, 119, 179
computation of, 129
Heuns method, 314, 316
Hidden bit, 384
Homotopy methods, 153
Horners algorithm, 81
Householder transformation, 30
Hybrid systems, 304
Hyperbolic PDE, 363
I
IEEE 754, 154, 384
Ill-conditioned problems, 194
Implicit Euler method, 306
Implicit methods
for ODEs, 306, 311, 313
for PDEs, 365
Inclusion function, 393
472
Independent variables, 121
Inequality constraints, 170
Inexact line search, 198
Infimum, 169
Infinite-precision computation, 383
Infinity norm, 14
Initial-value problem, 302, 303
Initialization, 153, 216
Input factor, 89
Integer programming, 170, 289, 422
Integrating functions
multivariate case, 109
univariate case, 101
via the solution of an ODE, 109
Interior-point methods, 271, 277, 291
Interpolation, 77
by cubic splines, 84
by Kriging, 90
by Lagranges formula, 81
by Nevilles algorithm, 83
multivariate case, 89
polynomial, 18, 80, 89
rational, 86
univariate case, 79
Interval, 392
computation, 392
Newton method, 396
vector, 394
Inverse power iteration, 65
Inverting a matrix
flops, 60
useful?, 59
via LU factorization, 59
via QR factorization, 60
via SVD, 60
Iterative
improvement, 29
optimization, 195
solution of linear systems, 35
solution of nonlinear systems, 148
IVP, 302, 303
J
Jacobi iteration, 36
Jacobian, 9
Jacobian matrix, 9, 119
K
Karush, Kuhn and Tucker conditions, 256
Kriging, 79, 81, 90, 180, 225
confidence intervals, 92
Index
correlation function, 91
data approximation, 93
mean of the prediction, 91
variance of the prediction, 91
Kronecker delta, 188
Krylov subspace, 39
Krylov subspace iteration, 38
Kuhn and Tucker coefficients, 253
L
l1 norm, 13, 263, 433
l2 norm, 13, 184
l p norm, 12, 454
l norm, 13, 264
Lagrange multipliers, 250, 253
Lagrangian, 250, 253, 256, 275
augmented, 260
LAPACK, 28
Laplaces equation, 363
Laplacian, 10, 119
Least modulus, 263
Least squares, 171, 183
for BVPs, 337
formula, 184
recursive, 434
regularized, 194
unweighted, 184
via QR factorization, 188
via SVD, 191
weighted, 183
when the solution is not unique, 194
Legendre basis, 83, 188
Legendre polynomials, 83, 107
Levenbergs algorithm, 209
Levenberg-Marquardt algorithm, 209, 215
Levinson-Durbin algorithm, 43
Line search, 196
combining line searches, 200
Linear convergence, 142, 215
Linear cost, 171
Linear equations, 139
solving large systems of, 214
system of, 17
Linear ODE, 304
Linear PDE, 366
Linear programming, 171, 261, 278
Lipschitz condition, 215, 303
Little o, 11
Local method error
estimate of, 320
for multistep methods, 310
of Runge-Kutta methods, 308
Index
Local minimizer, 169
Local minimum, 169
Logarithmic barrier, 259, 272, 277, 279
LOLIMOT, 428
Low-discrepancy sequences, 112
LU factorization, 25
flops, 45
for tridiagonal systems, 44
Lucky cancelation, 104, 105, 118
M
Machine epsilon, 154, 386
Manhattan norm, 13
Mantissa, 383
Markov chain, 62, 415
Matrix
derivatives, 8
diagonally dominant, 22, 36, 332
exponential, 304, 452
inverse, 8
inversion, 22, 59
non-negative definite, 8
normal, 66
norms, 14
orthonormal, 27
permutation, 27
positive definite, 8, 22, 42
product, 7
singular, 17
sparse, 18, 43, 332
square, 17
symmetric, 22, 65
Toeplitz, 23, 43
triangular, 23
tridiagonal, 18, 22, 86, 368
unitary, 27
upper Hessenberg, 68
Vandermonde, 43, 82
Maximum likelihood, 182
Maximum norm, 13
Mean-value theorem, 395
Mesh, 368
Meshing, 368
Method error, 88, 379, 381
bounding, 396
local, 306
MIMO, 89
Minimax estimator, 264
Minimax optimization, 222
on a budget, 226
Minimizer, 168
Minimizing an expectation, 221
473
Minimum, 168
MISO, 89
Mixed boundary conditions, 361
Modified midpoint integration method, 324
Monte Carlo integration, 110
Monte Carlo method, 397
MOOCs, 414
Multi-objective optimization, 226
Multiphysics, 362
Multistart, 153, 216, 223
Multistep methods for ODEs, 310
Multivariate systems, 141
N
1-norm, 14
2-norm, 14
Nabla operator, 10
NaN, 384
Necessary optimality condition, 251, 253
Nelder and Mead algorithm, 217
Nested integrations, 110
Neumann conditions, 361
Newton contractor, 395
Newtons method, 144, 149, 203, 215, 257,
278, 280
damped, 147, 280
for multiple roots, 147
Newton-Cotes methods, 102
No free lunch theorems, 172
Nonlinear cost, 171
Nonlinear equations, 139
multivariate case, 148
univariate case, 141
Nordsieck vector, 323
Normal equations, 186, 337
Normalized representation, 383
Norms, 12
compatible, 14, 15
for complex vectors, 13
for matrices, 14
for vectors, 12
induced, 14
subordinate, 14
Notation, 7
NP-hard problems, 272, 291
Number of significant digits, 391, 398
Numerical debugger, 402
O
Objective function, 168
ODE, 299
scalar, 301
474
Off-base variable, 269
OpenCourseWare, 413
Operations on intervals, 392
Operator overloading, 129, 392, 399
Optimality condition
necessary, 178, 179
necessary and sufficient, 275, 278
sufficient local, 180
Optimization, 168
combinatorial, 289
in the worst case, 222
integer, 289
linear, 261
minimax, 222
nonlinear, 195
of a non-differentiable cost, 224
on a budget, 225
on average, 220
Order
of a method error, 88, 106
of a numerical method, 307
of an ODE, 299
Ordinary differential equation, 299
Outliers, 263, 425
Outward rounding, 394
Overflow, 384
P
PageRank, 62, 415
Parabolic interpolation, 196
Parabolic PDE, 363
Pareto front, 227
Partial derivative, 119
Partial differential equation, 359
Particle-swarm optimization, 223
PDE, 359
Penalty functions, 257, 280
Perfidious polynomial, 72
Performance index, 168
Periodic restart, 212, 214
Perturbing computation, 397
Pivoting, 27
Polack-Ribire algorithm, 213
Polynomial equation
nth order, 64
companion matrix, 64
second-order, 3
Polynomial regression, 185
Powells algorithm, 200
Power iteration algorithm, 64
Preconditioning, 41
Prediction method, 306
Index
Prediction-correction methods, 313
Predictive controller, 429
Primal problem, 276
Problems, 415
Program, 261
Programming, 168
combinatorial, 289
integer, 289
linear, 261
nonlinear, 195
sequential quadratic, 261
Prototyping, 79
Q
QR factorization, 29, 188
flops, 45
QR iteration, 67
shifted, 69
Quadratic convergence, 146, 215
Quadratic cost, 171
in the decision vector, 184
in the error, 183
Quadrature, 101
Quasi steady state, 327
Quasi-Monte Carlo integration, 112
Quasi-Newton equation, 151, 211
Quasi-Newton methods
for equations, 150
for optimization, 210, 215
R
Radial basis functions, 427
Random search, 223
Rank-one correction, 151, 152
Rank-one matrix, 8
realmin, 154
Reflection of a simplex, 217
Regression matrix, 184
Regularization, 34, 194, 208
Relaxation method, 222
Repositories, 410
Response-surface methodology, 448
Richardsons extrapolation, 88, 106, 117, 324
Ritz-Galerkin methods, 334, 372
Robin conditions, 361
Robust estimation, 263, 425
Robust optimization, 220
Rombergs method, 106
Rounding, 383
modes, 385
Rounding errors, 379, 385
Index
cumulative effect of, 386
Runge phenomenon, 93
Runge-Kutta methods, 308
embedded, 321
Runge-Kutta-Fehlberg method, 321
Running error analysis, 397
S
Saddle point, 178
Saturated constraint, 170, 253
Scalar product, 389
Scaling, 314, 383
Schur decomposition, 67
Schwarzs theorem, 9
Search engines, 409
Secant method, 144, 148
Second-order linear PDE, 361
Self-starting methods, 309
Sensitivity functions, 205
evaluation of, 206
for ODEs, 207
Sequential quadratic programming (SQP), 261
Shifted inverse power iteration, 66
Shooting methods, 330
Shrinkage of a simplex, 219
Simplex, 217
Simplex algorithm
Dantzigs, 265
Nelder and Meads, 217
Simpsons 1/3 rule, 103
Simpsons 3/8 rule, 104
Simulated annealing, 223, 290
Single precision, 384
Single-step methods for ODEs, 306, 307
Singular perturbations, 326
Singular value decomposition (SVD), 33, 191
flops, 45
Singular values, 14, 21, 191
Singular vectors, 191
Slack variable, 253, 265
Slaters condition, 276
Software, 411
Sparse matrix, 18, 43, 53, 54, 366, 373
Spectral decomposition, 68
Spectral norm, 14
Spectral radius, 14
Splines, 84, 333, 369
Stability of ODEs
influence on global error, 324
Standard form
for equality constraints, 248
for inequality constraints, 252
475
for linear programs, 265
State, 122, 299
State equation, 122, 299
Stationarity condition, 178
Steepest descent algorithm, 202
Step-size
influence on stability, 314
tuning, 105, 320, 322
Stewart-Gough platform, 141
Stiff ODEs, 325
Stochastic gradient algorithm, 221
Stopping criteria, 154, 216, 400
Storage of arrays, 46
Strong duality, 276
Strong Wolfe conditions, 199
Students test, 398
Subgradient, 217
Successive over-relaxation (SOR), 37
Sufficient local optimality condition, 251
Superlinear convergence, 148, 215
Supremum, 169
Surplus variable, 265
Surrogate model, 225
T
Taxicab norm, 13
Taylor expansion, 307
of the cost, 177, 179, 201
Taylor remainder, 396
Termination, 154, 216
Test for positive definiteness, 43
Test function, 335
TeXmacs, 6
Theoretical optimality conditions
constrained case, 248
convex case, 275
unconstrained case, 177
Time dependency
getting rid of, 301
Training data, 427
Transcendental functions, 386
Transconjugate, 13
Transposition, 7
Trapezoidal rule, 103, 311
Traveling salesperson problem (TSP), 289
Trust-region method, 196
Two-endpoint boudary-value problems, 328
Types of numerical algorithms, 379
U
Unbiased predictor, 181
476
Unconstrained optimization, 170, 177
Underflow, 384
Uniform norm, 13
Unit roundoff, 386
Unweighted least squares, 184
Utility function, 168
V
Verifiable algorithm, 3, 379, 400
Vetters notation, 8
Vibrating-string equation, 363
Index
Violated constraint, 253
W
Warm start, 258, 278
Wave equation, 360
Weak duality, 276
WEB resources, 409
Weighted least squares, 183
Wolfes method, 198
Worst operations, 405
Worst-case optimization, 222