Академический Документы
Профессиональный Документы
Культура Документы
Jun Zou
Department of Mathematics
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong
The lecture notes were prepared by Jun Zou, purely for the convenience of his teaching of the
course Numerical Analysis . Students taking this course may use the notes as part of their reading
and reference materials. This version of the lecture notes were revised and extended from a previous
version, so there might be many mistakes and typos, including English grammatical and spelling
errors, in the notes. It would be greatly appreciated if those students, who will use the notes as their
reading or reference material, report any mistakes and typos to the instructor Jun Zou for improving
the lecture notes. Students are strongly recommended to refer to those recommended textbooks for
more exercises and more details about the backgrounds and motivations of different concepts and
numerical methods.
Contents
1 Introduction
9
9
9
10
11
12
12
13
13
13
14
16
18
18
20
21
23
24
25
26
28
30
31
32
34
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
36
37
39
41
43
equations
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
45
50
50
51
.
.
.
.
.
5.5
5.6
5.7
5.8
5.9
Cholesky factorization . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Properties of symmetric positive definite matrices . . . . . . .
5.5.2 Cholesky factorization . . . . . . . . . . . . . . . . . . . . . .
LU factorization and Gaussian elimination . . . . . . . . . . . . . . .
5.6.1 Gaussian elimination . . . . . . . . . . . . . . . . . . . . . . .
5.6.2 LDU factorization . . . . . . . . . . . . . . . . . . . . . . . .
5.6.3 Cholesky factorization from LDU factorization . . . . . . . . .
LU factorization with partial pivoting . . . . . . . . . . . . . . . . . .
5.7.1 The necessity of pivoting . . . . . . . . . . . . . . . . . . . . .
5.7.2 LU factorization with partial pivoting . . . . . . . . . . . . .
LU factorization of upper Hessenberg matrix and tri-diagonal matrix
General non-square linear systems . . . . . . . . . . . . . . . . . . . .
6 Floating-point arithmetic
6.1 Decimal and binary numbers . . . .
6.2 Rounding errors . . . . . . . . . . .
6.3 Normalized scientific notation . . .
6.4 Accuracies in 32-bit representation
6.5 Machine rounding . . . . . . . . . .
6.6 Floating-point arithmetic . . . . . .
6.7 Backward error analysis . . . . . .
7 Sensitivity of linear systems
7.1 Vector and matrix norms . . . . .
7.1.1 Vector norms . . . . . . .
7.1.2 Matrix norms . . . . . . .
7.2 Relative errors . . . . . . . . . . .
7.3 Sensitivity of linear systems . . .
7.4 The condition of a linear system .
7.5 Importance of condition numbers
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8 Polynomial interpolation
8.1 What is the interpolation ? . . . . . . . . . . . . . . . . . . . . . .
8.2 Vandermonde Interpolation . . . . . . . . . . . . . . . . . . . . .
8.3 General quadratic interpolation . . . . . . . . . . . . . . . . . . .
8.4 Interpolation with polynomials of degree n . . . . . . . . . . . . .
8.5 Lagrange interpolation . . . . . . . . . . . . . . . . . . . . . . . .
8.6 The Newton form of interpolation . . . . . . . . . . . . . . . . . .
8.6.1 Divided differences . . . . . . . . . . . . . . . . . . . . . .
8.6.2 Relations between derivatives and divided differences . . .
8.6.3 Symmetry of divided differences . . . . . . . . . . . . . . .
8.6.4 Relation between divided difference and Gauss elimination
8.6.5 How to compute coefficients of the Newton form ? . . . . .
8.7 Three fundamental questions in interpolation . . . . . . . . . . . .
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
53
55
58
59
65
66
67
67
70
73
74
.
.
.
.
.
.
.
76
76
77
78
81
82
82
83
.
.
.
.
.
.
.
85
85
85
86
87
88
90
91
.
.
.
.
.
.
.
.
.
.
.
.
92
92
93
94
94
95
96
97
99
99
100
101
103
. . . . . . . 103
. . . . . . . 105
. . . . . . . 107
. . . . . . . 107
. . . . . . . 109
. . . . . . . 113
. . . . . . . 113
. . . . . . . 114
. . . . . . . 114
interpolations 115
. . . . . . . 117
. . . . . . . 119
9 Numerical integration
9.1 Simple rules and their error estimates . . . . . . . . . . . . .
9.2 Composite trapezoidal rule and its accuracy . . . . . . . . .
9.3 Newton-Cotes quadrature rule . . . . . . . . . . . . . . . . .
9.3.1 Computing the coefficients of Newton-Cotes rules (I)
9.3.2 Computing the coefficients of Newton-Cotes rules (II)
9.4 Simpsons rule and its error estimates . . . . . . . . . . . . .
9.5 Composite Simpson rule and its accuracy . . . . . . . . . . .
9.6 Gaussian quadrature rule . . . . . . . . . . . . . . . . . . . .
9.7 Quadrature rule from [a, b] to [c, d] . . . . . . . . . . . . . .
9.8 Construction and accuracy of Gaussian quadrature rules . .
9.9 Errors of different quadrature rules . . . . . . . . . . . . . .
9.9.1 Trapezoidal rule . . . . . . . . . . . . . . . . . . . . .
9.9.2 Simpsons rule . . . . . . . . . . . . . . . . . . . . . .
9.9.3 Gaussian rules . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
121
121
122
125
126
127
128
131
131
132
133
135
135
136
137
10 Numerical differentiation
138
10.1 Aim of numerical differentiation . . . . . . . . . . . . . . . . . . . . . 138
10.2 Forward and backward differences . . . . . . . . . . . . . . . . . . . . 138
10.3 Central differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Introduction
Numerical Analysis is a fundamental branch in Computational and Applied Mathematics. In this section, we list some important topics from Numerical Analysis, which
will be covered in this course.
1. Nonlinear equations of one variable. We discuss how to solve nonlinear
equations of one variable (in standard form):
f (x) = 0,
where f (x) is nonlinear with respect to variable x. The solutions may be very
complicated, even though function f (x) looks simple. In many applications we
may not even have the explicit expression of f (x), e.g., we may know only the
measured data of f (x) (such as temperature or flow velocity) at some locations
or times, but we need to find out different behaviors of f (x).
System of nonlinear equations. In most applications one may need to solve
the following more general system of nonlinear equations:
f1 (x1 , x2 , , xn ) = 0 ,
f2 (x1 , x2 , , xn ) = 0 ,
fn (x1 , x2 , , xn ) = 0 ,
where each fi (x1 , x2 , , xn ) is a nonlinear function of n variables x1 , x2 , ,
xn . In general, the solutions of a system of nonlinear equations are much more
complicated than the solutions to a single nonlinear equation, but many methods for equations of one variable can be generalized to systems of nonlinear
equations.
2. Linear system of algebraic equations. Linear systems are often the problems one needs to solve repeatedly, or thousands times during many mathematical modeling processes or physical/engineering simulations, e.g., in the
numerical solutions of a simple population model, or the more complicated electromagnetic Maxwell system. We will study how to solve the following general
system of linear algebraic equations:
Ax = b
where A is a m n matrix, and b is a vector with m components.
5
How can we find a solution x when the matrix A is square and nonsingular ?
We can solve a 2 2, or 3 3 system by hands, but it should be difficult to
solve a 10 10 system by hands, and it is almost impossible to solve a 100 100
system by hands. So when m and n are much larger than 1000, how can we
solve the system ? We have to turn to computers for help.
When m > n, the system may not have a solution. Then how can we find some
possible solutions which are physically or practically meaningful ?
3. Floating-point arithmetic. When computers are used for numerical computations, rounding-off errors are always present. Then how can we solve the
system of linear algebraic equations Ax = b or the system of nonlinear algebraic equations F (x) = 0, with satisfactory accuracy ? How can we judge the
solutions computed by computers are reliable ?
4. Interpolation. For a complicated function, can we find a simple and easy-tocompute function to approximate it ?
For a given set of observation data, can we find a function to best represent
the set of data ? E.g., suppose we know the measured temperature at a set of
selected locations along the boundary of China, can we locate the coldest or
warmest places, or the places with most rapid changes of temperature ? Or if
the temperature has been measured at a fixed location for all the months in the
previous 10 years, can we locate the coldest or warmest time of the location, or
the time with most rapid changes of temperature ?
5. Numerical integration. Integration is involved in many practical applications, e.g., computing the physical masses, surface areas, volumes, fluxes, etc.
But most integrals are difficult or impossible to compute exactly. For the integration of a complicated function, can we compute some good approximate
value of the integral when it is impossible to calculate the integral exactly ?
6. Numerical differentiation. Can we compute the derivatives of some complicated functions when their exact derivatives can not be computed or even the
exact expressions of the functions are not available ? This happens quite often
in real applications. E.g., suppose you are given the prices of a stock in the
past 10 years, can you find the times when the stock usually grows or drops
most rapidly ? Most importantly, numerical differentiation is needed in the construction of every numerical method for solving ordinary or partial differential
equations.
In this section, we list some important mathematical theorems and tools that will be
used frequently in this course.
Theorem on polynomials. A polynomial of degree n has exact
n roots; and a polynomial of degree n which vanishes at (n + 1)
points is a constant function of zero.
Rolles theorem. Let f (x) be continuous in [a, b] and its derivative
f 0 (x) be continuous on (a, b), and f (a) = f (b), then there exists
(a, b) such that f 0 () = 0.
Mean-value theorem. Let f (x) be continuous on [a, b] and its
derivative f 0 (x) be continuous in (a, b), then there exists (a, b)
such that
f (b) = f (a) + (b a)f 0 () ,
f 0 () =
i.e.,
f (b) f (a)
.
ba
where Rn (x) is the remainder of the Taylors series and can be given by
1
Rn (x) =
n!
x0
f (n+1) ()
(x x0 )n+1
(n + 1)!
Z
f (x) g(x)dx = g()
f (x)dx .
a
1
(x x0 )T D2 f (x0 )(x x0 ) +
2!
where Df (x) is the gradient of f (x) and D2 f (x) is the Hessian matrix
of f (x).
Fundamental theorem of calculus. Let F be differentiable in an
open set Rn and x . Then for all x such that the line
segment connecting x and x lies inside , we have
Z 1
3.1
(3.1)
g(x) = h(x),
4x5 + x4 3x2 + 4x = 6,
x2 = g(t)
where t stands for the time each particle has traveled and x for the location of each
particle at time t (the location may be measured in terms of the arc length of the
path). Find the time when the two particles meet each other, or when the two
particles have the same speed. When f and g represent the populations of two cities,
you may be asked to find the time when the two cities have the same population or
reach the same population growth rate.
3.2
1
= 0.
2
3.3
Now we will introduce an important concept which can be used to measure the
efficiencies of different iterative methods, that is, rate of convergence.
Given a sequence x0 , x1 , x2 , , converging to a limit x .
Linear convergence. If there is a constant satisfying 0 1 such that
|xk+1 x |
= ,
k |xk x |
lim
then
the sequence {xk } is said to converge to x linearly with rate for (0, 1);
the sequence {xk } is said to converge to x superlinearly for = 0;
the sequence {xk } is said to converge to x sublinearly for = 1,
10
Think about the following sequences and find out whether they converge, or converge linearly or superlinearly or sublinearly:
xk = 1 + 22k ,
k = 0, 1, 2,
1
xk = 1 + 22k +
, k = 0, 1, 2,
k+1
xk+1 = 2k xk , k = 0, 1, 2,
1
xk+1 = xk , k = 0, 1, 2,
2
1
xk = d (d is fixed), k = 1, 2, 3,
k
k
xk = 102 , k = 1, 2, 3,
Convergence with order p. If there are two positive constants p > 1 and C such that
|xk+1 x |
=C,
k |xk x |p
lim
(3.2)
3.4
To measure the accuracy of the approximate values, we need to introduce some useful
concepts.
11
3.4.1
Absolute errors
The absolute error between the true value (solution) x and the approximate value
xk is the error given by
|xk x | .
We can see that this error considers only the distance of xk from x without taking
care of the magnitude of the true solution x . This may not be always satisfactory
in applications. For example, consider = 105 , and the true solution x = 1. If we
stop iteration when |xk x | 105 , then xk has 5 accurate digits after the decimal
point (xk = 0.99999 ). But if the true solution x = 108 , then we will stop the
iteration even when xk = 105 since
|xk x | = 105 108 105 ,
this approximation xk is 1000 times of the exact solution x , so not accurate at all.
3.4.2
Relative errors
The disadvantage of the absolute error is that it ignores the information on the
magnitude of the true solution x . Let xk be an approximation to x 6= 0, then the
error
|xk x |
=
|x |
is called the relative error of xk . We can write
xk = x (1 + ) = x + x
where || = . So xk can be viewed as a small perturbation of x .
Now look at the following table of approximation to x = e = 2.7182818 :
Approximation
2.0
2.7
2.71
2.718
2.7182
2.71828
2 101
6 103
3 103
1 104
3 105
6 107
101
102
103
104
105
106
accurate digits
1
2
3
4
5
6
Count the first non-zero digit when the value is written in the format x.xxxx... 10m , where m
may be positive or negative.
12
3.5
Bisection algorithm
There are many different methods one can use to solve a given nonlinear equation
f (x) = 0.
We introduce in this section one of the simplest but very effective iterative methods,
called the bisection algorithm.
3.5.1
Basic conditions
We now present the Interval Bisection Algorithm for finding the solution x of the
equation f (x) = 0. Usually, one can not find the exact solution x . We will be satisfied
when we find an approximate solution x such that |f (
x )| or |
x x | for
some small tolerance parameters and . One may set and to be at different
magnitude in each application.
Assume3 that f (a) < 0 and f (b) > 0, then there exists a point x (a, b) such
that f (x ) = 0. Next we state the simple and popular Bisection Algorithm to find an
approximate solution x .
Bisection Algorithm. Input a, b, and two stopping tolerance parameters and
. Set a0 := a, b0 := b; k := 0, xk := (ak + bk )/2.
k
k
Step 1 If f ( ak +b
)
;
, stop and output xk = ak +b
2
2
k
If f ( ak +b
) > 0 , set ak+1 := ak , bk+1 :=
2
GO TO Step 2;
k
If f ( ak +b
) < 0, set ak+1 :=
2
GO TO Step 2.
ak +bk
, bk+1
2
ak +bk
.
2
:= bk ;
The case that f (a) > 0 and f (b) < 0 can be handled similarly.
13
Step 2
ak+1 +bk+1
;
2
Example 1. Use the bisection method to find an approximate solution of the nonlinear equation
ex = sin x.
Solution. By the graphs of ex and sin x, we can easily see that there are no positive
solutions to the equation ex = sin x, and that the closest solution to 0 lies in the
interval [3/2, ]. The following table gives the results generated by the bisection
algorithm:
xk
3.9270
3.5343
3.3379
3.2398
..
.
k
1
2
3
4
..
.
|f (xk )|
6.8740 101
3.5350 101
1.5958 101
5.8844 102
..
.
Let us now analyse if the bisection algorithm converges and how fast it may converge.
For convenience, we denote the successive intervals that are generated by the
bisection algorithm as follows:
[a0 , b0 ],
[a1 , b1 ],
, [ak , bk ],
14
then we have
f (ak )f (bk ) < 0,
which implies
f (x )2 0,
therefore we know the limit x is a solution to the nonlinear equation f (x) = 0.
Moreover, let
xk = (ak + bk )/2,
then we have
1
|xk x | (bk ak ) = 2(k+1) (b0 a0 ).
2
This proves the following result4 :
Let f (x) be a continuous function on [a, b] such that f (a)f (b) < 0,
then the bisection algorithm always converges to a solution x of
the equation f (x) = 0, and the following error estimate holds for
the k-th approximate value xk :
1
|xk x | (bk ak ) = 2(k+1) (b a).
2
Think about the following more examples of nonlinear equations.
Example 2. Find a positive root of the nonlinear equation
x2 4x sin x + (2 sin x)2 = 0,
and find a root of the equation
2x + ex + 2 cos x 6 = 0
on the interval [1, 3].
Example 3. Consider the bisection algorithm starting with the interval [1.5, 3.5].
1. What is the width of the interval at the k-th step of the iteration ?
2. What is the maximum distance possible between the solution x and the midpoint of this interval ?
4
15
3.6
Before we discuss more iterative methods, we first review and study a fundamental
tool that is frequently used in numerical analysis, approximation by Taylor expansion.
Taylor expansion or Taylor series is a tool for approximating a function, given the
derivatives of the function at a specific point, say x0 . It takes the form:
f (x) = f (x0 ) + f 0 (x0 )(x x0 ) +
f 00 (x0 )
f (n) (x0 )
(x x0 )2 + +
(x x0 )n +
2!
n!
F 0 (x0 ) = f 0 (x0 ).
In fact, the 1st order approximation of the curve f (x) is nothing else but
its tangent at x0 .
(c) 2nd order approximation: we know the function value and the 1st and
the 2nd order derivatives of f (x) at the point x0 , then we may use the
following 2nd order approximation of f (x):
1
f (x) f (x0 ) + f 0 (x0 )(x x0 ) + f 00 (x0 )(x x0 )2 = F (x)
2
for all x that are close to x0 . Clearly we see that
F (x0 ) = f (x0 ),
F 0 (x0 ) = f 0 (x0 ),
F 00 (x0 ) = f 00 (x0 ).
So the 2nd order approximation considers also the local geometric shape
of f (x) at x0 .
In practice, one may not often use derivatives higher than 2nd order for approximations.
16
0th order
x=2
e
error
4.67 100
x = 1.1
e
error
2.86 101
x = 1.01
e
error
2.73 102
1st order
2e
1.95 100
1.1e
1.41 102
1.01e
1.36 104
2nd order
2.5e
5.93 101
1.105e
4.64 104
1.01005e
4.54 107
true value
e2
e1.1
e1.01
From this table, we observe the following properties about the Taylor approximation:
1. Taylor approximation gives basically no accuracy at x = 2, since it is bit too
far away from x0 . The accuracy behaves basically like (x x0 )n+1 , where n is
the order of the approximation, but also affected by the values of the (n + 1)-th
order derivative of f (x) near x0 divided by (n + 1)!.
2. For 0th order approximation, the error is decreased by 101 each time when x
x0 is decreased by a factor of 101 . For the 1st and the 2nd order approximation,
the decreasing factors are respectively 102 and 103 .
Explain why we observe such behaviors.
3.7
Newtons method
There are several approaches to derive the Newtons method. We first present a
geometrical approach.
18
Start with an initial point x0 : draw the tangent line to the curve
y = f (x) at the point (x0 , f (x0 )), and find the intersection point of
the tangent line with the x-axis, which Newtons method takes to
be the new approximation x1 .
Repeat the same procedure, we can get x2 , x3 ,
Let us now derive a formula for computing x1 , x2 , . We know the tangent line
of the curve y = f (x) at the point (x0 , f (x0 )):
y f (x0 ) = f 0 (x0 )(x x0 ).
The intersection point of this line with the x-axis can be found from
f (x0 ) = f 0 (x0 )(x x0 ),
which gives
x1 = x0
f (x0 )
.
f 0 (x0 )
xn+1 = xn
f (xn )
,
f 0 (xn )
n = 0, 1, 2,
Next, we derive the Newtons method using an analytical approach. We know the
Taylor expansion
1
f (x) = f (x0 ) + f 0 (x0 )(x x0 ) + f 00 ()(x x0 )2 ,
2
where lies between x and x0 .
Now if x0 is close to x and f 00 (x) is not too large around x0 , then
F (x) = f (x0 ) + f 0 (x0 )(x x0 ) f (x) ,
so instead of solving f (x) = 0, we can solve
F (x) = 0 ,
that gives
f (x0 )
,
f 0 (x0 )
similarly, we can derive x2 , x3 , , thus the Newtons method
x1 = x0
xn+1 = xn
f (xn )
,
f 0 (xn )
19
n = 0, 1, 2,
3.7.2
Example 1. Use the Newtons method to find some approximate inverse5 of a given
positive number a.
Solution. The question amounts to solving the equation
1
= a.
x
Let f (x) = x1 a. The Newtons method is given by
xk+1 = xk
f (xk )
= 2xk ax2k ,
0
f (xk )
k = 0, 1, 2,
Clearly this avoids the computing the inverse of any number. And we can verify that
the sequence {xk } converges monotonely as long as x0 (0, 1/a). Why ?
The following table gives the sequence {xk } to approximate 1/a when a = 1 with
different initial guesses:
Iteration
0
1
2
3
4
5
6
xk
0.250000
0.437500
0.683594
0.899887
0.989977
0.999899
0.999999
x
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
xk
1.750000
0.437500
0.683594
0.899887
0.989977
0.999899
0.999999
xk
2.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
xk
+2.100000 100
0.210000 100
4.641000 101
1.143589 100
3.594973 100
2.011378 101
4.447915 102
The choice of the initial guess. The choice of the initial guesses is very important to the convergence of the Newtons method. This can be clearly seen from
the previous table.
Example 2. Use the Newtons method to find the square root of a given positive
number a.
Solution. The problem amounts to solving the following nonlinear equation:
x2 = a.
Let f (x) = x2 a. Then the Newtons method can be written as
xk+1 = xk
f (xk )
1
a
= (xk + ),
0
f (xk )
2
xk
k = 0, 1, 2,
The following table indicates that Newtons method converges very rapidly with quite
different initial guesses. This is a unusual example as the function is a quadratic
polynomial. Please look at the convergence of the Newtons method geometrically
for the quadratical nonlinear equation x2 a = 0 and demonstrate if the method
converges globally, namely converges with any initial guess.
5
How did computers get x = 1/a ? Newtons method is a good choice to solve the equivalent
nonlinear equation x1 = a.
20
Iteration
0
1
2
3
4
5
xk
1.000000
1.500000
1.416667
1.414216
1.414214
1.414214
x
1.414214
1.414214
1.414214
1.414214
1.414214
1.414214
xk
0.500000
2.250000
1.569444
1.421890
1.414234
1.414214
xk
6.000000
3.166667
1.899123
1.476120
1.415512
1.414214
Example 3. Use the Newtons method to find the first negative solution of the
following nonlinear equation
ex = sin x.
Solution. Let f (x) = ex sin x. By a simple analysis, we know x [3/2, ].
Then the Newtons method can be written as
xk+1 = xk
exk sin xk
f (xk )
=
x
,
k
f 0 (xk )
exk cos xk
k = 0, 1, 2,
The following table shows the convergence of the algorithm with different initial
guesses, from which we can clearly see again how important the initial guesses are.
Iteration
0
1
2
3
4
5
6
3.7.3
xk
xk
xk
x
-0.500000 -1.000000 -2.000000 -3.183063
+3.506451 6.013863 -3.894228 -3.183063
+2.523301 5.010849 -3.010248 -3.183063
+1.628274 4.002502 -3.183451 -3.183063
+0.833182 3.000576 -3.183063 -3.183063
-0.125327 2.054193 -3.183063 -3.183063
+9.035402 1.217551 -3.183063 -3.183063
In this section, we shall discuss the convergence of the Newtons method. As we see
from the numerical examples in the last subsection, the convergence of the Newtons
method depends strongly on the initial guess we have taken: the initial guess can not
be taken too far away from the exact solution x . Such a convergence is called the
local convergence. In the following, we shall show that the Newtons method converges
when the initial guess x0 is close enough to the true solution x of f (x) = 0, and when
f 0 (x ) 6= 0 .
This condition means x is a simple root of f (x). The case of multiple roots will be
discussed later in Section 3.8.4.
To analyse the convergence, we define an iteration function
(x) = x
21
f (x)
,
f 0 (x)
k = 0, 1, 2,
Clearly, we have
x = (x ) .
Next, we show that the sequence {xk } converges to x . Let ek = xk x be the
error at the k-th iteration, then
lim xk = x
if and only if
lim ek = 0.
Noting that
xk+1 = (xk ) ,
x = (x ),
we see
ek+1 = (xk ) (x ) .
By the mean-value theorem, we derive the error equation for ek :
ek+1 = 0 (k )(xk x ) = 0 (k )ek
where k lies between xk and x . From this, we know ek 0 if we can show
|0 (k )|
1
.
2
which implies
0 (x ) =
f (x )f 00 (x )
= 0,
f 0 (x )2
1
2
1
< 1,
2
|ek+1 |
1
|ek | .
2
This yields
1
1
|e
|
=
|x0 x |,
0
2k+1
2k+1
so we have ek 0 as k , namely xk x as k . This demonstrates the
convergence of the Newtons method.
|ek+1 |
3.7.4
In the last subsection we showed the convergence of the Newtons method, but we
still do not know how fast the method converges. Next we will demonstrate that the
Newtons method converges quadratically, in two different approaches.
Approach 1. Let ek = xk x be the error at the k-th iteration of the Newtons
method, then we have
ek+1 = xk+1 x = (xk ) (x ) .
As we saw earlier, 0 (x ) = 0. By Taylor expansion,
1
(xk ) (x ) = 00 (k )(xk x )2
2
where k lies between xk and x . Hence
1
1
ek+1 = 00 (k )(xk x )2 = 00 (k )e2k .
2
2
But we know
00
f (x) f (x)
(x) =
f 0 (x)2
0
00
f 00 (x )
.
f 0 (x )
Recall that we have already known the Newtons method converges, namely,
lim xk = x ,
23
(x
)
=
k e2
2
2f 0 (x )
k
lim
(3.3)
1
f 0 (xk )(xk x ) f (xk ) = f 00 (k )(x xk )2 ,
2
this gives
f (xk )
(xk x )f 0 (xk ) f (xk )
=
f 0 (xk )
f 0 (xk )
f 00 (k )(x xk )2
=
.
2f 0 (xk )
xk+1 x = (xk x )
Now we see
ek+1
f 00 (x )
=
.
k e2
2f 0 (x )
k
lim
Recall that the Newtons method is derived using the linear polynomial
F (x) = f (x0 ) + f 0 (x0 )(x x0 )
from the Taylor expansion:
1
f (x) = f (x0 ) + f 0 (x0 )(x x0 ) + f 00 ()(x x0 )2
2
24
to approximate the actual function f (x). Naturally, one may think about if we can
use the quadratic polynomial
1
f (x) = f (x0 ) + f 0 (x0 )(x x0 ) + f 00 (x0 )(x x0 )2
2
from the Taylor expansion
1
1
f (x) = f (x0 ) + f 0 (x0 )(x x0 ) + f 00 (x0 )(x x0 )2 + f 000 ()(x x0 )3
2
6
to approximate the actual function f (x) so that we may get a new iteration which
converges even faster than the Newtons method.
Indeed, this is possible. Students may try to work out the details of this new
iterative method and analyse its convergence and convergence order.
3.8
Quasi-Newton methods
f (xk )
,
f 0 (xk )
k = 0, 1, 2, ,
we see that at each iteration, we need to compute the derivative f 0 (xk ). This is simple
in many applications, but it might be also a big trouble for some practical problems.
For instance, this is the case when one of the following situations happens:
(1) The expression of f (x) is unknown;
(2) The derivative f 0 (x) is very expensive to compute;
(3) The value of function f may be the result of a long numerical calculation, so
the derivative has no formula available.
A good example. We next show a good example to see why the derivative f 0 (x)
can be complicated and expensive to compute sometimes. Consider the following
two-point boundary value problem:
00
x (t) + p(t)x0 (t) + q(t)x(t) = g(t), a < t < b ,
(3.4)
x(a) = , x(b) =
where p(t), q(t) and g(t) are given functions, and x(t) is unknown. Our aim is to find
the solution x(t) to the system (3.4). One popular way for this purpose is to first
solve the following initial value problem:
00
x (t) + p(t)x0 (t) + q(t)x(t) = g(t), t > a ,
x(a) = , x0 (a) = z
25
for a given z. We write the solution as x(t, z), then x(t, z) will be also the solution
to the boundary-value problem (3.4) if we can find a value z such that
x(b, z) = .
Let f (z) = x(b, z) , then z is the solution to the nonlinear equation
f (z) = 0.
To find such solutions z, we may apply the Newtons method:
zk+1 = zk
f (zk )
,
f 0 (zk )
k = 0, 1, 2, .
One possibility to get rid of the derivatives in Newtons method is to find some
approximations of the derivatives f 0 (xk ), say gk f 0 (xk ), but gk should be much easier
to compute than f 0 (xk ). Then we replace the Newtons method by the following:
xk+1 = xk
f (xk )
,
gk
Such an iteration is called a quasi-Newtons method. There are many possible approximations of f 0 (xk ), thus generating many different quasi-Newtons methods. For
example, one can replace f 0 (xk ) by the difference quotient
gk =
f (xk ) f (xk1 )
,
xk xk1
the resulting method is called the secant method. It is easy to see that this method
needs two initial values x0 and x1 to start with.
The simplest way to approximate f 0 (xk ) is to replace it by a constant gk = g.
Then Newtons method becomes
xk+1 = xk
f (xk )
,
g
k = 0, 1, 2,
this is called the constant slope method. In particular, we might take g = f 0 (x0 ).
26
f (xk )
,
g
k = 0, 1, 2,
f (x)
,
g
f 0 (x )
| < 1.
g
Then
xk+1 x = (xk ) (x ) = 0 (k )(xk x ) ,
where k lies between xk and x . As |0 (x )| < 1 , there exists an > 0 such that7
|0 (x)| < 1 x [x , x + ] .
Then it is easy to see that {xk } will lie inside the interval [x , x + ] as long as
the initial guess x0 [x , x + ]. This implies
|xk+1 x | = |0 (k )||xk x | |xk x |,
or
|xk+1 x | k+1 |x0 x |.
(3.5)
so the constant slope method converges linearly with the rate |0 (x )|.
How fast is the linear convergence ?
The constant slope method converges linearly with rate = |0 (x )|. The efficiency of a method with linear convergence strongly depends on the magnitude of
rate or constant in (3.5). In fact, each iteration reduces the error by a factor of
constant . Then if we set the accuracy tolerance to be , the number of iteration k
required to reach the tolerance is
h log i
k=
+ 1.
log
The following table gives us a clear picture about how fast the linear method converges
and how strongly it depends on the rate :
For example, if we know the sign of f 0 (x ), then we can take g = 2 sign(f 0 (x )) M , where M
can be any estimate of the upper bound of max |f 0 (x)|.
7
Think about how to find such a fixed constant < 1 ?
6
27
\
105
1010
1015
We observe that when the rate is close to 0, the linear convergence (superlinear
convergence for = 0) may be nearly as efficient as a quadratic convergence. When
the rate is close to 1, the convergence can be extremely slow.
3.8.2
Both Netwons method and the quasi-Newtons method can be seen as some special
cases of the fixed-point iterative method.
What is a fixed point ?
For a given function (x), x is called its fixed point if x satisfies
(x ) = x .
For the Newtons method, we have
(x) = x
f (x)
,
f 0 (x)
then the exact solution x to the nonlinear equation f (x) = 0 is a fixed-point of (x).
For the quasi-Newtons method, we have
(x) = x
f (x)
,
g
fixed-point iteration converges as long as one starts close enough to the fixed-point;
an fixed point x is said to be a repulsive point, if the fixed-point iteration diverges
no matter how close we starts to the fixed-point.
About the fixed-point iterative method, we have the following convergence:
If the iterative function (x) satisfies the condition
|0 (x )| < 1 ,
then there exists a > 0 such that for any x0 [x , x + ],
the fixed-point iteration converges. If 0 (x ) 6= 0, the convergence
is linear with convergence rate = |0 (x )|. If
0 (x ) = 00 (x ) = = (p1) (x ) = 0,
but (p) (x ) 6= 0 ,
but (p) (x ) 6= 0 .
(x) = (x ) +
(p) (k )
(xk x )p ,
p!
using (x ) = x , we have
xk+1 x =
(p) (k )
(xk x )p .
p!
3.8.3
A numerical example
k = 0, 1, 2,
Secant method:
xk+1 = (xk xk1 + 2)/(xk + xk1 ),
k = 0, 1, 2,
k = 0, 1, 2,
The following table gives the convergence information of these three methods:
Iteration
k
0
1
2
3
4
5
6
..
.
|xk x |
Newton
Secant
4.14 101
8.57 102
2.45 103
2.12 106
1.59 1012
2.22 1016
4.14 101
8.57 102
1.42 102
4.20 104
2.12 106
3.15 1010
2.22 1016
g = 0.5
4.14 101
1.58 100
1.24 101
2.50 102
1.24 105
3.08 1010
1.90 1021
..
.
1.41 100
5.85 101
18
19
Constant Slope
g=1
g=2
1
4.14 10
4.14 101
5.85 101 8.57 102
1.41 100 3.92 102
5.85 101 1.54 102
1.41 100 6.52 103
5.85 101 2.68 103
1.41 100 1.11 103
..
..
.
.
2.84 108
1.17 108
k = 0, 1, 2,
Second method:
xk+1 = 2/xk = 2 (xk ) ,
30
k = 0, 1, 2,
Third method:
xk+1 = (6 x4k )/x2k = 3 (xk ) ,
Iteration
k
0
1
2
3
4
5
6
7
8
..
.
18
19
k = 0, 1, 2,
First Method
4.1421 101
2.5245 101
5.8658 102
2.1245 102
6.8720 103
2.3130 103
7.6849 104
2.5644 104
8.5450 105
..
.
|xk x |
Second Method
4.1421 101
5.8579 101
4.1421 101
5.8579 101
4.1421 101
5.8579 101
4.1421 101
5.8579 101
4.1421 101
..
.
Third Method
4.1421 101
3.5858 100
2.6174 101
6.1446 102
3.7583 105
1.4125 1011
1.9951 1022
3.9802 1044
1.5842 1089
..
.
1.4472 109
4.8241 1010
4.1421 101
5.8579 101
One may observe from the above table that the first method converges very fast,
the second method does not converge and the third method diverges rapidly. It is
interesting to think about the reasons for such behaviours.
3.8.4
In the previous sections, we have only considered the case of a simple zero x of the
function f (x), that is, we have
f (x ) = 0
but f 0 (x ) 6= 0. Next, we consider the case where x is a multiple zero. We first define
the multiplicity of a zero.
A point x is called a zero of the function f (x) with multiplicity
m 1 if it holds
f (x ) = f 0 (x ) = = f (m1) (x ) = 0 ,
but f (m) (x ) 6= 0.
What does a function with multiple zero look like ?
By Taylor expansion,
f (x) = f (x ) +
f 0 (x )
f (m1) (x )
f (m) ()
(x x ) + +
(x x ) +
(x x )m
1!
(m 1)!
m!
31
f (x) = (x x )
(m)
()
(x x )m g(x),
m!
g(x ) 6= 0 ,
xk+1 = (xk ),
If we take
(x) = x
f (x)
,
f 0 (x)
we get the Newtons method. From this expression of we see that we can not
compute the value of (x) at x since f 0 (x ) = 0. But fortunately, we need only to
compute (x) at xk (not at x ) in each iteration, where f 0 (xk ) should not be zero
generally.
On the other hand, we see from the expression of that (x) seems undefined at
x . But by detailed computing, we have
(x x )m g(x)
m(x x )m1 g(x) + (x x )m g 0 (x)
(x x )g(x)
,
= x
mg(x) + (x x )g 0 (x)
(x) = x
which indicates that (x) is actually well-defined at x . But we can not use this
equivalent formula to compute (xk ) at each Newton iteration (think about why).
Next, we check the convergence of the Newtons method. By the convergence
theorem for the fixed-point iteration, we should have the condition
|0 (x )| < 1 .
By direct calculation,
1
.
m
So we always have |0 (x )| < 1 as m 1. This indicates the following local convergence (i.e., the initial guess x0 is close to x ) of the Newtons method when it is
applied for solving a nonlinear equation with a zero of multiplicity m:
0 (x ) = 1
32
33
3.9
f (xk )
gk
with gk chosen to approximate the derivative f 0 (xk ). It is usually not easy to obtain
an accurate approximation of f 0 (xk ). One effective and accurate choice is to select gk
using the quotient:
f (xk ) f (xk1 )
,
(3.6)
gk =
xk xk1
and this leads to the following secant method:
xk+1 =
k = 1, 2, .
From the expression of gk in (3.6) we may expect that at the beginning xk and
xk1 may not be very close, so gk is not a very accurate approximation of f 0 (xk ) and
the convergence of the secant method should be slower than that of the Newtons
method. But as xk gets more and more accurate, gk approximates f 0 (xk ) more and
more accurately, then the convergence of the secant method should be close to the
one of the Newtons method. We will study the convergence of the secant method in
more detail in the next subsection.
Convergence order of the secant method
In order to study the convergence of the secant method, we introduce the iterative
function:
vf (u) uf (v)
f (u)(u v)
=
,
(u, v) = u
f (u) f (v)
f (u) f (v)
then the secant method can be written in the form:
xk+1 = (xk , xk1 ) .
Due to this relation, the secant method is also called a two-point iterative method
since it uses two previous approximate values xk and xk1 to compute each new
approximate value xk+1 .
Next we are interested in analyzing the order of convergence of the secant method
under the assumption that f 0 (x ) 6= 0. To that end, we define the error at the kth
step to be ek = xk x . Then we apply the Taylor expansion to get
f (xk ) = f (x + ek ) = f 0 (x )ek + f 00 (x )e2k /2 + O(e3k ).
34
2f (x )
ek+1 = xk+1 x = xk
=
=
=
=
=
=
To find an approximate order p, we let |ek+1 | C|ek |p . Then after dropping the cubic
term above, we need to find p such that
e e f 00 (x )
k1 k
C|ek |p .
0
2f (x )
This gives D|ek1 ek | |ek |p , with D = |f 00 (x )/(2Cf 0 (x ))|. We can write this
relation as D|ek1 | |ek |p1 . On the other hand, we should have |ek | C|ek1 |p ,
hence D|ek1 | (C|ek1 |p )p1 . This implies
D = C p1 ,
p(p 1) = 1.
Clearly, the only positive solution of p is p = (1 + 5)/2 1.618 (the golden mean).
And the constant C is given by
C = |f 00 (x )/(2f 0 (x ))|p1 |f 00 (x )/(2f 0 (x ))|0.618 .
In summary, we have shown that the secand method converges with an order
p 1.618. So it is much faster than the linear convergence, and slightly slower than
the Newtons method.
35
4.1
F(x) =
f (x, y)
g(x, y)
,
x=
x
y
.
k = 0, 1, 2,
(4.3)
0
0
x0 =
.
F (x0 ) = F
0
0
0
0
1
1
=
=
,
2 0
0 2
Therefore
x1 =
0
0
2 0
0 2
1
1
1
=
1/2
1/2
.
F (x1 ) = F
1/2
1/2
1/2
1/2
=
=
1/4
1/4
,
3/2 1/2
1/2 3/2
.
Therefore
x2 =
1/2
1/2
3/2 1/2
1/2 3/2
0.875
0.875
1
,
x4 =
1/4
1/4
0.9325
0.9325
=
3/4
3/4
.
,
and clearly the sequence {xk } converges to the limit x = (1, 1)T .
4.2
fn (x1 , x2 , , xn ) = 0 ,
37
o
n f
fn
fn
n
(xk )(x xk )1 +
(xk )(x xk )2 +
(xk )(x xk )n +
fn (x) = fn (xk ) +
x1
x2
xn
f1 (x) = f1 (xk ) +
n f
(xk )(x xk )1 +
f1
f1
(x) x
(x)
x1
2
f2
f2
x1 (x) x2 (x)
fn
fn
(x)
(x)
x1
x2
of F at x given by
f1
(x)
x
n
f2
x
(x)
n
,
fn
(x)
x
n
k = 0, 1, 2, .
When the Jacobian matrix F 0 (x) is large, it may not be efficient and stable to
compute the inverse F 0 (xk )1 at each iteration. Then we usually implement the
Newtons method as follows:
38
4.3
In this section we will study the convergence of the Newtons method for general
systems of nonlinear equations. For this purpose we need to introduce a few useful
results. The first one is the following fundamental theorem of calculus.
Let F be differentiable in an open set Rn and x . Then
for all x sufficiently close to x , we have
Z 1
F (x) F (x ) =
F 0 (x + t(x x ))(x x )dt .
0
kBk
,
1 kI BAk
kB 1 k
kAk
.
1 kI BAk
An outline of the proof. For any nn matrix C, if kCk < 1 then I C is invertible
and
(I C)1 = I + C + C 2 + .
So if kI BAk < 1, we know BA = I (I BA) is invertible, so both A and B are
invertible. Then we can write
A1 = (BA)1 B = (I (I BA))1 B = {I + (I BA) + (I BA)2 + }B ,
which implies
kA1
kBk
.
1 kI BAk
]
Now we make the following standard assumptions on F :
1. There is a solution x to the equation F (x) = 0.
39
(4.4)
(4.5)
(4.6)
1
.
2
(4.7)
which is the right inequality of (4.6). To show the left one, we note that
Z 1
0
1
0
1
F (x ) F (x) = F (x )
F 0 (x + t(x x ))(x x )dt
0
Z 1
= (x x )
(I F 0 (x )1 F 0 (x + t(x x ))(x x )dt ,
0
40
n = 0, 1, 2, .
4.4
Broydens method
Similar to the secant method, Broydens method is also a variant of Newtons method
without derivative information required.
Next we discuss how to derive the Broydens method for solving the system of
nonlinear equations F (x) = 0. Recall the Newtons method
xk+1 = xk F 0 (xk )1 F (xk ) ,
k = 0, 1, 2, ,
which requires the computing and inversion of the Jacobian matrix F 0 (xk ) at each
iteration. When we derive the Newtons method, we used the affine function
y = F (xk ) + F 0 (xk )(x xk )
to approximate the original nonlinear function F (x). The Broydens method wants
to replace the Jacobian matrix F 0 (xk ) by a simpler approximation Ak , then the affine
function becomes
y = F (xk ) + Ak (x xk ) .
41
Now we require this function to take the same value as F (x) at x = xk1 , namely
Ak (xk xk1 ) = F (xk ) F (xk1 ) .
Noting that Ak has n2 entries, we can view this equation as n constraint equations.
But some additional information is needed to determine Ak uniquely. One strategy
is to choose Ak such that it is identical to Ak1 on all vectors that are orthogonal to
dk = xk xk1 , which implies
Ak Ak1 = u dTk
for some vector u. We can find the vector u by solving
(Ak Ak1 )dk = u dTk dk ,
namely
u=
1
kxk x k .
2
4.5
The previous iterative methods require a good initial guess for their convergence.
Steepest descent method is a simple iterative method that makes no assumptions on
the initial guess. Instead of working with the original nonlinear equation
F (x) = 0 ,
the steepest descent method targets a local minimum of the function
g(x) = F (x)T F (x)/2 ,
always along a descent direction at each iteration.
Descent direction. A direction p Rn is called a descent direction of function f at
point x0 if it satisfies for some t0 > 0 that
f (x0 + t p) < f (x0 ) 0 < t < t0 .
It is easy to check that g is a descent direction of g, and
g(x) = F 0 (x)T F (x) ,
where F 0 (x) is the Jacobian matrix of F (x). Here is the steepest descent algorithm.
Steepest descent method. Select x0 . For k = 0, 1, 2, , do the following
1. Find k that solves the one-dimensional minimization
min g(xk g(xk )) .
0
2. Update xk+1 by
xk+1 = xk k g(xk ) .
The following result shows the convergence of the steepest descent method when
it is applied to a simple model problem.
Convergence rate. Let A be a positive definite matrix. Consider the
quadratic function g(x) = xT Ax/2. Suppose the eigenvalues of A are given by
0 < 1 2 n ,
and = n /1 .
Then the sequence {xk } generated by the steepest descent method satisfies
g(xk+1 )
1
+1
2
g(xk )
43
1
+1
2(k+1)
g(x0 ) .
Proof. We can easily check that g(x) = Ax. By Taylor expansion, we can write
2
(Ax)T A(Ax)
2
2
= g(x) + (Ax)T (Ax) + (Ax)T A(Ax) .
2
Using this relation and the selection of k , we derive
g(x + Ax) = g(x) + g(x)T (Ax) +
(a + bA)xk
a
a 6= 0, b R1 ,
(4.8)
g(xk+1 ) = g Pk (A)xk
Q (A)
k
g
xk
Qk (0)
(4.9)
2z (1 + n )
.
n 1
As Qk (1 ) = 1 and Qk (n ) = 1, we have
|Qk (j )| 1 j .
Let {vj } be the orthonormalized eigenvectors corresponding to {j }. We can express
xk =
n
X
(k)
aj v j .
j=1
1 X (k) 2 2
g(xk+1 )
(a ) Qk (j ) j
2Q2k (0) j=1 j
n
1 X (k) 2
(a ) j
2Q2k (0) j=1 j
=
1
Q2k (0)
g(xk )
1
+1
2
g(xk ) .
In numerical solutions of nearly all mathematical models, including linear and nonlinear differential equations, integral equations and system of nonlinear algebraic equations, one may have to solve system of linear algebraic equations repeatedly, possibly
millions of times in many practical applications. Before discussing how to solve the
system, we first introduce some fundamental definitions and concepts related to system of linear algebraic equations. But we will not discuss most of the content in
sections 5.1 and 5.2 in the lectures.
5.1
In this section we review very briefly some basic concepts and knowledge about matrices, vectors and scalars and linear systems. All these are supposed to be known to
the students in this course, so we will not discuss them in any detail in the lectures.
Matrix. An m n matrix A is a rectangular array of numbers of the form
a11
a12
a1n1
a1n
a21
a22
a2n1
a2n
..
..
..
A = ...
,
.
.
.
1
2 3
0 1 ,
A= 4
1 3 4
we have a12 = 2, a32 = 3, and so on.
45
x1
x2
x = .. ,
.
xn
and we often write x Rn . The integer n is called the dimension. The number xj is
called the jth component of x.
By convention, all vectors are column vectors, and the transpose of a column
vector is a row vector, e.g.,
x1
x2
x = .. , xT = (x1 , x2 , , xn ) ,
.
xn
where x is a column vector, and xT is its transpose, a row vector.
Operation with matrices. Matrix A multiplied by a scalar is defined by
A = (aij )mn .
e.g.,
1
0 1
0
0 1 = 0
0 .
0
2 1 0
2 0
If matrices A and B have the same number of rows and columns, and
A = (aij )mn ,
B = (bij )mn ,
then we define
A + B = (aij + bij )mn .
Given two matrices A and B,
A = (aij )lm ,
B = (bij )mn ,
cij =
m
X
aik bkj .
k=1
E.g., consider
1 1
0
3 1 ,
A= 2
2
0
2
46
2 0
B = 1 0 ,
2 1
then we have
3
0
AB = 1 1 .
0
2
Identity matrix.
In
1
0
0
..
.
0
1
0
..
.
0
0
1
..
.
0 0 0
0 0 0
0 0
0 0
0 0
..
1 0
0 1
is called an identity matrix of order n, whose diagonal entries are one and whose off
diagonal entries are zero.
Diagonal matrix. A matrix D is called
are all zero, e.g.,
d1 0
0 d2
D=
0 0
0 0
0
0
,
0
dn
x=
A = (aij )mn ,
x1
x2
..
.
xn
then we define
a11 x1 + a12 x2 +
a21 x1 + a22 x2 +
Ax =
am1 x1 + am2 x2 +
+ a1n xn
+ a2n xn
+ amn xn
(AB)C + A(BC) .
AB =
1
0
1 1
,
but BA =
1 1
0 1
6= AB.
1 2 3
1 4 7 10
4 5 6
T
2 5 8 11 .
A=
7 8 9 , A =
3 6 9 12
10 11 12
Transpose satisfies the following properties:
(A)T = AT ,
(A + B)T = AT + B T ,
x1
x2
x = ..
.
xn
y=
y1
y2
..
.
(AB)T = B T AT .
yn
x Ay =
n
X
aij xi yj = y T AT x .
i,j=1
48
A=
a31 a32 a33 a34 a35 a36 a37
a41 a42 a43 a44 a45 a46 a47
a51 a52 a53 a54 a55 a56 a57
where
A11 =
a11 a12
a21 a22
,
,
A22
If B is another matrix having same numbers of rows and columns as A and is decomposed blockwise as A above:
B11 B12 B13
B =
B21 A22 B23
then we have
A+B =
.
x=
A = (aij )mn ,
x1
x2
..
.
xn
we can write A as A = (a1 a2 an ), then we have
Ax = x1 a1 + x2 a2 + + xn an ,
so Ax is nothing else but the combination of column vectors of A with coefficients
being components of x.
49
5.2
From now on we will be concerned with the solution of the system of linear algebraic
equations
Ax = b
where A is a n n matrix, x and b are two vectors of dimensions n. Before discussing
how to solve the system, we first look at when the system will have a unique solution.
We summarize some useful conclusions below:
Let A be a n n, then the following statements are equivalent:
(1) For all x, Ax = 0 implies x = 0.
(2) The columns (rows) of A are linearly independent.
(3) For any vector b, the system Ax = b has a solution.
(4) If A x = b has a solution, it is unique.
(5) There is a matrix B such that BA = AB = I.
(6) det (A) 6= 0.
Properties of inverse matrices.
1. The product AB is nonsingular if and only if A and B are both nonsingular,
and in this case,
(AB)1 = B 1 A1
2. If the matrix A is nonsingular, then the matrix AT is also nonsingular and
(AT )1 = (A1 )T .
5.3
Consider the n n square system Ax = b. A natural way to solve the system seems
to be multiplying both sides of Ax = b by A1 to get the solution
x = A1 b .
This suggests the following algorithm for solving Ax = b:
Algorithm.
1. Compute C = A1 .
2. Compute the solution x = C b.
50
ei = (0, 0, 1, 0 0)T ,
i = 1, 2, , n.
So computing A1 amounts to solving n equations like Ax = b with different righthand sides b. This is much more expensive than solving the original equation Ax = b.
Is there any more efficient algorithm for solving Ax = b?
Yes. One of the very popular algorithms is to use the LU factorization of A:
A = LU ,
where L is a lower triangular matrix, i.e., its entries are all zero above the diagonal;
U is a upper triangular matrix, i.e., its entries are all zero below the diagonal.
Using A = LU , then the solution of the system Ax = b amounts to solving the
system LU x = b. Let U x = y, then this suggests the following algorithm:
Algorithm.
1. Factorize A = LU .
2. Solve Ly = b.
3. Solve U x = y.
We next discuss how to solve triangular systems, then discuss how to find the LU
factorization of A.
5.4
l11
0 0
l21 l22
0
.
..
..
.
.
.
.
L =
l
l
l
i1 i2 i3
.
..
ln1 ln2 ln3
0
0
0
0
0
0
lii
0
0
51
lni ln,i+1
lnn
To solve the system Lx = b, we can start with the first equation to find x1 , then
find x2 from the second equation, and so on. Let us work out more details below.
Write Lx = b componentwise:
l11 x1
l21 x1 + l22 x2
= b1
= b2
= bi
= bn
i1
X
xi = (bi
lij xj )/lii ,
i = 3, 4, , n.
j=1
u11 u12
0 u22
0
U =
0
0
0
u1i u1n
u2i u2n
uii uin
0 unn
52
uii xi + + uin xn = bi
unn xn = bn
Then we can solve the system as follows:
xn = bn /unn
xn1 = (bn1 un1,n xn )/un1,n1
n
X
xi = (bi
uij xj )/uii , i = n 2, n 3, , 2, 1.
j=i+1
Work out the total computational costs of the above solution process in terms of n,
the total degrees of freedom.
5.5
Cholesky factorization
We have discussed how to solve the system Ax = b when we have the LU factorization
of A. Below we shall discuss how to factorize a matrix A. We start with the positive
definite matrices.
5.5.1
1 2
4
3
5 = AT .
A = 2
4
5 6
A symmetric matrix is determined by its entries on and above its diagonal, hence
only half of its entries need to be stored.
Symmetric positive definite matrix. An n n matrix A is said to be symmetric
and positive definite if it satisfies
(1) A is symmetric.
53
ai 1 i 1 ai 1 i 2 ai 1 i k
ai 2 i 1 ai 2 i 2 ai 2 i
k
Ak =
.
ai k i 1 ai k i 2 ai k i k
For any vector of the form x = xi1 ei1 + +xik eik Rn , let xk = (xi1 , , xik )T
Rk , then it is easy to find that
xT Ax = xT (xi1 ai1 + + xik aik )
k
k X
X
x i l x i t ai l i t
=
=
l=1 t=1
xTk Ak xk ,
from this we know that for any xk 6= 0, we have xTk Ak xk > 0 by the positive
definiteness of A.
3. Any eigenvalue of a SPD matrix is positive.
Suppose is an eigenvalue, then there exists a x 6= 0, called the eigenvector of
A, such that
Ax = x .
Then we have
xT Ax = xT x.
As xT Ax > 0, so xT x > 0. But x 6= 0 implies xT x > 0. So we must have
> 0.
4. For any rectangular matrix U , if its column vectors are linearly independent,
then the matrix U T U is an SPD matrix.
54
5.5.2
Cholesky factorization
We know that if there is a matrix of the form U T U , with the columns of U being
linearly independent, then the matrix is an SPD matrix. We next show that the
converse is also true, that is, if A is an SPD matrix, then A can be factorized as
U T U , where U is a upper triangular matrix. If, in addition, we require the diagonal
entries of U to be positive, then the factorization is unique and is called the Cholesky
factorization of A.
To factor A into U T U , let us write
aT
,
A=
a A11
U=
u11 rT
0 U11
,
then
A = UT U
is equivalent to
aT
a A11
=
u11 0
T
r U11
u11 rT
0 U11
.
The step 3 above shows that U11 is the Cholesky factorization of the matrix A11 .
One can repeat the above procedure for the submatrix A11 . And the Cholesky
factorization proceeds in n steps:
1. At the first step, the first row of U is computed and the (n 1) (n 1)
submatrix A11 in the right bottom corner is modified;
2. At the 2nd step, the second row of U is computed and the (n 2) (n 2)
submatrix in the right bottom corner is modified.
55
3. The procedure continues till the n-th step, or till nothing is left in the right
bottom corner.
Thus the algorithm begins with a loop on the row of U to be computed; remember
we update the upper triangular part of A and do not need another matrix U for
storage. When the algorithm is finished the upper triangular part of A will store the
required information of U .
The cost of the Cholesky factorization.
The multiplication or subtraction, mainly for computing rrT , is :
Z n
Z n Z n
n X
n
X
di
dj
dk
(n k)
k+1
k=1 j=k+1
k+1
(n k)2 dk
1
=
(n 1)3 .
3
But the matrix rrT is symmetric, so we need to compute only half of its entries and
the total cost is approximately:
1 3
n
6
(multiplication)
1
+ n3
6
1
(subtraction) = n3 .
3
3 1
3
A = 1
1
0
1
0 .
3
Solution.
1. Update the 1st
3 1
1
3
1
0
1
3 13
3 13
1
3
8
0 3 13 0 + 13 =
3
1
1
1
3
3 3
3
3
56
1
3
1
3
8
3
2. Update the 2nd row and the submatrix at the right bottom corner:
3
1
8
3
1
3
3
1
3
8
3
1
q 3
8
3
3q
1
3
3
8
8 1
3
3
3
8
1
3
3
8
3
2 2
3
3
1
2 6
63
24
3. Update the 3rd row and the submatrix at the right bottom corner:
1
1
1
1
3
3 3 3
3
3
2 2
1
2 2
1
3
2
6
q
3
2 6
63
63
24
24
this gives
3 13
2 2
0
U =
3
0
0
1
3
1
2 6
q
63
24
4 2 4
2
1
4 2 1
1 1 2
5 0 1
.
1 17 1 , 1
, 2
5
0
2
16
4
4
0 8
3
1
33
1 4 64
2
0 14
2 1 3 21
Solution. We show the Cholesky factorization only for the first matrix.
1. Update the 1st row and the submatrix at the right bottom corner:
4 21 1
2 14 12
1 17 1 1 1
2
16
4
8
1
17
1 14 33
64
8
64
2. Update the 2nd row and the submatrix at the right bottom corner:
1 81
1 18
1
17
14
8
64
3. Update the third row and the submatrix at the right bottom corner:
1
1
4
2
this gives
2 14
U = 0 1
0 0
1
2
1
8
1
2
5.6
The solution of non-SPD systems of linear algebraic equations arises in nearly all applications of mathematics. But the Cholesky factorization introduced in the previous
section is only applicable to SPD matrices. Next we shall discuss the LU factorization
for general matrices.
Consider solving the system of equations of the following form
Ax = b,
where A is a non-singular n n matrix and b Rn is a vector. The Gaussian
elimination is basically a process of the so-called LU factorization for the matrix A.
To start with, we first give some basic properties about matrix operations.
Consider a 4 4 matrix:
1 0 0 0
l21 1 0 0
L1 =
l31 0 1 0 ,
l41 0 0 1
where l21 , l31 and l41 are real numbers. It is easy
1 0 0
l
21 1 0
L1
1 =
l31 0 1
l41 0 0
to verify that
0
0
.
0
1
The same facts are true for similar matrices L2 and L3 . Please check it yourself !
Now consider the following two
1 0 0
l21 1 0
L1 =
l31 0 1
l41 0 0
matrices:
0
1 0
0
0 1
, L2 =
0
0 l32
1
0 l42
0
0
1
0
0
0
,
0
1
1
l21
L1 L2 =
l31
l41
0
1
l32
l42
0
0
1
0
0
0
.
0
1
58
1 0 0 0
a11 a12
0 1 0 0 a21 a22
L2 A
0 l32 1 0 a31 a32
0 l42 0 1
a41 a42
a11
a12
a
a
21
22
=
a31 + l32 a21 a32 + l32 a22
a41 + l42 a21 a42 + l42 a22
following operation:
a13 a14
a23 a24
a33 a34
a43 a44
a13
a14
a23
a24
.
a33 + l32 a23 a34 + l32 a24
a43 + l42 a23 a44 + l42 a24
If you look at the above operations carefully, you will come to the conclusion:
The actions of adding row 2 multiplied by l32 and l42 to row 3 and row 4 respectively, are equal to the operation L2 A.
Similar results are true for other Li matrices and any n n matrix A.
5.6.1
Gaussian elimination
b1
b2
bn .
A=
an1 an2 ann
into a upper triangular matrix
a11 a12
0 a
22
A =
0
0
59
a1n
a
2n
.
a
nn
Next, we will explain the Gaussian elimination and the LU factorization for two
simple examples: one is a 2 2 system of equations, the other is a 3 3 system. If
you can understand these two simple examples, then you may easily carry out the
Gaussian elimination and LU factorization for more general n n systems.
Gaussian elimination for a 2 2 matrix
Consider the following 2 2 system:
2 4
x1
2
Ax
=
b.
4 11
x2
1
(5.1)
Eliminating x1 in the 2nd equation needs to add row 1 multiplied by -2 to row 2, that
equals
1
0
2
4
2
4
LA
=
U.
2 1
4 11
0 3
the above equation gives
Using the property of the matrix L,
2 4
1 0
2 4
A=
=
L U.
4 11
2 1
0 3
This implies a LU factorization of A.
Using the factorization, solving the system
Ax = b
is equivalent to solving the system
LU x = b,
which can be done as follows:
L c = b,
U x = c.
c1
c2
=
2
3
.
2 4
x1
2
U x = c
=
,
0 3
x2
3
which gives the solution of equation (5.1):
x1
3
=
.
x2
1
Example 5.1. Use the Gaussian elimination to solve the following 2 2 system:
2 6
x1
1
=
.
(5.2)
4 13
x2
0
60
LU factorization of a 3 3 matrix
Let us consider one more simple example for the Gaussian elimination8 :
1 1 1
x1
0
Ax 3 6 4 x2 = 2 b .
1 2 1
x3
1/3
(5.3)
Eliminating x1 in the 2nd equation needs to add row 1 multiplied by -3 to row 2, and
eliminating x1 in the 3rd equation needs to add row 1 multiplied by -1 to row 3, that
equals
1 0 0
1 1 1
1 1 1
1 A 3 1 0 3 6 4 = 0 3 1 U1 .
L
1 0 1
1 2 1
0 1 0
Now eliminating x2 in the 3rd equation needs to add row 2 multiplied by 1/3 to
row 3, that equals
1
0
0
1 0 0
1 1 1
1 1
1
2L
1A 0
1
0 3 1 0 3 6 4 = 0 3
1 U.
L
0 1/3 1
1 0 1
1 2 1
0 0 1/3
1 and L
2 , the above equation gives
Using the property of the matrix L
1 1 1
A = 3 6 4
1 2 1
1 0 0
1 0 0
1 1
1
1
= 3 1 0 0 1 0 0 3
1 0 1
0 1/3 1
0 0 1/3
1 0 0
1 1
1
1
= 3 1 0 0 3
1 1/3 1
0 0 1/3
L U.
This completes a LU factorization of A.
We can easily write the above entire LU factorization process directly for the
system (5.3) of linear algebraic equations.
Using this factorization, solving the system
Ax = b
is equivalent to solving the system
LU x = b,
8
One can write the Gaussian elimination and the LU factorisation in parallel.
61
U x = c.
1 0 0
c1
0
L c = b 3 1 0 c2 = 2 ,
1 1/3 1
c3
1/3
which gives
c1
0
c2 = 2 .
c3
1
Then
Ux=c
1 1
1
x1
0
0 3
1 x2 = 2 ,
0 0 1/3
x3
1
x1
8/3
x2 = 1/3 .
x3
3
Example 5.2. Use the Gaussian elimination to solve the following two 3 3 systems
and write down the LU factorisation in parallel:
1 0
2
x1
0
3 4 1 x2 = 1 .
(5.4)
1 3
4
x3
1
and
3 2 1
x1
1
3 4 3 x2 = 2 .
1 2
1
x3
1
(5.5)
Ak = ..
k = 1, 2, , n
.. . .
.. ,
.
.
.
.
ak1 ak2 akk
62
ik
a
kj
0 a
ij aakk
row i :
a
ik a
ij
and the elementary matrix Lk has the form:
1
0
row k :
.
ik
1
aakk
row i :
LU factorization of the n n matrix A.
do
10
20
30
40
40
do
k=1,n-1
10 i=k+1,n
a(i, k)= a(i, k)/a(k, k)
continue
do 30 j=k+1,n
do 20 i=k+1,n
a(i, j)=a(i, j)-a(i, k)*a(k, j)
continue
continue
continue
6 2 2
4
x1
12 8 6
x2
10
3 13 9
3 x3
6
4 1 18
x4
9
12
= 34 .
27
38
63
Solution. Let
6 2 2
4
12 8 6
10
,
A1 = A =
3 13 9
3
6
4 1 18
then we have
L1
1
2
=
1
2
1
0
1
0
0
0
0
1
0
0
0
.
0
1
This gives
A2
6 2 2
4
0 4 2
2
.
= L1 A1 =
0 12 8
1
0
2 3 14
1
0
0
1
L2 =
0 3
1
0
2
0
0
1
0
0
0
,
0
1
which yields
A3 = L2 A2
6 2 2
4
0 4 2
2
,
=
0
0 2 5
0
0 4 13
1 0
0
0 1
0
L3 =
0 0
1
0 0 2
0
0
,
0
1
A4 = L3 A3
6 2 2
4
0 4 2
2
.
=
0
0 2
5
0
0 0 3
Now, we know
U = A4 = L3 L2 L1 A .
64
This implies
L1
L1 L1 U
1 2 3
1 0 0 0
2 1 0 0
=
1 0 1 0
2
1 0 0 1
1
0 0
2
1 0
=
1
3
1
2
1
1 2 2
6 2 2
12 8 6
=
3 13 9
6
4 1
A =
1
0 0 0
1
0
0
1
0
0
0
3 1 0 0
0 21 0 1
0
0
6 2 2
4
0 0 4 2
2
0 0
0 2 5
1
0
0 0 3
4
10
.
3
18
0
1
0
0
0
0
1
2
0
6 2 2
4
0 4 2
0
2
0 0
0 2 5
1
0
0 0 3
Using the LU factorization, one can easily find the desired solution of the given
system. (Please practice !)
5.6.2
LDU factorization
The LU factorization of a matrix may not be unique. Next we show how to ensure a
unique factorization. Suppose we have obtained an LU factorization of A:
U .
A=L
Then let
D = diag(U ),
we can further factorize A as follows:
A = LDU
where L and U are lower and upper triangular matrices respectively, both matrices
with 1 as their diagonal entries, and D is a diagonal matrix.
The advantage of the LDU factorization is its uniqueness. Next, we show that the
LDU factorization is unique. That is, if we have
A = L1 D1 U1 = L2 D2 U2 ,
where Li are lower triangular matrices with 1 on the main diagonal, Di are diagonal
matrices, and Ui are upper triangular matrices with 1 on the main diagonal, then
L1 = L2 , D1 = D2 , and U1 = U2 .
From the assumption, we have
1
L1
2 L1 D1 = D2 U2 U1 .
65
1
Let L L1
2 L1 and U U2 U1 . Then we have
LD1 = D2 U.
(5.6)
(D2 )11
(D1 )11 0
0
...
...
LD1 =
= D2 U,
=
0
0
0
0 (D2 )nn
(D1 )nn
where are supposedly nonzero entries. Comparing the entries in the left and the
right hand sides, we can conclude that (D1 )ii = (D2 )ii for all i, i.e. D1 = D2 .
Moreover all these should be equal to zero. In particular, that would mean that
both U and L are nothing but just the identity matrix I. Since I = L L1
2 L1 , we
have L1 = L2 . Similarly, U1 = U2 . ]
Remark 5.1. Please refer to the Tutorial Notes and Assignments for more details
about how to find the factorization of the form A = L D U . Try the example in the
previous subsection.
5.6.3
Next we claim that all diagonal entries of D are positive. (Remember that since
D is diagonal, all the other off-diagonal entries of D are zero.) In fact, for all i, let
xi = (LT )1 ei , where ei are the canonical basis vectors, then we have
0 < xT Ax = eTi ((LT )1 )T LDLT (LT )1 ei = eTi Dei = Dii .
Thus Dii > 0 for all i, and hence we can
write: D = D1/2 D1/2 , where D1/2 is a
diagonal matrix with main diagonal entries Dii . Then
L
T ,
A = LD1/2 D1/2 LT = LD1/2 (LD1/2 )T L
= LD1/2 is a lower triangular matrix.
where of course L
Remark 5.2. Note that the diagonal entries of L in the Cholesky factorization is not
necessary to be 1, not like the entries of L in the LU factorization.
Remark 5.3. Please check the Tutorial Notes for more details about how to find the
factorization of the form A = L LT .
5.7
5.7.1
ik
a
kj
0 a
ij aakk
i
a
ik a
ij
From this we see that the elimination process can continue only when the diagonal
entry a
kk is nonzero, as it is used as a divisor. If no diagonals are zero in all steps, the
LU factorization can carry on till completion. If the diagonal entry is zero at certain
step, the algorithm cannot continue. For example, it is the case with the matrix
0 1
A =
.
1 0
Sometimes, even the diagonal entry is not zero, but very small, one also has trouble.
For example, considering solving the following simple system:
1
x1
1
=
(5.7)
1 1
x2
2
we have
x1 =
1
1,
1
x2 =
1 2
1.
1
But if we use the Gaussian elimination, we may have some problem. We have
1
1
0
A1 =
, L1 =
,
1 1
1 1
67
then
A2 = L1 A1 =
1
0 1 1
.
This gives
L1
1
L=
=
1 0
1 1
1
0 1 1
U=
1
2
1
0
1
.
we get
or
1
0 1 1
1
0 1 1
x1
x2
=
x1
x2
=
1
2 1
1
2
,
.
1
2 1
= 1,
1 1
1
x1 =
1 x2
0.
0.0001
0.5
0
2000
x1
x2
=
0.5
2000
0.4
0.5 0.3 = 2000.3 2000.
0.0001
68
(5.9)
We see that the information of a22 = 0.3 in (5.8) is completely wiped out by the
elimination. It is as if the original matrix in (5.8) starts out with a22 = 0, one will
get the same matrix as in (5.9) after the elimination.
Solving the upper-triangular system (5.9), we have
x2 = 1,
x1 =
0.5 0.5x2
= 0.
0.0001
We see that the solution, especially x1 , is totally wrong. Thus if there are small pivots,
such as the (1,1)-entry 0.0001 in the current example, then Gaussian Elimination may
give very inaccurate results.
The remedy is to use partial pivoting, i.e., we permute the rows so that the largest
entry in magnitude in the column becomes the pivot. Let us see what happens if we
use partial pivoting. Considering system (5.8) again. Using partial pivoting, i.e. we
permute the largest entry (i.e. 0.4) in the first column to the first row:
0.4
0.3
x1
0.1
=
.
(5.10)
0.0001 0.5
x2
0.5
Notice that we have to do it for every row when doing the permutation. Then the
Gaussian Elimination of (5.10) becomes
1
0
0.4
0.3
x1
0.1
1
0
=
.
0.0001
0.0001
x2
1
0.0001 0.5
1
0.5
0.4
0.4
Now simplifying it, we have
0.4 0.3
x1
0.1
=
.
0 0.5001
x2
0.5000
Hence the solution is:
x2 = 0.9998,
x1 =
0.1 + 0.3x2
= 0.9999.
0.4
5.7.2
In the last subsection we saw that when the diagonal entry is too small at certain
stage of the LU factorization, it may cause some serious problem if one continues with
the factorization. In this case, we should take some special strategy before continuing
the factorization. The following pivoting is one of such strategies:
At the kth stage of the LU factorization, suppose the
(k)
matrix A becomes Ak = (aij ). Then determine an index
(k)
(k)
pk for which |apk ,k | is largest among all |ajk | for (k j
n). Then interchange rows k and pk before proceeding
the next step of the factorization.
With the pivoting, the LU factorization process takes the form:
U = Ln1 Pn1 L2 P2 L1 P1 A
where Pk s are the matrices for interchanges.
To carry out the LU factorization with pivoting, we first introduce an elementary
matrix for interchanging.
Let Pi,j be a matrix obtained by interchanging rows i and j of the identity matrix,
then one can easily check that Pi,j A is the matrix obtained by interchanging rows i
and j of A:
0 0 1 0
a31 a32 a33 a3n
0 1 0
a21 a22 a23 a2n
0
1 0 0
.. .. .. . . ..
. . .
. .
an1 an2 an3 ann
0 0 0
1
The matrix Pi,j is called an elementary permutation. And we can verify that
1
T
Pi,j
= Pi,j
= Pi,j .
Example. Solve the following systems using the LU factorization with pivoting:
1 1
0 3
x1
4
1 0
3
1 x2
0
=
0 1 1 1 x3
3
3 0
1
2
x4
1
and
1
1
0
2
6
0 8
x1
x2
0 6
1
1 1 1 x3
0 4
4
x4
= 0
2
1
3
1
0
1
0
3
1
1 1 1
1
0 3
x1
1
x2
0
=
x3
3
x4
4
3 0
1
2
x1
0 0 3 1
1 32
3
x2 =
0 1
1
1
x3
2
1
3 + 3
0 1
x4
3
Step 2 . Permute rows 2 and
3
0
0
0
13
3
4 + 13
3 using P2,3 :
0 1
2
x1
x2
1 1 1
1
0 38
x3
3
1
7
1 3 3
x4
3 0
1
2
0 1 1 1
1
8
0 0
3
3
4
0 0
43
3
= 31 .
3
13
3
factorization:
x1
1
x2
3
x3 = 1
3
4
x4
3
x1
3 0
1
2
0 1 1
1
x2 =
8
1
x3
0 0
3
3
4
1
0 0
0 6 3
x4
1 .
3
4
+ 16
3
1
2
x=
0 .
1
Matrix-forms. It is interesting to write the above Gaussian elimination process into
matrix-forms. We can do as follows. Let
1 1
0 3
1 0
3
1
A=
0 1 1 1 .
3 0
1
2
71
3
1
P1,4 A =
0
1
0
1
2
0
3
1
.
1 1 1
1
0 3
1 0
3 0
1
2
8
1
1 1
0 0
3
3
3
L1 (P1,4 A) =
0 1 1 1 , L1 = 0 0
1
1
73
0
0 1
3
3
0
0
1
0
0
0
0
1
3
0
P2,3 (L1 P1,4 A) =
0
0
0 1
2
1 1 1
1
0 83
3
1 13 37
3 0
1
2
1
0
0 1 1 1
0
1
L2 (P2,3 L1 P1,4 A) =
L2 =
1 ,
8
0 0
0
0
3
3
4
4
0 0
0
1
3
3
0
0
1
0
0
0
0
1
3 0
1
2
1 0
0 0
0 1 1 1
0 0
U , L3 = 0 1
0 0
1 0
3
3
0 0
0 69
0 0 12 1
This gives the the LU factorization:
1 1 1 1 1
A = P1,4
L1 P2,3 L2 L3 U
1
3 0 0 1
1
1 1 0 0
0
3
=
0 0 1 0 P2,3 0
1 0 0 0
0
1
3 0 0 1
1 0
1 1 0 0 0 0
3
=
0 0 1 0 0 1
1 0 0 0
0 1
1
3 1 21 1
3 0
1 0 1 0 0 1
3
=
0 1 0 0 0 0
1 0 0 0
0 0
72
0 0 0
1 0 0
U
0 1 0
1 21 1
0 0
1 0
U
0 0
1
1
2
1
2
1 1
8
1 .
3
3
0 96
5.8
0 a32 a33
0
0 a43
0
0
0
..
..
..
.
.
.
a14
a24
a34
a44
a54
..
.
a1,n1
a2,n1
a3,n1
a4,n1
a5,n1
..
.
a1n
a2n
a3n
a4n
a5n
..
.
1 2 3 4
1 2 3 4
2 1 0
1 2 3 , 2 3 4 5 , 2 3 4 5
0 4 5 6 0 4 5 6
0 0 1
0 5 6 7
0 0 6 7
Work out the LU factorization of an upper Hessenberg matrix !
A matrix A is called tridiagonal if
aij = 0 |j i| > 1
that is, A is of the form
A =
a11 a12 0
0
a21 a22 a23 0
0 a32 a33 a34
0
0 a43 a44
..
..
..
..
.
.
.
.
0
0
0
0
..
.
0
0
0
0
..
.
73
5.9
Up to now, all the methods that we have studied for solving the linear system
Ax = b
(5.11)
apply only when the coefficient A is a square matrix. But in many applications, we
are also encountered with non-square matrices A. In this section we shall discuss how
to solve such non-square systems.
Assume that A is an m n matrix, with m > n. Clearly the system (5.11) is
unlikely to have a solution as the number of equations is larger than the number of
unknowns.
Let us first study when this system has a solution x. For this we write A columnwise as
A = (a1 , a2 , , an ),
then we can write (5.11) into
b = x 1 a1 + x 2 a2 + + x n an .
So we conclude
The system (5.11) has a solution only when b lies in the
subspace spanned by the column vectors of A.
Next we shall study the more general case when b does not lie in the subspace
spanned by the column vectors of A. In this case the system (5.11) does not have a
solution. In physical or engineering applications, it is often acceptable for us to find
some solutions x that minimize the error (Ax b) in certain sense. We will look for
one of such types of solutions, i.e., least-squares solutions. This seeks some vector x
that minimizes the error (Ax b), namely,
minn kAx bk2
xR
(5.12)
f (x + ty) = 0 .
dt
t=0
This reduces to
(Ax b, Ay) = 0 y Rn ,
74
(5.13)
(5.14)
LT x = c .
75
6
6.1
Floating-point arithmetic
Decimal and binary numbers
We are used to the decimal system in our daily life. But most computers adopt the
binary system, which uses 2 as the base instead of 10 in the decimal system. In the
decimal system, we use 10 numbers 0, 1, 2, , 9, and write any number in powers of
10. Consider the number 538.372, which is equivalent to
538.372 = 5 102 + 3 101 + 8 100 + 3 101 + 7 102 + 2 103 .
Also, for the number , which is
= 3.14159 26535 89793 23846 26433 8 . . . ,
where the last digit 8 stands for 8 1026 .
In the binary system, one uses only two digits 0 and 1. For example, we can write
the real number 9.90625 in the binary form:
(1001.11101)2 = 123 +022 +021 +120 +121 +122 +123 +024 +125 .
Computers communicate with the human users in the decimal system but work
internally in the binary system. Conversions take place inside the computers. Though
the users do not need to worry about these conversions, they should know the fact
that small roundoff errors are always involved during these conversions.
Computers can only operate using real numbers expressed in a fixed number of
digits. The finite word length of computers restrict the precision with which a real
number can be represented. Even a simple number like 1/10 can not be stored in any
binary computer as its actual expression needs an infinite number of binary digits:
1
= (0.0001 1001 1001 1001 . . .)2 .
10
For example, if we read 1/10 = 0.1 into a 32-bit computer and then print it out to
40 decimal places, we obtain the following result:
0.10000 00014 90116 11938 47656 25000 00000 00000 .
From this we see that there is some error caused during the decimal-binary conversion.
Two conversion errors are involved here:
from decimal to binary, and from binary to decimal.
76
6.2
Rounding errors
1
10n .
2
6.3
(6.1)
In the decimal system, one can express any real number in normalized scientific
notation. For example, we can write
2048.6076 = 0.20486 076 104 ,
0.0004321 = 0.4321 103 .
In general, we can express a nonzero real number a in the normalized form
a = r 10n
(6.2)
where r is a real number in the range 0.1 r < 1 and n is an integer (positive,
negative or zero).
A real number like 0.000259 is called a fixed-point number. Its floating-point
representation of form (6.2) is 0.259 103 . The advantage of the floating-point
representation is that it can express the numbers with huge different magnitudes.
For example, if we are allowed to use only a fixed number of digits, say 8 digits, with
7 after the decimal point, then the largest number we can express is 9.9999999 10,
the smallest is 0.0000001 = 107 . But if we use 8 digits to represent a floating-point
number, the number can range
from 1099
to 1099
78
where 1/2 q < 1 and m is an integer. The number q is called the mantissa and the
integer m is called the exponent, and both numbers q and m are base 2 numbers. Let
us see the floating-point binary number
+0.10111010 24 .
We may shift the binary point 4 places to the right and get the binary number
1011.1010, or a decimal number 11.625. Similarly, for the floating-point binary number given by
+0.10111010 21 ,
we shift the binary point 1 place to the left and get the binary number +0.010111010.
The accuracy of a number represented by a computer depends on the word length
of the computer. Most modern PCs have the word length of 32 bits, but most modern
workstations and high performance supercomputers have word length of 64 bits. Let
us take a computer with a word length of 32 bits (binary digits) to discuss the accuracy
in more details. The floating-point representation for a single-precision real number
is divided into three sections:
1. The first section contains one bit for the sign of the mantissa q;
2. The second section contains 8 bits for the exponent m;
3. The last section contains 23 bits for the mantissa q.
To save storage, a real number a = q 2m can be expressed as a normalized
binary number such that the first nonzero bit (must be 1) in the mantissa is just
before the binary point, that is, q = (1.f )2 . So the first bit (it is 1) does not need to
store. Then the mantissa is in the range 1 q < 2. The remaining 23 bits can be
used to store the 23 bits from f . In this way, the computer has a 24-bit mantissa for
its floating-point numbers.
In summary, nonzero normalized machine numbers (in a computer with word
length of 32 bits) are strings consisting of bits, whose values are decoded as the
following normalized floating-point form:
a = (1)s q 2(1)
p m
(6.3)
where
s, p = 0 or 1,
q = (1.f )2 ,
Note that we have 1 q < 2 and the most significant bit in q is 1 and is not stored;
s stands for the bit expressing the sign of a (positive: bit 0; negative: bit 1); m is
the 7-bit exponent, and f is the 23-bit fractional part of a, together with an implicit
leading bit 1. That is,
. . 11})2 ,
(00
. . 00})2 f (11
| .{z
| .{z
23
(00
. . 00})2 m (11
. . 11})2 .
| .{z
| .{z
23
79
19
19
1 1111111
23
= 0| 00
. . 00} |1|1111111.
| .{z
23
A real number which can be represented as the normalized floating-point form (6.3)
is called a machine number in the computer having a word length of 32 bits (binary
digits). So a machine number can be precisely represented by the computer. But this
is not the case for most real numbers. When a real number is not a machine number
and serves as an input datum or as the result of a computation, a representation error
then arises, and it will be approximated by a most accurate machine number.
Noting that m has 7 digits in binary, so the largest m is 127. From this, we know
that a computer of word length of 32 bits can handle numbers as small as (1.f 2m )
2127 6.0 1039
and as large as (1.f 2m )
(2 223 )2127 3.4 1038 .
This is certainly not sufficiently enough for some scientific computings. When this
happens, we must write a program in double-precision or extended-precision arithmetic. A floating-point number in double precision occupies two computer words,
and the mantissa usually has at least twice as many bits. Hence there are roughly
twice the number of decimal places of accuracy in double precision as in single precision. But in double precision, calculations are much slower than in single precision,
often by a factor of 2 or more. This happens mainly because double-precision arithmetic is usually done using software, while single-precision arithmetic is done by the
hardware.
80
6.4
The restriction that the mantissa part occupies no more than 23 bits means that the
machine numbers have a limited precision of roughly 6 decimal places, since the least
significant bit in the mantissa represents units of 223 , approximately 1.2 107
106 . Then numbers expressed with more than 6 decimal digits will be approximated
by machine numbers.
The second smallest positive number that can be represented in 32-bit is:
amin = (1)0 (1. 00
. . 00} 1) 21111111 2127 .
| .{z
22
The maximum relative error that one can made in truncating (or rounding) a number
a between 0 and amin is:
amin a (0.00 . . . 001) 2127
= 223 = 106.9 .
a
1 2127
This means that the number of significant digits retained is roughly equal to 7 when
one truncates a very small number.
The largest number that can be represented in 32-bit is:
amax = (1)0 (1. 11
. . . 1}) 2127 2128 .
| {z
23
= 224 = 107.2 .
a
(1.11 . . . 11) 2127
2 2127
That again means that the number of significant digits retained is roughly equal to 7
when one truncates a very large number.
In general, the number of significant digits retained when one truncates any number in 32-bit representation is always 7 (because we use 23 bits for the mantissa and
224 = 107.2 ). The number 224 is called unit roundoff error or machine precision.
Since it is usually denoted by M , it is also called machine epsilon. We see that the
unit roundoff error depends on the length of the mantissa only.
In 64-bit machines, we use 52 bits for the mantissa, and hence the accuracy is
within 252 = 1015.6 , i.e. about 16 digits of accuracy.
An integer can use all bits of the computer word in its representation except for
a single bit reserved for the sign. So a computer having word length of 32 bits can
handle the integers from (231 1) to 231 1 = 21474 83647 2 109 .
81
6.5
Machine rounding
In addition to rounding input data, rounding is also needed after most arithmetic
operations. The result of an arithmetic operation resides in a long 80bit computer
register and must be rounded to single-precision before being placed in memory. Similarly for double-precision calculations. Usual rounding mode is rounding to nearest:
The closer of the two machine numbers on the left and right of the real number is
selected. The roundoff error will be less than 224 , and 224 is called the unit roundoff
error.
p
6.6
Floating-point arithmetic
6.7
Given a real number x, let f l(x) be the floating point representation of x, i.e. f l(x)
is a machine representable number closest to x. By previous discussions, we have
f l(x) x
2 M ,
(6.4)
x
where M is the machine precision, or the machine unit roundoff error. Here = 24
for 32-bit computers and = 52 for 64-bit computers. Using (6.4), we can write
f l(x) = x(1 + ),
(6.5)
|| M .
(6.6)
where
Equations (6.5) and (6.6) form the basis of the backward error analysis.
Now we consider the relative error from adding two floating-point binary numbers
a and b. We write
a = a1 2m , b = b1 2n ,
where a1 and b1 are of the normalized form 0.1c1 c2 , and m and n can be any
integer numbers. We assume that b is smaller, then b1 should be shifted m n places
to the right to line up the binary points. The results are then added, normalized and
rounded. We have two possibilities: overflow occurs to the left of the binary point,
or overflow does not occur. The first case yields
1 |a1 + b1 2nm | < 2,
while the second case indicates
1
|a1 + b1 2nm | < 1.
2
For the overflow case, a right shift of one place is needed, and we then have
o
f l(a + b) = {(a1 + b1 2nm )21 + 2m+1
with being the roundoff error. So we can further write
2
f l(a + b) = (a + b) 1 +
a1 + b1 2nm
= (a + b)(1 + E)
(6.7)
with |E| 2 2M .
For the case without overflow, we can further write
o
f l(a + b) = {(a1 + b1 2nm ) + 2m
= (a + b) 1 +
a1 + b1 2nm
= (a + b)(1 + E)
83
(6.8)
with |E| 2M . This gives the bound of the relative error from adding two floatingpoint binary numbers a and b.
Now let us consider to add up n floating-point binary numbers:
c1 + c2 + . . . + ck .
To do so, we write the partial sum si = c1 + c2 + . . . + ci . Let s1 = c1 , then using
(6.7) or (6.8), we derive
s2 = f l(s1 + c2 ) = (s1 + c2 )(1 + E1 )
with E1 2M . We can rewrite s2 as
s2 = c1 (1 + E1 ) + c2 (1 + E1 ).
Similarly, we can do the following
s3 = f l(s2 + c3 ) = (s2 + c3 )(1 + E2 )
= c1 (1 + E1 )(1 + E2 ) + c2 (1 + E1 )(1 + E2 ) + c3 (1 + E2 ).
Continuing this process, we come to
sk = f l(sk1 + ck ) = (sk1 + ck )(1 + Ek1 )
= c1 (1 + d1 ) + c2 (1 + d2 ) + + ck (1 + dk ) ,
where for i = 2, 3, , k, we have
1 + di = (1 + Ei1 )(1 + Ei ) (1 + Ek1 )
and 1 + d1 = 1 + d2 . Noting the uniform bound on all the Ei , we can estimate
(1 2M )ki+1 1 + di (1 + 2M )ki+1 .
Summarizing the above results, we conclude that
fl
k
k
X
X
cj =
cj (1 + E)
j=1
j=1
with
c1 d1 + c2 d2 + + ck dk
.
c1 + c2 + + ck
This gives the bound of the relative error from adding k floating-point binary numbers.
When we know more concrete data about the computer we are using and the
detailed numbers c1 , c2 , . . ., ck , we will have a clearer impression on the detailed
bound of the relative error from the previous error estimate.
E=
84
We now study the sensitivity of linear systems to the changes of errors in their coefficients. Suppose our aim is to solve
Ax = b ,
but because of the measurement errors or rounding-off errors of computers, we are in
fact solving the following perturbed system:
x = b ,
A
instead of the real system Ax = b.
Next, we shall analyse the error (x x) between the true solution x and the
solution x to the perturbed system.
Will this error be small when the errors in A are small ? How will the error (x x)
depend on the errors in A ?
In order to measure x x, we are going to introduce some quantities.
7.1
7.1.1
Why do we need norms for vectors and matrices ? Its the same reason as we need
absolute values for real numbers. When we say the number 1.23456 approximates the
number 1.234559 well, we know that it is true because
|1.23456 1.234559| = 106
is a small number. The same is true for vectors and matrices. When we use a method
to solve a problem where the true solution is a matrix A or a vector x , we say that
the method is a good method if it gives us a matrix A or a vector x such that the
distance between A and A or between x and x is small. The distance between them
is called the norm between them, and denoted by
kA A k or kx x k .
The simplest norm that we have learned in secondary school is what we called the
Euclidean norm. For a vector x = (x1 , x2 , . . . , xn )t , it is defined as
q
(7.1)
kxk = x21 + x22 + . . . + x2n .
You can easily check that it satisfies all the requirements to be a norm:
(i) kxk 0, and kxk = 0 if and only if x = 0.
(ii) kxk = || kxk, for all R.
85
n
X
i=1
1in
and kxk2 is nothing but just the Euclidean norm. One can verify that all p-norms
satisfy the three requirements of a norm listed above. In Matlab, one can compute the
norm of a vector x by norm (x, p); and norm (x) gives the 2-norm of the vector.
Given two n-vectors y and z, we can now measure the distance between them by
computing ky zkp . For any n-vector x and two arbitrary integers 1 p q ,
we can prove that
kxk kxkq kxkp kxk1 nkxk .
Thus in particular, if ky zkp is small for some p, then we expect ky zkq is still
small for all the other qs.
7.1.2
Matrix norms
To measure the distance between two matrices, we need matrix norm. The trivial
matrix norm is to consider the matrix as a sequence of numbers, and then compute
the Euclidean norm of this sequence. The resulting norm is called the Frobenius
norm. More precisely, if the (i, j)th entry of an m-by-n matrix A is aij , then
m X
n
X
kAkF = (
a2ij )1/2 .
i=1 j=1
But more frequently, we prefer matrix norms that are induced from a vector pnorm. The p-norm of a matrix is defined as:
kAkp = max kAxkp .
kxkp =1
max
1jn
m
X
i=1
p
max (AT A) , where max is the largest eigenvalue of AT A
n
X
= max
|aij | = maximum row sum.
kAk2 =
kAk
1im
j=1
86
(7.2)
As an example, if A =
1 3
, then
2 4
kAkF =
1 + 4 + 9 + 16 = 30,
5 11
1/2
kAk2 = max
= 5.465.
11 25
You can easily verify from (7.2) that p-norms satisfy all the norm requirements
listed earlier. Moreover, they also satisfy the triangle inequality for multiplication:
kABkp kAkp kBkp ,
(7.3)
for any matrices A and B. For the Frobenius norm, (7.3) also holds. It is interesting
to know that
kAkm max |aij |
i,j
7.2
Relative errors
Recall the relative errors for scalars: let x be the approximate value to x, then the
relative error is
|x x|
.
|x|
Similarly we can define the relative error of an approximate vector x to x
kx xk
kxk
where k k can be any norm in Rn .
Next, we show the following property
If we know the relative error of x to x satisfies
k
x xk
< 1,
kxk
then the relative error of x to x satisfies
kx xk
.
k
xk
1
87
,
k
xk
kxk
k
xk
k
xk
but using
kx xk kxk,
we derive
kxk k
xk |kxk k
xk| kx xk kxk ,
this yields
1
kxk
,
k
xk
1
thus we obtain
k
x xk
kxk
.
k
xk
k
xk
1
]
From this result, we see that /(1 ) is not much different from if is small,
e.g., we have /(1 ) = 0.111 if = 0.1. This shows if the relative error of x to
x is small, then the relative error of x to x is also small. So we may use any one of
two to measure the relative error. But the relative error
kx xk
k
xk
is much easier to estimate than the error
k
x xk
kxk
in most applications, since it is usually complicated and difficult to have an accurate
estimate on the exact value x. In fact our entire task is to find an accurate estimation
of the exact solution x.
7.3
we have (A + E)x 6= 0 .
Ax = b,
and b 6= 0, then we have
k
x xk
kA1 Ek = kA1 A Ik
k
xk
If in addition,
kA1 Ek < 1 ,
then A is nonsingular and
kx xk
kA1 Ek
.
kxk
1 kA1 Ek
89
(7.4)
1
kxk.
1
k
xk
kx xk
.
kxk
kxk
1
7.4
x = b satisfies
Recall the relative error of the solution to the perturbed system A
k
x xk
kA1 Ek
.
kxk
1 kA1 Ek
But the right-hand side in this relation seems difficult to interpret. Below we try to
find a more convenient relation.
It is easy to see
kA1 Ek kA1 kkEk = kA1 kkAk
If we define
(A) = kAk kA1 k,
then we know
kA1 Ek (A)
90
kEk
,
kAk
kEk
.
kAk
which implies
(A) kEk
k
x xk
kAk
.
kxk
1 (A) kEk
kAk
Now, if (A) kEk
is small, say less than 0.1, then the denominator (1(A)kEk/kAk) <
kAk
1 and close to 1, thus we can consider approximately
k
x xk
kEk
kA Ak
(A)
= (A)
.
(7.5)
kxk
kAk
kAk
This relation indicates that the relative error of x to the exact solution x can be
controlled by a factor of the relative error of the perturbed matrix A to the true
matrix A.
Condition numbers. The real number (A) given by
(A) = kAk kA1 k
is called the condition number of the matrix A.
We have
The condition number (A) is always greater than or
equal to one.
This is easily seen from
1 kIk kAA1 k kAkkA1 k = (A),
where the first inequality is a consequence of the following
kIk = kI Ik kIk kIk .
7.5
The condition number has direct influence on the accuracy of the solution to the
linear system Ax = b.
Suppose the matrix A is rounded to A on a computer with rounding unit M
(machine accuracy), so we have a
ij = aij + aij ij , where |ij | M . Then in any of
the usual norms, we have
kA Ak M kAk .
Suppose we will solve the perturbed system
x = b
A
without any further errors, then we get a solution x that satisfies
k
x xk
kA Ak
(A)
M (A).
kxk
kAk
For instance, if M = 10t and (A) = 10k , then the solution x can have relative error
as large as 10t+k . This justifies the following rule (roughly):
91
Polynomial interpolation
8.1
1998
6.5437
2002
6.787
2003
6.8031
6.8
Population in millions
6.75
6.7
6.65
6.6
6.55
6.5
1998
1999
2000
2001
2002
2003
2004
2005
Year
92
8.2
Vandermonde Interpolation
In this interpolation method, we try to get the unique polynomial of the highest
degree that can pass through all the given data points. Since we have three given
data points: P (1998), P (2002), and P (2003), we can only determine a quadratic
polynomial
p(x) = 0 + 1 x + 2 x2
that fits them. How to get the coefficients i s, i = 0, 1, 2? We want p(x) to pass
through the data points, i.e.
0 + 1 x + 2 x2 = P (x),
1 1998 19982
0
P (1998)
1 2002 20022 1 = P (2002) .
1 2003 20032
2
P (2003)
(8.1)
P (1998)
P (2002)
P (2003) .
P (2004)
Question is: can we make use of the i s that we have computed to speed up the
solution of i s? Of course, the matrix will even be more ill-conditioned. In fact, the
condition number of the 4-by-4 matrix is 1019 , and we have no hope of solving the
system to any digit of accuracy.
In this chapter, we will learn two other methods, the Lagrange method and the
Newton method, that try to compute the same interpolation polynomial using different approaches so as to overcome the two aforementioned drawbacks.
93
8.3
Instead of using the concrete data points, now we look at the interpolation in a bit
more general situation. Given the three observation data points:
x0
f0
x1
f1
x2
f2
i = 0, 1, 2,
that is,
f0
0
1 x0 x20
1 x1 x21 1 = f1 .
f2
2
1 x2 x22
(8.2)
Solving the system gives the coefficients 0 , 1 and 2 , hence a quadratic interpolation
p(x).
8.4
x1
f1
x2
f2
xn
fn
(8.3)
i = 0, 1, , n .
(8.4)
Let us write
p(x) = 0 + 1 x + 2 x2 + + n xn ,
then using the conditions (8.4), we have
1 x0 x20 xn0
1 x1 x2 xn
1
1
.. ..
..
. .
.
2
1 xn xn xnn
94
0
1
..
.
n
f0
f1
..
.
fn
(8.5)
Va =
1 x0 x20
1 x1 x21
.. ..
..
. .
.
1 xn x2n
xn0
xn1
..
.
xnn
(8.6)
i = 0, 1, , n ,
then r(x) = p(x) q(x) is a polynomial of degree n and has n + 1 roots. The only
case for this to hold is r(x) 0. That is, p(x) = q(x).
Noting that (8.4) is equivalent to the linear system (8.5), so the uniqueness of
polynomial p(x) ensures the uniqueness of the solutions to (8.5). This implies the
non-singularity of matrix Va or the linear independence of the column vectors of Va ,
hence the existence of the solutions to (8.5) or the existence of polynomials to (8.4). If
it is not true, namely Va is singular or the column vectors of Va is linearly dependent,
then there exists some non-zero vector x0 such that Va x0 = 0. This violates the
established uniqueness: if is a solution to (8.5), i.e., Va = f , then + x0 is also a
solution.
How to find an interpolation polynomial p(x) that satisfies the conditions (8.4) ? One
may think about solving the system (8.5) using the Gaussian elimination. But this
procedure is very expensive when n is large. Another obvious disadvantage of the
Gaussian elimination is that the system (8.5) can be very ill-conditioned even for
small n, see the examples in Subsection 8.2.
8.5
Lagrange interpolation
There are many different approaches to compute the polynomial interpolations that
satisfy (8.4). We first introduce the Lagrange interpolation.
Instead of the standard basis functions
1, x, x2 , , xn
95
for a polynomial of degree n, Lagrange interpolation takes the following special basis
functions:
lj (x) =
(8.7)
i = 0, 1, 2, , n.
8.6
2 1 0 1
2
10 3 5 4 6
x1
f1
x2
f2
xn
fn
i = 0, 1, , n.
(8.8)
Below, we are interested in another form of interpolation, called the Newton form
of interpolation, which is much easier to evaluate than the Lagrange interpolation.
Noting the fact that
1,
x x0 ,
(x x0 )(x x1 ),
(x x0 )(x x1 ) (x xn1 )
are linearly independent, one may choose them as a basis of the polynomials of degree
n. That is, any polynomial of degree n can be represented in terms of this basis.
Especially, let p(x) be a polynomial of degree n satisfying (8.8), then we can write
p(x) = c0 + c1 (x x0 ) + c2 (x x0 )(x x1 ) +
+ cn (x x0 )(x x1 ) (x xn1 ) .
(8.9)
One can see an obvious advantage of the Newton form: the resulting form of
having one more point is only different from the previous one by adding one more
term. This is quite different from Lagrange interpolation.
How to find the coefficients in Newton form of interpolation (8.9) ? To compute
the coefficients {ci }, we will introduce a new elegant tool: divided differences.
8.6.1
Divided differences
Consider a function f defined on [a, b], and a set of distinct nodal points in [a, b]:
a = x0 < x1 < x2 < < xn1 < xn = b . Then we can define the Divided Differences
of different orders as follows. We call
f [x0 ] = f (x0 ),
f [x1 ] = f (x1 ),
f [xn ] = f (xn )
f [x1 ] f [x0 ]
,
x1 x0
f [x1 , x2 ] =
f [x2 ] f [x1 ]
, ,
x2 x1
97
p(x1 ) p(x0 )
f [x1 ] f [x0 ]
=
= f [x0 , x1 ] (= f [x1 , x0 ]).
x1 x0
x1 x0
f [x0 , x2 ] f [x1 , x0 ]
x2 x1
f (x1 ) f (x0 )
f (x2 ) f (x0 )
x2 x0
(x2 x1 )
x2 x0 x 1 x0
f [x1 , x2 ] f [x0 , x1 ]
x2 x0
x2 x0
f [x0 , x1 , x2 ] .
98
8.6.2
Next we shall study some relations between derivatives and divided differences. First
we can easily check the following facts:
1. For any constant function, its first order divided differences always vanish;
2. For any linear function, its first order divided differences are always constant,
and its second order divided differences always vanish;
3. For any quadratic function, its first and second order divided differences are
respectively linear and constant, and its third order divided differences always
vanish.
The above facts tell us that the actions of divided differences are very similar to
the ones of derivatives. In fact, we have the following result.
Suppose f C n [a, b], and a = x0 < x1 < x2 < < xn1 < xn = b , are
distinct points in [a, b]. Then there exists some (a, b) such that
f [x0 , x1 , x2 , , xn ] =
f (n) ()
.
n!
Proof. Let p(x) be the Newton form of interpolation of f (x) at the nodal points:
a = x0 < x1 < x2 < < xn1 < xn = b , and g(x) = f (x)p(x). Since f (xi ) = p(xi )
for i = 0, 1, , n, g has n + 1 distinct zeros in [a, b]. Then Rolles Theorem tells the
existence of a point (a, b) such that g (n) () = 0, which implies
p(n) () = f (n) () .
But p(x) is a polynomial of degree n with its leading coefficient f [x0 , x1 , x2 , , xn ],
so we have
p(n) (x) = n!f [x0 , x1 , x2 , , xn ] x .
This completes the proof of the desired relation. ]
8.6.3
The divided differences are symmetric, i.e., exchanging any two variables in a divided
difference does not change its value. Let us see why this is true.
Clearly, f [x0 ] is symmetric.
For the 1st order divided difference, we have
f [x0 , x1 ] =
f [x0 ] f [x1 ]
f [x1 ] f [x0 ]
=
= f [x1 , x0 ],
x1 x0
x0 x1
(8.10)
and
c0 = f [x0 ]
c1 = f [x0 , x1 ]
c2 = f [x0 , x1 , x2 ] .
We can compute these divided differences through the Gaussian Elimination for the
following system:
1
0
0
c0
f0
f [x0 ]
1 x1 x0
c1 = f1 f [x1 ] .
0
(8.11)
1 x2 x0 (x2 x0 )(x2 x1 )
c2
f2
f [x2 ]
100
Next we will use Gaussian Elimination to change the lower triangular matrix to
an identity matrix.
First we subtract row 2 from row 3 and put it back to row 3 (i.e. r3 r2 r3 ).
This eliminate the 1 on the (3,1)-entry. Remember to do the same thing to the right
hand side vector.
Next we eliminate the 1 on the (2,1)-entry by doing r2 r1 r2 . The resulting
system after doing these two row operations is:
1
0
0
c0
f [x0 ]
0 x1 x 0
c1 = f [x1 ] f [x0 ] .
0
0 x2 x1 (x2 x0 )(x2 x1 )
c2
f [x2 ] f [x1 ]
Next we divide the 2nd row by x1 x0 (i.e. 1/(x1 x0 ) r2 r2 ) to make the
(2,2)th entry 1. Also we do 1/(x2 x1 ) r3 r3 to make the (3, 2)-entry 1. These
two steps are equivalent to pre-multiplying the whole system by the matrix:
1
0
0
1
0 x x
0 .
1
0
1
0
0
x2 x1
The resulting system is:
1 0
0
c0
0 1
c1 =
0
0 1 (x2 x0 )
c2
f [x0 ]
f [x1 ]f [x0 ]
x1 x0
f [x2 ]f [x1 ]
x2 x1
f [x0 ]
f [x0 , x1 ] .
f [x1 , x2 ]
f [x0 ]
f [x0 ]
1 0 0
c0
0 1 0 c1 =
f [x0 , x1 ] .
f [x0 , x1 ]
f [x1 ,x2 ]f [x0 ,x1 ]
c2
f [x0 , x1 , x2 ]
0 0 1
x2 x0
Putting these values of ci s back into (8.10), we have our expected result:
p(x) = f [x0 ] + f [x0 , x1 ](x x0 ) + f [x0 , x1 , x2 ](x x0 )(x x1 ).
Thus the computation of the divided differences is the same as solving the lower
triangular system (8.11) by Gaussian Elimination.
8.6.5
x1
f1
x2
f2
101
xn
fn
how to compute the coefficients in the Newton form of interpolation ? We take the
following simple case as an example:
Table for computing divided difference f [x0 , x1 , x2 , x3 ]
x0 f [x0 ] = f0
f [x0 , x1 ]
x1 f [x1 ] = f1
f [x0 , x1 , x2 ]
f [x1 , x2 ]
f [x0 , x1 , x2 , x3 ]
x2 f [x2 ] = f2
f [x1 , x2 , x3 ]
f [x2 , x3 ]
x3 f [x3 ] = f3
Example. Compute the Newton form of interpolation satisfying the following conditions:
3 1
1 3
x
f (x)
5 6
2 4
5
4
x2 = 5 f [x2 ] = 2
f [x0 , x1 , x2 ] = 38
f [x0 , x1 , x2 , x3 ] =
f [x1 , x2 , x3 ] =
3
20
7
40
f [x2 , x3 ] = 2
x3 = 6 f [x3 ] = 4
Thus the Newton form of interpolation can be written as follows:
p(x) = f [x0 ] + f [x0 , x1 ](x x0 ) + f [x0 , x1 , x2 ](x x0 )(x x0 )
+f [x0 , x1 , x2 , x3 ](x x0 )(x x1 )(x x2 )
3
7
= 1 + 2(x 3) (x 3)(x 1) + (x 3)(x 1)(x 5). (8.12)
8
40
Now if we add one more data point, we have the data table:
x
f (x)
3 1
1 3
5 6 0
2 4 5
x0 = 3
f [x0 ] = 1
x1 = 1
f [x1 ] = 3
f [x0 , x1 ] = 2
5
f [x1 , x2 ] = 4
x2 = 5
f [x2 ] = 2
3
f [x0 , x1 , x2 ] = 8
7
f [x0 , x1 , x2 , x3 ] = 40
3
f [x1 , x2 , x3 ] = 20
17
f [x1 , x2 , x3 , x4 ] = 60
f [x2 , x3 ] = 2
x3 = 6
x4 = 0
f [x3 ] = 4
f [x3 , x4 ] = 1
6
f [x4 ] = 5
13
f [x2 , x3 , x4 ] = 30
f [x0 , x1 , x2 , x3 , x4 ] = 11
72
This immediately gives the Newton form of interpolation, adding only one term
to the polynomial (8.12):
p(x) = f [x0 ] + f [x0 , x1 ](x x0 ) + f [x0 , x1 , x2 ](x x0 )(x x0 )
+f [x0 , x1 , x2 , x3 ](x x0 )(x x1 )(x x2 )
+f [x0 , x1 , x2 , x3 , x4 ](x x0 )(x x1 )(x x2 )(x x3 )
3
7
= 1 + 2(x 3) (x 3)(x 1) + (x 3)(x 1)(x 5)
8
40
11
+ (x 3)(x 1)(x 5)(x 6).
72
8.7
8.7.1
x0
f0
x1
f1
x2
f2
xn
fn
pn (x) = f0
Let us denote it by
pn (x) = 0 (x x1 )(x x2 ) (x xn )
+1 (x x0 )(x x2 ) (x xn )
+ . . . + n (x x0 )(x x1 ) (x xn1 ).
(8.13)
(8.14)
fi
.
(xi x0 )(xi x1 ) (xi xi1 )(xi xi+1 ) (xi xn )
103
(8.15)
n
Y
(t xi )
i=0
x0
f0
x1
f1
x2
f2
xn
fn
xn+1
fn+1
pn+1 (x) = f0
It is very complicated indeed, but we can simplify it using (8.15). In fact, if we rewrite
pn+1 (x) as
pn+1 (x) = 0 (x x1 )(x x2 ) (x xn )(x xn+1 )
+1 (x x0 )(x x2 ) (x xn )(x xn+1 )
+ . . . + n (x x0 )(x x1 ) (x xn1 )(x xn+1 )
+n+1 (x x0 )(x x1 ) (x xn1 )(x xn ),
104
if 0 i n,
xi xn+1
i =
fn+1
if i = n + 1.
(xn+1 x0 )(xn+1 x1 ) (xn+1 xn1 )(xn+1 xn )
Now it is clear that i , 0 i n, can each be computed in 1 addition and 1 multiplication, while n+1 can be computed in (n + 1) additions and (n + 1) multiplications.
Thus pn+1 can be obtained in O(n) operations.
8.7.2
x0
f0
x1
f1
x2
f2
xn
fn
(8.16)
The question is how costly it is to compute ci . Recall that the ci s are the first entries
in each column of the divided difference table.
Let us then compute the cost of obtaining all the entries in the divided difference
table. Recall that the table looks something like this:
x
x0
0th
f [x0 ]
(n 1)th
1st
nth
f [x0 , x1 ]
x1
f [x1 ]
..
.
..
.
...
f [x1 , x2 ]
..
.
f [xn2 , xn1 ]
xn1
f [xn1 ]
f [x0 , . . . , xn1 ]
f [x0 , . . . , xn ]
..
.
f [x1 , . . . , xn ]
f [xn1 , xn ]
xn
f [xn ]
For the 0th level column, we dont have to do anything. We just have f [xi ] = fi . In
the 1st level column, each divided difference is computed in the form:
f [xi ] f [xj ]
xi xj
which require 2 additions and 1 multiplication. Notice that there are only n 1st level
divided differences to compute in this column.
105
Next for the 2nd level divided differences. Again, each divided difference can be
computed in 2 additions and 1 multiplication, but there are only (n 1) 2nd level
divided differences to compute. Repeating this argument, finally, for the nth level
divided difference, there is only one to compute (namely cn = f [x0 , . . . , xn ]); and it
can be computed in 2 additions and 1 multiplication. Thus to compute all the entries
in the divided difference table (in particular those ci s), it requires
2 [n + (n 1) + . . . + 1] = O(n2 )
additions and half of that numbers of multiplications. Hence generating Newton
interpolation polynomial is an O(n2 ) process.
(2) Cost of evaluating the polynomial
Suppose we have already obtained all the ci in (8.16). Our next question is how
costly it will be to compute pn (t) at arbitrary point t. For this, we just invoke Horners
rule:
pn (t)
= ((..(cn (t xn1 ) + cn1 )(t xn2 ) + cn2 )(t xn3 ) + . . .
. . . + c1 )(t x0 ) + c0 .
This can now be computed in 2n additions and n multiplications. Thus for any t,
pn (t) can be computed in O(n) operations.
(3) Cost of updating the polynomial
Suppose we are given one more data point to interpolate:
x
f (x)
x0
f0
x1
f1
x2
f2
xn
fn
xn+1
fn+1
106
8.7.3
From the previous discussions, we can briefly summarize the computational costs of
three interpolation methods in the following table:
Vandermonde
Lagrange
Newton
8.8
x1
f1
x2
f2
xn
fn
f (n+1) (x )
(x x0 )(x x1 ) (x xn ) .
(n + 1)!
(8.17)
Proof. If x = xi for some i {0, 1, , n}, then the result holds obviously. Now,
we consider x (a, b) but x 6 {x0 , x1 , , xn } and introduce
(t) = f (t) p(t) (t x0 )(t x1 ) (t xn ) .
Choose such that (x) = 0 (note that x is fixed), then we have
(x) = 0, (x0 ) = (x1 ) = = (xn ) = 0 .
By the Rolles theorem, we know
0 (t) has at least n + 1 distinct zeros;
Similarly,
00
f (n+1) (x )
.
(n + 1)!
f (n+1) (x )
(x x0 )(x x1 ) (x xn ) .
(n + 1)!
108
(8.18)
]
Some interesting results. Using the error estimate (8.18), we have the following
relation if we take p(x) = L(x):
f (x)
n
X
i=0
f (n+1) (x )
(x x0 )(x x1 ) (x xn ) .
(n + 1)!
n
X
i=0
n
X
i=0
n
X
li (x) = 1,
xi li (x) = x,
x2i li (x) = x2 ,
i=0
..
.
n
X
xni li (x) = xn .
i=0
i=0
8.9
(4)10
.
10!
Chebyshev polynomials
Using the interpolation error estimate (8.17), we can estimate the accuracy of each
interpolation polynomial when all the interpolating nodes
a = x0 < x1 < x2 < < xn1 < xn = b
109
are given. But with a different set of interpolating nodes, the resulting polynomial
will have different accuracy. This raises a natural question: can we choose the interpolating nodes so that the resulting polynomial reaches an optimal accuracy ? In
this subsection, we shall discuss how to find such optimal polynomial. It is easy
to see from the error estimate (8.17) that the optimal polynomial can be realized if
we can choose a set of interpolating nodes x0 , x1 , , xn such that the resulting
polynomial (x x0 )(x x1 ) (x xn ) is minimized among all polynomials in magnitude. An analysis of this optimization problem was first made by the mathematician
Chebyshev. The process leads naturally to a system of polynomials called Chebyshev
polynomials. Next we will introduce the polynomials.
The Chebyshev polynomials are defined recursively as follows:
T0 (x) = 1,
T1 (x) = x ,
and for n 1,
Tn+1 (x) = 2x Tn (x) Tn1 (x) .
You may compute the next few such polynomials:
T2 (x)
T3 (x)
T4 (x)
T5 (x)
T6 (x)
=
=
=
=
=
2x2 1 ,
4x3 3x ,
8x4 8x2 + 1 ,
16x5 20x3 + 5x ,
32x6 48x4 + 18x2 1 .
1 x 1.
f1 (x) = x ,
and for n 1,
fn+1 (x) = 2x fn (x) fn1 (x) .
So we have fn = Tn for all n. ]
We see the following properties directly from the closed form of the Chebyshev
polynomials:
|Tn (x)| 1 (1 x 1) ,
j
Tn cos
= (1)j (0 j n) ,
n
2j 1
Tn cos
= 0 (1 j n).
2n
A monic polynomial is the one in which the term of highest degree has a coefficient
of unity. We see that Tn (x)s term of highest degree has a coefficient 2n1 xn for n > 0.
Therefore 21n Tn is a monic polynomial for all n > 0.
We have the following estimate for a monic polynomial:
If p is a monic polynomial of degree n, then
kpk = max |p(x)| 21n .
1x1
(|x| 1) .
1
max |f (n+1) (x)| max |(x x0 )(x x1 ) (x xn )| .
|x|1
(n + 1)! |x|1
111
i = 1, 2, , n + 1 .
112
8.10
Spline interpolation
In the previous few subsections we have discussed the interpolating polynomials for
a given set of the observation data of function f (x):
x0
f0
x1
f1
x2
f2
xn
fn
Remember that all such interpolating polynomials are global polynomials in the
entire given interval [a, b]. If we want to get higher interpolation accuracy, we need to
use more interpolating nodes and require the interpolating function f (x) to be more
smooth. But with more and more interpolating nodes, the degree of the resulting
polynomial becomes higher and higher, leading to very unstable computations.
Because of the instability of higher order polynomials, one is interested in finding
lower order polynomials to achieve higher order accuracy. This can be realized by using piecewise polynomials, instead of global polynomials. Spline functions are widely
used piecewise polynomials, with certain continuity conditions.
Spline functions. Given the following (n + 1) nodal points
x0 < x1 < x2 < < xn ,
a spline function of degree k with these nodes is a function S(x) satisfying the following conditions:
(i) S(x) is a polynomial of degree k on each interval [xi1 , xi ], i = 1, 2, , n.
(ii) S(x) has a continuous (k 1)-th order derivative on [x0 , xn ].
Spline interpolations. A spline interpolation S(x) of degree k associated with
the following set of the observation data of function f (x):
x0
f0
x1
f1
x2
f2
xn
fn
i = 0, 1, , n .
S0 (x)
= C0 ,
x [x0 , x1 )
S1 (x)
= C1 ,
x [x1 , x2 )
S(x) =
As the spline function of degree 0 has only n degrees of freedom, so it can fit only n
observation data. E.g., we may take the data:
S(xi ) = fi ,
i = 0, 1, 2, , n 1 ,
S0 (x)
= f0 ,
x [x0 , x1 )
S1 (x)
= f1 ,
x [x1 , x2 )
S(x) =
S0 (x)
S1 (x)
S(x) =
Sn1 (x)
x [x0 , x1 ]
x [x1 , x2 ]
= an1 x + bn1 ,
x [xn1 , xn ]
How to find the spline interpolation function ? We have 2n unknown coefficients and
thus need 2n conditions. Usually we require S(x) to fit the observed data:
S(xi ) = fi ,
i = 0, 1, 2, , n ,
which gives 1 condition at each endpoint x0 and xn , and 2 conditions at each internal
point xi , i = 1, 2, , n 1. So this yields exactly the desired 2(n 1) + 2 = 2n
conditions to find all the coefficients in S(x).
Geometrically, the unique determination of the 2n coefficients in S(x) is clear. In
fact, the linear function is uniquely determined in each subinterval by the assigned
values at the two endpoints.
Please check if the piecewise linear function determined as above satisfies all the
requirements a spline function of degree 1 should meet.
8.10.3
A spline function of degree 3 is called a cubic spline, which is most frequently used in
practice. Let S(x) be a cubic spline. According to the definition of a spline function,
we can write it as follows: for k = 0, 1, . . . , n 1 ,
S(x) = Sk (x) = sk,0 + sk,1 (x xk ) + sk,2 (x xk )2 + sk,3 (x xk )3 , x [xk , xk+1 ]
How to find a cubic spline interpolation function ? It amounts to finding 4n unknown
coefficients, 4 for each subinterval, thus we have to find 4n conditions to determine
the 4n unknowns. Usually we require S(x) to fit the observed data:
S(xi ) = fi ,
i = 0, 1, 2, , n ,
114
S 00 (xn ) =
with and being two constants. A cubic spline S(x) is called a natural cubic spline
if it satisfies
S 00 (x0 ) = 0 , S 00 (xn ) = 0.
Example. Determine whether the following function
x [0, 1]
1 + x x3 ,
2
3
1 2(x 1) 3(x 1) + 4(x 1) , x [1, 2]
S(x) =
xk
yk
8.10.4
In this subsection we discuss the unique existence and construction of cubic spline
functions for a given set of observation data:
x0
f0
x1
f1
x2
f2
xn
fn
Let S(x) be a natural cubic spline function on [a, b], then we can write for k =
0, 1, , n 1 that
S(x) = Sk (x) = sk,0 +sk,1 (xxk )+sk,2 (xxk )2 +sk,3 (xxk )3 , x [xk , xk+1 ] (8.19)
Noting that S(x) is piecewise cubic, its second derivative S 00 (x) must be piecewise
linear on [a, b]. So we can use the linear Lagrange interpolation to represent Sk00 (x):
Sk00 (x) = S 00 (xk )
x xk+1
x xk
+ S 00 (xk+1 )
,
xk xk+1
xk+1 xk
(8.20)
where we have used the fact that S 00 (x) is continuous on [a, b], and S 00 (x) = Sk00 (x).
Letting mk = S 00 (xk ), mk+1 = S 00 (xk+1 ), and hk = xk+1 xk , then we can write
Sk00 (x) =
mk+1
mk
(xk+1 x) +
(x xk ) ,
hk
hk
115
(8.21)
mk
mk+1
(xk+1 x)3 +
(x xk )3 + ck x + dk
6hk
6hk
(8.22)
for some constants ck and dk . More conveniently, using the linear independency of
xk+1 x and x xk , we can express ck x + dk in terms of xk+1 x and x xk . Thus
one can rewrite (8.22) as
Sk (x) =
mk+1
mk
(xk+1 x)3 +
(x xk )3 + pk (xk+1 x) + qk (x xk ) .
6hk
6hk
(8.23)
Now using the observation data S(xk ) = fk and S(xk+1 ) = fk+1 , we can derive
fk =
mk 2
h + pk hk ,
6 k
fk+1 =
mk+1 2
hk + qk hk .
6
One can get pk and qk from these relations and then substitute into equation (8.23)
to derive
mk
mk+1
Sk (x) =
(xk+1 x)3 +
(x xk )3
6hk
6hk
fk
mk hk
fk+1 mk+1 hk
+(
)(xk+1 x) + (
)(x xk ) . (8.24)
hk
6
hk
6
To find the values mk , we can use the continuity of the derivatives of S(x). Differentiating (8.24), we obtain
mk
mk+1
(xk+1 x)2 +
(x xk )2
2hk
2hk
mk hk
fk+1 mk+1 hk
fk
)+(
).
(
hk
6
hk
6
Sk0 (x) =
(8.25)
Evaluating at x = xk gives
Sk0 (xk ) =
mk
mk+1
hk
hk + dk ,
3
6
dk =
fk+1 fk
.
hk
(8.26)
mk
mk1
hk1 +
hk1 + dk1 .
3
6
(8.27)
0
Using the fact that Sk0 (xk ) = Sk1
(xk ), we derive the equation for {mk }:
(8.28)
unique solutions m1 , m2 , , mn1 . This proves the unique existence of the natural
cubic spline function for any given set of observation data.
Recovering the coefficients of the natural cubic spline function. Using
the values of m1 , m2 , , mn1 , we can recover the coefficients of Sk (x) in (8.19). To
do so, we first easily know by differentiating Sk (x) and evaluating at xk :
sk,0 = fk ,
(8.29)
sk,2 =
Using sk,1 = Sk0 (xk ) and (8.26), we get
sk,1 = dk
hk (2mk + mk+1 )
.
6
mk+1 mk
.
6hk
S 00 (b) = 0
or
S 0 (a) = f 0 (a),
Then it holds that
Z b
00
S 0 (b) = f 0 (b).
{S (x)} dx
a
{f 00 (x)}2 dx .
Proof. Consider the difference g(x) = f (x) S(x), then we know g(xi ) = 0 for
i = 0, 1, , n and
Z b
Z b
Z b
Z b
00
2
00
2
00
2
{f (x)} dx =
{S (x)} dx +
{g (x)} dx + 2
S 00 (x)g 00 (x)dx .
a
117
To see this, using the integration by parts, the given conditions in the theorem and
the fact that S 000 (x) is constant, say ci , on [xi1 , xi ], we derive
Z
b
00
00
S (x)g (x)dx =
a
n Z
X
xi
S 00 (x)g 00 (x)dx
i=1 xi1
n n
X
00 0
00 0
xi
(S g )(xi ) (S g (xi1 )
=
=
n Z
X
i=1
n
X
i=1
n
X
S (x)g (x)dx
xi1
i=1
000
xi
xi1
xi
g 0 (x)dx
ci
xi1
ci {g(xi ) g(xi1 )} = 0 .
i=1
From the previous important property of the cubic spline interpolations, we can
immediately see that
Z b
Z b
00
2
{S (x)} dx
{h00 (x)}2 dx h V (f )
a
S 00 (x)2 dx
is minimized.
118
8.11
Hermites interpolation
x1
f1
f10
x2
f2
f20
xn
fn
fn0
(8.30)
i = 0, 1, , n .
(8.31)
Let l0 (x), l1 (x), , ln (x) be the Lagrange basic functions in (8.7) associated with the
set of nodal points x0 , x1 , , xn , then we define
ui (x) = 1 2li0 (xi )(x xi ) li2 (x) , vi (x) = (x xi )li2 (x) .
One can easily show that
1. ui (x) and vi (x) are all polynomials of degree 2n + 1;
2. ui (xj ) = ij , vi (xj ) = 0 for any j 6= i;
3. u0i (xj ) = 0 for any j 6= i, and vi0 (xj ) = ij .
Using these results, we can directly verify that the following polynomial
H(x) =
n
X
fi ui (x) +
i=0
n
X
fi0 vi (x)
i=0
is a polynomial of degree 2n + 1 such that all the conditions in (8.31) are satisfied.
This polynomial is called the Hermites interpolation.
The Hermites interpolation has the following error estimate.
Suppose f C 2n+2 [a, b], and H(x) is its Hermites interpolation at the
n + 1 distinct points:
a = x0 < x1 < x2 < < xn1 < xn = b
such that all the conditions in (8.31) are satisfied. Then the following
error estimate holds for some (a, b):
f (x) H(x) =
f (2n+2) ((x))
(x x0 )2 (x x1 )2 (x xn )2 .
(2n + 2)!
119
(8.32)
Proof. To prove (8.32), we fix x (1, 1). If x is a node, the result holds clearly.
So we assume that x is not a node. Let
w(t) = f (t) H(t) (t x0 )2 (t x1 )2 (t xn )2
where is a constant such that w(x) = 0. One can easily see that w(t) has at least
n + 2 zeros in [1, 1]: x, x0 , x1 , , xn . By Rolles theorem, w0 (t) has at least n + 1
zeros in [1, 1]. But w0 (x) also vanishes at all the nodal points, so w0 (t) has at least
2n+2 zeros in [1, 1]. Recursively using the Rolles theorem, we know that w(2n+2) (t)
has some zero (1, 1), that is,
0 = w(2n+2) () = f (2n+2) () p(2n+2) () (2n+2)
()
n
By calculating, we obtain
0 = f (2n+2) () (2n + 2)! .
This proves the desired error estimate (8.32). ]
120
Numerical integration
Approximations of integrals are widely encountered in real applications. Many important physical quantities can be represented by the integrals, e.g., mass, concentrations,
heat flux, heat sources and so on.
In this section we shall discuss how to approximate a given integral on an interval.
The approximation of integrals in higher dimensions can be reduced to the integrals
on many intervals.
Given a function f (x) on a interval [a, b], we are now going to discuss how to
approximate the integral
Z b
f (x) dx .
a
Same approximations can be applied for the higher dimensional integrations such as
Z bZ d
Z bZ dZ f
f (x, y) dxdy and
f (x, y, z) dxdydz .
a
Evaluation of an integral may not be an easy job, possibly much more difficult than
2
the evaluation of derivatives. For example, one can easily find the derivatives of ex
3
and ex , but their integrals are not simple. In fact, it is impossible to calculate their
exact values.
However, we often need to know the value of an integral in real applications. If
the integral is difficult to compute, do we have some way to find the approximate
value of the integral ?
Yes. The way to compute an integral approximately is called numerical integration, or quadrature rule.
9.1
f (x)dx
a
is the area formed by the curve y = f (x), and the lines x = a, x = b. When a is close
to b, one may approximate the area by the area of some simple geometric domains.
Rectangular rules. If we use the area of the rectangle with the base [a, b] and
height f (a) or f (b), then we get two simple the quadrature rules:
Z
f (x)dx (b a) f (a) ,
a
f (x)dx (b a) f (b) .
a
121
f (x)dx
a
ba
(f (a) + f (b))
2
Let us now try to understand how good this approximation is. To see the error of
the trapezoidal rule, we first consider the Lagrange interpolation of f (x) at the two
points a and b:
xb
xa
L(x) =
f (a) +
f (b) ,
ab
ba
then we have
Z b
Z b
xb
xa
L(x)dx =
(
f (a) +
f (b))dx
ba
a
a ab
f (a)
1
f (b) 1
=
( )(b a)2 +
(b a)2
ab 2
ba 2
ba
=
(f (a) + f (b)) .
2
This is exactly the same as the trapezoidal rule. So the error of the trapezoidal rule
can be transfered to the error of the Lagrange interpolation:
Z b
1
f (x)dx (f (a) + f (b))
2
a
Z b
(f (x) L(x))dx
=
a
Z
1 b 00
=
f (x )(x a)(x b)dx
2 a
Z b
1 00
=
f ()
(x a)(x a + a b)dx
2
a
1 00
= f ()(b a)3 .
(9.1)
12
00
So if the size |b a| is small and f () not large on [a, b], then the trapezoidal rule can
be very accurate. This error formula indicates that the trapezoidal rule is accurate
for all linear polynomials.
9.2
f (x)dx T (a, b; f ) =
a
1 00
f ()(b a)3 .
12
We see that if the interval [a, b] is not small, the error of the trapezoidal rule can be
very large, so the approximation is not accurate enough.
To derive more accurate approximation, we divide [a, b] into n equally-spaced
smaller subintervals using the points
a = x0 < x1 < x2 < < xn1 < xn = b .
Let h =
ba
n
i = 0, 1, , n .
R xi
xi1
f (x)dx by the
Then using
Z
f (x)dx =
a
n Z
X
i=1
xi
f (x)dx
xi1
n
X
h
i=1
(f (xi1 ) + f (xi )) ,
Letting xi1/2 = xi1 + h/2, then using integration by parts we derive the
following error estimate
n Z xi
X
h
0
Eh (f ) =
(x xi1/2 ) f (x)dx (f (xi1 ) + f (xi ))
2
xi1
i=1
Z
n
xi
X
=
(x xi1/2 )f 0 (x)dx
(9.2)
h
2
i=1
b
xi1
|f 0 (x)|dx .
|f 00 (x)|dx < .
i=1 xi1
n Z xi
X
1
2
i=1 xi1
n
X Z xi
1
(x xi1/2 )2 f 00 (x)dx h2 (f 0 (xi ) f 0 (xi1 ))
8
i=1 xi1
n Z
1 X xi n
h2 o 00
2
=
(x xi1/2 )
f (x)dx ,
2 i=1 xi1
4
=
1
2
n
X
i=1
n
X
i=1
1 00
f (i )(xi xi1 )3
12
1 00
f (i )h3
12
n
1 3 X 00
h
f (i ) .
12 i=1
124
Using the mean value theorem, there exists a point [a, b] such that
00
f () =
1 00
00
00
(f (1 ) + f (2 ) + + f (n )) ,
n
hence we derive the error estimate for the composite trapezoidal rule:
Eh (f ) =
nh3 00
b a 00
f () =
f ()h2 .
12
12
(9.3)
Example. Determine the mesh size h so that the error of the composite trapezoidal
R1
rule for computing the integral 0 sin xdx is not bigger than 106 .
Solution. We know the error
Eh (f ) =
b a 00
1 00
f ()h2 = f ()h2
12
12
00
00
1 2 2
h 106 ,
12
9.3
xi
f (x)dx
xi1
i=1
n
X
h
i1
(f (xi1 ) + f (xi ))
(9.1)
f (x)dx
a
n
X
h
i=1
(f (xi1 ) + f (xi )) =
b a 00
f ()h2 ,
12
we know that for any linear polynomial f (x), the trapezoidal rule gives the exact
value of the integral, that is,
Z b
n
X
h
f (x)dx =
(f (xi1 ) + f (xi )).
2
a
i=1
125
Can we find more accurate quadrature rules which are exact for polynomials of
higher degree ? Below, we shall try to construct such quadrature rules.
For a set of given points
x0 ,
x1 ,
x2 ,
xn
in the interval [a, b], we would like to find 0 , 1 , , n such that for any polynomial
of degree n, we have
Z b
f (x)dx = 0 f (x0 ) + 1 f (x1 ) + + n f (xn ).
(9.2)
a
i = 0, 1, , n,
(9.3)
where
n
Y
x xj
li (x) =
.
xi xj
j=0
j6=i
9.3.1
Let us see how to compute the coefficients of the Newton-Cotes rule using (9.3).
Case n = 1. We have
Z b
Z b
1
xb
0 =
l0 (x)dx =
dx = (b a) ,
2
a
a ab
Z b
Z b
xa
1
1 =
l1 (x)dx =
dx = (b a) ,
2
a ba
a
so we get the Newton-Cotes rule for n = 1:
Z b
ba
f (x)dx
(f (a) + f (b)) ,
2
a
which is exactly the trapezoidal rule.
126
(x x1 )(x b)
l0 (x)dx =
dx
a
a (a x1 )(a b)
Z b
2
1
(x b + h/2)(x b)dx = (b a) ,
2
(b a) a
6
Z b
Z b
(x a)(x b)
l1 (x)dx =
dx
a
a (x1 a)(x1 b)
Z b
4
4
(x a)(x a + a b)dx = (b a) ,
2
(b a) a
6
Z b
Z b
(x a)(x x1 )
l2 (x)dx =
dx
a
a (b a)(b x1 )
1
(b a) ,
6
Z
One can easily see that it is very technical and lengthy to compute the coefficients
i of the Newton-Cotes rules using (9.3) for larger n. Usually, we may calculate the
coefficients i for larger n directly by the definition:
The quadrature rule (9.2) holds for all polynomials of degree n.
Example. Find the Newton-Cotes rules (9.2) when n = 1 and n = 2.
Solution. When n = 1, we need two points x0 and x1 . Let us consider x0 = a and
x1 = b. As the quadrature rule
Z b
f (x) = 0 f (a) + 1 f (b)
a
ba
,
2
1 =
ba
2
f (x)dx
a
ba
(f (a) + f (b)).
2
127
x a,
(x a)(x b) .
9.4
Simpsons rule is one of the most important quadrature rules due to its nice properties. In this subsection we shall introduce three different methods to derive the error
estimates of the Simpsons rule:
Z b
o
b an
a+b
f (x)dx
f (a) + 4f (
) + f (b) .
(9.4)
6
2
a
Method 1. Let x = (a + b)/2. Using the fact that the rule (9.4) is exact for all
128
Z x
1 (3)
a+b
=
(x a)(x b)(x
f (1 )
)dx
6
2
a
Z b
a+b
1 (3)
(x a)(x b)(x
)dx
+ f (2 )
6
2
x
h4 (3)
=
f (1 ) f (3) (2 ) .
384
Method 2. The second approach is to use the Taylor expansion. Let
Z x
F (x) =
f (t)dt,
a
On the other hand, we can expand each term on the right-hand side of (9.4) to get
f (a) = f (a) ,
(9.6)
0
000
(4)
h
h f (a)
h f (a)
h f (1 )
f (
x) = f (a) + f 0 (a) + ( )2
+ ( )3
+ ( )4
,
2
2
2
2
6
2
24
h2
h3
h4
f (b) = f (a) + hf 0 (a) + f 00 (a) + f 000 (a) + f (4) (2 ) .
2
6
24
(9.7)
(9.8)
o
b an
a+b
f (x)dx
f (a) + 4f (
) + f (b)
6
2
a
h5 (4)
5h5 (4)
=
f ()
f ()
120
576
25
h5 (4)
f () f (4) () .
=
120
24
Z
o
b an
a+b
f (a) + 4f (
) + f (b)
f (x)dx
6
2
a
Z b
(4)
f ()
a+b 2
=
(x a)(x b)(x
) dx
4!
2
a
Z
f (4) () h
=
(y + h)y 2 (y h)dy
4!
h
b a 5 f (4) ()
=
.
2
90
Z
130
9.5
xi = a + ih,
h=
ba
,
n
and use the Simpsons rule on each subinterval. This gives the following composite
Simpsons rule:
Z b
n Z xi
X
f (x)dx
f (x)dx =
a
Zi=1x1
=
xi1
x2
xn
f (x)dx + +
f (x)dx +
x0
x1
f (x)dx
xn1
o
xi1 + xi
h Xn
f (xi1 ) + 4f (
) + f (xi ) .
6 i=1
2
The error estimate of the composite Simpson rule is a direct application of the
error estimate of the Simpson rule:
Z b
n
o
h Xn
xi1 + xi
f (x)dx
f (xi1 ) + 4f (
) + f (xi )
6 i=1
2
a
n n Z xi
o
X
hn
xi1 + xi
f (x)dx
=
f (xi1 ) + 4f (
) + f (xi )
6
2
xi 1
i=1
n
h 5 f (4) ( )
X
(b a)h4 (4)
i
=
=
f ()
2
90
2880
i=1
where lies in the interval [a, b].
9.6
(9.10)
0 = 1 = 1 ,
9.7
i=0
(Quadrature rules are usually given on [1, 1] or [0, 1].) Suppose we want to use it
to approximate the following integral on another interval [c, d]:
Z d
g(s)ds ,
c
132
dc
ba
(s c) s = c +
(t a).
dc
ba
dc
(ti a) = 10 + 8(ti + 1)
ba
for i = 0, 1. Hence we derive the following Gaussian quadrature rule on [10, 6]:
Z 6
2
2
2
et /2 dt 8e[10+8(1/ 3+1)] /2 + 8e[10+8(1/ 3+1)] /2 = 0.259 .
10
9.8
P1 (x) = x,
1
P2 (x) = x2 ,
3
133
3
P3 (x) = x3 x , .
5
Then the following technique shows a general way to construct the Gaussian quadrature rules.
Let {x0 , x1 , , xn } be the roots of the polynomial Pn+1 (x), and set
Z
x xj
dx ,
1 j=0,j6=i xi xj
li (x)dx =
wi =
n
Y
Then
Z
f (x)dx =
1
n
X
i = 0, 1, 2, , n .
wi f (xi )
(9.11)
(9.12)
j=0
holds for all polynomials of degree less than or equal (2n + 1), so (9.12) is a Gaussian
quadrature rule.
We next prove that (9.12) is a Gaussian quadrature rule. Let f (x) be an arbitrary
polynomial of degree (2n + 1), then there exist two polynomials of degree n such
that f (x) = q(x)Pn+1 (x) + r(x). By the Lagrange interpolation, we know
r(x) =
n
X
li (x) r(xi ) =
i=0
n
X
li (x) f (xi ) .
i=0
= 0+
n
X
f (xi )
n
X
n
Y
x xj
dx
1 j=0,j6=i xi xj
i=0
r(x)dx
1
q(x)Pn+1 (x)dx +
f (x)dx =
wi f (xi )
i=0
by the definition of wi . ]
With {wi } and {xi } chosen as in (9.11) and (9.12), we can obtain some property
of the Gaussian quadrature rule and its accuracy:
Q
Let n (x) = ni=0 (x xi )2 , {wi } and {xi } be defined as in (9.11) and
(9.12), then we have
wi > 0,
i = 0, 1, , n,
(9.13)
and
Z
f (x)dx
1
n
X
i=0
f (2n+2) ()
wi f (xi ) =
(2n + 2)!
134
n (x)dx .
1
(9.14)
li2 (x)
n
Y
(x xi )2
,
2
(x
i xj )
j=0,j6=i
j=0
To derive the error estimate (9.14), we let p(x) be the Hermite interpolation polynomial of degree 2n + 1 such that p(xi ) = f (xi ) and p0 (xi ) = f 0 (xi ), i = 0, 1, , n.
Then we have
Z 1
n
n
X
X
p(x)dx =
wj p(xj ) =
wj f (xj ) .
1
j=0
j=0
Using the error estimate (8.32) of the Hermite interpolation polynomial, we know
f (x) p(x) =
f (2n+2) ((x))
n (x) ,
(2n + 2)!
(9.15)
9.9
R1 2
We are now trying to compute a specific integral, 0 et dt, by using several different
quadrature rules to better understand the actual accuracy of each quadrature rule.
The correct value of the integral is 0.7468241328124 up to 13 digits.
9.9.1
Trapezoidal rule
1 1 02
1 1
2
2
2
(e
+ e0.25 ) + (e0.25 + e0.5 )
4 2
4 2
1 1
1 1
2
2
2
2
+ (e0.5 + e0.75 ) + (e0.75 + e1 )
4 2
4 2
!
02
12
1 e
e
2
2
2
=
+ e0.25 + e0.5 + e0.75 +
.
4
2
2
et dx
0.5
0.0155
4.06
2
4 0.25 0.00384 4.04
8 0.125 0.000959 4.004
It is clear from the ratio of the errors that they are decreasing like O(h2 ) because
as h is halved, the error is decreased by 4 times.
From the trend, we can estimate how many subintervals we need in order to
compute the integral correct up to 13 digits. Since
0.0629
1
= 1013 = m 24.6.
4m
Thus we need n = 224.6 = 107.4 , i.e. 25 million subintervals and hence 25 million
function evaluations.
9.9.2
Simpsons rule
n
hX
xi1 + xi
f (x)dx
f (xi1 ) + 4f (
) + f (xi ) .
6 i=0
2
136
For example, if we use 2 subintervals [0, 0.5] and [0.5, 1], then
Z 1
2
et dx
0
1 1 02
1 1
2
2
2
2
2
(e
+ 4e0.25 + e0.5 ) + (e0.5 + 4e0.75 + e1 )
2 6
2 6
1 02
2
2
2
2
=
e
+ 4e0.25 + 2e0.5 + 4e0.75 + e1 .
12
Again notice that the factor 1/2 is the width of the intervals.
Using the formula for different n, we have
n
h
1
1
2
0.5
4 0.25
8 0.125
error
ratio
3.56(4)
3.12(5) 11.40
1.99(6) 15.72
1.24(7) 15.95
(3.65(4) means 3.64 104 .) It is clear from the ratio of the errors that they are
decreasing like O(h4 ) because as h is halved, the error is decreased by 16 times.
From the trend, we can estimate how many subintervals we need in order to
compute the integral correct up to 13 digits. Since
1
= 1013 = m 12.3.
16m
Thus we need n = 212.3 = 103.7 , i.e. 5,000 subintervals and hence 10,000 function
evaluations (in each subinterval, we need two function evaluations). So we can see
that the composite Simpsons rule is already 2,500 times faster than the composite
trapezoidal rule.
But can we get faster algorithms than the composite Simpsons rule ?
0.0629
9.9.3
Gaussian rules
Yes, we can use Gaussian quadrature rule to evaluate the integral. The results are:
No. of Points
2
3
4
..
.
Error
2.29(4)
9.55(6)
3.35(7)
..
.
7.88(13)
It should be emphasized that we just use the simple Gaussian quadrature rules on
the interval [0, 1] and not any composite rules (i.e. we didnt divide the interval into
subintervals). We see that the error
1013 when we use the 7-point Gaussian
R 1 istalready
2
formula to evaluate the integral 0 e dt. This requires only 7 function evaluations.
It is thus about 1,400 times faster than the composite Simpsons rule and 3.5 million
times faster than the composite trapezoidal rule.
137
10
10.1
Numerical differentiation
Aim of numerical differentiation
x i = x0 + i h ,
00
10.2
j = 0, 1, , n.
h2 00
f (1 ),
2
1 (xi , xi+1 ) ,
then
f (xi+1 ) f (xi )
h 00
= f 0 (xi ) + f (1 ) .
h
2
So if the mesh size h is small, we can use the approximation
f (xi+1 ) f (xi )
f 0 (xi ) ,
h
and the error is
(10.16)
f (xi ) o
h 00
f 0 (xi ) = f (1 ) .
h
2
The scheme (10.16) is called the forward difference scheme.
n f (x
i+1 )
h2 00
f (2 ) ,
2
2 (xi1 , xi )
thus
h 00
f (xi ) f (xi1 )
= f 0 (xi ) f (2 ),
h
2
So if h is small, we can use the approximation
f (xi ) f (xi1 )
f 0 (xi ) .
h
This scheme (10.17) is called the backward difference scheme.
138
(10.17)
10.3
Central differences
1 (xi , xi+1 )
2 (xi1 , xi )
thus
f (xi+1 ) f (xi1 )
h2 000
000
= f 0 (xi ) + (f (1 ) + f (2 )) .
2h
12
So if h is small, we may use the approximation
f (xi+1 ) f ((xi1 )
f 0 (xi )
2h
(10.18)
with an error
n f (x
i+1 )
f (xi1 ) o
h2 000
000
f 0 (xi ) = (f (1 ) + f (2 )) .
2h
12
Compute f (xi ). We now discuss how to compute the second order derivatives
00
f (xi ).
By Taylor expansion, we have
h2 00
h3
h4
f (xi ) + f (3) (xi ) + f (4) (1 ) ,
2
6
24
2
3
h
h
h4
00
f (xi1 ) = f (xi ) hf 0 (xi ) + f (xi ) f (3) (xi ) + f (4) (2 ) ,
2
6
24
f (xi+1 ) = f (xi ) + hf 0 (xi ) +
1 (xi , xi+1 )
2 (xi1 , xi ) ,
this gives
f (xi+1 ) 2f (xi ) + f (xi1 )
h2 (4)
00
=
f
(x
)
+
(f (1 ) + f (4) (2 )) ,
i
h2
24
so if h is small, we can use the approximation
f (xi+1 ) 2f (xi ) + f (xi1 )
00
f (xi ) .
h2
This scheme is called the second order central difference scheme.
139