Вы находитесь на странице: 1из 139

Lecture Notes on Numerical Analysis1

(Course code: MATH3230)


(Academic Year: 2015/2016, Second Term)

Jun Zou
Department of Mathematics
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong

Mainly based on the textbooks


Numerical Analysis: Mathematics of Scientific Computing
Brooks/Cole Publishing Co., 2009
by D. Kincaid and W. Cheney
and
Afternotes on Numerical Analysis, SIAM, 2006
by G.W. Stewart
1

The lecture notes were prepared by Jun Zou, purely for the convenience of his teaching of the
course Numerical Analysis . Students taking this course may use the notes as part of their reading
and reference materials. This version of the lecture notes were revised and extended from a previous
version, so there might be many mistakes and typos, including English grammatical and spelling
errors, in the notes. It would be greatly appreciated if those students, who will use the notes as their
reading or reference material, report any mistakes and typos to the instructor Jun Zou for improving
the lecture notes. Students are strongly recommended to refer to those recommended textbooks for
more exercises and more details about the backgrounds and motivations of different concepts and
numerical methods.

Contents
1 Introduction

2 Auxiliary theorems and tools

3 Nonlinear equations of one variable


3.1 Nonlinear equations of one variable . . . . . . . . . . . . . . . . . . .
3.2 Difficulties in solving nonlinear equations . . . . . . . . . . . . . . . .
3.3 Iterative methods and rate of convergence . . . . . . . . . . . . . . .
3.4 Absolute and relative errors . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Absolute errors . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Relative errors . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Bisection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Basic conditions . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2 Interval bisection algorithm . . . . . . . . . . . . . . . . . . .
3.5.3 Convergence of the bisection algorithm . . . . . . . . . . . . .
3.6 Approximation by Taylor expansion . . . . . . . . . . . . . . . . . . .
3.7 Newtons method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.1 Derivations of Newtons method . . . . . . . . . . . . . . . . .
3.7.2 Applications of the Newtons methods . . . . . . . . . . . . .
3.7.3 Local convergence analysis . . . . . . . . . . . . . . . . . . . .
3.7.4 Quadratic convergence of the Newtons method . . . . . . . .
3.7.5 Extension of the Newtons method . . . . . . . . . . . . . . .
3.8 Quasi-Newton methods . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8.1 Constant slope method . . . . . . . . . . . . . . . . . . . . . .
3.8.2 Fixed-point iterative methods . . . . . . . . . . . . . . . . . .
3.8.3 A numerical example . . . . . . . . . . . . . . . . . . . . . . .
3.8.4 Functions with multiple zeros . . . . . . . . . . . . . . . . . .
3.8.5 Convergence of Newtons method in the case of multiple zeros
3.9 The secant method . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9
9
9
10
11
12
12
13
13
13
14
16
18
18
20
21
23
24
25
26
28
30
31
32
34

4 Systems of nonlinear equations


4.1 Newtons method for a 2 2 system . . . . . .
4.2 Newtons method for general nonlinear systems
4.3 Convergence of Newtons method . . . . . . . .
4.4 Broydens method . . . . . . . . . . . . . . . . .
4.5 Steepest descent method . . . . . . . . . . . . .
5 Solutions of linear systems of algebraic
5.1 Matrices, vectors and scalars . . . . . .
5.2 Theory of linear systems . . . . . . . .
5.3 Simple solution of a linear system . . .
5.4 Solutions of triangular systems . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

36
36
37
39
41
43

equations
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

45
45
50
50
51

.
.
.
.
.

5.5

5.6

5.7

5.8
5.9

Cholesky factorization . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Properties of symmetric positive definite matrices . . . . . . .
5.5.2 Cholesky factorization . . . . . . . . . . . . . . . . . . . . . .
LU factorization and Gaussian elimination . . . . . . . . . . . . . . .
5.6.1 Gaussian elimination . . . . . . . . . . . . . . . . . . . . . . .
5.6.2 LDU factorization . . . . . . . . . . . . . . . . . . . . . . . .
5.6.3 Cholesky factorization from LDU factorization . . . . . . . . .
LU factorization with partial pivoting . . . . . . . . . . . . . . . . . .
5.7.1 The necessity of pivoting . . . . . . . . . . . . . . . . . . . . .
5.7.2 LU factorization with partial pivoting . . . . . . . . . . . . .
LU factorization of upper Hessenberg matrix and tri-diagonal matrix
General non-square linear systems . . . . . . . . . . . . . . . . . . . .

6 Floating-point arithmetic
6.1 Decimal and binary numbers . . . .
6.2 Rounding errors . . . . . . . . . . .
6.3 Normalized scientific notation . . .
6.4 Accuracies in 32-bit representation
6.5 Machine rounding . . . . . . . . . .
6.6 Floating-point arithmetic . . . . . .
6.7 Backward error analysis . . . . . .
7 Sensitivity of linear systems
7.1 Vector and matrix norms . . . . .
7.1.1 Vector norms . . . . . . .
7.1.2 Matrix norms . . . . . . .
7.2 Relative errors . . . . . . . . . . .
7.3 Sensitivity of linear systems . . .
7.4 The condition of a linear system .
7.5 Importance of condition numbers

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

8 Polynomial interpolation
8.1 What is the interpolation ? . . . . . . . . . . . . . . . . . . . . . .
8.2 Vandermonde Interpolation . . . . . . . . . . . . . . . . . . . . .
8.3 General quadratic interpolation . . . . . . . . . . . . . . . . . . .
8.4 Interpolation with polynomials of degree n . . . . . . . . . . . . .
8.5 Lagrange interpolation . . . . . . . . . . . . . . . . . . . . . . . .
8.6 The Newton form of interpolation . . . . . . . . . . . . . . . . . .
8.6.1 Divided differences . . . . . . . . . . . . . . . . . . . . . .
8.6.2 Relations between derivatives and divided differences . . .
8.6.3 Symmetry of divided differences . . . . . . . . . . . . . . .
8.6.4 Relation between divided difference and Gauss elimination
8.6.5 How to compute coefficients of the Newton form ? . . . . .
8.7 Three fundamental questions in interpolation . . . . . . . . . . . .
3

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

53
53
55
58
59
65
66
67
67
70
73
74

.
.
.
.
.
.
.

76
76
77
78
81
82
82
83

.
.
.
.
.
.
.

85
85
85
86
87
88
90
91

.
.
.
.
.
.
.
.
.
.
.
.

92
92
93
94
94
95
96
97
99
99
100
101
103

8.7.1 Lagrange interpolation polynomials . . . . . . . .


8.7.2 Newton Interpolation Polynomials . . . . . . . . .
8.7.3 Summary of the interpolation methods . . . . . .
8.8 Error estimates of polynomial interpolations . . . . . . .
8.9 Chebyshev polynomials . . . . . . . . . . . . . . . . . . .
8.10 Spline interpolation . . . . . . . . . . . . . . . . . . . . .
8.10.1 Spline interpolation of degree 0 . . . . . . . . . .
8.10.2 Spline interpolation of degree 1 . . . . . . . . . .
8.10.3 Cubic spline interpolations . . . . . . . . . . . . .
8.10.4 Existence and construction of natural cubic spline
8.10.5 Properties of cubic splines . . . . . . . . . . . . .
8.11 Hermites interpolation . . . . . . . . . . . . . . . . . . .

. . . . . . . 103
. . . . . . . 105
. . . . . . . 107
. . . . . . . 107
. . . . . . . 109
. . . . . . . 113
. . . . . . . 113
. . . . . . . 114
. . . . . . . 114
interpolations 115
. . . . . . . 117
. . . . . . . 119

9 Numerical integration
9.1 Simple rules and their error estimates . . . . . . . . . . . . .
9.2 Composite trapezoidal rule and its accuracy . . . . . . . . .
9.3 Newton-Cotes quadrature rule . . . . . . . . . . . . . . . . .
9.3.1 Computing the coefficients of Newton-Cotes rules (I)
9.3.2 Computing the coefficients of Newton-Cotes rules (II)
9.4 Simpsons rule and its error estimates . . . . . . . . . . . . .
9.5 Composite Simpson rule and its accuracy . . . . . . . . . . .
9.6 Gaussian quadrature rule . . . . . . . . . . . . . . . . . . . .
9.7 Quadrature rule from [a, b] to [c, d] . . . . . . . . . . . . . .
9.8 Construction and accuracy of Gaussian quadrature rules . .
9.9 Errors of different quadrature rules . . . . . . . . . . . . . .
9.9.1 Trapezoidal rule . . . . . . . . . . . . . . . . . . . . .
9.9.2 Simpsons rule . . . . . . . . . . . . . . . . . . . . . .
9.9.3 Gaussian rules . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

121
121
122
125
126
127
128
131
131
132
133
135
135
136
137

10 Numerical differentiation
138
10.1 Aim of numerical differentiation . . . . . . . . . . . . . . . . . . . . . 138
10.2 Forward and backward differences . . . . . . . . . . . . . . . . . . . . 138
10.3 Central differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Introduction

Numerical Analysis is a fundamental branch in Computational and Applied Mathematics. In this section, we list some important topics from Numerical Analysis, which
will be covered in this course.
1. Nonlinear equations of one variable. We discuss how to solve nonlinear
equations of one variable (in standard form):
f (x) = 0,
where f (x) is nonlinear with respect to variable x. The solutions may be very
complicated, even though function f (x) looks simple. In many applications we
may not even have the explicit expression of f (x), e.g., we may know only the
measured data of f (x) (such as temperature or flow velocity) at some locations
or times, but we need to find out different behaviors of f (x).
System of nonlinear equations. In most applications one may need to solve
the following more general system of nonlinear equations:
f1 (x1 , x2 , , xn ) = 0 ,
f2 (x1 , x2 , , xn ) = 0 ,

fn (x1 , x2 , , xn ) = 0 ,
where each fi (x1 , x2 , , xn ) is a nonlinear function of n variables x1 , x2 , ,
xn . In general, the solutions of a system of nonlinear equations are much more
complicated than the solutions to a single nonlinear equation, but many methods for equations of one variable can be generalized to systems of nonlinear
equations.
2. Linear system of algebraic equations. Linear systems are often the problems one needs to solve repeatedly, or thousands times during many mathematical modeling processes or physical/engineering simulations, e.g., in the
numerical solutions of a simple population model, or the more complicated electromagnetic Maxwell system. We will study how to solve the following general
system of linear algebraic equations:
Ax = b
where A is a m n matrix, and b is a vector with m components.
5

How can we find a solution x when the matrix A is square and nonsingular ?
We can solve a 2 2, or 3 3 system by hands, but it should be difficult to
solve a 10 10 system by hands, and it is almost impossible to solve a 100 100
system by hands. So when m and n are much larger than 1000, how can we
solve the system ? We have to turn to computers for help.
When m > n, the system may not have a solution. Then how can we find some
possible solutions which are physically or practically meaningful ?
3. Floating-point arithmetic. When computers are used for numerical computations, rounding-off errors are always present. Then how can we solve the
system of linear algebraic equations Ax = b or the system of nonlinear algebraic equations F (x) = 0, with satisfactory accuracy ? How can we judge the
solutions computed by computers are reliable ?
4. Interpolation. For a complicated function, can we find a simple and easy-tocompute function to approximate it ?
For a given set of observation data, can we find a function to best represent
the set of data ? E.g., suppose we know the measured temperature at a set of
selected locations along the boundary of China, can we locate the coldest or
warmest places, or the places with most rapid changes of temperature ? Or if
the temperature has been measured at a fixed location for all the months in the
previous 10 years, can we locate the coldest or warmest time of the location, or
the time with most rapid changes of temperature ?
5. Numerical integration. Integration is involved in many practical applications, e.g., computing the physical masses, surface areas, volumes, fluxes, etc.
But most integrals are difficult or impossible to compute exactly. For the integration of a complicated function, can we compute some good approximate
value of the integral when it is impossible to calculate the integral exactly ?
6. Numerical differentiation. Can we compute the derivatives of some complicated functions when their exact derivatives can not be computed or even the
exact expressions of the functions are not available ? This happens quite often
in real applications. E.g., suppose you are given the prices of a stock in the
past 10 years, can you find the times when the stock usually grows or drops
most rapidly ? Most importantly, numerical differentiation is needed in the construction of every numerical method for solving ordinary or partial differential
equations.

Auxiliary theorems and tools

In this section, we list some important mathematical theorems and tools that will be
used frequently in this course.
Theorem on polynomials. A polynomial of degree n has exact
n roots; and a polynomial of degree n which vanishes at (n + 1)
points is a constant function of zero.
Rolles theorem. Let f (x) be continuous in [a, b] and its derivative
f 0 (x) be continuous on (a, b), and f (a) = f (b), then there exists
(a, b) such that f 0 () = 0.
Mean-value theorem. Let f (x) be continuous on [a, b] and its
derivative f 0 (x) be continuous in (a, b), then there exists (a, b)
such that
f (b) = f (a) + (b a)f 0 () ,

f 0 () =

i.e.,

f (b) f (a)
.
ba

Intermediate value theorem. Let f (x) be continuous on [a, b],


then for any real number g that lies between f (a) and f (b) there
exists [a, b] such that f () = g .
Taylors theorem in one dimension. Let f (x) be continuously differentiable on [a, b] up to order n 1 and its nth derivative f (n) (x) be continuous
in (a, b). Then for any two points x0 , x (a, b),
f (n) (x0 )
f 00 (x0 )
2
f (x) = f (x0 )+f (x0 )(xx0 )+
(xx0 ) + +
(xx0 )n +Rn (x)
2!
n!
0

where Rn (x) is the remainder of the Taylors series and can be given by
1
Rn (x) =
n!

f (n+1) (t)(t x0 )n dt or Rn (x) =

x0

f (n+1) ()
(x x0 )n+1
(n + 1)!

for some lying between x0 and x.


Mean-value theorem for integrals. Let f (x) and g(x) be continuous on [a, b], and f (x) does not change signs in [a, b]. Then
there exists (a, b) such that
Z

Z
f (x) g(x)dx = g()

f (x)dx .
a

Taylors theorem in higher dimensions. Let f (x) be sufficiently


smooth in Rm . Then for any two points x0 , x , we have
f (x) = f (x0 ) + Df (x0 )T (x x0 ) +

1
(x x0 )T D2 f (x0 )(x x0 ) +
2!

where Df (x) is the gradient of f (x) and D2 f (x) is the Hessian matrix
of f (x).
Fundamental theorem of calculus. Let F be differentiable in an
open set Rn and x . Then for all x such that the line
segment connecting x and x lies inside , we have
Z 1

F 0 (x + t(x x ))(x x )dt .


F (x) F (x ) =
0

Nonlinear equations of one variable

3.1

Nonlinear equations of one variable

A nonlinear equation of one variable is an equation of the form


f (x) = 0,

(3.1)

where f is a nonlinear function with respect to x, and x is the only independent


variable.
Equation (3.1) is the standard form of nonlinear equations of one variable. Nonlinear equations of more general form, e.g.,
g(x) + h(x) = b,

g(x) = h(x),

can be easily written as the standard form (3.1).


Below are some simple examples of nonlinear equations of one variable:
esin x 2 cos x = 1,

4x5 + x4 3x2 + 4x = 6,

(x + 1)(x 2)(x + 3)(x 5) x2 = 2 .

In many scientific applications, we need to solve this type of nonlinear equations,


and often the number of variables is more than one. But in this section, we first
discuss nonlinear equations of one variable.
Here is a simple practical example. Suppose there are two particles traveling
respectively around a closed path, following the rules:
x1 = f (t),

x2 = g(t)

where t stands for the time each particle has traveled and x for the location of each
particle at time t (the location may be measured in terms of the arc length of the
path). Find the time when the two particles meet each other, or when the two
particles have the same speed. When f and g represent the populations of two cities,
you may be asked to find the time when the two cities have the same population or
reach the same population growth rate.

3.2

Difficulties in solving nonlinear equations

1. Nonlinear equations may have no solutions. For example, the equation


f (x) = x2 4x + 5 = 0
has no solutions in R. In fact, f (x) = (x 2)2 + 1 1 for all x R .
Usually, we should check if a nonlinear equation has solutions before solving it.
If no solutions exist, we do not need to spend time. But it is not always easy to
check if a nonlinear equation has solutions. Please think about if the following
equation has a solution:
ex sin x2 = 0 .
9

2. Solutions may not be unique. Consider the equation


f (x) = sin x cos x

1
= 0.
2

It is easy to see that x = /4 is a solution, and x = 2n + /4 are also solutions


for any integer n.
Similarly, one may find many more nonlinear equations which have more than
one solution. The simplest ones seem to be the special class of functions, polynomials.
Most numerical methods can approximate only one solution with one initial
guess. So when we construct numerical methods for a nonlinear equation, we
should first use some mathematical tools to locate a range in which there is a
unique solution to the nonlinear equation.

3.3

Iterative methods and rate of convergence

Consider solving a nonlinear equation


f (x) = 0
with an exact solution x , the most popular and effective methods are called iterative
methods.
An iterative method starts with an initial value, called an initial guess and usually
denoted as x0 , then generates a sequence
x1 , x 2 , , x k ,
such that
lim xk = x .

Now we will introduce an important concept which can be used to measure the
efficiencies of different iterative methods, that is, rate of convergence.
Given a sequence x0 , x1 , x2 , , converging to a limit x .
Linear convergence. If there is a constant satisfying 0 1 such that
|xk+1 x |
= ,
k |xk x |
lim

then
the sequence {xk } is said to converge to x linearly with rate for (0, 1);
the sequence {xk } is said to converge to x superlinearly for = 0;
the sequence {xk } is said to converge to x sublinearly for = 1,
10

Think about the following sequences and find out whether they converge, or converge linearly or superlinearly or sublinearly:
xk = 1 + 22k ,

k = 0, 1, 2,
1
xk = 1 + 22k +
, k = 0, 1, 2,
k+1
xk+1 = 2k xk , k = 0, 1, 2,
1
xk+1 = xk , k = 0, 1, 2,
2
1
xk = d (d is fixed), k = 1, 2, 3,
k
k
xk = 102 , k = 1, 2, 3,
Convergence with order p. If there are two positive constants p > 1 and C such that
|xk+1 x |
=C,
k |xk x |p
lim

(3.2)

then the sequence {xk } is said to converge to x with order p.


If the order parameter p = 2 in (3.2), the convergence is said to be quadratic. For
example, we shall see the following relation holds for the Newtons method to solve
the nonlinear equation f (x) = 0:
xk+1 x
f 00 (x )
=
,
k (xk x )2
2f 0 (x )
lim

so we know the Newtons method converges quadratically.


If the order parameter p = 3 in (3.2), the convergence is said to be cubic.
It is very easy to show the following important result on the order of convergence.
If the sequence {xk } converges to x with order p > 0,
then it converges to x with any order q for 0 < q < p.
In general, most iterative methods have convergence of order less than 3. Also,
the order parameter p is not necessarily to be an integer. As we shall see, the secant
method converges with order p = 1.618 (the golden mean).

3.4

Absolute and relative errors

To measure the accuracy of the approximate values, we need to introduce some useful
concepts.

11

3.4.1

Absolute errors

The absolute error between the true value (solution) x and the approximate value
xk is the error given by
|xk x | .
We can see that this error considers only the distance of xk from x without taking
care of the magnitude of the true solution x . This may not be always satisfactory
in applications. For example, consider = 105 , and the true solution x = 1. If we
stop iteration when |xk x | 105 , then xk has 5 accurate digits after the decimal
point (xk = 0.99999 ). But if the true solution x = 108 , then we will stop the
iteration even when xk = 105 since
|xk x | = 105 108 105 ,
this approximation xk is 1000 times of the exact solution x , so not accurate at all.
3.4.2

Relative errors

The disadvantage of the absolute error is that it ignores the information on the
magnitude of the true solution x . Let xk be an approximation to x 6= 0, then the
error
|xk x |
=
|x |
is called the relative error of xk . We can write
xk = x (1 + ) = x + x
where || = . So xk can be viewed as a small perturbation of x .
Now look at the following table of approximation to x = e = 2.7182818 :
Approximation
2.0
2.7
2.71
2.718
2.7182
2.71828

2 101
6 103
3 103
1 104
3 105
6 107

101
102
103
104
105
106

accurate digits
1
2
3
4
5
6

This suggests the following fact2 :


If the relative error of xk is approximately 10k , then xk
and x agree to k digits, and vice versa.
2

Count the first non-zero digit when the value is written in the format x.xxxx... 10m , where m
may be positive or negative.

12

3.5

Bisection algorithm

There are many different methods one can use to solve a given nonlinear equation
f (x) = 0.
We introduce in this section one of the simplest but very effective iterative methods,
called the bisection algorithm.
3.5.1

Basic conditions

The bisection algorithm is based on the following intermediate value theorem:


If f (x) is continuous in the interval [a, b], and g is a given
number lying between f (a) and f (b). Then there exists
a point x [a, b] such that f (x ) = g.
This immediately implies the following result:
If f (a)f (b) < 0, then there exists at least one solution
x (a, b) such that f (x ) = 0.
3.5.2

Interval bisection algorithm

We now present the Interval Bisection Algorithm for finding the solution x of the
equation f (x) = 0. Usually, one can not find the exact solution x . We will be satisfied
when we find an approximate solution x such that |f (
x )| or |
x x | for
some small tolerance parameters and . One may set and to be at different
magnitude in each application.
Assume3 that f (a) < 0 and f (b) > 0, then there exists a point x (a, b) such
that f (x ) = 0. Next we state the simple and popular Bisection Algorithm to find an
approximate solution x .
Bisection Algorithm. Input a, b, and two stopping tolerance parameters and
. Set a0 := a, b0 := b; k := 0, xk := (ak + bk )/2.



k
k
Step 1 If f ( ak +b
)
;
, stop and output xk = ak +b
2
2
k
If f ( ak +b
) > 0 , set ak+1 := ak , bk+1 :=
2
GO TO Step 2;

k
If f ( ak +b
) < 0, set ak+1 :=
2
GO TO Step 2.

ak +bk
, bk+1
2

ak +bk
.
2

:= bk ;

The case that f (a) > 0 and f (b) < 0 can be handled similarly.

13

Step 2

If |bk+1 ak+1 | < , stop and output xk+1 :=

ak+1 +bk+1
;
2

Otherwise, set k := k + 1, GO TO Step 1.

Example 1. Use the bisection method to find an approximate solution of the nonlinear equation
ex = sin x.
Solution. By the graphs of ex and sin x, we can easily see that there are no positive
solutions to the equation ex = sin x, and that the closest solution to 0 lies in the
interval [3/2, ]. The following table gives the results generated by the bisection
algorithm:
xk
3.9270
3.5343
3.3379
3.2398
..
.

k
1
2
3
4
..
.

|f (xk )|
6.8740 101
3.5350 101
1.5958 101
5.8844 102
..
.

13 3.1832 1.4451 104


14 3.1831 4.4742 105
15 3.1831 5.1407 106
3.5.3

Convergence of the bisection algorithm

Let us now analyse if the bisection algorithm converges and how fast it may converge.
For convenience, we denote the successive intervals that are generated by the
bisection algorithm as follows:
[a0 , b0 ],

[a1 , b1 ],

, [ak , bk ],

where a0 = a and b0 = b. We observe the following properties:


a0 a1 a2 ak b 0 ,
b 0 b 1 b 2 b k a0 ,
1
bn an = (bn1 an1 ) , n 1 .
2
From these properties, we know the sequences {ak } and {bk } both converge, and
converge to the same limit using the fact
bn an = 2n (b0 a0 ).
Let
x = lim ak = lim bk ,
k

14

then we have
f (ak )f (bk ) < 0,
which implies
f (x )2 0,
therefore we know the limit x is a solution to the nonlinear equation f (x) = 0.
Moreover, let
xk = (ak + bk )/2,
then we have

1
|xk x | (bk ak ) = 2(k+1) (b0 a0 ).
2
This proves the following result4 :
Let f (x) be a continuous function on [a, b] such that f (a)f (b) < 0,
then the bisection algorithm always converges to a solution x of
the equation f (x) = 0, and the following error estimate holds for
the k-th approximate value xk :
1
|xk x | (bk ak ) = 2(k+1) (b a).
2
Think about the following more examples of nonlinear equations.
Example 2. Find a positive root of the nonlinear equation
x2 4x sin x + (2 sin x)2 = 0,
and find a root of the equation
2x + ex + 2 cos x 6 = 0
on the interval [1, 3].
Example 3. Consider the bisection algorithm starting with the interval [1.5, 3.5].
1. What is the width of the interval at the k-th step of the iteration ?
2. What is the maximum distance possible between the solution x and the midpoint of this interval ?
4

Think about the rate of convergence of the bisection method.

15

3.6

Approximation by Taylor expansion

Before we discuss more iterative methods, we first review and study a fundamental
tool that is frequently used in numerical analysis, approximation by Taylor expansion.
Taylor expansion or Taylor series is a tool for approximating a function, given the
derivatives of the function at a specific point, say x0 . It takes the form:
f (x) = f (x0 ) + f 0 (x0 )(x x0 ) +

f 00 (x0 )
f (n) (x0 )
(x x0 )2 + +
(x x0 )n +
2!
n!

1. Approximation by Taylor expansion


(a) 0th order approximation: we know only the function value (0th order
derivative) of f (x) at a point x0 , then we may use the following 0th order
approximation of f (x):
f (x) f (x0 ) = F (x)
for all x that are close to x0 .
(b) 1st order approximation: we know the function value and the 1st order
derivative of f (x) at the point x0 , then we may use the following 1st order
approximation of f (x):
f (x) f (x0 ) + f 0 (x0 )(x x0 ) = F (x)
for all x that are close to x0 . Clearly we see that
F (x0 ) = f (x0 ),

F 0 (x0 ) = f 0 (x0 ).

In fact, the 1st order approximation of the curve f (x) is nothing else but
its tangent at x0 .
(c) 2nd order approximation: we know the function value and the 1st and
the 2nd order derivatives of f (x) at the point x0 , then we may use the
following 2nd order approximation of f (x):
1
f (x) f (x0 ) + f 0 (x0 )(x x0 ) + f 00 (x0 )(x x0 )2 = F (x)
2
for all x that are close to x0 . Clearly we see that
F (x0 ) = f (x0 ),

F 0 (x0 ) = f 0 (x0 ),

F 00 (x0 ) = f 00 (x0 ).

So the 2nd order approximation considers also the local geometric shape
of f (x) at x0 .
In practice, one may not often use derivatives higher than 2nd order for approximations.
16

2. Approximation errors of Taylor expansion


(a) 0th order approximation: by the mean-value theorem we have
f (x) F (x) = f (x) f (x0 ) = f 0 ()(x x0 )
for some lying in between x and x0 .
(b) 1st order approximation:
1
f (x) F (x) = f (x) {f (x0 ) + f 0 (x0 )(x x0 )} = f 00 ()(x x0 )2
2
for some lying in between x and x0 .
(c) 2nd order approximation:
1
f (x) F (x) = f (x) {f (x0 ) + f 0 (x0 )(x x0 ) + f 00 (x0 )(x x0 )2 }
2
1 000
=
f ()(x x0 )3
3!
for some lying in between x and x0 .
Example. Approximate ex around the point x0 = 1 using the 0th, 1st and 2nd order
Taylor approximation.
Solution. Note that all the derivatives of ex at x0 = 1 are simply e.
1. 0th order approximation:
f (x) = ex f (x0 ) = ex0 = e
for all x that are close to x0 = 1.
2. 1st order approximation:
f (x) = ex F (x) = ex0 + ex0 (x x0 ) = e x
for all x that are close to x0 = 1.
3. 2nd order approximation:
1
1
f (x) = ex F (x) = ex0 + ex0 (x x0 ) + ex0 (x x0 )2 = e (x2 + 1)
2
2
for all x that are close to x0 = 1.
Let us try these approximations and see their accuracies at x = 2, 1.1, and 1.01,
i.e., x x0 = 1, 0.1, and 0.01:
17

0th order
x=2
e
error
4.67 100
x = 1.1
e
error
2.86 101
x = 1.01
e
error
2.73 102

1st order
2e
1.95 100
1.1e
1.41 102
1.01e
1.36 104

2nd order
2.5e
5.93 101
1.105e
4.64 104
1.01005e
4.54 107

true value
e2
e1.1
e1.01

From this table, we observe the following properties about the Taylor approximation:
1. Taylor approximation gives basically no accuracy at x = 2, since it is bit too
far away from x0 . The accuracy behaves basically like (x x0 )n+1 , where n is
the order of the approximation, but also affected by the values of the (n + 1)-th
order derivative of f (x) near x0 divided by (n + 1)!.
2. For 0th order approximation, the error is decreased by 101 each time when x
x0 is decreased by a factor of 101 . For the 1st and the 2nd order approximation,
the decreasing factors are respectively 102 and 103 .
Explain why we observe such behaviors.

3.7

Newtons method

Consider solving the nonlinear equation


f (x) = 0 .
We know from our previous discussions, the bisection algorithm is convergent under
a very weak assumption on f (x), namely it needs only to be continuous; and the
algorithm converges very steadily. But the convergence is usually slow.
Newtons method, also called the Newton-Raphson method, is one of the most
powerful numerical methods for solving nonlinear equations. Newtons method is
also an iterative method. It begins with an initial guess x0 , then produces successive
approximate values
x 1 , x 2 , , xn , .
If x0 is a good initial value (not too far away from the true solution x ), then the
sequence xn converges to x very rapidly. Usually only a few iterations will produce
a very accurate approximation to x .
3.7.1

Derivations of Newtons method

There are several approaches to derive the Newtons method. We first present a
geometrical approach.

18

Start with an initial point x0 : draw the tangent line to the curve
y = f (x) at the point (x0 , f (x0 )), and find the intersection point of
the tangent line with the x-axis, which Newtons method takes to
be the new approximation x1 .
Repeat the same procedure, we can get x2 , x3 ,
Let us now derive a formula for computing x1 , x2 , . We know the tangent line
of the curve y = f (x) at the point (x0 , f (x0 )):
y f (x0 ) = f 0 (x0 )(x x0 ).
The intersection point of this line with the x-axis can be found from
f (x0 ) = f 0 (x0 )(x x0 ),
which gives
x1 = x0

f (x0 )
.
f 0 (x0 )

Similarly we can find


f (x1 )
,
f 0 (x1 )
Thus we have derived the following Newtons method:
x2 = x1

xn+1 = xn

f (xn )
,
f 0 (xn )

n = 0, 1, 2,

Next, we derive the Newtons method using an analytical approach. We know the
Taylor expansion
1
f (x) = f (x0 ) + f 0 (x0 )(x x0 ) + f 00 ()(x x0 )2 ,
2
where lies between x and x0 .
Now if x0 is close to x and f 00 (x) is not too large around x0 , then
F (x) = f (x0 ) + f 0 (x0 )(x x0 ) f (x) ,
so instead of solving f (x) = 0, we can solve
F (x) = 0 ,
that gives
f (x0 )
,
f 0 (x0 )
similarly, we can derive x2 , x3 , , thus the Newtons method
x1 = x0

xn+1 = xn

f (xn )
,
f 0 (xn )
19

n = 0, 1, 2,

3.7.2

Applications of the Newtons methods

Example 1. Use the Newtons method to find some approximate inverse5 of a given
positive number a.
Solution. The question amounts to solving the equation
1
= a.
x
Let f (x) = x1 a. The Newtons method is given by
xk+1 = xk

f (xk )
= 2xk ax2k ,
0
f (xk )

k = 0, 1, 2,

Clearly this avoids the computing the inverse of any number. And we can verify that
the sequence {xk } converges monotonely as long as x0 (0, 1/a). Why ?
The following table gives the sequence {xk } to approximate 1/a when a = 1 with
different initial guesses:
Iteration
0
1
2
3
4
5
6

xk
0.250000
0.437500
0.683594
0.899887
0.989977
0.999899
0.999999

x
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000

xk
1.750000
0.437500
0.683594
0.899887
0.989977
0.999899
0.999999

xk
2.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000

xk
+2.100000 100
0.210000 100
4.641000 101
1.143589 100
3.594973 100
2.011378 101
4.447915 102

The choice of the initial guess. The choice of the initial guesses is very important to the convergence of the Newtons method. This can be clearly seen from
the previous table.
Example 2. Use the Newtons method to find the square root of a given positive
number a.
Solution. The problem amounts to solving the following nonlinear equation:
x2 = a.
Let f (x) = x2 a. Then the Newtons method can be written as
xk+1 = xk

f (xk )
1
a
= (xk + ),
0
f (xk )
2
xk

k = 0, 1, 2,

The following table indicates that Newtons method converges very rapidly with quite
different initial guesses. This is a unusual example as the function is a quadratic
polynomial. Please look at the convergence of the Newtons method geometrically
for the quadratical nonlinear equation x2 a = 0 and demonstrate if the method
converges globally, namely converges with any initial guess.
5

How did computers get x = 1/a ? Newtons method is a good choice to solve the equivalent
nonlinear equation x1 = a.

20

Iteration
0
1
2
3
4
5

xk
1.000000
1.500000
1.416667
1.414216
1.414214
1.414214

x
1.414214
1.414214
1.414214
1.414214
1.414214
1.414214

xk
0.500000
2.250000
1.569444
1.421890
1.414234
1.414214

xk
6.000000
3.166667
1.899123
1.476120
1.415512
1.414214

Example 3. Use the Newtons method to find the first negative solution of the
following nonlinear equation
ex = sin x.
Solution. Let f (x) = ex sin x. By a simple analysis, we know x [3/2, ].
Then the Newtons method can be written as
xk+1 = xk

exk sin xk
f (xk )
=
x

,
k
f 0 (xk )
exk cos xk

k = 0, 1, 2,

The following table shows the convergence of the algorithm with different initial
guesses, from which we can clearly see again how important the initial guesses are.
Iteration
0
1
2
3
4
5
6
3.7.3

xk
xk
xk
x
-0.500000 -1.000000 -2.000000 -3.183063
+3.506451 6.013863 -3.894228 -3.183063
+2.523301 5.010849 -3.010248 -3.183063
+1.628274 4.002502 -3.183451 -3.183063
+0.833182 3.000576 -3.183063 -3.183063
-0.125327 2.054193 -3.183063 -3.183063
+9.035402 1.217551 -3.183063 -3.183063

Local convergence analysis

In this section, we shall discuss the convergence of the Newtons method. As we see
from the numerical examples in the last subsection, the convergence of the Newtons
method depends strongly on the initial guess we have taken: the initial guess can not
be taken too far away from the exact solution x . Such a convergence is called the
local convergence. In the following, we shall show that the Newtons method converges
when the initial guess x0 is close enough to the true solution x of f (x) = 0, and when
f 0 (x ) 6= 0 .
This condition means x is a simple root of f (x). The case of multiple roots will be
discussed later in Section 3.8.4.
To analyse the convergence, we define an iteration function
(x) = x
21

f (x)
,
f 0 (x)

then Newtons method can be written as the fixed-point iteration:


xk+1 = (xk ) ,

k = 0, 1, 2,

Clearly, we have
x = (x ) .
Next, we show that the sequence {xk } converges to x . Let ek = xk x be the
error at the k-th iteration, then
lim xk = x

if and only if

lim ek = 0.

Noting that
xk+1 = (xk ) ,

x = (x ),

we see
ek+1 = (xk ) (x ) .
By the mean-value theorem, we derive the error equation for ek :
ek+1 = 0 (k )(xk x ) = 0 (k )ek
where k lies between xk and x . From this, we know ek 0 if we can show
|0 (k )|

1
.
2

By a simple computing, we have


f 0 (x)f 0 (x) f 00 (x)f (x)
f (x)f 00 (x)
(x) = 1
=
,
f 0 (x)2
f 0 (x)2
0

which implies
0 (x ) =

f (x )f 00 (x )
= 0,
f 0 (x )2

since we know f (x ) = 0. Clearly, we have assumed here that f (x) is continuously


differentiable up to second order near x = x . Thus by the definition of continuity of
function f (x), we can find a constant > 0 such that
|0 (x)|

1
2

for all x such that |x x | .

Now by mathematical induction, it is natural to verify that


As long as the initial guess x0 is taken such that |x0 x | ,
then all xk generated by the Newtons method lie in the range
|x x | .
22

In fact, by induction it is easy to check that for k = 1, 2, , we have


|0 (k )|

1
< 1,
2

|ek+1 |

1
|ek | .
2

This yields
1

1
|e
|
=
|x0 x |,
0
2k+1
2k+1
so we have ek 0 as k , namely xk x as k . This demonstrates the
convergence of the Newtons method. 
|ek+1 |

3.7.4

Quadratic convergence of the Newtons method

In the last subsection we showed the convergence of the Newtons method, but we
still do not know how fast the method converges. Next we will demonstrate that the
Newtons method converges quadratically, in two different approaches.
Approach 1. Let ek = xk x be the error at the k-th iteration of the Newtons
method, then we have
ek+1 = xk+1 x = (xk ) (x ) .
As we saw earlier, 0 (x ) = 0. By Taylor expansion,
1
(xk ) (x ) = 00 (k )(xk x )2
2
where k lies between xk and x . Hence
1
1
ek+1 = 00 (k )(xk x )2 = 00 (k )e2k .
2
2
But we know
00

f (x) f (x)
(x) =
f 0 (x)2
0

00

or 0 (x)f 02 (x) = f (x)f (x) .

Taking the derivative in both sides,


000

00 (x)f 02 (x) + 20 (x)f 0 (x)f 00 (x) = f 0 (x)f 00 (x) + f (x)f (x) ,


letting x = x , we get
00 (x )f 02 (x ) = f 0 (x )f 00 (x ),
this implies
00 (x ) =

f 00 (x )
.
f 0 (x )

Recall that we have already known the Newtons method converges, namely,
lim xk = x ,

23

so we must have limk k = x (think about why). This leads to


ek+1
1 00
f 00 (x )
.
=

(x
)
=
k e2
2
2f 0 (x )
k
lim

(3.3)

Note that the limit = f 00 (x )/(2f 0 (x )) is a constant. By the definition, we know


from (3.3) that Newtons method converges quadratically.
Approach 2. We have by Taylor expansion
1
f (x ) = f (xk ) + f 0 (xk )(x xk ) + f 00 (k )(x xk )2 ,
2
thus

1
f 0 (xk )(xk x ) f (xk ) = f 00 (k )(x xk )2 ,
2

this gives
f (xk )
(xk x )f 0 (xk ) f (xk )
=
f 0 (xk )
f 0 (xk )
f 00 (k )(x xk )2
=
.
2f 0 (xk )

xk+1 x = (xk x )

Now we see

ek+1
f 00 (x )
=
.
k e2
2f 0 (x )
k
lim

This proves again the quadratic convergence of Newtons method. .


The previous analysis leads to the following conclusion about the Newtons method.
The Newtons method is a locally convergent iterative
method, and it has quadratical convergence when the initial
guess x0 is taken within the convergence region.
We emphasize that it may not be always easy to find a good initial guess x0 for
Newtons method in practical applications.
3.7.5

Extension of the Newtons method

Recall that the Newtons method is derived using the linear polynomial
F (x) = f (x0 ) + f 0 (x0 )(x x0 )
from the Taylor expansion:
1
f (x) = f (x0 ) + f 0 (x0 )(x x0 ) + f 00 ()(x x0 )2
2
24

to approximate the actual function f (x). Naturally, one may think about if we can
use the quadratic polynomial
1
f (x) = f (x0 ) + f 0 (x0 )(x x0 ) + f 00 (x0 )(x x0 )2
2
from the Taylor expansion
1
1
f (x) = f (x0 ) + f 0 (x0 )(x x0 ) + f 00 (x0 )(x x0 )2 + f 000 ()(x x0 )3
2
6
to approximate the actual function f (x) so that we may get a new iteration which
converges even faster than the Newtons method.
Indeed, this is possible. Students may try to work out the details of this new
iterative method and analyse its convergence and convergence order.

3.8

Quasi-Newton methods

Recall the Newtons method:


xk+1 = xk

f (xk )
,
f 0 (xk )

k = 0, 1, 2, ,

we see that at each iteration, we need to compute the derivative f 0 (xk ). This is simple
in many applications, but it might be also a big trouble for some practical problems.
For instance, this is the case when one of the following situations happens:
(1) The expression of f (x) is unknown;
(2) The derivative f 0 (x) is very expensive to compute;
(3) The value of function f may be the result of a long numerical calculation, so
the derivative has no formula available.
A good example. We next show a good example to see why the derivative f 0 (x)
can be complicated and expensive to compute sometimes. Consider the following
two-point boundary value problem:
 00
x (t) + p(t)x0 (t) + q(t)x(t) = g(t), a < t < b ,
(3.4)
x(a) = , x(b) =
where p(t), q(t) and g(t) are given functions, and x(t) is unknown. Our aim is to find
the solution x(t) to the system (3.4). One popular way for this purpose is to first
solve the following initial value problem:
 00
x (t) + p(t)x0 (t) + q(t)x(t) = g(t), t > a ,
x(a) = , x0 (a) = z

25

for a given z. We write the solution as x(t, z), then x(t, z) will be also the solution
to the boundary-value problem (3.4) if we can find a value z such that
x(b, z) = .
Let f (z) = x(b, z) , then z is the solution to the nonlinear equation
f (z) = 0.
To find such solutions z, we may apply the Newtons method:
zk+1 = zk

f (zk )
,
f 0 (zk )

k = 0, 1, 2, .

Think about how to compute the derivative f 0 (z).


In the above example we can see that computing the derivative f 0 (zk ) is indeed
complicated and can not be done directly.
Then can we avoid computing derivatives f 0 (xk ) when they are difficult to do ?
3.8.1

Constant slope method

One possibility to get rid of the derivatives in Newtons method is to find some
approximations of the derivatives f 0 (xk ), say gk f 0 (xk ), but gk should be much easier
to compute than f 0 (xk ). Then we replace the Newtons method by the following:
xk+1 = xk

f (xk )
,
gk

Such an iteration is called a quasi-Newtons method. There are many possible approximations of f 0 (xk ), thus generating many different quasi-Newtons methods. For
example, one can replace f 0 (xk ) by the difference quotient
gk =

f (xk ) f (xk1 )
,
xk xk1

the resulting method is called the secant method. It is easy to see that this method
needs two initial values x0 and x1 to start with.
The simplest way to approximate f 0 (xk ) is to replace it by a constant gk = g.
Then Newtons method becomes
xk+1 = xk

f (xk )
,
g

k = 0, 1, 2,

this is called the constant slope method. In particular, we might take g = f 0 (x0 ).

26

Convergence of the constant slope method


Next, we show that the constant slope method
xk+1 = xk

f (xk )
,
g

k = 0, 1, 2,

converges locally. To see this, let


(x) = x
and assume that6
|0 (x )| = |1

f (x)
,
g

f 0 (x )
| < 1.
g

Then
xk+1 x = (xk ) (x ) = 0 (k )(xk x ) ,
where k lies between xk and x . As |0 (x )| < 1 , there exists an > 0 such that7
|0 (x)| < 1 x [x , x + ] .
Then it is easy to see that {xk } will lie inside the interval [x , x + ] as long as
the initial guess x0 [x , x + ]. This implies
|xk+1 x | = |0 (k )||xk x | |xk x |,
or
|xk+1 x | k+1 |x0 x |.

(3.5)

Thus we know xk x as k , or the constant slope method converges.


Moreover, we further see
xk+1 x
= lim 0 (k ) = 0 (x ) ,
k xk x
k
lim

so the constant slope method converges linearly with the rate |0 (x )|.
How fast is the linear convergence ?
The constant slope method converges linearly with rate = |0 (x )|. The efficiency of a method with linear convergence strongly depends on the magnitude of
rate or constant in (3.5). In fact, each iteration reduces the error by a factor of
constant . Then if we set the accuracy tolerance to be , the number of iteration k
required to reach the tolerance is
h log i
k=
+ 1.
log
The following table gives us a clear picture about how fast the linear method converges
and how strongly it depends on the rate :
For example, if we know the sign of f 0 (x ), then we can take g = 2 sign(f 0 (x )) M , where M
can be any estimate of the upper bound of max |f 0 (x)|.
7
Think about how to find such a fixed constant < 1 ?
6

27

\
105
1010
1015

0.99 0.90 0.50 0.10 0.01


1146 110
17
6
3
2292 219
34
11
6
3437 328
50
16
8

We observe that when the rate is close to 0, the linear convergence (superlinear
convergence for = 0) may be nearly as efficient as a quadratic convergence. When
the rate is close to 1, the convergence can be extremely slow.
3.8.2

Fixed-point iterative methods

Both Netwons method and the quasi-Newtons method can be seen as some special
cases of the fixed-point iterative method.
What is a fixed point ?
For a given function (x), x is called its fixed point if x satisfies
(x ) = x .
For the Newtons method, we have
(x) = x

f (x)
,
f 0 (x)

then the exact solution x to the nonlinear equation f (x) = 0 is a fixed-point of (x).
For the quasi-Newtons method, we have
(x) = x

f (x)
,
g

then an exact solution x to the nonlinear equation f (x) = 0 is also a fixed-point of


(x) as we have (x ) = x .
Fixed-point iterative methods
For a given function (x), the iterative method
xk+1 = (xk ) , k = 0, 1, 2,
is called a fixed-point iteration associated with the function (x), and (x) is called
the iterative function.
Clearly, the Newtons method and the quasi-Newtons method are special cases
of fixed-point iterative methods.
Geometrically one can easily see the meaning of the fixed-point iteration. Due
to its geometric illustration, a fixed-point x is said to be an attractive point if the
28

fixed-point iteration converges as long as one starts close enough to the fixed-point;
an fixed point x is said to be a repulsive point, if the fixed-point iteration diverges
no matter how close we starts to the fixed-point.
About the fixed-point iterative method, we have the following convergence:
If the iterative function (x) satisfies the condition
|0 (x )| < 1 ,
then there exists a > 0 such that for any x0 [x , x + ],
the fixed-point iteration converges. If 0 (x ) 6= 0, the convergence
is linear with convergence rate = |0 (x )|. If
0 (x ) = 00 (x ) = = (p1) (x ) = 0,

but (p) (x ) 6= 0 ,

then the fixed-point iteration converges with order p.


Proof. Students are suggested to finish the first part of the proof.
Now for the second part, we suppose
0 (x ) = 00 (x ) = = (p1) (x ) = 0,

but (p) (x ) 6= 0 .

We have by Taylors expansion,


0 (x )
00 (x )
(p1) (x )
(x x ) +
(x x )2 + +
(x x )p1
1!
2!
(p 1)!
(p) ()
+
(x x )p
p!

(x) = (x ) +

where lies between x and x . Thus by the given assumptions,


xk+1 = (xk ) = (x ) +

(p) (k )
(xk x )p ,
p!

using (x ) = x , we have
xk+1 x =

(p) (k )
(xk x )p .
p!

Since k lies between xk and x , and xk x when k , so we know k x


when k . This leads to
(p) (x )
|xk+1 x |
=
6= 0 ,
k |xk x |p
p!
lim

so the convergence is of order p. 


29

3.8.3

A numerical example

We now consider solving the simple nonlinear equation f (x) = x2 2 = 0 with x0 = 1


by three different methods:
Newton method:
xk+1 = (x2k + 2)/(2xk ),

k = 0, 1, 2,

Secant method:
xk+1 = (xk xk1 + 2)/(xk + xk1 ),

k = 0, 1, 2,

Constant slope method:


xk+1 = (gxk x2k + 2)/g ,

k = 0, 1, 2,

The following table gives the convergence information of these three methods:
Iteration
k
0
1
2
3
4
5
6
..
.

|xk x |
Newton

Secant

4.14 101
8.57 102
2.45 103
2.12 106
1.59 1012
2.22 1016

4.14 101
8.57 102
1.42 102
4.20 104
2.12 106
3.15 1010
2.22 1016

g = 0.5
4.14 101
1.58 100
1.24 101
2.50 102
1.24 105
3.08 1010
1.90 1021
..
.

1.41 100
5.85 101

18
19

Constant Slope
g=1
g=2
1
4.14 10
4.14 101
5.85 101 8.57 102
1.41 100 3.92 102
5.85 101 1.54 102
1.41 100 6.52 103
5.85 101 2.68 103
1.41 100 1.11 103
..
..
.
.
2.84 108
1.17 108

Next, we consider solving the nonlinear equation f (x) = x2 2 = 0 with three


different fixed-point method, all started with x0 = 1.
First method:
xk+1 = (x2k + 4)/(3xk ) = 1 (xk ) ,

k = 0, 1, 2,

Second method:
xk+1 = 2/xk = 2 (xk ) ,

30

k = 0, 1, 2,

Third method:
xk+1 = (6 x4k )/x2k = 3 (xk ) ,
Iteration
k
0
1
2
3
4
5
6
7
8
..
.
18
19

k = 0, 1, 2,

First Method
4.1421 101
2.5245 101
5.8658 102
2.1245 102
6.8720 103
2.3130 103
7.6849 104
2.5644 104
8.5450 105
..
.

|xk x |
Second Method
4.1421 101
5.8579 101
4.1421 101
5.8579 101
4.1421 101
5.8579 101
4.1421 101
5.8579 101
4.1421 101
..
.

Third Method
4.1421 101
3.5858 100
2.6174 101
6.1446 102
3.7583 105
1.4125 1011
1.9951 1022
3.9802 1044
1.5842 1089
..
.

1.4472 109
4.8241 1010

4.1421 101
5.8579 101

One may observe from the above table that the first method converges very fast,
the second method does not converge and the third method diverges rapidly. It is
interesting to think about the reasons for such behaviours.
3.8.4

Functions with multiple zeros

In the previous sections, we have only considered the case of a simple zero x of the
function f (x), that is, we have
f (x ) = 0
but f 0 (x ) 6= 0. Next, we consider the case where x is a multiple zero. We first define
the multiplicity of a zero.
A point x is called a zero of the function f (x) with multiplicity
m 1 if it holds
f (x ) = f 0 (x ) = = f (m1) (x ) = 0 ,
but f (m) (x ) 6= 0.
What does a function with multiple zero look like ?
By Taylor expansion,
f (x) = f (x ) +

f 0 (x )
f (m1) (x )
f (m) ()
(x x ) + +
(x x ) +
(x x )m
1!
(m 1)!
m!
31

where lies between x and x . So if x is a multiple zero of order m, then


mf

f (x) = (x x )

(m)

()
(x x )m g(x),
m!

with g(x ) 6= 0. So f (x) behaves like a polynomial with a multiple zero.


3.8.5

Convergence of Newtons method in the case of multiple zeros

Let f (x) be a function with a zero of multiplicity m, then


f (x) = (x x )m g(x) ,

g(x ) 6= 0 ,

The fixed-point iteration for finding the solution x is as follows:


k = 0, 1, 2,

xk+1 = (xk ),
If we take

(x) = x

f (x)
,
f 0 (x)

we get the Newtons method. From this expression of we see that we can not
compute the value of (x) at x since f 0 (x ) = 0. But fortunately, we need only to
compute (x) at xk (not at x ) in each iteration, where f 0 (xk ) should not be zero
generally.
On the other hand, we see from the expression of that (x) seems undefined at
x . But by detailed computing, we have
(x x )m g(x)
m(x x )m1 g(x) + (x x )m g 0 (x)
(x x )g(x)
,
= x
mg(x) + (x x )g 0 (x)

(x) = x

which indicates that (x) is actually well-defined at x . But we can not use this
equivalent formula to compute (xk ) at each Newton iteration (think about why).
Next, we check the convergence of the Newtons method. By the convergence
theorem for the fixed-point iteration, we should have the condition
|0 (x )| < 1 .
By direct calculation,
1
.
m
So we always have |0 (x )| < 1 as m 1. This indicates the following local convergence (i.e., the initial guess x0 is close to x ) of the Newtons method when it is
applied for solving a nonlinear equation with a zero of multiplicity m:
0 (x ) = 1

32

1. It converges quadratically for m = 1, namely when x is a single zero;


2. It converges only linearly to the multiple zero x with rate = 1 1/m for
m > 1. If m = 2, the convergence rate is 1/2, the same as the convergence rate
of the bisection algorithm. The Newtons method converges more slowly when
m is larger.

33

3.9

The secant method

Recall the previously introduced quasi-Newtons method:


xk+1 = xk

f (xk )
gk

with gk chosen to approximate the derivative f 0 (xk ). It is usually not easy to obtain
an accurate approximation of f 0 (xk ). One effective and accurate choice is to select gk
using the quotient:
f (xk ) f (xk1 )
,
(3.6)
gk =
xk xk1
and this leads to the following secant method:
xk+1 =

xk1 f (xk ) xk f (xk1 )


,
f (xk ) f (xk1 )

k = 1, 2, .

From the expression of gk in (3.6) we may expect that at the beginning xk and
xk1 may not be very close, so gk is not a very accurate approximation of f 0 (xk ) and
the convergence of the secant method should be slower than that of the Newtons
method. But as xk gets more and more accurate, gk approximates f 0 (xk ) more and
more accurately, then the convergence of the secant method should be close to the
one of the Newtons method. We will study the convergence of the secant method in
more detail in the next subsection.
Convergence order of the secant method
In order to study the convergence of the secant method, we introduce the iterative
function:
vf (u) uf (v)
f (u)(u v)
=
,
(u, v) = u
f (u) f (v)
f (u) f (v)
then the secant method can be written in the form:
xk+1 = (xk , xk1 ) .
Due to this relation, the secant method is also called a two-point iterative method
since it uses two previous approximate values xk and xk1 to compute each new
approximate value xk+1 .
Next we are interested in analyzing the order of convergence of the secant method
under the assumption that f 0 (x ) 6= 0. To that end, we define the error at the kth
step to be ek = xk x . Then we apply the Taylor expansion to get
f (xk ) = f (x + ek ) = f 0 (x )ek + f 00 (x )e2k /2 + O(e3k ).

34

Now we can deduce


xk xk1
f (xk ) x
f (xk ) f (xk1 )
(xk1 x )f (xk ) (xk x )f (xk1 )
f (xk ) f (xk1 )
ek1 f (xk ) ek f (xk1 )
f (xk ) f (xk1 )
ek1 f (x + ek ) ek f (x + ek1 )
f (x + ek ) f (x + ek1 )
ek1 (ek f 0 (x ) + e2k f 00 (x )/2 + O(e3k )) ek (ek1 f 0 (x ) + e2k1 f 00 (x )/2 + O(e3k1 ))
(ek f 0 (x ) + e2k f 00 (x )/2 + O(e3k )) (ek1 f 0 (x ) + e2k1 f 00 (x )/2 + O(e3k1 ))
ek1 ek f 00 (x )(ek ek1 )/2 + O(e4k1 )
(ek ek1 )(f 0 (x ) + (ek + ek1 )f 00 (x )/2) + O(e2k1 ))
ek1 ek f 00 (x )
+ O(e3k1 ).
0

2f (x )

ek+1 = xk+1 x = xk
=
=
=
=
=
=

To find an approximate order p, we let |ek+1 | C|ek |p . Then after dropping the cubic
term above, we need to find p such that
e e f 00 (x )
k1 k


C|ek |p .
0

2f (x )
This gives D|ek1 ek | |ek |p , with D = |f 00 (x )/(2Cf 0 (x ))|. We can write this
relation as D|ek1 | |ek |p1 . On the other hand, we should have |ek | C|ek1 |p ,
hence D|ek1 | (C|ek1 |p )p1 . This implies
D = C p1 ,

p(p 1) = 1.

Clearly, the only positive solution of p is p = (1 + 5)/2 1.618 (the golden mean).
And the constant C is given by
C = |f 00 (x )/(2f 0 (x ))|p1 |f 00 (x )/(2f 0 (x ))|0.618 .
In summary, we have shown that the secand method converges with an order
p 1.618. So it is much faster than the linear convergence, and slightly slower than
the Newtons method.

35

Systems of nonlinear equations

4.1

Newtons method for a 2 2 system

We first take a 2 2 system as an example to illustrate the Newtons method for


solving systems of nonlinear equations. The extension to general systems of nonlinear
equations is straightforward.
Consider solving the following 2 2 system:
f (x, y) = 2x + xy 1 = 0,
g(x, y) = 2y xy + 1 = 0.
This is a system of 2 nonlinear equations in 2 unknowns. It can be written in vector
form:
F(x) = 0,
(4.1)
where


F(x) =

f (x, y)
g(x, y)


,

x=

x
y


.

Treating (4.1) as a one-dimensional equation, we will have the following Newtons


iteration:
F(xk )
.
(4.2)
xk+1 = xk 0
F (xk )
But what are F0 (xk ) and 1/F0 (xk ) ? F0 (xk ) is the derivatives of f and g with respect
to x and y, and is defined as:
  " f (x,y) f (x,y) #
x
x
y
F0 (x) = F0
.
= g(x,y)
g(x,y)
y
x
y
It is also known as the Jacobian of the function F at x.
Think about why we can not define the Jacobian of the function F at x by
  " f (x,y) g(x,y) #
x
x
0 (x) = F
0
F
= fx
.
(x,y)
g(x,y)
y
y
y
With this definition, (4.2) can be written as
xk+1 = xk F0 (xk )1 F(xk ),

k = 0, 1, 2,

This is the Newton method for 2-dimensional problems.


Example. Solve the nonlinear system
f (x, y) = 2x + xy 1 = 0,
g(x, y) = 2y xy + 1 = 0,
36

(4.3)

with initial guess




0
0

x0 =


.

Solution. For this system, we have


  

x
2+y
x
0
0
=
.
F (x) = F
y
y 2 x
For the first iteration:

F(x0 ) = F
and
0



F (x0 ) = F

0
0



0
0



1
1

=

=


,


2 0
0 2

Therefore

x1 =

0
0

2 0
0 2

1 

1
1


=

1/2
1/2


.

For the second iteration:



F(x1 ) = F
and
0



F (x1 ) = F

1/2
1/2

1/2
1/2




=




=

1/4
1/4


,

3/2 1/2
1/2 3/2


.

Therefore

x2 =

1/2
1/2

You can verify yourself that:



x3 =

3/2 1/2
1/2 3/2

0.875
0.875

1 


,

x4 =

1/4
1/4

0.9325
0.9325


=

3/4
3/4


.


,

and clearly the sequence {xk } converges to the limit x = (1, 1)T .

4.2

Newtons method for general nonlinear systems

Consider the following system of nonlinear equations:


f1 (x1 , x2 , , xn ) = 0 ,
f2 (x1 , x2 , , xn ) = 0 ,

fn (x1 , x2 , , xn ) = 0 ,
37

where each fi (x1 , x2 , , xn ) is a nonlinear function of n variables x1 , x2 , , xn .


The system can be written simply as
F (x) = 0 ,
where F (x) = (f1 (x), f2 (x), , fn (x))T , and x = (x1 , x2 , , xn )T .
Suppose an approximate solution xk of F (x) = 0 is available, we look for the next
approximate solution xk+1 by the Newtons method. To do so, we first look at the
Taylor expansion of the vector-valued function F (x). For each component of F (x),
we can write by the Taylor series that
o
f1
f1
(xk )(x xk )2 +
(xk )(x xk )n +
x1
x2
xn
n f
o
f2
f2
2
f2 (x) = f2 (xk ) +
(xk )(x xk )1 +
(xk )(x xk )2 +
(xk )(x xk )n +
x1
x2
xn

o
n f
fn
fn
n
(xk )(x xk )1 +
(xk )(x xk )2 +
(xk )(x xk )n +
fn (x) = fn (xk ) +
x1
x2
xn
f1 (x) = f1 (xk ) +

n f

(xk )(x xk )1 +

If we let F 0 (x) denote the Jacobian matrix

f1
f1
(x) x
(x)
x1
2
f2
f2
x1 (x) x2 (x)


fn
fn
(x)
(x)
x1
x2

of F at x given by

f1
(x)
x
n

f2
x
(x)
n
,

fn
(x)
x
n

then we can write the previous expansions as


F (x) = F (xk ) + F 0 (xk )(x xk ) + .
Then the Newtons method is to use the 1st order approximation of F (x), namely
F (xk ) + F 0 (xk )(x xk )
to approximate F (x) to find the solution of equation F (x) = 0. That is, setting
F (xk ) + F 0 (xk )(x xk ) = 0 ,
we find the Newtons method
xk+1 = xk F 0 (xk )1 F (xk ) ,

k = 0, 1, 2, .

When the Jacobian matrix F 0 (x) is large, it may not be efficient and stable to
compute the inverse F 0 (xk )1 at each iteration. Then we usually implement the
Newtons method as follows:

38

Newtons method. For k = 0, 1, 2, , do the following


1. Compute dk by solving
F 0 (xk )dk = F (xk ) .
2. Update xk+1 by
xk+1 = xk + dk .

4.3

Convergence of Newtons method

In this section we will study the convergence of the Newtons method for general
systems of nonlinear equations. For this purpose we need to introduce a few useful
results. The first one is the following fundamental theorem of calculus.
Let F be differentiable in an open set Rn and x . Then
for all x sufficiently close to x , we have
Z 1

F (x) F (x ) =
F 0 (x + t(x x ))(x x )dt .
0

The following lemma is often called the Banach Lemma.


Let A and B be two n n matrices such that B is an approximate
inverse of A, namely kI BAk < 1. Then both A and B are
nonsingular and
kA1 k

kBk
,
1 kI BAk

kB 1 k

kAk
.
1 kI BAk

An outline of the proof. For any nn matrix C, if kCk < 1 then I C is invertible
and
(I C)1 = I + C + C 2 + .
So if kI BAk < 1, we know BA = I (I BA) is invertible, so both A and B are
invertible. Then we can write
A1 = (BA)1 B = (I (I BA))1 B = {I + (I BA) + (I BA)2 + }B ,
which implies
kA1

kBk
.
1 kI BAk

]
Now we make the following standard assumptions on F :
1. There is a solution x to the equation F (x) = 0.
39

2. Jacobian matrix F 0 (x ) is nonsingular.


3. Jacobian F 0 : Rnn is Lipschitz continuous with Lipschitz constant .
For our analysis, we need to measure the surrounding points near x . We define
B (x) = {y Rn ; ky xk < }.
Then under the above three standard assumptions, we can show
There exists a > 0 such that for all x B (x ), it holds that
kF 0 (x)k 2kF 0 (x )k ,
kF 0 (x)1 k 2kF 0 (x )1 k ,
1
kF 0 (x )1 k1 kx x k kF (x)k 2kF 0 (x )k kx x k .
2

(4.4)
(4.5)
(4.6)

Proof. Let be small enough such that B (x ) . For all x B (x ) we have


kF 0 (x)k kF 0 (x )k + kx x k ,
which implies (4.4).
The next result (4.5) follows from the Banach Lemma if we can show
kI F 0 (x )1 F 0 (x)k <

1
.
2

In fact, if we choose appropriately we have


kI F 0 (x )1 F 0 (x)k = kF 0 (x )1 (F 0 (x) F 0 (x))k
kF 0 (x )1 k kx x k
1
kF 0 (x )1 k < .
2

(4.7)

To prove the last estimate (4.6), we note that if x B (x ) then x + t(x x )


B (x ) for all 0 t 1. Using (4.4) and the fundamental theorem of calculus, we
obtain
Z 1
kF (x)k
kF 0 (x + t(x x ))k kx x kdt 2kF 0 (x )k kx x k ,
0

which is the right inequality of (4.6). To show the left one, we note that
Z 1
0
1
0
1
F (x ) F (x) = F (x )
F 0 (x + t(x x ))(x x )dt
0
Z 1

= (x x )
(I F 0 (x )1 F 0 (x + t(x x ))(x x )dt ,
0

40

hence by (4.7) we get


kF 0 (x )1 F (x)k kx x k(1 kI F 0 (x )1 F 0 (x + t(x x ))k) kx x k/2.
Therefore
kx x k/2 kF 0 (x )1 F (x)k kF 0 (x )1 kkF (x)k,
which ends the proof. ]
We are now ready to demonstrate the convergence of Newtons method:
Under three standard Assumptions 1, 2 and 3, there exist constants K > 0
and > 0 such that for any x0 B (x ), the sequence {xn } generated by the
Newtons method satisfies xn B (x ), and
kxn+1 x k Kkxn x k2 ,

n = 0, 1, 2, .

So Newtons method converges quadratically.


Proof. By the definition we have
xn+1 x = xn x F 0 (xn )1 F (xn )
Z 1
0
1
= F (xn )
(F 0 (xn ) F 0 (x + t(xn x ))(xn x )dt .
0

Using (4.4) and (4.5) we derive


kxn+1 x k (2kF 0 (x )1 k)kxn x k2 /2 .
This completes the proof. ]

4.4

Broydens method

Similar to the secant method, Broydens method is also a variant of Newtons method
without derivative information required.
Next we discuss how to derive the Broydens method for solving the system of
nonlinear equations F (x) = 0. Recall the Newtons method
xk+1 = xk F 0 (xk )1 F (xk ) ,

k = 0, 1, 2, ,

which requires the computing and inversion of the Jacobian matrix F 0 (xk ) at each
iteration. When we derive the Newtons method, we used the affine function
y = F (xk ) + F 0 (xk )(x xk )
to approximate the original nonlinear function F (x). The Broydens method wants
to replace the Jacobian matrix F 0 (xk ) by a simpler approximation Ak , then the affine
function becomes
y = F (xk ) + Ak (x xk ) .
41

Now we require this function to take the same value as F (x) at x = xk1 , namely
Ak (xk xk1 ) = F (xk ) F (xk1 ) .
Noting that Ak has n2 entries, we can view this equation as n constraint equations.
But some additional information is needed to determine Ak uniquely. One strategy
is to choose Ak such that it is identical to Ak1 on all vectors that are orthogonal to
dk = xk xk1 , which implies
Ak Ak1 = u dTk
for some vector u. We can find the vector u by solving
(Ak Ak1 )dk = u dTk dk ,
namely
u=

F (xk ) F (xk1 ) Ak1 dk


(Ak Ak1 )dk
=
.
T
dk dk
dTk dk

The above process suggests us the following Broydens method:


Broydens method. Select x0 and A0 . For k = 0, 1, 2, , do the following
1. Compute dk+1 using
dk+1 = A1
k F (xk ) .
2. Update xk+1 by
xk+1 = xk + dk+1 .
3. Update Ak+1 using
Ak+1 = Ak +

(F (xk+1 ) F (xk ) Ak dk+1 )dTk+1


.
dTk+1 dk+1

We can show that Broydens method converges superlinearly.


Let F (x) be differentiable and Lipschitz continuous with Lipschitz constant
in the ball B (x ), and kF 0 (x )1 k for some constant > 0. If the initial
value x0 and the initial matrix A0 are chosen such that x0 B (x ) and A0 is
invertible, and satisfying
kA0 F 0 (x )k + 2kx0 x k 1/(8) ,
then the iterates xn and An generated by Broydens method are all well-defined,
and the method converges linearly,
kxk+1 x k
42

1
kxk x k .
2

4.5

Steepest descent method

The previous iterative methods require a good initial guess for their convergence.
Steepest descent method is a simple iterative method that makes no assumptions on
the initial guess. Instead of working with the original nonlinear equation
F (x) = 0 ,
the steepest descent method targets a local minimum of the function
g(x) = F (x)T F (x)/2 ,
always along a descent direction at each iteration.
Descent direction. A direction p Rn is called a descent direction of function f at
point x0 if it satisfies for some t0 > 0 that
f (x0 + t p) < f (x0 ) 0 < t < t0 .
It is easy to check that g is a descent direction of g, and
g(x) = F 0 (x)T F (x) ,
where F 0 (x) is the Jacobian matrix of F (x). Here is the steepest descent algorithm.
Steepest descent method. Select x0 . For k = 0, 1, 2, , do the following
1. Find k that solves the one-dimensional minimization
min g(xk g(xk )) .
0

2. Update xk+1 by
xk+1 = xk k g(xk ) .
The following result shows the convergence of the steepest descent method when
it is applied to a simple model problem.
Convergence rate. Let A be a positive definite matrix. Consider the
quadratic function g(x) = xT Ax/2. Suppose the eigenvalues of A are given by
0 < 1 2 n ,

and = n /1 .

Then the sequence {xk } generated by the steepest descent method satisfies

g(xk+1 )

1
+1

2


g(xk )

43

1
+1

2(k+1)
g(x0 ) .

Proof. We can easily check that g(x) = Ax. By Taylor expansion, we can write
2
(Ax)T A(Ax)
2
2
= g(x) + (Ax)T (Ax) + (Ax)T A(Ax) .
2
Using this relation and the selection of k , we derive
g(x + Ax) = g(x) + g(x)T (Ax) +

g(xk+1 ) = g(xk k Axk ) g(xk + Axk ) R1 ,


which implies

g(xk+1 ) g

(a + bA)xk
a

a 6= 0, b R1 ,

(4.8)

Letting Pk (z) = 1 k z, then we can write


xk k Axk = Pk (A)xk .
Now it follows from (4.8) that


g(xk+1 ) = g Pk (A)xk

 Q (A) 
k
g
xk
Qk (0)

(4.9)

for any polynomial Qk (z) of degree one. Define


Qk (z) =

2z (1 + n )
.
n 1

As Qk (1 ) = 1 and Qk (n ) = 1, we have
|Qk (j )| 1 j .
Let {vj } be the orthonormalized eigenvectors corresponding to {j }. We can express
xk =

n
X

(k)

aj v j .

j=1

Then we can estimate as follows:


n

1 X (k) 2 2
g(xk+1 )
(a ) Qk (j ) j
2Q2k (0) j=1 j
n

1 X (k) 2

(a ) j
2Q2k (0) j=1 j
=

1
Q2k (0)


g(xk )

1
+1

2
g(xk ) .

This completes the proof of the desired estimate. ]


44

Solutions of linear systems of algebraic equations

In numerical solutions of nearly all mathematical models, including linear and nonlinear differential equations, integral equations and system of nonlinear algebraic equations, one may have to solve system of linear algebraic equations repeatedly, possibly
millions of times in many practical applications. Before discussing how to solve the
system, we first introduce some fundamental definitions and concepts related to system of linear algebraic equations. But we will not discuss most of the content in
sections 5.1 and 5.2 in the lectures.

5.1

Matrices, vectors and scalars

In this section we review very briefly some basic concepts and knowledge about matrices, vectors and scalars and linear systems. All these are supposed to be known to
the students in this course, so we will not discuss them in any detail in the lectures.
Matrix. An m n matrix A is a rectangular array of numbers of the form

a11
a12
a1n1
a1n
a21

a22
a2n1
a2n

..
..
..
A = ...
,
.
.
.

am1,1 am1,2 am1,n1 am1,n


am1
am2
am,n1
amn
and it can be simply written as
A = (aij )mn ,
or A Rmn . If m = n, A is called a square matrix.
The numbers aij are called the entries or elements of the matrix. By convention,
the first subscript i, called the row index, indicates the row in which the entry aij
is located. The second subscript j, called the column index, indicates the column in
which the entry aij is located. Thus the entry aij is located at the i-th row and the
j-th column.
For example, for the matrix

1
2 3
0 1 ,
A= 4
1 3 4
we have a12 = 2, a32 = 3, and so on.

45

Vector. An n-vector x is an array of the form

x1
x2

x = .. ,
.
xn
and we often write x Rn . The integer n is called the dimension. The number xj is
called the jth component of x.
By convention, all vectors are column vectors, and the transpose of a column
vector is a row vector, e.g.,

x1
x2

x = .. , xT = (x1 , x2 , , xn ) ,
.
xn
where x is a column vector, and xT is its transpose, a row vector.
Operation with matrices. Matrix A multiplied by a scalar is defined by
A = (aij )mn .
e.g.,

1
0 1

0
0 1 = 0
0 .
0
2 1 0
2 0
If matrices A and B have the same number of rows and columns, and
A = (aij )mn ,

B = (bij )mn ,

then we define
A + B = (aij + bij )mn .
Given two matrices A and B,
A = (aij )lm ,

B = (bij )mn ,

then AB is an l n matrix defined by


AB = (cij )ln ,

cij =

m
X

aik bkj .

k=1

E.g., consider

1 1
0
3 1 ,
A= 2
2
0
2
46

2 0
B = 1 0 ,
2 1

then we have

3
0
AB = 1 1 .
0
2

Identity matrix.

In

1
0
0
..
.

0
1
0
..
.

0
0
1
..
.

0 0 0
0 0 0

0 0
0 0

0 0

..

1 0
0 1

is called an identity matrix of order n, whose diagonal entries are one and whose off
diagonal entries are zero.
Diagonal matrix. A matrix D is called
are all zero, e.g.,

d1 0
0 d2
D=
0 0
0 0

a diagonal matrix, if its off-diagonal entries


0
0
d3
0

0
0
,
0
dn

which is often written as


D = diag (d1 , d2 , , dn ) .
Matrix-vector product. Consider a matrix A and a vector x:

x=

A = (aij )mn ,

x1
x2
..
.

xn
then we define

a11 x1 + a12 x2 +
a21 x1 + a22 x2 +
Ax =

am1 x1 + am2 x2 +

+ a1n xn
+ a2n xn

+ amn xn

which is a vector of dimension m.


Properties of matrices.
(i) Associative law :
(A + B) + C = A + (B + C) ,
47

(AB)C + A(BC) .

(ii) Distribution law :


A(B + C) = AB + AC.
(iii) Communicative law :
A + B = B + A.
But usually we do not have
AB = BA.
Is this true when A is a diagonal matrix ?
For example, consider the following two matrices




0
1
0 1
A =
, B =
1 1
1 0
then


AB =

1
0
1 1


,

but BA =

1 1
0 1


6= AB.

Transpose of a matrix. The transpose of a matrix A = (aij )mn is given by AT =


(aji )nm . E.g.,

1 2 3
1 4 7 10
4 5 6
T

2 5 8 11 .
A=
7 8 9 , A =
3 6 9 12
10 11 12
Transpose satisfies the following properties:
(A)T = AT ,

(A + B)T = AT + B T ,

Scalar products. For two vectors

x1
x2

x = ..
.
xn

y=

y1
y2
..
.

(AB)T = B T AT .

yn

we define their scalar product by


xT y = x1 y + x2 y2 + + xn yn = y T x.
This product is also called the inner product of x and y.
Using this relation, we can verify that for any n n matrix A = (aij ), we have
T

x Ay =

n
X

aij xi yj = y T AT x .

i,j=1

48

Block matrices. The matrix can be written in block form.


analysing properties of matrices. For examples, the matrix

a11 a12 a13 a14 a15 a16 a17


a21 a22 a23 a24 a25 a26 a27

A=
a31 a32 a33 a34 a35 a36 a37
a41 a42 a43 a44 a45 a46 a47
a51 a52 a53 a54 a55 a56 a57

This is a useful tool in

can be written as a block matrix



A=

A11 A12 A13


A21 A22 A23

where

A11 =

a11 a12
a21 a22


,


,

A22

a33 a34 a35


= a43 a44 a45 ,
a53 a54 a55

If B is another matrix having same numbers of rows and columns as A and is decomposed blockwise as A above:


B11 B12 B13
B =
B21 A22 B23
then we have

A+B =

A11 + B11 A12 + B12 A13 + B13


A21 + B21 A22 + B22 A23 + B23


.

But note that the multiplication AB does not make sense.


Matrix-vector product. Consider

x=

A = (aij )mn ,

x1
x2
..
.

xn
we can write A as A = (a1 a2 an ), then we have
Ax = x1 a1 + x2 a2 + + xn an ,
so Ax is nothing else but the combination of column vectors of A with coefficients
being components of x.

49

5.2

Theory of linear systems

From now on we will be concerned with the solution of the system of linear algebraic
equations
Ax = b
where A is a n n matrix, x and b are two vectors of dimensions n. Before discussing
how to solve the system, we first look at when the system will have a unique solution.
We summarize some useful conclusions below:
Let A be a n n, then the following statements are equivalent:
(1) For all x, Ax = 0 implies x = 0.
(2) The columns (rows) of A are linearly independent.
(3) For any vector b, the system Ax = b has a solution.
(4) If A x = b has a solution, it is unique.
(5) There is a matrix B such that BA = AB = I.
(6) det (A) 6= 0.
Properties of inverse matrices.
1. The product AB is nonsingular if and only if A and B are both nonsingular,
and in this case,
(AB)1 = B 1 A1
2. If the matrix A is nonsingular, then the matrix AT is also nonsingular and
(AT )1 = (A1 )T .

5.3

Simple solution of a linear system

Consider the n n square system Ax = b. A natural way to solve the system seems
to be multiplying both sides of Ax = b by A1 to get the solution
x = A1 b .
This suggests the following algorithm for solving Ax = b:
Algorithm.
1. Compute C = A1 .
2. Compute the solution x = C b.

50

This algorithm is very expensive when the size of A is large as computing A1 is


extremely time-consuming, and more importantly unstable.
In fact, a natural way to compute the inverse A1 is equivalent to solving n linear
systems. To see this, let
A1 = (y1 , y2 , , yn ),
as AA1 = I, we have
Ayi = ei ,

ei = (0, 0, 1, 0 0)T ,

i = 1, 2, , n.

So computing A1 amounts to solving n equations like Ax = b with different righthand sides b. This is much more expensive than solving the original equation Ax = b.
Is there any more efficient algorithm for solving Ax = b?
Yes. One of the very popular algorithms is to use the LU factorization of A:
A = LU ,
where L is a lower triangular matrix, i.e., its entries are all zero above the diagonal;
U is a upper triangular matrix, i.e., its entries are all zero below the diagonal.
Using A = LU , then the solution of the system Ax = b amounts to solving the
system LU x = b. Let U x = y, then this suggests the following algorithm:
Algorithm.
1. Factorize A = LU .
2. Solve Ly = b.
3. Solve U x = y.
We next discuss how to solve triangular systems, then discuss how to find the LU
factorization of A.

5.4

Solutions of triangular systems

Consider solving the lower triangular system


Lx = b
where L is a lower triangular matrix of
write

l11
0 0
l21 l22
0
.
..
..
.
.
.
.
L =
l
l
l
i1 i2 i3
.
..
ln1 ln2 ln3

order n, i.e., lij = 0 for all i < j. So we can

0
0
0

0
0
0

lii
0
0

51

lni ln,i+1

lnn

To solve the system Lx = b, we can start with the first equation to find x1 , then
find x2 from the second equation, and so on. Let us work out more details below.
Write Lx = b componentwise:
l11 x1
l21 x1 + l22 x2

li1 x1 + li2 x2 + + lii xi

ln1 x1 + ln2 x2 + + lni xi + + lnn xn

= b1
= b2
= bi
= bn

then we can solve the system as follows:


x1 = b1 /l11
x2 = (b2 l21 x1 )/l22

i1
X
xi = (bi
lij xj )/lii ,

i = 3, 4, , n.

j=1

This algorithm is called the forward-substitution algorithm as we start with x1


and end with xn . Let us now check the cost of this algorithm:
Total multiplication:
n X
i1
n
X
X
n2
n(n 1)
.
1
(i 1) =
2
2
i=1 j=1
i=1

Total subtraction: same as for multiplication.


So the total cost for solving the lower triangular system is:
n2 /2 + n2 /2 = n2 .
If each operation costs time unit, then total running time is approximately n2 .
Note that in the above computational cost we have omitted the time for retrieving
entries l(i, j) and checking if i < n or j < i.
Similarly, we can solve the following upper triangular system
Ux = b
where U is upper triangular of order n,

u11 u12
0 u22

0
U =
0

0
0

i.e., ui,j = 0 for all i > j. Suppose

u1i u1n
u2i u2n

uii uin

0 unn
52

then we can write the system U x = b componentwise:


u11 x1 + u12 x2 + + u1i xi + + u1n xn = b1
u22 x2 + + u2i xi + + u2n xn = b2

uii xi + + uin xn = bi

unn xn = bn
Then we can solve the system as follows:
xn = bn /unn
xn1 = (bn1 un1,n xn )/un1,n1

n
X
xi = (bi
uij xj )/uii , i = n 2, n 3, , 2, 1.
j=i+1

Work out the total computational costs of the above solution process in terms of n,
the total degrees of freedom.

5.5

Cholesky factorization

We have discussed how to solve the system Ax = b when we have the LU factorization
of A. Below we shall discuss how to factorize a matrix A. We start with the positive
definite matrices.
5.5.1

Properties of symmetric positive definite matrices

Symmetric matrices. A matrix A of order n is said to be symmetric if AT = A,


that is
aij = aji , i, j = 1, 2, , n .
E.g., the following matrix A is symmetric:

1 2
4
3
5 = AT .
A = 2
4
5 6
A symmetric matrix is determined by its entries on and above its diagonal, hence
only half of its entries need to be stored.
Symmetric positive definite matrix. An n n matrix A is said to be symmetric
and positive definite if it satisfies
(1) A is symmetric.
53

(2) xT Ax > 0 for all x 6= 0.


Some properties of SPD matrices
1. A SPD matrix is nonsingular. Write A = (a1 a2 an ), then we should show
{ai } are linearly independent. Equivalently, we should show
Ax = 0 x = 0 .
Suppose this is not true, that is, for some x 6= 0 we have Ax = 0 . Then
xT Ax = 0.
This is a contradiction.
2. Any diagonal square submatrix of A is also an SPD matrix. To see this, let us
look at a diagonal submatrix of A of the form

ai 1 i 1 ai 1 i 2 ai 1 i k
ai 2 i 1 ai 2 i 2 ai 2 i
k
Ak =

.
ai k i 1 ai k i 2 ai k i k
For any vector of the form x = xi1 ei1 + +xik eik Rn , let xk = (xi1 , , xik )T
Rk , then it is easy to find that
xT Ax = xT (xi1 ai1 + + xik aik )
k
k X
X
x i l x i t ai l i t
=
=

l=1 t=1
xTk Ak xk ,

from this we know that for any xk 6= 0, we have xTk Ak xk > 0 by the positive
definiteness of A.
3. Any eigenvalue of a SPD matrix is positive.
Suppose is an eigenvalue, then there exists a x 6= 0, called the eigenvector of
A, such that
Ax = x .
Then we have
xT Ax = xT x.
As xT Ax > 0, so xT x > 0. But x 6= 0 implies xT x > 0. So we must have
> 0.
4. For any rectangular matrix U , if its column vectors are linearly independent,
then the matrix U T U is an SPD matrix.
54

5.5.2

Cholesky factorization

We know that if there is a matrix of the form U T U , with the columns of U being
linearly independent, then the matrix is an SPD matrix. We next show that the
converse is also true, that is, if A is an SPD matrix, then A can be factorized as
U T U , where U is a upper triangular matrix. If, in addition, we require the diagonal
entries of U to be positive, then the factorization is unique and is called the Cholesky
factorization of A.
To factor A into U T U , let us write


aT
,
A=
a A11


U=

u11 rT
0 U11


,

then
A = UT U
is equivalent to


aT
a A11


=

u11 0
T
r U11

 

u11 rT
0 U11


.

Comparing with both sides of the equation, we obtain


1. = u211
2. aT = u11 rT
T
3. A11 = rrT + U11
U11

or equivalently, we can write

1. u11 = (take only the positive one).


2. rT = aT /u11 .
T
3. U11
U11 = A11 rrT = A11 .

The step 3 above shows that U11 is the Cholesky factorization of the matrix A11 .
One can repeat the above procedure for the submatrix A11 . And the Cholesky
factorization proceeds in n steps:
1. At the first step, the first row of U is computed and the (n 1) (n 1)
submatrix A11 in the right bottom corner is modified;
2. At the 2nd step, the second row of U is computed and the (n 2) (n 2)
submatrix in the right bottom corner is modified.

55

3. The procedure continues till the n-th step, or till nothing is left in the right
bottom corner.
Thus the algorithm begins with a loop on the row of U to be computed; remember
we update the upper triangular part of A and do not need another matrix U for
storage. When the algorithm is finished the upper triangular part of A will store the
required information of U .
The cost of the Cholesky factorization.
The multiplication or subtraction, mainly for computing rrT , is :
Z n
Z n Z n
n X
n
X
di
dj
dk
(n k)
k+1

k=1 j=k+1

k+1

(n k)2 dk

1
=
(n 1)3 .
3
But the matrix rrT is symmetric, so we need to compute only half of its entries and
the total cost is approximately:
1 3
n
6

(multiplication)

1
+ n3
6

1
(subtraction) = n3 .
3

An important technical issue in Cholesky factorization. We recall that


the step 3 in the Cholesky factorization requires the submatrix A11 is symmetric and
positive definite. This comes from the following identity by writing xT = (x1 , xT ):


1
aT
T
x
x = x21 + 2x1 (aT x) + xT A11 x = x21 + 2x1 (aT x) + xT A11 x + (aT x)2 .
a A11

Examples of Cholesky factorization


Example 1 Find the Cholesky factorization of

3 1
3
A = 1
1
0

the following matrix

1
0 .
3

Solution.
1. Update the 1st

3 1
1
3
1
0

row and the submatrix at the right bottom corner:



1
3 13
3 13
1
3
8
0 3 13 0 + 13 =
3
1
1
1
3
3 3

3
3
56

1
3
1
3
8
3

2. Update the 2nd row and the submatrix at the right bottom corner:

3

1
8
3
1
3

3
1
3
8
3

1
q 3
8
3

3q
1
3
3
8
8 1
3
3

3
8

1
3

3
8

3
2 2
3

3
1

2 6
63
24

3. Update the 3rd row and the submatrix at the right bottom corner:

1
1

1
1
3

3 3 3
3
3

2 2
1

2 2
1



3
2
6
q

3
2 6
63
63

24
24
this gives


3 13

2 2

0
U =
3
0
0

1
3
1

2 6
q
63
24

One can easily check U T U = A.


Example 2. Find the Cholesky factorization of the following three matrices

4 2 4
2
1
4 2 1
1 1 2

5 0 1
.
1 17 1 , 1
, 2
5
0
2
16
4
4
0 8
3
1
33
1 4 64
2
0 14
2 1 3 21
Solution. We show the Cholesky factorization only for the first matrix.
1. Update the 1st row and the submatrix at the right bottom corner:

4 21 1
2 14 12
1 17 1 1 1
2
16
4
8
1
17
1 14 33

64
8
64
2. Update the 2nd row and the submatrix at the right bottom corner:




1 81
1 18

1
17
14
8
64
3. Update the third row and the submatrix at the right bottom corner:


1
1

4
2
this gives
2 14
U = 0 1
0 0

One can easily verify that U T U = A.


57

1
2
1
8
1
2

5.6

LU factorization and Gaussian elimination

The solution of non-SPD systems of linear algebraic equations arises in nearly all applications of mathematics. But the Cholesky factorization introduced in the previous
section is only applicable to SPD matrices. Next we shall discuss the LU factorization
for general matrices.
Consider solving the system of equations of the following form
Ax = b,
where A is a non-singular n n matrix and b Rn is a vector. The Gaussian
elimination is basically a process of the so-called LU factorization for the matrix A.
To start with, we first give some basic properties about matrix operations.
Consider a 4 4 matrix:

1 0 0 0
l21 1 0 0

L1 =
l31 0 1 0 ,
l41 0 0 1
where l21 , l31 and l41 are real numbers. It is easy

1 0 0

l
21 1 0
L1
1 =
l31 0 1
l41 0 0

to verify that

0
0
.
0
1

The same facts are true for similar matrices L2 and L3 . Please check it yourself !
Now consider the following two

1 0 0
l21 1 0
L1 =
l31 0 1
l41 0 0

matrices:

0
1 0

0
0 1
, L2 =

0
0 l32
1
0 l42

0
0
1
0

0
0
,
0
1

we can directly check

1
l21
L1 L2 =
l31
l41

0
1
l32
l42

0
0
1
0

0
0
.
0
1

Please check the product L2 L1 !


The matrices L1 , L2 etc. are called the elementary matrices.

58

Now, try to understand the meaning of the

1 0 0 0
a11 a12
0 1 0 0 a21 a22

L2 A
0 l32 1 0 a31 a32
0 l42 0 1
a41 a42

a11
a12

a
a
21
22
=
a31 + l32 a21 a32 + l32 a22
a41 + l42 a21 a42 + l42 a22

following operation:

a13 a14
a23 a24

a33 a34
a43 a44

a13
a14

a23
a24
.
a33 + l32 a23 a34 + l32 a24
a43 + l42 a23 a44 + l42 a24

If you look at the above operations carefully, you will come to the conclusion:
The actions of adding row 2 multiplied by l32 and l42 to row 3 and row 4 respectively, are equal to the operation L2 A.
Similar results are true for other Li matrices and any n n matrix A.
5.6.1

Gaussian elimination

Gaussian elimination is a process to reduce a full n n system of equations


a11 x1 + a12 x2 + + a1n xn = b1
a21 x1 + a22 x2 + + a2n xn = b2

an1 x1 + an2 x2 + + ann xn = bn


into a upper diagonal system of equations
a11 x1 + a12 x2 + + a1n xn =
a
22 x2 + + a
2n xn =

a
nn xn =

b1
b2
bn .

This is equivalent to a process of reducing the full matrix

a11 a12 a1n


a21 a22 a2n

A=



an1 an2 ann
into a upper triangular matrix

a11 a12
0 a
22
A =

0
0
59

a1n
a
2n
.

a
nn

Next, we will explain the Gaussian elimination and the LU factorization for two
simple examples: one is a 2 2 system of equations, the other is a 3 3 system. If
you can understand these two simple examples, then you may easily carry out the
Gaussian elimination and LU factorization for more general n n systems.
Gaussian elimination for a 2 2 matrix
Consider the following 2 2 system:


  
2 4
x1
2
Ax
=
b.
4 11
x2
1

(5.1)

Eliminating x1 in the 2nd equation needs to add row 1 multiplied by -2 to row 2, that
equals


 

1
0
2
4
2
4

LA
=
U.
2 1
4 11
0 3
the above equation gives
Using the property of the matrix L,

 


2 4
1 0
2 4
A=
=
L U.
4 11
2 1
0 3
This implies a LU factorization of A.
Using the factorization, solving the system
Ax = b
is equivalent to solving the system
LU x = b,
which can be done as follows:
L c = b,

U x = c.

Applying this process to equation (5.1), we have


  


2
1 0
c1
L c = b
=
,
2 1
c2
1
which gives

Then

c1
c2


=

2
3


.


 

2 4
x1
2
U x = c
=
,
0 3
x2
3
which gives the solution of equation (5.1):

 

x1
3
=
.
x2
1
Example 5.1. Use the Gaussian elimination to solve the following 2 2 system:


 

2 6
x1
1
=
.
(5.2)
4 13
x2
0
60

LU factorization of a 3 3 matrix
Let us consider one more simple example for the Gaussian elimination8 :

1 1 1
x1
0
Ax 3 6 4 x2 = 2 b .
1 2 1
x3
1/3

(5.3)

Eliminating x1 in the 2nd equation needs to add row 1 multiplied by -3 to row 2, and
eliminating x1 in the 3rd equation needs to add row 1 multiplied by -1 to row 3, that
equals

1 0 0
1 1 1
1 1 1
1 A 3 1 0 3 6 4 = 0 3 1 U1 .
L
1 0 1
1 2 1
0 1 0
Now eliminating x2 in the 3rd equation needs to add row 2 multiplied by 1/3 to
row 3, that equals

1
0
0
1 0 0
1 1 1
1 1
1
2L
1A 0
1
0 3 1 0 3 6 4 = 0 3
1 U.
L
0 1/3 1
1 0 1
1 2 1
0 0 1/3
1 and L
2 , the above equation gives
Using the property of the matrix L

1 1 1
A = 3 6 4
1 2 1

1 0 0
1 0 0
1 1
1
1
= 3 1 0 0 1 0 0 3
1 0 1
0 1/3 1
0 0 1/3

1 0 0
1 1
1
1
= 3 1 0 0 3
1 1/3 1
0 0 1/3
L U.
This completes a LU factorization of A.
We can easily write the above entire LU factorization process directly for the
system (5.3) of linear algebraic equations.
Using this factorization, solving the system
Ax = b
is equivalent to solving the system
LU x = b,
8

One can write the Gaussian elimination and the LU factorisation in parallel.

61

which can be done as follows:


L c = b,

U x = c.

Applying this process to equation (5.3), we have

1 0 0
c1
0
L c = b 3 1 0 c2 = 2 ,
1 1/3 1
c3
1/3
which gives

c1
0
c2 = 2 .
c3
1
Then

Ux=c

1 1
1
x1
0
0 3
1 x2 = 2 ,
0 0 1/3
x3
1

which gives the solution of equation (5.3):

x1
8/3
x2 = 1/3 .
x3
3
Example 5.2. Use the Gaussian elimination to solve the following two 3 3 systems
and write down the LU factorisation in parallel:

1 0
2
x1
0
3 4 1 x2 = 1 .
(5.4)
1 3
4
x3
1
and

3 2 1
x1
1
3 4 3 x2 = 2 .
1 2
1
x3
1

(5.5)

LU factorization for an n n matrix


The previous LU factorization can be carried out for more general n n matrices.
One can show that if all n leading principal minors

a11 a12 a1k


a21 a22 a2k

Ak = ..
k = 1, 2, , n
.. . .
.. ,
.
.
.
.
ak1 ak2 akk
62

of the n n matrix A are non-singular, then A has an LU factorization:


A = LU
where L is a lower triangular matrix with 1 as its diagonal entries, and U is a upper
triangular matrix.
Codes for the LU factorization. Next, we give a code for carrying out the LU
factorization for a general n n matrix. The code overwrites A with its LU factorization. When it is done, the entries of U occupy the upper part of the matrix A,
including the diagonal. And the entries of L occupy the lower part of A, excluding
the diagonal (as the diagonal entries of L are known to be one and do not need to be
stored).
We may write the general elimination Step k as follows:




a
kk
a
kj
row k :
a
kk a
kj

ik
a
kj
0 a
ij aakk
row i :
a
ik a
ij
and the elementary matrix Lk has the form:


1
0
row k :
.
ik
1
aakk
row i :
LU factorization of the n n matrix A.
do

10

20
30
40

40
do

k=1,n-1
10 i=k+1,n
a(i, k)= a(i, k)/a(k, k)
continue
do 30 j=k+1,n
do 20 i=k+1,n
a(i, j)=a(i, j)-a(i, k)*a(k, j)
continue
continue
continue

Example. Solve the following system by Gaussian elimination9 :


6 2 2
4
x1
12 8 6
x2
10


3 13 9
3 x3
6
4 1 18
x4
9

12

= 34 .

27
38

Write the Gaussian elimination and the LU factorisation in parallel.

63

Solution. Let

6 2 2
4
12 8 6
10
,
A1 = A =
3 13 9
3
6
4 1 18

then we have

L1

1
2
=
1
2
1

0
1
0
0

0
0
1
0

0
0
.
0
1

This gives

A2

6 2 2
4
0 4 2
2
.
= L1 A1 =
0 12 8
1
0
2 3 14

Looking at the second row of A2 , we know

1
0
0
1
L2 =
0 3
1
0
2

0
0
1
0

0
0
,
0
1

which yields

A3 = L2 A2

6 2 2
4
0 4 2
2
,
=
0
0 2 5
0
0 4 13

Further, checking the third row of A3 , we get

1 0
0
0 1
0
L3 =
0 0
1
0 0 2

0
0
,
0
1

which helps us find

A4 = L3 A3

6 2 2
4
0 4 2
2
.
=
0
0 2
5
0
0 0 3

Now, we know
U = A4 = L3 L2 L1 A .

64

This implies
L1
L1 L1 U
1 2 3
1 0 0 0
2 1 0 0
=
1 0 1 0
2
1 0 0 1

1
0 0
2
1 0
=
1
3
1
2
1
1 2 2

6 2 2
12 8 6
=
3 13 9
6
4 1

A =

1
0 0 0
1
0
0
1
0
0

0
3 1 0 0
0 21 0 1
0

0
6 2 2
4

0 0 4 2
2

0 0
0 2 5
1
0
0 0 3

4
10
.
3
18

0
1
0
0

0
0
1
2

0
6 2 2
4
0 4 2
0
2

0 0
0 2 5
1
0
0 0 3

Using the LU factorization, one can easily find the desired solution of the given
system. (Please practice !)
5.6.2

LDU factorization

The LU factorization of a matrix may not be unique. Next we show how to ensure a
unique factorization. Suppose we have obtained an LU factorization of A:
U .
A=L
Then let
D = diag(U ),
we can further factorize A as follows:
A = LDU
where L and U are lower and upper triangular matrices respectively, both matrices
with 1 as their diagonal entries, and D is a diagonal matrix.
The advantage of the LDU factorization is its uniqueness. Next, we show that the
LDU factorization is unique. That is, if we have
A = L1 D1 U1 = L2 D2 U2 ,
where Li are lower triangular matrices with 1 on the main diagonal, Di are diagonal
matrices, and Ui are upper triangular matrices with 1 on the main diagonal, then
L1 = L2 , D1 = D2 , and U1 = U2 .
From the assumption, we have
1
L1
2 L1 D1 = D2 U2 U1 .

65

1
Let L L1
2 L1 and U U2 U1 . Then we have

LD1 = D2 U.

(5.6)

It is tedious (but not difficult) to verify that L L1


2 L1 is still a lower triangular
matrix with 1 on the main diagonal. (The proof goes like this: first show that the
inverse of a lower triangular matrix L is still lower triangularwhich you can show
by comparing the entries of the product LB = I, then argue that B has to be lower
triangular. Then you show that if the main diagonal entries of L are all 1, so is B.
Finally, show that given two lower triangular matrices with main diagonal entries
being all 1, their product will also have main diagonal entries being all 1). By the
same reasoning, we can prove that U U2 U11 will still be an upper triangular matrix
with 1 on the main diagonal.
Writing the matrices in (5.6) out, we have

(D2 )11

(D1 )11 0
0

...
...
LD1 =
= D2 U,
=

0
0

0
0 (D2 )nn

(D1 )nn
where are supposedly nonzero entries. Comparing the entries in the left and the
right hand sides, we can conclude that (D1 )ii = (D2 )ii for all i, i.e. D1 = D2 .
Moreover all these should be equal to zero. In particular, that would mean that
both U and L are nothing but just the identity matrix I. Since I = L L1
2 L1 , we
have L1 = L2 . Similarly, U1 = U2 . ]
Remark 5.1. Please refer to the Tutorial Notes and Assignments for more details
about how to find the factorization of the form A = L D U . Try the example in the
previous subsection.
5.6.3

Cholesky factorization from LDU factorization

A matrix A is said to be symmetric positive definite if A = BB T for some nonsingular


matrix B. An equivalent definition is
xT A x > 0 x 6= 0 .
Cholesky has found a way to write all symmetric positive definite matrices A in the
form of A = LLT where L is lower triangular.
Let A be a symmetric and positive definite matrix, and we are now going to see
how Cholesky writes A in the form of A = LLT . Let A = LDU , the unique LDU
factorization of A. Because A is symmetric, we have
LDU = (LDU )T = U T DLT .
Note that U T (resp. LT ) is a lower (resp. an upper) triangular matrix with 1 on
the main diagonal. By uniqueness of the factorization, we have LT = U and hence
A = LDLT .
66

Next we claim that all diagonal entries of D are positive. (Remember that since
D is diagonal, all the other off-diagonal entries of D are zero.) In fact, for all i, let
xi = (LT )1 ei , where ei are the canonical basis vectors, then we have
0 < xT Ax = eTi ((LT )1 )T LDLT (LT )1 ei = eTi Dei = Dii .
Thus Dii > 0 for all i, and hence we can
write: D = D1/2 D1/2 , where D1/2 is a
diagonal matrix with main diagonal entries Dii . Then
L
T ,
A = LD1/2 D1/2 LT = LD1/2 (LD1/2 )T L
= LD1/2 is a lower triangular matrix.
where of course L
Remark 5.2. Note that the diagonal entries of L in the Cholesky factorization is not
necessary to be 1, not like the entries of L in the LU factorization.
Remark 5.3. Please check the Tutorial Notes for more details about how to find the
factorization of the form A = L LT .

5.7

LU factorization with partial pivoting

5.7.1

The necessity of pivoting

Recall at Step k of the LU factorization:






a
kk
a
kj
k
a
kk a
kj

ik
a
kj
0 a
ij aakk
i
a
ik a
ij
From this we see that the elimination process can continue only when the diagonal
entry a
kk is nonzero, as it is used as a divisor. If no diagonals are zero in all steps, the
LU factorization can carry on till completion. If the diagonal entry is zero at certain
step, the algorithm cannot continue. For example, it is the case with the matrix


0 1
A =
.
1 0
Sometimes, even the diagonal entry is not zero, but very small, one also has trouble.
For example, considering solving the following simple system:


  
1
x1
1
=
(5.7)
1 1
x2
2
we have
x1 =

1
1,
1

x2 =

1 2
1.
1

But if we use the Gaussian elimination, we may have some problem. We have




1
1
0
A1 =
, L1 =
,
1 1
1 1
67

then


A2 = L1 A1 =

1
0 1 1


.

This gives
L1
1

L=


=

1 0
1 1

1
0 1 1

U=

1
2

1
0
1


.

Then solving the system (5.7) is to solve


LU x =

we get

or

1
0 1 1




1
0 1 1

x1
x2



=
x1
x2


=



1
2 1

1
2


,


.

This leads to the following wrong solution:


x2 =

1
2 1

= 1,
1 1
1

x1 =

1 x2
0.

Below we present another example by considering the accuracy within a finite


number of digits to better understand the necessity of the partial pivoting for LU
factorization.
Considering solving the following simple system:
 



0.5
0.0001 0.5
x1
=
.
(5.8)
x2
0.1
0.4
0.3
Let us assume that the computer (or calculator) has only 4 digits of accuracy. Then
the true solution is
x1 = 0.9999, x2 = 0.9998.
Using Gauassian Elimination to eliminate the first column, we have



 


1
0
0.0001 0.5
x1
1
0
0.5
=
.
0.4
0.4
0.0001
1
0.4
0.3
x2
0.0001
1
0.1
Simplifying it, we get,


0.0001
0.5
0
2000



x1
x2


=

0.5
2000

Notice that the (2, 2)-entry is obtained by:

0.4
0.5 0.3 = 2000.3 2000.
0.0001
68

(5.9)

We see that the information of a22 = 0.3 in (5.8) is completely wiped out by the
elimination. It is as if the original matrix in (5.8) starts out with a22 = 0, one will
get the same matrix as in (5.9) after the elimination.
Solving the upper-triangular system (5.9), we have
x2 = 1,

x1 =

0.5 0.5x2
= 0.
0.0001

We see that the solution, especially x1 , is totally wrong. Thus if there are small pivots,
such as the (1,1)-entry 0.0001 in the current example, then Gaussian Elimination may
give very inaccurate results.
The remedy is to use partial pivoting, i.e., we permute the rows so that the largest
entry in magnitude in the column becomes the pivot. Let us see what happens if we
use partial pivoting. Considering system (5.8) again. Using partial pivoting, i.e. we
permute the largest entry (i.e. 0.4) in the first column to the first row:


 

0.4
0.3
x1
0.1
=
.
(5.10)
0.0001 0.5
x2
0.5
Notice that we have to do it for every row when doing the permutation. Then the
Gaussian Elimination of (5.10) becomes
 





1
0
0.4
0.3
x1
0.1
1
0
=
.
0.0001
0.0001
x2
1
0.0001 0.5
1
0.5
0.4
0.4
Now simplifying it, we have


 

0.4 0.3
x1
0.1
=
.
0 0.5001
x2
0.5000
Hence the solution is:
x2 = 0.9998,

x1 =

0.1 + 0.3x2
= 0.9999.
0.4

We see that the solution is accurate up to the specified precision of 4 digits.


It is known by experience that partial pivoting is sufficient for all practical problems. But mathematicians have found examples where partial pivoting fails. (See
page 109 in Stewarts book.) In those cases, one has to use full pivoting to ensure
correctness of the solutions. Full pivoting means that one permute the largest entry
in absolute value in the rows and the columns to the pivot, instead of just considering
the rows. In our case above, we will be solving:


 

0.5 0.0001
x2
0.5
=
.
0.3
0.4
x1
0.1
Unfortunately, the computations are very difficult if one uses full pivoting. Most
commercial software (like Matlab) only uses partial pivoting.
In matlab, given a matrix A, to see how partial pivoting and the LU factorization
are done step by step, use rrefmovie(A).
69

5.7.2

LU factorization with partial pivoting

In the last subsection we saw that when the diagonal entry is too small at certain
stage of the LU factorization, it may cause some serious problem if one continues with
the factorization. In this case, we should take some special strategy before continuing
the factorization. The following pivoting is one of such strategies:
At the kth stage of the LU factorization, suppose the
(k)
matrix A becomes Ak = (aij ). Then determine an index
(k)
(k)
pk for which |apk ,k | is largest among all |ajk | for (k j
n). Then interchange rows k and pk before proceeding
the next step of the factorization.
With the pivoting, the LU factorization process takes the form:
U = Ln1 Pn1 L2 P2 L1 P1 A
where Pk s are the matrices for interchanges.
To carry out the LU factorization with pivoting, we first introduce an elementary
matrix for interchanging.
Let Pi,j be a matrix obtained by interchanging rows i and j of the identity matrix,
then one can easily check that Pi,j A is the matrix obtained by interchanging rows i
and j of A:

0 0 1 0
a31 a32 a33 a3n
0 1 0
a21 a22 a23 a2n
0

1 0 0

0 , P1,3 A = a11 a12 a13 a1n .


P1,3 =

.. .. .. . . ..


. . .
. .
an1 an2 an3 ann
0 0 0
1
The matrix Pi,j is called an elementary permutation. And we can verify that
1
T
Pi,j
= Pi,j
= Pi,j .

Example. Solve the following systems using the LU factorization with pivoting:


1 1
0 3
x1
4
1 0

3
1 x2

0
=
0 1 1 1 x3
3
3 0
1
2
x4
1
and

1
1

0
2


6
0 8
x1
x2
0 6
1

1 1 1 x3
0 4
4
x4

= 0

2
1

Solution. It can be done in the following (n 1) = 3 steps.


70

Step 1 . Permutate the rows 1

3
1

0
1

and 4 using P1,4 :



0
1
2

0
3
1

1 1 1
1
0 3

x1
1

x2
0
=

x3
3
x4
4

Then do one step of the standard LU factorization:

3 0
1
2
x1
0 0 3 1

1 32
3

x2 =
0 1

1
1
x3
2
1
3 + 3
0 1
x4
3
Step 2 . Permute rows 2 and

3
0

0
0

13

3
4 + 13

3 using P2,3 :

0 1
2
x1
x2
1 1 1

1
0 38
x3
3
1
7
1 3 3
x4

Then do one step of the standard LU

3 0
1
2
0 1 1 1

1
8
0 0
3
3
4
0 0
43
3

= 31 .

3
13
3

factorization:

x1
1
x2
3

x3 = 1
3
4
x4
3

Step 3 . No need to do permutation. One can do one step


ization:

x1
3 0
1
2

0 1 1

1
x2 =

8
1
x3
0 0
3
3
4
1
0 0
0 6 3
x4

of the standard LU factor

1 .
3
4
+ 16
3

This gives the solution:

1
2

x=
0 .
1
Matrix-forms. It is interesting to write the above Gaussian elimination process into
matrix-forms. We can do as follows. Let

1 1
0 3
1 0
3
1

A=
0 1 1 1 .
3 0
1
2
71

Step 2 . Permute rows 1 and 4 using P1,4 :

3
1
P1,4 A =
0
1

0
1
2
0
3
1
.
1 1 1
1
0 3

Then do one step of the standard LU factorization:

1 0
3 0
1
2
8
1
1 1
0 0
3
3
3
L1 (P1,4 A) =
0 1 1 1 , L1 = 0 0
1
1
73
0
0 1
3
3

0
0
1
0

0
0

0
1

Step 2 . Permute rows 2 and 3 using P2,3 :

3
0
P2,3 (L1 P1,4 A) =
0
0

0 1
2
1 1 1

1
0 83
3
1 13 37

Then do one step of the standard LU factorization:

3 0
1
2
1
0
0 1 1 1
0
1

L2 (P2,3 L1 P1,4 A) =
L2 =
1 ,
8
0 0
0
0
3
3
4
4
0 0

0
1
3
3

0
0
1
0

0
0

0
1

Step 3 . No need for permutation. Do one step of the standard LU factorization:

3 0
1
2
1 0
0 0
0 1 1 1

0 0
U , L3 = 0 1

L3 (L2 P2,3 L1 P1,4 A) =


1
8
0 0

0 0
1 0
3
3
0 0
0 69
0 0 12 1
This gives the the LU factorization:
1 1 1 1 1
A = P1,4
L1 P2,3 L2 L3 U
1

3 0 0 1
1
1 1 0 0
0
3

=
0 0 1 0 P2,3 0
1 0 0 0
0
1

3 0 0 1
1 0
1 1 0 0 0 0
3

=
0 0 1 0 0 1
1 0 0 0
0 1
1

3 1 21 1
3 0
1 0 1 0 0 1
3

=
0 1 0 0 0 0
1 0 0 0
0 0

72

0 0 0
1 0 0
U
0 1 0
1 21 1

0 0
1 0
U
0 0
1
1
2

1
2
1 1

8
1 .
3
3
0 96

5.8

LU factorization of upper Hessenberg matrix and tridiagonal matrix

A matrix A is upper Hessenberg if we have


aij = 0 i > j + 1 ,
that is, A is of the following form

a11 a12 a13


a21 a22 a23

0 a32 a33

0
0 a43

0
0
0

..
..
..
.
.
.

a14
a24
a34
a44
a54
..
.

a1,n1
a2,n1
a3,n1
a4,n1
a5,n1
..
.

a1n
a2n
a3n
a4n
a5n
..
.

Think about which of the following are upper Hessenberg matrices ?

1 2 3 4
1 2 3 4
2 1 0

1 2 3 , 2 3 4 5 , 2 3 4 5
0 4 5 6 0 4 5 6
0 0 1
0 5 6 7
0 0 6 7
Work out the LU factorization of an upper Hessenberg matrix !
A matrix A is called tridiagonal if
aij = 0 |j i| > 1
that is, A is of the form

A =

a11 a12 0
0
a21 a22 a23 0
0 a32 a33 a34
0
0 a43 a44
..
..
..
..
.
.
.
.

0
0
0
0
..
.

0
0
0
0
..
.

Work out the LU factorization of a tridiagonal matrix !

73

5.9

General non-square linear systems

Up to now, all the methods that we have studied for solving the linear system
Ax = b

(5.11)

apply only when the coefficient A is a square matrix. But in many applications, we
are also encountered with non-square matrices A. In this section we shall discuss how
to solve such non-square systems.
Assume that A is an m n matrix, with m > n. Clearly the system (5.11) is
unlikely to have a solution as the number of equations is larger than the number of
unknowns.
Let us first study when this system has a solution x. For this we write A columnwise as
A = (a1 , a2 , , an ),
then we can write (5.11) into
b = x 1 a1 + x 2 a2 + + x n an .
So we conclude
The system (5.11) has a solution only when b lies in the
subspace spanned by the column vectors of A.
Next we shall study the more general case when b does not lie in the subspace
spanned by the column vectors of A. In this case the system (5.11) does not have a
solution. In physical or engineering applications, it is often acceptable for us to find
some solutions x that minimize the error (Ax b) in certain sense. We will look for
one of such types of solutions, i.e., least-squares solutions. This seeks some vector x
that minimizes the error (Ax b), namely,
minn kAx bk2

xR

(5.12)

We assume that the columns of A are linearly independent. Then it is easy to


check that AT A is symmetric positive definite.
In order to find the minimizer, we define
f (x) = kAx bk2 .
Then we can easily see that the minimizer x of f (x) satisfies

d

f (x + ty) = 0 .
dt
t=0
This reduces to
(Ax b, Ay) = 0 y Rn ,
74

(5.13)

or the least-square solution x is the solution to the following normal equation


AT Ax = AT b .

(5.14)

Cholesky factorization for least-squares solutions. Since AT A is a symmetric


and positive definite matrix, we may find the Cholesky factorization
AT A = LLT ,
where L is a lower triangular matrix. Then solving the normal equation (5.14) is
equivalent to solving the following two triangular systems
Lc = AT b,

LT x = c .

These two are triangular systems and can be solved efficiently.


Observation. For the least-squares solution x = (AT A)1 AT b, we can directly
see that the error vector (Ax b) is orthogonal to every column vector of A.

75

6
6.1

Floating-point arithmetic
Decimal and binary numbers

We are used to the decimal system in our daily life. But most computers adopt the
binary system, which uses 2 as the base instead of 10 in the decimal system. In the
decimal system, we use 10 numbers 0, 1, 2, , 9, and write any number in powers of
10. Consider the number 538.372, which is equivalent to
538.372 = 5 102 + 3 101 + 8 100 + 3 101 + 7 102 + 2 103 .
Also, for the number , which is
= 3.14159 26535 89793 23846 26433 8 . . . ,
where the last digit 8 stands for 8 1026 .
In the binary system, one uses only two digits 0 and 1. For example, we can write
the real number 9.90625 in the binary form:
(1001.11101)2 = 123 +022 +021 +120 +121 +122 +123 +024 +125 .
Computers communicate with the human users in the decimal system but work
internally in the binary system. Conversions take place inside the computers. Though
the users do not need to worry about these conversions, they should know the fact
that small roundoff errors are always involved during these conversions.
Computers can only operate using real numbers expressed in a fixed number of
digits. The finite word length of computers restrict the precision with which a real
number can be represented. Even a simple number like 1/10 can not be stored in any
binary computer as its actual expression needs an infinite number of binary digits:
1
= (0.0001 1001 1001 1001 . . .)2 .
10
For example, if we read 1/10 = 0.1 into a 32-bit computer and then print it out to
40 decimal places, we obtain the following result:
0.10000 00014 90116 11938 47656 25000 00000 00000 .
From this we see that there is some error caused during the decimal-binary conversion.
Two conversion errors are involved here:
from decimal to binary, and from binary to decimal.

76

6.2

Rounding errors

Rounding is encountered in scientific computings all the time. Consider a decimal


number x of the form x.xxx xxx with m digits after the decimal point. One may
round x to n decimal places (n < m) in a manner that depends on the value at the
(n + 1)th position. If this digit is 0, 1, 2, 3, or 4, then the nth digit keeps unchanged
and all following digits are discarded. If this digit is 5, 6, 7, 8, or 9, then the nth
digit is increased by one unit and the remaining digits are discarded. For example,
the following seven-digit numbers are rounded to four digits as follows:
0.1735499 0.1735
0.9999500 1.000
0.4321609 0.4322
Next we estimate the rounding errors for both decimal numbers and binary numbers. First we consider a decimal number x, say of the form x = a.a1 a2 an . If x
is rounded to x with n digits after the decimal point, we will have the error estimate
|x x|

1
10n .
2

Clearly, if the (n + 1)th digit of x is 0, 1, 2, 3, or 4, then x = x + with < 21 10n ,


which proves the desired result.
If the (n + 1)th digit of x is 5, 6, 7, 8, or 9, then we can write x = x + 10n , where
x is a number with the same n digits as x and all the other digits beyond the nth are
0. So we have
x = x + 10n
with 21 , and x x = (1 ) 10n , which yields immediately the desired estimate
by noting that 1 21 .
If x is a decimal number, its chopped or truncated n-digit approximation is the
number x obtained by simply discarding all digits beyond the nth. For this, we have
|x x| < 10n .
In fact, we can write x = x + 10n with 0 < 1. Hence we can see
|x x| = || 10n < 10n .
Now we consider a floating-point binary number x and round it to x with n digits
after the binary point, then we have the error estimate
|x x| 2(n+1+k) .
To see this, one may write x = 0.1a2 a3 an an+1 2k , and the same can be done
for the case x = 0.1a2 a3 an an+1 2k . Then x = 0.1a2 a3 a
n , where a
n = an
77

for an+1 = 0 and a


n = an + 1 for an+1 = 1. In both cases, by direct computing we
can get
|x x| 2(n+1+k) .
Using this result, we can further estimate the relative error:
|x x|
2n .
|x|

6.3

(6.1)

Normalized scientific notation

In the decimal system, one can express any real number in normalized scientific
notation. For example, we can write
2048.6076 = 0.20486 076 104 ,
0.0004321 = 0.4321 103 .
In general, we can express a nonzero real number a in the normalized form
a = r 10n

(6.2)

where r is a real number in the range 0.1 r < 1 and n is an integer (positive,
negative or zero).
A real number like 0.000259 is called a fixed-point number. Its floating-point
representation of form (6.2) is 0.259 103 . The advantage of the floating-point
representation is that it can express the numbers with huge different magnitudes.
For example, if we are allowed to use only a fixed number of digits, say 8 digits, with
7 after the decimal point, then the largest number we can express is 9.9999999 10,
the smallest is 0.0000001 = 107 . But if we use 8 digits to represent a floating-point
number, the number can range
from 1099

to 1099

if we allocate 2 of the digits to represent a power of 10 (.xxxxx10xx ). This is


tremendously different from the range between 107 and 10. So the floating-point
representation can handle much smaller and much larger numbers compared to the
fixed-point numbers.
But the disadvantage of floating-point representations is that they have less figures
of accuracy. e.g., the floating number of the form x.xxxxxx has only 5 figures of
accuracy, but the fixed-point number of the form x.xxxxxxx has 8 figures of accuracy.
Most computers use the binary system. In binary system, one can represent a
nonzero binary number a in the standard form
a = q 2m

78

where 1/2 q < 1 and m is an integer. The number q is called the mantissa and the
integer m is called the exponent, and both numbers q and m are base 2 numbers. Let
us see the floating-point binary number
+0.10111010 24 .
We may shift the binary point 4 places to the right and get the binary number
1011.1010, or a decimal number 11.625. Similarly, for the floating-point binary number given by
+0.10111010 21 ,
we shift the binary point 1 place to the left and get the binary number +0.010111010.
The accuracy of a number represented by a computer depends on the word length
of the computer. Most modern PCs have the word length of 32 bits, but most modern
workstations and high performance supercomputers have word length of 64 bits. Let
us take a computer with a word length of 32 bits (binary digits) to discuss the accuracy
in more details. The floating-point representation for a single-precision real number
is divided into three sections:
1. The first section contains one bit for the sign of the mantissa q;
2. The second section contains 8 bits for the exponent m;
3. The last section contains 23 bits for the mantissa q.
To save storage, a real number a = q 2m can be expressed as a normalized
binary number such that the first nonzero bit (must be 1) in the mantissa is just
before the binary point, that is, q = (1.f )2 . So the first bit (it is 1) does not need to
store. Then the mantissa is in the range 1 q < 2. The remaining 23 bits can be
used to store the 23 bits from f . In this way, the computer has a 24-bit mantissa for
its floating-point numbers.
In summary, nonzero normalized machine numbers (in a computer with word
length of 32 bits) are strings consisting of bits, whose values are decoded as the
following normalized floating-point form:
a = (1)s q 2(1)

p m

(6.3)

where
s, p = 0 or 1,

q = (1.f )2 ,

Note that we have 1 q < 2 and the most significant bit in q is 1 and is not stored;
s stands for the bit expressing the sign of a (positive: bit 0; negative: bit 1); m is
the 7-bit exponent, and f is the 23-bit fractional part of a, together with an implicit
leading bit 1. That is,
. . 11})2 ,
(00
. . 00})2 f (11
| .{z
| .{z
23

(00
. . 00})2 m (11
. . 11})2 .
| .{z
| .{z

23

79

In this representation, the binary number 10.001 is written as


0 0000001

10.001 = (1)0 (1.0001 |00 .{z


. . 00}) 2(1)

19

So its computer representation is:


0|0001 |00 .{z
. . 00} |0|0000001.
19

For the binary number 0.0010101, it is written as


1 0000011

0.0010101 = (1)1 (1.0101 |00 .{z


. . 00}) 2(1)

19

So its computer representation is:


1|0101 |00 .{z
. . 00} |1|0000011.
19

The numbers 0 is represented as:


0 = (1)0 (1. 00
. . 00}) 2(1)
| .{z

1 1111111

23

= 0| 00
. . 00} |1|1111111.
| .{z
23

A real number which can be represented as the normalized floating-point form (6.3)
is called a machine number in the computer having a word length of 32 bits (binary
digits). So a machine number can be precisely represented by the computer. But this
is not the case for most real numbers. When a real number is not a machine number
and serves as an input datum or as the result of a computation, a representation error
then arises, and it will be approximated by a most accurate machine number.
Noting that m has 7 digits in binary, so the largest m is 127. From this, we know
that a computer of word length of 32 bits can handle numbers as small as (1.f 2m )
2127 6.0 1039
and as large as (1.f 2m )
(2 223 )2127 3.4 1038 .
This is certainly not sufficiently enough for some scientific computings. When this
happens, we must write a program in double-precision or extended-precision arithmetic. A floating-point number in double precision occupies two computer words,
and the mantissa usually has at least twice as many bits. Hence there are roughly
twice the number of decimal places of accuracy in double precision as in single precision. But in double precision, calculations are much slower than in single precision,
often by a factor of 2 or more. This happens mainly because double-precision arithmetic is usually done using software, while single-precision arithmetic is done by the
hardware.
80

6.4

Accuracies in 32-bit representation

The restriction that the mantissa part occupies no more than 23 bits means that the
machine numbers have a limited precision of roughly 6 decimal places, since the least
significant bit in the mantissa represents units of 223 , approximately 1.2 107
106 . Then numbers expressed with more than 6 decimal digits will be approximated
by machine numbers.
The second smallest positive number that can be represented in 32-bit is:
amin = (1)0 (1. 00
. . 00} 1) 21111111 2127 .
| .{z
22

The maximum relative error that one can made in truncating (or rounding) a number
a between 0 and amin is:


amin a (0.00 . . . 001) 2127


= 223 = 106.9 .


a
1 2127
This means that the number of significant digits retained is roughly equal to 7 when
one truncates a very small number.
The largest number that can be represented in 32-bit is:
amax = (1)0 (1. 11
. . . 1}) 2127 2128 .
| {z
23

The second largest number that can be represented in 32-bit is:


as = (1)0 (1. 11
. . . 1} 0) 2127 .
| {z
22

The difference between the two numbers is huge:


|amax as | = (0.00 . . . 001) 2127 = 223 2127 = 2103 1031 .
However, the maximum relative error that one can made in truncating (or rounding)
a number a between as and amax is:


amax a (0.00 . . . 001) 2127
223 2127

= 224 = 107.2 .


a
(1.11 . . . 11) 2127
2 2127
That again means that the number of significant digits retained is roughly equal to 7
when one truncates a very large number.
In general, the number of significant digits retained when one truncates any number in 32-bit representation is always 7 (because we use 23 bits for the mantissa and
224 = 107.2 ). The number 224 is called unit roundoff error or machine precision.
Since it is usually denoted by M , it is also called machine epsilon. We see that the
unit roundoff error depends on the length of the mantissa only.
In 64-bit machines, we use 52 bits for the mantissa, and hence the accuracy is
within 252 = 1015.6 , i.e. about 16 digits of accuracy.
An integer can use all bits of the computer word in its representation except for
a single bit reserved for the sign. So a computer having word length of 32 bits can
handle the integers from (231 1) to 231 1 = 21474 83647 2 109 .
81

6.5

Machine rounding

In addition to rounding input data, rounding is also needed after most arithmetic
operations. The result of an arithmetic operation resides in a long 80bit computer
register and must be rounded to single-precision before being placed in memory. Similarly for double-precision calculations. Usual rounding mode is rounding to nearest:
The closer of the two machine numbers on the left and right of the real number is
selected. The roundoff error will be less than 224 , and 224 is called the unit roundoff
error.
p

During computations, suppose a number is produced of the form q 2(1) m but


m > 127. In this case, if p = 0, we say an overflow has occurred, and this will cause
a fatal error and the computation will be halted; if p = 1, we say a underflow has
occurred, and the number will be simply set to be 0 in many computers, and the
computations proceed.

6.6

Floating-point arithmetic

In this section, we discuss briefly the operations of floating-point numbers. In general,


the result of some operation of two floating-point number can not be represented by
a floating-point numbers of the same size. E.g., the product of two five-digit numbers
(after the decimal point) will often require ten digits for its result, say
(2.0001 106 ) (9.0001 102 ) = 18.00110001 108 .
Thus the result of a floating-point operation can be represented only approximately.
Let us look at a simple example. Suppose we are computing the difference 1
0.999999 in six-digit decimal arithmetic. We first align the numbers:
1.000000
0.999999
Mathematically, we have
1.000000
0.999999
0.000001
Normalizing the result to the correct answer, we get
0.100000 105 .
But the computer has only 6-digit decimal operation available now, so during the
alignment, we should do the roundoff:
1.00000
1.00000
0.00000
and thus obtain the result zero.
82

6.7

Backward error analysis

Given a real number x, let f l(x) be the floating point representation of x, i.e. f l(x)
is a machine representable number closest to x. By previous discussions, we have


f l(x) x

2 M ,
(6.4)


x
where M is the machine precision, or the machine unit roundoff error. Here = 24
for 32-bit computers and = 52 for 64-bit computers. Using (6.4), we can write
f l(x) = x(1 + ),

(6.5)

|| M .

(6.6)

where
Equations (6.5) and (6.6) form the basis of the backward error analysis.
Now we consider the relative error from adding two floating-point binary numbers
a and b. We write
a = a1 2m , b = b1 2n ,
where a1 and b1 are of the normalized form 0.1c1 c2 , and m and n can be any
integer numbers. We assume that b is smaller, then b1 should be shifted m n places
to the right to line up the binary points. The results are then added, normalized and
rounded. We have two possibilities: overflow occurs to the left of the binary point,
or overflow does not occur. The first case yields
1 |a1 + b1 2nm | < 2,
while the second case indicates
1
|a1 + b1 2nm | < 1.
2
For the overflow case, a right shift of one place is needed, and we then have
o
f l(a + b) = {(a1 + b1 2nm )21 + 2m+1
with being the roundoff error. So we can further write


2
f l(a + b) = (a + b) 1 +
a1 + b1 2nm
= (a + b)(1 + E)

(6.7)

with |E| 2 2M .
For the case without overflow, we can further write
o
f l(a + b) = {(a1 + b1 2nm ) + 2m



= (a + b) 1 +
a1 + b1 2nm
= (a + b)(1 + E)
83

(6.8)

with |E| 2M . This gives the bound of the relative error from adding two floatingpoint binary numbers a and b.
Now let us consider to add up n floating-point binary numbers:
c1 + c2 + . . . + ck .
To do so, we write the partial sum si = c1 + c2 + . . . + ci . Let s1 = c1 , then using
(6.7) or (6.8), we derive
s2 = f l(s1 + c2 ) = (s1 + c2 )(1 + E1 )
with E1 2M . We can rewrite s2 as
s2 = c1 (1 + E1 ) + c2 (1 + E1 ).
Similarly, we can do the following
s3 = f l(s2 + c3 ) = (s2 + c3 )(1 + E2 )
= c1 (1 + E1 )(1 + E2 ) + c2 (1 + E1 )(1 + E2 ) + c3 (1 + E2 ).
Continuing this process, we come to
sk = f l(sk1 + ck ) = (sk1 + ck )(1 + Ek1 )
= c1 (1 + d1 ) + c2 (1 + d2 ) + + ck (1 + dk ) ,
where for i = 2, 3, , k, we have
1 + di = (1 + Ei1 )(1 + Ei ) (1 + Ek1 )
and 1 + d1 = 1 + d2 . Noting the uniform bound on all the Ei , we can estimate
(1 2M )ki+1 1 + di (1 + 2M )ki+1 .
Summarizing the above results, we conclude that
fl

k
k
X
 X

cj =
cj (1 + E)
j=1

j=1

with

c1 d1 + c2 d2 + + ck dk
.
c1 + c2 + + ck
This gives the bound of the relative error from adding k floating-point binary numbers.
When we know more concrete data about the computer we are using and the
detailed numbers c1 , c2 , . . ., ck , we will have a clearer impression on the detailed
bound of the relative error from the previous error estimate.
E=

84

Sensitivity of linear systems

We now study the sensitivity of linear systems to the changes of errors in their coefficients. Suppose our aim is to solve
Ax = b ,
but because of the measurement errors or rounding-off errors of computers, we are in
fact solving the following perturbed system:
x = b ,
A
instead of the real system Ax = b.
Next, we shall analyse the error (x x) between the true solution x and the
solution x to the perturbed system.
Will this error be small when the errors in A are small ? How will the error (x x)
depend on the errors in A ?
In order to measure x x, we are going to introduce some quantities.

7.1
7.1.1

Vector and matrix norms


Vector norms

Why do we need norms for vectors and matrices ? Its the same reason as we need
absolute values for real numbers. When we say the number 1.23456 approximates the
number 1.234559 well, we know that it is true because
|1.23456 1.234559| = 106
is a small number. The same is true for vectors and matrices. When we use a method
to solve a problem where the true solution is a matrix A or a vector x , we say that
the method is a good method if it gives us a matrix A or a vector x such that the
distance between A and A or between x and x is small. The distance between them
is called the norm between them, and denoted by
kA A k or kx x k .
The simplest norm that we have learned in secondary school is what we called the
Euclidean norm. For a vector x = (x1 , x2 , . . . , xn )t , it is defined as
q
(7.1)
kxk = x21 + x22 + . . . + x2n .
You can easily check that it satisfies all the requirements to be a norm:
(i) kxk 0, and kxk = 0 if and only if x = 0.
(ii) kxk = || kxk, for all R.
85

(iii) (Triangle Inequality): kx + yk kxk + kyk.


However, one can extend the idea in (7.1) and define more general norms: the
p-norm:
kxkp = (|x1 |p + |x2 |p + . . . + |xn |p )1/p , 1 p .
As an example, we see that for x = (2, 3)t , then kxk3 = (35)1/3 .
We easily find that
kxk1 =

n
X

|xi | (the M anhattan norm),

kxk = max |xi | (the sup norm),

i=1

1in

and kxk2 is nothing but just the Euclidean norm. One can verify that all p-norms
satisfy the three requirements of a norm listed above. In Matlab, one can compute the
norm of a vector x by norm (x, p); and norm (x) gives the 2-norm of the vector.
Given two n-vectors y and z, we can now measure the distance between them by
computing ky zkp . For any n-vector x and two arbitrary integers 1 p q ,
we can prove that
kxk kxkq kxkp kxk1 nkxk .
Thus in particular, if ky zkp is small for some p, then we expect ky zkq is still
small for all the other qs.
7.1.2

Matrix norms

To measure the distance between two matrices, we need matrix norm. The trivial
matrix norm is to consider the matrix as a sequence of numbers, and then compute
the Euclidean norm of this sequence. The resulting norm is called the Frobenius
norm. More precisely, if the (i, j)th entry of an m-by-n matrix A is aij , then
m X
n
X
kAkF = (
a2ij )1/2 .
i=1 j=1

But more frequently, we prefer matrix norms that are induced from a vector pnorm. The p-norm of a matrix is defined as:
kAkp = max kAxkp .
kxkp =1

It can be shown that


kAk1 =

max

1jn

m
X

|aij | = maximum colunm sum,

i=1

p
max (AT A) , where max is the largest eigenvalue of AT A
n
X
= max
|aij | = maximum row sum.

kAk2 =
kAk

1im

j=1

86

(7.2)


As an example, if A =


1 3
, then
2 4

kAk1 = max{1 + | 2|, | 3| + 4} = 7,


kAk = max{1 + | 3|, | 2| + 4} = 6,

kAkF =
1 + 4 + 9 + 16 = 30,


5 11
1/2
kAk2 = max
= 5.465.
11 25
You can easily verify from (7.2) that p-norms satisfy all the norm requirements
listed earlier. Moreover, they also satisfy the triangle inequality for multiplication:
kABkp kAkp kBkp ,

(7.3)

for any matrices A and B. For the Frobenius norm, (7.3) also holds. It is interesting
to know that
kAkm max |aij |
i,j

defines a norm but it doesnt have property (7.3).

7.2

Relative errors

Recall the relative errors for scalars: let x be the approximate value to x, then the
relative error is
|x x|
.
|x|
Similarly we can define the relative error of an approximate vector x to x
kx xk
kxk
where k k can be any norm in Rn .
Next, we show the following property
If we know the relative error of x to x satisfies
k
x xk
< 1,
kxk
then the relative error of x to x satisfies

kx xk

.
k
xk
1

87

Proof. We easily see


k
x xk
k
x xk kxk
kxk
=

,
k
xk
kxk
k
xk
k
xk
but using
kx xk kxk,
we derive
kxk k
xk |kxk k
xk| kx xk kxk ,
this yields
1
kxk

,
k
xk
1
thus we obtain

k
x xk
kxk

.
k
xk
k
xk
1

]
From this result, we see that /(1 ) is not much different from if is small,
e.g., we have /(1 ) = 0.111 if = 0.1. This shows if the relative error of x to
x is small, then the relative error of x to x is also small. So we may use any one of
two to measure the relative error. But the relative error
kx xk
k
xk
is much easier to estimate than the error
k
x xk
kxk
in most applications, since it is usually complicated and difficult to have an accurate
estimate on the exact value x. In fact our entire task is to find an accurate estimation
of the exact solution x.

7.3

Sensitivity of linear systems

Consider solving the linear system


Ax = b.
Because of the rounding errors, or the observation data errors, the actual problem we
are solving by computer is the perturbed system:
x = b .
A
We would like to find out: will x be a good approximation to x ?
88

To answer the question, let


E = A A ,
so we can view A as a perturbation of A:
A = A + E .
If A is nonsingular, is A also nonsingular ?
If A is nonsingular, and E is not too large in the sense that
kA1 Ek < 1,
then A = A + E is also nonsingular.
Proof. It suffices to show
for any x 6= 0 ,

we have (A + E)x 6= 0 .

As A is nonsingular, so we can write


(A + E)x = Ax + Ex = A(x + A1 Ex).
Thus (A + E)x 6= 0 if and only if x + A1 Ex 6= 0. In fact, by kA1 Ek < 1, we obtain
kx + A1 Exk kxk kA1 Exk kxk kA1 Ekkxk > 0 ,
this proves x + A1 Ex 6= 0. ]
Now we can show a fundamental perturbation theory:
Let A be a nonsingular matrix, and A = A + E. If
x = b,
A

Ax = b,
and b 6= 0, then we have

k
x xk
kA1 Ek = kA1 A Ik
k
xk
If in addition,
kA1 Ek < 1 ,
then A is nonsingular and
kx xk
kA1 Ek

.
kxk
1 kA1 Ek
89

Proof. To prove the first result, we use


A = A + E
to see
x = A
b = A
x + E x.
Noting b = Ax, we get
Ax = A
x + E x ,
which implies
kx xk kA1 Ekk
xk ,

(7.4)

and proves the first desired result.


To prove the second result, using the previous lemma we know A is nonsingular
as kA1 Ek < 1. Letting = kA1 Ek, we obtain from (7.4) that
k
x xk k
xk,
thus
k
xk
Then we derive

1
kxk.
1

k
xk

kx xk

.
kxk
kxk
1

7.4

The condition of a linear system

x = b satisfies
Recall the relative error of the solution to the perturbed system A
k
x xk
kA1 Ek

.
kxk
1 kA1 Ek
But the right-hand side in this relation seems difficult to interpret. Below we try to
find a more convenient relation.
It is easy to see
kA1 Ek kA1 kkEk = kA1 kkAk
If we define
(A) = kAk kA1 k,
then we know
kA1 Ek (A)

90

kEk
,
kAk

kEk
.
kAk

which implies
(A) kEk
k
x xk
kAk

.
kxk
1 (A) kEk
kAk
Now, if (A) kEk
is small, say less than 0.1, then the denominator (1(A)kEk/kAk) <
kAk
1 and close to 1, thus we can consider approximately
k
x xk
kEk
kA Ak
(A)
= (A)
.
(7.5)
kxk
kAk
kAk
This relation indicates that the relative error of x to the exact solution x can be
controlled by a factor of the relative error of the perturbed matrix A to the true
matrix A.
Condition numbers. The real number (A) given by
(A) = kAk kA1 k
is called the condition number of the matrix A.
We have
The condition number (A) is always greater than or
equal to one.
This is easily seen from
1 kIk kAA1 k kAkkA1 k = (A),
where the first inequality is a consequence of the following
kIk = kI Ik kIk kIk .

7.5

Importance of condition numbers

The condition number has direct influence on the accuracy of the solution to the
linear system Ax = b.
Suppose the matrix A is rounded to A on a computer with rounding unit M
(machine accuracy), so we have a
ij = aij + aij ij , where |ij | M . Then in any of
the usual norms, we have
kA Ak M kAk .
Suppose we will solve the perturbed system
x = b
A
without any further errors, then we get a solution x that satisfies
k
x xk
kA Ak
(A)
M (A).
kxk
kAk
For instance, if M = 10t and (A) = 10k , then the solution x can have relative error
as large as 10t+k . This justifies the following rule (roughly):
91

If (A) = 10k , one should expect to lose at least k digits


of accuracy in solving the system A x = b .

Polynomial interpolation

8.1

What is the interpolation ?

What is the interpolation ? Let us first look at a simple example.


According to the Hong Kong Census and Statistics Department, the population
of Hong Kong was:
Year
Population (in millions)

1998
6.5437

2002
6.787

2003
6.8031

The questions that one would most likely be asking are:


1. (Interpolation) What was the population in the year 2000?
2. (Extrapolation) What will the population be in the year 2005?
The easiest way to answer these questions is to assume a linear growth of population
in between the census years (i.e. 1998, 2002, and 2003). Then we have Figure
1. In particular, the population in 2000 (let us denote this by P (2000)) should
be the average of P (1998) and P (2002), and will be equal to 6.665 millions. The
population in 2005, i.e. P (2005), will be 6.835 millions. This kind of interpolation
(or extrapolation) technique is called the piecewise linear interpolation method.
Population in Hong Kong
6.85

6.8

Population in millions

6.75

6.7

6.65

6.6

6.55

6.5
1998

1999

2000

2001

2002

2003

2004

2005

Year

Figure 1: Interpolation using piecewise linear polynomial.


The drawback of the method is that the future forecast will only depend on the
last two data P (2002) and P (2003). In general, we would like to involve P (1998) in
our forecast to reflect long-term effects. This leads to the Vandermonde interpolation
method.

92

8.2

Vandermonde Interpolation

In this interpolation method, we try to get the unique polynomial of the highest
degree that can pass through all the given data points. Since we have three given
data points: P (1998), P (2002), and P (2003), we can only determine a quadratic
polynomial
p(x) = 0 + 1 x + 2 x2
that fits them. How to get the coefficients i s, i = 0, 1, 2? We want p(x) to pass
through the data points, i.e.
0 + 1 x + 2 x2 = P (x),

for x = 1998, 2002, 2003.

Thus in matrix notations, we have:

1 1998 19982
0
P (1998)
1 2002 20022 1 = P (2002) .
1 2003 20032
2
P (2003)

(8.1)

Solving the system, we get 0 = 36115, 1 = 36.061 and 2 = 0.009. Thus


p(x) = 36115 + 36.061x 0.009x2 . In particular, p(2005) = 6.781 millions.
This method is simple but naive it has two major drawbacks. First the matrix
in (8.1) is a well-known ill-conditioned matrix called the Vandermonde matrix. Its
condition number is about 1013 . Hence in the computed i , only 3 significant digits
are accurate if we use double precision, and none of the digits will be correct in single
precision.
The second drawback is that solving (8.1) requires Gaussian Elimination, which
is an O(n3 ) process, where n is the degree of the interpolation polynomial. This
drawback becomes more serious when n is large or when new data comes in very
often, as we will have to update the coefficients very often. As an example, if we
are now given the population P (2004), we will have to increase the degree of the
interpolation polynomial by one:
q(x) = 0 + 1 x + 2 x2 + 3 x3 .
To get the coefficients i s, we will then have to solve
with dimension 4:

1 1998 19982 19983


0
1 2002 20022 20023 1

1 2003 20032 20033 2 =


1 2004 20042 20043
3

a system similar to (8.1) but

P (1998)
P (2002)

P (2003) .
P (2004)

Question is: can we make use of the i s that we have computed to speed up the
solution of i s? Of course, the matrix will even be more ill-conditioned. In fact, the
condition number of the 4-by-4 matrix is 1019 , and we have no hope of solving the
system to any digit of accuracy.
In this chapter, we will learn two other methods, the Lagrange method and the
Newton method, that try to compute the same interpolation polynomial using different approaches so as to overcome the two aforementioned drawbacks.
93

8.3

General quadratic interpolation

Instead of using the concrete data points, now we look at the interpolation in a bit
more general situation. Given the three observation data points:
x0
f0

x1
f1

x2
f2

determine a quadratic polynomial


p(x) = 0 + 1 x + 2 x2
such that
p(xi ) = fi , i = 0, 1, 2.
How to get the coefficients 0 , 1 and 2 ? By the above conditions,
0 + 1 xi + 2 x2i = fi ,

i = 0, 1, 2,

that is,


f0
0
1 x0 x20
1 x1 x21 1 = f1 .
f2
2
1 x2 x22

(8.2)

Solving the system gives the coefficients 0 , 1 and 2 , hence a quadratic interpolation
p(x).

8.4

Interpolation with polynomials of degree n

More generally, suppose we are given n + 1 observation data :


x0
f0

x1
f1

x2
f2

xn
fn

(8.3)

where xi 6= xj for all i 6= j, we would like to see if it is possible to determine a


polynomial p(x) of degree n such that
p(xi ) = fi ,

i = 0, 1, , n .

(8.4)

Let us write
p(x) = 0 + 1 x + 2 x2 + + n xn ,
then using the conditions (8.4), we have

1 x0 x20 xn0
1 x1 x2 xn
1
1

.. ..
..
. .
.
2
1 xn xn xnn

94

0
1
..
.
n

f0
f1
..
.
fn

(8.5)

where the coefficient matrix

Va =

1 x0 x20
1 x1 x21
.. ..
..
. .
.
1 xn x2n

xn0
xn1
..
.
xnn

(8.6)

is called a Vandermonde matrix.


Clearly, finding a polynomial p(x) satisfying the conditions (8.4) amounts to finding (n + 1) coefficients 0 , 1 , , n such that (8.5) is satisfied.
To see if system (8.5) is uniquely solvable, we should check if the coefficient matrix
Va is nonsingular. A natural way is to calculate the determinant of Va , but that is
computationally unstable. Below we present an alternative approach to demonstrate
this unique solvability.
Unique existence of polynomial interpolations. We first show the polynomials
of degree n satisfying the conditions (8.4) are unique. If not, let q(x) be another
polynomial of degree n such that
q(xi ) = fi ,

i = 0, 1, , n ,

then r(x) = p(x) q(x) is a polynomial of degree n and has n + 1 roots. The only
case for this to hold is r(x) 0. That is, p(x) = q(x).
Noting that (8.4) is equivalent to the linear system (8.5), so the uniqueness of
polynomial p(x) ensures the uniqueness of the solutions to (8.5). This implies the
non-singularity of matrix Va or the linear independence of the column vectors of Va ,
hence the existence of the solutions to (8.5) or the existence of polynomials to (8.4). If
it is not true, namely Va is singular or the column vectors of Va is linearly dependent,
then there exists some non-zero vector x0 such that Va x0 = 0. This violates the
established uniqueness: if is a solution to (8.5), i.e., Va = f , then + x0 is also a
solution.
How to find an interpolation polynomial p(x) that satisfies the conditions (8.4) ? One
may think about solving the system (8.5) using the Gaussian elimination. But this
procedure is very expensive when n is large. Another obvious disadvantage of the
Gaussian elimination is that the system (8.5) can be very ill-conditioned even for
small n, see the examples in Subsection 8.2.

8.5

Lagrange interpolation

There are many different approaches to compute the polynomial interpolations that
satisfy (8.4). We first introduce the Lagrange interpolation.
Instead of the standard basis functions
1, x, x2 , , xn
95

for a polynomial of degree n, Lagrange interpolation takes the following special basis
functions:
lj (x) =

(x x0 )(x x1 ) (x xj1 )(x xj+1 ) (x xn )


,
(xj x0 )(xj x1 ) (xj xj1 )(xj xj+1 ) (xj xn )

(8.7)

for j = 0, 1, , n. Clearly, we see that lj (x) is a polynomial of degree n, and


lj (xj ) = 1 and lj (xi ) = 0 i 6= j .
Now one can readily verify that the polynomial of degree n
L(x) = f0 l0 (x) + f1 l1 (x) + + fn ln (x)
satisfies
L(xi ) = fi ,

i = 0, 1, 2, , n.

This polynomial L(x) is called the Lagrange interpolation of f (x).


Alternative verification of the non-singularity of Va . Using the Lagrange
interpolation, we can easily verify the non-singularity of the Vandermonde matrix Va
in (8.6). In fact, one may prove that for any given vector (f0 , f1 , , fn ), there exists
a polynomial p(x) of degree n such that (8.5) holds, that is, (8.5) has a solution.
Clearly the Lagrange interpolation ensures the existence of such an interpolation
polynomial. This proves that any vector of dimension (n + 1) can be represented by
the column vectors of Va , so the column vectors of Va are linearly independent, and
Va is nonsingular.
Example. Write down the Lagrange interpolation for the observation data
x
f (x)

8.6

2 1 0 1
2
10 3 5 4 6

The Newton form of interpolation

Suppose we are given n + 1 data:


x0
f0

x1
f1

x2
f2

xn
fn

we would like to find a polynomial of degree n such that


p(xi ) = fi ,

i = 0, 1, , n.

We have known already that the Lagrange interpolation


L(x) = f0 l0 (x) + f1 l1 (x) + + ln fn (x)
a polynomial that fulfills the requirement.
96

(8.8)

Below, we are interested in another form of interpolation, called the Newton form
of interpolation, which is much easier to evaluate than the Lagrange interpolation.
Noting the fact that
1,

x x0 ,

(x x0 )(x x1 ),

(x x0 )(x x1 ) (x xn1 )

are linearly independent, one may choose them as a basis of the polynomials of degree
n. That is, any polynomial of degree n can be represented in terms of this basis.
Especially, let p(x) be a polynomial of degree n satisfying (8.8), then we can write
p(x) = c0 + c1 (x x0 ) + c2 (x x0 )(x x1 ) +
+ cn (x x0 )(x x1 ) (x xn1 ) .

(8.9)

One can see an obvious advantage of the Newton form: the resulting form of
having one more point is only different from the previous one by adding one more
term. This is quite different from Lagrange interpolation.
How to find the coefficients in Newton form of interpolation (8.9) ? To compute
the coefficients {ci }, we will introduce a new elegant tool: divided differences.
8.6.1

Divided differences

Consider a function f defined on [a, b], and a set of distinct nodal points in [a, b]:
a = x0 < x1 < x2 < < xn1 < xn = b . Then we can define the Divided Differences
of different orders as follows. We call
f [x0 ] = f (x0 ),

f [x1 ] = f (x1 ),

f [xn ] = f (xn )

the zeroth-order divided differences of f (x).


We call
f [x0 , x1 ] =

f [x1 ] f [x0 ]
,
x1 x0

f [x1 , x2 ] =

f [x2 ] f [x1 ]
, ,
x2 x1

the first order divided difference of f (x); and for k = 2, 3, , n, we call


f [x0 , x1 , , xk ] =

f [x1 , x2 , , xk ] f [x0 , x1 , , xk1 ]


xk x0

the kth order divided differences of f (x).


Using these divided differences, we can now calculate the coefficients in (8.9).
First, letting x = x0 in (8.9), we get
c0 = p(x0 ) = f0 = f [x0 ];

97

taking x = x1 in (8.9), we see


c1 =

p(x1 ) p(x0 )
f [x1 ] f [x0 ]
=
= f [x0 , x1 ] (= f [x1 , x0 ]).
x1 x0
x1 x0

Similarly, taking x = x2 in (8.9) we can derive




f (x2 ) f (x0 )
f [x0 , x2 ] f [x1 , x0 ]
c2 =
f [x0 , x1 ] /(x2 x1 ) =
= f [x1 , x0 , x2 ].
x2 x0
x2 x1
We next show that f [x1 , x0 , x2 ] = f [x0 , x1 , x2 ]. In fact,
f [x1 , x0 , x2 ] =
=
=
=
=
=

f [x0 , x2 ] f [x1 , x0 ]
x2 x1
f (x1 ) f (x0 )
f (x2 ) f (x0 )

(x2 x0 )(x2 x1 ) (x1 x0 )(x2 x1 )


f (x1 ) f (x0 )
f (x1 ) f (x0 )
f (x2 ) f (x1 )
+

(x2 x0 )(x2 x1 ) (x2 x0 )(x2 x1 ) (x1 x0 )(x2 x1 )


f [x1 , x2 ] f (x1 ) f (x0 ) n 1
1 o
+

x2 x0
(x2 x1 )
x2 x0 x 1 x0
f [x1 , x2 ] f [x0 , x1 ]

x2 x0
x2 x0
f [x0 , x1 , x2 ] .

In general, we can deduce


ck = f [x0 , x1 , x2 , , xk ].
Thus the Newton form of interpolation of f (x) can be written as
p(x) = f [x0 ] + f [x0 , x1 ](x x0 ) +
+ f [x0 , x1 , , xn ](x x0 )(x x1 ) (x xn1 ) .
The Newton form of interpolation can be expressed in a simplified form when
x0 , x1 , x2 , , xn are equally spaced. In this case, let us introduce h = xi+1 xi for
each i and write x = x0 + s h. Then we know x xi = (s i)h, and the Newton form
can be written as
p(x) = p(x0 + sh) = f [x0 ] + s h f [x0 , x1 ] + s(s 1)h2 f [x0 , x1 , x2 ] +
+s(s 1) (s n + 1)hn f [x0 , x1 , , xn ]
n
X
= f [x0 ] +
s(s 1) (s k + 1)hk f [x0 , x1 , , xk ] .
k=1

98

8.6.2

Relations between derivatives and divided differences

Next we shall study some relations between derivatives and divided differences. First
we can easily check the following facts:
1. For any constant function, its first order divided differences always vanish;
2. For any linear function, its first order divided differences are always constant,
and its second order divided differences always vanish;
3. For any quadratic function, its first and second order divided differences are
respectively linear and constant, and its third order divided differences always
vanish.
The above facts tell us that the actions of divided differences are very similar to
the ones of derivatives. In fact, we have the following result.
Suppose f C n [a, b], and a = x0 < x1 < x2 < < xn1 < xn = b , are
distinct points in [a, b]. Then there exists some (a, b) such that
f [x0 , x1 , x2 , , xn ] =

f (n) ()
.
n!

Proof. Let p(x) be the Newton form of interpolation of f (x) at the nodal points:
a = x0 < x1 < x2 < < xn1 < xn = b , and g(x) = f (x)p(x). Since f (xi ) = p(xi )
for i = 0, 1, , n, g has n + 1 distinct zeros in [a, b]. Then Rolles Theorem tells the
existence of a point (a, b) such that g (n) () = 0, which implies
p(n) () = f (n) () .
But p(x) is a polynomial of degree n with its leading coefficient f [x0 , x1 , x2 , , xn ],
so we have
p(n) (x) = n!f [x0 , x1 , x2 , , xn ] x .
This completes the proof of the desired relation. ]
8.6.3

Symmetry of divided differences

The divided differences are symmetric, i.e., exchanging any two variables in a divided
difference does not change its value. Let us see why this is true.
Clearly, f [x0 ] is symmetric.
For the 1st order divided difference, we have
f [x0 , x1 ] =

f [x0 ] f [x1 ]
f [x1 ] f [x0 ]
=
= f [x1 , x0 ],
x1 x0
x0 x1

so the first divided differences are symmetric.


99

To see the symmetry of the second order divided difference, namely


f [x0 , x1 , x2 ] = f [x1 , x0 , x2 ],
we recall that the interpolation of f (x) at the point x0 , x1 , x2 is given by
p1 (x) = f [x0 ] + f [x0 , x1 ](x x0 ) + f [x0 , x1 , x2 ](x x0 )(x x1 ) .
Now, if we consider the interpolation of f (x) at the points x1 , x0 , x2 (just swap the
position of x0 with the one of x1 ), we have
p2 (x) = f [x1 ] + f [x1 , x0 ](x x1 ) + f [x1 , x0 , x2 ](x x1 )(x x0 ) .
Knowing that the interpolation is unique, so p1 (x) = p2 (x), thus the coefficients of x2
in p1 (x) and p2 (x) must be equal. This gives
f [x1 , x0 , x2 ] = f [x0 , x1 , x2 ] .
More generally, we have
Consider a function f defined on [a, b], and a set of distinct nodal points
in [a, b]: a = x0 < x1 < x2 < < xn1 < xn = b . Then the divided difference is a symmetric function of its arguments. Namely if z0 , z1 , , zn
is a permutation of x0 , x1 , , xn , then
f [z0 , z1 , z2 , , zn ] = f [x0 , x1 , x2 , , xn ] .
Proof. It is easy to see that the divided difference f [z0 , z1 , z2 , , zn ] is the coefficient
of xn in the polynomial of degree n that interpolates f at the points z0 , z1 , , zn ,
while the divided difference f [x0 , x1 , x2 , , xn ] is the coefficient of xn in the polynomial of degree n that interpolates f at the points x0 , x1 , , xn . These two polynomials are obviously the same. ]
8.6.4

Relation between divided difference and Gauss elimination

In this subsection we illustrate that computing divided differences is the same as


doing Gaussian Elimination for a lower triangular system. Let us just consider three
data points: (x0 , f0 ), (x1 , f1 ), and (x2 , f2 ), and more data points can be done in a
same way. Recall that the Newton interpolation polynomial is:
p(x) = c0 + c1 (x x0 ) + c2 (x x0 )(x x1 ),

(8.10)

and
c0 = f [x0 ]

c1 = f [x0 , x1 ]

c2 = f [x0 , x1 , x2 ] .

We can compute these divided differences through the Gaussian Elimination for the
following system:

1
0
0
c0
f0
f [x0 ]
1 x1 x0
c1 = f1 f [x1 ] .
0
(8.11)
1 x2 x0 (x2 x0 )(x2 x1 )
c2
f2
f [x2 ]
100

Next we will use Gaussian Elimination to change the lower triangular matrix to
an identity matrix.
First we subtract row 2 from row 3 and put it back to row 3 (i.e. r3 r2 r3 ).
This eliminate the 1 on the (3,1)-entry. Remember to do the same thing to the right
hand side vector.
Next we eliminate the 1 on the (2,1)-entry by doing r2 r1 r2 . The resulting
system after doing these two row operations is:

1
0
0
c0
f [x0 ]
0 x1 x 0
c1 = f [x1 ] f [x0 ] .
0
0 x2 x1 (x2 x0 )(x2 x1 )
c2
f [x2 ] f [x1 ]
Next we divide the 2nd row by x1 x0 (i.e. 1/(x1 x0 ) r2 r2 ) to make the
(2,2)th entry 1. Also we do 1/(x2 x1 ) r3 r3 to make the (3, 2)-entry 1. These
two steps are equivalent to pre-multiplying the whole system by the matrix:

1
0
0
1
0 x x
0 .
1
0
1
0
0
x2 x1
The resulting system is:


1 0
0
c0
0 1
c1 =
0

0 1 (x2 x0 )
c2

f [x0 ]
f [x1 ]f [x0 ]
x1 x0
f [x2 ]f [x1 ]
x2 x1

f [x0 ]

f [x0 , x1 ] .

f [x1 , x2 ]

Finally, we eliminate the 1 at the (3, 2)th entry by r3 r2 r3 , and normalize


the (3, 3)th entry to 1 by 1/(x2 x0 ) r3 r3 . Then the resulting system is:


f [x0 ]
f [x0 ]
1 0 0
c0
0 1 0 c1 =
f [x0 , x1 ] .
f [x0 , x1 ]
f [x1 ,x2 ]f [x0 ,x1 ]
c2
f [x0 , x1 , x2 ]
0 0 1
x2 x0
Putting these values of ci s back into (8.10), we have our expected result:
p(x) = f [x0 ] + f [x0 , x1 ](x x0 ) + f [x0 , x1 , x2 ](x x0 )(x x1 ).
Thus the computation of the divided differences is the same as solving the lower
triangular system (8.11) by Gaussian Elimination.
8.6.5

How to compute coefficients of the Newton form ?

Given a set of observation data:


x0
f0

x1
f1

x2
f2
101

xn
fn

how to compute the coefficients in the Newton form of interpolation ? We take the
following simple case as an example:
Table for computing divided difference f [x0 , x1 , x2 , x3 ]
x0 f [x0 ] = f0
f [x0 , x1 ]
x1 f [x1 ] = f1
f [x0 , x1 , x2 ]
f [x1 , x2 ]
f [x0 , x1 , x2 , x3 ]
x2 f [x2 ] = f2
f [x1 , x2 , x3 ]
f [x2 , x3 ]
x3 f [x3 ] = f3
Example. Compute the Newton form of interpolation satisfying the following conditions:
3 1
1 3

x
f (x)

5 6
2 4

Solution. We can compute as follows:


x0 = 3 f [x0 ] = 1
f [x0 , x1 ] = 2
x1 = 1 f [x1 ] = 3
f [x1 , x2 ] =

5
4

x2 = 5 f [x2 ] = 2

f [x0 , x1 , x2 ] = 38
f [x0 , x1 , x2 , x3 ] =
f [x1 , x2 , x3 ] =

3
20

7
40

f [x2 , x3 ] = 2
x3 = 6 f [x3 ] = 4
Thus the Newton form of interpolation can be written as follows:
p(x) = f [x0 ] + f [x0 , x1 ](x x0 ) + f [x0 , x1 , x2 ](x x0 )(x x0 )
+f [x0 , x1 , x2 , x3 ](x x0 )(x x1 )(x x2 )
3
7
= 1 + 2(x 3) (x 3)(x 1) + (x 3)(x 1)(x 5). (8.12)
8
40
Now if we add one more data point, we have the data table:
x
f (x)

3 1
1 3

5 6 0
2 4 5

The calculation needs to be slightly changed as follows:


102

x0 = 3

f [x0 ] = 1

x1 = 1

f [x1 ] = 3

f [x0 , x1 ] = 2
5
f [x1 , x2 ] = 4

x2 = 5

f [x2 ] = 2

3
f [x0 , x1 , x2 ] = 8

7
f [x0 , x1 , x2 , x3 ] = 40

3
f [x1 , x2 , x3 ] = 20

17
f [x1 , x2 , x3 , x4 ] = 60

f [x2 , x3 ] = 2
x3 = 6
x4 = 0

f [x3 ] = 4
f [x3 , x4 ] = 1
6

f [x4 ] = 5

13
f [x2 , x3 , x4 ] = 30

f [x0 , x1 , x2 , x3 , x4 ] = 11
72

This immediately gives the Newton form of interpolation, adding only one term
to the polynomial (8.12):
p(x) = f [x0 ] + f [x0 , x1 ](x x0 ) + f [x0 , x1 , x2 ](x x0 )(x x0 )
+f [x0 , x1 , x2 , x3 ](x x0 )(x x1 )(x x2 )
+f [x0 , x1 , x2 , x3 , x4 ](x x0 )(x x1 )(x x2 )(x x3 )
3
7
= 1 + 2(x 3) (x 3)(x 1) + (x 3)(x 1)(x 5)
8
40
11
+ (x 3)(x 1)(x 5)(x 6).
72

8.7
8.7.1

Three fundamental questions in interpolation


Lagrange interpolation polynomials

(1) Cost of constructing the polynomial


Suppose we are already given (n + 1) data points to interpolate:
x
f (x)

x0
f0

x1
f1

x2
f2

xn
fn

The nth degree Lagrange interpolation polynomial is given by


(x x1 )(x x2 ) (x xn )
(x0 x1 )(x0 x2 ) (x0 xn )
(x x0 )(x x2 ) (x xn )
+f1
(x1 x0 )(x1 x2 ) (x1 xn )
+
(x x0 )(x x1 ) (x xn1 )
+fn
.
(xn x0 )(xn x1 ) (xn xn1 )

pn (x) = f0

Let us denote it by
pn (x) = 0 (x x1 )(x x2 ) (x xn )
+1 (x x0 )(x x2 ) (x xn )
+ . . . + n (x x0 )(x x1 ) (x xn1 ).

(8.13)
(8.14)

The question is how costly it is to compute i . Notice that for 0 i n,


i =

fi
.
(xi x0 )(xi x1 ) (xi xi1 )(xi xi+1 ) (xi xn )
103

(8.15)

Obviously, it requires n additions and n multiplications to obtain one i . Hence the


total cost of computing all (n + 1) i s will be O(n2 ).
(2) Cost of evaluating the polynomial
Suppose we have already obtained all the i in (8.14). Our next question is how
costly it will be to compute pn (t) at arbitrary point t. Let us define:
P =

n
Y

(t xi )

i=0

(which can be computed in (n + 1) additions and n multiplications). Then (8.14)


becomes


1
n
0
+
+ ... +
,
pn (t) = P
t x0 t x1
t xn
which can now be computed in (n + 1) additions and (n + 2) multiplications. Thus
pn (t) can be computed in O(n) operations for any t.
(3) Cost of updating the polynomial
Suppose we are given one more data point to interpolate:
x
f (x)

x0
f0

x1
f1

x2
f2

xn
fn

xn+1
fn+1

The (n + 1)th degree Lagrange interpolation polynomial is given by


(x x1 )(x x2 ) (x xn )(x xn+1 )
(x0 x1 )(x0 x2 ) (x0 xn )(x0 xn+1 )
(x x0 )(x x2 ) (x xn )(x xn+1 )
+f1
(x1 x0 )(x1 x2 ) (x1 xn )(x1 xn+1 )
+...
(x x0 )(x x1 ) (x xn1 )(x xn+1 )
+fn
(xn x0 )(xn x1 ) (xn xn1 )(xn xn+1 )
(x x0 )(x x1 ) (x xn1 )(x xn )
+fn+1
.
(xn+1 x0 )(xn+1 x1 ) (xn+1 xn1 )(xn+1 xn )

pn+1 (x) = f0

It is very complicated indeed, but we can simplify it using (8.15). In fact, if we rewrite
pn+1 (x) as
pn+1 (x) = 0 (x x1 )(x x2 ) (x xn )(x xn+1 )
+1 (x x0 )(x x2 ) (x xn )(x xn+1 )
+ . . . + n (x x0 )(x x1 ) (x xn1 )(x xn+1 )
+n+1 (x x0 )(x x1 ) (x xn1 )(x xn ),

104

then we can easily check that

if 0 i n,

xi xn+1
i =
fn+1

if i = n + 1.
(xn+1 x0 )(xn+1 x1 ) (xn+1 xn1 )(xn+1 xn )
Now it is clear that i , 0 i n, can each be computed in 1 addition and 1 multiplication, while n+1 can be computed in (n + 1) additions and (n + 1) multiplications.
Thus pn+1 can be obtained in O(n) operations.
8.7.2

Newton Interpolation Polynomials

(1) Cost of constructing the polynomial


Suppose we are already given (n + 1) data points to interpolate:
x
f (x)

x0
f0

x1
f1

x2
f2

xn
fn

The nth degree Newton interpolation polynomial is given by


pn (x) = c0 + c1 (x x0 ) + . . . + cn (x x0 )(x x1 ) (x xn1 ).

(8.16)

The question is how costly it is to compute ci . Recall that the ci s are the first entries
in each column of the divided difference table.
Let us then compute the cost of obtaining all the entries in the divided difference
table. Recall that the table looks something like this:
x
x0

0th
f [x0 ]

(n 1)th

1st

nth

f [x0 , x1 ]
x1

f [x1 ]

..
.

..
.

...
f [x1 , x2 ]
..
.
f [xn2 , xn1 ]

xn1

f [xn1 ]

f [x0 , . . . , xn1 ]
f [x0 , . . . , xn ]
..
.

f [x1 , . . . , xn ]

f [xn1 , xn ]
xn

f [xn ]

For the 0th level column, we dont have to do anything. We just have f [xi ] = fi . In
the 1st level column, each divided difference is computed in the form:
f [xi ] f [xj ]
xi xj
which require 2 additions and 1 multiplication. Notice that there are only n 1st level
divided differences to compute in this column.
105

Next for the 2nd level divided differences. Again, each divided difference can be
computed in 2 additions and 1 multiplication, but there are only (n 1) 2nd level
divided differences to compute. Repeating this argument, finally, for the nth level
divided difference, there is only one to compute (namely cn = f [x0 , . . . , xn ]); and it
can be computed in 2 additions and 1 multiplication. Thus to compute all the entries
in the divided difference table (in particular those ci s), it requires
2 [n + (n 1) + . . . + 1] = O(n2 )
additions and half of that numbers of multiplications. Hence generating Newton
interpolation polynomial is an O(n2 ) process.
(2) Cost of evaluating the polynomial
Suppose we have already obtained all the ci in (8.16). Our next question is how
costly it will be to compute pn (t) at arbitrary point t. For this, we just invoke Horners
rule:
pn (t)
= ((..(cn (t xn1 ) + cn1 )(t xn2 ) + cn2 )(t xn3 ) + . . .
. . . + c1 )(t x0 ) + c0 .
This can now be computed in 2n additions and n multiplications. Thus for any t,
pn (t) can be computed in O(n) operations.
(3) Cost of updating the polynomial
Suppose we are given one more data point to interpolate:
x
f (x)

x0
f0

x1
f1

x2
f2

xn
fn

xn+1
fn+1

The (n + 1)th degree Newton interpolation polynomial is given by


pn (x) = c0 + c1 (x x0 ) + . . . + cn (x x0 )(x x1 ) (x xn1 )
+cn+1 (x x0 )(x x1 ) (x xn1 )(x xn ).
Thus we only have to compute cn+1 .
To compute cn+1 , we have to add one entry at the bottom of each column in the
divided difference table. Since each entry in the table are of the form (a b)/(c d),
each of them can be computed in 2 additions and 1 multiplications. Moreover, since
there are (n + 2) columns in the table, we only have to compute (n + 2) entries. We
therefore can conclude that cn+1 can be computed in 2(n + 2) additions and (n + 2)
multiplications. Thus the update can be done in O(n) operations.

106

8.7.3

Summary of the interpolation methods

From the previous discussions, we can briefly summarize the computational costs of
three interpolation methods in the following table:

Vandermonde
Lagrange
Newton

8.8

finding the cofficients


adding one more data point
solve Ax = b
solve completely new system
O(n3 ), ill-conditioned
O(n3 ), more
Q
Qnill-conditioned
compute i6=j (xi xj )
compute i=0 (xi xn+1 )
2
O(n )
O(n)
solve Lx = b
add one more row to L
O(n2 )
O(n)

Error estimates of polynomial interpolations

Given a set of n + 1 observation data :


x0
f0

x1
f1

x2
f2

xn
fn

where xi 6= xj , we know there exists a unique polynomial p(x) of degree n satisfying


the conditions
p(xi ) = fi , i = 0, 1, 2, , n .
We have learned two types of such polynomial interpolations:
The Lagrange interpolation
L(x) = f0 l0 (x) + f1 l1 (x) + + fn ln (x) ,
where lj (x) are given by
lj (x) =

(x x0 )(x x1 ) (x xj1 )(x xj+1 ) (x xn )


(xj x0 )(xj x1 ) (xj xj1 )(xj xj+1 ) (xj xn )
n
Y
(x xi )
=
;
(xj xi )
i=0
i 6= j

The Newton form of interpolation


N (x) = f [x0 ] + f [x0 , x1 ](x x0 ) + f [x0 , x1 , x2 ](x x0 )(x x1 ) +
+ f [x0 , x1 , , xn ](x x0 )(x x1 ) (x xn1 ) .
How good are these polynomial interpolations ? Next we shall study the error
function
e(x) = L(x) f (x) = N (x) f (x).
We have the following error estimate about the polynomial interpolation:
107

Suppose f C n+1 [a, b], and p(x) is the polynomial interpolation of


f (x) at the n + 1 distinct points:
a = x0 < x1 < x2 < < xn1 < xn = b .
Then for any x [a, b], there exists a point x (a, b) such that
f (x) p(x) =

f (n+1) (x )
(x x0 )(x x1 ) (x xn ) .
(n + 1)!

(8.17)

Proof. If x = xi for some i {0, 1, , n}, then the result holds obviously. Now,
we consider x (a, b) but x 6 {x0 , x1 , , xn } and introduce
(t) = f (t) p(t) (t x0 )(t x1 ) (t xn ) .
Choose such that (x) = 0 (note that x is fixed), then we have
(x) = 0, (x0 ) = (x1 ) = = (xn ) = 0 .
By the Rolles theorem, we know
0 (t) has at least n + 1 distinct zeros;
Similarly,
00

(t) has at least n distinct zeros;


Continue this way, and we know
(n+1) (t) has at least one zero,
say x (a, b), i.e.,
(n+1) (x ) = 0 .
But notice that
(n+1) (t) = f (n+1) (t) p(n+1) (t) (n + 1)!,
especially
(n+1) (x ) = f (n+1) (x ) (n + 1)! = 0.
This gives
=

f (n+1) (x )
.
(n + 1)!

Noting was chosen such that (x) = 0, we obtain


f (x) p(x) =

f (n+1) (x )
(x x0 )(x x1 ) (x xn ) .
(n + 1)!
108

(8.18)

]
Some interesting results. Using the error estimate (8.18), we have the following
relation if we take p(x) = L(x):
f (x)

n
X

f (xi )li (x) =

i=0

f (n+1) (x )
(x x0 )(x x1 ) (x xn ) .
(n + 1)!

This immediately yields

n
X
i=0
n
X
i=0
n
X

li (x) = 1,
xi li (x) = x,
x2i li (x) = x2 ,

i=0

..
.
n
X

xni li (x) = xn .

i=0

In fact, for any polynomial pn (x) of degree n, we have


n
X

pn (xi )li (x) = pn (x) .

i=0

Example. Consider the function f (x) = sin 4x approximated by a polynomial of


degree 9 that interpolates f at 10 points in the interval [0, 1]. Estimate the error of
the interpolation.
Solution. We can apply the error estimate (8.17). Obviously we have |f (10) (x )|
(4)10 and |x xi | 1. So for all x [0, 1],
| sin 4x p(x)|

8.9

(4)10
.
10!

Chebyshev polynomials

Using the interpolation error estimate (8.17), we can estimate the accuracy of each
interpolation polynomial when all the interpolating nodes
a = x0 < x1 < x2 < < xn1 < xn = b
109

are given. But with a different set of interpolating nodes, the resulting polynomial
will have different accuracy. This raises a natural question: can we choose the interpolating nodes so that the resulting polynomial reaches an optimal accuracy ? In
this subsection, we shall discuss how to find such optimal polynomial. It is easy
to see from the error estimate (8.17) that the optimal polynomial can be realized if
we can choose a set of interpolating nodes x0 , x1 , , xn such that the resulting
polynomial (x x0 )(x x1 ) (x xn ) is minimized among all polynomials in magnitude. An analysis of this optimization problem was first made by the mathematician
Chebyshev. The process leads naturally to a system of polynomials called Chebyshev
polynomials. Next we will introduce the polynomials.
The Chebyshev polynomials are defined recursively as follows:
T0 (x) = 1,

T1 (x) = x ,

and for n 1,
Tn+1 (x) = 2x Tn (x) Tn1 (x) .
You may compute the next few such polynomials:
T2 (x)
T3 (x)
T4 (x)
T5 (x)
T6 (x)

=
=
=
=
=

2x2 1 ,
4x3 3x ,
8x4 8x2 + 1 ,
16x5 20x3 + 5x ,
32x6 48x4 + 18x2 1 .

The Chebyshev polynomials have closed forms.


For x in the interval [1, 1], the Chebyshev polynomials have the
following closed forms for n 0,
Tn (x) = cos(n cos1 x) ,

1 x 1.

Proof. Recall the formula


cos(A + B) = cos A cos B sin A sin B.
So we obtain
cos(n + 1) = cos cos n sin sin n ,
cos(n 1) = cos cos n + sin sin n ,
adding them up, we have
cos(n + 1) = 2 cos cos n cos(n 1) .
Now let = cos1 x and x = cos . We see from the above relation that the function
fn (x) = cos(n cos1 x)
110

satisfies the following system:


f0 (x) = 1,

f1 (x) = x ,

and for n 1,
fn+1 (x) = 2x fn (x) fn1 (x) .
So we have fn = Tn for all n. ]
We see the following properties directly from the closed form of the Chebyshev
polynomials:
|Tn (x)| 1 (1 x 1) ,

j 
Tn cos
= (1)j (0 j n) ,
n

2j 1 
Tn cos
= 0 (1 j n).
2n
A monic polynomial is the one in which the term of highest degree has a coefficient
of unity. We see that Tn (x)s term of highest degree has a coefficient 2n1 xn for n > 0.
Therefore 21n Tn is a monic polynomial for all n > 0.
We have the following estimate for a monic polynomial:
If p is a monic polynomial of degree n, then
kpk = max |p(x)| 21n .
1x1

Proof. We prove by contradiction. Suppose that


|p(x)| < 21n

(|x| 1) .

Let q = Tn /2n1 and xj = cos(j/n). As q is a monic polynomial of degree n, then


(1)j p(xj ) |p(xj )| < 21n = (1)j q(xj ) .
Consequently,
(1)j [q(xj ) p(xj )] > 0 (0 j n).
This shows that the polynomial q p oscillates in sign n + 1 times on the interval
[1, 1]. Therefore it must have at least n roots in (1, 1). But this is impossible
because q p has degree at most n 1. ]

Choosing the nodes


Using the interpolation error estimate (8.17), we can deduce that
max |f (x) p(x)|
|x|1

1
max |f (n+1) (x)| max |(x x0 )(x x1 ) (x xn )| .
|x|1
(n + 1)! |x|1
111

But by the property of the monic function, we know


max |(x x0 )(x x1 ) (x xn )| 2n .
|x|1

This minimum value is attained if (x x0 )(x x1 ) (x xn ) is the monic multiple


of Tn+1 , i.e., Tn+1 /2n . The nodes are the roots of Tn+1 , namely
 2i 1 
xi = cos
,
2n + 2

i = 1, 2, , n + 1 .

These considerations lead to the following result.


If the nodes xi are chosen to be the roots of the Chebyshev polynomial Tn+1 , then the error of the resulting interpolating polynomial
for a given function f (x) will be minimized and can be estimated
by
1
max |f n+1 (t)| .
|f (x) p(x)| n
2 (n + 1)! |t|1

112

8.10

Spline interpolation

In the previous few subsections we have discussed the interpolating polynomials for
a given set of the observation data of function f (x):
x0
f0

x1
f1

x2
f2

xn
fn

Remember that all such interpolating polynomials are global polynomials in the
entire given interval [a, b]. If we want to get higher interpolation accuracy, we need to
use more interpolating nodes and require the interpolating function f (x) to be more
smooth. But with more and more interpolating nodes, the degree of the resulting
polynomial becomes higher and higher, leading to very unstable computations.
Because of the instability of higher order polynomials, one is interested in finding
lower order polynomials to achieve higher order accuracy. This can be realized by using piecewise polynomials, instead of global polynomials. Spline functions are widely
used piecewise polynomials, with certain continuity conditions.
Spline functions. Given the following (n + 1) nodal points
x0 < x1 < x2 < < xn ,
a spline function of degree k with these nodes is a function S(x) satisfying the following conditions:
(i) S(x) is a polynomial of degree k on each interval [xi1 , xi ], i = 1, 2, , n.
(ii) S(x) has a continuous (k 1)-th order derivative on [x0 , xn ].
Spline interpolations. A spline interpolation S(x) of degree k associated with
the following set of the observation data of function f (x):
x0
f0

x1
f1

x2
f2

xn
fn

is a spline function of degree k such that


S(xi ) = fi ,
8.10.1

i = 0, 1, , n .

Spline interpolation of degree 0

A spline function of degree 0 is a piecewise constant function :

S0 (x)
= C0 ,
x [x0 , x1 )

S1 (x)
= C1 ,
x [x1 , x2 )
S(x) =

Sn1 (x) = Cn1 , x [xn1 , xn )


113

As the spline function of degree 0 has only n degrees of freedom, so it can fit only n
observation data. E.g., we may take the data:
S(xi ) = fi ,

i = 0, 1, 2, , n 1 ,

then the spline function of degree 0 can be constructed as follows:

S0 (x)
= f0 ,
x [x0 , x1 )

S1 (x)
= f1 ,
x [x1 , x2 )
S(x) =

Sn1 (x) = fn1 , x [xn1 , xn ] .


8.10.2

Spline interpolation of degree 1

A spline interpolation function of

S0 (x)

S1 (x)
S(x) =

Sn1 (x)

degree 1 is a continuous piecewise linear function :


= a0 x + b 0 ,
= a1 x + b 1 ,

x [x0 , x1 ]
x [x1 , x2 ]

= an1 x + bn1 ,

x [xn1 , xn ]

How to find the spline interpolation function ? We have 2n unknown coefficients and
thus need 2n conditions. Usually we require S(x) to fit the observed data:
S(xi ) = fi ,

i = 0, 1, 2, , n ,

which gives 1 condition at each endpoint x0 and xn , and 2 conditions at each internal
point xi , i = 1, 2, , n 1. So this yields exactly the desired 2(n 1) + 2 = 2n
conditions to find all the coefficients in S(x).
Geometrically, the unique determination of the 2n coefficients in S(x) is clear. In
fact, the linear function is uniquely determined in each subinterval by the assigned
values at the two endpoints.
Please check if the piecewise linear function determined as above satisfies all the
requirements a spline function of degree 1 should meet.
8.10.3

Cubic spline interpolations

A spline function of degree 3 is called a cubic spline, which is most frequently used in
practice. Let S(x) be a cubic spline. According to the definition of a spline function,
we can write it as follows: for k = 0, 1, . . . , n 1 ,
S(x) = Sk (x) = sk,0 + sk,1 (x xk ) + sk,2 (x xk )2 + sk,3 (x xk )3 , x [xk , xk+1 ]
How to find a cubic spline interpolation function ? It amounts to finding 4n unknown
coefficients, 4 for each subinterval, thus we have to find 4n conditions to determine
the 4n unknowns. Usually we require S(x) to fit the observed data:
S(xi ) = fi ,

i = 0, 1, 2, , n ,
114

which gives 2 condition on each interval [xk , xk+1 ] for k = 0, 1, . . . , n 1, so leads to


2n conditions.
But as required, S 0 (x) and S 00 (x) must be continuous on [x0 , xn ]. So this gives
2(n 1) conditions. Now we have found a total 4n 2 conditions. There are many
different ways to provide the remaining two conditions. A most popular one is to
provide these two conditions by requiring
S 00 (x0 ) = ,

S 00 (xn ) =

with and being two constants. A cubic spline S(x) is called a natural cubic spline
if it satisfies
S 00 (x0 ) = 0 , S 00 (xn ) = 0.
Example. Determine whether the following function

x [0, 1]
1 + x x3 ,
2
3
1 2(x 1) 3(x 1) + 4(x 1) , x [1, 2]
S(x) =

4(x 2) + 9(x 2)2 3(x 2)3 ,


x [2, 3]
is a natural cubic spline which fits the following observation data
0 1 2 3
1 1 0 10

xk
yk
8.10.4

Existence and construction of natural cubic spline interpolations

In this subsection we discuss the unique existence and construction of cubic spline
functions for a given set of observation data:
x0
f0

x1
f1

x2
f2

xn
fn

Let S(x) be a natural cubic spline function on [a, b], then we can write for k =
0, 1, , n 1 that
S(x) = Sk (x) = sk,0 +sk,1 (xxk )+sk,2 (xxk )2 +sk,3 (xxk )3 , x [xk , xk+1 ] (8.19)
Noting that S(x) is piecewise cubic, its second derivative S 00 (x) must be piecewise
linear on [a, b]. So we can use the linear Lagrange interpolation to represent Sk00 (x):
Sk00 (x) = S 00 (xk )

x xk+1
x xk
+ S 00 (xk+1 )
,
xk xk+1
xk+1 xk

(8.20)

where we have used the fact that S 00 (x) is continuous on [a, b], and S 00 (x) = Sk00 (x).
Letting mk = S 00 (xk ), mk+1 = S 00 (xk+1 ), and hk = xk+1 xk , then we can write
Sk00 (x) =

mk+1
mk
(xk+1 x) +
(x xk ) ,
hk
hk
115

(8.21)

integrating it twice, we obtain


Sk (x) =

mk
mk+1
(xk+1 x)3 +
(x xk )3 + ck x + dk
6hk
6hk

(8.22)

for some constants ck and dk . More conveniently, using the linear independency of
xk+1 x and x xk , we can express ck x + dk in terms of xk+1 x and x xk . Thus
one can rewrite (8.22) as
Sk (x) =

mk+1
mk
(xk+1 x)3 +
(x xk )3 + pk (xk+1 x) + qk (x xk ) .
6hk
6hk

(8.23)

Now using the observation data S(xk ) = fk and S(xk+1 ) = fk+1 , we can derive
fk =

mk 2
h + pk hk ,
6 k

fk+1 =

mk+1 2
hk + qk hk .
6

One can get pk and qk from these relations and then substitute into equation (8.23)
to derive
mk
mk+1
Sk (x) =
(xk+1 x)3 +
(x xk )3
6hk
6hk
fk
mk hk
fk+1 mk+1 hk
+(
)(xk+1 x) + (

)(x xk ) . (8.24)
hk
6
hk
6
To find the values mk , we can use the continuity of the derivatives of S(x). Differentiating (8.24), we obtain
mk
mk+1
(xk+1 x)2 +
(x xk )2
2hk
2hk
mk hk
fk+1 mk+1 hk
fk
)+(

).
(
hk
6
hk
6

Sk0 (x) =

(8.25)

Evaluating at x = xk gives
Sk0 (xk ) =

mk
mk+1
hk
hk + dk ,
3
6

dk =

fk+1 fk
.
hk

(8.26)

Similarly, by replacing k by k 1 in (8.25) to know the expression of Sk1 (x), then


evaluating at x = xk , we derive
0
Sk1
(xk ) =

mk
mk1
hk1 +
hk1 + dk1 .
3
6

(8.27)

0
Using the fact that Sk0 (xk ) = Sk1
(xk ), we derive the equation for {mk }:

hk1 mk1 + 2(hk1 + hk )mk + hk mk+1 = uk ,

(8.28)

where uk = 6(dk dk1 ), k = 1, 2, , N 1.


Noting that m0 = and mN = are specified by the given values, and the
coefficient matrix in (8.28) is diagonally dorminant, so the system (8.28) has a set of
116

unique solutions m1 , m2 , , mn1 . This proves the unique existence of the natural
cubic spline function for any given set of observation data.
Recovering the coefficients of the natural cubic spline function. Using
the values of m1 , m2 , , mn1 , we can recover the coefficients of Sk (x) in (8.19). To
do so, we first easily know by differentiating Sk (x) and evaluating at xk :
sk,0 = fk ,

Sk00 (x) = 2sk,2 + 6sk,3 (x xk ) .

sk,1 = Sk0 (xk ) ,

(8.29)

Taking x = xk in the last equation of (8.29), we know


mk
.
2

sk,2 =
Using sk,1 = Sk0 (xk ) and (8.26), we get
sk,1 = dk

hk (2mk + mk+1 )
.
6

Finally, taking x = xk+1 in the last equation of (8.29), we derive


sk,3 =
8.10.5

mk+1 mk
.
6hk

Properties of cubic splines

We now discuss an important property of some special cubic splines.


Suppose f C 2 [a, b], and S(x) is a cubic spline interpolation of
f (x) at the n + 1 distinct points:
a = x0 < x1 < x2 < < xn1 < xn = b
such that
S 00 (a) = 0,

S 00 (b) = 0

or
S 0 (a) = f 0 (a),
Then it holds that
Z b

00

S 0 (b) = f 0 (b).

{S (x)} dx
a

{f 00 (x)}2 dx .

Proof. Consider the difference g(x) = f (x) S(x), then we know g(xi ) = 0 for
i = 0, 1, , n and
Z b
Z b
Z b
Z b
00
2
00
2
00
2
{f (x)} dx =
{S (x)} dx +
{g (x)} dx + 2
S 00 (x)g 00 (x)dx .
a

117

Now the desired inequality follows if we can show that


Z b
S 00 (x)g 00 (x)dx = 0 .
a

To see this, using the integration by parts, the given conditions in the theorem and
the fact that S 000 (x) is constant, say ci , on [xi1 , xi ], we derive
Z

b
00

00

S (x)g (x)dx =
a

n Z
X

xi

S 00 (x)g 00 (x)dx

i=1 xi1
n n
X
00 0

00 0

xi

(S g )(xi ) (S g (xi1 )

=
=

n Z
X
i=1
n
X
i=1
n
X

S (x)g (x)dx

xi1

i=1

000

xi

S 000 (x)g 0 (x)dx

xi1
xi

g 0 (x)dx

ci
xi1

ci {g(xi ) g(xi1 )} = 0 .

i=1

From the previous important property of the cubic spline interpolations, we can
immediately see that
Z b
Z b
00
2
{S (x)} dx
{h00 (x)}2 dx h V (f )
a

where V (f ) is defined by V (f ) = {h C 2 ([a, b]); h is an interpolation of f (x) (not


necessarily polynomial) such that h00 (a) = h00 (b) = 0 or h0 (a) = h0 (a), h0 (b) = h0 (b)} .
We may recall that the curvature of a curve described by the equation y = f (x)
is the quantity
n
o3/2
|f 00 (x)| 1 + f 0 (x)2
.
If we drop the nonlinear bracketed term, |f 00 (x)| may be viewed as an approximation
to the curvature. In natural cubic spline interpolation, we are finding a curve with
minimal (approximate) curvature over an interval because the quantity
Z

S 00 (x)2 dx

is minimized.

118

8.11

Hermites interpolation

Suppose we are given the following set of observation data :


x0
f0
f00

x1
f1
f10

x2
f2
f20

xn
fn
fn0

(8.30)

where xi 6= xj for all i 6= j, we would like to see if it is possible to determine a


polynomial p(x) of degree 2n + 1 such that
p(xi ) = fi

and p0 (xi ) = fi0 ,

i = 0, 1, , n .

(8.31)

Let l0 (x), l1 (x), , ln (x) be the Lagrange basic functions in (8.7) associated with the
set of nodal points x0 , x1 , , xn , then we define


ui (x) = 1 2li0 (xi )(x xi ) li2 (x) , vi (x) = (x xi )li2 (x) .
One can easily show that
1. ui (x) and vi (x) are all polynomials of degree 2n + 1;
2. ui (xj ) = ij , vi (xj ) = 0 for any j 6= i;
3. u0i (xj ) = 0 for any j 6= i, and vi0 (xj ) = ij .
Using these results, we can directly verify that the following polynomial
H(x) =

n
X

fi ui (x) +

i=0

n
X

fi0 vi (x)

i=0

is a polynomial of degree 2n + 1 such that all the conditions in (8.31) are satisfied.
This polynomial is called the Hermites interpolation.
The Hermites interpolation has the following error estimate.
Suppose f C 2n+2 [a, b], and H(x) is its Hermites interpolation at the
n + 1 distinct points:
a = x0 < x1 < x2 < < xn1 < xn = b
such that all the conditions in (8.31) are satisfied. Then the following
error estimate holds for some (a, b):
f (x) H(x) =

f (2n+2) ((x))
(x x0 )2 (x x1 )2 (x xn )2 .
(2n + 2)!

119

(8.32)

Proof. To prove (8.32), we fix x (1, 1). If x is a node, the result holds clearly.
So we assume that x is not a node. Let
w(t) = f (t) H(t) (t x0 )2 (t x1 )2 (t xn )2
where is a constant such that w(x) = 0. One can easily see that w(t) has at least
n + 2 zeros in [1, 1]: x, x0 , x1 , , xn . By Rolles theorem, w0 (t) has at least n + 1
zeros in [1, 1]. But w0 (x) also vanishes at all the nodal points, so w0 (t) has at least
2n+2 zeros in [1, 1]. Recursively using the Rolles theorem, we know that w(2n+2) (t)
has some zero (1, 1), that is,
0 = w(2n+2) () = f (2n+2) () p(2n+2) () (2n+2)
()
n
By calculating, we obtain
0 = f (2n+2) () (2n + 2)! .
This proves the desired error estimate (8.32). ]

120

Numerical integration

Approximations of integrals are widely encountered in real applications. Many important physical quantities can be represented by the integrals, e.g., mass, concentrations,
heat flux, heat sources and so on.
In this section we shall discuss how to approximate a given integral on an interval.
The approximation of integrals in higher dimensions can be reduced to the integrals
on many intervals.
Given a function f (x) on a interval [a, b], we are now going to discuss how to
approximate the integral
Z b
f (x) dx .
a

Same approximations can be applied for the higher dimensional integrations such as
Z bZ d
Z bZ dZ f
f (x, y) dxdy and
f (x, y, z) dxdydz .
a

Evaluation of an integral may not be an easy job, possibly much more difficult than
2
the evaluation of derivatives. For example, one can easily find the derivatives of ex
3
and ex , but their integrals are not simple. In fact, it is impossible to calculate their
exact values.
However, we often need to know the value of an integral in real applications. If
the integral is difficult to compute, do we have some way to find the approximate
value of the integral ?
Yes. The way to compute an integral approximately is called numerical integration, or quadrature rule.

9.1

Simple rules and their error estimates

Recall that the integral


Z

f (x)dx
a

is the area formed by the curve y = f (x), and the lines x = a, x = b. When a is close
to b, one may approximate the area by the area of some simple geometric domains.
Rectangular rules. If we use the area of the rectangle with the base [a, b] and
height f (a) or f (b), then we get two simple the quadrature rules:
Z

f (x)dx (b a) f (a) ,
a

f (x)dx (b a) f (b) .
a

121

Trapezoidal rule. A more accurate rule is to approximate the integral by the


area of the trapezoid formed by the base [a, b], the lines x = a, x = b and the line
connecting (a, f (a)) and (b, f (b)). This leads immediately to the following trapezoidal
rule:
Z

f (x)dx
a

ba
(f (a) + f (b))
2

Let us now try to understand how good this approximation is. To see the error of
the trapezoidal rule, we first consider the Lagrange interpolation of f (x) at the two
points a and b:
xb
xa
L(x) =
f (a) +
f (b) ,
ab
ba
then we have
Z b
Z b
xb
xa
L(x)dx =
(
f (a) +
f (b))dx
ba
a
a ab
f (a)
1
f (b) 1
=
( )(b a)2 +
(b a)2
ab 2
ba 2
ba
=
(f (a) + f (b)) .
2
This is exactly the same as the trapezoidal rule. So the error of the trapezoidal rule
can be transfered to the error of the Lagrange interpolation:
Z b
1
f (x)dx (f (a) + f (b))
2
a
Z b
(f (x) L(x))dx
=
a
Z
1 b 00
=
f (x )(x a)(x b)dx
2 a
Z b
1 00
=
f ()
(x a)(x a + a b)dx
2
a
1 00
= f ()(b a)3 .
(9.1)
12
00

So if the size |b a| is small and f () not large on [a, b], then the trapezoidal rule can
be very accurate. This error formula indicates that the trapezoidal rule is accurate
for all linear polynomials.

9.2

Composite trapezoidal rule and its accuracy

Recall the trapezoidal rule :


Z b
ba
f (x)dx
(f (a) + f (b)) = T (a, b; f )
2
a
122

and the error

f (x)dx T (a, b; f ) =
a

1 00
f ()(b a)3 .
12

We see that if the interval [a, b] is not small, the error of the trapezoidal rule can be
very large, so the approximation is not accurate enough.
To derive more accurate approximation, we divide [a, b] into n equally-spaced
smaller subintervals using the points
a = x0 < x1 < x2 < < xn1 < xn = b .
Let h =

ba
n

be the length of each subinterval, then we have


xi = a + ih,

i = 0, 1, , n .

Now on each small subinterval [xi1 , xi ], we can approximate


trapezoidal rule with good accuracy, i.e.,
Z xi
h
f (x) dx (f (xi1 ) + f (xi )) .
2
xi1

R xi
xi1

f (x)dx by the

Then using
Z

f (x)dx =
a

n Z
X
i=1

xi

f (x)dx

xi1

n
X
h
i=1

(f (xi1 ) + f (xi )) ,

we derive the following composite trapezoidal rule:




Z b
f (x0 )
f (xn )
f (x)dx h
+ f (x1 ) + f (x2 ) + + f (xn1 ) +
.
2
2
a
Error of the composite trapezoidal rule
As we shall see, the accuracy of any quadrature rule depends strongly on the
smoothness of the integrand function f (x): the quadrature rule will be more accurate
if the function f (x) is more smooth. Below we shall consider three different cases.
1. Let us first assume that f (x) satisfies that
Z b
|f 0 (x)|dx < .
a

Then we can write the error function as


Z b
n
X
h
Eh (f ) =
f (x)dx
(f (xi1 ) + f (xi ))
2
a
i=1

n Z xi
X
h
=
f (x)dx (f (xi1 ) + f (xi )) .
2
x
i1
i=1
123

Letting xi1/2 = xi1 + h/2, then using integration by parts we derive the
following error estimate

n  Z xi
X
h
0
Eh (f ) =
(x xi1/2 ) f (x)dx (f (xi1 ) + f (xi ))
2
xi1
i=1
Z
n
xi
X
=
(x xi1/2 )f 0 (x)dx
(9.2)
h
2

i=1
b

xi1

|f 0 (x)|dx .

2. Now we consider the case that


Z

|f 00 (x)|dx < .

Then using (9.2) and integrating one more time gives


n Z xi
X
(x xi1/2 )f 0 (x)dx
Eh (f ) =
=

i=1 xi1
n Z xi
X

1
2

{(x xi1/2 )2 }0 f 0 (x)dx

i=1 xi1
n
X Z xi

1
(x xi1/2 )2 f 00 (x)dx h2 (f 0 (xi ) f 0 (xi1 ))
8
i=1 xi1
n Z
1 X xi n
h2 o 00
2
=
(x xi1/2 )
f (x)dx ,
2 i=1 xi1
4
=

1
2

this implies the following error estimate


Z
n
1 X h2 xi 00
|f (x)|dx
|Eh (f )|
2 i=1 4 xi1
Z
1 2 b 00
=
|f (x)|dx .
h
8
a
3. Finally, we assume that f C 2 [a, b], then using (9.1) we have
n n Z xi
o
X
xi xi1
En (f ) =
f (x)dx
(f (xi1 ) + f (xi ))
2
xi1
i=1
=
=

n
X
i=1
n
X
i=1

1 00
f (i )(xi xi1 )3
12

1 00
f (i )h3
12
n

1 3 X 00
h
f (i ) .
12 i=1
124

Using the mean value theorem, there exists a point [a, b] such that
00

f () =

1 00
00
00
(f (1 ) + f (2 ) + + f (n )) ,
n

hence we derive the error estimate for the composite trapezoidal rule:
Eh (f ) =

nh3 00
b a 00
f () =
f ()h2 .
12
12

(9.3)

Example. Determine the mesh size h so that the error of the composite trapezoidal
R1
rule for computing the integral 0 sin xdx is not bigger than 106 .
Solution. We know the error
Eh (f ) =

b a 00
1 00
f ()h2 = f ()h2
12
12

00

00

but f 0 (x) = cos x, f (x) = 2 sin x, so we know |f (x)| 2 . This implies h


should be chosen such that
|Eh (f )|

1 2 2
h 106 ,
12

this shows h must be not bigger than 2 3 103 .


R1
Clearly, we observe from (9.3) that if h is smaller, the approximation to 0 f (x)dx
is more accurate. But then the rule involves more points to evaluate.
For example,
R1
we see from above that we need h 103 in order to evaluate 0 sin xdx to 6 digits
of accuracy. That would mean about 1,000 subregions and about 1,000 function
evaluations.

9.3

Newton-Cotes quadrature rule

Recall the trapezoidal rule


Z b
n Z
X
f (x)dx =
a

xi

f (x)dx

xi1

i=1

n
X
h
i1

(f (xi1 ) + f (xi ))

(9.1)

and its error


Z

f (x)dx
a

n
X
h
i=1

(f (xi1 ) + f (xi )) =

b a 00
f ()h2 ,
12

we know that for any linear polynomial f (x), the trapezoidal rule gives the exact
value of the integral, that is,
Z b
n
X
h
f (x)dx =
(f (xi1 ) + f (xi )).
2
a
i=1
125

Can we find more accurate quadrature rules which are exact for polynomials of
higher degree ? Below, we shall try to construct such quadrature rules.
For a set of given points
x0 ,

x1 ,

x2 ,

xn

in the interval [a, b], we would like to find 0 , 1 , , n such that for any polynomial
of degree n, we have
Z b
f (x)dx = 0 f (x0 ) + 1 f (x1 ) + + n f (xn ).
(9.2)
a

In fact, it is easy to see that i exists and


Z b
li (x)dx,
i =

i = 0, 1, , n,

(9.3)

where

n
Y
x xj
li (x) =
.
xi xj
j=0
j6=i

9.3.1

Computing the coefficients of Newton-Cotes rules (I)

Let us see how to compute the coefficients of the Newton-Cotes rule using (9.3).
Case n = 1. We have
Z b
Z b
1
xb
0 =
l0 (x)dx =
dx = (b a) ,
2
a
a ab
Z b
Z b
xa
1
1 =
l1 (x)dx =
dx = (b a) ,
2
a ba
a
so we get the Newton-Cotes rule for n = 1:
Z b
ba
f (x)dx
(f (a) + f (b)) ,
2
a
which is exactly the trapezoidal rule.

126

Case n = 2. Let x0 = a, x1 = (a + b)/2 and x2 = b, h = b a, then we have


0 =
=
1 =
=
2 =
=

(x x1 )(x b)
l0 (x)dx =
dx
a
a (a x1 )(a b)
Z b
2
1
(x b + h/2)(x b)dx = (b a) ,
2
(b a) a
6
Z b
Z b
(x a)(x b)
l1 (x)dx =
dx
a
a (x1 a)(x1 b)
Z b
4
4

(x a)(x a + a b)dx = (b a) ,
2
(b a) a
6
Z b
Z b
(x a)(x x1 )
l2 (x)dx =
dx
a
a (b a)(b x1 )
1
(b a) ,
6
Z

so we get the Newton-Cotes rule for n = 2:


Z b
o
b an
f (x)dx
f (a) + 4f ((a + b)/2) + f (b) ,
6
a
which is called the Simpsons rule.
9.3.2

Computing the coefficients of Newton-Cotes rules (II)

One can easily see that it is very technical and lengthy to compute the coefficients
i of the Newton-Cotes rules using (9.3) for larger n. Usually, we may calculate the
coefficients i for larger n directly by the definition:
The quadrature rule (9.2) holds for all polynomials of degree n.
Example. Find the Newton-Cotes rules (9.2) when n = 1 and n = 2.
Solution. When n = 1, we need two points x0 and x1 . Let us consider x0 = a and
x1 = b. As the quadrature rule
Z b
f (x) = 0 f (a) + 1 f (b)
a

holds for all polynomials of degree n 1, we can immediately find


0 =

ba
,
2

1 =

ba
2

by taking f (x) = 1 and f (x) = x a. This gives the trapezoidal rule:


Z

f (x)dx
a

ba
(f (a) + f (b)).
2
127

When n = 2, we need three points x0 , x1 and x2 . Let us consider that x0 = a,


x1 = (a + b)/2 and x2 = b, Then using the fact that the formula
b

f (x) = 0 f (x0 ) + 1 f (x1 ) + 2 f (x2 )


a

is exact for all polynomials of degree n 2, we obtain


Z b
1 dx = 0 + 1 + 2 ,
a
Z b
ba
(x a) dx = 1
+ 2 (b a) ,
2
a
Z b
ba 2
(x a)(x b)dx = 1 (
)
2
a
by taking
f (x) = 1,

x a,

(x a)(x b) .

Solving the system, we derive the famous Simpsons rule:


Z b
o
a+b
b an
f (a) + 4f (
) + f (b) .
f (x)dx
6
2
a

9.4

Simpsons rule and its error estimates

Simpsons rule is one of the most important quadrature rules due to its nice properties. In this subsection we shall introduce three different methods to derive the error
estimates of the Simpsons rule:
Z b
o
b an
a+b
f (x)dx
f (a) + 4f (
) + f (b) .
(9.4)
6
2
a
Method 1. Let x = (a + b)/2. Using the fact that the rule (9.4) is exact for all

128

quadratic polynomials, so we have


Z b
o
b an
a+b
En (f ) =
f (x)dx
f (a) + 4f (
) + f (b)
6
2
a
Z b
=
(f (x) L2 (x))dx
a
Z b (3)
f (x )
a+b
=
(x a)(x b)(x
)dx
3!
2
a
Z x (3)
a+b
f (x )
(x a)(x b)(x
)dx
=
3!
2
a
Z b (3)
f (x )
a+b
+
(x a)(x b)(x
)dx
3!
2
x

Z x
1 (3)
a+b
=
(x a)(x b)(x
f (1 )
)dx
6
2
a
Z b
a+b
1 (3)
(x a)(x b)(x
)dx
+ f (2 )
6
2
x


h4  (3)
=
f (1 ) f (3) (2 ) .
384
Method 2. The second approach is to use the Taylor expansion. Let
Z x
F (x) =
f (t)dt,
a

h = b a and x = (a + b)/2. Then we expand F (b) at x = a by Taylor series up to


the 5th order to obtain:
h2 00
h3
h4
h5 (5)
F (a) + F 000 (a) + F (4) (a) +
F ()
2
6
24
120
h2
h3
h4
h5 (4)
= h f (a) + f 0 (a) + f 00 (a) + f (3) (a) +
f () .
(9.5)
2
6
24
120

F (b) = F (a) + h F 0 (a) +

On the other hand, we can expand each term on the right-hand side of (9.4) to get
f (a) = f (a) ,

(9.6)
0

000

(4)

h
h f (a)
h f (a)
h f (1 )
f (
x) = f (a) + f 0 (a) + ( )2
+ ( )3
+ ( )4
,
2
2
2
2
6
2
24
h2
h3
h4
f (b) = f (a) + hf 0 (a) + f 00 (a) + f 000 (a) + f (4) (2 ) .
2
6
24

(9.7)
(9.8)

Using the relations (9.6)-(9.8), we deduce


o
b an
a+b
h2 0
h3 00
h4 000
f (a) + 4f (
) + f (b)
= hf (a) + f (a) + f (a) + f (a)
6
2
2
6
24
h5 (4)
+
(f (1 ) + 4f (4) (2 )) .
576
129

Subtracting this from (9.5) yields


b

o
b an
a+b
f (x)dx
f (a) + 4f (
) + f (b)
6
2
a
h5 (4)
5h5 (4)
=
f ()
f ()
120
576

25
h5  (4)
f () f (4) () .
=
120
24
Z

Method 3. By taking special polynomials of degree 3, it is easy to find out


that the Simpsons rule is also exact for all polynomials of degree 3. Now for
any , considering the Lagrange polynomial L (x) of f (x) using the 4 nodal points
, a+b
+ , b. As the Simpsons rule is exact for all polynomials of degree 3, so
a, a+b
2
2
we have
Z b
o b an
o
b an
a+b
a+b
L (x)dx =
L (a) + 4L (
) + L (b) =
f (a) + 4f (
) + f (b) .
6
2
6
2
a
From this it is interesting to see that the integration of L (x) is independent of , or
the nodal point a+b
+ . Now by the error estimate for the Lagrange interpolation,
2
we have
Z b
o
a+b
b an
f (a) + 4f (
) + f (b)
f (x)dx
6
2
a
Z b
=
(f (x) L (x))dx
a
Z b (4)
f ((, x))
a+b
a+b
=
(x a)(x b)(x
)(x
)dx .
(9.9)
4!
2
2
a
For convenience, we take n so that it strictly decreases to zero as n . Noting that
(n , x) is bounded so that there exists a subsequence (nk , x) such that (nk , x)
(x) as k . Taking = nk in (9.9), then letting k and using the fact that
(x a)(x b)(x a+b
)2 keeps always the same negative sign, so we have (let h = ba
,
2
2
a+b
y =x 2 )
b

o
b an
a+b
f (a) + 4f (
) + f (b)
f (x)dx
6
2
a
Z b
(4)
f ()
a+b 2
=
(x a)(x b)(x
) dx
4!
2
a
Z
f (4) () h
=
(y + h)y 2 (y h)dy
4!
h
 b a 5 f (4) ()
=
.
2
90
Z

130

9.5

Composite Simpson rule and its accuracy

Recall the Simpson rule


Z b
o
ban
a+b
f (x)dx
f (a) + 4f (
) + f (b) ,
6
2
a
which is accurate enough when the interval [a, b] is small. Otherwise the Simpsons
rule may have bad accuracy. In this case, we can divide [a, b] into n equally-spaced
small subintervals:
a = x0 < x1 < < xn1 < xn = b ,

xi = a + ih,

h=

ba
,
n

and use the Simpsons rule on each subinterval. This gives the following composite
Simpsons rule:
Z b
n Z xi
X
f (x)dx
f (x)dx =
a

Zi=1x1
=

xi1

x2

xn

f (x)dx + +

f (x)dx +
x0

x1

f (x)dx
xn1

o
xi1 + xi
h Xn
f (xi1 ) + 4f (
) + f (xi ) .

6 i=1
2
The error estimate of the composite Simpson rule is a direct application of the
error estimate of the Simpson rule:
Z b
n
o
h Xn
xi1 + xi
f (x)dx
f (xi1 ) + 4f (
) + f (xi )
6 i=1
2
a
n n Z xi
o
X
hn
xi1 + xi
f (x)dx
=
f (xi1 ) + 4f (
) + f (xi )
6
2
xi 1
i=1
n
 h 5 f (4) ( )
X
(b a)h4 (4)
i
=

=
f ()
2
90
2880
i=1
where lies in the interval [a, b].

9.6

Gaussian quadrature rule

Recall that in the Newton-Cotes quadrature rule,


Z b
f (x)dx 0 f (x0 ) + 1 f (x1 ) + + n f (xn ) ,

(9.10)

we fix the (n + 1) points x0 , x1 , , xn , and try to find the (n + 1) coefficients


0 , 1 , , n by assuming (9.12) is exact for all polynomials of degree n.
131

The Gaussian quadrature rule is of the form


Z b
f (x)dx 0 f (x0 ) + 1 f (x1 ) + + n f (xn ) .
a

We want to determine both the (n + 1) points {xi } and the (n + 1) coefficients {i }.


All together we have (2n + 2) unknowns, thus need (2n + 2) conditions. Usually we
select the conditions as follows.
Conditions. Assume the relation
Z b
f (x)dx = 0 f (x0 ) + 1 f (x1 ) + + n f (xn )
a

is exact for any polynomial of degree 2n + 1.


Example. Let [a, b] = [1, 1], work out the Gaussian quadrature rule for n = 1.
Solution. As n = 1, the Gaussian quadrature rule must be exact for all polynomials
of degree 3. By taking f (x) = 1, x, x2 and x3 , we obtain
0 + 1 = 2 ,
0 x0 + 1 x1 = 0 ,
2
0 x20 + 1 x21 =
,
3
0 x30 + 1 x31 = 0 .
Solving the system, we get
1
x0 = x1 = ,
3

0 = 1 = 1 ,

this gives the Gaussian quadrature rule with n = 1:


Z 1
1
1
f (x)dx f ( ) + f ( ) .
3
3
1
And this quadrature rule is exact for any polynomial of degree 3.

9.7

Quadrature rule from [a, b] to [c, d]

Suppose we are given a quadrature rule defined on [a, b]:


Z b
n
X
f (t)dt
i f (ti ).
a

i=0

(Quadrature rules are usually given on [1, 1] or [0, 1].) Suppose we want to use it
to approximate the following integral on another interval [c, d]:
Z d
g(s)ds ,
c

132

what will be the corresponding quadrature rule ?


The easiest way is to use a substitution:
t=a+

dc
ba
(s c) s = c +
(t a).
dc
ba

Note that when s = c, t = a, and when s = d, t = b. Moreover




dc
ds =
dt.
ba
Using the substitution, we have


Z d
Z b 
dc
dc
g(s)ds =
g c+
(t a)
dt
ba
ba
c
a


Z b 
dc
dc
=
g c+
(t a) dt
ba
ba
a

 n
dc X
=
i g(si )
b a i=0
dc
(ti a) for i = 0, 1, 2, , n.
where si = c + ba
As an example of the above transformation, let us consider the following two-point
Gaussian quadrature rule:




Z 1
1
1
f (t)dt f
+f .
3
3
1

Using the previous transformation, we have


si = c +

dc
(ti a) = 10 + 8(ti + 1)
ba

for i = 0, 1. Hence we derive the following Gaussian quadrature rule on [10, 6]:
Z 6

2
2
2
et /2 dt 8e[10+8(1/ 3+1)] /2 + 8e[10+8(1/ 3+1)] /2 = 0.259 .
10

9.8

Construction and accuracy of Gaussian quadrature rules

In this subsection we present a general approach to construct Gaussian quadrature


rules and analyse their accuracies.
Let {Pk (x)}
k=0 be any orthogonal sequence of polynomial functions on [1, 1].
For example, one may take the following Legendre polynomials
P0 (x) = 1,

P1 (x) = x,

1
P2 (x) = x2 ,
3
133

3
P3 (x) = x3 x , .
5

Then the following technique shows a general way to construct the Gaussian quadrature rules.
Let {x0 , x1 , , xn } be the roots of the polynomial Pn+1 (x), and set
Z

x xj
dx ,
1 j=0,j6=i xi xj

li (x)dx =

wi =

n
Y

Then
Z

f (x)dx =
1

n
X

i = 0, 1, 2, , n .

wi f (xi )

(9.11)

(9.12)

j=0

holds for all polynomials of degree less than or equal (2n + 1), so (9.12) is a Gaussian
quadrature rule.
We next prove that (9.12) is a Gaussian quadrature rule. Let f (x) be an arbitrary
polynomial of degree (2n + 1), then there exist two polynomials of degree n such
that f (x) = q(x)Pn+1 (x) + r(x). By the Lagrange interpolation, we know
r(x) =

n
X

li (x) r(xi ) =

i=0

Now, we can derive


Z 1

n
X

li (x) f (xi ) .

i=0

= 0+

n
X

f (xi )

n
X

n
Y

x xj
dx
1 j=0,j6=i xi xj

i=0

r(x)dx
1

q(x)Pn+1 (x)dx +

f (x)dx =

wi f (xi )

i=0

by the definition of wi . ]
With {wi } and {xi } chosen as in (9.11) and (9.12), we can obtain some property
of the Gaussian quadrature rule and its accuracy:
Q
Let n (x) = ni=0 (x xi )2 , {wi } and {xi } be defined as in (9.11) and
(9.12), then we have
wi > 0,

i = 0, 1, , n,

(9.13)

and
Z

f (x)dx
1

n
X
i=0

f (2n+2) ()
wi f (xi ) =
(2n + 2)!

134

n (x)dx .
1

(9.14)

Proof. We first prove (9.13). For any fixed i, we choose


f (x) =

li2 (x)

n
Y

(x xi )2
,
2
(x
i xj )
j=0,j6=i

which is clearly a positive polynomial of degree 2n and f (xj ) = ij . Thus we have


Z 1
n
X
0<
f (x)dx =
wj f (xj ) = wi .
1

j=0

To derive the error estimate (9.14), we let p(x) be the Hermite interpolation polynomial of degree 2n + 1 such that p(xi ) = f (xi ) and p0 (xi ) = f 0 (xi ), i = 0, 1, , n.
Then we have
Z 1
n
n
X
X
p(x)dx =
wj p(xj ) =
wj f (xj ) .
1

j=0

j=0

Using the error estimate (8.32) of the Hermite interpolation polynomial, we know
f (x) p(x) =

f (2n+2) ((x))
n (x) ,
(2n + 2)!

(9.15)

then the desired error estimate (9.14) will follow. In fact,


Z 1
Z 1 (2n+2)
Z
n
X
f
((x))
f (2n+2) () 1
f (x)dx
wj f (xj ) =
n (x)dx =
n (x)dx
(2n + 2)!
(2n + 2)! 1
1
1
j=0
by the mean value theorem. ]

9.9

Errors of different quadrature rules

R1 2
We are now trying to compute a specific integral, 0 et dt, by using several different
quadrature rules to better understand the actual accuracy of each quadrature rule.
The correct value of the integral is 0.7468241328124 up to 13 digits.
9.9.1

Trapezoidal rule

The simple trapezoidal rule gives:


Z 1
1
2
2
2
et dt (e0 + e1 ) = 0.7471804289095.
2
0
Its error is 0.0629.
If we want to get higher accuracy, we can use the composite trapezoidal rule.
Recall that the composite trapezoidal rule for n subintervals using nodes x0 = 0, x1 ,
. . ., xn1 , xn = 1 is
!
Z 1
x20
x2n
e
e
2
2
2
2
et dx h
+ ex1 + ex2 + + exn1 +
.
2
2
0
135

For example, if we use 4 subintervals [0, 14 ], [ 14 , 12 ], [ 12 , 34 ], and [ 34 , 1], then


Z

1 1 02
1 1
2
2
2
(e
+ e0.25 ) + (e0.25 + e0.5 )
4 2
4 2
1 1
1 1
2
2
2
2
+ (e0.5 + e0.75 ) + (e0.75 + e1 )
4 2
4 2
!
02
12
1 e
e
2
2
2
=
+ e0.25 + e0.5 + e0.75 +
.
4
2
2

et dx

Note that the factor 1/4 is the width of the intervals.


Using the formula for different n, we have
n
h
error
ratio
1
1
0.0629

0.5
0.0155
4.06
2
4 0.25 0.00384 4.04
8 0.125 0.000959 4.004
It is clear from the ratio of the errors that they are decreasing like O(h2 ) because
as h is halved, the error is decreased by 4 times.
From the trend, we can estimate how many subintervals we need in order to
compute the integral correct up to 13 digits. Since
0.0629

1
= 1013 = m 24.6.
4m

Thus we need n = 224.6 = 107.4 , i.e. 25 million subintervals and hence 25 million
function evaluations.
9.9.2

Simpsons rule

The simple Simpsons rule gives:


Z 1
1
2
2
2
et dt (e0 + 4e0.5 + e1 ) = 0.6839397205857.
6
0
The error is 0.000356, which is already very small. This indicates the superiority of
Simpsons rule over the trapezoidal rule.
Again if we want to get higher accuracy, we could use the composite Simpsons
rule. Recall that the composite Simpsons rule for n subintervals using nodal points
x0 = 0, x1 , . . . , xn1 , xn = 1
is
Z
a


n 
hX
xi1 + xi
f (x)dx
f (xi1 ) + 4f (
) + f (xi ) .
6 i=0
2
136

For example, if we use 2 subintervals [0, 0.5] and [0.5, 1], then
Z 1
2
et dx
0

1 1 02
1 1
2
2
2
2
2

(e
+ 4e0.25 + e0.5 ) + (e0.5 + 4e0.75 + e1 )
2 6
2 6

1  02
2
2
2
2
=
e
+ 4e0.25 + 2e0.5 + 4e0.75 + e1 .
12
Again notice that the factor 1/2 is the width of the intervals.
Using the formula for different n, we have
n
h
1
1
2
0.5
4 0.25
8 0.125

error
ratio
3.56(4)

3.12(5) 11.40
1.99(6) 15.72
1.24(7) 15.95

(3.65(4) means 3.64 104 .) It is clear from the ratio of the errors that they are
decreasing like O(h4 ) because as h is halved, the error is decreased by 16 times.
From the trend, we can estimate how many subintervals we need in order to
compute the integral correct up to 13 digits. Since
1
= 1013 = m 12.3.
16m
Thus we need n = 212.3 = 103.7 , i.e. 5,000 subintervals and hence 10,000 function
evaluations (in each subinterval, we need two function evaluations). So we can see
that the composite Simpsons rule is already 2,500 times faster than the composite
trapezoidal rule.
But can we get faster algorithms than the composite Simpsons rule ?
0.0629

9.9.3

Gaussian rules

Yes, we can use Gaussian quadrature rule to evaluate the integral. The results are:
No. of Points
2
3
4
..
.

Error
2.29(4)
9.55(6)
3.35(7)
..
.

7.88(13)

It should be emphasized that we just use the simple Gaussian quadrature rules on
the interval [0, 1] and not any composite rules (i.e. we didnt divide the interval into
subintervals). We see that the error
1013 when we use the 7-point Gaussian
R 1 istalready
2
formula to evaluate the integral 0 e dt. This requires only 7 function evaluations.
It is thus about 1,400 times faster than the composite Simpsons rule and 3.5 million
times faster than the composite trapezoidal rule.
137

10
10.1

Numerical differentiation
Aim of numerical differentiation

The aim of numerical differentiation is to study how to compute derivatives of a


function approximately by using function values.
Given a function f (x) on [a, b] and a partition on the interval:
a = x0 < x1 < x2 < < xn1 < xn = b .
For the sake of simplicity, we assume the partition is uniform with the mesh size
h = (b a)/n, i.e., we can write
i = 0, 1, , n .

x i = x0 + i h ,
00

How to compute f 0 (xi ) and f (xi ) approximately using function values


f (xj ),

10.2

j = 0, 1, , n.

Forward and backward differences

We first discuss how to approximate the first derivative f 0 (xi ).


Forward difference. By Taylor series, we know
f (xi+1 ) = f (xi ) + hf 0 (xi ) +

h2 00
f (1 ),
2

1 (xi , xi+1 ) ,

then

f (xi+1 ) f (xi )
h 00
= f 0 (xi ) + f (1 ) .
h
2
So if the mesh size h is small, we can use the approximation
f (xi+1 ) f (xi )
f 0 (xi ) ,
h
and the error is

(10.16)

f (xi ) o
h 00
f 0 (xi ) = f (1 ) .
h
2
The scheme (10.16) is called the forward difference scheme.
n f (x

i+1 )

Backward difference. Again, by Taylor expansion,


f (xi1 ) = f (xi ) hf 0 (xi ) +

h2 00
f (2 ) ,
2

2 (xi1 , xi )

thus

h 00
f (xi ) f (xi1 )
= f 0 (xi ) f (2 ),
h
2
So if h is small, we can use the approximation
f (xi ) f (xi1 )
f 0 (xi ) .
h
This scheme (10.17) is called the backward difference scheme.
138

(10.17)

10.3

Central differences

Next, we study some central finite difference schemes.


0

Compute f (xi ). By Taylor expansion, we have


h2 00
h3 000
f (xi ) + f (1 ),
2
6
2
h3 000
h 00
f (xi1 ) = f (xi ) hf 0 (xi ) + f (xi ) f (2 ),
2
6
f (xi+1 ) = f (xi ) + hf 0 (xi ) +

1 (xi , xi+1 )
2 (xi1 , xi )

thus

f (xi+1 ) f (xi1 )
h2 000
000
= f 0 (xi ) + (f (1 ) + f (2 )) .
2h
12
So if h is small, we may use the approximation
f (xi+1 ) f ((xi1 )
f 0 (xi )
2h

(10.18)

with an error
n f (x

i+1 )

f (xi1 ) o
h2 000
000
f 0 (xi ) = (f (1 ) + f (2 )) .
2h
12

The scheme (10.18) is called the central difference scheme.


00

Compute f (xi ). We now discuss how to compute the second order derivatives
00
f (xi ).
By Taylor expansion, we have
h2 00
h3
h4
f (xi ) + f (3) (xi ) + f (4) (1 ) ,
2
6
24
2
3
h
h
h4
00
f (xi1 ) = f (xi ) hf 0 (xi ) + f (xi ) f (3) (xi ) + f (4) (2 ) ,
2
6
24
f (xi+1 ) = f (xi ) + hf 0 (xi ) +

1 (xi , xi+1 )
2 (xi1 , xi ) ,

this gives
f (xi+1 ) 2f (xi ) + f (xi1 )
h2 (4)
00
=
f
(x
)
+
(f (1 ) + f (4) (2 )) ,
i
h2
24
so if h is small, we can use the approximation
f (xi+1 ) 2f (xi ) + f (xi1 )
00
f (xi ) .
h2
This scheme is called the second order central difference scheme.

139

Вам также может понравиться