Вы находитесь на странице: 1из 160

# Contents

1 Introduction 1
1.1 Overview of typical issues in scientific computing . . . . . . . . . . . . . . 1
1.2 Structure of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Algebraic equations 4
2.1 Problem description and modeling of floating objects . . . . . . . . . . . . 5
2.2 Solving an algebraic equation by hand . . . . . . . . . . . . . . . . . . . . 6
2.3 Solving an algebraic equation with Matlab . . . . . . . . . . . . . . . . . . 7
2.3.1 Graphical approximation . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Symbolical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.3 Numerical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Digital representation of numbers . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 Floating point numbers and round-off errors . . . . . . . . . . . . . 10
2.4.2 Binary representation of integer numbers . . . . . . . . . . . . . . . 14
2.4.3 Binary representation of floating point numbers . . . . . . . . . . . 15
2.5 Iterative methods for algebraic equations . . . . . . . . . . . . . . . . . . . 17
2.6 Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.1 Mathematical background and method . . . . . . . . . . . . . . . . 19
2.6.2 Algorithm and program . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.3 Programming issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.4 Example: equation for floating sphere . . . . . . . . . . . . . . . . . 28
2.7 Fixed-point iterations (Picard iteration) . . . . . . . . . . . . . . . . . . . 30
2.7.1 Mathematical background and method . . . . . . . . . . . . . . . . 30
2.7.2 Example: equation for floating sphere . . . . . . . . . . . . . . . . . 31
2.7.3 Checking convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8.1 Mathematical background and method . . . . . . . . . . . . . . . . 34
2.8.2 Example: equation for floating sphere . . . . . . . . . . . . . . . . . 35
2.9 Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.9.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.9.2 Example: equation for floating sphere . . . . . . . . . . . . . . . . . 37

i
3 Nonlinear systems of algebraic equations 40
3.1 Problem description and modeling: predator-prey models . . . . . . . . . . 41
3.2 Analytical solutions and solving with Matlab . . . . . . . . . . . . . . . . . 43
3.2.1 Analytical solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.3 Symbolical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.4 Numerical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Newton’s method for systems of equations . . . . . . . . . . . . . . . . . . 46
3.3.1 Mathematical background and method . . . . . . . . . . . . . . . . 46
3.3.2 Stopping criteria and vector norms . . . . . . . . . . . . . . . . . . 47
3.3.3 Example: predator-prey equations . . . . . . . . . . . . . . . . . . . 48
3.3.4 Choice of initial vector . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Solving linear systems for population models . . . . . . . . . . . . . . . . . 50
3.4.1 Solving very small systems . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.2 Solving a little larger systems: Gaussian elimination . . . . . . . . . 51
3.4.3 Built-in Matlab functions . . . . . . . . . . . . . . . . . . . . . . . 53

## 4 Linear boundary value problems 54

4.1 Problem description and modeling: pollution models . . . . . . . . . . . . 55
4.1.1 Governing equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.2 Boundary conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Analytical solutions and solving with Matlab . . . . . . . . . . . . . . . . . 59
4.2.1 Analytical solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.2 Symbolical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Solving BVPs numerically: Introduction . . . . . . . . . . . . . . . . . . . 61
4.3.1 Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Numerical techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Solving BVPs numerically using Matlab: bvp4c . . . . . . . . . . . . . . . 63
4.5 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.1 Mathematical background and method . . . . . . . . . . . . . . . . 66
4.5.2 Simple first-order finite difference formulas . . . . . . . . . . . . . . 66
4.5.3 More complicated finite difference expressions . . . . . . . . . . . . 67
4.5.4 Matrix-vector equation: example . . . . . . . . . . . . . . . . . . . 68
4.5.5 Matrix-vector equation: general approach . . . . . . . . . . . . . . . 69
4.5.6 Programming finite differences . . . . . . . . . . . . . . . . . . . . . 71
4.6 Eliminating boundary conditions . . . . . . . . . . . . . . . . . . . . . . . 73
4.6.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6.2 General approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.7 Finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.7.1 Mathematical background . . . . . . . . . . . . . . . . . . . . . . . 77
4.7.2 Matrix-vector equation: example . . . . . . . . . . . . . . . . . . . 81
4.7.3 Matrix-vector equation: general approach . . . . . . . . . . . . . . . 84

ii
4.7.4 Numerical computation of the integrals . . . . . . . . . . . . . . . . 88
4.7.5 Programming finite elements . . . . . . . . . . . . . . . . . . . . . . 88
4.8 Convergence of numerical methods for BVPs . . . . . . . . . . . . . . . . . 90
4.8.1 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.8.2 Finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.9 Solving linear systems for BVPs: Crout’s method . . . . . . . . . . . . . . 94

## 5 Initial value problems 98

5.1 Problem description and modelling . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Solving first order linear IVPs analytically . . . . . . . . . . . . . . . . . . 100
5.2.1 Analytical solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.2 Symbolical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Solving IVPs numerically: Introduction . . . . . . . . . . . . . . . . . . . . 102
5.3.1 Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.2 Numerical techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Solving IVPs numerically using Matlab: ode45 . . . . . . . . . . . . . . . . 104
5.5 One-step methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5.1 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5.2 Trapezoidal rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5.3 Runge–Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5.4 Programming one-step methods . . . . . . . . . . . . . . . . . . . . 106
5.5.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.6 Test equation and amplifying factor . . . . . . . . . . . . . . . . . . . . . . 109
5.7 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.7.1 Local truncation error . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.7.2 Global error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.7.3 Impact of roundoff errors . . . . . . . . . . . . . . . . . . . . . . . . 114
5.8 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.8.2 Stability of one-step methods (test equation) . . . . . . . . . . . . . 117
5.8.3 Region of absolute stability . . . . . . . . . . . . . . . . . . . . . . 117
5.8.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.8.5 Linear stability analysis for nonlinear equations . . . . . . . . . . . 121
5.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

## 6 Systems of initial value problems 124

6.1 Problem description: predator-prey models . . . . . . . . . . . . . . . . . . 125
6.2 Checking numerical solutions for systems of IVPs . . . . . . . . . . . . . . 126
6.2.1 Equilibrium solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2.2 Symbolic calculations . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2.3 Analytical solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2.4 Numerical calculations with Matlab . . . . . . . . . . . . . . . . . . 127
6.3 One-step methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

iii
6.3.1 Programming one-step methods . . . . . . . . . . . . . . . . . . . . 130
6.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.5 Stability of one-step methods . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5.1 Linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5.2 Nonlinear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

## 7 Partial differential equations 136

7.1 Problem description: pollution models . . . . . . . . . . . . . . . . . . . . 137
7.1.1 Governing equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.1.2 Boundary and initial conditions . . . . . . . . . . . . . . . . . . . . 137
7.1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2 Validation of numerical code for PDEs . . . . . . . . . . . . . . . . . . . . 139
7.2.1 Equilibrium solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2.2 Analytical solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2.3 Numerical calculations with Matlab . . . . . . . . . . . . . . . . . . 139
7.3 Solving PDEs numerically: Introduction . . . . . . . . . . . . . . . . . . . 142
7.4 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.5 Finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.6 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.6.1 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.6.2 Finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.7 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.8 Solving linear and non-linear systems for PDEs . . . . . . . . . . . . . . . 153
7.8.1 Direct methods: factorization . . . . . . . . . . . . . . . . . . . . . 153
7.8.2 LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

iv
Chapter 1

Introduction

## 1.1 Overview of typical issues in scientific computing

Scientific computing
Many scientific models and methods developed in Math, Physics, Engineering, Economics
etc. that describe real-life phenomena are too difficult to solve by hand (analytically).
Then scientific computing (computer simulations) need to be used to obtain solutions.
Sometimes computers can obtain analytic solutions for you using symbolic calculations.
Otherwise, numerical calculations need to be performed to obtain approximate solutions.
This requires a good numerical method to approximate the problem as accurate as needed
and a computer and computer code to perform the numerical calculations.
Numerical methods
Numerical methods have two different aspects
• Numerical analysis: derivation and understanding of general behavior of numerical
methods (Numerical analysis Math 4445/4446).
• Choice and application of numerical techniques: How to choose a good numerical
technique for a specific problem and how to apply it (Scientific computing Math
4414).
Best numerical method
There is no best numerical method. Which numerical method to choose depends on the
specific problem you are trying to solve and on the available resources. It is important to
choose a numerical technique that ‘works well’ for the specific problem you are solving.
In choosing a numerical technique, it is important that you understand
• The problem and its mathematical formulation you are trying to solve:
– Are there always solutions to the mathematical problem you try to solve nu-
merically: if there are no solutions, no numerical method will be able to find
the solution.
– If there is more than one solution, which is the solution you are looking for.

1
– Limitations of the theory: The mathematical model is usually only an approx-
imation to the physical process.

## – When and why does a certain numerical method work?

– Is the numerical technique applicable or easy to change to handle slightly dif-
ferent problems.
– Will the approximate numerical solution be sufficiently accurate for the prob-
lem you try to solve. What errors to expect. Numerical computations are
never exact and always contain errors. Errors should be kept as small as possi-
ble or necessary so that the final result is a sufficiently accurate solution of the
underlying physical problem. If the mathematical model is not very accurate,
there is no need to do the computing extremely accurately.
– When a method might lead to difficulties for a specific problem that require
changes to the method or even a different numerical technique.
– Is a numerical technique easy to implement on a computer
– How fast can a computer provide solutions for a given numerical method

## – Can calculations be done in a limited time.

– Is computer memory large enough to store all information.
– What software is already available.

Programming
Programming is an essential part of the course. To obtain a solution using a numerical
method often requires a large amount of algebraic manipulations. In order to perform
such calculations, you need to be able transform the numerical method into a numerical
computer program and let a computer do all calculations.
In this course we use Matlab to write numerical programs.

## • Advantages: easier if you have no programming experience, a lot of numerical meth-

ods are available in Matlab, and it is easy to plot solutions.

## • Disadvantages: Matlab is too slow for more computationally demanding problems.

Numerical methods are usually implemented much more efficiently in C or Fortran.
Such problems are outside the scope of Math 4414.

## When you write programs it is important to

• Understand the problem and the numerical technique before you start programming.

• Write your program in such a way that it is easily changed to solve similar problems.

2
• Validate the results of your computer program. If you do computations with a
computer, you always get an answer. You need to make sure that you
get an answer that makes sense. A numerical technique might not work very
well for a specific problem or a computer program might contain errors. (It is very
easy to make mistakes if you write programs, but also easy to check.)

## – Plot solutions to check whether the numerical solution might be correct.

– Compare with an analytical solution (of a simplified problem).
– Compare with numerical data in books or class notes.
– Compare with results from standard Matlab functions.

## 1.2 Structure of the course

For scientific computing it is necessary that you understand the problem that you solve
and the available numerical methods and there limitations. We discuss the following
aspects of a number of selected problems from physics/engineering (about 10 weeks of
class: homework problems, midterm test)

## 1. Modeling: Brief description of problem, mathematical model, and relevant physical

parameters. Mathematical models we consider are algebraic equation, boundary
value problems, ordinary differential equations, and partial differential equations.

2. Numerical method: make the mathematical model suitable to solve with the
help of a computer. Go from a continuous description (differential equation) to a
an approximate discrete description (difference equation). Discuss one or several
basic techniques that can be used to solve the obtained equations and relevant
mathematical and computational concepts. (This will take about 3/4 of the time.)
The focus is on how to use the techniques, how they work, what are advantages
and disadvantages of the various numerical techniques, and what are the numerical
issues.

## 3. Computing: find a numerical solution to the problem using an appropriate numer-

ical method.

4. Validation and visualization of results Convince yourself that you found the
correct solution. Visualization of relevant data.

The last 5 weeks of class you need to work on a longer project (in groups or on your
own, you may make your own project or I may make one). During these weeks some classes
will be replaced by office hours, so you can work on the project and/or presentation.
This should give you a good basis for when and how to use scientific computing and
what to pay attention to. Wherever applicable, I will use a laptop with Matlab to illustrate
what we are doing.

3
Chapter 2

Algebraic equations

## • round off errors

• rate of convergence

• algorithms

## • floating point numbers and symbolic variables

• stopping criteria

• ininitial guess

Numerical methods:

ton’s method)

## • Finding roots of systems of algebraic equations (bisection method, fixed-point iter-

ations, Newton’s method)

Programming

## • Introduction to Matlab: how to use it and how to write simple programs

4
2.1 Problem description and modeling of floating ob-
jects
Consider a ball made of wood with a radius R = 1 and a density ρb = 1/2. How much of
the ball will be submerged when it is placed into water (ρw = 1). We don’t know which
portion of the ball is below the water, let’s call this d. See Fig. 2.1.

R=1

d
r=r z

z=0

## How to start this problem?

We need a mathematical/physical model: Archimedes’ law. The mass of the ball Mb
equals the mass of the water Mw displaced by the ball.
We have Mb = Mw and the mass of a ball is Mb = ρb Vb = 4πR3 ρb /3 = 2π/3. The
amount of water that is displaced corresponds to the volume Vw of the of the ball in
between the dashed and solid line. The mass of such a volume of water is ρw Vw . We
need to find the volume of a part of a sphere (Multivariable calculus). This is easiest in
cylindrical coordinates.
We have a sphere with radius R = 1 and center (0, 0, 1). In rectangular coordinates the
equation for the sphere is x2 + y 2 + (z − 1)2 = 1. In cylindrical coordinates (r2 = x2 + y 2 )
this gives rz2 = 1 − (z − 1)2 , where rz is the upper surface in the r-direction which depends
on z (see sketch). Taking z = 0 at the bottom of the sphere gives for the volume
Z d Z 2π Z rz Z d Z d
Vw = r dr dθ dz = πrz2 dz = π 1 − (z − 1)2 dz. (2.1)
0 0 0 0 0

## After some algebra the result is

π(3d2 − d3 )
Vw = . (2.2)
3
Using Mb = Mw we arrive at an algebraic equation

3d2 − d3 = 2. (2.3)

## This equation can be solved by hand or with a computer.

5
2.2 Solving an algebraic equation by hand
Why is it useful to have an analytic solution?
To check whether a numerical program that you wrote is working correctly.
Solution method 1: You might know the general formula for a third order equation
or know where to find it.
Solution method 2: The polynomial has integer coefficients. Then we can look
for roots that are also integers. Any such root must divide the constant term. For
d3 − 3d2 + 2 = 0 this leaves four possibilities: 1, -1, 2, -2. We can easily verify by
substitution that d = 1 is a solution and the others are not. The other 2 roots can be
found by reducing the degree√ of the original polynomial,
√ (d3 −3d2 +2)/(d−1) = d2 −2d−2.
This has the roots d1 = 1 + 3 and d2 = 1 − 3.
We have 3 possible solutions. If we let a body float in the water it will stay at 1 height
only. Which is the correct solution? The problem requires that 0 ≤ d ≤ 2R = 2, so d = 1
is the solution we are looking for.

6
2.3 Solving an algebraic equation with Matlab
See the separate Matlab guide for an introduction on how to use Matlab and how to write
simple programs. All Matlab code and Matlab output will be in Sans Serif font.

## 2.3.1 Graphical approximation

Make a plot of this curve with Matlab! This can be done with
ezplot(’d^3 - 3*d^2 + 2’)
which opens a new window and plots the function. See Fig. 2.2. We need to find where

3 2
d3−3 d2+2
d −3 d +2
150

8
100

6
50

0 4

−50 2

−100
0

−150
−2

−200
−4
−250

−6
−300

−8
−350

## −6 −4 −2 0 2 4 6 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5

d d

(a) (b)

Figure 2.2: Plot of d3 − 3d2 + 2 (a) Using ezplot, and (b) zoom around the roots.

the function intersects the d-axis. From the graph we can estimate this number. For a
more accurate number you can zoom in around the root in the Matlab window.
How to get a more exact number: solve the equation d3 − 3d2 + 2 = 0?

## 2.3.2 Symbolical calculations

The roots of an algebraic equation like d3 − 3d2 + 2 = 0 can be found exactly using
the built-in symbolic Matlab function solve. Symbolic calculations give exact results (no
errors).
First define a symbolic variable
syms d
Then use solve
sol = solve(d^3 - 3*d^2 + 2)

7
This gives
sol =
1
1 + 3^(1/2)
1 - 3^(1/2)

Remarks

• the roots d2,3 = 1 ± 3 are found exactly not as finite precision numbers.

• Alternatively, you could have divided out a root from a polynomial symbolically:
syms d
y=(d^3-3*d^2+2) / (d-1)
y = simplify(y)
gives
y = d^2-2*d-2

• Symbolic calculations take much longer than floating point calculations. Thus if
computing time is an issue and a numerical solution is sufficient do not use symbolic
calculations.

• If no analytical solution is found (and the number of equations equals the number
of unknowns), a numerical solution is attempted. For example
syms d
sol = solve(d^5 + 3*d^2 - d^3 - 2)
gives the numerical values
[ .85052644896432252802899764958837]
[ .73793253717418508330026780848423+1.1865616439828408330458807692618*i]
[ -.77762851016613744223048982300174]
[ -1.5487630131465552523990434435551]
[ .73793253717418508330026780848423-1.1865616439828408330458807692618*i]

## • The calculations are exact: no roundoff errors

• Only for relatively simple problems that you could (in principle) also do by hand

## 2.3.3 Numerical calculations

In Matlab, roots of algebraic equations can be computed numerically using the built-in
Matlab function roots or fzero.
For polynomial algebraic equations, roots can be used to compute all roots of the
polynomial. As input it needs an array of the coefficients of the polynomial. To use roots,

8
first write the polynomial in the form f (d) = 0, so for this case d3 − 3d2 + 2 = 0. Then
make an array with the coefficients of the polynomial (the values of an array in Matlab
are in between [..]). Coefficients need to be ordered from the highest to the lowest power
in d, (and don’t forget the zero of 0d)
c=[1 -3 0 2]
Then use d=roots(c) to compute all three roots
d=
2.7321
1.0000
-0.7321

For non-polynomial algebraic equations, fzero can be used to compute a single root
of an arbitrary function. For example fzero(’x^3-3*x^2+2’, 0.5) tries to find a root of
x3 − 3x2 + 2 = 0 near x0 = 0.5. To use fzero you need a reasonable estimate of the value
of the root for which you could use a plot of the function.

## • The calculations are not exact: roundoff errors

9
2.4 Digital representation of numbers
2.4.1 Floating point numbers and round-off errors
When the root problem d3 − 3d2 + 2 = 0 was solved with the Matlab function roots or
fzero, the exact roots were not found. Matlab produced socalled floating point num-
bers which are only approximations to the roots. A numerical computation on
a computer is different from a calculation in algebra and calculus courses: it
never produces an exact solution, only an approximation. Even Matlab’s 1.0000
is not the same as d = 1 which we obtained analytically. It could be any number in the
range 0.99995 ≤ d < 1.00005. To check this in Matlab:
d = 1.0000499999999999
which gives
d = 1.0000
and
d = 0.99995
which also gives
d = 1.0000

## The approximations that Matlab gave were accurate up to 5 digits. It would be

incorrect to write d1 = 2.7321, because it is not exactly the root. We should write instead
d1 ≈ 2.7321.
Note that when we found a root using fzero,
d = fzero(’x^3-3*x^2+2’, 2.1)
Matlab didn’t compute the result in 5 digits precision, it only displayed the result d =
2.7321 in 5 digits. The 5-digit format is called format short in Matlab and this is the
default. To see 16 significant digits in floating point format, we don’t need to redo the
calculation, just type
format long e
d
which gives
d = 2.732050807568878e+00
Here the notation e+00 means 100 .
Numerical computations always use a finite number of digits to represent a number.
Most often 16 digits (double precision) are used, sometimes 8 digits (single precision) or
32 digits. This introduces an error called roundoff error.
If we type in a number in Matlab that has more than 16 digits, Matlab doesn’t say
that you gave an incorrect number, but it makes a floating point number of it (using 16
digits). For example,
y=1.234567890123456789
gives
y = 1.234567890123457e+00
We see that Matlab rounds the number to the nearest floating point number. Another

10
method is simply chopping the number after 16 digits. Thus if the 17th digit is 5 or larger,
1 is added to the 16th digit (round up). If the 17th digit is lower than 5, all digits from
17 on are chopped (round down).
The notation for the floating point number of y is fl(y). Matlab uses 16 significant
digits for floating point numbers.
• Computations are slower and more memory is required to store numbers.
• Roundoff error becomes smaller.
Nowadays, most computations are done in double precision (16 digits).

## Calculations with finite digit arithmetic

When we do calculations in finite digit arithmetic, the rounding (chopping) occurs at
every step of the calculation. Finite digit computations can be done in Matlab via Maple,
but it is easier to do it by√hand.

Example: Compute 3 + 3 using 5 significant digits with rounding. We get
√ √
fl( 3) + fl( 3) = fl(1.732050807568877e+00) + fl(1.732050807568877e+00)
= 1.7321e+00 + 1.7321e+00 = 3.4642e+00

Note that first 3 is rounded to 5-digit
√ precision.
√ √The + adds the two 5-digit numbers.
This is not the same as computing 3 + 3 = 2 3 exactly or in more digits precision
and rounding the result to 5-digit precision:
√ √
fl( 3 + 3) = fl(1.732050807568877e+00 + 1.732050807568877e+00)
= fl(3.464101615137754e+00) = 3.4641e+00

## The last (5th) digit is different.

Similar errors occur when you use 16 digits, the last (16th) digit may not be accurate
(for a single calculations). These so-called roundoff errors are caused by the finite
number of digits that a computer uses to represent numbers.
Example: Try in Matlab
x = (1-sqrt(3))*(1+sqrt(3))
which gives
x = -2.00000000000000
Then try y = x + 2 which gives exactly 0 in exact calculations. Using (standard) 16 digits
arithmetic in Matlab: y = 4.440892098500626e-16

## Implications of finite precision: subtracting nearly equal numbers

In finite digit arithmetic you can get inaccurate results when you subtract nearly equal
numbers in your calculations. If you write a program this needs to be avoided if possible.
As a simple example we discuss the calculation of the roots of a quadratic equation.

11
In the floating object problem: we started from a cubic equation and divided out d − 1
and we found the quadratic equation d2 −2d−2 = 0. A quadratic equation ax2 +bx+c = 0
has the solution √
−b ± b2 − 4ac
x1,2 =
2a
In ”exact (calculus) calculations” this always gives the correct answer. When a finite
number of digits are used, a poor approximation of the root can be found using this
equation.
Example:
Use a = 1, b = 123.4, c = 1.2. The exact solutions are up to 8 digits:
x1 = -9.7252397e-03 and x2 = -1.2339027e+02.
√ Using 4-digit p arithmetic we would get: √
b 2 − 4ac = fl(1.522756e4) − fl(4.800000e0) = 1.523e4 − 4.800e0 =

1.523e4 = 1.234e2
Now compute x1,2 (using 4-digit arithmetic):

−b + b2 − 4ac -1.234e2 + 1.234e2
x1 = = = 0.000e0
2a 2.000e0

−b − b2 − 4ac -1.234e2 − 1.234e2
x2 = = = -1.234e2
2a 2.000e0
The approximation of x2 using 4-digit precision is accurate up to 4 digits, but the ap-
proximation for x1 is not! x1 has no accurate digits which is rather problematic if x1 is
the physical solution. The main problem is subtracting two almost equal numbers. Using
more digits for the calculations will improve the result, but it cannot completely eliminate
the inaccuracy of subtracting two nearly equal numbers with finite precision arithmetic.
One can avoid the subtraction of two equal numbers by rewriting the expression for
x1 : √ √
−b + b2 − 4ac −b − b2 − 4ac 2c
x1 = × √ = √
2a −b − b2 − 4ac −b − b2 − 4ac
This gives using 4-digit arithmetic.
2c 2.400e0
√ = = -9.724e-3
2
−b − b − 4ac -1.234e2 − 1.234e2

## This is an accurate approximation to x1 with a relative error of only 10−4 . Similarly, we

can write for x2 ,
√ √
−b − b2 − 4ac −b + b2 − 4ac 2c
x2 = × √ = √
2a −b + b2 − 4ac −b + b2 − 4ac
Using 4-digit arithmetic, however, this would lead to a zero denominator.
What we want to use is a proper combination of the two ways to compute the roots,
that avoids the subtraction of almost equal numbers. For which of the two expressions

12
almost equal numbers are subtracted, depends on whether b is positive or negative. To
take this into account properly, first evaluate
1h √ i
2
q = − b + sign(b) b − 4ac ,
2
where sign(b) = 1 if b ≥ 0 and −1 otherwise. Then the roots are
q c
x1 = , x2 = .
a q
Using 4 digits arithmetic this gives x1 = -9.724e-03 and x2 = -1.234e+02 which are both
accurate up to 3 digits.

## Implications of finite precision calculations: testing for equal numbers

Usually, you never test whether 2 floating point numbers are exactly equal:
if x == y
The result of this test will only be true when all digits are the same. Usually this will not
be the case due to roundoff errors. For example, type
x = (1-sqrt(3))*(1+sqrt(3))
which gives
x = -2.00000000000000
Comparing with the exact value -2
x == -2
gives
0
meaning false or not equal (a value of 1 means true or equal numbers).
Instead we should use when we test whether 2 numbers are (nearly) equal abs(x-
y) <  with  a small number. How small depends on the type of calculations. Never
smaller than 10−number of digits+1 . For example, for a 16-digit calculation to find roots, 10−15
will probably work. More complicated calculations (more computations, subtraction of
almost equal numbers), you maybe can only reach an accurate answer up to 8 or 10 digits
( ≈ 10−8 or  ≈ 10−10 would then be the smallest tolerance you want to choose).

## Implications of finite precision calculations: stopping criteria

Stopping criteria in numerical methods are used to determine whether solutions are suf-
ficiently accurate, which can be done by comparing the two most recent approximations
to a solution p, say pi and pi−1 . When you try to determine a large p ≈ 108 and test
whether the error |pi+1 − pi | < 10−10 , you want 18 digits correct. This is not possible with
16 digits numbers. When you try to determine a small p ≈ 10−8 and test whether the
error |pi+1 − pi | < 10−10 , you only have 2 accurate digits.
To avoid a dependence on the magnitude of p, a relative error |pi+1 − pi |/|pi | < 10−10
should be used. Note that this is not such a good idea if pi equals or is very close to zero.
Safer is |pi+1 − pi |/|pi + | < 10−10 with  a small number, for example 10−6 .

13
2.4.2 Binary representation of integer numbers
Computers do not use base-10 numbers but base-2 (binary) numbers to do calculations.
We start with binary representation of integer numbers, since this is easier to understand
than the binary representation for floating point numbers. Matlab doesn’t have a separate
integer number representation, only floating point number. Programming languages like
C and Fortran, however, have.
Integer base 10 numbers: What does the number 123 mean exactly? 123 =
1 × 102 + 2 × 101 + 0
P3n × 10 .i Any integer number we can write as a base 10 number. For
a positive integer i=0 ai 10 , with each ai any integer number from 0 to 9.
Integer base P 2 numbers: Any integer number we can write as a base 2 number. For
a positive integer ni=0 bi 2i , with bi a 0 or 1. To distinguish base-2 numbers from base-10
numbers we use the notation ( )2 for base-2 numbers and (P )10 for base-10 numbers. To
go from base-2 to base-10 numbers and vice versa, just use ni=0 bi 2i :
base 2 base 10
(1)2 1 × 20 (1)10
1 0
(10)2 1×2 +0×2 (2)10
(11)2 1 × 21 + 1 × 20 (3)10
3 2 1 0
(1011)2 1 × 2 + 0 × 2 + 1 × 2 + 1 × 2 (11)10
Try the following two examples yourself:
What is the base-10 number corresponding to the base-2 number (10101)2 ? What is the
base-2 number corresponding to the base-10 number (101)10 ?
There are Matlab functions that convert from binary to decimal and vice versa: bin2dec
and dec2bin. The above 2 examples you can check using bin2dec(’10101’) and dec2bin(101).
How much memory is reserved for an integer?
How do we measure the amount of memory used?
A bit is a binary digit, i.e. a 0 or a 1.
A byte is a group of eight bits.
A word is the smallest addressable unit of memory for a computer (often 2 bytes or 4
bytes).
If we have one word of 2 bytes to store an integer number, the binary number (1011)2
would be stored as 0000 0000 0000 1011 (just fill up with zeros at the front). Why is
the above number not stored as a 4-bit number? Computer hardware can be kept more
simple and efficient if it only handles numbers with a predetermined number of bits.
What is the largest integer on a computer if 2 bytes are used to store integers (don’t
consider negative integers).
P15 k We need to have all ones to have the largest integer, so for 2
bytes (= 16 bits): k=0 2 = (1 − 2 )/(1 − 2) = 216 − 1 (geometric series). In this way we
16

cannot represent negative numbers, only integers from 0 to 216 − 1. These type of integers
are called unsigned integers. Note that when signed integers are used, only integers
in the range [−215 , 215 − 1] = [−32768, 32767] are available. (This is not symmetric since
0 is included.) If 32-bits are used for an integer this is called a long integer.

14
2.4.3 Binary representation of floating point numbers
Single precision numbers have 32 bits (4 bytes). The single precision IEEE standard
floating point number is defined as:

## (−1)s × 2c−127 × (1.f )2

and Fig. 2.3 contains a graphical representation. Note that you can always rewrite a

8 bits 23 bits
1 bit

s c f

Figure 2.3: Single precision floating point number (32 bits): 1 sign bit for mantissa (s), 8
bits for the exponent (c), and 23 bits for the fractional part of the mantissa (f ).

number number so that the first number is non-zero. For example for base 10 numbers:
0.9 = 9 × 10−1 . For binary numbers the only non-zero number is 1.
The leftmost bit of a floating point number is for the sign of the number: s = 0 for
positive and s = −1 for negative numbers.
The next 8 bits are for the exponent c − 127. The value of c could take 28 = 256
different numbers from 0 to 255. The first and last value (c = 0 and c = 255 for single
precision) are always reserved for special cases including ±0 and ±∞. (also for other
precisions like double precision) reserved for special cases Thus for single precision the
range of values for the exponent is

## 0 < c < (11 111 111)2 = 255 or − 126 ≤ c − 127 ≤ 127

The last 23 bits are used for the mantissa (the number multiplying the exponential
function, here with base 2). Since the first bit is always 1, it doesn’t need to be stored.
So the mantissa actually corresponds to 24 bits since there is one ’hidden’ bit. Zero
in floating point number notation is represented by all zero bits (with the sign bit as a
possible exception). The mantissa is restricted by
23
X
1 = (1.000.......0)2 ≤ (1.f )2 ≤ (1.111.......1)2 = (1/2)i = 2 − 2−23
i=0

Note that all together there are 24 1’s in (1.111.......1)2 , meaning 1 × 20 + 1 × 2−1 + 1 ×
2−2 + · · · 1 × 2−23 .
The largest single precision number is (2 − 2−23 ) × 2127 ≈ 3.4 × 1038 (largest mantissa
and largest exponent). The smallest (positive) single precision number 1 × 2−126 ≈ 1.2 ×
10−38 (smallest mantissa and smallest exponent). There are no (accurate) single precision
numbers in between 0 and approximately 1.2 × 10−38 and there are no single precision
numbers above the maximum single precision number 3.4 × 1038 . Similarly the largest

15
denormal denormal
usable

## usable range underflow underflow usable range

overflow overflow

## −3 × 1038 −10−38 0 10−38 3 × 1038

Figure 2.4: Usable range of numbers in single precision using standard IEEE notation.

negative single precision number is approximately −3.4 × 1038 and the smallest negative
number is approximately −1.2 × 10−38 . See Fig. 2.4 for the usable range of numbers.
What happens when we produce a number outside the usable range of values?
A too large number gives an overflow (Inf). A too small (positive) numbers first gives
a less accurate number (denormal) and then 0. The situation for negative numbers is
similar.
The machine epsilon is the smallest positive machine number  so that 1 +  6= 1. In
single precision (23 bits for the mantissa) this is 2−23 ≈ 1.2 × 10−7 . Note that this value
is much larger than the smallest single precision number.
Double precision numbers have 64 bits (8 bytes). The double precision IEEE
standard floating point numbers uses 1 bit for the sign, 11 bits for the number c in the
exponent (c − 1023) and 52 bits for the mantissa. In Matlab you can find the maximum
number by using realmax, the minimum positive number by realmin, and the machine
epsilon by eps.

16
2.5 Iterative methods for algebraic equations
Iterative methods do not try to compute the exact solution, but give only an approxi-
mation to the solution. The user should specify how close the approximation should be
to the solution. Iterative methods involve the following 3 steps:

## • Generate a sequence of approximations that (we hope) converges to the solution.

One step to generate a new number in the sequence is called an iteration.

• The sequence is stopped when the approximation is ”close enough to the solution”.
The sequence should also be stopped when it is clear that it will not approach the
solution at all (otherwise it keeps on generating numbers forever). This might be
because the method generates a sequence that does not approach the solution or
because a (Matlab) program is written incorrectly.

Usually a while-loop is used for an iterative method, since you don’t know in advance how
many times you need to compute an approximation in the sequence. The structure of a
while-loop for an iterative method is as follows
function [x] = iterative method(initial guess, maxiter, tolerance)
% Initializations, for example
iter = 1;
xold = initial guess;

## % Perform at most maxiter iterations

while iter <=maxiter
% Compute a new approximation x from xold
···

## % Stop the iteration process if the approximation is good enough,

% for example
if abs(x-xold) < tolerance
break;
end

## % Prepare for next iteration, for example

iter = iter + 1;
xold = x;
end
Note that there are 2 mechanisms that terminate the iteration process. First the while-
loop is terminated when iter exceeds the maximum number of iterations (to avoid that
the iterations continue forever). Second it is terminated when the approximation is good

17
enough (break terminates the while loop and continues the program after the end corre-
sponding to the while loop).
Three iterative methods to find roots will be discussed in the next sections: bisection,
fixed-point iterations, and Newton’s method.

18
2.6 Bisection method
2.6.1 Mathematical background and method
The idea of the bisection method is based on the intermediate value theorem:
If f is continuous on [a, b], and f (a) and f (b) have opposite sign, then there exists a point
p with f (p) = 0. See Fig. 2.5.

f(b)>0

a f(p)=0
p
b

f(a)<0

Figure 2.5: A continuous line from a to b where f (a) and f (b) have opposite sign should
cross the x-axis (y = 0).

## How does the bisection method work?

See Fig. 2.6. Find the sign of f halfway the interval, i.e. at p1 = (a + b)/2.

a p4
p3 p2 p1 b

a1 p1 b1
a2 p2 b2
Figure 2.6: Graphical representation of bisection method.

If f (a) and f (p1 ) have opposite signs, then the root p is in (a, p1 ). Take a point halfway
this interval p2 = (a + p1 )/2 etc.

19
If f (b) and f (p1 ) have opposite signs, then the root p is in (p1 , b). Take a point halfway
this interval p2 = (p1 + b)/2 etc.

## • Convergence is relatively slow.

• You only find 1 root, which depends on the initial points a and b.

## 2.6.2 Algorithm and program

An algorithm describes in words (pseudocode) which steps a method needs to perform.
So it is not necessary to worry about the exact (Matlab) commands. Writing down an
algorithm before you write a (Matlab) program makes it easier to write the program.
Algorithms are particularly helpful when you start programming or when the method
involves a lot of steps and it is difficult to picture the structure of the program.
On the next pages are an algorithm and a naive program of the bisection method
(i.e. a program you might write if you are unaware of typical programming issues). You
can run the bisection function as any built-in Matlab function. Using input parameters
a = 0.1 and b = 2 for the endpoints, tolerance  = 10−6 , and maximum number of
iterations N = 100, you would use
[p, k] = bisection0(0.1, 2, 100, 1e-6)

20
Algorithm: Bisection
Input: 2 points a and b, a tolerance , and a maximum number of iterations N
Output: approximation to a root in [a, b] and the number of iterations performed

Checks
Check whether f (a) and f (b) have opposite sign

Initialization
Compute the function value f (a) (Done already in Checks)

## Initialize iteration counter

Actual method
While the number of iterations does not exceed N , do
Compute new p = (a + b)/2

## Stop the iteration process when the approximation is good enough

Update the right point if f (p) and f (a) have opposite sign
Otherwise update the left point

End while-loop

## Check if iterative method gives a good approximation

Write an error message if this is not the case (if maximum number of
iterations is exceeded)

21
Matlab program: Bisection
(Following the algorithm, ignorant of the programming issues in Sec. 2.6.3. Using these
it can be improved significantly.)

## function [p, k] = bisection0(a0, b0, N, epsil)

%=============================================
% Description:
% Approximate one root of the function x3 − 3x2 + 2 in the interval [a0,b0]
% using the bisection method
% Ignorant version
% Input parameters:
% a0 initial guess for left point a
% b0 initial guess for right point b
% N maximum number of bisection iterations to be performed
% epsil tolerance for the error
% Output parameters:
% p array of approximations to the root pk
% k number of performed iterations
%=============================================

%———————————————————————————————————
% Checks and initializations
%———————————————————————————————————
k = 1;
a(1) = a0;
b(1) = b0;
fa(1) = a0^3 - 3*a0^2 + 2;
fb(1) = b0^3 - 3*b0^2 + 2;
if (fa(1) <= 0 & fb(1) <= 0) | (fa(1) >= 0 & fb(1) >= 0)
error(’Initial guesses do not have opposite sign’)
end

22
%———————————————————————————————————
% Iteration loop
%———————————————————————————————————
while k <= N
%————————————————————————————————-
% Calculate new approximation to root pk
%————————————————————————————————-
p(k) = (a(k) + b(k)) / 2;
fp(k) = p(k)^3 - 3*p(k)^2 + 2;

%———————————————————————————————————
% Check whether close to a root
%———————————————————————————————————
if k > 1
if abs(p(k) - p(k-1)) < epsil
break
end
end

%———————————————————————————————————
% Prepare for next iteration
%———————————————————————————————————
if (fa(k) < 0 & fp(k) > 0) | (fa(k) > 0 & fp(k) < 0)
a(k+1) = a(k);
fa(k+1) = fa(k);
b(k+1) = p(k);
fb(k+1) = b(k+1)^3 - 3*b(k+1)^2 + 2;
else
a(k+1) = p(k);
fa(k+1) = a(k+1)^3 - 3*a(k+1)^2 + 2;
b(k+1) = b(k);
fb(k+1) = fb(k);
end

k = k + 1;
end

%———————————————————————————————————
% Check if iterative method converged or not
%———————————————————————————————————
if k > N
error(sprintf(’Bisection method did not converge in %d iterations’, N));
end

23
2.6.3 Programming issues
Programming with minimal memory usage
Now we use 6 arrays with all previous values of a, b, p, fa, fb, and fp. If a large number
of iterations is performed, the large arrays may take quite a lot of memory and make
computations slower. In addition, no memory is allocated (reserved) in advance for the
large arrays. Every time a new number is stored in an array, Matlab needs to create space
first to store that number. This makes the process even slower. However, storing all those
results is totally unnecessary: only the most recent values of the left point a, the right
point b, and the middle point p are used and some corresponding function values. So a
single variable instead of an array is sufficient for each.
How to check efficiently whether f (a) and f (p) have opposite sign?
If you use if statements inside for or while loops, this makes a code (much) slower than
necessary. A more efficient way to check whether two numbers have opposite sign is to
check whether the product is negative, thus check if f (a) × f (p) < 0. Note: if both f (a)
and f (p) are very large, the product f (a) × f (p) might be larger than the maximum
floating point number (see Sec. 2.4). To avoid such problems the sign function can be
used: sign(fa)*sign(fb). If fa is positive, sign(fa) equals 1, if it is negative it equals −1.
Make m-files easy to modify
Use separate functions for parts that need to be modified frequently. Advantage: Once
the bisection function is working, you never need to modify it anymore. You only need
to modify the accompanying function that evaluates the function f . Additionally, there
is only one line that defines the function f , even if you evaluate f at various places in the
bisection m-file. In the bisection function you need to compute f (a), f (b), and f (p). You
just need to call the function several times with the correct value at which the function
needs to be evaluated, a, b, and p.
Disadvantage: Computations take a little longer due to extra function calls. Only
when you find that you can save a significant amount of computing time, you may want
to avoid the function calls.
Easiest way to do this in Matlab: We write a separate function to evaluate the function
f at the end of the m-file for the bisection method. If we solve d3 −3d2 +2 = 0, for example,
we make a function funcbisec that evaluates the function
function [f] = funcbisec(x)
% Floating sphere equation
f = x^3 - 3*x^2 + 2;
The function funcbisec is called in the m-file for the bisection method, bisection, at every
place where f needs to be evaluated, with the proper value for x. To evaluate f (p), for
example, and assign this value to the variable fp, use
fp = funcbisec(p);
Here p needs to have a proper numerical value.
If we want to compute roots of another problem, we only need to modify the expression
for f in the function funcbisec.
How to check whether the approximation is ’good enough’ ?

24
For the bisection method, we can do better than comparing pi and pi−1 . We know that
the root is in between the most recent values of a and b. So if we choose p in the middle,
we are certain that the error is less or equal to |b − a|/2 = |p − a|. Thus if |p − a| is
smaller than a specified tolerance , we are certain that the actual error is less or equal
than .
The stopping criterium can be made more robust by using a relative stopping criterium
and/or checking how small the residual f (pi ) is (this measures how well the equation you
try to solve is satisfied, for a root f (p) = 0 exactly, so you want f (pi ) to be small).
When you are not sure whether the numerical solution is good enough, you can always
try to decrease  (and probably increase the maximum number of iterations as well).
Function evaluations
Evaluating functions like sin(x), exp(x), etc. is computationally much more expensive
then multiplications or additions. For not too simple functions, most of the computing
time will be in the evaluation of f . Thus a fast program for the bisection method contains
as few function evaluations as possible.
Our function bisection0 has several function evaluations inside the while-loop. Only
the function evaluation at the point pi is necessary.
Unnecessary operations inside loops
Every operation and function call takes computing time (CPU time). If part of a compu-
tation is repeated exactly every iteration, it saves CPU time if you do the computation
once before the loop starts. For example, the sign of f (a) never changes during the bi-
section iterations. Thus these can be computed before the while loop, and stored in a
variable (signfa in function bisection). Decisions on whether to use function calls (more
flexible, easier to read code) or not (save CPU time) depends on what is most important
for the problem you are solving. If your computations take a long time and a significant
percentage of the total CPU time can be saved by avoiding function calls, you may want
to minimize the function calls.

A Matlab function with all the above modifications can be found on the next page.
You can run the bisection function using (with input parameters a = 0.1 and b = 2 for
the endpoints, tolerance  = 10−6 , and maximum number of iterations N = 100)
[p, k] = bisection(0.1, 2, 100, 1e-6)

25
Matlab program: Bisection
(More flexible and robust bisection method, including the programming issues in
Sec. 2.6.3)

## function [p, k] = bisection(a, b, N, epsil)

%=============================================
% Description:
% Approximate one root of the function in funcbisec in the interval [a,b]
% using the bisection method
% More flexible and robust version % Input parameters:
% a initial guess for left point a
% b initial guess for right point b
% N maximum number of bisection iterations to be performed
% epsil tolerance for the error
% Output parameters:
% p approximation to the root % k number of performed iterations
%=============================================

%———————————————————————————————————
% Checks and initializations
% Note: use sign to avoid overflow
% signfa will never change, no need to recompute
%———————————————————————————————————
k = 1;
fa = funcbisec(a);
fb = funcbisec(b);
signfa = sign(fa);
if signfa*sign(fb) >= 0
error(’Initial guesses do not have opposite sign’)
end

26
%———————————————————————————————————
% Iteration loop
%———————————————————————————————————
while k <= N
%———————————————————————————————————
% Calculate new approximation to root pk
%———————————————————————————————————
p = (a + b) / 2;
fp = funcbisec(p);

%———————————————————————————————————
% Check whether close to a root
% Note: Both f and the difference of 2 approximations should be small
%———————————————————————————————————
if abs(fp) < epsil & abs(p-a) < epsil
break
end

%———————————————————————————————————
% Prepare for next iteration
%———————————————————————————————————
if signfa*sign(fp) < 0
b = p;
else
a = p;
end

k = k + 1;
end

%———————————————————————————————————
% Check if iterative method converged or not
%———————————————————————————————————
if k > N
error(sprintf(’Bisection method did not converge in %d iterations’, N));
end

%———————————————————————————————————
% Function f(x)
%———————————————————————————————————
function [f] = funcbisec(x)
% Floating body problem
f = x^3 - 3*x^2 + 2;

27
2.6.4 Example: equation for floating sphere
As example we take as starting values a = 0.1 and b = 2. In addition we use N = 100 for
the maximum number of iterations and  = 10−6 for the tolerance (We can always change
the values and run the problem again if these are not sufficient). Table 2.1 contains the
sequence of approximations. Such a long list of numbers is difficult to interpret, a plot is

i pi |1 − pi |
1 1.050000000000000e+00 5.000000000000004e-02
2 5.750000000000001e-01 4.249999999999999e-01
3 8.125000000000000e-01 1.875000000000000e-01
4 9.312500000000000e-01 6.874999999999998e-02
5 9.906250000000001e-01 9.374999999999911e-03
6 1.020312500000000e+00 2.031250000000018e-02
7 1.005468750000000e+00 5.468750000000133e-03
8 9.980468750000001e-01 1.953124999999889e-03
9 1.001757812500000e+00 1.757812500000178e-03
10 9.999023437500001e-01 9.765624999991118e-05
11 1.000830078125000e+00 8.300781250001332e-04
12 1.000366210937500e+00 3.662109375000000e-04
13 1.000134277343750e+00 1.342773437500444e-04
14 1.000018310546875e+00 1.831054687517764e-05
15 9.999603271484376e-01 3.967285156236677e-05
16 9.999893188476564e-01 1.068115234359457e-05
17 1.000003814697266e+00 3.814697265847045e-06
18 9.999965667724611e-01 3.433227538929273e-06
19 1.000000190734863e+00 1.907348634588857e-07
20 9.999983787536623e-01 1.621246337735194e-06
21 9.999992847442629e-01 7.152557370826429e-07
22 9.999997377395632e-01 2.622604368118786e-07

Table 2.1: Iteration number i, approximations pi , and absolute difference with the exact
solution |1 − pi |; bisection method for d3 − 3d2 + 2 = 0 using initial points a = 0.1 and
b = 2, tolerance  = 10−6 , and maximum number of iterations N = 100.

easier. The easiest way to generate a plot of the errors, is to compute the errors |1 − pi | in
the bisection function and store these in an array, say e. Thus, for the above example, e
is an array of length 22 which contains the actual error (error with the exact solution),
here |1 − pi |.
Then create an array with the iteration numbers 1 to 22:
i=1:1:22
Then make a plot with marker x:
plot(i, e, ’x’)
This plots the errors, but it is still difficult to see what happens (does the error continue

28
to decrease or not).
To see more clearly the behavior at small errors, we can use a logarithmic scale for the
”y-axis”. Instead of plot, now use
semilogy(i, e, ’x’)
See Fig. 2.7. It is clear from the logarithmic plot that the error continues to decrease.

## Convergence for bisection method Convergence for bisection method

0
0.45 10

0.4 −1
10

0.35
−2
10
0.3
−3
error |1−pi|

error |1−p |
10

i
0.25

0.2 −4
10

0.15
−5
10
0.1
−6
10
0.05

−7
0 10
0 5 10 15 20 25 0 5 10 15 20 25
iteration i iteration i

(a) (b)

Figure 2.7: Actual error as a function of the iteration number for the bisection method
(d3 − 3d2 + 2 = 0, a = 0.1, b = 2) using unscaled (a) and logarithmic y-axis (b).

## During the 22 iterations, the approximation pi approaches the solution d = 1 in a

quite irregular manner. The value of p1 after the first iteration, for example, is much
closer to the actual solution than the value of p2 . This suggest that is should be possible
to develop faster methods.

29
2.7 Fixed-point iterations (Picard iteration)
2.7.1 Mathematical background and method
A number p is a fixed point for a given function g if g(p) = p.
Relation between roots and fixed points:
Finding solutions of the root problem f (p) = 0 is equivalent to finding the fixed points
of a corresponding fixed point problem. There is not one single way to transform a root
problem into a fixed point problem.
For example, g(x) = x − f (x) and g(x) = x + f (x)/3 are 2 fixed point iterations cor-
responding to to the root problem f (x) = 0. For both cases, if p is a root of f , then
f (p) = 0 and thus g(p) = p, i.e. p is a fixed point of g.
Also if p is a fixed point of g, then g(p) = p and thus f (p) = 0, i.e. p is a root of f .
How does a fixed-point iteration work?
Fixed points are the intersection points of the line y = x and the curve y = g(x). They
consist of two steps:

y=x

g(x)

p0 p1 p2 p3 p

## When is a fixed point iteration guaranteed to converge?

The sequence of a fixed point iteration is convergent on the interval [a, b] if the interval
contains a fixed point, if g is continuous on [a, b], and if

|g 0 (x)| < 1

30
for all x in [a, b] (and the initial guess is in [a, b]).
When does a fixed point iteration converge fast?
When the absolute value of the derivative |g 0 (x)| is small. If you have to find such a g
by trial and error it might be much more work than the bisection method takes to solve
the problem. In the next section, we discuss a fixed-point iteration that converges fast:
Newton’s method.

## • it does not always converge to a solution.

• if a fixed-point iteration converges, it may (or may not) be faster than the bisec-
tion method.

## 2.7.2 Example: equation for floating sphere

The root problem f (d) = d3 −3d2 +2 = 0 can be written as a fixed-point problem d = g(d)
in several ways, for example:

1. Adding a d on the left and right: d = d+d3 −3d2 +2. Starting from d = 0.99 this does
not converge. However, d = d+(d3 −3d2 +2)/2 converges and d = d+(d3 −3d2 +2)/3
converges in just 4 iterations starting from d = 0.5.
p
2. Writing d2 = (3d2 −2)/d, and take the square root: d = 3d − 2/d. Converges in 30
iterations, starting from d = 0.5. In Matlab, the result has a small imaginary part.
Some other programming languages would give an error when you try to compute
the square root of a negative number.
p
3. Writing d2 (3 − d) = 2, divide by (3 − d) and take the square root: d = 2/(3 − d).
Converges in 11 iterations, starting from d = 0.5. See Table 2.2.

## 2.7.3 Checking convergence

For a fixed-point iteration, we do not have two points ai and bi surrounding p that get
closer and closer to the fixed point p from the left and right. If we want to use a
similar criterium, the best we can do for the fixed-point iteration is compare the two
latest results. This is much more tricky: the difference |pi − pi−1 | may be small just
because the true solution p is approached slowly, not because |p − pi | is small. For the
floating object equation (Table 2.2) the stopping criterium used, |pi − pi−1 | <  = 10−6 ,
worked fine. When this was fulfilled the actual error (difference with the exact solution)
|1 − pi | ' 10−7 was small as well.

31
i pi |1 − pi |
1 8.944271909999159e-01 1.055728090000841e-01
2 9.746077623781704e-01 2.539223762182963e-02
3 9.937117548732618e-01 6.288245126738201e-03
4 9.984316360970824e-01 1.568363902917591e-03
5 9.996081394766783e-01 3.918605233217409e-04
6 9.999020492625699e-01 9.795073743013027e-05
7 9.999755132150757e-01 2.448678492428247e-05
8 9.999938783599811e-01 6.121640018896812e-06
9 9.999984695935085e-01 1.530406491534464e-06
10 9.999996173985968e-01 3.826014032259906e-07
11 9.999999043496630e-01 9.565033698422098e-08

## Table 2.2: Iteration number i, approximations pi , p

and absolute difference with the exact
solution |1 − pi |; fixed point iteration with d = 2/(3 − d) using d0 = 1/2, tolerance
 = 10−6 , and maximum number of iterations N = 100.

The stopping criterium is not always sufficient. Consider a function g with g 0 that
has a derivative that is almost equal to 1? Take as example p = g(p) with g(p) =
p+10−4 ×(p3 −3p2 +2) to determine the root of the floating object equation p3 −3p2 +2 = 0.
The derivative is g 0 (p) = 1 + 10−4 × (3p2 − 6p) which is just below 1 in (0, 1]. Again we
take p0 = 0.5 and  = 10−6 . The actual error at each iteration is displayed in Fig. 2.9 on
a semilogarithmic scale.
Convergence for fixed−point iteration
0
10

−1
10
error |1−pi|

−2
10

−3
10
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
iteration i

Figure 2.9: Actual error as a function of iteration number for the fixed-point iteration
p = p + 10−4 × (p3 − 3p2 + 2), p0 = 0.5.

The difference between the root and the approximation when the stopping criterium
was met, |1 − p16846 | = 3.332001912442095e − 03, is much larger than 10−6 . Another
stopping criterium that could be used is checking whether the residual r is small. The
residual measures how well the original equation is satisfied, here how well f (p) = 0 is

32
satisfied. For the floating body equation we need to check whether ri (pi ) = p3i − 3p2i + 2 is
small. At the final iteration we have r16846 = p316846 −3p216846 +2 = 9.995968744652473e−03
which is quite large compared to the tolerance 10−6 .
To obtain the required accuracy, it often helps to check the residual as well. If we use
as additional stopping criterium |ri | < 10−6 , both stopping criteria are met after 47542
iterations. Then p47542 = 9.999996667462827e−01 and |1−p47542 | = 3.332537172884287e−
07 which agrees well with the accuracy we wanted (10−6 ).

33
2.8 Newton’s method
Newton’s method for solving f (x) = 0 is a special choice of a fixed-point iteration. It is
also called the Newton–Raphson method.

## 2.8.1 Mathematical background and method

There are several ways to derive Newton’s method. One is using a Taylor polynomial.
The 1st order Taylor polynomial for f (x) around x0 equals

## assuming f 00 exists and f and f 0 are continuous on the interval of interest.

At the root, x = p (f (p) = 0):

## 0 = f (x0 ) + (p − x0 )f 0 (x0 ) + O((p − x0 )2 )

Now assume that we are ”close” to the root, then the O(p − x0 )2 term is small compared
to the linear term:
0 ' f (x0 ) + (p − x0 )f 0 (x0 ),
or after rearranging:
f (x0 )
p ' x0 − .
f 0 (x0 )
This is used in Newton’s method to find a new approximation pn .
How does Newton’s method work?

## 1. Choose an initial point p0

2. Take for n ≥ 1
f (pn−1 )
pn = pn−1 −
f 0 (pn−1 )

Note that Newton’s method is of the form pn = g(pn−1 ) with g(pn−1 ) = pn−1 −f (pn−1 )/f 0 (pn−1 ),
i.e. Newton’s method is a fixed-point iteration.
Remarks

## • There is a problem when f 0 (pn−1 ) = 0, so Newton’s method might not converge if

f 0 is zero close to the root.

• In the derivation it was assumed that the remainder term which contains a term
(x − p)2 was small compared to the linear term in x − p. This is not true if the
initial guess is not close enough to the root, and the method might therefore not
converge if the starting point is ”too far” from the root.

34
• Graphical interpretation: Newton’s method finds at every iteration an approxi-
mation to the root by walking a distance (pi − pi−1 ) along the tangent line at
(pi−1 , f (pi−1 )) so that f (pi ) = 0. Thus for linear functions f , you will arrive at the
root in 1 iteration.

## When is Newton’s method guaranteed to converge?

Assume f ∈ C 2 near the root (f , its derivative and second derivative exist and are
continuous near the root). If f 0 (p) 6= 0 then Newton’s method will converge to p if the
initial guess p0 is close enough to the true root p. Unfortunately, we usually don’t know
how close to p the initial guess should be. For hard problems this might be very close,
say |p − p0 | < 10−3 or 10−4 .

• Fast (quadratic) convergence when you get close enough to the root.

• The method does not always converge to a solution: zero derivative, initial guess not
sufficiently close. (If you don’t have a good enough initial approximation, you could
first do, for example, some bisection iterations and then start Newton’s method)

## • You only find 1 root, which depends on the initial point p0 .

• You need to calculate a derivative (for complicated functions, you can use a symbolic
calculation)

## 2.8.2 Example: equation for floating sphere

Note that we cannot start from p0 = 0 or p0 = 2 since f 0 is zero in those points. Starting
from p0 = 1/2 converges in only 4 iterations (see Table 2.3).

i pi |1 − pi |
1 1.111111111111111e+00 1.111111111111112e-01
2 9.990740740740740e-01 9.259259259259967e-04
3 1.000000000529222e+00 5.292219995567393e-10
4 1.000000000000000e+00 0.000000000000000e+00

Table 2.3: Iteration number i, approximations pi , and absolute difference with the exact
solution |1 − pi |; Newton’s method using p0 = 1/2, tolerance  = 10−6 , and maximum
number of iterations N = 100.

35
2.9 Rate of convergence
2.9.1 Definition
A sequence {pn } converges to p of order α if constants α and γ exist so that
|pn+1 − p|
lim = γ.
n→∞ |pn − p|α

## An iterative technique of the form pn = f (pn−1 ) is said to converge of order α if the

corresponding sequence {pn } converges of order α.
α can be any number, but there are two important cases:
If α = 1, the sequence and iterative technique are linearly convergent.
If α = 2, the sequence and iterative technique are quadratically convergent.

## What does linear/quadratic convergence mean exactly?

This is easiest when we look at the actual error: the difference between the true solution
p and the approximation at the nth iteration: en = |pn − p|, where en is the error at the
nth iteration.
Linearly convergent: since α = 1, we have en+1 = γen . This means that if en is the
current error at iteration n, the error at the next iteration level, n + 1, is γ times the
previous error (en+1 = γen = γ 2 en−1 etc).
Example 1: if γ = 0.1 then the error is a factor 10 smaller every iteration and you have
approximately 1 more correct digit every iteration.
Example 2: how many iterations do we need to do for 1 more accurate digit if γ = 0.9?
We need to find k so that the error after k iterations is reduced by a factor 10: en+k =
0.1en . Since we have linear convergence: en+k = γ k en = 0.1en or γ k = 0.1. This gives for
γ = 0.9: k = log(0.1)/ log(0.9) ≈ 22 iterations.
Example 3: Similarly, if γ = 0.99 you need to do log(0.1)/ log(0.99) ≈ 229 iterations
for 1 more correct digit. Values of γ close to 1 are unfortunately not unusual for linearly
convergent techniques.
Quadratically convergent: since α = 2, we have en+1 = γe2n . This means that if en is
the error at iteration n, the error at the next iteration level, en+1 , is γen times the current
error en .
Example: If we have en = 0.1 and γ a little smaller than 1, a linearly convergent scheme
would converge very slowly for γ near 1. For a quadratically convergent method you would
have en+1 ≈ 10−2 , en+2 ≈ 10−4 , en+3 ≈ 10−8 , en+4 ≈ 10−16 . Once you are reasonably
close to the solution convergence is very fast: in just a few iterations a very accurate
approximation is reached. Roughly, the number of correct digits doubles every step for a
quadratically convergent method (when you are not too far from the solution p).
To find the order α, we can take the log on both sides in the definition:
log |pn+1 − p| = log γ + α log |pn − p|,
or
log en+1 = log γ + α log en .

36
Thus if we make a xy-plot of y = log en+1 vs. x = log en , α corresponds to the slope.
Once α is known, γ can be determined easily from the above equation.

## 2.9.2 Example: equation for floating sphere

Consider Table 2.4 for Newton’s method which shows the errors with the root p = 1 − 3
starting from p0 = −20 using a tolerance 10−13 . We see that for the first 7 iterations,

i en = |p − pn |
1 1.229976737424930e+01
2 7.670248252043633e+00
3 4.607864445396778e+00
4 2.602396109292012e+00
5 1.320034026906669e+00
6 5.473710251344159e-01
7 1.497420289157650e-01
8 1.616422346263446e-02
9 2.214556785341548e-04
10 4.245948570513747e-08
11 1.665334536937735e-15
12 1.110223024625157e-16

## Table 2.4: solution |p − pn | with

√ Iteration number i and absolute difference with the exact −6
p = 1 − 3; Newton’s method using p0 = −20 and tolerance  = 10 .

when the solution is not yet close to p = 1 − 3, Newton’s method √ converges slowly. For
iteration 8 to 11, when the approximation is close to p = 1 − 3, convergence is fast. The
number of accurate digits approximately doubles (this suggests quadratic convergence).
The fast convergence stops at iteration 12 since we have reached the maximum precision
of 15-16 digits for double precision calculations (machine accuracy) so that the error can’t
decrease any further.
The order of convergence (α and γ) close to the true solution can be determined as
follows. In an array x we put the logarithm of the error at the ”old level” i: x(i) = log e(i)
for i = 1, . . . , 11. In array y we put the logarithm of the error at the ”new level” i + 1:
y(i) = log e(i+1) for i = 1, . . . , 11. Now we make a plot: plot(x,y).
The result is in Fig. 2.10.
To find α and γ we should not consider en ’s that are not yet close to the solution due
to a poor initial guess Of course, we cannot get n → ∞ as in the definition, the best we
can do is look at the convergence behavior for sufficiently large values of n, i.e. close to the
true solution. However, we should also not consider en ’s that are too close to the solution:
these en ’s are affected by the finite precision in the calculation (for double precision you
can never get more than 16 accurate digits). From Fig. 2.10 it is clear that we should
only use e7 to e11 in Table 2.4 and that the slope of the line segment is approximately 2,

37
2

0 Newton
fixed−point iteration
−2

−4

log(ei+1)
−6

−8

−10

−12

−14

−16
−16 −14 −12 −10 −8 −6 −4 −2 0 2
log(ei)

Figure 2.10: Rate of convergence plot for floating sphere problem (d3 − 3d2 + 2 = 0);
Newton’s method using p0 = −20, tolerance  = 10−13 . Rate of convergence of a typical
fixed-point iteration ais shown for comparison.

i.e. α ≈ 2. The value of α can be approximated more precisely by using the numerical
values for the e’s, for example
∆y log e9 − log e10 −3.6547 + 7.3720
α≈ = ≈ ≈ 1.995
∆x log e8 − log e9 −1.7914 + 3.6547
Thus Newton’s method converges quadratically (α ≈ 2) close to the solution. Once α has
been found, γ can be determined by taking the ratio of en and eαn+1 . For example,
e9
γ≈ 1.995
≈ 0.83
e8

## Theoretically, this should be if we solve f (x) = 0 with Newton’s method

|f 00 (p)|
.γ≈
2|f 0 (p)|

For the problem we consider this gives γ ≈ 3/2 ≈ 0.866 which is indeed close to what
we find numerically.
In general a fixed-point iteration converges linearly, α ≈ 1. This can also be observed
from Fig. 2.10 where the slope is approximately 1, meaning α ≈ 1. As fixed-point iteration
we used g(p) = p−(p3 −3p2 +2)/10 and an initial guess p0 = −2. For a linearly convergent
fixed-point iteration we should find a constant (γ) if we we take the ratio of two consecutive
errors (en+1 /en ). For the fixed-point iteration we consider, we have γ ≈ en+1 /en ≈ 0.400.
This corresponds to the theoretical value
√ γ ≈ |g 0 (p)| for a fixed-point iteration g(p) = p.
0
For the problem we consider |g (1 − 3)| = 2/5 = 0.4.
Newton’s method for the root d = 1 is a special case. See Table 2.5. Convergence is
faster than quadratic: better than doubling of the number of accurate digits. What is

38
i en = |p − pn |
1 1.111111111111112e-01
2 9.259259259259967e-04
3 5.292219995567393e-10
4 0.000000000000000e+00

Table 2.5: Iteration number i and absolute difference with the exact solution |p − pn | with
p = 1; Newton’s method using p0 = 1/2 and tolerance  = 10−6 .

## How to use convergence to check a program

To see whether a numerical technique is correctly implemented, it is always good to check
numerical results with an analytical solution. For harder problems, however, no analyt-
ical solution can be found. Then convergence behavior can be checked. For Newton’s
method you expect α = 2 and for fixed-point iterations α = 1. If you get less, this is
an indication that somethinge unusual is going on either in the numerics (incorrect f 0 in
Newton’s method, for example) or in the equations you are trying to solve (maybe the
solution is not sufficiently smooth: derivatives might not exist or are not continuous, for
example).
To check the convergence behavior, we need |p − pi |, i.e. we need the exact solution
p. This value, however, is typically not available for more complicated problems (that’s
why we try to solve them numerically). Then we can take for the value of p a very good
approximation. For example, an accurate value obtained using a built-in Matlab function
or a very accurate value using your own code (preferably machine accuracy, but in any
case a value with much smaller errors than the en ’s you consider).

39
Chapter 3

equations

• vector norms

## • initial guess for vector equations

Numerical methods:

ations)

## – Direct vs. iterative methods.

– Gaussian elimination.

40
3.1 Problem description and modeling: predator-prey
models
Consider an environment with 2 species: one of them is a predator and of them a prey.
We want to know how the population of the predator and prey evolve in time. Will one
or both species die out or will they coexist.
In the model we denote by x1 (t) the population of prey at time t and by x2 (t) the
population of predators at time t. The basic model to describe changing quantities in
time is:

## ”Rate of change = rate of increase − rate of decrease”

Here the rate of change of the prey population is ẋ1 = dx1 /dt and the rate of change of
the predator population ẋ2 = dx2 /dt. The rate of increase and rate of decrease of the prey
and predators needs to be modeled. To keep the model simple, we make the following
assumptions:
• Rate of increase of prey (birth rate of prey).
We assume that the birth rate at time t is proportional to the number of prey at
time t. This means the birth rate is modeled by ax1 (t) with a a constant. A more
sophisticated model could include for example the effects of insufficient food supplies
and/or illnesses on the birth rate.
• Rate of decrease of prey (death rate of prey).
We assume that the death rate at time t is proportional to the product of the number
of prey and number of predators at time t. This means the death rate is modeled by
bx1 (t)x2 (t) with b a constant. A more sophisticated model could include for example
the effects of insufficient food supplies and/or illnesses on the death rate.
• Rate of increase of predators (birth rate of predators).
We assume that the birth rate at time t is proportional to the product of the number
of prey and number of predators at time t (i.e. the birth rate depends on the amount
of food available and the number of predators currently alive). This means the birth
rate is modeled by cx1 (t)x2 (t) with c a constant. A more sophisticated model could
include for example the effects of illnesses and/or age of the predators on the birth
rate.
• Rate of decrease of predators (death rate of predators).
We assume that the death rate at time t is proportional to the number of predators
alive at time t. This means the death rate is modeled by dx2 (t) with d a constant.
A more sophisticated model could include for example the effects of insufficient food
supplies, age, and illnesses on the death rate.
The resulting model is
ẋ1 = ax1 − bx1 x2 ,
ẋ2 = cx1 x2 − dx2 .

41
We are interested in whether the predator and/or prey die out or coexist. Thus we
are interested in possible equilibrium solutions (i.e. when the populations do not change
anymore in size, or dx1 /dt = dx2 /dt = 0),

0 = ax1 − bx1 x2 ,
0 = cx1 x2 − dx2 .

## As an example, we consider this system of equations with a = d = 2 and b = c = 1,

0 = 2x1 − x1 x2 ,
0 = x1 x2 − 2x2 .

We first discuss some methods to solve systems of equations analytically, before we dis-
cuss some numerical techniques (to see whether numerical methods converge to a correct
solution).

42
3.2 Analytical solutions and solving with Matlab
In this section we discuss three methods to obtain solutions of a 2 × 2 system of equations.
In the next sections we will solve the above problem numerically and see whether numerical
methods converge to one of the two equilibrium solutions found in this section.

## 3.2.1 Analytical solution

The example 2 × 2 system of equations can easily be solved analytically by factoring out
common terms,

2x1 − x1 x2 = x1 (2 − x2 ) = 0,
x1 x2 − 2x2 = x2 (x1 − 2) = 0.

The first equation is satisfied for x1 = 0 or x2 = 2. For x1 = 0 we get from the second
equation x2 = 0. For x2 = 2 we get from the second equation x1 = 2. Thus we have two
equilibrium solutions (0, 0) and (2, 2).

3.2.2 Plotting
The curves corresponding to the two equations 2x1 − x1 x2 = 0 and x1 x2 − 2x2 = 0 can
be plotted using Matlab’s ezplot
ezplot(’2*x1 - x1*x2=0’)
hold on
ezplot(’x1*x2 - 2*x2=0’)
grid on
setcurve2(’color’,’red’)
where setcurve2 is a small Matlab script to plot the two curves corresponding to the second
equation in red. In Fig. 3.1 the blue curves correspond to solutions of the first equation
and red lines to solutions of the second equation. The intersections of a blue and red
curve near (0, 0) and (2, 2) correspond to the two equilibrium points. You can zoom in
near these points to obtain more accurate values. Note that n × n systems of equations
need n-dimensional plots so that this method is only useful for 2 × 2 systems.

## 3.2.3 Symbolical calculations

Systems of algebraic equations can be solved symbolically using the built in Matlab func-
tion solve. For the example we consider you would use
[x1, x2] = solve(’2*x1 - x1*x2=0’, ’x1*x2 - 2*x2=0’)
which returns as output
x1 =
0
2

43
6

x2
0

−2

−4

−6
−6 −4 −2 0 2 4 6
x1

## Figure 3.1: Plot of 2x1 − x1 x2 = 0 and x1 x2 − 2x2 = 0 using Matlab’s ezplot.

x2 =
0
2
From the first line of solutions for x1 and x2 we get the equilibrium point (0, 0). From
the second line of solutions for x1 and x2 we get the equilibrium point (2, 2).

## 3.2.4 Numerical calculations

Systems of algebraic equations can be solved numerically using the built in Matlab func-
tion fsolve, which solves the system f (x) = 0. For this you first need to create an m-file
that defines the vector function f (x). For the example we consider you could use the
following m-file fun fsolve.m

## % Predator prey system

a = 2;
b = 1;
c = 1;
d = 2;
f(1) = a*x(1) - b*x(1)*x(2);
f(2) = c*x(1)*x(2) - d*x(2);

Then you need to select an initial vector p(0) , say h3, 3i. To solve the system numeri-
cally using fsolve, use

44
p0 = [3; 3]
p = fsolve(’fun fsolve’, p0)
which gives
p=
2.000000141789899e+00
2.000000141789899e+00
This uses a default tolerance of 10−6 . To increase the accuracy, you need to change the
option TolFun. To use a tolerance of 10−10 use
options = optimset(’TolFun’, 1e-10)
p0 = [3; 3]
p = fsolve(’fun fsolve’, p0, options)
which gives the more accurate solution
p=
2.000000000000007e+00
2.000000000000007e+00

45
3.3 Newton’s method for systems of equations
Newton’s method for systems can be used to solve a system of m nonlinear equations
with m unknowns
f (x) = 0
or in component form

f1 (x1 , x2 , . . . , xm ) = 0,
f2 (x1 , x2 , . . . , xm ) = 0,
.. .. ..
. . .
fm (x1 , x2 , . . . , xm ) = 0.

## 3.3.1 Mathematical background and method

Similarly to Newton’s method for functions of one variable, Newton’s method for systems
can be derived using a 1st order Taylor expansion for functions of several variables.
Using a first order Taylor expansion around x0 gives at the root x = p

## where J is the Jacobian matrix (matrix with first partial derivatives)

 ∂f1 ∂f1 ∂f1 
∂x1 ∂x2
· · · ∂xm
 ∂f2 ∂f2 · · · ∂f2 
J(x) =  ∂x1. ∂x2. ∂xm 

 . . . .
. ..
. . .
∂fm ∂fm ∂fm
∂x1 ∂x2
··· ∂xm

## This is used in Newton’s method to find a new approximation p(k) .

Algorithm of Newton’s method

## 4. Stop if convergence criteria are met, otherwise perform another iteration.

Remarks
• There is a problem when J −1 (p(n−1) ) is singular, so Newton’s method might not
converge if J −1 is (nearly) singular close to the solution vector.

46
• Like in Newton’s method for scalars, in the derivation of Newton’s method for
systems it was assumed that higher order terms are small compared to the linear
terms. This is not true if the initial guess is not close enough to the solution vector,
and the method might not converge then.

• Provided certain conditions on the partial derivatives of the functions fi are satisfied,
also Newton’s method for systems converges quadratically when the initial vector
p(0) is close enough to the true solution vector p. For hard problems you might need
to be very close to the true solution vector before Newton’s method will converge.

## • Fast (quadratic) convergence close to the solution vector.

• It does not always converge to a solution (singular Jacobian, initial guess not suf-
ficiently close).

• You only find 1 solution vector, which depends on the initial vector p(0) .

## • Need to calculate m × m derivatives for an m × m system. (Even using symbolic

calculations, this might not be trivial when m is large.)

## 3.3.2 Stopping criteria and vector norms

An accurate numerical solution to a m × m system of equations is obtained when all
components involved are approximated accurately. Instead of checking whether each
component satisfies the stopping criteria, vector norms are used to check whether the
numerical approximation is good enough. The notation for a norm is k.k with a possible
subscript to indicate the type of norm. Two frequently used norms are the l2 norm and
l∞ norm. The l2 or Euclidean norm is defined as
q
kyk2 = y12 + y22 + . . . + ym
2 .

## kyk∞ = max |yi |.

1≤i≤m

As stopping criteria we can use the norm of the difference between two solution vectors,
kp(k) − p(k−1) k and the norm of the residual vector krk = kf (k) k, either using the l2 or
l∞ norm. Note that if the l2 or l∞ norm of a vector is small, then every component of
the vector must be small. As for scalar equations, relative stopping criteria can be used
to increase robustness.

47
3.3.3 Example: predator-prey equations
We solve the algebraic predator-prey system of equations discussed in Sec. 3.1 using
Newton’s method. As initial vector we take p(0) = h3, 3i and as tolerance 10−13 . We
see from see Table 3.1 that Newton’s method convergences to the analytical solution
p = hp1 , p2 i = h3, 3i in only 6 iterations.

## k p(k) kp − p(k) k2 kp − p(k) k∞

1 (2.250000000000000, 2.250000000000000) 3.5355339e-01 2.5000000e-01
2 (2.025000000000000, 2.025000000000000) 3.5355339e-02 2.4999999e-02
3 (2.000304878048780, 2.000304878048780) 4.3116267e-04 3.0487804e-04
4 (2.000000046461147, 2.000000046461147) 6.5705984e-08 4.6461147e-08
5 (2.000000000000001, 2.000000000000001) 1.8841109e-15 1.3322676e-15
6 (2.000000000000000, 2.000000000000000) 0.0000000e+00 0.00000000e+00

Table 3.1: Iteration number k, approximations p(k) , and l2 and l∞ error norms; Newton’s
method using p(0) = h3, 3i and tolerance  = 10−13 .

## The error norms for the considered 2 × 2 example are

 1/2
(k) (k)
kp − p(k) k2 = (p1 − p1 )2 + (p2 − p2 )2
(k) (k)
kp − p(k) k∞ = max(|p1 − p1 |, |p2 − p2 |)

with p1 = 2 and p2 = 2. We note from Table 3.1 that the l2 error is always slightly larger.
It is easy to see why this should always be true. Consider a vector x with m components,
then v
q u m
q uX
kxk∞ = max |xi | = max (x2i ) = max (x2i ) ≤ t (x2i ) = kxk2
1≤i≤m 1≤i≤m 1≤i≤m
i=1

The l∞ norm only considers the maximum x2i , while for the l2 norm some nonnegative
numbers are added to this value before taking the square root. The convergence behavior
of both norms, however, is very similar. Both give approximately a doubling of the
number of accurate digits each iteration. This indicates quadratic convergence like for
Newton’s method for a single equation (See Sec. 2.8). Typically, the l2 and l∞ norm give
very similar results for small systems of equations. For large m × m systems you need
to be a little more careful when you do finite precision calculations. If all terms of the
error vector have the same error, say machine accuracy e = 10−16 , than the error for the
l2 norm is v
u m
uX √ √
kek2 = t e2i = me2 = m|e|
i=1

Thus if m is √sufficiently large, you need to increase the tolerance for l2 norms accordingly,
−15
i.e. at least m10 , in order to be able to satisfy stopping criteria.

48
3.3.4 Choice of initial vector
Finding a good initial guess for a system of equations is a little more complicated than
for a single equation. First, we cannot generalize the bisection method to systems since
we do not have a point with opposite function value on each side of the root in two or
more dimensional space. We only discuss some simple options.

1. Plotting might give you an initial vector for 2 × 2 systems, but cannot be used for
m × m systems.

2. If nonlinear terms are relatively small so that the solution is dominated by the linear
terms, one could first solve the linear system Ax = b and use the solution vector of
the linear sytem as initial guess for Newton’s method for systems.

3. If nonlinear terms are not small, one could first try to solve an easier problem or
a sequence of easier problems (continuation). For example, for our predator-prey
equations, we could first try to find a solution for the problem with b = c = 0.1.
Then we could use the solution corresponding to b = c = 0.1 as initial vector for the
problem with b = c = 0.5. The solution corresponding to b = c = 0.5 could then be
used as initial vector for the problem you are interested in (b = c = 1). The harder
the problem, the more intermediate solutions you might need to generate to find an
appropriate initial vector for the problem you are really interested in.

4. Often Newton’s method for systems is part of a larger calculation which provides an
initial guess in a natural way. For example, for partial differential equations (PDEs)
involving a dependence on time and space, you typically generate a solution at the
next time level n + 1 for every coordinate position from the solution at the current
time level n. Since time steps ∆t = tn+1 − tn should be taken small for reasons
of accuracy, the solution at time level n is usually a good approximation to start
Newton’s method to obtain the solution at tn+1 . We discuss this further when we
discuss PDEs.

49
3.4 Solving linear systems for population models
The solution of a linear system Ax = b can be found using a direct or iterative method.
Direct methods are methods that compute the solution x of Ax = b in a fixed number
of operations that can be determined a priori. Provided that the matrix is nonsingular,
a unique solution will be found. (At least when calculations can be done exactly. When
finite precision arithmetic is used also nearly singular matrices will give problems.) Itera-
tive methods for Ax = b are methods that give an approximation of the solution vector x
by generating a sequence of vectors x(k) that (we hope) converge to the true solution. The
number of operations can not be determined a priori and convergence is not guaranteed.
As for all iterative methods, an initial guess to the solution needs to be provided.
• The solution is subject to round off errors only.
• The number of operations might be very large for large systems of equations, so
that it takes a long time to solve a linear system.
• For large systems, you need a lot of memory on your computer to store the matrix
A (your computer may run out of memory just for storage of a large matrix).
• Relatively few operations per iteration and thus fast for a single iteration.
• Additional errors since an iterative method only gives an approximation to the
solution.
• Cheap in memory. The matrix A needs not to be stored, only vectors that result
from matrix-vector products like Ax(k) .
• Convergence might be slow (resulting in a large number of iterations and large
computing time) or even impossible.
Of course, one may try to improve on basic direct and iterative methods to alleviate the
typical shortcomings. This is outside the scope of Math 4414. We only discuss methods
that are most useful for the problems we consider. For population models, the size of the
system of equations is typically relatively small and solving Ax = b using a direct method
can still be done in an efficient manner. Iterative methods will be discussed in some more
detail when they are more useful: for numerical solutions of differential equations.

## 3.4.1 Solving very small systems

Solving very small systems like 2 × 2 and 3 × 3 systems can be solved as you are used to in
linear algebra: first compute the inverse of A and then x = A−1 b. We discuss the typical
issues accuracy, CPU time, and memory for the 2 × 2 predator-prey system we consider
in this chapter.

50
Accuracy: A direct method is only affected by round-off errors due to finite precision.
Iterative methods have additional errors, or might not produce an accurate solution at
all if it does not converge. Thus for reasons of accuracy a direct method is preferred.
An example of a direct method is computing the inverse A−1 of the matrix A and then
perform the matrix-vector multiplication x = A−1 b. For a 2 × 2 system we have a formula
for the inverse available,
 −1  
−1 a b 1 d −b
A = =
c d ad − bc −c a

To compute A−1 exactly we need that det A = ad − bc 6= 0. With finite precision arith-
metic, we also expect large errors when det A = ad − bc is very small. Then we subtract
two nearly equal numbers ad and bc and next we divide by this inaccurate small num-
ber. Other techniques, however, will have similar difficulties when the matrix is nearly
nonsingular (ill-conditioned).
Computing time: This is not a real issue for a 2 × 2 system, only a few operations
(multiplications and additions) are involved and computing x = A−1 b will be fast.
Memory usage: For 2 × 2 systems, however, this is not a real issue. We only need
to store 4 numbers.

## 3.4.2 Solving a little larger systems: Gaussian elimination

Larger systems arise in population models when we include more species, for example 10.
Finding the inverse of an m × m matrix using minors ((m − 1) × (m − 1) submatrices)
involves O(m!) operations. For a relatively small 10 × 10 matrix this would require
approximately 10! ≈ 4 million operations to compute the inverse. For such problems it
is advantageous to use other direct methods, that require less operations for these size of
matrices. Such methods are not only faster, but also produce a result with less roundoff
errors (less operations so less possibility to accumulate errors). For example, Gaussian
elimination with backward substitution takes O(m3 ) operations, i.e. approximately 1
thousand operations for a 10 × 10 system. Since the time it takes to run a program
is proportional to the number of operations, Gaussian elimination for a 10 × 10 matrix
is approximately 4000 times faster than computing the inverse. Gaussian elimination is
analyzed in detail in Math 4445. Here we only focus on some important aspects.
Background
The idea is to transform Ax = b into an upper triangular system U x = f by subtraction
of rows. To create a zero at matrix entry a12 , subtract l21 = a12 /a11 times the first row
from the second row in both A and b. Similarly, subtracting l31 = a13 /a11 times the first
row from the third row gives a zero at a13 , etc. Once the first column is done, you can
create zeros below a22 using the second row. Since the first column of the second row
has a zero, the zeros in the first column won’t change anymore. This technique can be
continued until un upper-triangular matrix U is obtained. Backward substitution is then
performed starting at the mth row which can be used to find xm . Once xm is known,
xm−1 can be determined from row m − 1 etc.

51
We illustrate Gaussian elimination with backward substitution for a simple 2 × 2
system:     
2 1 x1 1
=
1 2 x2 1
Using Gaussian elimination
   
2 1 1 =⇒ 2 1 1
1 2 1 E2 := E2 − (1/2) × E1 0 3/2 1/2

Solving with backward substitution gives from the second line x2 = (1/2)/(3/2) = 1/3
and by substituting this in the first line
1 − x2 1
x1 = =
2 3
The Gaussian elimination described above fails when you want to make zeros below a
diagonal entry aii (pivot) and the value of aii is exactly zero. Then we need to interchange
that row i (including the row of vector b) with a row below row i that has a non-zero
value in column i. This is called pivoting.
Implications of finite precision: pivoting
In finite precision calculations, also very small pivots can also create large errors. Consider
as example the system     
0.001 1 x1 1
=
−1 4 x2 9
which has the solution x1 ≈ −4.9800797 and x2 ≈ 1.0049801. If we try to solve this with
the Gaussian elimination/backward substitution algorithm in three digit arithmetic with
rounding, we get
   
1.00e-3 1.00 1.00 =⇒ 1.00e-3 1.00 1.00
-1.00 4.00 9.00 E2 := E2 + 1.00e3 × E1 0 1.00e3 1.01e3
Solving with backward substitution gives x2 = 1.01 and
1.00 − 1.00 × 1.01
x1 = = -1.00e1.
1.00e-3
We see that x1 is not at all close to the true solution. The reason lies in the backward
substitution: to determine x1 we subtract nearly equal numbers and divide by a small
pivot number. Here, small errors in the numerator are amplified by a factor 1000, because
of the small pivot of 1.00e-3.
If we first interchange rows, we can obtain an accurate solution
   
-1.00 4.00 9.00 =⇒ -1.00 4.00 9.00
1.00e-3 1.00 1.00 E2 := E2 + 1.00e-3 × E1 0 1.00 1.01
which gives using backward substitution x2 = 1.01 and x1 = -4.96 which has 2 accurate
digits.

52
Thus to increase accuracy in finite precision calculations, we need to perform pivoting
not only when the pivot element is zero, but also when it is ”small”. The safest pivoting
technique is complete pivoting. Both row and column interchanges are performed to
find the largest pivot element. Of course, the comparisons and interchanging of rows and
columns makes the Gaussian elimination more time consuming (O(m3 ) more operations
are required).
In 16 digits arithmetic, the problem is less serious, but the problem is not eliminated.
Also keep in mind that matrices are much larger than 2 × 2. In a large system many more
operations need to be performed and errors accumulate. Even with the best pivoting
technique a significant number of digits may be lost when solving Ax = b. The larger the
linear system, the more computations and the more inaccurate digits can be expected.

## 3.4.3 Built-in Matlab functions

A small system of linear equations can be solved in Matlab by computing the inverse, for
example
A = [2 1; 1 2]
b = [1; 1]
x = inv(A)*b
gives the numerical solution
x=
3.333333333333333e-01
3.333333333333333e-01

Larger systems, like 10×10, can be solved more efficiently using Gaussian elimination with
backward substitution. A safe Gaussian elimination and backward substitution technique
with pivoting is implemented in Matlab through the backslash operator \. For the above
example, you would use
A = [2 1; 1 2]
b = [1; 1]
x=A\b
gives the numerical solution
x=
3.333333333333332e-01
3.333333333333335e-01

53
Chapter 4

## • Discretization in space: grids

• Accuracy

• Order of convergence

Numerical methods:

## – Direct vs. iterative methods.

– Direct methods: Crout factorization for tridiagonal matrices.

54
4.1 Problem description and modeling: pollution mod-
els
4.1.1 Governing equation
We consider a river which is being polluted by some factories along that river. The
pollutant affects the fish population once it reaches a critical concentration level for a
sufficiently long period in time. For this we want to predict the concentration of pollutant
along the river.
The basic model to describe changing quantities is

## ”Rate of change = rate of increase − rate of decrease”

Now we need to apply the basic model locally to a small volume element ∆V = ∆x∆y∆z
in the river. The pollutant concentration in ∆V increases when the factories add pollutant
to it, and decreases because it decays. In addition, the concentration may change if the
amount of pollutant that is transported into ∆V is different than the amount that is
transported out of ∆V .
The basic model in words to describe changes in concentration in a flowing river then
becomes

## ”Rate of change = transport + production − decay”

The terms on the right-hand side need to be modeled. There are two types of transport
in and out a small volume element: convection and diffusion of the pollutant. We discuss
all 4 terms in the right-hand side separately.

## • Convection: Convection describes how the pollutant is convected by the flow.

We assume that the pollutant moves with the same velocity v as the water, i.e.
convection can be modeled as −v∇c. If the pollutant is small and light that is a
good approximation. Larger, heavier objects typically will not follow the water and
then the model needs modification.

• Diffusion: Diffusion describes how the pollutant spreads over the river in the ab-
sence of flow. Diffusion is described mathematically by the divergence of the mass
flux φ, i.e. −∇ · φ where ∇ is the gradient vector. We assume that the mass flux φ
of the pollutant is proportional to the concentration gradient, φ = −D∇c (Fick’s
law) where D is the diffusivity which we assume constant. This gives a diffusion
term ∇ · (D∇c) = D∇ · ∇c.

• Production of pollutant: We assume that the pollutant added to the river de-
pends only on where and when the factories add pollutant to the river. This means
the source term r depends on position x and time t only and is independent of the
concentration of pollutant in the river.

55
• Decay of pollutant: We assume that the rate of decay is proportional to the
concentration, i.e. it can be modeled by kc with k a constant decay rate. Note that
the rate of change depends indirectly on position and time through the concentration
c(x, t).

The resulting model is a partial differential equation (PDE) for the concentration of
the pollutant which depends on position x in the river and time t, i.e. c = c(x, t):
∂c
= −v · ∇c + D∇ · ∇c − kc + r. (4.1)
∂t
In case k = r = 0, this equation reduces to the so-called convection-diffusion equation.
Solving Eq. (4.1) for the concentration in three spatial dimensions and time is beyond
the scope of Math 4414. However, we can take advantage of the geometry of the river to
simplify the equation. We consider a narrow and shallow river so that we can consider a
concentration that depends on x (coordinate along the river) and time t only: c = c(x, t).
Flow will then occur in the x direction only, represented by a scalar velocity v. Then the
model simplifies to
∂c ∂c ∂2c
= −v + D 2 + r − kc. (4.2)
∂t ∂x ∂x
Here the velocity v and pollution rate r may depend on position along the river x and
time t.
To solve Eq. (4.2) it is important to realize how the different terms on the right-hand
side affect the solution. Of course the source term r tends to increase the concentration
and the decay term −kc tends to decrease the concentration, so we only discuss convection
and diffusion in some more detail.
Convection: If ∂c/∂x is negative at some location, there is more pollution just up-
stream than just downstream. Therefore, the concentration will increase which is ac-
counted for in Eq. (4.2) by the minus sign. Convection affects the concentration level
downstream while leaving the concentration upstream unchanged. Fig. 4.1(a) shows how
an initial concentration profile changes in time by convection: the concentration profile
at the initial time shift in the direction of the flow, leaving the shape unaltered.
Diffusion: If there is a difference in concentration, pollutant will be transported from
high to low concentration by thermal motion. This diffusion process is independent from
the convection and may transport pollutant in both the upstream and downstream direc-
tion. Fig. 4.1(b) shows how an initial concentration profile changes in time by diffusion:
the initial concentration profile will smooth out.
We are interested in whether the concentration reaches an equilibrium solution when
the rate at which pollution is added to the river and the velocity of the water are constant
in time, i.e. r = r(x) and v = v(x). Then the concentration does not change anymore
in time (∂c/∂t = 0) and the model reduces to a 2nd order ordinary differential equation
(ODE) in which the concentration only depends on position c = c(x),

dc d2 c
v − D 2 + kc = r
dx dx
56
1 1
t=0
0.9 t=0 0.9 t=0.01
t=0.5
t=0.05
0.8 t=1 0.8
t=0.1
t=3
t=0.5
0.7 0.7

0.6 0.6
c

c
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x

(a) (b)

Figure 4.1: Effect of transport terms on c(x, t) (a) pure convection, and (b) pure diffusion.

## where v and r may depend on x only.

We are interested in the concentration profile near the factories where the largest
concentration is to be expected. Therefore we consider only a small part of the river.
As example we consider a river segment from x = −1 to x = 1. The pollution source
from the factories r can be described by r = 17 +9x. In addition we chose v = 8, D = −1,
k = 9. This gives
d2 c dc
2
−8 − 9c = −17 − 9x, −1 < x < 1.
dx dx

## 4.1.2 Boundary conditions

In order to solve the 2nd order differential equation we need to specify boundary con-
ditions for the concentration at the boundary. For the 2nd order ODE describing the
concentration, boundary conditions are specified at both endpoints (x = 0 and x = 10).
This is called a two-point boundary value problem (BVP). From differential equations,
we know that to determine a unique solution of an nth order BVP we need n boundary
conditions. Thus for the problem considered here, we need one boundary condition on
each side.
Various types of boundary conditions can be used for 2nd order BVPs at a boundary
point x = xb :
• Dirichlet boundary conditions. The concentration is prescribed at the boundary,
c(x = xb ) = CD ,
where CD is a constant.
• Neumann boundary conditions. The mass flux (or concentration gradient) is pre-
scribed at the boundary:
dc
(x = xb ) = CN ,
dx
57
where CN is a constant.

## • Robin boundary conditions. This is a combination of a Dirichlet and Neumann

condition:
dc
(x = xb ) + kR c(x = xb ) = CR ,
dx
where kR and CR are constants.

4.1.3 Example
For simplicity, we assume we know the concentration at the endpoints:

## Thus the example we use for BVPs throughout this chapter is

d2 c dc
2
−8 − 9c = −17 − 9x, −1 < x < 1,
dx dx
c(x = −1) = e−18 , c(x = 1) = 3.

58
4.2 Analytical solutions and solving with Matlab
In this section we discuss two methods to obtain the analytical solution of a boundary
value problem. In the next sections we will solve BVPs numerically and see whether the
solution obtained by a numerical method converges to the analytical solution found in
this section.

## 4.2.1 Analytical solution

The solution of a nonhomogeneous BVP consists of a homogeneous solution and a partic-
ular solution c(x) = ch (x) + cp (x). The solution to the homogeneous differential equation
d2 c dc
2
−8 − 9c = 0
dx dx
is found using a characteristic equation (i.e. you look for solutions eλx ),

λ2 − 8λ − 9 = 0

which has roots λ1 = −1 and λ2 = 9. This gives solutions e−x and e9x and the homogenous
solution is a linear combination of these,

## ch (x) = C1 e−x + C2 e9x

The particular solution can be found using the method of undeterminied coefficients. Here
you would try a polynomial of the same order as −17 − 9x, i.e. cp (x) = Ax + B and try
to find A and B by substitution. This gives −8A − 9B = −17 and −9A = −9 which has
solution A = 1 and B = 1. Thus the particular solution equals cp (x) = 1 + x.
This gives for the total concentration c(x) = C1 e−x + C2 e9x + 1 + x. The values for C1
and C2 are obtained from the boundary conditions c(x = 1) = 3 and c(x = −1) = e−18 .
This gives C1 = 0 and C2 = e−9 which gives for the solution of the pollution BVP

c(x) = e9x−9 + 1 + x

## 4.2.2 Symbolical calculations

Ordinary differential equations can be solved symbolically using the built in Matlab func-
tion dsolve. For the example we consider you would use
dsolve(’D2c - 8*Dc -9*c=-17-9*x’, ’x’)
where D2c denotes the second derivative c00 and Dc the first derivative c0 . Here the ’x’
denotes that the independent variable is x instead of the default t. Matlab will give the
symbolic solution
C1*exp(-x) + C2*exp(9*x) + 1 + x

To find the unique solution of the boundary value problem with the boundary condi-
tions c(−1) = e−18 and c(1) = 3, use

59
c = dsolve(’D2c - 8*Dc -9*c= -17 - 9*x’, ’c(-1)=exp(-18)’, ’c(1)=3’, ’x’)
simplify(c)
which will give
c=
(-exp(-9+9*x)+exp(11+9*x)+exp(20)-1+x*exp(20)-x)/(exp(20)-1)
Which is the correct solution but written in a different way.

60
4.3 Solving BVPs numerically: Introduction
4.3.1 Grids
The solution to a boundary value problem is a function c(x) which is defined for every
x. If we use numerical techniques, we find an approximation to the solution at certain
discrete coordinates values xi only. The collection of xi ’s is called a (computational) grid.
Before we can solve a BVP numerically, we need to specify the computational grid, i.e.
all grid points xi . A numerical technique will produce approximations to the function c
at these grid points only, i.e. we will obtain approximations for the values c(xi ) which
will be denoted by ci . See Fig. 4.2.
cn
cn−1
c3 c(x)
c1 c2

c0

a = x0 x1 x2 x3 xn−1 x = b
n

## Figure 4.2: Grid points xi and numerical approximations at grid points ci .

Often we choose the interval [a, b] to be divided into N equally spaced subintervals of
length h = (b − a)/N , which corresponds to the grid points

xi = a + ih

for i = 0, . . . , N . The length of a subinterval h is called the mesh size. Note that the
nodes are numbered by increasing x coordinate, thus x0 = a, x1 = a + h, . . ., xN = b.
This numbering is called natural numbering. Henceforth, we will only use natural
numbering since it simplifies the derivation of methods.

Example
We solve a BVP numerically on [0, 1] and divide the interval [0, 1] into N = 4 equally
spaced subintervals. The length of a subinterval is h = (1 − 0)/4 = 1/4. We obtain a
numerical approximation to the solution at the N + 1 = 5 discrete coordinate values only:
x0 = 0, x1 = 1/4, x2 = 1/2, x3 = 3/4, and x4 = 1.

Remarks:
1. Typically, the more mesh points the more accurate the approximation to the so-
lution and the more work to compute the approximation. The goal is to compute

61
an accurate numerical solution with as few grid points as possible, to minimize
computing time.

2. Almost always there are certain regions in [a, b] where the solution changes more
rapidly (i.e. where you want more mesh points). An equally spaced mesh is then
not the best choice.

3. An equally space mesh is easiest to introduce the numerical techniques and will
therefore be used in this chapter.

## 4.3.2 Numerical techniques

If a BVP cannot be solved analytically, one can try to solve it numerically. Also numerical
techniques, however, will not always work. If no unique solution exists for the BVP, we
can’t expect that a numerical technique will produce something useful. In addition, there
is no best method to solve BVPs. Which method to choose depends on the problem you
want to solve. Issues are

1. Accuracy.

2. Computing time and memory (Typically not an important issue for BVPs, only for
PDEs in 2 or 3 spatial dimensions)

The four most frequently used techniques that can easily be extended to two or three
dimensions are finite differences (FD), finite elements (FE), finite volumes (FV), and
spectral methods. We discuss two of these techniques, FD and FE, and apply the tech-
niques to solve BVPs. We focus on how these methods work, how to program them, and
typical numerical issues. Finite differences is easiest to understand the mathematical and
numerical concepts. Finite elements are particularly useful when dealing with complex
geometries (curved boundaries) in two or three dimensions. We will further discuss this
issue in Chap. 7.

62
4.4 Solving BVPs numerically using Matlab: bvp4c
Solving BVPs numerically with Matlab is a little more complicated than solving algebraic
equations. We discuss only the use of bvp4c. To solve a boundary value problem in
Matlab, you first need to write the ODE into a system of first order ODEs y 0 = f (y, x)
(See Math 2214), using the solution vector
   
y1 y
 y2   y 0 
 ..  =  .. 
   
.  . 
ym y (m−1)

where y (m−1) is the (m − 1)th derivative of y with respect to x. For the pollution BVP
one would introduce    
c1 c
= 0
c2 c
Taking the derivative of this vector and using the pollution ODE gives
 0   0  
0 c1 c c2
c = 0 = 00 = = f (c, x)
c2 c 9c1 + 8c2 − 17 − 9x

For the pollution BVP written as a first order system of first order ODEs one would
use in the Matlab Command Window (in addition you need to write 3 small m-files as
discussed below)

x = -1:0.1:1;
solinit = bvpinit(x, ’init bvp’);
options = bvpset(’AbsTol’, 1e-8);
sol = bvp4c(’func bvp’, ’bc bvp’, solinit, options);

The first line specifies the initial grid points (Matlab might add some grid points if it
thinks it is necessary to obtain a solution that is sufficiently accurate).
The second line creates a solution structure solinit which contains in solinit.x the initial
mesh and in solinit.y the initial approximation specified in the user-written m-file named
here init bvp.m. If we use a linear initial guess between c(−1) = e−18 ≈ 0 and c(1) = 3,
we get yinit (x) = (3 + 3x)/2. We need to specify the initial guess for the system of first
0
order ODEs, so we also need to specify the derivative yinit (x) = 3/2. The initial guess
needs to be specified as a column vector,

## % initial concentration profile for pollution model

c(1,1) = 1.5*(x+1);
c(2,1) = 1.5;

63
The third line specifies options for BVPs that are different from the defaults used to
solve a BVP. Here we specify an absolute tolerance for the residual AbsTol = 1e-8, where
the default is 1e-6. The tolerance is satisfied at every grid point in the mesh.
The fourth line solves the BVP. In addition to the initial solution structure solinit,
it needs two user-written m-files which we named here func bvp and bc bvp. The fourth
parameter options is optional and can be omitted if default options are used.
The right-hand-side vector f should be specified in func bvp as a column vector,

## % right-hand-side vector f for pollution model

f(1,1) = y(2);
f(2,1) = 9*y(1) + 8*y(2) - 17 - 9*x;

where the array y contains the solution vector (y(1)= y and y(2)= y 0 ) at the grid point
x (single variable, not all grid points). The boundary conditions should be specified in
bc bvp, written in residual form . . . = 0, i.e. y(x = −1) − e−18 = 0 and y(x = 1) − 3 = 0.
The residual needs to be specified as a column vector,

## % boundary conditions for pollution model

res(1,1) = ya(1) - exp(-18);
res(2,1) = yb(1) - 3;

where the array ya contains the solution vector at the left endpoint (ya(1)= y(x = a)
and ya(2)= y 0 (x = a)) and the array yb contains the solution vector at the right endpoint
(yb(1)= y(x = b) and yb(2)= y 0 (x = b)).
The function bvp4c produces as output a data structure which we named sol. The data
structure sol has two members sol.x and sol.y. The grid points at which the solution is
approximated are stored in sol.x (These are typically not the same grid points as the initial
mesh you specified, but might be refined to satisfy the default tolerances or tolerances
specified in options). The solution vector y at each grid point is stored in the two-
dimensional array sol.y. Row 1 contains y1 = y at all grid points, row 2 contains y2 = y 0
at all grid points, etc.
A specific column or row in a two-dimensional array y can be selected using colon
notation. For example, the first row can be selected by using y(1,:) (meaning row 1, all
columns) and the second column by using y(:,2) (meaning all rows, column 2).
After the calculation with bvp4c we only have some (long) arrays with numbers from
which it is difficult to obtain insight into what the model predicts exactly. For this we
need to plot the solution. The approximation to the solution is in the first row of sol.y
and can be selected using colon notation, sol.y(1,:). The following plots the numerical

64
solution from bvp4c and the exact solution,

## plot(sol.x, sol.y(1,:), ’r+’)

hold on
ezplot(’exp(9*x-9)+1+x’,[-1,1])

exact solution
2.5
bvp4c

2
c

1.5

0.5

## −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

x

Figure 4.3: Numerical solution using bvp4c and exact solution using ezplot for the pollution
BVP.

65
4.5 Finite differences
In this section, we discuss the discretization in space of a second-order BVP using the
finite difference method. Finite differences is conceptually the easiest, but not necessarily
the best method. In section 4.7 we discuss an alternative method: finite elements.
We only consider equally spaced grid points and use natural numbering. This simplifies
the derivation.

## 4.5.1 Mathematical background and method

In finite difference methods, all derivatives are approximated by finite difference for-
mulas. By using more grid points in the approximation, a more accurate approximation
of a derivative can be obtained. Finite difference formulas can be derived using Taylor’s
theorem. We use the form
n
X (∆x)k
f (xi + ∆x) = f (k) (xi ) + O((∆x)n+1 ).
k=0
k!

Here O((∆x)n+1 ) means that all terms we neglect are C(∆x)n+1 (with C a constant) or
of higher order. We consider three cases that frequently occur.

## 4.5.2 Simple first-order finite difference formulas

Simple expressions, i.e. when a derivative is approximated by 2 points, can be derived
directly from a Taylor series.

## O(h) approximation of f 0 (xi )

Using Taylor’s theorem with ∆x = h gives

## f (xi + h) = f (xi ) + hf 0 (xi ) + O(h2 ).

Rewriting gives the forward difference formula for the 1st derivative
1
f 0 (xi ) = [f (xi + h) − f (xi )] + O(h).
h
Alternatively, we can write the forward difference a little shorter, using the notation
introduced in Fig. 4.2,
1
f 0 (xi ) = [fi+1 − fi ] + O(h).
h
By taking ∆x = −h, i.e. h → −h in the forward difference formula, the backward
difference formula can be obtained
1 1 1
f 0 (xi ) = [f (xi − h) − f (xi )] + O(h) = [f (xi ) − f (xi − h)] + O(h) = [fi − fi−1 ] + O(h).
−h h h

66
Both formulas are equally accurate O(h), so if you have a choice is doesn’t matter which
one you use. However, for the first node (no previous node) the backward difference
formula can not be used and for the last node (no next node) the forward difference
formula can not be used.

## 4.5.3 More complicated finite difference expressions

Derivation of more accurate (higher-order) finite difference formulas and finite difference
formulas for higher-order derivatives very tedious when Taylor series are used directly.
Instead we can use the following theorem:
If a (q + 1)-point difference formula is exact for the polynomials P0 (x) = 1, P1 (x) = x,
P2 (x) = x2 , . . ., Pq (x) = xq , then the error is O(hq ). In addition, the value of the grid
point xi doesn’t matter. To obtain simple algebra xi = 0 is usually convenient.
Thus we can try to find coefficients α1 , . . . , αq+1 so that if we approximate a certain
derivative at grid point xi by a finite difference formula involving the points k, . . . , k + q,
say
α1 fk + α2 fk+1 + · · · αq+1 fk+q
we get the exact value of that derivative (zero error) if f (x) is a polynomial of order q or
lower. This will lead to a linear system of q + 1 equations ans q + 1 unknowns. Note that
to approximate a derivative at a grid point xi you can you use function values at any
q + 1 grid points. However, function values in grid points closer to xi will contain more
information about the behavior of a function close to xi and will typically have a smaller
error.

## O(h2 ) approximation of f 0 (xi )

We derive an O(h2 ) accurate approximation for the 1st derivative at grid point xi , i.e. for
f 0 (xi ) using the function values in the three closest grid points xi − h, xi , and xi + h. To
simplify algebra we choose xi = 0 (any other choice should lead to the same result), so
that xi−1 = −h and xi+1 = h. We need to find three coefficients (α1 , α2 , and α3 ) so that

f 0 (0) = α1 f (0 − h) + α2 f (0) + α3 f (0 + h)

is exact when we use for the function f the polynomials P0 (x) = 1, P1 (x) = 1, and
P2 (x) = x2 (Note that you also know f 0 (0) for these functions). This gives three equations
with three unknowns,

## 0 = α1 (1) + α2 (1) + α3 (1),

1 = α1 (−h) + α2 (0) + α3 (h),
0 = α1 (−h)2 + α2 (0)2 + α3 (h)2 .

The third line gives α1 = −α3 . Substituting in line 2 gives α3 = 1/(2h) and α1 = −1/(2h).
Line 1 gives then α2 = 0.

67
Thus we found the central finite difference formula
1
f 0 (xi ) = [f (xi + h) − f (xi − h)] + O(h2 ).
2h
Alternatively we could have solved the linear system with Matlab
sol = solve(’a1+a2+a3 = 0’, ’-h*a1+h*a3=1’, ’h^2*a1+h^2*a3=0’, ’a1’, ’a2’, ’a3’)
produces a structure sol. The values of a1 can be found by typing sol.a1, sol.a2, and sol.a3
in the command window.

## O(h2 ) approximation of f 00 (xi )

We now derive an O(h2 ) accurate approximation for the 2nd derivative at grid point xi ,
i.e. for f 00 (xi ) using the function values in the three closest grid points xi − h, xi , and
xi + h. Again we use xi = 0, call the three coefficients α1 , α2 , and α3 , and require that

f 00 (0) = α1 f (0 − h) + α2 f (0) + α3 f (0 + h)

is exact when we use for the function f the polynomials P0 (x) = 1, P1 (x) = 1, and
P2 (x) = x2 . This gives three equations with three unknowns,

## 0 = α1 (1) + α2 (1) + α3 (1),

0 = α1 (−h) + α2 (0) + α3 (h),
2 = α1 (−h)2 + α2 (0)2 + α3 (h)2 .

The second line gives α1 = α3 . Substituting in line 3 gives α3 = 1/h2 and α1 = 1/h2 .
Line 1 gives then α2 = −2/h2 . Thus we obtained a central finite difference formula for
the second derivative,
1
f 00 (xi ) = [f (xi − h) − 2f (xi ) + f (xi + h)] + O(h2 ).
h2

## 4.5.4 Matrix-vector equation: example

To reduce the amount of algebra a little bit, we consider the discretization of the following
second order linear ODE instead of Eq. (4.2)

d2 c
− 2 +c=x
dx
with boundary conditions c(x = 0) = 0 and c(x = 1) = 3 using n = 4 subintervals of
length h = 0.25.
We make a grid in the x-direction using natural numbering and we denote the numer-
ical approximation of c at x = xj by cj = c̃(xj ) for j = 0, . . . , n = 4.
At every point xj in the interior (0, 1), the differential equation holds. The second
derivative is approximated by a difference formula. Here we use the O(h2 ) approximation
for c00 . The concentration itself is approximated by the nodal value cj at x = xj . The

68
value of x in node j we know: xj . We obtain 3 finite difference equations for j = 1, 2, 3
(for which the value of xj is in (0, 1)),

c0 − 2c1 + c2
j=1: − + c1 = x1 ,
h2
c1 − 2c2 + c3
j=2: − + c2 = x2 ,
h2
c2 − 2c3 + c4
j=3: − + c3 = x3 ,
h2
with h = 1/4 here. At the endpoints x0 = 0 and x4 = 1 we have the boundary condition
instead of the ODE. Since c0 is the approximation of c at x = 0, we have c0 = 0. At the
last point x4 = 1 we have the boundary condition: c4 = 3.
All together we have 5 linear equations with 5 unknowns (including the boundary
conditions) which can be written in matrix form
    
1 0 0 0 0 c0 0
−1/h2 1 + 2/h2 −1/h2 0 0  c1 x1
 2 2 2
   
 0
 −1/h 1 + 2/h −1/h 0  c2 = x2 ,
   
2 2 2
 0 0 −1/h 1 + 2/h −1/h  c3 x3
0 0 0 0 1 c4 3

or using the known values for h = 1/4 and the grid points xj = j/4
    
1 0 0 0 0 c0 0
−16 33 −16 0 0  c1 1/4
   
 
 0 −16 33 −16 0  c2 = 1/2 .
    
 0 0 −16 33 −16 c3 3/4
0 0 0 0 1 c4 3

The matrix and right-hand side are constant, so that we have reduced the problem to a
linear algebra problem: solving Ax = b. Once we have setup the linear system, we can
use any appropriate solution method to solve Ax = b to find the values cj .

## 4.5.5 Matrix-vector equation: general approach

If we need to program the finite difference method, we do not want to write a new code
if we use a different number of grid points. So we want to find a more general equation
Ax = b in terms of cj for the n + 1 grid points x0 , . . . , xn .
We consider again the pollution equation

d2 c dc
− 2
+8 + 9c = 17 + 9x, −1 < x < 1,
dx dx
c(x = −1) = e−18 , c(x = 1) = 3.

69
We use the O(h2 ) central FD formulas to approximate derivatives at xj , i.e Eq. (4.5.3)
for c0 and Eq. (4.5.3) for c00 . The concentration itself at x = xj gives cj . For each internal
node j = 1, . . . , n − 1 we then have
cj−1 − 2cj + cj+1 cj+1 − cj−1
− 2
+8 + 9cj = 17 + 9xj .
h 2h
The boundary conditions are: c0 = e−18 and cn = 3.
We can write the equations in matrix form:

Ac = f ,

## where A is an (n + 1) × (n + 1) matrix and c and f are vectors of length n + 1.

• In the first row we put the equation corresponding to the grid point x0 . This is
the boundary condition c0 = e−18 . Thus we need a value of 1 in the first column
corresponding to c0 and zeros in the other columns since these cj ’s are not involved in
the Dirichlet boundary condition. In f0 we need the value of the boundary condition
e−18 .

• The next n−1 rows (rows 2, . . . , n correspond to the internal grid points x1 , . . . , xn−1 ,
at which the (discretized) ODE needs to be satisfied: The second row correspond
to x1 , the third row to x2 etc. until row n. Thus the rows are ordered the same way
as the grid points. For each j = 1, . . . , n − 1, we have

## – 9 + 2/h2 in column j, i.e. on the diagonal (component Aj,j ).

– −1/h2 − 4/h in column j − 1, i.e. in the column on the left of the diagonal
(component Aj,j−1 ).
– −1/h2 + 4/h in column j + 1, i.e. in the column on the right of the diagonal
(component Aj,j+1 ).
– 17 + 9xj in the right-hand-side vector (component fj ).

• In the last row n + 1 we put the equation corresponding to xn . This is the boundary
condition cn = 3. Thus we need a value of 1 in the last column, n + 1, corresponding
to cn and zeros in the other columns since these cj ’s are not involved in the Dirichlet
boundary condition. In fn we need the value of the boundary condition 3.

## All together, this gives the matrix-vector equation

 
1 0 0 ··· 0 c0
 
e−18

... ..
−1/h2 − 4/h 9 + 2/h2 −1/h2 + 4/h . c1 17 + 9x1
    
   
 ... ... ... 
= .. ..
,
   

 0 0 
   . .
. ...
  
−1/h2 − 4/h 9 + 2/h2 −1/h2 + 4/h cn−1

 ..
   17 + 9xn−1
0 ··· 0 0 1 cn 3

70
where · · · denote a continuation of the same value, for example 9 + 2/h2 on the diagonal.
Once the matrix A and right-hand side vector f have been constructed, any appro-
priate method to solve Ax = b can be used. An efficient method will be discussed in
Sec. 4.9.

## 4.5.6 Programming finite differences

A numerical code for a finite difference method consists of two different parts. First, the
matrix A and right-hand-side vector f need to be constructed. Second, the matrix-vector
equation needs to be solved.

1. Setup of Ax = b
A lot of things can vary when you use the finite difference method to solve a second
order ODE. In a general form, a second order ODE reads

d2 c dc
− 2
+ p(x) + q(x)c = r(x) (4.3)
dx dx
where p(x), q(x), and r(x) are functions of x that are different from one problem
to the next. In addition, the type and values of the boundary conditions may vary
and you may want to change the order of the approximation of a derivative.

2. Solving Ax = b
Depending on the size of your matrix you may want to choose a different method to
solve Ax = b. To economize in memory, you might want to store only the non-zero
components in your matrix (in the examples above, just three diagonals instead of
the full (n + 1) × (n + 1) matrix with all the zero’s).

## Consequences for programming finite differences

We don’t want to have a large number of nearly identical programs for each case, so you
need to create several functions that are sufficiently general and focus on one or a few
aspects and combine these in a clever way. To obtain a readable code it is very important
that you program in a structured way.
There are various ways to program finite differences in a well-structured way. Here, I
give one possibility to program the second order ODE y 00 + p(x)y 0 + q(x)y = r(x) using
proper boundary conditions. We use several (levels of) functions

## • A main function finitedif with calls to

– a function fdmatvec that sets up the FD part matrix A and vector f . This
function calls
∗ a function ode param that computes the values of p(x), q(x), and r(x).
– a function bcmatvec that sets up the boundary condition part of the matrix A
and vector f . This function calls

71
∗ A function bc param that specifies the type (Dirichlet, Neumann) and val-
ues (CD , CN ) of boundary conditions.
– a function that solves Ax = b

In the main file, it needs to be decided (input) which order of approximation for the
derivatives is used, how the matrix is stored, and how Ax = b is solved.
In fdmatvec the part of the matrix A and f corresponding to the finite difference
equations is set up. Since you have the same type of expression for every internal grid
point you can use a for-loop for this. Here you may want to distinguish different orders
of approximation using if statements. These if statements should be kept outside of for-
loops if possible, for reasons of efficiency. The function ode param is called for every grid
point and evaluates the functions p(x), q(x), and r(x) at a certain grid point.
Boundary conditions would typically be incorporated in a separate function bcmatvec,
since it is independent of how you handle internal gridpoints. To keep this function
general, you would use another function to specify the type and value (say Ca for the left
and Cb for the right endpoint) of the boundary condition used at each end point. Different
types of boundary conditions can be distinguished using if-statements. The parameters
Ca and Cb are used in the matrix A and right-hand side f to keep these expressions
general.
In the main function finitedif different solvers would be distinguished through if-
statements. Since different solvers do not have much in common and for reasons of
efficiency, different methods to solve Ax = b would be in different functions. In Matlab
you could use, for example, the \ operator.
Once your program is working, you just need to change the simple functions ode param
and bc param if you solve a different BVP. The more complicated functions fdmatvec,
bcmatvec, and possible functions to solve Ax = b you can leave unchanged.

72
4.6 Eliminating boundary conditions
There are two reasons to eliminate boundary conditions from a linear system Ac = f .
First, you reduce the number of unknowns, so you need to solve a smaller system
which is faster. For a BVP this is not a very important issue since the boundary consists
only of two grid points (the end points). However, when we solve PDEs on 2D regions the
boundary consists of curves and on 3D regions the boundary consists of surfaces. Both
typically have a lot of grid points.
Second, a certain pattern in the matrix might be distorted by the boundary conditions.
For example, symmetry or a tridiagonal structure of the matrix may be lost due to the
boundary conditions. These are important properties of the matrix when you solve the
linear system. Some efficient techniques to solve Ax = b require a symmetric matrix,
for example the conjugate gradient method. A tridiagonal system can be solved very
efficiently using a direct method (Crout’s method, see Sec. 4.9).

4.6.1 Example
In the FD example in Sec. 4.5.4,

j=0: c0 = 0,
c0 − 2c1 + c2
j=1: − + c1 = x1 ,
h2
c1 − 2c2 + c3
j=2: − + c2 = x2 ,
h2
c2 − 2c3 + c4
j=3: − + c3 = x3 ,
h2
j=4: c4 = 3,

c0 = 0 and c4 = 3 are known and do not need to be solved for. The unknowns c0 and
c4 can be eliminated from the above set of equations (bring the c0 and c4 terms in the
equations for the internal points to the right-hand side and substitute the values). Only
the equations for j = 1 and j = 3 contain c0 and c4 and are affected,
−2c1 + c2 c0
j=1: − + c 1 = x 1 + = x1
h2 h2
c1 − 2c2 + c3
j=2: − + c2 = x2
h2
c2 − 2c3 c4 3
j=3: − + c 3 = x 3 + = x 3 + .
h2 h2 h2
In matrix-vector form this becomes
    
1 + 2/h2 −1/h2 0 c1 x1
 −1/h2 1 + 2/h2 −1/h2 c2 =  x2 ,
2 2 2
0 −1/h 1 + 2/h c3 x3 + 3/h

73
or using h = 0.25, x1 = 0.25, x2 = 0.5, and x3 = 0.75:
    
33 −16 0 c1 0.25
−16 33 −16 c2 =  0.5  .
0 −16 33 c3 48.75

## Solving the 3 × 3 linear system with Matlab

A = [33 -16 0;-16 33 -16; 0 -16 33];
f = [0.25; 0.5; 48.75];
c=A\f
gives
c=
6.802295047529017e-01
1.387348353552860e+00
2.149926474449871e+00
Note that the array c only contains the solution at the internal grid points (since we
eliminated the boundary conditions). To incorporate the values c0 and c4 at the correct
postions in the array corresponding to the first and last grid point, you can do as follows.
First shift all values in the array c one position. Then incorporate the boundary condi-
tions at the first array element c(1) and the last array element c(5)
c(2:4) = c(1:3)
c(1) = 0
c(5) = 3

Remarks

faster.

## • The matrix is tridiagonal, which is a result of the natural numbering. Instead of

the full matrix A we could store only the non-zero elements of A (to save memory)
and use an efficient solver for tridiagonal matrices (to save computing time). See
Sec. 4.9.

• By eliminating the boundary conditions from the linear system, we have created a
symmetric matrix. Solvers that require a symmetric matrix can now be used and
storing a matrix only requires storing two diagonals (the third diagonal elements
are then known because of the symmetry).

## 4.6.2 General approach

There are n + 1 equations with n + 1 unknowns, two equations from the boundary con-
ditions and (n − 1) on the internal grid points,

c0 = e−18 ,

74
cj−1 − 2cj + cj+1 cj+1 − cj−1
− 2
+8 + 9cj = 17 + 9xj , j = 1, . . . , n − 1,
h 2h
cn = 3,

However, c0 and cn are known from the boundary conditions. We can eliminate these
from the equations to obtain only n − 1 equations with n − 1 unknowns. Thus we will
eliminate the first and last row and the first and last column by eliminating c0 and cn from
equations j = 1, . . . , n − 1 using the 2 rows corresponding to the boundary conditions. c0
only appears in the finite difference formula of j = 1 and cn only in the finite difference
formula of j = n − 1. After substituting the values for c0 and cn and bringing the known
terms to the right-hand side we obtain a new equation for j = 1 and j = n − 1:
−2c1 + c2 c2 c0 c0 4e−18 e−18
− + 8 + 9c 1 = 17 + 9x 1 + 8 + = 17 + 9x 1 + + 2 ,
h2 2h 2h h2 h h
and j = n − 1:
cn−2 − 2cn−1 cn−2 cn cn 12 3
− 2
−8 + 9cn−1 = 17 + 9xn−1 − 8 + 2 = 17 + 9xn−1 − + 2.
h 2h 2h h h h
The equations for j = 2, . . . , n − 2 remain the same since they don’t contain c0 and cn .
Thus we now have the n − 1 × n − 1 system of equations

## 9 + 2/h2 −1/h2 + 4/h

 
0 ··· 0 c1

.. ..
−1/h2 − 4/h 9 + 2/h2 −1/h2 + 4/h . .   c2 
 
 .. . .  . 

 0 . . . . . 0   .. 
 =
. ..

.. −1/h2 + 4/h cn−2
 
. −1/h2 − 4/h 9 + 2/h2
  
0 ··· 0 −1/h2 − 4/h 9 + 2/h2 cn−1
 
17 + 9x1 + 4e−18 /h + e−18 /h2

 17 + 9x2 

..
.
 

 . 
 17 + 9xn−2 
2
17 + 9xn−1 − 12/h + 3/h

To incorporate the values c0 and cn at the correct postions in the array corresponding to
the first and last grid point, use
n = length(c) + 1;
c(2:n) = c(1:n-1)
c(1) = exp(-18)
c(n+1) = 3

Remarks
• The size of the matrix-vector equations is reduced by 2, so it can be solved a little
faster.

75
• The matrix is tridiagonal, which is a result of the natural numbering. Instead of
the full matrix A we could store only the non-zero elements of A (to save memory)
and use an efficient solver for tridiagonal matrices (to save computing time). See
Sec. 4.9.

• The convection term makes the matrix A to be non-symmetric. Solvers that require
a symmetric matrix cannot be used. In the absence of convection, however, the
resulting matrix would be symmetric after eliminating the boundary conditions
allowing the use of solvers for symmetric matrices.

76
4.7 Finite elements
In this section, we discuss the discretization in space of a second-order BVP using the finite
element method (FEM). The method is conceptually much harder than finite differences.
The advantage lies in the easy incorporation of Neumann and Robin type boundary
conditions and mesh refinement. This is particularly useful in two or three dimensions,
especially when there are curved boundaries or steep gradients in the solution. We only
consider equally spaced grids and use natural numbering. This simplifies the derivation.

## 4.7.1 Mathematical background

Finite elements
The interval [a, b] is divided into N elements of equal length h = (b − a)/N that only
have endpoints in common. We number the elements e1 , . . . , eN using natural numbering.
The grid points (nodes) are xj = jh, j = 0, . . . , n, which are also numbered using natural
numbering. An element you can consider as a local ”computational domain”. An element
contains a certain number of grid points which we did not specify yet. We only consider
elements with at least two grid points on it: the endpoints of the element. Frequently
used elements are linear, quadratic, and cubic elements. A linear element just contains
the endpoints of the element. A quadratic element contains the endpoints and a point
halfway the element. A cubic element contains the endpoints and two points at 1/3
and 2/3 of the element. We will mainly focus on linear finite elements for which n = N .
Fig. 4.4 shows the linear elements and positions of the nodes.
elements


e1 

e2 

e3  
eN −1 

eN 

a = x0 x1 x2 x3 xN −2 xN −1 xN = b
nodes

## Figure 4.4: Linear finite elements e1 , . . . eN and nodes x0 , . . . , xN .

Basis functions
On the interval [a, b], basis functions φj are defined with the following properties:

• φj (xj ) = 1

• φj (xi ) = 0 for i 6= j

77
• φj is continuous and a piecewise polynomial (per element). For linear elements, φj
is a piecewise linear polynomial, for quadratic elements, φj is a piecewise quadratic
polynomial etc.
Fig. 4.5 shows the linear finite element basis functions φ0 , φj for 0 < j < N , and φN .
Outside the sketched region every basis function is exactly equal to zero.

1
@ @
@ φ0 @ φj φN
@ @
@ @
0 @ @
x0 x1 xj−1 xj xj+1 xN −1 xN

## Finite element approximation

We can approximate a function using the FE basis functions and the nodal values
N
X
c(x) = cj φj (x).
j=0

This results in a piecewise polynomial approximation of the function c(x). Fig. 4.6 shows
a linear finite element approximation and how it is constructed using the basis functions
and nodal values cj , j = k − 2, . . . , k + 2.

## Element matrix and element vector

The easiest way to compute the coefficients of the matrix A and vector f of the linear
system is to compute the contributions to A and f element-by-element. The contributions
of element l are stored in an element matrix a(l) and element vector f (l) .
We first note that the integral over the entire interval [a, b] is just the sum of the
integrals over all elements
Z b XN Z
dx = dx.
a l=1 el

On an element el = [xl−1 , xl ], only two basis functions are non-zero: φl−1 and φl . (See
Fig. 4.7). The non-zero contributions to A on this element can be put in an element
matrix a(l) and the non-zero contributions to f in an element vector f (l) .
After the element matrices and vectors have been computed, they need to be assembled
into (added to) the global matrix A and global vector f .

78
ck−1
ck
ck−2 ck+1 ck+2

## xk−2 xk−1 xk xk+1 xk+2

Figure 4.6: Finite element approximation. The solid black line is the sum of the colored
dashed lines. Red dashed line: ck−1 φk−1 , cyan dashed line: ck φk , green dashed line:
ck+1 φk+1 , blue dashed line: right part of ck−2 φk−2 , magenta dashed line: left part of
ck+2 φk+2 .
φl−1 φl

xl−1 el xl
Figure 4.7: Two non-zero basis functions on element el .

Reference element
To simplify the numerical calculations, elements el = [xl−1 , xl ] are mapped to the refer-
ence element ê = [0, 1] using the transformation

## x = xl−1 + hξ, ξ ∈ [0, 1].

We define two basis functions on the reference element (See Fig. 4.8)

φ̂1 = 1 − ξ, φ̂2 = ξ.

whose derivatives are dφ̂1 /dξ = −1 and dφ̂2 /dξ = 1. The 2 non-zero basis functions on
element el are mapped to the reference basis functions by

## φ̂1 (ξ) = φl−1 (xl−1 + hξ), φ̂2 (ξ) = φl (xl−1 + hξ).

Derivatives on the x and ξ domain are related via the chain rule. We have φ̂(ξ) = φ(x)
with x = x(ξ), thus
dφ̂ dφ dx
=
dξ dx dξ
Since dx/dξ = h we get
dφ 1 dφ̂
= .
dx h dξ

79
φ̂1 φ̂2

0 ê 1
Figure 4.8: Reference element ê = [0, 1] with linear basis functions φ̂1 (ξ) and φ̂2 (ξ).

## Integrals are transformed using substitution x = xl−1 + hξ with dx = h dξ. For

example,
Z Z 1 Z 1
p(x)φl−1 (x) dx = p(xl−1 + hξ)φ̂1 (ξ)h dξ = h p(xl−1 + hξ)ξ dξ
el 0 0

## Weak form and boundary conditions

For the finite element method the ODE is not discretized directly, but written in the weak
form first. The weak form of an ODE is obtained by multiplying the ODE by suitable test
functions, integrating the resulting equation over the whole domain [a, b], and applying an
integration by parts to reduce the order of the derivatives. When functions are sufficiently
smooth, the weak form and original ODE are equivalent, i.e. they have the same solution.
Boundary conditions are applied to the weak form. Dirichlet and Neumann/Robin
type boundary conditions are handled very differently in finite element methods.
Dirichlet boundary conditions
Dirichlet boundary conditions, are handled in a similar way as for finite differences. The
row corresponding to that boundary condition is replaced by the Dirichlet boundary
condition.
Neumann/Robin boundary conditions
Neumann/Robin boundary conditions are not handled the same way as for finite differ-
ences. In the weak form, the derivative dc/dx at a boundary appears naturally. Sub-
stitution of the Neumann/Robin boundary condition into the weak form is all that is
needed.

## Finite element method: overview of discretization steps

There are several steps involved to obtain a linear system in the finite element method.

• Find the weak form: multiply the differential equation by a test function ψ(x),
integrate over the whole domain, and apply integration by parts.

• Choose suitable test functions ψ. We will choose the n + 1 basis functions φj . This
is called Galerkin finite element method.

80
Pn
• Replace c(x) by its finite element approximation c(x) = j=0 cj φj (x).

• Evaluate integrals over each finite element using a reference element (analytically
or with some appropriate numerical technique).

• Assemble contributions of each element into the matrix A and right-hand-side vector
f.

## 4.7.2 Matrix-vector equation: example

To reduce the amount of algebra a little bit, we consider the discretization of the following
second order linear ODE instead of Eq. (4.2)

−c00 + c = 1, 0<x<1

## with c(x = 0) = 1 and c0 (x = 1) = 0 using N = 4 subintervals of length 0.25.

Weak form
Multiply by test functions ψ(x), integrate over the whole domain [0, 1], and use integration
by parts for the second derivative term (so that we only have first derivatives):
Z 1 Z 1
0 0 0 1
ψ c + ψc dx − ψc |0 = ψ dx
0 0

## for all suitable test functions ψ.

Galerkin FEM
Using the five basis functions φ0 , . . . , φ4 for ψ gives 5 equations with 5 unknowns
Z 1 Z 1
0 0 0 1
φi (x)c + φi (x)c dx − φi (x)c |0 = φi (x) dx
0 0

## for 0 ≤ i ≤ 4. Each i corresponds to a row in the linear system: i = 0 corresponds to row

1, i = 1 corresponds to row 2, etc.
FEM approximation
Next the finite element approximation c(x) = 4j=0 cj φj (x) is substituted in the terms
P
under the integral and the integral over the whole domain is written as a sum of integrals
over all elements,
4 Z
X 4
X 4
X 4 Z
X
1
φ0i (x) cj φ0j (x) + φi (x) cj φj (x) dx = φi (x)c0 |0 + φi (x) dx
l=1 el j=0 j=0 l=1 el

We note for further reference that the term φi (x)c0 |10 = φi (x = 1)c0 (x = 1) − φi (x =
0)c0 (x = 0) doesn’t affect the equations for i = 1, . . . , 3. The only non-zero basis function
at x = 1 is φ4 and the only non-zero basis function at x = 0 is φ0 . Thus the equation for

81
i = 0 has an extra term −φ0 (x = 0)c0 (x = 0) = −c0 (x = 0) and the equation for i = 4 has
an extra term φ4 (x = 1)c0 (x = 1) = c0 (x = 1).
Element matrix and vector
The element matrix a(l) and element vector f (l) contain the contributions of the element
integrals. The term φi (x)c0 |10 is related to the boundary conditions and will be discussed
when applying the boundary conditions. On an element el = [xl−1 , xl ], only two basis
functions are non-zero: φl−1 and φl . Thus only the equations for i = l − 1 and i = l give
non-zero contributions on element el in Eq. (4.7.2).P4 Similarly, only the j = l − 1 and j = l
terms in the finite element approximation in j=0 give a non-zero contribution to this
equation. The non-zero integrals can be put into a local matrix a(l)
" #
(l) (l)  
a11 a12 cl−1
(l) (l)
a21 a22 cl

with
Z
(l)
a11 = φ0l−1 φ0l−1 + φl−1 φl−1 dx
el
Z
(l)
a12 = φ0l−1 φ0l + φl−1 φl dx
Zel
(l)
a21 = φ0l φ0l−1 + φl φl−1 dx
el
Z
(l)
a22 = φ0l φ0l + φl φl dx
el

Z
(l)
f1 = φl−1 dx
el
Z
(l)
f2 = φl dx
el

## Row 1 in a(l) and f (l) corresponds to i = l − 1 and row 2 to i = l. Column 1 corresponds

to j = l − 1 and column 2 to j = l.
Evaluating integrals using a reference element
By applying the tranformation of variables discussed in 4.7.1, we can write all integrals
in terms of ξ. For the element matrix we get
1 1
Z Z 1
(l) 1 h
a11 = 1 dξ + h (1 − ξ)2 dξ = + ,
h 0 0 h 3
Z 1 Z 1
(l) (l) 1 1 h
a12 = a21 = −1 dξ + h ξ(1 − ξ) dξ = − + ,
h 0 h 6
Z 1 Z 10
(l) 1 1 h
a22 = 1 dξ + h ξ 2 dξ = + ,
h 0 0 h 3

82
For the element vector we get
Z 1
(l)
f1 = h (1 − ξ) dξ = h/2,
0
Z 1
(l)
f2 = h ξ dξ = h/2.
0

Note that we have the same element matrix and element vector for every element in this
case (equally spaced grid points and constant coefficient ODE).
Assembling
The element matrix and vector needs to be assembled element-by-element into the global
matrix A and right-hand-side vector f . We start form the zero matrix. We first add the
contributions of the first element, i = 0 and i = 1 corresponding to row 1 and 2,
   
1/h + h/3 −1/h + h/6 0 0 0 h/2
−1/h + h/6 1/h + h/3 0 0 0 h/2
   
A := 
 0 0 0 0 0   0 
f :=  
 0 0 0 0 0  0 
0 0 0 0 0 0

## Then we add the contributions of the second element, i = 1 and i = 2 corresponding to

row 2 and 3,
   
1/h + h/3 −1/h + h/6 0 0 0 h/2
−1/h + h/6 2/h + 2h/3 −1/h + h/6 0 0  h 
   

 0 −1/h + h/6 1/h + h/3 0 0

h/2
 
 0 0 0 0 0  0 
0 0 0 0 0 0

## Then we add the contributions of the third element, i = 2 and i = 3 corresponding to

row 3 and 4,
   
1/h + h/3 −1/h + h/6 0 0 0 h/2
−1/h + h/6 2/h + 2h/3 −1/h + h/6 0 0  h 
   

 0 −1/h + h/6 2/h + 2h/3 −1/h + h/6 0

 h 
 
 0 0 −1/h + h/6 1/h + h/3 0 h/2
0 0 0 0 0 0

## Finally, the contributions of the fourth element, i = 3 and i = 4 corresponding to row 4

and 5,
   
1/h + h/3 −1/h + h/6 0 0 0 h/2
−1/h + h/6 2/h + 2h/3 −1/h + h/6 0 0  h 
  
 

 0 −1/h + h/6 2/h + 2h/3 −1/h + h/6 0 

 h 
 
 0 0 −1/h + h/6 2/h + 2h/3 −1/h + h/6  h 
0 0 0 −1/h + h/6 1/h + h/3 h/2

83
Boundary conditions
To incorporate c(0) = 1 we replace the first row by c0 = 1. To incorporate c0 (1) = 0 we
add +φ4 (x = 1)c0 )x = 1) = c0 (1) = 0 to the right-hand side of the last row, f4 .
    
1 0 0 0 0 c0 1
−1/h + h/6 2/h + 2h/3 −1/h + h/6 0 0  c1  h 
    

 0 −1/h + h/6 2/h + 2h/3 −1/h + h/6 0  c2 =  h 
   
 0 0 −1/h + h/6 2/h + 2h/3 −1/h + h/6 c3  h 
0 0 0 −1/h + h/6 1/h + h/3 c4 h/2 + 0

This is a linear system that can be solved with any appropriate numerical technique.

## 4.7.3 Matrix-vector equation: general approach

If we need to program the finite element method, we do not want to write a new code
if we use a different number of grid points. So we want to find a more general equation
Ac = f in terms of cj for n elements e1 , . . . en and n + 1 grid points x0 , . . . , xn .
We consider again the pollution equation

d2 c dc
− + 8 + 9c = 17 + 9x, −1 < x < 1,
dx2 dx
c(x = −1) = e−18 , c0 (x = 1) = 10.

To point out the different treatment of Dirichlet and Neumann boundary conditions in
finite elements, we replaced the boundary condition c(x = 1) = 3 by the equivalent
Neumann boundary condition c0 (x = 1) = 10 (obtained from the analytical solution).
Weak form
The weak form is obtained by multiplying the ODE by suitable test functions ψ and
integrating over the whole domain, here [−1, 1],
Z 1 Z 1
00 0
ψ [−c + 8c + 9c] dx = ψ [17 + 9x] dx
−1 −1

for all suitable test functions ψ. Now we can use integration by parts for the c00 term to
reduce the order of the highest derivative
Z 1 Z 1
0 0 0 0 1
ψ c + ψ [8c + 9c] dx − ψc |−1 = ψ [17 + 9x] dx
−1 −1

## for all suitable test functions ψ.

Galerkin FEM
Using the n + 1 basis functions φ0 , . . . , φn for ψ gives n + 1 equations with n + 1 unknowns
Z 1 Z 1
0 0 0 0 1
φi (x)c + φi (x) [8c + 9c] dx − φi (x)c |−1 = φi (x)[17 + 9x] dx
−1 −1

84
for 0 ≤ i ≤ n. Each i corresponds to a row in the linear system: i = 0 corresponds to row
1, i = 1 corresponds to row 2, etc. until i = n that corresponds to row n + 1.
FEM approximation
Next the finite element approximation c(x) = nj=0 cj φj (x) is substituted in the terms
P
under the integral and the integral over the whole domain is written as a sum of integrals
over all elements,
n Z n
" n n
# n Z
X X X X X
0 0 0 0 1
φi cj φj + φi 8 cj φj + 9 cj φj dx = φi c |−1 + φi (17 + 9x) dx
l=1 el j=0 j=0 j=0 l=1 el

We note for further reference that the term φi (x)c0 |1−1 = φi (x = 1)c0 (x = 1) − φi (x =
−1)c0 (x = −1) doesn’t affect the equations for i = 1, . . . , n − 1. The only non-zero basis
function at x = 1 is φn and the only non-zero basis function at x = −1 is φ0 . Thus the
equation for i = 0 has an extra term −φ0 (x = −1)c0 (x = −1) = −c0 (x = −1) and the
equation for i = n has an extra term φn (x = 1)c0 (x = 1) = c0 (x = 1).
Element matrix and vector
The element matrix a(l) and element vector f (l) contain the contributions of the element
integrals. The term φi (x)c0 |1−1 is related to the boundary conditions and will be discussed
when applying the boundary conditions. On an element el = [xl−1 , xl ], only two basis
functions are non-zero: φl−1 and φl . Thus only the equations for i = l − 1 and l = i give
non-zero contributions on element el in Eq. (4.7.2). P Similarly, only the j = l − 1 and j = l
terms in the finite element approximation in nj=0 give a non-zero contribution to this
equation.
The non-zero integrals can be put into a local matrix a(l)
" #
(l) (l)  
a11 a12 cl−1
(l) (l)
a21 a22 cl

with
Z
(l)
φ0l−1 φ0l−1 + φl−1 8φ0l−1 + 9φl−1
 
a11 = dx
el
Z
(l)
a12 = φ0l−1 φ0l + φl−1 [8φ0l + 9φl ] dx
Z el
(l)
φ0l φ0l−1 + φl 8φ0l−1 + 9φl−1
 
a21 = dx
el
Z
(l)
a22 = φ0l φ0l + φl [8φ0l + 9φl ] dx
el

## and local vector f (l)

Z
(l)
f1 = φl−1 (17 + 9x) dx
el
Z
(l)
f2 = φl (17 + 9x) dx
el

85
Row 1 in a(l) and f (l) corresponds to i = l − 1 and row 2 to i = l. Column 1 corresponds
to j = l − 1 and column 2 to j = l.
Evaluating integrals using a reference element
By applying the tranformation of variables discussed in 4.7.1, we can write all integrals
in terms of ξ. For the element matrix we get
Z 1 
(l) 1 8 2 1
a11 = 2
+ (1 − ξ) + 9(1 − ξ) h dξ = − 4 + 3h,
0 h h h
Z 1 
(l) −1 8 1 3h
a12 = + (1 − ξ) + 9ξ(1 − ξ) h dξ = − + 4 + ,
0 h2 h h 2
Z 1 
(l) −1 8 1 3h
a21 = 2
− ξ + 9ξ(1 − ξ) h dξ = − − 4 + ,
0 h h h 2
Z 1 
(l) 1 8 1
a22 = 2
− ξ + 9ξ 2 h dξ = + 4 + 3h.
0 h h h
For the element vector we get
Z 1
(l) h 3h2
f1 = h (1 − ξ)(xl−1 + hξ) dξ = (17 + 9xl−1 ) + ,
0 2 2
Z 1
(l) h
f2 = h ξ(xl−1 + hξ) dξ = (17 + 9xl−1 ) + 3h2 .
0 2
Note that, because of the dependence on x of the right-hand side, the element vector is
different for every element.
Assembling
The element matrix and vector needs to be assembled element-by-element into the global
matrix A and right-hand-side vector f . We start form the zero matrix. We first add the
contributions of the first element, i = 0 and i = 1 corresponding to row 1 and 2. At
element 1, we have xl−1 = x0 ,
 (1) (1) 
a11 a12 0 . . . 0  (1)
f1
 (1) (1)
a21 a22 0 . . . 0

f (1)
 ...  2 
A := 
 0 0 0 0 
 f :=  0 
 
 .. .
. . . . . .
.   0 
 . . . . .
0 0 0 ... 0 0

## Then we add the contributions of the second element, i = 1 and i = 2 corresponding to

row 2 and 3,
 (1) (1)   (1)

a11 a12 0 0 ... 0 f1
a(1) a(1) + a(2) a(2) 0 . . . 0  (1)
f2 + f1(2)

 21 22 11 12 
(2) (2)
 0 a21 a22 0 . . . 0  f2(2) 
   
A :=  f := 
 0 0 0 0 . . . 0
 
 .
 0 
 .. .. .. .. ..  
0

. . . .   
0 0 0 0 0 0 0

86
And so on until the contributions of the nth element, i = n − 1 and i = n corresponding
to row n and n + 1,
 (1) (1)
  (1)

a11 a12 0 ... 0 f1
a(1) a(1) + a(2) a(2)  (1)
... 0   f2 + f1(2) 

 21 22 11 12 
 ... ... ...   .. 
A := 
 0 0 
 f := 
 . 

 .. . . (n−1) (n−1) (n) (n)  (n−1) (n)
 . . a21 a22 + a11 a12  f2 + f1 
(n) (n) (n)
0 ... 0 a21 a22 f2

## Substituting all the values for the element matrices gives

1
− 4 + 3h − h1 + 4 + 3h

h 2
0 ... 0
− h − 4 + 3h
1
2
2
h
+ 6h − h
1
+ 4 + 3h
2
... 0 
.. .. ..
 
A := 

0 . . . 0

 .
.. . ..

 − h1 − 4 + 3h 2
2
h
+ 6h − h1 + 4 + 3h
2

1 3h 1
0 ... 0 −h − 4 + 2 h
+ 4 + 3h
 2 
(17 + 9x0 ) h2 + 3h2
 (34 + 9x + 9x ) h + 9h2 
 0 1 2 2 
f := 
 .
. 
. 
2
 
(34 + 9xn−2 + 9xn−1 ) +  h 9h
2 2
(17 + 9xn−1 ) h2 + 3h2

Boundary conditions
To incorporate the Dirichlet boundary condition c(−1) = 1 we replace the first row by
c0 = 1. To incorporate c0 (1) = 10 we add c0 (1) = 10 to the right-hand side of the last row,
fn ,
 
1 0 0 ... 0 1
  
c 0
a(1) a(1) + a(2) a(2) ... 0    (1) (2) 
 21 22 11 12  c1   f2 + f1 

 0 ... ... ...   ..   ..
0  =

 . . 
 ..

... (n−1) (n−1) (n)

(n) c
  (n−1)
f2
(n)
+ f1 
 . a21 a22 + a11 a12  n−1  
(n) (n) cn (n)
0 ... 0 a21 a22 f2 + 10

This is a linear system that can be solved with any appropriate numerical technique.
For a Robin boundary conditions at x = 1, c0 = αc + β, the right-hand side of the
boundary condition also contains c, so that the last row of the matrix A would change
as well. The way natural boundary conditions are handled in 2D and 3D, including
curved boundaries, is similar (just substitution) but there are of course more nodes on
the boundary. This makes boundary conditions much easier to handle in the finite element
method than in the finite difference method.

87
4.7.4 Numerical computation of the integrals
For an ODE c00 + p(x)c0 + q(x)c = r(x) with relatively simple functions p(x), q(x), and
r(x) the integrals in the element matrices and vectors can be computed analytically. For
more complicated functions, the integrals can only be approximated numerically.
An integral can be approximated by a weighted sum of function values (numerical
quadrature) in integration points xi ,
Z b nint
X
f (x) dx = wi f (xi )
a i=1

where nint is the number of integration points and wi the weight for each integration point.
We only discuss two closed Newton-Cotes formulas, i.e. formulas that contain the
endpoints of the interval and additional points are chosen so that integration points are
equally spaced over the interval.
The trapezoidal rule only uses the two endpoints of the interval, i.e. ξ1 = 0 and ξ2 = 1
on the reference element 0 ≤ ξ ≤ 1,
Z 1 int n
1 X
g(ξ) dξ ≈ [g(0) + g(1)] = wi g(ξi )
0 2 i=1

## with nint = 2, w1 = w2 = 1/2, ξ1 = 0, and ξ2 = 1. The trapezoidal rule is exact for

polynomials up to degree 1.
Simpson’s rule uses the two endpoints of the interval and the midpoint of the interval,
i.e. ξ1 = 0, ξ2 = 1/2, and ξ3 = 1 on the reference element 0 ≤ ξ ≤ 1,
Z 1 int n
1 X
g(ξ) dξ ≈ [g(0) + 4g(1/2) + g(1)] = wi g(ξi )
0 6 i=1

## with nint = 3, w1 = w3 = 1/6, w2 = 4/6, ξ1 = 0, ξ2 = 1/2, and ξ3 = 1. Simpson’s rule is

exact for polynomials up to degree 3.
Alternatively, Gauss integration can be used (details are discussed in Math 4446).
Gauss integration is a little more complicated but more efficient: less function evaluations
are necessary than for a Newton–Cotes formula with the same order of error.

## 4.7.5 Programming finite elements

There are various ways to program finite elements in a well-structured way. Here, I give
one possibility to program the second order ODE c00 + p(x)c0 + q(x)c = r(x) using proper
boundary conditions.
The finite elements can be programmed using the same overall structure as the finite
difference method. We will only discuss the differences. In the function that sets up the
matrix A and right-hand-side vector f , contributions are added element-by-element. We
name this function fematvec. Instead of a for loop over all grid points, we now have a for

88
loop over all elements. Since row i + 1 corresponding to test function φi has contributions
from two different elements, we need to add the local matrix a(l) and vector f (l) to the
matrix A and vector f (For FD you can just set the elements of a row). Inside the
for-loop the element matrix and vector needs to be computed and assembled into the
global matrix A and vector f . If you distinguish for example different quadrature rules
and element orders, it is more convenient to compute the element matrix and element
vector in a separate function that we name fematvecelm. To keep fematvecelm general, a
function ode param is called to evaluate the functions p(x), q(x), and r(x) at a point x.
Boundary conditions would typically be incorporated in a separate function febc, since
it is independent of how you handle the element contributions. To keep this function
general, you would use another function to specify the type and value of the boundary
condition used at each end point. Implementation of a Dirichlet boundary condition in
a certain row requires that all existing values in that row of A and f are replaced by
the boundary condition. Implementation of a Neumann/Robin boundary condition in a
certain row requires that boundary conditions are added to the existing row.

89
4.8 Convergence of numerical methods for BVPs
We expect that we get a better approximation to the solution of a BVP if we decrease the
grid size h. The order of convergence tells how fast the numerical solution approximates
the exact solution. Since we solve a system of equations, we use error norms to compute
the error (See Sec. 3.3). In this section we measure the actual error using the infinity
norm e∞ = kc(xi ) − ci k∞ , i.e. we compute at every grid point the absolute difference with
the exact solution and take the maximum.
To obtain the order of convergence we need to determine how fast the error approaches
zero if h → 0. Thus we need to determine the value p in e = O(hp ) = Chp . For this we
need to do numerical computations on several grids, and record the error with the exact
solution (or if this is not available a very accurate numerical solution). To determine p
we take the logarithm of e = Chp ,

## log10 e = log10 (Chp ) = log10 C + p log10 (h)

Thus p is the slope on a log10 h vs. log10 e plot. From the plot you can determine suitable
values to determine p: h should not be too large (the order of convergence is for h → 0)
and not too small (round-off errors). Note that when h is small, you subtract nearly equal
numbers in the numerator of a FD formula, since the values of yi−1 , yi , and yi+1 only differ
slightly. In addition, you divide by a small number of O(h) for an FD formula of a first
derivative or of O(h2 ) for an FD formula of a second derivative. This limits the smallest
grid size h that you can use in your calculations. For too small values of h, the error e∞
will not be dominated by the error due to the finite difference approximation but by the
error caused by round-off errors due to finite precision calculations and e∞ will start to
increase. This is not a real problem in Matlab, since you will run out of memory before
round-off errors become important.
To determine the order of convergence numerically, we consider the problem
d2 c dc
2
−8 − 9c = −17 − 9x, −1 < x < 1,
dx dx
c(x = −1) = e−18 , c(x = 1) = 3.

## 4.8.1 Finite differences

We assume that the solution c(x) is sufficiently smooth, i.e. all derivatives exist and are
continuous. Then for finite difference schemes, the order of the actual error e∞ is the
sum of the discretization errors in each finite difference formula used to approximate a
derivative of a certain order. Since O(hp−1 ) + O(hp ) = O(hp−1 ) the order of the actual
error is governed by the lowest order of the error term in one of the finite difference
formulas used.
We consider the case where both c0 and c00 are approximated by a centered finite
difference formula which are both O(h2 ). We therefore expect that the numerical error

90
e∞ we find is O(h2 ) + O(h2 ) = O(h2 ). Similarly, if O(h) difference formulas are used in a
finite difference method, we would find a numerical error e∞ of O(h). If O(h3 ) difference
formulas are used for all derivatives, we would find a numerical error e∞ of O(h3 ), and so
on.
We start with a medium grid size of h = 0.1 to compute the numerical solution. Next
meshes are obtained by dividing the grid size by a factor of two (i.e. doubling the number
of intervals). Fig. 4.9 shows e∞ for the various grid sizes considered. We indeed observe a
slope n ≈ 2 up to h ≈ 10−4.5 . For smaller values of h, e∞ starts to increase. This is caused
by round-off errors which are then the dominant contributiuon to the error e∞ . Table 4.1

−1

−2

−3

−4
log10 e∞

−5

−6

−7

−8

−9
−5.5 −5 −4.5 −4 −3.5 −3 −2.5 −2 −1.5 −1
log h
10

Figure 4.9: Error plot for O(h2 ) centered finite difference scheme.

shows the numerical values of e∞ for the various grid sizes considered. If the step size h
is halved for a O(h2 ) method, the error on the refined mesh is O((h/2)2 ) = 1/4 O(h2 ),
i.e 1/4 of the error using mesh size h. This is exactly what we observe in Table 4.1 up to
h = 1/81920. For small values of h, the roundoff error becomes important and the error
would start to grow rapidly: an ancrease by a factor of almost 10 instead of a decrease
by a factor of 4.

## 4.8.2 Finite elements

We assume that the solution c(x) is sufficiently smooth, i.e. all derivatives exist and
are continuous. Then for finite elements of degree p, the order of the actual error e∞
is O(hp+1 ). Thus for linear elements we expect e∞ = O(h2 ), for quadratic elements
e∞ = O(h3 ) for cubic elements e∞ = O(h4 ), etc.
We consider the same grids as for the FD calculations. Fig. 4.10 shows e∞ using linear
finite elements for the various grid sizes considered. We indeed observe a slope n ≈ 2
that is one order higher than the order of the element. When round-off errors become
important, the error would start to increase, just as for finite differences. The quadratic
convergence can be observed more accurately from Table 4.2 which shows the numerical

91
h e∞ ratio
1/10 1.87e-2 -
1/20 4.41e-3 0.235
1/40 1.09e-3 0.246
1/80 2.72e-4 0.250
1/160 6.79e-5 0.250
1/320 1.70e-5 0.250
1/640 4.24e-6 0.250
1/1280 1.06e-6 0.250
1/2560 2.65e-7 0.250
1/5120 6.63e-8 0.250
1/10240 1.66e-8 0.250
1/20480 4.15e-9 0.250
1/40960 1.07e-9 0.251
1/81920 1.07e-9 0.258
1/163840 1.07e-8 9.993
1/327680 9.89e-8 9.237

## Table 4.1: Error for O(h2 ) centered finite difference scheme.

−1

−2

−3
log10 e∞

−4

−5

−6

−7

−8
−4 −3.5 −3 −2.5 −2 −1.5 −1
log10 h

## Figure 4.10: Error plot for O(h2 ) linear finite elements.

values of e∞ for the various grid sizes considered. If the step size h is halved, the error on
the refined mesh is 1/4 of the error using mesh size h, typical for quadratic convergence.

92
h e∞ ratio
1/10 2.47e-2 -
1/20 5.76e-3 0.233
1/40 1.42e-3 0.246
1/80 3.54e-4 0.250
1/160 8.84e-5 0.250
1/320 2.21e-5 0.250
1/640 5.52e-6 0.250
1/1280 1.38e-6 0.250
1/2560 3.45e-7 0.250
1/5120 8.62e-8 0.250

## Table 4.2: Error for O(h2 ) linear finite elements.

93
4.9 Solving linear systems for BVPs: Crout’s method
Typically, the number of grid points to obtain a decent numerical approximation is rel-
atively small for boundary value problems and the linear system can still be solved ef-
ficiently using a direct method. To obtain more accurate numerical approximations one
could use a finer grid (typically non-uniform) or higher-order finite difference or finite el-
ement schemes. For the FD scheme using O(h2 ) centered finite difference approximation
and for linear finite elements we considered, a tridiagonal matrix was obtained. For such
type of matrices a very efficient direct solver is available: Crout’s method which uses only
O(n) operations to solve a (tridiagonal) linear system. Higher-order finite differences and
higher-order finite elements, will lead to more than 3 non-zero diagonals and solving the
linear system is computationally more expensive. Using the basic Gaussian elimination
technique would take O(n3 ) operations. The most computationally efficient approach is
therefore usually to use the O(h2 ) schemes in combination with grid refinement. Note
that this argument only holds for BVPs where only one spatial dimension is involved.
For O(h2 ) methods for PDEs in two or three dimensions the resulting linear system is
no longer tridiagonal. Since Crout’s method only requires O(n) operations, there is no
advantage in using iterative techniques to solve the linear system. Iterative techniques
use O(n) operations per iteration.

Background
A tridiagonal N × N matrix (which arises in O(h2 ) finite difference methods and linear
finite elements for BVPs)
 
a11 a12 0 ··· 0
.. .. 
a21 a22 a23 . . 

A= . .
.. .. . ..

0 0  
 . . . .
 .. .. .. ..

aN −1,N
0 · · · 0 aN,N −1 aN N
has a Crout factorization A = LU of the form
   
l11 0 · · · ··· 0 1 u12 0 ··· 0
... ..  ... ... ..
l21 l22 .  0 1 .
  

 ... ... ... .. 
  . . . ... 
L= 0 .  .. . . . .
U = 0  .
. . .
 .. .. ... ... .. ... ...
 
0  uN −1,N
0 · · · 0 lN,N −1 lN N 0 ··· ··· 0 1
This is easy to check: just take the matrix product LU and start comparing coefficients
from the top to the bottom row. If we have the LU factorization of the matrix A, we can
solve Ax = b very fast. Since A = LU we need to solve
LU x = b.

94
The matrix vector product U x is a vector, say y. Thus we can first solve the lower
triangular system
Ly = b
using forward substitution (first y1 , then y2 using the already computed value of y1 etc.)
to find the intermediate vector y, and then solve

Ux = y

using backward substitution (first xN , then xN −1 etc.) to find the solution x of the system
U x = y which is the solution of LU x = b.
The non-zero entries in L and U can be calculated using the Crout factorization
algorithm for tridiagonal systems and the system LU x = b can be solved using the Crout
forward/backward substitution algorithm.

• The solution is subject to round off errors only.

• Computing a Crout factorization and solving a system is very fast. The Crout
factorization requires only O(3N ) operations. Also the backward substitution is
only O(N ). This is very cheap compared to Gaussian elimination with backward
substitution (O(N 3 ) operations) and faster than any iterative technique can be
(O(N ) operations per iteration).

## • The method is only for tridiagonal systems.

95
Algorithm for Crout factorization for tridiagonal systems
Input: tridiagonal matrix A
Output: Crout LU factorization of tridiagonal matrix A

## First row of L and U

Set l11 = a11
a12
Set u12 =
l11
Row 2 to N − 1 of L and U
Do for i = 2, . . . , N − 1
Set li,i−1 = ai,i−1
Set lii = aii − li,i−1 ui−1,i
ai,i+1
Set ui,i+1 =
lii
End Do (i-loop)
Last row of L
Set lN,N −1 = aN,N −1
Set lN N = aN N − lN,N −1 uN −1,N

## Algorithm for Crout forward/backward substitution

Input: L and U of the Crout factorization
Output: Solution vector x of Ax = b

## First solve Ly = b using forward substitution

y1 = b1 /l11
Do for i = 2, . . . , N
yi = (bi − li,i−1 ∗ yi−1 ))/lii
End Do (i-loop)
Now solve U x = y using backward substitution
xN = yN
Do for i = N − 1, . . . , 1
xi = yi − ui,i+1 ∗ xi+1
End Do (i-loop)

96
Remarks on Crout factorization algorithm:

• The factorization cannot be performed when lii = 0 for some i. In a computer code
you might want to add a test and error message for this to make the code more
robust. For matrices arising from 1D finite difference and finite element methods
you typically have lii > 0.

• It is not necessary to define a full matrix for A, L, and U . Storing all zeros is a
waste of memory. Only storing the unknown entries in L and U as arrays and using
the proper array elements in the forward and backward substitution is sufficient.

97
Chapter 5

## • Discretization in time: grids.

• Accuracy

• Stability

• Order of convergence

Numerical methods:

## • Solving initial value problems (IVPs)

– Euler
– Trapezoidal rule
– Runge–Kutta

• polynomial approximation

– Lagrange polynomials
– piecewise Lagrange polynomials

98
5.1 Problem description and modelling
Problem: A hot cup of tea (of temperature T0 ) is placed in a room with a lower tem-
perature. What is the temperature of the tea as a function of time?
Simplifying assumptions: We assume that the tea is well-stirred so that its tempera-
ture doesn’t vary in space, only in time: T = T (t). In Chap. 7 we consider the temperature
to be a function of time and space. The room is large, so we may assume that the tem-
perature of the room is basically unchanged by the tea, i.e. the room temperature Tsur
remains constant.
Basic model:
”Rate of change = rate of increase - rate of decrease”.
Apply this to temperature1
• Rate of change: dT /dt, change of temperature in time.
• Rate of increase: 0, no heat source.
• Rate of decrease: cup looses energy to environment. Newton’s cooling law: Experi-
mentally one observes that the rate at which the temperature of the liquid decreases
is proportional to the differences in temperatures between the object and its sur-
roundings: k(T − Tsur ).
Substitution results in the initial value problem
dT
= −k(T − Tsur ), T (t = 0) = T0 .
dt
Henceforth, we take k = 1, Tsur = 20, and T0 = 80. (All quantities are expressed in an
arbitrary consistent system of units.) Substituting gives
dT
= −T + 20, T (t = 0) = 80.
dt
which we will solve numerically on the time interval 0 ≤ t ≤ 10.
The cooling model is a linear, nonhomogeneous first order initial value problem (IVP)
which have the general form y 0 + p(t)y = g(t), y(t0 ) = y0 . From differential equations
we know that a unique solution exists if p(t) and g(t) are continuous functions on the t
interval considered. For those cases, a numerical technique should give a good numerical
approximation to the solution. If this is not the case, there is likely an error in the program.
For non-linear first-order IVPs, which have the general form y 0 = f (t, y), y(t0 ) = y0 , it
is often unknown whether a unique solution exists or not. If you experience problems in
solving nonlinear IVPs, there can be several causes. A unique solution might not exist,
the numerical method or the grid is not good enough, or you might have an error in your
program. Then a careful inspection of the numerical results is necessary to determine the
most likely cause and a possible fix.
1
Actually it should be an internal energy balance, but for the usual assumption that the internal
energy only depends on temperature, the result is the same.

99
5.2 Solving first order linear IVPs analytically
In this section we discuss two methods to obtain solutions of a (linear) IVP. In the next
sections we will solve the linear IVP numerically and see whether numerical methods
converge to the analytical solution found in this section.

## 5.2.1 Analytical solution

Analytical solution of a first-order linear nonhomogeneous IVP:

y 0 + p(t)y = g(t)

## First find integrating factor µ(t):

Z t 
µ(t) = exp p(s) ds .

## Then multiply the ODE:

µy 0 + µp(t)y = µ(t)g(t)
and combine terms on the left (using the chain rule)

dµy
= µ(t)g(t).
dt
Integration gives Z 
1
y(t) = µ(t)g(t) dt + C .
µ(t)
The constant C can be determined from the initial condition. R
In our example, we have p(t) = 1 and g(t) = 20. This gives µ(t) = exp( 1 dt) = et .
Multiplying and combining terms on the left, the ODE results in

det T
= et 20.
dt
After integration and dividing by et , we get

T (t) = 20 + Ce−t ,

where C needs to be determined from the initial condition: 80 = T (0) = 20+C or C = 60.
The (unique) solution to the IVP is thus

T (t) = 20 + 60e−t .

100
5.2.2 Symbolical calculations
ODEs and IVPs can be solved symbolically using Matlab with dsolve. In order to solve
the 1st order ODE T 0 = −T + 20, use
T = dsolve(’DT=20-T’)
which gives the general solution
T = 20+exp(-t)*C1
In order to solve the 1st order IVP T 0 = −T + 20 with T (0) = 80 just add the initial value
T = dsolve(’DT=20-T’, ’T(0)=80’)
which gives the unique solution
T = 20+60*exp(-t)
If Matlab cannot solve the ODE, for example T 0 = sin(T 2 ),
dsolve(’DT=sin(T^2)’)
it will respond
T = RootOf(-t+Int(1/sin( a^2), a = .. Z)-C1)
which means that it cannot find an analytical solution.

101
5.3 Solving IVPs numerically: Introduction
5.3.1 Grids
The solution to an initial value problem is a function y(t) which is defined for every t. If
we use numerical techniques, we find an approximation to the solution at certain discrete
time levels ti only. The collection of ti ’s is called a (computational) grid. Before we can
solve an IVP numerically, we need to specify the computational grid, i.e. all times ti in
[a, b] for which we want to obtain a numerical approximation. A numerical technique will
produce approximations to the function y at these grid points only, i.e. we will obtain
approximations for the values y(ti ) which will be denoted by yi . See Fig. 5.1.
yn
yn−1
y3 y(t)
y1 y2

y0 = y(a)

a = t0 t1 t2 t3 tn−1 t = b
n

## Figure 5.1: Grid points ti and numerical approximations at grid points yi .

Often we choose the interval [a, b] to be divided into N equally spaced subintervals of
length h = (b − a)/N , which corresponds to the grid points
ti = a + ih
for i = 0, . . . , N . The length of a subinterval h is called the step size.
Example
We solve an initial value problem numerically on [0, 1] and divide the interval [0, 1] into
N = 4 equally spaced subintervals. The length of a subinterval is h = (1 − 0)/4 = 1/4.
We obtain a numerical approximation to the solution at the N + 1 = 5 discrete time levels
only: t0 = 0 (just the initial condition), t1 = 1/4, t2 = 1/2, t3 = 3/4, and t4 = 1.
Remarks:
1. Typically, the more grid points the more accurate the approximation to the solution
and the more work to compute the approximation. The goal is to compute an accu-
rate numerical solution with as few grid points as possible, to minimize computing
time.
2. Almost always there are certain regions in [a, b] where the solution changes more
rapidly (i.e. where you want more grid points). An equally spaced grid is then not
the best choice.

102
3. An equally space mesh is easiest to introduce the numerical techniques and will
therefore be used in this chapter.

## 5.3.2 Numerical techniques

If an IVP cannot be solved analytically, one can try to solve it numerically. Also numerical
techniques, however, will not always work. If no solution exists for the IVP, we can’t
expect that a numerical technique will produce something useful. In addition, there is no
best method to solve IVPs. Which method to choose depends on the problem you want
to solve. Issues are

1. Accuracy.

2. Stability.

3. Computing time and memory (Typically not a very important issue for IVPs, only
for time-dependent PDEs in 2 or 3 spatial dimensions)

We only discuss some well-known one-step methods, for which a solution at some time
ti is computed using only quantities at the previous time level ti−1 . We discuss Euler’s
method, trapezoidal rule, and Runge-Kutta methods. We focus on how these methods
work, how to program them, and typical numerical issues.

103
5.4 Solving IVPs numerically using Matlab: ode45
Matlab has several built-in functions to solve initial value problems numerically. We only
discuss ode45 which uses a Runge–Kutta technique with adaptive time stepping, i.e. every
step a proper value of the step size h is determined in order to obtain a solution within
a specified tolerance. Matlab’s ode45 solves the IVP y 0 = f (t, y) with y(a) = y0 on the
time interval a ≤ t ≤ b.
To solve the cooling problem with default tolerances, one would type in the Matlab
Command Window
[t, T] = ode45(’func ode’, [0 10], 80)
where [0 10] is the time interval [a, b] at which you want to obtain a numerical solution
and 80 the value of the initial condition y0 . The string func ode (the quotes are to indicate
that it is a string) specifies the name of your m-file where the right-hand-side function
f (t, y) is specified. For the cooling problem f (t, T ) = 20 − T needs to be specified in the
m-file func ode.m,

## function f = func ode(t, T)

f = 20 - T;

The result of ode45 is 2 arrays with the discrete time values used (t) and the corre-
sponding approximations to the solution (T). To satisfy the default tolerance values, 49
grid points are used.
The accuracy can be increased by using odeset. By default an absolute tolerance of
−6
10 is used to determine the step size h at grid point ti . To solve the IVP with an
absolute tolerance of 10−8 , you would type in the Matlab Command Window
options = odeset(’AbsTol’, 1e-8)
[t,y] = ode45(’func ode’, [0 10], 80, options)
which produces, in this case, the same solution (49 grid points).
Fig. 5.2 shows the numerical solution together with the analytical solution. No differ-
ences are observed on the scale of the plot.

80
exact solution
ode45

70

60
T

50

40

30

20
0 1 2 3 4 5 6 7 8 9 10
t

## Figure 5.2: Numerical and analytical solution of the cooling IVP.

104
5.5 One-step methods
Details about the derivation of one-step methods for y 0 = f (t, y) are discussed in Math
4446. Here we focus on how the methods work and typical numerical issues. We start with
the easiest method: Euler’s method. It is the simplest numerical technique to demonstrate
all numerical concepts. However, it is almost never the best numerical technique for the
problem you solve. We consider a constant step size h = ti+1 − ti .

## 5.5.1 Euler’s method

Euler’s method can easily be derived using a forward finite difference formula y 0 (xi ) =
(yi+1 − yi )/h. Starting from the initial condition y0 , any next yi+1 is obtained from

yi+1 = yi + hf (ti , yi ) i = 0, . . . , N − 1

Euler’s method is an explicit method. The right-hand side only depends on known quan-
tities (at time level ti ). This makes it easy to solve: just evaluate the right-hand side
using the known values ti and yi .

## 5.5.2 Trapezoidal rule

Integrating y 0 = f (t, y) on both sides from ti to ti+1 gives
Z ti+1
y(ti+1 ) − y(ti ) = f (t, y) dt.
ti

The integral is approximated with the trapezoidal rule (average of the function values
in the end points ti and ti+1 ). Starting from the initial condition y0 , any next yi+1 is
obtained from
h
yi+1 = yi + [f (ti , yi ) + f (ti+1 , yi+1 )] ,
2
for i = 0, . . . , N − 1.
This is an implicit method. The right-hand side also depends on the a priori unknown
yi+1 . In general, this makes it more difficult to compute yi+1 . Only for relatively simple
functions f an explicit equation for yi+1 can be obtained. Otherwise, a nonlinear equation
needs to be solved. For example, bisection or Newton can be used. Generally, this requires
much more work per time step. The test equation and cooling equation are relatively
simple and solving a nonlinear equation is not neccessary (the right-hand side is linear in
y).

## 5.5.3 Runge–Kutta methods

Runge–Kutta methods are explicit methods that use proper points (tl , yl ) to evaluate the
function f (t, y) so that a higher order method is obtained.

105
A frequently used Runge–Kutta method is the Runge–Kutta method of order four
(RK4). Starting from the initial condition y0 , any next yi+1 is obtained from

k1 = hf (ti , yi )
h k1
k2 = hf (ti + , yi + )
2 2
h k2
k3 = hf (ti + , yi + )
2 2
k4 = hf (ti+1 , yi + k3 )
1
yi+1 = yi + (k1 + 2k2 + 2k3 + k4 )
6
for i = 0, . . . , N − 1. The method is explicit: all substeps only involve known quantities
in the right-hand side.

## 5.5.4 Programming one-step methods

All one-step methods have a similar structure, summarized in the following algorithm.

## Algorithm: one-step methods for first-order initial value problems

Input: discrete times ti , value of initial condition y0 .

Initializations
Set initial condition
Compute number of subintervals N
One-step method
Do for i = 0, . . . , N − 1
Compute step size h
Compute next approximation yi+1 from the known values h, yi , ti , and ti+1 .
End do (i-loop)

## Programming (explicit) one-step methods is straightforward: except for initializations one

for-loop to find the approximation yi+1 at the next time level for all i’s.

106
function [y] = euler(t, y0)
%=============================================
% Euler’s method for 1st order IVPs
% Input/output parameters:
% t array with (N+1) grid points ti
% y0 value of initial condition
% y array with (N+1) approximations yi
%=============================================

%———————————————————————————————————
% Initializations
%———————————————————————————————————
y(1) = y0;
N = length(t) - 1;

%———————————————————————————————————
% Euler’s method
%———————————————————————————————————
for i = 1:N
h = (t(i+1) - t(i)) / N;
f = funcivp(y(i), t(i));
y(i+1) = y(i) + h*f;
end

%———————————————————————————————————
% Function f(y, t) with y and t scalars
%———————————————————————————————————
function [f] = funcivp(y, t)
% cooling problem
f = 20 - y;

Remarks:

• Note that the function euler is general. When changing to a different f (t, y) only
funcivp needs to be changed.

• For different one-step methods just the part inside the for-loop needs to be modified.
For the explicit RK4 we just have some more explicit substeps to perform. If you
solve a nonlinear equation with the implicit trapezoidal rule, you need to solve a
nonlinear algebraic equation which can be done with Newton’s method. Then you
need to call a function newton to find yi+1 . Since the time step is typically small, yi
is usually a good enough initial guess for Newton’s method.

107
5.5.5 Example
We solve the cooling problem Eq. (5.1) using a step size h = 1/2 and compare with the
exact solution T (t) = 20 + 60e−t .
Fig. 5.3 shows the approximations Ti for the three different one-step methods consid-
ered: Euler, trapezoidal rule, and RK4.

80

70

60

50
exact
Euler
T, Ti

40
trapezoidal
RK4
30

20

10

0
0 2 4 6 8 10
t

Figure 5.3: One-step methods for the cooling problem using step size h = 1/2.

## exact Euler trapezoidal RK4

T3 42.07276647028654 35.00000000000000 41.60000000000000 42.09025065104166
T11 20.40427681994513 20.05859375000000 20.36279705600000 20.40588052828283
T21 20.00272399578575 20.00005722045898 20.00219369506403 20.00274565005399

Table 5.1: One-step methods for the cooling problem using step size h = 1/2.

from the comparison with the exact solution in Fig. 5.3 and Table 5.1 that RK4 gives a
better approximation than the trapezoidal rule which gives a better approximation than
Euler. Of course RK4 requires more work per time step than the trapezoidal rule which
requires more work per time step than Euler. In Sec. 5.7 we discuss accuracy in more
detail.

108
5.6 Test equation and amplifying factor
The general IVP y 0 = f (t, y) is rather complicated to analyze. For this a test equation
is introduced (just use f (t, y) = λy):

y 0 = λy

where λ is a complex number. This is a rather simple equation, but can still demonstrate
the main concepts. Results we derive remain valid for more complicated equations.
Applying a one-step method to the test equation results in an equation of the form

yi+1 = k(hλ)yi ,

where k is the amplifying factor which depends on hλ. The amplifying factor of a
certain numerical technique contains enough information to determine its accuracy and
stability properties.
Euler’s method
The Euler method for the test equation is

## Thus the amplifying factor of Euler’s method is k = 1 + hλ.

Trapezoidal rule
Writing the trapezoidal rule in the form yi+1 = k(hλ)yi gives the amplifying factor:

h
yi+1 = yi + [λyi + λyi+1 ] .
2
After some algebra this gives the amplifying factor

1 + hλ/2
k(hλ) = .
1 − hλ/2

Runge-Kutta (RK4)
Writing the RK4 method in the form yi+1 = k(hλ)yi gives the amplifying factor. After
some algebra this gives
1 1 1
k(hλ) = 1 + hλ + (hλ)2 + (hλ)3 + (hλ)4 .
2 6 24

109
5.7 Accuracy
To say more about the error we can expect when using a numerical technique to solve
IVPs, we distinguish two types of errors:
• local truncation error (easiest to use, but doesn’t include accumulation of errors).

## 5.7.1 Local truncation error

The local truncation error measures, at a specified time, the amount by which the exact
solution and numerical approximation differ, assuming that the method was exact at the
previous step. In a numerical simulation there is always one time level at which we know
the exact solution: the initial condition y(t0 ) = y0 . Performing one step starting from the
exact initial condition gives you the local truncation error e1 . The order O(hn ) can be
obtained by computing the local truncation error for various step sizes h and determining
the slope on a log10 h vs. log10 e1 plot. Determining the slope is similar to boundary value
problems. See Sec. 4.8.
The local truncation error is defined as

## ei+1 (h) = |y(ti+1 ) − yi+1 |,

where we assume that the solution at the previous step was exact: yi = y(ti ).
To analyze the local truncation error we consider the test equation y 0 = λy. Integration
from ti to ti+1 gives
y(ti+1 ) = ehλ y(ti ).
To investigate errors, we use the Taylor series
n−1
X (λh)j (λh)2 (λh)n−1
ehλ = + O(hn ) = 1 + λh + + ··· + + O(hn ),
j=0
j! 2 (n − 1)!

Combining with the general form for a one-step method yi+1 = k(hλ)yi gives

## since yi = y(ti ) is assumed to be exact.

Euler’s method
For Euler we obtain

 

 

## Thus the local truncation error is O(h2 ).

The local truncation error for the cooling problem Eq. (5.1) is given in Table 5.2. We

110
h y1 y(t1 ) e1 ratio
0.1 74.000 7.4290245e+01 2.90e-01 -
0.05 77.000 7.7073765e+01 7.38e-02 0.254
0.025 78.500 7.8518594e+01 1.86e-02 0.252
0.0125 79.250 7.9254668e+01 4.67e-03 0.251
0.00625 79.625 7.9626169e+01 1.17e-03 0.251

Table 5.2: Local truncation error for the cooling problem Eq. (5.1) using Euler’s rule.

observe that the error decreases by a factor of 4 = 22 This is exactly what we expect:
if we decrease the step size h to h/2 we the error decreases from O(h2 ) to O((h/2)2 ) =
1/4 O(h2 ).
Trapezoidal rule
Using
P∞ Taylor expansions for the exponential and 1/(1 − hλ/2) gives (using 1/(1 − x) =
i
i=0 x ) after some algebra

 

## This is one order higher than Euler which was O(h2 ).

The local truncation error for the cooling problem Eq. (5.1) is given in Table 5.3. We

h y1 y(t1 ) e1 ratio
0.1 7.42857143e1 7.42902451e1 4.53e-3 -
0.05 7.70731707e1 7.70737655e1 5.95e-4 0.131
0.025 7.85185185e1 7.85185947e1 7.62e-5 0.128
0.0125 7.92546584e1 7.92546680e1 9.64e-6 0.127
0.00625 7.96261682e1 7.96261694e1 1.21e-6 0.126

Table 5.3: Local truncation error for the cooling problem Eq. (5.1) using the trapezoidal
rule.

observe that the error decreases by a factor of 8 = 23 This is exactly what we expect:
if we decrease the step size h to h/2 we the error decreases from O(h3 ) to O((h/2)3 ) =
1/8 O(h3 ).
Runge-Kutta (RK4)
For RK4 we obtain
ei+1 (h) = ehλ − k(hλ) yi = O(h5 ).
 

This is two orders higher than the trapezoidal rule and three orders higher than Euler.
The local truncation error for the cooling problem Eq. (5.1) is given in Table 5.4. We
observe that the error decreases by a factor of 32 = 25 This is exactly what we expect:
if we decrease the step size h to h/2 we the error decreases from O(h5 ) to O((h/2)5 ) =
1/32 O(h5 ).

111
h y1 y(t1 ) e1 ratio
0.1 7.4290250000000e1 7.4290245082158e1 4.92e-06 -
0.05 7.7073765625000e1 7.7073765470043e1 1.55e-07 3.150e-2
0.025 7.8518594726563e1 7.8518594721700e1 4.86e-09 3.135e-2
0.0125 7.9254668029785e1 7.9254668029633e1 1.52e-10 3.128e-2
0.00625 7.9626169437408e1 7.9626169437404e1 4.76e-12 3.132e-2

Table 5.4: Local truncation error for the cooling problem Eq. (5.1) using RK4.

## 5.7.2 Global error

For the global error we look at the error at a fixed time ti = t∗ . The global error is
the absolute difference of the exact solution y(ti ) and the numerical approximation yi
after applying the numerical scheme a number of times, starting from y(t0 ) = y0 till ti is
reached,
i = |y(ti ) − yi |.
Note that if we decrease h, the value of i gets larger if the time ti = t∗ is fixed
Theorem
If the local error is en = O(hp+1 ) then the global error is n = O(hp ).

## Explanation (no proof):

Consider the error at a fixed time t, say t = 1. To do Euler with step size h = 1/N , the
number of Euler steps is N = 1/h, each with an error of O(h2 ). Then the total error
becomes 1/h × O(h2 ) = O(h).

Consequence
The global error is one order lower than the local truncation error, thus for Euler we have
i = O(h), for the trapezoidal rule i = O(h2 ), and for RK4 i = O(h4 ).

Application
We can use the order of the global error to estimate a value of the step size h that is
required to obtain a solution that is a certain amount more accurate. Assume we have
a solution with error 0 obtained using step size h0 . To obtain a solution with an error
 = 10−4 0 (4 more accurate digits) with a method of order p, we would need

## Since Ch ≈ Ch0 if h0 is sufficiently small, we have

hp ≈ 10−4 hp0

For Euler, the global error is O(h), i.e. p = 1, and we would need h ≈ 10−4 h0 or 104 as
many intervals. For the trapezoidal rule, the global error is O(h2 ), i.e. p = 2, and we
would need h2 ≈ 10−4 h20 or h ≈ 10−2 h0 or 102 as many intervals. For RK4, the global

112
error is O(h4 ), i.e. p = 4, and we would need h4 ≈ 10−4 h40 or h ≈ 10−1 h0 or only 101 as
many intervals.
Example
For the cooling problem Eq. (5.1), the exact solution at t = 1 is T (t = 1) = 20 + 60e−1 ≈
4.207276647028654e+01.
Euler
The global error is one order lower than the local truncation error, thus i = O(h).
The global error for the cooling problem Eq. (5.1) is given in Table 5.5. We observe

h Ti (t = 1) (t = 1) ratio
1/10 4.0921e1 1.2 100 -
1/20 4.1509e1 5.6 10−1 0.489
1/40 4.1794e1 2.8 10−1 0.495
1/80 4.1934e1 1.4 10−1 0.497
1/160 4.2004e1 6.9 10−2 0.499
1/320 4.2038e1 3.5 10−2 0.499

Table 5.5: Global error for the cooling problem Eq. (5.1) using Euler.

## that the error decreases by a factor of 2 = 21 This is exactly what we expect: if we

decrease the step size h to h/2 we the error decreases from O(h) to O((h/2)) = 1/2 O(h).
Trapezoidal rule
The global error is one order lower than the local truncation error, thus i = O(h2 ).
The global error for the cooling problem Eq. (5.1) is given in Table 5.6. We observe

## h w(t = 1) (t = 1) ratio

1/10 4.2054352e1 1.84e-2 -
1/20 4.2068166e1 4.60e-3 0.250
1/40 4.2071616e1 1.15e-3 0.250
1/80 4.2072479e1 2.87e-4 0.250
1/160 4.2072694e1 7.19e-5 0.250

Table 5.6: Global error for the cooling problem Eq. (5.1) using the trapezoidal rule.

that the error decreases by a factor of 4 = 22 This is exactly what we expect: if we decrease
the step size h to h/2 we the error decreases from O(h2 ) to O((h/2)2 ) = 1/4 O(h2 ). Note
that the global error for h = 0.1 is already better than for Euler using h = 1/320.
Runge-Kutta (RK4)
The global error is one order lower than the local truncation error, thus i = O(h4 ).
The global error for the cooling problem Eq. (5.1) is given in Table 5.7. We observe that
the error decreases by a factor of 16 = 24 This is exactly what we expect: if we decrease
the step size h to h/2 we the error decreases from O(h4 ) to O((h/2)4 ) = 1/16 O(h4 ). Note
the the global error for h = 0.1 is already better than the global error for the trapezoidal
rule for h = 1/160.

113
h w(t = 1) (t = 1) ratio
1/10 4.207278646475e1 2.00e-05 -
1/20 4.207276766885e1 1.20e-06 5.99e-2
1/40 4.207276654365e1 7.34e-08 6.12e-2
1/80 4.207276647482e1 4.54e-09 6.19e-2
1/160 4.207276647057e1 2.82e-10 6.21e-2

Table 5.7: Global error for the cooling problem Eq. (5.1) using RK4.

## 5.7.3 Impact of roundoff errors

We can’t continue to decrease the step size to get a better solution. Because of round-off
errors (due to finite number of digits to represent numbers on a computer), the error may
increase for small step sizes.
Example: Consider the following simple initial value problem (just to demonstrate the
problem caused by round-off errors)

y 0 = 1, 0 ≤ t ≤ 1, y(0) = 1

## which has solution y(t) = 1 + t. This is easily verified by substitution.

Using Euler and 8-digit arithmetic this becomes

w0 = 1.0000000, wi+1 = wi + h

## If we choose h = 1.0 × 10−8 , we get w1 = w0 + h = 1.0000000 + 1.0 × 10−8 = 1.0000000

because we only have digits up to 10−7 . Also w2 = w1 + h = 1.0000000 + 1.0 × 10−8 =
1.0000000 and so on until wn = wn−1 + h = 1.0000000 + 1.0 × 10−8 = 1.0000000. The
result is not close to the exact solution y(1) = 2 at all! We never find anything
else then the initial condition.
Of course this problem does not only appear when the time step is ”smaller than the
last digit”. If we choose h = 1/3 × 10−6 , we find w1 = 1.0000003 because we only have
digits up to 10−7 . Continuing w2 = 1.0000006, w3 = 1.0000009. This gives an error of

More digits of precision shifts the problem to smaller step sizes. If we did the
calculations in double precision (16 digits accuracy) there is no problem to represent
1 + 1.0 × 10−8 = 1.00000001. But 1 + 1.0 × 10−16 = 1 does give the same problem. How
round-off errors affect the global error in Euler’s method for the cooling problem Eq. (5.1)
can be observed in Table 5.8. Up to step size h = 10−9 , the global error is 1/10 times the
previous error if h is divided by 10, i.e. O(h). For a smaller step size the roundoff error is
dominating and the global error starts to grow rapidly. Thus there is an optimum h with
minimum error. In Table 5.8 the optimum value of h would be around 10−9 .

114
h w(t = 0.1) e(t = 0.1) ratio
−4
10 7.429021793686086e+01 2.7e-05 -
10−5 7.429024236764370e+01 2.7e-06 0.1
10−6 7.429024481071004e+01 2.7e-07 0.1
10−7 7.429024505501881e+01 2.7e-08 0.1
10−8 7.429024507944148e+01 2.7e-09 0.1
10−9 7.429024508190086e+01 2.6e-10 0.1
10−10 7.872524636435026e+01 4.43

Table 5.8: Impact of round-off error on global error in Euler’s method for the cooling
problem Eq. (5.1).

115
5.8 Stability
In numerical computations there are always small errors: round-off errors, discretization
errors. Stability is related to whether these errors grow without bound or not. Stability
is particularly important for so-called stiff equations.

5.8.1 Introduction
To develop some intuition for stability, we consider the following examples which we solve
using Euler’s method.

1. y 0 + y = −99e−100t , y(0) = 2 with exact solution y(t) = e−t + e−100t which decays
from 2 to 0. Solving this IVP numerically is straightforward and the numerical
solution looks as expected. This we will call stable further on.

2. y 0 + 100y = 99e−t , y(0) = 2 with exact solution y(t) = e−t + e−100t . The exact
solution is exactly the same, but solving this IVP numerically is much harder. The
global errors show huge errors up to and including h = 1/32. This we will call
unstable further on.

3. y 0 + 100y = 0, y(0) = 1 with exact solution y(t) = e−100t . The numerical solution is
again unstable: global errors again show huge errors up to and including h = 1/32.

To analyze what happens exactly, we only consider the simplest equation for which
we observe the unstable behavior, y 0 = −100y with y(0) = 1. The analytical solution
y(t) = e−100t , decays very rapidly to zero. Numerical solution using Euler gives

## yi+1 = yi − 100hyi = (1 − 100h)yi = (1 − 100h)i+1 y0 = (1 − 100h)i+1 .

Every step the previous value is multiplied by (1 − 100h). The numerical solution will
only tend to 0 if |1 − 100h| < 1. Thus we need 0 < h < 2/100 = 0.02. Otherwise yi and
thus the error will grow since the exact solution y → 0 as t → ∞.
Now consider y 0 = −100e−100t with y(0) = 1 which has the same analytical solution
e−100t . Numerical solution using Euler gives
i
X i
X i
X
−100ti −100tj −100tj
yi+1 = yi −100he = yi−1 −100h e = y0 −100h e = 1−100h e−100tj
j=i−1 j=1 j=1

which is conceptually very different. There is no multiplication of the previous value, just
an addition of a small term every step.

116
5.8.2 Stability of one-step methods (test equation)
We solve two test equations with a one-step method using slightly different initial condi-
tions,
y 0 = λy, y(t0 ) = y0
and
w0 = λw, w(t0 ) = y0 + 
Thus we have the same ODE (and the same general solution) but a slightly different value
() for the initial condition.
Applying a one-step method to the above equations gives

yi+1 = k(hλ)yi , w0 = y0

and
wi+1 = k(hλ)wi , w0 = y0 + .
Subtracting gives an equation for the difference i = yi − wi between the 2 solutions
at time level i + 1,
i+1 = k(hλ)i , 0 = .
Applying the one-step method i + 1 times gives

i→∞

## and unstable otherwise. A scheme is called absolutely stable if

lim |i | = 0.
i→∞

Thus the scheme is stable if |k(hλ)| ≤ 1 (error will not grow), absolutely stable if |k(hλ)| <
1 (error will decay to zero), and unstable if |k(hλ)| > 1 (error will grow and yi and wi
will be very different for large values of i).

## 5.8.3 Region of absolute stability

In the test equation λ ∈ C is a given constant and h needs to be chosen such that the
criterium for absolute stability is fulfilled in order to obtain a stable numerical solution.
Region of absolute stability (R) is defined by those values of h for which the method is
absolutely stable:
R = {z ∈ C| |k(z)| < 1}
with z = hλ.

117
Euler
For Euler, k(λh) = (1 + hλ). To have absolute stability we need |k(hλ)| < 1 which gives
|1 + z| < 1, where z = hλ is a complex number. For the magnitude of a complex number
z = x + iy, we have
p p p
(1 + z)(1 + z̄) = (1 + x + iy)(1 + x − iy) = (1 + x)2 + y 2 .

## Here z̄ denotes the complex conjugate of z. After squaring we get

(x + 1)2 + y 2 < 1.

Since (x + 1)2 + y 2 = 1 is a circle around (−1, 0) with radius 1, this corresponds to the
region inside the circle. The circle itself is not included which is represented in a figure
by a dashed line.
The boundary of the region of absolute stability can also be plotted directly with
Matlab using
syms x y;
z = x + i*y;
k = 1 + z;
ezplot(abs(k) - 1, [-3, 1, -1.5, 1.5]);
grid on;
setcurve(’Line’, ’:’);
Here abs gives the magnitude of a complex number and [-3, 1, -1.5, 1.5] are the minimum
and maximum x and y values in the figure (appropriate values were found by trial and
error). The last line makes a dashed line. For this you need the m-file setcurve.m in your
Current Directory. The resulting Matlab figure is Fig. 5.4. Note that we only plotted the

## Region of absolute stability: Euler

1.5

0.5

0
y

−0.5

−1

−1.5
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1
x

Figure 5.4: Region of absolute stability for Euler’s method: inside the circle. A dashed
curve means that the curve itself is not included.

boundary. The region of absolute stability |k(z)| < 1 is inside the closed curve. This

118
is easily verified by checking one point inside the circle and one point outside the circle
(z = −1 + 0i satisfies |1 + z| < 1 and z = 1 + 0i not).
Trapezoidal rule
For the region of absolute stability we need |k(hλ)| < 1. This gives using z = hλ

1 + z/2
1 − z/2 < 1
or |1 + z/2| < |1 − z/2|

The boundary of the region of stability is displayed in Fig. 5.5. The region of stability is

## Region of absolute stability: trapezoidal rule

6

0
y

−2

−4

−6
−6 −4 −2 0 2 4 6
x

Figure 5.5: Region of absolute stability for trapezoidal rule: the left half plane.

on the left of the imaginary axis (Check a point on each side). Thus the region of stability
is the whole left half plane.
RK4
The boundary of the region of stability is displayed in Fig. 5.6. The region of absolute
stability is inside the closed curve (check 1 point inside and outside the closed curve).

Application
From the region of absolute stability, you can obtain a minimum value of h for which the
solution should remain stable. For example,
1. Assume λ is real and negative. For which values of h is Euler absolutely stable?
λ is real, thus we are on the real axis. The part of the real axis inside the circle
corresponds to −2 < hλ < 0. Since λ < 0 this gives 0 < h < 2/(−λ) = 2/|λ|.
2. Assume λ is purely imaginary. For which values of h is Euler absolutely stable?
λ is purely imaginary, thus we are on the imaginary axis. There is no point from
the region of absolute stability on the imaginary axis. Thus Euler is unstable for
any h.

119
Region of absolute stability: RK4
3

0
y

−1

−2

−3
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1
x

Figure 5.6: Region of absolute stability for RK4: inside the closed curve.

5.8.4 Example
We consider the IVP y 0 = −20y (i.e. λ = −20) with y(t = 0) = 1/3 which has exact
solution y(t) = e−20t /3. The initial condition y0 = 1/3 can not be represented exactly, so
we have a small round-off error in y0 . We use various step sizes h inside and outside the
region of stability to see whether the numerical result.
Euler
Table 5.9 shows results for Euler’s method using various step sizes h. We see that for

t y yi yi yi yi
h = 1 h = 0.1 h = 0.05 h = 0.01
1 6.87e-10 -6.33e0 3.33e-1 0 6.79e-11
2 1.41e-18 1.20e2 3.33e-1 0 1.38e-20
3 2.91e-27 -2.28e3 3.33e-1 0 2.82e-30
4 6.01e-36 4.34e4 3.33e-1 0 5.47e-40
5 1.24e-44 -8.25e5 3.33e-1 0 1.16e-49

## Table 5.9: Stability for test equation using Euler.

h = 1 the numerical solution and error grows without bound so unstable. For h = 0.1 the
numerical solution and error are bounded but the error does not decay. This means the
solution is stable but not absolutely stable. For h < 0.1 the numerical solution and error
decay so absolutely stable. This is exactly what we expect from the theory: absolutely
stable for h < 2/λ = 2/20 = 0.1.
Trapezoidal rule
Table 5.10 shows results for the trapezoidal rule using various step sizes h. We note that

120
t y yi yi yi yi
h = 1 h = 0.1 h = 0.02 h = 0.01
1 6.87e-10 -2.73e-1 0.00e0 5.23e-10 6.42e-10
2 1.41e-18 2.23e-1 0.00e0 8.20e-19 1.24e-18
3 2.91e-27 -1.83e-1 0.00e0 1.29e-27 2.39e-27
4 6.01e-36 1.49e-1 0.00e0 2.02e-36 4.60e-36
5 1.24e-44 -1.22e-1 0.00e0 3.16e-45 8.87e-45

Table 5.10: Stability for test equation using the trapezoidal rule.

all numerical solutions are absolutely stable (all decay). This is exacly what we expect
from theory: the left half plane implies all h > 0 since λ < 0.
Here we see the advantage of an implicit method: implicit methods can have an infinite
region of stability. For the trapezoidal rule, for Re(λ) < 0 all values of h are in the region
of stability. Explicit methods always have a finite region of stability. Thus h should be
chosen small enough to ensure absolute stability for explicit methods.
Note that although the trapezoidal rule is stable for h = 1, the numerical solution is
not very accurate. Accuracy requires a smaller value of h. Also note that h = 0.1 is a
special case: the amplifying factor equals zero giving a zero solution except for the initial
condition.
RK4
Table 5.11 shows results for RK4 using various step sizes h. Note that for h = 1, RK4

t y yi yi yi yi
h=1 h = 0.1 h = 0.02 h = 0.01
1 6.87e-10 1.84e03 5.65e-06 6.91e-10 6.87e-10
2 1.41e-18 1.01e07 9.56e-11 1.43e-18 1.42e-18
3 2.91e-27 5.59e10 1.62e-15 2.97e-27 2.92e-27
4 6.01e-36 3.08e14 2.74e-20 6.16e-36 6.02e-36
5 1.24e-44 1.70e18 4.64e-25 1.28e-44 1.24e-44

## Table 5.11: Stability for test equation using RK4.

is unstable. For h = 0.1, however, RK4 is already absolutely stable, contrary to Euler.
The reason is that the region of stability of RK4 includes a larger portion of the real axis
which includes h = 0.1.
The unstable behavior for too large values of h is typical for explicit methods. A
very accurate explicit technique doesn’t eliminate the unstable behavior. Stability and
accuracy are two different subjects.

## 5.8.5 Linear stability analysis for nonlinear equations

Stability analysis for nonlinear IVPs can be done with a linearization.

121
To determine the linear stability for y 0 = g(y) with initial condition y(0) = y0 , we also
consider the perturbed problem w0 = g(w) with w(0) = y0 + 0 . Introducing  = y − w
and differentiating gives

## g(y) − g(y − ) dg(y)

0 = y 0 − w0 = g(y) − g(y − ) = =  + O(2 ).
 dy
Neglecting the higher order terms in  gives

dg(y)
0 = .
dy

Comparing with the test equation, we now have dg/dy instead of λ. Thus to determine
the stability, dg/dy needs to be evaluated at a known value of y (typically the solution at
the previous time step yi ).
Example: We use Euler’s method to solve the nonliner IVP

y 0 = −y 2 , y(0) = 1.

We have
dg
= −2y
dy
which is real. For the test equation we need for Euler h < 2/|λ| for stability. Here we
have −2y instead of λ. At time ti we thus get as stability criterium
2 1
h< = .
| − 2yi | |yi |

Remarks

• We neglected higher order terms when we did the linearization. Thus it is safer to
take the step size a little smaller than the one obtained from the linearization.

• The stability criterion depends on the solution yi which is a priori unknown. The
maximum allowable step size to guarantee stability for step i thus needs to be
determined at every step.

122
5.9 Discussion
Which method you would choose depends on the type of problem you are solving and on
how many times you are solving such problems. You need to consider the following:

• Implementation time. If you only solve a problem once, you want something
that you can program quickly. So you would typically choose an explicit technique
since these are straightforward to program. That computing time is larger is often
not a problem.

• Accuracy. If you need an accurate solution, you would typically choose a higher-
order method to obtain a small global error (lower computing time for accurate
solutions, but a little harder to implement in more complex problems). If you just
need a rough estimate of what the solution looks like, an O(h) method might be
sufficient.

## • Stability: If stability is an issue, you would typically use an implicit technique.

This avoids the small step sizes needed to ensure the stability criterium. Particularly
when the solution hardly changes over a large period in time this is a good choice.

• Implicit/explicit: if stability is not an issue for the problem you are solving, there
is no need to consider implicit texhniques. It is harder to write the numerical pro-
gram and there is no benefit compared to explicit techniques (for implicit methods
a nonlinear equation needs to be solved every time step for implicit with bisection
or Newton for example).

123
Chapter 6

## Mathematical and computational concepts

• Accuracy

• Stability

• Order of convergence

Numerical methods:

• Euler

• Trapezoidal rule

• Runge–Kutta

124
6.1 Problem description: predator-prey models
In Sec. 3.1 we discussed a population model for a predator-prey system. The resulting
model was

## ẋ1 = ax1 − bx1 x2 ,

ẋ2 = cx1 x2 − dx2 .

where x1 (t) is the population of prey at time t, x2 (t) the population of predators at time t,
and a, b, c, and d some given constants. In Chap. 3, we determined equilibrium solutions
(i.e. when the populations do not change anymore in size, or dx1 /dt = dx2 /dt = 0). In
this chapter, we will solve the transient equations, i.e. we will predict the populations as
a function of time.
As an example, we consider the system of equations with a = d = 2 and b = c = 1,

ẋ1 = 2x1 − x1 x2 ,
ẋ2 = x1 x2 − 2x2 .

## with initial conditions x1 (t = 0) = 1 and x2 (t = 0) = 1.

Eq. (6.1) represents a system of initial value problems. The general form for an m × m
system of IVPs is system of m IVPs
       
y10 f1 (t, y1 , . . . , ym ) y1 (t = t0 ) Y1
 ..   .
.. .
.. .
.= ,  =  ..  ,
   

0
ym fm (t, y1 , . . . , ym ) ym (t = t0 ) Ym

## or in more compact form y 0 = f (t, y) with y(t = t0 ) = Y .

125
6.2 Checking numerical solutions for systems of IVPs
Systems of IVPs are typically difficult to solve analytically. In this section we discuss four
methods to validate a numerical program for solving a system of equations.

## 6.2.1 Equilibrium solutions

Equilibrium solutions are usually much easier to obtain. A first check to validate a
program could be to take y 0 close to an equilibrium solution and check whether the
numerical solution approaches the equilibrium solution. Note that the exact solution
does not always approach an equilibrium solution. Only when the equilibrium point is
asymptotically stable the exact solution approaches the equilibrium point which can be
checked by inspecting the eigenvalues of the linearized system at the equilibrium point
(See Math 2214).

## 6.2.2 Symbolic calculations

Systems of IVPs can be solved symbolically using the built in Matlab function dsolve. For
the example we consider, you would use
sol = dsolve(’Dx = 2*x - x*y’, ’Dy = x*y - 2*y’, ’x(0) = 1’, ’y(0) = 1’)
which returns as output
Warning: Explicit solution could not be found.
sol = [ empty sym ]
which means that no analytical solution can be found (by Matlab).
Of course you can simplify the equations and try to find an analytical solution to test
your code. A linear system is much easier to solve, for example
sol = dsolve(’Dx = 2*x - y’, ’Dy = x - 2*y’, ’x(0) = 1’, ’y(0) = 1’)
gives the analytical solution in sol.x and sol.y. The disadvantage is that a linear system is
usually much easier to solve than a nonlinear system, so that it is not necessarily a good
test to determine whether your program works for nonlinear equations. Of course it is
not completely useless: if it doesn’t work for a linear system there is at least one thing
wrong and what goes wrong for a linear system is easier to detect.

## 6.2.3 Analytical solutions

Even though it is hard to solve a nonlinear system of IVPs, it is not difficult to find an
analytical solution of a nonlinear system to test a numerical code. Instead of Eq. (6.1),
we consider

## ẋ1 = 2x1 − x1 x2 + q1 (t),

ẋ2 = x1 x2 − 2x2 + q2 (t).

We substitute the analytical solution x1 (t) and x2 (t) that we want and try find the corre-
sponding q1 (t) and q2 (t). For example, we take x1 (t) = e−t and x2 (t) = e−2t . Substitution

126
gives q1 (t) = e−3t −3e−t and q2 (t) = −e−3t . The initial conditions at t = 0 that correspond
to the analytical solution are x1 (0) = 1 and x2 (0) = 1.
Thus we obtained the analytical solution for a slightly more difficult system of IVPs

## ẋ1 = 2x1 − x1 x2 + e−3t − 3e−t ,

ẋ2 = x1 x2 − 2x2 − e−3t ,

with ICs x1 (0) = 1 and x2 (0) = 1. This should be sufficient to test the numerical code.

## 6.2.4 Numerical calculations with Matlab

Matlab has several built-in functions to solve systems of initial value problems numerically.
We only discuss ode45 which uses a Runge–Kutta technique with adaptive time stepping,
i.e. every step a proper value of the step size h is determined in order to obtain a solution
within a specified tolerance. Matlab’s ode45 solves the system of IVPs y 0 = f (t, y) with
y(a) = y 0 on the time interval a ≤ t ≤ b.
Systems of initial value problems can be solved using ode45 similar to the scalar IVPs
(See Sec. 5.4). To solve the population problem with default tolerances, one would type
in the Matlab Command Window
[ti, xi] = ode45(’func ode’, [0 10], [1 1])
where [0 10] is the time interval [a, b] at which you want to obtain a numerical solution
and [1 1] the values of the initial conditions y 0 .
The string func ode (the quotes are to indicate that it is a string) specifies the name
of your m-file where the right-hand-side vector f (t, x) is specified. For the population
problem f1 (t, x) = x1 (2 − x2 ) and f2 (t, x) = x2 (x1 − 2) needs to be specified as a column
vector in the m-file func ode.m,

x1 = x(1);
x2 = x(2);

## f(1,1) = x1*(2.0 - x2);

f(2,1) = x2*(x1 - 2.0);

The result of ode45 is 2 arrays with the discrete time values used (ti) and the corre-
sponding approximations to the solution (xi). The first column of the matrix xi contains
the approximation to x1 and can be selected using xi(:,1). The second column of the
matrix xi contains the approximation to x2 and can be selected using xi(:,2).
To satisfy the default tolerance values, 101 grid points are used. The accuracy can be
increased by using odeset, similar to scalar IVPs (See Sec. 5.4).
Fig. 6.1 shows the numerical solution for x1 and x2 . We note that the population does
not approach the equilibrium solution (x1 , x2 ) = (2, 2) but oscillates periodically around

127
5
x
1
4.5 x2

3.5

3
x

2.5

1.5

0.5

0
0 1 2 3 4 5 6 7 8 9 10
t

## Figure 6.1: Numerical solution of the population problem.

(x1 , x2 ) = (2, 2). This is easily explained by examining the eigenvalues of the Jacobian
at (x1 , x2 ) = (2, 2). The Jacobian has pure imaginary eigenvalues which correspond to a
periodic solution.

128
6.3 One-step methods
An mth order system of IVPs can be solved by applying any one-step method discussed in
Chapter 5 to a system. Conceptually there is nothing new, it is only more work. We just
apply every single step in the one-step method for all components, i.e. m times, instead
of only one time.
Below we discuss Euler, the trapezoidal rule, and RK4 for systems.

Euler
Applying Euler to the vector equation y 0 = f (t, y) gives
y i+1 = y i + hf (ti , y i ).
Note that the right-hand-side only contains values at level i which are all known. The
difference with Euler’s method for scalar IVPs in Chapter 5 is that y and f are now a
vector (array) with m components. This means we just need to evaluate the m component
functions f1 , . . . , fm and use the values to compute the m components of yi+1,j where
i = 0, . . . , N denotes the time level and j = 1, . . . , m the component.

Trapezoidal rule
Applying the trapezoidal rule to the vector equation y 0 = f (t, y) gives
h 
y i+1 = y i + f (ti , y i ) + f (ti+1 , y i+1 ) ,
2
for i = 0, . . . , N .
This is an implicit system of equations since the right-hand side also depends on
the a priori unknown y i+1 . This makes it more difficult to compute y i+1 . A nonlinear
system of equations needs to be solved every time step. This can be done using, for
example, Newton’s method for systems (See Sec. 3.3). This requires solving a linear
system Jy i+1 = b(y i , ti , ti+1 ) at every time step. Since the time step is typically small, y i
is usually a good enough initial guess for Newton’s method for systems. Implicit methods
for systems require much more work per time step compared to explicit methods but have
a larger region of stability.

RK4
Applying RK4 to the vector equation y 0 = f (t, y) gives
k1 = hf (ti , y i ),
k2 = hf (ti + h/2, y i + k1 /2),
k3 = hf (ti + h/2, y i + k2 /2),
k4 = hf (ti+1 , y i + k3 ),
y i+1 = y i + (k1 + 2k2 + 2k3 + k4 )/6.

129
The difference with the RK4 method for scalar IVPs in Chapter 5 is that y, f , k1 , k2 , k3 ,
k4 are now a vector (array) with m components. All substeps, however, are still explicit.
First k1 needs to be computed for all m. Once all components of k1 are known, the values
of k2 can be computed etc.

## 6.3.1 Programming one-step methods

One-step methods for systems of IVPs have a similar structure as the one-step methods
for scalar IVPs. We just need to do every operation for all components, i.e m times. The
algorithm is therefore almost identical, except that we now have a vector y instead of a
scalar y.

## Algorithm: one-step methods for first-order systems of IVPs

Input: discrete times ti , value of initial vector y 0 .

Initializations
Set initial condition
Compute number of subintervals N
One-step method
Do for i = 0, . . . , N − 1
Compute step size h
Compute next approximation y i+1 from the known values h, y i , ti , and ti+1 .
End do (i-loop)

## To program Euler for systems in a general way, we introduce a two-dimensional array

yi,j (in Matlab y(i+1,j)) for all grid points in time i = 0, . . . , N and for every component
j = 1, . . . , m.
To do one step Euler inside the for-loop, you can use colon notation in Matlab to pass
all components of y i+1 to a function and compute y i+1 :

## [f] = funcivp(y(i,:), t(i));

y(i+1,:) = y(i,:) + h*f;

where funcivp computes all components of the right-hand-side vector f using solution
y i and time ti . The colon notation in Matlab, does the operation for all possible values
at the place of the :. Thus y(i,:); is a vector of length m with all components of y i and
the function funcivp gets all components of the vector as it should get. The second line
computes y i+1 for all components, due to the colon in the two-dimensional array.
Remarks

• By using a function funcivp to compute the right-hand side vector f , you keep euler
general.

130
• Different one-step methods have the same structure. Only the part inside the for-
loop needs to be modified. For RK4 just some more explicit substeps need to be
performs. If you solve a nonlinear equation with the implicit trapezoidal rule, you
need to solve a system of nonlinear algebraic equations which can be done with
Newton’s method for systems. Then you need to call a function newtonsys to find
the vector y i+1 . Since the time step is typically small, the vector y i is usually a
good enough initial guess for Newton’s method.

• Alternatively, for-loops could be used instead of the colon notation. For loops would
typically be used in Fortran-77 or C.

## • If the product of the number of time steps and number of components N × m is

large, you might not want to store all y i ’s. You could overwrite (y(:) = y(:) + h*f;)
and print the values of y(:) in a file every now and then. Then you only use an
array of length m.

The numerical solutions using Euler and RK4 with h = 0.1 is displayed in Fig. 6.2. The

14 14
ode45 ode45
euler euler
12 rk4 12 rk4

10 10

8 8
x1

x2

6 6

4 4

2 2

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
t t

(a) (b)

Figure 6.2: Numerical solution of the population problem using ode45, Euler and RK4.
(a) x1 , and (b) x2 .

solution obtained using RK4 agrees well with the ode45 solution, but the Euler solution
is far off at larger values of t.

131
6.4 Accuracy
To determine the local truncation error and global error for a system of IVPs at time ti ,
we use the l∞ norm
∞ = ky(ti ) − y i k
i.e. the maximum over all components j = 1, . . . , m.
The results of the analysis of the local truncation error for the scalar test equation
remain valid for systems of IVPs. Furthermore, the global error is still one order lower
than the local truncation error. Thus when there are no discontinuities, we have the
following global errors:

• Euler: O(h).

## • Trapezoidal rule: O(h2 ).

• RK4: O(h4 ).

Example
We solve using one-step methods

## ẋ1 = 2x1 − x1 x2 + e−3t − 3e−t ,

ẋ2 = x1 x2 − 2x2 − e−3t ,

with ICs x1 (0) = 1 and x2 (0) = 1. The analytical solution is x1 (t) = e−t and x2 (t) = e−2t .
The global error at t = 1 for various values of h is given in Table 6.1. We observe that

h Euler RK4
0.1 4.00e-02 9.69e-06
0.05 1.96e-02 5.88e-07
0.025 9.69e-03 3.62e-08
0.0125 4.82e-03 2.24e-09
0.00625 2.40e-03 1.39e-10

## Table 6.1: Global error  at t = 1 for Euler and RK4.

RK4 using h = 0.1 is already much more accurate then Euler using h = 0.00625.
The errors are plotted in a log10 h vs. log10  plot in Fig. 6.3. The error behaves as
expected. For Euler the slope equals approximately 1 indicating a global error of O(h).
For RK4 the slope equals approximately 4 indicating a global error of O(h4 ).

132
0

euler
rk4
−2

−4
log10 ε∞

−6

−8

−10

−12

−14
−3.5 −3 −2.5 −2 −1.5 −1
log10 h

## Figure 6.3: Global error  at t = 1 for Euler and RK4.

133
6.5 Stability of one-step methods
To determine whether small errors grow without bound or not, we follow the same struc-
ture as in Sec. 5.8 for scalar IVPs. We first consider stability for linear systems and then
extend the result to non-linear systems.

## 6.5.1 Linear systems

We consider the linear system of IVPs

y 0 = Ay, y(t0 ) = y 0

## with A a nonsingular m × m matrix with constant coefficients.

A similar analysis as for the test equation in Sec. 5.8.2 can be performed, but now
using an amplifying matrix G(hA) instead of amplifying factor k(hλ). The result is a
similar criterium as for scalar IVPs. A one-step method for systems is absolutely stable if

|k(hλj )| < 1, j = 1, . . . , m,

where λj are the m eigenvalues of the matrix A. Note that all m eigenvalues λj need to
satisfy the stability criterium. If for one eigenvalue |k(hλj )| > 1, the numerical technique
becomes unstable.
Using the expressions for the amplifying factors k(hλ) obtained in Sec. 5.6 gives

• Euler: |1 + hλj | < 1 for j = 1, . . . , m. This means all m values of hλj should be
inside the circle depicted in Fig. 5.4.
1+hλ /2
• Trapezoidal rule: | 1−hλjj /2 | < 1 for j = 1, . . . , m. This means all m values of hλj
should be in the left half-plane.

## • RK4: |1 + hλj + 21 (hλj )2 + 16 (hλj )3 + 24

1
(hλj )4 | < 1 for j = 1, . . . , m. This means all
m values of hλj should be inside the curve depicted in Fig. 5.6.

## 6.5.2 Nonlinear systems

We consider the nonlinear system of IVPs

y 0 = g(y), y(t0 ) = y 0

A similar analysis as in Sec. 5.8.5 can be performed. Local linearization of the nonlinear
system leads to a vector equation for the error  = y − w,

0 = J

where J is the Jacobian matrix. This linear system can be treated as in Sec. 6.5.1.

134
Example
We use Euler’s method for systems to solve the nonlinear system
 0     
y1 −y12 y1 (0) 1
= , = .
y20 y1 − 20y2 y2 (0) 1
We have  
−2y1 0
J=
1 −20
which has two real eigenvalues λ1 = −2y1 and λ2 = −20. For stability we need |k(hλj )| <
1 which becomes for Euler h < 2/|λj |. This gives two conditions: h < 1/|y1 | and h < 0.1
or combining
1
h < min( , 0.1)
|y1 |
The numerical solution x2 for h = 0.2, h = 0.1, and h = 0.05 is depicted in Fig. 6.4.
Stability is in agreement with the linearized theory: absolutely stable for h < 0.1.

10

h=0.2
8
h=0.1
h=0.05
6

2
x2

−2

−4

−6

−8

−10
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t

Figure 6.4: Unstable, stable, and absolutely stable behavior of Euler’s method for systems
at various h.
Remarks
• We neglected higher order terms when we did the linearization. Thus it is safer to
take the step size a little smaller than the one obtained from the linearization.
• The stability criterion depends on the solution y i which is a priori unknown. The
maximum allowable step size to guarantee stability for step i thus needs to be
determined at every step.

135
Chapter 7

## • Boundary and initial conditions

• Accuracy

• Stability

Numerical methods:

## • Finite element method

136
7.1 Problem description: pollution models
7.1.1 Governing equation
In Sec. 4.1 we discussed pollution of a narrow and shallow river for which the concentration
of the pollutant depends on x (coordinate along the river) and time t: c = c(x, t). Flow
will then occur in the x direction only, represented by a scalar velocity v. The resulting
model was the partial differential equation (PDE) of Eq. (4.2)

∂c ∂c ∂2c
= −v + D 2 + r − kc,
∂t ∂x ∂x
where v is the velocity of the river, D the diffusivity, r(x, t) a production term, and k the
rate of decay. In Chap. 4, we determined equilibrium solutions (i.e. when the concentra-
tion does not change anymore in time, or ∂c/∂t = 0) by solving a BVP. In this chapter,
we will solve the partial differential equation, i.e. we will predict the concentration as a
function of time t and space x.
As an example, we consider the system of equations with v = k = r = 0 and D = 1/π 2 ,

∂c 1 ∂2c
= 2 2. (7.1)
∂t π ∂x

## 7.1.2 Boundary and initial conditions

Eq. (7.1) contains a second oder derivative in space and a first order derivative in time.
From differential equations theory, we know that to determine a unique solution of a PDE
that is nth order in 1D space (x) and mth order in time, we need n boundary conditions
and m initial conditions. Thus for the PDE considered here, we need two boundary
conditions and one initial condition.
Initial conditions are similar to initial conditions of IVPs (except that the constant
can now be a function of x). As initial condition, the concentration can be specified at
time t0 for all values of x
c(x, t = t0 ) = c0 (x),
where c0 (x) is the initial concentration profile.
Boundary conditions are similar to boundary conditions for BVPs (except that the
constants can now be functions of t and that we have partial derivatives instead of deriva-
tives for a function of 1 variable). The following boundary conditions can be specified at
a boundary x = xb for all values of t.

## • Dirichlet boundary conditions. The concentration is prescribed at the boundary,

c(x = xb , t) = CD (t),

## where CD (t) is the given concentration at the boundary point xb as a function of

time t.

137
• Neumann boundary conditions. The mass flux (or concentration gradient) is pre-
scribed at the boundary:
∂c
(x = xb , t) = CN (t),
∂x
where CN (t) may depend on t.

## • Robin boundary conditions. This is a combination of a Dirichlet and Neumann

condition:
∂c
(x = xb , t) + kR (t)c(x = xb , t) = CR (t),
∂x
where kR (t) and CR (t) may depend on t.

7.1.3 Example
The example we use for PDEs throughout this chapter is

∂c 1 ∂2c
= , 0<x<1
∂t π 2 ∂x2
c(x = 0, t) = 0, c(x = 1, t) = 0,
c(x, t = 0) = sin(πx). (7.2)

138
7.2 Validation of numerical code for PDEs
Numerical calculations to solve PDEs are much more involved than calculations for ODEs.
It is thus very important to check the numerical solution carefully. We discuss three ways.

## 7.2.1 Equilibrium solutions

For equilibrium solutions, the solution doesn’t change anymore in time, i.e. ∂c/∂t = 0.
This gives a BVP which can more easily be solved, in simple cases analytically. For the
example Eq. (7.2) we need to solve d2 c/dx2 = 0 which gives c(x) = Ax + B. Applying the
boundary conditions c(x = 0) = 0 and c(x = 1) = 0 gives A = B = 0 and the equilibrium
solution is c(x) = 0. The equilibrium also makes physical sense: diffusion transports all
pollution through the boundaries.

## 7.2.2 Analytical solutions

Even though it is hard to solve a PDE, it is not difficult to find an analytical solution of
a PDE to test a numerical code. Instead of Eq. (7.2), we consider

∂c 1 ∂2c
= 2 2 + f (x, t), 0<x<1
∂t π ∂x
From the theory of partial differential equations we know that solutions can be written
as products of exponentials in time and sine or cosine functions in space. We take

## c(x, t) = e−t sin(πx)

We substitute the analytical solution c(x, t) into the PDE and try find the corresponding
f (x, t). We find f (x, t) = 0, which means that c(x, t) = e−t sin(πx) is a solution of
Eq. (7.2). The initial condition that corresponds to this solution is c(x, t = 0) = sin(πx).
The boundary conditions that correspond to this solution are c(x = 0, t) = 0 and c(x =
1, t) = 0. Also both initial and boundary conditions are identical to those in Eq. (7.2).
Thus c(x, t) = e−t sin(πx) is the solution of Eq. (7.2).
Fig. 7.1 shows the analytical solution for various times.

## 7.2.3 Numerical calculations with Matlab

PDEs in time and 1D space can be solved using Matlab’s built-in Matlab function pdepe.
We only consider the case m=0 in pdepe which is sufficient to solve the problems we
consider. Then pdepe solves the pde

∂u ∂u ∂f (x, t, u, ∂u
∂x
) ∂u
c(x, t, u, ) = s(x, t, u, ).
∂x ∂t ∂+ ∂x

139
1
t=0
0.9 t=0.1
t=0.5
t=1
0.8 t=2
t=10
0.7

0.6
c(x,t)

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Figure 7.1: Numerical solution of the pde Eq. (7.2) at various time levels.

For our example pde Eq. (7.2), we have c(x, t, u, ∂u/∂x) = 1, f (x, t, u, ∂u/∂x) = (1/π 2 )∂u/∂x,
s(x, t, u, ∂u/∂x) = 0. The boundary conditions for the left and right point should have
the form
p(x, t, u) + q(x, t) ∗ f (x, t, u, Du/Dx) = 0
where f is identical to the f in the pde. For our example pde Eq. (7.2) we have for both
the left and right boundary point p = u and q = 0.
A pde can be solved by typing in the Command Window
sol = pdepe(0, ’pdefun’, ’pdeic’, ’pdebc’, xj, ti);
where the firsat input variable m = 0 corresponds to m=0. The array xj should contain
the grid points at which you want to obtain the numerical solution. The array ti should
contain the time values at which you want to obtain the numerical approximation (note
that these are not all time levels that matlab uses in the computation, only the time
levels at which a solution is stored in the output array sol). Intermediate time levels to
obtain a sufficiently accurate solution are determined in pdepe. In the solution matrix
sol(i,j) a row i correspond to the selected time levels ti and a columns j to a grid points
xj .
The three strings pdefun, pdeic, and pdebc are the names of the m-files that contain
the info for the pde, initial condition, and boundary conditions, respectively.
For our example pde Eq. (7.2), we would use the following three m-files:

## function [c, f, s] = pdefun(x, t, u, DuDx)

% Evaluate c, f, s for c Du/Dt = Df/Dx + s

140
c = 1;
f = DuDx / pi^2;
s = 0;

## function [u0] = pdeic(x)

% Initial condition for a pde as a function of x

u0 = sin(pi*x);

## function [pl, ql, pr, qr] = pdebc(xl, ul, xr, ur, t)

% Boundary condition for a pde as a function of x

pl = ul;
ql = 0;
pr = ur;
qr = 0;

Fig. 7.2 shows the numerical approximation using pdepe together with the exact solution
as a function of x at various times. The results for pdepe were obtained using h = 1/10.
We see that qualitatively (on the scale of the figure) the numerical solution agrees well

1
t=0
0.9 t=0.1
t=0.5
t=1
0.8 t=2
t=10
0.7

0.6
c(x,t)

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Figure 7.2: Exact solution (solid line) and numerical approximation using pdepe (symbols)
of Eq. (7.2) at times indicated in the legend.

## with the exact solution.

141
7.3 Solving PDEs numerically: Introduction
The solution c(x, t) of Eq. (7.2) depends on spatial coordinate x and time t. Thus we now
need a grid for the spatial and time integration. For the spatial discretization we use a
grid xj for j = 0, . . . , m as we used to solve BVPs in Chap. 4. To keep the algebra as
simple as possible, we only consider equally spaced grid points xj with grid size h. For the
time discretization we use a grid tk for k = 0, . . . , n as we used to solve IVPs in Chap. 5.
To keep the algebra as simple as possible, we only consider equally spaced times tk with
step size ∆t.
In a PDE, partial derivatives need to be discretized. Discretization of partial deriva-
tives is identical to the discretization of derivatives of functions of 1 variable. Thus for
Eq. (7.2) we can use the finite difference formulas or finite element method discussed for
BVPs to discretize ∂ 2 c/∂x2 (See Sec. 4.5 and 4.7).
The general strategy to discretize a PDE in time and 1D space is as follows.

• First discretize the PDE in the spatial direction x. For this you can use finite
differences or finite elements. This is similar to BVPs (See Sec. 4.5 and 4.7).

## • If necessary, eliminate boundary conditions to obtain a system of IVPs. Boundary

conditions can be eliminated as in Sec. 4.6 for BVPs.

• Solve the system of IVPs using any method for systems of IVPs. For example,
Euler, trapezoidal rule, or RK4 for systems (See Chap. 6).

## • If boundary conditions had to be eliminated, construct the whole solution vector

142
7.4 Finite differences
We only consider the PDE with boundary and initial conditions as described in Eq. (7.2).
The PDE is first discretized in the x direction using finite difference formulas discussed in
Sec. 4.5. The key difference with section 4.5 is that after the discretization in space the
approximate values at the nodes are still a function of time: cj = cj (t). Thus after using
the central O(h2 ) approximation for ∂ 2 c/∂x2 we get the m − 1 ODEs and 2 boundary
conditions

c0 = 0,
dcj 1
(t) = − 2 2 [−cj−1 (t) + 2cj (t) − cj+1 (t)] , j = 1, . . . , m − 1,
dt π h
cm = 0,

with initial condition cj (t = 0) = sin(πxj ). Note that the partial derivative with respect
to time has become a d/dt derivative since cj only depends on time.
Next we eliminate the boundary conditions, to get a system of IVPs (See Sec. 4.6).
Only the equation for j = 1 contains c0 and only the equation for node j = m − 1 contains
cm . Substitution of c0 = 0 and cm = 0 gives
dc1 1 1
= − 2 2 (−0 + 2c1 − c2 ) = − 2 2 (2c1 − c2 )
dt π h π h
dcj 1
(t) = − 2 2 [−cj−1 (t) + 2cj (t) − cj+1 (t)] , j = 2, . . . , m − 2
dt π h
dcm−1 1 1
= − 2 2 (−cm−2 + 2cm−1 − 0) = − 2 2 (−cm−2 + 2cm−1 )
dt π h π h
with initial condition cj (t = 0) = sin(πxj ).
We obtained a system of IVPs which can be written in general matrix-vector form
dc
= Pc + r
dt
with  
2 −1 0
0 ··· 
c1
 
0
.. ...
−1 2 −1 .  c2 0
  
−1   
..

. . . r = ... .

P = 2 2 0 .. .. .. 0  c=

,

.

π h  .
 
.
  
 .. . . −1 2 −1
 cm−2 0
0 · · · 0 −1 2 cm−1 0

with initial condition c(t = 0) = sin(πx). Note that in general r is non-zero due to
non-zero boundary conditions and/or a non-zero right-hand-side function r(x, t) in the
PDE.
The system of IVPs can be solved in time using any of the numerical techniques
discussed in Sec. 6.3 for c0 = f (t, c). The right-hand-side vector is now f (t, c) = P c + r.
Remarks

143
• Since you have a large system of equations, and you want to be able to change the
number of grid points m easily, you would use a for-loop to compute the right-hand-
side vector f (t, c) = P c + r.

## • For explicit methods to solve systems of IVPs, it is not necessary to introduce a

matrix. Only the vector P c is necessary.

## • Since f (t, c) = P c + r is linear in c, a tridiagonal, linear system needs to be solved

for implicit techniques to find c at the new time level i + 1. This can be done with
Crout’s method (See Sec. 4.9) so that solving the system is still fast. Also no full
matrix needs to be stored. Just the three diagonals is sufficient.

• For nonlinear PDEs you need to solve a nonlinear system every time step, using for
example Newton’s method for systems.

Fig. 7.3 shows the finite difference approximation together with the exact solution as
a function of x at various times. The results for the FD method were obtained using
h = 1/10 and ∆t = 2 10−3 . We see that qualitatively (on the scale of the figure) the

1
t=0
0.9 t=0.1
t=0.5
t=1
0.8 t=2
t=10
0.7

0.6
c(x,t)

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Figure 7.3: Exact solution (solid line) and FD approximation (symbols) of Eq. (7.2) at
times indicated in the legend.

## numerical solution agrees well with the exact solution.

144
7.5 Finite elements
We only consider the PDE with boundary and initial conditions as described in Eq. (7.2).
The PDE is first discretized in the x direction using finite elements. We only consider
linear finite elements as in Sec. 4.7. The key difference with section 4.7 is that after the
discretization in space the approximate values at the nodes are still a function of time:
m
X
c(x, t) = cj (t)φj (x).
j=0

The weak form of the heat equation (PDE) is again obtained by multiplying by a test
function ψ and integrating over the domain:
Z 1 Z 1 1
∂c 1 ∂c ∂ψ ∂c
ψ dx = − 2 dx + ψ
0 ∂t π 0 ∂x ∂x ∂x 0

## for all suitable test functions ψ.

The Galerkin finite element method is obtained by choosing for the test functions
ψ the basisPm functions φi (x), i = 0, . . . , m, substituting the finite element approximation
c(x, t) = j=0 cj (t)φj (x) in the integral terms, and evaluting the integrals element-by-
element:
m Z X m m Z m 1
X dcj X 1 X dφj dφi ∂c
φj φi dx = − 2
cj dx + φi ,
l=1 el j=0
dt l=1 el
π j=0
dx dx ∂x 0

for i = 0, . . . , m. The contributions of the element integrals can be put in two element
matrices (the element vector is the zero vector). After evaluating the element integrals,
we obtain    
(l) h 2 1 (l) −1 1 −1
k = , p = 2 .
6 1 2 π h −1 1
where k (l) is the element matrix corresponding to the dcj /dt term and p(l) the element
matrix corresponding to the cj term.
Assembling into two global matrices K and P gives the (m + 1) × (m + 1) matrices
   
2 1 0 ··· 0 1 −1 0 · · · 0
. . . .. . . 
1 4 1 . −1 2 −1 . . .. 
 
h  ... ... ...  ,  −1  ... ... ... 
K=  0 0 P = 2h  0 0 .
6
. .
 π  . .

. . . .
 
. . 1 4 1  . . −1 2 −1
0 ··· 0 1 2 0 · · · 0 −1 1

Boundary conditions are handled exactly the same way as in Sec. 4.7. For the two
Dirichlet boundary conditions considered here, we replace the equation for i = 0 by
c0 = 0, and the equation for i = m by cm = 0. Next we eliminate the Dirichlet boundary

145
conditions, to get a system of IVPs (See Sec. 4.6). Note that we do not only need to
eliminate c0 and cm from the equations, but also dc0 /dt and dcm /dt. This does not lead
to major complications since we know the Dirichlet boundary conditions as a function of
time. Since the values of the boundary conditions we consider are independent of time,
both derivatives are zero. Only the equation for i = 1 contains contributions from c0 and
dc0 /dt and only equation for i = m − 1 contains cm and dcm /dt. Substitution of c0 = 0,
dc0 /dt = 0, cm = 0, and dcm /dt = 0 gives in matrix-vector form

dc
K = P c + r,
dt
where K and P are (m − 1) × (m − 1) matrices and r an m − 1 vector:
   
4 1 0 ··· 0 2 −1 0 · · · 0 
0
. . .. . . 
1 4 1 . . −1 2 −1 . . ..  0
 
h −1  
.. .. ..  0 ... ... ... 0  , r = ... .
 
K=  , P =

0 . . . 0
6
.
 π2h  . .

..

.. . 1

. . . −1 2 −1
 0
4 1   .
0 ··· 0 1 4 0 · · · 0 −1 2 0

## with initial condition cj (t = 0) = sin(πxj ).

The system of equations can be written as a system of IVPs by multiplying by K −1
dc
= K −1 P c + K −1 r.
dt
Applying Euler, for example, would give (since r = 0)

ci+1 = ci + K −1 P ci .

Since it is computationally expensive to compute the inverse, however, you would multiply
by K and solve the system
Kci+1 = (K + P )ci
Since K is tridiagonal and can thus be solved very efficiently using Crout’s method (See
Sec. 4.9). Also no full matrix needs to be stored. Just the three diagonals is sufficient.
Also for the right-hand side there is no need to introduce big the matrices K and P
explicitly. The result of the matrix-vector product (K + P )ci is a vector and that is all
you need.
Fig. 7.4 shows the linear finite element approximation together with the exact solution
as a function of x at various times. The results for FEM were obtained using h = 1/10
and ∆t = 2 10−3 . We see that qualitatively (on the scale of the figure) the numerical
solution agrees well with the exact solution.

146
1
t=0
0.9 t=0.1
t=0.5
t=1
0.8 t=2
t=10
0.7

0.6
c(x,t)

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Figure 7.4: Exact solution (solid line) and FEM approximation (symbols) of Eq. (7.2) at
times indicated in the legend.

147
7.6 Stability
In Sec. 6.5, we discussed stability for a system of IVPs. For a linear system y 0 = Ay we
need all eigenvalues λj of A to satisfy the stability criterium |k(hλj )| < 1. To determine
the values of h for which a computation is stable we need to find the eigenvalues of A.

Eigenvalues
To find the maximum eigenvalue of a matrix A we can calculate the eigenvalues or estimate
them:

• The eigenvalues of a matrix A can be calculated with the built-in Matlab function
eig. For example
A = [2 1;1 2];
lambda = eig(A)
creates an array ev with all eigenvalues
lambda =
1
3
The built-in Matlab function max can compute the maximum of an array of values
lammax = max(lambda)
gives
lammax =
3
The disadvantage is that we can only use this for a matrix with numberical values, no
variable h. In addition, it may take a lot of computing time to compute eigenvalues
for large matrices.

## • The eigenvalues of P can be estimated using

– Gerschgorin’s theorem
To find approximations for the eigenvalues τ using Gerschgorin’s theorem we
need to check for every row of a matrix A
N
X
|τ − akk | ≤ |akj |.
j=1, j6=k

The right-hand side is just the sum of the magnitude of the off-diagonal entries
in row k.
– Raleigh’s quotient
Rayleigh’s quotient is useful to relate eigenvalues of the element matrix to
eigenvalues of the global matrix in the finite element method. How to use
Rayleigh’s quotient exactly is outside the scope of Math 4414.

148
Eigenvalues of symmetric matrices
From linear algebra, we know that symmetric matrices have real eigenvalues. This re-
stricts the region of absolute convergence to the part on the real axis. Thus for symmetric
matrices we will need for stability of Euler’s method ∆t ≤ 2/|λj | and for RK4 approxi-
mately ∆t ≤ 2.75/|λj |. This needs to hold for all eigenvalues λj , thus we only need to
determine the largest eigenvalue.

## 7.6.1 Finite differences

For the FD example we considered, we obtained dc/dt = P c with P the (m − 1) × (m − 1)
matrix  
2 −1 0 · · · 0
. . 
−1 2 −1 . . .. 

−1  .. .. .. 
P = 2 2 0 . . . 0 ,
π h  .

 .. . . . −1 2 −1

0 · · · 0 −1 2
we get when applying Gerschgorin’s theorem for each row j = 2, . . . , m − 2

τ + 2 ≤ 2

π h π 2 h2
2 2

## or, since P has real eigenvalues since it is symmetric,

−4
≤ τ ≤ 0.
π 2 h2
Similarly, for the first and last row corresponding to j = 1 and j = m − 1, we get
−3
≤ τ ≤ 0.
π 2 h2
Thus the maximum magnitude is |τ |max = 4/(π 2 h2 ). Applying the stability criterium for
real eigenvalues gives for Euler
2 π 2 h2
∆t < =
|τ |max 2
and for RK4
2.75 2.75π 2 h2
∆t < = .
|τ |max 4
Note that the stability criterium severely restrict the maximum value of ∆t that can be
used. For h = 10−3 you would need for Euler ∆t < 4.93 10−6 and for RK4 ∆t < 6.78 10−6 .
Fig. 7.5 shows two solutions. One with h = 0.04 which is inside the region of absolute
stability, the other with h = 0.06 which is outside the region of absolute stability. For
h = 0.04 the numerical solution remains stable up to t = 100. For h = 0.06 the numerical
solution starts to grow rapidly around t = 7. If h is just outside the region of stability,
errors will start to grow immediately but it may take a while before the magnitude of the
error becomes significant.

149
2 2
t=0 t=0
t=0.5 t=0.5
t=2 t=2
1.5 t=5 1.5 t=5
t=20 t=7
t=100 t=8

1 1
c(x,t)

c(x,t)
0.5 0.5

0 0

−0.5 −0.5

−1 −1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x

(a) (b)

Figure 7.5: Numerical solution of Eq. (7.2) using finite differences and Euler. (a) h = 0.04
within region of absolute stability, and (b) h = 0.06 outside region of absolute stability.

## 7.6.2 Finite elements

For finite elements we would need to apply Gerschgorin to M −1 P which is not so trivial.
For finite elements using Rayleigh’s quotient is easier. The result is
12
|τ |max = .
π 2 h2
Note that |τ |max is three times as large compared to finite difference so that we need to take
a time step that is three times as small. For Euler we now have ∆t < 2/|τ |max = π 2 h2 /(6
which means we need a larger value of h for finite elements than for finite differences to
get a stable scheme for the same step size ∆t.

150
7.7 Accuracy
In the discretization of a PDE, we make a discretization error for the time derivatives
O((∆t)p ) and for the spatial derivatives O(hq ). The total error is the sum of these:
 = O((∆t)p ) + O(hq ). For example, consider a spatial discretization with error O(h2 ). If
we use for the time discretization Euler’s method the total discretization error would be
 = O(∆t) + O(h2 ) and if we use RK4  = O((∆t)4 ) + O(h2 ).
We simulate numerically Eq. (7.2). Since the behavior of the error is similar for finite
differences and finite elements, we only consider finite differences. We consider O(h2 ) finite
differences with Euler and RK4 for the time discretization. We look at the global error at
t = 1/2 and take the l∞ norm over all grid points,  = |y(t = 1/2, xi ) − yi (t = 1/2, xi )k.
Fig. 7.6(a) shows the global error for a discretization in space with h = 1/10 and
various step sizes ∆t and Fig. 7.6(b) global error for a step size in space with ∆t = 10−3
and various grid sizes h.

−1.8 −2
Euler
RK4 Euler
−2 RK4
−2.5

−2.2
−3

−2.4
log10 ε

log10 ε

−3.5

−2.6

−4
−2.8

−4.5
−3

−5
−3.2

−3.4 −5.5
−2.6 −2.4 −2.2 −2 −1.8 −1.6 −1.4 −1.2 −1 −2.6 −2.4 −2.2 −2 −1.8 −1.6 −1.4 −1.2 −1 −0.8 −0.6
log10 ∆ t log10 h

(a) (b)

Figure 7.6: Global error at t = 1/2 using O(h2 ) finite differences for the heat equation
using Euler and RK4. (a) h = 1/10 and various ∆t. (b) ∆t = 10−3 and various h.

We observe in both figures that the solution obtained with RK4 is not more accurate
than the one obtained with Euler’s method. To explain this maybe unexpected result,
we need to look more carefully at the discretization errors we make. For Euler the total
discretization error is O(∆t) + O(h2 ). However, for stability we need ∆t < π 2 h2 /2. Thus
the total error we make is

##  = O(∆t) + O(h2 ) = O(h2 ) + O(h2 ) = O(h2 )

For RK4 the total discretization error is O((∆t)4 ) + O(h2 ). However, for stability we need
∆t < 2.75π 2 h2 /4. Thus the total error we make is

##  = O((∆t)4 ) + O(h2 ) = O(h8 ) + O(h2 ) = O(h2 )

151
which is the same order as Euler’s method.
Thus, if we take a very accurate discretizations in time using Euler or RK4 and not
a very accurate discretization in space the total error is dominated by the error in the
space discretization, O(h2 ). This situation we cannot avoid since we need an accurate
discretization in time to satisfy the stability criterium.
Remarks

• For the PDE considered here, there is no advantage in using RK4 instead of Euler.
Only more computational time and the same accuracy.

• An implicit technique like the trapezoidal rule would be very useful. There is no
stability criterium for ∆t and the error is  = O((∆t)2 ) + O(h2 ). Both ∆t and h
can be varied independently to obtain a more accurate solution.

152
7.8 Solving linear and non-linear systems for PDEs
For explicit finite difference methods, the values cj at the new time level n+1 are obtained
directly. Other methods require solving a linear or non-linear system of equations. In this
section we discuss some efficient solution methods for the linear and non-linear systems
resulting from the discretization a PDE in time and 1D space.

## 7.8.1 Direct methods: factorization

Linear systems with tridiagonal matrices
Linear tridiagonal systems are obtained when using linear finite elements with an explicit
time discretization, or when using O(h2 ) finite differences or linear finite elements with
the trapezoidal rule for the time discretization. Then one needs to solve a system of the
form
Kci+1 = f (ci )
where K is tridiagonal (See Sec. 7.5 for FEM with Euler).
As discussed in Sec. 4.9, tridiagonal systems can be solved very efficiently using Crout’s
method (O(m) operations and only diagonal elements need to be stored). For time-
dependent systems the computing time can be reduced even more. Since the matrix K
is the same at every time level i, the factorization only needs to be performed once: in
the initialization step of the program the non-zero L and U components can be computed
and stored. In the i-loop the L and U matrix can be used in the backward substitution.

## Linear systems with non-tridiagonal matrices

For higher-order finite elements more than three basis functions have non-zero contri-
bution at a grid point j and the matrix K would no longer be tridiagonal. Often, the
number of grid points is rather small, say O(10 − 102 ) and a direct method can still be
used. Instead of the Crout LU factorization for tridiagonal matrices, we need to perform
an LU factorization for more general matrices. If pivoting is included we need a P LU
factorization. Such factorizations are discussed in more detail in Sec. 7.8.2.
As for linear systems, it is not necessary to perform the factorization at every time
step. Since the matrix K is the same at every time level i, the factorization only needs
to be performed once: in the initialization step of the program the non-zero L and U
components can be computed and stored. In the i-loop the L and U matrix can be used
in the backward substitution.

Non-linear systems
A nonlinear system in ci+1 results if the PDE is nonlinear in c and an implicit method
like the trapezoidal rule is used for the time discretization. The nonlinear system can be
written in the from f (ci+1 ) = 0 and can be solved, for example, with Newton’s method.

153
(0)
Often ci , the solution vector at the previous time step is a good enough initial guess ci+1
in the Newton iteration
(k−1) (k) (k−1) (k−1)
J(ci+1 )∆ci+1 = −f (ci+1 , ci )
(k−1)
Since J depends on ci+1 factorization now needs to be performed every Newton step
(for each time step). If J is tridiagonal, factorization and the forward/backward substi-
tution can be done efficiently using the O(m) Crout method. Otherwise, a O(m3 ) LU
factorization for more general matrices is necessary (See Sec. 7.8.2).

7.8.2 LU factorization
LU factorization without row interchanges
For some type of matrices, no row interchanges are necessary in the Gaussian elimination
process. Then the m×m matrix A can be factored into A = LU , with L a lower-triangular
matrix and U an upper-triangular matrix
   
1 0 ... 0 u11 u12 . . . u1m
.. .. . ..
 l21 1 . .  0 u22 . . .
 
L= . . . . U = . . . ,
 .. . . .. 0  .. . . . . um−1,m
lm1 · · · lm,m−1 1 0 ··· 0 umm

## If we have the LU factorization of the matrix A, we can solve Ax = b relatively fast.

Since A = LU we need to solve
LU x = b.
The matrix vector product U x is a vector, say y. Thus we can first solve the lower
triangular system
Ly = b
using forward substitution (first y1 , then y2 using the already computed value of y1 etc.)
to find the intermediate vector y, and then solve

Ux = y

using backward substitution (first xm , then xm−1 etc.) to find the solution x of the system
LU x = b.
Both the upper and lower triangular systems only take O(N 2 ) operations to solve.
Thus once we have the LU factorization, it is relatively cheap to solve a system involving
the matrix A = LU and any vector b. However, the LU factorization needs to be
computed first, which takes O(2N 3 /3) operations.

154
P LU factorization with row interchanges
If row interchanges are necessary, then a permutation matrix P exists so that P A =
LU . This just means that the rows can be interchanged (via P ) so that a LU factorization
exists. Thus we solve
P Ax = P b
which is ”Ax = b with a different order of the rows”. For P A we can make a LU
factorization, so we solve
LU x = P b.
Starting from the the matrices P , L, and U and vector b, we can find the solution x of
Ax = b in three steps
• compute z = P b (O(N 2 ) operations)

## • compute x from U x = y (O(N 2 ) operations)

Saving memory
The matrix L and U contain a lot of zeros. To store both L and U in a separate matrix
is a waste of memory. Usually, L and U are stored in a single matrix
 
u11 u12 ... u1m
.. ..
 l21 u22
 . .
 . . . .
 .. . . . . um−1,m
lm1 · · · lm,m−1 umm
Note that the 1’s at the diagonal of L are not stored. This is, however, not necessary
since we know exactly what the values on that diagonal are. They are always 1, so we can
just use lii = 1 whereever the lii ’s are needed. Often the matrix A is no longer necessary
after the LU factorization has been performed. By overwriting the matrix A with L and
U , we don’t need any additional memory at all.

• The solution is subject to round off errors only.

• If you need to solve Ax = b several times with the same matrix A you need to do the
expensive LU factorization only once. Solving the triangular systems is relatively
cheap. This occurs, for example, when

## – solving Ax = b with multiple right-hand-side vectors.

– solving a time-dependent problem (like Kci+1 = f (ci )) and the matrix (K)
doesn’t depend on time.

155
Matlab
A P LU factorization of a matrix A can be obtained in Matlab using the built-in function
lu. For example:
A = [1 2 6; 4 8 -1; -2 3 -5]
[L, U, P] = lu(A)
gives      
1 0 0 4 8 −1 0 1 0
L = −0.5 1 0 , U = 0 7 −5.5 P = 0 0 1 .
0.25 0 1 0 0 6.25 1 0 0
If we check P*A - L*U this gives indeed the zero matrix.
An LU factorization with the L and U matrix stored in a single matrix K be done by
using lu with a single output argument. For example
A = [4 8 -1; -2 3 -5; 1 2 6;]
K = lu(A)
gives  
4 8 −1
K = −0.5 7 −5.5
0.25 0 6.25
Note that information about P is lost.

156