Attribution Non-Commercial (BY-NC)

Просмотров: 116

Attribution Non-Commercial (BY-NC)

- Numerical Solutions of Second Order Boundary Value Problems by Galerkin Residual Method on Using Legendre Polynomials
- FL-TL-TN-0388-TLCubeFormat2.0
- julia.pdf
- Fortran 77 Code for Spline
- Suggestion Paper for M (CS) – 312 NUMERICAL METHODS & PROGRAMMING
- lec10
- 2 Approximations and Rounding Errors
- huuli
- TeachingPlan BACS 1263
- Taylor Series 1
- Number Systems and Conversion
- JCND
- Director of Finance
- Gautam Lecture2
- Differential Quadrature Method Based on the Highest Derivative and Its Applications
- CAppt.pptx
- CH PDF.pdf
- H030a
- Aplicativo PID
- KSCE

Вы находитесь на странице: 1из 160

1 Introduction 1

1.1 Overview of typical issues in scientific computing . . . . . . . . . . . . . . 1

1.2 Structure of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Algebraic equations 4

2.1 Problem description and modeling of floating objects . . . . . . . . . . . . 5

2.2 Solving an algebraic equation by hand . . . . . . . . . . . . . . . . . . . . 6

2.3 Solving an algebraic equation with Matlab . . . . . . . . . . . . . . . . . . 7

2.3.1 Graphical approximation . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 Symbolical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.3 Numerical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Digital representation of numbers . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.1 Floating point numbers and round-off errors . . . . . . . . . . . . . 10

2.4.2 Binary representation of integer numbers . . . . . . . . . . . . . . . 14

2.4.3 Binary representation of floating point numbers . . . . . . . . . . . 15

2.5 Iterative methods for algebraic equations . . . . . . . . . . . . . . . . . . . 17

2.6 Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6.1 Mathematical background and method . . . . . . . . . . . . . . . . 19

2.6.2 Algorithm and program . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6.3 Programming issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6.4 Example: equation for floating sphere . . . . . . . . . . . . . . . . . 28

2.7 Fixed-point iterations (Picard iteration) . . . . . . . . . . . . . . . . . . . 30

2.7.1 Mathematical background and method . . . . . . . . . . . . . . . . 30

2.7.2 Example: equation for floating sphere . . . . . . . . . . . . . . . . . 31

2.7.3 Checking convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.8 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.8.1 Mathematical background and method . . . . . . . . . . . . . . . . 34

2.8.2 Example: equation for floating sphere . . . . . . . . . . . . . . . . . 35

2.9 Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.9.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.9.2 Example: equation for floating sphere . . . . . . . . . . . . . . . . . 37

i

3 Nonlinear systems of algebraic equations 40

3.1 Problem description and modeling: predator-prey models . . . . . . . . . . 41

3.2 Analytical solutions and solving with Matlab . . . . . . . . . . . . . . . . . 43

3.2.1 Analytical solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.2 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.3 Symbolical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.4 Numerical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3 Newton’s method for systems of equations . . . . . . . . . . . . . . . . . . 46

3.3.1 Mathematical background and method . . . . . . . . . . . . . . . . 46

3.3.2 Stopping criteria and vector norms . . . . . . . . . . . . . . . . . . 47

3.3.3 Example: predator-prey equations . . . . . . . . . . . . . . . . . . . 48

3.3.4 Choice of initial vector . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Solving linear systems for population models . . . . . . . . . . . . . . . . . 50

3.4.1 Solving very small systems . . . . . . . . . . . . . . . . . . . . . . . 50

3.4.2 Solving a little larger systems: Gaussian elimination . . . . . . . . . 51

3.4.3 Built-in Matlab functions . . . . . . . . . . . . . . . . . . . . . . . 53

4.1 Problem description and modeling: pollution models . . . . . . . . . . . . 55

4.1.1 Governing equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1.2 Boundary conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 Analytical solutions and solving with Matlab . . . . . . . . . . . . . . . . . 59

4.2.1 Analytical solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2.2 Symbolical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Solving BVPs numerically: Introduction . . . . . . . . . . . . . . . . . . . 61

4.3.1 Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.2 Numerical techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4 Solving BVPs numerically using Matlab: bvp4c . . . . . . . . . . . . . . . 63

4.5 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.5.1 Mathematical background and method . . . . . . . . . . . . . . . . 66

4.5.2 Simple first-order finite difference formulas . . . . . . . . . . . . . . 66

4.5.3 More complicated finite difference expressions . . . . . . . . . . . . 67

4.5.4 Matrix-vector equation: example . . . . . . . . . . . . . . . . . . . 68

4.5.5 Matrix-vector equation: general approach . . . . . . . . . . . . . . . 69

4.5.6 Programming finite differences . . . . . . . . . . . . . . . . . . . . . 71

4.6 Eliminating boundary conditions . . . . . . . . . . . . . . . . . . . . . . . 73

4.6.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.6.2 General approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.7 Finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.7.1 Mathematical background . . . . . . . . . . . . . . . . . . . . . . . 77

4.7.2 Matrix-vector equation: example . . . . . . . . . . . . . . . . . . . 81

4.7.3 Matrix-vector equation: general approach . . . . . . . . . . . . . . . 84

ii

4.7.4 Numerical computation of the integrals . . . . . . . . . . . . . . . . 88

4.7.5 Programming finite elements . . . . . . . . . . . . . . . . . . . . . . 88

4.8 Convergence of numerical methods for BVPs . . . . . . . . . . . . . . . . . 90

4.8.1 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.8.2 Finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.9 Solving linear systems for BVPs: Crout’s method . . . . . . . . . . . . . . 94

5.1 Problem description and modelling . . . . . . . . . . . . . . . . . . . . . . 99

5.2 Solving first order linear IVPs analytically . . . . . . . . . . . . . . . . . . 100

5.2.1 Analytical solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.2.2 Symbolical calculations . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.3 Solving IVPs numerically: Introduction . . . . . . . . . . . . . . . . . . . . 102

5.3.1 Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.3.2 Numerical techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.4 Solving IVPs numerically using Matlab: ode45 . . . . . . . . . . . . . . . . 104

5.5 One-step methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.5.1 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.5.2 Trapezoidal rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.5.3 Runge–Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.5.4 Programming one-step methods . . . . . . . . . . . . . . . . . . . . 106

5.5.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.6 Test equation and amplifying factor . . . . . . . . . . . . . . . . . . . . . . 109

5.7 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.7.1 Local truncation error . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.7.2 Global error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.7.3 Impact of roundoff errors . . . . . . . . . . . . . . . . . . . . . . . . 114

5.8 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.8.2 Stability of one-step methods (test equation) . . . . . . . . . . . . . 117

5.8.3 Region of absolute stability . . . . . . . . . . . . . . . . . . . . . . 117

5.8.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.8.5 Linear stability analysis for nonlinear equations . . . . . . . . . . . 121

5.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.1 Problem description: predator-prey models . . . . . . . . . . . . . . . . . . 125

6.2 Checking numerical solutions for systems of IVPs . . . . . . . . . . . . . . 126

6.2.1 Equilibrium solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.2.2 Symbolic calculations . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.2.3 Analytical solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.2.4 Numerical calculations with Matlab . . . . . . . . . . . . . . . . . . 127

6.3 One-step methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

iii

6.3.1 Programming one-step methods . . . . . . . . . . . . . . . . . . . . 130

6.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.5 Stability of one-step methods . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.5.1 Linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.5.2 Nonlinear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.1 Problem description: pollution models . . . . . . . . . . . . . . . . . . . . 137

7.1.1 Governing equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.1.2 Boundary and initial conditions . . . . . . . . . . . . . . . . . . . . 137

7.1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.2 Validation of numerical code for PDEs . . . . . . . . . . . . . . . . . . . . 139

7.2.1 Equilibrium solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.2.2 Analytical solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.2.3 Numerical calculations with Matlab . . . . . . . . . . . . . . . . . . 139

7.3 Solving PDEs numerically: Introduction . . . . . . . . . . . . . . . . . . . 142

7.4 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.5 Finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.6 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

7.6.1 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.6.2 Finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.7 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.8 Solving linear and non-linear systems for PDEs . . . . . . . . . . . . . . . 153

7.8.1 Direct methods: factorization . . . . . . . . . . . . . . . . . . . . . 153

7.8.2 LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

iv

Chapter 1

Introduction

Scientific computing

Many scientific models and methods developed in Math, Physics, Engineering, Economics

etc. that describe real-life phenomena are too difficult to solve by hand (analytically).

Then scientific computing (computer simulations) need to be used to obtain solutions.

Sometimes computers can obtain analytic solutions for you using symbolic calculations.

Otherwise, numerical calculations need to be performed to obtain approximate solutions.

This requires a good numerical method to approximate the problem as accurate as needed

and a computer and computer code to perform the numerical calculations.

Numerical methods

Numerical methods have two different aspects

• Numerical analysis: derivation and understanding of general behavior of numerical

methods (Numerical analysis Math 4445/4446).

• Choice and application of numerical techniques: How to choose a good numerical

technique for a specific problem and how to apply it (Scientific computing Math

4414).

Best numerical method

There is no best numerical method. Which numerical method to choose depends on the

specific problem you are trying to solve and on the available resources. It is important to

choose a numerical technique that ‘works well’ for the specific problem you are solving.

In choosing a numerical technique, it is important that you understand

• The problem and its mathematical formulation you are trying to solve:

– Are there always solutions to the mathematical problem you try to solve nu-

merically: if there are no solutions, no numerical method will be able to find

the solution.

– If there is more than one solution, which is the solution you are looking for.

1

– Limitations of the theory: The mathematical model is usually only an approx-

imation to the physical process.

– Is the numerical technique applicable or easy to change to handle slightly dif-

ferent problems.

– Will the approximate numerical solution be sufficiently accurate for the prob-

lem you try to solve. What errors to expect. Numerical computations are

never exact and always contain errors. Errors should be kept as small as possi-

ble or necessary so that the final result is a sufficiently accurate solution of the

underlying physical problem. If the mathematical model is not very accurate,

there is no need to do the computing extremely accurately.

– When a method might lead to difficulties for a specific problem that require

changes to the method or even a different numerical technique.

– Is a numerical technique easy to implement on a computer

– How fast can a computer provide solutions for a given numerical method

– Is computer memory large enough to store all information.

– What software is already available.

Programming

Programming is an essential part of the course. To obtain a solution using a numerical

method often requires a large amount of algebraic manipulations. In order to perform

such calculations, you need to be able transform the numerical method into a numerical

computer program and let a computer do all calculations.

In this course we use Matlab to write numerical programs.

ods are available in Matlab, and it is easy to plot solutions.

Numerical methods are usually implemented much more efficiently in C or Fortran.

Such problems are outside the scope of Math 4414.

• Understand the problem and the numerical technique before you start programming.

• Write your program in such a way that it is easily changed to solve similar problems.

2

• Validate the results of your computer program. If you do computations with a

computer, you always get an answer. You need to make sure that you

get an answer that makes sense. A numerical technique might not work very

well for a specific problem or a computer program might contain errors. (It is very

easy to make mistakes if you write programs, but also easy to check.)

– Compare with an analytical solution (of a simplified problem).

– Compare with numerical data in books or class notes.

– Compare with results from standard Matlab functions.

For scientific computing it is necessary that you understand the problem that you solve

and the available numerical methods and there limitations. We discuss the following

aspects of a number of selected problems from physics/engineering (about 10 weeks of

class: homework problems, midterm test)

parameters. Mathematical models we consider are algebraic equation, boundary

value problems, ordinary differential equations, and partial differential equations.

2. Numerical method: make the mathematical model suitable to solve with the

help of a computer. Go from a continuous description (differential equation) to a

an approximate discrete description (difference equation). Discuss one or several

basic techniques that can be used to solve the obtained equations and relevant

mathematical and computational concepts. (This will take about 3/4 of the time.)

The focus is on how to use the techniques, how they work, what are advantages

and disadvantages of the various numerical techniques, and what are the numerical

issues.

ical method.

4. Validation and visualization of results Convince yourself that you found the

correct solution. Visualization of relevant data.

The last 5 weeks of class you need to work on a longer project (in groups or on your

own, you may make your own project or I may make one). During these weeks some classes

will be replaced by office hours, so you can work on the project and/or presentation.

This should give you a good basis for when and how to use scientific computing and

what to pay attention to. Wherever applicable, I will use a laptop with Matlab to illustrate

what we are doing.

3

Chapter 2

Algebraic equations

• rate of convergence

• algorithms

• stopping criteria

• ininitial guess

Numerical methods:

ton’s method)

ations, Newton’s method)

Programming

4

2.1 Problem description and modeling of floating ob-

jects

Consider a ball made of wood with a radius R = 1 and a density ρb = 1/2. How much of

the ball will be submerged when it is placed into water (ρw = 1). We don’t know which

portion of the ball is below the water, let’s call this d. See Fig. 2.1.

R=1

d

r=r z

z=0

We need a mathematical/physical model: Archimedes’ law. The mass of the ball Mb

equals the mass of the water Mw displaced by the ball.

We have Mb = Mw and the mass of a ball is Mb = ρb Vb = 4πR3 ρb /3 = 2π/3. The

amount of water that is displaced corresponds to the volume Vw of the of the ball in

between the dashed and solid line. The mass of such a volume of water is ρw Vw . We

need to find the volume of a part of a sphere (Multivariable calculus). This is easiest in

cylindrical coordinates.

We have a sphere with radius R = 1 and center (0, 0, 1). In rectangular coordinates the

equation for the sphere is x2 + y 2 + (z − 1)2 = 1. In cylindrical coordinates (r2 = x2 + y 2 )

this gives rz2 = 1 − (z − 1)2 , where rz is the upper surface in the r-direction which depends

on z (see sketch). Taking z = 0 at the bottom of the sphere gives for the volume

Z d Z 2π Z rz Z d Z d

Vw = r dr dθ dz = πrz2 dz = π 1 − (z − 1)2 dz. (2.1)

0 0 0 0 0

π(3d2 − d3 )

Vw = . (2.2)

3

Using Mb = Mw we arrive at an algebraic equation

3d2 − d3 = 2. (2.3)

5

2.2 Solving an algebraic equation by hand

Why is it useful to have an analytic solution?

To check whether a numerical program that you wrote is working correctly.

Solution method 1: You might know the general formula for a third order equation

or know where to find it.

Solution method 2: The polynomial has integer coefficients. Then we can look

for roots that are also integers. Any such root must divide the constant term. For

d3 − 3d2 + 2 = 0 this leaves four possibilities: 1, -1, 2, -2. We can easily verify by

substitution that d = 1 is a solution and the others are not. The other 2 roots can be

found by reducing the degree√ of the original polynomial,

√ (d3 −3d2 +2)/(d−1) = d2 −2d−2.

This has the roots d1 = 1 + 3 and d2 = 1 − 3.

We have 3 possible solutions. If we let a body float in the water it will stay at 1 height

only. Which is the correct solution? The problem requires that 0 ≤ d ≤ 2R = 2, so d = 1

is the solution we are looking for.

6

2.3 Solving an algebraic equation with Matlab

See the separate Matlab guide for an introduction on how to use Matlab and how to write

simple programs. All Matlab code and Matlab output will be in Sans Serif font.

Make a plot of this curve with Matlab! This can be done with

ezplot(’d^3 - 3*d^2 + 2’)

which opens a new window and plots the function. See Fig. 2.2. We need to find where

3 2

d3−3 d2+2

d −3 d +2

150

8

100

6

50

0 4

−50 2

−100

0

−150

−2

−200

−4

−250

−6

−300

−8

−350

d d

(a) (b)

Figure 2.2: Plot of d3 − 3d2 + 2 (a) Using ezplot, and (b) zoom around the roots.

the function intersects the d-axis. From the graph we can estimate this number. For a

more accurate number you can zoom in around the root in the Matlab window.

How to get a more exact number: solve the equation d3 − 3d2 + 2 = 0?

The roots of an algebraic equation like d3 − 3d2 + 2 = 0 can be found exactly using

the built-in symbolic Matlab function solve. Symbolic calculations give exact results (no

errors).

First define a symbolic variable

syms d

Then use solve

sol = solve(d^3 - 3*d^2 + 2)

7

This gives

sol =

1

1 + 3^(1/2)

1 - 3^(1/2)

Remarks

√

• the roots d2,3 = 1 ± 3 are found exactly not as finite precision numbers.

• Alternatively, you could have divided out a root from a polynomial symbolically:

syms d

y=(d^3-3*d^2+2) / (d-1)

y = simplify(y)

gives

y = d^2-2*d-2

• Symbolic calculations take much longer than floating point calculations. Thus if

computing time is an issue and a numerical solution is sufficient do not use symbolic

calculations.

• If no analytical solution is found (and the number of equations equals the number

of unknowns), a numerical solution is attempted. For example

syms d

sol = solve(d^5 + 3*d^2 - d^3 - 2)

gives the numerical values

[ .85052644896432252802899764958837]

[ .73793253717418508330026780848423+1.1865616439828408330458807692618*i]

[ -.77762851016613744223048982300174]

[ -1.5487630131465552523990434435551]

[ .73793253717418508330026780848423-1.1865616439828408330458807692618*i]

• Only for relatively simple problems that you could (in principle) also do by hand

In Matlab, roots of algebraic equations can be computed numerically using the built-in

Matlab function roots or fzero.

For polynomial algebraic equations, roots can be used to compute all roots of the

polynomial. As input it needs an array of the coefficients of the polynomial. To use roots,

8

first write the polynomial in the form f (d) = 0, so for this case d3 − 3d2 + 2 = 0. Then

make an array with the coefficients of the polynomial (the values of an array in Matlab

are in between [..]). Coefficients need to be ordered from the highest to the lowest power

in d, (and don’t forget the zero of 0d)

c=[1 -3 0 2]

Then use d=roots(c) to compute all three roots

d=

2.7321

1.0000

-0.7321

For non-polynomial algebraic equations, fzero can be used to compute a single root

of an arbitrary function. For example fzero(’x^3-3*x^2+2’, 0.5) tries to find a root of

x3 − 3x2 + 2 = 0 near x0 = 0.5. To use fzero you need a reasonable estimate of the value

of the root for which you could use a plot of the function.

Advantages and disadvantages of numerical calculations:

9

2.4 Digital representation of numbers

2.4.1 Floating point numbers and round-off errors

When the root problem d3 − 3d2 + 2 = 0 was solved with the Matlab function roots or

fzero, the exact roots were not found. Matlab produced socalled floating point num-

bers which are only approximations to the roots. A numerical computation on

a computer is different from a calculation in algebra and calculus courses: it

never produces an exact solution, only an approximation. Even Matlab’s 1.0000

is not the same as d = 1 which we obtained analytically. It could be any number in the

range 0.99995 ≤ d < 1.00005. To check this in Matlab:

d = 1.0000499999999999

which gives

d = 1.0000

and

d = 0.99995

which also gives

d = 1.0000

incorrect to write d1 = 2.7321, because it is not exactly the root. We should write instead

d1 ≈ 2.7321.

Note that when we found a root using fzero,

d = fzero(’x^3-3*x^2+2’, 2.1)

Matlab didn’t compute the result in 5 digits precision, it only displayed the result d =

2.7321 in 5 digits. The 5-digit format is called format short in Matlab and this is the

default. To see 16 significant digits in floating point format, we don’t need to redo the

calculation, just type

format long e

d

which gives

d = 2.732050807568878e+00

Here the notation e+00 means 100 .

Numerical computations always use a finite number of digits to represent a number.

Most often 16 digits (double precision) are used, sometimes 8 digits (single precision) or

32 digits. This introduces an error called roundoff error.

If we type in a number in Matlab that has more than 16 digits, Matlab doesn’t say

that you gave an incorrect number, but it makes a floating point number of it (using 16

digits). For example,

y=1.234567890123456789

gives

y = 1.234567890123457e+00

We see that Matlab rounds the number to the nearest floating point number. Another

10

method is simply chopping the number after 16 digits. Thus if the 17th digit is 5 or larger,

1 is added to the 16th digit (round up). If the 17th digit is lower than 5, all digits from

17 on are chopped (round down).

The notation for the floating point number of y is fl(y). Matlab uses 16 significant

digits for floating point numbers.

Advantages/disadvantages of more digits

• Computations are slower and more memory is required to store numbers.

• Roundoff error becomes smaller.

Nowadays, most computations are done in double precision (16 digits).

When we do calculations in finite digit arithmetic, the rounding (chopping) occurs at

every step of the calculation. Finite digit computations can be done in Matlab via Maple,

but it is easier to do it by√hand.

√

Example: Compute 3 + 3 using 5 significant digits with rounding. We get

√ √

fl( 3) + fl( 3) = fl(1.732050807568877e+00) + fl(1.732050807568877e+00)

= 1.7321e+00 + 1.7321e+00 = 3.4642e+00

√

Note that first 3 is rounded to 5-digit

√ precision.

√ √The + adds the two 5-digit numbers.

This is not the same as computing 3 + 3 = 2 3 exactly or in more digits precision

and rounding the result to 5-digit precision:

√ √

fl( 3 + 3) = fl(1.732050807568877e+00 + 1.732050807568877e+00)

= fl(3.464101615137754e+00) = 3.4641e+00

Similar errors occur when you use 16 digits, the last (16th) digit may not be accurate

(for a single calculations). These so-called roundoff errors are caused by the finite

number of digits that a computer uses to represent numbers.

Example: Try in Matlab

x = (1-sqrt(3))*(1+sqrt(3))

which gives

x = -2.00000000000000

Then try y = x + 2 which gives exactly 0 in exact calculations. Using (standard) 16 digits

arithmetic in Matlab: y = 4.440892098500626e-16

In finite digit arithmetic you can get inaccurate results when you subtract nearly equal

numbers in your calculations. If you write a program this needs to be avoided if possible.

As a simple example we discuss the calculation of the roots of a quadratic equation.

11

In the floating object problem: we started from a cubic equation and divided out d − 1

and we found the quadratic equation d2 −2d−2 = 0. A quadratic equation ax2 +bx+c = 0

has the solution √

−b ± b2 − 4ac

x1,2 =

2a

In ”exact (calculus) calculations” this always gives the correct answer. When a finite

number of digits are used, a poor approximation of the root can be found using this

equation.

Example:

Use a = 1, b = 123.4, c = 1.2. The exact solutions are up to 8 digits:

x1 = -9.7252397e-03 and x2 = -1.2339027e+02.

√ Using 4-digit p arithmetic we would get: √

b 2 − 4ac = fl(1.522756e4) − fl(4.800000e0) = 1.523e4 − 4.800e0 =

√

1.523e4 = 1.234e2

Now compute x1,2 (using 4-digit arithmetic):

√

−b + b2 − 4ac -1.234e2 + 1.234e2

x1 = = = 0.000e0

2a 2.000e0

√

−b − b2 − 4ac -1.234e2 − 1.234e2

x2 = = = -1.234e2

2a 2.000e0

The approximation of x2 using 4-digit precision is accurate up to 4 digits, but the ap-

proximation for x1 is not! x1 has no accurate digits which is rather problematic if x1 is

the physical solution. The main problem is subtracting two almost equal numbers. Using

more digits for the calculations will improve the result, but it cannot completely eliminate

the inaccuracy of subtracting two nearly equal numbers with finite precision arithmetic.

One can avoid the subtraction of two equal numbers by rewriting the expression for

x1 : √ √

−b + b2 − 4ac −b − b2 − 4ac 2c

x1 = × √ = √

2a −b − b2 − 4ac −b − b2 − 4ac

This gives using 4-digit arithmetic.

2c 2.400e0

√ = = -9.724e-3

2

−b − b − 4ac -1.234e2 − 1.234e2

can write for x2 ,

√ √

−b − b2 − 4ac −b + b2 − 4ac 2c

x2 = × √ = √

2a −b + b2 − 4ac −b + b2 − 4ac

Using 4-digit arithmetic, however, this would lead to a zero denominator.

What we want to use is a proper combination of the two ways to compute the roots,

that avoids the subtraction of almost equal numbers. For which of the two expressions

12

almost equal numbers are subtracted, depends on whether b is positive or negative. To

take this into account properly, first evaluate

1h √ i

2

q = − b + sign(b) b − 4ac ,

2

where sign(b) = 1 if b ≥ 0 and −1 otherwise. Then the roots are

q c

x1 = , x2 = .

a q

Using 4 digits arithmetic this gives x1 = -9.724e-03 and x2 = -1.234e+02 which are both

accurate up to 3 digits.

Usually, you never test whether 2 floating point numbers are exactly equal:

if x == y

The result of this test will only be true when all digits are the same. Usually this will not

be the case due to roundoff errors. For example, type

x = (1-sqrt(3))*(1+sqrt(3))

which gives

x = -2.00000000000000

Comparing with the exact value -2

x == -2

gives

0

meaning false or not equal (a value of 1 means true or equal numbers).

Instead we should use when we test whether 2 numbers are (nearly) equal abs(x-

y) < with a small number. How small depends on the type of calculations. Never

smaller than 10−number of digits+1 . For example, for a 16-digit calculation to find roots, 10−15

will probably work. More complicated calculations (more computations, subtraction of

almost equal numbers), you maybe can only reach an accurate answer up to 8 or 10 digits

( ≈ 10−8 or ≈ 10−10 would then be the smallest tolerance you want to choose).

Stopping criteria in numerical methods are used to determine whether solutions are suf-

ficiently accurate, which can be done by comparing the two most recent approximations

to a solution p, say pi and pi−1 . When you try to determine a large p ≈ 108 and test

whether the error |pi+1 − pi | < 10−10 , you want 18 digits correct. This is not possible with

16 digits numbers. When you try to determine a small p ≈ 10−8 and test whether the

error |pi+1 − pi | < 10−10 , you only have 2 accurate digits.

To avoid a dependence on the magnitude of p, a relative error |pi+1 − pi |/|pi | < 10−10

should be used. Note that this is not such a good idea if pi equals or is very close to zero.

Safer is |pi+1 − pi |/|pi + | < 10−10 with a small number, for example 10−6 .

13

2.4.2 Binary representation of integer numbers

Computers do not use base-10 numbers but base-2 (binary) numbers to do calculations.

We start with binary representation of integer numbers, since this is easier to understand

than the binary representation for floating point numbers. Matlab doesn’t have a separate

integer number representation, only floating point number. Programming languages like

C and Fortran, however, have.

Integer base 10 numbers: What does the number 123 mean exactly? 123 =

1 × 102 + 2 × 101 + 0

P3n × 10 .i Any integer number we can write as a base 10 number. For

a positive integer i=0 ai 10 , with each ai any integer number from 0 to 9.

Integer base P 2 numbers: Any integer number we can write as a base 2 number. For

a positive integer ni=0 bi 2i , with bi a 0 or 1. To distinguish base-2 numbers from base-10

numbers we use the notation ( )2 for base-2 numbers and (P )10 for base-10 numbers. To

go from base-2 to base-10 numbers and vice versa, just use ni=0 bi 2i :

base 2 base 10

(1)2 1 × 20 (1)10

1 0

(10)2 1×2 +0×2 (2)10

(11)2 1 × 21 + 1 × 20 (3)10

3 2 1 0

(1011)2 1 × 2 + 0 × 2 + 1 × 2 + 1 × 2 (11)10

Try the following two examples yourself:

What is the base-10 number corresponding to the base-2 number (10101)2 ? What is the

base-2 number corresponding to the base-10 number (101)10 ?

There are Matlab functions that convert from binary to decimal and vice versa: bin2dec

and dec2bin. The above 2 examples you can check using bin2dec(’10101’) and dec2bin(101).

How much memory is reserved for an integer?

How do we measure the amount of memory used?

A bit is a binary digit, i.e. a 0 or a 1.

A byte is a group of eight bits.

A word is the smallest addressable unit of memory for a computer (often 2 bytes or 4

bytes).

If we have one word of 2 bytes to store an integer number, the binary number (1011)2

would be stored as 0000 0000 0000 1011 (just fill up with zeros at the front). Why is

the above number not stored as a 4-bit number? Computer hardware can be kept more

simple and efficient if it only handles numbers with a predetermined number of bits.

What is the largest integer on a computer if 2 bytes are used to store integers (don’t

consider negative integers).

P15 k We need to have all ones to have the largest integer, so for 2

bytes (= 16 bits): k=0 2 = (1 − 2 )/(1 − 2) = 216 − 1 (geometric series). In this way we

16

cannot represent negative numbers, only integers from 0 to 216 − 1. These type of integers

are called unsigned integers. Note that when signed integers are used, only integers

in the range [−215 , 215 − 1] = [−32768, 32767] are available. (This is not symmetric since

0 is included.) If 32-bits are used for an integer this is called a long integer.

14

2.4.3 Binary representation of floating point numbers

Single precision numbers have 32 bits (4 bytes). The single precision IEEE standard

floating point number is defined as:

and Fig. 2.3 contains a graphical representation. Note that you can always rewrite a

8 bits 23 bits

1 bit

s c f

Figure 2.3: Single precision floating point number (32 bits): 1 sign bit for mantissa (s), 8

bits for the exponent (c), and 23 bits for the fractional part of the mantissa (f ).

number number so that the first number is non-zero. For example for base 10 numbers:

0.9 = 9 × 10−1 . For binary numbers the only non-zero number is 1.

The leftmost bit of a floating point number is for the sign of the number: s = 0 for

positive and s = −1 for negative numbers.

The next 8 bits are for the exponent c − 127. The value of c could take 28 = 256

different numbers from 0 to 255. The first and last value (c = 0 and c = 255 for single

precision) are always reserved for special cases including ±0 and ±∞. (also for other

precisions like double precision) reserved for special cases Thus for single precision the

range of values for the exponent is

The last 23 bits are used for the mantissa (the number multiplying the exponential

function, here with base 2). Since the first bit is always 1, it doesn’t need to be stored.

So the mantissa actually corresponds to 24 bits since there is one ’hidden’ bit. Zero

in floating point number notation is represented by all zero bits (with the sign bit as a

possible exception). The mantissa is restricted by

23

X

1 = (1.000.......0)2 ≤ (1.f )2 ≤ (1.111.......1)2 = (1/2)i = 2 − 2−23

i=0

Note that all together there are 24 1’s in (1.111.......1)2 , meaning 1 × 20 + 1 × 2−1 + 1 ×

2−2 + · · · 1 × 2−23 .

The largest single precision number is (2 − 2−23 ) × 2127 ≈ 3.4 × 1038 (largest mantissa

and largest exponent). The smallest (positive) single precision number 1 × 2−126 ≈ 1.2 ×

10−38 (smallest mantissa and smallest exponent). There are no (accurate) single precision

numbers in between 0 and approximately 1.2 × 10−38 and there are no single precision

numbers above the maximum single precision number 3.4 × 1038 . Similarly the largest

15

denormal denormal

usable

overflow overflow

Figure 2.4: Usable range of numbers in single precision using standard IEEE notation.

negative single precision number is approximately −3.4 × 1038 and the smallest negative

number is approximately −1.2 × 10−38 . See Fig. 2.4 for the usable range of numbers.

What happens when we produce a number outside the usable range of values?

A too large number gives an overflow (Inf). A too small (positive) numbers first gives

a less accurate number (denormal) and then 0. The situation for negative numbers is

similar.

The machine epsilon is the smallest positive machine number so that 1 + 6= 1. In

single precision (23 bits for the mantissa) this is 2−23 ≈ 1.2 × 10−7 . Note that this value

is much larger than the smallest single precision number.

Double precision numbers have 64 bits (8 bytes). The double precision IEEE

standard floating point numbers uses 1 bit for the sign, 11 bits for the number c in the

exponent (c − 1023) and 52 bits for the mantissa. In Matlab you can find the maximum

number by using realmax, the minimum positive number by realmin, and the machine

epsilon by eps.

16

2.5 Iterative methods for algebraic equations

Iterative methods do not try to compute the exact solution, but give only an approxi-

mation to the solution. The user should specify how close the approximation should be

to the solution. Iterative methods involve the following 3 steps:

One step to generate a new number in the sequence is called an iteration.

• The sequence is stopped when the approximation is ”close enough to the solution”.

The sequence should also be stopped when it is clear that it will not approach the

solution at all (otherwise it keeps on generating numbers forever). This might be

because the method generates a sequence that does not approach the solution or

because a (Matlab) program is written incorrectly.

Usually a while-loop is used for an iterative method, since you don’t know in advance how

many times you need to compute an approximation in the sequence. The structure of a

while-loop for an iterative method is as follows

function [x] = iterative method(initial guess, maxiter, tolerance)

% Initializations, for example

iter = 1;

xold = initial guess;

while iter <=maxiter

% Compute a new approximation x from xold

···

% for example

if abs(x-xold) < tolerance

break;

end

iter = iter + 1;

xold = x;

end

Note that there are 2 mechanisms that terminate the iteration process. First the while-

loop is terminated when iter exceeds the maximum number of iterations (to avoid that

the iterations continue forever). Second it is terminated when the approximation is good

17

enough (break terminates the while loop and continues the program after the end corre-

sponding to the while loop).

Three iterative methods to find roots will be discussed in the next sections: bisection,

fixed-point iterations, and Newton’s method.

18

2.6 Bisection method

2.6.1 Mathematical background and method

The idea of the bisection method is based on the intermediate value theorem:

If f is continuous on [a, b], and f (a) and f (b) have opposite sign, then there exists a point

p with f (p) = 0. See Fig. 2.5.

f(b)>0

a f(p)=0

p

b

f(a)<0

Figure 2.5: A continuous line from a to b where f (a) and f (b) have opposite sign should

cross the x-axis (y = 0).

See Fig. 2.6. Find the sign of f halfway the interval, i.e. at p1 = (a + b)/2.

a p4

p3 p2 p1 b

a1 p1 b1

a2 p2 b2

Figure 2.6: Graphical representation of bisection method.

If f (a) and f (p1 ) have opposite signs, then the root p is in (a, p1 ). Take a point halfway

this interval p2 = (a + p1 )/2 etc.

19

If f (b) and f (p1 ) have opposite signs, then the root p is in (p1 , b). Take a point halfway

this interval p2 = (p1 + b)/2 etc.

Advantages/disadvantages of the bisection method:

• You only find 1 root, which depends on the initial points a and b.

An algorithm describes in words (pseudocode) which steps a method needs to perform.

So it is not necessary to worry about the exact (Matlab) commands. Writing down an

algorithm before you write a (Matlab) program makes it easier to write the program.

Algorithms are particularly helpful when you start programming or when the method

involves a lot of steps and it is difficult to picture the structure of the program.

On the next pages are an algorithm and a naive program of the bisection method

(i.e. a program you might write if you are unaware of typical programming issues). You

can run the bisection function as any built-in Matlab function. Using input parameters

a = 0.1 and b = 2 for the endpoints, tolerance = 10−6 , and maximum number of

iterations N = 100, you would use

[p, k] = bisection0(0.1, 2, 100, 1e-6)

20

Algorithm: Bisection

Input: 2 points a and b, a tolerance , and a maximum number of iterations N

Output: approximation to a root in [a, b] and the number of iterations performed

Checks

Check whether f (a) and f (b) have opposite sign

Initialization

Compute the function value f (a) (Done already in Checks)

Actual method

While the number of iterations does not exceed N , do

Compute new p = (a + b)/2

Update the right point if f (p) and f (a) have opposite sign

Otherwise update the left point

End while-loop

Write an error message if this is not the case (if maximum number of

iterations is exceeded)

21

Matlab program: Bisection

(Following the algorithm, ignorant of the programming issues in Sec. 2.6.3. Using these

it can be improved significantly.)

%=============================================

% Description:

% Approximate one root of the function x3 − 3x2 + 2 in the interval [a0,b0]

% using the bisection method

% Ignorant version

% Input parameters:

% a0 initial guess for left point a

% b0 initial guess for right point b

% N maximum number of bisection iterations to be performed

% epsil tolerance for the error

% Output parameters:

% p array of approximations to the root pk

% k number of performed iterations

%=============================================

%———————————————————————————————————

% Checks and initializations

%———————————————————————————————————

k = 1;

a(1) = a0;

b(1) = b0;

fa(1) = a0^3 - 3*a0^2 + 2;

fb(1) = b0^3 - 3*b0^2 + 2;

if (fa(1) <= 0 & fb(1) <= 0) | (fa(1) >= 0 & fb(1) >= 0)

error(’Initial guesses do not have opposite sign’)

end

22

%———————————————————————————————————

% Iteration loop

%———————————————————————————————————

while k <= N

%————————————————————————————————-

% Calculate new approximation to root pk

%————————————————————————————————-

p(k) = (a(k) + b(k)) / 2;

fp(k) = p(k)^3 - 3*p(k)^2 + 2;

%———————————————————————————————————

% Check whether close to a root

%———————————————————————————————————

if k > 1

if abs(p(k) - p(k-1)) < epsil

break

end

end

%———————————————————————————————————

% Prepare for next iteration

%———————————————————————————————————

if (fa(k) < 0 & fp(k) > 0) | (fa(k) > 0 & fp(k) < 0)

a(k+1) = a(k);

fa(k+1) = fa(k);

b(k+1) = p(k);

fb(k+1) = b(k+1)^3 - 3*b(k+1)^2 + 2;

else

a(k+1) = p(k);

fa(k+1) = a(k+1)^3 - 3*a(k+1)^2 + 2;

b(k+1) = b(k);

fb(k+1) = fb(k);

end

k = k + 1;

end

%———————————————————————————————————

% Check if iterative method converged or not

%———————————————————————————————————

if k > N

error(sprintf(’Bisection method did not converge in %d iterations’, N));

end

23

2.6.3 Programming issues

Programming with minimal memory usage

Now we use 6 arrays with all previous values of a, b, p, fa, fb, and fp. If a large number

of iterations is performed, the large arrays may take quite a lot of memory and make

computations slower. In addition, no memory is allocated (reserved) in advance for the

large arrays. Every time a new number is stored in an array, Matlab needs to create space

first to store that number. This makes the process even slower. However, storing all those

results is totally unnecessary: only the most recent values of the left point a, the right

point b, and the middle point p are used and some corresponding function values. So a

single variable instead of an array is sufficient for each.

How to check efficiently whether f (a) and f (p) have opposite sign?

If you use if statements inside for or while loops, this makes a code (much) slower than

necessary. A more efficient way to check whether two numbers have opposite sign is to

check whether the product is negative, thus check if f (a) × f (p) < 0. Note: if both f (a)

and f (p) are very large, the product f (a) × f (p) might be larger than the maximum

floating point number (see Sec. 2.4). To avoid such problems the sign function can be

used: sign(fa)*sign(fb). If fa is positive, sign(fa) equals 1, if it is negative it equals −1.

Make m-files easy to modify

Use separate functions for parts that need to be modified frequently. Advantage: Once

the bisection function is working, you never need to modify it anymore. You only need

to modify the accompanying function that evaluates the function f . Additionally, there

is only one line that defines the function f , even if you evaluate f at various places in the

bisection m-file. In the bisection function you need to compute f (a), f (b), and f (p). You

just need to call the function several times with the correct value at which the function

needs to be evaluated, a, b, and p.

Disadvantage: Computations take a little longer due to extra function calls. Only

when you find that you can save a significant amount of computing time, you may want

to avoid the function calls.

Easiest way to do this in Matlab: We write a separate function to evaluate the function

f at the end of the m-file for the bisection method. If we solve d3 −3d2 +2 = 0, for example,

we make a function funcbisec that evaluates the function

function [f] = funcbisec(x)

% Floating sphere equation

f = x^3 - 3*x^2 + 2;

The function funcbisec is called in the m-file for the bisection method, bisection, at every

place where f needs to be evaluated, with the proper value for x. To evaluate f (p), for

example, and assign this value to the variable fp, use

fp = funcbisec(p);

Here p needs to have a proper numerical value.

If we want to compute roots of another problem, we only need to modify the expression

for f in the function funcbisec.

How to check whether the approximation is ’good enough’ ?

24

For the bisection method, we can do better than comparing pi and pi−1 . We know that

the root is in between the most recent values of a and b. So if we choose p in the middle,

we are certain that the error is less or equal to |b − a|/2 = |p − a|. Thus if |p − a| is

smaller than a specified tolerance , we are certain that the actual error is less or equal

than .

The stopping criterium can be made more robust by using a relative stopping criterium

and/or checking how small the residual f (pi ) is (this measures how well the equation you

try to solve is satisfied, for a root f (p) = 0 exactly, so you want f (pi ) to be small).

When you are not sure whether the numerical solution is good enough, you can always

try to decrease (and probably increase the maximum number of iterations as well).

Function evaluations

Evaluating functions like sin(x), exp(x), etc. is computationally much more expensive

then multiplications or additions. For not too simple functions, most of the computing

time will be in the evaluation of f . Thus a fast program for the bisection method contains

as few function evaluations as possible.

Our function bisection0 has several function evaluations inside the while-loop. Only

the function evaluation at the point pi is necessary.

Unnecessary operations inside loops

Every operation and function call takes computing time (CPU time). If part of a compu-

tation is repeated exactly every iteration, it saves CPU time if you do the computation

once before the loop starts. For example, the sign of f (a) never changes during the bi-

section iterations. Thus these can be computed before the while loop, and stored in a

variable (signfa in function bisection). Decisions on whether to use function calls (more

flexible, easier to read code) or not (save CPU time) depends on what is most important

for the problem you are solving. If your computations take a long time and a significant

percentage of the total CPU time can be saved by avoiding function calls, you may want

to minimize the function calls.

A Matlab function with all the above modifications can be found on the next page.

You can run the bisection function using (with input parameters a = 0.1 and b = 2 for

the endpoints, tolerance = 10−6 , and maximum number of iterations N = 100)

[p, k] = bisection(0.1, 2, 100, 1e-6)

25

Matlab program: Bisection

(More flexible and robust bisection method, including the programming issues in

Sec. 2.6.3)

%=============================================

% Description:

% Approximate one root of the function in funcbisec in the interval [a,b]

% using the bisection method

% More flexible and robust version % Input parameters:

% a initial guess for left point a

% b initial guess for right point b

% N maximum number of bisection iterations to be performed

% epsil tolerance for the error

% Output parameters:

% p approximation to the root % k number of performed iterations

%=============================================

%———————————————————————————————————

% Checks and initializations

% Note: use sign to avoid overflow

% signfa will never change, no need to recompute

%———————————————————————————————————

k = 1;

fa = funcbisec(a);

fb = funcbisec(b);

signfa = sign(fa);

if signfa*sign(fb) >= 0

error(’Initial guesses do not have opposite sign’)

end

26

%———————————————————————————————————

% Iteration loop

%———————————————————————————————————

while k <= N

%———————————————————————————————————

% Calculate new approximation to root pk

%———————————————————————————————————

p = (a + b) / 2;

fp = funcbisec(p);

%———————————————————————————————————

% Check whether close to a root

% Note: Both f and the difference of 2 approximations should be small

%———————————————————————————————————

if abs(fp) < epsil & abs(p-a) < epsil

break

end

%———————————————————————————————————

% Prepare for next iteration

%———————————————————————————————————

if signfa*sign(fp) < 0

b = p;

else

a = p;

end

k = k + 1;

end

%———————————————————————————————————

% Check if iterative method converged or not

%———————————————————————————————————

if k > N

error(sprintf(’Bisection method did not converge in %d iterations’, N));

end

%———————————————————————————————————

% Function f(x)

%———————————————————————————————————

function [f] = funcbisec(x)

% Floating body problem

f = x^3 - 3*x^2 + 2;

27

2.6.4 Example: equation for floating sphere

As example we take as starting values a = 0.1 and b = 2. In addition we use N = 100 for

the maximum number of iterations and = 10−6 for the tolerance (We can always change

the values and run the problem again if these are not sufficient). Table 2.1 contains the

sequence of approximations. Such a long list of numbers is difficult to interpret, a plot is

i pi |1 − pi |

1 1.050000000000000e+00 5.000000000000004e-02

2 5.750000000000001e-01 4.249999999999999e-01

3 8.125000000000000e-01 1.875000000000000e-01

4 9.312500000000000e-01 6.874999999999998e-02

5 9.906250000000001e-01 9.374999999999911e-03

6 1.020312500000000e+00 2.031250000000018e-02

7 1.005468750000000e+00 5.468750000000133e-03

8 9.980468750000001e-01 1.953124999999889e-03

9 1.001757812500000e+00 1.757812500000178e-03

10 9.999023437500001e-01 9.765624999991118e-05

11 1.000830078125000e+00 8.300781250001332e-04

12 1.000366210937500e+00 3.662109375000000e-04

13 1.000134277343750e+00 1.342773437500444e-04

14 1.000018310546875e+00 1.831054687517764e-05

15 9.999603271484376e-01 3.967285156236677e-05

16 9.999893188476564e-01 1.068115234359457e-05

17 1.000003814697266e+00 3.814697265847045e-06

18 9.999965667724611e-01 3.433227538929273e-06

19 1.000000190734863e+00 1.907348634588857e-07

20 9.999983787536623e-01 1.621246337735194e-06

21 9.999992847442629e-01 7.152557370826429e-07

22 9.999997377395632e-01 2.622604368118786e-07

Table 2.1: Iteration number i, approximations pi , and absolute difference with the exact

solution |1 − pi |; bisection method for d3 − 3d2 + 2 = 0 using initial points a = 0.1 and

b = 2, tolerance = 10−6 , and maximum number of iterations N = 100.

easier. The easiest way to generate a plot of the errors, is to compute the errors |1 − pi | in

the bisection function and store these in an array, say e. Thus, for the above example, e

is an array of length 22 which contains the actual error (error with the exact solution),

here |1 − pi |.

Then create an array with the iteration numbers 1 to 22:

i=1:1:22

Then make a plot with marker x:

plot(i, e, ’x’)

This plots the errors, but it is still difficult to see what happens (does the error continue

28

to decrease or not).

To see more clearly the behavior at small errors, we can use a logarithmic scale for the

”y-axis”. Instead of plot, now use

semilogy(i, e, ’x’)

See Fig. 2.7. It is clear from the logarithmic plot that the error continues to decrease.

0

0.45 10

0.4 −1

10

0.35

−2

10

0.3

−3

error |1−pi|

error |1−p |

10

i

0.25

0.2 −4

10

0.15

−5

10

0.1

−6

10

0.05

−7

0 10

0 5 10 15 20 25 0 5 10 15 20 25

iteration i iteration i

(a) (b)

Figure 2.7: Actual error as a function of the iteration number for the bisection method

(d3 − 3d2 + 2 = 0, a = 0.1, b = 2) using unscaled (a) and logarithmic y-axis (b).

quite irregular manner. The value of p1 after the first iteration, for example, is much

closer to the actual solution than the value of p2 . This suggest that is should be possible

to develop faster methods.

29

2.7 Fixed-point iterations (Picard iteration)

2.7.1 Mathematical background and method

A number p is a fixed point for a given function g if g(p) = p.

Relation between roots and fixed points:

Finding solutions of the root problem f (p) = 0 is equivalent to finding the fixed points

of a corresponding fixed point problem. There is not one single way to transform a root

problem into a fixed point problem.

For example, g(x) = x − f (x) and g(x) = x + f (x)/3 are 2 fixed point iterations cor-

responding to to the root problem f (x) = 0. For both cases, if p is a root of f , then

f (p) = 0 and thus g(p) = p, i.e. p is a fixed point of g.

Also if p is a fixed point of g, then g(p) = p and thus f (p) = 0, i.e. p is a root of f .

How does a fixed-point iteration work?

Fixed points are the intersection points of the line y = x and the curve y = g(x). They

consist of two steps:

y=x

g(x)

p0 p1 p2 p3 p

The sequence of a fixed point iteration is convergent on the interval [a, b] if the interval

contains a fixed point, if g is continuous on [a, b], and if

|g 0 (x)| < 1

30

for all x in [a, b] (and the initial guess is in [a, b]).

When does a fixed point iteration converge fast?

When the absolute value of the derivative |g 0 (x)| is small. If you have to find such a g

by trial and error it might be much more work than the bisection method takes to solve

the problem. In the next section, we discuss a fixed-point iteration that converges fast:

Newton’s method.

Advantages/disadvantages of fixed-point iterations:

• if a fixed-point iteration converges, it may (or may not) be faster than the bisec-

tion method.

The root problem f (d) = d3 −3d2 +2 = 0 can be written as a fixed-point problem d = g(d)

in several ways, for example:

1. Adding a d on the left and right: d = d+d3 −3d2 +2. Starting from d = 0.99 this does

not converge. However, d = d+(d3 −3d2 +2)/2 converges and d = d+(d3 −3d2 +2)/3

converges in just 4 iterations starting from d = 0.5.

p

2. Writing d2 = (3d2 −2)/d, and take the square root: d = 3d − 2/d. Converges in 30

iterations, starting from d = 0.5. In Matlab, the result has a small imaginary part.

Some other programming languages would give an error when you try to compute

the square root of a negative number.

p

3. Writing d2 (3 − d) = 2, divide by (3 − d) and take the square root: d = 2/(3 − d).

Converges in 11 iterations, starting from d = 0.5. See Table 2.2.

For a fixed-point iteration, we do not have two points ai and bi surrounding p that get

closer and closer to the fixed point p from the left and right. If we want to use a

similar criterium, the best we can do for the fixed-point iteration is compare the two

latest results. This is much more tricky: the difference |pi − pi−1 | may be small just

because the true solution p is approached slowly, not because |p − pi | is small. For the

floating object equation (Table 2.2) the stopping criterium used, |pi − pi−1 | < = 10−6 ,

worked fine. When this was fulfilled the actual error (difference with the exact solution)

|1 − pi | ' 10−7 was small as well.

31

i pi |1 − pi |

1 8.944271909999159e-01 1.055728090000841e-01

2 9.746077623781704e-01 2.539223762182963e-02

3 9.937117548732618e-01 6.288245126738201e-03

4 9.984316360970824e-01 1.568363902917591e-03

5 9.996081394766783e-01 3.918605233217409e-04

6 9.999020492625699e-01 9.795073743013027e-05

7 9.999755132150757e-01 2.448678492428247e-05

8 9.999938783599811e-01 6.121640018896812e-06

9 9.999984695935085e-01 1.530406491534464e-06

10 9.999996173985968e-01 3.826014032259906e-07

11 9.999999043496630e-01 9.565033698422098e-08

and absolute difference with the exact

solution |1 − pi |; fixed point iteration with d = 2/(3 − d) using d0 = 1/2, tolerance

= 10−6 , and maximum number of iterations N = 100.

The stopping criterium is not always sufficient. Consider a function g with g 0 that

has a derivative that is almost equal to 1? Take as example p = g(p) with g(p) =

p+10−4 ×(p3 −3p2 +2) to determine the root of the floating object equation p3 −3p2 +2 = 0.

The derivative is g 0 (p) = 1 + 10−4 × (3p2 − 6p) which is just below 1 in (0, 1]. Again we

take p0 = 0.5 and = 10−6 . The actual error at each iteration is displayed in Fig. 2.9 on

a semilogarithmic scale.

Convergence for fixed−point iteration

0

10

−1

10

error |1−pi|

−2

10

−3

10

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

iteration i

Figure 2.9: Actual error as a function of iteration number for the fixed-point iteration

p = p + 10−4 × (p3 − 3p2 + 2), p0 = 0.5.

The difference between the root and the approximation when the stopping criterium

was met, |1 − p16846 | = 3.332001912442095e − 03, is much larger than 10−6 . Another

stopping criterium that could be used is checking whether the residual r is small. The

residual measures how well the original equation is satisfied, here how well f (p) = 0 is

32

satisfied. For the floating body equation we need to check whether ri (pi ) = p3i − 3p2i + 2 is

small. At the final iteration we have r16846 = p316846 −3p216846 +2 = 9.995968744652473e−03

which is quite large compared to the tolerance 10−6 .

To obtain the required accuracy, it often helps to check the residual as well. If we use

as additional stopping criterium |ri | < 10−6 , both stopping criteria are met after 47542

iterations. Then p47542 = 9.999996667462827e−01 and |1−p47542 | = 3.332537172884287e−

07 which agrees well with the accuracy we wanted (10−6 ).

33

2.8 Newton’s method

Newton’s method for solving f (x) = 0 is a special choice of a fixed-point iteration. It is

also called the Newton–Raphson method.

There are several ways to derive Newton’s method. One is using a Taylor polynomial.

The 1st order Taylor polynomial for f (x) around x0 equals

At the root, x = p (f (p) = 0):

Now assume that we are ”close” to the root, then the O(p − x0 )2 term is small compared

to the linear term:

0 ' f (x0 ) + (p − x0 )f 0 (x0 ),

or after rearranging:

f (x0 )

p ' x0 − .

f 0 (x0 )

This is used in Newton’s method to find a new approximation pn .

How does Newton’s method work?

2. Take for n ≥ 1

f (pn−1 )

pn = pn−1 −

f 0 (pn−1 )

Note that Newton’s method is of the form pn = g(pn−1 ) with g(pn−1 ) = pn−1 −f (pn−1 )/f 0 (pn−1 ),

i.e. Newton’s method is a fixed-point iteration.

Remarks

f 0 is zero close to the root.

• In the derivation it was assumed that the remainder term which contains a term

(x − p)2 was small compared to the linear term in x − p. This is not true if the

initial guess is not close enough to the root, and the method might therefore not

converge if the starting point is ”too far” from the root.

34

• Graphical interpretation: Newton’s method finds at every iteration an approxi-

mation to the root by walking a distance (pi − pi−1 ) along the tangent line at

(pi−1 , f (pi−1 )) so that f (pi ) = 0. Thus for linear functions f , you will arrive at the

root in 1 iteration.

Assume f ∈ C 2 near the root (f , its derivative and second derivative exist and are

continuous near the root). If f 0 (p) 6= 0 then Newton’s method will converge to p if the

initial guess p0 is close enough to the true root p. Unfortunately, we usually don’t know

how close to p the initial guess should be. For hard problems this might be very close,

say |p − p0 | < 10−3 or 10−4 .

Advantages/disadvantages of Newton’s method:

• Fast (quadratic) convergence when you get close enough to the root.

• The method does not always converge to a solution: zero derivative, initial guess not

sufficiently close. (If you don’t have a good enough initial approximation, you could

first do, for example, some bisection iterations and then start Newton’s method)

• You need to calculate a derivative (for complicated functions, you can use a symbolic

calculation)

Note that we cannot start from p0 = 0 or p0 = 2 since f 0 is zero in those points. Starting

from p0 = 1/2 converges in only 4 iterations (see Table 2.3).

i pi |1 − pi |

1 1.111111111111111e+00 1.111111111111112e-01

2 9.990740740740740e-01 9.259259259259967e-04

3 1.000000000529222e+00 5.292219995567393e-10

4 1.000000000000000e+00 0.000000000000000e+00

Table 2.3: Iteration number i, approximations pi , and absolute difference with the exact

solution |1 − pi |; Newton’s method using p0 = 1/2, tolerance = 10−6 , and maximum

number of iterations N = 100.

35

2.9 Rate of convergence

2.9.1 Definition

A sequence {pn } converges to p of order α if constants α and γ exist so that

|pn+1 − p|

lim = γ.

n→∞ |pn − p|α

corresponding sequence {pn } converges of order α.

α can be any number, but there are two important cases:

If α = 1, the sequence and iterative technique are linearly convergent.

If α = 2, the sequence and iterative technique are quadratically convergent.

This is easiest when we look at the actual error: the difference between the true solution

p and the approximation at the nth iteration: en = |pn − p|, where en is the error at the

nth iteration.

Linearly convergent: since α = 1, we have en+1 = γen . This means that if en is the

current error at iteration n, the error at the next iteration level, n + 1, is γ times the

previous error (en+1 = γen = γ 2 en−1 etc).

Example 1: if γ = 0.1 then the error is a factor 10 smaller every iteration and you have

approximately 1 more correct digit every iteration.

Example 2: how many iterations do we need to do for 1 more accurate digit if γ = 0.9?

We need to find k so that the error after k iterations is reduced by a factor 10: en+k =

0.1en . Since we have linear convergence: en+k = γ k en = 0.1en or γ k = 0.1. This gives for

γ = 0.9: k = log(0.1)/ log(0.9) ≈ 22 iterations.

Example 3: Similarly, if γ = 0.99 you need to do log(0.1)/ log(0.99) ≈ 229 iterations

for 1 more correct digit. Values of γ close to 1 are unfortunately not unusual for linearly

convergent techniques.

Quadratically convergent: since α = 2, we have en+1 = γe2n . This means that if en is

the error at iteration n, the error at the next iteration level, en+1 , is γen times the current

error en .

Example: If we have en = 0.1 and γ a little smaller than 1, a linearly convergent scheme

would converge very slowly for γ near 1. For a quadratically convergent method you would

have en+1 ≈ 10−2 , en+2 ≈ 10−4 , en+3 ≈ 10−8 , en+4 ≈ 10−16 . Once you are reasonably

close to the solution convergence is very fast: in just a few iterations a very accurate

approximation is reached. Roughly, the number of correct digits doubles every step for a

quadratically convergent method (when you are not too far from the solution p).

To find the order α, we can take the log on both sides in the definition:

log |pn+1 − p| = log γ + α log |pn − p|,

or

log en+1 = log γ + α log en .

36

Thus if we make a xy-plot of y = log en+1 vs. x = log en , α corresponds to the slope.

Once α is known, γ can be determined easily from the above equation.

√

Consider Table 2.4 for Newton’s method which shows the errors with the root p = 1 − 3

starting from p0 = −20 using a tolerance 10−13 . We see that for the first 7 iterations,

i en = |p − pn |

1 1.229976737424930e+01

2 7.670248252043633e+00

3 4.607864445396778e+00

4 2.602396109292012e+00

5 1.320034026906669e+00

6 5.473710251344159e-01

7 1.497420289157650e-01

8 1.616422346263446e-02

9 2.214556785341548e-04

10 4.245948570513747e-08

11 1.665334536937735e-15

12 1.110223024625157e-16

√ Iteration number i and absolute difference with the exact −6

p = 1 − 3; Newton’s method using p0 = −20 and tolerance = 10 .

√

when the solution is not yet close to p = 1 − 3, Newton’s method √ converges slowly. For

iteration 8 to 11, when the approximation is close to p = 1 − 3, convergence is fast. The

number of accurate digits approximately doubles (this suggests quadratic convergence).

The fast convergence stops at iteration 12 since we have reached the maximum precision

of 15-16 digits for double precision calculations (machine accuracy) so that the error can’t

decrease any further.

The order of convergence (α and γ) close to the true solution can be determined as

follows. In an array x we put the logarithm of the error at the ”old level” i: x(i) = log e(i)

for i = 1, . . . , 11. In array y we put the logarithm of the error at the ”new level” i + 1:

y(i) = log e(i+1) for i = 1, . . . , 11. Now we make a plot: plot(x,y).

The result is in Fig. 2.10.

To find α and γ we should not consider en ’s that are not yet close to the solution due

to a poor initial guess Of course, we cannot get n → ∞ as in the definition, the best we

can do is look at the convergence behavior for sufficiently large values of n, i.e. close to the

true solution. However, we should also not consider en ’s that are too close to the solution:

these en ’s are affected by the finite precision in the calculation (for double precision you

can never get more than 16 accurate digits). From Fig. 2.10 it is clear that we should

only use e7 to e11 in Table 2.4 and that the slope of the line segment is approximately 2,

37

2

0 Newton

fixed−point iteration

−2

−4

log(ei+1)

−6

−8

−10

−12

−14

−16

−16 −14 −12 −10 −8 −6 −4 −2 0 2

log(ei)

Figure 2.10: Rate of convergence plot for floating sphere problem (d3 − 3d2 + 2 = 0);

Newton’s method using p0 = −20, tolerance = 10−13 . Rate of convergence of a typical

fixed-point iteration ais shown for comparison.

i.e. α ≈ 2. The value of α can be approximated more precisely by using the numerical

values for the e’s, for example

∆y log e9 − log e10 −3.6547 + 7.3720

α≈ = ≈ ≈ 1.995

∆x log e8 − log e9 −1.7914 + 3.6547

Thus Newton’s method converges quadratically (α ≈ 2) close to the solution. Once α has

been found, γ can be determined by taking the ratio of en and eαn+1 . For example,

e9

γ≈ 1.995

≈ 0.83

e8

|f 00 (p)|

.γ≈

2|f 0 (p)|

√

For the problem we consider this gives γ ≈ 3/2 ≈ 0.866 which is indeed close to what

we find numerically.

In general a fixed-point iteration converges linearly, α ≈ 1. This can also be observed

from Fig. 2.10 where the slope is approximately 1, meaning α ≈ 1. As fixed-point iteration

we used g(p) = p−(p3 −3p2 +2)/10 and an initial guess p0 = −2. For a linearly convergent

fixed-point iteration we should find a constant (γ) if we we take the ratio of two consecutive

errors (en+1 /en ). For the fixed-point iteration we consider, we have γ ≈ en+1 /en ≈ 0.400.

This corresponds to the theoretical value

√ γ ≈ |g 0 (p)| for a fixed-point iteration g(p) = p.

0

For the problem we consider |g (1 − 3)| = 2/5 = 0.4.

Newton’s method for the root d = 1 is a special case. See Table 2.5. Convergence is

faster than quadratic: better than doubling of the number of accurate digits. What is

38

i en = |p − pn |

1 1.111111111111112e-01

2 9.259259259259967e-04

3 5.292219995567393e-10

4 0.000000000000000e+00

Table 2.5: Iteration number i and absolute difference with the exact solution |p − pn | with

p = 1; Newton’s method using p0 = 1/2 and tolerance = 10−6 .

To see whether a numerical technique is correctly implemented, it is always good to check

numerical results with an analytical solution. For harder problems, however, no analyt-

ical solution can be found. Then convergence behavior can be checked. For Newton’s

method you expect α = 2 and for fixed-point iterations α = 1. If you get less, this is

an indication that somethinge unusual is going on either in the numerics (incorrect f 0 in

Newton’s method, for example) or in the equations you are trying to solve (maybe the

solution is not sufficiently smooth: derivatives might not exist or are not continuous, for

example).

To check the convergence behavior, we need |p − pi |, i.e. we need the exact solution

p. This value, however, is typically not available for more complicated problems (that’s

why we try to solve them numerically). Then we can take for the value of p a very good

approximation. For example, an accurate value obtained using a built-in Matlab function

or a very accurate value using your own code (preferably machine accuracy, but in any

case a value with much smaller errors than the en ’s you consider).

39

Chapter 3

equations

• vector norms

Numerical methods:

ations)

– Gaussian elimination.

40

3.1 Problem description and modeling: predator-prey

models

Consider an environment with 2 species: one of them is a predator and of them a prey.

We want to know how the population of the predator and prey evolve in time. Will one

or both species die out or will they coexist.

In the model we denote by x1 (t) the population of prey at time t and by x2 (t) the

population of predators at time t. The basic model to describe changing quantities in

time is:

Here the rate of change of the prey population is ẋ1 = dx1 /dt and the rate of change of

the predator population ẋ2 = dx2 /dt. The rate of increase and rate of decrease of the prey

and predators needs to be modeled. To keep the model simple, we make the following

assumptions:

• Rate of increase of prey (birth rate of prey).

We assume that the birth rate at time t is proportional to the number of prey at

time t. This means the birth rate is modeled by ax1 (t) with a a constant. A more

sophisticated model could include for example the effects of insufficient food supplies

and/or illnesses on the birth rate.

• Rate of decrease of prey (death rate of prey).

We assume that the death rate at time t is proportional to the product of the number

of prey and number of predators at time t. This means the death rate is modeled by

bx1 (t)x2 (t) with b a constant. A more sophisticated model could include for example

the effects of insufficient food supplies and/or illnesses on the death rate.

• Rate of increase of predators (birth rate of predators).

We assume that the birth rate at time t is proportional to the product of the number

of prey and number of predators at time t (i.e. the birth rate depends on the amount

of food available and the number of predators currently alive). This means the birth

rate is modeled by cx1 (t)x2 (t) with c a constant. A more sophisticated model could

include for example the effects of illnesses and/or age of the predators on the birth

rate.

• Rate of decrease of predators (death rate of predators).

We assume that the death rate at time t is proportional to the number of predators

alive at time t. This means the death rate is modeled by dx2 (t) with d a constant.

A more sophisticated model could include for example the effects of insufficient food

supplies, age, and illnesses on the death rate.

The resulting model is

ẋ1 = ax1 − bx1 x2 ,

ẋ2 = cx1 x2 − dx2 .

41

We are interested in whether the predator and/or prey die out or coexist. Thus we

are interested in possible equilibrium solutions (i.e. when the populations do not change

anymore in size, or dx1 /dt = dx2 /dt = 0),

0 = ax1 − bx1 x2 ,

0 = cx1 x2 − dx2 .

0 = 2x1 − x1 x2 ,

0 = x1 x2 − 2x2 .

We first discuss some methods to solve systems of equations analytically, before we dis-

cuss some numerical techniques (to see whether numerical methods converge to a correct

solution).

42

3.2 Analytical solutions and solving with Matlab

In this section we discuss three methods to obtain solutions of a 2 × 2 system of equations.

In the next sections we will solve the above problem numerically and see whether numerical

methods converge to one of the two equilibrium solutions found in this section.

The example 2 × 2 system of equations can easily be solved analytically by factoring out

common terms,

2x1 − x1 x2 = x1 (2 − x2 ) = 0,

x1 x2 − 2x2 = x2 (x1 − 2) = 0.

The first equation is satisfied for x1 = 0 or x2 = 2. For x1 = 0 we get from the second

equation x2 = 0. For x2 = 2 we get from the second equation x1 = 2. Thus we have two

equilibrium solutions (0, 0) and (2, 2).

3.2.2 Plotting

The curves corresponding to the two equations 2x1 − x1 x2 = 0 and x1 x2 − 2x2 = 0 can

be plotted using Matlab’s ezplot

ezplot(’2*x1 - x1*x2=0’)

hold on

ezplot(’x1*x2 - 2*x2=0’)

grid on

setcurve2(’color’,’red’)

where setcurve2 is a small Matlab script to plot the two curves corresponding to the second

equation in red. In Fig. 3.1 the blue curves correspond to solutions of the first equation

and red lines to solutions of the second equation. The intersections of a blue and red

curve near (0, 0) and (2, 2) correspond to the two equilibrium points. You can zoom in

near these points to obtain more accurate values. Note that n × n systems of equations

need n-dimensional plots so that this method is only useful for 2 × 2 systems.

Systems of algebraic equations can be solved symbolically using the built in Matlab func-

tion solve. For the example we consider you would use

[x1, x2] = solve(’2*x1 - x1*x2=0’, ’x1*x2 - 2*x2=0’)

which returns as output

x1 =

0

2

43

6

x2

0

−2

−4

−6

−6 −4 −2 0 2 4 6

x1

x2 =

0

2

From the first line of solutions for x1 and x2 we get the equilibrium point (0, 0). From

the second line of solutions for x1 and x2 we get the equilibrium point (2, 2).

Systems of algebraic equations can be solved numerically using the built in Matlab func-

tion fsolve, which solves the system f (x) = 0. For this you first need to create an m-file

that defines the vector function f (x). For the example we consider you could use the

following m-file fun fsolve.m

a = 2;

b = 1;

c = 1;

d = 2;

f(1) = a*x(1) - b*x(1)*x(2);

f(2) = c*x(1)*x(2) - d*x(2);

Then you need to select an initial vector p(0) , say h3, 3i. To solve the system numeri-

cally using fsolve, use

44

p0 = [3; 3]

p = fsolve(’fun fsolve’, p0)

which gives

p=

2.000000141789899e+00

2.000000141789899e+00

This uses a default tolerance of 10−6 . To increase the accuracy, you need to change the

option TolFun. To use a tolerance of 10−10 use

options = optimset(’TolFun’, 1e-10)

p0 = [3; 3]

p = fsolve(’fun fsolve’, p0, options)

which gives the more accurate solution

p=

2.000000000000007e+00

2.000000000000007e+00

45

3.3 Newton’s method for systems of equations

Newton’s method for systems can be used to solve a system of m nonlinear equations

with m unknowns

f (x) = 0

or in component form

f1 (x1 , x2 , . . . , xm ) = 0,

f2 (x1 , x2 , . . . , xm ) = 0,

.. .. ..

. . .

fm (x1 , x2 , . . . , xm ) = 0.

Similarly to Newton’s method for functions of one variable, Newton’s method for systems

can be derived using a 1st order Taylor expansion for functions of several variables.

Using a first order Taylor expansion around x0 gives at the root x = p

∂f1 ∂f1 ∂f1

∂x1 ∂x2

· · · ∂xm

∂f2 ∂f2 · · · ∂f2

J(x) = ∂x1. ∂x2. ∂xm

. . . .

. ..

. . .

∂fm ∂fm ∂fm

∂x1 ∂x2

··· ∂xm

Algorithm of Newton’s method

Remarks

• There is a problem when J −1 (p(n−1) ) is singular, so Newton’s method might not

converge if J −1 is (nearly) singular close to the solution vector.

46

• Like in Newton’s method for scalars, in the derivation of Newton’s method for

systems it was assumed that higher order terms are small compared to the linear

terms. This is not true if the initial guess is not close enough to the solution vector,

and the method might not converge then.

• Provided certain conditions on the partial derivatives of the functions fi are satisfied,

also Newton’s method for systems converges quadratically when the initial vector

p(0) is close enough to the true solution vector p. For hard problems you might need

to be very close to the true solution vector before Newton’s method will converge.

• It does not always converge to a solution (singular Jacobian, initial guess not suf-

ficiently close).

• You only find 1 solution vector, which depends on the initial vector p(0) .

calculations, this might not be trivial when m is large.)

An accurate numerical solution to a m × m system of equations is obtained when all

components involved are approximated accurately. Instead of checking whether each

component satisfies the stopping criteria, vector norms are used to check whether the

numerical approximation is good enough. The notation for a norm is k.k with a possible

subscript to indicate the type of norm. Two frequently used norms are the l2 norm and

l∞ norm. The l2 or Euclidean norm is defined as

q

kyk2 = y12 + y22 + . . . + ym

2 .

1≤i≤m

As stopping criteria we can use the norm of the difference between two solution vectors,

kp(k) − p(k−1) k and the norm of the residual vector krk = kf (k) k, either using the l2 or

l∞ norm. Note that if the l2 or l∞ norm of a vector is small, then every component of

the vector must be small. As for scalar equations, relative stopping criteria can be used

to increase robustness.

47

3.3.3 Example: predator-prey equations

We solve the algebraic predator-prey system of equations discussed in Sec. 3.1 using

Newton’s method. As initial vector we take p(0) = h3, 3i and as tolerance 10−13 . We

see from see Table 3.1 that Newton’s method convergences to the analytical solution

p = hp1 , p2 i = h3, 3i in only 6 iterations.

1 (2.250000000000000, 2.250000000000000) 3.5355339e-01 2.5000000e-01

2 (2.025000000000000, 2.025000000000000) 3.5355339e-02 2.4999999e-02

3 (2.000304878048780, 2.000304878048780) 4.3116267e-04 3.0487804e-04

4 (2.000000046461147, 2.000000046461147) 6.5705984e-08 4.6461147e-08

5 (2.000000000000001, 2.000000000000001) 1.8841109e-15 1.3322676e-15

6 (2.000000000000000, 2.000000000000000) 0.0000000e+00 0.00000000e+00

Table 3.1: Iteration number k, approximations p(k) , and l2 and l∞ error norms; Newton’s

method using p(0) = h3, 3i and tolerance = 10−13 .

1/2

(k) (k)

kp − p(k) k2 = (p1 − p1 )2 + (p2 − p2 )2

(k) (k)

kp − p(k) k∞ = max(|p1 − p1 |, |p2 − p2 |)

with p1 = 2 and p2 = 2. We note from Table 3.1 that the l2 error is always slightly larger.

It is easy to see why this should always be true. Consider a vector x with m components,

then v

q u m

q uX

kxk∞ = max |xi | = max (x2i ) = max (x2i ) ≤ t (x2i ) = kxk2

1≤i≤m 1≤i≤m 1≤i≤m

i=1

The l∞ norm only considers the maximum x2i , while for the l2 norm some nonnegative

numbers are added to this value before taking the square root. The convergence behavior

of both norms, however, is very similar. Both give approximately a doubling of the

number of accurate digits each iteration. This indicates quadratic convergence like for

Newton’s method for a single equation (See Sec. 2.8). Typically, the l2 and l∞ norm give

very similar results for small systems of equations. For large m × m systems you need

to be a little more careful when you do finite precision calculations. If all terms of the

error vector have the same error, say machine accuracy e = 10−16 , than the error for the

l2 norm is v

u m

uX √ √

kek2 = t e2i = me2 = m|e|

i=1

√

Thus if m is √sufficiently large, you need to increase the tolerance for l2 norms accordingly,

−15

i.e. at least m10 , in order to be able to satisfy stopping criteria.

48

3.3.4 Choice of initial vector

Finding a good initial guess for a system of equations is a little more complicated than

for a single equation. First, we cannot generalize the bisection method to systems since

we do not have a point with opposite function value on each side of the root in two or

more dimensional space. We only discuss some simple options.

1. Plotting might give you an initial vector for 2 × 2 systems, but cannot be used for

m × m systems.

2. If nonlinear terms are relatively small so that the solution is dominated by the linear

terms, one could first solve the linear system Ax = b and use the solution vector of

the linear sytem as initial guess for Newton’s method for systems.

3. If nonlinear terms are not small, one could first try to solve an easier problem or

a sequence of easier problems (continuation). For example, for our predator-prey

equations, we could first try to find a solution for the problem with b = c = 0.1.

Then we could use the solution corresponding to b = c = 0.1 as initial vector for the

problem with b = c = 0.5. The solution corresponding to b = c = 0.5 could then be

used as initial vector for the problem you are interested in (b = c = 1). The harder

the problem, the more intermediate solutions you might need to generate to find an

appropriate initial vector for the problem you are really interested in.

4. Often Newton’s method for systems is part of a larger calculation which provides an

initial guess in a natural way. For example, for partial differential equations (PDEs)

involving a dependence on time and space, you typically generate a solution at the

next time level n + 1 for every coordinate position from the solution at the current

time level n. Since time steps ∆t = tn+1 − tn should be taken small for reasons

of accuracy, the solution at time level n is usually a good approximation to start

Newton’s method to obtain the solution at tn+1 . We discuss this further when we

discuss PDEs.

49

3.4 Solving linear systems for population models

The solution of a linear system Ax = b can be found using a direct or iterative method.

Direct methods are methods that compute the solution x of Ax = b in a fixed number

of operations that can be determined a priori. Provided that the matrix is nonsingular,

a unique solution will be found. (At least when calculations can be done exactly. When

finite precision arithmetic is used also nearly singular matrices will give problems.) Itera-

tive methods for Ax = b are methods that give an approximation of the solution vector x

by generating a sequence of vectors x(k) that (we hope) converge to the true solution. The

number of operations can not be determined a priori and convergence is not guaranteed.

As for all iterative methods, an initial guess to the solution needs to be provided.

Typical advantages and disadvantages of direct methods

• The solution is subject to round off errors only.

• The number of operations might be very large for large systems of equations, so

that it takes a long time to solve a linear system.

• For large systems, you need a lot of memory on your computer to store the matrix

A (your computer may run out of memory just for storage of a large matrix).

Typical advantages and disadvantages of iterative methods

• Relatively few operations per iteration and thus fast for a single iteration.

• Additional errors since an iterative method only gives an approximation to the

solution.

• Cheap in memory. The matrix A needs not to be stored, only vectors that result

from matrix-vector products like Ax(k) .

• Convergence might be slow (resulting in a large number of iterations and large

computing time) or even impossible.

Of course, one may try to improve on basic direct and iterative methods to alleviate the

typical shortcomings. This is outside the scope of Math 4414. We only discuss methods

that are most useful for the problems we consider. For population models, the size of the

system of equations is typically relatively small and solving Ax = b using a direct method

can still be done in an efficient manner. Iterative methods will be discussed in some more

detail when they are more useful: for numerical solutions of differential equations.

Solving very small systems like 2 × 2 and 3 × 3 systems can be solved as you are used to in

linear algebra: first compute the inverse of A and then x = A−1 b. We discuss the typical

issues accuracy, CPU time, and memory for the 2 × 2 predator-prey system we consider

in this chapter.

50

Accuracy: A direct method is only affected by round-off errors due to finite precision.

Iterative methods have additional errors, or might not produce an accurate solution at

all if it does not converge. Thus for reasons of accuracy a direct method is preferred.

An example of a direct method is computing the inverse A−1 of the matrix A and then

perform the matrix-vector multiplication x = A−1 b. For a 2 × 2 system we have a formula

for the inverse available,

−1

−1 a b 1 d −b

A = =

c d ad − bc −c a

To compute A−1 exactly we need that det A = ad − bc 6= 0. With finite precision arith-

metic, we also expect large errors when det A = ad − bc is very small. Then we subtract

two nearly equal numbers ad and bc and next we divide by this inaccurate small num-

ber. Other techniques, however, will have similar difficulties when the matrix is nearly

nonsingular (ill-conditioned).

Computing time: This is not a real issue for a 2 × 2 system, only a few operations

(multiplications and additions) are involved and computing x = A−1 b will be fast.

Memory usage: For 2 × 2 systems, however, this is not a real issue. We only need

to store 4 numbers.

Larger systems arise in population models when we include more species, for example 10.

Finding the inverse of an m × m matrix using minors ((m − 1) × (m − 1) submatrices)

involves O(m!) operations. For a relatively small 10 × 10 matrix this would require

approximately 10! ≈ 4 million operations to compute the inverse. For such problems it

is advantageous to use other direct methods, that require less operations for these size of

matrices. Such methods are not only faster, but also produce a result with less roundoff

errors (less operations so less possibility to accumulate errors). For example, Gaussian

elimination with backward substitution takes O(m3 ) operations, i.e. approximately 1

thousand operations for a 10 × 10 system. Since the time it takes to run a program

is proportional to the number of operations, Gaussian elimination for a 10 × 10 matrix

is approximately 4000 times faster than computing the inverse. Gaussian elimination is

analyzed in detail in Math 4445. Here we only focus on some important aspects.

Background

The idea is to transform Ax = b into an upper triangular system U x = f by subtraction

of rows. To create a zero at matrix entry a12 , subtract l21 = a12 /a11 times the first row

from the second row in both A and b. Similarly, subtracting l31 = a13 /a11 times the first

row from the third row gives a zero at a13 , etc. Once the first column is done, you can

create zeros below a22 using the second row. Since the first column of the second row

has a zero, the zeros in the first column won’t change anymore. This technique can be

continued until un upper-triangular matrix U is obtained. Backward substitution is then

performed starting at the mth row which can be used to find xm . Once xm is known,

xm−1 can be determined from row m − 1 etc.

51

We illustrate Gaussian elimination with backward substitution for a simple 2 × 2

system:

2 1 x1 1

=

1 2 x2 1

Using Gaussian elimination

2 1 1 =⇒ 2 1 1

1 2 1 E2 := E2 − (1/2) × E1 0 3/2 1/2

Solving with backward substitution gives from the second line x2 = (1/2)/(3/2) = 1/3

and by substituting this in the first line

1 − x2 1

x1 = =

2 3

The Gaussian elimination described above fails when you want to make zeros below a

diagonal entry aii (pivot) and the value of aii is exactly zero. Then we need to interchange

that row i (including the row of vector b) with a row below row i that has a non-zero

value in column i. This is called pivoting.

Implications of finite precision: pivoting

In finite precision calculations, also very small pivots can also create large errors. Consider

as example the system

0.001 1 x1 1

=

−1 4 x2 9

which has the solution x1 ≈ −4.9800797 and x2 ≈ 1.0049801. If we try to solve this with

the Gaussian elimination/backward substitution algorithm in three digit arithmetic with

rounding, we get

1.00e-3 1.00 1.00 =⇒ 1.00e-3 1.00 1.00

-1.00 4.00 9.00 E2 := E2 + 1.00e3 × E1 0 1.00e3 1.01e3

Solving with backward substitution gives x2 = 1.01 and

1.00 − 1.00 × 1.01

x1 = = -1.00e1.

1.00e-3

We see that x1 is not at all close to the true solution. The reason lies in the backward

substitution: to determine x1 we subtract nearly equal numbers and divide by a small

pivot number. Here, small errors in the numerator are amplified by a factor 1000, because

of the small pivot of 1.00e-3.

If we first interchange rows, we can obtain an accurate solution

-1.00 4.00 9.00 =⇒ -1.00 4.00 9.00

1.00e-3 1.00 1.00 E2 := E2 + 1.00e-3 × E1 0 1.00 1.01

which gives using backward substitution x2 = 1.01 and x1 = -4.96 which has 2 accurate

digits.

52

Thus to increase accuracy in finite precision calculations, we need to perform pivoting

not only when the pivot element is zero, but also when it is ”small”. The safest pivoting

technique is complete pivoting. Both row and column interchanges are performed to

find the largest pivot element. Of course, the comparisons and interchanging of rows and

columns makes the Gaussian elimination more time consuming (O(m3 ) more operations

are required).

In 16 digits arithmetic, the problem is less serious, but the problem is not eliminated.

Also keep in mind that matrices are much larger than 2 × 2. In a large system many more

operations need to be performed and errors accumulate. Even with the best pivoting

technique a significant number of digits may be lost when solving Ax = b. The larger the

linear system, the more computations and the more inaccurate digits can be expected.

A small system of linear equations can be solved in Matlab by computing the inverse, for

example

A = [2 1; 1 2]

b = [1; 1]

x = inv(A)*b

gives the numerical solution

x=

3.333333333333333e-01

3.333333333333333e-01

Larger systems, like 10×10, can be solved more efficiently using Gaussian elimination with

backward substitution. A safe Gaussian elimination and backward substitution technique

with pivoting is implemented in Matlab through the backslash operator \. For the above

example, you would use

A = [2 1; 1 2]

b = [1; 1]

x=A\b

gives the numerical solution

x=

3.333333333333332e-01

3.333333333333335e-01

53

Chapter 4

• Accuracy

• Order of convergence

Numerical methods:

– Direct methods: Crout factorization for tridiagonal matrices.

54

4.1 Problem description and modeling: pollution mod-

els

4.1.1 Governing equation

We consider a river which is being polluted by some factories along that river. The

pollutant affects the fish population once it reaches a critical concentration level for a

sufficiently long period in time. For this we want to predict the concentration of pollutant

along the river.

The basic model to describe changing quantities is

Now we need to apply the basic model locally to a small volume element ∆V = ∆x∆y∆z

in the river. The pollutant concentration in ∆V increases when the factories add pollutant

to it, and decreases because it decays. In addition, the concentration may change if the

amount of pollutant that is transported into ∆V is different than the amount that is

transported out of ∆V .

The basic model in words to describe changes in concentration in a flowing river then

becomes

The terms on the right-hand side need to be modeled. There are two types of transport

in and out a small volume element: convection and diffusion of the pollutant. We discuss

all 4 terms in the right-hand side separately.

We assume that the pollutant moves with the same velocity v as the water, i.e.

convection can be modeled as −v∇c. If the pollutant is small and light that is a

good approximation. Larger, heavier objects typically will not follow the water and

then the model needs modification.

• Diffusion: Diffusion describes how the pollutant spreads over the river in the ab-

sence of flow. Diffusion is described mathematically by the divergence of the mass

flux φ, i.e. −∇ · φ where ∇ is the gradient vector. We assume that the mass flux φ

of the pollutant is proportional to the concentration gradient, φ = −D∇c (Fick’s

law) where D is the diffusivity which we assume constant. This gives a diffusion

term ∇ · (D∇c) = D∇ · ∇c.

• Production of pollutant: We assume that the pollutant added to the river de-

pends only on where and when the factories add pollutant to the river. This means

the source term r depends on position x and time t only and is independent of the

concentration of pollutant in the river.

55

• Decay of pollutant: We assume that the rate of decay is proportional to the

concentration, i.e. it can be modeled by kc with k a constant decay rate. Note that

the rate of change depends indirectly on position and time through the concentration

c(x, t).

The resulting model is a partial differential equation (PDE) for the concentration of

the pollutant which depends on position x in the river and time t, i.e. c = c(x, t):

∂c

= −v · ∇c + D∇ · ∇c − kc + r. (4.1)

∂t

In case k = r = 0, this equation reduces to the so-called convection-diffusion equation.

Solving Eq. (4.1) for the concentration in three spatial dimensions and time is beyond

the scope of Math 4414. However, we can take advantage of the geometry of the river to

simplify the equation. We consider a narrow and shallow river so that we can consider a

concentration that depends on x (coordinate along the river) and time t only: c = c(x, t).

Flow will then occur in the x direction only, represented by a scalar velocity v. Then the

model simplifies to

∂c ∂c ∂2c

= −v + D 2 + r − kc. (4.2)

∂t ∂x ∂x

Here the velocity v and pollution rate r may depend on position along the river x and

time t.

To solve Eq. (4.2) it is important to realize how the different terms on the right-hand

side affect the solution. Of course the source term r tends to increase the concentration

and the decay term −kc tends to decrease the concentration, so we only discuss convection

and diffusion in some more detail.

Convection: If ∂c/∂x is negative at some location, there is more pollution just up-

stream than just downstream. Therefore, the concentration will increase which is ac-

counted for in Eq. (4.2) by the minus sign. Convection affects the concentration level

downstream while leaving the concentration upstream unchanged. Fig. 4.1(a) shows how

an initial concentration profile changes in time by convection: the concentration profile

at the initial time shift in the direction of the flow, leaving the shape unaltered.

Diffusion: If there is a difference in concentration, pollutant will be transported from

high to low concentration by thermal motion. This diffusion process is independent from

the convection and may transport pollutant in both the upstream and downstream direc-

tion. Fig. 4.1(b) shows how an initial concentration profile changes in time by diffusion:

the initial concentration profile will smooth out.

We are interested in whether the concentration reaches an equilibrium solution when

the rate at which pollution is added to the river and the velocity of the water are constant

in time, i.e. r = r(x) and v = v(x). Then the concentration does not change anymore

in time (∂c/∂t = 0) and the model reduces to a 2nd order ordinary differential equation

(ODE) in which the concentration only depends on position c = c(x),

dc d2 c

v − D 2 + kc = r

dx dx

56

1 1

t=0

0.9 t=0 0.9 t=0.01

t=0.5

t=0.05

0.8 t=1 0.8

t=0.1

t=3

t=0.5

0.7 0.7

0.6 0.6

c

c

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0

0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x x

(a) (b)

Figure 4.1: Effect of transport terms on c(x, t) (a) pure convection, and (b) pure diffusion.

We are interested in the concentration profile near the factories where the largest

concentration is to be expected. Therefore we consider only a small part of the river.

As example we consider a river segment from x = −1 to x = 1. The pollution source

from the factories r can be described by r = 17 +9x. In addition we chose v = 8, D = −1,

k = 9. This gives

d2 c dc

2

−8 − 9c = −17 − 9x, −1 < x < 1.

dx dx

In order to solve the 2nd order differential equation we need to specify boundary con-

ditions for the concentration at the boundary. For the 2nd order ODE describing the

concentration, boundary conditions are specified at both endpoints (x = 0 and x = 10).

This is called a two-point boundary value problem (BVP). From differential equations,

we know that to determine a unique solution of an nth order BVP we need n boundary

conditions. Thus for the problem considered here, we need one boundary condition on

each side.

Various types of boundary conditions can be used for 2nd order BVPs at a boundary

point x = xb :

• Dirichlet boundary conditions. The concentration is prescribed at the boundary,

c(x = xb ) = CD ,

where CD is a constant.

• Neumann boundary conditions. The mass flux (or concentration gradient) is pre-

scribed at the boundary:

dc

(x = xb ) = CN ,

dx

57

where CN is a constant.

condition:

dc

(x = xb ) + kR c(x = xb ) = CR ,

dx

where kR and CR are constants.

4.1.3 Example

For simplicity, we assume we know the concentration at the endpoints:

d2 c dc

2

−8 − 9c = −17 − 9x, −1 < x < 1,

dx dx

c(x = −1) = e−18 , c(x = 1) = 3.

58

4.2 Analytical solutions and solving with Matlab

In this section we discuss two methods to obtain the analytical solution of a boundary

value problem. In the next sections we will solve BVPs numerically and see whether the

solution obtained by a numerical method converges to the analytical solution found in

this section.

The solution of a nonhomogeneous BVP consists of a homogeneous solution and a partic-

ular solution c(x) = ch (x) + cp (x). The solution to the homogeneous differential equation

d2 c dc

2

−8 − 9c = 0

dx dx

is found using a characteristic equation (i.e. you look for solutions eλx ),

λ2 − 8λ − 9 = 0

which has roots λ1 = −1 and λ2 = 9. This gives solutions e−x and e9x and the homogenous

solution is a linear combination of these,

The particular solution can be found using the method of undeterminied coefficients. Here

you would try a polynomial of the same order as −17 − 9x, i.e. cp (x) = Ax + B and try

to find A and B by substitution. This gives −8A − 9B = −17 and −9A = −9 which has

solution A = 1 and B = 1. Thus the particular solution equals cp (x) = 1 + x.

This gives for the total concentration c(x) = C1 e−x + C2 e9x + 1 + x. The values for C1

and C2 are obtained from the boundary conditions c(x = 1) = 3 and c(x = −1) = e−18 .

This gives C1 = 0 and C2 = e−9 which gives for the solution of the pollution BVP

c(x) = e9x−9 + 1 + x

Ordinary differential equations can be solved symbolically using the built in Matlab func-

tion dsolve. For the example we consider you would use

dsolve(’D2c - 8*Dc -9*c=-17-9*x’, ’x’)

where D2c denotes the second derivative c00 and Dc the first derivative c0 . Here the ’x’

denotes that the independent variable is x instead of the default t. Matlab will give the

symbolic solution

C1*exp(-x) + C2*exp(9*x) + 1 + x

To find the unique solution of the boundary value problem with the boundary condi-

tions c(−1) = e−18 and c(1) = 3, use

59

c = dsolve(’D2c - 8*Dc -9*c= -17 - 9*x’, ’c(-1)=exp(-18)’, ’c(1)=3’, ’x’)

simplify(c)

which will give

c=

(-exp(-9+9*x)+exp(11+9*x)+exp(20)-1+x*exp(20)-x)/(exp(20)-1)

Which is the correct solution but written in a different way.

60

4.3 Solving BVPs numerically: Introduction

4.3.1 Grids

The solution to a boundary value problem is a function c(x) which is defined for every

x. If we use numerical techniques, we find an approximation to the solution at certain

discrete coordinates values xi only. The collection of xi ’s is called a (computational) grid.

Before we can solve a BVP numerically, we need to specify the computational grid, i.e.

all grid points xi . A numerical technique will produce approximations to the function c

at these grid points only, i.e. we will obtain approximations for the values c(xi ) which

will be denoted by ci . See Fig. 4.2.

cn

cn−1

c3 c(x)

c1 c2

c0

a = x0 x1 x2 x3 xn−1 x = b

n

Often we choose the interval [a, b] to be divided into N equally spaced subintervals of

length h = (b − a)/N , which corresponds to the grid points

xi = a + ih

for i = 0, . . . , N . The length of a subinterval h is called the mesh size. Note that the

nodes are numbered by increasing x coordinate, thus x0 = a, x1 = a + h, . . ., xN = b.

This numbering is called natural numbering. Henceforth, we will only use natural

numbering since it simplifies the derivation of methods.

Example

We solve a BVP numerically on [0, 1] and divide the interval [0, 1] into N = 4 equally

spaced subintervals. The length of a subinterval is h = (1 − 0)/4 = 1/4. We obtain a

numerical approximation to the solution at the N + 1 = 5 discrete coordinate values only:

x0 = 0, x1 = 1/4, x2 = 1/2, x3 = 3/4, and x4 = 1.

Remarks:

1. Typically, the more mesh points the more accurate the approximation to the so-

lution and the more work to compute the approximation. The goal is to compute

61

an accurate numerical solution with as few grid points as possible, to minimize

computing time.

2. Almost always there are certain regions in [a, b] where the solution changes more

rapidly (i.e. where you want more mesh points). An equally spaced mesh is then

not the best choice.

3. An equally space mesh is easiest to introduce the numerical techniques and will

therefore be used in this chapter.

If a BVP cannot be solved analytically, one can try to solve it numerically. Also numerical

techniques, however, will not always work. If no unique solution exists for the BVP, we

can’t expect that a numerical technique will produce something useful. In addition, there

is no best method to solve BVPs. Which method to choose depends on the problem you

want to solve. Issues are

1. Accuracy.

2. Computing time and memory (Typically not an important issue for BVPs, only for

PDEs in 2 or 3 spatial dimensions)

The four most frequently used techniques that can easily be extended to two or three

dimensions are finite differences (FD), finite elements (FE), finite volumes (FV), and

spectral methods. We discuss two of these techniques, FD and FE, and apply the tech-

niques to solve BVPs. We focus on how these methods work, how to program them, and

typical numerical issues. Finite differences is easiest to understand the mathematical and

numerical concepts. Finite elements are particularly useful when dealing with complex

geometries (curved boundaries) in two or three dimensions. We will further discuss this

issue in Chap. 7.

62

4.4 Solving BVPs numerically using Matlab: bvp4c

Solving BVPs numerically with Matlab is a little more complicated than solving algebraic

equations. We discuss only the use of bvp4c. To solve a boundary value problem in

Matlab, you first need to write the ODE into a system of first order ODEs y 0 = f (y, x)

(See Math 2214), using the solution vector

y1 y

y2 y 0

.. = ..

. .

ym y (m−1)

where y (m−1) is the (m − 1)th derivative of y with respect to x. For the pollution BVP

one would introduce

c1 c

= 0

c2 c

Taking the derivative of this vector and using the pollution ODE gives

0 0

0 c1 c c2

c = 0 = 00 = = f (c, x)

c2 c 9c1 + 8c2 − 17 − 9x

For the pollution BVP written as a first order system of first order ODEs one would

use in the Matlab Command Window (in addition you need to write 3 small m-files as

discussed below)

x = -1:0.1:1;

solinit = bvpinit(x, ’init bvp’);

options = bvpset(’AbsTol’, 1e-8);

sol = bvp4c(’func bvp’, ’bc bvp’, solinit, options);

The first line specifies the initial grid points (Matlab might add some grid points if it

thinks it is necessary to obtain a solution that is sufficiently accurate).

The second line creates a solution structure solinit which contains in solinit.x the initial

mesh and in solinit.y the initial approximation specified in the user-written m-file named

here init bvp.m. If we use a linear initial guess between c(−1) = e−18 ≈ 0 and c(1) = 3,

we get yinit (x) = (3 + 3x)/2. We need to specify the initial guess for the system of first

0

order ODEs, so we also need to specify the derivative yinit (x) = 3/2. The initial guess

needs to be specified as a column vector,

c(1,1) = 1.5*(x+1);

c(2,1) = 1.5;

63

The third line specifies options for BVPs that are different from the defaults used to

solve a BVP. Here we specify an absolute tolerance for the residual AbsTol = 1e-8, where

the default is 1e-6. The tolerance is satisfied at every grid point in the mesh.

The fourth line solves the BVP. In addition to the initial solution structure solinit,

it needs two user-written m-files which we named here func bvp and bc bvp. The fourth

parameter options is optional and can be omitted if default options are used.

The right-hand-side vector f should be specified in func bvp as a column vector,

f(1,1) = y(2);

f(2,1) = 9*y(1) + 8*y(2) - 17 - 9*x;

where the array y contains the solution vector (y(1)= y and y(2)= y 0 ) at the grid point

x (single variable, not all grid points). The boundary conditions should be specified in

bc bvp, written in residual form . . . = 0, i.e. y(x = −1) − e−18 = 0 and y(x = 1) − 3 = 0.

The residual needs to be specified as a column vector,

res(1,1) = ya(1) - exp(-18);

res(2,1) = yb(1) - 3;

where the array ya contains the solution vector at the left endpoint (ya(1)= y(x = a)

and ya(2)= y 0 (x = a)) and the array yb contains the solution vector at the right endpoint

(yb(1)= y(x = b) and yb(2)= y 0 (x = b)).

The function bvp4c produces as output a data structure which we named sol. The data

structure sol has two members sol.x and sol.y. The grid points at which the solution is

approximated are stored in sol.x (These are typically not the same grid points as the initial

mesh you specified, but might be refined to satisfy the default tolerances or tolerances

specified in options). The solution vector y at each grid point is stored in the two-

dimensional array sol.y. Row 1 contains y1 = y at all grid points, row 2 contains y2 = y 0

at all grid points, etc.

A specific column or row in a two-dimensional array y can be selected using colon

notation. For example, the first row can be selected by using y(1,:) (meaning row 1, all

columns) and the second column by using y(:,2) (meaning all rows, column 2).

After the calculation with bvp4c we only have some (long) arrays with numbers from

which it is difficult to obtain insight into what the model predicts exactly. For this we

need to plot the solution. The approximation to the solution is in the first row of sol.y

and can be selected using colon notation, sol.y(1,:). The following plots the numerical

64

solution from bvp4c and the exact solution,

hold on

ezplot(’exp(9*x-9)+1+x’,[-1,1])

exact solution

2.5

bvp4c

2

c

1.5

0.5

x

Figure 4.3: Numerical solution using bvp4c and exact solution using ezplot for the pollution

BVP.

65

4.5 Finite differences

In this section, we discuss the discretization in space of a second-order BVP using the

finite difference method. Finite differences is conceptually the easiest, but not necessarily

the best method. In section 4.7 we discuss an alternative method: finite elements.

We only consider equally spaced grid points and use natural numbering. This simplifies

the derivation.

In finite difference methods, all derivatives are approximated by finite difference for-

mulas. By using more grid points in the approximation, a more accurate approximation

of a derivative can be obtained. Finite difference formulas can be derived using Taylor’s

theorem. We use the form

n

X (∆x)k

f (xi + ∆x) = f (k) (xi ) + O((∆x)n+1 ).

k=0

k!

Here O((∆x)n+1 ) means that all terms we neglect are C(∆x)n+1 (with C a constant) or

of higher order. We consider three cases that frequently occur.

Simple expressions, i.e. when a derivative is approximated by 2 points, can be derived

directly from a Taylor series.

Using Taylor’s theorem with ∆x = h gives

Rewriting gives the forward difference formula for the 1st derivative

1

f 0 (xi ) = [f (xi + h) − f (xi )] + O(h).

h

Alternatively, we can write the forward difference a little shorter, using the notation

introduced in Fig. 4.2,

1

f 0 (xi ) = [fi+1 − fi ] + O(h).

h

By taking ∆x = −h, i.e. h → −h in the forward difference formula, the backward

difference formula can be obtained

1 1 1

f 0 (xi ) = [f (xi − h) − f (xi )] + O(h) = [f (xi ) − f (xi − h)] + O(h) = [fi − fi−1 ] + O(h).

−h h h

66

Both formulas are equally accurate O(h), so if you have a choice is doesn’t matter which

one you use. However, for the first node (no previous node) the backward difference

formula can not be used and for the last node (no next node) the forward difference

formula can not be used.

Derivation of more accurate (higher-order) finite difference formulas and finite difference

formulas for higher-order derivatives very tedious when Taylor series are used directly.

Instead we can use the following theorem:

If a (q + 1)-point difference formula is exact for the polynomials P0 (x) = 1, P1 (x) = x,

P2 (x) = x2 , . . ., Pq (x) = xq , then the error is O(hq ). In addition, the value of the grid

point xi doesn’t matter. To obtain simple algebra xi = 0 is usually convenient.

Thus we can try to find coefficients α1 , . . . , αq+1 so that if we approximate a certain

derivative at grid point xi by a finite difference formula involving the points k, . . . , k + q,

say

α1 fk + α2 fk+1 + · · · αq+1 fk+q

we get the exact value of that derivative (zero error) if f (x) is a polynomial of order q or

lower. This will lead to a linear system of q + 1 equations ans q + 1 unknowns. Note that

to approximate a derivative at a grid point xi you can you use function values at any

q + 1 grid points. However, function values in grid points closer to xi will contain more

information about the behavior of a function close to xi and will typically have a smaller

error.

We derive an O(h2 ) accurate approximation for the 1st derivative at grid point xi , i.e. for

f 0 (xi ) using the function values in the three closest grid points xi − h, xi , and xi + h. To

simplify algebra we choose xi = 0 (any other choice should lead to the same result), so

that xi−1 = −h and xi+1 = h. We need to find three coefficients (α1 , α2 , and α3 ) so that

f 0 (0) = α1 f (0 − h) + α2 f (0) + α3 f (0 + h)

is exact when we use for the function f the polynomials P0 (x) = 1, P1 (x) = 1, and

P2 (x) = x2 (Note that you also know f 0 (0) for these functions). This gives three equations

with three unknowns,

1 = α1 (−h) + α2 (0) + α3 (h),

0 = α1 (−h)2 + α2 (0)2 + α3 (h)2 .

The third line gives α1 = −α3 . Substituting in line 2 gives α3 = 1/(2h) and α1 = −1/(2h).

Line 1 gives then α2 = 0.

67

Thus we found the central finite difference formula

1

f 0 (xi ) = [f (xi + h) − f (xi − h)] + O(h2 ).

2h

Alternatively we could have solved the linear system with Matlab

sol = solve(’a1+a2+a3 = 0’, ’-h*a1+h*a3=1’, ’h^2*a1+h^2*a3=0’, ’a1’, ’a2’, ’a3’)

produces a structure sol. The values of a1 can be found by typing sol.a1, sol.a2, and sol.a3

in the command window.

We now derive an O(h2 ) accurate approximation for the 2nd derivative at grid point xi ,

i.e. for f 00 (xi ) using the function values in the three closest grid points xi − h, xi , and

xi + h. Again we use xi = 0, call the three coefficients α1 , α2 , and α3 , and require that

f 00 (0) = α1 f (0 − h) + α2 f (0) + α3 f (0 + h)

is exact when we use for the function f the polynomials P0 (x) = 1, P1 (x) = 1, and

P2 (x) = x2 . This gives three equations with three unknowns,

0 = α1 (−h) + α2 (0) + α3 (h),

2 = α1 (−h)2 + α2 (0)2 + α3 (h)2 .

The second line gives α1 = α3 . Substituting in line 3 gives α3 = 1/h2 and α1 = 1/h2 .

Line 1 gives then α2 = −2/h2 . Thus we obtained a central finite difference formula for

the second derivative,

1

f 00 (xi ) = [f (xi − h) − 2f (xi ) + f (xi + h)] + O(h2 ).

h2

To reduce the amount of algebra a little bit, we consider the discretization of the following

second order linear ODE instead of Eq. (4.2)

d2 c

− 2 +c=x

dx

with boundary conditions c(x = 0) = 0 and c(x = 1) = 3 using n = 4 subintervals of

length h = 0.25.

We make a grid in the x-direction using natural numbering and we denote the numer-

ical approximation of c at x = xj by cj = c̃(xj ) for j = 0, . . . , n = 4.

At every point xj in the interior (0, 1), the differential equation holds. The second

derivative is approximated by a difference formula. Here we use the O(h2 ) approximation

for c00 . The concentration itself is approximated by the nodal value cj at x = xj . The

68

value of x in node j we know: xj . We obtain 3 finite difference equations for j = 1, 2, 3

(for which the value of xj is in (0, 1)),

c0 − 2c1 + c2

j=1: − + c1 = x1 ,

h2

c1 − 2c2 + c3

j=2: − + c2 = x2 ,

h2

c2 − 2c3 + c4

j=3: − + c3 = x3 ,

h2

with h = 1/4 here. At the endpoints x0 = 0 and x4 = 1 we have the boundary condition

instead of the ODE. Since c0 is the approximation of c at x = 0, we have c0 = 0. At the

last point x4 = 1 we have the boundary condition: c4 = 3.

All together we have 5 linear equations with 5 unknowns (including the boundary

conditions) which can be written in matrix form

1 0 0 0 0 c0 0

−1/h2 1 + 2/h2 −1/h2 0 0 c1 x1

2 2 2

0

−1/h 1 + 2/h −1/h 0 c2 = x2 ,

2 2 2

0 0 −1/h 1 + 2/h −1/h c3 x3

0 0 0 0 1 c4 3

or using the known values for h = 1/4 and the grid points xj = j/4

1 0 0 0 0 c0 0

−16 33 −16 0 0 c1 1/4

0 −16 33 −16 0 c2 = 1/2 .

0 0 −16 33 −16 c3 3/4

0 0 0 0 1 c4 3

The matrix and right-hand side are constant, so that we have reduced the problem to a

linear algebra problem: solving Ax = b. Once we have setup the linear system, we can

use any appropriate solution method to solve Ax = b to find the values cj .

If we need to program the finite difference method, we do not want to write a new code

if we use a different number of grid points. So we want to find a more general equation

Ax = b in terms of cj for the n + 1 grid points x0 , . . . , xn .

We consider again the pollution equation

d2 c dc

− 2

+8 + 9c = 17 + 9x, −1 < x < 1,

dx dx

c(x = −1) = e−18 , c(x = 1) = 3.

69

We use the O(h2 ) central FD formulas to approximate derivatives at xj , i.e Eq. (4.5.3)

for c0 and Eq. (4.5.3) for c00 . The concentration itself at x = xj gives cj . For each internal

node j = 1, . . . , n − 1 we then have

cj−1 − 2cj + cj+1 cj+1 − cj−1

− 2

+8 + 9cj = 17 + 9xj .

h 2h

The boundary conditions are: c0 = e−18 and cn = 3.

We can write the equations in matrix form:

Ac = f ,

• In the first row we put the equation corresponding to the grid point x0 . This is

the boundary condition c0 = e−18 . Thus we need a value of 1 in the first column

corresponding to c0 and zeros in the other columns since these cj ’s are not involved in

the Dirichlet boundary condition. In f0 we need the value of the boundary condition

e−18 .

• The next n−1 rows (rows 2, . . . , n correspond to the internal grid points x1 , . . . , xn−1 ,

at which the (discretized) ODE needs to be satisfied: The second row correspond

to x1 , the third row to x2 etc. until row n. Thus the rows are ordered the same way

as the grid points. For each j = 1, . . . , n − 1, we have

– −1/h2 − 4/h in column j − 1, i.e. in the column on the left of the diagonal

(component Aj,j−1 ).

– −1/h2 + 4/h in column j + 1, i.e. in the column on the right of the diagonal

(component Aj,j+1 ).

– 17 + 9xj in the right-hand-side vector (component fj ).

• In the last row n + 1 we put the equation corresponding to xn . This is the boundary

condition cn = 3. Thus we need a value of 1 in the last column, n + 1, corresponding

to cn and zeros in the other columns since these cj ’s are not involved in the Dirichlet

boundary condition. In fn we need the value of the boundary condition 3.

1 0 0 ··· 0 c0

e−18

... ..

−1/h2 − 4/h 9 + 2/h2 −1/h2 + 4/h . c1 17 + 9x1

... ... ...

= .. ..

,

0 0

. .

. ...

−1/h2 − 4/h 9 + 2/h2 −1/h2 + 4/h cn−1

..

17 + 9xn−1

0 ··· 0 0 1 cn 3

70

where · · · denote a continuation of the same value, for example 9 + 2/h2 on the diagonal.

Once the matrix A and right-hand side vector f have been constructed, any appro-

priate method to solve Ax = b can be used. An efficient method will be discussed in

Sec. 4.9.

A numerical code for a finite difference method consists of two different parts. First, the

matrix A and right-hand-side vector f need to be constructed. Second, the matrix-vector

equation needs to be solved.

1. Setup of Ax = b

A lot of things can vary when you use the finite difference method to solve a second

order ODE. In a general form, a second order ODE reads

d2 c dc

− 2

+ p(x) + q(x)c = r(x) (4.3)

dx dx

where p(x), q(x), and r(x) are functions of x that are different from one problem

to the next. In addition, the type and values of the boundary conditions may vary

and you may want to change the order of the approximation of a derivative.

2. Solving Ax = b

Depending on the size of your matrix you may want to choose a different method to

solve Ax = b. To economize in memory, you might want to store only the non-zero

components in your matrix (in the examples above, just three diagonals instead of

the full (n + 1) × (n + 1) matrix with all the zero’s).

We don’t want to have a large number of nearly identical programs for each case, so you

need to create several functions that are sufficiently general and focus on one or a few

aspects and combine these in a clever way. To obtain a readable code it is very important

that you program in a structured way.

There are various ways to program finite differences in a well-structured way. Here, I

give one possibility to program the second order ODE y 00 + p(x)y 0 + q(x)y = r(x) using

proper boundary conditions. We use several (levels of) functions

– a function fdmatvec that sets up the FD part matrix A and vector f . This

function calls

∗ a function ode param that computes the values of p(x), q(x), and r(x).

– a function bcmatvec that sets up the boundary condition part of the matrix A

and vector f . This function calls

71

∗ A function bc param that specifies the type (Dirichlet, Neumann) and val-

ues (CD , CN ) of boundary conditions.

– a function that solves Ax = b

In the main file, it needs to be decided (input) which order of approximation for the

derivatives is used, how the matrix is stored, and how Ax = b is solved.

In fdmatvec the part of the matrix A and f corresponding to the finite difference

equations is set up. Since you have the same type of expression for every internal grid

point you can use a for-loop for this. Here you may want to distinguish different orders

of approximation using if statements. These if statements should be kept outside of for-

loops if possible, for reasons of efficiency. The function ode param is called for every grid

point and evaluates the functions p(x), q(x), and r(x) at a certain grid point.

Boundary conditions would typically be incorporated in a separate function bcmatvec,

since it is independent of how you handle internal gridpoints. To keep this function

general, you would use another function to specify the type and value (say Ca for the left

and Cb for the right endpoint) of the boundary condition used at each end point. Different

types of boundary conditions can be distinguished using if-statements. The parameters

Ca and Cb are used in the matrix A and right-hand side f to keep these expressions

general.

In the main function finitedif different solvers would be distinguished through if-

statements. Since different solvers do not have much in common and for reasons of

efficiency, different methods to solve Ax = b would be in different functions. In Matlab

you could use, for example, the \ operator.

Once your program is working, you just need to change the simple functions ode param

and bc param if you solve a different BVP. The more complicated functions fdmatvec,

bcmatvec, and possible functions to solve Ax = b you can leave unchanged.

72

4.6 Eliminating boundary conditions

There are two reasons to eliminate boundary conditions from a linear system Ac = f .

First, you reduce the number of unknowns, so you need to solve a smaller system

which is faster. For a BVP this is not a very important issue since the boundary consists

only of two grid points (the end points). However, when we solve PDEs on 2D regions the

boundary consists of curves and on 3D regions the boundary consists of surfaces. Both

typically have a lot of grid points.

Second, a certain pattern in the matrix might be distorted by the boundary conditions.

For example, symmetry or a tridiagonal structure of the matrix may be lost due to the

boundary conditions. These are important properties of the matrix when you solve the

linear system. Some efficient techniques to solve Ax = b require a symmetric matrix,

for example the conjugate gradient method. A tridiagonal system can be solved very

efficiently using a direct method (Crout’s method, see Sec. 4.9).

4.6.1 Example

In the FD example in Sec. 4.5.4,

j=0: c0 = 0,

c0 − 2c1 + c2

j=1: − + c1 = x1 ,

h2

c1 − 2c2 + c3

j=2: − + c2 = x2 ,

h2

c2 − 2c3 + c4

j=3: − + c3 = x3 ,

h2

j=4: c4 = 3,

c0 = 0 and c4 = 3 are known and do not need to be solved for. The unknowns c0 and

c4 can be eliminated from the above set of equations (bring the c0 and c4 terms in the

equations for the internal points to the right-hand side and substitute the values). Only

the equations for j = 1 and j = 3 contain c0 and c4 and are affected,

−2c1 + c2 c0

j=1: − + c 1 = x 1 + = x1

h2 h2

c1 − 2c2 + c3

j=2: − + c2 = x2

h2

c2 − 2c3 c4 3

j=3: − + c 3 = x 3 + = x 3 + .

h2 h2 h2

In matrix-vector form this becomes

1 + 2/h2 −1/h2 0 c1 x1

−1/h2 1 + 2/h2 −1/h2 c2 = x2 ,

2 2 2

0 −1/h 1 + 2/h c3 x3 + 3/h

73

or using h = 0.25, x1 = 0.25, x2 = 0.5, and x3 = 0.75:

33 −16 0 c1 0.25

−16 33 −16 c2 = 0.5 .

0 −16 33 c3 48.75

A = [33 -16 0;-16 33 -16; 0 -16 33];

f = [0.25; 0.5; 48.75];

c=A\f

gives

c=

6.802295047529017e-01

1.387348353552860e+00

2.149926474449871e+00

Note that the array c only contains the solution at the internal grid points (since we

eliminated the boundary conditions). To incorporate the values c0 and c4 at the correct

postions in the array corresponding to the first and last grid point, you can do as follows.

First shift all values in the array c one position. Then incorporate the boundary condi-

tions at the first array element c(1) and the last array element c(5)

c(2:4) = c(1:3)

c(1) = 0

c(5) = 3

Remarks

faster.

the full matrix A we could store only the non-zero elements of A (to save memory)

and use an efficient solver for tridiagonal matrices (to save computing time). See

Sec. 4.9.

• By eliminating the boundary conditions from the linear system, we have created a

symmetric matrix. Solvers that require a symmetric matrix can now be used and

storing a matrix only requires storing two diagonals (the third diagonal elements

are then known because of the symmetry).

There are n + 1 equations with n + 1 unknowns, two equations from the boundary con-

ditions and (n − 1) on the internal grid points,

c0 = e−18 ,

74

cj−1 − 2cj + cj+1 cj+1 − cj−1

− 2

+8 + 9cj = 17 + 9xj , j = 1, . . . , n − 1,

h 2h

cn = 3,

However, c0 and cn are known from the boundary conditions. We can eliminate these

from the equations to obtain only n − 1 equations with n − 1 unknowns. Thus we will

eliminate the first and last row and the first and last column by eliminating c0 and cn from

equations j = 1, . . . , n − 1 using the 2 rows corresponding to the boundary conditions. c0

only appears in the finite difference formula of j = 1 and cn only in the finite difference

formula of j = n − 1. After substituting the values for c0 and cn and bringing the known

terms to the right-hand side we obtain a new equation for j = 1 and j = n − 1:

−2c1 + c2 c2 c0 c0 4e−18 e−18

− + 8 + 9c 1 = 17 + 9x 1 + 8 + = 17 + 9x 1 + + 2 ,

h2 2h 2h h2 h h

and j = n − 1:

cn−2 − 2cn−1 cn−2 cn cn 12 3

− 2

−8 + 9cn−1 = 17 + 9xn−1 − 8 + 2 = 17 + 9xn−1 − + 2.

h 2h 2h h h h

The equations for j = 2, . . . , n − 2 remain the same since they don’t contain c0 and cn .

Thus we now have the n − 1 × n − 1 system of equations

0 ··· 0 c1

.. ..

−1/h2 − 4/h 9 + 2/h2 −1/h2 + 4/h . . c2

.. . . .

0 . . . . . 0 ..

=

. ..

.. −1/h2 + 4/h cn−2

. −1/h2 − 4/h 9 + 2/h2

0 ··· 0 −1/h2 − 4/h 9 + 2/h2 cn−1

17 + 9x1 + 4e−18 /h + e−18 /h2

17 + 9x2

..

.

.

17 + 9xn−2

2

17 + 9xn−1 − 12/h + 3/h

To incorporate the values c0 and cn at the correct postions in the array corresponding to

the first and last grid point, use

n = length(c) + 1;

c(2:n) = c(1:n-1)

c(1) = exp(-18)

c(n+1) = 3

Remarks

• The size of the matrix-vector equations is reduced by 2, so it can be solved a little

faster.

75

• The matrix is tridiagonal, which is a result of the natural numbering. Instead of

the full matrix A we could store only the non-zero elements of A (to save memory)

and use an efficient solver for tridiagonal matrices (to save computing time). See

Sec. 4.9.

• The convection term makes the matrix A to be non-symmetric. Solvers that require

a symmetric matrix cannot be used. In the absence of convection, however, the

resulting matrix would be symmetric after eliminating the boundary conditions

allowing the use of solvers for symmetric matrices.

76

4.7 Finite elements

In this section, we discuss the discretization in space of a second-order BVP using the finite

element method (FEM). The method is conceptually much harder than finite differences.

The advantage lies in the easy incorporation of Neumann and Robin type boundary

conditions and mesh refinement. This is particularly useful in two or three dimensions,

especially when there are curved boundaries or steep gradients in the solution. We only

consider equally spaced grids and use natural numbering. This simplifies the derivation.

Finite elements

The interval [a, b] is divided into N elements of equal length h = (b − a)/N that only

have endpoints in common. We number the elements e1 , . . . , eN using natural numbering.

The grid points (nodes) are xj = jh, j = 0, . . . , n, which are also numbered using natural

numbering. An element you can consider as a local ”computational domain”. An element

contains a certain number of grid points which we did not specify yet. We only consider

elements with at least two grid points on it: the endpoints of the element. Frequently

used elements are linear, quadratic, and cubic elements. A linear element just contains

the endpoints of the element. A quadratic element contains the endpoints and a point

halfway the element. A cubic element contains the endpoints and two points at 1/3

and 2/3 of the element. We will mainly focus on linear finite elements for which n = N .

Fig. 4.4 shows the linear elements and positions of the nodes.

elements

e1

e2

e3

eN −1

eN

a = x0 x1 x2 x3 xN −2 xN −1 xN = b

nodes

Basis functions

On the interval [a, b], basis functions φj are defined with the following properties:

• φj (xj ) = 1

• φj (xi ) = 0 for i 6= j

77

• φj is continuous and a piecewise polynomial (per element). For linear elements, φj

is a piecewise linear polynomial, for quadratic elements, φj is a piecewise quadratic

polynomial etc.

Fig. 4.5 shows the linear finite element basis functions φ0 , φj for 0 < j < N , and φN .

Outside the sketched region every basis function is exactly equal to zero.

1

@ @

@ φ0 @ φj φN

@ @

@ @

0 @ @

x0 x1 xj−1 xj xj+1 xN −1 xN

We can approximate a function using the FE basis functions and the nodal values

N

X

c(x) = cj φj (x).

j=0

This results in a piecewise polynomial approximation of the function c(x). Fig. 4.6 shows

a linear finite element approximation and how it is constructed using the basis functions

and nodal values cj , j = k − 2, . . . , k + 2.

The easiest way to compute the coefficients of the matrix A and vector f of the linear

system is to compute the contributions to A and f element-by-element. The contributions

of element l are stored in an element matrix a(l) and element vector f (l) .

We first note that the integral over the entire interval [a, b] is just the sum of the

integrals over all elements

Z b XN Z

dx = dx.

a l=1 el

On an element el = [xl−1 , xl ], only two basis functions are non-zero: φl−1 and φl . (See

Fig. 4.7). The non-zero contributions to A on this element can be put in an element

matrix a(l) and the non-zero contributions to f in an element vector f (l) .

After the element matrices and vectors have been computed, they need to be assembled

into (added to) the global matrix A and global vector f .

78

ck−1

ck

ck−2 ck+1 ck+2

Figure 4.6: Finite element approximation. The solid black line is the sum of the colored

dashed lines. Red dashed line: ck−1 φk−1 , cyan dashed line: ck φk , green dashed line:

ck+1 φk+1 , blue dashed line: right part of ck−2 φk−2 , magenta dashed line: left part of

ck+2 φk+2 .

φl−1 φl

xl−1 el xl

Figure 4.7: Two non-zero basis functions on element el .

Reference element

To simplify the numerical calculations, elements el = [xl−1 , xl ] are mapped to the refer-

ence element ê = [0, 1] using the transformation

We define two basis functions on the reference element (See Fig. 4.8)

φ̂1 = 1 − ξ, φ̂2 = ξ.

whose derivatives are dφ̂1 /dξ = −1 and dφ̂2 /dξ = 1. The 2 non-zero basis functions on

element el are mapped to the reference basis functions by

Derivatives on the x and ξ domain are related via the chain rule. We have φ̂(ξ) = φ(x)

with x = x(ξ), thus

dφ̂ dφ dx

=

dξ dx dξ

Since dx/dξ = h we get

dφ 1 dφ̂

= .

dx h dξ

79

φ̂1 φ̂2

0 ê 1

Figure 4.8: Reference element ê = [0, 1] with linear basis functions φ̂1 (ξ) and φ̂2 (ξ).

example,

Z Z 1 Z 1

p(x)φl−1 (x) dx = p(xl−1 + hξ)φ̂1 (ξ)h dξ = h p(xl−1 + hξ)ξ dξ

el 0 0

For the finite element method the ODE is not discretized directly, but written in the weak

form first. The weak form of an ODE is obtained by multiplying the ODE by suitable test

functions, integrating the resulting equation over the whole domain [a, b], and applying an

integration by parts to reduce the order of the derivatives. When functions are sufficiently

smooth, the weak form and original ODE are equivalent, i.e. they have the same solution.

Boundary conditions are applied to the weak form. Dirichlet and Neumann/Robin

type boundary conditions are handled very differently in finite element methods.

Dirichlet boundary conditions

Dirichlet boundary conditions, are handled in a similar way as for finite differences. The

row corresponding to that boundary condition is replaced by the Dirichlet boundary

condition.

Neumann/Robin boundary conditions

Neumann/Robin boundary conditions are not handled the same way as for finite differ-

ences. In the weak form, the derivative dc/dx at a boundary appears naturally. Sub-

stitution of the Neumann/Robin boundary condition into the weak form is all that is

needed.

There are several steps involved to obtain a linear system in the finite element method.

• Find the weak form: multiply the differential equation by a test function ψ(x),

integrate over the whole domain, and apply integration by parts.

• Choose suitable test functions ψ. We will choose the n + 1 basis functions φj . This

is called Galerkin finite element method.

80

Pn

• Replace c(x) by its finite element approximation c(x) = j=0 cj φj (x).

• Evaluate integrals over each finite element using a reference element (analytically

or with some appropriate numerical technique).

• Assemble contributions of each element into the matrix A and right-hand-side vector

f.

To reduce the amount of algebra a little bit, we consider the discretization of the following

second order linear ODE instead of Eq. (4.2)

−c00 + c = 1, 0<x<1

Weak form

Multiply by test functions ψ(x), integrate over the whole domain [0, 1], and use integration

by parts for the second derivative term (so that we only have first derivatives):

Z 1 Z 1

0 0 0 1

ψ c + ψc dx − ψc |0 = ψ dx

0 0

Galerkin FEM

Using the five basis functions φ0 , . . . , φ4 for ψ gives 5 equations with 5 unknowns

Z 1 Z 1

0 0 0 1

φi (x)c + φi (x)c dx − φi (x)c |0 = φi (x) dx

0 0

1, i = 1 corresponds to row 2, etc.

FEM approximation

Next the finite element approximation c(x) = 4j=0 cj φj (x) is substituted in the terms

P

under the integral and the integral over the whole domain is written as a sum of integrals

over all elements,

4 Z

X 4

X 4

X 4 Z

X

1

φ0i (x) cj φ0j (x) + φi (x) cj φj (x) dx = φi (x)c0 |0 + φi (x) dx

l=1 el j=0 j=0 l=1 el

We note for further reference that the term φi (x)c0 |10 = φi (x = 1)c0 (x = 1) − φi (x =

0)c0 (x = 0) doesn’t affect the equations for i = 1, . . . , 3. The only non-zero basis function

at x = 1 is φ4 and the only non-zero basis function at x = 0 is φ0 . Thus the equation for

81

i = 0 has an extra term −φ0 (x = 0)c0 (x = 0) = −c0 (x = 0) and the equation for i = 4 has

an extra term φ4 (x = 1)c0 (x = 1) = c0 (x = 1).

Element matrix and vector

The element matrix a(l) and element vector f (l) contain the contributions of the element

integrals. The term φi (x)c0 |10 is related to the boundary conditions and will be discussed

when applying the boundary conditions. On an element el = [xl−1 , xl ], only two basis

functions are non-zero: φl−1 and φl . Thus only the equations for i = l − 1 and i = l give

non-zero contributions on element el in Eq. (4.7.2).P4 Similarly, only the j = l − 1 and j = l

terms in the finite element approximation in j=0 give a non-zero contribution to this

equation. The non-zero integrals can be put into a local matrix a(l)

" #

(l) (l)

a11 a12 cl−1

(l) (l)

a21 a22 cl

with

Z

(l)

a11 = φ0l−1 φ0l−1 + φl−1 φl−1 dx

el

Z

(l)

a12 = φ0l−1 φ0l + φl−1 φl dx

Zel

(l)

a21 = φ0l φ0l−1 + φl φl−1 dx

el

Z

(l)

a22 = φ0l φ0l + φl φl dx

el

Z

(l)

f1 = φl−1 dx

el

Z

(l)

f2 = φl dx

el

to j = l − 1 and column 2 to j = l.

Evaluating integrals using a reference element

By applying the tranformation of variables discussed in 4.7.1, we can write all integrals

in terms of ξ. For the element matrix we get

1 1

Z Z 1

(l) 1 h

a11 = 1 dξ + h (1 − ξ)2 dξ = + ,

h 0 0 h 3

Z 1 Z 1

(l) (l) 1 1 h

a12 = a21 = −1 dξ + h ξ(1 − ξ) dξ = − + ,

h 0 h 6

Z 1 Z 10

(l) 1 1 h

a22 = 1 dξ + h ξ 2 dξ = + ,

h 0 0 h 3

82

For the element vector we get

Z 1

(l)

f1 = h (1 − ξ) dξ = h/2,

0

Z 1

(l)

f2 = h ξ dξ = h/2.

0

Note that we have the same element matrix and element vector for every element in this

case (equally spaced grid points and constant coefficient ODE).

Assembling

The element matrix and vector needs to be assembled element-by-element into the global

matrix A and right-hand-side vector f . We start form the zero matrix. We first add the

contributions of the first element, i = 0 and i = 1 corresponding to row 1 and 2,

1/h + h/3 −1/h + h/6 0 0 0 h/2

−1/h + h/6 1/h + h/3 0 0 0 h/2

A :=

0 0 0 0 0 0

f :=

0 0 0 0 0 0

0 0 0 0 0 0

row 2 and 3,

1/h + h/3 −1/h + h/6 0 0 0 h/2

−1/h + h/6 2/h + 2h/3 −1/h + h/6 0 0 h

0 −1/h + h/6 1/h + h/3 0 0

h/2

0 0 0 0 0 0

0 0 0 0 0 0

row 3 and 4,

1/h + h/3 −1/h + h/6 0 0 0 h/2

−1/h + h/6 2/h + 2h/3 −1/h + h/6 0 0 h

0 −1/h + h/6 2/h + 2h/3 −1/h + h/6 0

h

0 0 −1/h + h/6 1/h + h/3 0 h/2

0 0 0 0 0 0

and 5,

1/h + h/3 −1/h + h/6 0 0 0 h/2

−1/h + h/6 2/h + 2h/3 −1/h + h/6 0 0 h

0 −1/h + h/6 2/h + 2h/3 −1/h + h/6 0

h

0 0 −1/h + h/6 2/h + 2h/3 −1/h + h/6 h

0 0 0 −1/h + h/6 1/h + h/3 h/2

83

Boundary conditions

To incorporate c(0) = 1 we replace the first row by c0 = 1. To incorporate c0 (1) = 0 we

add +φ4 (x = 1)c0 )x = 1) = c0 (1) = 0 to the right-hand side of the last row, f4 .

1 0 0 0 0 c0 1

−1/h + h/6 2/h + 2h/3 −1/h + h/6 0 0 c1 h

0 −1/h + h/6 2/h + 2h/3 −1/h + h/6 0 c2 = h

0 0 −1/h + h/6 2/h + 2h/3 −1/h + h/6 c3 h

0 0 0 −1/h + h/6 1/h + h/3 c4 h/2 + 0

This is a linear system that can be solved with any appropriate numerical technique.

If we need to program the finite element method, we do not want to write a new code

if we use a different number of grid points. So we want to find a more general equation

Ac = f in terms of cj for n elements e1 , . . . en and n + 1 grid points x0 , . . . , xn .

We consider again the pollution equation

d2 c dc

− + 8 + 9c = 17 + 9x, −1 < x < 1,

dx2 dx

c(x = −1) = e−18 , c0 (x = 1) = 10.

To point out the different treatment of Dirichlet and Neumann boundary conditions in

finite elements, we replaced the boundary condition c(x = 1) = 3 by the equivalent

Neumann boundary condition c0 (x = 1) = 10 (obtained from the analytical solution).

Weak form

The weak form is obtained by multiplying the ODE by suitable test functions ψ and

integrating over the whole domain, here [−1, 1],

Z 1 Z 1

00 0

ψ [−c + 8c + 9c] dx = ψ [17 + 9x] dx

−1 −1

for all suitable test functions ψ. Now we can use integration by parts for the c00 term to

reduce the order of the highest derivative

Z 1 Z 1

0 0 0 0 1

ψ c + ψ [8c + 9c] dx − ψc |−1 = ψ [17 + 9x] dx

−1 −1

Galerkin FEM

Using the n + 1 basis functions φ0 , . . . , φn for ψ gives n + 1 equations with n + 1 unknowns

Z 1 Z 1

0 0 0 0 1

φi (x)c + φi (x) [8c + 9c] dx − φi (x)c |−1 = φi (x)[17 + 9x] dx

−1 −1

84

for 0 ≤ i ≤ n. Each i corresponds to a row in the linear system: i = 0 corresponds to row

1, i = 1 corresponds to row 2, etc. until i = n that corresponds to row n + 1.

FEM approximation

Next the finite element approximation c(x) = nj=0 cj φj (x) is substituted in the terms

P

under the integral and the integral over the whole domain is written as a sum of integrals

over all elements,

n Z n

" n n

# n Z

X X X X X

0 0 0 0 1

φi cj φj + φi 8 cj φj + 9 cj φj dx = φi c |−1 + φi (17 + 9x) dx

l=1 el j=0 j=0 j=0 l=1 el

We note for further reference that the term φi (x)c0 |1−1 = φi (x = 1)c0 (x = 1) − φi (x =

−1)c0 (x = −1) doesn’t affect the equations for i = 1, . . . , n − 1. The only non-zero basis

function at x = 1 is φn and the only non-zero basis function at x = −1 is φ0 . Thus the

equation for i = 0 has an extra term −φ0 (x = −1)c0 (x = −1) = −c0 (x = −1) and the

equation for i = n has an extra term φn (x = 1)c0 (x = 1) = c0 (x = 1).

Element matrix and vector

The element matrix a(l) and element vector f (l) contain the contributions of the element

integrals. The term φi (x)c0 |1−1 is related to the boundary conditions and will be discussed

when applying the boundary conditions. On an element el = [xl−1 , xl ], only two basis

functions are non-zero: φl−1 and φl . Thus only the equations for i = l − 1 and l = i give

non-zero contributions on element el in Eq. (4.7.2). P Similarly, only the j = l − 1 and j = l

terms in the finite element approximation in nj=0 give a non-zero contribution to this

equation.

The non-zero integrals can be put into a local matrix a(l)

" #

(l) (l)

a11 a12 cl−1

(l) (l)

a21 a22 cl

with

Z

(l)

φ0l−1 φ0l−1 + φl−1 8φ0l−1 + 9φl−1

a11 = dx

el

Z

(l)

a12 = φ0l−1 φ0l + φl−1 [8φ0l + 9φl ] dx

Z el

(l)

φ0l φ0l−1 + φl 8φ0l−1 + 9φl−1

a21 = dx

el

Z

(l)

a22 = φ0l φ0l + φl [8φ0l + 9φl ] dx

el

Z

(l)

f1 = φl−1 (17 + 9x) dx

el

Z

(l)

f2 = φl (17 + 9x) dx

el

85

Row 1 in a(l) and f (l) corresponds to i = l − 1 and row 2 to i = l. Column 1 corresponds

to j = l − 1 and column 2 to j = l.

Evaluating integrals using a reference element

By applying the tranformation of variables discussed in 4.7.1, we can write all integrals

in terms of ξ. For the element matrix we get

Z 1

(l) 1 8 2 1

a11 = 2

+ (1 − ξ) + 9(1 − ξ) h dξ = − 4 + 3h,

0 h h h

Z 1

(l) −1 8 1 3h

a12 = + (1 − ξ) + 9ξ(1 − ξ) h dξ = − + 4 + ,

0 h2 h h 2

Z 1

(l) −1 8 1 3h

a21 = 2

− ξ + 9ξ(1 − ξ) h dξ = − − 4 + ,

0 h h h 2

Z 1

(l) 1 8 1

a22 = 2

− ξ + 9ξ 2 h dξ = + 4 + 3h.

0 h h h

For the element vector we get

Z 1

(l) h 3h2

f1 = h (1 − ξ)(xl−1 + hξ) dξ = (17 + 9xl−1 ) + ,

0 2 2

Z 1

(l) h

f2 = h ξ(xl−1 + hξ) dξ = (17 + 9xl−1 ) + 3h2 .

0 2

Note that, because of the dependence on x of the right-hand side, the element vector is

different for every element.

Assembling

The element matrix and vector needs to be assembled element-by-element into the global

matrix A and right-hand-side vector f . We start form the zero matrix. We first add the

contributions of the first element, i = 0 and i = 1 corresponding to row 1 and 2. At

element 1, we have xl−1 = x0 ,

(1) (1)

a11 a12 0 . . . 0 (1)

f1

(1) (1)

a21 a22 0 . . . 0

f (1)

... 2

A :=

0 0 0 0

f := 0

.. .

. . . . . .

. 0

. . . . .

0 0 0 ... 0 0

row 2 and 3,

(1) (1) (1)

a11 a12 0 0 ... 0 f1

a(1) a(1) + a(2) a(2) 0 . . . 0 (1)

f2 + f1(2)

21 22 11 12

(2) (2)

0 a21 a22 0 . . . 0 f2(2)

A := f :=

0 0 0 0 . . . 0

.

0

.. .. .. .. ..

0

. . . .

0 0 0 0 0 0 0

86

And so on until the contributions of the nth element, i = n − 1 and i = n corresponding

to row n and n + 1,

(1) (1)

(1)

a11 a12 0 ... 0 f1

a(1) a(1) + a(2) a(2) (1)

... 0 f2 + f1(2)

21 22 11 12

... ... ... ..

A :=

0 0

f :=

.

.. . . (n−1) (n−1) (n) (n) (n−1) (n)

. . a21 a22 + a11 a12 f2 + f1

(n) (n) (n)

0 ... 0 a21 a22 f2

1

− 4 + 3h − h1 + 4 + 3h

h 2

0 ... 0

− h − 4 + 3h

1

2

2

h

+ 6h − h

1

+ 4 + 3h

2

... 0

.. .. ..

A :=

0 . . . 0

.

.. . ..

− h1 − 4 + 3h 2

2

h

+ 6h − h1 + 4 + 3h

2

1 3h 1

0 ... 0 −h − 4 + 2 h

+ 4 + 3h

2

(17 + 9x0 ) h2 + 3h2

(34 + 9x + 9x ) h + 9h2

0 1 2 2

f :=

.

.

.

2

(34 + 9xn−2 + 9xn−1 ) + h 9h

2 2

(17 + 9xn−1 ) h2 + 3h2

Boundary conditions

To incorporate the Dirichlet boundary condition c(−1) = 1 we replace the first row by

c0 = 1. To incorporate c0 (1) = 10 we add c0 (1) = 10 to the right-hand side of the last row,

fn ,

1 0 0 ... 0 1

c 0

a(1) a(1) + a(2) a(2) ... 0 (1) (2)

21 22 11 12 c1 f2 + f1

0 ... ... ... .. ..

0 =

. .

..

... (n−1) (n−1) (n)

(n) c

(n−1)

f2

(n)

+ f1

. a21 a22 + a11 a12 n−1

(n) (n) cn (n)

0 ... 0 a21 a22 f2 + 10

This is a linear system that can be solved with any appropriate numerical technique.

For a Robin boundary conditions at x = 1, c0 = αc + β, the right-hand side of the

boundary condition also contains c, so that the last row of the matrix A would change

as well. The way natural boundary conditions are handled in 2D and 3D, including

curved boundaries, is similar (just substitution) but there are of course more nodes on

the boundary. This makes boundary conditions much easier to handle in the finite element

method than in the finite difference method.

87

4.7.4 Numerical computation of the integrals

For an ODE c00 + p(x)c0 + q(x)c = r(x) with relatively simple functions p(x), q(x), and

r(x) the integrals in the element matrices and vectors can be computed analytically. For

more complicated functions, the integrals can only be approximated numerically.

An integral can be approximated by a weighted sum of function values (numerical

quadrature) in integration points xi ,

Z b nint

X

f (x) dx = wi f (xi )

a i=1

where nint is the number of integration points and wi the weight for each integration point.

We only discuss two closed Newton-Cotes formulas, i.e. formulas that contain the

endpoints of the interval and additional points are chosen so that integration points are

equally spaced over the interval.

The trapezoidal rule only uses the two endpoints of the interval, i.e. ξ1 = 0 and ξ2 = 1

on the reference element 0 ≤ ξ ≤ 1,

Z 1 int n

1 X

g(ξ) dξ ≈ [g(0) + g(1)] = wi g(ξi )

0 2 i=1

polynomials up to degree 1.

Simpson’s rule uses the two endpoints of the interval and the midpoint of the interval,

i.e. ξ1 = 0, ξ2 = 1/2, and ξ3 = 1 on the reference element 0 ≤ ξ ≤ 1,

Z 1 int n

1 X

g(ξ) dξ ≈ [g(0) + 4g(1/2) + g(1)] = wi g(ξi )

0 6 i=1

exact for polynomials up to degree 3.

Alternatively, Gauss integration can be used (details are discussed in Math 4446).

Gauss integration is a little more complicated but more efficient: less function evaluations

are necessary than for a Newton–Cotes formula with the same order of error.

There are various ways to program finite elements in a well-structured way. Here, I give

one possibility to program the second order ODE c00 + p(x)c0 + q(x)c = r(x) using proper

boundary conditions.

The finite elements can be programmed using the same overall structure as the finite

difference method. We will only discuss the differences. In the function that sets up the

matrix A and right-hand-side vector f , contributions are added element-by-element. We

name this function fematvec. Instead of a for loop over all grid points, we now have a for

88

loop over all elements. Since row i + 1 corresponding to test function φi has contributions

from two different elements, we need to add the local matrix a(l) and vector f (l) to the

matrix A and vector f (For FD you can just set the elements of a row). Inside the

for-loop the element matrix and vector needs to be computed and assembled into the

global matrix A and vector f . If you distinguish for example different quadrature rules

and element orders, it is more convenient to compute the element matrix and element

vector in a separate function that we name fematvecelm. To keep fematvecelm general, a

function ode param is called to evaluate the functions p(x), q(x), and r(x) at a point x.

Boundary conditions would typically be incorporated in a separate function febc, since

it is independent of how you handle the element contributions. To keep this function

general, you would use another function to specify the type and value of the boundary

condition used at each end point. Implementation of a Dirichlet boundary condition in

a certain row requires that all existing values in that row of A and f are replaced by

the boundary condition. Implementation of a Neumann/Robin boundary condition in a

certain row requires that boundary conditions are added to the existing row.

89

4.8 Convergence of numerical methods for BVPs

We expect that we get a better approximation to the solution of a BVP if we decrease the

grid size h. The order of convergence tells how fast the numerical solution approximates

the exact solution. Since we solve a system of equations, we use error norms to compute

the error (See Sec. 3.3). In this section we measure the actual error using the infinity

norm e∞ = kc(xi ) − ci k∞ , i.e. we compute at every grid point the absolute difference with

the exact solution and take the maximum.

To obtain the order of convergence we need to determine how fast the error approaches

zero if h → 0. Thus we need to determine the value p in e = O(hp ) = Chp . For this we

need to do numerical computations on several grids, and record the error with the exact

solution (or if this is not available a very accurate numerical solution). To determine p

we take the logarithm of e = Chp ,

Thus p is the slope on a log10 h vs. log10 e plot. From the plot you can determine suitable

values to determine p: h should not be too large (the order of convergence is for h → 0)

and not too small (round-off errors). Note that when h is small, you subtract nearly equal

numbers in the numerator of a FD formula, since the values of yi−1 , yi , and yi+1 only differ

slightly. In addition, you divide by a small number of O(h) for an FD formula of a first

derivative or of O(h2 ) for an FD formula of a second derivative. This limits the smallest

grid size h that you can use in your calculations. For too small values of h, the error e∞

will not be dominated by the error due to the finite difference approximation but by the

error caused by round-off errors due to finite precision calculations and e∞ will start to

increase. This is not a real problem in Matlab, since you will run out of memory before

round-off errors become important.

To determine the order of convergence numerically, we consider the problem

d2 c dc

2

−8 − 9c = −17 − 9x, −1 < x < 1,

dx dx

c(x = −1) = e−18 , c(x = 1) = 3.

We assume that the solution c(x) is sufficiently smooth, i.e. all derivatives exist and are

continuous. Then for finite difference schemes, the order of the actual error e∞ is the

sum of the discretization errors in each finite difference formula used to approximate a

derivative of a certain order. Since O(hp−1 ) + O(hp ) = O(hp−1 ) the order of the actual

error is governed by the lowest order of the error term in one of the finite difference

formulas used.

We consider the case where both c0 and c00 are approximated by a centered finite

difference formula which are both O(h2 ). We therefore expect that the numerical error

90

e∞ we find is O(h2 ) + O(h2 ) = O(h2 ). Similarly, if O(h) difference formulas are used in a

finite difference method, we would find a numerical error e∞ of O(h). If O(h3 ) difference

formulas are used for all derivatives, we would find a numerical error e∞ of O(h3 ), and so

on.

We start with a medium grid size of h = 0.1 to compute the numerical solution. Next

meshes are obtained by dividing the grid size by a factor of two (i.e. doubling the number

of intervals). Fig. 4.9 shows e∞ for the various grid sizes considered. We indeed observe a

slope n ≈ 2 up to h ≈ 10−4.5 . For smaller values of h, e∞ starts to increase. This is caused

by round-off errors which are then the dominant contributiuon to the error e∞ . Table 4.1

−1

−2

−3

−4

log10 e∞

−5

−6

−7

−8

−9

−5.5 −5 −4.5 −4 −3.5 −3 −2.5 −2 −1.5 −1

log h

10

Figure 4.9: Error plot for O(h2 ) centered finite difference scheme.

shows the numerical values of e∞ for the various grid sizes considered. If the step size h

is halved for a O(h2 ) method, the error on the refined mesh is O((h/2)2 ) = 1/4 O(h2 ),

i.e 1/4 of the error using mesh size h. This is exactly what we observe in Table 4.1 up to

h = 1/81920. For small values of h, the roundoff error becomes important and the error

would start to grow rapidly: an ancrease by a factor of almost 10 instead of a decrease

by a factor of 4.

We assume that the solution c(x) is sufficiently smooth, i.e. all derivatives exist and

are continuous. Then for finite elements of degree p, the order of the actual error e∞

is O(hp+1 ). Thus for linear elements we expect e∞ = O(h2 ), for quadratic elements

e∞ = O(h3 ) for cubic elements e∞ = O(h4 ), etc.

We consider the same grids as for the FD calculations. Fig. 4.10 shows e∞ using linear

finite elements for the various grid sizes considered. We indeed observe a slope n ≈ 2

that is one order higher than the order of the element. When round-off errors become

important, the error would start to increase, just as for finite differences. The quadratic

convergence can be observed more accurately from Table 4.2 which shows the numerical

91

h e∞ ratio

1/10 1.87e-2 -

1/20 4.41e-3 0.235

1/40 1.09e-3 0.246

1/80 2.72e-4 0.250

1/160 6.79e-5 0.250

1/320 1.70e-5 0.250

1/640 4.24e-6 0.250

1/1280 1.06e-6 0.250

1/2560 2.65e-7 0.250

1/5120 6.63e-8 0.250

1/10240 1.66e-8 0.250

1/20480 4.15e-9 0.250

1/40960 1.07e-9 0.251

1/81920 1.07e-9 0.258

1/163840 1.07e-8 9.993

1/327680 9.89e-8 9.237

−1

−2

−3

log10 e∞

−4

−5

−6

−7

−8

−4 −3.5 −3 −2.5 −2 −1.5 −1

log10 h

values of e∞ for the various grid sizes considered. If the step size h is halved, the error on

the refined mesh is 1/4 of the error using mesh size h, typical for quadratic convergence.

92

h e∞ ratio

1/10 2.47e-2 -

1/20 5.76e-3 0.233

1/40 1.42e-3 0.246

1/80 3.54e-4 0.250

1/160 8.84e-5 0.250

1/320 2.21e-5 0.250

1/640 5.52e-6 0.250

1/1280 1.38e-6 0.250

1/2560 3.45e-7 0.250

1/5120 8.62e-8 0.250

93

4.9 Solving linear systems for BVPs: Crout’s method

Typically, the number of grid points to obtain a decent numerical approximation is rel-

atively small for boundary value problems and the linear system can still be solved ef-

ficiently using a direct method. To obtain more accurate numerical approximations one

could use a finer grid (typically non-uniform) or higher-order finite difference or finite el-

ement schemes. For the FD scheme using O(h2 ) centered finite difference approximation

and for linear finite elements we considered, a tridiagonal matrix was obtained. For such

type of matrices a very efficient direct solver is available: Crout’s method which uses only

O(n) operations to solve a (tridiagonal) linear system. Higher-order finite differences and

higher-order finite elements, will lead to more than 3 non-zero diagonals and solving the

linear system is computationally more expensive. Using the basic Gaussian elimination

technique would take O(n3 ) operations. The most computationally efficient approach is

therefore usually to use the O(h2 ) schemes in combination with grid refinement. Note

that this argument only holds for BVPs where only one spatial dimension is involved.

For O(h2 ) methods for PDEs in two or three dimensions the resulting linear system is

no longer tridiagonal. Since Crout’s method only requires O(n) operations, there is no

advantage in using iterative techniques to solve the linear system. Iterative techniques

use O(n) operations per iteration.

Background

A tridiagonal N × N matrix (which arises in O(h2 ) finite difference methods and linear

finite elements for BVPs)

a11 a12 0 ··· 0

.. ..

a21 a22 a23 . .

A= . .

.. .. . ..

0 0

. . . .

.. .. .. ..

aN −1,N

0 · · · 0 aN,N −1 aN N

has a Crout factorization A = LU of the form

l11 0 · · · ··· 0 1 u12 0 ··· 0

... .. ... ... ..

l21 l22 . 0 1 .

... ... ... ..

. . . ...

L= 0 . .. . . . .

U = 0 .

. . .

.. .. ... ... .. ... ...

0 uN −1,N

0 · · · 0 lN,N −1 lN N 0 ··· ··· 0 1

This is easy to check: just take the matrix product LU and start comparing coefficients

from the top to the bottom row. If we have the LU factorization of the matrix A, we can

solve Ax = b very fast. Since A = LU we need to solve

LU x = b.

94

The matrix vector product U x is a vector, say y. Thus we can first solve the lower

triangular system

Ly = b

using forward substitution (first y1 , then y2 using the already computed value of y1 etc.)

to find the intermediate vector y, and then solve

Ux = y

using backward substitution (first xN , then xN −1 etc.) to find the solution x of the system

U x = y which is the solution of LU x = b.

The non-zero entries in L and U can be calculated using the Crout factorization

algorithm for tridiagonal systems and the system LU x = b can be solved using the Crout

forward/backward substitution algorithm.

• The solution is subject to round off errors only.

• Computing a Crout factorization and solving a system is very fast. The Crout

factorization requires only O(3N ) operations. Also the backward substitution is

only O(N ). This is very cheap compared to Gaussian elimination with backward

substitution (O(N 3 ) operations) and faster than any iterative technique can be

(O(N ) operations per iteration).

95

Algorithm for Crout factorization for tridiagonal systems

Input: tridiagonal matrix A

Output: Crout LU factorization of tridiagonal matrix A

Set l11 = a11

a12

Set u12 =

l11

Row 2 to N − 1 of L and U

Do for i = 2, . . . , N − 1

Set li,i−1 = ai,i−1

Set lii = aii − li,i−1 ui−1,i

ai,i+1

Set ui,i+1 =

lii

End Do (i-loop)

Last row of L

Set lN,N −1 = aN,N −1

Set lN N = aN N − lN,N −1 uN −1,N

Input: L and U of the Crout factorization

Output: Solution vector x of Ax = b

y1 = b1 /l11

Do for i = 2, . . . , N

yi = (bi − li,i−1 ∗ yi−1 ))/lii

End Do (i-loop)

Now solve U x = y using backward substitution

xN = yN

Do for i = N − 1, . . . , 1

xi = yi − ui,i+1 ∗ xi+1

End Do (i-loop)

96

Remarks on Crout factorization algorithm:

• The factorization cannot be performed when lii = 0 for some i. In a computer code

you might want to add a test and error message for this to make the code more

robust. For matrices arising from 1D finite difference and finite element methods

you typically have lii > 0.

• It is not necessary to define a full matrix for A, L, and U . Storing all zeros is a

waste of memory. Only storing the unknown entries in L and U as arrays and using

the proper array elements in the forward and backward substitution is sufficient.

97

Chapter 5

• Accuracy

• Stability

• Order of convergence

Numerical methods:

– Euler

– Trapezoidal rule

– Runge–Kutta

• polynomial approximation

– Lagrange polynomials

– piecewise Lagrange polynomials

98

5.1 Problem description and modelling

Problem: A hot cup of tea (of temperature T0 ) is placed in a room with a lower tem-

perature. What is the temperature of the tea as a function of time?

Simplifying assumptions: We assume that the tea is well-stirred so that its tempera-

ture doesn’t vary in space, only in time: T = T (t). In Chap. 7 we consider the temperature

to be a function of time and space. The room is large, so we may assume that the tem-

perature of the room is basically unchanged by the tea, i.e. the room temperature Tsur

remains constant.

Basic model:

”Rate of change = rate of increase - rate of decrease”.

Apply this to temperature1

• Rate of change: dT /dt, change of temperature in time.

• Rate of increase: 0, no heat source.

• Rate of decrease: cup looses energy to environment. Newton’s cooling law: Experi-

mentally one observes that the rate at which the temperature of the liquid decreases

is proportional to the differences in temperatures between the object and its sur-

roundings: k(T − Tsur ).

Substitution results in the initial value problem

dT

= −k(T − Tsur ), T (t = 0) = T0 .

dt

Henceforth, we take k = 1, Tsur = 20, and T0 = 80. (All quantities are expressed in an

arbitrary consistent system of units.) Substituting gives

dT

= −T + 20, T (t = 0) = 80.

dt

which we will solve numerically on the time interval 0 ≤ t ≤ 10.

The cooling model is a linear, nonhomogeneous first order initial value problem (IVP)

which have the general form y 0 + p(t)y = g(t), y(t0 ) = y0 . From differential equations

we know that a unique solution exists if p(t) and g(t) are continuous functions on the t

interval considered. For those cases, a numerical technique should give a good numerical

approximation to the solution. If this is not the case, there is likely an error in the program.

For non-linear first-order IVPs, which have the general form y 0 = f (t, y), y(t0 ) = y0 , it

is often unknown whether a unique solution exists or not. If you experience problems in

solving nonlinear IVPs, there can be several causes. A unique solution might not exist,

the numerical method or the grid is not good enough, or you might have an error in your

program. Then a careful inspection of the numerical results is necessary to determine the

most likely cause and a possible fix.

1

Actually it should be an internal energy balance, but for the usual assumption that the internal

energy only depends on temperature, the result is the same.

99

5.2 Solving first order linear IVPs analytically

In this section we discuss two methods to obtain solutions of a (linear) IVP. In the next

sections we will solve the linear IVP numerically and see whether numerical methods

converge to the analytical solution found in this section.

Analytical solution of a first-order linear nonhomogeneous IVP:

y 0 + p(t)y = g(t)

Z t

µ(t) = exp p(s) ds .

µy 0 + µp(t)y = µ(t)g(t)

and combine terms on the left (using the chain rule)

dµy

= µ(t)g(t).

dt

Integration gives Z

1

y(t) = µ(t)g(t) dt + C .

µ(t)

The constant C can be determined from the initial condition. R

In our example, we have p(t) = 1 and g(t) = 20. This gives µ(t) = exp( 1 dt) = et .

Multiplying and combining terms on the left, the ODE results in

det T

= et 20.

dt

After integration and dividing by et , we get

T (t) = 20 + Ce−t ,

where C needs to be determined from the initial condition: 80 = T (0) = 20+C or C = 60.

The (unique) solution to the IVP is thus

T (t) = 20 + 60e−t .

100

5.2.2 Symbolical calculations

ODEs and IVPs can be solved symbolically using Matlab with dsolve. In order to solve

the 1st order ODE T 0 = −T + 20, use

T = dsolve(’DT=20-T’)

which gives the general solution

T = 20+exp(-t)*C1

In order to solve the 1st order IVP T 0 = −T + 20 with T (0) = 80 just add the initial value

T = dsolve(’DT=20-T’, ’T(0)=80’)

which gives the unique solution

T = 20+60*exp(-t)

If Matlab cannot solve the ODE, for example T 0 = sin(T 2 ),

dsolve(’DT=sin(T^2)’)

it will respond

T = RootOf(-t+Int(1/sin( a^2), a = .. Z)-C1)

which means that it cannot find an analytical solution.

101

5.3 Solving IVPs numerically: Introduction

5.3.1 Grids

The solution to an initial value problem is a function y(t) which is defined for every t. If

we use numerical techniques, we find an approximation to the solution at certain discrete

time levels ti only. The collection of ti ’s is called a (computational) grid. Before we can

solve an IVP numerically, we need to specify the computational grid, i.e. all times ti in

[a, b] for which we want to obtain a numerical approximation. A numerical technique will

produce approximations to the function y at these grid points only, i.e. we will obtain

approximations for the values y(ti ) which will be denoted by yi . See Fig. 5.1.

yn

yn−1

y3 y(t)

y1 y2

y0 = y(a)

a = t0 t1 t2 t3 tn−1 t = b

n

Often we choose the interval [a, b] to be divided into N equally spaced subintervals of

length h = (b − a)/N , which corresponds to the grid points

ti = a + ih

for i = 0, . . . , N . The length of a subinterval h is called the step size.

Example

We solve an initial value problem numerically on [0, 1] and divide the interval [0, 1] into

N = 4 equally spaced subintervals. The length of a subinterval is h = (1 − 0)/4 = 1/4.

We obtain a numerical approximation to the solution at the N + 1 = 5 discrete time levels

only: t0 = 0 (just the initial condition), t1 = 1/4, t2 = 1/2, t3 = 3/4, and t4 = 1.

Remarks:

1. Typically, the more grid points the more accurate the approximation to the solution

and the more work to compute the approximation. The goal is to compute an accu-

rate numerical solution with as few grid points as possible, to minimize computing

time.

2. Almost always there are certain regions in [a, b] where the solution changes more

rapidly (i.e. where you want more grid points). An equally spaced grid is then not

the best choice.

102

3. An equally space mesh is easiest to introduce the numerical techniques and will

therefore be used in this chapter.

If an IVP cannot be solved analytically, one can try to solve it numerically. Also numerical

techniques, however, will not always work. If no solution exists for the IVP, we can’t

expect that a numerical technique will produce something useful. In addition, there is no

best method to solve IVPs. Which method to choose depends on the problem you want

to solve. Issues are

1. Accuracy.

2. Stability.

3. Computing time and memory (Typically not a very important issue for IVPs, only

for time-dependent PDEs in 2 or 3 spatial dimensions)

We only discuss some well-known one-step methods, for which a solution at some time

ti is computed using only quantities at the previous time level ti−1 . We discuss Euler’s

method, trapezoidal rule, and Runge-Kutta methods. We focus on how these methods

work, how to program them, and typical numerical issues.

103

5.4 Solving IVPs numerically using Matlab: ode45

Matlab has several built-in functions to solve initial value problems numerically. We only

discuss ode45 which uses a Runge–Kutta technique with adaptive time stepping, i.e. every

step a proper value of the step size h is determined in order to obtain a solution within

a specified tolerance. Matlab’s ode45 solves the IVP y 0 = f (t, y) with y(a) = y0 on the

time interval a ≤ t ≤ b.

To solve the cooling problem with default tolerances, one would type in the Matlab

Command Window

[t, T] = ode45(’func ode’, [0 10], 80)

where [0 10] is the time interval [a, b] at which you want to obtain a numerical solution

and 80 the value of the initial condition y0 . The string func ode (the quotes are to indicate

that it is a string) specifies the name of your m-file where the right-hand-side function

f (t, y) is specified. For the cooling problem f (t, T ) = 20 − T needs to be specified in the

m-file func ode.m,

f = 20 - T;

The result of ode45 is 2 arrays with the discrete time values used (t) and the corre-

sponding approximations to the solution (T). To satisfy the default tolerance values, 49

grid points are used.

The accuracy can be increased by using odeset. By default an absolute tolerance of

−6

10 is used to determine the step size h at grid point ti . To solve the IVP with an

absolute tolerance of 10−8 , you would type in the Matlab Command Window

options = odeset(’AbsTol’, 1e-8)

[t,y] = ode45(’func ode’, [0 10], 80, options)

which produces, in this case, the same solution (49 grid points).

Fig. 5.2 shows the numerical solution together with the analytical solution. No differ-

ences are observed on the scale of the plot.

80

exact solution

ode45

70

60

T

50

40

30

20

0 1 2 3 4 5 6 7 8 9 10

t

104

5.5 One-step methods

Details about the derivation of one-step methods for y 0 = f (t, y) are discussed in Math

4446. Here we focus on how the methods work and typical numerical issues. We start with

the easiest method: Euler’s method. It is the simplest numerical technique to demonstrate

all numerical concepts. However, it is almost never the best numerical technique for the

problem you solve. We consider a constant step size h = ti+1 − ti .

Euler’s method can easily be derived using a forward finite difference formula y 0 (xi ) =

(yi+1 − yi )/h. Starting from the initial condition y0 , any next yi+1 is obtained from

yi+1 = yi + hf (ti , yi ) i = 0, . . . , N − 1

Euler’s method is an explicit method. The right-hand side only depends on known quan-

tities (at time level ti ). This makes it easy to solve: just evaluate the right-hand side

using the known values ti and yi .

Integrating y 0 = f (t, y) on both sides from ti to ti+1 gives

Z ti+1

y(ti+1 ) − y(ti ) = f (t, y) dt.

ti

The integral is approximated with the trapezoidal rule (average of the function values

in the end points ti and ti+1 ). Starting from the initial condition y0 , any next yi+1 is

obtained from

h

yi+1 = yi + [f (ti , yi ) + f (ti+1 , yi+1 )] ,

2

for i = 0, . . . , N − 1.

This is an implicit method. The right-hand side also depends on the a priori unknown

yi+1 . In general, this makes it more difficult to compute yi+1 . Only for relatively simple

functions f an explicit equation for yi+1 can be obtained. Otherwise, a nonlinear equation

needs to be solved. For example, bisection or Newton can be used. Generally, this requires

much more work per time step. The test equation and cooling equation are relatively

simple and solving a nonlinear equation is not neccessary (the right-hand side is linear in

y).

Runge–Kutta methods are explicit methods that use proper points (tl , yl ) to evaluate the

function f (t, y) so that a higher order method is obtained.

105

A frequently used Runge–Kutta method is the Runge–Kutta method of order four

(RK4). Starting from the initial condition y0 , any next yi+1 is obtained from

k1 = hf (ti , yi )

h k1

k2 = hf (ti + , yi + )

2 2

h k2

k3 = hf (ti + , yi + )

2 2

k4 = hf (ti+1 , yi + k3 )

1

yi+1 = yi + (k1 + 2k2 + 2k3 + k4 )

6

for i = 0, . . . , N − 1. The method is explicit: all substeps only involve known quantities

in the right-hand side.

All one-step methods have a similar structure, summarized in the following algorithm.

Input: discrete times ti , value of initial condition y0 .

Initializations

Set initial condition

Compute number of subintervals N

One-step method

Do for i = 0, . . . , N − 1

Compute step size h

Compute next approximation yi+1 from the known values h, yi , ti , and ti+1 .

End do (i-loop)

for-loop to find the approximation yi+1 at the next time level for all i’s.

106

function [y] = euler(t, y0)

%=============================================

% Euler’s method for 1st order IVPs

% Input/output parameters:

% t array with (N+1) grid points ti

% y0 value of initial condition

% y array with (N+1) approximations yi

%=============================================

%———————————————————————————————————

% Initializations

%———————————————————————————————————

y(1) = y0;

N = length(t) - 1;

%———————————————————————————————————

% Euler’s method

%———————————————————————————————————

for i = 1:N

h = (t(i+1) - t(i)) / N;

f = funcivp(y(i), t(i));

y(i+1) = y(i) + h*f;

end

%———————————————————————————————————

% Function f(y, t) with y and t scalars

%———————————————————————————————————

function [f] = funcivp(y, t)

% cooling problem

f = 20 - y;

Remarks:

• Note that the function euler is general. When changing to a different f (t, y) only

funcivp needs to be changed.

• For different one-step methods just the part inside the for-loop needs to be modified.

For the explicit RK4 we just have some more explicit substeps to perform. If you

solve a nonlinear equation with the implicit trapezoidal rule, you need to solve a

nonlinear algebraic equation which can be done with Newton’s method. Then you

need to call a function newton to find yi+1 . Since the time step is typically small, yi

is usually a good enough initial guess for Newton’s method.

107

5.5.5 Example

We solve the cooling problem Eq. (5.1) using a step size h = 1/2 and compare with the

exact solution T (t) = 20 + 60e−t .

Fig. 5.3 shows the approximations Ti for the three different one-step methods consid-

ered: Euler, trapezoidal rule, and RK4.

80

70

60

50

exact

Euler

T, Ti

40

trapezoidal

RK4

30

20

10

0

0 2 4 6 8 10

t

Figure 5.3: One-step methods for the cooling problem using step size h = 1/2.

T3 42.07276647028654 35.00000000000000 41.60000000000000 42.09025065104166

T11 20.40427681994513 20.05859375000000 20.36279705600000 20.40588052828283

T21 20.00272399578575 20.00005722045898 20.00219369506403 20.00274565005399

Table 5.1: One-step methods for the cooling problem using step size h = 1/2.

from the comparison with the exact solution in Fig. 5.3 and Table 5.1 that RK4 gives a

better approximation than the trapezoidal rule which gives a better approximation than

Euler. Of course RK4 requires more work per time step than the trapezoidal rule which

requires more work per time step than Euler. In Sec. 5.7 we discuss accuracy in more

detail.

108

5.6 Test equation and amplifying factor

The general IVP y 0 = f (t, y) is rather complicated to analyze. For this a test equation

is introduced (just use f (t, y) = λy):

y 0 = λy

where λ is a complex number. This is a rather simple equation, but can still demonstrate

the main concepts. Results we derive remain valid for more complicated equations.

Applying a one-step method to the test equation results in an equation of the form

yi+1 = k(hλ)yi ,

where k is the amplifying factor which depends on hλ. The amplifying factor of a

certain numerical technique contains enough information to determine its accuracy and

stability properties.

Euler’s method

The Euler method for the test equation is

Trapezoidal rule

Writing the trapezoidal rule in the form yi+1 = k(hλ)yi gives the amplifying factor:

h

yi+1 = yi + [λyi + λyi+1 ] .

2

After some algebra this gives the amplifying factor

1 + hλ/2

k(hλ) = .

1 − hλ/2

Runge-Kutta (RK4)

Writing the RK4 method in the form yi+1 = k(hλ)yi gives the amplifying factor. After

some algebra this gives

1 1 1

k(hλ) = 1 + hλ + (hλ)2 + (hλ)3 + (hλ)4 .

2 6 24

109

5.7 Accuracy

To say more about the error we can expect when using a numerical technique to solve

IVPs, we distinguish two types of errors:

• local truncation error (easiest to use, but doesn’t include accumulation of errors).

The local truncation error measures, at a specified time, the amount by which the exact

solution and numerical approximation differ, assuming that the method was exact at the

previous step. In a numerical simulation there is always one time level at which we know

the exact solution: the initial condition y(t0 ) = y0 . Performing one step starting from the

exact initial condition gives you the local truncation error e1 . The order O(hn ) can be

obtained by computing the local truncation error for various step sizes h and determining

the slope on a log10 h vs. log10 e1 plot. Determining the slope is similar to boundary value

problems. See Sec. 4.8.

The local truncation error is defined as

where we assume that the solution at the previous step was exact: yi = y(ti ).

To analyze the local truncation error we consider the test equation y 0 = λy. Integration

from ti to ti+1 gives

y(ti+1 ) = ehλ y(ti ).

To investigate errors, we use the Taylor series

n−1

X (λh)j (λh)2 (λh)n−1

ehλ = + O(hn ) = 1 + λh + + ··· + + O(hn ),

j=0

j! 2 (n − 1)!

Combining with the general form for a one-step method yi+1 = k(hλ)yi gives

Euler’s method

For Euler we obtain

The local truncation error for the cooling problem Eq. (5.1) is given in Table 5.2. We

110

h y1 y(t1 ) e1 ratio

0.1 74.000 7.4290245e+01 2.90e-01 -

0.05 77.000 7.7073765e+01 7.38e-02 0.254

0.025 78.500 7.8518594e+01 1.86e-02 0.252

0.0125 79.250 7.9254668e+01 4.67e-03 0.251

0.00625 79.625 7.9626169e+01 1.17e-03 0.251

Table 5.2: Local truncation error for the cooling problem Eq. (5.1) using Euler’s rule.

observe that the error decreases by a factor of 4 = 22 This is exactly what we expect:

if we decrease the step size h to h/2 we the error decreases from O(h2 ) to O((h/2)2 ) =

1/4 O(h2 ).

Trapezoidal rule

Using

P∞ Taylor expansions for the exponential and 1/(1 − hλ/2) gives (using 1/(1 − x) =

i

i=0 x ) after some algebra

The local truncation error for the cooling problem Eq. (5.1) is given in Table 5.3. We

h y1 y(t1 ) e1 ratio

0.1 7.42857143e1 7.42902451e1 4.53e-3 -

0.05 7.70731707e1 7.70737655e1 5.95e-4 0.131

0.025 7.85185185e1 7.85185947e1 7.62e-5 0.128

0.0125 7.92546584e1 7.92546680e1 9.64e-6 0.127

0.00625 7.96261682e1 7.96261694e1 1.21e-6 0.126

Table 5.3: Local truncation error for the cooling problem Eq. (5.1) using the trapezoidal

rule.

observe that the error decreases by a factor of 8 = 23 This is exactly what we expect:

if we decrease the step size h to h/2 we the error decreases from O(h3 ) to O((h/2)3 ) =

1/8 O(h3 ).

Runge-Kutta (RK4)

For RK4 we obtain

ei+1 (h) = ehλ − k(hλ) yi = O(h5 ).

This is two orders higher than the trapezoidal rule and three orders higher than Euler.

The local truncation error for the cooling problem Eq. (5.1) is given in Table 5.4. We

observe that the error decreases by a factor of 32 = 25 This is exactly what we expect:

if we decrease the step size h to h/2 we the error decreases from O(h5 ) to O((h/2)5 ) =

1/32 O(h5 ).

111

h y1 y(t1 ) e1 ratio

0.1 7.4290250000000e1 7.4290245082158e1 4.92e-06 -

0.05 7.7073765625000e1 7.7073765470043e1 1.55e-07 3.150e-2

0.025 7.8518594726563e1 7.8518594721700e1 4.86e-09 3.135e-2

0.0125 7.9254668029785e1 7.9254668029633e1 1.52e-10 3.128e-2

0.00625 7.9626169437408e1 7.9626169437404e1 4.76e-12 3.132e-2

Table 5.4: Local truncation error for the cooling problem Eq. (5.1) using RK4.

For the global error we look at the error at a fixed time ti = t∗ . The global error is

the absolute difference of the exact solution y(ti ) and the numerical approximation yi

after applying the numerical scheme a number of times, starting from y(t0 ) = y0 till ti is

reached,

i = |y(ti ) − yi |.

Note that if we decrease h, the value of i gets larger if the time ti = t∗ is fixed

Theorem

If the local error is en = O(hp+1 ) then the global error is n = O(hp ).

Consider the error at a fixed time t, say t = 1. To do Euler with step size h = 1/N , the

number of Euler steps is N = 1/h, each with an error of O(h2 ). Then the total error

becomes 1/h × O(h2 ) = O(h).

Consequence

The global error is one order lower than the local truncation error, thus for Euler we have

i = O(h), for the trapezoidal rule i = O(h2 ), and for RK4 i = O(h4 ).

Application

We can use the order of the global error to estimate a value of the step size h that is

required to obtain a solution that is a certain amount more accurate. Assume we have

a solution with error 0 obtained using step size h0 . To obtain a solution with an error

= 10−4 0 (4 more accurate digits) with a method of order p, we would need

hp ≈ 10−4 hp0

For Euler, the global error is O(h), i.e. p = 1, and we would need h ≈ 10−4 h0 or 104 as

many intervals. For the trapezoidal rule, the global error is O(h2 ), i.e. p = 2, and we

would need h2 ≈ 10−4 h20 or h ≈ 10−2 h0 or 102 as many intervals. For RK4, the global

112

error is O(h4 ), i.e. p = 4, and we would need h4 ≈ 10−4 h40 or h ≈ 10−1 h0 or only 101 as

many intervals.

Example

For the cooling problem Eq. (5.1), the exact solution at t = 1 is T (t = 1) = 20 + 60e−1 ≈

4.207276647028654e+01.

Euler

The global error is one order lower than the local truncation error, thus i = O(h).

The global error for the cooling problem Eq. (5.1) is given in Table 5.5. We observe

h Ti (t = 1) (t = 1) ratio

1/10 4.0921e1 1.2 100 -

1/20 4.1509e1 5.6 10−1 0.489

1/40 4.1794e1 2.8 10−1 0.495

1/80 4.1934e1 1.4 10−1 0.497

1/160 4.2004e1 6.9 10−2 0.499

1/320 4.2038e1 3.5 10−2 0.499

Table 5.5: Global error for the cooling problem Eq. (5.1) using Euler.

decrease the step size h to h/2 we the error decreases from O(h) to O((h/2)) = 1/2 O(h).

Trapezoidal rule

The global error is one order lower than the local truncation error, thus i = O(h2 ).

The global error for the cooling problem Eq. (5.1) is given in Table 5.6. We observe

1/10 4.2054352e1 1.84e-2 -

1/20 4.2068166e1 4.60e-3 0.250

1/40 4.2071616e1 1.15e-3 0.250

1/80 4.2072479e1 2.87e-4 0.250

1/160 4.2072694e1 7.19e-5 0.250

Table 5.6: Global error for the cooling problem Eq. (5.1) using the trapezoidal rule.

that the error decreases by a factor of 4 = 22 This is exactly what we expect: if we decrease

the step size h to h/2 we the error decreases from O(h2 ) to O((h/2)2 ) = 1/4 O(h2 ). Note

that the global error for h = 0.1 is already better than for Euler using h = 1/320.

Runge-Kutta (RK4)

The global error is one order lower than the local truncation error, thus i = O(h4 ).

The global error for the cooling problem Eq. (5.1) is given in Table 5.7. We observe that

the error decreases by a factor of 16 = 24 This is exactly what we expect: if we decrease

the step size h to h/2 we the error decreases from O(h4 ) to O((h/2)4 ) = 1/16 O(h4 ). Note

the the global error for h = 0.1 is already better than the global error for the trapezoidal

rule for h = 1/160.

113

h w(t = 1) (t = 1) ratio

1/10 4.207278646475e1 2.00e-05 -

1/20 4.207276766885e1 1.20e-06 5.99e-2

1/40 4.207276654365e1 7.34e-08 6.12e-2

1/80 4.207276647482e1 4.54e-09 6.19e-2

1/160 4.207276647057e1 2.82e-10 6.21e-2

Table 5.7: Global error for the cooling problem Eq. (5.1) using RK4.

We can’t continue to decrease the step size to get a better solution. Because of round-off

errors (due to finite number of digits to represent numbers on a computer), the error may

increase for small step sizes.

Example: Consider the following simple initial value problem (just to demonstrate the

problem caused by round-off errors)

y 0 = 1, 0 ≤ t ≤ 1, y(0) = 1

Using Euler and 8-digit arithmetic this becomes

w0 = 1.0000000, wi+1 = wi + h

because we only have digits up to 10−7 . Also w2 = w1 + h = 1.0000000 + 1.0 × 10−8 =

1.0000000 and so on until wn = wn−1 + h = 1.0000000 + 1.0 × 10−8 = 1.0000000. The

result is not close to the exact solution y(1) = 2 at all! We never find anything

else then the initial condition.

Of course this problem does not only appear when the time step is ”smaller than the

last digit”. If we choose h = 1/3 × 10−6 , we find w1 = 1.0000003 because we only have

digits up to 10−7 . Continuing w2 = 1.0000006, w3 = 1.0000009. This gives an error of

about 10% each step.

More digits of precision shifts the problem to smaller step sizes. If we did the

calculations in double precision (16 digits accuracy) there is no problem to represent

1 + 1.0 × 10−8 = 1.00000001. But 1 + 1.0 × 10−16 = 1 does give the same problem. How

round-off errors affect the global error in Euler’s method for the cooling problem Eq. (5.1)

can be observed in Table 5.8. Up to step size h = 10−9 , the global error is 1/10 times the

previous error if h is divided by 10, i.e. O(h). For a smaller step size the roundoff error is

dominating and the global error starts to grow rapidly. Thus there is an optimum h with

minimum error. In Table 5.8 the optimum value of h would be around 10−9 .

114

h w(t = 0.1) e(t = 0.1) ratio

−4

10 7.429021793686086e+01 2.7e-05 -

10−5 7.429024236764370e+01 2.7e-06 0.1

10−6 7.429024481071004e+01 2.7e-07 0.1

10−7 7.429024505501881e+01 2.7e-08 0.1

10−8 7.429024507944148e+01 2.7e-09 0.1

10−9 7.429024508190086e+01 2.6e-10 0.1

10−10 7.872524636435026e+01 4.43

Table 5.8: Impact of round-off error on global error in Euler’s method for the cooling

problem Eq. (5.1).

115

5.8 Stability

In numerical computations there are always small errors: round-off errors, discretization

errors. Stability is related to whether these errors grow without bound or not. Stability

is particularly important for so-called stiff equations.

5.8.1 Introduction

To develop some intuition for stability, we consider the following examples which we solve

using Euler’s method.

1. y 0 + y = −99e−100t , y(0) = 2 with exact solution y(t) = e−t + e−100t which decays

from 2 to 0. Solving this IVP numerically is straightforward and the numerical

solution looks as expected. This we will call stable further on.

2. y 0 + 100y = 99e−t , y(0) = 2 with exact solution y(t) = e−t + e−100t . The exact

solution is exactly the same, but solving this IVP numerically is much harder. The

global errors show huge errors up to and including h = 1/32. This we will call

unstable further on.

3. y 0 + 100y = 0, y(0) = 1 with exact solution y(t) = e−100t . The numerical solution is

again unstable: global errors again show huge errors up to and including h = 1/32.

To analyze what happens exactly, we only consider the simplest equation for which

we observe the unstable behavior, y 0 = −100y with y(0) = 1. The analytical solution

y(t) = e−100t , decays very rapidly to zero. Numerical solution using Euler gives

Every step the previous value is multiplied by (1 − 100h). The numerical solution will

only tend to 0 if |1 − 100h| < 1. Thus we need 0 < h < 2/100 = 0.02. Otherwise yi and

thus the error will grow since the exact solution y → 0 as t → ∞.

Now consider y 0 = −100e−100t with y(0) = 1 which has the same analytical solution

e−100t . Numerical solution using Euler gives

i

X i

X i

X

−100ti −100tj −100tj

yi+1 = yi −100he = yi−1 −100h e = y0 −100h e = 1−100h e−100tj

j=i−1 j=1 j=1

which is conceptually very different. There is no multiplication of the previous value, just

an addition of a small term every step.

116

5.8.2 Stability of one-step methods (test equation)

We solve two test equations with a one-step method using slightly different initial condi-

tions,

y 0 = λy, y(t0 ) = y0

and

w0 = λw, w(t0 ) = y0 +

Thus we have the same ODE (and the same general solution) but a slightly different value

() for the initial condition.

Applying a one-step method to the above equations gives

yi+1 = k(hλ)yi , w0 = y0

and

wi+1 = k(hλ)wi , w0 = y0 + .

Subtracting gives an equation for the difference i = yi − wi between the 2 solutions

at time level i + 1,

i+1 = k(hλ)i , 0 = .

Applying the one-step method i + 1 times gives

i→∞

lim |i | = 0.

i→∞

Thus the scheme is stable if |k(hλ)| ≤ 1 (error will not grow), absolutely stable if |k(hλ)| <

1 (error will decay to zero), and unstable if |k(hλ)| > 1 (error will grow and yi and wi

will be very different for large values of i).

In the test equation λ ∈ C is a given constant and h needs to be chosen such that the

criterium for absolute stability is fulfilled in order to obtain a stable numerical solution.

Region of absolute stability (R) is defined by those values of h for which the method is

absolutely stable:

R = {z ∈ C| |k(z)| < 1}

with z = hλ.

117

Euler

For Euler, k(λh) = (1 + hλ). To have absolute stability we need |k(hλ)| < 1 which gives

|1 + z| < 1, where z = hλ is a complex number. For the magnitude of a complex number

z = x + iy, we have

p p p

(1 + z)(1 + z̄) = (1 + x + iy)(1 + x − iy) = (1 + x)2 + y 2 .

(x + 1)2 + y 2 < 1.

Since (x + 1)2 + y 2 = 1 is a circle around (−1, 0) with radius 1, this corresponds to the

region inside the circle. The circle itself is not included which is represented in a figure

by a dashed line.

The boundary of the region of absolute stability can also be plotted directly with

Matlab using

syms x y;

z = x + i*y;

k = 1 + z;

ezplot(abs(k) - 1, [-3, 1, -1.5, 1.5]);

grid on;

setcurve(’Line’, ’:’);

Here abs gives the magnitude of a complex number and [-3, 1, -1.5, 1.5] are the minimum

and maximum x and y values in the figure (appropriate values were found by trial and

error). The last line makes a dashed line. For this you need the m-file setcurve.m in your

Current Directory. The resulting Matlab figure is Fig. 5.4. Note that we only plotted the

1.5

0.5

0

y

−0.5

−1

−1.5

−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1

x

Figure 5.4: Region of absolute stability for Euler’s method: inside the circle. A dashed

curve means that the curve itself is not included.

boundary. The region of absolute stability |k(z)| < 1 is inside the closed curve. This

118

is easily verified by checking one point inside the circle and one point outside the circle

(z = −1 + 0i satisfies |1 + z| < 1 and z = 1 + 0i not).

Trapezoidal rule

For the region of absolute stability we need |k(hλ)| < 1. This gives using z = hλ

1 + z/2

1 − z/2 < 1

or |1 + z/2| < |1 − z/2|

The boundary of the region of stability is displayed in Fig. 5.5. The region of stability is

6

0

y

−2

−4

−6

−6 −4 −2 0 2 4 6

x

Figure 5.5: Region of absolute stability for trapezoidal rule: the left half plane.

on the left of the imaginary axis (Check a point on each side). Thus the region of stability

is the whole left half plane.

RK4

The boundary of the region of stability is displayed in Fig. 5.6. The region of absolute

stability is inside the closed curve (check 1 point inside and outside the closed curve).

Application

From the region of absolute stability, you can obtain a minimum value of h for which the

solution should remain stable. For example,

1. Assume λ is real and negative. For which values of h is Euler absolutely stable?

λ is real, thus we are on the real axis. The part of the real axis inside the circle

corresponds to −2 < hλ < 0. Since λ < 0 this gives 0 < h < 2/(−λ) = 2/|λ|.

2. Assume λ is purely imaginary. For which values of h is Euler absolutely stable?

λ is purely imaginary, thus we are on the imaginary axis. There is no point from

the region of absolute stability on the imaginary axis. Thus Euler is unstable for

any h.

119

Region of absolute stability: RK4

3

0

y

−1

−2

−3

−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1

x

Figure 5.6: Region of absolute stability for RK4: inside the closed curve.

5.8.4 Example

We consider the IVP y 0 = −20y (i.e. λ = −20) with y(t = 0) = 1/3 which has exact

solution y(t) = e−20t /3. The initial condition y0 = 1/3 can not be represented exactly, so

we have a small round-off error in y0 . We use various step sizes h inside and outside the

region of stability to see whether the numerical result.

Euler

Table 5.9 shows results for Euler’s method using various step sizes h. We see that for

t y yi yi yi yi

h = 1 h = 0.1 h = 0.05 h = 0.01

1 6.87e-10 -6.33e0 3.33e-1 0 6.79e-11

2 1.41e-18 1.20e2 3.33e-1 0 1.38e-20

3 2.91e-27 -2.28e3 3.33e-1 0 2.82e-30

4 6.01e-36 4.34e4 3.33e-1 0 5.47e-40

5 1.24e-44 -8.25e5 3.33e-1 0 1.16e-49

h = 1 the numerical solution and error grows without bound so unstable. For h = 0.1 the

numerical solution and error are bounded but the error does not decay. This means the

solution is stable but not absolutely stable. For h < 0.1 the numerical solution and error

decay so absolutely stable. This is exactly what we expect from the theory: absolutely

stable for h < 2/λ = 2/20 = 0.1.

Trapezoidal rule

Table 5.10 shows results for the trapezoidal rule using various step sizes h. We note that

120

t y yi yi yi yi

h = 1 h = 0.1 h = 0.02 h = 0.01

1 6.87e-10 -2.73e-1 0.00e0 5.23e-10 6.42e-10

2 1.41e-18 2.23e-1 0.00e0 8.20e-19 1.24e-18

3 2.91e-27 -1.83e-1 0.00e0 1.29e-27 2.39e-27

4 6.01e-36 1.49e-1 0.00e0 2.02e-36 4.60e-36

5 1.24e-44 -1.22e-1 0.00e0 3.16e-45 8.87e-45

Table 5.10: Stability for test equation using the trapezoidal rule.

all numerical solutions are absolutely stable (all decay). This is exacly what we expect

from theory: the left half plane implies all h > 0 since λ < 0.

Here we see the advantage of an implicit method: implicit methods can have an infinite

region of stability. For the trapezoidal rule, for Re(λ) < 0 all values of h are in the region

of stability. Explicit methods always have a finite region of stability. Thus h should be

chosen small enough to ensure absolute stability for explicit methods.

Note that although the trapezoidal rule is stable for h = 1, the numerical solution is

not very accurate. Accuracy requires a smaller value of h. Also note that h = 0.1 is a

special case: the amplifying factor equals zero giving a zero solution except for the initial

condition.

RK4

Table 5.11 shows results for RK4 using various step sizes h. Note that for h = 1, RK4

t y yi yi yi yi

h=1 h = 0.1 h = 0.02 h = 0.01

1 6.87e-10 1.84e03 5.65e-06 6.91e-10 6.87e-10

2 1.41e-18 1.01e07 9.56e-11 1.43e-18 1.42e-18

3 2.91e-27 5.59e10 1.62e-15 2.97e-27 2.92e-27

4 6.01e-36 3.08e14 2.74e-20 6.16e-36 6.02e-36

5 1.24e-44 1.70e18 4.64e-25 1.28e-44 1.24e-44

is unstable. For h = 0.1, however, RK4 is already absolutely stable, contrary to Euler.

The reason is that the region of stability of RK4 includes a larger portion of the real axis

which includes h = 0.1.

The unstable behavior for too large values of h is typical for explicit methods. A

very accurate explicit technique doesn’t eliminate the unstable behavior. Stability and

accuracy are two different subjects.

Stability analysis for nonlinear IVPs can be done with a linearization.

121

To determine the linear stability for y 0 = g(y) with initial condition y(0) = y0 , we also

consider the perturbed problem w0 = g(w) with w(0) = y0 + 0 . Introducing = y − w

and differentiating gives

0 = y 0 − w0 = g(y) − g(y − ) = = + O(2 ).

dy

Neglecting the higher order terms in gives

dg(y)

0 = .

dy

Comparing with the test equation, we now have dg/dy instead of λ. Thus to determine

the stability, dg/dy needs to be evaluated at a known value of y (typically the solution at

the previous time step yi ).

Example: We use Euler’s method to solve the nonliner IVP

y 0 = −y 2 , y(0) = 1.

We have

dg

= −2y

dy

which is real. For the test equation we need for Euler h < 2/|λ| for stability. Here we

have −2y instead of λ. At time ti we thus get as stability criterium

2 1

h< = .

| − 2yi | |yi |

Remarks

• We neglected higher order terms when we did the linearization. Thus it is safer to

take the step size a little smaller than the one obtained from the linearization.

• The stability criterion depends on the solution yi which is a priori unknown. The

maximum allowable step size to guarantee stability for step i thus needs to be

determined at every step.

122

5.9 Discussion

Which method you would choose depends on the type of problem you are solving and on

how many times you are solving such problems. You need to consider the following:

• Implementation time. If you only solve a problem once, you want something

that you can program quickly. So you would typically choose an explicit technique

since these are straightforward to program. That computing time is larger is often

not a problem.

• Accuracy. If you need an accurate solution, you would typically choose a higher-

order method to obtain a small global error (lower computing time for accurate

solutions, but a little harder to implement in more complex problems). If you just

need a rough estimate of what the solution looks like, an O(h) method might be

sufficient.

This avoids the small step sizes needed to ensure the stability criterium. Particularly

when the solution hardly changes over a large period in time this is a good choice.

• Implicit/explicit: if stability is not an issue for the problem you are solving, there

is no need to consider implicit texhniques. It is harder to write the numerical pro-

gram and there is no benefit compared to explicit techniques (for implicit methods

a nonlinear equation needs to be solved every time step for implicit with bisection

or Newton for example).

123

Chapter 6

• Accuracy

• Stability

• Order of convergence

Numerical methods:

• Euler

• Trapezoidal rule

• Runge–Kutta

124

6.1 Problem description: predator-prey models

In Sec. 3.1 we discussed a population model for a predator-prey system. The resulting

model was

ẋ2 = cx1 x2 − dx2 .

where x1 (t) is the population of prey at time t, x2 (t) the population of predators at time t,

and a, b, c, and d some given constants. In Chap. 3, we determined equilibrium solutions

(i.e. when the populations do not change anymore in size, or dx1 /dt = dx2 /dt = 0). In

this chapter, we will solve the transient equations, i.e. we will predict the populations as

a function of time.

As an example, we consider the system of equations with a = d = 2 and b = c = 1,

ẋ1 = 2x1 − x1 x2 ,

ẋ2 = x1 x2 − 2x2 .

Eq. (6.1) represents a system of initial value problems. The general form for an m × m

system of IVPs is system of m IVPs

y10 f1 (t, y1 , . . . , ym ) y1 (t = t0 ) Y1

.. .

.. .

.. .

.= , = .. ,

0

ym fm (t, y1 , . . . , ym ) ym (t = t0 ) Ym

125

6.2 Checking numerical solutions for systems of IVPs

Systems of IVPs are typically difficult to solve analytically. In this section we discuss four

methods to validate a numerical program for solving a system of equations.

Equilibrium solutions are usually much easier to obtain. A first check to validate a

program could be to take y 0 close to an equilibrium solution and check whether the

numerical solution approaches the equilibrium solution. Note that the exact solution

does not always approach an equilibrium solution. Only when the equilibrium point is

asymptotically stable the exact solution approaches the equilibrium point which can be

checked by inspecting the eigenvalues of the linearized system at the equilibrium point

(See Math 2214).

Systems of IVPs can be solved symbolically using the built in Matlab function dsolve. For

the example we consider, you would use

sol = dsolve(’Dx = 2*x - x*y’, ’Dy = x*y - 2*y’, ’x(0) = 1’, ’y(0) = 1’)

which returns as output

Warning: Explicit solution could not be found.

sol = [ empty sym ]

which means that no analytical solution can be found (by Matlab).

Of course you can simplify the equations and try to find an analytical solution to test

your code. A linear system is much easier to solve, for example

sol = dsolve(’Dx = 2*x - y’, ’Dy = x - 2*y’, ’x(0) = 1’, ’y(0) = 1’)

gives the analytical solution in sol.x and sol.y. The disadvantage is that a linear system is

usually much easier to solve than a nonlinear system, so that it is not necessarily a good

test to determine whether your program works for nonlinear equations. Of course it is

not completely useless: if it doesn’t work for a linear system there is at least one thing

wrong and what goes wrong for a linear system is easier to detect.

Even though it is hard to solve a nonlinear system of IVPs, it is not difficult to find an

analytical solution of a nonlinear system to test a numerical code. Instead of Eq. (6.1),

we consider

ẋ2 = x1 x2 − 2x2 + q2 (t).

We substitute the analytical solution x1 (t) and x2 (t) that we want and try find the corre-

sponding q1 (t) and q2 (t). For example, we take x1 (t) = e−t and x2 (t) = e−2t . Substitution

126

gives q1 (t) = e−3t −3e−t and q2 (t) = −e−3t . The initial conditions at t = 0 that correspond

to the analytical solution are x1 (0) = 1 and x2 (0) = 1.

Thus we obtained the analytical solution for a slightly more difficult system of IVPs

ẋ2 = x1 x2 − 2x2 − e−3t ,

with ICs x1 (0) = 1 and x2 (0) = 1. This should be sufficient to test the numerical code.

Matlab has several built-in functions to solve systems of initial value problems numerically.

We only discuss ode45 which uses a Runge–Kutta technique with adaptive time stepping,

i.e. every step a proper value of the step size h is determined in order to obtain a solution

within a specified tolerance. Matlab’s ode45 solves the system of IVPs y 0 = f (t, y) with

y(a) = y 0 on the time interval a ≤ t ≤ b.

Systems of initial value problems can be solved using ode45 similar to the scalar IVPs

(See Sec. 5.4). To solve the population problem with default tolerances, one would type

in the Matlab Command Window

[ti, xi] = ode45(’func ode’, [0 10], [1 1])

where [0 10] is the time interval [a, b] at which you want to obtain a numerical solution

and [1 1] the values of the initial conditions y 0 .

The string func ode (the quotes are to indicate that it is a string) specifies the name

of your m-file where the right-hand-side vector f (t, x) is specified. For the population

problem f1 (t, x) = x1 (2 − x2 ) and f2 (t, x) = x2 (x1 − 2) needs to be specified as a column

vector in the m-file func ode.m,

x1 = x(1);

x2 = x(2);

f(2,1) = x2*(x1 - 2.0);

The result of ode45 is 2 arrays with the discrete time values used (ti) and the corre-

sponding approximations to the solution (xi). The first column of the matrix xi contains

the approximation to x1 and can be selected using xi(:,1). The second column of the

matrix xi contains the approximation to x2 and can be selected using xi(:,2).

To satisfy the default tolerance values, 101 grid points are used. The accuracy can be

increased by using odeset, similar to scalar IVPs (See Sec. 5.4).

Fig. 6.1 shows the numerical solution for x1 and x2 . We note that the population does

not approach the equilibrium solution (x1 , x2 ) = (2, 2) but oscillates periodically around

127

5

x

1

4.5 x2

3.5

3

x

2.5

1.5

0.5

0

0 1 2 3 4 5 6 7 8 9 10

t

(x1 , x2 ) = (2, 2). This is easily explained by examining the eigenvalues of the Jacobian

at (x1 , x2 ) = (2, 2). The Jacobian has pure imaginary eigenvalues which correspond to a

periodic solution.

128

6.3 One-step methods

An mth order system of IVPs can be solved by applying any one-step method discussed in

Chapter 5 to a system. Conceptually there is nothing new, it is only more work. We just

apply every single step in the one-step method for all components, i.e. m times, instead

of only one time.

Below we discuss Euler, the trapezoidal rule, and RK4 for systems.

Euler

Applying Euler to the vector equation y 0 = f (t, y) gives

y i+1 = y i + hf (ti , y i ).

Note that the right-hand-side only contains values at level i which are all known. The

difference with Euler’s method for scalar IVPs in Chapter 5 is that y and f are now a

vector (array) with m components. This means we just need to evaluate the m component

functions f1 , . . . , fm and use the values to compute the m components of yi+1,j where

i = 0, . . . , N denotes the time level and j = 1, . . . , m the component.

Trapezoidal rule

Applying the trapezoidal rule to the vector equation y 0 = f (t, y) gives

h

y i+1 = y i + f (ti , y i ) + f (ti+1 , y i+1 ) ,

2

for i = 0, . . . , N .

This is an implicit system of equations since the right-hand side also depends on

the a priori unknown y i+1 . This makes it more difficult to compute y i+1 . A nonlinear

system of equations needs to be solved every time step. This can be done using, for

example, Newton’s method for systems (See Sec. 3.3). This requires solving a linear

system Jy i+1 = b(y i , ti , ti+1 ) at every time step. Since the time step is typically small, y i

is usually a good enough initial guess for Newton’s method for systems. Implicit methods

for systems require much more work per time step compared to explicit methods but have

a larger region of stability.

RK4

Applying RK4 to the vector equation y 0 = f (t, y) gives

k1 = hf (ti , y i ),

k2 = hf (ti + h/2, y i + k1 /2),

k3 = hf (ti + h/2, y i + k2 /2),

k4 = hf (ti+1 , y i + k3 ),

y i+1 = y i + (k1 + 2k2 + 2k3 + k4 )/6.

129

The difference with the RK4 method for scalar IVPs in Chapter 5 is that y, f , k1 , k2 , k3 ,

k4 are now a vector (array) with m components. All substeps, however, are still explicit.

First k1 needs to be computed for all m. Once all components of k1 are known, the values

of k2 can be computed etc.

One-step methods for systems of IVPs have a similar structure as the one-step methods

for scalar IVPs. We just need to do every operation for all components, i.e m times. The

algorithm is therefore almost identical, except that we now have a vector y instead of a

scalar y.

Input: discrete times ti , value of initial vector y 0 .

Initializations

Set initial condition

Compute number of subintervals N

One-step method

Do for i = 0, . . . , N − 1

Compute step size h

Compute next approximation y i+1 from the known values h, y i , ti , and ti+1 .

End do (i-loop)

yi,j (in Matlab y(i+1,j)) for all grid points in time i = 0, . . . , N and for every component

j = 1, . . . , m.

To do one step Euler inside the for-loop, you can use colon notation in Matlab to pass

all components of y i+1 to a function and compute y i+1 :

y(i+1,:) = y(i,:) + h*f;

where funcivp computes all components of the right-hand-side vector f using solution

y i and time ti . The colon notation in Matlab, does the operation for all possible values

at the place of the :. Thus y(i,:); is a vector of length m with all components of y i and

the function funcivp gets all components of the vector as it should get. The second line

computes y i+1 for all components, due to the colon in the two-dimensional array.

Remarks

• By using a function funcivp to compute the right-hand side vector f , you keep euler

general.

130

• Different one-step methods have the same structure. Only the part inside the for-

loop needs to be modified. For RK4 just some more explicit substeps need to be

performs. If you solve a nonlinear equation with the implicit trapezoidal rule, you

need to solve a system of nonlinear algebraic equations which can be done with

Newton’s method for systems. Then you need to call a function newtonsys to find

the vector y i+1 . Since the time step is typically small, the vector y i is usually a

good enough initial guess for Newton’s method.

• Alternatively, for-loops could be used instead of the colon notation. For loops would

typically be used in Fortran-77 or C.

large, you might not want to store all y i ’s. You could overwrite (y(:) = y(:) + h*f;)

and print the values of y(:) in a file every now and then. Then you only use an

array of length m.

The numerical solutions using Euler and RK4 with h = 0.1 is displayed in Fig. 6.2. The

14 14

ode45 ode45

euler euler

12 rk4 12 rk4

10 10

8 8

x1

x2

6 6

4 4

2 2

0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

t t

(a) (b)

Figure 6.2: Numerical solution of the population problem using ode45, Euler and RK4.

(a) x1 , and (b) x2 .

solution obtained using RK4 agrees well with the ode45 solution, but the Euler solution

is far off at larger values of t.

131

6.4 Accuracy

To determine the local truncation error and global error for a system of IVPs at time ti ,

we use the l∞ norm

∞ = ky(ti ) − y i k

i.e. the maximum over all components j = 1, . . . , m.

The results of the analysis of the local truncation error for the scalar test equation

remain valid for systems of IVPs. Furthermore, the global error is still one order lower

than the local truncation error. Thus when there are no discontinuities, we have the

following global errors:

• Euler: O(h).

• RK4: O(h4 ).

Example

We solve using one-step methods

ẋ2 = x1 x2 − 2x2 − e−3t ,

with ICs x1 (0) = 1 and x2 (0) = 1. The analytical solution is x1 (t) = e−t and x2 (t) = e−2t .

The global error at t = 1 for various values of h is given in Table 6.1. We observe that

h Euler RK4

0.1 4.00e-02 9.69e-06

0.05 1.96e-02 5.88e-07

0.025 9.69e-03 3.62e-08

0.0125 4.82e-03 2.24e-09

0.00625 2.40e-03 1.39e-10

RK4 using h = 0.1 is already much more accurate then Euler using h = 0.00625.

The errors are plotted in a log10 h vs. log10 plot in Fig. 6.3. The error behaves as

expected. For Euler the slope equals approximately 1 indicating a global error of O(h).

For RK4 the slope equals approximately 4 indicating a global error of O(h4 ).

132

0

euler

rk4

−2

−4

log10 ε∞

−6

−8

−10

−12

−14

−3.5 −3 −2.5 −2 −1.5 −1

log10 h

133

6.5 Stability of one-step methods

To determine whether small errors grow without bound or not, we follow the same struc-

ture as in Sec. 5.8 for scalar IVPs. We first consider stability for linear systems and then

extend the result to non-linear systems.

We consider the linear system of IVPs

y 0 = Ay, y(t0 ) = y 0

A similar analysis as for the test equation in Sec. 5.8.2 can be performed, but now

using an amplifying matrix G(hA) instead of amplifying factor k(hλ). The result is a

similar criterium as for scalar IVPs. A one-step method for systems is absolutely stable if

|k(hλj )| < 1, j = 1, . . . , m,

where λj are the m eigenvalues of the matrix A. Note that all m eigenvalues λj need to

satisfy the stability criterium. If for one eigenvalue |k(hλj )| > 1, the numerical technique

becomes unstable.

Using the expressions for the amplifying factors k(hλ) obtained in Sec. 5.6 gives

• Euler: |1 + hλj | < 1 for j = 1, . . . , m. This means all m values of hλj should be

inside the circle depicted in Fig. 5.4.

1+hλ /2

• Trapezoidal rule: | 1−hλjj /2 | < 1 for j = 1, . . . , m. This means all m values of hλj

should be in the left half-plane.

1

(hλj )4 | < 1 for j = 1, . . . , m. This means all

m values of hλj should be inside the curve depicted in Fig. 5.6.

We consider the nonlinear system of IVPs

y 0 = g(y), y(t0 ) = y 0

A similar analysis as in Sec. 5.8.5 can be performed. Local linearization of the nonlinear

system leads to a vector equation for the error = y − w,

0 = J

where J is the Jacobian matrix. This linear system can be treated as in Sec. 6.5.1.

134

Example

We use Euler’s method for systems to solve the nonlinear system

0

y1 −y12 y1 (0) 1

= , = .

y20 y1 − 20y2 y2 (0) 1

We have

−2y1 0

J=

1 −20

which has two real eigenvalues λ1 = −2y1 and λ2 = −20. For stability we need |k(hλj )| <

1 which becomes for Euler h < 2/|λj |. This gives two conditions: h < 1/|y1 | and h < 0.1

or combining

1

h < min( , 0.1)

|y1 |

The numerical solution x2 for h = 0.2, h = 0.1, and h = 0.05 is depicted in Fig. 6.4.

Stability is in agreement with the linearized theory: absolutely stable for h < 0.1.

10

h=0.2

8

h=0.1

h=0.05

6

2

x2

−2

−4

−6

−8

−10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

t

Figure 6.4: Unstable, stable, and absolutely stable behavior of Euler’s method for systems

at various h.

Remarks

• We neglected higher order terms when we did the linearization. Thus it is safer to

take the step size a little smaller than the one obtained from the linearization.

• The stability criterion depends on the solution y i which is a priori unknown. The

maximum allowable step size to guarantee stability for step i thus needs to be

determined at every step.

135

Chapter 7

• Accuracy

• Stability

Numerical methods:

136

7.1 Problem description: pollution models

7.1.1 Governing equation

In Sec. 4.1 we discussed pollution of a narrow and shallow river for which the concentration

of the pollutant depends on x (coordinate along the river) and time t: c = c(x, t). Flow

will then occur in the x direction only, represented by a scalar velocity v. The resulting

model was the partial differential equation (PDE) of Eq. (4.2)

∂c ∂c ∂2c

= −v + D 2 + r − kc,

∂t ∂x ∂x

where v is the velocity of the river, D the diffusivity, r(x, t) a production term, and k the

rate of decay. In Chap. 4, we determined equilibrium solutions (i.e. when the concentra-

tion does not change anymore in time, or ∂c/∂t = 0) by solving a BVP. In this chapter,

we will solve the partial differential equation, i.e. we will predict the concentration as a

function of time t and space x.

As an example, we consider the system of equations with v = k = r = 0 and D = 1/π 2 ,

∂c 1 ∂2c

= 2 2. (7.1)

∂t π ∂x

Eq. (7.1) contains a second oder derivative in space and a first order derivative in time.

From differential equations theory, we know that to determine a unique solution of a PDE

that is nth order in 1D space (x) and mth order in time, we need n boundary conditions

and m initial conditions. Thus for the PDE considered here, we need two boundary

conditions and one initial condition.

Initial conditions are similar to initial conditions of IVPs (except that the constant

can now be a function of x). As initial condition, the concentration can be specified at

time t0 for all values of x

c(x, t = t0 ) = c0 (x),

where c0 (x) is the initial concentration profile.

Boundary conditions are similar to boundary conditions for BVPs (except that the

constants can now be functions of t and that we have partial derivatives instead of deriva-

tives for a function of 1 variable). The following boundary conditions can be specified at

a boundary x = xb for all values of t.

c(x = xb , t) = CD (t),

time t.

137

• Neumann boundary conditions. The mass flux (or concentration gradient) is pre-

scribed at the boundary:

∂c

(x = xb , t) = CN (t),

∂x

where CN (t) may depend on t.

condition:

∂c

(x = xb , t) + kR (t)c(x = xb , t) = CR (t),

∂x

where kR (t) and CR (t) may depend on t.

7.1.3 Example

The example we use for PDEs throughout this chapter is

∂c 1 ∂2c

= , 0<x<1

∂t π 2 ∂x2

c(x = 0, t) = 0, c(x = 1, t) = 0,

c(x, t = 0) = sin(πx). (7.2)

138

7.2 Validation of numerical code for PDEs

Numerical calculations to solve PDEs are much more involved than calculations for ODEs.

It is thus very important to check the numerical solution carefully. We discuss three ways.

For equilibrium solutions, the solution doesn’t change anymore in time, i.e. ∂c/∂t = 0.

This gives a BVP which can more easily be solved, in simple cases analytically. For the

example Eq. (7.2) we need to solve d2 c/dx2 = 0 which gives c(x) = Ax + B. Applying the

boundary conditions c(x = 0) = 0 and c(x = 1) = 0 gives A = B = 0 and the equilibrium

solution is c(x) = 0. The equilibrium also makes physical sense: diffusion transports all

pollution through the boundaries.

Even though it is hard to solve a PDE, it is not difficult to find an analytical solution of

a PDE to test a numerical code. Instead of Eq. (7.2), we consider

∂c 1 ∂2c

= 2 2 + f (x, t), 0<x<1

∂t π ∂x

From the theory of partial differential equations we know that solutions can be written

as products of exponentials in time and sine or cosine functions in space. We take

We substitute the analytical solution c(x, t) into the PDE and try find the corresponding

f (x, t). We find f (x, t) = 0, which means that c(x, t) = e−t sin(πx) is a solution of

Eq. (7.2). The initial condition that corresponds to this solution is c(x, t = 0) = sin(πx).

The boundary conditions that correspond to this solution are c(x = 0, t) = 0 and c(x =

1, t) = 0. Also both initial and boundary conditions are identical to those in Eq. (7.2).

Thus c(x, t) = e−t sin(πx) is the solution of Eq. (7.2).

Fig. 7.1 shows the analytical solution for various times.

PDEs in time and 1D space can be solved using Matlab’s built-in Matlab function pdepe.

We only consider the case m=0 in pdepe which is sufficient to solve the problems we

consider. Then pdepe solves the pde

∂u ∂u ∂f (x, t, u, ∂u

∂x

) ∂u

c(x, t, u, ) = s(x, t, u, ).

∂x ∂t ∂+ ∂x

139

1

t=0

0.9 t=0.1

t=0.5

t=1

0.8 t=2

t=10

0.7

0.6

c(x,t)

0.5

0.4

0.3

0.2

0.1

0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x

Figure 7.1: Numerical solution of the pde Eq. (7.2) at various time levels.

For our example pde Eq. (7.2), we have c(x, t, u, ∂u/∂x) = 1, f (x, t, u, ∂u/∂x) = (1/π 2 )∂u/∂x,

s(x, t, u, ∂u/∂x) = 0. The boundary conditions for the left and right point should have

the form

p(x, t, u) + q(x, t) ∗ f (x, t, u, Du/Dx) = 0

where f is identical to the f in the pde. For our example pde Eq. (7.2) we have for both

the left and right boundary point p = u and q = 0.

A pde can be solved by typing in the Command Window

sol = pdepe(0, ’pdefun’, ’pdeic’, ’pdebc’, xj, ti);

where the firsat input variable m = 0 corresponds to m=0. The array xj should contain

the grid points at which you want to obtain the numerical solution. The array ti should

contain the time values at which you want to obtain the numerical approximation (note

that these are not all time levels that matlab uses in the computation, only the time

levels at which a solution is stored in the output array sol). Intermediate time levels to

obtain a sufficiently accurate solution are determined in pdepe. In the solution matrix

sol(i,j) a row i correspond to the selected time levels ti and a columns j to a grid points

xj .

The three strings pdefun, pdeic, and pdebc are the names of the m-files that contain

the info for the pde, initial condition, and boundary conditions, respectively.

For our example pde Eq. (7.2), we would use the following three m-files:

% Evaluate c, f, s for c Du/Dt = Df/Dx + s

140

c = 1;

f = DuDx / pi^2;

s = 0;

% Initial condition for a pde as a function of x

u0 = sin(pi*x);

% Boundary condition for a pde as a function of x

pl = ul;

ql = 0;

pr = ur;

qr = 0;

Fig. 7.2 shows the numerical approximation using pdepe together with the exact solution

as a function of x at various times. The results for pdepe were obtained using h = 1/10.

We see that qualitatively (on the scale of the figure) the numerical solution agrees well

1

t=0

0.9 t=0.1

t=0.5

t=1

0.8 t=2

t=10

0.7

0.6

c(x,t)

0.5

0.4

0.3

0.2

0.1

0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x

Figure 7.2: Exact solution (solid line) and numerical approximation using pdepe (symbols)

of Eq. (7.2) at times indicated in the legend.

141

7.3 Solving PDEs numerically: Introduction

The solution c(x, t) of Eq. (7.2) depends on spatial coordinate x and time t. Thus we now

need a grid for the spatial and time integration. For the spatial discretization we use a

grid xj for j = 0, . . . , m as we used to solve BVPs in Chap. 4. To keep the algebra as

simple as possible, we only consider equally spaced grid points xj with grid size h. For the

time discretization we use a grid tk for k = 0, . . . , n as we used to solve IVPs in Chap. 5.

To keep the algebra as simple as possible, we only consider equally spaced times tk with

step size ∆t.

In a PDE, partial derivatives need to be discretized. Discretization of partial deriva-

tives is identical to the discretization of derivatives of functions of 1 variable. Thus for

Eq. (7.2) we can use the finite difference formulas or finite element method discussed for

BVPs to discretize ∂ 2 c/∂x2 (See Sec. 4.5 and 4.7).

The general strategy to discretize a PDE in time and 1D space is as follows.

• First discretize the PDE in the spatial direction x. For this you can use finite

differences or finite elements. This is similar to BVPs (See Sec. 4.5 and 4.7).

conditions can be eliminated as in Sec. 4.6 for BVPs.

• Solve the system of IVPs using any method for systems of IVPs. For example,

Euler, trapezoidal rule, or RK4 for systems (See Chap. 6).

(add y0 and/or ym ).

142

7.4 Finite differences

We only consider the PDE with boundary and initial conditions as described in Eq. (7.2).

The PDE is first discretized in the x direction using finite difference formulas discussed in

Sec. 4.5. The key difference with section 4.5 is that after the discretization in space the

approximate values at the nodes are still a function of time: cj = cj (t). Thus after using

the central O(h2 ) approximation for ∂ 2 c/∂x2 we get the m − 1 ODEs and 2 boundary

conditions

c0 = 0,

dcj 1

(t) = − 2 2 [−cj−1 (t) + 2cj (t) − cj+1 (t)] , j = 1, . . . , m − 1,

dt π h

cm = 0,

with initial condition cj (t = 0) = sin(πxj ). Note that the partial derivative with respect

to time has become a d/dt derivative since cj only depends on time.

Next we eliminate the boundary conditions, to get a system of IVPs (See Sec. 4.6).

Only the equation for j = 1 contains c0 and only the equation for node j = m − 1 contains

cm . Substitution of c0 = 0 and cm = 0 gives

dc1 1 1

= − 2 2 (−0 + 2c1 − c2 ) = − 2 2 (2c1 − c2 )

dt π h π h

dcj 1

(t) = − 2 2 [−cj−1 (t) + 2cj (t) − cj+1 (t)] , j = 2, . . . , m − 2

dt π h

dcm−1 1 1

= − 2 2 (−cm−2 + 2cm−1 − 0) = − 2 2 (−cm−2 + 2cm−1 )

dt π h π h

with initial condition cj (t = 0) = sin(πxj ).

We obtained a system of IVPs which can be written in general matrix-vector form

dc

= Pc + r

dt

with

2 −1 0

0 ···

c1

0

.. ...

−1 2 −1 . c2 0

−1

..

. . . r = ... .

P = 2 2 0 .. .. .. 0 c=

,

.

π h .

.

.. . . −1 2 −1

cm−2 0

0 · · · 0 −1 2 cm−1 0

with initial condition c(t = 0) = sin(πx). Note that in general r is non-zero due to

non-zero boundary conditions and/or a non-zero right-hand-side function r(x, t) in the

PDE.

The system of IVPs can be solved in time using any of the numerical techniques

discussed in Sec. 6.3 for c0 = f (t, c). The right-hand-side vector is now f (t, c) = P c + r.

Remarks

143

• Since you have a large system of equations, and you want to be able to change the

number of grid points m easily, you would use a for-loop to compute the right-hand-

side vector f (t, c) = P c + r.

matrix. Only the vector P c is necessary.

for implicit techniques to find c at the new time level i + 1. This can be done with

Crout’s method (See Sec. 4.9) so that solving the system is still fast. Also no full

matrix needs to be stored. Just the three diagonals is sufficient.

• For nonlinear PDEs you need to solve a nonlinear system every time step, using for

example Newton’s method for systems.

Fig. 7.3 shows the finite difference approximation together with the exact solution as

a function of x at various times. The results for the FD method were obtained using

h = 1/10 and ∆t = 2 10−3 . We see that qualitatively (on the scale of the figure) the

1

t=0

0.9 t=0.1

t=0.5

t=1

0.8 t=2

t=10

0.7

0.6

c(x,t)

0.5

0.4

0.3

0.2

0.1

0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x

Figure 7.3: Exact solution (solid line) and FD approximation (symbols) of Eq. (7.2) at

times indicated in the legend.

144

7.5 Finite elements

We only consider the PDE with boundary and initial conditions as described in Eq. (7.2).

The PDE is first discretized in the x direction using finite elements. We only consider

linear finite elements as in Sec. 4.7. The key difference with section 4.7 is that after the

discretization in space the approximate values at the nodes are still a function of time:

m

X

c(x, t) = cj (t)φj (x).

j=0

The weak form of the heat equation (PDE) is again obtained by multiplying by a test

function ψ and integrating over the domain:

Z 1 Z 1 1

∂c 1 ∂c ∂ψ ∂c

ψ dx = − 2 dx + ψ

0 ∂t π 0 ∂x ∂x ∂x 0

The Galerkin finite element method is obtained by choosing for the test functions

ψ the basisPm functions φi (x), i = 0, . . . , m, substituting the finite element approximation

c(x, t) = j=0 cj (t)φj (x) in the integral terms, and evaluting the integrals element-by-

element:

m Z X m m Z m 1

X dcj X 1 X dφj dφi ∂c

φj φi dx = − 2

cj dx + φi ,

l=1 el j=0

dt l=1 el

π j=0

dx dx ∂x 0

for i = 0, . . . , m. The contributions of the element integrals can be put in two element

matrices (the element vector is the zero vector). After evaluating the element integrals,

we obtain

(l) h 2 1 (l) −1 1 −1

k = , p = 2 .

6 1 2 π h −1 1

where k (l) is the element matrix corresponding to the dcj /dt term and p(l) the element

matrix corresponding to the cj term.

Assembling into two global matrices K and P gives the (m + 1) × (m + 1) matrices

2 1 0 ··· 0 1 −1 0 · · · 0

. . . .. . .

1 4 1 . −1 2 −1 . . ..

h ... ... ... , −1 ... ... ...

K= 0 0 P = 2h 0 0 .

6

. .

π . .

. . . .

. . 1 4 1 . . −1 2 −1

0 ··· 0 1 2 0 · · · 0 −1 1

Boundary conditions are handled exactly the same way as in Sec. 4.7. For the two

Dirichlet boundary conditions considered here, we replace the equation for i = 0 by

c0 = 0, and the equation for i = m by cm = 0. Next we eliminate the Dirichlet boundary

145

conditions, to get a system of IVPs (See Sec. 4.6). Note that we do not only need to

eliminate c0 and cm from the equations, but also dc0 /dt and dcm /dt. This does not lead

to major complications since we know the Dirichlet boundary conditions as a function of

time. Since the values of the boundary conditions we consider are independent of time,

both derivatives are zero. Only the equation for i = 1 contains contributions from c0 and

dc0 /dt and only equation for i = m − 1 contains cm and dcm /dt. Substitution of c0 = 0,

dc0 /dt = 0, cm = 0, and dcm /dt = 0 gives in matrix-vector form

dc

K = P c + r,

dt

where K and P are (m − 1) × (m − 1) matrices and r an m − 1 vector:

4 1 0 ··· 0 2 −1 0 · · · 0

0

. . .. . .

1 4 1 . . −1 2 −1 . . .. 0

h −1

.. .. .. 0 ... ... ... 0 , r = ... .

K= , P =

0 . . . 0

6

.

π2h . .

..

.. . 1

. . . −1 2 −1

0

4 1 .

0 ··· 0 1 4 0 · · · 0 −1 2 0

The system of equations can be written as a system of IVPs by multiplying by K −1

dc

= K −1 P c + K −1 r.

dt

Applying Euler, for example, would give (since r = 0)

ci+1 = ci + K −1 P ci .

Since it is computationally expensive to compute the inverse, however, you would multiply

by K and solve the system

Kci+1 = (K + P )ci

Since K is tridiagonal and can thus be solved very efficiently using Crout’s method (See

Sec. 4.9). Also no full matrix needs to be stored. Just the three diagonals is sufficient.

Also for the right-hand side there is no need to introduce big the matrices K and P

explicitly. The result of the matrix-vector product (K + P )ci is a vector and that is all

you need.

Fig. 7.4 shows the linear finite element approximation together with the exact solution

as a function of x at various times. The results for FEM were obtained using h = 1/10

and ∆t = 2 10−3 . We see that qualitatively (on the scale of the figure) the numerical

solution agrees well with the exact solution.

146

1

t=0

0.9 t=0.1

t=0.5

t=1

0.8 t=2

t=10

0.7

0.6

c(x,t)

0.5

0.4

0.3

0.2

0.1

0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x

Figure 7.4: Exact solution (solid line) and FEM approximation (symbols) of Eq. (7.2) at

times indicated in the legend.

147

7.6 Stability

In Sec. 6.5, we discussed stability for a system of IVPs. For a linear system y 0 = Ay we

need all eigenvalues λj of A to satisfy the stability criterium |k(hλj )| < 1. To determine

the values of h for which a computation is stable we need to find the eigenvalues of A.

Eigenvalues

To find the maximum eigenvalue of a matrix A we can calculate the eigenvalues or estimate

them:

• The eigenvalues of a matrix A can be calculated with the built-in Matlab function

eig. For example

A = [2 1;1 2];

lambda = eig(A)

creates an array ev with all eigenvalues

lambda =

1

3

The built-in Matlab function max can compute the maximum of an array of values

lammax = max(lambda)

gives

lammax =

3

The disadvantage is that we can only use this for a matrix with numberical values, no

variable h. In addition, it may take a lot of computing time to compute eigenvalues

for large matrices.

– Gerschgorin’s theorem

To find approximations for the eigenvalues τ using Gerschgorin’s theorem we

need to check for every row of a matrix A

N

X

|τ − akk | ≤ |akj |.

j=1, j6=k

The right-hand side is just the sum of the magnitude of the off-diagonal entries

in row k.

– Raleigh’s quotient

Rayleigh’s quotient is useful to relate eigenvalues of the element matrix to

eigenvalues of the global matrix in the finite element method. How to use

Rayleigh’s quotient exactly is outside the scope of Math 4414.

148

Eigenvalues of symmetric matrices

From linear algebra, we know that symmetric matrices have real eigenvalues. This re-

stricts the region of absolute convergence to the part on the real axis. Thus for symmetric

matrices we will need for stability of Euler’s method ∆t ≤ 2/|λj | and for RK4 approxi-

mately ∆t ≤ 2.75/|λj |. This needs to hold for all eigenvalues λj , thus we only need to

determine the largest eigenvalue.

For the FD example we considered, we obtained dc/dt = P c with P the (m − 1) × (m − 1)

matrix

2 −1 0 · · · 0

. .

−1 2 −1 . . ..

−1 .. .. ..

P = 2 2 0 . . . 0 ,

π h .

.. . . . −1 2 −1

0 · · · 0 −1 2

we get when applying Gerschgorin’s theorem for each row j = 2, . . . , m − 2

τ + 2 ≤ 2

π h π 2 h2

2 2

−4

≤ τ ≤ 0.

π 2 h2

Similarly, for the first and last row corresponding to j = 1 and j = m − 1, we get

−3

≤ τ ≤ 0.

π 2 h2

Thus the maximum magnitude is |τ |max = 4/(π 2 h2 ). Applying the stability criterium for

real eigenvalues gives for Euler

2 π 2 h2

∆t < =

|τ |max 2

and for RK4

2.75 2.75π 2 h2

∆t < = .

|τ |max 4

Note that the stability criterium severely restrict the maximum value of ∆t that can be

used. For h = 10−3 you would need for Euler ∆t < 4.93 10−6 and for RK4 ∆t < 6.78 10−6 .

Fig. 7.5 shows two solutions. One with h = 0.04 which is inside the region of absolute

stability, the other with h = 0.06 which is outside the region of absolute stability. For

h = 0.04 the numerical solution remains stable up to t = 100. For h = 0.06 the numerical

solution starts to grow rapidly around t = 7. If h is just outside the region of stability,

errors will start to grow immediately but it may take a while before the magnitude of the

error becomes significant.

149

2 2

t=0 t=0

t=0.5 t=0.5

t=2 t=2

1.5 t=5 1.5 t=5

t=20 t=7

t=100 t=8

1 1

c(x,t)

c(x,t)

0.5 0.5

0 0

−0.5 −0.5

−1 −1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x x

(a) (b)

Figure 7.5: Numerical solution of Eq. (7.2) using finite differences and Euler. (a) h = 0.04

within region of absolute stability, and (b) h = 0.06 outside region of absolute stability.

For finite elements we would need to apply Gerschgorin to M −1 P which is not so trivial.

For finite elements using Rayleigh’s quotient is easier. The result is

12

|τ |max = .

π 2 h2

Note that |τ |max is three times as large compared to finite difference so that we need to take

a time step that is three times as small. For Euler we now have ∆t < 2/|τ |max = π 2 h2 /(6

which means we need a larger value of h for finite elements than for finite differences to

get a stable scheme for the same step size ∆t.

150

7.7 Accuracy

In the discretization of a PDE, we make a discretization error for the time derivatives

O((∆t)p ) and for the spatial derivatives O(hq ). The total error is the sum of these:

= O((∆t)p ) + O(hq ). For example, consider a spatial discretization with error O(h2 ). If

we use for the time discretization Euler’s method the total discretization error would be

= O(∆t) + O(h2 ) and if we use RK4 = O((∆t)4 ) + O(h2 ).

We simulate numerically Eq. (7.2). Since the behavior of the error is similar for finite

differences and finite elements, we only consider finite differences. We consider O(h2 ) finite

differences with Euler and RK4 for the time discretization. We look at the global error at

t = 1/2 and take the l∞ norm over all grid points, = |y(t = 1/2, xi ) − yi (t = 1/2, xi )k.

Fig. 7.6(a) shows the global error for a discretization in space with h = 1/10 and

various step sizes ∆t and Fig. 7.6(b) global error for a step size in space with ∆t = 10−3

and various grid sizes h.

−1.8 −2

Euler

RK4 Euler

−2 RK4

−2.5

−2.2

−3

−2.4

log10 ε

log10 ε

−3.5

−2.6

−4

−2.8

−4.5

−3

−5

−3.2

−3.4 −5.5

−2.6 −2.4 −2.2 −2 −1.8 −1.6 −1.4 −1.2 −1 −2.6 −2.4 −2.2 −2 −1.8 −1.6 −1.4 −1.2 −1 −0.8 −0.6

log10 ∆ t log10 h

(a) (b)

Figure 7.6: Global error at t = 1/2 using O(h2 ) finite differences for the heat equation

using Euler and RK4. (a) h = 1/10 and various ∆t. (b) ∆t = 10−3 and various h.

We observe in both figures that the solution obtained with RK4 is not more accurate

than the one obtained with Euler’s method. To explain this maybe unexpected result,

we need to look more carefully at the discretization errors we make. For Euler the total

discretization error is O(∆t) + O(h2 ). However, for stability we need ∆t < π 2 h2 /2. Thus

the total error we make is

For RK4 the total discretization error is O((∆t)4 ) + O(h2 ). However, for stability we need

∆t < 2.75π 2 h2 /4. Thus the total error we make is

151

which is the same order as Euler’s method.

Thus, if we take a very accurate discretizations in time using Euler or RK4 and not

a very accurate discretization in space the total error is dominated by the error in the

space discretization, O(h2 ). This situation we cannot avoid since we need an accurate

discretization in time to satisfy the stability criterium.

Remarks

• For the PDE considered here, there is no advantage in using RK4 instead of Euler.

Only more computational time and the same accuracy.

• An implicit technique like the trapezoidal rule would be very useful. There is no

stability criterium for ∆t and the error is = O((∆t)2 ) + O(h2 ). Both ∆t and h

can be varied independently to obtain a more accurate solution.

152

7.8 Solving linear and non-linear systems for PDEs

For explicit finite difference methods, the values cj at the new time level n+1 are obtained

directly. Other methods require solving a linear or non-linear system of equations. In this

section we discuss some efficient solution methods for the linear and non-linear systems

resulting from the discretization a PDE in time and 1D space.

Linear systems with tridiagonal matrices

Linear tridiagonal systems are obtained when using linear finite elements with an explicit

time discretization, or when using O(h2 ) finite differences or linear finite elements with

the trapezoidal rule for the time discretization. Then one needs to solve a system of the

form

Kci+1 = f (ci )

where K is tridiagonal (See Sec. 7.5 for FEM with Euler).

As discussed in Sec. 4.9, tridiagonal systems can be solved very efficiently using Crout’s

method (O(m) operations and only diagonal elements need to be stored). For time-

dependent systems the computing time can be reduced even more. Since the matrix K

is the same at every time level i, the factorization only needs to be performed once: in

the initialization step of the program the non-zero L and U components can be computed

and stored. In the i-loop the L and U matrix can be used in the backward substitution.

For higher-order finite elements more than three basis functions have non-zero contri-

bution at a grid point j and the matrix K would no longer be tridiagonal. Often, the

number of grid points is rather small, say O(10 − 102 ) and a direct method can still be

used. Instead of the Crout LU factorization for tridiagonal matrices, we need to perform

an LU factorization for more general matrices. If pivoting is included we need a P LU

factorization. Such factorizations are discussed in more detail in Sec. 7.8.2.

As for linear systems, it is not necessary to perform the factorization at every time

step. Since the matrix K is the same at every time level i, the factorization only needs

to be performed once: in the initialization step of the program the non-zero L and U

components can be computed and stored. In the i-loop the L and U matrix can be used

in the backward substitution.

Non-linear systems

A nonlinear system in ci+1 results if the PDE is nonlinear in c and an implicit method

like the trapezoidal rule is used for the time discretization. The nonlinear system can be

written in the from f (ci+1 ) = 0 and can be solved, for example, with Newton’s method.

153

(0)

Often ci , the solution vector at the previous time step is a good enough initial guess ci+1

in the Newton iteration

(k−1) (k) (k−1) (k−1)

J(ci+1 )∆ci+1 = −f (ci+1 , ci )

(k−1)

Since J depends on ci+1 factorization now needs to be performed every Newton step

(for each time step). If J is tridiagonal, factorization and the forward/backward substi-

tution can be done efficiently using the O(m) Crout method. Otherwise, a O(m3 ) LU

factorization for more general matrices is necessary (See Sec. 7.8.2).

7.8.2 LU factorization

LU factorization without row interchanges

For some type of matrices, no row interchanges are necessary in the Gaussian elimination

process. Then the m×m matrix A can be factored into A = LU , with L a lower-triangular

matrix and U an upper-triangular matrix

1 0 ... 0 u11 u12 . . . u1m

.. .. . ..

l21 1 . . 0 u22 . . .

L= . . . . U = . . . ,

.. . . .. 0 .. . . . . um−1,m

lm1 · · · lm,m−1 1 0 ··· 0 umm

Since A = LU we need to solve

LU x = b.

The matrix vector product U x is a vector, say y. Thus we can first solve the lower

triangular system

Ly = b

using forward substitution (first y1 , then y2 using the already computed value of y1 etc.)

to find the intermediate vector y, and then solve

Ux = y

using backward substitution (first xm , then xm−1 etc.) to find the solution x of the system

LU x = b.

Both the upper and lower triangular systems only take O(N 2 ) operations to solve.

Thus once we have the LU factorization, it is relatively cheap to solve a system involving

the matrix A = LU and any vector b. However, the LU factorization needs to be

computed first, which takes O(2N 3 /3) operations.

154

P LU factorization with row interchanges

If row interchanges are necessary, then a permutation matrix P exists so that P A =

LU . This just means that the rows can be interchanged (via P ) so that a LU factorization

exists. Thus we solve

P Ax = P b

which is ”Ax = b with a different order of the rows”. For P A we can make a LU

factorization, so we solve

LU x = P b.

Starting from the the matrices P , L, and U and vector b, we can find the solution x of

Ax = b in three steps

• compute z = P b (O(N 2 ) operations)

Saving memory

The matrix L and U contain a lot of zeros. To store both L and U in a separate matrix

is a waste of memory. Usually, L and U are stored in a single matrix

u11 u12 ... u1m

.. ..

l21 u22

. .

. . . .

.. . . . . um−1,m

lm1 · · · lm,m−1 umm

Note that the 1’s at the diagonal of L are not stored. This is, however, not necessary

since we know exactly what the values on that diagonal are. They are always 1, so we can

just use lii = 1 whereever the lii ’s are needed. Often the matrix A is no longer necessary

after the LU factorization has been performed. By overwriting the matrix A with L and

U , we don’t need any additional memory at all.

• The solution is subject to round off errors only.

• If you need to solve Ax = b several times with the same matrix A you need to do the

expensive LU factorization only once. Solving the triangular systems is relatively

cheap. This occurs, for example, when

– solving a time-dependent problem (like Kci+1 = f (ci )) and the matrix (K)

doesn’t depend on time.

155

Matlab

A P LU factorization of a matrix A can be obtained in Matlab using the built-in function

lu. For example:

A = [1 2 6; 4 8 -1; -2 3 -5]

[L, U, P] = lu(A)

gives

1 0 0 4 8 −1 0 1 0

L = −0.5 1 0 , U = 0 7 −5.5 P = 0 0 1 .

0.25 0 1 0 0 6.25 1 0 0

If we check P*A - L*U this gives indeed the zero matrix.

An LU factorization with the L and U matrix stored in a single matrix K be done by

using lu with a single output argument. For example

A = [4 8 -1; -2 3 -5; 1 2 6;]

K = lu(A)

gives

4 8 −1

K = −0.5 7 −5.5

0.25 0 6.25

Note that information about P is lost.

156

- Numerical Solutions of Second Order Boundary Value Problems by Galerkin Residual Method on Using Legendre PolynomialsЗагружено:IOSRjournal
- FL-TL-TN-0388-TLCubeFormat2.0Загружено:Allan Alphonso
- julia.pdfЗагружено:Thiru
- Fortran 77 Code for SplineЗагружено:subha_aero
- Suggestion Paper for M (CS) – 312 NUMERICAL METHODS & PROGRAMMINGЗагружено:MyWBUT - Home for Engineers
- lec10Загружено:manishtopsecrets
- 2 Approximations and Rounding ErrorsЗагружено:İsmail Eren
- huuliЗагружено:Tuluvsaikhan Battulga
- TeachingPlan BACS 1263Загружено:Durah Afiqah
- Taylor Series 1Загружено:umar
- Number Systems and ConversionЗагружено:madhusiva
- JCNDЗагружено:Giridhari Behera
- Director of FinanceЗагружено:api-78757593
- Gautam Lecture2Загружено:VIVEK SINGH
- Differential Quadrature Method Based on the Highest Derivative and Its ApplicationsЗагружено:Hamid Mojiry
- CAppt.pptxЗагружено:Ashi Jain
- CH PDF.pdfЗагружено:Bharat
- H030aЗагружено:Antonio
- Aplicativo PIDЗагружено:deniodetec
- KSCEЗагружено:796727
- Day_4Загружено:priya
- AccuracyЗагружено:hellofin
- 1-s2.0-0003682X9400045W-main.pdfЗагружено:Laith Egab
- Algebra Review Part1Загружено:Anonymous GHi53ZA
- Hydraulic RoutingЗагружено:Princess Jereyviah
- CC442 Design FlowЗагружено:MuhamdA.Badawy
- 10.1.1.127.4366Загружено:oguier
- International Journal of Engineering Research and DevelopmentЗагружено:IJERD
- Digital DesignЗагружено:Ran Quit
- sensors-15-29845Загружено:Guillermo Henry Ramirez Ulloa

- Fatigue Analysis of Rail Joint Using Finite Element MethodЗагружено:International Journal of Research in Engineering and Technology
- Benyon Et Al., 2011, SymposiaЗагружено:anon_783728373
- Balkan Idols. Religion and Nationalism in Yugoslav StatesЗагружено:ramona771
- BAR Articulo FinalЗагружено:Ranko Manojlovic
- Vol 2 _1_ - Cont. J. Renewable Energy CHARACTERIZATION AND PERFORMANCE EVALUATION OF 11M3 BIOGAS PLANT CONSTRUCTED AT NATIONAL CENTER FOR ENERGY RESEARCH AND DEVELOPMENT, UNIVERSITY OF NIGERIA, NSUKKAЗагружено:Francis Abulude
- 10.5923.j.pc.20120201.01Загружено:Dika
- How to Use SniplyЗагружено:Eugene Agena
- Cisco CCNA Security Chapter 8 ExamЗагружено:Paulina Echeverría
- Self & Mutual InductanceЗагружено:Imran Parvez
- Photodynamic TherapyЗагружено:Krishan Gulia
- AIP ON SBMЗагружено:baldo yellow4
- Probability DistributionsЗагружено:cooooool1927
- CCNA Prep- IP Subnetting From NetworkersЗагружено:sankar_raka
- DFL-DDP USB3.0 Data Recovery Equipment Reaches A High Level Of Recovery Success.pdfЗагружено:Stanley Morgan
- Carbon Nanotube Super CapacitorsЗагружено:Kowsalya Palanisamy
- Final Report - Canara HSBC Oriental Bank of Commere Life Insurance CompanyЗагружено:Rohit Mehta
- Freire Aesthetic Through RanciereЗагружено:SteveMacqueen
- Template Sales PlanЗагружено:fburzan493
- Quality Enhancement CellЗагружено:Aisha Bint Tila
- Ghazali's Concept of CausalityЗагружено:gzaly
- Design of Star-Shaped Microstrip Patch Antenna for Ultra Wideband (UWB) ApplicationsЗагружено:John Berg
- Microwave Test benchЗагружено:Bikram Paul
- pioneer_dvr-530h-s_[ET]Загружено:Jamal Jardel
- EIT-ICT-Labs Final Acatech-Study as 121106 Einzelseiten FinalЗагружено:Marko Miljus
- r-school-self-evaluation-2013.pdfЗагружено:Ingy Hassan
- 00+BIT+11003Загружено:dexjh
- MRP Example2Загружено:Anonymous ffje1rpa
- lesson plan 6 and reflectionЗагружено:api-287604931
- letter of rec dp2 signature nbЗагружено:api-217651748
- Pipeline Activity Report SampleЗагружено:chy12484