Numerical Analysis I: California State University Northridge Lecture Notes For Math 481A

California State University Northridge
Lecture Notes for Math 481A:

Numerical Analysis I
Bruce E. Shapiro, Ph.D.
Last Revision: July 5, 2008
This document is provided in the hope that it will be useful but without any
warranty, without even the implied warranty of merchantability or fitness for a
particular purpose. The document is provided on an “as is” basis and the author
has no obligations to provide corrections or modifications. The author makes no
claims as to the accuracy of this document. In no event shall the author be liable
to any party for direct, indirect, special, incidental, or consequential damages,
including lost profits, unsatisfactory class performance, poor grades, confusion,
misunderstanding, emotional disturbance or other general malaise arising out of
the use of this document or any software described herein, even if the author has
been advised of the possibility of such damage.
« 2008. This document is licensed under a Creative Commons Attribution-

Noncommercial-No Derivative Works 3.0 United States License (by-nc-nd). For
specifics and a copy of the license please see the Creative Commons web site at
http://creativecommons.org/licenses/by-nc-nd/3.0/us/
Please report any errors to bruce.e.shapiro at csun.edu. All feedback, com-

ments, suggestions for improvement, etc., is appreciated, especially if you’ve used
these notes for a class, either at CSUN or elsewhere, from both instructors and
students.
ii
Your fearless leader. Above: Typical view during a

class lecture. Below: Typical view during an exam.
The pictures were drawn by former students. The con-
sumpution of cookies and caffeinated beverages during
class is optional but is strongly encouraged.
Math 481A « 2008, B.E.Shapiro

California State University Northridge Last revised: July 5, 2008
Contents
1 A Motivational Example 1
2 Limits and Continuity 9
3 Sequences 13
4 Theorems About Derivatives 17
5 Error 23
6 Number Representation 27
7 Fixed and Floating Point 31
8 Roots and Bisection 35
9 Fixed Point Iteration 41
10 Newton’s Method 59
11 Secant Method 67
12 Error Analysis for Iterative Methods 71
13 The Aitken-Steffensen Methods 81
14 Synthetic Division and Horner’s Method 87
15 Müller’s Method 93
16 Linear Systems 99
17 Lagrange Interpolation 105
18 Newton Interpolation 119
iii
iv CONTENTS
19 Hermite Interpolation 125
20 Cubic Splines 135
21 Bezier Curves 143
22 Least Squares 153
23 Numerical Differentiation 161
24 Richardson Extrapolation 169
25 Numerical Integration 175
26 Theory of Differential Equations 187
27 Method of Successive Approximations 197
28 Euler’s Method 211
29 The Backwards Euler Method 215
30 Improving Euler’s Method 221

Lesson 1
A Motivational Example
Numerical analysis is a branch of mathematics that deals with the development and
implementation of methods for solving problems numerically with continuous math-
ematics. A related field, discrete or finite mathematics, deals with problems that
do not contain or depend on the concept of continuity. In practice both fields of
mathematics overlap the with the subjects of numerical computation and computer
science, which deal with the actual implementations (e.g., computer programs and
algorithms) used to solve these problems. √
As an example of a numerical algorithm, consider finding the square root a of a
number. We know that the solution satisfies
x2 = a (1.1)
Many algorithms for finding the value of x are based on finding the root of the
polynomial ⇓:rev.5/28/08
f (x) = x2 − a (1.2)
i.e., finding the value of x that satisfies the equation f (x) = 0. This number is called a
root of f (x). We will explore some of these algorithms this semester. We start with an
example
√ that was first observed
√ by the ancient Babylonians. If x is an approximation
to a then since a/x ≈ a is an equally good approximation. Furthermore,
√ 1 1
x< a =⇒ √ < (1.3)
a x
√ a a
=⇒ a = √ < (1.4)
a x
and
√ √ a
x> a =⇒ a > (1.5)
x
√ √
In other words, if x0 is any approxomation to a, then the actual value of a must
lie between x and a/x. Since there is no reason to believe that x is any better of an
1
2 LESSON 1. A MOTIVATIONAL EXAMPLE
approximation then
√ a/x, and vice versa, this suggests that we can obtain a better
approximation to a by averaging the two estimates:

1 a
x1 = x0 + (1.6)
x x0
We can repeat this argument with x1 to generate a better estimate x2 , and so forth,
leading us to the sequence of approximations x0 , x1 , x2 , . . . given by

1 a
xi+1 = xi + (1.7)
2 xi
Equation 1.7 is an example of an iteration formula. It gives us a sequence of better and
better approximations to the number we are looking for. We will see iteration formulas
again and again throughout this class; they are one of the principal techniques by
which we summarize a numerical algorithm. The basic technique is summarized here:
Given: x0
i=0
Repeat
xi+1 = f (xi )
i=i+1
Until the approximation is “good enough”
A standard way of defining “good enough” is by using a tolerance. To do this we

keep repeating the calculation until the difference between two successive iterations
is less than the tolerance, which we will denote by :
Given: x0 ,
i=0
Repeat
xi+1 = f (xi )
i=i+1
Until |xi − xi−1 | <
Finally, we note that there is always the possibility that there could be a “bug”
in our implementation that could lead to an infinite loop. Hence it is wise to always
include an iteration counter that will force termination:
Given: x0 , , N
i=0
Repeat
xi+1 = f (xi )
i=i+1
Until |xi − xi−1 | < or i > N

LESSON 1. A MOTIVATIONAL EXAMPLE 3
When you are debugging your code it is generally a good idea to use a very small
value of N such as 2 or 3, even if you expect a much larger number of iterations to
5/28/08:⇑ occur in the final version.
The only remaining problem is to figure out what to use for the first guess x0 .
This algorithm is so good, in fact,
√ that it doesn’t much matter. We can use x0 = 1
or x0 = a. For example, to find 2 using x0 = 2 we have

1 2
x1 = 2+ = 1.5 (1.8)
2 2

1 2
x2 = 1.5 + = 1.41667 (1.9)
2 1.5

1 2
x2 = 1.41667 + = 1.41422 (1.10)
2 1.41667
and so forth. This algorithm converges rather quickly; in fact, it precisely reproduces
the same formula as Newton’s method (which we will discuss in section 10).
Throughout this class we will give examples using a programming language called
Mathematica. We have chosen this language because it is extremely powerful; uses
a fairly intuitive mathematical interface that is easy to learn rapidly; and allows us
to program without worrying about many of the details such as types, classes, and
objects that we need to worry over in more primitive languages such as Java, C++
or FORTRAN. In Mathematica, we can implement the square root finding algorithm
quite easily (Don’t worry about the details of this program if you don’t know Math-
ematica; we’ll come back to that and do some training in the Math Lab before you
have to start coding for your homework). The following will calculate shows that the
algorithm converges to the first 50 digits in only 8 iterations!
In:=
f[x_] := (1/2)(x + 2/x)

NestList[f, 2.0‘50, 8]
Out:=
{2.0000000000000000000000000000000000000000000000000,
1.5000000000000000000000000000000000000000000000000,
1.4166666666666666666666666666666666666666666666667,
1.4142156862745098039215686274509803921568627450980,
1.4142135623746899106262955788901349101165596221157,
1.4142135623730950488016896235025302436149819257762,
1.4142135623730950488016887242096980785696718753772,
1.4142135623730950488016887242096980785696718753769,
1.4142135623730950488016887242096980785696718753769}
« 2008, B.E.Shapiro Math 481A

Last revised: July 5, 2008 California State University Northridge
From an implementation perspective, one could just catalog a table of algorithms

that give methods for solving various problems, and then proceed to translate these
algorithms into computer programs. We will see, however, that this blind approach
can lead to disaster, because a method that works under one set of conditions may not
work under another set of conditions. We will need to understand both the mathemat-
ics underlying the algorithms as well as the nature of the computer representations
before we can be confident that an implementation will work. We will approach this
subject, then, from an interdisciplinary perspective, as neither mathematicians nor
computer scientists, but as mathematical scientists. We illustrate the necessity of
such an interdisciplinary approach with a tragic example.
The Patriot Mission system used by the US Army is a surface-to-air (SAM) mis-
sile used primarily as an advanced aerial interceptor – i.e., to target and destroy
incoming missiles. The acronym “Patriot” actually stands for “Phased Array Track-
ing Radar to Intercept of Target;” a more politically charged version is “Protection
Against Threats, Real, Imagined, Or Theorized.” The system was developed during
the 1970’s, first deployed in 1984 as an anti-aircraft weapon, and in 1988 as an anti-
ballistic-missle defense. Patriots deployed during the first Gulf War in 1991 used a
24 bit integer counter to measure time in tenths of a second. This number needed to
be converted to floating point and used in a calculation to determine if and when an
missile should be fired. The following is taken from a GAO report:1
“The heart of the Patriot system is its weapons control computer. It performs
the system’s major functions for tracking and intercepting a target, as well as other
battle management, command and control functions. The Patriot’s weapons control
computer used in Operation Desert Storm is based on a 1970s design with relatively
limited capability to perform high precision calculations.
“To carry out its mission, the Patriot’s weapons control computer obtains target
information from the system’s radar. The Patriot’s radar sends out electronic pulses
that scan the air space above it. When the pulses hit a target they are reflected back
to the radar system and shown as an object (or plot) on the Patriot’s display screens.
Patriot operators use the software to instruct the system to intercept certain types
of objects such as planes, cruise missiles, or tactical ballistic missiles (such as Scuds).
During Desert Storm the Patriot was instructed to intercept tactical ballistic missiles.
For the Patriot’s computer to identify, track, and intercept these missiles, important
information describing them was kept by the system’s range-gate algorithm.
“After the Patriot’s radar detects an airborne object that has the characteristics
of a Scud, the range gate–an electronic detection device within the radar system–
calculates an area in the air space where the system should next look for it ... Finding
an object within the calculated range gate area confirms that it is a Scud missile.
“The range gate’s prediction of where the Scud will next appear is a function of the
Scud’s know velocity and the time of the last radar detection. Velocity is a real number
1
United States Office of the General Accounting Office Memorandum GAO/IMTEC-92-26, “Pa-
triot Missile Software Problem.”

that can be expressed as a whole number and a decimal (e.g., 3750.2563...miles per
hour). Time is kept continuously by the system’s internal clock in tenths of seconds
but is expressed as an integer or whole number (e.g., 32, 33, 34...). The longer the
system has been running, the larger the number representing time. To predict where
the Scud will next appear, both time and velocity must be expressed as real numbers.
Because of the way the Patriot computer performs its calculations and the fact that
its registers are only 24 bits long, the conversion of time from an integer to a real
number cannot be any more precise than 24 bits. This conversion results in a loss
of precision causing a less accurate time calculation. The effect of this inaccuracy on
the range gate’s calculation is directly proportional to the target’s velocity and the
length of time the system has been running. Consequently, performing the conversion
after the Patriot has been running continuously for extended periods causes the range
gate to shift away from the center of the target, making it less likely that the target,
in this case a Scud, will be successfully intercepted.
“... after about 20 hours, the inaccurate time calculation becomes sufficiently large
to cause the radar to look in the wrong place for the target ... Army officials said
that they believed that ... Patriot users were not running their systems for 8 or more
hours at a time ... Significant shifts of the range gate away from the desired center
of the target could be eliminated by rebooting the system-turning the system off and
on–every few hours. Rebooting, which takes about 60 to 90 seconds, reinitializes the
computer’s clock, setting the time back to zero.
“... On February 25, Alpha Battery had been in operation for over 100 consecutive
hours ...”
Lets examine this calculation in some detail. Each bit in a binary number rep-
resents a power of 2. The numbers to the right of the radix point are fractions,
representing 2−1 , 2−2 , 2−3 , ... as we move from left to right; the numbers to the left
of the binary point represent 20 , 21 , 22 , 23 , ... as we move to the left, starting at the
binary point. The nth bit to the right of the radix point then represents 2− n and
the nth bit to the left represents 2n−1 . We can convert a binary number back to its
decimal representation by then adding up the value. Let bn = 1 or 0 represent the
nth bit. Then
X X
decimal value = bn × 2n−1 + bn × 2−n (1.11)
whole bits f ractional bits
where “whole bits” means the bits to the left of the radix point and “fractional bits”
means the bits to the right of the radix point.
Google Calculator gives you a convenient way to convert between bases. In Google
calculator a binary integer begins with “0b” (zero followed by the letter b). Unfortu-
nately it does not work with fractions, on integers. If we enter the string
0b110011001100110011 in Decimal
in any Google search window it will return the number 209715.

To determine its value in decimal we observe that the least significant bit corresponds
to 2−21 so we enter the string
2^(-21)*209715
in the search window.
This returns a number that is very close to – but not precisely equal to – one tenth.
Example 1.1. Find the decimal equivalent of the binary number 101101.011
Solution.
101101.011 = 1 × 25 + 0 × 24 + 1 × 23 + 1 × 22 + 0 × 21 + 1 × 20

+ 0 × 2−1 + 1 × 2−2 + 1 × 2−3

(1.12)
1 1
= 32 + 8 + 4 + 1 + + (1.13)
4 8
3
= 45 (1.14)
8
In Mathematica we can do the conversion in example 1.1 quite easily; entering
2^^101101.011
returns the value
45.375
An related function is
BaseForm[2^^101101.011, 10]
which also will return the value 45. 375. To go in the other direction, we can type in

BaseForm[45.375, 2]
which returns the value 101101.0112 . Unfortunately BaseForm only returns a string
representation of the number, not an actual binary number, so doing calculations
with the binary number requires a bit more work.
In the Patriot software, integers were converted to decimal numbers by multiplying
by the 24-bit binary representation of the decimal number 0.1 with one bit to the left
of the decimal and 23 bits to the right of the decimal. This number is
209715
m = 0.0001 1001 1001 1001 1001 100 = (1.15)
2097152
Spaces are used to separate every fourth bit to make the binary numbers easier to
read. The choice of four bits in a binary number is convenient because 4 binary bits
corresponds to precisely one hexa-decimal (base 16) digit. (We can, of course, find
this crucial number in Mathematica by
BaseForm[0.1, 2]
which returns the string representation of binary number show above.)

So the Patriot missile software used the number 209715/2097152 to approximate
the decimal fraction 1/10, it made an error of
1 209715 1
= − = ≈ 9.54 × 10−8 seconds (1.16)
10 2097152 10485760
or about a tenth of a microsecond each second. While this number may seem small,
it adds up over time. Suppose, for example, the counter is running for one hour (3600
seconds); then the total error that builds up is
3600
= ≈ 0.003433 seconds (1.17)
10485760
At the time of a Scud missile attack on Feb 25, 1991 the system operating at Dhahran,
Saudi Arabia had been operating for approximately 100 hours. The accumulated
roundoff error was therefor
3600 × 100
= ≈ 0.3433 seconds (1.18)
10485760
The clock was off by about 1/3 of a second. While this may still seem small, it
happens that Scud missiles travel at a speed of approximately Mach 5 or 1650 meters
per second (about 6000 km/hour). The Patriot missile systems was using this time
calculation to determine the actual position of the incoming missile, so it made an
error of
≈ (1650 meters/second) × (0.3433 seconds) = 566 meters. (1.19)

The calculation was off by over half of a kilometer. This caused the system to
repeatedly recycle and try to recalculate the position again. It was unable to converge ⇓:rev.5/28/08
and so a missile was allowed to penetrate the bases defenses on 25 Feb 1991, killing
28 people. Ironically, the bug was known and a patch had been released on 16 Feb
1991 correcting the problem, but was still in the mail. It arrived one day too late, on
5/28/08:⇑ 26 Feb 1991. President Bush declared that hostilities had ended on 28 Feb.
Other Famous Numerical Errors
Approximately 36 seconds after the launch of an Ariane rocket from French

Guiana on 4 June 1996 the rocket’s guidance system shut down because of a
software error. It was trying to calculate the rocket’s veclocity and performed an
illegal conversion of a 64 bit real number into a 16 bit integer. Since the backup
system had the same software installed, it also shut off. So the rocket veered
off course and controllers were forced to activate the on-board self-destruct
mechanism. Strangely enough, this part of the calculation was only used before
launch and wasn’t even needed once the rocket took off. The rocket and its
cargo cost approximately $500 million.
Over the course of 22 months starting in 1982 the Vancouver Stock exchange
accumulated enough numerical error due to rounoff to reduce the correct value
of the index (1098.98) to 574.08.
Under German law (in 1992) no party with less than a 5 percent vote may be
seated in parliament. On April 5, 1992 the Green party obtained 4.97% of the
vote, but a computer program that prints out results was set to round to one
decimal place - exactly 5.0%. Hence the early official results showed that the
candidate had been elected.
in 1995 Microsoft announced that some versions of the their spreadsheet pro-
gram Excel makes mistakes because of a base 10 to base two conversion error.
An oil platform off the coast of Norway sank on August 23, 1991 at a cost of
nearly a billion dollars as a result of an error in a finite element approximation
– a method that was used to calculate the linear elastic stresses on the structure
supporting the platform.

Lesson 2
Limits and Continuity
In the next sections we will make a brief review of some mathematical preliminaries
before we turn to a study of how numbers are represented in computers.
Definition 2.1 (Limit). We say “the limit of f (x) as x approaches x0 is equal to
L” and write
lim f (x) = L (2.1)
x→x0
if, given any > 0 there exists some δ > 0 such that
|x − x0 | < δ =⇒ |f (x) − L| < (2.2)
The value of the number is allowed to depend on the value of the number δ. This
concept of a limit is illustrated in the following figure: If you give me the value of ,
I can find a value of δ such that |f (x) − L| < whenever |x − x0 | < δ.
You name ε f(x)
box for
smaller ε
I will find δ
x0
9
10 LESSON 2. LIMITS AND CONTINUITY
√
Example 2.1. Show that limx→3 x+1=2
Solution. Using the nomenclature of the definition,
√
f (x) = x + 1 (2.3)
L=2 (2.4)
x0 = 3 (2.5)
Let > 0 be any small number. The we need to find a number δ > 0 such that
√
|x − 3| < δ =⇒ x + 1 − 2 < (2.6)

or equivalently, by the definition of absolute values,

√
−δ < x − 3 < δ =⇒ − < x + 1 − 2 < (2.7)
What condition does this impose on δ? We calculate
−δ + 3 < x < δ + 3 (2.8)
−δ + 4 < x + 1 < δ + 4 (2.9)

√√ √
4−δ < x+1< 4+δ (2.10)
√ √ √
4−δ−2< x+1−2< 4+δ−2 (2.11)
But to get equation 2.7, out of this we would need to get
√ √ √
− < 4 − δ − 2 < x + 1 − 2 < 4 + δ − 2 < (2.12)
This leads to the following two conditions:

√
− < 4 − δ − 2 (2.13)
√
4+δ−2< (2.14)
By condition 2.13: √
2−< 4−δ (2.15)
(2 − )2 < 4 − δ (2.16)
δ < 4 − (2 − )2 = (4 − ) (2.17)
Note that the last quantity is going to be positive for small . By condition 2.14:
√
4+δ <+2 (2.18)
4 + δ < ( + 2)2 (2.19)

δ < (4 + ) (2.20)

LESSON 2. LIMITS AND CONTINUITY 11
Since we need both conditions 2.13 and 2.14 to hold then we require both equations
2.17 and 2.20. Since > 0,
4−<4+ (2.21)
(4 − ) < (4 + ) (2.22)
we determine that condition 2.17 is more restrictive and when it holds, we are ensured
that condition 2.20 is also met. This gives us enough information to construct a proof,
which we present immediately.
Let > 0. We need to show that there is some δ such that |x − x0 | < δ implies
that |f (x0 ) − L| < . In other words we need to show that there is some δ such that
|x − 3| < δ (2.23)
implies that √
x + 1 − 2 < (2.24)

We do this by choosing
δ = (4 − ) (2.25)
hence
|x − 3| < δ = (4 − ) (2.26)
−(4 − ) < x − 3 < (4 − ) (2.27)
−2 − 4 + 4 < x + 1 < −2 + 4 + 4 (2.28)
( − 2)2 < x + 1 < −2 + 4 + 4 < 2 + 4 + 4 = ( + 2)2 (2.29)
p √ p
( − 2)2 < x + 1 < ( + 2)2 (2.30)
From the equality on the left, we know that
√ p
x + 1 > ( − 2)2 = ±( − 2) (2.31)
and that this must hold for both values of the root. The value − 2 is near -2 and
the value 2 − is near 2, so we chose:
√
2−< x+1 (2.32)
Combining this with the inequality on the right hand side of 2.30 we obtain
√
2−< x+1<2+ (2.33)
√
− < x + 1 − 2 < (2.34)
√
x + 1 − 2 < (2.35)

|f (x) − L| < (2.36)

This is sufficient to prove that that limx→x0 = L; hence we conclude that
√
lim 1 + x = 2 (2.37)
x→2

12 LESSON 2. LIMITS AND CONTINUITY
Definition 2.2 (Continuity). We say “f (x) is continuous at x0 ” or just “f is

continuous at x0 if
lim f (x) = f (x0 ) (2.38)
x→x0
Example 2.2. The function √

f (x) = x+1 (2.39)
is continuous at x = 3, because, as we showed in the previous example,
lim f (x) = 2 = f (3) (2.40)

x→3
Example 2.3. The function

(√
1+x , x 6= 3
f (x) = (2.41)
0 ,x = 3
is not continuous at x = 3 because
lim f (x) = 2 6= 0 = f (3) (2.42)

x→3
as illustrated in the figure.
f(x) is not continuous

at x=3
L
f(x)=(1+x)1/2

Lesson 3
Sequences
We will frequently use iterative processes in our study of numerical analysis. In such
a process, one computes a sequence of values, usually in a loop or other similar control
structure. Such iterative processes can be related to the concept of a sequence: at
each iteration of the the loop we calculate the value of some number an . The complete
set of all possible an is a sequence. More specifically, we have the following definition.
Definition 3.1 (Sequence). A sequence is a function that maps positive integers

to the real numbers:
n 7→ xn , n ∈ Z+ , xn ∈ R (3.1)
and we write the sequence as one of the following:
x1 , x2 , x3 , ... (3.2)
xn (3.3)
{xn }∞
n=1 (3.4)
Sometimes we will define a sequence on the set of non-negative integers, {0} ∪ Z+

rather than just the positive integers.
Definition 3.2 (Convergence of a Sequence). We say that the sequence xn con-

verges to to the limit x if, given any real number > 0, then we can find an integer N
(usually large, and usually N will depend on ), such that |x − xn | < for all |n > N |,
and we write
xn → x as n → ∞ (3.5)
and
lim xn = x (3.6)
n→∞
The concept of a limit is illustrated in the following figure.
13
14 LESSON 3. SEQUENCES
x1
x2
9OUNAMEε
xN+1
x3
L
xN for all n>N,

the dots fall in the grey band
I will find N
1 2 3 N n>N
3(1 + 2n )
Example 3.1. Show that the sequence xn = → 3 as n → ∞.
2n
Solution. We need to show that for any > 0 there exists some N such that
|xn − x| = |xn − 3| < (3.7)
for all n > N . We begin by observing that

3(1 + 2n )

|xn − 3| =
n
− 3 (3.8)
2
3 + (3)(2n ) − (3)(2n ))

= (3.9)
2n
3
= n (3.10)
2
which can be made as small as we like by choosing n sufficiently large.
Let > 0 be given. Then so long as

3
< (3.11)
2n
we will have
|xn − 3| < (3.12)

LESSON 3. SEQUENCES 15
To find the value of N , we solve 3.11 for n:
2n > 3/ (3.13)

n ln 2 > ln(3/) (3.14)
ln(3/)
n> = log2 (3/) (3.15)
ln 2
Hence given any , we can choose any integer N > log2 (3/); then we are ensured
that |xn − 3| < for all n > N , which means that xn → 3 .
Theorem 3.1. If f (x) is continuous at the point x = c and xn is a converging

sequence such that xn → c then

lim = f lim xn (3.16)
n→∞ n→∞
or, equivalently,
f (xn ) → f (c) (3.17)
This result is sometimes stated as “the limit of function of a sequence is the

function of the limit of the sequence.”
r
3(1 + 2n )
Example 3.2. Find lim 1 + .
n→∞ 2n
√
Solution. Let f (x) = x + 1. This is a continuous function at x = 3, as we showed in
3(1 + 2n )
an earlier example. Furthermore, we showed that the sequence xn = →3
2n
as n → ∞. Hence by the theorem,
√
f (xn ) → f (3) = 3 + 1 = 2 (3.18)

16 LESSON 3. SEQUENCES

Lesson 4
Theorems About Derivatives
Definition 4.1 (Derivative). The derivative is given by either of the following two
equivalent formulas:
f (x0 + h) − f (x0 ) f (x) − f (x0 )
f 0 (x0 ) = lim = lim (4.1)
h→0 h x→x0 x − x0
The second definition can be derived from the first with the substitution x = h + x0 .
Theorem 4.1 (Intermediate Value Theorem (IVT)). Suppose that f (x) is a
continuous function on the interval [a, b], and that K is a number between f (a) and
f (b). Then there exists at least one (and possibly many) number(s) c ∈ [a, b] such
that f (c) = K .
Figure 4.1: Illustration of the Intermediate Value Theorem.
f(a)
f(c)
f(b)
a c b
Thus a continuous value takes on all values between the values it obtains at the
endpoints of its domain (See figure 4.1). The following corollary is illustrated in figure
4.2.
17
18 LESSON 4. THEOREMS ABOUT DERIVATIVES
Corollary 4.1. Under the same conditions as the IVT,f (a) and f (b) have different
signs, then there is a root between a and b.
Corollary 4.2. Under the same conditions as the IVT, if f (a)f (b) < 0, then there
is a root in the interval (a, b).
Proof. If f (a)f (b) < 0 then either f (a) < 0 and f (b) > 0, or f (a) > 0 and f (b) < 0.
In either case, the number 0 is between f (a) and f (b). Hence there is some number
c such f (c) = 0.
Figure 4.2: If f (a)f (b) < 0 then there is a root between a and b.
f(a)
b
a c
f(b)
Theorem 4.2 (Mean Value Theorem (MVT)). If f is continuous on [a, b] and

differentiable on (a, b) then there exists some c ∈ [a, b] such that
f (b) − f (a)
f 0 (c) = (4.2)
b−a
The interpretation of the mean value theorem is as follows: there is at least one
point in the interval [a, b] where the slope of f (x) is identical to the slope of a straight
rev.5/29/08:⇓ line between the end points of the function. See figure 4.3.
Figure 4.3: Illustration of the Mean Value Theorem.
f HbL
f HaL
a c b
5/29/08:⇑

LESSON 4. THEOREMS ABOUT DERIVATIVES 19
Theorem 4.3 (Rolle’s Theorem). If f is continuous on [a, b] and differentiable on

(a, b), and f (a) = f (b) , then there exists some c ∈ [a, b] such that f 0 (c) = 0 .
Theorem 4.4 (Generalized Rolle’s Theorem). Suppose f is continuous on [a, b]

and n times differentiable on (a, b). If f (x) = 0 at n + 1 distinct points x0 , x1 , .., xn
in [a, b] then there is a number c ∈ (a, b) such that f (n) = 0.
⇓:rev.5/29/08
The Generalized Rolle’s theorem is illustrated in figure 4.4. The top curve shows
a plot of some function f (x) with unique zeroes at x1 < x2 < x3 < x4 . Rolle’s
theorem then tells us that between each pair of points (x1 , x2 ), (x2 , x3 ) and (x3 , x4 )
there are points p1 , p2 , p3 such that f 0 (p1) = 0, f 0 (p2 ) = 0, and f 0 (p3 ) = 0. With this
information we can sketch a plot of f 0 (x), as shown in the middle graph. We know
that f 0 (x) has three unique zeroes at p1 , p2 and p3 . Hence by Rolle’s theorem, there
is ap point q1 ∈ (p1 , p2 ), and a point q2 ∈ (p2 , p3 ) where the derivative of f 0 (x) is
zero, i.e., f 00 (q1 ) = f 00 (q2 ) = 0. Next, we can sketch a plot of f 00 (x), which is shown
in the bottom curve. It has two zeroes, at q1 and q2 . Hence by Rolle’s theorem there
is some point r1 ∈ (q1 , q2 ) such that f 000 (r1 ) = 0. But this is the precise prediction of
the Generalized Rolle’s Theorem: since f was continuous and hand 4 unique zeroes
(i.e., n + 1 = 4) then there is some point with f (n) (x) = 0, where n = 3.
Figure 4.4: Illustration of the Generalized Rolle’s Theorem.
x1 p1 x2 p2 x3 p3 x4
p1 q1 p2 q2 p3
q1 r1 q2
⇑:5/29/08
Theorem 4.5 (Extreme Value Theorem (EVT)). If f is continuous on an in-

terval [a, b] then it takes on both a minimum and a maximum value on [a, b]. If f
is differentiable on (a, b) then the extrema occur either at the endpoints or where
f 0 (x) = 0.

Definition 4.2 (Continuously Differentiable). We say a function f is continu-

ously differentiable on [a, b] if f is differentiable on (a, b) and its derivative is contin-
uous on [a, b].
Theorem 4.6 (Taylor’s Theorem with Remainder). Let f be n − times contin-
uously differentiable and suppose that f (n+1) exists on [a, b], and let x0 be any point
in (a, b). Then for all x ∈ [a, b] there exists some number c ∈ [a, b] such that
f (x) = Pn (x) + Rn (x) (4.3)
where
n
X f (k) (x0 )
Pn (x) = (x − x0 )k (4.4)
k=0
k!
f (n) (x0 )
= f (x0 ) + (x − x0 )f 0 (x0 ) + · · · + (x − x0 )n (4.5)
n!
and
f (n+1) (c)
Rn (x) = (x − x0 )n+1 (4.6)
(n + 1)!
The polynomial Pn x is called the Taylor Polynomial of Order n and the function
Rn (x) is called the Remainder.
When x0 = 0, Taylor’s theorem gives the Maclaurin Polynomials:
n
X f (k) (0)
Pn (x) = xk (4.7)
k=0
k!
f (n) (0) n
= f (0) + xf 0 (0) + · · · + x (4.8)
n!
The corresponding Maclaurin Remainder Formula is
f (n+1) (c) n+1

Rn (x) = (x) (4.9)
(n + 1)!
where c is some number between 0 and x, inclusive.
√
Example 4.1. Find the Taylor Polynomial of order for f (x) = x + 1 about the
point x0 = 0, and find the remainder.
Solution.
√
f (x) = x + 1, f (0) = 1 (4.10)
1 1
f 0 (x) = (x + 1)−1/2 , f 0 (0) = (4.11)
2 2

LESSON 4. THEOREMS ABOUT DERIVATIVES 21
1 1
f 00 (x) = − (x + 1)−3/2 , f 00 (0) = − (4.12)
4 4
000 3 −5/2 000 3
f (x) = (x + 1) , f (0) = (4.13)
8 8
(4) 15 −7/2
f (x) = − (x + 1) (4.14)
16
Hence
0 x2 00 x3 000
P3 (x) = f (0) + xf (0) + f (0) + f (0) (4.15)
2 3!
1 1 1 1 3
=1+ x+ − x2 + x3 (4.16)
2 2 4 6 8
1 1 1
= 1 + x − x2 + x3 (4.17)
2 8 16
and similarly rev:5/29/08
f (4) (c) 4 −15(c + 1)−7/2 x4
R3 (x) = x = (4.18)
4! 384
Example 4.2. Use the Maclaurin series found in the previous example to estimate
√
2.
√ √
Solution. The formula in the previous example is for f (x) = 1 + x, so to get 2 we
need to use x = 1. Thus
√ 1 1 1
2≈1+ − + = 1.4375 (4.19)
2 8 16
Example 4.3. Use the remainder formula
√ found in the previous example to determine
the maximum error in calculating 2 with this formula.
Solution. We start with the formula rev:5/29/08
−15(c + 1)−7/2
R3 (1) = (4.20)
384
where c is some number between 0 and 1 (because 1 is the argument of f (x) that we
evaluated the polynomial at. The maximum value occurs when c = 0, hence we have
rev:5/29/08
−15(0 + 1)−7/2

|R3 (1)| < = 15 ≈ 0.0391 (4.21)
384 384
√
thus we can conclude that our calculation of 2 ≈ 1.4375 ± 0.0391


Lesson 5
Error
As software designers we will need to understand the sources of error in a numerical

calculation if we want to avoid disasters such as the one we discussed in lesson 1. To
understand error, we will also have to learn about how numbers are represented in
computers.
One important fact to always remember is that computers do not represent exact
numbers: they only represent them with a finite number of digits. And since the base
representation used by the computer is rarely base 10, it may not even be possible
to represent a number that we are accustomed to representing exactly, such as 1/10,
which has a repeating fraction in base 2:
0.110 = 0.000110012 (5.1)
Since the computer represents numbers with a finite number of bits, it will have to
truncate this approximation with a finite number of repeats of the 1001. This will
lead to a small error, which as we have seen, can compound into a very large error.
We will use the following definitions:
error = approximate value - true value (5.2)

true value = approximate value + remainder (5.3)
error
relative error = (5.4)
true value
This gives us the useful result
approximate value = (true value) × (1 + relative error) (5.5)
We will sometimes use the term unit in the last place or ulp to represent the value
of a 1 when placed in the rightmost digit of a numerical representation of a number.
For example, if we represent the irrational number
e = 2.718281820459045... (5.6)
23
24 LESSON 5. ERROR
with the approximation

ê = 2.71828 (5.7)
then we say that
1 ulp = 0.00001 (5.8)
and that the error is
= 2.71828 − 2.718281828459045... ≈ 0.1828 ulps (5.9)
The following bit of Mathematica code will find the ulp on your computer:
In:=
ulp=1.0;
While[((1+ulp)-1>0, ulp=ulp/2];
Print["ulp=", 2*ulp];
Out:=
2.2045 ×10−16
One caution about this program: if you forget to put the decimal point in the initial
assignment ulp=1.0, and just write it as ulp=1, the program will run in an infinite
loop, because all of the calculations are rational. To see what is happening, insert a
print statement in the loop.
Because the error is often small, it is sometimes more meaningful to measure the
decimal places of accuracy = − log10 |error| (5.10)
The decimal places of accuracy gives approximately the number of digits that are
accurately represented to the right of decimal point. In our approximation to e we
had 1 ulp = 10−5 , so that an error of 1 ulp represents 5 decimal places of accuracy.
We are sometimes only interested in the relative error, which we can define as
digits of accuracy = − log10 |relative error| (5.11)
The digits of accuracy gives approximately the total number of digits of accuracy,
starting from the first nonzero digit. So 3.124, 3124, and 0.003124 all have 4 digits of
accuracy, whereas they have 3, 0, and 6 decimal places of accuracy, respectively.
We will see that there are two sources of error that we will have to worry about
in a computer program:
data error: error that is already present in the input data before a computation
begins. Typical sources of data error include:
– measurement error: the number supplied to the program is wrong

LESSON 5. ERROR 25
– previous computation; successive computations depend on earlier com-

putations; if the result of one computation that has an error in it is used
as the input to another computation, that causes a data error.
– modeling error: the theory behind the implementation could be approx-
imate. For example, one might model a gravity force by F = −mg (an
approximation that is only valid near the surface of the earth) instead of
F = −GM m/r2 .
computational error: errors introduced by the computation itself. Compu-

tation error is typically classified into the following subclasses:
– roundoff error: error due to the fact that computers use a finite number
of digits to represent numbers.
– truncation error: error due to the truncation of an infinite process, such
as calculating only a finite number of terms in a Taylor Series approxima-
tion.
Let x be a true value of some quantity, and x̃ be the same quantity with data error
in it, and let the function f (x) represent the thing we are trying to compute. Then
the
propagated data error = f (ã) − f (a) (5.12)
Note that the propagated data error defined in this way has nothing to do with the
computer implementation of how we calculate f : it only depends on the true definition
of f . For example, suppose we want to calculate cos(π/3) where we supply as input
the value π = 3.1416. Then the
propagated data error = cos(3.1416/3) − cos(π/3) (5.13)
The computational error depends on the way in which calculate f . Suppose we define
fˆ to be the computer implementation that is used to calculate the true function f .
For example, we might use the first 3 terms of the Taylor series for f (x) = cos(x):
x2 x4
fˆ(x) = cos x ≈ 1 − + (5.14)
2 24
We define the
computational error = fˆ(x̃) − f (x̃) (5.15)
Then for our example implementation,
(3.1416/3)2 (3.1416/3)4
computational error ≈ 1 − + − cos(3.1416/3) (5.16)
2 24

26 LESSON 5. ERROR
Example 5.1. Calculate the relative errors in the ratio

2
x−y
f (x, y) = (5.17)
x+y
with x = 100 and y = 99 assuming an input error of (a) 0.1% and (b) 1.0% for x.
Solution. The exact solution is

2 2
100 − 99 1 1
f (100, 99) = = = ≈ 0.00002525 (5.18)
100 + 99 199 39601
If we start with an error of 0.1% in x, i.e., x̃ = 100.1 then

2
100.1 − 99
f (100.1, 99) = = 0.00003052 (5.19)
100.1 + 99
and the relative error is

f (100.1, 99) − f (100, 99) 0.00003052 − 0.00002525
= = 0.21 (5.20)
f (100, 99) 0.00002525
i.e., an 0.1% error in the input leads to a 21% error in the result. If we are off by as
much as a full percent, say x̃ = 101, then
1
f (101, 99) = (5.21)
10000
hence the relative error is
f (101, 99) − f (100, 99) 0.0001 − 0.00002525
= = 2.9601 (5.22)
f (100, 99) 0.00002525
A data error of 1% gives a propagated error of 296%.

Lesson 6
Number Representation
Number representations in computers are limited because they only store a finite
number of digits.
√ While the loss of information here is obvious for irrational numbers
such as π or 2, what is not obvious at first glance is that even simple integer
operations can be seriously affected. Before proceeding to a formal description of
number representations we present a simple example of a computer that can store 3
digit decimal numbers. What is the best way to represent this kind of number? Our
first guess might be to use a representation that contains 3 machine digits such as:
d d d
where each “d” represents a digit in the range 0, 1, . . . , 9. This is good for numbers
such as 547 or 612, but what about negative numbers such as -43? And what happens
if we try to add two numbers together, such as 547+612? There is no way to represent
1159 in this scheme. So when we try to add two numbers whose sum is larger 999,
we get an error called an “overflow.”
There are two standard ways to represent negative numbers. One way is to use the
following mapping: ‘000’ represents -500, ‘001’ represents -499, ‘002’ represents -498,
..., ‘998’ represents 498, ‘999’ represents 499. In this way we shift the representation
from one that represents all integers 0 < z < 999 to one which represents only integers
in the range −500 < z < 499. This method is called an “excess 500” representation
- the number actually stored in memory is 500 in excess of the number it represents.
A simpler method is to add a sign bit:
s d d d
where s is not a digit that is only allowed to take on two values: 1 or 0, with 0
representing a positive number and 1 representing a negative numbers. This allows
us to represent everything in the range −999 < z < 999. So we represent 765 as
0 7 6 5
and -43 by
27
28 LESSON 6. NUMBER REPRESENTATION
1 0 4 3
Of course neither of these methods allow us to represent anything besides integers.

So we propose another solution: and an exponent.
s d d d e
The number represented by this scheme is given by
z = ±(0.ddd) × 10e (6.1)
allowing us to represent a much larger range of integers. If we want to represent

fractions we could use an excess-5 representation for the exponent, allowing it to
represent numbers from -5 to 4; for now, however, we will restrict our computer to
only be able to store integers, and let the exponent range from 0 to 9.
Now lets see what happens when we add 547 + 612 = 0.547 × 103 + 0.612 × 103 :
0 5 4 7 3
0 6 1 2 3
The answer is 1159, which can not be represented in three digits, so some sort of
approximation scheme is needed. Two common schemes include:
rev:5/29/08 chopping: drop the final digit, 1159 ≈ 0.115 × 104 ; and
0 1 1 5 4
rounding: round off the final digit, 1159 ≈ 0.116 × 104 .
0 1 1 6 4
The following example illustrates one of the dangers of these approximations.
Example 6.1. Calculate the average of two numbers x and y using the formula
x+y
average = (6.2)
2
for
rev:5/29/08 a) x = 563, y = 566, using chopping
b) x = 568, y = 566, using rounding

LESSON 6. NUMBER REPRESENTATION 29
Solution. For (a), we find using truncation that
563 + 566 = 1129 ≈ 1120 (6.3)
1120
average = = 560 (6.4)
2
which is not even between the two values 563 and 566. Had we used rounding, we
would have rounded 1129 to 1130 and obtained the answer 565 which is as close as
we can represent the exact answer 564.5.
For (b) we use rounding:
568 + 566 = 1134 ≈ 1130 (6.5)
1130
average = = 565 (6.6)
2
Again, the answer 565 is not between the two original numbers. We would have
obtained the same answer with truncation.
We will use the operator F l (for “Float” or “Floating Point”, the topic of next
section) to represent our “approximation.” When we rounded, for example, we have
F l(1137) = 1140 (6.7)

F l(1131) = 1130 (6.8)
while for truncation

F l(1137) = F l(1131) = 1130 (6.9)
When we calculated our average we used

F l(F l(a) + F l(b))
average = F l (6.10)
F l(2)
One way to improve the situation we found in the previous example is to use the
revised formula
b−a
average = a + (6.11)
2
Mathematically, both equations 6.2 and 6.12 are identical, but they will give us
different results when implemented in a computer, because of how the F l operator is
applied:
F l(F l(b) − F l(a))
average = F l F l(a) + F l (6.12)
F l(2)
Example 6.2. Repeat the previous example using equation 6.12.

30 LESSON 6. NUMBER REPRESENTATION
Solution. In (a) we used truncation to find the average of 563 and 566

F l(F l(566) − F l(563))
average = F l F l(563) + F l (6.13)
F l(2)

F l(566 − 563)
= F l 563 + F l (6.14)
2

3
= F l 563 + F l (6.15)
2
= F l (563 + F l (1.5)) (6.16)
= F l (563 + 1) (6.17)
= F l(564) (6.18)
= 564 (6.19)
In (b) we used rounding to find the average of 568 and 566:

F l(F l(566) − F l(568))
average = F l F l(568) + F l (6.20)
F l(2)

F l(566 − 568)
= F l 568 + F l (6.21)
2

F l(−2)
= F l 568 + F l (6.22)
2

−2
= F l 568 + F l (6.23)
2
= F l (568 + F l (−1)) (6.24)
= F l (568 − 1) (6.25)
= F l(567) (6.26)
= 567 (6.27)

Lesson 7
Fixed and Floating Point
The two most common representations of numbers in computers are
Fixed point representation: the sign and the radix point have a fixed loca-
tion:
sign digits
This representation is commonly used for integers.
Floating point representation: in addition to providing space for the sign

and the digits, space is also provided to specify the location of radix point. The
field that specifies this location is usually called the exponent while the field
that specifies the digits is called a mantissa.
sign exponent mantissa
Most modern computers have the ability to store both fixed point and floating point
numbers; fixed point representations are typically used for integer and boolean vari-
ables. In some cases the representation will span many computer bytes. Floating
point representations may be implemented in either hardware or software or both.
For example, a typical “32 bit floating point” computer provides hardware (e.g., mem-
ory, registers, and arithmetic operations such as addition and multiplication) for a
floating point representation that uses a total of 32 bits. High level compilers such
as C or FORTRAN also provide additional representations, such as 64 bit “double
precision” and 128 bit “quadruple precision.” The details on how 8-bit bytes are
mapped to 32 bit long integers or 128 bit quadruple precision floating point reals is of
no concern to us here, only the ultimate representation. As one text says, “the details
of how numbers are represented do not concern us in numerical analysis; rather our
concern is whether a number is representable.”1
1
Skeel and Keiper, page 39.
31
32 LESSON 7. FIXED AND FLOATING POINT
Definition 7.1. A real number x is called an n-digit number if it can be expressed

as
x = ±d1 d2 · · · dn × 10e (7.1)
for some integer e and digits d1 , . . . , dn .
There are lots of ways we can represent any real number as a floating point n-digit
number, e.g., we can represent 467.2 as
467.2 = 467.2 × 100 (7.2)
= 4.672 × 102 (7.3)
= 0.04672 × 104 (7.4)
Of course this leads to several different machine representations. Most implemen-
tations typically choose a particular normalization for the number, e.g., represent
it in such a way that the first digit is nonzero. Thus we would represent 467.2 as
0.4672 × 103 (the leading zero to the left of the decimal point would not actually be
stored). In this way we can find a unique representation for each number. (Zero is
sometimes an exception, as we will see).
The most standard notations are given by the IEEE Floating Point Standard
(IEEE-754). These standards provide 32-bit, 64-bit, and 128-bit floating point rep-
resentations. The IEEE 32-bit Standard representation can store numbers in the
approximate range
1.17 × 10−38 < x < 3.4 × 1038 (7.5)
with a precision of around 8 digits (223 ≈ 8 × 106 ) :
Sign Exponent Mantissa
1 Bit 8 Bits 23 bits
s e = e1 e2 · · · e8 m = d1 d2 d3 d4 · · · d23
The 8-bit exponent can take on 256 possible values; the values with all zeroes and
or all 1s (255) have special meanings. The remaining values 1, . . . , 254 are used
to represent the true exponent in an excess-127 representation, so that exponents
−126 ≤ e ≤ 127. The general conversion formula is
x = (−1)s 2e−127 × (1.d1 d2 d3 . . . d23 ) (7.6)
If e = 0 and m 6= 0 then
x = (−1)s 2−126 × (0.d1 d2d 3 . . . d23 ) (7.7)
rev:5/29/08 If e = 255 = 111111112 , then

nan if m 6= 0,

x = −∞ if m = 0 and s = 1, (7.8)

∞ if m = 0 and s = 0.


LESSON 7. FIXED AND FLOATING POINT 33
Here “nan” is a special symbol used to mean “not a number.” Finally, there are two
different ways to represent zero, which we call 0 and −0:
(
0 if e = m = s = 0,
x= (7.9)
−0 if e = m = 0 and s = 1.
The related IEEE 64-bit Standard representation an store numbers in the approx-
imate range
2.22 × 10−308 < x < 1.8 × 10308 (7.10)
with a precision of around 15 to 16 digits (252 ≈ 4.5 × 1015 ):
Sign Exponent Mantissa

1 Bit 11 Bits 52 bits
s e = e1 e2 · · · e11 m = d1 d2 d3 d4 · · · d52
The general formula is now
x = (−1)s 2e−1023 × (1.d1 d2 d3 d4 · · · d52 ) (7.11)
If e = 0 and m 6= 0 then
x = (−1)s 2−1022 × (0.d1 d2d 3 . . . d52 ) (7.12)
If e = 2047 = 111111111112 , then rev:r/29/08

nan if m 6= 0,

x = −∞ if m = 0 and s = 1, (7.13)

∞ if m = 0 and s = 0.

The same representations for 0 and -0 apply.

The IEEE 128-bit Standard uses 1 sign bit, 15 exponent bits, and a 113 bit
mantissa. The various representations are modified accordingly. This standard is
used for quadruple precision numbers in various computer languages.
The IEED standard allows for up four different methods for rounding:
Unbiased: rounds to the nearest value. If the number falls midway it is rounded
to the nearest value with an even (zero) least significant bit. This mode is
required to be default.
Towards zero: round of in the direction of zero.
Towards positive infinity: round of in the direction of ∞ (round “up”).
Towards negative infinity: round off in the direction of −∞ (round “down”).

34 LESSON 7. FIXED AND FLOATING POINT
The following overflow and underflow conditions that may occur as a result of an
operation are not representable and should generate error messages:
Negative overflow: Negative numbers less than −(2 − 2−23 ) × 2127 (32 bit) or
−(2 − 2−52 ) × 21023 (64 bit).
Negative underflow: Negative numbers greater than −2−149 (32 bit denor-
malized (leading 0)), −2−126 (32 bit normalized (leading 1)), −2−1022 (64 bit
normalized), or −21074 (64 bit denormalized).
Positive underflow: Positive numbers less than 2−149 (32 bit denormalized),
2−126 (32 bit normalized), 2−1022 (64 bit normalized), or 2−1074 (64 bit denor-
malized).
Positive overflow: Positive numbers greater than (2 − 2−23 ) × 2127 (32 bit)
or(2 − 2−52 ) × 21023 (64 bit).

Lesson 8
Roots and Bisection
The first numerical problem we will face is root finding: given a function f (x), find a
number r such that f (x) = 0 at x = r. The bisection algorithm uses a binary search
strategy. It assumes we already know two points a and b, one to the right of the root
and one to the left of the root. Since the two points are on opposite sides of the root,
they must be on opposite sides of the x−axis; hence either f (a) > 0 and f (b) < 0,
if the function is decreasing through the root; or f (a) < 0 and f (b) > 0, when the
function is increasing as it passes through the root. In either case,
f (a)f (b) < 0 (8.1)
Then we simply split the interval [a, b] in half: pick a new point
b−a
c=a+ (8.2)
2
and calculate the product f (a)f (c). If f (a)f (c) > 0 then a and c are on the same side
of the root, so we replace a with c. If f (a)f (c) < 0 then a and c are on different sides
of the root, so we replace b with c. Then we repeat the process until our interval size
∆=b−a< (8.3)
where is some desired tolerance.

In general the following additions are good programming style:
1. It might take a long time to reach the desired , so it always a good idea to
include a counter and terminate after some number N steps regardless of how
close you’ve gotten. This is especially important when you are debugging the
program.
2. As you get closer and closer to the root the product f (a)f (c) will get smaller
and smaller, and could run into the level of machine accuracy. Thus its better
to check the product Sign(f (a))Sign(f(b)) rather than the product f (a)f (c).
35
36 LESSON 8. ROOTS AND BISECTION
Here is the algorithm:
Algorithm Bisection
Input a, b, f , , N ;
Let ∆ = (b − a)/2; i = 0;
If f (a)f (b) > 0, Print error message and stop;
While ∆ > and i < N ,
p = a + ∆;
If f (p) = 0, Return(p);
If Sign(f (a))Sign(f (p)) < 0,
Let b = p;
Otherwise,
Let a = p;
End If;
∆ = (b − a)/2
i = i + 1;
End While;
If i = N , Print a message saying that tolerance not reached.
Return (a + ∆).
Of all the algorithms we will discuss for root finding, bisection is the slowest.
In fact, we can predict precisely the number of iterations it will take to converge.
Because the size of the interval is halved each time, it will be the smallest integer n
such that n
1
|b − a| < (8.4)
2
Hence

n log(1/2) < log (8.5)
|b − a|
Since log(1/2) = − log 2, n is the smallest integer for which

1 |b − a|
n>− log = log2 (8.6)
log 2 |b − a|
Thus we could add a test at the beginning of our program, and just iterate n times.
This is actually more efficient, because we don’t need to do a comparison to check
the size of the interval each iteration. Here is the revised algorithm.
Algorithm Bisection (revised)

Input a, b, f , ;
Let N = log2 ((b − a)/); i = 0;
If f (a)f (b) > 0, Print error message and stop;
While i < N ,

LESSON 8. ROOTS AND BISECTION 37
p = a + (b − a)/2;
If f (p) = 0, Return(p);
If Sign(f (a))Sign(f (p)) < 0,
Let b = p;
Otherwise,
Let a = p;
End If;
i = i + 1;
End While;
Return (a + (b − a)/2).
This analysis actually allows us to prove the following.
Theorem 8.1. The Bisection algorithm converges.
Proof. Either the algorithm reaches the exact root at some step of the iteration or it
does not. Let L = |b − a|. Let the value of a, b, and p after the ith iteration be ai , bi ,
and pi , respectively. If for some n we have
f (an )f (bn ) = 0 (8.7)
then either a is a root or b is a root, and the algorithm has converged.

If we never have f (an )f (bn ) = 0 then we continue to split the interval. Since each
iteration splits the interval in half, we have
|a1 − b1 | = L/2 (8.8)

|a2 − b2 | = L/4 (8.9)
|a3 − b3 | = L/8 (8.10)
..
. (8.11)
|an − bn | = L/2n (8.12)
Furthermore, by construction, f (ai )f (bi ) < 0, so the root is in each interval. Therefore
each pi is a distance no larger than |ai − bi | from the root r. Hence
|p1 − r| ≤ |a1 − b1 | = L/2 (8.13)

|p2 − r| ≤ |a2 − b2 | = L/22 (8.14)
|p3 − r| ≤ |a3 − b3 | = L/23 (8.15)
..
. (8.16)
|pn − r| ≤ |an − bn | = L/2n (8.17)
Therefore
0 ≤ lim |pn − r| ≤ lim L/2n = 0 (8.18)
n→∞ n→∞

Hence
lim pn = r (8.19)
n→∞
which proves that the sequence of iterations converges to the root.

√
Example 8.1. Estimate 2 by finding the root of f (x) = x2 − 2.
Solution.
Step 1.
a = 1, b = 2, f (a) = −1, f (b) = 2 (8.20)
p = (a + b)/2 = 1.5 (8.21)
f (p) = (1.5)2 − 2 = 0.25 (8.22)
f (a)f (p) = (−)(+) < 0 (8.23)
so the root is between a and p. So we set
b = p = 1.5 (8.24)
Step 2.
a = 1, b = 1.5, f (a) = −1, f (b) = 0.25 (8.25)
p = (1 + 1.5)/2 = 1.25 (8.26)
f (p) = (1.25)2 − 2 = −0.4375 (8.27)
f (a)f (p) = (−1)(−1) > 0 (8.28)
The root is between p and b, so set
a = p = 1.25 (8.29)
Step 3.
a = 1.25, b = 1.5, f (a) = −0.4375, f (b) = 0.25 (8.30)
p = (1.25 + 1.5)/2 = 1.375 (8.31)
f (p) = 1.3752 − 2 = −0.109 (8.32)
f (p)f (a) = (−.4375)(−.109) > 0 (8.33)
so set
a = p = 1.375 (8.34)
The root is between a=1.375 and b=1.5. As we continue the process we compute the
sequence 1.5, 1.25, 1.375, 1.4375, 1.40625, 1.42188, 1.41406, ...

LESSON 8. ROOTS AND BISECTION 39
Illustration of bisection, showing locations along the x-axis of successive iterations

for the example x2 − 2 = 0.
a5p5b5
a4 p4 b4
a3 p3 b3
a2 p2 b2
a1 p1 b1
a0 p0 b0
1 1.25 2 1.5 1.75 2


Lesson 9
Fixed Point Iteration
Anyone who has every played with their calculator by typing in a number and then
hitting the same function key repeatedly has used fixed point iteration. For example,
√
if you type the number 16 and then start pressing the key you will generate the
sequence (this was generated with a TI-36 which has 10 digit accuracy):
x0 = 16 (9.1)
√ √
x1 = x0 = 16 = 4 (9.2)
√ √
x2 = x1 = 4=2 (9.3)
√ √
x3 = x2 = 2 = 1.414213562 (9.4)
√ √
x4 = x3 = 1.414213562 = 1.1892070115 (9.5)
√ √
x5 = x4 = 1.1892070115 = 1.090507733 (9.6)
..
.
Eventually, after around 30 iterations, the calculator will display something like
1.0000000000 (9.7)
on all subsequent iterations, because

√
1.0000000000 = 1.0000000000 (9.8)
In fact, this iteration has found the fixed point of the square root function
√
f (x) = x (9.9)
to within the machine epsilon of the calculator (1 part in 1010 ), namely, the point
where √
x = f (x) = x (9.10)
41
42 LESSON 9. FIXED POINT ITERATION
Equation 9.10 has only two solutions: x = 1 and x = 0. We have converged on the
first of these solutions. Had we started with any positive number, we still would have
converged on the solution x = 1, regardless of which number we typed in for x0 . Had
we started with x = 0 we would have converged on the other root, x = 0, and had
we started with a negative number, we would have gotten an error message.
What we are doing during this iteration is computing a sequence of function
applications:
x1 = g(x0 ) (9.11)
x2 = g(g(x0 )) = g 2 (x0 ) (9.12)
x3 = g(g(g(x0 ))) = g 3 (x0 ) (9.13)
..
.
xn = g n (x0 ) (9.14)
where we have used the notation g k (x) to denote the repeated application of the
function g(x) k times.
Definition 9.1 (Fixed Point). A number p is called a fixed point of the function
f (x) if p = f (p).
Example 9.1. Find the fixed points of the function f (x) = x4 + 2x2 + x − 3.
Solution. We need to solve
x = f (x) = x4 + 2x2 + x − 3 (9.15)
for x. Hence
0 = x4 + 2x2 − 3 (9.16)
= (x2 − 1)(x2 + 3) (9.17)
= (x − 1)(x + 1)(x2 + 3) (9.18)
Hence there are two fixed points: x = ±1.

Theorem 9.1. A continuous function f (x) will have a fixed point if and only if it
crosses the line y = x (see figure 9.1). A fixed point of f (x) always occurs at an
intersection of the two curves y = f (x) and y = x, namely at the point p such that
p = f (p).
rev.6/5/08:⇓ √
Consider again the example of fixed point iteration f (x) = x, which we illus-
trated by repeatedly pushing the square-root button a calculator. This algorithm can
be visualized as shown in figure 9.2. The top figure shows both the function y = f (x)
and the line y = x. We then draw a line from (x0 , f (x0 )) horizontally to the line
y = x. The two lines meet at the point (f (x0 ), f (x0 )) (middle plot). The head of

LESSON 9. FIXED POINT ITERATION 43
Figure 9.1: A fixed point occurs at the intersection of the curve y = f (x) with the
line y = x. If there are multiple intersections then there are multiple fixed points.
y=x
b
a b
the arrow lies directly over the value of f (x0 ) on the x-axis, so that by projecting
vertically to the curve of y = f (x), we intersect at f (f (x0 )) (bottom plot). We then
repeat this process, generating successive iterations,
√ approaching closer and closer ot
the fixed point (see figure 9.3 at (1, 1) = (x, x).
. ⇑:6/5/08
Example 9.2. Find the first 25 iterations in the fixed point iteration for the function
f (θ) = cos θ (9.19)
to 10 digits of precision using Mathematica with x0 = π.
Solution. We can do fixed point iteration with the function NestList.

In:=
f[x_]:=Cos[x];
c=N[NestList[f, Pi, 25], 10]
Out:=
{3.141592654, -1.0000000000, 0.5403023059, 0.8575532158,

0.6542897905, 0.7934803587, 0.7013687736, 0.7639596829,
0.7221024250, 0.7504177618, 0.7314040424, 0.7442373549,
0.7356047404, 0.7414250866, 0.7375068905, 0.7401473356,
0.7383692041, 0.7395672022, 0.7387603199, 0.7393038924,
0.7389377567, 0.7391843998, 0.7390182624, 0.7391301765,
0.7390547907, 0.7391055719}
Convergence is illustrated in figure 9.4.

√
Figure 9.2: Visualization of fixed point iteration on y = x. See text for description.
4
0 2 4 6 8 10 12 14 16 18
0 2 4 6 8 10 12 14 16 18
0 2 4 6 8 10 12 14 16 18
Despite the success illustrated by the rapid convergence in example 9.2, fixed point
iteration does not always work. This is illustrated by the following example.
Example 9.3. Compute the fixed point of the function
f (x) = x2 − 2 (9.20)
and then, using Mathematica, compute the result of 100 iterations of the fixed point
algorithm using x0 = 1.9 and plot the results as we did in the previous example.
Solution. The fixed point occurs when f (x) = x, so that means
x = x2 − 2 (9.21)
0 = x2 − x − 2 (9.22)
= (x − 2)(x + 1) (9.23)
rev.6/5/08:⇓ So fixed points occur at x = 2 and x = −1 To compute, say, the first 50 fixed point
iterations starting with x0 = 1.5, in Mathematica, In:=
g[x_] := x^2 - 2;
N[NestList[g, 1.9, 100], 5]

√
Figure 9.3: Fixed point iteration on y = f x (continued from fig. 9.1).
2
1.8
1.6
1.4
1.2
1 1.5 2 2.5 3 3.5 4
Out:=
{1.5, 0.25, -1.9375, 1.75391, 1.07619, -0.841821, -1.29134,

-0.332449, -1.88948, 1.57013, 0.465297, -1.7835, 1.18087, -0.605549,
-1.63331, 0.667703, -1.55417, 0.415451, -1.8274, 1.33939, -0.206031,
-1.95755, 1.83201, 1.35625, -0.160593, -1.97421, 1.8975, 1.60052,
0.561675, -1.68452, 0.837612, -1.29841, -0.31414, -1.90132, 1.615,
0.608231, -1.63005, 0.657078, -1.56825, 0.459403, -1.78895, 1.20034,
-0.559188, -1.68731, 0.847012, -1.28257, -0.355011, -1.87397,
1.51175, 0.285399, -1.91855}
No obvious pattern is discernible in the numbers; this is confirmed in figure 9.5. In

fact, the resulting sequence of iteration is chaotic. Not only does it never converge,
it has another interesting property: if we iterate long enough we will calculate an
iteration that is arbitrarily close to virtually every value between -2 and 2. ⇑:6/5/08
Theorem 9.2 (Sufficient Condition for a Fixed Point). Suppose that f (x) is a
continuous function that maps its domain onto a subset of itself, i.e., f (x) ∈ C[a, b]
such that1
f (x) : [a, b] 7→ S ⊂ [a, b] (9.24)
Then f (x) has a fixed point in [a, b].
Proof. Case1: f (a) = a or f (b) = b, in which case the fixed point is at x = a or

x = b, respectively, and the theorem is proved.
1
By C[a, b] we mean the set of all continuous function whose domain is the interval [a, b]

Figure 9.4: Visualization of fixed point iteration in example 9.2.

1
!1
!Π 3Π Π Π 0 Π Π 3Π Π
! ! !
4 2 4 4 2 4
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Π 5Π 3Π 7Π Π 9Π 5Π
8 32 16 32 4 32 16
Case 2. Both f (a) 6= a and f (b) 6= b. By assumption 9.24 we must have
f (a) > a and (9.25)

f (b) < b (9.26)
Let
h(x) = f (x) − x (9.27)
Since f (x) is continuous, so is h(x), and
h(a) = f (a) − a > a − a = 0 (9.28)

h(b) = f (b) − b < b − b = 0 (9.29)
Hence by the intermediate value theorem, h(x) has a root r ∈ (a, b), such that
h(r) = 0. But at r we have
0 = h(r) = f (r) − r (9.30)
Thus since f (r) = r, r must be a fixed point of f .

Figure 9.5: Left: The first 5 fixed point iterations on g(x) = x2 − 2 starting from x0 =
1.5. Right: The first 100 iterations. There is no discernable pattern of convergence;
in fact, the iteration is chaotic.
2 2
1 1
0 0
-1 -1
-2 -2
-2 -1 0 1 2 -2 -1 0 1 2
Theorem 9.3. Every continuous bounded functions on the real numbers has a fixed
point.
Proof. Let
f (x) : R 7→ R (9.31)
be continuous and bounded. Then it has a least upper bound a and a greatest lower
bound b. Hence
f (x) : R 7→ [a, b] (9.32)
Thus the conditions of Theorem 9.2 are met and hence f (x) has a fixed point.
⇓:rev.6/5/08
(The example that was here previously has been deleted because it wasn’t very helpful.) ⇑:6/5/08
Theorem 9.4 (Condition for a Unique Fixed Point). Let f (x) be a continuous
and diferentiable function that maps its domain onto a subset of itself,
f (x) : [a, b] 7→ S ⊂ [a, b] (9.33)
Suppose further that there exists some positive constant
0<K<1 (9.34)
such that
|f 0 (x)| ≤ K (9.35)
for all x ∈ [a, b]. Then f (x) has a unique fixed point p ∈ [a, b].

Proof. By theorem 9.3 at least one fixed point exists; call it p. Then
f (p) = p (9.36)
Suppose that a second fixed point q 6= p exists. Since q is also a fixed point,
q = f (q) (9.37)
By the Mean Value theorem, there exists some number c ∈ [min(p, q), max(p, q)] such
that
f (p) − f (q)
f 0 (c) = (9.38)
p−q
By equation 9.35, |f 0 (c)| ≤ K, hence

f (p) − f (q)
p−q ≤K (9.39)

i.e.,
|f (p) − f (q)| ≤ K|p − q| < |p − q| (9.40)
because K < 1. But by equations 9.36 and 9.37 we have
|f (p) − f (q)| = |p − q| (9.41)
and therefore
|p − q| < |p − q| (9.42)
Since p 6= q we know that |p − q| 6= 0 hence we can cancel it on both sides of the
equation to gives 1 < 1, which is a contradiction. Hence our original assumption
p 6= q must be wrong. Thus the fixed point is unique.
Example 9.4. Show that
1 x
g(x) = π + sin (9.43)
2 2
has a unique fixed point.
Solution. We first observe that g(x) is continuous and differentiable, and that

1 1
Range(g) = π − , π + ⊂ (−∞, ∞) = Domain(g) (9.44)
2 2
Hence by theorem 9.3 at least one fixed point exists. To verify uniqueness we calculate

0
1 x 1
|g (x)| = cos ≤ < 1
(9.45)
4 2 4
Hence the conditions of theorem 9.4 are met with K = 1/4. Hence the fixed point is
unique. (see figure 9.6.)

Figure 9.6: The fixed point of f (x) = π + (1/2) sin(x/2) is unique. See example 9.4.
4
!2 Π 3Π !Π Π 0 Π Π 3Π 2Π 5Π 3Π 7Π 4Π
! !
2 2 2 2 2 2
Example 9.5. Calculate the first four fixed point iterates of the function in the pre-
vious example, starting with x0 = π, and then use NestList to calculate the first 10
iterations to 20 digits.
Solution.
p1 = π + 0.5 sin(π/2) ≈ 3.64159 (9.46)

p2 = π + 0.5 sin(3.64159/2) ≈ 3.62605 (9.47)
p3 = π + 0.5 sin(3.62605/2) ≈ 3.62700 (9.48)
p4 = π + 0.5 sin(3.62700/2) ≈ 3.62694 (9.49)
In Mathematica,
In:=
g[x_] := Pi + (1/2) Sin[x/2];

N[NestList[g, Pi, 10], 20]
Out:=
{3.1415926535897932385, 3.6415926535897932385, 3.6260488644451156305,

3.6269956224387354753, 3.6269387942254171004, 3.6269422083510946963,
3.6269420032482992065, 3.6269420155698412408, 3.6269420148296252521,
3.6269420148740936891, 3.6269420148714222501}
To find the root of a function f (x) using the fixed point algorithm, we define
g(x) = x − f (x) (9.50)
Then if p is a root of f (x),
g(p) = p − f (p) = p − 0 = p (9.51)
Hence p is a fixed point of g(x) = x − f (x). This suggests that we use the following
algorithm.

Algorithm FixedPointRoot
Input f (x), a first guess p0 , and an error tolerance ;
Define g(x) = x − f (x);
Define ∆ = ∞;
While ∆ > 0,
Let pold = p;
Let p = g(p);
∆ = |p − pold |;
Return (p).
p
Example 9.6. Use the fixed point algorithm to find 1/2 to 25 digits accuracy.
p
Solution. We know that 1/2 is a root of f (x) = x2 − 1/2, so we form the function
g(x) = x − f (x) = x − x2 + 1/2 (9.52)
We can use
In:=
f[x_] := x^2 - 1/2;

g[x_] := x - f[x];
NestList[g, 1.0, 10]
but this returns
Out:=
{1., 0.5, 0.75, 0.6875, 0.714844, 0.703842, 0.708448, 0.706549,

0.707337, 0.707011, 0.707146}
which does not give us enough digits. We also want the computer to calculate the
error for us, so that it can automatically figure out when to stop the calcultions. One
way to do this is by literally translating the iterative algorithm into Mathematica:
∆ = ∞;
p = 1.0‘50;
n = 0;
While[∆ > 10−25 ,
pold = p;
p = g[p];
∆= Abs[p - pold];
n++;
];
Print["The root is ", N[p, 25]," after ", n, " iterations."];

The initialization of p=1.0‘50 ensures that the data starts with 50 digit accuracy.
This is a good general rule of thumb, that you data should have at least twice the
digits that you need in your final answer, although in fact it will depend upon what
kind of calculation you are doing. The output is
The root is 0.7071067811865475244008444after 64 iterations.
That the convergence to 25 digits does occur after 64 iterations can be verified by
including a statement
Print[p]
before the end of the While loop.
√
Example 9.7. Repeat the previous example with 2, starting with x0 = 1.5.
√
Solution. As before we observe that 2 is the root of f (x) = x2 −2 and so we calculate
g(x) = x − f (x) = x − x2 + 2 (9.53)

√
Then 2 is a fixed point of g. The first several iterations are:
x1 = 1.5 − 1.52 + 2 = 1.25 (9.54)

x2 = 1.25 − 1.252 + 2 = 1.6875 (9.55)
x3 = 1.6875 − 1.68752 + 2 = 0.8398 (9.56)
x4 = 0.8398 − 0.83982 + 2 = 2.1345 (9.57)
So far there is no discernible pattern; in fact, the first 100 iterations are (from Math-
ematica):
In:=
g[x_] := x - x^2 + 2;
q = NestList[g, 1.5, 120]
Out:=
{1.5, 1.25, 1.6875, 0.839844, 2.13451, -0.421611, 1.40063, 1.43886, 1.36854,

1.49563, 1.25872, 1.67434, 0.870917, 2.11242, -0.3499, 1.52767, 1.19389,
1.76851, 0.640879, 2.23015, -0.74343, 0.703883, 2.20843, -0.668739, 0.884049,
2.10251, -0.318025, 1.58083, 1.0818, 1.91151, 0.257634, 2.19126, -0.610355,
1.01711, 1.9826, 0.0519074, 2.04921, -0.150061, 1.82742, 0.487955, 2.24985,
-0.811992, 0.528676, 2.24918, -0.809622, 0.534889, 2.24878, -0.808241,
0.538505, 2.24852, -0.807313, 0.540933, 2.24832, -0.806639, 0.542696,
2.24818, -0.806123, 0.544042, 2.24806, -0.805715, 0.545109, 2.24797,
-0.805382, 0.545977, 2.24789, -0.805106, 0.546699, 2.24782, -0.804872,
0.547309, 2.24776, -0.804671, 0.547832, 2.24771, -0.804497, 0.548286,
2.24767, -0.804345, 0.548684,

2.24763, -0.80421, 0.549036, 2.2476, -0.80409, 0.54935, 2.24756,

-0.803982, 0.549631, 2.24754, -0.803885, 0.549884, 2.24751, -0.803797,
0.550114, 2.24749, -0.803716, 0.550324, 2.24747, -0.803643, 0.550516,
2.24745, -0.803575, 0.550692, 2.24743, -0.803513, 0.550855, 2.24741,
-0.803455, 0.551005, 2.2474, -0.803401, 0.551145, 2.24738, -0.803352,
0.551274, 2.24737, -0.803305, 0.551396, 2.24736, -0.803262, 0.551509}
It appears that the answer is “cycling” between three different values, which are
approximately
x0 = 0.554, x1 = 2.47, x2 − 0.802 (9.58)
none of which are the correct answer! We call this a “period-3 cycle.” Such phenom-
ena often occur in the study of dynamical systems, of which fixed point iteration is
an example (See figure 9.7.)
Figure 9.7: Convergence of fixed point iteration on f (x) = x − x2 − 2 to a period-3

limit cycle.
!1
!1 0 1 2
p
So why, when things worked √so well with 1/2, does fixed point iteration fail so
miserably in when we calculate 2? For one thing,
g 0 (x) = 1 − 2x (9.59)
√
Near the root, say at x = 2 + , we have
√ √
|g 0 ( 2)| = |1 − 2( 2 + )| ≈ |1 − 2.83 − 2| ≈ | − 1.83 − 2| (9.60)
There is no way that we can bound this number by a constant that is smaller than
1, so that theorem 9.3 does not even guarantee √
the existence of a fixed point (even
though we know that one does, in fact, exist at 2). The next theorem gives us an
idea.

Theorem 9.5. The fixed point iteration algorithm on a function g(x) will converge
to a fixed point of g(x) if the conditions of theorem 9.4 are satisfied. More precisely,
suppose that g(x) is a continuous function on [a, b] such that g : [a, b] 7→ S ⊂ [a, b],
and that there is a positive number K < 1 such that |g 0 (x)| ≤ K on [a, b]. Then for a
starting point p0 , the sequence pn = g(pn−1 ) converges to a unique fixed point of g(x).
Proof. By theorem 9.4 a unique fixed point exists. We need to show that
lim pn = p (9.61)
n→∞
Since g : [a, b] 7→ S ⊂ [a, b], then all of the pn = g(pn−1 ) ∈ [a, b]. Furthermore, since
p is a fixed point, p = g(p), and
|pn − p| = |pn − g(p)| = |g(pn−1 ) − g(p)| (9.62)
If there is some n such that pn−1 = p then the sequence has converged, and the
theorem has been proven. So we may assume that there is no n such that pn = p.
Since pn−1 6= p, we know that there is some point cn between pn−1 and p such that

0
g(pn−1 ) − g(p)
|g (cn )| = (9.63)
pn−1 − p
hence there exists some K < 1 such that
|pn−1 − p| = |g 0 (cn )| |g(pn−1 ) − g(p)| ≤ K |g(pn−1 ) − g(p)| (9.64)
Substituting equation 9.62,
|pn − p| ≤ K |pn−1 − p| (9.65)
Since this is true for all n,
|pn − p| ≤ K |pn−1 − p| (9.66)

≤ K 2 |pn−2 − p| (9.67)
≤ K 3 |pn−3 − p| (9.68)
..
. (9.69)
≤ K n |p0 − p| (9.70)
Hence
0 ≤ lim |pn − p| ≤ lim K n |p0 − p| = 0 (9.71)
n→∞ n→∞
because K < 1 implies that K n → 0 as n → ∞. Since |pn − p| → 0 is equivalent to

pn → p, the algorithm converges.

Example
p 9.8. Prove that the fixed point algorithm for g(x) = x − x2 + 1/2 converges
to 1/2 ≈ 0.707.
p
Solution. First we observe that 1/2 is a fixed point of g(x) since
p p p
g 1/2 = 1/2 − 1/2 + 1/2 = 1/2 (9.72)
Next we calculate
|g 0 (x)| = |1 − 2x| (9.73)
We want to determine if there is some positive constant K < 1 such that |g 0 (x)| ≤ K,
which requires that
−K ≤ 1 − 2x ≤ K (9.74)
−1 − K ≤ −2x ≤ −1 + K (9.75)
K −1 K +1
≤x≤ (9.76)
2 2
If we try K = 0.8 then
−0.1 ≤ x ≤ 0.9 (9.77)
In other words, for all x ∈ [−0.1, 0.9], we have |g 0 (x)| ≤ K < 1. Thus the conditions
of the theorem are met for any starting point in [−0.1, 0.9]. If we start with, say,
x0 = 1/2, which is clearly in this interval, the algorithm converges by theorem 9.5.
From equation 9.70 we could calculate an error estimate based on the size of the
original interval [a, b]. Since both p and p0 are in the interval [a, b], if we stop the
iteration after n steps the error is limited by
|pn − p| ≤ K n |p0 − p| ≤ K n |b − a| (9.78)
Thus each iteration reduces the error by a factor of K. While this is a significant
improvement, equation 9.78 is not very useful if the interval [a, b] is especially large,
such as the whole real line. Fortunately we can make an improved estimate based on
the values of the first guess and the first iteration.
Theorem 9.6 (Error Estimate for Fixed Point Iteration). If fixed point itera-
tion is terminated after n ≥ 1 steps then the error is limited by
K n |p1 − p0 |
|pn − p| ≤ (9.79)
1−K
rev.6/5/08:⇓

Proof. Prove by induction. For n = 1 wee need to prove

K
|p1 − p| ≤ |p1 − p0 | (9.80)
1−K
To demonstrate 9.80 we use the Mean Value Theorem: there is some number c between
p0 and p such that

0
g(p0 ) − g(p) p1 − p
|g (c)| = = ≤K (9.81)
p0 − p p0 − p
where the last step follows because |g 0 (c)| ≤ K. Hence by the triangle inequality,
|p1 − p| ≤ K|p0 − p| (9.82)

= K|p0 − p1 + p1 − p| (9.83)
≤ K (|p0 − p1 | + |p1 − p|) (9.84)
= K|p0 − p1 | + K|p1 − p| (9.85)
Solving the last equation for |p1 − p| yields equation 9.80.

For the inductive step we assume that equation 9.79 holds, and attempt to prove
from that that it holds from n = 1, namely, that
K n+1 |p1 − p0 |
|pn+1 − p| ≤ (9.86)
1−K
We again use the Mean Value Theorem: there is some number c between pn and p
such that
0
g(pn ) − g(p) pn+1 − p
|g (c)| =
= ≤K (9.87)
pn − p pn − p
Hence
|pn+1 − p| ≤ K|pn − p| (9.88)
Substituting equation 9.79 on the right yields equation 9.86.
⇑:6/5/08
Example 9.9. Estimate the number of iterations required for fixed point iteration to
converge to the fixed point of
1 x
g(x) = π + sin (9.89)
2 2
with (a) 4 digit accuracy and (b) 10 digit accuracy, using p0 = π.
Solution. By theorem 9.6, to achieve an accuracy of , it is sufficient to find the

smallest n such that
K n |p1 − p0 |
|p − pn | ≤ < (9.90)
1−K

(1 − K)
Kn < (9.91)
|p1 − p0 |
(1 − K)
n log K < log (9.92)
|p1 − p0 |
1 (1 − K)
n> log (9.93)
log K |p1 − p0 |
where we reversed the direction of the less-than sign because for K < 1 then log K <
0. To find K we calculate

0
1 x 1
|g (x)| = cos ≤ (9.94)
4 2 4
so we chose K = 1/4. Hence, since p0 = π and log(1/4) = − log 4,
−1 (3/4)
n> log (9.95)
log 4 |p1 − π|
To get p1 we iterate once,
1 π 1
p1 = π + sin = π + (9.96)
2 2 2
Therefore we need to find an n such that
−1 (3/4) −1 3
n> log = log (9.97)
log 4 1/2 log 4 2
For = 10−4 then

−1
n> log(1.5 × 10−4 ) ≈ 6.3 (9.98)
log 4
hence we will need at most 7 iterations. For = 10−10 ,
−1
n> log(1.5 × 10−10 ) ≈ 16.3 (9.99)
log 4
so that 17 iterations are guaranteed to be sufficient.

Appendix
The fixed point plots shown in this section can be generated with the following Math-
ematica program:
fixedPointPlot[f_, x0_, n_] :=

Module[{data, u, g, g2, g1, min, max, range},
data = NestList[f, x0, n];
data = Partition[
Flatten[Table[{data[[i]], data[[i]]},
{i, 1, Length[data]}]], 2, 1];
g = Graphics[Line[data]];
max = Max[data];
min = Min[data];
range = max - min;
If[range == 0, range = 1];
max = max + range/4;
min = min - range/4;
g1 = Plot[u, {u, min, max}];
g2 = Plot[f[u], {u, min, max}];
Show[g2, g1, g]
]
A typical plot can be produced as follows:
In:=
g[x_]:= 4.0 x (1-x);

fixedPointPlot[g, 0.49, 100]
Out:=
1.0
0.5
!0.2 0.2 0.4 0.6 0.8 1.0 1.2
!0.5
!1.0


Lesson 10
Newton’s Method
Suppose we already have an estimate p0 for the root of f (x). If we project the tangent
line to f (x) at the point (p0 , f (p0 )) down to where it intersects with the x-axis, this
should give us a better guess for root, as illustrated in figure 10.1.
Figure 10.1: Derivation of Newton’s method.
f(x)
tangent lines
p3 p2 p1 p0
The slope of the straight line connecting the points (p0 , f (p0 )) and (p1 , 0) is
rise f (p0 ) − 0 f (p0 )
f 0 (p0 ) = slope = = = (10.1)
run p0 − p1 p0 − p1
Solving for p1 ,
f (p0 )
p1 = p0 − (10.2)
f 0 (p0 )
This gives us the well know formula for Newon’s Method
f (pn )
pn+1 = pn − (10.3)
f 0 (pn )
59
60 LESSON 10. NEWTON’S METHOD
This gives us the following algorithm.
Algorithm Newtons Method

Input f (x), p0 , tolerance
Let ∆ = f (p0 )/f 0 (p0 )
While ∆ > ,
p0 = p0 − ∆
∆ = f (p0 )/f 0 (p0 )
End While
Return p0
√
Example 10.1. Find 2 with Newtons method as the root of f (x) = x2 − 2. Use
p0 = 2.
Solution. To get an iteration formula for pn we need to know the derivative of f (x),
which is
f 0 (x) = 2x (10.4)
Hence the iteration formula is
p2n − 2
pn+1 = pn − (10.5)
2pn
Although it is possible to simplify this algebraically, convergence of the algorithm
(which we have not proven yet) ensures that the second term above approaches zero
and hence it is computationally preferable to leave it in this form rather than placing
the sum over a common denominator. Thus
22 − 2
p1 = 2 − = 1.5 (10.6)
2×2
1.52 − 2
p2 = 1.5 − = 1.4167 (10.7)
2 × 1.5
1.41672 − 2
p3 = 1.4167 − = 1.4142 (10.8)
2 × 1.4167
and so forth.
In general Newton’s method converges extremely rapidly. The only time it will
be slow to converge is√when f 0 (p) = 0. As the following Mathematicaillustrates, the
method converges to 2 to 50 digits, starting with p0 = 2, in only 6 iterations.
In:=
f[x_] := x^2 - 2;
g[x_] := x - f[x]/f’[x];
NestList[g, 2.0‘50, 7]

LESSON 10. NEWTON’S METHOD 61
Out:=
{2.0000000000000000000000000000000000000000000000000,
1.5000000000000000000000000000000000000000000000000,
1.4166666666666666666666666666666666666666666666667,
1.414215686274509803921568627450980392156862745098,
1.414213562374689910626295578890134910116559622116,
1.414213562373095048801689623502530243614981925776,
1.41421356237309504880168872420969807856967187538,
1.41421356237309504880168872420969807856967187538}
Theorem 10.1 (Convergence of Newton’s Method). Suppose that f (x) is con-

tinuously differentiable1 and has a root p ∈ [a, b]. Suppose further that f 0 (p) 6= 0.
Then there is some interval
I = [p − δ, p + δ] (10.9)
for some number δ ≥ 0 such that for any p0 ∈ I, Newton’s method will converge.
Proof. We need to prove that pn → p as n → ∞, when
pn+1 = pn − f 0 (pn )/f 0 (pn ) (10.10)
We first observe that Newton’s method is nothing more than fixed point iteration on
the function
g(x) = x − f (x)/f 0 (x) (10.11)
Furthermore, since p is a root of f , then it is also a fixed point of g, because
g(p) = p − f (p)/f 0 (p) = p − 0/f 0 (p) = p (10.12)
Since f 0 (p) 6= 0 then by continuity there must be some interval U = [p−, p+] ⊂ [a, b]
about p such that f 0 (x) 6= 0 for all x ∈ U . Since f (x) and f 0 (x) are defined and
continuous on [a, b] then they are defined and continuous on U ⊂ [a, b]. Since by
construction of U , f 0 (x) 6= 0 on U then g(x) is also defined and continuous on U .
Therefore

0 d f (x)
g (x) = x− 0 (10.13)
dx f (x)
f 0 (x)f 0 (x) − f (x)f 00 (x)
=1− (10.14)
(f 0 (x))2
f (x)f 00 (x)
= (10.15)
(f 0 (x))2
Furthermore, since f (p) = 0, g 0 (p) = 0. So if we pick any small number K < 1

then by continuity there must be some interval I = [p − δ, p + δ] about p such that

Figure 10.2: Figures for proof of convergence of Newton’s method.
f ′(p)≠0 K
p-δ p p+δ
-K
p-ε p p+ε
|g 0 (x)| ≤ K, as we see in fig. 10.2. This proves that there is some K > 0 such that
|g 0 (x)| ≤ K < 1 in some interval about p.
To see that the g : I 7→ S ⊂ I, let x ∈ I. Then by the mean value theorem there
is a point c ∈ I, between p and x. such that
|g(p) − g(x)|
|g 0 (c)| = (10.16)
|p − x|
or
|g(p) − g(x)| = |p − x||g 0 (c)| (10.17)
Since the maximum distance between p and x in I is δ,
|g(p) − g(x)| ≤ δ|g 0 (c)| < δK < δ (10.18)
because |g 0 (x)| ≤ K < 1. But since p is a fixed point of g, we know that g(p) = p,
and therefore
|p − g(x)| < δ (10.19)
or equivalently,
p − δ < g(x) < p + δ (10.20)
Thus g maps I into a subset of itself, and hence all of the hypotheses of theorem 9.5
are met. Therefore fixed point iteration on g converges to the fixed point of g, which
we have already shown is a root of f . Thus Newton’s method converges.
We can also do some error analysis for Newton’s method. Recall that by Taylor’s
theorem (theorem 4.6),
1
f (p + ) ≈ f (p) + f 0 (p) + 2 f 00 (p) + · · · (10.21)
2
1
≈ f 0 (p) + 2 f 00 (p) + · · · (10.22)
2
1
By continuously differentiable we me the the function is continuous and differentiable and its
first derivative is also continuous.

because f (p) = 0. Similarly,

f 0 (p + ) ≈ f 0 (p) + f 00 (p) + · · · (10.23)
Let i be the error after the ith iteration. Then since
f (xi )
xi+1 = xi − (10.24)
f 0 (xi )
we have
i+1 − i = (xi+1 − p) − (xi − p) (10.25)
= xi+1 − xi (10.26)
f (xi )
=− 0 (10.27)
f (xi )
i f 0 (p) + 12 2i f 00 (p) + · · ·
=− 0 (10.28)
f (p) + i f 00 (p) + · · ·
Solving for i+1 ,

(f 0 (p) + f 00 (p) + · · · ) − f 0 (p) + 1 2 f 00 (p)
i i i 2 i
|i+1 | ≈ (10.29)

0 00
f (p) + i f (p) + · · ·

2 00
f (p)|
≈ i 0 (10.30)
2f (p)
In other words, the error reduces by the square of the error on the previous step. By
comparison, the bisection algorithm has
1
i+1 = |i | (10.31)
2
The quadratic factor results in Newton’s method converging much more quickly than
the bisection algorithm. We will return to this in section 12.
We can also implement Newton’s method iteratively in Mathematica:
NewtonsMethod[f_, p0_, eps_: 0.001, Nmax_: 10] :=
Module[{delta, Delta i, p},
Delta[x_] := f[x]/f’[x];
p = p0;
delta = Delta[p]; i = 0;
While[And[delta>eps, i++<Nmax],
p = p - delta;
delta = Delta[p];
];
Return[{i, p}];
]

To find the root of x2 − 2 to 50 significant figures,

In:=
f[x_] := x^2 - 2;
NewtonsMethod[f, 1.5‘53, 10^-50]
Out:=
{6, 1.414213562373095048801688724209698078569671875376948}
Under certain conditions Newton’s method will not converge, even if a root does
exist. For example, by theorem 10.1 if the derivativep is not continuous in the entire
interval then it will fail. The function f (x) = x/ |x| provides an example of this
situation. The derivative is everywhere continuous except at the origin where it
becomes infinite. The plot of f (x) is also a mirror image of itself through the origin.
In this case Newton’s method can lead to cyclic iteration. A similar case can occur
if the initial point is chosen√on the edge of an open interval of convergence, as with
f (x) = x/(1 + x2 ) at x = 1/ 3|. In both cases we have a situation where xn+1 = −xn
and thepfunction is a mirror image of itself. The same thing happens with f (x) = x2
if x = 5/3.
A variation on the Newton’s method called the Damped Newton’s Method can fix
these situations by checking if successive iterations decrease in magnitude. If they do
not, the interval is halved, until they do. The damped Newton method will always
converge to either a root or to a local minimum.
Algorithm Damped Newton Method

Let ∆ = f (p0 )/f 0 (p0 )
Let p = p0 , fnew = f (p)
While ∆ > ,
fold = fnew
pnew = p − ∆
fnew = f (pnew )
While |fnew | ≥ |fold |,
∆ = ∆/2
pnew = p − ∆
fnew = f (pnew )
End While
p = pnew
∆ = f (p)/f 0 (p)
End While
Return p0
The formula that is now known as Newton’s method was actually developed by the
British mathematician Thomas Simpson (1740), who is better know as the inventor

Figure 10.3: When Newton’s method fails. Top: f (x) = x/|x| has an infinite deriva-
2
tive at the origin. Middle: f (x) = x/(1
√+ x √ ) is a mirror image of itself. Newton’s
method converges on the interval (−1/ 3, 1/ 3), diverges outside this interval, and
oscillates right on the endpoints. Bottom: f (x) = x2 has a local
p minimum, but no
root, at x = 0. Newton oscillation can become trapped if x0 = 5/3.
1.5
1.0
0.5
0.0
!0.5
!1.0
!1.5
!2 !1 0 1 2
0.4
0.2
0.0
!0.2
!0.4
!0.5 0.0 0.5
0
!2 !1 0 1 2

of Simpson’s Rule to numerically calculate integrals. Newton (1669) and Joseph

Raphson (1690) published formulas based on the results of François Viète (1540-1603),
who derived a set of formulas for the roots of polynomials. Hero of Alexandria (10-
70) wrote about the “Babylonian Algorithm” for the square root that we discussed in
chapter 1, which is also a form of Newton’s method. This method is widely
√ attributed
to the ancient Babylonians because of the existence of a formula for 2 on an ancient
tablet, but the evidence that they used this algorithm is not conclusive.

Lesson 11
Secant Method
The main problem with Newtons method is that we need to know both the function
and its derivative. If the derivative is easy to calculate this is not a problem, but
sometimes it can be very expensive computationally to calculate the derivative. To
solution to this problem is to stop calculating the derivative after the first iteration
and instead approximate it by the slope of the line connecting the first two guesses
(see figure).
secant line
f(x)
p2 p1 p0
The slope of the line through the points (pn , f (pn )) and (pn−1 , f (pn−1 ) is used to
approximate the derivative at pn .
f (pn ) − f (pn−1 )
f 0 (pn ) ≈ (11.1)
pn − pn−1
67
68 LESSON 11. SECANT METHOD
The derivation is similar to the derivation of Newton’s method; we just use the slope
derived here in place of the derivative:
f (pn )
pn+1 = pn − (11.2)
f 0 (pn )
pn − pn−1
= pn − f (pn ) (11.3)
f (pn ) − f (pn−1 )
This method converges at about the same rate as Newton’s method. Here is the
algorithm.
Algorithm SecantMethod
Input f (x), p0 , p1 , tolerance
Let q0 = f (p0 ), q1 = f (p1 )
Let ∆ = q1 (p1 − p0 )/(q1 − q0 )
While ∆ > ,
p0 = p1 ;
p1 = p1 − ∆
q0 = q1
q1 = f (p1 )
∆ = q1 (p1 − p0 )/(q1 − q0 )
End While
Return p1
One complaint about both Newtons method and the Secant method is that it is
difficult to estimate an error bounds. With bisection, on the other hand, we had
|| < |an − bn |/2, because we know that the actual root always lies somewhere inside
the interval [an , bn ] . Since successive iterations of either Newtons Method (or the
Secant method) will not, in general, bracket the root, we cannot make this type of
simple error limit. The Method of Regula Falsi (Method of False Position) is a
modification of the Secant Method that ensures that successive iterations bracket the
root, at some (sometimes significant) cost of execution time.
Here is the idea behind the algorithm. We initially start with two guesses p0 , p1
that are known to bracket the root and then calculate p2 using the secant method. If
f (p1 )f (p2 ) < 0 (11.4)
then the root is in the interval [p1 , p2 ], so we use p1 and p2 to calculate p3 , and p1 and
p3 become our next initial values. Otherwise, we use p0 and p2 to calculate p3 , and
p0 and p3 become our next values. The algorithm is shown on the next page.

LESSON 11. SECANT METHOD 69
Algorithm Regula Falsi

Input f (x), p0 , p1 , tolerance
Let q0 = f (p0 ), q1 = f (p1 )
Let ∆ = q1 (p1 − p0 )/(q1 − q0 )
While ∆ > ,
p = p1 − ∆;
q = f (p);
If qq1 < 0 then
p0 = p1 ;
q0 = 11 ;
End if.
p1 = p;
q1 = q;
∆ = q1 (p1 − p0 )/(q1 − q0 )
End While
Return p1
Versions of the method of false position were cited in Vaishali Ganit, written
in India around the 3rd century BC, and The Nine Chapters of Mathematical Art
written in China a century or two later. It was well known by the middle ages and
was cited by Fibonacci in his text Liber Abaci written in 1202.

70 LESSON 11. SECANT METHOD

Lesson 12
Error Analysis for Iterative

Methods
Definition 12.1 (Order of Convergence). We say that a sequence pn converges

to p (or write pn → p) with order k > 0 and asymptotic error constant λ > 0 if
n+1 |pn+1 − p|
lim = lim =λ (12.1)
k
n→∞ n n→∞ |p − p|k
n
where n is the error after the nth iteration.
We say that pn → p converges linearly if
n+1 |pn+1 − p|
lim = lim =λ (12.2)
n→∞ n n→∞ |pn − p|
We say pn → p converges quadratically if
n+1 |pn+1 − p|
lim = lim =λ (12.3)
2
n→∞ n n→∞ |p − p|2
n
The following is a good general rule of thumb: the higher the order of conver-
gence, the faster the sequence converges. To see why this general rule is true,
suppose, for example, that an → p linearly with asymptotic error constant λ and that
bn → p with the same asymptotic error constant λ. Then
|an+1 − p| |bn+1 − p|
lim = λ = lim (12.4)
n→∞ |an − p| n→∞ |b − p|2
n
71
72 LESSON 12. ERROR ANALYSIS FOR ITERATIVE METHODS
Hence for sufficiently large n,
|an − p| ≈ λ|an−1 − p| (12.5)

≈ λ2 |an−2 − p| (12.6)
≈ λ3 |an−3 − p| (12.7)
..
.
≈ λn |a0 − p| (12.8)
= λn ∆ (12.9)
where ∆ = |a0 − p|. Now suppose that b0 = a0 . Then for sufficiently large n,
|bn − p| ≈ λ|bn−1 − p|2 (12.10)

2
≈ λ λ|bn−2 − p|2 (12.11)
= λ3 |bn−2 − p|4 (12.12)
4
≈ λ3 λ|bn−3 − p|2 (12.13)
= λ7 |bn−3 − p|8 (12.14)
..
.
n −1 n
≈ λ2 |b0 − p|2 (12.15)
2n −1 2n
=λ ∆ (12.16)
2n
(λ∆)
= (12.17)
λ
The following table illustrates the differences in the rate of convergence between
linearly convergent and quadratically convergence sequences. The table shows the
values of n for using ∆ = 1 and for different values of λ.
Linear Quadratic
λ 0.5 0.5 0.9 0.99
n=1 0.25 0.125 0.729 0.97
n=2 0.125 7.8 × 10−3 0.478 0.932
n=3 0.0625 3.1 × 10−5 0.206 0.860
n=4 0.0312 6.6 × 10−10 0.0382 0.732
n=5 0.0156 1.1 × 10−19 0.00131 0.531
n=6 0.0078 5.9 × 10−39 1.5 × 10−6 0.279
n=7 0.0039 1.7 × 10−77 2.1 × 10−12 0.0771
n=8 0.0019 1.5 × 10−154 4.1 × 10−24 5.9 × 10−3
n=9 9.7 × 10−4 1.1 × 10−308 1.5 × 10−47 3.4 × 10−5
n = 10 4.9 × 10−4 6.2 × 10−617 2.2 × 10−94 1.2 × 10−9

LESSON 12. ERROR ANALYSIS FOR ITERATIVE METHODS 73
√
Example 12.1. Suppose we know two different algorithms to find 2, one of which is
linearly convergent with error constant λ = 1/2, and the other is quadratically conver-
gent with error constant λ = 1/2. Assuming our initial error is ∆ = 1, estimate the
number of iterations each algorithm will require to converge to 50 significant figures.
Solution. For the linearly convergent algorithm we have n ≈ λn ∆, hence
n
−50 1
10 > × (1) (12.18)
2
2n > 1050 (12.19)
n log 2 > 50 log 10 (12.20)
log 10 (50)
n > 50 ≈ ≈ 166 (12.21)
log 2 0.301
n n −1
For the quadratically convergent sequence n ≈ (λ∆)2 /λ = (λ)2 for ∆ = 1. Hence
2n −1
−50 1
10 > (12.22)
2
2n −1
2 > 1050 (12.23)
(2n − 1) log 2 > 50 log 10 (12.24)
log 10
2n > 1 + 50 ≈ 167 (12.25)
log 2
n log 2 > log 167 (12.26)
log 167
n> ≈ 7.4 (12.27)
log 2
so 8 iterations will suffice.
√
Example 12.2. An iteration formula to find 3
7 as the root of f (x) = x3 − 7, that
can be derived using Newton’s method, is
x3 − 7
g(x) = x − (12.28)
3x2
Show that this algorithm converges quadratically.
Solution. Let x = pn . Then
n+1 g(x) − 71/3

= (12.29)
2 (x − 71/3 )2
n
x − (x3 − 7)/(3x2 ) − 71/3

= (12.30)
x2 − 2(71/3 )x + 72/3
3 1/3 2

2x − 3(7 )x + 7
= 4 (12.31)
3x − 6(71/3 )x3 + 3(72/3 )x2

√
Hence, since we know that pn → 3 7 as n → ∞,
2x3 − 3(71/3 )x2 + 7

n+1
lim 2 = lim

√

4 1/3 3 2/3 2
(12.32)
n→∞ n 3
x→ 7
3x − 6(7 )x + 3(7 )x
√
3
If we plug x = 7 into the right hand side of this limit we get 0/0, so we apply
L’Hopital’s rule:
2 1/3

n+1 6x − 6(7 )x
lim 2 = lim √

3

1/3 )x2 + 6(72/3 )x
(12.33)
n→∞ n x→ 3 7 12x − 18(7

x − 71/3

= lim
√

2 1/3 )x + 72/3
(12.34)
x→ 3 7 2x − 3(7
(12.35)
Again this gives 0/0 so we can use L’Hopital a second time,

n+1 1
lim 2 = lim
1/3 )
(12.36)
n→∞ n x→71/3 4x − 3(7

1
= 1/3
(12.37)
4(7 ) − 3(71/3 )
= 7−1/3 ≈ 0.523 (12.38)
which proves that the iteration converges quadratically with asymptotic error constant
λ ≈ 0.522.
Theorem 12.1. Newton’s Method converges quadratically if f 0 (p) 6= 0.
Proof. Recall from equation 10.30 that for Newton’s method

2 00
f (p)
|n+1 | = n 0 (12.39)
2f (p)
Hence
n+1 f 00 (p)

= (12.40)
2 2f 0 (p)
n
Thus Newton’s method converges quadratically unless f 0 (p) = 0, with
λ = |f 00 (p)/2f 0 (p)| (12.41)
Theorem 12.2. If all of the conditions of the fixed point theorem (theorem 9.5) are
met, and g 0 (p) 6= 0, then the fixed point algorithm converges (at least) linearly.

We observe that this says that fixed point converges at least linearly; this does not
mean that every fixed point algorithm only converges linearly. As we saw above, New-
ton’s method, which is a type of fixed point iteration, in fact, converges quadratically.
So this theorem says that convergence is linear or better, i.e, k ≥ 1.
Proof. Since p is a fixed point, p = g(p). Let p1 , p2 , . . . be the sequence of fixed point
iterates pn+1 = g(pn ). Then by the mean value theorem, for each n there is a number
cn between the fixed point p and the nth fixed-point iterate pn such that
g(pn ) − g(p) pn+1 − p

g 0 (cn ) = = (12.42)
pn − p pn − p
Therefore
pn+1 − p
lim = lim |g 0 (cn )| (12.43)
n→∞ pn − p n→∞
Since cn is between p and pn ,
|cn − p| ≤ |pn − p| (12.44)
Furthermore, since the conditions of theorem 9.5 are met, we know that pn → p and
thereofore
0 ≤ lim n → ∞|cn − p| ≤ lim |pn − p| = 0 (12.45)
n→∞
hence
lim cn = p (12.46)
n→∞
Therefore
lim g 0 (cn ) = g 0 (p) (12.47)
n→∞
Using equation 12.47 in equation 12.43, we find that

pn+1 − p
lim = |g 0 (p)| =
6 0 (12.48)
n→∞ pn − p
Thus the sequence converges linearly with asymptotic error constant λ = |g 0 (p)|.
Theorem 12.3. Let I be an open interval and suppose that the following conditions
hold:
1. g(x) is twice continuously differentiable on I;
2. p ∈ I is a fixed point of g(x);
3. g 0 (p) = 0;
4. g 00 (p) 6= 0;

5. |g 0 (x)| ≤ K < 1 on I;
6. |g 00 (x)| < M on I (i.e., g is bounded on I).

Then there exists some δ > 0 such that for any p0 in the interval
[p − δ, p + δ] ⊂ I (12.49)
the sequence pn → p quadratically.
Geometry for theorem 12.3.

I
p
( ( | ) )
p±δ
Proof. Chose some δ > 0 such that S = [p − δ, p + δ] ⊂ I. Since |g 0 (x)| ≤ K < 1 on

I then |g 0 (x)| ≤ K < 1 on S. Hence by theorem 9.5, for any p0 ∈ S the sequence pk
lies entirely in S and converges to p.
Pick any point x ∈ S. By Taylor’s theorem, there is some number c between p
and x such that
g 00 (c)
g(x) = g(p) + g 0 (p)(x − p) + (x − p)2 (12.50)
2
By assumption 3 of the theorem, g 0 (p) = 0, so that
g 00 (c)
g(x) = g(p) + (x − p)2 (12.51)
2
Since p is a fixed point of g, then g(p) = p, and
g 00 (c)
g(x) = p + (x − p)2 (12.52)
2
Hence for each pn in the sequence there is a number cn such that
g 00 (cn )
g(pn ) = p + (pn − p)2 (12.53)
2
where cn is between p and pn . But since g(pn ) = pn+1 ,
g 00 (cn )
pn+1 − p = (pn − p)2 (12.54)
2
or
pn+1 − p 1
lim = lim |g 00 (cn )| (12.55)
n→∞ (pn − p)2 2 n→∞

Since cn → p, g 00 (cn ) → g 00 (p) and

pn+1 − p 1 00
lim = |g (p)| (12.56)
n→∞ (pn − p)2 2
Since |g 00 (p)| =
6 0, the sequence converges quadratically with asymptotic error constant
00
|g (p)|/2.
One could ask the following question: given any linearly convergent sequence, how
can we turn it into a quadratically convergent sequence? One way to do this is as
follows. Let p be a root of f (p); the goal is to find a method that converges to p
quadratically. Since f (p) = 0, we can form a function
g(x) = x − h(x)f (x) (12.57)
where h(x) is any function. But now g(p) = p − h(p)f (p) = p so p is a fixed point of
g. By theorem 12.3, we need g 0 (p) = 0 to get quadratic convergence:
0 = g 0 (p) (12.58)
= 1 − h0 (p)f (p) − h(p)f 0 (p) (12.59)
= 1 − h(p)f 0 (p) (12.60)
or
1
h(p) = (12.61)
f 0 (p)
so long f 0 (p) 6= 0. Substituting equation 12.61 into equation 12.57,
f (x)
g(x) = x − (12.62)
f 0 (x)
which is precisely the formula for Newton’s method.
Definition 12.2 (Zero, Multiplicity). A root p of a function f (x) is called a zero

of multiplicity m if there exists some function q(x) such that
f (x) = (x − p)m q(x) (12.63)
and
lim q(x) 6= 0 (12.64)
x→p
If q is continuous this also means that q(p) 6= 0. A simple zero or simple root is
a zero of multiplicity 1. Roots of multiplicity m > 1 are called repeated roots.
Example 12.3. The function f (x) = x2 + 7x + 12 has simple zeroes at x = −4 and

x = −3.

Example 12.4. The function f (x) = (x − 2)2 (x − 3) has a simple root at x = 3 and
a root of multiplicity 2 at x = 2.
Theorem 12.4. Let f (x) be a continuously differentiable function on [a, b]. Then f
has a simple zero p ∈ (a, b) if and only if f (p) = 0 and f 0 (p) 6= 0.
Proof. Since this is an “if-and-only-if” theorem we need to prove two things:
(a) If p is a simple root then f (p) = 0 and f 0 (p) 6= 0; and
(b) If f (p) = 0 and f 0 (p) 6= 0 then p is a simple root.

To prove (a) we first assume that p is a simple root. Then since it is a root, we
automatically know that f (p) = 0. The only other thing we need to show is that
f 0 (p) 6= 0. But since p is a simple root, then there must exist some function q(x) such
that
lim q(x) 6= 0 (12.65)
x→p
and
f (x) = (x − p)q(x) (12.66)
Since f is continuously differentiable, then so is q. In particular, q is continuous at p,
which means that
lim q(x) = q(p) (12.67)
x→p
Hence by 12.65, q(p) 6= 0. But from equation 12.66
f 0 (x) = q(x) + (x − p)q 0 (x) (12.68)

f 0 (p) = q(p) + (p − p)q 0 (p) (12.69)
= q(p) 6= 0 (12.70)
which completes the proof of part (a).

To prove part (b), assume that both f (p) = 0 and f 0 (p) 6= 0 are true. Then by
the mean value theorem there is a number c between p and x such that
f (x) − f (p) f (x)
f 0 (c) = = (12.71)
x−p x−p
Consequently
f (x) = (x − p)f 0 (c) (12.72)
Since c is between x and p then by pinching,
lim c = p (12.73)
x→p
Let
q(x) = f 0 (c) (12.74)

then
f (x) = q(x)(x − p) (12.75)
where
lim q(x) = lim f 0 (c) (12.76)

x→p x→p

0
= f lim c (12.77)
x→p
= f 0 (p) (12.78)
6= 0 (12.79)
Hence the function has a simple zero.
Corollary 12.1. Newton’s method converges quadratically if p is a simple root.
Theorem 12.5. Suppose the function f (x) is m−times continuously differentiable in

the interval [a, b]. Then f (x) has a zero of multiplicity m at p in (a, b) if and only if
0 = f (p) = f 0 (p) = f 00 (p) = · · · = f (m−1) (p) and f (m) (p) 6= 0 (12.80)
Theorem 12.6. Suppose that f (x) is continuously differentiable on [a, b] and has a
root of multiplicity m > 1 at p ∈ (a, b). Then p is a simple root of µ(x) = f (x)/f 0 (x).
Proof. Since f (x) has a root of multiplicity m > 1 then there is some function g(x)
such that
f (x) = (x − p)m g(x) (12.81)
where g(p) 6= 0. Differentiating,
f 0 (x) = m(x − p)m−1 g(x) + (x − p)m g 0 (x) (12.82)
Therefore
(x − p)m g(x)
µ(x) = (12.83)
m(x − p)m−1 g(x) + (x − p)m g 0 (x)
(x − p)m−1 (x − p)g(x)
= (12.84)
m(x − p)m−1 g(x) + (x − p)m−1 (x − p)g 0 (x)
(x − p)g(x)
= (12.85)
mg(x) + (x − p)g 0 (x)
= (x − p)q(x) (12.86)
where
g(x)
q(x) = (12.87)
mg(x) + (x − p)g 0 (x)

Since g(p) 6= 0,
g(p) 1
q(p) = 0
= 6= 0 because m > 1 (12.88)
mg(p) + (p − p)g (p) m
hence
µ(x) = (x − p)q(x) (12.89)
where q(p)] 6= 0. Since µ(p) = 0 then p is a root of µ; since q(p) 6= 0, it is a simple
root.
Therefore we know that Newton’s method will converge quadratically to a root
of µ(x) even though it will only converge linearly to a repeated root of f (x). Using
Newtons method to find the simple root of µ(x) gives
µ(x)
g(x) = x − (12.90)
µ0 (x)
The function g has a fixed point at any root of µ(x), and the following iteration
converges quadratically,
µ(xn )
xn+1 = xn − 0 (12.91)
µ (xn )
because µ0 (p) 6= 0. But since µ(x) = f (x)/f 0 (x) then by the quotient formula for
differentiation,
f 0 f 0 − f f 00
µ0 (x) = (12.92)
(f 0 )2
f 02 − f f 00
= (12.93)
f 02
Therefore,
f (x)/f 0 (x)
g(x) = x − (12.94)
(f 0 (x)2 − f (x)f 00 (x))/(f 0 (x))2
f 0 (x)f (x)
=x− 0 (12.95)
(f (x))2 − f (x)f 00 (x)
This gives us the following quadratically convergent iteration formula:
f 0 (xn )f (x)
xn+1 = xn − (12.96)
(f 0 (xn ))2 − f (xn )f 00 (xn )
The problem with this formula arises from the fact that both f (p) and f 0 (p) are zero
and therefore as the iteration approaches the root, both (f 0 (xn ))2 and f (xn )f 00 (xn )
are very small numbers: taking the difference of two very small numbers can lead to
round-off errors.

Lesson 13
The Aitken-Steffensen Methods
Definition 13.1. Let pn be a sequence. Then the first forward difference is
∆pn = pn+1 − pn (13.1)
Successive forward differences are defined recursively, as follows:
The k th forward difference is given in terms of the k − 1st forward difference

as
∆k pn = ∆(∆k−1 pn ) (13.2)
Equations for specific differences expand out via Pascal’s triangle.
The second forward difference is
∆2 pn = ∆(∆pn ) = ∆pn+1 − ∆pn (13.3)

= (pn+2 − pn+1 ) − (pn+1 − pn ) (13.4)
= pn+2 − 2pn+1 + pn (13.5)
The third forward difference is
∆3 pn = ∆(∆2 pn ) = ∆2 pn+1 − ∆2 pn (13.6)

= (pn+3 − 2pn+2 + pn+1 ) − (pn+2 − 2pn+1 + pn ) (13.7)
= pn+3 − 3pn+2 + 3pn+1 − pn (13.8)
and so forth.
Aitken’s method (for its inventor, George Aitken, 1895-1967) is based on the
following observation. For any linearly convergent method,

pn+1 − p
lim =λ>0 (13.9)
n→∞ pn − p
81
82 LESSON 13. THE AITKEN-STEFFENSEN METHODS
Thus for sufficiently large n we might expect that

pn+1 − p
≈λ (13.10)
pn − p
Since this should be true for all n after a certain point, we would also expect that
pn+2 − p
≈λ (13.11)
pn+1 − p
Setting the last two expressions for λ equal to one another gives
pn+1 − p pn+2 − p
≈ (13.12)
pn − p pn+1 − p
Cross-multiplying and solving for p,
(pn+1 − p)2 = (pn+2 − p)(pn − p) (13.13)

p2n+1 − 2ppn+1 + p2 = pn+2 pn − p(pn + pn+2 ) + p2 (13.14)
p2n+1 − 2ppn+1 = pn+2 pn − p(pn + pn+2 ) (13.15)
p(pn + pn+2 ) − 2ppn+1 = pn+2 pn − p2n+1 (13.16)
p(pn+2 − 2pn+1 + pn ) = pn+2 pn − p2n+1 (13.17)
p∆2 pn = pn+2 pn − p2n+1 (13.18)
Hence
pn+2 pn − p2n+1
p= (13.19)
∆2 pn
pn+2 pn − p2n+1
= + pn − pn (13.20)
∆2 pn
pn+2 pn − p2n+1 − pn ∆2 pn
= pn + (13.21)
∆2 pn
Expanding the numerator,
pn+2 pn − p2n+1 − pn ∆2 pn = pn+2 pn − p2n+1 − pn (pn+2 − 2pn+1 + pn ) (13.22)

= pn+2 pn − p2n+1 − pn pn+2 + 2pn pn+1 − p2n (13.23)
= −p2n+1 + 2pn pn+1 − p2n (13.24)
= −(p2n+1 − 2pn pn+1 + p2n ) (13.25)
= −(pn+1 − pn )2 (13.26)
= −(∆pn )2 (13.27)
Therefore,
(∆pn )2
p = pn − (13.28)
∆2 pn

LESSON 13. THE AITKEN-STEFFENSEN METHODS 83
Aitken’s idea was that if we have a converging sequence pn → p as n → ∞ then the

sequence
(∆pn )2
qn = pn − (13.29)
∆2 pn
should converge to p faster than pn . We will accept this fact without proof in the
following theorem.
Theorem 13.1 (Aitken). Suppose that pn → p linearly and there is some number
N such that for all n > N ,
(pn − p)(pn+1 − p) > 0 (13.30)
then the sequence qn → p faster than pn → p, where
(∆pn )2
qn = pn − (13.31)
∆2 pn
in the sense that

qn − p
lim =0 (13.32)
n→∞ pn − p
Proof. (Outline of Proof)

Define
pn+1 − pn
δn = −λ (13.33)
pn − p
Then (... proof left as an exercise .. )
lim δn = 0 (13.34)
n→∞
and (...derivation left as an exercise ...)
qn − p λ(δn + δn+1 ) − 2δn + δn δn+1 − 2δn (λ − 1) − δn2

= (13.35)
pn − p (λ − 1)2 + λ(δn + δn+1 ) − 2δn + δn δn+1
Taking the limit gives equation 13.32.
Johann Frederik Steffensen (1873-1961) observed that the sequence would converge
faster if we started each iteration with (qi , g(qi ), g(g(qi )) instead of (pi , g(pi ), g(g(pi )).
The difference between the two methods (which is subtle) is illustrated in figure 13.1
Theorem 13.2 (Steffensen’s Method). Suppose that f (x) is thrice continuously

differentiable and has a fixed point p with f 0 (p) 6= 1. Then Aitken’s method can be
made to converge quadratically if we replace (qi , g(qi ), g(g(qi )) instead of (pi , g(pi ), g(g(pi ))
at each iteration.

Figure 13.1: Top: In Aitken’s method, at the nd of each iteration, the next iteration
begins by setting p1 = p0 . Bottom: In Steffensen’s method we set p0 = q. In both
methods, p1 = f (p0 ), p2 = f (p1 ), and q is computed from equation 13.31
Aitken’s Method:
p p p q
0 1 2
p p p q
0 1 2
p p p q
0 1 2
Steffensen’s Method:
p p p q
0 1 2
p p p q
0 1 2
p p p q
0 1 2
The following algorithms uses Aitken’s method to find the fixed point of the function
f (x).
Algorithm Aitken
Let δ = ∞;
Let p = p0 ;
While δ > ,
p1 = f (p);
p2 = f (p1 );
∆p = p1 − p0 ;
∆∆p = (p2 − p1 ) − ∆p;
p = p − (∆p)2 /∆∆p;
δ = |p − p0 |;
p0 = p1 (this is where Steffensen’s method differs)
End While
Return p
Steffensen’s method is only different in one place, as indicated in the following

alorithm.

LESSON 13. THE AITKEN-STEFFENSEN METHODS 85
Algorithm Steffensen
Let δ = ∞;
Let p = p0 ;
While δ > ,
p1 = f (p);
p2 = f (p1 );
∆p = p1 − p0 ;
∆∆p = (p2 − p1 ) − ∆p;
p = p − (∆p)2 /∆∆p;
δ = |p − p0 |;
p0 = p (this is where Aitken’s method differs)
End While
Return p
Both Aitken’s method and Steffensen’s method can be used to find the fixed point
of a function. To find the root of the function we have the following algorithm, which
significantly improves the rate of convergence of Newton’s method when there are
repeated roots (e.g., for functions such as f (x) = (x − 2)2 .
Algorithm Newton-Steffensen
Define g(x) = x − f (x)/f 0 (x);
p = Steffensen(g, p0 , );
Return p.


Lesson 14
Synthetic Division and Horner’s

Method
Definition 14.1 (Polynomial). Let a0 , . . . , an be arbitrary constants. Then any

function P (x) of the form
n
X
P (x) = ak xk = a0 + a1 x + a2 x2 + · · · + an xn (14.1)
k=0
with an 6= 0 is called a polynomial of order n in x.
To implement a polynomial most efficiently, we observe that once we know x2 , it

is faster to calculate x3 = x × x2 rather than as x × x × x; once we know x3 it is faster
to calculate x4 as x × x3 rather than x × x × x × x; and so forth. In general, we want
to calculate xn as x × xn−1 . This produces the concept of nested multiplication:
P (x) = a0 + a1 x + a2 x + · · · + an xn (14.2)
= a0 + x(a1 + a2 x + a3 x2 + · · · + an xn−1 ) (14.3)
..
.
= a0 + x(a1 + x(a2 + x(a3 + · · · + x(an−1 + an x))) · · · ) (14.4)
Theorem 14.1 (Fundamental Theorem of Algebra). Every polynomial of degree

n has precisely n roots.
Some (or all) of the roots may be complex. Since complex roots come in conjugate
pairs, the total number of complex roots must be even. Thus a polynomial of odd
degree always has at least one real root. If the unique roots are given as r1 , r2 , ..., rk
each with multiplicity m1 , m2 , ..., mk then we can always write a polynomial as
P (x) = C(x − r1 )m1 (x − r2 )m2 · · · (x − rk )mk (14.5)
87
88 LESSON 14. SYNTHETIC DIVISION AND HORNER’S METHOD
Descarte proposed in 1637 that one could imagine that there were n roots to a poly-
nomial. Albert Girard (1629) proposed that an nth order polynomial has n roots but
that they may exist in a field larger than the complex numbers. The first published
proof of the fundamental theorem of algebra was by DAlembert in 1746, but his proof
was based on an earlier theorem that itself used the theorem, and hence is unsatisfac-
tory. At about the same time Euler proved it for polynomials with real coefficients up
to 6th. Between 1799 (in his doctoral dissertation) and 1816 Gauss published three
different proofs for polynomials with with real coefficients, and in 1849 he proved the
general case for polynomials with complex coefficients.
Theorem 14.2. If two polynomials of degree n agree at n + 1 unique points, then

they must be identical. More precisely: If P (x) and Q(x) are two polynomials of the
same degree n that agree at at least n + 1 distinct points, i.e, if there exist unique
numbers x1 , ..., xn+1 such that P (xk ) = Q(xk ) then P (x) = Q(x) for all x.
For example, if two lines are equal at two points, they are identical; if two parabo-
las match at three points, they are identical; and so on.
Theorem 14.3 (Horner’s Method for Synthetic Division). Let P (x) be any
polynomial of degree n, given by
P (x) = a0 + a1 x + a2 x + · · · + an xn (14.6)
Then for any number x0 there exists another polynomial Q(x) of degree n − 1, given
by
Q(x) = b1 + b2 x + b3 x2 + · · · + bn xn−1 (14.7)
such that
P (x) = (x − x0 )Q(x) + b0 (14.8)
where bn = an ,
bk = ak + bk+1 x0 (14.9)
for k = n − 1, n − 2, ..., 0, for k = n − 1, n − 2, . . . , 0 Furthermore, b0 = P (x0 ) and
P 0 (x0 ) = Q(x0 ) (14.10)
Proof. Suppose that
Q(x) = bn xn−1 + bn−1 xn−2 + · · · + b3 x2 + b2 x + b1 (14.11)
for some undetermined numbers b1 , . . . , bn . Then we ask what conditions will ensure
that
P (x) = (x − x0 )Q(x) + b0 (14.12)

LESSON 14. SYNTHETIC DIVISION AND HORNER’S METHOD 89
Multiplying things out,
(x − x0 )Q(x) + b0 = b0 +
(x − x0 )(bn xn−1 + bn−1 xn−2 + · · · + b3 x2 + b2 x + b1 ) (14.13)
= b0 + x(bn xn−1 + bn−1 xn−2 + · · · + b3 x2 + b2 x + b1 )
− x0 (bn xn−1 + bn−1 xn−2 + · · · + b3 x2 + b2 x + b1 ) (14.14)
= bn xn + bn−1 xn−1 + bn−2 xn−2 + · · · + b2 x2 + b1 x
− bn x0 xn−1 − x0 bn−1 xn−2 − · · · − x0 b3 x2
− x 0 b2 x − x 0 b1 + b0 (14.15)
= bn xn + (bn−1 − bn x0 )xn−1 +
(bn−2 − x0 bn−1 )xn−2 + · · · + (14.16)
(b2 − x0 b3 )x2 + (b1 − x0 b2 )x + (b0 − x0 b1 ) (14.17)
We want this to be equal to
P (x) = a0 + a1 x + a2 x + · · · + an xn (14.18)
Equating coefficients of like powers of x gives us
an = b n (14.19)
an−1 = bn−1 − bn x0 (14.20)
an−2 = bn−2 − bn−1 x0 (14.21)
..
.
a0 = b 0 − x 0 b 1 (14.22)
Rearranging,
b n = an (14.23)
bn−1 = an−1 + bn x0 (14.24)
bn−2 = an−2 + bn−1 x0 (14.25)
..
.
b 0 = a0 + b 1 x 0 (14.26)
This proves equation 14.9.

Next, to see that b0 = P (x0 ) we observe that
P (0) = (x0 − x0 )Q(x0 ) + b0 = b0 (14.27)

Furthermore, differentiating P (x) = (x − x0 )Q(x) + b0 gives
P 0 (x) = (x − x0 )Q0 (x) + Q(x) (14.28)

hence
P 0 (x0 ) = Q(x0 ) (14.29)
which gives us equation 14.10.
The following gives a recapitulation of the algorithm for Horner’s method to cal-
culate the numbers P (x0 ) and P 0 (x0 ) for a polynomial.
Algorithm Horner
Input a0 , . . . , an , x0 ;
Set y = an ; (y will give the bn for P )
Set z = an ; (z gives the bn−1 for Q)
For j = n − 1, n − 2, . . . , 1,
y = x0 y + aj ; (this gives bj for P (x0 ))
z = x0 z + y; (this gives bj−1 for the calculation of Q(x0 ) )
End For;
y = x0 y + a0 ; (this gives b0 )
Return y (which is P (x0 )) and z (which is P 0 (x0 ) = Q(x0 ))
We can make two interesting observations about Horner’s method. First, it has
the same number of multiplications as nested multiplication, making it at least as
efficient as that algorithm. Secondly, it gives us a number for both P (x0 ) and P 0 (x0 )
for no extra cost. This becomes useful in operations where both numbers are needed,
such as in the calculation of Newton’s method (for the roots of a polynomial).
Algorithm Newton’s Method with Horner

Input a0 , . . . , an , x0 , tolerance ;
(p0 , p00 ) = Horner(a0 , . . . , an , x0 )
p = x0
δ = p0 /p00
While |δ| > ,
p=p−δ
(p0 , p00 ) = Horner(a0 , . . . , an , p)
δ = p0 /p00
End While
Return p
Let x0 be a root of P . Then we know that there exists a second polynomial Q(x)
such that P (x) = (x − x0 )Q(x) + P (x0 ) = (x − x0 )Q(x). So if P has any other
roots that are different from x0 then they are also roots of Q. Hence if we repeat the
process on Q iteratively we will find all the subsequent roots of P . Unfortunately
this leads to round-off error that can be avoided by using a different algorithm that
we will discuss subsequently.
Example 14.1. Find P (1) and P 0 (1) for P (x) = x3 −2x2 −5 using Horner’s method.

LESSON 14. SYNTHETIC DIVISION AND HORNER’S METHOD 91
Solution. We have a0 = −5, a1 = 0, a2 = −2, and a3 = 1, and also x0 = 1. Then If

we set z = a3 = 1,
b3 = a3 = 1 (14.30)
b2 = a2 + b3 x0 = −2 + (1)(1) = −1 (14.31)
b1 = a1 + b2 x0 = 0 + (−1)(1) = −1 (14.32)
b0 = a0 + b1 x0 = −5 + (−1)(1) = −6 (14.33)
Hence P (1) = −6. Using the same algorithm for Q we have
c 2 = b3 = 1 (14.34)
c1 = b2 + c2 x0 = (−1) + (1)(1) = 0 (14.35)
c0 = b1 + c1 x0 = (−1) + (0)(1) = −1 (14.36)
Hence Q(1) = P 0 (1) = c0 = −1.

Horner’s method is also fairly easy to implement in Mathematica. We can take
advantage of the fact that a list may have an arbitrary number of elements, so we
don’t even need to know the order of the polynomial:
Horner[A_?ListQ, x0_] := Module[{z, y, a},

a = A;
y = z = Last[a];
a = Most[a];
While[Length[a] > 1,
y = x0*y + Last[a];
z = x0*z + y;
a = Most[a];
];
y = x0*y + Last[a];
Print["P(x)=" A.("x"^Range[0, Length[A] - 1])];
Print["P(", x0, ")=", y, "\n", "P’(", x0, ")=", z];
Return[{y, z}]
]


Lesson 15
Müller’s Method
Müller’s method is s based on the idea that if a straight line is good, then a parabola
is better. It’s really a modification of the Secant method, replacing the projectin of a
secant line with the projectiong of a parabola, fit to three consecutive points on the
curve, to find the next guess. Suppose we “know” the value of f at three points on
the curve of f (x) at x = p, x = q, and x = r. The we need to find a parabola through
the three points
(p, f (p)), (q, f (q)), (r, f (r)) (15.1)
Figure 15.1: Illustration of Müller’s method. A parabola is fit to three points on the
curve, and the intersection of the parabola with the x−axis is used to for the next
guess of the root.
93
94 LESSON 15. MÜLLER’S METHOD
The general equation for a parabola is

P (x) = Ax2 + Bx + C (15.2)
= Ax2 + Bx + C − 2pxA + 2pxA + p2 A − p2 A (15.3)
= A(x2 − 2px + p2 ) + (B + 2pA)x + C − Ap2 (15.4)
= A(x − p)2 + (B + 2pA)x + C − Ap2 − (B + 2pA)p + (B + 2pA)p (15.5)
= A(x − p)2 + (B + 2pA)(x − p) + C − Ap2 + (B + 2pA)p (15.6)
Make the following substitutions:
a=A (15.7)
b = B + 2pA (15.8)
c = C − Ap2 + (B + 2pA)p (15.9)
This gives us
P (x) = a(x − p)2 + b(x − p) + c (15.10)
Since P (p) = f (p), P (q) = f (q), and P (r) = f (r),
f (p) = P (p) = a(p − p)2 + b(p − p)2 + c = c (15.11)
f (q) = P (q) = a(q − p)2 + b(q − p) + c (15.12)
= a(q − p)2 + b(q − p) + f (p) (15.13)
f (r) = P (r) = a(r − p)2 + b(r − p) + c (15.14)
= a(r − p)2 + b(r − p) + f (p) (15.15)
Rearranging,
f (q) − f (p) = a(q − p)2 + b(q − p) (15.16)
f (r) − f (p) = a(r − p)2 + b(r − p) (15.17)
This is a systems of two equations in two unknowns, a and b. Multiplying equation
15.16 by r − p and equation 15.17 by q − p gives
(r − p) (f (q) − f (p)) = a(q − p)2 (r − p) + b(q − p)(r − p) (15.18)
(q − p) (f (r) − f (p)) = a(r − p)2 (q − p) + b(r − p)(q − p) (15.19)
Subtracting equation 15.19 from equation 15.18 gives
(r − p) (f (q) − f (p)) − (q − p)[f (r) − f (p)] (15.20)

2 2

= a (q − p) (r − p) − (r − p) (q − p) (15.21)
= a(q − p)(r − p) ((q − p) − (r − p)) (15.22)
= a(q − p)(r − p) (q − p − r + p) (15.23)
= a(q − p)(r − p)(q − r) (15.24)

LESSON 15. MÜLLER’S METHOD 95
Thus
(r − p) (f (q) − f (p)) − (q − p) (f (r) − f (p))
a= (15.25)
(q − p)(r − p)(q − r)
Next we multiply equation 15.16 by (r − p)2 , equation 15.17 by (q − p)2 , and subtract,
which gives
(r − p)2 (f (q) − f (p)) = a(q − p)2 (r − p)2 + b(q − p)(r − p)2 (15.26)
(q − p)2 (f (r) − f (p)) = a(r − p)2 (q − p)2 + b(r − p)(q − p)2 (15.27)
The subtraction gives
(r − p)2 (f (q) − f (p)) − (q − p)2 (f (r) − f (p)) (15.28)

= b(q − p)(r − p)2 − b(r − p)(q − p)2 (15.29)
= b(q − p)(r − p) ((r − p) − (q − p)) (15.30)
= b(q − p)(r − p) (r − p − q + p) (15.31)
= b(q − p)(r − p)(r − q) (15.32)
and therefore,
(r − p)2 (f (q) − f (p)) − (q − p)2 (f (r) − f (p))

b= (15.33)
(q − p)(r − p)(r − q)
Müller’s method uses the intersection of the parabola with the x-axis as the next
guess. Given three guesses p, q, r, the parabola intersects at s where
0 = P (s) = a(s − p)2 + b(s − p) + c (15.34)
If we define δ = s − p then 0 = aδ 2 + bδ + c and hence s = p + δ where

√
−b ± b2 − 4ac
δ= (15.35)
2a
and therefore √
−b ± b2 − 4ac
s=p+ (15.36)
2a
where a and b are given by equations 15.25 and 15.33.
If b is a large positive number then the positive root
√
−b + b2 − 4ac
δ+ = (15.37)
2a
has two large and nearly equal numbers in the numerator; this could lead to roundoff
errors. To improve our accuracy we rearrange by rationalizing the numerator:
√ √
−b +b2 − 4ac −b − b2 − 4ac
δ+ = × √ (15.38)
2a −b − b2 − 4ac
b2 − b2 + 4ac
= √ (15.39)
2a −b − b2 − 4ac
−2c
= √ (15.40)
b + b2 − 4ac
Thus if no roundoff error here because now we are adding two large positive numbers
in the denominator, and not subtracting them. Thus if b is large and positive, our
two intersection points are
√
−b − b2 − 4ac
s =p + (15.41)
2a
2c
s =p − √ (15.42)
b + b2 − 4ac
By a similar argument, if b is a large negative number then the negative root is

subtracting two nearly equal numbers and so the solutions are
√
−b + b2 − 4ac
s =p + (15.43)
2a
2c
s =p − √ (15.44)
b − b2 − 4ac
Since we don’t know up front which, if either, special case occurs, we can do the
following: choose the sign of the square root to agree with the sign of b. This will
work in either case! Hence
2c
s =p − √ (15.45)
b + sign(b) b2 − 4ac
This assures that of the two possible roots of the parabola, the one closest to p will
be selected.
Müller’s algorithm also uses Horner’s method to evaluate the polynomial (it ignores
the derivatives since they aren’t really needed). The algorithm to find the root of a
polynomial with coefficients given by a0 , . . . , an is

LESSON 15. MÜLLER’S METHOD 97
Algorithm Muller (root of a polynomial)

Input a0 , . . . , an , x0 , x1 , x2 , tolerance ;
Let p = x2 and f (p) = Horner(a0 , . . . , an , p)
Let q = x1 and f (q) = Horner(a0 , . . . , an , q)
Let r = x0 and f (r) = Horner(a0 , . . . , an , r)
Let δ = ∞
While |δ| > ,
(r − p) (f (q) − f (p)) − (q − p) (f (r) − f (p))
a=
(q − p)(r − p)(q − r)
(r − p)2 (f (q) − f (p)) − (q − p)2 (f (r) − f (p))
b=
(q − p)(r − p)(r − q)
c = f (p)
2c
δ= √
b + sign(b) b2 − 4ac
r=q
q=p
p=p−δ
fr = fq
fq = fp
f p = Horner(a0 , . . . , an , p)
End While
Return p


Lesson 16
Linear Systems
In this section we will study the solution of a linear system of n equations with n
unknowns. We cover it briefly here because some understanding of the problem will
be necessary in our study of interpolation. However, this subject is normally part of
the Math 481A curriculum and hence will not be covered in any detail here.
Given a square n × n matrix A and n numbers b1 , . . . , bn , we would like to solve
the linear system
Ax = b (16.1)
Since it is generally numerically inefficient to compute an inverse (it generally requires
O(n3 ) operations we will not solve the system as
x = A−1 b (16.2)
although this is technically correct. Instead we will use the process of Gaussian
elimination. We begin by observing that if we can transform equation 16.1 into a
form
T x = b0 (16.3)
where T is an upper triangular matrix, and b0 is a modified version of b, then we can
read the solution for xn off the bottom row of the matrix, namely,
xn = b0n /Tnn (16.4)
The matrix T is said to be in Row Echelon Form. The second to the last row of
the system 16.3 only depends on two variables, xn and xn−1 . Once we read off xn
then we can solve for xn−1 . This process of back substitution moves back up the
matrix one line at a time, solving for one variable at each step.
Gaussian elimination is then summarized as follows:
1. Convert the system Ax = b into an equivalent form T x = b0 where T is upper-

triangular.
99
100 LESSON 16. LINEAR SYSTEMS
2. Solve for x using back-substitution.

It is possible to take this idea one step further. If we can reduce equation 16.3 to
the form
Dx = b00 (16.5)
where D is a diagonal matrix, then it is even easier to read off the solutions, namel,
xi = b0i /Dii . In this revised form the matrix D is said to be in Reduced Row
Echelon Form, and the revised algorithm is called Gauss-Jordan Elimination.
The revised algorithm is summarized:
1. Convert the system Ax = b into an equivalent form T x = b0 , where T is

upper-triangular.
2. Convert the system T x = b0 into an equivalent form Dx = b00 , where D is
diagonal.
3. Solve for the xi .
We will outline the first algorithm (row reduction followed by back-substitution). We

start by writing the linear system
a11 x1 + a12 x2 + · · · + a1n xn = b1 (16.6)
a21 x1 + a22 x2 + · · · + a2n xn = b2 (16.7)
..
.
an11 x1 + an2 x2 + · · · + ann xn = bn (16.8)
From equation 16.6 we can solve for x1 in terms of x2 , . . . , xn ,
x1 = (b1 − a12 x2 − a13 x3 − · · · − a1n xn )/a11 (16.9)
so if we already know x2 , . . . , xn we can solve for x1 immediately. But if we eliminate
x1 from each of the remaining equations, we have system of n − 1 equations in the
n − 1 variables x2 , . . . , xn , which is easier to solve than the original system because
it is smaller. We get this system by subtracting an appropriate multiple of the first
equation from each of the remaining equations, namely we subtract
(ai1 /a11 ) × (a11 x1 + a12 x2 + · · · + a1n xn = b1 ) (16.10)
from the ith equation. The resulting system for x2 , . . . , xn is
(a22 − a21 a12 /a11 )x2 + · · · + (a2n − a21 a1n /a11 )xn = b2 − a21 b1 /a11 (16.11)
(a32 − a31 a12 /a11 )x2 + · · · + (a3n − a31 a1n /a11 )xn = b3 − a31 b1 /a11 (16.12)
..
.
(an2 − an1 a12 /a11 )x2 + · · · + (ann − an1 a1n /a11 )xn = bn − an1 b1 /a11 (16.13)

LESSON 16. LINEAR SYSTEMS 101
The idea is to keep repeating this process until there is only one equation in the
reduced system. The result is an “upper triangular system.” If the original matrix
system is     
a11 a12 a13 · · · a1n x1 b1
 a21 a22 a23 · · · a2n   x2   b2 
    
 a31 a32 a33   x 3   b3 
  = 
 .. .. .. ..   ..   .. 
 . . . .   .   . 
an1 an2 an3 · · · ann an1 xn
Then the reduced matrix system is
    
a11 a12 a13 · · · a01n x1 b01
 0 a0 a0 · · · a02n  x2   b02 
 22 23    
0
 0
 0 a 33 · · · a03n 
 x3 =
  b03 

 .. .. ..  ..   .. 
 . . .  .   . 
0 ··· 0 0 a0nn an1 b‘n
This process is called Gaussian Reduction. We can then solve the system by
starting on the bottom equation for xn , then the second from the bottom for xn−1 ,
and so forth, until we obtain x1 . This second step is called back substitution.
Example 16.1. Solve the system

    
1 2 3 x 5
 4 5 2   y  =  10 
2 8 5 z 15
using Gaussian Reduction and back substitution.
Solution. The first step is to subtract multiples of the first row from each of the
remaining two rows to make the coefficients of x zero in each of rows 2 and 3 of the
system. Since the coefficient of x is 1 in the first row, 4 in the second row, and 2 in
the third row, we subtract four times the first row from the second row, and twice
the first row from the third row.
    
1 2 3 x 5
 4 − 4(1) 5 − 4(2) 2 − 4(3)   y  =  10 − 4(5) 
2 − 2(1) 8 − 2(2) 5 − 2(3) z 15 − 2(5)
    
1 2 3 x 5
 0 −3 −10   y  =  −10 
0 4 −1 z 5
Now the first column is all zeroes (except for the first row). The next step is to
subtract a multiple of the second row from the third row to get a zero in the second

entry of the third row. Since the coefficient of y is -3 in the second row and 4 in the
third row, we can add 4/3 times the second row to the third row.
    
1 2 3 x 5
 0 −3 −10  y  =  −10 
0 4 + (4/3)(−3) −1 + (4/3)(−10) z 5 + (4/3)(−10)
    
1 2 3 x 5
 0 −3 −10   y  =  −10 
0 0 −43/3 z −25/3
This completes the Gaussian elimination. We can then read off the solution by back-
substitution. From the third row of the matrix,
z = (−25/3)/(−43/3) = 25/43
From the second row of the matrix,
−3y − 10z = −10
hence
1 60
y = − (−10 + 10(25/43)) =
3 43
Finally, from the first row, we have
x + 2y + 3z = 5

60 25 20
x=5−2 −3 =
43 43 43
We can write a simple recursive algorithm for Gaussian elimination as
Algorithm LinearSolve
Input: A, b
If n > 1,
{A0 , b0 } = Reduce(A, b)
LinearSolve (A0 , b0 )
End if
x1 = (b1 − a12 x2 − a13 x3 − · · · − a1n xn )/a11
Return {x1 , x2 , . . . , xn }
Algorithm Reduce
Input: A, b
n = dimension(b)
For k = 2, . . . , n,

LESSON 16. LINEAR SYSTEMS 103
m = ak1 /a11
For j = 2, . . . , n,
a0k−1,j−1 = akj − ma1j
End For
b0k−1 = bk − mb1
End For
Return {A0 , b0 }
The recurse algorithm can be almost literally translated into Mathematica:
reduce[A_, b_] := Module[{n, j, k, Aprime, bprime, m, row},
n = Length[b];
Aprime = {}; bprime = {};
For[k = 2, k n , k++,
m = A[[k, 1]]/A[[1, 1]];
row = {};
For[j = 2, j n, j++,
AppendTo[row, A[[k, j]] - m* A[[1, j]]];
];
AppendTo[Aprime, row];
AppendTo[bprime, b[[k]] - m*b[[1]]];
];
Return[{Aprime, bprime}];
];
gauss[A_, b_] := Module[{n, x, x1, Aprime, bprime},

n = Length[b];
x = {};
If[n > 1,
{Aprime, bprime} = reduce[A, b];
x = gauss[Aprime, bprime];
];
x1 = b[[1]]/A[[1, 1]];
For[k = 2, k n, k++,
x1 = x1 - A[[1, k]]x[[k - 1]]/A[[1, 1]];
];
x = Prepend[x, x1];
Return[x];
]
For example, to solve the system
    
0.116093 0.230616 0.34202 x1 3
0.461232 0.897598 1.28558 x2  = 17 (16.14)
1.02606 1.92836 2.59808 x3 5

One could use this function by typing
In:=
A={{0.116093, 0.230616, 0.34202},

{0.461232, 0.897598, 1.28558},
{1.02606, 1.92836, 2.59808}};
b={3, 17, 5};
gauss[A, b]
Out:=
{-33612.9, 27351.9, -7024.58}
In Mathematicawe can also solve the system directly by using the built in function
LinearSolve[A,b].
Gaussian elimination can fail if we divide by zero, and is susceptible to large
errors or possible overflow if we divide by a very small number (relative to the other
numbers in the matrix). Division occurs in two places in the algorithm: during the
row reduction phase where we define m = ak1 /a11 and during the back-substitution
step at the end of the algorithm, where we solve for x1 (here we also divide by a11 ,
but its usually a different a11 ). These numbers are called pivots. The solution is to
rearrange the matrix (and the corresponding elements of b): if at any step along the
way the pivot is zero, then the entire row is exchanged with a row that does not have
zero in that column. If all of the remaining elements in that column are zero then
the matrix is singular and there is no unique solution (or no solution at all).

Lesson 17
Lagrange Interpolation
Suppose we know the values of some function f (x) at n + 1 distinct grid points
a = x0 , x1 , x2 , ..., xn = b (17.1)
Denote the values of the function at each of these points as
fk = f (xk ), k = 0, 1, 2, ..., n (17.2)
The problem of interpolation is to find an approximate (numerical) value for f (x)

at any point x ∈ [a, b] that does not necessarily correspond to one of the grid points.
Function known only at the grid points
(x2,f2) (x3,f3) (x4,f4)
(x1,f1)
(x5,f5)
x1 x2 x3 x4 x5
The unknown function f(x) goes through

the grid points
(x3,f3)
(x ,f ) (x4,f4)
2 2
(x1,f1) f(x)
(x5,f5)
x1 x2 x3 x4 x5
105
106 LESSON 17. LAGRANGE INTERPOLATION
The simplest method is linear interpolation: draw line segments connecting each
pair of consecutive grid points (xk , fk ) and (xk+1 , fk+1 ). For xk ≤ x ≤ xk+1 we have:
fk+1 − fk
y = fk + m(x − xk ) = fk + (x − xk ) (17.3)
xk+1 − xk
(x3,f3) (x4,f4) (x5,f5)

(x2,f2)
(x1,f1)
x1 x2 x3 x4 x5
In general, unless the grid points are very close, linear interpolation does not give
very accurate result. A better approximation would be given by a polynomial. The
key is to find the right polynomial, not just any polynomial that goes through the
points. As it turns out, it is possible to find a polynomial that approximates the
function to any desired degree of accuracy. This result is called the Weirstrass Ap-
proximation Theorem. Furthermore, given any n + 1 points it is possible to find
a unique polynomial of minimum degree that fits all the points. For example, any
two points can be fit by a line; and three non-collinear points can be fit by a unique
parabola; any four points that do not line on the same line or on the same parabola
can be fit by a unique cubic; and so forth.
Let us suppose that we have defined n + 1 unique points
(x0 , f0 ), (x1 , f1 ), . . . , (xn , fn ) (17.4)
where
x0 < x1 < · · · < xn (17.5)
and that we want to find the polynomial of lowest order
P (x) = a0 + a1 x + a2 x2 + · · · + an xn (17.6)
Math 481A «2008, B.E.Shapiro

LESSON 17. LAGRANGE INTERPOLATION 107
to these points. We begin by substituting the points 17.4 into the polynomial to get
n + 1 equations in the n + 1 unknowns a0 , a1 , . . . , an :
f0 = a0 + a1 x0 + a2 x20 + · · · + an xn0 (17.7)
f1 = a0 + a1 x1 + a2 x21 + · · · + an xnn (17.8)
..
.
fn = a0 + a1 xn + a2 x2n + · · · + an xnn (17.9)
which we can write as the matrix system
    
1 x0 x20 · · · xn0 a0 f0
2 n  
1 x 1 x 1 · · · x 1   a1   f 1 

..   ..  =  ..  (17.10)
 
 ..
. .   .   .
2 n
1 xn xn · · · xn an fn
This equation has a solution if the matrix of coefficients is non-singular. But because
the points are distinct the lines of the matrix form a linearly independent set of vectors
(proof left as an exercise). Hence the matrix is non-singular. To find the a0 , a1 , . . .
we could use Gaussian elimination or some other method as we have discussed. It
turns out that this is not necessary because the form of the matrix allows us to write
a much simpler iterative process for finding these coefficients.
We will actually present two different methods for constructing the polynomial:
the Lagrange method (in this section) and the Newton method (in the next section).
Because of uniqueness both polynomials will be identical; however, they are con-
structed differently. The Newton method is particularly useful when one needs to
calculate numbers by hand, as was done in the 19th century. The Lagrange method,
which we will discuss first, is somewhat more intuitive. Before providing the general
form, we will illustrate the technique with linear (n=1) and quadratic (n=2) interpo-
lation.
For n = 1 we start with two points (x0 , f0 ), (x1 , f1 ) that we want to fit a line to.
Of course we have already done it, but this time we will construct the line in such a
way that the method can be easily extending to higher degree fits (with more points).
We define the functions
x − x1
L0 (x) = (17.11)
x0 − x1
x − x0
L1 (x) = (17.12)
x1 − x0
and we observe that
L0 (x0 ) = 1 L0 (x1 ) = 0 (17.13)
L1 (x0 ) = 0 L1 (x1 ) = 1 (17.14)

We write this more compactly as

(
1, i = j,
Li (xj ) = δij = (17.15)
6 j
0, i =
known as the Kroeneker delta function (for the German mathematician Leopold
Kroeneker, 1823-1891). Next, we define the function
1
X
P (x) = Li (x)fi (17.16)
k=0
= L0 (x)f0 + L1 (x)f1 (17.17)
x − x1 x − x0
= f0 + f1 (17.18)
x0 − x1 x1 − x0
We observe that P (xi ) = fi and that P is linear in x. Hence it is the equation of
a line that goes through both points (xi , fi ), i = 0, 1. A rearrangement of this gives
equation 17.3.
For n = 2 we have 3 points: (x0 , f0 ), (x1 , f1 ), and (x2 , f2 ). Again, we have already
solved for the equation of a parabola through three points in the previous section,
but we will do it this time by extending the Lagrange technique. We define the three
functions
(x − x1 )(x − x2 )
L0 (x) = (17.19)
(x0 − x1 )(x0 − x2 )
(x − x0 )(x − x2 )
L1 (x) = (17.20)
(x1 − x0 )(x1 − x2 )
(x − x0 )(x − x1 )
L2 (x) = (17.21)
(x2 − x0 )(x2 − x1 )
We observe that
L0 (x0 ) = 1 L0 (x1 ) = 0 L0 (x2 ) = 0 (17.22)

L1 (x0 ) = 0 L1 (x1 ) = 1 L1 (x2 ) = 0 (17.23)
L2 (x0 ) = 0 L2 (x1 ) = 0 L2 (x2 ) = 1 (17.24)
or in general Li (xj ) = δij , as before with the linear functions. Then we define the
function
X2
P (x) = L0 (x)f0 + L1 (x)f1 + L2 (x)f2 = Li (x)fi (17.25)
k=0
We observe now that P (x) is quadratic, and that P (xi ) = fi . Thus it goes through
all three points, and hence by uniqueness it is the only parabola that goes through
all three points.

In the general case it becomes more convenient to add a second index indicating
the order of the polynomials to the L functions. Thus we rename our linear functions
from L0 and L1 to L10 and L11 , and our quadratic functions L0 , L1 , and L2 become
L20 , L21 , and L22 . The general definition is
n
Y x − xj
Lnk (x) = (17.26)
j=0,j6=k
xk − xj
for k = 0, . . . , n. It is easily observed that (a) each of the Lnk has degree n; and that
(b) that Lnk (xi ) = δik . Hence the polynomial
n
X
P (x) = Lnk (x)fk (17.27)
k=0
is also of degree at most k and that P (xj ) = fj . Thus P (x) is our interpolating
polynomial, and we have derived the following result.
Theorem 17.1 (Lagrange Interpolating Polynomial). Suppose that we are given

the values of the function f (x) at n + 1 distinct points x0 , . . . , xn , which we denote by
f0 , . . . , fn . Then the nth Lagrange interpolating polynomial
n n n
X X Y x − xj
P (x) = Lnk (x)fk = fk (17.28)
k=0 k=0 j=0,j6=k
xk − x j
is a polynomial of degree at most n that matches f (x) at each of the xi .

√
Example 17.1. Let f (x) = 1 + x. Construct the Lagrange polynomial of degree
at most 2 to interpolate the point f (0.45) using grid points at x0 = 0, x1 = 0.6 and
x2 = 0.9.
Solution. First we calculate the fi :

√
f0 = f (x0 ) = f (0) = 1=1 (17.29)
√
f1 = f (x1 ) = f (0.6) = 1.6 ≈ 1.265 (17.30)
√
f2 = f (x2 ) = f (0.9) = 1.9 ≈ 1.378 (17.31)
So the Lagrange polynomial is
P (x) = L20 (x)f0 + L21 (x)f1 + L22 (x)f2 (17.32)

= L20 (x) + 1.27L21 (x) + 1.38L22 (x) (17.33)

Where
(x − 0.6)(x − 0.9)
L20 (x) = = 1.85(x − 0.6)(x − 0.9) (17.34)
(0 − 0.6)(0 − 0.9)
(x − 0)(x − 0.9)
L21 (x) = = −5.56x(x − 0.9) (17.35)
(0.6 − 0)(0.6 − 0.9)
(x − 0)(x − 0.6)
L22 (x) = = 3.70x(x − 0.6) (17.36)
(0.9 − 0)(0.9 − 0.6)
Thus
P (x) = L20 (x) + 1.27L21 (x) + 1.38L22 (x) (17.37)
= 1.85(x − 0.6)(x − 0.9) − 1.27(5.56)x(x − 0.9)
+ 1.38(3.70)x(x − 0.6) (17.38)
= 1.85(x − 0.6)(x − 0.9) − 7.03x(x − 0.9) + 5.11x(x − 0.6) (17.39)
= 0.999 + 0.486x − 0.07x2 (17.40)
Hence
P (0.45) ≈ 0.999 + (0.486)(0.45) − 0.07)(0.45)2 = 1.20 (17.41)
We summarize the algorithm for Lagrange Interpolation here.1
Algorithm LagrangeInterpolatingFunctions
Input: x0 , . . . , xn , x
For i = 0, 1, . . . , n,
Define the set Ui ={x0 , . . . , xn } − {xi }
Define numerator = 1, denominator = 1
For j = 0, . . . , n − 1
numerator = numerator × (x − Uij )
denominator = denominator × (xi − Uij )
End For
Lni = numerator/denominator
End For
Return the list {Ln1 , . . . , Lnn }
Algorithm LagrangeInterpolatingPolynomial
Input: x0 , . . . , xn , f0 , . . . , fn , x
Let L be the list LagrangeInterpolatingFunctions(x0 , . . . , xn , x)
P = f0 ∗ L0 + f1 ∗ L1 + · · · + fn ∗ Ln
Return P
1
The notation A − B, where A and B are sets, means the relative complement of the set B in
the set A, e.g., all of the elements of A that are not in B. For an ordered set Ui , the notation Uij
means the j th element of Ui . An ordered set is also called a List.

We now illustrate how to calculate the Lagrange Interpolating Polynomials both

analytically and numerically in Mathematica. First we make a few observations. The
relative complement of the set B in A is given by Complement[A, B]., e.g,
In:=
U={x1, x2, x3, x4, x5}

Complement[U, {x4}]
Out:=
{x1, x2, x3, x5}
Next, we observe that if U is a list such as the one defined above, then Map[f, U]
returns the result of f[u], for every element u of U. Recall that f/@U is a shorthand
for Map[f, U],
In:=
f/@U
Out:=
{f[x1], f[x2], f[x3], f[x4], f[x5]}
Suppose that f[x] represents the function f (x) = x − 3. We can calculate some
value, say f (u) in two different ways. The first is the usual way,
In:=
f[x_]:=x-3;
f[u]
Out:=
u-3
The second way is to use pure functions:
In:=
(#-3)&[u]
Out:=
u-3

Pure functions allow us to define a function and use it in a single statement. In-
stead of saying f[x] we replace the f with the pure function (#-3)&. The symbol &
tells us where the function definition ends, and the symbol # is used in place of the
function’s argument x. We can also combine pure functions with the Map function.
This is convenient because it lets us map an expression that we are only going to use
once; otherwise we’d have to use an extra line of code to define an unnecessary extra
variable to hold the function. Thus
In:=
f[x_]:= x-3; V=Map[f, U]
and
In:=
V=(#-3)&/@U
both return the identical output
Out:=
{-3 + x1, -3 + x2, -3 + x3, -3 + x4, -3 + x5}
Next, we observe that the generalization of multiplication in Mathematicais the Times

function. Times can take an arbitrary number of arguments and returns their prod-
uct. Thus
In:=
Times[-3+x1, -3+x2, -3+x3, -3+x4, -3+x5]
Out:=
(-3 + x1) (-3 + x2) (-3 + x3) (-3 + x4) (-3 + x5)
To multiply out the elements of V we need to take all the elements of V and place
them as arguments to Times. We do this with the Apply command, which has a
shorthand of @@. The following are:
In:=
Apply[Times, V]
In:=
Times@@V

and both return the same thing (recall the definition of V, above):
Out:=
(-3 + x1) (-3 + x2) (-3 + x3) (-3 + x4) (-3 + x5)
Now suppose we want to combine these two functions. We want to subtract 3 from
every element of the list U, which we can do with Map, and then take the product of
the results with Apply and Times:
In:=
Times@@(#-3)&/@U
or In:=
Apply[Times, Map[(# - 3)&, U]]
both of which return
Out:=
(-3 + x1) (-3 + x2) (-3 + x3) (-3 + x4) (-3 + x5)
With this we can define a function to calculate the Lagrange Interpolating Functions
in Mathematica.
LagrangeInterpolatingFunctions[{xj__}, x_] :=
Module[ {i, n, xi, xjc, L, xgrid, num, den},
xgrid = {xj};
n = Length[xgrid];
L = {};
For[i = 1, i <= n, i++,
xi = xgrid[[i]];
xjc = Complement[xgrid, {xi}];
den = Times @@ ((xi - #) & /@ xjc);
num = Times @@ ((x - #) & /@ xjc);
L = Append[L, num/den];
];
Return[L];
]
We can now calculated a set of functions analytically. For example,
In:=

LagrangeInterpolatingFunctions[{x1, x2, x3}, x]
Out:=
(x − x2)(x − x3) (x − x1)(x − x3) (x − x1)(x − x2)
, ,
(x1 − x2)(x1 − x3) (x2 − x1)(x2 − x3) (x3 − x1)(x3 − x2)
Repeating our earlier example,
In:=
LagrangeInterpolatingFunctions[{0, 0.6, 0.9}, x]
Out:=
{1.85185 (-0.9 + x) (-0.6 + x), -5.55556 (-0.9 + x) x,

3.7037 (-0.6 + x) x}
We can also calculate at a point:
In:=
LagrangeInterpolatingFunctions[{0, 0.6, 0.9},0.45]
Out:=
{0.125, 1.125, -0.25}
Next we observe that the dot product of two lists A and B of the same length is
calculated with the dot operator, which is a period:
In:=
{f1, f2, f3, f4}.{L1, L2, L3, L4}
Out:=
f1 L1 + f2 L2 + f3 L3 + f4 L4
So we can repeat our previous example as follows:
In:=
f[x_]:= Sqrt[1.0+x];
points = {0.0, 0.6, 0.9};
(f/@points).LagrangeInterpolatingFunctions[points, 0.45]
Out:=

1.20342
We can get the Lagrange Interpolating Polynomial with

In:=
(f /@ points).LagrangeInterpolatingFunctions[points, x] // Expand
Out:=
1.+ 0.483656 x - 0.0702286 x^2
Theorem 17.2 (Error Bounds for Lagrange Interpolation). Suppose that f (x)
is n + 1 times continuously differentiable, and suppose that the points x0 , . . . , xn are
distinct. Then for any x ∈ [a, b] there exists a number c ∈ [a, b] such that
f n+1 (c)(x − x0 )(x − x1 ) · · · (x − xn )

f (x) = P (x) + (17.42)
(n + 1)!
where
n n n
X X Y x − xj
P (x) = fk Lnk (x) = fk (17.43)
k=0 k=0 j=0,j6=k
xk − x j
Proof. If x = xk and P (xk ) = fk for some k, then the second term in equation 17.42
is zero, regardless of the value of c, and the result holds identically.
So suppose that x 6= xk for all k, and define the function
n
Y t − xi
g(t) = f (t) − P (t) − [f (x) − P (x)] (17.44)
i=0
x − xi
Then
n
Y xk − xi
g(xk ) = f (xk ) − P (xk ) − [f (x) − P (x)] =0 (17.45)
i=0
x − xi
The second equality follows because (a) by construction, f (xk ) = P (xk ), so the first
term is zero; and (b) for some i we have i = k and hence there is a factor of xk − xk
in the numerator of the second term, making it zero as well. Furthermore,
n
Y x − xi
g(x) = f (x) − P (x) − [f (x) − P (x)] (17.46)
i=0
x − xi
= f (x) − P (x) − [f (x) − P (x)] (17.47)
=0 (17.48)
Hence g(t) = 0 as the n + 2 numbers x, x0 , . . . , xn . Since it is also continuously

differentiable n + 1 times, then by the generalized Rolle’s theorem, theorem 4.4, there

exists at least one number c ∈ (a, b) such that g (n+1) (c) = 0. Differentiating g(t) a
total of n + 1 times,
n
d(n+1) Y t − xi
g (n+1) (t) = f (n+1) (t) − P (n+1) (t) − [f (x) − P (x)] (17.49)
dt(n+1) i=0 x − xi
hence at t = c we have
0 = g (n+1) (c) (17.50)

n

(n+1) Y
d t − xi
= f (n+1) (c) − P (n+1) (c) − [f (x) − P (x)] (17.51)
dt(n+1) x − xi

i=0 t=c

(n+1) Y n
[f (x) − P (x)] d
= f (n+1) (c) − P (n+1) (c) − Qn (t − xi ) (17.52)

(n+1)
i=0 (x − xi ) dt i=0

t=c
Now since
P (t) = a0 + a1 t + · · · + an tn (17.53)
then P (n+1) (t) = 0 for all t, and hence P (n+1) (c) = 0, so that

(n+1) Y n
[f (x) − P (x)] d
0 = f (n+1) (c) − Qn (t − x ) (17.54)

(n+1) i
i=0 (x − x i ) dt i=0

t=c
Furthermore,
n
d(n+1) Y d(n+1)
(t − x i ) = (t − x1 )(t − x2 )(t − x3 ) · · · (t − xn ) (17.55)
dt(n+1) i=0 dt(n+1)
d(n+1) n
t + (stuff) × tn−1 + (more stuff) × tn−2 + · · ·

= (n+1)
dt
(17.56)
= (n + 1)! (17.57)
Substituting equation 17.57 into equation 17.58 gives
[f (x) − P (x)]
0 = f (n+1) (c) − Qn (n + 1)! (17.58)
i=0 (x − xi )
Solving for f (x) gives us equation 17.42, completing the proof.
Example 17.2. Suppose you want to make a table of the natural logarithms over the
range 1 ≤ x ≤ 100. What step size is sufficient to ensure that linear interpolation
between each successive pair of points will be accurate to within 10−5 ?
Math 481A «2008, B.E.Shapiro

Solution. From equation 17.42 we have

n+1
f (c)(x − x0 )(x − x1 ) · · · (x − xn )
|f (x) − P (x)| = (17.59)
(n + 1)!
For linear interpolation we use n = 1 (there are two points, x0 and x1 , so that
(2)
f (c)(x − x0 )(x − x1 )
|f (x) − P (x)| = (17.60)
2!
1
≤ max |f 00 (c)| × max |(x − x0 )(x − x1 )| (17.61)
2
on each interval. Since f (x) = log x we have f 0 (x) = 1/x and f 00 (x) = −1/x2 . The
maximum value of | − 1/x2 | on [1, 100] is 1, so that
1
|f (x) − P (x)| ≤ max |(x − x0 )(x − x1 )| (17.62)
2
To find the maximum value of g(x) = (x − x0 )(x − x1 ) = x2 − (x0 + x1 )x + x0 x1
on [x0 , x1 ] we observe that the maximum either occurs at an endpoint or at a point
where g 0 (x) = 0. At the endpoints g(x) = 0. So first we differentiate:
0 = g 0 (x) = 2x − (x0 + x1 ) (17.63)
which gives a possible maximum at x = (x0 + x1 )/2. The value of g at this point is

x 0 + x 1 x0 + x1 x 0 + x 1
g = − x 0 − x 1
(17.64)
2 2 2

x1 − x0 x0 − x1
=
(17.65)
2 2
h2
= (17.66)
4
where h is the size between entries in the table (the number we are solving for).
Substituting equation 17.66 into example 17.62 gives
h2
|f (x) − P (x)| ≤ (17.67)
8
Since we want to ensure that the error is no larger than 10−5 we set
h2
< 10−5 (17.68)
8
or √
h< 8 × 10−5 ≈ 0.0089 (17.69)
so if choose any step size smaller than h ≈ 0.0089 we are guaranteed to have an error
of no larger than 10−5 .


Lesson 18
Newton Interpolation
Newton’s method for interpolation is derived by seeking a polynomial of the form
P (x) = a0 +a1 (x − xn ) (18.1)

+a2 (x − xn )(x − xn−1 )
+a3 (x − xn )(x − xn−1 )(x − xn−2 )
..
.
+an (x − xn )(x − xn−1 ) · · · (x − x1 )
that interpolates the points
P (x0 ) = f (x0 ) (18.2)

P (x1 ) = f (x1 ) (18.3)
..
.
P (xn−1 ) = f (xn−1 ) (18.4)
P (xn ) = f (xn ) (18.5)
We define the backward difference operator ∇ for an element fn of a sequence as
∇fn = fn − fn−1 (18.6)

∇2 fn = ∇fn − ∇fn−1 = fn − 2fn−1 + fn−2 (18.7)
∇3 fn = ∇2 fn − ∇2 fn−1 = fn − 3fn−1 + 3fn−2 − fn−3 (18.8)
..
.
∇k fn = ∇k−1 fn − ∇k−1 fn−1 (18.9)
Letting fn = f (xn ) we have by substituting 18.5 into 18.1 that
f n = a0 (18.10)
119
120 LESSON 18. NEWTON INTERPOLATION
From 18.4 we get
fn−1 = fn + a1 (xn−1 − xn ) (18.11)

= f n − a1 h (18.12)
1 1
a1 = (fn − fn−1 ) = ∇fn (18.13)
h h
Substituting at x = xn−2 = xn − 2h gives
fn−2 = a0 + a1 (xn−2 − xn ) + a2 (xn−2 − xn )(xn−2 − xn−2 ) (18.14)

1
= fn + (fn − fn−1 )(−2h) + a2 (−2h)(−h) (18.15)
h
= 2fn−1 − fn + 2h2 a2 (18.16)
1 1
a2 = 2 (fn − 2fn−1 + fn−2 ) = 2 ∇2 fn (18.17)
2h 2h
Continuing the process we find in general that
1
ak = ∇ k fn (18.18)
k!hk
Next we define the polynomials Qk by
k−1
Y
Qk (x) = (x − xn−j ) (18.19)
j=0
Using 18.18 and 18.19 in 18.1
P (x) = a0 + a1 Q1 (x) + a2 Q2 (x) + · · · + an Qn (x) (18.20)

Xn
= a0 + ak Qk (x) (18.21)
k=1
n
X ∇ k fn
= fn + Qk (x) (18.22)
k=1
k!hk
Define the parameter s, −1 ≤ s ≤ 0 in the interval [xn−1 , xn ] by
x = xn + sh (18.23)

LESSON 18. NEWTON INTERPOLATION 121
From equation 18.19,

k−1
Y
Qk (x) = (xn + sh − (xn − jh)) (18.24)
j=0
k−1
Y
= (j + s)h (18.25)
j=0
k−1
Y
= hk (s + j) (18.26)
j=0
k
= h s(s + 1)(s + 2) · · · (s + k − 1) (18.27)
Recall the definition of the binomial coefficient for n, m integers,

n n! n(n − 1)(n − 2) · · · (n − m + 1)
= = (18.28)
m m!(n − m)! m!
we can define, for any real number t, not necessarily integer,

t t(t − 1)(t − 2) · · · (t − m + 1)
= (18.29)
k k!
Using this we calculate

−s −s(−s − 1)(−s − 2) · · · (−s − k + 1)
= (18.30)
k k!
(−1)k
= s(s + 1)(s + 2) · · · (s + k − 1) (18.31)
k!
(−1)k
= Qk (x) (18.32)
k!hk
Using 18.32 in 18.22 we get
n
X
k −s
P (x) = fn + (−1) ∇ k fn (18.33)
k
k=1
which is known as Newton’s Backward Difference Formula.

We can also derive a formula using forward differences, and the forward difference
operators that we defined in section 13,
∆fn = fn+1 − fn (18.34)
∆2 fn = ∆fn+1 − ∆fn = fn+2 − 2fn+1 + fn (18.35)
..
. (18.36)
∆k fn = ∆k−1 fn+1 − ∆k fn (18.37)

The result is known as Newton’s Forward Difference Formula,

n
X s
P (x) = f0 + ∆k f0 (18.38)
k
k=1
where x = x0 + hs.
Example 18.1. Find e1.2 using a first, second, third, and fourth order differences
using the data e = 2.71828, e1.5 = 4.48169, e2 = 7.38906, e2.5 = 12.1829, e3 =
20.085554 with Newton’s forward difference formula.
Solution. We want to use equation 18.34 at x = 1.2 with x0 = 1. Hence
1.2 = x = x0 + hs = 1 + 0.5s (18.39)
and so s = 0.4. We can construct the following table of forward differences based on
the input data. The actual data values that we will use are colored yellow.
xk fk ∆fk ∆2 fk ∆3 fk ∆4 fk
1 2.71828
1.76341
1.5 4.48169 1.14396
2.90737 0.74211
2 7.38906 1.88607 0.48142
4.79344 1.22353
2.5 12.18249 3.10961
7.90304
3 20.08554
We then calculated the following binomial coefficients, using s = 0.4

s 0.4
= = 0.4 (18.40)
1 1

s 0.4 (0.4)(−0.6)
= = = −0.12 (18.41)
2 2 2!

s 0.4 (0.4)(−0.6)(−1.6)
= = = 0.064 (18.42)
3 3 3!

s 0.4 (0.4)(−0.6)(−1.6)(−2.6)
= = = −0.0416 (18.43)
4 4 4!
For n = 1, the interpolated value is

s
P (x + sh) = f0 + ∆f0 (18.44)
1
= 2.71828 + (0.4)(1.76341) (18.45)
= 3.42364 (18.46)

LESSON 18. NEWTON INTERPOLATION 123
For n = 2,

s s
P (x + sh) = f0 + ∆f0 + ∆2 f0 (18.47)
1 2
= 3.42364 + (−0.12)(1.14396) (18.48)
= 3.28636 (18.49)
For n = 3,

s s 2 s
P (x + sh) = f0 + ∆f0 + ∆ f0 + ∆3 f0 (18.50)
1 2 3
= 3.28636 + (0.064)(0.74211) (18.51)
= 3.33386 (18.52)
For n = 4,

s s 2 s 3 s
P (x + sh) = f0 + ∆f0 + ∆ f0 + ∆ f0 + ∆4 f0 (18.53)
1 2 3 4
= 3.33386 + (−0.0416)(0.48142) (18.54)
= 3.31383 (18.55)
The correct value is approximately 3.32012.

We can also use backward differences for numbers closer to the end of the table.
Example 18.2. Using the same data as the previous example, calculate e2.7 using
backward differences.
Solution. We have
2.7 = x = xn + sh = 3 + (0.5)s (18.56)
Hence s = −0.6. The backwards difference formula gives

0.6 2 0.6 2 3 0.6
P (2.7) = fn + (−1) ∇fn + (−1) ∇ fn + (−1) ∇ 3 fn
1 2 3

4 0.6
+ (−1) ∇4 fn + · · · (18.57)
4
(0.6)(−0.4) 2 (0.6)(−0.4)(−1.4) 3
= fn + (−1)(0.6)∇fn + (−1)2 ∇ fn + (−1)3 ∇ fn
2! 3!
(0.6)(−0.4)(−1.4)(−2.4) 4
+ (−1)4 ∇ fn + · · · (18.58)
4!
= fn − 0.6∇fn − 0.12∇2 fn − 0.056∇3 fn − 0.0336∇4 fn + · · · (18.59)
We now can read data off the lower diagonal in the table.

xk fk ∆fk ∆2 fk ∆3 fk ∆4 fk
1 2.71828
1.76341
1.5 4.48169 1.14396
2.90737 0.74211
2 7.38906 1.88607 0.48142
4.79344 1.22353
2.5 12.18249 3.10961
7.90304
3 20.08554
Substituting numbers from the table gives us
P (2.7) = 20.08554 − 0.6(7.90304) − 0.12(3.10961)

− 0.056(1.22353) − 0.0336(0.48142) (18.60)
= 14.88586 (18.61)

Lesson 19
Hermite Interpolation
One of the problems with polynomial interpolation is that although it fits the points
the shape of the curve doesnt always match very well:
(x2,f2) (x4,f4) P(x)
(x1,f1) f(x)
(x3,f3) (x5,f5)
x1 x2 x3 x4 x5
One approach to this problem is to try to match the derivatives as well as the points.
Suppose that we know the function f (x) at n+1 points, given by (x0 , f0 ), . . . , (xn , fn ),
and that we also know the derivatives at these same n + 1 points,
fi0 = f 0 (xi ) (19.1)
Then our approach will be to try to find a polynomial that matches both the function
and the derivative as these points. Our conditions are then:
P (xi ) = fi (19.2)
P 0 (xi ) = fi0 (19.3)
for i = 0, . . . , n. Since there are 2(n + 1) = 2n + 2 conditions, we are potentially able

to determine up to 2n + 2 unknowns in our model. Typically this means that our
polynomial will have 2n + 2 unknown coefficients, i.e., that it will have degree 2n + 1.
We can construct the solution using Lagrange Interpolating polynomials. The result
is given first.
125
126 LESSON 19. HERMITE INTERPOLATION
Theorem 19.1. Suppose that f (t) is continuously differentiable on [a, b] and that the
numbers x0 , . . . , xn ∈ [a, b] are unique, and let Lnj (x) be the Lagrange interpolating
functions. Then
n
X n
X
P (x) = H2n+1 (x) = fj Hnj (x) + fj0 Ĥnj (x) (19.4)
j=0 j=0
where
Hnj (x) = 1 − 2(x − xj )L0nj (xj ) (Lnj (x))2

(19.5)
Ĥnj (x) = (x − xj ) (Lnj (x))2 (19.6)
satisfies equations 19.2 and 19.3. Equation 19.4 is called the Hermite Interpolat-
ing Polynomial.
Proof. Since Lnj (xi ) = δij ,
Hnj (xi ) = 1 − 2(xi − xj )L0nj (xj ) L2nj (xi )

(19.7)
= 1 − 2(xi − xj )L0nj (xj ) δij

(19.8)
If i 6= j then Hnj = 0, while if i = j,
Hnj (xj ) = 1 − 2(xj − xj )L0nj (xj ) δjj = 1

(19.9)
Hence
Hnj (xi ) = δij (19.10)
Similarly,
Ĥnj (xi ) = (xi − xj )δij = 0 (19.11)
For all i and j. Substituting into equation 19.4,
n
X n
X
P (xi ) = fj Hnj (xi ) + fj0 Ĥnj (xi ) (19.12)
j=0 j=0
Xn
= fj δij (19.13)
j=0
= fi (19.14)
hence equation 19.2 is satisfied. To demonstrate equation 19.3 we differentiate 19.4,

n
X n
X
0 0
P (x) = fj Hnj (x) + fj0 Ĥnj
0
(x) (19.15)
j=0 j=0

LESSON 19. HERMITE INTERPOLATION 127
and therefore
n
X n
X
0 0
P (xi ) = fj Hnj (xi ) + fj0 Ĥnj
0
(xi ) (19.16)
j=0 j=0
To evaluate equation 19.16 we calculate the derivative,

d
0
1 − 2(x − xj )L0nj (xj ) (Lnj (x))2

Hnj (x) = (19.17)
dx
= 2 1 − 2(x − xj )L0nj (xj ) Lnj (x)L0nj (x) − 2 (Lnj (x))2 L0nj (xj )

(19.18)
0
(xi ) = 2 1 − 2(xi − xj )L0nj (xj ) Lnj (xi )L0nj (xi ) − 2 (Lnj (xi ))2 L0nj (xj )

Hnj (19.19)
= 2 1 − 2(xi − xj )L0nj (xj ) L0nj (xi )δij − 2δij L0nj (xj )

(19.20)
0
If i 6= j then clearly this Hnj (xi ) = 0 because of the common factor δij . If i = j,
0
(xj ) = 2 1 − 2(xj − xj )L0nj (xj ) L0nj (xi ) − 2L0nj (xj )

Hnj (19.21)
= 2L0nj (xi ) − 2L0nj (xj ) (19.22)
=0 (19.23)
0
Hence Hnj (xi ) = 0 for all i and j. Substituting this into 19.16,
n
X
P 0 (xi ) = fj0 Ĥnj
0
(xi ) (19.24)
j=0
Differentiating Ĥnj ,
d
0
(x − xj ) (Lnj (x))2

Ĥnj (x) = (19.25)
dx
= 2(x − xj )Lnj (x)L0nj (x) + L2nj (x) (19.26)
0
Ĥnj (xi ) = 2(xi − xj )Lnj (xi )L0nj (xi ) + L2nj (xi ) (19.27)
= 2(xi − xj )δij L0nj (xi ) + δij (19.28)
= δij (19.29)
Therefore
n
X
0
P (xi ) = fj0 δij = fi0 (19.30)
j=0
which is equation 19.3.

Theorem 19.2. Under the same conditions as theorem 19.1 then the Hermite inter-
polating polynomials are the unique polynomials of least degree (at most 2n + 1) that
satisfy the conditions of equation 19.2 and 19.3.

Proof. Certainly H2n+1 is a polynomial of order at most n + 1 in x because Lnj is a

polynomial of at most degree n. We have already shown that H2n+1 (x) satisfies the
conditions of equation 19.2 and 19.3 (in theorem 19.1). Now suppose that there is
some other polynomial g(x), also of degree at most 2n+1, such that g(xi ) = fi and
g 0 (xi ) = fi0 . Let
∆(x) = H2n+1 (x) − g(x) (19.31)
be the difference between these two polynomials. Since ∆(x) is the difference of two
polynomials of degree at most 2n + 1 that satisfy 19.2 and 19.3, then ∆(x) is also a
polynomial of degree at most 2n + 1. Furthermore, it satisfies 19.2 and 19.3, since
∆(xi ) = H2n+1 (xi ) − g(xi ) = fi − fi = 0 (19.32)

∆0 (xi ) = H2n+1
0
(xi ) − g 0 (xi ) = fi0 − fi0 = 0 (19.33)
Thus by theorem 12.6 ∆(x) has zeroes of multiplicity 2 at each of x0 , . . . , xn , and

hence there exists some function q(x), which does not have a zero at any of these
points, such that
∆(x) = (x − x0 )2 (x − x1 )2 · · · (x − xn )2 q(x) (19.34)
This says that either ∆(x) has 2(n + 1) zeroes, which contradicts our earlier obser-
vation that it only has 2n + 1 zeroes; or that q(x) = 0 identically. But if q(x) = 0
identically, then ∆(x) = 0 identically, which implies that g(x) = H2n+1 (x) for all x.
In other words, H2n+1 is unique.
Example 19.1. Find a √ Hermite interpolating polynomial for the following data,
which is based on f (x) = x.
f (x) f 0 (x)
x0 = 1 f0 = 1 f00 = 1/2
x1 = 4 f1 = 2 f10 = 1/4
Solution. Since n = 1 there are 2n + 2 = 4 conditions that must be met, and therefore
the order of the polynomial will be 2n + 1 = 3. The interpolating polynomial is
P (x) = H3 (x) (19.35)

1
X 1
X
= fj H1j (x) + fj0 Ĥ1j (x) (19.36)
j=0 j=0
= f0 H10 (x) + f1 H11 (x) + f00 Ĥ10 (x) + f10 Ĥ11 (x) (19.37)
1 1
= H10 (x) + 2H11 (x) + Ĥ10 (x) + Ĥ11 (x) (19.38)
2 4

To find the H’s we need to first find the L1j ’s,

x − x1 x−4 1 4
L10 (x) = = =− x+ (19.39)
x0 − x1 −3 3 3
x − x0 x−1 1 1
L11 (x) = = = x− (19.40)
x1 − x0 3 3 3
From this we can determined that L010 = −1/3 and L011 = 1/3. Hence
H10 (x) = [1 − 2(x − x0 )L010 (x0 )]L210 (x) (19.41)
2
1 1 4
= 1 − 2(x − 1) − − x+ (19.42)
3 3 3
1
= (1 + 2x)(4 − x)2 (19.43)
27
and
H11 (x) = [1 − 2(x − x1 )L011 (x1 )]L211 (x) (19.44)
2
1 1 1
= 1 − 2(x − 4) x− (19.45)
3 3 3
1
= (11 − 2x)(x − 1)2 (19.46)
27
Similarly,
Ĥ10 (x) = (x − x0 )L210 (x) (19.47)
1
= (x − 1)(4 − x)2 (19.48)
9
Ĥ11 x = (x − x1 )L211 (x) (19.49)
1
= (x − 4)(x − 1)2 (19.50)
9
Thus from equation 19.35
1 1
P (x) = H10 (x) + 2H11 (x) + Ĥ10 (x) + Ĥ11 (x) (19.51)
2 4
1 2
= (1 + 2x)(4 − x)2 + + (11 − 2x)(x − 1)2 + (19.52)
27 27
1 1
(x − 1)(4 − x)2 + (x − 4)(x − 1)2 (19.53)
18 36
Theorem 19.3 (Error Formula for Hermite Interpolation). Suppose that f is
n+2 times continuously differentiable and that the same conditions hold as in theorem
19.2. Then there is some number c ∈ (a, b) such that
(x − x0 )2 · · · (x − xn )2 (2n+2)
f (x) = H2n+1 (x) + f (c) (19.54)
(2n + 2)!

Proof. First, suppose that x = xk for some k . Then the second term in equation
19.54 is zero and it becomes f (x) = H2n+1 (x), which is the interpolation condition,
because x = xk . This condition is known to hold true because of theorem 19.1.
Now suppose that x 6= xk for all k. Then define the function g(t) by
(t − x0 )2 · · · (t − xn )2
g(t) = f (t) − H2n+1 (t) − [f (x) − H2n+1 (x)] (19.55)
(x − x0 )2 · · · (x − xn )2
then
g(xk ) = f (xk ) − H2n+1 (xk )−

(xk − x0 )2 · · · (xk − xk )2 · · · (xk − xn )2
[f (x) − H2n+1 (x)] (19.56)
(x − x0 )2 · · · (x − xn )2
= f (xk ) − H2n+1 (xk ) (19.57)
=0 (19.58)
and
g(x) = f (x) − H2n+1 (x)−

(x − x0 )2 · · · (x − xn )2
[f (x) − H2n+1 (x)] (19.59)
(x − x0 )2 · · · (x − xn )2
= f (x) − H2n+1 (x) − [f (x) − H2n+1 (x)] (19.60)
=0 (19.61)
Hence g(x) has n + 2 distinct zeros in [a, b] at x, x0 , . . . , xn . By Rolle’s theorem,

between each pair of consecutive zeroes there is a number ck such that g 0 (ck ) = 0.
Since there are n + 1 such pairs of consecutive points, g 0 (t) has n + 1 unique zeroes
at c0 , . . . , cn , none of which are equal to any of the grid points x0 , . . . , xn .
If we differentiate 19.55
(t − x0 )2 · · · (t − xn )2

0 0 0 d
g (t) = f (t) − H2n+1 (t) − [f (x) − H2n+1 (x)] (19.62)
(x − x0 )2 · · · (x − xn )2
dt
n
f (x) − H2n+1 (x) d Y
= f 0 (t) − H2n+1
0
(t) − (t − xk )2 (19.63)
(x − x0 )2 · · · (x − xn )2 dt k=0
But by the product rule for derivatives
d
(a1 a2 a3 · · · an ) = a01 a2 a3 · · · an + a1 a02 a3 · · · an + · · · + a1 · · · an−1 a0n (19.64)
dt
Hence
n n n
d Y 2
Y
2
Y
(t − xk ) = 2(t − x0 ) (t − xk ) + 2(t − x1 ) (t − xk )2 +
dt k=0 k=0,k6=0 k=0,k6=1
n
Y
2(t − x2 ) (t − xk )2 + · · · +
k=0,k6=2
Yn
2(t − xn ) (t − xk )2 (19.65)
k=0,k6=n
n
Y Xn Y n
=2 (t − xk ) (t − xj ) (19.66)
k=0 i=0 j=0,j6=i
= P (t)Q(t) (19.67)
where
n
Y
P (t) = 2 (t − xk ) (19.68)
k=0
n
X n
Y
Q(t) = (t − xj ) (19.69)
i=0 j=0,j6=i
Consequently
f (x) − H2n+1 (x)
g 0 (t) = f 0 (t) − H2n+1
0
(t) − P (t)Q(t) (19.70)
(x − x0 )2 · · · (x − xn )2
Since
f (x) − H2n+1 (x)
g 0 (xk ) = f 0 (xk ) − H2n+1
0
(xk ) − P (xk )Q(xk ) = 0 (19.71)
(x − x0 )2 · · · (x − xn )2
the two facts that fk0 = H2n+1 0
(xk ) and P (xk ) = 0 (from equation 19.68) we see that
g has roots at x0 , . . . , xn . Therefore g 0 (t) has 2n + 2 unique zeroes at c0 , . . . , cn and
0
x0 , . . . , xn . Hence by the generalized Rolle’s theorem, there is some number c ∈ [a, b]

such that g (2n+2) (c) = 0.
But
(2n+2)
g (2n+2) (t) = f (2n+2) (t) − H2n+1 (t)−
n
f (x) − H2n+1 (x) d2n+2 Y
(t − xk )2 (19.72)
(x − x0 )2 · · · (x − xn )2 dt2n+2 k=0
Since H2n+1 (t) is a polynomial of order 2n + 1, its 2n + 2-th derivative is zero,

n
(2n+2) (2n+2) f (x) − H2n+1 (x) d2n+2 Y
g (t) = f (t) − (t − xk )2 (19.73)
(x − x0 )2 · · · (x − xn )2 dt2n+2 k=0

Next we calculate
n
d2n+2 Y 2 d2n+2 2n+2 2n+1

(t − x k ) = t + (stuff) × t + · · · + (stuff) (19.74)
dt2n+2 k=0 dt2n+2
= (2n + 2)! (19.75)
and therefore
f (x) − H2n+1 (x)
g (2n+2) (t) = f (2n+2) (t) − (2n + 2)! (19.76)
(x − x0 )2 · · · (x − xn )2
Since there is some point c such that g (2n+2) (c) = 0,
f (x) − H2n+1 (x)

0 = f (2n+2) (c) − (2n + 2)! (19.77)
(x − x0 )2 · · · (x − xn )2
Solving for f (x) gives equation 19.54.

√
Example
√ 19.2. Calculate the error bounds on 16 from the following data for f (x) =
x according to the Hermite polynomial error formula.
x 5 10 15 20 25
f (x) 2.24 3.16 3.87 4.47 5.00
Solution. Since there are 5 data points x0 , . . . , x4 , we have n = 4, so we need the 10th
derivative of f (x). Using Mathematica, we find that
34, 459, 425
f (10) (x) = − (19.78)
1024x19/2
At x = 16, equation 19.54 gives
(16 − 5)2 (16 − 10)2 (16 − 15)2 (16 − 20)2 (16 − 25)2 34, 459, 425

|error| = ×
10! 1024c19/2
(19.79)
52352.6
= (19/2) (19.80)
c
The maximum of 1/c19/2 on (5, 25) occurs at the minimum of c19/2 on (5, 25), which
occurs at c = 5.
52352.6
|error| ≤ ≈ 0.012 (19.81)
5(19/2)
Of course this is just a theoretical limit, because we do not know the actual value of
c. In this case, the actual Hermite approximation gives a much smaller error, of only
9.3 × 10−7 .

The Hermite polynomials are easily calculated in Mathematicaas follows.
Hermite[{x__}, {f__}, {fprime__}, t_] :=

Module[{L, L2, Lprime, z, H, HH, H2NP1},
L = LagrangeInterpolatingPolynomials[{x}, z];
Lprime = MapThread[#1 /. {z -> #2} &, {D[L, z], {x}}];
L2 = L^2;
HH = (((t - #) & /@ {x})*L2) /. {z -> t};
H = (L2 - 2*Lprime*HH) /. {z -> t};
H2NP1 = H.{f} + HH.{fprime};
Return[H2NP1];
]
The polynomial in the last example is then given by
f[x_] := Sqrt[x];
xdata = Range[5, 25, 5]; (* list of x values *)
fdata = f /@ xdata; (* list of f values *)
fpdata = f’[#] & /@ xdata; (* list of f’ values *)
Hermite[xdata, fdata, fpdata, x]
To get the actual error value quoted in the example we use Hermite[xdata, fdata,
fpdata, 16] - 4.0, because we know that the correct answer is 4.


Lesson 20
Cubic Splines
As before we are trying to find an interpolating function for a function that we know
at n + 1 points a = x0 < x1 < · · · < xn = b . Instead of fitting a single polynomial
to all n + 1 points, an alternative strategy is to fit a different polynomial to each
successive pair of points. We will define a set of functions Si (x), on each interval
[xi , xi+1 ]. To keep the solution smooth we would like the match the first and second
derivatives, as well as the function itself, at each grid point. Our conditions are
Si (xi ) = fi (20.1)
Si (xi+1 ) = Si+1 (xi+1 ) (20.2)
Si0 (xi+1 ) = Si+1
0
(xi+1 ) (20.3)
00 00
Si (xi+1 ) = Si+1 (xi+1 ) (20.4)
The functions S0 (x), . . . , Sn−1 (x) are called Spline functions.
Si(x) ... Sn-1(x)

Si-1(x)
...
S0(x)
... ...
x0 x1 xi-1 xi xi+1 xn-1 xn
Equations 20.1 through 20.4 give us a total of 4n − 2 conditions. Since there are
n spline functions, we need 3n parameters if the functions are quadratic and 4n
parameters if the functions are cubic. Since 4n−2 > 3n the system is over-determined
for a quadratic to work, and since 4n − 2 < 4n the system is under-determined for
a cubic to work. By adding two additional conditions, however, we can uniquely
determine a set of cubic spline functions. These are typically either free (natural)
boundary conditions,
S000 (x0 ) = Sn−1
00
(xn ) = 0 (20.5)
135
136 LESSON 20. CUBIC SPLINES
or the clamped boundary conditions:
S00 (x0 ) = f 0 (x0 ) (20.6)

0
Sn−1 (xn ) = f 0 (xn ) (20.7)
Let the cubic splines be given by
Si (x) = ai + bi (x − xi ) + ci (x − xi )3 + di (x − xi )3 (20.8)
on each interval [xi , xi+1 ]. Substituting equation 20.1 into 20.8 gives
ai = f i (20.9)
From equation 20.2,

ai + bi hi + ci h2i + di h3i = ai+1 (20.10)
where hi = xi+1 − xi . Differentiating equation 20.8 twice
Si0 (x) = bi + 2ci (x − xi ) + 3di (x − xi )2 (20.11)

Si00 (x) = 2ci + 6di (x − xi ) (20.12)
Substituting equation 20.3 into 20.11
bi + 2ci hi + 3di h2i = bi+1 (20.13)
Similarly, if we substitute equation 20.4 into 20.12,
2ci + 6di hi = 2ci+1 (20.14)
where i = 0, . . . , n − 1. We will define one additional number
cn = cn−1 + 3dn−1 hn−1 (20.15)
From equation 20.5 in 20.12

c0 = 0 (20.16)
and
2cn = 2ci−1 + 6dn−1 hn−1 = 0 (20.17)
Rearranging equation 20.17
ci+1 − ci
di = (20.18)
3hi
Thus we now have determined all of the ai , and the di are fully determined by the ci .
To get bi we substitute 20.18 into equation 20.10,
ci+1 − ci 3
ai + bi hi + ci h2i + hi = ai+1 (20.19)
3hi

LESSON 20. CUBIC SPLINES 137
Rearranging,
h2i
ai + bi hi + (2ci + ci+1 ) = ai+1 (20.20)
3
Solving for bi ,
h3i
bi hi = ai+1 − ai − (2ci + ci+1 ) (20.21)
2
or
ai+1 − ai hi
bi = − (2ci + ci+1 ) (20.22)
hi 3
Reducing the index by 1 all the way through,
ai − ai−1 hi−1
bi−1 = − (2ci−1 + ci ) (20.23)
hi−1 3
Similarly, by substituting 20.18 into 20.13
bi+1 = bi + 2ci hi + (ci+1 − ci )hi (20.24)

= bi + hi (ci + ci+1 ) (20.25)
Again, we reduce the index by 1,
bi = bi−1 + hi−1 (ci−1 + ci ) (20.26)
Using equation 20.22 for the bi on the left hand side of equation 20.26, and equation
20.23 for the bi−1 on the right hand side of equation 20.26,
ai+1 − ai hi ai − ai−1 hi−1

− (2ci + ci+1 ) = − (2ci−1 + ci ) + hi−1 (ci−1 + ci ) (20.27)
hi 3 hi−1 3
Rearranging a bit,
ai+1 − ai ai − ai−1 hi hi−1

− = (2ci + ci+1 ) − (2ci−1 + ci ) + hi−1 (ci−1 + ci ) (20.28)
hi hi−1 3 3
ai+1 − ai ai − ai−1
3 −3 = hi ci+1 + 2ci (hi − hi−1 ) + ci−1 hi−1 (20.29)
hi hi−1

This equation is defined for i = 0, .., n − 1. We can extend it to include an extra

parameter an that we are not actually interested in, to give
···
  
1 0 0 c0
..   c 
h0 2(h0 + h1 ) h1 .   1

  c2 
0 h1 2(h1 + h2 ) h2

  ..  =
 
. . . .
 .. .. .. ..  . 
  
0 0 hn−2 2(hn−1 + hn−2 ) hn−1   
0 0 1 cn
 
0
3 3

h1
(a 2 − a 1 ) − h0
(a1 − 10 ) 
..
 
(20.30)
 

 3 . 
3
 h (an − an−1 ) − h (an−1 − an−2 )

n−1 n−2
0
If the grid points are equally spaced with hi = h for some number h, then
···
  
1 0 0 c0  
..   c  0
h 4h h 0 .   1

  c2  3  a0 − 2a1 + a2 
 
0 h 4h h 0

..
.=  (20.31)
   
.
 .. ... ... ...    ..  h  . 

   an−2 − 2an−1 + an 
0 0 h 4h h   0
0 0 1 cn
Denoting the square matrix by A and the vectors by c and w, this can be written
concisely as Ac = w. Since all of the ai are already known, the only unknowns in
this equation are the c, for which we can solve as c = A−1 w.
For clamped cubic splines, the corresponding equations are
···
  
2h0 h0 0 c0
..   c 
 h0 2(h0 + h1 ) h1 .   1

  c2 
 0 h1 2(h1 + h2 ) h2

  ..  =
 
 . . . .
 .. .. .. .. 0  . 
  
 0 0 hn−2 2(hn−1 + hn−2 ) hn−1   
0 0 hn−1 2hn−1 cn
3
(a1 − a0 ) − 3f 0 (a)
 
h0
3

h1
(a2 − a1 ) − h30 (a1 − 10 ) 
..
 
(20.32)
 
 . 
 3 3
 hn−1 (an − an−1 ) − hn−2 (an−1 − an−2 )

3f 0 (b) − hn−13
(an − an−1 )

Example 20.1. Fit a free cubic spline to the following data.

x0 = 0 x1 = 1 x2 = 2 x3 = 3
f0 = 0 f1 = 0.5 f2 = 0.8 f3 = 0.9
Solution. By equation 20.9,
a0 = 0; a1 = 0.5; a2 = 0.8; a3 = 0.9 (20.33)
Since the points are equally spaced with h = 1, we can use 20.31. The right hand
side is given by
w0 = 0 (20.34)
3
w1 = (0 − 2(.5) + .8) = −0.6 (20.35)
1
3
w2 = (.5 − 2(.8) + .9) = −0.6 (20.36)
1
w3 = 0 (20.37)
So the 20.31 becomes

    
1 0 0 0 c0 0
 c1  = −0.6
1 4 1 0    
 (20.38)
0 1 4 1  c2   −0.6
0 0 0 1 c3 0
Multiplying the matrices on the left and setting like components equal gives the
equivalent system of equations:
c0 =0 (20.39)
c0 + 4c1 + c2 = −0.6 (20.40)
c1 + 4c2 + c3 = −0.6 (20.41)
c3 =0 (20.42)
Subsituting the first and last result into the middle two equations,
4c1 + c2 = −0.6 (20.43)

c1 + 4c2 = −0.6 (20.44)
Multiplying 20.43 by 4 and subtracting 20.44
15c1 = −1.8 (20.45)
Hence c1 = −1.8/15 = −0.12. Equation 20.43 then gives
c2 = −0.6 − 4(c1 ) = −0.6 − 4(−0.12) = −0.12 (20.46)

Summarizing, we have
c0 = 0; c1 = −0.12; c1 = −0.12; c3 = 0 (20.47)
ai+1 −ai hi
From 20.22, bi = hi
− 3
(2ci + ci+1 ) and therefore
a1 − a0 h
b0 = − (2c0 + c1 ) (20.48)
h 3
1
= (0.5 − 0) − (2(0) − 0.12) (20.49)
3
= 0.54 (20.50)
a2 − a1 h
b1 = − (2c1 + c2 ) (20.51)
h 3
1
= (0.8 − 0.5) − (2(−0.12) + (−0.12)) (20.52)
3
= 0.42 (20.53)
a3 − a2 h
b2 = − (2c2 + c3 ) (20.54)
h 3
1
= (0.9 − 0.8) − (2(−0.12) + 0) (20.55)
3
0 = 0.18 (20.56)
From equation 20.18 di = ci+1 − ci /3hi = ci+1 − ci /3 (since h = 1), so that
d0 = (c1 − c0 )/3 = (−.12 − 0)/3 = −0.04 (20.57)
d1 = (c2 − c1 )/3 = (−.12 − −.12)/3 = 0 (20.58)
d2 = (c3 − c2 )/3 = (0 − −.12)/3 = 0.04 (20.59)
Combining equations 20.33, 20.48, 20.47 and 20.57,
S0 = a0 + b0 (x − x0 ) + c0 (x − x0 )2 + d0 (x − x0 )3 (20.60)
= 0 + (0.54)(x) + (0)(x)2 + (−0.04)(x)3 (20.61)
= 0.54x − 0.04x3 (20.62)
S1 = a1 + b1 (x − x1 ) + c1 (x − x1 )2 + d1 (x − x1 )3 (20.63)
= 0.5 + (0.42)(x − 1) + (−0.12)(x − 1)2 + (0)(x − 1)3 (20.64)
= 0.5 + 0.42x − 0.42 − 0.12x2 + 0.24x − 0.12 (20.65)
= −0.04 + 0.66x − 0.12x2 (20.66)
S2 = a2 + b2 (x − x2 ) + c2 (x − x2 )2 + d2 (x − x2 )3 (20.67)
= 0.8 + (0.18)(x − 2) + (−0.12)(x − 2)2 + (0.04)(x − 2)3 (20.68)
= 0.8 + 0.18x − 0.36 − 0.12x2 + 0.48x − 0.48
+ 0.04x3 − 0.24x2 + 0.48x − 0.32 (20.69)
= −0.36 + 1.14x − 0.36x2 + 0.04x3 (20.70)

Therefore the spline solution is


3
0.54x − 0.04x ,
 x ∈ [0, .5],
f (x) = −0.04 + 0.66x − 0.12x , 2
x ∈ [.5, .8] (20.71)
 2 3
−0.36 + 1.14x − 0.36x + 0.04x , x ∈ [.8, .9]

Theorem 20.1 (Error Bounds for Clamped Cubic Splines). Let f (x) be 4-times
continuously differentiable on [a, b], and define

M = sup |f (4) (x) (20.72)
[a,b]
If S(x) is the unique clamped cubic spline interpolant to f (x) on the nodes x0 , . . . , xn ,
where a = x0 and b = xn , then
5M
|f (x) − S(x)| ≤ max hj (20.73)
384 j=1,...,n−1


Lesson 21
Bezier Curves
Bezier curves, commonly used in computer graphics, are a modification of Hermite

interpoloation in which the derivatives are computed based on certain control points,
or handles, rather than just the end points. They are named for Piere Bézier (1910-
1999) who created them to model automobile surfaces in the early 1960’s. He worked
for Renault for virtually his entire career.
Bezier curves are used by many interactive graphics programs to draw smooth
curves. One typically marks the end points of the curve by clicking a mouse, then
then drags a mouse to define the handles, which can be used interactively to pull
the curve in different directions. In general, Bezier curves are described by systems
of parametric equations that describe the path of the curve in the xy-plane over the
interval [0, 1].
The simplest type of Bezier curves is a Bezier Line, describe parametrically by
joining the points P0 = (x0 , y0 ) and P2 = (x1 , y1 ) with a straight line. We can describe
this construction as
P (t) = P0 (1 − t) + P1 t (21.1)
In terms of the separate x and y components,
x(t) = x0 (1 − t) + x1 t (21.2)
y(t) = y0 (1 − t) + y1 t (21.3)
Quadratic Bezier curves are constructed as illustrated in figure 21.1. We draw
a line that connects points P0 and P2 ; the shape of this line is determined by the
position of a third “guide point” or “handle” that we label P1 . We then define
parameterizations of the line segements P0 P1 and P1 P2 as
P01 (t) = P0 (1 − t) + P1 t (21.4)
P12 (t) = P1 (1 − t) + P2 t (21.5)
Finally, we parameterize the line segment from P01 to P12 :
P (t) = (1 − t)P01 (t) + tP12 (t) (21.6)
143
144 LESSON 21. BEZIER CURVES
The Bezier Quadratic is the curved traced out P (t) as t goes from t = 0 to t = 1.
Substituting the expressions for P01 and P12 gives
P (t) = (1 − t)[(1 − t)P0 + tP1 ] + t[(1 − t)P1 + tP2 ] (21.7)
= (1 − t)2 P0 + 2t(1 − t)P1 + t2 P2 (21.8)
In terms of the x and y coordinates, the quadratic Bezier interpolants are:
x(t) = (1 − t)2 x0 + 2t(1 − t)x1 + t2 x2 (21.9)
y(t) = (1 − t)2 y0 + 2t(1 − t)y1 + t2 y2 (21.10)
Bezier quadratics are used, for example, to describe true-type fonts. The spline
functions constructud in this way have slopes tangent to the line segment P01 P12 .
Figure 21.1: Construction of Bezier quadratics. See text for description.

P1
P12
P01
P
P2
P0
Figure 21.2: Construction of the Bezier Cubic spline.

P12 P2
P1 P123
P012
P P23
P01
P3
P0
Now suppose that we have two control points, P1 and P2 , and that we want to
draw a curve connecting P0 and P3 using points P1 and P2 to control our movement.

LESSON 21. BEZIER CURVES 145
See figure 21.2. As before we construct the line segments P0 P1 , P1 P2 and P2 P3 , and
at any time t ∈ [0, 1] define points on these three segments:
P01 (t) = (1 − t)P0 + tP1 (21.11)
P12 (t) = (1 − t)P1 + tP2 (21.12)
P23 (t) = (1 − t)P2 + tP3 (21.13)
Next, construct line segments P01 P12 and P12 P23 and define their parameterization
on [0, 1] as follows:
P012 (t) = (1 − t)P01 (t) + tP12 (t) (21.14)
P123 (t) = (1 − t)P12 (t) + tP23 (t) (21.15)
Finally we construct a line segment P012 P123 with parameterization
P (t) = (1 − t)P012 (t) + tP123 (t) (21.16)
= (1 − t)[(1 − t)P01 (t) + tP12 (t)] + t[(1 − t)P12 (t) + tP23 (t)] (21.17)
= (1 − t)2 P01 (t) + 2t(1 − t)P12 (t) + t2 P23 (t) (21.18)
= (1 − t)2 [(1 − t)P0 + tP1 ] + 2t(1 − t)[(1 − t)P1 + tP2 ] (21.19)
+ t2 [(1 − t)P2 + tP3 ]
= (1 − t)3 P0 + 3t(1 − t)2 P1 + 3t2 (1 − t)P2 + t3 P3 (21.20)
The cartesian coordinates of the point P constructed in this ware are
x(t) = (1 − t)3 x0 + 3t(1 − t)2 x1 + 3t2 (1 − t)x2 + t3 x3 (21.21)
y(t) = (1 − t)3 y0 + 3t(1 − t)2 y1 + 3t2 (1 − t)y2 + t3 y3 (21.22)
Bezier cubics in this form are used to describe Postscript fonts. It is left as an
exercise to verify that the splines formed in this way approach the fixed endpoints
P0 and P3 with tangent lines P0 P1 and P2 P3 . Becase they are described by a cubic
parameterization there are a total of eight coefficients (4 for the x and 4 for the y),
which we have described by the coordinates of the points P0 , P1 P2 , P3 . By uniqueness,
there can be only curve that matches our restriction, and so the following derivation
of the Bezier cubics must, of necessity give the same curve. We present the derivation
because the notation, which is different from the derivation given above, is commonly
used to describe Bezier curves in various graphics applications. Suppose we want to
join the two points as show in figure 21.3, given by
P0 = (x0 , y0 ) (21.23)
P1 = (x1 , y1 ) (21.24)
in such a way that the slopes at P0 and P1 are defined in terms of the points Q0 and
Q1 by the vectors P0 Q0 and P1 Q1 , where
Q0 = (x0 + 3α0 , y0 + 3β0 ) (21.25)
Q1 = (x1 − 3α1 , y1 − 3β1 ) (21.26)

Figure 21.3: Alternate derivation of Bezier cubic splines.

(x0+α0, y0+β0)
(x1-α1, y1-β1)
(x0,y0)
(x1,y1)
As before we find a parametric representation of the curve (x(t), y(t) on the interval
t ∈ [0, 1], where
x(0) = x0 x(1) = x1 x0 (0) = 3α0 x0 (1) = 3α1 (21.27)
y(0) = y0 y(1) = y1 y 0 (0) = 3β0 y 0 (1) = 3β1 (21.28)
The factor of 3 in the definitions of the numbers α0 , α1 , β0 and β1 is not used in all
textbooks but is standard in the implementation used in most graphics programs, so
we will abide by it. Let us write the parametric equation for x(t) as
x(t) = A + Bt + Ct2 + Dt3 (21.29)
Differentiating equation 21.29,
x0 (t) = B + 2Ct + 3Dt (21.30)
The boundary conditions at t = 0 give
x(0) = A = x0 (21.31)
x0 (0) = B = 3α0 (21.32)
Substituting 21.31 and 21.32 back into 21.29 and 21.30 and then setting t = 1 gives
x(1) = x0 + 3α0 + C + D = x1 (21.33)
x0 (1) = 3α0 + 2C + 3D = 3α1 (21.34)
Multiplying equation 21.33 by 3,
3x0 + 9α0 + 3C + 3D = 3x1 (21.35)
Subtracting equation 21.34 from 21.35,
3x0 + 6α0 + C = 3x1 − 3α1 (21.36)

Hence
C = 3(x1 − x0 ) − 3(2α0 + α1 ) (21.37)
Multiplying equation 21.33 by 2,
2x0 + 6α0 + 2C + 2D = 2x1 (21.38)
Subtracting equation 21.34 from 21.38,
2x0 + 3α0 − D = 2x1 − 3α1 (21.39)
D = 2(x0 − x1 ) + 3(α0 + α1 ) (21.40)

Therefore
x(t) = x0 + 3α0 t + [3(x1 − x0 ) − 3(2α0 + α1 )]t2 + (21.41)

[2(x0 − x1 ) + 3(α0 + α1 )]t3
y(t) = y0 + 3β0 t + [3(y1 − y0 ) − 3(2β0 + β1 )]t2 + (21.42)
[2(y0 − y1 ) + 3(β0 + β1 )]t3
A variation of Bezier Cubics is used for Postscript fonts, which are defined in terms of
positions of the two points (x0 , y0 ) and (x3 , y3 ) and their handles (x1 , y1 ) and (x2 , y2 )
rather than the derivatives (so it has a slightly different form):
x(t) = (1 − t)3 x0 + 3t(1 − t)2 x1 + 3t2 (1 − t)x2 + t3 x3 (21.43)

y(t) = (1 − t)3 y0 + 3t(1 − t)2 y1 + 3t2 (1 − t)y2 + t3 y3 (21.44)
Bezier curves can be defined of any order, using any number of points. The points
define a sequence of line segments that “pull” the curve towards them, with the Bezier
curve parallel to the first and last segment. The general formula is
n
X n
x(t) = xi (1 − t)n−i ti (21.45)
i
i=0
n
X n
y(t) = yi (1 − t)n−i ti (21.46)
i
i=0
We can generate the Bezier Curve equations for a set of points in Mathematica as
follows.
bezier[points_?ListQ, t_] := Module[{n, bezx, bezy, x, y},
n = Length[points] - 1;
bezx = 0; bezy = 0;
x[i_] := points[[i + 1, 1]];
y[i_] := points[[i + 1, 2]];

Figure 21.4: A. A typical Bezier Curve generated with the 6 points (1, 1.52), (2,
1.94), (3, 1.39), (4, 1.0), (5, 1.54), (6, 1.55).B, C: Rearrangements of the points give
different curves. C: The curve is closed becasue the first and last point are the same.
For[i = 0, i n, i++,
bezx = bezx + Binomial[n, i] * x[i] (1 - t)^(n - i) tî;
bezy = bezy + Binomial[n, i] * y[i] (1 - t)^(n - i) tî;
];
Return[{bezx, bezy}]
];
This function bezier is passed a list of points in cartesian coordinates;
In:=
bezier[{{1, 1}, {1.5, 2}, {4.5, 1.5}, {5.5, .5}}, t]
Out:=
{1 + 4.5*(1 - t)^2*t - t^2 + 13.5*(1 - t)*t^2 + 5.5*t^3,

1 + 6*(1 - t)^2*t - t^2 + 4.5*(1 - t)*t^2 + 0.5*t^3}
We can plot the points, line segments with handles, and Bezier curve with the Math-
ematica function bezierPlot:
bezierPlot[points_, opt___?OptionQ] := Module[{p1, p2, p3, t},

p1 = ListPlot[points,
PlotStyle -> PointSize[0.03], DisplayFunction -> Identity];
p2 = ListPlot[points, PlotJoined -> True, DisplayFunction -> Identity];
p3 = ParametricPlot[bezier[points, t], {t, 0, 1}, DisplayFunction ->

Identity];
Return[Show[{p1, p2, p3}, DisplayFunction -> $DisplayFunction, opt]]
];
Then the command bezierPlot[points] will generate the following plot
2
1.8
1.6
1.4
1.2
2 3 4 5
0.8
0.6
Standard options for Plot, such as PlotRange, Axes, TextStyle, etc, can be used by
bezierPlot. A generalization to higher dimensions is given by Bezier Surfaces, which
were also invented by Pierre Bezier in 1972. The general form of a Bezier Surface is
given in terms of (m + 1)(n + 1) points (x0,0 , y0,0 , z0,0 ), . . . , (xm,n , ym,n , zm,n ) as
n X
m
X n m
x(s, t) = si (1 − s)n−i tj (1 − t)n−j xi,j (21.47)
i j
i=0 j=0
n X
m
X n m
y(s, t) = si (1 − s)n−i tj (1 − t)n−j yi,j (21.48)
i j
i=0 j=0
n X
m
X n m
z(s, t) = si (1 − s)n−i tj (1 − t)n−j zi,j (21.49)
i j
i=0 j=0
where s, t ∈ [0, 1]. This can be implemented in Mathematicaby the following function.
bezierSurface[points_?ListQ, {s_, t_}] :=

Module[{x, y, z, i, j, m, n, bezx, bezy, bezz, bezcoef},
n = Length[points] - 1;
m = Union[Length /@ points];
If[Length[m] 1, Return[$Failed]];
m = First[m] - 1;
x[i_, j_] := points[[i + 1, j + 1, 1]];
y[i_, j_] := points[[i + 1, j + 1, 2]];
z[i_, j_] := points[[i + 1, j + 1, 3]];
bezx = bezy = bezz = 0;

For[i = 0, i n, i++,
For[j = 0, j m, j++,
bezcoef = Binomial[n, i] Binomial[m, j]
(sî)((1-s)^(n-i))(t^j)((1-t)^(m-j));
bezx = bezx + bezcoef* x[i, j];
bezy = bezy + bezcoef * y[i, j];
bezz = bezz + bezcoef * z[i, j] ;
] ] ;
Return[{bezx, bezy, bezz}];
];
Consider the set of points on the corners of a cube given by
(0, 0, 0) (1, 0, 0), (1, 1, 0)
(0, 0, 1) (1, 0, 1), (1, 1, 1)
The Bezier Surface calculated with this algorithm is
x(t) = 2(1 − s)(1 − t)t + 2s(1 − t)t + (1 − s)t2 + st2 (21.50)
y(t) = (1 − s)t2 + st2 (21.51)
z(t) = s(1 − t)2 + 2s(1 − t)t + st2 (21.52)
which can be found in Mathematica via
In:=
data = {{{0,0,0}, {1,0,0}, {1,1,0}},
{{0,0,1}, {1,0,1}, {1,1,1}}};
surface = bezierSurface[data, {s, t}]
Out:=
{2*(1 - s)*(1 - t)*t + 2*s*(1 - t)*t + (1 - s)*t^2 + s*t^2,
(1 - s)*t^2 + s*t^2,
s*(1 - t)^2 + 2*s*(1 - t)*t + s*t^2}
The surface and its generating points are illustrated below. They are produced with
the following commands:
<<Graphics‘Graphics3D‘
dataPlot = ScatterPlot3D[Partition[Flatten[hinge], 3],
PlotStyle -> PointSize[.03]];
surfacepoints = Table[surface, {s, 0, 1, .05}, {t, 0, 1, .05},
DisplayFunction-> Identity];
surfacePlot = ListSurfacePlot3D[surfacepoints,
DisplayFunction-> Identity];
Show[dataPlot, surfacePlot, DisplayFunction-> \$DisplayFunction]

The following data was generated on a fixed (x, y) grid with z−values determined by
a random number generator. It produces a more complicated surface.
(1, 1, 4.63) (1, 2, 4.41) (1, 3, 3.05) (1, 4, 3.76) (1, 5, 2.87) (1, 6, 4.05) (1, 7, 2.81)
(2, 1, 3.31) (2, 2, 2.61) (2, 3, 3.17) (2, 4, 2.47) (2, 5, 4.55) (2, 6, 2.35) (2, 7, 3.)
(3, 1, 3.63) (3, 2, 4.99) (3, 3, 2.21) (3, 4, 3.46) (3, 5, 3.74) (3, 6, 4.62) (3, 7, 2.24)
(4, 1, 3.18) (4, 2, 4.33) (4, 3, 3.98) (4, 4, 2.62) (4, 5, 3.76) (4, 6, 3.28) (4, 7, 2.22)
(5, 1, 4.75) (5, 2, 4.71) (5, 3, 2.47) (5, 4, 3.91) (5, 5, 4.14) (5, 6, 3.54) (5, 7, 5.)
Three different views of the resulting Bezier surface are shown in the following figure.
Points that are blocked by the surface are not shown. The figure on the top left and
on the bottom shown only the bezier surface and the points from different angles.
The figure on the top right also shows a triangulated surface joined by connecting the
points.


Lesson 22
Least Squares
Suppose we have a large set of data points in the xy-plane
{(xi , yi ) : i = 1, 2, ..., n} (22.1)
and we want to find the “best fit” straight line to our data, namely, we want to find
number m and b such that
y = mx + b (22.2)
is the “best” possible line in the sense that it minimizes the total sum-squared vertical
distance between the data points and the line. This process is known as the linear
least-squares problem or linear regression.
The vertical distance between any point (xi , yi ) and the line (see figure 22.1),
which we will denote by di , is
di = |mxi + b − yi | (22.3)
Since this distance is also minimized when its square is minimized, we instead calculate
d2i = (mxi + b − yi )2 (22.4)
The total of all these square-distances (the “sum-squared-distance”) is
n
X n
X
f (m, b) = d2i = (mxi + b − yi )2 (22.5)
i=1 i=1
The only unknowns in this expression are the slope m and y-intercept b. Thus we
have written the expression as a function f (m, b). Our goal is to find the values of m
and b that correspond to the global minimum of f (m, b).
153
154 LESSON 22. LEAST SQUARES
Figure 22.1: The sum of all the vertical distances is minimized in the least-squares
linear fit.
Setting ∂f /∂b = 0 gives
n
∂f ∂ X
0 = = (mxi + b − yi )2 (22.6)
∂b ∂b i=1
n
X
= 2 (mxi + b − yi ) (22.7)
i=1
X n
= 2 (mxi + b − yi ) (22.8)
i=1
Dividing by 2 and separating the three sums
n
X
0 = (mxi + b − yi ) (22.9)
i=1
n
X n
X n
X
= mxi + b− yi (22.10)
i=1 i=1 i=1
n
X n
X
= m xi + nb − yi (22.11)
i=1 i=1

LESSON 22. LEAST SQUARES 155
Defining
n
X
X = xi (22.12)
i=1
n
X
Y = yi (22.13)
i=1
then we have
0 = mX + nb − Y (22.14)
Next, we set ∂f /∂m = 0, which gives
n
∂f ∂ X
0 = = (mxi + b − yi )2 (22.15)
∂m ∂m i=1
n
X
= 2xi (mxi + b − yi ) (22.16)
i=1
X n
= 2 xi (mxi + b − yi ) (22.17)
i=1
Dividing by 2 and separating the three sums as before

n
X
0 = xi (mxi + b − yi ) (22.18)
i=1
Xn n
X n
X
= mx2i + b xi − xi yi (22.19)
i=1 i=1 i=1
n
X n
X
= m x2i + bX − xi y i (22.20)
i=1 i=1
where X is defined in equation 22.12. Next we define,

n
X
A = x2i (22.21)
i=1
Xn
C = xi y i (22.22)
i=1
so that
0 = mA + bX − C (22.23)
Equations 22.14 and 22.23 give us a a system of two linear equations in two
variables m and b. Multiplying equation 22.14 by A and equation 22.23 by X gives
0 = A (mX + nb − Y ) = AXm + Anb − AY (22.24)
0 = X (mA + bX − C) = AXm + X 2 b − CX (22.25)

Subtracting these two equations gives
0 = Anb − AY − X 2 b + CX = b(An − X 2 ) + CX − AY (22.26)
and therefore n n n n
x2i
P P P P
yi −
xi y i xi
AY − CX i=1 i=1 i=1 i=1
b= = n 2 (22.27)
An − X 2 n
P 2
P
n xi − xi
i=1 i=1
If we instead multiply equation 22.14 by X and equation 22.23 by n we obtain
0 = X (mX + nb − Y ) = mX 2 + nXb − Y X (22.28)
0 = n (mA + bX − C) = nAm + nXb − nC (22.29)

Subtracting these two equations,
0 = m X 2 − nA − (Y X − nC)

(22.30)
Solving for m and substituting the definitions of A, C, X and Y , gives

n
P Pn n
P
y i − n xi y i
xi
XY − nC
m= 2 = i=1 i=1
n 2
i=1
(22.31)
X − nA n
xi − n x2i
P P
i=1 i=1
We can summarize the algorithm as follows. To find a best-fit line to a set of n

data points (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ) calculate,
Xn
X= xi (22.32)
Xni=1
Y = yi (22.33)
Xni=1
A= x2i (22.34)
i=1
Xn
C= xi y i (22.35)
i=1
The best fit line is y = mx + b where

XY − nC
m= (22.36)
X 2 − nA
AY − CX
b= (22.37)
An − X 2
Example 22.1. Find the least squares fit to the data (3, 2), (4,3), (5,4), (6,4) and
(7,5).

Solution. First we calculate the numbers X, Y , A, and C,

Xn
X = xi = 3 + 4 + 5 + 6 + 7 = 25 (22.38)
Xi=1n
Y = yi = 2 + 3 + 4 + 4 + 5 = 18 (22.39)
Xi=1n
A = x2i = 9 + 16 + 25 + 36 + 49 = 135 (22.40)
i=1
Xn
C = xi yi = (3)(2) + (4)(3) + (5)(4) + (6)(4) + (7)(5)
i=1
= 97 (22.41)
Therefore
XY − nC (25)(18) − (5)(97) 450 − 485 −35
m= 2
= 2
= = = 0.7 (22.42)
X − nA (25) − 5(135) 625 − 675 −50
and
AY − CX (135)(18) − (97)(25) 2430 − 2425 5
b= 2
= 2
= = = 0.1 (22.43)
An − X (135)(5) − 25 50 50
So the best fit line is y = 0.7x + 0.1

We can generalize the least squares problem to any order polynomial. Suppose
we want to fit a polynomial of degree n − 1 to m points (x1 , y1 ), . . . , (xm , ym ). Denote
the polynomial by
P (x) = c1 + c2 x + · · · + cn xn−1 (22.44)
If we were to attempt to fit the data exactly this would give us the system of equations
c1 + c2 x1 + · · · + cn xn−1
n = y1 (22.45)
..
. (22.46)
n−1
c1 + c2 x m + · · · + cn x m = y m (22.47)
If m = n, which is usually not the case, we could solve this equation exactly, by
writing     
1 x1 x21 · · · xn−11 c1 y1
 1 x2 x2 n−1  
2 x 2 c
  2   y2 
  
  ..  =  ..  (22.48)

 ..
.  .   . 
1 xm x2m · · · xn−1m cm ym
or more simply,
Ac = y (22.49)
The solution is of course
c = A−1 y (22.50)

However, this is not usually a good idea. Even if we had m = n, when the number
of data points is relatively large such a curve would be extremely over-fit, giving a
“kink” or “bump” for each point. This is not usually a good approximation to the
data. Furthermore, when we have m > n the equation is not even solvable. In general
it is better to look for some sort of solution to
Ac ≈ y (22.51)
in the sense that the two sides of the equation match one-another in some sort of
minimized least squares sense. In other words, we want to find the “best fit” linear
combination
Ac = c1 a1 + c2 a2 + · · · + cn an (22.52)
where each aj is the j th column of the matrix A,
j−1 T
aj = xj−1 j−1

1 x 2 · · · x m (22.53)
We denote the pth component of ai as
aip = Api = AT
ip (22.54)
We define the best fit as the set of values c1 , . . . , cn that minimizes the distance
min |y − Ac| (22.55)
c1 ,...,cn
Define the residual error in fitting the ith data point

X
ri = cj aji − yi (22.56)
j
As with linear least squares, we will minimize the sum-square residual error
" #2
X X
= cj ATji − yi (22.57)
i j
by setting the partial derivatives equal to zero:

" #2
∂ X X
0= cj ATji − yi (22.58)
∂cp i j
for p = 1, . . . , n. Expanding gives us

"
∂ XXX XX
0= cj ATji ck ATki − cj ATji yi
∂cp i j k i j
#
XX X
− yi cj ATji + yi2 (22.59)
i j i
XXX ∂ XX
= ATji ATki (cj ck ) − 2 yi ATji δjp (22.60)
i j k
∂cp i j

Since
∂ ∂ ∂
(cj ck ) = cj ck + ck cj (22.61)
∂cp ∂cp ∂cp
= cj δpk + ck δpj (22.62)
Equation 22.60 gives
XXX X
0= ATji ATki (cj δpk + ck δpj ) − 2 yi ATpi (22.63)
i j k i
XXX XXX
= ATji ATki cj δpk + ATji ATki ck δpj − 2(AT y)p (22.64)
i j k i j k
XX XX
= ATji ATpi cj + ATpi ATki ck − 2(AT y)p (22.65)
i j i k
X X X X
= ATpi ATji cj + ATpi ATki ck − 2(AT y)p (22.66)
i j i k
X X X X
= ATpi Aij cj + ATpi Aik ck − 2(AT y)p (22.67)
i j i k
X X
= ATpi (Ac)i + ATpi (Ac)i − 2(AT y)p (22.68)
i i
= 2(AT Ac)p − 2(A y)p T
(22.69)
Hence c is the solution of the linear equation
AT Ac = AT y (22.70)
Formally, the least squares solution is given exactly by
c = (AT A)−1 AT y (22.71)

In practice, it turns out that there are generally better ways to solve the system 22.70
than calculation of the inverse matrix, which we will discuss subsequently.
As an example to illustrate that our earlier calculation of the linear least squares
fit falls out of equation 22.71 we note that for linear data,
 
1 x1
1 x 2 
A =  .. ..  (22.72)
 
. . 
1 xn
 
1 x1
1 1 · · · 1 1 x 2 
  P
T n P x2
A A=  .. ..  = P (22.73)
x1 x2 · · · xn  . .  x x
1 xn

Define X X 2
∆=n x2 − x (22.74)
Then P 2 P
T −1 1 x − x
(A A) = P (22.75)
∆ − x n
Hence eq. 22.71 gives
 
y
 1
· · · 1  y2 
P 2 P
1 x − x 1 1
c= (22.76)
· · · xn  ... 
P
∆ − x n x1 x2
 
yn
P 2 P P
1 x − x
= P P y (22.77)
∆ − x n xy
P 2 P P P
1 x y − xP xy
= P P (22.78)
∆ − x y + n xy
and therefore
x2 y − x xy
P P P P
b = c1 = (22.79)
n x2 − ( x)2
P P
P P P
− x y + n xy
m = c2 = (22.80)
n x2 − ( x)2
P P
which is identical to equation 22.36.

Lesson 23
Numerical Differentiation
Recall the definition of a derivative
df (x) f (x + h) − f (x)
f 0 (x) = = lim (23.1)
dx h→0 h
If we choose h sufficiently small, then we can approximate the derivative by
f (x + h) − f (x)
f 0 (x) ≈ (23.2)
h
If we represent the function by a table of numbers , f0 = f (x0 ), . . . , fn = f (xn ) the
for fixed values of h
f (xi + h) − f (xi ) fi+1 − fi
f 0 (xi ) ≈ = , i = 0, 1, ..., n − 1 (23.3)
h h
To find an upper bound on the error, we use Taylor’s theorem:
1 1 1
f (x+h) = f (x)+hf 0 (x)+ h2 f 00 (x)+· · ·+ hn f (n) (x)+ hn+1 f (n+1) (c) (23.4)
2 n! (n + 1)!
where c is some unknown number between x and xh The Taylor formula for n = 1 is
1
f (x + h) = f (x) + hf 0 (x) + h2 f 00 (c) (23.5)
2
Thus
1
hf 0 (x) = f (x + h) − f (x) − h2 f 00 (c) (23.6)
2
and dividing by h
f (x + h) − f (x) 1 00
f 0 (x) = − hf (c) (23.7)
h 2
The first term gives precisely the same thing as equation 23.1; the second term
gives the error. Thus we have the following approximation formulas depending upon
whether h > 0 or h < 0. The Forward Difference Formula is
1 1
fi0 ≈ (fi+1 − fi ) − hf 00 (c) (23.8)
h 2
161
162 LESSON 23. NUMERICAL DIFFERENTIATION
and the Backward Difference Formula is

1 1
fi0 ≈ (fi − fi−1 ) + hf 00 (c) (23.9)
h 2
We can get a better approximation if we observe that
1 1
f (x + h) = f (x) + hf 0 (x) + h2 f 00 (x) + h3 f (3) (c) (23.10)
2 3!
where x ≤ c ≤ x + h and
1 1
f (x − h) = f (x) − hf 0 (x) + h2 f 00 (x) + (−h)3 f (3) (c1 ) (23.11)
2 3!
where x − h ≤ c1 ≤ x. Subtracting 23.11 from 23.10
1
f (x + h) − f (x − h) = 2hf 0 (x) + h3 (f (3) (c) + f (3) (c1 )) (23.12)
6
Solving for f 0 (x) gives
f (x + h) − f (x − h) 1
f 0 (x) = − h2 (f (3) (c) + f (3) (c1 )) (23.13)
2h 12
By the intermediate value theorem there is some number ξ, c1 ≤ ξ ≤ c, such that
1
f (3) (ξ) = (f (3) (c) + f (3) (c1 )) (23.14)
2
Using 23.14 in 23.13 gives us The Central Difference Formula
f (x + h) − f (x − h) 1
f 0 (x) = − h2 f (3) (ξ) (23.15)
2h 16
Example 23.1. Compare the forward difference, central difference, and backward
difference methods for the following data.
x = 1.1 x = 1.2 x = 1.3 x = 1.4
f (x) = 9.025 f (x) = 11.023 f (x) = 13.464 f (x) = 16.645
Solution. According to the forward difference formula:
f (1.2) − f (1.1) 11.023 − 9.025
f 0 (1.1) = = = 19.98 (23.16)
0.1 .1
f (1.3) − f (1.2) 13.464 − 11.023
f 0 (1.2) = = = 24.41 (23.17)
0.1 0.1
f (1.4) − f (1.3) 16.645 − 13.464
f 0 (1.3) = = = 31.81 (23.18)
0.1 0.1

LESSON 23. NUMERICAL DIFFERENTIATION 163
Figure 23.1: Points used in the calculation of the derivative at (x1 , f (x1 )) for the
forward difference formula (top left0; the backward difference formula (top right) and
the central difference formula (bottom). The slope of the line joining the points shown
is used to approximate the derivative. The actual tangent is indicated by a dashed
line.
Hx1 , f Hx1 LL Hx1 , f Hx1 LL
Hx0 , f Hx0 LL Hx0 , f Hx0 LL

8x2 , fHx2 L< 8x2 , fHx2 L<
Hx1 , f Hx1 LL
Hx0 , f Hx0 LL
8x2 , fHx2 L<
According to the backward difference formula:

f (1.2) − f (1.1) 11.023 − 9.025
f 0 (1.2) = = = 19.98 (23.19)
0.1 .1
f (1.3) − f (1.2) 13.464 − 11.023
f 0 (1.3) = = = 24.41 (23.20)
0.1 0.1
f (1.4) − f (1.3) 16.645 − 13.464
f 0 (1.4) = = = 31.81 (23.21)
0.1 0.1
According to the central difference formula:
f (1.3) − f (1.1) 13.464 − 9.025
f 0 (1.2) = = = 22.195 (23.22)
.2 .2
f (1.4) − f (1.2) 16.645 − 11.023
f 0 (1.3) = = = 28.110 (23.23)
.2 .2
It is instructive to compare the results tabularly, as follows; we see immediately that
the backward difference calculations are identical to the forward difference calculations
but shifted one step to the right.
x = 1.1 x = 1.2 x = 1.3 x = 1.4
FD 19.98 24.41 34.81
BD 19.98 24.41 34.81
CD 22.195 28.110

There is no forward difference value for the right endpoint, no backward difference
value for the left endpoint, and no central difference value for either endpoint.
To get a more accurate number for the derivative, we can find the derivative of an
interpolating polynomial. We start with the Lagrange interpolating polynomial with
error term: n n
X 1 Y
f (x) = fk Lk (x) + (x − xk )f (n+1) (c) (23.24)
k=0
(n + 1)! k=0
where n
Y (x − xj )
Lk = (23.25)
j=0,j6=k
(xk − xj )
and c is some point in [a, b]. Differentiating,

( n n
)
d X 1 Y
f 0 (x) = fk Lk (x) + f (n+1) (c(x)) (x − xk ) (23.26)
dx k=0 (n + 1)! k=0
n
( n
)
X
0 1 d (n+1)
Y
= fk Lk (x) + f (c(x)) (x − xk ) (23.27)
k=0
(n + 1)! dx k=0
n n
X 1 d (n+1) Y
= fk L0k (x) + f (c(x)) (x − xk )+
k=0
(n + 1)! dx
( n
) k=0
1 d Y
f (n+1) (c(x)) (x − xk ) (23.28)
(n + 1)! dx k=0
At any grid point xi ,

n n
0
X 1 d (n+1) Y
fk L0k (xi )

f (xi ) = + f (c(x))
(xi − xk )+
k=0
(n + 1)!
dx x=xi k=0
( n
)
1 (n+1) d Y
f (c(xi )) (x − xk ) (23.29)

(n + 1)! dx k=0

x=xi
Notice that the product in the middle term has a factor of (xi − xi ) and is therefore
zero. Thus
n
( n
)
X 1 d Y
f 0 (xi ) = fk L0k (xi ) + f (n+1) (c(xi )) (x − xk ) (23.30)

k=0
(n + 1)! dx k=0

x=xi
By the product rule for differentiation,

n n n
d Y X Y
(x − xk ) = (x − xj ) (23.31)
dx k=0 k=0 j=0,j6=k

and therefore at a grid point

n n n
d Y X Y
(x − xk ) = (xi − xj ) (23.32)

dx k=0
k=0 j=0,j6=k
xi
Every term in the summation except for the k = i term will have a factor xi − xi in
the product; therefore the only nonzero term is for k = i.
n
n
d Y Y
(x − xk ) = (xi − xj ) (23.33)

dx
k=0 j=0,j6=i
xi
Renaming the index on right from j to k gives

n
n
d Y Y
(x − xk ) = (xi − xk ) (23.34)

dx
k=0 k=0,k6=i
xi
Substitution back into equation 23.30 yields the n + 1-point approximation for-
mula,
n n
X 1 Y
f 0 (xi ) = fk L0k (xi ) + f (n+1) (c(xi )) (xi − xk ) (23.35)
k=0
(n + 1)! k=0,k6=i
The first term gives the approximation and the second gives an error formula.
A two-point approximation is obtained when n = 1. The Lagrange Polynomials
are
x − x1 x − x0
L0 = , L1 = (23.36)
x0 − x1 x1 − x0
Hence
1 1
L00 = , L01 = (23.37)
x0 − x1 x1 − x0
Therefore
f0 f1 f1 − f0
f 0 (xi ) ≈ f0 L00 (xi ) + f1 L01 (xi ) = + = (23.38)
x0 − x1 x1 − x0 x 1 − x0
This is precisely the forward difference formula.
When n = 2 we obtain the 3-point formulas. The Lagrange functions are
(x − x1 )(x − x2 ) x2 − (x1 + x2 )x + x1 x2
L0 (x) = = (23.39)
(x0 − x1 )(x0 − x2 ) (x0 − x1 )(x0 − x2 )
(x − x0 )(x − x2 ) x2 − (x0 + x2 )x + x0 x2
L1 (x) = = (23.40)
(x1 − x0 )(x1 − x2 ) (x1 − x0 )(x1 − x2 )

(x − x0 )(x − x1 ) x2 − (x0 + x1 )x + x0 x1
L2 (x) = = (23.41)
(x2 − x0 )(x2 − x1 ) (x2 − x0 )(x2 − x1 )
Hence
2x − (x1 + x2 )
L00 (x) = (23.42)
(x0 − x1 )(x0 − x2 )
2x − (x0 + x2 )
L01 (x) = (23.43)
(x1 − x0 )(x1 − x2 )
2x − (x0 + x1 )
L02 (x) = (23.44)
(x2 − x0 )(x2 − x1 )
and therefore
f 0 (xi ) ≈ f0 L00 (xi ) + f1 L01 (xi ) + f2 L02 (xi ) (23.45)

2xi − (x1 + x2 ) 2xi − (x0 + x2 )
= f0 + f1
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 )
2xi − (x0 + x1 )
+f2 (23.46)
(x2 − x0 )(x2 − x1 )
If the grid points are equally spaced then
2xi − (x1 + x2 ) 2xi − (x0 + x2 ) 2xi − (x0 + x1 )

f 0 (xi ) = f0 + f1 + f2 (23.47)
(−h)(−2h) (h)(−h) (2h)(h)
2xi − (x1 + x2 ) 2xi − (x0 + x2 ) 2xi − (x0 + x1 )
= f0 2
− f1 2
+ f2 (23.48)
2h h 2h2
At each of the points
2x0 − (x1 + x2 ) 2x0 − (x0 + x2 ) 2x0 − (x0 + x1 )

f 0 (x0 ) = f0 2
− f1 2
+ f2 (23.49)
2h h 2h2
2x0 − (x0 + h + x0 + 2h) 2x0 − (x0 + x0 + 2h)
= f0 2
− f1 (23.50)
2h h2
2x0 − (x0 + x0 + h)
+ f2 (23.51)
2h2
(−3h) (−2h) (−h) 1 3 1
= f0 − f1 + f2 = − f0 + 2f1 − f2 (23.52)
2h2 h2 2h2 h 2 2

2x1 − (x1 + x2 ) 2x1 − (x0 + x2 ) 2x1 − (x0 + x1 )

f 0 (x1 ) = f0 2
− f1 2
+ f2 (23.53)
2h h 2h2
2(x0 + h) − (x0 + h + x0 + 2h) 2(x0 + h) − (x0 + x0 + 2h)
= f0 2
− f1
2h h2
2(x0 + h) − (x0 + x0 + h)
+ f2 (23.54)
2h2
(−h) (h)
= f0 2
+ f2 2 (23.55)
2h 2h
1 1 1
= − f0 + f2 (23.56)
h 2 2
2x2 − (x1 + x2 ) 2x2 − (x0 + x2 ) 2x2 − (x0 + x1 )

f 0 (x2 ) = f0 2
− f1 2
+ f2 (23.57)
2h h 2h2
2(x0 + 2h) − (x0 + h + x0 + 2h) 2(x0 + 2h) − (x0 + x0 + 2h)
= f0 2
− f1
2h h2
2(x0 + 2h) − (x0 + x0 + h)
+ f2 (23.58)
2h2
h 2h 3h 1 1 3
= f0 2 − f1 2 + f2 2 = f0 − 2f1 + f2 (23.59)
2h h 2h h 2 2
This gives us the general three-point formula
1
fi0 =
(−3fi + 4fi+1 − fi+2 ) (23.60)
2h
1
fi0 = (−fi−1 + fi+1 ) (23.61)
2h
1
fi0 = (fi−2 − 4fi−1 + 3fi ) (23.62)
2h
The middle formula returns the central difference formula; the first and third formulas
give us a method to extend the technique to the endpoints.
Another method for deriving formulas for the derivative is to use the Taylor series,
1
f (x + h) = f (x) + hf 0 (x) + h2 f 00 (x) + · · ·
2
1 n (n) 1
+ h f (x) + hn+1 f (n+1) (c) (23.63)
n! (n + 1)!
for some c between x and x + h. In particular, this method can be used to derive
approximation formulas for higher order derivatives. We will illustrate this process
by deriving an approximation formula for f 00 . Letting x = x0 in the Taylor series
23.63
1 1 (n) 1
f1 = f0 + hf00 + h2 f 00 0 + · · · + hn f0 + hn+1 f (n+1) (c1 ) (23.64)
2 n! (n + 1)!

where c1 ∈ [x0 , x0 + h]. Next, let x−1 = x − h,

1 2 00 1 (n)
f−1 = f0 − hf00 + h f0 + · · · + (−h)n f0 (x) (23.65)
2! n!
1
+ (−hn+1 )f (n+1) (c−1 )
(n + 1)!
for some c−1 ∈ [x0 − h, x0 ]. Adding equations 23.64 and 23.65 gives

1 2 00 1 4 (4) 1 n (n)
f1 + f−1 = 2 f0 + h f 0 + h f0 + · · · + h f0
2! 4! n!
n+1
h
f (n+1) (c1 ) + (−1)n+1 f (n+1) (c−1 )

+ (23.66)
(n + 1)!
where n is even. If n is odd the term in the square brackets terminates at the hn−1
term instead of the hn term. For example, if n = 3,
h4 (4)

1 2 00
f (c1 ) + (−1)4 f (4) (c−1 )

f1 + f−1 = 2 f0 + h f 0 + (23.67)
2! 4!
1
= 2f0 + h2 f000 + h4 f (4) (c1 ) + f (4) (c−1 )

(23.68)
24
Solving for f000 ,
1 1
f 00 0 = [f1 − 2f0 + f−1 ] − h2 f (4) (c1 ) + f (4) (c−1 )

2
(23.69)
h 24
By the intermediate value theorem, since [f (4) (c1 ) + f (4) (c−1 )]/2 is between f (4) (c1 )
and f (4) (c−1 ), then (assume f (4) is continuous in [a, b] then there is some number
c0 ∈ [c−1 , c1 ] such that
(4) 1 (4) (4)

f0 (c0 ) = [f0 (c1 ) + f0 (c−1 )] (23.70)
2
Hence
1 1
f 00 0 = 2
[f1 − 2f0 + f−1 ] − h2 f (4) (c0 ) (23.71)
h 12
Replacing x0 with xk ,
1 1
f 00 k = 2
[fk+1 − 2fk + fk−1 ] − h2 f (4) (c0 ) (23.72)
h 12

Lesson 24
Richardson Extrapolation
Richardson extrapolation gives a method to “accelerate” the convergence of a se-

quence; in other words, if we have a method that converges as O(hm ), where h is
some small parameter (e.g., the error in the calculation is proportional to hm ) it can
produce a method that converges as O(hm+1 ) (e.g., the error in the calculation is
proportional to hm+1 ). This reduces the error by a factor of h, and can be used
iteratively to produce methods that converge at any rate we want. It was developed
by Lewis Fry Richardson (1891-1953).
Suppose we have an approximation formula for x that depends on some parameter
h, such as the step size. We will write this as N (h) plus higher order terms:
x = N (h) + c1 h + c2 h2 + c3 h3 + · · · (24.1)
where x is the exact value and N (h) is the approximate value. Equation 24.1 says
that N (h) is accurate to O(h). The terms with higher order in h are the error terms.
Then we can write
x = N (h/2) + c1 h/2 + c2 h2 /4 + c3 h3 /8 + · · · (24.2)
Multiplying equation 24.2 gives
2x = 2N (h/2) + c1 h + c2 h2 /2 + c3 h3 /4 + · · · (24.3)
Subtracting equation 24.1 from 24.3 gives
1 1
x = 2N (h/2) + c1 h + c2 h2 + c3 h3 + · · ·
2 4
−(N (h) + c1 h + c2 h2 + c3 h3 + · · · ) (24.4)
1 3
= 2N (h/2) − N (h) − c2 h2 − c3 h3 (24.5)
2 4
If we define the function N2 (h) by
N (h/2) − N (h)
N2 (h) = N (h/2) + (24.6)
22−1 − 1
169
170 LESSON 24. RICHARDSON EXTRAPOLATION
then
1 3
x = N2 (h) − c2 h2 − c3 h3 + · · · (24.7)
2 4
which tells us that N2 approximates x to O(h2 ). Repeating the process,
1 3
x = N2 (h/2) − c2 (h/2)2 − c3 (h/2)3 + · · · (24.8)
2 4
1 2 3
= N2 (h/2) − c2 h − c3 h3 − · · · (24.9)
8 32
Multiplying by 4,
1 3
4x = 4N2 (h/2) − c2 h2 − c3 h3 − · · · (24.10)
2 8
Subtracting equation 24.7 from 24.10 gives

1 2 3 3
3x = 4N2 (h/2) − c2 h − c3 h − · · ·
2 8

1 2 3 3
− N2 (h) − c2 h − c3 h − · · · (24.11)
2 4
3
= 4N2 (h/2) − N2 (h) + c3 h3 + · · · (24.12)
8
Solving for x,
4N2 (h/2) − N2 (h) 1 3
x= + c3 h + · · · (24.13)
3 8
It is convenient to rewrite equation 24.13 as
N2 (h/2) − N2 (h) 1 3
x = N2 (h/2) + + c3 h + · · · (24.14)
3 8
Let us define
N2 (h/2) − N2 (h)
N3 (h) = N2 (h/2) + (24.15)
23−1 − 1
Then
1
x = N3 (h) + c3 h3 + · · · (24.16)
8
Hence N3 (h) is accurate to O(h3 ). In general we define
Nj−1 (h/2) − Nj−1 (h)

Nj (h) = Nj−1 (h/2) + (24.17)
2j−1 − 1
which gives
x = Nj (h) + O(hj ) (24.18)

LESSON 24. RICHARDSON EXTRAPOLATION 171
Equation 24.17 is called Richardson Extrapolation. It takes a method that is

O(hm ) and gives a method that is O(hm+1 ). The method is sometimes summarized
as:
Better Estimate = More Accurate

1
+ j−1 (More Accurate − Less Accurate) (24.19)
2 −1
We next illustrate how Richardson extrapolation can be used to produce differentia-
tion formulas. We start with the forward difference formula,
1
f00 ≈ (f1 − f0 ) + O(h) (24.20)
h
Doubling the step size,
1
f00 = (f2 − f0 ) + O(h) (24.21)
2h
Define
1 1
N (h) = (f2 − f0 ) = (f (x0 + 2h) − f (x0 )) (24.22)
2h 2h
1 1
N (h/2) = (f (x0 + h) − f (x0 )) = (f1 − f0 ) (24.23)
h h
and therefore,
N (h/2) − N (h)
N2 (h) = N (h/2) + (24.24)
22−1 − 1
1 1 1
= (f1 − f0 ) + (f1 − f0 ) − (f2 − f0 ) (24.25)
h h 2h
1
= (2f1 − 2f0 + 2f1 − 2f0 − f2 + f0 ) (24.26)
2h
1
= (−3f0 + 4f1 − f2 ) (24.27)
2h
This is the same formula we found by using Taylor series.
To get a higher order approximation we would have to split the step size again.
Since we don’t have any data more finely grained that at intervals of h, we use the
following trick: go back to the beginning and start with 4h, then 2h, then h. Starting
with a centered difference with a step size of 4h,
1 1
N (h) = (f4 − f0 ) = (f (x0 + 4h) − f0 ) (24.28)
4h 4h
1 1
N (h/2) = (f (x0 + 2h) − f0 ) = (f2 − f0 ) (24.29)
2h 2h
N2 (h) = N (h/2) + [N (h/2) − N (h)] (24.30)

1 1 1
= (f2 − f0 ) + (f2 − f0 ) − (f4 − f0 ) (24.31)
2h 2h 4h
1
= (2f2 − 2f0 + 2f2 − 2f0 − f4 + f0 ) (24.32)
4h
1
= (−3f0 + 4f2 − f4 ) (24.33)
4h
Continuing the process,
1
N2 (h/2) = (−3f0 + 4f1 − f2 ) (24.34)
2h
hence
1
N3 = N2 (h/2) + [N2 (h/2) − N2 (h)] (24.35)
3
1
= (−3f0 + 4f1 − f2 ) +
2h
1 1 1
(−3f0 + 4f1 − f2 ) − (−3f0 + 4f2 − f4 ) (24.36)
3 2h 4h
1
= (−3f0 + 4f1 − f2 ) +
2h
1 1
(−3f0 + 4f1 − f2 ) − (−3f0 + 4f2 − f4 ) (24.37)
6h 12h
1
= [6(−3f0 + 4f1 − f2 ) + 2(−3f0 + 4f1 − f2 )
12h
−(−3f0 + 4f2 − f4 )] (24.38)
1
= (−21f0 + 32f1 − 12f2 + f4 ) (24.39)
12h
and so forth, so that
1
f0 = (−21f0 + 32f1 − 12f2 + f4 ) + O(h3 ) (24.40)
12h
Example 24.1. Use a centered difference and Richardson Extrapolation to determine

f 0 (x) for the function f (x) = x + ex at the origin (x=0) using h = 0.4 through N3 (0.4)
Solution. The centered difference formula is
f (x + h) − f (x − h)
f 0 (x) = = N1 (h) (24.41)
2h
LESSON 24. RICHARDSON EXTRAPOLATION 173
Letting x = 0 and h = .4,
N1 (.4) = 2.02688 (24.42)

N1 (.2) = 2.00668 (24.43)
N2 (.4) = N1 (.2) + [N1 (.2) − N1 (.4)] (24.44)
= 2.00668 + 2.00668 − 2.02688 (24.45)
= 1.98648 (24.46)
N1 (.1) = 2.00167 (24.47)
N2 (.2) = N1 (.1) + [N1 (.1) − N1 (.2)] (24.48)
= 2.00167 + 2.00167 − 2.00668 (24.49)
= 1.99665 (24.50)
N2 (.2) − N2 (.4)
N3 (.4) = N2 (.2) + (24.51)
3
1.99665 − 1.98648
= 1.99665 + (24.52)
3
= 2.00004 (24.53)
Observe in this example that we had to obtain information in the following table:
N1 (.4)
N1 (.2) N2 (.4)
N1 (.1) N2 (.2) N3 (.4)
N1 (.05) N2 (.1) N3 (.2) N4 (.4)
.. .. .. .. ...
. . . .
In other words, to get any item in the table requires knowledge of everything above and
to the left of it in the table. This is true in general in using Richardson Extrapolation.


Lesson 25
Numerical Integration
The simplest method for numerical integration is a direct implementation of Riemann

Sums. If you know the function values f0 , f1 , ..., fn at the points a = x0 , x1 , ..., xn = b
then one form for the Riemann Sum is
Z b n−1
X
f (x)dx ≈ f (xi )(xi+1 − xi ) (25.1)
a i=0
i.e., the areas of rectangles whose upper left hand corners touch the curve of f (x) We
can write a similar formula down using the right hand corners:
Z b n
X
f (x)dx ≈ f (xi )(xi − xi−1 ) (25.2)
a i=i
If the points are equally spaced then
xi = x0 + ih (25.3)
so that we have
Z b n−1
X n
X
f (x)dx ≈ h f (xi ) ≈ h f (xi ) (25.4)
a i=0 i=1
Alternatively, we can calculate the area using boxes that cross the curve. For
example, if we know the function at three points (x0 , f0 ), (x1 , f1 ), and (x2 , f2 ) where
x1 = x0 + h and x2 = x0 + 2h then we can approximate the area under the curve of
f (x) on [x0 , x2 ] by a box whose width is x2 − x0 = 2h and whose height is f (x1 ):
Z x2
f (x)dx = 2hf (x1 ) (25.5)
x0
175
176 LESSON 25. NUMERICAL INTEGRATION
Figure 25.1: Calculation of an integral as the area under the curve can be apprxi-
mated with vertical rectangles. Top row, left: upper left hand corner of rectangles
fit to curve. Right: Upper right hand corner of each rectangle is fit to the curve.
Bottom row., left: Midpoint of top of each rectangle is fit to the curve. Right: in the
trapezoidal rule, the rectangles are replaced by trapezoids whose tops fit the function
at both upper corners.
a b a b
a b a b
To get the area over the entire interval [a, b], where a = x0 < x1 < x2 < · · · < xn = b,
and n is assumed to be even, we obtaine the Composite Midpoint Rule,
Z b Z x2 Z x4
f (x)dx = f (x)dx + f (x)dx + · · ·
a x0 x2
Z xn−2 Z xn
+ f (x)dx + f (x)dx (25.6)
xn−4 xn−2
≈ 2hf1 + 2hf3 + 2hf5 + · · · + 2hfn−3 + 2hfn−1 (25.7)
= 2h(f1 + f3 + · · · fn−1 ) (25.8)
R 10
Example 25.1. Find 0
x2 e−x dx using n = 4 and n = 10. Compare your result
with the exact integral.
Solution. The exact solution is

Z 10
10
x2 e−x dx = e−x −2 − 2x − x2 0

(25.9)
0
= 2 − 122e−10 ≈ 1.99446 (25.10)

LESSON 25. NUMERICAL INTEGRATION 177
x2 e−x with the midpoint rule for n = 4 and n = 10.

R
Figure 25.2: Approximation of
See example 25.1.
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0. 0.
. 0. 2.5 5. 7.5 10. 0 1 2 3 4 5 6 7 8 9 10
For n = 4 we have h = 10/4 = 2.5. Let f (x) = x2 e−x . Then
f1 = f (x1 ) = f (h) = f (2.5) = 2.52 e−2.5 ≈ 0.513031 (25.11)

f3 = f (x3 ) = f (3h) = f (7.5) = 7.52 e−7.5 ≈ 0.031111 (25.12)
hence
Z 10
x2 e−x dx = 2h[f1 + f3 ] (25.13)
0
≈ 2(2.5)(0.513031 + 0.031111) (25.14)
≈ 2.72072 (25.15)
The relative error is

2.72071 − 1.99446
= = 36% (25.16)
1.99446
For n = 10 we have h = 1 so that we need to calculate f − 1, f3 , f5 , f7 and f9 .
f1 = f (x1 ) = f (1) = (1)2 e−1 ≈ 0.367879 (25.17)

f3 = f (x3 ) = f (3) = (3)2 e−3 ≈ 0.448084 (25.18)
f5 = f (x5 ) = f (5) = (5)2 e−5 ≈ 0.168449 (25.19)
f7 = f (x7 ) = f (7) = (7)2 e−7 ≈ 0.044682 (25.20)
f9 = f (x9 ) = f (9) = (9)2 e−9 ≈ 0.0099996 (25.21)
hence
Z 10
f (x)dx = 2h[f1 + f3 + f5 + f7 + f9 ] (25.22)
0
≈ 2(0.367879 + 0.48084 + 0.168449
+ 0.044682 + 0.0099996) (25.23)
≈ 2.07818 (25.24)

This gives a relative error of

2.07818 − 1.99f 446
≈ ≈ 5.03% (25.25)
1.99446
This example illustrates how we can decrease the error by decreasing the step size.
In general, the relative error will decrease as we increase the number of intervals,
and, correspoondingly, decrease the step size. The relative error in ths method as a
function of number of intervals is plotted for n as large as 1000 in figure 25.3.
Figure 25.3: The relative error as a function of number of intervals for the integral
solved in example 25.1.
100 %
10 %
1%
0.1 %
0.01%
0.001%
0.0001%
0.00001%
5 10 50 100 500 1000
Since the midpoint rule does not use all the information that we know about the
function, we could modify it by interpreting the xi as the center of rectangles of width
h rather than treating the odd-numbered xi as centers of rectangles of 2h. The first
and last points x0 and xn become the left- and right-hand ends of rectangles of width
h/2 in this scheme,
Z b Z x0 +h/2 n−1 Z
X xi +h/2
f (x)dx = f (x)dx + f (x)dx
a x0 i=1 xi −h/2
Z xn
+ f (x)dx (25.26)
xn −h/2
n−1
h X h
≈ f0 + hfi + fn (25.27)
2 i=1
2

The resulting Composite Trapezoidal Rule is
Z b
h
f (x)dx ≈ [f0 + 2f1 + 2f2 + · · · + 2fn−2 + 2fn−1 + fn (25.28)
a 2
Example 25.2. Repeat example 25.1 using the composite Trapezoidal rule with h =
2.5.
x2 e−x using the trapezoidal rule as illustrated in

R
Figure 25.4: Approximation of
example 25.2.
0.5
0.4
0.3
0.2
0.1
0.
0. 2.5 5. 7.5 10.
Solution. Since h = 2.5, then n = (b − a)/h = 4. According to equation 25.28,
Z 10
h
f (x)dx = [f0 + 2f1 + 2f2 + 2f3 + f4 ] (25.29)
0 2
2.5 2 −0
= 0 e + 2(2.52 e−2.5 ) + 2(52 e−5 )
2
+ 2(7.52 e−7.5 ) + 102 e−10

(25.30)
= 1.25[2 + 2(0.513031) + 2(0.168449) + 2(0.031111) + 0.00454] (25.31)
= 1.78715 (25.32)
The relative error is

1.78715 − 1.99446
= = 11.6% (25.33)
1.99446

Figure 25.5: Relative error for trapezoidal method (orange); midpoint method (red);
and Simpson’s method (blue) for various step sizes.
100 %
10 %
1%
0.1 %
0.01%
0.001%
0.0001%
0.00001%
5 10 50 100 500 1000
Simpson’s rule is derived by fitting a quadratic to three equally spaced points. Let
p(x) = A(x − x0 )2 + B(x − x1 ) + C (25.34)
be a quadratic that passes through the three points (x0 , f0 ), (x1 , f1 ), and (x2 , f2 ),
where x1 = x0 + h, and x2 = x0 + 2h.
f0 = p(x0 ) = B(x0 − x1 ) + C = B(−h) + C (25.35)

= −Bh + C (25.36)
f1 = A(x1 − x0 )2 + C (25.37)
= Ah2 + C (25.38)
f2 = A(x2 − x0 )2 + B(x2 − x1 ) + C = A(2h)2 + Bh + C (25.39)
= 4Ah2 + Bh + C (25.40)
Adding equations 25.36 and 25.40
f0 + f2 = 4Ah2 + 2C (25.41)
while multiplying equation 25.38 by 2,
2f1 = 2Ah2 + 2C (25.42)
Subtracting equation 25.42 from equation 25.41,
2Ah2 = f0 − 2f1 + f2 (25.43)

Multiplying equation 25.38 by 2 and substituting 25.43,
2C = 2f1 − 2Ah2 = −f0 + 4f1 − f2 (25.44)

Multiplying equation 25.41 by 2 and substituting 25.44,
2Bh = 2C − 2f0 = −3f0 + 4f1 − f2 (25.45)
Equations 25.43, 25.44 and 25.45 give us the values of A, B, and C. As it turns out,
we will only need to know A and C but not B. The integral is
Z x2
I= [A(x − x0 )2 + B(x − x1 ) + C]dx (25.46)
x0
x2
A 3 B 2

= [ (x − x0 ) + (x − x1 ) + Cx] (25.47)
3 2 x0
A
= [(x2 − x0 )3 − (x0 − x0 )3 ]
3
B
+ [(x2 − x1 )2 − (x0 − x1 )2 + C(x2 − x0 ) (25.48)
2
A B
= (2h)3 + [h2 − h2 ] + C(2h) (25.49)
3 2
8
= Ah3 + 2Ch (25.50)
3
Substituting equations 25.43 and 25.44
h h
I= [8Ah2 + 6C] = [4(2Ah2 ) + 3(2C)] (25.51)
3 3
h
= [4(f0 − 2f1 + f2 ) + 3(−f0 + 4f1 − f2 )] (25.52)
3
h
= [f0 + 4f1 + f2 ] (25.53)
3
This gives us Simpson’s Rule:
Z x2
h
f (x)dx = [f (x0 ) + 4f (x1 ) + f (x2 )] (25.54)
x0 3
where x1 = (x2 + x0 )/2, or equivalently

Z b
h
f (x)dx = [f (a) + 4f ((a + b)/2) + f (b)] (25.55)
a 3
In general, we can write
Z xi+2
h
f (x)dx = [f (xi ) + 4f (xi+1 ) + f (xi+2 )] (25.56)
xi 3

If n is even then
Z xn Z x2 Z x4 Z x6
f (x)dx = f (x)dx + f (x)dx + f (x)dx
x0 x0 x2 x4
Z xn
+ ··· + f (x)dx (25.57)
xn−2
Substituting equation 25.56 in each term of 25.57

Z xn
h
f (x)dx = [(f0 + 4f1 + f2 ) + (f2 + 4f3 + f4 )
x0 3
+ (f4 + 4f5 + f6 ) + · · · + (fn−4 + 4fn−3 + fn−2 )
+ (fn−2 + 4fn−1 + fn )] (25.58)
Collecting terms gives the Composite Simpson’s Rule:
Z xn
h
f (x)dx = [f0 + 4f1 + 2f2 + 4f3 + 2f4 + · · · + 2fn−2 + 4fn−1 + fn ] (25.59)
x0 3
The pattern of coefficients is 1, 4, 2, 4, 2, 4, . . . , 4, 2, 4, 1.

Example 25.3. Repeat example 25.1 using Simpson’s rule with n = 4.
Solution. Since n = 4 then h = (b − a)/n = 2.5. The formula is
Z 10
h
e−x x2 dx = [f0 + 4f1 + 2f2 + 4f3 + f4 ] (25.60)
0 3
2.5 − 2
= [e 00 + 4e−2.5 (2.52 ) + 2e−5 (52 ) + 4e−7.5 (7.52 ) + e−10 (102 )]
3
(25.61)
2.5
= [0 + 4(0.513031) + 2(0.168449) + 4(0.031111) + 0.00454999]
3
(25.62)
= 2.09834 (25.63)
To arrive at Simpson’s rule we derived the coefficients by fitting a parabola to
three successive points. What if we were to fit a higher order polynomial? Given any
n + 1 points we can fit a unique polynomial of order at most n. If
n
X
f (x) ≈ ci x i (25.64)
i=0
Then
Z b n
Z bX n Z b n
i
X
i
X ci
f (x)dx ≈ ci x dx ≈ ci x dx = (bi+1 − ai+1 ) (25.65)
a a i=0 i=0 a i=0
i+1

which gives us a general Quadrature Formula:

Z b n
X
f (x)dx ≈ ai f i (25.66)
a i=0
For example, we know how to fit n + 1 points with the nth Lagrange Polynomial
Quadrature for n=1 (2 points) using the Lagrange Polynomials. For n = 1 the points
are a = x0 and x1 = b = a + h and so we write
x − x1 x−b 1
L0 = = = (b − x) (25.67)
x0 − x1 a−b h
and
x − x0 x−a 1
L1 = = = (x − a) (25.68)
x 1 − x0 b−a h
The error is
f 00 (c)
R= (x − a)(x − b) (25.69)
2!
Hence
f0 f1
f (x) ≈ P (x) = L0 (x)f0 + L1 (x)f1 = (b − x) + (x − a) (25.70)
h h
from which we can calculate the integral
Z b Z b
f (x)dx ≈ [P (x) + R(x)]dx (25.71)
a a
f0 b f1 b
Z Z
= (b − x)dx + (x − a)dx
h a h a
f 00 (c) b 2
Z
+ (x − (a + b)x + ab)dx (25.72)
2 a
b b
f0 1 2 f1 1 2
= bx − x + x − ax
h 2 a h 2 a
00
b
f (c) 1 3 1 2
+ x − (a + b)x + abx (25.73)
2 3 2 a

f0 1 2 2 f 1 1 2 2
= b(b − a) − (b − a ) + (b − a ) − a(b − a)
h 2 h 2
f 00 (c) 1 3

3 1 2 2
+ (b − a ) − (a + b)(b − a ) + ab(b − a) (25.74)
2 3 2
Applying the algebraic relations
b−a=h (25.75)
b − a2 = (b − a)(b + a) = h(b + a)
2
(25.76)
b3 − a3 = (b − a)(b2 + ab + a2 ) = h(b2 + ab + a2 ) (25.77)

gives
Z b
f0 h f1 h
f (x)dx = bh − (b + a) + (b + a) − ah (25.78)
a h 2 h 2
f 00 (c) h 2

2 h 2
+ (b + ab + a ) − (a + b) + abh (25.79)
2 3 2
b−a b−a
= f0 + f1
2 2
hf 00 (c) 2
+ [2b + 2ab + 2a2 − 3a2 − 6ab − 3b2 + 6ab] (25.80)
12
h hf 00 (c)
= [f0 + f1 ] + [−b2 + 2ab − a2 ] (25.81)
2 12
h h3 f 00 (c)
= [f0 + f1 ] − (25.82)
2 12
The resulting Trapezoidal Rule with Remainder: is
b
h3 f 00 (c)
Z
h
f (x)dx = [f0 + f1 ] − (25.83)
a 2 12
We then can obtain a composite quadrature rule by applying this formula at each
pair of successive grid points [xi , xi+1 ] as:
xi+1
h3 f 00 (ci )
Z
h
f (x)dx = [fi + fi+1 ] − (25.84)
xi 2 12
we get
Z b n−1 Z
X xi+1
f (x)dx = f (x)dx (25.85)
a i=0 xi
n−1
h3 f 00 (ci )

X h
= [fi + fi+1 ] − (25.86)
i=0
2 12
n−1 n−1
h X h3 X 00
= [fi + fi+1 ] − f (ci ) (25.87)
2 i=0
12 i=0
By the intermediate value theorem there is some number µ ∈ [a, b] such that f 00 (µ) is
the average
1
f 00 (µ) = [f 00 (c1 ) + f 00 (c1 ) + · · · + f 00 (cn−1 )] (25.88)
n
and therefore

Z b
h
f (x)dx = [(f0 + f1 + · · · + fn−1 ) + (f1 + f2 + · · · + fn )]
a 2
h3
− nf 00 (µ) (25.89)
12
Substituting nh = b − a we arrive at the Composite Trapezoidal Rule with
Remainder
Z b
h h2 (b − a) 00
f (x)dx = [(f0 + 2f1 + 2f2 + · · · + 2fn−1 + fn )] − f (µ) (25.90)
a 2 12
If we repeat the same process at three points x0 , x1 , and x2 we end up with

Simpsons Rule with Remainder:
Z x2
h h5
f (x)dx = [f0 + 4f1 + f2 ] − f (4) (c) (25.91)
x0 3 90
The corresponding and the Composite Simpson’s Rule with Remainder is
Z b
h
f (x)dx = [f0 + 4f1 + 2f2 + · · · + 2fn−2 + 4fn−1 + fn ]
a 3
b − a 4 (4)
− h f (µ) (25.92)
180
If we use four points x0 , . . . , x3 then we obtain Simpson’s Three-Eighths rule:
Z x3
3h 3h5 (4)
f (x)dx = [f0 + 3f1 + 3f2 + f3 ] − f (c) (25.93)
x0 8 80
Using five points x0 , . . . , x4 gives
Z x4
2x 8h7 (6)
f (x)dx = [7f0 + 32f1 + 12f2 + 32f3 + 7f4 ] − f (c) (25.94)
x0 45 945
The procedures we have outlined above are called the Closed Newton-Cotes
Technique. The general method for deriving an integration formula is as follows.
Suppose you know the function values f0 , f1 , ..., fn at n + 1 equally spaced points
a = x0 < x1 < · · · < xn = b where xi = xi−1 + h. Then
Z b Xn
f (x)dx ≈ ai f i + R (25.95)
a i=0
where n
b b
x − xk
Z Z Y
ai = Li (x)dx = dx (25.96)
a a x i − xk
k=0,k6=i

hn+3 f (n+2) (c) n 2

 Z
t (t − 1)(t − 2) · · · (t − n)dt, n even


(n + 2)! Z0

R= (25.97)
hn+2 f (n+1) (c) n
t(t − 1)(t − 2) · · · (t − n)dt, n odd



(n + 1)! 0
A similar technique, called the Open Newton-Cotes Technique, does not include
the endpoints in the polynomial interpolation. By renumbering the grid points so
that we have n + 3 grid points at a = x−1 < x0 < x1 < · · · < xn < xn+1 = b
Then equations 25.95 and 25.96 still hold; the modified remainder formula is
hn+3 f (n+2) (c) n+1 2

 Z
t (t − 1)(t − 2) · · · (t − n)dt, n even


(n + 2)! Z−1

R= (25.98)
hn+2 f (n+1) (c) n+1
t(t − 1)(t − 2) · · · (t − n)dt, n odd



(n + 1)! −1
The open Newton-Cotes method with n = 0 gives the Midpoint Rule

Z x1
h3
f (x)dx = 2hf0 + f 00 (c) (25.99)
x−1 3
The open Newton-Cotes method with n = 1 gives

Z x2
3h 3h3 00
f (x)dx = [f0 + f1 ] + f (c) (25.100)
x−1 2 4
14h5 (4)
Z
4h
x3 f (x)dx = [2f0 − f1 + 2f2 ] + f (c) (25.101)
x−1 3 45

Z x4
5h 95h5 (4)
f (x)dx = [11f0 + f1 + f2 + 11f3 ] + f (c) (25.102)
x−1 24 144

Lesson 26
Theory of Differential Equations
Definition 26.1 (Ordinary Differential Equation, ODE). Let y ∈ R be a variable1

that depends on t. Then we define a differential equation as any equation of the
form
F (t, y, y 0 ) = 0 (26.1)
where F is any function of the 3 variables t, y, y 0 .
Definition 26.2. We will call any function φ(t) that satisfies
F (t, φ(t), φ0 (t)) = 0 (26.2)
a solution of the differential equation.
We will use the terms “Ordinary Differential Equation” and “Differential Equa-
tion”, as well as the abbreviations ODE and DE, interchangeably. More generally, on
can include partial derivatives in the definition, in which case one must distinguish
between “Partial” DEs (PDEs) and “Ordinary” DEs (ODEs). We will leave the study
of PDEs to another class.
Equation 26.1 is, in general, very difficult, and often, impossible, to solve, either
analytically (e.g., by finding a formula that describes y), or numerically (e.g., for
example, by using a computer to draw a picture of the graph of the solution). Often
it is possible to solve 26.1 explicitly for the derivatives:
y 0 = f (t, y) (26.3)
Many important problems can be put into this form, and solutions are known to
exists for a wide class of functions, particular as a result of theorem 26.2. The class
of problems in which equation 26.1 can be converted to the form 26.3, at least locally,
1
In general, this theory can be extended to higher dimensions, where y ∈ Rn ; all of the same
results hold.
187
188 LESSON 26. THEORY OF DIFFERENTIAL EQUATIONS
is not seriously restrictive from a practical point of view. The only requirements are
that F be sufficiently smooth2 and that the matrix of partial derivatives ∂F/∂(y0 )
(Jacobian matrix) be nonsingular3 (∂F/∂(y 0 ) for a scalar equation). Then by the
implicit function theorem we can solve for y 0 locally. An equation of the form 26.1
for which the Jacobian is nonsingular is thus called an ordinary differential equa-
tion, and we will focus on equations of this form in the first several chapters of these
notes. It turns out that an equation for which the Jacobian is singular actually has
hidden constraints: it is really a combination of differential equations and algebraic
constraints, and is called a differential algebraic equation.
Theorem 26.1. Implicit Function Theorem on R.4 Let F (t, y) have continuous
derivatives ∂F/∂t and ∂F/∂y in the neighborhood of a point (t0 , y0 ), where
∂F (t0 , y0 )
F (t0 , y0 ) = 0, 6= 0 (26.4)
∂y
Then there are intervals I and J where
I = [t0 − a, t0 + a] (26.5)
J = [y0 − b, y0 + b] (26.6)
and a R rectangle R = I × J, such that the equation F (t, y) has precise one solution
y = f (t) lying in the rectangle R, such that
F (t, y(t)) = 0 (26.7)
y(t) ∈ J (26.8)
Fy (t, f (t)) 6= 0 (26.9)
for all t ∈ I.
Example 26.1. Solve y 0 = y.
Solution. Writing y 0 = dy/dt and integrating we find
Z Z
1
dy = dt (26.10)
y
ln|y| = t + C (26.11)
y = Ket (26.12)
where K = ±eC . There is no restriction on the values of either C or K.
2
Throughout these notes we will assume that F is sufficiently smooth without explicitly stating
so. By “sufficiently” smooth we mean that F is continuously differentiable enough times to give us
the results we want.
3
Strictly speaking, nonsingularity is not really a requirement. Nonsingularity is sufficient to
ensure that a solution exists, but is not required. There are examples of functions with singularities
at points but for which solutions may exist.
4
For a proof, see Richard Courant and Fritz John, Introduction to Calculus and Analysis, Volume
II/1, Springer Classics in Mathematics, 1998, page 225

LESSON 26. THEORY OF DIFFERENTIAL EQUATIONS 189
Thus we see that equation 26.1 (or 26.3) will often admit to an infinite number of
solutions owing to arbitrary constants of integration that arise during its solution. For
example 26.1 this is illustrated if figure 26.1, which shows the one parameter family
of solutions to the example. A particular physical problem may only correspond to
one member of this family. To fix down this constant, the problem must be further
constrained. Such a constraint can take various forms. The nature of the constraint
can have an enormous impact on our ability to solve the equation.
Figure 26.1: One parameter family of solutions to y 0 = y, showing the solutions for
various values of the constant of integration.
5 4 3 2 1
2
0.5
0 0
-1
-0.5
-2
-5 -4 -3 -2 -1
t
-1 0 1
Equation 26.3 has an intuitive feeling as the description of a dynamical system:

given any starting point y0 , then the subsequent “motion” or “time-evolution” of y is
given by equation 26.3. By adding an initial condition, that is, by specifying a point
that the solution passes through, we obtain an initial value problem (IVP).
Definition 26.3. (Initial Value Problem) Let D ∈ Rn+1 be a set. Let (t0 , y0 ) ∈ D
and suppose that f (t, y) : D 7→ Rn . Then
y 0 = f (t, y) (26.13)
y(t0 ) = y0 (26.14)

is called an initial value problem. The constraint 26.14 is called an initial con-
dition.
Figure 26.2: Illustration of a differential equation as a dynamical system. Given any

starting point an object moves as described by the differential equation. The curve
on the left shows the coordinates of an object at several time points. On the right,
the points are annotated with the direction of motion, an arrow whose direction is
specified by the components of the differential equation.
(y′1(t1), y′2(t1))
y2 (y1(t1), y2(t1)) y2
(y1(t2), y2(t2))
(y′1(t2), y′2(t2)) (y′1(t0), y′2(t0))
(y1(t0), y2(t0))
y1 y1
Example 26.2. Solve the initial value problem y 0 = (3 − y)/2, y(2π) = 4.

Solution. We can rearrange variables as
2dy
= dt (26.15)
3−y
and integrate to obtain
−t = 2 + ln |y − 3| + C (26.16)
Substituting the initial condition gives
C = −2π − ln |4 − 3| = −2π (26.17)
which gives
|y − 3| = eπ−t/2 (26.18)
Thus either y = 3 + eπ−t/2 or y = 3 − eπ−t/2 . At the initial condition, however,
y(2π) = 4, which is only obtained with the plus sign in the solution. Hence
y = 3 + eπ−t/2 (26.19)
is the unique solution. The solution of the initial value problem is plotted in figure
26.3 in comparison with the one-parameter family.

Figure 26.3: The one-parameter family of solutions for y 0 = (3 − y)/2 for different
values of the constant of integration, and the solution to the initial value (heavy line)
problem through (t0 , y0 ) = (2π, 4). The initial condition is indicated by the large gray
dot.
We will say that an initial value problem is well posed if it meets the following
criteria:
A solution exists;
The solution is unique;
The solution depends continuously on the data.
If a problem is not well posed then there is no point in trying to solve it numerically,
so we begin our study of initial value problems by looking at what it takes to make
a problem well posed. We will find that a Lipschitz Condition, defined below in
definition 26.4 is sufficient to ensure that the problem is well posed.
The importance (and usefulness) of initial value problems is enhanced by a general
existence theorem and the fact that under appropriate conditions (namely, a Lipschitz
Condition) the solution is unique. While we will defer the proof of this statement
until later, we will present one of many different versions of the fundamental existence
theorem.
Definition 26.4 (Lipschitz Condition). A function f (t, y) on D is said to be

Lipschitz (or Lipschitz continuous, or satisfy a Lipschitz condition) on y if
there exists some constant K > 0 if for all (x, y1 ), (x, y2 ) ∈ D then
|f (x, y1 ) − f (x, y2 )| ≤ K|y1 − y2 | (26.20)

The constant K is called a Lipschitz Constant for f . We will sometimes denote

this as f ∈ L(y; K)(D).
Theorem 26.2. Fundamental Existence and Uniqueness Theorem Suppose

that f (t, y) ∈ L(y; K)(R) for some convex domain R. Then for any point (t0 , y0 ) ∈ R
there exists a neighborhood N of (t0 , y0 ) and a unique differentiable function φ(t) on
N satisfying
y 0 = f (t, φ(t)) (26.21)
such that y 0 (t0 ) = y0 .
The existence theorem is illustrated in figure 26.4. Given any initial value, there
is some solution that passes through the point. Observe that the existence of the
solution is not guaranteed globally, only within some open neighborhood of the initial
condition.
Figure 26.4: Illustration of the existence of a solution.
!t0 ,y0 " N
Theorem 26.3 (Continuous dependence on IC). Under the same conditions, the
solution depends continuously on the initial data, i.e., if ỹ is a solution satisfying the
same ODE with ỹ(t0 ) = ỹ0 , then
|y(t) − ỹ(t)| ≤ eKt |y0 − ỹ0 | (26.22)

Theorem 26.4 (Perturbed Equation). Under the same conditions, suppose that ỹ is
a solution of the perturbed ODE,
ỹ 0 = f (t, ỹ) + r(t, ỹ) (26.23)
where r is bounded on D, i.e., there exists some M > 0 such that |r(t)| ≤ M on D.
Then
M
|y(t) − ỹ(t)| ≤ eKt |y0 − ỹ0 | + (eKt − 1) (26.24)
K
Proving that a function is Lipschitz is considerably eased by the following theorem.
Theorem 26.5. Suppose that |∂f /∂y| is bounded by K on a set D. Then f (t, y) ∈
L(y, K)(D).
Proof. The result follows immediately from the mean value theorem. Let (t, y1 ), (t, y2 ) ∈
D. Then there is some number c between y1 and y2 such that
|f (t, y1 ) − f (t, y2 )| = |fy (c)||y1 − y2 | < K|y1 − y2 | (26.25)

Hence f is Lipschitz in y on D.
Example 26.3. Show that a unique solution exists to the initial value problem
y 0 = sin(ty), y(0) = 0.5 (26.26)
Solution. We have f (t, y) = sin(ty), hence fy = t cos(ty). Thus |fy | ≤ |t| which is
bounded for any finite range of t. Let R be a bounded, convex set enclosing (0, 0.5),
and let
K = 1 + sup |t| (26.27)
t∈R
Since R is bounded we know that the supremum exists. By adding 1 we ensure that
we have a number that is strictly larger than the maximum value of |t|. Then K is a
Lipschitz constant for f and hence a unique solution exists in some neighborhood N
of (0, 5). See figure 26.5.
Example 26.4. Analyze the uniqueness of solutions to

p
y0 = 4 − y2 (26.28)
y(0) = 2 (26.29)
Solution. Finding a “solution” is easy enough. We can separate the variables and
integrate. It is easily verified (by direct substitution) that
π
y = 2 sin t + (26.30)
2
Figure 26.5: A solution exists in some neighborhood N of (0, 0.5). See Example 26.3
1.5
1.25
0.75
0.5 N
0.25 R
!10 !5 0 5 10
satisfies both the differential equation and the initial condition, hence it is a solution.
It is also easily verified that y = 2 is a solution, as are functions of the form


π

2 sin t + 2
 t<0
y= 2 0≤t≤φ (26.31)

2 sin t + π2 − φ

t>φ

for any positive real number φ. See Figure 26.6.
Since the solution is not unique, any condition that guarantees existence must be
violated. We have two such conditions: the boundedness of the partial derivative,
and the Lipschitz condition. The first implies the second, and the second implies
uniqueness. By
∂f −y
=p (26.32)
∂y 4 − y2
which is unbounded at y = 2. So the first condition is violated. Of course, a violation
of the condition does not ensure non-uniqueness, all it tells us is that uniqueness is
not ensured.

p
Figure 26.6: There are several solutions to y 0 = 4 − y 2 that pass through the point
(0, 2). See Example 26.4
0 Φ
p
What about the Lipschitz condition? Suppose that the function f (x) = (4 − y 2 )
is Lipschitz with Lipschitz constant K > 0 on some domain D. Then for any y1 , y2
in D,
q q

2 2
K|y1 − y2 | ≥ |f (y1 , y2 )| = 4 − y1 − 4 − y2
(26.33)
Let y2 = 2 and y1 = 2 − for some small number . Then

p
K|| ≥ 4 − (2 − )2 (26.34)

√
2
≥ 4 −

(26.35)
K 2 2 ≥ 4 − 2 (26.36)
(K 2 + 1)2 ≥ 4 (26.37)
4
K2 + 1 ≥ (26.38)

4
K2 ≥ − 1 (26.39)

But since we can choose to be arbitrarily small (including 0), the right hand side of
the equation can be arbitrarily large. But then K is not a finite number, especially

when = 0. So f (t, y) is not Lipschitz, either. Again, this does not guarantee
non-uniqueness; it just tells us that uniqueness is not guaranteed.

Lesson 27
Method of Successive
Approximations
The Method of Successive Approximations or Picard Iteration takes the initial

value problem
y 0 = f (t, y) (27.1)
y(t0 ) = y0 (27.2)
through a sequence of recursive iterations. If φ(t) is a solution to equations (27.1,27.2)

then φ(t) must also be a solution to the integral equation
Z t
φ(t) = y0 + f (s, φ(s))ds (27.3)
t0
At t = t0 then φ(t0 ) = y0 . If h = t − t0 is sufficiently small then a reasonable first

approximation to φ(t) is
φ0 (t) = y0 + (error terms) (27.4)
If we substitute φ0 from equation 27.4 for φ(s) in the integral equation 27.3, we get
a second approximation Z t
φ1 (t) = y0 + f (s, φ0 )ds (27.5)
t0
Better guesses are generated by repeated substitutions:

Z t
φ2 (t) = y0 + f (s, φ1 (s))ds (27.6)
t0
Z t
φ3 (t) = y0 + f (s, φ2 (s))ds (27.7)
t0
..
.
197
198 LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS
It turns out the sequence of functions φ0 , φ1 , φ2 , · · · → φ where φ is a solution of the

initial value problem if f (t, y) satisfies a Lipschitz condition on the second variable
(see Definition 26.4). The algorithm is summarized below in Algorithm 27.1.
Algorithm 27.1. Picard Iteration To solve the initial value problem

y 0 = f (t, y), y(t0 ) = y0
for the function y(t)
1. input: f (t, y), t0 , y0 , nmax

2. let φ0 = y0
3. For i = 1, 2, . . . , nmax
Z t
let φi+1 (t) = y0 + f (s, φi (s))ds
t0
4. output: φi (t)
We will show in this chapter that when f is Lipschitz in y, algorithm 27.1 con-
verges to the unique solution of equation 27.1. Technically speaking, however, Picard
Iteration1 does not guarantee a solution to any specific accuracy except in the limit
as n → ∞. Thus it is usually quite impractical in practice. Nevertheless it has
the advantage that it is easily implemented in a computer algebra system, and will
sometimes yield useful results.
Example 27.1. Solve y 0 = y, y(0) = 1 using Picard Iteration.
Solution. Since f (t, y) = y, t0 = 0, y0 = 1, we have the following:
Z t
φ0 = 1 + ds = 1 + t (27.8)
0
Z t
t2
φ1 = 1 + (1 + s)ds = 1 + t + (27.9)
2
Z0 t 2

s
φ2 = 1 + 1+s+ ds (27.10)
0 2
t2 t3
=1+t+ + (27.11)
2 3!
1
The method bears the name of Charles Emile Picard (1856-1941), who popularized the technique,
and published it in 1890, but gave credit to Hermann Schwartz. Guisseppe Peano in 1887, Ernst
Leonard Lindeloff in 1890, and G. von Escherich in 1899 also published existence proofs based on this
technique. Hartman claims that both Liouville and Cauchy were aware of this method. Schwartz,
for his part, outlined the technique in a Festschrift honoring Karl Weierstrass’ 70’th birthday in
1885.

LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS 199
We begin to see a pattern that suggests to us that

n+1 k
X t
φn = (27.12)
k=0
k!
We can check this out by induction. It certainly holds for n = 1. For the inductive
step, assume equation eq:picard-ind and solve for φn+1 :
n+1 k
Z tX
s
φn+1 =1+ ds (27.13)
0 k=0 k!
n+1
X tk+1
=1+ (27.14)
k=0
(k + 1)!
Making a change of index j = k + 1, we have

n+2 n+2
X tj X tj
φn+1 =1+ = (27.15)
j=1
(j)! j=0
(j)!
which is exactly what equation 27.12 gives for φn+1 . Hence by the convergence theo-
rem (Theorem 27.3), the corresponding infinite series converges to the actual solution
of the IVP: ∞
X tk
φ(t) = = et (27.16)
k=0
k!
where the last step follows from Taylor’s theorem.
Picard iteration is quite easy to implement in Mathematica; here is one possible
implementation that will print out the first n iterations of the algorithm.
Picard[f ,t , t0 , y0 , n ]:=
Module[{i, y=y0}
Print[Subscript["φ", 0], "=", y0];
For[i=0, i<n, i++,
Z t
ynext=y0+ (f[s, y/.{t->s}]) ds;
t0
y=ynext;
Print[Subscript["φ", i+1], "=", y];
];
Return[Expand[y]]
]
Function Picard has five arguments (f, t, t0, y0, n) and two local variables (i,
y)

Picard[f ,t , t0 , y0 , n ]:=
Module[{i, y=y0},
...
]
The local variable y is initialized to the value of the parameter y0 in the list of
variable declarations. This is equivalent to initializing the value of the variable in the
first line of the program. The first line of the program prints the initial iteration as
φ0 =value of parameter y0 ,
Print[Subscript["φ", 0], "=", y0];
The output will be displayed on the console in an “output cell.” The next line of the
program is a For loop. A For statement takes on four arguments:
For[initialization,
test,
increment,
statement;
..
.
statement;
]
The For loop takes the following actions:
1. The initialization statement (or sequence of statements) is executed;
2. The test is evaluated. If it evaluates to False then the rest of the For is
ignored.
3. Each of the statments is evaluated in sequence.
4. The increment statement is evaluated.
5. Steps (2) through (4) are repeated until test is False.
In our program, we have a counter i that is initially set equal to zero; then the con-
tents of the For are executed only so long as i < n; and the value of i is incremented
by 1 on each iteration. Hence the loop will execute n times. Within the loop three
statements are executed on each iteration:
For[i=0, i<n, i++,
Z t
t0
y=ynext;
Print[Subscript["φ", i+1], "=", y];
];

There are two important variables used in this loop: y and ynext. At the start of
each iteration, y refers to the value of the previous iteration φi−1 , while at the end of
each iteration (because of the statement y=ynext) it refers to the current iteration φi .
In the first line of the iteration the next iteration after φi−1 , namely, φi , is calculated
and saved in ynext. The value depends on the integral
Z t
f (s, φi−1 (s))ds
t0
. But φi−1 (s) is represented by the value of y at this point. Unfortunately, the
expression for y depends upon t, and we need to integrate over s and not s. So to
get the right variable in the expression for f (s, φi−1 (s)) we need to replace t everywhere
by s. We do that with the expression
y/.{t->s}
which means, quite literally, take the expression for y, and everywhere that a t
appears in it, replace the t with an s. To perform this substitution inside the integral
only we do the following:
Z t
t0
So then ynext (φi ) is calculated and saved as y, and the results of the current iteration
are printed on the console. The final line of the program returns the value of the final
iteration in expanded form, namely, with all multiplications and factoring expanded
out:
Return[Expand[y]]
To print the first 5 iterations of y 0 = y cos t, y(0) = 1 using this function, one enters
g[tvariable , yvariable ]:= yvariable*Cos[tvariable];
Picard[g, t, 0, 1, 5];
which prints
φ0 =1
φ1 =1+Sin[t]
φ2 =1+Sin[t]+ 1
2 Sin[t]
2
φ3 =1+Sin[t]+ 1 2 1
2 Sin[t] + 6 Sin[t]
3
φ4 =1+Sin[t]+ 1 2 1 3 1
2 Sin[t] + 6 Sin[t] + 24 Sin[t]
4
φ5 =1+Sin[t]+ 1 2 1 3 1 4 1
2 Sin[t] + 6 Sin[t] + 24 Sin[t] + 120 Sin[t]
5

and returns the value

1+Sin[t]+ 1 2 1 3 1 4 1
2 Sin[t] + 6 Sin[t] + 24 Sin[t] + 120 Sin[t]
5
It appears that the sequence is converging to the series

∞
X sink (t)
φ(t) = = esin t
k=0
k!
It is easily verified by separation of variables or direct substitution that this is, in

fact, the correct solution.
The gist of the proof of Algorithm 27.1 is that it is a form of fixed point iteration.
We recall from chapter 9 that a function f : R 7→ R has a fixed point if and only if
its graph intersects with the line y = x. If there are multiple intersections, then there
are multiple fixed points. Consequently a sufficient condition is that the range of f
is contained in its domain. We first recall some basic theorem from our earlier study
of fixed point theory.
Definition 27.1. Fixed Point. A number a is called a fixed point of the function f
if f (a) = a.
Theorem 27.1 (Sufficient condition for fixed point). Suppose that f (t) is a
continuous function that maps its domain into a subset of itself, i.e.,
f (t) : [a, b] 7→ S ⊂ [a, b] (27.17)
Then f (t) has a fixed point in [a, b].

Theorem 27.2 (Condition for a unique fixed point). Suppose that
(a) f ∈ C[a, b] maps its domain into a subset of itself.
(b) There exists some K > 0, K < 1, such that
|f 0 (t)| ≤ K, ∀t ∈ [a, b] (27.18)
Then f (t) has a unique fixed point in [a, b].

Theorem 27.3 (Fixed point iteration). Under the same conditions as theorem
27.2 then fixed point iteration converges.
Theorem 27.4. Under the same conditions as theorem 27.3 except that the condi-
tion of equation 27.18 is replaced with the following condition: f (t) is Lipschitz with
Lipschitz constant K < 1. Then fixed point iteration converges.
To prove that Picard iteration converges we need to generalized the concept of a
fixed point of a function to general vector spaces.

Definition 27.2 (Vector Space). A vector space V is a set that is closed under
two operations that we call addition and scalar multiplication such that the following
properties hold:
Closure For all vectors u, v ∈ V, and for all a ∈ R,
u+v ∈V (27.19)
av ∈ V (27.20)
Commutivity of Vector Addition For all u, v ∈ V,
u+v =v+u (27.21)
Associativity of Vector Addition For all u, v, w ∈ V,
u + (v + w) = (u + v) + w (27.22)
Identity for Addition There is some element 0 ∈ V such that for all v ∈ V
0+v =v+0=v (27.23)
Inverse for Addition For each v ∈ V there is a vector −v ∈ V such that
v + (−v) = (−v) + v = 0 (27.24)
Associativity of Scalar multiplication For all v ∈ V and for all a, b ∈ R,
a(bv) = (ab)v (27.25)
Distributivity For all a, b ∈ R and for all u, v ∈ V,
(a + b)v = av + bv (27.26)
a(u + v) = au + av (27.27)
Identity for Scalar Multiplication For all vectors v ∈ V,
1v = v (27.28)
Example 27.2. The usual Cartesian vector space to which we are accustomed is a
vector space with vectors being defined as ordered triples of coordinates hx, y, zi.
Example 27.3. Show that the set F[a, b] of all integrable functions f : [a, b] 7→ R is
a vector space.

Solution. Let f, g, h ∈ F[a, b] and c, d ∈ R Then
V is closed: Let p(t) = f (t) + g(t) and q(t) = ch(t). Then p, q : [a, b] 7→ R hence
p, q ∈ F[a, b]
f (t) + g(t) = g(t) + f (t) so commutivity holds.
(f (t) + g(t)) + h(t) = f (t) + (g(t) + h(t)) and c(df (t)) = (cd)f (t) so both
associative properties hold.
The function f (t) = 0 is an additive identity.
For any function f (t) the function −f (t) is an additive inverse.
(c + d)f (t) = cf (t) + df (t) and c(f (t) + g(t)) = cf (t) + cg(t) so both distributive
properties hold.
The number 1 acts as an identity for scalar multiplication.
Hence the set F[a, b] is a vector space.
Definition 27.3 (Norm). A norm k · k : V 7→ R on a vector space V is a function

mapping the vector space to the real numbers such that
1. For all v ∈ V, kvk ≥ 0.
2. kvk = 0 if and only if v = 0.
3. For al v ∈ V and for all a ∈ R, kavk = |a| kvk.
4. The norm satisfies a triangle inequality: for all v, w ∈ V,
kv + wk ≤ kvk + kwk (27.29)
Definition 27.4 (Normed Vector Space). A vector space on which a norm has
been defined is a normed vector space.
Example 27.4. Let V be ordinary Euclidean space and v = hx, y, zi a vector in V.

Then we can define many different norms on this space:
Taxicab (Manhatten, City Block) Norm The L1 norm is: kvk1 = |x| + |y| + |z|
p
Euclidean Distance Function The L2 norm is: kvk2 = x2 + y 2 + z 2
p-norm The Lp norm is: kvkp = (xp + y p + z p )1/p for p ∈ Z+ .
sup-norm The L∞ norm is kvk∞ = sup(|x|, |y|, |z|)

Example 27.5. The following norms can be defined on the vector space F[a, b] of
integrable functions on [a, b]:
qR
b
L2 -norm: kf k2 = a
|f (x)|2 dx
R 1/p
b
Lp -norm: kf kp = a
|f (x)|p dx
L∞ or sup-norm: kf k = sup |f (x)|

x∈[a,b]
Definition 27.5 (Contraction). Let V be a normed vector space, S ⊂ V. Then a

contraction is any mapping T : S 7→ V that satisfies
kT (f ) − T (g)k ≤ Kkf − gk (27.30)
form some K ∈ R, 0 < K < 1, for all f, g ∈ S. We will call the number K the
contraction constant.
Lemma 27.1. Let T be a contraction on a complete normed vector space V with

contraction constant K. Then for any g ∈ Vm
1 − Kn
kT n g − gk ≤ kT g − gk (27.31)
1−K
Proof. Use induction. For n = 1, we have
1−K
kT g − gk ≤ kT g − gk (27.32)
1−K
As our inductive hypothesis choose any n > 1 and suppose that equation 27.31 holds.
Then by the triangle inequality
kT n+1 g − gk ≤ kT n+1 g − T n gk + kT n g − gk (27.33)

1 − Kn
≤ kT n+1 g − T n gk + kT g − gk (27.34)
1−K
1 − Kn
≤ K n kT g − gk + kT g − gk (27.35)
1−K
(1 − K)K n + (1 − K n )
= kT g − gk (27.36)
1−K
1 − K n+1
= kT g − gk (27.37)
1−K
which proves the conjecture for n + 1.

Theorem 27.5 (Contraction Mapping Theorem2 ). Let T be a contraction on a

normed vector space V. Then T has a unique fixed point h ∈ V such that T (h) = h.
Furthermore, any sequence of functions g1 , g2 , . . . defined by gk = T gk−1 converges to
the unique fixed point T g = g. We denote this by gk → g.
Proof. 3 Let > 0 be given. Then since K n /(1 − K) → 0 as n → ∞ (because T is a

contraction, K < 1) it is possible to choose an integer N such that
K n kT g − gk
< (27.38)
1−K
Pick any two integers m ≥ n ≥ N , and define the sequence g0 = g, gn = T gn−1 . Then
kgm − gn k = kT m g − T n gk (27.39)
≤ K n kT m−n g − gk (27.40)
m−n
n1 − K
≤K kT g − gk (27.41)
1−K
by Lemma 27.1. Hence
Kn − Km Kn
kgm − gn k ≤ kT g − gk ≤ kT g − gk < (27.42)
1−K 1−K
Therefore gn is a Cauchy sequence, and every Cauchy sequence on a complete normed
vector space converges. Define f = limn→∞ gn . Then either f is a fixed point of T or
it is not a fixed point of T . Suppose that it is not a fixed point of T . Then T f 6= f
and hence there exists some δ > 0 such that
kT f − f k > δ (27.43)
On the other hand, because gn → f , there exists an integer N such that for all n > N ,
kgn − f k < δ/2 (27.44)
Hence
kT f − f k ≤ kT f − gn+1 k + kf − gn+1 k (27.45)

≤ Kkf − gn k + kf − gn+1 k (27.46)
≤ kf − gn k + kf − gn+1 k (27.47)
= 2kf − gn k (27.48)
<δ (27.49)
2
The contraction mapping theorem is sometimes called the Banach Fixed Point Theorem
3
The proof follows “Proof of Banach Fixed Point Theorem,” Encyclopedia of Mathematics (Vol-
ume 2, 54A20:2034), PlanetMath.org.

This is a contradiction, and hence f must be a fixed point of T .
To prove uniqueness, suppose that there is another fixed point h 6= f . Then

kh − f k > 0 (otherwise they are equal). But
kh − f k = kT h − T f k (27.50)
≤ Kkh − f k (27.51)
< kh − f k (27.52)
which is impossible and hence and contradiction. Thus f is the unique fixed point of
T.
We restate the fundamental existence theorem here for reference. While it is stated
in terms of the scalar problem, the vector problem is not fundamentally different, and
the proof is completely analogous.
Theorem 27.6 (Fundamental Existence Theorem). Let D ∈ R2 be convex and

suppose that f is continuously differentiable on D. Then the initial value problem
y 0 = f (t, y), y(t0 ) = y0 (27.53)
has a unique solution φ(t) in the sense that φ0 (t) = f (t, φ(y)), φ(t0 ) = y0 .
Proof. We begin by observing that φ is a solution of equation 27.53 if and only if it

is a solution of Z t
φ(t) = y0 + f (x, φ(x))dx (27.54)
t0
Our goal will be to prove 27.54.
Let S be the set of all continuous integrable functions on an interval (a, b) that
contains t0 . Corresponding to any function φ ∈ S we can define the mapping T :
S 7→ S as Z t
T [φ] = y0 + f (x, φ(x))dx (27.55)
t0
We will assume t > t0 . The proof for t < t0 is completely analogous. Using the
sup-norm on (a, b), we calculate that for any two functions g, h ∈ S,
Z t

kT [g] − T [h]k =
[f (x, g(x)) − f (x, h(x))] dx
(27.56)
t0
Z t

≤ sup [f (x, g(x)) − f (x, h(x))] dx (27.57)
a≤t≤b t0

Since f is differentiable it is Lipschitz in its second argument, hence

Z t
kT [g] − T [h]k ≤ K sup |g(x)) − h(x)| dx (27.58)
a≤t≤b t0
≤ K(b − a) sup |g(x)) − h(x)| (27.59)
a≤t≤b
≤ K(b − a) kg − hk (27.60)
Where K is any number larger than sup(a,b) fy . If we choose the endpoints a and b
such that |b − a| < 1/K we have K|b − a| < 1. Thus T is a contraction. By the
contraction mapping theorem it has a fixed point; call this point φ. Equation 27.54
follows immediately.
Theorem 27.7 (Error Bounds on Picard Iteration). Under the same conditions
as before, let φn be the nth Picard iterate, and let φ be the solution of the IVP. Then
M |K(t − t0 )|n+1 K|t−t0 |
|φ(t) − φn (t)| ≤ e (27.61)
K(n + 1)!
where M = supD |f (t, y)| and K is a Lipschitz constant. Furthermore, if L = |b − a|
then
M [KL]n+1 eKL
kφ(t) − φn (t)k ≤ (27.62)
K(n + 1)!
where k · k denotes the sup-norm.
Proof. We begin by proving the conjecture
K n−1 M
|φn − φn−1 | ≤ |t − t0 |n (27.63)
n!
For n = 1, equation 27.63 says that
|φ1 − y0 | ≤ M |t − t0 | (27.64)
which follows immediately from equation 27.54. Next, make the inductive hypothesis
27.63 and calculate
Z t

|φn+1 − φn| = [f (s, φn (s)) − f (s, φn−1 (s))] ds
(27.65)
t0
Z t
≤K |φn (s) − φn−1 (s)| ds (27.66)
t0
by the definition of φn and the Lipschitz condition. Applying the inductive hypothesis
and then integrating,
K nM t
Z
|φn+1 − φn| ≤ |s − t0 |n ds (27.67)
n! t0
K nM
≤ |t − t0 |n+1 (27.68)
(n + 1)!

which proves conjecture 27.63. Now let

n
X
φn (t) = φ0 (t) + [φi (t) − φi−1 (t)] (27.69)
i=1
Then since the sequence of Picard iterates converges to the solution,

∞
X
φ(t) = lim φn (t) = φ0 (t) + [φi (t) − φi−1 (t)] (27.70)
n→∞
i=1
Hence by equation 27.63,

X∞
|φ(t) − φn (t)| = (φi (t) − φi−1 (t)) (27.71)

i=n+1
X∞
≤ |φi (t) − φi−1 (t)| (27.72)
i=n+1
∞
X K i−1 M
≤ |t − t0 |i (27.73)
i=n+1
i!
∞
M X |K(t − t0 )|i
≤ (27.74)
K i=n+1 i!
Therefore by comparison with a Taylor series for eK(b−a) ,

∞
M X |K(b − a)|i
kφ(t) − φn (t)k ≤ (27.75)
K i=n+1 i!
n
!
M X |K(b − a)|i
≤ eK(b−a) − (27.76)
K i=0
i!
M
≤ sup Rn (t) (27.77)
K 0≤t≤KL
where Rn (t) is the Taylor Series Remainder for et after n terms,
tn+1 ec (KL)n+1 eKL

sup Rn (t) ≤ sup ≤ (27.78)
0≤t≤KL 0≤{c,t}≤KL (n + 1)! (n + 1)!
for some unknown c between a and b.Hence

M (KL)n+1 eKL
kφ(t) − φn (t)k ≤ (27.79)
K (n + 1)!
proving the theorem.

The following example shows that this bounds is not very useful in practice.
Example 27.6. Estimate the number of iterations required to obtain an solution to
y 0 = t, y(0) = 1 on [0, 10] with a precision of no more that 10−7 .
Solution. Since f (t, y) = t we have fy = 0 and hence a Lipschitz constant is K = 1
(or any positive number), and we can use M = 10 on [0, 10]. The precision in the
error is bounded by
M (KL)n+1 eKL 10(10)n+1 e10
≤ ≤ (27.80)
K(n + 1)! (n + 1)!
We can determine the minimum value of n by using Mathematica. The following will
print a list of values of equation (27.80) for n ranging from 1 to 50.
errs = Table[{n, 10 (10) ˆ (n + 1) (E ˆ 10.)/(n + 1)!}, {n, 1, 50}]
The output is a list of number pairs, which can be plotted with ListPlot or
<<‘Graphics‘Graphics‘
LogListPlot[errs];
The output of LogListPlot is shown below; we have annotated the plot with an
additional line at the desired tolerance of 10−7 showing that it occurs at the 47th
iteration.
This example shows that Picard iteration will produce the desired accuracy if we
perform 47 iterations. For this particular problem, this suggestion is absurd, because
Picard iteration converges to the exact solution after 1 iteration. The calculated
solution does not change upon further calculations. Hence the method vastly over-
estimates the potential error (at least for this example).

Lesson 28
Euler’s Method
By a numerical solution of the initial value problem
y 0 = f (t, y), y(t0 ) = y0 (28.1)

we mean a sequence of values
y0 , y1 , y2 , ..., yn−1 , yn ; (28.2)
a corresponding mesh or grid M by
M = {t0 < t1 < t2 < · · · < tn−1 < tn }; (28.3)
and a grid spacing as

hj = tj+1 − tj (28.4)
Then the numerical solution or numerical approximation to the solution is the se-
quence of points
(t0 , y0 ), (t1 , y1 ), . . . , (tn−1 , yn−1 ), (tn , yn ) (28.5)
In this solution the point (tj , yj ) represents the numerical approximation to the solu-
tion point y(tj ). We can imagine plotting the points (28.5) and then “connecting the
dots” to represent an approximate image of the graph of y(t), t0 ≤ t ≤ tn . We will
use the convenient notation
yn ≈ y(tn ) (28.6)
which is read as “yn is the numerical approximation to y(t) at t = tn .”
Euler’s Method or the Forward Euler’s Method is constructed as illustrated

in figure 28.1. At grid point tn , y(t) ≈ yn , and the slope of the solution is given by
exactly y 0 = f (tn , y(tn )). If we approximate the slope by the straight line segment
between the numerical solution at tn and the numerical solution at tn+1 then
yn+1 − yn yn+1 − yn
yn0 (tn ) ≈ = (28.7)
tn+1 − tn hn
211
212 LESSON 28. EULER’S METHOD
Figure 28.1: Illustration of Euler’s Method. A tangent line with slope f (t0 , y0 ) is con-
structed from (t0 , y0 ) forward a distance h = t1 − t0 in the t− direction to determined
y1 . Then a line with slope f (t1 , y1 ) is constructed forward from (t1 , y1 ) to determine
y2 , and so forth. Only the first line is tangent to the actual solution; the subsequent
lines are only approximately tangent.
y2
y1
y0
t0 t1 t2
Since y 0 (t) = f (t, y), we can approximate the left hand side of (28.7) by
yn0 (tn ) ≈ f (tn , yn ) (28.8)
and hence
yn+1 = yn + hn f (tn , yn ) (28.9)
It is often the case that we use a fixed step size h = tj+1 − tj , in which case we have
tj = t0 + jh (28.10)
In this case the Forward Euler’s method becomes
yn+1 = yn + hf (tn , yn ) (28.11)
The Forward Euler’s method is sometimes just called Euler’s Method. The application
of Euler’s method is summarized below.
Algorithm Forward Euler

Input f , t0 , y0 , h, tmax
Let t = t0 , y = y0
While t < tmax
let y = y + hf (t, y)
let t = t + h
let tn = t, yn = y
End While
Return {(t0 , y0 ), . . . , (tn , yn )}

LESSON 28. EULER’S METHOD 213
An alternate derivation of equation (28.9) is to expand the solution y(t) in a Taylor

Series about the point t = tn :
h2n 00
y(tn+1 ) = y(tn + hn ) = y(tn ) + hn y 0 (tn ) + y (tn ) + · · · (28.12)
2
= y(tn ) + hn f (tn , y(n )) + · · · (28.13)
We then observe that since yn ≈ y(tn ) and yn+1 ≈ y(tn+1 ), then (28.9) follows imme-
diately from (28.13).
If the scalar initial value problem of equation (28.1) is replaced by a systems of

equations
y0 = f (t, y), y(t0 ) = y0 (28.14)
then the Forward Euler’s Method has the obvious generalization
yn+1 = yn + hf (tn , yn ) (28.15)
Example 28.1. Solve y 0 = y, y(0) = 1 on the interval [0, 1] using h = 0.25.
Solution. The exact solution is y = ex . We compute the values using Euler’s method.
For any given time point tk , the value yk depends purely on the values of tk1 and
yk1 . This is often a source of confusion for students: although the formula yk+1 =
yk + hf (tk , yk ) only depends on tk and not on tk+1 it gives the value of yk+1 .
We are given the following information:
(t0 , y0 ) = (0, 1) (28.16)

f (t, y) = y (28.17)
h = 0.25 (28.18)
We first compute the solution at t = t1 .
y1 = y0 + hf (t0 , y0 ) = 1 + (0.25)(1) = 1.25 (28.19)

t1 = t0 + h = 0 + 0.25 = 0.25 (28.20)
(t1 , y1 ) = (0.25, 1.25) (28.21)
Then we compute the solutions at t = t1 , t2 , . . . until tk+1 = 1.
y2 = y1 + hf (t1 , y1 ) (28.22)
= 1.25 + (0.25)(1.25) = 1.5625 (28.23)
t2 = t1 + h = 0.25 + 0.25 = 0.5 (28.24)
(t2 , y2 ) = (0.5, 1.5625) (28.25)

214 LESSON 28. EULER’S METHOD
y3 = y2 + hf (t2 , y2 ) (28.26)
= 1.5625 + (0.25)(1.5625) = 1.953125 (28.27)
t3 = t2 + h = 0.5 + 0.25 = 0.75 (28.28)
(t3 , y3 ) = (0.75, 1.953125) (28.29)
y4 = y3 + hf (t3 , y3 ) (28.30)
= 1.953125 + (0.25)(1.953125) = 2.44140625 (28.31)
t4 = t3 + 0.25 = 1.0 (28.32)
(t4 , y4 ) = (1.0, 2.44140625) (28.33)
Since t4 = 1 we are done. The solutions are tabulated in table ?? for this and other
step sizes.
t h = 1/2 h = 1/4 h = 1/8 h = 1/16 exact solution
0.0000 1.0000 1.0000 1.0000 1.0000 1.0000
0.0625 1.0625 1.0645
0.1250 1.1250 1.1289 1.1331
0.1875 1.1995 1.2062
0.2500 1.2500 1.2656 1.2744 1.2840
0.3125 1.3541 1.3668
0.3750 1.4238 1.4387 1.4550
0.4375 1.5286 1.5488
0.5000 1.5000 1.5625 1.6018 1.6242 1.6487
0.5625 1.7257 1.7551
0.6250 1.8020 1.8335 1.8682
0.6875 1.9481 1.9887
0.7500 1.9531 2.0273 2.0699 2.1170
0.8125 2.1993 2.2535
0.8750 2.2807 2.3367 2.3989
0.9375 2.4828 2.5536
1.0000 2.2500 2.4414 2.5658 2.6379 2.7183

Lesson 29
The Backwards Euler Method
Now consider the IVP

5 1
y 0 = −5ty 2 + − 2 , y(1) = 1 (29.1)
t t
The exact solution is y = 1/t. The numerical solution is plotted for three different
step sizes on the interval [1, 25] in the following figure. Clearly something appears
to be happening here around h = 0.2, but what is it? For smaller step sizes, a
relatively smooth solution is obtained, and for larger values of h the solution becomes
progressively more jagged.
0.4
0.3
0.2
0.1
t
5 10 15 20 25
This example illustrates a problem that occurs in the solution of differential equa-
tions, known as stiffness. Stiffness occurs when the numerical method becomes
unstable. An exploration of this phenomenon is beyond the scope of Math 481A
(the topic is covered in great detail in Math 582B). One solution is to modify Euler’s
method as illustrated in figure 29.1 to give the Backward’s Euler Method:
yn = yn−1 + hn f (tn , yn ) (29.2)
[h]
215
216 LESSON 29. THE BACKWARDS EULER METHOD
Figure 29.1: Illustration of the Backward’s Euler Method. Instead of constructing

a tangent line with slope f (t0 , y0 ) through (t0 , y0 ) a line with slope f (t1 , y1 ) is con-
structed. This necessitates knowing the solution at the t1 in order to determine y1 (t1 )
y2
y1
y0
t0 t1 t2
The problem with the Backward’s Euler method is that we need to know the
answer to compute the solution: yn exists on both sides of the equation, and in general,
we can not solve explicitly for it. The Backwards Euler Method is an example of an
implicit method, because it contains yn implicitly. In general it is not possible
to solve for yn explicitly as a function of yn−1 in equation 29.2, even though it is
sometimes possible to do so for specific differential equations. Thus at each mesh
point one needs to make some first guess to the value of yn and then perform some
additional refinement to improve the calculation of yn before moving on to the next
mesh point. A common method is to use fixed point iteration on the equation
y = k + hf (t, y) (29.3)
where k = yn−1 . The technique is summarized here:
Make a first guess at yn and use that in right hand side of 29.2. A common first
guess that works reasonably well is
yn(0) = yn−1 (29.4)
Use the better estimate of yn produced by 29.2 and then evaluate 29.2 again to
get a third guess, e.g.,
yn(ν+1) = yn−1 + hf (tn , yn(ν) ) (29.5)
Repeat the process until the difference between two successive guesses is smaller
than the desired tolerance.
Of course we know that Fixed Point iteration will only converge if there is some
number K < 1 such that |∂g/∂y| < K where g(t, y) = k+hf (t, y). An implementation
of Backward’s Euler method with Fixed Point iteration in Mathematica as follows:

LESSON 29. THE BACKWARDS EULER METHOD 217
Figure 29.2: Result of the forward Euler method to solve y 0 = −100(y−sin t), y(0) = 1
with h = 0.001 (top), h = 0.019 (middle), and h = 0.02 (third). The bottom figure
shows the same equation solved with the backward Euler method for step sizes of
h = 0.001, 0.02, 0.1, 0.3, left to right curves, respectively
1
0.8
0.6
0.4
0.2
. 0 0.5 1 1.5 2 2.5 3

1
0.5
!0.5
!1
0 0.5 1 1.5 2 2.5 3
2
1.5
1
0.5
0
!0.5
!1
0 0.5 1 1.5 2 2.5 3
1
0.8
0.6
0.4
0.2
0 0.5 1 1.5 2 2.5 3
BackwardEuler[f, {t0_, y0_},

h_, tmax_, tol_:0.003, nmax_:5] := Module[{ t, time,
yval, yvaln, yvalp, delta, eps, r, J, JV},
r = {{t0, y0}};
yval = y0;
J = D[f[t, y], {y, 1}];
JV = J /. {y -> y0, t -> t0};
If[Abs[JV] >= 1, Return[$Failed]];
J = J /. {t -> time};
For[t = t0, t < tmax, t += h,
yvaln = yval + f[t, yval];
For[i = 1, i nmax, i++,
yvalp = yvaln;

yvaln = yval + f[t, yvalp];

delta = yvaln - yvalp;
If[Abs[delta] < tol, Break[]];
];
yval = yvaln;
AppendTo[r, {t + h, yval}];
JV = J /. {y -> ynext, time -> t + h};
If[Abs[JV] >= 1, Return[$Failed]];
];
Return[r];
]
The input parameter tol gives the tolerance for the fixed point iteration: when
two successive guesses are this close together, the iteration stops. The default value
is 0.003. Similarly, we have added a parameter nmax which is an emergency cut-off
for the fixed point iteration. This number prevents infinite loops in the event the
tolerance is never reached. The number of iterations is counted, and if they reach the
value of nmax the fixed point iteration stops. Because there is always the possibility
(either through a program bug or some sort of bizarre input) that the algorithm will
not terminate, it is generally a good programming practice to always include this type
of counter and cut-off value. In the implementation shown above nmax has a default
value of 5. Since both nmax and tol have default values, they are considered optional
parameters by Mathematica: if you are happy with the values of the defaults, you do
not have to supply them when you call the program.
However, to avoid ill-conditioned equations it is usually better to use a root-
finding algorithm such as Newton’s method to find the root y of y = k + f (t, y), e.g,
use Newton’s method to find the root of
g(s) = s − yn−1 − f (tn , s) (29.6)
at each iteration. An implementation is shown below:
BackwardsEulerNewtonsMethod[f_, {t0_, y0_}, h_, tmax_,

tol_:0.003, nmax_:5] :=
Module[{
t, y, n, time, yval, yvaln, yvalp, delta, eps, r, J, JV, fv},
r = {{t0, y0}};
yval = y0;
J = D[f[t, y], {y, 1}];
For[time = t0, time < tmax, time += h,
fv = f[time, yval];
JV = J /. {t -> time, y -> yval};
yvaln = (-yval + h (-fv + JV *yval))/ (-1 + h JV);

LESSON 29. THE BACKWARDS EULER METHOD 219
For[i = 1, i <= nmax, i++,

yvalp = yvaln;
JV = J /. {t -> time, y -> yvalp};
fv = f[time, yvalp];
yvaln = (yval + h(fv - JV* yvalp))/( 1 - h JV);
delta = yvaln - yvalp;
If[Abs[delta] < tol, Break[]];
];
yval = yvaln;
AppendTo[r, {time + h, yval}];
];
Return[r];
]
To solve the initial value problem y 0 = −50(y − sin t), y(0) = 1 on the interval [0, 3]
using a step size of h = 0.3,
In:=
f[t_, y_] := -50(y - Sin[t]);

BackwardsEulerNewtonsMethod[f, {0, 1}, 0.3];
Out:=
{{0, 1}, {0.3, 0.0625}, {0.6, 0.280956}, {0.9,

0.546912}, {1.2, 0.768551}, {1.5, 0.921821}, {1.8, 0.992765},
{2.1,0.97503}, {2.4, 0.870198}, {2.7, 0.687634}, {3., 0.443646}}
which produces a list of values {{t0 , y0 }, . . . , {tn , yn }}.


Lesson 30
Improving Euler’s Method
All numerical methods for initial value problems of the form
y 0 (t) = f (t, y), y(t0 ) = y0 (30.1)
variations of the form

yn+1 = yn + φ(tn , yn , . . . ) (30.2)
for some function φ. In Euler’s method, φ = hf (tn , yn ); in the Backward’s Euler
method, φ = hf (tn+1 , yn+1 ). In general we can get a more accurate result with a
smaller step size. However, in order to reduce computation time, it is desirable to
find methods that will give better results withou a significant decrease in step size.
We can do this by making φ depend on values of the solution at multiple time points.
For example, a Linear Multistep Method has the form
yn+1 + a0 yn + a1 yn−1 + · · · = h(b0 fn+1 + b1 fn + b2 fn−1 + · · · ) (30.3)
For some numbers a0 , a1 , . . . and b0 , b1 , . . . . Euler’s method has a0 = −1, a1 = a2 =

· · · = 0 and b1 = 1, b0 = b2 = b3 = · · · = 0
Here we introduce the Local Truncation Error, one measure of the “goodness”
of a numerical method. The Local truncation error tells us the error in the calculation
of y, in units of h, at each step tn assuming that there we know yn−1 precisely correctly.
Suppose we have a numerical estimate yn of the correct solution at y(tn ). Then the
Local Truncation Error is defined as
1
LTE = (y(tn ) − yn ) (30.4)
h
1
= (y(tn ) − y(tn−1 ) + y(tn−1 ) − yn ) (30.5)
h
Assuming we know the answer precisely correctly at tn−1 then we have
yn−1 = y(tn−1 ) (30.6)
221
222 LESSON 30. IMPROVING EULER’S METHOD
so that
y(tn ) − y(tn−1 ) yn−1 − yn
LTE = + (30.7)
h h
y( tn ) − y(tn−1 ) 1
= − φ(tn , yn , . . . ) (30.8)
h h
For Euler’s method,
φ = hf (t, y) (30.9)
hence
y(tn ) − y(tn−1 )
LTE(Euler) = − f (tn , yn ) (30.10)
h
If we expand y in a Taylor series about tn−1 ,
h2 00
y(tn ) = y(tn−1 ) + hy 0 (tn−1 ) + y (tn−1 ) + · · · (30.11)
2
h2
= y(tn−1 ) + hf (tn−1 , yn−1 ) + y 00 (tn−1 ) + · · · (30.12)
2
Thus
h 00
LTE(Euler) = y (tn−1 ) + c2 h2 + c3 h3 + · · · (30.13)
2
for some constants c1 , c2 , ... Because the lowest order term in powers of h is propor-
tional to h, we say that
LTE(Euler) = O(h) (30.14)
and say that Euler’s method is a First Order Method. In general, to improve
accuracy for a given step size, we look for higher order methods, which are O(hn );
the larger the value of n, the better the method in general.
The Trapezoidal Method averages the values of f at the two end points. It has
an iteration formula given by
hn
yn = yn−1 + (f (tn , yn ) + f (tn−1 , yn−1 )) (30.15)
2
We can find the LTE as follows by expanding the Taylor Series,
y(tn ) − y(tn−1 )
LTE(Trapezoidal) = − f (tn , yn ) (30.16)
h
h2 00 h3 000

1 0
= y(tn−1 ) + hy (tn−1 ) + y (tn−1 ) + y (tn−1 ) + · · · − y(tn−1 )
h 2 3!
1
− (f (tn , yn ) + f (tn−1 , yn−1 )) (30.17)
2

LESSON 30. IMPROVING EULER’S METHOD 223
Therefore using y 0 (tn−1 ) = f (tn−1 , yn−1 ),
1 h h2 1
LTE(Trapezoidal) = f (tn−1 , yn−1 ) + y 00 (tn−1 ) + y 000 (tn−1 ) · · · − f (tn , yn )
2 2 6 2
(30.18)
Expanding the final term in a Taylor series,
f (tn , yn ) = y 0 (tn ) (30.19)

2
h 000
= y 0 (tn−1 ) + hy 00 (tn−1 ) + y (tn−1 ) + · · · (30.20)
2
h2
= f (tn−1 , yn−1 ) + hy 00 (tn−1 ) + y 000 (tn−1 ) + · · · (30.21)
2
Therefore the Trapezoidal method is a second order method:
1 h 00 h2 000
LTE(Trapezoidal = fn−1 + yn−1 + yn−1 + ···
2 2 6
1 1 00 1 000
− fn−1 − hyn−1 − h2 yn−1 + ··· (30.22)
2 2 4
1 000
= − h2 yn−1 + ··· (30.23)
12
= O(h2 ) (30.24)
The theta method is given by
yn = yn−1 + h [θf (tn−1 , yn−1 ) + (1 − θ)f (tn , yn )] (30.25)
The theta method is implicit except when θ = 1, where it reduces to Euler’s method,
and is first order unless θ = 1/2. For θ = 1/2 it becomes the trapezoidal method. The
usefullness of the comes from the ability to remove the error for specific high order
terms. For example, when θ = 2/3, there is no h3 term even though there is still an
h2 term. This can help if the coefficient of the h3 is so larger that it overwhelms the
the h2 term for some values of h.
The second-order midpoint method is given by

1
yn = yn−1 + hn f tn−1/2 , [yn + yn−1 ] (30.26)
2
The modified Euler Method, which is also second order, is

hn
yn = yn−1 + [f (tn−1 , yn−1 ) + f (tn , yn−1 + hf (tn−1 , yn−1 ))] (30.27)
2
224 LESSON 30. IMPROVING EULER’S METHOD
Heun’s Method is

hn 2 2
yn = yn−1 + f (tn−1 , yn−1 ) + 3f tn−1 + h, yn−1 + hf (tn−1 , yn−1 ) (30.28)
4 3 3
Both Heun’s method and the modified Euler method are second order and are ex-
amples of two-step Runge-Kutta methods. It is clearer to implement these in two
“stages,” eg., for the modified Euler method,
ỹn = yn−1 + hf (tn−1 , yn−1 ) (30.29)

hn
yn = yn−1 + [f (tn−1 , yn−1 ) + f (tn , ỹn )] (30.30)
2
while for Heun’s method,
2
ỹn = yn−1 + hf (tn−1 , yn−1 ) (30.31)
3
hn 2
yn = yn−1 + f (tn−1 , yn−1 ) + 3f tn−1 + h, ỹn (30.32)
4 3


Numerical Analysis I: California State University Northridge Lecture Notes For Math 481A

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Numerical Analysis I: California State University Northridge Lecture Notes For Math 481A

Загружено:

Авторское право:

Доступные форматы

California State University Northridge

Lecture Notes for Math 481A:

Last Revision: July 5, 2008

« 2008. This document is licensed under a Creative Commons Attribution-

Please report any errors to bruce.e.shapiro at csun.edu. All feedback, com-

Your fearless leader. Above: Typical view during a

Math 481A « 2008, B.E.Shapiro

2 Limits and Continuity 9

4 Theorems About Derivatives 17

7 Fixed and Floating Point 31

8 Roots and Bisection 35

9 Fixed Point Iteration 41

12 Error Analysis for Iterative Methods 71

13 The Aitken-Steffensen Methods 81

14 Synthetic Division and Horner’s Method 87

17 Lagrange Interpolation 105

18 Newton Interpolation 119

19 Hermite Interpolation 125

20 Cubic Splines 135

21 Bezier Curves 143

22 Least Squares 153

23 Numerical Differentiation 161

24 Richardson Extrapolation 169

25 Numerical Integration 175

26 Theory of Differential Equations 187

27 Method of Successive Approximations 197

28 Euler’s Method 211

29 The Backwards Euler Method 215

30 Improving Euler’s Method 221

Math 481A « 2008, B.E.Shapiro

A standard way of defining “good enough” is by using a tolerance. To do this we

Math 481A « 2008, B.E.Shapiro

f[x_] := (1/2)(x + 2/x)

« 2008, B.E.Shapiro Math 481A

From an implementation perspective, one could just catalog a table of algorithms

Math 481A « 2008, B.E.Shapiro

« 2008, B.E.Shapiro Math 481A

+ 0 × 2−1 + 1 × 2−2 + 1 × 2−3

Math 481A « 2008, B.E.Shapiro

which returns the string representation of binary number show above.)

 ≈ (1650 meters/second) × (0.3433 seconds) = 566 meters. (1.19)

« 2008, B.E.Shapiro Math 481A

Other Famous Numerical Errors

 Approximately 36 seconds after the launch of an Ariane rocket from French

Math 481A « 2008, B.E.Shapiro

Limits and Continuity

|x − x0 | < δ =⇒ |f (x) − L| <  (2.2)

You name ε f(x)

or equivalently, by the definition of absolute values,

What condition does this impose on δ? We calculate

−δ + 3 < x < δ + 3 (2.8)

−δ + 4 < x + 1 < δ + 4 (2.9)

This leads to the following two conditions:

4 + δ < ( + 2)2 (2.19)

Math 481A « 2008, B.E.Shapiro

|f (x) − L| <  (2.36)

« 2008, B.E.Shapiro Math 481A

Definition 2.2 (Continuity). We say “f (x) is continuous at x0 ” or just “f is

Example 2.2. The function √

lim f (x) = 2 = f (3) (2.40)

Example 2.3. The function

is not continuous at x = 3 because

≈ (1650 meters/second) × (0.3433 seconds) = 566 meters. (1.19)

Approximately 36 seconds after the launch of an Ariane rocket from French

|x − x0 | < δ =⇒ |f (x) − L| < (2.2)

4 + δ < ( + 2)2 (2.19)

|f (x) − L| < (2.36)

|xn − x| = |xn − 3| < (3.7)

Let > 0 be given. Then so long as

2n > 3/ (3.13)

= 2.71828 − 2.718281828459045... ≈ 0.1828 ulps (5.9)

computational error: errors introduced by the computation itself. Compu-

rounding: round off the final digit, 1159 ≈ 0.116 × 104 .