Вы находитесь на странице: 1из 228

California State University Northridge

Lecture Notes for Math 481A:


Numerical Analysis I
Bruce E. Shapiro, Ph.D.

Last Revision: July 5, 2008

This document is provided in the hope that it will be useful but without any
warranty, without even the implied warranty of merchantability or fitness for a
particular purpose. The document is provided on an “as is” basis and the author
has no obligations to provide corrections or modifications. The author makes no
claims as to the accuracy of this document. In no event shall the author be liable
to any party for direct, indirect, special, incidental, or consequential damages,
including lost profits, unsatisfactory class performance, poor grades, confusion,
misunderstanding, emotional disturbance or other general malaise arising out of
the use of this document or any software described herein, even if the author has
been advised of the possibility of such damage.

« 2008. This document is licensed under a Creative Commons Attribution-


Noncommercial-No Derivative Works 3.0 United States License (by-nc-nd). For
specifics and a copy of the license please see the Creative Commons web site at
http://creativecommons.org/licenses/by-nc-nd/3.0/us/

Please report any errors to bruce.e.shapiro at csun.edu. All feedback, com-


ments, suggestions for improvement, etc., is appreciated, especially if you’ve used
these notes for a class, either at CSUN or elsewhere, from both instructors and
students.
ii

Your fearless leader. Above: Typical view during a


class lecture. Below: Typical view during an exam.
The pictures were drawn by former students. The con-
sumpution of cookies and caffeinated beverages during
class is optional but is strongly encouraged.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Contents

1 A Motivational Example 1

2 Limits and Continuity 9

3 Sequences 13

4 Theorems About Derivatives 17

5 Error 23

6 Number Representation 27

7 Fixed and Floating Point 31

8 Roots and Bisection 35

9 Fixed Point Iteration 41

10 Newton’s Method 59

11 Secant Method 67

12 Error Analysis for Iterative Methods 71

13 The Aitken-Steffensen Methods 81

14 Synthetic Division and Horner’s Method 87

15 Müller’s Method 93

16 Linear Systems 99

17 Lagrange Interpolation 105

18 Newton Interpolation 119

iii
iv CONTENTS

19 Hermite Interpolation 125

20 Cubic Splines 135

21 Bezier Curves 143

22 Least Squares 153

23 Numerical Differentiation 161

24 Richardson Extrapolation 169

25 Numerical Integration 175

26 Theory of Differential Equations 187

27 Method of Successive Approximations 197

28 Euler’s Method 211

29 The Backwards Euler Method 215

30 Improving Euler’s Method 221

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 1

A Motivational Example

Numerical analysis is a branch of mathematics that deals with the development and
implementation of methods for solving problems numerically with continuous math-
ematics. A related field, discrete or finite mathematics, deals with problems that
do not contain or depend on the concept of continuity. In practice both fields of
mathematics overlap the with the subjects of numerical computation and computer
science, which deal with the actual implementations (e.g., computer programs and
algorithms) used to solve these problems. √
As an example of a numerical algorithm, consider finding the square root a of a
number. We know that the solution satisfies

x2 = a (1.1)

Many algorithms for finding the value of x are based on finding the root of the
polynomial ⇓:rev.5/28/08
f (x) = x2 − a (1.2)
i.e., finding the value of x that satisfies the equation f (x) = 0. This number is called a
root of f (x). We will explore some of these algorithms this semester. We start with an
example
√ that was first observed
√ by the ancient Babylonians. If x is an approximation
to a then since a/x ≈ a is an equally good approximation. Furthermore,
√ 1 1
x< a =⇒ √ < (1.3)
a x
√ a a
=⇒ a = √ < (1.4)
a x
and
√ √ a
x> a =⇒ a > (1.5)
x
√ √
In other words, if x0 is any approxomation to a, then the actual value of a must
lie between x and a/x. Since there is no reason to believe that x is any better of an

1
2 LESSON 1. A MOTIVATIONAL EXAMPLE

approximation then
√ a/x, and vice versa, this suggests that we can obtain a better
approximation to a by averaging the two estimates:
 
1 a
x1 = x0 + (1.6)
x x0
We can repeat this argument with x1 to generate a better estimate x2 , and so forth,
leading us to the sequence of approximations x0 , x1 , x2 , . . . given by
 
1 a
xi+1 = xi + (1.7)
2 xi
Equation 1.7 is an example of an iteration formula. It gives us a sequence of better and
better approximations to the number we are looking for. We will see iteration formulas
again and again throughout this class; they are one of the principal techniques by
which we summarize a numerical algorithm. The basic technique is summarized here:

Given: x0
i=0
Repeat
xi+1 = f (xi )
i=i+1
Until the approximation is “good enough”

A standard way of defining “good enough” is by using a tolerance. To do this we


keep repeating the calculation until the difference between two successive iterations
is less than the tolerance, which we will denote by :
Given: x0 , 
i=0
Repeat
xi+1 = f (xi )
i=i+1
Until |xi − xi−1 | < 
Finally, we note that there is always the possibility that there could be a “bug”
in our implementation that could lead to an infinite loop. Hence it is wise to always
include an iteration counter that will force termination:

Given: x0 , , N
i=0
Repeat
xi+1 = f (xi )
i=i+1
Until |xi − xi−1 | <  or i > N

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 1. A MOTIVATIONAL EXAMPLE 3

When you are debugging your code it is generally a good idea to use a very small
value of N such as 2 or 3, even if you expect a much larger number of iterations to
5/28/08:⇑ occur in the final version.
The only remaining problem is to figure out what to use for the first guess x0 .
This algorithm is so good, in fact,
√ that it doesn’t much matter. We can use x0 = 1
or x0 = a. For example, to find 2 using x0 = 2 we have
 
1 2
x1 = 2+ = 1.5 (1.8)
2 2
 
1 2
x2 = 1.5 + = 1.41667 (1.9)
2 1.5
 
1 2
x2 = 1.41667 + = 1.41422 (1.10)
2 1.41667
and so forth. This algorithm converges rather quickly; in fact, it precisely reproduces
the same formula as Newton’s method (which we will discuss in section 10).
Throughout this class we will give examples using a programming language called
Mathematica. We have chosen this language because it is extremely powerful; uses
a fairly intuitive mathematical interface that is easy to learn rapidly; and allows us
to program without worrying about many of the details such as types, classes, and
objects that we need to worry over in more primitive languages such as Java, C++
or FORTRAN. In Mathematica, we can implement the square root finding algorithm
quite easily (Don’t worry about the details of this program if you don’t know Math-
ematica; we’ll come back to that and do some training in the Math Lab before you
have to start coding for your homework). The following will calculate shows that the
algorithm converges to the first 50 digits in only 8 iterations!

In:=

f[x_] := (1/2)(x + 2/x)


NestList[f, 2.0‘50, 8]

Out:=

{2.0000000000000000000000000000000000000000000000000,
1.5000000000000000000000000000000000000000000000000,
1.4166666666666666666666666666666666666666666666667,
1.4142156862745098039215686274509803921568627450980,
1.4142135623746899106262955788901349101165596221157,
1.4142135623730950488016896235025302436149819257762,
1.4142135623730950488016887242096980785696718753772,
1.4142135623730950488016887242096980785696718753769,
1.4142135623730950488016887242096980785696718753769}

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
4 LESSON 1. A MOTIVATIONAL EXAMPLE

From an implementation perspective, one could just catalog a table of algorithms


that give methods for solving various problems, and then proceed to translate these
algorithms into computer programs. We will see, however, that this blind approach
can lead to disaster, because a method that works under one set of conditions may not
work under another set of conditions. We will need to understand both the mathemat-
ics underlying the algorithms as well as the nature of the computer representations
before we can be confident that an implementation will work. We will approach this
subject, then, from an interdisciplinary perspective, as neither mathematicians nor
computer scientists, but as mathematical scientists. We illustrate the necessity of
such an interdisciplinary approach with a tragic example.
The Patriot Mission system used by the US Army is a surface-to-air (SAM) mis-
sile used primarily as an advanced aerial interceptor – i.e., to target and destroy
incoming missiles. The acronym “Patriot” actually stands for “Phased Array Track-
ing Radar to Intercept of Target;” a more politically charged version is “Protection
Against Threats, Real, Imagined, Or Theorized.” The system was developed during
the 1970’s, first deployed in 1984 as an anti-aircraft weapon, and in 1988 as an anti-
ballistic-missle defense. Patriots deployed during the first Gulf War in 1991 used a
24 bit integer counter to measure time in tenths of a second. This number needed to
be converted to floating point and used in a calculation to determine if and when an
missile should be fired. The following is taken from a GAO report:1
“The heart of the Patriot system is its weapons control computer. It performs
the system’s major functions for tracking and intercepting a target, as well as other
battle management, command and control functions. The Patriot’s weapons control
computer used in Operation Desert Storm is based on a 1970s design with relatively
limited capability to perform high precision calculations.
“To carry out its mission, the Patriot’s weapons control computer obtains target
information from the system’s radar. The Patriot’s radar sends out electronic pulses
that scan the air space above it. When the pulses hit a target they are reflected back
to the radar system and shown as an object (or plot) on the Patriot’s display screens.
Patriot operators use the software to instruct the system to intercept certain types
of objects such as planes, cruise missiles, or tactical ballistic missiles (such as Scuds).
During Desert Storm the Patriot was instructed to intercept tactical ballistic missiles.
For the Patriot’s computer to identify, track, and intercept these missiles, important
information describing them was kept by the system’s range-gate algorithm.
“After the Patriot’s radar detects an airborne object that has the characteristics
of a Scud, the range gate–an electronic detection device within the radar system–
calculates an area in the air space where the system should next look for it ... Finding
an object within the calculated range gate area confirms that it is a Scud missile.
“The range gate’s prediction of where the Scud will next appear is a function of the
Scud’s know velocity and the time of the last radar detection. Velocity is a real number
1
United States Office of the General Accounting Office Memorandum GAO/IMTEC-92-26, “Pa-
triot Missile Software Problem.”

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 1. A MOTIVATIONAL EXAMPLE 5

that can be expressed as a whole number and a decimal (e.g., 3750.2563...miles per
hour). Time is kept continuously by the system’s internal clock in tenths of seconds
but is expressed as an integer or whole number (e.g., 32, 33, 34...). The longer the
system has been running, the larger the number representing time. To predict where
the Scud will next appear, both time and velocity must be expressed as real numbers.
Because of the way the Patriot computer performs its calculations and the fact that
its registers are only 24 bits long, the conversion of time from an integer to a real
number cannot be any more precise than 24 bits. This conversion results in a loss
of precision causing a less accurate time calculation. The effect of this inaccuracy on
the range gate’s calculation is directly proportional to the target’s velocity and the
length of time the system has been running. Consequently, performing the conversion
after the Patriot has been running continuously for extended periods causes the range
gate to shift away from the center of the target, making it less likely that the target,
in this case a Scud, will be successfully intercepted.
“... after about 20 hours, the inaccurate time calculation becomes sufficiently large
to cause the radar to look in the wrong place for the target ... Army officials said
that they believed that ... Patriot users were not running their systems for 8 or more
hours at a time ... Significant shifts of the range gate away from the desired center
of the target could be eliminated by rebooting the system-turning the system off and
on–every few hours. Rebooting, which takes about 60 to 90 seconds, reinitializes the
computer’s clock, setting the time back to zero.
“... On February 25, Alpha Battery had been in operation for over 100 consecutive
hours ...”
Lets examine this calculation in some detail. Each bit in a binary number rep-
resents a power of 2. The numbers to the right of the radix point are fractions,
representing 2−1 , 2−2 , 2−3 , ... as we move from left to right; the numbers to the left
of the binary point represent 20 , 21 , 22 , 23 , ... as we move to the left, starting at the
binary point. The nth bit to the right of the radix point then represents 2− n and
the nth bit to the left represents 2n−1 . We can convert a binary number back to its
decimal representation by then adding up the value. Let bn = 1 or 0 represent the
nth bit. Then
X X
decimal value = bn × 2n−1 + bn × 2−n (1.11)
whole bits f ractional bits

where “whole bits” means the bits to the left of the radix point and “fractional bits”
means the bits to the right of the radix point.
Google Calculator gives you a convenient way to convert between bases. In Google
calculator a binary integer begins with “0b” (zero followed by the letter b). Unfortu-
nately it does not work with fractions, on integers. If we enter the string
0b110011001100110011 in Decimal
in any Google search window it will return the number 209715.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
6 LESSON 1. A MOTIVATIONAL EXAMPLE

To determine its value in decimal we observe that the least significant bit corresponds
to 2−21 so we enter the string
2^(-21)*209715
in the search window.

This returns a number that is very close to – but not precisely equal to – one tenth.
Example 1.1. Find the decimal equivalent of the binary number 101101.011
Solution.

101101.011 = 1 × 25 + 0 × 24 + 1 × 23 + 1 × 22 + 0 × 21 + 1 × 20


+ 0 × 2−1 + 1 × 2−2 + 1 × 2−3



(1.12)
1 1
= 32 + 8 + 4 + 1 + + (1.13)
4 8
3
= 45 (1.14)
8
In Mathematica we can do the conversion in example 1.1 quite easily; entering
2^^101101.011
returns the value
45.375
An related function is
BaseForm[2^^101101.011, 10]
which also will return the value 45. 375. To go in the other direction, we can type in

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 1. A MOTIVATIONAL EXAMPLE 7

BaseForm[45.375, 2]

which returns the value 101101.0112 . Unfortunately BaseForm only returns a string
representation of the number, not an actual binary number, so doing calculations
with the binary number requires a bit more work.
In the Patriot software, integers were converted to decimal numbers by multiplying
by the 24-bit binary representation of the decimal number 0.1 with one bit to the left
of the decimal and 23 bits to the right of the decimal. This number is
209715
m = 0.0001 1001 1001 1001 1001 100 = (1.15)
2097152
Spaces are used to separate every fourth bit to make the binary numbers easier to
read. The choice of four bits in a binary number is convenient because 4 binary bits
corresponds to precisely one hexa-decimal (base 16) digit. (We can, of course, find
this crucial number in Mathematica by

BaseForm[0.1, 2]

which returns the string representation of binary number show above.)


So the Patriot missile software used the number 209715/2097152 to approximate
the decimal fraction 1/10, it made an error of
1 209715 1
= − = ≈ 9.54 × 10−8 seconds (1.16)
10 2097152 10485760
or about a tenth of a microsecond each second. While this number may seem small,
it adds up over time. Suppose, for example, the counter is running for one hour (3600
seconds); then the total error that builds up is
3600
= ≈ 0.003433 seconds (1.17)
10485760
At the time of a Scud missile attack on Feb 25, 1991 the system operating at Dhahran,
Saudi Arabia had been operating for approximately 100 hours. The accumulated
roundoff error was therefor
3600 × 100
= ≈ 0.3433 seconds (1.18)
10485760
The clock was off by about 1/3 of a second. While this may still seem small, it
happens that Scud missiles travel at a speed of approximately Mach 5 or 1650 meters
per second (about 6000 km/hour). The Patriot missile systems was using this time
calculation to determine the actual position of the incoming missile, so it made an
error of

 ≈ (1650 meters/second) × (0.3433 seconds) = 566 meters. (1.19)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
8 LESSON 1. A MOTIVATIONAL EXAMPLE

The calculation was off by over half of a kilometer. This caused the system to
repeatedly recycle and try to recalculate the position again. It was unable to converge ⇓:rev.5/28/08
and so a missile was allowed to penetrate the bases defenses on 25 Feb 1991, killing
28 people. Ironically, the bug was known and a patch had been released on 16 Feb
1991 correcting the problem, but was still in the mail. It arrived one day too late, on
5/28/08:⇑ 26 Feb 1991. President Bush declared that hostilities had ended on 28 Feb.

Other Famous Numerical Errors

ˆ Approximately 36 seconds after the launch of an Ariane rocket from French


Guiana on 4 June 1996 the rocket’s guidance system shut down because of a
software error. It was trying to calculate the rocket’s veclocity and performed an
illegal conversion of a 64 bit real number into a 16 bit integer. Since the backup
system had the same software installed, it also shut off. So the rocket veered
off course and controllers were forced to activate the on-board self-destruct
mechanism. Strangely enough, this part of the calculation was only used before
launch and wasn’t even needed once the rocket took off. The rocket and its
cargo cost approximately $500 million.

ˆ Over the course of 22 months starting in 1982 the Vancouver Stock exchange
accumulated enough numerical error due to rounoff to reduce the correct value
of the index (1098.98) to 574.08.

ˆ Under German law (in 1992) no party with less than a 5 percent vote may be
seated in parliament. On April 5, 1992 the Green party obtained 4.97% of the
vote, but a computer program that prints out results was set to round to one
decimal place - exactly 5.0%. Hence the early official results showed that the
candidate had been elected.

ˆ in 1995 Microsoft announced that some versions of the their spreadsheet pro-
gram Excel makes mistakes because of a base 10 to base two conversion error.

ˆ An oil platform off the coast of Norway sank on August 23, 1991 at a cost of
nearly a billion dollars as a result of an error in a finite element approximation
– a method that was used to calculate the linear elastic stresses on the structure
supporting the platform.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 2

Limits and Continuity

In the next sections we will make a brief review of some mathematical preliminaries
before we turn to a study of how numbers are represented in computers.
Definition 2.1 (Limit). We say “the limit of f (x) as x approaches x0 is equal to
L” and write
lim f (x) = L (2.1)
x→x0

if, given any  > 0 there exists some δ > 0 such that

|x − x0 | < δ =⇒ |f (x) − L| <  (2.2)

The value of the number  is allowed to depend on the value of the number δ. This
concept of a limit is illustrated in the following figure: If you give me the value of ,
I can find a value of δ such that |f (x) − L| <  whenever |x − x0 | < δ.

You name ε f(x)

box for
smaller ε
I will find δ

x0

9
10 LESSON 2. LIMITS AND CONTINUITY


Example 2.1. Show that limx→3 x+1=2
Solution. Using the nomenclature of the definition,

f (x) = x + 1 (2.3)
L=2 (2.4)
x0 = 3 (2.5)

Let  > 0 be any small number. The we need to find a number δ > 0 such that

|x − 3| < δ =⇒ x + 1 − 2 <  (2.6)

or equivalently, by the definition of absolute values,



−δ < x − 3 < δ =⇒ − < x + 1 − 2 <  (2.7)

What condition does this impose on δ? We calculate

−δ + 3 < x < δ + 3 (2.8)

−δ + 4 < x + 1 < δ + 4 (2.9)


√√ √
4−δ < x+1< 4+δ (2.10)
√ √ √
4−δ−2< x+1−2< 4+δ−2 (2.11)
But to get equation 2.7, out of this we would need to get
√ √ √
− < 4 − δ − 2 < x + 1 − 2 < 4 + δ − 2 <  (2.12)

This leads to the following two conditions:



− < 4 − δ − 2 (2.13)

4+δ−2< (2.14)
By condition 2.13: √
2−< 4−δ (2.15)
(2 − )2 < 4 − δ (2.16)
δ < 4 − (2 − )2 = (4 − ) (2.17)
Note that the last quantity is going to be positive for small . By condition 2.14:

4+δ <+2 (2.18)

4 + δ < ( + 2)2 (2.19)


δ < (4 + ) (2.20)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 2. LIMITS AND CONTINUITY 11

Since we need both conditions 2.13 and 2.14 to hold then we require both equations
2.17 and 2.20. Since  > 0,
4−<4+ (2.21)
(4 − ) < (4 + ) (2.22)
we determine that condition 2.17 is more restrictive and when it holds, we are ensured
that condition 2.20 is also met. This gives us enough information to construct a proof,
which we present immediately.
Let  > 0. We need to show that there is some δ such that |x − x0 | < δ implies
that |f (x0 ) − L| < . In other words we need to show that there is some δ such that
|x − 3| < δ (2.23)
implies that √
x + 1 − 2 <  (2.24)

We do this by choosing
δ = (4 − ) (2.25)
hence
|x − 3| < δ = (4 − ) (2.26)
−(4 − ) < x − 3 < (4 − ) (2.27)
−2 − 4 + 4 < x + 1 < −2 + 4 + 4 (2.28)
( − 2)2 < x + 1 < −2 + 4 + 4 < 2 + 4 + 4 = ( + 2)2 (2.29)
p √ p
( − 2)2 < x + 1 < ( + 2)2 (2.30)
From the equality on the left, we know that
√ p
x + 1 > ( − 2)2 = ±( − 2) (2.31)
and that this must hold for both values of the root. The value  − 2 is near -2 and
the value 2 −  is near 2, so we chose:

2−< x+1 (2.32)
Combining this with the inequality on the right hand side of 2.30 we obtain

2−< x+1<2+ (2.33)

− < x + 1 − 2 <  (2.34)

x + 1 − 2 <  (2.35)

|f (x) − L| <  (2.36)


This is sufficient to prove that that limx→x0 = L; hence we conclude that

lim 1 + x = 2 (2.37)
x→2

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
12 LESSON 2. LIMITS AND CONTINUITY

Definition 2.2 (Continuity). We say “f (x) is continuous at x0 ” or just “f is


continuous at x0 if
lim f (x) = f (x0 ) (2.38)
x→x0

Example 2.2. The function √


f (x) = x+1 (2.39)
is continuous at x = 3, because, as we showed in the previous example,

lim f (x) = 2 = f (3) (2.40)


x→3

Example 2.3. The function


(√
1+x , x 6= 3
f (x) = (2.41)
0 ,x = 3

is not continuous at x = 3 because

lim f (x) = 2 6= 0 = f (3) (2.42)


x→3

as illustrated in the figure.

f(x) is not continuous


at x=3
L
f(x)=(1+x)1/2

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 3

Sequences

We will frequently use iterative processes in our study of numerical analysis. In such
a process, one computes a sequence of values, usually in a loop or other similar control
structure. Such iterative processes can be related to the concept of a sequence: at
each iteration of the the loop we calculate the value of some number an . The complete
set of all possible an is a sequence. More specifically, we have the following definition.

Definition 3.1 (Sequence). A sequence is a function that maps positive integers


to the real numbers:
n 7→ xn , n ∈ Z+ , xn ∈ R (3.1)
and we write the sequence as one of the following:

x1 , x2 , x3 , ... (3.2)

xn (3.3)

{xn }∞
n=1 (3.4)

Sometimes we will define a sequence on the set of non-negative integers, {0} ∪ Z+


rather than just the positive integers.

Definition 3.2 (Convergence of a Sequence). We say that the sequence xn con-


verges to to the limit x if, given any real number  > 0, then we can find an integer N
(usually large, and usually N will depend on ), such that |x − xn | <  for all |n > N |,
and we write
xn → x as n → ∞ (3.5)
and
lim xn = x (3.6)
n→∞

The concept of a limit is illustrated in the following figure.

13
14 LESSON 3. SEQUENCES

x1
x2
9OUNAMEε
xN+1

x3
L

xN for all n>N,


the dots fall in the grey band
I will find N

1 2 3 N n>N

3(1 + 2n )
Example 3.1. Show that the sequence xn = → 3 as n → ∞.
2n
Solution. We need to show that for any  > 0 there exists some N such that

|xn − x| = |xn − 3| <  (3.7)

for all n > N . We begin by observing that


3(1 + 2n )


|xn − 3| =
n
− 3 (3.8)
2
3 + (3)(2n ) − (3)(2n ))

= (3.9)
2n
3
= n (3.10)
2
which can be made as small as we like by choosing n sufficiently large.

Let  > 0 be given. Then so long as


3
< (3.11)
2n
we will have
|xn − 3| <  (3.12)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 3. SEQUENCES 15

To find the value of N , we solve 3.11 for n:

2n > 3/ (3.13)


n ln 2 > ln(3/) (3.14)
ln(3/)
n> = log2 (3/) (3.15)
ln 2
Hence given any , we can choose any integer N > log2 (3/); then we are ensured
that |xn − 3| <  for all n > N , which means that xn → 3 .

Theorem 3.1. If f (x) is continuous at the point x = c and xn is a converging


sequence such that xn → c then
 
lim = f lim xn (3.16)
n→∞ n→∞

or, equivalently,
f (xn ) → f (c) (3.17)

This result is sometimes stated as “the limit of function of a sequence is the


function of the limit of the sequence.”
r
3(1 + 2n )
Example 3.2. Find lim 1 + .
n→∞ 2n

Solution. Let f (x) = x + 1. This is a continuous function at x = 3, as we showed in
3(1 + 2n )
an earlier example. Furthermore, we showed that the sequence xn = →3
2n
as n → ∞. Hence by the theorem,

f (xn ) → f (3) = 3 + 1 = 2 (3.18)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
16 LESSON 3. SEQUENCES

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 4

Theorems About Derivatives

Definition 4.1 (Derivative). The derivative is given by either of the following two
equivalent formulas:
f (x0 + h) − f (x0 ) f (x) − f (x0 )
f 0 (x0 ) = lim = lim (4.1)
h→0 h x→x0 x − x0
The second definition can be derived from the first with the substitution x = h + x0 .
Theorem 4.1 (Intermediate Value Theorem (IVT)). Suppose that f (x) is a
continuous function on the interval [a, b], and that K is a number between f (a) and
f (b). Then there exists at least one (and possibly many) number(s) c ∈ [a, b] such
that f (c) = K .

Figure 4.1: Illustration of the Intermediate Value Theorem.

f(a)

f(c)

f(b)

a c b

Thus a continuous value takes on all values between the values it obtains at the
endpoints of its domain (See figure 4.1). The following corollary is illustrated in figure
4.2.

17
18 LESSON 4. THEOREMS ABOUT DERIVATIVES

Corollary 4.1. Under the same conditions as the IVT,f (a) and f (b) have different
signs, then there is a root between a and b.
Corollary 4.2. Under the same conditions as the IVT, if f (a)f (b) < 0, then there
is a root in the interval (a, b).
Proof. If f (a)f (b) < 0 then either f (a) < 0 and f (b) > 0, or f (a) > 0 and f (b) < 0.
In either case, the number 0 is between f (a) and f (b). Hence there is some number
c such f (c) = 0.

Figure 4.2: If f (a)f (b) < 0 then there is a root between a and b.

f(a)

b
a c
f(b)

Theorem 4.2 (Mean Value Theorem (MVT)). If f is continuous on [a, b] and


differentiable on (a, b) then there exists some c ∈ [a, b] such that
f (b) − f (a)
f 0 (c) = (4.2)
b−a
The interpretation of the mean value theorem is as follows: there is at least one
point in the interval [a, b] where the slope of f (x) is identical to the slope of a straight
rev.5/29/08:⇓ line between the end points of the function. See figure 4.3.

Figure 4.3: Illustration of the Mean Value Theorem.

f HbL

f HaL

a c b

5/29/08:⇑

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 4. THEOREMS ABOUT DERIVATIVES 19

Theorem 4.3 (Rolle’s Theorem). If f is continuous on [a, b] and differentiable on


(a, b), and f (a) = f (b) , then there exists some c ∈ [a, b] such that f 0 (c) = 0 .

Theorem 4.4 (Generalized Rolle’s Theorem). Suppose f is continuous on [a, b]


and n times differentiable on (a, b). If f (x) = 0 at n + 1 distinct points x0 , x1 , .., xn
in [a, b] then there is a number c ∈ (a, b) such that f (n) = 0.
⇓:rev.5/29/08
The Generalized Rolle’s theorem is illustrated in figure 4.4. The top curve shows
a plot of some function f (x) with unique zeroes at x1 < x2 < x3 < x4 . Rolle’s
theorem then tells us that between each pair of points (x1 , x2 ), (x2 , x3 ) and (x3 , x4 )
there are points p1 , p2 , p3 such that f 0 (p1) = 0, f 0 (p2 ) = 0, and f 0 (p3 ) = 0. With this
information we can sketch a plot of f 0 (x), as shown in the middle graph. We know
that f 0 (x) has three unique zeroes at p1 , p2 and p3 . Hence by Rolle’s theorem, there
is ap point q1 ∈ (p1 , p2 ), and a point q2 ∈ (p2 , p3 ) where the derivative of f 0 (x) is
zero, i.e., f 00 (q1 ) = f 00 (q2 ) = 0. Next, we can sketch a plot of f 00 (x), which is shown
in the bottom curve. It has two zeroes, at q1 and q2 . Hence by Rolle’s theorem there
is some point r1 ∈ (q1 , q2 ) such that f 000 (r1 ) = 0. But this is the precise prediction of
the Generalized Rolle’s Theorem: since f was continuous and hand 4 unique zeroes
(i.e., n + 1 = 4) then there is some point with f (n) (x) = 0, where n = 3.

Figure 4.4: Illustration of the Generalized Rolle’s Theorem.

x1 p1 x2 p2 x3 p3 x4

p1 q1 p2 q2 p3

q1 r1 q2

⇑:5/29/08

Theorem 4.5 (Extreme Value Theorem (EVT)). If f is continuous on an in-


terval [a, b] then it takes on both a minimum and a maximum value on [a, b]. If f
is differentiable on (a, b) then the extrema occur either at the endpoints or where
f 0 (x) = 0.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
20 LESSON 4. THEOREMS ABOUT DERIVATIVES

Definition 4.2 (Continuously Differentiable). We say a function f is continu-


ously differentiable on [a, b] if f is differentiable on (a, b) and its derivative is contin-
uous on [a, b].
Theorem 4.6 (Taylor’s Theorem with Remainder). Let f be n − times contin-
uously differentiable and suppose that f (n+1) exists on [a, b], and let x0 be any point
in (a, b). Then for all x ∈ [a, b] there exists some number c ∈ [a, b] such that

f (x) = Pn (x) + Rn (x) (4.3)

where
n
X f (k) (x0 )
Pn (x) = (x − x0 )k (4.4)
k=0
k!
f (n) (x0 )
= f (x0 ) + (x − x0 )f 0 (x0 ) + · · · + (x − x0 )n (4.5)
n!
and
f (n+1) (c)
Rn (x) = (x − x0 )n+1 (4.6)
(n + 1)!
The polynomial Pn x is called the Taylor Polynomial of Order n and the function
Rn (x) is called the Remainder.
When x0 = 0, Taylor’s theorem gives the Maclaurin Polynomials:
n
X f (k) (0)
Pn (x) = xk (4.7)
k=0
k!
f (n) (0) n
= f (0) + xf 0 (0) + · · · + x (4.8)
n!
The corresponding Maclaurin Remainder Formula is

f (n+1) (c) n+1


Rn (x) = (x) (4.9)
(n + 1)!
where c is some number between 0 and x, inclusive.

Example 4.1. Find the Taylor Polynomial of order for f (x) = x + 1 about the
point x0 = 0, and find the remainder.
Solution.

f (x) = x + 1, f (0) = 1 (4.10)
1 1
f 0 (x) = (x + 1)−1/2 , f 0 (0) = (4.11)
2 2

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 4. THEOREMS ABOUT DERIVATIVES 21

1 1
f 00 (x) = − (x + 1)−3/2 , f 00 (0) = − (4.12)
4 4
000 3 −5/2 000 3
f (x) = (x + 1) , f (0) = (4.13)
8 8
(4) 15 −7/2
f (x) = − (x + 1) (4.14)
16
Hence

0 x2 00 x3 000
P3 (x) = f (0) + xf (0) + f (0) + f (0) (4.15)
  2 3! 
1 1 1 1 3
=1+ x+ − x2 + x3 (4.16)
2 2 4 6 8
1 1 1
= 1 + x − x2 + x3 (4.17)
2 8 16
and similarly rev:5/29/08
f (4) (c) 4 −15(c + 1)−7/2 x4
R3 (x) = x = (4.18)
4! 384
Example 4.2. Use the Maclaurin series found in the previous example to estimate

2.
√ √
Solution. The formula in the previous example is for f (x) = 1 + x, so to get 2 we
need to use x = 1. Thus
√ 1 1 1
2≈1+ − + = 1.4375 (4.19)
2 8 16
Example 4.3. Use the remainder formula
√ found in the previous example to determine
the maximum error in calculating 2 with this formula.

Solution. We start with the formula rev:5/29/08

−15(c + 1)−7/2
R3 (1) = (4.20)
384
where c is some number between 0 and 1 (because 1 is the argument of f (x) that we
evaluated the polynomial at. The maximum value occurs when c = 0, hence we have
rev:5/29/08
−15(0 + 1)−7/2

|R3 (1)| < = 15 ≈ 0.0391 (4.21)
384 384

thus we can conclude that our calculation of 2 ≈ 1.4375 ± 0.0391

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
22 LESSON 4. THEOREMS ABOUT DERIVATIVES

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 5

Error

As software designers we will need to understand the sources of error in a numerical


calculation if we want to avoid disasters such as the one we discussed in lesson 1. To
understand error, we will also have to learn about how numbers are represented in
computers.
One important fact to always remember is that computers do not represent exact
numbers: they only represent them with a finite number of digits. And since the base
representation used by the computer is rarely base 10, it may not even be possible
to represent a number that we are accustomed to representing exactly, such as 1/10,
which has a repeating fraction in base 2:

0.110 = 0.000110012 (5.1)

Since the computer represents numbers with a finite number of bits, it will have to
truncate this approximation with a finite number of repeats of the 1001. This will
lead to a small error, which as we have seen, can compound into a very large error.
We will use the following definitions:

error = approximate value - true value (5.2)


true value = approximate value + remainder (5.3)
error
relative error = (5.4)
true value
This gives us the useful result

approximate value = (true value) × (1 + relative error) (5.5)

We will sometimes use the term unit in the last place or ulp to represent the value
of a 1 when placed in the rightmost digit of a numerical representation of a number.
For example, if we represent the irrational number

e = 2.718281820459045... (5.6)

23
24 LESSON 5. ERROR

with the approximation


ê = 2.71828 (5.7)
then we say that
1 ulp = 0.00001 (5.8)
and that the error is

 = 2.71828 − 2.718281828459045... ≈ 0.1828 ulps (5.9)

The following bit of Mathematica code will find the ulp on your computer:

In:=

ulp=1.0;
While[((1+ulp)-1>0, ulp=ulp/2];
Print["ulp=", 2*ulp];

Out:=
2.2045 ×10−16

One caution about this program: if you forget to put the decimal point in the initial
assignment ulp=1.0, and just write it as ulp=1, the program will run in an infinite
loop, because all of the calculations are rational. To see what is happening, insert a
print statement in the loop.
Because the error is often small, it is sometimes more meaningful to measure the

decimal places of accuracy = − log10 |error| (5.10)

The decimal places of accuracy gives approximately the number of digits that are
accurately represented to the right of decimal point. In our approximation to e we
had 1 ulp = 10−5 , so that an error of 1 ulp represents 5 decimal places of accuracy.
We are sometimes only interested in the relative error, which we can define as

digits of accuracy = − log10 |relative error| (5.11)

The digits of accuracy gives approximately the total number of digits of accuracy,
starting from the first nonzero digit. So 3.124, 3124, and 0.003124 all have 4 digits of
accuracy, whereas they have 3, 0, and 6 decimal places of accuracy, respectively.
We will see that there are two sources of error that we will have to worry about
in a computer program:

ˆ data error: error that is already present in the input data before a computation
begins. Typical sources of data error include:

– measurement error: the number supplied to the program is wrong

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 5. ERROR 25

– previous computation; successive computations depend on earlier com-


putations; if the result of one computation that has an error in it is used
as the input to another computation, that causes a data error.
– modeling error: the theory behind the implementation could be approx-
imate. For example, one might model a gravity force by F = −mg (an
approximation that is only valid near the surface of the earth) instead of
F = −GM m/r2 .

ˆ computational error: errors introduced by the computation itself. Compu-


tation error is typically classified into the following subclasses:

– roundoff error: error due to the fact that computers use a finite number
of digits to represent numbers.
– truncation error: error due to the truncation of an infinite process, such
as calculating only a finite number of terms in a Taylor Series approxima-
tion.

Let x be a true value of some quantity, and x̃ be the same quantity with data error
in it, and let the function f (x) represent the thing we are trying to compute. Then
the
propagated data error = f (ã) − f (a) (5.12)
Note that the propagated data error defined in this way has nothing to do with the
computer implementation of how we calculate f : it only depends on the true definition
of f . For example, suppose we want to calculate cos(π/3) where we supply as input
the value π = 3.1416. Then the

propagated data error = cos(3.1416/3) − cos(π/3) (5.13)

The computational error depends on the way in which calculate f . Suppose we define
fˆ to be the computer implementation that is used to calculate the true function f .
For example, we might use the first 3 terms of the Taylor series for f (x) = cos(x):

x2 x4
fˆ(x) = cos x ≈ 1 − + (5.14)
2 24
We define the
computational error = fˆ(x̃) − f (x̃) (5.15)
Then for our example implementation,

(3.1416/3)2 (3.1416/3)4
computational error ≈ 1 − + − cos(3.1416/3) (5.16)
2 24

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
26 LESSON 5. ERROR

Example 5.1. Calculate the relative errors in the ratio


 2
x−y
f (x, y) = (5.17)
x+y

with x = 100 and y = 99 assuming an input error of (a) 0.1% and (b) 1.0% for x.

Solution. The exact solution is


 2  2
100 − 99 1 1
f (100, 99) = = = ≈ 0.00002525 (5.18)
100 + 99 199 39601

If we start with an error of 0.1% in x, i.e., x̃ = 100.1 then


 2
100.1 − 99
f (100.1, 99) = = 0.00003052 (5.19)
100.1 + 99

and the relative error is


f (100.1, 99) − f (100, 99) 0.00003052 − 0.00002525
= = 0.21 (5.20)
f (100, 99) 0.00002525

i.e., an 0.1% error in the input leads to a 21% error in the result. If we are off by as
much as a full percent, say x̃ = 101, then
1
f (101, 99) = (5.21)
10000
hence the relative error is
f (101, 99) − f (100, 99) 0.0001 − 0.00002525
= = 2.9601 (5.22)
f (100, 99) 0.00002525

A data error of 1% gives a propagated error of 296%.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 6

Number Representation

Number representations in computers are limited because they only store a finite
number of digits.
√ While the loss of information here is obvious for irrational numbers
such as π or 2, what is not obvious at first glance is that even simple integer
operations can be seriously affected. Before proceeding to a formal description of
number representations we present a simple example of a computer that can store 3
digit decimal numbers. What is the best way to represent this kind of number? Our
first guess might be to use a representation that contains 3 machine digits such as:
d d d
where each “d” represents a digit in the range 0, 1, . . . , 9. This is good for numbers
such as 547 or 612, but what about negative numbers such as -43? And what happens
if we try to add two numbers together, such as 547+612? There is no way to represent
1159 in this scheme. So when we try to add two numbers whose sum is larger 999,
we get an error called an “overflow.”
There are two standard ways to represent negative numbers. One way is to use the
following mapping: ‘000’ represents -500, ‘001’ represents -499, ‘002’ represents -498,
..., ‘998’ represents 498, ‘999’ represents 499. In this way we shift the representation
from one that represents all integers 0 < z < 999 to one which represents only integers
in the range −500 < z < 499. This method is called an “excess 500” representation
- the number actually stored in memory is 500 in excess of the number it represents.
A simpler method is to add a sign bit:
s d d d
where s is not a digit that is only allowed to take on two values: 1 or 0, with 0
representing a positive number and 1 representing a negative numbers. This allows
us to represent everything in the range −999 < z < 999. So we represent 765 as
0 7 6 5
and -43 by

27
28 LESSON 6. NUMBER REPRESENTATION

1 0 4 3

Of course neither of these methods allow us to represent anything besides integers.


So we propose another solution: and an exponent.

s d d d e

The number represented by this scheme is given by

z = ±(0.ddd) × 10e (6.1)

allowing us to represent a much larger range of integers. If we want to represent


fractions we could use an excess-5 representation for the exponent, allowing it to
represent numbers from -5 to 4; for now, however, we will restrict our computer to
only be able to store integers, and let the exponent range from 0 to 9.
Now lets see what happens when we add 547 + 612 = 0.547 × 103 + 0.612 × 103 :

0 5 4 7 3

0 6 1 2 3

The answer is 1159, which can not be represented in three digits, so some sort of
approximation scheme is needed. Two common schemes include:

rev:5/29/08 ˆ chopping: drop the final digit, 1159 ≈ 0.115 × 104 ; and

0 1 1 5 4

ˆ rounding: round off the final digit, 1159 ≈ 0.116 × 104 .

0 1 1 6 4

The following example illustrates one of the dangers of these approximations.

Example 6.1. Calculate the average of two numbers x and y using the formula
x+y
average = (6.2)
2
for

rev:5/29/08 a) x = 563, y = 566, using chopping

b) x = 568, y = 566, using rounding

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 6. NUMBER REPRESENTATION 29

Solution. For (a), we find using truncation that

563 + 566 = 1129 ≈ 1120 (6.3)

1120
average = = 560 (6.4)
2
which is not even between the two values 563 and 566. Had we used rounding, we
would have rounded 1129 to 1130 and obtained the answer 565 which is as close as
we can represent the exact answer 564.5.
For (b) we use rounding:

568 + 566 = 1134 ≈ 1130 (6.5)

1130
average = = 565 (6.6)
2
Again, the answer 565 is not between the two original numbers. We would have
obtained the same answer with truncation.
We will use the operator F l (for “Float” or “Floating Point”, the topic of next
section) to represent our “approximation.” When we rounded, for example, we have

F l(1137) = 1140 (6.7)


F l(1131) = 1130 (6.8)

while for truncation


F l(1137) = F l(1131) = 1130 (6.9)
When we calculated our average we used
 
F l(F l(a) + F l(b))
average = F l (6.10)
F l(2)

One way to improve the situation we found in the previous example is to use the
revised formula

b−a
average = a + (6.11)
2
Mathematically, both equations 6.2 and 6.12 are identical, but they will give us
different results when implemented in a computer, because of how the F l operator is
applied:   
F l(F l(b) − F l(a))
average = F l F l(a) + F l (6.12)
F l(2)

Example 6.2. Repeat the previous example using equation 6.12.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
30 LESSON 6. NUMBER REPRESENTATION

Solution. In (a) we used truncation to find the average of 563 and 566
  
F l(F l(566) − F l(563))
average = F l F l(563) + F l (6.13)
F l(2)
  
F l(566 − 563)
= F l 563 + F l (6.14)
2
  
3
= F l 563 + F l (6.15)
2
= F l (563 + F l (1.5)) (6.16)
= F l (563 + 1) (6.17)
= F l(564) (6.18)
= 564 (6.19)

In (b) we used rounding to find the average of 568 and 566:


  
F l(F l(566) − F l(568))
average = F l F l(568) + F l (6.20)
F l(2)
  
F l(566 − 568)
= F l 568 + F l (6.21)
2
  
F l(−2)
= F l 568 + F l (6.22)
2
  
−2
= F l 568 + F l (6.23)
2
= F l (568 + F l (−1)) (6.24)
= F l (568 − 1) (6.25)
= F l(567) (6.26)
= 567 (6.27)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 7

Fixed and Floating Point

The two most common representations of numbers in computers are

ˆ Fixed point representation: the sign and the radix point have a fixed loca-
tion:

sign digits

This representation is commonly used for integers.

ˆ Floating point representation: in addition to providing space for the sign


and the digits, space is also provided to specify the location of radix point. The
field that specifies this location is usually called the exponent while the field
that specifies the digits is called a mantissa.

sign exponent mantissa

Most modern computers have the ability to store both fixed point and floating point
numbers; fixed point representations are typically used for integer and boolean vari-
ables. In some cases the representation will span many computer bytes. Floating
point representations may be implemented in either hardware or software or both.
For example, a typical “32 bit floating point” computer provides hardware (e.g., mem-
ory, registers, and arithmetic operations such as addition and multiplication) for a
floating point representation that uses a total of 32 bits. High level compilers such
as C or FORTRAN also provide additional representations, such as 64 bit “double
precision” and 128 bit “quadruple precision.” The details on how 8-bit bytes are
mapped to 32 bit long integers or 128 bit quadruple precision floating point reals is of
no concern to us here, only the ultimate representation. As one text says, “the details
of how numbers are represented do not concern us in numerical analysis; rather our
concern is whether a number is representable.”1
1
Skeel and Keiper, page 39.

31
32 LESSON 7. FIXED AND FLOATING POINT

Definition 7.1. A real number x is called an n-digit number if it can be expressed


as
x = ±d1 d2 · · · dn × 10e (7.1)
for some integer e and digits d1 , . . . , dn .
There are lots of ways we can represent any real number as a floating point n-digit
number, e.g., we can represent 467.2 as
467.2 = 467.2 × 100 (7.2)
= 4.672 × 102 (7.3)
= 0.04672 × 104 (7.4)
Of course this leads to several different machine representations. Most implemen-
tations typically choose a particular normalization for the number, e.g., represent
it in such a way that the first digit is nonzero. Thus we would represent 467.2 as
0.4672 × 103 (the leading zero to the left of the decimal point would not actually be
stored). In this way we can find a unique representation for each number. (Zero is
sometimes an exception, as we will see).
The most standard notations are given by the IEEE Floating Point Standard
(IEEE-754). These standards provide 32-bit, 64-bit, and 128-bit floating point rep-
resentations. The IEEE 32-bit Standard representation can store numbers in the
approximate range
1.17 × 10−38 < x < 3.4 × 1038 (7.5)
with a precision of around 8 digits (223 ≈ 8 × 106 ) :
Sign Exponent Mantissa
1 Bit 8 Bits 23 bits
s e = e1 e2 · · · e8 m = d1 d2 d3 d4 · · · d23
The 8-bit exponent can take on 256 possible values; the values with all zeroes and
or all 1s (255) have special meanings. The remaining values 1, . . . , 254 are used
to represent the true exponent in an excess-127 representation, so that exponents
−126 ≤ e ≤ 127. The general conversion formula is
x = (−1)s 2e−127 × (1.d1 d2 d3 . . . d23 ) (7.6)
If e = 0 and m 6= 0 then
x = (−1)s 2−126 × (0.d1 d2d 3 . . . d23 ) (7.7)
rev:5/29/08 If e = 255 = 111111112 , then

nan if m 6= 0,

x = −∞ if m = 0 and s = 1, (7.8)

∞ if m = 0 and s = 0.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 7. FIXED AND FLOATING POINT 33

Here “nan” is a special symbol used to mean “not a number.” Finally, there are two
different ways to represent zero, which we call 0 and −0:
(
0 if e = m = s = 0,
x= (7.9)
−0 if e = m = 0 and s = 1.

The related IEEE 64-bit Standard representation an store numbers in the approx-
imate range
2.22 × 10−308 < x < 1.8 × 10308 (7.10)
with a precision of around 15 to 16 digits (252 ≈ 4.5 × 1015 ):

Sign Exponent Mantissa


1 Bit 11 Bits 52 bits
s e = e1 e2 · · · e11 m = d1 d2 d3 d4 · · · d52

The general formula is now

x = (−1)s 2e−1023 × (1.d1 d2 d3 d4 · · · d52 ) (7.11)

If e = 0 and m 6= 0 then

x = (−1)s 2−1022 × (0.d1 d2d 3 . . . d52 ) (7.12)

If e = 2047 = 111111111112 , then rev:r/29/08


nan if m 6= 0,

x = −∞ if m = 0 and s = 1, (7.13)

∞ if m = 0 and s = 0.

The same representations for 0 and -0 apply.


The IEEE 128-bit Standard uses 1 sign bit, 15 exponent bits, and a 113 bit
mantissa. The various representations are modified accordingly. This standard is
used for quadruple precision numbers in various computer languages.
The IEED standard allows for up four different methods for rounding:

ˆ Unbiased: rounds to the nearest value. If the number falls midway it is rounded
to the nearest value with an even (zero) least significant bit. This mode is
required to be default.

ˆ Towards zero: round of in the direction of zero.

ˆ Towards positive infinity: round of in the direction of ∞ (round “up”).

ˆ Towards negative infinity: round off in the direction of −∞ (round “down”).

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
34 LESSON 7. FIXED AND FLOATING POINT

The following overflow and underflow conditions that may occur as a result of an
operation are not representable and should generate error messages:

ˆ Negative overflow: Negative numbers less than −(2 − 2−23 ) × 2127 (32 bit) or
−(2 − 2−52 ) × 21023 (64 bit).

ˆ Negative underflow: Negative numbers greater than −2−149 (32 bit denor-
malized (leading 0)), −2−126 (32 bit normalized (leading 1)), −2−1022 (64 bit
normalized), or −21074 (64 bit denormalized).

ˆ Positive underflow: Positive numbers less than 2−149 (32 bit denormalized),
2−126 (32 bit normalized), 2−1022 (64 bit normalized), or 2−1074 (64 bit denor-
malized).

ˆ Positive overflow: Positive numbers greater than (2 − 2−23 ) × 2127 (32 bit)
or(2 − 2−52 ) × 21023 (64 bit).

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 8

Roots and Bisection

The first numerical problem we will face is root finding: given a function f (x), find a
number r such that f (x) = 0 at x = r. The bisection algorithm uses a binary search
strategy. It assumes we already know two points a and b, one to the right of the root
and one to the left of the root. Since the two points are on opposite sides of the root,
they must be on opposite sides of the x−axis; hence either f (a) > 0 and f (b) < 0,
if the function is decreasing through the root; or f (a) < 0 and f (b) > 0, when the
function is increasing as it passes through the root. In either case,

f (a)f (b) < 0 (8.1)

Then we simply split the interval [a, b] in half: pick a new point
b−a
c=a+ (8.2)
2
and calculate the product f (a)f (c). If f (a)f (c) > 0 then a and c are on the same side
of the root, so we replace a with c. If f (a)f (c) < 0 then a and c are on different sides
of the root, so we replace b with c. Then we repeat the process until our interval size

∆=b−a< (8.3)

where  is some desired tolerance.


In general the following additions are good programming style:

1. It might take a long time to reach the desired , so it always a good idea to
include a counter and terminate after some number N steps regardless of how
close you’ve gotten. This is especially important when you are debugging the
program.

2. As you get closer and closer to the root the product f (a)f (c) will get smaller
and smaller, and could run into the level of machine accuracy. Thus its better
to check the product Sign(f (a))Sign(f(b)) rather than the product f (a)f (c).

35
36 LESSON 8. ROOTS AND BISECTION

Here is the algorithm:

Algorithm Bisection
Input a, b, f , , N ;
Let ∆ = (b − a)/2; i = 0;
If f (a)f (b) > 0, Print error message and stop;
While ∆ >  and i < N ,
p = a + ∆;
If f (p) = 0, Return(p);
If Sign(f (a))Sign(f (p)) < 0,
Let b = p;
Otherwise,
Let a = p;
End If;
∆ = (b − a)/2
i = i + 1;
End While;
If i = N , Print a message saying that tolerance not reached.
Return (a + ∆).

Of all the algorithms we will discuss for root finding, bisection is the slowest.
In fact, we can predict precisely the number of iterations it will take to converge.
Because the size of the interval is halved each time, it will be the smallest integer n
such that  n
1
|b − a| <  (8.4)
2
Hence  

n log(1/2) < log (8.5)
|b − a|
Since log(1/2) = − log 2, n is the smallest integer for which
   
1  |b − a|
n>− log = log2 (8.6)
log 2 |b − a| 

Thus we could add a test at the beginning of our program, and just iterate n times.
This is actually more efficient, because we don’t need to do a comparison to check
the size of the interval each iteration. Here is the revised algorithm.

Algorithm Bisection (revised)


Input a, b, f , ;
Let N = log2 ((b − a)/); i = 0;
If f (a)f (b) > 0, Print error message and stop;
While i < N ,

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 8. ROOTS AND BISECTION 37

p = a + (b − a)/2;
If f (p) = 0, Return(p);
If Sign(f (a))Sign(f (p)) < 0,
Let b = p;
Otherwise,
Let a = p;
End If;
i = i + 1;
End While;
Return (a + (b − a)/2).

This analysis actually allows us to prove the following.

Theorem 8.1. The Bisection algorithm converges.

Proof. Either the algorithm reaches the exact root at some step of the iteration or it
does not. Let L = |b − a|. Let the value of a, b, and p after the ith iteration be ai , bi ,
and pi , respectively. If for some n we have

f (an )f (bn ) = 0 (8.7)

then either a is a root or b is a root, and the algorithm has converged.


If we never have f (an )f (bn ) = 0 then we continue to split the interval. Since each
iteration splits the interval in half, we have

|a1 − b1 | = L/2 (8.8)


|a2 − b2 | = L/4 (8.9)
|a3 − b3 | = L/8 (8.10)
..
. (8.11)
|an − bn | = L/2n (8.12)

Furthermore, by construction, f (ai )f (bi ) < 0, so the root is in each interval. Therefore
each pi is a distance no larger than |ai − bi | from the root r. Hence

|p1 − r| ≤ |a1 − b1 | = L/2 (8.13)


|p2 − r| ≤ |a2 − b2 | = L/22 (8.14)
|p3 − r| ≤ |a3 − b3 | = L/23 (8.15)
..
. (8.16)
|pn − r| ≤ |an − bn | = L/2n (8.17)

Therefore
0 ≤ lim |pn − r| ≤ lim L/2n = 0 (8.18)
n→∞ n→∞

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
38 LESSON 8. ROOTS AND BISECTION

Hence
lim pn = r (8.19)
n→∞

which proves that the sequence of iterations converges to the root.



Example 8.1. Estimate 2 by finding the root of f (x) = x2 − 2.

Solution.
Step 1.
a = 1, b = 2, f (a) = −1, f (b) = 2 (8.20)
p = (a + b)/2 = 1.5 (8.21)
f (p) = (1.5)2 − 2 = 0.25 (8.22)
f (a)f (p) = (−)(+) < 0 (8.23)
so the root is between a and p. So we set

b = p = 1.5 (8.24)

Step 2.
a = 1, b = 1.5, f (a) = −1, f (b) = 0.25 (8.25)
p = (1 + 1.5)/2 = 1.25 (8.26)
f (p) = (1.25)2 − 2 = −0.4375 (8.27)
f (a)f (p) = (−1)(−1) > 0 (8.28)
The root is between p and b, so set

a = p = 1.25 (8.29)

Step 3.
a = 1.25, b = 1.5, f (a) = −0.4375, f (b) = 0.25 (8.30)
p = (1.25 + 1.5)/2 = 1.375 (8.31)
f (p) = 1.3752 − 2 = −0.109 (8.32)
f (p)f (a) = (−.4375)(−.109) > 0 (8.33)
so set
a = p = 1.375 (8.34)
The root is between a=1.375 and b=1.5. As we continue the process we compute the
sequence 1.5, 1.25, 1.375, 1.4375, 1.40625, 1.42188, 1.41406, ...

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 8. ROOTS AND BISECTION 39

Illustration of bisection, showing locations along the x-axis of successive iterations


for the example x2 − 2 = 0.

a5p5b5

a4 p4 b4

a3 p3 b3

a2 p2 b2

a1 p1 b1

a0 p0 b0

1 1.25 2 1.5 1.75 2

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
40 LESSON 8. ROOTS AND BISECTION

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 9

Fixed Point Iteration

Anyone who has every played with their calculator by typing in a number and then
hitting the same function key repeatedly has used fixed point iteration. For example,

if you type the number 16 and then start pressing the key you will generate the
sequence (this was generated with a TI-36 which has 10 digit accuracy):

x0 = 16 (9.1)
√ √
x1 = x0 = 16 = 4 (9.2)
√ √
x2 = x1 = 4=2 (9.3)
√ √
x3 = x2 = 2 = 1.414213562 (9.4)
√ √
x4 = x3 = 1.414213562 = 1.1892070115 (9.5)
√ √
x5 = x4 = 1.1892070115 = 1.090507733 (9.6)
..
.

Eventually, after around 30 iterations, the calculator will display something like

1.0000000000 (9.7)

on all subsequent iterations, because



1.0000000000 = 1.0000000000 (9.8)

In fact, this iteration has found the fixed point of the square root function

f (x) = x (9.9)

to within the machine epsilon of the calculator (1 part in 1010 ), namely, the point
where √
x = f (x) = x (9.10)

41
42 LESSON 9. FIXED POINT ITERATION

Equation 9.10 has only two solutions: x = 1 and x = 0. We have converged on the
first of these solutions. Had we started with any positive number, we still would have
converged on the solution x = 1, regardless of which number we typed in for x0 . Had
we started with x = 0 we would have converged on the other root, x = 0, and had
we started with a negative number, we would have gotten an error message.
What we are doing during this iteration is computing a sequence of function
applications:

x1 = g(x0 ) (9.11)
x2 = g(g(x0 )) = g 2 (x0 ) (9.12)
x3 = g(g(g(x0 ))) = g 3 (x0 ) (9.13)
..
.
xn = g n (x0 ) (9.14)

where we have used the notation g k (x) to denote the repeated application of the
function g(x) k times.
Definition 9.1 (Fixed Point). A number p is called a fixed point of the function
f (x) if p = f (p).
Example 9.1. Find the fixed points of the function f (x) = x4 + 2x2 + x − 3.
Solution. We need to solve

x = f (x) = x4 + 2x2 + x − 3 (9.15)

for x. Hence

0 = x4 + 2x2 − 3 (9.16)
= (x2 − 1)(x2 + 3) (9.17)
= (x − 1)(x + 1)(x2 + 3) (9.18)

Hence there are two fixed points: x = ±1.


Theorem 9.1. A continuous function f (x) will have a fixed point if and only if it
crosses the line y = x (see figure 9.1). A fixed point of f (x) always occurs at an
intersection of the two curves y = f (x) and y = x, namely at the point p such that
p = f (p).
rev.6/5/08:⇓ √
Consider again the example of fixed point iteration f (x) = x, which we illus-
trated by repeatedly pushing the square-root button a calculator. This algorithm can
be visualized as shown in figure 9.2. The top figure shows both the function y = f (x)
and the line y = x. We then draw a line from (x0 , f (x0 )) horizontally to the line
y = x. The two lines meet at the point (f (x0 ), f (x0 )) (middle plot). The head of

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 9. FIXED POINT ITERATION 43

Figure 9.1: A fixed point occurs at the intersection of the curve y = f (x) with the
line y = x. If there are multiple intersections then there are multiple fixed points.
y=x
b

a b

the arrow lies directly over the value of f (x0 ) on the x-axis, so that by projecting
vertically to the curve of y = f (x), we intersect at f (f (x0 )) (bottom plot). We then
repeat this process, generating successive iterations,
√ approaching closer and closer ot
the fixed point (see figure 9.3 at (1, 1) = (x, x).
. ⇑:6/5/08

Example 9.2. Find the first 25 iterations in the fixed point iteration for the function

f (θ) = cos θ (9.19)

to 10 digits of precision using Mathematica with x0 = π.

Solution. We can do fixed point iteration with the function NestList.


In:=

f[x_]:=Cos[x];
c=N[NestList[f, Pi, 25], 10]

Out:=

{3.141592654, -1.0000000000, 0.5403023059, 0.8575532158,


0.6542897905, 0.7934803587, 0.7013687736, 0.7639596829,
0.7221024250, 0.7504177618, 0.7314040424, 0.7442373549,
0.7356047404, 0.7414250866, 0.7375068905, 0.7401473356,
0.7383692041, 0.7395672022, 0.7387603199, 0.7393038924,
0.7389377567, 0.7391843998, 0.7390182624, 0.7391301765,
0.7390547907, 0.7391055719}

Convergence is illustrated in figure 9.4.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
44 LESSON 9. FIXED POINT ITERATION


Figure 9.2: Visualization of fixed point iteration on y = x. See text for description.
4

0 2 4 6 8 10 12 14 16 18

0 2 4 6 8 10 12 14 16 18

0 2 4 6 8 10 12 14 16 18

Despite the success illustrated by the rapid convergence in example 9.2, fixed point
iteration does not always work. This is illustrated by the following example.

Example 9.3. Compute the fixed point of the function

f (x) = x2 − 2 (9.20)

and then, using Mathematica, compute the result of 100 iterations of the fixed point
algorithm using x0 = 1.9 and plot the results as we did in the previous example.

Solution. The fixed point occurs when f (x) = x, so that means

x = x2 − 2 (9.21)
0 = x2 − x − 2 (9.22)
= (x − 2)(x + 1) (9.23)

rev.6/5/08:⇓ So fixed points occur at x = 2 and x = −1 To compute, say, the first 50 fixed point
iterations starting with x0 = 1.5, in Mathematica, In:=

g[x_] := x^2 - 2;
N[NestList[g, 1.9, 100], 5]

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 9. FIXED POINT ITERATION 45


Figure 9.3: Fixed point iteration on y = f x (continued from fig. 9.1).
2

1.8

1.6

1.4

1.2

1 1.5 2 2.5 3 3.5 4

Out:=

{1.5, 0.25, -1.9375, 1.75391, 1.07619, -0.841821, -1.29134,


-0.332449, -1.88948, 1.57013, 0.465297, -1.7835, 1.18087, -0.605549,
-1.63331, 0.667703, -1.55417, 0.415451, -1.8274, 1.33939, -0.206031,
-1.95755, 1.83201, 1.35625, -0.160593, -1.97421, 1.8975, 1.60052,
0.561675, -1.68452, 0.837612, -1.29841, -0.31414, -1.90132, 1.615,
0.608231, -1.63005, 0.657078, -1.56825, 0.459403, -1.78895, 1.20034,
-0.559188, -1.68731, 0.847012, -1.28257, -0.355011, -1.87397,
1.51175, 0.285399, -1.91855}

No obvious pattern is discernible in the numbers; this is confirmed in figure 9.5. In


fact, the resulting sequence of iteration is chaotic. Not only does it never converge,
it has another interesting property: if we iterate long enough we will calculate an
iteration that is arbitrarily close to virtually every value between -2 and 2. ⇑:6/5/08

Theorem 9.2 (Sufficient Condition for a Fixed Point). Suppose that f (x) is a
continuous function that maps its domain onto a subset of itself, i.e., f (x) ∈ C[a, b]
such that1
f (x) : [a, b] 7→ S ⊂ [a, b] (9.24)
Then f (x) has a fixed point in [a, b].

Proof. Case1: f (a) = a or f (b) = b, in which case the fixed point is at x = a or


x = b, respectively, and the theorem is proved.

1
By C[a, b] we mean the set of all continuous function whose domain is the interval [a, b]

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
46 LESSON 9. FIXED POINT ITERATION

Figure 9.4: Visualization of fixed point iteration in example 9.2.


1

!1
!Π 3Π Π Π 0 Π Π 3Π Π
! ! !
4 2 4 4 2 4
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3
Π 5Π 3Π 7Π Π 9Π 5Π
8 32 16 32 4 32 16

Case 2. Both f (a) 6= a and f (b) 6= b. By assumption 9.24 we must have

f (a) > a and (9.25)


f (b) < b (9.26)

Let
h(x) = f (x) − x (9.27)
Since f (x) is continuous, so is h(x), and

h(a) = f (a) − a > a − a = 0 (9.28)


h(b) = f (b) − b < b − b = 0 (9.29)

Hence by the intermediate value theorem, h(x) has a root r ∈ (a, b), such that
h(r) = 0. But at r we have
0 = h(r) = f (r) − r (9.30)
Thus since f (r) = r, r must be a fixed point of f .

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 9. FIXED POINT ITERATION 47

Figure 9.5: Left: The first 5 fixed point iterations on g(x) = x2 − 2 starting from x0 =
1.5. Right: The first 100 iterations. There is no discernable pattern of convergence;
in fact, the iteration is chaotic.
2 2

1 1

0 0

-1 -1

-2 -2
-2 -1 0 1 2 -2 -1 0 1 2

Theorem 9.3. Every continuous bounded functions on the real numbers has a fixed
point.
Proof. Let

f (x) : R 7→ R (9.31)
be continuous and bounded. Then it has a least upper bound a and a greatest lower
bound b. Hence
f (x) : R 7→ [a, b] (9.32)
Thus the conditions of Theorem 9.2 are met and hence f (x) has a fixed point.
⇓:rev.6/5/08
(The example that was here previously has been deleted because it wasn’t very helpful.) ⇑:6/5/08

Theorem 9.4 (Condition for a Unique Fixed Point). Let f (x) be a continuous
and diferentiable function that maps its domain onto a subset of itself,

f (x) : [a, b] 7→ S ⊂ [a, b] (9.33)

Suppose further that there exists some positive constant

0<K<1 (9.34)

such that
|f 0 (x)| ≤ K (9.35)
for all x ∈ [a, b]. Then f (x) has a unique fixed point p ∈ [a, b].

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
48 LESSON 9. FIXED POINT ITERATION

Proof. By theorem 9.3 at least one fixed point exists; call it p. Then

f (p) = p (9.36)

Suppose that a second fixed point q 6= p exists. Since q is also a fixed point,

q = f (q) (9.37)

By the Mean Value theorem, there exists some number c ∈ [min(p, q), max(p, q)] such
that
f (p) − f (q)
f 0 (c) = (9.38)
p−q
By equation 9.35, |f 0 (c)| ≤ K, hence

f (p) − f (q)
p−q ≤K (9.39)

i.e.,
|f (p) − f (q)| ≤ K|p − q| < |p − q| (9.40)
because K < 1. But by equations 9.36 and 9.37 we have

|f (p) − f (q)| = |p − q| (9.41)

and therefore
|p − q| < |p − q| (9.42)
Since p 6= q we know that |p − q| 6= 0 hence we can cancel it on both sides of the
equation to gives 1 < 1, which is a contradiction. Hence our original assumption
p 6= q must be wrong. Thus the fixed point is unique.
Example 9.4. Show that
1 x
g(x) = π + sin (9.43)
2 2
has a unique fixed point.
Solution. We first observe that g(x) is continuous and differentiable, and that
 
1 1
Range(g) = π − , π + ⊂ (−∞, ∞) = Domain(g) (9.44)
2 2
Hence by theorem 9.3 at least one fixed point exists. To verify uniqueness we calculate

0
1 x 1
|g (x)| = cos ≤ < 1
(9.45)
4 2 4
Hence the conditions of theorem 9.4 are met with K = 1/4. Hence the fixed point is
unique. (see figure 9.6.)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 9. FIXED POINT ITERATION 49

Figure 9.6: The fixed point of f (x) = π + (1/2) sin(x/2) is unique. See example 9.4.
4

!2 Π 3Π !Π Π 0 Π Π 3Π 2Π 5Π 3Π 7Π 4Π
! !
2 2 2 2 2 2

Example 9.5. Calculate the first four fixed point iterates of the function in the pre-
vious example, starting with x0 = π, and then use NestList to calculate the first 10
iterations to 20 digits.
Solution.

p1 = π + 0.5 sin(π/2) ≈ 3.64159 (9.46)


p2 = π + 0.5 sin(3.64159/2) ≈ 3.62605 (9.47)
p3 = π + 0.5 sin(3.62605/2) ≈ 3.62700 (9.48)
p4 = π + 0.5 sin(3.62700/2) ≈ 3.62694 (9.49)

In Mathematica,

In:=

g[x_] := Pi + (1/2) Sin[x/2];


N[NestList[g, Pi, 10], 20]
Out:=

{3.1415926535897932385, 3.6415926535897932385, 3.6260488644451156305,


3.6269956224387354753, 3.6269387942254171004, 3.6269422083510946963,
3.6269420032482992065, 3.6269420155698412408, 3.6269420148296252521,
3.6269420148740936891, 3.6269420148714222501}

To find the root of a function f (x) using the fixed point algorithm, we define

g(x) = x − f (x) (9.50)

Then if p is a root of f (x),

g(p) = p − f (p) = p − 0 = p (9.51)

Hence p is a fixed point of g(x) = x − f (x). This suggests that we use the following
algorithm.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
50 LESSON 9. FIXED POINT ITERATION

Algorithm FixedPointRoot
Input f (x), a first guess p0 , and an error tolerance ;
Define g(x) = x − f (x);
Define ∆ = ∞;
While ∆ > 0,
Let pold = p;
Let p = g(p);
∆ = |p − pold |;
Return (p).
p
Example 9.6. Use the fixed point algorithm to find 1/2 to 25 digits accuracy.
p
Solution. We know that 1/2 is a root of f (x) = x2 − 1/2, so we form the function

g(x) = x − f (x) = x − x2 + 1/2 (9.52)

We can use

In:=

f[x_] := x^2 - 1/2;


g[x_] := x - f[x];
NestList[g, 1.0, 10]

but this returns

Out:=

{1., 0.5, 0.75, 0.6875, 0.714844, 0.703842, 0.708448, 0.706549,


0.707337, 0.707011, 0.707146}

which does not give us enough digits. We also want the computer to calculate the
error for us, so that it can automatically figure out when to stop the calcultions. One
way to do this is by literally translating the iterative algorithm into Mathematica:

∆ = ∞;
p = 1.0‘50;
n = 0;
While[∆ > 10−25 ,
pold = p;
p = g[p];
∆= Abs[p - pold];
n++;
];
Print["The root is ", N[p, 25]," after ", n, " iterations."];

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 9. FIXED POINT ITERATION 51

The initialization of p=1.0‘50 ensures that the data starts with 50 digit accuracy.
This is a good general rule of thumb, that you data should have at least twice the
digits that you need in your final answer, although in fact it will depend upon what
kind of calculation you are doing. The output is
The root is 0.7071067811865475244008444after 64 iterations.
That the convergence to 25 digits does occur after 64 iterations can be verified by
including a statement
Print[p]
before the end of the While loop.

Example 9.7. Repeat the previous example with 2, starting with x0 = 1.5.

Solution. As before we observe that 2 is the root of f (x) = x2 −2 and so we calculate

g(x) = x − f (x) = x − x2 + 2 (9.53)



Then 2 is a fixed point of g. The first several iterations are:

x1 = 1.5 − 1.52 + 2 = 1.25 (9.54)


x2 = 1.25 − 1.252 + 2 = 1.6875 (9.55)
x3 = 1.6875 − 1.68752 + 2 = 0.8398 (9.56)
x4 = 0.8398 − 0.83982 + 2 = 2.1345 (9.57)

So far there is no discernible pattern; in fact, the first 100 iterations are (from Math-
ematica):
In:=

g[x_] := x - x^2 + 2;
q = NestList[g, 1.5, 120]
Out:=

{1.5, 1.25, 1.6875, 0.839844, 2.13451, -0.421611, 1.40063, 1.43886, 1.36854,


1.49563, 1.25872, 1.67434, 0.870917, 2.11242, -0.3499, 1.52767, 1.19389,
1.76851, 0.640879, 2.23015, -0.74343, 0.703883, 2.20843, -0.668739, 0.884049,
2.10251, -0.318025, 1.58083, 1.0818, 1.91151, 0.257634, 2.19126, -0.610355,
1.01711, 1.9826, 0.0519074, 2.04921, -0.150061, 1.82742, 0.487955, 2.24985,
-0.811992, 0.528676, 2.24918, -0.809622, 0.534889, 2.24878, -0.808241,
0.538505, 2.24852, -0.807313, 0.540933, 2.24832, -0.806639, 0.542696,
2.24818, -0.806123, 0.544042, 2.24806, -0.805715, 0.545109, 2.24797,
-0.805382, 0.545977, 2.24789, -0.805106, 0.546699, 2.24782, -0.804872,
0.547309, 2.24776, -0.804671, 0.547832, 2.24771, -0.804497, 0.548286,
2.24767, -0.804345, 0.548684,

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
52 LESSON 9. FIXED POINT ITERATION

2.24763, -0.80421, 0.549036, 2.2476, -0.80409, 0.54935, 2.24756,


-0.803982, 0.549631, 2.24754, -0.803885, 0.549884, 2.24751, -0.803797,
0.550114, 2.24749, -0.803716, 0.550324, 2.24747, -0.803643, 0.550516,
2.24745, -0.803575, 0.550692, 2.24743, -0.803513, 0.550855, 2.24741,
-0.803455, 0.551005, 2.2474, -0.803401, 0.551145, 2.24738, -0.803352,
0.551274, 2.24737, -0.803305, 0.551396, 2.24736, -0.803262, 0.551509}
It appears that the answer is “cycling” between three different values, which are
approximately
x0 = 0.554, x1 = 2.47, x2 − 0.802 (9.58)
none of which are the correct answer! We call this a “period-3 cycle.” Such phenom-
ena often occur in the study of dynamical systems, of which fixed point iteration is
an example (See figure 9.7.)

Figure 9.7: Convergence of fixed point iteration on f (x) = x − x2 − 2 to a period-3


limit cycle.

!1

!1 0 1 2

p
So why, when things worked √so well with 1/2, does fixed point iteration fail so
miserably in when we calculate 2? For one thing,
g 0 (x) = 1 − 2x (9.59)

Near the root, say at x = 2 + , we have
√ √
|g 0 ( 2)| = |1 − 2( 2 + )| ≈ |1 − 2.83 − 2| ≈ | − 1.83 − 2| (9.60)
There is no way that we can bound this number by a constant that is smaller than
1, so that theorem 9.3 does not even guarantee √
the existence of a fixed point (even
though we know that one does, in fact, exist at 2). The next theorem gives us an
idea.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 9. FIXED POINT ITERATION 53

Theorem 9.5. The fixed point iteration algorithm on a function g(x) will converge
to a fixed point of g(x) if the conditions of theorem 9.4 are satisfied. More precisely,
suppose that g(x) is a continuous function on [a, b] such that g : [a, b] 7→ S ⊂ [a, b],
and that there is a positive number K < 1 such that |g 0 (x)| ≤ K on [a, b]. Then for a
starting point p0 , the sequence pn = g(pn−1 ) converges to a unique fixed point of g(x).

Proof. By theorem 9.4 a unique fixed point exists. We need to show that

lim pn = p (9.61)
n→∞

Since g : [a, b] 7→ S ⊂ [a, b], then all of the pn = g(pn−1 ) ∈ [a, b]. Furthermore, since
p is a fixed point, p = g(p), and

|pn − p| = |pn − g(p)| = |g(pn−1 ) − g(p)| (9.62)

If there is some n such that pn−1 = p then the sequence has converged, and the
theorem has been proven. So we may assume that there is no n such that pn = p.
Since pn−1 6= p, we know that there is some point cn between pn−1 and p such that

0
g(pn−1 ) − g(p)
|g (cn )| = (9.63)
pn−1 − p

hence there exists some K < 1 such that

|pn−1 − p| = |g 0 (cn )| |g(pn−1 ) − g(p)| ≤ K |g(pn−1 ) − g(p)| (9.64)

Substituting equation 9.62,

|pn − p| ≤ K |pn−1 − p| (9.65)

Since this is true for all n,

|pn − p| ≤ K |pn−1 − p| (9.66)


≤ K 2 |pn−2 − p| (9.67)
≤ K 3 |pn−3 − p| (9.68)
..
. (9.69)
≤ K n |p0 − p| (9.70)

Hence
0 ≤ lim |pn − p| ≤ lim K n |p0 − p| = 0 (9.71)
n→∞ n→∞

because K < 1 implies that K n → 0 as n → ∞. Since |pn − p| → 0 is equivalent to


pn → p, the algorithm converges.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
54 LESSON 9. FIXED POINT ITERATION

Example
p 9.8. Prove that the fixed point algorithm for g(x) = x − x2 + 1/2 converges
to 1/2 ≈ 0.707.
p
Solution. First we observe that 1/2 is a fixed point of g(x) since
p  p p
g 1/2 = 1/2 − 1/2 + 1/2 = 1/2 (9.72)

Next we calculate
|g 0 (x)| = |1 − 2x| (9.73)
We want to determine if there is some positive constant K < 1 such that |g 0 (x)| ≤ K,
which requires that
−K ≤ 1 − 2x ≤ K (9.74)

−1 − K ≤ −2x ≤ −1 + K (9.75)
K −1 K +1
≤x≤ (9.76)
2 2
If we try K = 0.8 then
−0.1 ≤ x ≤ 0.9 (9.77)
In other words, for all x ∈ [−0.1, 0.9], we have |g 0 (x)| ≤ K < 1. Thus the conditions
of the theorem are met for any starting point in [−0.1, 0.9]. If we start with, say,
x0 = 1/2, which is clearly in this interval, the algorithm converges by theorem 9.5.

From equation 9.70 we could calculate an error estimate based on the size of the
original interval [a, b]. Since both p and p0 are in the interval [a, b], if we stop the
iteration after n steps the error is limited by

|pn − p| ≤ K n |p0 − p| ≤ K n |b − a| (9.78)

Thus each iteration reduces the error by a factor of K. While this is a significant
improvement, equation 9.78 is not very useful if the interval [a, b] is especially large,
such as the whole real line. Fortunately we can make an improved estimate based on
the values of the first guess and the first iteration.

Theorem 9.6 (Error Estimate for Fixed Point Iteration). If fixed point itera-
tion is terminated after n ≥ 1 steps then the error is limited by

K n |p1 − p0 |
|pn − p| ≤ (9.79)
1−K
rev.6/5/08:⇓

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 9. FIXED POINT ITERATION 55

Proof. Prove by induction. For n = 1 wee need to prove


K
|p1 − p| ≤ |p1 − p0 | (9.80)
1−K
To demonstrate 9.80 we use the Mean Value Theorem: there is some number c between
p0 and p such that

0
g(p0 ) − g(p) p1 − p
|g (c)| = = ≤K (9.81)
p0 − p p0 − p

where the last step follows because |g 0 (c)| ≤ K. Hence by the triangle inequality,

|p1 − p| ≤ K|p0 − p| (9.82)


= K|p0 − p1 + p1 − p| (9.83)
≤ K (|p0 − p1 | + |p1 − p|) (9.84)
= K|p0 − p1 | + K|p1 − p| (9.85)

Solving the last equation for |p1 − p| yields equation 9.80.


For the inductive step we assume that equation 9.79 holds, and attempt to prove
from that that it holds from n = 1, namely, that

K n+1 |p1 − p0 |
|pn+1 − p| ≤ (9.86)
1−K
We again use the Mean Value Theorem: there is some number c between pn and p
such that
0
g(pn ) − g(p) pn+1 − p
|g (c)| =
= ≤K (9.87)
pn − p pn − p
Hence
|pn+1 − p| ≤ K|pn − p| (9.88)
Substituting equation 9.79 on the right yields equation 9.86.
⇑:6/5/08

Example 9.9. Estimate the number of iterations required for fixed point iteration to
converge to the fixed point of
1 x
g(x) = π + sin (9.89)
2 2
with (a) 4 digit accuracy and (b) 10 digit accuracy, using p0 = π.

Solution. By theorem 9.6, to achieve an accuracy of , it is sufficient to find the


smallest n such that
K n |p1 − p0 |
|p − pn | ≤ < (9.90)
1−K

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
56 LESSON 9. FIXED POINT ITERATION

(1 − K)
Kn < (9.91)
|p1 − p0 |

(1 − K)
n log K < log (9.92)
|p1 − p0 |

1 (1 − K)
n> log (9.93)
log K |p1 − p0 |

where we reversed the direction of the less-than sign because for K < 1 then log K <
0. To find K we calculate

0
1 x 1
|g (x)| = cos ≤ (9.94)
4 2 4

so we chose K = 1/4. Hence, since p0 = π and log(1/4) = − log 4,

−1 (3/4)
n> log (9.95)
log 4 |p1 − π|

To get p1 we iterate once,

1 π 1
p1 = π + sin = π + (9.96)
2 2 2

Therefore we need to find an n such that

−1 (3/4) −1 3
n> log = log (9.97)
log 4 1/2 log 4 2

For  = 10−4 then


−1
n> log(1.5 × 10−4 ) ≈ 6.3 (9.98)
log 4

hence we will need at most 7 iterations. For  = 10−10 ,

−1
n> log(1.5 × 10−10 ) ≈ 16.3 (9.99)
log 4

so that 17 iterations are guaranteed to be sufficient.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 9. FIXED POINT ITERATION 57

Appendix
The fixed point plots shown in this section can be generated with the following Math-
ematica program:

fixedPointPlot[f_, x0_, n_] :=


Module[{data, u, g, g2, g1, min, max, range},
data = NestList[f, x0, n];
data = Partition[
Flatten[Table[{data[[i]], data[[i]]},
{i, 1, Length[data]}]], 2, 1];
g = Graphics[Line[data]];
max = Max[data];
min = Min[data];
range = max - min;
If[range == 0, range = 1];
max = max + range/4;
min = min - range/4;
g1 = Plot[u, {u, min, max}];
g2 = Plot[f[u], {u, min, max}];
Show[g2, g1, g]
]

A typical plot can be produced as follows:

In:=

g[x_]:= 4.0 x (1-x);


fixedPointPlot[g, 0.49, 100]
Out:=
1.0

0.5

!0.2 0.2 0.4 0.6 0.8 1.0 1.2

!0.5

!1.0

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
58 LESSON 9. FIXED POINT ITERATION

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 10

Newton’s Method

Suppose we already have an estimate p0 for the root of f (x). If we project the tangent
line to f (x) at the point (p0 , f (p0 )) down to where it intersects with the x-axis, this
should give us a better guess for root, as illustrated in figure 10.1.

Figure 10.1: Derivation of Newton’s method.

f(x)

tangent lines

p3 p2 p1 p0

The slope of the straight line connecting the points (p0 , f (p0 )) and (p1 , 0) is
rise f (p0 ) − 0 f (p0 )
f 0 (p0 ) = slope = = = (10.1)
run p0 − p1 p0 − p1
Solving for p1 ,
f (p0 )
p1 = p0 − (10.2)
f 0 (p0 )
This gives us the well know formula for Newon’s Method

f (pn )
pn+1 = pn − (10.3)
f 0 (pn )

59
60 LESSON 10. NEWTON’S METHOD

This gives us the following algorithm.

Algorithm Newtons Method


Input f (x), p0 , tolerance 
Let ∆ = f (p0 )/f 0 (p0 )
While ∆ > ,
p0 = p0 − ∆
∆ = f (p0 )/f 0 (p0 )
End While
Return p0

Example 10.1. Find 2 with Newtons method as the root of f (x) = x2 − 2. Use
p0 = 2.

Solution. To get an iteration formula for pn we need to know the derivative of f (x),
which is
f 0 (x) = 2x (10.4)
Hence the iteration formula is
p2n − 2
pn+1 = pn − (10.5)
2pn
Although it is possible to simplify this algebraically, convergence of the algorithm
(which we have not proven yet) ensures that the second term above approaches zero
and hence it is computationally preferable to leave it in this form rather than placing
the sum over a common denominator. Thus
22 − 2
p1 = 2 − = 1.5 (10.6)
2×2
1.52 − 2
p2 = 1.5 − = 1.4167 (10.7)
2 × 1.5
1.41672 − 2
p3 = 1.4167 − = 1.4142 (10.8)
2 × 1.4167
and so forth.
In general Newton’s method converges extremely rapidly. The only time it will
be slow to converge is√when f 0 (p) = 0. As the following Mathematicaillustrates, the
method converges to 2 to 50 digits, starting with p0 = 2, in only 6 iterations.

In:=

f[x_] := x^2 - 2;
g[x_] := x - f[x]/f’[x];
NestList[g, 2.0‘50, 7]

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 10. NEWTON’S METHOD 61

Out:=

{2.0000000000000000000000000000000000000000000000000,
1.5000000000000000000000000000000000000000000000000,
1.4166666666666666666666666666666666666666666666667,
1.414215686274509803921568627450980392156862745098,
1.414213562374689910626295578890134910116559622116,
1.414213562373095048801689623502530243614981925776,
1.41421356237309504880168872420969807856967187538,
1.41421356237309504880168872420969807856967187538}

Theorem 10.1 (Convergence of Newton’s Method). Suppose that f (x) is con-


tinuously differentiable1 and has a root p ∈ [a, b]. Suppose further that f 0 (p) 6= 0.
Then there is some interval
I = [p − δ, p + δ] (10.9)
for some number δ ≥ 0 such that for any p0 ∈ I, Newton’s method will converge.

Proof. We need to prove that pn → p as n → ∞, when

pn+1 = pn − f 0 (pn )/f 0 (pn ) (10.10)

We first observe that Newton’s method is nothing more than fixed point iteration on
the function
g(x) = x − f (x)/f 0 (x) (10.11)
Furthermore, since p is a root of f , then it is also a fixed point of g, because

g(p) = p − f (p)/f 0 (p) = p − 0/f 0 (p) = p (10.12)

Since f 0 (p) 6= 0 then by continuity there must be some interval U = [p−, p+] ⊂ [a, b]
about p such that f 0 (x) 6= 0 for all x ∈ U . Since f (x) and f 0 (x) are defined and
continuous on [a, b] then they are defined and continuous on U ⊂ [a, b]. Since by
construction of U , f 0 (x) 6= 0 on U then g(x) is also defined and continuous on U .
Therefore
 
0 d f (x)
g (x) = x− 0 (10.13)
dx f (x)
f 0 (x)f 0 (x) − f (x)f 00 (x)
=1− (10.14)
(f 0 (x))2
f (x)f 00 (x)
= (10.15)
(f 0 (x))2

Furthermore, since f (p) = 0, g 0 (p) = 0. So if we pick any small number K < 1


then by continuity there must be some interval I = [p − δ, p + δ] about p such that

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
62 LESSON 10. NEWTON’S METHOD

Figure 10.2: Figures for proof of convergence of Newton’s method.

f ′(p)≠0 K

p-δ p p+δ
-K
p-ε p p+ε

|g 0 (x)| ≤ K, as we see in fig. 10.2. This proves that there is some K > 0 such that
|g 0 (x)| ≤ K < 1 in some interval about p.
To see that the g : I 7→ S ⊂ I, let x ∈ I. Then by the mean value theorem there
is a point c ∈ I, between p and x. such that

|g(p) − g(x)|
|g 0 (c)| = (10.16)
|p − x|
or
|g(p) − g(x)| = |p − x||g 0 (c)| (10.17)
Since the maximum distance between p and x in I is δ,

|g(p) − g(x)| ≤ δ|g 0 (c)| < δK < δ (10.18)

because |g 0 (x)| ≤ K < 1. But since p is a fixed point of g, we know that g(p) = p,
and therefore
|p − g(x)| < δ (10.19)
or equivalently,
p − δ < g(x) < p + δ (10.20)
Thus g maps I into a subset of itself, and hence all of the hypotheses of theorem 9.5
are met. Therefore fixed point iteration on g converges to the fixed point of g, which
we have already shown is a root of f . Thus Newton’s method converges.
We can also do some error analysis for Newton’s method. Recall that by Taylor’s
theorem (theorem 4.6),
1
f (p + ) ≈ f (p) + f 0 (p) + 2 f 00 (p) + · · · (10.21)
2
1
≈ f 0 (p) + 2 f 00 (p) + · · · (10.22)
2
1
By continuously differentiable we me the the function is continuous and differentiable and its
first derivative is also continuous.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 10. NEWTON’S METHOD 63

because f (p) = 0. Similarly,


f 0 (p + ) ≈ f 0 (p) + f 00 (p) + · · · (10.23)
Let i be the error after the ith iteration. Then since
f (xi )
xi+1 = xi − (10.24)
f 0 (xi )
we have
i+1 − i = (xi+1 − p) − (xi − p) (10.25)
= xi+1 − xi (10.26)
f (xi )
=− 0 (10.27)
f (xi )
i f 0 (p) + 12 2i f 00 (p) + · · ·
=− 0 (10.28)
f (p) + i f 00 (p) + · · ·
Solving for i+1 ,

 (f 0 (p) +  f 00 (p) + · · · ) −  f 0 (p) + 1 2 f 00 (p)
i i i 2 i
|i+1 | ≈ (10.29)

0 00
f (p) + i f (p) + · · ·


2 00
 f (p)|
≈ i 0 (10.30)
2f (p)
In other words, the error reduces by the square of the error on the previous step. By
comparison, the bisection algorithm has
1
i+1 = |i | (10.31)
2
The quadratic factor results in Newton’s method converging much more quickly than
the bisection algorithm. We will return to this in section 12.
We can also implement Newton’s method iteratively in Mathematica:
NewtonsMethod[f_, p0_, eps_: 0.001, Nmax_: 10] :=
Module[{delta, Delta i, p},
Delta[x_] := f[x]/f’[x];
p = p0;
delta = Delta[p]; i = 0;
While[And[delta>eps, i++<Nmax],
p = p - delta;
delta = Delta[p];
];
Return[{i, p}];
]

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
64 LESSON 10. NEWTON’S METHOD

To find the root of x2 − 2 to 50 significant figures,


In:=

f[x_] := x^2 - 2;
NewtonsMethod[f, 1.5‘53, 10^-50]
Out:=
{6, 1.414213562373095048801688724209698078569671875376948}

Under certain conditions Newton’s method will not converge, even if a root does
exist. For example, by theorem 10.1 if the derivativep is not continuous in the entire
interval then it will fail. The function f (x) = x/ |x| provides an example of this
situation. The derivative is everywhere continuous except at the origin where it
becomes infinite. The plot of f (x) is also a mirror image of itself through the origin.
In this case Newton’s method can lead to cyclic iteration. A similar case can occur
if the initial point is chosen√on the edge of an open interval of convergence, as with
f (x) = x/(1 + x2 ) at x = 1/ 3|. In both cases we have a situation where xn+1 = −xn
and thepfunction is a mirror image of itself. The same thing happens with f (x) = x2
if x = 5/3.
A variation on the Newton’s method called the Damped Newton’s Method can fix
these situations by checking if successive iterations decrease in magnitude. If they do
not, the interval is halved, until they do. The damped Newton method will always
converge to either a root or to a local minimum.

Algorithm Damped Newton Method


Input f (x), p0 , tolerance 
Let ∆ = f (p0 )/f 0 (p0 )
Let p = p0 , fnew = f (p)
While ∆ > ,
fold = fnew
pnew = p − ∆
fnew = f (pnew )
While |fnew | ≥ |fold |,
∆ = ∆/2
pnew = p − ∆
fnew = f (pnew )
End While
p = pnew
∆ = f (p)/f 0 (p)
End While
Return p0

The formula that is now known as Newton’s method was actually developed by the
British mathematician Thomas Simpson (1740), who is better know as the inventor

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 10. NEWTON’S METHOD 65

Figure 10.3: When Newton’s method fails. Top: f (x) = x/|x| has an infinite deriva-
2
tive at the origin. Middle: f (x) = x/(1
√+ x √ ) is a mirror image of itself. Newton’s
method converges on the interval (−1/ 3, 1/ 3), diverges outside this interval, and
oscillates right on the endpoints. Bottom: f (x) = x2 has a local
p minimum, but no
root, at x = 0. Newton oscillation can become trapped if x0 = 5/3.

1.5

1.0

0.5

0.0

!0.5

!1.0

!1.5
!2 !1 0 1 2

0.4

0.2

0.0

!0.2

!0.4

!0.5 0.0 0.5

0
!2 !1 0 1 2

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
66 LESSON 10. NEWTON’S METHOD

of Simpson’s Rule to numerically calculate integrals. Newton (1669) and Joseph


Raphson (1690) published formulas based on the results of François Viète (1540-1603),
who derived a set of formulas for the roots of polynomials. Hero of Alexandria (10-
70) wrote about the “Babylonian Algorithm” for the square root that we discussed in
chapter 1, which is also a form of Newton’s method. This method is widely
√ attributed
to the ancient Babylonians because of the existence of a formula for 2 on an ancient
tablet, but the evidence that they used this algorithm is not conclusive.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 11

Secant Method

The main problem with Newtons method is that we need to know both the function
and its derivative. If the derivative is easy to calculate this is not a problem, but
sometimes it can be very expensive computationally to calculate the derivative. To
solution to this problem is to stop calculating the derivative after the first iteration
and instead approximate it by the slope of the line connecting the first two guesses
(see figure).

secant line
f(x)

p2 p1 p0

The slope of the line through the points (pn , f (pn )) and (pn−1 , f (pn−1 ) is used to
approximate the derivative at pn .

f (pn ) − f (pn−1 )
f 0 (pn ) ≈ (11.1)
pn − pn−1

67
68 LESSON 11. SECANT METHOD

The derivation is similar to the derivation of Newton’s method; we just use the slope
derived here in place of the derivative:

f (pn )
pn+1 = pn − (11.2)
f 0 (pn )
pn − pn−1
= pn − f (pn ) (11.3)
f (pn ) − f (pn−1 )

This method converges at about the same rate as Newton’s method. Here is the
algorithm.

Algorithm SecantMethod
Input f (x), p0 , p1 , tolerance 
Let q0 = f (p0 ), q1 = f (p1 )
Let ∆ = q1 (p1 − p0 )/(q1 − q0 )
While ∆ > ,
p0 = p1 ;
p1 = p1 − ∆
q0 = q1
q1 = f (p1 )
∆ = q1 (p1 − p0 )/(q1 − q0 )
End While
Return p1

One complaint about both Newtons method and the Secant method is that it is
difficult to estimate an error bounds. With bisection, on the other hand, we had
|| < |an − bn |/2, because we know that the actual root always lies somewhere inside
the interval [an , bn ] . Since successive iterations of either Newtons Method (or the
Secant method) will not, in general, bracket the root, we cannot make this type of
simple error limit. The Method of Regula Falsi (Method of False Position) is a
modification of the Secant Method that ensures that successive iterations bracket the
root, at some (sometimes significant) cost of execution time.
Here is the idea behind the algorithm. We initially start with two guesses p0 , p1
that are known to bracket the root and then calculate p2 using the secant method. If

f (p1 )f (p2 ) < 0 (11.4)

then the root is in the interval [p1 , p2 ], so we use p1 and p2 to calculate p3 , and p1 and
p3 become our next initial values. Otherwise, we use p0 and p2 to calculate p3 , and
p0 and p3 become our next values. The algorithm is shown on the next page.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 11. SECANT METHOD 69

Algorithm Regula Falsi


Input f (x), p0 , p1 , tolerance 
Let q0 = f (p0 ), q1 = f (p1 )
Let ∆ = q1 (p1 − p0 )/(q1 − q0 )
While ∆ > ,
p = p1 − ∆;
q = f (p);
If qq1 < 0 then
p0 = p1 ;
q0 = 11 ;
End if.
p1 = p;
q1 = q;
∆ = q1 (p1 − p0 )/(q1 − q0 )
End While
Return p1

Versions of the method of false position were cited in Vaishali Ganit, written
in India around the 3rd century BC, and The Nine Chapters of Mathematical Art
written in China a century or two later. It was well known by the middle ages and
was cited by Fibonacci in his text Liber Abaci written in 1202.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
70 LESSON 11. SECANT METHOD

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 12

Error Analysis for Iterative


Methods

Definition 12.1 (Order of Convergence). We say that a sequence pn converges


to p (or write pn → p) with order k > 0 and asymptotic error constant λ > 0 if

n+1 |pn+1 − p|
lim = lim =λ (12.1)
k
n→∞ n n→∞ |p − p|k
n

where n is the error after the nth iteration.

We say that pn → p converges linearly if

n+1 |pn+1 − p|
lim = lim =λ (12.2)
n→∞ n n→∞ |pn − p|

We say pn → p converges quadratically if

n+1 |pn+1 − p|
lim = lim =λ (12.3)
2
n→∞ n n→∞ |p − p|2
n

The following is a good general rule of thumb: the higher the order of conver-
gence, the faster the sequence converges. To see why this general rule is true,
suppose, for example, that an → p linearly with asymptotic error constant λ and that
bn → p with the same asymptotic error constant λ. Then

|an+1 − p| |bn+1 − p|
lim = λ = lim (12.4)
n→∞ |an − p| n→∞ |b − p|2
n

71
72 LESSON 12. ERROR ANALYSIS FOR ITERATIVE METHODS

Hence for sufficiently large n,

|an − p| ≈ λ|an−1 − p| (12.5)


≈ λ2 |an−2 − p| (12.6)
≈ λ3 |an−3 − p| (12.7)
..
.
≈ λn |a0 − p| (12.8)
= λn ∆ (12.9)

where ∆ = |a0 − p|. Now suppose that b0 = a0 . Then for sufficiently large n,

|bn − p| ≈ λ|bn−1 − p|2 (12.10)


2
≈ λ λ|bn−2 − p|2 (12.11)
= λ3 |bn−2 − p|4 (12.12)
4
≈ λ3 λ|bn−3 − p|2 (12.13)
= λ7 |bn−3 − p|8 (12.14)
..
.
n −1 n
≈ λ2 |b0 − p|2 (12.15)
2n −1 2n
=λ ∆ (12.16)
2n
(λ∆)
= (12.17)
λ
The following table illustrates the differences in the rate of convergence between
linearly convergent and quadratically convergence sequences. The table shows the
values of n for using ∆ = 1 and for different values of λ.

Linear Quadratic
λ 0.5 0.5 0.9 0.99
n=1 0.25 0.125 0.729 0.97
n=2 0.125 7.8 × 10−3 0.478 0.932
n=3 0.0625 3.1 × 10−5 0.206 0.860
n=4 0.0312 6.6 × 10−10 0.0382 0.732
n=5 0.0156 1.1 × 10−19 0.00131 0.531
n=6 0.0078 5.9 × 10−39 1.5 × 10−6 0.279
n=7 0.0039 1.7 × 10−77 2.1 × 10−12 0.0771
n=8 0.0019 1.5 × 10−154 4.1 × 10−24 5.9 × 10−3
n=9 9.7 × 10−4 1.1 × 10−308 1.5 × 10−47 3.4 × 10−5
n = 10 4.9 × 10−4 6.2 × 10−617 2.2 × 10−94 1.2 × 10−9

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 12. ERROR ANALYSIS FOR ITERATIVE METHODS 73


Example 12.1. Suppose we know two different algorithms to find 2, one of which is
linearly convergent with error constant λ = 1/2, and the other is quadratically conver-
gent with error constant λ = 1/2. Assuming our initial error is ∆ = 1, estimate the
number of iterations each algorithm will require to converge to 50 significant figures.
Solution. For the linearly convergent algorithm we have n ≈ λn ∆, hence
 n
−50 1
10 > × (1) (12.18)
2
2n > 1050 (12.19)
n log 2 > 50 log 10 (12.20)
log 10 (50)
n > 50 ≈ ≈ 166 (12.21)
log 2 0.301
n n −1
For the quadratically convergent sequence n ≈ (λ∆)2 /λ = (λ)2 for ∆ = 1. Hence
 2n −1
−50 1
10 > (12.22)
2
2n −1
2 > 1050 (12.23)
(2n − 1) log 2 > 50 log 10 (12.24)
log 10
2n > 1 + 50 ≈ 167 (12.25)
log 2
n log 2 > log 167 (12.26)
log 167
n> ≈ 7.4 (12.27)
log 2
so 8 iterations will suffice.

Example 12.2. An iteration formula to find 3
7 as the root of f (x) = x3 − 7, that
can be derived using Newton’s method, is
x3 − 7
g(x) = x − (12.28)
3x2
Show that this algorithm converges quadratically.
Solution. Let x = pn . Then
n+1 g(x) − 71/3

= (12.29)
2 (x − 71/3 )2
n
x − (x3 − 7)/(3x2 ) − 71/3

= (12.30)
x2 − 2(71/3 )x + 72/3
3 1/3 2

2x − 3(7 )x + 7
= 4 (12.31)
3x − 6(71/3 )x3 + 3(72/3 )x2

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
74 LESSON 12. ERROR ANALYSIS FOR ITERATIVE METHODS


Hence, since we know that pn → 3 7 as n → ∞,

2x3 − 3(71/3 )x2 + 7



n+1
lim 2 = lim



4 1/3 3 2/3 2
(12.32)
n→∞ n 3
x→ 7
3x − 6(7 )x + 3(7 )x

3
If we plug x = 7 into the right hand side of this limit we get 0/0, so we apply
L’Hopital’s rule:
2 1/3

n+1 6x − 6(7 )x
lim 2 = lim √

3

1/3 )x2 + 6(72/3 )x
(12.33)
n→∞ n x→ 3 7 12x − 18(7

x − 71/3


= lim


2 1/3 )x + 72/3
(12.34)
x→ 3 7 2x − 3(7

(12.35)

Again this gives 0/0 so we can use L’Hopital a second time,



n+1 1
lim 2 = lim
1/3 )
(12.36)
n→∞ n x→71/3 4x − 3(7

1
= 1/3
(12.37)
4(7 ) − 3(71/3 )
= 7−1/3 ≈ 0.523 (12.38)

which proves that the iteration converges quadratically with asymptotic error constant
λ ≈ 0.522.

Theorem 12.1. Newton’s Method converges quadratically if f 0 (p) 6= 0.

Proof. Recall from equation 10.30 that for Newton’s method


2 00
 f (p)
|n+1 | = n 0 (12.39)
2f (p)

Hence
n+1 f 00 (p)

= (12.40)
2 2f 0 (p)
n

Thus Newton’s method converges quadratically unless f 0 (p) = 0, with

λ = |f 00 (p)/2f 0 (p)| (12.41)

Theorem 12.2. If all of the conditions of the fixed point theorem (theorem 9.5) are
met, and g 0 (p) 6= 0, then the fixed point algorithm converges (at least) linearly.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 12. ERROR ANALYSIS FOR ITERATIVE METHODS 75

We observe that this says that fixed point converges at least linearly; this does not
mean that every fixed point algorithm only converges linearly. As we saw above, New-
ton’s method, which is a type of fixed point iteration, in fact, converges quadratically.
So this theorem says that convergence is linear or better, i.e, k ≥ 1.

Proof. Since p is a fixed point, p = g(p). Let p1 , p2 , . . . be the sequence of fixed point
iterates pn+1 = g(pn ). Then by the mean value theorem, for each n there is a number
cn between the fixed point p and the nth fixed-point iterate pn such that

g(pn ) − g(p) pn+1 − p


g 0 (cn ) = = (12.42)
pn − p pn − p

Therefore
pn+1 − p
lim = lim |g 0 (cn )| (12.43)
n→∞ pn − p n→∞

Since cn is between p and pn ,

|cn − p| ≤ |pn − p| (12.44)

Furthermore, since the conditions of theorem 9.5 are met, we know that pn → p and
thereofore
0 ≤ lim n → ∞|cn − p| ≤ lim |pn − p| = 0 (12.45)
n→∞

hence
lim cn = p (12.46)
n→∞

Therefore
lim g 0 (cn ) = g 0 (p) (12.47)
n→∞

Using equation 12.47 in equation 12.43, we find that



pn+1 − p
lim = |g 0 (p)| =
6 0 (12.48)
n→∞ pn − p

Thus the sequence converges linearly with asymptotic error constant λ = |g 0 (p)|.

Theorem 12.3. Let I be an open interval and suppose that the following conditions
hold:

1. g(x) is twice continuously differentiable on I;

2. p ∈ I is a fixed point of g(x);

3. g 0 (p) = 0;

4. g 00 (p) 6= 0;

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
76 LESSON 12. ERROR ANALYSIS FOR ITERATIVE METHODS

5. |g 0 (x)| ≤ K < 1 on I;

6. |g 00 (x)| < M on I (i.e., g is bounded on I).


Then there exists some δ > 0 such that for any p0 in the interval

[p − δ, p + δ] ⊂ I (12.49)

the sequence pn → p quadratically.

Geometry for theorem 12.3.


I
p
( ( | ) )
p±δ

Proof. Chose some δ > 0 such that S = [p − δ, p + δ] ⊂ I. Since |g 0 (x)| ≤ K < 1 on


I then |g 0 (x)| ≤ K < 1 on S. Hence by theorem 9.5, for any p0 ∈ S the sequence pk
lies entirely in S and converges to p.
Pick any point x ∈ S. By Taylor’s theorem, there is some number c between p
and x such that
g 00 (c)
g(x) = g(p) + g 0 (p)(x − p) + (x − p)2 (12.50)
2
By assumption 3 of the theorem, g 0 (p) = 0, so that
g 00 (c)
g(x) = g(p) + (x − p)2 (12.51)
2
Since p is a fixed point of g, then g(p) = p, and
g 00 (c)
g(x) = p + (x − p)2 (12.52)
2
Hence for each pn in the sequence there is a number cn such that
g 00 (cn )
g(pn ) = p + (pn − p)2 (12.53)
2
where cn is between p and pn . But since g(pn ) = pn+1 ,
g 00 (cn )
pn+1 − p = (pn − p)2 (12.54)
2
or
pn+1 − p 1
lim = lim |g 00 (cn )| (12.55)
n→∞ (pn − p)2 2 n→∞

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 12. ERROR ANALYSIS FOR ITERATIVE METHODS 77

Since cn → p, g 00 (cn ) → g 00 (p) and



pn+1 − p 1 00
lim = |g (p)| (12.56)
n→∞ (pn − p)2 2

Since |g 00 (p)| =
6 0, the sequence converges quadratically with asymptotic error constant
00
|g (p)|/2.
One could ask the following question: given any linearly convergent sequence, how
can we turn it into a quadratically convergent sequence? One way to do this is as
follows. Let p be a root of f (p); the goal is to find a method that converges to p
quadratically. Since f (p) = 0, we can form a function

g(x) = x − h(x)f (x) (12.57)

where h(x) is any function. But now g(p) = p − h(p)f (p) = p so p is a fixed point of
g. By theorem 12.3, we need g 0 (p) = 0 to get quadratic convergence:

0 = g 0 (p) (12.58)
= 1 − h0 (p)f (p) − h(p)f 0 (p) (12.59)
= 1 − h(p)f 0 (p) (12.60)

or
1
h(p) = (12.61)
f 0 (p)
so long f 0 (p) 6= 0. Substituting equation 12.61 into equation 12.57,

f (x)
g(x) = x − (12.62)
f 0 (x)

which is precisely the formula for Newton’s method.

Definition 12.2 (Zero, Multiplicity). A root p of a function f (x) is called a zero


of multiplicity m if there exists some function q(x) such that

f (x) = (x − p)m q(x) (12.63)

and
lim q(x) 6= 0 (12.64)
x→p

If q is continuous this also means that q(p) 6= 0. A simple zero or simple root is
a zero of multiplicity 1. Roots of multiplicity m > 1 are called repeated roots.

Example 12.3. The function f (x) = x2 + 7x + 12 has simple zeroes at x = −4 and


x = −3.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
78 LESSON 12. ERROR ANALYSIS FOR ITERATIVE METHODS

Example 12.4. The function f (x) = (x − 2)2 (x − 3) has a simple root at x = 3 and
a root of multiplicity 2 at x = 2.
Theorem 12.4. Let f (x) be a continuously differentiable function on [a, b]. Then f
has a simple zero p ∈ (a, b) if and only if f (p) = 0 and f 0 (p) 6= 0.
Proof. Since this is an “if-and-only-if” theorem we need to prove two things:
(a) If p is a simple root then f (p) = 0 and f 0 (p) 6= 0; and

(b) If f (p) = 0 and f 0 (p) 6= 0 then p is a simple root.


To prove (a) we first assume that p is a simple root. Then since it is a root, we
automatically know that f (p) = 0. The only other thing we need to show is that
f 0 (p) 6= 0. But since p is a simple root, then there must exist some function q(x) such
that
lim q(x) 6= 0 (12.65)
x→p

and
f (x) = (x − p)q(x) (12.66)
Since f is continuously differentiable, then so is q. In particular, q is continuous at p,
which means that
lim q(x) = q(p) (12.67)
x→p

Hence by 12.65, q(p) 6= 0. But from equation 12.66

f 0 (x) = q(x) + (x − p)q 0 (x) (12.68)


f 0 (p) = q(p) + (p − p)q 0 (p) (12.69)
= q(p) 6= 0 (12.70)

which completes the proof of part (a).


To prove part (b), assume that both f (p) = 0 and f 0 (p) 6= 0 are true. Then by
the mean value theorem there is a number c between p and x such that
f (x) − f (p) f (x)
f 0 (c) = = (12.71)
x−p x−p
Consequently
f (x) = (x − p)f 0 (c) (12.72)
Since c is between x and p then by pinching,

lim c = p (12.73)
x→p

Let
q(x) = f 0 (c) (12.74)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 12. ERROR ANALYSIS FOR ITERATIVE METHODS 79

then
f (x) = q(x)(x − p) (12.75)
where

lim q(x) = lim f 0 (c) (12.76)


x→p x→p
 
0
= f lim c (12.77)
x→p

= f 0 (p) (12.78)
6= 0 (12.79)

Hence the function has a simple zero.

Corollary 12.1. Newton’s method converges quadratically if p is a simple root.

Theorem 12.5. Suppose the function f (x) is m−times continuously differentiable in


the interval [a, b]. Then f (x) has a zero of multiplicity m at p in (a, b) if and only if

0 = f (p) = f 0 (p) = f 00 (p) = · · · = f (m−1) (p) and f (m) (p) 6= 0 (12.80)

Theorem 12.6. Suppose that f (x) is continuously differentiable on [a, b] and has a
root of multiplicity m > 1 at p ∈ (a, b). Then p is a simple root of µ(x) = f (x)/f 0 (x).

Proof. Since f (x) has a root of multiplicity m > 1 then there is some function g(x)
such that
f (x) = (x − p)m g(x) (12.81)
where g(p) 6= 0. Differentiating,

f 0 (x) = m(x − p)m−1 g(x) + (x − p)m g 0 (x) (12.82)

Therefore
(x − p)m g(x)
µ(x) = (12.83)
m(x − p)m−1 g(x) + (x − p)m g 0 (x)
(x − p)m−1 (x − p)g(x)
= (12.84)
m(x − p)m−1 g(x) + (x − p)m−1 (x − p)g 0 (x)
(x − p)g(x)
= (12.85)
mg(x) + (x − p)g 0 (x)
= (x − p)q(x) (12.86)

where
g(x)
q(x) = (12.87)
mg(x) + (x − p)g 0 (x)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
80 LESSON 12. ERROR ANALYSIS FOR ITERATIVE METHODS

Since g(p) 6= 0,
g(p) 1
q(p) = 0
= 6= 0 because m > 1 (12.88)
mg(p) + (p − p)g (p) m
hence
µ(x) = (x − p)q(x) (12.89)
where q(p)] 6= 0. Since µ(p) = 0 then p is a root of µ; since q(p) 6= 0, it is a simple
root.
Therefore we know that Newton’s method will converge quadratically to a root
of µ(x) even though it will only converge linearly to a repeated root of f (x). Using
Newtons method to find the simple root of µ(x) gives
µ(x)
g(x) = x − (12.90)
µ0 (x)
The function g has a fixed point at any root of µ(x), and the following iteration
converges quadratically,
µ(xn )
xn+1 = xn − 0 (12.91)
µ (xn )
because µ0 (p) 6= 0. But since µ(x) = f (x)/f 0 (x) then by the quotient formula for
differentiation,
f 0 f 0 − f f 00
µ0 (x) = (12.92)
(f 0 )2
f 02 − f f 00
= (12.93)
f 02
Therefore,
f (x)/f 0 (x)
g(x) = x − (12.94)
(f 0 (x)2 − f (x)f 00 (x))/(f 0 (x))2
f 0 (x)f (x)
=x− 0 (12.95)
(f (x))2 − f (x)f 00 (x)
This gives us the following quadratically convergent iteration formula:
f 0 (xn )f (x)
xn+1 = xn − (12.96)
(f 0 (xn ))2 − f (xn )f 00 (xn )
The problem with this formula arises from the fact that both f (p) and f 0 (p) are zero
and therefore as the iteration approaches the root, both (f 0 (xn ))2 and f (xn )f 00 (xn )
are very small numbers: taking the difference of two very small numbers can lead to
round-off errors.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 13

The Aitken-Steffensen Methods

Definition 13.1. Let pn be a sequence. Then the first forward difference is

∆pn = pn+1 − pn (13.1)

Successive forward differences are defined recursively, as follows:

The k th forward difference is given in terms of the k − 1st forward difference


as
∆k pn = ∆(∆k−1 pn ) (13.2)
Equations for specific differences expand out via Pascal’s triangle.

The second forward difference is

∆2 pn = ∆(∆pn ) = ∆pn+1 − ∆pn (13.3)


= (pn+2 − pn+1 ) − (pn+1 − pn ) (13.4)
= pn+2 − 2pn+1 + pn (13.5)

The third forward difference is

∆3 pn = ∆(∆2 pn ) = ∆2 pn+1 − ∆2 pn (13.6)


= (pn+3 − 2pn+2 + pn+1 ) − (pn+2 − 2pn+1 + pn ) (13.7)
= pn+3 − 3pn+2 + 3pn+1 − pn (13.8)

and so forth.

Aitken’s method (for its inventor, George Aitken, 1895-1967) is based on the
following observation. For any linearly convergent method,

pn+1 − p
lim =λ>0 (13.9)
n→∞ pn − p

81
82 LESSON 13. THE AITKEN-STEFFENSEN METHODS

Thus for sufficiently large n we might expect that


pn+1 − p
≈λ (13.10)
pn − p
Since this should be true for all n after a certain point, we would also expect that
pn+2 − p
≈λ (13.11)
pn+1 − p
Setting the last two expressions for λ equal to one another gives
pn+1 − p pn+2 − p
≈ (13.12)
pn − p pn+1 − p
Cross-multiplying and solving for p,

(pn+1 − p)2 = (pn+2 − p)(pn − p) (13.13)


p2n+1 − 2ppn+1 + p2 = pn+2 pn − p(pn + pn+2 ) + p2 (13.14)
p2n+1 − 2ppn+1 = pn+2 pn − p(pn + pn+2 ) (13.15)
p(pn + pn+2 ) − 2ppn+1 = pn+2 pn − p2n+1 (13.16)
p(pn+2 − 2pn+1 + pn ) = pn+2 pn − p2n+1 (13.17)
p∆2 pn = pn+2 pn − p2n+1 (13.18)

Hence
pn+2 pn − p2n+1
p= (13.19)
∆2 pn
pn+2 pn − p2n+1
= + pn − pn (13.20)
∆2 pn
pn+2 pn − p2n+1 − pn ∆2 pn
= pn + (13.21)
∆2 pn
Expanding the numerator,

pn+2 pn − p2n+1 − pn ∆2 pn = pn+2 pn − p2n+1 − pn (pn+2 − 2pn+1 + pn ) (13.22)


= pn+2 pn − p2n+1 − pn pn+2 + 2pn pn+1 − p2n (13.23)
= −p2n+1 + 2pn pn+1 − p2n (13.24)
= −(p2n+1 − 2pn pn+1 + p2n ) (13.25)
= −(pn+1 − pn )2 (13.26)
= −(∆pn )2 (13.27)

Therefore,
(∆pn )2
p = pn − (13.28)
∆2 pn

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 13. THE AITKEN-STEFFENSEN METHODS 83

Aitken’s idea was that if we have a converging sequence pn → p as n → ∞ then the


sequence
(∆pn )2
qn = pn − (13.29)
∆2 pn
should converge to p faster than pn . We will accept this fact without proof in the
following theorem.

Theorem 13.1 (Aitken). Suppose that pn → p linearly and there is some number
N such that for all n > N ,

(pn − p)(pn+1 − p) > 0 (13.30)

then the sequence qn → p faster than pn → p, where

(∆pn )2
qn = pn − (13.31)
∆2 pn

in the sense that


qn − p
lim =0 (13.32)
n→∞ pn − p

Proof. (Outline of Proof)


Define
pn+1 − pn
δn = −λ (13.33)
pn − p
Then (... proof left as an exercise .. )

lim δn = 0 (13.34)
n→∞

and (...derivation left as an exercise ...)

qn − p λ(δn + δn+1 ) − 2δn + δn δn+1 − 2δn (λ − 1) − δn2


= (13.35)
pn − p (λ − 1)2 + λ(δn + δn+1 ) − 2δn + δn δn+1

Taking the limit gives equation 13.32.

Johann Frederik Steffensen (1873-1961) observed that the sequence would converge
faster if we started each iteration with (qi , g(qi ), g(g(qi )) instead of (pi , g(pi ), g(g(pi )).
The difference between the two methods (which is subtle) is illustrated in figure 13.1

Theorem 13.2 (Steffensen’s Method). Suppose that f (x) is thrice continuously


differentiable and has a fixed point p with f 0 (p) 6= 1. Then Aitken’s method can be
made to converge quadratically if we replace (qi , g(qi ), g(g(qi )) instead of (pi , g(pi ), g(g(pi ))
at each iteration.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
84 LESSON 13. THE AITKEN-STEFFENSEN METHODS

Figure 13.1: Top: In Aitken’s method, at the nd of each iteration, the next iteration
begins by setting p1 = p0 . Bottom: In Steffensen’s method we set p0 = q. In both
methods, p1 = f (p0 ), p2 = f (p1 ), and q is computed from equation 13.31

Aitken’s Method:
p p p q
0 1 2

p p p q
0 1 2

p p p q
0 1 2

Steffensen’s Method:
p p p q
0 1 2

p p p q
0 1 2

p p p q
0 1 2

The following algorithms uses Aitken’s method to find the fixed point of the function
f (x).

Algorithm Aitken
Input f (x), p0 , tolerance 
Let δ = ∞;
Let p = p0 ;
While δ > ,
p1 = f (p);
p2 = f (p1 );
∆p = p1 − p0 ;
∆∆p = (p2 − p1 ) − ∆p;
p = p − (∆p)2 /∆∆p;
δ = |p − p0 |;
p0 = p1 (this is where Steffensen’s method differs)
End While
Return p

Steffensen’s method is only different in one place, as indicated in the following


alorithm.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 13. THE AITKEN-STEFFENSEN METHODS 85

Algorithm Steffensen
Input f (x), p0 , tolerance 
Let δ = ∞;
Let p = p0 ;
While δ > ,
p1 = f (p);
p2 = f (p1 );
∆p = p1 − p0 ;
∆∆p = (p2 − p1 ) − ∆p;
p = p − (∆p)2 /∆∆p;
δ = |p − p0 |;
p0 = p (this is where Aitken’s method differs)
End While
Return p

Both Aitken’s method and Steffensen’s method can be used to find the fixed point
of a function. To find the root of the function we have the following algorithm, which
significantly improves the rate of convergence of Newton’s method when there are
repeated roots (e.g., for functions such as f (x) = (x − 2)2 .

Algorithm Newton-Steffensen
Input f (x), p0 , tolerance 
Define g(x) = x − f (x)/f 0 (x);
p = Steffensen(g, p0 , );
Return p.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
86 LESSON 13. THE AITKEN-STEFFENSEN METHODS

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 14

Synthetic Division and Horner’s


Method

Definition 14.1 (Polynomial). Let a0 , . . . , an be arbitrary constants. Then any


function P (x) of the form
n
X
P (x) = ak xk = a0 + a1 x + a2 x2 + · · · + an xn (14.1)
k=0

with an 6= 0 is called a polynomial of order n in x.

To implement a polynomial most efficiently, we observe that once we know x2 , it


is faster to calculate x3 = x × x2 rather than as x × x × x; once we know x3 it is faster
to calculate x4 as x × x3 rather than x × x × x × x; and so forth. In general, we want
to calculate xn as x × xn−1 . This produces the concept of nested multiplication:

P (x) = a0 + a1 x + a2 x + · · · + an xn (14.2)
= a0 + x(a1 + a2 x + a3 x2 + · · · + an xn−1 ) (14.3)
..
.
= a0 + x(a1 + x(a2 + x(a3 + · · · + x(an−1 + an x))) · · · ) (14.4)

Theorem 14.1 (Fundamental Theorem of Algebra). Every polynomial of degree


n has precisely n roots.

Some (or all) of the roots may be complex. Since complex roots come in conjugate
pairs, the total number of complex roots must be even. Thus a polynomial of odd
degree always has at least one real root. If the unique roots are given as r1 , r2 , ..., rk
each with multiplicity m1 , m2 , ..., mk then we can always write a polynomial as

P (x) = C(x − r1 )m1 (x − r2 )m2 · · · (x − rk )mk (14.5)

87
88 LESSON 14. SYNTHETIC DIVISION AND HORNER’S METHOD

Descarte proposed in 1637 that one could imagine that there were n roots to a poly-
nomial. Albert Girard (1629) proposed that an nth order polynomial has n roots but
that they may exist in a field larger than the complex numbers. The first published
proof of the fundamental theorem of algebra was by DAlembert in 1746, but his proof
was based on an earlier theorem that itself used the theorem, and hence is unsatisfac-
tory. At about the same time Euler proved it for polynomials with real coefficients up
to 6th. Between 1799 (in his doctoral dissertation) and 1816 Gauss published three
different proofs for polynomials with with real coefficients, and in 1849 he proved the
general case for polynomials with complex coefficients.

Theorem 14.2. If two polynomials of degree n agree at n + 1 unique points, then


they must be identical. More precisely: If P (x) and Q(x) are two polynomials of the
same degree n that agree at at least n + 1 distinct points, i.e, if there exist unique
numbers x1 , ..., xn+1 such that P (xk ) = Q(xk ) then P (x) = Q(x) for all x.

For example, if two lines are equal at two points, they are identical; if two parabo-
las match at three points, they are identical; and so on.

Theorem 14.3 (Horner’s Method for Synthetic Division). Let P (x) be any
polynomial of degree n, given by

P (x) = a0 + a1 x + a2 x + · · · + an xn (14.6)

Then for any number x0 there exists another polynomial Q(x) of degree n − 1, given
by
Q(x) = b1 + b2 x + b3 x2 + · · · + bn xn−1 (14.7)
such that
P (x) = (x − x0 )Q(x) + b0 (14.8)
where bn = an ,
bk = ak + bk+1 x0 (14.9)
for k = n − 1, n − 2, ..., 0, for k = n − 1, n − 2, . . . , 0 Furthermore, b0 = P (x0 ) and

P 0 (x0 ) = Q(x0 ) (14.10)

Proof. Suppose that

Q(x) = bn xn−1 + bn−1 xn−2 + · · · + b3 x2 + b2 x + b1 (14.11)

for some undetermined numbers b1 , . . . , bn . Then we ask what conditions will ensure
that
P (x) = (x − x0 )Q(x) + b0 (14.12)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 14. SYNTHETIC DIVISION AND HORNER’S METHOD 89

Multiplying things out,

(x − x0 )Q(x) + b0 = b0 +
(x − x0 )(bn xn−1 + bn−1 xn−2 + · · · + b3 x2 + b2 x + b1 ) (14.13)
= b0 + x(bn xn−1 + bn−1 xn−2 + · · · + b3 x2 + b2 x + b1 )
− x0 (bn xn−1 + bn−1 xn−2 + · · · + b3 x2 + b2 x + b1 ) (14.14)
= bn xn + bn−1 xn−1 + bn−2 xn−2 + · · · + b2 x2 + b1 x
− bn x0 xn−1 − x0 bn−1 xn−2 − · · · − x0 b3 x2
− x 0 b2 x − x 0 b1 + b0 (14.15)
= bn xn + (bn−1 − bn x0 )xn−1 +
(bn−2 − x0 bn−1 )xn−2 + · · · + (14.16)
(b2 − x0 b3 )x2 + (b1 − x0 b2 )x + (b0 − x0 b1 ) (14.17)

We want this to be equal to

P (x) = a0 + a1 x + a2 x + · · · + an xn (14.18)

Equating coefficients of like powers of x gives us

an = b n (14.19)
an−1 = bn−1 − bn x0 (14.20)
an−2 = bn−2 − bn−1 x0 (14.21)
..
.
a0 = b 0 − x 0 b 1 (14.22)

Rearranging,

b n = an (14.23)
bn−1 = an−1 + bn x0 (14.24)
bn−2 = an−2 + bn−1 x0 (14.25)
..
.
b 0 = a0 + b 1 x 0 (14.26)

This proves equation 14.9.


Next, to see that b0 = P (x0 ) we observe that

P (0) = (x0 − x0 )Q(x0 ) + b0 = b0 (14.27)


Furthermore, differentiating P (x) = (x − x0 )Q(x) + b0 gives

P 0 (x) = (x − x0 )Q0 (x) + Q(x) (14.28)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
90 LESSON 14. SYNTHETIC DIVISION AND HORNER’S METHOD

hence
P 0 (x0 ) = Q(x0 ) (14.29)
which gives us equation 14.10.
The following gives a recapitulation of the algorithm for Horner’s method to cal-
culate the numbers P (x0 ) and P 0 (x0 ) for a polynomial.

Algorithm Horner
Input a0 , . . . , an , x0 ;
Set y = an ; (y will give the bn for P )
Set z = an ; (z gives the bn−1 for Q)
For j = n − 1, n − 2, . . . , 1,
y = x0 y + aj ; (this gives bj for P (x0 ))
z = x0 z + y; (this gives bj−1 for the calculation of Q(x0 ) )
End For;
y = x0 y + a0 ; (this gives b0 )
Return y (which is P (x0 )) and z (which is P 0 (x0 ) = Q(x0 ))

We can make two interesting observations about Horner’s method. First, it has
the same number of multiplications as nested multiplication, making it at least as
efficient as that algorithm. Secondly, it gives us a number for both P (x0 ) and P 0 (x0 )
for no extra cost. This becomes useful in operations where both numbers are needed,
such as in the calculation of Newton’s method (for the roots of a polynomial).

Algorithm Newton’s Method with Horner


Input a0 , . . . , an , x0 , tolerance ;
(p0 , p00 ) = Horner(a0 , . . . , an , x0 )
p = x0
δ = p0 /p00
While |δ| > ,
p=p−δ
(p0 , p00 ) = Horner(a0 , . . . , an , p)
δ = p0 /p00
End While
Return p

Let x0 be a root of P . Then we know that there exists a second polynomial Q(x)
such that P (x) = (x − x0 )Q(x) + P (x0 ) = (x − x0 )Q(x). So if P has any other
roots that are different from x0 then they are also roots of Q. Hence if we repeat the
process on Q iteratively we will find all the subsequent roots of P . Unfortunately
this leads to round-off error that can be avoided by using a different algorithm that
we will discuss subsequently.
Example 14.1. Find P (1) and P 0 (1) for P (x) = x3 −2x2 −5 using Horner’s method.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 14. SYNTHETIC DIVISION AND HORNER’S METHOD 91

Solution. We have a0 = −5, a1 = 0, a2 = −2, and a3 = 1, and also x0 = 1. Then If


we set z = a3 = 1,

b3 = a3 = 1 (14.30)
b2 = a2 + b3 x0 = −2 + (1)(1) = −1 (14.31)
b1 = a1 + b2 x0 = 0 + (−1)(1) = −1 (14.32)
b0 = a0 + b1 x0 = −5 + (−1)(1) = −6 (14.33)

Hence P (1) = −6. Using the same algorithm for Q we have

c 2 = b3 = 1 (14.34)
c1 = b2 + c2 x0 = (−1) + (1)(1) = 0 (14.35)
c0 = b1 + c1 x0 = (−1) + (0)(1) = −1 (14.36)

Hence Q(1) = P 0 (1) = c0 = −1.


Horner’s method is also fairly easy to implement in Mathematica. We can take
advantage of the fact that a list may have an arbitrary number of elements, so we
don’t even need to know the order of the polynomial:

Horner[A_?ListQ, x0_] := Module[{z, y, a},


a = A;
y = z = Last[a];
a = Most[a];
While[Length[a] > 1,
y = x0*y + Last[a];
z = x0*z + y;
a = Most[a];
];
y = x0*y + Last[a];
Print["P(x)=" A.("x"^Range[0, Length[A] - 1])];
Print["P(", x0, ")=", y, "\n", "P’(", x0, ")=", z];
Return[{y, z}]
]

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
92 LESSON 14. SYNTHETIC DIVISION AND HORNER’S METHOD

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 15

Müller’s Method

Müller’s method is s based on the idea that if a straight line is good, then a parabola
is better. It’s really a modification of the Secant method, replacing the projectin of a
secant line with the projectiong of a parabola, fit to three consecutive points on the
curve, to find the next guess. Suppose we “know” the value of f at three points on
the curve of f (x) at x = p, x = q, and x = r. The we need to find a parabola through
the three points
(p, f (p)), (q, f (q)), (r, f (r)) (15.1)

Figure 15.1: Illustration of Müller’s method. A parabola is fit to three points on the
curve, and the intersection of the parabola with the x−axis is used to for the next
guess of the root.

93
94 LESSON 15. MÜLLER’S METHOD

The general equation for a parabola is


P (x) = Ax2 + Bx + C (15.2)
= Ax2 + Bx + C − 2pxA + 2pxA + p2 A − p2 A (15.3)
= A(x2 − 2px + p2 ) + (B + 2pA)x + C − Ap2 (15.4)
= A(x − p)2 + (B + 2pA)x + C − Ap2 − (B + 2pA)p + (B + 2pA)p (15.5)
= A(x − p)2 + (B + 2pA)(x − p) + C − Ap2 + (B + 2pA)p (15.6)
Make the following substitutions:
a=A (15.7)
b = B + 2pA (15.8)
c = C − Ap2 + (B + 2pA)p (15.9)
This gives us
P (x) = a(x − p)2 + b(x − p) + c (15.10)
Since P (p) = f (p), P (q) = f (q), and P (r) = f (r),
f (p) = P (p) = a(p − p)2 + b(p − p)2 + c = c (15.11)
f (q) = P (q) = a(q − p)2 + b(q − p) + c (15.12)
= a(q − p)2 + b(q − p) + f (p) (15.13)
f (r) = P (r) = a(r − p)2 + b(r − p) + c (15.14)
= a(r − p)2 + b(r − p) + f (p) (15.15)
Rearranging,
f (q) − f (p) = a(q − p)2 + b(q − p) (15.16)
f (r) − f (p) = a(r − p)2 + b(r − p) (15.17)
This is a systems of two equations in two unknowns, a and b. Multiplying equation
15.16 by r − p and equation 15.17 by q − p gives
(r − p) (f (q) − f (p)) = a(q − p)2 (r − p) + b(q − p)(r − p) (15.18)
(q − p) (f (r) − f (p)) = a(r − p)2 (q − p) + b(r − p)(q − p) (15.19)
Subtracting equation 15.19 from equation 15.18 gives

(r − p) (f (q) − f (p)) − (q − p)[f (r) − f (p)] (15.20)


2 2

= a (q − p) (r − p) − (r − p) (q − p) (15.21)
= a(q − p)(r − p) ((q − p) − (r − p)) (15.22)
= a(q − p)(r − p) (q − p − r + p) (15.23)
= a(q − p)(r − p)(q − r) (15.24)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 15. MÜLLER’S METHOD 95

Thus
(r − p) (f (q) − f (p)) − (q − p) (f (r) − f (p))
a= (15.25)
(q − p)(r − p)(q − r)

Next we multiply equation 15.16 by (r − p)2 , equation 15.17 by (q − p)2 , and subtract,
which gives

(r − p)2 (f (q) − f (p)) = a(q − p)2 (r − p)2 + b(q − p)(r − p)2 (15.26)
(q − p)2 (f (r) − f (p)) = a(r − p)2 (q − p)2 + b(r − p)(q − p)2 (15.27)

The subtraction gives

(r − p)2 (f (q) − f (p)) − (q − p)2 (f (r) − f (p)) (15.28)


= b(q − p)(r − p)2 − b(r − p)(q − p)2 (15.29)
= b(q − p)(r − p) ((r − p) − (q − p)) (15.30)
= b(q − p)(r − p) (r − p − q + p) (15.31)
= b(q − p)(r − p)(r − q) (15.32)

and therefore,

(r − p)2 (f (q) − f (p)) − (q − p)2 (f (r) − f (p))


b= (15.33)
(q − p)(r − p)(r − q)

Müller’s method uses the intersection of the parabola with the x-axis as the next
guess. Given three guesses p, q, r, the parabola intersects at s where

0 = P (s) = a(s − p)2 + b(s − p) + c (15.34)

If we define δ = s − p then 0 = aδ 2 + bδ + c and hence s = p + δ where



−b ± b2 − 4ac
δ= (15.35)
2a

and therefore √
−b ± b2 − 4ac
s=p+ (15.36)
2a
where a and b are given by equations 15.25 and 15.33.
If b is a large positive number then the positive root

−b + b2 − 4ac
δ+ = (15.37)
2a
« 2008, B.E.Shapiro Math 481A
Last revised: July 5, 2008 California State University Northridge
96 LESSON 15. MÜLLER’S METHOD

has two large and nearly equal numbers in the numerator; this could lead to roundoff
errors. To improve our accuracy we rearrange by rationalizing the numerator:
√ √
−b +b2 − 4ac −b − b2 − 4ac
δ+ = × √ (15.38)
2a −b − b2 − 4ac
b2 − b2 + 4ac
= √  (15.39)
2a −b − b2 − 4ac
−2c
= √ (15.40)
b + b2 − 4ac

Thus if no roundoff error here because now we are adding two large positive numbers
in the denominator, and not subtracting them. Thus if b is large and positive, our
two intersection points are

−b − b2 − 4ac
s =p + (15.41)
2a
2c
s =p − √ (15.42)
b + b2 − 4ac

By a similar argument, if b is a large negative number then the negative root is


subtracting two nearly equal numbers and so the solutions are

−b + b2 − 4ac
s =p + (15.43)
2a
2c
s =p − √ (15.44)
b − b2 − 4ac

Since we don’t know up front which, if either, special case occurs, we can do the
following: choose the sign of the square root to agree with the sign of b. This will
work in either case! Hence

2c
s =p − √ (15.45)
b + sign(b) b2 − 4ac

This assures that of the two possible roots of the parabola, the one closest to p will
be selected.

Müller’s algorithm also uses Horner’s method to evaluate the polynomial (it ignores
the derivatives since they aren’t really needed). The algorithm to find the root of a
polynomial with coefficients given by a0 , . . . , an is

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 15. MÜLLER’S METHOD 97

Algorithm Muller (root of a polynomial)


Input a0 , . . . , an , x0 , x1 , x2 , tolerance ;
Let p = x2 and f (p) = Horner(a0 , . . . , an , p)
Let q = x1 and f (q) = Horner(a0 , . . . , an , q)
Let r = x0 and f (r) = Horner(a0 , . . . , an , r)
Let δ = ∞
While |δ| > ,
(r − p) (f (q) − f (p)) − (q − p) (f (r) − f (p))
a=
(q − p)(r − p)(q − r)
(r − p)2 (f (q) − f (p)) − (q − p)2 (f (r) − f (p))
b=
(q − p)(r − p)(r − q)
c = f (p)
2c
δ= √
b + sign(b) b2 − 4ac
r=q
q=p
p=p−δ
fr = fq
fq = fp
f p = Horner(a0 , . . . , an , p)
End While
Return p

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
98 LESSON 15. MÜLLER’S METHOD

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 16

Linear Systems

In this section we will study the solution of a linear system of n equations with n
unknowns. We cover it briefly here because some understanding of the problem will
be necessary in our study of interpolation. However, this subject is normally part of
the Math 481A curriculum and hence will not be covered in any detail here.
Given a square n × n matrix A and n numbers b1 , . . . , bn , we would like to solve
the linear system
Ax = b (16.1)
Since it is generally numerically inefficient to compute an inverse (it generally requires
O(n3 ) operations we will not solve the system as

x = A−1 b (16.2)

although this is technically correct. Instead we will use the process of Gaussian
elimination. We begin by observing that if we can transform equation 16.1 into a
form
T x = b0 (16.3)
where T is an upper triangular matrix, and b0 is a modified version of b, then we can
read the solution for xn off the bottom row of the matrix, namely,

xn = b0n /Tnn (16.4)

The matrix T is said to be in Row Echelon Form. The second to the last row of
the system 16.3 only depends on two variables, xn and xn−1 . Once we read off xn
then we can solve for xn−1 . This process of back substitution moves back up the
matrix one line at a time, solving for one variable at each step.

Gaussian elimination is then summarized as follows:

1. Convert the system Ax = b into an equivalent form T x = b0 where T is upper-


triangular.

99
100 LESSON 16. LINEAR SYSTEMS

2. Solve for x using back-substitution.


It is possible to take this idea one step further. If we can reduce equation 16.3 to
the form
Dx = b00 (16.5)
where D is a diagonal matrix, then it is even easier to read off the solutions, namel,
xi = b0i /Dii . In this revised form the matrix D is said to be in Reduced Row
Echelon Form, and the revised algorithm is called Gauss-Jordan Elimination.
The revised algorithm is summarized:

1. Convert the system Ax = b into an equivalent form T x = b0 , where T is


upper-triangular.
2. Convert the system T x = b0 into an equivalent form Dx = b00 , where D is
diagonal.
3. Solve for the xi .

We will outline the first algorithm (row reduction followed by back-substitution). We


start by writing the linear system
a11 x1 + a12 x2 + · · · + a1n xn = b1 (16.6)
a21 x1 + a22 x2 + · · · + a2n xn = b2 (16.7)
..
.
an11 x1 + an2 x2 + · · · + ann xn = bn (16.8)
From equation 16.6 we can solve for x1 in terms of x2 , . . . , xn ,
x1 = (b1 − a12 x2 − a13 x3 − · · · − a1n xn )/a11 (16.9)
so if we already know x2 , . . . , xn we can solve for x1 immediately. But if we eliminate
x1 from each of the remaining equations, we have system of n − 1 equations in the
n − 1 variables x2 , . . . , xn , which is easier to solve than the original system because
it is smaller. We get this system by subtracting an appropriate multiple of the first
equation from each of the remaining equations, namely we subtract
(ai1 /a11 ) × (a11 x1 + a12 x2 + · · · + a1n xn = b1 ) (16.10)
from the ith equation. The resulting system for x2 , . . . , xn is
(a22 − a21 a12 /a11 )x2 + · · · + (a2n − a21 a1n /a11 )xn = b2 − a21 b1 /a11 (16.11)
(a32 − a31 a12 /a11 )x2 + · · · + (a3n − a31 a1n /a11 )xn = b3 − a31 b1 /a11 (16.12)
..
.
(an2 − an1 a12 /a11 )x2 + · · · + (ann − an1 a1n /a11 )xn = bn − an1 b1 /a11 (16.13)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 16. LINEAR SYSTEMS 101

The idea is to keep repeating this process until there is only one equation in the
reduced system. The result is an “upper triangular system.” If the original matrix
system is     
a11 a12 a13 · · · a1n x1 b1
 a21 a22 a23 · · · a2n   x2   b2 
    
 a31 a32 a33   x 3   b3 
  = 
 .. .. .. ..   ..   .. 
 . . . .   .   . 
an1 an2 an3 · · · ann an1 xn
Then the reduced matrix system is
    
a11 a12 a13 · · · a01n x1 b01
 0 a0 a0 · · · a02n  x2   b02 
 22 23    
0
 0
 0 a 33 · · · a03n 
 x3 =
  b03 

 .. .. ..  ..   .. 
 . . .  .   . 
0 ··· 0 0 a0nn an1 b‘n
This process is called Gaussian Reduction. We can then solve the system by
starting on the bottom equation for xn , then the second from the bottom for xn−1 ,
and so forth, until we obtain x1 . This second step is called back substitution.

Example 16.1. Solve the system


    
1 2 3 x 5
 4 5 2   y  =  10 
2 8 5 z 15

using Gaussian Reduction and back substitution.

Solution. The first step is to subtract multiples of the first row from each of the
remaining two rows to make the coefficients of x zero in each of rows 2 and 3 of the
system. Since the coefficient of x is 1 in the first row, 4 in the second row, and 2 in
the third row, we subtract four times the first row from the second row, and twice
the first row from the third row.
    
1 2 3 x 5
 4 − 4(1) 5 − 4(2) 2 − 4(3)   y  =  10 − 4(5) 
2 − 2(1) 8 − 2(2) 5 − 2(3) z 15 − 2(5)
    
1 2 3 x 5
 0 −3 −10   y  =  −10 
0 4 −1 z 5
Now the first column is all zeroes (except for the first row). The next step is to
subtract a multiple of the second row from the third row to get a zero in the second

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
102 LESSON 16. LINEAR SYSTEMS

entry of the third row. Since the coefficient of y is -3 in the second row and 4 in the
third row, we can add 4/3 times the second row to the third row.

    
1 2 3 x 5
 0 −3 −10  y  =  −10 
0 4 + (4/3)(−3) −1 + (4/3)(−10) z 5 + (4/3)(−10)
    
1 2 3 x 5
 0 −3 −10   y  =  −10 
0 0 −43/3 z −25/3
This completes the Gaussian elimination. We can then read off the solution by back-
substitution. From the third row of the matrix,

z = (−25/3)/(−43/3) = 25/43

From the second row of the matrix,

−3y − 10z = −10

hence
1 60
y = − (−10 + 10(25/43)) =
3 43
Finally, from the first row, we have

x + 2y + 3z = 5
   
60 25 20
x=5−2 −3 =
43 43 43
We can write a simple recursive algorithm for Gaussian elimination as

Algorithm LinearSolve
Input: A, b
If n > 1,
{A0 , b0 } = Reduce(A, b)
LinearSolve (A0 , b0 )
End if
x1 = (b1 − a12 x2 − a13 x3 − · · · − a1n xn )/a11
Return {x1 , x2 , . . . , xn }

Algorithm Reduce
Input: A, b
n = dimension(b)
For k = 2, . . . , n,

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 16. LINEAR SYSTEMS 103

m = ak1 /a11
For j = 2, . . . , n,
a0k−1,j−1 = akj − ma1j
End For
b0k−1 = bk − mb1
End For
Return {A0 , b0 }
The recurse algorithm can be almost literally translated into Mathematica:
reduce[A_, b_] := Module[{n, j, k, Aprime, bprime, m, row},
n = Length[b];
Aprime = {}; bprime = {};
For[k = 2, k n , k++,
m = A[[k, 1]]/A[[1, 1]];
row = {};
For[j = 2, j n, j++,
AppendTo[row, A[[k, j]] - m* A[[1, j]]];
];
AppendTo[Aprime, row];
AppendTo[bprime, b[[k]] - m*b[[1]]];
];
Return[{Aprime, bprime}];
];

gauss[A_, b_] := Module[{n, x, x1, Aprime, bprime},


n = Length[b];
x = {};
If[n > 1,
{Aprime, bprime} = reduce[A, b];
x = gauss[Aprime, bprime];
];
x1 = b[[1]]/A[[1, 1]];
For[k = 2, k n, k++,
x1 = x1 - A[[1, k]]x[[k - 1]]/A[[1, 1]];
];
x = Prepend[x, x1];
Return[x];
]
For example, to solve the system
    
0.116093 0.230616 0.34202 x1 3
0.461232 0.897598 1.28558 x2  = 17 (16.14)
1.02606 1.92836 2.59808 x3 5

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
104 LESSON 16. LINEAR SYSTEMS

One could use this function by typing

In:=

A={{0.116093, 0.230616, 0.34202},


{0.461232, 0.897598, 1.28558},
{1.02606, 1.92836, 2.59808}};
b={3, 17, 5};
gauss[A, b]

Out:=

{-33612.9, 27351.9, -7024.58}

In Mathematicawe can also solve the system directly by using the built in function
LinearSolve[A,b].
Gaussian elimination can fail if we divide by zero, and is susceptible to large
errors or possible overflow if we divide by a very small number (relative to the other
numbers in the matrix). Division occurs in two places in the algorithm: during the
row reduction phase where we define m = ak1 /a11 and during the back-substitution
step at the end of the algorithm, where we solve for x1 (here we also divide by a11 ,
but its usually a different a11 ). These numbers are called pivots. The solution is to
rearrange the matrix (and the corresponding elements of b): if at any step along the
way the pivot is zero, then the entire row is exchanged with a row that does not have
zero in that column. If all of the remaining elements in that column are zero then
the matrix is singular and there is no unique solution (or no solution at all).

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 17

Lagrange Interpolation

Suppose we know the values of some function f (x) at n + 1 distinct grid points

a = x0 , x1 , x2 , ..., xn = b (17.1)

Denote the values of the function at each of these points as

fk = f (xk ), k = 0, 1, 2, ..., n (17.2)

The problem of interpolation is to find an approximate (numerical) value for f (x)


at any point x ∈ [a, b] that does not necessarily correspond to one of the grid points.

Function known only at the grid points

(x2,f2) (x3,f3) (x4,f4)

(x1,f1)
(x5,f5)

x1 x2 x3 x4 x5

The unknown function f(x) goes through


the grid points
(x3,f3)
(x ,f ) (x4,f4)
2 2

(x1,f1) f(x)
(x5,f5)

x1 x2 x3 x4 x5

105
106 LESSON 17. LAGRANGE INTERPOLATION

The simplest method is linear interpolation: draw line segments connecting each
pair of consecutive grid points (xk , fk ) and (xk+1 , fk+1 ). For xk ≤ x ≤ xk+1 we have:
fk+1 − fk
y = fk + m(x − xk ) = fk + (x − xk ) (17.3)
xk+1 − xk

(x3,f3) (x4,f4) (x5,f5)


(x2,f2)

(x1,f1)

x1 x2 x3 x4 x5

In general, unless the grid points are very close, linear interpolation does not give
very accurate result. A better approximation would be given by a polynomial. The
key is to find the right polynomial, not just any polynomial that goes through the
points. As it turns out, it is possible to find a polynomial that approximates the
function to any desired degree of accuracy. This result is called the Weirstrass Ap-
proximation Theorem. Furthermore, given any n + 1 points it is possible to find
a unique polynomial of minimum degree that fits all the points. For example, any
two points can be fit by a line; and three non-collinear points can be fit by a unique
parabola; any four points that do not line on the same line or on the same parabola
can be fit by a unique cubic; and so forth.

Let us suppose that we have defined n + 1 unique points

(x0 , f0 ), (x1 , f1 ), . . . , (xn , fn ) (17.4)

where
x0 < x1 < · · · < xn (17.5)
and that we want to find the polynomial of lowest order

P (x) = a0 + a1 x + a2 x2 + · · · + an xn (17.6)

Math 481A «2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 17. LAGRANGE INTERPOLATION 107

to these points. We begin by substituting the points 17.4 into the polynomial to get
n + 1 equations in the n + 1 unknowns a0 , a1 , . . . , an :
f0 = a0 + a1 x0 + a2 x20 + · · · + an xn0 (17.7)
f1 = a0 + a1 x1 + a2 x21 + · · · + an xnn (17.8)
..
.
fn = a0 + a1 xn + a2 x2n + · · · + an xnn (17.9)
which we can write as the matrix system
    
1 x0 x20 · · · xn0 a0 f0
2 n  
1 x 1 x 1 · · · x 1   a1   f 1 

..   ..  =  ..  (17.10)
 
 ..
. .   .   .
2 n
1 xn xn · · · xn an fn
This equation has a solution if the matrix of coefficients is non-singular. But because
the points are distinct the lines of the matrix form a linearly independent set of vectors
(proof left as an exercise). Hence the matrix is non-singular. To find the a0 , a1 , . . .
we could use Gaussian elimination or some other method as we have discussed. It
turns out that this is not necessary because the form of the matrix allows us to write
a much simpler iterative process for finding these coefficients.

We will actually present two different methods for constructing the polynomial:
the Lagrange method (in this section) and the Newton method (in the next section).
Because of uniqueness both polynomials will be identical; however, they are con-
structed differently. The Newton method is particularly useful when one needs to
calculate numbers by hand, as was done in the 19th century. The Lagrange method,
which we will discuss first, is somewhat more intuitive. Before providing the general
form, we will illustrate the technique with linear (n=1) and quadratic (n=2) interpo-
lation.

For n = 1 we start with two points (x0 , f0 ), (x1 , f1 ) that we want to fit a line to.
Of course we have already done it, but this time we will construct the line in such a
way that the method can be easily extending to higher degree fits (with more points).
We define the functions
x − x1
L0 (x) = (17.11)
x0 − x1
x − x0
L1 (x) = (17.12)
x1 − x0
and we observe that
L0 (x0 ) = 1 L0 (x1 ) = 0 (17.13)
L1 (x0 ) = 0 L1 (x1 ) = 1 (17.14)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
108 LESSON 17. LAGRANGE INTERPOLATION

We write this more compactly as


(
1, i = j,
Li (xj ) = δij = (17.15)
6 j
0, i =

known as the Kroeneker delta function (for the German mathematician Leopold
Kroeneker, 1823-1891). Next, we define the function
1
X
P (x) = Li (x)fi (17.16)
k=0
= L0 (x)f0 + L1 (x)f1 (17.17)
x − x1 x − x0
= f0 + f1 (17.18)
x0 − x1 x1 − x0
We observe that P (xi ) = fi and that P is linear in x. Hence it is the equation of
a line that goes through both points (xi , fi ), i = 0, 1. A rearrangement of this gives
equation 17.3.
For n = 2 we have 3 points: (x0 , f0 ), (x1 , f1 ), and (x2 , f2 ). Again, we have already
solved for the equation of a parabola through three points in the previous section,
but we will do it this time by extending the Lagrange technique. We define the three
functions
(x − x1 )(x − x2 )
L0 (x) = (17.19)
(x0 − x1 )(x0 − x2 )
(x − x0 )(x − x2 )
L1 (x) = (17.20)
(x1 − x0 )(x1 − x2 )
(x − x0 )(x − x1 )
L2 (x) = (17.21)
(x2 − x0 )(x2 − x1 )
We observe that

L0 (x0 ) = 1 L0 (x1 ) = 0 L0 (x2 ) = 0 (17.22)


L1 (x0 ) = 0 L1 (x1 ) = 1 L1 (x2 ) = 0 (17.23)
L2 (x0 ) = 0 L2 (x1 ) = 0 L2 (x2 ) = 1 (17.24)

or in general Li (xj ) = δij , as before with the linear functions. Then we define the
function
X2
P (x) = L0 (x)f0 + L1 (x)f1 + L2 (x)f2 = Li (x)fi (17.25)
k=0

We observe now that P (x) is quadratic, and that P (xi ) = fi . Thus it goes through
all three points, and hence by uniqueness it is the only parabola that goes through
all three points.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 17. LAGRANGE INTERPOLATION 109

In the general case it becomes more convenient to add a second index indicating
the order of the polynomials to the L functions. Thus we rename our linear functions
from L0 and L1 to L10 and L11 , and our quadratic functions L0 , L1 , and L2 become
L20 , L21 , and L22 . The general definition is
n
Y x − xj
Lnk (x) = (17.26)
j=0,j6=k
xk − xj

for k = 0, . . . , n. It is easily observed that (a) each of the Lnk has degree n; and that
(b) that Lnk (xi ) = δik . Hence the polynomial
n
X
P (x) = Lnk (x)fk (17.27)
k=0

is also of degree at most k and that P (xj ) = fj . Thus P (x) is our interpolating
polynomial, and we have derived the following result.

Theorem 17.1 (Lagrange Interpolating Polynomial). Suppose that we are given


the values of the function f (x) at n + 1 distinct points x0 , . . . , xn , which we denote by
f0 , . . . , fn . Then the nth Lagrange interpolating polynomial
n n n
X X Y x − xj
P (x) = Lnk (x)fk = fk (17.28)
k=0 k=0 j=0,j6=k
xk − x j

is a polynomial of degree at most n that matches f (x) at each of the xi .



Example 17.1. Let f (x) = 1 + x. Construct the Lagrange polynomial of degree
at most 2 to interpolate the point f (0.45) using grid points at x0 = 0, x1 = 0.6 and
x2 = 0.9.

Solution. First we calculate the fi :



f0 = f (x0 ) = f (0) = 1=1 (17.29)

f1 = f (x1 ) = f (0.6) = 1.6 ≈ 1.265 (17.30)

f2 = f (x2 ) = f (0.9) = 1.9 ≈ 1.378 (17.31)

So the Lagrange polynomial is

P (x) = L20 (x)f0 + L21 (x)f1 + L22 (x)f2 (17.32)


= L20 (x) + 1.27L21 (x) + 1.38L22 (x) (17.33)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
110 LESSON 17. LAGRANGE INTERPOLATION

Where
(x − 0.6)(x − 0.9)
L20 (x) = = 1.85(x − 0.6)(x − 0.9) (17.34)
(0 − 0.6)(0 − 0.9)
(x − 0)(x − 0.9)
L21 (x) = = −5.56x(x − 0.9) (17.35)
(0.6 − 0)(0.6 − 0.9)
(x − 0)(x − 0.6)
L22 (x) = = 3.70x(x − 0.6) (17.36)
(0.9 − 0)(0.9 − 0.6)
Thus
P (x) = L20 (x) + 1.27L21 (x) + 1.38L22 (x) (17.37)
= 1.85(x − 0.6)(x − 0.9) − 1.27(5.56)x(x − 0.9)
+ 1.38(3.70)x(x − 0.6) (17.38)
= 1.85(x − 0.6)(x − 0.9) − 7.03x(x − 0.9) + 5.11x(x − 0.6) (17.39)
= 0.999 + 0.486x − 0.07x2 (17.40)
Hence
P (0.45) ≈ 0.999 + (0.486)(0.45) − 0.07)(0.45)2 = 1.20 (17.41)
We summarize the algorithm for Lagrange Interpolation here.1
Algorithm LagrangeInterpolatingFunctions
Input: x0 , . . . , xn , x
For i = 0, 1, . . . , n,
Define the set Ui ={x0 , . . . , xn } − {xi }
Define numerator = 1, denominator = 1
For j = 0, . . . , n − 1
numerator = numerator × (x − Uij )
denominator = denominator × (xi − Uij )
End For
Lni = numerator/denominator
End For
Return the list {Ln1 , . . . , Lnn }

Algorithm LagrangeInterpolatingPolynomial
Input: x0 , . . . , xn , f0 , . . . , fn , x
Let L be the list LagrangeInterpolatingFunctions(x0 , . . . , xn , x)
P = f0 ∗ L0 + f1 ∗ L1 + · · · + fn ∗ Ln
Return P

1
The notation A − B, where A and B are sets, means the relative complement of the set B in
the set A, e.g., all of the elements of A that are not in B. For an ordered set Ui , the notation Uij
means the j th element of Ui . An ordered set is also called a List.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 17. LAGRANGE INTERPOLATION 111

We now illustrate how to calculate the Lagrange Interpolating Polynomials both


analytically and numerically in Mathematica. First we make a few observations. The
relative complement of the set B in A is given by Complement[A, B]., e.g,

In:=

U={x1, x2, x3, x4, x5}


Complement[U, {x4}]

Out:=

{x1, x2, x3, x5}

Next, we observe that if U is a list such as the one defined above, then Map[f, U]
returns the result of f[u], for every element u of U. Recall that f/@U is a shorthand
for Map[f, U],

In:=

f/@U

Out:=

{f[x1], f[x2], f[x3], f[x4], f[x5]}

Suppose that f[x] represents the function f (x) = x − 3. We can calculate some
value, say f (u) in two different ways. The first is the usual way,

In:=

f[x_]:=x-3;
f[u]

Out:=

u-3

The second way is to use pure functions:

In:=

(#-3)&[u]

Out:=

u-3

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
112 LESSON 17. LAGRANGE INTERPOLATION

Pure functions allow us to define a function and use it in a single statement. In-
stead of saying f[x] we replace the f with the pure function (#-3)&. The symbol &
tells us where the function definition ends, and the symbol # is used in place of the
function’s argument x. We can also combine pure functions with the Map function.
This is convenient because it lets us map an expression that we are only going to use
once; otherwise we’d have to use an extra line of code to define an unnecessary extra
variable to hold the function. Thus

In:=

f[x_]:= x-3; V=Map[f, U]

and

In:=

V=(#-3)&/@U

both return the identical output

Out:=

{-3 + x1, -3 + x2, -3 + x3, -3 + x4, -3 + x5}

Next, we observe that the generalization of multiplication in Mathematicais the Times


function. Times can take an arbitrary number of arguments and returns their prod-
uct. Thus

In:=

Times[-3+x1, -3+x2, -3+x3, -3+x4, -3+x5]

Out:=

(-3 + x1) (-3 + x2) (-3 + x3) (-3 + x4) (-3 + x5)

To multiply out the elements of V we need to take all the elements of V and place
them as arguments to Times. We do this with the Apply command, which has a
shorthand of @@. The following are:

In:=

Apply[Times, V]

In:=

Times@@V

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 17. LAGRANGE INTERPOLATION 113

and both return the same thing (recall the definition of V, above):

Out:=

(-3 + x1) (-3 + x2) (-3 + x3) (-3 + x4) (-3 + x5)

Now suppose we want to combine these two functions. We want to subtract 3 from
every element of the list U, which we can do with Map, and then take the product of
the results with Apply and Times:

In:=

Times@@(#-3)&/@U

or In:=

Apply[Times, Map[(# - 3)&, U]]

both of which return

Out:=

(-3 + x1) (-3 + x2) (-3 + x3) (-3 + x4) (-3 + x5)

With this we can define a function to calculate the Lagrange Interpolating Functions
in Mathematica.

LagrangeInterpolatingFunctions[{xj__}, x_] :=
Module[ {i, n, xi, xjc, L, xgrid, num, den},
xgrid = {xj};
n = Length[xgrid];
L = {};
For[i = 1, i <= n, i++,
xi = xgrid[[i]];
xjc = Complement[xgrid, {xi}];
den = Times @@ ((xi - #) & /@ xjc);
num = Times @@ ((x - #) & /@ xjc);
L = Append[L, num/den];
];
Return[L];
]

We can now calculated a set of functions analytically. For example,

In:=

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
114 LESSON 17. LAGRANGE INTERPOLATION

LagrangeInterpolatingFunctions[{x1, x2, x3}, x]

Out:=  
(x − x2)(x − x3) (x − x1)(x − x3) (x − x1)(x − x2)
, ,
(x1 − x2)(x1 − x3) (x2 − x1)(x2 − x3) (x3 − x1)(x3 − x2)
Repeating our earlier example,

In:=

LagrangeInterpolatingFunctions[{0, 0.6, 0.9}, x]

Out:=

{1.85185 (-0.9 + x) (-0.6 + x), -5.55556 (-0.9 + x) x,


3.7037 (-0.6 + x) x}

We can also calculate at a point:

In:=

LagrangeInterpolatingFunctions[{0, 0.6, 0.9},0.45]

Out:=

{0.125, 1.125, -0.25}

Next we observe that the dot product of two lists A and B of the same length is
calculated with the dot operator, which is a period:

In:=

{f1, f2, f3, f4}.{L1, L2, L3, L4}

Out:=

f1 L1 + f2 L2 + f3 L3 + f4 L4

So we can repeat our previous example as follows:

In:=

f[x_]:= Sqrt[1.0+x];
points = {0.0, 0.6, 0.9};
(f/@points).LagrangeInterpolatingFunctions[points, 0.45]

Out:=

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 17. LAGRANGE INTERPOLATION 115

1.20342

We can get the Lagrange Interpolating Polynomial with


In:=

(f /@ points).LagrangeInterpolatingFunctions[points, x] // Expand

Out:=

1.+ 0.483656 x - 0.0702286 x^2

Theorem 17.2 (Error Bounds for Lagrange Interpolation). Suppose that f (x)
is n + 1 times continuously differentiable, and suppose that the points x0 , . . . , xn are
distinct. Then for any x ∈ [a, b] there exists a number c ∈ [a, b] such that

f n+1 (c)(x − x0 )(x − x1 ) · · · (x − xn )


f (x) = P (x) + (17.42)
(n + 1)!

where
n n n
X X Y x − xj
P (x) = fk Lnk (x) = fk (17.43)
k=0 k=0 j=0,j6=k
xk − x j

Proof. If x = xk and P (xk ) = fk for some k, then the second term in equation 17.42
is zero, regardless of the value of c, and the result holds identically.
So suppose that x 6= xk for all k, and define the function
n
Y t − xi
g(t) = f (t) − P (t) − [f (x) − P (x)] (17.44)
i=0
x − xi

Then
n
Y xk − xi
g(xk ) = f (xk ) − P (xk ) − [f (x) − P (x)] =0 (17.45)
i=0
x − xi
The second equality follows because (a) by construction, f (xk ) = P (xk ), so the first
term is zero; and (b) for some i we have i = k and hence there is a factor of xk − xk
in the numerator of the second term, making it zero as well. Furthermore,
n
Y x − xi
g(x) = f (x) − P (x) − [f (x) − P (x)] (17.46)
i=0
x − xi
= f (x) − P (x) − [f (x) − P (x)] (17.47)
=0 (17.48)

Hence g(t) = 0 as the n + 2 numbers x, x0 , . . . , xn . Since it is also continuously


differentiable n + 1 times, then by the generalized Rolle’s theorem, theorem 4.4, there

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
116 LESSON 17. LAGRANGE INTERPOLATION

exists at least one number c ∈ (a, b) such that g (n+1) (c) = 0. Differentiating g(t) a
total of n + 1 times,
n
d(n+1) Y t − xi
g (n+1) (t) = f (n+1) (t) − P (n+1) (t) − [f (x) − P (x)] (17.49)
dt(n+1) i=0 x − xi

hence at t = c we have

0 = g (n+1) (c) (17.50)


n

(n+1) Y
d t − xi
= f (n+1) (c) − P (n+1) (c) − [f (x) − P (x)] (17.51)
dt(n+1) x − xi

i=0 t=c

(n+1) Y n
[f (x) − P (x)] d
= f (n+1) (c) − P (n+1) (c) − Qn (t − xi ) (17.52)

(n+1)
i=0 (x − xi ) dt i=0

t=c

Now since
P (t) = a0 + a1 t + · · · + an tn (17.53)
then P (n+1) (t) = 0 for all t, and hence P (n+1) (c) = 0, so that

(n+1) Y n
[f (x) − P (x)] d
0 = f (n+1) (c) − Qn (t − x ) (17.54)

(n+1) i
i=0 (x − x i ) dt i=0

t=c

Furthermore,
n
d(n+1) Y d(n+1)
(t − x i ) = (t − x1 )(t − x2 )(t − x3 ) · · · (t − xn ) (17.55)
dt(n+1) i=0 dt(n+1)
d(n+1)  n
t + (stuff) × tn−1 + (more stuff) × tn−2 + · · ·

= (n+1)
dt
(17.56)
= (n + 1)! (17.57)

Substituting equation 17.57 into equation 17.58 gives

[f (x) − P (x)]
0 = f (n+1) (c) − Qn (n + 1)! (17.58)
i=0 (x − xi )

Solving for f (x) gives us equation 17.42, completing the proof.

Example 17.2. Suppose you want to make a table of the natural logarithms over the
range 1 ≤ x ≤ 100. What step size is sufficient to ensure that linear interpolation
between each successive pair of points will be accurate to within 10−5 ?

Math 481A «2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 17. LAGRANGE INTERPOLATION 117

Solution. From equation 17.42 we have


n+1
f (c)(x − x0 )(x − x1 ) · · · (x − xn )
|f (x) − P (x)| = (17.59)
(n + 1)!

For linear interpolation we use n = 1 (there are two points, x0 and x1 , so that
(2)
f (c)(x − x0 )(x − x1 )
|f (x) − P (x)| = (17.60)
2!
1
≤ max |f 00 (c)| × max |(x − x0 )(x − x1 )| (17.61)
2
on each interval. Since f (x) = log x we have f 0 (x) = 1/x and f 00 (x) = −1/x2 . The
maximum value of | − 1/x2 | on [1, 100] is 1, so that
1
|f (x) − P (x)| ≤ max |(x − x0 )(x − x1 )| (17.62)
2
To find the maximum value of g(x) = (x − x0 )(x − x1 ) = x2 − (x0 + x1 )x + x0 x1
on [x0 , x1 ] we observe that the maximum either occurs at an endpoint or at a point
where g 0 (x) = 0. At the endpoints g(x) = 0. So first we differentiate:
0 = g 0 (x) = 2x − (x0 + x1 ) (17.63)
which gives a possible maximum at x = (x0 + x1 )/2. The value of g at this point is
    
x 0 + x 1 x0 + x1 x 0 + x 1
g = − x 0 − x 1
(17.64)
2 2 2
  
x1 − x0 x0 − x1
=
(17.65)
2 2
h2
= (17.66)
4
where h is the size between entries in the table (the number we are solving for).
Substituting equation 17.66 into example 17.62 gives
h2
|f (x) − P (x)| ≤ (17.67)
8
Since we want to ensure that the error is no larger than 10−5 we set
h2
< 10−5 (17.68)
8
or √
h< 8 × 10−5 ≈ 0.0089 (17.69)
so if choose any step size smaller than h ≈ 0.0089 we are guaranteed to have an error
of no larger than 10−5 .

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
118 LESSON 17. LAGRANGE INTERPOLATION

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 18

Newton Interpolation

Newton’s method for interpolation is derived by seeking a polynomial of the form

P (x) = a0 +a1 (x − xn ) (18.1)


+a2 (x − xn )(x − xn−1 )
+a3 (x − xn )(x − xn−1 )(x − xn−2 )
..
.
+an (x − xn )(x − xn−1 ) · · · (x − x1 )

that interpolates the points

P (x0 ) = f (x0 ) (18.2)


P (x1 ) = f (x1 ) (18.3)
..
.
P (xn−1 ) = f (xn−1 ) (18.4)
P (xn ) = f (xn ) (18.5)

We define the backward difference operator ∇ for an element fn of a sequence as

∇fn = fn − fn−1 (18.6)


∇2 fn = ∇fn − ∇fn−1 = fn − 2fn−1 + fn−2 (18.7)
∇3 fn = ∇2 fn − ∇2 fn−1 = fn − 3fn−1 + 3fn−2 − fn−3 (18.8)
..
.
∇k fn = ∇k−1 fn − ∇k−1 fn−1 (18.9)

Letting fn = f (xn ) we have by substituting 18.5 into 18.1 that

f n = a0 (18.10)

119
120 LESSON 18. NEWTON INTERPOLATION

From 18.4 we get

fn−1 = fn + a1 (xn−1 − xn ) (18.11)


= f n − a1 h (18.12)
1 1
a1 = (fn − fn−1 ) = ∇fn (18.13)
h h

Substituting at x = xn−2 = xn − 2h gives

fn−2 = a0 + a1 (xn−2 − xn ) + a2 (xn−2 − xn )(xn−2 − xn−2 ) (18.14)


1
= fn + (fn − fn−1 )(−2h) + a2 (−2h)(−h) (18.15)
h
= 2fn−1 − fn + 2h2 a2 (18.16)
1 1
a2 = 2 (fn − 2fn−1 + fn−2 ) = 2 ∇2 fn (18.17)
2h 2h

Continuing the process we find in general that

1
ak = ∇ k fn (18.18)
k!hk

Next we define the polynomials Qk by

k−1
Y
Qk (x) = (x − xn−j ) (18.19)
j=0

Using 18.18 and 18.19 in 18.1

P (x) = a0 + a1 Q1 (x) + a2 Q2 (x) + · · · + an Qn (x) (18.20)


Xn
= a0 + ak Qk (x) (18.21)
k=1
n
X ∇ k fn
= fn + Qk (x) (18.22)
k=1
k!hk

Define the parameter s, −1 ≤ s ≤ 0 in the interval [xn−1 , xn ] by

x = xn + sh (18.23)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 18. NEWTON INTERPOLATION 121

From equation 18.19,


k−1
Y
Qk (x) = (xn + sh − (xn − jh)) (18.24)
j=0
k−1
Y
= (j + s)h (18.25)
j=0
k−1
Y
= hk (s + j) (18.26)
j=0
k
= h s(s + 1)(s + 2) · · · (s + k − 1) (18.27)
Recall the definition of the binomial coefficient for n, m integers,
 
n n! n(n − 1)(n − 2) · · · (n − m + 1)
= = (18.28)
m m!(n − m)! m!
we can define, for any real number t, not necessarily integer,
 
t t(t − 1)(t − 2) · · · (t − m + 1)
= (18.29)
k k!
Using this we calculate
 
−s −s(−s − 1)(−s − 2) · · · (−s − k + 1)
= (18.30)
k k!
(−1)k
= s(s + 1)(s + 2) · · · (s + k − 1) (18.31)
k!
(−1)k
= Qk (x) (18.32)
k!hk
Using 18.32 in 18.22 we get
n  
X
k −s
P (x) = fn + (−1) ∇ k fn (18.33)
k
k=1

which is known as Newton’s Backward Difference Formula.


We can also derive a formula using forward differences, and the forward difference
operators that we defined in section 13,
∆fn = fn+1 − fn (18.34)
∆2 fn = ∆fn+1 − ∆fn = fn+2 − 2fn+1 + fn (18.35)
..
. (18.36)
∆k fn = ∆k−1 fn+1 − ∆k fn (18.37)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
122 LESSON 18. NEWTON INTERPOLATION

The result is known as Newton’s Forward Difference Formula,


n  
X s
P (x) = f0 + ∆k f0 (18.38)
k
k=1

where x = x0 + hs.
Example 18.1. Find e1.2 using a first, second, third, and fourth order differences
using the data e = 2.71828, e1.5 = 4.48169, e2 = 7.38906, e2.5 = 12.1829, e3 =
20.085554 with Newton’s forward difference formula.
Solution. We want to use equation 18.34 at x = 1.2 with x0 = 1. Hence
1.2 = x = x0 + hs = 1 + 0.5s (18.39)
and so s = 0.4. We can construct the following table of forward differences based on
the input data. The actual data values that we will use are colored yellow.
xk fk ∆fk ∆2 fk ∆3 fk ∆4 fk
1 2.71828
1.76341
1.5 4.48169 1.14396
2.90737 0.74211
2 7.38906 1.88607 0.48142
4.79344 1.22353
2.5 12.18249 3.10961
7.90304
3 20.08554
We then calculated the following binomial coefficients, using s = 0.4
   
s 0.4
= = 0.4 (18.40)
1 1
   
s 0.4 (0.4)(−0.6)
= = = −0.12 (18.41)
2 2 2!
   
s 0.4 (0.4)(−0.6)(−1.6)
= = = 0.064 (18.42)
3 3 3!
   
s 0.4 (0.4)(−0.6)(−1.6)(−2.6)
= = = −0.0416 (18.43)
4 4 4!
For n = 1, the interpolated value is
 
s
P (x + sh) = f0 + ∆f0 (18.44)
1
= 2.71828 + (0.4)(1.76341) (18.45)
= 3.42364 (18.46)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 18. NEWTON INTERPOLATION 123

For n = 2,
   
s s
P (x + sh) = f0 + ∆f0 + ∆2 f0 (18.47)
1 2
= 3.42364 + (−0.12)(1.14396) (18.48)
= 3.28636 (18.49)

For n = 3,
     
s s 2 s
P (x + sh) = f0 + ∆f0 + ∆ f0 + ∆3 f0 (18.50)
1 2 3
= 3.28636 + (0.064)(0.74211) (18.51)
= 3.33386 (18.52)

For n = 4,
       
s s 2 s 3 s
P (x + sh) = f0 + ∆f0 + ∆ f0 + ∆ f0 + ∆4 f0 (18.53)
1 2 3 4
= 3.33386 + (−0.0416)(0.48142) (18.54)
= 3.31383 (18.55)

The correct value is approximately 3.32012.


We can also use backward differences for numbers closer to the end of the table.

Example 18.2. Using the same data as the previous example, calculate e2.7 using
backward differences.

Solution. We have
2.7 = x = xn + sh = 3 + (0.5)s (18.56)
Hence s = −0.6. The backwards difference formula gives
     
0.6 2 0.6 2 3 0.6
P (2.7) = fn + (−1) ∇fn + (−1) ∇ fn + (−1) ∇ 3 fn
1 2 3
 
4 0.6
+ (−1) ∇4 fn + · · · (18.57)
4
(0.6)(−0.4) 2 (0.6)(−0.4)(−1.4) 3
= fn + (−1)(0.6)∇fn + (−1)2 ∇ fn + (−1)3 ∇ fn
2! 3!
(0.6)(−0.4)(−1.4)(−2.4) 4
+ (−1)4 ∇ fn + · · · (18.58)
4!
= fn − 0.6∇fn − 0.12∇2 fn − 0.056∇3 fn − 0.0336∇4 fn + · · · (18.59)

We now can read data off the lower diagonal in the table.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
124 LESSON 18. NEWTON INTERPOLATION

xk fk ∆fk ∆2 fk ∆3 fk ∆4 fk
1 2.71828
1.76341
1.5 4.48169 1.14396
2.90737 0.74211
2 7.38906 1.88607 0.48142
4.79344 1.22353
2.5 12.18249 3.10961
7.90304
3 20.08554

Substituting numbers from the table gives us

P (2.7) = 20.08554 − 0.6(7.90304) − 0.12(3.10961)


− 0.056(1.22353) − 0.0336(0.48142) (18.60)
= 14.88586 (18.61)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 19

Hermite Interpolation

One of the problems with polynomial interpolation is that although it fits the points
the shape of the curve doesnt always match very well:

(x2,f2) (x4,f4) P(x)

(x1,f1) f(x)
(x3,f3) (x5,f5)

x1 x2 x3 x4 x5

One approach to this problem is to try to match the derivatives as well as the points.
Suppose that we know the function f (x) at n+1 points, given by (x0 , f0 ), . . . , (xn , fn ),
and that we also know the derivatives at these same n + 1 points,

fi0 = f 0 (xi ) (19.1)

Then our approach will be to try to find a polynomial that matches both the function
and the derivative as these points. Our conditions are then:

P (xi ) = fi (19.2)
P 0 (xi ) = fi0 (19.3)

for i = 0, . . . , n. Since there are 2(n + 1) = 2n + 2 conditions, we are potentially able


to determine up to 2n + 2 unknowns in our model. Typically this means that our
polynomial will have 2n + 2 unknown coefficients, i.e., that it will have degree 2n + 1.
We can construct the solution using Lagrange Interpolating polynomials. The result
is given first.

125
126 LESSON 19. HERMITE INTERPOLATION

Theorem 19.1. Suppose that f (t) is continuously differentiable on [a, b] and that the
numbers x0 , . . . , xn ∈ [a, b] are unique, and let Lnj (x) be the Lagrange interpolating
functions. Then
n
X n
X
P (x) = H2n+1 (x) = fj Hnj (x) + fj0 Ĥnj (x) (19.4)
j=0 j=0

where

Hnj (x) = 1 − 2(x − xj )L0nj (xj ) (Lnj (x))2


 
(19.5)
Ĥnj (x) = (x − xj ) (Lnj (x))2 (19.6)

satisfies equations 19.2 and 19.3. Equation 19.4 is called the Hermite Interpolat-
ing Polynomial.

Proof. Since Lnj (xi ) = δij ,

Hnj (xi ) = 1 − 2(xi − xj )L0nj (xj ) L2nj (xi )


 
(19.7)
= 1 − 2(xi − xj )L0nj (xj ) δij
 
(19.8)

If i 6= j then Hnj = 0, while if i = j,

Hnj (xj ) = 1 − 2(xj − xj )L0nj (xj ) δjj = 1


 
(19.9)

Hence
Hnj (xi ) = δij (19.10)
Similarly,
Ĥnj (xi ) = (xi − xj )δij = 0 (19.11)
For all i and j. Substituting into equation 19.4,
n
X n
X
P (xi ) = fj Hnj (xi ) + fj0 Ĥnj (xi ) (19.12)
j=0 j=0
Xn
= fj δij (19.13)
j=0

= fi (19.14)

hence equation 19.2 is satisfied. To demonstrate equation 19.3 we differentiate 19.4,


n
X n
X
0 0
P (x) = fj Hnj (x) + fj0 Ĥnj
0
(x) (19.15)
j=0 j=0

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 19. HERMITE INTERPOLATION 127

and therefore
n
X n
X
0 0
P (xi ) = fj Hnj (xi ) + fj0 Ĥnj
0
(xi ) (19.16)
j=0 j=0

To evaluate equation 19.16 we calculate the derivative,


d 
0
1 − 2(x − xj )L0nj (xj ) (Lnj (x))2
 
Hnj (x) = (19.17)
dx
= 2 1 − 2(x − xj )L0nj (xj ) Lnj (x)L0nj (x) − 2 (Lnj (x))2 L0nj (xj )

(19.18)
0
(xi ) = 2 1 − 2(xi − xj )L0nj (xj ) Lnj (xi )L0nj (xi ) − 2 (Lnj (xi ))2 L0nj (xj )

Hnj (19.19)
= 2 1 − 2(xi − xj )L0nj (xj ) L0nj (xi )δij − 2δij L0nj (xj )

(19.20)
0
If i 6= j then clearly this Hnj (xi ) = 0 because of the common factor δij . If i = j,
0
(xj ) = 2 1 − 2(xj − xj )L0nj (xj ) L0nj (xi ) − 2L0nj (xj )

Hnj (19.21)
= 2L0nj (xi ) − 2L0nj (xj ) (19.22)
=0 (19.23)
0
Hence Hnj (xi ) = 0 for all i and j. Substituting this into 19.16,
n
X
P 0 (xi ) = fj0 Ĥnj
0
(xi ) (19.24)
j=0

Differentiating Ĥnj ,
d 
0
(x − xj ) (Lnj (x))2

Ĥnj (x) = (19.25)
dx
= 2(x − xj )Lnj (x)L0nj (x) + L2nj (x) (19.26)
0
Ĥnj (xi ) = 2(xi − xj )Lnj (xi )L0nj (xi ) + L2nj (xi ) (19.27)
= 2(xi − xj )δij L0nj (xi ) + δij (19.28)
= δij (19.29)

Therefore
n
X
0
P (xi ) = fj0 δij = fi0 (19.30)
j=0

which is equation 19.3.


Theorem 19.2. Under the same conditions as theorem 19.1 then the Hermite inter-
polating polynomials are the unique polynomials of least degree (at most 2n + 1) that
satisfy the conditions of equation 19.2 and 19.3.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
128 LESSON 19. HERMITE INTERPOLATION

Proof. Certainly H2n+1 is a polynomial of order at most n + 1 in x because Lnj is a


polynomial of at most degree n. We have already shown that H2n+1 (x) satisfies the
conditions of equation 19.2 and 19.3 (in theorem 19.1). Now suppose that there is
some other polynomial g(x), also of degree at most 2n+1, such that g(xi ) = fi and
g 0 (xi ) = fi0 . Let
∆(x) = H2n+1 (x) − g(x) (19.31)

be the difference between these two polynomials. Since ∆(x) is the difference of two
polynomials of degree at most 2n + 1 that satisfy 19.2 and 19.3, then ∆(x) is also a
polynomial of degree at most 2n + 1. Furthermore, it satisfies 19.2 and 19.3, since

∆(xi ) = H2n+1 (xi ) − g(xi ) = fi − fi = 0 (19.32)


∆0 (xi ) = H2n+1
0
(xi ) − g 0 (xi ) = fi0 − fi0 = 0 (19.33)

Thus by theorem 12.6 ∆(x) has zeroes of multiplicity 2 at each of x0 , . . . , xn , and


hence there exists some function q(x), which does not have a zero at any of these
points, such that

∆(x) = (x − x0 )2 (x − x1 )2 · · · (x − xn )2 q(x) (19.34)

This says that either ∆(x) has 2(n + 1) zeroes, which contradicts our earlier obser-
vation that it only has 2n + 1 zeroes; or that q(x) = 0 identically. But if q(x) = 0
identically, then ∆(x) = 0 identically, which implies that g(x) = H2n+1 (x) for all x.
In other words, H2n+1 is unique.

Example 19.1. Find a √ Hermite interpolating polynomial for the following data,
which is based on f (x) = x.

f (x) f 0 (x)
x0 = 1 f0 = 1 f00 = 1/2
x1 = 4 f1 = 2 f10 = 1/4

Solution. Since n = 1 there are 2n + 2 = 4 conditions that must be met, and therefore
the order of the polynomial will be 2n + 1 = 3. The interpolating polynomial is

P (x) = H3 (x) (19.35)


1
X 1
X
= fj H1j (x) + fj0 Ĥ1j (x) (19.36)
j=0 j=0

= f0 H10 (x) + f1 H11 (x) + f00 Ĥ10 (x) + f10 Ĥ11 (x) (19.37)
1 1
= H10 (x) + 2H11 (x) + Ĥ10 (x) + Ĥ11 (x) (19.38)
2 4

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 19. HERMITE INTERPOLATION 129

To find the H’s we need to first find the L1j ’s,


x − x1 x−4 1 4
L10 (x) = = =− x+ (19.39)
x0 − x1 −3 3 3
x − x0 x−1 1 1
L11 (x) = = = x− (19.40)
x1 − x0 3 3 3
From this we can determined that L010 = −1/3 and L011 = 1/3. Hence
H10 (x) = [1 − 2(x − x0 )L010 (x0 )]L210 (x) (19.41)
    2
1 1 4
= 1 − 2(x − 1) − − x+ (19.42)
3 3 3
1
= (1 + 2x)(4 − x)2 (19.43)
27
and
H11 (x) = [1 − 2(x − x1 )L011 (x1 )]L211 (x) (19.44)
    2
1 1 1
= 1 − 2(x − 4) x− (19.45)
3 3 3
1
= (11 − 2x)(x − 1)2 (19.46)
27
Similarly,
Ĥ10 (x) = (x − x0 )L210 (x) (19.47)
1
= (x − 1)(4 − x)2 (19.48)
9
Ĥ11 x = (x − x1 )L211 (x) (19.49)
1
= (x − 4)(x − 1)2 (19.50)
9
Thus from equation 19.35
1 1
P (x) = H10 (x) + 2H11 (x) + Ĥ10 (x) + Ĥ11 (x) (19.51)
2 4
1 2
= (1 + 2x)(4 − x)2 + + (11 − 2x)(x − 1)2 + (19.52)
27 27
1 1
(x − 1)(4 − x)2 + (x − 4)(x − 1)2 (19.53)
18 36
Theorem 19.3 (Error Formula for Hermite Interpolation). Suppose that f is
n+2 times continuously differentiable and that the same conditions hold as in theorem
19.2. Then there is some number c ∈ (a, b) such that
(x − x0 )2 · · · (x − xn )2 (2n+2)
f (x) = H2n+1 (x) + f (c) (19.54)
(2n + 2)!

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
130 LESSON 19. HERMITE INTERPOLATION

Proof. First, suppose that x = xk for some k . Then the second term in equation
19.54 is zero and it becomes f (x) = H2n+1 (x), which is the interpolation condition,
because x = xk . This condition is known to hold true because of theorem 19.1.
Now suppose that x 6= xk for all k. Then define the function g(t) by

(t − x0 )2 · · · (t − xn )2
g(t) = f (t) − H2n+1 (t) − [f (x) − H2n+1 (x)] (19.55)
(x − x0 )2 · · · (x − xn )2

then

g(xk ) = f (xk ) − H2n+1 (xk )−


(xk − x0 )2 · · · (xk − xk )2 · · · (xk − xn )2
[f (x) − H2n+1 (x)] (19.56)
(x − x0 )2 · · · (x − xn )2
= f (xk ) − H2n+1 (xk ) (19.57)
=0 (19.58)

and

g(x) = f (x) − H2n+1 (x)−


(x − x0 )2 · · · (x − xn )2
[f (x) − H2n+1 (x)] (19.59)
(x − x0 )2 · · · (x − xn )2
= f (x) − H2n+1 (x) − [f (x) − H2n+1 (x)] (19.60)
=0 (19.61)

Hence g(x) has n + 2 distinct zeros in [a, b] at x, x0 , . . . , xn . By Rolle’s theorem,


between each pair of consecutive zeroes there is a number ck such that g 0 (ck ) = 0.
Since there are n + 1 such pairs of consecutive points, g 0 (t) has n + 1 unique zeroes
at c0 , . . . , cn , none of which are equal to any of the grid points x0 , . . . , xn .
If we differentiate 19.55

(t − x0 )2 · · · (t − xn )2
 
0 0 0 d
g (t) = f (t) − H2n+1 (t) − [f (x) − H2n+1 (x)] (19.62)
(x − x0 )2 · · · (x − xn )2
dt
n
f (x) − H2n+1 (x) d Y
= f 0 (t) − H2n+1
0
(t) − (t − xk )2 (19.63)
(x − x0 )2 · · · (x − xn )2 dt k=0

But by the product rule for derivatives

d
(a1 a2 a3 · · · an ) = a01 a2 a3 · · · an + a1 a02 a3 · · · an + · · · + a1 · · · an−1 a0n (19.64)
dt
Math 481A « 2008, B.E.Shapiro
California State University Northridge Last revised: July 5, 2008
LESSON 19. HERMITE INTERPOLATION 131

Hence
n n n
d Y 2
Y
2
Y
(t − xk ) = 2(t − x0 ) (t − xk ) + 2(t − x1 ) (t − xk )2 +
dt k=0 k=0,k6=0 k=0,k6=1
n
Y
2(t − x2 ) (t − xk )2 + · · · +
k=0,k6=2
Yn
2(t − xn ) (t − xk )2 (19.65)
k=0,k6=n
n
Y Xn Y n
=2 (t − xk ) (t − xj ) (19.66)
k=0 i=0 j=0,j6=i

= P (t)Q(t) (19.67)
where
n
Y
P (t) = 2 (t − xk ) (19.68)
k=0
n
X n
Y
Q(t) = (t − xj ) (19.69)
i=0 j=0,j6=i

Consequently
f (x) − H2n+1 (x)
g 0 (t) = f 0 (t) − H2n+1
0
(t) − P (t)Q(t) (19.70)
(x − x0 )2 · · · (x − xn )2
Since
f (x) − H2n+1 (x)
g 0 (xk ) = f 0 (xk ) − H2n+1
0
(xk ) − P (xk )Q(xk ) = 0 (19.71)
(x − x0 )2 · · · (x − xn )2
the two facts that fk0 = H2n+1 0
(xk ) and P (xk ) = 0 (from equation 19.68) we see that
g has roots at x0 , . . . , xn . Therefore g 0 (t) has 2n + 2 unique zeroes at c0 , . . . , cn and
0

x0 , . . . , xn . Hence by the generalized Rolle’s theorem, there is some number c ∈ [a, b]


such that g (2n+2) (c) = 0.
But
(2n+2)
g (2n+2) (t) = f (2n+2) (t) − H2n+1 (t)−
n
f (x) − H2n+1 (x) d2n+2 Y
(t − xk )2 (19.72)
(x − x0 )2 · · · (x − xn )2 dt2n+2 k=0

Since H2n+1 (t) is a polynomial of order 2n + 1, its 2n + 2-th derivative is zero,


n
(2n+2) (2n+2) f (x) − H2n+1 (x) d2n+2 Y
g (t) = f (t) − (t − xk )2 (19.73)
(x − x0 )2 · · · (x − xn )2 dt2n+2 k=0

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
132 LESSON 19. HERMITE INTERPOLATION

Next we calculate
n
d2n+2 Y 2 d2n+2  2n+2 2n+1

(t − x k ) = t + (stuff) × t + · · · + (stuff) (19.74)
dt2n+2 k=0 dt2n+2
= (2n + 2)! (19.75)

and therefore
f (x) − H2n+1 (x)
g (2n+2) (t) = f (2n+2) (t) − (2n + 2)! (19.76)
(x − x0 )2 · · · (x − xn )2

Since there is some point c such that g (2n+2) (c) = 0,

f (x) − H2n+1 (x)


0 = f (2n+2) (c) − (2n + 2)! (19.77)
(x − x0 )2 · · · (x − xn )2

Solving for f (x) gives equation 19.54.



Example
√ 19.2. Calculate the error bounds on 16 from the following data for f (x) =
x according to the Hermite polynomial error formula.
x 5 10 15 20 25
f (x) 2.24 3.16 3.87 4.47 5.00

Solution. Since there are 5 data points x0 , . . . , x4 , we have n = 4, so we need the 10th
derivative of f (x). Using Mathematica, we find that
34, 459, 425
f (10) (x) = − (19.78)
1024x19/2
At x = 16, equation 19.54 gives
(16 − 5)2 (16 − 10)2 (16 − 15)2 (16 − 20)2 (16 − 25)2 34, 459, 425

|error| = ×
10! 1024c19/2
(19.79)
52352.6
= (19/2) (19.80)
c
The maximum of 1/c19/2 on (5, 25) occurs at the minimum of c19/2 on (5, 25), which
occurs at c = 5.
52352.6
|error| ≤ ≈ 0.012 (19.81)
5(19/2)
Of course this is just a theoretical limit, because we do not know the actual value of
c. In this case, the actual Hermite approximation gives a much smaller error, of only
9.3 × 10−7 .

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 19. HERMITE INTERPOLATION 133

The Hermite polynomials are easily calculated in Mathematicaas follows.

Hermite[{x__}, {f__}, {fprime__}, t_] :=


Module[{L, L2, Lprime, z, H, HH, H2NP1},
L = LagrangeInterpolatingPolynomials[{x}, z];
Lprime = MapThread[#1 /. {z -> #2} &, {D[L, z], {x}}];
L2 = L^2;
HH = (((t - #) & /@ {x})*L2) /. {z -> t};
H = (L2 - 2*Lprime*HH) /. {z -> t};
H2NP1 = H.{f} + HH.{fprime};
Return[H2NP1];
]

The polynomial in the last example is then given by

f[x_] := Sqrt[x];
xdata = Range[5, 25, 5]; (* list of x values *)
fdata = f /@ xdata; (* list of f values *)
fpdata = f’[#] & /@ xdata; (* list of f’ values *)
Hermite[xdata, fdata, fpdata, x]

To get the actual error value quoted in the example we use Hermite[xdata, fdata,
fpdata, 16] - 4.0, because we know that the correct answer is 4.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
134 LESSON 19. HERMITE INTERPOLATION

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 20

Cubic Splines

As before we are trying to find an interpolating function for a function that we know
at n + 1 points a = x0 < x1 < · · · < xn = b . Instead of fitting a single polynomial
to all n + 1 points, an alternative strategy is to fit a different polynomial to each
successive pair of points. We will define a set of functions Si (x), on each interval
[xi , xi+1 ]. To keep the solution smooth we would like the match the first and second
derivatives, as well as the function itself, at each grid point. Our conditions are

Si (xi ) = fi (20.1)
Si (xi+1 ) = Si+1 (xi+1 ) (20.2)
Si0 (xi+1 ) = Si+1
0
(xi+1 ) (20.3)
00 00
Si (xi+1 ) = Si+1 (xi+1 ) (20.4)

The functions S0 (x), . . . , Sn−1 (x) are called Spline functions.

Si(x) ... Sn-1(x)


Si-1(x)
...

S0(x)
... ...
x0 x1 xi-1 xi xi+1 xn-1 xn

Equations 20.1 through 20.4 give us a total of 4n − 2 conditions. Since there are
n spline functions, we need 3n parameters if the functions are quadratic and 4n
parameters if the functions are cubic. Since 4n−2 > 3n the system is over-determined
for a quadratic to work, and since 4n − 2 < 4n the system is under-determined for
a cubic to work. By adding two additional conditions, however, we can uniquely
determine a set of cubic spline functions. These are typically either free (natural)
boundary conditions,
S000 (x0 ) = Sn−1
00
(xn ) = 0 (20.5)

135
136 LESSON 20. CUBIC SPLINES

or the clamped boundary conditions:

S00 (x0 ) = f 0 (x0 ) (20.6)


0
Sn−1 (xn ) = f 0 (xn ) (20.7)

Let the cubic splines be given by

Si (x) = ai + bi (x − xi ) + ci (x − xi )3 + di (x − xi )3 (20.8)

on each interval [xi , xi+1 ]. Substituting equation 20.1 into 20.8 gives

ai = f i (20.9)

From equation 20.2,


ai + bi hi + ci h2i + di h3i = ai+1 (20.10)
where hi = xi+1 − xi . Differentiating equation 20.8 twice

Si0 (x) = bi + 2ci (x − xi ) + 3di (x − xi )2 (20.11)


Si00 (x) = 2ci + 6di (x − xi ) (20.12)

Substituting equation 20.3 into 20.11

bi + 2ci hi + 3di h2i = bi+1 (20.13)

Similarly, if we substitute equation 20.4 into 20.12,

2ci + 6di hi = 2ci+1 (20.14)

where i = 0, . . . , n − 1. We will define one additional number

cn = cn−1 + 3dn−1 hn−1 (20.15)

From equation 20.5 in 20.12


c0 = 0 (20.16)
and
2cn = 2ci−1 + 6dn−1 hn−1 = 0 (20.17)
Rearranging equation 20.17
ci+1 − ci
di = (20.18)
3hi
Thus we now have determined all of the ai , and the di are fully determined by the ci .
To get bi we substitute 20.18 into equation 20.10,
ci+1 − ci 3
ai + bi hi + ci h2i + hi = ai+1 (20.19)
3hi

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 20. CUBIC SPLINES 137

Rearranging,
h2i
ai + bi hi + (2ci + ci+1 ) = ai+1 (20.20)
3

Solving for bi ,
h3i
bi hi = ai+1 − ai − (2ci + ci+1 ) (20.21)
2
or
ai+1 − ai hi
bi = − (2ci + ci+1 ) (20.22)
hi 3

Reducing the index by 1 all the way through,

ai − ai−1 hi−1
bi−1 = − (2ci−1 + ci ) (20.23)
hi−1 3

Similarly, by substituting 20.18 into 20.13

bi+1 = bi + 2ci hi + (ci+1 − ci )hi (20.24)


= bi + hi (ci + ci+1 ) (20.25)

Again, we reduce the index by 1,

bi = bi−1 + hi−1 (ci−1 + ci ) (20.26)

Using equation 20.22 for the bi on the left hand side of equation 20.26, and equation
20.23 for the bi−1 on the right hand side of equation 20.26,

ai+1 − ai hi ai − ai−1 hi−1


− (2ci + ci+1 ) = − (2ci−1 + ci ) + hi−1 (ci−1 + ci ) (20.27)
hi 3 hi−1 3

Rearranging a bit,

ai+1 − ai ai − ai−1 hi hi−1


− = (2ci + ci+1 ) − (2ci−1 + ci ) + hi−1 (ci−1 + ci ) (20.28)
hi hi−1 3 3

ai+1 − ai ai − ai−1
3 −3 = hi ci+1 + 2ci (hi − hi−1 ) + ci−1 hi−1 (20.29)
hi hi−1

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
138 LESSON 20. CUBIC SPLINES

This equation is defined for i = 0, .., n − 1. We can extend it to include an extra


parameter an that we are not actually interested in, to give

···
  
1 0 0 c0
..   c 
h0 2(h0 + h1 ) h1 .   1

  c2 
0 h1 2(h1 + h2 ) h2

  ..  =
 
. . . .
 .. .. .. ..  . 
  
0 0 hn−2 2(hn−1 + hn−2 ) hn−1   
0 0 1 cn
 
0
3 3

h1
(a 2 − a 1 ) − h0
(a1 − 10 ) 
..
 
(20.30)
 

 3 . 
3
 h (an − an−1 ) − h (an−1 − an−2 )

n−1 n−2
0

If the grid points are equally spaced with hi = h for some number h, then

···
  
1 0 0 c0  
..   c  0
h 4h h 0 .   1

  c2  3  a0 − 2a1 + a2 
 
0 h 4h h 0

..
.=  (20.31)
   
.
 .. ... ... ...    ..  h  . 

   an−2 − 2an−1 + an 
0 0 h 4h h   0
0 0 1 cn

Denoting the square matrix by A and the vectors by c and w, this can be written
concisely as Ac = w. Since all of the ai are already known, the only unknowns in
this equation are the c, for which we can solve as c = A−1 w.
For clamped cubic splines, the corresponding equations are

···
  
2h0 h0 0 c0
..   c 
 h0 2(h0 + h1 ) h1 .   1

  c2 
 0 h1 2(h1 + h2 ) h2

  ..  =
 
 . . . .
 .. .. .. .. 0  . 
  
 0 0 hn−2 2(hn−1 + hn−2 ) hn−1   
0 0 hn−1 2hn−1 cn
3
(a1 − a0 ) − 3f 0 (a)
 
h0
3

h1
(a2 − a1 ) − h30 (a1 − 10 ) 
..
 
(20.32)
 
 . 
 3 3
 hn−1 (an − an−1 ) − hn−2 (an−1 − an−2 )

3f 0 (b) − hn−13
(an − an−1 )

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 20. CUBIC SPLINES 139

Example 20.1. Fit a free cubic spline to the following data.


x0 = 0 x1 = 1 x2 = 2 x3 = 3
f0 = 0 f1 = 0.5 f2 = 0.8 f3 = 0.9
Solution. By equation 20.9,

a0 = 0; a1 = 0.5; a2 = 0.8; a3 = 0.9 (20.33)

Since the points are equally spaced with h = 1, we can use 20.31. The right hand
side is given by

w0 = 0 (20.34)
3
w1 = (0 − 2(.5) + .8) = −0.6 (20.35)
1
3
w2 = (.5 − 2(.8) + .9) = −0.6 (20.36)
1
w3 = 0 (20.37)

So the 20.31 becomes


    
1 0 0 0 c0 0
 c1  = −0.6
1 4 1 0    
 (20.38)
0 1 4 1  c2   −0.6
0 0 0 1 c3 0

Multiplying the matrices on the left and setting like components equal gives the
equivalent system of equations:

c0 =0 (20.39)
c0 + 4c1 + c2 = −0.6 (20.40)
c1 + 4c2 + c3 = −0.6 (20.41)
c3 =0 (20.42)

Subsituting the first and last result into the middle two equations,

4c1 + c2 = −0.6 (20.43)


c1 + 4c2 = −0.6 (20.44)

Multiplying 20.43 by 4 and subtracting 20.44

15c1 = −1.8 (20.45)

Hence c1 = −1.8/15 = −0.12. Equation 20.43 then gives

c2 = −0.6 − 4(c1 ) = −0.6 − 4(−0.12) = −0.12 (20.46)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
140 LESSON 20. CUBIC SPLINES

Summarizing, we have
c0 = 0; c1 = −0.12; c1 = −0.12; c3 = 0 (20.47)
ai+1 −ai hi
From 20.22, bi = hi
− 3
(2ci + ci+1 ) and therefore
a1 − a0 h
b0 = − (2c0 + c1 ) (20.48)
h 3
1
= (0.5 − 0) − (2(0) − 0.12) (20.49)
3
= 0.54 (20.50)
a2 − a1 h
b1 = − (2c1 + c2 ) (20.51)
h 3
1
= (0.8 − 0.5) − (2(−0.12) + (−0.12)) (20.52)
3
= 0.42 (20.53)
a3 − a2 h
b2 = − (2c2 + c3 ) (20.54)
h 3
1
= (0.9 − 0.8) − (2(−0.12) + 0) (20.55)
3
0 = 0.18 (20.56)
From equation 20.18 di = ci+1 − ci /3hi = ci+1 − ci /3 (since h = 1), so that
d0 = (c1 − c0 )/3 = (−.12 − 0)/3 = −0.04 (20.57)
d1 = (c2 − c1 )/3 = (−.12 − −.12)/3 = 0 (20.58)
d2 = (c3 − c2 )/3 = (0 − −.12)/3 = 0.04 (20.59)
Combining equations 20.33, 20.48, 20.47 and 20.57,
S0 = a0 + b0 (x − x0 ) + c0 (x − x0 )2 + d0 (x − x0 )3 (20.60)
= 0 + (0.54)(x) + (0)(x)2 + (−0.04)(x)3 (20.61)
= 0.54x − 0.04x3 (20.62)
S1 = a1 + b1 (x − x1 ) + c1 (x − x1 )2 + d1 (x − x1 )3 (20.63)
= 0.5 + (0.42)(x − 1) + (−0.12)(x − 1)2 + (0)(x − 1)3 (20.64)
= 0.5 + 0.42x − 0.42 − 0.12x2 + 0.24x − 0.12 (20.65)
= −0.04 + 0.66x − 0.12x2 (20.66)
S2 = a2 + b2 (x − x2 ) + c2 (x − x2 )2 + d2 (x − x2 )3 (20.67)
= 0.8 + (0.18)(x − 2) + (−0.12)(x − 2)2 + (0.04)(x − 2)3 (20.68)
= 0.8 + 0.18x − 0.36 − 0.12x2 + 0.48x − 0.48
+ 0.04x3 − 0.24x2 + 0.48x − 0.32 (20.69)
= −0.36 + 1.14x − 0.36x2 + 0.04x3 (20.70)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 20. CUBIC SPLINES 141

Therefore the spline solution is



3
0.54x − 0.04x ,
 x ∈ [0, .5],
f (x) = −0.04 + 0.66x − 0.12x , 2
x ∈ [.5, .8] (20.71)
 2 3
−0.36 + 1.14x − 0.36x + 0.04x , x ∈ [.8, .9]

Theorem 20.1 (Error Bounds for Clamped Cubic Splines). Let f (x) be 4-times
continuously differentiable on [a, b], and define

M = sup |f (4) (x) (20.72)
[a,b]

If S(x) is the unique clamped cubic spline interpolant to f (x) on the nodes x0 , . . . , xn ,
where a = x0 and b = xn , then
5M
|f (x) − S(x)| ≤ max hj (20.73)
384 j=1,...,n−1

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
142 LESSON 20. CUBIC SPLINES

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 21

Bezier Curves

Bezier curves, commonly used in computer graphics, are a modification of Hermite


interpoloation in which the derivatives are computed based on certain control points,
or handles, rather than just the end points. They are named for Piere Bézier (1910-
1999) who created them to model automobile surfaces in the early 1960’s. He worked
for Renault for virtually his entire career.
Bezier curves are used by many interactive graphics programs to draw smooth
curves. One typically marks the end points of the curve by clicking a mouse, then
then drags a mouse to define the handles, which can be used interactively to pull
the curve in different directions. In general, Bezier curves are described by systems
of parametric equations that describe the path of the curve in the xy-plane over the
interval [0, 1].
The simplest type of Bezier curves is a Bezier Line, describe parametrically by
joining the points P0 = (x0 , y0 ) and P2 = (x1 , y1 ) with a straight line. We can describe
this construction as
P (t) = P0 (1 − t) + P1 t (21.1)
In terms of the separate x and y components,
x(t) = x0 (1 − t) + x1 t (21.2)
y(t) = y0 (1 − t) + y1 t (21.3)
Quadratic Bezier curves are constructed as illustrated in figure 21.1. We draw
a line that connects points P0 and P2 ; the shape of this line is determined by the
position of a third “guide point” or “handle” that we label P1 . We then define
parameterizations of the line segements P0 P1 and P1 P2 as
P01 (t) = P0 (1 − t) + P1 t (21.4)
P12 (t) = P1 (1 − t) + P2 t (21.5)
Finally, we parameterize the line segment from P01 to P12 :
P (t) = (1 − t)P01 (t) + tP12 (t) (21.6)

143
144 LESSON 21. BEZIER CURVES

The Bezier Quadratic is the curved traced out P (t) as t goes from t = 0 to t = 1.
Substituting the expressions for P01 and P12 gives
P (t) = (1 − t)[(1 − t)P0 + tP1 ] + t[(1 − t)P1 + tP2 ] (21.7)
= (1 − t)2 P0 + 2t(1 − t)P1 + t2 P2 (21.8)
In terms of the x and y coordinates, the quadratic Bezier interpolants are:
x(t) = (1 − t)2 x0 + 2t(1 − t)x1 + t2 x2 (21.9)
y(t) = (1 − t)2 y0 + 2t(1 − t)y1 + t2 y2 (21.10)
Bezier quadratics are used, for example, to describe true-type fonts. The spline
functions constructud in this way have slopes tangent to the line segment P01 P12 .

Figure 21.1: Construction of Bezier quadratics. See text for description.


P1

P12

P01
P

P2
P0

Figure 21.2: Construction of the Bezier Cubic spline.


P12 P2

P1 P123
P012
P P23
P01

P3

P0

Now suppose that we have two control points, P1 and P2 , and that we want to
draw a curve connecting P0 and P3 using points P1 and P2 to control our movement.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 21. BEZIER CURVES 145

See figure 21.2. As before we construct the line segments P0 P1 , P1 P2 and P2 P3 , and
at any time t ∈ [0, 1] define points on these three segments:
P01 (t) = (1 − t)P0 + tP1 (21.11)
P12 (t) = (1 − t)P1 + tP2 (21.12)
P23 (t) = (1 − t)P2 + tP3 (21.13)
Next, construct line segments P01 P12 and P12 P23 and define their parameterization
on [0, 1] as follows:
P012 (t) = (1 − t)P01 (t) + tP12 (t) (21.14)
P123 (t) = (1 − t)P12 (t) + tP23 (t) (21.15)
Finally we construct a line segment P012 P123 with parameterization
P (t) = (1 − t)P012 (t) + tP123 (t) (21.16)
= (1 − t)[(1 − t)P01 (t) + tP12 (t)] + t[(1 − t)P12 (t) + tP23 (t)] (21.17)
= (1 − t)2 P01 (t) + 2t(1 − t)P12 (t) + t2 P23 (t) (21.18)
= (1 − t)2 [(1 − t)P0 + tP1 ] + 2t(1 − t)[(1 − t)P1 + tP2 ] (21.19)
+ t2 [(1 − t)P2 + tP3 ]
= (1 − t)3 P0 + 3t(1 − t)2 P1 + 3t2 (1 − t)P2 + t3 P3 (21.20)
The cartesian coordinates of the point P constructed in this ware are
x(t) = (1 − t)3 x0 + 3t(1 − t)2 x1 + 3t2 (1 − t)x2 + t3 x3 (21.21)
y(t) = (1 − t)3 y0 + 3t(1 − t)2 y1 + 3t2 (1 − t)y2 + t3 y3 (21.22)
Bezier cubics in this form are used to describe Postscript fonts. It is left as an
exercise to verify that the splines formed in this way approach the fixed endpoints
P0 and P3 with tangent lines P0 P1 and P2 P3 . Becase they are described by a cubic
parameterization there are a total of eight coefficients (4 for the x and 4 for the y),
which we have described by the coordinates of the points P0 , P1 P2 , P3 . By uniqueness,
there can be only curve that matches our restriction, and so the following derivation
of the Bezier cubics must, of necessity give the same curve. We present the derivation
because the notation, which is different from the derivation given above, is commonly
used to describe Bezier curves in various graphics applications. Suppose we want to
join the two points as show in figure 21.3, given by
P0 = (x0 , y0 ) (21.23)
P1 = (x1 , y1 ) (21.24)
in such a way that the slopes at P0 and P1 are defined in terms of the points Q0 and
Q1 by the vectors P0 Q0 and P1 Q1 , where
Q0 = (x0 + 3α0 , y0 + 3β0 ) (21.25)
Q1 = (x1 − 3α1 , y1 − 3β1 ) (21.26)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
146 LESSON 21. BEZIER CURVES

Figure 21.3: Alternate derivation of Bezier cubic splines.


(x0+α0, y0+β0)
(x1-α1, y1-β1)

(x0,y0)
(x1,y1)

As before we find a parametric representation of the curve (x(t), y(t) on the interval
t ∈ [0, 1], where
x(0) = x0 x(1) = x1 x0 (0) = 3α0 x0 (1) = 3α1 (21.27)
y(0) = y0 y(1) = y1 y 0 (0) = 3β0 y 0 (1) = 3β1 (21.28)
The factor of 3 in the definitions of the numbers α0 , α1 , β0 and β1 is not used in all
textbooks but is standard in the implementation used in most graphics programs, so
we will abide by it. Let us write the parametric equation for x(t) as
x(t) = A + Bt + Ct2 + Dt3 (21.29)
Differentiating equation 21.29,
x0 (t) = B + 2Ct + 3Dt (21.30)
The boundary conditions at t = 0 give
x(0) = A = x0 (21.31)
x0 (0) = B = 3α0 (21.32)
Substituting 21.31 and 21.32 back into 21.29 and 21.30 and then setting t = 1 gives
x(1) = x0 + 3α0 + C + D = x1 (21.33)
x0 (1) = 3α0 + 2C + 3D = 3α1 (21.34)
Multiplying equation 21.33 by 3,
3x0 + 9α0 + 3C + 3D = 3x1 (21.35)
Subtracting equation 21.34 from 21.35,
3x0 + 6α0 + C = 3x1 − 3α1 (21.36)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 21. BEZIER CURVES 147

Hence
C = 3(x1 − x0 ) − 3(2α0 + α1 ) (21.37)
Multiplying equation 21.33 by 2,

2x0 + 6α0 + 2C + 2D = 2x1 (21.38)

Subtracting equation 21.34 from 21.38,

2x0 + 3α0 − D = 2x1 − 3α1 (21.39)

D = 2(x0 − x1 ) + 3(α0 + α1 ) (21.40)


Therefore

x(t) = x0 + 3α0 t + [3(x1 − x0 ) − 3(2α0 + α1 )]t2 + (21.41)


[2(x0 − x1 ) + 3(α0 + α1 )]t3
y(t) = y0 + 3β0 t + [3(y1 − y0 ) − 3(2β0 + β1 )]t2 + (21.42)
[2(y0 − y1 ) + 3(β0 + β1 )]t3

A variation of Bezier Cubics is used for Postscript fonts, which are defined in terms of
positions of the two points (x0 , y0 ) and (x3 , y3 ) and their handles (x1 , y1 ) and (x2 , y2 )
rather than the derivatives (so it has a slightly different form):

x(t) = (1 − t)3 x0 + 3t(1 − t)2 x1 + 3t2 (1 − t)x2 + t3 x3 (21.43)


y(t) = (1 − t)3 y0 + 3t(1 − t)2 y1 + 3t2 (1 − t)y2 + t3 y3 (21.44)

Bezier curves can be defined of any order, using any number of points. The points
define a sequence of line segments that “pull” the curve towards them, with the Bezier
curve parallel to the first and last segment. The general formula is
n  
X n
x(t) = xi (1 − t)n−i ti (21.45)
i
i=0
n  
X n
y(t) = yi (1 − t)n−i ti (21.46)
i
i=0

We can generate the Bezier Curve equations for a set of points in Mathematica as
follows.
bezier[points_?ListQ, t_] := Module[{n, bezx, bezy, x, y},
n = Length[points] - 1;
bezx = 0; bezy = 0;
x[i_] := points[[i + 1, 1]];
y[i_] := points[[i + 1, 2]];

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
148 LESSON 21. BEZIER CURVES

Figure 21.4: A. A typical Bezier Curve generated with the 6 points (1, 1.52), (2,
1.94), (3, 1.39), (4, 1.0), (5, 1.54), (6, 1.55).B, C: Rearrangements of the points give
different curves. C: The curve is closed becasue the first and last point are the same.

For[i = 0, i n, i++,
bezx = bezx + Binomial[n, i] * x[i] (1 - t)^(n - i) t^i;
bezy = bezy + Binomial[n, i] * y[i] (1 - t)^(n - i) t^i;
];
Return[{bezx, bezy}]
];

This function bezier is passed a list of points in cartesian coordinates;

In:=

bezier[{{1, 1}, {1.5, 2}, {4.5, 1.5}, {5.5, .5}}, t]

Out:=

{1 + 4.5*(1 - t)^2*t - t^2 + 13.5*(1 - t)*t^2 + 5.5*t^3,


1 + 6*(1 - t)^2*t - t^2 + 4.5*(1 - t)*t^2 + 0.5*t^3}

We can plot the points, line segments with handles, and Bezier curve with the Math-
ematica function bezierPlot:

bezierPlot[points_, opt___?OptionQ] := Module[{p1, p2, p3, t},


p1 = ListPlot[points,
PlotStyle -> PointSize[0.03], DisplayFunction -> Identity];
p2 = ListPlot[points, PlotJoined -> True, DisplayFunction -> Identity];
p3 = ParametricPlot[bezier[points, t], {t, 0, 1}, DisplayFunction ->

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 21. BEZIER CURVES 149

Identity];
Return[Show[{p1, p2, p3}, DisplayFunction -> $DisplayFunction, opt]]
];

Then the command bezierPlot[points] will generate the following plot

2
1.8
1.6
1.4
1.2

2 3 4 5
0.8
0.6

Standard options for Plot, such as PlotRange, Axes, TextStyle, etc, can be used by
bezierPlot. A generalization to higher dimensions is given by Bezier Surfaces, which
were also invented by Pierre Bezier in 1972. The general form of a Bezier Surface is
given in terms of (m + 1)(n + 1) points (x0,0 , y0,0 , z0,0 ), . . . , (xm,n , ym,n , zm,n ) as
n X
m   
X n m
x(s, t) = si (1 − s)n−i tj (1 − t)n−j xi,j (21.47)
i j
i=0 j=0

n X
m   
X n m
y(s, t) = si (1 − s)n−i tj (1 − t)n−j yi,j (21.48)
i j
i=0 j=0
n X
m   
X n m
z(s, t) = si (1 − s)n−i tj (1 − t)n−j zi,j (21.49)
i j
i=0 j=0

where s, t ∈ [0, 1]. This can be implemented in Mathematicaby the following function.

bezierSurface[points_?ListQ, {s_, t_}] :=


Module[{x, y, z, i, j, m, n, bezx, bezy, bezz, bezcoef},
n = Length[points] - 1;
m = Union[Length /@ points];
If[Length[m] 1, Return[$Failed]];
m = First[m] - 1;
x[i_, j_] := points[[i + 1, j + 1, 1]];
y[i_, j_] := points[[i + 1, j + 1, 2]];
z[i_, j_] := points[[i + 1, j + 1, 3]];
bezx = bezy = bezz = 0;

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
150 LESSON 21. BEZIER CURVES

For[i = 0, i n, i++,
For[j = 0, j m, j++,
bezcoef = Binomial[n, i] Binomial[m, j]
(s^i)((1-s)^(n-i))(t^j)((1-t)^(m-j));
bezx = bezx + bezcoef* x[i, j];
bezy = bezy + bezcoef * y[i, j];
bezz = bezz + bezcoef * z[i, j] ;
] ] ;
Return[{bezx, bezy, bezz}];
];
Consider the set of points on the corners of a cube given by
(0, 0, 0) (1, 0, 0), (1, 1, 0)
(0, 0, 1) (1, 0, 1), (1, 1, 1)
The Bezier Surface calculated with this algorithm is
x(t) = 2(1 − s)(1 − t)t + 2s(1 − t)t + (1 − s)t2 + st2 (21.50)
y(t) = (1 − s)t2 + st2 (21.51)
z(t) = s(1 − t)2 + 2s(1 − t)t + st2 (21.52)
which can be found in Mathematica via

In:=
data = {{{0,0,0}, {1,0,0}, {1,1,0}},
{{0,0,1}, {1,0,1}, {1,1,1}}};
surface = bezierSurface[data, {s, t}]
Out:=
{2*(1 - s)*(1 - t)*t + 2*s*(1 - t)*t + (1 - s)*t^2 + s*t^2,
(1 - s)*t^2 + s*t^2,
s*(1 - t)^2 + 2*s*(1 - t)*t + s*t^2}

The surface and its generating points are illustrated below. They are produced with
the following commands:
<<Graphics‘Graphics3D‘
dataPlot = ScatterPlot3D[Partition[Flatten[hinge], 3],
PlotStyle -> PointSize[.03]];
surfacepoints = Table[surface, {s, 0, 1, .05}, {t, 0, 1, .05},
DisplayFunction-> Identity];
surfacePlot = ListSurfacePlot3D[surfacepoints,
DisplayFunction-> Identity];
Show[dataPlot, surfacePlot, DisplayFunction-> \$DisplayFunction]

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 21. BEZIER CURVES 151

The following data was generated on a fixed (x, y) grid with z−values determined by
a random number generator. It produces a more complicated surface.

(1, 1, 4.63) (1, 2, 4.41) (1, 3, 3.05) (1, 4, 3.76) (1, 5, 2.87) (1, 6, 4.05) (1, 7, 2.81)
(2, 1, 3.31) (2, 2, 2.61) (2, 3, 3.17) (2, 4, 2.47) (2, 5, 4.55) (2, 6, 2.35) (2, 7, 3.)
(3, 1, 3.63) (3, 2, 4.99) (3, 3, 2.21) (3, 4, 3.46) (3, 5, 3.74) (3, 6, 4.62) (3, 7, 2.24)
(4, 1, 3.18) (4, 2, 4.33) (4, 3, 3.98) (4, 4, 2.62) (4, 5, 3.76) (4, 6, 3.28) (4, 7, 2.22)
(5, 1, 4.75) (5, 2, 4.71) (5, 3, 2.47) (5, 4, 3.91) (5, 5, 4.14) (5, 6, 3.54) (5, 7, 5.)
Three different views of the resulting Bezier surface are shown in the following figure.
Points that are blocked by the surface are not shown. The figure on the top left and
on the bottom shown only the bezier surface and the points from different angles.
The figure on the top right also shows a triangulated surface joined by connecting the
points.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
152 LESSON 21. BEZIER CURVES

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 22

Least Squares

Suppose we have a large set of data points in the xy-plane

{(xi , yi ) : i = 1, 2, ..., n} (22.1)

and we want to find the “best fit” straight line to our data, namely, we want to find
number m and b such that
y = mx + b (22.2)

is the “best” possible line in the sense that it minimizes the total sum-squared vertical
distance between the data points and the line. This process is known as the linear
least-squares problem or linear regression.
The vertical distance between any point (xi , yi ) and the line (see figure 22.1),
which we will denote by di , is

di = |mxi + b − yi | (22.3)

Since this distance is also minimized when its square is minimized, we instead calculate

d2i = (mxi + b − yi )2 (22.4)

The total of all these square-distances (the “sum-squared-distance”) is

n
X n
X
f (m, b) = d2i = (mxi + b − yi )2 (22.5)
i=1 i=1

The only unknowns in this expression are the slope m and y-intercept b. Thus we
have written the expression as a function f (m, b). Our goal is to find the values of m
and b that correspond to the global minimum of f (m, b).

153
154 LESSON 22. LEAST SQUARES

Figure 22.1: The sum of all the vertical distances is minimized in the least-squares
linear fit.

Setting ∂f /∂b = 0 gives

n
∂f ∂ X
0 = = (mxi + b − yi )2 (22.6)
∂b ∂b i=1
n
X
= 2 (mxi + b − yi ) (22.7)
i=1
X n
= 2 (mxi + b − yi ) (22.8)
i=1

Dividing by 2 and separating the three sums

n
X
0 = (mxi + b − yi ) (22.9)
i=1
n
X n
X n
X
= mxi + b− yi (22.10)
i=1 i=1 i=1
n
X n
X
= m xi + nb − yi (22.11)
i=1 i=1

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 22. LEAST SQUARES 155

Defining
n
X
X = xi (22.12)
i=1
n
X
Y = yi (22.13)
i=1

then we have
0 = mX + nb − Y (22.14)
Next, we set ∂f /∂m = 0, which gives
n
∂f ∂ X
0 = = (mxi + b − yi )2 (22.15)
∂m ∂m i=1
n
X
= 2xi (mxi + b − yi ) (22.16)
i=1
X n
= 2 xi (mxi + b − yi ) (22.17)
i=1

Dividing by 2 and separating the three sums as before


n
X
0 = xi (mxi + b − yi ) (22.18)
i=1
Xn n
X n
X
= mx2i + b xi − xi yi (22.19)
i=1 i=1 i=1
n
X n
X
= m x2i + bX − xi y i (22.20)
i=1 i=1

where X is defined in equation 22.12. Next we define,


n
X
A = x2i (22.21)
i=1
Xn
C = xi y i (22.22)
i=1

so that
0 = mA + bX − C (22.23)
Equations 22.14 and 22.23 give us a a system of two linear equations in two
variables m and b. Multiplying equation 22.14 by A and equation 22.23 by X gives
0 = A (mX + nb − Y ) = AXm + Anb − AY (22.24)
0 = X (mA + bX − C) = AXm + X 2 b − CX (22.25)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
156 LESSON 22. LEAST SQUARES

Subtracting these two equations gives

0 = Anb − AY − X 2 b + CX = b(An − X 2 ) + CX − AY (22.26)

and therefore n n n n
x2i
P P P P
yi −
xi y i xi
AY − CX i=1 i=1 i=1 i=1
b= =  n 2 (22.27)
An − X 2 n
P 2
P
n xi − xi
i=1 i=1

If we instead multiply equation 22.14 by X and equation 22.23 by n we obtain

0 = X (mX + nb − Y ) = mX 2 + nXb − Y X (22.28)

0 = n (mA + bX − C) = nAm + nXb − nC (22.29)


Subtracting these two equations,

0 = m X 2 − nA − (Y X − nC)

(22.30)

Solving for m and substituting the definitions of A, C, X and Y , gives


n
P Pn n
P
y i − n xi y i
xi
XY − nC
m= 2 = i=1 i=1
 n 2
i=1
(22.31)
X − nA n
xi − n x2i
P P
i=1 i=1

We can summarize the algorithm as follows. To find a best-fit line to a set of n


data points (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ) calculate,
Xn
X= xi (22.32)
Xni=1
Y = yi (22.33)
Xni=1
A= x2i (22.34)
i=1
Xn
C= xi y i (22.35)
i=1

The best fit line is y = mx + b where


XY − nC
m= (22.36)
X 2 − nA
AY − CX
b= (22.37)
An − X 2
Example 22.1. Find the least squares fit to the data (3, 2), (4,3), (5,4), (6,4) and
(7,5).

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 22. LEAST SQUARES 157

Solution. First we calculate the numbers X, Y , A, and C,


Xn
X = xi = 3 + 4 + 5 + 6 + 7 = 25 (22.38)
Xi=1n
Y = yi = 2 + 3 + 4 + 4 + 5 = 18 (22.39)
Xi=1n
A = x2i = 9 + 16 + 25 + 36 + 49 = 135 (22.40)
i=1
Xn
C = xi yi = (3)(2) + (4)(3) + (5)(4) + (6)(4) + (7)(5)
i=1
= 97 (22.41)

Therefore
XY − nC (25)(18) − (5)(97) 450 − 485 −35
m= 2
= 2
= = = 0.7 (22.42)
X − nA (25) − 5(135) 625 − 675 −50

and
AY − CX (135)(18) − (97)(25) 2430 − 2425 5
b= 2
= 2
= = = 0.1 (22.43)
An − X (135)(5) − 25 50 50

So the best fit line is y = 0.7x + 0.1


We can generalize the least squares problem to any order polynomial. Suppose
we want to fit a polynomial of degree n − 1 to m points (x1 , y1 ), . . . , (xm , ym ). Denote
the polynomial by
P (x) = c1 + c2 x + · · · + cn xn−1 (22.44)
If we were to attempt to fit the data exactly this would give us the system of equations

c1 + c2 x1 + · · · + cn xn−1
n = y1 (22.45)
..
. (22.46)
n−1
c1 + c2 x m + · · · + cn x m = y m (22.47)

If m = n, which is usually not the case, we could solve this equation exactly, by
writing     
1 x1 x21 · · · xn−11 c1 y1
 1 x2 x2 n−1  
2 x 2 c
  2   y2 
  
  ..  =  ..  (22.48)

 ..
.  .   . 
1 xm x2m · · · xn−1m cm ym
or more simply,
Ac = y (22.49)
The solution is of course
c = A−1 y (22.50)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
158 LESSON 22. LEAST SQUARES

However, this is not usually a good idea. Even if we had m = n, when the number
of data points is relatively large such a curve would be extremely over-fit, giving a
“kink” or “bump” for each point. This is not usually a good approximation to the
data. Furthermore, when we have m > n the equation is not even solvable. In general
it is better to look for some sort of solution to
Ac ≈ y (22.51)
in the sense that the two sides of the equation match one-another in some sort of
minimized least squares sense. In other words, we want to find the “best fit” linear
combination
Ac = c1 a1 + c2 a2 + · · · + cn an (22.52)
where each aj is the j th column of the matrix A,
j−1 T
aj = xj−1 j−1
 
1 x 2 · · · x m (22.53)
We denote the pth component of ai as
aip = Api = AT
ip (22.54)
We define the best fit as the set of values c1 , . . . , cn that minimizes the distance
min |y − Ac| (22.55)
c1 ,...,cn

Define the residual error in fitting the ith data point


X
ri = cj aji − yi (22.56)
j

As with linear least squares, we will minimize the sum-square residual error
" #2
X X
= cj ATji − yi (22.57)
i j

by setting the partial derivatives equal to zero:


" #2
∂ X X
0= cj ATji − yi (22.58)
∂cp i j

for p = 1, . . . , n. Expanding gives us


"
∂ XXX XX
0= cj ATji ck ATki − cj ATji yi
∂cp i j k i j
#
XX X
− yi cj ATji + yi2 (22.59)
i j i
XXX ∂ XX
= ATji ATki (cj ck ) − 2 yi ATji δjp (22.60)
i j k
∂cp i j

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 22. LEAST SQUARES 159

Since
∂ ∂ ∂
(cj ck ) = cj ck + ck cj (22.61)
∂cp ∂cp ∂cp
= cj δpk + ck δpj (22.62)
Equation 22.60 gives
XXX X
0= ATji ATki (cj δpk + ck δpj ) − 2 yi ATpi (22.63)
i j k i
XXX XXX
= ATji ATki cj δpk + ATji ATki ck δpj − 2(AT y)p (22.64)
i j k i j k
XX XX
= ATji ATpi cj + ATpi ATki ck − 2(AT y)p (22.65)
i j i k
X X X X
= ATpi ATji cj + ATpi ATki ck − 2(AT y)p (22.66)
i j i k
X X X X
= ATpi Aij cj + ATpi Aik ck − 2(AT y)p (22.67)
i j i k
X X
= ATpi (Ac)i + ATpi (Ac)i − 2(AT y)p (22.68)
i i
= 2(AT Ac)p − 2(A y)p T
(22.69)
Hence c is the solution of the linear equation
AT Ac = AT y (22.70)
Formally, the least squares solution is given exactly by

c = (AT A)−1 AT y (22.71)


In practice, it turns out that there are generally better ways to solve the system 22.70
than calculation of the inverse matrix, which we will discuss subsequently.

As an example to illustrate that our earlier calculation of the linear least squares
fit falls out of equation 22.71 we note that for linear data,
 
1 x1
1 x 2 
A =  .. ..  (22.72)
 
. . 
1 xn
 
1 x1
1 1 · · · 1 1 x 2 
    P 
T n P x2
A A=  .. ..  = P (22.73)
x1 x2 · · · xn  . .  x x
1 xn

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
160 LESSON 22. LEAST SQUARES

Define X X 2
∆=n x2 − x (22.74)
Then P 2 P 
T −1 1 x − x
(A A) = P (22.75)
∆ − x n
Hence eq. 22.71 gives
 
y
  1
· · · 1  y2 
P 2 P 
1 x − x 1 1
c= (22.76)
· · · xn  ... 
P
∆ − x n x1 x2
 
yn
P 2 P  P 
1 x − x
= P P y (22.77)
∆ − x n xy
P 2 P P P 
1 x y − xP xy
= P P (22.78)
∆ − x y + n xy

and therefore
x2 y − x xy
P P P P
b = c1 = (22.79)
n x2 − ( x)2
P P
P P P
− x y + n xy
m = c2 = (22.80)
n x2 − ( x)2
P P

which is identical to equation 22.36.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 23

Numerical Differentiation

Recall the definition of a derivative

df (x) f (x + h) − f (x)
f 0 (x) = = lim (23.1)
dx h→0 h
If we choose h sufficiently small, then we can approximate the derivative by
f (x + h) − f (x)
f 0 (x) ≈ (23.2)
h
If we represent the function by a table of numbers , f0 = f (x0 ), . . . , fn = f (xn ) the
for fixed values of h
f (xi + h) − f (xi ) fi+1 − fi
f 0 (xi ) ≈ = , i = 0, 1, ..., n − 1 (23.3)
h h
To find an upper bound on the error, we use Taylor’s theorem:
1 1 1
f (x+h) = f (x)+hf 0 (x)+ h2 f 00 (x)+· · ·+ hn f (n) (x)+ hn+1 f (n+1) (c) (23.4)
2 n! (n + 1)!
where c is some unknown number between x and xh The Taylor formula for n = 1 is
1
f (x + h) = f (x) + hf 0 (x) + h2 f 00 (c) (23.5)
2
Thus
1
hf 0 (x) = f (x + h) − f (x) − h2 f 00 (c) (23.6)
2
and dividing by h
f (x + h) − f (x) 1 00
f 0 (x) = − hf (c) (23.7)
h 2
The first term gives precisely the same thing as equation 23.1; the second term
gives the error. Thus we have the following approximation formulas depending upon
whether h > 0 or h < 0. The Forward Difference Formula is
1 1
fi0 ≈ (fi+1 − fi ) − hf 00 (c) (23.8)
h 2
161
162 LESSON 23. NUMERICAL DIFFERENTIATION

and the Backward Difference Formula is


1 1
fi0 ≈ (fi − fi−1 ) + hf 00 (c) (23.9)
h 2
We can get a better approximation if we observe that
1 1
f (x + h) = f (x) + hf 0 (x) + h2 f 00 (x) + h3 f (3) (c) (23.10)
2 3!
where x ≤ c ≤ x + h and
1 1
f (x − h) = f (x) − hf 0 (x) + h2 f 00 (x) + (−h)3 f (3) (c1 ) (23.11)
2 3!
where x − h ≤ c1 ≤ x. Subtracting 23.11 from 23.10
1
f (x + h) − f (x − h) = 2hf 0 (x) + h3 (f (3) (c) + f (3) (c1 )) (23.12)
6
Solving for f 0 (x) gives
f (x + h) − f (x − h) 1
f 0 (x) = − h2 (f (3) (c) + f (3) (c1 )) (23.13)
2h 12
By the intermediate value theorem there is some number ξ, c1 ≤ ξ ≤ c, such that
1
f (3) (ξ) = (f (3) (c) + f (3) (c1 )) (23.14)
2
Using 23.14 in 23.13 gives us The Central Difference Formula
f (x + h) − f (x − h) 1
f 0 (x) = − h2 f (3) (ξ) (23.15)
2h 16

Example 23.1. Compare the forward difference, central difference, and backward
difference methods for the following data.
x = 1.1 x = 1.2 x = 1.3 x = 1.4
f (x) = 9.025 f (x) = 11.023 f (x) = 13.464 f (x) = 16.645
Solution. According to the forward difference formula:
f (1.2) − f (1.1) 11.023 − 9.025
f 0 (1.1) = = = 19.98 (23.16)
0.1 .1
f (1.3) − f (1.2) 13.464 − 11.023
f 0 (1.2) = = = 24.41 (23.17)
0.1 0.1
f (1.4) − f (1.3) 16.645 − 13.464
f 0 (1.3) = = = 31.81 (23.18)
0.1 0.1

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 23. NUMERICAL DIFFERENTIATION 163

Figure 23.1: Points used in the calculation of the derivative at (x1 , f (x1 )) for the
forward difference formula (top left0; the backward difference formula (top right) and
the central difference formula (bottom). The slope of the line joining the points shown
is used to approximate the derivative. The actual tangent is indicated by a dashed
line.
Hx1 , f Hx1 LL Hx1 , f Hx1 LL

Hx0 , f Hx0 LL Hx0 , f Hx0 LL


8x2 , fHx2 L< 8x2 , fHx2 L<

Hx1 , f Hx1 LL

Hx0 , f Hx0 LL
8x2 , fHx2 L<

According to the backward difference formula:


f (1.2) − f (1.1) 11.023 − 9.025
f 0 (1.2) = = = 19.98 (23.19)
0.1 .1
f (1.3) − f (1.2) 13.464 − 11.023
f 0 (1.3) = = = 24.41 (23.20)
0.1 0.1
f (1.4) − f (1.3) 16.645 − 13.464
f 0 (1.4) = = = 31.81 (23.21)
0.1 0.1
According to the central difference formula:
f (1.3) − f (1.1) 13.464 − 9.025
f 0 (1.2) = = = 22.195 (23.22)
.2 .2
f (1.4) − f (1.2) 16.645 − 11.023
f 0 (1.3) = = = 28.110 (23.23)
.2 .2
It is instructive to compare the results tabularly, as follows; we see immediately that
the backward difference calculations are identical to the forward difference calculations
but shifted one step to the right.
x = 1.1 x = 1.2 x = 1.3 x = 1.4
FD 19.98 24.41 34.81
BD 19.98 24.41 34.81
CD 22.195 28.110

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
164 LESSON 23. NUMERICAL DIFFERENTIATION

There is no forward difference value for the right endpoint, no backward difference
value for the left endpoint, and no central difference value for either endpoint.
To get a more accurate number for the derivative, we can find the derivative of an
interpolating polynomial. We start with the Lagrange interpolating polynomial with
error term: n n
X 1 Y
f (x) = fk Lk (x) + (x − xk )f (n+1) (c) (23.24)
k=0
(n + 1)! k=0
where n
Y (x − xj )
Lk = (23.25)
j=0,j6=k
(xk − xj )

and c is some point in [a, b]. Differentiating,


( n n
)
d X 1 Y
f 0 (x) = fk Lk (x) + f (n+1) (c(x)) (x − xk ) (23.26)
dx k=0 (n + 1)! k=0
n
( n
)
X
0 1 d (n+1)
Y
= fk Lk (x) + f (c(x)) (x − xk ) (23.27)
k=0
(n + 1)! dx k=0
n   n
X 1 d (n+1) Y
= fk L0k (x) + f (c(x)) (x − xk )+
k=0
(n + 1)! dx
( n
) k=0
1 d Y
f (n+1) (c(x)) (x − xk ) (23.28)
(n + 1)! dx k=0

At any grid point xi ,


n   n
0
X 1 d (n+1) Y
fk L0k (xi )

f (xi ) = + f (c(x))
(xi − xk )+
k=0
(n + 1)!
dx x=xi k=0
( n
)
1 (n+1) d Y
f (c(xi )) (x − xk ) (23.29)

(n + 1)! dx k=0

x=xi

Notice that the product in the middle term has a factor of (xi − xi ) and is therefore
zero. Thus
n
( n
)
X 1 d Y
f 0 (xi ) = fk L0k (xi ) + f (n+1) (c(xi )) (x − xk ) (23.30)

k=0
(n + 1)! dx k=0

x=xi

By the product rule for differentiation,


n n n
d Y X Y
(x − xk ) = (x − xj ) (23.31)
dx k=0 k=0 j=0,j6=k

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 23. NUMERICAL DIFFERENTIATION 165

and therefore at a grid point



n n n
d Y X Y
(x − xk ) = (xi − xj ) (23.32)

dx k=0
k=0 j=0,j6=k
xi

Every term in the summation except for the k = i term will have a factor xi − xi in
the product; therefore the only nonzero term is for k = i.
n
n
d Y Y
(x − xk ) = (xi − xj ) (23.33)

dx
k=0 j=0,j6=i
xi

Renaming the index on right from j to k gives


n
n
d Y Y
(x − xk ) = (xi − xk ) (23.34)

dx
k=0 k=0,k6=i
xi

Substitution back into equation 23.30 yields the n + 1-point approximation for-
mula,
n n
X 1 Y
f 0 (xi ) = fk L0k (xi ) + f (n+1) (c(xi )) (xi − xk ) (23.35)
k=0
(n + 1)! k=0,k6=i

The first term gives the approximation and the second gives an error formula.
A two-point approximation is obtained when n = 1. The Lagrange Polynomials
are
x − x1 x − x0
L0 = , L1 = (23.36)
x0 − x1 x1 − x0
Hence
1 1
L00 = , L01 = (23.37)
x0 − x1 x1 − x0
Therefore
f0 f1 f1 − f0
f 0 (xi ) ≈ f0 L00 (xi ) + f1 L01 (xi ) = + = (23.38)
x0 − x1 x1 − x0 x 1 − x0
This is precisely the forward difference formula.
When n = 2 we obtain the 3-point formulas. The Lagrange functions are

(x − x1 )(x − x2 ) x2 − (x1 + x2 )x + x1 x2
L0 (x) = = (23.39)
(x0 − x1 )(x0 − x2 ) (x0 − x1 )(x0 − x2 )

(x − x0 )(x − x2 ) x2 − (x0 + x2 )x + x0 x2
L1 (x) = = (23.40)
(x1 − x0 )(x1 − x2 ) (x1 − x0 )(x1 − x2 )

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
166 LESSON 23. NUMERICAL DIFFERENTIATION

(x − x0 )(x − x1 ) x2 − (x0 + x1 )x + x0 x1
L2 (x) = = (23.41)
(x2 − x0 )(x2 − x1 ) (x2 − x0 )(x2 − x1 )

Hence
2x − (x1 + x2 )
L00 (x) = (23.42)
(x0 − x1 )(x0 − x2 )

2x − (x0 + x2 )
L01 (x) = (23.43)
(x1 − x0 )(x1 − x2 )

2x − (x0 + x1 )
L02 (x) = (23.44)
(x2 − x0 )(x2 − x1 )

and therefore

f 0 (xi ) ≈ f0 L00 (xi ) + f1 L01 (xi ) + f2 L02 (xi ) (23.45)


2xi − (x1 + x2 ) 2xi − (x0 + x2 )
= f0 + f1
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 )
2xi − (x0 + x1 )
+f2 (23.46)
(x2 − x0 )(x2 − x1 )

If the grid points are equally spaced then

2xi − (x1 + x2 ) 2xi − (x0 + x2 ) 2xi − (x0 + x1 )


f 0 (xi ) = f0 + f1 + f2 (23.47)
(−h)(−2h) (h)(−h) (2h)(h)
2xi − (x1 + x2 ) 2xi − (x0 + x2 ) 2xi − (x0 + x1 )
= f0 2
− f1 2
+ f2 (23.48)
2h h 2h2

At each of the points

2x0 − (x1 + x2 ) 2x0 − (x0 + x2 ) 2x0 − (x0 + x1 )


f 0 (x0 ) = f0 2
− f1 2
+ f2 (23.49)
2h h 2h2
2x0 − (x0 + h + x0 + 2h) 2x0 − (x0 + x0 + 2h)
= f0 2
− f1 (23.50)
2h h2
2x0 − (x0 + x0 + h)
+ f2 (23.51)
2h2  
(−3h) (−2h) (−h) 1 3 1
= f0 − f1 + f2 = − f0 + 2f1 − f2 (23.52)
2h2 h2 2h2 h 2 2

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 23. NUMERICAL DIFFERENTIATION 167

2x1 − (x1 + x2 ) 2x1 − (x0 + x2 ) 2x1 − (x0 + x1 )


f 0 (x1 ) = f0 2
− f1 2
+ f2 (23.53)
2h h 2h2
2(x0 + h) − (x0 + h + x0 + 2h) 2(x0 + h) − (x0 + x0 + 2h)
= f0 2
− f1
2h h2
2(x0 + h) − (x0 + x0 + h)
+ f2 (23.54)
2h2
(−h) (h)
= f0 2
+ f2 2 (23.55)
2h 2h 
1 1 1
= − f0 + f2 (23.56)
h 2 2

2x2 − (x1 + x2 ) 2x2 − (x0 + x2 ) 2x2 − (x0 + x1 )


f 0 (x2 ) = f0 2
− f1 2
+ f2 (23.57)
2h h 2h2
2(x0 + 2h) − (x0 + h + x0 + 2h) 2(x0 + 2h) − (x0 + x0 + 2h)
= f0 2
− f1
2h h2
2(x0 + 2h) − (x0 + x0 + h)
+ f2 (23.58)
2h2  
h 2h 3h 1 1 3
= f0 2 − f1 2 + f2 2 = f0 − 2f1 + f2 (23.59)
2h h 2h h 2 2
This gives us the general three-point formula
1
fi0 =
(−3fi + 4fi+1 − fi+2 ) (23.60)
2h
1
fi0 = (−fi−1 + fi+1 ) (23.61)
2h
1
fi0 = (fi−2 − 4fi−1 + 3fi ) (23.62)
2h
The middle formula returns the central difference formula; the first and third formulas
give us a method to extend the technique to the endpoints.
Another method for deriving formulas for the derivative is to use the Taylor series,
1
f (x + h) = f (x) + hf 0 (x) + h2 f 00 (x) + · · ·
2
1 n (n) 1
+ h f (x) + hn+1 f (n+1) (c) (23.63)
n! (n + 1)!
for some c between x and x + h. In particular, this method can be used to derive
approximation formulas for higher order derivatives. We will illustrate this process
by deriving an approximation formula for f 00 . Letting x = x0 in the Taylor series
23.63
1 1 (n) 1
f1 = f0 + hf00 + h2 f 00 0 + · · · + hn f0 + hn+1 f (n+1) (c1 ) (23.64)
2 n! (n + 1)!

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
168 LESSON 23. NUMERICAL DIFFERENTIATION

where c1 ∈ [x0 , x0 + h]. Next, let x−1 = x − h,


1 2 00 1 (n)
f−1 = f0 − hf00 + h f0 + · · · + (−h)n f0 (x) (23.65)
2! n!
1
+ (−hn+1 )f (n+1) (c−1 )
(n + 1)!

for some c−1 ∈ [x0 − h, x0 ]. Adding equations 23.64 and 23.65 gives
 
1 2 00 1 4 (4) 1 n (n)
f1 + f−1 = 2 f0 + h f 0 + h f0 + · · · + h f0
2! 4! n!
n+1 
h
f (n+1) (c1 ) + (−1)n+1 f (n+1) (c−1 )

+ (23.66)
(n + 1)!

where n is even. If n is odd the term in the square brackets terminates at the hn−1
term instead of the hn term. For example, if n = 3,

h4  (4)
 
1 2 00
f (c1 ) + (−1)4 f (4) (c−1 )

f1 + f−1 = 2 f0 + h f 0 + (23.67)
2! 4!
1 
= 2f0 + h2 f000 + h4 f (4) (c1 ) + f (4) (c−1 )

(23.68)
24
Solving for f000 ,
1 1 
f 00 0 = [f1 − 2f0 + f−1 ] − h2 f (4) (c1 ) + f (4) (c−1 )

2
(23.69)
h 24
By the intermediate value theorem, since [f (4) (c1 ) + f (4) (c−1 )]/2 is between f (4) (c1 )
and f (4) (c−1 ), then (assume f (4) is continuous in [a, b] then there is some number
c0 ∈ [c−1 , c1 ] such that

(4) 1 (4) (4)


f0 (c0 ) = [f0 (c1 ) + f0 (c−1 )] (23.70)
2
Hence
1 1
f 00 0 = 2
[f1 − 2f0 + f−1 ] − h2 f (4) (c0 ) (23.71)
h 12
Replacing x0 with xk ,
1 1
f 00 k = 2
[fk+1 − 2fk + fk−1 ] − h2 f (4) (c0 ) (23.72)
h 12

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 24

Richardson Extrapolation

Richardson extrapolation gives a method to “accelerate” the convergence of a se-


quence; in other words, if we have a method that converges as O(hm ), where h is
some small parameter (e.g., the error in the calculation is proportional to hm ) it can
produce a method that converges as O(hm+1 ) (e.g., the error in the calculation is
proportional to hm+1 ). This reduces the error by a factor of h, and can be used
iteratively to produce methods that converge at any rate we want. It was developed
by Lewis Fry Richardson (1891-1953).
Suppose we have an approximation formula for x that depends on some parameter
h, such as the step size. We will write this as N (h) plus higher order terms:
x = N (h) + c1 h + c2 h2 + c3 h3 + · · · (24.1)
where x is the exact value and N (h) is the approximate value. Equation 24.1 says
that N (h) is accurate to O(h). The terms with higher order in h are the error terms.
Then we can write
x = N (h/2) + c1 h/2 + c2 h2 /4 + c3 h3 /8 + · · · (24.2)
Multiplying equation 24.2 gives
2x = 2N (h/2) + c1 h + c2 h2 /2 + c3 h3 /4 + · · · (24.3)
Subtracting equation 24.1 from 24.3 gives
1 1
x = 2N (h/2) + c1 h + c2 h2 + c3 h3 + · · ·
2 4
−(N (h) + c1 h + c2 h2 + c3 h3 + · · · ) (24.4)
1 3
= 2N (h/2) − N (h) − c2 h2 − c3 h3 (24.5)
2 4
If we define the function N2 (h) by
N (h/2) − N (h)
N2 (h) = N (h/2) + (24.6)
22−1 − 1

169
170 LESSON 24. RICHARDSON EXTRAPOLATION

then
1 3
x = N2 (h) − c2 h2 − c3 h3 + · · · (24.7)
2 4
which tells us that N2 approximates x to O(h2 ). Repeating the process,

1 3
x = N2 (h/2) − c2 (h/2)2 − c3 (h/2)3 + · · · (24.8)
2 4
1 2 3
= N2 (h/2) − c2 h − c3 h3 − · · · (24.9)
8 32
Multiplying by 4,
1 3
4x = 4N2 (h/2) − c2 h2 − c3 h3 − · · · (24.10)
2 8
Subtracting equation 24.7 from 24.10 gives
 
1 2 3 3
3x = 4N2 (h/2) − c2 h − c3 h − · · ·
2 8
 
1 2 3 3
− N2 (h) − c2 h − c3 h − · · · (24.11)
2 4
3
= 4N2 (h/2) − N2 (h) + c3 h3 + · · · (24.12)
8
Solving for x,
4N2 (h/2) − N2 (h) 1 3
x= + c3 h + · · · (24.13)
3 8
It is convenient to rewrite equation 24.13 as

N2 (h/2) − N2 (h) 1 3
x = N2 (h/2) + + c3 h + · · · (24.14)
3 8
Let us define
N2 (h/2) − N2 (h)
N3 (h) = N2 (h/2) + (24.15)
23−1 − 1
Then
1
x = N3 (h) + c3 h3 + · · · (24.16)
8
Hence N3 (h) is accurate to O(h3 ). In general we define

Nj−1 (h/2) − Nj−1 (h)


Nj (h) = Nj−1 (h/2) + (24.17)
2j−1 − 1

which gives
x = Nj (h) + O(hj ) (24.18)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 24. RICHARDSON EXTRAPOLATION 171

Equation 24.17 is called Richardson Extrapolation. It takes a method that is


O(hm ) and gives a method that is O(hm+1 ). The method is sometimes summarized
as:

Better Estimate = More Accurate


1
+ j−1 (More Accurate − Less Accurate) (24.19)
2 −1
We next illustrate how Richardson extrapolation can be used to produce differentia-
tion formulas. We start with the forward difference formula,

1
f00 ≈ (f1 − f0 ) + O(h) (24.20)
h
Doubling the step size,
1
f00 = (f2 − f0 ) + O(h) (24.21)
2h
Define
1 1
N (h) = (f2 − f0 ) = (f (x0 + 2h) − f (x0 )) (24.22)
2h 2h
1 1
N (h/2) = (f (x0 + h) − f (x0 )) = (f1 − f0 ) (24.23)
h h
and therefore,

N (h/2) − N (h)
N2 (h) = N (h/2) + (24.24)
22−1 − 1
1 1 1
= (f1 − f0 ) + (f1 − f0 ) − (f2 − f0 ) (24.25)
h h 2h
1
= (2f1 − 2f0 + 2f1 − 2f0 − f2 + f0 ) (24.26)
2h
1
= (−3f0 + 4f1 − f2 ) (24.27)
2h
This is the same formula we found by using Taylor series.
To get a higher order approximation we would have to split the step size again.
Since we don’t have any data more finely grained that at intervals of h, we use the
following trick: go back to the beginning and start with 4h, then 2h, then h. Starting
with a centered difference with a step size of 4h,

1 1
N (h) = (f4 − f0 ) = (f (x0 + 4h) − f0 ) (24.28)
4h 4h
1 1
N (h/2) = (f (x0 + 2h) − f0 ) = (f2 − f0 ) (24.29)
2h 2h
« 2008, B.E.Shapiro Math 481A
Last revised: July 5, 2008 California State University Northridge
172 LESSON 24. RICHARDSON EXTRAPOLATION

N2 (h) = N (h/2) + [N (h/2) − N (h)] (24.30)


1 1 1
= (f2 − f0 ) + (f2 − f0 ) − (f4 − f0 ) (24.31)
2h 2h 4h
1
= (2f2 − 2f0 + 2f2 − 2f0 − f4 + f0 ) (24.32)
4h
1
= (−3f0 + 4f2 − f4 ) (24.33)
4h

Continuing the process,

1
N2 (h/2) = (−3f0 + 4f1 − f2 ) (24.34)
2h

hence

1
N3 = N2 (h/2) + [N2 (h/2) − N2 (h)] (24.35)
3
1
= (−3f0 + 4f1 − f2 ) +
2h  
1 1 1
(−3f0 + 4f1 − f2 ) − (−3f0 + 4f2 − f4 ) (24.36)
3 2h 4h
1
= (−3f0 + 4f1 − f2 ) +
2h
1 1
(−3f0 + 4f1 − f2 ) − (−3f0 + 4f2 − f4 ) (24.37)
6h 12h
1
= [6(−3f0 + 4f1 − f2 ) + 2(−3f0 + 4f1 − f2 )
12h
−(−3f0 + 4f2 − f4 )] (24.38)
1
= (−21f0 + 32f1 − 12f2 + f4 ) (24.39)
12h

and so forth, so that

1
f0 = (−21f0 + 32f1 − 12f2 + f4 ) + O(h3 ) (24.40)
12h

Example 24.1. Use a centered difference and Richardson Extrapolation to determine


f 0 (x) for the function f (x) = x + ex at the origin (x=0) using h = 0.4 through N3 (0.4)

Solution. The centered difference formula is

f (x + h) − f (x − h)
f 0 (x) = = N1 (h) (24.41)
2h
Math 481A « 2008, B.E.Shapiro
California State University Northridge Last revised: July 5, 2008
LESSON 24. RICHARDSON EXTRAPOLATION 173

Letting x = 0 and h = .4,

N1 (.4) = 2.02688 (24.42)


N1 (.2) = 2.00668 (24.43)
N2 (.4) = N1 (.2) + [N1 (.2) − N1 (.4)] (24.44)
= 2.00668 + 2.00668 − 2.02688 (24.45)
= 1.98648 (24.46)
N1 (.1) = 2.00167 (24.47)
N2 (.2) = N1 (.1) + [N1 (.1) − N1 (.2)] (24.48)
= 2.00167 + 2.00167 − 2.00668 (24.49)
= 1.99665 (24.50)
N2 (.2) − N2 (.4)
N3 (.4) = N2 (.2) + (24.51)
3
1.99665 − 1.98648
= 1.99665 + (24.52)
3
= 2.00004 (24.53)

Observe in this example that we had to obtain information in the following table:
N1 (.4)
N1 (.2) N2 (.4)
N1 (.1) N2 (.2) N3 (.4)
N1 (.05) N2 (.1) N3 (.2) N4 (.4)
.. .. .. .. ...
. . . .
In other words, to get any item in the table requires knowledge of everything above and
to the left of it in the table. This is true in general in using Richardson Extrapolation.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
174 LESSON 24. RICHARDSON EXTRAPOLATION

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 25

Numerical Integration

The simplest method for numerical integration is a direct implementation of Riemann


Sums. If you know the function values f0 , f1 , ..., fn at the points a = x0 , x1 , ..., xn = b
then one form for the Riemann Sum is
Z b n−1
X
f (x)dx ≈ f (xi )(xi+1 − xi ) (25.1)
a i=0

i.e., the areas of rectangles whose upper left hand corners touch the curve of f (x) We
can write a similar formula down using the right hand corners:
Z b n
X
f (x)dx ≈ f (xi )(xi − xi−1 ) (25.2)
a i=i

If the points are equally spaced then

xi = x0 + ih (25.3)

so that we have
Z b n−1
X n
X
f (x)dx ≈ h f (xi ) ≈ h f (xi ) (25.4)
a i=0 i=1

Alternatively, we can calculate the area using boxes that cross the curve. For
example, if we know the function at three points (x0 , f0 ), (x1 , f1 ), and (x2 , f2 ) where
x1 = x0 + h and x2 = x0 + 2h then we can approximate the area under the curve of
f (x) on [x0 , x2 ] by a box whose width is x2 − x0 = 2h and whose height is f (x1 ):
Z x2
f (x)dx = 2hf (x1 ) (25.5)
x0

175
176 LESSON 25. NUMERICAL INTEGRATION

Figure 25.1: Calculation of an integral as the area under the curve can be apprxi-
mated with vertical rectangles. Top row, left: upper left hand corner of rectangles
fit to curve. Right: Upper right hand corner of each rectangle is fit to the curve.
Bottom row., left: Midpoint of top of each rectangle is fit to the curve. Right: in the
trapezoidal rule, the rectangles are replaced by trapezoids whose tops fit the function
at both upper corners.

a b a b

a b a b

To get the area over the entire interval [a, b], where a = x0 < x1 < x2 < · · · < xn = b,
and n is assumed to be even, we obtaine the Composite Midpoint Rule,
Z b Z x2 Z x4
f (x)dx = f (x)dx + f (x)dx + · · ·
a x0 x2
Z xn−2 Z xn
+ f (x)dx + f (x)dx (25.6)
xn−4 xn−2
≈ 2hf1 + 2hf3 + 2hf5 + · · · + 2hfn−3 + 2hfn−1 (25.7)
= 2h(f1 + f3 + · · · fn−1 ) (25.8)
R 10
Example 25.1. Find 0
x2 e−x dx using n = 4 and n = 10. Compare your result
with the exact integral.

Solution. The exact solution is


Z 10
10
x2 e−x dx = e−x −2 − 2x − x2 0

(25.9)
0
= 2 − 122e−10 ≈ 1.99446 (25.10)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 25. NUMERICAL INTEGRATION 177

x2 e−x with the midpoint rule for n = 4 and n = 10.


R
Figure 25.2: Approximation of
See example 25.1.
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0. 0.
. 0. 2.5 5. 7.5 10. 0 1 2 3 4 5 6 7 8 9 10

For n = 4 we have h = 10/4 = 2.5. Let f (x) = x2 e−x . Then

f1 = f (x1 ) = f (h) = f (2.5) = 2.52 e−2.5 ≈ 0.513031 (25.11)


f3 = f (x3 ) = f (3h) = f (7.5) = 7.52 e−7.5 ≈ 0.031111 (25.12)

hence
Z 10
x2 e−x dx = 2h[f1 + f3 ] (25.13)
0
≈ 2(2.5)(0.513031 + 0.031111) (25.14)
≈ 2.72072 (25.15)

The relative error is


2.72071 − 1.99446
= = 36% (25.16)
1.99446
For n = 10 we have h = 1 so that we need to calculate f − 1, f3 , f5 , f7 and f9 .

f1 = f (x1 ) = f (1) = (1)2 e−1 ≈ 0.367879 (25.17)


f3 = f (x3 ) = f (3) = (3)2 e−3 ≈ 0.448084 (25.18)
f5 = f (x5 ) = f (5) = (5)2 e−5 ≈ 0.168449 (25.19)
f7 = f (x7 ) = f (7) = (7)2 e−7 ≈ 0.044682 (25.20)
f9 = f (x9 ) = f (9) = (9)2 e−9 ≈ 0.0099996 (25.21)

hence
Z 10
f (x)dx = 2h[f1 + f3 + f5 + f7 + f9 ] (25.22)
0
≈ 2(0.367879 + 0.48084 + 0.168449
+ 0.044682 + 0.0099996) (25.23)
≈ 2.07818 (25.24)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
178 LESSON 25. NUMERICAL INTEGRATION

This gives a relative error of


2.07818 − 1.99f 446
≈ ≈ 5.03% (25.25)
1.99446
This example illustrates how we can decrease the error by decreasing the step size.
In general, the relative error will decrease as we increase the number of intervals,
and, correspoondingly, decrease the step size. The relative error in ths method as a
function of number of intervals is plotted for n as large as 1000 in figure 25.3.

Figure 25.3: The relative error as a function of number of intervals for the integral
solved in example 25.1.
100 %

10 %

1%

0.1 %

0.01%

0.001%

0.0001%

0.00001%

5 10 50 100 500 1000

Since the midpoint rule does not use all the information that we know about the
function, we could modify it by interpreting the xi as the center of rectangles of width
h rather than treating the odd-numbered xi as centers of rectangles of 2h. The first
and last points x0 and xn become the left- and right-hand ends of rectangles of width
h/2 in this scheme,
Z b Z x0 +h/2 n−1 Z
X xi +h/2
f (x)dx = f (x)dx + f (x)dx
a x0 i=1 xi −h/2
Z xn
+ f (x)dx (25.26)
xn −h/2
n−1
h X h
≈ f0 + hfi + fn (25.27)
2 i=1
2

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 25. NUMERICAL INTEGRATION 179

The resulting Composite Trapezoidal Rule is

Z b
h
f (x)dx ≈ [f0 + 2f1 + 2f2 + · · · + 2fn−2 + 2fn−1 + fn (25.28)
a 2

Example 25.2. Repeat example 25.1 using the composite Trapezoidal rule with h =
2.5.

x2 e−x using the trapezoidal rule as illustrated in


R
Figure 25.4: Approximation of
example 25.2.

0.5
0.4
0.3
0.2
0.1
0.
0. 2.5 5. 7.5 10.

Solution. Since h = 2.5, then n = (b − a)/h = 4. According to equation 25.28,

Z 10
h
f (x)dx = [f0 + 2f1 + 2f2 + 2f3 + f4 ] (25.29)
0 2
2.5  2 −0
= 0 e + 2(2.52 e−2.5 ) + 2(52 e−5 )
2
+ 2(7.52 e−7.5 ) + 102 e−10

(25.30)
= 1.25[2 + 2(0.513031) + 2(0.168449) + 2(0.031111) + 0.00454] (25.31)
= 1.78715 (25.32)

The relative error is


1.78715 − 1.99446
= = 11.6% (25.33)
1.99446

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
180 LESSON 25. NUMERICAL INTEGRATION

Figure 25.5: Relative error for trapezoidal method (orange); midpoint method (red);
and Simpson’s method (blue) for various step sizes.
100 %

10 %

1%

0.1 %

0.01%

0.001%

0.0001%

0.00001%

5 10 50 100 500 1000

Simpson’s rule is derived by fitting a quadratic to three equally spaced points. Let

p(x) = A(x − x0 )2 + B(x − x1 ) + C (25.34)

be a quadratic that passes through the three points (x0 , f0 ), (x1 , f1 ), and (x2 , f2 ),
where x1 = x0 + h, and x2 = x0 + 2h.

f0 = p(x0 ) = B(x0 − x1 ) + C = B(−h) + C (25.35)


= −Bh + C (25.36)
f1 = A(x1 − x0 )2 + C (25.37)
= Ah2 + C (25.38)
f2 = A(x2 − x0 )2 + B(x2 − x1 ) + C = A(2h)2 + Bh + C (25.39)
= 4Ah2 + Bh + C (25.40)

Adding equations 25.36 and 25.40

f0 + f2 = 4Ah2 + 2C (25.41)

while multiplying equation 25.38 by 2,

2f1 = 2Ah2 + 2C (25.42)

Subtracting equation 25.42 from equation 25.41,

2Ah2 = f0 − 2f1 + f2 (25.43)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 25. NUMERICAL INTEGRATION 181

Multiplying equation 25.38 by 2 and substituting 25.43,

2C = 2f1 − 2Ah2 = −f0 + 4f1 − f2 (25.44)


Multiplying equation 25.41 by 2 and substituting 25.44,

2Bh = 2C − 2f0 = −3f0 + 4f1 − f2 (25.45)

Equations 25.43, 25.44 and 25.45 give us the values of A, B, and C. As it turns out,
we will only need to know A and C but not B. The integral is
Z x2
I= [A(x − x0 )2 + B(x − x1 ) + C]dx (25.46)
x0
x2
A 3 B 2

= [ (x − x0 ) + (x − x1 ) + Cx] (25.47)
3 2 x0
A
= [(x2 − x0 )3 − (x0 − x0 )3 ]
3
B
+ [(x2 − x1 )2 − (x0 − x1 )2 + C(x2 − x0 ) (25.48)
2
A B
= (2h)3 + [h2 − h2 ] + C(2h) (25.49)
3 2
8
= Ah3 + 2Ch (25.50)
3
Substituting equations 25.43 and 25.44
h h
I= [8Ah2 + 6C] = [4(2Ah2 ) + 3(2C)] (25.51)
3 3
h
= [4(f0 − 2f1 + f2 ) + 3(−f0 + 4f1 − f2 )] (25.52)
3
h
= [f0 + 4f1 + f2 ] (25.53)
3
This gives us Simpson’s Rule:
Z x2
h
f (x)dx = [f (x0 ) + 4f (x1 ) + f (x2 )] (25.54)
x0 3

where x1 = (x2 + x0 )/2, or equivalently


Z b
h
f (x)dx = [f (a) + 4f ((a + b)/2) + f (b)] (25.55)
a 3
In general, we can write
Z xi+2
h
f (x)dx = [f (xi ) + 4f (xi+1 ) + f (xi+2 )] (25.56)
xi 3

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
182 LESSON 25. NUMERICAL INTEGRATION

If n is even then
Z xn Z x2 Z x4 Z x6
f (x)dx = f (x)dx + f (x)dx + f (x)dx
x0 x0 x2 x4
Z xn
+ ··· + f (x)dx (25.57)
xn−2

Substituting equation 25.56 in each term of 25.57


Z xn
h
f (x)dx = [(f0 + 4f1 + f2 ) + (f2 + 4f3 + f4 )
x0 3
+ (f4 + 4f5 + f6 ) + · · · + (fn−4 + 4fn−3 + fn−2 )
+ (fn−2 + 4fn−1 + fn )] (25.58)
Collecting terms gives the Composite Simpson’s Rule:
Z xn
h
f (x)dx = [f0 + 4f1 + 2f2 + 4f3 + 2f4 + · · · + 2fn−2 + 4fn−1 + fn ] (25.59)
x0 3

The pattern of coefficients is 1, 4, 2, 4, 2, 4, . . . , 4, 2, 4, 1.


Example 25.3. Repeat example 25.1 using Simpson’s rule with n = 4.
Solution. Since n = 4 then h = (b − a)/n = 2.5. The formula is
Z 10
h
e−x x2 dx = [f0 + 4f1 + 2f2 + 4f3 + f4 ] (25.60)
0 3
2.5 − 2
= [e 00 + 4e−2.5 (2.52 ) + 2e−5 (52 ) + 4e−7.5 (7.52 ) + e−10 (102 )]
3
(25.61)
2.5
= [0 + 4(0.513031) + 2(0.168449) + 4(0.031111) + 0.00454999]
3
(25.62)
= 2.09834 (25.63)
To arrive at Simpson’s rule we derived the coefficients by fitting a parabola to
three successive points. What if we were to fit a higher order polynomial? Given any
n + 1 points we can fit a unique polynomial of order at most n. If
n
X
f (x) ≈ ci x i (25.64)
i=0

Then
Z b n
Z bX n Z b n
i
X
i
X ci
f (x)dx ≈ ci x dx ≈ ci x dx = (bi+1 − ai+1 ) (25.65)
a a i=0 i=0 a i=0
i+1

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 25. NUMERICAL INTEGRATION 183

which gives us a general Quadrature Formula:


Z b n
X
f (x)dx ≈ ai f i (25.66)
a i=0

For example, we know how to fit n + 1 points with the nth Lagrange Polynomial
Quadrature for n=1 (2 points) using the Lagrange Polynomials. For n = 1 the points
are a = x0 and x1 = b = a + h and so we write
x − x1 x−b 1
L0 = = = (b − x) (25.67)
x0 − x1 a−b h
and
x − x0 x−a 1
L1 = = = (x − a) (25.68)
x 1 − x0 b−a h
The error is
f 00 (c)
R= (x − a)(x − b) (25.69)
2!
Hence
f0 f1
f (x) ≈ P (x) = L0 (x)f0 + L1 (x)f1 = (b − x) + (x − a) (25.70)
h h
from which we can calculate the integral
Z b Z b
f (x)dx ≈ [P (x) + R(x)]dx (25.71)
a a
f0 b f1 b
Z Z
= (b − x)dx + (x − a)dx
h a h a
f 00 (c) b 2
Z
+ (x − (a + b)x + ab)dx (25.72)
2 a
 b  b
f0 1 2 f1 1 2
= bx − x + x − ax
h 2 a h 2 a
00
 b
f (c) 1 3 1 2
+ x − (a + b)x + abx (25.73)
2 3 2 a
   
f0 1 2 2 f 1 1 2 2
= b(b − a) − (b − a ) + (b − a ) − a(b − a)
h 2 h 2
f 00 (c) 1 3
 
3 1 2 2
+ (b − a ) − (a + b)(b − a ) + ab(b − a) (25.74)
2 3 2
Applying the algebraic relations
b−a=h (25.75)
b − a2 = (b − a)(b + a) = h(b + a)
2
(25.76)
b3 − a3 = (b − a)(b2 + ab + a2 ) = h(b2 + ab + a2 ) (25.77)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
184 LESSON 25. NUMERICAL INTEGRATION

gives
Z b    
f0 h f1 h
f (x)dx = bh − (b + a) + (b + a) − ah (25.78)
a h 2 h 2
f 00 (c) h 2
 
2 h 2
+ (b + ab + a ) − (a + b) + abh (25.79)
2 3 2
b−a b−a
= f0 + f1
2 2
hf 00 (c) 2
+ [2b + 2ab + 2a2 − 3a2 − 6ab − 3b2 + 6ab] (25.80)
12
h hf 00 (c)
= [f0 + f1 ] + [−b2 + 2ab − a2 ] (25.81)
2 12
h h3 f 00 (c)
= [f0 + f1 ] − (25.82)
2 12
The resulting Trapezoidal Rule with Remainder: is

b
h3 f 00 (c)
Z
h
f (x)dx = [f0 + f1 ] − (25.83)
a 2 12

We then can obtain a composite quadrature rule by applying this formula at each
pair of successive grid points [xi , xi+1 ] as:
xi+1
h3 f 00 (ci )
Z
h
f (x)dx = [fi + fi+1 ] − (25.84)
xi 2 12

we get
Z b n−1 Z
X xi+1
f (x)dx = f (x)dx (25.85)
a i=0 xi
n−1 
h3 f 00 (ci )

X h
= [fi + fi+1 ] − (25.86)
i=0
2 12
n−1 n−1
h X h3 X 00
= [fi + fi+1 ] − f (ci ) (25.87)
2 i=0
12 i=0

By the intermediate value theorem there is some number µ ∈ [a, b] such that f 00 (µ) is
the average
1
f 00 (µ) = [f 00 (c1 ) + f 00 (c1 ) + · · · + f 00 (cn−1 )] (25.88)
n
and therefore

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 25. NUMERICAL INTEGRATION 185

Z b
h
f (x)dx = [(f0 + f1 + · · · + fn−1 ) + (f1 + f2 + · · · + fn )]
a 2
h3
− nf 00 (µ) (25.89)
12
Substituting nh = b − a we arrive at the Composite Trapezoidal Rule with
Remainder
Z b
h h2 (b − a) 00
f (x)dx = [(f0 + 2f1 + 2f2 + · · · + 2fn−1 + fn )] − f (µ) (25.90)
a 2 12

If we repeat the same process at three points x0 , x1 , and x2 we end up with


Simpsons Rule with Remainder:
Z x2
h h5
f (x)dx = [f0 + 4f1 + f2 ] − f (4) (c) (25.91)
x0 3 90
The corresponding and the Composite Simpson’s Rule with Remainder is
Z b
h
f (x)dx = [f0 + 4f1 + 2f2 + · · · + 2fn−2 + 4fn−1 + fn ]
a 3
b − a 4 (4)
− h f (µ) (25.92)
180
If we use four points x0 , . . . , x3 then we obtain Simpson’s Three-Eighths rule:
Z x3
3h 3h5 (4)
f (x)dx = [f0 + 3f1 + 3f2 + f3 ] − f (c) (25.93)
x0 8 80
Using five points x0 , . . . , x4 gives
Z x4
2x 8h7 (6)
f (x)dx = [7f0 + 32f1 + 12f2 + 32f3 + 7f4 ] − f (c) (25.94)
x0 45 945
The procedures we have outlined above are called the Closed Newton-Cotes
Technique. The general method for deriving an integration formula is as follows.
Suppose you know the function values f0 , f1 , ..., fn at n + 1 equally spaced points
a = x0 < x1 < · · · < xn = b where xi = xi−1 + h. Then
Z b Xn
f (x)dx ≈ ai f i + R (25.95)
a i=0

where n
b b
x − xk
Z Z Y
ai = Li (x)dx = dx (25.96)
a a x i − xk
k=0,k6=i

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
186 LESSON 25. NUMERICAL INTEGRATION

hn+3 f (n+2) (c) n 2


 Z
t (t − 1)(t − 2) · · · (t − n)dt, n even


(n + 2)! Z0

R= (25.97)
hn+2 f (n+1) (c) n
t(t − 1)(t − 2) · · · (t − n)dt, n odd



(n + 1)! 0

A similar technique, called the Open Newton-Cotes Technique, does not include
the endpoints in the polynomial interpolation. By renumbering the grid points so
that we have n + 3 grid points at a = x−1 < x0 < x1 < · · · < xn < xn+1 = b
Then equations 25.95 and 25.96 still hold; the modified remainder formula is

hn+3 f (n+2) (c) n+1 2


 Z
t (t − 1)(t − 2) · · · (t − n)dt, n even


(n + 2)! Z−1

R= (25.98)
hn+2 f (n+1) (c) n+1
t(t − 1)(t − 2) · · · (t − n)dt, n odd



(n + 1)! −1

The open Newton-Cotes method with n = 0 gives the Midpoint Rule


Z x1
h3
f (x)dx = 2hf0 + f 00 (c) (25.99)
x−1 3

The open Newton-Cotes method with n = 1 gives


Z x2
3h 3h3 00
f (x)dx = [f0 + f1 ] + f (c) (25.100)
x−1 2 4

The open Newton-Cotes method with n = 2 gives

14h5 (4)
Z
4h
x3 f (x)dx = [2f0 − f1 + 2f2 ] + f (c) (25.101)
x−1 3 45

The open Newton-Cotes method with n = 3 gives


Z x4
5h 95h5 (4)
f (x)dx = [11f0 + f1 + f2 + 11f3 ] + f (c) (25.102)
x−1 24 144

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 26

Theory of Differential Equations

Definition 26.1 (Ordinary Differential Equation, ODE). Let y ∈ R be a variable1


that depends on t. Then we define a differential equation as any equation of the
form
F (t, y, y 0 ) = 0 (26.1)
where F is any function of the 3 variables t, y, y 0 .

Definition 26.2. We will call any function φ(t) that satisfies

F (t, φ(t), φ0 (t)) = 0 (26.2)

a solution of the differential equation.

We will use the terms “Ordinary Differential Equation” and “Differential Equa-
tion”, as well as the abbreviations ODE and DE, interchangeably. More generally, on
can include partial derivatives in the definition, in which case one must distinguish
between “Partial” DEs (PDEs) and “Ordinary” DEs (ODEs). We will leave the study
of PDEs to another class.

Equation 26.1 is, in general, very difficult, and often, impossible, to solve, either
analytically (e.g., by finding a formula that describes y), or numerically (e.g., for
example, by using a computer to draw a picture of the graph of the solution). Often
it is possible to solve 26.1 explicitly for the derivatives:

y 0 = f (t, y) (26.3)

Many important problems can be put into this form, and solutions are known to
exists for a wide class of functions, particular as a result of theorem 26.2. The class
of problems in which equation 26.1 can be converted to the form 26.3, at least locally,
1
In general, this theory can be extended to higher dimensions, where y ∈ Rn ; all of the same
results hold.

187
188 LESSON 26. THEORY OF DIFFERENTIAL EQUATIONS

is not seriously restrictive from a practical point of view. The only requirements are
that F be sufficiently smooth2 and that the matrix of partial derivatives ∂F/∂(y0 )
(Jacobian matrix) be nonsingular3 (∂F/∂(y 0 ) for a scalar equation). Then by the
implicit function theorem we can solve for y 0 locally. An equation of the form 26.1
for which the Jacobian is nonsingular is thus called an ordinary differential equa-
tion, and we will focus on equations of this form in the first several chapters of these
notes. It turns out that an equation for which the Jacobian is singular actually has
hidden constraints: it is really a combination of differential equations and algebraic
constraints, and is called a differential algebraic equation.

Theorem 26.1. Implicit Function Theorem on R.4 Let F (t, y) have continuous
derivatives ∂F/∂t and ∂F/∂y in the neighborhood of a point (t0 , y0 ), where
∂F (t0 , y0 )
F (t0 , y0 ) = 0, 6= 0 (26.4)
∂y
Then there are intervals I and J where
I = [t0 − a, t0 + a] (26.5)
J = [y0 − b, y0 + b] (26.6)
and a R rectangle R = I × J, such that the equation F (t, y) has precise one solution
y = f (t) lying in the rectangle R, such that
F (t, y(t)) = 0 (26.7)
y(t) ∈ J (26.8)
Fy (t, f (t)) 6= 0 (26.9)
for all t ∈ I.
Example 26.1. Solve y 0 = y.
Solution. Writing y 0 = dy/dt and integrating we find
Z Z
1
dy = dt (26.10)
y
ln|y| = t + C (26.11)
y = Ket (26.12)
where K = ±eC . There is no restriction on the values of either C or K.
2
Throughout these notes we will assume that F is sufficiently smooth without explicitly stating
so. By “sufficiently” smooth we mean that F is continuously differentiable enough times to give us
the results we want.
3
Strictly speaking, nonsingularity is not really a requirement. Nonsingularity is sufficient to
ensure that a solution exists, but is not required. There are examples of functions with singularities
at points but for which solutions may exist.
4
For a proof, see Richard Courant and Fritz John, Introduction to Calculus and Analysis, Volume
II/1, Springer Classics in Mathematics, 1998, page 225

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 26. THEORY OF DIFFERENTIAL EQUATIONS 189

Thus we see that equation 26.1 (or 26.3) will often admit to an infinite number of
solutions owing to arbitrary constants of integration that arise during its solution. For
example 26.1 this is illustrated if figure 26.1, which shows the one parameter family
of solutions to the example. A particular physical problem may only correspond to
one member of this family. To fix down this constant, the problem must be further
constrained. Such a constraint can take various forms. The nature of the constraint
can have an enormous impact on our ability to solve the equation.

Figure 26.1: One parameter family of solutions to y 0 = y, showing the solutions for
various values of the constant of integration.

5 4 3 2 1
2
0.5

0 0

-1

-0.5
-2
-5 -4 -3 -2 -1
t
-1 0 1

Equation 26.3 has an intuitive feeling as the description of a dynamical system:


given any starting point y0 , then the subsequent “motion” or “time-evolution” of y is
given by equation 26.3. By adding an initial condition, that is, by specifying a point
that the solution passes through, we obtain an initial value problem (IVP).
Definition 26.3. (Initial Value Problem) Let D ∈ Rn+1 be a set. Let (t0 , y0 ) ∈ D
and suppose that f (t, y) : D 7→ Rn . Then
y 0 = f (t, y) (26.13)
y(t0 ) = y0 (26.14)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
190 LESSON 26. THEORY OF DIFFERENTIAL EQUATIONS

is called an initial value problem. The constraint 26.14 is called an initial con-
dition.

Figure 26.2: Illustration of a differential equation as a dynamical system. Given any


starting point an object moves as described by the differential equation. The curve
on the left shows the coordinates of an object at several time points. On the right,
the points are annotated with the direction of motion, an arrow whose direction is
specified by the components of the differential equation.

(y′1(t1), y′2(t1))
y2 (y1(t1), y2(t1)) y2
(y1(t2), y2(t2))
(y′1(t2), y′2(t2)) (y′1(t0), y′2(t0))
(y1(t0), y2(t0))

y1 y1

Example 26.2. Solve the initial value problem y 0 = (3 − y)/2, y(2π) = 4.


Solution. We can rearrange variables as
2dy
= dt (26.15)
3−y
and integrate to obtain
−t = 2 + ln |y − 3| + C (26.16)
Substituting the initial condition gives
C = −2π − ln |4 − 3| = −2π (26.17)
which gives
|y − 3| = eπ−t/2 (26.18)
Thus either y = 3 + eπ−t/2 or y = 3 − eπ−t/2 . At the initial condition, however,
y(2π) = 4, which is only obtained with the plus sign in the solution. Hence
y = 3 + eπ−t/2 (26.19)
is the unique solution. The solution of the initial value problem is plotted in figure
26.3 in comparison with the one-parameter family.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 26. THEORY OF DIFFERENTIAL EQUATIONS 191

Figure 26.3: The one-parameter family of solutions for y 0 = (3 − y)/2 for different
values of the constant of integration, and the solution to the initial value (heavy line)
problem through (t0 , y0 ) = (2π, 4). The initial condition is indicated by the large gray
dot.

We will say that an initial value problem is well posed if it meets the following
criteria:

ˆ A solution exists;

ˆ The solution is unique;

ˆ The solution depends continuously on the data.

If a problem is not well posed then there is no point in trying to solve it numerically,
so we begin our study of initial value problems by looking at what it takes to make
a problem well posed. We will find that a Lipschitz Condition, defined below in
definition 26.4 is sufficient to ensure that the problem is well posed.
The importance (and usefulness) of initial value problems is enhanced by a general
existence theorem and the fact that under appropriate conditions (namely, a Lipschitz
Condition) the solution is unique. While we will defer the proof of this statement
until later, we will present one of many different versions of the fundamental existence
theorem.

Definition 26.4 (Lipschitz Condition). A function f (t, y) on D is said to be


Lipschitz (or Lipschitz continuous, or satisfy a Lipschitz condition) on y if
there exists some constant K > 0 if for all (x, y1 ), (x, y2 ) ∈ D then

|f (x, y1 ) − f (x, y2 )| ≤ K|y1 − y2 | (26.20)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
192 LESSON 26. THEORY OF DIFFERENTIAL EQUATIONS

The constant K is called a Lipschitz Constant for f . We will sometimes denote


this as f ∈ L(y; K)(D).

Theorem 26.2. Fundamental Existence and Uniqueness Theorem Suppose


that f (t, y) ∈ L(y; K)(R) for some convex domain R. Then for any point (t0 , y0 ) ∈ R
there exists a neighborhood N of (t0 , y0 ) and a unique differentiable function φ(t) on
N satisfying
y 0 = f (t, φ(t)) (26.21)
such that y 0 (t0 ) = y0 .

The existence theorem is illustrated in figure 26.4. Given any initial value, there
is some solution that passes through the point. Observe that the existence of the
solution is not guaranteed globally, only within some open neighborhood of the initial
condition.

Figure 26.4: Illustration of the existence of a solution.

!t0 ,y0 " N

Theorem 26.3 (Continuous dependence on IC). Under the same conditions, the
solution depends continuously on the initial data, i.e., if ỹ is a solution satisfying the
same ODE with ỹ(t0 ) = ỹ0 , then

|y(t) − ỹ(t)| ≤ eKt |y0 − ỹ0 | (26.22)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 26. THEORY OF DIFFERENTIAL EQUATIONS 193

Theorem 26.4 (Perturbed Equation). Under the same conditions, suppose that ỹ is
a solution of the perturbed ODE,

ỹ 0 = f (t, ỹ) + r(t, ỹ) (26.23)

where r is bounded on D, i.e., there exists some M > 0 such that |r(t)| ≤ M on D.
Then
M
|y(t) − ỹ(t)| ≤ eKt |y0 − ỹ0 | + (eKt − 1) (26.24)
K
Proving that a function is Lipschitz is considerably eased by the following theorem.

Theorem 26.5. Suppose that |∂f /∂y| is bounded by K on a set D. Then f (t, y) ∈
L(y, K)(D).

Proof. The result follows immediately from the mean value theorem. Let (t, y1 ), (t, y2 ) ∈
D. Then there is some number c between y1 and y2 such that

|f (t, y1 ) − f (t, y2 )| = |fy (c)||y1 − y2 | < K|y1 − y2 | (26.25)


Hence f is Lipschitz in y on D.

Example 26.3. Show that a unique solution exists to the initial value problem

y 0 = sin(ty), y(0) = 0.5 (26.26)

Solution. We have f (t, y) = sin(ty), hence fy = t cos(ty). Thus |fy | ≤ |t| which is
bounded for any finite range of t. Let R be a bounded, convex set enclosing (0, 0.5),
and let
K = 1 + sup |t| (26.27)
t∈R

Since R is bounded we know that the supremum exists. By adding 1 we ensure that
we have a number that is strictly larger than the maximum value of |t|. Then K is a
Lipschitz constant for f and hence a unique solution exists in some neighborhood N
of (0, 5). See figure 26.5.

Example 26.4. Analyze the uniqueness of solutions to


p
y0 = 4 − y2 (26.28)
y(0) = 2 (26.29)

Solution. Finding a “solution” is easy enough. We can separate the variables and
integrate. It is easily verified (by direct substitution) that
 π
y = 2 sin t + (26.30)
2
« 2008, B.E.Shapiro Math 481A
Last revised: July 5, 2008 California State University Northridge
194 LESSON 26. THEORY OF DIFFERENTIAL EQUATIONS

Figure 26.5: A solution exists in some neighborhood N of (0, 0.5). See Example 26.3
1.5

1.25

0.75

0.5 N

0.25 R

!10 !5 0 5 10

satisfies both the differential equation and the initial condition, hence it is a solution.

It is also easily verified that y = 2 is a solution, as are functions of the form



π

2 sin t + 2
 t<0
y= 2 0≤t≤φ (26.31)

2 sin t + π2 − φ

t>φ

for any positive real number φ. See Figure 26.6.

Since the solution is not unique, any condition that guarantees existence must be
violated. We have two such conditions: the boundedness of the partial derivative,
and the Lipschitz condition. The first implies the second, and the second implies
uniqueness. By

∂f −y
=p (26.32)
∂y 4 − y2
which is unbounded at y = 2. So the first condition is violated. Of course, a violation
of the condition does not ensure non-uniqueness, all it tells us is that uniqueness is
not ensured.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 26. THEORY OF DIFFERENTIAL EQUATIONS 195

p
Figure 26.6: There are several solutions to y 0 = 4 − y 2 that pass through the point
(0, 2). See Example 26.4

0 Φ

p
What about the Lipschitz condition? Suppose that the function f (x) = (4 − y 2 )
is Lipschitz with Lipschitz constant K > 0 on some domain D. Then for any y1 , y2
in D,
q q

2 2
K|y1 − y2 | ≥ |f (y1 , y2 )| = 4 − y1 − 4 − y2
(26.33)

Let y2 = 2 and y1 = 2 −  for some small number . Then


p
K|| ≥ 4 − (2 − )2 (26.34)


2
≥ 4 − 

(26.35)
K 2 2 ≥ 4 − 2 (26.36)
(K 2 + 1)2 ≥ 4 (26.37)
4
K2 + 1 ≥ (26.38)

4
K2 ≥ − 1 (26.39)

But since we can choose  to be arbitrarily small (including 0), the right hand side of
the equation can be arbitrarily large. But then K is not a finite number, especially

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
196 LESSON 26. THEORY OF DIFFERENTIAL EQUATIONS

when  = 0. So f (t, y) is not Lipschitz, either. Again, this does not guarantee
non-uniqueness; it just tells us that uniqueness is not guaranteed.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 27

Method of Successive
Approximations

The Method of Successive Approximations or Picard Iteration takes the initial


value problem

y 0 = f (t, y) (27.1)
y(t0 ) = y0 (27.2)

through a sequence of recursive iterations. If φ(t) is a solution to equations (27.1,27.2)


then φ(t) must also be a solution to the integral equation
Z t
φ(t) = y0 + f (s, φ(s))ds (27.3)
t0

At t = t0 then φ(t0 ) = y0 . If h = t − t0 is sufficiently small then a reasonable first


approximation to φ(t) is

φ0 (t) = y0 + (error terms) (27.4)

If we substitute φ0 from equation 27.4 for φ(s) in the integral equation 27.3, we get
a second approximation Z t
φ1 (t) = y0 + f (s, φ0 )ds (27.5)
t0

Better guesses are generated by repeated substitutions:


Z t
φ2 (t) = y0 + f (s, φ1 (s))ds (27.6)
t0
Z t
φ3 (t) = y0 + f (s, φ2 (s))ds (27.7)
t0
..
.

197
198 LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS

It turns out the sequence of functions φ0 , φ1 , φ2 , · · · → φ where φ is a solution of the


initial value problem if f (t, y) satisfies a Lipschitz condition on the second variable
(see Definition 26.4). The algorithm is summarized below in Algorithm 27.1.

Algorithm 27.1. Picard Iteration To solve the initial value problem


y 0 = f (t, y), y(t0 ) = y0
for the function y(t)

1. input: f (t, y), t0 , y0 , nmax


2. let φ0 = y0
3. For i = 1, 2, . . . , nmax
Z t
let φi+1 (t) = y0 + f (s, φi (s))ds
t0

4. output: φi (t)

We will show in this chapter that when f is Lipschitz in y, algorithm 27.1 con-
verges to the unique solution of equation 27.1. Technically speaking, however, Picard
Iteration1 does not guarantee a solution to any specific accuracy except in the limit
as n → ∞. Thus it is usually quite impractical in practice. Nevertheless it has
the advantage that it is easily implemented in a computer algebra system, and will
sometimes yield useful results.
Example 27.1. Solve y 0 = y, y(0) = 1 using Picard Iteration.
Solution. Since f (t, y) = y, t0 = 0, y0 = 1, we have the following:
Z t
φ0 = 1 + ds = 1 + t (27.8)
0
Z t
t2
φ1 = 1 + (1 + s)ds = 1 + t + (27.9)
2
Z0 t  2

s
φ2 = 1 + 1+s+ ds (27.10)
0 2
t2 t3
=1+t+ + (27.11)
2 3!
1
The method bears the name of Charles Emile Picard (1856-1941), who popularized the technique,
and published it in 1890, but gave credit to Hermann Schwartz. Guisseppe Peano in 1887, Ernst
Leonard Lindeloff in 1890, and G. von Escherich in 1899 also published existence proofs based on this
technique. Hartman claims that both Liouville and Cauchy were aware of this method. Schwartz,
for his part, outlined the technique in a Festschrift honoring Karl Weierstrass’ 70’th birthday in
1885.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS 199

We begin to see a pattern that suggests to us that


n+1 k
X t
φn = (27.12)
k=0
k!

We can check this out by induction. It certainly holds for n = 1. For the inductive
step, assume equation eq:picard-ind and solve for φn+1 :
n+1 k
Z tX
s
φn+1 =1+ ds (27.13)
0 k=0 k!
n+1
X tk+1
=1+ (27.14)
k=0
(k + 1)!

Making a change of index j = k + 1, we have


n+2 n+2
X tj X tj
φn+1 =1+ = (27.15)
j=1
(j)! j=0
(j)!

which is exactly what equation 27.12 gives for φn+1 . Hence by the convergence theo-
rem (Theorem 27.3), the corresponding infinite series converges to the actual solution
of the IVP: ∞
X tk
φ(t) = = et (27.16)
k=0
k!
where the last step follows from Taylor’s theorem.
Picard iteration is quite easy to implement in Mathematica; here is one possible
implementation that will print out the first n iterations of the algorithm.

Picard[f ,t , t0 , y0 , n ]:=
Module[{i, y=y0}
Print[Subscript["φ", 0], "=", y0];
For[i=0, i<n, i++,
Z t
ynext=y0+ (f[s, y/.{t->s}]) ds;
t0
y=ynext;
Print[Subscript["φ", i+1], "=", y];
];
Return[Expand[y]]
]

Function Picard has five arguments (f, t, t0, y0, n) and two local variables (i,
y)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
200 LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS

Picard[f ,t , t0 , y0 , n ]:=
Module[{i, y=y0},
...
]

The local variable y is initialized to the value of the parameter y0 in the list of
variable declarations. This is equivalent to initializing the value of the variable in the
first line of the program. The first line of the program prints the initial iteration as
φ0 =value of parameter y0 ,
Print[Subscript["φ", 0], "=", y0];
The output will be displayed on the console in an “output cell.” The next line of the
program is a For loop. A For statement takes on four arguments:
For[initialization,
test,
increment,
statement;
..
.
statement;
]
The For loop takes the following actions:
1. The initialization statement (or sequence of statements) is executed;
2. The test is evaluated. If it evaluates to False then the rest of the For is
ignored.
3. Each of the statments is evaluated in sequence.
4. The increment statement is evaluated.
5. Steps (2) through (4) are repeated until test is False.
In our program, we have a counter i that is initially set equal to zero; then the con-
tents of the For are executed only so long as i < n; and the value of i is incremented
by 1 on each iteration. Hence the loop will execute n times. Within the loop three
statements are executed on each iteration:
For[i=0, i<n, i++,
Z t
ynext=y0+ (f[s, y/.{t->s}]) ds;
t0
y=ynext;
Print[Subscript["φ", i+1], "=", y];
];

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS 201

There are two important variables used in this loop: y and ynext. At the start of
each iteration, y refers to the value of the previous iteration φi−1 , while at the end of
each iteration (because of the statement y=ynext) it refers to the current iteration φi .
In the first line of the iteration the next iteration after φi−1 , namely, φi , is calculated
and saved in ynext. The value depends on the integral
Z t
f (s, φi−1 (s))ds
t0

. But φi−1 (s) is represented by the value of y at this point. Unfortunately, the
expression for y depends upon t, and we need to integrate over s and not s. So to
get the right variable in the expression for f (s, φi−1 (s)) we need to replace t everywhere
by s. We do that with the expression
y/.{t->s}
which means, quite literally, take the expression for y, and everywhere that a t
appears in it, replace the t with an s. To perform this substitution inside the integral
only we do the following:
Z t
ynext=y0+ (f[s, y/.{t->s}]) ds;
t0
So then ynext (φi ) is calculated and saved as y, and the results of the current iteration
are printed on the console. The final line of the program returns the value of the final
iteration in expanded form, namely, with all multiplications and factoring expanded
out:
Return[Expand[y]]
To print the first 5 iterations of y 0 = y cos t, y(0) = 1 using this function, one enters
g[tvariable , yvariable ]:= yvariable*Cos[tvariable];
Picard[g, t, 0, 1, 5];
which prints
φ0 =1
φ1 =1+Sin[t]
φ2 =1+Sin[t]+ 1
2 Sin[t]
2

φ3 =1+Sin[t]+ 1 2 1
2 Sin[t] + 6 Sin[t]
3

φ4 =1+Sin[t]+ 1 2 1 3 1
2 Sin[t] + 6 Sin[t] + 24 Sin[t]
4

φ5 =1+Sin[t]+ 1 2 1 3 1 4 1
2 Sin[t] + 6 Sin[t] + 24 Sin[t] + 120 Sin[t]
5

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
202 LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS

and returns the value


1+Sin[t]+ 1 2 1 3 1 4 1
2 Sin[t] + 6 Sin[t] + 24 Sin[t] + 120 Sin[t]
5

It appears that the sequence is converging to the series



X sink (t)
φ(t) = = esin t
k=0
k!

It is easily verified by separation of variables or direct substitution that this is, in


fact, the correct solution.
The gist of the proof of Algorithm 27.1 is that it is a form of fixed point iteration.
We recall from chapter 9 that a function f : R 7→ R has a fixed point if and only if
its graph intersects with the line y = x. If there are multiple intersections, then there
are multiple fixed points. Consequently a sufficient condition is that the range of f
is contained in its domain. We first recall some basic theorem from our earlier study
of fixed point theory.
Definition 27.1. Fixed Point. A number a is called a fixed point of the function f
if f (a) = a.
Theorem 27.1 (Sufficient condition for fixed point). Suppose that f (t) is a
continuous function that maps its domain into a subset of itself, i.e.,

f (t) : [a, b] 7→ S ⊂ [a, b] (27.17)

Then f (t) has a fixed point in [a, b].


Theorem 27.2 (Condition for a unique fixed point). Suppose that
(a) f ∈ C[a, b] maps its domain into a subset of itself.

(b) There exists some K > 0, K < 1, such that

|f 0 (t)| ≤ K, ∀t ∈ [a, b] (27.18)

Then f (t) has a unique fixed point in [a, b].


Theorem 27.3 (Fixed point iteration). Under the same conditions as theorem
27.2 then fixed point iteration converges.
Theorem 27.4. Under the same conditions as theorem 27.3 except that the condi-
tion of equation 27.18 is replaced with the following condition: f (t) is Lipschitz with
Lipschitz constant K < 1. Then fixed point iteration converges.
To prove that Picard iteration converges we need to generalized the concept of a
fixed point of a function to general vector spaces.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS 203

Definition 27.2 (Vector Space). A vector space V is a set that is closed under
two operations that we call addition and scalar multiplication such that the following
properties hold:
Closure For all vectors u, v ∈ V, and for all a ∈ R,

u+v ∈V (27.19)

av ∈ V (27.20)

Commutivity of Vector Addition For all u, v ∈ V,

u+v =v+u (27.21)

Associativity of Vector Addition For all u, v, w ∈ V,

u + (v + w) = (u + v) + w (27.22)

Identity for Addition There is some element 0 ∈ V such that for all v ∈ V

0+v =v+0=v (27.23)

Inverse for Addition For each v ∈ V there is a vector −v ∈ V such that

v + (−v) = (−v) + v = 0 (27.24)

Associativity of Scalar multiplication For all v ∈ V and for all a, b ∈ R,

a(bv) = (ab)v (27.25)

Distributivity For all a, b ∈ R and for all u, v ∈ V,

(a + b)v = av + bv (27.26)

a(u + v) = au + av (27.27)

Identity for Scalar Multiplication For all vectors v ∈ V,

1v = v (27.28)

Example 27.2. The usual Cartesian vector space to which we are accustomed is a
vector space with vectors being defined as ordered triples of coordinates hx, y, zi.
Example 27.3. Show that the set F[a, b] of all integrable functions f : [a, b] 7→ R is
a vector space.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
204 LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS

Solution. Let f, g, h ∈ F[a, b] and c, d ∈ R Then

ˆ V is closed: Let p(t) = f (t) + g(t) and q(t) = ch(t). Then p, q : [a, b] 7→ R hence
p, q ∈ F[a, b]

ˆ f (t) + g(t) = g(t) + f (t) so commutivity holds.

ˆ (f (t) + g(t)) + h(t) = f (t) + (g(t) + h(t)) and c(df (t)) = (cd)f (t) so both
associative properties hold.

ˆ The function f (t) = 0 is an additive identity.

ˆ For any function f (t) the function −f (t) is an additive inverse.

ˆ (c + d)f (t) = cf (t) + df (t) and c(f (t) + g(t)) = cf (t) + cg(t) so both distributive
properties hold.

ˆ The number 1 acts as an identity for scalar multiplication.

Hence the set F[a, b] is a vector space.

Definition 27.3 (Norm). A norm k · k : V 7→ R on a vector space V is a function


mapping the vector space to the real numbers such that

1. For all v ∈ V, kvk ≥ 0.

2. kvk = 0 if and only if v = 0.

3. For al v ∈ V and for all a ∈ R, kavk = |a| kvk.

4. The norm satisfies a triangle inequality: for all v, w ∈ V,

kv + wk ≤ kvk + kwk (27.29)

Definition 27.4 (Normed Vector Space). A vector space on which a norm has
been defined is a normed vector space.

Example 27.4. Let V be ordinary Euclidean space and v = hx, y, zi a vector in V.


Then we can define many different norms on this space:

Taxicab (Manhatten, City Block) Norm The L1 norm is: kvk1 = |x| + |y| + |z|
p
Euclidean Distance Function The L2 norm is: kvk2 = x2 + y 2 + z 2

p-norm The Lp norm is: kvkp = (xp + y p + z p )1/p for p ∈ Z+ .

sup-norm The L∞ norm is kvk∞ = sup(|x|, |y|, |z|)

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS 205

Example 27.5. The following norms can be defined on the vector space F[a, b] of
integrable functions on [a, b]:
qR
b
L2 -norm: kf k2 = a
|f (x)|2 dx
R 1/p
b
Lp -norm: kf kp = a
|f (x)|p dx

L∞ or sup-norm: kf k = sup |f (x)|


x∈[a,b]

Definition 27.5 (Contraction). Let V be a normed vector space, S ⊂ V. Then a


contraction is any mapping T : S 7→ V that satisfies

kT (f ) − T (g)k ≤ Kkf − gk (27.30)

form some K ∈ R, 0 < K < 1, for all f, g ∈ S. We will call the number K the
contraction constant.

Lemma 27.1. Let T be a contraction on a complete normed vector space V with


contraction constant K. Then for any g ∈ Vm

1 − Kn
kT n g − gk ≤ kT g − gk (27.31)
1−K
Proof. Use induction. For n = 1, we have

1−K
kT g − gk ≤ kT g − gk (27.32)
1−K
As our inductive hypothesis choose any n > 1 and suppose that equation 27.31 holds.
Then by the triangle inequality

kT n+1 g − gk ≤ kT n+1 g − T n gk + kT n g − gk (27.33)


1 − Kn
≤ kT n+1 g − T n gk + kT g − gk (27.34)
1−K
1 − Kn
≤ K n kT g − gk + kT g − gk (27.35)
1−K
(1 − K)K n + (1 − K n )
= kT g − gk (27.36)
1−K
1 − K n+1
= kT g − gk (27.37)
1−K
which proves the conjecture for n + 1.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
206 LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS

Theorem 27.5 (Contraction Mapping Theorem2 ). Let T be a contraction on a


normed vector space V. Then T has a unique fixed point h ∈ V such that T (h) = h.
Furthermore, any sequence of functions g1 , g2 , . . . defined by gk = T gk−1 converges to
the unique fixed point T g = g. We denote this by gk → g.

Proof. 3 Let  > 0 be given. Then since K n /(1 − K) → 0 as n → ∞ (because T is a


contraction, K < 1) it is possible to choose an integer N such that

K n kT g − gk
< (27.38)
1−K
Pick any two integers m ≥ n ≥ N , and define the sequence g0 = g, gn = T gn−1 . Then

kgm − gn k = kT m g − T n gk (27.39)
≤ K n kT m−n g − gk (27.40)
m−n
n1 − K
≤K kT g − gk (27.41)
1−K
by Lemma 27.1. Hence
Kn − Km Kn
kgm − gn k ≤ kT g − gk ≤ kT g − gk <  (27.42)
1−K 1−K
Therefore gn is a Cauchy sequence, and every Cauchy sequence on a complete normed
vector space converges. Define f = limn→∞ gn . Then either f is a fixed point of T or
it is not a fixed point of T . Suppose that it is not a fixed point of T . Then T f 6= f
and hence there exists some δ > 0 such that

kT f − f k > δ (27.43)

On the other hand, because gn → f , there exists an integer N such that for all n > N ,

kgn − f k < δ/2 (27.44)

Hence

kT f − f k ≤ kT f − gn+1 k + kf − gn+1 k (27.45)


≤ Kkf − gn k + kf − gn+1 k (27.46)
≤ kf − gn k + kf − gn+1 k (27.47)
= 2kf − gn k (27.48)
<δ (27.49)
2
The contraction mapping theorem is sometimes called the Banach Fixed Point Theorem
3
The proof follows “Proof of Banach Fixed Point Theorem,” Encyclopedia of Mathematics (Vol-
ume 2, 54A20:2034), PlanetMath.org.

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS 207

This is a contradiction, and hence f must be a fixed point of T .

To prove uniqueness, suppose that there is another fixed point h 6= f . Then


kh − f k > 0 (otherwise they are equal). But

kh − f k = kT h − T f k (27.50)
≤ Kkh − f k (27.51)
< kh − f k (27.52)

which is impossible and hence and contradiction. Thus f is the unique fixed point of
T.

We restate the fundamental existence theorem here for reference. While it is stated
in terms of the scalar problem, the vector problem is not fundamentally different, and
the proof is completely analogous.

Theorem 27.6 (Fundamental Existence Theorem). Let D ∈ R2 be convex and


suppose that f is continuously differentiable on D. Then the initial value problem

y 0 = f (t, y), y(t0 ) = y0 (27.53)

has a unique solution φ(t) in the sense that φ0 (t) = f (t, φ(y)), φ(t0 ) = y0 .

Proof. We begin by observing that φ is a solution of equation 27.53 if and only if it


is a solution of Z t
φ(t) = y0 + f (x, φ(x))dx (27.54)
t0

Our goal will be to prove 27.54.

Let S be the set of all continuous integrable functions on an interval (a, b) that
contains t0 . Corresponding to any function φ ∈ S we can define the mapping T :
S 7→ S as Z t
T [φ] = y0 + f (x, φ(x))dx (27.55)
t0

We will assume t > t0 . The proof for t < t0 is completely analogous. Using the
sup-norm on (a, b), we calculate that for any two functions g, h ∈ S,
Z t

kT [g] − T [h]k =
[f (x, g(x)) − f (x, h(x))] dx
(27.56)
t0
Z t

≤ sup [f (x, g(x)) − f (x, h(x))] dx (27.57)
a≤t≤b t0

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
208 LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS

Since f is differentiable it is Lipschitz in its second argument, hence


Z t
kT [g] − T [h]k ≤ K sup |g(x)) − h(x)| dx (27.58)
a≤t≤b t0
≤ K(b − a) sup |g(x)) − h(x)| (27.59)
a≤t≤b

≤ K(b − a) kg − hk (27.60)
Where K is any number larger than sup(a,b) fy . If we choose the endpoints a and b
such that |b − a| < 1/K we have K|b − a| < 1. Thus T is a contraction. By the
contraction mapping theorem it has a fixed point; call this point φ. Equation 27.54
follows immediately.
Theorem 27.7 (Error Bounds on Picard Iteration). Under the same conditions
as before, let φn be the nth Picard iterate, and let φ be the solution of the IVP. Then
M |K(t − t0 )|n+1 K|t−t0 |
|φ(t) − φn (t)| ≤ e (27.61)
K(n + 1)!
where M = supD |f (t, y)| and K is a Lipschitz constant. Furthermore, if L = |b − a|
then
M [KL]n+1 eKL
kφ(t) − φn (t)k ≤ (27.62)
K(n + 1)!
where k · k denotes the sup-norm.
Proof. We begin by proving the conjecture
K n−1 M
|φn − φn−1 | ≤ |t − t0 |n (27.63)
n!
For n = 1, equation 27.63 says that
|φ1 − y0 | ≤ M |t − t0 | (27.64)
which follows immediately from equation 27.54. Next, make the inductive hypothesis
27.63 and calculate
Z t

|φn+1 − φn| = [f (s, φn (s)) − f (s, φn−1 (s))] ds
(27.65)
t0
Z t
≤K |φn (s) − φn−1 (s)| ds (27.66)
t0

by the definition of φn and the Lipschitz condition. Applying the inductive hypothesis
and then integrating,
K nM t
Z
|φn+1 − φn| ≤ |s − t0 |n ds (27.67)
n! t0
K nM
≤ |t − t0 |n+1 (27.68)
(n + 1)!

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS 209

which proves conjecture 27.63. Now let


n
X
φn (t) = φ0 (t) + [φi (t) − φi−1 (t)] (27.69)
i=1

Then since the sequence of Picard iterates converges to the solution,



X
φ(t) = lim φn (t) = φ0 (t) + [φi (t) − φi−1 (t)] (27.70)
n→∞
i=1

Hence by equation 27.63,



X∞
|φ(t) − φn (t)| = (φi (t) − φi−1 (t)) (27.71)


i=n+1
X∞
≤ |φi (t) − φi−1 (t)| (27.72)
i=n+1

X K i−1 M
≤ |t − t0 |i (27.73)
i=n+1
i!

M X |K(t − t0 )|i
≤ (27.74)
K i=n+1 i!

Therefore by comparison with a Taylor series for eK(b−a) ,



M X |K(b − a)|i
kφ(t) − φn (t)k ≤ (27.75)
K i=n+1 i!
n
!
M X |K(b − a)|i
≤ eK(b−a) − (27.76)
K i=0
i!
M
≤ sup Rn (t) (27.77)
K 0≤t≤KL

where Rn (t) is the Taylor Series Remainder for et after n terms,

tn+1 ec (KL)n+1 eKL


sup Rn (t) ≤ sup ≤ (27.78)
0≤t≤KL 0≤{c,t}≤KL (n + 1)! (n + 1)!

for some unknown c between a and b.Hence


M (KL)n+1 eKL
kφ(t) − φn (t)k ≤ (27.79)
K (n + 1)!
proving the theorem.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
210 LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS

The following example shows that this bounds is not very useful in practice.
Example 27.6. Estimate the number of iterations required to obtain an solution to
y 0 = t, y(0) = 1 on [0, 10] with a precision of no more that 10−7 .
Solution. Since f (t, y) = t we have fy = 0 and hence a Lipschitz constant is K = 1
(or any positive number), and we can use M = 10 on [0, 10]. The precision in the
error is bounded by
M (KL)n+1 eKL 10(10)n+1 e10
≤ ≤ (27.80)
K(n + 1)! (n + 1)!
We can determine the minimum value of n by using Mathematica. The following will
print a list of values of equation (27.80) for n ranging from 1 to 50.
errs = Table[{n, 10 (10) ˆ (n + 1) (E ˆ 10.)/(n + 1)!}, {n, 1, 50}]
The output is a list of number pairs, which can be plotted with ListPlot or
<<‘Graphics‘Graphics‘
LogListPlot[errs];
The output of LogListPlot is shown below; we have annotated the plot with an
additional line at the desired tolerance of 10−7 showing that it occurs at the 47th
iteration.

This example shows that Picard iteration will produce the desired accuracy if we
perform 47 iterations. For this particular problem, this suggestion is absurd, because
Picard iteration converges to the exact solution after 1 iteration. The calculated
solution does not change upon further calculations. Hence the method vastly over-
estimates the potential error (at least for this example).

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 28

Euler’s Method

By a numerical solution of the initial value problem

y 0 = f (t, y), y(t0 ) = y0 (28.1)


we mean a sequence of values

y0 , y1 , y2 , ..., yn−1 , yn ; (28.2)

a corresponding mesh or grid M by

M = {t0 < t1 < t2 < · · · < tn−1 < tn }; (28.3)

and a grid spacing as


hj = tj+1 − tj (28.4)
Then the numerical solution or numerical approximation to the solution is the se-
quence of points
(t0 , y0 ), (t1 , y1 ), . . . , (tn−1 , yn−1 ), (tn , yn ) (28.5)
In this solution the point (tj , yj ) represents the numerical approximation to the solu-
tion point y(tj ). We can imagine plotting the points (28.5) and then “connecting the
dots” to represent an approximate image of the graph of y(t), t0 ≤ t ≤ tn . We will
use the convenient notation
yn ≈ y(tn ) (28.6)
which is read as “yn is the numerical approximation to y(t) at t = tn .”

Euler’s Method or the Forward Euler’s Method is constructed as illustrated


in figure 28.1. At grid point tn , y(t) ≈ yn , and the slope of the solution is given by
exactly y 0 = f (tn , y(tn )). If we approximate the slope by the straight line segment
between the numerical solution at tn and the numerical solution at tn+1 then
yn+1 − yn yn+1 − yn
yn0 (tn ) ≈ = (28.7)
tn+1 − tn hn

211
212 LESSON 28. EULER’S METHOD

Figure 28.1: Illustration of Euler’s Method. A tangent line with slope f (t0 , y0 ) is con-
structed from (t0 , y0 ) forward a distance h = t1 − t0 in the t− direction to determined
y1 . Then a line with slope f (t1 , y1 ) is constructed forward from (t1 , y1 ) to determine
y2 , and so forth. Only the first line is tangent to the actual solution; the subsequent
lines are only approximately tangent.

y2
y1

y0

t0 t1 t2

Since y 0 (t) = f (t, y), we can approximate the left hand side of (28.7) by

yn0 (tn ) ≈ f (tn , yn ) (28.8)

and hence
yn+1 = yn + hn f (tn , yn ) (28.9)
It is often the case that we use a fixed step size h = tj+1 − tj , in which case we have

tj = t0 + jh (28.10)

In this case the Forward Euler’s method becomes

yn+1 = yn + hf (tn , yn ) (28.11)

The Forward Euler’s method is sometimes just called Euler’s Method. The application
of Euler’s method is summarized below.

Algorithm Forward Euler


Input f , t0 , y0 , h, tmax
Let t = t0 , y = y0
While t < tmax
let y = y + hf (t, y)
let t = t + h
let tn = t, yn = y
End While
Return {(t0 , y0 ), . . . , (tn , yn )}

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 28. EULER’S METHOD 213

An alternate derivation of equation (28.9) is to expand the solution y(t) in a Taylor


Series about the point t = tn :

h2n 00
y(tn+1 ) = y(tn + hn ) = y(tn ) + hn y 0 (tn ) + y (tn ) + · · · (28.12)
2
= y(tn ) + hn f (tn , y(n )) + · · · (28.13)

We then observe that since yn ≈ y(tn ) and yn+1 ≈ y(tn+1 ), then (28.9) follows imme-
diately from (28.13).

If the scalar initial value problem of equation (28.1) is replaced by a systems of


equations
y0 = f (t, y), y(t0 ) = y0 (28.14)
then the Forward Euler’s Method has the obvious generalization

yn+1 = yn + hf (tn , yn ) (28.15)

Example 28.1. Solve y 0 = y, y(0) = 1 on the interval [0, 1] using h = 0.25.

Solution. The exact solution is y = ex . We compute the values using Euler’s method.
For any given time point tk , the value yk depends purely on the values of tk1 and
yk1 . This is often a source of confusion for students: although the formula yk+1 =
yk + hf (tk , yk ) only depends on tk and not on tk+1 it gives the value of yk+1 .
We are given the following information:

(t0 , y0 ) = (0, 1) (28.16)


f (t, y) = y (28.17)
h = 0.25 (28.18)

We first compute the solution at t = t1 .

y1 = y0 + hf (t0 , y0 ) = 1 + (0.25)(1) = 1.25 (28.19)


t1 = t0 + h = 0 + 0.25 = 0.25 (28.20)
(t1 , y1 ) = (0.25, 1.25) (28.21)

Then we compute the solutions at t = t1 , t2 , . . . until tk+1 = 1.

y2 = y1 + hf (t1 , y1 ) (28.22)
= 1.25 + (0.25)(1.25) = 1.5625 (28.23)
t2 = t1 + h = 0.25 + 0.25 = 0.5 (28.24)
(t2 , y2 ) = (0.5, 1.5625) (28.25)

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
214 LESSON 28. EULER’S METHOD

y3 = y2 + hf (t2 , y2 ) (28.26)
= 1.5625 + (0.25)(1.5625) = 1.953125 (28.27)
t3 = t2 + h = 0.5 + 0.25 = 0.75 (28.28)
(t3 , y3 ) = (0.75, 1.953125) (28.29)

y4 = y3 + hf (t3 , y3 ) (28.30)
= 1.953125 + (0.25)(1.953125) = 2.44140625 (28.31)
t4 = t3 + 0.25 = 1.0 (28.32)
(t4 , y4 ) = (1.0, 2.44140625) (28.33)

Since t4 = 1 we are done. The solutions are tabulated in table ?? for this and other
step sizes.
t h = 1/2 h = 1/4 h = 1/8 h = 1/16 exact solution
0.0000 1.0000 1.0000 1.0000 1.0000 1.0000
0.0625 1.0625 1.0645
0.1250 1.1250 1.1289 1.1331
0.1875 1.1995 1.2062
0.2500 1.2500 1.2656 1.2744 1.2840
0.3125 1.3541 1.3668
0.3750 1.4238 1.4387 1.4550
0.4375 1.5286 1.5488
0.5000 1.5000 1.5625 1.6018 1.6242 1.6487
0.5625 1.7257 1.7551
0.6250 1.8020 1.8335 1.8682
0.6875 1.9481 1.9887
0.7500 1.9531 2.0273 2.0699 2.1170
0.8125 2.1993 2.2535
0.8750 2.2807 2.3367 2.3989
0.9375 2.4828 2.5536
1.0000 2.2500 2.4414 2.5658 2.6379 2.7183

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 29

The Backwards Euler Method

Now consider the IVP


5 1
y 0 = −5ty 2 + − 2 , y(1) = 1 (29.1)
t t
The exact solution is y = 1/t. The numerical solution is plotted for three different
step sizes on the interval [1, 25] in the following figure. Clearly something appears
to be happening here around h = 0.2, but what is it? For smaller step sizes, a
relatively smooth solution is obtained, and for larger values of h the solution becomes
progressively more jagged.

0.4

0.3

0.2

0.1

t
5 10 15 20 25

This example illustrates a problem that occurs in the solution of differential equa-
tions, known as stiffness. Stiffness occurs when the numerical method becomes
unstable. An exploration of this phenomenon is beyond the scope of Math 481A
(the topic is covered in great detail in Math 582B). One solution is to modify Euler’s
method as illustrated in figure 29.1 to give the Backward’s Euler Method:

yn = yn−1 + hn f (tn , yn ) (29.2)

[h]

215
216 LESSON 29. THE BACKWARDS EULER METHOD

Figure 29.1: Illustration of the Backward’s Euler Method. Instead of constructing


a tangent line with slope f (t0 , y0 ) through (t0 , y0 ) a line with slope f (t1 , y1 ) is con-
structed. This necessitates knowing the solution at the t1 in order to determine y1 (t1 )

y2
y1

y0

t0 t1 t2

The problem with the Backward’s Euler method is that we need to know the
answer to compute the solution: yn exists on both sides of the equation, and in general,
we can not solve explicitly for it. The Backwards Euler Method is an example of an
implicit method, because it contains yn implicitly. In general it is not possible
to solve for yn explicitly as a function of yn−1 in equation 29.2, even though it is
sometimes possible to do so for specific differential equations. Thus at each mesh
point one needs to make some first guess to the value of yn and then perform some
additional refinement to improve the calculation of yn before moving on to the next
mesh point. A common method is to use fixed point iteration on the equation
y = k + hf (t, y) (29.3)
where k = yn−1 . The technique is summarized here:
ˆ Make a first guess at yn and use that in right hand side of 29.2. A common first
guess that works reasonably well is
yn(0) = yn−1 (29.4)

ˆ Use the better estimate of yn produced by 29.2 and then evaluate 29.2 again to
get a third guess, e.g.,
yn(ν+1) = yn−1 + hf (tn , yn(ν) ) (29.5)

ˆ Repeat the process until the difference between two successive guesses is smaller
than the desired tolerance.
Of course we know that Fixed Point iteration will only converge if there is some
number K < 1 such that |∂g/∂y| < K where g(t, y) = k+hf (t, y). An implementation
of Backward’s Euler method with Fixed Point iteration in Mathematica as follows:

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 29. THE BACKWARDS EULER METHOD 217

Figure 29.2: Result of the forward Euler method to solve y 0 = −100(y−sin t), y(0) = 1
with h = 0.001 (top), h = 0.019 (middle), and h = 0.02 (third). The bottom figure
shows the same equation solved with the backward Euler method for step sizes of
h = 0.001, 0.02, 0.1, 0.3, left to right curves, respectively
1
0.8
0.6
0.4
0.2

. 0 0.5 1 1.5 2 2.5 3


1

0.5

!0.5

!1
0 0.5 1 1.5 2 2.5 3
2
1.5
1
0.5
0
!0.5
!1
0 0.5 1 1.5 2 2.5 3
1
0.8
0.6
0.4
0.2

0 0.5 1 1.5 2 2.5 3

BackwardEuler[f, {t0_, y0_},


h_, tmax_, tol_:0.003, nmax_:5] := Module[{ t, time,
yval, yvaln, yvalp, delta, eps, r, J, JV},
r = {{t0, y0}};
yval = y0;
J = D[f[t, y], {y, 1}];
JV = J /. {y -> y0, t -> t0};
If[Abs[JV] >= 1, Return[$Failed]];
J = J /. {t -> time};
For[t = t0, t < tmax, t += h,
yvaln = yval + f[t, yval];
For[i = 1, i nmax, i++,
yvalp = yvaln;

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
218 LESSON 29. THE BACKWARDS EULER METHOD

yvaln = yval + f[t, yvalp];


delta = yvaln - yvalp;
If[Abs[delta] < tol, Break[]];
];
yval = yvaln;
AppendTo[r, {t + h, yval}];
JV = J /. {y -> ynext, time -> t + h};
If[Abs[JV] >= 1, Return[$Failed]];
];
Return[r];
]

The input parameter tol gives the tolerance for the fixed point iteration: when
two successive guesses are this close together, the iteration stops. The default value
is 0.003. Similarly, we have added a parameter nmax which is an emergency cut-off
for the fixed point iteration. This number prevents infinite loops in the event the
tolerance is never reached. The number of iterations is counted, and if they reach the
value of nmax the fixed point iteration stops. Because there is always the possibility
(either through a program bug or some sort of bizarre input) that the algorithm will
not terminate, it is generally a good programming practice to always include this type
of counter and cut-off value. In the implementation shown above nmax has a default
value of 5. Since both nmax and tol have default values, they are considered optional
parameters by Mathematica: if you are happy with the values of the defaults, you do
not have to supply them when you call the program.
However, to avoid ill-conditioned equations it is usually better to use a root-
finding algorithm such as Newton’s method to find the root y of y = k + f (t, y), e.g,
use Newton’s method to find the root of

g(s) = s − yn−1 − f (tn , s) (29.6)

at each iteration. An implementation is shown below:

BackwardsEulerNewtonsMethod[f_, {t0_, y0_}, h_, tmax_,


tol_:0.003, nmax_:5] :=
Module[{
t, y, n, time, yval, yvaln, yvalp, delta, eps, r, J, JV, fv},
r = {{t0, y0}};
yval = y0;
J = D[f[t, y], {y, 1}];
For[time = t0, time < tmax, time += h,
fv = f[time, yval];
JV = J /. {t -> time, y -> yval};
yvaln = (-yval + h (-fv + JV *yval))/ (-1 + h JV);

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 29. THE BACKWARDS EULER METHOD 219

For[i = 1, i <= nmax, i++,


yvalp = yvaln;
JV = J /. {t -> time, y -> yvalp};
fv = f[time, yvalp];
yvaln = (yval + h(fv - JV* yvalp))/( 1 - h JV);
delta = yvaln - yvalp;
If[Abs[delta] < tol, Break[]];
];
yval = yvaln;
AppendTo[r, {time + h, yval}];
];
Return[r];
]

To solve the initial value problem y 0 = −50(y − sin t), y(0) = 1 on the interval [0, 3]
using a step size of h = 0.3,

In:=

f[t_, y_] := -50(y - Sin[t]);


BackwardsEulerNewtonsMethod[f, {0, 1}, 0.3];

Out:=

{{0, 1}, {0.3, 0.0625}, {0.6, 0.280956}, {0.9,


0.546912}, {1.2, 0.768551}, {1.5, 0.921821}, {1.8, 0.992765},
{2.1,0.97503}, {2.4, 0.870198}, {2.7, 0.687634}, {3., 0.443646}}

which produces a list of values {{t0 , y0 }, . . . , {tn , yn }}.

« 2008, B.E.Shapiro Math 481A


Last revised: July 5, 2008 California State University Northridge
220 LESSON 29. THE BACKWARDS EULER METHOD

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
Lesson 30

Improving Euler’s Method

All numerical methods for initial value problems of the form

y 0 (t) = f (t, y), y(t0 ) = y0 (30.1)

variations of the form


yn+1 = yn + φ(tn , yn , . . . ) (30.2)
for some function φ. In Euler’s method, φ = hf (tn , yn ); in the Backward’s Euler
method, φ = hf (tn+1 , yn+1 ). In general we can get a more accurate result with a
smaller step size. However, in order to reduce computation time, it is desirable to
find methods that will give better results withou a significant decrease in step size.
We can do this by making φ depend on values of the solution at multiple time points.
For example, a Linear Multistep Method has the form

yn+1 + a0 yn + a1 yn−1 + · · · = h(b0 fn+1 + b1 fn + b2 fn−1 + · · · ) (30.3)

For some numbers a0 , a1 , . . . and b0 , b1 , . . . . Euler’s method has a0 = −1, a1 = a2 =


· · · = 0 and b1 = 1, b0 = b2 = b3 = · · · = 0

Here we introduce the Local Truncation Error, one measure of the “goodness”
of a numerical method. The Local truncation error tells us the error in the calculation
of y, in units of h, at each step tn assuming that there we know yn−1 precisely correctly.
Suppose we have a numerical estimate yn of the correct solution at y(tn ). Then the
Local Truncation Error is defined as
1
LTE = (y(tn ) − yn ) (30.4)
h
1
= (y(tn ) − y(tn−1 ) + y(tn−1 ) − yn ) (30.5)
h
Assuming we know the answer precisely correctly at tn−1 then we have

yn−1 = y(tn−1 ) (30.6)

221
222 LESSON 30. IMPROVING EULER’S METHOD

so that
y(tn ) − y(tn−1 ) yn−1 − yn
LTE = + (30.7)
h h
y( tn ) − y(tn−1 ) 1
= − φ(tn , yn , . . . ) (30.8)
h h
For Euler’s method,
φ = hf (t, y) (30.9)
hence
y(tn ) − y(tn−1 )
LTE(Euler) = − f (tn , yn ) (30.10)
h
If we expand y in a Taylor series about tn−1 ,

h2 00
y(tn ) = y(tn−1 ) + hy 0 (tn−1 ) + y (tn−1 ) + · · · (30.11)
2
h2
= y(tn−1 ) + hf (tn−1 , yn−1 ) + y 00 (tn−1 ) + · · · (30.12)
2
Thus
h 00
LTE(Euler) = y (tn−1 ) + c2 h2 + c3 h3 + · · · (30.13)
2
for some constants c1 , c2 , ... Because the lowest order term in powers of h is propor-
tional to h, we say that
LTE(Euler) = O(h) (30.14)
and say that Euler’s method is a First Order Method. In general, to improve
accuracy for a given step size, we look for higher order methods, which are O(hn );
the larger the value of n, the better the method in general.

The Trapezoidal Method averages the values of f at the two end points. It has
an iteration formula given by

hn
yn = yn−1 + (f (tn , yn ) + f (tn−1 , yn−1 )) (30.15)
2
We can find the LTE as follows by expanding the Taylor Series,

y(tn ) − y(tn−1 )
LTE(Trapezoidal) = − f (tn , yn ) (30.16)
 h
h2 00 h3 000

1 0
= y(tn−1 ) + hy (tn−1 ) + y (tn−1 ) + y (tn−1 ) + · · · − y(tn−1 )
h 2 3!
1
− (f (tn , yn ) + f (tn−1 , yn−1 )) (30.17)
2

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008
LESSON 30. IMPROVING EULER’S METHOD 223

Therefore using y 0 (tn−1 ) = f (tn−1 , yn−1 ),

1 h h2 1
LTE(Trapezoidal) = f (tn−1 , yn−1 ) + y 00 (tn−1 ) + y 000 (tn−1 ) · · · − f (tn , yn )
2 2 6 2
(30.18)

Expanding the final term in a Taylor series,

f (tn , yn ) = y 0 (tn ) (30.19)


2
h 000
= y 0 (tn−1 ) + hy 00 (tn−1 ) + y (tn−1 ) + · · · (30.20)
2
h2
= f (tn−1 , yn−1 ) + hy 00 (tn−1 ) + y 000 (tn−1 ) + · · · (30.21)
2
Therefore the Trapezoidal method is a second order method:

1 h 00 h2 000
LTE(Trapezoidal = fn−1 + yn−1 + yn−1 + ···
2 2 6
1 1 00 1 000
− fn−1 − hyn−1 − h2 yn−1 + ··· (30.22)
2 2 4
1 000
= − h2 yn−1 + ··· (30.23)
12
= O(h2 ) (30.24)

The theta method is given by

yn = yn−1 + h [θf (tn−1 , yn−1 ) + (1 − θ)f (tn , yn )] (30.25)

The theta method is implicit except when θ = 1, where it reduces to Euler’s method,
and is first order unless θ = 1/2. For θ = 1/2 it becomes the trapezoidal method. The
usefullness of the comes from the ability to remove the error for specific high order
terms. For example, when θ = 2/3, there is no h3 term even though there is still an
h2 term. This can help if the coefficient of the h3 is so larger that it overwhelms the
the h2 term for some values of h.

The second-order midpoint method is given by


 
1
yn = yn−1 + hn f tn−1/2 , [yn + yn−1 ] (30.26)
2

The modified Euler Method, which is also second order, is


hn
yn = yn−1 + [f (tn−1 , yn−1 ) + f (tn , yn−1 + hf (tn−1 , yn−1 ))] (30.27)
2
« 2008, B.E.Shapiro Math 481A
Last revised: July 5, 2008 California State University Northridge
224 LESSON 30. IMPROVING EULER’S METHOD

Heun’s Method is
  
hn 2 2
yn = yn−1 + f (tn−1 , yn−1 ) + 3f tn−1 + h, yn−1 + hf (tn−1 , yn−1 ) (30.28)
4 3 3

Both Heun’s method and the modified Euler method are second order and are ex-
amples of two-step Runge-Kutta methods. It is clearer to implement these in two
“stages,” eg., for the modified Euler method,

ỹn = yn−1 + hf (tn−1 , yn−1 ) (30.29)


hn
yn = yn−1 + [f (tn−1 , yn−1 ) + f (tn , ỹn )] (30.30)
2
while for Heun’s method,
2
ỹn = yn−1 + hf (tn−1 , yn−1 ) (30.31)
3   
hn 2
yn = yn−1 + f (tn−1 , yn−1 ) + 3f tn−1 + h, ỹn (30.32)
4 3

Math 481A « 2008, B.E.Shapiro


California State University Northridge Last revised: July 5, 2008

Вам также может понравиться