Академический Документы
Профессиональный Документы
Культура Документы
This document is provided in the hope that it will be useful but without any
warranty, without even the implied warranty of merchantability or fitness for a
particular purpose. The document is provided on an “as is” basis and the author
has no obligations to provide corrections or modifications. The author makes no
claims as to the accuracy of this document. In no event shall the author be liable
to any party for direct, indirect, special, incidental, or consequential damages,
including lost profits, unsatisfactory class performance, poor grades, confusion,
misunderstanding, emotional disturbance or other general malaise arising out of
the use of this document or any software described herein, even if the author has
been advised of the possibility of such damage.
1 A Motivational Example 1
3 Sequences 13
5 Error 23
6 Number Representation 27
10 Newton’s Method 59
11 Secant Method 67
15 Müller’s Method 93
16 Linear Systems 99
iii
iv CONTENTS
A Motivational Example
Numerical analysis is a branch of mathematics that deals with the development and
implementation of methods for solving problems numerically with continuous math-
ematics. A related field, discrete or finite mathematics, deals with problems that
do not contain or depend on the concept of continuity. In practice both fields of
mathematics overlap the with the subjects of numerical computation and computer
science, which deal with the actual implementations (e.g., computer programs and
algorithms) used to solve these problems. √
As an example of a numerical algorithm, consider finding the square root a of a
number. We know that the solution satisfies
x2 = a (1.1)
Many algorithms for finding the value of x are based on finding the root of the
polynomial ⇓:rev.5/28/08
f (x) = x2 − a (1.2)
i.e., finding the value of x that satisfies the equation f (x) = 0. This number is called a
root of f (x). We will explore some of these algorithms this semester. We start with an
example
√ that was first observed
√ by the ancient Babylonians. If x is an approximation
to a then since a/x ≈ a is an equally good approximation. Furthermore,
√ 1 1
x< a =⇒ √ < (1.3)
a x
√ a a
=⇒ a = √ < (1.4)
a x
and
√ √ a
x> a =⇒ a > (1.5)
x
√ √
In other words, if x0 is any approxomation to a, then the actual value of a must
lie between x and a/x. Since there is no reason to believe that x is any better of an
1
2 LESSON 1. A MOTIVATIONAL EXAMPLE
approximation then
√ a/x, and vice versa, this suggests that we can obtain a better
approximation to a by averaging the two estimates:
1 a
x1 = x0 + (1.6)
x x0
We can repeat this argument with x1 to generate a better estimate x2 , and so forth,
leading us to the sequence of approximations x0 , x1 , x2 , . . . given by
1 a
xi+1 = xi + (1.7)
2 xi
Equation 1.7 is an example of an iteration formula. It gives us a sequence of better and
better approximations to the number we are looking for. We will see iteration formulas
again and again throughout this class; they are one of the principal techniques by
which we summarize a numerical algorithm. The basic technique is summarized here:
Given: x0
i=0
Repeat
xi+1 = f (xi )
i=i+1
Until the approximation is “good enough”
Given: x0 , , N
i=0
Repeat
xi+1 = f (xi )
i=i+1
Until |xi − xi−1 | < or i > N
When you are debugging your code it is generally a good idea to use a very small
value of N such as 2 or 3, even if you expect a much larger number of iterations to
5/28/08:⇑ occur in the final version.
The only remaining problem is to figure out what to use for the first guess x0 .
This algorithm is so good, in fact,
√ that it doesn’t much matter. We can use x0 = 1
or x0 = a. For example, to find 2 using x0 = 2 we have
1 2
x1 = 2+ = 1.5 (1.8)
2 2
1 2
x2 = 1.5 + = 1.41667 (1.9)
2 1.5
1 2
x2 = 1.41667 + = 1.41422 (1.10)
2 1.41667
and so forth. This algorithm converges rather quickly; in fact, it precisely reproduces
the same formula as Newton’s method (which we will discuss in section 10).
Throughout this class we will give examples using a programming language called
Mathematica. We have chosen this language because it is extremely powerful; uses
a fairly intuitive mathematical interface that is easy to learn rapidly; and allows us
to program without worrying about many of the details such as types, classes, and
objects that we need to worry over in more primitive languages such as Java, C++
or FORTRAN. In Mathematica, we can implement the square root finding algorithm
quite easily (Don’t worry about the details of this program if you don’t know Math-
ematica; we’ll come back to that and do some training in the Math Lab before you
have to start coding for your homework). The following will calculate shows that the
algorithm converges to the first 50 digits in only 8 iterations!
In:=
Out:=
{2.0000000000000000000000000000000000000000000000000,
1.5000000000000000000000000000000000000000000000000,
1.4166666666666666666666666666666666666666666666667,
1.4142156862745098039215686274509803921568627450980,
1.4142135623746899106262955788901349101165596221157,
1.4142135623730950488016896235025302436149819257762,
1.4142135623730950488016887242096980785696718753772,
1.4142135623730950488016887242096980785696718753769,
1.4142135623730950488016887242096980785696718753769}
that can be expressed as a whole number and a decimal (e.g., 3750.2563...miles per
hour). Time is kept continuously by the system’s internal clock in tenths of seconds
but is expressed as an integer or whole number (e.g., 32, 33, 34...). The longer the
system has been running, the larger the number representing time. To predict where
the Scud will next appear, both time and velocity must be expressed as real numbers.
Because of the way the Patriot computer performs its calculations and the fact that
its registers are only 24 bits long, the conversion of time from an integer to a real
number cannot be any more precise than 24 bits. This conversion results in a loss
of precision causing a less accurate time calculation. The effect of this inaccuracy on
the range gate’s calculation is directly proportional to the target’s velocity and the
length of time the system has been running. Consequently, performing the conversion
after the Patriot has been running continuously for extended periods causes the range
gate to shift away from the center of the target, making it less likely that the target,
in this case a Scud, will be successfully intercepted.
“... after about 20 hours, the inaccurate time calculation becomes sufficiently large
to cause the radar to look in the wrong place for the target ... Army officials said
that they believed that ... Patriot users were not running their systems for 8 or more
hours at a time ... Significant shifts of the range gate away from the desired center
of the target could be eliminated by rebooting the system-turning the system off and
on–every few hours. Rebooting, which takes about 60 to 90 seconds, reinitializes the
computer’s clock, setting the time back to zero.
“... On February 25, Alpha Battery had been in operation for over 100 consecutive
hours ...”
Lets examine this calculation in some detail. Each bit in a binary number rep-
resents a power of 2. The numbers to the right of the radix point are fractions,
representing 2−1 , 2−2 , 2−3 , ... as we move from left to right; the numbers to the left
of the binary point represent 20 , 21 , 22 , 23 , ... as we move to the left, starting at the
binary point. The nth bit to the right of the radix point then represents 2− n and
the nth bit to the left represents 2n−1 . We can convert a binary number back to its
decimal representation by then adding up the value. Let bn = 1 or 0 represent the
nth bit. Then
X X
decimal value = bn × 2n−1 + bn × 2−n (1.11)
whole bits f ractional bits
where “whole bits” means the bits to the left of the radix point and “fractional bits”
means the bits to the right of the radix point.
Google Calculator gives you a convenient way to convert between bases. In Google
calculator a binary integer begins with “0b” (zero followed by the letter b). Unfortu-
nately it does not work with fractions, on integers. If we enter the string
0b110011001100110011 in Decimal
in any Google search window it will return the number 209715.
To determine its value in decimal we observe that the least significant bit corresponds
to 2−21 so we enter the string
2^(-21)*209715
in the search window.
This returns a number that is very close to – but not precisely equal to – one tenth.
Example 1.1. Find the decimal equivalent of the binary number 101101.011
Solution.
101101.011 = 1 × 25 + 0 × 24 + 1 × 23 + 1 × 22 + 0 × 21 + 1 × 20
BaseForm[45.375, 2]
which returns the value 101101.0112 . Unfortunately BaseForm only returns a string
representation of the number, not an actual binary number, so doing calculations
with the binary number requires a bit more work.
In the Patriot software, integers were converted to decimal numbers by multiplying
by the 24-bit binary representation of the decimal number 0.1 with one bit to the left
of the decimal and 23 bits to the right of the decimal. This number is
209715
m = 0.0001 1001 1001 1001 1001 100 = (1.15)
2097152
Spaces are used to separate every fourth bit to make the binary numbers easier to
read. The choice of four bits in a binary number is convenient because 4 binary bits
corresponds to precisely one hexa-decimal (base 16) digit. (We can, of course, find
this crucial number in Mathematica by
BaseForm[0.1, 2]
The calculation was off by over half of a kilometer. This caused the system to
repeatedly recycle and try to recalculate the position again. It was unable to converge ⇓:rev.5/28/08
and so a missile was allowed to penetrate the bases defenses on 25 Feb 1991, killing
28 people. Ironically, the bug was known and a patch had been released on 16 Feb
1991 correcting the problem, but was still in the mail. It arrived one day too late, on
5/28/08:⇑ 26 Feb 1991. President Bush declared that hostilities had ended on 28 Feb.
Over the course of 22 months starting in 1982 the Vancouver Stock exchange
accumulated enough numerical error due to rounoff to reduce the correct value
of the index (1098.98) to 574.08.
Under German law (in 1992) no party with less than a 5 percent vote may be
seated in parliament. On April 5, 1992 the Green party obtained 4.97% of the
vote, but a computer program that prints out results was set to round to one
decimal place - exactly 5.0%. Hence the early official results showed that the
candidate had been elected.
in 1995 Microsoft announced that some versions of the their spreadsheet pro-
gram Excel makes mistakes because of a base 10 to base two conversion error.
An oil platform off the coast of Norway sank on August 23, 1991 at a cost of
nearly a billion dollars as a result of an error in a finite element approximation
– a method that was used to calculate the linear elastic stresses on the structure
supporting the platform.
In the next sections we will make a brief review of some mathematical preliminaries
before we turn to a study of how numbers are represented in computers.
Definition 2.1 (Limit). We say “the limit of f (x) as x approaches x0 is equal to
L” and write
lim f (x) = L (2.1)
x→x0
if, given any > 0 there exists some δ > 0 such that
The value of the number is allowed to depend on the value of the number δ. This
concept of a limit is illustrated in the following figure: If you give me the value of ,
I can find a value of δ such that |f (x) − L| < whenever |x − x0 | < δ.
box for
smaller ε
I will find δ
x0
9
10 LESSON 2. LIMITS AND CONTINUITY
√
Example 2.1. Show that limx→3 x+1=2
Solution. Using the nomenclature of the definition,
√
f (x) = x + 1 (2.3)
L=2 (2.4)
x0 = 3 (2.5)
Let > 0 be any small number. The we need to find a number δ > 0 such that
√
|x − 3| < δ =⇒ x + 1 − 2 < (2.6)
Since we need both conditions 2.13 and 2.14 to hold then we require both equations
2.17 and 2.20. Since > 0,
4−<4+ (2.21)
(4 − ) < (4 + ) (2.22)
we determine that condition 2.17 is more restrictive and when it holds, we are ensured
that condition 2.20 is also met. This gives us enough information to construct a proof,
which we present immediately.
Let > 0. We need to show that there is some δ such that |x − x0 | < δ implies
that |f (x0 ) − L| < . In other words we need to show that there is some δ such that
|x − 3| < δ (2.23)
implies that √
x + 1 − 2 < (2.24)
We do this by choosing
δ = (4 − ) (2.25)
hence
|x − 3| < δ = (4 − ) (2.26)
−(4 − ) < x − 3 < (4 − ) (2.27)
−2 − 4 + 4 < x + 1 < −2 + 4 + 4 (2.28)
( − 2)2 < x + 1 < −2 + 4 + 4 < 2 + 4 + 4 = ( + 2)2 (2.29)
p √ p
( − 2)2 < x + 1 < ( + 2)2 (2.30)
From the equality on the left, we know that
√ p
x + 1 > ( − 2)2 = ±( − 2) (2.31)
and that this must hold for both values of the root. The value − 2 is near -2 and
the value 2 − is near 2, so we chose:
√
2−< x+1 (2.32)
Combining this with the inequality on the right hand side of 2.30 we obtain
√
2−< x+1<2+ (2.33)
√
− < x + 1 − 2 < (2.34)
√
x + 1 − 2 < (2.35)
Sequences
We will frequently use iterative processes in our study of numerical analysis. In such
a process, one computes a sequence of values, usually in a loop or other similar control
structure. Such iterative processes can be related to the concept of a sequence: at
each iteration of the the loop we calculate the value of some number an . The complete
set of all possible an is a sequence. More specifically, we have the following definition.
x1 , x2 , x3 , ... (3.2)
xn (3.3)
{xn }∞
n=1 (3.4)
13
14 LESSON 3. SEQUENCES
x1
x2
9OU NAME ε
xN+1
x3
L
1 2 3 N n>N
3(1 + 2n )
Example 3.1. Show that the sequence xn = → 3 as n → ∞.
2n
Solution. We need to show that for any > 0 there exists some N such that
or, equivalently,
f (xn ) → f (c) (3.17)
Definition 4.1 (Derivative). The derivative is given by either of the following two
equivalent formulas:
f (x0 + h) − f (x0 ) f (x) − f (x0 )
f 0 (x0 ) = lim = lim (4.1)
h→0 h x→x0 x − x0
The second definition can be derived from the first with the substitution x = h + x0 .
Theorem 4.1 (Intermediate Value Theorem (IVT)). Suppose that f (x) is a
continuous function on the interval [a, b], and that K is a number between f (a) and
f (b). Then there exists at least one (and possibly many) number(s) c ∈ [a, b] such
that f (c) = K .
f(a)
f(c)
f(b)
a c b
Thus a continuous value takes on all values between the values it obtains at the
endpoints of its domain (See figure 4.1). The following corollary is illustrated in figure
4.2.
17
18 LESSON 4. THEOREMS ABOUT DERIVATIVES
Corollary 4.1. Under the same conditions as the IVT,f (a) and f (b) have different
signs, then there is a root between a and b.
Corollary 4.2. Under the same conditions as the IVT, if f (a)f (b) < 0, then there
is a root in the interval (a, b).
Proof. If f (a)f (b) < 0 then either f (a) < 0 and f (b) > 0, or f (a) > 0 and f (b) < 0.
In either case, the number 0 is between f (a) and f (b). Hence there is some number
c such f (c) = 0.
Figure 4.2: If f (a)f (b) < 0 then there is a root between a and b.
f(a)
b
a c
f(b)
f HbL
f HaL
a c b
5/29/08:⇑
x1 p1 x2 p2 x3 p3 x4
p1 q1 p2 q2 p3
q1 r1 q2
⇑:5/29/08
where
n
X f (k) (x0 )
Pn (x) = (x − x0 )k (4.4)
k=0
k!
f (n) (x0 )
= f (x0 ) + (x − x0 )f 0 (x0 ) + · · · + (x − x0 )n (4.5)
n!
and
f (n+1) (c)
Rn (x) = (x − x0 )n+1 (4.6)
(n + 1)!
The polynomial Pn x is called the Taylor Polynomial of Order n and the function
Rn (x) is called the Remainder.
When x0 = 0, Taylor’s theorem gives the Maclaurin Polynomials:
n
X f (k) (0)
Pn (x) = xk (4.7)
k=0
k!
f (n) (0) n
= f (0) + xf 0 (0) + · · · + x (4.8)
n!
The corresponding Maclaurin Remainder Formula is
1 1
f 00 (x) = − (x + 1)−3/2 , f 00 (0) = − (4.12)
4 4
000 3 −5/2 000 3
f (x) = (x + 1) , f (0) = (4.13)
8 8
(4) 15 −7/2
f (x) = − (x + 1) (4.14)
16
Hence
0 x2 00 x3 000
P3 (x) = f (0) + xf (0) + f (0) + f (0) (4.15)
2 3!
1 1 1 1 3
=1+ x+ − x2 + x3 (4.16)
2 2 4 6 8
1 1 1
= 1 + x − x2 + x3 (4.17)
2 8 16
and similarly rev:5/29/08
f (4) (c) 4 −15(c + 1)−7/2 x4
R3 (x) = x = (4.18)
4! 384
Example 4.2. Use the Maclaurin series found in the previous example to estimate
√
2.
√ √
Solution. The formula in the previous example is for f (x) = 1 + x, so to get 2 we
need to use x = 1. Thus
√ 1 1 1
2≈1+ − + = 1.4375 (4.19)
2 8 16
Example 4.3. Use the remainder formula
√ found in the previous example to determine
the maximum error in calculating 2 with this formula.
−15(c + 1)−7/2
R3 (1) = (4.20)
384
where c is some number between 0 and 1 (because 1 is the argument of f (x) that we
evaluated the polynomial at. The maximum value occurs when c = 0, hence we have
rev:5/29/08
−15(0 + 1)−7/2
|R3 (1)| < = 15 ≈ 0.0391 (4.21)
384 384
√
thus we can conclude that our calculation of 2 ≈ 1.4375 ± 0.0391
Error
Since the computer represents numbers with a finite number of bits, it will have to
truncate this approximation with a finite number of repeats of the 1001. This will
lead to a small error, which as we have seen, can compound into a very large error.
We will use the following definitions:
We will sometimes use the term unit in the last place or ulp to represent the value
of a 1 when placed in the rightmost digit of a numerical representation of a number.
For example, if we represent the irrational number
e = 2.718281820459045... (5.6)
23
24 LESSON 5. ERROR
The following bit of Mathematica code will find the ulp on your computer:
In:=
ulp=1.0;
While[((1+ulp)-1>0, ulp=ulp/2];
Print["ulp=", 2*ulp];
Out:=
2.2045 ×10−16
One caution about this program: if you forget to put the decimal point in the initial
assignment ulp=1.0, and just write it as ulp=1, the program will run in an infinite
loop, because all of the calculations are rational. To see what is happening, insert a
print statement in the loop.
Because the error is often small, it is sometimes more meaningful to measure the
The decimal places of accuracy gives approximately the number of digits that are
accurately represented to the right of decimal point. In our approximation to e we
had 1 ulp = 10−5 , so that an error of 1 ulp represents 5 decimal places of accuracy.
We are sometimes only interested in the relative error, which we can define as
The digits of accuracy gives approximately the total number of digits of accuracy,
starting from the first nonzero digit. So 3.124, 3124, and 0.003124 all have 4 digits of
accuracy, whereas they have 3, 0, and 6 decimal places of accuracy, respectively.
We will see that there are two sources of error that we will have to worry about
in a computer program:
data error: error that is already present in the input data before a computation
begins. Typical sources of data error include:
– roundoff error: error due to the fact that computers use a finite number
of digits to represent numbers.
– truncation error: error due to the truncation of an infinite process, such
as calculating only a finite number of terms in a Taylor Series approxima-
tion.
Let x be a true value of some quantity, and x̃ be the same quantity with data error
in it, and let the function f (x) represent the thing we are trying to compute. Then
the
propagated data error = f (ã) − f (a) (5.12)
Note that the propagated data error defined in this way has nothing to do with the
computer implementation of how we calculate f : it only depends on the true definition
of f . For example, suppose we want to calculate cos(π/3) where we supply as input
the value π = 3.1416. Then the
The computational error depends on the way in which calculate f . Suppose we define
fˆ to be the computer implementation that is used to calculate the true function f .
For example, we might use the first 3 terms of the Taylor series for f (x) = cos(x):
x2 x4
fˆ(x) = cos x ≈ 1 − + (5.14)
2 24
We define the
computational error = fˆ(x̃) − f (x̃) (5.15)
Then for our example implementation,
(3.1416/3)2 (3.1416/3)4
computational error ≈ 1 − + − cos(3.1416/3) (5.16)
2 24
with x = 100 and y = 99 assuming an input error of (a) 0.1% and (b) 1.0% for x.
i.e., an 0.1% error in the input leads to a 21% error in the result. If we are off by as
much as a full percent, say x̃ = 101, then
1
f (101, 99) = (5.21)
10000
hence the relative error is
f (101, 99) − f (100, 99) 0.0001 − 0.00002525
= = 2.9601 (5.22)
f (100, 99) 0.00002525
Number Representation
Number representations in computers are limited because they only store a finite
number of digits.
√ While the loss of information here is obvious for irrational numbers
such as π or 2, what is not obvious at first glance is that even simple integer
operations can be seriously affected. Before proceeding to a formal description of
number representations we present a simple example of a computer that can store 3
digit decimal numbers. What is the best way to represent this kind of number? Our
first guess might be to use a representation that contains 3 machine digits such as:
d d d
where each “d” represents a digit in the range 0, 1, . . . , 9. This is good for numbers
such as 547 or 612, but what about negative numbers such as -43? And what happens
if we try to add two numbers together, such as 547+612? There is no way to represent
1159 in this scheme. So when we try to add two numbers whose sum is larger 999,
we get an error called an “overflow.”
There are two standard ways to represent negative numbers. One way is to use the
following mapping: ‘000’ represents -500, ‘001’ represents -499, ‘002’ represents -498,
..., ‘998’ represents 498, ‘999’ represents 499. In this way we shift the representation
from one that represents all integers 0 < z < 999 to one which represents only integers
in the range −500 < z < 499. This method is called an “excess 500” representation
- the number actually stored in memory is 500 in excess of the number it represents.
A simpler method is to add a sign bit:
s d d d
where s is not a digit that is only allowed to take on two values: 1 or 0, with 0
representing a positive number and 1 representing a negative numbers. This allows
us to represent everything in the range −999 < z < 999. So we represent 765 as
0 7 6 5
and -43 by
27
28 LESSON 6. NUMBER REPRESENTATION
1 0 4 3
s d d d e
0 5 4 7 3
0 6 1 2 3
The answer is 1159, which can not be represented in three digits, so some sort of
approximation scheme is needed. Two common schemes include:
rev:5/29/08 chopping: drop the final digit, 1159 ≈ 0.115 × 104 ; and
0 1 1 5 4
0 1 1 6 4
Example 6.1. Calculate the average of two numbers x and y using the formula
x+y
average = (6.2)
2
for
1120
average = = 560 (6.4)
2
which is not even between the two values 563 and 566. Had we used rounding, we
would have rounded 1129 to 1130 and obtained the answer 565 which is as close as
we can represent the exact answer 564.5.
For (b) we use rounding:
1130
average = = 565 (6.6)
2
Again, the answer 565 is not between the two original numbers. We would have
obtained the same answer with truncation.
We will use the operator F l (for “Float” or “Floating Point”, the topic of next
section) to represent our “approximation.” When we rounded, for example, we have
One way to improve the situation we found in the previous example is to use the
revised formula
b−a
average = a + (6.11)
2
Mathematically, both equations 6.2 and 6.12 are identical, but they will give us
different results when implemented in a computer, because of how the F l operator is
applied:
F l(F l(b) − F l(a))
average = F l F l(a) + F l (6.12)
F l(2)
Solution. In (a) we used truncation to find the average of 563 and 566
F l(F l(566) − F l(563))
average = F l F l(563) + F l (6.13)
F l(2)
F l(566 − 563)
= F l 563 + F l (6.14)
2
3
= F l 563 + F l (6.15)
2
= F l (563 + F l (1.5)) (6.16)
= F l (563 + 1) (6.17)
= F l(564) (6.18)
= 564 (6.19)
Fixed point representation: the sign and the radix point have a fixed loca-
tion:
sign digits
Most modern computers have the ability to store both fixed point and floating point
numbers; fixed point representations are typically used for integer and boolean vari-
ables. In some cases the representation will span many computer bytes. Floating
point representations may be implemented in either hardware or software or both.
For example, a typical “32 bit floating point” computer provides hardware (e.g., mem-
ory, registers, and arithmetic operations such as addition and multiplication) for a
floating point representation that uses a total of 32 bits. High level compilers such
as C or FORTRAN also provide additional representations, such as 64 bit “double
precision” and 128 bit “quadruple precision.” The details on how 8-bit bytes are
mapped to 32 bit long integers or 128 bit quadruple precision floating point reals is of
no concern to us here, only the ultimate representation. As one text says, “the details
of how numbers are represented do not concern us in numerical analysis; rather our
concern is whether a number is representable.”1
1
Skeel and Keiper, page 39.
31
32 LESSON 7. FIXED AND FLOATING POINT
Here “nan” is a special symbol used to mean “not a number.” Finally, there are two
different ways to represent zero, which we call 0 and −0:
(
0 if e = m = s = 0,
x= (7.9)
−0 if e = m = 0 and s = 1.
The related IEEE 64-bit Standard representation an store numbers in the approx-
imate range
2.22 × 10−308 < x < 1.8 × 10308 (7.10)
with a precision of around 15 to 16 digits (252 ≈ 4.5 × 1015 ):
If e = 0 and m 6= 0 then
nan if m 6= 0,
x = −∞ if m = 0 and s = 1, (7.13)
∞ if m = 0 and s = 0.
Unbiased: rounds to the nearest value. If the number falls midway it is rounded
to the nearest value with an even (zero) least significant bit. This mode is
required to be default.
The following overflow and underflow conditions that may occur as a result of an
operation are not representable and should generate error messages:
Negative overflow: Negative numbers less than −(2 − 2−23 ) × 2127 (32 bit) or
−(2 − 2−52 ) × 21023 (64 bit).
Negative underflow: Negative numbers greater than −2−149 (32 bit denor-
malized (leading 0)), −2−126 (32 bit normalized (leading 1)), −2−1022 (64 bit
normalized), or −21074 (64 bit denormalized).
Positive underflow: Positive numbers less than 2−149 (32 bit denormalized),
2−126 (32 bit normalized), 2−1022 (64 bit normalized), or 2−1074 (64 bit denor-
malized).
Positive overflow: Positive numbers greater than (2 − 2−23 ) × 2127 (32 bit)
or(2 − 2−52 ) × 21023 (64 bit).
The first numerical problem we will face is root finding: given a function f (x), find a
number r such that f (x) = 0 at x = r. The bisection algorithm uses a binary search
strategy. It assumes we already know two points a and b, one to the right of the root
and one to the left of the root. Since the two points are on opposite sides of the root,
they must be on opposite sides of the x−axis; hence either f (a) > 0 and f (b) < 0,
if the function is decreasing through the root; or f (a) < 0 and f (b) > 0, when the
function is increasing as it passes through the root. In either case,
Then we simply split the interval [a, b] in half: pick a new point
b−a
c=a+ (8.2)
2
and calculate the product f (a)f (c). If f (a)f (c) > 0 then a and c are on the same side
of the root, so we replace a with c. If f (a)f (c) < 0 then a and c are on different sides
of the root, so we replace b with c. Then we repeat the process until our interval size
∆=b−a< (8.3)
1. It might take a long time to reach the desired , so it always a good idea to
include a counter and terminate after some number N steps regardless of how
close you’ve gotten. This is especially important when you are debugging the
program.
2. As you get closer and closer to the root the product f (a)f (c) will get smaller
and smaller, and could run into the level of machine accuracy. Thus its better
to check the product Sign(f (a))Sign(f(b)) rather than the product f (a)f (c).
35
36 LESSON 8. ROOTS AND BISECTION
Algorithm Bisection
Input a, b, f , , N ;
Let ∆ = (b − a)/2; i = 0;
If f (a)f (b) > 0, Print error message and stop;
While ∆ > and i < N ,
p = a + ∆;
If f (p) = 0, Return(p);
If Sign(f (a))Sign(f (p)) < 0,
Let b = p;
Otherwise,
Let a = p;
End If;
∆ = (b − a)/2
i = i + 1;
End While;
If i = N , Print a message saying that tolerance not reached.
Return (a + ∆).
Of all the algorithms we will discuss for root finding, bisection is the slowest.
In fact, we can predict precisely the number of iterations it will take to converge.
Because the size of the interval is halved each time, it will be the smallest integer n
such that n
1
|b − a| < (8.4)
2
Hence
n log(1/2) < log (8.5)
|b − a|
Since log(1/2) = − log 2, n is the smallest integer for which
1 |b − a|
n>− log = log2 (8.6)
log 2 |b − a|
Thus we could add a test at the beginning of our program, and just iterate n times.
This is actually more efficient, because we don’t need to do a comparison to check
the size of the interval each iteration. Here is the revised algorithm.
p = a + (b − a)/2;
If f (p) = 0, Return(p);
If Sign(f (a))Sign(f (p)) < 0,
Let b = p;
Otherwise,
Let a = p;
End If;
i = i + 1;
End While;
Return (a + (b − a)/2).
Proof. Either the algorithm reaches the exact root at some step of the iteration or it
does not. Let L = |b − a|. Let the value of a, b, and p after the ith iteration be ai , bi ,
and pi , respectively. If for some n we have
Furthermore, by construction, f (ai )f (bi ) < 0, so the root is in each interval. Therefore
each pi is a distance no larger than |ai − bi | from the root r. Hence
Therefore
0 ≤ lim |pn − r| ≤ lim L/2n = 0 (8.18)
n→∞ n→∞
Hence
lim pn = r (8.19)
n→∞
Solution.
Step 1.
a = 1, b = 2, f (a) = −1, f (b) = 2 (8.20)
p = (a + b)/2 = 1.5 (8.21)
f (p) = (1.5)2 − 2 = 0.25 (8.22)
f (a)f (p) = (−)(+) < 0 (8.23)
so the root is between a and p. So we set
b = p = 1.5 (8.24)
Step 2.
a = 1, b = 1.5, f (a) = −1, f (b) = 0.25 (8.25)
p = (1 + 1.5)/2 = 1.25 (8.26)
f (p) = (1.25)2 − 2 = −0.4375 (8.27)
f (a)f (p) = (−1)(−1) > 0 (8.28)
The root is between p and b, so set
a = p = 1.25 (8.29)
Step 3.
a = 1.25, b = 1.5, f (a) = −0.4375, f (b) = 0.25 (8.30)
p = (1.25 + 1.5)/2 = 1.375 (8.31)
f (p) = 1.3752 − 2 = −0.109 (8.32)
f (p)f (a) = (−.4375)(−.109) > 0 (8.33)
so set
a = p = 1.375 (8.34)
The root is between a=1.375 and b=1.5. As we continue the process we compute the
sequence 1.5, 1.25, 1.375, 1.4375, 1.40625, 1.42188, 1.41406, ...
a5p5b5
a4 p4 b4
a3 p3 b3
a2 p2 b2
a1 p1 b1
a0 p0 b0
Anyone who has every played with their calculator by typing in a number and then
hitting the same function key repeatedly has used fixed point iteration. For example,
√
if you type the number 16 and then start pressing the key you will generate the
sequence (this was generated with a TI-36 which has 10 digit accuracy):
x0 = 16 (9.1)
√ √
x1 = x0 = 16 = 4 (9.2)
√ √
x2 = x1 = 4=2 (9.3)
√ √
x3 = x2 = 2 = 1.414213562 (9.4)
√ √
x4 = x3 = 1.414213562 = 1.1892070115 (9.5)
√ √
x5 = x4 = 1.1892070115 = 1.090507733 (9.6)
..
.
Eventually, after around 30 iterations, the calculator will display something like
1.0000000000 (9.7)
In fact, this iteration has found the fixed point of the square root function
√
f (x) = x (9.9)
to within the machine epsilon of the calculator (1 part in 1010 ), namely, the point
where √
x = f (x) = x (9.10)
41
42 LESSON 9. FIXED POINT ITERATION
Equation 9.10 has only two solutions: x = 1 and x = 0. We have converged on the
first of these solutions. Had we started with any positive number, we still would have
converged on the solution x = 1, regardless of which number we typed in for x0 . Had
we started with x = 0 we would have converged on the other root, x = 0, and had
we started with a negative number, we would have gotten an error message.
What we are doing during this iteration is computing a sequence of function
applications:
x1 = g(x0 ) (9.11)
x2 = g(g(x0 )) = g 2 (x0 ) (9.12)
x3 = g(g(g(x0 ))) = g 3 (x0 ) (9.13)
..
.
xn = g n (x0 ) (9.14)
where we have used the notation g k (x) to denote the repeated application of the
function g(x) k times.
Definition 9.1 (Fixed Point). A number p is called a fixed point of the function
f (x) if p = f (p).
Example 9.1. Find the fixed points of the function f (x) = x4 + 2x2 + x − 3.
Solution. We need to solve
for x. Hence
0 = x4 + 2x2 − 3 (9.16)
= (x2 − 1)(x2 + 3) (9.17)
= (x − 1)(x + 1)(x2 + 3) (9.18)
Figure 9.1: A fixed point occurs at the intersection of the curve y = f (x) with the
line y = x. If there are multiple intersections then there are multiple fixed points.
y=x
b
a b
the arrow lies directly over the value of f (x0 ) on the x-axis, so that by projecting
vertically to the curve of y = f (x), we intersect at f (f (x0 )) (bottom plot). We then
repeat this process, generating successive iterations,
√ approaching closer and closer ot
the fixed point (see figure 9.3 at (1, 1) = (x, x).
. ⇑:6/5/08
Example 9.2. Find the first 25 iterations in the fixed point iteration for the function
f[x_]:=Cos[x];
c=N[NestList[f, Pi, 25], 10]
Out:=
√
Figure 9.2: Visualization of fixed point iteration on y = x. See text for description.
4
0 2 4 6 8 10 12 14 16 18
0 2 4 6 8 10 12 14 16 18
0 2 4 6 8 10 12 14 16 18
Despite the success illustrated by the rapid convergence in example 9.2, fixed point
iteration does not always work. This is illustrated by the following example.
f (x) = x2 − 2 (9.20)
and then, using Mathematica, compute the result of 100 iterations of the fixed point
algorithm using x0 = 1.9 and plot the results as we did in the previous example.
x = x2 − 2 (9.21)
0 = x2 − x − 2 (9.22)
= (x − 2)(x + 1) (9.23)
rev.6/5/08:⇓ So fixed points occur at x = 2 and x = −1 To compute, say, the first 50 fixed point
iterations starting with x0 = 1.5, in Mathematica, In:=
g[x_] := x^2 - 2;
N[NestList[g, 1.9, 100], 5]
√
Figure 9.3: Fixed point iteration on y = f x (continued from fig. 9.1).
2
1.8
1.6
1.4
1.2
Out:=
Theorem 9.2 (Sufficient Condition for a Fixed Point). Suppose that f (x) is a
continuous function that maps its domain onto a subset of itself, i.e., f (x) ∈ C[a, b]
such that1
f (x) : [a, b] 7→ S ⊂ [a, b] (9.24)
Then f (x) has a fixed point in [a, b].
1
By C[a, b] we mean the set of all continuous function whose domain is the interval [a, b]
!1
!Π 3Π Π Π 0 Π Π 3Π Π
! ! !
4 2 4 4 2 4
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Π 5Π 3Π 7Π Π 9Π 5Π
8 32 16 32 4 32 16
Let
h(x) = f (x) − x (9.27)
Since f (x) is continuous, so is h(x), and
Hence by the intermediate value theorem, h(x) has a root r ∈ (a, b), such that
h(r) = 0. But at r we have
0 = h(r) = f (r) − r (9.30)
Thus since f (r) = r, r must be a fixed point of f .
Figure 9.5: Left: The first 5 fixed point iterations on g(x) = x2 − 2 starting from x0 =
1.5. Right: The first 100 iterations. There is no discernable pattern of convergence;
in fact, the iteration is chaotic.
2 2
1 1
0 0
-1 -1
-2 -2
-2 -1 0 1 2 -2 -1 0 1 2
Theorem 9.3. Every continuous bounded functions on the real numbers has a fixed
point.
Proof. Let
f (x) : R 7→ R (9.31)
be continuous and bounded. Then it has a least upper bound a and a greatest lower
bound b. Hence
f (x) : R 7→ [a, b] (9.32)
Thus the conditions of Theorem 9.2 are met and hence f (x) has a fixed point.
⇓:rev.6/5/08
(The example that was here previously has been deleted because it wasn’t very helpful.) ⇑:6/5/08
Theorem 9.4 (Condition for a Unique Fixed Point). Let f (x) be a continuous
and diferentiable function that maps its domain onto a subset of itself,
0<K<1 (9.34)
such that
|f 0 (x)| ≤ K (9.35)
for all x ∈ [a, b]. Then f (x) has a unique fixed point p ∈ [a, b].
Proof. By theorem 9.3 at least one fixed point exists; call it p. Then
f (p) = p (9.36)
Suppose that a second fixed point q 6= p exists. Since q is also a fixed point,
q = f (q) (9.37)
By the Mean Value theorem, there exists some number c ∈ [min(p, q), max(p, q)] such
that
f (p) − f (q)
f 0 (c) = (9.38)
p−q
By equation 9.35, |f 0 (c)| ≤ K, hence
f (p) − f (q)
p−q ≤K (9.39)
i.e.,
|f (p) − f (q)| ≤ K|p − q| < |p − q| (9.40)
because K < 1. But by equations 9.36 and 9.37 we have
and therefore
|p − q| < |p − q| (9.42)
Since p 6= q we know that |p − q| 6= 0 hence we can cancel it on both sides of the
equation to gives 1 < 1, which is a contradiction. Hence our original assumption
p 6= q must be wrong. Thus the fixed point is unique.
Example 9.4. Show that
1 x
g(x) = π + sin (9.43)
2 2
has a unique fixed point.
Solution. We first observe that g(x) is continuous and differentiable, and that
1 1
Range(g) = π − , π + ⊂ (−∞, ∞) = Domain(g) (9.44)
2 2
Hence by theorem 9.3 at least one fixed point exists. To verify uniqueness we calculate
0
1 x 1
|g (x)| = cos ≤ < 1
(9.45)
4 2 4
Hence the conditions of theorem 9.4 are met with K = 1/4. Hence the fixed point is
unique. (see figure 9.6.)
Figure 9.6: The fixed point of f (x) = π + (1/2) sin(x/2) is unique. See example 9.4.
4
!2 Π 3Π !Π Π 0 Π Π 3Π 2Π 5Π 3Π 7Π 4Π
! !
2 2 2 2 2 2
Example 9.5. Calculate the first four fixed point iterates of the function in the pre-
vious example, starting with x0 = π, and then use NestList to calculate the first 10
iterations to 20 digits.
Solution.
In Mathematica,
In:=
To find the root of a function f (x) using the fixed point algorithm, we define
Hence p is a fixed point of g(x) = x − f (x). This suggests that we use the following
algorithm.
Algorithm FixedPointRoot
Input f (x), a first guess p0 , and an error tolerance ;
Define g(x) = x − f (x);
Define ∆ = ∞;
While ∆ > 0,
Let pold = p;
Let p = g(p);
∆ = |p − pold |;
Return (p).
p
Example 9.6. Use the fixed point algorithm to find 1/2 to 25 digits accuracy.
p
Solution. We know that 1/2 is a root of f (x) = x2 − 1/2, so we form the function
We can use
In:=
Out:=
which does not give us enough digits. We also want the computer to calculate the
error for us, so that it can automatically figure out when to stop the calcultions. One
way to do this is by literally translating the iterative algorithm into Mathematica:
∆ = ∞;
p = 1.0‘50;
n = 0;
While[∆ > 10−25 ,
pold = p;
p = g[p];
∆= Abs[p - pold];
n++;
];
Print["The root is ", N[p, 25]," after ", n, " iterations."];
The initialization of p=1.0‘50 ensures that the data starts with 50 digit accuracy.
This is a good general rule of thumb, that you data should have at least twice the
digits that you need in your final answer, although in fact it will depend upon what
kind of calculation you are doing. The output is
The root is 0.7071067811865475244008444after 64 iterations.
That the convergence to 25 digits does occur after 64 iterations can be verified by
including a statement
Print[p]
before the end of the While loop.
√
Example 9.7. Repeat the previous example with 2, starting with x0 = 1.5.
√
Solution. As before we observe that 2 is the root of f (x) = x2 −2 and so we calculate
So far there is no discernible pattern; in fact, the first 100 iterations are (from Math-
ematica):
In:=
g[x_] := x - x^2 + 2;
q = NestList[g, 1.5, 120]
Out:=
!1
!1 0 1 2
p
So why, when things worked √so well with 1/2, does fixed point iteration fail so
miserably in when we calculate 2? For one thing,
g 0 (x) = 1 − 2x (9.59)
√
Near the root, say at x = 2 + , we have
√ √
|g 0 ( 2)| = |1 − 2( 2 + )| ≈ |1 − 2.83 − 2| ≈ | − 1.83 − 2| (9.60)
There is no way that we can bound this number by a constant that is smaller than
1, so that theorem 9.3 does not even guarantee √
the existence of a fixed point (even
though we know that one does, in fact, exist at 2). The next theorem gives us an
idea.
Theorem 9.5. The fixed point iteration algorithm on a function g(x) will converge
to a fixed point of g(x) if the conditions of theorem 9.4 are satisfied. More precisely,
suppose that g(x) is a continuous function on [a, b] such that g : [a, b] 7→ S ⊂ [a, b],
and that there is a positive number K < 1 such that |g 0 (x)| ≤ K on [a, b]. Then for a
starting point p0 , the sequence pn = g(pn−1 ) converges to a unique fixed point of g(x).
Proof. By theorem 9.4 a unique fixed point exists. We need to show that
lim pn = p (9.61)
n→∞
Since g : [a, b] 7→ S ⊂ [a, b], then all of the pn = g(pn−1 ) ∈ [a, b]. Furthermore, since
p is a fixed point, p = g(p), and
If there is some n such that pn−1 = p then the sequence has converged, and the
theorem has been proven. So we may assume that there is no n such that pn = p.
Since pn−1 6= p, we know that there is some point cn between pn−1 and p such that
0
g(pn−1 ) − g(p)
|g (cn )| = (9.63)
pn−1 − p
Hence
0 ≤ lim |pn − p| ≤ lim K n |p0 − p| = 0 (9.71)
n→∞ n→∞
Example
p 9.8. Prove that the fixed point algorithm for g(x) = x − x2 + 1/2 converges
to 1/2 ≈ 0.707.
p
Solution. First we observe that 1/2 is a fixed point of g(x) since
p p p
g 1/2 = 1/2 − 1/2 + 1/2 = 1/2 (9.72)
Next we calculate
|g 0 (x)| = |1 − 2x| (9.73)
We want to determine if there is some positive constant K < 1 such that |g 0 (x)| ≤ K,
which requires that
−K ≤ 1 − 2x ≤ K (9.74)
−1 − K ≤ −2x ≤ −1 + K (9.75)
K −1 K +1
≤x≤ (9.76)
2 2
If we try K = 0.8 then
−0.1 ≤ x ≤ 0.9 (9.77)
In other words, for all x ∈ [−0.1, 0.9], we have |g 0 (x)| ≤ K < 1. Thus the conditions
of the theorem are met for any starting point in [−0.1, 0.9]. If we start with, say,
x0 = 1/2, which is clearly in this interval, the algorithm converges by theorem 9.5.
From equation 9.70 we could calculate an error estimate based on the size of the
original interval [a, b]. Since both p and p0 are in the interval [a, b], if we stop the
iteration after n steps the error is limited by
Thus each iteration reduces the error by a factor of K. While this is a significant
improvement, equation 9.78 is not very useful if the interval [a, b] is especially large,
such as the whole real line. Fortunately we can make an improved estimate based on
the values of the first guess and the first iteration.
Theorem 9.6 (Error Estimate for Fixed Point Iteration). If fixed point itera-
tion is terminated after n ≥ 1 steps then the error is limited by
K n |p1 − p0 |
|pn − p| ≤ (9.79)
1−K
rev.6/5/08:⇓
where the last step follows because |g 0 (c)| ≤ K. Hence by the triangle inequality,
K n+1 |p1 − p0 |
|pn+1 − p| ≤ (9.86)
1−K
We again use the Mean Value Theorem: there is some number c between pn and p
such that
0
g(pn ) − g(p) pn+1 − p
|g (c)| =
= ≤K (9.87)
pn − p pn − p
Hence
|pn+1 − p| ≤ K|pn − p| (9.88)
Substituting equation 9.79 on the right yields equation 9.86.
⇑:6/5/08
Example 9.9. Estimate the number of iterations required for fixed point iteration to
converge to the fixed point of
1 x
g(x) = π + sin (9.89)
2 2
with (a) 4 digit accuracy and (b) 10 digit accuracy, using p0 = π.
(1 − K)
Kn < (9.91)
|p1 − p0 |
(1 − K)
n log K < log (9.92)
|p1 − p0 |
1 (1 − K)
n> log (9.93)
log K |p1 − p0 |
where we reversed the direction of the less-than sign because for K < 1 then log K <
0. To find K we calculate
0
1 x 1
|g (x)| = cos ≤ (9.94)
4 2 4
−1 (3/4)
n> log (9.95)
log 4 |p1 − π|
1 π 1
p1 = π + sin = π + (9.96)
2 2 2
−1 (3/4) −1 3
n> log = log (9.97)
log 4 1/2 log 4 2
−1
n> log(1.5 × 10−10 ) ≈ 16.3 (9.99)
log 4
Appendix
The fixed point plots shown in this section can be generated with the following Math-
ematica program:
In:=
0.5
!0.5
!1.0
Newton’s Method
Suppose we already have an estimate p0 for the root of f (x). If we project the tangent
line to f (x) at the point (p0 , f (p0 )) down to where it intersects with the x-axis, this
should give us a better guess for root, as illustrated in figure 10.1.
f(x)
tangent lines
p3 p2 p1 p0
The slope of the straight line connecting the points (p0 , f (p0 )) and (p1 , 0) is
rise f (p0 ) − 0 f (p0 )
f 0 (p0 ) = slope = = = (10.1)
run p0 − p1 p0 − p1
Solving for p1 ,
f (p0 )
p1 = p0 − (10.2)
f 0 (p0 )
This gives us the well know formula for Newon’s Method
f (pn )
pn+1 = pn − (10.3)
f 0 (pn )
59
60 LESSON 10. NEWTON’S METHOD
Solution. To get an iteration formula for pn we need to know the derivative of f (x),
which is
f 0 (x) = 2x (10.4)
Hence the iteration formula is
p2n − 2
pn+1 = pn − (10.5)
2pn
Although it is possible to simplify this algebraically, convergence of the algorithm
(which we have not proven yet) ensures that the second term above approaches zero
and hence it is computationally preferable to leave it in this form rather than placing
the sum over a common denominator. Thus
22 − 2
p1 = 2 − = 1.5 (10.6)
2×2
1.52 − 2
p2 = 1.5 − = 1.4167 (10.7)
2 × 1.5
1.41672 − 2
p3 = 1.4167 − = 1.4142 (10.8)
2 × 1.4167
and so forth.
In general Newton’s method converges extremely rapidly. The only time it will
be slow to converge is√when f 0 (p) = 0. As the following Mathematicaillustrates, the
method converges to 2 to 50 digits, starting with p0 = 2, in only 6 iterations.
In:=
f[x_] := x^2 - 2;
g[x_] := x - f[x]/f’[x];
NestList[g, 2.0‘50, 7]
Out:=
{2.0000000000000000000000000000000000000000000000000,
1.5000000000000000000000000000000000000000000000000,
1.4166666666666666666666666666666666666666666666667,
1.414215686274509803921568627450980392156862745098,
1.414213562374689910626295578890134910116559622116,
1.414213562373095048801689623502530243614981925776,
1.41421356237309504880168872420969807856967187538,
1.41421356237309504880168872420969807856967187538}
We first observe that Newton’s method is nothing more than fixed point iteration on
the function
g(x) = x − f (x)/f 0 (x) (10.11)
Furthermore, since p is a root of f , then it is also a fixed point of g, because
Since f 0 (p) 6= 0 then by continuity there must be some interval U = [p−, p+] ⊂ [a, b]
about p such that f 0 (x) 6= 0 for all x ∈ U . Since f (x) and f 0 (x) are defined and
continuous on [a, b] then they are defined and continuous on U ⊂ [a, b]. Since by
construction of U , f 0 (x) 6= 0 on U then g(x) is also defined and continuous on U .
Therefore
0 d f (x)
g (x) = x− 0 (10.13)
dx f (x)
f 0 (x)f 0 (x) − f (x)f 00 (x)
=1− (10.14)
(f 0 (x))2
f (x)f 00 (x)
= (10.15)
(f 0 (x))2
f ′(p)≠0 K
p-δ p p+δ
-K
p-ε p p+ε
|g 0 (x)| ≤ K, as we see in fig. 10.2. This proves that there is some K > 0 such that
|g 0 (x)| ≤ K < 1 in some interval about p.
To see that the g : I 7→ S ⊂ I, let x ∈ I. Then by the mean value theorem there
is a point c ∈ I, between p and x. such that
|g(p) − g(x)|
|g 0 (c)| = (10.16)
|p − x|
or
|g(p) − g(x)| = |p − x||g 0 (c)| (10.17)
Since the maximum distance between p and x in I is δ,
because |g 0 (x)| ≤ K < 1. But since p is a fixed point of g, we know that g(p) = p,
and therefore
|p − g(x)| < δ (10.19)
or equivalently,
p − δ < g(x) < p + δ (10.20)
Thus g maps I into a subset of itself, and hence all of the hypotheses of theorem 9.5
are met. Therefore fixed point iteration on g converges to the fixed point of g, which
we have already shown is a root of f . Thus Newton’s method converges.
We can also do some error analysis for Newton’s method. Recall that by Taylor’s
theorem (theorem 4.6),
1
f (p + ) ≈ f (p) + f 0 (p) + 2 f 00 (p) + · · · (10.21)
2
1
≈ f 0 (p) + 2 f 00 (p) + · · · (10.22)
2
1
By continuously differentiable we me the the function is continuous and differentiable and its
first derivative is also continuous.
f[x_] := x^2 - 2;
NewtonsMethod[f, 1.5‘53, 10^-50]
Out:=
{6, 1.414213562373095048801688724209698078569671875376948}
Under certain conditions Newton’s method will not converge, even if a root does
exist. For example, by theorem 10.1 if the derivativep is not continuous in the entire
interval then it will fail. The function f (x) = x/ |x| provides an example of this
situation. The derivative is everywhere continuous except at the origin where it
becomes infinite. The plot of f (x) is also a mirror image of itself through the origin.
In this case Newton’s method can lead to cyclic iteration. A similar case can occur
if the initial point is chosen√on the edge of an open interval of convergence, as with
f (x) = x/(1 + x2 ) at x = 1/ 3|. In both cases we have a situation where xn+1 = −xn
and thepfunction is a mirror image of itself. The same thing happens with f (x) = x2
if x = 5/3.
A variation on the Newton’s method called the Damped Newton’s Method can fix
these situations by checking if successive iterations decrease in magnitude. If they do
not, the interval is halved, until they do. The damped Newton method will always
converge to either a root or to a local minimum.
The formula that is now known as Newton’s method was actually developed by the
British mathematician Thomas Simpson (1740), who is better know as the inventor
Figure 10.3: When Newton’s method fails. Top: f (x) = x/|x| has an infinite deriva-
2
tive at the origin. Middle: f (x) = x/(1
√+ x √ ) is a mirror image of itself. Newton’s
method converges on the interval (−1/ 3, 1/ 3), diverges outside this interval, and
oscillates right on the endpoints. Bottom: f (x) = x2 has a local
p minimum, but no
root, at x = 0. Newton oscillation can become trapped if x0 = 5/3.
1.5
1.0
0.5
0.0
!0.5
!1.0
!1.5
!2 !1 0 1 2
0.4
0.2
0.0
!0.2
!0.4
0
!2 !1 0 1 2
Secant Method
The main problem with Newtons method is that we need to know both the function
and its derivative. If the derivative is easy to calculate this is not a problem, but
sometimes it can be very expensive computationally to calculate the derivative. To
solution to this problem is to stop calculating the derivative after the first iteration
and instead approximate it by the slope of the line connecting the first two guesses
(see figure).
secant line
f(x)
p2 p1 p0
The slope of the line through the points (pn , f (pn )) and (pn−1 , f (pn−1 ) is used to
approximate the derivative at pn .
f (pn ) − f (pn−1 )
f 0 (pn ) ≈ (11.1)
pn − pn−1
67
68 LESSON 11. SECANT METHOD
The derivation is similar to the derivation of Newton’s method; we just use the slope
derived here in place of the derivative:
f (pn )
pn+1 = pn − (11.2)
f 0 (pn )
pn − pn−1
= pn − f (pn ) (11.3)
f (pn ) − f (pn−1 )
This method converges at about the same rate as Newton’s method. Here is the
algorithm.
Algorithm SecantMethod
Input f (x), p0 , p1 , tolerance
Let q0 = f (p0 ), q1 = f (p1 )
Let ∆ = q1 (p1 − p0 )/(q1 − q0 )
While ∆ > ,
p0 = p1 ;
p1 = p1 − ∆
q0 = q1
q1 = f (p1 )
∆ = q1 (p1 − p0 )/(q1 − q0 )
End While
Return p1
One complaint about both Newtons method and the Secant method is that it is
difficult to estimate an error bounds. With bisection, on the other hand, we had
|| < |an − bn |/2, because we know that the actual root always lies somewhere inside
the interval [an , bn ] . Since successive iterations of either Newtons Method (or the
Secant method) will not, in general, bracket the root, we cannot make this type of
simple error limit. The Method of Regula Falsi (Method of False Position) is a
modification of the Secant Method that ensures that successive iterations bracket the
root, at some (sometimes significant) cost of execution time.
Here is the idea behind the algorithm. We initially start with two guesses p0 , p1
that are known to bracket the root and then calculate p2 using the secant method. If
then the root is in the interval [p1 , p2 ], so we use p1 and p2 to calculate p3 , and p1 and
p3 become our next initial values. Otherwise, we use p0 and p2 to calculate p3 , and
p0 and p3 become our next values. The algorithm is shown on the next page.
Versions of the method of false position were cited in Vaishali Ganit, written
in India around the 3rd century BC, and The Nine Chapters of Mathematical Art
written in China a century or two later. It was well known by the middle ages and
was cited by Fibonacci in his text Liber Abaci written in 1202.
n+1 |pn+1 − p|
lim = lim =λ (12.1)
k
n→∞ n n→∞ |p − p|k
n
n+1 |pn+1 − p|
lim = lim =λ (12.2)
n→∞ n n→∞ |pn − p|
n+1 |pn+1 − p|
lim = lim =λ (12.3)
2
n→∞ n n→∞ |p − p|2
n
The following is a good general rule of thumb: the higher the order of conver-
gence, the faster the sequence converges. To see why this general rule is true,
suppose, for example, that an → p linearly with asymptotic error constant λ and that
bn → p with the same asymptotic error constant λ. Then
|an+1 − p| |bn+1 − p|
lim = λ = lim (12.4)
n→∞ |an − p| n→∞ |b − p|2
n
71
72 LESSON 12. ERROR ANALYSIS FOR ITERATIVE METHODS
where ∆ = |a0 − p|. Now suppose that b0 = a0 . Then for sufficiently large n,
Linear Quadratic
λ 0.5 0.5 0.9 0.99
n=1 0.25 0.125 0.729 0.97
n=2 0.125 7.8 × 10−3 0.478 0.932
n=3 0.0625 3.1 × 10−5 0.206 0.860
n=4 0.0312 6.6 × 10−10 0.0382 0.732
n=5 0.0156 1.1 × 10−19 0.00131 0.531
n=6 0.0078 5.9 × 10−39 1.5 × 10−6 0.279
n=7 0.0039 1.7 × 10−77 2.1 × 10−12 0.0771
n=8 0.0019 1.5 × 10−154 4.1 × 10−24 5.9 × 10−3
n=9 9.7 × 10−4 1.1 × 10−308 1.5 × 10−47 3.4 × 10−5
n = 10 4.9 × 10−4 6.2 × 10−617 2.2 × 10−94 1.2 × 10−9
√
Example 12.1. Suppose we know two different algorithms to find 2, one of which is
linearly convergent with error constant λ = 1/2, and the other is quadratically conver-
gent with error constant λ = 1/2. Assuming our initial error is ∆ = 1, estimate the
number of iterations each algorithm will require to converge to 50 significant figures.
Solution. For the linearly convergent algorithm we have n ≈ λn ∆, hence
n
−50 1
10 > × (1) (12.18)
2
2n > 1050 (12.19)
n log 2 > 50 log 10 (12.20)
log 10 (50)
n > 50 ≈ ≈ 166 (12.21)
log 2 0.301
n n −1
For the quadratically convergent sequence n ≈ (λ∆)2 /λ = (λ)2 for ∆ = 1. Hence
2n −1
−50 1
10 > (12.22)
2
2n −1
2 > 1050 (12.23)
(2n − 1) log 2 > 50 log 10 (12.24)
log 10
2n > 1 + 50 ≈ 167 (12.25)
log 2
n log 2 > log 167 (12.26)
log 167
n> ≈ 7.4 (12.27)
log 2
so 8 iterations will suffice.
√
Example 12.2. An iteration formula to find 3
7 as the root of f (x) = x3 − 7, that
can be derived using Newton’s method, is
x3 − 7
g(x) = x − (12.28)
3x2
Show that this algorithm converges quadratically.
Solution. Let x = pn . Then
n+1 g(x) − 71/3
= (12.29)
2 (x − 71/3 )2
n
x − (x3 − 7)/(3x2 ) − 71/3
= (12.30)
x2 − 2(71/3 )x + 72/3
3 1/3 2
2x − 3(7 )x + 7
= 4 (12.31)
3x − 6(71/3 )x3 + 3(72/3 )x2
√
Hence, since we know that pn → 3 7 as n → ∞,
(12.35)
which proves that the iteration converges quadratically with asymptotic error constant
λ ≈ 0.522.
Hence
n+1 f 00 (p)
= (12.40)
2 2f 0 (p)
n
Theorem 12.2. If all of the conditions of the fixed point theorem (theorem 9.5) are
met, and g 0 (p) 6= 0, then the fixed point algorithm converges (at least) linearly.
We observe that this says that fixed point converges at least linearly; this does not
mean that every fixed point algorithm only converges linearly. As we saw above, New-
ton’s method, which is a type of fixed point iteration, in fact, converges quadratically.
So this theorem says that convergence is linear or better, i.e, k ≥ 1.
Proof. Since p is a fixed point, p = g(p). Let p1 , p2 , . . . be the sequence of fixed point
iterates pn+1 = g(pn ). Then by the mean value theorem, for each n there is a number
cn between the fixed point p and the nth fixed-point iterate pn such that
Therefore
pn+1 − p
lim = lim |g 0 (cn )| (12.43)
n→∞ pn − p n→∞
Furthermore, since the conditions of theorem 9.5 are met, we know that pn → p and
thereofore
0 ≤ lim n → ∞|cn − p| ≤ lim |pn − p| = 0 (12.45)
n→∞
hence
lim cn = p (12.46)
n→∞
Therefore
lim g 0 (cn ) = g 0 (p) (12.47)
n→∞
Thus the sequence converges linearly with asymptotic error constant λ = |g 0 (p)|.
Theorem 12.3. Let I be an open interval and suppose that the following conditions
hold:
3. g 0 (p) = 0;
4. g 00 (p) 6= 0;
5. |g 0 (x)| ≤ K < 1 on I;
[p − δ, p + δ] ⊂ I (12.49)
Since |g 00 (p)| =
6 0, the sequence converges quadratically with asymptotic error constant
00
|g (p)|/2.
One could ask the following question: given any linearly convergent sequence, how
can we turn it into a quadratically convergent sequence? One way to do this is as
follows. Let p be a root of f (p); the goal is to find a method that converges to p
quadratically. Since f (p) = 0, we can form a function
where h(x) is any function. But now g(p) = p − h(p)f (p) = p so p is a fixed point of
g. By theorem 12.3, we need g 0 (p) = 0 to get quadratic convergence:
0 = g 0 (p) (12.58)
= 1 − h0 (p)f (p) − h(p)f 0 (p) (12.59)
= 1 − h(p)f 0 (p) (12.60)
or
1
h(p) = (12.61)
f 0 (p)
so long f 0 (p) 6= 0. Substituting equation 12.61 into equation 12.57,
f (x)
g(x) = x − (12.62)
f 0 (x)
and
lim q(x) 6= 0 (12.64)
x→p
If q is continuous this also means that q(p) 6= 0. A simple zero or simple root is
a zero of multiplicity 1. Roots of multiplicity m > 1 are called repeated roots.
Example 12.4. The function f (x) = (x − 2)2 (x − 3) has a simple root at x = 3 and
a root of multiplicity 2 at x = 2.
Theorem 12.4. Let f (x) be a continuously differentiable function on [a, b]. Then f
has a simple zero p ∈ (a, b) if and only if f (p) = 0 and f 0 (p) 6= 0.
Proof. Since this is an “if-and-only-if” theorem we need to prove two things:
(a) If p is a simple root then f (p) = 0 and f 0 (p) 6= 0; and
and
f (x) = (x − p)q(x) (12.66)
Since f is continuously differentiable, then so is q. In particular, q is continuous at p,
which means that
lim q(x) = q(p) (12.67)
x→p
lim c = p (12.73)
x→p
Let
q(x) = f 0 (c) (12.74)
then
f (x) = q(x)(x − p) (12.75)
where
= f 0 (p) (12.78)
6= 0 (12.79)
Theorem 12.6. Suppose that f (x) is continuously differentiable on [a, b] and has a
root of multiplicity m > 1 at p ∈ (a, b). Then p is a simple root of µ(x) = f (x)/f 0 (x).
Proof. Since f (x) has a root of multiplicity m > 1 then there is some function g(x)
such that
f (x) = (x − p)m g(x) (12.81)
where g(p) 6= 0. Differentiating,
Therefore
(x − p)m g(x)
µ(x) = (12.83)
m(x − p)m−1 g(x) + (x − p)m g 0 (x)
(x − p)m−1 (x − p)g(x)
= (12.84)
m(x − p)m−1 g(x) + (x − p)m−1 (x − p)g 0 (x)
(x − p)g(x)
= (12.85)
mg(x) + (x − p)g 0 (x)
= (x − p)q(x) (12.86)
where
g(x)
q(x) = (12.87)
mg(x) + (x − p)g 0 (x)
Since g(p) 6= 0,
g(p) 1
q(p) = 0
= 6= 0 because m > 1 (12.88)
mg(p) + (p − p)g (p) m
hence
µ(x) = (x − p)q(x) (12.89)
where q(p)] 6= 0. Since µ(p) = 0 then p is a root of µ; since q(p) 6= 0, it is a simple
root.
Therefore we know that Newton’s method will converge quadratically to a root
of µ(x) even though it will only converge linearly to a repeated root of f (x). Using
Newtons method to find the simple root of µ(x) gives
µ(x)
g(x) = x − (12.90)
µ0 (x)
The function g has a fixed point at any root of µ(x), and the following iteration
converges quadratically,
µ(xn )
xn+1 = xn − 0 (12.91)
µ (xn )
because µ0 (p) 6= 0. But since µ(x) = f (x)/f 0 (x) then by the quotient formula for
differentiation,
f 0 f 0 − f f 00
µ0 (x) = (12.92)
(f 0 )2
f 02 − f f 00
= (12.93)
f 02
Therefore,
f (x)/f 0 (x)
g(x) = x − (12.94)
(f 0 (x)2 − f (x)f 00 (x))/(f 0 (x))2
f 0 (x)f (x)
=x− 0 (12.95)
(f (x))2 − f (x)f 00 (x)
This gives us the following quadratically convergent iteration formula:
f 0 (xn )f (x)
xn+1 = xn − (12.96)
(f 0 (xn ))2 − f (xn )f 00 (xn )
The problem with this formula arises from the fact that both f (p) and f 0 (p) are zero
and therefore as the iteration approaches the root, both (f 0 (xn ))2 and f (xn )f 00 (xn )
are very small numbers: taking the difference of two very small numbers can lead to
round-off errors.
and so forth.
Aitken’s method (for its inventor, George Aitken, 1895-1967) is based on the
following observation. For any linearly convergent method,
pn+1 − p
lim =λ>0 (13.9)
n→∞ pn − p
81
82 LESSON 13. THE AITKEN-STEFFENSEN METHODS
Hence
pn+2 pn − p2n+1
p= (13.19)
∆2 pn
pn+2 pn − p2n+1
= + pn − pn (13.20)
∆2 pn
pn+2 pn − p2n+1 − pn ∆2 pn
= pn + (13.21)
∆2 pn
Expanding the numerator,
Therefore,
(∆pn )2
p = pn − (13.28)
∆2 pn
Theorem 13.1 (Aitken). Suppose that pn → p linearly and there is some number
N such that for all n > N ,
(∆pn )2
qn = pn − (13.31)
∆2 pn
lim δn = 0 (13.34)
n→∞
Johann Frederik Steffensen (1873-1961) observed that the sequence would converge
faster if we started each iteration with (qi , g(qi ), g(g(qi )) instead of (pi , g(pi ), g(g(pi )).
The difference between the two methods (which is subtle) is illustrated in figure 13.1
Figure 13.1: Top: In Aitken’s method, at the nd of each iteration, the next iteration
begins by setting p1 = p0 . Bottom: In Steffensen’s method we set p0 = q. In both
methods, p1 = f (p0 ), p2 = f (p1 ), and q is computed from equation 13.31
Aitken’s Method:
p p p q
0 1 2
p p p q
0 1 2
p p p q
0 1 2
Steffensen’s Method:
p p p q
0 1 2
p p p q
0 1 2
p p p q
0 1 2
The following algorithms uses Aitken’s method to find the fixed point of the function
f (x).
Algorithm Aitken
Input f (x), p0 , tolerance
Let δ = ∞;
Let p = p0 ;
While δ > ,
p1 = f (p);
p2 = f (p1 );
∆p = p1 − p0 ;
∆∆p = (p2 − p1 ) − ∆p;
p = p − (∆p)2 /∆∆p;
δ = |p − p0 |;
p0 = p1 (this is where Steffensen’s method differs)
End While
Return p
Algorithm Steffensen
Input f (x), p0 , tolerance
Let δ = ∞;
Let p = p0 ;
While δ > ,
p1 = f (p);
p2 = f (p1 );
∆p = p1 − p0 ;
∆∆p = (p2 − p1 ) − ∆p;
p = p − (∆p)2 /∆∆p;
δ = |p − p0 |;
p0 = p (this is where Aitken’s method differs)
End While
Return p
Both Aitken’s method and Steffensen’s method can be used to find the fixed point
of a function. To find the root of the function we have the following algorithm, which
significantly improves the rate of convergence of Newton’s method when there are
repeated roots (e.g., for functions such as f (x) = (x − 2)2 .
Algorithm Newton-Steffensen
Input f (x), p0 , tolerance
Define g(x) = x − f (x)/f 0 (x);
p = Steffensen(g, p0 , );
Return p.
P (x) = a0 + a1 x + a2 x + · · · + an xn (14.2)
= a0 + x(a1 + a2 x + a3 x2 + · · · + an xn−1 ) (14.3)
..
.
= a0 + x(a1 + x(a2 + x(a3 + · · · + x(an−1 + an x))) · · · ) (14.4)
Some (or all) of the roots may be complex. Since complex roots come in conjugate
pairs, the total number of complex roots must be even. Thus a polynomial of odd
degree always has at least one real root. If the unique roots are given as r1 , r2 , ..., rk
each with multiplicity m1 , m2 , ..., mk then we can always write a polynomial as
87
88 LESSON 14. SYNTHETIC DIVISION AND HORNER’S METHOD
Descarte proposed in 1637 that one could imagine that there were n roots to a poly-
nomial. Albert Girard (1629) proposed that an nth order polynomial has n roots but
that they may exist in a field larger than the complex numbers. The first published
proof of the fundamental theorem of algebra was by DAlembert in 1746, but his proof
was based on an earlier theorem that itself used the theorem, and hence is unsatisfac-
tory. At about the same time Euler proved it for polynomials with real coefficients up
to 6th. Between 1799 (in his doctoral dissertation) and 1816 Gauss published three
different proofs for polynomials with with real coefficients, and in 1849 he proved the
general case for polynomials with complex coefficients.
For example, if two lines are equal at two points, they are identical; if two parabo-
las match at three points, they are identical; and so on.
Theorem 14.3 (Horner’s Method for Synthetic Division). Let P (x) be any
polynomial of degree n, given by
P (x) = a0 + a1 x + a2 x + · · · + an xn (14.6)
Then for any number x0 there exists another polynomial Q(x) of degree n − 1, given
by
Q(x) = b1 + b2 x + b3 x2 + · · · + bn xn−1 (14.7)
such that
P (x) = (x − x0 )Q(x) + b0 (14.8)
where bn = an ,
bk = ak + bk+1 x0 (14.9)
for k = n − 1, n − 2, ..., 0, for k = n − 1, n − 2, . . . , 0 Furthermore, b0 = P (x0 ) and
for some undetermined numbers b1 , . . . , bn . Then we ask what conditions will ensure
that
P (x) = (x − x0 )Q(x) + b0 (14.12)
(x − x0 )Q(x) + b0 = b0 +
(x − x0 )(bn xn−1 + bn−1 xn−2 + · · · + b3 x2 + b2 x + b1 ) (14.13)
= b0 + x(bn xn−1 + bn−1 xn−2 + · · · + b3 x2 + b2 x + b1 )
− x0 (bn xn−1 + bn−1 xn−2 + · · · + b3 x2 + b2 x + b1 ) (14.14)
= bn xn + bn−1 xn−1 + bn−2 xn−2 + · · · + b2 x2 + b1 x
− bn x0 xn−1 − x0 bn−1 xn−2 − · · · − x0 b3 x2
− x 0 b2 x − x 0 b1 + b0 (14.15)
= bn xn + (bn−1 − bn x0 )xn−1 +
(bn−2 − x0 bn−1 )xn−2 + · · · + (14.16)
(b2 − x0 b3 )x2 + (b1 − x0 b2 )x + (b0 − x0 b1 ) (14.17)
P (x) = a0 + a1 x + a2 x + · · · + an xn (14.18)
an = b n (14.19)
an−1 = bn−1 − bn x0 (14.20)
an−2 = bn−2 − bn−1 x0 (14.21)
..
.
a0 = b 0 − x 0 b 1 (14.22)
Rearranging,
b n = an (14.23)
bn−1 = an−1 + bn x0 (14.24)
bn−2 = an−2 + bn−1 x0 (14.25)
..
.
b 0 = a0 + b 1 x 0 (14.26)
hence
P 0 (x0 ) = Q(x0 ) (14.29)
which gives us equation 14.10.
The following gives a recapitulation of the algorithm for Horner’s method to cal-
culate the numbers P (x0 ) and P 0 (x0 ) for a polynomial.
Algorithm Horner
Input a0 , . . . , an , x0 ;
Set y = an ; (y will give the bn for P )
Set z = an ; (z gives the bn−1 for Q)
For j = n − 1, n − 2, . . . , 1,
y = x0 y + aj ; (this gives bj for P (x0 ))
z = x0 z + y; (this gives bj−1 for the calculation of Q(x0 ) )
End For;
y = x0 y + a0 ; (this gives b0 )
Return y (which is P (x0 )) and z (which is P 0 (x0 ) = Q(x0 ))
We can make two interesting observations about Horner’s method. First, it has
the same number of multiplications as nested multiplication, making it at least as
efficient as that algorithm. Secondly, it gives us a number for both P (x0 ) and P 0 (x0 )
for no extra cost. This becomes useful in operations where both numbers are needed,
such as in the calculation of Newton’s method (for the roots of a polynomial).
Let x0 be a root of P . Then we know that there exists a second polynomial Q(x)
such that P (x) = (x − x0 )Q(x) + P (x0 ) = (x − x0 )Q(x). So if P has any other
roots that are different from x0 then they are also roots of Q. Hence if we repeat the
process on Q iteratively we will find all the subsequent roots of P . Unfortunately
this leads to round-off error that can be avoided by using a different algorithm that
we will discuss subsequently.
Example 14.1. Find P (1) and P 0 (1) for P (x) = x3 −2x2 −5 using Horner’s method.
b3 = a3 = 1 (14.30)
b2 = a2 + b3 x0 = −2 + (1)(1) = −1 (14.31)
b1 = a1 + b2 x0 = 0 + (−1)(1) = −1 (14.32)
b0 = a0 + b1 x0 = −5 + (−1)(1) = −6 (14.33)
c 2 = b3 = 1 (14.34)
c1 = b2 + c2 x0 = (−1) + (1)(1) = 0 (14.35)
c0 = b1 + c1 x0 = (−1) + (0)(1) = −1 (14.36)
Müller’s Method
Müller’s method is s based on the idea that if a straight line is good, then a parabola
is better. It’s really a modification of the Secant method, replacing the projectin of a
secant line with the projectiong of a parabola, fit to three consecutive points on the
curve, to find the next guess. Suppose we “know” the value of f at three points on
the curve of f (x) at x = p, x = q, and x = r. The we need to find a parabola through
the three points
(p, f (p)), (q, f (q)), (r, f (r)) (15.1)
Figure 15.1: Illustration of Müller’s method. A parabola is fit to three points on the
curve, and the intersection of the parabola with the x−axis is used to for the next
guess of the root.
93
94 LESSON 15. MÜLLER’S METHOD
Thus
(r − p) (f (q) − f (p)) − (q − p) (f (r) − f (p))
a= (15.25)
(q − p)(r − p)(q − r)
Next we multiply equation 15.16 by (r − p)2 , equation 15.17 by (q − p)2 , and subtract,
which gives
(r − p)2 (f (q) − f (p)) = a(q − p)2 (r − p)2 + b(q − p)(r − p)2 (15.26)
(q − p)2 (f (r) − f (p)) = a(r − p)2 (q − p)2 + b(r − p)(q − p)2 (15.27)
and therefore,
Müller’s method uses the intersection of the parabola with the x-axis as the next
guess. Given three guesses p, q, r, the parabola intersects at s where
and therefore √
−b ± b2 − 4ac
s=p+ (15.36)
2a
where a and b are given by equations 15.25 and 15.33.
If b is a large positive number then the positive root
√
−b + b2 − 4ac
δ+ = (15.37)
2a
« 2008, B.E.Shapiro Math 481A
Last revised: July 5, 2008 California State University Northridge
96 LESSON 15. MÜLLER’S METHOD
has two large and nearly equal numbers in the numerator; this could lead to roundoff
errors. To improve our accuracy we rearrange by rationalizing the numerator:
√ √
−b +b2 − 4ac −b − b2 − 4ac
δ+ = × √ (15.38)
2a −b − b2 − 4ac
b2 − b2 + 4ac
= √ (15.39)
2a −b − b2 − 4ac
−2c
= √ (15.40)
b + b2 − 4ac
Thus if no roundoff error here because now we are adding two large positive numbers
in the denominator, and not subtracting them. Thus if b is large and positive, our
two intersection points are
√
−b − b2 − 4ac
s =p + (15.41)
2a
2c
s =p − √ (15.42)
b + b2 − 4ac
Since we don’t know up front which, if either, special case occurs, we can do the
following: choose the sign of the square root to agree with the sign of b. This will
work in either case! Hence
2c
s =p − √ (15.45)
b + sign(b) b2 − 4ac
This assures that of the two possible roots of the parabola, the one closest to p will
be selected.
Müller’s algorithm also uses Horner’s method to evaluate the polynomial (it ignores
the derivatives since they aren’t really needed). The algorithm to find the root of a
polynomial with coefficients given by a0 , . . . , an is
Linear Systems
In this section we will study the solution of a linear system of n equations with n
unknowns. We cover it briefly here because some understanding of the problem will
be necessary in our study of interpolation. However, this subject is normally part of
the Math 481A curriculum and hence will not be covered in any detail here.
Given a square n × n matrix A and n numbers b1 , . . . , bn , we would like to solve
the linear system
Ax = b (16.1)
Since it is generally numerically inefficient to compute an inverse (it generally requires
O(n3 ) operations we will not solve the system as
x = A−1 b (16.2)
although this is technically correct. Instead we will use the process of Gaussian
elimination. We begin by observing that if we can transform equation 16.1 into a
form
T x = b0 (16.3)
where T is an upper triangular matrix, and b0 is a modified version of b, then we can
read the solution for xn off the bottom row of the matrix, namely,
The matrix T is said to be in Row Echelon Form. The second to the last row of
the system 16.3 only depends on two variables, xn and xn−1 . Once we read off xn
then we can solve for xn−1 . This process of back substitution moves back up the
matrix one line at a time, solving for one variable at each step.
99
100 LESSON 16. LINEAR SYSTEMS
The idea is to keep repeating this process until there is only one equation in the
reduced system. The result is an “upper triangular system.” If the original matrix
system is
a11 a12 a13 · · · a1n x1 b1
a21 a22 a23 · · · a2n x2 b2
a31 a32 a33 x 3 b3
=
.. .. .. .. .. ..
. . . . . .
an1 an2 an3 · · · ann an1 xn
Then the reduced matrix system is
a11 a12 a13 · · · a01n x1 b01
0 a0 a0 · · · a02n x2 b02
22 23
0
0
0 a 33 · · · a03n
x3 =
b03
.. .. .. .. ..
. . . . .
0 ··· 0 0 a0nn an1 b‘n
This process is called Gaussian Reduction. We can then solve the system by
starting on the bottom equation for xn , then the second from the bottom for xn−1 ,
and so forth, until we obtain x1 . This second step is called back substitution.
Solution. The first step is to subtract multiples of the first row from each of the
remaining two rows to make the coefficients of x zero in each of rows 2 and 3 of the
system. Since the coefficient of x is 1 in the first row, 4 in the second row, and 2 in
the third row, we subtract four times the first row from the second row, and twice
the first row from the third row.
1 2 3 x 5
4 − 4(1) 5 − 4(2) 2 − 4(3) y = 10 − 4(5)
2 − 2(1) 8 − 2(2) 5 − 2(3) z 15 − 2(5)
1 2 3 x 5
0 −3 −10 y = −10
0 4 −1 z 5
Now the first column is all zeroes (except for the first row). The next step is to
subtract a multiple of the second row from the third row to get a zero in the second
entry of the third row. Since the coefficient of y is -3 in the second row and 4 in the
third row, we can add 4/3 times the second row to the third row.
1 2 3 x 5
0 −3 −10 y = −10
0 4 + (4/3)(−3) −1 + (4/3)(−10) z 5 + (4/3)(−10)
1 2 3 x 5
0 −3 −10 y = −10
0 0 −43/3 z −25/3
This completes the Gaussian elimination. We can then read off the solution by back-
substitution. From the third row of the matrix,
z = (−25/3)/(−43/3) = 25/43
hence
1 60
y = − (−10 + 10(25/43)) =
3 43
Finally, from the first row, we have
x + 2y + 3z = 5
60 25 20
x=5−2 −3 =
43 43 43
We can write a simple recursive algorithm for Gaussian elimination as
Algorithm LinearSolve
Input: A, b
If n > 1,
{A0 , b0 } = Reduce(A, b)
LinearSolve (A0 , b0 )
End if
x1 = (b1 − a12 x2 − a13 x3 − · · · − a1n xn )/a11
Return {x1 , x2 , . . . , xn }
Algorithm Reduce
Input: A, b
n = dimension(b)
For k = 2, . . . , n,
m = ak1 /a11
For j = 2, . . . , n,
a0k−1,j−1 = akj − ma1j
End For
b0k−1 = bk − mb1
End For
Return {A0 , b0 }
The recurse algorithm can be almost literally translated into Mathematica:
reduce[A_, b_] := Module[{n, j, k, Aprime, bprime, m, row},
n = Length[b];
Aprime = {}; bprime = {};
For[k = 2, k n , k++,
m = A[[k, 1]]/A[[1, 1]];
row = {};
For[j = 2, j n, j++,
AppendTo[row, A[[k, j]] - m* A[[1, j]]];
];
AppendTo[Aprime, row];
AppendTo[bprime, b[[k]] - m*b[[1]]];
];
Return[{Aprime, bprime}];
];
In:=
Out:=
In Mathematicawe can also solve the system directly by using the built in function
LinearSolve[A,b].
Gaussian elimination can fail if we divide by zero, and is susceptible to large
errors or possible overflow if we divide by a very small number (relative to the other
numbers in the matrix). Division occurs in two places in the algorithm: during the
row reduction phase where we define m = ak1 /a11 and during the back-substitution
step at the end of the algorithm, where we solve for x1 (here we also divide by a11 ,
but its usually a different a11 ). These numbers are called pivots. The solution is to
rearrange the matrix (and the corresponding elements of b): if at any step along the
way the pivot is zero, then the entire row is exchanged with a row that does not have
zero in that column. If all of the remaining elements in that column are zero then
the matrix is singular and there is no unique solution (or no solution at all).
Lagrange Interpolation
Suppose we know the values of some function f (x) at n + 1 distinct grid points
a = x0 , x1 , x2 , ..., xn = b (17.1)
(x1,f1)
(x5,f5)
x1 x2 x3 x4 x5
(x1,f1) f(x)
(x5,f5)
x1 x2 x3 x4 x5
105
106 LESSON 17. LAGRANGE INTERPOLATION
The simplest method is linear interpolation: draw line segments connecting each
pair of consecutive grid points (xk , fk ) and (xk+1 , fk+1 ). For xk ≤ x ≤ xk+1 we have:
fk+1 − fk
y = fk + m(x − xk ) = fk + (x − xk ) (17.3)
xk+1 − xk
(x1,f1)
x1 x2 x3 x4 x5
In general, unless the grid points are very close, linear interpolation does not give
very accurate result. A better approximation would be given by a polynomial. The
key is to find the right polynomial, not just any polynomial that goes through the
points. As it turns out, it is possible to find a polynomial that approximates the
function to any desired degree of accuracy. This result is called the Weirstrass Ap-
proximation Theorem. Furthermore, given any n + 1 points it is possible to find
a unique polynomial of minimum degree that fits all the points. For example, any
two points can be fit by a line; and three non-collinear points can be fit by a unique
parabola; any four points that do not line on the same line or on the same parabola
can be fit by a unique cubic; and so forth.
where
x0 < x1 < · · · < xn (17.5)
and that we want to find the polynomial of lowest order
P (x) = a0 + a1 x + a2 x2 + · · · + an xn (17.6)
to these points. We begin by substituting the points 17.4 into the polynomial to get
n + 1 equations in the n + 1 unknowns a0 , a1 , . . . , an :
f0 = a0 + a1 x0 + a2 x20 + · · · + an xn0 (17.7)
f1 = a0 + a1 x1 + a2 x21 + · · · + an xnn (17.8)
..
.
fn = a0 + a1 xn + a2 x2n + · · · + an xnn (17.9)
which we can write as the matrix system
1 x0 x20 · · · xn0 a0 f0
2 n
1 x 1 x 1 · · · x 1 a1 f 1
.. .. = .. (17.10)
..
. . . .
2 n
1 xn xn · · · xn an fn
This equation has a solution if the matrix of coefficients is non-singular. But because
the points are distinct the lines of the matrix form a linearly independent set of vectors
(proof left as an exercise). Hence the matrix is non-singular. To find the a0 , a1 , . . .
we could use Gaussian elimination or some other method as we have discussed. It
turns out that this is not necessary because the form of the matrix allows us to write
a much simpler iterative process for finding these coefficients.
We will actually present two different methods for constructing the polynomial:
the Lagrange method (in this section) and the Newton method (in the next section).
Because of uniqueness both polynomials will be identical; however, they are con-
structed differently. The Newton method is particularly useful when one needs to
calculate numbers by hand, as was done in the 19th century. The Lagrange method,
which we will discuss first, is somewhat more intuitive. Before providing the general
form, we will illustrate the technique with linear (n=1) and quadratic (n=2) interpo-
lation.
For n = 1 we start with two points (x0 , f0 ), (x1 , f1 ) that we want to fit a line to.
Of course we have already done it, but this time we will construct the line in such a
way that the method can be easily extending to higher degree fits (with more points).
We define the functions
x − x1
L0 (x) = (17.11)
x0 − x1
x − x0
L1 (x) = (17.12)
x1 − x0
and we observe that
L0 (x0 ) = 1 L0 (x1 ) = 0 (17.13)
L1 (x0 ) = 0 L1 (x1 ) = 1 (17.14)
known as the Kroeneker delta function (for the German mathematician Leopold
Kroeneker, 1823-1891). Next, we define the function
1
X
P (x) = Li (x)fi (17.16)
k=0
= L0 (x)f0 + L1 (x)f1 (17.17)
x − x1 x − x0
= f0 + f1 (17.18)
x0 − x1 x1 − x0
We observe that P (xi ) = fi and that P is linear in x. Hence it is the equation of
a line that goes through both points (xi , fi ), i = 0, 1. A rearrangement of this gives
equation 17.3.
For n = 2 we have 3 points: (x0 , f0 ), (x1 , f1 ), and (x2 , f2 ). Again, we have already
solved for the equation of a parabola through three points in the previous section,
but we will do it this time by extending the Lagrange technique. We define the three
functions
(x − x1 )(x − x2 )
L0 (x) = (17.19)
(x0 − x1 )(x0 − x2 )
(x − x0 )(x − x2 )
L1 (x) = (17.20)
(x1 − x0 )(x1 − x2 )
(x − x0 )(x − x1 )
L2 (x) = (17.21)
(x2 − x0 )(x2 − x1 )
We observe that
or in general Li (xj ) = δij , as before with the linear functions. Then we define the
function
X2
P (x) = L0 (x)f0 + L1 (x)f1 + L2 (x)f2 = Li (x)fi (17.25)
k=0
We observe now that P (x) is quadratic, and that P (xi ) = fi . Thus it goes through
all three points, and hence by uniqueness it is the only parabola that goes through
all three points.
In the general case it becomes more convenient to add a second index indicating
the order of the polynomials to the L functions. Thus we rename our linear functions
from L0 and L1 to L10 and L11 , and our quadratic functions L0 , L1 , and L2 become
L20 , L21 , and L22 . The general definition is
n
Y x − xj
Lnk (x) = (17.26)
j=0,j6=k
xk − xj
for k = 0, . . . , n. It is easily observed that (a) each of the Lnk has degree n; and that
(b) that Lnk (xi ) = δik . Hence the polynomial
n
X
P (x) = Lnk (x)fk (17.27)
k=0
is also of degree at most k and that P (xj ) = fj . Thus P (x) is our interpolating
polynomial, and we have derived the following result.
Where
(x − 0.6)(x − 0.9)
L20 (x) = = 1.85(x − 0.6)(x − 0.9) (17.34)
(0 − 0.6)(0 − 0.9)
(x − 0)(x − 0.9)
L21 (x) = = −5.56x(x − 0.9) (17.35)
(0.6 − 0)(0.6 − 0.9)
(x − 0)(x − 0.6)
L22 (x) = = 3.70x(x − 0.6) (17.36)
(0.9 − 0)(0.9 − 0.6)
Thus
P (x) = L20 (x) + 1.27L21 (x) + 1.38L22 (x) (17.37)
= 1.85(x − 0.6)(x − 0.9) − 1.27(5.56)x(x − 0.9)
+ 1.38(3.70)x(x − 0.6) (17.38)
= 1.85(x − 0.6)(x − 0.9) − 7.03x(x − 0.9) + 5.11x(x − 0.6) (17.39)
= 0.999 + 0.486x − 0.07x2 (17.40)
Hence
P (0.45) ≈ 0.999 + (0.486)(0.45) − 0.07)(0.45)2 = 1.20 (17.41)
We summarize the algorithm for Lagrange Interpolation here.1
Algorithm LagrangeInterpolatingFunctions
Input: x0 , . . . , xn , x
For i = 0, 1, . . . , n,
Define the set Ui ={x0 , . . . , xn } − {xi }
Define numerator = 1, denominator = 1
For j = 0, . . . , n − 1
numerator = numerator × (x − Uij )
denominator = denominator × (xi − Uij )
End For
Lni = numerator/denominator
End For
Return the list {Ln1 , . . . , Lnn }
Algorithm LagrangeInterpolatingPolynomial
Input: x0 , . . . , xn , f0 , . . . , fn , x
Let L be the list LagrangeInterpolatingFunctions(x0 , . . . , xn , x)
P = f0 ∗ L0 + f1 ∗ L1 + · · · + fn ∗ Ln
Return P
1
The notation A − B, where A and B are sets, means the relative complement of the set B in
the set A, e.g., all of the elements of A that are not in B. For an ordered set Ui , the notation Uij
means the j th element of Ui . An ordered set is also called a List.
In:=
Out:=
Next, we observe that if U is a list such as the one defined above, then Map[f, U]
returns the result of f[u], for every element u of U. Recall that f/@U is a shorthand
for Map[f, U],
In:=
f/@U
Out:=
Suppose that f[x] represents the function f (x) = x − 3. We can calculate some
value, say f (u) in two different ways. The first is the usual way,
In:=
f[x_]:=x-3;
f[u]
Out:=
u-3
In:=
(#-3)&[u]
Out:=
u-3
Pure functions allow us to define a function and use it in a single statement. In-
stead of saying f[x] we replace the f with the pure function (#-3)&. The symbol &
tells us where the function definition ends, and the symbol # is used in place of the
function’s argument x. We can also combine pure functions with the Map function.
This is convenient because it lets us map an expression that we are only going to use
once; otherwise we’d have to use an extra line of code to define an unnecessary extra
variable to hold the function. Thus
In:=
and
In:=
V=(#-3)&/@U
Out:=
In:=
Out:=
(-3 + x1) (-3 + x2) (-3 + x3) (-3 + x4) (-3 + x5)
To multiply out the elements of V we need to take all the elements of V and place
them as arguments to Times. We do this with the Apply command, which has a
shorthand of @@. The following are:
In:=
Apply[Times, V]
In:=
Times@@V
and both return the same thing (recall the definition of V, above):
Out:=
(-3 + x1) (-3 + x2) (-3 + x3) (-3 + x4) (-3 + x5)
Now suppose we want to combine these two functions. We want to subtract 3 from
every element of the list U, which we can do with Map, and then take the product of
the results with Apply and Times:
In:=
Times@@(#-3)&/@U
or In:=
Out:=
(-3 + x1) (-3 + x2) (-3 + x3) (-3 + x4) (-3 + x5)
With this we can define a function to calculate the Lagrange Interpolating Functions
in Mathematica.
LagrangeInterpolatingFunctions[{xj__}, x_] :=
Module[ {i, n, xi, xjc, L, xgrid, num, den},
xgrid = {xj};
n = Length[xgrid];
L = {};
For[i = 1, i <= n, i++,
xi = xgrid[[i]];
xjc = Complement[xgrid, {xi}];
den = Times @@ ((xi - #) & /@ xjc);
num = Times @@ ((x - #) & /@ xjc);
L = Append[L, num/den];
];
Return[L];
]
In:=
Out:=
(x − x2)(x − x3) (x − x1)(x − x3) (x − x1)(x − x2)
, ,
(x1 − x2)(x1 − x3) (x2 − x1)(x2 − x3) (x3 − x1)(x3 − x2)
Repeating our earlier example,
In:=
Out:=
In:=
Out:=
Next we observe that the dot product of two lists A and B of the same length is
calculated with the dot operator, which is a period:
In:=
Out:=
f1 L1 + f2 L2 + f3 L3 + f4 L4
In:=
f[x_]:= Sqrt[1.0+x];
points = {0.0, 0.6, 0.9};
(f/@points).LagrangeInterpolatingFunctions[points, 0.45]
Out:=
1.20342
(f /@ points).LagrangeInterpolatingFunctions[points, x] // Expand
Out:=
Theorem 17.2 (Error Bounds for Lagrange Interpolation). Suppose that f (x)
is n + 1 times continuously differentiable, and suppose that the points x0 , . . . , xn are
distinct. Then for any x ∈ [a, b] there exists a number c ∈ [a, b] such that
where
n n n
X X Y x − xj
P (x) = fk Lnk (x) = fk (17.43)
k=0 k=0 j=0,j6=k
xk − x j
Proof. If x = xk and P (xk ) = fk for some k, then the second term in equation 17.42
is zero, regardless of the value of c, and the result holds identically.
So suppose that x 6= xk for all k, and define the function
n
Y t − xi
g(t) = f (t) − P (t) − [f (x) − P (x)] (17.44)
i=0
x − xi
Then
n
Y xk − xi
g(xk ) = f (xk ) − P (xk ) − [f (x) − P (x)] =0 (17.45)
i=0
x − xi
The second equality follows because (a) by construction, f (xk ) = P (xk ), so the first
term is zero; and (b) for some i we have i = k and hence there is a factor of xk − xk
in the numerator of the second term, making it zero as well. Furthermore,
n
Y x − xi
g(x) = f (x) − P (x) − [f (x) − P (x)] (17.46)
i=0
x − xi
= f (x) − P (x) − [f (x) − P (x)] (17.47)
=0 (17.48)
exists at least one number c ∈ (a, b) such that g (n+1) (c) = 0. Differentiating g(t) a
total of n + 1 times,
n
d(n+1) Y t − xi
g (n+1) (t) = f (n+1) (t) − P (n+1) (t) − [f (x) − P (x)] (17.49)
dt(n+1) i=0 x − xi
hence at t = c we have
Now since
P (t) = a0 + a1 t + · · · + an tn (17.53)
then P (n+1) (t) = 0 for all t, and hence P (n+1) (c) = 0, so that
(n+1) Y n
[f (x) − P (x)] d
0 = f (n+1) (c) − Qn (t − x ) (17.54)
(n+1) i
i=0 (x − x i ) dt i=0
t=c
Furthermore,
n
d(n+1) Y d(n+1)
(t − x i ) = (t − x1 )(t − x2 )(t − x3 ) · · · (t − xn ) (17.55)
dt(n+1) i=0 dt(n+1)
d(n+1) n
t + (stuff) × tn−1 + (more stuff) × tn−2 + · · ·
= (n+1)
dt
(17.56)
= (n + 1)! (17.57)
[f (x) − P (x)]
0 = f (n+1) (c) − Qn (n + 1)! (17.58)
i=0 (x − xi )
Example 17.2. Suppose you want to make a table of the natural logarithms over the
range 1 ≤ x ≤ 100. What step size is sufficient to ensure that linear interpolation
between each successive pair of points will be accurate to within 10−5 ?
For linear interpolation we use n = 1 (there are two points, x0 and x1 , so that
(2)
f (c)(x − x0 )(x − x1 )
|f (x) − P (x)| = (17.60)
2!
1
≤ max |f 00 (c)| × max |(x − x0 )(x − x1 )| (17.61)
2
on each interval. Since f (x) = log x we have f 0 (x) = 1/x and f 00 (x) = −1/x2 . The
maximum value of | − 1/x2 | on [1, 100] is 1, so that
1
|f (x) − P (x)| ≤ max |(x − x0 )(x − x1 )| (17.62)
2
To find the maximum value of g(x) = (x − x0 )(x − x1 ) = x2 − (x0 + x1 )x + x0 x1
on [x0 , x1 ] we observe that the maximum either occurs at an endpoint or at a point
where g 0 (x) = 0. At the endpoints g(x) = 0. So first we differentiate:
0 = g 0 (x) = 2x − (x0 + x1 ) (17.63)
which gives a possible maximum at x = (x0 + x1 )/2. The value of g at this point is
x 0 + x 1 x0 + x1 x 0 + x 1
g = − x 0 − x 1
(17.64)
2 2 2
x1 − x0 x0 − x1
=
(17.65)
2 2
h2
= (17.66)
4
where h is the size between entries in the table (the number we are solving for).
Substituting equation 17.66 into example 17.62 gives
h2
|f (x) − P (x)| ≤ (17.67)
8
Since we want to ensure that the error is no larger than 10−5 we set
h2
< 10−5 (17.68)
8
or √
h< 8 × 10−5 ≈ 0.0089 (17.69)
so if choose any step size smaller than h ≈ 0.0089 we are guaranteed to have an error
of no larger than 10−5 .
Newton Interpolation
f n = a0 (18.10)
119
120 LESSON 18. NEWTON INTERPOLATION
1
ak = ∇ k fn (18.18)
k!hk
k−1
Y
Qk (x) = (x − xn−j ) (18.19)
j=0
x = xn + sh (18.23)
where x = x0 + hs.
Example 18.1. Find e1.2 using a first, second, third, and fourth order differences
using the data e = 2.71828, e1.5 = 4.48169, e2 = 7.38906, e2.5 = 12.1829, e3 =
20.085554 with Newton’s forward difference formula.
Solution. We want to use equation 18.34 at x = 1.2 with x0 = 1. Hence
1.2 = x = x0 + hs = 1 + 0.5s (18.39)
and so s = 0.4. We can construct the following table of forward differences based on
the input data. The actual data values that we will use are colored yellow.
xk fk ∆fk ∆2 fk ∆3 fk ∆4 fk
1 2.71828
1.76341
1.5 4.48169 1.14396
2.90737 0.74211
2 7.38906 1.88607 0.48142
4.79344 1.22353
2.5 12.18249 3.10961
7.90304
3 20.08554
We then calculated the following binomial coefficients, using s = 0.4
s 0.4
= = 0.4 (18.40)
1 1
s 0.4 (0.4)(−0.6)
= = = −0.12 (18.41)
2 2 2!
s 0.4 (0.4)(−0.6)(−1.6)
= = = 0.064 (18.42)
3 3 3!
s 0.4 (0.4)(−0.6)(−1.6)(−2.6)
= = = −0.0416 (18.43)
4 4 4!
For n = 1, the interpolated value is
s
P (x + sh) = f0 + ∆f0 (18.44)
1
= 2.71828 + (0.4)(1.76341) (18.45)
= 3.42364 (18.46)
For n = 2,
s s
P (x + sh) = f0 + ∆f0 + ∆2 f0 (18.47)
1 2
= 3.42364 + (−0.12)(1.14396) (18.48)
= 3.28636 (18.49)
For n = 3,
s s 2 s
P (x + sh) = f0 + ∆f0 + ∆ f0 + ∆3 f0 (18.50)
1 2 3
= 3.28636 + (0.064)(0.74211) (18.51)
= 3.33386 (18.52)
For n = 4,
s s 2 s 3 s
P (x + sh) = f0 + ∆f0 + ∆ f0 + ∆ f0 + ∆4 f0 (18.53)
1 2 3 4
= 3.33386 + (−0.0416)(0.48142) (18.54)
= 3.31383 (18.55)
Example 18.2. Using the same data as the previous example, calculate e2.7 using
backward differences.
Solution. We have
2.7 = x = xn + sh = 3 + (0.5)s (18.56)
Hence s = −0.6. The backwards difference formula gives
0.6 2 0.6 2 3 0.6
P (2.7) = fn + (−1) ∇fn + (−1) ∇ fn + (−1) ∇ 3 fn
1 2 3
4 0.6
+ (−1) ∇4 fn + · · · (18.57)
4
(0.6)(−0.4) 2 (0.6)(−0.4)(−1.4) 3
= fn + (−1)(0.6)∇fn + (−1)2 ∇ fn + (−1)3 ∇ fn
2! 3!
(0.6)(−0.4)(−1.4)(−2.4) 4
+ (−1)4 ∇ fn + · · · (18.58)
4!
= fn − 0.6∇fn − 0.12∇2 fn − 0.056∇3 fn − 0.0336∇4 fn + · · · (18.59)
We now can read data off the lower diagonal in the table.
xk fk ∆fk ∆2 fk ∆3 fk ∆4 fk
1 2.71828
1.76341
1.5 4.48169 1.14396
2.90737 0.74211
2 7.38906 1.88607 0.48142
4.79344 1.22353
2.5 12.18249 3.10961
7.90304
3 20.08554
Hermite Interpolation
One of the problems with polynomial interpolation is that although it fits the points
the shape of the curve doesnt always match very well:
(x1,f1) f(x)
(x3,f3) (x5,f5)
x1 x2 x3 x4 x5
One approach to this problem is to try to match the derivatives as well as the points.
Suppose that we know the function f (x) at n+1 points, given by (x0 , f0 ), . . . , (xn , fn ),
and that we also know the derivatives at these same n + 1 points,
Then our approach will be to try to find a polynomial that matches both the function
and the derivative as these points. Our conditions are then:
P (xi ) = fi (19.2)
P 0 (xi ) = fi0 (19.3)
125
126 LESSON 19. HERMITE INTERPOLATION
Theorem 19.1. Suppose that f (t) is continuously differentiable on [a, b] and that the
numbers x0 , . . . , xn ∈ [a, b] are unique, and let Lnj (x) be the Lagrange interpolating
functions. Then
n
X n
X
P (x) = H2n+1 (x) = fj Hnj (x) + fj0 Ĥnj (x) (19.4)
j=0 j=0
where
satisfies equations 19.2 and 19.3. Equation 19.4 is called the Hermite Interpolat-
ing Polynomial.
Hence
Hnj (xi ) = δij (19.10)
Similarly,
Ĥnj (xi ) = (xi − xj )δij = 0 (19.11)
For all i and j. Substituting into equation 19.4,
n
X n
X
P (xi ) = fj Hnj (xi ) + fj0 Ĥnj (xi ) (19.12)
j=0 j=0
Xn
= fj δij (19.13)
j=0
= fi (19.14)
and therefore
n
X n
X
0 0
P (xi ) = fj Hnj (xi ) + fj0 Ĥnj
0
(xi ) (19.16)
j=0 j=0
Differentiating Ĥnj ,
d
0
(x − xj ) (Lnj (x))2
Ĥnj (x) = (19.25)
dx
= 2(x − xj )Lnj (x)L0nj (x) + L2nj (x) (19.26)
0
Ĥnj (xi ) = 2(xi − xj )Lnj (xi )L0nj (xi ) + L2nj (xi ) (19.27)
= 2(xi − xj )δij L0nj (xi ) + δij (19.28)
= δij (19.29)
Therefore
n
X
0
P (xi ) = fj0 δij = fi0 (19.30)
j=0
be the difference between these two polynomials. Since ∆(x) is the difference of two
polynomials of degree at most 2n + 1 that satisfy 19.2 and 19.3, then ∆(x) is also a
polynomial of degree at most 2n + 1. Furthermore, it satisfies 19.2 and 19.3, since
This says that either ∆(x) has 2(n + 1) zeroes, which contradicts our earlier obser-
vation that it only has 2n + 1 zeroes; or that q(x) = 0 identically. But if q(x) = 0
identically, then ∆(x) = 0 identically, which implies that g(x) = H2n+1 (x) for all x.
In other words, H2n+1 is unique.
Example 19.1. Find a √ Hermite interpolating polynomial for the following data,
which is based on f (x) = x.
f (x) f 0 (x)
x0 = 1 f0 = 1 f00 = 1/2
x1 = 4 f1 = 2 f10 = 1/4
Solution. Since n = 1 there are 2n + 2 = 4 conditions that must be met, and therefore
the order of the polynomial will be 2n + 1 = 3. The interpolating polynomial is
= f0 H10 (x) + f1 H11 (x) + f00 Ĥ10 (x) + f10 Ĥ11 (x) (19.37)
1 1
= H10 (x) + 2H11 (x) + Ĥ10 (x) + Ĥ11 (x) (19.38)
2 4
Proof. First, suppose that x = xk for some k . Then the second term in equation
19.54 is zero and it becomes f (x) = H2n+1 (x), which is the interpolation condition,
because x = xk . This condition is known to hold true because of theorem 19.1.
Now suppose that x 6= xk for all k. Then define the function g(t) by
(t − x0 )2 · · · (t − xn )2
g(t) = f (t) − H2n+1 (t) − [f (x) − H2n+1 (x)] (19.55)
(x − x0 )2 · · · (x − xn )2
then
and
(t − x0 )2 · · · (t − xn )2
0 0 0 d
g (t) = f (t) − H2n+1 (t) − [f (x) − H2n+1 (x)] (19.62)
(x − x0 )2 · · · (x − xn )2
dt
n
f (x) − H2n+1 (x) d Y
= f 0 (t) − H2n+1
0
(t) − (t − xk )2 (19.63)
(x − x0 )2 · · · (x − xn )2 dt k=0
d
(a1 a2 a3 · · · an ) = a01 a2 a3 · · · an + a1 a02 a3 · · · an + · · · + a1 · · · an−1 a0n (19.64)
dt
Math 481A « 2008, B.E.Shapiro
California State University Northridge Last revised: July 5, 2008
LESSON 19. HERMITE INTERPOLATION 131
Hence
n n n
d Y 2
Y
2
Y
(t − xk ) = 2(t − x0 ) (t − xk ) + 2(t − x1 ) (t − xk )2 +
dt k=0 k=0,k6=0 k=0,k6=1
n
Y
2(t − x2 ) (t − xk )2 + · · · +
k=0,k6=2
Yn
2(t − xn ) (t − xk )2 (19.65)
k=0,k6=n
n
Y Xn Y n
=2 (t − xk ) (t − xj ) (19.66)
k=0 i=0 j=0,j6=i
= P (t)Q(t) (19.67)
where
n
Y
P (t) = 2 (t − xk ) (19.68)
k=0
n
X n
Y
Q(t) = (t − xj ) (19.69)
i=0 j=0,j6=i
Consequently
f (x) − H2n+1 (x)
g 0 (t) = f 0 (t) − H2n+1
0
(t) − P (t)Q(t) (19.70)
(x − x0 )2 · · · (x − xn )2
Since
f (x) − H2n+1 (x)
g 0 (xk ) = f 0 (xk ) − H2n+1
0
(xk ) − P (xk )Q(xk ) = 0 (19.71)
(x − x0 )2 · · · (x − xn )2
the two facts that fk0 = H2n+1 0
(xk ) and P (xk ) = 0 (from equation 19.68) we see that
g has roots at x0 , . . . , xn . Therefore g 0 (t) has 2n + 2 unique zeroes at c0 , . . . , cn and
0
Next we calculate
n
d2n+2 Y 2 d2n+2 2n+2 2n+1
(t − x k ) = t + (stuff) × t + · · · + (stuff) (19.74)
dt2n+2 k=0 dt2n+2
= (2n + 2)! (19.75)
and therefore
f (x) − H2n+1 (x)
g (2n+2) (t) = f (2n+2) (t) − (2n + 2)! (19.76)
(x − x0 )2 · · · (x − xn )2
Solution. Since there are 5 data points x0 , . . . , x4 , we have n = 4, so we need the 10th
derivative of f (x). Using Mathematica, we find that
34, 459, 425
f (10) (x) = − (19.78)
1024x19/2
At x = 16, equation 19.54 gives
(16 − 5)2 (16 − 10)2 (16 − 15)2 (16 − 20)2 (16 − 25)2 34, 459, 425
|error| = ×
10! 1024c19/2
(19.79)
52352.6
= (19/2) (19.80)
c
The maximum of 1/c19/2 on (5, 25) occurs at the minimum of c19/2 on (5, 25), which
occurs at c = 5.
52352.6
|error| ≤ ≈ 0.012 (19.81)
5(19/2)
Of course this is just a theoretical limit, because we do not know the actual value of
c. In this case, the actual Hermite approximation gives a much smaller error, of only
9.3 × 10−7 .
f[x_] := Sqrt[x];
xdata = Range[5, 25, 5]; (* list of x values *)
fdata = f /@ xdata; (* list of f values *)
fpdata = f’[#] & /@ xdata; (* list of f’ values *)
Hermite[xdata, fdata, fpdata, x]
To get the actual error value quoted in the example we use Hermite[xdata, fdata,
fpdata, 16] - 4.0, because we know that the correct answer is 4.
Cubic Splines
As before we are trying to find an interpolating function for a function that we know
at n + 1 points a = x0 < x1 < · · · < xn = b . Instead of fitting a single polynomial
to all n + 1 points, an alternative strategy is to fit a different polynomial to each
successive pair of points. We will define a set of functions Si (x), on each interval
[xi , xi+1 ]. To keep the solution smooth we would like the match the first and second
derivatives, as well as the function itself, at each grid point. Our conditions are
Si (xi ) = fi (20.1)
Si (xi+1 ) = Si+1 (xi+1 ) (20.2)
Si0 (xi+1 ) = Si+1
0
(xi+1 ) (20.3)
00 00
Si (xi+1 ) = Si+1 (xi+1 ) (20.4)
S0(x)
... ...
x0 x1 xi-1 xi xi+1 xn-1 xn
Equations 20.1 through 20.4 give us a total of 4n − 2 conditions. Since there are
n spline functions, we need 3n parameters if the functions are quadratic and 4n
parameters if the functions are cubic. Since 4n−2 > 3n the system is over-determined
for a quadratic to work, and since 4n − 2 < 4n the system is under-determined for
a cubic to work. By adding two additional conditions, however, we can uniquely
determine a set of cubic spline functions. These are typically either free (natural)
boundary conditions,
S000 (x0 ) = Sn−1
00
(xn ) = 0 (20.5)
135
136 LESSON 20. CUBIC SPLINES
Si (x) = ai + bi (x − xi ) + ci (x − xi )3 + di (x − xi )3 (20.8)
on each interval [xi , xi+1 ]. Substituting equation 20.1 into 20.8 gives
ai = f i (20.9)
Rearranging,
h2i
ai + bi hi + (2ci + ci+1 ) = ai+1 (20.20)
3
Solving for bi ,
h3i
bi hi = ai+1 − ai − (2ci + ci+1 ) (20.21)
2
or
ai+1 − ai hi
bi = − (2ci + ci+1 ) (20.22)
hi 3
ai − ai−1 hi−1
bi−1 = − (2ci−1 + ci ) (20.23)
hi−1 3
Using equation 20.22 for the bi on the left hand side of equation 20.26, and equation
20.23 for the bi−1 on the right hand side of equation 20.26,
Rearranging a bit,
ai+1 − ai ai − ai−1
3 −3 = hi ci+1 + 2ci (hi − hi−1 ) + ci−1 hi−1 (20.29)
hi hi−1
···
1 0 0 c0
.. c
h0 2(h0 + h1 ) h1 . 1
c2
0 h1 2(h1 + h2 ) h2
.. =
. . . .
.. .. .. .. .
0 0 hn−2 2(hn−1 + hn−2 ) hn−1
0 0 1 cn
0
3 3
h1
(a 2 − a 1 ) − h0
(a1 − 10 )
..
(20.30)
3 .
3
h (an − an−1 ) − h (an−1 − an−2 )
n−1 n−2
0
If the grid points are equally spaced with hi = h for some number h, then
···
1 0 0 c0
.. c 0
h 4h h 0 . 1
c2 3 a0 − 2a1 + a2
0 h 4h h 0
..
.= (20.31)
.
.. ... ... ... .. h .
an−2 − 2an−1 + an
0 0 h 4h h 0
0 0 1 cn
Denoting the square matrix by A and the vectors by c and w, this can be written
concisely as Ac = w. Since all of the ai are already known, the only unknowns in
this equation are the c, for which we can solve as c = A−1 w.
For clamped cubic splines, the corresponding equations are
···
2h0 h0 0 c0
.. c
h0 2(h0 + h1 ) h1 . 1
c2
0 h1 2(h1 + h2 ) h2
.. =
. . . .
.. .. .. .. 0 .
0 0 hn−2 2(hn−1 + hn−2 ) hn−1
0 0 hn−1 2hn−1 cn
3
(a1 − a0 ) − 3f 0 (a)
h0
3
h1
(a2 − a1 ) − h30 (a1 − 10 )
..
(20.32)
.
3 3
hn−1 (an − an−1 ) − hn−2 (an−1 − an−2 )
3f 0 (b) − hn−13
(an − an−1 )
Since the points are equally spaced with h = 1, we can use 20.31. The right hand
side is given by
w0 = 0 (20.34)
3
w1 = (0 − 2(.5) + .8) = −0.6 (20.35)
1
3
w2 = (.5 − 2(.8) + .9) = −0.6 (20.36)
1
w3 = 0 (20.37)
Multiplying the matrices on the left and setting like components equal gives the
equivalent system of equations:
c0 =0 (20.39)
c0 + 4c1 + c2 = −0.6 (20.40)
c1 + 4c2 + c3 = −0.6 (20.41)
c3 =0 (20.42)
Subsituting the first and last result into the middle two equations,
Summarizing, we have
c0 = 0; c1 = −0.12; c1 = −0.12; c3 = 0 (20.47)
ai+1 −ai hi
From 20.22, bi = hi
− 3
(2ci + ci+1 ) and therefore
a1 − a0 h
b0 = − (2c0 + c1 ) (20.48)
h 3
1
= (0.5 − 0) − (2(0) − 0.12) (20.49)
3
= 0.54 (20.50)
a2 − a1 h
b1 = − (2c1 + c2 ) (20.51)
h 3
1
= (0.8 − 0.5) − (2(−0.12) + (−0.12)) (20.52)
3
= 0.42 (20.53)
a3 − a2 h
b2 = − (2c2 + c3 ) (20.54)
h 3
1
= (0.9 − 0.8) − (2(−0.12) + 0) (20.55)
3
0 = 0.18 (20.56)
From equation 20.18 di = ci+1 − ci /3hi = ci+1 − ci /3 (since h = 1), so that
d0 = (c1 − c0 )/3 = (−.12 − 0)/3 = −0.04 (20.57)
d1 = (c2 − c1 )/3 = (−.12 − −.12)/3 = 0 (20.58)
d2 = (c3 − c2 )/3 = (0 − −.12)/3 = 0.04 (20.59)
Combining equations 20.33, 20.48, 20.47 and 20.57,
S0 = a0 + b0 (x − x0 ) + c0 (x − x0 )2 + d0 (x − x0 )3 (20.60)
= 0 + (0.54)(x) + (0)(x)2 + (−0.04)(x)3 (20.61)
= 0.54x − 0.04x3 (20.62)
S1 = a1 + b1 (x − x1 ) + c1 (x − x1 )2 + d1 (x − x1 )3 (20.63)
= 0.5 + (0.42)(x − 1) + (−0.12)(x − 1)2 + (0)(x − 1)3 (20.64)
= 0.5 + 0.42x − 0.42 − 0.12x2 + 0.24x − 0.12 (20.65)
= −0.04 + 0.66x − 0.12x2 (20.66)
S2 = a2 + b2 (x − x2 ) + c2 (x − x2 )2 + d2 (x − x2 )3 (20.67)
= 0.8 + (0.18)(x − 2) + (−0.12)(x − 2)2 + (0.04)(x − 2)3 (20.68)
= 0.8 + 0.18x − 0.36 − 0.12x2 + 0.48x − 0.48
+ 0.04x3 − 0.24x2 + 0.48x − 0.32 (20.69)
= −0.36 + 1.14x − 0.36x2 + 0.04x3 (20.70)
Theorem 20.1 (Error Bounds for Clamped Cubic Splines). Let f (x) be 4-times
continuously differentiable on [a, b], and define
M = sup |f (4) (x) (20.72)
[a,b]
If S(x) is the unique clamped cubic spline interpolant to f (x) on the nodes x0 , . . . , xn ,
where a = x0 and b = xn , then
5M
|f (x) − S(x)| ≤ max hj (20.73)
384 j=1,...,n−1
Bezier Curves
143
144 LESSON 21. BEZIER CURVES
The Bezier Quadratic is the curved traced out P (t) as t goes from t = 0 to t = 1.
Substituting the expressions for P01 and P12 gives
P (t) = (1 − t)[(1 − t)P0 + tP1 ] + t[(1 − t)P1 + tP2 ] (21.7)
= (1 − t)2 P0 + 2t(1 − t)P1 + t2 P2 (21.8)
In terms of the x and y coordinates, the quadratic Bezier interpolants are:
x(t) = (1 − t)2 x0 + 2t(1 − t)x1 + t2 x2 (21.9)
y(t) = (1 − t)2 y0 + 2t(1 − t)y1 + t2 y2 (21.10)
Bezier quadratics are used, for example, to describe true-type fonts. The spline
functions constructud in this way have slopes tangent to the line segment P01 P12 .
P12
P01
P
P2
P0
P1 P123
P012
P P23
P01
P3
P0
Now suppose that we have two control points, P1 and P2 , and that we want to
draw a curve connecting P0 and P3 using points P1 and P2 to control our movement.
See figure 21.2. As before we construct the line segments P0 P1 , P1 P2 and P2 P3 , and
at any time t ∈ [0, 1] define points on these three segments:
P01 (t) = (1 − t)P0 + tP1 (21.11)
P12 (t) = (1 − t)P1 + tP2 (21.12)
P23 (t) = (1 − t)P2 + tP3 (21.13)
Next, construct line segments P01 P12 and P12 P23 and define their parameterization
on [0, 1] as follows:
P012 (t) = (1 − t)P01 (t) + tP12 (t) (21.14)
P123 (t) = (1 − t)P12 (t) + tP23 (t) (21.15)
Finally we construct a line segment P012 P123 with parameterization
P (t) = (1 − t)P012 (t) + tP123 (t) (21.16)
= (1 − t)[(1 − t)P01 (t) + tP12 (t)] + t[(1 − t)P12 (t) + tP23 (t)] (21.17)
= (1 − t)2 P01 (t) + 2t(1 − t)P12 (t) + t2 P23 (t) (21.18)
= (1 − t)2 [(1 − t)P0 + tP1 ] + 2t(1 − t)[(1 − t)P1 + tP2 ] (21.19)
+ t2 [(1 − t)P2 + tP3 ]
= (1 − t)3 P0 + 3t(1 − t)2 P1 + 3t2 (1 − t)P2 + t3 P3 (21.20)
The cartesian coordinates of the point P constructed in this ware are
x(t) = (1 − t)3 x0 + 3t(1 − t)2 x1 + 3t2 (1 − t)x2 + t3 x3 (21.21)
y(t) = (1 − t)3 y0 + 3t(1 − t)2 y1 + 3t2 (1 − t)y2 + t3 y3 (21.22)
Bezier cubics in this form are used to describe Postscript fonts. It is left as an
exercise to verify that the splines formed in this way approach the fixed endpoints
P0 and P3 with tangent lines P0 P1 and P2 P3 . Becase they are described by a cubic
parameterization there are a total of eight coefficients (4 for the x and 4 for the y),
which we have described by the coordinates of the points P0 , P1 P2 , P3 . By uniqueness,
there can be only curve that matches our restriction, and so the following derivation
of the Bezier cubics must, of necessity give the same curve. We present the derivation
because the notation, which is different from the derivation given above, is commonly
used to describe Bezier curves in various graphics applications. Suppose we want to
join the two points as show in figure 21.3, given by
P0 = (x0 , y0 ) (21.23)
P1 = (x1 , y1 ) (21.24)
in such a way that the slopes at P0 and P1 are defined in terms of the points Q0 and
Q1 by the vectors P0 Q0 and P1 Q1 , where
Q0 = (x0 + 3α0 , y0 + 3β0 ) (21.25)
Q1 = (x1 − 3α1 , y1 − 3β1 ) (21.26)
(x0,y0)
(x1,y1)
As before we find a parametric representation of the curve (x(t), y(t) on the interval
t ∈ [0, 1], where
x(0) = x0 x(1) = x1 x0 (0) = 3α0 x0 (1) = 3α1 (21.27)
y(0) = y0 y(1) = y1 y 0 (0) = 3β0 y 0 (1) = 3β1 (21.28)
The factor of 3 in the definitions of the numbers α0 , α1 , β0 and β1 is not used in all
textbooks but is standard in the implementation used in most graphics programs, so
we will abide by it. Let us write the parametric equation for x(t) as
x(t) = A + Bt + Ct2 + Dt3 (21.29)
Differentiating equation 21.29,
x0 (t) = B + 2Ct + 3Dt (21.30)
The boundary conditions at t = 0 give
x(0) = A = x0 (21.31)
x0 (0) = B = 3α0 (21.32)
Substituting 21.31 and 21.32 back into 21.29 and 21.30 and then setting t = 1 gives
x(1) = x0 + 3α0 + C + D = x1 (21.33)
x0 (1) = 3α0 + 2C + 3D = 3α1 (21.34)
Multiplying equation 21.33 by 3,
3x0 + 9α0 + 3C + 3D = 3x1 (21.35)
Subtracting equation 21.34 from 21.35,
3x0 + 6α0 + C = 3x1 − 3α1 (21.36)
Hence
C = 3(x1 − x0 ) − 3(2α0 + α1 ) (21.37)
Multiplying equation 21.33 by 2,
A variation of Bezier Cubics is used for Postscript fonts, which are defined in terms of
positions of the two points (x0 , y0 ) and (x3 , y3 ) and their handles (x1 , y1 ) and (x2 , y2 )
rather than the derivatives (so it has a slightly different form):
Bezier curves can be defined of any order, using any number of points. The points
define a sequence of line segments that “pull” the curve towards them, with the Bezier
curve parallel to the first and last segment. The general formula is
n
X n
x(t) = xi (1 − t)n−i ti (21.45)
i
i=0
n
X n
y(t) = yi (1 − t)n−i ti (21.46)
i
i=0
We can generate the Bezier Curve equations for a set of points in Mathematica as
follows.
bezier[points_?ListQ, t_] := Module[{n, bezx, bezy, x, y},
n = Length[points] - 1;
bezx = 0; bezy = 0;
x[i_] := points[[i + 1, 1]];
y[i_] := points[[i + 1, 2]];
Figure 21.4: A. A typical Bezier Curve generated with the 6 points (1, 1.52), (2,
1.94), (3, 1.39), (4, 1.0), (5, 1.54), (6, 1.55).B, C: Rearrangements of the points give
different curves. C: The curve is closed becasue the first and last point are the same.
For[i = 0, i n, i++,
bezx = bezx + Binomial[n, i] * x[i] (1 - t)^(n - i) t^i;
bezy = bezy + Binomial[n, i] * y[i] (1 - t)^(n - i) t^i;
];
Return[{bezx, bezy}]
];
In:=
Out:=
We can plot the points, line segments with handles, and Bezier curve with the Math-
ematica function bezierPlot:
Identity];
Return[Show[{p1, p2, p3}, DisplayFunction -> $DisplayFunction, opt]]
];
2
1.8
1.6
1.4
1.2
2 3 4 5
0.8
0.6
Standard options for Plot, such as PlotRange, Axes, TextStyle, etc, can be used by
bezierPlot. A generalization to higher dimensions is given by Bezier Surfaces, which
were also invented by Pierre Bezier in 1972. The general form of a Bezier Surface is
given in terms of (m + 1)(n + 1) points (x0,0 , y0,0 , z0,0 ), . . . , (xm,n , ym,n , zm,n ) as
n X
m
X n m
x(s, t) = si (1 − s)n−i tj (1 − t)n−j xi,j (21.47)
i j
i=0 j=0
n X
m
X n m
y(s, t) = si (1 − s)n−i tj (1 − t)n−j yi,j (21.48)
i j
i=0 j=0
n X
m
X n m
z(s, t) = si (1 − s)n−i tj (1 − t)n−j zi,j (21.49)
i j
i=0 j=0
where s, t ∈ [0, 1]. This can be implemented in Mathematicaby the following function.
For[i = 0, i n, i++,
For[j = 0, j m, j++,
bezcoef = Binomial[n, i] Binomial[m, j]
(s^i)((1-s)^(n-i))(t^j)((1-t)^(m-j));
bezx = bezx + bezcoef* x[i, j];
bezy = bezy + bezcoef * y[i, j];
bezz = bezz + bezcoef * z[i, j] ;
] ] ;
Return[{bezx, bezy, bezz}];
];
Consider the set of points on the corners of a cube given by
(0, 0, 0) (1, 0, 0), (1, 1, 0)
(0, 0, 1) (1, 0, 1), (1, 1, 1)
The Bezier Surface calculated with this algorithm is
x(t) = 2(1 − s)(1 − t)t + 2s(1 − t)t + (1 − s)t2 + st2 (21.50)
y(t) = (1 − s)t2 + st2 (21.51)
z(t) = s(1 − t)2 + 2s(1 − t)t + st2 (21.52)
which can be found in Mathematica via
In:=
data = {{{0,0,0}, {1,0,0}, {1,1,0}},
{{0,0,1}, {1,0,1}, {1,1,1}}};
surface = bezierSurface[data, {s, t}]
Out:=
{2*(1 - s)*(1 - t)*t + 2*s*(1 - t)*t + (1 - s)*t^2 + s*t^2,
(1 - s)*t^2 + s*t^2,
s*(1 - t)^2 + 2*s*(1 - t)*t + s*t^2}
The surface and its generating points are illustrated below. They are produced with
the following commands:
<<Graphics‘Graphics3D‘
dataPlot = ScatterPlot3D[Partition[Flatten[hinge], 3],
PlotStyle -> PointSize[.03]];
surfacepoints = Table[surface, {s, 0, 1, .05}, {t, 0, 1, .05},
DisplayFunction-> Identity];
surfacePlot = ListSurfacePlot3D[surfacepoints,
DisplayFunction-> Identity];
Show[dataPlot, surfacePlot, DisplayFunction-> \$DisplayFunction]
The following data was generated on a fixed (x, y) grid with z−values determined by
a random number generator. It produces a more complicated surface.
(1, 1, 4.63) (1, 2, 4.41) (1, 3, 3.05) (1, 4, 3.76) (1, 5, 2.87) (1, 6, 4.05) (1, 7, 2.81)
(2, 1, 3.31) (2, 2, 2.61) (2, 3, 3.17) (2, 4, 2.47) (2, 5, 4.55) (2, 6, 2.35) (2, 7, 3.)
(3, 1, 3.63) (3, 2, 4.99) (3, 3, 2.21) (3, 4, 3.46) (3, 5, 3.74) (3, 6, 4.62) (3, 7, 2.24)
(4, 1, 3.18) (4, 2, 4.33) (4, 3, 3.98) (4, 4, 2.62) (4, 5, 3.76) (4, 6, 3.28) (4, 7, 2.22)
(5, 1, 4.75) (5, 2, 4.71) (5, 3, 2.47) (5, 4, 3.91) (5, 5, 4.14) (5, 6, 3.54) (5, 7, 5.)
Three different views of the resulting Bezier surface are shown in the following figure.
Points that are blocked by the surface are not shown. The figure on the top left and
on the bottom shown only the bezier surface and the points from different angles.
The figure on the top right also shows a triangulated surface joined by connecting the
points.
Least Squares
and we want to find the “best fit” straight line to our data, namely, we want to find
number m and b such that
y = mx + b (22.2)
is the “best” possible line in the sense that it minimizes the total sum-squared vertical
distance between the data points and the line. This process is known as the linear
least-squares problem or linear regression.
The vertical distance between any point (xi , yi ) and the line (see figure 22.1),
which we will denote by di , is
di = |mxi + b − yi | (22.3)
Since this distance is also minimized when its square is minimized, we instead calculate
n
X n
X
f (m, b) = d2i = (mxi + b − yi )2 (22.5)
i=1 i=1
The only unknowns in this expression are the slope m and y-intercept b. Thus we
have written the expression as a function f (m, b). Our goal is to find the values of m
and b that correspond to the global minimum of f (m, b).
153
154 LESSON 22. LEAST SQUARES
Figure 22.1: The sum of all the vertical distances is minimized in the least-squares
linear fit.
n
∂f ∂ X
0 = = (mxi + b − yi )2 (22.6)
∂b ∂b i=1
n
X
= 2 (mxi + b − yi ) (22.7)
i=1
X n
= 2 (mxi + b − yi ) (22.8)
i=1
n
X
0 = (mxi + b − yi ) (22.9)
i=1
n
X n
X n
X
= mxi + b− yi (22.10)
i=1 i=1 i=1
n
X n
X
= m xi + nb − yi (22.11)
i=1 i=1
Defining
n
X
X = xi (22.12)
i=1
n
X
Y = yi (22.13)
i=1
then we have
0 = mX + nb − Y (22.14)
Next, we set ∂f /∂m = 0, which gives
n
∂f ∂ X
0 = = (mxi + b − yi )2 (22.15)
∂m ∂m i=1
n
X
= 2xi (mxi + b − yi ) (22.16)
i=1
X n
= 2 xi (mxi + b − yi ) (22.17)
i=1
so that
0 = mA + bX − C (22.23)
Equations 22.14 and 22.23 give us a a system of two linear equations in two
variables m and b. Multiplying equation 22.14 by A and equation 22.23 by X gives
0 = A (mX + nb − Y ) = AXm + Anb − AY (22.24)
0 = X (mA + bX − C) = AXm + X 2 b − CX (22.25)
and therefore n n n n
x2i
P P P P
yi −
xi y i xi
AY − CX i=1 i=1 i=1 i=1
b= = n 2 (22.27)
An − X 2 n
P 2
P
n xi − xi
i=1 i=1
0 = m X 2 − nA − (Y X − nC)
(22.30)
Therefore
XY − nC (25)(18) − (5)(97) 450 − 485 −35
m= 2
= 2
= = = 0.7 (22.42)
X − nA (25) − 5(135) 625 − 675 −50
and
AY − CX (135)(18) − (97)(25) 2430 − 2425 5
b= 2
= 2
= = = 0.1 (22.43)
An − X (135)(5) − 25 50 50
c1 + c2 x1 + · · · + cn xn−1
n = y1 (22.45)
..
. (22.46)
n−1
c1 + c2 x m + · · · + cn x m = y m (22.47)
If m = n, which is usually not the case, we could solve this equation exactly, by
writing
1 x1 x21 · · · xn−11 c1 y1
1 x2 x2 n−1
2 x 2 c
2 y2
.. = .. (22.48)
..
. . .
1 xm x2m · · · xn−1m cm ym
or more simply,
Ac = y (22.49)
The solution is of course
c = A−1 y (22.50)
However, this is not usually a good idea. Even if we had m = n, when the number
of data points is relatively large such a curve would be extremely over-fit, giving a
“kink” or “bump” for each point. This is not usually a good approximation to the
data. Furthermore, when we have m > n the equation is not even solvable. In general
it is better to look for some sort of solution to
Ac ≈ y (22.51)
in the sense that the two sides of the equation match one-another in some sort of
minimized least squares sense. In other words, we want to find the “best fit” linear
combination
Ac = c1 a1 + c2 a2 + · · · + cn an (22.52)
where each aj is the j th column of the matrix A,
j−1 T
aj = xj−1 j−1
1 x 2 · · · x m (22.53)
We denote the pth component of ai as
aip = Api = AT
ip (22.54)
We define the best fit as the set of values c1 , . . . , cn that minimizes the distance
min |y − Ac| (22.55)
c1 ,...,cn
As with linear least squares, we will minimize the sum-square residual error
" #2
X X
= cj ATji − yi (22.57)
i j
Since
∂ ∂ ∂
(cj ck ) = cj ck + ck cj (22.61)
∂cp ∂cp ∂cp
= cj δpk + ck δpj (22.62)
Equation 22.60 gives
XXX X
0= ATji ATki (cj δpk + ck δpj ) − 2 yi ATpi (22.63)
i j k i
XXX XXX
= ATji ATki cj δpk + ATji ATki ck δpj − 2(AT y)p (22.64)
i j k i j k
XX XX
= ATji ATpi cj + ATpi ATki ck − 2(AT y)p (22.65)
i j i k
X X X X
= ATpi ATji cj + ATpi ATki ck − 2(AT y)p (22.66)
i j i k
X X X X
= ATpi Aij cj + ATpi Aik ck − 2(AT y)p (22.67)
i j i k
X X
= ATpi (Ac)i + ATpi (Ac)i − 2(AT y)p (22.68)
i i
= 2(AT Ac)p − 2(A y)p T
(22.69)
Hence c is the solution of the linear equation
AT Ac = AT y (22.70)
Formally, the least squares solution is given exactly by
As an example to illustrate that our earlier calculation of the linear least squares
fit falls out of equation 22.71 we note that for linear data,
1 x1
1 x 2
A = .. .. (22.72)
. .
1 xn
1 x1
1 1 · · · 1 1 x 2
P
T n P x2
A A= .. .. = P (22.73)
x1 x2 · · · xn . . x x
1 xn
Define X X 2
∆=n x2 − x (22.74)
Then P 2 P
T −1 1 x − x
(A A) = P (22.75)
∆ − x n
Hence eq. 22.71 gives
y
1
· · · 1 y2
P 2 P
1 x − x 1 1
c= (22.76)
· · · xn ...
P
∆ − x n x1 x2
yn
P 2 P P
1 x − x
= P P y (22.77)
∆ − x n xy
P 2 P P P
1 x y − xP xy
= P P (22.78)
∆ − x y + n xy
and therefore
x2 y − x xy
P P P P
b = c1 = (22.79)
n x2 − ( x)2
P P
P P P
− x y + n xy
m = c2 = (22.80)
n x2 − ( x)2
P P
Numerical Differentiation
df (x) f (x + h) − f (x)
f 0 (x) = = lim (23.1)
dx h→0 h
If we choose h sufficiently small, then we can approximate the derivative by
f (x + h) − f (x)
f 0 (x) ≈ (23.2)
h
If we represent the function by a table of numbers , f0 = f (x0 ), . . . , fn = f (xn ) the
for fixed values of h
f (xi + h) − f (xi ) fi+1 − fi
f 0 (xi ) ≈ = , i = 0, 1, ..., n − 1 (23.3)
h h
To find an upper bound on the error, we use Taylor’s theorem:
1 1 1
f (x+h) = f (x)+hf 0 (x)+ h2 f 00 (x)+· · ·+ hn f (n) (x)+ hn+1 f (n+1) (c) (23.4)
2 n! (n + 1)!
where c is some unknown number between x and xh The Taylor formula for n = 1 is
1
f (x + h) = f (x) + hf 0 (x) + h2 f 00 (c) (23.5)
2
Thus
1
hf 0 (x) = f (x + h) − f (x) − h2 f 00 (c) (23.6)
2
and dividing by h
f (x + h) − f (x) 1 00
f 0 (x) = − hf (c) (23.7)
h 2
The first term gives precisely the same thing as equation 23.1; the second term
gives the error. Thus we have the following approximation formulas depending upon
whether h > 0 or h < 0. The Forward Difference Formula is
1 1
fi0 ≈ (fi+1 − fi ) − hf 00 (c) (23.8)
h 2
161
162 LESSON 23. NUMERICAL DIFFERENTIATION
Example 23.1. Compare the forward difference, central difference, and backward
difference methods for the following data.
x = 1.1 x = 1.2 x = 1.3 x = 1.4
f (x) = 9.025 f (x) = 11.023 f (x) = 13.464 f (x) = 16.645
Solution. According to the forward difference formula:
f (1.2) − f (1.1) 11.023 − 9.025
f 0 (1.1) = = = 19.98 (23.16)
0.1 .1
f (1.3) − f (1.2) 13.464 − 11.023
f 0 (1.2) = = = 24.41 (23.17)
0.1 0.1
f (1.4) − f (1.3) 16.645 − 13.464
f 0 (1.3) = = = 31.81 (23.18)
0.1 0.1
Figure 23.1: Points used in the calculation of the derivative at (x1 , f (x1 )) for the
forward difference formula (top left0; the backward difference formula (top right) and
the central difference formula (bottom). The slope of the line joining the points shown
is used to approximate the derivative. The actual tangent is indicated by a dashed
line.
Hx1 , f Hx1 LL Hx1 , f Hx1 LL
Hx1 , f Hx1 LL
Hx0 , f Hx0 LL
8x2 , fHx2 L<
There is no forward difference value for the right endpoint, no backward difference
value for the left endpoint, and no central difference value for either endpoint.
To get a more accurate number for the derivative, we can find the derivative of an
interpolating polynomial. We start with the Lagrange interpolating polynomial with
error term: n n
X 1 Y
f (x) = fk Lk (x) + (x − xk )f (n+1) (c) (23.24)
k=0
(n + 1)! k=0
where n
Y (x − xj )
Lk = (23.25)
j=0,j6=k
(xk − xj )
Notice that the product in the middle term has a factor of (xi − xi ) and is therefore
zero. Thus
n
( n
)
X 1 d Y
f 0 (xi ) = fk L0k (xi ) + f (n+1) (c(xi )) (x − xk ) (23.30)
k=0
(n + 1)! dx k=0
x=xi
Every term in the summation except for the k = i term will have a factor xi − xi in
the product; therefore the only nonzero term is for k = i.
n
n
d Y Y
(x − xk ) = (xi − xj ) (23.33)
dx
k=0 j=0,j6=i
xi
Substitution back into equation 23.30 yields the n + 1-point approximation for-
mula,
n n
X 1 Y
f 0 (xi ) = fk L0k (xi ) + f (n+1) (c(xi )) (xi − xk ) (23.35)
k=0
(n + 1)! k=0,k6=i
The first term gives the approximation and the second gives an error formula.
A two-point approximation is obtained when n = 1. The Lagrange Polynomials
are
x − x1 x − x0
L0 = , L1 = (23.36)
x0 − x1 x1 − x0
Hence
1 1
L00 = , L01 = (23.37)
x0 − x1 x1 − x0
Therefore
f0 f1 f1 − f0
f 0 (xi ) ≈ f0 L00 (xi ) + f1 L01 (xi ) = + = (23.38)
x0 − x1 x1 − x0 x 1 − x0
This is precisely the forward difference formula.
When n = 2 we obtain the 3-point formulas. The Lagrange functions are
(x − x1 )(x − x2 ) x2 − (x1 + x2 )x + x1 x2
L0 (x) = = (23.39)
(x0 − x1 )(x0 − x2 ) (x0 − x1 )(x0 − x2 )
(x − x0 )(x − x2 ) x2 − (x0 + x2 )x + x0 x2
L1 (x) = = (23.40)
(x1 − x0 )(x1 − x2 ) (x1 − x0 )(x1 − x2 )
(x − x0 )(x − x1 ) x2 − (x0 + x1 )x + x0 x1
L2 (x) = = (23.41)
(x2 − x0 )(x2 − x1 ) (x2 − x0 )(x2 − x1 )
Hence
2x − (x1 + x2 )
L00 (x) = (23.42)
(x0 − x1 )(x0 − x2 )
2x − (x0 + x2 )
L01 (x) = (23.43)
(x1 − x0 )(x1 − x2 )
2x − (x0 + x1 )
L02 (x) = (23.44)
(x2 − x0 )(x2 − x1 )
and therefore
for some c−1 ∈ [x0 − h, x0 ]. Adding equations 23.64 and 23.65 gives
1 2 00 1 4 (4) 1 n (n)
f1 + f−1 = 2 f0 + h f 0 + h f0 + · · · + h f0
2! 4! n!
n+1
h
f (n+1) (c1 ) + (−1)n+1 f (n+1) (c−1 )
+ (23.66)
(n + 1)!
where n is even. If n is odd the term in the square brackets terminates at the hn−1
term instead of the hn term. For example, if n = 3,
h4 (4)
1 2 00
f (c1 ) + (−1)4 f (4) (c−1 )
f1 + f−1 = 2 f0 + h f 0 + (23.67)
2! 4!
1
= 2f0 + h2 f000 + h4 f (4) (c1 ) + f (4) (c−1 )
(23.68)
24
Solving for f000 ,
1 1
f 00 0 = [f1 − 2f0 + f−1 ] − h2 f (4) (c1 ) + f (4) (c−1 )
2
(23.69)
h 24
By the intermediate value theorem, since [f (4) (c1 ) + f (4) (c−1 )]/2 is between f (4) (c1 )
and f (4) (c−1 ), then (assume f (4) is continuous in [a, b] then there is some number
c0 ∈ [c−1 , c1 ] such that
Richardson Extrapolation
169
170 LESSON 24. RICHARDSON EXTRAPOLATION
then
1 3
x = N2 (h) − c2 h2 − c3 h3 + · · · (24.7)
2 4
which tells us that N2 approximates x to O(h2 ). Repeating the process,
1 3
x = N2 (h/2) − c2 (h/2)2 − c3 (h/2)3 + · · · (24.8)
2 4
1 2 3
= N2 (h/2) − c2 h − c3 h3 − · · · (24.9)
8 32
Multiplying by 4,
1 3
4x = 4N2 (h/2) − c2 h2 − c3 h3 − · · · (24.10)
2 8
Subtracting equation 24.7 from 24.10 gives
1 2 3 3
3x = 4N2 (h/2) − c2 h − c3 h − · · ·
2 8
1 2 3 3
− N2 (h) − c2 h − c3 h − · · · (24.11)
2 4
3
= 4N2 (h/2) − N2 (h) + c3 h3 + · · · (24.12)
8
Solving for x,
4N2 (h/2) − N2 (h) 1 3
x= + c3 h + · · · (24.13)
3 8
It is convenient to rewrite equation 24.13 as
N2 (h/2) − N2 (h) 1 3
x = N2 (h/2) + + c3 h + · · · (24.14)
3 8
Let us define
N2 (h/2) − N2 (h)
N3 (h) = N2 (h/2) + (24.15)
23−1 − 1
Then
1
x = N3 (h) + c3 h3 + · · · (24.16)
8
Hence N3 (h) is accurate to O(h3 ). In general we define
which gives
x = Nj (h) + O(hj ) (24.18)
1
f00 ≈ (f1 − f0 ) + O(h) (24.20)
h
Doubling the step size,
1
f00 = (f2 − f0 ) + O(h) (24.21)
2h
Define
1 1
N (h) = (f2 − f0 ) = (f (x0 + 2h) − f (x0 )) (24.22)
2h 2h
1 1
N (h/2) = (f (x0 + h) − f (x0 )) = (f1 − f0 ) (24.23)
h h
and therefore,
N (h/2) − N (h)
N2 (h) = N (h/2) + (24.24)
22−1 − 1
1 1 1
= (f1 − f0 ) + (f1 − f0 ) − (f2 − f0 ) (24.25)
h h 2h
1
= (2f1 − 2f0 + 2f1 − 2f0 − f2 + f0 ) (24.26)
2h
1
= (−3f0 + 4f1 − f2 ) (24.27)
2h
This is the same formula we found by using Taylor series.
To get a higher order approximation we would have to split the step size again.
Since we don’t have any data more finely grained that at intervals of h, we use the
following trick: go back to the beginning and start with 4h, then 2h, then h. Starting
with a centered difference with a step size of 4h,
1 1
N (h) = (f4 − f0 ) = (f (x0 + 4h) − f0 ) (24.28)
4h 4h
1 1
N (h/2) = (f (x0 + 2h) − f0 ) = (f2 − f0 ) (24.29)
2h 2h
« 2008, B.E.Shapiro Math 481A
Last revised: July 5, 2008 California State University Northridge
172 LESSON 24. RICHARDSON EXTRAPOLATION
1
N2 (h/2) = (−3f0 + 4f1 − f2 ) (24.34)
2h
hence
1
N3 = N2 (h/2) + [N2 (h/2) − N2 (h)] (24.35)
3
1
= (−3f0 + 4f1 − f2 ) +
2h
1 1 1
(−3f0 + 4f1 − f2 ) − (−3f0 + 4f2 − f4 ) (24.36)
3 2h 4h
1
= (−3f0 + 4f1 − f2 ) +
2h
1 1
(−3f0 + 4f1 − f2 ) − (−3f0 + 4f2 − f4 ) (24.37)
6h 12h
1
= [6(−3f0 + 4f1 − f2 ) + 2(−3f0 + 4f1 − f2 )
12h
−(−3f0 + 4f2 − f4 )] (24.38)
1
= (−21f0 + 32f1 − 12f2 + f4 ) (24.39)
12h
1
f0 = (−21f0 + 32f1 − 12f2 + f4 ) + O(h3 ) (24.40)
12h
f (x + h) − f (x − h)
f 0 (x) = = N1 (h) (24.41)
2h
Math 481A « 2008, B.E.Shapiro
California State University Northridge Last revised: July 5, 2008
LESSON 24. RICHARDSON EXTRAPOLATION 173
Observe in this example that we had to obtain information in the following table:
N1 (.4)
N1 (.2) N2 (.4)
N1 (.1) N2 (.2) N3 (.4)
N1 (.05) N2 (.1) N3 (.2) N4 (.4)
.. .. .. .. ...
. . . .
In other words, to get any item in the table requires knowledge of everything above and
to the left of it in the table. This is true in general in using Richardson Extrapolation.
Numerical Integration
i.e., the areas of rectangles whose upper left hand corners touch the curve of f (x) We
can write a similar formula down using the right hand corners:
Z b n
X
f (x)dx ≈ f (xi )(xi − xi−1 ) (25.2)
a i=i
xi = x0 + ih (25.3)
so that we have
Z b n−1
X n
X
f (x)dx ≈ h f (xi ) ≈ h f (xi ) (25.4)
a i=0 i=1
Alternatively, we can calculate the area using boxes that cross the curve. For
example, if we know the function at three points (x0 , f0 ), (x1 , f1 ), and (x2 , f2 ) where
x1 = x0 + h and x2 = x0 + 2h then we can approximate the area under the curve of
f (x) on [x0 , x2 ] by a box whose width is x2 − x0 = 2h and whose height is f (x1 ):
Z x2
f (x)dx = 2hf (x1 ) (25.5)
x0
175
176 LESSON 25. NUMERICAL INTEGRATION
Figure 25.1: Calculation of an integral as the area under the curve can be apprxi-
mated with vertical rectangles. Top row, left: upper left hand corner of rectangles
fit to curve. Right: Upper right hand corner of each rectangle is fit to the curve.
Bottom row., left: Midpoint of top of each rectangle is fit to the curve. Right: in the
trapezoidal rule, the rectangles are replaced by trapezoids whose tops fit the function
at both upper corners.
a b a b
a b a b
To get the area over the entire interval [a, b], where a = x0 < x1 < x2 < · · · < xn = b,
and n is assumed to be even, we obtaine the Composite Midpoint Rule,
Z b Z x2 Z x4
f (x)dx = f (x)dx + f (x)dx + · · ·
a x0 x2
Z xn−2 Z xn
+ f (x)dx + f (x)dx (25.6)
xn−4 xn−2
≈ 2hf1 + 2hf3 + 2hf5 + · · · + 2hfn−3 + 2hfn−1 (25.7)
= 2h(f1 + f3 + · · · fn−1 ) (25.8)
R 10
Example 25.1. Find 0
x2 e−x dx using n = 4 and n = 10. Compare your result
with the exact integral.
hence
Z 10
x2 e−x dx = 2h[f1 + f3 ] (25.13)
0
≈ 2(2.5)(0.513031 + 0.031111) (25.14)
≈ 2.72072 (25.15)
hence
Z 10
f (x)dx = 2h[f1 + f3 + f5 + f7 + f9 ] (25.22)
0
≈ 2(0.367879 + 0.48084 + 0.168449
+ 0.044682 + 0.0099996) (25.23)
≈ 2.07818 (25.24)
Figure 25.3: The relative error as a function of number of intervals for the integral
solved in example 25.1.
100 %
10 %
1%
0.1 %
0.01%
0.001%
0.0001%
0.00001%
Since the midpoint rule does not use all the information that we know about the
function, we could modify it by interpreting the xi as the center of rectangles of width
h rather than treating the odd-numbered xi as centers of rectangles of 2h. The first
and last points x0 and xn become the left- and right-hand ends of rectangles of width
h/2 in this scheme,
Z b Z x0 +h/2 n−1 Z
X xi +h/2
f (x)dx = f (x)dx + f (x)dx
a x0 i=1 xi −h/2
Z xn
+ f (x)dx (25.26)
xn −h/2
n−1
h X h
≈ f0 + hfi + fn (25.27)
2 i=1
2
Z b
h
f (x)dx ≈ [f0 + 2f1 + 2f2 + · · · + 2fn−2 + 2fn−1 + fn (25.28)
a 2
Example 25.2. Repeat example 25.1 using the composite Trapezoidal rule with h =
2.5.
0.5
0.4
0.3
0.2
0.1
0.
0. 2.5 5. 7.5 10.
Z 10
h
f (x)dx = [f0 + 2f1 + 2f2 + 2f3 + f4 ] (25.29)
0 2
2.5 2 −0
= 0 e + 2(2.52 e−2.5 ) + 2(52 e−5 )
2
+ 2(7.52 e−7.5 ) + 102 e−10
(25.30)
= 1.25[2 + 2(0.513031) + 2(0.168449) + 2(0.031111) + 0.00454] (25.31)
= 1.78715 (25.32)
Figure 25.5: Relative error for trapezoidal method (orange); midpoint method (red);
and Simpson’s method (blue) for various step sizes.
100 %
10 %
1%
0.1 %
0.01%
0.001%
0.0001%
0.00001%
Simpson’s rule is derived by fitting a quadratic to three equally spaced points. Let
be a quadratic that passes through the three points (x0 , f0 ), (x1 , f1 ), and (x2 , f2 ),
where x1 = x0 + h, and x2 = x0 + 2h.
f0 + f2 = 4Ah2 + 2C (25.41)
Equations 25.43, 25.44 and 25.45 give us the values of A, B, and C. As it turns out,
we will only need to know A and C but not B. The integral is
Z x2
I= [A(x − x0 )2 + B(x − x1 ) + C]dx (25.46)
x0
x2
A 3 B 2
= [ (x − x0 ) + (x − x1 ) + Cx] (25.47)
3 2 x0
A
= [(x2 − x0 )3 − (x0 − x0 )3 ]
3
B
+ [(x2 − x1 )2 − (x0 − x1 )2 + C(x2 − x0 ) (25.48)
2
A B
= (2h)3 + [h2 − h2 ] + C(2h) (25.49)
3 2
8
= Ah3 + 2Ch (25.50)
3
Substituting equations 25.43 and 25.44
h h
I= [8Ah2 + 6C] = [4(2Ah2 ) + 3(2C)] (25.51)
3 3
h
= [4(f0 − 2f1 + f2 ) + 3(−f0 + 4f1 − f2 )] (25.52)
3
h
= [f0 + 4f1 + f2 ] (25.53)
3
This gives us Simpson’s Rule:
Z x2
h
f (x)dx = [f (x0 ) + 4f (x1 ) + f (x2 )] (25.54)
x0 3
If n is even then
Z xn Z x2 Z x4 Z x6
f (x)dx = f (x)dx + f (x)dx + f (x)dx
x0 x0 x2 x4
Z xn
+ ··· + f (x)dx (25.57)
xn−2
Then
Z b n
Z bX n Z b n
i
X
i
X ci
f (x)dx ≈ ci x dx ≈ ci x dx = (bi+1 − ai+1 ) (25.65)
a a i=0 i=0 a i=0
i+1
For example, we know how to fit n + 1 points with the nth Lagrange Polynomial
Quadrature for n=1 (2 points) using the Lagrange Polynomials. For n = 1 the points
are a = x0 and x1 = b = a + h and so we write
x − x1 x−b 1
L0 = = = (b − x) (25.67)
x0 − x1 a−b h
and
x − x0 x−a 1
L1 = = = (x − a) (25.68)
x 1 − x0 b−a h
The error is
f 00 (c)
R= (x − a)(x − b) (25.69)
2!
Hence
f0 f1
f (x) ≈ P (x) = L0 (x)f0 + L1 (x)f1 = (b − x) + (x − a) (25.70)
h h
from which we can calculate the integral
Z b Z b
f (x)dx ≈ [P (x) + R(x)]dx (25.71)
a a
f0 b f1 b
Z Z
= (b − x)dx + (x − a)dx
h a h a
f 00 (c) b 2
Z
+ (x − (a + b)x + ab)dx (25.72)
2 a
b b
f0 1 2 f1 1 2
= bx − x + x − ax
h 2 a h 2 a
00
b
f (c) 1 3 1 2
+ x − (a + b)x + abx (25.73)
2 3 2 a
f0 1 2 2 f 1 1 2 2
= b(b − a) − (b − a ) + (b − a ) − a(b − a)
h 2 h 2
f 00 (c) 1 3
3 1 2 2
+ (b − a ) − (a + b)(b − a ) + ab(b − a) (25.74)
2 3 2
Applying the algebraic relations
b−a=h (25.75)
b − a2 = (b − a)(b + a) = h(b + a)
2
(25.76)
b3 − a3 = (b − a)(b2 + ab + a2 ) = h(b2 + ab + a2 ) (25.77)
gives
Z b
f0 h f1 h
f (x)dx = bh − (b + a) + (b + a) − ah (25.78)
a h 2 h 2
f 00 (c) h 2
2 h 2
+ (b + ab + a ) − (a + b) + abh (25.79)
2 3 2
b−a b−a
= f0 + f1
2 2
hf 00 (c) 2
+ [2b + 2ab + 2a2 − 3a2 − 6ab − 3b2 + 6ab] (25.80)
12
h hf 00 (c)
= [f0 + f1 ] + [−b2 + 2ab − a2 ] (25.81)
2 12
h h3 f 00 (c)
= [f0 + f1 ] − (25.82)
2 12
The resulting Trapezoidal Rule with Remainder: is
b
h3 f 00 (c)
Z
h
f (x)dx = [f0 + f1 ] − (25.83)
a 2 12
We then can obtain a composite quadrature rule by applying this formula at each
pair of successive grid points [xi , xi+1 ] as:
xi+1
h3 f 00 (ci )
Z
h
f (x)dx = [fi + fi+1 ] − (25.84)
xi 2 12
we get
Z b n−1 Z
X xi+1
f (x)dx = f (x)dx (25.85)
a i=0 xi
n−1
h3 f 00 (ci )
X h
= [fi + fi+1 ] − (25.86)
i=0
2 12
n−1 n−1
h X h3 X 00
= [fi + fi+1 ] − f (ci ) (25.87)
2 i=0
12 i=0
By the intermediate value theorem there is some number µ ∈ [a, b] such that f 00 (µ) is
the average
1
f 00 (µ) = [f 00 (c1 ) + f 00 (c1 ) + · · · + f 00 (cn−1 )] (25.88)
n
and therefore
Z b
h
f (x)dx = [(f0 + f1 + · · · + fn−1 ) + (f1 + f2 + · · · + fn )]
a 2
h3
− nf 00 (µ) (25.89)
12
Substituting nh = b − a we arrive at the Composite Trapezoidal Rule with
Remainder
Z b
h h2 (b − a) 00
f (x)dx = [(f0 + 2f1 + 2f2 + · · · + 2fn−1 + fn )] − f (µ) (25.90)
a 2 12
where n
b b
x − xk
Z Z Y
ai = Li (x)dx = dx (25.96)
a a x i − xk
k=0,k6=i
A similar technique, called the Open Newton-Cotes Technique, does not include
the endpoints in the polynomial interpolation. By renumbering the grid points so
that we have n + 3 grid points at a = x−1 < x0 < x1 < · · · < xn < xn+1 = b
Then equations 25.95 and 25.96 still hold; the modified remainder formula is
14h5 (4)
Z
4h
x3 f (x)dx = [2f0 − f1 + 2f2 ] + f (c) (25.101)
x−1 3 45
We will use the terms “Ordinary Differential Equation” and “Differential Equa-
tion”, as well as the abbreviations ODE and DE, interchangeably. More generally, on
can include partial derivatives in the definition, in which case one must distinguish
between “Partial” DEs (PDEs) and “Ordinary” DEs (ODEs). We will leave the study
of PDEs to another class.
Equation 26.1 is, in general, very difficult, and often, impossible, to solve, either
analytically (e.g., by finding a formula that describes y), or numerically (e.g., for
example, by using a computer to draw a picture of the graph of the solution). Often
it is possible to solve 26.1 explicitly for the derivatives:
y 0 = f (t, y) (26.3)
Many important problems can be put into this form, and solutions are known to
exists for a wide class of functions, particular as a result of theorem 26.2. The class
of problems in which equation 26.1 can be converted to the form 26.3, at least locally,
1
In general, this theory can be extended to higher dimensions, where y ∈ Rn ; all of the same
results hold.
187
188 LESSON 26. THEORY OF DIFFERENTIAL EQUATIONS
is not seriously restrictive from a practical point of view. The only requirements are
that F be sufficiently smooth2 and that the matrix of partial derivatives ∂F/∂(y0 )
(Jacobian matrix) be nonsingular3 (∂F/∂(y 0 ) for a scalar equation). Then by the
implicit function theorem we can solve for y 0 locally. An equation of the form 26.1
for which the Jacobian is nonsingular is thus called an ordinary differential equa-
tion, and we will focus on equations of this form in the first several chapters of these
notes. It turns out that an equation for which the Jacobian is singular actually has
hidden constraints: it is really a combination of differential equations and algebraic
constraints, and is called a differential algebraic equation.
Theorem 26.1. Implicit Function Theorem on R.4 Let F (t, y) have continuous
derivatives ∂F/∂t and ∂F/∂y in the neighborhood of a point (t0 , y0 ), where
∂F (t0 , y0 )
F (t0 , y0 ) = 0, 6= 0 (26.4)
∂y
Then there are intervals I and J where
I = [t0 − a, t0 + a] (26.5)
J = [y0 − b, y0 + b] (26.6)
and a R rectangle R = I × J, such that the equation F (t, y) has precise one solution
y = f (t) lying in the rectangle R, such that
F (t, y(t)) = 0 (26.7)
y(t) ∈ J (26.8)
Fy (t, f (t)) 6= 0 (26.9)
for all t ∈ I.
Example 26.1. Solve y 0 = y.
Solution. Writing y 0 = dy/dt and integrating we find
Z Z
1
dy = dt (26.10)
y
ln|y| = t + C (26.11)
y = Ket (26.12)
where K = ±eC . There is no restriction on the values of either C or K.
2
Throughout these notes we will assume that F is sufficiently smooth without explicitly stating
so. By “sufficiently” smooth we mean that F is continuously differentiable enough times to give us
the results we want.
3
Strictly speaking, nonsingularity is not really a requirement. Nonsingularity is sufficient to
ensure that a solution exists, but is not required. There are examples of functions with singularities
at points but for which solutions may exist.
4
For a proof, see Richard Courant and Fritz John, Introduction to Calculus and Analysis, Volume
II/1, Springer Classics in Mathematics, 1998, page 225
Thus we see that equation 26.1 (or 26.3) will often admit to an infinite number of
solutions owing to arbitrary constants of integration that arise during its solution. For
example 26.1 this is illustrated if figure 26.1, which shows the one parameter family
of solutions to the example. A particular physical problem may only correspond to
one member of this family. To fix down this constant, the problem must be further
constrained. Such a constraint can take various forms. The nature of the constraint
can have an enormous impact on our ability to solve the equation.
Figure 26.1: One parameter family of solutions to y 0 = y, showing the solutions for
various values of the constant of integration.
5 4 3 2 1
2
0.5
0 0
-1
-0.5
-2
-5 -4 -3 -2 -1
t
-1 0 1
is called an initial value problem. The constraint 26.14 is called an initial con-
dition.
(y′1(t1), y′2(t1))
y2 (y1(t1), y2(t1)) y2
(y1(t2), y2(t2))
(y′1(t2), y′2(t2)) (y′1(t0), y′2(t0))
(y1(t0), y2(t0))
y1 y1
Figure 26.3: The one-parameter family of solutions for y 0 = (3 − y)/2 for different
values of the constant of integration, and the solution to the initial value (heavy line)
problem through (t0 , y0 ) = (2π, 4). The initial condition is indicated by the large gray
dot.
We will say that an initial value problem is well posed if it meets the following
criteria:
A solution exists;
If a problem is not well posed then there is no point in trying to solve it numerically,
so we begin our study of initial value problems by looking at what it takes to make
a problem well posed. We will find that a Lipschitz Condition, defined below in
definition 26.4 is sufficient to ensure that the problem is well posed.
The importance (and usefulness) of initial value problems is enhanced by a general
existence theorem and the fact that under appropriate conditions (namely, a Lipschitz
Condition) the solution is unique. While we will defer the proof of this statement
until later, we will present one of many different versions of the fundamental existence
theorem.
The existence theorem is illustrated in figure 26.4. Given any initial value, there
is some solution that passes through the point. Observe that the existence of the
solution is not guaranteed globally, only within some open neighborhood of the initial
condition.
Theorem 26.3 (Continuous dependence on IC). Under the same conditions, the
solution depends continuously on the initial data, i.e., if ỹ is a solution satisfying the
same ODE with ỹ(t0 ) = ỹ0 , then
Theorem 26.4 (Perturbed Equation). Under the same conditions, suppose that ỹ is
a solution of the perturbed ODE,
where r is bounded on D, i.e., there exists some M > 0 such that |r(t)| ≤ M on D.
Then
M
|y(t) − ỹ(t)| ≤ eKt |y0 − ỹ0 | + (eKt − 1) (26.24)
K
Proving that a function is Lipschitz is considerably eased by the following theorem.
Theorem 26.5. Suppose that |∂f /∂y| is bounded by K on a set D. Then f (t, y) ∈
L(y, K)(D).
Proof. The result follows immediately from the mean value theorem. Let (t, y1 ), (t, y2 ) ∈
D. Then there is some number c between y1 and y2 such that
Example 26.3. Show that a unique solution exists to the initial value problem
Solution. We have f (t, y) = sin(ty), hence fy = t cos(ty). Thus |fy | ≤ |t| which is
bounded for any finite range of t. Let R be a bounded, convex set enclosing (0, 0.5),
and let
K = 1 + sup |t| (26.27)
t∈R
Since R is bounded we know that the supremum exists. By adding 1 we ensure that
we have a number that is strictly larger than the maximum value of |t|. Then K is a
Lipschitz constant for f and hence a unique solution exists in some neighborhood N
of (0, 5). See figure 26.5.
Solution. Finding a “solution” is easy enough. We can separate the variables and
integrate. It is easily verified (by direct substitution) that
π
y = 2 sin t + (26.30)
2
« 2008, B.E.Shapiro Math 481A
Last revised: July 5, 2008 California State University Northridge
194 LESSON 26. THEORY OF DIFFERENTIAL EQUATIONS
Figure 26.5: A solution exists in some neighborhood N of (0, 0.5). See Example 26.3
1.5
1.25
0.75
0.5 N
0.25 R
!10 !5 0 5 10
satisfies both the differential equation and the initial condition, hence it is a solution.
Since the solution is not unique, any condition that guarantees existence must be
violated. We have two such conditions: the boundedness of the partial derivative,
and the Lipschitz condition. The first implies the second, and the second implies
uniqueness. By
∂f −y
=p (26.32)
∂y 4 − y2
which is unbounded at y = 2. So the first condition is violated. Of course, a violation
of the condition does not ensure non-uniqueness, all it tells us is that uniqueness is
not ensured.
p
Figure 26.6: There are several solutions to y 0 = 4 − y 2 that pass through the point
(0, 2). See Example 26.4
0 Φ
p
What about the Lipschitz condition? Suppose that the function f (x) = (4 − y 2 )
is Lipschitz with Lipschitz constant K > 0 on some domain D. Then for any y1 , y2
in D,
q q
2 2
K|y1 − y2 | ≥ |f (y1 , y2 )| = 4 − y1 − 4 − y2
(26.33)
when = 0. So f (t, y) is not Lipschitz, either. Again, this does not guarantee
non-uniqueness; it just tells us that uniqueness is not guaranteed.
Method of Successive
Approximations
y 0 = f (t, y) (27.1)
y(t0 ) = y0 (27.2)
If we substitute φ0 from equation 27.4 for φ(s) in the integral equation 27.3, we get
a second approximation Z t
φ1 (t) = y0 + f (s, φ0 )ds (27.5)
t0
197
198 LESSON 27. METHOD OF SUCCESSIVE APPROXIMATIONS
4. output: φi (t)
We will show in this chapter that when f is Lipschitz in y, algorithm 27.1 con-
verges to the unique solution of equation 27.1. Technically speaking, however, Picard
Iteration1 does not guarantee a solution to any specific accuracy except in the limit
as n → ∞. Thus it is usually quite impractical in practice. Nevertheless it has
the advantage that it is easily implemented in a computer algebra system, and will
sometimes yield useful results.
Example 27.1. Solve y 0 = y, y(0) = 1 using Picard Iteration.
Solution. Since f (t, y) = y, t0 = 0, y0 = 1, we have the following:
Z t
φ0 = 1 + ds = 1 + t (27.8)
0
Z t
t2
φ1 = 1 + (1 + s)ds = 1 + t + (27.9)
2
Z0 t 2
s
φ2 = 1 + 1+s+ ds (27.10)
0 2
t2 t3
=1+t+ + (27.11)
2 3!
1
The method bears the name of Charles Emile Picard (1856-1941), who popularized the technique,
and published it in 1890, but gave credit to Hermann Schwartz. Guisseppe Peano in 1887, Ernst
Leonard Lindeloff in 1890, and G. von Escherich in 1899 also published existence proofs based on this
technique. Hartman claims that both Liouville and Cauchy were aware of this method. Schwartz,
for his part, outlined the technique in a Festschrift honoring Karl Weierstrass’ 70’th birthday in
1885.
We can check this out by induction. It certainly holds for n = 1. For the inductive
step, assume equation eq:picard-ind and solve for φn+1 :
n+1 k
Z tX
s
φn+1 =1+ ds (27.13)
0 k=0 k!
n+1
X tk+1
=1+ (27.14)
k=0
(k + 1)!
which is exactly what equation 27.12 gives for φn+1 . Hence by the convergence theo-
rem (Theorem 27.3), the corresponding infinite series converges to the actual solution
of the IVP: ∞
X tk
φ(t) = = et (27.16)
k=0
k!
where the last step follows from Taylor’s theorem.
Picard iteration is quite easy to implement in Mathematica; here is one possible
implementation that will print out the first n iterations of the algorithm.
Picard[f ,t , t0 , y0 , n ]:=
Module[{i, y=y0}
Print[Subscript["φ", 0], "=", y0];
For[i=0, i<n, i++,
Z t
ynext=y0+ (f[s, y/.{t->s}]) ds;
t0
y=ynext;
Print[Subscript["φ", i+1], "=", y];
];
Return[Expand[y]]
]
Function Picard has five arguments (f, t, t0, y0, n) and two local variables (i,
y)
Picard[f ,t , t0 , y0 , n ]:=
Module[{i, y=y0},
...
]
The local variable y is initialized to the value of the parameter y0 in the list of
variable declarations. This is equivalent to initializing the value of the variable in the
first line of the program. The first line of the program prints the initial iteration as
φ0 =value of parameter y0 ,
Print[Subscript["φ", 0], "=", y0];
The output will be displayed on the console in an “output cell.” The next line of the
program is a For loop. A For statement takes on four arguments:
For[initialization,
test,
increment,
statement;
..
.
statement;
]
The For loop takes the following actions:
1. The initialization statement (or sequence of statements) is executed;
2. The test is evaluated. If it evaluates to False then the rest of the For is
ignored.
3. Each of the statments is evaluated in sequence.
4. The increment statement is evaluated.
5. Steps (2) through (4) are repeated until test is False.
In our program, we have a counter i that is initially set equal to zero; then the con-
tents of the For are executed only so long as i < n; and the value of i is incremented
by 1 on each iteration. Hence the loop will execute n times. Within the loop three
statements are executed on each iteration:
For[i=0, i<n, i++,
Z t
ynext=y0+ (f[s, y/.{t->s}]) ds;
t0
y=ynext;
Print[Subscript["φ", i+1], "=", y];
];
There are two important variables used in this loop: y and ynext. At the start of
each iteration, y refers to the value of the previous iteration φi−1 , while at the end of
each iteration (because of the statement y=ynext) it refers to the current iteration φi .
In the first line of the iteration the next iteration after φi−1 , namely, φi , is calculated
and saved in ynext. The value depends on the integral
Z t
f (s, φi−1 (s))ds
t0
. But φi−1 (s) is represented by the value of y at this point. Unfortunately, the
expression for y depends upon t, and we need to integrate over s and not s. So to
get the right variable in the expression for f (s, φi−1 (s)) we need to replace t everywhere
by s. We do that with the expression
y/.{t->s}
which means, quite literally, take the expression for y, and everywhere that a t
appears in it, replace the t with an s. To perform this substitution inside the integral
only we do the following:
Z t
ynext=y0+ (f[s, y/.{t->s}]) ds;
t0
So then ynext (φi ) is calculated and saved as y, and the results of the current iteration
are printed on the console. The final line of the program returns the value of the final
iteration in expanded form, namely, with all multiplications and factoring expanded
out:
Return[Expand[y]]
To print the first 5 iterations of y 0 = y cos t, y(0) = 1 using this function, one enters
g[tvariable , yvariable ]:= yvariable*Cos[tvariable];
Picard[g, t, 0, 1, 5];
which prints
φ0 =1
φ1 =1+Sin[t]
φ2 =1+Sin[t]+ 1
2 Sin[t]
2
φ3 =1+Sin[t]+ 1 2 1
2 Sin[t] + 6 Sin[t]
3
φ4 =1+Sin[t]+ 1 2 1 3 1
2 Sin[t] + 6 Sin[t] + 24 Sin[t]
4
φ5 =1+Sin[t]+ 1 2 1 3 1 4 1
2 Sin[t] + 6 Sin[t] + 24 Sin[t] + 120 Sin[t]
5
Definition 27.2 (Vector Space). A vector space V is a set that is closed under
two operations that we call addition and scalar multiplication such that the following
properties hold:
Closure For all vectors u, v ∈ V, and for all a ∈ R,
u+v ∈V (27.19)
av ∈ V (27.20)
u + (v + w) = (u + v) + w (27.22)
Identity for Addition There is some element 0 ∈ V such that for all v ∈ V
(a + b)v = av + bv (27.26)
a(u + v) = au + av (27.27)
1v = v (27.28)
Example 27.2. The usual Cartesian vector space to which we are accustomed is a
vector space with vectors being defined as ordered triples of coordinates hx, y, zi.
Example 27.3. Show that the set F[a, b] of all integrable functions f : [a, b] 7→ R is
a vector space.
V is closed: Let p(t) = f (t) + g(t) and q(t) = ch(t). Then p, q : [a, b] 7→ R hence
p, q ∈ F[a, b]
(f (t) + g(t)) + h(t) = f (t) + (g(t) + h(t)) and c(df (t)) = (cd)f (t) so both
associative properties hold.
(c + d)f (t) = cf (t) + df (t) and c(f (t) + g(t)) = cf (t) + cg(t) so both distributive
properties hold.
Definition 27.4 (Normed Vector Space). A vector space on which a norm has
been defined is a normed vector space.
Taxicab (Manhatten, City Block) Norm The L1 norm is: kvk1 = |x| + |y| + |z|
p
Euclidean Distance Function The L2 norm is: kvk2 = x2 + y 2 + z 2
Example 27.5. The following norms can be defined on the vector space F[a, b] of
integrable functions on [a, b]:
qR
b
L2 -norm: kf k2 = a
|f (x)|2 dx
R 1/p
b
Lp -norm: kf kp = a
|f (x)|p dx
form some K ∈ R, 0 < K < 1, for all f, g ∈ S. We will call the number K the
contraction constant.
1 − Kn
kT n g − gk ≤ kT g − gk (27.31)
1−K
Proof. Use induction. For n = 1, we have
1−K
kT g − gk ≤ kT g − gk (27.32)
1−K
As our inductive hypothesis choose any n > 1 and suppose that equation 27.31 holds.
Then by the triangle inequality
K n kT g − gk
< (27.38)
1−K
Pick any two integers m ≥ n ≥ N , and define the sequence g0 = g, gn = T gn−1 . Then
kgm − gn k = kT m g − T n gk (27.39)
≤ K n kT m−n g − gk (27.40)
m−n
n1 − K
≤K kT g − gk (27.41)
1−K
by Lemma 27.1. Hence
Kn − Km Kn
kgm − gn k ≤ kT g − gk ≤ kT g − gk < (27.42)
1−K 1−K
Therefore gn is a Cauchy sequence, and every Cauchy sequence on a complete normed
vector space converges. Define f = limn→∞ gn . Then either f is a fixed point of T or
it is not a fixed point of T . Suppose that it is not a fixed point of T . Then T f 6= f
and hence there exists some δ > 0 such that
kT f − f k > δ (27.43)
On the other hand, because gn → f , there exists an integer N such that for all n > N ,
Hence
kh − f k = kT h − T f k (27.50)
≤ Kkh − f k (27.51)
< kh − f k (27.52)
which is impossible and hence and contradiction. Thus f is the unique fixed point of
T.
We restate the fundamental existence theorem here for reference. While it is stated
in terms of the scalar problem, the vector problem is not fundamentally different, and
the proof is completely analogous.
has a unique solution φ(t) in the sense that φ0 (t) = f (t, φ(y)), φ(t0 ) = y0 .
Let S be the set of all continuous integrable functions on an interval (a, b) that
contains t0 . Corresponding to any function φ ∈ S we can define the mapping T :
S 7→ S as Z t
T [φ] = y0 + f (x, φ(x))dx (27.55)
t0
We will assume t > t0 . The proof for t < t0 is completely analogous. Using the
sup-norm on (a, b), we calculate that for any two functions g, h ∈ S,
Z t
kT [g] − T [h]k =
[f (x, g(x)) − f (x, h(x))] dx
(27.56)
t0
Z t
≤ sup [f (x, g(x)) − f (x, h(x))] dx (27.57)
a≤t≤b t0
≤ K(b − a) kg − hk (27.60)
Where K is any number larger than sup(a,b) fy . If we choose the endpoints a and b
such that |b − a| < 1/K we have K|b − a| < 1. Thus T is a contraction. By the
contraction mapping theorem it has a fixed point; call this point φ. Equation 27.54
follows immediately.
Theorem 27.7 (Error Bounds on Picard Iteration). Under the same conditions
as before, let φn be the nth Picard iterate, and let φ be the solution of the IVP. Then
M |K(t − t0 )|n+1 K|t−t0 |
|φ(t) − φn (t)| ≤ e (27.61)
K(n + 1)!
where M = supD |f (t, y)| and K is a Lipschitz constant. Furthermore, if L = |b − a|
then
M [KL]n+1 eKL
kφ(t) − φn (t)k ≤ (27.62)
K(n + 1)!
where k · k denotes the sup-norm.
Proof. We begin by proving the conjecture
K n−1 M
|φn − φn−1 | ≤ |t − t0 |n (27.63)
n!
For n = 1, equation 27.63 says that
|φ1 − y0 | ≤ M |t − t0 | (27.64)
which follows immediately from equation 27.54. Next, make the inductive hypothesis
27.63 and calculate
Z t
|φn+1 − φn| = [f (s, φn (s)) − f (s, φn−1 (s))] ds
(27.65)
t0
Z t
≤K |φn (s) − φn−1 (s)| ds (27.66)
t0
by the definition of φn and the Lipschitz condition. Applying the inductive hypothesis
and then integrating,
K nM t
Z
|φn+1 − φn| ≤ |s − t0 |n ds (27.67)
n! t0
K nM
≤ |t − t0 |n+1 (27.68)
(n + 1)!
The following example shows that this bounds is not very useful in practice.
Example 27.6. Estimate the number of iterations required to obtain an solution to
y 0 = t, y(0) = 1 on [0, 10] with a precision of no more that 10−7 .
Solution. Since f (t, y) = t we have fy = 0 and hence a Lipschitz constant is K = 1
(or any positive number), and we can use M = 10 on [0, 10]. The precision in the
error is bounded by
M (KL)n+1 eKL 10(10)n+1 e10
≤ ≤ (27.80)
K(n + 1)! (n + 1)!
We can determine the minimum value of n by using Mathematica. The following will
print a list of values of equation (27.80) for n ranging from 1 to 50.
errs = Table[{n, 10 (10) ˆ (n + 1) (E ˆ 10.)/(n + 1)!}, {n, 1, 50}]
The output is a list of number pairs, which can be plotted with ListPlot or
<<‘Graphics‘Graphics‘
LogListPlot[errs];
The output of LogListPlot is shown below; we have annotated the plot with an
additional line at the desired tolerance of 10−7 showing that it occurs at the 47th
iteration.
This example shows that Picard iteration will produce the desired accuracy if we
perform 47 iterations. For this particular problem, this suggestion is absurd, because
Picard iteration converges to the exact solution after 1 iteration. The calculated
solution does not change upon further calculations. Hence the method vastly over-
estimates the potential error (at least for this example).
Euler’s Method
211
212 LESSON 28. EULER’S METHOD
Figure 28.1: Illustration of Euler’s Method. A tangent line with slope f (t0 , y0 ) is con-
structed from (t0 , y0 ) forward a distance h = t1 − t0 in the t− direction to determined
y1 . Then a line with slope f (t1 , y1 ) is constructed forward from (t1 , y1 ) to determine
y2 , and so forth. Only the first line is tangent to the actual solution; the subsequent
lines are only approximately tangent.
y2
y1
y0
t0 t1 t2
Since y 0 (t) = f (t, y), we can approximate the left hand side of (28.7) by
and hence
yn+1 = yn + hn f (tn , yn ) (28.9)
It is often the case that we use a fixed step size h = tj+1 − tj , in which case we have
tj = t0 + jh (28.10)
The Forward Euler’s method is sometimes just called Euler’s Method. The application
of Euler’s method is summarized below.
h2n 00
y(tn+1 ) = y(tn + hn ) = y(tn ) + hn y 0 (tn ) + y (tn ) + · · · (28.12)
2
= y(tn ) + hn f (tn , y(n )) + · · · (28.13)
We then observe that since yn ≈ y(tn ) and yn+1 ≈ y(tn+1 ), then (28.9) follows imme-
diately from (28.13).
Solution. The exact solution is y = ex . We compute the values using Euler’s method.
For any given time point tk , the value yk depends purely on the values of tk1 and
yk1 . This is often a source of confusion for students: although the formula yk+1 =
yk + hf (tk , yk ) only depends on tk and not on tk+1 it gives the value of yk+1 .
We are given the following information:
y2 = y1 + hf (t1 , y1 ) (28.22)
= 1.25 + (0.25)(1.25) = 1.5625 (28.23)
t2 = t1 + h = 0.25 + 0.25 = 0.5 (28.24)
(t2 , y2 ) = (0.5, 1.5625) (28.25)
y3 = y2 + hf (t2 , y2 ) (28.26)
= 1.5625 + (0.25)(1.5625) = 1.953125 (28.27)
t3 = t2 + h = 0.5 + 0.25 = 0.75 (28.28)
(t3 , y3 ) = (0.75, 1.953125) (28.29)
y4 = y3 + hf (t3 , y3 ) (28.30)
= 1.953125 + (0.25)(1.953125) = 2.44140625 (28.31)
t4 = t3 + 0.25 = 1.0 (28.32)
(t4 , y4 ) = (1.0, 2.44140625) (28.33)
Since t4 = 1 we are done. The solutions are tabulated in table ?? for this and other
step sizes.
t h = 1/2 h = 1/4 h = 1/8 h = 1/16 exact solution
0.0000 1.0000 1.0000 1.0000 1.0000 1.0000
0.0625 1.0625 1.0645
0.1250 1.1250 1.1289 1.1331
0.1875 1.1995 1.2062
0.2500 1.2500 1.2656 1.2744 1.2840
0.3125 1.3541 1.3668
0.3750 1.4238 1.4387 1.4550
0.4375 1.5286 1.5488
0.5000 1.5000 1.5625 1.6018 1.6242 1.6487
0.5625 1.7257 1.7551
0.6250 1.8020 1.8335 1.8682
0.6875 1.9481 1.9887
0.7500 1.9531 2.0273 2.0699 2.1170
0.8125 2.1993 2.2535
0.8750 2.2807 2.3367 2.3989
0.9375 2.4828 2.5536
1.0000 2.2500 2.4414 2.5658 2.6379 2.7183
0.4
0.3
0.2
0.1
t
5 10 15 20 25
This example illustrates a problem that occurs in the solution of differential equa-
tions, known as stiffness. Stiffness occurs when the numerical method becomes
unstable. An exploration of this phenomenon is beyond the scope of Math 481A
(the topic is covered in great detail in Math 582B). One solution is to modify Euler’s
method as illustrated in figure 29.1 to give the Backward’s Euler Method:
[h]
215
216 LESSON 29. THE BACKWARDS EULER METHOD
y2
y1
y0
t0 t1 t2
The problem with the Backward’s Euler method is that we need to know the
answer to compute the solution: yn exists on both sides of the equation, and in general,
we can not solve explicitly for it. The Backwards Euler Method is an example of an
implicit method, because it contains yn implicitly. In general it is not possible
to solve for yn explicitly as a function of yn−1 in equation 29.2, even though it is
sometimes possible to do so for specific differential equations. Thus at each mesh
point one needs to make some first guess to the value of yn and then perform some
additional refinement to improve the calculation of yn before moving on to the next
mesh point. A common method is to use fixed point iteration on the equation
y = k + hf (t, y) (29.3)
where k = yn−1 . The technique is summarized here:
Make a first guess at yn and use that in right hand side of 29.2. A common first
guess that works reasonably well is
yn(0) = yn−1 (29.4)
Use the better estimate of yn produced by 29.2 and then evaluate 29.2 again to
get a third guess, e.g.,
yn(ν+1) = yn−1 + hf (tn , yn(ν) ) (29.5)
Repeat the process until the difference between two successive guesses is smaller
than the desired tolerance.
Of course we know that Fixed Point iteration will only converge if there is some
number K < 1 such that |∂g/∂y| < K where g(t, y) = k+hf (t, y). An implementation
of Backward’s Euler method with Fixed Point iteration in Mathematica as follows:
Figure 29.2: Result of the forward Euler method to solve y 0 = −100(y−sin t), y(0) = 1
with h = 0.001 (top), h = 0.019 (middle), and h = 0.02 (third). The bottom figure
shows the same equation solved with the backward Euler method for step sizes of
h = 0.001, 0.02, 0.1, 0.3, left to right curves, respectively
1
0.8
0.6
0.4
0.2
0.5
!0.5
!1
0 0.5 1 1.5 2 2.5 3
2
1.5
1
0.5
0
!0.5
!1
0 0.5 1 1.5 2 2.5 3
1
0.8
0.6
0.4
0.2
The input parameter tol gives the tolerance for the fixed point iteration: when
two successive guesses are this close together, the iteration stops. The default value
is 0.003. Similarly, we have added a parameter nmax which is an emergency cut-off
for the fixed point iteration. This number prevents infinite loops in the event the
tolerance is never reached. The number of iterations is counted, and if they reach the
value of nmax the fixed point iteration stops. Because there is always the possibility
(either through a program bug or some sort of bizarre input) that the algorithm will
not terminate, it is generally a good programming practice to always include this type
of counter and cut-off value. In the implementation shown above nmax has a default
value of 5. Since both nmax and tol have default values, they are considered optional
parameters by Mathematica: if you are happy with the values of the defaults, you do
not have to supply them when you call the program.
However, to avoid ill-conditioned equations it is usually better to use a root-
finding algorithm such as Newton’s method to find the root y of y = k + f (t, y), e.g,
use Newton’s method to find the root of
To solve the initial value problem y 0 = −50(y − sin t), y(0) = 1 on the interval [0, 3]
using a step size of h = 0.3,
In:=
Out:=
Here we introduce the Local Truncation Error, one measure of the “goodness”
of a numerical method. The Local truncation error tells us the error in the calculation
of y, in units of h, at each step tn assuming that there we know yn−1 precisely correctly.
Suppose we have a numerical estimate yn of the correct solution at y(tn ). Then the
Local Truncation Error is defined as
1
LTE = (y(tn ) − yn ) (30.4)
h
1
= (y(tn ) − y(tn−1 ) + y(tn−1 ) − yn ) (30.5)
h
Assuming we know the answer precisely correctly at tn−1 then we have
221
222 LESSON 30. IMPROVING EULER’S METHOD
so that
y(tn ) − y(tn−1 ) yn−1 − yn
LTE = + (30.7)
h h
y( tn ) − y(tn−1 ) 1
= − φ(tn , yn , . . . ) (30.8)
h h
For Euler’s method,
φ = hf (t, y) (30.9)
hence
y(tn ) − y(tn−1 )
LTE(Euler) = − f (tn , yn ) (30.10)
h
If we expand y in a Taylor series about tn−1 ,
h2 00
y(tn ) = y(tn−1 ) + hy 0 (tn−1 ) + y (tn−1 ) + · · · (30.11)
2
h2
= y(tn−1 ) + hf (tn−1 , yn−1 ) + y 00 (tn−1 ) + · · · (30.12)
2
Thus
h 00
LTE(Euler) = y (tn−1 ) + c2 h2 + c3 h3 + · · · (30.13)
2
for some constants c1 , c2 , ... Because the lowest order term in powers of h is propor-
tional to h, we say that
LTE(Euler) = O(h) (30.14)
and say that Euler’s method is a First Order Method. In general, to improve
accuracy for a given step size, we look for higher order methods, which are O(hn );
the larger the value of n, the better the method in general.
The Trapezoidal Method averages the values of f at the two end points. It has
an iteration formula given by
hn
yn = yn−1 + (f (tn , yn ) + f (tn−1 , yn−1 )) (30.15)
2
We can find the LTE as follows by expanding the Taylor Series,
y(tn ) − y(tn−1 )
LTE(Trapezoidal) = − f (tn , yn ) (30.16)
h
h2 00 h3 000
1 0
= y(tn−1 ) + hy (tn−1 ) + y (tn−1 ) + y (tn−1 ) + · · · − y(tn−1 )
h 2 3!
1
− (f (tn , yn ) + f (tn−1 , yn−1 )) (30.17)
2
1 h h2 1
LTE(Trapezoidal) = f (tn−1 , yn−1 ) + y 00 (tn−1 ) + y 000 (tn−1 ) · · · − f (tn , yn )
2 2 6 2
(30.18)
1 h 00 h2 000
LTE(Trapezoidal = fn−1 + yn−1 + yn−1 + ···
2 2 6
1 1 00 1 000
− fn−1 − hyn−1 − h2 yn−1 + ··· (30.22)
2 2 4
1 000
= − h2 yn−1 + ··· (30.23)
12
= O(h2 ) (30.24)
The theta method is implicit except when θ = 1, where it reduces to Euler’s method,
and is first order unless θ = 1/2. For θ = 1/2 it becomes the trapezoidal method. The
usefullness of the comes from the ability to remove the error for specific high order
terms. For example, when θ = 2/3, there is no h3 term even though there is still an
h2 term. This can help if the coefficient of the h3 is so larger that it overwhelms the
the h2 term for some values of h.
Heun’s Method is
hn 2 2
yn = yn−1 + f (tn−1 , yn−1 ) + 3f tn−1 + h, yn−1 + hf (tn−1 , yn−1 ) (30.28)
4 3 3
Both Heun’s method and the modified Euler method are second order and are ex-
amples of two-step Runge-Kutta methods. It is clearer to implement these in two
“stages,” eg., for the modified Euler method,