Вы находитесь на странице: 1из 23

lundamentals

THOO 1
T
be goal of this book is to present and discuss merhods of solving matal prob
lems with computers. The most fundamental oprons of aec are addition
and multiplication. These are also the operatons needed to evaluate a polynomal
. P(x) ar a particular value x. It is no coincidence that polyno are t basic building
blocks for many computational techques we will construct.
Because of this, it i important to know how to evaluate a polynomial T reader prob
bly a kows how, and may eons ider spending tme on such an easy problem slightly
ridiculous! But the more basic an operation i. the more we stand ro gain by doi1g it right.
So we will think about how to implement polynomial evaluation as efciery a possible.
!0.1 EVALUATING A POLYNOMIAL
What is the best way to evaluate
say, at x = J j2? Assume thai tlle coefcients of tlle polynomial and the number 1/2 are
stored it1 memory, and try to minimi ze the number of additions and multi plicftions required
10 get P(l/2). To simplify matters, we will not count time spent storing and fetching
numbers t9 and from memor.
Te ft and most staightorward approach is
(1)
1 1 I I 1 1 1 I I 1 5
p 2 = 2 * 2
*
2 * 2 * 2 + 3 * 2 * 2 * 2 -3 * 2 * 2 + 5 *
2
- 1 = 4' (O.l)
The number of multiplications required is 10, together with 4additons. Two oftbe adcitions
are actually subtracton, but because subtraction can be viewed as adding a negatve stored
number, we will not worry about t.bc diference.
2 I CHAPTER 0 Fundamentals
MEHOD 2
MEHOD 3
The surly i a better way than (0.1 ). Ef orr is being duplicated-operations ca:n
be save by eliminating the repeated multiplication by the. mput 1/2. A better strategy i.
toirst compute (1/2)4 storing partial prodncts as we go. That leads to the following method:
Find the powers of the input number x = 1/2 frst, and store them f0r future use:
Now we can add up the terms:
There are now 3 multiplications of 1/2. along with 4 other multiplicatons. Counting up,
we have reduced to 7 multiplications, with the same 4 additions. fs the reduction :om 14.
to l1 operatons a signcant improvement? ltr is only one evaluation to be done, then
probably not. Whether Method 1 or Method 2 is. used. the aer w:ll be availabl before
you can lift your fngers fom the computer keyboard. However, suppese the polynomia
needs to be evaluated at di ferenr inputS x several tmes per second. Then the difrence
may be crucial to getting the iaton when it is needed.
I this me best we can do for a degee 4 polynomial? It may be hard to imagine tl
we can eliminate three more operations, but we can. The besL element method is the
following one:
(Nested Multiplicaton) Rewrite the polynomial so that it can be evaluated fom the inside
out:
i
/
P(x) = -1 + x(5-3x + 3x2 + 23)
= -1 + x(S + x(.. 3 + 3x + 2x2))
= -1 + x(S + x(-3 + x(3 + 2)))
= -1 + ,t * (5 +X : ( -3-X* (3 +X * 2))). (0.2)
Here U1e polynomial is written backwards, and powers of x are factored oul of the rest of
the polynomaL Once you c see to write it this way, no computation is required to do the
rewrting-the coefficients are unchanged. Now evaluate from the inside out:
multiply * 2,
.
1
multiply
Z
* 4,
add + 3 - .
add -3- -1
.
1
multply * -1,
2
0.1 Evaluating a Polynomial I 3
add+ 5
2
.
1 9
mullply
2 * 2.
5
add - 1 -.
4
(0.3)
This method, called neted multiplication or Horer's method, evaluates rhe polynomial
in 4 multiplicatons and 4 additons. A general degree d polynomial can be evaluated in
d multipucations and d additions. Nested multlication is closely related to synthetic
division of polynomial arithmetic.
Tileexample of polynomial evaluation is characteristic of the entre topic of computa
tional methods for scientfc computing. First, compucers are very fast at doing very simple
things. Secon<L it is important to do even simple tasks as efciently as possible, since they
may b executed many te. Td the best way may not be the obvious way. Over the
last half-century, the felds of numerical analysis and scientific computing, hand-in-hand
wit computer hardware technology, have developed efdent solution techniques to attaek
common problems.
While the standard form for a polynomial c1 + c2x + qx2 + qx3 + csx
4
can b
written in nested form a
I
c1 + x(c2 + X(C3 + x(c4 + x(cs)))),
(0.4)
some applications require a more general form. in particUlar, interpolation calculations in
Chapter 3 wl require the form
CJ + (x - q)(C2 + (x
-
f)(C3 + (x - r3)(c4 + (x - r
4
)(cs)))),
(0.5)
where we call r,, r2, 1, and r4 the bas points. Note that sening q = '2 = 1 = r4 = 0 in
(0.5) recovers the original nested form (0.4).
Te fol)owjng M!T code implements the general form of nested multip)jcaton
(compare with (0.3)):
%Program 0.1 Nested multiplication
%Evaluates polynomial from nested form using Horer's Method
%Input: degree d of polyomial,
% array of d+l coerficiens c (constant term fist),
% x-coordinate x at which to evaluate, and
% array of d base points b, if needed
%Output: value y of polynomial at x
function y:nest (q, c, x, b) if nargin<4 , b:ze;os (d, 1); end y=c (d+ll;
for i=d:-1:1
y = y.' (x-b(i) )+c(il:
end
Running this . function is a matter of substimting the input data, wbich consist
of the degree, cofcients, evaluation points, and base points. For example, polynomial
{0.2) can b evaluated at x = 1/2 by the M.T command
4 I
CHAPTER 0 Fundamentals
EXAMPLE 0.1
nest(4,(-l S -3 3 2),1/2,(0 0 0 OJ)
ans =
L2500
a we found earlier by hand. The fe oes t . m, as the rest of the MAT code shown in
t book, must b accessible fom the MA path (or in the current cfuetory) wben
executing the command.
Tf the nest comand is to b used witb all base points 0 as in (0.2), the abbreviated
form
nest(4, [-1 5 -3 3 2) ,1/2)
may be used with the snme result. T i due to the nagin statement i nest. m. I te
number of input arguments is less than 4, the bnse points are automatically set to zero.
Because of MATAB's seamless teatment of vector notton. the nest command c
evaluate an aray of : values at once. The following code is illustatve:
nest(4, (-1 5 -3 3 2]. (-2 -1 0 1 2})
ans =
-
15 -10 -1 6 53
Finally, the degree 3 interating polynomial
P(x) = 1 + x ( + (:-2) ( + (x-3) ( -)))
from Chapter 3 has base points r1 = 0. 1 = 2, r3 = 3. lt can b evaluated a. x = I by
>> nest(3,.[1 1/2,1/2,-1/2].1,[0 2 3])
ans =
0
Find an effcient method for evaluating tbe polynomial
P(x) = 4:c5 + 1x8-3:11
+
2x14.
Some rewriting of the polynomial may help reduce the computational effon required
for evaluation. The idea is to factor x5 from each tem1 and write as a polynomial in the
quantity x3:
P(.t) = x5(4 + 7x3-3x6 + 2x9)
= x5 * (4 + x3 * (7 + x3 * (-3 + x3 * (2)))).
For each input x, we need to calculate x * x = x2, x * ,2 = .t3, and x2 * x3 = x5 fr st
These three multiplications, combined with the multiplication of x5, and the te
multiplications and three additions from the degree 3 polynomial in tbe quantity x3 gve
the total operation count of 7 multiplies and 3 adds per evaluation.

0.1 Exercises
0.2 Binar Numbers 5
1. Rewrite the following polynomials in nested form. Evaluate witll and without nested form at
X =1/3.
(a) P(x) = 6x4 + x: + 5x2 + x + 1
()
P(x) = -3:4 + 4.3 + sx -5x + 1
(c) P(x) = 2x4 + x3 x2 + I
2. Rewrite the following polynomials in nete for and eva a x = -1/2:
(a) P(x) = 6x3-22-3x + 7
()
P(x) = 8x5 -x4 - 3x3 + x2 - 3x + 1
(c) P(:) = 4x6 -24 -2 + 4
3. Evaluate P (x) = x6 -4x4 + 2.:2 + 1 at x = 1/2 by considering P (x) as a polynomial in ..
and using nested multplcaon
4. Evaluate the neste polynomial wir base points P(x) = l + x(l/2 + (x -2)(J /2 + (x - 3)
(-1/2))) at (a) x = 5 and (b) x =-I.
5. Evaluate the nested polynomial with base points P(x) =4 + x(4 + (x- 1)(1 + (x-2){3 +
(. - 3)(2)))) at (a) x = l/2 and (b) x = -I /2
.
6. Eplaln how to evaluate the polye>mial for a given input x, using as few operatons as
possible. How many multiplicatons and bow mn.ny additions a required?
(a) P(x) = ao + asx5 + a1
0
x10 + aux15 (b) P(x) = a,x1 + aux12 + a,,x1
7
+
anx2 + a2
7
x27.
7. How many additions and multctons are reuired to evaluate a degee" plynomial with
base points. using t general nested multiplication algorithm?
0.1 Computer Problems
1. Use the fncton nest to evaluate P(x) = I + x +
.
+ x5
0
at x = 1.0001. (Use the
MD ones comd to save typing.) Find U1e error of the computation by cmparig with
the equivalent expression Q(x) = (r1 1)/(x- 1).
2. Use nest. m to evaluat P(x) = 1 .T + x2- x3 + + x98- x99 at x = 1.001. Find a
simpler, equivalent expression. and u it to estimate the error of t nested multplicaton.
I 0.2 BINARY NUMBERS
Io pepaation for the detailed study of computer arithmetic i the next section, we need
to understand the binary number system. Decimal numbers a converted fom base 10 to
base 2 i order to store numbers on a computer and to simplif computer operations like
addition and multiplication. To gjve output in decimal notation, the process is reversed. I
t section, we discuss ways to conven between decimal and binary numbers.
6 I CHAPTER 0 Fundamentals
Binary numbers a expressed a
... b2btbo.b-1b-
2
... ,
where each binary digit, or bit. is 0 or 1. The base 10 equivalent to the number is
... b:2
2
+ bt2
1
+ bo2

+ h-tT
1
+ h-2T
2
....
For example, the decimal number 4 is expressed as (100.)2 in base rwo, and 3/4 is repr
sented as (0.11 h.
0.2.1 Decimal to binfr
The decimal number 53 wlb represented a (53) 10 to empbasize that it is to be interpreted
as base 10. To convert w binary, it is simplest to break the number into integer and fractional
pars ad convert each pan separately. For the number (53.7)10 = (53)10 + (0.7)JO, we
will convert each part to binary and combine the results.
Integer part. Convert decimal integers to binary by dividing by 2 successively and
recording lbe remainder. The remainders. 0 or I, a recorded by stg at the decimal
point (or more accurately. rd) and moving away (to lbe left). For (53)
1
0. we would have
53 -2 = 26 R l
26 -2 = I 3 R 0
13-2=6Rl
6-2 = 3RO
3-2= IRl
L-2 =OR I.
Therefore, the base 10 number 53 can be wrnen i bits as 110101, denoted as
(53)w = (ILOIOl.h. Checkng the result, we have 110101 = 2
5
+ 2
4
+ 2
2
+ 2
0
=
32 + 16 +4 + 1 =53.
Fractional part. Convert (0.7) 10 to binary by reversing the preceding steps. Multply
by 2 successively and record lbe integer parts, moving away fom the decimal point to the
right
.7 x 2 = .4 +I
.4 X 2 = .8 + 0
.8 X 2 = .6 + 1
.6 X 2 = .2 + l
.2 X 2 = .4 + 0
.4 X 2 = .8 + 0
Notice that the process repeats after four steps and will repeal indefnitely exactly the same
way. Therefore,
(0.7)JO = (.101101100110 . .. )2 = (.10110)2,
0.2 b|ryNumOrs j 7
where overbar notaton is u to denote infnitely repte bits. Potting the two parts
together. we conclude that
(5
3.7
)
10
= (11010l.l0110)2.
0.2.2 b1HdIy to 1 ___
To convert a binary number ro dein1al, iris again best to separate into integer and factiona
parts.
1nlcgcr Qarl. Simply add up powers of 2 as we did before. The binary number
(1010lh is simply 1 24 + 0 23 + 1 22 + 0 21 + 1 2 = (2lho.
I
lracuonaparI.If the fractional part is fte (a terminating base 2 expansion), proceed
the same way. For example,
1 1 1 (11)
(.1011)2=-+
-
+-= -
.
2 8 16 16 10
The only complication arises when te fractional part is not a fnite base 2 expansion.
Converting an infnitely repg binary exlansion to a decimal fraction can be done
in several ways. Perhaps the simplest way is to use the shif property of multiplication
by 2.
For example, suppose x = (0.1011)
2
is to be converted to decimal. Multiply x by 2.
which shifts 4 places to the lef i binar. Then subtract the orgnal x:
Subtracting yields
24x = 1011.1011
X= 0000.1011.
(24 - l)x = (1011)2 = (ll)to.
Ten solve for x to fnd x = I.11= .l/15 in base 10.
As another example, assume that tbe factional part does not immediately repeat, as in
x = .1 0101. Multiplying by 2
2
shifts to y = 22x = 10. I 01. The fractional part of y. call it
z = .10 L, i calculated as before:
2\ = 101.101
z = 000.101.
Terefore, 7z = 5, andy= 2 + 5/7. x = 2
-
2y = l9/28ln base 10. It is a good exercise
to check m|sresult by convertng 19/28 to binary and comparing to the original x.
Binary numbers a the building blocks of machine computations, but they tum
out to be long and unwieldy for humans to itcrret It is usef to ose base 16
at times just to present numbers more easily. Ucxadccma nubcr9 are rpresented
by the 16 numerals 0, l,2, ... ,9,a,b,c,d,e.f. Each hex number can be represented
by 4 bits. Thus (l)t6=(001)2, (8)!6=(1000)2, and (f)tc=(1 lllh=(l5)w. In the
8 l
CHAPTER 0 Fundamentals
0.2 Exercises
next section, MA11.AB's foat hex for representing machlne numbers will be
described.
1. Find the binary reprsentation of the base 10 integers (a) 6, (b) 17, (c}79, (d) 227.
2. Convc.n de following bae I 0 numbers to binary. Use ovcrbar nottion for nonterminating
binary numbers. (a) 10.5 (b) 1/3 (c) 5/7 (d) 12.8 (e) 55.4 (f) 0.1
3. Find !he frst 15 bits in the binary representation of r.
4. Conven the following binary numbers to base 10: (a) 1010101 (b) 1011.101 (c) l0111.01
(d) 110.1 (e) lO.TO (f) 110.1T (g) 10.0101101 (h) liLT
0.3 FLOATlNG POINT REPRESENTATION OF REAL NUMBERS
I t section we prent a model for computer arithmetic of foating point numbers.
There a several models, bur to simplify matters we will choose one particular model and
describe it in detn. The model we choose is the so-called IE754 Floating Point Standard.
The Institure of Elecu:cal and Ekclronics Engneers (IEEE) takes a active interest in
establishing standards for lbe indust. Their foating point arithmetic fom1at bas become
the common sta ndard for single-precision and double-precision arithmetic throughout the
compurer indust.
Rounding erors n inevitable wben fiite-precision computer memory loations a
used to represent real, ite precision numbers. Although we would hope that smaU eror
made during a long calculation bave only a minor efect oo the answer, this turns out to
be wishful thinking in many caes. Simple algorithms, such a Gausian elimination or
methods for solvng diferential equations, c magnif microscopic errors to macr
scopic size. rn facl, a main theme of thls bok is to help te reader to rogne when a
calculation is at risk of being unreliable due to magnifcation of the small errors made by
digital computer and to know bow to avoid or minimize the risk.
0..1 Floating point formats
The 1Estandard consists of a set of binary representations of real number. A foating
point number consists of three parts. the sign (+or -). a mants a, which contains Lhe
string of sigifcant bits. and an exponent. The te pans a srored together in a single
computer word.
There n three commonly used levels of precision for foating point numbers: single
precision, double precision. and eX1ended precision, also known as long-double precision.
Tbe number of bits allocated for ech foating point number in the three formats is 32, 6,
and 80, respctively. The bits are divided aong te pa a follows:
I precision I sign I exponent I mantissa I
single I 8 23
double I 11 52
long double I 15 6
NN.1
.J !oar DgFODl hP6S6DIIOOlPI NUDDPIS
I P
7 Ifccly0S OltV0SOH WOfk cSScI8!IcS8mcW8.\cOrm Ol 8 m
O3l0gOIHI umb0f S
!.bbb...b X ='.
l.0J
Whcf0c80hOlhcJb`SSMOfi.3Hd]S8H^bIbH3fyHubcf fcgfcS0pI0glccXgOHcHI.
^Ofm8!Z8IOH mc8HS I8I,a ShOWH(.0),Ihc!c8dHg(lcImOSI}bImuS| i.
hcH 8 bHry Humbcf is SlOf0d a 8 HOfm8!2cd O8|Hg OHI Humbcf, l D !c-
]uSmCd,'mc8DHgUIhc!cHmOSIJ DShI1I0d]uSI IO !c!cI OIhcl8dX QOHI.cShl
S0OmQcD88Icdb 8Ch8Dgc UIhccXQODcHI. Of cX8mQ!c, IcdcCI8l umbcfV,WJ0hS
JJ IH bH8f, WOu!dbcSlOfcu 8S
bcC8uSc8 SlI Ol bIS. or mU!U!08IOH b , is Hc0cSS8fy IO mOYc lc !cUOSl Oc !O
lhc6OlTc0IQOSlO.
Of 0OH0fcIcDcSS, Wc`W!! S000 I0 Ic dOUbc lYCSOH Olm8I Ol mOSt O lhc
dS0USSO.gc8d!OHg-dOub!cQlcCSOH8fc8d!cdIBcS8m0W8y,WlIccX0cQlO
O dcfcDl cXOHcHl 80d m8lSS8 !cgIh8 ^8Dd P.JH dOublc f00SOH, u8cd b m30y
0OmlcfS 8Hdb ATAB. M= IJ 8Hd ^ = 3Z.
TcdOub!c Qfc0SO umbcr 1 IS
+I{ U I X J t
Whclc Wc h8YcbOXcd Ic JJblS O Ic m80lSS8. Ic HcXI O8I0g O0I Humbcf gf08Itr
Ih3H 1 S
1. 4!l X 1.
\cHumUcf mBOu8 8QS0u, dcOlcd mh. S Ihc d!!H0c bcIWccH I 8Dd Ihc Sm8|!cSt
0O8IDg ODIHumbclgfc8lcf lh8U ! .Ol IhcdOuU!0Qr00SOHBO8lDgOHlSl8Dd8rd,
--
52
rcb
=

\c dcCm8 Uumbcf V.+= {JJ._ S !cI}uSIBcd0\
tl 11!JJJ1J1JJJJJJiJ111JJ Jjuo . . . x J*,

Whcfc Wc h8Vc bOX0d lbcHl3Z bIS OlIcm8HISS8.P HcWQucSIOH 8fScS: NOW dO Wc


Bl Icmlcb08l umbcf f0f0ScHIHgV.+H 8llc Humb0fOlbIST
c muSI |fuH08Ic lc ubcf 0 SOmc W8y. 3Hd H SO dO0g Wc c0cSS8fl m8kc 8
Sm8l! cffOf.L0c mcIhOd,03!l0 O0QQ0.SlOSI! lfOW8W8 IhcbIISIh8t8!O|c
cd~IB8l S. |BOSc bcOHd Ic 3Jd bl lO lc fIghl O Ihc dc0m8!QOl.IS fOlOCO!U
Sm!c, bul l S b88cd D lh8II 8lW8SmOvcS IcfcSu!I lOW8rd )cfO.
hc 8!IcD8IVcmcIOd StO00dBg. 1H b8ScJ, HumbclSa 0u8IUmmly fOuHdcdu
IIhcHcXI dgI S3 OfIghcI. M fOuHdcd dOW0 OIhclWSc.!H bH3Iy, IS 0Olr0SOdS lO
1 0 I CHAPTER 0 Fundamentals
DEFINITION 0.2
rounding up l lhe bit is I. Specifically. the important bit in lhe double precision formal is
lhe 53rd bit to the right of t rd point, the frst one lying outside of the box. The dt
rounding technique, implemente by the IEstandard, is to add 1 to bir 52 (round u) if
bit 53 i I. and to do nothing (round down) to bit 52 if bit 53 i 0, wilh one excepton: If
lhe bits followng bit 52 a 100.. . , exactly halfway between up and down, we round up
or round down according to which choice makes the fnal bit 52 equal to 0. (Here we a
dealing with the mantissa only, since the sign dos not play a role.)
Wy i there the stange exceptonal c? Except for this case,lhe rule means rounding
ro the ooaalized foatng point number closest ro the original numbr-hence its name,
the Rounding to Nearet Rule. The error made in rouncng wl be equally likely to be
op or down. Terefore. the exceptional case, the case where there a nvo equally distant
foating point numbers to round to, should b decided in a way that doesn't prefer up or
down systematically. This i s to ty t o avoid the possibility of an unwanted slow d in long
calculations due simply to a biased rounding. The choice to make the fnal bit 52 equal to 0 in
the case of a tie is somewhat arbjt, but at least it does not display a preference up or down.
Exercise 0.3. Problem 6 shes some light on why the abitrary choice of 0 i made in case of
nie.
IEEE Rounding to Nearest Rule
For double precision, if the 53rd bit to the right of the binary point i 0. then round dOvn
(truncate after the 52nd bit). I the 53rd bit is l, then round up (add I 10 the 52 bit). unless
all known bits to the right of the I a O's. in which case 1 is added to bit 52 if and only if
bit 52 i I.
For the numbr 9.4 discussed previously. the 53rd bit to the right of the binary poin. is
a I and is followed by other nonzero bits. The Rounding to Nearest Rule says to round up,
or add 1 to bit 52. Terefre, the foatng point number that represents 9.4 is
+1 010110110110110110llOllOLI01101101101101I X 23. (0.7)
Denote the IEE double precision foating point number associated to x, using the Rounding
to Nearest Rule, by f(x).
In computer arthmetc, the real number x is replaced with the string of bits t(x).
According to t deftion. f(9.4) is the number in the binary representation (0.7). We
arrived at the foating point representation by discarding lhe ite tail . J lQ X T52 X
2
3
=.OLIO x 2-
51
x 2
3
= .4 x 2-
48
fom the right end of the number and then adding
2-52 x 23 = 2-
49
in t rounding step. Therefore,
f(9.4) = 9.4 + T
49
- 0.4 X 2
-
= 9.4 + ( l - 0.8)2-
49
= 9.4 + 0.2 X T
49

(0.8)
In other words, a computer using double precision representation and the Rounding to Near
est Rule makes an error of 0.2 x 2
-9
when storing 9.4. We call 0.2 x 2-
49
the roundng
error.
QLlNl1ILN.J
LPNlLL ,(
U.3 |DIlgPoint Representation of Real Numbers I 11
The impt mes is that foating point number representing 9.4 is not eual
| 9.4, although it is very close. To quantif that closeness, we use the standard deftion
of eror.
Let Xc Da computed version of the exact quantity 7. Ten
absolute 0II0I= lxc -x 1.
and
if the latter quantity exists.
H0lIV0 I0UHO1DgBII0I
. lxc-xl
I0UV0fII0I =
,
lxl

l n the LLL machine arithmetic model, the relative rounding error off(x) is no more than
one-half machine epsilon:
lf(:) - I 1
- ` "
lx I
- _
(0.9)
mthe case of the number x = 9.4, we woed out the rounding error in (0.8), which
must satisfy (0.9):
1t(9.4) -9.41 0.2 ? 2-9 8 -52 I
----- 9.4
=
47
X 4 2Emh
Find the double preision representation f(x) and rounding eror for x = 0.4.
Since (0.4) g = (.0110)
2
. jug the lef side gves
-
-
-2
0.4 = l .IOO l lO ? 2
= + i N J NJ oo ::: 1-= 1 \ J \ J \J o= -o: 1-: J o:: o- 1 N o ::: 1- 1 N J-. 1 I
100110 ... 7 T2.
Therefore, according to the rounding rule, fH0.4) u
+t.l10011001100110011001100110011001l00110l 100110011010 I 2
-2
.
Here, 1 has ben added to bit 52, which caused O! 51 to cbange also,due to carrying in
the binary addion.
Analyzng carefuJ y, we discarded 7 2-2 + .0110 7
>
Y * in the
truncaton and added 2-52 7 7 Dy rounding up. Therefore,
H(0.4) = 0.4
-
Z
-
0.4 7 = r54
= 0.4 + 2-54(-1/2-0.1 + l)
= 0.4 + T54(.4)
= 0.4 + 0. 7 r5
2
.
Notice that the relative error in rounding for 0.4 is 0.1/0.4 7 @ = l/4 7
t@O
*
obeying (0.9).
T2 I CHAPTER 0 UHUau0I5
LPNHLL .d
0.3.2_achinerepres_entation _____
So f we have described a foating point representaton in the abstct Here a a few more
details about how this representation is implemened on a computer. Again, in this secton
we will discuss the double preision format; tbe other fonuats a ve similar.
Each double precision foating point numbr is assigned an byte word. or 0bits, to
store its three parts. Each such word has the fonn
(0.10)
where the sign is stored, followed by II bits representing the expnent aBU the 52 bits
following the decimal point, reprsenting the mantssa. The 8g bit s u 0 for a positive
number and I for a negative number. The 11 OI| representing the exponent come fom the
positive binary imeger resulting from adding 210 ~ ! = 1023 to the exponent, at least for
exponents between -1022 and 1023. Tbis covers values of e1 . . e 11 fom I to 2046, leaving
0 md2047 for special purposes. which we will rerum to later.
number I 023 is caled the 8XuD8DtDmof the double precision format.lt is us8d
t convert both positive and negative exponents to posirivc binary numbers for storage in
the exponent bits. For single and long-double precision. the exponent bias values are 127
and 16383, respectively.
M's JIL DG consists simply of epressing t1e 64 bits of the macbine
number (0.1 0) as successive hexadecimal, or base 16. numbers. Thus. the first hex
numerals represent the sign and exponent combined, wl.le the last 13 contain the mantissa.
For example. the number 1, or
= 1 .[ 0000 0000000000 000 I X Zt
has double precision machine number form
I I 0 I 1111 lIl l! I I
once the usual 1023 is added to the exponent. The ft three bex dgits correspond to
001111111111
=
3//,
so the Oml D6 representation of the flouting poinL number I will 0
3
f !
0000000000000. You c check thls by typing CIHLDeXinto and entering
the number I.
Find the hex machine numbr representation of the real number V.4.
From (0. 7). we fnd that the sign is s = 0, the exponent is 3. and the 52 bits of the
mantissa 0H0!!B0decimal point 0I0
I 10 It 1 It 1 lu lu Itt It 1 l11 It 1 l11 lu 111 U It 1 o 11
- (2cccccccccccd)t
6
Adding 1023 to the exponent gives I 026
=
'"2, or [ 1000 0010)2. The sign and
exponent combinaLion $(0100000010h = (402)t6 making the hex format
4022cccccccccccd.
.
U.3 Floating Point Representation of Real Numbers I 13
Now we rt to the speial exponent values 0 and +!. The Jatte 2+!. is used t
represent o Hthe mantssa bit string U3l DDand NaN. which stands for Not a Number.
otherwise. Te fanner e1 ... e1 1 = 0, is interprte- as the non-normalized foatng point
focm
(0.11)
That is, the left-most bit is no longer assumed to be . Tese non-nomed numbers a
called SuDu0Ofoating point numbers. They extend the gcof very small numbers by a
few more orders of magnitude; Therefore, Z` x
~
=
_

74
is the smallest nonzero
representable number in double precision. U machine word is
I U I JUJJ I U| JUJJI1.
Be sure to understand the diference between U smallest representable number
_
-1!4
and 6g_p= '` Mny numbers below Em are machine represcruable, even though
addig them to 1 may have no efet On the other hand, numbrs below Z

'

' cannot b
repreented at M
The subnormal numbes include the most imponant number, .U fact, the subnoral
representation includes two difernt foating point numbers, +0 and ~. that are teated
as the same. real numbers. The machine representtion of bas sign bit s = 0. exponent
bits e1 . eu = 00 . and mantssa JZ zeros; in shon. all bits a 1. The hex
format for +0 is U.For the number -0. u exactly the same, except
for lhcsign bits= i. The hex foat for-{ is 80000V .
0.3. Additiop of flQ_Cting Roim,umbers
Machine addition consists of lining up the decimaJ point. of the two numbers to be added,
adding them, and eben storing the I<O!again as a foatng point number. The additon itself
can be done in higher precision (with more U3
_
bits) since the addition takes place in a
regster dedicated just to that purpose. Following the addition. the result must be rounded
back ro 3Zbits beyond the binary point for storage as a machjne number.
For example, adding uZ` would appear as follows:
=
1.1 ^ V V I
X Z"
U. JJJM x Z"
= 1.j 000 0 00 11 X Z"
14 I CHAPTER 0 Fundamentals
This is saved as l. x 2 = I, according to the rounding rule. Therefore, 1 + 2-
53
is eqnal
to 1 in double precision IEarithmetic. Note that 2-53 is the lurgest foating point number
with this property; anything larger added to 1 would result in a sum grater than 1 under
computer arithmetic.
The fact that Em = 2-52 do not mean that numbers smaJier lhnn Enl c ae negli
gible in the IEEE model. As long as they a representable in the model. compuLatons with
numbers of this size a just a accumte, assuming that they are not added or subtacted to
numbers of unit size.
It is imporram to realize that computer arithmetic. bcause of the truncation and
rounding lat it carries out, can sometimes give surprising results. For example. if a
double precision computer with IEEE rounding to nearest is asked to store 9.4. then
subtact 9, and then subtract 0.4, the result will be somelhing olber than zero! What
happens is the following: First, 9.4 is stored as 9.4 + 0.2 x 2-49, as shown previ
ously. When 9 is subtracted (note that 9 can be represented with no error), the result
is 0.4 + 0.2 x 2-49 Now, asking be computer to subtact 0.4 results in subtracting (as
we found in Example 0.2) lbc machine number f(0.4) = 0.4 + 0.1 x 2-
52
which will
leave
0.2 X 2-49-0.1 x T52 =.I x T52(24-I) =3 X 2-
SJ
instead of zero. This is a sm:ll number, on the order of Ema but i1 is not zero. Since
MATLAB's basic dma type is lhe iEEE double precision number, we cnn ilJustrare this
fnding in a MATLAB session:
>> forat long
>> x.9.4
X =
9.40000000000000
>> y=x-9
y
=
0.40000000000000
>> z=y-0.
z =
3.330669073875470e-16
>> 3*2"(-53)
ans
=
3.330669073875470a-16
(XAmlll9
.LXef5e5
0.3 Floating Point Representation of Real Numbers I 15
Find the double precision foating point sum (1 3 x 2-53) ~ 1.
O course, in r arthmetic te answer is 3 x 2-53. However, foating point
arithmetic may dfe. Note that 3 x 2-53 = 2-52 2-53. 1be frst addition is
I.l U... I X 2 14 !... 0 I X 2-52
= 1.1 UU0UM UI X 2"
o.l \ 1 !1 x 2
= 1.! OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOJ jJ x 2.
This is again the exceptional case for lhe rounding rule. Since bit 52 in ll1e sum is 1, we
must round up. which means adding I to bit 52. After carrying, we get
1. 0U0U0 00000 I I X 2'
which is the representation of 1 `'Therefore, after subtracting 1, the result wiU be
2-51, which is equal to 2mh = 4 x 25` Once again. note the dfrence between
computer arilhmetic and corre.t arithmetic. Check this result by using MA.
Calculations in MATA. or in any compilerperfonning foating point calulation under
the IEE standard, follow t.be precise rules described in this section. Although Boating
point calculation can give surprising results because it differs fom corect arithmetic, it is
always predictable. The Rounding to Nearest Rule is the typical default rounding, allhougb.
if desired, it is possible to change to other rounding rules by using compiler fags. Te
comparison of results fom diferent rounding protocols is sometimes useful as an inforal
way to assess lhe stability of a calculation.
It may be surprising that small rounding errors alone, of relative sie Emcb. are capable
olderailing meaningful calculations. One mechanism for this is introduced in the next
section. More genef".lly. the srudy of error magnifcation and conditioning is a recuring
theme in Chapters I, 2, and beyond.
1. Convcn lhe following base 1numbers t. o binary nod express eacb a a foating point numb
(x)by using the Rounding to Nearest Rule: (a) l/4 (b) IIJ(c)2 (d) 0.9
2. Convcn the followig base 10 numbers to binary md express each as a foating point numbr
(x)by using lhe Rounding to Nearest Rule: (a) 9.5 {b) 9.6 (c) 1.Z(d) 4/7
3. Do the following sums by hand in IEEE double preision computer arithmetic. using the
Rounding to Nerest Rule: (Chek your answers. using Ma,)
Is (1 [Z`' Z``)) - 1
(b)
[1 {Z`' +/` Z``|) ~ 1
16 I CHAPTER 0 Fundamentals
4. Do the following sum by hand i IEE double precision computer arithmetic, using lbe
Rounding to Nearest Rule:
(a) (I + (2-s
t
+ z-s2 + r}) - I
(b) o + (z
-
51 + 2-52 + 2)) - 1
5. Write eacb of lbe given numbrs in MATB's forl hex. Show your work. Te chek
your answers wt L>T..-. (a) 8 (b) 21 (c) 1/8 (d) H (1/3) (e) H (2/3) (f f (0.1) (g) f (-0.1)
(h) f ( -0.2)
6. Ts I /3 + 2/3 exactly equal t I i double precision foating point arithmetic, using the IE
Rounding to Nearest Rule? You wilJ need to use f (1/3) and f (2/3) from Exercise I. Does
this help explain why the rle is expressed as it is? Would the sum be t11e same if chopping
after bit 52 were use instead of IEE rounding?
7. (a) Explain why you can deterine machine epsilon on a computer using IEE double
precision amllbe IEE Rounding to Nearest Rule by calculating (7 /3 - 4/3) - I. (b) Does
(4/3 - I /3) - I also give Emcb? Explain by converting to foating point numbers and
caring out the machine arithmetic.
8. Decide whether I + x > I in double precision foating pinr alhelic. with Rounding t
NearesL (a) : = z-53
(b)
X = 2-53
+
z
-
6
9. Does the associative law bold for LE compurer addition?
10. Find the TEEE double precision representation f(x). and fnd the exact difference llx) -x for
the gven reaJ nUillbers. Check t t:he relative rounding error is no more than l/211.
(a) :
=
2.75 () x = '.7 (c) .r = 10/3
11. How many correct digits c you guarantee in calculating sin( 10) by calculator and by
M? How aboutsin( I030 + l)? Explain your reasoning. Is lhcrc any way to calculare this
number more prcisly on a double precision computer?
I 0.4
LOSS OF SIGNIFICANCE
An advantage of knowing the details of computer arithmetic is that we a therefore in a
better position to understand potential pitfalls i computer calculations. One major problem
that arises in many forms is the loss of sign.i'kant digits that result fm subtracting nealy
equal numbers. I its simplest form. this is an obvious statement. For example, in the
subtraction problem
123.4567
1:3.4566
00 .001
we Ste with rwo input numbers that we knew to seven-digit accuracy. and ended with
a result lhat has only one-digit accuracy. Although this example is quite straightforwad,
EXAMPLE 0.5
0.4 Loss of Signifcance
I
17
!here are other examples of loss of signifcance that are more subtle, and in many cases chis
can b avoided by rcscroring t calculation.
Calculate . - 3 on a 3-decimal-digit computer.
Thls example is still fairly simple and is presented only for illustrative purposes.
Istead of using a computer with a 52-bit mantissa, as in double precision 1Estandard
formaL we assume lhar we are using a tbree-decimaJ-digt computer. Using a three-digit
computer means that storing each intermediate caJculaton along the way implies storing
into a floatng point number wilb a three-digit mantissa. The problem data (Lhe 9.01 and
3.
0
) are given to three-digit accuracy. Since we are going to use a thre-digit compurer,
being optimistic, we might hope to get an answer that is good to three digii. (Of course,
we car't expect more than l because we only carry along three digirs during the
calculation.) Checkng on a hand calculator. we see that the corect answer is
approximately 0.0016662 = 1.6662 x 10-3. How many corect digirs do we get with the
three-digit computer?
None. ns it turs ouL Since J 3.0016662, when we store t intermediate
result to three significant digits we ger 3.0. Subtracting 3.00, we get a fnal answer of
0.00. No signifcant digts in our answer are correct.
Surprisingly, there is a way to save t computation, even on a three-digit computer.
What is causing the loss of signifcance is the fact that we a explicitly subtracting nearly
equal numbers, J and 3. We car avoid ts problem by using aJgebra to rewrite t:he
expression:
J9.01 -3 = (J-
3
)(J9.0f +
3
)
J + 3
9.01 -32
-.+
3
0.01 .01
3
-
3
_
0
+ 3
= 6 = 0.0167 1.67 X lQ- .
Here, we have rounded the laSt digit of the mantissa up to 7 since the next cigit is 6.
Notice that we got all three digits correct !his way. m least t.be three digirs that the corect
answer rounds to. The lesson is t.bar iL is important to fnd ways to avoid subtacting nearly
equal numbers in calculations, i possible.
Te metho that worked in the preceding example was essentally a trick. Multiplying
by tbe conjugate expression" is one tick Lhat can help restructure the calculaton. Ofen.
specifc identities can b use. a wit trigonometric expressions. For exaple, calcuJarioo
of 1 -cosx wben x is close to zero is subject to loss of sigcance. Let's compar the
calculation of the expressions
c 1- cosx
C'J =
sin2 .
and
2
=
1
l + cosx
for a range of input numbers x. We arrved at
2
by multiplying the numerator and denomi
nator of E 1 by I + cosx, and using the t.rig identiry sin
2
x + cos2 x = l. llinfnite precision,
18 I CHAPTER 0 Fundamentals
EXAMPLE 0.6
the two expressions are equal. Using the double precision ofMATLAB compurations, we get
the following table:
X
1
2
1.0000000 0.922320520476 0.6
4
922320520
4
76
0.10000000 0.50125208628858 0.50125208628857
0.0100000000 0.5000 1250020848 0.50001250020834
0.0010000 000 O.SOOOOL2499219 0.500001250002
0.001000000000 0.49999999862793 0.500000012500
0.0000100000000 0.50000004138685 0.50000000001250
0.00000 l 000000 0.500445029134 0.50000000013
0.000000 I 0000000 0.49960036108132 0.50000000000000
0.00000001 000000 0.00000000000000 0.50000000000000
0.00000000 I 00000 0.00000000000000 0.50000000000000
0.00000000 l 0000 0.0000000000000 0.50000000000000
0.000000001000 0.00000000000000 0.50000000000000
0.00000001 00 0.0000000 0.500000000
The right column 2 is corect up to the digits shown. The Et compuLarion, due to the
subtraction of nearly equal numbers. is having major problems below . = I o-5 and has no
correct signifcant digi[S for inpu[S x = t0-8 and below.
The expression E
1
already has several incorrect digits for x = w-- and gets worse as
x decreases. Te equivalent expression 2 does not subtact nearly equal numbers and bas
no such problems.
The quadratc formula is often subject to loss of signcance. Again, it is easy to avoid
as long a you know it is there and how to restructure the expression.
Find both roots of the quadmtic equation x2 + 9
12
x = 3.
Tr t one in double precision :u:ilhmctic, for example, using MTB. Neither one
will give the right answer unless you are aware of loss of signifcance and know bow to
counteract it. The problem is t fnd both roots, let's say. with four-digit accuracy. So fru: it
looks like an easy problem. The roots of a quadratic equation of form ax2 + bx + c = 0
are given by t he quadratic formula
(0.12
)
-b

Jb2- 4c
x=
2a
For our problem, this tanslates to
-
9
12
}
9
2
4 +
4
(
3)
x=
2
Using lbe minus sign gives the root
Xi= -2.824 X
J
0
11

corret to four signifcant digits. For lbe plus sign root
-
9
12 +
}9
24 +
4
(3
)
ro= 0 -
2
.9 cX0fC505
0.4 Loss of Signifcance I
19
MATB calculates 0. Although the correct answer is close 10 ,the answer h oo corect
signifcant digits-even though the numbers defing the problem were specifed exactly
(essentially with infinitely many correct digits) and despite the fact that M
computes with approximately 16 sinifcant digits (an interpretaton of the fact that the
machi ne epsilon of .1is ?
7
?.?x w-16). How do we explain the tot failure
to get accurate digits for x2?
The answer is loss of sigcance. lt is clear that 912 and /924 + 4(3) are nearly
equal. relatively spng. More precisely, as store foating point numbers, their
mantissas not only start of similarly, but also are actually identical. When they are
subtracted, as directed by the quadratic formula, of course the result is zero.
Can this calculation oe saved? We must fx the loss of signifcance problem. The
correct way to compute x2 i by restucturing the quadratic formula:
0 J 4ac
2
( + .- 4ac)(b + Jb2- 4ar)
2a(b + J - 4ac)
-
4ac
[
o
-2c
b:-
+
-- rb :2 )
Substituting o,b, c for our example yields, according to MATLAB, x2 = l.0? x 10-
11

which is correct ro four signifcant digits of accuracy. as required.
This example shows us that the quadratic formula |.J 2) must be used wit care i
cases where a and/or care smaJJ compared with b. More precisely, if 4[o| b2, then b
and J - 4ac are nearly equal in magoirude, and one of the roots is subject to loss of
signi.cance.l f b is positive i this situation. then the two roots should be caJculated as
b+.J-4ac 2c
x1 =
2
and xz = (0.3)
(b + . - 4ac)
Note that neither formula sufers fr subtracting nearly eqal llllmbers. On the other hand,
|1
b
is negative and 41acl b2, then the two toots are best calculated as
-b + J- 4ac
Xi=
2
2c
and x2 =
.
(-b + .- 4ac)
1. Idenlify for which values of x ther is subtction of nearly equal numbrs, and find an
nltemate form that avoids the problem.
(a)
I x _
tan2 x
I-(1-xy
X
0I -
-
1 +x I -x
2. Find te roots of the equaton x2 + 3x s-14 = 0 wit three-digit accuracy.
(0!4)
20 I CHAPTER 0 Fundamentals
3. Explain how to most accurately compute the two root of the equaton ;2 + bx - J
o-12
= Q
,
where b is a number greater than I 0.
4. Prove formula (0.14).
0.4 Computer Problems
1. Calculate the expressions that follow in double precision arithmetc (using MT, for
example) for x = I o-
1
, , I o-
14
Then, using an alterarj ve form of tlte expre.sion that
doesn't suffer from subtracting nearly equnl numbers, repeat the caculation and make a table
of results. Report the number of correct cigits in the original expression for each x.
(a)
I -ecx
(
b
)
tan x
1- (1 -x)3
X
2. Find the smal est value of p for which the expression calculated in double precision arithmetic
at x = 1 o-p bas no correc[ sign iicnnt digits. (Hint: First fnd the limit of ilie expression as
X- 0.)
(a)
tanx -x (
b
) e+ cosx - siox -2
x3 x3
3. Consider a right triangle whose legs a of length 33455660 and 1.2222222. How much
longer is me hypotenuse than me longer leg? Give your answer with at least four correct
cgits.
I 0.5
REVIEW OF CALCULUS
THEOREM0.4
EXAMPLE 0.7
Some importat basic facts from calculus will be necessary later. The lntermcdhue Value
Theorem and tte Mean Value Theorem are important for sohri1g equations in Chapter I.
Taylor's Theorem is imporraot for unders1anding interpolation in Chapter 3 and becomes
of paramount importance for solving diferential equations in Chapters 6, 7, and 8.
The graph of a cominuous fnction has no gaps. For example, if the function is positive
for one x-value and negative for another, ir must pass through zero somewhere. This fact is
basic for getting equation solvers to work i'n tbe next chapter. The ft theorem formalizes
t notion.
(lnteoncdiare Value Theorem) Let f be a continuous function on the interval [a, b]. Then
f realizes every value between f(a) and f(b). More precisely. if y is a number between
f(a) and j(b), then ther exists a number c with a c b such thtl
f(c) =
y.

Show that j(x) = x2- 3 on the interval [I, 3] mut take on the values 0 and 1.
Because f( I) = -2 and /(3) = 6, all values between -2 and 6, including 0 and I.
must be taken on by f. For example. seLLing c = J, note that f(c) = f(.) = 0, nnd
seondly, /(2) = I.
THEOREM 0.5
EXAMPLE 0.8
THEOREM 0.8
0.5 Review of Calculus I 21
(Continuous Lts) Let f be a continuous function in a neighborhood of xo. and assume
l1-..Xn = xo. Then
lim f(xn) = f ( tim xn)
= f(xo).
n
-o \-o
In other words, limits may be brought inside continuou. functions.

(Mean Value Theorem) Let f be a continuously diferentiable function oo the interval
[a,b]. Then there exists a number c between a and b such that f
'
(c) = (f(b)-f(a))j
(b- a). 8
Apply the Mean Value Theorem to f (x) = x2 - 3 on the interval f I. 3].
The contem of the theorem is ta because /(1) = -2 and /(3) = 6, t.herc must exist
a oumber c in the interval [1,3) satisfing /'(c)= (6-(-2}}/(3- I) =4. It is e to
fnd such a c. Since J' (x) = 2x. the correct c = 2.
The next statement is a simple corollary of the Mean Value Theorem.
(Rolle's Theorem) Let f be a continuously diferentiable function on the interval [a, b],
and assume that f(a) = f(b). Ten there exists a numbr c between a and b such that
f
'
(c) = 0.
If a function f i known weU at a point x. then a lot of information can be extracted
about nearby pint. For example, i P (x) > 0, tben f h greater values for nearby points
to the right, and lesser values for points ro the left. Taylor's Theorem uses the derivatives at
x to give a full accounting of the functon values in a small neighborhood of x.
(Taylor's Theorem with Remainder) Let x and xo be real numbers, and let f be k + 1 times
continuously djferentiable on the interval between x and xo. Tbeo there exists a number c
betvveen x and
xo
such that
' (x
-
xof2
''
(x - xo)3 m
f(x) = f(xo) + (x - xo)f (xo) +
2!
f (
x
o)
+
31
I
(
x
o)
+
( )/
(
)k+J
+
x -xo
f
(k>
(x ) +
x -xo
J<
k
+J)(c).
k! 0 (k + 1}!

The plynomial par of the result, the ters up to degree k in x - :o, is called the
degree k Taylor polynomial for f centered at :o. The fnal term is called the Taylor
remainder. To the extent tht the Taylor remainder term is small, Taylor's Theorem gives a
way to approximate a general, smooth function with a polynomial This js very converuem
in solving problems with a computer, which. as mentoned earlier. can evaluate polynom ials
very efnciently.
22 I CHAPTER 0 Fundamentals
EXAMPLE 0.9
THEOREM 0.9
0.5 Exercises
Find the degree 4 Taylor polynomial P4(x) for f(x) = sinx centered at the pointxo = 0.
Estimate the maximum possible error when using P4(x) to estimate sin. for lx I : 0.0001.
The polynomial is easily calculated t o be PJ(.) = x -.3 /6. Note Lhar the degree 4
term is absenL since ir coefcient is zero. Te remainder term is
wbcll in absolme value cannot be larger than lx !5 I 120. For lx I . 0.0001. the remainder is
at most l0-20 1120 and will be invisible when, for example, x -x3 /6 is used in double
precision to approximate sin 0.00 I. Check this by computing both in MAT..
FinaiJy, the integral version of the Mean Value Theorem foU0ws:
(Mean Value Theorem for Integals) Let f be a continuous function on the interal [a, b],
and lee g be an integrable function that does not change sign on [a. b]. Then there exists a
number c between a and b such that
ib f(.)g(:c) dx = f(c) ib g(x) dx.
1. Use the lnteredi:ue Value 'heorem to pove !hat f(c) = 0 for some 0 < c < l.
(a) f(x) = x3-4x + I () j(x) = 5cosrx -4 (c) j(x) = 8x- - 8x2 + I
2. Fmd c sati sfying the Mean Value Theorem for f(x) on the interval [0, lJ. (a) f(x) =eX
(b) f(x) =.(c) f(x) = Jj(x +I)
3. Find c sasfying the Mea Value Theorem for Integrals with f(x), g(x) in the interal [0, lj.
(a) f(:) =x,g.(x) =x () f(x) =x2,g(x) =.t (c) f(x) =x,g(x) =e'
4. Find the Taylor polynom.ial of degree 2 about the point x = 0 for the following functions:
(
a) f(x) = eAz {b) f(x) = cos5x (c) f(x) = Lf(x + .1)
5. Find the Taylor polynomal of degee 5 about the point x = 0 for te following functions:
(o) f(x)
=
e2 () J(x) = cos2x (c) f(x) = ln(l + x) (d) f(x) = sin
2
x
6. (a) Find the Taylor polynomial of degee 4 for f(x) = x-2 about the poLnt x = 1.
(b) Use the result of (a) to approximate /(0.9) and /(1.1).
(c) Use the Taylor remainder to fnd an error formula for the Taylor polynomial. Give error
bounds for each of the two approximations made in part (b). Which of the two
approximatons i par (b) do you expct to be closer to the correct value?
(d) Use a calculator to compar the actual err in each case with your error bound from
pan (c).
7. Carry out Exercise 6 (a)-{d) for f(x) = l;c

Soare and Further Reading I 23
8. (a) Find the degree 5 Taylor plyomial P(x) centered at x = 0 for f(:r:) = cosx. ()Finc a
upper bound for the error i approxating f(x) = cosx ior . in [-rr /4,rr /4] by P(x).
9. A common approxiation for .J +xis I + !x, when x is sl Use the degee 1 Taylor
polynomial of j(x) = J with remalder to determine a formula of fon
. = I + !x .Evaluate E for the case of approximating .. Use a calculator to
compare te acmal error tO your err bound E.
Sofware an9 Ft,rther Reading
The IEE standard for foating point computation i found in [3]. The sources [1, 6] discoss
foatng point arthmetic in great detaiJ, and Lbe recent [5] emphasizes the IEE754standard.
The texts [7] by Wil nson and [4J by Knuth had geat infuence on the development of both
hardware and software.
There are several software packages that specialize i general-purpose scientifc com
puting, the bull; of it done in foating point arithmetic. Netlib (http:/ww ..etlib.org) is
a coUectioo of fe softare maned by AT&T Bell Laboratories, Lhe University of
Tennessee. and Oa Ridge National Laboratory. The collection consists of high-uality
progams available in Fortran, C, and Java, but it comes wit Uttle suppon. Te comments
in the code are meant to b sufciently instructjve for the user to oper:ue me po.
The Numercal Algorithms Group (NAG) (hup://www.oag.co.uk) marker a library
containing over 140 use. r-callable subroutines for solving general applied math problems.
Te programs are available in Fortan and C and are callable from Java progams. NAG
includes libraries for shared memory and distbuted memory computing.
The ISL numerical library is a product of Visual Numercs, Inc.
(http://www. ,.ni.com), and covers a similar to Lhose covered by the NAG library.
Fortran. C, and Java programs are available. It also Qrovides PV-WAVE, a powerful
programming language with data analysis and visualizaton capabilities.
The computing environments.Mathematica, Maple, and M1have grown to encom
pass many ofthesame computationa mel.hods as previously described and have built-in edit
ing and graphical interlaces. Mathematica (bnp://www.wolesearch.com) and Maple
(www.maplesoftcom) came to prominence due to novel symbolic computing engines.
11AT.B bas gown to serve many science and engineering applications Lough "tool
boxes," which leverage me basic high-quality software into divers directions.
l this te>.', we fequently illustate basic algorithms with M1 implementations.
TheMAT. AB code given is meant to be instructional only. Quite often. speed and reliability
are sacrifced for clarity and readability. Readers who are new to MAT. should begin
witb the tutorial in Appndix B; they will soon b doing their own implementatons.

Вам также может понравиться