NC 11111111111111111111

Chapter 1
NUMERICAL ANALYSIS
When a mathematical problem can be solved analytically, its solution may be exact,
but more frequently, there may not be a known method of obtaining its solution.
e.g.
Z
t
0
ex dx
,
1 x2
1 t 1
is difficult to solve. There are many examples whose solutions by analytical method
are either impossible or may be so complex that they are quite unsuitable for practical purposes. In this situation, the only way of obtaining an idea of the behavior
of a solution is to approximate the problem in such a manner that the numbers
representing the solution can be produced. The process of obtaining a solution is to
reduce the original problem to a repetition of the same step or series of steps so that
computations become automatic. Such a process is called a numerical method
and a numerical method, which can be used to solve a problem, will be called an
algorithm. An algorithm is a complete and unambiguous set of procedures leading to the solution of a mathematical problem. The selection or construction of
appropriate algorithms properly falls within the discipline of numerical analysis.
Having decided on a specific algorithm or set of algorithms for solving the problem,
numerical analysts should consider all the sources of error that may affect the results. They must consider how much accuracy is required, estimate the magnitude
of the round-off and discretization errors, determine an appropriate step size or the
number of iterations required, provide for adequate checks on accuracy, and make
allowance for corrective action in case of non-convergence.
Numerical analysis is a way to do higher mathematics problems on a computer,
a technique widely used by scientist and engineers to solve their problems.
Before starting we consider methods for representing numbers on computers and
the errors introduced by these representations.
3
1.0.1
The Representation of Integers
In every day life, we use numbers based on the decimal system. Thus the number
257, for example, is expressible as
257 = 2 100 + 5 10 + 7 1
= 2 102 + 5 101 + 7 100
we call 10 the base of this system. Any integer is expressible as a polynomial in the
base 10 with integral coefficients between 0 and 9. We use the notation
N = (an an1 an2 a0 )10
= an 10n + an1 10n1 + an2 10n2 + + a0 100
Modern computers read pulses sent by electrical components. The state of an
electrical impulse is either on or off. It is therefore, convenient to represent numbers
in computers in the binary system. Here the base is 2, and the integer coefficient
may take the values 0 and 1. A nonnegative integer N will be represented in the
binary system as
N = (an an1 an2 a0 )2
= an 2n + an1 2n1 + an2 2n2 + + a0 20
where the coefficient ak are either 0 or 1. Note that N is again represented as a
polynomial, but now in the base 2. Many computers used in scientific work, operate
internally in the binary system. Users of computers, however, prefer to work in
the more familiar decimal system. Then computer converts their inputs to base 2
(or perhaps base 16), then performs base 2 arithmetic, and finally, translates the
answer into base 10 before it prints it out to them. It is therefore necessary to have
some means of converting from decimal to binary when submitting information to
the computer, and from binary to decimal for output purposes. Conversion of the
binary number to decimal may be accomplished from the above definition as
(11)2 = 1 21 + 1 20 = 3
(1101)2 = 1 23 + 1 22 + 0 21 + 1 20 = 13
and decimal number to binary as
187 = (187)10 =
=
=
=
=
=
1 102 + 8 101 + 7 100

(1)2 (1010)22 + (1000)2 (1010)12 + (111)2 (1010)02
(1010)2 ((1010)2 + (1000)2 ) + (111)2
(1010)2 (10010)2 + (111)2
(10110100)2 + (111)2
(10111011)2
4
However, if we look into the machine languages, we soon realize that other number
systems, particularly the octal and hexadecimal systems, are also used. The octal
and hexadecimal systems are close relatives of the binary and can be translated to
and from binary easily. Expression in octal and hexadecimal are shorter than in
binary, so they are easier for humans to read and understand. Hexadecimal also
provides more efficient use of memory space for real numbers.
The octal number system using the base 8 presents a kind of compromise between
the computer-preferred binary and the people-preferred decimal system. It is easy
to convert from octal to binary and back since three binary digits make one octal
digit. To convert from octal to binary one merely replaces all octal digits by their
binary equivalent; thus
187 = (187)10 =
=
=
=
=
=
=
=
1.0.2
(1) (10)2 + (8) (10)1 + (7) (10)0

(1)8 (12)28 + (10)8 (12)18 + (7)8 (12)08
(12)8 ((12)8 + (10)8 ) + (7)8
(12)8 (22)8 + (7)8
(264)8 + (7)8
(273)8
(2
7
3)8
(010 111 011)2
The Representation of Fractions
If x is a positive real number, then its integral part xl is the largest integer less
than or equal to x, while
xF = x xl
is its fractional part. The fractional part can always be written as a decimal
fraction:
xF = bk 10k
where each bk is a nonnegative integer less than 10. If bk = 0 for all k greater than
a certain integer, then the fraction is said to terminate. Thus
1
= 0.25 = 2 101 + 5 102
4
is terminating decimal fraction since bk = 0 for all k > 3, while
1
= 0.3333 = 3 101 + 3 102 + 3 103 +
3
5
is not. Here the symbol 3 means that the digit 3 is repeated forever to form a
decimal. If the integral part of x is given as a decimal integer by
xl = (an an1 an2 a0 )10
and the fractional part is given by
xF = bk 10k
then
x = (an an1 an2 a0 .b1 b2 b3 )10
one after the other, separated by a point, the decimal point.
Completely analogously, one can write the fractional part of x as a binary fraction:
xF = bk 2k
k = 1, 2, 3,
where each bk is a nonnegative integer less than 2, i.e., either 0 or 1. If the integral
part of x is given by the binary integer
xl = (an an1 an2 a0 )2
then we write
x = (an an1 an2 a0 .b1 b2 b3 )2
using a binary point.
The binary fraction (.b1 b2 b3 )2 for a given number xF between zero and one
can be calculated as follows:
If
xF = bk 2k
k = 1, 2, 3,
then
2xF = bk 2k+1
+ bk+1 2k
= b1
k = 1, 2, 3,
k = 1, 2, 3,
Hence b1 is the integral part of 2xF , while

2xF b1 =
=
(2xF )F =
2(2xF )F =
bk+1 2k
(2xF )F
bk+1 2k
b2
+ bk+2 2k
6
k = 1, 2, 3,
k = 1, 2, 3,
k = 1, 2, 3,
b2 is the integral part of 2(2xF )F ,

(2xF )F b2 = bk+2 2k
= (2(2xF )F )F
k = 1, 2, 3,
Therefore repeating this procedure we find that b3 is the integral part of 2(2(2xF )F )F
and so on.
Example:-
If
x = 0.625 = xF , then
2(0.625) = 1.25
2(0.25) = 0.5
2(0.5) = 1
so b1 = 1
so b2 = 0
so b3 = 1
and all further b0k s are zero. Hence

0.625 = (.101)2
Inversely, if x = (.101)2 then in decimal system where base is 10, the decimal fraction
(.b1 b2 b3 )10 for a given number xF between zero and nine can be calculated as
follows
X
bk 10k
k = 1, 2, 3,
xF =
Here xF = (.101)2 , then multiplying this number with 10 = (1010)2 instead of 2
i. e.,
10 xF = 10 (.101)2 = (1010)2 (.101)2 = (110.010)2 .
So integral part of 10 xF is (110)2 = 6, i.e., b1 = 6 and
(10 xF )F = (.010)2 .
10 (10 xF )F = (1010)2 (.010)2 = (10.10)2 .
Here integral part of 10 (10 xF )F is (10)2 = 2, i.e., b2 = 2 and
(10 (10 xF )F )F = 10 (.10)2 . = (1010)2 (.10)2 = (101.0)2 .
The integral part of (10 (10 xF )F )F is (101)2 = 5, i.e., b3 = 5 and
(10 (10 (10 xF )F )F )F = 10 (.0)2 . = (1010)2 (.0)2 = (0.0)2 = 0.
showing that b4 = 0. Hence subsequent b0k s are zero. This shows that
(.101)2 = 0.625
7
Note that if xF is a terminating binary fraction with n digits, then it is also a

terminating decimal fraction with n digits since
(.1)2 = 0.5
We shall not go further into the messy field of binary representation arithmetic and
its pitfalls because much depends on the machine used, on the programs supplied
by the computer manufacturere and on the computer center, but it should be clear
that the system of binary representation of numbers is going to affect our answers
in many ways.
1.1
The Three Number Systems
Besides the various bases for representing numbers - decimal, binary, and octal there are also three distinct number systems that are used in computing machines.
First, there are the integers, or counting numbers, For example, 0, 1, 2, 3,
that are used to index and count and have limited usage for numerical analysis.
Usually, they have the range from 0 to the largest number that can be contained in
the machines index registers.
Second there are the fixed-point numbers. For example
367.143 258 765
593, 245.678 953
0.001 236 754 56
The fixed-point number system is the one that the programmer has implicitly used
during much of his own calculation, and it is the one with which he is most familiar.
Perhaps the only feature that is different in hand and machine calculations is that
the machine always carries the same number of digits, whereas in hand calculation
the user often changes the number of figures he carries to fit the current needs of
the problem.
Third, there is the floating-point number system, which is the one used in
almost all practical scientific and engineering computations. This number system
differs in significant ways from the fixed-point number system, and we must be aware
of these differences at many stages of long computation. Typically, the computer
word length includes both the mantissa and exponent; thus the number of digits in
the mantissa of a floating point number is less than in that of a fixed point.
1.1.1
Floating-Point Arithmetic
Scientific and engineering calculations are usually carried out in floating-point arithmetic. To examine round-off error in detail, we need to understand how numeric
8
quantities are represented in computers. In nearly all cases, numbers are stored
as floating-point quantities, the computer has a number of values it chooses from
to store as an approximation to the real number. The term real numbers is for
the continuous (and infinite) set of numbers on the number line. When printed
as a number with a decimal point, it is either fixed point or Floating-point, in
contrast to integers.
Floating-point numbers have three parts:
1. the sign (which requires one bit);
2. the fraction part-often called the mantissa but better characterized by the
name significand;
3. the exponent part-often called the characteristic.
The three parts of the numbers have a fixed total length that is often 32 or 64 bits
(sometimes even more). The fraction part uses most of these bits, perhaps 23 to as
many as 52 bits, and that number determines the precision of the representation.
The exponent part uses 7 to as many as 11 bits, and this number determines the
range of the values.
The general form of a floating-point number is
.a1 a2 a3 ap B e
where the a1 6= 0 and ai s are digits or bits with values from zero to B 1 and
B = the number base that is used, usually 2, 16, 10.
p = the number of significand bits (digits), that is, the precision.
e = an integer exponent, ranging from Emin to Emax, with the values going
from negative Emin to positive Emax.
The significand bits(digits) constitute the fractional part of the number. In almost
all cases, numbers are normalized, meaning that the fraction digits are shifted and
exponent adjusted so that a1 is nonzero. e.g.,
27.39 +.2739 102 ;
0.00124 .1240 102 ;
37000 +.3700 105 .
Observe that we have normalized the fractions-the first fraction digit is nonzero. Zero
is a special case; it usually has a fraction part with all zeros and a zero exponent.
This kind of zero is not normalized and never can be. In hand calculators, the base
is usually 10; in computers the base is often 2, but sometimes a base of 16 is used.
Most computers permit two or even three types of numbers:
9
1. single precision, use the letter E in the exponent which is usually equivalent
to seven to nine significant decimal digits;
2. double precision, use the letter D in the exponent instead of E varies from
14 to 29 significant decimal digits, but is typically about 16 or 17.;
3. extended precision, which may be equivalent to 19 to 20 significant decimal
digits.
Calculation in double precision usually doubles the storage requirements and
more than doubles running time as compared with single precision.
Method
IEEE
single
double
extended
VAX
single
double-1
double-2
extended
IBM
single
double
extended
Largest Number
Smallest Number
1.701E38
8.988E307
6E4931
1.755E-38
2.225E-308
3E-4931
1.701E38
1.701E38
8.988E307
6E4931
5.877E-39
5.877E-39
1.123E-308
1E-4931
7.237E75
7.237E75
7.237E75
8.636E-78
8.636E-78
8.636E-78
The finite range of the exponent also is a source of trouble, namely, what are
called overflow and underflow which refer respectively to the numbers exceeding
the largest - and the smallest-sized (non-zero) numbers that can be represented
within the system.
It should be evident that we can replace an underflow by a zero and often not go
far wrong. It is less safe to replace a positive overflow by the largest number that
the system has (to prevent some subsequent overflows due to future additions).
We may wonder how, in actual practice, with a range of 1038 to 1038 or more,
we can have trouble with overflow and underflow.
Numerical methods provide estimates that are very close to the exact analytical solutions: obviously, an error is introduced into the computation. This error is
not a human error, such as a blunder or mistake or oversight but rather a discrepancy between the exact and approximate (computed) values. In fact, numerical
analysis is a vehicle to study errors in computations. It is not a static discipline. The continuous change in this field is to devise algorithms, which are both
10
fast and accurate. These algorithms may become obsolete and may be replaced
by algorithms that are more powerful. In the practice of numerical analysis it is
important to be aware that computed solutions are not exact mathematical solutions but numerical methods should be sufficiently accurate1 (or unbiased) to meet
the requirements of a particular scientific problem and they also should be precise2
enough. The precision of a numerical solution can be diminished in several subtle
ways. Understanding these difficulties can often guide the practitioner in the proper
implementation and/or development of numerical algorithms.
1.2
Error Analysis
Error analysis is the study and evaluation of error. The accuracy of any
computation is always of great importance. Every floating-point operation in a
computational process may give rise to an error which, once generated, may then
be amplified or reduced in subsequent operations. An error in a numerical computation is simply the difference between the actual (true) value of a quantity and its
computed (approximate) value. There are three common ways to express the size of
the error in a computed result: Absolute error, Relative error and Percentage
Error.
Suppose that x is an approximation (computed value) to x. The error is
= x x ,
1.2.1
Absolute Error:-
The absolute error of a given result is frequently used as a measure of accuracy; the
conventional definition is
absolute error = |true value - approximate value|
Ea = |x x |
However, a given error is usually much more serious when the magnitude of the
true value is small. For example, 1036.52 0.010 is accurate to five significant digits
and is frequently of more than adequate precision, while 0.005 0.010 is a clear
disaster.
1.2.2
Relative Error:-
The relative error =
|true value - approximate value|

|true value|
Accuracy is the number of digits to which an answer is correct.

Precision is the number of digits in which a number is expressed or a given, irrespective of
the correctness of the digits.
2
11
Er =
Ea
;
|x |
x 6= 0
Ea
;
|x|
x 6= 0
If actual value is not known, then

Er =
is often a better indicator of the accuracy. Relative error is more independent of

the scale of the value, a desirable attribute. This is particularly so when the actual
value is either very small or very large. When the true value is zero, the relative
error is undefined. It follows that the round-off error due to finite-fraction length
in floating-point numbers is more nearly constant when expressed as relative error
than when expressed as absolute error. Observe that the loss of significant digits
when nearly equal floating-point numbers are subtracted produces a particularly
sever relative error.
Examples:1. If x = 0.3000 101 and x = 0.3100 101 , the absolute error is 0.1 and the
relative error is 0.3333 101 .
2. If x = 0.3000 103 and x = 0.3100 103 , the absolute error is 0.1 104
and the relative error is 0.3333 101 .
3. If x = 0.3000 104 and x = 0.3100 104 , the absolute error is 0.1 103 and
the relative error is 0.3333 101 .
This example shows that the same relative error 0.3333 101 , occurs for
widely varying absolute errors. As a measure of accuracy, the absolute error
may be misleading and the relative error more meaningful.
4. Consider the following three cases

(a) Let x = 3.141592 and x = 3.14; then the absolute error is
Eax = | x x | = | 3.141592 3.14| = 0.001592
and the relative error is Erx =
| 0.001592|
= 0.000507
| 3.141592|
12
(b) Let y = 1, 000, 000 and y = 999, 996; then the absolute error is
Eay = | y y | = | 1, 000, 000 999, 996| = 4
and the relative error is Ery =
| 4|
= 0.000004
| 1000000|
(c) Let z = 0.000012 and z = 0.000009; then the absolute error is

Eaz = | z z | = | 0.000012 0.000009| = 0.000003
and the relative error is Erz =
| 0.000003|
= 0.25
| 0.000012|
In case (a) there is not too much difference between Eax and Erx and either could
be used to determine the accuracy of x . In case (b) the value of y is of magnitude
106 , the error Eay is large, and the relative error Ery is small. We would call y a
good approximation to y. in case (c) z is of magnitude 106 and the error Eaz is
the smallest in all three cases, but the relative error Erz is the largest. In terms of
percentage, it amounts to 25%, and thus z is a bad approximation to z.
1.2.3
Percentage error: -
Relative error expressed in percentage is called the percentage error, defined by

P E = 100 Er
In order to investigate the effect of total error in a method we often compute an
error bound which is the limit on how large and small the error can be.
1.2.4
Significant Digits
In considering rounding errors, it is necessary to be precise in the usage of approximate digits. A significant digit in an approximate number is a digit, which gives
reliable information about the size of the number. In other words, a significant digit
is used to express accuracy, i.e., how many digits in the number have meaning. The
significant digits of a (measured or calculated) quantity are the meaningful digits
in it. There are conventions which you should learn and follow for how to express
numbers so as to properly indicate their significant digits.
Any digit that is not zero is significant. Thus 549 has three significant digits
and 1.892 has four significant digits.
13
Zeros between non zero digits are significant. Thus 4023 has four significant
digits.
Zeros to the left of the first non zero digit are not significant. Thus 0.000034
has only two significant digits. This is more easily seen if it is written as
3.4 105 .
For numbers with decimal points, zeros to the right of a non zero digit are
significant. Thus 2.00 has three significant digits and 0.050 has two significant
digits. For this reason it is important to keep the trailing zeros to indicate the
actual number of significant digits.
For numbers without decimal points, trailing zeros may or may not be significant. Thus, 400 indicates only one significant digit. To indicate that the
trailing zeros are significant a decimal point must be added. For example, 400.
has three significant digits, and 4 102 has one significant digit.
Exact numbers have an infinite number of significant digits. For example, if
there are two oranges on a table, then the number of oranges is 2.000... .
Defined numbers are also like this. For example, the number of centimeters
per inch (2.54) has an infinite number of significant digits, as does the speed
of light (299792458 m/s).
There are also specific rules for how to consistently express the uncertainty associated with a number. In general, the last significant digit in any result should be of
the same order of magnitude (i.e. in the same decimal position) as the uncertainty.
Also, the uncertainty should be rounded to one or two significant digits. Always
work out the uncertainty after finding the number of significant digits for the actual
measurement. For example,
9.82 0.02
10.0 1.5
4
1
The following numbers are all incorrect.
9.82 0.02385 is wrong but 9.82 0.02 is fine
10.0 2
is wrong but 10.0 2.0 is fine
4
0.5
is wrong but 4.0 0.5 is fine
In practice, when doing mathematical calculations, it is a good idea to keep one
more digit than is significant to reduce rounding errors. But in the end, the answer
must be expressed with only the proper number of significant digits. After addition
or subtraction, the result is significant only to the place determined by the largest
last significant place in the original numbers. For example,
89.332 + 1.1 = 90.432
14
should be rounded to get 90.4 (the tenths place is the last significant place in 1.1).
After multiplication or division, the number of significant digits in the result is
determined by the original number with the smallest number of significant digits.
For example,
(2.80)(4.5039) = 12.61092
should be rounded off to 12.6 (three significant digits like 2.80).
1.2.5
Loss of Significance and Error Propagation: Condition and instability.
One of the most common (and often avoidable) ways of increasing importance of an
error is commonly called loss of significant digits. If X is an approximation to x,
then we say that X approximates x to r significant -digits provided the absolute
error |x X| is at most 12 in the rth significant -digit of x. This can be expressed
as
1
|x X| sr+1
2
with s the largest integer such that s |x|. For instance, X = 3 agrees with x =
= 3.1428 is correct to three
to one significant (decimal) digit, while X = 22
7
significant digits (as an approximation to ).
Once an error is committed, it contaminates subsequent results. This error
propagation through subsequent calculations is conveniently studied in terms of
the two related concepts of condition and instability.
The word condition is used to describe the sensitivity of the function value f (x)
to changes in the argument x. The condition is usually measured by the maximum
relative change in the function value f (x) caused by a unit relative change in the
argument x.
An example to illustrate the avoidance of loss of significance is
Example:- Compare the results of computing f (500) and g(500) using six digits
and rounding.
f (x) = x[ x + 1 x]
x
g(x) =
x+1+ x
f (500) = 500[ 500 + 1 500]

= 500[22.3830 22.3607]
= 500 0.0223
= 11.1500
500
g(500) =
500 + 1 + 500
15
500
22.3830 + 22.3607
500
=
44.7437
= 11.1748
=
The function g(x) is algebraically equivalent to f (x) as
x+1+ x
f (x) = x[ x + 1 x]
x+1+ x
x
=
x+1+ x
the answer g(500) = 11.1748 involves less error and is the same as that obtained by
rounding the true answer 11.174753 to six digits.
All that is required is that the person computing should use his imagination and
foresee what might happen before he writes the program for a machine. As a simple
rule, try to avoid subtractions ( even if they appear as a sum but with the sign of
one of the terms negative and the other positive).
Example:-
2
3
(x + ) x
2
3
2
3
(x + ) x
h
=
=
(x + ) 3
2
3
i3
(x + ) 3 + (x + ) 3 x 3 + x 3
(x + ) 3 + (x + ) 3 x 3 + x 3
2
(x 3 )3
(x + ) 3 + (x + ) 3 x 3 + x 3
2x + 2
(x + ) 3 + (x + ) 3 x 3 + x 3
Other methods for rearranging an expression

Simple rearrangements will not always produce a satisfactory expression for computer evaluation, and it is necessary to use other devices that occur in the calculus
course.
For small positive x :
x2 x3
+
2!
3!
x2 x3
+
+ )
ln(1 x) = (x +
2!
3!
1 ex = x
16
x + x3 + 2x
+ x x6 +
tan x sin x
15
=
x3
x3

1
2
1
+ 16 x3 + 15
120
x5 +
3
=
x3
2
1 x
+
+
=
2
8
x5
120
Another technical device of less practical use but of great theoretical value is the
mean-value theorem
f (b) f (a) = (b a)f 0 ()
(a < < b)
Whereas the value of is not known and in principle can be anywhere inside the
interval (a, b), it is reasonable to suspect that the choice of the midvalue is as good
as any other value if nothing else is known about the function.
Example:For x small with respect to a,
x
ln(a + x) ln a = ln(1 + )
a
Also by using the mean-value theorem
x
a+
x
=
a + x2
ln(a + x) ln a =
The main sources of error are

Gross errors
Errors in original data
Round-off errors
Truncation errors
They all cause the same effect: diversion from the exact answer. Some errors
are small and may be neglected while others may be devastating if overlooked.
17
1.2.6
Gross Errors
When humans are involved in programming, operations, preparing the input, and
interpreting the output, blunders or gross errors do occur rather more frequently
than we like to admit. A few examples of these errors are
Poor definition of the problem,
Choice of an inappropriate model,
Approximation made in representing physical processes by mathematical operations,
Misreading or misquoting the digits, particularly in the interchange of adjacent
digits.
Use of inaccurate formula(algorithm) to solve a particular problem, and
Use of inaccurate data.
These can be avoided by taking enough care, coupled with a careful examination of the results for reasonableness. Sometimes a test run with known results is
worthwhile, but this is no guarantee of freedom from foolish error.
1.2.7
Errors in Original Data
Real world problems, in which an existing or proposed physical situation is modeled

by a mathematical equation, will nearly always have coefficients that are imperfectly
known. The reason is that the problems often depend on measurements of doubtful
accuracy. Further, the model itself may not reflect the behavior of the situation
perfectly. We can do nothing to overcome such errors by any choice of method, but
we need to be aware of such uncertainties; in particular, we may need to perform
tests to see how sensitive the results are to changes in the input information. Since
the reason for performing the computation is to reach some decision with validity
in the real world, sensitivity analysis is of extreme importance. As Hamming says,
the purpose of computing is insight, not numbers.
There are errors, which arise after a mathematical formulation is obtained. They
include not only computational errors in the strict sense but also those errors, which
arise because we substitute finite mathematical processes for infinite mathematical
processes. An example of this is the substitution of the sum of a finite series for the
value of a function. These are the errors of mathematical approximation. Computational errors might more appropriately be named the errors of numerical methods.
The finite representation of numbers in the machine leads to roundoff errors,
whereas the finite representation of processes leads to truncation errors.
18
1.2.8
Truncation Error
The term truncation error refers to those errors caused by the method itself, e.g.,
caused by the approximations used in the mathematical formula of the scheme, when
a more complicated mathematical expression is replaced with a more elementary
formula. The error arising from this approximation, is called the truncation error.
This terminology originates from the technique of replacing a complicated function
with a truncated Taylor/Maclaurin series, Binomial expansion, Infinite geometric
progression or any other approximation. For example, the infinite Taylor series
ex2 = 1 +
xn
x2 x4 x6
+
+
++
+
1!
2!
3!
n!
might be replaced with just the five terms

ex2 = 1 +
x2 x4 x6 x8
+
+
+
1!
2!
3!
4!
This might be done when approximating an integral numerically.

Example:R 1/2
Given that I = 0 ex2 dx = 0.544987104184 determine the accuracy of the

approximation obtained by replacing the integrand f (x) = ex2 with the truncated
Taylor series
P4 (x) = 1 +
Solution:I=
1/2
ex2 dx
=
=
=
x2 x4 x6 x8
+
+
+ .
1!
2!
3!
4!
x2 x4 x6 x8
+
+
+ ) dx
1!
2!
3!
4!
0
1/2
x3
x5
x7
x8
x+
+
+
+

3
5 2! 7 3! 9 4! 0
1
1
1
1
1
+
+
+
+
2 24 320 5376 110592
2109491
3870720
0.544986720817 = I
1/2
(1 +
=
|I I |
= 7.03442 107
Er =
|I|
The approximation I agrees with the true answer I to five significant digits.
Exercise
19
1.2.9
Rounding Errors
This is the most basic source of errors in a computer. All computing devices represent numbers, except for integers, with some imprecision. Digital computers will
nearly always use floating-point numbers of fixed word length; the true values are
not expressed exactly by such representations. Round-off error occurs when a calculator or computer is used to perform real number calculations. This error arises
because the arithmetic performed in a machine involves numbers with only a finite
number of digits, say, n significant digits by rounding off the (n + 1)th place and
dropping all digits after the nth with the result that calculations are performed with
approximate representations of the actual numbers. That is, the error introduced by
rounding-off numbers to a limited number of decimal places is called the rounding
error, or the error that results from replacing a number with its floating-point form
is called the rounding error.
For example
= 0.314159265 101
The five-digit floating point form of using chopping is
= 0.31415 101 = 3.1415
and is called chopped floating point representation of . Since sixth digit of the
decimal expansion of is a 9, the floating point form of using five-digit rounding
is
= (0.31415 + 0.00001) 101 = 3.1416
and is called rounded floating point representation of The error that results
from replacing a number with its floating-point form is called round-off error
(regardless of whether the rounding or chopping method is used).
Consider another example :- When two 3-digit numbers are multiplied together,
their product has either five or six places.
20
0.236 101
0.127 101
1652
472
236
0.0299|72 102
0.|5
roundoff
Answer 0.300|
(drop) 101
|roundoff error| = 0.28 102
As above, the product is

0.0299|72 102
|
chop
Answer
0.299 101
|roundoff error| = 0.72 102

When machine drop without rounding, which is called chopping; this can cause
serious trouble.
Round-off causes trouble mainly when two numbers of about the same size are
subtracted. As a result of the cancellation of the leading digits, the number is shifted
(normalized) to the left until the first digit is not zero. This shifting can bring the
round-off errors that where in the extreme right part of the number well into the
middle, if not to the extreme left. In the later steps we shall think that we have an
accurate number when we do not.
A second, more insidious trouble with round off , especially with chopping, is
the presence of internal correlations between numbers in the computation so that,
step after step, the small error is always in the same direction and one is therefore
not under the protective umbrella of the statistical average behavior.
Example:1.2.9.1
Accumulated Round-off Error & Local Round-off Errors
Analysis of the round-off error present in the final result of a numerical computation,
usually termed the accumulated round-off error, and the errors resulting from
21
individual rounding or truncating operations are called local round-off errors.
1.2.10
Error Accumulation in Computations
To investigate, how error might be accumulated in computations, we proceed as

follows
1. Error Accumulation in Addition
Consider the addition of two numbers p and q (the true values) with approximate values p and q , with errors p and q respectively. i.e.,
p = p + p
and
q = q + q
Let z = p + q, with error z and z = p + q . Then the sum is

z + z = p + p + q + q
= p + q + p + q
z = p + q
Hence, for addition, the error in the sum is the sum of the errors of the addends.
The absolute error of the sum of two numbers is the sum of the absolute errors
of the given numbers i. e.,
Ea = |z | |p | + |q |
This formula can be extended to any number of terms.
Ea = |z | |1 | + |2 | + |3 | + + |n |
The relative error is calculated as
Er =
Ea
Absolute Error
=
Sum of the given numbers
|z|
2. Error Accumulation in Subtraction

Let z = p q where p > q and z = p q .
z = z + z =
=
implies
z =
Ea = |z |
22
(p + p ) (q + q )
(p q ) + (p q )
p q
|p | + |q |
Which is same as above. Hence, the absolute error of a difference between two
numbers is the sum of the absolute errors of the given numbers. This formula
can be extended to any number of terms.
Ea = |z | |1 | + |2 | + |3 | + + |n |
The relative error is calculated as
Er =
Ea
Absolute Error
=
Sum of the given numbers
|z|
3. Error Accumulation in Multiplication

The propagation of error in multiplication is more complicated. Let z = p q
and z = p q . Then the product is
z = p q = (p + p )(q + q ) = p q + p q + q p + p q
Hence, if p and q are larger than 1 in absolute value, the terms p q and q p
show that there is a possibility of magnification of the original errors p and
q . Insights are gained if we look at the relative error. Rearrange the terms in
(??) to get
z = p q p q = p q + q p + p q
Suppose p 6= 0 and q 6= 0; then dividing (??) by p q, we obtain
p q q p p q
z
=
+
+
pq
pq
pq
pq
Furthermore, suppose that
p
1,
p
Then the relative error
q
1,
q
p q
0
pq
|z |
|q | |p |
+
|p q|
|q|
|p|
This shows that the relative error in the product p q is approximately the sum
of the relative errors in the approximations p and q , that is, the relative
error modulus of the product of two numbers does not exceed the sum of the
relative error moduli of the given numbers. The relative error modulus of the
product of n numbers

1 2
n
z
Er = | | + + +
z
x1
x2
xn
23
4. Error Accumulation in Division

p
Let z = ;
q
p
;
q
q 6= 0 and z =
z =
q 6= 0, then
(p + p )
(q + q )
p
)]
p
=
q
[q (1 + )]
q
p
p
q
= (1 + ) (1 + )1
q
p
q
[p (1 +
expanding with the help of binomial theorem and ignoring the product of
errors being small, we have
p
p
q
(1 + ) (1 )
q
p
q
p
p
p q
= +
q
q
q q
p
p q
=
q
q q
z = z + z =
z
then
p
p q
q q q
=
p
q
!
q p
p q
=
p q q q
q p p q q q
=
q p
p q q q
z
z
We have already supposed that

z
z
Er
p
1,
p
q
1, using this
q
p q
p
q

z

p
q

= =
p
z
q

p
q
+
p q
24
Thus, the relative error of a quotient of two numbers is equivalent to the sum
of the relative error moduli of the dividend and divisor and for n terms
Er =

z

1 2 3

+ + ++ n

x1
x2
x3
xn
5. Errors of Powers and Roots

Let z = xn , where n is the power and denotes an integral or a fractional
quantity, then
z = z + z = (x + x )n
x
= x n (1 + )n
x
expanding with the help of binomial theorem and neglecting the higher powers
x
of , we get
x
nx
z = x n (1 + )
x
= xn + nx xn1
therefore,
z = nx xn1
n x x n1
z
=
then
z
x n
n x
=
x

z
n x

Er = =
z x
x
|n|
x
Thus, the relative error modulus of a factor raised to a power is the product
of the modulus of power and the relative error of the factor.
6. Error in Function Evaluation
Let z = f (x), then
z + z = f (x + )
using the Taylors series expansion and neglecting the higher powers of , being
small, we have
z + z = f (x) + f 0 (x)
25
Or,
z = f 0 (x)
Therefore,
Ea = | z | | f 0 (x)|

f 0 (x)
z

Er = =

z
f (x)
The formula can be extended for any number of factors, e.g., if
z = f (x1 ) + f (x2 ) + f (x3 ) + + f (xn ),
then
z + z = f (x1 + 1 ) + f (x2 + 2 ) + + f (xn + n )
z = 1 f 0 (x1 ) + 2 f 0 (x2 ) + + n f 0 (xn )
Ea = | z | = | 1 f 0 (x1 ) + 2 f 0 (x2 ) + + n f 0 (xn )|
| 1 f 0 (x1 )| + | 2 f 0 (x2 )| + + | n f 0 (xn )|
and
Er
or
1.2.10.1
z 1 f 0 (x1 ) + 2 f 0 (x2 ) + + n f 0 (xn )

=
=
z f (x1 ) + f (x2 ) + f (x3 ) + + f (xn )
| 1 f 0 (x1 )| + | 2 f 0 (x2 )| + + | n f 0 (xn )|
P
| f (xn )|
| n f 0 (xn )|
| 1 f 0 (x1 )| | 2 f 0 (x2 )|
+ P
++ P
P
| f (xn )|
| f (xn )|
| f (xn )|
Propagated Error
The local error at any stage of the calculation is propagated through out the remaining part of the computation, i.e., the error in the succeeding steps of the process due
to the occurance of an earlier error. Propagated error is more subtle than the
other errors-such errors are in addition to the local errors. Propagated error is of
critical importance. If errors are magnified continuously as the method continues,
eventually they will overshadow the true value, destroying its validity; we call such
a method unstable. For a stable method-the desirable kind- errors made at early
points die out as the method continues. Whenever possible we shall choose methods
that are stable. The following definition is used to describe the propagation of error.
Definition:Suppose that E(n) represents the growth of error after n steps. If |E(n)| n,
the growth of error is said to be linear . If |E(n)| kn , the growth of error is called
26
exponential. If k > 1, the exponential error grows without bound as n , and

if 0 < k < 1, the exponential error diminishes to zero as n .
1.2.11
Numerical Cancellation
Accuracy is lost when two nearly equal numbers are subtracted. For example, the
two numbers 9.4157233 and 9.4157227 are each accurate to 8 significant digits, yet
their difference 0.0000006 is accurate to only 1 significant digit. Thus care should
be taken to avoid such subtraction where possible, because this is the major source
of error in floating point operations. This phenomenon is also called subtractive
cancellation.
In multiplying two ndigits numbers, when using computer, a product with 2n
digits results. Internally, double length registers are used. The result is truncated
to the length of a single register.
1.2.12
Errors in Converting Values
The numbers that are input to a computer are ordinarily base10 values. Thus the
input must be converted to the computers internal number base, normally base 2.
This conversion itself causes some errors.
1.2.13
Machine eps
One important measure in computer arithmetic is how small a difference between

two values the computer can recognize. This quatity is termed the computer eps,
where eps is for Greek letter epsilon. This measure of machine accuracy is standardized by finding the smallest floating-point number that, when added to floatingpoint 1.000, produces a result different from 1.000. Numbers smaller than eps are
effectively zero in the computer.
Peculiar things happen in floating-point arithmetic. For example, adding 0.001
one thousand times may not equal 1.0 exactly. In some instances, multiplying a
number with unity does not reproduce the number.
In many computations, changing the order of calculations will produce different
results.
1.2.14
Evaluation of Functions by Series Expansion and Estimation of Errors
Taylors series is considered as a foundation of numerical analysis. It is the most

important tool for deriving numerical methods and analyzing errors.
27
If f (x) is analytic about x = x , then f (x) in the neighbourhood of x = x can

be exactly represented in the Taylor series, which is power series given by
(x x )2 00
f (x )
f (x) = f (x ) + (x x )f (x ) +
2!
(x x )4 (4)
(x x )3 000
f (x ) +
f (x ) +
+
3!
4!
0
This series is unique, that is, there is no other power series in (x x ) to represent
f (x).
In practical applications, the Taylor series has to be truncated after a certain
order term because it is impossible to include an infinite number of terms. If the
Taylor series is truncated after the N th term, it is expressed as
h2 00
h3 000
f (x) = f (x ) + hf (x ) + f (x ) + f (x ) +
2!
3!
hm (m)
hN (N )
+
f (x ) + +
f (x ) + O(hN +1 )
m!
N!
0
where h = x x and O(hN +1 ) represents the error caused by truncating the

terms of order N + 1 and higher. We can also write the above expression as
f (x) = PN (x) + RN (x)
where
PN (x) = f (x ) + hf 0 (x ) +
+ +
=
and
h2 00
h3
f (x ) + f 000 (x )
2!
3!
hN (N )
f (x )
N!
N
X
hk (k)
f (x )
k=0 k!
hN +1 (N +1)
f
((x)),
RN (x) =
(N + 1)!
where = x + h
where PN (x) is called the nth Taylor polynomial for f about x and RN (x) is
called the remainder term (or truncation error) associated with PN (x).
However, the whole error can be expressed by
O(hN +1 ) =
hN +1 (N +1)
f
(x + h),
(N + 1)!
28
0<<1
Since cannot be found exactly, the error term is often approximated by setting
= 0:
hN +1 (N +1)
O(hN +1 ) '
f
(x )
(N + 1)!
which is leading term of the truncation terms.
If N = 1, for example, the truncated Taylor series is
f (x) = f (x ) + hf 0 (x ),
h = x x
Including the effect of the error, it can also expressed as

f (x) = f (x ) + hf 0 (x ) + O(h2 )
where
O(h2 ) '
h2 00
f (x + h),
2!
0<<1
Example:- 4
Determine the (a) the second and (b) the third Taylor polynomials for
f (x) = cos(x) about x = 0 and use these polynomials to approximate cos(0.01).
(a) For N = 2 and x = 0
(x 0)2 00
(x 0)3 000
f (0) +
f ((x))
2!
3!
x2
x3
= f (0) + xf 0 (0) + f 00 (0) + f 000 ((x))
2!
3!
P2 (x) = f (0) + (x 0)f 0 (0) +
therefore,
cos(x) = cos(0) x sin(0)
= 1
x3
x2
cos(0) +
sin((x))
2!
3!
x2 x3
+
sin((x)),
2!
3!
(x) (0, x)
with x = 0.01
(0.01)2 (0.01)3
+
sin((x)),
2!
3!
0.0001 0.000001
+
sin((x)),
= 1
2
6
= 1 0.00005 + 0.000000166 sin((x)),
= .99995 + 0.166 106 sin((x)),
cos(0.01) = 1
29
(x) (0, 0.01)

(x) (0, 0.01)
(x) (0, 0.01)
(x) (0, 0.01)
Since sin((x)) < 1, we have

| cos(0.01) .99995| < 0.166 106 ,
From table we get
cos(0.01) = 0.99995000042
(b) The third polynomial about x = 0 is
(x 0)2 00
(x 0)3 000
(x 0)4 iv
f (0) +
f (0) +
f ((x))
2!
3!
4!
x2
x3
(x 0)4 iv
f ((x))
= f (0) + xf 0 (0) + f 00 (0) + f 000 (0) +
2!
3!
4!
P3 (x) = f (0) + (x 0)f 0 (0) +
therefore,
x3
x4
x2
cos(0) +
sin(0) +
cos((x))
cos(x) = cos(0) x sin(0)
2!
3!
4!
x2 x4
+
cos((x)),
(x) (0, x)
= 1
2!
4!
with x = 0.01
(0.01)2 (0.01)4
+
cos((x)),
2!
4!
0.0001 0.00000001
+
cos((x)),
= 1
2
24
= .99995 + 4.2 1010 cos((x)),
cos(0.01) = 1
(x) (0, 0.01)

(x) (0, 0.01)
(x) (0, 0.01)
Since 1 < cos((x)) < 1, we have

| cos(0.01) .99995| < 4.2 1010 ,
which is better accuracy assurance.
30

NC 11111111111111111111

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

NC 11111111111111111111

Загружено:

Авторское право:

Доступные форматы

Chapter 1

The Representation of Integers

1 102 + 8 101 + 7 100

(1) (10)2 + (8) (10)1 + (7) (10)0

The Representation of Fractions

Hence b1 is the integral part of 2xF , while

b2 is the integral part of 2(2xF )F ,

and all further b0k s are zero. Hence

Note that if xF is a terminating binary fraction with n digits, then it is also a

The Three Number Systems

The relative error =

|true value - approximate value|

Accuracy is the number of digits to which an answer is correct.

If actual value is not known, then

is often a better indicator of the accuracy. Relative error is more independent of

4. Consider the following three cases

(c) Let z = 0.000012 and z = 0.000009; then the absolute error is

Relative error expressed in percentage is called the percentage error, defined by

Loss of Significance and Error Propagation: Condition and instability.

f (500) = 500[ 500 + 1 500]

The function g(x) is algebraically equivalent to f (x) as

Other methods for rearranging an expression

The main sources of error are

Errors in Original Data

Real world problems, in which an existing or proposed physical situation is modeled

might be replaced with just the five terms

This might be done when approximating an integral numerically.

Given that I = 0 ex2 dx = 0.544987104184 determine the accuracy of the

The five-digit floating point form of using chopping is

= 0.31415 101 = 3.1415

|roundoff error| = 0.28 102

As above, the product is

|roundoff error| = 0.72 102

Accumulated Round-off Error & Local Round-off Errors

individual rounding or truncating operations are called local round-off errors.

Error Accumulation in Computations

To investigate, how error might be accumulated in computations, we proceed as

Let z = p + q, with error z and z = p + q . Then the sum is

2. Error Accumulation in Subtraction

3. Error Accumulation in Multiplication

4. Error Accumulation in Division

We have already supposed that

5. Errors of Powers and Roots

z 1 f 0 (x1 ) + 2 f 0 (x2 ) + + n f 0 (xn )

exponential. If k > 1, the exponential error grows without bound as n , and

Errors in Converting Values

One important measure in computer arithmetic is how small a difference between

Evaluation of Functions by Series Expansion and Estimation of Errors

Taylors series is considered as a foundation of numerical analysis. It is the most

If f (x) is analytic about x = x , then f (x) in the neighbourhood of x = x can

where h = x x and O(hN +1 ) represents the error caused by truncating the

Including the effect of the error, it can also expressed as

P2 (x) = f (0) + (x 0)f 0 (0) +

(x)  (0, 0.01)

Since sin((x)) < 1, we have

P3 (x) = f (0) + (x 0)f 0 (0) +

(x)  (0, 0.01)

Since 1 < cos((x)) < 1, we have

Вам также может понравиться

Let z = p + q, with error z and z = p + q . Then the sum is

z 1 f 0 (x1 ) + 2 f 0 (x2 ) + + n f 0 (xn )

(x) (0, 0.01)

(x) (0, 0.01)