Вы находитесь на странице: 1из 651

General Preface

During the past decades, giant needs for ever more sophisticated mathematical
models and increasingly complex and extensive computer simulations have arisen.
In this fashion, two indissociable activities, mathematical modeling and computer
simulation, have gained a major status in all aspects of science, technology, and
industry.
In order that these two sciences be established on the safest possible grounds,
mathematical rigor is indispensable. For this reason, two companion sciences,
Numerical Analysis and Scientific Software, have emerged as essential steps for
validating the mathematical models and the computer simulations that are based on
them.
Numerical Analysis is here understood as the part of Mathematics that describes
and analyzes all the numerical schemes that are used on computers; its objective
consists in obtaining a clear, precise, and faithful, representation of all the
"information" contained in a mathematical model; as such, it is the natural
extension of more classical tools, such as analytic solutions, special transforms,
functional analysis, as well as stability and asymptotic analysis.

The various volumes comprising the Handbook of Numerical Analysis will


thoroughly cover all the major aspects of Numerical Analysis, by presenting
accessible and in-depth surveys, which include the most recent trends.
More precisely, the Handbook will cover the basic methods ofNumerical Analysis,
gathered under the following general headings:
- Solution of Equations in gRn,
- Finite Difference Methods,
- Finite Element Methods,
- Techniques of Scientific Computing,
- Optimization Theory and Systems Science.
It will also cover the numericalsolution of actualproblems ofcontemporary interest in
Applied Mathematics, gathered under the following general headings:
- Numerical Methods for Fluids,
- Numerical Methods for Solids,
- Specific Applications.
"Specific Applications" include: Meteorology, Seismology, Petroleum Mechanics,
Celestial Mechanics, etc.

v
vi General Preface

Each heading is covered by several articles, each of which being devoted to


a specialized, but to some extent "independent", topic. Each article contains
a thorough description and a mathematical analysis of the various methods in
actual use, whose practical performances may be illustrated by significant numerical
examples.
Since the Handbook is basically expository in nature, only the most basic results
are usually proved in detail, while less important, or technical, results may be only
stated or commented upon (in which case specific references for their proofs are
systematically provided). In the same spirit, only a "selective" bibliography is
appended whenever the roughest counts indicate that the reference list of an article
should comprise several thousands items if it were to be exhaustive.

Volumes are numbered by capital Roman numerals (as Vol. I, Vol. II, etc.),
according to their chronologicalappearance.
Since all the articles pertaining to a given heading may not be simultaneously
available at a given time, a given heading usually appears in more than one volume;
for instance, if articles devoted to the heading "Solution of Equations in "" appear
in Volumes I and III, these volumes will include "Solution of Equations in R"(Part
1)" and "Solution of Equations in R' (Part 2)" in their respective titles. Naturally, all
the headings dealt with within a given volume appear in its title; for instance, the
complete title of Volume I is "Finite Difference Methods (Part tl-Solution of
Equations in it" (Part 1)".
Each article is subdivided into sections, which are numbered consecutively
throughout the article by Arabic numerals, as Section 1, Section 2,..., Section 14,
etc. Within a given section,formulas,theorems, remarks, andfigures, have their own
independent numberings; for instance, within Section 14, formulas are numbered
consecutively as (14.1), (14.2), etc., theorems are numbered consecutively as Theorem
14.1, Theorem 14.2, etc. For the sake of clarity, the article is also subdivided into
chapters, numbered consecutively throughout the article by capital Roman numerals;
for instance, Chapter I comprises Sections 1 to 9, Chapter II comprises Sections 10
to 16, etc.

P.G. CIARLET
J.L. LIONS
May 1989
Introduction
G.I. Marchuk

The finite difference method is a universal and efficient numerical method for solving
differential equations. Its intensive development, which began at the end of 1940s
and the beginning of 1950s, was stimulated by the need to cope with a number of
complex problems of science and technology. Powerful computers provided an
impetus of paramount importance for the development and application of the finite
difference method which in itself is sufficiently simple in utilization and can be
conveniently realized using computers of different architecture. A large number of
complicated multidimensional problems in electrodynamics, elasticity theory, fluid
mechanics, gas dynamics, theory of particle and radiation transfer, atmosphere and
ocean dynamics, and plasma physics were solved employing the finite difference
techniques.
Numerous spectacular results have been obtained in the theory of finite difference
methods during the last four decades.
In ordinary differential equations, the stability of the main classical finite
difference methods was investigated and the relevant accuracy estimates were
constructed, a large number of new versions of these methods were constructed, and
efficient algorithms were suggested for their realization in a wide field of
applications-oriented problems. The needs of electronics, kinetics, and catalysis
stimulated the development of a broad class of methods for solving stiff systems of
equations. Problems in control theory, biology, and medicine were important for the
progress in finite difference methods of solving delay ordinary differential equations.
In partial differential equations, the achievements of the finite difference method
are even more impressive. Finite difference counterparts of the main differential
operators of mathematical physics were constructed, including those with conser-
vation properties, that is, those obeying the discrete counterparts of the laws of
conservation. An elegant theory of approximation, stability, and convergence of the
finite difference method was constructed.

HANDBOOK OF NUMERICAL ANALYSIS, VOL. I


Finite Difference Methods (Part )--Solution of Equations in R" (Part 1)
Edited by P.G. Ciarlet and J.L. Lions
© 1990, Elsevier Science Publishers B.V. (North-Holland)
3
4 Introduction

The efforts of the specialists in differential equations and numerical mathematics


yielded a convenient and efficient apparatus of the finite difference method,
including spectral analysis, discrete maximum principle, and energy method.
Considerable progress was achieved in the methods on a sequence of grids, including
extrapolation methods. A flood of new results in the theory of the finite element
method greatly stimulated new achievements in the theory of the finite difference
method.
An important stage in the progress of finite difference methods was the
development of the alternating direction implicit method, the fractional steps
method, and the splitting method. The realization of these methods consists in
solving a large number of one-dimensional problems. A considerable number of
versions of this class of methods have been suggested, having high approximation
accuracy and absolute stability. Numerous multidimensional problems in physics,
mechanics, and geophysical hydrodynamics have been solved using these methods.
The theory of the finite difference method is far from having been completed,
especially in the field of nonlinear partial differential equations. Life never ceases to
offer new complex problems, and the method of finite differences remains a powerful
approach to solving them.
It would be impossible in the Handbook of NumericalAnalysis to cover even all the
basic achievements of the theory. Nevertheless, I do not doubt that its publication
will help making the finite difference method interesting to new people and will
attract new researchers to solving its problems.
Finite Difference
Methods for Linear
Parabolic Equations
Vidar Thome
Department of Mathematics
Chalmers University of Technology
S-41296 Giteborg, Sweden

HANDBOOK OF NUMERICAL ANALYSIS, VOL. I


Finite Difference Methods (Part )-Solution of Equations in R' (Part 1)
Edited by P.G. Ciarlet and J.L. Lions
© 1990, Elsevier Science Publishers B.V. (North-Holland)
Contents

PREFACE 9

CHAPTER I. Introduction 11

1. The pure initial value problem 11


2. The mixed initial boundary value problem 21

CHAPTER II. The Pure Initial Value Problem 31

3. Finite difference schemes 32


4. L2 theory for finite difference schemes with constant coefficients 42
5. Particular single-step schemes for constant coefficients problems 52
6. Some multistep difference schemes with constant coefficients 64
7. John's approach to maximum-norm estimates 76
8. Stability of difference schemes for general parabolic equations 90
9. Convergence estimates 97

CHAPTER III. The Mixed Initial Boundary Value Problem 109

10. The energy method 110


11. Monotonicity and maximum principle type arguments 134
12. Stability analysis by spectral methods 148
13. Various additional topics 167

REFERENCES 183

LIST OF SYMBOLS 191

SUBJECT INDEX 195


Preface

This article is devoted to the numerical solution of linear partial differential


equations of parabolic type by means of finite difference methods. The emphasis in
our presentation will be on concepts and basic principles of the analysis, and we shall
therefore often restrict our considerations to model problems in as much as the
choice of the parabolic equation, the regularity of its coefficients, the underlying
geometry and the boundary conditions are concerned. In the beginning of the article
proofs are provided for the principal results, but as the situation under investigation
requires more technical machinery, the analysis will become more sketchy. The
reader will then be referred to the literature quoted for fuller accounts of both results
and theoretical foundations.
The article is divided into three chapters. The first of these is of an introductory
nature and presents some of the basic problems and concepts of the theory for the
model heat equation in one space dimension. This chapter is subdivided into two
sections, devoted to the pure initial value problem and the mixed initial boundary
value problem, respectively. This division is then the basis for the plan of the rest of
the article, where Chapters II and III treat these two classes of problems in greater
depth and generality. Whereas most of the theory in Chapter II depends on Fourier
analysis, that of Chapter III relies heavily on energy and monotonicity type
arguments.
The finite difference method for partial differential equations has a relatively short
history. After the fundamental theoretical paper by COURANT, FRIEDRICHS and LEWY
[1928] on the solution of the problems of mathematical physics by means of finite
differences, the subject lay dormant till the period of, and immediately following, the
Second World War, when considerable theoretical progress was made, and large
scale practical applications became possible with the aid of computers. In this
context a major role was played by the work of von Neumann, partly reported in
O'BRIAN, HYMAN and KAPLAN [1951]. For parabolic equations a highlight of the
early theory was the celebrated paper by JOHN [1952], which had a great influence
on the subsequent research. The field then had its golden age during the 1950s and
1960s, and major contributions were given by Douglas, Kreiss, Lees, Samarskii,
Widlund and others.
At the end of this period the theory for the pure initial value problem had become
reasonably well developed and complete, and this was essentially also true for mixed
initial boundary value problems in one space dimension. For multidimensional
problems in general domains the situation was less satisfactory, partly because the
finite difference method employs the values of the solution at the points of a uniform

9
10 V. Thomee

mesh, which does not necessarily fit the domain. This led the development into
a different direction based on variational formulations of the boundary value
problems, and using piecewise polynomial approximating functions on more
flexible partitions of the domain. This approach, the finite element method, was
better suited for complex geometries and many numerical analysts (including the
author of this article) abandoned the classical finite difference method to work with
finite elements. The papers on parabolic equations using finite differences after 1970
are few, particularly in the West.
It should be said, however, that finite elements and finite differences have many
points in common, and that it may be more appropriate to think of the new
development as a continuation of the established theory rather than a break away
from it. In the Russian literature, for instance, finite element methods are often
referred to as variational difference schemes, and variational thinking was, in fact,
used already in the paper by Courant, Friedrichs and Lewy quoted above. The finite
element theory owes much of its present level of development and sophistication to
the foundation provided by the finite difference theory. However, in the present
Handbook, the two subjects are separated, and we shall only very briefly touch upon
their interrelation below.
In our presentation here we shall not discuss techniques for solving the algebraic
linear systems of equations that result from the discretization of the initial boundary
value problems, but refer to other articles of this Handbook concerning such
matters. Neither shall we treat the related area of alternating direction implicit
methods, or fractional step methods, which are designed to reduce the amount of
computation needed in multidimensional, particularly rectangular, domains, and to
which a special article of this volume is devoted.
Several textbooks exist which treat finite difference methods for parabolic
problems, and we refer, in particular, to RICHTMYER and MORTON [1967] and
SAMARSKn and GULIN [1973] for thorough accounts of the field, but also (in
chronological order of publication) to COLLATZ [1955], FORSYTHE and WAsow
[1960], RJABENKI and FILIPPOW [1960], Fox [1962], GODUNOV and RYABENKII
[1964], SAUL'EV [1964], SMITH [1965], BABUSKA, PRAGER and VITASEK [1966],
MITCHELL [1969], and SAMARSKII [1971]. In addition we would like to mention the
survey papers by DOUGLAS [1961a] and THOMLE [1969]. We have included in our
list of references a large number of original papers, not all of which are quoted in our
text. For treatises on the theory of parabolic differential equations, covering
existence, uniqueness and regularity results such as needed here, we refer to
FRIEDMAN [1964] and LADYZENSKAJA, SOLONNIKOV and URAL'CEVA [1968].
I would like to take this opportunity to thank Chalmers University of Technology
for granting me a reduction of my teaching load while writing this article, to
Ann-Britt Karlsson and Yumi Karlsson for typing the manuscript, and to
Mohammad Asadzadeh for proofreading the entire work.
CHAPTER I

Introduction

In this first introductory chapter our purpose is to use the simplest possible model
problems to present some basic concepts which are important for the understanding
of the formulation and analysis of finite difference methods for parabolic partial
differential equations. The chapter is subdivided into two sections corresponding to
the two basic problems discussed in the rest of this article, namely the pure initial
value problem and the mixed initial boundary value problem.
In the first section we thus consider the pure initial value problem for the heat
equation in one space dimension. We begin with the simplest example of an explicit
one-step, or two-level, finite difference approximation, discuss its stability with
respect to the maximum norm and relate its formal accuracy to its rate of
convergence to the exact solution. We also present an example of the construction of
a more accurate explicit scheme. We then introduce the application of Fourier
techniques in the analysis of both stability, now with respect to the L2 -norm,
accuracy, and convergence. We finally touch upon the possibility of using more than
two time levels in our approximations.
Section 2 is devoted to the mixed initial boundary value problem for the same
basic parabolic equation, with Dirichlet type boundary conditions at the endpoints
of a finite interval in the space variable. Here we discuss the possibility and
advantage of using implicit methods, requiring the solution of a linear system of
equations at each time level. Stability and error analysis is carried out for the
simplest such methods, the backward Euler method and the more accurate
Crank-Nicolson method. Again both maximum-norm estimates based on positivity
properties of the difference scheme and 12 -norm estimates derived by Fourier
analysis are treated. A brief mention is made of the possibility of extending some
initial boundary value problems to periodic pure initial value problems.
The material in this chapter is standard and we refer to the basic textbooks quoted
in our preface for further details and references.

1. The pure initial value problem


In this first section of our introduction to the solution of parabolic problems by
means of finite difference methods we shall discuss several such methods for the pure
initial value problem for the simple homogeneous heat equation in one space
dimension.
11
12 V. Thomee CHAPTER

We thus wish to find the solution of the pure initial value problem
au 02u
/2 forx eR, t ,
a=-- (1.1)
u(x,O)=v(x) for xeR,
where lR denotes the real axis and v is a given smooth bounded function. It is well
known that this problem admits a unique solution, many properties of which may be
deduced, for instance, from the representation

u(x,t) = e -(x- y) ( dy = (E(t)v) (x).

Here we think of the right-hand side as defining the solution operator E(t) of (1.1). In
particular, we may note that this solution operator is bounded, or, more precisely,
supl E(t)v(x)l =supl u(x, t)l < supl v(x)l for t >0. (1.2)
xeR xeR xeR

For the numerical solution of the problem (1.1) by finite differences we introduce
a grid of mesh points (x, t) = (jh, nk) where h and k are mesh parameters which are
small and thought of as tending to zero, and wherej and n are integers, n >0. We then
look for an approximate solution of (1.1) at these mesh points, which will be denoted
by U., by solving a problem in which the derivatives in (1.1) have been replaced by
finite difference quotients. Define thus for functions defined on the grid the forward
and backward difference quotients
axU= h (Uj+- U1),
xU= h - (U -Ui- ),
and similarly, for instance,
+ - U").
8,OU = k- (Uj"
The simplest finite difference equation corresponding to (1.1) is then
a,U=axAX U. for -oo<j<oc, n>O0,
°
U =vj _ v(jh) for - oo <j < oo.
This difference equation may also be written as
Uv+ '-Uj U+ 1 -2 U+ U_ 1
k h2
or, if we set A= k/h 2 ,
U" +1 = U_ + (1 - 2) U + U!+ =(EkhUn)j, (1.3)
where the identity defines a linear operator Ekh, the local discrete solution operator.
This scheme is called explicit since it expresses the solution at t = (n + 1)k explicitly in
terms of the values at t = nk. Iterating the operator we find that the solution of the
SECTION 1 Introduction 13

discrete problem is
U7 = (Ekh U)j = (Ekhv)j.

Assume now that 14A<. The coefficients of the operator Ekh in (1.3) are then all
nonnegative, and since their sum is 1 we find
suplUj"+l suplUJI.

Setting for mesh functions vi


lv II=sup lvi,
this implies
IIU" ' 11
= IIEkhU" I< 11U"I,
and hence by repeated application
IIU" =II Et v II IIv I, (1.4)
which is a discrete analogue of the estimate (1.2) for the continuous problem.
The boundedness of the discrete solution in terms of the discrete initial data is
referred to as stability. We shall see now that if A is chosen as a constant > ½ the
method is unstable. To see this, we may choose vj = (- l)j where £ is a small positive
number so that I v =E. Then
+]
Uj=[4(- )-' +(I-24)(- 1) + (- 1F)j
=(1-4U) (- )j,
or, more generally,
U7 =(1-44)"(- ji,
whence
IUn1=(4A--1)R--oo as n-+o.
We thus find that in this case, even though the initial data are very small, the discrete
solution tends to infinity in norm as n-- o, that is, for instance, at t = nk = 1, when
k = l/n-O. This may be interpreted to mean that very small perturbations of the
initial data (for instance by roundoff errors) may cause such big changes in the
discrete solution at later times that it becomes useless.
We now restrict the considerations to the stable case, < 2, and we- shall show
that, provided the initial data and thus the exact solution of (1.1) are smooth
enough, the discrete solution converges to the exact solution as the mesh parameters
tend to zero. In order to demonstrate this, we need to notice that the exact solution
satisfies the difference equation with a small error, which tends to zero with h and k.
In fact, setting u = u(jh, nk) we have by Taylor series expansion
r=8tu7-a.x6xu1=O(k+ h2 ) = O(h2 ), if A4<,
where the constants involved in the order relations will depend on upper bounds for
14 V. Thome CHAPTER 1

a2 u/t and 4 u/x 4. The expression rj is referred to as the truncation or local


2

discretization error. We have now the following result. Here and below we denote by
Cmn the set of functions of (x, t) with bounded derivatives of orders at most m and
n with respect to x and t, respectively.

THEOREM 1.1. Assume that u e C 4 ' 2 and k/h 2 = < . Then there is a C = C(u, T) such
that
11
Un- u"ll Ch 2 for nk T.

PROOF. Set z = U"- u". Then


at Z.- a a Z. =-J,

and hence
+
z = Ekhzj- k.

By repeated application this yields


n-I
Z =E hZ-k --- E E TIk
1=0

Since zo = U - u = vj - v = Owe find, using the stability estimate (1.4),


n-l

lz"l k Z jllJt11<Cnkh2 <CTh2 ,


1=o

which is the desired result. []

The method described has first-order accuracy in time and second-order accuracy
in space; since k and h are tied by k/h 2 =A2, the total effect is second-order
accuracy in space. We shall discuss how this accuracy may be increased, so as to
obtain a higher-order rate of convergence as h -- O.
Our abovementioned method may be written in the form
U,+'= U+k x Uv,
which may be thought of as resulting from the expansion
u +1=u(jh, (n + 1)k)= uj + k(au/0t) + O(k2 )
= uj + k(a2 u/aX2 )
+ O(k 2 )
uj+ ka, a uj' + kO(k + h2 ).
Using one more term in the Taylor series, we find (omitting the subscript j)
un+ Iun+ k )u)+ k aU) + O(k3)

=unk(a2)
u k2 (a4)nu (k 3 ).
SECTION 1 Introduction 15

We shall now approximate a2 u/Ox2 to fourth order and 4 u/lax4 to second order. We
have

a 4=(a) u + (h 2 ),

and noting that (now with the superscript n left out)

+ 1
~xaxUi"jUj--2Ujq-u [u2 ux h2 4a"U 4)

also
a2 u h2
=a
a xu-(a ax) 2u + (h 4 ).

These considerations suggest the method


Uj" 1t, + k n x 2 (axa
2 Ujn + k2( 8xx)2 Uj,

or, after simple calculations,

Us =1( - -)(UJ-2+Uj+2)

21)(Uj- + U,)+(l 2
+1(- -15-+31 )U
-(Ek U)
i . (1.5)
We see that the present operator Ekh has nonnegative coefficients which add up to
1 if

and as above we may conclude that the method is then stable, or


| u| < vl||

The above considerations show


uj =(EkhU") + kj,
where now, if A = k/h 2 = constant,
4 2 2 4
I1 j 1 < C(u)(h + kh + k ) < C(u)h ,

with C(u) depending on bounds for a3 u/at 3 and a6 u/ax6. Using the method of proof
of Theorem 1.1 we may conclude
IIU-u"ll <C(u, T)h4 for nk <T.
More generally we may consider finite difference operators of the form
U + =EkU apUP, (1.6)

where ap = ap,(), 1 = k/h 2 , and where the sum is finite. Note that since we considered
16 V Thome CHAPTER

h and k tied together by the relation k/h 2 =A= constant, we have now omitted the
dependence on h in the notation. One may associate with this operator the
trigonometric polynomial
E(4)= ape-ip . (1.7)
p

This polynomial is relevant to the stability analysis and is called the symbol or
characteristic polynomial of E,. We find at once the following result.

THEOREM 1.2. A necessary conditionforstabilityofthe operatorEk in (1.6) in the above


sense is that
IE()I< 1 for all e R. (1.8)
>
PROOF. Assume for some o E R, IE(o) 1. Take vj=e'iio. Then
-
Ul = Z apei(j P)o = E(%O)vj,
p

and hence, by repeated application,

IIU" I=E(o)"'-+oo
0 as n-oo,
which proves the theorem. [1

The condition (1.8) is a special case of what is referred to as von Neumann's


condition, and which was first proposed in O'BRIAN, HYMAN and KAPLAN [1951].
We shall now see that in a slightly different setting this is also sufficient for stability.
The symbol is particularly suited for investigating stability in the framework of
Fourier analysis. It is then most convenient to use the 12-norm to measure the mesh
functions. Let thus V= {Vj °,, be a mesh function in the space variable and set

I1v2,h= (h IVj2).

The set of mesh functions thus normed will be denoted 12, h below. Let us also define,
for such a mesh function, its discrete Fourier transform
() = h E Vje- ij4,
j
where we asume that the sum is absolutely convergent. Recall the Parseval relation

2lhVl12=-=nh I'(12 do

n/h

= 2-X IP'(hO)1 d~. 2 (19)


-a/h
SECTION Introduction 17

We may now define stability with respect to the norm 11I12,h, or in 12h,, to mean,
analogously to above,

IIEkVlI2, CIIVIL2,h for O<nk<T, (1.10)


and find the following

THEOREM 1.3. The von Neumann condition IE()I <1 is a necessary and sufficient
condition for stability of the operator Ek in 12,h.

PROOF. We note that

(EkV)()= h a, Vj_pe- JA
j,p

=hZ ape - 'P Vj-, e - i (j - p)


4
J,P

= E(4) ().
Hence
(Ek V)A () = E(0)" V(),
and using the Parseval relation (1.9), the stability of Ek in 12,h is equivalent to

lE()n 2lf)12 d<C2 fI f/(g)12 dca, nk < T,


-n -A

for all permissible . But this is easily seen to hold exactly if


IE()rl<C, nO,
which is equivalent to (1.8) (and thus with C= 1 in (1.10)). 1

The convergence analysis may also be carried out in 11·I,2h- We introduce as


before the truncation error
r = k (u7 -Eku!),

and say that the difference scheme is accurate of order p (assuming = k/h 2 =
constant) if, formally, =O(h') as h -0. Under the appropriate assumptions
concerning to the smoothness and asymptotic behavior for large xl of the exact
solution this assumption may be translated into an estimate of the form
II1T"2,h < C(u)h1, (1.11)
and we may show the following error estimate.

THEOREM 1.4. Assume that Ek is accurate of order p and stable in 12,h. Then under the
appropriate regularity assumptions on the exact solution of(1.1) we have
IlUn-unl2L.,<C(u,T)h' for nk<T.
18 Y. Thomde CHAPTER

PROOF. As above we have, for z = U -uj,


n-l

zJ=-k E -T
1=0

so that by (.10) and (1.11),


n-i n-1
lzh k E IIE
zII1-1-2jl2,h k _ 11T[11l2,
1=0 1=0

< Cnkh- < CTh ' ,

which proves the theorem.

Consider again the fourth-order scheme (1.5). Recall that the stability with respect
to the maximum norm was deduced above for 1 .23 Here

E(5)= 2(2 - )cos 24 + 2(4 - 22)cos 5 + 1- t + 312.


We shall show that von Neumann's condition is satisifed for all Awith < 23.First, for
6< 2 we have, by the positivity of the coefficients,
2
IE(5)I < (i2 - ) + 2 (4-22) + 1 - 2 + 3 = 1.
Consider now 2 < . We have
E'()= - 21( - )sin 25 - 2(4 - 2i)sin 5
= 44{( - 2)cos 5 + - 3} sin X,
and since the bracket is negative for 0 << , E() has a maximum at = 0 with
E(O)= 1 and a minimum at 7Ewith E(n)= 8,2 -2 + 1 > - 1 for A< . It follows that
IE(g)I < 1 for all real 5 if 2 < . Since E(n) = 1 + 81( - 2)> 1 if > 3 the von Neumann
condition is not satisfied then.
Thus, in particular, the present method is stable in 12,h for A< 213 and unstable in
both 12,h and maximum norm if 2 > 3.
We note that if i <3, we have strict inequality for 5 %2nr[, or
IE()l<l forO<lI11 .
Schemes with this property are called parabolic (in the sense of John) and will be
studied below. In particular they are maximum-norm stable, and the case A< 6 of
our present scheme will thus provide an example of a maximum-norm stable scheme
for which not all coefficients are positive.
Let us return to the model problem (1.1) and an explicit finite difference method
+
U 1=(EkU)j ap UjP.
p

It is sometimes convenient to consider the functions of the space variable x to be


defined not just at the mesh points, but for all x so that we are given an initial
SECTION Introduction 19

function U°(x) = v(x) and seek an approximate solution U"(x) at t = nk, n = 1, 2,...
from
U" + 1(X) =(Ek U (x)) = a U'(X- ph),
(1.12)
ap=a,(l), = k/h 2.
One advantage of this point of view, which is taken in a large part of the literature, is
that all U" then lie in the same function space, independently of h, for instance L2 (R),
L®(R) or the space C(R) of bounded continuous functions on 1R.
Let us briefly consider the situation in which the analysis takes place in L 2 = L 2(R)
and set

ul2= f 2 dx.
iul

We shall then use the Fourier transform defined by

(4)= if v(x)e- dx,

and note that here, with E(4) defined by (1.7),


(Ekv) () = E(h)6().
Recalling Parseval's relation,
Ilvll2=2llll2,
we thus find

=/2-n JlE(h5)"112
l U"112 'sup [E(h5)I" Ilvll2,

and that stability with respect to L 2 holds if and only if


sup IE(h)nl C,
geR

which is equivalent to von Neumann's condition (1.8).


The convergence may now be expressed in L 2 and we have, for instance, the
following result:
2
THEOREM 1.5. Assume Ek is defined by (1.12) with 2 = k/h constant and is accurateof
order j and stable in L 2. Then, if the exact solution is appropriatelyregular,we have,
with C = C(u, T),
1IU"-u l2 ChU for nk <T.

PROOF. The fact that Ek is accurate of order p may be expressed by saying that for
20 V. Thome CHAPTER I

any choice of a smooth solution of (1.1),


u(x, t +k)-Z apu(x-ph, t) =kO(hI)= O(h" + 2) as k-0.
p
By Taylor series expansion this is seen to be equivalent to the following relations
between the coefficients:
_p
2 j+'a, =0 for 0<2j<,
p

and

p2 ja, =,( 2 )! for 0o2j<#+2.

On the other hand, these relations are, again by Taylor series expansions, easily seen
to be equivalent to
E() = e - "' 2 + 0(1 1+2)
"
as -0,
or, since E(~) is bounded on TR,
IE( )-e-z421 <Cll"+2 for eR.
By stability it follows that
n-I

E() -e`4 I= E()"- -j(E(~)-e-A¢2)e-jX,2


j=o

2 .
,<Cntl1 (1.13)
Now, by Fourier transformation of (1.1), we have for

i(, t)= f u(x, t)e-ix dx

that
dt
-~(5,t)=-52i(5,t) for tk0,

(, o)= (),
and hence
a(r, t) =e-2'().
We conclude
(Un - u')^(() =(E(kr)"-e k )(),

and hence
IIU"_ u II22= 2 1(E(k)n _ e- nk,2)6 112
SECTION 2 Introduction 21

Now by (1.13)
IE(hr -e-k21 < Cnh+ 2g +
2
so that, using the fact that (dv/dx)?= -i~,
IIU-u ull 2 2 :rCnh"
+ 2
1 + 2vl12
< CnkhPI (d/dx)" +2V 11
.2
This shows the conclusion of the theorem under the assumption that the initial data
are such that (d/dx)"+ 2v belong to L 2. In fact, by a more precise computation one
may reduce this regularity requirement by almost two derivatives, as we shall
describe in Section 4 below. 1

In the above discussion we have only considered finite difference schemes of


one-step (or two-level) type, i.e. schemes that use the values at time t=nk to
compute the approximate solution at t =(n+ 1)k. It would also be natural to replace
the derivatives in the model heat equation (1.1) by difference quotients in a
symmetric fashion around (x, nk), which would result in the equation

Unx)-
+ U
I n(X) =axUnwx. (1.14)

In this case, with U°=v we also need to prescribe U (presumably by some


approximation of u(, k)) in order to be able to use (1.14) to find Un for n >0. Such
a two-step or three-level scheme would formally be accurate of second order in both
x and t. Although, as we shall see in Section 3 below, the particular scheme (1.14)
turns out to be unstable for any combination of h and k, other multistep schemes are
useful in applications.

2. The mixed initial boundary value problem

In physical situations our above pure initial value model problem (1.1) is generally
not adequate, and instead it is required to solve the parabolic equation on a finite
interval with boundary values given at the endpoints of the interval for positive time.
We shall therefore have reason to consider the following model problem:

~ a2U ~ for x [0,1], tO,


at - x
u(O,t)=u(l,t)=O, for t>O, (2.1)
u(x, O)= v(x), for x E [0, 1].
For the approximate solution we may again cover the domain with a grid of
mesh points, this time by dividing [0, 1] into subintervals of equal length h = 1/M,
where M is a positive integer, and consider the mesh points (jh, nk) withj = 0, ... , M
and n = O,1,.... Letting as above Uj denote the approximate solution at (jh, nk), the
22 V. Thomie CHAPTER I

natural explicit finite difference scheme is now


a,U=_xOxU7, forj=l,...,M-1, nO,
U[ = UM = 0, for n>O, (2.2)
Uj = Vj
= v(jh), for j= 0, .. , M,
or, for U, j= O ... , M, given,

U"+ =2(U_ +U]+)+(1--22)Uj, j= ,...,M-1,


U+ I= U"+ =0.
This scheme is referred to as the forward Euler scheme.
This time we are looking for a sequence of vectors U"=(Uo,.... Uu)T with
U = U0 = 0 which we first norm by

1IUIllI ,h= max lull.

When A=k/h2 < we conclude, as before, that

IIU, + ll.,hU< IIUllh

or, defining the local solution operator Ekh in the obvious way,
<
!lEkhvl.,h~ lvl11 ,h, nO,
so that the scheme is maximum-norm stable for 1< 2.
Here, in order to see that this condition is also necessary for stability, we modify
our above example so as to incorporate the boundary conditions and set
Uj° = Vj=(-l)i sin 7jh, j=O..., M.
Then, by a simple calculation,
UJ =(1-2A-22 cos lh)"Vj, j = 0 ... , M.
If > I we have for h sufficiently small
l1-21-24 cos hl y> 1
and hence, if nk= 1, say,

lUnl10,h>7"llvll,h--o as h,k-*O.
In the presence of stability we may define the local discretization or truncation
error -r and show convergence in exactly the same way as before, and obtain the
following theorem:

THEOREM 2.1. Assume that u E C 4' 2 and that U" is the solution of theforward Euler
scheme (2.2) with A <2. Then there is a constant C= C(u, T) such that

IlU"-un"ll.h Ch2 for nk T.


SECTION 2 Introduction 23

Let us note that a method of the form

aU_P,
UJ+I= j=l,...,M-,

is not suitable for the present problem if ap 0Ofor some IPI> 1, since then for some
interior mesh point of [0, 1] the equation uses mesh points outside this interval. In
such a case the equation has to be modified near the endpoints, which significantly
complicates the analysis.
The stability requirement k •½h 2 used in the forward Euler method above is quite
restrictive in practice and it would be desirable to be able to use h and k, for instance,
of the same order of magnitude. For this purpose one may define an approximate
solution, instead as above, implicitly by the following equation, which is referred to
as the backward Euler scheme and which was proposed first in LAASO1TEN [1949]:

at j i =l .... M-I, n>O,


U"+'=U+
'=0, n O, (2.3)
U° =Vj=v(jh), 0=1,...,M.
For U" given this defines U" + by means of the linear system of equations

(1 +2)U"+1-(U+ +Uj_ ++1)=Uj, j= 1,...,M-l,


UO+1= U* = .
In matrix notation it may be written as
n
BUn+= U ,

where U" + 1 and U" are thought of as vectors with (M- 1) components and B is the
diagonally dominant, symmetric, tridiagonal matrix

- 1+21 - 0 01
B= 0 -iZ 1+22 -2 ... .

l O - 1 +22
" +
Clearly this system may easily be solved for U '. Introducing the finite-
dimensional space I° of(M + 1)-vectors {Vj} with Vo = Vm = 0 and the operator Bkh
on 1h° defined by

Bkhj=(l+2A)Vj-A(Vj +Vj+l)=Vj-kxxVj, j=l ... M-1,


the above system may also be written
n
BkhUn+l = U ,
24 V. Thom~e CHAPTER

or
Un +1=Bk U = Ekh U".
We shall show now that this method is stable in maximum-norm without any
restrictions on k and h. In fact, with arbitrary, we have

IU"+ lloo(n,h< IU"ItOh, n>O. (2.4)


For, with suitable Jo,

I|U +1 GOhI
= IU l

,12i P(_,(|j-,I+IU; .+,I)+|U! |}

< 1+22 IIU IIh.+


w 1+22 IIU Int h
from which (2.4) follows at once. This may also be expressed by saying that

i EkhVI l ,h < IBkh 1 / |, <, I<


1VII .,h
The corresponding local solution operator Ekh is thus stable in maximum norm
and convergence may also be proved. This time
Zr= ,u "+l- au +l=O(k+h 2) ask,h,O, for j=l,...,M-1,
where, since h and k are unrelated, the latter expression does not reduce to O(h2 ). As
a consequence the convergence result now reads as follows:

THEOREM 2.2. Let u E C4 '2 and let U" be the solution of the backward Euler scheme
(2.3). Then, with C = C(u, T), we have
IU"-u"llh,, <C(k+h2 ) for nk< T.

PROOF. We may write, for z" = U " - u,

BkhZn+ 1 =Bkh Un+ 1 -_ BkkhU+ = Un -(u n + _ kixxU + )


= n _ un _- kz" = z" - kr,

where we consider T" to be an element of l, or


z = EkhZ -kEkhT.
Hence,
n-l
Z"=-k Ekh Tl
1=0

n-1

IZIIl ,h<k Y IIT'I l,h<Cnk(k+h2 ),


which
concludes
the proof.

which concludes the proof. [


SECTION 2 Introduction 25

The above convergence result for the backward Euler method is satisfactory in
that it requires no restriction on the mesh ratio i = k/h 2. On the other hand, since it is
only first order accurate in time, the error in the time discretization will dominate
unless k is chosen much smaller than h. It would thus be desirable to find a method
which is second-order accurate also in time. Such a method is the following, which is
credited to CRANK and NICOLSON [1947].
In order to have second-order accuracy in both space and time we will use
symmetry around the point (jh, (n + )k) and pose thus, for U" given in I °, the
problem

atU= a.(u + ) for j= 1,...,M-1,


(2.5)
U+ = U = 0.
The first equation may also be written
(I-=kax,a,)Uj+ =(I+kaa,)ujn
or

(1 + )U+'1-42(Uj 1+ Ui+ )=(1 -2)U+ -2(Ut? 1 ± Uj+ 1),

and, in matrix form,


BU" +l
=A U",

where now both A and B are symmetric tridiagonal matrices, with B diagonally
dominant,

-1+2 -½2 0 0 o -

-2 1+2 -½2 0 01
to
B= 0 -2 1+2 -21 ... 0

- o -- 2 1+Ai

-1-1 ½2 0 0
1-2 42 o
A= 0 42 1-2½ 2 ... 0

0 0 ½2 1- :]I-
With the obvious notation we also have
BkhUn+ 1 =AkhUn,
26 V. Thomge CHAPTER I

or
Un + = Bkh' Akh U' = Ekh U,

where, similarly to above,

IIBkh' Vll h < 1IIV l oo.,h.


The same approach to the stability as for the backward Euler method gives, for
,< 1, as the coefficients on the right are then nonnegative,
( +A) 11+ ,h < A U ,h+
U+ IIUn
or

IIU"+ I1.1h < IIUnI|


which is stability. However, if A> 1, which is the interesting case if we want to be able
to take h and k of the same order, one obtains instead
(1+) +'11 ,h, IU+11 ,h+( 22 - 1l)llUn 1oh,
which does not show maximum-norm stability, since 22-1 > 1. For A< 1 we have
immediately as before a O(k 2 + h2 )= O(h2) convergence estimate. We shall return
later to the question of maximum-norm stability and convergence for A> 1.
We turn now instead to an analysis in a real 12 -type space. We introduce thus for
vectors V =(VO,..., VM)T the inner product
M
(V, W)h=h VjWj,
j=0

and the corresponding norm

IVI l2.h=(V, V)h = Vh)


j=O

We denote by °2,h the space ° equipped with this inner product and norm and note
that this space is spanned by the (M-1) vectors p, p=1,...,M-l, with
components

qppj = /2 sin rpjh, j=O,..., M,


and that these form an orthonormal basis with respect to the above inner product,
or

1,f if p=q,
6
((pp'(pq)h pq O, if p:q.

We also observe that the qp are eigenfunctions of the finite difference operators
-Ox, in the sense that
-Ox xppj = (2/h 2 )(1 -cos nlph)pj, j= ...., M- 1.
We shall now use these notions to discuss the stability of the three difference
SECTION 2 Introduction 27
schemes considered above. Let V be given initial data in 12,h. Then
M-1
V= , cpp,, where cp=(V, p)h.
p=i

The forward Euler method associates with these initial data


M-1
Uj= Vj+kX. k Vj= cP(1-2(1 -cosfph))cp j, j=1,... ,M-1,
p=l
Uv = U =O,
or, more generally,
M-1
U= I cE(7rph)"p, j, j=O,...,..M, (2.6)
p=l
where
E(4)= 1- 2 + 22 cos 4.
By Parseval's relation we have thus
M-1 1/2
2"
1 U2,h= c2E(ph) < max IE(ph)"l I VlI1
2,h,

with equality for the appropriate V. Now


maxlE(rph)l =max(I - 2(1 - cos Ih)l, I - 22(1 -cos 7(M- 1)h)l)

= max(l 1-22(1 -cos ith)l, I - 2(1 +cos rh)l).


Thus we have maxplE(phr)l < 1 for small h if and only if 4- 1 < 1 or <2, and it
follows in this case that

IIUn II2,h< 11VI12,h- (2.7)


Consequently, the forward Euler scheme is stable in 12, h if and only if AS 2, that is,
under the same conditions as for the maximum norm.
The correspondence analysis for the backward Euler scheme shows (2.6) where
now

E() =
1 + 22(1 -cos )'
In this case 0 < E(tph) < 1 for all p and and hence (2.7) is valid for any value of A.
Similarly, for the Crank-Nicolson scheme, (2.5) holds with
1 -;2(1 -cos t)
1 +(1 -cos )'
and we now note that IE(4)l 1 for all for any >0. Thus, the Fourier analysis
method here shows stability in 1°,h for any . The convergence follows again by the
28 V, Thome CHAPTER I

standard method and gives thus in this case the following:

THEOREM 2.3. Let u e C4 ' 3 be the solution of (2.1) and U that of the Crank-Nicolson
scheme (2.5). Then, with C = C(u, T), we have
Un _-u"n12,h<C(h2 +k 2) for nk < T.

PROOF. By Taylor series expansion we have for the truncation error


+
Tn = aun
+
1_ + un
½a~xx(u" )

= k- (BkhUn + - AkhU )

=O(k 2 +h2 ) as k,h-+O,

uniformly in (jh, nk), and hence for sufficiently smooth u,

I ]I 2,h ~
.n C(u)(h 2 + k2).

As earlier we have, for the error zJ = U- u,


+ un+_ + 1Un

= EkhU -Bkh 1Bkh = Ekhz -k Bkh n,

or
n-1

z= -k E Ekh IBk-hl T
1=0

from which the result follows by using the stability of the Crank-Nicholson
operator Ekh and the boundedness of Bkh' .

Let us return to the model problem (2.1) and note that because of the simplicity of
the boundary conditions it may be replaced by a pure initial value problem with
periodic initial data. In fact, extending the function v defined on [0, 1] with
v(O) = v(1)= 0 to [- 1, 1] by setting v(-x)= -v(x) and then extending the function
thus obtained by the periodicity requirement v(x + 2) = v(x) for all x E R we have a
2-periodic odd function, which we may still call v.The problem may now be posed as
to find a 2-periodic solution of the pure initial value problem
au a 2 u
-= 2' xeR, tO,
at ax '
u(x, O)= v(x), x R.
The solution of this problem is then easily seen to be 2-periodic and odd in x and
(x- 1) and its restriction to [0, 1] is thus a solution of our original problem (2.1).
For this problem we may again apply the forward Euler method
U '=w1Uj_1+(1-22)Ui+AUi+1, j=O, +1, +2 .... (2.8)
SECTION 2 Introduction 29

where Uj is the approximation to u(jh, nk). If, as above, h = 1/M for some integer M,
then if U ° is periodic in j with period 2M, then so is U" for any n ~ 0. The discrete
problem (2.8) thus reduces to 2M equations at each time level.
Similarly we may apply the backward Euler equations
(1 +2)Uqj+' -,W + - j'+)= j j=O, +1, +2,..., (2.9)
and look for a solution which is periodic with period 2M in j, thus reducing (2.9) to
a system of 2M equations for Uj+', j= -M,..., M- , say, with the matrix
+2A -i O -1
-
-A +2A
1 - 0 ... O
B= 0 -i 1+22 - -.. 0

_ -A -i 1+22
This matrix is clearly invertible since it is diagonally dominant. Similar consider-
ations hold for the Crank-Nicolson method.
In the case that V is odd and satisfies VM = 0 the same holds for U" and the systems
reduce to our old finite-dimensional systems of order (M - 1).
Certain other problems may as well be restated in terms of periodicity require-
ments. For instance, consider the problem
2
au u
2
t=x in [0,1] for t>0,

=au (,t)=O for t>O, u(x,0)=v(x) in[0,1].

In this case we may extend v first to [0, 2] by requiring v(x) = v(2 - x) and then extend
this function to a 4-periodic odd function on R. Looking for a solution with these
properties will then lead to a solution of our original problem. Again, the finite
difference methods suggested by the above may be applied for the periodic pure
initial value problem and yields an approximate solution of our boundary value
problem.
Let us therefore, more generally, consider the periodic initial value problem,
which we normalize to have period 27r,
au a2u2
-a a 2 '' XE, t>O,
at X

u(x, 0) = v(x), v(x + 2n) = v(x) for x e R,


and let us also consider a finite difference approximation defined by
YbpU"+(x-ph)=YaPU"(x-ph) for n>O,
P P (2.10)
U°(x) = v(x),
30 V. Thom&e CHAPTER

where h = 2n/M for some integer M and where the sums are finite.
For the analysis of this situation it is again convenient to use Fourier analysis and
to work in the space L2 of 2-periodic functions with norm

V11v2,
#(f fVx)2dx)"
For such functions we have the Fourier series representation

v(x)=Zeix, where i f v(x)e-idx,


j
-cc

and we recall Parseval's relation

i2vl12,.--2n C tj-2
j= -

Setting now
U"(x)= UeniJx,
j
we find, by (2.10),

ybp J + eij(x -ph= a Ojeij(x -h),


p.j Pj

or, with a(~)=Z ape- i'P and similarly for b(),


p

b(jh)j+leijx =E a(jh)jeeijx

Assuming b(() O0
for all real we thus conclude
1
"+l=E(jh)U, where E(S)=b( )-la(~),
and hence
Un(x) = E E(jh)jeiix.

By Parseval's relation we find for the operator Ek defined by Un+ =EkU" that

I112. = (2n
,Ev E(h)n 12)

<suplE(jh)"l Ilv112, #,

and we may again conclude that stability (now with respect to L 2 ) holds if and only
if von Neumann's condition IE()I < 1 is satisfied.
CHAPTER II

The Pure Initial Value Problem


In this chapter we shall give an account of the present state of the theory of finite
difference methods for the pure initial value problem for linear parabolic partial
differential equations. We are then seeking a solution of a parabolic equation (or
system of equations) for positive time when its initial values are prescribed at the
initial time on the unrestricted Euclidean space Rd with d > 1.
The approximation by finite differences may be explicit or implicit. In the latter
case the algebraic system of equations to be solved at a specific time level generally
results in a finite algorithm only when the problem is periodic in space. We shall
allow the differential equations and the approximating difference schemes to have
variable but smooth coefficients.
Our main concerns are the stability of the finite difference scheme with respect to
various norms and the rate of convergence of the approximate, or discrete, solution
to the exact, or continuous, solution of the given problem.
In Section 3 we introduce the basic concepts of the theory, such as explicit and
implicit schemes, one-step and multistep schemes, truncation errors, orders of
accuracy, and various notions of stability and convergence. We also give a brief
account of the Lax-Richtmyer theory of the relation between stability and
convergence.
Section 4 is devoted to the Fourier method of analysis of one-step constant
coefficient methods in the space L 2 (Rd). In this case, using Parseval's relation, the
stability may be translated to the boundedness of a family of powers of
trigonometric rational functions, which are matrix-valued in the systems case. The
well-known von Neumann necessary condition for stability then relates to the
eigenvalues of these matrix-valued functions. A somewhat stronger condition,
introduced in a special case in JOHN [1952], is shown to be sufficient for stability and
to imply regularity properties of the difference methods similar to those of the
continuous problem. The Fourier method is also applied to derive estimates for the
rate of convergence for parabolic difference schemes.
In Section 5 several examples of specific standard finite difference methods are
investigated, in light of the theory, with respect to stability and accuracy. In Section
6 multistep schemes with constant coefficients are brought into the framework of
one-step methods and particular examples are analyzed.
Section 7 develops some of the material from the fundamental paper by John
quoted above, in which Fourier methods, somewhat more refined than those of the

31
32 V. Thome CHAPTER 11

earlier sections, are used to derive stability bounds in the maximum-norm for single-
step explicit schemes in the case of a scalar second-order equation (which may have
variable coefficients) in one space dimension. In Section 8 we describe generali-
zations of these results, principally by Widlund, to explicit and implicit, one- and
multistep schemes for parabolic systems of equations in an arbitrary number of
space dimensions.
In the final Section 9 we discuss the rate of convergence of finite difference schemes
for parabolic equations and systems. Here more attention than before is paid to the
relation between the regularity of the data of the continuous problem and the
convergence properties of the approximating scheme. In order to describe this the
Besov spaces B' qare introduced into the analysis. Direct results are presented which
show that a specific degree of regularity together with a given order of accuracy and
parabolicity of the method implies a certain order of convergence. Often this
convergence is of lower order for nonregular data even when the method has high
accuracy, but we also indicate how a preliminary smoothing of data can recover the
convergence rate caused by low regularity. We finally present examples of inverse
results which make it possible to draw conclusions about the data from the observed
rate of convergence.

3. Finite difference schemes

In this section we shall, in a more systematic and general way than in Chapter I,
introduce the basic concepts relating to finite difference schemes for the numerical
solution of the pure initial value problem

= P(x, t, D)u + f

= E P.(x,t)Dau+f(x,t) in RdxR+, (3.1)

u(x, 0) = v(x) on Rd,


where a=(al, ..., a) is a multi-index, Il =a + +- ad and
Dav = (8/ax1)X1 (a/axd)dv.
Although our purpose is the solution of parabolic problems we shall not specify the
type of the linear differential operator P(x, t, D) at present, but simply assume, to
begin with, that the initial value problem (3.1) has a sufficiently regular solution for
our purposes. We shall return in Sections 4 and 8 below to more precise descriptions
of parabolic equations.
We begin our discussion here by considering the case of the homogeneous
equation so that f =0, and we first consider an explicit finite difference scheme
Un + l(x) af(x, nk, h)U"(x - lh)

-A'khU(x), n=0, 1, ....


SECTION 3 Pureinitial value problem 33
where U" stands for the approximate solution at time nk, where k is the time step and
where h is the mesh size in space, f/=(lj ... , 3fd) is a multi-index with integer
components, and the summation is finite. The space varible x may range either over
Rd or over the mesh points h. We shall also allow the scheme to be implicit, i.e.
defined by the equation
Bnk hUn+l=Akn hUn n=O, 1,....
where Bk,h is an operator of the same form as Ak,, but where we also assume that
B"k,h has a bounded inverse (in the space of functions under consideration) so that
the latter equation may be solved for U"'+ and written
n+ U
n .
U = Ek, h,n

The solution of this discrete problem may then, in both cases, be written

Un= Ekh, n-1Ek,h,n-


h, E2 k, o U ,
or, if we set U° = v, and introduce the product
Ek h = Ekh,n- 1Ek,h,n- 2 ... Ek h,m, (3.2)
as
U" = Ek v.

In many cases we shall assume that k and h are tied together by a relation such as
k/h q = A= constant, most often with q = M, the order of the operator P(x, t, D). We
shall then omit h in the notation and write, for instance, Ek,, instead of Ek,h, . When
the coefficients are independent of t, we may write simply Ek and the solution in such
a case is simply Un = Ekv.
For the nonhomogeneous equation we will use difference approximations of the
form
Bk,h Un+ =AhUn+kMhf, n=O, 1, ... , (3.3)
where Mk, his an operator which could be of the same type as Ak, hand Bkh, or, e.g. an
integral average such as

M,hf(x)= k- J ((n+ 1)k


f(x, t) dt.

We find then, with M h=(Bk) Mk h,


+l
U" = Ek, h,U n+ k,hf,

and the solution of the discrete problem takes the form


n-1
U"=Ek h U +k En:j+l
h kjhf (3.4)
j=o

In the simple case that the coefficients are independent of t, that k and h are related as
34 V. Thomee CHAPTER II

j
above, that iRMkhf=f and U°=v this reduces to
"n-

Un=Ekv+k E Ek-l-f.
j=o

The difference equation is said to be consistent with the differential equation if,
formally, for smooth solutions of the differential equation, the difference equation is
satisfied with an error which is o(k) as k,h-O. More precisely, introducing the
truncation error,
Tk,h,n =k- 1k,
(B,hu+ -AhU)-
k hU 1 -Ak, Mhfhf
hU )- Mk,

the scheme is said to be accurate of order y in x and v in t if, uniformly for 0 < nk < T,
say,
Tk, h, =O(h +k) as k,h-0. (3.5)
When k and h are tied to each other as above we simply say that the scheme is
accurate of order p (with respect to x) if
T,,=O(h") as k-+0. (3.6)
Note that these concepts are local properties of the operators which may be
checked formally by Taylor expansions and are not dependent on any specific space
of functions. It is clear, however, that the constants in (3.5) and (3.6) will depend on
certain derivatives of the solution considered. For instance, in the case of (3.5) it is
always possible to find an M1 such that, for nk<T,
IZk,h,(X)I= iTk,h, (U; x)I
<C(h"'+k) max Tl).
IIDxDtuIL,(dx r[0
JIu+#<M1

By the definition of the truncation error we have for the exact solution
Un 1 = Ek,h,nU +kMk, hf+ kk,h,n,

where rkh,n=(B,h)- Zk,h,.


This will be utilized in the error analysis below.
We consider two examples for the model heat equation
au a'2u
at ax 2 f
Our first example is the implicit Crank-Nicolson scheme
U
un + (x)- U"(x) a Un+ (X)+ Un(x)+fn+12(x
k 2
2
where f" + / (x) = f (x, (n +½)k).
With a~ U (x) = h - 2(U(X + h)- 2 U(x) + U(x - h)) it may be written in the basic form
SECTION 3 Pure initial value problem 35

(3.3) as

(I- -kaax)Un +'(x) = (I + ½kaxx)U"(x) + kf " +1/2(X).

Omitting the variable x, we find by Taylor series expansion


n+l_u In
__k, _h, n -2_ l/ n+n 1/2

= (U - Ux-)"+ 1/2 +
- 4k2U 2
3)
k2.,n+x1/2 -lh2U XX+X2 +O(h3+k as h,k- O,

and hence, using the differential equation for the first term,

tk,hn=O(h2+k2) as h,k-+O,

so that the scheme is second-order accurate in both x and t, and second-order in x if


k/h 2 =l is kept constant, or even if k/h=2 is constant.
As our second example we consider the explicit scheme
U"+ (x)= U"(x + h) + 2 U"(x) +* U"(x- h) + kMk f(x)
where k/h 2 = , with Mk not yet specified. By Taylor expansion we obtain, at x,

Zk,(X) = k-1 (u"+


(x) - U"(x + h)
-U"(
- u() - - h)) - Mkf(x)
2
= (u - ux)(x)+ h(u - uxx,,x)(x)-Mkf(x)+ O(h4 ) as h-O.

For the homogeneous equation this gives, since u,=u,x and u,,=uX XX that
k,n= O(h4 ) so that the scheme is of fourth order. We now want to choose M, in such
a way that the fourth-order accuracy is preserved for the nonhomogeneous
equation. Using the differential equation we now find
4
TZk,(X)= (fn+l2h 2(fnx+ f,))(x)- Mf(x)+O(h ) as hO,
so that the requirement for M is

Mrf(x) = (f" +i h2 (f x + ft))(x)+O(h


4
) as hO.
One choice satisfying these conditions is, as is easily seen,

Mkf(x) = f" +1(x)+ l2f"(x + h)+ ½f"(x) + if"(x- h).


We shall now introduce the concept of stability of a difference scheme. For this
purpose we assume that the functions under consideration belong to a normed
linear space ./i, with norm 11-11.The difference scheme introduced above is said to
be stable with respect to X if, with C = CT,
IIEk:'vll<Cjvll for O<mk<nk<T,
where Ek: ' is the product defined in (3.2). If t is permitted to be unbounded (T = oo),
36 V. Thome CHAPTER II

this inequality may also be expressed


IIEk: Tll <Cec"-m)'klvl, 0,<m<n,
thus permitting an exponential growth of the bound in time. Note that the definition
of stability is independent of the approximation of the inhomogenity f. For the
time-independent case, with k and h tied together, our stability condition is
equivalent to
|IIEkvl <CeC"llv{, n=O, 1,....
Let us also note that if (B,)- and Mk,h are bounded in X and in L, (O, T; XA),
respectively, then the stability implies, by (3.4),

IU"I<C(IlU°II+T sup Ilf(t)ll) for nk<T. (3.7)

The concept of stability depends on Ar; one and the same difference scheme may
be stable with respect to one normed space and unstable with respect to another. The
spaces that occur below are the Lp-type spaces with 1 <p < cc. It will turn out that
often stability is easier to show in L 2 then in other Lp spaces.
The importance of stability for numerical calculations is clear from the following
result:

THEOREM 3.1. Assume that Ek,,h, is stable in JVX and that U" and V" satisfy
+ U n
Un = Ek, h, n + kF",
and
V"+l=Ek,,,V+kG" for n=O, 1....
Then

IUn-VnIlI<C(IU°-V°ll+T max IF-GJII) forO<nk<T.


O<jk<T

This follows at once as in (3.7) upon noting that, by linearity, the difference
n n
Z = U - V' satisfies

Z" +
Zn+ l =Ek, h, k(F ' -G ") for n >O.
Thus, if GI and V are close to F and U° , respectively, then V" is close to U" in the
° j

sense of the above inequality. As a special case one may prove the following
convergence result:

THEOREM 3.2. Assume that the difference scheme is accurateof order A in x and v in t,
and that Ek, h, is stable with respect to X. Then, if U° = v, and if the exact solution u is
sufficiently smooth, we have
IUn-unll <C(u, T)(h"+kv) for O<nk <T.
SECTION 3 Pure initial value problem 37
This may be demonstrated as follows. By our above definitions of the approx-
imate solution and of the truncation error we have
n+1 Un+l =Ekhln(Un-un)-ktkhn, n=0,1,....
1 we have, as in
Since U - u= v - v = 0, and since (B,h)- l is assumed bounded in X
Theorem 3.1,
un IItk,h,j II
IIU_- II CT max IItk,h,j II CT max
j<n-1 j n-

If we now interpret the assumption that the exact solution is "sufficiently smooth" to
mean that
Iltkhh,11 <C(u, T)(h'+kv) for nk <T,
the result follows at once.
Clearly, the result as stated is somewhat imprecise in that it does not specify the
regularity assumptions on the exact solution. The correct precise assumptions will
depend on the choice of the normed space X and can often be expressed in terms of
the data of the problem by means of a priori estimates. For instance, for the
homogeneous equation (f= 0) a natural requirement is that a certain number of
derivatives of the initial data v have finite norm in J'.
It is often possible to obtain convergence results under less stringent assumptions
on the data. For instance, considering again the homogeneous equation, let v be such
that it may be arbitrarily well approximated by a sufficiently smooth for the
conclusion of Theorem 3.2 to be valid. Assume further that the initial value problem
(3.1) is well posed with respect to AX' so that for the solution
Iju(t)l = liE(t, 0)v jl Cjlvll for 0<t<T,
where E(t, 0) denotes the solution operator of (3.1), i.e. the linear operator that takes
the initial data into the solution at time t. Then U" converges to u(t)= u(nk)= u" as
n - oo, nk = t. For, with >O given, we may choose such that

and such that it is smooth enough for Theorem 3.2 to apply. We have then, with
obvious notation,
un
IIU n _ -- il(Erk, - E(t, O))v II
=
<lJ(Ek:h - E(t, 0))li + II(Ek: h- E(nk, 0))(v- ) Ij
<C(e, T)(h+kv)+Ce<2Ce for k, h small.
In the case that k and h are tied together as above by a mesh ratio condition and
the order of accuracy is pI(with respect to x) the result of the theorem is modified to
read
IlUn _-unl < C(u, T)h", O<nk < T.
We shall return in Section 9 to discuss more precise results concerning the relation
38 Y. Thome CHAPTER 11

between the regularity of the exact solution and the rate of convergence.
It should be mentioned that, theoretically, convergence is possible without
stability if the initial data are sufficiently smooth, cf. e.g. THOMtE [1969].
The relation between the concepts of consistency, stability and convergence was
examined in an abstract Banach space setting in a celebrated paper by LAX and
RICHTMYER [1956]. We give a brief account of their theory which is concerned with
the case of a time-independent homogeneous equation.
Let - be a Banach space (a normed linear space which is complete with respect to
its norm) and consider the initial value problem
du
-=Pu for t>0,
dt
(3.8)
u(O)= ,
(3.8)
where P is a linear operator defined on a dense subset of X. It is assumed that this
problem is correctly posed so that, in particular, there exists a solution operator E(t)
which takes the initial data v into the value u(t) of the solution at time t, and which is
bounded in X, or
Ilu(t)11 = IIE(t)viI <CT1iIV for O<t< T.
The discussion is further restricted to the case of a one-parameter family of
"difference" operators Ek, approximating E(k), which is assumed to be consistent
with (3.8) in the sense that for each solution u(t) = E(t)v with initial data v in some
dense subset of X' one has

sup Ilk- (Eku(t)--u(t))- Pu(t) llO as kO.


t[O, T]

Such an operator is stable if, with C= CT,

IIEkvI <CjIlvll for O nk< T.


The difference scheme under investigation thus consists in approximating E(t)v by
Ekv for nk close to t.
The following is then the famous Lax equivalence theorem:

THEOREM 3.3. Assume Ek is consistent with the correctly posed problem (3.8). Then
stability of Ek is the necessary and sufficient conditionforconvergence of the difference
scheme to the solution of(3.8), in the sense that,for any v e S, with u(t) = E(t)v, one has

IIE'v-u(t)ll-O as n-oo, nk-t.

The proof of the sufficiency of stability for convergence is essentially the same as
our proof of Theorem 3.2 above. The proof of its necessity is based on one of the
fundamental theorems of functional analysis, the Banach-Steinhaus theorem (or the
principle of uniform boundedness).
The Lax equivalence theorem has been generalized to cover a variety of situations
SECTION 3 Pureinitial value problem 39

such as time-dependent, nonhomogeneous or even nonlinear equations, see e.g.


STETTER [1959], THOMPSON [1964], ANSORGE and HASS [1970].
We shall now briefly discuss the case of a multistep finite difference scheme. Such
a scheme is based on a difference equation of the form
Bk,h U+l =Ahl U+
B Ak, h Un+ -+ A
Ah. Un-m+l +kM[hf,
+ kMkhf,
1

where m > 2. This formula may only be used for n > m-1 and thus requires, in
addition to the natural initial condition U ° = v, that U',..., U m- be prescribed.
A common way to analyze such a scheme is to reduce it to a two-level or one-step
scheme of the form discussed above. For this purpose one introduces the compound
vector-valued unknown function
n = (Un, Un- .. Un-m +l )T,

and the matrices of operators

- An
Ak, h,

I
An
Ak,h,2

0
''

...
An
Ak,h,m

0
l ...
0
... 0

:i
Ak,h= 0 I and I .:

I
U
. U I U I

With this notation the scheme can be written as

Bh U =. h Un+kFw for n>m-1,


U=v,
where F" =(M, hf,,..., ) T .
Again we may introduce the discrete local solution operator of the homogeneous
equation,

Ek,hn=(Bk,h) Ak,h,

and define its stability and consistency in the obvious way. In particular, if X is
a normed linear space in which we consider our functions, then Un is sought in the
product space XvVx x with m factors, and the stability is measured with respect
to the corresponding norm. In the discussion of the accuracy of the scheme special
attention has to be given the choice of the starting values U,..., U - .
Similarly as before, with i" = (u", u"-',. ., u"-m+ )T for the exact solution we now
have, for n m- 1,
n-l
.h (- i-" k ,h ,j( (3·9)
j=m-1

where the truncation error only stems from the first components of the compound
solutions. It follows that, in the presence of stability, if the truncation error is
O(h + kv), say, then the global error is of the same order provided Um- 1 -m- 1
40 V. Thomee CHAPTER I

matches this. Note that, since the initial error is a local error, a lower order
approximation may be used in the initial steps, because an extra factor k is available
in this term, which is not needed to compensate for the summation in (3.9), see the
example below.
As an illustration, consider for the solution of the model homogeneous
one-dimensional heat equation the three-level equation

Un+ 1(x)-U -
(x) U"(x + h)-2 U(x)+ U"(x-h)
2k h2 (3.10)

or, if A= k/h 2 ,
U + '(x)=(2i Un(x + h)-4iU"(x)+ 2U"(x- h)) + U' - '(x).

In system form this may be expressed in terms of U"=(U", U"-1)T=(U", V")T as


U"'+(x) = 21 Un(x + h)-41 Un(x)+ 2;U"(x - h)+ V"(x),
" +
v (X)= U"(X).

or
+
0" +(X)= (,:h Gn)(X)

[2A 0C
U(x + h) + [ 4A (X1+ 2A 01 Cn -

for n>l.
By the symmetry around (x, nk) the exact solution satisfies (3.10) with an error of
O(h2 + k2 ), which translates into

k'k h=[Un Ek h [U -]kO(h2+k2) ash,k-0.

If we assume constant the order term reduces to O(h2 ) as k tends to 0. With


U° = U1 = v the initial accuracy is also O(k) = O(h2 ) for this case, but if k and h are
independent, a more natural choice is

U' (x) = v(x) + ka,8xv(x)

=u(x,k)+O(k2+kh2 ) as h,k-O. (3.11)


The definition of stability is now interpreted to mean the boundedness of U" in
terms of U' =(U, UO)T, and this would imply convergence of order O(h2) or
O(h2 +k 2), respectively, in the two cases, for sufficiently smooth initial data.
However, it turns out that the present scheme is unstable for any choice of A.For
example, if the scheme is considered at the mesh pointsjh and we set Uj = U(jh) and
J/ = I h,, we may choose the initial values

'ji =- )ja =(- O~j(aII OC2),


SECTION 3 Pureinitial value problem 41

where a is a fixed 2-vector, and then find

Cj=(l)[ 1{ a.

For any 2 the 2 x 2 matrix entering here has two distinct real eigenvalues, one of
which is less than - 1. If we choose for a the corresponding eigenvector, it is clear
that 0n is unbounded as n grows and thus the scheme is unstable.
As we shall see in a later chapter this scheme may be stabilized, for any constant 2,
by replacing U" by the average½(U"++ Un - 1) so that the scheme becomes
+
U"n 1(x)- U n- I (x) U"(x + h)- U +
l(x)- Un - (x)+ U (x-h)
~~~2k h 2~~~~~~, (3.12)

which was proposed in Du FORT and FRANKEL [1953].


As another example we could consider the implicit three-level method derived by
selecting a second-order accurate backward difference approximation to au/at at
t=(n + 1)k. This requirement defines the scheme uniquely as
+
(3 U +(x) - 2 (x) + U"- x)x))/k = OxxU (X), (3.13)
and, as we shall see in Section 6 below, it is unconditionally stable. The truncation
error here is O(h2 + k 2 ) and, in order to match this order we may set U ° = v and
choose U' by (3.11). Note that thus the first-order accurate forward Euler method is
accurate enough to match the second-order method (3.13). Note also that no
stability restriction needs to be imposed for this first step because the formula is only
used once.
We shall end this section by making an observation concerning the accuracy of
the Du Fort-Frankel scheme (3.12). Let u be a sufficiently smooth function. With
0.,x as before and correspondingly for 8,6t and with , the symmetric difference
quotient, defined by
ku(x, t) = (u(x, t + k)- u(x, t - k))/(2k) = (at + a,)u(x, t),

we have for the truncation error of (3.12)


n+ -
U" (X)- u '(X)
h, k, (X) = 2k

U"(x + h)- u + (x)--U- '(x) + u(x - h)


h2

= ~tu(x, nk) - ax xu(x, nk) + (k/h)2 a,u(x,


t nk)
Uau a 2ul k2 a 2u
= Lat -2 (x, nk)+h2 at2 ( nk)
2 2 4 2
+ o(k )+ O(h )+ O[k/h ].

Consistency with the heat equation therefore requires that k/h tends to zero, which is
42 V. Thome CHAPTER II

the case e.g. if k/h 2 = A= constant. However, if instead k/h = i = constant, we obtain

[au a2u +A2 a2u


h)=F Ž 6]
T(X i=, 2 (x, nk)+ O(h22) as ho,
which shows that the scheme is then consistent, not with the heat equation, but with
the second-order hyperbolic equation
a2u au a2 u
22 _+ - 0.
at at ax2
We shall return in Section 6 to discuss the stability of the Du Fort-Frankel scheme.

4. L2 theory for finite difference schemes with constant coefficients

In this section we shall use the Fourier transform systematically to express and
analyze the notions of consistency, stability and convergence for constant coefficient
single-step finite difference methods applied to the pure initial value problem for
a homogeneous parabolic equation, or system of equations, in d space dimensions.
As we have seen earlier such material is also relevant to the study of initial boundary
value problems when the boundary conditions may be interpreted as periodicity
conditions.
First, we define the Fourier transform over Rd by

0(4)=Jv(x)eX dx, x, E d, x= E xjj,


j=l

where, as always below, we asume that v is small enough for large xl that the
definitions and subsequent calculations are justified. We recall Fourier's inversion
formula,

v(x)= (21)- d(-x)=(2)d f 0() e x' 4di,


Rd

and Parseval's relation,


II II=(2) - /1 2 [Iv I,
where, as for the rest of this section, I1 Ii denotes the norm in L 2(Rd).
We introduce as above, for a = (a, ... , ad) a multi-index and l = ot, + - + ad, the
mixed derivative of order la,
oD =(ala,) -(alax.)%

and recall that


(D v) ()= (i)a0(~)= i a0(), where ~ = . <a (4.1)
Consider now the initial value problem
SECTION 4 Pure initialvalue problem 43

au
- = P(D)u-_ PDu forxcERd, t>O,
a~ ~~~~~~~~t ~(4.2)
1j~~~~~a M
u(x, O)= v(x) in Rd,
where v is sufficiently smooth and small for large Ix I,in a way that we shall not make
precise at present. We shall focus our attention on equations which are parabolic,
and begin by a definition of this concept which generalizes the heat equation
au
=Au,

which we have considered above, and the more general case when P(D) is
a second-order elliptic operator,
d 2u d au
P(D)u= I Pjk--XjaXk + E Pi-
j,k=I j=l
+pou, (4.3)

where (Pjk) is a symmetric positive definite constant d x d matrix with real elements.
For this purpose we introduce the characteristic polynomial of P(D),
P()= P,

and say (admitting the possibility of complex-valued coefficients) that (4.2) is


parabolic (of order M) if
Re P(i5)<-cl i + C, C>O, e R,
I =d \/2
where 1= ) (42).4

We recall that this is the same as saying that - P(D) is strongly elliptic. Sometimes,
we shall allow (4.2) to be an N x N system. In this case the coefficients P. and the
polynomial P() are N x N matrices and the condition (4.4) is replaced by
A(P(i)) -c IM+ C, c>, EcR , (4.5)
where
A(A) = max Re j,

if {Ai}jN, are the eigenvalues of A. Clearly (4.4) is satisfied with M=2 for the
second-order operator in (4.3) since then
d d
ReP(i5)=- Pjk4jk- E (Impj)4i+Rep ,,
j,k=l j=1
.
< -c1~j2 +C(j + 1)< -C1j12 +C1, E
Equations for which (4.4) or (4.5) are satisfied are referred to below as parabolic in
the sense of PETROVSKII [1937].
44 V. Thome CHAPTER 11

Letting 6(, t) denote the Fourier transform with respect to x of the solution at
time t, we obtain by virtue of (4.1) the ordinary differential equation

dt( ,t)=P(i)(~,t) for t>O,

u(, 0)= 0(4)


so that, formally,
Q(, t) = exp(tP(i))6(),
from which u(x, t) may be found by means of the inverse Fourier transform. Using
this together with Parseval's relation we find at once for the solution E(t)v of (4.2),

IIE(t)v II< sup Iexp(tP(i)) I v 11, (4.6)

and hence, by (4.4),


IIE(t)vll C sup e-ct'II +Cr lvll CeCt lvI for t>0,

so that, in particular, the initial value problem (4.2) is correctly posed in L2 . This
estimate is valid also in the case of a system if we interpret the modulus in (4.6) as the
matrix norm subordinate to the vector norm used. In fact, for an N x N matrix A we
have with the above notation (cf. e.g. GELFAND and SCHILOW [1964])
N-1

Iexp AI <exp(A(A)) y (2IA IY, (4.7)


j=o

from which, with A = tP(iS),


lexp(tP(iS))[ < Ce- 1tl+Ct(l
1
+ t(l + It1
1 M))N-

< C1 e-cLtllM +Cit C 1 eC". (4.8)


We note that the latter estimate also implies the smoothing property
IID'E(t)v 11
< sup I ' exp(tP(i)) I 1v I
CERd

< Ct- I/M IIv JI, (4.9)


which is an important feature of parabolic equations to which we will return on
several occasions below.
For the numerical solution of (4.2) we introduce mesh sizes h and k in space and
time, respectively, which in this section we shall assume tied together by the
requirement that the mesh ratio i = kh -M is kept constant. We may then consider an
explicit single-step scheme
U" + (x)=" a, U"(x- flh) = A, U'(x),

where /=(1, ... fd) has integer components and the sum is finite, and where
a, = a,(k, h)= a,(2hM, h) are N x N matrices in the system case. In accordance with
SECTION 4 Pureinitial value problem 45

our previous discussion we shall consider that x varies over all of Rd and not just
over mesh points ph. With Bk a similar operator one may define an implicit scheme
by

Bk Un + (x)- b U" + 1(x - fh)= Ak U"(X), n 0,

U° = v. (4.10)

Including both possibilities we write

U n+ = Ek U = B 1Ak Un,

where Bk is the identity operator in the explicit case and assumed invertible in the
implicit case.
Introducing the symbols of Ak and Bk by

' -i '-
Ak()= Zape a h and Bk()='bfle i h,

we may set

Ek(¢ ) = Bk( )- Ak()- _ es e-a., (4.11)

and find from (4.10) for the Fourier transform of the discrete solution

n
U() =(E vf() = Ek() v(4). (4.12)

In the implicit case we assume in order that Bk be invertible on L 2, that Bk(¢) :0 (or
in case of a system that det Bk()0O), uniformly for small k. The Fourier series in
(4.11) then has infinitely many terms.
We recall that the difference scheme (4.10) is consistent with (4.2) if the difference
equation (4.10) is satisfied by the exact solution to order o(k), and that it is accurate
of order if this error is of the order o(khu), as k tends to zero.
The concepts of consistency and accuracy may be expressed in terms of the
symbols in the following way, as may be seen by using Taylor expansions:

THEOREM 4.1. The operator Ek = Bk1 Ak is consistent with (4.2) if and only if

Ek(h-')=exp(kP(ih-'))+o(k + IM) as k,4O.

It is accurate of order if and only if

Ek(h -')=exp(kP(ih-'l))+o(hM+"+ ¢ M+ )
as k, -0. (4.13)

For certain extremely smooth initial data it turns out that, theoretically,
consistency is all that is needed for convergence. For instance, if v is such that
46 V. Thom&e CHAPTER 1

eC (d) then we have by Fourier's inversion formula, with u"=u( , nk),


Un(X)- un(X)
= E'v(x)- E(nk)v(x)

= (2)- d (Ek() f -exp(nkP(i)))6(x)eiX- d',


Ad

so that
n n
IIU'-u A C(v) sup Ek() -exp(nkP(iS))l. (4.14)
C4supp6

Now, by Theorem 4.1, we have, uniformly on the support of v(4), that


IEk(,) - exp(kP(i,)) i <Ckh.
In particular, we have for small k,
IEk(4) < Iexp(kP(i)) I+ Ck < eck

so that, using also (4.8),


n
Ek() - exp(nkP(i)) I

= Et(E)n- -j(Ek()-exp(kP(i)))exp(jkP(i))

< Cnkh' eCnk

Together with (4.14) this shows


IIu n - un Ch' for nk<T.
For numerical computation, however, as we know, stability of Ek is required, as
otherwise perturbation such as caused by roundofferrors or imprecise knowledge of
data will make the computed solution useless. We shal therefore now consider
stability with respect to the L2 - norm, and begin by showing the following:

THEOREM 4.2. The operator Ek = Bk l Ak is stable in L2 ifand only iffor some positive
constants C, K and k o we have
I E()nl Cetnk for E Rd, n >O, k<k. (4.15)

PROOF. We have at once using (4.12) and Parseval's relation,


I Ekv || suplEk() I11V11,

with equality for the appropriate v, from which the result follows. J

Note that (4.15) may also be written


i(Eg()e-xk)n I,< C for eRd, n>O, k ko, (4.16)
SECTION 4 Pure initial value problem 47

so that stability is translated into the uniform boundedness of the powers of a family
of matrices depending on the two parameters and k.
In the event that P(D) does not contain any lower-order terms, it is often possible
to choose the coefficients of Ek independent of h or k. In this case Ek(h-1 ) is
independent of h, and (4.13) reduces to
+
E()-Ek(h- '))= exp(AP(i))+O(ll ) as -0O,
and the stability condition (4.15) to
IE(5)"I<C for cERd, n >0. (4.17)
In the scalar case, (4.15) is equivalent to
IEk()l1l+Ck, (eRd, k k o,
which we refer to as von Neumann's condition for stability; in the special case that
the symbol depends only on h it reduces to
tE(~)j <, eR d ,
which we recognize from Section 1.
In order to discuss the matrix case, we introduce for an N x N matrix A its spectral
radius
p(A)= maxIljl,

where {,i}7= are the eigenvalues of A. We then immediately find, since, for any A,
p(A)< IA[,
that a necessary condition for (4.15) to hold is that for some positive C and k o,
d
p(Ek()) 1 + Ck for elR , kk o. (4.18)
This condition is again referred to as von Neumann's condition for stability; in the
special case (4.17) it reduces to
p(E(r)) < 1 for 4 e R .
Von Neumann's condition is, however, not sufficient for stability in the matrix
case. Conditions which are both necessary and sufficient for inequalities such as
(4.17) (or (4.16)) to hold have been given in KREISS [1962] and in subsequent
literature, mainly with applications to hyperbolic problems in mind. These condi-
tions require, in addition to von Neumann's condition, that the behavior of the
powers of Ek() is determined, in some sense, by its eigenvalues. A sufficient
condition is that Ek() be a normal matrix (i.e. that it commutes with its adjoint
Ek()*) so that it is unitarily equivalent with a diagonal matrix with the eigenvalues
as entries. Other sufficient conditions assume that the matrix is uniformly equivalent
to a triangular matrix and that certain conditions for the off-diagonal elements of
the latter are satisfied. We shall not describe the details here but refer to e.g.
RICHTMYER and MORTON [1967] for an account of this work.
48 V. Thome CHAPTER II

In the parabolic case, as we shall see in the following theorem, a sufficient


condition for stability is that the spectral radius of Ek() satisfies, with positive
constants 6 and C,
p(Ek(h-' )) < 1 -1 1m + Ck
(4.19)
for jl<x, j=l,...,d
(or without the term Ck in the case that Ek(h- ') is independent of h). This
condition, which is more restrictive than the von Neumann condition (4.18), is
a generalization to the system case of a condition introduced by JOHN [1952],
whose work we shall describe in more detail in Section 7. We shall therefore call
a difference operator whose symbol satisfies the condition (4.19) parabolic in the
sense of John.

THEOREM 4.3. Let the operator Ek = Bk 1 Ak be consistent with (4.2) and parabolicin
the sense of John. Then it is stable in L 2.

PROOF. In the scalar case this is obvious since then


iEk(h-1 "lI(1
)n +Ck)n <ec"k , n O.
In the case of a system one may use the fact that for a N x N matrix A with spectral
radius p=p(A) one has (cf. THOMEE [1966])
IAnl<CNpn-N+l[pN-1 + (n l A -I I
)N- 1] for n>N.
(This may be thought of as a discrete analogue of (4.7).) Applied to Ek(h- a), using
that consistency implies
nj Ek(h- 3)-II Cnk(lh- IJM + 1)
for Il j < r, j=l, .,d,
this yields
E h '( )n< C exp(- nbl6 Im) (n15{M + nk)N - eC"k
-'
< Ce lnk Cnk < Ce (4.20)
for lijl<Ti, j=l,...,d,
and thus shows stability. l
In addition to being stable in L 2 parabolic difference operators enjoy a smoothing
property analogous to the property of the continuous problem mentioned above in
(4.9): Set, for any multi-index , with ej the unit vector in the direction of xj,
a v(X)= aax adv(x),
where axxv(X) = v(x + he )- v(x)
h
Then we have the following result.
SECTION 4 Pure initial alue problem 49

THEOREM 4.4. Let Ek be consistent with (4.2) and parabolicin the sense of John. Then

1aEkv (I< C(nk)- IIIM lIv iI for nk A T. (4.21)

PROOF. This follows by Fourier transformation upon observing that

(av) () = H (h j v(),
j=1

and hence, for U" =Ekv,


= d eihij 1
(ax u")^(i)= h Ek(f)n(j),

so that, using (4.20),


Ila"U" I C sup (Ih -1Iale"l-nl'"M/2+Cnk) vll

C(nk)-IIal/M lv for nk<T.

As we know from the Lax equivalence theorem, consistency and stability together
imply convergence. We shall now give a more precise result about the rate of
convergence and consider first the case that the difference scheme is parabolic. Here
we shall employ the Sobolev space Hr=H'(Rd), with norm

|I[|IH=( LE '
D2)1/2 ) l

THEOREM 4.5. Let Ek be consistent with (4.2), accurate of order I, and parabolic in
John's sense. Then

l U -u u" 11<Ch"1iv IH, for nk< T.

PROOF. By Theorem 4.1 we have, for Ijl<h- ,


IEk() -exp(kP(i5))I < Ckh"(1 + 1t IM+").
Further, for these , we have by (4.4) or (4.5),
jexp(tP(i))l < Ce - `' ldM , t<T, c>O,
and by (4.20),
jEk(0)jl<Ce- cjk l 1M for hjli<n, jk•<T, c>O. (4.22)
Writing
Ek()n - exp(nkP(i))
n-l
= Ek(5)n-'-(Ek(D)-exp(kP(i)))exp(jkP(i,)), (4.23)
j=o
50 V. Thome CHAPTER II

we hence have
n
IE() - exp(nkP(i4)) I< Cnkh'(1 + 1I1M + )e -cnklI
M
< Ch"(1 + I I),
hliAjj, j=l,...,d, nk<T7:

Further, for the remaining ,

IEk()"-exp(nkP(i))l C CI hClh, nk < T (4.24)

and hence, for all SE Rd,

IEk()n -exp(nkP(i#))l < Ch'(l + 1"z), nk < T.

Therefore, using Parseval's relation, we have

un -u II2 Ch 2 ;f(l + ~lIP)2 (q)j


11 2d~

Rd

<Ch2 1lv 11
2,
which is the desired result. 1

If Ek is only stable in L 2 and not parabolic we obtain a slightly weaker result.

THEOREM 4.6. Let Ek be consistent with (4.2), accurateof order p and L 2 -stable. Then
for any >O0, we have, with C= CE,,
llU"-u"ll Ch8lvjlH+, nk<T

PROOF. If Ek is only stable and not parabolic, we have to replace the use of the
estimate (4.22) in (4.23) by simply the boundedness of Ek())' and now obtain
n-l
iEk(5) -exp(nkP(i))l
n
< Ckh" (1 + 5IM+)e-cjkllM.
j=o

For the first term (j = 0) we have, since k = hM,

CkhM (l + I5IM+)<Chu(l + l)1


for hljl n, j=l, ... ,d,
and for j>O0 we have
khlJIM +e-cklr'M < CkhP, I-l + (k)-(M- )/M.
Hence,
IEk()"- exp(nkP(i~)) I
<Ch(l + M
l )(+ l+k j (jk) -(M - z)1m

<Ch(l1+l l8+ )
forhl jlj x, j=l,...,d, nk <T.
SECTION 4 Pure initial value problem 51

Using again (4.24) for the remaining , this completes the proof by Parseval's
relation.

We shall briefly discuss the convergence of difference quotients of the discrete


solution to derivatives of the exact solution of (4.2). Consider a finite difference
operator

Qh (X) = h E q(h)v(x - h)

where the sum is finite and the q(h) are polynomials in h. Assume Qh is consistent
with the differential operator

Q(D)v= E QaD'v,
lal <q
and accurate of order # so that, for smooth v,
QhV(X)= Q(D)v(x)+ O(h") as h-0.
Introducing the symbol of Qh,

Qh() = h-q E q(h)e-ip-h

this condition implies


IQh(5)-Q(i5)l <Ch(1 + l+), R,dE

and we also have


d.
IQh()l <C(l+ Ilq), he

Using these properties it is now easy to prove the following result which is related
to property (4.21), and assumes the same regularity of v as Theorem 4.5:

THEOREM 4.7. Assume Ek is consistent with (4.2), accurate of order # and parabolicin
John's sense, and that Qh is consistent with Q and accurate of order q. Then
q/M
IIQhU - Q(D)u" II < Ch(nk)- Iv I H .

For even more regular initial data, v Hq + ", say, the negative power of t=nk
disappears in the above error estimate.
We want to consider briefly the case that the initial data of (4.2) are periodic of
period 1, say. In this case, rather than working with Fourier integrals as above, it is
natural to work with Fourier series. Thus, in this case v may be developed as
2"
v(x) = E ie ni x

where y are multi-indices and where, with Q the unit cube [0, 1d,

2
= (x)e- fiy"Xd.

Q
52 V. Thomee CHAPTER II

In this case Parseval's identity reads

11
VIIL2(Q) = IIV1 12
Applying now a finite difference method of the form (4.9), say, with h= l/v for
some positive integer v and k= hM, we find, with our old notation,
Un(x) = E vyEk(27ry)e 2i7x,

and hence
2
U" I L2(Q) = E Y Ek(2ty) 2n max
V Ek( ny) 211 V|| L2(Q)

Stability therefore requires


max IEk( 2 ry)"l < CeK"k,

for all sufficiently small k. In particular, our previous stability criterion (4.15) shows
stability also in the present case.

5. Particular single-step schemes for constant coefficient problems

In this section we shall present several examples of single-step finite difference


schemes for constant coefficient problems and discuss their stability and accuracy by
the techniques of Section 4. Many of the examples are standard and we refer to
a basic text such as RICHTMYER and MORTON [1967] for details and further
references.
In Section 1 we already considered several finite difference schemes for the model
one-dimensional heat equation
au a2u
t- -=a for x , t>0, (5.1)
0t x2
with initial values given on the real axis. The basic examples were the forward and
backward Euler methods and the Crank-Nicolson method (the latter two
applicable in practice only to periodic problems). These may all be considered to be
special cases of the so-called 0-method defined by
, U"(X) = Oaxax Un + (x) + ( - )axUn(x),
with 0 =0 for the forward Euler, 0 = 1 for the backward Euler and 0 = for the
Crank-Nicolson method.
Written in the basic form (4.10) we have
(I- kaOxa) U + 1=( + (1- )kax) U",
and we find for the symbol of the corresponding operator Ek
1- 2(1 - 0)(1 -cos 5)
1 + 202(1 -cos )
SECTION 5 Pure initial value problem 53

where 2=k/h 2 is considered to be fixed. Assuming 0<0< 1 we have


E(E) <1 for R,
and the stability requirement reduces to
1-4(1-8).
min E(4)=
- 1
1 +402
or
(I -20)2< 2 (5.2)
Hence the 0-method is unconditionally L2-stable, i.e. stable for all i, if 0 > 2, whereas
for 0 <½ stability holds if and only if
1
2( - 20)'
We note that if strict inequality holds in (5.2), then the method is parabolic, since
in this case
IE()1 < 1 for O<lll<.
In particular, the Crank-Nicolson method is parabolic in the sense of John for any
fixed A.
We shall discuss the accuracy of the 0-method. We have
2(1 -cos )= I2-4 + 3 6 + (8) as 0
and hence

1-(1 4 + 1 6)
E() _1+0A(42 I 4+46) +0(48)

= 1 -- .2 + (12 + 20) 4 - 3102 (1 + 6020 + 3602202)6 + 0(8) as -


2
For equation (5.1) we have P(4)= and thus

eAP(i4) = e- = 1- 42 + 2224 _ 1236 + O(8)

Hence, for any choice of 0 and ,

E(4) =e - X + (4) as -- 0,

so that the method is always at least second-order accurate.


Furthermore, the scheme is accurate of order 4, or
6
E()=e- A2 +O( ) as -0,

if (and only if)

A(f1 + o) = 2,
54 V. Thome CHAPTER 1

or

Z .-
0=2-12-12)
Note that this is in the stable range as now
(1 - 20)A = 6 - I
Finally, the scheme is of order 6, or
8
E() = e-2 +( ) as y0,
if, in addition,

6 = 2 (1 + 6020 + 360220),
or
1 + 6020 + 3602202 = 6022,

or, with A0 = 2 - 2, if 22 = To, i.e. 2 = To/5 0.224, 80 0.127. Note that this number
is relatively small, so that the scheme then requires many time steps. This choice is,
however, in the stable range as above. We emphasize once more that the
investigation presupposes that = k/h 2 is kept constant.
The stability considerations above may be generalized to several space dimensions
to apply to the equation
2
au.= d u
at }j=a

Setting
d
Ah V(x)= Z aXjX V(x),
j=1

where

V(x) V(x + hej)- V(x)


h

V(x) - V(x - hef)


h

we may consider the -method defined by


x ( ) + ( 1 - ) A h" U (x)
a,U"(x)=OAhU" o , 0r<0<1,
or
+
(I- OkA)U =(I+(1 - O)kA,)U".
SECTION 5 Pure initial value problem 55

The corresponding operator Ek has its symbol defined by


I-( - O)2D(~)
1 + 2D()
d
where D(C)=2 E (1-cos ).
j=1

For 0 = 0, 1 and these are again referred to as the forward Euler, the backward
Euler and the Crank-Nicolson methods. Here

max IE() = max(1 4(1 - )2d - 1


axE mx 1 + 4d J'
and hence the method is now L 2 -stable if and only if
2d(1 -20) < 1,
and parabolic in John's sense if strict inequality holds. For 0 >2 this is always
satisfied, and for 0 < the stability condition reads
1
2d(1- 20)
To consider the accuracy of these methods we note that
d
D(g)=2 y' (I-cos~fi
j=1
d
6 )
=1512- Z 4+ ({l as - 0,
j=1

and hence, by a simple calculation,


d
E(4)= 1-2 112+2 E 4+022114+0(1516),
j=1

and, since now P()= I12,


2
exp(lP(i5))= 1 -1 +½2114+0(11 6 ).

We conclude that the difference scheme is always second-order accurate and never
of higher order if d> 1.
Returning to the one-dimensional equation (5.1) we shall show now that for any
given positive integer v there exists a unique explicit method for (5.1) of accuracy
I = 2v using the (2v + 1)points x +jh,j=- v,..., v. This method will be shown to be
parabolic for 2 sufficiently small.
We shall prove this by constructing an even trigonometric polynomial

E() = E(, 2) = E a()cosj


j=o
56 Y. Thome CHAPTER 11

such that
E(,)=e- - +O( 2 v +2 )
as ~-'0, (5.3)
and such that, for small ,
IE(5,i)l<1 for 0<1l1<. (5.4)
Clearly a trigonometric polynomial of order 2v is uniquely determined by (5.3).
We start the construction by noting that in the MacLaurin expansion

arcsinz=z+'z 3 + z 5 + ... = E b2j+lZ 2 i + l


j=O

all the coefficients b2j+l are positive. Hence we have

z=arcsinsinz= Z b2j+1(sinz)2 +
l with b 2 j+l >0,
j=O
and, by taking the (21)th power and replacing z by 2,

52t = 2j(2 sin 2 j)j


f1A,
j=l

j
= fiI, 2j(-cos) with , 2j>0.
j=l

Hence if l< v, we have


2
22'=T, 2 ()+O( v+2 ) as --*0, (5.5)
where thus

T, 2v ()= fl,,2j(l -cos )j>O for 0<1 I]<n. (5.6)


j=l

Now

2e t /2)1 +o(2V+2) as -0,


I=0

and hence (5.3) is satisfied if we set

1)= -
E(4, A T,2.-
=o 1!
Further, by (5.6), we find that E(, A) may be written

E(,, )= I-A TI, 2V(){ i (4) ,


1=1
with the Q(~) bounded. This implies (5.4) for small and thus completes the
proof. []
SECTION 5 Pure initialvalue problem 57

For v = 2 we easily find

T1 4 (4) = 2(1 - cos ) + 3(1 - cos ) 2 = -§ cos + cos 24,


2
T2,4 () = 4(1 - cos 4) = 6-8 cos 4 + 2 cos 2X,
and hence
E(4, ,) = 1- i+ 32 + (3A - 4; 2 )cos 4 + (12 -2)COS 2,

which corresponds to the fourth-order operator discussed in Section 1.


A similar argument was used in KREISS [1959a] to construct L2 -stable explicit
finite difference schemes for general parabolic equations, even in the case of variable
coefficients. See also STRANG [1963].
One other way of defining finite difference methods for the model equation (5.1)
with prescribed accuracy and stability properties is to proceed as follows. As the
purpose is to find a trigonometric rational function E(4) with the property
E(4)=e-aZ+O(4 2) as -+0, (5.7)
it is natural to determine first a trigonometric polynomial A(4) such that

A(4)= ape- 'P4= _ 2 +O(4p+2) as 4-*0,


p

and then choose a rational function r(z) such that


r(z)=e'+O(zq + l) as z-O0.

Then, if 2q >p, we have that


E(4) = r(.A(4)) = eA(4) + 0(42(q + 1))
2
=e-Al2 +0(4 +2)+ (42q+ 2)=e-A4 + O(p + 2),

which is the desired property (5.7).


Now, if
-A<A(4) <0 for 4ER, (5.8)
and
lr(z)l<1 for -zz<O0,, (5.9)
then

IE(4)I=lr(AA())I<l for 4eR, if i<Tz,


so that the corresponding finite difference operator is L 2-stable. If A and r satisfy the
stronger conditions
-A<A()<O for 0<l1<R, (5.10)
58 V. Thome CHAPTER II

and
jr(z)l<l for -t<Z<O, (5.11)
then
IE(4)l<l for0<i4l<, if{A<r,
so that the method is then parabolic.
For instance, the 0-method considered above is of this form with
2
A()= -2(1 -cos - +0(¢ 4 )
+)= as 4-*0,
and

1+(1-0) =ez+O(zq+l) asz-*0,


-Oz
with q = 1 for 08: , q = 2 for 0 =½. Here (5.8) and (5.10) hold with A= 4, and (5.9) and
(5.11) with r = 00 if 0 > , r = 2/(1 - 20) if 0 < ½and we recognize our above conditions
for accuracy and stability for this method.
This procedure for constructing a finite difference operator may be thought of, in
the following way, as a discretization in the space variable, followed by a separate
discretization in time: Replace first the differential operator a2 /ax2 in (5.1) by a finite
difference operator

Ah V(x) = h- 2Y ap V(x - ph),

and consider thus the initial value problem


aU
- =AhU for t>O.

Introducing the Fourier transform /(, t) of U(x, t) with respect to x, we find for this
function the ordinary differential equation

d (, t)=h- 2 A(h4)OU(, t) for t>O, (5.12)


dt

or, by integration between t, = nk and t + = (n + 1)k,

(4, t + ) = exp(2A(h4))O(4, tn). (5.13)

This may now be approximated, with r as above, by

0"+'() = r(2A(h)) U"() = E(h~)j"-(4).


The property (5.10) is referred to as ellipticity of the finite difference operator Ah. The
use of special rational functions r(z) to replace ez in (5.13) may be interpreted as the
use of a Runge-Kutta type method for the integration of the ordinary differential
equation (5.12) (cf. e.g. HENRICI [1962]).
SECTION 5 Pure initial value problem 59

The discretization in space only, leading to the above system of ordinary


differential equations with respect to time, with the values at the mesh points as the
unknowns, is sometimes referred to in the literature as the method of lines (cf. e.g.
FRANKLIN [1969], LEES [1961], VARGA [1962]). Discretization in time only, or the
method of ROTHE [1931], can sometimes also be used as a preliminary step.
These considerations may be generalized to a scalar parabolic equation of the
form

- = P(D)u=-E PD"u, (5.14)


l =M
where, for simplicity, the elliptic operator P(D) only includes derivatives of the
highest order M and has real coefficients. We may then first approximate P(D) by
a finite difference operator

Qh V(x) = h-M E q V(x - fh), (5.15)

for which we introduce the symbol Q() satisfying


Q(r)= ,qpe-p =P(i)+O([[M+') as -o0,

where #ais the order of accuracy. The finite difference operator (5.15) is said to be
elliptic (cf. THOMEE [1964]) if

Q(r)<O for O<[jir, #A0.


We then again take a rational function r(z) as above and form E(4) = r(AQ(~)) with
,= k/hM, and now find, if qM >,
Q(
E(4) =e' + O(I ~ IM(q + 1)) = eAP(ig + O(1 f M+L)

The discussion of the stability is analogous to above.


In particular, if we can construct an elliptic finite difference operator consistent
with P(D), then the 0-method with 0=0, 1 or gives an explicit forward Euler,
implicit backward Euler or a Crank-Nicolson method for (5.14), which is L2 - stable
or parabolic under the appropriate assumption.
Recalling that -P(D) is elliptic if
/2
_p(i)=-(-1) P,> C M
l
Ial =M
for r0O, c>O,
it is natural to try to construct an elliptic finite difference operator consistent with
P(D) simply by replacing the derivatives in P(D) by symmetric finite difference
quotients, which corresponds to taking

Q(O)=P(isin )=P(i)+O(l}M +2) as -o0,


where sin- =(sin 4,..., sin d). The polynomial Q() thus defined is nonpositive
and may be used to construct an L 2-stable finite difference operator by setting
60 V. Thomee CHAPTER 1lI

E() = r(iQ(5)) with r(z) as above. However, as Q(5) = 0 if, for instance, 5 = (,...., I),
this operator is not elliptic in the above sense, and the corresponding finite difference
scheme for (5.14) thus not parabolic. In order to make Qh elliptic one may modify the
definition by setting
d

Q()=P(i sin )- ( -cos j),


j=1

which does not change the consistency since the additional term is of order O(I I2M)
as -40.
The operator thus defined uses other than the closest possible neighbors at the
mesh point x. For instance, for the second-order operator
d a2 u
P(D)= i Pk aX X
j,k= 1 PXjXk
2
the term p1,, u/ax is replaced by
u(x + 2he )- 2u(x) + u(x - 2he 1)
P11 4h2
Another possible elliptic finite difference operator which does not have this
disadvantage is
d

Qh = Pjk xj xk
j,k=l 1

with symbol
d

Q()= . pjk(e i j - 1)'(1-e-i4),


j,k=

which is elliptic as
d

-Q()= Y pjk(l-cos j-isin jN -cosk +isin k)


j,k = l

= P
pjk(1 - cos j) ( - cos k) + Z Pjk sin j sin ~k
j,k j,k

d d
>c ((1 -cos j)2 +sin2 2)=2c (1 -os ).
j=1 j=1

Another common choice is


Qh V(x) =
j
pjj ax, aj V(x) + Z Pik ,Xj
j~k
xk V(X),

where xj =(a, +j), or


-Q()= 2 pj(1 -cos j) + Z Pjk sin Ijsin {k,
jfk
SECTION 5 Pure initial value problem 61

which is easily seen to imply that Qh is elliptic. For d = 2 the latter choice corresponds
to
2
Qh V(x)= Z p(V(x + hej)- 2V(x) + V(x -he))h2
j=1

+pl2 (V(x +he1 + he2 )- V(x + he1 - he2 )


- V(x-he 1 +he2 )+ V(x-he, -he 2 ))/h2 ,
and thus uses the nine points x, x + hej, x +he +he2 .
Let us remark about the simple model problem (5.1) that, by our earlier
discussion, we know that there exists a trigonometric polynomial of order v such
that (cf. (5.5))
A() =-T1 2v()
V
j
=- P 1,2j(1-COS )
j=1

=_2 +0(2v+2) as 0,

and such that by (5.6) the corresponding difference operator,


V
2
Qh= -h E 1,2j(-h 2a.,a,
j=1

is elliptic.
The above makes it natural to ask for rational functions r(z) which approximate ez
near z=0 and which satisfy the appropriate boundedness conditions. The most
commonly used functions of this type are the Pad6 approximants (cf. e.g. VARGA
[1961, 1962]) which are defined by

r() = rp,q() =dp(z)

=ez+O(zP+q+l) as z-+0,

where dp.q and np,q are polynomials of degree p and q, respectively. One may show
that these polynomials are uniquely determined by

nq(z) = q (P + q-j)!q!
j= (p+q)!j!(q-j)!
and

dpq(z)= (P+ q-j)!P! (-z)j .


J= o (P+ q)!j!(p -)!
For p,q < 1 we recognize the rational functions ro l, r 0 and r,, corresponding to
the forward and backward Euler method and Crank-Nicolson's method.
It is known that
Irp,(z)l<1 forz<0, if p>q.
62 V. Thomee CHAPTER II

For p < q, rp,q(z) is not bounded but there is a positive Tp,q such that

rp.(z) < 1 for - p, < < 0.


In particular, the functions rq(z) give rise to explicit methods for which, in
general, the stability interval is rather short. We have, for instance,

ro,2 (z)= 1 + z, with z0 = 2,


2
ro2(z) = 1 + z + z1 , with TO,2 = 2,
3
ro,3(z)= 1+ z + +
+2 with To, 3 , 2.51.
The above raises the problem to find the polynomial rq(z) of degree at most q for
which
rq(z)=ez+O(z2 ) as z0,
and for which the stability interval is as long as possible. It is shown in FRANKLIN
[1959] that the best such polynomial is
r(z) = r(Z)= (- 1)q22q -1 T2q(q - ( _ 2z)l/2),
where T 2q is the Chebyshev polynomial
T2q(z) = 21- 2q cos(2q arccos z),
and that the corresponding stability interval is (0, 2q 2). The first of these polynomials
are
r,(z)= 1+ z, with = 2,
r2(z) = 1+ z+ z 2, with 2 = 8,
2 3
r3 (z)= 1+z+ 7z +,9z with 3= 18.

For the model equation (5.1), the second of these corresponds to the explicit scheme
given by

E(4) = 122 cos 25 + 1(2 - )cos 5 + 1- 2 + 41,2

which is stable if 1<2 whereas the corresponding Pad6 scheme has

E(4) = 22 cos 25 + 22(1 - 2A)cos 5 + 1- 21 + 212

and is stable only for < 2.


As our next example we shall consider the so-called alternating direction scheme
by PEACEMAN and RACHFORD [1955] for the heat equation in two space dimensions,
au a2"u a2 u
x-- +a 2 xeR 2, t>O.
at aXI 6xi
We recall that the Crank-Nicolson scheme for this equation is defined by
(I- kAh)U+ =(Il+ k A)U" , nO,
SECTION 5 Pure initial value problem 63

where
AhV= a,, V+ ax
2 X2 V,
and observe that the strictly two-dimensional discrete elliptic operator I- k Ak is
involved in solving for U" + . The purpose of the alternating direction method is to
reduce the computational labor by solving instead two one-dimensional equations.
For this purpose we introduce an intermediate value U " + /2 for the solution at
t=(n+ )k by the equations
Un+ 1/2 _ Un
½k
un+ 1 un+1/2

ka X=(l+k
r"+l X 1/2 + axx2)u" +1,
u
-(I-k,," ,)U",
12 (I +xkkll

(I-lk. a )U+l=(I+ ,x,)Un+


½k 1/2

The first step is thus implicit with respect to the xl variable and explicit with respect
to x 2, and in the second step the roles of the variables are reversed. Elimination of
Un+ /2 gives, since the various operators commute,

U" +l =EkU"
= (I - kaaX, ) - 1( + k )(- k )- (I + kaX
2 0X2)U",

and we find for the symbol of Ek, with A= k/h2 ,

- 1+ (1 -cos) 1-lA(1 -cos 2)

It is clear from this representation that Ek is unconditionally stable in L. By an


obvious calculation we have

E()=(e- ~ + O(4))(e- ~ + (-))


4
=e-ll,12 + 0(ll1 ) as -0,

so that the scheme is second-order accurate. (In fact, the scheme is easily seen to be
second-order accurate in both space and time when h and k are allowed to vary
independently of each other.)
We shall derive this method in a slightly different way, which immediately
generalizes to the heat equation in several space dimensions,

u (5.16)
Au=_ Z
The
method is now referred to as the fractional step method and uses the auxiliary'
The method is now referred to as the fractional step method and uses the auxiliary
64 V. Thomee CHAPrER II

values U" + j/d, j= 1...,d, with the final value U" +', defined by

Un+j/d_Un+(-l)/d (Un+i/d+un+(-I)/d
- =xjxj 2
or
Un" d = (I - kak a)-(l + k axx) un+- )/d
for j=1, ... ,d,
and thus, again with commuting operators,
d
U"+I =EkU"= H (I- kaxjxj) l - (I +kaxj)U
j=l

In this approach one thus approximates the operators in the sum in (5.16)
separately, and obtains Un + as a product of one-dimensional L2-stable operators
acting on Un. We have again for the symbol

E(4)= f 1 -(1 -cos 5,)


j= 1+A(1-cos )
=e-lel2+0(l14) as 0,
so that the operator Ek is of second order.
We refer to the special article by Marchuk on "fractional step methods" in this
volume for a more thorough discussion of these questions.

6. Some multistep difference schemes with constant coefficients

We shall now consider the stability in L2 of some specific multistep difference


schemes for a simple constant coefficient scalar parabolic problem of the form
au
a-Pu- E P.D'u for xERd, t >0,
atl=M (6.1)
d.
u(x, 0) = v(x) for x e R
Let thus h and k be the mesh sizes in space and time and let kh-M= =constant.
Recall that a multistep scheme is then defined by a relation of the form

B,kU"+=Ak,,U+ **' +AmU"- m + l' for no>m-1, (6.2)

where m >2, together with the initial conditions

Ui=v/ forj=0,...,m-1,

where the vi are determined in some fashion from the initial data v. We shall assume
SECTION 6 Pure initial value problem 65

that the Ak, and Bk are finite difference operators with constant coefficients,
Akv(x)= atv(x- fh) for l= ,...,m,

Bkv(x) = E bqv(x- fh),

where the aji and be are constant and the sums finite.
Introducing as in Section 3 the vectors U"=(U", U"-i,...,U"- +)T and the
corresponding matrices Ak and Bk equation (6.2) takes the form
n
BkUn+l=Ak for n > m-1,
or
On+l=kU for n>m-l, (6.3)
where

Bk- Akl B lAk 2 ... Bk Akm


I 0 ... 0
k-= 0 I

0 I 0 _
We say that (6.2) is stable in L 2 if (6.3) is stable in the product space L'2 so that the
powers of the operator k are bounded on L. Introducing the symbols or
characteristic polynomials of the operators Bk and Ak,,
b() = E b ee',"
i

and
'i e 1=1'...'M' 5E[Wd
a,() = ae ,
I

the symbol, or amplification matrix, of (6.3) is

al(4) am(4)
b() b(·)
1 0 ... O

0 1 0

0 1 0

(We assume that Bk is invertible in L 2 so that b() O.) The L2 -stability of our
multistep scheme therefore reduces to the uniform boundedness in ~ of the powers of
66 V. Thome CHAPTER II

the matrix E(). The von Neumann condition is


p(()) 1 for E Rd,

where p denotes the spectral radius, and we say that (6.2) is parabolic (in the sense of
John) if, with c>0,
p(£())<exp(-cll M ) for Ijl s, j=l,...,d.
We note that # is an eigenvalue of E(4) if and only if y satisfies

#'b()_-#~- la() . . . -a() = 0. (6.4)


It is well known that for fixed , (4) has bounded powers if and only if all
eigenvalues of E(4) are in the closed unit disk and the ones on the unit circle are
simple. In particular, von Neumann's condition is necessary for stability in L 2.
We shall consider two simple examples for the one-dimensional heat equation,
au a2u2'
-t= 6x- xelR, t>O.

In the first example, we replace derivatives in a symmetric way by difference


quotients to obtain
,U(x)= ,,U"(x), n 1, (6.5)
where Ox, x and b, denote forward, backward and centered difference quotients, or,
with = k/h 2,
U+ '(x) = 2(U'(x + h)- 2U"(x) + U"(x - h)) + UW - l(x). (6.6)
This method was mentioned briefly already in Section 1 and we showed in Section
3 that it is unstable in the maximum norm. To consider it in the present framework
we note that here
b() = a2() = 1,
and
a( )=42L(cos -1) for e R,
so that (6.4) becomes
#2- 42(cos - 1)/u- = 0.
For 5 # 2nn this equation has two distinct real eigenvalues whose product is 1 and
hence von Neumann's condition is violated for any A.This method is thus, as
expected, unconditionally unstable in L 2.
We recall from Section 3 also the Du Fort-Frankel scheme which may be
obtained by replacing 2Un(x) in (6.5) or (6.6) by U"+'(x)+ U`- (x), so that the
method is
(I + 2)U"+ l(x) =22(U"(x + h)+ U(x - h))+(1 - 22)Un -'(x).
SECTION 6 Pure initial value problem 67

Here the amplification matrix is

E(5) = I

and its characteristic equation is

t2 _42 cos l-2 = 0. (6.7)


1+22 1+23
The eigenvalues of E(4) are therefore

2 cos + 1-4A2 sin2 , if 21 sin4I < 1,


1'2= 2 I (6.8)
. 122 x,4
costr+ l2,/12 42sin2 ~-1, if 22sin > 1.

In the former case, we have immediately


22 1
< + +2=1,
1

and since the product of the roots of (6.7) equals

y2 = 1 - 21/(1 + 2)< 1,

we have, with the proper ordering,

Iil < 1 , 1#21<Y < for 4e R. (6.9)


In the latter case of (6.8), #, and #2 are complex conjugate and thus

I tj <y<l, j=1,2 HER. (6.10)


In both cases, thus, von Neumann's necessary condition for stability is satisfied.
As is easily seen from (6.8) we have
4)
1=1_42+(5
~ as -O0,
and
Iy,l<l for 0<1~1 <,
so that
p(E(~))<e - c ¢2 for 0<141 < c>0,
i.e. the Du Fort-Frankel scheme is parabolic.
We shall now show that the method is actually stable in L 2. We recall that for any
(2 x 2) matrix A =(ajk) there is a unitary matrix U which transforms A to a triangular
68 V. Thomte CHAPTER II

matrix, or

U*AU=Lo
]
where #1,2 are the eigenvalues of A and
m= a u U1 2 + a2 1 u 21 u12 +a1 2 1U 22 + a2 2 u 2 1u 2 2,

so that, in particular,
2

Iml< Y lajkl
j,k= 1

Applied to our matrix E(4) this shows


42 1-22
Iml < 1 2 ,tcos + 12A + 1 4.
T+_2 i 1+22
The L2 -stability of the method is now equivalent to the uniform boundedness of the
powers of the triangular matrix

[jO mn(/11 1 2)1 n>

where
n-il

m.(I ,01 2)=m 1 J2J 2 ,


j=o

and, since I1jl 1 for j= 1, 2, to the boundedness of m,(#l,#2). In the case (6.9) we
have
n-Il

Im(YI,,tq)l <m yJ<m/(l-y) for e R,


j=o

and, in the case (6.10),


]m.(11P,2)1 <mnyn-'<C for n>l, cEiR.
Together these estimates complete the proof of the L2 -stability of the Du
Fort-Frankel scheme for any fixed A< oo.
More generally, let us consider the initial value problem (6.1) where P = P(D) is an
elliptic operator of principal type,
M /2
P(ir)=(-1) Y Pez<0 for :0.
Itl=M

One way to define a multistep finite difference approximation is to proceed as in


Section 5 and discretize first in space by replacing P by a finite difference operator
and then apply a known multistep discretization method in time to the resulting
ordinary differential equation. Let thus Qh be a finite difference operator of the form
Qhv(x)=h- M
E Q#v(x-fh),
p
SECTION 6 Pure initialvalue problem 69

which we assume to be elliptic, so that


Q()=Q_ e'i'<O0 for jlil<n, 4#0. (6.11)

Replacing P in (6.1) by Qh yields the ordinary differential equation


aU
at =QhU for t0, (6.12)

which we now want to discretize in time.


One general class of methods for time discretization of ordinary differential
equations which has been extensively investigated is the linear multistep methods,
which for equation (6.12) takes the form

ajU"+l-i=k fijQU n"+ -j for n>m-l, (6.13)


j=o j=o
where m is the number of steps. Assuming that s o >0, fio > 0, it is clear by (6.11) that
(%o-kfoQh)- 1 exists on L 2, and we may solve (6.13) for U" + ' to obtain

U+1 = - E (o -kfioQh)- (aj-kfijQh)U"+-i


j=1
For the investigation of the stability we consider the characteristic equation
m
E (aj-AfljQ())#m-j=
j=O

which is thus the characteristic equation of the one-step system formulation of (6.13)
as well as of the scalar problem. For the Fourier transform of the solution we have

n+l = _ ( -fi oQ(r))-(%


0 j-Q( _))+ -j,
j=1

where i = kh-M.
We shall consider some specific choices of linear multistep methods which are
used for ordinary differential equations.
Let us being with the Adams methods. They consist in writing (6.12) in the form
tn. I

U(tn+1) =U(t) + f Qh U(s) ds,

and then replacing U(s) in the integrand by a Lagrange interpolation polynomial.


Using thus the polynomial determined by the values at t ... ,t, + _, one obtains
the m-step Adams-Bashforth method

U" += U+k bjQU n+ l- j, (6.14)


j=l

where the coefficients bmj may be found in e.g. HENRICI [1962]. If instead we use the
70 V. Thomee CHAPTER II

polynomial interpolating also at t+ , we obtain the m-step Adams-Moulton


method

(1-kb*oQ,)U" + l = Un+k E b*jQhU"+ l - j ,


j=

where the coefficients b* are also easy to determine.


By their construction these methods are of order O(km) and O(km + ), respectively,
in time, to which the error of the discretization in space, O(h4), say, has to be added.
u q)
In view of the relation kh -M = 1 = constant the total order is thus O(hmin(mM ) and
O(hmin((m + 1)M,q)) respectively.
For m=2 we have the Adams-Bashforth method
U +l=(1 2kQh)U -kQU,-
,
and the Adams-Moulton method
(1 -t2kQ)U "+ =(1 +kQ)U" -lkQUn- , (6.15)
with the characteristic equations
,U2 -(1 + 32Q(s)) + t2LQ() = 0,
2
(1 - Q( ))# _ ( + 2Q())# + Li Q() = 0, (6.16)
respectively.
By direct computation it follows that von Neumann's condition is satisfied for the
Adams-Bashforth method if and only if
-2Q(5)1 for e d. (6.17)
For the standard approximation to the one-dimensional heat equation we have
Q(5) = 2(cos 5 - 1)and the condition (6.17) holds if and only if 1<
¼. This requirement
is thus more severe than for our previous methods and the method is therefore not
competitive.
Turning now to the Adams-Moulton formula (6.15) we easily see that this
method, although implicit, cannot be unconditionally stable. In fact, this would
demand that both roots of (6.16) are in the unit disk for all values of p = 1Q(S) <0.
However as p- - o the equation becomes
i522 + -13 21 =O,

which has the root p= -(4+/21) outside the unit disk. The method is thus
unstable for large A and hence also inferior to our previous methods.
We now turn to the method of backward differencing to construct a finite
difference operator from (6.12). It consists in replacing the time derivative in (6.12) by
the derivative of the interpolating polynomial, based on the values at t+,,
t, ... It + -, evaluated at t, +, to obtain an implicit scheme. This yields a method
of the form
n+ -
k- t1[aU Z lmjUn+l-j]=QhUn+l
j=l
SECTION 6 Pure initialvalue problem 71

or

+ l- i
(Cm-kQh)U +l= fimjiU . (6.18)
j=i
It would also be possible to construct in this way an explicit scheme by evaluating at
t, i.e.
m
o U+ W
(fi + kQh)U + (6.19)
j=2

or, at some earlier point t_ j, j = 2, .. ,m.


Since the order of accuracy in the replacement of the time derivative is O(k m) we
find that the total error, as in (6.14), is O(h+k"), or, with A= kh-M constant,
O(hmin(q,Mm)).
For m= 1 these schemes reduce to the backward and forward Euler methods
discussed above, but since we are interested now in multistep methods we consider
only m >2.
We briefly discuss the stability of these schemes and begin with (6.18). The
characteristic equation is

-
(0m - Q( )) E mjm- j=. (6.20)
j=t
For m =2 and m =3 we have, in particular, for (6.18),
(32- kQh)Un + = 2U" - U"n- , (6.21)
( -- kQh)U" +
= 3U"n _ 3 n-I + I U"- 2.
In order to decide whether the roots of the characteristic polynomials correspond-
ing to these methods are in the unit disk we transform it to the left half-plane by
setting y = (1+ jl)/(1 -?I) and apply the Hurwitz criterion: In order that the equation
EYjo yj? = 0 with Yo >0 has all roots in the left half-plane if suffices that
D, =y >0,

Y1 73 75 .' 72k-1
Yo 72 Y4 ' Y2k-2
7
Dk = 0 1 Y3 ' Y2k-3 >0 for k=2,...,m.

0 k

where y,=0 for j>m.


We now apply this criterion to show that the method (6.21) always satisfies von
Neumann's condition. In fact, setting p = Q(5) we obtain for the transformed
equation
(4 - p)tl' + (2 - 2p)tj - p = 0. (6.22)
72 . Thome CHAPTER II

Here for p< 0 we have

DI =2-2p>0,
and

D2= 2-2p 4 =2p2 _ lp+8>0,

and hence for 0 < I I< n all roots of (6.22) are in the left half-plane and thus those of
the corresponding equation (6.20) in the open unit disk.
For m= 3 we have

(1 - p)q3 + (6 - 3 p)tr2 + (2 - 3p)ir - p = 0,

and, for p<0 ,

D1 =2-3p>0,

6 3
_p - p 3

and

3 -p 0
6-3p 0 = (- p)D2 > 0.
2-3p 2-0 P

Both methods thus satisfy von Neumann's condition for any and are parabolic.
For m=2 and 3 the explicit schemes (6.19) are

U"+1=kQhU"+2U" - I for n>-1,


and
- 2
+
un+I=(-½+kQh)Un Un l-- for n >2.

For the first the characteristic equation is


2 - 2 p/- 1=0
with the roots p = p ± p 2 +l, one of which is always outside the unit disk if p <0.
Hence this method is unstable for any choice of A; it contains (6.5) as a special case.
For the second method the characteristic equation is
3 + 3(2 _ p)2 -3 + 2 = 0,
and with p = 0 the roots are p = 1 and p =(-5 + /33) so that this method is also
unconditionally unstable.
We shall now describe two higher-order three-level methods from DOUGLAS and
SECTION 6 Pure initial alue problem 73

GUNN [1963] for the model heat equation in d space dimensions (with d(<4),
au d a2 U
- Au =- Y- Z)l x c-,Rd' t>0, (6.23)
at j=1 x2
with initial data

u(x, O)= (x).


These methods will have the special feature that they only use the immediate
neighbors at a given mesh point, which is, of course, particularly desirable when
applying them to a mixed initial boundary value problem. However, as we shall see,
the stability results we shall present do not have exactly the same character as the
ones discussed above.
Let us first note that for u smooth, with

ax U(X
) = u(x + hej)- u(x)
h
u(x) - u(x - hej)
h
and u"( ) = u(-,nk), we have

a2 ul ,h
2
a4 un
= a u - 12 a +
axj
h2 a4un
- )
=½axa,,(u+l ++u"+u ')- a +O(h4+k2 as h,kO.

Setting, as in Section 5,
d

Ah= a axj
j=1

we have then
1
un+1- un-I 1
(unt un+u')- 2 d a4un
2k2k 13Ah
n
lU_i=1_ -l )
-2h 2
-
E a& +h ).k (6.24)

Thus, if we can approximate the sum appearing as a coefficient of h2 on the right to


second order in h, then we will have produced a O(h4 + k2 ) finite difference
approximation for (6.23). Note that, for the exact solution of (6.23),

d a4u a4u
A 2 U= Y ~- 2 E a 2 =Au,=ut,
j ~~~I
74 V. Thome CHAPTER 11

and hence, using the fact that Au = ,


a4 u un+ _ 2un+un- 1
i=1 k
-2 E a,,,,axiAxjun + O(h2 +k 2 ) as h,k- O, (6.25)
i<j

and also, since A2u = Au,,


"
d ia4u · [u+ -
un l~'
X J4A 2k /

-2 I aj,~iaXj Xu+O(h2 +k2+h4/k) as h,k-O. (6.26)


i<j

The relation (6.24) together with (6.25) and (6.26) thus suggest the following two
difference analogues of (6.23), namely
n n
U +I - U
2k =Ah(U+ + Un + U- l)

- h2 Un+ - 2 U n+ U -
k2 i 1 i
2 <j x jxU (6.27)

and, with = kh 2 ,
Un+ _ Un -
2k h= - 1/(82)) Un+ + Un
h((
+(l+1/(8))U" - ')+h 2
aixAXjaXj U". (6.28)
i<j

The crucial matter is then the stability of these schemes. In this regard we have the
following two results from DOUGLAS and GUNN [1963] (where they were expressed
for a cubic domain and in discrete 12 h-norm).

THEOREM 6.1. The difference scheme (6.27) is conditionallystable in L 2 (Rd) for d • 4 in


the sense that if U° = 0 and A= k/h 2 is bounded away from 0 and oo (or bounded away
from 0 if d<3) then, with 11II= 1i IL,(Ed),
11U 11
< C U' for n 1. (6.29)

THEOREM 6.2. The difference scheme (6.28) is unconditionallystable in L2 (Rd)for d 3


in the sense that (6.29) holds if U°=O.

We note that these results do not show stability in the sense of boundedness for all
U° and U', but are restricted to data with U°=0.
After application of the Fourier transform, equations (6.27) and (6.28) may both
be written in the form
b(h)"n+
1 = a(h)" +a2(h)0n- ,
SECTION 6 Pure initialvalue problem 75
and the corresponding characteristic equation is
2
b()# - a ( -2=
2(~) 0,

with roots ju= ,j(j),j= 1, 2. The solution then satisfies (assuming distinct roots; the
case of coinciding roots has to be treated separately)
"()=cl (~)t(h)"+ c 2 (5)p 2(hr), n 2,
with initial conditions

c1( )+c2(0)= U0°()=0,


cj()j,(h)+c2()9(h)= 1'(),
which implies

0"() =(h5)- 2(a) 0 ().

The stability stated then requires the boundedness of the ratio on the right as
h E Rd. One may show that for the method (6.27) this ratio is bounded by
3(642 + - 1 + 6) if d 4 and for (6.28) it is bounded by 4 if d<3.
Although this stability concept is more restrictive than earlier, it suffices
nevertheless in order to derive convergence estimates if we choose U° = u° , as then
the initial error z° = U -u ° = 0. We obtain for this choice, in both cases, if U' is
chosen so that
IIU1- u'l C(h4 + k 2 ), (6.30)
that
IIlU-unll <C(h4 +k 2 ) as h,k-O
(for (6.28) under the assumption that i is bounded below). A possible choice for U'
that guarantees (6.30) is
U' = v+kAv.
Because of the implicit character of the above two schemes it is of interest to
associate with them alternating direction type schemes which will reduce the work in
the solution of the algebraic equations for U"+ . In both cases we may write the
scheme in the form

(I+B)U"+ -AiU"-A2 U - =0,


where B may be decomposed into a sum of one-dimensional operators,
d
B=Z Bj.
ji=

The alternating direction type scheme is then to define the approximate solution W"
at t=nk from W°=u ° and W' as above, and then use the following equations to
determine W"+ ' for n 1 from W and W"-', namely by defining intermediate
76 V. Thome CHAPTER II

values W",j, j1, d,

(I +B)Wn+1lt+( B-A 1 )W - A 2Wn1-=0,

(i+Bi)W"+lJ-W"+lj-l-BiW"=O, j=2,...,d,
and then setting
n + l n+ l d.
W = W ,

For (6.27) and d = 3 we may take


B1 = -3ka, 1 B + IA,

Bj=-ZkaxjXi, j=2,3,
and for (6.28),
Bj = - 2k(1 - l/(8{))ja , j = 1,2, 3.
In both cases (for d=3), it is shown in DOUGLAS and GUNN [1963, 1964] that
4 2
Wn- u 112,
h C(h + k ),

in the former case for bounded away from 0, in the latter for A> 8.

7. John's approach to maximum-norm estimates

In this section our main purpose is to describe some results and techniques
developed in the important paper of JOHN [1952] concerning the maximum-norm
stability of explicit finite difference schemes for general second-order parabolic
equations in one-space variable. We shall also include some material concerning
related work based on Fourier analysis, such as discussions of unconditional
maximum-norm stability of certain implicit methods and of the use of Fourier
multipliers in stability analysis.
We consider thus first the general nonhomogeneous equation
au a2U au
at =P2(x,t)x +P (x, t)-
(7.1)
+po(X, t)u+f(x, t) in R x [0, T],
where p 2(x, t)> 5>0, under the initial condition
u(x, 0)= v(x) for x e R, (7.2)
and an explicit finite difference approximation of the form
un+l(x)=Ea,(x,t,h)U(x-lh)+kf(x,t) for n>0,
(7.3)
U°(x) = v(x), x e I,
SECTION 7 Pure initial value problem 77

where (x,t)=(jh, nk) varies over the mesh points in R x [0, T]. The summation is
over a finite set of points, Ill < M, say, and we assume that h and k are sufficiently
small and that k/h 2 = Ais kept constant as h and k tend to zero. We shall assume for
simplicity that the coefficients a are defined everywhere in R x [0, T] and have
certain smoothness and boundedness properties there, although in the calculations
they only enter at mesh points. With a change of notation the difference equation
(7.3) may also be written as
U:+'=Z a,j,(h)Ujl+kf7, j=O, + 1, +2,....

In John's paper the main purpose is to use the finite difference scheme to analyze
the differential equation under weak conditions on the coefficients and data. He
considers the case that P2, ap2 /ax, a2 p2 /ax2, p, aplx, Po and f are uniformly
continuous and bounded and that v is bounded and locally Riemann integrable. He
further assumes that the coefficients of the difference equation have analogous
properties. Under the assumption of the compatibility condition to be described
presently he shows convergence to a "generalized" solution of (7.1), (7.2) which is
a classical solution if apo/ax and af/ax are uniformly continuous and bounded.
Since our interest is in the numerical solution, we shall not insist on the details of the
regularity aspects and assume simply that the coefficients po, Pl, and P2 and the data
f and v are sufficiently smooth for our purposes below. For results with reduced
regularity assumptions, see also ARONSON [1963a].
Let us first discuss consistency. Writing equation (7.3) in the form
U"+ (X)- U(X)
k

we find easily by Taylor expansions that this equation, with the exact solution
substituted for U", is satisfied in the limit as h-O, if and only if

lim a(x, t h)-1)=po(, t),

hO
lim - lat(x, t, h)= -pl(x, t),
h-O

h2
lim - 12a(x, t, h)=p 2(x, t),
h-o2k

and we thus say that (7.3) is consistent with (7.1) if these equations hold.
Assuming that the a are of the form
al(x, t, h)=a, (x, t)+ ha, (x, t) + h2a, 2 (x, t, h),
where a j are uniformly bounded and
al 2 (x,t,h)-al 2 (x, t) as h-O,
78 V. Thomde CHAPTER II

these relations may be written as

] a,,(x, t)=1,

la,o(X, t)= O,

Z a,(x,t)=O,
12a, o(X, t)= 2P2(X t),

la,,, (x, t)= -2p 1(x, t),

a, 2(x, t)= 2po(x, t).

Note that if all coefficients a(x, t, h) are independent of h (which is only possible if
Po= p1 = 0), then these relations reduce to
a,(x, t)=l,

la,(x, t)=0, (7.4)

12a(x, t)= 2p2(x, t).

For the case of the standard heat equation,


au a 2u
at x 2,
we have P2 = 1 and, if the a, are constant, equations (7.4) may be written compactly as
4 2
E()=ae-i4=e- +o(r2) as --.0, (7.5)

which is a special case of Theorem 4.1 above.


An explicit forward Euler scheme may be defined here by
a, (x) = p2 (X, tx U (x)

+ P (x, t)8x UM(x) + pO(x, t) U(x) + f(x, t),


where x, &_,and 8x are the standard forward, backward and centered difference
quotients, or

Un+ 1(X)=(AP2(X t)+ 2Ahpl (x, t)) U"(x + h)


(p2(x, t)--lAhp(x, t)) U"(x-h)
+(1 - 2P2 (x, t)+ h2po(x, t)) U"(x)+ kf(x, t);
it is obviously consistent with (7.1).
We shall now turn to stability, which in this section will be interpreted to be with
SECTION 7 Pure initialvalue problem 79

respect to the maximum norm. We note that in this case the operators are allowed to
depend on t, which will somewhat complicate the notation and the analysis.
The solution of the inhomogeneous equation may be written by superposition of
solutions of the homogeneous equation as
n-1
U"=Ekn,.o v+k Eknm+lf m, n>l, (7.6)
m=O

where thus Ek, n, is the operator that defines the solution of the homogeneous
equation at t=nk by means of the initial values at t=mk.
We say that the finite difference scheme (7.6) is maximum-norm stable if, with
C= CT independent of h,
IIEk,n,mVll <Cllvll for O<mks<nk<T.
Here, as in the rest of this section, we use 11- 11to denote the maximum norm over the
mesh points or over all of R, depending on the case considered. If the scheme is
maximum-norm stable we have from (7.6), for the solution of (7.3),

llUll<C llvI +tmax IIfmll, where t=nk.

Note that in the time-independent case we have Ek,nm=EE.-


As usual stability and consistency imply convergence, as is stated in the following
theorem.

THEOREM 7.1. Let u e C 2 ' 1 and assume that the scheme is consistent and maximum-
norm stable. Then
maxllU-u(nk)ll-O as h-+O.
nk < T

Setting
E(x, t, a, O(x, t)e - i l ,
Y)=,

we may extend the von Neumann condition to the present case to read
IE(x,t, )l <l for R, (x,t)E R x [O,T].
It is easy to show that it is a necessary condition for stability.
The basic result here is now that a slightly stronger condition, which generalizes
our above concept of a parabolic finite difference operator, is also sufficient for
stability.

THEOREM 7.2. Assume that the a, satisfy the appropriateregularityconditions and that
(7.3) is consistent with (7.1). Then a sufficient conditionfor maximum-norm stability is
that there exists a positive c such that
IE(x,t,) <e - c,2 for 1< , (x,t) R x [O, T]. (7.7)
80 V. Thomge CHAPTER II

We remark that this condition is satisfied in the constant coefficient case for the
standard heat equation as a result of the consistency condition (7.5) if
IE()I<l for O<ll<-. (7.8)
In fact, (7.5) implies at once that (7.7) holds for small X,and by (7.8) c may then be
adjusted so that (7.7) holds for all I1 <i. The condition (7.7) may also be shown to
hold if the a, o(x, t) are nonnegative and ao(x, t) and axO(x, t) are bounded away
from zero. For instance, the forward Euler method introduced above satisfies this
latter condition provided
1
sup P2(X, t)'
x,

since in this case


ao, (x, t)= 1- 2p 2(x, t),
at,O(x, t)=a_-1,(x, t)=)p 2 (x, t).
To give a hint of the techniques used in the proof of Theorem 7.2 we shall first
sketch it for the case that the coefficients are independent of x, t and h. We then have
Ekv(x)= a,v(x - Ih)

and
Ek,,ov(x)=E'v(x)= a v(x - h).

We find at once by Fourier transformation that, with E(4) the symbol of E k, we have
E(h5)"6() = a,,e - ik 0(e),

and hence that the ant are the Fourier coefficients of the periodic function E()', so
that

ant = 2 J E(r)"ei d.
-g

We note that
IEkv(x)l < al Ivll,

and also that equality holds for suitable v so that the operator norm of E is
Je tin f rII (7.9)

We observe that, in fact, this relation holds also if we consider the operator to be
defined on the space C(R) of bounded continuous functions on Rand not only on the
SECTION 7 Pure initial value problem 81

mesh functions of ',h . The stability problem is thus the same in the two cases and
reduces to bounding the sum on the right in (7.9).
We now estimate the a.,. First, we have directly, from the definition of a,,and the
assumption (7.7),

la,1- 1I J- 2
4 di C

Further, by integration by parts twice we obtain

ant= -212 | n (E()")"eu¢ d

212
= (nE(c)-E"(c)+n(n- )E()n- 2 E'() 2
)e" d.

By (7.5), E'() = O() as -0.O, and hence we have

2)
al <(C/I I (ne-nc42 + n2 d*
2e-"c42)

We conclude that

<C
Ela,,tl'<C E 1/x/>+ Y /n12) ,

which proves the result in the case considered.


The case of variable coefficients may be considered as a perturbation of the
constant coefficient case. We sketch the proof of the stability in the case that the
coefficients depend on x but not on t or h, so that
Ekv(x) = a,(x)v(x - lh).

Letting U"= Ekv we shall want to estimate U"(xo), for xo arbitrary, by the maximum
norm of v. We then fix the coefficients of Ek at xo to obtain the representation
Ekv(x) = a,dv(x - lh) + b,(x)v(x - lh),

where ta,=a,(xo) and b,(x)=at(x)-a,(x). Thus


82 V. Thome CHAPTER 1

U"+ 1(x)= da,U'(x -h)+ E b,(x)U"(x - Ih)


I I

= Ek U(x) + f (x),
where the latter equality defines Ek and fn, and hence
n-i

Un(XO)=EkV(Xo ) + Ek f- 1-m(XO). (7.10)


m=0

Setting
E' v(x) = admp V(X - ph),
p

we may write
-
E f"- -m(xo)=Y af" I - (x o -ph)
p

= aimp bl(xo - ph) U " - - m(x - (p + I)h)


P l

- (7.11)
= Ymj(Xo) U 1-m(xo -jh),
j
where

ymj(xo)=Z adm, jbl(x o-jh+ lh).

Setting now
fij(x o , )= b(x o -jh + Ih)e - i"

we have

)e O
Ymj(xO)= 2r E(xo )miSj(xo, . d

Using the consistency relations one may show, by arguments similar to the ones
used to estimate the anj above, for I fixed, nk < T, that with C independent of xo ,

Iymj(xo)I Ch min( m + 1)

and hence
n- n- 1

Z ZlYmj(xo)l < C
h -- <Ch/n=C /nk.
m=O j M=o m+
Using the stability of Ek it follows from (7.10) and (7.11) that for nk 6,

IU"(xo)l C llvll + C / max IIUmll,


SECTION 7 Pure initial value problem 83

and hence, since xo is arbitrary, for 6 sufficiently small,


IiUnll<2Cjvjl for nk<b,
or
ljU"l<2CllUmll for O<(n-m)k<$.
The stability now follows by repeated application of this estimate.
John also proved a discrete analogue of a well-known smoothing property for the
parabolic differential equation, which in the case of the homogeneous equation
reads
IlaxU"llL <C(nk)- l/2 IIUI
V L for n>O. (7.12)
This is relevant for the analysis of the convergence of difference quotients of the
approximate solution to derivatives of the exact solution of (7.1) and was used in
JOHN [1952] in the study of smoothness properties of the exact solution.
The solution U" of the difference scheme defined by (7.3) may also be written in the
form
n-l
g,m+
l m
Un(x)=h Sgnol(X, h)v(x-lh) + h E 1, (x, h)f (x-lh)
I m=O I
or
n-1
Uj"=h g.,,j,l(h)vj_l+h Yg .m+,j,,(h)fJ?1
I m=O I
and the coefficients gm,,,ij,(h) may be thought of as determining a discrete
fundamental solution. For the particular case that the coefficients of (7.3) are
independent of x, t and h, we are in the situation discussed in detail in the proof of
Theorem 7.2 above, with
gn. j, ,(h)= h- lan- m
,j-
_ l
The maximum-norm stability may be expressed in terms of the fundamental
solution as
suph gnm,j,(h)<C for O0mk<nk<T.
j I

For the same range of parameters one also has the estimate
.. j,t(h)l <C
1g0 //(n-m) k ,
and, for difference quotients,
laxg,,.s,j,t(h)l <Cl((n -m)k)
and (cf. (7.12))
h laxgn.,j,(h) < C /(n-m)k,
84 V. Thomee CHAPTER II

which are all analogues of estimates for the fundamental solution of the continuous
problem.
We shall now turn to a brief discussion of the maximum-norm stability of general
difference operators of the form
EkV(x) ajv(x -jh), (7.13)

which we now study without assuming any relation with any particular differential
equation and where we permit the summation to be over infinitely manyj, but such
that
laj < oo.

When the summation is infinite, we refer to the operator as implicit, in contrast to


the explicit case treated above. We associate with the operator Ek its symbol, the
absolutely convergent trigonometric series
E() = E aje-ij4,

and note that, as above, the coefficients aj may be retrieved from E(4) by

aj =2 E(~)e4d.
-it

Implicit operators of the type studied in Section 4, such as the Crank-Nicolson


operator, have symbols which are rational trigonometric functions and are thus
special cases of the class of operators studied here.
The following result (THOMEE [1965]) gives necessary and sufficient conditions for
maximum-norm stability in the case E(4) is real analytic.

THEOREM 7.3. Assume that the symbol E(4) of E, is analyticon the real axis. Then Ek is
maximum-norm stable if and only if one of the following two conditions is satisfied:
(a) E(4)=ce'i j with Icl = 1, j an integer;
(b) IE()I <1 exceptfor at most afinite number of points q, q= 1,..., N, in 11 4<x,
where E()l = 1. For q=1,...,N there are constants tq, iq, Pq where aq is real,
1 >0 and #l4 is an even natural number, such that
Re fq
E(~q + 5) = E(q) exp(iq -lq4 +0(1O(1)))as 5-*0. (7.14)

The proof of this result may be carried out by the above technique of John.
It follows, of course, in particular, that von Neumann's condition is necessary for
stability. For an operator E which is consistent with the heat equation condition
(7.14) is satisfied for , =0, with =0, f = and 1 =2. If there are no further
points q in 11<r with E(Iq)l= 1, the operator is parabolic in John's sense and
maximum-norm stability is known by Theorem 7.2. If there are more points 5q with
SECTION 7 Pure initial alue problem 85

E(5q) on the unit circle, (7.14) describes the behaviour of E(4) near these points which
is required for maximum-norm stability. A simple example is the explicit forward
Euler scheme with 1= 2,
Ekv(x) =½v(x + h) + (x - h),
with
E() =cos 5.
Here E()=I at c=0 but also at 4=n and
2
E(x+)= -cos ~= -exp(- +O( 4 )) as -~-0.
This operator is thus not parabolic, although it is trivially maximum-norm stable.
One example of an operator Ek of the form (7.13) in which the summation is
infinite is provided by the Crank-Nicolson method, for which
1 -2(1 - cos r)
1 + A(1 -cos )'
This real analytic 2n-periodic function has an infinite Fourier series, and the
operator is parabolic for any choice of 1. Although the coefficients are not
necessarily nonnegative there thus exists, for each 1 >0, a constant C = C(1) such
that, in the maximum norm
IIEv|l < cvllo.
Since it would be desirable, in practice, to take h and k of the same order, and thus
1 of order 1/h, it is of interest to ask whether the same constant can be chosen for all 2,
such as in the case of L 2 -stability. This problem was considered in SERDYUKOVA
[1964, 1967] who showed
11Ekvil <2311vil, n>l, 1>0
(for earlier related work, cf. also JUNCOSA and YOUNG [1957] and WASOW [1958]).
Somewhat more general results were obtained in HAKBERG [1970] and NORDMARK
[1974]. We shall now briefly describe Hakberg's result.
Consider the initial value problem for the parabolic equation

a O=(- Mu
)"-
(7.15)
M=2m, xcf, t>0,
and consider a consistent finite difference operator of the form

Ekv = R(kAh)v, (7.16)


where R is a real rational function and

Ahv(x)=h - M E djv(x-jh), d_j=d EfR.


Ijl<J
86 V. Thomde CHAPTER II

The operator thus defined is consistent with (7.15) if


R(y)=l-y+o(y) as y-,O, (7.17)
and
A()=-?M+o(M ) as 5-0, (7.18)
where A(4) is the symbol of A,.
The symbol of E, is then
E(4)=R(,A(S)), where =kh-M,
and we say that Ek is uniformly maximum-norm stable if, with C independent of A,
IIE'vl<CIIv([ for all n>l, O<A<oo.
We have the following result.

THEOREM 7.4. Assume that (7.17) and (7.18) hold, and that

IR(y)l<l for0<y<co,
and
A(f)q0 for O<}<.
Then the operator Ek defined in (7.16) is uniformly maximum-norm stable.

Setting

a. ,( == R(A())" eij d,
-n-1
the proof is accomplished by showing that, under the hypotheses made,

laj(A)IQ<C forn>l, O<A<oo.

Since A(4) is even we may write, using integration by parts in the second step, for

a.n#)=-JR(AA())'cosjt jd

aE )

2 (f + ) d R(iA())sinj d;
0 (A)

= a,j() + aj(A),
SECTION 7 Pure initial alue problem 87

where u(2)=l min(- l/M, 1). By our previous technique one may show
Ia,,j()lI < C min((nA) IM, j-2(n2)l).
Further, using the fact that for large y, R(y) has an expansion of the form
R(y)=r +r 2 y -'+o(y-r), r2 s0, r> 1,
it is possible to demonstrate that
Idj(A)] < C min((n - 1/r2)-M, j-2(n - 1/r) tIM).
Together these estimates complete the proof.

We shall end this section by briefly describing a method for deriving estimates
such as the above, in the case of constant coefficient problems. This is the method of
Fourier multipliers which has been used systematically for stability and convergence
analysis of finite difference methods in a sequence of papers, see BRENNER, THOMEE
and WAHLBIN [1975].
We introduce as earlier the Fourier transform on Rd,

( = v()= f v(x)e-i dx, eR,


E
Rd

and its inverse,

1(~)=(2i)
-lV f v(4|)ei'x d, x Rd.
Rd

We recall the definition of the convolution of two functions on Rd,

(u*v)(x)= fu(x- y)v(y) dy,


Rd

and the fact that, under appropriate conditions,


(u*v)( ) = ()(O).
We shall denote by Wp the space Lp(Rd) for 1 p < co, and by Wo, the functions in
C(Rd) which vanish at infinity. For 1 < p < o the space W is then the closure with
respect to the Lp-norm of CO' (Rd), of A , the functions which, together with all their
derivatives tend to zero faster than any negative power of r= Ixl as r tends to
infinity, or of CO, the functions v for which E C(Rd).
Let now for a C'(Rd) the operator A from CO into itself be defined by
Av=.-'(a). (7.19)
We say that a is a Fourier multiplier on L v, or that a e M = Mp(Rd), if
M,(a)=sup( IAv I(L: VE-'O, [lv||Lpl I1}< o.
88 V. Thomee CHAPTER II

The operator A on C' defined by (7.19) may then be extended by continuity from
C' to Wp. Our definition thus says that a EMp if the effect of multiplication by a on
the Fourier transform side is a bounded operator on Wp.
By Parseval's relation we discover at once that a EM2 if and only if a is bounded
and
M 2(a)=sup Ia().

Further, it may be shown that a M, if and only if a is the Fourier transform of


a bounded measure,

a(,= f e- du(x),
Rd

and in this case

M.(a)= f [dy(x).

i g
a()= a ,

we have
M,(a)= laj.

One may show that MP is symmetric around p = 2 in the sense that


Mp(a)=Mq(a), ifp- + q- = 1, 1<p, q o. (7.20)
Further,
Mq(a)< Mp(a) for 1 < p < q 2,
and the convexity inequality
Mp(a)<M2(a)21pM,(a) -2/p for p>2 , (7.21)
holds. Also, MP(-) is a submultiplicative norm, so that
Mp(ab) < Mp(a)Mp(b).
We note that our finite difference operator Ek defined in (7.13) may be written
Ekv = .F - '(E(h)6),
where E(4) is the symbol of Ek, and hence Ek is stable in Wp exactly if
JiEk'lla =M(E(h)"))C for n>0.
It is easy to see that the second quantity is independent of h so that the condition
SECTION 7 Pure initial value problem 89

reduces to
Mp(E( )") <C for n>0. (7.22)
It follows from (7.20) and (7.21) that stability in maximum norm implies stability in
Lp for other p. We remark that it follows from our above discussion that the
condition for stability in C(R) or 1.,h is the same as in W.,.
In order to use these notions to prove stability it is thus needed to have access to
some method to bound an expression such as the left-hand side in (7.22). One may
then first show that in order to estimate a 2n-periodic multiplier a it suffices to
estimate ?la where 11 is a function in C (Rd) which is identically 1 for JilI<n,
j=l ,...,d, or
Mp(a) < C,Mp(qa). (7.23)
The basic result that may be used to estimate the latter quantity is then an inequality
which generalizes a well-known inequality by Carlson and Beurling, which may be
formulated, in the case d=1, as
M.r,(a) (<CIIaILl
| I| a 'L(R), (7.24)
or, in the case of an arbitrary d, for v > d,
d/(2 v)
M,(a) ( )C l II [-d/(2)[
iD alL 2 (Rd)

Let us show, in the one-dimensional case, how this implies maximum-norm


stability for an operator which is parabolic in John's sense. In view of (7.23) we have
M,(E( )' ) CM,(rl()E( )),
where ? E CO([- -, x + 6]) and 11 1 in , r]. Because
n[- Ek is parabolic we have
11()E(r I < Ce- c2
and
I(r()E()"g)' Cn- I le "
2 for E R.
Hence, by obvious calculations,
Il " I[2(m Cn-/4
and
II(tlE)' IIL(R) < Cn/4,
so that, by (7.24),
Mo(E"))E'<C for n>0,
which is the desired result.
The technique just described may be applied to give a simple proof of the
sufficiency of the conditions of Theorem 7.3 and are also useful to show convergence
estimates, cf. Section 9 below.
90 v. Thome CHAPTER 11

8. Stability of difference schemes for general parabolic equations

In this section we shall consider the stability of finite difference schemes for general
parabolic equations and systems. We first discuss equations or systems which are
parabolic in the sense of Petrovskii and difference schemes which are parabolic in
a sense generalizing the one introduced by John and described in Section 7 above.
We also show analogues of some known smoothing properties of parabolic
problems and touch upon the possibility of using such properties as definitions of
parabolicity of difference schemes.
We shall thus first consider equations of the form

a= P(x, t, D)u- P,(x, t)Du, (8.1)

xeR d, t>O,

where, with =(i1. ., d),

D'u =(a/x)0 I ...(a/aXd)Uld lr=C+ + d

The equation is considered, as usual, under the initial condition


u(x, 0) = v(x), x d

Such an equation is said to be parabolic if P(x, t, D) is elliptic, or more precisely, if the


real part of its characteristic polynomial,

P(x,t, i)= E P(x, t)(i),


121=M

is negative-definite, and uniformly parabolic in a domain in Rd x R + if, for (x, t) in


this domain,

Re P(x, t, i) <-c |M for ERd with c >0. (8.2)

It is clear that in this case M has to be an even positive number.


We shall allow below also that (8.1) is a system of N equations in N unknowns
which is included by letting u be an N-vector u=(u, ... , uN)T and the P, (N x N)
matrices, which we always assume to be sufficiently smooth for our purposes. In the
system case we replace condition (8.2) by a corresponding condition for the
eigenvalues. Setting, as in Section 4, for any (N x N) matrix with eigenvalues {Aj}~,

A(A) = max Re j,

we thus say that (8.1) is parabolic in Petrovskii's sense if

A(P(x, t, i)) <-c 4 M, ~ xR ad t >O,


where now ~= (51 ... , 4d)T,
.d = 411 ...
It is known that associated with a system of the above kind there is a fundamental
SECTION 8 Pure initial value problem 91

solution F(x, t, y, s) defined for x, y e Rd, 0 < s < t, such that

u(x, t) = f F(x, t, y, s)u(y, s) dy,


Rd

and such that


I xDyr(x, t, y, s) I
< C(t- S)-(d+ al +lI)/M exp(- c x-ylM(t- s)- 1/(M-'), >O. (8.3)
The fundamental solution also has the property
IDxDr(x, t, x + z, s) I
< C(t-s)- (d + IPI)/M exp(-c)zjM (t-s)- 1/(M- 1)),

with the power of t - s on the right independent of a. (In the constant coefficient case
F depends only on z = y - x and t- s and thus F(x, t, x + z, s) is independent of x.)
It follows, in particular, from the above representation and (8.3) that the initial
value problem in (8.1) is well posed in LP=L(lRd) and W'= W~'(Rd), the Sobolev
space defined by the norm

IIvUvIw= Y, IIDv IILp


Ial< m

Further, the solution has the regularity property corresponding to our previous
result in L 2 for the constant coefficient case (cf. (4.9)): With u(x, t)= E(t)v we have, for
l~p co,

IID'E(t)v IILp< Ct- I/M II L (8.4)


and

HE(t)v Iw < CI[v 1w


We consider now a one-step (or two-level) difference scheme defined by
Bh,fU "+ =A h ,. U for n>O, (8.5)
where Ah, . = Ah(nk) and Bh,, = Bh(nk) are difference operators of the form
Ah(t)v(x) = E aO(x, t, h)v(x - fh),

Bh(t)v(x) = Y b(x, t, h)v(x - ah),

with ap, bn smooth functions of x, t and h and where ,A= k/hM = constant. We assume
that Bk(t) is invertible (in the appropriate Lp-space) so that (8.5) may be written
U+ i= Ekn,,U' with Ek,n =Bh, Ah,n
or
U = E' U = Ek,_ -1Ek, -2 '. Ek,o U.
92 V. Thome CHAPTER 1

The scheme is stable in L if


11IU"nLp<CI U m JILp for O<mk <nk T,
and consistent with (8.1) if, for any smooth solution of (8.1),
Bh,,u(x, (n + 1)k) = Ah,,u(x, nk) + O(kh) as k-*O.
We introduce the symbol (or amplification matrix) of the difference scheme,

E(x, t, = b(x, t, O)e -' ')( a,(x, t, O)e-'),

where we assume that the first factor is uniformly bounded.


We first note that certain conditions suggested from the corresponding constant
coefficient problems are necessary for stability.

THEOREM 8.1. If I <p <oO and the scheme is stable in Lp then


lE(x,t, )"l <C for x,r e d, 0<t,<T, n>O. (8.6)
In particular, von Neumann's condition
p(E(x,t, )) for x, R, O < t < T,
is necessary for stability in LP for any p with I <p < oo.

It was shown by WIDLUND [1966] that neither in L2 nor L.o is (8.6) a sufficient
condition for stability, in the case of variable coefficients.
Following WIDLUND [1965] we now introduce a generalization of the previous
concept of a difference scheme which is parabolic in the sense of John and say that
our difference scheme is parabolic (in the sense of John) if
M
p((x, t, ))< 1-cl lj
for jfI<Tx, j=l,...,d, c>O. (8.7)
The following important result as then proved by WIDLUND [1965, 1966] (cf. also
ARONSON [1963b]).

THEOREM 8.2. If the difference scheme is parabolicin the sense of John, then it is stable
in Lp for I< p < o.

We also have the following regularity estimate corresponding to the estimate (8.4)
for the continuous problem, where &a as earlier denotes mixed forward difference
quotients.

THEOREM 8.3 If the difference scheme is parabolic in John's sense then


uaaxnIIL,p<C((n-m+ l)k)- Il/MI UIILp for O<mk nk< T7 (8.8)

The results depend on estimates for a discrete fundamental solution: One may
SECnON 8 Pure initial value problem 93

write
U"(x) = hd g.,,y(x)U(x - yh),

and it is possible to show that for flihl < 6, i= 1,..., d, with some 6 > 0 (6 arbitrary
in the explicit case)
lagn,m.(X)l
•< C((n - m + 1)k)-(ll +d)lMexp((C IM(n - m + I)k)- yh 4).
This is done by first freezing the coefficients at an arbitrary point (xo, to) and
obtaining estimates for the corresponding constant coefficient problem by the
Fourier method, whereby the analyticity of the symbol is used to move the path of
integration into the complex. After this the frozen coefficient fundamental solution is
employed as a parametrix to obtain the final estimate by a perturbation argument.
With suitable choice of #Bthis yields estimates of the form e.g.
laq ,,y(x)l
.. .... / / vkiM \ -)\
C((n-m+l)k)" '""'"exp{ -c. l)k'-
\ n-m+ )k

<1 I
fC((n-m+)kiai +dMexp(-cyl)

|(n-m+ I)k|
The first of these are analogous to the corresponding estimates (8.3) for the
continuous problem. The second estimate, which is not needed for explicit schemes
as 6 is then unrestricted, shows that the contributions for large j are exponentially
small.
The above results may also be generalized to certain multistep schemes.
Following WIDLUND [1966] we thus consider schemes of the form

BUn+=Ah,iUn+ +Ah,m-U-m+, (8.9)


where the operators Aj, j = 1,.. ., m, are of the form Aj = Ahj.nk), where

Ah,i(t)v(x)= ocI + Y aj,(x, t, h)v(x - flh),


and similarly
B,(t)v(x) = Ah,o(t)v(x)
= I + Z ao (x, t, h)v(x -lsh),
8
with the sums extending only over finite sets of multi-indices fl. We note that the
94 V. Thomee CHAPTER I

natural generalizations to variable coefficients of the multistep schemes discussed in


Section 6 may be written in this form. As earlier, in addition to the initial condition
U° = v, it is necessary here to supply methods for the determination of U', ... , Um -',
and we shall assume that this is done in some appropriate way, in particular so that
these values are suitably bounded by the initial data v.
In the same way as previously discussed such a scheme may be represented as
a two-level scheme for an associated system in the variable = (Un , n - I,..., Un - m +1)T,
namely
BOn + =Anhn
or
(B n) - 1.An h,
! °h] (Bn-'An
I ... O

O-+l = h,n -= I

8
0
0O

Similarly to above we may introduce the principal symbol of the operator


Eh,n = Eh(nk),

B(x, t,)- 'A,l(x, t, 0)

E(x, t, ) = B(x, t, O - '*x, t, O =


I I
L 0 ...j

where
A=x, t, = o+ I aj,(x, t, O)e-i', j = 1, . .., m,

and correspondingly for B(x, t, ).


For a consistent scheme we have, in particular,
l(x, t, 0) = Al,
where
a1 a2 Om
1I 0 0 ... 0
A= 0 1 0 .

00 I O_

It follows that the boundedness of An, n = 0, 1,. . , which we shall refer to as stability
SECTION 8 Pure initial value problem 95

of the matrix A, is a necessary condition for stability of the scheme (8.9), or of Eh,. As
is well known, A is a stable matrix if and only if all eigenvalues of A are in the closed
unit disk and all eigenvalues on the unit circle are simple. Recall also that these
eigenvalues are the roots of the equation
pm _ ,Lrnlm-1_ ._c = .
We say that the multistep scheme (8.9) is parabolic (in the sense of John) if
p(E(x, t, ))< 1- c l IM
for 1jl~<, j=l,...,d,
uniformly in x and t.
The following is now the main result of WIDLUND [1966] about the stability of the
multistep parabolic scheme.

THEOREM 8.4. Assume that the multistep scheme (8.9) is consistent with (8.1) and that
A is a stable matrix. Then the scheme is stable in Lp and
IIaU IL <C((n-m+ 1)k)- Ill UIJl,p 1 pP Co.

In order to determine if a scheme satisfies the parabolicity condition, one may


employ the following criterion, see WIDLUND [1966].

THEOREM 8.5. Assume that the difference scheme (8.9) is consistent with (8.1), which is
parabolicin Petrovskii's sense. Assumefurther that all eigenvalues of A except one are
in the open unit disk and thatfor 5 #:0, j[ , j= 1,..., d, all eigenvalues of E(x, t, )
are in the open unit disk. Then the scheme is parabolic.

We note that the requirement on A here is stronger than the stability of this
matrix. It may be shown that the theorem does not hold in general when only the
stability of A is assumed.
We shall briefly consider a different parabolicity concept and take smoothing
properties like (8.4) and (8.8) as our new definitions (cf. THOMEE [1966]). We restrict
the discussion to the case of constant coefficients and with the basic space as
L =L2(lRd). We consider thus the pure initial value problem for the equation

au d
a = P(D)u_ E P2Dau, xe , t, (8.10)
6Mlal
where the P, are (N x N) constant matrices. We say that (8.10) is weakly parabolic if
the initial value problem is weakly correctly posed, that is, if

A(P(ir)) < C for Ed,

and if for any positive integer m and 0 < T < T, with C = C(z, T),

IIE(t)vlI[H<CIIvllL. for O<rzt•T,


96 V. Thome CHAPTER 11

where H m denotes the Sobolev space W'. If in addition the initial value problem is
correctly posed in L2, so that, with C=C(T),

IIE(t)vllL2,<CiIIvL2 for 0<t<T,


we say that (8.10) is strongly parabolic in L 2. One may show the following
characterizations.

THEOREM 8.6. Equation (8.10) is weakly parabolic if and only if there are positive
constants c, C and v such that
A(P(i~)) -cJIlv+C for eRffd. (8.11)
It is strongly parabolic in L 2 if and only if there are positive constants c, C, Cl and
v and,for each ( E Rd, a positive-definite Hermitian matrix H() such that
c ; I <H()< CI,
and

Re(H()P(~))= 2(H()P(~)+ P()*H())<(- clljv + C)I. (8.12)

Systems satisfying (8.11) are also said to be parabolic in Shilov's sense (cf. SHILOV
[1955]). The largest possible v in (8.11) and (8.12) is referred to as the order of
parabolicity.
It follows at once from (8.3) that if (8.10) is parabolic in Petrovskii's sense, then it is
both weakly and strongly parabolic of order M; the scalar equation
au a2 u a3u
_t = x 2 + x-
provides an example of an equation which is weakly and strongly parabolic of order
2, but not parabolic in Petrovskii's sense.
It follows easily from Theorem 8.6 that if(8.10) is strongly parabolic in L 2 of order
v then the corresponding solution operator satisfies
rlE(t)vlH, _<Ct-(m-i)/ivl||jv
for 0<j<m, 0<t<T.
This property was the basis for a generalization in PEETRE and THOMEE [1967] where
a system with time-independent coefficients is said to be strongly parabolic of order
v in W v (where W, denotes LP(lRd) if I <p< co, and W. is the closure of C (Efd) in
L,,(Rd)) if the solution operator satisfies

JE(t)v w <Ct- ( - j)/V11vii wj


for 0<jm, 0<t<T.
This concept will reappear in Section 9 below.
In analogy with the situation in the continuous case we say that the finite
SECTION 9 Pure initial value problem 97

difference operator Ek with constant coefficients is weakly parabolic if, for any a and
0 < r < T, with C=C(C, , T),
Ia EkvIL 2<CCllvlL2 for T<nk T,
and strongly parabolic (in L 2 ) if in addition Ek is stable in L 2.
We have the following analogue of Theorem 8.6.

THEOREM 8.7. The finite difference operator Ek is weakly parabolicif and only if there
are positive constants c, C and v such that
v
p(Ek()) 1- ckll I+ Ck
for hiji< n, k <k o.
It is strongly parabolicif and only if there are positive constants c, C, C 1 and v andfor
4
each k <ko, ea R , a positive-definite matrix Hk() with

C < Hk(O)< Cl
and such that

IEk(k)luk(O <i-ckl I +Ck


for hl Ijl I, k < k o,
where IIH denotes the operatornorm subordinateto the vector norm (Hv, v)1/ 2 on RN.

If Ek is parabolic in the sense of John, so that (8.7) holds, then it is weakly parabolic
and strongly parabolic in L 2 of order M. For any system which is strongly parabolic
in L 2 one can construct Ek with the same property.
Motivation for introducing these concepts is provided by the following analogue
of the Lax equivalence theorem.

THEOREM 8.8. Assume that Ek is stable in L 2 and consistent with (8.10). Then Ek is
strongly parabolic in L 2 if and only iffor any , any v e L2 and any t >O, we have
IlaU-Dau(',t)ll 2-+0 as k-- 0, nk=t.

For another concept of a parabolic difference operator, of single-step or multistep


type, see IONKIN and MOKIN [1974] and MOKIN [1975]. This concept is associated
with the use of a discrete Fourier transform in space and a discrete Laplace
transform in time, and results in a priori estimates in discrete analogues of the
Sobolev spaces W'.

9. Convergence estimates

In the preceding sections we have shown a number of convergence results (cf. e.g.
Theorem 3.2) to the effect that a stable finite difference scheme which is accurate of
order ja is convergent to that same order provided the exact solution is smooth
98 V. Thomge CHAPTER 1

enough. We also know by the Lax equivalence theorem (Theorem 3.3) that
convergence follows from stability and consistency without any regularity hypo-
theses on the data other than that they belong to the space of functions under
consideration. Our purpose in this section is to give more precise information about
the relation between the rate of convergence and the smoothness of the exact
solution.
Consider thus an initial value problem

D = P(x, t, D)u

E PY(x,t)D'u, x l0a, t>0, (9.1)


Il <M
u(x, 0) = v(x), X Rd,
where the equation is parabolic (in the sense of Petrovskii), and a consistent finite
difference scheme of the form
U"+
+= Ek,lB,,U=BA hfU, n>0, , = k/hM, (9.2)
where Ah,, = Ah(nk) and Bh,, = Bh(nk) are finite difference operators of the form used
in (8.5). We shall also assume that the scheme is accurate of order I. In Sections 5 and
6 we have become acquainted with several examples of difference schemes of various
orders of accuracy and we also quoted a result in KREIss [1959a] where it is shown
that schemes of arbitrarily high order of accuracy exist in great generality.
In order to express our results we need to introduce certain Banach spaces, known
as the Besov spaces B,; q. Recall from Section 7 that Wp = Wp(lRd) denotes Lp(Rd) for
1 p < oo and that WOO is the closure of Co(Rd) with respect to the norm in L (d).
For v E Wp, t > 0, we introduce the modulus of continuity in Wp of orderj,j = 1, 2, by
Wcp(t, v)= sup lI(Ty-lyvilw,
!y _It
where Tyv(x) = v(x +y). Let now s > 0 and I < p, q < o, and let us write s in the form
s = S + so where S is a nonnegative integer and 0 < so < 1. Then Bs," is the subspace of
W s defined for 1 <q < oo by the norm

1 V1csi--
V11w + {(t -o%9PJ(t, Dev))q (dt/t) ,
o

with j = if 0 <s< 1, j = 2 if so = 1, and with the obvious modification for q= o,


namely,
IIvIBs = IIvllww + sup(t-oOp.j(t, D'v)).
j=s 1>0
We shall often write Bs =B,' for brevity. This space is then defined by a Lipschitz
condition in W for the derivatives of order S (or, for so = 1, i.e. for s integer,
a so-called Zygmund condition for the derivative of order S=s- 1).
SECTION 9 Pure initial value problem 99

As an example, for the Heavyside function in one dimension,

H(x)=1, x<O,

we have for I <p < o,

[(Ty-I)H l1 w IyJl/P
and it follows that if qp E C then oH EB m P.
The Besov spaces form a scale of spaces between the Sobolev spaces in the sense
that Bs',q c Bs2q2 if SI> S2 , or s 1 = S2, q l < q2 , and B', c Wspc B'j if s is a natural
number, and we may think of a function in B, as one with s derivatives in Wp. Here
inclusion stands for continuous embedding so that in each case a corresponding
inequality between norms holds. The Besov spaces are also intermediate spaces
between the Sobolev spaces in the sense of the theory of interpolation of Banach
spaces. For instance, if m is a natural number and 0 <s < m then there is a constant
C such that for any bounded linear operator A in W with

I}AvliwP <CoIIvllwp

and

IIAvllw <C vlIwl (9.3)


we have

IAvlIwP •<CCO-OCOIIIIB, O=s/m.


In fact, estimate (9.3) may be replaced by the weaker estimate

IIAvllw, < C, Ivllsr


to yield the same conclusion.
We return now to the initial value problem (9.1) and the associated finite
difference scheme (9.2). In terms of the spaces just introduced one may show the
following convergence result. Here we denote by u" the value of the exact solution at
t = nk. Recall that if Ek,, is parabolic in John's sense then it is stable in W for any
p with 1 < p < o (Theorem 8.2).

THEOREM 9.1. Let 1 <p < o and assume that (9.1) is parabolic in Petrovskii'ssense
and that the scheme (9.2) is stable in WI and accurate of order p. Then,for 1< q < co,

IIU" - uu"
IIw < Ch'(log(1/h))' - /qIvl,.q (9.4)
for nk<T.

The proof was given by PEETRE and THOMEE [1967] for the case of time-
independent coefficients, and is valid also in the general case. It is based on
100 V. Thomie CHAPTER 11

estimating the different terms in the identity

U" - u(, nk) Ekv - E(nk, O)v


n-I

= C
j=o
E, 'j +'(Ek, - E((j + l)k,jk))E(jk, O)v,

using the uniform boundedness of the operators Ek,' + , the definition of accuracy
and a smoothing property for the exact solution which follows from the estimate
(8.3) for the fundamental solution (cf. (8.4)).
For initial data that are less regular one may show a corresponding lower rate of
convergence.

THEOREM 9.2. Under the assumptions of Theorem 9.1 we have


|jUn - "]l W ACshSHJ vIlS
(9.5)
for 0<s<, nk<T.

This result follows by interpolation between the result (9.4) with q= and the
obvious inequality

IiU -u lW CIlvllw,
Note that from our above remark about how the interpolation theory works, the
norm on the right in (9.5) is that in Bp,= Bp and not the stronger norm in B' as
might have been expected.
In the particular case that (9.2) is parabolic in John's sense, the factor log(l/h) can
be removed from (9.4) for q = oo as was shown by WIDLUND [1970a, 1970b] (cf. also
results in this direction in special cases by HEDSTROM [1968] and LOFSTROM [1970]):

THEOREM 9.3. Assume that (9.1) is parabolicin the sense of Petrovskii and that (9.2) is
parabolicin the sense of John and accurate of order It. Then, if 1 <p < oo, we have

IUn"-u"ilwpChlIv[IBg for nk T.

The proof depends on estimates for the discrete fundamental solution. These
results may be thought of as generalizations to 1< p < oo and variable coefficients of
the corresponding results of Section 4. In fact, even in the case p = 2 and constant
coefficients, Theorem 9.3 is sharper than Theorem 4.5 as B~ strictly contains H".
We shall briefly discuss the sharpness of the above results. We first present the
following saturation result of LOFSTROM [1970] in the constant coefficient case.

THEOREM 9.4. Assume that (9.1) is parabolicin the sense ofPetrovskii and has constant
coefficients and no lower-order terms, and that (9.2) also has constant coefficients, is
stable in Wfor some p with 1 p < oo, and accurateof order exactly p. Let v be such
SECTION 9 Pure initial alue problem 101

that, for some s with 0 <s < ,


sup IIU"- u"IIw,, < Ch. (9.6)
nk<T

Then v B. If (9.6) holds for same s > a then v0.

This shows thus that if the rate of convergence in W is O(hS), where s ,


uniformly on [0, T], then v has to belong to BS. In the proof of this result it is
essential that the estimate is valid also for small t =nk. If a convergence rate of O(hs )
is observed for a fixed positive time only, less can be said about v. We quote some
simple results in this direction from THOMEE and WAHLBIN [1974] concerning
maximum-norm convergence in the case of the one-dimensional heat equation
au a 97
2 ' x f, t >0, (9.7)
at ax

and a corresponding constant coefficient scheme defined by


Ek v(x)= ajv(x-jh), k/h 2 = = constant. (9.8)

THEOREM 9.5. Consider the initial value problemfor the parabolicequation (9.7) and
a corresponding scheme (9.8) of order #i which is parabolic in the sense of John. Let
1 <s and t > 0 and assume that v L is such that
IIUn-unIL <Chs as nk=t, k- O.
Then v e BO'.

The result of Theorem 9.4 is still best possible, however, in the sense that for some
v BS the O(hS) convergence rate is best possible.

THEOREM 9.6. Let the assumptions of Theorem 9.5 hold and let O<s #p,t >O. Then
there exists a v BS such that
lim sup h -SIIUn u'l[ L. > 0.
nk=t
k-O

In spite of the above, it is, however, possible to attain a O(hS) convergence rate in
the maximum-norm under weaker assumptions than v BSo(s < ), namely if the
regularity of the initial data is measured in L1 . In this regard we have the following
result proved in THOMEE and WAHLBIN [1974] for the case of time-independent
coefficients. In a particular case a similar result was proved in JUNosA and YOUNG
[1957].

THEOREM 9.7. Under the assumptions of Theorem 9.3, let d < s < /1and 0 < T < T. Then
there is a constant C such that,for v E Bs,
IlU"-u"llL,ChSllvllg,, z nk <T.
102 V. Thome CHAPTER II

In the special one-dimensional situation treated in Theorems 9.5 and 9.6, the
estimate takes the more precise form
IIU"--UIILoo ~ Ch*(nk)- /21VIIB, = 1, 2,... (9.9)
The proofs of Theorems 9.5 and 9.6 and of (9.9) depend on Fourier analysis
whereas the more general result of Theorem 9.7 uses the estimates for discrete
fundamental solutions of Widlund.
As an example of a function which illustrates the difference between the spaces
occurring above, let us take X CO(R) with (O) O and set, with a>0,

X(x) = {x(x), x>0 (9.10)

Clearly many functions of one variable occurring in applications are linear


combinations of functions of this form. As is easily seen, we have X, E B, if and only if
s < a + p- , so that for a in the appropriate range, Theorem 9.2 shows a maximum-
norm convergence rate of O(h'), uniformly down to t= 0, whereas Theorem 9.7
shows an O(h+ 1) error estimate for t positive. A function of the type described in
Theorem 9.6 will have to have its roughness more distributed over the real axis and is
not likely to be of practical interest.
The above results all refer to equations which are parabolic in Petrovskii's sense.
We note that the result of Theorem 9.1 was shown in PEETRE and THOMEE [1967] for
equations which are strongly parabolic, in the sense introduced in Section 8 so that
the following holds:

THEOREM 9.8. Assume that (9.1) is strongly parabolicof order v (< M) in W, i.e. such
that the initial value problem is correctly posed in W and such that

IIU(' t) IIw,- C, t -( -j)I VllvlWj


forj<m, O<t T.
Assume further that the difference scheme (9.2) is stable in W and accurate oforder #.
Then

{IU - u"llw Cq Th"(log(l/n)) -'l/qI vIBM+ S..


for 1q< oo,
and

lU"-u"ll Cs,,Th
h/(M + -'v) IVIIB for O<s<M+ p-v.

For initial data which are not smooth enough to result in optimal order
convergence, it is possible to remedy the situation by first smoothing the data in an
appropriate manner. We shall now describe a result in this direction by KREISS,
THOMEE and WIDLUND [1970].
For this purpose we shall introduce a concept of a smoothing operator Mh by
SECTION 9 Pure initial alue problem 103

setting, for a given function on Rd and 4,h(x)= h-d (h- 'x),

Mhv(x)=(0h*v)(x)= If th(y)v(x - y) dy, (9.11)


Rd

or, in terms of Fourier transforms,


Mhv =
= - - ($(h.)O(F)). (9.12)
We call Mh a smoothing operator of order # on L if

$() = 1+ E b"b(°)() (9.13)

and
'(5)= E (sin )b?)(g),
½ (9.14)
Ial =t
where bn,j = 0, 1, are such that for some 6 > O,b() and b?' coincide with multipliers
on BFLp for I1[< 26 and I1 > 6, respectively. Since the multipliers on FL 2 are simply
the functions in L., the above conditions are seen to be satisfied for p=2 if
)
() = +1 °0(}l1 as --*0,
and, for any multi-index : 0,

~()= 0(1-2/{tIP) as -*2/,


uniformly in P.
Special examples of smoothing operators of orders 1 and 2, respectively, in the
case d= 1 are
hi2

Mlhv(x) = h- Xf v(x - y) dy (9.15)

- h/2

and
h

M v(x)=h- f (1h-- yl)v(x-y) dy, (9.16)


-h

and for general #, a smoothing operator of order p can easily be constructed in the
form

1
M)v(x) = h- f ,(h- l y)v(x - y) dy,

where *, is a function which is piecewise a polynomial of degree (- 1) and which


vanishes outside ( - + 2, - ) for p odd and (- # + 1,p - 1)for p even. In fact, MP),
104 V. Thomee CHAPTER 11

j= 1, 2, corresponds to

$A) = (k 9) j=1,2,
and, generally,
$()p,(sin ½6)

where p, is the polynomial of lowest degree such that


p,(sin )= +0( 2) as 5--0.
For p=2, the operator Mh corresponding to

k(0= ,: 1 6,
with 0 < 6 < n is a smoothing operator of arbitrarily high order; in this case
sin 6x
nx
Smoothing operators in higher dimensions can be obtained by taking products of
one-dimensional operators,

MhV = ( MhJ)v

where Mhj is a smoothing operator with respect to xj.


We shall now state the following result which shows that by using a preliminary
smoothing it is possible to obtain optimal order of convergence for t positive even if
the data are nonsmooth:

THEOREM 9.9. Let 1 <p oo and assume that (9.1) is parabolic in Petrovskii's sense
and that (9.2) is parabolicin John's sense and accurate of order p. Let further Mh be
a smoothing operator of order u and let U" be the discrete solution with initial data
U ° = Mv. Then
M
lUn"-u'llw .<Chl(nk)-'l iilvlw for n . (9.17)

PROOF. In order to shed some light on the mechanisms that are involved we shall
present a proof for the one-dimensional heat equation in the case p = 2. Recalling
that then
a(z, t)= e-'2(),

and
0 (r) = E(h)"(h)(
O()n = E(h)nJO ),
SECTION 9 Pure initial value problem 105

we see at once by Parseval's relation that the proof of (9.17) reduces to showing

IE(h)n)"(h)-e-nk421 _Cn-"/2 for 5 e Rd (9.18)


or
IE(5)n( )-e-nX21 _Cn -/2 for e d.

We write

E(,)"n()- )+ e-n 2($( )_ 1)= I + II.


e-n22 = (E()n - e-nJf2)(~)
Here, by the proof of Theorem 4.5, say,
"-
IE(4) e-
, a,21
<Cn +2e-c52 Cn-/ 2
for 11 <t,
and hence the same bound is valid after multiplication by $ for these 4. For I11 > n we
have, by (9.14),
IE()n~()1 < CIE()"l Isin 1,
and since the right-hand side is periodic it suffices to estimate it for 14l < n. We have
therefore, using the parabolicity of Ek,
n-
IE(4)f(I )l<clle-C'c"42 n 2
for 41>.
Since
le-nA2r(4)l <Ce-cn"Cn-ll for 14 >n,

we conclude that
III < Cn -/ 2
for 4 e R.
For the remaining term we have at once, by (9.13),
jIII < CljIye-"2<
Cn-Cn/ 2 for e R,
which completes the proof of (9.18) and thus of (9.17) in the case considered. [

We shall now describe a result by THOMtE and WAHLBIN [1974] which shows that
in certain cases optimal order convergence for positive time may be attained with
slightly simpler smoothing operators than in Theorem 9.9.
We shall say that the smoothing operator defined by (9.11) or (9.12) is of orders
(p, v) if (9.13) holds together with

~()= E (sin ) b(r),


Ial=v

where b), j=1,2, now coincide with multipliers as FL, for 141<26 and 141>6,
respectively, so that the parameter in (9.14) is replaced by v. Our previous
smoothing operators are then of orders (, p) and it is easy to see, for instance, that
the simple operator (9.15) has orders (2,1). The following result then holds.
106 V. Thome CHAPTER 1

THEOREM 9.10. Assume that (9.1) and (9.2) are parabolicin the sense of Petrovskii and
John, respectively, and that (9.2) is accurate of order I. Let d < s #l, 0 < z < T and
assume that Mh has orders (, v) with v >p-s. Then for U the discrete solution with
U° =Mhv we have

IIU-UnIIL, ~ChIllviX, t<nk< T.

As an example, consider the one-dimensional function X, defined in (9.10), with


0 < a < 1 and let the difference operator have order of accuracy p = 2. Then Theorem
9.9 shows second-order convergence for positive t by application of a smoothing
operator of order 2, such as (9.16) whereas Theorem 9.10 gives the same conclusion
already for the simpler operator (9.15) of orders (2,1)
We shall close this section with some results concerning the convergence of finite
difference quotients of the approximate solution to derivatives of the exact solution
of (9.1), cf. WIDLUND [1970a, 1970b] and KREISS, THOMEE and WIDLUND [1970]. Let
thus Qh be a finite difference operator of the form

QhV(X) = q,p a'v(x - flh),


a,f

which is consistent with the differential operator

Qv= E QDav,
Il =q

which, for simplicity only, we take to have constant coefficients and no lower order
terms, and assume that the order of the approximation is so that, for smooth v,
Qhv=Qv+O(h") as h-*0. (9.19)
We begin with a smooth data result:

THEOREM 9.11. Assume that (9.1) is parabolic in Petrovskii's sense and that (9.2) is
parabolic in John's sense and accurate of order u. Then, if 1< p < o, we have

IIQhUn-Qu"llw,<ChIlvIlwg+, for nks< T.

One may also show the following estimates for less smooth data:

THEOREM 9.12. Under the assumptions of Theorem 9.11 we have, for O<s</I,
O<nk T,

[1Qh U -QU nIIw < Ch(n k ) qM| vII||w

THEOREM 9.13. Under the assumptions of Theorem 9.11 we have, for O<s<,
O<nk < T,

iiQ, ,_Un
- W"l , Ch"(nk) -qM IIIIBP-
SECTION 9 Pure initial value problem 107

Finally, application of a smoothing operator may again be used to retrieve the


accuracy:

THEOREM 9.14. Under the assumptions of Theorem 9.11, let Mh be a smoothing


operator of order u on Wp and let U = Mhv. Then, for 0 < nk T,

IIQhU n- Qu IIw <ChP(nk) - ( + MI v w -

In the proofs of these estimates one observes that


Q Un n
IlQh U n - IIW < IIQh(Un- Un)ll W + (Qh -Q)U Wp. (9.20)
The second term on the right is easily bounded, using (9.19) and the smoothness
properties of u", so that e.g.

II(Qh-Q)U n I W< ChM Iunal Iw +q< ChI ll vll wp+


or

II(Qh - Q)uIn W< Chlt- l +q)/MI WvW,


depending on the case considered. For the first term in (9.20) it suffices to bound
'(U' n- un), which may be done using the appropriate estimate for the fundamental
solutions. For the case of constant coefficients and p=2, see also the proof of
Theorem 4.3.
Results of similar type may be derived in strongly parabolic situations.
In addition to the work described above we would like to quote also some recent
work by LAZAROV [1982] and WEINELT, LAZAROV and STREIT [1984]. In these papers
convergence estimates are obtained for finite difference schemes for second-order
parabolic equations with periodic multidimensional boundary conditions. The
particular emphasis is on weak solutions with low regularity, and averages of the
data are used rather than point-values in order to extract sufficient information. The
analysis is close to that associated with the finite element method. In GODEV and
LAZAROv [1984] discrete Fourier and Laplace transforms are used to derive
convergence estimates in Lp for the same problem under weak assumptions.
CHAPTER III

The Mixed Initial Boundary


Value Problem
This chapter is concerned with the numerical solution of mixed initial boundary
value problems for parabolic equations. The problem whose solution is sought is
then required to satisfy the differential equation in a bounded domain 2 in space,
and is subjected to boundary conditions on the boundary 8a2 of Q, for positive time,
and also to assume given initial values.
Here the theory of finite difference methods is less complete and satisfactory
than for the pure initial value problem. One reason for this is that in the case of
more than one space dimension only very special domains, such as unions of
rectangles with sides parallel to the coordinate axes, may be well represented by
mesh domains. Further, even in one space dimension the transition between the
finite difference equations in the interior and the approximation of the boundary
conditions is more complex both to define and to analyze. This is the reason why
finite element methods, which are based on variational formulations of the
boundary value problems, have been more successful and caused the develop-
ment of the classical finite difference method to stagnate.
It should be reiterated that the finite element method is, in some sense,
a development and refinement of the finite difference method and that the latter
has contributed to the former in a variety of ways. As an illustration of this fact
we have mentioned earlier that in the Russian literature a class of finite element
methods is referred to as variational finite difference methods. However, in this
article we restrict ourselves to classical finite difference methods and refer to the
article of Fujita and Suzuki in Volume II for the development along the lines just
described.
We have divided this chapter into four sections out of which the three first are
devoted to the three major approaches to methods of analysis of finite difference
schemes for initial boundary value problems, namely energy methods, methods
based on maximum principles or monotonicity, and spectral methods. In the fourth
section we collect some additional topics, not covered in the earlier sections.
In Section 10 we thus discuss stability and convergence results derived by discrete
analogues of the classical energy arguments for the differential equations. We begin
by considering a one-dimensional heat equation with Dirichlet type boundary
conditions and show stability and error estimates in discrete L 2 -norms for the
standard explicit and implicit finite difference methods. We permit variable
109
l10 V. Thomee CHAPTER III

coefficients and discuss the effect of lower-order terms and proceed to treat
boundary conditions which involve derivatives. We then consider the extension to
several space dimensions, including the case of domains with curved boundaries.
When the domain is such that its boundary falls on mesh planes we demonstrate
stability in the discrete L 2-norm and convergence to the natural orders for the
standard finite difference equations with Dirichlet boundary conditions. In the case
of a curved boundary we describe a crude variant of the backward Euler method
which has a low-order convergence rate and we end the section with a brief
discussion of the relation of standard finite difference methods to special cases of the
finite element method.
In Section 11 we consider the same types of problems as in Section 10, but now the
methods of analysis are based on discrete analogues of the maximum principle or
related monotonicity arguments, depending on positivity of the coefficients of the
finite difference operators. This restricts the generality of the methods but gives
somewhat more satisfactory results in cases when they apply.
In Section 12 we describe a variety of spectral methods and begin by discussing
a concept of spectrum relating to a family of operators, which was introduced in
GODUNOV and RYABENKI [1 964] to deal with stability of finite difference schemes in
the presence of boundary conditions. Their approach shows that the stability in the
case of one space dimension is dependent on the stability of the interior finite
difference equations, and the left-sided and right-sided boundary value problems
separately. This makes it natural to discuss the stability of a quarter-plane problem,
with the space variable ranging over the positive axis, and we describe the analysis of
such a situation by means of Fourier analysis in the vein of the school of Kreiss. We
close the section with some remarks concerning the stability of initial value type
difference operators which are suited for the quarter-plane problem, and concerning
the use of eigenfunctions to derive maximum-norm error estimates.
Finally, in Section 13, we collect some material which has not naturally fallen into
any of the earlier sections. We thus discuss some results relating to the possibility of
using a variable time step, we consider a parabolic problem with a singularity caused
by transformation by means of polar coordinates of spherically symmetric
problems, and touch upon the case of a discontinuous coefficient in the parabolic
equation. We further examine the possibility of treating the initial boundary value
problem as a pure boundary value problem and then present some interior estimates
for standard parabolic difference schemes which may be used to draw strong
conclusions about convergence away from the boundary from global results.
Finally, we present an example of the application of finite difference schemes in
existence theory.

10. The energy method

In this section we shall consider the application of discrete variants of the energy
method, commonly used in the study of boundary value problems for partial
differential equations, to derive stability and convergence estimates for finite
SECTION 10 Mixed initial boundary value problem 111

difference schemes. In contrast to the Fourier method which was the basis for the
analysis described, for instance, in Sections 7 and 8, such methods are suitable for
equations with variable coefficients and also for problems with boundary condi-
tions which are not of Dirichlet type.
This approach was developed first by LEES [1959, 1960a,b], KREISS [1959a,c,
1960] and SAMARSKII [1961a,b, 1962a] (cf. also e.g. BABUSKA, PRAGER and VITASEK
[1966]).
We shall first consider the model problem of the one-dimensional heat equation in
divergence form,
au ( au'
_au=-a a for 0<xl 1, t>0,

u(O,t)=u(l,t)=O for t0, (10.1)


u(x, 0)= (x) for 0 <x 1,
where a = a(x, t) is bounded away from zero and infinity,
O<c<a(x,t)<K<oo for0<xl, t. (10.2)
The energy method for estimating the solution of the continuous problem (1O.1) in
terms of the initial data with respect to the norm in L 2 (0, 1) may be described as
follows: We multiply the differential equation by u(x, t), integrate with respect to
x over [0, 1] and obtain after an integration by parts in the second term
1 1 1

au. drudx= a0 d au\2


) udx= faU dx,
at a· ax
oax/ -J
xudo d,

which, with
I

U2
IIUIL2=( dx)
0

may be written as

1d
2 fafaU2
Ox dx=O'
2dt lu(t)l 2+J a ax)
0

Integration in t and using (10.2) yields

+2K1sllall
L2 (10.3)
0

so that, in particular, the solution operator defined by u(t)=E(t)v is bounded in


L2 (0, 1) for t0.
112 V. Thomee CHAPTER I1

We shall now be interested in a variety of finite difference analogues of this


estimate and similar estimates, involving other types of boundary conditions as well.
We shall also apply such estimates to show convergence of finite difference schemes.
As earlier we divide [0, 1] into M subintervals of length h = 1/M and consider the
mesh function U. approximating u(jh, nk), where k is the time step andj = 0,..., M,
n>0. For mesh functions V=(Vo,..., VM) and W=(W0 ,..., WM)T we introduce
the discrete inner product
M
(V, W)=h Z Vj Wj
j=0

and the corresponding norm

!V11= h V2(10.4)
(j=o

We shall employ our previous notation for forward and backward difference
quotient operators ax, ax, a, and t. For mesh functions Vand W as above, which
satisfy V0 = V = 0 and correspondingly for W, we may define, e.g.,
M-1
(av,W)=h E aVj- Wj,
j=o
where we note that the summation ends at j=M-1. With the obvious corres-
ponding expression involving we have by summation by parts
(av, W)= -(V, xW). (10.5)
We now consider the backward Euler method for (10.1) defined by
at uj = ax(a_ 1,2axu)
forj=l,...,M-l, n>l,
(10.6)
UO= U =0 for n 1,
U°=Vj=v(jh) forj=0,...,M,
where we have set (a- l/2)jn -
= ajh- h, nk). Explicitly the difference equation
may be written

U7-U -1 a+ 11/2(U+ 1- U")/h-a_ 1 12 (U- U_ 1)/h


k h
It is easy to see by the arguments employed in Section 2 that this finite difference
scheme is stable in the maximum norm, but we shall now use energy arguments to
demonstrate its stability with respect to the discrete 12-norm introduced in (10.4).
We thus multiply (10.6) by Uj and sum over j= 1,..., M- 1 and obtain, noting
that U" = UM = 0, that
(0, U', U) =(aja - 112 a, U"), Un). (10.7)
SECTION 10 Mixed initial boundary value problem 113

By the summation by parts formula (10.5) we have


n
(O(a 1/2x U), U)= - (a 112 a U", a, u).
Further, we note that

a,Ui U = k -'(U i - U - ). Ui
2
= k-1 [(Uq) (Un- 1)2 + (U2 _ Un- 1)2]

= [,(Un)2 + k(at Uj)2]. (10.8)

Consequently (10.7) yields,

1Un[12 +
½t, ½ktW ll + (an/2~,
Ulac W "
a, Wn)= 0
(10.9)
for n>0,
from which we conclude at once, for instance,
a,llUnIl2<0,
or
IIU"I < IIUn"- II,
whence, by summation,
n °
|| u | < u | = v I|,

which shows the 12-stability for any choice of h and k.


Let us now consider instead the forward Euler method,
auj=ax(al/28U)i forj=1,...,M-l, n>0,

U =UM=0 for n>0,


U = Vj = v(jh) for j = 0,..., M.
The analysis is similar but (10.8) is replaced by
a,U' U"= a,(U) 2 - k(a, Uj) 2 ],
and hence the energy identity (10.9) by
aIll U" I 2 + (a 1/2axU",
2.
")=kllIatU"11 (10.10)

Since obviously

I aVI < 2h- 1I V11, (10.11)


we have
ull = 11la(a l1,2,U)"ll <2h-1'Ilan "
lUa, /2 x
Un ll,
and hence from (10.10) and using (10.2), with = k/h 2,
, IIU 11 + (a" 1 /2 ) 12
/ a u 112 22K II(a" 1/ 2 )1/2 a Un 112.
114 V. Thomde CHAPTER III

Therefore, if A is chosen small enough so that

2lK < 1, (10.12)

where K is the constant in (10.2), we may conclude that

or
|| u "' 1< u 11|,
which is the 12 -stability desired.
The symbol of the difference operator is in this case
E(x, t, ) = 1- 2a(x, t)(l - cos g), (10.13)
and we recognize in (10.12) the von Neumann stability condition.
In order to consider similarly the Crank-Nicolson method

a-u = a,(an-/2 ax,(j + uj" ))), j= M-1,


, .. , M n >O,
'
where a+ 1/2 =a( , nk + k), we set

uJ=1/2 = (Uj+ U?+),

and obtain
(a, u, u+ 1/2) + (a +1 2 Un+1/2, 1/2)=o. (10.14)
Here
(aU, u n + 2)=((Un+l_U")/k, (U + U"+ ))
= atlun112 ,
and (10.14) thus immediately yields
Un 2 < 0,
U11

which shows the unconditional stability in this case.


For the 0-method,

,tU = (a 2 x Uj +), j = 1,..., M- 1, n >0,


(10.15)
U = U' =0, n >0,
U = v(jh), j= ... , M,
where
n
a +o = a(-, (n + O)k)
and
un+ o= U,,+ +I(1 -- )U,
SECTION 10 Mixed initialboundary value problem 115

we have similarly
(atU,, +)+(a+02 aX u , xUn
,+o +o)=o. (10.16)
Here

(aU, U " +)=, IIu II +(0-)kIJa, U I2 , (10.17)


and hence, if 0>2, we obtain at once
a, u·fr2 0,
which shows unconditional stability. For 0 < we may use the difference equation
(10.15) and (10.11) to obtain
+o
11,U· l<2h-' a, UI ·
so that, using also (10.2),
+
IIU 112
+ II(an<+(2)1/2x 012 < (1 )k t U11 2
/ 2 U+1120.18)
(10.18)
<(1 - 20)2AK II(a'+12)11
The stability now follows under the condition
2(1- 20)K < 1. (10.19)
If we look at the proofs above we see that, by being slightly more careful, we
obtain for the backward Euler method

nIUnl12+k IUml12<CU I2=CIIVI2.


m=l

For the forward Euler method we obtain similarly

2
IU"nll2+k n Imll2,<CIVII
,
m=l

under the assumption 22K < 1(with strict inequality). Comparing with (10.13) we see
that this condition implies that the difference operator is parabolic in John's sense.
For the Crank-Nicolson method we find instead
n
[IU"ll2+k E HxaUn+ 2/2 <CIIVI12 .
m=l

These estimates are discrete analogues of (10.3) and show, in addition to 12- stability,
a certain smoothness of the solution in a sense defined by the terms in axU.
The energy method may be used to derive stability also in other norms. For
instance, we may show stability with respect to a discrete analogue of the Ho norm
for the 0-method, under the same condition (10.19) as above, or

IISXU 1<Ctl6U° II for nk•<T. (10.20)


To indicate the proof of this we assume for simplicity that a is independent of t. We
116 V. Thomee CHAPTrER III

then multiply the finite difference equation in (10.15) by a,Uj and sum over j to
obtain

112 +(a_-X/2x
U
[lt U" + , at0xU.)= O,

or, by an analogue of (10.17),

1I,un ii2+ ½2,I(a _ 12)1/2~xU n


112
(10.21)
+(8- )k I(a_11 2)/2axUI 2 = 0(10.21)
In particular, for 0 2 we find at once
a 1(a_ / 12 )1/2x Uc II2•< 0, (10.22)

from which (10.20) follows, in view of the bounds (10.2) for a. For 8<2 we have
instead

II0,U n ii2 _ ½~ I (a _ /2)/26~xU 112


<(2- 8)Ah2K li < 2(1- 2)K Ila, Un 2,
ax Un 112
and hence, since we now assume the mesh ratio condition (10.19), that (10.22) still
holds, which again implies (10.20).
Our above 12-stability analysis may be generalized to the nonhomogeneous
equation
a I' au\
a=xax-a-x+f, 0<x<l, t>0, (10.23)

with the same boundary and initial conditions as before. For instance, let us consider
the 0-method

Ui (a'_/ 2 x Uj +0) F", .-0 (10.24)


j=l,...,M-1, n>0,
where F" could be chosen, for instance, as f (, (n + 0)k). Multiplication by Uj + and
summation over j gives, similarly to (10.16), that
(a, Un,Un + ) +(an 2 kxUn + , 6xUn + )=(F, U"+O).

For 0 > ½ we conclude, as above, for E> 0 arbitrary,


2+ Ka un+ll 2
, u11Un l2
l un+I2I + C llF
We now appeal to the inequality
IIu,,+ Cllau +°,11
valid since U"n+ satisfies homogeneous boundary conditions, and we may conclude,
by choosing small enough, that
a,IIUn 112 < C IIF n112, (10.25)
SECTION 10 Mixed initial boundary value problem 117

and hence, by summation,


n-i
IU"12< 11U° 112+ Ck 1lFj112 (10.26)
j=0

For 0<½ we obtain instead of (10.18)

x Un +0 2
2 2
at IIU 11 + II(anl 2)l/

<(2-O)klOa, unIl2+ JIFn II. IU"n+11


F
<(2- 0)k(2h - I a'L2 x U"n+ 1 + II2 + IlF I 0
<(1 -20)22K(1 +C)F(a-1/2)l2IaxIU+l2+.I U CeIlF112.
1+°12+
Hence under the condition
22(1 -20)K< 1, (10.27)
we may conclude as before, by taking small, that (10.25) holds and hence also
(10.26). Note that (10.27) is slightly stronger than (10.19) in that strict inequality is
required.
The stability estimate (10.26) may now be used to derive a convergence result for
the solution given by the 0-method to the solution of (10.1) or (10.23). To see this,
recall that the truncation error is defined by means of the exact solution as
rT =tun -Oxa(a-s/21xu )

O(h2 +k), if 0<0< 1, (10.28)


2 2
O(h +k ), if 0 =
and that the error z -uj
-"=Uj satisfies the nonhomogeneous finite difference
equation
= 0,(an+
nZ; - zn"- n+O n
(10.29)
j=l ... ,M-l, n>0,
with homogeneous initial and boundary conditions. From (10.26) we now conclude,
since z =0, that
n-l
m 12
||Zn1|2<Ck E IIT ,
n=0

which yields the following convergence result.

THEOREM 10.1. Let u" and U" be the solutions of(10.23) and (10.24), respectively, and
assume that (10.27) holdsfor 0 <½.Then we have, with C= C(u, T),for nk < T,

C(h2 + k), if < 1,


IUn-u" I < C(h2+k2), if 0=2,
Ch2, if 0<0<2.
118 V. Thom~e CHAPTER III

We shall see that the energy method may even be used to derive a maximum-norm
error estimate. This will follow from a stability estimate in the discrete Ho norm
together with the easily proven discrete Sobolev type inequality

II , h= sup IvilA Clvll, (10.30)


j
valid for V=(V o,..., VM)T with V0 =0.
For the purpose of deriving such a maximum-norm error estimate we consider
again the nonhomogeneous finite difference equation (10.24) (for simplicity with
a independent of t) which we now multiply by aOUj and sum overj= 1, ... , M- 1 to
obtain (cf. (10.21))
Ila, U I 2+ a, II(a 2)/2ax Un 11
2 + (0-k l(a- 12)1/2 U 2
=(F", a,U") <IIFn11+ ll, unl1.
For 0 2 we conclude at once
all(a 12)1/2~x Unl 2
d II
Fn 112,
and hence easily
n-i
ll 2
UnI2eCl}3U0I +C k E Fa, 2e
IlFmll for nk T. (10.31)
m=O

For 0 < ½ we obtain similarly

la, U 11ii2 (a_ 11/2) 2 8ux I 2


+½a,II
<2(1 - 20)AK D U"ii2+C[IlF"1 2.
1U"II2+dl,
Hence assuming the strict inequality (10.27) we find for e small enough
n
a I(a-_ /2)l /2x U 112 2C E iiF"112

which implies (10.31).


In order to show the maximum-norm convergence estimate we apply (10.31) to
the error z = U"- u", and recall that zn satisfies (10.29), with zn as in (10.28), as well as
homogeneous initial and boundary conditions. Applying also (10.30) we now find
the following result.

THEOREM 10.2. Let u" and U' be the solutions of(10.23) and (10.24), respectively, and
assume that (10.27) holds for 0 < . Then we have with C = C(u, T), for nk < T,

C(h2 +k) if < 0 1,


U _ Unll <
,h C(h2 +k2 ) if 0=2,
(Ch2, if 0 0< .

We shall briefly discuss the effect of lower-order terms in the differential and
difference equations in the above stability analysis, and consider therefore the
SECTION 10 Mixed initial boundary value problem 119

initial boundary value problem


au a au) au
l
a-=ax a ax +b a+bou
for 0<x<1, t0,
u(0,t)=u(1,t)=0 for tO0,
u(x, 0)= v(x) for 0 x < 1,
where a, b1 and bo are smooth bounded functions of x and t with a satisfying (10.2).
Consider first the backward Euler equation defined by
0aUq = Ox(a - 1/2 U)j + b: ax UJ + b U1
forj=1,...,M-1, n>O,
with the boundary and initial conditions used above in (10.6). Employing the same
approach as before we obtain now
n) /2
(a, U, + (an'/2)1 x Un I2
=(bnlaxUn+bU", U")

< c(llaxUnll + 11U 11)11


u" 11
2 2
l/2)l2Un 11 + C un1,
nII(a

where in the last step we have used (10.2) and the geometric-arithmetic mean value
inequality. We conclude that

a, IIUn I12 < CI Un 112,


or

l Un12 < 11Un-1 112 + 2Ckl Unl 2


Hence, for small k, with different constants C,

1-2Ck

and, by repeated application,


IIU" I1< e c nk IIv II ecT IIv for nk<T, (10.32)
which shows the stability.
For the Crank-Nicolson method,
n =(a n2+la + 12 1/2 U n+1/2 + n+ 1/2 U /2n

we obtain similarly
U 112+ l(a"+
t II1 12)l/2 Un+/2
± 112
tn11/2 1/2J
<~1(a- l/2l112 xU+'11C
U+ l
ciun+ /2 12
120 V. Thome CHAPTER II

and hence

+ Ck(lI U n+ 112
IIU+' 112 < IIUn 112 + 11
Un 112),
and again, for small k,
I U" + 112 (1 + Ck) Un 112,
from which (10.32) follows as above.
For the forward Euler method,
a, u = a(a 1/ 2 axU)+ biaxU" + b U

we have analogously
+ Il(a_- 1/2 )1/2a
a, Itu 112 Un 2

<k Ila, u 2 + II(a 1/2)12 + Ce IUn 12.


U 112
Here

1lau,
Un -la-/2u,ll l +C(lla1xuI + IIlUnll),

so that, for any > 0,


½kIla,UII2 < 2(Ila" + Ch llaU 11 2
a, 2a UI11 + Ch llU 11)
22K(1 + )11(a' 2 aU" 11
2+ Ck(laS + 11
U` 112 U 11
1/2) 2)
< (2AK(1 + e)+ Cjk)II(a"_ 1/2) 2aU 11
2+ Ck IIU 11
2
Altogether we thus obtain
n 2+
½al,u11 Il(a" 1/2)1/2a
n il 2
U

(2AK(1 + e)+ C,k + ) (a"' l 2 )i2axUn 112


+ cE IIun 112
Hence, if we assume the strict inequality 21K < 1, we may choose first e and then k so
small that
22K(1 + ) + C~k + < 1
and thus conclude that
Un 12 < Cll Un 12,
11
which implies stability as before.
In all the above cases we have chosen to replace the first derivative by an
asymmetric difference quotient. If we choose instead the symmetric difference
quotient ~= ½(.+), which is desirable for accuracy reasons, the calculations
simplify somewhat since, by summation by parts, for V, vanishing atj=0 and M,
(b(0ax
-- )vV, V)= -(V, (a.+ax)(b±lV))
= -(V, b,(ax+ax)v)-(v, V+,ab + V_,abl),
SECTION 10 Mixed initial boundary value problem 121

where (V+l)j= VjIj, so that

I(bl(ax + ax) V, V)l < CII V I 2.


In the particular case that b1 is independent of x, the inner product vanishes and thus
the first-order derivative term does not contribute to the growth of U" 11.
It is clear that convergence estimates may also be obtained along these lines for
equations. with lower-order terms.
We shall now discuss some other choices of boundary conditions than those of
Dirichlet type and permit Neumann 'and third-kind boundary conditions. We
consider thus the initial boundary value problem

au / u\
at=
- a a
atuax
-) a for 0<x< 1, t0,
u

(-au+au)(0, t)= -+flu)(1,t)=0 for tŽo, (10.33)

u(x, 0) = v(x),

where a is a smooth function satisfying (10.2) and a and are smooth bounded
functions of t.
In this case it is convenient to use a mesh which does not contain the points x = 0
and x= I and we set, for M a positive integer greater than 1, h= 1/(M-1) and
xj = - h +jh,j= O,.. . M. We then have xo = - h and xM = 1+ h and these points
are thus outside the interval [0, 1] whereas the remaining points x1, ... , x M are
interior mesh points.
We now pose the backward Euler type difference equations at the interior mesh
points

at j=Ox(a_1,2axU)j forj=l .... M-1, n 1, (10.34)


and supplement these equations with the boundary and initial conditions

nxunl+2 n(Ui+ Uo) =8x Ut (U + U-1)= for n ,


(10.35)
U9°=v(jh), j=1,...,M-1.
The approximations of the boundary conditions have thus been chosen symmet-
rically around the points x = 0 and x = 1 and are therefore second-order accurate.
In analyzing the stability of this scheme by the energy method it is convenient to
indicate the points included in the summation in the notation for the discrete inner
products and set

(V,W)(rS,=h VjWJ,
j=r

where r = O, 1, 2 and s = M- 1, M, depending on the case at hand, and similarly for


122 V. Thome CHAPTER III

the discrete norms. Let us recall the summation by parts formula

(av, )(,,) =h E h-'(Vj+ -j)Wj


j=r

= -(V, ax W)(r+l s)+ V"+I W- VrWr.

In particular, considering a mesh function U" satisfying the Neumann boundary


conditions

a, ul =-xUM =O,
we have

(=x(a_ l2axU
U)", U")(UM- 1)
-= (a 1 1/2xU, xU )(2 ,Ml)+aM-1/2~XUM U m--all2xU
1 1U
- -
= -(a 1/2xU , x Un)( 2 ,M ) (10.36)

We now apply this to a solution of (10.34), (10.35) with a=f=O0. We multiply


(10.34) by U and sum over the interior mesh points to obtain

(atUn , U)(1, M-l)=(ax(a_1/2xU), )(1M- 1),

and hence, in view of the boundary conditions and (10.8),

½,fiUii(,M 2)+½kIl ,U (1,M 1)+(a 1/2U, xu )(2 , M- 1)=0


for no1.
We conclude as earlier that
at U :1UM-n
1) 0,

or

|| U 1II(1M- 1) S 11U- I(1M -1)'

which shows the stability in the case of Neumann boundary conditions. Note
that the argument automatically shows also the existence of a solution of (10.34) and
(10.35) since U" - =0 implies U"=0.
Consider now the general case of the boundary conditions (10.35) with a and not
necessarily vanishing. In view of(10.35), the identity (10.36) may now be replaced by

(ax( a - n -
1/2axu) , U)(M _,)

= -(a 1/2x
U
, ax u)( 2 ,M-)+ aM-12xUM UM-1 1 U -al/26xU 1 U1

< -( a --I121/2~XU,
u , C(
)(1,M){
2
+
IUM-1I 2
+U M |).
It is easy to prove that for >0 arbitrarily small the a priori inequalities
Ul[2 +lUM_,l12<EllxUll(2, M->)+CellUll(2,A-1),
SECTION 10 Mixed initial boundary value problem 123

and
IU0 12 <21U 12 + 2h2 IaxUl 2

~<21Ul12 + 2hU112MU 11M- 2


<2(e+ h) ilxU,II(1,M)+ C II1
U I(1,M- 1),
hold, and similarly for UM.
Multiplication of (10.34) by U and summation over the interior mesh points
therefore implies, in view of the general boundary conditions (10.35), that
tllU I(lM-1+)+2 kItUII(1,M- 1)+ IIXU II(1,M)
C( + h)IIOXU II,(M)+ C- IIn I 1) (10.37)
Hence, for and h sufficiently small,
~lI u ~ n
II(1,M- 1) ~CIU 1I(1, M- 1),

or
(1- Ck)l I
([ U'|l,M-1),
(lM
whence, for small k,
IIU II ,M 1)<( +Ck)IIUn- II(1,M-1),
from which the stability follows.
The analysis just described extends to the case 0 1. Further, it applies to
arbitrary boundary value problems for parabolic systems of second order in one
space dimension which are correctly posed in the sense that the corresponding
elliptic operator is minimally semibounded (or maximally dissipative) in the
appropriate sense, see e.g. KREISS [1960, 1963].
The arguments may also be carried over to the case of a nonhomogeneous
equation with nonhomogeneous boundary conditions, and the resulting estimate
may be used, in particular, to derive error bounds for (10.34) and (10.35). Consider
thus the discrete problem
a, Un = ax(a - /2
1 1aU) + Fj,
j=l ... ,M-I, n>,
xul + 2ctU + U) = Gn, n> 1, (10.38)
axU,+2" U M+U-i)=G, n>l,
U ° = v(jh ), j=1,...,M-1.
Proceeding as in the above stability analysis we now obtain instead of (10.37)
~a, | ll~.m- 1)+k,UnI(21iM-x)+K lIaU |(1,UM)

C( + h)llx UnII1 M) + CIU II(1M- 1)

+ C(IF II(1.M-1) + IGol 2 + IG712),


124 V. Thome CHAPTER III

from which we now conclude, for and h sufficiently small,

•<C(IIU nI t,- 1)+ IFn 1M- 1) + IGo


2 + IG; 12).

By iteration this yields

[1'-ZJ'2 Lf-)'(I- [It:(,M-l,+ k (IIFm-IIM-x)+IGI2+jG'l~ 2 )


m=l
for nk < T.

Since the exact solution of (10.33), and hence also the error zn = U _- u, satisfies
(10.38) with

Fj =zj0 =O(h2+k) as h,k+O,


Gn=
G = O(h 2) as h-O0,

we may conclude the following error estimate, which, for simplicity only, we state for
the homogeneous differential equation with homogeneous boundary conditions.

THEOREM 10.3. Let un and U be the solutions of (10.33) and (10.34), (10.35),
respectively. Then we have with C= C(u, T),
2
ilU"-ull(M_l,)<C(h +k) for nk<T.

In the approximation of the derivative boundary conditions above we followed


KREISS [1960] in imposing the mesh in such a way that mesh points fall at + h but
not on the boundary. It is also possible to use the standard partition by points jh,
j = 0, ... , M, and to choose the approximation accordingly. For instance, in the case
of problem (10.33), for the backward Euler method, we may set, at the left-hand
boundary

3uno+ + Uo+ _ h(2al/2


) - lupn+ 1 = 0,

and similarly at the right-hand boundary. This equation may be interpreted as


resulting from elimination of U" +' between the discrete parabolic equation at j = 0
(with a_-/ 2 = a/ 2 ) and the equation

x U1+Uo1 + Um = 0,

where b~ = ½(a + &) is a symmetric difference quotient. In the case that x is negative
and fl positive, results are obtained in e.g. SAMARSKII [ 196 1a] for this approximation
by the energy method. We shall return to such methods in Sections 11 and 12 below.
The energy arguments employed in the above analysis may be extended to several
space dimensions and to nonrectangular domains. We shall demonstrate this by
SECTION 10 Mixed initial boundary value problem 125

considering the initial boundary value problem

=Au= a au
at I -,j=x axj
for xeQ, t0, (10.39)
u(x,t)=0 on aQ, t O,
u(x, 0) = v(x) in 2,
where Q2 c Rd will first be assumed to be a union of axes-parallel cubes with vertices
having integer components. The symmetric positive-definite matrix (aij) will be
assumed to have constant coefficients, for simplicity of presentation only.
For the numerical solution we select a mesh width h = 1/M where M is a positive
integer and cover 2 by a cubic grid of mesh points x=fiBh=((lh,..., fldh) with fBl
integers, and let 2h denote the mesh points in the interior of Q and Fh those on .
We note that all "neighbors" x + 6h of a mesh point x in Qh, where 6 =(1, ... , 6d)
with 16il 1, are in 2h u r .h
Let now Ah be the finite difference operator approximating A in (10.39) and
defined by
d
Ah U(X)= y a,J(aij8U)(), XEth, (10.40)
i,j= 1

where ax, and ,i denote the forward and backward difference quotients in the
direction of xi,
ax, U(x) = (U(x + hei)- U(x))/h
and

%, U(x) = (U(x)- U(x - hei))/h.


Introducing the discrete inner product
(V, W)=hd y V(x)W(x),
X=iheQhuFh

and the corresponding norm,


11vli=(V, V)' 2,
we find, by summation by parts, for mesh functions vanishing on F, that

(axi v, W)= - (, j W),


and, in particular, for U vanishing on Fh, we have, with y>0,
d d

-(AhU,U)= E (aijxjU, E axjUL11 2.


_,U)>y (10.41)
i,j=1 j=1

We shall now apply the 0-method to the present situation and thus pose the
discrete problem to find an approximation to the solution u of(10.39) at t = nk, where
126 V. Thomde CHAPTER III

k is the time step, as a mesh function U" on h u Fh for n > 0, defined by


+
a,U =OAhU 1+ (1 - O)AhU"
in h, n 0,
U"=o on rF, n0, (10.42)
U°(fih) = v(#fh) for h E Oh,
where a, and 0, as usual, denote a forward difference quotient and a parameter with
0 < 0 < 1, respectively. Given U" equation (10.42) requires us to find U" + vanishing
on Fh such that
(I-kA,)U +1 = (I + k(l - )A)U",
and it follows immediately from the positivity of -Ah, see (10.41), that this problem
has a unique solution.
We now turn to the stability. Multiplication of (10.42) by
U"+ =OU" +
+(1 -O)U"
gives
U")=(AhU" + 0 , U"+ ), n0,
(8,U, (10.43)
where the right-hand side is nonpositive by (10.41). As above in (10.17) we have
(t U,n U +8)= ti U 2+(0-)k II Unl2 . (10.44)
Hence, if 0 we obtain at once
a 11I
U112 <0, (10.45)
which shows the stability estimate
i U"I 11<UII
U= Ilvt1 for n0,
for any choice of k and h. For 0<½ we may use the equation (10.42) to obtain
d
I UnI = IIAhUn+Il <Ch-1 11lax U+0 11,
j=l
and hence, by (10.41), for some A,
2
aptUnl -< h-2(AhU n +, U+).

We may therefore conclude from (10.43) and (10.44) that


a 11
u· 12 _(AhUn vU )
= (2- )k IIatUn"12
:-(½-0)Akh-2(AhU + 8, Un+o).
Hence, if = k/h 2 satisfies
(2 - )AA < 1, (10.46)
SECTION 10 Mixed initial boundary value problem 127

then (10.45) holds as before so that stability is shown under the condition (10.46).
The above analysis assumes that the domain Q2 is such that its boundary falls on
the mesh planes. In this case the exact solution of (10.39) is, in general, nonsmooth
even for smooth data, and even though the method is formally O(h2 + k) for 0 A 2 and
O(h2 + k2 ) for 0 = -, the corresponding convergence estimates may not apply for lack
of regularity. On the other hand, if Q has a smooth boundary difficulties arise in the
construction of the difference equations near aQ2.
We shall briefly describe and analyze a crude scheme for the case of a curved
boundary, based on an approximation in the elliptic case discussed in THOMEE
[1964]. Consider thus the parabolic problem

Lu =y
at axi (a ax=
Ou=f (10.47)

in x[0,T],
u=g on Fx[0, T],
u(x, 0) =v(x) for x e 2,
where (ai)is a positive-definite symmetric (d x d) matrix which again for simplicity of
presentation only we take to be constant and where Fr=2.
Consider the finite difference approximation of the elliptic operator defined as
above by (10.40). In the two-dimensional case the operator Ah takes the form
2
AhV(x)= Z a,(aijxjv(x))
i,j=1

= h- 2{a l(V(x + he,)- 2V(x) + V(x- he))


+ a22(V(x + he2)- 2 V(x) + V(x - he2 ))
+ a 2 (V(x +he1 )- V(x + he1 - he2 )- V(x)+ V(x - he2)
+ V(x + he 2 )- V(x- he1 + he2 )- V(x)+ V(x- hel))}.
For the given mesh point x this involves the six neighbors x+ hei, I=1, 2, and
x_+h(e, -e 2) and the coefficients at these points are h -2 times a,, +a1 2,a 2 2 +a1 2
and - a 2, respectively. If these numbers are all positive the operator Ah is said to be
of positive type, and we remark that the method may then be analyzed in the
maximum norm as will be described in Section 11 below. However, positivity is not
a necessary consequence of the ellipticity requirement a 2 <a1 la 22 even if a1 2 is
negative (take e.g. a1 1 = 5, a 2 2 = 1, a = -2). We shall therefore now present a finite
difference method based on the backward Euler discretization in time together with
an analysis by the energy method which does not require the operator to have this
positivity property. In the d-dimensional case the number of neighbors is
2d +2(d)=d2 d.
Let Q2obe the mesh points x of i2such that the neighbors thus described belong to
Q, and let F2 be the remaining mesh points of Q. With each point x of Fh we associate
the point e F of distance less than Ch from x. Note that in the case considered
above, in which the domain was a union of cubes, we have Qho = 2h and F ° = Fh c F.
128 V. Thome CHAPTER III

In the present case we pose the discrete problem

(LkhU)+l-AhtU"+ fAhUn + for xehO,


°
Un+ (x)=g+'(x) for XeFh, (10.48)
U°(x) = v(x) for x e Qh.
We first note that this problem has a unique solution. As usual this follows by the
uniqueness of the corresponding homogeneous problem. Let thus U" =f + = gf + = 00
so that we may consider U +1 to be in the set -9h(12) of mesh functions on Rd which
vanish outside Q2. Setting here, for any VWE-h(Q2°),

(V,W)=h' Z
x= jhR
d
V(x) W(x),

and

11VI =(V, V)'1/2 ,


we have
(Un+ 1_kAhUn+ l, U+ ) =0, (10.49)
and it follows by summation by parts that
d
=
+U1"'112+k (aijaxu"+laxiU"+l)
i,j= 1

Since (aut) is positive-definite, the second term is nonnegative and we conclude that
the first term, and hence U"+', vanishes, which completes the proof.

We shall prove the following convergence estimates where, as throughout this


discussion, we denote by 11- Is the 12-norm over a set of mesh points S,
/s=(hd~vx2
1Vlls = h E
)
2)
1/2
V (x )

THEOREM 10.4 Assume that u e C3 '2 . Then,for the solutions of(10.48) and (10.47), we
have

11US"-u" I Q< C(u, T)(k + Ah) for nk < T. (10.50)


In the case that Fr° c F we have, if u e C4 '2 , that

Il_"- u" II < C(u, T)(k + h2) for nk < T (10.51)

PROOF. Let V" E h(2Qh) for n 0. We have, by the definition of Lkh in (10.48),

Vn + 1-kAhV" + = V" +k(LkhV)" + I on 2.


SECTION 10 Mixed initial boundary value problem 129

We multiply by V" + 1 and sum to obtain


d
Vn + ll 2
IVn+l 12+ky IIa
i=1

< (V", V + ) + k((Lkh V) +


V" + 1)
< 11 12+ I1 + 1112 + k( n+ 1, n + ),
11 vnh2 2 I2+ k((LkhV) " x ,
or
d
]V+ 1 2+2yk Vn+lll 2
11a"x
i=l

I V11 2 + 2k((LkhV) n +1 , Vn+ 1 ). (10.52)

Let now F1 be the set of x 120 with neighbors in F2 and =0 h\2 Set
Il V' I2=hd- 2 V(x) 2 + hd V(x)2 ,
xcFh XEQh

and
Lkh V(x) on rh,
Lkh V(x) hLkh V(x) on Fh,
0 for x Qh
Then

(Lkh Vn+ vn+ 1) |I(LkhV)n+ lI n l V l II (10.53)


Now we have easily, for VE ~(Qh°),
d

i=1

In fact, if x e F} is such that, e.g., x- he, is in F °, then, for V e(°),


h- 1V(x) = h - (V(x)- V(x - he 1)) = a0, V(x),
and hence, with summation on the left over only such points,

h-2 V(x) 2 ~ h (~x~ V(x))2 II ,Vl~ ° .


r,

We conclude from (10.52), (10.53) and (10.54) that


< 11 V + 2
1 Vn+ 12 1 2 +Ck (LkhV) 1 for VE h(2° )

and hence, by summation, the 12 -stability estimate

IIvn2< Vl 11 +Ck ILkhVi 12 . (10.55)


1=1
We now apply this estimate to = n -u on h°, extended by zero outside Qh°.
130 Y. Thomie CHAPTER III

Then clearly, for x e 2h,

LkhZ1 = LkhU t - LkhUI =fi- Lu + O(k + h)


=O(k+h) as h,kO,

(or O(k + h2) if u E C4' 2 ), and for x rh, with (x) =u(x) if XE 2Qh,
U(x) = u(X) if x rh,

LkhZ t = h{f(x)-tu' + A, fi}

<Ch{O(k + h) +h- 2 I_u-u l ,r} <C, (10.56)


since au- u = O(h). It follows that, since the number of points of Fr is of the order
O(h -(d- ),

1t(LkhZ)lI 2 < C hd 1 + Chd Y (k + h)2 < C(k 2 + h),


h n

so that, by (10.55), since z0 = 0 on °


Qh

1z C(k 2 + h) for nk < T,


2 <Cnk(k 2 + h)<
which shows the desired result (10.50).
In the case that F c F we have a = u on F, and the above argument easily yields
the O(k+h 2) error estimate (10.51) for ueC 4 2. E

The above analysis may be extended to the Crank-Nicolson method using


instead of (10.48) the equation
" +
(LkhU)+ 1/2 a=
,u nAhI(U
- + U")
°
=fn+1/2 in 2
The existence of a solution could then be shown as before, replacing k by k in
(10.49), and the estimate analogous to (10.55) is obtained by writing the definition of
Lkh in the form
l /2
yn +1 _ -k Ah( Vn
+ )=k(Lkh V)+ ,

then multiplying by V"+ V +' to obtain, with obvious notation, for V' e,(Q),
VIIn+ 12_l Vn II2_k(Ah(nVn+ "+ l), Vn + Vn+ )
= k((LkhV)n+ 1/2, V + v n+ 1)
+" 2
<kl(LkhV) 'lV+ V,+l11

and hence, as earlier,


| v 1112 2 + Ck II
< 1Ivn 11 (Lkh V)n + 1/2 112

and, by summation,
n-1
11 Vn 112< 1IV 0 11
2+ Ck 1Il(Lkh V)+/112 12.
1=0
SECTION 10 Mixed initialboundary value problem 131

The conclusion in this case is an error estimate of the form

IIU"- u" I- ~ C(u, T)(k 2 + ,) h


for nk < T, u C3' 3 ,
without any mesh ratio restrictions. For domains with F c F the result is

11Un - u" 11Fh C(u, T)(k 2 + h2 )


for nk T, u EC4 '3 .
Results may also be demonstrated for the explicit forward Euler method using this
boundary value approximation.
We emphasize again that except for very special types of domains the methods just
described are of low accuracy with respect to the mesh size in space.
The finite difference method is thus difficult to apply successfully in the case of
a domain in more than one space dimension with a curved boundary. As we have
stated earlier, this is one of the main reasons for the recent rapid growth of the finite
element method, which is better suited for application to complicated geometries.
This method is, in fact, very closely associated with the finite difference method and
sometimes coincides with it. For this reason we shall briefly discuss a specific finite
element method for the heat equation and see how it relates to our well-known finite
difference equations.
We consider thus the initial boundary value problem
ut= Au +f in 2, t>0,
u=0 on a2, t>0, (10.57)
u(x, 0)= v(x) in Q2,

where, for simplicity, f2 is a convex plane domain with smooth boundary.


We let h be a small positive number and impose a triangulation of the plane by the
mesh lines x 1 =jh, x 2 =jh, x 1 + x 2 =jh. We now modify the triangulation near the
boundary in such a way that f2 is divided into disjoint triangles such that no vertex of
any triangle is on the interior of a side of another triangle and such that the union of
the triangles determines a polygonal domain f 2h whose boundary vertices lie on Oa2
(cf. Fig. 10.1). Note that the triangles near a2 have straight edges and are not
curvilinear. We assume that the modifications near 0a2 are done in such a way that
the angles of the triangulations are bounded below, independently of h. The
triangulations are denoted by .'h.

Let now Sh denote the continuous functions on 0 which are linear in each of the
triangles of h-,and which vanish outside p2h, and let {pj} 'h be the interior vertices of
Jh. A function xE Sh may then be written as
Nh

(x)= I X(Pj)~qj(x), (10.58)


j=1

where {qfj}lh are the basis functions in Sh defined by pj(Pj)= 1, Tpj(P,)=O for l j.
For the purpose of defining an approximate solution in Sh of the initial boundary
132 V. Thomee CHAPTER III

I
_.

Yi
4

51X
Afi 'N

>X K
Av \ th \. Kj N>
l I \11 \ Ai
\N
B \
k\
\
N
I
I
- 78
E·1: l EI
;X
N K
\i s sc I V---
I
Iit-

It
-- q
S

FIG. 10.1.

value problem (10.57) we first write this problem in a weak form: We multiply the
heat equation by a smooth function (pwhich vanishes on a2, integrate over Q2,and
apply Green's formula to the second term to obtain for all such p, with (v, w) now
denoting the standard inner product in L 2(Q2),
(u, op)+ (Vu, Vq) = (f, ,) for t > 0.
We may now pose the approximate problem to find u(t), belonging to Sh for each t,
such that
(uh, ,X)+(Vuh, VX)=(fg ) for ESh, t> 0, (10.59)
together with the initial condition

Uh(O) = Vh,

where vh is some approximation of v in Sh, such as the interpolant


Nh

Vh = Y v(P)pj.
j=1

Since we have only discretized in the space variables, this is referred to as


a semidiscrete problem.
In terms of a basis ({b} hfor Sh our semidiscrete problem may be stated: Find the
coefficients U(t) in
Nh

uh(x, t)= U(t)(Pj(x)


j=1

such that
Nh Nh

E U(t)(cpj,p)+ E Uj(t)(Vpj, V(p)=(f, pl), =1,..., Nh. (10.60)


j=1 j=1

Recall from (10.58) that the Uj(t) are then the values of the approximate solution at
SECTION 10 Mixed initial boundary value problem 133

the mesh points Pj. In matrix formulation (10.60) may be expressed as


BU'(t)+AU(t)=f(t) for to0, (10.61)
with U(0)= Vh given, where B = (bj,) is the mass matrix with elements bj =((j, p,),
A =(aj,) the stiffness matrix with aj = (Vpj,Vp), f =(f), the vector with entries
f:=(f,(qp) and U(t) the vector of the unknown coefficients Uj(t), j = 1, . .. , N h.
We may now discretize in the time variable, for instance by means of the
0-method, to obtain
U±+1_ U"
B +AUn+=n+, n>0,
k
with U =Vh given, where U"+ = U"+" +(1 - )U", 0 0 <1.
Let us now note that for Pt =(l1h, 12h) an interior mesh point which together with
its immediate neighbors belongs to the original square mesh, then the equation in
(10.60) corresponding to that point may be written

fjU;-j-AU =h-2 (f, ),

and the corresponding time discrete equation


Z fijat,Ui-AhUln+ =h-2 (fn 0
, q), (10.62)

where Ah=ax±,+0 ,x26 is the standard discrete Laplacian, a, the forward


difference quotient and j ranges over (0,0), (± 1,0), (0, + 1), (_ 1, T 1) with

l(o, o)= , #(±1,0)=l(O, )=(+1,+1)= 2


Note that even if 0 = 0 the scheme is implicit as B is not diagonal, and has to be
inverted. However, it is possible to modify the scheme slightly to replace B by
a diagonal matrix. This modification is accomplished by evaluating the inner
product in the first term in (10.59) and thus in (10.60) by means of numerical
quadrature as follows: We set

(v,W)h = E Q(vw),

where for each triangle zT h (with vertices {P,j}j= ),


3
Q,(f)= area ; f(P,j).
j=1

The method is then

(uh,, X)h+(Vuh, VX) =(fX), VX Sh, t > O.


Since by our definition
((j, l)h=O for j/1,
the matrix B in (10.61) is now replaced by a diagonal matrix. The procedure is
referred to as the lumped mass method since it may be interpreted as having been
134 V. Thome CHAPTER III

obtained by adding all the elements in each row of the mass matrix B into the
corresponding diagonal element. For strictly interior mesh points we also have

(q0j, qpj) h = h2,

and hence equation (10.62) is now replaced by

atU,-/1 hU+ = h- 2(f+ , P1).


In particular, in the case of a homogeneous heat equation this approximation is the
standard finite difference equation of the 0-method treated earlier. Our above
method may thus be thought of as a method of adjusting these equations near the
boundary to obtain a viable method for a general domain.
One may show that the methods just described have the natural stability and
convergence properties. In particular, the backward Euler method has a O(h 2 + k)
error in the 12-norm (and essentially also in the maximum norm) and the
Crank-Nicolson method is correspondingly O(h2 + k2 ).
The approach sketched above is suitable for many problems which may be put in
variational form and thus generalizes to general parabolic equations with variable
coefficients and to other types of boundary conditions, cf. e.g. RAVIART [ 1967]. In the
Russian literature they are referred to as "variational difference schemes", cf. e.g.
ASTRAKHANTSEV [1971], AKOPYAN and OGANESYAN [1975]. These types of methods
are examples of Galerkin finite element methods, and the analysis will not be carried
out here. For references, see the article on "evolution problems" by Fujita and
Suzuki in Volume II or THOMEE [1984].

11. Monotonicity and maximum principle type arguments

In this section we shall consider the application of monotonicity and maximum


principle type arguments, such as those used in Sections 1 and 2 to derive
maximum-norm stability from the positivity of the coefficients of the difference
schemes. We shall start by formulating a maximum principle in the one-dimensional
case and then apply this to derive stability and convergence results in the case that
the boundary conditions involve derivatives. We shall then turn to problems in two
space dimensions with Dirichlet boundary conditions and analyze some difference
schemes based on the backward Euler method in which the difference equations are
adjusted to fit the possibly curved boundary.
Let us start by recalling some facts about the maximum-norm stability of the
solution of the initial boundary value problem
au a2 u
at a 0x2'X
2' O<x 1Xl,t>O,

u(O,t)=u(l,t)=0, t>0, (11.1)


u(xO) = v(x), 0 < x < 1.
We considered already in Sections 1 and 2 the forward and backward Euler methods
SECTION 11 Mixed initial boundary value problem 135

and the Crank-Nicolson method, the special cases 0 = 0, 1 and of the 0-method, i.e.

tUJ=aU , j=1,.. ,M-, nO,


U+l=UM+ =0, n~O, (11.2)
U°=vj=v(jh), O j <M,

where h = 1/M is the mesh width in space and Un + ° = 0 U + +(1 - )U". The system
of equations for the components of U "+' in (11.2) may be written

(1 + 20)Uj" + _ (1 + UI)n+1
=(1-2(1 -O))U;+( -O);(U]+ + Uq_-,), j= 1, ... ,M- 1,
Uo+ = U+ = 0.

We see that if
2(1 - 0)1,< 1, (11.3)
then the coefficients on the right are nonnegative and an obvious argument shows as
earlier that
IU n+ll1 ,h< tU1nll ,h, (11.4)
where the norm is the maximum norm over the mesh points,
Ivll .,h =maxlvjl.

Thus, in particular, maximum-norm stability holds with a stability constant 1.


Except for the case 0= 1 this imposes a condition of the type k = O(h2 ) on the time
step which is not desirable. We noted, however, in Section 7 that for the
Crank-Nicolson scheme (0 = ) uniform maximum-norm stability holds (with the
stability constant 23) for all choices of h and k.
We also recall that the stability estimate (11.4) implies a convergence result of the
form
JC(u)h2, if 1,(11.5)
IIUn -ull°sh" C(u)(k+h 2), if 0-1,

provided the solution of (11.1) is smooth enough.


We shall reconsider the above stability result in a somewhat different way. For
this purpose we introduce the finite difference operator
Lk UJ =-tUj-xOU j=l, ... ,M-1, n >0.
We may now state the following maximum-minimum principle.

THEOREM 11.1. Assume that (11.3) holds and that


LkhUjn<O forj=l,...,M-l, n=O,...,N-1.
Then the maximum of Uj over the set {(j, n): j =0 ... , M, n = 0 .. , N} is attainedfor
136 V. Thomde CHAPTER III

n= or for j=O or j = M. Similarly, if


LkhUj,>O forj=l,...,M-l, n=O,...,N-l,
then the minimum of Uj over the above set of points is attainedforn =O orforj=O or
j=M.

PROOF. Assume that the maximum is attained at the point (j, n + 1) with 0 <j< M.
Then at this point
+l
(1 +202)U <nO(Uq+1 + Ur+l)
+(l - 2(1 0- )A)U +(1 -0)(U+I + UJ-_,1),
and hence, since the coefficients on the right are all nonnegative and add up to
1 + 20R, i.e. the coefficient on the left, and since the values of U occurring on the right
are at most U. + they all have to be equal to this number. Repeating this argument
we may conclude in a finite number of steps that the same value has to be attained at
either a point on the initial line or on the left or right boundary. This shows the first
part of the theorem. The second part follows from the first by application to - U.
It follows, in particular, that if U satisfies (11.2), then since LkhUj-O and since
U vanishes on the left and right boundaries, we have
min(0, min U°) < U max(0, max U),
OI<M OI•<M

which is a sharper result than (11.4).

Our above theorem may also be used to discuss monotonicity properties in the
case of the nonhomogeneous initial boundary value problem
Zu a2 u
Ox<1, t>O,
at ax 2
u(0, t) = go(t), t > 0,
u(, t)=gl(t), t>0,
u(x, 0) = v(x), 0 < x < 1,
and the corresponding finite difference scheme
"
aitU-axxUJ+=F+0, j=1,...,M-l, n0,
+ l
Uo =Gno+ , n>0,
Mn+1
0, =G+l n (11.6)

Uj° =Vj, j=O, ... , M,


where
Fj +0=f(jh, (n + 0)k),
G+ ' = go((n + l)k ) , G + =gl((n + l)k),
Vj = v(jh).
SECTION 11 Mixed initial boundary value problem 137

Let (11.3) be satisfied. We may then assert that if U is the solution of(11.6) and U the
solution with F, Go, G, and V replaced by F, Go, 4G and V,respectively, and if F < F,
Go<Go,
0 G<, G and V<P, then U<SU. For
+
Lkh(U-)=Fe-Fj " + 0 <0 forj=1,.. .,M-, n>0,
and hence, by Theorem 11.1, U - U attains its maximum on the initial line or the left-
or right-hand boundaries. But there U-U <0 so that this inequality holds
everywhere. As a special case, if the data of (1 1.6) are nonpositive (or nonnegative) so
is the solution.
This argument may also be used in error estimation. To demonstrate this,
consider for simplicity the case 0 # 1 so that k = O(h2 ) when (11.3) is satisfied. Setting
as usual z= Un - u we now have

LkhZq = a,zj-axa~z, = TX
j=l,...,M-1, n>0,
1
zo =z =0, n >0,
z°=O, j=O, ... M,
where zj= O(h2 ). Let
2
j=z;-jh(-jh)h , j= 1,...,M-, n>O.

Then, with chosen large enough, we have


2
Lkh)j = LkhZj - 2h = J - 2ph2 < 0,

and hence, since wo vanishes on the left- and right-hand boundaries and is
nonpositive on the initial line, w is nonpositive everywhere and hence
zJ < #jh(l -jh)h2 .

Since the analogous estimate holds for -z we may state the following theorem:

THEOREM 11.2. Let Un and u be the solutions of (11.2) and (11.1), respectively, and
assume that (11.3) holds. Then we have

IU -uJl <:C(u)jh( -jh)h 2, j=O,..., M, n >O.

This result includes (11.5) as a special case but also shows that the error is smaller
near the endpoints of the interval.
Arguments of the above type have been used, also for nonlinear problems, e.g. in
ROSE [1956], IsAACSON [1961], BATTEN [1963] and KRAWCZYK [1963].
We shall now discuss some maximum-norm stability and convergence results in
the case that the Dirichlet boundary conditions in (11.1) are replaced by boundary
conditions of Neumann or the third kind,

(a --u)(0,t)=( +flu)(1, t)=O for t>0, (11.7)


138 V. Thomee CHAPTER III

where a and #fare nonnegative constants. We consider, for simplicity of presentation


only, the backward Euler scheme
~atU,+'l-BU,+'= 0, j=,...,M-1, n0O. (11.8)
and we shall approximate the boundary conditions (11.7) by
~a~xUv 1
X-Uno+'--hrUo+
, =0 (11.9)
and
+
,UnM+ + /M +½hl,
lU u " =0, (11.10)
respectively. As we pointed out in Section 10 above, the first of these may be thought
of as having been obtained by eliminating an artificial value UI from the
equations
+
U 'O-+aUO + l =0,
and

a,un+ I -aa,xU l =0,

U
where 8x =(a + g)U denotes the symmetric difference quotient, and similarly
for the second boundary condition.
The equations for the components of U" + may be written in the form
(1+2)U+1-A(U+1+Ujl-)=U, j=l,.. ,M-l, (11.11)

(+och+2I UO+-U+1= Uno (11.12)

(1+,Bh+2
U nM' I Mu l-un1= (11.13)

and we conclude in the standard way, since and ft are nonnegative, that

1Un+ll ,h< IIUnil ®,h,


that is, maximum-norm stability.
We recall that the truncation error in (11.8) is O(h2 + k) which is of the order O(h2 )
if A= k/h 2 is fixed, as we shall now assume. Similarly, simple calculations confirm
that the truncation error in both (11.9) and (11. 10) are second-order in h, in the sense
that a smooth solution of the continuous problem satisfies these equations with
an O(h2 ) error. This may also be expressed by saying that the exact solution satisfies
(11.11) with an O(h4 )= kO(h 2 ) error and (11.12) and (11.13) with an O(h3)= kO(h)
error. Introducing the error z= U-u we thus have
(1 + 2)zj + ' -,(z +±zs)
+ =Zz + kO(h2 )
for j=1,. . . M-1,(11.14)

(1+ h 2, 1-+
Z -1+= 2zjo+ kO(h), (11.15)
SECTION II Mixed initial boundary value problem 139

( +#h 2-M
,+ _ ZM-l = - z + kO(h). (11.16)

From this we easily conclude that

Iz" +
.,h < jIZn1 r,h
111l o + C(u)kh,
and hence, by summation, since z0 =0,

Ilzn i o,h ° II ,h+


11z C(u)nkh C(u)Th.
We note thus that because of the fact that the accuracy at the endpoints is of
lower order in equations (11.15) and (11.16), the above argument only makes it
possible for us to show first-order convergence.
However, by a slightly more refined analysis we shall now show, that if at least one
of a or fl is nonzero, then the error in our procedure is O(h2 ). For this purpose we
write the equations (11.14), (11.15) and (11.16) in the form
6tZ4 +l -Ax4xZj +
=O(h2), j= 1, ... , M- 1,

6,zn,+ 1- _ (a8xz - cz[+ ) = O(h), (11.17)

+
',zm + h(xZM + fizm+)= O(h).

We now introduce an auxiliary function Gj = G(jh), j = 0,..., M, where G(x) will be


specified below, and set
oin=zJ+h2 Gj, j=0, ..,M.

We then have
ioJ+1 - Eaxs+ = O(h2) h aXaGj, j= 1,..., M - 1,

+ 1- 21 i+1 n+
2
,co (-Xo ) = O(h)-2h(aG O- aG0),
_ acro+

a,co ( +fwl = O(h)+ 2h(XGM +


)0+ fBGM).

Now, if G is chosen such that, for some appropriately large positive number y,
aiSkxj y,
0xG 0o-aG 0 >y, (11.18)
axGM + GM < - Y,
then we may conclude that woj satisfies

Lkjn = , - +l 0oJ+<0, j=l,(11...19)


0,~+ -h (ao,+ 1- arwo + ) < 0, (11.20)
140 V. Thome CHAPTER III

14 nCAPE2
a
+
h (COrM@
+ )<0.V
frO4u+lM CT (11.21)

To accomplish this we choose


G(x)= yx 2 +cx +c 2 ,
which gives
xixGj =7,
aGo - aG = yTh + c, - cc 2,
=y -
G+fBGM 2 yh+c,+P(
2 +c, + 2 ).
Hence the inequalities (11.18) are satisfied, for instance, if cl and c 2 solve the system
C1 -OCC2 = ,

(1 + ,)c + ,c2= -2y - yfl,


which has the determinant P + (l + )= 0 when a and Bare nonnegative, provided
at least one of them is positive.
We now claim that as a result of (11.19), (11.20) and (11.21), either maxj oj <0
or the maximum is attained at n=0. For assume that, contrary to this claim,
max,j Mw> 0 and that the maximum for n + 1< N, say, is not attained at n = 0. Then
by the maximum principle of Theorem 11.1 it is attained at the left- or right-hand
boundary. If it is attained at the left-hand boundary point (0, n + 1), then a,o0o ~>0,
axw00+ < 0, and on + >0 so that (11.20) is not satisfied. Similarly, if it is attained at
the right-hand boundary point (M,n + 1) we have a, o"M>0, on+ I >0, on+ l >0,
contradicting (11.21).
Altogether, this shows
max z < max j + C(u)h2
nj n,j

< max wo + C(u)h 2 < C(u)h2 ,

as z°= 0. Since - z also satisfies (11.17) we may conclude that


max (- z) C(u)h2 ,
n,j

and hence we may finally state the following error estimate.

THEOREM 11.3. Let U' be the solution of(11.8), (11.9) and (11.10) and u that of
(11.1) with the boundary condition replaced by (11.7), and assume that (11.3) holds.
Then
U"-u ll h< C(u)h2 for n O.

The above argument is given by ISAACSON [1961] who also considered the explicit
forward Euler scheme. The earlier stability argument is carried out also for the
general 0-method in SAMARSKI and GULIN [1973]. See also GORENFLO [1971a,b,c].
SECTION 11 Mixed initial boundary value problem 141

We shall now leave the case of one space dimension and consider, for the rest of
this section, the initial boundary value problem
au
- = Au +f in x [0, T],
at
ul=g for te[0,T], (11.22)
u(x, 0) =v(x) in Q,

where Q is a bounded domain in R2 . We shall mainly be interested in the case that


02 is not a union of rectangles, so that neither the possibility of reducing the
considerations to a periodic problem nor application of the simplest energy method
analyzed in Section 10 is open.
As earlier, we cover R2 by a square grid of mesh points x = pfh =( 1 h, 2 h) where
h is the mesh width, and we let Qh denote the set of mesh points of Q. We also
introduce the set Oh° of interior mesh points for which the four neighbors x + he I,
1= 1,2, belong to 2h.
Let us consider the possibility of using the backward Euler scheme for (11.22). It is
then natural to apply the standard such equation for interior mesh points, or
U "+
Lk Un+ (x)=x- U" (x)-Ah (x)=f+ (x), x, (11.23)
where at is the backward difference quotient with stepsize k and
ah v(x)=(axl ax, +a2 2)V(x)
2
=h- (V(x+ hel)+ V(x-hel)
+ V(x + he2 )+ V(x - he2 )- 4 V(x)). (11.24)
However, for some mesh points of Q2h/Qh = Fh this definition would employ mesh
points outside £ and hence the equation (11.23) cannot be used. For such points it is
natural to modify the definition of Ah V(x) and base it on the boundary crossings of
the mesh lines. Let these points of aQ be Fh and assume, e.g., that we have the
situation illustrated in Fig. 11.1, where with XEFh°, the neighboring points
x + p- he2 andx- fpi he1 arein Fhandx+ l he1 =x + he1 and x + he2 =x-he2
are in Oh .
Generally, if x_+_hel, 1=1, 2 (with 0,fBl -< 1) are the neighbors of x in 2huFh,
we set
V(x + f+ he,)- V(x) V(x)- V(x- 1- he,)
2 Ih -h
Ah V(x)= i fl+± )h
1=1 2+ ;)h

Note that for xe2Q° and also if xeF ° and all its original neighbors are in , this
definition reduces to our old definition (11.24). Note also that by Taylor expansion
we have
Ah v(x) =Av(x) + O(h2 ) (11.25)
for xe2, if veC4 (Q),
142 V. Thome CHAPTER III

a ,'

=h

/
FIG. 11.1.

and
Ah v(x) = Av(x) + O(h),
(11.26)
for xEFh, if vEC3 (a),
so that, in particular, the accuracy can only be guaranteed to be of first order at the
irregular points of F °.
We may now pose the discrete problem
+ +
Lkh n+ ) = t U" l(X
()- AhUn (x) =fn + l(x )
for x f2E, n > 0,
Un + (x ) = g + (x)=g(x, (n + l)k)

for xeFh, nO. (11.27)


U(x)=v(x) for x E Qh.
We shall show that this problem has a unique solution and, in spite of the lower
accuracy at the boundary, its order of convergence in the maximum norm is
O(k +h2 ).
Let us first note that (11.27) may be written in matrix form as
"+ = U " y h
BkhU +k(fn+ gn+l) for n>0, (11.28)
where now U" denotes the vector composed of the point values {U"(x): X E h} and
similarly forf", where g" denotes the vector with elements {g"(x): x E Fh} and where
SECTION 11 Mixed initial boundary value problem 143

Bkh and yh are the appropriate matrices. In particular, the matrix Bkh has, in a row
corresponding to an interior mesh point x E Qh° , the diagonal element 1 + 4A = 1+
4k/h 2 , four off-diagonal elements - , and the remaining elements 0. For x E F° the
diagonal element is instead
2 I+,I 2 +

with the corresponding at most three nonzero off-diagonal elements.


The matrix Bkh is thus diagonally dominant, and hence, by a well-known property
of such matrices, Ekh = B ' exists and has all elements nonnegative. We may write
(11.28) as
Un +1 = Ekh Un + kEkh(f" + + +
))ng"
n + +
= EkhU + kEkh((Lkh U)" + U 1 i h)
or, by iteration,
n-I
+ rh .)
Un =EkhU +k E Ekh-t((LkhU)' I +yhU+ 1 (11.29)
1=0

We are now in a position to prove the following error estimate, part of which also
shows maximum-norm stability of our method. Here and in the rest of this section
we write
11VIls = supl V(x)I.

THEOREM 11.4. Let U" and u be the solutions of(11.27) and (11.22), respectively, and
assume that u E C4 '2 . Then
IIU"-u"ll , <C(u, T)(k +h 2 ) for nk <T.

PROOF. We shall use the representation (11.29) and recall that the matrices Ekh and
Ah have nonnegative elements. With < denoting elementwise inequality for matrices
and vectors, and I s denoting vectors of the appropriate dimensions with all
2
elements = 1 on S, we find by setting U"(x)= 1 for x E Qh uh, n O0,
into (11.29) that
n-1

Ekhlq+k E Ekh Yhlrh =la,,. (11.30)


1=0

Further, setting, for n>0,

U"(x) {=t on o h,
U on Fh,
(x)={O
we have (LkhU)'(X) = 0 for x E Q2, whereas for x e F° we have (in the case of the above
144 V. Thomee CHAPTER III

figure, say)
2 1
(Lkh U)(x)= -(AhU)(x) l(fl- 2
+ )h2 h

and hence, from (11.29), with

v°x={1 on r °,
that
n-
0 2 (11.31)
kZ Ekh zVzh1.
=o0

From (11.29), (11.30) and (11.31) and the positivity of the coefficients we now
conclude the a priori estimate

[U|ln |IUOij, +k
l~< JI(LkhU) li2Q

+h 2 max 11
(Lkh U)rh+max 11U'lir (11.32)
l <n I <n

which may be interpreted as a maximum-norm stability result.


We shall now apply this inequality to the error z = Un - un . Since z = 0 on Fh and
z°=0 and, since by (11.25) and (11.26),
L fO(k+h 2 ) for x e Q2° ,
khZ (X) O(k + h) for x E Fh',

it follows that

nlznjjI
0 Ck E (k+h2 )+h 2 C(k+h)
1=1

C(k +h 2) for nk <T,


which proves the theorem.

We remark that it follows from the above analysis that even if the local error on
Fr had been only bounded, and not O(h + k) as here, the global error would still have
been O(k + h2) as a result of the factor h2 occurring in the corresponding term in the
a priori estimate (11.32).
We shall now see that even a very crude approximation of the boundary values
may be used to get a O(k + h) global approximation. If k and h are of the same order
of magnitude, such an approximation is balanced in the two mesh parameters.
°
With a slight modification of our above notation, let Q2 be mesh points whose
neighbors are in Q and Fo be the remaining mesh points of Q. For each point x e F°
we associate x e F = Ma2 of distance at most h from x. We shall then consider the
discrete problem
SECTION 11 Mixed initial boundary value problem 145

LkhU n + l(X)= atU" l()- AhU + l(x) =fn+ (X)


forxe•2, n>0,
U"+ (x)=g(x) for xeF, n>0, (11.33)
U(x) = v(x) for x E Qh.
For the vectors U" with components {U(x): x E h0)}this is a linear system of the
form
n+ n+
BkhU = Un+ k(f 1 + h- 2G n + 1)
= Un + k((LkhU)n+ 1 + h - 2(lhU)n + 1),

with G and lhU to be defined presently. Let Fh be the mesh points in Q2 with
neighbors in F, and let for x E F', y,(x), I= 1, ... , s(x), be these neighbors. Then

(U ) U(y1(x)) for x F,

for x = 2\',

and

Gn+ (X)={4 g"+(y(x)) for x F


O for x E 2.

As before, Bkh is diagonally dominant, now with elements 1 + 42, -2 or 0, and thus
again Ekh = B- 1 > 0. Further,
U" +1 = Ekh Un + kEkh{(LkhU) +'
+ h- 2 ( h U)n+ },
or, by iteration,
n-1
U"=EkhU +k E{(LkhU) +h- + 2(lU) '}, (11.34)
1=0

and we may demonstrate the following result.

THEOREM 11.5. Let U' and u be the solutions of (11.33) and (11.22), respectively,
and assume that u e C3 '2 . Then we have
1IU -u 11 <C(u, T)(k+h) for nk•<T.
PROOF. Setting U"- 1 on Q2uF° for n>0 in (11.34) and,

v(x)={(x ) on Fh,
on 1 = 2h \Fh,
we have
n-1

Ekh1o+kh 2 Y E-vkh'V 1 <0 o.


I=0
146 V. Thom&e CHAPTER III

We conclude from this and (11.34) that

IIU"nll I< IIUOI +nk maxll(LkhU)zLaoI + maxl Ullro. (11.35)


l n <n

We now apply this to z" = U"- u, observing that z0 =0, that


z'(x) = u'() - u(x) = O(h) for x E rFh
and that
(Lkhz)'(x) = O(k + h) on Qh2
2
(in fact, O(k + h ) if u is smooth enough). This shows the desired result.

We shall now show that interpolation of the boundary values at the irregular
boundary points may be used to improve the accuracy to second order in h. To
define this procedure, let x E Frh. Then either x e F or x is a mesh point in Q with at
least one neighbor outside a, but also with at least one neighbor in 2Qhso that (cf. Fig.
11.2) there is a boundary crossing z with x on the segment defined by y and z.
We now define a system for determining U"+ (x) for x E 2h by equation (11.33)for
°
x E 2Q , and, with the notation of Fig. 11.2,
1
u"+ '(x)= U" + (y)+ U"+'(z) for x rh, (11.36)
l+ca l+ac

x --- z XI X
h
Y

I~~~~~

/~~~~~~~~~
FIG. 11.2.
SECTION I11 Mixed initial boundary value problem 147

where, with Fr denoting the boundary crossings,


U"+ (z)=g"+'(z) for z Fh. (11.37)
Note that, since 0<a<1 in (11.36),

"+ +
½ < 1[IIU
IIUn+111O I tlo+ IIU 'IrI.

Together with (11.35) this implies that


IIUt 0 o0 2r21U°0 1 +2nkmaxl(LkU)'II +2max IIU'IIr. (11.38)
hoFh 1Qh [~ln I<n Oh

In particular, this implies that the system of equations for Un(x), x E QT°uF °, has
a unique solution as the corresponding homogeneous system has only the trivial
solution.
We shall now show the following error estimate.

THEOREM 11.6. Let U" be the solution of(11.33) with the modifications (11.36) and
(11.37) and let u be the solution of (11.22). If u E C4 2 we have
IUn-ulln,rC(u, T)(k+h 2 ) for nk<T.

PROOF. This follows at once by application of (11.38) to z = Un- u, since z' = Oon
Fh, z = 0, and (Lkhz) = O(k + h2 ) on Q .

In the above discussion, the explicit forward Euler method or the Crank-
Nicolson method might also have been proposed. Consider first the case that the
modified five-point operator is employed at the irregular boundary mesh points.
The forward Euler method would then be written as
U'' l(x)=AkhUn(x)+k(fn(x)+yh g(x)) for x E Q2 ,
where now, with the above notation, the matrix Akh has diagonal elements

1 2 (
+i + 2 ( +

+ a

The standard analysis in the maximum norm would now require these to be
nonnegative. However, in the case of a curved boundary the fI+ would not have
positive lower bounds independently of h, so that no positive value of = k/h 2 ,
however small, could be employed to guarantee the nonnegativity of the diagonal
elements. The corresponding statement is valid for the Crank-Nicolson scheme.
Consider now the method of transferring the boundary values to the irregular
boundary mesh points of F °. In this case the diagonal elements of the matrix Akh are
simply 1-42 and the condition LA, is therefore sufficient for maximum-norm
148 V. Thome CHAPTER III

stability. This also applies to the method of interpolated boundary values.


The above discussion relating to approximation in the case of a curved boundary
is based on similar approximations for elliptic equations see e.g. FORSYTHE and
WASOW [1960] or BABUSKA, PRAGER and VITASEK [1966]. Such ideas have been
applied also in the context of alternating direction schemes in SAMARSKII [1962b]
and HUBBARD [1966], cf. also SAMARSKII and GULIN [1973].

12. Stability analysis by spectral methods

This section is devoted to the application of spectral methods to the stability analysis
in the context of the mixed initial boundary value problem, as outlined in the
introduction of this chapter.
We shall begin by considering the simple initial boundary value problem
u a'u
8 ~~for
ea O<xl 1, tO,

u(O, t)=u(1,t)=O for t0, (12.1)


u(x, O)= v(x) for O< x < 1,
and, with the notation used earlier, e.g. in Section 2, the forward Euler method
u-2 = (Uq+ + Uj_ ) + (1 - 2A)U,
j=l,...,M-l, n>O,
Un+ 1= Un+=0, n O, (12.2)
UV = = v(jh), j =O, ... , M,
where thus, with h = 1/M the mesh width in space and k the time step, Uj denotes the
approximation to u(jh, nk) and = k/h 2 is a fixed number with A. 2
Introducing the vectors U"=(Uo,..., U,,)T this may be written in operator form
as
Un+ 1 = Ek Un

The operator Ek here acts, for different k, in different normed spaces /'Yk,such as e.g.
in

Ih = = (o, .. , VM )T: VO = VM =0),


where we set

1IV Ilk = IIViI o ,h= maxl Vjl,

or, in 1° ,h, where

h
Vk= II Vi 2,h h £I v
j=Oj1
SECTION 12 Mixed initial boundary value problem 149

Considered on the above (M + 1)-dimensional spaces Xk, the operator Ek has the
matrix representation

o 0 0
A 1-22 2 0
0 2 1-22 A
0 0 A (12.3)
A 0
2 1-21 2
0 0 0

In order to deal with the stability problem in situations such as this, GODUNOV and
RYABENKII [1963, 1964] introduced a concept of spectrum of a family of operators
{Ek, where Ek is defined on a normed space Xk with norm II 1k where, in our case,
k is a small positive parameter. According to this definition the spectrum a({Ek})
consists of the complex numbers z such that for any >0 and sufficiently small
k there is a Uk e Xk, Uk 0O, such that
E
I EkUk-zUk Ik IIUk IIk (12.4)

One may then show the following variant of von Neumann's criterion:

THEOREM 12.1. A necessary conditionfor thefamily {Ek} to be stable in the sense that

EkVII k C I VIk for nk <T


is that a({Ek}) is contained in the closed unit disk.

PROOF. Assume that z E ( {Ek} ), Iz> 1, and let K be such that Ek Ik < K for small k.
Let co be arbitrary, choose n so large that
IZln" 2w,
and take E so small that
n-1
E Kj<2
j=o

Let Uk be a unit vector in Jfk satisfying (12.4) and set

(pk=EkUk-ZUk, IIPk k<£

Then
n-i
EkUk=znUk+ Y Z-- j
Eikk,
j=O
150 V. Thome CHAPTER III

and hence

!lEk||k) IEkUklnk

n-\

>[IZin_ E Izin--jKj > IZI I-e K '>½zl">co.


j=0 j=0
Since wo is arbitrary this shows that {Ek} cannot be stable. This contradiction shows
the theorem. 1

Let us return to our above example to determine the spectrum of the associated
family {Ek} on 1',h. Here, as we shall indicate, the spectrum consists of the
eigenvalues of the operator Ek associated with the pure initial value problem
corresponding to (12.2), i.e. defined by
(EkV)j=(Vj+ + Vj_ )+(l-22)Vj, j=O,+ 1, ....
and considered with respect to the norm in l o.
Before we show this, let us determine these eigenvalues. We note that the defining
equation for the eigenvectors corresponding to an eigenvalue z, namely,
2 (12.5)
(Vj+ 1 + V_,)+(1-2A)Vj=zV j, =O, +1, ... ,
is a second-order homogeneous difference equation and that the general solution of
this equation is
clT'i+c2Tj, if zr1 r 2, j=,_+ 1,...
j={ (Cl+c 2j).T, ifz 1=z 2, (12.6)

where T1, 2 =r1 2(z) are the roots of the equation

(z + r- ) + 1-22-z =0.

The condition for the existence of a bounded V is that one root has modulus 1, and
we find that z then has to be of the form

z = 1 -22 + (rz+ -')= 1-22 + 2 cos . (12.7)

Conversely, any z of this form corresponds to a bounded solution of(12.5), namely


e.g. Vj=ei j ¢.
When r varies over the real numbers, these points z cover the interval [1 -42, 1]
on the real axis, and, if we already know, as we shall show presently, that this interval
constitutes the spectrum a({Ek}), then Theorem 12.1 asserts that a necessary
condition for stability is the well-known condition A-.< We remark that the
spectrum a(Ek) of the fixed operator Ek consists of the eigenvalues of the matrix in
(12.3), which are z=l-22+2cosirph, p=l,...,M-1, and also z=O (with
multiplicity two) which latter shows the spectrum of Ek to have a different behavior
from u({Ek}).
SECTION 12 Mixed initial boundary value problem 151

Assume now that z is of the form (12.7). We shall show that z E ({Ek} ). Consider
the vector V=(V 0 ,..., VM)T with Vj=e i j . Then
A(V,++Vj_i+ (1-2A)Vj=zVj, j=l, ... ,M-l,

but the boundary conditions in the definition of 1',,h are not satisfied. Set therefore
W=(W,..., Wm)T where Wj=x(l-x)Vj, x=j/M. We now obviously have Wo=
WM= 0 and a simple computation shows that
Il(Wj+ + Wj_1)+(1 - 2 -z)Wl <Ch<Ch IIWII ,,.
Thus, for small h, W satisfies (12.4), so that zEa({Ek}).
We also need to show that if z is not of the form (12.7) then it does not belong to
o({Ek}). For this purpose one proves that, for some positive C and small h,
1U h, Cll EkU-zU II ,h for UEI 0oo,h, (12.8)
showing that (12.4) cannot hold for any Uk Eloo,h, Uk O. To see this, we set
EkU-zU=f, (12.9)
and show that this equation has a unique solution, satisfying
U | ash <IC lf ||oh. (12.10)
The difference equation (12.9) may be written
A(Uj+, +Uj_ )+(l-2 - z)Uj=fj, j= 1,... ,M-1, U=UM=O.
We first extendfj,j = 1, . . ., M - 1, tofj,j = 0, + 1 ... , without increasing its norm in
l',h (e.g. by setting the missing components of] equal to zero) and then solve
A(Uj++ UOj_ 1)+(1-22A-z)j =j forj=0, +1, .... (12.11)
Since z is not of the form (12.7) we have
a()=1-2A-z+2Acos4 0, ER,
and the Fourier series for a(4)- ,

a(4)-1 = E beiJ,
j= -w

is absolutely convergent. Hence the solution of (12.11) is

andconclude,
we

and we conclude

IIUll o,h< (a 1b,)ill oo,h= CIlfll ,h- (12.12)

We now introduce Wj= Uj-Uj,j=O, ... , M, and find that


(Wj,+ + Wj )+(1- 2-z)Wj = 0, = 1, ... M-I,
Wo = o, WM = OM.
152 V. Thome CHAPTER III

With rzl=z(z) and 2 =r 2 (z) as above, we find from (12.6) and the boundary
conditions that
UOTM-U2 UM-UOT
Wj= -M - Tr + T z2 forj=0,..., M. (12.13)

Now since z1 r2 = 1, and since neither of the rJ is on the unit circle if z is not of the form
(12.7), one of the roots, say Tz, has to be inside the unit circle and the other outside.
We may therefore conclude from (12.13) and (12.12) that
11
Wl , QC(IU0 I + IUM)<CIIfIIllh
Together with (12.12) this completes the proof of (12.10) and thus of (12.8).

In the situation just described the spectrum ( {Ek}) associated with the family of
discrete boundary value problems happens to coincide with the spectrum associated
with the corresponding pure initial value problem, and therefore, the condition of
Theorem 12.1 gives no new restriction for stability. In general, however, the stability
may be affected by the choice of the boundary value approximation. Assume, for
instance, that instead of the boundary condition UO = 0 at the left end of the interval
we choose the somewhat artificial condition
U o-YU=O, y0,
which is also satisfied by the exact solution of (12.1). Then a computation similar to
the above gives, for Wj= Uj - Uj,

t(U - yU) -(1 - 7T 2)M

T(1 U- yU)-(1 - yl)UM


+ T'(1yT2 )-TM2(1-yT 1) 2, j=0,..., M.

For ITl < 1 we have 1-yT,1 0 and we conclude as before that W is bounded.
However, for y such that 1-yTr =0 this conclusion does not hold and, in fact,
z belongs to a({Ek}).
In general, it was shown in the work of Godunov and Ryabenkii that the spectrum
of a family such as the above is the union of three sets, one corresponding to the pure
initial value problem and one to each of the one-sided boundary value problems for
the differential equation in {x>0,t>0} and in {x<l,t>0}, with boundary
conditions given to x =0, and x = 1, respectively.
We shall now describe some work of VARAH [1970] and OSHER [1972] concerning
the effect on the stability of the choice of discrete boundary conditions such as those
indicated above, and, in particular, the formulation of sufficient conditions for
stability in the maximum norm. The methods used in this work are based on
techniques developed by Kreiss for hyperbolic equations (cf. KREIsS [1968],
GUSTAFSSON, KREISS and SUNDSTROM [1972]).
We shall follow Varah's discussion in a simple case and then present the results in
SECTION 12 Mixed initial boundary alue problem 153

more generality. He considers the problem with derivative boundary conditions

forO x1, 1, O T,
at ax

(au+ u)(O,t)=(a +fiu)(1,t)=O for OtT, (12.14)

u(x,O) = v(x) for0 x < 1.


and the associated explicit finite difference equation

Wu"= L. a,lU_-(EkU")j, (12.15)


I=-4
where r and q are given positive integers and where =k/h2 =constant. If this
equation is used for j= 1, ... , M- 1, we need to supply also the values of Uj for
j =-r + 1,..., Oand j = M,... ,M + q - 1, and we shall take these to be of the form
S

UJ= bj(h)Un forj=-r+l,...0, (12.16)


1=1

and similarly for j = M,..., M + q- 1.


In discussing the maximum-norm stability of such schemes it is natural to assume
that the operator defined by (12.15) is a parabolic finite difference operator when
applied to the pure initial value problem. The question which arises is then the effect
of the boundary approximations (12.16) on the stability.
For an introductory discussion we restrict ourselves to the situation with one
lateral boundary and with a Neumann boundary condition, so that the problem
under consideration is the case =0 of the quarter-plane problem
au a2 u
forx0, 0<t<T,
at ax 2

(a +u)(0, t)=0 for 0<tCT, (12.17)

u(x, 0) = v(x) for x> 0.


We also return to the simple explicit finite difference analogue of the heat equation
defined by
(Ekv)j= .(vj ++vj_ )+(1 - 2)vj, (12.18)
where we assume that A= k/h2 < , so that the scheme is parabolic in the sense of
John, and hence maximum-norm stable, when considered for the unrestricted initial
value problem. Using (12.18) forj > 1, we complete the definition of Ek by prescribing
the relation

(Ekv)o= Z'
1=1
b(Ekv)l, (12.19)
154 V. Thome CHAPTER III

where the b are constant. We assume that the boundary approximation (12.19) is
consistent with the boundary condition in (12.14) (for a =0), or that, for smooth u,

u(O)- biu(lh)
=yu'(0)+o(1) ash-y0 with y0.

This is equivalent to the relations

bl= , E Ib=y :0. (12.20)


I I

We shall now discuss the stability of the operator Ek thus defined on

1+,h={V=(Vj)O. I vl,,h=sup
J
vjl <CO}.

It is clear that a necessary condition for stability is that this operator has all its
eigenvalues in the closed unit disk, for if z were an eigenvalue with z I > 1, then, with
UO=ve l h, the corresponding eigenfunction, we would have

l Un I oo,h = IE U °II(,h = I1 l --
,h* as n- o,
contrary to maximum-norm stability. We shall see that a sufficient condition for
stability is the slightly stronger condition that all eigenvalues z of Ek, except z = 1, lie
in the open unit disk {z: Iz < 1}. That z = 1 is an eigenvalue follows at once from the
consistency condition (12.20) by taking vjl I for j >0. To demand that the
eigenvalues z : 1 of Ek are in the open unit disk is a stronger condition than asking
that they are in the closed unit disk in the same way as the definition of a parabolic
operator is stronger than that of one which satisfies only the von Neumann
condition for stability in L .2
We shall now reformulate our above property in terms of the coefficients of the
discrete boundary condition (12.19). For this purpose we assume that z is an
eigenvalue of Ek and that v=(vj) is the corresponding eigenfunction. Then

.(vj+ 1+ vj_ l) + (1- 22)vj = zvj, j = 1, 2, ...

vo = bvy1.
1=1

The first of these relations is a second-order difference equation, and we know that if
rI = (z), 1=1, 2, are simple roots of the corresponding characteristic equation,
- 22=
)l(r r- ~) + 1
+ z, (12.21)
then the general solution of the difference equation is
vj=ctij +c 2 T2 for jŽ0. (12.22)
Let now Izl>1 and z1 and let us note that then none of the roots z, and 2 of
(12.22) is on the unit circle. For, if z =e is such a root, then,
z = (e +e )+ 1-22= 1 -22 +2A cos 5,
SECTION 12 Mixed initial boundary value problem 155

which is real and has modulus less than 1 except if ~ = 0, which corresponds to z = 1.
Since TjT2 = 1 it follows that exactly one, say Tz, is in the inside the unit circle and the
other is outside. Because vj}O in (12.22) has to be in ,h we conclude that the
second term must vanish, so that
vj=c(z)zI(z)j forj=O,1 ......
In order for this v to satisfy the boundary condition we must have

=0
c(z) - E_ bTl,(Z)'

and hence, a necessary and sufficient condition for z to be an eigenvalue is that

b(z)= 1- b(z)t=O.
1=1

The hypothesis that the eigenvalues z 1 must satisfy Iz < 1 may therefore be
expressed by saying that

b(z)=l1- b(z)'O0 for Izl 1, z 1. (12.23)


1=1

This is Varah's sufficient condition for the scheme to be maximum-norm stable.


We continue to sketch Varah's analysis in the special case before stating his result
in a more general situation. The proof of the maximum-norm stability uses the
Dunford spectral representation,

Ev= 2i f z"(z-Ek)- 1v dz,


F

where r is a contour enclosing the spectrum of Ek.


The first step in the proof is to show that if lzl > 1, z # 1 and (12.23) is satisfied, then
z is not in the spectrum of Ek (earlier we only knew that such a z is not an eigenvalue).
For this purpose, let z be a fixed such complex number. We shall show that the
equation
(Ek-z)U=f (12.24)
has a unique bounded solution U for f bounded, and that, for this U,

IIU ||I,h < CllfA,h-


I (12.25)
Equation (12.24) is equivalent to
A(Uj+,, + Uj_)+(1-2A-z)Uj=fj, j i,

Uo- Y b,U, =O.


1=1
156 V. Thomee CHAPTER III

In order to accomplish this, we follow the lines of our discussion in the beginning
of this section of the two-point boundary value problem and first extendf without
increasing its norm to fj, j=0, 1, ... (e.g. by setting j=0 for j<0). We then
determine U by (12.11) and conclude that the inequality (12.12) holds. We now set
Wj = Uj - Uj, j > 0, and find for W the equations
A(Wj++Wj_)+(1-2A-z)W=0 forj>1,
S S

Wo- b,W,=0o- b,.


1=1 1=1

Since Wj has to be bounded for j > 1 it follows as before that

7o- E b1,U
Wj=crT', with c=c(z)= =1
b(z)
whence

|| XI S |W
C(Z) < C-hU 1,h <C(Z) lIIf1I
1b(z) I
Oh (12.26)

Together (12.12), (12.23) and (12.26) show the desired estimate (12.25).
In the next step of the proof of the maximum-norm stability we write U' = Ek U in
the form

°
U= E a,jU forj¢0, nO.
1=1

One needs to show, with C independent of n andj, that, for the operator norm of Ek,
=
IIEnI oh sup lajl
, AC for n >0.
j>1 l

Introducing the mesh function 6, with elements 6g, = 0 for s A 1, 1,, = 1, we have,
by the Dunford representation,

anji= 1 i fz((z-Ek)_)jdz,
F

where F is a contour enclosing the spectrum of Ek. We set


((z - Ek)- 1l)j = flj(z)+ djl(z),
where Pit is the solution corresponding to the unrestricted problem
(Bj,(
+"jl ,))+-(l-2-z),i=t, j=O, 1,....
One finds easily that the unique bounded solution is

withj,
=c-i
with
SECTION 12 Mixed initial boundary value problem 157

c(Z(z)- 2(Z))'

It follows that
A(d+ ,,+dj_ ,,,)+(1-2A-z)dl =O for j> 1,

do- , bpdp =-iol+ b,,


p=l p=l

whence

djl = cTJ1,

with
S

iol- Y bpfpl
CI=CI(Z) P=I
b(z)
In particular, for 1>s, say, we have
S

1- Y, bpzT
1 S 1- c (z) ,+
1- bpzb(z)
p=l

(where the latter equality defines b(z)) so that then


1 (z)
a,j,= zn ){
- il -
T,(z)' +j dz.
21i J A(lr(z) - 2(z)) b(z)
F

In order to complete the proof delicate estimates are needed for this integral, using
an appropriate choice of the contour F which may a priori be selected as any closed
contour around the origin in {Izl > 1, z - 1}. In these estimates one also needs the
fact that, in a neighborhood of z = 1,
Ib(z)- 1 2
< CIz- iI -1/2 (12.27)
To see that such an inequality holds in the present case, we note that (12.21) has
a double root = 1 at z= 1. In fact, we have

1-212-z (12-Z)-2_ t
T1,2
= ~2\1
and one sees that, for z near 1,

T(z)=l + -) +O(z-l1)asz- 1,

with the square root appropriately determined. Reporting this in the definition of
158 V. Thomde CHAPTER III

b(z) and using the consistency conditions this yields

b(z)=-Y +Oz-1) as z-l,

which implies (12.27).


Let us remark that for any given z with lzl > 1, it is possible to find a consistent
boundary condition such that z is an eigenvalue of Ek, thus showing that the result
we are discussing is nonvacuous. In fact, to see this, we only have to choose bj,
j= 1, . . , s, such that

L bjtz(z)'=l,
I=1

E b,=1,
1=1

E Ib O.
l=1

These equations are satisfied, e.g. with s =2,b =(1 + z(z))/z 1(z) and b2 = -1/z 1(z).
On the other hand, for more reasonable choices of discrete boundary conditions,
this does not happen. Thus, for instance, for s=2 let us select b, and b2 so that the
order of accuracy of the boundary approximation is two, or such that,
2
u(O)- E bu(lh)=hyu'())+O(h3 ) as hO.
1=1

This equation is equivalent to the system


b, +b 2 = 1,
b +2b 2 =yO0,
bl +4b 2 =0,
2
which has the unique solution b =4, b 2 = -, 7= . Hence, in this case,
b(z)= 1-3t(z)+ i(z)2

1 (Z)-3) O, since
=(T 1(z)-1)(T IT1(z)I < 1,
so that (12.23) is satisfied. Similarly, taking s = 3 and the order of accuracy three, we
have
1 =2, 8, b=- b =2 ,=6
and
b(z)= 1-T +
191T2 - 23
= -- (Ti--1)(T2_--2T, + 11) for IT 1 < 1.
SECTION 12 Mixed initial boundary value problem 159

It was shown by Varah that also high-order approximations of this type satisfy our
condition (12.23) for A small enough.
We now turn to a description of Varah's more general result for the one-sided
problem (12.17) and consider the finite difference operator (12.15) with the boundary
approximations (12.16) at x = 0, where coefficients bj(h) are smooth functions of h.
The operator Ek = Ek(h) defined by U"+ = EkU" now depends on h. One can show
that if z belongs to the spectrum of Ek(h) and Izl 1, z 1, then z is actually an
eigenvalue of Ek(h), and for a certain matrix B(z, h) which plays the role of our above
function b(z), we have det(B(z, h))=O0. In fact, if z is such an eigenvalue, we have

E avj_l=zvj, j=1,2,..., (12.28)


I= -q

and its bounded solutions are of the form


vj= E P1(j)T(z) j, (12.29)
Irf(z)l < I

where the rT are the roots of the characteristic equation of (12.28),

E alT'-Z =
O,
l= -q

and where P, is a polynomial of degree less than the multiplicity of zT(z). One may
show that there are r independent parameters 6=(1,..., 6r)x in (12.29) and these
may be determined from the matrix equation
B(z, h)b = 0.
Hence, similarly to the situation in the particular case treated above, z is an
eigenvalue if and only if B(z, h) is nonsingular. We may formulate the following
theorem from VARAH [1970].

THEOREM 12.2. The operator Ek(h) defined by (12.15) and (12.16) is maximum-norm
stable if
(i) the operator(12.15) is consistent with the heat equation and parabolic in the
sense of John, and the boundary approximations (12.16) are consistent with the
continuous boundary condition in (12.17),
(ii) Ek(O) has no eigenvalues z with Iz > 1, z 1,
(iii) B(z, O) is nonsingular and IB(z, O)- l < Clz - 1 -1/2 for Izi > 1, z close to 1.

We recall that (iii) was automatically satisfied in the simple case treated above.
Let us now return briefly to the problem (12.14) on the finite interval 0 < x 1,
using the finite difference equation (12.15) for j= 1, ... , M- 1, where Mh= 1, and
using the equations (12.16) for the boundary approximations at x=0 and similar
equations at x = 1. Associated with this problem are two one-sided problems for the
infinite intervals 0 < x < o and - oo < x < 1 and for each of these our above analysis
may be used. VARAH [1970] shows the following:
160 V. Thome CHAPTER III

THEOREM 12.3. Assume that the finite difference scheme just defined for the initial
boundary value problem (12.14) is such that the conditions of Theorem 12.2 are
satisfiedfor the problems corresponding to each of the two boundaries at x =0 and
x = 1. Then the scheme is stable in the maximum-norm.

The results of Varah have been extended by OSHER [1972] in a number of ways
which we shall briefly describe below, to more general parabolic equations and more
general finite difference schemes. Thus let us now consider the differential equation
au a2u au
t=a(x,t)x 2 -b(x,t) xx-c(x,t)u=f(x,t) for 0<x<1, 0<t<T (12.30)

with boundary conditions


au
xu(O, t)+taa (0,t)=go(t) for O<t T. (12.31)

au
fBu(l,t)+/f-x(l,t)=g1 (t) for 0<t<T,

where a2+ 2 = fB2 + i 2 = 1, so that Dirichlet type boundary conditions are now
also allowed, and finally with the standard initial condition
u(x, 0) =v(x) for 0 < x < 1.
For the numerical solution we use implicit multistep finite difference operators of
the form described in Sections 3, 6 and 8, so that
AoU n+I = A hl Un+- +Ah ,U-m+1+kf, (12.32)
where the Ahj = Ah,.(nk, h) are defined by
q

Ahjt, h)v(x) = acjI + Y a (x, t, h)v(x - lh), (12.33)

with o = 1. We assume that the coefficients are sufficiently smooth and that (12.32) is
consistent with the differential equation (12.30). Further, we assume that
Ao(x, t, ) =I+ ao0 (x, t, O)e - 0,
l

so that Ah, is invertible, that (12.32) is parabolic in the sense of John, and that the
coefficients ac in (12.33) are such that the associated matrix
atl 2 ' .
.. am

1 0 ... 0

_= 0 1 0

is stable (cf. Section 8).


SECTION 12 Mixed initial boundary value problem 161
We are thus making all the natural assumptions to ensure that (12.32) is a viable
finite difference method for the pure initial value problem. This operator will now be
used at the mesh pointsjh (with h = 1/M) of [0, 1] if the boundary conditions at x = 0
and x = 1 are of Dirichlet type (i = /= 0), with the modification that the point x = 0 is
not included if the boundary condition at x = 0 is of differential type (i.e. with a # 0),
and correspondingly at x= 1.
In addition to demanding that Ao(x, t, ) # 0 we now also require that the change
in the argument of Ao(x, t, ) as r varies from - i to t (or the index of Ao) is zero. This
is a well-known condition for equations of the Wiener-Hopf type (cf. STRANG
[1964b, 1966]) and constitutes the condition for the one-sided operator Ah,o to be
invertible. For instance, for the backward Euler scheme for the standard heat
equation we have

Ao(x, t,c) = 1 + 2(1 -cos 6),

and the change in argument is zero, so that the condition is satisfied.


We now turn to the conditions on our scheme imposed by the presence of the
boundary conditions (12.31), and as above restrict the discussion to the left-hand
boundary at x = 0. We first consider boundary conditions of Dirichlet type so that
ac=0. In this case, we use the equations defined by (12.32) for j=0, 1, 2,..., and
supplement the definition by discrete boundary conditions of the form

us= E bl(nk, h)U + 1- bj,(nk, h)g(nk) forj=-r,...,-1.


1=1 1=1

For boundary conditions in (12.31) with i 0, we use the equations from (12.32) at
the interior points jh, i.e. for j = 1, 2,.. ., and set

Uq= b(nk,h)U+--go(nk)[j - E lbj(nk, h)

forj=-r+1,...,0.

The form of the boundary conditions employed at the right-hand boundary are
analogous.
We assume that the discrete boundary conditions are consistent with the
continuous boundary conditions in the natural sense.
We now demand that the boundary approximations are such that there is no
solution in 12 of the semi-infinite problem (with v = 0 or 1 depending of the type of
boundary condition)

(Aho(t, O)v) = O j=v, v + 1, ... ,

vj = b(t,O)vl, j= -r+v,..., -1 +v,


I=v

and the corresponding condition for the right-hand boundary. Further, we consider
162 V. Thomne CHAPTER Ill

the equation

((ZAh,0(t, 0)- A, (t, 0)- z - Ah, 2 (t, 0)- - m + 1Ah,m(t 0))v)j = 0


for j=v, v+1,....
~~~S~~~~~~~~
~~~(12.34)
vj= E b (t,0)vt forj=-r+v,... ,-1 + v.
I=v

Let {eilk} be those eigenvalues of the matrix 4S introduced above, which are on the
unit circle. We then assume that for Izl >1, z e ' ,k the unique solution of (12.34) in
loo is zero. If a = 0, we make the same assumption for z = ei°k as well. For a = 0 and
z = e i Pk we have that vj-1, j = 0, 1,... is a solution but we assume that there is no
other linearly independent solution with IvjI< c(j+ 1).
We make the analogous assumption for the right-hand boundary.
Under the above conditions Osher shows that the difference scheme has a unique
solution (the conditions made to ensure this are in fact also necessary). Further the
scheme is maximum-norm stable. In fact, the following more precise estimates hold
under the following different assumptions. First, for the nonhomogeneous equation
with homogeneous boundary conditions (g(t)-g,(t)-0) we have

11U1o,h < C (nkmaxllfml11 ojh+ 11U° 11®.,h)

In this case we also have the smoothing property

~.h 1< ((n + C-j


i~X~I 1)k)"/
ink max
max fm II ,h+ U

Consider now instead the case thatf-0 and v = 0 but that the boundary conditions
are nonhomogeneous. Then, if a # 0 we have for the one-sided problem defined by
the left-hand boundary, with c >0, that

I UjI < C/nk sup Igo() max(e-'', e-'i 2/l),


r <nk

and

la UjI < C sup Igo()Imax log(j+ )e- ci, e - cJ/vJ, log - e'i2/c

For Dirichlet boundary conditions, d= 0, we have for j> 1,

Uj < C( go(0)i+ go(nk) + nk supl g(t)l


T nk
n-I m
+ go((I+1)k)- E opgo((-p)k)
I=m p=O
SECTION 12 Mixed initial boundary value problem 163

and

IaUl< (( 1/ 2 nk suplgo(T)l + Igo(O)


((n k) , r<nk ik)

n-i g 0((1+1)k)- Z go((l-p)k)


+ IC>
l=m
p=O
(I- /n) 12

The corresponding estimate for the problem with the boundary to the right, and the
two-sided problem are also valid. Clearly the general situation withf, go, gI and v A 0
may be analyzed by combination of the above estimates.
Before leaving the discussion of construction of discrete boundary conditions we
shall quote some simple results from STRANG [1964a] and MILLER [1969]
concerning a specific type of difference schemes designed for the solution of the
quarter-plane mixed initial boundary value problem
au a2u
for x0, t0,
0t8U-2U x
u(O, t) = 0 for t >O,
u(x, 0) = v(x) for x > 0.
For such a problem it is particularly easy to apply an explicit operator of the form
(12.15) if r=1 so that (with our standard notation) the scheme reduces to

WU+1= E a,Uq_, forj1>, n0,


I= -q

Un+ =0 for n>O0,


U = v(jh) for j 0.
Strang used the term "unbalanced" to describe methods of this form; and he posed
the problem of determining such methods that are both L 2-stable and highly
accurate. By our discussions in Sections 1and 4 this problem reduces to choosing the
coefficients a in such a way that the symbol of the operator, the trigonometric
polynomial

1=-q

satisfies
IE(~)I < I for e R, (12.36)
2
and also, with # the specified order of accuracy, with 2 = k/h (which is assumed kept
constant),
E()=e-`12 + 0(,, +2 as ~--O. (12.37)
164 V. Thomee CHAPTER III

In this regard Strang proved that for given it is indeed possible to find a
trigonometric polynomial of the form (12.35), for some q, which satisfies (12.37) and
also the stability condition (12.36) for small 2.
Condition (12.37) may also be expressed as the p+ 2 linear equations

2 n/2 n even,

Zq'=- ={4.0, n odd, O<np<+l .


Strang's construction gives no information about the minimal order q of the
trigonometric polynomial satisfying the desired condition, and Miller therefore
raised the question of stability for the uniquely determined scheme for which q = p
and showed that this method is stable (for small A)if and only if It< 4. For p = 2 we
have the standard symmetric operator
vU. '=U7_,I +(1- 2)Uq +A2U+
(here the coefficient a_ 2 happens to vanish), and for p = 3 and p =4,
UJ" = 22(1 1 + 6) U_ + (3-52-6A2)U.

+½2(1 +62) Uq+, + 2(1 -6) U; +2 - 12(l -6) U;+3 ,


and
U+ =62(5 + 62) Uj_ + (4-52- 1822) U + 32(- 1+ 242) U.+
+ 2( -6) 2±A(l( - 6) U%3+ - 62) U+ 3,
respectively. These operators are thus all stable in the sense of (12.36) for small 2.
So far in our discussion in this section we have only treated the question of
stability of the finite difference methods under investigation. It is clear from the
general type of considerations, presented in Section 3 within the framework of pure
initial value problems, that also in the case of mixed initial boundary value
problems, stability together with a specified order of accuracy will imply con-
vergence of that same order. We also saw in Section 11 that sometimes the formal
accuracy may be f lower order near the boundary without affecting the rate of
convergence. These results all require that the exact solution is sufficiently smooth,
which is not always the case in practice. For the pure initial value problem we
presented a thorough discussion in Section 9 of the relation between the regularity
and the actual rate of convergence. Some of these results were originally derived by
means of eigenfunction expansions for the model heat equation with Dirichlet
boundary conditions, see e.g. JUNcosA and YOUNG [1953, 1954, 1957] and WAsow
[1958]. A more recent contribution is MONIEN [1970] who uses eigenfunction
expansions to specify the precise regularity requirements in cases which include the
possibility of derivative boundary conditions.
We shall end this section on the use of spectral techniques with an example by
GEKELER [1975] of an argument employing discrete eigenfunctions in a somewhat
more refined way than in Section 2. It concerns the maximum-norm convergence of
SECTION 12 Mixed initial boundary value problem 165

the Crank-Nicolson method for the initial boundary value problem


au / au\
at ax(ax ) °%x , tO,
u(O, t)=u(1,t) = 0, t >O, (12.38)
u(x, O) = v(x), 0 < x < 1,
where a(x) is a positive smooth function.
1
We recall from Section 10 that the Crank-Nicolson method is stable in 2h and
that a convergence estimate of the form

Un - U 2 ,h C(u)(h 2 +k 2 )
holds, independently of the mesh ratio A= k/h 2 , and also that the energy method
may be used to show the corresponding error estimate in the maximum norm. It is
for this latter result that we would now like to give an alternative proof using
a spectral method, and this will be done for h and k of the same order of magnitude.
For comparison we recall that problem (12.38) may be continued to a periodic
problem, and that the Crank-Nicolson scheme is parabolic in John's sense for
i = k/h 2 fixed and hence, by the results of Section 9, we have the maximum-norm
error estimate
n 2
U" - u"
H1 I h < C(u)h .
In the case that k and h are independent we mentioned in Section 7 that if a(x) 1 the
scheme is uniformly maximum-norm stable (with the stability constant 23) and
hence in this case

I UW_- u II ,h < C(u)(h2 + k 2) .


With our standard notation the method under scrutiny is
a,Uj=AhUj= a(aj- 1 2 x(U"+ U"+ 1)),
j=l_...,M-l, n>O,
Un'=U~+
UO+ 11=UM+1=O
= u+=0, nŽ>0,
n O, (12.39)

u =v(jh), j=o,..., M.
We recall the notation for the discrete inner product
M
(V, W)h= h VjW,
j=O
1
and the corresponding norm in 2,h,

II V112,h=(V, V)2=(h j

It is clear that the symmetric operator Ah on 12,h has (M- 1) positive eigenvalues
{#,} - 1and a corresponding orthonormal system of eigenfunctions {pe} m- ', and
166 V. Thomee CHAPTER III

that for each V °2,h,

M-1

V= E (V,P)h(P.
p=1

It is then easy to see that the solution of (12.39) is


M-1
U= (v, p)hE(ktp)Up,
p=1

where
1-½z
E(z) = 11 2

It can be shown (cf. e.g. CARASSO [1969]) that there are positive constants co , cl and
C independent of M such that
CoP2 <,
2
< Clp (12.40)

and

lIPp AIh<C. (12.41)


(From Section 2 we know that in the case a(x)- 1 we have sin nrpjh and
= /22p,j
lp = 2h- 2(1 - cos ph).)
Let us now introduce the truncation error r" by
(I + 2kAh)u" + = (I - kAh)u" + kr",
where u is the exact solution of (12.38), and recall that
Il t 1 ,h< C(u)(h2 + k 2 ). (12.42)
Setting
Ekh = (I + 2 kAh)- '(I- kAh)
and
Gkh =( + 2kAh)- ,

we have for the error z"= U" -u n


zn+ = Ekhz + kGkh ', n >O,
whence
n-i
z'=k , Eh--jGkhTj.
j=O
Here
Trj=(T, (Pp)hPp'
p
SECTION 13 Mixed initial boundary value problem 167

so that, with G(z)=(l + z)'


M-1 n-
zn: E k E(kup)"- -jG(kjup)( T j, (p)h00p,
p=l j=o

and hence, by (12.41),


M-1 n-1

II"lll,h <C Z k E IE(kp)ljG(kpp) max IIzTllo',h


p=l j=O j<n-1
We need to estimate the expression
n-l 1
S=k E(k#)jjG(kj),<k G(k)
j=o 1-IE(kt)]
for the various =,up. We have
k I
| k G(k#)=- for 2ku< 1,
1S<
-E(kp)
1+E(kp) G(kp))= k for 2kp> 1,

and we conclude that

IZn ,h.C(<C -+Mk max IITl h.


_ P j<n- 1

Using (12.40) and (12.42) this yields


liz"ll ,h C(u)(1 +k/h)(h 2 +k 2 )Ch 2
, if k/h C,
which is the desired result.
In closing we remark that this section has dealt with problems in one space
variable only. As indicated earlier, the spectral techniques used in the analysis of the
pure initial value problem are easily adapted to apply also for axes parallel
rectangular domains in several space dimensions with Dirichlet boundary con-
ditions when the elliptic operator occurring in the parabolic equation is a sum of
one-dimensional operators. For more complicated domains, equations or boundary
conditions, less has been accomplished. We refer to STRANG [1960], KEAST and
MITCHELL [1967] and OSHER [1962] for some work in this direction based on
spectral techniques.

13. Various additional topics

As specified in the introduction to this chapter we shall now touch upon a few topics
not covered in our earlier presentation. These are concerned with variable time
steps, problems with singular or discontinuous coefficients, the use of boundary
value problem techniques, interior estimates and finally the application of finite
differences in existence theory.
168 V. Thome CHAPTER III

In the analysis in the preceding sections we have always employed a constant time
step in our finite difference schemes. We shall now briefly describe some work by
DOUGLAS and GALLIE [1955] which suggests longer time steps as the distance to the
initial time increases.
We consider the model problem

Ox lu, tO,
at ax 2
u(O,t)=u(1,t)=O tŽ>O, (13.1)
u(x, )=v(x), 0<x 1.
For the numerical solution we select as usual a mesh width h = 1/M where M is
a positive integer, but choose now time steps k,, n = 0, 1, ... , which may vary with n.
The corresponding discrete time levels are then t, = I I=-okj. For the discrete solution
Uj approximating u(jh, tn) we take now the result of the backward Euler method,

=axaxUj + l , j=1,...,M-1, n30,


kn

U+ = U =0,
Uj = v(jh), j=O, ... , M.
In the operator form this may be written
Un+ =Ekh Un n >0,

and the error z" = U"- u satisfies


n Zn rn
Z+ E gk.h - k ,

where the truncation error rj now has the form


Tj=((1 h2 + kn)ut(x*, t*)
with0<x*<l, tt*<t,+,.
The standard argument then shows, for instance in the maximum norm, that
n-

lIzn11,h < k(i'h2 +½k,)llutl. (13.2)


I=o

Recalling that the exact solution of (13.1) is

u(x, t)= Y ame-m 22


, t sin minx,
m=l

where

am= 2 fv(x) sin mx dx,


0
SECTION 13 Mixed initial boundary value problem 169

we find easily, for sufficiently smooth initial data, that

IIut 11 < Ce - 2',


and, introducing the variable mesh ratio ,, = k,/h 2, we have then, by (13.2),
n-1
1z"llh <
Ch2 E k(l + - )e 2
t (13.3)
1=0

Let us first remark that even for k and fixed the sum in (13.3) is uniformly
bounded for n > 0 so that the error is bounded, uniformly for t > 0, and not only on
bounded intervals in t, by C(v)h2 . This is due to the exponentially decreasing factors
in the sum, which we did not keep track of properly in the simple analyses of Sections
1 and 2.
More generally, the exponential factors allow a certain growth in k t and A,without
sacrificing the O(h2 ) convergence. For instance, if
Atl Ce7 't2 for l >O with y< 1,
and
k,+l <Ck, />0,
then obvious estimates show for the sum in (13.3)

<
E k,(l + )e- 2tt C{l + e-(1-7)2tdt C.
1=0
0

Douglas and Gallie considered, in particular, the choices A,= a + fit t, and A,= e"2t12.
It is clear that these remarks admit vast generalizations.

The use of nonuniform meshes in space has been suggested and analyzed in one
space dimension in SAMARSKII [1963].

We shall now consider an initial boundary value problem for a parabolic equation
with the singular elliptic operator,

a=&a (X'u) O<x<l, t>O,


atx2 (xu)
u(l, t)=O, t>O, (13.4)
u(x, O) = v(x), O< x < 1.

Such a problem arises after transformation by polar coordinates of the heat


equation

au/at =Au (13.5)

in three space dimensions in the case of spherical symmetry. In interesting situations


170 V. Thomde CHAPTER III

the solution of (13.4) will be regular at x =0, since this corresponds to an interior
point for (13.5), and as a result of (13.4), the boundary condition (8/ax)u(O, t)= is
satisfied for the solution.
We shall describe a finite difference scheme analyzed in FRYAZINOV and BAKIROVA
[1972]. We use a mesh width h = 1/M in space and a time step k and define the
discrete elliptic operator

AhU= 2
Xj
ax( 2-
-
l /2 k uj),

where xi=jh and

Xj - 1/2 =Xj1(1/ - h 2/x,2_ 1/2)1/2 =(XjXj- 1)1/2 . (13.6)


We then employ this operator together with the -method and pose thus, with
U" + = 0 UnU++ (1- O)U, the discrete problem

,U"+=AhU+ °, j=1,...,M-l, n>O,


+l
UM =0, n>O, (13.7)
°
U = u(jh), j=
1,., M.

We note that forj= 1 the coefficients x1/2 of xU'O+ vanishes so that, in fact, neither
U0 nor UO+ ' appear in this system.
The equations may be written in the form

(1m+ Xj+ 1/2 +Xj- 1/2 j-2 U1Xj+l/ n+ l


x 2 jOA+ O 2 U j_+

Xj-2Xj

+(1-6 J2Un
iX lX

U' =0.
We now note that
-2 -2
x j+1/2 + Xj_ 1/2= XjXj_ 1 + x
Xj 2x,
so that the coefficient of UJ equals 1- 2(1 - )2. Hence all coefficients on the right are
nonnegative and add up to 1 provided the stability condition
2(1-0)2 1 (13.8)
is satisfied, and we then conclude in the standard manner the maximum-norm
stability estimate
+
|U Illh UI u Ih1
SECTION 13 Mixed initialboundary value problem 171

One may also show convergence at the rate O(h2+k) for the backward Euler
method, and, with satisfying (13.8), O(h 2 ) when 0< 1.These results apply as well to
the nonhomogeneous equation with nonhomogeneous boundary conditions.
Similar results are shown in FRYAZINOV and BAKIROVA [1972] for the equation
au 1 a x( u )
a =-a for0<x 1, t >0,

which arises from transformation by polar coordinates of the two-dimensional


Laplacian under the assumption that the functions involved depend only on the
distance from the origin. In this case it turns out to be convenient to use mesh points
of the form xj= + )h where a = 2(1 + 3/) and the weights (13.6) are replaced by
Xj - 1/2 =Xj_/2(1-h2/xj- 1/2).

Problems of the above type may also be analyzed by the energy method, now in
general with respect to weighted norms (cf. e.g. SAMARSKII and GULIN [1973]). We
consider only the backward Euler version (0 = 1) of (13.7) and introduce the inner
product
M-1
(V,W)=h VjWj,
j=o
and the corresponding norm

j=O

Multiplication of (13.7) (with 0= 1) by hxjUJ+' and summation over = 1,...,


M- (noting that x0 =0) yields
(xUj 1 ,/
+)-( ), U )=0 for n0O
Using the identity (10.8) and summation by parts in the second term we find easily
, I11 xi + l12+kll
2+ x U+ 1 112 + | Xj-2~U 112 =,

from which we conclude that

a,XjU +
112 = 0,
and hence

IxjUj 1 II I|xjUjII for n>0,


which shows stability in the weighted 12-norm IIxVj I.
For further work on singular problems, see e.g. FRANKLIN [1958], SMITH [1965]
and EISEN [1966, 1967a, 1967b].

We shall briefly discuss the case of a parabolic equation with a discontinuous


172 V. Thombe CHAPTER III

coefficient. We consider thus the initial value problem


au ( au\
(a- xx
t= xt for 0x< 1, t>0,
(13.9)
u(0, t)= u(1, t) =0 for t >O,
u(x,0)=v(x) Ox<l,
where the positive function a =a(x) may have a finite number of simple discon-
tinuities but is smooth on the closures of the intervals of the partition of [0, 1]
defined by the discontinuities. The solution of this problem is then smooth away
from the discontinuities of a where, in fact, u and aau/x are continuous, so that, in
particular, u/Ox is discontinuous.
We consider first the standard 0-method (with Un + = U"' +(1 -0)U")
,Ui-=ax(aj-l2xU +), j=l, ... ,M-I, n 0,
UO+' = UM+' =°, n>°, (13.10)
°
U = v(jh), j =O ... , M,
where h = 1/M and where we restrict ourselves to the case 1 < 0 < 1. We recall from
Section 10 that if a is smooth we have unconditional stability in 12 and convergence
in both 12 and the maximum norm to the natural orders, provided the exact solution
of (13.9) is sufficiently smooth, e.g.

C(u)(h2 +k),2 if <0<1,


lI Uu
- Imh +k
< ).C(u)(h2 ), if 0=.

For the case of discontinuous coefficients it was shown by the energy method in
SAMARSKI and FRYAZINOV [1961] that the following weaker convergence estimates
hold, namely

( C(u) (h + k), if 0= 1,
2
Un -u"n loh C(u)(h"l +k), if 2<0<1,
C(u)(h/2 + k2 ), if 0= .

It was also shown by Samarskii and Fryazinov that it is possible to modify the
difference scheme so that a higher rate of convergence may be attained in the case of
discontinuities. The modification uses the harmonic mean over intervals of length h,
1/2

ah(x ) = ( a(x + sh)- 1 ds)

and replaces a by a, in the finite difference equation in (13.10) so that it thus reads
a, un= ax(a,_1/2 U + ),
j=1,...,M~~ -l, "n,(13.1 (13.11))
j=l .... M-, n>O,
SECTION 13 Mixed initial boundary value problem 173

with the same initial and boundary conditions as before. (This is a homogeneous
difference scheme in the sense of TIKHONOV and SAMARSKII [1961].) For this modified
method it was proved that

C(u)(h2 + k), if 0 = 1,
I-U"ll®,h < C(u)(h3/2+k), if <0<1,
(C(u)(h312 + k2) if 0 =

Note that in the particular case that a discontinuity falls at a mesh point xj but is
constant on both sides with values a+ and a_ then ah,j,+/2 =a+ so that the new
difference equation (13.11) reduces to the old one in (13.10).
For the backward Euler method the investigation by Samarskii and Fryazinov
was pursued to the case, with a in (13.9) depending on both x and t, that the
discontinuities are oblique, that is, the discontinuities of a appear along curves
which are not necessarily of the form x = constant. For the standard method (13.10)
with 0 = 1 the result is then

Un -
11 U" II®h< C(U)(h(l12)-Pl(h)+ k l -P2 (k)),

where pj(s)-*O as s-*o, forj= 1, 2. The corresponding result for the modified scheme
using (13.11) with 0= is
1IU"n-
un h
h C(u)( -P (h)+ k -P2(k))
,<
again an improvement over the standard method.

We shall briefly describe some work by CARASSO and PARTER [1970], based on an
idea and numerical experiments by GREENSPAN [1968], concerning the use of
"boundary value techniques" for a parabolic problem over a long time interval. We
consider the nonhomogeneous equation

at=ax( a(x)), t
u(0, t)=u(1, t)=0, t>0, (13.12)
u(x, O)=v(x), 0 x 1,
where a is a positive smooth function. The solution then tends to a steady state
solution w which solves the two-point boundary value problem

da aax)-=f in 0x 1,
w(0) = w(l) = 0.
Normally we have applied above a finite difference method to (13.12) in which the
approximation is successively determined at times t, + = (n + 1)k when it is known at
t = nk (or possibly at one or more additional earlier time levels). The purpose is now
174 V. Thome CHAPrER III

to describe a method which uses the assumed knowledge of the stationary solution
w by using this function as an approximation to u(, T) for a sufficiently large T, and
then interpreting the problem as a boundary value problem, solving in the whole
domain at once.
We consider thus, with Mh = 1, Nk = T the finite difference equations

tU = u) + ,,f
(aj- 1/2Xj,

j=l,...,M-l, n= ... N-1,


U =UA =O, n=1, ... ,N-1, (13.13)
U = Vj = v(jh), U= Wj=w(jh), j=O,. ., M.

Here
+
,U"=(, + ,)u =(U - - ')/k
is a symmetric difference quotient with respect to t, and we recall from Section 3 that
the three-level scheme thus defined is unconditionally unstable when used in
a normal marching procedure.
Introducing the vectors U"=(UO,..., U)T and similarly for V and W, the
system (13.13) may be written in the form
U"'+l-Un-I +2AU"=2kF" , n= 1,..., N-1,
U° = V, UM = W,
where = k/h 2 and A is a tridiagonal matrix.
It may be shown by expanding the U" in eigenvectors of A that this system has
a unique solution for V, W, and F given. Further, assuming that T is selected large
enough so that the inequality
dCk3
l u(, T)- w 2h <
holds, Carasso and Parter show the error estimate
11Un - u(t,)II ,h< C(u)(h2'+ k2), for n = 1,..., M- .
The key step is to show that the (N- 1) x (N- 1) Toeplitz matrix

1 a O ... 0

T(a)= 0 . ..
0 1 i
0 -a 1

has an inverse which is uniformly bounded in the maximum norm, independently of


a and N when T=Nk is large and = a/k is bounded below.
The result generalizes to nonselfadjoint operators and to f depending also on t,
SECTION 13 Mixed initial boundary value problem 175

provided the solution tends to a stationary solution and sufficient regularity is


present.
The method requires a system of (M- 1)(N- 1) equations to be solved at once
rather than, as is normally the case, a sequence of systems of order (M - 1). Parter
and Carasso also discuss the iterative solution of this large linear system.
One advantageous feature of this method is that the matrix of the algebraic system
has an inverse which is bounded independently of h and k so that, in particular, the
roundoff errors do not grow in time, in contrast to the situation with some marching
procedures.

We shall now turn to a discussion of some interior estimates for parabolic


difference equations. Consider a parabolic differential equation of order M,
au
Lu = - E a(x,t)Du =f for (x, t)2Q, (13.14)
At i i<M
where Q is some bounded domain in Rd+'. It is well known that iff is smooth in
•2 then so is each solution, and this fact may be expressed in the form of an a priori
inequality such as

Y IIDt DxullL ,(pd


Mao+lal|<M+s

0
<C{ E D
DlD IILPWT) +CIIuIL()}, (13.15)
Moo+ ela< sI DT°D"Lu + 1[

where Q2 c and 2jT= 2jr{It <T}. Here 2jQ. 02 signifies that the closure
C12C

of the open domain 22 is contained in Q22 so that 12 has a positive distance to a2 2.


Note that this inequality is independent of possible boundary conditions imposed
on u.
Our purpose is now to describe some work by BONDESSON [1971, 1973] which
shows a similar result for a parabolic difference operator, and uses this to discuss the
convergence of solutions of parabolic difference equations in the interior of the
domain.
Let thus h and k be mesh sizes in x and t and assume that k/hM = A= constant. On
the corresponding mesh we consider a parabolic finite difference operator
(LhU)n = k- (BhUn + 1- AUe),
where A, and B are operators of the form considered in Section 8, say. If, in
particular,
Bv(x) = Bh(nk)v(x) = Z b(x, nk, h)v(x - fh),

we require that the symbol of Bh(t),


B(x, t, ) = bf(x, t, O)e-ia',

is nonvanishing and that the difference operator is parabolic in the sense of John, or,
176 V. Thornme CHAPTER IIl

with c >O,
IE(x,t, )l = IB(x,t, )- A(x,t, ) <l-clI M
I
for |jlI<tc, j=1,...,d. (13.16)
The discussion below remains valid in the case of a system which is parabolic in
Petrovskii's sense provided we demand that B(x, t, 3), which is then a matrix, is
invertible and that in (13.16) the modulus of E(x,t, )=B(x,t,)-'A(x,t, ) is
replaced by the spectral radius of this matrix.
For mesh functions we then define the norms

i1U1Ih,,= khd a, I Ujl p for 1< p < oo,


(jh,nk)e.Q

and

IIU 11
h,ps,n = Z Il 0a UII1h,P,J, sO,
Mao +al s

where ax=aSx is defined by means of forward difference quotients as in


aid

Section 4.
We are now ready to state the a priori estimate analogous to (13.15).

THEOREM 13.1. Let Lh be a parabolic difference operator (in the sense of John) in
Q2. Then for any p,r with 1 < r p < o, any Q2I Cc02 Cc0, and any nonnegative integer
s there exists a constant C such thatfor any positive T, sufficiently small h, and any
mesh function U,

1IU 1h,pM +sT < { I LhU 1h,p,s, + 1IU Ilt,r, 2} .


PROOF. The proof proceeds in several steps. In the first one shows in the case of an
operator L with constant coefficients and no lower-order terms that for mesh
functions vanishing outside a compact set one has, for 1<p < o,

E ax Ih,,2lT $ C|LhU hpOT.


Ijl=M
This is done by Fourier methods, in the case p = 2 by Parseval's relation and in the
case p 2 using a variant of Michlin's multiplier theorem. Setting j= h - j t for
1 <j< M one concludes for such L h by a discrete Friedrichs type inequality that

E
al +j<M
11Zaj eIlpl C IIC
LhU IIp

and then, using a partition of unity with functions of small support, for a general L,
with variable coefficients and lower order terms, that
E || ~,OxS i h,p, T
< C( || LhU IIbp,1 T + IlU Ih,p,M- 1,2T -k).
al +j~<M

A discrete interpolation type argument employing weighted norms then makes it


possible to reduce the order (M - 1) of the last term to zero, thus showing the desired
SECTION 13 Mixed initial boundary value problem 177

result for s = 0, r = p. The proof in the case r = p is then completed by induction over
s by applying the inequality obtained to difference quotients, and the result for r <p
then follows using a discrete Sobolev type inequality. [1

We shall now turn to the application of this result to the convergence of solutions
of finite difference operators. We consider thus solutions of equations of the form

(Lh U) = (Mf)j, (13.17)


where M, is some operator of the form described in Section 3. We shall assume that
this equation approximates the differential equation (13.14) with order of accuracy
p so that for smooth solutions of (13.14),
Lhu-Mhf=O(h') as h-O. (13.18)
We shall only consider the difference equation (13.17) in some interior subdomain of
Q. Our discussion therefore applies to solutions of any discrete boundary value
problem which uses equation (13.17) in the interior.
We shall also consider a differential operator
u
Q= E QaD'

and a finite difference operator


Qh U(x) = qp(h)aU(x- lh),
ljl-<q,p
which is consistent with Q and accurate of order y so that for any smooth function u,
Qhu-Qu=O(h) as h-*0.
The following is now the main result of this part of the section.

THEOREM 13.2. Let Lh be a parabolicdifference operator in Q which is consistent with


the parabolic partial differential operator L there. Assume that equation (13.17)
approximates(13.14) with order ofaccuracyp and that Qh is a finite difference operator
which approximates the qth-order differential operatorQ, also with order of accuracy
p. Then, if 1 < r < oo, Q, C Q2 Cc Q, and if U and u are solutions of(13.17) and (13.14) we
have, for small h and real T,

r
I1QhU - QU Ilh,., < C(u)h" + C i U - Ih,r,2.

PROOF. We sketch the steps of the proof. Since

II(Qh - Q)u Ih,o,Qr < C(u)hu,


it suffices to estimate Qh(U - u). By a discrete Sobolev type inequality one may show
that if p is large enough and Q2 C Q3 C Q4 C f2, then

IIQh(U-U)llh,.,t < CII U -Ullh,,.D < CII U - ullh,p,q+ I,nT-


178 V. Thomee CHAPTER III

Now by Theorem 13.1, if s0 is chosen so that M+s>q+ 1, then

IIU - ulh,,q+,n < C{ ILh(U -U) Ihp,s2 + 1lU - ulIh,,r}


Here, since (13.17) approximates (13.14) with order of accuracy p we have, by (13.18),
Lh(U-u)=Mhf-Lhu=O(h) as hO,
and this holds true, in particular, in 11fi1h,p,ps
p Together these estimates show the
result. []

The theorem thus asserts that if u is a solution in Q2 of any boundary value problem
for (13.14) and U is a solution of a corresponding discrete problem which uses (13.17)
in the interior, and if it is known that for some r, and some Q2 C ,
[IU-ull^,r, =O(h) as h-O,
then with Qh and Q as stated we may conclude that QhU tends uniformly to Qu with
rate O(h) as h-+O. In a typical case we might have obtained an interior L2 type
convergence result, for example by the energy method, and may then conclude
maximum-norm convergence of QhU to Qu. The particular case when Qh and Q are
both the identity operator shows uniform convergence in the interior.

We cannot leave our survey of the finite difference method in the context of
parabolic problems without mentioning its use in existence theory. We shall
therefore close this section by sketching an example of this given in PETROWSKI
[1955], and consider an initial boundary value problem in one space dimension with
nonvertical lateral boundaries.
Let thus pjeC[O,T], j=l1, 2, be two given functions with pl(t)< q2 (t) for
t [0, T], and consider the initial boundary value problem
au a2u
~a
at ax 22U for p(t)<x<P2(t), O<t<T,

u(pot),t)=gj(t)
1 for O<t T, j=1, 2, (13.19)
u(x,0)=v(x) for (p(O)<x< 2 (0),
where gj, j= 1, 2, and v are the given data of the problem.
We denote by Q the domain under consideration, i.e. Q = {(x, t); p1 (t) < < (t),
0 < t < T}, and by F its parabolic boundary, F = aQ\{(x, T); p ,(T) < x < ¢2 (T)}. We
now impose a mesh (jh, nk) where h and k are the mesh widths in space and time, with
k/h 2 = 2 kept constant. Let Qh denote the union of the closed mesh squares which
belong to Q, Fh those mesh squares which have at least one point on aQh, excluding,
however, squares with one side on t= T and with vertical sides not on aQh, and
finally Qh=Qh\Fh (see Fig. 13.1).
For each mesh point P of Fh we choose a point P on F of minimal distance from P.
We may then pose the discrete problem
B,Ui-a, U for (jh,nk)eQh, (13.20)
(13.20)
U! = u(P) for (jh, nk) E ,F h
SECTION 13 Mixed initial boundary value problem 179

rh = shaded squares
t
Qh = white squares

F-

0/ X=p t)

f Ir
x
= U 7//_HH-
Pl(t) l ./ /
r I
7-7S
//7
//

I / H7
t X 7/i"/// _
FIG. 13.1.

where the value of u(P) equals the appropriate value of g1, g2 or v, depending on the
location of P.
One notices immediately, as in the proof of Theorem 11.1, that the maximum and
minimum of a solution of (13.20) are attained on Fh. Hence, in particular, this
problem has at most one solution since the difference between two solutions
vanishes on rh and hence in Qh- Because (13.20) has as many equations as unknowns,
the uniqueness implies the existence of a solution for any choice of the u(P). It follows
also that I U I is bounded in Qh, independently of h, by the maximum of u on F.
We shall now extend U to a function uh defined on all of Q. For this purpose we
divide each of the mesh squares into two triangles by means of the straight lines
x/h + t/k =j, and define u, in each of the triangles by linear interpolation from the
values at the mesh points. The definition is completed by extending U to Q\Qh as
a continuous function without increasing the maximum of Iuhl.
Let QOQ, i.e. let QO be such that Q ° cQ. Since U is uniformly bounded in Qh
we find by Theorem 13.1, together with a discrete Sobolev inequality, that all
difference quotients of U are bounded in QO for small h. Using this fact for the first
order difference quotients one finds easily that there is a constant C independent of
h such that

Ih(xl,tl)-Uh(X2,t2)I 1<C(Ix -x21 + Itl-t 2 1)


for (xj,tj)eQ0, j=1,2.
In particular, the family {Uh} is equicontinuous on QO. The Arzeli-Ascoli theorem
therefore permits us to extract a subsequence which converges uniformly in Q0.
Applying a similar reasoning to {,U} and {xU} we find that these may also be
extended to uniformly bounded equicontinuous families. Hence it is possible to find
a subsequence of {uh} such that the corresponding three families converge to , and
180 V. Thomee CHAPTER III

w and one may easily show that

v=- w=-
at' ax
Let now {Qji} be a sequence of domains with QJ+' Qj and Uj Qj= Q. The above
argument may be applied to each of the QJ and by a diagonal procedure it is then
possible to show that there exists a sequence {hm} =, with h-+0 as m-+ oo, and
a function i e C(Q) such that the corresponding extensions of {U}, {,U} and {&XU}
converge to ai,ai/atand Oa/?x, uniformly on each compact subset of Q. Using the
finite difference equation in (13.20) one also shows that these functions satisfy
X2

a-(X 2, t)--(X, t)= j (x, t) dx,


ax ax Jat
which implies that a2ii/ax2 exists in Q and that uiis a solution of the heat equation in
(13.19).
It remains to consider the behavior of ii near F. Consider first an inner point (xo, 0)
on F 0 = F n {t =0} . We want to show

lim li(x, t) = v(x,). (13.21)


(xt)(xoO)

For this purpose, set


2
c(x, t) = (x - X) + 3t.
This function is positive in Q except at (xo, 0) and

LhO = to - xxow = 1 >0.


Let > 0 be arbitrarily small and let V. be a neighborhood of (x o, 0) such that for
h small enough

IUJ-V(Xo) l < e, (jh,nk) e VnF . h

(Recall that, for these points, UJ= v(P) for a suitable P e Fo.) Let further K be
a constant such that
supIvi < 2Kco(x,t) for (x,t)eQ\ V,,
F0

which is possible since co is positive on the closed set Q\V. We shall show that
v(xo) - - Kco(x, t) < uh(x, t) < (Xo) + Ko(x, t)
+ + (13.22)
for (x, t) = (jh, nk) e Qh.
Assuming this for a moment, we have, since the mesh points are dense in Q as h and
k tend to zero, that the same inequality holds with uh(x, t) replaced by iu(x, t), and
SECTION 13 Mixed initial boundary value problem 181

hence
v(x)- < lim inf i(x, t) < lim sup ii(x, t) <v(x 0 ) + e,
(x,t)-xoO) (xt)(xo,O )

whence (13.21) follows.


To show the left-hand side inequality in (13.22) we set 'p= u - v(xo)+ e + Kw and
note that Lhp = KLco > 0 in Qh and hence that the minimum of p is attained at the
boundary Fh. Now at the points of both V, and rF\ V, we see by our definitions that
>p_0 and hence we find that (q>0 in Qh, which is the desired conclusion. The
right-hand side inequality is shown analogously.
In order to demonstrate the corresponding result for the left- and right-hand
boundaries of Q one shows that if, for P = (x, f) a point on this part of the boundary,
there exists a function cop(x, t) defined in a neighborhood W of P, which vanishes at
P but is otherwise positive, and such that LhoP 0 at the points of Qh n W for small
h, then, with j = 1 or 2 depending on the location of P,
lim (x, t) = g,(P).
(x,t)-P

The proof of this fact uses the same ideas as above, with cwp playing the role of w.
The result thus depends on the existence of a function ap with the above
properties, a so-called barrier. This in turn will depend on the regularity of the
functions pl(t) and p2(t) defining the boundary. One may show that Lipschitz
continuity is a sufficient condition for this, and that, for instance, for the left-hand
boundary,
Cp(X,t)= I(X-)-(t-f)l, O<f<1,

is a barrier, if K is larger than the Lipschitz constant of q1 (t).


Under the assumptions made we have thus shown that the family of solutions {uh}
contains a subsequence {uh}, where hO as m-coo, which converges to
a solution fi of (13.19) in the sense described. Since the solution of (13.19) is unique, it
may be seen that, in fact, any such sequence {Uhm} converges to without the
extraction of a subsequence.
References

AKOPYAN, Yu.R. and L.A. OGANESYAN (1977), A variational-difference method for solving two-
dimensional linear parabolic equations, Zh. Vychisl. Mat. i Mat. Fiz. 17,109-118 (in Russian). (U.S.S.R.
Comput. Math. and Math. Phys. 17, 101-111.)
ALBRECHT, J. (1957), Zum Differenzenverfahren bei parabolischen Differentialgleichungen, Z. Angew.
Math. Mech. 37, 202-212.
ANDERSSEN, R.S. (1968), On the reliability of finite difference representations, Austral. Comput. J. 1 (3).
ANSORGE, R. and R. HASS (1970), Konvergenz von Differenzenverfahren fur lineare und nichtlineare
Anfangswertaufgaben, Lecture Notes in Mathematics 159 (Springer, Berlin).
ARONSON, D.G. (1963a), The stability of finite difference approximations to second order linear parabolic
differential equations, Duke Math. J. 30, 117-128.
ARONSON, D.G. (1963b), On the stability of certain finite difference approximations to parabolic systems
of differential equations, Numer. Math. 5, 118-137.
ASCHER, M. (1960), Explicit solutions of the one-dimensional heat equation for a composite wall, Math.
Comp. 14, 346-353.
ASTRAKHANTSEV, G.P. (1971), A finite difference method for solving the third boundary value problem for
elliptic and parabolic equations in an arbitrary domain. Iterative solution of difference equations, 2,
Zh. Vychisl. Mat. i Mat. Fiz. 11, 677-687 (in Russian). (U.S.S.R. Comput. Math. and Math. Phys. 11,
168-182.)
BABUSKA, I., M. PRAGER and E. VITASEK (1966), Numerical Processes in Differential Equations (Wiley,
London).
BATTEN, G.W. (1963), Second order correct boundary conditions for the numerical solution of the mixed
boundary problem for parabolic equations, Math. Comp. 17, 405-413.
BIRTWISTLE, G.M. (1968), The explicit solution of the equation of heat conduction, Comput. J. 11,
317-323.
BONDESSON, M. (1971), An interior a priori estimate for parabolic difference operators and an
application, Math. Comp. 25, 43-58.
BONDESSON, M. (1973), Interior a priori estimates in discrete LP norms for solutions of parabolic and
elliptic difference equations, Ann. Mat. Pura Appl. 95, 1-43.
BRENNER, P., V. THOMEE and L.B. WAHLBIN (1975), Besov Spacesand Applications to Difference Methods
for Initial Value Problems, Lecture Notes in Mathematics 434 (Springer, Berlin).
CAMPBELL, C.M. and P. KEAST (1968), The stability of difference approximations to a self-adjoint
parabolic equation under derivative boundary conditions, Math. Comp. 22, 336-346.
CARASSO, A. (1969), Finite difference methods and the eigenvalue problem for nonselfadjoint Sturm-
Liouville operators, Math. Comp. 23, 717-729.
CARASSO, A. (1970), A posteriori bounds in the numerical solution of mildly nonlinear parabolic
equations, Math. Comp. 24, 785-792.
CARASSO, A. (1971), Long-range numerical solution of mildly non-linear parabolic equations, Numer.
Math. 16, 304-321.
CARAsso, A. and S.V. PARTER (1970), An analysis of"boundary-value techniques" for parabolic problems,
Math. Comp. 24, 315-340.
CIMENT, M., S.H. LEVENTHAL and B.C. WEINBERG (1978), The operator compact implicit method for
parabolic equations, J. Comput. Phys. 28, 138-166.
COLLATZ, L. (1955), Numerische Behandlung von Differentialgleichungen (Springer, Berlin, 2nd ed.).

183
184 :V.
Thom&e

COURANT, R., K.O. FRIEDRICHS and H. LEWY (1928), Ober die partiellen Differenzengleichungen der
mathematischen Physik, Math. Ann. 100, 32-74.
CRANDALL, S.H. (1955), An optimum implicit recurrence formula for the heat conduction equation,
Quart. Appl. Math. 13, 318-320.
CRANK, J. and P. NICOLSON (1947), A practical method for numerical integration of solution of partial
differential equations of heat-conduction type, Proc. Cambridge Philos. Soc. 43, 50-67.
2
DOUGLAS Jr, J. (1955), On the numerical integration of O u/lx 2 + 02u/6y 2 = u/lat by implicit methods, J.
SIAM 3, 42-65.
DOUGLAS Jr, J. (1956a), On the errors in analogue solutions of heat conduction problems, Quart. Appl.
Math. 14, 333-335.
DOUGLAS Jr, J. (1956b), On the numerical integration of quasi-linear parabolic differential equations,
Pacific J. Math. 6, 35-42.
DOUGLAS Jr, J. (1956c), On the relation between stability and convergence in the numerical solution of
linear parabolic and hyperbolic equations, J. SIAM 4, 20-37.
DOUGLAS Jr, J. (1956d), The solution of the diffusion equation by a high order correct difference equation,
J. Math. Phys.35, 145-151.
DOUGLAS Jr, J. (1957), A note on the alternating direction implicit method for the numerical solution of
heat flow problems, Proc. Amer. Math. Soc. 8, 409-412.
DOUGLAS Jr, J. (1958), The application of stability analysis in the numerical solution of quasi-linear
parabolic differential equations, Trans. Amer. Math. Soc. 89, 484-518.
DOUGLAS Jr, J. (1959), The effect of round-off error in the numerical solution of the heat equation, J.
Math. Phys. 31, 35-41.
DOUGLAS Jr, J. (1960), A numerical method for a parabolic system, Numer. Math. 2, 91-98.
DOUGLAS Jr, J. (1961a), A survey of numerical methods for parabolic differential equations, in: F.C. ALT,
ed., Advances in Computers 2 (Academic Press, New York) 1-54.
DOUGLAS Jr, J. (1961b), On incomplete iteration for implicit parabolic difference equations, J. SIAM 9,
433-439.
DOUGLAS Jr, J. and T.M. GALLIE Jr(1955), Variable time steps in the solution of the heat flow equation by
a difference equation, Proc. Amer. Math. Soc. 6, 787-793.
DOUGLAS Jr, J, and J.E. GUNN (1962), Alternating direction methods for parabolic systems in m space
variables, J. Assoc. Comput. Mach. 9, 450456.
DOUGLAS Jr, J. and J.E. GUNN (1963), Two high-order correct difference analogues for the equation of
multidimensional heat flow, Math. Comp. 17, 71-80.
DOUGLAS Jr, J. and J.E. GUNN (1964), A general formulation of alternating direction methods, Part I.
Parabolic and hyperbolic problems, Numer. Math. 6, 428-453.
DOUGLAS Jr, J. and B.F. JONES (1963), On predictor-corrector methods for nonlinear parabolic
differential equations, J. SIAM 11, 195-204.
DOUGLAS Jr, J. and C.M. PEARCY (1963), On convergence of alternating direction procedures in the
presence of singular operators, Numer. Math. 5, 175-184.
DOUGLAS Jr, J. and H.H. RACHFORD (1956), On the numerical solution of heat conduction problems in
two and three space variables, Trans. Amer. Math. Soc. 82, 421-439.
Du FORT, E.C. and S.P. FRANKEL (1953), Stability conditions in the numerical treatment of parabolic
differential equations, Math. Tables Aids Comput. 7, 135-152.
EISEN, D. (1966), Stability and convergence of finite difference schemes with singular coefficients, SIAM J.
Numer. Anal. 3, 545-552.
EISEN, D. (1967a), The equivalence of stability and convergence for finite difference schemes with singular
coefficients, Numer. Math. 10, 20-29.
EISEN, D. (1967b), On the numerical solution of u, =u,r+(2/r)u,.Numer. Math. 10, 397-409.
EVANS, D.J. (1965), A stable explicit method for the finite-difference solution of a fourth-order parabolic
partial differential equation, Computer J. 8, 280-287.
FAIRWEATHER, G., A.R. GOURLAY and A.R. MITCHELL (1967), Some high accuracy difference schemes
with a splitting operator for equations of parabolic and elliptic type, Numer. Math. 10, 56-66.
FORSYTHE, G.E. and W.R. WAsow (1960), Finite Difference Methods for PartialDifferential Equations
(Wiley, New York).
References 185

Fox, L. (1962), Numerical Solution of Ordinaryand PartialDifferential Equations(Pergamon, Oxford).


FRANKLIN, J.N. (1958), Numerical stability in digital and analog computation for diffusion problems, J.
Math. Phys. 37, 305-315.
FRIEDMAN, A. (1964), PartialDifferential Equationsof ParabolicType (Prentice-Hall, Englewood Cliffs,
NJ).
FRYAZINOV, I.V. and M.I. BAKIROVA (1972), Economical difference schemes for solving the heat
conduction equation in polar, cylindrical and spherical coordinates, Zh. Vychisl. Mat. i Mat. Fiz. 12,
352-363 (in Russian). (U.S.S.R. Comput. Math. and Math. Phys. 12, 87-100.)
GEKELER, E. (1975a), A-convergence of finite difference approximations of parabolic initial-boundary
value problems, SIAM J. Numer. Anal. 12, 1-12.
GEKELER, E. (1975b), Entwicklung nach Eigenvectoren beim Verfahren von Du Fort und Frankel, Z.
Angew. Math. Mech. 55, T238-T240.
GELFAND, I.M. and G.E. SCHILOW (1964), Verallgemeinerte Funktionen III (Deutscher Verlag der
Wissenschaften, Berlin).
GODEV, K.N. and R.D. LAZAROV (1984), Error estimates of finite difference schemes in L,-metrics for
parabolic boundary value problems, C.R. Acad. Bulgare Sci. 37, 565-568.
GODUNOV, S.K. and V.S. RYABENKII (1963), Special criteria of stability of boundary value problems for
non-selfadjoint difference equations, Uspekhi Mat. Nauk. 18, 3-14 (in Russian). (Russian Math. Surveys
18 (3), 1-12.)
GODUNOV, S.K. and V.S. RYABENKII (1964), Introduction to the Theory of Difference Schemes
(Interscience, New York).
GORENFLO, R. (1971a), Differenzenschemata monotoner Art fur schwach gekoppelte Systeme paraboli-
scher Differentialgleichungen mit gemischten Randbedingungen, Computing 8, 343-362.
GORENFLO, R. (1971b), Differenzenschemata monotoner Art fiir lineare parabolische Randwertaufgaben,
Z. Angew. Math. Mech. 51, 595-610.
GORENFLO, R. (1971c), On difference schemes for parabolic differential equations with derivative
boundary conditions, in: J. LI. Morris, ed., Conference on Applications of Numerical Analysis, Lecture
Notes in Mathematics 228, (Springer, Berlin) 57-69.
GREENSPAN, D. (1968), Lectures on the Numerical Solution of Linear,Singular, and Nonlinear Differential
Equations(Prentice-Hall, Englewood Cliffs, NJ).
GUSTAFSSON, B., H.O. KREISS and A. SUNDSTROM (1972), Stability theory for difference approximations
of mixed initial boundary value problems, II, Math Comp. 26, 649-686.
HAKBERG, B. (1970), Uniformly maximumnorm stable difference schemes, BIT 10, 266-276.
HANSSON, P.M. and J.E. WALSH (1984), Asymptotic theory of the global error and some technique of
error estimation, Numer. Math. 45, 51-74.
HEDSTROM, G.W. (1968), The rate of convergence of some difference schemes, SIAM J. Numer. Anal. 5,
363-406.
HENRICI, P. (1962), Discrete Variable Methods in Ordinary Differential Equations(Wiley, New York).
HILDEBRAND, F.B. (1952), On the convergence of numerical solutions of the heat-flow equation, J. Math.
Phys. 31, 35-41.
HUBBARD, B. (1966), Some locally one-dimensional difference schemes for parabolic equations in an
arbitrary region, Math. Comp. 20, 53-59.
IONKIN, N.I. and Yu.I. MOKIN (1974), The parabolicity of difference schemes, Zh. Vychisl. Mat. i Mat. Fiz.
14 (2), 402-417 (in Russian). (U.S.S.R. Comput. Math. and Math. Phys. 14 (2), 125-140.)
ISAACSON, E. (1961), Error estimates for parabolic equations, Comm. Pure Appl. Math. 14, 381-389.
JANENKO, N.N. (1969), Die Zwischenschrittmethode zur Lsung mehr-dimensionaler Probleme der
mathematischen Physik (Springer, Berlin).
JOHN, F. (1952), On the integration of parabolic equations by difference methods, I: Linear and quasi-
linear equations for the infinite interval, Comm. Pure Appl. Math. 5, 155-211.
JOUBERT, G.R. (1971), Explicit difference approximations of Du Fort-Frankel type of the one-
dimensional diffusion equation, Numer. Math. 18, 18-25.
JUNCOSA, M.L. and D.M. YOUNG (1953), On the order of convergence of a difference equation to a
solution of the diffusion equation, J. SIAM 1, 111-135.
JUNCOSA, M.L. and D.M. YOUNG (1954), On the convergence of a solution of a difference equation to a
186 V Thomde

solution of the equation of diffusion, Proc. Amer. Math. Soc. 5,168-174.


JUNCOSA, M.L. and D.M. YOUNG (1957), On the Crank-Nicolson procedure for solving parabolic partial
differential equations, Proc. CambridgePhilos. Soc. 53, 448-461.
KEAST, P. and A.R. MITCHELL (1966), On the instability of the Crank-Nicholson formula under
derivative boundary conditions, Computer J. 9, 110-114.
KEAST, P. and A.R. MITCHELL (1967), Finite difference solution of the third boundary problem in elliptic
and parabolic equations, Numer. Math. 10, 67-75.
KRAWCZYK, R. (1963), ber Differenzenverfahren bei parabolischen Differentialgleichungen. Arch.
Rational Mech. Anal. 13, 81-121.
KREISS, H.O. (1959a), Ober die Differenzapproximation hoher Genauigkeit bei Anfangwertproblemen
fiur particle Differenzialgleichungen, Numer. Math. 1, 186-202.
KREISS, H.O. (1959b), Ober Matrizen die beschriinkte Halbgruppen erzeugen, Math. Scand. 7, 71-80.
KREISS, H.O. (1959c), Ober die Lbsung des Cauchyproblems fiir lineare partielle Differenzialgleichungen
mit Hilfe von Differenzengleichungen, Acta Math. 101, 179-199.
KREISS, H.O, (1960), Ober die Lbsung von Anfangsrandwertaufgaben ffir partielle Differentialgleichun-
gen mit Hilfe von Differenzengleichungen, Trans. Roy. Inst. Technol. Stockholm 166.
KREISS, H.O. (1962), fber die Stabilitatsdefinition fiir Differenzengleichungen die partielle Differential-
gleichungen approximieren, Nordisk Tidskr. Informationsbehandling2, 153-181.
KREISS, H.O. (1963), Cber implicite Differenzmethoden fuir partielle Differentialgleichungen, Numer.
Math. 5,24-47.
KREISS, H.O. (1968), Stability theory of difference approximations for mixed initial boundary value
problems, I, Math Comp. 22, 703-714.
KREISS, H.O., V. THIOMEE and O.B. WIDLUND (1970), Smoothing of initial data and rates of convergence
for parabolic difference equations, Comm. PureAppl. Math. 23, 241-259.
LAASONEN, P. (1949), Ofber eine Methode zur Losung der Wirmeleitungsgleichung, Acta Math. 81, 309-
317.
LADYZENSKAJA, O.A., V.A. SOLONNIKOV and N.N. URAL'CEVA (1968), Linear and QuasilinearEquationsof
Parabolic Type (American Mathematical Society, Providence, RI; Nauka, Moscow, 1967).
LAX, P.D. and R.D. RICHTMYER (1956), Survey of the stability of linear finite difference equations, Comm.
Pure Appl. Math. 9, 267-293.
LAZAROv, R.D. (1982), Convergence estimates for difference schemes for parabolic equations to
generalized solutions, C.R. Acad. Bulgare Sci. 35(1) 7-10 (in Russian).
LEES, M. (1959), Approximate solution of parabolic equations, J. SIAM 7, 167-183.
LEES, M. (1960a), A priori estimates for the solution of difference approximations to parabolic differential
equations, Duke Math. J. 27, 297-311.
LEES, M. (1960b), Energy inequalities for the solution of differential equations, Trans Amer. Math.Soc. 94,
58-73.
LEES, M. (1961), Alternating direction and semi-explicit difference methods for parabolic partial
differential equations, Numer. Math. 3, 398-412.
LEUTERT, W. (1951), On the convergence of approximate solutions of the heat equation to the exact
solution, Proc. Amer. Math. Soc. 2, 433-439.
LEUTERT, W. (1952), On the convergence of unstable approximate solutions of the heat equation to the
exact solution, J. Math. Phys. 30, 245-251.
LOFSTROM, J. (1970), Besov spaces in theory of approximation, Ann. Mat. Pura Appl. 85, 93-184.
LOTKIN, M. (1958), The numerical integration of heat conduction equations, J. Math. Phys. 37, 178-187.
MIGNOT, N. (1953), Sur les solutions numeriques du probleme de la chaleur, C.R. Acad. Sci. ParisSer. I.
Math. 236 (25) 2375-2377.
MILLER, J. (1969), The construction of unbalanced difference operators for parabolic initial boundary
value problems, SIAM J. Numer. Anal. 6, 476-479.
MITCHELL, A.R. (1969), Computational Methods in PartialDifferential Equations (Wiley, London).
MITCHELL, A.R. and D.F. GRIFFITHS (1980), The Finite Difference Method in Partial Differential
Equations (Wiley, London).
MOKIN, Yu.I. (1975), Two-layer parabolic and weakly parabolic difference schemes, Zh. Vychisl. Mat.
i Mat. Fiz. 15(3), 661-671 (in Russian). (U.S.S.R Comput. Math. and Math. Pys. 15(3), 111-121.)
References 187

MONIEN, B. (1970), Ober die Konvergenzordnung von Differenzenverfahren, die parabolische Anfangs-
wertaufgaben approximieren, Computing 5, 221-245.
NORDMARK, S. (1974), Uniform stability of a class of parabolic difference operators, BIT14, 314-325.
O'BRIEN, G.G., M.A. HYMAN and S. KAPLAN (1951), A study of the numerical solution of partial
differential equations, J. Math. Phys. 29, 223-251.
OSBORNE, M.R. (1969), The numerical solution of the heat conduction equation subject to separated
boundary conditions, Comput. J. 12, 82-87.
OSHER, S. (1969), Maximum norm stability for parabolic difference schemes in half-space, in: Hyperbolic
Equationsand Waves (Springer, New York) 61-75.
OSHER, S. (1970), Mesh refinements for the heat equation, SIAM J. Numer. Anal. 7, 199-205.
OSHER, S. (1972), Stability ofparabolic difference approximations to certain mixed initial boundary value
problems, Math. Comp. 26, 13-39.
PARKER, I.B. and J. CRANK (1964), Persistent discretization errors in partial differential equations of
parabolic type, Comput. J. 7, 163-167.
PEACEMAN, D.W. and H.H. RACHFORD Jr (1955), The numerical solution of parabolic and elliptic
differential equations, J. SIAM 3, 28-41.
PEETRE, J. and V. THOMEE (1967), On the rate of convergence for discrete initial value problems, Math.
Scand. 21, 159-176.
PETROVSKII, I.G. (1937), Ober das Cauchysche Problem fur Systeme von partiellen Differentialgleichun-
gen, Rec. Math. (Math. Sb.) 2, 814-868.
PETROWSKI, I.G. (1953), Vorlesungen iiber partielle Differentialgleichungen(Teubner, Leipzig).
POLITCHKA, A.E. and P.E. SOBOLEVSKII (1976), New Lp estimates for parabolic difference problems, Zh.
Vychisl. Mat. i Mat. Fiz. 16 (5), 65-74 (in Russian). (U.S.S.R. Comput. Math. and Math. Phys. 16 (5),
1155-1163.)
RAVIART, P.A. (1967), Sur l'approximation de certaines equations d'evolution lineaires et non-lineaires, J.
Math. Pures Appl. 46 (1), 11-107; 46 (2), 109-183.
RICHTMYER, R.D. and K.W. MORTON(1967), Difference Methodsfor Initial-Value Problems(Interscience,
New York).
ROSE, M.E. (1956), On the integration of non-linear parabolic equations by implicit difference methods,
Quart. Appl. Math. 14, 237-248.
ROTHE, E. (1931), Wiirmeleitungsgleichung mit nichtkonstanten Koefficienten, Math.Ann. 104, 340-362.
RYABENKII, V.S. and A.F. FILIPPOV (1960), Uber die Stabilitiit von Differenzengleichungen (Deutscher
Verlag der Wissenschaften, Berlin).
SAMARSKII, A.A. (1961a), A priori estimates for the solution of the difference analogue of a parabolic
differential equation, Zh. Vychisl. Mat. i Mat. Fiz. 1, 441-460 (in Russian). (U.S.S.R. Comput. Math. and
Math. Phys. 1, 487-512.)
SAMARSKII, A.A. (1961b), A priori estimates for difference equations, Zh. Vychisl. Mat. i Mat. Fiz. 1,
972-1000 (in Russian). (U.S.S.R. Comput. Math. and Math. Phys. 1, 1138-1167.)
SAMARSKII, A.A. (1962a), On the convergence and accuracy of homogeneous difference schemes for one-
dimensional and multidimensional parabolic equations, Zh. Vychisl. Mat. i Mat. Fiz. 2, 603-634 (in
Russian). (U.S.S.R. Comput. Math. and Math. Phys. 2, 654-696.)
SAMARSKn, A.A. (1962b), On an economical difference method for the solution of a multidimensional
parabolic equation in an arbitrary region, Zh. Vychisl. Mat. i Mat. Fiz. 2, 787-811 (in Russian).
(U.S.S.R. Comput. Math. and Math. Phys. 2, 894-926.)
SAMARSKII, A.A. (1963), Homogeneous difference schemes on non-uniform meshes for parabolic
equations, Zh. Vychisl. Mat. i Mat. Fiz. 3, 351-393 (in Russian). (U.S.S.R. Comput. Math. and Math.
Phys. 3, 266-298.)
SAMARSKII, A.A. (1964a), An accurate high-order difference system for a heat conductivity equation with
several space variables, Zh. Vychisl. Mat. i Mat. Fiz. 4, 161-165 (in Russian). (U.S.S.R. Comput. Math.
and Math. Phys. 4, 222-228.)
SAMARSKII, A.A. (1964b), Economical difference schemes for parabolic equations with mixed derivatives,
Zh. Vychisl. Mat. i Mat. Fiz. 4, 182-191 (in Russian). (U.S.S.R. Comput. Math. and Math. Phys. 4, 753-
759.)
SAMARSKII, A.A. (1971), Introduction to the Theory of Difference Schemes (Nauka, Moscow) (in Russian).
188 V. Thomde

SAMARSKII, A.A. and I.V. FRYAZINOV (1961), On the convergence of homogeneous difference schemes for
a heat-conduction equation with discontinuous coefficients, Zh. Vychisl. Mat. i Mat. Fiz. 1, 806-824 (in
Russian). (U.S.S.R. Comput. Math. and Math. Phys. 1, 962-982.)
SAMARSKII, A.A. and A.V. GULIN (1973), Stability of Difference Schemes (Nauka, Moscow) (in Russian).
SAUL'EV, V.K. (1964), Integrationof Equations of ParabolicType by the Method of Nets (Pergamon Press,
Oxford).
SERDJUKOVA, S.J. (1964), The uniform stability with respect to the initial data of a sixpoint symmetrical
scheme for the heat conduction equation, in: Numerical Methods for the Solution of Differential
Equations and Quadrature Formulae(Nauka, Moscow) 212-216.
SERDJUKOVA, S.J. (1967), The uniform stability of a sixpoint scheme of increased order of accuracy for the
heat equation, Zh. Vychisl. Mat. i Mat. Fiz. 7, 214-218 (in Russian). (U.S.S.R. Comput. Math. and
Math. Phys. 7, 297-303.)
SHILOV, G.E. (1955), On the correctness of Cauchy problems for systems of partial differential equations
with constant coefficients, Uspekhi Mat. Nauk. 10, 89-100 (in Russian).
SMITH, G.D. (1965), Numerical Solution of Partial Differential Equations (Oxford University Press,
London).
STETTER, H.J. (1959), Anwendung des Aquivalenzsatzes von P. Lax aufinhomogene Probleme. Z. Angew.
Math. Mech. 39, 396-397.
STRANG, W.G. (1959), On the order of convergence of the Crank-Nicolson procedure, J. Math. Phys. 38,
141-144.
STRANG, W.G. (1960), Difference methods for mixed boundary value problems, Duke Math. J. 27, 221
232.
STRANG, G. (1963), Accurate partial difference methods I: Linear Cauchy problems, Arch. RationalMech.
Anal. 13, 392-402.
STRANG, (1964a), Unbalanced polynomials and difference methods for mixed problems. SIAM J. Numer.
Anal. 2, 46-51.
STRANG, G. (1964b), Wiener Hopf difference equations, J. Math. Mech. 13, 85-96.
STRANG, G. (1966), Implicit difference methods for initial boundary value problems, J. Math. Anal. Appl.
16, 188-198.
SUNAUCHI, H. (1968), Perturbation theory of difference schemes, Numer. Math. 12, 454-458.
'TAYLOR, P.J. (1970), The stability of the Du Fort-Frankel method for the diffusion equation with
boundary conditions involving space derivatives, Computer J. 13, 92-97.
THOMIE, V. (1964), Elliptic difference operators and Dirichlet's problem, Contribut. Differential
Equations 3, 301-324.
THOME, V. (1965), Stability of difference schemes in the maximum-norm, J. Differential Equations1, 273-
292.
THOMCE, V. (1966a), On maximum-norm stable difference operators, in: J.H. Bramble, ed., Numerical
Solution of PartialDifferential Equations(Academic Press, New York) 125-151.
THOMEE, V. (1966b), Parabolic difference operators, Math. Scand. 19, 77-107.
THOMPE, V. (1967), Generally unconditionally stable difference operators, SIAM J. Numer. Anal. 4, 55-
69.
THOMEE, V. (1969), Stability theory for partial difference operators, SIAM Rev. 11, 152-195.
THOMtE, V. (1984), Galerkin Finite Element Methods for Parabolic Problems, Lecture Notes in
Mathematics 1054 (Springer, Berlin).
THOMCE, V. and L.B. WAHLBIN (1974), Convergence rates of parabolic difference schemes for non-
smooth data, Math. Comp. 28, 1-13.
THOMPSON, R.J. (1964), Difference approximations for inhomogeneous and quasi-linear equations, J.
SIAM 12, 189-199.
TIKHONOV, A.N. and A.A. SAMARSKII (1961), Homogeneous difference schemes, Zh. Vychisl. Mat. i Mat.
Fiz. 1, 5-63 (in Russian). (U.S.S.R. Comput. Math. and Math. Phys. 1, 5-67.)
VARAH, J.M. (1970), Maximum norm stability of difference approximations to the mixed initial
boundary-value problem for the heat equation, Math. Comp. 24, 31-44.
VARGA, R.S. (1961), On high order stable implicit methods for solving parabolic partial differential
equations, J. Math. and Phys. 40, 220-231.
References 189

VARGA, R.S. (1962), Matrix Iterative Analysis (Prentice-Hall, Englewood Cliffs, NJ).
WASOW, W. (1958), On the accuracy of implicit difference approximations to equation of heat flow, Math.
Tables Aids Comput. 12, 43-55.
WEINELT, W., R.D. LAZAROV and U. STREIT (1984), On the order of convergence of difference schemes for
weak solutions of the heat conduction equation in nonisotropic nonhomogeneous media, Differencial'
nye Uravnenija20, 1144-1151 (in Russian).
WIDLUND, O.B. (1965), On the stability of parabolic difference schemes, Math. Comp. 19, 1-13.
WIDLUND, O.B. (1966), Stability of parabolic difference schemes in the maximum norm, Numer. Math. 8,
186-202.
WIDLUND, O.B. (1970a), On the rate of convergence for parabolic difference schemes, I, in: Numerical
Solution ofField Problems in Continuum Physics,SIAM-AMS Proceedings II (American Mathematical
Society, Providence, RI) 60-73.
WIDLUND, O.B. (1970b), On the rate of convergence for parabolic difference schemes II, Comm. PureAppl.
Math. 23, 79-96.
List of Symbols

1-II =norm in Banach space ,

IVIv11= ( ||Dav 11 )

·
jvjtw = Z lD vllL,
al <m

W = Lp(Rd), < p <o00,

W. = closure of Co(Rd) in L((Rd).

Besov space Ba, 1 p,q< oo, s>0:


WO,pj(t,v)= sup 1(Ty - I)iv IIw, Tyv(x)= v(x + y),
IylSt
s = S+ s, S is a nonnegative integer, O < so 1,

IIvII p = Ilvllwp + [f( p j(t, Dlv))- ]


0

if p <co0, j=1, if 0<So < 1,


i '=
p< 2, if s = 1,

lVIIvllB = IIvllwp + sup(t-s°Opj(t, Dv)),


Ial=S 1>0

= s; -

(V,W)h=hd Z V(x)W(x),

2 )
112,h =(V, V)h = (hd V()2
x =Ph1Q

191
192 V. Thomee

IIVIh,h,= max IV(x)l.


x = fhn

On [0, 1], h= I/M, M positive integer:


=
lh {V=(V,..., VM) T: V0= VM=},

II2,h h E )= IIV ,h= maxi V

Matrix A =(aij), eigenvalues Aj:


p(A)=maxIj l, A(A)= max Re lj,
j j
ReA=2(A+A*),
H Hermitian positive-definite:
IvI, =(Hv,v)1 2 , IAiH =sup{IAvIH: VIH < 1}.

D =(al/ax) ... (a/aXd))d.

h=mesh size in space,


k = time step,
A = k/hM = mesh ratio (M = order of equation),
Un = value at (jh, nk),
Ek V(x) = , ap V(x - ph), E() = ape-iP.
p p

Truncation error:
z"(x)=(u(x,(n+ l)k)-Eku(x,nk))/k, tr = (jh).

ax vj = (Vj+ - Vj)/h,
ax j = (v - Vj 1)/h,
,V = 2½(Vj+ - Vj-l)/h = (ax + x)Vj,
axaxvj=(Vj+ - 2 Vj + Vj_ l)/h2,
O Vn = (Vn +1 - Vn)/k,

a, vn =(vn_ V- -)/k,

Oxj V(x) = (V(x + hei)- V(x))h,


, V(x) = (V(x) - V(x - hej))h,
ej = unit vector in direction of xj,
ax =a-: .. .d
List of Symbols 193

Fourier transform:

( )= (Fv)() = v(x)e-ix dx,


Rd

v(x) = (2n) -v (- x) = (2,n) - |f 6(~)e ~ d ,


Rd

- 2
IIV IL2(Rd) =(2n) d/ IIL(Rd)

(D v) ()=(i) (), ~ =I
.... *,4,

Av=-'(a6), ECo ( ), aEC (R d),


MP(a)= sup{11 Av IE,: c CO'(R d), IVIL, 1},
MP = {a: Mp(a)< oo}.
Subject Index

Adams-Bashforth method, 69 Hakberg's result, 85


Adams-Moulton method, 70 Homogeneous difference scheme, 173
Alternating direction scheme, 62 Hurwitz criterion, 71
Amplification matrix, 65, 67

Implicit scheme, 23, 33, 41


Backward differencing, 70 Interior estimate, 175
Backward Euler scheme, 23, 52, 112, 119, 138 Interior mesh point, 141
Banach space, 38, 99
Besov space, 98, 99
Boundary value technique, 173 L2 -norm, 46
Lax equivalence theorem, 38, 98
Local discrete solution operator, 12
Carlson-Beurling inequality, 89 Local discretization error, 14, 22, 24
Characteristic polynomial, 16, 43 Lumped mass method, 133
Consistency, 34, 77
Crank-Nicolson scheme, 25, 34, 52, 85, 114, 119
MacLaurin expansion, 56
Maximum principle, 135
Difference operator, 15, 45 Mesh ratio, 25
Dirichlet boundary condition, 161 Method of Fourier multipliers, 87
Discontinuous coefficient, 161 Method of lines, 59
Discrete elliptic operator, 63 Minimum principle, 135
Discrete fundamental solution, 83 Mixed initial boundary value problem, 21
Douglas-Gunn method, 72 Modulus of continuity, 98
Du Fort-Frankel scheme, 41, 66 Monotonicity, 136
Dunford spectral representation, 155 Multistep scheme, 21, 39, 64, 93

Ellipticity, 58 Neumann boundary condition, 122, 137, 153


Energy method, 110
Explicit scheme, 12, 32, 35
Order of accuracy, 17, 34
Order of parabolicity, 96
Finite difference scheme, 32
Finite element method, 131
Forward Euler scheme, 22, 52, 78, 113, 119 Pade approximant, 61
Fourier's inversion formula, 42, 87 Parabolic in the sense of
Fourier transform, 42, 87 John, 18, 48, 66, 95
Fractional step method, 63 Petrovskii, 43, 90
Shilov, 96
strongly in L2, 96
Green's formula, 132 Parseval's relation, 16, 30, 42

195
196 V. Thomie

Periodic initial value problem, 29 Third-kind boundary condition, 121, 137


Pure initial value problem, 11 Toeplitz matrix, 174
Truncation error, 14, 22, 24, 34, 117

Single-step scheme, 44, 52


Singular problem, 169 Unbalanced method, 163
Smoothing operator, 102
Smoothing property, 162
Sobolev space, 49, 91, 99 Variable time step, 167
Solution operator, 12, 39 Variational difference scheme, 134
Spectral method, 148 von Neumann condition, 16, 18, 47, 66, 149
Spectrum of family, 149
Spherical symmetry, 169
Stability, 13, 15, 22, 24, 35, 76, 84, 90, 95, 134, 144, Weighted norm, 171
153

Zygmund condition, 98
Theta method, 114, 125
Splitting and
Alternating
Direction Methods

G.I. Marchuk
Department of Numerical Mathematics
USSR Academy of Sciences
Ryleev Street 29
119034 Moscow, USSR

HANDBOOK OF NUMERICAL ANALYSIS, VOL. I


Finite Difference Methods (Part I)-Solution of Equations in R" (Part 1)
Edited by P.G. Ciarlet and J.L. Lions
© 1990, Elsevier Science Publishers B.V. (North-Holland)
Contents

PREFACE 203

CHAPTER I. Introduction 205


1. Approximation 206
2. Stability 216
3. Convergence 222
4. The Crank-Nicolson scheme 224

PART 1. ALGORITHMS FOR THE SPLITTING METHODS AND THE ALTERNAI NG DIRECTION
METHODS 229

CHAPTER II. Componentwise Splitting (Fractional Steps) Methods 231


5. The splitting method based on implicit schemes of first-order; :curacy 231
6. The componentwise splitting method based on Crank-Nicolsc 1 schemes:
The case A=A, +A2 232
7. Multicomponent splitting of problems based on Crank-Nico, mnschemes 234
8. A general approach to componentwise splitting based on elem ntary Crank-
Nicolson schemes 235
9. A general formulation for the splitting method based on multi evel schemes 236
10. The two-level splitting scheme with weight coefficients 237
11. The splitting method for systems which do not belong to the ( auchy-Kovalevskaya
class 238
12. Splitting schemes for the heat conduction equation: Local one dimensional schemes 239
12.1. The splitting scheme for the heat conduction equation i an orthogonal
coordinate system 239
12.2. The splitting scheme for the heat conduction equation ir an arbitrary
coordinate system 240
12.3. Local one-dimensional schemes 242

CHAPTER III. Two-Cycle Componentwise Splitting Methods 245

13. The two-cycle componentwise splitting methods: The case A = 4 +A2 245
14. The two-cycle multicomponent splitting method 247
15. The two-cycle componentwise splitting method for quasi-linea problems 248
16. A general approach to the two-cycle componentwise splitting: lethod 249
17. The two-cycle componentwise splitting scheme for the heat co duction equation 252

CHAPTER IV. Splitting Schemes with Factorization of the Operators 255

18. Schemes factorizing the operators 255


19. The implicit splitting scheme with approximate factorization o the operator 258
20. The stabilization method (explicit-implicit scheme with appro: imate factorization
of the operator) 259

199
200 G.l. Marchuk

21. A general scheme for the method of approximate factorization of the operator 263
22. The scheme of approximate factorization for the parabolic equation 265
CHAPTER V. The Predictor-Corrector Method 269
23. The predictor-corrector method: The case A= A, + A2 269
24. The predictor-corrector method: The case A =Z=, A, 272
25. The predictor-corrector method for the parabolic equation 273
CHAPTER VI. The Alternating Direction and the Stabilizing Correction Methods 277
26. The alternating direction method 277
27. The stabilizing correction method 278
28. A general formulation for the stabilizing correction method 279
29. Application of the alternating direction scheme to the parabolic equation 280
CHAPTER VII. Methods of Splitting with Respect to Physical Processes 283
30. The method of splitting with respect to physical processes 283
31. The method of particles in a cell 285
32. The method of large particles 286
CHAPTER VIII. The Alternating Triangular Method and the Alternating Operator Method 289
33. The alternating triangular method 289
34. The alternating operator method 291
35. The generalized alternating operator method 292
36. The scheme of the alternating triangular method for the parabolic equation 292

CHAPTER IX. Splitting Methods and Alternating Direction Methods as Iterative Methods
for Stationary Problems 295
37. The stationing method: General concepts of the theory of iterative methods 295
38. Iterative algorithms 297
39. Acceleration of the convergence of iterative methods 299
PART 2. METHODS FOR STUDYING THE CONVERGENCE OF SPLITTING AND ALTERNATING
DIRECTION SCHEMES 301
CHAPTER X. Convergence Studies of the Splitting Schemes by Use of the Fourier Method
(Spectral Method) 303
40. General statement of the Fourier method 303
41. The Fourier method and the convergence studies of splitting schemes for
stationary problems 307
42. The Fourier method and the grounding of the splitting schemes for
nonstationary problems 310

CHAPTER XI. The A Priori Estimates Method and the Convergence Studies of the
Splitting Schemes 315
43. The simplest a priori estimates 315
44. A priori estimates for the splitting scheme of type Ajpj+' =Bpj+rjfJ 319
45. The energy inequalities method for constructing a priori estimates 322
CHAPTER XII. The Splitting of the Evolutionary Problem for a System of
Differential Equations 327
46. The splitting of problems defined on fractional intervals and the weak
approximation method 327
Contents 201

47. The splitting of problems defined on the whole interval 331


48. The two-cycle splitting of the problem 332
49. Some results on convergence and stability 335

CHAPTER XIII. Convergence Studies and Optimization of Iterative Methods 339


50. Sufficient conditions for convergence 339
51. The choice of parameters in the commutative alternating direction method 344
52. The choice of parameters in the noncommutative alternating direction method 348
53. The convergence acceleration procedures for the alternating direction method 350
54. Generalizations 352

CHAPTER XIV. Splitting and Decomposition Methods for Variational Problems 355
55. Splitting and decomposition methods for classical variational problems 355
56. Decomposition of a general variational problem 356
57. A variational problem with restrictions 357
58. The convergence of decomposition algorithms 358

PART 3. APPLICATIONS OF SPLITTING METHODS TO PROBLEMS OF MATHEMATICAL PHYSICS 361

CHAPTER XV. The Heat Conduction Equation 363


59. The two-cycle componentwise splitting scheme for a parabolic equation with
three spatial variables 363
60. Schemes of second-order accuracy for p-dimensional parabolic equations
without mixed derivatives 366
61. Schemes for equations with mixed derivatives 368
62. Alternating direction schemes 370
63. Schemes of increased order of accuracy 371
64. Finite element method schemes and splitting schemes for two-dimensional
parabolic equations 374

CHAPTER XVI. Equations of Hyperbolic Type 377


65. The stabilization scheme for the multidimensional equation of oscillations 377
66. Approximate factorization schemes for equations of oscillations 378
67. Local one-dimensional schemes for multidimensional hyperbolic equations 381
68. The splitting scheme for multidimensional hyperbolic systems of equations 383

CHAPTER XVII. Integro-Differential Transport Equations 389


69. Statement of the problem and the scheme of incomplete splitting 389
70. The scheme of complete splitting 392
71. The scheme of approximate factorization of the operator 392
72. The method of integral identities and the splitting method 394
73. The numerical scheme for the nonstationary transport equation in (x, y) geometry 399
73.1. Approximation in spatial variables 400
73.2. Approximation in the time variable 405
73.3. Approximation in angular variables 406
73.4. Numerical realization of the algorithm 407
74. Splitting methods as iterative algorithms for stationary transport equations 410

CHAPTER XVIII. The Splitting Method for Problems of Hydrodynamics 413


75. Splitting schemes for Navier-Stokes equations with E-perturbation of
incompressible fluid equations 413
202 G.I. Marchuk

76. Splitting schemes restoring divergence for incompressible fluid equations 416
77. The general principle for constructing splitting schemes for Navier-Stokes equations 419

CHAPTER XIX. Problems in Meteorology 427


78. Equations of atmosphere hydrothermodynamics 427
79. The general splitting method with respect to physical processes based
on the separation of characteristic times 432
80. The approximation of equations in spatial variables and the discrete analogues of
conservation laws 435
81. The method of splitting with respect to geometric variables and
numerical realization 437

CHAPTER XX. Problems in Oceanology 441


82. Statement of the problem and the splitting of equations with
respect to physical processes 441
83. The splitting of adaptation equations in planes and generalization
for the nonhydrostatic case 444
84. The splitting of adaptation equations with respect to "topography" 445
85. The splitting of "shallow water" equations with respect to coordinates 447

REFERENCES 449

SUBJECT INDEX 461


Preface

We witness the advent of the age of parallel large-capacity computers and


processors. Parallel computers enter our life as the realization of objective processes
in aspiration for creating more and more powerful computers capable of solving
most complicated computational and informational problems arising in a society.
Since on each level of technology we approach our limits in computer capacity, this
intention to accelerate the process of creating more and more powerful computers
stimulates us permanently for transition to computational processes based on
parallel and asynchronous data processing. Rapid progress in the electronic element
base allows to construct such a multiprocessor computer system on VLSI crystals.
However, the new quality parallel computers require a completely new mathe-
matical base for solving complicated problems on parallel computer systems. What
is this base? No doubt such base is the adequate representation of problems in a form
accessible to parallel processing inherent to the structure of the computer.
Numerical mathematics has already done much in this direction. First of all there
are the methods of splitting complicated problems of mathematical physics into
simpler systems realizable on parallel processors.
This fact inspired the author to give a survey of the splitting methods and to
demonstrate the results achieved in this direction. The author tried to analyze all
basic splitting algorithms, naturally focussing special attention on the Soviet school
of numerical mathematics which has achieved important results both in formulating
algorithms and in applying them to solutions of complicated problems. As for the
rest of the world's experience, it was in our opinion reflected sufficiently while
considering the algorithms as well as in the detailed bibliography presented in this
work.
In preparing material for this article the author was assisted greatly by V.I.
Agoshkov, G.R. Kontarev, V.I. Kuzin, Y.A. Kuznetsov, V.N. Lykossov, V.P.
Shutyaev, V.B. Zalesny. The author is very much obliged to all these colleagues.

203
CHAPTER I

Introduction

The intensive development of the methods for solving linear algebraic equations
with three-diagonal and block three-diagonal matrices in the fifties and sixties led to
the creation of effective numerical algorithms for solving stationary problems based
on the factorization of the difference operator. A special place among the methods of
factorization(sweep methods) belongs to various versions of noniterative methods of
factorization developed by KELDYSH [1942], GELFAND and LOKUTSIEVSKY [1962],
RUSANO [1960], GODUNOv [1962], ABRAMOV and ANDREEV [1963] and others.
Experience accumulated in solving one-dimensional problems by methods based
on factorization had prepared the basis for the construction of the algorithms for
solving more complicated problems in mathematical physics. And the beginning of
the sixties was marked by a great contribution into numerical mathematics by the
development of such algorithms. This contribution is connected with the names of
Douglas, Peaceman and Rachford, who suggested the alternating direction method
(see DOUGLAS and RACHFORD [1956], PEACEMAN and RACHFORD [1955]). The success
of the method was ensured by a simple reduction of the multidimensional problem
to a sequence of one-dimensional problems with three-diagonal matrices easily
invertible by computers. This method affected significantly the construction of
algorithms in various fields of applied mathematics. The theoretical studies of this
and related methods are presented in the works by DOUGLAS [1962], DYAKONOV
[1961-1967, 1970-1972], SAMARSKII [1961-1967, 1970-1971, 1977], BIRKHOF,
VARGA and YOUNG [1962], WACHPRESS [1962], KELLOGG [1964], GUNN [1965],
VOROBYOV [1968] and others.
The methods were developed based on homogeneous and inhomogeneous
approximations. In the case of inhomogeneous approximation each of the auxiliary
(intermediate) problems not necessarily approximates the original problem but in
the whole in special norms this approximation takes place. These methods have
been named the splitting methods; they have been developed in the works of the
Soviet mathematicians YANENKO [1966, 1967], DYAKONOV [1963a, 1966], SAMARSKII
[1962d, 1963a], SAUL'EV [1960], MARCHUK [1971] and others.
The splitting methods were widely used in problems of various kinds and they
stimulated the more general approach to the solution of problems in mathematical
physics based on the method of weak approximation developed by YANENKO
[1964b, 1967] and SAMARSKII [1963a, 1965b]. It appeared that the splitting method
may be understood as the method of weak approximation of the original equation

205
206 G.I. Marchuk CHAPTER

by a more simple equation. The convergence conditions for the method of weak
approximation were formulated in the theorem by YANENKO and DEMIDOV [1966]
and in the works by LEBEDEV [1977] and DYAKONOV [1962f, 1971d]. The method of
weak approximation has found natural applications in the problems of hydro-
dynamics, meteorology, oceanology, in the theory of radiation transport, etc.
(MARCHUK [1967, 1974a], YANENKO [1967]).
The original scheme of the predictor-corrector type by Lax and Wendroff found
wide applications in problems of hydrodynamics, meteorology, oceanology, where
the predictor was suggested in the form of an explicit difference scheme. This scheme
is conditionally stable, it is easy in realization and has a second-order approx-
imation in all variables. A detailed study of the scheme is presented in the book by
RICHTMYER and MORTON [1972].
Various versions of the predictor-corrector method based on implicit difference
approximations were proposed by BRAYN [1966], DOUGLAS [1961], SOFRONOV
[1965] and MARCHUK and YANENKO [1966]. All these schemes proved to be
equivalent in a certain sense and differed only in the realization technique. In the last
of these works the implicit splitting scheme with factorized operator is used as
a predictor, which has first-order accuracy. For the problems in hydrodynamics the
implicit majorant schemes are used as predictors.
Of particular interest is the method of decomposition and decentralization
formulated by LIoNs and TEMAM [1966] ad also by BENSOUSSAN, LIONs and TEMAM
[1975] which also is close to the splitting methods and methods of weak
approximation.
In the sixties the method for solving multidimensional problems of mathematical
physics was developed intensively which is connected with the name of HARLOW
[1967]. This method was named the method of large particles. Today it is also con-
sidered as a splitting method. In the works by DYACHENKO [1965], BELOTSERKOVSKY
and DAVYDOV [1978], and YANENKO, ANUCHINA, PETRENKO and SHOKIN [1971]
various modifications of this method are given and schemes of their realization are
considered.
In the present work many ofthe splitting (fractionalsteps) and alternatingdirection
methods are considered,somefactsfrom the theory of convergence of these methods are
presented and the algorithms for numerically solving a number of concrete applied
problems areformulatedbased on the applications of the methods discussed further.
Main attention in this work is paid to nonstationary problems. In this connection
the splitting and alternating direction methods are formulated as classes of finite
difference algorithms for solving these problems. Nevertheless, in some sections they
will be considered as iterative methods for solving stationary equations.

1. Approximation

This section introduces the general concepts of the theory of finite difference
methods which will be used in the present work.
SECTION Introduction 207

Consider a stationary problem of mathematical physics in the operator form:


A(p=f inQ,
(1.1)
a(p=g on aQ,

where A is a linear operator, T e , f e F. Here @and F are Hilbert spaces, whose


elements are defined respectively on Q + 02 = Q (02 is the boundary of the domain
D) and 2, a is the linear boundary condition operator, g e G,where G is the Hilbert
space of the functions defined on 0a2. For the sake of definiteness and simplicity of
notation we will consider the functions from 4, F, and G to be dependent only upon
two variables x and y (which may be regarded as spatial).
Construct the finite-dimensional approximation of the problem (1.1) for example
by the finite difference method. To this end consider a set of points (xk, y,), where
k and 1 are arbitrary integers. The set of these points will be called a grid and the
points will be called grid points. The distance between grid points will be described
by the value h > 0 called the step of the grid (naturally, this distance may also be
described by two parameters h and hy-the grid steps in the x- and y-direction
respectively). Denote by oh the set of the grid points approximating in some sense
2
the set of the points of the domain 12, and by 0a h the set of the grid points
approximating the boundary aQ. Further, the functions whose definition domain is
a grid will be called gridfunctions. The set of the grid functions ph with definition
domain 12 h will be denoted by Oh. Each function pE , may be connected with the
grid function (Tp)h as follows: the value of (p)h at the grid point (xk, y,) equals p(xk, yh)
(of course, if the values {p(xk, y1)} have a meaning). This correspondence presents
the linear operator acting from 4 to 'h; this operator is called a projection of the
function p on the grid. The function =-A can also be projected on the grid by
taking ()h = (A()h. The correspondence between ()h and (A9)h is a linear operator
defined on the grid functions.
By now applying approaches developed in the theory of the finite difference
methods we construct the problem in the finite-dimensional grid function space:
Ahrph=fh in 2h,
~~~h ~~~~~~~~~~~h
h (1.2)
ahph=gh on a0h,

which is the finite difference analogue of problem (1.1). Here Ah and ah are linear
operators depending on the grid step h, 9Oh E Oh, f EF,, gh EG., and A5h, Fh and G,
are grid function spaces.
We introduce in 4,, F h and G, respectively the norms I - [1la, I[ IlFvand 11.
IGl. Let
(-)h denote the linear operator which puts the element p e · in correspondence with
the element ()h E Ah in such a way that

lim
h--O
II()hl[, = IIll .

We shall say that the problem (1.2) approximates the problem (1.1) with order n on
the solution p, if there exist positive constants , MI, M 2 such that for all h <f
hthe
208 G.I. Marchuk CHAPTER I

following inequalities hold

IlAh()p)h-fhll M MF,.hnl, ilah(qp)h -g hil<M2. h 2, (1.3)


and n = min(n, n2).
In cases where the solution p of problem (1.1) is sufficiently smooth it is
convenient to find the order of approximation using the norm which is natural for
the space of continuous and differentiable functions. To that end the expansion into
Taylor series of the solution and other functions contained in the problem is usually
applied.
Further we shall suppose that the reduction of problem (1.1) to problem (1.2)
has been accomplished and, moreover, that the boundary condition of(1.2) has been
used for excluding the values of the solution at the boundary points of the domain
Q2, + M•4. As a result we have the equivalent problem
hh = fh. (1.4)

The values of the solution at the boundary points can be found from (1.2) after
solving (1.4). In some cases it is more convenient to write the approximating problem
in the form (1.4) and in some other cases in the form (1.2). Thus, as a result of the
reduction and with the required approximation taken into account, problem (1.1)
with a continuous argument has been reduced to the problem of linear algebra (1.4),
i.e., of solving the system of algebraic equations.
Further we shall mainly use Hilbert spaces of grid functions and, unless otherwise
stated, the corresponding norm ipohrl,, will be defined by the relationship
Ijh1=((p,
=(ph),/2.
It must be noted, however, that many of the introduced concepts (approximation,
etc.) also apply to the case of Banach spaces and in some statements and illustrating
examples grid function norms will be introduced which are not connected with the
inner product defined by the above relationship.
We shall illustrate the above concepts with an example of the following concrete
problem
-Aqp=f in 2,
(1.5)
p =O on a2,
where
={(x, y): O<x<1, O<y< 1},
f=f(x, y) is a smooth function,
A= a2 /ax2 + a2 /ay2.
Let F be the Hilbert space of real functions L 2(2) with the inner product

(u, v)= fuv dx dy


SECTION Introduction 209

and the norm

lull =(U, U)1/2.


Denote by i the set of functions that are continuous in 7= f + Maand that have
continuous first and second derivatives in 2. We take the same norm in 4, as in F, i.e.
11II-1 II11.Choose as G the Hilbert space L 2(0Q) of functions defined on Ma2 with
the norm

I lLJ2(2) = ( fJlgi2 dF)


anŽ

If we introduce the operators Ap = -A and asp = q(lan, then problem (1.5) may be
represented in the form (1.1) for g-O.
Now, we introduce the finite-dimensional approximation of problem (1.5). To this
end, we cover the square = Q + 2Qby a uniform grid with step h in the x- and
y-direction. The grid points of the domain will be identified by a pair of indices (k, 1),
where the first index k (0 < k < N) corresponds to discretization of the x-coordinate
and the second index (O< l <Ny) corresponds to that of the y-coordinate. Consider
the following approximations:
2
a2(P a 9
2
X xx AxVx(P)h (yy AyVy(()h,

where A, Ay, V, and V are difference operators defined on grid functions ph with
components khp,as follows:

(Vph)k, = (o- q,,- 1)

(vy 0 )k,I = h((, I I- 1)

Then problem (1.5) can be approximated by the following problem:


A h' _ [AVX, + AyVy(a ] = f in Qh,
(1.6)
ph=0 on aff,
where aQh is the set of grid points that belong to the boundary af, and is the
h2, set of
internal grid points in Q. Here the operator Ah is given by
Ah = A +Ay,
210 G.]. Marchuk CHAPTER I

i.e., A h is a difference analogue of the operator -A: Ah -Ah. Now (1.6) can be
rewritten in the form
-Ahph=fh in h,
_2hzph to in QhX (1.7)
ph=O0 on a',
where h and fh are vectors with the components kpl and f, and

(L5 h)k,
f I h 2 (p + 1,+
I L - 1p +( h I 1 + (p I -4? AhI

Xh+1 2 Yl +1/2
1
fki=j2
k f f dx dy,
X'- ,/2 Y - 1/2

Xk+l/2 =Xk,'2h, Y,+l,2 =yl+h.


Here and in the sequel we take as fhI some averagef (x, y), calculated by the above
formula. (This allows us to consider the difference schemes with a function f(x, y)
which, in general, is not assumed to be sufficiently smooth.)
We introduce the space hP.Let the elements of oh be defined in the domain
Oh= Oh +0ah = {(Xk, yl): O< k <N, = 1/h, 0 INy= l/h}.
Define the inner product and the norm as follows
N, Ny

((ph, O)= ) h ~k,lkl


"O
k=O 1=0

jj.L hP0= E h ()2

Choose as Fh the space of grid functions defined on

Qh= {(xk, y): l k Nx- , 1 <Ny-1}


with the inner product and the norm given by:
Nx-, N,-,

(9&IF
" = (E n E
k=l I=1
h2 h,10 ,
k,(lk,

Nx-1 Ny-1

IPh 11 = h2((pI)2-
k= =1

The space G, of grid functions defined on aQh is defined in a similar way.


Take as (), a vector whose components are the values of the function p at
corresponding grid points. Then using the expansions of the functions q'(x, y) and
f(x, y) into Taylor series we obtain
11- A~h(cp)h _fh lllhhQ M, ·h 2, (1.8)
SECTION 1 Introduction 211

where M 1 =const < oo. The approximation of the boundary conditions in this
example is accurate. From this fact and from the estimate (1.8) it follows that
problem (1.7) approximates problem (1.5) with the second-order of accuracy on the
solutions of problem (1.5) having bounded fourth-order derivatives.
Note, that if we require the grid functions of 4h to satisfy the condition h I-h = 0,
then for such functions the inner products in Ah and Fh (and therefore the generated
norms) will coincide. Further, using the following identities analogous to the first
and the second Green formulae
N,- 1 Nx

- (AxVx(ph)k,',sl= E (Vxph)k,I.(V h)k,l,


k=l k=1
Nr-l N,-1 h oh

- E (axVxPh)kl4,l=- E .. )k
(axVx, k,,
k=l k=l

and similar identities for the sums over index I it is not difficult to show that
(Ah(ph, )Fh = ((ph, Ahh)F,
N Ny
2 2
(A h(h, (qh),=h E, [((Vxph)kl) 2 +((Vyoph)k) ] > 0
k=1 =1

for 9h # 0

i.e., the operator Ah on the functions of 4h satisfying the condition ?hlaQ =O is a


self-adjoint positive-definite operator. It is of interest to note also that here Ah is
representable as the sum of symmetric positive-definite commutative operators A,
Ay:
A=A,+Ay, AAy = AA x,.
So far the stationary problem and its approximation in spatial variables were
considered. But we may consider in a similar way the problem of approximation of
the evolutionary equation (i.e., the equation which has no time derivatives in A and
may be resolved explicitly with respect to the first derivative in time):
Oap/t+Ap=f in QxQ,, Q,=(0, T),
ap=g on a2 x Q,, (1.9)
p= P° in for t=0.
Problem (1.9) will be approximated in two stages. First we shall approximate this
problem in the domain (2 + 0aQ) x Q2 in spatial variables. As a result we obtain the
differential equation in time which is a difference equation in spatial variables.
In some cases it is easy to exclude from the obtained differential-difference
problem the values of the solution at the boundary points of the domain
(2 +aQ2 h) x CQ, using difference boundary conditions. Assuming this to be carried
out we come to the evolutionary equation of the kind
drphldt + Ap =f', (1.10)
212 G.l. Marchuk CHAPTER I

where A, fh and (qh are functions of time t. Further we shall omit the index h in
problem (1.10) assuming that we deal with the difference analogue of the original
problem of mathematical physics in spatial variables.
Equation (1.10) presents a system of ordinary differential equations for the
components of the vector h.
Consider the following Cauchy problem:
dp/dt + Atp=f,
p=g for t=0.
Suppose that the operator A does not depend on time. Consider the simplest
methods of approximation to problem (1.11) in time. Currently the difference
schemes of the first and second order of approximation in t are used in most cases.
First, consider the simplest explicit scheme of the first-order of approximation on
the grid 02,:

-+At'p1 =f, (pO=g, (1.12)

where =tj+ -tj and f is some projection of the function f. For the sake of
simplicity we may now take fi=f(tj).
If we consider the simplest implicit scheme, then we have

(PJ+l-~°J -J+=
qJ -+A
p :j+l=fJ, q'P =g, (1.13)

and we choose f as f(tj+l). The schemes (1.12) and (1.13) are first-order
approximations in time. We may check it easily using expansion into Taylor series in
time and assuming, for example, the existence of bounded second-order derivatives
of the solution in time.
Resolving (1.12) and (1.13) with respect to pi + we come to the recursive relation

Tpj+ '=TTj+TSfj, (1.14)


where the step operatorT and the source operatorS are defined as follows: for scheme
(1.12)
T=E-rA, S=E;
for scheme (1.13)
T=(E+rA)- 1, S=T.
The difference schemes of type (1.14) for evolutionary equations will be called
two-level schemes.
In applications the second-order approximation Crank-Nicolson scheme is
widely used.
9oj+l--pj j
A j+ +
Tn 2 =fj (po=, (1.15)
r ~~~ 2
SECTION Introduction 213

where fi= f(tj+ 1/2). Scheme (1.15) can also be represented in the form (1.14) with
T=(E+rA)- (E- rA), S=(E+ rA) - 1.
In some cases it is convenient to write the difference equations (1.12), (1.13) and
(1. 15) as a system of two equations, the first approximating in 2h, the equation itself,
and the second approximating the boundary condition on h,. In this case the
difference analogue of problem (1.9) has the form
Lhr(phr=fhr in h (1.16)
h
IhT hP
T g on ahr,

where
2h,= 2h X Q2,, a2 =2h X {}UQ hx 2,,

IILT(o)h - TfhlIFh, M1 h + N 1 -TP (1.17)


I *'((P) - gh, IGh, M2 h + N 2 · r p.
In these inequalities, just as in (1.3), ()h is the operator of projection on the
corresponding grid space.
By introducing vector functions and new operators acting in the domains 2, x Q2,,
where Q2, is the set {tj}, the difference equation in canonical form (1.14) can also be
written in the form
1hr ohr=fJh* (1.18)

Thus, the evolutionary equation with boundary conditions and initial data can be
reduced to problem (1.18) of linear algebra.
In particular a boundary value problem of elliptic type, an integral equation, etc.,
may be reduced to equation (1.18). Here the approximation condition can again be
written in the form (1.17) with the only approximation index h which is the
maximum from the set {Axi} of steps in spatial variables.
Consider the problem

L(p =- +A(p-a-A=f in 2x2,,

qp=O onaQxQ,, (1.19)


(=g in Q2 for t=O.
Solutions are assumed to be defined on domain (2 + aQ2) x Q2, where Q2is a square as
before, and 2t = {O< t < T}. From Q2, 8a2, 2, we come to Qh, a2, 2, respectively. Let
Q2, be the set of points tj and let t+-t=r. Then problem (1.19) can be
approximated as follows:
Lht(phz=fht in Q2, x 2,,
hr= 0 on x , (1.20)

(ph'=g in x {0}.
h2,
214 G.I. Marchuk CHAPTER I

Consider the simplest explicit approximation

(Ltp h)j _lY (hj), (1.21)

Xk+ 1/2 Y+ 1/2

f=h f J f (x,y, t) dxdy, (1.22)

Xk-1/ 2 Y- 1 /2

Xk+ 1/2 Yt+ /2

gk= f f g(x, y) dxdy, (1.23)

Xk-1/2 Y-1/2

where pi, = ph (Xk y,, tj). Then


+ z(Ah(p)k
= p~.t+Z(Atpj)k,l+
j+kt =k (P1 I+ tf{ I in
rfJ Q Xx Q.(1.24)
in Qh Q.
Besides,

=0
k(Pk,
qk,=O on
on aQ
Wg2n x f2
hX O" (1.25)
(Pk,l=gk,i in h x {O}.
The recursive relation (1.24) can be written as
o + = TM + rf i , (1.26)
where pi is the grid function with the definition domain fh + aQh and

(PJ(xk, yt) = T h(Xk, Y, tj),


T=E + rAh = E- r(A x + Ay)
is the step operator, and A. and Ay are the same as in (1.6). For the sake of simplicity
let here

Fh=4h and 11h 11= ( h2l 1pkl2)


k,l

We estimate the norm of the operator T. To that end, we want to find its largest
eigenvalue
2
Tu=.(T)u in h, (1.27)
u =O on a0h.
The following relation is obviously valid
t(T) = 1 + ,2(Ah).
Noting that the orthonormal eigenvectors {ump} of the operator Ah have the
components
uk = 2 sin mntkh sin prTlh,
e=l, 2,....Nx-1, p=l, 2,.....Ny-1,
SECTION Introduction 215

and corresponding eigenvalues are

,,p(Ah) =h(sin2 mih +sin2 psnh),

we easily find i"(T) and the norm of the operator T:

1IT ==max{ 1 h 82 sin (1.28)

and if r/h2 < , then 1IT I < 1.


Along with the explicit approximation of the first order in Twe may consider the
implicit approximation of the first order in z and of the second order in h2 . Then
instead of (1.21) we will have:

(Lhph) _ (Ahoj+)k,, (1.29)

where f~, and gk, are defined by (1.22) and (1.23) respectively. In this case (1.20) can
no longer be solved explicitly and we come to the operator equation
((E-rAh)pj)k+,l = 'l
-r+zf, in Qh X Q, (1.30)
whose solution must satisfy the following conditions:
~k ,t- 0 on aQ h x D,
0 (1.31)
(pk, =gk, in Qh x {0}.

We write (1.30) in the form


(pkl =(T((pk, + 'f ))k, 1 (1.32)
-
where T=(E- TAh) . In this case the norm of the operator T becomes

=max'1
I~I +(8z/h2 ) cos ½nh' 1 +(8/h 2) sin 2 h} (1.33)
and therefore T 11 < 1 for any z and h.
Finally, we consider the approximation in the Crank-Nicolson scheme. Define
the operators and functions in problem (1.20) as follows:

(Lhph) (A k, (1.34)
T 2 ,
2
xk,+ 1/ y + 1/2

fk,= f (x,
, tj+ /2)dx dy,
2
xk - 1/ Y- 1/2

+
(1.35)
x + 1/2 y 1/2

gkl= x g(x, y) dx dy.


xk-1/2 y-1/2
216 G.l. Marchuk CHAPTER I

Then we obtain the problem

(E- A)¢k, =(E + 2zAk)AI.,+ zfi, in ah x Q, (1.36)

(Pk pi,,=O
=° on •h
on a XX Q,,
O" (1.37)
0

In this case (1.36) can be formally solved with respect to the unknown p" ' in the
form
+
(Pk' =(TqoJ), +z(S f )k, (1.38)
where
T=(E-TAh)-' (E+zTA'), S=(E-PA)-L.
The norm of the step operator is
It -m 1x -(4t/h2) cos2 ½7rh 1-(4/h 2 ) sin2 rjh (1.39)
2 2 2
lmax
1 + (4/h )cos2 2th ' 1+(4z/h ) sin 2Ih j'
< 1.
Hence {1T 11

2. Stability

Consider the next important concept of the theory of finite difference methods:
stability.
To clarify basic definitions and concepts of the theory of stability consider first the
explicit difference scheme (1.12)
(pj+ =(E- A)(pj + fj, (pO=g (2.1)
The solution (p is found for 0<rj< T.
Assume that the operator A is positive and has a complete system of eigen-
functions {un} and a set of eigenvalues {n>O} corresponding to the spectral
problem
Au = Au.
We introduce the following Fourier series:

(i = (pnu, fi= fnun g= E gnu, (2.2)


a n n

where

(Pn = ((Pj, Un), fn =, (f Un*), n = (g, Un);

u* are the eigenfunctions of the adjoint spectral problem. Substituting (2.2) into (2.1)
and taking the inner product of the result and the vectors u,* we obtain the following
expression for the Fourier coefficients:
9n (I=(- TA.) Tjn + Tfjl. (2.3)
SECTION 2 Introduction 217

Since
p0= gn
n

the initial condition has the form


o =g,. (2.4)
By successively excluding the unknowns we obtain

=(pr gf, I rJ t (2.5)


i=1

where
r, = 1T-,. (2.6)
From (2.5) it follows that for T>0

j
I|q J<lr.lJ'gn+T E Irnlj -Iffn-Xl·
i=1

We strengthen the latter inequality by substituting Ift-l for Ifnl = maxlfjl under
the summation symbol and we obtain
1-jrnj
Io1 < Irn
lJl g + l]rn'tlfn. (2.7)

Von Neumann has introduced the so-called spectral criterion of stability, the
essence of which is as follows. If for each Fourier coefficient (Pi of (2.2) the following
relation holds
1 nlgnl+C 2 .lf.I,
{<PCnlnC n = 1, 2 ..., (2.8)
where Cn, C2, are constants bounded uniformly for 0 <jzT T, then the difference
scheme (2.1) is computationally stable. Consider conditions on the parameters of the
difference scheme (1.12) sufficient for relation (2.8) to be valid. An analysis of (2.7)
shows that the stability criterion (2.8) is satisfied if the following restriction is
imposed on the parameter r,
Ir,,n<1, n=1,2 ..... (2.9)
Assume that the spectrum of the operator A is contained in the interval
0 < c((A) < (A) < (A).
Then, according to (2.6), if
T 28(A), (2.10)
then relation (2.9) will hold. Thus inequality (2.10) becomes a constructive stability
condition for the difference scheme (2.1). Note that condition (2.10) is a sufficient
stability condition; the scheme remains stable, for instance, when
T = 2/1f(A).
218 G.. Marchuk CHAPTER I

Relation (2.7) in this case becomes obviously


IjPjl < Ig,l +jTlfi. (2.11)
But jTr < T, where T is fixed. This means that for small r a large number of steps j is
required, and j-coo as r-0, but such that the upper bound of the time interval
T remains fixed. Again, we come to schemes which are stable in the sense of von
Neumann.
We consider now some other difference schemes based on the implicit difference
approximations. In the case of the implicit scheme of the first-order approximation
(1.13) we obtain an expression similar to (2.7):

Iq< rntj' 1['. I r lrlf.l, (2.12)


where
r. = 1/(1 + Tin(A)).
This difference scheme for 2n(A) >0 is stable for any > 0, since
lr.l<1, n=1, 2....
Stability of this kind will be called the absolute stability.
In the case of the Crank-Nicolson scheme (1.15) the estimate for the Fourier
coefficients of the solution has the form

lTnl
Ir~ r~Tjjfnj,
< lrnli-lG~l+ 1g_.1 ~(2.13)
where
-2Tn(A) 1
I+2tn(A)' H'=l+"
I+ (TA)'
Therefore rnl< 1 for any r>0 provided that A,(A)>0.
It should be noted that, firstly, the stability in von Neumann's sense is based on
the spectral analysis of the operator of the problem. This means that for this
approach the computation of the largest eigenvalue or the estimation of its upper
bound are necessary parts of algorithm. Secondly, the spectral stability criterion
ascertains stability of the solution with respect to each of the harmonics from the
Fourier series but it says nothing about the stability of the solution in terms of
energy norms. At the same time the norm of the solution pi often happens to be the
only characteristic of the problem's solution. All of this made the researchers offer
some other definitions of stability which would be related to the norms of the
operators. It should be emphasized, however, that stability analysis of von
Neumann type continues to play a prominent role in applications.
We now come to a more general definition of the concept of computational
stability. To that end we consider the problem
aOp/t+A= f in 2x, (2.14)
p=g for t=0,
SECTION 2 Introduction 219

which is approximated by the difference problem

pJ+ =TpiJ+Sf in (2.15)x,


(2.15)
(0 =g.

We shall say that the difference scheme (2.15) is stable, if for any parameter
h characterizing the difference approximation and forj < T/T the following relation
holds

IIllo 10 < C1 I1lgllh + C2 IIfh IlF,, (2.16)


where the constants Cl and C 2 are uniformly bounded on 0 < t s< T and do not
depend on z, h, g and f; G, denotes the space to which g of (2.15) belongs.
The definition of computational stability is closely related to the concept of
well-posed problems with a continuous argument. One may say that computational
stability implies continuous dependence of the solution on the input data in the case
of problems with a discrete argument. It is easy to see that the definition of stability
in the form (2.16) already relates the solution itself with the a priori knowledge
concerning the input data for the problem. This definition is often more suitable for
analyzing the stability of many problems and more informative than Neumann's
definition. Consider the stability of the scheme (1.12). To this end rewrite the
recursive relation (2.1) as
0
qPj+' = TPi + f, o=g, (2.17)
where
T= E -A, (2.18)
and A is the operator approximating A. The formal solution of the problem (2.17)
has the form

(pJ+ = Tig + E TJ-if ' -1. (2.19)


i=1

We assume' G=Fh, and by estimating the norm of the solution of (2.19) we


obtain

11+ 1h ||T|||Ilrllgll +alG


T
i=I
-'li a.
JIIJ-'Ilf'
T (2.20)
We substitute for fi- 11under the summation symbol its maximal value for all
j from the fixed time interval. Let
IlfrlI =max II fIlF,

'This assumption is made for the sake of simplicity. Otherwise instead of one norm IITI1I
IITIIF,.,,=supF,(IITrlqIIO,/T1q91) we should introduce two norms: IITIIF,-, and IITllc, =sup.G
0
(11 lIJIAI (P11
G).
220 G.I. Marchuk CHAPTER I

then

1'PeII, < I IjI[T[lll6gl+ - 1ITIITzl l f hl 11 (2.21)

If we assume that
IITIl < 1, (2.22)
then scheme (1.12) will be stable in the sense of definition (2.16). Naturally, (2.22) is
a sufficient condition for stability. More sophisticated conditions could be obtained
using the norms of powers of the step operators IIT'll (i= 1, 2,..., j). Weakening the
condition however brings additional difficulties in the constructive procedure of
ascertaining the stability criterion. As a rule, the sufficient condition (2.22) is used in
practice.
The stability of the implicit difference equations (1.13) and (1.15) may be
considered in a similar way. In these cases we have

11'11 < IITlI j - IglGh + II -ITI[


- 11 I T sIII f tlF
ISll

where for scheme (1.13)


T=(E+ A) - 1, S=(E+zA)-
and for scheme (1.15)
T=(E+ 2rA)-' (E-zTA), S=(E + 2A)-.
It is easy to show that these difference schemes will be absolutely stable in the sense

IPI[-
ZIPI I1/2
of definition (2.16) provided that A >0 and

It may be done using the estimate


II(E+zA)-111l forA>0, >O (2.23)
resulting from the relations:
I(E+rTA)- Ixq0[I 2 I1¢112
2 2
11P i 11
(E + IA) 11
i1¢112
=11¢112 + ((A + A*)¢ ' )+T211A¢ 12
112

and from the following theorem (which is very useful in the stability analysis of many
schemes).

THEOREM 2.1 (KELLOGG [1963]). If the operator A acting in real Hilbert space is
positive-semidefinite and the numerical parameter a is nonnegative, then
1(E - orA) (E + A) `II < 1.i (2.24)
SECTION 2 Introduction 221

Indeed, introduce the notation T=(E-aA).(E+aA)- .


Then we have

2=sup ((E-aA)(E + aA)- ' p, (E-aA)(E + aA) -1 ),,


vP (¢, ¢),
((E-aA)P, (E- aA)),
P ((E+aA)O, (E+uA)O),,

= sup
, 20(A¢, 4),) + a2(AO, AO).
(4', u)-
p (, 4') + 2a(A4, '), + a2(Aq, A4O)
REMARK 2.1. In the case when A >0 and a>0 we would have instead of (2.24) the
inequality of the kind
1(E- A)(E + aA)- Xl < 1. (2.25)

Discuss briefly the limit transitions. In solving the difference analogues of the
evolutionary problems of mathematical physics we have to consider approxi-
mations in time with step z as well as in space with characteristic step h. This means
that the transition operator T= T(z, h) depends both on z and on h.
The construction of a stable algorithm for a given approximation method is
usually reduced to establishing the relation between and h which ensures the
computational stability. If the difference scheme happens to be stable for any T > 0
and h >0, then it is declared absolutely stable. And if the scheme happens to be stable
only in the case where there is a certain relation between and h then such a scheme
is called conditionally stable.
Assume that z and h are related according to the inequality
T< C hP, (2.26)
where C and p are given constants independent of T and h. It is worthy to note that
such relations are usually established while considering the amplitudes of the
"shortest" perturbation. As a rule they reflect the connection between the minimal
spatial and time scales of the phenomena simulated by the difference scheme.
Of course, large perturbations (say, of the order of several h) will then be described
more precisely.
Suppose that we need to increase the accuracy of the solution formally by
decreasing the grid step h. Then we must decrease simultaneously the time step so
that the above inequality is again satisfied with these new parameters. This means
that we may even allow the limit transition as -+0 and h-0O if we satisfy the
condition (2.26), for example, in the form
T/hP = const ~<C.
Along with the above definitions of the computational stability other definitions
are used in the literature allowing the expansion of the class of difference schemes
which are of interest in applications. For example, the scheme is called stable if
jI jjI 1 + 0(). (2.27)
222 G.I. Marchuk CHAPTER I

For small Tsuch a definition allows for the exponential growth of round-off errors in
time.
Note that if the approximation of the evolutionary equation is studied in the
spaces of grid functions defined in Qh x Q2, then it is useful to give the definition of the
stability in terms of the same spaces. Indeed, let the difference problem have the form
2
Lhrphr =fhr in Q hx 2 (2.28)
lhr (ph = ghr on aQh x QT.
We introduce the stability criterion in the form:
I1(phr Ih C 1 Ifh
'F + C2 |ghr IG, (2.29)
where C 1 and C2 are constants on the interval 0 < t < T independent of h, T,fh' and
ghT
Assume that the original problem is approximated by the difference equation with
the boundary conditions already taken into account. Then it is convenient to
introduce the stability criterion in the form:
Ijh
I[ phrj G f hc IIF,, (2.30)

where C is bounded on the interval 0 < t < T.

3. Convergence
We turn now to the formulation of the major result in the theory of the finite
difference algorithms: the convergence theorem. The study of the convergence of the
difference solution to the solution of the original problem for the stationary and
evolutionary problems of mathematical physics is based on the same principles.
This allows us to follow the main idea of the proof on the example of the stationary
problem (l.1) which is approximated by the difference scheme (1.2) (i.e., by the system
of equations approximating both the equation and the boundary condition of (1.1)).
The following convergence theorem is valid (GODuNOV and RYABENKII [1970],
FILIPPOV [1955]).

THEOREM 3.1. Let thefollowing conditions hold:


(1) The difference scheme (1.2) approximates the original problem (1.1) on the
solution q with order n.
(2) Ah and ah are linear operators.
(3) The difference scheme (1.2) is stable in the sense of(2.29), i.e. there exist positive
constants 1, C , C 2 such thatfor all h <], fh e Fh, gh e Gh there exists a unique solution
(ph of problem (1.2) which satisfies the inequality
I Ph ll, C IfhllF, + C2lilghl Gh. (3.1)
Then the solution of the difference problem qph converges to the solution p of the
original problem, i.e.,
lim I()h - h Ill = 0O
h-O
SECTION 3 Introduction 223

when the following estimatefor the convergence rate is valid

II(o)h - PIhd11<(C1M1+ C2M 2)h


n ,
(3.2)
where M 1 and M 2 are the same constants as in (1.3).

PROOF. Let be the constant introduced in the definitions of approximation


and stability. The stability implies that for any right-hand sidesfh and gh for h <k
there exists a unique solution qph, i.e. for h < h-we may consider the difference
(p) h _ ph. Since Ah is linear we have

Ah[((p)h - (ph] = Ah(p)h - Ah(ph = Ah(rp)h _ fh.

Similarly,
a [(p)h - h = ah(p)h - gh.
Since h < , owing to the stability and approximation of the relations
Ah[(4p)h - (ph] = Ah((p)h -fh, a[()h - (ph] = ah(()h - gh, (3.3)
it follows that
I(| - (ph I < C Ah(p)h
()h - fh IIFh + C2 IIah()h gh IIG
n n
s< C 1Mlh 1 + C 2M 2h 2 <(C1M 1 + C2 M 2)h.
While obtaining the last inequality we may suppose without loss of generality that
h<l.
In the case of the evolutionary problem consider
fh =_ Lht[(p)h, - (ph] _ Lt(p)h, - fh,4
=
5gt IhP[(p)h - (pht]_ (p)h -ghr. (3
From (3.4) and from the stability condition (2.29) we have
II(p)h, - (phr IIh, < C 1 6ifht IFh, + C2 II gh 1IlGh,,
or, taking (1.17) into account,
1(p)hr- (p Ilhr , <
K h + K2T P , (3.5)
where
K 1 =C
CM +C 2M 2, K 2 =C1 N 1 +C 2N 2 .
Estimate (3.5) proves the convergence of the difference solution to the exact one
and gives a clear comprehension of the convergence with respect to the spatial grid
step h as well as to the time step .

The assumptions of the theorem include the rather rigid requirement that C1 and
C2 are independent of h and . The condition that C1 and C2 must be independent of
h is especially unpleasant since in some cases C, and C2 may tend to infinity as h - 0.
Let
224 G.I. Marchuk CHAPTER I

C = C/h m , C2 = C 2/hm ,
where m 0. The rate of convergence of the approximate solution to the exact one
will then be estimated as follows:
k-
I ()hr - ph I ., < Mh m + NTPh - m

If k>m and rP0-")0 as T--+0,h-+O, then the convergence takes place. Of course,
the convergence theorem can be formulated even in the case when C, and C2 depend
both on h and z.

4. The Crank-Nicolson scheme

The Crank-Nicolson scheme will play a significant role in the next chapters. So we
shall consider this scheme in more detail.
Consider the evolutionary equation

0t+Aq
at
=f in xt,

(p=g in2
Q for t=O,
where A>O (i.e. (Acp,(p)>O) and the solution p is sufficiently smooth. We shall
assume that the solution satisfies certain boundary conditions on Q2. We shall
assume as well that (4.1) already presents the finite difference approximation to the
corresponding original evolutionary problem in all variables excluding t (i.e. A is
a matrix, p is the grid function depending on t, etc., as we specified above, the index
"h" of A, , Q and other symbols will be often omitted for the sake of simplicity).
Suppose first that A does not depend on t. Then it is easy to check that the difference
equation of the form

( j+1- pi ((Pi+' + (pi)


+A = O=
, O= g (4.2)
2
approximates problem (4.1) with second-order accuracy in r. This difference scheme,
as was already noted, is usually referred to as the Crank-Nicolson scheme. It is of
interest to note that scheme (4.2) is the result of alternately applying explicit and
implicit first-order schemes written on the intervals tj t tj+1/2 and t+ 1/2 <t <
tj+, respectively (provided that A is a linear operator independent of t):
qpj+ 1/2 j
+A(PJ=O,

+ A&p j + = 0.(4.3)

Excluding the unknowns (pi + 1/2 from this system of difference equations we come to
the Crank-Nicolson scheme.
Assume that the operator A is independent of time and is approximated in the
SECTION 4 Introduction 225

problem (4.1) by the difference operator, which will be denoted by A. Then we will
deal with the problem in linear algebra of the form

+ 2 =O, p°=g, (4.4)


Tr 2
where
(Ai p, )> 0 (4.5)
for all functions from the subspace 0.
Resolving (4.4) with respect to p j + ' we obtain
j+
= (E + 2TAi ) - .
(E - rA)p i, (4.6)
j+ i= Tj, (4.7)
where T j is a step operator:
T j =(E+ rAi)- -(E- Ai). (4.8)
In order to prove the computational stability we need not to estimate the norm of
the step operator Ti. Taking the inner product of (4.4) and 1(qij+' + pj) we obtain

2T 2- (49)
If Ai is positive-semidefinite, then
IITill( 1 (4.10)
and, therefore,
j + ll.
Il~p II'Pill, (4.11)
i.e., the scheme is stable.
If the operator Ai is skew-symmetric, i.e., the following equality holds:
(A i, (p)o=0,
then we have instead of (4.10) the equality
·
11(i+ 11 = pil[ (4.12)
And so here
lITill = 1. (4.13)
We now consider the approximation with the Crank-Nicolson scheme in the case
where the operatorA is time-dependent. We define the operator L by the equality
Lq =_ alat + Ap (4.14)
and the operator L, by the equality

(LTp)J = (2)(P)
+Aj() +((P) (4.15)
226 G.I. Marchuk CHAPTER I

where (p)i is the projection of a function p on the grid Q2,. Further, we introduce the
norm
l(L,(p)llc,=max Il(L,q)Jll, (4.16)

where ' Iis some norm in the space to which (Lp)j belongs. In order to estimate the
norm (4.16) we expand the solution of the original equation (4.1) into a Taylor series,
and we have:
( = (o)j
1)j+ + T(cp)j + r2 (p) + ... (4.17)
Taking into account the obvious relations

(Pt = - A(p,
(4.18)
where At=aA/at, we transform Taylor series (4.17) into the form
j j j 2
( +1
1)= (p) i_ AJA(p) + tr2 [(A j) (q)j _ A j(p)J- .. . (4.19)
Substituting (4.19) into (4.16) and taking into account (4.15) we obtain

11(f- L,p) lc,


2
=max IIAi'(p)-Ai(p)'+½r{(AJ) -Aj-AjAj} (p)j+O(( 2) 11, (4.20)

where fi is the right-hand side of (4.4) which, in this case, equals zero.
If we choose as the approximation of the operator A the operator
Ai = A i=A(tj), (4.21)
then it follows from (4.20) that

II(f- L) c = max Aj(p)' I1+ O(T2)

and we have the first-order approximation. Note that in the special case where A is
independent of t the approximation in the form (4.21) ensures second-order accuracy
in .
Suppose now that the approximating operator has been chosen in the form
Aj= Aj + Ai. (4.22)
In this case we have
I(f -L,q)llc =O(r2).
Note that the approximation with the Crank-Nicolson scheme will also have
second-order accuracy in if the operator A is chosen in the form
Ai = Ai+ 1/2 (4.23)
or
A j 21(A j + 'A Ai. (4.24)
SECTION 4 Introduction 227

The three forms (4.22), (4.23) and (4.24) of the approximation of the operator
A ensuring second-order accuracy are used in various applications, in particular in
numerically solving the quasi-linear equations.
We now consider the inhomogeneous equations
ao/lat+Aqp=f in 2x Q2,
(4.25)
p=g in for t =O.
The difference approximation to problem (4.25) based on the Crank-Nicolson
scheme has, under the assumptions formulated above, the form

v+ lf
jv+v/= fJi (4.26)
2
where

fJ =f(ti+1/2)
It is not difficult to check that the difference problem (4.26) approximates (4.25)
with second-order accuracy in . Write the solution of problem (4.26) in the form
pj+ = Tjpj + (E+ TAj)-' f. (4.27)
From (4.27) it follows that

I(pj+ 1I IIT JI*1Ilj II, + II(E + TA) II IIf lIlF- (4.28)


Since > 0 and
(Aqpj, 9pj),>O, (4.29)
we obtain
[1TJll < 1,
II(E +2TaI)- II 1. (
Therefore, (4.28) leads to

l" j + ' 11< l(~Pj;o[,+rl||f¢itIF (4.31)


Taking
11 °11 = Ilgll, Ilf 1=max IIfjIIF

and using the recursive relation (4.31) we obtain


IJ[ll, <ll I g +jtz llf l , jzr<const. (4.32)
Thus relation (4.32) proves the stability of the difference scheme. Besides, this
relation presents the a priori estimate for the norm of the solution.
Thus we have introduced some concepts which will be used in considering the
splitting and alternating direction methods. Besides, we have shown how to reduce
this problem using the finite difference method and approximating the original
228 G.I. Marchuk CHAPTER I

evolution problem in all variables (excluding t) to the solution of the system of


ordinary differential equations of the form
ao/lat+Ap=f in Q,,
(4.33)
q=g for t=0,
which may be solved using splitting or alternating direction methods. (Note that it
is this approach of successive approximation to the problem that is often used in
practice!)
Further, the finite difference method for the approximation of problems has been
chosen here since the present work is devoted mainly to this method. However,
currently systems of difference equations may be obtained in various ways. Thus, the
finite element method also leads to a certain system which may be considered a
finite difference system (STRANG and Fix [1977], BABUSKA [1970], MARCHUK and
AGosHKov [1981]). But in this case as a rule we obtain instead of (4.33) the system
Baplat+Ap=f in Q2,
(4.34)
Bp=g for t=0,
where the matrix B is symmetric positive-definite and presents a Gram matrix (if the
original problem has in the main equation a distinguished derivative "a/plat"). But
here it is often possible to approximately replace this matrix in (4.34) by a certain
diagonal matrix B (with positive elements on the diagonal). As a result from (4.34)
one comes to the problem
Ba/lt + Ap=f in 2,,
(4.35)
Bp=g for t=0,
and the order of approximation of the original problem by problem (4.35) remains
the same as while considering the system (4.34) (see STRANG and Fix [1977]). Now
it is easy to notice that by introducing the vector p= '/2q0 1
and the matrix
a=B-/2AB-12 we may obtain from (4.35) a system of the form of (4.33). And
therefore it is possible to expand the splitting and alternating direction algorithms
described below for problems like (4.34) arising in the finite element method. Note
also that we may arrive at systems of type (4.33) (or (4.34)) if we approximate the
problem in spatial variables using Bubnov-Galerkin or Galerkin-Petrov methods
or some other methods (see MARCHUK and AGOSHKov [1981]). In this context it is of
interest to formulate the splitting and the alternating direction algorithms for
direct application to (4.33) as it will be done in the sequel.
PART 1

Algorithms for the Splitting


Methods and the Alternating
Direction Methods
In Part 1 of the paper some algorithms will be considered which are widely used
currently for the solution of complex multidimensional problems of mathematical
physics. In order to concentrate on the formulations of the algorithms we will
consider as a basic object for study the evolutionary problem of mathematical
physics given by

t-a+Ap=f in QxQ,,

po=g in2
Q for t=O
with a positive-semidefinite operator A (i.e., A > 0 or (Aqp, o)> 0). We suppose here
that the approximation to the original problem in all variables except t has already
been carried out, Q is a grid domain, A is matrix, p, t and g are grid functions (see
Section 1). For the sake of simplicity the index h of the grid parameter is omitted. We
suppose that the solution of the problem satisfies already given boundary conditions
on aQ, and A and f are constructed with this fact taken into account. Let also the
spaces · and F of grid functions coincide and, if not otherwise specified, have the
same norm I p 11= (,)1/2.
This problem may also be written in the form

8?p/lt+Aqp=f in Q2,,
p=g for t=0
implying that the first equation is considered in Q x Q2 and the second in D x {0}. Let
us agree to use mostly the latter form.
Note that if it is desirable we may consider the above equations as the equations
of the original problem with no preliminary approximation, and that the algorithms
formulated below are applied directly to this problem. However, the description
of such algorithms should often be considered as formal and their theoretical
grounding in such cases is far more complicated.
229
CHAPTER II

Componentwise Splitting
(Fractional Steps) Methods

The splitting (fractional steps) methods are based on the idea to split a complex
operator in a sequence of the simplest ones and as a result to reduce the integration
of the initial equation to successively integrating equations of a more simple
structure. Since the splitting schemes must satisfy the conditions of approximation
and stability only on the whole, we have the possibility for a flexible construction of
the schemes essentially for all basic equations of mathematical physics.

5. The splitting method based on implicit schemes of first-order accuracy

Consider the problem


q/plt + Ap = 0 in, (5.1)
p=g for t=0,
where

A= Z A,, A,>0, n>2. (5.2)


=l

Suppose that operators A, are time-independent. The splitting algorithm based on


using implicit schemes of first-order accuracy in T (MARCHUK [1980], p. 269) has
the form:

p jt 1 A1 rpj+ 1/n =0,

j+ 1 _ (pj+(n -1)/n
+ An cp'+ 1 = °,
___
(5.3)
(5.3)

j=O,....
P 0 =g.
This algorithm is absolutely stable, and the system (5.3) approximates problem (5.1)
with first-order accuracy in t (MARCHUK [1980], p. 270).
231
232 G.I. Marchuk CHAPTER II

REMARK 5.1. Let us agree that if there are no special notes the approximation of the
schemes is studied on solutions which are sufficiently smooth.

For the inhomogeneous problem,

apl/at+Aqp=f in Q,,
(5.4)
=g for t = 0,

the splitting scheme of type (5.3) is

qj+1/n_ 0j
jA(P +(P 1+n = 0,
j+

9pj+ 1 j+(n- I)/n


+ A p+ =fj, (5.5)

j=0,1 ....
(O =g.

This scheme is absolutely stable and has first-order accuracy in , when the following
estimate holds:

(pi + 1 II j lgj G+JZjmax


1I fjllF. (5.6)

The realization of algorithms (5.3) and (5.5) consists of successively solving the
equations of (5.3) and (5.5). And if A is split into A, ( = 1, ... , n) so that it is easy to
invert the operators (E + tra) (for example, if A. and (E + TA,) are three-diagonal or
triangular matrices), then it is easy to find an approximate solution ¢j+l of the
problem corresponding to t=tj+,.
Algorithms (5.3) and (5.5) permit obvious generalizations in the case, where A is
time-dependent. Then, in the cycle of computations of the splitting scheme, instead
of A, a suitable approximation A j of A should be taken on each interval tj < t <tj+,.

6. The componentwise splitting method based on Crank-Nicolson schemes:


The case A=A +A 2

We suppose that the operator A =A(t) in (5.1) is representable as the sum of two
positive-semidefinite matrices:
A(t)=Al(t)+A 2 (t), Al(t)3O, A 2 (t)>0, (6.1)
which are sufficiently smooth along t-elements. We consider the approximation of
these matrices on the interval ti <t ti+j in the form
AJ.=Ajtij+112), oc=1,2. (6.2)
SECTION 6 Componentwise splitting 233

The componentwise splitting scheme for problem (5.1) (MARCHUK [1980], p. 256) has
the form
( j+1/2_i (1/2 + (pJ
+A j + =0,
, 2
qj+l _ j+ /2 A(P
j+1 + (pj+1/2
+Ai =0,
2 2 (6.3)
j=O,1,....,
(p =g.
By excluding auxiliary functions oji+ 1/2 the system of difference equations (6.3) may
be reduced to one equation,
(pj+ = T pj, (6.4)
where
T= (E+T A' )- (E - z A2)
(6.5)
'(E +2TA)-'(E-½ITAj).
Assume that
T l AII<l, = 1,2. (6.6)
Then scheme (6.3) is absolutely stable. It has a second-order approximation in Tif
A~ and A' commute, and a first-order approximation in if they do not.
Indeed, expanding the operator Ti into a power series in we obtain
Ti= E-TAi+ 2 ((Al) 2 +2A2 A +(A2)2 )- ...
If the operators Ai commute, that is if Ai Aj 2 = Aj , then this expansion may
be written as
Ti = E - TAi + T2(A )2 _ .. .
Then (6.4) may be represented as follows:
pj+ = j-TA '+T(Aj)2 ( i-... (6.7)
or

?(pO +A (p-- A(jj+ ...)=.

Substituting the expression Aip from (6.7) into the last equality we have

'--
+ 2 2(Aj)2j+ ) =0.

Comparing this scheme with the Crank-Nicolson scheme (1.15) we conclude that
the order of approximation of (6.3) differs from the order of approximation of (1.15)
by a quantity O(T2 ). Hence, the scheme (6.3) has a second-order approximation in
234 G.I. Marchuk CHAPTER II

z in the case A j A =2A A and only a first-order approximation in in the case


A 2 A 2 A~.
The absolute stability of scheme (6.3) follows from the inequality,
TA')- (E- TA')Ij
I Tjll < ll(ET+
Il(E+2TA)-'(E-zTA' )Il,
and from the estimates
ii(E+ITA')- (E -2T A')[I = I(E - zA)(E + zTA'i) 1,
resulting from Theorem 2.1.
Realization of scheme (6.3) can be carried out in the following way:
¢i+1/4 =(E - A i)pi
j+ /4
(E +½ Aji)j+ 2/4 = 1

i+ 2 / 4
+ 3/ = (E- ZTAj2)j ,
+I
(E+ zA)p == J+3 4 (6.8)
j=O,1 ....

(o =g.

If A is split into A, and A2 in such a way that the effective solution of the equations
with matrices (E - 'r Ai) is possible, then the realization of the whole algorithm will
also be effective.

REMARK 6.1. To formulate scheme (6.3) with a varying time step it is enough to
substitute in (6.3) and (6.8) for Tj= tj+ - tj.

7. Multicomponent splitting of problems based on Crank-Nicolson schemes

We suppose that in (5.1) we have

A= i A,, Aa>O, n>2. (7.1)


t=l

We introduce on the interval tj< tj 1 the approximations A j of the operators


Ai so that

nA= Ai,
A Aj,. (7.2)
c=l

The multicomponent splitting scheme based on elementary Crank-Nicolson


scheme is the system of equations of the form
(E+ 2T Aaj)( J+ "= (E - A)tp +( - l)i/
A
a=1,2,..., n, j=0,1 . (7.3)
P =g.
SECTION 8 Componentwise splitting 235

In the case where Aj> 0 are commutative and


Aj=A,(tj+1/2) or A= 2(A(tj+1 )+ A,(tj)), (7.4)
the given scheme is absolutely stable and has a second-order approximation in r.
For noncommutative operators Ai scheme (7.3) will be, generally speaking, the
scheme of first-order accuracy in (MARCHUK [1980], p. 266-267).

8. A general approach to componentwise splitting based on elementary


Crank-Nicolson schemes
Let for problem (5.1) the assumptions (7.1) hold where the A, are time-independent
and the solution p is sufficiently smooth in time. Suppose that (5.1) is approximated
in a weak sense (weak approximation will be studied in Part 2 of this paper) on each
interval j= {tj
i t < t+ } by the problem
ap,/at + A ~p,=0 in Oj,
= ,..., n.
T=qi- for t= tj, (8.1)
We introduce the notations:
0+ = j
=P J+.
c +
(8.2)
Applying a Crank-Nicolson scheme to each of the equations in (8.1) we obtain the
system of difference equations:
(pj+e/n_ j+(a- )/n (pj+aln + (pj+( - )/n
+A2 =0, =l,...,n,2 (8.3)
where
+
(pJ+/n, =a n . (8.4)
We suppose that each operator A, is in its turn representable in the form
ma
AO,= A,, (8.5)
p=1
where A ,,> 0.

REMARK 8.1. The question arises: Is it worthwhile to split first the operator A into A,
and then in turn the operators A, into A.a? Is it not easier to represent the operator
A as a set of operators Ap right away? In this context it should be remarked that
though these two approaches seem equivalent, in many cases it is more convenient
first to decompose a complex problem of mathematical physics into simpler ones
which further can be independently reduced to elementary problems.

Taking into account (8.5) we state for system (8.1) the following splitting scheme
~oj+6/rna (P0+
- 1)/rfl q
(PWona M+/(p+(l- 1)/met
+ A,, 2 =0, (8.6)
a=1,2,...,n, 0=1,2,...,m,
236 G.1. Marchuk CHAPTER II

where

(oP= P, p= p+-l (a > 1),


clr+l= Dj+l

This scheme approximates system (8.1) with second-order accuracy in r, if the


operators A, commute (MARCHUK [1980], p. 274). This result remains true in the
case where operators Ala are time-dependent. In this case it is necessary to construct
a second-order approximation in of the operators A = An on each interval
tj t tj+

9. A general formulation for the splitting method based on multi-level schemes

Consider the problem


a/plt+Ap=f in Q2, (9.1)
(p=g for t=0,
where

A(t)= E A(t)
a=1

is a representation of the operator A(t) as the sum of n operators A,,...,A,.


Approximate the operators A,..., An by operators ApB on the interval tj t _ tj+
in such a way that the operator a=o0Ai approximates the operator A. We write
the following splitting scheme:

ajt bpl ~+A


? (j +AX1 j+l/ln=FF (9.2)

pj+2/n_ j+1n
-A
A 20pj
0 + A2T+ l/n +
2o+ 2/P=F2,

1ij+
_I j+(n- 1)/n J j j+ =
+AnoOT+A "i +/n+ +n¢ =F,

where

and the sum ~= ! B approximates the identity operator E. Notice that if

A'=O0 for a<:-l, (9.3)


then we obtain a two-level splitting scheme. In particular, the homogeneous
SECTION 10 Componentwise splitting 237
two-level scheme (f- 0, F, =0, A,P=0 for < - 1) has the form

(pj+kn (j+lk - l)/n


+ A k- (I"n+
+ k- )/n + Akk +kn= 0 (9.4)
r
k=l1,2,..., n.

If the operators Ak and Ak,k-I are commutative, then the scheme (9.4)
approximates problem (9.1) and is stable provided that

IICn C.- . C II 1+ const z, (9.5)


where
Ck = (E + TAkk) (E - zAk -),

k= l,2,..., n

(see YANENKO [1967], p. 151).

REMARK 9.1. It follows from (9.5) that for the stability of the whole scheme the
stability of each elementary scheme from (9.4) is not necessary at all.

REMARK 9.2. In the case where all operators Ad and B are commutative the
statements about the approximation of scheme (9.2) and its stability (and therefore
the convergence conditions; see Theorem 3.1 in the Introduction) may be formulated
as well.

REMARK 9.3. In scheme (9.2) the operators Ad may have an arbitrary structure
including difference, differential and integral-differential operators. The schemes of
type (9.2) were developed in the works by YANENKO [1960, 1967] and by MARCHUK
and YANENKO [1966] in which the theory and applications of such schemes are
studied in more detail.

REMARK 9.4. If we take in (9.2) that Akk = Ak is the approximation of operator Ak(t)
(and A 0 for or/ f) and B,=O0 for < n, B= 1, then we obtain scheme (5.5).

10. The two-level splitting scheme with weight coefficients

Let the operators Ajkk_ , Ajk in scheme (9.2) have the form

Akk = aks,
Alk-1=(1 - )Ak (10.1)

0< a=const< 1,

where A, is the approximation of the operator Ak(t) on tj<t<tj+ , l and other


operators A , are zero operators. Then we obtain the splitting scheme with weight
238 G.I. Marchuk CHAPTER 11

coefficients

(j + _ - a)cj + a (p+ 1~)= F ,


((

pj+2/n _ j+ 1/n
+ A ((1 - a)¢ i + /n + a6pj+ 2) = F2

j+ 1 j+(n-l)/n (10.2)
l )/ n+ cpj+
j ( -
+((l - a)p )= F,,

j=0,1...
O=g.

If we take in (10.2) = 2, then in the case where the operators Ak are commutative
and A' approximates Ak with the order O(T2), the whole scheme 10.2) will have
a second-order approximation in T as well. Besides, if A >0 then the scheme will be
absolutely stable.

11. The splitting method for systems which do not belong to the
Cauchy-Kovalevskaya class

The splitting method allows a special formulation for systems which do not belong
to the Cauchy-Kovalevskaya class (irregular systems) (see MARCHUK [1964, 1967],
YANENKO [1967], MARCHUK and YANENKO [1966]). Let

¢/(plt+Apo+LV=f, ,A= A= (11.1)


a=l

and
K 1 (p +K 2 0i +g =0 (11.2)
be irregular system for which
apo/t + Ap=F, F=f--L, (11.3)
is regular for a given subsystem F. For integrating (11.1) and (11.2) the following
scheme may be used:
(P j+ 11 9 J+An_ +AiJ±1/n"B F,

+A2O j+Aji 9j+'I"+A 2 j+2n=B2 F, (11.4)

qgj+_1 pj+(n- )/n


- n o 4Ail
j+ oi + /'i+ .o. ++
Anipj+ = B.Fj+
SECTION 12 Componentwise splitting 239

where the operators Aj and B. were defined in the formulation of (9.2). While
solving system (11.4) on the last fractional step (i.e., the last equation from (11.4)) it is
necessary to use for definiteness of the algorithm the difference analogue of the
relationship (11.2)
+l
K' q + Kij + '
+gj+x =0 (11.5)
where Ki is the approximation of operator K, gj+l approximates g, and Fj+ 1 (from
(11.4)) is the approximation of (f-L).

12. Splitting schemes for the heat conduction equation: Local one-dimensional
schemes

We consider some of the splitting schemes that are applied to the heat conduction
equation.

12.1. The splitting schemefor the heat conduction equation in an orthogonal


coordinate system
We consider the three-dimensional problem for the heat conduction equation:
/plat-AT=0in2 x ,,
(p=0 on 02, (12.1)
p=g in for t=O,
where
Q2= {(x,y,z): O<x,y,z< 1}, 2 = (, T),
g is a given function, and
A = a2 /aX2 + a2 /ay 2 + a2 /aZ2 .
Approximating (12.1) by the finite difference method in variables x, y and z in the
same way as it was done in problems (1.5) and (1.19) and taking into acount the given
boundary conditions from (12.1) we arrive at the matrix evolutionary equation
aTp/at+Ap=O in, (12.2)
(12.2)
qp=g for t=0,
where
A =A + Ay + Az,
A= -(AV ) =
-A1,-(AyV&)-A2,
Ay - ) A3,
and g and qp are vectors ( = (p(t) here takes into the account given boundary
conditions). The operator A is considered in the Hilbert space =F with norm
[jIIjI 0 =((,0P) / 21 defined as
Nx--1 Ny-i Nz- 1 /2

kYl h1 hhzlk,l,p
p
1k=
=1 p=l
240 G.I. Marchuk CHAPTER 11

Scheme (5.5) applied to problem (12.1) has the form (YANENKO [1969a]
coj+ /3 -- iJ

ji 2/3 j+/3
+ 2/3
3
/+ ~ ' +A+/2(P =0,

j+1 j+2/3 (2
+4A34p + l =O, (12.3)

t' =g.
Each equation from (12.3) may be easily solved by the factorization (sweep) method.
Scheme (12.3) is absolutely stable and has a first-order approximation in T, and
therefore the convergence theorem holds (of the type of Theorem 3.1).
For increasing the accuracy of scheme (12.3) the weighted scheme (10.2) may be
applied (YANENKO [1967], p. 33)
4 j+1/3_(Pj
J ((1 -a)(p
+ /3+A +apj+/3)=O,

9j+2/3 - j+ 1/3
13 2 13
4 +A 2 ((1-)+O' +±pJ+ ) 0,
T
j+1 j+ 2/3 +1
2 3
4i' -'il + 33((1-A) 4 ' +a )-+l)=
0, (12.4)

j= 0, 1 '.. '

4'0 =g.
With a = the scheme has a second-order approximation in t, and the whole order
of accuracy of this scheme is O(r 2 + h2 ), where h = max(h,, hy, hz). It is absolutely
stable. Equations (12.4) again may be solved by a one-dimensional sweep method.

12.2. The splitting scheme for the heat conduction equation in an arbitrary
coordinate system
We consider a problem
(pl/t
4 + Ap = 0 in x Q,,
4o=0 on a2, (12.5)
(p=g in Q for t =0,
where 2 is a bounded two-dimensional domain and the differential operator A has
the form
2 82
A= A ai , aii=const, (12.6)
i,ja
1= axax
alla22 -a2 > 0.
SECTION 12 Componentwise splitting 241

We introduce in 2 a rectangular grid with step hi in the xi-direction (i=1,2) and


the following approximations
(A 1Vx,) a2
for All=-alla

+
a2
V
A12= A = -a2 (Ax Vx)(Ax 2 x 2 ) for A= 2
4hih2 aX a 2'
(A.2Vx2) a2
A22 = -a22 2 for A 2 2 = -a22x2

(taking into account the boundary conditions). We write the following scheme
(YANENKO, SUCHOV and POGODIN [1959])
(pj+l/2 _ (oJ
+All ,oJ l +
+A2J=o,
/ 2
j~~~~~~~l_ ~(12.7)
~~~j+1/2
+ 1- +/+ A2 1 j+112 + A2 2 (pj+1 =0,

which is a special case of the splitting scheme (9.2). It is not difficult to notice that
here in the first fractional step "half" of the operator A, i.e. A,,ll+A 12 , is
approximated, where A,,ll is approximated on the time layer j+2 and A 1 2 is
approximated on the layerj; in the second fractional step the second "half" of A, i.e.,
A 21 +A2 2 , is approximated, where A 2 1 =A 12 again is approximated on the layer
j+2 and A2 2 is approximated on the layer j+1. Scheme (12.7) approximates
problem (12.5) and is stable, therefore the convergence theorem holds.
Under stronger conditions than the ellipticity condition for the three-dimensional
heat conduction equation given by
aqP 3 a 2(P0
at E aa-i =0 ' (12.8)
YANENKO [1959b, 1964] suggested the scheme of the form
(pj+1/6 (pj
+ 2 A 11 I/2 + A12 pj= O,

(pj+2/6 _ j+1/6
+ fA2l cPj + /6 + 2 2 (j+2/6 = 0,

j + 2
¢j+ 3/6 ~_ All(,oJ+3/6.A13j+ 2/6=0,

j+4/6_ j+3/6
(12.9)
+A31 j+3/6+{A33 (j+3/6=0,

j+5/6 _j+4/6

P A 2 2 'P +A23P

Poj+l _ j+5/6
+A32 (fj+1 =0.
j+5/6+½A33
242 G.I. Marchuk CHAPTER II

This scheme approximates (12.8) and is stable if the matrix with elements bj=a
for i j and bi = 2aii is positive-definite.

12.3. Local one-dimensional schemes

We consider problem (5.1) and system (8.1) which approximates (5.1). If in (8.1) the
operators A, are one-dimensional differential operators (or their approximations),
then the corresponding difference scheme is also called local one-dimensional
(SAMARSKII [1971], p. 407). Thus, if in (8.3) the operators A. are one-dimensional then
this scheme may be called local one-dimensional.
The theory of local one-dimensional schemes for some differential equations is
presented in the works by SAMARSKII [1971] and SAMARSKII and NIKOLAEV [1978].
Here we will consider only one such scheme in application to the following problem
of the heat conduction equation:
a/lat+Aq~=f in QxQ,,
Dla = P(r)(X, t), (12.10)
p= g(x) in Q for t = 0,
where

A= A,, A,,=-a2/XL2,
=l

X=(X . )= {0 < x<11: a = , ... ,n}.


We suppose that problem (12.10) has a unique sufficiently smooth solution.
In the layer tj < t < tj + 1 we will solve instead of (12.10) the sequence of equations
OcpJt+Ap =f, x, tcOj (12.11)
a=1,...,n, Oj={tj<t<tj+
} 1
with the initial conditions

1(x, 0)= g,
pl1 (x, tj)=cp(x, tj), j= 1,2, .. ,
~0(x, ti)= p_(x, t+, j=, 1,..., =2, 3, ... , n,
where we set
(P(X, tj+1) = Pn(X, tj +).
Here the f, are arbitrary functions such that E = f,= f. Boundary conditions for
(p, are set only on the part aQ,2 of the boundary a2 consisting of the sides xa = 0
and x, = 1. We introduce in Q a uniform grid with step h along each variable. Define
the difference approximation of the operator Aa as follows:
A= -(Ax, Vx)/h 2, = 1, . ., n.
Coming from (12.11) to difference approximations of the problems in spatial
SECTION 12 Componentwise splitting 243

variables (when p, and f, are vectors and A, are matrices, t2 is a grid domain and a2
is the grid on the sides x=O0 and xc = 1) and approximating in t with the help of
a two-layer implicit scheme of first-order accuracy in as well, we obtain the
following local one-dimensional scheme:
- (J.(
(5J+l
T + A. pA'+ =f(tj+1/2) (12.12)

=x=l,...,n, j=0,1,...
with the initial conditions
s
qPi
=, (i=(Pa ,
j=1,2,...,
(0~= ~%(_, a=2,3,...,n, j=0,1,2 ....
and the boundary conditions
P lan = (r)(tj+ 1). (12.14)
Each problem in (12.12H12.14) (for each fixed a) is a one-dimensional first boundary
value problem and may be solved by the sweep method.
In order to find the approximate values of the solution of the initial problem on
the time layer tj+ using data from the layer t, it is necessary to solve successively
a sequence of one-dimensional problems. The local one-dimensional scheme
(12.12H12.14) is stable in the metrics
IIPlec
= max 01il
xi, .

(i.e. uniformly stable) with respect to the initial and boundary data and to the
right-hand side.
If problem (12.10) has a unique solution q = p(x, t) continuous in Q x Q. and there
exist in 2 x •, derivatives
2 4 3
a (
___I
a ,P
20X1a
' x2
a TO a2f
_n, a,
at2 a8x2x~ +Sx cax,
then scheme (12.12H-(12.14) converges uniformly with rate O(h2 + ) (it has
first-order accuracy in T and second-order accuracy in h) so that
Jjpj-p(t)llc<M(h 2 + ), j=1,2,...,
where M=const does not depend on T and h (SAMARSKII [1971], p. 423).

REMARK 12.1. SAMARSKII ([1971], p. 425) presented and studied the local one-dimen-
sional schemes for the parabolic equation with varying coefficients in the domain of
complex shape. The local one-dimensional schemes for the quasi-linear parabolic
equation were considered by FRYAZINOV [1964]. SAMARSKII [1970, 1971] gives
a bibliography of the works devoted to this class of schemes.
CHAPTER III

Two-Cycle Componentwise
Splitting Methods

Let us construct a class of splitting methods, in which the requirement of com-


mutativeness of the operators {A}) in representation A=E=.1 A is taken off.

13. Two-cycle componentwise splitting methods: The case A = A1 + A 2

Consider the problem

ao/pt+Ap=O in , (13.1),,
(p=g for t=0

under the assumptions (6.1). We will approximate operators A1 (t) and A2 (t) not on
the interval ti t < tj+ (as in (6.3)) but on the interval tj_< t < tj+1. Let Ai = A(tj).
Write the following two systems of the difference equations:

i-1/2 _jj-1 A (pj- /2 + pj- =0


+ Ai - + = 0,
2 (13.2)

(p _ q0 - +/2 (PJ +
+ (P = 0,
2
1 /2
pj+1/2 _ (j j+ +q J
+ A2 = 0,

Tz1 2
The cycle of computations consists of the alternative application of the schemes
(13.2) and (13.3). Excluding qj + /2 from these schemes we get on the whole step of
computations

j+ = Tij- 1, (13.4)

245
246 G.I. Marchuk CHAPTER III

where
TJ= (E+2TAi)- (E+StAi)(E+ TA)-
x (E -- 2r/62)(E+ TA
i) - (E --2rA )(E + 2rA i)- 1
(13.5)
x (E - tAi)
= E - 2A J + 12r)2 (Aj)2 -

Here and further we suppose that restrictions (6.6) are satisfied.


If A,(t)> O,then the system of the difference equations (13.2)-(13.3) is absolutely
stable and scheme (13.4) approximates the initial problem (13.1) with second-order
accuracy in z, provided that the solution (pof problem (5.1) and the elements of the
matrices {A(t)} are sufficiently smooth.
Indeed, from the expansion (13.5) of the step operator T we conclude that it
coincides up to a quantity of order O(T2 ) with the step operator of the
Crank-Nicolson scheme applied to the doubled time interval. Hence, independently
of commutativeness of the operators {Aj} scheme (13.4) has a second-order
approximation in .
The stability of the scheme follows from the estimates
I(Pi + l II< 11Ti I
IIej- l l< I(pi- ' .. '< [Ig 1,
since IfT 1 1 as follows from Theorem 2.1.
For the inhomogeneous problem
a(p/t +A(p=f in 2,, (13.6)
(p=g for t=O
the two-cycle componentwise splitting scheme has the form:
(E + A2 )(pj-l2 =(E- 3rA)J- ',
~(E+1~Aj)((i-)=(E-~2r~A~j~2)~~i-
i,(13.7)
(E + 2rAi2)(p + /2 = (E - Aj)(rp + fJ),
(E + rAi)(pj+I =(E-2rij)J+ 1/2,
wherefi=f(tj ). Solving these equations with respect to (p+' we obtain
qij+ 1 = Tjo j -' + 2T{ T'Tf, (13.8)
where
r= T= T TiTl,
T = (E + rA')-t(E- A).

Scheme (13.7) approximates problem (13.6) with second-order accuracy in r and is


absolutely stable with the assumptions of Theorem 2.1 (MARCHUK [1980], p. 259).
The system of equations (13.7) may also be written in the following equivalent
SECTION 14 Two-cycle componentwise splitting methods 247

form:

(E+ Ai J)(pj-23= (E - 2TA,)~0j-',


(E + A2 )(p - 3 = (E - TA )pj - 23,
qi+1/3 = Hi- 113 + 2Tfij, (13.9)
(E+ ITA )(pij+ 2/3 = ( + 1/3
½A)i2)j+
(E + TAi)(pj =(E-2 TAi )(pi+ 2/3
Excluding unknown quantities with fractional indices we come to the resolved
equation in the form
j
( p+ = T T T Tl j- '
+ 2 Ti TjfJ, (13.10)
which coincides with (13.8). In some cases representation of equations in the form
(13.9) is more preferable than in the form (13.7).

14. The two-cycle multicomponent splitting method

Suppose that in (13.1) the operator A has the form A(t)= = A,(t), A,(t) >O. Let

A= E, A
a=1

where all operators A are positive-semidefinite: Ai, 0.


Consider the special construction of the splitting method which solves the Cauchy
problem for the positive-semidefinite and noncommutative operators A and has
a second-order approximation. It appears to be in some sense the complete solution
of the splitting problem. To construct such a scheme it is necessary to replace (7.3) by
the scheme

rpj= H Tp-', Ji+l H=


-TJ J (14.1)

where
TJ = (E + Ai,) '(E - rA,).

This means algorithmically that first we solve the system of equations (7.3) on the
interval t_ t < tj for c = 1, 2,.. ., n and then the same system on the interval
tj t tj+, but in the backward sequence (= n, n-1, ... , 1):

(E + zAia)qpi+,-1=(E-_Ai)q)+(-l)/-l, c = 1, 2 ... n,2)


(E+tAj)pj+ 1-(a- ,/n = (EtA)pj+ 1l-al, c=n, n- 1,... 1.
Obviously, for the whole cycle (14.2) we have
-
(pi+1= Tji 1
248 G.l. Marchuk CHAPTER III

where
1

TJ= i T [I T~=E-2rA+ (2) 2 (Aj) 2 +O( 3 ).


e=1 =n

Thus, on the interval t_< t tj+ scheme (14.2) has second-order accuracy in t,
provided that we take as one of the next two operators:

AO =A(tj)+2-A,(tj), (14.3)

Aj = (A(tj+1)+ A,(t)). (14.4)


Besides, scheme (14.2) is absolutely stable for Ai 0 (MARCHUK [1980], p. 268).
For the inhomogeneous problem (13.6) where A(t)>O and
n

A(t)= A(t), A(t)>O


a=1

on the interval tj_, t tj+ I the two-cycle multicomponent splitting scheme has
the form:
(E + TAI)(pi-'- 1)1" = (E -2TAJ)(ipJ-
(E+½TAn)((pi - Tfi ) =(E- )o - s
I/n /,5
(14.5)
(E + Ti)pi + 1/' = (E - trA)(i + Trf),
+1 +
(E+-1Ai )j = (E
i -- TA i )()j(n-1 )/n
where
A = A,(tj).
It is not difficult to see that this scheme with the necessary smoothness assumption
has second-order approximation in Tand is absolutely stable (MARCHUK [1980], p.
269).
In the same way as in the case n = 2, the multicomponent system of equations
(14.5) may be written in an equivalent form
(n +
(E +½ A )oi- 1- )/(n+ 1) =(E -TrA )i- (i + 1 -+ )/( +1),
a;=l, 2,...,n,
9j+ 1/(n+ 1) =(pj- (n+ 1) 2rfj, (14.6)
(E +I A j + (n+1)=(E-1ZTAn-+ 2)oij+(a- 1)/(n +1)
ca=2, 3,...,n+1.

15. The two-cycle componentwise splitting method for quasi-linear problems


Consider an evolutionary problem with operator A which depends upon the time
and upon the solution of the problem
a(p/t+A(t, )q=O0 in Q2x2 (15.1)
(p=g in Q for t =0.
SECTION 16 Two-cycle componentwise splitting methods 249

Suppose that A(t, p) is nonnegative, sufficiently smooth and has the form

A(t,p)= A(t,(), A(t, o)>0. (15.2)


e=l

Suppose further that the solution p is also a sufficiently smooth function of time.
Consider on the interval t< t < tj+ the splitting scheme

jo+ l/n-1 _pAjiP + (P =0,


2 (15.3)
(cpJ+ -T p + /n +
_I/" _ _

T +AJ 2 =n

j+l -_ 9 j+(n-l)/n (pA


++(p+(n-l)ln
+ Ail 2 =0,

where
A i = A(tj, S),
Oy = ji- 1 _ zA(tj- 1, piy- )(pi- t, (15.4)
= t - tj_
It may be proved by methods used earlier for linear operators depending only upon
time that splitting scheme (15.3) under conditions (15.4) has a second-order
approximation in and is absolutely stable. The splitting method for inhomo-
geneous quasi-linear equations is defined in a similar way. This opens up broad
vistas for applications of the componentwise splitting schemes to the nonstationary
quasi-linear problems in hydrodynamics, meteorology, oceanology and other
important fields of natural sciences.

16. A general approach to the two-cycle componentwise splitting method

We saw in Section 8 that, if the evolutionary problem (5.1) where A =I= A, is


reduced to partial problems (8.1) and these are then regarded as the set of new
evolutionary problems, then the approximation to the initial problem (5.1) will be of
first-order accuracy in z if at least one of the elementary evolutionary problems is
reduced to a difference scheme of first-order accuracy. But, if each such problem has
a second-order approximation, then in the framework of the two-cycle procedure on
a and # we obtain an approximation of second-order accuracy in T.Note, that if the
operators A 4, are noncommutative then, without the two-cycle procedure, we
obtain a first-order approximation to problem (5.1). In this case the initial problem
250 G.I. Marchuk CHAPTER II

will be the following:

a/plat+ E A,,p=0 in QxOj,


=l (16.1)

( = (P inQ for t=tj.

Reduce problem (16.1) to the system

8lp,,t+A,p, =O, (pj =qpj 1 , a= 1, 2,.., n. (16.2)

Let A,,= 1 A. Then we will solve each problem in (16.2) with the two-cycle
method:
(pj+/2m_ jp+(?- 1)/2m, pj+i2m + pj+(P- 1)/2m,
2.4~ =0,

j+ /2m _ j+(- 1)/2m j+f/2m, +j(- 1)/2m, (16.3)


(_+_Aa,2m__ t_-t + =0,
½, 2

= m + 1, m, + 2,..., 2mn.
Initial conditions for each system in (16.3) will be respectively taken in the form

(pil= cPj , qJ =p+, =2,...,n. (16.4)


It is not difficult to check that on the interval t<t <t+1 problem (16.3)
approximates problem (16.2) with accuracy up to T2.
In order to obtain the solution of problem (5.1) with accuracy up to t2 it is
necessary to alternate the basic cycles as well. So instead of (16.2) on the interval
t_ 1 t tj there should be

Oaq)Jt+A o(p,,=O, =1, 2,...,n,


-
1 = p PJ-,l=q_ l, O>l, (16.5)
(p =qoP;

and on the interval tj <t <ti+


1

a(pql/t+A,_l p =0, = 1, 2, .. ,n,


(pit = pi, (pa-ja+-l, a>1, (16.6)
j+1 j+1

It is supposed that each of the problems (16.5) and (16.6) is solved with the two-cycle
method of the kind (16.3). Note that with A., >0 the componentwise splitting
method is absolutely stable.
SECTION 16 Two-cycle componentwise splitting methods 251
Let us give the general splitting scheme for the inhomogeneous equation

apat+ Ap=f;
a=1 (16.7)
(p=g for t=0
on the interval t j_- t tj+l based on the two-cycle method.
On the interval t _ < t < t we set
/offat+Aqp,=O , a=1, 2,.....,n-1, (16.8)
aq%/at + A (p, =f+ 2TAf;
and on the interval tj t < tj+
a+,+l/at+
A .9+t =f-1Af, (16.9)
Oep.+:/ t+A,_~+l~.+.=O, c=2,3,...,n

provided that

¢Pl(tj- = cP(tj- 1) (16.10)


(P.%+l(ti_=(t) x= 1, 2 . n(16.10)
and, respectively,
(P+x(tj)=
P(tj+
), c=n+ , n+2,.....,2n. (16.11)
Using now Crank-Nicolson schemes for solving equations (16.8)-(16.11) on the
interval tj_1 t,<tj+, withf=fj we obtain the systems (14.5) for (16.9).
Along with the system of equations (16.8) and (16.9) let us consider the following.
On the interval tj_ 1 < t t
acpl/at+Ax(p =,

(16.12)
'ap,/la t+Ao =O;

on the interval tj_ 1 t tj+ 1

a + /lat= fi (16.13)
and on the interval tj < t < tj+
aPn+2/lt+Anpn+2 =0,

(16.14)
a(P2n+il/t + Al cP2.+ =0.
The initial conditions for system (16.12) will be

(pI(tj_- )= (tj_-), (16.15)


(P.(ti-l)=(.-_l(tj) =2, 3 .... n;
252 G.I. Marchuk CHAPTER III

for equation (16.13)

j_ )= (p(tj);
0n +1(t (16.6)
and for the system of equations (16.14)

(P.(tj)=rp._1(tj+), =n+2,n+3,.....2n+1. (16.17)


The approximation and the stability of these schemes ensure the convergence of
the approximate solution to the exact one (see Introduction).

17. The two-cycle componentwise splitting scheme for the heat conduction equation

Consider the two-dimensional heat conduction problem


OT/at-Ap =f in 2x Q,
¢p=0 on a2, (17.1)
o =g in Q for t =O,
where
2= {O<x,y< 1}, Qt = {O<t< T<o}.
Introducing the grid with steps hx = 1/N, hy = 1/Ny along the x- and y-direction
respectively and approximating problem (17.1) as it was described in the Introduc-
tion, we obtain the matrix evolutionary problem
a/plat+Ap=f in Q2xQ, (17.2)
Tp=g in Q for t=O,
where , f g are vectors and A is a matrix:
A=A 1 +A 2 ,
.
Al -A= - (axv,)h, A2 -Ay= -(AyV)/hy2
We suppose that the boundary conditions are taken into account in (17.2) and that
£2 is the grid domain consisting of the internal grid points of the initial domain.
The operators A., = 1, 2, in (17.2) are positive-definite, At > 0, A 2 > 0, so with
A 1 A 1, A 2 -A 2 the two-cycle componentwise splitting scheme (on the interval
tj-1 <t< t+l)

(E+ 1A 1)Tpj - 11/2 =(E- t½A1)(i j - 1,


(E +2iA 2 )((ip - f)=(E - TA 2 )( j -/2
j+
(E+ -TA2) 1/2 (E- TA2)(Pj+rfj) (17.3)
+ +
(E + 2/T1,)j 1= (E - CA 1)Tj 1/2
j=O, 1,....
po =g
SECTION 17 Two-cycle componentwise splitting methods 253

is absolutely stable and has a second-order approximation in r for smooth solutions.


Equations (17.3) may be written in componentwise form as follows:

(1 )~- 1/2 _ ( iJ- 1/2 r- 1/2) l- 1

( )l +ll -2h .k +1-k-lll=F-k,l

Xl- 1/2-T j1/ j+1/2-A+ 1+/2

where
Y' i

F 1/2=
1kl
/2 = ( (1-(1 -12)i),l2h2 Z1/2 + , 22(40.(+l,-1t
+1 (1/2
1+i ++
i ,I k-k2h,1
-I-2
?)X1 i 1+ + ),

- (Pk,::+l/2 m r ,.j+1/2 =j+pl/2


i
hxJ 2hx '
Note, that the right-hand sides in (17.4) are calculated explicitly. Thus we deal here
with a system of equations represented in the form of three-point schemes, and the
solution of these equations may be easily and effectively obtained by the sweep
method.
The two-cycle componentwise splitting method is used for solving parabolic
equations in the papers by MARCHUK [1967], MARCHUK and SARKISYAN [1980] and
others.
Note that in the paper by MARCHUK and KUZIN [1983] (see also Chapter XIV)
this method is used for the realization of the finite element approximations of
a two-dimensional parabolic equation with mixed derivatives. It is proved that the
scheme is stable and has a second-order approximation.
CHAPTER IV

Splitting Schemes with


Factorization of the Operators

18. Schemes factorizing the operators

Consider the evolutionary problem


aop/at+Aqp=f in 2x,,
qp=g in Q for t=O (18.1)
cqEtk, fEF,
where kPand F are Hilbert spaces and the difference operator A does not depend on
time. Suppose that problem (18.1) is solved using a time difference scheme which
may be written in the form:
i
B p=FJ, j=1, 2,..., (18.2)
where F = Fj((p - , pi- 2,... ; ) is a given function of T, -, (p -2 ...(in case of
a two-level scheme F i = F(T, qj-l) and B is some operator constructed in such
a way that it allows a factorized representation
B=B1 B 2 ...BP. (18.3)
The operators Be have a more simple structure than B and may be easily inverted.
Difference schemes allowing representation (18.3) are called factorized schemes
(BAKER and OLIPHANT [1960], SAMARSKII [1971], p. 368) orfactorizationschemes of
the difference operator (YANENKO [1967], p. 37).
Numerical solution of equation (18.2) on each time level may be carried out by
way of successive solution of more simple equations
B1 j+lip =F,
B2 Tj+2/p =(pi+p (18.4)
0p

Bp + = (j+(p-1)/p

Ifp = 2 and B1, B 2 are triangular matrices, then the relationships (18.4) represent an
explicit ("running") calculation scheme (see SAMARSKII [1971], p. 368).

255
256 G.] Marchuk CHAPTER IV

We now consider some ways of constructing factorized operators and cor-


responding factorized schemes applied to the problems of parabolic type.
BAKER and OLIPHANT [1960] consider for the problem

a(? 2 2 ~+29 = 0

t a x2+--),
(P=(r ) on Of, (18.5)
(p=g for t=0,

where f2 = {0 < x,y < 1, and a = const, the three-layer scheme

2(j+1 -2(pJ n-½°j-'


cAcpj+l=O,
nif
29 (18.6)

in which the operator A was constructed to approximate the operator A=


-a2(a2 /ax 2 + 2 /ay 2 ) with second-order accuracy; at the same time the operator

B=2E+TA (18.7)
was represented as the product of two three-point operators B, and B 2. It was
shown by BAKER and OLIPHANT [1960] that such an operator may be constructed; as
a result the realization of scheme (18.6) is reduced to solving (18.2) with FJ=
Fj( j - 1, qJ; rz)= 2p -- -pj-'. For p =2, (18.2) also is decomposed in two equations
(18.4), which may be easily solved using the three-point sweep method.
BAKER [1960] describes the factorization schemes in the case of a multidimen-
sional heat conduction equation with constant coefficients.
But if a two-dimensional parabolic equation with varying coefficients (diffusion
equation) is considered, then it is impossible to represent B exactly as the product of
two three-point operators. In this case additional iterations are needed (BULEYEV
[1970], OLIPHANT [1961]). For such problems BULEYEV [1970] (see also MARCHUK
[1980], pp. 235-237) formulated a scheme of incomplete factorization. Thus, let
(18.2) be represented as a two-dimensional difference equation

ai.k(Pi-l,k + Ci,kq)i+l,k+bi,k(Pi,k- 1 +di,k(Pi,k+ 1 -Pi,krPi,k= -fi,k,


i=1,2, ...,m, k=1,2,...,n, (18.8)
al,k = Cm,k = bi, 1 = di, = 0,
Pi,k ai,k +Ci,k +bi,k +di,k

(the index j is omitted for simplicity). Having added the vector Cp to the left-hand
and right-hand sides of the equation we obtain

(B+ C)(p = Cp + F. (18.9)


We now choose the matrix C in such a way that the matrix B + C can be represented
as the product of two simple triangular matrices SI, S2 and a diagonal matrix K:
B+C=K-'S, S.
SECTION 18 Schemes with factorization of the operators 257

Then equation (18.9) is replaced by the system


S 1Z=K-'(Cp+F), S 2wp=Z, (18.10)
which will be solved by the method of successive approximations. The realization
formulae of this iteration algorithm applied to (18.10) can be found in the form
Zi,k =oi,kZi- ,k++
li,kZi,k- l +Yi,kLi,k +D,k(P)-qi,kqi,k], (18.11)
(Pi,k= i,k Pi +1,k + li,k
qPi,k ++ Zi,k

(the index of the iteration step is omitted; Di.k(9) =(D()i,k, where the matrix
D = KC + diag qi,k and diag qi,k is a diagonal matrix with elements qik).
Comparing (18.11) and (18.8) we obtain for coefficients ai,k, fIi,k, Eik' i,k, Yi,k
formulae which are analogous to the formulae of the sweep method:

ai,k = Yi,kai,k, flik = Yi,jkbi,k,

£i,k =Yi.kCik 1]i,k = Yi,kdi,k

Yi,k = (Pi,k - qi,k - ai,kCi- ,kYi- I,k - bi,kdi - 1 i,k - 1)- .

For the iteration operator Di,k(P) we obtain the expression

Di,k()= ai,ki- ,k(Pi- l,k+ 1 + bi,kEi,k-1 Oi+ 1,k- 1


The coefficient qi,k is taken as
+
qi,k = Oi,k(ai,kqi- l,k bi, k- 1 ), 0 < i, k < 1

To ensure the convergence of the iterative process (18.11) for 0 > 0 these equations
are supplemented with a simple iteration

W'i,k = (Oi,1k(Pi ,k + 1_ik(Disk-I


+ Eik(Pi)+I,k
Yi,kPi,k

+ i,kdfPk + Yi,kf,k), (18.12)


so that for obtaining the (I+l)th approximation of the function pi,k the Ith
approximation of the function ~i,k is to be substituted into the equation for Z '1. If
the problem is ill-conditioned, then the expression in square brackets in (18.11) for
the grid points belonging to the right and upper boundaries of the domain should be
replaced by

f,k + Dik((P) qi,k Pi,k + Xi,k(ai,k + bi,k)iPi,k (18.13)

where 0 <xi,k < 1.


It is not difficult to show using induction method that the coefficients i k, fii,k, Ei,k,
i,k, Yi,k satisfy the conditions
Ei,k + rik 1;

aXi,k + i,k %(ai~k + bik)

{Ci,k + dik +(1 - 0ik)(i,ki- ,k + aiki,k -1)


+ Xi,k(ai,k + b,k)} - 1
258 G.l. Marchuk CHAPTER IV

Choosing parameters 0,,k and xik it is always possible to satisfy the condition
ci,n + fli, < 1, i.e. to achieve computational stability of the scheme (18.11). Note that
MARCHUK ([1980], pp. 236-242) presented another scheme of incomplete factor-
ization.

19. The implicit splitting scheme with approximate factorization of the operator

We consider for the problem (18.1) with operator A >0 the implicit scheme of the
form

'+Apj+' =0, j=0, 1,2,..., q0 =g, (19.1)

where

A= , A, Aa 0. (19.2)
a=l

We rewrite (19.1) in the form


J+
(E + TA) ' =p' (19.3)
and factorize the operator (E + tA) approximately up to quantities of order 0(r 2).
To that end we replace the operator (E+ TA) in (19.3) with factorized operator
(E +TA,)(E +T 2) .. (E +TAn)= E + zA + 2R, (19.4)
where
R=ZAiAj+r Y AiAjAk+. _+zn- 2 Ai...A,
i<j i<j<k

As a result we obtain an implicit scheme with an approximate factorization of the


operator B (YANENKO [1967], p. 38):
BpJ+-=- j , j=0,1,2 .....
, p°=g, (19.5)
where

B= HBx, B, =E +A,.
a=l

The solution of equations (19.5) may be carried out by successively solving equations
i
(E + TA )j+ 1/n = ,
2 + l /n
(E+TA 2 )pj+ /"n=pj
(19.6)
(E + A)pj+ = j+(n- 1)/n,

=O, 1,2,....
p° =q.
SECTION 20 Schemes with factorization of the operators 259

If the operators A, in representation (19.2) have a simple structure and allow cheap
inversion, then system (19.6) may also be solved rather simply.
It is easy to see that scheme (19.6) has a first-order approximation in T. It is
absolutely stable for A., > 0 in the metrics of the grid function space o ( qv lsh =
- ' IIh < 1 (see (2.23) and Theorem 2.1).
(q, 4/h2), since, in that case, IIE+ Ata)
We will illustrate an application of the scheme with approximate factorization of
the operator to scheme (18.6), which was suggested by BAKER and OLIPHANT [1960]
for the heat conduction equation (18.5).
We take A= A +A 2, where A, is a difference approximation of the one-
dimensional differential operator A, = -a 22 /a 2x 2 (x 1 = X,x 2 =y). We rewrite
scheme (18.6) in the form
(3E+zA) j ' =Fj, F=2pj - q
-' (19.7)
and replace the operator (E + A) with the factorized one obtained by the implicit
scheme with approximate factorization of the operator (YANENKO [1967], p. 39). We
obtain the scheme
B(p+ ' -(E + A,)(E + -A2)0j+ = Fj, (19.8)
where B is a nine-point factorized operator. (Note, that scheme (19.8) coincides with
the scheme suggested by BAKER and OLIPHANT [1960].)

20. The stabilization method (explicit-implicit scheme with approximate factoriza-


tion of the operator)

We consider the homogeneous problem (18.1) with A =A, + A 2 , A, f-O0 and


the following two-layer scheme
( j+ 1 (p
(E+½A,)(E+rA2). +A(pJ =0,

j=0, 1, 2,..., (20.1)


(p0=g.

If A, >0, A 2 >0, and the solution of problem (18.1) is smooth enough, then the
difference scheme (20.1) has a second-order approximation in and is absolutely
stable, the following estimate being valid

(pi+ ' IIc2 < 11


g c2, j=O, 1, 2, ... , (20.2)
where
2
IIllcI
1,¢)1/ =(c 2 , C2 =(E+2trA*)(E+TrA2).
Indeed, using simple transformations, equation (20.1) is reduced to the form
(E 2+
(E+4 AAA
- (p + oJ
1A2)
q2 +1
2) +A ' ° =0
2
(OP=g. (20.3)
260 G.I. Marchuk CHAPTER IV

It is easy to notice that if the solution is sufficiently smooth then the order of
approximation in r of the difference equation (20.3) coincides with that of the
Crank-Nicolson scheme
(pj+1 (Pi PJ+ +pi 2.J
+A 0, (po =g, (20.4)
T 2
i.e. second-order in .
We now move to the stability analysis of the difference equation (20.1). To this end
we transform (20.1) to the form
(E + LTA 1)(E + A2 ) + 1 = (E- A )(E--rA2)j (20.5)
(20.5)
°
q0 =g.

Resolving the difference equation (20.5) with respect to ApJ+I we get


oj + =(E+ A2)- (E + TA, )- 1(E -½TA)(E- A2)oj.
From the unknowns pi we compute Jiusing the following formula

¢i = (E + ,TA2)j-
Then for a new unknown ri we obtain the relationship
j + 1 = Tid,
where the transition operator T is given by
2

T= (E + rA,)-1 (E - rTA ,)(E - A 2 )(E + ½zA2)- ' [J


a=l
Ta,

T = (E - tA,)(E + 2A)-
It follows from Theorem 2.1 that
I T. I =sup l T,. q / lq11 11 p , ) /2.

Therefore, IITI 1, which means that

tl J I. = 11lioje
JIC< I ¢j- I11 '<- ' lllc2.
1

The difference scheme (20.1) of the stabilization method allows convenient


realization on the computer. Indeed, the difference equation (20.1) may be written in
the form
F =A p,
(E+ 2A ),+ /2 = _ Fj ,(20.6)
(E+ tA 2 )ei+' = ei+ 1/2
j + .
(p+ 1= pj + r 1

Here ei + 1/2 and c + 1 are some auxiliary quantities ensuring the reduction of problem
(20. 1) to the sequence of elementary problems (20.6). Note that the first and the last
SECTION 20 Schemes with factorization of the operators 261

equations in (20.6) are explicit relationships. This means that we have to invert
operators only in the second and third equations, where only the elementary
operators A1 and A 2 are present.
We now consider the inhomogeneous problem (18.1), where A = A, + A 2 , A, > 0,
A 2 > 0,f 0. In this case the stabilization scheme is written as follows:

(E + rA 1)(E + TA 2 ) +Aj =fj,


(20.7)
°
(p = g,
where
fi =f(tj+1/2). (20.8)
If the elements of the matrices Aa do not depend on time and the solution and the
functionf of problem (18.1) are sufficiently smooth, then scheme (20.8) is absolutely
stable and approximates the original problem with second-order accuracy in
T (MARCHUK [1980], pp. 251-252).

REMARK 20.1. As we have seen the scheme of the stabilization method may be
obtained by introducing into the Crank-Nicolson scheme (i.e. explicit-implicit
scheme) the operator B=E+r 2 AA 2. After that the operator of the problem
becomes factorizable. Therefore, the stabilization scheme may also be called the
explicit-implicit scheme with approximatefactorization of the operator.

Let us formulate the stabilization scheme for the inhomogeneous problem (18.1)
where

A= A, n>2, A >0 (20.9)


a=l

and the operators A, are time-independent. With assumptions (20.9) the stabiliza-
tion method may be represented in the form
n ~cj+1
(P - (Pj
f (E + 2AT) + Apj =fJ, (cp = g, (20.10)
a=l

where

fJ =f(tj+1/2)-
The scheme of realization of the algorithm has the following form:
F = - Ap +fj,
+
(E + 2TA,)j 1/n =Fj,
i+ 2
(E + rTA 2)E n= ej+ 1/n
(20.11)

+ n-
(E + An) 1 =j+( l)ln
+1
(Pj =.(tOj+J+l.
262 G.I. Marchuk CHAPTER IV

With the assumption of sufficient smoothness the stabilization method (20.10) has
a second-order approximation in . Computational stability will be ensured
provided that

1T II< 1, (20.12)
where the transition operator T is defined by the formula
1

T=E- (E+2TA) - A. (20.13)


a=n

Unfortunately, the condition A, > 0 does not imply here the stability in some norm,
as was the case for n = 2.
To ensure stability the following simple algorithmic tool is sometimes used. Setf j
equal to zero and reduce equation (20.10) resolved with respect to pJ+1 to the
following form:
+ =
Tpi. (20.14)
Since the operator T is assumed to be time-independent (i.e. does not depend on j),
we will keep an eye only on the norm II j IIwhen solving problem (20.14) with initial
condition
( ° =g (20.15)
and fixed parameter r ensuring necessary approximation. If this norm is not
increasing then we may be sure that computational stability takes place. After this
we may move to the solution of the inhomogeneous problem. We rewrite equation
(20.10) as follows:

(pj+1 = TpJ+rT (E + 2TrA)- lfj. (20.16)

Hence,

j +l
pII < IT l 1 pi II, + r
11 II(E+ ,TA)- II I f IIF
x=l

or due to inequality (20.12) and the inequality lI(E+ TzA,)- 1


Ip 1I <011I II +II
+
F (20.17)
Using the recursive connection we obtain the stability condition
II P I < aII iid, + jl[ f 1I, (20.18)
where
IIf II= max IIfJ IIF (20.19)

REMARK 20.2. While checking the stability condition IIT IJ1 we used the initial
condition (20.15). This was not necessary at all. We might have chosen as initial
condition any function from the same smoothness class as g and f
SECTION 21 Schemes with factorization of the operators 263

21. A general scheme for the method of approximate factorization of the operator

We will give a general formulation of the approximate factorization method


(MARCHUK and YANENKO [1966], YANENKO [1967], p. 155).
We write for problem (18.1) a two-level approximation of the kind:

= fJ+ =Apj
+opj+FF, j=O, 1, 2,...,
(21.1)
(c =g,

where the function F approximates f In the case of multilayer schemes this


approximation contains the results of applying the difference operators to pi- ',
j-2,J etc. Let the operators Al, A0 ,
q P
AO= E a,, , Al = A,,a (21.2)
a=l a=l

be represented as the sum of the operators, generally speaking, of a more simple


structure, then the following approximative relationships are valid

n (E-TAI,.)-E-TA1 ,
a=l (21.3)
q

n (E+TZAo)E+TAo
a=l

(i.e. the operators on the left-hand sides are approximations of the operators on the
right-hand sides). The relationships (21.3) allow the replacement of (21.1) by the
factorized scheme
P q

[ (E-tA)Pj+l
a=l
= n
a=l
(E+zAo, j + TF. (21.4)

The given scheme approximates problem (18.1) and if A, =A o = -2A, and FP=
f(ti+1/2), then it has a second-order approximation in . The stability analysis of
scheme (21.4) is often difficult in comparison with the componentwise splitting
schemes. For the sake of simplicity letfO in (18.1). We rewrite this relationship in
the form

j+1 =_ Tpj, (21.5)


where
- I
T=(E- znA n) .(E-zA l,)-l(E + zAo,) ... (E+ zTAo,,).
If for operator T the estimate
I1T 11< 1+ const t (21.6)
holds, then scheme (21.4) is stable. Estimate (21.6) is rather difficult to obtain. In the
264 G.I. Marchuk CHAPTER IV

case when p=q and the operators are commutative we have:


p
T= T T. =(E-rA 1')-(E + TAo,).
a=1

If at the same time A,1 = Ao, < 0, then 11


T II< 1 and the scheme is absolutely stable
p II =(q, )1/2.
in the norm 11
From scheme (21.4) we may obtain the scheme of the splitting operator method
(DYAKONOV [1971d]).
We suppose that in (18.1) we have

A= A, A,,kO, (21.7)
f=l1

where the operators A are difference approximations of the corresponding


one-dimensional differential operators. Let p = q = n, and operators Ao,, ,A l given
by
Ao,= -(1 -O)A 1 , A,= -OAe, (21.8)
0 <0=const < 1.
Then scheme (21.4) takes the form:

H (E + r0A)p
=l
j+ ' = H (E -
a=l
(1-)A,)j+rFj,

j=O, 1, 2, ... , (21.9)


(P =g.

It is easy to notice that the realization of scheme (21.9) consists of solving the
sequence of one-dimensional problems (with the sweep method).

REMARK 21.1 Scheme (21.1) and representations (21.2) do not allow a simple
determination of the corresponding scheme. For example, the scheme

(E - rA,) . . (E - Al, 1)pip+ = (E + rAo,q).. (E + TAo,1 )(pj + zF j (21.10)

is not equivalent to scheme (21.4) in the general case of noncommutative operators


A,, Ao, 1 but only in the case of commutative operators.

REMARK 21.2. In a paper by YANENKO ([1967], pp. 158-159) schemes of approximate


factorization arising from the multi-level schemes are presented:
i- 1
Alpi+ l +Ao(P +A_l .+ + A_P+ lqij-v+1+f' =0. (21.11)
These schemes may appear in approximations of equations of the type
aap a2 P a (P
at
B1 a- +HB ±t2
-- +BP atp +Ao=fp
+/ 2 T+"+B-F-rq=f
2 (21.12)

where B 1, B 2 ,..., Bp are linear operators.


SECTION 22 Schemes with factorization of the operators 265
22. The scheme of approximate factorization for the parabolic equation

In papers by DYAKONOV [1962d-f, 1963-1965, 1971b-d, 1972] the method of


approximate factorization (splitting operator method) has been developed and
grounded for a wide class of equations of parabolic and hyperbolic types. We
consider some of these results for the parabolic equation of the form:

a?
E _a2 =f in xQ (22.1)
?it a a
under conditions
oIan = (Fr), P= g for t = 0.
Here
={O <x, < l:a = 1, ... , n},
.. ,=O < t < T}.
Let
= T/K, ha= 1/N, ,.. ., n,
(22.2)
h=(hx... ,h), [hl2= E h2.
a=l
The spatial grid Qh is the set of points x,=(llh 1,..., hn), where 1a N - 1,
l=(l1,..., ), afh is a set of boundary grid points, h= 2 h+ 2 h. The temporal
grid is given by tj=j, j =O, 1..., K.
We introduce the difference operators

A()I: = ((t j, x + hal)- ((tj, x)),

V.(p)l I (q(tj, x) - p(tj, xl - hala)),

where la is the n-dimensional unity vector in the direction x,,. We consider a set of
difference schemes of the kind
j+1
- e ~+ E
+ Aa,(OTp+i +(1lO)oj)+fj+ in 2h, O j K,
1
a=l
~0o
= 91l, XE (22.3)

jI=q(i'r),, xle fh,

where
Aa = (AVa,)- -A, O< 0 < const 1,
f+l 1 x)
{f(tj+l, for 02,
[f(ti+112 ,x) for 0=½,
and the notation p~ j p(Tj, xi) is used. When the solutions are sufficiently smooth,
266 G.I. Marchuk CHAPTER IV

scheme (22.3) approximates (22.1) with local error of approximation of order


O(r + hl2 ) for 0 and O(z 2 +Ihi 2) for 0 = . For 0 =0 scheme (22.3) is explicit; for
O= 1 it is purely implicit; for 0=½ it is the Crank-Nicolson scheme.
Let 4i and F be the Hilbert spaces of the grid functions pce O andfe F, defined on
2
£h with inner product
(u, v), =hl h. E uEvl
XII.eh

u , =(u, u)'/2. Then in this norm the schemes (22.3) are absolutely stable
and norm 11
independently of r and h for 0 2',and for 0 < they are stable provided that the
following relationship holds:
1 1
e IZ-< (22.4)
=, h,22(1 -20)'
i.e. conditionally stable. From the approximation and stability of these schemes
corresponding statements about the convergence follow (see Theorem 3.1).
We now introduce the operator

Bq t
+ fi (E- A )pj+
c=l

pj+1,-0 h n .+1
a=1

1 htct + "'

h
+( r) AI A2 .. A +1

2
XI f h

and consider the scheme


(Sal+I- E1

2
PoI =gl, XIE h, (22.6)
j J
cat = (cr)l, xI aQh.
Taking (22.5) into account we may rewrite (22.6) in the form

1P _ E Ah(0oj+ ' +(1-o))p)


T =1 (22.7)

where

Q. V,_ E (- OT)' E Ah,A 2 'An V,. (22.8)


a= 2 1<i <12 < ...<anKS
SECTION 22 Schemes with factorization of the operators 267

Thus, we see that (22.6) differs from (22.3) by the presence of the additional term
Q~(q+ I-(p)/z on the left-hand side. If the solution of the original problem has
derivatives (bounded in 2 x Ql,)

D#' D* . aa ii <i2 < ... <ia a n,

where
#, < 2, Dfi - 0i/xIfi,
then

(P = 0(ti , x/), M = const. (22.9)


(Note that the smoothness restriction on, q((t, x) may be weakened; see DYAKONOV
[1971d], p. 23.) Thus, if (22.9) holds then scheme (22.6) approximates the original
problem with the same order as scheme (22.3). Here, naturally, we have for aT
instead of (22.9) the inequality

|| ('j4(Pj) || <-MT (22.10)

and, also, weakened conditions on the smoothness of p(t, x) would do.


We consider how the transition from the time level tj to the level tj+ 1 in scheme
(22.6) is realized. Note that if we exclude-with the help of boundary conditions-all
boundary values of (p from the left-hand side at the expense of the corresponding
+
modifications offT ' in the near-boundary grid points, then we may rewrite (22.6) in
the form:

j (E + A,.)j+' = (E + rA,)gp - Agpj+7j, (22.11)


a=l a=l

where A = := A, and thevector f7+ 1 is defined through fj+ , (rj+)and ipr). This
is (18.2) with splitting operator B = I = (E + TOA,). The solution of (22.11) on each
time step is reduced to solution of (18.4) for p = n and B, =(E+ OA) by the sweep
method.
Stability of scheme (22.6) follows from the a priori estimate

I ij+ "II < || g l| + ZE Ifl IIF (22.12)


i=1

(DYAKONOv [1971d], p. 26).

REMARK 22.1. Along with scheme (22.6) one may consider schemes with approxi-
mate factorization of the operator, which are obtained by replacing the basic
268 G.I. Marchuk CHAPTER IV

equation from (22.6) with one of the following equations:


i jlH(E th).j+l (P=fi+ ,
Tj j

HE
(E-ojA)
j - j- (E -)za'p =~+1 ,
Tj =1 Tj a=1

. n , h
-M (E- Oa (TlA-) a) j +-
al (-H(E+ (I1- )/)A =f
Tj e=l Tj =1

r (E
E - Tj A') pi +--2 t A~/ - 4Tj
e<f
a~A . ~t=f+
'~j a=l Tj a=1

where tj = tj - tj_ I is a nonconstant grid step in t and 0 < 8O= const 1 (DYAKONOV
[1971d], p. 26).
CHAPTER V

The Predictor-Corrector Method

The next class of the splitting methods-the predictor-corrector method (approxi-


mation correction scheme)-as well as the scheme with a factorization of the
operator will be considered in application to the matrix evolutionary equation (18.1)
with a time-independent operator.

23. The predictor-corrector method: The case A=A,+ A2

The idea of the predictor-corrector method is as follows. The whole interval 0 < t < T
is split into a number of partial intervals and within each elementary interval
tj < t < tj + 1 the problem (18.1) is solved in two steps. First, one finds an approximate
solution of the problem at time tj + 1/2 = + using a first-order accuracy scheme
which has a considerable "amount" of stability. This step is usually referred to as the
predictor. Then, on the whole interval (tj, tj+) the original equation is approxi-
mated with a second-order scheme which serves as the corrector. It is essential that,
while constructing the corrector, the "rough" solution at tj+ /2 found in the pre-
dictor step is used.
The predictor-corrector scheme may be written in the form:

9(j+ +Al f+/4=


j+1/4=
0
½r
v j+ 1/2_ j+ 1/4

+4A2
(23.1)
1_( P A(P j+ 1/2 0,

(P
0 =g.

If we exclude the auxiliary function (pi + 1/4 from the first two equations of (23.1),
the system will reduce to the following one:
(E+ 2rA,)(E + zrA2)qOj + 1/2 = pj,
(23.2)
j+ / 2
(p+ _ '-+Ap = .
z
269
270 G.l. Marchuk CHAPTER V

Excluding pj+1/2 from (23.2) we get

+ A(E +A 2 )- (E +zAl)- = 0,
0
o =g. (23.3)
P° =g.
We now suppose that the following restriction holds:
TAaII1 < 1, = 1,2. (23.4)
If A1 >0, A 2 >0, the elements of matrices Al, A2 do not depend on t, and the
solution p of (18.1) is sufficiently smooth, then the difference scheme (23.1) is
absolutely stable and approximates the original problem with second-order
accuracy in . At the same time the estimate
' (23.5)
l(jp Ilc < [g[lcr ',
holds where
C; =(E+2TA)-'(E+rA1 ),
iTPIl =(C' 1( , (p)2 .
To prove this statement let us rewrite (23.3) in the form

(E+A,)(E+rA2 ) +ApjI=O,

where
'
A =(E+LA,)(E + A2 )A(E + A 2)-'(E+½TzA)-
Expanding the right-hand side of the last relationship into a Taylor series in T and
assuming that
1A,J, <1,
we get
2
A=A+O(Tr).
With the help of the estimate used in the stabilization method we come to the
conclusion that the predictor-corrector method has a second-order of approxima-
tion in r.
Let us consider the stability of this method. To this end we rewrite (23.3) in the
form

(E + 2A,)(E + 2rA2 ) + A J = 0, (23.6)

where
-
Oij=(E+'A 2)-'(E+-rA) ' p i. (23.7)
Difference scheme (23.6) is stable since
IIpi+ I PC, < 11
Oilic'. (23.8)
SECTION 23 Predictor-correctormethod 271

Then taking (23.7) and (23.8) into account we get


II(E + ½TAa)- I q j+ 11< I(E+
< 2TAx )- 1 pji,
or
II(pi+ lC < ft(illcIT <Igicc,
<
In the case of an inhomogeneous problem we formulate the predictor-corrector
method as follows:
q
+ 1/4 j+ /4=,
(p - + Al
½z
j+ 1/2 j+1/4 + /2 =0,
(p±+A fj+1/2=0( 123.9)
2r

j+Aq j +
T

where
fj =f(tj+ /2).
If fJ is chosen in this form, then (23.9) approximates the original problem with
second-order accuracy in r and the estimate
I J+ zfjIll cri < Ilg+ f°llcr +jll f lcr, (23.10)
where
Ilf Ilcr = max I fll cr',
is valid, i.e., if O<tj < T we again get stability of the difference scheme (see MARCHUK
[1980], pp. 254-255).
Thus if A > O0, A2 >0 and the elements of matrices A , A2 do not depend on time,
then the difference scheme (23.9) is absolutely stable and allows to get a solution of
second-order accuracy in r provided that the solution and the right-hand sidef of
(18.1) are sufficiently smooth.
In conclusion we shall pay attention to the fact that, although difference scheme
(23.1) is absolutely stable, the difference scheme for the corrector incorporated here
as a part may be absolutely unstable if considered separately. Let us show this. For
the sake of simplicity we will consider the case when A is the difference analogue of
the two-dimensional Laplace operator, Q2 is the unity square, and the solution equals
zero on the boundaries of Q2.
In this case the corrector has the form
(Pkl (Pk. - + I+,-2,+ +(Pl Ok-lP +k,l-1
2Z h2 h2
(since now the predictor is not used in the computations, only integers j are
considered).
For this difference problem a solution (pi will be chosen for which the inequality
IIp l. <Il -1 1 1 from the definition of stability does not hold. We will find such
272 G.I. Marchuk CHAPTER V

a solution in the form


pk, = A sin(mitkh) sin(pntlh),
where j serves both as an index on the left-hand side and as a power on the
right-hand side. Using this expression in the difference equation we obtain the
following characteristic equation:
),2 + 8amp - =0,
where

amp = 2 (sin 2 (½mih)+ sin 2 (- pnh)).

Taking as the quantity


A = - 4 amp- l +I16a,
we see that
11~ Jll /ll(p°ll =i.J o as T--O
2
(so that r/h =const), i.e. the scheme is absolutely unstable.
Thus in spite of the fact that the difference scheme used as corrector is absolutely
unstable the "amount" of stability the predictor possesses is sufficient for absolute
stability of the whole scheme.

24. The predictor-corrector method: The case A =2= A

We suppose that in (18.1)

A= A, A>O, n>2. (24.1)


=1

In this case the splitting scheme has a form:


j+
(E+ ½A,)qp
1/2n = pj+ ½,fj,
(E + )]ji+l2n=
jtA j+112n

(24.2)
+
J
(E + 1/2 = j+(n- 1)/2n
11A,)p
j+
qoj + Ac 1/2 =fj,

where we suppose again that Aa, 0 and fj =f(tj+1/2). The system of equations (24.2)
is reduced to one equation
j + l j
(P --
_ ( 1
+A fi)=fj,
(E+2'ZrA)-l(pj+2j'
{o ="g (24.3)
0o°.
SECTION 25 Predictor-correctormethod 273

The predictor-corrector method in this case has second-order accuracy in r provided


that the solution is sufficiently smooth. Write (24.3) in the form:
op+ = Tpi+ 1r(E+ T)f , (24.4)
where the transition operator Tis given by
I
T= E-rA H (E+ rA,)- 1. (24.5)
a=n

The requirements for numerical stability have been reduced, eventually, to


estimating the norm of the operator T. Unfortunately in this case the constructive
condition A O> 0 does not allow to prove the stability of the scheme.
In the case when the operators A, are commutative and have a common basis, the
stability of the above schemes follows from the condition A., >0 (MARCHUK [1980],
p. 265).

REMARK 24.1. The stabilization method as well as the predictor-corrector method


for splitting the operator in n-components may also be applied when operator A is
time-dependent. However, in this situation the a priori formulation of the stability
condition appears to be a more complicated problem. Therefore it is difficult to say
to what extent the application of the two above schemes is justified in general
situations.

25. The predictor-corrector method for the parabolic equation

Let
2={O<x<1: =l, 2, 33, 2={0<t<T}.
,
For the problem

a-a Ea
_ =° in QxQ,,
(p=g for t=0, (25.1)
N = (r) in 2
we introduce the following difference approximation
&8/ot+A~p=Oin Q2x2,,
(P= (r) on aQ, (25.2)
p=g for t = 0.
Here and a2 are already grid domains, and A has the form:
3
A= A, As=- a(Ax, Vx,)/hx2. (25.3)
a=1

BRAYN [1966] (see also YANENKO [1967], p. 40) suggested the following scheme
274 G.I. Marchuk CHAPTER V

based on the idea of the predictor-corrector method:


j+ 1/6 _ (pj
I 'A, P+ /6 + A2 (p+A3=O,

(j+1/3_ ( j+1/6
IT +A2((P _3-&i)= 0,

pj+1/2 j+ 1/3 (25.4)


TI( + A3({Dj+ /2 _(PJ)=O'

+ l /2
Ir + Al J+ 6 + Ap2 J3 + A3 (P =0.

Here, the first three equations form the predictor (stabilizing correction scheme),
bringing p to time level t=(j+ )T; the fourth is the corrector.
Having excluded the fractional steps from (25.4) we get
I(pj
_(p pj+ (p (j+l _Tj
j? 2-+A S +4T2(A1A2+AlA3+A2A3)
(PJ
T 2 T
(25.5)
+T3 A A 2 A3 - S =05.
T

From this we conclude that scheme (25.4) has a second-order approximation in


T and in h,. Taking into account that the operators A, are commutative, we also
conclude that this scheme is absolutely stable.
DOUGLAS [1962] suggested the following absolutely stable scheme which has
second-order accuracy:

(pj -v +A +(p)+ A, (pi+A3 (pj


+ /__-t-
= O
,
j)
Tj+2/3 _J j+3
TpJ+ +A (<P + (Pj)

+ A 2((p'+ 2/ 3 + (pi)+A3 (p= O, (25.6)

j+ AI(pj I + (Pj )

+ A 2 ((p+ 2/3 + pj)+A 3 ((pj + j) =0,


which may be rewritten in the form:
j+ 1/3 _ J ,
T ±4A (,("1
A(p)+ 3
+ A2 (pJ+ A 3 pj=O,

(pj+2/3_(pj+1/3
+'A 2( j + 2/3_ () = 0, (25.7)

pj+- j+2/3
Tp +2A 3 ((pj+ -p)=O.
SECTION 25 Predictor-correctormethod 275

Excluding (pJ+1/3, (pj+2/3 from (25.7) we again come to (25.5). Therefore, schemes
(25.4) and (25.6) are equivalent (YANENKO [1967], p. 42).
We write the predictor-corrector scheme as
(pj + 1/6_ -i
+ q j6 l/ = O'
I l-Al a

pj+ 1/3 _ (pj+1/6


½+ A2 l/3 =O0pJ+

(25.8)
+ 1/2 j+ 1/3 j+/20,
½T
j +
1P-' 1/2 = 0,

j+ from
which is special realization of scheme (24.2). Excluding (rp /6, j+ /3, ( j+ 12
(25.8) we also come to (25.5). Therefore, this scheme is equivalent to the schemes
(25.4) and (25.6).
CHAPTER VI

The Alternating Direction and the


Stabilizing Correction Methods

26. The alternating direction method

Consider the matrix homogeneous evolutionary problem (18.1):


a/at+A( =O in 2 x 2,,
(26.1)
(p=g in Q for t = 0,
where
A=A 1 +A 2.
A scheme of the alternating direction method for (26.1) has the form
(Pj+ 1/2 _nj
- + (A Ij+ /2 + A 2 j)= 0,

(j+_1 /j+ 1/2


I- +1(A ,pj+ 1/2 +A 2
j +
l )= 0, (26.2)

j=0, 1,2,...
spO=g.

In this form it was applied by PEACEMAN and RACHFORD [1955] and DOUGLAS [1955]
to a parabolic problem with two spatial variables. In this case operator A, is
a difference approximation of the one-dimensional operator -a 2 2 /8x22. Note, that
in this case scheme (26.2) is symmetric, i.e. x and x2 change roles from first fractional
step to the second (this fact gives the method its name). It is easy to find the solution
of each equation in the parabolic problem by the sweep method (therefore, scheme
(26.2) is also called the alternatingdirection sweep method (YANENKO [1967], p. 26).
Eliminating pj+ 1/2 from (26.2) we obtain

(pJ+ -P +A( + +)± 2 AA 2 (? 0). (26.3)

Comparing (26.3) with the Crank-Nicolson scheme we conclude that it has a


second-order approximation in . Further, if we consider (26.2) when A, is a three-
277
278 G.I. Marchuk CHAPTER VI

point approximation of the operator -a2(2/x2),


A =-a2(Ax Vx,)/h, 2
then it is easy to establish that this scheme is absolutely stable (DOUGLAS and
RACHFORD [1956], YANENKO [1967], p. 28). If we analyze in this case the behaviour of
the error in each direction x~ using Neumann's method, then it appears that in the
first semistep the error in the direction x1 decreases and the error in the direction x2
increases, then in the second semistep the error in the direction xl increases and the
error in the direction x2 decreases. It does not matter how much the error increases
in any direction at a given semistep, because it certainly will decrease at the next one,
so that on the whole after two semisteps it will not increase in absolute value
(YANENKO [19671, pp. 28-29).
YANENKO ([1967], p. 28) noted that a scheme of the alternating direction method is
useless for the three-dimensional parabolic problem. It has been shown that in that
case the following scheme (with A,= -a 2(AxV,)/h2)
(j+1/3_ -I
( + (A, pi+1 3 +A 2(pJ+A 3 pi) = O,

qlj+2/3_(pj+1/3
2 13
+1(A (ij+ 1/3 + A 2 pj+ + A 3 Pj+ 1/3)= 0, (26.4)

(pj+ (j+2/3
_ -i (A1 (pi+2/3 + + 13A3
iA2 j* 1) = 0

is not absolutely stable. Therefore, for many problems schemes of the stabilizing
correction method are preferable (along with schemes of the alternating direction
method they are sometimes also called implicit schemes of alternating direction).

27. The stabilizing correction method


Stabilizing correction schemes were proposed by DOUGLAS and RACHFORD [1956]
for solving the three-dimensional heat conduction equation. So if we have in (26.1)
A=A1 +A 2 +A3 ,
this scheme has the form
TDj+ 133-i1
TJ+ / Ai(jP+13
1+ +A 2 pj+A 3qij=O,

j +
(pj+2/3 _ (i+ 1/3 2 3
(p (pP (-c)=O,

(9JP+l_-)j+2/3
- A3(+pj + _- pj)=O, (27.1)

j=O, 1,2, ... .,


( =g.
SECTION 28 Alternating direction methods 279

By eliminating <pi+ /
3
and pj + 2/3 from (27.1) we obtain the equation
(J+ -(-jq )J_ (pj+T
T ? +AT I+T2(A1A 2 + A IA 3 + A 2A 3) r )

+T3AIA2A3(j +l')=0. (27.2)

It follows that the scheme has first-order accuracy in T. When we consider it for the
heat conduction equation it is easy to establish absolute stability. Besides, the
structure of the scheme is as follows: the first fractional step gives the whole
approximation of the heat conduction equation, the next fractional steps are
corrective and serve to improve the stability. Therefore such schemes are called
stabilizing correction schemes or schemes with stability correction.
DOUGLAS [1962] proposed scheme (25.6) of second-order accuracy, which may
also be considered as a stabilizing correction scheme.

28. A general formulation for the stabilizing correction method

The stabilizing correction method was formulated in a general form by DOUGLAS


and GUNN [1964] (see also YANENKO [1967], p. 159). So, let us consider the
evolutionary problem
ap/lat+AT=f in xQ , (28.1)
= g in Q for t=O,
where A = ,= l A,. We write the difference multi-layer scheme

(pj+l_
(pJ+ j +Api+l
+ =Fj, (28.2)

where
Fi=BopJ+B_l- 1
+ ... +B 4+lp-q +f (28.3)
(28.3)
A=A+ .. +A,,
which approximates problem (28.1). Taking (28.3) into account we can write the
corresponding scheme in fractional steps:

(qp+/ j + n )+

qPj+ 2/n _ pj+ /n j+2n j) O


+/ (pj+2 np)=
(28.4)

ij+ _(pj+(n- )n
+
n( -_ qj) = O.
280 G.I. Marchuk CHAPTER VI

By eliminating rpJ +i/n,...


. , (pj+ n- )/ from (28.4) we obtain, after simple transforma-
tions, an equation "on the whole steps":
(/0j+ 1 (i2,j /( j+ _~j9
P+ +tT20
-l v) = F', (28.5)

where

¢= YZAiAj + '
E AiAjAk,+ .+rn-2Al. An. (28.6)
i<j i<j<k

By comparing (28.5) and (28.6) we conclude that, with accuracy up to order O(T2 ),
the general stabilizing correction scheme keeps the order of approximation of
scheme (28.2)-(28.3). The spectral stability analysis for scheme (28.4) is given by
YANENKO ([1967], pp. 160-161).
Note that for two-layer schemes under the condition that the operators are
commutative and for A, > 0 the stability of scheme (28.4) follows from the stability of
scheme (28.2) (DOITGLAS and GUNN [1964]).

29. Application of the alternating direction scheme to the parabolic equation

Let us solve the following boundary value problem for the second-order parabolic
equation without mixed derivatives
a9/t + Ap =f in 12 x Q,
(PI n = (r(x, t), (29.1)
p = g(x) in 2 for t = 0,
where
2
A= A , AA - k,
a= ax a a
x=(x1 ,x 2 )eQ ={O<xa<1: oc= 1, 2},
ka >0.
We assume that problem (29.1) has a unique and sufficiently smooth solution.
THOMPSON [1964] proved the existence and uniqueness of a generalized solution for
(29.1) provided that the functionf(x, u) satisfies Lipschitz's condition in the variable
u. He also showed that the convergence of the solution of difference problem to the
solution of differential one follows from the approximation and stability of the
corresponding difference operator.
We construct in Q = Q + aQ grid Qh with steps h, that is uniform in x, and replace
the operators A, by the difference operators A,:
A = (Ax, k V p)/h2. (29.2)
Contrary to the case of constant coefficients, here, the operators A, are positive and
self-adjoint but not commutative. Instead of (29.1) we consider the following
SECTION 29 Alternating direction methods 281

problem approximating (29.1):


ao/at+Ap=f in Q2h, A=A +A 2,
(Planh = (~r), (29.3)
p =gh in 2h for t = 0.

Now p and fare vectors and A, and A2 are matrices. Suppose solution (pbelongs to
the grid function space .
For solving problem (29.3) in the layer tj<t<tj+ l we write the alternating
direction scheme in the form (SAMARSKII [1971], pp. 360-365):
q j+1/2 _ (pj /2 j +/2
+ 2(A+jI2
-? + J (pj)=If i,
(29.4)
qj+1 - qpj+ '/2
T -+0 (P 1(Ai+
+ 1/2 (pj+ 1/2 A 1j+ )+
= fj,

j=O, 1,2, ... .,


with initial condition
°
0 = g. (29.5)
Equations (29.4) and (29.5) must be supplemented by difference boundary condi-
tions, which may be written, for example, in the form (see KRYAKVIN [1966]):
i 0
1 lan
j+ = q(r)(tJ+1), (P+ 1/2 2lan
= (r), (29.6)

where a12 is the grid on the sides x, =0 and x, = 1,

(P(r) = 2 [(r)(tj+ 1) + p(r)(tj)] + 4tA 2 [P,(r)(tj+i)- q(r)(tj)]. (29.7)

We rewrite (29.4) in the form:


2
_ j+1/2 +A/j2+lj+I2=F, F= (i2 - i +fj

_2j+ +Aj2+1 + =Fj+ 2, (29.8)


T

Fi+1/2 = -(i+ 1/2 -A 2+ 1/2(q


j+ 1/2 +fj

Each problem in (29.8) with the corresponding boundary conditions from (29.6) is
a one-dimensional first boundary value problem and it can be solved by the sweep
method.
If the value k in the grid point i is calculated, for example, by the formula
(k )i, = [k(x,) + ka,(xi + ,)] 1 < i < I,
then operator A. approximates operator A, with second-order accuracy, i.e.
Ac( - A,, = O(h2).
282 G.I. Marchuk CHAPTER VI

Let k = const. Operator A is self-adjoint and positive in . Introduce the metric


11 - 1 2 I 12-1

11(112= Z E (Ai, °)2hlh2 + E (Ai2P ) 2 h l h 2 (29.9)


it=l i2=1 i1=l i2=1

In this metric scheme (29.4)-(29.6) is stable on the initial and boundary conditions
and on the right-hand side.
Assume that, in Q2 x Q,, (29.1) has a unique solution q =(x, t) which is con-
tinuous and has bounded derivatives
a3-3 a5P 04S(

at ' 3
X2 t
'
aX4 (29.10)
Then scheme (29.4)-(29.6) converges in the grid norm (29.9) with rate O(r2 +Jhi 2) SO
that
11q'- (ti) IIA M(I h 2 + T2),
where M = const does not depend on t and Ihl.

REMARK 29.1. The alternating direction scheme with varying k has an approxi-
mation of order O(I h 12 + T2 ) for the solution p= q(x, t), provided that, in addition
to (29.10), the obvious requirements of smoothness of k,(x, t) in x, and t are satisfied.
The distinction from the case of constant coefficients reveals itself in the stability
study of the scheme, since though the operators A , 2 are positive and self-adjoint,
they are not commutative. However if, in addition to (29.10), there also exists in
2 x , a continuous derivative

axi V ax~X2
(a X2j'
then scheme (29.4)-(29.6) is stable and converges with rate O(T2 + h 12), in the case of
varying coefficients k.
CHAPTER VII

Methods of Splitting with Respect


to Physical Processes

30. The method of splitting with respect to physical processes

The method of splitting with respect to physical processes consists of reducing the
original evolutionary problem describing complex physical processes to a sequence
of problems describing processes of a more simple structure. To explain the idea of
the method we will consider the equation describing the process of advection-
diffusion of some substance (MARCHUK [1982], pp. 82-90).
The two-dimensional advection-diffusion problem consists of finding the solution
of the equation
ap aup ap
+ + - A y+ 0(p =f,(30.1)
at ax ay
(p=g for t=0, (30.2)
acO>, #uO0.
Assume for simplicity that we find the solution (x, y) of (30.1)-(30.2) on the whole
plane and that the functions u and v satisfy the continuity equation
au av
+-= 0. (30.3)
ax ay
From the physical point of view problem (30.1)-(30.2) describes, in principle, two
different physical processes. The first one is the process of advection of a substance
along the trajectory and it is described by
a 1 aup aVpl
at + ax + ay =0, (30.4)

1 =' 1 fort=0. (30.5)


The second one is connected with diffusion and absorption of a substance and is
given by
a9q 2
at -tA 92 +O 2 =fl (30.6)

(p2 =g 2 for t=0. (30.7)


283
284 G.I. Marchuk CHAPTER VII

These processes represent two extreme cases of the problem (30.1)-(30.2). Indeed,
taking in (30.1)-(30.2) a = 0 and a = 0 we obtain problem (30.4)-(30.5) and taking
u = v = 0 we obtain problem (30.6)-(30.7).
Let us now try to put (30.4)-(30.5) and (30.6)-(30.7) together. This appears to be
possible because of the additivity of the processes of advection and diffusion
locally.
Assume that u and v are constants and g(x, y) decreases quickly enough at infinity
so that the following representations take place

= Xf i(m, n, t)eiMX + iny dm dn, (30.8)

g= f f A(m, n)eimx + i" ydmdn.


-o0 -7

Then the exact solution of the original problem (30.1)-(30.2) has the form

a= f f A(m, n)exp{im(x+ut)+ in(y+vt)


- - n dm dn.

-[a + (m 2 + n2 )]t} dm dn.


Let us now split the problem into two. To that end we consider the time interval
0< t < Ton which we will solve the two problems successively: first (30.4)-(30.5) with
the initial condition
(p
1 =g
and next (30.6)-(30.7) with the initial condition
02 =(PIo

We assume as before that p1, P2, g are represented as Fourier integrals as in (30.8).
For Fourier components of the solution of (30.4)-(30.5) we find
0Q(t)= A exp{(imu + inv)t} (30.10)
and for those of (30.6)-(30.7),
2
0 2 (t)= l,(t) exp{ - [Ca +(m + n2 )]t}. (30.11)
When we take t=T in (30.10) and substitute ,(T) in (30.11), we obtain
q02(r) = A exp{(imu + inv)z- [a + p(m 2 + n 2)]Z}.
Therefore

2= f A(m, n) exp{im(x + uT) + in(y + r)-[a+ (m2 + n2 )]T} dmdn.


-roD -co
SECTION 31 Splitting with respect to physical processes 285

Thus the solution obtained by the splitting method for t = r is identical to the
solution of the original problem (30.1)-(30.2). This fact lies at the basis of the method
of splitting with respect to physical processes.
Note that in reality u and v are not constant; in this case the splitting algorithm
does not give the exact solution for t = tj,j = 1, 2, .... To correct the result in this case
the time interval should be taken small enough to ensure good approximation in
cases when the coefficients u and v vary considerably.
The method of splitting with respect to physical processes is widely used for
solving problems in hydrodynamics, gas dynamics (KOVENYA and YANENKO [1981],
MIGUAL, PINSKY and TAYLOR [1983], GUSHCHIN [1981], TEMAM [1981]), meteor-
ology and oceanology (MARCHUK [1967, 1974a, 1982], MARCHUK, DYMNIKOV et al.
[1984]) etc.
CHORIN [1968] proposed the original method for finding the approximate
solution of the equations for viscous incompressible liquid. This method is based on
Helmholtz's theorem on the decomposition of the vector field. (A vector field may be
decomposed in the unique way into a solenoidal part Af and a potential part Ai,
provided that the normal component Af equals zero on the boundary.) The
theoretical grounding of the method is given by TEMAM [1981]. YANENKO,
ANUCHINA, PETRENKO and SHOKIN [1970] interpreted this method from the
standpoint of splitting with respect to physical processes. A modification of this
method was used by KUZNETSOV, MOSHKIN and SMAGULOV [1983] to solve the
nontraditional Navier-Stokes problem when the pressure and tangential com-
ponent of the velocity on some part of boundary are given.

31. The method of particles in a cell

Methods for solving multidimensional problems in mathematical physics are sug-


gested by HARLOW [1967] and HARLOW and WELCH [1965] and named methods of
particles in a cell or methods of markersand cells (MAC method). These methods have
been developed intensively in recent years. They are also based on the idea of
splitting the physical process into more simple ones and are applied to calculations
of multidimensional hydrodynamic currents with strong liquid deformations, large
relative displacements and colliding interfaces.
The essence of the method of particles in a cell is as follows. The hydrodynamic
equations are reduced using weak approximation on each small time interval to two
more simple systems, the first of which describes the adaptation of the hydro-
dynamic fields omitting advective terms and is integrated by the usual methods in
a fixed Euler grid; the second one describes the advection of substances in a Lagrange
coordinate system. Only while solving the second system one uses the phenomeno-
logical simplification of the continuous medium model replacing it with the system
of particles in each cell of the Euler system in such a way that the total mass,
momentum and energy of the particles in a cell are identified with the corresponding
characteristics for the continuous medium.
When some particle "carrying" certain mass along its trajectory crosses the
286 G.I. Marchuk CHAPTER VII

boundary of the cell, the mass, momentum and energy of this particle are subtracted
from its "old" cell and are added to the "new" cell. Harlow's scheme is based on the
explicit methods for solving equations of the first and the second stage and is
conditionally stable in the whole scheme.
It is especially fruitful to use implicit schemes in calculations of the first step,. In
this case the criterion of stability of the whole scheme coincides with the well-known
Courant condition. Some improvements of the MAC method in connection with the
introduction of fractional cells and calculations of the locations of markers were
suggested by CHEN with co-authors [1970, 1991] and NICOLS [1973].

32. The method of large particles

DYACHENKO [1969], BELOTSERKOVKY and DAVYDOV [1970], YANENKO, ANUCHINA,


PETRENKO and SHOKIN [1970, 1971] give various modifications of the method of
particles in a cell which diminish substantially the inherent fluctuations of density
and pressure and extend the "stability reserve". Various realization schemes are
considered. These modifications are currently referred to as the method of large
particles.
The main idea of the nonstationary method of large particles consists of splitting
the original system of equations into physical processes. We will describe this
method "on the physical level" applied to the problem of motion of an ideal
compressible gas.
The model medium is replaced by the system of fluid particles coinciding at
a given moment with cells of an Euler grid. The stationary solution of the problem (if
there exists any) is obtained as a result of positioning; therefore, the whole
computational process consists of repeated time steps. The calculations on each time
step proceed in three stages:
In the first stage, the variation of momentum and energy of the Lagrange
elementary liquid volume (large particle) contained in a Euler cell over time interval
is considered (in this stage the boundary of the volume is displaced with respect to
the initial position).
In the second stage, the movement of the particles through the boundaries of the
Euler cells is modelled and the particles are redistributed in space.
In the third stage, the redistribution of mass, momentum and energy in space
takes place (the variation of the flux parameters in the elementary Euler cell
obtained by returning the Lagrange volume to the initial position over time interval
t is determined).
Thus the evolution of the whole system over time interval r is calculated using the
following splitting: first the variation of the internal state of the subsystems in the
cells/large particles is considered assuming that they are frozen (Euler stage), and
next the displacements of all particles proportional to their velocity and time
interval Twithout variation of the internal state of the subsystem are considered with
subsequent recomputation of the computation grid in the initial position (Lagrange
and final stages).
SECTION 32 Splitting with respect to physical processes 287

Consider the system of equations describing the motion of an ideal compressible


gas

apu ap
pu +div(puu) + a=O,
at ax

apv + div(pvu)+ p =O,


at ay

ap
aP + div(pu)=O, (32.1)

apE + div(pEu) + div(pu) = 0,


at
2
p=p(p,J), J=E-½u.
Here u and v are the components of velocity vector u, p is the pressure, p is the
density, J and E are the specific internal and total energy respectively.
Let the domain where the solution of(32.1) is to be found be covered by an Euler
(fixed in space) computation grid consisting of rectangular cells with sides h1 and h2.
The solution of problem (32.1) consists of the following steps:
In the first (Eulerian) stage, only quantities related to the whole cell are varied,
while the liquid is considered to be "instantaneously frozen". We find the solution
of the system:
au ap av ap
P-+-=O,
at ax p-+
at ay=O,
aE ap
p-t +div(pu)=O, -a =0, (32.2)

p=p(p,J), J=E-u 2 .
In the next (Lagrangean) stages, the motion of the mass flux (particles) AM
through the boundaries of the Euier cells is modelled, the redistribution of mass,
momentum and energy in space take place and the final fields of Euler parameters
are determined.
The equations of these stages may be combined formally and represented in the
form
apu/at+ div(puu) = 0,
apv/at + div(pvu) = 0,(32.3)
ap/at + div(pu) = 0,
apE/at+ div(pEu) = 0.
288 G.I. Marchuk CHAPTER VII

Various versions of the numerical schemes of the large particle method along with
some aspects of approximation and stability are presented by BELOTSERKOVSKY and
DAVYDOV [1978], BELOTSERKOVSKY [1984], DAVYDOV and SKOTNIKOV [1978] and
others.
CHAPTER VIII

The Alternating Triangular Method


and the Alternating Operator Method

The alternating direction method proposed by Peaceman, Douglas and Rachford is


usually connected with the decomposition of the operator A into "one-dimensional"
operators A,, the sweep method being used for the solution of equations at each
fractional step. However (as it was already noted in Section 26) the generalization of
this method for problems with three spatial variables faces some difficulties. On the
other hand the requirement that the operators A, are "one-dimensional" is not
compulsory. Therefore it is of interest to split the operator A in operators that would
allow an effective solution of the problem in each step and would keep the main
advantages of the alternating direction method. Such a splitting may be the
decomposition of the matrix operator A=A* into two matrices Aland A 2 with
A1 =A*, A+A 2 =A. Writing these alternating direction schemes formally we
obtain schemes of the alternating triangular method which was proposed and
grounded for some problems of mathematical physics by SAMARSKII [1964a] and
IL'IN [1965]. IL'IN [1966] also presented generalizations of the alternating direction
methods where Al and A 2 are arbitrary matrices (in particular, triangular). These
generalizations are called schemes of the alternating operator method.

33. The alternating triangular method

Let us consider the evolutionary problem


(p/lat+Aqp=f in Q, (33.1)
p=g for t=O,
where Q2 = {t: 0 < t < T} and A = A(t) is a square matrix representing a finite-dimen-
sional approximation of the corresponding operator of the original problem in all
variables except for t. Operator A from (33.1) acts in a Hilbert space 0 with inner
product (u, v), and norm IIu Il=(u, u)./ 2. It is assumed that A can be represented as
the sum of two triangular positive-definite matrices
A=A1 +A 2 , A,=(ak)), A=(aik),
(1)0=,O k>i, (c( = 33.32)
a ik -= > ii ~"-)~,8aii,
(A, u, u) > C l U 112, C=const>O, a=1,2. (33.3)
289
290 G.. Marchuk CHAPTER VIII

If A is a symmetric matrix, then Al = A2, A = A2 and we may take a +a =aii.


Let us write the scheme of the alternating triangular method (SAMARSKII [1964a]
as
(Pj+ 1/2_ jAJ J+ 1/2 j+ 1/2 1f
_(pp- ~+ (A i + [
(pj 2j ,

- P +w1(A1
r qAJ+l)= fj,
r 2(P+A+
T
(33.4)

q =g,

where
A + /2 =Al(tj+1/2), A = A2(tj),

S =f(tj+l/2),
r=tj+1-tj

It is easy to notice that to solve the first equation from (33.4) we need to invert the
triangular matrix E + 'A + 2 and to solve the second equation we need to invert
the triangular matrix E+2zAj2+'. SAMARSKII [1964a] noted that the inversion,
when the order of the matrix is more than five, will require fewer operations than
to solve problem (33.1) by the explicit two-layer scheme.
If conditions (33.3) are satisfied and A 2(t)p(t) and dp/dteC"'1) [0, T], then (33.4)
has second-order accuracy in :
(ip1- (t) C.C 2, j =1,2,...,
where C is a positive constant independent of z (SAMARSKII [1964a].

REMARK 33.1. Scheme (33.4) and the above statement remain valid when A 1 and A 2
are arbitrary linear operators acting in a Hilbert space and satisfying conditions
(33.3) (SAMARSKII [1964a]).

REMARK 33.2. If A, are self-adjoint operators, then scheme (33.4) is a generalization


of the alternating direction method proposed by Peaceman, Douglas and Rachford,
and it may be considered as the alternating operator scheme.

REMARK 33.3. For the problem


02/at2 +A(t)9=f in Q,
9q=go for t=0, (33.5)
a/plt=gl for t=0,
SAMARSKII [1964a] formulated also a scheme of the alternating triangular method
which has second-order accuracy.
SECTION 34 Alternating triangularmethod 291

REMARK 33.4. Since the solution of (33.4) may be carried out with explicit formulae,
schemes of the alternating triangular method are also called explicit alternating
direction schemes (IL'IN [1970], p. 109).

34. The alternating operator method

We assume that operator A in (33.1) does not depend on t and can be represented as
A=AI +A 2,
where the only requirement for A, is positive-semidefiniteness
(A. u, u), > 0.
We write for (33.1) a scheme of the following kind (IL'IN [1966]):

ji+ -o +(A1 pi + 1/2 + A 2 (pj)=½fj,

(+ Ij+ j+1/2 A+ /2- j


T' +2A 2 (pj' - pi)=kj T, (34.1)

j=O,1.....
(P =g,

where rj and k are parameters, fi=f(ti+1/2), and operator A and vector fi are
constructed taking into account the boundary conditions of the original problem
approximated by (33.1). Formally scheme (34.1) coincides for k=O with the
Douglas-Rachford scheme and for k= 1 with the Peaceman-Rachford scheme.
After excluding p'o+ 1/2 (34.1) is transformed to
pj+l_ pi (pi+'+ kjj (Pj+-(Pi
+
(k +A + 4t jAA 2 LT(l+ =f (34.2)
!'r(1 + k) 1+k s ½zs(1 ks) +

If depending on parameters Tj and kj the time step for the original problem is chosen
as
Atj= 2Tj(1 + kj) (34.3)
the order of approximation of (34.1) is equal to O(zj+ kj Tj). In the case kj=O, it has
a first-order approximation in Atj= rTand, in the case kj = 1, it has a second-order
approximation (with respect to Atj = Tj).
Let us assume that the approximation error of both equations in (34.1) on the
smooth solution of (33.1) tends to zero in the norm II11
I for T= max Tz-+O, then the
solution (pi of the difference scheme (34.1) converges to qp(t,) in the norm
II(IIA2 =(llll(+ llA2ollI) for T-*0
(IL'IN [1966].
Note that we still did not make an assumption concerning the structure of the
operators A1 and A 2 . Further, the restrictions on A, in (34.1) are proved to be
292 G.I. Marchuk CHAPTER VIII

satisfied provided that they can be represented as in (33.2)-(33.3) (and so we obtain


the schemes of the alternating triangular method).

REMARK 34.1. IL'IN [1966] noted the possibility to generalize (34.1) for kI= I on
quasi-linear problems as the error of approximation reaches the order O(Tz).

35. The generalized alternating operator method

Let the operator A in (33.1) be represented in the form


n
A= A,
a=l

where the operators A, are not necessarily related to differential operators with
respect to some single variable (if, of course, they belong to the original problem).
The following scheme is a generalization of (34.1) (IL'IN [1966]):
i+j+In _ (Pj j+l/n+ A pi=fj,
Tj a=2

(ij+k/n _ (p j+(k- )/n


k j)
+ Ak((Pj - - 1 (+
( n- 1)In _ (pj)/Tj, (35.1)
zj
k=l,..., n,
where n is the number of fractional steps on the interval (tj, tj+ ). After elimination
of the intermediate values (pj+k/n, we have

B rp J - +
t-A =f J, (35.2)

where
n n-1
B= H (E+rjAj/cjw, oj= n (l+ i).
a=l a=l

With jo=2 scheme (35.1) approximates (33.1) with error 0(r 2 ). With Pla=o it
coincides with the Douglas-Rachford stabilizing correction scheme, which has error
O(z).

36. The scheme of the alternating triangular method for the parabolic equation

Let us consider one of the concrete realizations of the alternating triangular method
(IL'IN [1965]) for the two-dimensional heat conduction equation
apl/at - div(D(x, y) grad o) = 0. (36.1)
SECTION 36 Alternating triangularmethod 293

We write the following five-point difference scheme

ka -Pkl j+=0. (36.2)


Atj -'W
Here and further
hx=hy=h, Attj=tj+l-t j,
1
A(Pk, =2(Pk,q(Pgk,t-ak,lftk- ,1 -bk.l(Pk.l-1 -Ck,tIPk+1,-dk,(Pk,+l), (36.3)

where the coefficients a, b, c, d, p may be variables. We also suppose that the boundary
conditions for (36.1) are taken into account for the coefficients of the difference
equation.
A scheme of the alternating triangular method has the form
j+1/2 .
Pk,1 -Pk. 1 1 j+1/2 j 12bl + 1/2
[---' p tk
Atj (Pk,l -akl (Pk- -bk_ Pk'I- 1

+ 2Pk,l (Pk,l - Ck,l(Pk+ 1,1- dk,l k,l+ 1) =0,


(P 1/2 j
j+ 1 (36.4)
J- k,+ 1/2 j+1/2/2 . 1/2
A ,lfk + h_ (Pk I Pk,I -akIik-ll -bk, i Pk,l-1

I j+1 j+j+l l d j =
+2Pk,I (Pk -Ck,l (Pk , -dk,l(Pkj+1
+)
+ -

The condition of "spatial" stability with respect to the accumulation of errors on


each time step has the form of inequalities,

+At. At ]
4h2 Pkl2h2 (lak,lI + Ibk, l)
(36.5)

1+--p, >t-(Ck,
4h2 Pk,I 2 h2(C
++ Idk,
IdkI)
l),

which must hold for all grid points except, maybe, for some of them. If we introduce
the following operators A1 and A2
1
A2 (Pk,l = (2I Pk, - ak,l Pk+- 1,1 - bk,l k,l - 1),

A2 tk,1t = (Pkltk,- Ck, tvk+l,-dk,t pk,+1),

then (36.4) can be written as follows:


j+ /2 j
Tkl -ttkl+ _(A pj+1/2 +A j+)=°
At +2 2 qT')=,(
j+1 j+ 1/2 (36.6)
T 2
At. + 2(A 1 + XA2 kj+l )=0,
At
294 G.]. Marchuk CHAPTER VIII

or
(E+ Atj A )pij + 12 = (E- tj A 2 ),(
(E+ AtiA2)p JA)rp3
2=(E-A +

On "whole steps" scheme (36.6) has the form:


j
(E+ Atj A,)(E + lAtj A 2)
(36.8)
=(E+ At i A )(E - Ati A 2)q j
or
J+l--(oj + 1 2 j
-+A((pi +(pj)q+ At? A A2 ( =0. (36.9)
At3 2 4 J 1 2 At
From (36.9) we conclude that (36.4) has second-order accuracy in r (provided that
the solution and the initial data are sufficiently smooth). If A and A IA 2 are
positive-semidefinite operators, then it follows from a theorem by DOUGLAS and
PIERCY [1963] that all eigenvalues of the operator
Tj = (E +42tjA + 4At, a A 2 )- (E- AtjA + At]Al A 2),
which is a transition operator from pi to iJ+S , are less than 1. This implies the
stability and convergence of the scheme.
Note the simplicity of the numerical realization of (36.4): computations on each
half-step present the explicit scheme. The number of operations saved on one step is
about 50 per cent. Scheme (36.4) may be generalized on multidimensional problems.
In this case the operator A is split into two suitable operators A 1, A2 in such a way
that the computations on each half-step can be executed using an explicit scheme.
The advantage of scheme (36.4) is more evident when the dimension of the problem
increases.

REMARK 36.1. Besides the papers mentioned in this section the alternating
triangular schemes (or "explicit alternating direction schemes") are also considered
in papers by SAUL'EV [1960], IL'IN [1967], LEES [1961] and others. The alternating
triangular method as an iterative method for solving systems of algebraic equations
is thoroughly treated by SAMARSKII and NIKOLAEV [1978] and IL'IN [1970].
CHAPTER IX

Splitting Methods and Alternating


Direction Methods as Iterative
Methods for Stationary Problems

37. The stationing method: General concepts of the theory of iterative methods

The solution of many stationary problems with positive operators may be regarded
as a limiting solution of a nonstationary problem for t-oo. While solving
a stationary problem by methods of asymptotic stationing we pay no attention to
intermediate values of the solution as they are of no interest. In the case of
a nonstationary problem these values do have a physical meaning. In general, this is
precisely what unites and divides these classes of problems.
Assume that we have a system of linear algebraic equations (representing, for
example, the result of the approximation of some stationary problem in mathe-
matical physics by the finite difference method)
A(p=f (37.1)
where
A>O, qOe, fEF.
We consider instead of (37.1) the nonstationary problem
a/lat + A' =f,
and represent , ¢ and fin the form

P (P nu, = in.., f= f.Un,


n n n1

where
Au = ,, ,, A* u* = ,nun*,
(= (, o 0. = (nu*),)*), fn=(fu*)
and {un} and {un*} represent biorthogonal bases. Then using well-known tools we
obtain, on the one hand, problems for Fourier coefficients
n 4on =fn

295
296 G.l. Marchuk CHAPTER IX

and, on the other hand, the equations


al,/at+ An 0. =fL, ln(O) = .
Solving these problems we obtain

tP = E ( f,/ -)u,
n

k = E (f,/n )(1-e-an')u,.

> 0 (n = 1, 2, ... ). This


Assuming that the spectrum of the operator A is real we have A2,
implies that
lim , = .

When the operator of the stationary problem has a spectrum of an arbitrary


structure such simple and evident relation between the solutions of the problems
may not exist.
Of course, a nonstationary problem for a new function f may be solved using
difference methods in t, for example,

1 Jt - i+Aij=f

Then
J+" =¢iJ-(ACi-f).
If it is our aim to solve the stationary problem, then-with certain relationship
between r and P(A)-we have
lim i = (P.
j-co

The parameter may either depend on j or not.


While solving the stationary problem it is convenient to regardj not as an index of
the time step but as that of the iterative step.
Here we come across one more peculiarity: in nonstationary problems the values
of T must be small enough to ensure the accuracy of the solution, while in stationary
problems the optimal iterative parameters are chosen on the condition that the
number of iterations should be minimal and they may be relatively large.
The majority of iterative methods which are applied for solving linear systems
may be united by the formula

-f)
Bj = - a(A , (37.2)

where a is some positive number, {Bj} is a sequence of non-singular matrices and


{zj} is a sequence of real parameters. If we introduce the notion Hj=cTjB,-, then
the process (37.2) may be written in the following equivalent form:
qj+ l = pji H(A(pj-f). (37.3)
SECTION 38 Iterative methods for stationaryproblems 297

Vectors i= Api-f are usually called the residual vectors of the iterative method
(37.2) and vectors i = ji - * ( * =A- fis the exact solution of (37.1)) are called
the error vectors of this method. Subtracting vector p* from both sides of
relationship (37.3) and replacing f with A(p* we obtain a relationship for the
sequence of error vectors:
I j+T=j i , (37.4)
where matrix
T=E-HjA (37.5)
is called thejth step operatorof the iterative method (37.2). Multiplying (37.4) by the
matrix A we obtain a relationship for the sequence of residual vectors:
j + = (E-AHj)4. (37.6)
j}
We will call the iterative method (37.2) convergent if the sequence {p converges
to the exact solution Aq*of (37.5) for any initial vector. It is obvious that the
convergence of the sequences {ij} and {J} to the zero vector for any ¢o and °
(AO = 4° ) is a necessary and sufficient convergence condition for iterative method
(37.2).
The iterative method (37.2) is called stationaryif the matrix Hj does not depend on
the iteration number (the operator Tj is the constant matrix), otherwise it is called
nonstationary. We will distinguish specially the class of cyclic iterative methods,
which may be regarded both as stationary and nonstationary iterative methods.
A method is called cyclic if it has the property Hj = Hj+s for anyj > 0 and for some
fixed S > 1. It is easy to see that combining each S successive iterations into one we
come to a stationary iterative process of the kind
pj + = ¢pj-(Aqpj-f), (37.7)
where H is determined by the equation
j-1
E-HA= H (E-HiA). (37.8)
i=o
On the other hand the cyclic iterative methods, as originally stated, belong to the
nonstationary class.
One of our major tasks in studying iterative methods will be the problem of
optimization, i.e. the choice of the sequence of matrices {Hj} from a given class in
order to obtain a more effective computational process.

38. Iterative algorithms

For the effective realization of method (37.2) the operator Bj should have a more
simple structure as compared to A. In many cases of practical interest the operator
B has the form
n

Bj- =Bj,, = (E + zjBi), (38.1)


i=l
298 G.I. Marchuk CHAPTER IX

where the Bi are some matrices of the same order N as the matrix A. These matrices
are chosen in such a way that the matrices (E + rj Bi) might be easily inverted, i.e. the
inversion of the whole matrix Bj might be carried out more easily than the inversion
of A. The matrices Bi are often chosen with the representation (splitting) of the
matrix A as the sum
n
A= Ak. (38.2)
k=l

taken in account.
First we take n=m=2, Tj=T and give corresponding iterative algorithms for
solving the system (37.1) which are constructed using the splitting and alternating
direction schemes of previous sections.
(1) We get the alternatingdirectionmethod (PEACEMAN and RACHFORD [1955])for
= 2, B,= Ai. We obtain from (37.2) an algorithm of the kind

(E + rj A)(E +j A2) = - 2(A (p-f), (38.3)

which may also be written, after simple transformations, in "fractional steps":


j+1/2 _ j
j
(') -t(Al
(-11 i+ 2 + A 2 pJ) =f,

fj
ij+1 +1/ 5oJ~tI2+A~spi+1)=(38.4)
B'i - (Al (pi+ /2A + 2iPj
l)f

(2) We get the stabilizing correction scheme when a= 1,Bi=Ai (DOUGLAS and
RACHFORD [1956]):
j+l-i j
(E + rjA 1)(E +rjA 2 ) =-(Apj-f), (38.5)

which may be rewritten in the form


q
9 j+ 1/2 qj j_
+(A qj+ /2i + A2 pj)=f,
(38.6)
j+_ j+1/2
q0J 59 + A2(qj+ 1 =O
(cpj)

(3) The splitting (fractionalsteps) method for an arbitrary value m =n > 2 may be
represented in the form (MARCHUK and YANENKO [1966])

Tp - +A ((p;+l/n--(¢i)=-- ( Ak (pi-

j+kl/n_ (j+(k-l)/n
+ Ak(p+klI -pj)= O, (38.7)
Tj

k=2,. .,n
SECTION 39 Iterative methods for stationary problems 299

or, equivalently,

j- f )
Bj T oj(A , (38.8)

where
n

Bj= n (E+ TjAk), (38.9)


k=1
Tj and aj are some iterative parameters.
The other iterative algorithms may be formulated in a similar way.

39. Acceleration of the convergence of iterative methods

The acceleration of the convergence of the formulated iterative algorithms is


achieved either by a special choice of the values of the parameters tj and oj or by
applying some adaptive accelerating procedure of the type of the conjugate gradient
method to these algorithms.
We consider, for example, the case n = m = 2 under the additional restriction
A, =A2. Let
(Az, z) su Aiz ]l
inf ( >S, sup <, i=1,2,
ZERN (zz) zeRN (AiZ, z)

where S and A are some positive constants. Then the effective value of the parameter
T in the stationary (Tj- T)version of method (38.3) is taken from formula T = (bA)- 1/2
Since for this value of T the eigenvalues of the matrix
B. 1 A =(I + A 2 )-(I + A)- A (39.1)
are real, nonnegative and belong to the interval [a,b] with the boundaries

A /
a=v Ac b= 6 (39.2)

Chebyshev's semi-iterative method may be applied for the acceleration of method


(38.3) (VARGA [1962]):
B,(uk + 1 _ Uk)= k l(AUk
+ -f)+ 3k +1 B,(uk -U k - ),

0, k=O
k + 1 Ck() k+ 1 = Ckl) k >O, (39.3)

k=l,2,....
where =(b + a)/(b-a).
For an arbitrary T >O0 it is advisable to use the generalized conjugate gradient
method to accelerate method (38.3) (MARCHUK and KUZNETSOV [1972], CONCUS,
300 G.I. Marchuk CHAPTER IX

GOLUB and O'LEARY [1976]):


k k -
U + 1 = u _ (/qk)[B 1 k _ ( )],

= 0 =
||Be' A
11 Ik=qk 1Ik+l j2
/-1 qk .k2 /-1,
II
11k5ll2,

Other approaches to accelerate splitting methods are possible. For example,


various versions of the descent method or the generalized conjugate direction
method.
PART 2

Methods for Studying the


Convergence of Splitting and
Alternating Direction Schemes
Before we discuss the theory of convergence of splitting and alternating direction
methods in detail, let us recall once more the sequence of stages we adhere to in this
paper in solving approximately the problems of mathematical physics.
Thus, first of all, we approximate the initial problem in all independent variables
(to be definite we will call them "spatial") but for time variable t. Finite difference
methods, finite elements methods, projection methods, etc. can be used here. In case
of a stationary problem we obtain the general system of linear algebraic equations
A p=f, (1)
and in the nonstationary case we get the following system of equations:
a/plt+Aq=f in Q,
(2)
p=g for t=0.
System (1) can be realized by use of the well-known methods for solving linear
algebraic equations including the iterative methods of splitting the operator A. At
this stage we must prove the convergence of the applied iterative algorithms.
For the nonstationary problem (2) we can do as follows: either discretize all the
problems or preliminarily split the system a/lat+Ap=f on the system of
differential equations of type ac/lat+Atp,=f, (a=l,..., n), and then obtain
a completely discretized problem, using the difference scheme (possibly with
additional splitting of the operators A,). At the next stage of solving the problem we
compute the systems of algebraic equations thus obtained (of type (1)). Iterative
methods based on the splitting of the problem can be used at this stage. Note that it
is possible to approximate the nonstationary problem (2) with the help of the finite
element method to get the matrix problem of type (1); to solve it the splitting
methods are applicable.
As we see, the first stage of solving the problem belongs to a different part of
applied mathematics and it will therefore not be discussed in this paper. We will
study the convergence of iterative methods for solving system (1) by splitting the

301
302 G.I. Marchuk

problem. Besides we are going to consider the aspects of grounding the transitions
from a matrix evolutionary problem (2) to the systems of algebraic equations.
Suppose, we make a direct transition from this problem to the systems of difference
equations by splitting methods. To ground this stage we must study two problems:
the approximation of the obtained scheme and its stability. When these are
established, the convergence of the difference scheme's solution to the exact solution
of problem (2) takes place (see the Introduction for Convergence Theorem 3.1).
If we preliminarily split the matrix evolutionary equation (2) on the system:
arpaat+Ap=f., cY=l,.... n, (3)
it is necessary to find out in what sense solutions {pj approximate the exact
solution %p,i.e. to ground this stage of solving the problem. The study of further
stages of solving each of equations (3)is based as mentioned above on the studies of
approximation and stability. Hence, to prove the convergence of splitting methods it
is necessary to know how to study the following problems:
(1) the transition from system (2)to the splitting system (3);
(2) the approximation of splitting schemes;
(3) the stability of splitting schemes;
(4) the convergence of iterative methods for solving system (1), using the splitting
technique.
Note, first of all, the methods to study approximation. They represent, as a whole,
methods which are well developed in the theory of the finite difference methods
(expansion in Taylor series, etc.). It should be taken into account here that it is not
necessary in the splitting methods on the fractional steps that the solution of the
scheme approximates the solution of problem (2). Therefore, the approximation is
often checked for "'integer" equations of the scheme (i.e. after excluding the
"intermediate" equations of the scheme). Note one particular method that is
frequently used for approximation in the splitting methods: after the scheme is
obtained at the integer steps (and after a possible expansion of the transition
operator in step of the time grid), it is compared with one of the well-studied
schemes, e.g. the implicit scheme of first-order accuracy, the Crank-Nicolson
second-order scheme, etc. Frequently, this comparison allows to prove that the
order of accuracy of the splitting scheme is equivalent to that of one of these schemes,
and that means that the approximation of the scheme under study is established.
This method has been used in Part I for establishing the approximation orders of the
schemes considered there. Taking the above notes into account we come to the
conclusion that the first, the third, and the fourth of the abovementioned problems
are specific in the theory of splitting and alternating direction methods. In this part,
we will focus our attention on the description of the approximations to the solution
of these problems.
CHAPTER X

Convergence Studies of the Splitting


Schemes by Use of the Fourier
Method (Spectral Method)

One of the general approaches to convergence studies of the splitting schemes is the
use of the Fourier method (spectral method). Despite the number of restrictions of
this method, it is widely used both for analyzing the stability and convergence of
various difference schemes and for convergence studies of iterative algorithms.
Consider the main concepts of this method.

40. General statement of the Fourier method

Assume that we solve the stationary problem (1) or the nonstationary problem (2)
with the help of one of the splitting schemes described in Part 1, which can be
reduced to the form:
JP =Tij+rFj,
j+I p°=g. (40.1)
According to Chapter IX, scheme (40.1) should be treated as an iterative solution
process for the stationary problem (1), and as a difference scheme for the
nonstationary problem (2). Many of the schemes described in Chapters II-IX
belong to this class of schemes. (The transition from schemes with fractional steps to
schemes of type (40.1) is usually carried out by exclusion of the intermediate
equations.)
The most important problem here is the convergence of the approximate solution,
obtained by scheme (40.1), to the exact solution of problem (1) or (2). The con-
vergence of pJ to the exact solution of the stationary problem (1) is considered as
the convergence of the iterative process (40.1), while the convergence to the exact
solution of the nonstationary problem (2) is to be regarded as the convergence of the
approximate solution of the difference scheme (40.1) when the time step T is being
decreased and, in general, may be included in operator T. To prove the convergence
in the latter case, it is necessary to study the approximation and stability of scheme
(40.1).
Assume that the approximation of (2) by scheme (40.1) is established. As
mentioned above, this can be done by classical methods that are developed in the
303
304 G.I. Marchuk CHAPTER X

theory of finite difference methods. To prove the convergence of the approximate


solution (40.1) to the exact solution qp it is necessary to establish the stability of
scheme (40.1). One of the methods for studying the stability of scheme (40.1) is the
Fourier method or the spectral method based on the analysis of the spectrum of the
transition operator T. At the same time this method allows to study the convergence
of the iterative process (40.1) to the solution of the stationary problem (1). Let us
formulate this method here.
Assume that the transition operator T is a linear operator acting in a certain
Hilbert space '· with domain 9(A) = , and Fi,g,q¢ic.
Along with operator T, we introduce the adjoint operator T* and consider two
eigenvalue problems:
TO =2, (40.2)
T*¢* = Ad *. (40.3)
Assume that (40.2) and (40.3) determine two systems of eigenvalues {} and {t)*}
and two corresponding systems of eigenfunctions {¢,}, {'*}, complete in and
orthonormalized in such a way that

*)
{00, =1nnm.
= m. (40.4)

Here, (-, ), denotes the inner product defined in P.


Represent the function pJ, F j and g in the form of Fourier series:

(pJi= (in, Fj= Fn, g=ZgnICn, (40.5)


n n n

where qp, Fn, gn are Fourier coefficients determined by


~n=WqF1F
n*)0F l= F,,,, g. = (g, '*)Q,.
From (40.1) and (40.2) we obtain in the usual way the equations for the Fourier
coefficients:
(p+ =,in+zFn, q°,=gn. (40.6)
In order to study the stability of difference schemes of type (40.1) Neumann
introduced the so-called spectral criterion, the essence of which is given in the
Introduction. If for each Fourier coefficient (Pi the following relationship holds:
IPql <C1nl l+C2nlFI, n=l, 2,..., (40.7)
where C,,, C 2, are positive constants, then the difference scheme (40.1) is declared to
be stable (or computationally stable).
Let us see what conditions must be imposed on the parameters of the difference
scheme (40.1) to satisfy inequality (40.7). Consequently excluding unknown
quantities from (40.6) we obtain
j

i=!
SECTION 40 Convergence studies of splitting schemes 305

Hence it follows that

1?jn|< 12 lIgn |+ t g 1^wg1


IFn | (40.8)
where FI =maxjIFil.
The analysis of relationship (40.8) shows that the stability criterion will be
satisfied, if
Inl~l, n=1, 2,.... (40.9)
Obviously, inequality (40.8) transforms, in this case, into
klIl < Igl + TlIFI. (40.10)
If we solve the nonstationary problem (2) with the help of scheme (40.1) on the finite
time interval, then as a rule, Tj const. Hence, it follows from (40.10) that inequality
(40.7) holds.
Thus, under condition (40.9) the difference scheme (40.1) is stable according to von
Neumann. It follows from the study of approximation and stability that the
approximate solution (40.1) converges to the exact solution (p of the nonstationary
problem (2) (see the Introduction for Convergence Theorem 3.1).
It is easily seen that condition I1,I <
1 is also the condition for the convergence of
the iterative process (40.1) to the solution of the stationary problem (1).
Note that, in solving nonstationary problems (2) with the help of scheme (40.1), the
stability in the sense of (40.7) may also be established under the condition that
lil,,fl+C 0 z, C 0 =const>0, (40.11)
which is usually called von Neumann's necessary condition (see GODuNOV and
RYABENKII [1967]). This condition allows a growth of the round-off errors, which
is exponential in time.
The spectral criterion was already used to establish the splitting schemes in the
very first papers on splitting methods (see PEACEMAN and RACHFORD [1955],
DOUGLAS [1955, 1962], DOUGLAS and RACHFORD [1956]). It was also used in papers
by BIRKHOFF and VARGA [1959], PEARCY [1962], DOUGLAS and PEARCY [1963],
DOUGLAS and GUNN [1964], MARCHUK and YANENKO [1966], DYAKONOV [1962e],
MARCHUK [1968, 1969] and others.
The spectral criterion is sometimes called harmonic. This term was introduced for
studying the stability of the splitting schemes by YANENKO ([1967], p. 16), who
formulated it in a somewhat different form though the essence remained the same.
Let us illustrate this with an example of a homogeneous equation with constant
coefficients. Let
au/t = Y(9)u (40.12)
be an equation with constant coefficients, where u is a scalar function of x,, ... , x m
and t, and
()=s al .....
i te i"m (40.13)

is a polynomial in the differentiation operators i=/xi.


306 G.I. Marchuk CHAPTER X

Consider a harmonic function


U= U et+ikx, o = const, ke . (40.14)
For (40.14) to be the solution it is necessary and sufficient that there exists
a relationship between cw and k
wc= (ik) (40.15)
(the so-called dispersion or characteristic equation).
Let

= ()UY h (40.16)

be an explicit homogeneous approximation to (40.12). Here A/h is a certain


approximation to 9, for example,
A,= Ti-E
hi hi
where
Tiu(x1 ,...,x)=u(x1 ..., xi+ hi, _ x,), Eu=u.

-_= ,(ekh- )
Then (40.14) can be the solution of (40.16) if the difference dispersion equation

Oc- ...
holds (40.17)

holds.
Let p = e%, where co is determined by the dispersion equation (40.17). Then the
condition for the stability of scheme (40.16) will be the following
pf < 1. (40.18)
The value p = p(k) is called the growth coefficient of the harmonic function (40.14).
As we can see, this value precisely determines the stability of scheme (40.16).
Such a method for studying the stability has been applied in YANENKO'S
[1967] monograph not only to (40.16) but also to three-layer schemes and to
fractional steps schemes, where the difference dispersion equation is written in terms
of the coefficient p.
Let us show that this method for analyzing the stability applied to scheme (40.1)
for F = 0 brings us to the earlier formulated spectral criterion.
Let problems (40.2) and (40.3) determine as before two complete systems of
eigenfunctions {k}, {k*'}, orthonormalized according to (40.4). Consider the
harmonic function
(p= poe '° k, (40.19)
where k is a fixed integer and qo = const. Then (40.19) may be the solution of scheme
(40.1) with F j =0 under conditions
SECTION 41 Convergence studies of splitting schemes 307

gk =Pock and e' =,


where lk is the eigenvalue of operator T corresponding to Ok.
If p=e- ' , then the stability condition (40.18) transforms to IJkl < 1, i.e. coincides
with (40.9).
Hence the harmonic stability analysis is nothing but the spectral analysis
described above. As YANENKO ([1967], p. 16) notes, the harmonic stability analysis
is very convenient in practice.
It may be useful to note that there exist a number of strong restrictions in applying
the Fourier method to establish splitting schemes. First of all, it is required that
operators T and T* have a complete system of eigenfunctions. Besides, in using the
spectral approach for the convergence studies of the splitting schemes it is necessary
to find the eigenvalue of the operator T with a maximum absolute value or its
upper estimate as part of the numerical algorithm for solving (40.1). Despite these
restrictions, however, the Fourier method is still playing an exceptionally important
role in applications
In the following two sections we will demonstrate the method's application for
grounding splitting schemes in both stationary and nonstationary problems.

41. The Fourier method and the convergence studies of splitting schemes for
stationary problems

Consider the stationary problem in the form:


A(p=f, (41.1)
where

A= E a, (41.2)
a=l

A. are operators of a simpe structure acting in a Hilbert space p, 9(A) = (A), f is


a given element of 4 and qp is the solution to be found.
To solve problem (41.1) consider the method of successive approximations in the
form:
j+1 j
+
fI (E+ 2rA) +Api==f, (41.3)
a=l T

where r and T,are arbitrary relaxation parameters, z, r > 0. The difference equation
(41.3) was formulated by DYAKONOv [1971d], DOUGLAS and GUNN [1964] and
SAMARSKII [1977] (assuming that r = ). MARCHUK and YANENKO [1966] considered
(41.3) as the scheme for constructing various difference methods.
Obviously, scheme (41.3) is a special case of the general universal algorithm (see
MARCHUK et al. [1981]):
i+1 I-J
B +Ap=f, (41.4)
308 G.I. Marchuk CHAPTER X

where B is some "stabilizing" operator; in the given case:

B= H (E + 2
2Ta).
a=l

The realization scheme for each step of the iterative process (41.3) may be
represented as
(E + ½,rnA,) j+ l In = -- rF,j
(E+ 2T2A2)j+2/n= :j+ i/n
(41.5)
(E +t A.)if'' = j+(n- 1)/n-

(pj+ = (pj+
, j+

where F j is the residual of the relaxation process, determined by the formula


Fj = Ap-f (41.6)
It follows from the analysis of system (41.5) that each equation has a simple structure
which implies the inversion of "simple" operators (E + 2z,A).
Let us demonstrate the application of the Fourier method to study the
convergence of (41.3).
Assume that the operators A, a = I, ... , n, commute and generate a common basis
of eigenfunctions. Then, we consider n eigenvalue problems:
Aua=2A()u, =1, 2,...,n (41.7)
and the corresponding adjoint problems
u = 2()u*, = 1, 2,...,n. (41.8)
Let the spectral problems (41.7) and (41.8) constitute n complete orthonormal
systems of functions {uk} and {u'}. Assume that function f in (41.3) can be
expanded into a Fourier series in the eigenfunctions of problem (41.7):

f = fk 'UlkU2k"Unk. (41.9)
k=l

We will also find the solution of (41.3) in the form of the series

j =
qP E Ojk Ulk U2k... Unk (41.10)
k=l

The following statement is valid: The iterative process (41.3) converges if


(1) operators A,, a=1 ... ,n, commute and generate a common basis of
eigenfunctions;
(2) the eigenvalues {i(')} of problem (41.7) are positive;
(3) ,t>O.
Indeed, substitute (41.9) and (41.10) into (41.3) and take the inner product of the
SECTION 41 Convergence studies of splitting schemes 309

result with u, u,..., u*. Then we get the simplest equation for the Fourier
coefficients Jl,:
nj j
f (1+ (zP;
k P) +
+ k o k =fk,
(41.11)
k=l, 2,....
where
n

Ak= , (41.12)
==1

and 2Ak)is the eigenvalue of the operator Aa, corresponding to uk.


Solving equation (41.11) with respect to the unknown qp+', we have

H (1 + ½3r21 ) ) - Tr k
j + I a=1 zfk
(Pj+ =k = n -- - - 'ok+' (41.13)
fi (1 +T9ZA
~=1
k) n (1 +1,TAV))
a=1

Assume that the spectrum {?k2)} is positive. Then, it is easy to formulate the
convergence criterion for the iterative process (41.13). To this end, consider the value
n

H (1 +2 T))- k
qk == (41.14)

==1
n (1+ 2k)
As follows from Section 40 the convergence criterion for the iterative process (41.13)
(and therefore for (41.3) as well) is the condition
Jqk < 1. (41.15)
This means that the following inequality must hold:

[ (1 + 2Ti k )> T k. (41.16)


a=1

Obviously, the condition


Tz>T>0 (41.17)
is sufficient for (41.16) to hold. Thus, it follows that the iterative process (4.3)
converges under the above conditions.
Analogous problems involving the convergence study of splitting schemes by use
of the Fourier method have been considered in papers by SAMARSKII and ANDREEV
[1964], SAMARSKII and NIKOLAEV ([1978], Chapters X-XI), SAMARSKII ([1977],
Chapter X, Sections 3-4).
The problem of choosing parameters T, and T of the relaxation process (41.3) will
be considered separately in Chapter XIII.
310 G.l. Marchuk CHAPTER X

42. The Fourier method and the grounding of the splitting schemes for
nonstationary problems

Consider now nonstationary problems. In choosing a suitable splitting method for


such problems we keep in mind, above all, the requirement of effective realization
and absolute stability of the schemes.
The approximation problem is likewise important since it is tightly connected
with the effectiveness of the numerical algorithm (see SAMARSKII [1977], p. 476).
Indeed, if first-order schemes are used to approximate the solution of the
nonstationary equation, small steps in time and space should be used in
computations to obtain the necessary precision. However, often it is reasonable to
apply schemes of second and higher orders. Large steps in time and space may be
chosen in this case essentially saving computer time. Although it is clear that the use
of second-order schemes leads sometimes to insurmountable difficulties, since
difference schemes of second order turn out to be unstable in some cases.
In this section we will demonstrate three applications of the Fourier method for
grounding known splitting schemes for the nonstationary problem
p/lt+Ap=f in f2,
(42.1)
9p=g for t=O.
(1)Divide domain Q, into subintervals r = At and replace (42.1) with the difference
equation:
j+1 j
i (E+=½ A) +AJ=fJ / 2
+1

rp o = A. (42.2)
Here /A are operators with X, =1A = A. For the sake of simplicity consider A and Ae
to be independent of z.
Scheme (42.2) approximates problem (42.1) with second-order accurary in t when
the solution (41.1) is smooth enough. If operators A, commute and generate a
common basis of eigenfunctions, the eigenvalues {I)} of which are nonnegative,
then scheme (42.2) is absolutely stable according to von Neumann.
Indeed, let us show that problem (42.1) is approximated by scheme (42.2).
Consider the Taylor series
p+l= =_ (j+12
1 j+ 1/2 2j+
2 1/2 + ,

j = j + /2
(42.3)
9p pj+12 ._--2,ft
I T 2,. tt +
q- ./2

Replacing (42.2) with (42.3), we obtain


plat+A = f + O(T2 ) for t=tj+1/2 . (42.4)
Hence the difference scheme (42.2) approximates problem (42.1) with order O(r 2).
To study its stability, reduce scheme (42.2) to the form:
(0j+ = TpJ+F ij (42.5)
SECTION 42 Convergence studies of splitting schemes 311

where
T=(E- B-1A),
n
B= H (E+ AJl),
a=l

F j =TB- 1fj +l1/2

As follows from Section 40 the following condition is sufficient for the stability of
scheme (42.5):
Iik(T)I 1, k=1,2,..., (42.6)
where Ak(T) are the eigenvalues of operator T, which, as can easily be seen, under the
above conditions coincide with the values qk of (41.14) for = T. Since A?) ~>0 and
> 0, inequality (42.6) holds.
Hence, scheme (42.2) is stable.
The realization scheme of (42.2) coincides with (41.5).
(2) Consider now another scheme of the splitting method, introduced by
DYAKONOV [1962f] for the solution of problem (42.1):
n n
j+
E (E+'TAA)(p i+= n (E- TA,)pj+Tzf 1 /2 (42.7)
a=l a=l

The realization scheme of this algorithm is as follows:


2 2
,_,+
(pj+a/ n=(E-'TA 1 )pi+( 1)/ n, c= 1,...,n- 1,
j+ 1/2 = (E--zAl)j+(n- 1)/2 +z fj+
1/2

(E + 2TAl ) pj + 1/n = (pj+1/2 +Tfj+ 1/2

(E+tA)qpi+" =pj+(- ')l"n, a(=2,...,n.

The Fourier method may be also used to ground this splitting scheme.
Scheme (42.7) is a second-order approximation in of problem (42.1) on
sufficiently smooth solutions. If operators A commute and generate a common
basis of eigenfunctions, the eigenvalues {)} of which are nonnegative, then scheme
(42.7) is absolutely stable.
To prove this statement consider the Taylor series (42.3) and substitute them for
(42.7). As a result we get (42.4). Therefore, scheme (42.7) approximates (42.1)
with order (2).
To study the stability of (42.7) let us introduce the following notations:

B= H (E+21A), C= fi (E-'Ta).
Then(42.7)
can be reduced to the form (42.1):a=l

Then (42.7) can be reduced to the form (42.1):


(Pj+ = T+ Fj, (42.8)
312 G.I. Marchuk CHAPTER X

where
T=B-'C, Fj=TB-'fj+/2.
According to Section 40 the following condition is sufficient for the stability of
scheme (42.8):
1In(T)l < 1, (42.9)
where An(T) are the eigenvalues of the transition operator T.
Because of the restrictions imposed on A. it is easy to prove that

A"(T)= n (42.10)

Hence, condition (42.9) holds if An2) > 0, i.e., if the splitting scheme (42.7) is absolutely
stable.

(3) Let us demonstrate the Fourier method's application for grounding the
splitting scheme considered by MARCHUK [1968] for the solution of the non-
stationary problem (42.1).
First of all find the solution of (42.1) in the interval tj<t<tj+l/2, where
ti+1/2 =tj + , by the splitting method
qpj+1/2n j 2 2
- +A p+l/ =fi+l/
2T

(pj+2/2n (j+ 1/2n

2
(42.11)

pj+ 1/2 _ j+(n- )/2n


An p
½I + l/2= O.

If solution (42.11) is obtained, we find the solution of (42.1) for t=tj+


1 , using
a difference scheme of second-order approximation

P+' (P(p3 +j+A


j+ /2=
iZ
f j+ /2. (42.12)

The realization scheme of (42.11)-(42.12) has the form:


(E +TA,A)(E +½,A2) .. (E+ 2rn)oJ+'12
zfij+/2,
2p+ (42.13)
pi+ = pi+T(fi+ /2 _ Apj+ 1/2).
Scheme (42.13) approximates (42.1) with second-order accuracy in z when the
solution of (42.1) is smooth enough. If the operators A. commute and generate a
common basis of eigenfunctions, the eigenvalues of which are nonnegative, then
scheme (42.13) is absolutely stable.
SECTION 42 Convergence studies of splitting schemes 313

To show this consider the first equation of (42.13). This is a first-order scheme for
(42.1) in the interval tj<t tj+ 12 for the class of smooth solutions.
Indeed,
(E +-TA,)(E+ 2TA 2)... (E+ TA)=E +A +T2R,
+A (42.14)
where
R =(A1A 2 + -)+T(A1A2A 3 + .)+ - + (TrAiA 2 A.....
It follows from (42.13) and (42.14) that

pi- +A p j + 1/2 =fi+ /2+ O(T). (42.15)

Correcting the solution with the help of (42.12) we get:


ij+ = pi-T(Aipj+ 1/2 _fj+ /2). (42.16)
j+
Substituting p 1/2 of (42.15) into (42.16) we get the following equation:

¢
j- +-Af=i+ 1/2
I O(r2) (42.17)
TZ 2
Hence, scheme (42.13) coincides with the well-known Crank-Nicolson scheme with
the accuracy of O(T2). It thus follows that (42.13) approximates (42.1) with second-
order accuracy in T.
To study the stability we exclude the intermediate steps and express TqJ+1 in p.
Denote

B= H (E+trA).
a=1

Under the above conditions B is invertible and from (42.13) we obtain


J +l =2 B -
1/2 B-l lj+ +2TB-
£' fj+/2. (42.18)
Substituting (42.18) into (42.16) we obtain:
(Pj+= np j + Fj , (42.19)
where
Il=(E-TzAB-), Fj=t(E--TAB- )fj+I/2.
It follows from Section 40 that the following condition is sufficient for the stability
of scheme (42.19):

IAk()l < 1, k=1,2,..., (42.20)


where ,k(17) are the eigenvalues of the operator 7.
When the operators A, commute and generate a common basis of eigenfunctions,
operator II coincides with operator T of (41.15) and the eigenvalues coincide with
the values qk, defined in (41.14) for T, = . Then, if A") >0, the inequality (42.20) holds
and (42.13) is absolutely stable.
314 G.I. Marchuk CHAPTER X

It is evident that the Fourier method plays an important role in grounding the
splitting schemes. It should be noted, however, that the use of this method not only
supposes the general requirements (for example, that eigenfunctions of the operator
be complete) which have been considered in Section 40, but also some additional
restrictions in each of the cases. For example, in Sections 41 and 42 it was required
that the operators A, are commutative and have a nonnegative spectrum. These and
other restrictions make it difficult to use the Fourier method for studying the
convergence of the splitting schemes.
In the following chapter, we will consider the a priori estimates method which
allows, in some cases, to ground a whole class of splitting schemes under weaker
restrictions.
CHAPTER XI

The A Priori Estimates Method and


the Convergence Studies of the
Splitting Schemes

In studying the convergence of the splitting schemes, it is important to know in what


way the definition of a scheme's stability is chosen. The a priori estimates method is
based on the stability of the scheme's solution in an energetical norm (see the
Introduction). The idea of this method is to obtain an a priori estimate for the
approximate solution, which helps to prove the scheme's stability and-if there
exists an approximation-its convergence as well.
In this chapter we will consider the application of this method for grounding
various splitting schemes.

43. The simplest a priori estimates

Let us solve the nonstationary problem:


aq/lt+Ap=f in2,,
o=g for t=O,
using one of the splitting schemes (see Part 1). We write such a scheme in the form:
((qt)= f, (43.1)
where Y is an operator acting in the grid space 0,, fX Eo,, and p' is the solution to
be found. (In the general case, the right-hand sidef' in (43.1) may belong to another
grid space F'.)
We assume that the approximation of (43.1) is established. Following the
Introduction, we will regard the difference scheme (43.1) as stable if for any
parameter T characterizing the approximation the following inequality holds:
li, < Cl f Il ,
1o(p (43.2)
where c is a positive constant independent of and 11I denotes the norm in 0,.
The definition of stability in this form is in close connection with the notion of
correctness of problems with a continuous argument. It may be noted that the
315
316 G.I. Marchuk CHAPTER XI

stability establishes the continuous dependence of the solution on the initial data in
the case of a problem, (43.1), with a discrete argument. Indeed, as follows from
estimate (43.2) small variations of the input data f' lead to small variations of the
solution p.
Thus, the definition of stability in the form (43.2) ties the solution itself down to
a priori information about the input data of the problem. To analyze the stability of
many of the splitting schemes this definition is more convenient and informative
than Neumann's definition of stability.
First of all, we consider the simplest a priori estimates obtained on the basis of this
definition and show their application for studying the convergence of various
splitting schemes.
While solving nonstationary problems by the splitting method (fractional steps
method) the difference scheme may often be reduced to the form (40.1):
p j+ = T +rF, i (pO=g, (43.3)
where T is the transition operator from one time layer to another acting in some
Hilbert space 0 and g, FJ E .
To analyze approximation and stability it is convenient to write the scheme in the
form (43.3) (especially when approximation and even stability do not occur on each
of the intermediate steps).
It was already noted in Section 40 that many splitting schemes are reduced to the
form (43.3) by excluding fractional (intermediate) steps.
Assume that the approximation of the exact problem by scheme (43.3) has been
established. Let us see what conditions should be imposed on the operator T to
obtain the stability estimate (43.2).
The formal solution of (43.3) has the form
j
qpi=Tjg+ Z T-'lFi'1, O<jr<const, (43.4)
i=l

which implies that

(43.5)
+(I + ITII+ +-' IITllj-)' )-max IFi IOl.
Assuming that
TIIT<, 1 (43.6)
we obtain an estimate in the form (43.2) which follows from (43.5):
IIp II lll g II,+const -max IIF l,, (43.7)
where
INp ll,=max ill
(p

Hence the following statement is valid.


SECTION 43 The a priori estimates method 317

If (43.3) approximates the evolutionary problem


apl/t+Ap=f in Q,,
Tp=g for t=0
and IITII<,1, then (43.3) is absolutely stable and the approximate solution qp'
converges to the exact solution Tqof the problem.
(This follows directly from the above considerations as well as from Convergence
Theorem 3.1.)
Naturally, condition (43.6) is a sufficient stability condition. More sophisticated
criteria may be found via the norm of powers of the transition operator, IITi [J,. But
a weakened condition makes a constructive procedure of establishing the stability
criteria more difficult. It is the sufficient condition (43.6) which is used, as a rule, in
practical calculations.
Consider the case that the operator T is self-adjoint, i.e. T* = T and the problem
TI=Ao determines a complete system of eigenfunctions ,,} corresponding to
eigenvalues {.,}. Therefore any element qp of 04 may be represented in a Fourier
series.
In this case
2ITIIZ (Tcp,Tp)
I TII =sup = max lIA.(T)I 2
.sZo (q, P)

and the stability condition (43.6) converts into:

12.(T)12<1, n=1,2,.... (43.8)


Thus we get condition (40.9). This means that, in the case of the self-adjoint operator
T, the sufficient stability conditions according to Neumann and those according
to definition (43.2) coincide.
For example, let us formulate the convergence conditions for the homogeneous
two-level splitting scheme:
qpj+k/p _(j+(k-1)/p
= Ak,k- j+(k- '
1)/+ Ak, k ji+k/p

(43.9)
k=l,...,p.
This scheme was applied by YANENKO ([1967], p. 149) for solving the evolutionary
problem for f = 0. The operator A is assumed to be split as follows:

A= A(k
k=l

A(k) =Ak,k-1 +Ak,k, k=l,.. ., p, (43.10)


where Aij are operators acting in .
Assume that the operators Ai, j are commutative and that the following conditions
318 G.I. Marchuk CHAPTER XI

hold:

tvlAi,ijlll, i=l,...,p,

fi (E-zAi,i)-(E+TA, i_l) <1.

Then (43.9) approximates the problem

aS0/at + A( =O in Q,,
Sp=g for t=O

with first-order accuracy in r and is absolutely stable. The approximate solution of


scheme (43.9) converges to the exact solution of the problem with the order O(T).
To prove this we construct the corresponding scheme with integer steps.
Equations (43.9) may be rewritten as follows:

Ak p j /
= Bkp j (k- 1
), (43.11)
where
Ak=(E- rAk.k), Bk =E+Ak,k-, 1
k=l,...,p.
Excluding the intermediate equations and using the fact that operators Ak and Bk are
commutative we obtain the scheme with integer steps

Ao pi+ l =Bo , (43.12)


where
Ao=A 1 A2 ....Ap, Bo =B, -B 2 .... B.

Under conditions I Ai, i 1 < 1, i = 1, ... , p, operator A o is invertible and it follows


from (43.12) that

(p+ = A 1Bo p (43.13)


Thus, by excluding fractional steps the splitting scheme (43.9) is reduced to (43.13)
with integer steps and has the form (43.3).
Consider the scheme's approximation. Expanding operators Ao and Bo in powers
of we obtain

AO= E- (A 1, +A2, 2 + +APP)


+ 2(1,z+AI,2+ +A,
2 -1Ap-p-lApp)

+ +(-1)PTPA1,A2,2 ... AP-l,P-lAp p, (43.14)

+ 2 (Al1 ,0 A 2, 1 + + A---+,p- 2Ap,p- 1)+ - + rAl,oA, 2 1 Ap,p- 1.


SECTION 44 The a prioriestimates method 319

In view of (43.14), scheme (43.12) can be transformed into


l
(P j+ ( P
E (k,k- 1 +Akk(P )+TF, (43.15)
k=l

where
F= +j
(Ak,k-_Al, ll(p_-Ak,kA, l(p )
k<l

jl + )
+T (Ai, i-lAk,k-1At, - o +Ai,iAk,kA,
i, k,l

+ +P- 2 (A1 ,oA 2,1 ... Ap,plJ+(-)P-lAt,lA2,2 Ap, pi )


Substituting the Taylor series
j+= j + TipJ+2 z +...
into (43.15) and taking (43.10) into account we obtain:
a/plt+A(p=O() for t=tj.
This means that difference scheme (43.9) approximates the problem
ap/at+A(p=O in Q,
qo=g for t=0
with the first-order accuracy in .
To study the stability of (43.9) we write it in the form (43.13). It follows from (43.6)
that the sufficient stability condition for (43.9) will be
IIAo Bo lIJ11, (43.16)
which holds under the above assumptions.
From the study of approximation and stability the convergence of the two-layer
fractional steps scheme (43.9) follows with the order O(T).
In the general case of noncommutative operators Ai,j convergence criteria for
(43.9) are analogous, although the proof is more complicated. In order to prove the
approximation we have to modify the exclusion method in the case of non-
commutative operators. The idea of such a modification belongs to YANENKO
[1964a] and was worked out in detail by YANENKO [1962] and BOYARINTSEV [1966].
YANENKO ([1967], p. 151) proved the convergence of the splitting method (43.9) in
case of commutative operators Ai,j for weak conditions on the operators Ao and
B o.

44. A priori estimates for the splitting scheme of type Ajq+'= Bjq +ifj

Consider a more general splitting scheme for the evolutionary problem


ap/lat+A~p=f in Q,
pp=g for t=0,
320 G.I. Marchuk CHAPTER XI

which can be written in the form:


Ajq j + '1=Bipj+zTjf, po=g, (44.1)
where Aj and Bj are operators acting in · depending, in general, on the numberj of
the time step Tr and g,fj' e i.
Excluding fractional steps (see, for example, (43.11)), we often come to schemes of
type (44.1).
Assume that (44.1) approximates the evolutionary problem
a/lat+A(p=f in 2 t,
Tp=g for t=O
with some order of accuracy in , operators Aj are invertible for each j and the
following conditions are satisfied:
IIA Ile,AM, IIA 'BjII.< 1 +NT, (44.2)
where M and N are positive constants and
T=maxr, 0<jT <T o < oo .

Then (44.1) is absolutely stable, and the following a priori estimate holds for its
solution (p:

1( i eNT
v || g Ii + ZE ti 1 f i- I (44.3)
i=1

The approximate solution of scheme (44.1) converges to the exact solution of the
problem.
Indeed, operators A, being invertible, (44.1) may be written in the form:
qj+'l=AlBtpj+ rjAj f', cp=g. (44.4)
Thus, under conditions (44.2) we obtain successively:

l [I •<(1 + Nr)11 g Il, +


lp M I1fo I,,
l12p II.
<(1(I+NT)lg + zT,M iif It.P
<(1 +Nr)2 tlgll, + roM(1+ N) lf° I,+ Mt, Ilf '1,

etc., i.e.

I(l + N)illg +M r,_l(1+ Nr)jl fi- 11.


i=

Then for j= 1, 2, ...


SECTION 44 The a priori estimates method 321
Having (1 + NzY) < e N i < eNT°, we obtain

||i || < eNr(| g || + M EI )

Scheme (44.1) is, therefore, absolutely stable and the a priori estimate (44.3) is valid.
The scheme's convergence results from its approximation and stability.
It follows that the homogeneous two-layer scheme (43.12) converges under the
conditions that

llAo lI, <M, [IAo-lBll< l+Nx. (44.5)

If in (44.1)

Aj=E, Bj=T, fJ = Fj , Tj=T,

we obtain the classical scheme (43.3).


If approximation conditions are satisfied, (43.3) converges when

T I1+NT, N=const>O.
]II (44.6)
This follows from the above considerations.
At the same time the a priori estimate (44.3) for M = 1, zi 1= z holds.
The condition
IITill<c, c=const>0 (44.7)
is more general than (44.6).
Replacing (44.6) with (44.7) we easily get the a priori estimate:

I1TI
· IIac ' II ) >+ E (44.8)

Note that in the case that self-adjoint operators Ai and Tj=Aj-aBj have a
complete basis of eigenfunctions, the convergence conditions (42.2) are equivalent to
the following conditions:
maxl/A(A ')I M,
nI

max A,(Tj)l < 1 +Nt, (44.9)

which are called Neumann's stability conditions (see DYAKONOV [1971d],


RICHTMYER and MORTON [1972]). Here L,(A) denotes, as earlier, the eigenvalue of
the operator A.
Estimates of types (43.7), (44.3) and (44.8) are the simplest a priori estimates. They
have been constructed using information about the norms of the problem's
operators.
Using a priori estimates constructed in the same way, the splitting schemes have
been grounded by many authors, for example by DYAKONOV [1962c, 1962f, 1971d],
322 G.I. Marchuk CHAPTER XI

SAMARSKII [1977] and MARCHUK [1973, 1980]. In particular, the a priori estimates
method has been used for grounding the two-cycle componentwise splitting schemes
considered in Chapter III.
More sophisticated estimates may be obtained using the energy inequalities
method which is considered in the next section.

45. The energy inequalities method for constructing a priori estimates

The energy inequalities method is one of the general and rather effective methods for
obtaining a priori estimates. Its essence consists of the following.
Consider scheme (43.1) in a Hilbert space PO.The inner product of(43.1) with pt in
at is
( (q0*), p),=(f, (q),. (45.1)
Equality (45.1) is called an energy identity (see, for instance, SAMARSKII [1977], p.
359). For linear problems the left-hand side of (45.1) is a quadratic form of (rp, and
the right-hand side is a bilinear form with respect to f and (pt. After some
transformations one obtains lower estimates of ((pt), po),, upper estimates of
(f t , q'T)Q, and, then, a priori estimates of type (43.2). In these transformations the
Green difference formulae, formulae of summation by parts, and grid analogues of
imbedding theorems are often used. Sometimes one obtains systems of difference
inequalities. Estimating their solutions, one obtains (43.2). Estimates of the type
(43.2), obtained from (45.1), are called energy inequalities.
Energy inequalities for the simplest difference schemes were first obtained in the
well-known article by COURANT, FRIEDRICHS and LEWY [1928]. For a broad class of
problems the estimates of this kind were studied by LADYZHENSKAYA [1956, 1968],
LEES [1960a-b], SAMARSKII [1961b, 1977], DYAKONOV [1962e, 1971a, 1971d], etc.
Such a method for obtaining energy inequalities has been applied for grounding
the splitting and alternating direction methods. Some of the first papers in this
direction were by LEES [1960c, 1961], where the energy inequalities method has been
applied for grounding the simplest splitting schemes. This approach has been further
developed by DYAKONOV [1962c, 1962e] and SAMARSKII [1962c-d, 1963b]. In their
papers the splitting schemes have been grounded with the help of a priori estimates
for various problems of mathematical physics.
Let us shortly summarize the results obtained in these papers.
LEES [1961] considered the implicit alternating direction methods of Douglas,
Peaceman and Rachford for solving the first boundary value problem for the
parabolic equation

N E 2p X >,
t>0 (45.2)
at i= axlc
i
(45>3
with initial condition

(Ax,=0)= (POWx) 2 (45.3)


x,
qt
SECTION 45 The a priori estimates method 323

The exact solution (p(x, t) is assumed to be a sufficiently smooth function, its support
for a fixed t lies in the domain s. Excluding intermediate steps from the splitting
schemes Lees obtains equalities which lead to energy identities of the type (45.1). On
the basis of difference Green formulae and grid analogues of the imbedding
theorems he comes to the stability estimate for approximate solution 0'h:

O<
11ph(t) IDh Cllpqh() ll,,, (45.4)
where l - lh is the usual vector norm in the grid space Oh, c is a positive constant
depending on N and on the grid coefficient

A=(, hi-2)

here and hi are steps in time and in space respectively.


Besides the estimate (45.4) ("the initial data" stability) LEES [1961] obtained
estimates for the convergence speed of the splitting schemes.
As mentioned by LEES [1961] the energy inequalities method allows to ground
implicit alternating direction methods for the parabolic equation

=i x(ai(x,t) ap), xEfQ2, t>0

with variable coefficients a(x, t).


In another paper, LEES [1960c] considered two difference methods for solving the
mixed problem for the simplest two-dimensional wave equation

a2(p
t2
al a2p
x2 + in Qx [0, T]. (45.5)

These methods were obtained by the author from the simplest approximations of
problem (45.5) using DOUGLAS and RACHFORD'S [1956] splitting procedure. In the
same way the energy inequalities method was used for grounding these methods. As
mentioned by the author, the results may be generalized for hyperbolic equations of
the type:

a2 2 a2 2 b a2 1P? a( a a(P
t =a(x,y,t) a +b(x, y, t)ay2 + F x,y,t, ay'a

It was shown by DYAKONOV [1962c] that a somewhat modified YANENKO [1959a,


1960] method of fractional steps is-under zero boundary conditions-an
absolutely stable and convergent difference scheme for equations of the type:
a/plat=L + f, (45.6)
where L is a self-adjoint elliptical negatively defined operator with separable
variables of order 2m. Here for m > 1 the method's convergence is understood as the
convergence to a generalized solution. Difference energy inequalities are the main
tool for the proof. The analogous result for the classical solution of problem (45.6)
324 G.. Marchuk CHAPER XI

for m = 1 was obtained by DYAKONOV [1962e] where a priori estimates like those of
LEES [1960c, 1961] were constructed.
SAMARSKII [1962c] considered the alternating direction method (see Section 12)
for solving linear and quasi-linear equations of a parabolic type:
p
&P/Ot=L°- L. (45.7)
a=1

This method uses any number p of space variables and is applicable for an arbitrary
domain G.
Its essence consists of the following. In each layer

tj+(_ l)p <t <tj+/p = tj + ZTp, a = l, ... ,p,


a one-dimensional difference equation is solved:
1 00 =
at
To this end homogeneous difference schemes are used:

qj+oqp_ -j+(- l)/p


P P A(Pj+a/P=O. (45.8)

These have been studied by TIKHONOV and SAMARSKII [1961], SAMARSKII [1961a,
1962a-b] and SAMARSKn and FRYAZINOV [1961].
In the paper by SAMARSKUI [1962c] the uniform stability of the method was proved
with respect to the right-hand side, boundary and initial data. It was shown that
locally one-dimensional schemes and multidimensional implicit difference schemes
(see SAMARSKII [1962b]) have accuracy O(h2 + T). In establishing the a priori estimate
of stability the author has used the maximum principle for difference parabolic
equations.
SAMARSKII [1962d] grounded the local one-dimensional alternating direction
method for (45.7) in the case where on each stage (unlike SAMARSKII [1962c]) six-
point one-dimensional schemes were considered. The method's convergence is
proved for grids with arbitrary steps. Using the energy ("integral") inequalities
method and a special summation method for local errors, the a priori estimate for an
approximate solution is obtained.
In another paper by SAMARSKII [1963b] two-layer schemes were considered with
accuracy of order O(h4 +, 2 ) for the multidimensional heat conduction equation
(45.7) applicable for p <3. For the realization a number of splitting algorithms
requiring the same number of operations as the corresponding algorithms of
O(T2 +h 4 ) were used (see DOUGLAS and RACHFORD [1956], DYAKONOV [1962e],
YANENKO [1959a, 1961]). It was shown that these schemes are absolutely stable
and converge in the mean with the rate O(h4 + z2) for any value y=z/h4 . For
stability studies a priori estimates obtained from an energy identity on the basis of
Green difference formulae and grid analogues of the imbedding theorems were
constructed.
SECTION 45 The a priori estimates method 325

The use of the energy inequalities for grounding the splitting schemes for
stationary problems was dealt with in papers by SAMARSKII and ANDREEV [1964],
SAMARSKII ([1977], Chapter X), etc.
As we see from these papers the energy inequalities method is widely used for
grounding the splitting and alternating direction schemes for various problems of
mathematical physics.
To conclude this section let us demonstrate the use of this method for grounding
the splitting schemes in case of an implicit difference approximation (see MARCHUK
[1980]). To this end consider the problem
o/lat+Ap=f in Q,,
(45.9)
(p=g for t=0.
Assume that
h
A= A,
a=1

where all operators A, are time-independent and positive-semidefinite in Hilbert


space · (A, 0), i.e. (A,,p, ), > 0 for all go ED(A,). Consider the splitting algorithm
in the form
pi+12-p +A q J+l/n=o,

(45.10)
pi+ _ qj+(- 1)n + =f

With the help of the energy inequalities method let us prove the following
statement.
Scheme (45.10) approximates problem (45.9) with first-order accuracy in on
sufficiently smooth solutions and is absolutely stable under the condition that
A, > 0. The approximate solution of (45.10) converges to the exact solution of (45.9).
In order to prove this we consider the following. The approximation of scheme
(45.10) is obvious; it is proved by excluding fractional steps and using a Taylor series
for q0j+T
Using the energy inequalities method we prove the stability of (45.10). Take the
inner product of each of the equations (45.10) with qj+ /",..., qpj+ 1, respectively.
Thus using the positive-semidefiniteness of operators A,, we obtain
Ii+2/nl"ll < IPj+(a- l)1/"1, = 1,2,...,n-1. (45.11)

Consider the last equation obtained from (45.10) in more detail. We have
( 1j+
1 iJ+ 1) =(j+(n-l)/n, ac+ 1)

- T(A,,+ (o+
, 1) + (fj, pji+ ),.
326 G.I. Marchuk CHAPTER XI

Taking into account that A n 0 we obtain


j+ l
((p , (pi+l) <((Jpi+(n- )/n, ij+l) +T(fj, pji+l)

Since
(qoj+(n-)/, (pj+ lpj +(n- )/n l ipj+
I ',

I(fj , i+ )OIl < 11


fJIlIoll ji+ ll1,
we have
(pij + 1I22 Ipj+(n-1)/2 I II(p0j+lll
+ Ijll 11+l

Dividing by Ilj+ j+
l, we get the following inequality:
jloJ+a III, < I(pij+(n - )/N"1 + TIIfill.
Excluding the solution with a fractional index with the help of (45.11) we obtain
I ' ji+' II. I pjll ,+tIfJll.. (45.12)
°
Taking into account that 11 I, = IIg II, and excluding intermediate values of the
solution we obtain the a priori estimate:
I(p' j+ 1 < O IIg+j IIf II",
where
lif II, =max IIfJilI.
It thus follows that the difference scheme (45.10) is absolutely stable.
The convergence follows from the approximation and stability. Thus, the
statement is proved.
We have come to the conclusion of this chapter that the a priori estimates method
is an effective technique for the convergence studies of splitting and alternating
direction schemes.
CHAPTER XII

The Splitting of the Evolutionary


Problem for a System of Differential
Equations

As noted above, for a system of difference equations with properly chosen operators
A. the initial evolutionary problem is preliminarily split in a number of splitting
algorithms. Then, each equation of the system is approximated by a suitable
difference scheme and additional splitting of the A, in A. is possible. In this chapter,
we consider some of these splitting methods and their grounding for a system of
difference equations.
Note that the operator A in the initial problem is not necessarily a matrix. It may
be some abstract operator acting in a Banach space . Let us agree to write the
partial derivative in t as ap/Ot, in all equations under consideration, though, in some
occasions it would be more correct to use the conventional notation of the derivative
with respect to t.

46. The splitting of problems defined on fractional intervals and the weak
approximation method

Assume that in a Banach space di the Cauchy abstract problem is considered,


ap/at+Ap=f in Q2,= {O<t<T},
(46.1)
¢=g fort=O, g,
where A = A(t) is a linear operator that has a domain of definition dense in 0 and
a range in . The functions p(t) and f(t) are abstract functions of t e [0, T] with
values in .
Let operator A be representable as the sum
n

A= A (46.2)

of linear operators A,(t) with the same domain as A. Then solving problem (46.1)
may be approximately reduced to successively solving Cauchy problems of type
(46.1), where instead of A we have operators A,, a = 1, 2, ... , n. Let us consider the

327
328 G.I. Marchuk CHAPTER XII

first of those reduction methods (SAMARSKII [1962C], YANENKO [1964b], YANENKO


and DEMIDOV [1966]).
Introduce the grid tj =j x T,j=O, 1, ... , J with the step z. Represent f in the form

f=E f. (46.3)
a=l

and rewrite (46.1) in the form

(46.4)

Introduce the fractional values tj + c/n, c = 1,... , n- 1, on the interval O=


j [t, t,+ ]
and consider the following system of equations on fractional intervals:

1 atq,
n at A,(p=f, te6[tj+(a-/n t+aln,
(46.5)
0t=l,...,n
with conditions

,(O)=g,
%(t)==p(t;), j= 1,
2,
(46.6)
cPa(tj+(a-l)/n)= a-l(tj+(a_-l)/,), j=0, 1 ...

= 2, 3,..., n.
As a solution of problem (46.5)-(46.6) we will consider the element

(pj+ = (n(tj+
1 ). (46.7)
The next stage in the solution of the initial problem consists in solving numerically
problem (46.5)-(46.6). To that end, the corresponding difference schemes in t (and
possibly the additional splitting of A, in Aa, = 1, 2,... ,n) can be used. If the
structure of the operators A, is simple enough, we obtain an effectively realizable
splitting scheme for problem (46.1).
Let us formulate some results to ground the transition from (46.1) .o (46.5)-(46.6).
To that end, we introduce concepts connected with the weak approximationmethod
(YANENKO and DEMIDOV [1966], YANENKO [1964b]). We will say that the family of
functions P,(t) weakly approximates the function TF(t) in t on the interval (0, T), if

*( (¢-/(~))d = (t', t", O), (46.8)

where 1111-110 for r-0+ and t',t"E(O, T). Further, we say that the family of linear
differential operators A,(t) weakly approximates the operator A(t) in t, if the weak
approximation for coefficients of A(t) takes place. Now, taking (46.2) into account we
SECTION 46 Splitting of the evolutionary problem 329

introduce the operator

A,= #,(z,t)-A,, (46.9)


a=1

where the functions fi(xr, t) are defined in the following way:

fl(T,t)=nbi, ift E t tj+ - ).

(Here 6,, is the Kronecker delta.) It is easy to notice that operator A t weakly
approximates the operator A. Together with (46.1) we consider the following
Cauchy problem:
ap/at+A,p=f, in Q2, (46.10)
p=g for t=0.
Its solution will be defined by p - p,.It is easy to see that for f, =f./n this problem
may also be written in the form (46.5)-(46.6). Assuming f, =f and f. =f/n, then (46.3)
and the conditions for weak approximation f by f, hold and the problems
(46.5)-(46.6) and (46.10) coincide. Thus, problem (46.10) constitutes a splitting
system of equations and to ground the transition from (46.1) to (46.5)-(46.6), it is
sufficient to study the convergence of the problem's solution q%, to cp(t) fo r T-0.
Now assume that operator A has a form
ak, +- +k
A= E akl...k (Xx1,. Xm t)O axXkm (46.11)
k,.....krM

where x,..., xm are real variables on which the functions in 4P depend. All
coefficients a,l ... k are assumed to be real, limited and continuous on t in a uniform
metric. The order of the differential operators A in representation (46.2) is
considered to be finite. Further, if all the derivatives of the function a(x, . . ., x,, t)
exist in A, and are limited and continuous in a uniform metric, we then say that
a(x, ..... x,, t) has derivatives up to order "A". When this differentiation procedure
can be repeated j times we say that a(x, .. . x m, t) has derivatives up to order "Aj".
The expressions "up to order A" and "up to order A" are equivalent.
Assume that a(x 1,... , x,, t) and f have derivatives in variables x, ....., x up to
order "A j ". We formally differentiate (46.1) for vector (p() with the components
V(k)= (kl .k)= D-a... D. , D ki/ax ki

where the combinations k = (k, ... , k,) are taken from the sum (46.11) (necessarily
including the combination (0,.. . ,0), corresponding to 'p= v(° ...- )). Then, we get
the system
a(P(l/at + A 1 )p(1) = f(l). (46.12)
)
This system is the first extended system with operator A" and vector f(l). In the
same way the vectors To(2) = {v(k+)}, ( 3) = {v(k+l+4)},... are constructed, where the
indices , q, ... range over the same combinations as k. For ( 2), (0(3), ... we obtain the
330 G.i. Marchuk CHAPTER XII

second, third,... extended systems, respectively:


a(j)/at+ A(ij)o(i)=f(j), j=2, 3, ..... (46.13)
It is easy to verify that presentations
n n
)=
A(j A(J and Aj= E ,A(

correspond to the presentation (46.2). Therefore, it is possible to consider the


extended systems

a A+A(i)p() = f( i) (46.14)
at' - ' I

along with (46.10).


Let us further agree to denote the Cauchy problems (46.1) and (46.10) and their
extensions by I, I,, I(j, I() respectively. The problems I and I, are defined in the same
space P and the problems I(J and I) in the corresponding Banach space By.
If (q is the solution of the problem
aqp/at+Atp=O in Qt,
rp=g for t=0
and the initial condition is q(t 1), then the transition operator from ((t 1 ) to (p(t 2 ) will
be denoted by S(t2 , tl), i.e.
(P(t2)=S(t2,tl)-p(tl), O<t l < t 2 < T.
System (46.1) is called correct, if
IS(t2, t)ll,<M(T), O<t I<t 2 T
and uniformly correct, if
IIS(t2, t)llge<ec( t2 - t 1)
with constant C only dependent on T. For problems I, I(j) and I(j) the notions,
transition operator, correct and uniformly correct, are defined in a similar way.
YANENKO and DEMIDOV [1966] and YANENKO [1964b] formulated the following
results. In Statements 46.1 and 46.2 f 0 is assumed.

STATEMENT 46.1. If the problems I,, 1() and 1(2) are uniformly correct, then p,(t)
converges uniformly in t to qp(t)=S(t, O)g for r-*0. The transition operator S(t 2, t)
satisfies the conditions of uniform correctnessfor system (46.1).

STATEMENT 46.2. If the problems I, and I(, j= 1, 2, 3 are uniformly correct, thenfor
any go'P function p,(t) converges uniformly in t to p(t) for zT-0 along with its
xk-derivatives up to order "A". The limitingfunction cp(t) has a derivative in t and is the
solution of problem (46.1).
SECTION 47 Splitting of the evolutionary problem 331

STATEMENT 46.3. If (a) g E l, (b) A,,p are uniformly continuous in t, (c) problems I,
and I(,) are uniformly correct,and (d) problems I, are correct on the right-handside (i.e.
their solutions are uniformly dependent on the right-hand sides), then (46.1) has
a unique solution and %p(t)converges uniformly to p(t) in t.

STATEMENT 46.4. Let i=L,(f2), q>O, 2Q=Q(x ,...,x,). If (a) problems I and
Il) are uniformly correctfor f_ 0, and (b) problems I, are correct on the right-hand
side, then %p(t) isfundamentally uniform in tfor TO0, and any smooth limitingfunction
(p(t) is a solution of the problem I. If g(') = ),=o0 E Lq(Q), then qp(t) is a smooth and
unique solution of problem I.

REMARK 46.1. Here, the function p(t) is called a smooth function if all its derivatives
belonging to Lq(Q x f2t) which are present in equation (46.1) exist.

STATEMENT 46.5. Let A, have the form

A,=A,,(x, t) a =1, .. ,m, x=(x .... Xm),

where the A,(x, t) are symmetric matrices, continuous in 2 x £2t along with the first
derivatives in space variables. Then the problems I and I are uniformly correctfor
fO 0. Thefunction qp,(t) converges uniformly in tfor r- O to the solution of problem I.

The proofs of Statements 46.1-46.5 are given in the work by YANENKO and
DEMIDOV [1966]. Different aspects connected with grounding the transition from
(46.1) and (46.10) (and (46.5)-(46.6) in particular) can be found in the papers by
YANENKO [1987] and SAMARSKII [1970]. Note that SAMARSKII [1970] estimated the
convergence rate of Aq as well.

47. The splitting of problems defined on the whole interval

We now consider a splitting method for problem (46.1) that is often used in practice
for obtaining splitting schemes (SAMARSKII [1965b, 1970], BAKER and OLIPHANT
[1960], MARCHUK [1967]).
We replace problem (46.1) with the following system:

0p1/ot + Ajp, =fa(t), (p(tj)=v(tj),

aT2 /at+A 2P2 =f 2 (t), (P2(tj)= l(tj+ ),


(47.1)

a(p/at+ Apn = f(t), (Pn(tj) = Pn-l(tj+ ),

t E Oj= [tj, ti+l]


332 G.I Marchuk CHAPTER XII

and assume

v(tj+t)-(p,(tj+1), J=0, 1,...,J-l,


(47.2)
v(O)= g.
We formulate the statements with regard to the proximity of problems (46.1) and
(47.1) (SAMARSKII [1970, 1971]).
If A,(t') and An(t"), #,flB' , =l
1, ,... n, for any t',t" [0O, T] commute, i.e.
A,(t') Ag(t") = Af(t") A(t'),

then for f =0 the following equalities hold

v(ti)=(p(tj), j=O, 1...,J. (47.3)


If #f0, then we can choose f with f1 =f2 = ... =f, =0 so that (47.3) holds.
If A, and A commute and the exact solution of problem (46.1) satisfies the
restriction that

IIAbA8qo 11 <M = const < co,

then under conditions (46.3) we have:

Ilv(tj)- (P(tj)11< O(), j=O, 1,..., J. (47.4)

Problem (47.1) approximates (46.1) in the summarized sense, i.e.

|, | 0]|= O(T),

where

¢/,=f.(t)-A(t)qo(tj+), > 1,
1= f (t)-A (t) p(t)--d(p/dt
(see SAMARSKII [1970], p. 219).

48. The two-cycle splitting of the problem

Problem (46.1) can be split into a system of Cauchy problems that approximates
problem (46.1) with order 0(T2), using the two-cycle method, i.e. symmetrizing the
sequence of problems of type (47.1) (BAKER and OLIPHANT [1960], MARCHUK [1967,
1971], FRYAZINOV [1968]). Let us demonstrate several examples of systems of
Cauchy problems obtained by this method, each of these systems having second-
order accuracy (with the corresponding additional restrictions on the smoothness of
the initial data, of the solution and the appropriate choices of f, in (46.3)).
SECTION 48 Splitting of the evolutionary problem 333

EXAMPLE 48.1 (FRYAZINOV [1968]). The system has the form:

ap/at + 2Atp, =fL, pI(t)= v(tj),


4aq2 /at + A2P 2 =f 2 ,, (P2(tj)= ( (tj+ ),

O Olt + A.(p. =f., q0.(tj)= qp_ ,(tj+ ),


(48.1)
. + / t + ½A.. + =f+ , TPn+ (tj)= Pn(tj+ ),

8P2./lat+½ATp2.=f2, a P2.(tj)= P2n- (tj+),

tE ,

where
2n

f= E t.
a=l

It is assumed that
v(tj+l)= 2n(tj+l)

EXAMPLE 48.2 (BAKER and OLIPHANT [1960]). The system has the form:

aq lat +21AIf =fl, ( (tj)=v(t),

8p.n -/at+A. -lcp.- 1=fn-1, (P -I (tj)=(Pn-2(tJ+ 1),


aqp,/at + A1(p =f,, (p(tj) = pn-I (tj+1),
(48.2)
apn + /at + An_- I(+1 =fn+l, q+ (tj)= qP(tj+ ),

aq 2 /at+ Alq2n =f 2n, 2.(tj)= (P2n- (tj+ ),

te j
for
2n

f= f,, V(t+ )=P2.(tj+ l)·


e=l

EXAMPLE 48.3 (MARCHUK [1980], p. 276). The system is assumed to have the
following forms. On the interval tj_ < t <tj:
apfat+Ap =, O=l, 2.....,n-1,48.3)
1,
ap%lt + A1 . =f+2Af;
334 G.I. Marchuk CHAPTER XII

and on the interval tjttj+ :

8pn+1/t + An p.n+, =f-TAf,


0 pn+a10t+ An-a+ (Pn+ a=O
( oex=2, 3, ......
,n.. (48.4)
.......
.97,

under conditions,
(P1(tj_-)=V(tj_1),
(p+ (tj_ ) =Pa(tj), = 1, 2 ... ,n, (48.5)
¢ps+(tj)=(p,(tj+,), f=n,...,2n-1.
The approximation of problem (46.1) using (48.3)-(48.5) is considered on the double
interval (tj_, tj+1) under the assumption that

v(tj+ )= p 2n(jIj+ 1).

EXAMPLE 48.4 (MARCHUK [1980], p. 276). The system is assumed to have the
following forms. On the interval tj_1 t tj:

apl/at+ Al =0,

p(p/at + A n p =0, (48.6)


%l(tj-O)=v(tj-,),
9(tj-)=Ta-l(tj) x= 2, 3,...,n;
on the interval t_ 1< t < tj+ :

aO%+/at=f, +l(tj-l)=%P(tj);
+ (48.7)
and on the interval tj<ttj+ 1 :

a~qn+2at+Anqpn+2 =0,

(48.8)
a(P2n.+ /t + A P2+ =0,
Pa(tj)=(P-_(tj+l), aC=n+2 ... ,2n+l.
Problem (48.6)-(48.8) approximates (46.1) on (tj_ 1, tj+ ) and under the assumption
that

V(tj+ 1) = (P2n+1 (0j+ 1)

All the systems in this section have second-order accuracy and A, and A,, e- fi, do
not have to be commutative. Theoretical aspects concerning these systems are
considered in the above-cited literature.
SECTION 49 Splitting o the evolutionary problem 335

49. Some results on convergence and stability

We assume that appropriate finite difference approximations in all variables


(including t) are applied to the systems of Cauchy problems considered in Sections
46-48 and that the possible splitting of the difference approximations of the
operators A. into A,, is carried out. Many of the thus obtained splitting schemes can
be written in a canonical form:
j(p+ar/m_ qj+(a- )/m m
B + A(pj+#-=M i+lm
T #=0

x=l,...,m, j=0,1,.....J, (49.1)


O = g,
where all grid functions are assumed to belong to a Hilbert space =Ha, and Aa
and B are linear operators from Hh in H,. The approximation error of (49.1) is
defined as the sum of the approximation errors for different equations:

= 0/1 + +M, (49.2)


, being represented in the form = + so that ¢/ = O0(see SAMARSKII
[1970]). We assume further that (49.1) is solvable and that the inverse operator B-
exists.
Let us formulate the statements establishing the connection between stability,
approximation and convergence of the schemes of type (49.1) using norms I 11(1)and
11-
11(2) in the space H,.
Let the scheme (49.1) be stable, i.e. the following a priori estimate holds:

M . llg
lqJ + 1(l)m< I(l)+M 2 max E IlfJ'+/ 11(2 ), (49.3)
Oj'<j = 1

where the constants Ml and M 2 do not depend on z, the grid parameter h in other
variables, g and f +a/m, and the "smoothness" condition for the solution tp(t) holds:

1=1
Y-A#(
=1
Cj 1(49.4)
=fl+1 (2)
-MO

where M o =const >0 does not depend on z and h. Then (49.1) converges and the
following estimate is valid:
I( pj+ -_ (p(tj+1) (1)

<M 2 max[ I*(tj')11(2)


O<j' <j La=

+ (see AAM BAR(t (49.5)


)=1 =1 k=#+l
(see SAMARSKII [1970]).
336 G.I. Marchuk CHAPTER XII

Let us now suppose that A,,o=0, B=B*>0 is the constant operator and

,=1
C (A~,s ,)>0

for any {,E Ha, a = 1,... m. Then the following estimate holds for the solution of
problem (49.1):

lqo1J+'llI<qlB+M2 max [ E fj'+, -


O< j'~<j 1 olr
(=

+ t/; Z ItIf,' i1, (49.6)

where
l(pII,,=(B, q)1/2, I(P}lp=(B- (p, ' )/2
p)

(SAMARSKII [1970]).
Scheme (49.1) converges in the norm I1 {I,, if p(0) = po and the conditions of the
above statement hold as well as the summarized approximation conditions

max || pj+ ar/l -0 for hl-0, z-0,


=
O<j<J a 1 B-1

max IlJ+1'/mIlj-,1 <Mo=const>0,


O<j<J =1l

where Mo does not depend on the grid parameters h and (SAMARSKII [1970]).
If p(0)=O° ,

|| E 1| =O(Ihl+ rk),

where k>0, 1>0, and the conditions of the above statement hold, then scheme
(49.1) converges in the metric 111Ib with rate O(Ihl+ rkl), where k, =min(k,, 1). If,
besides, the restriction (49.4) holds for 1112) = I II-, then the scheme converges
with rate O(Ihl'+ k2), where k 2 =min(k, 1) (SAMARSKII [1970], p. 215).
In conclusion, we mention the works dealing with the problems connected with
the splitting of the initial problem into a system of simpler problems. General
principles of such splitting are given by YANENKO and DEMIDOV [1966], MARCHUK
([1971]; [1980], p. 275), SAMARSKII ([1971], p. 400). Theoretical grounding of such
a splitting may be found in works by YANENKO and DEMIDOV [1966], SAMARSKII
([1970]; [1971], p. 403), YANENKO ([1967], p. 170).
The work by MARCHUK ([1980], Chapter V) is dedicated to two-cycle (sym-
metrized ) splitting schemes; theoretical grounding and some modifications are
given by SAMARSKII [1971] on the basis of symmetrized splitting. FRYAZINOV [1968,
1969] constructed the schemes for the equations of parabolic type in domains
SECTIoN 49 Splitting of the evolutionary problem 337
consisting of p-dimensional parallelepipeds. Somehow a different symmetrization
method for the difference schemes is used by GODUNOV and ZABRODIN [1962] for the
acoustics equations. A large bibliography on all these problems is available in works
by MARCHUK [1980] and SAMARSKII [1970, 1971].
CHAPTER XIII

Convergence Studies and


Optimization of Iterative
Methods

50. Sufficient conditions for convergence

In this chapter we will consider some variants of the alternating direction method for
solving a system of linear algebraic equations
Au=f (50.1)
.
with a nonsingular (N x N) matrix A and vector fe RN Let us take the following
formula as the general form of a stationary method
k
B,(uk+l _ k)= _ T(Au _f),
k=O, 1,..., (50.2)
where a and are some positive parameters and
m
B, = I (I + Bi) (50.3)
i=1
is a nonsingular matrix. Here m is a fixed positive integer, and B,, i = 1,..., m, is an
arbitrary (N x N) matrix. Here and in the sequel the unit matrix is denoted by E.
Further, we will consider nonstationary, iterative methods for which the parameters
used depend on the step number, k.
Method (50.2) is called convegent when, for any initial approximation ue iRN,
the sequence of vectors u', u2 ,... converges to the solution u* =A-f of system
(50.1)
Let value m and matrices B, ... , Bm be given. Then the following statement is
valid: If the real parts of the eigenvalues of the matrix A are positive, then, for any
a > 0, there exists such a f > 0 that method (50.2) converges for any positive r E (0, f).
The inverse result is also valid. If value a > 0 is given, then a necessary condition
for the method to converge for any sufficiently small T> 0 is that the real parts of the
eigenvalues of the matrix A must be positive.
The uniform convergence for Tr-0 of the eigenvalues of the matrix B, to the unit
matrix is used to prove the above statements. These proofs are given in the book by
MARCHUK and KUZNETSOV [1972].
339
340 G.I. Marchuk CHAPTER XIII

We will consider further only two cases. Let us call the first case commutative. It
supposes the existence of a nonsingular matrix Q, which, by a similarity trans-
formation, simultaneously reduces the matrices A, B 1 ,..., B, to a diagonal form, i.e.

Q -AQ = diag{la, .. IiN}, (50.4)


Q- BjQ=diag{2(?j.... ), .... i= M.

Obviously, matrices A,B,,... B,,B, as well as matrix B~, can commute. In the
commutative case we will further assume that the eigenvalues Aj of the matrix A are
real and positive and that the eigenvalues 2j) of the matrices Bi, i = 1, ... , m, are real
and nonnegative.
The second case, called noncommutative, supposes that m=2, the matrix A is
determined to be positive in RN, A = B1 + B 2 , and matrices B1 and B2 are at least
positive-semidefinite and, generally speaking, noncommutative. For the sake of
simplicity we will use the following notations B=A1 and B 2 =A2 in the
noncommutative case.
Let us, first, determine sufficient conditions for the convergence of the commuta-
tive alternating direction method. To this end, we introduce the function

) m)
~g= 1-~~(2,; )~(1 ~ s~..... (50.5)
fH (1+ t(i))
i=1

Then the spectral radius p(Ta) of the step matrix


T,,. = I - xB1 A (50.6)
of method (50.2) can be computed by the formula:
p(Tr,,)= max Ig,,(Aj, ?,.... LCm)l- (50.7)
I <jN

Since the requirement that the inequality p(T,) < I holds is a necessary and
sufficient condition for convergence of method (50.2), we have to determine the
conditions for which-1 < g (Aj, 1,.. ., A(m)) < 1 for all j.
Let matrix B = im 1 Bi be nonsingular. Then there exists 6 > 0 such that p(T,,) < 1
for any at (0, a) and > 0, i.e. the commutative alternating direction method (50.2)
for all such values of parameters and converges.
If we assume additionally that A = B, then the commutative alternating direction
method (50.2) converges for any a e (0, 2) and T>0.
To prove this statement it is sufficient to note that by assuming A = T= B ,i we
have Ij= ,T= 1ij. Therefore,

0< m <1, j=l,...N,


( + T{')
i=l

for any rr>0.


SECTION 50 Optimization of iterative methods 341

We consider one important example. Let m (ni x ni) matrices S of a simple


structure be given, i.e. nonsingular (ni x n.) matrices Qi exist such that

Q[SiQi=diag{l', ... , (i)},


where the eigenvalues /p(,..., #~ are nonnegative for i= 1,... , m. Define the
(N x N) matrices (N=n1 'nm)

Bi =Ei ... " Eil- (Si (Ei+ IE E. (50.8),


(50.8)
i=l,...
,m,

where Es denotes the unit (ns x ns) matrix for s = 1,... ,m, and ® denotes the tensor
product of matrices. Assume also that
M
A= E B. (50.9)
i=1

Such matrices occur when elliptic boundary problems with separating variables in
rectangular domains are approximated by the grid method.
Obviously, with the additional assumption that det A 0, i.e.
m
-
jj= i() 0, j=l,...,N,
i=I

all requirements for the commutative alternating direction method hold. Then, from
the above considerations, method (50.2) for solving system (50.1)-with matrix
A from (50.9) and matrices B 1, ... , B from (50.8)-converges for any a E(0, 2) and
for >0.
For a more detailed description of the commutative alternating direction method
see Section 51.
Now we consider the noncommutative alternating direction method. Taking the
above assumptions into account, we will write the method in a different, more
convenient form:

(E + rA )(E + A2)(uk + 1 k)= r(Auk -f). (50.10)

For r=1 we obtain the well-known DOUGLAS-RACHFORD [1956] method and for
a=2 the PEACEMAN-RACHFORD [1955] method:

(E+ rA,)(E+ A2 )(Uk+ 1_ Uk)= -2(Auk -f),


k=0,1,... (50.11)

We will consider the latter in more detail. Our results will be based on papers by
BIRKHOF and VARGA [1959], WACHPRESS and HABETLER [1960], KELLOGG [1963],
SAMARSKII [1964a] and others.
We introduce the matrix D, =(E+ A2)T(E+ A2), error vectors zk =uk - u* of
(50.11), and vectors Yk =(E+rA 1)- (E-rA 2) zk. Then, after simple transformations,
342 G.I. Marchuk CHAPTER XH1

we immediately obtain the equalities:


(E + A 2)zk+l
=[(E-rAi)(E+ A 1)- '(E-zA 2 )(E + A 2) - ]
-(E + tA 2)z1 , (50.12)

zk 1 (j1_I (E-TzA 1 )yk I (E-zA2)zk jj1


{Ize+l l{D = )Y II( DI
EA
II(E±TA,)yk j II(E+,tA2zk
and the inequality:
k+
|| Z I 1ID T,
< 11 11[ W z
I2,, II (50.13)
Here 11V11= (v, v)
11 2
is the usual Euclidean norm of the vector vE ?N' and T 1 is the
11
corresponding norm of the matrix T,
VItDI = (Dv, v)"/2 = 11
11 (E + IA 2)v 11,
Ti,, =(E-,Ai)(E+zAi) - 1, i=1,2.
It is obvious that
11 Tj 1111T2.rl,
TO 1(ID,< 11 (50.14)
where
T, =E-2trB 'A
=(E+TA2)- 1(E +zA 1 )- 1(E-'rA1)(E- ).A 2
It follows from BIRKHOF and VARGA [1939] that, if at least one of the matrices
A1 , A 2 is positive-definite then the alternating direction method (50.11) converges
for any r > 0.
Assuming in addition that matrices A, and A 2 are symmetric (naturally, the
conditions of positive-semidefiniteness of matrices A, and A 2 and nonsingularity of
the matrix A = A + A 2 remain) this statement can be reinforced, i.e. the alternating
direction method (50.11) converges for any z >0. To prove this latter statement it
is-according to MARCHUK and KUZNETSOV [1972]-sufficient to show that for any
nonzero ze RN the following inequality holds:
||11 T~ 1~D < |1 Z~(50.15)
~D,
Hence the inequality 1T i D< I holds, since R N is finite-dimensional and T, is
a linear operator, which means that (50.11) converges. Let us use (50.12) to determine
inequality (50.15).
It is easily seen that with the above assumptions the inequality j1z + JI
I < IIzk 1ID
follows from (50.12) and equality holds only when Al yk =0 and A2zk =0. But the
latter is possible only, if z"eKerAnKerA 2 =KerA, which contradicts the
nonsingularity of matrix A and the condition zk 0. Thus, inequality (50.15) has
been proved.
We consider one interesting variant of the noncommutative alternating direction
method. Namely, we assume that matrix A is symmetric (we have assumed earlier
SECTION 50 Optimization of iterative methods 343

that A is positive-definite) and A 1 =A'. This means that matrices A1 and A 2 are
positive-definite since the equalities
(Aiz, z)= 2(Az, z), i = 1, 2, z RN,
take place in real space. We will call this variant of the alternating direction method
(50.11) symmetric. Since matrices A1 and A 2 are positive-definite, the method's
convergence for any > 0 follows from the above considerations.
The important properties of the symmetric alternating direction method are
symmetry and positive-definiteness of the matrix
B,= (E + A, )(E+rA T)
=(E + rA2)(E + TA2), (50.16)
this means that the step matrix T, can be symmetrized in the inner product generated
by the matrix A. Indeed,
(T v, w)A = (v, w)A - 2T(B- 1Av, W)A
2
= (V, W)A - T(B 1'Av, Aw)

= (v, w)A - 2r(v, B.- Aw)A

=(v, Tw)A

for any v,welRN. Moreover, it follows therefore that matrix BT1'A is not only
A-self-adjoint but also A-positive-definite. Thus, for any > 0 all eigenvalues of the
matrix T, are real and belong to the interval (-1,1), and the system of its
eigenvectors constitutes the A-orthonormal basis of RN.
Let us briefly discuss the influence of the parameter a on the convergence of the
noncommutative alternating direction method. To this end, we consider the
right-hand side of the equality

I zk+ 112 = jIzk IID. - 2OCT(B- 1Azk, AZk)D


+ 022TI B;- Azk 12
D
as a square polynomial in . First, we note that, due to the nonsingularity of the
matrix B- A, we have for any nonzero zkE RN,

I1B.- Azk 1D, #0.


Further for a= 0 we have
+k x k
IIZ IID, = IIz 1ID,
and for i= 2, according to (50.13), we have the inequality
Iz k+ll1 < IZ k II .
It thus follows that for any >0

d llzk + 1 I2 IJ = -2T(B-l Azk, AZk)D, < 0


344 G.I. Marchuk CHAPTER XIII

and, hence, for any e (0, 2) the following inequality analogous to inequality (50.15)
holds:

IIZk+ II D,< II D,.


Therefore, we conclude that

11 , ID,< 1
for any > 0 and a E (0, 2).
Hence, under the above assumptions the noncommutative alternating direction
method converges for any r >0 and a e (0,2).
From this statement it follows in particular that the Douglas-Rachford method,
i.e. (50.10) with parameter a = 1, converges for all positive-semidefinite matrices A,
and A 2.

51. The choice of parameters in the commutative alternating direction method

By optimization of the iterative method we often mean a way of increasing its


convergence rate either by a special choice of parameters or with the help of
acceleration procedures. For the stationary commutative alternating direction
method considered above, the complete optimization presupposes minimizing the
spectral radius of the step matrix T,,, i.e. the solution of the problem
minp(T,). (51.1)

This is, however, not an optimal approach since we can only in exceptional cases find
p(T,,,) explicitly as a function of variables a and . That is why in practice, a set
Gca m+' l such that all (m +1)-dimensional vectors )j=(Ai, 21),... )lt'))EG,
j = 1, ... , N, the components of which are the eigenvalues of the diagonal matrices A,
B 1, . .., B, of (50.4), is actually introduced, as well as a function ,,(ta)such that for
any e G the following inequality holds:

IgY,(A) I 1,l).
i< (51.2)
If the function
f'(a, ) = max By,() (51.3)
AeG

approximates p(T~,,) sufficiently well for values (, r) of a set Hc 1R2, in which the
best values for the parameters are found, then instead of (51.1) we may consider the
approximate optimization problem for the iterative method, i.e. solving the
extremum problem
min Y'(a, r). (51.4)
(a,r)eH

Of course, the required function VYcan be constructed in several stages subsequently


SECTION 51 Optimization of iterative mthods 345

coming to new majorants ,~,, by expanding the set G and narrowing the set H, i.e.
by imposing additional restrictions on the values of parameters. The analogous
transition procedure from extremum problem (51.1) to approximating problem
(51.4) can be used in other cases as well, in particular, for multiparametric
alternating direction methods.
In considering concrete ways of choosing parameters in the commutative
alternating direction method, we assume additionally that m=a=2 and A-
A1 +A 2, where A, =B1 and A 2 =B2. Then

I -r2) I-±rA(2)
g,,( A)(= 1 +) +i( 2 )

the set Gc R2, and the set H consists of positive values of . Choose
G=[l ,Al] x [3 2 ,A 2 ], (51.5)
where
bi = min AY', Ai= max Ai, i=1, 2;
i<j<N 1 <j~<N

take
'(a, T)-= (T)= maxlg,(A)l;
=O

and assume, in addition, that


1 =62 =>0, A = 2 = >.

Then

max

and the value pt, minimizing the function !P(T), is computed according to the
formula,
opt=(5l)- 1/2 (51.6)
Hence, we arrive at the estimate,

AP(T,") [,I, 6A * (51.7)


However, if 6,5162 or d, ZA 2, but 1662:0, then with the choice of G=
[8, A] x [, A], where =min{b, 2} and A=max{3 , A2}, we again obtain (51.6)
and the estimate (51.7). But if, for example, 2 =0 then, choosing

/( = max 1 , (51.8)
346 G.l. Marchuk CHAPTER XIII

we obtain op = (A 1 6 )-1/2and come to the estimate

p(T, I)%
< a

We now consider the simplest multiparametric alternating direction method,

(E + TkA1)(E + tkA2 )(uk + - uk ) = 2Tk(Au k _f),.9)


k=0, ....
where the parameters Tk have period p > 1, i.e. zk = zk +, for all k > 0. If we introduce
the matrix D=QTQ, where the matrix Q is defined by (50.4), then we get the
inequality

Iz lIID = max k-i -T'l ) * i |)1 11)


1-j<Nl i=0 1+Tri 1+ri2)
Hence, for all k = tp, t a positive integer, we have

[A i= llJ+TiD 1
where
= min ,li) and A = max Ai)
1<j<N 1<j<N
i=1,2 i=1,2

(6 is assumed to be positive).
We introduce the function

~(T) -(o ...... t-_ )

= max max (51.11)


O<iSp-1 mi<Ami,+ 1 +Ti

where 6 = m o <m < < m, = A are some positive integers.


Obviously,
IIz IID [ (T)]2 Iz° 1D,
with l(z) < 1 for all positive To, z,.. .,t_
, . Therefore the approximate optimiza-
tion problem for method (51.9) can be formulated as the problem of minimizing the
function Y(zr) on the set H c IRP, which consists of vectors with positive components.
Since

min (r) = max min max (51.12)


CIH O<i<p-1 ri mi*<Ani+l1 + Ti

it is sufficient to solve the minimax problems,

min max i=O,1 ... p-


ri>O mi~<Aml I 1 +T
SECTION 51 Optimization of iterative methods 347

in order to solve the problem of minimizing the function 'F(r)for T.According to the
above, the solutions of the minimax problems can be found by choosing
Ti=ri,opt =(mimi+) - 1/2, i=O, 1,..., p-1. (51.13)
Now we have to choose values m,... , p,_ , so that the right-hand side of (51.12) is
minimal. Obviously, it is enough to take
i 1p
mi = 6(/1) , i= O, 1 ... ,p.
Hence, for r= rTot =(T,opt, .. ., p_ 1 .opt)T we obtain the estimate

Il2kll,< 1-h'/P 2t
Izk 1
hID ZEll (51.14)

where h=(/A) 1 2 .
So, the average term for decreasing the D-norm of the error vector of the
optimized cyclic parameter with period p for the alternating direction method is
equal to
l _hl / 2 p
q= Ll-+h'iP . (51.15)
Assume that h<< 1 and find out for what value p (51.15) will be minimal. To find p,
taking h << 1 into account, we get the equation

d (hP) =0;
dp /
its solution is the value
p= -Inh.
With this value p (51.14) takes the form:
e- 12k/p
iZk
II, [e1 11Zo. (51.16)

Thus, if under the assumption h << 1,the problem is to decrease the D-norm of the
initial error vector in 1/E times ( < 1), then with the values of parameters found the
method allows to do this in
k, -ln(l/h)ln(l/g) (51.17)
steps while the one-parameter method requires
k, -h- 'l(l/£) (51.18)
steps.
The above method is one of the simplest for the approximate solution of the
optimization problem for the multiparametric alternating direction method. It was
used first by PEACEMAN and RACHFORD [1955], DOUGLAS and RACHFORD [1956]
348 G.I. Marchuk CHAPTER XIII

and DYAKONOV [1961, 1962b]. Further studies on the choice of optimum parameters
can be found in the works by WACHPRESS [1962], TODD [1967], LEBEDEV [1977] and
others. These results have been thoroughly considered by SAMARSKII and NIKOLAEV
[1978].

52. The choice of parameters in the noncommutative alternating direction method


Consider, first, the case when matrices Al and A2 are symmetric and one of them, for
example, the matrix A, is positive-definite. Then, according to (50.14), the estimate

p(T
II,!= max T (52.1)

occurs, where 6, and are minimal and maximal eigenvalues of the matrix A,
respectively. Now, if we take

T(T)= 'ITll=max{l-T6

then with the choice


Topt =(1A1)- 1/2 (52.2)
we come to the solution of the approximate optimization problem of the alternating
direction method (50.11). As a result we get the estimate

AT,
< _,,1A
,(52.3)
--) pb--

Assume in addition that matrix A2 is also positive-definite and that 62 and A 2 are
its minimal and maximal eigenvalues respectively. Then we obtain the estimates

T, 11 =I- max <, max 1+ i=1, 2, (52.4)


,j<A' 1T
Ti<, - ASz21+r

where
6=min{6,,, 2}, A=max{A,,A 2}.
Now, if we take

1-
lm
= -r2 2
then, according to (50.14) and (52.4), we have
p(T) < (T)< 1
for any T>0. In this case we get the solution of the approximate optimization
problem for (50.11) by the choice
T.Pt = (6A- ./ (52.5)
SECTION 52 Optimization o iterative methods 349

As a result we get the estimate

P(To,) < 11
T <[
In1D, /+
_/ j 2 (52.6)

We now consider a non-self-adjoint case. To construct the majorant F(T) let us use
SAMARSKII'S [1964a] method. Since
lI TT< II
p(TO< ITD, 1111T
I 2 , 11, (52.7)
it will be sufficient to derive the majorant, for example, for 11
T,, I,assuming that A,
is positive-definite. We introduce the values

6, = inf (A z, z) = sup IAz (52.8)


z6RN (Z,Z) ZsRN (AlZ,Z )

Obviously, both values exist and are positive, with 61 being the minimal eigenvalue
of the matrix S =- (A, + A'). It is easy to see that
2 2
II T 2= SUPN 1z112
I= sup IIz 12 -2(A
+2(A 1 zz)+ 2 [A1 Z1
IIAz
1 z, z)+ 112

Ilzl 2 -(2,- 2aA)(Alz, z)


= sup
ZERN Z| 11 + (2r + t2A)(Axz, Z)
2, we obtain IIT,, II2 Y
2
Hence, using inequality (A, z, z) <A 11
z 11 (), where
(2 - T A 1)
9' 2() = max 1- 2 A
2
Al<A.<a 1 +(2+r A1 )2
The analysis of function T(r) shows that for =(6 A1 )- 1/2 it reaches its minimal
value

'Fmin= L2 V/7-
Thus, if matrix A, is positive-definite, then as the solution of the approximate
optimization problem we can choose the parameter Tzpt = (6 A)- 1/2 for which the
estimate

holds. Assuming that the matrix A 2 is also positive-semidefinite then for p, t=


(6A)- 1/2 we obtain the estimate

11
Tt IID K V/A /b (52.9)

where 6 = min(6, 62) and A = max(A A, and the values 62 and A 2 are defined as in
(52.8) but for matrix A 2 .
350 G.I. Marchuk CHAPTER XIII

53. The convergence acceleration procedures for the alternating direction method

We introduce a new matrix

D,=AT[(I + A)-1](I + rA )-'A. (53.1)


Then, by simple transformations analogous to those of Section 50, we can show that

I1 +x - -II X, 112 1xk, xk) + 2 11Al xk 112


_2t(A
tZk11-^ +2z(Axk, xk)+: 2 IIAixk 12
Iixk 112

IIyk 112_ 2z(A 2 yk, yk)+ r2 1A 2 Yk 12 . IIz k 112


2
yk 112+ 2z(A 2 yk, yk) + z 11AZy 112
where

Xk = (E +TA )- 1(E - A 2 )yk,


yk =(E+ zA 2)-l(E +tzA)-lAzk,
and, therefore, for arbitrary positive-semidefinite matrices A, and A 2 we have

1zk+ IIDr < 11lT 1111


T2, r1 II
11i D < 11l
Z IID,

From this and from equality

[IZk+1 112= JZk Il| -21(B[ Azk, Zk)D


2
+ T IB- 1 Azk 11
2,

it follows that matrix B-' A is Dr-positive-definite.


In view of this fact we can solve system (50.1) not by (50.10) or (50.11) but by the
following iterative method
(E +TA 1)(E + zA 2)(Uk + 1 _U)= -_ k(Au -f), (53.2)
for which parameter IBk is chosen according to the formula

(B,- lAzk, Zk)D- (AB 1 , k)D


l Bl Azk l|D llAB-1 ,k 11'5
where k = AUk -f is the residual vector, and
gT = [(E + TA2)- 1]T(E + TA)- '.
The parameter fk is chosen in correspondence to the descent method from the
maximum minimization condition of the D,-norm of the error vector zk+ ' . As is
known (see, for instance, MARCHUK and KUZNETSOV [1972]), this method converges
with
L(T) 11 <[D.
DTT, (53.4)
where the L(T) are Lipschitz constants of step operator T of method (53.2)-(53.3).
SECTION 53 Optimization o iterative methods 351

Note that the similar procedure for the inner product generated by matrix D, of
Section 50 will not be suitable here.
Among different versions of the alternating direction method the case A1 =A
considered in Section 50 stands out as rather efficient. The transition matrix (50.16)
of method (50.10) is A-self-adjoint, which is the reason for using acceleration
procedures based on the Chebyshev methods and the generalized conjugate
gradient method.
Let it be known that eigenvalues of matrix B- 1 A of the symmetric aternating
direction method belong to the interval [a, b) with 0 < a < b. For example, if we use
the results from the previous section, then it follows, from the estimate (52.9), that the
interval boundaries can be computed according to the formulae

a= , b= A1
lA
+ 1- JA + b-

Under this assumption we can use the following Chebyshev semi-iterative method
(see VARGA [1962]) with the pre-conditioning matrix B' for the solution of system
(50.1):

B,(u - u °)= - (Au -f),53.5)


k k) A Uk k
U Uk -
B,(u +1 _ u = -_ k + 1( -f)+ k + 1P( - 1),

k= 1, 2,...,

where

4 Ck (rt) C _ k (/) b+a


k+ 1=b-a Ck+ (/)' fk+ =CI() k>0, ba= (53.6)
(Ck+I(q)a
and Ck(t) is the Chebyshev polynomial of the kth order of the first kind in t.
According to MARCHUK and KUZNETSOV [1972] and GOLUB, CONCUS and
O'LEARY [1976] we can also use two-term and three-term generalized conjugate
gradient methods with the pre-conditioning matrix B-' for the solution of system
(50.1). The two-term formulae of the method have the form:

=fB 10, k =1,


gk IB-1,k-1 -gk, k>l,

Uk =Uk-l -kgk, (53.7)

2-/l (B -
ak = I -I[k-
IIAB -k
12=
11k)
B-k k-1k-1)

=
k
;, k=l1, 2 ....;
/=9k
352 G.i. Marchuk CHAPTER XIII

and three-term formulae (see REID 1971]) have the form:

Uk +1 = uk _ B -1 k_ ek-_lu( k
-U )],
qk

e_1 = , qk = IIk 2 7, -ek- (53.8)

ek k1l k=O, 1, 2, ....

Here k = Auk -fMare residual vectors.

54. Generalizations

The best known generalization of the original alternating direction method is the
following one:
(R + zA 1 )R - l(R + A 2)(uk + - uk)= _ ct(Auk -f), (54.1)
k=O, 1,2,...,
where R is, generally speaking, an arbitrary symmetric and positive-definite matrix
(see WACHPRESS and HABETLER [1960]). It is easy to see that this method is
completely equivalent to the alternating direction method
(I+ TA1 )(I + rA2)(Uk + _ uik)= - Z(AUk -), (54.2)
k=0, 1...
where
/2
A1 =R-1/2A1RR-/2, A2 = R /2A2R-
(54.3)
2
A=A 1 +A 2, Rf=R
i=R' u, R-112f
Hence, all the abovementioned facts for methods (50.10) and (50.11) refer to method
(54.1) as well. Of course, the role of matrix R should be considered, i.e. in deriving the
formulae for optimal parameters matrices A, and A2 of (54.3) should be used instead
of A1 and A 2 of (54.1).
We consider the case AI =AT. To this end, we rewrite, first (54.1) with ea=2 in
the following equivalent form:
(R + rA )uk + 1/2 =(R - A 2)uk +f, (54.4)
k
(R +rA 2 )uk+ 1 = (R - A )u +1/2 i+f,
and then, using

L=½R--A, LT=½R-A
2 2
, co=2i,
2 +,r' (54.5)
SECTION 54 Optimization of iterative methods 353

we rewrite (54.4) in the form

( -L) (+ 1/2 _ uk) = -(Auk f) (54.6)

(1R-LT)(Uk + 1 k+ 1/2)= (Auk+ 1/2 -f)

If the above assumptions hold, and


A=K-L-L, (54.7)
T (0, + o), an ow(0, 2), then it is easy to see that (54.6) is a generalization of the
symmetric upper relaxation method corresponding to splitting (54.7) of matrix A.
Hence, the theory developed by LUNN [1964] (see also YOUNG [1971]) can also be
used for optimizing method (54.6) by the choice of parameter o.
CHAPTER XIV

Splitting and Decomposition


Methods for Variational Problems

Many problems of economics, optimal control, hydrodynamics, etc. lead to


variational problems with restrictions (DUVANT and LIONS [1972], BENSOUSSAN,
LIONS and TEMAM [1975], GLOWINSKY, LIONS and TREMOLIER [1979]). To solve
these problems BENSOUSSAN et al. [1975] developed methods, some of which (for
example, decomposition methods) are based on splitting methods. In this chapter
we will consider the general scheme of applying the splitting and decomposition
methods to variational problems. We will do this following BENsOUSSAN et al.
[1975] and without specifying the type of problems to be solved.

55. Splitting and decomposition methods for classical variational problems

Let the initial stationary problem, after approximation on all its variables, be
reduced to the equation
A p =f, (55.1)
considered in finite-dimensional Hilbert space H with inner product (-,)H and
norm 11-IIH=(',)H/2. Matrix A is assumed to be symmetric, positive-definite and
representable in the form

A= A,, n, (55.2)
a=l

where A, are symmetric, positive-definite matrices. It is known that solving equa-


tion (55.1) is equivalent to solving the following classical problem of variational
calculus
J(qp) = inf J(v),
vEH

where
n

J(v)= (Aev, v)H-(fv, V)H


355
356 G.I. Marchuk CHAPTER XIV

and {far are chosen so that .E=1 f, =f. According to the method of stationing (see
Section 37) we can consider the evolutionary problem
apq/at+A=f, teQ,=(O, T),
rp=g for t=O, gH,
and, in case of large values of T, we can approximate the solution of (55.1) by the
solution of problem (55.4) p(T). The numerical solution of (55.4), in turn, can be
realized by the splitting method. To this end, we represent f as
n
f= f, fe H, (55.5)
4=l

divide 2, in N intervals of length to, , . .., N-1 such that


TO +T1 + ' +TN-1 =GN>T,

and use a splitting scheme of type


gj +an _ Dj+ (a - )/n
+ Aaqo' + /"
=fI

O,<j,<N-1, ~l
<an, (55.6)
'Po =g.
Each realization step of scheme (55.6) is equivalent to the minimization, in H, of the
following functional problem,

Iv p_+(
- 1)/" I11+ J,(v)
(55.7)
O<j<N-1, l<a<n,5
(Po = -

This algorithm constructed by using splitting scheme (55.6) is, in fact, the
decomposition method for solving the variational problem (55.3). This same method
reduces (55.3) to a sequence of problems minimizing the functionals (55.7). Note that
the convergence of the decomposition algorithm (55.7) follows directly from the
convergence theorems for scheme (55.6). These theorems have been considered in
foregoing chapters.

56. Decomposition of a general variational problem

Based on algorithm (55.7), we now formulate the decompositon method for the
following general variational problem in Hilbert space H:
J(p)= infJ(v), (56.1)
vcH
1
where J:H-R -R is a continuous function with a lower bound. We assume that
SECTION 57 Variationalproblems 357

J(v) can be represented as

J= J (56.2)

where the J,:H-C R are also continuous functions with a lower bound. We introduce
a sequence of parameters ri, j=O, 1, ....,, N-1 and define elements
(pj+/"l j=O, 1,...,N-, = 1,...,n, (56.3)
where pO =g is an arbitrary element in H. If j+("-')/ is known, then (pj+/ is
determined as a solution of a variational problem of type

{ jinf p1j|-+(a-l)n/ll2 +j.(V)} (56.4)

This problem has at least one solution. We denote by qpj+a/n one of these solutions
and compute further pJ+(+ 1)/n, (pj+(+2)/n...
It is obvious that the proposed algorithm for solving problem (56.1) is efficient
when it is easier to solve a sequence of problems (56.4) than to solve one of (56.1)
which depends on the explicit form of J and J. It is also easily seen that the given
algorithm is of great interest for problems of type (55.3), obtained by discretizing
elliptic boundary value problems, etc. Note that the splitting method theory does
not allow us to conclude that algorithm (56.4) converges (see Section 58).

57. A variational problem with restrictions

Let K be a closed convex set in H (the set of restrictions)and, as in Section 56, let
J: H-*R be a continuous function satisfying the assumptions from Section 56.
Instead of (56.1) we consider the problem
infJ(v), (57.1)
veK

that is, a variational problem with restrictions. We also assume that

K= n K), (57.2)
a=l

where K, c H are closed sets which correspond to perturbations of restrictions. The


following algorithm is, in fact, a generalization of algorithm (56.4) for problem (57.1):
Starting with some element °p =g K, find ij+ c/"eK, as a solution of problem

inf I11V- )/l IIH + J(V) (57.3)

Algorithm (57.3) is a general form of the decomposition method for a variational


problem with restrictions.
358 G.I. Marchuk CHAPTER XIV

58. The convergence of decomposition algorithms

We now consider a decomposition algorithm in an infinite-dimensional space and


formulate the conditions for its convergence.
Let 4 be a real reflective Banach space with norm II1Il and let K be a convex
closed non-empty subset of 0. Let J be a real functional defined on K such that J is
strictly convex; J is lower semicontinuous on K (in a weak or strong topology QP). We
also assume that the set K is either limited or lim J(u) = + co, when IIu 11,-,co, u e K.
It is known that under these conditions the functional J has a lower bound on K and
there exists a unique element S e K on which J(v) reaches its minimum:
J(p)<J(v), vK. (58.1)
Further, let another norm I IHbe defined on 4>. Supplementing 0 with this norm we
obtain a Hilbert space H. We assume that
qn
= 1

where the hPe are reflective Banach spaces with norms Il ll' and the inclusions
4c ,, c H are continuous (a = 1,. ., n). We introduce the following assumption
n
K= K, (58.2)
al=1

where K, is a convex subset of 4>,. We also assume that J(v) can be repre-
sented as

J(v)= E J(v), veK, (58.3)


a=1

where J(v) is a real, convex lower semicontinuous functional defined on K, (in


topology ,) and lim J,(v)= + oo when 1Iv I[, - oo ,ve K, (if K, is not limited in 4ji).
We formulate the decomposition algorithm for solving (58.1). Let To, z, .... be
a sequence of strictly positive numbers. If p is an arbitrary element of K,, then
define a family of elements (pj+/, y = 1 ... n,j = 0, 1, ... , where element (pj+'/"is to
' -
be found (after o, ... , j+(a 1)/n as an element of K, satisfying the inequality

Ip +n - )in
· 2 + j(pj+an

2Tj
< Iv|( i+(-)/n +J(v) (58.4)

for an element v E K,. Problem (58.2) has a unique solution pj+a/n e K,. From qp° ,....
(Pj+/" we form the elements

1
Na Ni j+a
N+/n 1 = E pjn (58.5)
GN j=O
SECTION 58 Variational problems 359

where
N

UN E Ij, a=l,...,n, N=0, 1,....


j=O
If

lima N = + o, lim max =0,


N-te N-w:0o<j.<N =

then
N +
lim ( /n = p
N-co

in a weak topology 4P, for 1 acc n where p is the solution of problem (58.1).
In the work by BENSOUSSAN, LIONS and TEMAM ([1975], p. 200), one can find the
proof of this statement as well as a number of concrete realizations of the above
decomposition algorithms for solving variational problems of mathematical physics
along with corresponding splitting methods.
PART 3

Applications of Splitting
Methods to Problems of
Mathematical Physics
In this part, we will consider the realization of the general statements of the splitting
methods, from the previous chapters for some concrete problems of mathematical
physics. We will describe the means to approximate problems in space, angular and
other variables earlier accepted as achieved (see Part 1). All this is supposed to give
us a complete notion of the methods used for approximating solutions of problems
by splitting.

361
CHAPTER XV

The Heat Conduction Equation

We considered the major algorithms of the splitting and alternating direction


methods in Part 1 and used the parabolic equation for illustration (Sections 12, 17,
22, 25, 26, 29, 36). In this chapter, we will consider the schemes for a more general
form of the parabolic equation.

59. The two-cycle componentwise splitting scheme for a parabolic equation


with three spatial variables

We consider the two-cycle componentwise splitting scheme in application to


a diffusion model for propagation of pollutants in the atmosphere from pollution
sources (MARCHUK [1982]). We write the problem in the following form:
0cp 3
+at
+ A. p=f in2xQ,, (59.1)

where
aup a2 1 au

adp a2?2 a
A 2 9 =ay -( y' ay (59.2)

aw¢ av ap aw
A= z az az-i aza
The following notations are used in (59.1) and (59.2): x, y, z are components of the
Cartesian system of coordinates (the x-axis is directed to the east, the y-axis to the
north, and the z-axis is directed vertically upward); = {(O<x<a, O<y<b,
0 < z <H} is a domain with boundary aQ2 consisting of the cylinder's side surface 8a2
with lower basis aS20 (for z = 0) and upper basis a0H (for z = H); t e Q2 = (0, T) denotes
time; p is the intensity of aerosol substance, migrating along with the flow of air in
the atmosphere; u, v, w are components of wind velocity along x-, y- and z-axes
respectively; = const>O, v(z) >O are the horizontal and vertical turbulence
coefficients; a=const> 0 is the coefficient of the substance's absorption by the
363
364 G.I. Marchuk CHAPTER XV

environment; f(x, y, z, t) is the function describing the sources of the polluting


substance p under consideration.
It is assumed that for the components of the air flow velocity the mass
conservation law represented by the continuity equation holds
au av aw
- + +-+ = in QxQ, (59.3)
ax a z
and, besides, the conditions
w=O for z = 0, z=H (59.4)
hold. Values of the functions u, v and w are considered to be known in Q x Q.
The initial condition
(p=g for t=O (59.5)
is added to equation (59.1) and the boundary conditions on Q2 are:
p=0 on a,s
pla/z= p on aQo , (59.6)
/Olaz
=0 on aa,
where g is a given function and a > 0 is some function characterizing the interaction
of the pollutants and the Earth's surface. MARCHUK [1982] noted that more general
boundary conditions can be used instead of (59.6). The solution (pand right-hand
side belong respectively to the Hilbert spaces 4Iand F of smooth enough functions.
Adirect verification allowing for boundary conditions (59.6) shows that operators
A, A2 and A3 are positive-definite, i.e. A, >0 (c= 1, 2, 3).
A uniform grid for x, y and z is introduced in Q with steps Ax, Ay and Az
respectively. The operators A, are approximated with second-order accuracy by
their difference analogues A,, having the form (see MARCHUK [1982])
= i+ 1/2,j,k(Pi + 1,j,k - Ui- 1/2,j,k(Pi - l,j,k

2Ax

A2 ((pi + l,j,k - 2(pi,j,k + i- 1,j,k),

A2P=
Vi,j+ 1/2,k(Pi,j+ 1,k - Ui,j- 1/2,k(Pi,j- 1,k

2Ay

y2((Pij+ k - 2(pi,j ++ (Pi,j- 1,k), (59.7)

A3(P = Wi,j,k+ 1/2(Pi,j,k+ -Wi,j,k- 1/2,Pi,j,k-1 + ,,k


2Az

AZ2 [Vk+ 1/2((Pij,k+1 - Oj,k) Vk -12((Pi,j,k - (,ik -1)],

i=1,2,...I, j=l,2,...,J, k=l, 2,...,K.


It is assumed that the difference analogues of u, v and w satisfy the difference
SECTION 59 Heat conduction equation 365

analogue of the continuity equation,

Ui+ /2j,k -Ui-1/2,j,k Vij+ 1/2, V- 1/2k


/2k Wi,j,k+1/2 -Wi,j,k-1/2
Ax + + = (59.8)
with
Wl/2 = Wk+ 1/2=0. (59.9)
Due to the approximation of boundary conditions (59.6) and by allowing for
(59.9), the expressions (59.7) are somewhat changed in the vicinity of boundary
points:
(Pi,j,k=0 for i=0, i=1+1,
'Pi,j,k = forj=0, j=J+1, (59.10)
ij, k ~j-
=A(Pi,j,k for k=l,
Az
i.jk0 for k=K.
Az
It is easily shown that difference operators A1 , A2 , A3 acting in the Hilbert space
=Hh are positive-definite on the functions pij,k that satisfy the boundary
conditions (59.10). We write the obtained system of ordinary differential equations
with respect to p = {tPi,j,k} in the form,
d + A=f in 2 x Q,
dt = (59.11)
p=g for t=0.
To solve (59.11) let us use the two-cycle componentwise splitting method ( is the
time step)
( n+1/6 p An n+ 1/6 + (pn
+A' = 0,
T +r 2
(Pn+2/6_ (n+ 1/6 (Pn+2/6+ n+ 1/6

( n+3/6 _ n + 2/6 (pn+3/6 (p+ 2/6


T -= ± A 3 22 =0,

(fn + 4/6 _ (n + 3/6 qOn+4/6 .. (n +3/6


+A 3 2=0, (59.12)
T 2
V0n+5/6 _ (pn+4/6 (n+5/6 + 0n+4/6
+A 2 2 =0,

(pn+l _(Pn+5/6 q n+1 +(n+5/6


-A, 0,
Tr1 2,.2
n=O, 1, 2....
366 G.I. Marchuk CHAPTER XV

Scheme (59.12) has second-order accuracy in t and is absolutely stable. Each


splitting step is realized by the scalar sweep method (GODUNOV and RYABENKII
[1977], pp. 51-53).

60. Schemes of second-order accuracy for p-dimensional parabolic equations


without mixed derivatives
We consider the p-dimensional heat conduction equation with constant coefficients
written in nondimensional form:
p/lat+A =f in x 2,
P P (- 2) ' (60.1)

wheref=f(x, t) is a known function of coordinates x =(xI, x2 ... , xp) and time t.


Additionally, the initial condition
p=g for t=0 (60.2)
is set along with the boundary condition,
(p=O ina2 x Q,, (60.3)
which does not restrict generality, and the function g(x) is considered to be known.
We assume that Q is a p-dimensional parallelepiped
tO<X I ... 1, < , 0<
<Xp
l, v < 1} ,
and we obtain the grid domain 2h= {x},= , where ={xi =i,ha}
(i, =0, 1,..., I, + 1), and the step h, for variable x, does not depend on the number i,
of the grid point. On smooth solutions, we approximate the operator A, with
second-order accuracy by its discrete analogue A, with the help of the relationship

h
A0(p= I((p 2( -2p,+ ,21)-
(Pi (P+

1 '<i ~<
la. (60.4)
The differential operators A and their difference analogues A. on the functions
satisfying the condition (56.3) are positive-definite in the corresponding spaces
=H and = Hh. Thus, the initial problem has been reduced to a system of
ordinary differential equations (index h is everywhere omitted)

-- + Aq=f in x ,,
tdp (60.5)

A= ZE Aa, (p=0 in aQh xQt


a=l

with initial condition (60.2).


To solve this problem, we use the scheme with the splitting (factorized) operator
SECTION 60 Heat conduction equation 367
(DYAKONOV [1971d]). This scheme has the form

Bp + E A,"n =f , ( =g,

viz = 0, i,=0, i=IL+, (60.6)


where fn +'is the second-order approximation in time of the right-hand side f and
operator B has the form:
p
B I (E + Tn,)
a=1

-E+OTA+(Or) 2
E AA++( (Tz)P H A., (60.7)

with 0 0 < 1 arbitrary. In particular, for p = 2 we have


B=E+ TAO02 + 2 A1 A 2. (60.8)
In view of (60.7), equation (60.6) can be rewritten in an equivalent form
fp,l-+o" + (o" +l- n
+A( ( + (1- )(p ) + Qp , (60.9)
T T
where
p

Q
a=2
E (0T) 1alA <Q2< .. <lap
A, ,,2 Ap. (60.10)

Thus, if the inequality

Q( ( (p p) 2
aMT , M constant (60.11)
holds, then scheme (60.6) approximates the initial problem (60.5) with the same
order of accuracy as the scheme

+ 1 +A(O4p " + +(1


' - )) =f +
'. (60.12)
r
As is known (DYAKONOV [1971d]), scheme (60.11) has a second-order approxima-
tion in t for 0 = 2 and a first-order approximation otherwise. For ½2 the following
inequality is sufficient

I Q-(p MT.

Scheme (60.6) is absolutely stable for 0 = Aand for the solution the following estimate
(DYAKONOV [1971d]) is valid:

I1(pn+ Ilo <iiO (( + ilB-Xfi+x Il


j=o

< lIl + +
JTIfJ IIF-
j=o
368 G.i. Marchuk CHAPTER XV

Taking into account the value of qp in the boundary points and the form of operator
B, the scheme can be rewritten in the form

H (E + OA.)" '+= 5,, (60.13)

where the right-hand side O' is known. The realization of (60.13) consists in
consecutively solving the problems
+
(E +TzOA 1) lIP = "
(E + OA 2 )n(P +2/1 = 1ipp+

(60.14)
+
(E + rOAp)q" = on +(p- 1)/p

by the sweep method.


The DOUGLAS-RACHFORD scheme [1956] is a special case of this scheme for 0 =
and f- 0.

61. Schemes for equations with mixed derivatives

We consider the first boundary value problem for a parabolic equation with mixed
derivatives in the parallelepiped

Q=0 {O<X1 I1.. Ox, a..., O<xp<lp)


in the following form (SAMARSKII [1971], p. 372; YANENKO [1967], p. 34):
p
+ E kp(t)Ap =f(x, t, ),
=1

Alp= -a2ol/ax axs,

o<c1 E 5.4 E kP(t)~, ~<cz


2 E 6a2, (61.1)
a=l a,Ii= 1=1

p =g for t=O,
o =O in a2 x .
As in Section 60 we introduce a uniform grid for variables x, with the step
h, =l (l+ 1): x = ih,,i, =0,1,.., + 1. We approximate the differential
operators A,, with second-order accuracy by their finite difference analogues A,,.
The operator A, is written in the traditional form (see (60.4))
A
Aa=-AaA/ha, 1<ia,<I, (61.2)
and, according to YANENKO [1967], Ag for ofi can be written in the form

A,,° =4h h [i.+1 idi+ -(i+l,i-


4hhh'"+ + P-,i+l)+i1- li-] (61.3)
SEcnoN 61 Heat conduction equation 369

To solve (61.1) we take the analogous scheme (60.6) with a factorized difference
operator B= IaP= 1 B,:

B + kaa(tn+l2)/ #pn =n+


a,f=l

o =g, (61.4)
pi =0 for i =0 and i,=lI+1, 0a=1,2,...,p.
We choose operators B. in the form (SAMARSKII [1971])
B, = E + TOC2A,. (61.5)
In O =H h these operators are self-adjoint, positive (for 0>0) and commutative
(because the domain is a parallelepiped). The factorized scheme (61.4) has at least
first-order accuracy in and is stable for 0 >. With B, = 1, 2,..., p, being
three-point difference operators with constant coefficients, the algorithm for
computing p"+ for a given (pn coincides with that of Section 60.
For the special case f 0, k =const, p=2, YANENKO [1967] considered the
scheme based on the splitting method

T n+ 1/2 _ (n / n + l 2
+k1/lllr + k12/112(P" =0,
T (61.6)
T n+ 1_ n+ 1/ 2 2 1/
- + kA21(pn + + k2 2 A 2 2(pn+ = 0.

This scheme has a first-order approximation in time and is absolutely stable.


For p = 3 the following scheme has been introduced by YANENKO [1959b, 1961]
generalizing (61.6):
con+ 1/6 (Pn
+ kllAll. pf + kl2Al2P0 =0,

n 1 /6 6
(pn+k 21 A2 1 + + 2 222 A22 2 = 0,

(Pn+3/6 _ n+2/6 A
2f3333(pn 336 + 13A13(on+ 26 = 0,
t (61.7)
n
pn +4/6 _ (0 + 3/6
+k3lA31(pn+ 3/6 + k33.33+4
T

(pn+5/6 _ (n+4/6
6
+4k 2 2 A2 2 p+5/ +k 2 3 A 23pn+4/6 =0,

n+1
+ n+n5 /6
16
qp1 _'"5p ' +k3 2 A3 2 (p+5 +2-k 3 3A 33 p =0.

Scheme (61.7) approximates the original problem with first-order accuracy in T and
370 G.l. Marchuk CHAPTER XV

is stable if the matrix with elements k,, on the main diagonal and kits on the other
diagonals is positive-definite.
Schemes (61.6) and (61.7) are realized on each splitting step by the scalar sweep
method.

62. Alternating direction schemes

In the framework of problem (61.1) we demonstrate the alternating direction type


scheme proposed by SAMARSKII [1964a] for the case when k =k,. For p= 3 we
have
p n+1/6 _ pn
+/klA nl p +
1/6 =f(, t + 1/6, n),

(Dn26 + 2/6 + n+ 16),/6 1/6


+lk it Xn+2/6+ ,+1/

pn +3/6 _ p +2/6
3 6 n+2 1 6 n+ 2 1 6
+k 3 3 A 3 3 (n+ / +k3 1 A3 1 (p +k 3 2 A 3 2

=f(x, t+ 3 /6 , "+ 2/6) (62.1)

(Pn+4/6_ pn +3/6
+ 6
+k
+ 33 A
3 3 (pn 3/ =f(x, t +4/ 6 , (pn+3/6),

6 41 6
4k
'n+5/6-gn+46 -l+ 2 2 A 2 2 pn+2/ +k 2 3 A 2 3 n+ f(X, tn,+ /6 rn+4/6)

(n+1 n+5/6
P 'P +2k, All 5/6 +k 13 A 31 pn+ 4/6
(P"+ 1/6 +k 12 A 12 (pn+

=f(X, tn +l, (pn+ 5/6).

On smooth solutions this scheme has a second-order approximation in x,,


a first-order in T and it converges for any z-*O. However, there exists a restriction on
2
Ihl = (P=hQ2) : h hho, following from the restriction of matrix (kp) being
positive-definite. Constant ho depends on constants y and y, defined by the
relationship
p P

E k pl;>jyY E' X2 (62.2)

in the case of a differential problem, and by the relationship


P p

E p>}y~ Y(62.3)
kAjflA"p E e2
,h=1 c=l

in the case of its difference analogue. Scheme (62.1) is realized by the sweep method
SECTION 63 Heat conduction equation 371

on the first three steps and by computations with explicit formulae on the final three
stages.
For the case when

~k.0id 7&~~ keow~~ , (62.4)


kl =, > ,
SAMARSKII [1964a] proposed the following difference scheme:
( n+ 1/3 n A
+ k l1AI (pn+ 1/3 =f(x, t +1/3, (pn),
T

(Pn+2/3 _ n + 1/3
n + k 22 A 2 2
23n+3 n"+2/3 + k21 A21 =f(x, tn+ 2/ 3 , P+ 1 3 ), (62.5)
"n+/3

p n+ _pn+2/3
3 2
(pflnl +k
+ 33 A 3 3 pn+1 +k 3 A31 n+ 1/ + k3 2 3 2 pn+ /3

=f(x, tn+1, (pn+2/3).

This scheme converges for -0O, its realization is analogous to that in Section 61.

63. Schemes of increased order of accuracy

All the above difference schemes for solving parabolic equations did not have more
than second-order accuracy in space variables and time. The splitting fractional
steps method can be used for obtaining simple schemes of increasedorderofaccuracy.
Assume here that the grid is uniform and has the same step h in all variables x".
The schemes of increased accuracy for the simplest parabolic equations,
constructed by factorizing the operator on the upper time layer are considered in
works by DOUGLAS and GUNN [1963], SAMARSKII [1963b], SAMARSKII and GULIN
[1973], YANENKO [1967], SAMARSKII and ANDREEV [1963]. These schemes are
constructed on the basis of an m-layer homogeneous scheme of increased accuracy
(m > 2 chosen) in the form:

"
f e L(P
ltn+ =fn (63.1)

n
wher e fn is the result of applying difference spatial operators to the functions (p ,
n
10-1..
We consider the method proposed by DOUGLAS and GUNN [1963] and assume
that the following two-dimensional heat conduction equation has to be solved
ap/&lat+A=f (63.2)
where
2
A=A 1 +A 2, Aa =- /ax 2, a=1, 2.
372 G.I. Marchuk CHAPTER XV

To solve (63.2) we use a scheme of the stabilizing correction type


(Pn+ 1/2 _ Pn
p q + L (p" + 1/2_ pn) + Lipn =f',
t (63.3)
Pon+1- n+1/2 (rpn+
+ L2 -? )=0

We choose the following difference operators L and L. as approximating operators:


L= A-(h/T)2 E,
L1 =3A -6(h/T)2 E, L 2 =3 1,
2

where A, nA and A2 are approximations of the differential operators -A, -a 2 /8x2,


- a 2 /Ox2. Then, under the condition T/h = const, scheme (63.3) has accuracy of order
O(T2 +h 2) and is absolutely stable.
If we choose for operators L, L, and L2 the following,
L= (1-'h2/)A,
L,=3(l-h2/z)Ad, o=1, 2,
then scheme (63.3) will be absolutely stable and have the accuracy of order
O(T2 +h 2) if the condition c/h2 = const holds.
Another method of constructing a factorized operator is proposed by SAMARSKII
[1963b]. It is based on replacing (63.1) with the scheme,
~0n+1 (n f
1 +1 '(q 1 _p)=fn,
p
?pni L+
"·-+L (63.4)
To T
where the operator · is chosen in such a way that the operator of the scheme (63.4) is
factorized. If we split L according to the expansion,
L = L, +L 2 +. +Lp,
then
E + L + Q= (E+ zL) ... (E+ rL). (63.5)
Hence, Q can be represented as

Q = (E + L1 ). ..(E + TLp)- (E + ,L)


=zT2 Z ,LL + r3 E LLpL+'+zPLL2 L, (63.6)
a<# a<fl<y
In the work by SAMARSKII [1963b] the following absolutely stable two-layer
scheme of increased order of accuracy,

+
+ OA(p" n +(1 - )Airpn +1h 2 AAtrPq = O,
T <~
p
0=(1-h 2/r), A= E A, (63.7)
1=1
SECTION 63 Heat conduction equation 373

approximating the equation


p
?p/lt+Aq=O , A=- E a2(P/x2 (63.8)
=l

with accuracy of order O(T2 + h4 ) is taken instead of the initial scheme (63.1).
In this case the factorized scheme has the form:
(E + OrA )(E + OTA 2). .(E + OTA)q" + 1= 0fq" +fn (63.9)
where

= E-T[h2 AA +(1- O)A

-(0)2 E A.A +(0T)3 E AaAPAy


az<ll a<#l<y

-
+ +(- 1)P 1(0T)P-l A1A 2 ... Ap. (63.10)

A scheme of increased order of accuracy is likewise constructed for the equation


2 VT

at- X k=axaX 0, (63.11)


kok =const, k1 =k 22 = 1.
The initial homogeneous scheme has the form
n
(pn_+ - __ _
1 2k22A12r+
n+1 n-1A--
2T - + 2 ( _- n -')-2k 2A2("
2r +2
h 2 ( pn+ -2qn + (pn"- h2
12 T+ -bAl A22qn =0, (63.12)

where A 1 , A2 2 and A12 are given by formulae (61.2) and (61.3) and the parameter
b is defined by
b= 1+2k22 -31k 1 21.

The factorized scheme has the form


(n +1 _(Pn pn (pn-
L, L 2 =(1-20) - OA((p o+q-
+ )

-40k12 A
A12
12 +2(1 -)rbA 1 A 2 q, (63.13)
where

L i =E+OrAii, 0= 1

All schemes from this section are realized on the basis of the scalar sweep method
in all directions which is fairly simple.
374 G./. Marchuk CHAPTER XV

64. Finite element method schemes and splitting schemes for


two-dimensional parabolic equations

We consider the diffusion equation

2 a ka =f(x ,xt) (64.1)


at =l ax ax

in the two-dimensional domain Q with boundary a2 and choose homogeneous


boundary and initial conditions:
T =0 for aQ, (64.2)
(p=0 for t=O.
We assume that coefficients k,n(x, x 2) satisfy the uniform ellipticity condition in
Q and thatf(x1, x 2, t) is a sufficiently smooth function. We introduce a rectangular
uniform grid with steps h, and h2 in the domain Q2 and triangulate each grid cell by
diagonals of positive or negative direction depending on the sign of the integral

S
X l. +h X2 ,,+h 2

amn =2 f (k2 k 21)dxl dX2


xI_m X2,"

We denote by the domain Qh the largest conjunction of triangles belonging to Q and


we denote subdomains of Qh with positive and negative triangulations of cells by
•Q' and O2 respectively.
We introduce a piecewise linear prolongation of the grid functions on the
triangulated domain Oh with the help of coordinate functions, m.n(xl, x2):
.(XlkX2,t)= If for (m,n)=(k,l), (64.3)
cum n(Xlk, x)y i2-
0 for (m, n) #(k, 1),
h =
(p Y(m,n(t)(m,n(X1,X2) (Xm,mX ,n)E
2 •2h\aOh
The semidiscrete Galerkin equation for Pm,n(t) in using the "lumping" method for an
evolutionary term can be written in the form:

a'" ( 1oom) )+ (, Om.) =(f Om,n),

(qh,to,,)=J for t=, (64.4)

~I((h,
o =fk cqpjh axd

r)
SECTION 64 Heat conduction equation 375

In operator form equation (64.4) is given by


Oq/t + Ap =7 (64.5)
Here 0>0 and A> 0 are grid operators, p= {(pm.n(t)} and f= {(f, w,j)
} are grid
functions.
It is shown by MARCHUK and KUZIN [1983] that the grid operator A can be
represented as the sum of one-dimensional three-point operators
A=A1 1 +A 2 2 +A1 2 +A 2 1,

where operator A 1 acts on the grid lines which are parallel to the axis x1, operator
A22 on the lines parallel to the axis x2, operator A1 2 on diagonals of positive
direction in the subdomain Q2' and A2 1 on negative diagonals in the subdomain Qf2.
It is proved that the operators are positive-semidefinite (Aap k 0, ot,f = 1,2) if the
following conditions hold:

k1j> 21 k12+k21, k22 > Ik12 +k2l I.

Basing on this we can use one of the splitting methods, in particular the two-cycle
method of type (59.12) with Crank-Nicolson schemes on each elementary step for
solving (64.5) (MARCHUK [1980]).
Similar results are obtained for the generalized von Neumann boundary
conditions.
The application of piecewise linear prolongation to equation (64.1) under the
condition that k 12 + k2 l =0 on a nonuniform triangulated grid which is topo-
logically equivalent to a rectangular one also leads to the grid operator splitting in
four one-dimensional operators. In this case the conditions of positive-semidefinite-
ness of the one-dimensional operators are reduced to some geometrical conditions
on the grid parameters.
This method has been used in the ocean dynamics models for computing the
stream function (KUZIN [1980]) and for solving the three-dimensional equation of
heat flux in the ocean, which is reduced by special splitting along the sections to
a series of equations of type (64.1) (KuZIN [1984]).
CHAPTER XVI

Equations of Hyperbolic Type

At present there exists a number of effective algorithms for solving problems that are
connected with equations of a hyperbolic type. This chapter features methods based
on special splitting algorithms for solving such problems.

65. The stabilization scheme for the multidimensional equation of oscillations

We consider an N-dimensional equation of oscillations


a2 q/at2 +Aqp=f in Q x Q,, (65.1)
(p=O on 2, (65.2)
(pla/t=qand p=p in f2 for t=O, (65.3)
where f2 is an N-dimensional parallelepiped and A= -ZN=_ 2/xj2. We approx-
imate (65.1)-(65.2) in space by a finite difference problem on a uniform grid 0h with
step hj along the axis xj:
d 2 ph/dt 2 +Aph =f h
in Qh x 2, (65.4)
(oh=p h
and dgp/dt=qh in Qh for t=O. (65.5)
Here
N
A= Aj
j=1

-pi,+1, 2qil" i = 1,
h2

'
A lh= -(Pil+l1. +2il,' iP-1, 2 il M -1,

2q'' h2 ' il =Ml,


and
and
Marthe
the index
number of points along the axis xM 2,A .... AN
iarand M are the index andy.that
Note, ofallnumber
points along
ax. the axis A2 , A 3,..., AN
are defined in the same way. Note, that all operators Aj>O.

377
378 G.I. Marchuk CHAPTER XVI

For the approximate solution of problem (65.4)-(65.5) in time we use the


following scheme (from now on the index h is omitted)

B +iAni=f' (65.6)

N
2
B= (E+ Aj). (65.7)
j=1

Equation (65.6) approximates (65.4) with order O(r2). Equation (65.6) can be written
in the form
n-
T n+1-2(pn+ 1
=
2 +lB+- An B- f. (65.8)

A Fourier spectrum analysis of the components of the solution vector (65.8) leads to
the following necessary stability condition of scheme (65.6)-(65.7)

r 2 #(-B A),
where /f(B-''A) is the upper bound of the operator B-'A spectrum. Thus, the
problem of choosing the parameter z satisfying the stability condition is reduced to
computing the maximal eigenvalue of the problem
Au = ABu,
in the assumption that all eigenvalues of B-'A are positive. This problem is solved,
for example, with the help of the Lusternik iterative process (see MARCHUK [1964a]).
The realization of difference scheme (65.6) can be written as:
2
(E+2 Aj)n+lN An+ f,
2 +2
/ N
(E + A2 )n± = n+ 1/N

(65.9)

(E + ½2A2)n+ ' = n+(N- 1)/N


n n +
pn+l=
2 (pn_ n-1+ 2 1

This problem is solved consecutively for n=2, 3, ... using the initial data (65.3)

66. Approximate factorization schemes for equations of oscillations

We consider the following problem for the two-dimensional homogeneous equation


of oscillations
a2 2 2 I2
x
8t2 a2x +,2-p=O, - o<xy< 0, (66.1)

op=p and p/l t=q for t=O, a 2 = const. (66.2)


SECTION 66 Equations of hyperbolic type 379

We use a scheme of second-order accuracy for an approximate solution of the


problem
qopf+ 2_29_
+__o- 1 + __ _
2e +A =0,
t 2
A=A 1 +A 2, (66.3)

Aq= -a 2 (ii+pl,--2,j
A1?=-a2 ,2?=-a2 i-_ ,i +I1
2 i,j
-h2i
2
,i -1,
ij+ pi,j

We rewrite (66.3) in the form


(P +1 +,n-I
2
(E+ nA) 2 - " (66.4)

and approximately factorize the operator on the left-hand side of (66.4):

E+LT2A (E+l½T2A)(E+-T2 A2 ).
Finally, we replace scheme (66.3) by the following factorized one:

(E+2z2Al)(E+2TA2) 2 qp". (66.5)

Scheme (66.5) is stable and has second-order accuracy. Scheme (66.5) is realized as
follows
+ -(E
2A n+ 1/2 -n

2
E+ A2 ) - , n 2,., (66.6)

h
=p , = ph + qh

Now we consider the scheme of approximate factorization of the operator on the


upper level for solving the multidimensional equation of oscillations (65.1) where


A=- E= f=O,

Q={x: -oo<x<oo}, x=(x, x2, ... , XN)

(KoNOVALOV [1962]). To solve it we use an implicit difference scheme,

((p" +' 2(p" + P"- 1)


T2+Apn+l =0,

and rewrite it in the form

( tE+ 2 Ai (n+l=
I =?n
p _(n- 1. (66.7)
380 G.1. Marchuk CHAPTER XVI

Using the relationship

E+T2 EA Pn'=Hn(E+T2 A )(pn"-+O(i2 ),


we can replace scheme (66.7) by the following factorized one
N

H (E+ i)(p+l =2Zn- (pn- l .


(66.8)
=1

The realization of scheme (66.8) has the form:


4c + 1/N_ 2(pn+ pn- n
spn1N2S +Al p1n+ f= 0,

pn +iN _ In +(i- 1)/N


2 +Ai4 pn+ilN=O, i=2, ...,N, (66.9)
c =ph 1 =ph+Tqh

A scheme similar to the DOUGLAS-RACHFORD [1956] scheme suggested for solving


a multidimensional heat conduction equation can be used for solving a multi-
dimensional equation of oscillations. It has the form
(pn+ lN__ 2n + Pn-1 N
/ +
2n + Ai(pn=O,
1
AcpN AE
/=2

(pn+ilN_ cn+(i- 1)/N


2 +i((P (pi/N-pn)=
O
(66.10)
i=2, ... ,N, n=l, 2,....
cpO = ph 1= ph + qh

Schemes (66.9) and (66.10) approximate the differential problem (65.1)-(65.3) for the
N-dimensional heat conduction equation. The approximations are consistent for
r/hi = const, i= 1,..., N. A Fourier analysis shows that schemes (66.9) and (66.10)
are unconditionally stable independent of the value r/hi = const.
Note also that, in the case N=2, scheme (66.10) can be used for a more general
equation as well (KoNOVALOV [1962])
a2 ( a2(P a2p a2(p
'
t=a
2 a2 - 2 xlax2 +a22 X2

aij= const, a 11>0, a, a 2 2 -a2 2 >0.

It has the form


n -
I- 2 pn
(pn+ /2 + (P = O
+allA cp+ /2 + 2a2A2zn +a22A22p

1/2 (P +a -,122(pn~~~t-(Pn)=0(66.11)
npncpa+ 1/2 + l
2 +a22A22(cp - cp.) = 0.
SECTION 67 Equations o hyperbolic type 381

67. Local one-dimensional schemes for the multidimensional hyperbolic equations

Consider the following problem for an N-dimensional hyperbolic equation with


variable coefficients

+ +at2 A;is=f(x, t) in n,
(67.1)

i,
Aip=-x (-ki(x,t)ax '=#(x,t) on85,
(p=p and aTp/at=q in 2 for t=O. (67.2)
Here
ki(x,t)>6>O, =(X, X2,... , XN),

2 is an arbitrary N-dimensional domain with the boundary a2,


0=Q+0Q, QT=Qx(O<t< T],
T= X [0O t < T].
It is assumed that this problem has a unique solution p = (p(x, t) which is continuous
in Q r and has all required derivatives (i.e, is sufficiently smooth).
Two assumptions are used with respect to the domain a:
(1)A section of domain 2 by any straight line parallel to the axis Ox; may consist
only of a finite number of intervals.
(2)It is possible to construct in the domain 0 a bounded grid h with steps hi,
i=1, 2, .. N.
In order to locally construct a one-dimensional scheme for solving (67.1)-(67.3)
we do as follows (SAMARSKII [1964b]). We approximate with step T/N the operators
1 a2 0
i'P = at2 (Aip+fi), i=1, 2,.. .,N
successively, where fi satisfies the condition
N
E fi=f
i=1

In order to approximate the derivative a2plat2 with step T/N we use for N= 2 the
expression,
ii-2i-_ l +ii- la( 2
q\n+(i-' '/2
t i= 2 I a2,1 i= 1, 2, (67.3a)
where
5fX(P p" n i12, = ,(P - I + i/2

(po = (p = pn-1, n
(P
2 = (p ,
382 G.I. Marchuk CHAPTER XVI

and for N=3,


Pi-i-l-(Pi_2+O
i 2 2p
(PF = g2
- i= 1, 2, 3, (67.3b)

where

To-1 = T1 , UP-2
- = = n- 2/3
To approximate Aijp+fi on the grid 2h we use the one-dimensional difference
operator of second-order accuracy,
Ai h+fi.

The coefficients of the operator A i and f are fixed at the moment

ti =0.5 (tn+i/ll +tn- +i/N)= tn +i/N- 1/2 = t. + (i/N - )T,


so that
Ai = A(t9), p = i(x, t).
Local one-dimensional schemes for hyperbolic equations have the form (from
now on h is omitted)
(PFF = N-
aii((P i +- i) + 2TN Pi,(
I. (67.4)
i=1,... N, N=2,3,
where
4 for N=2,
lN for N= 3.
If N= 2 we obtain a three-level additive scheme and if N =3 a four-layer one. Here
lies the difference from the parabolic equations, where the form of local one-
dimensional schemes does not depend on the number of dimensions N. Equation
(67.4) can be written in the form
-
(E + UNTzAi)(pfi + A) = J2(i_ 1 +a 2 T2 (pi) for N=2, (67.5)
(Pi-l+(Pi- 2 + 2a3 Tf for N=3.

Finding p is reduced to solving a three-point equation


(E + Nr2Ai)i = F i (67.6)
along circles parallel to the axis Ox i by the sweep method using a boundary
condition
i = (x, t +iN) for x E aQj. (67.7)

If the operator Airp has junior terms then in using the sweep method the
requirement that the step sizes are sufficiently small on the grid Dh steps arises. To
get rid of these restrictions on the grid Qh junior terms should be taken on
intermediate layers. The first of the initial conditions, qp(x, 0) = p(x), is approximated
SECTION 68 Equations o hyperbolic type 383

exactly
oh(x, 0)=p(x). (67.8)
To compute the intermediate values
p'12 =o(x,z) for N=2
and
(1/3 ((X, ), (2/3= (X, 2T) for N=3
we use the following equations: for N= 2,
2
(E + A )(p12 / =F , 1
(67.9)
Fl =p +½~q-IT2Ap+T2 (f/2- (-9(P +f))=O;

for N = 3,

(E +1T 2 A1 )p1/3 = F1 ,
(E + T2A 2)( 2
+P)= 2/3 + F 2,(67.10)
F1 =p+'r1q-r Atp+q2(3fl-1(-Aq
2
+f))t=o ,
2
F2 = (2f 2 -(-A+ f))t=o.

The local one-dimensional schemes (67.4) with the corresponding approximations


of the boundary and initial conditions converge with rate O(T + h 2) if certain
smoothness conditions for 9q(x, t) and f(x, t) hold (SAMARSKII [1964b]).

68. The splitting scheme for multidimensional hyperbolic systems of equations

The splitting method for multidimensional hyperbolic systems was first suggested
by BAGRINOVSKY and GODUNOV [1957]. These authors considered only the explicit
approximation for which the splitting scheme has few advantages as compared to
conventional explicit schemes.
ANUCHINA and YANENKO [1959] suggested an implicit splitting scheme for a
multidimensional hyperbolic system. We demonstrate the idea of this scheme for the
case of a two-dimensional system of acoustics equations.
We consider the following system of equations

au, a2 2
a lv =0, au2 a2 av =0,
at ax, at ax2,
(68.1)
av aul au2\
at- ax a =

In order to solve (68.1) we use a weighted splitting scheme along the coordinates
384 G.I. Marchuk CHAPTER XVI

x. and x2. On the first stage of splitting we have:

-ha2(0tv 2 + Piv)=0,

U+ 1/2 _nU
0, (68.2)

vn -v I (auA +1/2
1+' ul)
h,

and on the second stage we have

un+I _ u n+ v1 2

+
uI
U"2+
_u+ l=
-U 22+ 1/2a2 -2(
2
+ 1 + pvn+ 1/ )= 0.

(68.3)
V2 (OCun 1 + un±+1/2)= 0,
h2

+fi=1,

where

i,... .. ...
-- (01 i-1...
Aio = (Px ...../+i ....-(Pl_.........
hi hi

Scheme (68.2)-(68.3) is absolutely stable for ½


> and approximates the original
system (68.1) with first-order accuracy in z. For == the approximation
accuracy is O(T2 +h 2) (YANENKO [1967], p. 63).
Scheme (68.2)-(68.3) is realized in each stage of splitting by simple three-point
sweeps.
The approximate factorization method (66.5), used for solving the two-dimen-
sional wave equation, is generalized for the case of N-dimensional hyperbolic
systems of second order (DYAKONOV [1971d]).
We assume that it is necessary to solve the following problem

e2qp/Ot 2+Aq f in Q2x2,, (68.4)


and

Tp=p and apl/at=q in Q for t=0,


(68.5)
= 0 on a2.
SECTION 68 Equati~ns of hyperbolic type 385

Here

q0 = ( I(x), 2 (x),..., 0N(x)),

Ap=A 1 p+A 2(P,

Altp = -
xa(a,,axl)+ r b +Cqp,
+ax,
rlax, a

A 2 q,=-a(x, Dp), D= (pl,.


IP2,. . ,_PN,,
2X ' . ad)

a,,, b, and C are (N x N) square matrices depending on x, f, p and q; ao(x, 0)=0.


Let A', A' and A= A' + A' be the difference analogues of operators A 1, A2 and
A at time t = nr on a rectangular grid. To solve (68.4)-(68.5) we use the following
difference scheme (DYAKONOV [1965a, 1967b, 1971d])

i- (E-r204) qh 2 2

d
+ A"(pn)i + 2
20 A,(p" - p"- 1) i = f
r=1

for xi E 2h, (68.6)

o°=Pi, oi=Pi'=Pi+Tqi+'2[2,2a=o], ]

(p =0 for x i eh,

where a2p/at2 for t=0 is defined by (68.4)-(68.5). Scheme (68.6) approximates


(68.4)-(68.5) with the error O(T2) in time. If the conditions

aS.>=(),
,, -)_-(A(In- ))U IIVIR,,
y) ZT1xIl RII
,/f(n) ),

0R, 1 2A
t oR1 ( ") <)
I lRU1,,2 ,TtO-12
o=61/2>0,>0

V,
I(A">(u), v)< a [ I1 RU+ a3 Illl

0>(po+# I <T*(4a3)- ,
z,
d
=
Rlui -- E arUi
r=l

R,ui =(-1)d A,lAr2 ... AU i, ai=const>0,


rl <r2< < <Fr
386 G.l. Marchuk CHAPTER XVI

hold, then for the solution of (68.6) the following a priori estimate is valid:

IlT11Hl-.constj -9{ +q,l fj +|II·


f ±lll+ll? 1 If
+1f2-2

k-1l . +r rfll

n=2 Rt a2 )=1
Applied to multidimensional hyperbolic systems the splitting method allows to
obtain majorant approximations. Let

+ Ai +Bp=O (68.7)
T i=1

be a linear hyperbolic system which is symmetric according to Friedrichs. Here


P= {P1... PN} is a vector function and Ai and B are symmetric (N x N) matrices.
YANENKO [1964a] showed the possibility of constructing the majorant schemes by
the splitting method. Let us show a possible way of constructing majorant
approximations for equations of type (68.7) (see YANENKO [1967]). If the Ai are
positive matrices, then, using the approximation
(n+1 pn N ET i
+ = Ai i p"+Bq"=O, (68.8)
T i 1 hi

we obtain the scheme


N
'
f+1= Co(n
+ Ci T-i ", (68.9)
i=1

where
N
-
Co=E (r/hi)Ai-zB,
i=l

C- 1=(r/hl)A, C_ =(TIh
2 2 )A 2 ... , CN=(r/hN)AN,

T-if(x .. , xN)= f(xl .... xi-hi,..., XN).

If z/h i is sufficiently small, then matrix CO is positive, the matrices C_i are identically
positive.
If the Ai are negative matrices, then E- T_ i should be replaced by Ti - E in (68.8).
If the Ai are the matrices with an arbitrary sign then the following representation is
possible
Ai=A! + A,
where the Al are nonnegative and the A? are nonpositive. In this case the scheme has
the form:
N N
+l=CCO~n+ y C_iT_i pn+ E CiTi,
i=1 i=l
SECTION 68 Equations of hyperbolic type 387

where
N N
Co=E- ( h,) A + (Tzhi)At-TB,
i=1 i=1

C_i=(r/hi)Al, Ci= -(lhi)A 2 , i= 1,..., N.


Matrices C _i and Ci are always positive and matrix Co is positive if t/h, is sufficiently
small.
The splitting method allows to construct easily realizable positive schemes. We
consider an explicit majorant splitting scheme
"+s/N-pn+(s-) [ E-Ts 2 Ts-E+ B1 n+(s-1)/N0
'
T
-[A s hhsh
A - h =0,

s=l,..., N.
Scheme (68.10) is reduced to
n+sN = Csn+(s-1)/N

where
Cs= Cos + C-_T_-s+ C, T,,
Co,, = E - (r/h) A,' + (/h 5 ) As - B/N,
C_ =(l/hs)As, C,= -(/h)A,
s=l,...,N.
Operators C_ and Cs are always positive and operators Cos are positive if Tr/h is
small.
Hence, the scheme

q"n+l=CmCm- 1 ... CClq


is positive.
Majorant schemes with low accuracy are, as a rule, maximally stable. They often
lead to convergence in the space C' and are easily realizable.
The splitting method for multidimensional hyperbolic systems has first been
suggested by BAGRINOVSKI and GODUNOV [1957].
The general concepts of splitting schemes (their approximation and convergence)
and their use for solving hyperbolic equations are given by MARCHUK [1970,
1980], DYAKONOV [1963b], SAMARSKII [1946b], LEES [1960c], KONOVALOV [1962],
YANENKO [1967], ANUCHINA and YANENKO [1959] and WANG [1984].
Numerical splitting algorithms developed for applied problems (nonlinear
problems included) described by hyperbolic equations are considered by
BELOTSERKOVSKY [1984], ScoTT and ROGER [1983] and LAVAL [1983].
CHAPTER XVII

Integro-Differential
Transport Equations

69. Statement of the problem and the scheme of incomplete splitting

Let Q be a convex limited domain in R 3 with piecewise smooth boundary 0a and


variable point x = (x1, x 2, x 3). Let co be a sphere of unit radius in R3 with the center at
the origin, s=(s1 ,s 2 ,s 3 ) is the unit vector (=?s?=l), taking values on co.
Si, i= 1,2, 3, can be represented by
s, = sin cos , s2 = sin sin , s3 = cos O,
0<O<n, 0¢ <2.
We denote the integral over co by
2n

Stp = ds= Jfsin OdOf d


0) 0 0

and the time variable by te2 ,t =(0, T), T< oo.


The integro-differential transport equation for particles in a medium with
isotropic scattering has the form:

0 + (s, V)p + ap = SS+f(s,x,t), (69.1)


v, at 4i
where v = const >0 is the velocity of particles,
3
(s, V) = si a/iaxi,
i=l

f(s, x, t) -f(, ¢, x, t) is the function of sources, a = a(x), as = a,(x) and the restrictions
0<aaO <(X)< a1 < , O<as(x)as,, (69.2)
0 < cO < ac(x) = a(x)- as(),
for constant o , ac, as are assumed to hold.

389
390 G.I. Marchuk CHAPTER XVII

For (69.1) we will take boundary and initial conditions of the type
(q(s, x, t)=0
for (s, n)<O, xeaQ, t2t, sew, (69.3)
p(s, x, O)=g(s, x) forXEc, Sew, (69.4)
where n = (n1, n2, n 3) is the unit vector of the external normal to aQ, (s, n)= 3= 1sini,
and g(s, x) is the function of the particles' initial distribution.
If the spatial domain in which the process of particles' transport is considered is
the slab where x, = z e Q = (O,H), H < oo, - oo < xl, x2 < o and the properties of the
medium and those off and g do not depend on xl, x2 and ¢, then problems (69.1),
(69.3H69.4) take the form

O+ - +a(z)p(t, z, t)
+
V, t aZ

=½a(Z) f (p(#', z, t) d' +f,


-1

q(y, 0, t)= 0 for > 0,


q(p, H, t)=O for p<0, (69.5)
P(, , 0)= g(U, z),

where u =cos 0e [- 1, 1].


We will consider the splitting schemes on examples of the above-stated problems.
Note that the schemes are simple to write for problems with anisotropic scattering
(when Sp = SfO(s, s')(p(s', x, t) ds') and also for the more complex problems of the
transport theory (in particular, in curvilinear geometries).
For the approximation of the problems, let us introduce grids on all variables,
denote the grid step in t by At and define difference approximations for the operators
in (69.1): Sh, A1 =ah - (l/4 t)ashSh and A2 = (s, V)h are difference approximations of
the operators S, a-(ar/4r)S and (s,V) respectively. Here a,, as,, are diagonal
matrices of corresponding dimension. We assume that the approximating operators
Sh, A1 and A 2 are constructed so that the relationships

(A 1 Ph (Ph) y(Ph, POh), y=const>0, (69.6)


(A2(ph,(Ph)>0, She=e

hold, where e = (1, 1, ... , 1) and the operators act in the Hilbert space 4 =-H with the
inner product (, ). Now we formulate the scheme of incomplete splitting for the
solution of these problems (MARCHUK and YANENKO [1964]):

1/2 _/ +A (a(p2 Z~~~~~~~~~~~~~~~~~~6.a


+p )=fh, (69.7a)
SECTION 69 Integro-differential transport equations 391

qn+1 _9n+1/2 + fln 12)


P 2 (CapX' +p9" + 2)=O,
+A (69.7b)

where o>, f>O, a+#f=1, T=vcAt.


For c = = it has the form
n + 1/2 _pn (pn +1/2 + Apn)
, +Al 2 fh,
(69.8)
n+ 2
pO 1_ 0fn+ 1/2 ((h+l + (pn+ 1/ )
+A 2 = 0.
r 2
Scheme (69.8) is absolutely stable, but since the operators A 1 and A2 do not
commute it has only a first-order approximation in r.
Let us describe the realization of scheme (69.7) which is valid for the scheme (69.8)
as well. Assuming vector (pn to be known, let us apply operator Sh to (69.7a). As
a result we obtain the following equation for functions ±+ 1/2 = Shp n+ 1/2, p0o = Shp":
n+ 12 ni
90 (P + ,,(
ac(o+1/2 + /p o) = Shfg'

where ach = a- aS . From this result we obtain


(po+1/2 =(E +aTac ,) 1[(E - flajc)o + Shf].

(Note that here (E + aTcrc , )-' and (E - zh,,) are diagonal matrices and hence, that
the scheme is easily realizable at this stage.) Then we rewrite (69.7a) in the form:
"+ 1/ 2
1/2 ( +n+ + )= h( + pn) +fh
T 4i4

Since 0o+1/2, q0[ and ¢" are already known, we invert the diagonal matrix and find

(P+ =(E + au) [?(aX(p


' - 1/2 + fiq )+f
) -(E - 1rah)' .
]

Realization of (69.7b) depends on the approximation chosen for the operator (s, V).
So does realization of the boundary condition (69.3).
Assuming a= = in (69.7) and writing the two-cycle scheme of incomplete
splitting in the form

(E + ½4A1)p" -1/2 = (E -TA )9 - ,


n
(E + TA2)(P"- f ) =(E -2A 2)(, " -1/2
(E + A 2))"p +112 =(E- A 2)(qp" + f"),
(E+½zTA)pn+'=(E-TA)(pn + 1/2
n=l1, 2, .. .,
where f"=fhl,=t , we obtain an absolutely stable scheme with a second-order
approximation in T(on smooth solutions!) on the interval t, _ 1 <<t t _1. Equations
(69.9) are solved in the same way as in the case of scheme (69.7).
392 G.]. Marchuk CHAPTER XVII

70. The scheme of complete splitting

We represent operator A 2 as a sum A 2 = .3= 1 A2 ,i where A 2i is an approximation


of the one-dimensional operator sia/axi. By splitting operator A 2 of (69.7b) we
obtain the scheme of complete splitting (MARCHUK and YANENKO [1964])
T n+1/4 _ (Pn
T +A( 1+ /p"
4 + pn)=fn, (70. la)

on +(i + 1)/4 _ pn + i/4


+ A2 +(+ )/4 + ii(a"" + i/4)= 0, (70. b)

i=1,2,....
This scheme is absolutely stable but it has only first-order accuracy in (since
operators A1 and A2, i do not commute). Equation (70.1a) is realized like in the
scheme of incomplete splitting.
We now consider the solution of equations (70.1b). Let 2 be a parallelepiped.
Then these equations are easily solvable if the operators sia/Ox i are approximated by
schemes of the running count. Realization of the boundary conditions is obvious in
this case. In the case of an arbitrary convex domain 2we can do as follows. Let Q be
a parallelepiped including 12.We extend the definitions of as,g andfby zeros in f2\2
and we consider a(x) in 2\Q2, an arbitrary positive constant. Hence, we obtain new
functions , 5a, Jand and can consider the problem

1 + (S,V) + d=S S +:

xEQ, tEQt, sewo,


=0 for xe aQ, (s, )<0, s , (70.2)
3=jg for t=0.
The solution of this problem will coincide with that of the initial problem in 2.
Solving (70.2) by means of the splitting schemes considered above we will also obtain
the approximate solution of the initial problem.

71. The scheme of approximate factorization of the operator

The scheme has the form


+
(E + rA )(E + rA2 )(n 1 -n)= -2r(Ap -fn), (71.1)
where
fn =fhlt=,,m
2, A=A 1 +A 2 .
It approximates the initial problem (69.1), (69.3H69.4) or (69.5) with second-order
accuracy in and is absolutely stable. It can be realized through the solution of the
SECTION 71 Integro-differentialtransport equations 393

following equations (like the scheme of incomplete splitting)


(E + A 1 )qpn + /3 = 2TF,
(E+ zA 2 )q +2/ 3 = (pn+ 1/3 (71.2)
n+ 2/3
n+1 = on + (p

where
F, = -(Ap"-f), n=0, 1, 2,....
We consider scheme (71.1) in more detail by applying it to problem (69.5) for
concrete difference approximations of the operators (see SMELOV [1978]). To this
end divide the segment [- 1, 1] in 2M equal intervals and regard the center of each
one as the grid point along the angular variable. Let us number these grid points as
follows:
< < ' M- 1
#-M <#-M+ < <-l < < M.
Hence uj = -p_j. On the segment [0, H] we introduce an arbitrary system of grid
points
0=O <z 1 < ... <ZN =H
choosing them in such a way that points of discontinuity in the equations'
coefficients coincide with some of these grid points. We assume that the indices k,j of
any grid function a,k will denote the following correspondence to the grid points:
(k,j)-(zk, pj) for j > 0,
(k,j)-(zk-,#tj) for j<0.
We define difference approximations A, and A 2 as follows:
M
(A I (p)kJ= k pk,j-Aa,k E (Pk (71.3
M=-M (71.3)
(A 2 P )kj -= li((Pkj (Pk-l,j)/hk, j>O,
2k j(9Pk + 1,j - (Pk,j)/hk, J<0,
where
k=l,...,N, j=_1,.., +M,
hk = Zk = Zk - Zk - 1,

Ap= 1/M,
O'k = (Zk- 1/2)' 'sk = s(Zk - 1/2)

Zk- 1/2 = 2(Zk + Zk, -).

We assume that in (71.3)


(Po,j=(PN+ ,j =

(i.e., the boundary conditions of the problem are satisfied). With such difference
394 G.I. Marchuk CHAPTER XVII

approximations of the transport equation operators system (7.12) can be written in


component form as follows:
Fj = -(A 1 p")kj -(A 2 p")kj +fkj,
M
Fk = Ap E Fkm,
m=-M

nj 1/3 Z k[ a,Tz)Fk +Fj], (71.4)

(Pk I+ Iuj r/h. ( hk k


n+ =n n+ 2/3,

k=1, 2,..., N, j=1,..., +M,

where
n-
kj =f(Zk-1/2, ,j, t+ /2)9(k =7k 'sk'
n+2/1 = n+2/3
(POj ='PN+1,j-0.

Scheme (71.4) has an approximation of order O(T2 +h+Ay 2), h=maxk hk, on
smooth solutions and is absolutely stable. It also preserves the major properties of
the operators of the original problem.

72. The method of integral identities and the splitting method

Let us now construct the scheme for a problem in a slab on the basis of the method of
integral identities and the two-cycle splitting method (MARCHUK [1980], p. 464). We
denote by p + the solution of the problem for # > 0, and by p- the solution for < 0.
Then the transport equation can be written as a system of two equations for u > 0:

+
1 <(Po o +
V1a +Ya- + +p+ 2=: ((p-+ p-)d ++ ' , f (72.1)
0

v1 a(- z +f-at = s J(q0 + -)d#,'+f-,


0

with boundary conditions


q +(0, , t)= O, q -(H ,u,t)=0. (72.2)
We sum and subtract the two equations (72.1). As a result we obtain two new
SECTION 72 Integro-differential transport equations 395

equations,
1
1 au au d
v- at- +a
v, at az+au=a.,udu'+g, (72.3a)
0

1 au au
- a +p +v=r, (72.3b)
v at ar
where

U=2((P+ +P-), v=((,+ -q-),


=(f+ +
+f-), r=(f -f-)
The boundary conditions (72.2) transform into
u+v=O forz=O,
u-v=O for z=H, (72.4)
and the initial conditions are
u=u° , v=v ° for t=O. (72.5)
We write (72.3)-(72.5) in operator form. To this end we introduce vector functions w,
w° and F and operator A:

(v)' (uo) F(:)


d
f-as az
0
A= (72.6)
a

We consider them in a Hilbert space L 2(D), D = (0, H) x (0, 1)of vector functions with
the inner product
1 M

(a,b)= f
dju aibidz,
0 0

where a' and b are components of vector functions a and b. Now we can write the
problem under consideration as follows:
1 aw
-- a+Aw=F inDx(O,T),
v, at
w=w ° in D for t=0, (72.7)
396 G.I. Marchuk CHAPTER XVII

where function w for any t belongs to the domain of operator A (in particular, it
satisfies conditions (72.4)).
We construct a difference approximation of the problem along the variable z. To
this end we introduce two systems of grid points: the main system {z, )= ,, zo = 0,
ZN =H and the auxiliary system {Zk+1/2}ko. The points of these two systems
alternate, i.e. k-1/2 <Zk <Zk+1/.2 We integrate (72.3a) in z with limits (ZO,Z 1 /2),
(Zk-1/2,Zk+ 1/2), k = 1, . . ., N - I and (ZN- 1/2, ZN) and (72.3b) with limits (Zk- 1, Zk,
k = 1, ... , N. In the elementary integrals we approximately replace the function u by
its value at the corresponding grid point Zk and the function v by its value at Zk- 1/2,
take boundary conditions into account and omit the approximation errors. Then we
obtain the following system of equations:

I O +A(p=F, tQ,
V, at (72.8)
(P= (o
where
( =(Uo, 112 , U1, ... , UN- VN- 1/2, UN),

F=(go,rl/2,g19... gN-1, rN-1/2, g9N),


(o)=(U(o ), U,
.( ) u, 1(1/2
- U.o)

f
Zk+ 112 Zk 1/2

ak = adz, a,k =- a dz,


Zk- 1/2 Zk 1/2

zk+ 1 Zk+ 1/2

Gk++ 1/ =- 1/2
1J gk = gdz ,
Zk 2k - 112

k+ 1

Azo = 1 /2 --z

AZk =Zk+1/2- Zk-1/2 k=l,...,N-l,


AZN = ZN - ZN - 1/2'
AZk-1/2 =Zk-Zki-, k= l, ., N,
h = max(Az,, AZN, Azk, Azk- 1/2)'
A=A 2 +A 1,

A, = diag(i/2 - s, 21/2 I dy )
0
SECTION 72 Integro-differential transport equations 397

p/Azo p/Az o 0 0
A2 - -P/AZl/ 2 0 P/AZ 1/2 0
0 - P/AzN - 1/2 0 P/AzN- /2
0 0 - P/AzN P/AzN

Yi/2 lo0 for i odd.


Let H(0,2N) be the Hilbert space of vector functions a=(ao,al/2,a
1 ,....
aN-, aN_ 1/2, aN) with inner product and norm
1
2N r
(a,b)= Azi/2ai/2bi 2 dp, jlall=(a,a)1/ 2 , a,beH(0,2N).
i=o
0

In obtaining (72.8), the approximation errors in case of a uniform grid in the first and
the last equations are first-order in h and in other cases O(h 2 ) on smooth solutions.
Basing on this fact we can prove that the estimate (MARCHUK [1980], p. 472)
max IIPT - qP (t) < o(h21nl/ 2(1/h))

holds, where PT is the vector of the values of the exact solution of the problem:

(PT = (U(Z 0 , #, t), vU(Z1 2 , , t), ... , V(ZN- 1/2, P, t), U(ZN, P, t)).
Note, also, that operators A1 and A 2 satisfy the relationships
(A 2 a, a)>O,
(A 1 a, a)>yjaj 2, y=const>0.
This allows us to formulate the algorithm for solving (72.8) using the two-cycle
splitting method:
(E +l ,rA2)pij- 2/3 =(E-2TA2)pj-
(E +TzA)j- 1/3 =(E- A)i-2/3
ifj+ 1/3 = - 1/ + 2F3 j, (72.9)
+ j+
(E +TA )pi 2/3 =(E-TA )(pi /3
2 /3
(E+ A 2 )Oj+ =(E-tA2)+
where Fi is the vector with components
tj+I

F==(tj+ltjl
) Fidt, tj=jAt, T=vcAt.

It follows from the properties of A and A 2 that, if the solution is sufficiently smooth,
398 G.I. Marchuk CHAPTER XVII

scheme (72.9) approximates (72.8) with order O(At Z ) and is stable:


max 1Ipi 11 °
< C(l p(O + max 1IF j I)-
) 11
j j

It follows from this inequality and approximation that


cp(tj) - pO11< O(r 2 ).
max 11
j
Hence
max II(PT(t)- JI 0<o(T2 +h 2 1n 1 2 (l/h)}). (72.10)
j
We now consider the solution of system of equations (72.9). Since matrix
(I+ rA 2) is three-diagonal, the first and the last equations of (72.9) are easily
solvable for any fixed I. Since matrix operator (I + 2rA ) is diagonal, the solution of
the second equation of (72.9) is reduced to the following computations:
I

Pi- 1/3 1(1-o)p2/3


T + T , d"-2/3
1P
+ r [ , + I2+2Ttld .f(Pid1 (72.11)
0
j- 1/3 = 2li- l1/2 j- 2/3,i
pi-1/2=i(T~ 1/2, W- i=1 .
1 Ti- 1/2

Analogous formulae can be written for solving the fourth equation of (72.9).
Let us now formulate the difference approximation of equations in #. To this end
we divide the interval 0 p 1 in partial intervals Ap, by grid points so that the best
approximation of integrals in (72.11) on the given class of solutions is ensured:

where s, is the weight of the chosen square formula. Replacing integrals in (72.9) with
this square formula and considering the system when p = ,, 1= 1, . . , m, we come to
the system of linear algebraic equations approximating the original problem

(E + TA2,)pI- 2/3 =(E- ½A 2,1)(Pi- ,


(E +2TA ,)(pq-' =(E--2 TA,)cp-2 ,
+ 1/3 = ji- 1/3 + 2ZF, (72.12)
(E +2TA ,,)6q+2 =(E-2-rAl ,)pt 1
(E+
½PAi ,)i +1
½ = (E- tA2.,)' + 23

(p° = irp

+
where qpi-', . ., q 'are vectors of dimension (2N + 1), 1= ,. .,, 1°} = qpO)(,),
SECTION 73 Integro-differentialtransport equations 399

A 2, 1 =A 2(P,), and operator A,,, acts according to the formula

(A 1 , i)P1 2 = i12 W,i/2 - ij 2 Z skP7,i/2,


k=l

i = 0,..., 2N.
As was noted above, the first and the last equations of (72.12) are easily solvable, for
example, by the sweep method. The second equation is solved by the formulae

i't1/3 (1--,)q, (1 2/3 c' Skpi- 2/3

q~j-1/3
i=0, 1,...,N, i/2Z 1+ k_2/3

t,i- 1/2 = 11 +a
+ 1T-j 1 1/ 2 i1/2, i=,... N. (72.13)

The fourth equation is solved using these same formulae with j+ 1 instead ofj.
Thus, the algorithm for the numerical solution of the nonstationary transport
equation is completely determined. Hence, we come to the absolutely stable scheme
with accuracy of order O(r 2 +h 2 nl/2(1/h)) on smooth solutions. The order of
accuracy in p depends on the choice of the square formula. (Note that p = 0 is also
allowed as a grid point in the scheme under consideration.)
We did not impose restrictions on the grid steps yet. Note, however, that if the
sweep method is used in the scheme for solving three-diagonal systems, then for its
stability it is sufficient to require that T < max(Azi/2). For solving practical problems
this condition imposes some restrictions on the choice of the time step.
By using the method of integral identities, and the splitting and quadrature
methods in the same way we can construct difference approximations for problems
in the theory of particles' transport that depend on several spatial variables (see
MARCHUK and AGOSHKOv [1981],.pp. 405-410).

73. The numerical scheme for the nonstationary transport equation


in (x,y) geometry

We consider the nonstationary problem for the transport equation in (x,y)


geometry:
1 2it

I OT
fdy+ it ap
do'(73.1)
+? & + (P
v, at ax ay++ 2i7'
0 0

(p=0 for (x,y)EO2, (n±,+t/n)<0, (73.2)

tp=g for t=0, (73.3)


400 G.l. Marchuk CHAPTER XVII

where
Q2={x, y:O<x<a,O<y<b},
0P= (by, ¢, x, y, t), a= a(x, y), a, = a.(x, y),
s=(#, ,)-(S1, S2),

S = 1 V 2 cos fr, S2 = /1-y2 sin /,


y=s= cos0, 0 <u<2, 0o<0<x,
n = (nx, ny) is the external normal vector to Q2 and the restrictions (69.2) are assumed
to hold.
We attempt in this section to consider the algorithm for the numerical solution of
problem (73.1)-(73.3), which is absolutely stable and excludes any iterative
processes. We will construct such an algorithm using the following methods of
applied mathematics: the method of integral identities (to approximate the problem
in spatial variables); the splitting method (to approximate the problem in time
variable); and the quadrature method (to approximate in angular variables). We
assume further that the solution of the problem (73.1)-(73.3) is sufficiently smooth
and focus on the algorithmic side of the numerical solution of the problem. We will
give the major statements without proof since most of them are discussed in
corresponding sections of the book by MARCHUK and AGOsHKov [1981].

73.1. Approximation in spatial variables

Let us first perform some transformations of the problem. Since its solution not
infrequently has some peculiarities where the vector s = (#, i/) approaches directions
parallel to the coordinate axes, it is expedient to transform the original problem so
that the vector s = (I, r) gets a value between these directions. Such transformations
allow in many cases to increase the approximation accuracy of the problem. Let us
perform such transformations for problem (73.1)-(73.3).
We write the integral term in (73.1) in the form
1 2n

~Sdip·k) = _- {dl'
0
I
0
·
(k)(y, i, x, y, t) dae,

where

, - , , y,
(k)
= p(l
',' -(k , y, t),
- 1)c,
.+
and consider problem (73.1)-(73.3) consecutively on the subdomains x ok, where
Ok = {y, : 0<y < ,(k - 1)n < <krt, k= 1, ...4}.
SECTION 73 Integro-differentialtransport equations 401

Then replacing q by q + 2(k - 1)i, we obtain the following system of four equations:
I a(k)
1 0apt +
~ k) aP(k)aP
q+
a(k(k)
+ (k)= a,
a(k) E4 S(k') +f(k) (73.5)

pk =0 on aQ for (#(k)n + (k)ny)<O, (73.6)

((k) = g(k) for t =0,


k= 1, 2, 3, 4, (73.7)
where
0<<½~, O<y<l,
u(k) = 1y2 cos( + (k - 1)),
'(k) = /1 - sin( + (k- 1)t),
f(k) =f(y, '+ (k - 1), x, y, t),
g(k) = g(y, + ½(k- 1)7;, x, y).
As is easily seen, directions s parallel to the coordinate axes are already excluded
from problem (73.5)-(73.7).
We take the inner product in L2 (Q) of (73.5) and (73.7) and an arbitrary function
v(x, y)E I2(Q) (i.e., having first derivatives from L 2 (Q) and being equal to zero on
aQ). Then after integrating by parts we get the equalities,

,at a V ±4
+ ((k ax ay)
y ,

= S(a(p(k'), v)+(f(k), v), (73.8)


k'=1

((p(k), v)= (g(k), v) for t = O,


k= 1, 2,3,4, (73.9)

(u, v)= fuv dx dy.

We write conditions (73.6) in a generalized form

j p(k) n + 1(k) nyP(k)v(k) dx dy =0,


an'

k= 1, 2, 3, 4, (73.10)

where v(k) E W2 (2) and v(k) l' = 0, aS(k) is that part of the boundary aQ for which
((k)nX+rl (k)ny)>O for a given value k. Thus the exact solution of the original
problem satisfies equalities (73.8)-(73.10) that are used further for constructing
402 G.. Marchuk CHAPTER XVII

integral identities and for approximating problems in spatial variables.


We introduce on 2 a uniform grid

x i = ih, yj =jh, a=b, N=M, h=a/N

(only for the sake of simplicity; the algorithm for a nonuniform grid both in x and in
y has been considered by MARCHUK and AGOSHKOV [1981], p. 306) and define the
functions,

(x-Xi-)/h, x(Xi_-1,xi),
(x)=-f (x+il-x)/h, xE(xi, xi+1),
To, x (Xi-l,Xi+l);

1. (y- yj_l)/h, yE(yj_l,y),


pj(y) =- (yj+l -y)/h, yE(yj,yj+ ),
YY'j ~0, Y¢(Yj-1,Yj+1);

where y,,i and yy,j are normalizing coefficients, which can be further defined from the
conditions

a b

Tii(x)dx = 1, () d = 1; (73.11)
0 0

in the case of a uniform grid (73.11) gives the following values yxi =yj = h.
To construct systems of integral identities we assume in (73.8) and (73.9)

v(x,y)= qi(x)qoj(y), i,j= 1, ... , N- 1.

As a result we come to systems of integral identities the elements of which are


arranged in vectors of dimension (N-1)2 and matrices of order (N- 1)2 and
numbered as follows: for k = 1 we start from x = 0, y =0; for k = 2 from x = a, y = 0;
for k=3 from x=a, y=b; for k=4 from x=0, y=b. By assuming simple
approximations in these systems we come to a problem of the type:

1 (k)+ (I(k)l(fA) + In(k) I(A®i)+ (k))(p(k)


, at

= (k) Z
4
Sp(k,k')r(pk') +f(k),
k=(73.12)
(73.12)
k'=

q0(k) = g(k) for t = 0, k= 1, 2, 3, 4, (73.13)


SECTION 73 Integro-differential transport equations 403

where

G(
)=(G1,2G(zl,. , I,'''.,GI,M-I,G(2,)M_-I,...,GN-I,M-1)'

j=I j=M-I

(G(k). (n(x)(Pm(y)), k= 1,
(k) _ (G(k), (PN-m(X)Pm(Y)), k=2,
n'm (G(k), (PN-m(X)(M-m()), k=3,
(k)
(G , (P(X)(M -m(Y)), k=4,

G= , g, f

1
A=h = 1
N-i
N-1

4 11 B .. A,N- 1 B 1
I,
AN-l,1B ... AN-l,N-lB

8(k) and (k) are diagonal matrices of order (N- 1) x (M - 1) with elements

an,mn, k= 1, s,n,m, k= 1,
(k) n, k=2, k) ,N-n,m k=2,
nm aNM, k=3, annM k=3
N-n,M-m , s,N.n,M-m,
Crn,M-m k=4, (sn,M-m k=4,

'n,m = (, (Pn(X)prn(y)), s,.n,m= (a, Pn(X)p(y)),


/9('")=1[® k=1,2,3,4,
p(1,2) = (2,1) = p(3,4) = p(4, 3) = I[),

p(1,3) = (3,1) = (3,4) = (4,2) = p&p,

p(1,4) =p(4,1) =(2,3) =(3,2) = p

10

N-1

Note that the numbering of vector and matrix elements is chosen in such a way
that the values {(Pn' 2) 3) Mm, (P4' -m}, i.e., the approximations of the
(n,
404 G.I. Marchuk CHAPTER XVII

averaged values of the exact solution


(3) (4) (k) - (k)
tWT,,n,rn W'T,N-n,mq (PT,N-n,M-m, T,-m (T)nm (((P (PW)

are in correspondence with the grid point (x,, ym). Analogous relationships exist for
the grid points and elements of the vectors {g(k)} and {f(k)}.
Let Hh be a Hilbert space of vector functions with inner product and norm defined
by

:):
1 n/2
4 N-1 M-I
(u,v)a= Z Z
(UE E ddy n d
Unm(y,)v.,m(y,V)d,
k=l n=l m=l
0 0

2
Iulh = (U,U)'h .
We define in Hh the vector functions
( 1)
G = (G , G ', Gi3 ) , G(4))T,
2)
G = p,f,g
and the matrix operators

A41 =diag(Alkk),
6
d = diag(akk), s = diag(a,kk),

A,kk = I (k)l = (f®A) + (k) I(A®),


7kk = (k), kk = ) k = 1, 2, 3, 4,
p(1 1) ... p,4)]

(4,1) .fi(44)

Then problem (73.12)-(73.13) may take the form

--at +(A, +A 2 )V=f in Q,, (73.14)

(p=g for t=0. (73.15)


Note the following properties of the operators Al and 2 (see MARCHUK and
AGOSHKOv [1981]):
(a) Matrix A, is a lower triangular matrix with positive elements on the diagonal
and nonpositive elements outside the diagonal.
(b) A, is positive-semidefinite with
I n/2
4 N-1
(,V, V)h>½dh dyj d
do (IJ(k)+l,l)lv,I 2
(73.16)
k=l1 n,m=l1
0 0

2
where d7rT
,
SECTION 73 Integro-differential transport equations 405

(c) Operator A2 is symmetric in Ha and positive-definite

(A 2 v, V)h > ac I v h, (73.17)


where
aeo = min (a - as, qp(x)p(y)) > a,, > 0.
n,m

These properties guarantee the existence and uniqueness of the solution p(y, ¢, t)
of problem (73.14)-(73.15) with
T

c( hg ( jfh dt) ). (73.18)


o

Besides, according to MARCHUK and AGOSHKOV [1981], the following error estimate
holds:
T

max II PT- Pllh(t)+ (IIT- Il dt) const-h, (73.19)


te[O, T]
0

where vector 'PT = TpT('y, , t) has a structure similar to that of vector (p(y, , t) but
with elements (, p,(x)(pm(y)) equal to the averaged values of the exact solution of
problem (73.1)-(73.3).

73.2. Approximation in the time variable

The properties of Ai and 2 allow us to formulate the following algorithm for


solving (73.14)-(73.15) based on the two-cycle componentwise splitting method
j - l
(E+q- ,)qj- 2 3 = (E - 11))Pi ,
j
(E + -1A2)(jp- 1/3= (E - TA 2) p -2/3
3
(pj+ 1/ = (Pj- 1/3 + 2fj, (73.20)
j +
(E +' 12)( 2/3 = -- 2(E 1iA2)J+ 1/3

(E+ 2TA,)pj + =(E - 2TA )pj +2/3


where E is the unit matrix with the same dimension as A 1,
r= at, At = Ty,

tj=jAt, o0=g, fJ= f f dt/(tj+I-tj_


1 ).

Ij-l
406 G.I. Marchuk CHAPTER XVII

It also follows from the properties of A 1 and A2 that scheme (73.20) is absolutely
stable with
max IlPjllh <C(I[gllh +maxIIflihi), j=2k,
(73.21)
k=0,1, 2,..., C = const.
When the solutions of the approximated problems are sufficiently smooth, scheme
(73.20) has a second-order approximation in z in the points t, j = 1,2, ... , J.
Consequently,
max I1( T(t2j)- (
2J
lIh < const(z 2 + h). (73.22)

73.3. Approximation in angular variables

We approximate the equations in the angular variables y and ¢. To that end we


introduce an approximation of the integral sp by means of a suitable quadrature
formula,
L D

=l1 d=l

where Ald are weights, (l, ,d)are the values at the grid points. By replacing integrals
in (73.20) with quadrature formulae and considering equations from (73.20) with
Y=Y, l=1, ... L, ¢=Cd, d=1,... ,D, we come to a system of linear algebraic
equations that approximates the original problem
2 1
(EA- l,Id)l(Pj = (E- tzAid)(. 1), (73.23)
(E+2A 2 , Id)cpj1 3
, 12 )Pd 2 ,
=(E--ZA (73.24)

+d/3 = (d 1/3 + 2ZTfj (73.25)


(E + 2 zT 2 ,d)p d 2/3=(E - 2tA 2 id)(d /3 (73.26)
(E+ 2T,4l,1 I)ld j =(E 2tA1C,ld)(Pd/3 '(73.27)

where vectors (pd have the same structure as (pi in (73.20) (they are regarded as
approximations to pj(yl, Id)), (pd=g(yl, ad) and the operator B2,Id is defined as
follows:

p~(l1)(1,4) L D
A 2, Id PTd- (PId Ad (PId.
p(4,1) .., (4,4) 1=1 d=1

Thus the algorithm for the numerical solution of the nonstationary transport
equation has been defined. As a result we have obtained an absolutely stable scheme
of first-order accuracy in h and second-order accuracy in r on smooth solutions. The
order of approximation in the angular variables is defined by the quadrature
formula chosen (LEBEDEV [1971, 1976]).
SECTION 73 Integro-differential transport equations 407

73.4. Numerical realization of the algorithm

Let
Idj- =d(E-2 )lj d· =(E-21A2 Id) j-2/3
1 2d (gd Id

dj 1/3 = (E- 1TA2, Id)(P0; 1/3


+
2/3 =-(E -2z1 d)(pj+2
13

Then (73.23)-(73.27) can be written in a component form as follows:


(a) Equations (73.23):

[1 +k(t l + n*p]1j-2/3
2h Id
(73.23')

2h d l,d,n-l,m 2h ,dnm- ,dn,m

where

d td -("~(Yt, d),
kj - 2/30 k,j-2/3 __
I d,O, m - I ,d,-,

I
l,d,n,m
dTnm 2
2h(0Id
ltdkd I-t
+ d
l)II k,jd_
dn(k1

±2h2h| I Ymn --
I d 1n 1 m+
2h Id
l dnm-1
Idn,,-1
k,j-1 =0 kj-1 -0
(Pld,Om-0 PI,d,n,O '

k=1, 2, 3, 4, n=1, ... , N-1, m=1, ... , M-1,


l=1,...,L, d=l,...,D.

Equations (73.23) are solved according to the formulae

5k,j-1 .+tI,(k)lkj-2/3 + T (k) kj-2/3


I,d,n,m 2h Id I(l,.d,n- ,m-1 2h 'id I,d,n,m-1
k,j- 2/3
(Pl,d,n,m
(73.28)
+ (I 1(k)l + I(k) )

(b) Equations (73.24):


j
+, () l ,j- 1/3
(1 + 2T nmml,d,n,m
L D

-2ts, _ l' d ' l d',N n m '(Pl,d',n.m


V'=I d'=l

3,j- M1/3- +-)=(',d',n,M-m_,-


+ qpl',d',N-n,M-m 4 x1,j-2/3
~lIdn'm ,

(73.24')
408 G.I. Marchuk CHAPTER XVII

(1 n,dm

L D
'r2( "
4
-ITU(n,m £ Y
1'=1 d'=1
A
A'd'
4
. (p'((p
'J-1/3+ lJ-1/3
d', n,m + (Pl',d',N-n,m

2,j-1/3 3,j-1/3
+ (P',d',N-n,M-m + ql',d',n,M-,m)= %,dnm ,

I=1,...,L, d=l,...,D,

n=,...,N- 1, m=l ... ,M-,


where
1,j-2/3=(1 U_r(1) Ij-2/3
I,d, 2 n,m i l,d,n,m

L D
+Ia") +,nm Z A ( -l',d',N-n,m
2/3
° nm +
Al'd'(('Pl',d',n,
1 + 2'j-2/3
I'=l d'=l

+n3 j-2/3 n"j-2m3 4-


+ l't'd',N- n, M-m + 91':d'^M-m

(73.29)

4,j-2/3-1 _o(4) .4,j-2/3


f,d,n,m ( 2 t'n,m l,d,n,m

L D
m
2+isn _E Al'd' *,(P',d',nm
- I' d',N-n,m
1'=l d'=l

+ 2j-2/3 + ,j-2/3
+ (Pl'd'N-n,M-m + (Pl'd',n,M-m)

The solution of (73.24') is reduced to solving a succession of systems of order 4LD,


each of which is defined by equations (73.24') with fixed n and m. In the case of
isotropic dispersion, i.e. in considering problems (73.1)-(73.3) (and also in case of an
anisotropic indicatrix of type
p
O= Ov-Pp(, o), =(S,'),
p=l

P is finite, {Pp(u) is the Legendre polynomial) equations (73.24) are solvable in an


explicit form and their solutions are:

1,j-1/3_
*[Z
2 r 1,j-2/3
~t,d,n, m(l
I: ('4-3
(1)m
+
(P+dnm 2+ (a-, - aO
(,.)
L D
, [l,j-2/3 + N/3nm ;1
x 3,j-2/3
-'ll'd'tl',d',n,m : d',d',N-n,m + l',d',N-n,M-m
t'=1 d'=l

+.4,j-2/3 -(

(73.30)
SEc1oN 73 Integro-differential transport equations 409

(4)
4,j- 1/3. 2 4,j-2/3 + Tas,nm
(4 )- (4)
'(P2,dn, +TamL
''' 2 --
+ (aU
2+ nm avs,n,m)
)
L D

,,
Al'd' "; + l',dT',N-n, m %l'jdd',N-n,M-m
1'=1 d'=l

+ 'd,n,M-m]

/=1, .,L, d=l,...,D.


(c) Equations (73.25):
qlkj+ 1/3 =k,j+ 1/3A')fk,
'lPd<nm=Id, P i+ 3L+2, l ,d,n,m,

k=l, 2, 3, 4, l=l ... ,L, d=l,...,D, (73.31)


n=l,...,N-1, m=l,...,M-1.
(d) Equations (73.26) have the same form as (73.24) and are likewise solved.
Elements of vector pf + 2/3 jd+ 1/3 are found correspondingly according to formulae
(73.30), or (73.29) ifj is replaced by j+ 1.
(e) Equations (73.27) are solved according to
k,j+ 2/3 1 i(k) k j+l IT (k) kj+l
I,dn.m +2hlld I,'d,n-,m+nlid Ia d,'n, m-1
kj+ I _h
_2h
(Pl,d,n,mI

2h Ed

k=1,...,4, I= ,...,L, d=l,...,D,


(73.32)
n=l,...,N-1, m=l,...,M-1,
k,j+ 1 knj+l =0
9P,d,OO,m=° (PlnO o
where

tl,d,n,m - 2h1I Id1d I,d.n,m

I i (k)i kj+2/3 T (k) kj,+2/3


2h 1,dn 1,m +2h q.d m

(PId,Om al , d,n,O = 0.

Thus, the algorithm of numerical solution for problem (73.1)-(73.3) has been
completely defined. Note in conclusion that a scheme with a second-order of
approximation in the spatial variables is given in MARCHUK and AGOSHKOV ([1981],
p. 317). Further, we remark that, formally, 6-sources in x, y and t are possible in the
above algorithm and that the algorithm itself can be generalized to the problem of
anisotropic dispersion.
410 G.I. Marchuk CHAPTER XVII

74. Splitting methods as iterative algorithms for stationary transport equations

We consider scheme (71.1) as an iterative algorithm for solving the stationary


problem of the transport equation in the slab:
1

UaZz+Ad=2 as (p',z) dt'+ f (, z),


-1

q(, )=0, i>0, (74.1)


(p(/,H)=0, <O,

assuming that the parameter is positive and depends on the step number of the
iterative process. Operators A 1 and A 2 are defined according to (71.3). We have,
therefore, the following iterative process

(E +t A, )(E + rA2)((p"+ _ p)= -2T(Aq -f), (74.2)

which is realized according to formulae (71.4). To estimate the rate of convergence of


the process we write (74.2) for errors e n = - as
(E + T,,A )(E + rA 2 )(" +'1- E)= - 2r Ae"

or
+
(E + rAt)(E + TnA 2) " =(E- ,A ) (E - T
nA 2) e"

or
(E + A2) +
1= T,(E + tA 2) gn, (74.3)
where
T,=(E + ,A)-'(E-T A)-(E-TA 2)(E + TA 2 ) - 1.
When we consider equation (74.3) in Euclidean space, we have
"+
II(E +r A2) II T.
[ II (E+
T11 A2)E"II.

Since
(AI p, p)> ao 1(pll12 , (A2, )> 0,
we have
Il(E- TA2)(E + TrA2 ) - 1 112
[I(E- TrA 2)q 11
2
I
I(E+ r.A 2 t)° [ 2 <
=1
[12
sp -2z (A2q° {)+'n2 12
11A2

,.~0 }TPH +2T(A2P q)+T |A2°l{<


SECTION 74 Integro-differential transport equations 411

Il(E+ TzA)- 1(E - nA 1 ) 12


= Il(E-TA,)(E+A,)- 111 2
1-2r(A p, ')/ll p II2 + T2 IIrAp II2/II(pII 2

1 - 2zacO
2
+ b2 I}AI1 12
1 + z ,, I2 1o+b2llA
2
1-2aao + b 1 _ 0 < 1,
<1 + 2TzaacO + b r
where 0 < t, Tb < oo are the boundaries of parameter t. Hence, IITn 1l 0< 1 and for
(74.2) we obtain the convergence and estimates of the kind
IITP- 11< 1(E + zTnA )((P"- )11l
(74.4)
,0"1(E+T.
zA2)((?°-()11-0, n-coo.
We now take a parameter T. independent of the step size of the iterative process.
Then

2 '
0 = O(T) = 1 2 (74.5)

From the condition, 0'(T) = 0, we find the value Tproviding the minimum value of the
function 0(T):
Zop t = 1/c1 (74.6)
(note also that 0"(zTopt)>0).
The choice of ,.can also be optimized from the simplified approximate problem,
for example, from the diffusion problem (PENENKO, SULTANGAZIN and BALASH
[1969]). Namely, by using this simplification of the problem we have 1pt = l/c,,
where
ac = sup vrai ac.

Thus, we have considered the iterative process with an (approximately) factorized


operator for solving a stationary problem in the slab. It is easy to see that the above
conclusions are valid for multidimensional problems. Besides, other splitting
schemes (see MARCHUK and YANENKO [1964], MARCHUK [1980], etc.) can be used for
solving stationary transport equations.
In conclusion here is a concise review of the literature dealing with different
aspects of the splitting methods theory for integro-differential transport equations.
Various splitting schemes for nonstationary transport equations are described by
MARCHUK and YANENKO [1964], MARCHUK [1980], MARCHUK and AGOSHKOV
[1981] and YANENKO [1967]; concrete ways of approximating operators in all
variables are given by MARCHUK [1980] and MARCHUK and AGOSHKOv [1981].
412 G..Marchuk CHAPTER XVII

Splitting methods as iterative algorithms for solving stationary transport


equations are considered by MARCHUK and YANENKO [1964], SMELOV [1978],
MARCHUK and SULTANGAZIN [1965], KOCHERGIN and KUZNETSOV [1969], MARCHUK,
PENENKO and SULTANGAZIN [1969]. Iterative splitting schemes for systems of spheric
harmonics are discussed by BOYARINTSEV and UZNADZE [1967] and SULTANGAZIN
[1979] and the problem of optimizing the parameters of iterative processes in
problems of the transport theory are discussed by SMELOV [1978], PENENKO,
SULTANGAZIN and BALASH [1969], MARCHUK and SULTANGAZIN [1965].
Some aspects of realizing boundary conditions in splitting schemes for a transport
equation are considered by MARCHUK and YANENKO [1964], MARCHUK and
AGOSHKOV [1981], YANENKO [1967], KOROBITSINA [1969], MARCHUK, PENENKO and
SULTANGAZIN [1969]. Note also that this last work is in fact a survey of a certain
period of research into splitting algorithms and their theoretical grounding for
solving stationary problems for transport equations.
CHAPTER XVIII

The Splitting Method for Problems


of Hydrodynamics

The approximation of equations of hydro-, aero- and gas dynamics leads to systems
of nonlinear algebraic equations. The effective solution of these equations is
a complex problem, especially when solving multidimensional problems by means
of implicit schemes in time. On the other hand, the implicit schemes are particularly
convenient for gas dynamics and Navier-Stokes equations since they essentially
weaken the restrictions on the value of a time step. In this case, like in many
applications, the splitting method is a constructive approach.
This chapter discusses various approaches to construct splitting schemes for
equations of hydro- and gas dynamics. For the sake of simplicity we demonstrate
these methods for splitting differential systems of equations on the same level as in
a weak approximation method (see Section 46).

75. Splitting schemes for Navier-Stokes equations with -perturbation of


incompressible fluid equations

The artificial compressibility method for solving problems of incompressible fluid


currents has been described by YANENKO [1967].
We consider the problem of a symmetric stationary flux with an incompressible
fluid obstacle in a two-dimensional plane.
The process is described by

uiDiu+(1/p)grad p=vAu, (75.1)


i=

div u=O. (75.2)

Here u and u2 are velocity components in directions xl and x2 respectively, p is the


density, p is the pressure, v = #/p is the kinematic viscosity coefficient, D i = a/axi and
grad = (/8ax 1 , /a8x2). To construct a relaxation process, the nonstationary system
obtained from the weak compressibility assumption is formulated in corres-

413
414 G.I. Marchuk CHAPTER XVIII

pondence with (75.1) and (75.2). This system has the form
2

a-'+ uiu+ (1/p)grad p=vAu, (75.3)


i=

+e up' +p divu=O. (754)


at i=1

In (75.3) and (75.4) >0 is considered to be sufficiently small.


Approximating the system of equations
= , t = zn, t.+1/2 = (n + ),

on the first subinterval [tn t + 1/2] we can use a splitting scheme on coordinates x1
and x2 for equations (75.3), and (75.4)

1 au" aur, ape -


a2ui
---- +
2 at ax, ax, =v x'
1U2_au au~ a 2u
+~2 at
+ ax, V 2 (75.5)

2 at + EU1 ax, + -pE O;


axW
and on the subinterval t +
1 /2 , t+l we can use

1 au" aU, 2U
a8
2 at ax , = 29

I 8 +u au2 a u2
t+UE2 a=V 2 2' (75.6)
2 t a x x
ap 2apE au
Ia + gu P +PE =O.
at aX2 aX2
By approximating systems (75.5) and (75.6) by finite differences on a rectangular
grid and by implicit schemes in time, we can solve them by three-point sweeps along
the grid lines. To solve problem (75.1)-(75.2) we can use a nonstationary system
which is different from (75.3)-(75.4). It has the following form
au 2
- S uDiu +gradp=vAu,
at + i=1 (75.7)

-q+ div u=0. (75.8)


at
Here
2
2 ),
q=(ql+q qi=p+ui, i=1,2.
SECTON 75 Splitting method for hydrodynamics 415

The corresponding splitting schemes have the form:


1 au, aq, a2u
2 at ax, ax '

1 au2 au2 a2u 2


_+Ul =V 2ax'
(75.9)
1aq, aU=
2 at ax,
on t., t + 1/21,
1 au1 au, a 2 ul
2 at ax2 ax2
1au2 aq2 a2u 2
2 at ax-- v ax2 '
(75.10)
1 aq2 au2 =0
2 at ax2
on [t.+ /2, t.
In difference approximations of systems (75.5), (75.6), (75.9) and (75.10) is
a relaxation parameter of the iterative process. It is worth noting that boundary
conditions for the system (75.1)-(75.2) have intentionally been omitted in this
discussion to focus on major aspects of splitting. Splitting boundary conditions
requires a separate study in each particular case.
The -perturbation introduced in (75.2) can also be used for solving nonstationary
equations. TEMAM [1981] uses the following perturbed equations:

at + u 9iu + (div u )+ grad p=vAu, (75.11)


ap +div u=0. (75.12)

In this case, it is easier to approximate (75.11) and (75.12) than the original
equations, because the condition div u=0 is replaced by an evolutionary equation.
Here the following problems arise:
- existence and uniqueness of solutions of perturbed equations;
- convergence of u and p to u and p for E- 0;
- discretization of(75.11) and (75.12) and convergence of discrete approximations
to a solution of the original problem.
The existence of solutions of perturbed problems and the convergence of exact
solutions for E-0 are discussed by TEMAM [1981], where an alternating direction
scheme for approximation in time was used and where unconditional stability in
certain spaces and convergence of the discrete approximation to a solution of the
Navier-Stokes equations when ,z,h-0 was proved.
416 G.I. Marchuk CHAPTER XVIII

76. Splitting schemes restoring divergence for incompressible fluid equations


A fractional steps scheme requiring no perturbed equations and based on the
projection method has been suggested by CHORIN [1968] and further studied by
TEMAM [1981]. The classical statement consists of finding a vector function u and
a scalar function p such that
au a
-+ ui -u+grad p = vAu+f
at- i °U= X Ugd= U (76.1)
in Q x (0, T],
div u = 0, (76.2)
u=0 ona2 x [O,T], (76.3)
u(x,0)=uo(x) in . (76.4)
First, we introduce a function space Ho(2)= gI21 (Q), H-the closure of W in L 2 (Q)
and V-the closure of v in L2(2),

W2 = {ueD(Q2): div u=O}.


Then, in a weak statement, it is required for given uo H and f e L 2 (0, T; H) to find
a function u e L 2 (0, T; V) that satisfies the conditions

d (u, v)+ b(u, u, v) + v[u, v] =(f, v)


VV ~(76.5)
ar~~~~~~~~~e~~ V,
ne ar,
1U(°) = U, ['V = E (8' _),l. (76.6)

Here (u, s)= uv dQ, and

b(u,s,w)=Jf ui( vj)wdQ

is a three-linear form, Uo e H, f e L 2 (0, T; H).


Let the interval [0, T] be divided into subintervals of length T. We assume that

f =_ X f(t)dt, n=l,...,N.
(n- l)r

We introduce elements Un + /2 i=0, 1, n=0,..., N-l, and, beginning with


u°=Uo, compute all u "+1/2 consecutively using the formulae
un+ /Z e HI(Q),
1/ 2
~(mun +l/2,v)
+
U 1,/2 , V)+ V[U, + (
v]=", V) (76.7)

Vv e Ho'(a),
SECTION 76 Splitting method for hydrodynamics 417

u" + 1 eH,
(76.8)
(un+l v)=(u+l/2, v) VveH.
Here
;(u, v, w)= (b(u, v, w)- b(u, w, v))
is a continuous three-linear form on Ho(2) and
6
(u,s, v) = 0 u,veH(Q2). (76.9)
+ 2
Equation (76.8) means that u"+' is an orthogonal projection of u" 1/ on H in
L 2(Q). That is why u" + l can be defined by the equality
2
Un+ I =pH+ l/ ,

where PH is an orthogonal projector in L 2(Q) on the subspace H. TEMAM ([1981],


Theorem 1.1.4) shows that, in the orthogonal complement H', the difference u"+ -
u +' 12 is a gradient of some function of H'(Q2). We denote this by rp"+ ' and obtain
un+l un+l1/2
+ grad p" + = 0, p + e H (). (76.10)

Relationship (76.10) is one of the consequences of (76.8). The second relationship,


equivalent to (76.8) and (76.10), is given by
divu"+ = 0 in 2,
u"+1. n=O on a2;
u+ 1 E L 2(Q).
These expressions are equivalent to a certain Neumann problem with respect to
p +1. We apply operator div to (76.10) and obtain

Apn+ 1 =1 div u" + /2 (div u + l =0),


(76.11)
apn+ 1
= 0 on a2.

Relationship (76.7) is a nonlinear Dirichlet problem. Writing (76.7) for s e D(Q), we


obtain the classical form
2
n + 1/2 _U n aun+ 1/
_+ E UI1+ /2 + (div u"+'/2 )"+ 1/2
i=1 Xi
=f " + vAu "+
/2. (76.12)
Using a weak form of equations in this case helps to perform a separation of
operators which is not obvious in the classical statement. The last term on the left-
hand side of (76.12) is a stabilizing addition arising as a result of replacing R(u, u, v)by
u, u, v) in (76.7). A further solution of the obtained problems has been described by
TEMAM [1981]. It is also noteworthy that the generated function p+ 1 is not a "real"
pressure.
418 G.. Marchuk CHAPTER XVIII

The splitting idea connected with recalculation of the pressure given in a pro-
jection statement is, in fact, a consequence of ideas underlying the Harlow method
for particles in cells. According to this scheme let us, first, compute an intermediate
velocity field from
n + l/2 u
_ n n . n
u ! =+ EvAu" +f. (76.13)
i=1 axi
Then we correct this field by taking into account the pressure gradient
u+ 1 =un+ /2 - T grad p, (76.14)
where p is a stationary solution of

aP+ div u"+1/2 = Ap. (76.15)

As a result of the above stages both (76.1) and (76.2) are satisfied.
BELOTSERKOVSKY, GUSHCHIN and SHCHENNIKOV [1975] used another splitting
scheme, namely, the explicit scheme of splitting along the physical factors. We
introduce the notation rot u = w and consider the velocity field u and pressure field
p at the time t, = nr to be known, then we can write the scheme for finding unknown
functions at time t + as the three-stage splitting scheme (76.16)-(76.18):
Un+ 1/2 _Un un
Y0=- n - +V~u',X(76.16)
i=1 'axi

Ap= - -div n + /2, (76.17)

= -grad p. (76.18)

As noted earlier, equation (76.17) is obtained by applying operation div to both sides
of(76.18) and taking the continuity equation into account (vector u"+ 1 is solenoidal).
BELOTSERKOVSKY [1984] suggested the following physical interpretation of the
above scheme. It is assumed at the first stage (76.16) that the momentum transport (a
mass unit impulse) is realized only via convection and diffusion. The field u"+1 /2
thereby obtained does not satisfy the incompressibility condition giving never-
theless a correct description of the turbulent characteristics since it is possible to
show that rot Un+t/ 2 =Wn+l.
The obtained intermediate velocity field helps to find the field of pressure on the
second stage (76.17) taking into account the fact that the field u + ' is solenoidal. In
this approach the function pn + 1 is a physical pressure and boundary conditions for it
are not defined. On the contrary, in the projection method, p" + satisfies Neumann
boundary conditions.
It is assumed at the third stage (76.18) that the transport is realized due to pressure
gradients only.
The following variant (76.19)-(76.21) of realization has been considered for
SECTION 77 Splitting method for hydrodynamics 419

increasing the reserve of stability by using an implicit scheme at the first stage
(BELOTSERKOVSKY [1984]).
tn+l/2_n

~ grd_. f
+ 12_-v~ --.
= (76.19)
=- n un +/2 u+/2
1
= 1-u -- +vAu" -grad p+f,
i=I

1 +
A6p =-1 div u"+'/2 ap=p p1 _ , (76.20)
2
Un+ I _Un+ 1/
= - grad(6p). (76.21)

The proposed modification allows us to release the hard restrictions on the time
step inherent to the first scheme. Another advantage of this approach is that the
pressure increment in time and not the pressure itself is found in the second stage.

77. The general principle for constructing splitting schemes for Navier-Stokes
equations

We consider the original system of Navier-Stokes equations in a Cartesian


coordinate system, x, (a = 1, 2, 3). In the absence of external forces this system can be
written in a divergent form
au a"
+ -= 0 (77.1)
at iaxe
Here U is a vector of a flow's state, w is a vector of hydrodynamic fluxes

UO P
U' Ul

U2 Pti_-2
U3 13 ,
_U4 E

Wao put
Wa 1 pulu. + Pli
U,- )= PU2 U + P2a
Wa = W(
Ox W. 2 (77.2)
W, 3
PU3 U o + P3.
aT
Wa,4 puE + upp - ax -
In representation (77.2) we sum over /3,pp =pa- GGa,p is the pressure, p is the
420 G.I. Marchuk CHAPTER XVIII

density, 6, is the Kronecker symbol, E =e + lul 2 is the mass density of full energy,

G,fidivuu6 au au\
G,,= + u6'
a-~xp 20-ax~;,
div 0
+ (77.3)

# and I' are the coefficients of volume and dissipative viscosity. System (77.1) may be
represented in a nondivergent form as well. We introduce a new vector of a state f
f=f(U). (77.4)
We consider (77.4) as reversible, equivalent to U= U(f), and obtain nonsingular
matrices,

A=-= {Ai} aj (77.5)


A a,
af = ai } (77.6)

Thus, taking this into account, system (77.1) can be rewritten in the form

f+ca-=O. (77.7)
at ax
It follows from relationships (77.2) and (77.3) that w =w.(f, af/Ox). System (77.7)
can be rewritten in operator form:

(a/at+ a)f =o,


2 = BD, + CaDD# (77.8)

p,= a awflaf, ca=aawafff.


Then, equation (77.8) takes the form

f+B, f + caf 2f =0. (77.9)

We introduce, further, operators

12 = BaD,, R, = - CDDp

(no summation over a), then


3
= (- R.). (77.10)
a=l

System (77.9) is nondivergent in comparison with the equivalent system (77.1). Various
vectors of state, e.g., (p, us, u2, U3 , T), (p, ul , u2 , u3, e), and f=(p, u1 , u2, u 3 , p) are
SECTION 77 Splitting method for hydrodynamics 421

possible. In the first case

Ua- 03P blaP-0 62P-


ax ax a ax

6la2- U - 2
0 0 blab

aa a a
Qa= I
6a2x ax"

2a 2 a 22 a aa
0 bas 2aS 63aS U
axa ax" ax" ax .

a2 1 ap b2 1 ap al'/al

Considering vector f=(p, u, p) and a continuity equation in divergent form


simplifies the form of operators f to
a
a 0 0 0
ax a
a 1 a
IO
o Uax o 0 0 p ax
a ia
Qa, I
a 1 a
0 o 0 Ua a3
ax P ax
a a a
apC2 a
i ~ ax" ax axa ax"
where c 2 =dp/dp is the square of the speed of sound.
The choice of vector f for the Navier-Stokes equation determines the form of
representation Ra. The simplest form can be written for f =(p, u, T):
a
axa
a

a G
axa
a aT
ax" axa
422 G.I. Marchuk CHAPTER XVIII

2
axCT a3 Max,

r+aU aU2)2 +au + ,)u2+(


aul +)2]
ax ax, aXaXi 2 3 a2

Now we consider splitting of the Navier-Stokes equations in a divergent and


nondivergent form. Let us do this in Cartesian coordinates. The generalization of the
splitting method for equations of gas dynamics and Navier-Stokes equations for the
cases of curvilinear coordinates is carried out like in the above case (KOVENYA and
YANENKO [1981]). We represent the vector w, of fluxes defined in equations (77.1)
and (77.2) as a sum of three terms

w,= wla+ w 2 +w 3 , (77.11)

where wla is the vector of convective fluxes, w2. the vector of flux due to pressure
forces, and w3 the vector of dissipative forces:

_1pu
_ _

2
N
WO
1 0
2
Wal
1
PUIU wa2
2
Wl= Wa2 PU2U. W2. = wa2 62,P I
1
Wa 3 PU3U. W 3 3aP
2
Wa44
4 puE Wa4 _ U.P.

3
Wa0 0
W31 -G1.

W3c = Wa2 - G2
3

W 3

31
3
We4
_A
- ax
- = U Gp

When we consider each vector we as a sum (77.11) for every direction x,, we can write
the system of equations in the form of the splitting on physical processes and spatial
directions,

au + 3 ( 3aw
a )0 (77.12)
SECTION 77 Splitting method for hydrodynamics 423

Representing system (77.12) as

aul=0o , (77.13)
at m=l

where

m=l a=l k=l x,

we have a system split on a differential level (approximating weakly) in the form

aU N
-+ E (Xju=0,
at j=1

where

a =N, c,=O for l>, t, t t, + /N,


a2 =N, a,=O for I1 2, t,<t t + 2T/N,

aN=N, a,=O for lN, tn+(N-1)<tl)<t,+.

Here N is the number of steps, the Yj are differential operators taking into account
the splitting in physical processes and spatial directions.
When we write the system in the nondivergent form (77.7), splitting in physical
processes and spatial directions is done as in (77.12) and (77.13): one separates the
convective flux matrices and the matrices determined by the pressure and dissipative
forces with the subsequent splitting along the spatial directions so that operator Q is
represented as

Q= ( 2a). (77.14)

Here index a is related to splitting in directions x,, and k to splitting in physical


processes. Depending on the choice of a vector f of unknown functions, the matrix
differential operator Q has various structures. Hence, the form of operators Qk
depends on the choice of f. The arbitrary character of the choice of unknown
functions allows us to consider various forms of splitting and to choose those that
have the largest reserve of stability in a difference form or the simplest realization.
We illustrate some of the splitting forms with examples. First, we only consider
a system of equations of gas dynamics ( = 0, Gi = 0) resolved with respect to the
vectorf=(p, u1 , u2, U3, p)T. Since ic= 0 and Gi =0, the dissipative forces matrix 2Q3
is absent, and consequently k = 1 or 2. In considering continuity equations in a
divergent form and other equations in a nondivergent form the corresponding
424 G.I. Marchuk CHAPTER XVIII

matrices Qk,, have the form,


aa O

o a0 0 0 0
8axe
21 a= 0 O Ua 0 00 (77.15)

0 0 U ax, ,a0

0 o 0 0 u--

0
8O O 0
O 0
O 0O0

~~~0 0 00
p152axa
a

(77.16)
S22. =
o 0
P ax~
632a a
o o
p ax,

0 6lapC2 a 62. PC2 a 63.opc 2


a 0
M -

where c is the speed of the sound.


We consider splitting of equations (77.7) in physical processes and spatial
variables for vector f=(p, u, u2, u3, T)T. The convective fluxes matrix and the
matrix of pressure gradient forces are written as follows:
a
21 = u, - I, (77.17)
91= ax,

" 0 a
a 2 a0
a 6 a
a° 0 '

62,a 2 a 0 0 0 ,b 2 a
ax, ax2

0 0 0 6 b2 (77.18)
S22. = 0axa2 a- - 2 ax

0 0 0 6,b2a

2
0 6 1as 2
62 2
00S
ax 2 ax" ax -
SECTION 77 Splitting method for hydrodynamics 425

where
a2 a,p =l 2= P
p ap' pat' p
It is obvious that realizations of schemes based on splitting (77.15)-(77.16) or
(77.17)-(77.18) are different. The above examples can be continued if we choose gas
dynamics functions or some combinations of them as the functions to be found. In
considering difference splitting schemes for boundary value problems the choice of
unknown functions is also determined by the form of the boundary conditions.
For Navier-Stokes equations written in the nondivergent form (77.7) splitting of
equations in physical processes and spatial directions is performed in the same way
as for equations of gas dynamics. The dissipative forces matrix is added in this case
separate from the convective transport matrix and the matrix of flows is determined
by pressure forces. For vector f= (p, u~, T)T matrix f23w has the form
_ A _

7 0 -
G1a

aa G2. (77.19)
P ax,'
G3 a
a
K-
_ aX

Some other examples of splitting gas dynamics and Navier-Stokes equations are
possible. These examples will belong to various realizations differing in the reserve of
stability, effectiveness of realization and convenience for calculating boundary
conditions. So, the approximation of equations in a divergent form leads to
conservative difference schemes satisfying conservation laws. This yields a physi-
cally more grounded result. For nonstationary solutions this way seems to be most
justified. At the same time the numerical schemes of this type are nonlinear and
require either internal iterations or a complex linearization.
Although equations written in a divergent form are more efficient in realization,
they yet have the conservation property only asymptotically, that is, in the process of
stationing. In this case, it is efficient to consider the predictor-corrector method that
permits the construction of conservative difference schemes suitable for solving
nonstationary problems (see KOVENYA and LEBEDEV [1984]). At the predictor stage
in the approximation of equations in nondivergent form a scheme of splitting in
physical processes and coordinate directions is constructed. At the corrector stage
the equation is approximated in a divergent form.
The above schemes have been used in either way for performing a whole range of
computations for problems of gas dynamics and viscous gas flows in both two-
dimensional and three-dimensional cases.
Papers by KOVENYA and YANENKO [1981], TEMAM [1981], BELOTSERKOVSKY
[1984], YANENKO [1967] and GUSHCHIN [1981] discuss the construction and the
convergence studies of splitting schemes for Navier-Stokes equations. Splitting
426 G.I. Marchuk CHAPTER XVIII

schemes and the results of computing the viscous fluid fluxes by using the notion of
"artificial compressibility" are given by YANENKO [1967] and CHORIN [1968].
YANENKO [1967] also considered the coordinatewise splitting scheme. Splitting
methods for solving problems of the stratified fluid dynamics are described by
LYTKIN and CHERNYKH [1975], YOUNG and HORT [1972], GUSHCHIN [1981],
VASILYEV, KUZNETSOV, LYTKIN and CHERNYKH [1974]. Splitting methods for solving
a wide range of problems of gas and aerodynamics along with some computational
results are discussed by KUZNETSOV and STRELETS [1983], KOVENYA and LEBEDEV
[1984], KOVENYA and YANENKO [1981] and BELOTSERKOVSKY [1984]. Monographs
by MARCHUK [1974], KOVENYA and YANENKO [1981] and BELOTSERKOVSKY [1984]
are actually surveys of the results of constructing, studying and realizing splitting
schemes for problems of geophysical hydrodynamics, gas dynamics, hydrodynamics
and mechanics of a continuous medium.
CHAPTER XIX

Problems in Meteorology

A number of implicit schemes for integrating equations of dynamic meteorology


have been developed and grounded by MARCHUK [1964b, 1965, 1967, 1974a]. These
schemes are widely applied for solving problems of weather forecast, theory of
climate and environment protection (see the related publications by DYMNIKOV
and ISHIMOVA [1979], MARCHUK [1980], MARCHUK, DYMNIKOV et al. [1975, 1979,
1984], BURRIDGE and HAYES [1974], LEPAS et al. [1974], ZENG and ZHANG
[1982], DYMNIKOV and LYKOSOV [1983], DYMNIKOV and FILIN [1985], MARCHUK,
DYMNIKOV et al. [1981].

78. Equations of atmosphere hydrothermodynamics

Equations of atmosphere hydrothermodynamics with a quasistatic approximation


in spherical a-coordinates have the following form (MARCHUK, DYMNIKOV et al.
[1984]):

du u t F'(P RTap
d(+ tg + c)v a1
+ RT ap) F. (78.1)
dt a cos p aA p,
dv / u 1[a Taps\
_+ + tg(1 u+-
UF(p,-_ __(78.2)
dt a a a a =F,
dT RT
_ _ z=FT+ET, (78.3)
dt cp up,
dq
-d =Fq- 9, (78.4)

u
Ops+-1I ps ap vcos. p aP
apS
at c1 +
acos p a+
cos)+pS
ap au =0, (78.5)

a RT
',6 = - a (78.6)

· T=pd+a -p+ + ap (78.7)


at acos cp A aqo'
427
428 G.I. Marchuk CHAPTER XIX

where
d a u a a a
-=-K, K= -+--+-;
dt at acos a a ap a
t is the time variable; is the longitude and p the latitude; a=p/p s is a vertical
coordinate (p is the pressure, p, is the pressure at the surface of the Earth); u, v and
d are wind velocity components in the longitudinal, latitudinal and vertical
direction, respectively; i = gz is a geopotential on a constant a-surface (where g is
the gravitational acceleration and z is the altitude above the sea level); T is the
temperature (K); q is the specific humidity; Fu and Fv are the rates of impulse
moment changes due to Reynolds stresses; Fr and Fq are the rates of temperature
and specific humidity changes, respectively, caused by small-scale diffusion and
meso-scale convection; Pr = e, + &fare diabatic heat fluxes (E,is a radiation heat flux
and E a heat flux due to condensation); q =c-E is the rate of humidity
transformations (C and E are terms describing condensation and evaporation
processes); I = 2Q sin p is a Coriolis parameter (Q2 is the angular speed of rotation of
the Earth); a is the radius of the Earth; R and Cp are gas constants for dry air and
specific heat of air at constant pressure.
The following initial conditions are considered to be given for system (78.1H78.6)
U= Uo(A, (, a), V= v0(A, p,a),
Ps = Pso(, ), (78.8)
T= To(, p, a), q = q o(, (P,a)
as well as the boundary conditions which assume that a solution is periodic in
longitude and restricted on the North and South Poles. As far as the underlying
surface being a solid body is a a-coordinate surface (a= 1), the corresponding
kinematic condition of type
a=0 for a=l (78.9)
is used. An analogous condition can be stated for the upper boundary of the
atmosphere
o=0 for a=0, (78.10)
and the distribution of a geopotential on the lower boundary is also given:
· =gzs= 0 (78.11)
where z, is the Earth's surface altitude above sea level.
Equations (78.1)-(H78.4) can take a different form. At least three possible forms are
known. By multiplying any of these equations by p, and using the continuity
equation (78.5) we obtain a divergent form

a U+Kdu- (l+Utg)p v

+ PS fG +RTan p)=psFu, (78.12)


a cosrp + al OA ,
SECTION 78 Problems in meteorology 429

+KdV +
± IL+ tgo pSU
(78.13)
+Ps+RTa
-- T Inps)= Fv,

Kd RT
at
at + T-C--
CaT=Ps(FT+eT) (78.14)

p q+ K q
= p s(Fq - ), (78.15)

where
1 a aa
Kd= ( s--s
a + psvcos. + Ps d.

We use the following substitutions

-= A, V-pu=U, pV=V,
(78.16)
p/T=0, / pq=Q,
and we obtain a symmetrized form of equations

-+KsU- l+-tgp V
(78.17)
+ac (¢(X+2ROna:)=nF,

at ait
(78.18)
+( it a In
+2R0-lnic
a ==7tIv,
F '

ao RO
+ K 0 -C-r = r(FT + eT), (78.19)

+ K s Q= T(Fq - E, (78.20)

where

K 1 RF a a u \vcos a a Vcos T
s 2a cos L a a J a ap n

+- (a ) + (78.21)
430 G.I. Marchuk CHAPTER XIX

We introduce the following notation for an absolute eddy Z and energy E

1 /av aucosq)
acos pa?
E = I(U 2 + V2 ).

Now we can write the motion equations (78.1) and (78.2) in the so-called Lamb form

+- Z + X [a(+E)+RT Ip F (78.22)
at a6 acosdO a J
av a iFa a 1
Ed+ad+Z. +-l a (+E)+RT a In p]=F. (78.23)
at aa u a aT

The system of equations of atmosphere hydrothermodynamics with adiabatic


approximation has a number of integral properties. The conservation laws for total
energy and enstrophy (=Z) are the most interesting among them. Let the
right-hand sides in equations (78.1H78.4) be equal to zero, and let z, be absent for
some time. Multiplying (78.1) by u, (78.2) by v and adding these expressions we
obtain, after some transformations using equations of continuity (78.5) and
quasi-statics (78.6), the following equation for the changes of kinetic energy in time:
aP+Kps E+ I (Pau apsv cos
at d a cos up
¢ i a
(78.24)
4- (u ps,; = _RTT
+aa at aps
)r .
Adding equations (78.24) and (78.3) and integrating the obtained expression in
a from 0 to 1 and over the Earth's surface G using the corresponding boundary
conditions, we obtain
1

atJP (E+CPT)dadG=O, dG=a 2 cos dpd2. (78.25)


G O

Since it is known (LORENZ [1955]) that the total potential energy of a vertical air
column in hydrostatic systems (being the sum of two kinds of energy--potential gz
and internal C, T, where C, is specific heat of air at constant volume) coincides with
its enthalpy Cp T, equation (78.25) is the conservation law for the total energy of an
atmospheric system.
If the term RT is replaced in (78.3) by RT where T is the average temperature
(a value independent of spatial coordinates and time), then the total energy
conservation law takes the quadrature form
1

i(E
atJps + 2T
2
T dG d = . (78.26)
G O
SECTION 78 Problems in meteorology 431

The quadrature conservation law is a fundamental property of system (78.1)-


(78.6), which allows the construction of a stable computation process for obtaining
its solution.
In the case of nondivergent barotropic flow (Ps, const, d _ 0) the Lamb form for
the motion equations (78.22) and (78.23) along with the continuity equation in the
form

I u1 vCos )=0o
au (78.27)
acosqpa - aqp
allows to show the existence of the enstrophy conservation law
1

a JPs Z2 d dG =O. (78.28)


G 0

This law has also a fundamental meaning because the real atmosphere is
quasi-barotropic and quasi-geostrophic, and therefore has a specific energy cascade
in the direction of longitudinal waves and enstrophy cascade in the direction of the
high wave numbers.
We assume that p, = 1 in relationships (78.1H78.8) and obtain, as a special case,
a system of equations in p-coordinates where, besides the quasi-statics equation
a = RT
RT--, (78.29)
ap p
the continuity equation
1 tOu av cos P aT
_Io
-~+ +-=0 (78.30)
acosp OA' Oa / ap
turns out to be non-evolutionary.
The system of equations of atmosphere hydrothermodynamics with quasi-static
approximation is irregular, i.e. it does not belong to the Cauchy-Kovalevskaya type
systems but can be reduced to the evolutionary type by integrating, for example, the
quasi-statics equation (78.6) in a from a to 1 and using condition (78.11):
1

0= + _d-; (78.31)

and by integrating the continuity equation (78.5) in a from 0 to 1 and using


conditions (78.9) and (78.10):
1

aPS+ Dda=0 (78.32)


0
432 G.I. Marchuk CHAPTER XIX

with the vertical velocity analogue d being represented as follows


1 a

6=- (aDd- Dd ),
0 0
za(pT )
" al vco(78.33)
D= -1 laPsu ap' v cos qp
a cos [-il ap

79. The general splitting method with respect to physical processes


based on the separation of characteristic times

Systems of equations of atmosphere hydrothermodynamics appear to be fairly


difficult to integrate (especially for long times) owing to the simultaneous presence of
a great many physical processes with various characteristic time scales varying from
a few seconds (gravity waves) to a few days and more (synoptic and planetary waves).
It is natural, therefore, to use various computational schemes to describe terms
related to the various physical processes.
Following MESINGER and ARAKAWA [1976] we consider, first, a linearized system
("shallow water") including advective terms and terms responsible for the descrip-
tion of gravity waves
au au ah
-+Cx+g = ,
(79.1)
ah ah au

where h is the altitude of the free surface, and H is its average value. The ratio of the
velocity of the gravity waves propagation gH to the velocity of the advective
transport c constitutes a value of order - 10. It is expedient, therefore, to solve the
system of advection equations
au au
at Ox
(79.2)
ah ah
at ax
first for system (79.1) in the limits of a given time step.
We denote the thereby obtained intermediate values of u" + I and h" + 1 by un + 1/2
and h+ 1/2 and use them as initial conditions for solving the remaining subsystem
au ah
g- 0
ax=
-+
at
(79.3)
ah au
at Hax
SECTION 79 Problems in meteorology 433

The values u"'+ and h" + I, obtained after solving this subsystem, are taken to be the
final approximate values of the variables at the time level n + 1. This procedure is
repeated on each successive time step.
We consider a vector function p = {u, h) and a matrix

A=(QH g)
c

and write (79.1) in the form

+A-4=O. (79.4)
at ax
By using the following representation of the matrix A,

A=A 1 +A 2 , A,=cE, A 2 =( g),

we can represent the above procedure for solving systems of "shallow water"
equations as a splitting scheme,

n+12_¢n
At +A1Ap"+/2=Fnl

¢n+l_Z ipn = +1/2


At +AA + =Fn2+

where the difference operator A approximates a differential operator a/ax; Fn and


F2+ 1/2 are the result of applying difference operators to the function ¢ on previous
time steps.
The solution of a full diabatic problem in a symmetrized form, which is based on
the method of splitting with respect to physical processes and uses weak
approximation method symbolics, may be represented in various forms (MARCHUK,
DYMNIKOV et al. [1984]):
(a) for the computation of radiative heat fluxes,

a01
t =Lr, 7 =7E(t),

0 (t) = 0(t.); (79.6)

(b) for the computation of heat fluxes due to condensation, dry and moist
convection,

a02 aQ2
at= , at =-7q, r(tn),
(79.7)
U2(tn) - 1l(t.+r), 22(tn)= (2(tn);
434 G.I. Marchuk CHAPTER XIX

(c) for the solution of diffusion equations,

au3=,
t=OtF
a 3= Fv '

= (t,)
at at =7

V3(tn)= V(t.), V3(t)= V(t.),


(79.8)
03(t.) 02(t+ ) Q3(tn)=Q2(tn+l);

(d) for the solution of transport equations,

aU4KU 4 =, aV4K =
at at
004 + Ks04 = 0, OQ
4+ K = 0, (79.9)

U4 (t,)=U3 (t+ 1) V4(t.)= V3(t.+1),

04 (t.) = 3 (t. + ), Q4 (t) = Q3 (tn+1);

(e) for the solution of adaptation equations,

au5 + U5 tg p9V + 1 ('5 a + 2R\ In_7r_ =

&0 28 lnrt= O'


at ( a a a

aO5 ROs

U5 (t,)= U 4(tn+1), V5 (t.) = V4 (t,+1), O5(t)=04 (t+ ),


n 5 (tn)=(t).

Formulae (78.31), (78.7), (78.33) are used for computing ,z,D and d.
A formal analysis of the above splitting scheme yields a first-order approximation
in time even in the case when the Crank-Nicolson scheme is used as a basic scheme
on each splitting stage. We shall consider stability questions in the following
sections.
SECTION 80 Problems in meteorology 435

80. The approximation of equations in spatial variables and the discrete analogues
of conservation laws
In Section 78 we have described the equations of atmosphere hydrothermodynamics
in symmetrized form. The purpose of such a symmetrization is, first of all, the
possibility to construct absolutely stable difference schemes, whose stability is
proved in a quadratic norm equivalent to the system's square energy. For
illustration, consider a one-dimensional transport problem
a0jf a auO\
a+1 u-+-Jo +N= ° (80.1)
It is easily seen that for (80.1) the relationship

a 02dG=O
at
G

holds if u becomes zero at the boundary of domain G or if 0 is periodic. This follows


from the skew-symmetricity of the operator

( au\ax

Finite difference approximations of operator K, will also retain the skew-sym-


metricity condition (skew-symmetric matrices will appear) if the same symmetric
uniform scheme is used for each of the terms uaO/ax and auO/ax. Thus we can write

ah

where
(K 0 h0
h, ), = 0, (80.2)
and the inner product
((f, ). E f ip /

is chosen in a corresponding grid space .


The operator Ks retains its skew-symmetric form irrespective of the choice of uh,
i.e. it is independent of the type of projector transforming u into uh. The
independence of the skew-symmetricity property of the type of projection operator
allows the construction of a family of schemes satisfying condition (80.2) for problem
(80.1). Using the Crank-Nicolson scheme in time we obtain (index h is omitted)
on+l_ 0n on+l+on

At
+ Ks, 2
2 = 0. (80.3)

Taking the inner product of (80.3) and (0"+ +0 ) we obtain


n
(on + 1 O +1 = (O , On),

which means that problem (80.1) is stable.


436 G.I. Marchuk CHAPTER XIX

If we choose staggered grids where function 0 is determined on integer grid points


and the function u on fractional grid points, then a scheme of second-order accuracy
in space for (80.1) can be represented as follows
On+ _ on n+ /2 _ on+/2
- i "i+1/2 +1 -Ui- 1/2 i- =0 (80.4)
At 2Ax
and will have the property that
(0 + 1)2 = ()2 (80.5)

It is fairly simple to write the approximation in space variables and the scheme for
the three-dimensional transport equation in the general case:
on+l n 1 ,
/2 - ,ni - 1.j,k on+ l/2
1/2,j,k ai+
ijk + 1+
01+ 2a cok[u-+ 01+ljk i-lj,k]
At 2a os j-A

+ 2 c A [(v cos (p) j+ 1/2,k- 1/2,k 0,j-/l,k (80.6)


2acos qoj zxp¢1 , 1...

+
I (jk+
2Aa J +
1/2
1/2
i,j,k+ - j,jk- 1/2
~n+ 1/21
0.
i,j,k- )=

where
Ui+ 1/2,j,k= 2 (Ui+ l/2,j+ 1/2,k + Ui+ 1/2,j-1/2,k),

(V COS (9)i,j+1/2,k = 2((V COS P)i +1/2,j+ /2,k + (V COS )i- 1/2,j+ 1/2,k)'

The grid points where the functions u and v are determined have indices i + ,j +
and k; the grid points where a function 6 is determined have indices i,j and k + ½.
It is easily seen that the matrices for each difference operator approximating the
derivatives along A,p and a are skew-symmetric, i.e. (80.6) can be written in vector
form,
n +
1 -
n t +(K +Kv + Ka)On + 1/2 = 0, (80.7)

where
(K 0,0),=0,
= = ,(pa. (80.8)
Transport problems can be represented in form (80.7H80.8) for other substances
as well: specific humidity and horizontal velocity components.
At the adaptation stage the scheme has the form:
i+l/2,j+1/2,k- Ui+l/2,j+1/2,k Ui+l //2,J+1/2,kg
j+ 1/2 + 1/2 tg j+ /2
At +an1/2,+ 1/2

i+ 1/2,j+ 1/2,k + i41/2+ 1 /2 i (80.9a)


a cos(j 1A
.. 2Rin+
1,i + 1/2,k X V(ln i)7+1] =0,
SECTION 81 Problems in meteorology 437

i+ 1/2,j+ 1/2,k- Vi+ 1/2,j+ 1/2,k + Ui+1/2 j+ 1/2,k tg


At At / +j+l+
/1 2-t
i+
/2 + 2,i+ 1/22 g j
ai1/2
rn 2+ /2 n+l/
+ 22 + 1/ Vpr+ l2
'+l/2j2+ 1/2,k
X - [ + 1/2,k (80.9b)
'2R/n+11/2 n+ 1
2Ri+1/2,j+
.+ 1/2,k X Vj(ln )i+ 1/2] =0,

- K. 1 K
At
At +2nn+1/2
-ij 12E k=1 Dk =/2,= (80.9c)

where
1 - A21
Dijk= , [Vi - 1/2 i Ujk + Vj_ 1/2 7i cos q0 Vik (80.9d)

on+ _n n +
ijk Ok R 2 2
At 2Cpacos[
1C/2 q!j 'sj+1/2
/
2
LJ= o+/+j

x Vi+v(ln )q x1/2 + C7 on+1/2+v Vn +1+/2 +v COS (j+


,k 1/2 +vVj+v
1 jl A9 1 -(i,j+1/ k 2+vk 12-
AqnA] LJ/2/ COs iPn+ 1/2+v2

(in(ln
7 )+/2 1
1/21 C, t+j+ 1l /22A+1/2
CP) Aua
YE___ (_p j k+ 1/2 -
] Cpi v=-,o

ij,k+v/2) E
k+v ' i=
- 0, (80.9e)

ijk-- Vk- 1/2 ( _ij)= (80.9)

Vr=( )r+ ( )r, ()r+ 1/2 2Ur+1 +( )r]

r is any of the indices i,j, k, or n.


Let z =0 and let the term RT in (78.3) be equal to RT. Hence, multiplying (80.9a)
by U+ 1/2 and (80.9b) by V+ 1/22,j+ 1/2,k (80.9e) by 0i'k 1/2 and summing over
"ai+1/2,j+ 1/2,k and I+ ijk
all i, j and k, we obtain the relationship

COS
E i 2 +/2 + V+2, +
COS j i+1/2,j+1/2,k i+1/2j+j,k
(80.10)

,COScosj U+ 1/2,j+ 1/2,k + V+ 1/2,j+ 1/2,k+ 8T)

which is, in fact, the difference analogue of the conservation law (78.26).

81. The method of splitting with respect to geometric variables and


numerical realization

Since the transport operator K, is the sum of three operators (transport in each
geometric direction), each being skew-symmetric, it is quite natural to perform
438 G.l. Marchuk CHAPTER XIX

a further splitting of the problem. Hence, we obtain the following algorithm:


,.+ 1/3 _ n n+ 1/3 + on
At
+K¢
2
= 0, 0
3
+2/3 _ n+ 1//3 n+ 2/3 + n +1/3
- +K: =, (81.1)
At 2

-+K-- = 0
At 8 2
where ~qis any of the functions U, V, 0 or Q.
Due to (80.8) it is obvious that the above difference scheme is absolutely stable and
quadratic functionals conserve their value exactly. In the general case, scheme (81.1)
has a first-order approximation in time. The cyclic commutation method allows,
however, the construction of a second-order approximation scheme in the general
case of noncommuting operators as well (MARCHUK [1980]).
Thus, a three-dimensional transport problem is reduced to a set of one-dimen-
sional problems.
Now we consider methods for inverting these operators. Transport in the
A-direction (along latitude circles) is realized by the cyclic sweep method. It is the
same for all unknown functions (the only difference is in determining the "transport"
velocities, which are parts of the expression for operator Ks). The algorithm for
transport in the a-direction is also the same for all unknown functions (scalar sweep).
Transport in the p-direction (along longitudes) is realized in a more complex way.
A method for constructing cyclically closed rings of meridians shifted by 180° has
been used by MARCHUK, DYMNIKOV et al. [1984]. In this case components of the
vector values (U and V) in the transition over the Pole must change their sign while
the scalar values ( and Q) keep the same sign. Computation is realized via a cyclic
sweep. The skew-symmetric form of the transport operator guarantees, in this case,
conservation of the quadratic values.
To solve the system of algebraic equations at the adaptation stage (80.9) the
following iterative process with respect to values U" + 1/2, Vn + /2, n + 1/2 and "n+ is
used (some aspects of its convergence and the choice of optimal parameters are
discussed by MARCHUK, DYMNIKOV et al. [1984]):
Y(s)
( (ff)= is) (81.2)

= ~A(S
[I 1/2 UI+)1/2,j+ i/2,k tg pj+ 1/21
i+ ,j+ At j+1/2+ as.)At/2,k

1/,k
1(F1)/2,j+ U+ 1/2,k 2a cos P+ 12 AA i 1/2,j+ 1/2

X i J+ 1/2,k + 2ROi+ 1/2,j+ 1/2.k V(ln ) /2, (81.3)

(F)'('S1)2,i+ 12,k + +/2j+l/2i


1/2,j 1/2j+,k 2 +1/2,k
22a(p+ 1/2,j1/2
+

+ 2RW +1/2,j+ /2,k Vi(ln 7s)n1/21,


SECTION 81 Problems in meteorology 439

(S+1) s) O(s+1) n(s)


Ui+l/2,j+1/2,k- Ai+l/2,j+l/2,k i+l/2,j+/2k/
2 ,k(F)i/2,j+ 1/2,k, (814)
Ar(s+) l(S+ 1) [ 'i,(s)
i+ 1/2,j+ 1/2,k i+ 1/2,j+ /2,k 'i+ / 1/2,k=(F
2+ 2 + 1/2,j+ 1/2,k,

a cos qiae
A [.Ls)I T(s+ l)

+2C 7 1, E ( k+v+/2
, k+v/2) E (81.5)

cs) on AtR
(s+ -(-), /I -
4Cp j a
cos i

· · s~pjgLA
Y,~pI , 1/2 ,k+,,j,k1,,m,j n7·
+v+,(In

±j CO5 (j+/2+v i /2 vkklj 1/2+vik j+v i]

'l\-- 2iZ DisDl)]} (81.6)

j =(-T j)+T,, i - 2 j D k ) (81.7)


j2nik=1
The following notations are introduced in (81.2H81.7): s is the iteration number
(indices (n + 1)and (n + ½) are omitted); To and , are relaxation parameters in 0 and
a (at the resolution AA = Aq = 5° the optimal value of the parameters and T, is the
value 0.25).
The computational process is organized as follows. The distributions of U, V,0
and on the preceding step in time are taken as the initial approximation (when
s = 0). Distribution of the geopotential · is computed using the known "old" values
0 and 7r and equation (81.2). Auxiliary values A,F 1 and F2 are computed using
formulae (81.3). Here, in computing F1 in polar regions the short waves in zonal
gradients of functions In nt and d} are filtered by means of a fast Fourier
transformation. By solving system (81.4) we find values U and V by explicit
formulae. Then, substituting them into (81.5) we compute the flat divergence D.
Finally, using relationships (81.6) and (81.7), we compute "new" values 0 and 7t (in
this case, as above, we filter out short waves in zonal gradients /;Uand Inn).
Note that the accepted procedure of filtering short waves does not break
conservation laws. We perform iterations at each step in time until either the
following relationship holds
max Ieik - < s,
aiok
i,j,k

where is a given value of absolute error, or the number of iterations exceeds a given
limit value.
CHAPTER XX

Problems in Oceanology

The problem of mathematically modeling the ocean's general circulation attracts


many scientists. The progress in the research into the physics of the ocean stimulated
mathematicians to construct numerical models of ocean circulations, tide waves,
currents, etc. These models influenced significantly the development of research into
the central problem of modern geophysical hydrodynamics, the interaction of the
atmosphere and the ocean.
This chapter describes the major aspects of the construction of numerical
algorithms for the solution of ocean dynamics problems based on the splitting
method (MARCHUK [1974a], MARCHUK, DYMNIKOV et al. [1984]).

82. Statement of the problem and the splitting of equations with respect to
physical processes

The problem of the general circulation of the ocean is described by the following
system of equations written in Cartesian coordinates for the sake of simplicity:
du 1 ap a av
+ I= -I-+ Av +-Ka
dt po y az az,
dV l ap a au
dt- V -po a+x A+z az'
ap au av aw
=gp,
aZg -+-+-
ax a az 0,
dT a aT (82.1)
-+YTW=TAT+-KT
dt az a ,,
dS a as
'
dt +Ysw=sAS+aKSaz

p=p(T,), d a a a +w
dt at ax ay az
System (82.1) is obtained from the general equations of the hydrodynamics of
a rotating fluid by using traditional approximations for problems of the large-scale

441
442 G.. Marchuk CHAPTER XX

dynamics of the ocean, namely, Bousinesq, hydrostatics, incompressibility of the


water, linear mechanism of macroturbulent viscosity and diffusion.
Equations (82.1) are written for deviations p(z), p(z), T(z), S(z) of pressure p, density
p, temperature T, and salinity S respectively from their average values on the vertical
with
dT dS
YT=dz' Ys=dz '
We assume that the solution domain Q of the problem represents a closed basin,
bounded by the unperturbed surface of the ocean z = 0, bottom H= H(x, y) and
cylindric lateral surface a. We add the corresponding boundary and initial
conditions to (82.1):
aT as
u=v=O, -=-=0 on a,
av av
au av
K-=f, K =f W=0,
for az
az z=H, (82.2)
a , vv+b
T=C, , for t
for z=O;
as
a2 z+b2S=C2

au au u au
=0, +K- +
0,
ax for z=H, (82.3)
aTH OH aT a
=Uaz ay av az
u=u° , =vO, T=T° , S=S° fort=0,
where v is an external normal vector to a, and a, b, ci,fi, u, vo, To and S are given
coefficients and functions. The set of processes described by system (82.1) can
conditionally be divided in three main types: transport of momentum, heat and
salinity along the trajectories, their diffusion and adaptation of the mass fields and
currents. To solve approximately problem (82.1)-(82.3) the splitting method with
respect to the corresponding physical process can be used.
We pay no attention
+
-U-+V- f to+methods
-O0 for approximating (82. 1)in spatial variables (see
MARCHUK [1974a], MARCHOK, DYMNIKOV et al. [1984] and ZALESNY [1984, 1985])
and write a splitting system on time intervals t,<t~t,t+ in terms of differential
equations. We have to solve three subsystems consecutively:
1 au au au au

+
at a-xu ayxx (82.4)
1 aT UnT naT aT
3 at ax Zy az
SECTION 82 Problems in oceanology 443

1Ias as as as
-- +tu +VO +W" =O,
at ax ay az
1 au a au
3 at= V a aZ
1 av a av
(82.5)
1 aT a aT
3 3 at= pTAT+
aZ
KTrz'

1 as a as
'
3- at-=UsAS+zzSz
a ,
1 au 1 ap
3 at Po ax'
1av
a+l=- 1Iapap
3at Po ay'
1ap = P
au av aw
3 z a +ay+Z=°, (82.6)
3 aaxx y az
1 aT 1 as
3 at 3 at
P=aT T+ sS +p(T",Sn).

Note that in this case it is assumed, in splitting system (82.1), that equations (82.1) are
linearized at the interval ti < t < tj + 1. This is indicated by the indices n in (82.4) and in
the equation of state in system (82.6). We add to the subsystems (82.4)-(82.6) the
corresponding initial conditions and to (82.5H82.6) the boundary conditions
following from (82.2).
For system (82.5) we have the following boundary conditions
aT as
u=v=O, -=-= on a,
av av
au av
Kaz =f, Kaz =f2,

aT
a, +b, T=cl,
for z=0, (82.7)
a2+b2S=c2,
as
au av
K =0=,K=0,
for z=H;
aT as
av av
444 G.I. Marchuk CHAPTER XX

and for system (82.6)

(u,v)=0 on a,
w =0 for z = O, (82.8)
w = uH, + vHy for z=H,

where

u = (u, v).

83. The splitting of adaptation equations in planes and generalization


for the nonhydrostatic case

The equations solved by the methods considered above have been obtained on the
first two stages of splitting with respect to physical processes, described by systems
(82.4) and (82.5).
Now we consider a method for solving system (82.6) based on further splitting in
the coordinate planes (x,z) and (y,z) (MARCHUK [1974a]). Because this method
allows generalization, we assume that a more complete evolutionary equation is
used instead of the hydrostatics equation. Hence, we arrive at the following two
systems of equations, assuming for the sake of simplicity that the state equation is
linear (Pl = 0):

1 au 1 ap
6at Po x'
10v
av+lu=O,
6 at
1 aw 1 ap
g= 2p (83.1)
6 at az'
au 1 aw
-+- -=0
ax 2 az
1 ap -

1 au

1 aw 1 ap
6 at6Tt Z,- 2p az'
SECTION 84 Problems in oceanology 445

aV 1 aw
ay 2 az
1 ap+½yw=O;
6 at

where = aT yT + SYS'
By using continuity equations we can reduce each of systems (83.1) and (83.2) to
problems for stream functions.

84. The splitting of adaptation equations with respect to "topography"

The above system of ocean dynamics equations (82.1) is non-evolutionary. That is


why its splitting with respect to physical processes has been performed on
a "physical level". To be able to use the results of the general theory of the splitting
method given in the previous chapters, we reduce (82.1) to an evolutionary form. We
will, however, not consider here the transformation of the full system (82.1) to its
evolutionary form (see MARCHUK, DYMNIKOV et al. [1984] and ZALESNY [1984]). Let
us illustrate this technique by transforming adaptation equations (82.6) under the
assumption that the state condition (P =O0) is linear.
We transform the solution domain Q of the problem into a cylinder with unit base
by substitution of z, = z/H and represent the horizontal components of the velocity
vector in the form

u=iu+u', v=v+v', a=a dz.


0

Hence, we obtain the following system of equations

au Iap / aH ap
1H -Hiv 1 (Hp--
at Pox ax aZI/

1a l= 1 Hap aH ap
3 at Po\ ay ay Oz1JX

Op auH avH aw
p=pHg, auH + avH +aw 1 =0, (84.1)
az, ax ay az,

3 at +Y~lW'+z,(HA+ aHv)]1=0,
3 at axG ay ai aH

Yv=aTrY+TSYS, t= H+
Wl W -Z i
an ay v
446 G.I. Marchuk CHAPTER XX

Excluding functions p and w1 from (84.1) we obtain, assuming yy> >O,


au / ap aH ap=\
3poH-poHIv+ Hx--x Zl
at ax a azlj
PHav
+ Pol ( apaH ap) =O, (84.2)
at ay
la a- 1 ap a z U-+V---
aH aH\I auH avH
--- + + 0.
3at az, gH az, az, ax ay ax ay
We rewrite system (84.2) in operator form

B-+A'p=0, A=A 1 +A 2 , (84.3)

where
( =(u, v, p),

H ½po
O 0

0 ,

0 0
a I a
3a gHy az,/

0 -p 0HI H 0 aH a
0
ax ax az
aH a
A1 = poHI O H- A2= 0 0
ay ay 0az ,
a H 0 a aH
-H
ax ay azI' ax z
Oa ay

For the approximate solution of (84.3) we use a splitting scheme with respect to
"topography" (ZALESNY [1984, 1985])

½B IA-+Aj p=0,

(84.4)
2B at 22=O.

Hence, at the first stage of splitting we come to the solution of


au ap
6PoHt- o Hlv + H =0,

av ap (84.5)
poHat + poHlu + H-ay=0, (84.5)
SECTION 85 Problems in oceanology 447

1 a0 1 ap auH avH 0
- + + =0.
6 at az1 gH7 az1 ax ay
We add to system (84.5) the boundary conditions following from (82.8):
w=O for z=0,1;
or in terms of function p:

0 1 'o 0 forz,=0,1,
t gH, OzI
(84.6)
(u, v) = 0 on a.
At the second stage of splitting we have
1 au aH ap =
Po0 HZz-- -z 0,

v OH ap
UpoHat--yll
zl =O, (84.7)

1a a 1 ap a / a HaH\
6 at az I gH, az I aZI 1 ax ay
with boundary conditions

_ 1 ap=o for z=O,


0 (84.8)
at gH az 1
=
Ot H ___ uH +vH forz=l.
at gHy az X y
It is easily shown that relationships
B>O,
(Aip,p)= 0, i=1,2 for y 0
hold for operators B,A, and A 2 (with the corresponding boundary conditions).

85. The splitting of "shallow water" equations with respect to coordinates

System (84.5) obtained at the first splitting stage with respect to "topography" may
be transformed to a linear system of the "shallow water" equations:
1 au. ap
p o H- -p 0 Hlv, +Ha=O,

1 av ap.
poHa + PoHlu + H-=0, (85.1)
at a+
(8 ay
apn (au.H
-
av.H =\
'%-at + ax ay/ =0
448 G.I. Marchuk CHAPTE XX

Here 2, are eigenvalues of a self-adjoint spectral problem,


a 1 a=
az az n

a=0 forz=0,1
az
with 2 0= 0, i > 0, i = 1, 2 ....
System (85.1) can also be split along the coordinates x and y into two subsystems:
1
T2Po 8u.
at-2poHIvn+H
o ap.P

'12 aPo+ poHlu,,= 0, =(85.2)

1PoH P
- oHlv. = 0,

H avn+'p0Hlu.+HaPn=0, (85.3)
~PoH + #P
H n = 0

apn avH
I2a. +gU-y-H=0.

Systems (85.2) and (85.3), or to be exact, their discrete analogues are realized by
simple three-point sweeps.
Note that in the absence of rotation effect (1=O0) the system coincides with
acoustics equations in the two-dimensional case with the accuracy up to the
notations. A method of splitting with respect to coordinates analogous to
(85.2H85.3) with concrete discretization of the problem in spatial variables has been
considered by YANENKO [1967].
One can find various versions of splitting methods for ocean dynamics problems
as well as algorithms for their realizations and results of numerical experiments in
the works by MARCHUK [1974a], MARCHUK, BUBNOV, ZALESNY and KORDZADZE
[1983], MARCHUK, KORDZADZE and SKIBA [1974], MARCHUKet al. [1980], MARCHUK
and ZALESNY [1974], KUZIN [1980, 1984] and ZALESNY [1984, 1985].
References

ABRAMOV, A.A. and V.B. ANDREEV (1963), An application of the sweep method for finding periodic
solutions of differential and difference equations, Zh. Vychisl. Mat. i Mat. Fiz. 3 (2) (in Russian).
ALBRECHT, I. (1965), Eine Verallgemeinerung des Verfahrens der Alternierenden, Z. Angew. Math. Mech.
45 (3-6) (Sonderheft).
AMSDEN, A.A. and F.H. HARLOW (1970), A simplified MAC technique for incompressible fluid flow
calculations, J. Comput. Phys. 6 (2), 322-325.
ANDERSON, D.A. (1974), A comparison of numerical solutions to the inviscid equations of fluid motion,
J. Comput. Phys. 15 (1), 1-20.
ANDREEV, V.B. (1967), On the splitting difference schemes for general p-dimensional parabolic equations
of second order with mixed derivatives, Zh. Vychisl. Mat. i Mat. Fiz. 7 (2) (in Russian).
ANDREEV, V.B. (1969), On the convergence of the splitting difference schemes approximating the third
boundary value problem for parabolic equation, Zh. Vychisl. Mat. i Mat. Fiz. 9 (2) (in Russian).
ANUCHINA, N.N. (1970), On the methods for computing compressible liquid flows with large
deformations, Chisl. Metody Mekh. Sploshn. Sredy Inform. Bul. 1 (4) (in Russian).
ANUCHINA, N.N. and N.N. YANENKO (1959), The implicit splitting schemes for hyperbolic equations and
systems, Dokl. Acad. Sci. USSR 128 (6) (in Russian).
BABUSKA, I. (1970), Approximation by hill functions, Tech. Note BN-648, Institute for Fluid Dynamics
and Applied Mathematics, University of Maryland, College Park, MD.
BAGRINOVSKY, K.A. and S.K. GODUNOV (1957), Difference schemes for multidimensional problems, Dokl.
Acad. Sci. USSR 115 (3) (in Russian).
BAKER, G.A. (1960), An implicit numerical method for solving the n-dimensional heat equation, Quart.
Appl. Math. 17 (4), 440-443.
BAKER, G.A. and T.A. OLIPHANT (1960), An implicit numerical method for solving the two-dimensional
heat equation, Quart. Appl. Math. 17 (4).
BALDWIN, B.S. and R.W. MACCORMACK (1974), Numerical solution of the interaction of a strong shock
wave with a hypersonic turbulent boundary layer, AIAA Paper 558, New York.
BAUM, E. and E. NDEFO (1973), A temporal ADI computational technique, in: Proceedings AIAA
Computers and Fluid Dynamics Conference, Springs, CA (AIAA, New York) 133-140.
BELOTSERKOVSKY, O.M. (1984), Numerical Modelling in the Mechanics of Continuous Media (Nauka,
Moscow) (in Russian).
BELOTSERKOVSKY, O.M. and Yu.M. DAVYDOV (1970), The method of"large particles" for the problems in
dynamics, Chysl. Metody Mekh. Sploshn. Sredy Inform. Bul. 1 (3) (in Russian).
BELOTSERKOVSKY, O.M. and Yu.M. DAVYDOV (1978), The method of "large particles" (schemes and
applications), Moscow Physical Technical Institute, Moscow, USSR.
BELOTSERKOVSKY, O.M., Yu.P. GOLOVACHEV, V.G. GRUDNITSKY, Yu.M. DAVYDOV, V.K. DUSHIN, Yu.P.
LUNKIN, K.M. MAGOMEDOV, V.K. MOLODTSOV, F.D. Popov, A.I. TOLSTYKH, V.N. FOMIN and A.S.
KHOLODOV (1974), Numerical Study of Modern Problems in Gas Dynamics (Nauka, Moscow).
BELOTSERKOVSKY, O.M., V.A. GUSHCHIN and V.V. SHCHENNIKOV (1975), The splitting method in
application to solving problems in the dynamics of viscous incompressible liquid, Zh. Vychisl. Mat. i
Mat. Fiz. 15 (1) (in Russian).

449
450 G.I. Marchuk

BELOTSERKOVSKY, O.M., F.D. POPOV, A.I. TOLSTYKH, V.N. FOMIN and A.S. KHOLODOV (1970),
Numerical solution of some problems in gas dynamics, Zh. Vychisl. Mat. i Mat. Fiz. 10 (2) (in Russian).
BELUKHINA, I.G. (1969), The difference schemes for solving two-dimensional dynamic elasticity problem
with mixed boundary conditions, Zh. Vychisl. Mat. i Mat. Fiz. 9 (2) (in Russian).
BENSOUSSAN, A. (1971), Pure decentralization for interrelated payoffs, in: Proceedings Symposium on
Optimization, Los Angeles, CA.
BENSOUSSAN, A., J.-L. LIONS and R. TEMAM (1975), Methods of Decomposition, Decentralization,
Coordinationand Their Applications, Metody Vychisl. Mat. (Nauka, Novosibirsk).
BEREZIN, Yu.A., V.M. KOVENYA and N.N. YANENKO (1976), The Difference Methodfor Solving Round
Flow Problems, Aeromechanics (Nauka, Moscow) (in Russian).
BEREZIN, Yu.A. and V.A. VSHIVKOV (1980), The Particle Method in the Dynamics of Rarefied Plasma
(Nauka, Novosibirsk) (in Russian).
BEREZIN, Yu.A. and N.N. YANENKO (1984), The splitting method for the problem in physics of
semiconductors, Dokl. Acad. Sci. USSR 274 (6) (in Russian).
BIRKHOF,G. and R. VARGA(1959), Implicit alternating direction methods, Trans.Amer. Soc. 92(1), 13-24.
BIRKHOF, G., R. VARGA and D. YOUNG (1962), Alternating Direction Implicit Methods, Advances in
Computers 3 (Academic Press, New York).
BOYARINTSEV, Yu.Yu. (1966), On the Convergence of the SplittingMethod and Local CorrectnessCriterion
for Difference Equations with Varying Coefficients, Nekotorye Voprosy Prikl. i Vychisl. Mat. (Nauka,
Novosibirsk) (in Russian).
BOYARINTSEV, Yu.Yu. and O.P. UZNADZE (1967). On the convergence of splitting the system of spherical
harmonics, Zh. Vychisl. Mat. i Mat Fiz. 7 (6) (in Russian).
BRAILOVSKAYA, I.Yu., T.V, KUSKOVA and L.A. CHUDOV (1968), The Difference Methods for Solving
Navier-Stokes Equations (Survey), Vychisl. Metody i Programmirovanie XI (Moskov. Gos. Univ.,
Moscow) (in Russian).
BRAYN, K. (1966), A scheme for numerical integration of the equations for motion on an irregular grid
free of nonlinear instability, Monthly Weather Rev. 94 (1).
BRIAN, P.L.I. (1961), A finite difference method of high order accuracy for the solution of three-
dimensional heat conduction problems, AIChE J. 7, 367-370.
BULEYEV, N.I. (1960), The numerical method for solving two- and three-dimensional diffusion equations,
Mat. Sb. 5 (2) (in Russian).
BULEYEV, N.I. (1970), The method of incomplete factorization for solving two- and three-dimensional
equation of diffusion type, Zh. Vychisl. Mat. i Mat. Fiz. 10 (4) (in Russian).
BURRIDGE, D.M. and F.R. HAYES (1974), Development of the British operational model, The GARP
Programme on Numerical Experimentation, Rept. 4, 102-104.
BUTLER, T.D. (1974), Recent Advances in Computational Fluid Dynamics, Lecture Notes in Computer
Science 11 (Springer, Berlin) 1-21.
CHAN, R.K.-C. and R.L. STREET (1970), A computer study of finite amplitude water waves, J. Comput.
Phys, 6, 68-97.
CHAN, R.K.-C., R.L. STREET and J. FROMM (1971), Numerical modelling of the water waves: The
development of Summac method, in: Proceedings Second International Conference on Numerical
Methods in FluidDynamics, Lecture Notes in Physics (Springer, Berlin).
CHORIN, A.J. (1968), Numerical solution of the Navier-Stokes equations, Math. Comp. 22(104), 745-762.
CLINE, M.C. (1974), Computation of steady nozzle flow by a time dependent method, AIAA J. 12 (4),
419-420.
CONCUS, P., G.H. GOLUB and D.P. O'LEARY (1976), A generalized conjugate gradient method for the
numerical solution of elliptic partial differential equations, in: J.R. BUNCH and D.J. ROSE, eds., Sparse
Matrix Computations (Academic Press, New York) 309-332.
COURANT, R., K.O. FRIEDRICHS and H. LEWY (1928), ber die partiellen Differenzengleichungen der
mathematischen Physik, Math. Ann. 100, 32-74.
CRANE, C.M. (1974), A new method for the numerical solution of time dependent viscous flow, Appl. Sci.
Res. 30 (4), 47-77.
DAVIs, R.T., U. GHIA and K.N. GHIA (1974), Laminar incompressible flow past a class of blunted wedges
using the Navier-Stokes equations, Comput. & Fluids 2 (2), 211-223.
References 451

DAVYDOV, Yu. and V.P. SKOTNIKOV (1978), The Method of"Large Particles":Aspects of Approximation,
Scheme Viscosity and Stability (VC Acad. Sci. USSR, Moscow) (in Russian).
DOUGLAS, Jr, J. (1955), On the numerical integration of O2u/ax2 + a2u/ax2 = au/Ot by implicit methods, J.
SIAM 3, 42-65.
DOUGLAS Jr, J. (1961), Alternating direction iteration for mildly nonlinear elliptic difference equations,
Numer. Math. 3, 92-99.
DOUGLAS Jr, J. (1962), Alternating direction methods for three space variables, Numer. Math. 4, 41-63.
DOUGLAS Jr, J. and T. DUPONT (1971), Alternating direction Galerkin methods on rectangles, in:
Numerical Solution of PartialEquations, II, SYNSPADE (Academic Press, New York).
DOUGLAS Jr, J. and J.E. GUNN (1962), Alternating direction methods for parabolic system in m-space
variables, J. Assoc. Comput. Mach. 9, 450-456.
DOUGLAS Jr, J. and J.E. GUNN (1963), Two high-order correct difference analogues for the equation of
multidimensional heat flow, Math. Comp. 17, 71-80.
DOUGLAS Jr, J. and J.E. GUNN (1964), A general formulation of alternating direction methods, Part 1.
Parabolic and hyperbolic problems, Numer. Math. 6 (5), 428-453.
DOUGLAS Jr, J. and B.F. JONES (1963), On predictor-corrector methods for nonlinear parabolic
differential equations, J. SIAM 11, 195-204.
DOUGLAS Jr, J., R. KELLOGG and R. VARGA (1963), Alternating direction methods for n-space variables,
Math. Comp. 17.
DOUGLAS Jr, J. and C.M. PEARCY (1963), On convergence of alternating direction procedures in the
presence of singular operators, Numer. Math. 5, 175-184.
DOUGLAS Jr, J. and H.H. RACHFORD Jr (1956), On the numerical solution of heat conduction problems in
two- and three-space variables, Trans. Amer. Math. Soc. 82, 421-439.
DRYA, M. (1967), On the stability of splitting schemes in C, Zh. Vychisl. Mat. i Mat. Fiz. 7 (2)(in Russian).
DRYA, M. (1971a), The splitting difference schemes for systems of hyperbolic equations of the first order,
Zh. Vychisl. Mat. i Mat. Fiz. 11 (2) (in Russian).
DRYA, M. (1971b), On the convergence of the splitting difference schemes for parabolic systems in C in the
internal points of the domain, Zh. Vychisl. Mat. i Mat. Fiz. 11 (3) (in Russian).
DUPONT, T. (1968), A factorization procedure for the solution of elliptic difference equations, SIAM J.
Numer. Anal. 5 (4).
DUVANT, G. and J. LIONS (1972), Les Indquations en Mcanique et en Physique (Dunod, Paris).
DYACHENKO, V.F. (1965), On the new method for numerical solution of the non-stationary problems with
two spatial variables in gas dynamics, Zh. Vychisl. Mat. i Mat. Fiz. 5 (4) (in Russian).
DYAKONOV, Ye.G. (1961), The alternating direction method for solving systems of finite difference
equations, Dokl. Acad. Sci. USSR 138 (2) (in Russian).
DYAKONOV, Ye.G. (1962a), The grid method for the parabolic equations of the second order with
separable variables, Dokl. Acad. Sci. USSR 142 (6) (in Russian).
DYAKONOV, Ye.G. (1962b), On the one way to solve Poisson equation, Dokl. Acad. Sci. USSR 143 (1) (in
Russian).
DYAKONOV, Ye.G. (1962c), On some difference schemes for solving boundary value problems, Zh.
Vychisl. Mat. i Mat. Fiz. 2 (1) (in Russian).
DYAKONOV, Ye.G. (1962d), On the constructing of the splitting difference schemes for multidimensional
non-stationary problems, Uspekhi Mat. Nauk 17 (4) (in Russian).
DYAKONOV, Ye.G. (1962e), The splitting difference schemes for multidimensional stationary problems,
Zh. Vychisl. Mat. i Mat. Fiz. 2 (4) (in Russian).
DYAKONOV, Ye.G. (1962f), The splitting difference schemes for non-stationary equations, Dok. Acad.
Sci. USSR 144 (1) (in Russian).
DYAKONOV, Ye.G. (1963a), On the application of the splitting difference operators, Zh. Vychisl. Mat.
i Mat. Fiz. 3 (2) (in Russian).
DYAKONOV, Ye.G. (1963b), On the application of the splitting difference schemes for hyperbolic
equations with varying coefficients, Dokl. Acad. Sci. USSR 151 (4) (in Russian).
DYAKONOV, Ye.G. (1964a), On the application of the splitting difference schemes for some systems partial
equations, Uspekhi Mat. Nauk 14 (1) (in Russian).
DYAKONOV, Ye.G. (1964b), The splitting difference schemes of the second order of accuracy for parabolic
452 G.I. Marchuk

equations without mixed derivatives, Zh. Vychisl. Mat. i Mat. Fiz. 4 (5) (in Russian).
DYAKONOV, Ye.G. (1964c), The splitting difference schemes for general parabolic equations of the second
order with varying coefficients, Zh. Vychisl. Mat. i Mat. Fiz. 4 (2) (in Russian).
DYAKONOV, Ye.G. (1965a), On the application of the splitting difference schemes for some systems of
parabolic and hyperbolic equations, Sibirsk. Mat. Zh. 6 (3) (in Russian).
DYAKONOV, Ye.G. (1965b), The Splitting Difference Scheme of the Second Order of Accuracyfor Multi-
dimensional Parabolic Equations with Varying Coefficients, Numerical Methods and Programming
3 (Moskov. Gos. Univ., Moscow) (in Russian).
DYAKONOV, Ye.G. (1966), On the application of the splitting difference schemes for some systems of
integro-differential equations, Vestnik Moskov. Univ. Mat. 5 (in Russian).
DYAKONOV, Ye.G. (1967a), On the Class of PartialEquations Arising in Closing the Difference Methods
Based on Splitting the Operator, Vychisl. Metody i Programmirovanie. IV (Moskov. Gos. Univ.,
Moscow) (in Russian).
DYAKONOV, Ye.G. (1967b), The splitting difference schemes for the systems of equations of the kind Lo
Ou/at +Ll u =f, Dokl. Acad. Sci. USSR 176 (2) (in Russian).
DYAKONOV, Ye.G. (1967c), Economical Difference Methods Based on the Splitting Difference Operatorfor
Some Systems of PartialEquations, Numerical Methods and Programming 6 (Moskov. Gos. Univ.,
Moscow) (in Russian).
DYAKONOV, Ye.G. (1970), The iterative methods for solving difference analogues of the boundary value
problems for elliptic equations, Institute of Cybernetics Acad. Sci. USSR, Kiev (in Russian).
DYAKONOV, Ye.G. (1971a), On some operator inequalities and their applications, Dokl. Acad Sci. USSR
198 (5) (in Russian).
DYAKONOV, Ye.G. (1971b), On the Difference Methods for Solving Some NonstationarySystems, Applied
Mathematics and Programming 6 (Shtiintsa, Kishinev) (in Russian).
DYAKONOV, Ye.G. (1971c), The Difference Schemes of IncreasedAccuracyfor Systems ofMixed Equations,
Applied Mathematics and Programming (Shtiintsa, Kishinev) (in Russian).
DYAKONOV, Ye.G. (1971d), The Difference Methodsfor Solving Boundary Value Problems, I: Stationary
Problems (Moskov. Gos. Univ., Moscow) (in Russian).
DYAKONOV, Ye.G. (1972), The Difference Methods for Solving Boundary Value Problems, II: Non-
stationary Problems (Moskov. Gos. Univ., Moscow) (in Russian).
DYAKONOV, Ye.G. and V.I. LEBEDEV (1967), The Splitting Method for the Third Boundary Value
Problem, Numerical Methods and Programming 4 (Moskov. Gos. Univ., Moscow) (in Russian).
DYMNIKOV, V.P. and S.K. FILIN (1985), The numerical modelling of the atmospheric circulation response
to the temperature anomaly on the surface of the ocean in North Atlantic, Preprint No. 100, OVM
Acad. Sci. USSR, Moscow (in Russian).
DYMNIKOV, V.P. and A.V. ISHIMOVA (1979), Unadiabatic model for the short-term weather forecasting,
Meteorologia i Gidrologia6 (in Russian).
DYMNIKOV, V.P. and V.N. LYKOSOV (1983), Spectral analysis of the quasi-stationary atmospheric
circulation response to the temperature anomaly on the surface of the ocean, Preprint No. 61, OVM
Acad. Sci. USSR, Moscow (in Russian).
FAIRWEATHER, G. and A.R. MITCHELL (1966), Some computational results of an improved A.D.I. method
for the Dirichlet problem, Comput. J. 9 (3).
FAIRWEATHER, G. and A.R. MITCHELL (1967), A new computational procedure for A.D.I. methods, SIAM
J. Numer. Anal. 4, 163-170.
FAIRWEATHER, G., A.R. GOURLAY and A.R. MITCHELL (1967), Some high accuracy difference schemes
with a splitting operator for equations of parabolic and elliptic type, Numer. Math. 10, 56-66.
FILIPPOV, A.F. (1955), On the stability of the difference schemes, Dokl. Acad. Sci. USSR 100 (6) (in
Russian).
FORESTER, C. and A. EMERY (1972), A computational method for low Mach number unsteady
compressible free convective flows, J. Comput. Phys. 10, 487-502.
FRIEDRICHS, K.O. (1954), Symmetric hyperbolic linear differential equations, Comm. Pure Appl. Math. 7,
345-392.
FRYAZINOV, I.V. (1964), On the difference approximation of the boundary conditions for the third
boundary value problem, Zh. Vychisl. Mat. i. Mat. Fiz. 4 (6) (in Russian).
References 453

FRYAZINOV, I.V. (1966), On the solution of the third boundary value problem for two-dimensional heat
conduction equation in arbitrary domain by local one-dimensional method, Zh. VychisL Mat. i Mat.
Fiz. 6 (3) (in Russian).
FRYAZINOV, I.V. (1968), The economical symmetrized schemes for the solution of boundary value
problems for multidimensional parabolic equation, Zh. Vychisl. Mat. i Mat. Fiz. 8 (2) (in Russian).
FRYAZINOV, I.V. (1969a), The a priori estimates for some family of economical schemes, Zh. Vychisl. Mat.
i Mat. Fiz. 9 (3) (in Russian).
FRYAZINOV, I.V. (1969b), The economical schemes of increased order of accuracy for solving
multidimensional parabolic equation, Zh. Vychisl. Mat. i Mat. Fiz. 9 (6) (in Russian).
GELFAND, I.M. and O.V. LOKUTSIEVSKY (1962), The sweep method for solving difference equations, in:
S.K. GODUNOV and V.S. RYABENKII, eds., Introductionto the Theory of Difference Schemes (Fizmatgiz,
Moscow) (in Russian).
GLOWINSKY, R., J.-L. LIONS and R. TREMOLIER (1979), Numerical Study of Variational Inequalities(Mir,
Moscow) (in Russian).
GODUNOV, S.K. (1959), The difference method for numerical computation of discontinuous solutions of
equations in hydrodynamics, Mat. Sb. 47 (3) (in Russian).
GODUNOV, S.K. (1962), The method of the orthogonal sweep for solving system of difference equations,
Zh. Vychisl. Mat. i Mat. Fiz 2 (6) (in Russian).
GODUNOV, S.K. and V.S. RYABENKII (1977), The Difference Schemes: Introductionto the Theory, (Nauka,
Moscow) (in Russian).
GODuNOv, S.K. and K.A. SEMENDYAEV (1962), The difference methods for numerical solution of the
problems in gas dynamics, Zh. Vychisl. Mat. i Mat. Fiz. 2 (7) (in Russian).
GODUNOV, S.K. and A.V. ZABRODIN (1962), On the difference schemes of the second order of accuracy for
multidimensional problems, Zh. VychisL. Mat. i Mat. Fiz. 2 (4) (in Russian).
GODUNOV, S.K., A.V. ZABRODIN, M.Y. IVANOV et al. (1976), Numerical solution of Multidimensional
Problems in Gas Dynamics (Nauka, Moscow) (in Russian).
GOURLAY, A. and A. MITCHELL (1961), Split operator methods for hyperbolic system in p-space variables,
Math. Comp. 21, 351-354.
GOURLAY, A. and A MITCHELL (1966), Alternating direction methods for hyperbolic systems, Numer.
Math. 8, 137-149.
GOURLAY, A. and A. MITCHELL (1967), Intermediate boundary corrections for split operator methods in
three dimensions, BIT 7, 31-38.
GOURLAY, A. and A. MITCHELL (1968), High accuracy A.D.I. methods for parabolic equations with
variable coefficients, Numer. Math. 12, 180-185.
GOURLAY, A. and A. MITCHELL (1969), A classification of split difference methods for hyperbolic
equations in several space dimensions, SIAM J. Numer. Anal. 6, 62-71.
GuITrET, J. (1967), Une nouvelle de directions alternes a q variables, J. Math. Anal. Appl. 17, 199-213.
GUNN, J. (1965), The solution of elliptic difference equations by semi-explicit iterative techniques, SIAM
J. Numer. Anal. 2 (1).
GUSHCHIN, V.A. (1981), The splittig method for solving the problem in dynamics of inhomogeneous
viscous incompressible liquid, Zh. Vyschisl. Mat. i Mat. Fiz. 21 (4) (in Russian).
HABBARD, B. (1966), Alternating direction schemes for the heat equation in a general domain, SIAM J.
Numer. Anal. 2 (3).
HARLOW, F. (1967), NumericalMethod of Particlesin Cellsfor the Problemsin Hydrodynamics, Numerical
Methods in Hydrodynamics (Mir, Moscow) (in Russian).
HARLOW, F. and J. WELCH (1965), Numerical calculation of time-dependent viscous incompressible flow
of fluid with free surface, Phys. Fluids 8, 2182-2189.
IL'IN, V.P. (1965), On the splitting ofthe parabolic and elliptic difference equations, Sibirsk. Mat. Zh. 6(1)
(in Russian).
IL'IN, V.P. (1966), On the application of the alternating direction method for solving quasi-linear
parabolic and elliptic equations, in: Some Aspects of Applied and Numerical Mathematics (Nauka,
Novosibirsk) (in Russian).
IL'N, V.P. (1967), On the explicit alternating direction schemes, lzv. Sib. Otd. Acad. Sci. USSR Ser. Tekhn.
Nauk 13 (3) (in Russian).
454 G.. Marchuk

ILIN, V.P. (1970), The Difference Methods for Solving Elliptic Equations (Novosibirsk Gos. Univ.,
Novosibirsk) (in Russian).
KARCHEVSKY, M.M., A.V. LAPIN and A.D. LYASHKO (1972), Economical difference schemes for
quasi-linear parabolic equations, Izv. Vuzov Mat. 3 (118) (in Russian).
KELDYSH, M.V. (1942), On Galerkin's method for solving boundary value problems, Izv. Acad. Sci.
USSR, Mat. 6 (in Russian).
KELLOGG, R. (1963), Another alternating direction implicit method, J. SIAM 11, 976-979.
KELLOGG, R. (1964), An alternating direction method for operator equations, J. SIAM 12, (4).
KELLOGG, R. and J. SPANIER (1965), On optimal alternating direction parameters for singular matrices,
Math. Comp. 19 (1).
KOCHERGIN, V.P. and Yu.A. KUZNETSOV (1969), On the solution of the system of linear equation by the
splitting method, in: Numerical Methods in Transport Theory (Atomizdat, Moscow) (in Russian).
KONOVALOV, A.N. (1962), The method of fractional steps for solving Cauchy problem for multidimen-
sional oscillation equation, Dokl. Acad. Sci. USSR 147 (1) (in Russian).
KONOVALOV, A.N. (1964), The application of the splitting method to the numerical solution of dynamic
problems in the theory of elasticity, Zh. Vychisl. Mat. i Mat. Fiz. 4 (4) (in Russian).
KONOVALov, A.N. (1972), The problem of the filtration of multiphase incompressible liquid (Novosibirsk
Gos. Univ., Novosibirsk) (in Russian).
KOROBITSINA, J.L. (1969), On the boundary conditions in the splitting scheme for kinetic equation, in:
Numerical Methods in Transport Theory (Atomizdat, Moscow) (in Russian).
KOVENYA, V.M. and A.S. LEBEDEV (1984), The application of the splitting method in predictor-corrector
schemes for solving problems in gas dynamics, Chysl. Metody Mekh. Sploshn. Sredy 15 (2) (in Russian).
KOVENVA, V.M. and N.N. YANENKO (1981), The Splitting Methodsfor Problemsin Gas Dynamics (Nauka,
Novosibirsk) (in Russian).
KRYAKVIN, S.A. (1966), On the accuracy of the alternating direction schemes for heat conduction
equation, Zh. Vychisl. Mat. i Mat. Fiz. 6 (in Russian).
KUTLER, P. and H. LOMAX (1971), Shock-capturing, finite-difference approach to supersonic flows,
J. Spacecraft and Rockets 8, 1175-1182.
KUTLER, P., W. REINHARDT and R. WARMING (1973), Multishocked, three-dimensional supersonic
flowfields with real gas effects, AIAA J. 11, 657-664.
KUTLER, P., L. SAKELL and G. AIELLO(1974), On the shock-on-shock interaction problem, AIAA Paper
524, New York, 10pp.
KUZIN, V.I. (1980), On the solution of the equation for barotropic Rossby waves by the finite element
method with splitting, in: Mathematical Modelling of Ocean Dynamics (VC Sib. Otd. Acad. Sci. USSR,
Novosibirsk) (in Russian).
KUZIN, V.I. (1984), The numerical model of global ocean circulation based on finite element method with
splitting, MathematicalModelling of the Dynamics of Ocean and Internal Reservoirs (VC Sib. Otd. Acad.
Sci. USSR, Novosibirsk) (in Russian).
KUZNETSOV, A.Yu. and M.K. STRELETS (1983), Numerical modelling of sufficiently undersonic stationary
non-isothermal flows of homogeneous viscous gas in channels, Chysl. Metody Mekh. Sploshn. Sredy 14
(6) (in Russian).
KUZNETSOv, B.P., N.P. MOSHKIN and S. SMAGULOV (1983), Numerical study of viscous incompressible
liquid flows in channels of complex geometry under given pressure jumps, Chysl. Metody Mekh.
Sploshn. Sredy 14 (5) (in Russian).
LADYZHENSKAYA, O.A. (1956), On the solution of non-stationary operator equations, Mat. Sb. 39 (4)
(in Russian).
LADYZHENSKAYA, O.A. (1958), On the non-stationary operator equations and their applications to linear
problems in mathematical physics, Mat. Sb. 45 (in Russian).
LADYZHENSKAYA, O.A. and V.Y. RIVKIND (1971), On the convergent difference schemes for Navier-Stokes
equations, Chysl. Metody Mekh. Sploshn. Sredy Inform. Bul. 2 (1) (in Russian).
LAVAL, P. (1983), Nouveaux sch6mas de disintegration pour la resolution des problmes hyperboliques et
paraboliques non-lin6aires applications aux equations d'Euler et de Navier-Stokes, Rech. Agrospat. 4.
LAX, P. (1962), On the stability of finite difference approximations to the solutions of hyperbolic
equations with varying coefficients, Mathematics (Translations coll.) 6 (3) (in Russian).
References 455

LAX, and B. WENDROFF (1964), Difference schemes for hyperbolic equations with high order of accuracy,
Comm. Pure Appl. Math. 17, 381-398.
LEBEDEV, V.I. (1971), On the type of quadrature formulae ofincreased algebraic accuracy for sphere, Dokl.
Acad. Sci. USSR 231 (1) (in Russian).
LEBEDEV, V.I. (1976), On the quadratures on the sphere, Zh. Vychisl. Mat. i Mat. Fiz. 16 (2) (in Russian).
LEBEDEV, V.I. (1977), On Zolotarev's problem in alternating direction method, Zh. Vychisl. Mat. i Mat.
Fiz. 17 (2) (in Russian).
LEES, M. (1960a), A priori estimates for the solutions of difference approximations to parabolic partial
differential equations, Duke Math. J. 27, 297-311.
LEES, M. (1960b), Energy inequalities for the solution of differential equations, Trans. Amer. Math. Soc.
94, 58-73.
LEES, M. (1960c), Alternating direction methods for hyperbolic differential equations, J. SIAM 10,
610-616.
LEES, M. (1961), Alternating direction and semi-explicit difference methods for parabolic partial
differential equations, Numer. Math. 3, 398-412.
LEES, M. (1966), A linear three-level difference scheme for quasi-linear parabolic equations, Math. Comp.
20, 516-522.
LEPAS, J., M. BEKLARECHE, J. CAIFFIER, L. FINKE and A. TAGNIT-HAAON (1974), Primitive equations
model-implicit method for numerical integration, The GARP Programme on Numerical Experi-
mentation, Rept. 4, 65.
LERAT, A. and R. PEYRET (1975), Proprirtes dispersives et dissipatives, d'une classe des schrmas aux
differences pour les systemes hyperboliques non-linraires, Rech. Airospat. 2, 61-79.
LIONS, P. and B. MERCIER (1978), Splitting algorithms for the sum of two non-linear operators, Rapport
Interne 29, Centre de Mathrmatiques Appliquees, Ecole Polytechnique, Palaiseau, France.
LIONS, J.-L. and R. TEMAM (1966), Une m6thode d'&clatement des oprrateurs et des contraintes en calcul
des variations, C.R. Acad. Sci. Paris263.
LORENZ, E. (1955), Available potential energy and the maintenance of the general circulation, Tellus 7,
157-167.
LUNN, M. (1964), On the equivalence of SOR, SSOR and USSOR as applied to ordered systems of linear
equations, Comput. J. 7, 72-75.
LYTKIN, Yu.M. and G.G. CHERNYKH (1975), On the internal waves induced by the collapse of
displacement zone in stratified liquid, Dynamics of Continuous Media 22 (in Russian).
MACCORMACK, R. and B. BALDWIN (1975), A numerical method for solving Navier-Stokes equations
with application to shock-boundary layer interactions, AIAA Paper 1, New York, 8 pp.
MACCORMACK, R. and A. PAULIAY (1972), Computational efficiency achieved by time splitting of finite
difference operators, AIAA Paper 154, New York, 7 pp.
MARCHUK, G.I. (1964a), The new approach to numerical solving the equations of weather forecasting,
in: Proceedings Symposium on Long-Term ForecastingMethods, USA.
MARCHUK, G.I. (1964b), The numerical algorithm for solving the equations of weather forecasting, Dokl.
Acad. Sci. USSR 156 (2), 308-311 (in Russian).
MARCHUK, G.I. (1965), Numerical methods for the problems in weather forecasting and theory ofclimate,
in: Lectures on the Numerical Methods for Short-Term Weather Forecasting (Gidrometeoizdat,
Leningrad) (in Russian).
MARCHUK, G.I. (1967), Numerical Methodsfor Weather Forecasting(Gidrometeoizdat, Leningrad) (in
Russian).
MARCHUK, G.I. (1968), Some application of splitting methods of the solution of mathematical physics
problems, Apl. Mat. 13, 103-132.
MARCHUK, G.I. (1969), The splitting method for problems in mathematical physics, in: Numerical
Methods for Problems in Mechanics of Continuous Media (in Russian).
MARCHUK, G.I. (1970), Methods and problems in numerical mathematics, Internat. Math. Congr. Nizza;
also: (1972), Reports of Soviet Mathematicians (Nauka, Moscow) (in Russian).
MARCHUK, G.I. (1971), On the theory of splitting method, in: Numerical Solution of PartialDifferential
Equations, II, SYNSPADE (Academic Press, New York).
MARCHUK, G.I. (1973), Introduction into the Methods of Numerical Analysis (Cremonese, Rome).
456 G.. Marchuk

MARCHUK, G.I. (1974a), Numerical Solution of the Problems in Atmosphere and Ocean Dynamics
(Gidrometeoizdat, Leningrad) (in Russian).
MARCHUK, G.I. (1974b), Numerical Methods in the Computationof Ocean Currencies(Vychisl. Centr. Sib.
Otd. Acad. Sci. USSR, Novosibirsk) (in Russian).
MARCHUK, G.I. (1980), Methods of Numerical Mathematics (Nauka, Moscow) (in Russian).
MARCHUK, G.I. (1982), Mathematical Modelling in the Problem of Environment (Nauka, Moscow)
(in Russian).
MARCHUK, G.I. and V.I. AGOSHKOV (1981), Introduction to the Projection Grid Methods (Nauka,
Moscow) (in Russian).
MARCHUK, G.I., M.A. BUBNOV, V.B. ZALESNY and A.A. KORDZADZE (1983), Mathematical modelling of
the sea currents, tide waves and elaboration of numerical algorithms, in: Actual Problemsin Numerical
and Applied Mathematics (Nauka, Novosibirsk) (in Russian).
MARCHUK, G.I. and V.P. DYMNIKOV, eds. (1974), Dynamic Meteorology and Numerical Weather
Forecasting (Gidrometeoizdat, Moscow) (in Russian).
MARCHUK, G.I., V.P. DYMNIKOV, V.Ya. GALIN et al. (1975), The hydrodynamic model of the global
atmosphere and ocean circulation (in Russian).
MARCHUK, G.I., V.P. DYMNIKOV and V.B. LYKOSOV (1981), On the relation between index cycles of the
atmosphere circulation and spatial spectrum of the kinetic energy in the model of the general
circulation of the atmosphere, Tech. Memo. 31, ECMWF, 33 pp.
MARCHUK, G.I., V.P. DYMNIKOV, V.N. LYKOSOV, V.Ya. GALIN, I.M. BOBYLEVA and V.L. PEROV (1979),
The global atmosphere circulation model, Izv. Acad. Sci. USSR Ser. Fiz. Atm. Okeana 15 (in Russian).
MARCHUK, G.I., V.P. DYMNIKOV, V.B. ZALESNY, V.N. LYKOSOV, and V.Ya. GALIN (1984), Mathematical
Modelling of the GlobalAtmosphere and Ocean Circulation(Gidrometeoizdat, Leningrad) (in Russian).
MARCHUK, G.I., G.R. KONTAREV and G.S. RIVIN (1967), The short-term weather forecasting based on the
whole equations in bounded territory, Izv. Acad. Sci. USSR Ser. Fiz. Atm. Okeana 3 (in Russian).
MARCHUK, G.I., A.A. KORDZADZE and Yu.N. SKIBA (1975), The computation of main hydrological fields
of the Black Sea, Izv. Acad. Sci. USSR Ser. Fiz. Atm. Okeana 11 (4) (in Russian).
MARCHUK, G.I. and V. KUZIN (1983), On the combination of finite element and splitting methods in the
solution of parabolic equations, J. Comput. Phys. 52, 237-272.
MARCHUK, G.I. and Yu.A. KUZNETSOV (1972), Iterative Methods and Quadratic Functionals (Nauka,
Novosibirsk) (in Russian).
MARCHUK, G.I. and Yu.A. KUZNETSOV (1974), Methodes itrratives et fonctionnelles quadratiques, in: Sur
les Methodes Numriques en Sciences Physiques et Economiques, Mrthodes Mathrmatiques de
l'Informatique 4 (Dunod, Paris) 3-132.
MARCHUK, G.I. and V.I. LEBEDEV (1981), Numerical Methods in the Theory of Neutron Transport
(Atomizdat, Moscow) (in Russian).
MARCHUK, G.I. et al. (1980), Mathematical Models of Ocean Circulation (Nauka, Novosibirsk) (in
Russian).
MARCHUK, G.I., V.V. PENENKO and U.M. SULTANGAZIN (1969), On the solution of the kinetic equation by
the splitting method, in: Numerical Methods in Transport Theory (Atomizdat, Moscow) (in Russian).
MARCHUK, G.I. and A.S. SARKISYAN, eds. (1980), Mathematical Models of Ocean Circulation (Nauka,
Novosibirsk) (in Russian).
MARCHUK, G.I. and Yu.N. SKIBA (1976), The numerical computation of the conjugate problem in the
model of theoretical interaction between atmosphere, oceans and continents, Izv. Acad. Sci. USSR Ser.
Fiz. Atm. Okeana 12 (5) (in Russian).
MARCHUK, G.I. and U.M. SULTANGAZIN (1965a), On the convergence of the splitting method for the
equations of radiation transport, Dokl. Acad. Sci. USSR 161 (1) (in Russian).
MARCHUK, G.l. and U.M. SULTANGAZIN (1965b), On the solution of the kinetic transport equation by the
splitting method, Dokl. Acad. Sci. USSR 163 (4) (in Russian).
MARCHUK, G.I. and U.M. SULTANGAZIN (1965c). On the grounding of the splitting method for the
equation of radiation transport, Zh. Vychisl. Mat. i Mat. Fiz. 5 (5) (in Russian).
MARCHUK, G.I. and N.N. YANENKO (1964), The solution of the multidimensional kinetic equation by the
splitting method, Dokl. Acad. Sci. USSR 157 (6) (in Russian).
MARCHUK, G.I. and N.N. YANENKO (1966), Application of the splitting (fractional steps) method to the
References 457

solving of problems in mathematical physics, in: Some Aspects of Numerical and Applied Mathematics
(Nauka, Novosibirsk) (in Russian).
MARCHUK, G.I. and V.B. ZALESNY (1974), The numerical model of the large-scale circulation in the World
Ocean, in: Numerical Methods for Computation of Ocean Currents (in Russian).
MELADZE, G. (1970), The schemes of increased order of accuracy for systems of elliptic and parabolic
equations, Zh. Vychisl. Mat. i Mat. Fiz. 10 (2) (in Russian).
MESINGER, F. and A. ARAKAWA (1976), Numerical Methods Used in Atmospheric Models, GARP Publ.
Series 17, 135.
MIGUAL, O., P. PINSKY and R. TAYLOR (1983), Operator split methods for the numerical solution of the
elastoplastic dynamic problem, Comput. Methods Appl. Engrg. 39, 137-157.
MITCHELL, A. and G. FAIRWEATHER (1964), Improved forms of the alternating direction methods of
Douglas, Peaceman and Rachford for solving parabolic and elliptic equations, Numer. Math. 6,
285-292.
MORRIS, J. (1970), On the numerical solution of a heat equation associated with a thermal print-head, J.
Comput. Phys. 5, 208-228.
MOSKOLKOV, M.N. (1969), On the family of the difference schemes for quasi-linear parabolic equation,
Zh. Vychisl. Mat. i Mat. Fiz. 9 (6) (in Russian).
NICOLS, B. (1971), Recent extensions to the MAC method for incompressible fluid flows, in: Proceedings
Second InternationalConference on Numerical Methods in Fluid Dynamics, Lecture Notes on Physics
(Springer, Berlin).
NICOLS, B. (1973), Further development of the method of markers and cells for incompressible fluid flows,
in: Numerical Methods in Fluid Mechanics (Mir, Moscow) (in Russian).
OLIPHANT, T. (1961), An implicit, numerical method for solving two-dimensional time-dependent
diffusion problems, Quart. Appl. Math. 19 (3).
PEACEMAN, D.W. and H.H. RACHFORD Jr (1955), The numerical solution of parabolic and elliptic
differential equations, J. SIAM 3, 28-42.
PEARCY, C. (1962), On convergence of alternating direction procedures, Numer. Math. 4, 172-176.
PENENKO, V.V. and A.Ye. ALOYAN (1975), Some applications of the splitting method for the problems in
mesometeorology, Tr. Zap. Sib. Reg. N. 1. Gidrometeorol. In-t 14 (in Russian).
PENENKO, V.V., U.M. SULTANGAZIN and B.A. BALASH (1969), The solution of the kinetic equation by the
splitting method, in: Numerical Methods in Transport Theory (Atomizdat, Moscow) (in Russian).
PoPov, Yu.P. and A.A. SAMARSKII (1970), Completely conservative difference schemes for the equations
of magnetic hydrodynamics, Zh. Vychisl. Mat. i Mat. Fiz. 10 (4) (in Russian).
PRACHT, W. (1973), The implicit method for the computation of creeping with application to the problem
of continental drift, in: Numerical Methods in Fluid Mechanics (Moscow, Mir) (in Russian).
PYSTA, S. (1969), The splitting difference schemes for systems of mixed differential equations, Zh. Vychisl.
Mat. i Mat. Fiz. 9 (4) (in Russian).
RACHFORD Jr, H.H. (1968), Rounding errors in alternating direction methods for parabolic problems,
Ap. Mat. 13, 177-180.
RAVIART, P. (1967), Sur l'approximation de certaines quations d'evolution lineaires et non-lineaires,
J. Math. Pures Appl. 46, 11-107; 46, 109-183.
REID, J. (1971), On the method of conjugate gradients for the solution of large sparse systems of linear
equations, in: Large Sparse Sets of Linear Equations (Academic Press, London) 231-251.
RICHTMYER, R.D. and K.W. MORTON (1972), Difference Methodsfor Boundary Value Problems (Mir,
Moscow) (in Russian).
RIZZl, A.W. and M. INOUYE (1973a), A time-split finite volume technique for three-dimensional
blunt-body flow, AIAA Paper 133, New York, 14 pp.
RlzzI, A.W. and M. INOUYE (1973b), Time finite volume method for three-dimensional blunt-body, AIAA
J. 11, 1478-1485.
Rlzz, A.W., A. KLAVINS and R. MACCORMACK (1975), A generalized hyperbolic marching technique for
three-dimensional supersonic flow with shocks, in: Lecture Notes in Physics 35 (Springer, Berlin)
341-346.
ROZHDESTVENSKY, B.L. and N.N. YANENKO (1968), Systems of Quasi-Linear Equations (Nauka, Moscow)
(in Russian).
458 G.I. Marchuk

RUSANOV, V.V. (1960), On the stability of the matrix sweep method, in: Numerical Mathematics 6 (Mir,
Moscow) (in Russian).
RYABENKII, V.S. and A.F. FILIPPOV (1956), On the Stability of Difference Equations (Gostekhizdat,
Moscow) (in Russian).
SAMARSKII, A.A. (1961a), A priori estimates for the solution of the difference analogue of a parabolic
differential equation, Zh. Vychisl. Mat. i Mat. Fiz. 1, 441-460 (in Russian).
SAMARSKII, A.A. (1961b), A priori estimates for difference equations, Zh. Vychisl. Mat. i Mat. Fiz. 1,
972-1000 (in Russian).
SAMARSKII, A.A. (1962a), The homogeneous difference schemes for nonlinear parabolic equations, Zh.
Vychisl. Mat. i Mat. Fiz. 2 (1) (in Russian).
SAMARSKII, A.A. (1962b), On the convergence and accuracy of homogeneous difference schemes for one-
and multidimensional parabolic equations, Zh. Vychisl. Mat. i Mat. Fiz. 2, 603-634 (in Russian).
SAMARSKII, A.A. (1962c), On an economical difference method for the solution of a multidimensional
parabolic equation in an arbitrary region, Zh. Vychisl. Mat. i Mat. Fiz. 2, 787-811 (in Russian).
SAMARSKII, A.A. (1962d), On the convergence of the method of fractional steps for heat conduction
problems, Zh. Vychisl. Mat. i Mat. Fiz. 2 (6) (in Russian).
SAMARSKII, A.A. (1963a), Local one-dimensional difference schemes on non-uniform grids, Zh. Vychisl.
Mat. i Mat. Fiz. 3 (3) (in Russian).
SAMARSKII, A.A. (1963b), Schemes of increased order of accuracy for the multi-dimensional heat
conduction equation, Zh. Vychisl. Mat. i Mat. Fiz. 3 (5) (in Russian).
SAMARSKII, A.A. (1964a), On the economical algorithm for the numerical solution of a system of
differential and algebraic equations, Zh. Vychisl. Mat. i Mat. Fiz. 4 (3) (in Russian).
SAMARSKI, A,A. (19641,), Local one-dimensional difference schemes for multidimensional hyperbolic
equations in an arbitrary domain, Zh. Vychisl. Mat. i Mat. Fiz. 4 (4) (in Russian).
SAMARSKII, A.A. (1965a), Economical difference schemes for a hyperbolic system of equations with mixed
derivatives and their application to the equations of the theory of elasticity, Zh. Vychisl. Mat. i Mat.
Fiz. 5 (1) (in Russian).
SAMARSKII, A.A. (1965b), On the additivity principle in constructing economical difference schemes, Dokl.
Acad. Sci. USSR 165 (6) (in Russian).
SAMARSKII, A.A. (1966), The additive schemes, Thesis of the Reports on the International Mathematicians
Congress in Moscow (in Russian).
SAMARSKII, A.A. (1967), The classes of stable schemes, Zh. Vychisl. Mat. i Mat. Fiz. 7 (5) (in Russian).
SAMARSKI, A.A, (1970), Some Aspects of the General Theory of Difference Schemes: PartialDifferential
Equations(Nauka, Moscow) (in Russian).
SAMARSKII, A.A. (1971), Introductionto the Theory of Difference Schemes (Nauka, Moscow) (in Russian).
SAMARSKII, A.A. (1977), The Theory of Difference Schemes (Nauka, Moscow) (in Russian).
SAMARSKII, A.A. and V.B. ANDREEV (1963), On the difference schemes of increased order of accuracy for
the elliptic equation with several spatial variables, Zh. Vychisl. Mat. i Mat. Fiz. 3 (6) (in Russian).
SAMARSKII, A.A. and V.B. ANDREEV (1964), Iterative alternating direction schemes for the numerical
solution of the Dirichlet problem, Zh. Vychisl. Mat. i Mat. Fiz. 4 (6) (in Russian).
SAMARSKII, A.A. and V.B. ANDREEV (1976), The Difference Schemes for Elliptic Equations (Nauka,
Moscow) (in Russian).
SAMARSKII, A.A. and I.V. FRYAZINOV (1961), On the convergence of homogeneous difference schemes for
a heat conduction equation with discontinuous coefficients, Zh. Vychisl. Mat. i Mat. Fiz. 1, 806-824 (in
Russian).
SAMARSKII, A.A. and IV. FRYAZINOV (1971), On the convergence ofthe local one-dimensional scheme for
the multidimensional heat conduction equation on non-uniform grids, Zh. Vychisl. Mat. i Mat. Fiz. 11
(3) (in Russian).
SAMARSKII, A.A. and A.V. GULIN (1973), The Stability of Difference Schemes (Nauka, Moscow) (in
Russian).
SAMARSKII, A.A. and E.S. NIKOLAEV (1978), Methodsfor Solving Grid Equations (Nauka, Moscow) (in
Russian).
SAMARSKII, A.A. and Yu.P. PoPov (1975), Difference Schemesfor Gas Dynamics (Nauka, Moscow) (in
Russian).
References 459

SAUL'EV, V.K. (1960), Integration of Parabolic Equations by the Grid Method (Fizmatgiz, Moscow) (in
Russian).
ScoTT, W. and P. ROGER (1983), A more accurate method for the numerical solution of non-linear partial
differential equations, J. Comput. Phys. 49, 342-348.
SMELOV, V.V. (1978), Lectures on the Neutron Transport Theory (Atomizdat, Moscow) (in Russian).
SOFRONOV, I.D. (1965), The difference scheme with diagonal directions of sweeps for a heat conduction
equation, Zh. Vychisl. Mat. i Mat. Fitz. 5 (in Russian).
STRANG, G. and G. FIx (1977), The Theory of the Finite Element Method (Mir, Moscow) (in Russian).
SULTANGAZIN, U.M. (1971), On the foundation of the method of weak approximation for the spherical
harmonics equation, Preprint, Sib. Otd. Acad. Sci. USSR, Novosibirsk (in Russian).
SULTANGAZIN, U.M. (1979), The Methods of Spherical Harmonics and Discrete Ordinatesin Problems in
Kinetic Transport Theory (Nauka, Alma-Ata) (in Russian).
TEMAM, R. (1968), Sur la stability et la convergence de la methode des pas fractionnaires, Ann. Mat. Pura
Appl. (4) 79.
TEMAM, R. (1969), Sur I'approximation de la solution des equations de Navier-Stokes par la methode des
pas fractionnaires, Arch. Rational Mech. Anal. 32, 135-153.
TEMAM, R. (1970), Quelques m6thodes de decomposition en analyse num6rique, Acte du Congrds
Internationaldes Mathematiques 3.
TEMAM, R. (1981), Navier-Stokes Equation and Numerical Analysis (Mir, Moscow) (in Russian).
THOMPSON, R.J. (1964), Difference approximations for inhomogeneous and quasi-linear equations, J.
SIAM 12, 189-199.
TIKHONOV, A.N. and A.A. SAMARSKII (1961), Homogeneous difference schemes, Zh. Vychisl. Mat. i Mat.
Fiz. 1, 5-63 (in Russian).
TODD, J. (1 967), Inequalities of Chebyshev, Zolotareff, Caur and W.B. Jordan, in: ProceedingsSymposium
on Inequalities, Wright-Patterson Air Force Base, Dayton, OH (Academic Press, New York), 321-
328.
VALIULIN, A.N. and N.N. YANENKO (1967), Economical difference schemes of increased accuracy for
a polyharmonic equation, Izv. Sib. Otd. Acad. Sci. USSR Ser. Tekhn. Nauk 13 (3) (in Russian).
VARGA, R.S. (1962), Matrix Iterative Analysis (Prentice-Hall, Englewood Cliffs, NJ).
VASILYEV, O.R., B.G. KUZNETSOV, Yu.M. LYTKIN and G.P. CHERNYKH (1974), The evolution of
a turbulent fluid domain in a stratified medium, Izv. Acad. Sci. USSR Mekh. Zhidk. Gasa 3 (in Russian).
VIECELLI, J. (1971), A computing method for incompressible flows bounded by moving walls, J. Comput.
Phys. 8, 119-143.
VOROBYEV, Yu.V. (1968), The stochastic iterative process in the alternating direction method, Zh. Vychisl.
Mat. i Mat. Fiz. 8 (3) (in Russian).
WACHPRESS, E. (1962), Optimum alternating direction implicit iteration parameters for a model problem,
J. SIAM 10, 339-350.
WACHPRESS, E. (1963), Extended application of alternating direction implicit iteration model theory, J.
SIAM 11, 991-1016.
WACHPRESS, E. (1966), Iterative Solution of Elliptic Systems and Applications to the Neutron Diffusion
Equationsof Reactor Physics (Prentice-Hall, Englewood Cliffs, NJ).
WACHPRESS, E. and G. HABETLER(1960), An alternating direction implicit iteration technique, J. SIAM 8,
403-424.
WANG, G. (1984), The splitting schemes for solving the initial and boundary value problems of hyperbolic
partial differential equation, J. Nanjing Univ. Natur. Sci. Ed. 1, 21-38.
WARMING, R., P. KUTLER and H. LOMAX (1973), Second- and third-order non-centred difference schemes
for non-linear hyperbolic equations, AIAA J. II1, 189-196.
WASOW, W. and G. FORSYTHE (1963), Finite Difference Methodsfor PartialDifferential Equations (in
Russian).
WIDLUND, 0. (1966), On the rate of an alternating direction implicit method in a non-commutative case,
Math. Comp. 20.
WIDLUND, O. (1971), On the effects of scaling of the Peaceman-Rachford method, Math. Comp. 25.
WILKE, H. (1974), Zur Anwendung der SMAS-Methode bei der Behandlung hydrodynamischer
insbesondere grenzflachendynamischer Probleme, Chem. Tech. 26, 456-457.
460 G.l. Marchuk

YANENKO, N.N. (1959a), On the difference method for multidimensional heat conduction equation, Dokl.
Acad. Sci. USSR 125 (6) (in Russian).
YANENKO, N.N. (1959b), Simple implicit schemes for multidimensional problems, Dokl. Vsesoyuzn.
Soveshch. Vychisl. Mat: Vychisl. Tekhn., Moscow (in Russian).
YANENKO, N.N. (1960), On the economical implicit schemes (the method of fractional steps), Dokl. Acad.
Sci. USSR 134 (5) (in Russian).
YANENKO, N.N. (1961), On the implicit difference schemes for a multidimensional heat conduction
equation, Izv. Vuzov Mat. 4 (23) (in Russian).
YANENKO, N.N. (1962), On the convergence of the splitting method for a heat conduction equation with
varying coefficients, Zh. Vychisl. Mat. i Mat. Fiz. 2 (5) (in Russian).
YANENKO, N.N. (1964a), Some Aspects ofthe Theory of Convergence of Difference Schemes with Constant
and Varying Coefficients, Trudy IV Vsesoyuzn. Mat. S'ezda 2 (Nauka, Moscow) (in Russian).
YANENKO, N.N. (1964b), On the weak approximation to systems of differential equations, Sibirsk Mat.
Zh. 5 (6) (in Russian).
YANENKO, N.N. (1966), The Method of FractionalStepsfor Multidimensional Problemsin Mathematical
Physics (Lectures for students of NGU) (Novosibirsk Gos. Univ., Novosibirsk) (in Russian).
YANENKO, N.N. (1967), The Method of FractionalStepsfor Multidimensional Problemsin Mathematical
Physics (Nauka, Novosibirsk) (in Russian).
YANENKO, N.N. (1972), Modern numerical methods for problems in mechanics of continuous media, in:
-Proceedings International Mathematics Congress (Nauka, Moscow) (in Russian).
YANENKO, N.N. (1973), Difference Methodsfor Problemsin MathematicalPhysics,Trudy MIAN SSSR 22
(Nauka, Moscow) (in Russian).
YANENKO, N.N., N.N. ANUCHINA, V.Ye. PETRENKO and Yu.I. SHOKIN (1970), On the methods for gas
dynamics problems with large deformations, in: Numerical Methods in Mechanicsof Continuous Media
(Nauka, Novosibirsk) (in Russian).
YANENKO, N.N., N.N. ANUCHINA, V.Ye. PETRENKO and Yu.I. SHOKIN (1971), On the methods for
computation of the fluid flows with large deformations and the approximation viscosity of difference
schemes, in: Proceedings2nd International Colloquium on Gas Dynamics of Explosion and Reacting
Systems (in Russian).
YANENKO, N.N. and Yu.Ye. BOYARINTSEV (1961), On the convergence of the difference schemes for a heat
conduction equation with varying coefficients, Dokl. Acad. Sci. USSR 139 (6) (in Russian).
YANENKO, N.N. and G.V. DEMIDOV (1966), The method of weak approximation as the constructive
method for solving a Cauchy problem, in: Some Aspects ofNumericaland Applied Mathematics(Nauka,
Novosibirsk) (in Russian).
YANENKO, N.N., V.A. SUCHKOV a d Yu. Ya. POGODIN (1959), On the difference solution of the heat
conduction equation in curvilinear coordinates, Dokl. Acad. Sci. USSR 128 (5) (in Russian).
YANENKO, N.N. and .K. YAUSHEV (1966), On the absolutely stable scheme for integrating hydrothermo-
dynamic equations, in: Difference Methods for Problems in MathematicalPhysics, Trudy MIAN 74
(Nauka, Moscow) (in Russian).
YOUNG, D. (1971), Iterative Solution of Large Linear Systems (Academic Press, New York).
YOUNG, J.A, and C.W. HORT (1972), Numerical calculation of internal wave motions, J. Fluid Mech. 56,
265-276.
ZALESNY, V.B. (1984), Modelling of Large-Scale Motions in World Ocean (OVM Acad. Sci. USSR,
Moscow) (in Russian).
ZALESNY, V.B. (1985), Numerical model of ocean dynamics based on the splitting method, Sov. J. Numer.
Anal. Math. Modelling 1 (2), 1-22.
ZENG, Q. and X. ZHANG(1982), Perfectly energy-conservative time-space finite difference schemes and the
consistent split method to solve the dynamical equations of compressible fluid, Sci. Sinica Ser. B 25,
866-880.
Subject Index
see also the contents for further details

Adjoint operator T*, 304 Heat conduction equation, 239, 240, 252
Advection-diffusion, 283
Alternating direction method, 277, 298
Alternating direction sweep method, 277 Implicit scheme of alternating direction, 278
Alternating operator method, 289 Incomplete factorization scheme, 258
Alternating triangular method, 289, 292
Absolute stability, 218, 221
Approximation, 206 jth step operator, 297
Approximative relationship, 263
Auxiliary system, 396
Kronecker delta, 329

Cauchy problem, 212, 327


Computational stability, 217 Lipschitz constant, 350
Conditional stability, 221
Convergence theorem, 222
Convergent iterative method, 297 Main system, 396
Correct Cauchy system, 330 Method of large particles, 286
Crank-Nicolson scheme, 212, 224 Method of markers and cells, 285
Cyclic, 297 Method of particles in a cell, 285
Cyclic sweep method, 438
Neumann's stability condition, 321
Nonstationary, 297, 310
Decomposition method, 356 Norm IIrq11,
239
Norm of operator T, 214

Energy identity, 322


Energy inequalities method, 322 One-dimensional scheme, 242
Error vector, 297 Order, 207
Evolutionary equation, 211
Explicit alternating direction scheme, 291
Factorization or factorized scheme, 254 Projection, 207
Finite-dimensional approximation, 209
Fourier method, 304
Residual vector, 297, 352

Green formula, 211


Grid, 207 Scheme of complete splitting, 392
Grid function, 207 Scheme of incomplete splitting, 390
Grid point, 207 Scheme of increased order of accuracy, 371

461
462 G.. Marchuk

Set of restrictions, 357 Step of the grid, 207


Source operator S, 212 Step operator T, 212
Space P, 210 Symmetric alternating direction method, 343
Spatial variable, 301
Stationary problem, 307
Spectral criterion of stability, 217 Two-level scheme, 212, 332
Spectral method, 304
Splitting method, 298
Splitting operator, 367 Uniformly correct Cauchy system, 330
Stability, 216, 219
Stabilizing correction scheme, 274, 279, 298
Stationary iterative method, 297 Weak approximation method, 328
Least Squares Methods

Ake Bjorck
Department of Mathematics
LinkBping University
S-581 83 Link6ping
Sweden

HANDBOOK OF NUMERICAL ANALYSIS VOL. I


Finite Difference Methods (Part I)--Solution of Equations in R" (Part I)
Edited by P.G. Ciarlet and J.L. Lions
© 1990, Elsevier Science Publishers B.V. (North-Holland)
Contents

CHAPTER I. Mathematical and Statistical Properties of Least Squares Solutions 469


1. Introduction 469
2. Characterization of least squares solutions 470
3. The singular value decomposition 472
4. Pseudoinverses and the least squares problem 476
5. The sensitivity of least squares solutions 478

CHAPTER II. Numerical Methods for Linear Least Squares Problems 485

6. The method of normal equations 485


7. The QR decomposition and least squares problems 491
8. QR decomposition by orthogonal transformations 498
9. Gram-Schmidt orthogonalization 504
10. Rank-deficient problems and the SVD 510
11. Rank-deficient problems and the QR decomposition 514
12. Iterative refinement of linear least squares solutions 520
13. Computing the variance-covariance matrix 524
14. Weighted and generalized linear least squares problems 526

CHAPTER III. Sparse Least Squares Problems 531

15. Storage schemes for sparse matrices 531


16. The method of normal equations for sparse problems 534
17. Orthogonalization methods for sparse problems 541
18. Sequential orthogonalization methods for banded problems 549
19. Block structured sparse least squares problems 553
20. Iterative methods 557

CHAPTER IV. Some Modified and Generalized Least Squares Problems 569

21. Updating QR decompositions and least squares solutions 569


21.1. Rank-one change 570
21.2. Appending or deleting a column 571
21.3. Appending or deleting a row 574
22. The CS decomposition and the generalized singular value decomposition 577
23. The general Gauss-Markoff linear model 581
24. The total least squares problem 584

CHAPTER V. Constrained Least Squares Problems 589


25. Linear equality constraints 589
25.1. Method of direct elimination 590
25.2. The nullspace method 591

467
468 A. Bjbrck

25.3. Analysis of Problem LSE by generalized singular value decomposition 593


25.4. The method of weighting 594
26. Quadratic constraints and regularization 596
27. Linear inequality constraints 606

CHAPTER VI. Nonlinear Least Squares 617


28. The nonlinear least squares problem 617
29. Gauss-Newton-type methods 620
30. Trust region methods 624
31. Newton-type methods 626
32. Methods for separable problems 629
33. Constrained problems 631
34. Orthogonal distance regression 632

REFERENCES 637

SUBJECT INDEX 649


CHAPTER I

Mathematical and Statistical


Properties of Least Squares Solutions

1. Introduction

The linear least squares problem is a computational problem of primary importance


in many applications. Assume for example that one wants to fit a linear
mathematical model to given data. In order to reduce the influence of errors in the
data one can then use a greater number of measurements than the number of
unknowns. The resulting problem is to "solve" an overdetermined linear system, i.e.,
to find a vector x e R" such that Ax is the "best" approximation to the known vector
b e Rm, where AR mX" n and m>n.
There are many possible ways of defining the "best" solution. A choice which can
often be motivated for statistical reasons (see below) and also leads to a simple
computational problem is to let x be a solution to the minimization problem

minllAx-b1 2 , Ba
AeR "
, beR m , (1.1)

where 11-12 denotes the Euclidean vector norm. We call this a linear least squares
problem and x a linear least squares solution of the system Ax= b.
One important source of least squares problems is linear statistical models. Here
one assumes that the m observations b =(b 1, b2, ... , b,)T are related to n unknown
parameters x = (x, x 2 ,..., X,)T by

Ax=b+e,

where =(E1,E2,...,e) T and i, i = ... ,m, are random errors. Let us assume that
A has rank n and e has zero mean and covariance matrix a 2 1 (i.e. Elare uncorrelated
random variables with equal variance). Then GAUSS showed in Theoria Combina-
tiones [ 1823] that in case A has full rank the least squares estimate x has the smallest
variance in the class of estimation methods which fulfills the two conditions:
(1) no systematic errors (no bias) in the estimates,
(2) the estimates are linear functions of b.
Note that this property of least squares estimates does not depend on any assumed
distributional form of the error. For an account of the historical development of

469
470 ,. Bjbrck CHAPTER I

statistical methods for estimating parameters in linear models see FAREBROTHER


[1985].
If more generally the covariance matrix is V(s)=W, where We R 'X is
a positive-definite symmetric matrix, then the appropriate minimization problem is
min(Ax-b) T W- '(Ax-b). (1.2)
x
Special methods for such generalized least squares problems are treated in Section
14 and in Section 23 we consider models where W is only assumed to be
nonnegative-definite.
We remark that in some applications it might be more adequate to minimize
[Ax-b I p for some p# 2. Here we define the norm I I by
/ \ /P
pIII= E 14P l'p<cC (1.3)

and extend this definition to the limiting case by defining


lIxll[ = max xil. (1.4)
1 <i<n

Minimization in the 1-norm or oo-norm is complicated by the fact that the function
f(x) = 11Ax - b 1 is not differentiable for these values of p. However, there are several
good computational methods available for these minimizations, see BARTELS, CONN
and SINCLAIR [1978] and BARTELS, CONN and CHARALAMBOUS [1978].
To illustrate the effect of using a different norm we consider the problem of
estimating the scalar /Bfrom m observations y E RE'. This is equivalent to minimizing
{IA - y , with A =(1, 1, ... , I)T. It is easily verified that if y,l>Y2 > >Ym, then
the solution for some different values p are

fl1=Y(m+l)/2 m odd,
22=(YV1 +Y2+ .. +y,)/m,
3al=j(Yi +ym).
These estimates correspond to the median, mean, and midrange respectively. Note
that the estimate fl, is insensitive to extreme values of y,. This property carries over
to more general problems and the choice of p = 1 gives a more robust estimate than
p = 2. In some situations nonintegral values of p, 1<p < 2 are of interest see EKBLOM
[1973]. For a general treatment of robust statistical procedures see HUBER [1977].

2. Characterization of least squares solutions

The set of least squares solutions to a system Ax=b is defined by


X= {x e Rn: IIAx- b 11
2 = min} (2.1)
and is characterized by the following theorem
SECTION 2 Properties of least squares solutions 471

THEOREM 2.1. We have

x e X.> AT(b- Ax) = O.

PROOF. Assume that x satisfies ATrx = 0, where r, =b-Ax. Then for any y e R"
ry = b- Ay = r. + A(x - y). Squaring this we obtain

IIry II2 = rx 112 + 2(x - y)TATrX + 11


A(X- y) 112 > 1r 112
=0

Further assume that ATr =z =O. Then, if x-y= -z,


1Iryl]2= IlrxIIll2-2zllZ2+ll2 11All2< Irll12
for sufficiently small .

We refer to r = b - Ax as the residual of x. Theorem 2.1 asserts that the residual of


a least squares solution is orthogonal to the subspace
(A) = Ax: x E R"}.
Thus, the right-hand side b is decomposed into two orthogonal components
b=Ax+r, rlAx.
This geometric interpretation is illustrated for n=2 in Fig. 2.1. Note that this
decomposition is always unique, even when the least squares solution x is not
unique.

b
7
/ / .(A)

FIG. 2.1.

From Theorem 2.1 it follows that a least squares solution satisfies the normal
equations
ATAx = ATb. (2.2)
The matrix ATA is symmetric and nonnegative-definite and the normal equations
are always consistent. Furthermore we have:

THEOREM 2.2. The matrix ATA is positive-definite if and only if the columns of A are
linearly independent.

PROOF. If the columns of A are linearly independent, then xO=>Ax:#0 and


472 A. Bjdrck CHAPTER

therefore
x#O = xTATAx=lJAx1 2>O.

Hence ATA is positive-definite.


On the other hand if the columns are linearly dependent, then for some x O0
we
have Ax o = 0 and so xTATAxo =0 and therefore ATA is not positive-definite. [Z

REMARK 2.1. The normal equations (2.2) and the defining equations for the residual
r = b - Ax combine to form an augmented system of m + n equations

(A T O) (X )( (2.3)

The system matrix in (2.3) is square and symmetric but indefinite if A#0. The
augmented system (2.3) is used for iterative refinement of least squares solutions (see
Section 12) and in some methods for solving least squares problems where the
matrix A is sparse.
From Theorem 2.2 it follows that if rank(A)=n, then there is a unique least
squares solution, which can be written
x=(AT A)- A T b. (2.4)
The corresponding residual is
r=b-Ax=(I-PA)b, PA=A(A A)-'A T . (2.5)
Here PA is the orthogonal projector onto S(A), cf. Fig. 2.1.
If rank(A) < n, then the solution x to (2.1) is not unique. However, the solution to
the problem
minllxl12, xX,
where X is the set (2.1) is unique, cf. Theorem 4.1 below.

3. The singular value decomposition

The singular value decomposition (SVD) of a matrix A E Rm " is of great theoretical


and practical importance for least squares problems. It provides a diagonal form of
A under an orthogonal equivalence transformation. The history of this matrix
decomposition goes back more than a century, but only recently has it been as much
used as it should be.

THEOREM 3.1 (Singular value decomposition). Let A e amR be a matrix of rank r.


Then there exist orthogonal matrices U E RmXm and V E R".. such that

A=UZVT , =(0 ), (3.1)

where Z ERm Xn, Z = diag( a,,


1 2, .. , a,) and
SECTION 3 Properties of least squares solutions 473

The a i are called the singular values of A and if we write


U=[u ...... Um], V= [v, .. ., v, (3.2)
the ui and vi are, respectively, the left and right singular vectors corresponding to a,,
i=1,..., r.

PROOF. (After GOLUB and VAN LOAN ([1983], p. 17). Let xe R and ye Ra be such that
Ilx112 = llyl12 =l, Ax=ay, a=llAl12.
Here I A 2 is the matrix norm subordinate to the vector norm I II2, see GOLUB and
VAN LOAN ([1983], pp. 14-15), and the existence of such vectors x and y follows from
this definition of I A 112. Let
V=[x, V,] E R" X , U=[y,Uj]
E Rmxm
be orthogonal. (Recall that it is always possible to extend an orthogonal set of
vectors to an orthonormal basis for the whole space.) Since UTAx=aUTy=O it
follows that UTA V has the following structure:

A1 UTAV=(a w ),

where B = UTA V, e R(m- 1)x (n- 1) Since

ii·/(alh-(T02 + WTW)
A 711W2 Bw )2 + Tw,

if follows that 11A 1 2 > (2 + wTw) 1 2 . But since U and V are orthogonal IIA, 12 =
1 A J12= and thus w = 0. An induction argument now completes the proof. J
REMARK 3.1. From (3.1) it follows that
ATA = VITZVT and AAT= U22ETUT.

Thus uo, j= 1,..., r, are the nonzero eigenvalues of the symmetric and positive-
semidefinite matrices AT A and AAT, and vj and uj are the corresponding
eigenvectors. Hence, in principle, the SVD can be reduced to the eigenvalue
decomposition for symmetric matrices. This is not a suitable way to compute the
SVD. For a proof of the SVD using this relationship see STEWART ([1973], p. 319).

EXAMPLE 3.1. Consider the case n=2 and take


A =(a, a2), ala 2 = cos
and I a 112 = IIa2 II2= 1. Here y is the angle between the vectors a and a2 . The matrix

ATA
= ( cosI ] cos
1 ')
474 A. Bjorck CHAPTER I

has eigenvalues ), =2 cos 2 (y), 12 = 2 sin 2(-y) and so

a = cos(y), a2 = 2 sin(y).
The eigenvectors of ATA

VI=8/i~l V2 = 1)
are the right singular vectors of A. The left singular vectors can be determined from
(3.3). If y<< 1, then a, 52 and a 2 y/ F2 and we get
Ul a + a2), U2 (a -a2)/1
Now assume that y is less than the square root of the floating point precision. Then
the computed values of the elements cos y in ATA will equal 1. Thus, the computed
matrix ATA will be singular with eigenvalues 2 and 0, and it is not possible to retrieve
the small singular value y/x/2. This illustrates that information may be lost in
computing ATA unless sufficient precision is used.

REMARK 3.2. The singular values of A are unique. The singular vector vj, j<r,
will be unique only when a2 is a simple eigenvalue of ATA. For multiple singular
values, the corresponding singular vectors can be chosen as any orthonormal basis
for the unique subspace that they span. Once the singular vectors vP I <j <r, have
been chosen, the vectors u, 1 <j r, are uniquely determined from
Avj=aju, j=l,..., r (3.3)
Similarly, given u, I j <r, the vectors vj, 1 <j r, are uniquely determined from
ATuj=¢jVj, j=l ...., r. (3.4)

REMARK 3.3. From the SVD theorem it follows that

A= E iUiUT=UrZ,VT (3.5)

where

Ur=[U1, ... U] , U=IV£., VJ

This is sometimes called the full rank singular value decomposition. By (3.5) the
matrix A of rank r is decomposed in a sum of r matrices of rank one.

The singular value decomposition plays an important role in a number of matrix


approximation problems. In the theorem below we consider the approximation of
one matrix by another of lower rank. Several other results can be found in GOLUB
[1968] and in GOLUB and VAN LOAN ([1983], Ch. 12.4).

THEOREM 3.2. Let .<k' be the set of matrices in Rm" of rank k. Assume that
SECTION 3 Properties of least squares solutions 475

A /rm x. and let B E JM X", k <r, be such that

IIA-BI12< IIA-XI[ 2 for all Xe Jf "


.
Then if

A= u V=
T= auv
i=1

is the SVD of A we have


k
B= aiBivT, IIA-B11 2 =ak+l.
i=1

PROOF. See MIRSKY [1960]. l

REMARK 3.4. The theorem was originally proved for the Euclidean norm

IX E ij)

by ECKHARD and YOUNG [1936]. For this norm the minimum distance is
i A-BlE = (ok2+ 1 + ... + O )1/2

and the solution is unique. A generalization of the Eckhard-Young theorem is given


in GOLUB and STEWART [1986].

Like the eigenvalues of a real symmetric matrix, the singular values of a general
matrix have a minmax characterization.

THEOREM 3.3. Let A e Rfmx' have singular values


al >a2 '. Up > 0, p =min(m,n).
and S be a linear subspace of R" of dimension dim(S). Then
ai= min max 1IAxII 2 / X 112 (3.6)
dim(S)=n-i+1 xeS
x#O

PROOF. The result is established in almost the same way as for the corresponding
eigenvalue theorem, the Courant-Fischer theorem, see WILKINSON ([1965], pp.
99-101). El

The minmax characterization of the singular values may be used to establish


a relation between the singular values of two matrices A and B.

THEOREM 3.4. Let A,B E Rm x" have singularvalues a , > a2 > . >. p and r1
z> T2 > .
> p, respectively, where p = min(m, n). Then

la--rilI<lA-Bl[2 and laj-zri 2 I<llA-B[lI.


476 A. Bjbrck CHAPTER I

PROOF. See STEWART ([1973], pp. 321-322). l

REMARK 3.5. This result shows that the singular values of a matrix A are insensitive
to perturbations of A, which is of great importance for the use of the SVD to
determine the "numerical rank" of a matrix (see Section 9). Perturbations of the
elements of a matrix produce perturbations of the same, or smaller, magnitude in the
singular values.

It can be proved that the eigenvalues of the leading principal minor of order n - 1
of a symmetric matrix A E R xn separate the eigenvalues of A, see WILKINSON ([ 1965],
p. 103). A similar theorem holds for singular values.

m
THEOREM 3.5. Let A = (A , an) e- " n, m > n, an E Sm. Then the orderedsingularvalues
6i of A separate the ordered singular values o i of A as follows

a, -a1 >-62- 62'b - O I 38 >-6n'

PROOF. The theorem is a consequence of the minmax characterization of the singular


values in Theorem 3.3. See also LAWSON and HANSON ([1974], p. 26).

The SVD gives complete information about the four fundamental subspaces
associated with A. It is easy to verify that
X'(A) = span[v , . , v, sp[u,, ... , u].
(A)= span (3.7)
Since AT = VZTUT, it follows that also
(AT) = spanv,,. . ., v], X(AT) =span[u,,, , ., m] (3.8)
and we find the well-known relations
jVr(A) = (AT), M(A) = A(AT).

4. Pseudoinverses and the least squares problem


The SVD is a powerful tool to solve the linear least squares problem. The reason for
this is that the orthogonal matrices that transform A to diagonal form (3.1) do not
change the 12-norm of vectors. We have the following fundamental result.

THEOREM 4.1. Consider the linear least squares problem

min I b-Ax 11
2, (4.1)

where A e R m " and rank(A) = r < min(m, n). Then

X=V
V( O ) UTb (4.2)

is the unique solution of (4.1) for which IIx 2 is minimum.


SECTION 4 Properties of least squares solutions 477

PROOF. Let

(Z2) = U = ( 2b),
where z, c 1 R'. Then
= IIUT (b -A VV T x) 112
1lb- Ax 112

I \C2:
\o :()1 [ /112'

Thus, the residual norm will be minimized for


z1 = c1 , z2 arbitrary.
Obviously the choice z2 = 0 minimizes 1z 112 and therefore also tx I2 = IIVz 112 0

DEFINITION 4.1. We can write (4.2) as x=A'b, where

A' = V( ' )UTEIn (4.3)

is called the pseudoinverse of A.

REMARK 4.1. It is easy to show that G = A' satisfies the following four conditions:
AGA=A, (4.4a)
GAG = G, (4.4b)
(AG)T = AG, (4.4c)
(GA) T =GA. (4.4d)
PENROSE [1955] has shown that A' is uniquely determined by these conditions. In
particular A' in (4.3) does not depend on the particular choice of U and Vin the SVD.
Note that (AT)I = (A)T but that in general (AB)' #B'A'.

REMARK 4.2. A geometric characterization of the pseudoinverse solution x = A'b is


that it satisfies
x IA(A), Ax = P(A) b,
where PI(A) is the orthogonal projection onto (A).

REMARK 4.3. Two particular cases of interest are:


m" n, m n and r=n. Then
(i) Assume that AER
A' =(ATA) AT. (4.5)
" "X
(ii) Assume that A e R ,m n and r= m. Then
A' =A T(AA T)- (4.6)
478 A. Bjrck CHAPTER I

In case (i) we recognize the full rank least squares solution. Case (ii) gives the
minimum norm solution to an underdetermined system of full rank, i.e. solves the
problem
min x 11
2 , S={x:Ax=b}.
xeS

If S clR is a subspace, then Ps e R" " is the orthogonal projection onto S if


g(P)=S, P =Ps, P =Ps. (4.7)
If x E ", then x is decomposed into two orthogonal components by
x=x +x 2 =Psx+(I-P)X
where
x, eS, X2 JX1.
An important property of the pseudoinverse is that it gives formulas for the
orthogonal projections onto the four fundamental subspaces of A:
P(A) = AA, P ~ar = I - AAI,
P(AT = AA, P(A) = I - AA', (4.8)
P1(A = A'A, = I - A'A.
These expressions are easily verified using the Penrose conditions (4.4). If
S = span(u 1 , . , u), where u i, 1< i < k, are orthogonal, then it follows immediately
that
P = UUT (4.9)
satisfies (4.7). Using (3.7) and (3.8) we can therefore express the projections (4.8) in
terms of the singular vectors of A as

UAUI,
P,= P(AT) = U2 U2 , (4.10)
P(AT)= V1 V 1, Px(A) = V2 V,
where U and V are partitioned
U=(U, U 2)eE mRX , V=(V1, V2) "
r.

5. The sensitivity of least squares solutions

In this section we give results on the sensitivity of pseudoinverses and least squares
solutions to perturbations in A and b. The most complete results for these problems
have been given by WEDIN [1973b] and here we mainly follow his exposition. For
a survey of perturbation theory for pseudoinverses see also STEWART [1977a]. In this
m
analysis the condition number of the matrix A c RA " will play a significant role. The
following definition generalizes the condition number of a square nonsingular
matrix.

DEFINITION 5.1. Let A E iR" " have rank r and singular values a, > a2 > ' > a > 0.
SECTION 5 Properties of least squares solutions 479

The condition number of A is


K(A)= IIAII2 IIA' 112=a /'r- (5.1)
The last equality in (5.1) follows from the relations
IIA 112 = 1 , IIA'112 =r - 1
We now give some perturbation bounds for the pseudoinverse, which can be found
in WEDIN [1973b]. The following example shows that if rank(B)# rank(A), then B'
can be very different from A' even when B-A is small. If we let

A =( 0), B=( E),' 00

then 1 = rank(A) $ rank(B) = 2 and

1Al= (0 0) l=El Ur 2

In this example
IB-A I12 =e, IB'-A' 112> 1/e,
which is a special case of the following theorem.

THEOREM 5.1. If rank(B) rank(A), then


IIB -A' 112> 1/ B-A 112.

PROOF. See WEDIN [1973b].

On the other hand, if rank(B) = rank(A) and B - A is small, then B' is close to A'.
We first give an estimate of IIB 112.

THEOREM 5.2. If rank(B) = rank(A) and

= IIA' 211
B-A 2<1,
then

B11 2 < 1 II-A' 112 (5.2)

PROOF. From the assumption and Theorem 3.4 it follows that


1/Il B' 112= a,(B) > ar(A)- IIB - A 11
2

= 1/l1 A' 112- B-A 112>0


which is equivalent to (5.2).

If A and B are square nonsingular matrices, then from the identity B- - A- ' =
480 A. Bjirck CHAPTER I

A-'(A - B)B - it follows that


-
11B' - A- 112< 1A I1B
112 112-1lB - A 12
The next theorem due to WEDIN [1973b] generalizes this result.

THEOREM 5.3. If rank(B)=rank(A), then


lB -A' 12 4 IIB'II1 211A' 11
211B-All2 (5.3)
where
1 ), if rank(A) < min(m, n),
5(1+
l = ~/~, /if rank(A) = min(m, n).

PROOF. See WEDIN [1973b] for a slightly more general result.

We now consider the effect of perturbations of A and b upon the minimum norm
least squares solution x= A'b. The discussion again follows WEDIN [1973b] and
applies both to overdetermined and underdetermined systems. We denote the
perturbed A and b
A=A + A, b,
9=b+6 (5.4)
and the perturbed solution =Ai'b and put
EA = 11A 11
2 /1 A 11
2, Eb = 11b 11 b 11
2/11 2 (5.5)
To be able to prove any meaningful result we assume that rank(A)= rank(A) and
that the condition
1= 11
A, 1126A 112=K A <1
11 (5.6)
holds, where K= I(A) is the condition number of A.
We now decompose the error 6x as follows:
x =- x = A(Ax + r + fb)- x

or

fx =A"(- 6Ax + lb)+ Ai r -Pr(-I)X, (5.7)


where we have used that
PA'(A) = I - 'A.
We now separately estimate the three terms in this decomposition of 6x. Using
Theorem 5.2 and the assumption (5.6) it follows that

IIA'(-Ax +Ab) I121 1iA' 12( 112 11 X 112+ 16b 112 )

=
1- (E
(A
- ~al1x1[2±+I,11A 1112
II X112 + 11ib 2
SECTION 5 Properties of least squares solutions 481

Now since rIl(A) we have r e (AT) and thus we can write the second term using
(4.4) and (4.8) as
Air = AI P(x)PX(AT) r. (5.8)
Now by definition
IIP1(1)PX(AT) 112= sin Omax ((A), 3(A)),
where Omax is the largest principal angle between the two subspaces R(A) and (A).
Similarly for the third term using xe (AT) we can write

PX(A)X = Px~(P_4(AT)X (5.9)


and
IIPX(A)P(AT) 112= sin Omax (.Ar(A),X(A)).
We have the following estimate for the largest principal angle between the perturbed
fundamental subspaces.

THEOREM 5.4. Let A =A+ A and let X(') denote any of the four fundamental
subspaces. Then if rank(A) = rank(A) and the assumption (5.6) is satisfied then
sin Oax (X(A), X(A)) 1?= KEA. (5.10)

PROOF. The proof follows from WEDIN [1973b, Lemma 4.1].

Using Theorem 5.4 to estimate (5.8) and (5.9) we arrive at the bounds given in the
following theorem.

THEOREM 5.5. Assume that rank(A) = rank(A) and that '! = < 1.
CKEA Then

11,6X11
lI~xII 2" KII(EAIXll2
I· + IAL
6b ][ +I c -- ~KlXI12 x2
2 < (5.11)A
11 ++AKIA1)±EA
+B~h- iiAI
A 1 11 r111 (5.11)
and

A 112 + 11b 11
11r I2 EA IXII2 11 2 +EAIrIIr112. (5.12)
PROOF. The estimate (5.11) follows from above and (5.12) is proved using
a decomposition of br similar to that of 6x in (5.7).

REMARK 5.1. The last term in (5.7) and therefore also in (5.11) vanishes if rank(A) = n,
since then X(A) = {0}. If the system is compatible, e.g. if rank(A) = m, then r =0 and
2 in (5.11) vanishes.
the term involving Kc

REMARK 5.2. If rank(A) = min(m, n), then the condition < 1suffices to guarantee that
rank(A + 6A) = rank(A).

REMARK 5.3. For the case rank(A) = n there are perturbations A and bb such that
482 A. Bjrck CHAPTER I

the estimates in Theorem 5.5 can almost be attained for an arbitrary matrix A and
vector b. This can be shown by considering first-order approximations of the terms
(see WEDIN [1973b]).

It should be stressed that in order for the perturbation analysis above to make
sense the matrix A and vector b should be scaled so that the class of perturbations
defined by (5.5) is relevant. Assume e.g. that the columns in A = (al, a2,..., a.) have
widely differing norms. Then often a more relevant class of perturbations is
aj = aj + aj, I aj 112 E<
dj, j= 1, 2,...,n,
where dj = IIaj 11
2- We could use (5.11) with the estimate A < e/n since

A 2<
mll E edj2
j=1

However, a much better estimate is often obtained by scaling the problem so that the
error bound is the same for all columns
Ax = (AD- )(Dx) = Ax,
where D=diag(d,,..., d,). Often K(A)<<Ic(A) and we can use the bound

VA EA/

in (5.11) and (5.12). Note however that scaling the columns changes the norm in
which the error in x is measured.
Similarly, if the rows in A differ widely in norm, then (5.11) and (5.12) may
considerably overestimate the perturbations, cf. Section 14.
GOLUB and WILKINSON [1966] derived first-order perturbation bounds for the
least squares solution and were the first to note that a term proportional to Kc2
occurs. In VAN DER SLUIS [1975] a geometrical explanation for this term is given,
and also lower bounds for the worst perturbation are derived. The following
example shows that the term proportional to K2 in (5.11) may indeed occur.

EXAMPLE 5.1 (VAN DER SLUIS [1975]). Consider the case n=2 and let A=(al,a2 )
be the matrix in Example 3.1, where the angle y is small (see Fig. 5.1). We now
take perturbation b6a, and 6a 2 of size 116a 1 112=
I11a 2 112=, so that the plane

FIG. 5.1.
SECTION 5 Properties of least squares solutions 483

S= span (al +a 1,a 2 +a 2 ) is obtained by a rotation of the plane S=span(a, a 2 )


around the bisectrix u1 =1(al +a 2), which according to Example 3.1 is an
approximate left singular vector. If bal and a2 are orthogonal to S and of opposite
direction, then the angle of rotation will be 0 £e/(y). Now let c=Psb be the
orthogonal projection of b onto S and assume that the approximate direction of c is
along u1 . Then = Psb is obtained by rotating the residual vector r through the angle
0 and hence
Ie-c ;sin 0If r 1122E lrll112/y.
112
Further, the direction of c -c will be approximately along u2 =(a 2 -a)/y. Since
iba1 +ba 2 =0 we have bAxzO and hence

11 -c 112
A6x ztE- c, lk X 2 cCI
U2

It follows that

Ix
116X2 22E l1r 112/y2 = rl12K 2

which is what we wished to show. This example illustrates that the occurrence of K2
is due to two coinciding events: rotation of the projection plane around a dominant
left singular vector produces a large change in r and this has the direction of the
minimal left singular vector.
CHAPTER II

Numerical Methods for


Linear Least Squares Problems

6. The method of normal equations

The classical method of solving the least squares problem


min lAx-b1 2, A ERn , (6.1)
x

which dates back to Gauss, is to form and solve the normal equation (cf. Section 2)
ATAx = ATb. (6.2)
In this section we discuss numerical methods based on this approach. We assume
here that rank(A)=n, and defer treatment of rank deficient problems to later
sections. Then from Theorem 2.2 we know that the matrix ATA is positive-definite
and that (6.2) has a unique solution.
The first step in the method of normal equations is to form the matrix and vector
M=ATA E Rn, d=ATb E . (6.3)
Note 'that since the cross-product matrix M is symmetric it is only necessary to
compute and store its upper triangular part. The relevant elements in M and d are
given by

mjk =ajak= aijaik, 1 j < k < n,


i= 1 (6.4)

d=aJb= E aib i, 1 j<n,


i=1

where we have partitioned A by columns


A=(al,a2 .. a.).
In (6.4) it is possible to accumulate the inner products in double precision. Then the
only rounding errors in the computation of M and d will be when the double
precision results are rounded to single precision. However, even this might lead to
a loss of accuracy, cf. Remark 6.2.
485
486 A. Bjorck CHAPTER 11

By sequencing the operations in (6.4) differently we get a row oriented algorithm.


Partitioning A by rows

AT = (al, a 2 , ... , a.m)

we can write
m m
M= E a,aT , d= E bi, (6.5)
i=1 i=1

where M is expressed as the sum of m matrices of rank one and d as a linear


combination of the rows of A. Row-wise accumulation of M and d using (6.5) is
advantageous to use if the data A and b are stored row by row on secondary storage.
The system of normal equations can then be formed using only one pass through the
data and using no more storage than that needed for M and d.

REMARK 6.1. If m >> n, then the number of elements in the upper triangular part of M,
which is n(n + 1) is much smaller than mn, the number of elements in A. In this
case the formation of M and d can be viewed as a data compression.

REMARK 6.2. It is important to note that information in the data matrix A may be
lost when ATA is formed unless sufficient precision is used. As an example consider

/1I1 1 /1+ 21 \
A=
, 2
A 1 1+I

Assume that = 10- 3 and that six decimal digits are used for the elements of ATA.
Then, since 1 +2 = 1+ 10 - 6 is rounded to I we loose all information contained in
the last three rows of A.

REMARK 6.3. The number of multiplicative operations needed to form the matrix
ATA is n(n + )m. In the following, to quantify the operation counts in matrix
algorithms we will use the concept of aflop (cf. GOLUB and VAN LOAN [1983], p. 32).
This is roughly the amount of work associated with the statement
s := s + aik -bkj,
i.e. it comprises a floating-point add, a floating-point multiply and some sub-
scripting. Thus we will say that forming ATA and ATb requires n(n + )m + mn
flops, or approximately n2 m flops if n >> 1. Note that for vector and parallel
computers this definition of a flop may not be adequate.
We now consider the solution of the symmetric positive-definite system (6.2). This
will be based on the following matrix decomposition.

THEOREM 6.1 (Cholesky decomposition). Let the matrix Me R"xn be symmetric and
positive-definite. Then there is a unique upper triangular matrix R with positive
SECTION 6 Linear least squaresproblems 487

diagonal elements such that


M=RTR. (6.6)
R is called the Choleskyfactor of M and (6.4) is called the Cholesky decomposition.

PROOF. The proof is by induction on the order n of M. The result is trivial for n = 1.
Assume that (6.6) holds for all positive-definite matrices of order n and consider the
positive-definite matrix

of order n + 1. We seek a factorization of M


M=RT
(R °) (R r). (6.7)

By the induction hypothesis the factorization M = RTR exists and thus (6.7) holds
provided r and p >0 satisfy
RTr=m, p2 = # - rTr. (6.8)

Tm is uniquely
Since RT has positive diagonal elements and is lower triangular r = R-
determined. Further, provided that u-rTr>O also p=(p-rTr)l/2 is uniquely
determined. Now, from the positive-definiteness of M it follows that

O<( r)( M m) (R-r)

=rTR TMR- r - 2rTR-T m +


= rTr-2rTr+p= j-r T r.
Hence, the theorem is established.
The Cholesky decomposition has long been used to solve least squares problems.
In statistical applications it is known as the square root method.
We note that the proof of Theorem 6.1 contains an algorithm for computing the
Cholesky factor R. We summarize this algorithm below.
ALGORITHM 6.1 (Cholesky decomposition). Given a symmetric positive-definite
matrix MEe R" the following algorithm computes the unique upper triangular
matrix with positive diagonal elements such that M=RTR:
for j=1,2,. .. ,n
for i=1,2,...,j-1

rij:= (mij- irkirkj)/rii

rjj := mjj- rk
488 A. Bjorck CHAPTER II

The algorithm requires about n3 flops. Note that R is computed column by column
and that the elements ri can overwrite the elements mij. At any stage we have the
Cholesky factor of a leading principal minor of A. It is also possible to sequence the
algorithm so that R is computed row by row:
for i=1, 2,... n

rii = mi-E rk2 1


k=l1

forj=i+1,.. .,n

ri:= (mij - krkr)/rii

This version has the advantage of allowing pivoting so that at each stage we search
for the maximum diagonal element rii. The different sequencing in these two
versions of the Cholesky decomposition is illustrated in Fig. 6.1.

FIG. 6.1. The figure shows the computed part of the Cholesky factor R in the ith step of the column-wise
and row-wise algorithm.

For a further discussion of different ways of sequencing the operations in the


Cholesky factorization see GEORGE and Liu ([1981], pp. 15-20).

REMARK 6.4. We have here followed STEWART ([1973], pp. 83-93) and expressed
our algorithm in a programming-like language, which is precise enough to express
important algorithmic concepts, but permits suppression of unimportant details.
The notations should be self-explanatory (see also GOLUB and VAN LOAN [1983], pp.
30-32).

REMARK 6.5. The Cholesky algorithm can be modified to produce a decomposition


of the form M = RTDR, where D is diagonal and R is unit upper triangular, i.e. has
ones on its diagonal. The modified algorithm does not require square roots, and
hence is slightly faster. The unmodified algorithm has the advantage of retaining
compatibility with the QR decomposition of A, see Section 7.

When the Cholesky factor R of the matrix ATA has been computed then the least
SECTION 6 Linear least squaresproblems 489

squares solution can be obtained by solving the two triangular systems of equations
RTz =d, Rx=z. (6.9)
The solution of (6.9) requires a forward and a backward substitution which takes
about 2 n2 = n2 flops. The total work required to solve (6.1) by the method of
normal equations is
½mn2 + n3 + (mn) flops.
The Cholesky decomposition can be used in a slightly different way to solve the least
squares problem.

THEOREM 6.2. Consider the augmented matrix


A = (A, b) Em X(n+ 1), (6.10)
and the correspondingcross-product matrix

= M bdTb)
-ATA=(d
T

Let the Choleskyfactor of M be

R=( z)

Then the least squares solution and the residual norm are obtainedfrom
Rx=z, 11b-Ax112 =p. (6.11)

PROOF. By equating ATA and KTR it follows that R is the Cholesky factor of ATA
and that
RTz =d, btb = zTz + p2 .
From (6.9) it follows that x satisfies the first equation in (6.11). Further, since
r=b-Ax is orthogonal to Ax
IIAx I2 = (r + Ax)TAx = bTAx = bTAR - R - TATb = ZTz
and hence
I r l12 =bb -_ZTz.

We remark that working with the augmented matrix (6.10) often simplifies also
other methods for solving least squares problems.
We now turn to a discussion of the accuracy of the computed least squares
solution. Rounding errors will arise because the computer can only represent
a subset of the real numbers. The elements of this set are referred to as floating-point
numbers. To derive error estimates of matrix computations a model of floating-
point arithmetic is used, see e.g. FORSYTHE, MALCOLM and MOLER ([1977],
490 A. Bjorck CHAPTER II

pp. 10-29). We write the stored result of any calculation C as fl(C). Error estimates
can be expressed in terms of the unit roundoff, u, which is defined as the smallest
floating-point number such that
fl(1 +e)> 1. (6.12)
A thorough and elementary presentation of error analysis is given by WILKINSON
[1963]. For a short introduction see GOLUB and VAN LOAN ([1983], Section 3.2).
A detailed error analysis for the Cholesky factorization is carried out by
WILKINSON [1968]. We state the main result below.

THEOREM 6.3. Let Me lR"" be a symmetric positive-definite matrix. Provided that


2n 312 ulc(M)<0.1 (6.13)
the Cholesky factor of M can be computed without breakdown and the computed R will
satisfy
RTR=M +E, IIEl 2 <2.5n3 /2 u M l2. (6.14)
PROOF. See WILKINSON [1968].

The fact that M is positive-definite makes the Cholesky algorithm stable since the
elements cannot grow in the reduction. However, with M =ATA we have
K(M) = K2 (A). Hence (6.13) shows that the Cholesky algorithm may fail, i.e. square
roots of negative numbers may arise, already when K(A) is of the order of u- 1/2

REMARK 6.6. It is important to note that a scaling argument shows that in (6.13) it is
the minimum condition number under a diagonal scaling
Kc' = min K(AD), D> 0 diagonal matrix, (6.15)
D
which is relevant. VAN DER SLUIS [1969] has shown that if D, is chosen so that the
columns in AD1 have unit length then

tK' <K(AD)l n)<


K' (6.16)

EXAMPLE 6.1. The matrix A eR2 1


x6 with elements

auj =(i- 1)j, i21, 1j<6,


arises when fitting a fifth-degree polynomial
2 5
p(t)=x x+ tx 2t + . +xst
to observations at points t =0, 1, ... , 20. The condition numbers are
K(A)=6.40-106, c(ADl)=2.22- 103.
Thus, by scaling the condition number is reduced by more than three orders of
magnitude.
SECTION 7 Linear least squares problems 491

To assess the errors in the least squares solution x computed by the method of
normal equations we make the simplifying assumption that no rounding errors
occur during the formation of ATA and ATb. We also assume that the errors in
solving the triangular systems (6.9) can be neglected. If inner products are
accumulated in double precision this is a reasonable assumption. (For an analysis of
rounding errors in the solution of triangular systems see FORSYTHE and MOLER
([1967], pp. 104-105). Then it follows that satisfies
(ATA + E)x = ATb,
where a bound for 11E J11
is given by (6.14). Then a perturbation analysis similar to
that in Theorem 5.5 shows that
IIx-x 112 2.5n31 2 uK 2 (A). (6.17)
The accuracy of the computed normal equation solution thus depends on the square
of the condition number of A. In view of the perturbation result in Theorem 5.5 this
is not consistent with the mathematical sensitivity of small residual problems. We
conclude that the normal equations approach can introduce errors greater than
what would be expected of a stable algorithm. This is true in particular when the
rows of A have widely different norms, see Section 4. In the next several sections we
review methods for solving least squares problems based on orthogonalization.
These methods work directly with A and can be proved to have very satisfactory
stability properties.

7. The QR decomposition and least squares problems

Let A e Rm ", b Rm and let Q R Xm be an orthogonal matrix. Since orthogonal


transformations preserve the Euclidean length it follows that
IIAx - b II2 = IIQT(Ax- b) 11
2,

and hence the linear least squares problem


min IIQT Ax -Q T b 11
2 (7.1)
x

is equivalent to (6.1). We now show how to choose Q so that problem (7.1) becomes
simple to solve.

THEOREM 7.1 (QR decomposition). Let A E Rm "",m > n. Then there is an orthogonal
matrix Q E R m xm such that

QTA= (R), (7.2)

where R is upper triangular with nonnegative diagonal elements. The decomposition


(7.2) is called the QR decomposition of A and the matrix R will be called the Rfactor
of A.
492 A. Bjorck CHAPTER 11

PROOF: The proof is by induction on m. (Note that n < m.) For m = 1, A is a scalar
and we can take Q = + 1 according as A is nonnegative or negative. Now let m> 1,
and let A be partitioned in the form
m
A=(al,A2), aR .

Let U=(y, U,) be an orthogonal matrix with

al/lla1112 a 0.
Y=l ae,
a=.

(Here eI denotes the first column in the unit matrix I.) Since UTy =O it follows that

UTA =P r)

where

p= Ial, 112, r=A2y, B=UTA2 ER(m- )x(n" - ).


By the induction hypothesis there is an orthogonal matrix Q such that

Then (7.2) will hold if we define

Q=U _), R=( 'T)x [

REMARK 7.1. The proof of Theorem 7.1 gives a way to compute Q and R, provided
we can construct an orthogonal matrix U = (y, U1 ) given its first column. There are
several ways to perform this construction using elementary orthogonal transforma-
tions, see Sections 8 and 9.

THEOREM 7.2. Let A EI" X" have rank n. Then the Rfactor of A has positive diagonal
elements and equals the uniquely determined Cholesky factor of ATA.

PROOF. If rank(A) = n, then by Theorem 6.1. the Cholesky factor of ATA is unique.
Now from (7.2) it follows that

A T A = (R ,0)QTQ() = RTR,

which concludes the proof.

Assume that rank(A)=n, and partition Q in the form


n mx m -
Q=(Q1 ,Q 2 ), Ql ERm , Q2 EF n) (7.3)
SECTION 7 Linear least suares problems 493
Then by (7.2) and nonsingularity of R we have

A=(QIQ2) (R) =QR, Q=AR-1. (7.4)

Hence, we can express Q, uniquely in terms of A and R. The matrix Q2 however will
not in general be uniquely determined.
From (7.4) it follows that
M(A)= I(Q ), q(A)l = (Q2), (7.5)
which shows that the columns of Q, and Q2 form orthonormal bases for R(A) and its
complement. It follows that the corresponding orthogonal projections are

P(A) =Q1 QI, Pt(,) =Q 2 Q2. (7.6)


We now show how to use the decomposition (7.2) to solve the linear least squares
problem.

THEOREM 7.3. Let A Rm n",m > n, and b E R' be given. Assume that rank(A) = n and
that an orthogonal matrix Q has been computed such that

QTA (0 ) , Q T b=(l). (7.7)

Then the least squares solution x and the correspondingresidual r = b-Ax satisfies
Rx=c 1, r 211lC
= 2. 11
2 (7.8)

PROOF. Since Q is orthogonal we have


l[Ax-b 112 = QtTAx-
ax
= QbT b 22= 11
Rx- ct 112 + 11c2 1122
Obviously the residual norm is minimized by taking x=R-'cI and its minimum
equals the norm of c2. El

REMARK 7.2. The systematic use of orthogonal transformations to reduce matrices


to simpler form was initiated by GIVENS [1958] and HOUSEHOLDER [1958]. The
application to linear least squares problems is due to GOLUB [1965]. Since
x = R - 'c1 = R - 1 QTb it follows that when A has full rank n the pseudoinverse of A is
given by the expression
A'=R- QT. (7.9)

X
REMARK 7.3. Let A = Q1 R, where A E n has rank n and Q is orthogonal. Let E be
a perturbation of A, which satisfies
c(A) 11
E 112/ll A 112 < 1.
Then rank (A + E)= n, so that A + E has a unique QR decomposition
A + E =(Q + W)(R + F).
494 A. Bj6rck CHAPTER II

How large can the perturbations W and F be? This problem is analyzed by STEWART
[1977b]. The main result is that I W 112
and IFil2 are bounded by terms of order
K(A) 11E l12/ A 112-

According to Theorem 7.1 any matrix Ae R' X" has a QR decomposition.


However, if rank(A)< n, then the decomposition is not unique.

EXAMPLE 7.1. For any c and s such that c2 +s 2 = 1 we have

A= ( 1)=(c -s)(° s)=QR

Here rank(A) = 1 < 2 = n. Note that the columns of Q no longer provide orthogonal
bases for (A) and its complement. Therefore the QR decomposition is not very
useful for rank deficient matrices and has to be modified.

THEOREM 7.4. Given A e RmX with rank(A)=r < min(m, n) there is a permutation
matrix n and an orthogonal matrix Q EIm xm such that

QTAn (R1i R 12 )r (7.10)

where R l 1 E R x is upper triangularwith positive diagonalelements. (Note that we do


not require that m > n.)

PROOF. Since rank(A) = r, we can always choose a permutation matrix 7 such that

AIHI=(A 1 ,A 2),
where A1 E ,m r has linearly independent columns. Let

QTA = R11

be the QR decomposition of A 1 , where Q = (Q 1, Q2). By Theorem 7.2, Q and R 11 are


uniquely determined and R,, has positive diagonal elements. Put

QTAH=(QT AQT A2 )= (11 R1)

Since
rank(Q T A17) = rank(A) = r,
it follows that R2 2 =0, since otherwise QTAH would have more than r linearly
independent rows. Hence the decomposition must have the form (7.10).

REMARK 7.4. There may be several ways to-choose the permutation matrix n. An
algorithm for determining a suitable n will be described in Section 11. When H has
been chosen Q1, R,, and R 1 2 are uniquely determined.
SECTION 7 Linear least squares problems 495

REMARK 7.5. From (7.10) it follows that

( -I ) Q( 0 0 2 ) (R R-I2 ) O
Hence, a dimensional argument shows that we have obtained an explicit basis for the
nullspace of AB

.4((A)I)=.l (7.11)

Using the decomposition (7.10) the linear least squares problem (7.1) becomes

min (R' R,2)y (ci)l (7.12)

where

QT b=c=(c)r, HT x=y= (YI)


k C2j -- _ 'ky2]
The general solution of (7.12) can be written

H7(RIIc
(c 2y ) (7.13)

where Y2 is arbitrary. If we here take Y2 = 0 we obtain a solution

XB=Y7 (0)' yB=RIi cl, (7.14)

which has at most r = rank(A) nonzero components. Any such solution, where Ax
only involves at most r columns of A is called a basic solution. In several applications
it is important to compute such a basic solution. One example is the case when the
columns of A represent factors in a linear model and one wants to fit a vector of
observations b using as few factors as possible.
The QR decomposition (7.10) can also be used to compute the minimal norm
solution. If we let z = Y2, then by (7.13) minimizing Ix I 2 is equivalent to the linear
least squares problem

m (R~,lR12)Z_(YB) (7.15)
Z -In-r 0 12

The matrix in (7.15) always has full rank, and (7.15) can be solved by computing its
QR decomposition.

REMARK 7.6. To minimize 11x 2 is not always a good way to resolve rank deficiency.
Therefore the following more general approach is often useful.
For a given matrix BeRlXP" consider the problem
min IBx1[ 2, S={x:min IAx-b!2L}. (7.16)
xcS x
496 A. Bjbrck CHAPTER 1

Substituting the general solution (7.13) with Y2 = z we find that (7.16) is equivalent to

min BR7( R1 1 2)Z -B


1 r(IY (7.17)
z I ~I -I,_ 0 / 2
which is a linear least squares problem of dimension p x (n - r). If this problem is not
rank-deficient, then (7.16) has a unique solution which can be computed by
substituting the solution Y2 =z of (7.17) in (7.13).
The special case B = I gives the minimal norm solution. Often one wants to choose
B so that IIBx 112is a measure of the smoothness of the solution x. For example we
can let B be a discrete approximation to the second-derivative operator,
1 -2 1

(| 1) -2 )1 RE[-2)xn
n2-2 (7.18)

l -2 1

By applying further orthogonal transformations to (7.10) the submatrix


(RI,, R12) can be reduced to triangular form.

THEOREM 7.5. Given A ERa"x with rank(A) = r. Then there are orthogonal matrices
QEmxm and WER" Xn such that

Q`AW=(T
T ), (7.19)
where Te R' X' is triangularwith positive diagonal elements. 7he decomposition (7.19)
is called a "complete orthogonal decomposition" of A.

PROOF. Since (R,,,R 1 2)TELR" X' has rank r, by Theorem 7.2 there exists an
orthogonal matrix V such that

(T (R ) (S)

where S is upper traingular with positive diagonal elements. Hence

QTAW= ( ), =nv. (7.20)

Here ST is lower triangular. We can get the form (7.12) with T upper triangular by
pre- and postmultiplying (7.13) with a permutation matrix reversing the order of the
rows and columns in ST. O

REMARK 7.7. Partitioning the orthogonal matrices in (7.19) by rows we have

A=(Q~,Q2) 0 WTU'
SECTION 7 Linear least squares problems 497

It follows that the complete orthogonal decomposition provides orthogonal bases


for the four fundamental subspaces of A
4'(A) = R(W2), 4q(A) = (Q),
M(AT) = ,q(W 1 ), fJ(AT)= _(Q 2),

cf. (3.7) and (3.8). Further the pseudoinverse A' of A is given by

A' = W( 0 QT. (7.21)

It is possible to reduce A even further to bidiagonal form by orthogonal


transformations.

THEOREM 7.6 (Bidiagonal decomposition). Let A E lR' ,m > n. Then there are
orthogonal matrices Q e Rx m and P e R Xn such that

QTAP = (), (7.22)

where B is an upper bidiagonal matrix with nonnegative diagonal elements.

PROOF. The proof is similar to the proof of Theorem 7.1 and is by induction on m.
For m=1, we can take Q= +I and P=1. For m>1, we let A=(a1 ,A 2 ), and let
U=(y, U,) be constructed as in the proof of Theorem 7.1 so that

UTA = O r) P>O.

We now let =(z, VI) be an orthogonal matrix with

Zr/l12, r0,
[el, r=O.
Since rT V1 =0 it follows that

UTA = P ) cT=(, OT),

where
V=diag(l, ), a= rll2, C=Bre g (m-l)x(n-1)
By the induction hypothesis there now exist orthogonal matrices Q and P which
reduce C to bidiagonal form. Then (7.22) holds if we take

Q= (' Q), P= V( C
UO

It would seem from (7.21) that for computational purposes the complete
orthogonal factorization (7.19) is as useful as the SVD. To some extent this is true.
498 i. Bjbrck CHAPTER II

However, the SVD provides a more satisfactory way of determining the "'numerical
rank" of A. This is in general a difficult question.
In the next sections we describe several different methods to compute the QR
decomposition and discuss their numerical properties.

8. QR decomposition by orthogonal transformations

As remarked earlier, the proof in Theorem 7.1 of the existence of the QR


decomposition almost gives an algorithm to compute Q and R. It only remains to
show how we can construct an orthogonal matrix

U =(y, U1),

when its first column is given by

y a/ll all2, a ,O. (8.1)


el, a=O.

For a = 0 we can take U = I. For a # 0, the construction is equivalent to solving the


following standard task: Given a nonzero vector a E "mfind an orthogonal matrix
U such that
UTa=ae, a= all 2. (8.2)
From (8.2) it follows that the first column y in U satisfies yTa= llal[2, and hence
y =a/l a 11
2.
We now show how to solve the standard task using certain elementary orthogonal
matrices. A matrix P of the form
P=I-juuT, = 2/uu (8.3)
is called a Householder transformation.The name derives from A.S. Householder,
who initiated their use in numerical linear algebra. It is easily verified that P is
orthogonal and symmetric
pTp=I, pT= p,

and hence p2 = I. The product of P with a given vector can easily be found without
explicitly forming P itself since
Pa= a- (uTa)u. (8.4)
The effect of the transformation is to reflect the vector a in the hyperplane with
normal vector u, see Fig. 8.1. Therefore, P is also called a Householder reflector.
Note that Pu = - u so that P reverses u and that Pa espan{a, u}.
We now show how to choose u to find a Householder transformation P which
solves the standard task. From Fig. 8.1 it follows immediately that taking
u=a+ue,, 7=11ail,, (8.5)
SECTION 8 Linear least squares problems 499

Pa'
FIG. 8.1.

we get Pa= ae1 . Letting a=(a,,...,.") T,


UTU=(a+ae) T (a+ ae)=oa + 2aq
aa), + f2= 2a(a+
so that
0 = I/(a(a+al)). (8.6)
If a is close to a multiple of e, then a cal and cancellation may occur in the formula
(8.6). This will lead to a large error in , and consequently one usually chooses the
sign in (8.5) so that u =a+ sign(a,)ae1 which gives
#= 1/(a(a + l ))
This choice has the small blemish that the vector a = e, will be mapped onto - e,. As
pointed out by PARLETT [1971], the formula (8.6) can easily be rewritten so that the
other choice of sign does not give rise to numerical cancellation.
mX
If A =(a,,..., a,)E RI" is a matrix the product PA is computed as
PA=(Pa1,..., Pa.), Paj=aj-f(uTa)u. (8.7)

Thus P need not be formed explicitly and the product can be computed in 2mn flops.
Writing the product as
PA = (I - IUUT)A = A - flu(ATu)T
shows that A is altered by a matrix of rank one, when premultiplied by
a Householder transformation. An analogous algorithm exists for postmultiplying
A by a Householder matrix.
We next consider orthogonal matrices representing plane rotations. These are
also called Givens rotations after W. GIVENS [1958]. In two dimensions the matrix
representing a rotation clockwise through an angle 0 is

R(O)= (c s), c=cos 0, s=sin 0. (8.8)


R(O= s),
In n dimensions the matrix representing a rotation in the plane spanned by the unit
500 A. Bjorck CHAPTER II

vector e and ej, i<j, is a rank-two modification of the unit matrix I,

11 \

Rij(O)= '.. (8.9)


-S C j

1.

T
We now consider the premultiplication of a vector a =(ot, ... , O) with R.J(O). We
have Rij(O)a=b=(,l ... , fl)T, where pk=%,k, i,j and
fpi=cai+sap, Pj=-sai+cCj. (8.10)
Thus a plane rotation may be multiplied into a vector at a cost of two additions and
four multiplications. If in (8.10) we set
c=/a s=,
s=Oa, a =(o2 + 2)i/20, (8.11)
then Pi= and flj=0, i.e. we have introduced a zero in the jth component of the
vector.
Premultiplication of a matrix A e R' " with a Givens rotation Ri will only affect
the two rows i and j in A, which are transformed according to
aik = caik +saik, ak:= -sak + sajk, k = 1, 2, .. , n.
The product requires 4n flops. An analogous algorithm, which will only affect
columns i and j exists for postmultiplying A with Rij.
Givens rotations can be used in several different ways to solve the standard task
(8.2). We can let RIk, k=2,..., n, be a sequence of Givens rotations, where Rlk is
determined to zero the kth component in the vector
Rl", R 13R1 2 a=e 1 .

Note that, since R1k only involves the components 1 and k, previously introduced
zeros will not be destroyed. Another solution is to use the sequence R k _ t, k = n,
n- 1,..., 2, where Rkl_ k is chosen to zero the kth component. This demonstrates
the greater flexibility of Givens rotations compared to Householder reflections.
It is essential to note that the matrix Rij need never be explicitly formed, but can be
represented by the numbers c and s. When a large number of rotations need to be
stored it is more economical to store just a single number. This can be done in
a numerically stable way using a scheme devised by STEWART [1976]. The idea is to
save c or s, whichever is smaller. To distinguish between the two cases one stores the
reciprocal of c and treats c = 0 as a special case. Thus for the matrix (8.8) we define the
SECTION 8 Linear least squares problems 501

scalar
1, if c=O
p= sign(c)s, if sl<lcl (8.12)
[sign(s)/c, ifel
cI< sl
Given p the numbers c and s can be retrieved up to a common factor + 1 as follows:
if p=l, c=O, s=1,
if I<<1, s=p, c =(1-s2)1/2
if pl>1, c=l/p, s=(1-c 2 )1 /2.
The reason for using this scheme is that the formula (1 -x 2 ) 11 2 gives poor accuracy
when x is close to unity.
We now describe how the QR decomposition of a matrix A e R"' " of rank n can
be computed using a sequence of Householder or Givens transformations. Starting
with A ( ) = A we compute the sequence of matrices A(k), k = 2, . . , n + 1. The matrix
A(k) is already triangular in its first (k - 1) columns, i.e. it has the form

A(k)=(Rk R
A(k) (8.13)

where Rk e R(k - 1) (k- ) is upper triangular. We let


A(k +1) = PkA(k)

where the orthogonal matrix Pk is chosen to zero the first column in the submatrix
A(2k. If we let
A () = (a,(k) an),

then Pk=diag(Ik l, Pk) where

Pk- ) a=kel, k= rkk = I(ak) 2 (8.14)


Note that in this step only the matrix A(2k¶
is transformed and that (R(l, Rk2) are the
first (k-1) rows in the final matrix R. It follows that

Pn ... P2 PA=A(n+1) (R),

and hence
Q = (P. P2P) = P1P2. Pn.
The key construction is to find an orthogonal matrix Pk which satisfies (8.14).
However, this is just the standard task (8.1) with a=ak ). Using (8.5) and (8.6) to
construct Pk as a Householder matrix we get the following algorithm.

ALGORITHM 8.1 (QR decomposition by Householder transformation).Given a matrix


A(')= A R mX " of rank n. The following algorithm computes R and Householder
502 A. Bjbrck CHAPTER II

matrices

Pk=diag(Ik -_P -k), =l-(k)jU(k)T/yk ,


Pk k =1,2,...., n,
so that Q=PP 2 Pn.
..
for k=1, 2, ... , n
Sk = l a ) 112;
i(k) := fi);

) := IIu + sign(akk))ak;
Yk = ak(ak + l akk) );
rkk := - sign(akk) ak
forj=k+ 1,..., n
fik -= a(k)Ta(k)/y;

( = (k)-jk(k).
-

This algorithm requires (mn2-n 3


) flops.

REMARK 8.1. Note that the vectors u(k), k = 1, 2,...., n, can overwrite the elements on
and below the main diagonal. Thus, all information associated with the factors
Q and R can be kept in A and two extra vectors of length n for (r 1 ... , r,,) and
(A1...... n).
REMARK 8.2. Let R denote the computed R. It can be shown that there exists an
exactly orthogonal matrix Q (not the computed Q) such that
A+E=Q R, IE1F<ClullAHF, (8.15)
where the error constant c1 = cI(m, n) is a polynomial in m and n, and 11 I F denotes
the Frobenius norm. GOLUB and WILKINSON [1966] show that c1 = 12.5n3 /2 if inner
products are accumulated in double precision.
Normally it is more economical to keep Q in factored form and access it through
fik and aU(k), k = 1, 2, ... , n, than to compute Q explicitly. If Q is explicitly required it
can be accumulated by taking Q1')=I and computing Q=Q(n+l) by
Q(k+l=pkQ(k), k = 1,2, ... , n.
This accumulation requires 2(m 2n - mn 2 +n 3 ) flops, if we take advantage of the
property that Pk = diag(lk - , Pk). Similarly we can accumulate

Q,=Q Q, 2 Q (Ion)

separately in mn 2 -Sn 3 and (2m2 n- 3mn 2 +


n3) flops respectively.

An algorithm for solving the linear least squares problem based on the QR
decomposition and using Householder transformations was first given by GOLUB
[1965].
SECTION 8 Linear least squares problems 503

ALGORITHM 8.2 (Linear least squares solution by Golub's method). Given A e Rxn
with rank(A)= n and be R m, compute R and P, P 2, ... , P, by Algorithm 8.1. Form
the vector c by

c=P P2b=
Plb= P (8.16)
\C2J

and then solve


Rx=c1 . (8.17)
The residual vector satisfies

r=b-Ax=Q () (8.18)

and hence I r 2=] c 11


2 2.

REMARK 8.3. To compute c by (8.16) requires only (2mn-n 2) flops, and thus each
right-hand side takes (2mn - n2 ) flops.

The numerical properties of Golub's method for solving the least squares problem
are very good. The computed solution x can be shown to be the exact solution of
a slightly perturbed least squares problem
min II(A + 6A)x- (b + 6b) II2,
x

where the perturbation can be bounded in norm by


11A Ill2 c 2 unl/ 2 A Il 2, 1 6b 12 c 2u bII 2, c 2 =(6m- 3n+41)n, (8.19)
see LAWSON and HANSON ([1974], pp. 90ff). Using the bounds (8.19) the error in Z can
be estimated from the perturbation result (5.11). As remarked in Section 6, no such
backward error analysis is valid for the method of normal equations.
An algorithm similar to Algorithm 8.1 but using Givens rotations can easily be
developed. In this algorithm the matrix Pk in (8.14) is constructed as a product of
(m-k) Givens rotations.

ALGORITHM 8.3 (QR decomposition by Givens rotations). Given a matrix A E RI xn of


rank n the following algorithm overwrites A with

QTA = (R)
for k=1,2,..., n
for j=k+ ,..., m
a = (akk + aj')l/2; c:= akkla; s := ajk/la;

A := Rj,(eO) A;
The algorithm requires 2(mn2 -n 3)
flops.
504 . Bjorck CHAPTER 11

REMARK 8.4. Using Stewart's storage scheme (8.12) for the rotations R,({O) we can
store the information defining Q in the zeroed part of the matrix A. As for the
Householder algorithm it is advantageous to keep Q in factored form.

REMARK 8.5. The error properties of Algorithm 8.3 is as good as for the algorithm
based on Householder transformations. WILKINSON ([1965], pp. 240) showed that
for m = n the bound (8.15) holds with cl = 3n 3/ 2 . GENTLEMAN [1975] has improved
this error bound to c=3(m+n-2), m n, and notes that actual errors are
observed to grow even slower.

Givens rotations can be used to introduce zeros in a matrix more selectively than
is possible with Householder transformations. This flexibility is of importance for
solving sparse least squares problems, see Chapter III. For dense problems Givens'
method has the drawback of having twice the operation count of the Householder
method. Also O(nm) square roots are needed.
It is possible to rearrange the Givens rotation so that it uses only two instead of
four multiplications per element and no square root. These modified transformations,
called "fast" or "square root free" Givens transformations were introduced by
GENTLEMAN [1973] and modified by HAMMARLING [1974]. An algorithm for solving
linear least squares problems by fast Givens transformations is given in GOLUB and
VAN LOAN ([1983], pp. 158-160). In principle the gain in speed by using the modified
transformations should be 2. However, a nontrivial amount of monitoring to avoid
overflow is necessary to implement them and the observed gain is about 1.4-1.6 for
sufficiently large problems, see LAWSON et al. [1979].

9. Gram-Schmidt orthogonalization

The columns of Q, in the factorization A= Q1 R can be obtained by successively


orthogonalizing the columns of A. To see this we write the factorization in the form

(al,a 2 ... a,,)=(ql,q 2,.., qn) r22 r

Equating columns we get


k
a = E rikqi, k = 1, 2, .... n,
i=1

where by the orthonormality of the qi


rik = qTa, i = 1, 2, .. , k-1. (9.1)
If rank(A) = n, then, by Theorem 7.2, rkk > 0 and we can solve for qk
SECTION 9 Linear least squares problems 505

k-1

qk = qkrkk, qk =ak- E rikqi. (9.2)


i=1

This shows that qk is a unit vector in the direction of qk and rkk is determined as the
normalization constant

rkk = /(4T. 4k)2 (9.3)


The unnormalized vector 4 k is the orthogonal projection of ak onto the complement
of span [a, a2, ... , akl].
By (9.1H9.3) we are led to the classical Gram-Schmidt algorithm, which
generates Q1 and R column by column.

ALGORITHM 9.1 (Classical Gram-Schmidt (CGS)). Given A E Rm x" with rank(A)=n


the following algorithm computes for k=1,2,..., n the column q, of Q1 and the
elements rk,..., rkk of R in the factorization A = Q1R:
for k=1, 2,..., n
for i=1,..., k-1

rik := q ak;
k-1

qk := ak - rik qi;
i=l

r =( q4)k/2; qk := qklrkk;

The algorithm requires mn 2 flops.

The classical Gram-Schmidt algorithm explicitly computes the matrix Q1 , which


theoretically provides an orthogonal basis for 9(A). This is in contrast to other
numerical methods for computing the QR decomposition, in which Q is implicitly
defined as a product of Householder or Givens matrices. However, the computed Q1
will in general not be orthogonal to working accuracy. Indeed, the computed vectors
qk may depart from orthogonality to an almost arbitrary extent. As pointed out by
GANDER [1980] even computing Q1 via the Cholesky decomposition of ATA seems
superior to CGS.
We now describe a modification of the Gram-Schmidt algorithm for which the
loss of orthogonality is less dramatic. This modified Gram-Schmidt algorithm
provides a stable method for solving the linear least squares problem.

ALGORITHM 9.2 (Modified Gram-Schmidt (MGS)). Given A)=AERmXn with


rank(A)=n the following algorithm computes Q and R in the factorization
A=Q1 R.
for k= 1, 2,..., n
for i=l,..., k-1
rik := i ai a +1) = a- rikqi;

qk = ak); rkk = (k qk); q := 4l/rkk;


506 . Bjfrck CHAPTER II

REMARK 9.1. The important difference is that in the modified algorithm the
projections rkqi are subtracted from ak as soon as they are computed. Note that for
n=2 CGS and MGS are identical. For treating rank-deficient problems column
pivoting is necessary, see Section 10. Then it is convenient to interchange the two
loops in Algorithm 9.2 so that the elements of R are computed row by row.

ALGORITHM 9.3 (Modified Gram-Schmidt: Row-oriented version)

for i= 1, 2,..., n
qi := a; rii := (T i)1/2; qi := i/rii
for k=i+ ... ,n
rik := qT' ); a i+ 1:= ai)-rikqi;

There is no numerical difference between these two versions of MGS since the
operations and rounding errors are the same.

REMARK 9.2. There is a square root free version of Gram-Schmidt orthogonaliza-


tion, which results if the normalization of the vectors 4k is omitted. In this version
one computes Ql and R so that A = Q1 R and R unit upper triangular. The changes in
Algorithm 9.3 are to put
Fii := 1; di:= T i; ik = qiak)/di

and subtract out Friki instead of rikqi.


The reason why the computed columns of Q1 may depart from orthogonality is
that cancellation may take place when the orthogonal projection on qi is subtracted.
In Algorithm 9.2 cancellation occurs in the statement
a( + 1)= a(i)kqi

if for some small constant a


a(i + 1)1[12 < a 11a °)II2' (9.4)
To exhibit the loss of orthogonality we consider a case of orthogonalizing two
vectors. The following example is due to RUTISHAUSER ([1976], pp. 96-97).

EXAMPLE 9.1. For the matrix

13 34
A= 21 55
34 89

we get using either of the Gram-Schmidt algorithms and 4-digit computation

r, 1 = (a a)' = 42.78, q = (0.1870,0.3039,0.4909, 0 . 79 4 8 )T.


SECTION 9 Linear least squares problems 507

Then,
r2 =q a = 112.0, a22) = (0.06, - 0.04,0.02, - 0.0 2 )T.
Severe cancellation has taken place since
2 )
11a( 2 = 0.07746<< I a2 II2 = 112.0.
This leads to a serious loss of orthogonality between q, and a2
qT a 2 )/ Ia2 ) IL= 0.007022/0.07746 = - 0.09065.

We now consider the use of the modified Gram-Schmidt algorithm for solving
linear least squares problems. It is important to note that because of the loss of
orthogonality in Q. that takes place also in MGS we cannot compute ct = QI b and
solve Rx = c. Instead we apply the MGS algorithm to the augmented matrix (A, b).

ALGORITHM 9.4 (Linear least squares solution by MGS). Given ACe R "X , with
rank(A)=n and be Rm . Form A=(A,b) and apply Algorithm 9.2 to get the
factorization

(A,b)=(Qpq,.( ) (0R )(9.5)

Then the solution to the linear least squares problem min, IIAx - b 112 is given by
Rx=z, r=pq+l1, lrll12 =p. (9.6)

To show that (9.6) holds, we use (9.5) to get

IIAx-b112 = (Ab)( )1 2= i(QinQ+)(R ))(x 1 ) 2

+
= Q(Rx-z)-pqn 1 11
2.

If q,+ is orthogonal to QI, then the minimum of the last expression occurs when
Rx - z = 0 and the residual is pq, +1. Note that it is not necessary to assume that Q is
orthogonal for this conclusion to hold.
Even if for MGS the computed vectors q will not be orthogonal to working
accuracy, the loss of orthogonality is more gradual and can be bounded in terms of
c(A). This is illustrated in the example below from BJORCK [1967].

EXAMPLE 9.2. Let A be the matrix in Remark 6.2

(1 1 1

and assume that £ is SO small that fl(1 + e2 )1 . If no other rounding errors are made,
508 A. Bjorck CHAPTER 1

then the orthogonal matrices computed by CGS and MGS respectively are

1 0 0 1 0 O
- - - -- ½E
OCGS I O QMGS --2£
0 £ 0 £

For simplicity we have omitted the normalization of Q. It is easily verified that the
maximum deviation from orthogonality of the computed columns are
CGS: lq3q 2 1 =, MGS: q qj =(3)1E.

The condition number of A is K,=£- 1(3 + 2 )1/2 ~/3E-


and our assumption above
implies that C2 < u (u is the unit roundoff). It follows that for MGS

Iq3ql 1 3i/½2K(A)u,
but for CGS orthogonality has been completely lost.

The superiority of the modified Gram-Schmidt algorithm over the classical


version was experimentally established by RICE [1966]. The modified algorithm was
also described by RUTISHAUSER [1967] and can be shown to be equivalent to an
elimination method with weighted row combinations given by BAUER [1965].
A detailed error analysis for MGS and its application to the solution of least squares
problems was carried out by BJORCK [1967].

THEOREM 9.1. Let AElt" xn with rank n and beRm. Let R, z, Q1 and q4+, be the
computed quantitiesusing Algorithm 9.3. Then, provided inner productsareaccumulated
in double precision
A+E=Q,R, b+e=Qlz+cn+,,
where

11
E 112 1.5n3 A /u 5[2 b1
{e[ 2 1<.Snu1 2. (9.7)
If further
# = 12mnuK < 1,
where K is the condition number of A, then
2
i1I-QQ1 12<(I )1/2 n2UK. (9.8)

PROOF. BIORCK [1967]. El


It is further shown by BJORCK [1967], using the estimates (9.7) and (9.8), that the
error in the computed solution of Algorithm 9.3 is of the same order as those
SECTION 9 Linear least squaresproblems 509
resulting from perturbations 6A and bb such that

II A IIE/II A IlE II6b


b 11
2/ 11b 11
2 22n3/2u. (9.9)
Here IIA I Edenotes the Euclidean (or Frobenius) matrix norm
A I([ai2)1/2

This shows that Algorithm 9.3 is a stable method for solving the linear least squares
problem. Indeed, according to numerical experiments of JORDAN [1968] and
WAMPLER [1970] it seems to be slightly more accurate than other orthogonalization
methods. However, MGS is also somewhat more expensive in terms of operations
and storage.
In some applications it is important to compute Q, and R such that Q1 R
accurately represents A and Q. is accurately orthogonal. This is the orthogonalbases
problem. To satisfy both these conditions it is necessary to reorthogonalize the
computed vectors qk.

EXAMPLE 9.3. In Example 9.1 we compute using 4-digit computation


6r12 =q a) = -0.007022
and reorthogonalize by taking
42. = a) - 6rl2q 1 = (0.06131, - 0.03787, 0.02345, -0.01 4 42 )T.
Note that the correction 6rl 2 is too small to affect rl 2 = 112.0. The new vector 0 is
now accurately orthogonal to q,
6.866.10-6
ql 2/1q2 2= 7.713= 10 -2 0 8902-10 4

and we normalize to get


q2 = (0.7949, - 0.4910, 0.3040, - 0.18 70)T.

In the above example one reorthogonalization was sufficient. It can be shown, in


a sense made more precise below, that this is true in general. Thus, reorthogonaliza-
tion will at most double the cost of the Gram-Schmidt algorithm.
We follow here the discussion in PARLETT ([1980], pp. 105-110), which in turn is
based on unpublished notes of Kahan on the fact that "twice is enough".
Given vectors ql, IIq, II2 = 1, and a2, we want to compute
.
q 2 =a 2 -rl 2 ql, r 2 =qlTa 2 (9.10)
Assume that we can perform the computation (9.10) with an error e=fl(q2)-q2
which satisfies
e 12 [Ia2i 12
510 A. Bjorck CHAPTER 1

for some small positive independent of q1 and a 2. Then we have the following
theorem:

THEOREM 9.2. Let a be a fixed value in the range 1.2E<o0.83-E. The following
algorithm computes a vector 2, which satisfies
i q2- 42 2 ( + )E 11
a2 2, [qTq2 l <ea-i lll 2 . (9.11)
First compute 2 using (9.10), and put q2:= fl( 2). If
42 2 l l a2 11
11 2,
then accept 42, else reorthogonalize q2,
q 2 :=q 2 -Br1
2 ql, r12 :=qq2

If I[fl(q 2)II2> O I 114212, then accept q2 = fl(q2) else accept 2:= O.

PROOF. See PARLETT ([1980], pp. 107-108). [

REMARK 9.3. Note that when a is large, say 0.5, then the bounds (9.11) are very good
but reorthogonalization will occur frequently. If a is small reorthognalization will be
rarer, but the bound on orthogonality less good. RUTISHAUSER [1967] has given
a version of MGS, where = 0.1 is used.

REMARK 9.4. MOLINARI [1977] points out that there are special situations where even
better orthogonality is required than what can be obtained by one reorthogonaliza-
tion. He gives an ALGOL procedure for "superorthogonalization" which, depending
on a parameter, may carry out several reorthogonalizations.

A rounding error analysis for the classical Gram-Schmidt algorithm with


reorthogonalization has been given by ABDELMALEK [1971]. He proves that bounds
for the lack of orthogonality can be obtained that are independent of the condition
number of A.
RUHE [1983] discusses numerical aspects of several variants of Gram-Schmidt
orthogonalization. He also considers the case when one needs to orthogonalize
against a set of vectors which are not accurately orthogonal. He generalizes the
modified Gram-Schmidt algorithm to oblique projections, which has applications
to orthogonalization in elliptic norms and to biorthogonalization.

10. Rank-deficient problems and the SVD

In the previous two sections we have assumed that the matrix A E Rm has rank n.
Inaccuracy of data and rounding errors made during the computation usually mean
that the matrix A is not exactly known. In this situation the mathematical notion of
rank is not appropriate. For example, suppose that A is a matrix that originally was
of rank r <n, but whose elements have been perturbed by e.g. rounding errors. Then
it is most likely that the perturbed matrix has full rank n. However, it will be very
SECTION 10 Linear least squares problems 511

close to a rank-deficient matrix, and it should be considered as numerically


rank-deficient.
From the above considerations it follows that the numerical rank assigned to
a matrix should depend on a tolerance which reflects the error level in the data. We
will say that a matrix A has numerical 3-rank equal to r if
r =min{rank(B): I A - B I2 }. (10.1)
By Theorem 3.2 we have for r < n
inf IIA-B{12 = ar+i, (10.2)
rank(B)<r

where ai, i= 1, 2, ... , min(m, n), are the singular values of A. This infimum is actually
attained for the matrix

B= A i u i vT.
i=1

Hence a matrix A has numerical 3-rank r if and only if

al > ' tar > b r ' > an . (10.3)


The definition (10.3) is satisfactory only when there is a well-defined gap between
a,+ and a,. If the exact matrix A is rank-deficient but well-conditioned, then this
should be the case. However, there may not exist a gap for any r, e.g. suppose
a i + = 0.9 a, i = 1, 2, ... , n- 1. In such a case the rank obviously is not well-defined.
Such difficult problems, where additional information usually is needed for
a meaningful solution, are treated in Section 26.
The choice of the parameter 3 in (10.3) is not always an easy matter. Let the error
in the matrix A be E = (e,,). Assume that the absolute size of the elements eij in the
error matrix are all about the same size, and that Iejl < e for all i,j. Then a reasonable
choice in (10.3) is to take 3 = (mn)"/2 . If the absolute size of the elements ei are not
about the same, one could try to scale rows and columns of A so that they become
nearly equal. (Note that any diagonal scaling D AD 2 will induce the same scaling
D1 ED2 of the error matrix.) However, note that scaling the rows of A is normally not
allowed, since it would change the solution to the least squares problem.
In some cases the elements of A may be exactly known. Then the relative error in
the elements will be about equal to the unit roundoff u, and one can take 6 = u 11 A 11.
In this case one could scale the original matrix so that all its elements are roughly of
equal size. The above discussion mainly follows that in DONGARRA et al. ([1979],
pp. 1.10-12). The main point to remember is that definition (10.3) will not be
meaningful unless the matrix A is reasonably scaled.
The most reliable way to determine the numerical rank of a matrix A is by (10.3).
We now outline an algorithm by GOLUB and REINSCH [1970] for computing the
singular values and singular vectors. The first step is to reduce A to bidiagonal form,
see Theorem 7.6. The proof of this theorem gives a way to compute the bidiagonal
form using a sequence of Householder transformations, in the same way as the proof
of Theorem 7.1 leads to Algorithm 8.1.
512 A. Bjbrck CHAPTER I1

We can write (assume m >n)

QB APB= , B= e .. , (10.4)
en
q,

where

QB=Q.. QnEmxm and PB=PlP_ 2 e nxR,


are products of Householder matrices. Here Qk is chosen to zero elements in the kth
column, k= 1, ... , n, and Pk to zero elements in the kth row, k = 1, ... , n -2. If m = n
then Qn=I. This algorithm requires 2(mn 2-3n 3 ) flops and was first described by
GOLUB and KAHAN [1965]. The Householder vectors associated with QB can be
stored in the lower triangular part of A and those associated with PB in the upper
triangular part of A. If QB and PB are explicitly required they can be accumulated in
2(m2 n-mn2 + n 3) and 2n3 flops respectively. Note that the singular values of
B equal those of A.
Once the bidiagonal matrix B has been computed the second phase is to zero the
superdiagonal elements in B. This is done by an iterative algorithm first outlined by
GOLUB and KAHAN [1965]. With B, =B one computes

Bi+=STBiTi, i=1,2,....

where S, and Ti are products of Givens transformations. This step is mathematically


equivalent to the symmetric QR algorithm for computing the eigenvalues of BTB.
However, forming BTB would lead to a severe loss of accuracy in the small singular
values, and it is essential to work directly with the matrix B.
When the superdiagonal elements in B have converged to zero we have

QsTBPs = Z =diag(a, .... a).


Hence

UTAV= 0), (10.5)

where
U = QBdiag(Qs, I,,-), V= PBPs

is the singular value decomposition of A. Usually less than 2n iterations are needed
in the second phase. For a detailed description of the SVD algorithm we refer to
GOLUB and REINSCH [1970] and GOLUB and VAN LOAN ([1983], pp. 285-294).
For solving the least squares problem the matrix U of left singular vectors need
SECTION 10 Linear least squares problems 513

not be explicitly formed. We only need to compute the vector


c = UTb = diag(Q, I -n) Q b,
i.e. we apply the sequence of transformations in the first and second steps to b. The
numerical minimal norm least squares solution is then given by (cf. Theorem 4.1)
r

X= E (CiJ/i)-i, (10.6)
i=i

where r is the numerical rank of A. Here the matrix V is explicitly required. The
expression (10.6) shows that overestimating the numerical rank r of A can lead to
a solution of very large norm when o, is small.
A modified algorithm for computing the bidiagonal decomposition (10.4) of A,
which is more efficient when m >> n, has been analyzed by CHAN [1982]. The idea,
mentioned e.g. in LAWSON and HANSON ([1974], pp. 119-122) is to begin by
computing the QR decomposition of A

Q
T
A= (0). (10.7)

One then applies the first step of the Golub-Reinsch algorithm to R to get
Q2 RP B = B. (10.8)
Combining (10.7) and (10.8) we obtain (10.4), where QB=Q diag(Q2 ,I,_,). This
modified algorithm uses mn2 + n3 flops and therefore should be faster than the
Golub-Kahan bidiagonalization whenever m >_n, except for possible extra loop
overhead.
CHAN [1982] compares the operation count for the two variants of SVD
algorithms in four different cases depending on whether U1 and V are explicitly
required, where U=(U, U2):

Required Golub-Reinsch SVD Chan SVD

2
2
2mn -3n
3
mn2 +n
2 3 2 3
X, v 2mn +4n mn + n
2 3
A, U, 7mn2 - n3 3mn + X3n
2, U1, V 7mn 2 + 3n3 3mn 2 + 10n 3

Both of the SVD algorithms can be shown to be backward stable, i.e. the computed
singular values = diag(Yk) are the exact singular values of a nearby matrix A + A,
where
11 A 2 c(m, n). ual.
Here c(m, n) is a constant depending on m and n and u is the machine unit. From
Theorem 3.4 it follows that
I k-kla<c(m,n)-ua1.
514 A. Bj6rck CHAPTER II

Thus, if A is nearly rank-deficient this will be revealed by the computed singular


values.

11. Rank-deficient problems and the QR decomposition


The singular value decomposition is in general the most reliable method to
determine the numerical rank of a matrix. However, in practice the QR decompo-
sition often works as well and requires less work. Further, in some applications
a basic solution (cf. (7.14)) is wanted rather than the minimal norm solution, and
then the QR decomposition is the relevant tool.
Let A e Ra X" , and let r=rank(A) <min(m, n). Then according to Theorem 7.4
there is a column permutation 7 such that the QR decomposition of Al has the
form

QTAH (RI, R 2 ), R j1 1 ,R (11.1)

where R is upper triangular and nonsingular. To compute this decomposition we


modify the algorithms given in the previous sections to include column pivoting. We
now describe a suitable pivoting strategy introduced by GOLUB [1965].
The pivoting strategy can be described in an algorithm-independent way as
follows. Let A17=(a,,..., ajc) and denote by
Ak- =(acl,..., ack-,)

the matrix consisting of the first k- 1 selected columns. Let


sk)= minllAk_,y-acjI , j=k,.., n. (11.2)

Then the column ak selected in the kth step is such that


s(k) >S(k), j=k+l,...,n. (11.3)
In other words the selected column is the remaining column of largest distance to the
subspace l(Ak - ) spanned by the previously chosen columns.
We first consider how to incorporate this column pivoting strategy into
Algorithm 8.1, QR decomposition by Householder transformations. After (k-1)
steps in this algorithm we have computed
-(k A(n n _)= R(k R(k)
A

R (k E W(k- ) x (k- 1) ( 11.4)

where P ,..., Pk-1 are Householder matrices and H 1 ,.. Hk are elementary
permutation matrices. Let
(k) = (a (k) ( )

The quantities s(k) in (11.2) obviously are invariant under orthogonal transformations.
SECTION I Linear least squares problems 515

Hence, the pivoting strategy (11.3) is equivalent to searching for the column of
largest norm in the submatrix Ak). Hence the permutation matrix nk should be
chosen to interchange the columns p and k, where the index p satisfies
'
4 k' Si 4 iI w, J=k, ... , n.
Note that, in the first step, (11.2) is equivalent to selecting the column of largest norm
in A. If r=k- 1, then s(k) = 0, j= k, . . , n, and we are finished.
If the column norms in A2k2 are recomputed at each stage, then column pivoting
will increase the flop count of the Algorithm 8.1 by a factor of 1.5. An alternative is to
compute the norms of the columns of A initially and update these values as the
decomposition proceeds. We compute
=laji 2, j=l,..., n
s-')= (11.5)
and for k = 1, 2,...,r + 1 compare
s(k+l)=Sjk)--r2j, j=k+l,..., n. (11.6)
(Naturally, the s(k ) must be interchanged if the columns of Ak2 are interchanged.)
Using (11.6) will reduce the overhead of column pivoting to O(mn) operations.
However, some care must be taken to avoid numerical problems, see DONGARRA et
al. ([1979], pp. 9.16-9.18).
From (8.14) it follows that the column pivoting strategy will maximize the
diagonal element rkk, and hence the diagonal elements in R will form a non-increasing
sequence. It is easily shown that in fact the elements in R will satisfy the stronger
inequalities

rkk> r2, j=k+l,...,n. (11.7)


i=k

We now consider how to incorporate column pivoting in Algorithm 6.1, the


Cholesky decomposition. We first note that
(AI)T AI=nTATAnH=TMn, M=ATA,
i.e. a permutation of the columns of A is equivalent to a symmetric permutation of
rows and columns of the matrix M. Consider the row-wise version of the Cholesky
algorithm. After (k-1) steps we compute
k-1
Sk) =mjj- i r2, j=k,....., n, (11.8)
i=1

select an index p such that

and
interchange rows and columns k and p. Obviously this will maximi....ze the
and interchange rows and columns k and p. Obviously this will maximize the
diagonal element rkk so this choice of pivot will in exact computation produce the
same permutation as the Householder algorithm with column pivoting. The
quantities Sjk) can again be updated by (11.6), which gives an overhead for the
column pivoting of O(n2 ) flops.
516 A. Bjorck CHAPTER II

Finally we consider the implementation of pivoting in the row-wise version of


modified Gram-Schmidt, Algorithm 9.3. Here after (k- 1) steps we have trans-
formed the nonpivotal columns according to
k-1
k) = aj -
a rijqi, j=k,..., n.
i=1

This shows that ak) is just the orthogonal projection of aj onto the complement of
span [q 1,...,qk_,]= •(Ak_). Hence, in the kth step we should maximize
sk)= l a k) 2, j=k, .. n.
These quantities can be updated by the same formulas (11.6) as for the Householder
and Cholesky algorithms, but again some care is necessary to avoid numerical
problems.
We now consider how to terminate the Householder algorithm for the QR
decomposition. Because of rounding errors the submatrix A(2k will usually not be
exactly zero when A has rank r=k- 1. However, assume that A 2k2 is small in norm
relative to A, say for some E>0 we have

IIA2k2 I 2 < ea 1(A). (11.9)


If in (11.4) we perturb A(k) by putting Ak) =0, then the perturbed matrix has rank
k-1, and hence from Theorem 3.2
ak(A(k)) < IIA(2 11
2 < eaI(A). (11.10)
Further, from the rounding error analysis of the Householder algorithm (cf.
Remark 8.2) we know that A(k) is exactly orthogonally equivalent to A + Ek, where
IIEk 112< c u A IIF < c, unl/2 a(A). (11.11)
Using Theorem 3.4 we get from (11.9H11.11)
ak(A) ak(A(k)) + clun"/2 1(A) < ( + c 1un 1/2 )a (A).

Thus, if (11.9) is satisfied then A has numerical 6-rank at most equal to r = k- 1, for
3 = ( + cunl/2)l(A).
From the pivoting strategy it follows that

|| A(2) 112< 11
A(22) IIF (n-k+ 1)1/2 rkkJ, k = ,...,n.
In particular we have

Irl 1< al(A) <nl'/2 rll (11.12)


Thus if
Irkkl <grll1 (11.13)
then (11.9) holds with e = (n - k + 1)1/2, so we can use the simpler criterion (11.13) to
terminate the algorithm. Note that (11.13) can be used also for terminating the
Cholesky algorithm and the Gram-Schmidt algorithm. However, for the Cholesky
SECTION 11 Linear least squares problems 517

algorithm no rounding error analysis similar to (11.11) holds and in (11.13) we


should not take smaller than about u / 2, where u is the unit roundoff.
We have shown that if the algorithm with column pivoting for the QR
decomposition terminates with a small diagonal element r then A is close to
a matrix of rank k- 1. However, the converse is not true. As the following example
shows the matrix A can be nearly rank deficient without any element rkk being small.

EXAMPLE 11.1 (KAHAN [1966], pp. 791-792). Consider the upper triangular matrix

1 -C -C
R n = diag(1, s,..., sn- 1) 1

The matrix R is upper triangular and it can be verified that it satisfies the
inequalities (11.7). Therefore Rn is invariant under the algorithm for QR decompo-
sition with column pivoting. For n=100, c = 0.2, the smallest singular value is
an=0.368 10- 8, but r=s"-'=0.133. Hence, the near singularity of Rn is not
revealed.

The inequalities (11.12) give upper and lower bounds for al(R) in terms of rl I. For
the smallest singular value a(R) we have, assuming that r 0,
an- =lR=II= 2>Ij-r' X I, (11.14)
since the diagonal elements of R 1 equals r, i = 1,..., n. This gives an upper
bound for a(R). For an upper triangular matrix, whose elements satisfy (11.7),
FADDEEV, KUBLANOVSKAJA and FADDEEVA [1968] have shown the lower bound
an(R)>3(4n+6n- 1)- /2rnI
n >21-lrl. (11.15)
The matrices R. in Example 11.1 show that this lower bound can almost be attained.
Combining (11.12) and (11.14) we obtain a lower bound for the condition number
K(A) = K(R)
K(A)=ax/an > Ir,1rnl. (11.16)
The above discussion shows that this may considerably underestimate the condition
number. However, in extensive numerical testing by STEWART [1980] on randomly
generated test matrices (11.16) was a fairly reliable estimate of Kc(A). Indeed, (11.16)
usually was an underestimate only by a factor of 2-3 and never more than 10. GOLUB
and VAN LOAN ([1983], p. 167) remark that "the degree of inreliability of QR with
column pivoting is somewhat like that for Gaussian elimination with partial
pivoting, a method that works very well in practice."
One way to determine the condition number of R would be to compute R-
However, this requires n3 flops, and would significantly increase the work in
518 A. Bjidrck CHAPTER 11

computing the QR decomposition. We now describe a condition estimator given by


CLINE, MOLER, STEWART and WILKINSON [1979], which is based on inverse iteration
and only requires O(n2 ) flops. Let d be a random vector and compute y and z from

RTy=d; Rz=y. (11.17)

We have z=(RT R)- 1 d=(AT A)- d so (11.17) is equivalent to one step of inverse
iteration with AT A. Let R = U VT be the singular value decomposition of R. Note
that then

A =QR=(QU)ZVT ,

so the SVDs of A and R are closely related. If we write

d= E aV,
i=l

then we have

Y= (a a , z= i 1 (aJ a)vi.
i=l i=l
Provided a., the component of d along v,, is not very small the vector z is likely to be
dominated by its component of v, and

an' IIZII 2 /iIIy12 (11.18)

will be a good estimate of a- '. To enhance the probability of obtaining a good


estimate the following strategy is suggested. We take

d=(+±l,+ 1,...,
+ 1)T,
the sign of ds being determined at the stage when y. is computed so as to obtain
a vector y of large norm. For details of this strategy see CLINE et al. [1979]. This
condition estimator has been implemented in the LINPACK set of subroutines and has
proved to be very reliable in practice.
The condition estimator will detect near rank deficiency of the matrix A even in
the (unusual) case when this is not revealed by a small diagonal element in R. This is
important since failure to detect near rank deficiency can lead to meaningless
solutions of very large norm, or even to failure of the algorithm. However, it still
remains to be shown how to compute a rank revealing QR decomposition in this
case, i.e. a decomposition of the form (11.4) with A 2(' 1) small.
We first consider the case when r=n-1, and show that we can always find
a column permutation of A such that the resulting R factor has a small element rn,. It
turns out that such a permutation can be found by inspecting the elements of the
right singular vector of A corresponding to the smallest singular value a. This
procedure was first pointed out in GOLUB, KLEMA and STEWART [1976], see also
GOLUB and VAN LOAN ([1983], Problem 6.4-4, p. 168). Let the vector v, Iv 112 = 1,
satisfy IIAv t 2= E,and let n be a permutation such that if HTV = w, then Iw I = IIw II,.
SECTION I Linear least squares problems 519

Then if A = QR is the QR decomposition of A,

Irnnl <,n/2 C. (11.19)


To prove this estimate we follow CHAN [1985] and note that since Iw = IIw II
and IvII2= 11wll 2 =l, we have jwl lln1/ 2 . Next we have

QTAv=QT Ann v= ( ).

Therefore, since the last component of the vector Rw is rw n we have


T
E= AvILar2= jIQ AvI 2 = JIRwII 2 >) rnnWn
from which (11.19) follows.
If we let v, the right singular vector corresponding to the smallest singular
value ¢"(A) we have from (11.19)
Tn(A)>n-1/2Jrnnl.

Chan suggests that an approximation to v, is determined by one or two steps of


inverse iteration, from which the permutation can be determined. This leads to a two
step procedure where first any QR decomposition of A is determined, then the
condition estimator (11.17) is used to determine an elementary permutation matrix
n. We then compute the QR decomposition RH = QR, from which we derive

All = Q = Q diag(Q,Imn) 0)

the QR decomposition of AH. The QR decomposition of RH can be computed in


less than 2n 2 flops using the updating technique described in Section 21. (See also
DONGARRA et al. [1979], Chapter 10.)
CHAN [1985] has extended the above procedure to the case when A is nearly
rank-deficient with rank r<n-1, by repeatedly applying the one-dimensional
algorithm to smaller and smaller leading blocks of R. He shows that the resulting
algorithm is guaranteed to work for matrices of small rank deficiency and that it is
very likely to work also for large rank deficiency.
The number p= Irn has a nice interpretation: it is the norm of the smallest
perturbation of the last column in A that will make A exactly rank-deficient.
STEWART [1984a] has suggested computing the set of numbers {p,, ... , for
a sequence of permutations, which moves each column of A to the last position,
using the updating techniques referred to above.
In the rank-revealing QR decomposition in the case when r = rank(A) = n- 1 we
used the right singular vector corresponding to a"(A) to select the column of A to
permute to last position in AH7. If then a basic solution to a least squares problem is
computed from the QR decomposition of AH this will be the column deleted from
the solution, cf. (7.14). We now consider a generalization of this procedure to the case
when r < n - . We want to select r columns of A, which in some sense are maximally
independent, to be permuted into the first r positions in A. This problem of subset
selection is studied by GOLUB, KLEMA and STEWART [1976]. They suggest the
520 A. Bjbrck CHAPTER II

following algorithm based on the singular value decomposition of A, see also GOLUB
and VAN LOAN ([1983], p. 418]:

ALGORITHM 11.1 (Subset selection by SVD). Given A e IR" n, b E R'i, and a method for
computing the numerical rank r of A. The following algorithm computes
a permutation matrix n and a vector z E R' such that the first r columns of A are
independent and such that

is minimized.
Step 1. Compute L and V in the SVD of A, A = UZVT, 2 =diag(a1 ,... , an), and
use it to determine the numerical rank r of A.
Step 2. Partition the matrix of right singular vectors according to

V=("l V12Ir
(V21 V22/}.-r
r n-r

and use QR with column pivoting to compute


QT(V22, 12)fi=(Rj,R1 2 ).
Step 3. Let n, be the permutation matrix which performs the interchange
(vT2, VT2)H =(I22, VT2 ). Set Al =(B, B 2 ) where 1=n,nn,,
1 i is the permuta-
tion matrix from Step 2 and B1 e RTm '. Compute the QR decomposition of B1 and
compute z to minimize
IIB1 z-b 112.
If the Chan SVD algorithm is used in Step 1, Householder QR in Step 2 and
updating techniques in Step 3, then this algorithm requires a total of mn 2 +
n3 - n2 r + 4r 3 flops. The key step is Step 2, where the permutation 17 will tend to
make V22 well-conditioned, where
(V2, T2)=(V 2 2, VT)i7.
It can be shown (see GOLUB and VAN LOAN [1983], p. 416) that the singular value
a,(B1) is bounded by
a,(A)/lI V22' 112< a,(B 1) < a(A), (11.20)
which is the theoretical basis for this selection strategy.

12. Iterative refinement of linear least squares solutions

Iterative refinement is a technique which can be used either for reducing the
rounding errors in a computed solution or for estimating errors due to rounding in
SECTION 12 Linear least squares problems 521

a computed solution. In iterative refinement the computed least squares solution is


regarded as an initial approximation to the true solution and is corrected in an
iterative process. The cost of this refinement is often quite small.
A method for iterative refinement developed by BJORCK [1967, 1968] is based on
computing accurate residuals of the augmented system of m + n equations

(AT O)X ) (12.1)

for the least squares solution x and the residual r=b-Ax. We assume that
rank(A) = n, in which case the system (12.1) is nonsingular.
The description of the iterative refinement becomes a bit more compact if we take
as initial approximations
r(0) = O, x( °) =0. (12.2)
'
We then compute a sequence of single precision approximations r(s + , x( + ),
s = 0, 1,... ., where the sth iteration consists of three steps:
Step 1. Compute the residual vectors to the system (12.1)
f(s) = fl 2 (b -r(S) -Ax(S)), g(S) =fl2(_ATr()), (12.3)

where fl2(E) denotes that the expression E is computed using double precision
accumulation of the inner products defining E.
Step 2. Solve for the corrections r' (s and x (s) from

(AT 0 (X(s)) = Ig(S)) (12.4)


using single precision.
Step 3. Compute the new approximations
r(S+ 1)= r(s) + r(S), (S+1)= X(S + 6X(S)+ (12.5)
We now consider the solution of the system (12.4), where for convenience we drop
the superscript

(A T A (X _(9 (12.6)
Note that in (12.6) we have a more general right-hand side than in (12.1) and we have
to modify the algorithms given before to cope with that.
Assume that we have computed the QR decomposition of A by Householder or
Givens transformations. Then the system (12.6) can be transformed into

QTr (R)x=QTf (RT,O)QTr=g.

Here the first n components of QT6r can be solved from the last set of equations and
its last m-n components from the first set. Hence the solution to (12.6) can be
522 A. Bjorck CHAPTER II

computed from
T
R h=g, QTf= (d), 6r=Q(d) Rbx=c-h. (12.7)

Assuming that Q is stored as a product of Householder transformations this takes


4mn-n 2 flops.
Using the initial approximation (12.2) we havef(° ) = b and g(O) =0 and it follows
that r'l and x(1) will just be the computed solution given by Golub's method,
Algorithm 8.2.
Iterative refinement can equally well be based on the modified Gram-Schmidt
algorithm. Only Step 2 will differ. Assume that we have computed R and
Q1 =(q1,q2 ... . qn)

by MGS. To solve a system of the form (12.6) we proceed as follows. We solve the two
triangular systems
RTh=g, R6x=y, (12.8)
where y=(yl,... y")T is computed by takingf' ) =f and for k = 1, 2,... n

Yk = qf( - hk, f(k+ l) =f(k)_ qkYk. (12.9)


It is shown by BJORCK ([1968], pp. 272-273) that this algorithm does not require that
Q, is accurately orthogonal. We have that 6r=f ("+ ) and the solution takes only
2mn + n2 flops.
A complete error analysis of the refinement method (12.3)-(12.5) has been given
by BJORCK [1967, 1968]. It is shown that the initial rate of improvement of the
solution is linear with rate
P = IIX( S )- X 112/IIX( -
) X112 <cuK(A), (12.10)
s = 2, 3 ... , where c = c(m, n) is an error constant. If Householder transformations
with inner products accumulated in double precision are used then c < 39n 3 /2. Note
that it is K(A) and not K2(A) that appears in (12.10) even though the condition of the
problem may depend on c2(A) as shown in Section 5. It has been pointed out by
BJORCK [1978] that the estimate in (12.10) can be improved by substituting in (12.10)
for K(A)
'(A)= min c(AD), D >0 diagonal matrix
D

and that this rate is achieved even without actually carrying out the scaling of A by
the optimal D.

EXAMPLE 12.1 (BJORCK and GOLUB [1967]). To illustrate the method of iterative
refinement we consider the linear least squares problem where A is the last six
columns of the inverse of the Hilbert matrix H8 E~8 X8 which has elements
hii=1/(i+j-1), l<i,j<8.
SECTION 12 Linear least squares problems 523

Two right-hand sides b1 and b2 are chosen so that the exact solution equals
X=( 1 k I)T

For b=b, the system Ax=b is compatible, for b=b2 the norm of the residual
r=b-Ax equals 1.04-107. Hence for b2 the term proportional to K2(A) in the
perturbation (5.11) dominates.
The refinement algorithm (12.2)-(12.5) was run on a computer with unit roundoff
u= 1.46 10- ". The systems (12.4) were solved by the method (12.7) using a QR
decomposition computed using Householder transformations. We stress that it is
essential that double precision accumulation of inner products are used in Step (1),
but otherwise all computations can be performed in single precision. We give below
the first component of the successive approximations x(S), r(s), s = 1, 2, 3 ... for the
right-hand sides b, and b2.

rhs b rhs b2

xl)= 3.33323 25269- 10-1 x 1)= 5.56239 01547 10


3.33333 35247' 10-' 3.37777 18060' 10-'
3.33333 33334 10- 3.33311 57908' 10-1
3.33333 33334- 10-1 3.33333 33117- 10-1

r )~ = 9.32626 24303-10-5 r(,= 2.80130 68864- 106


5.0511403416- 10- 7 2.79999 98248- 106
3.65217 71718 10-11 2.79999 99995' 106
- 1.95300 70174- 10-13

We observe a gain of almost three digits accuracy per step in the approximations to
x1 and r for both right-hand sides b and b2. This is consistent with the estimate
(12.10) since
iK(A) = 5.03- 108, uc(A) = 5.84 10- 3.
For the right-hand side b1 the approximation x(1) is correct to full single precision
accuracy. It is interesting to note that for the right-hand side b2 the effect of the error
term proportional to uK 2 (A) is evident in that the computed solution x 1) is in error
by a factor of 103. However, x(4) has eight correct digits and r(?)is close to the true
value 2.8- 106.

REMARK 12.1. The key to the success of the iterative refinement scheme is that
approximations to both x and r are simultaneously improved. GOLUB and
WILKINSON [1966] suggested a simpler scheme where only x is improved. Here one
takes x +1= xx(S)+ 6x() where x(s) is the solution to
minl Abx(S)-r st 112, r) = fl2(b-Ax(s)).
This scheme results by taking g(S)=O in the previous scheme. Unless the system
Ax = b is nearly compatible it does not work as well.
524 A. Bj6rck CHAPTER 11

13. Computing the variance-covariance matrix

Consider the linear statistical model

Ax=b+e, (13.1)

relating the parameter vector x 1R" in the model and the observation vector b e m.
Assume that the random vector has zero mean and variance-covariance matrix
a2I. Denote by i a least squares estimate of x.
If rank(A)= n, then the error in the estimate c can be written

x- = (AT A)- 1AT =(R- , O)QT , (13.2)

where R and Q are the factors in the QR decomposition of A. Since Q is orthogonal


the random vector QTe also has zero mean and variance-covariance matrix .21.
Hence, the error x - has variance-covariance matrix a 2C, where

C=(ATA)- 1=(RTR)- =R- R- T. (13.3)

For the rank-deficient case, see Section 2.3.


If a2 is not known then it is commonly estimated by
62= l Ax-b 1 2/(m-r), r=rank(A). (13.4)

In order to assess the accuracy of the computed estimate of x it is often required to


compute the matrix C, or part of it. In particular the variance of the component xi is
given by a2 cii, where
diag(C)=diag(cl,..., c,,). (13.5)
To compute C in the nonsingular case we can first compute the inverse S = R -
and then form C = SST. The matrix S satisfies the triangular system RS = I. It follows
that S is also upper triangular and its elements can be computed by the following
algorithm:
for j=n, ... , 1
sjj= /ri
for i=j- .,

Si = -( riksk)/rii,.
k-i+ I /
Here the elements of S can overwrite the corresponding elements of R in storage. The
algorithm requires n3 flops.
The elements of diag(C) = diag(SST ) are just the 2-norms squared of the rows of
S and can be computed in a further n2 flops by

cii= Y s, i= 1, 2,. ..,n. (13.6)


j=i
SECTION 13 Linear least squaresproblems 525

The matrix C is symmetric, and therefore we need only compute its upper triangular
part. This takes Ln3 flops and can be sequenced so that the elements of C overwrite
those of S.
In many situations the matrix C only occurs as an intermediate quantity in
a formula. For example the variance of a linear functional q=fTX is equal to

v=fTCf=fTR-R-Tf=zTz, z=R-Tf. (13.7)

Thus v may be computed by solving a single triangular system RTz =fand forming
ZTz. This is a more stable and efficient approach than using the expression involving
C.
In case C is needed there is an alternative way of computing C without inverting
R. We have from (13.3) multiplying by R from the left

RC = R-T. (13.8)

The diagonal elements of R -T are simply rk, k = n,..., 1, and since R-T is lower
triangular it has ½n(n- 1) zero elements. Hence, ½n(n+ 1) elements of R-T are known
and the corresponding equations in (13.8) suffice to determine the elements in the
upper triangular part of the symmetric matrix C.
To compute the elements in the last column c, of C we solve the system

Rcn=rlen, en=(O,..., , 1)T

by backsubstitution. This gives

Cnn=rn , c= -ri, rij i=n-l,... , 1. (13.9)


j=i+l

By symmetry cn, = ci, i = n- ... 1, so we also know the last row of C. Now assume
that we have computed the elements

j = cji, j=n,...,k + 1, ij.

We now determine the elements cik, i<k. We have


n
Ckkrkk + rkjcjk = rkk
j=k+l

But the elements Ckj = Cjk, j = k + 1, .... , n, have already been computed and thus

Ckk=rr rkk - I rckjj. (13.10)


_ j=k+

Similarly, for i= k-1, . . 1

Cik= -r' r jCk + E rijckjl. (13.11)


j=i+ j=k+
526 A. Bjdrck CHAPTER 1

Using the formulas (13.9)-(13.11) all elements of C can be computed in about -n3
flops.
For the case when the Cholesky factor R is sparse GOLUB and PLEMMONS [1980]
have shown how to use the above algorithm to compute all elements of C associated
with nonzero elements of R very efficiently. Note that since R is nonsingular these
include the diagonal elements of C. We discuss this algorithm in Section 16.

14. Weighted and generalized linear least squares problems

The generalized linear least squares problem is to find a vector x e R" that solves
min(Ax- b)TW- '(Ax- b), (14.1)

where beRm is a given vector, AeR" X" n a given matrix and WER " Xm a known
symmetric and positive-definite matrix. This problem arises in finding the least
squares estimate of the vector x when we are given the linear model
Ax=b+E, (14.2)
with an unknown random vector of zero mean and covariance matrix a2W.
Consider a factorization of W,
W = BBT , (14.3)
""
where BeRmX . Often B itself is given rather than W, or B can be computed from
W by the Cholesky decomposition, see Algorithm 6.1. Then (14.1) is equivalent to
min I1B-'(Ax -b)11 2, (14.4)

which can be written as the standard linear least squares problem


min IlAx-bll 2, A=B-'A, =B-'b.
x

However, this is not a stable computational approach when B is ill conditioned,


since then A and F will be poorly determined.
When the components of the random vector are uncorrelated, then the matrix
W is a diagonal matrix
W= diag(w,, w2, ... , w,), wi>O, i=1,2, ... ,m.
If we introduce the diagonal weight matrix
D=diag(d,d 2 ..... in), di=w-1 / 2, i=1,2,...,m (14.5)
then (14.1) becomes a weighted linear least squares problem
min IID(Ax-b)112- (14.6)

In some applications the weights di, i= 1, 2,..., m, may vary widely in size. If we
SECTION 14 Linear least squares problems 527

assume that the equations have been ordered so that


d, d 2, .- d m (14.7)
then we have d ldm>> 1. This corresponds to the case when some components of the
error vector have much smaller variance than the rest. We call such weighted
problems stiff since in the limit, when some di tend to infinity, the corresponding
equations will become exactly satisfied, see below.
For stiff problems the condition number K(DA) will be large and an upper bound
is given by
K(DA) <c(D)K(A) = d /dm. K(A).
Note that this does not mean that the problem of determining x from given data D,
A and b is ill conditioned. For the weighted problem (14.6) the perturbations in DA
and Db will have a special form and therefore the perturbation analysis given in
Section 5 is not relevant in this case. Still, special care is needed in solving stiff
weighted linear least squares problems.
We first remark that the method of normal equations is in general not well suited
for solving stiff problems. To illustrate this we consider a case when only the first ml
equations are weighted. Let
D(y)=diag(ylm,, I2), m1 +m 2 =m, (14.8)
and partition A and b conformally

A= (A)/ , b = (b .

Then the matrix of normal equations can be written


M =(DA)T(DA)= y2 ATA, +AT2A 2.
If y is sufficiently large, then M will be completely dominated by the first term and we
might lose all information contained in A2 . The matrix in Remark 6.2 is of this type.
We conclude that the method of normal equations is not well behaved when
KC(DA) >>K(A).
Let x(y) be the solution to the weighted problem (14.6), when D=D(y) is given by
(14.8), and let r (y)= bI -A 1x(y) be the residual vector corresponding to the first set
of equations. We now relate x(y) and rl(y) to the corresponding quantities x and r,
for the unweighted problem. It can be shown that
rl(Y)=Y- 2 - 2m,+A(A T A) - AlTl (14.9)
2 -
x(y)=x+y (ATA)-'ATrl(y), y2=y2 1.
The formulas (14.9) can be used to compute the solution x(y) to the weighted
problem by updating the solution to the unweighted problem. If m <m, then the
extra work in the updating step is not great and the method of normal equations
might be used to solve the unweighted problem. Note that if rank(A ) = m, then for
large values of y the residual r,(y) is proportional to y-2.
528 A. Bjorck CHAPTER l

We now consider the use of methods based on the QR decomposition of A for


solving weighted problems. We first show by an example that these algorithm can
give poor accuracy for stiff problems unless a proper ordering of the rows is used.

EXAMPLE 14.1 (POWELL and REID [1969]). Consider the matrix

0 2 1\
106 106 0
106 0 106
0 1 1'

Using exact arithmetic we have after the first step of QR decomposition by


Householder transformations (Algorithm 8.1) the reduced matrix
1' 0
6 - 21 / 2 _½.106_ 2- 1/2x

1 1 /

but if five-decimal floating-point computation is used the terms - 2 /2 and -2- 1/2
in the first and second rows are lost. This is equivalent to the loss of all information
present in the first row of A. This loss is disastrous because the number of rows
containing large elements is less than the number of components in x, so there is
a substantial dependence of the solution x on the first row of A.

If in Example 14.1 the two large rows are permuted to the top of the matrix A, then
the Householder algorithm works well. POWELL and REID [1969] suggest that the
Householder algorithm is extended to include row interchanges, so that the element
of largest absolute value in the pivot column is permuted to the top. They give an
error analysis for this extended algorithm which shows that it is stable also for stiff
weighted problems. It can also be shown that QR decomposition by Givens
rotations is stable for stiff problems, if row interchanges are included.
Assume that the weights di, i = 1,... ., m, have been chosen so that the row norms of
the unweighted matrix A are about the same. Then in most cases it will be sufficient
to initially sort the rows in A and b after decreasing weights so that (14.7) is satisfied.
Gram-Schmidt orthogonalization is easily seen to be invariant under row
permutations. Therefore, the modified Gram-Schmidt method is the only ortho-
gonalization method that works satisfactorily for stiff problems without row
interchanges. We mention that another stable method for stiff problems is the
Peters-Wilkinson method, see PETERS and WILKINsON [1970] and BJORCK and
DUFF [1980].
Now, consider again the general case when the covariance matrix W is not
a diagonal matrix. Let the eigendecomposition of W be

W= PTAP = BBT,
SECTION 14 Linear least squares problems 529

where P R'm is orthogonal and


B=pTA/ 2
Al/ 2 =diag(Al/ 2 ,... 1m/2 ) .

Then B-'=A- 1/2p and (14.4) becomes


min A - /2(PAx - Pb)11
2 (14.10)
x

If the orthogonal transformation P is applied to A and b, then (14.10) is a weighted


linear least squares problem, which can be solved as outlined above. To compute A,
PA and Pb takes about 2m 3 + 4m2n flops.
A stable and more efficient algorithm for the generalized linear least squares
problem has been given by PAIGE [1979a, 1979b]. The method of Paige is based on
the observation that (14.4) is equivalent to the problem
min Ilvl 2,
(14.11)
s.t. Ax=b+Bv.
This is a more general formulation and even allows for a rank-deficient B. We
assume here that B is nonsingular and A has full rank n. For a treatment of the
general case see PAIGE [1979a] and Section 23.
In Paige's method we first compute the QR decomposition of A,

QTA=( ) , Q=(Q 1,Q 2), (14.12)

and apply the transformation QT also to b and B

QTb
= C2 }m-n QTB =CT/m--n

The constraints in (14.11) can now be written


Rx=cl+cu, O=c 2 +Cv. (14.13)
For any vector veRT" we can determine x so that the first set of these equations are
satisfied. We now determine an orthogonal matrix PFe R" such that

pTC 2 = (ST)}In (14.14)

where the matrix S is upper triangular. By the nonsingularity of B the matrix S will
be nonsingular. Note that (14.14) after a permutation of rows is just the QR
decomposition of C 2 . Now the second set of constraints in (14.13) becomes

=c 2 +Su 2 where PTv=u=( ). (14.15)

Since P is orthogonal we have IIv12 = Ilul 2 and so the minimum in (14.11) is found by
taking
-
u,=0, u 2 =-S 'C 2, V=P2u2
530 A. Bj6rck CHAPTER II

where P=(P, P2) and solving the triangular system in (14.13) for x.
An algorithm for (14.11) based on (14.12H14.15) requires a total of about
n3m3-m2 n-mn2 +n 3 flops. For large values of m the work in the QR decom-
position of C2 dominates.
PAIGE [1979a] obtains a perturbation analysis for the problem (14.4) by using the
formulation (14.11), and gives a rounding error analysis to show that the above
algorithm is numerically stable. The algorithm can be generalized in a straight-
forward way to rank-deficient A and B. For details see PAIGE [1979a].
In case the matrix B has been obtained from the Cholesky decomposition
W= BBT of W it is of lower triangular form. Then it is advantageous to carry out the
two QR decompositions in (14.12) and (14.14) together, and maintaining the lower
triangular form throughout. This algorithm which requires careful sequencing of
Givens transformations has been described by PAIGE [1979b]. It uses a total of
about jm3 + m2n -n 3
flops.
It can be shown that the computed solution x to (14.11) is an unbiased estimate of
x for the model (14.2). The covariance matrix of x is o-2C where
C=R-'LTLR-T, LT=CTP1 . (14.16)
CHAPTER III

Sparse Least Squares Problems

In this chapter we will review methods for solving the linear least squares problem
min IIAx-b 11
2

which are effective when the matrix A is sparse, i.e. when the matrix A has relatively
few nonzero elements.
In RICE [1983] sources of very large least squares problems are identified and
discussed. Note that very large problems are by necessity sparse. The following
sources are mentioned:
(a) the geodetic survey problem,
(b) the photogrammetry problem,
(c) the molecular structure problem,
(d) gravity field of the earth,
(e) tomography,
(f) force method in structural analysis,
(g) very long base line problem,
(h) surface fitting,
(i) cluster analysis and pattern matching.
A sparse least squares problem of spectacular size is described in KOLATA [1978].
This is the problem of least squares adjustment of coordinates of the geodetic
stations comprising the North American Datum. It consists of about six million
equations in 400,000 unknowns ( = twice the number of stations). The equations are
mildly nonlinear so two or three linearized problems of this size need to be solved.
We assume initially that A E Rm X" with rank(A) = n < m. However, problems where
rank(A) = m < n or rank(A) < min(m, n) occur in practice. Other important variations
include weighted problems, and/or linearly constrained problems.
Solving sparse least squares problems is closely related to solving sparse positive
definite systems. An excellent introduction to theory and methods for the latter
class of problems is given in the monograph by GEORGE and Liu [1981].

15. Storage schemes for sparse matrices

In order to solve large sparse matrix problems efficiently it is important that we only
store and operate on the nonzero elements of the matrices. We must also try to
531
532 A. Bjbrck CHAPTER III

minimizefill-in as the computation proceeds, which is the term used to denote the
creation of new nonzeros.
We first consider some storage schemes for a sparse rectangular matrix A. In the
general sparse storage scheme the nonzero elements of A are stored row by row in
a vector AN together with two integer vectors JA and IA. The vector JA gives the
column subscripts of the nonzeros and the elements in IA point to the start of
nonzeros in each row in AN.

EXAMPLE 15.1. In a general sparse storage scheme the matrix


~-
A - A A%
af1l U U1 3 U UV

a2 , a 22 0 0 0
O O a33 0 0
A=
O a 42 0 a44 0
O O 0 a54 a55
n n n n
..
A - -
1.

would be stored as
AN=(al, a 3 , a 21, a22, a3 3, a4 2, a4 4 , a54, a, a65),
JA=(1, 3, 1, 2, 3, 2, 4, 4, 5, 5),
IA=(1, 3, 5, 6, 8, 10).

The storage can be divided into primary storage for AN and overhead storage for
JA and IA. We remark that the overhead storage can often be decreased by using
a compressed scheme due to Sherman, see GEORGE and Liu ([1981], pp. 139-142).
The simplest class of sparse rectangular matrices is the class of band matrices.
Such matrices have the property that in each row all nonzero elements are contained
in a narrow band. For a matrix A we define
f = min{j: aiO0}O, i,=max{j: aij 0}. (15.1)
The numbers fi and li are simply column subscripts of the first and last nonzero in the
ith row of A.

DEFINITION 15.1. The rectangular matrix AE II"m X is said to have row bandwidth w if

w= max wi , w=(i-fj+1), (15.2)

where w i is the bandwidth of the ith row.

For this structure to have practical significance we need to have w << n. Matrices of
small row bandwidth often occur naturally, since they correspond to a situation
where only variables "close" to each other are coupled by observations. Note that
SECTION 15 Sparse least squares problems 533
although the row bandwidth is independent of the row ordering it will depend on the
column ordering.
An alternative storage scheme for banded matrices is the envelope storage scheme.
In this all elements in the ith row with column indicesj such that fi j i are stored.
These elements have indices in the envelope of A, denoted by Env(A), which is
defined as
Env(A)= {(i,j): fi<j<li}.

EXAMPLE 15.2. In an envelope storage scheme, the matrix of Example 15.1 would be
stored as
AN=(a,,, 0, a3, a2l, a2 2, a, a42, 0, a, a4, a, a 6 5),
FA=(1, 1, 3, 2, 4, 5),
IA=(1, 4, 6, 7, 10, 12).
Here FA contains the column indices f, for each row.

Note that by increasing the primary storage slightly we have reduced the storage
overhead. In the simplest case, when Ii - f i = w, i= 1, 2,... n, we need not store the
vector IA. Alternatively we can then use a two-dimensional array of dimension m by
w to store AN, in this case.
Storage schemes similar to the ones given above can be used for storing a sparse
symmetric positive-definite matrix Be R"X . Obviously it is sufficient to store the
upper (or lower) triangular part of B, including the main diagonal. The general
scheme given in Example 15.1 can be used unchanged. However, since for a positive-
definite matrix all diagonal elements are positive it is sometimes convenient to store
these in a separate vector, see GEORGE and Liu [1981, pp. 79-80].
We now define the bandwidth of a symmetric matrix.

DEFINITION 15.2. The bandwidth (or half-bandwidth) of a symmetric matrix


BelnXn is given by

= max pi, i = li(B)-i.

The number fBi is called the ith bandwidth of B.

Thus a symmetric matrix of small bandwidth has all its nonzeros clustered "near"
the main diagonal.
For symmetric band matrices we can use an envelope storage scheme, where all
elements in a row from the diagonal to the last nonzero are stored. The envelope
of a symmetric matrix B is defined by

Env(B) = {(i, j): i < j < li(B)}.

This storage scheme is illustrated in the following example:


534 A. Bjorck CHAPTER III

EXAMPLE 15.3. In an envelope storage scheme a symmetric matrix

b11 b12 b13


b2 2 0 b2 4
B= b3 3
symm b44 b45
b55
would be stored as
BN=(bll, b 12 , b 13, b22 , 0, b24 , b33, b44, b45 , b55),
IB=(1, 4, 7, 8, 10).

A symmetric band matrix with constant bandwidth would be stored similarly


except that the vector IB is no longer needed since the number of elements stored in
each row is constant.

16. The method of normal equations for sparse problems

In the method of normal equations there are two steps where fill-in, i.e. creation of
new nonzeros, may occur. The first step is when the matrix ATA is formed and the
second step is in computing the Cholesky factor of ATA.
We first discuss the fill-in when the matrix ATA is formed. Partitioning A by rows
we have (cf. (6.5))
m
= (16.1)
ATA E aiai,
i=1

where a T now denotes the ith row of A. This expresses ATA as the sum of m matrices
of rank one. We now make the important assumption that no numerical cancellation
occurs in the sum (16.1), that is, whenever two nonzero quantities are added or
subtracted, the result is nonzero. Then it follows that the nonzero structure of ATA is
the direct sum of the nonzero structures of aiaT, i = 1, 2,.. ., m. Another character-
ization is the following:

THEOREM 16.1. Assume that no numerical cancellation occurs in the computation of


ATA. Then
(ATA)k 0 aij a 0 and aik v 0
for at least one row i=1, 2,..., m.

REMARK 16.1. The no-cancellation assumption makes it possible to determine the


nonzero structure of ATA from that of A without any numerical computations. If the
assumption is not satisfied this may considerably overestimate the number of
SECTION 16 Sparse least squares problems 535

nonzeros in ATA. For example, if A is orthogonal then ATA = I and is sparse even
when A is dense.
We now prove a relation between the row bandwidth of the matrix A and the
bandwidth of the corresponding matrix of normal equations ATA.

THEOREM 16.2. Assume that the matrix AERm Xn has row bandwidth w. Then the
symmetric matrix ATA has bandwidth fi w- 1.

PROOF. From Definition 15.1 it follows that


aijO0 and aikO = Ij-kl<w.
The theorem now follows from the observation that

(AT A)jk= E aaik=O if Ij-k>w. El


i=l

From the no-cancellation assumption it follows that if A contains one full row
then ATA will be full even if the rest of the matrix is sparse, for example

A= x I = ATA full.

Sparse problems with only a few dense rows can be treated by updating the solution
to the corresponding problem where the dense rows have been deleted (see the end of
this section).
We now show that there are problems where even though A is fairly sparse in all
columns, the matrix ATA will be practically full. We consider a stochastic model
where each element aij is an independent random variable and
P{aijO} =p<< 1.
Then we have P{ajaik= 0} = 1 _p 2 , j Ak and, since

(ATA)j, = E aijaik,
i=!

it follows that

P{(ATA)jk 0} = 1-- (1-- p2)m Z 1-- e- p2.


Now assume that the expected number of nonzeros in a column is mI/ 2. Then
p=m-12, mp 2 =1, and ATA will be practically full. Large problems which have
these characteristics occur in image reconstruction and certain other inverse
problems. They must often be solved by iterative methods, see Section 20.
536 A. Bjorck CHAPTER III

Let Pr and Pc be permutation matrices of order m and n respectively and consider


the matrix PrAPc. Then
(PrAPc)T(PrAP) = PcATPTPAPC = Pc ATAPe.
Thus, the ordering of the rows of A has no effect on the matrix ATA, which also
follows directly from (16.1). Reordering the columns of A corresponds to a
symmetric reordering of the rows and columns of ATA, and will not affect the
number of nonzeros in ATA only their positions. However, as we will see later, it may
greatly affect the fill-in in the Cholesky factor R.
To represent the structure of the symmetric matrix ATA it is convenient to
introduce the undirected graph of ATA. A graph G = (X, E) consists of a finite set of
nodes X together with a set E of edges, which are unordered pairs of nodes. In an
ordered graph the nodes are labeled 1, 2,..., n where n is the number of nodes. The
ordered graph of ATA is one for which the n nodes are numbered from 1, 2,..., n and
{xi, xj}eE if and only if (ATA)ij = (ATA)jiO0, i j. Here xi denotes the node of
X with label i. For any permutation matrix PCeR. Xn the unlabeled graphs of ATA
and PCATAPC are the same but the labelings are different. Hence, the unlabeled
graph represents the structure of ATA without any particular ordering. Finding a
good permutation for ATA is equivalent to finding a good labeling for its graph.
The graph of ATA can be constructed directly from the structure of the matrix A.
Appealing to (16.1) and the no-cancellation assumption the graph of ATA is the
direct sum of all the graphs of alaT, i = 1, 2, ... , n. Note that the nonzeros in any row
aT will generate a subgraph where all pairs of nodes are connected. Such a subgraph
is called a clique and corresponds to a full submatrix in ATA. In Fig. 16.1 we give an
example of a matrix A and the labeled graph of ATA, with nonzeros denoted by x.

x x

x X

Matrix A Graph of ATA

FIG. 16.1. A matrix A and the labeled graph of ATA.

We now consider the second step in the method of normal equations; the
computation of the Cholesky factorization RTR =ATA. Before this is carried out
numerically it is important to find a permutation matrix Pc such that PCATAP, has
a sparse Cholesky factor R. A number of heuristic reorderings are known which can
substantially reduce the fill-in during the factorization.
The simplest reordering methods are those which try to minimize the bandwidth
SEcnoN 16 Sparse least squares problems 537

or envelope of the matrix ATA. These are motivated by the following important
relation between ATA and its Cholesky factor.

THEOREM 16.3. Let R be the Choleskyfactor of ATA. Then it holds that


Env(R + RT) = Env(A T A).

PROOF. See GEORGE and Liu ([1981], pp. 52-53). E1


Hence, the Cholesky factor R inherits the envelope of ATA and zeros outside the
envelope will not suffer fill-in during the Cholesky factorization. In particular if ATA
has bandwidth then its Cholesky factor R will have upper bandwidth P.
Perhaps the most widely used envelope reduction ordering algorithm is the
reverse Cuthill-McKee ordering. For a description and analysis of this ordering, see
GEORGE and LIu ([1981], pp. 58-78). For this ordering we should use the envelope
or band storage scheme.
When A has constant bandwidth w it follows from Theorems 16.2 and 16.3 that
ATA and R both have bandwidth w-1. Then the formation of ATA and ATb
followed by the Cholesky factorization and solution of RTRx = ATb requires
{½mw(w + 1)+ mw + ½n(w- 1)(w + 2)+ n(2w- 1)+ O(m + n)} flops
if full advantage is taken of the band structure of the matrices involved.
In other ordering algorithms the object is to exploit all zeros in the Cholesky
factor R. The most important of these are the minimum degree algorithm and
various nested dissection schemes. For these the storage scheme for general sparse
matrices should be used. The minimum degree algorithm is briefly described below.
For nested dissection orderings see Section 19.
We now outline the structure of an algorithm based on the normal equations for
solving sparse linear least squares problems:

ALGORITHM 16.1.
Step 1. Determine symbolically the structure of ATA.
Step 2. Determine a column permutation Pc such that PATAPc has a sparse
Cholesky factor R.
Step 3. Perform the Cholesky factorization cf PATAPTA symbolically to generate
a storage structure for R.
Step 4. Compute B = PATAPTA and c = PTATb numerically, storing B in the data
structure of R.
Step 5. Compute the Cholesky factor R numerically and solve RTz =c, Ry=z,
giving the solution x=Pcy.

It is important to emphasize that the reason why the ordering algorithm in Step
2 can be done symbolically, only working on the structure of ATA, is that pivoting is
not required for numerical stability in the Cholesky algorithm. (Note that we assume
that rank(A)= n; for modifications needed to treat the case when rank(A)< n see
Section 17.)
538 . Bjorck CHAPTER III

By far the most popular ordering algorithm for Step 2, is the minimum degree
algorithm, which is a special case of the Markowitz ordering algorithm (MARKOWITZ
[1957]) for unsymmetric matrices. The minimum degree algorithm is based on
a local minimization of fill-in and can be described as follows. Assume that (k - 1)
columns have been ordered, and the corresponding steps in the Cholesky
factorization carried out. This intermediate stage in the factorization is shown in
Fig. 16.2.

computed part of R

pivot row

Ah row

FIG. 16.2. The minimum degree algorithm.

Let vi be the number of nonzeros in the unreduced part of the ith row. We then
choose the next pivot column k so that
vk= min vi.

Remarkably fast symbolic implementations of the minimum degree algorithm exist,


which use an elimination graph model of the Cholesky factorization. Similar
techniques are used in Step 3 to generate a storage structure for R. We remark that in
predicting the structure of R from that of B = P'ATAPc by performing the Cholesky
factor symbolically, RT+ R will be at least as full as B. This may overestimate the
structure of R and we comment more on this point in Section 17. For details of the
implementation of the minimum degree algorithm and the symbolic and numeric
factorizations in Steps 3 and 5 we refer to GEORGE and Lu ([1981], Chapter 5).
An advantage of the normal equations method for sparse least squares is that
excellent software packages are available for these steps in Algorithm 16.1. We
mention in particular MA27 (DUFF and REID [1982]), YSMP (EISENSTAT et al. [1981])
and SPARSPAK (GEORGE and Liu [1981]). For well-conditioned problems the normal
equation method using any of the above implementations can be very satisfactory
and often gives a solution of sufficient accuracy. However, for ill-conditioned or stiff
problems this method may lead to a substantial loss of accuracy in the computed
solution. In the next section we will show how the numerical steps in Algorithm 16.1,
Steps 4 and 5, can be performed by more stable orthogonalization methods.
In Section 13 we discussed methods for computing the variance-covariance
matrix 2 C, where C=(RTR)- , for the least squares solution x. When the matrix
R is sparse GOLUB and PLEMMONS [1980] have shown that the algorithm (13.9)-
SECTION 16 Sparse least squaresproblems 539

(13.11) can be used to compute very efficiently all elements in C, which are associated
with nonzero elements in R. Since R has a nonzero diagonal this includes the
diagonal elements of C giving the variance of x. If R has bandwidth w, then the
corresponding elements in C can be computed in only nw2 flops by the algorithm
below.
We define the index set K by
f 0, (i,j)eK,

and let
fk= min {i: rikO0}.
l<i<ik- I

We will compute all elements cij, (i,j)e K, 1 <j n, i <j. We start with the last column
of C and compute
2
Cn = rn ,

Ci= -ri i E rijCj,, i=n- ... ,fn, (i, n)EK. (16.2)


j=i+l
(i,j)EK

Assume now that we have computed all elements cij, j=n,..., k+ 1, i<j, (i,j)EK.
Then from (13.10)

Ckk = rkk - rkJCkj (16.3)


j=k +1
(k,j)eK

and similarly from (13.11) for i=k-1,... ,fk

Cik=-r-' E rickj+ E rjckj, (i,k)eK. (16.4)


j=i+I j=k+l
L
(i,j)EK (i,j)eK

It can be shown that, since R is the Cholesky factor of ATA, its structure is such that
(i,j)EK and (i, k)EK implies that (j, k)EK if j<k and (k,j)eK if j>k. Hence all
elements needed in (16.2)-(16.4) have been computed.
We remarked earlier that a single dense row in A will lead to a full matrix ATA and
therefore, invoking the no-cancellation assumption, a full Cholesky factor R.
Problems where the matrix A is sparse except for a few dense rows can be treated by
first solving the problem with the dense rows deleted. The effect of the dense rows
into the solution is then incorporated by updating this solution. We stress that we
only update the solution, not the Cholesky factor.
Consider the problem

min (A)x() 2I1' (16.5)

where A,EeRm' Xn is sparse and AdER 2X", m2 <<n, contains the dense rows. We
540 A. Bjirck CHAPTER III

assume for simplicity that rank(As )= n. Denote by x the solution to the sparse
problem
min IA,x- b,112
x

and let the corresponding Cholesky factor be R,. The residual vectors in (16.5)
corresponding to xs are
rs(x) = bs - Asx, rd(x) = bd-A dXS.
We now wish to compute the solution x=xs+z to the full problem (16.5), and
hence to minimize

Irs(x)I)2+ 1I
rd(x) II
where
rs(x) = r(x)- Az, rd(x) = rd(X) - Adz.

Since Ars(x,)=O, this is equivalent to the problem

min{ IlAszII2 + lIAdz-rd(x)ll12}. (16.6)


z

Letting
u =Rsz, Bd = AdRs ,
we have that IlAzll2 = lull 2 and (16.6) reduces to
min{ Iu[2 + IIBdu-rd(x)[l2}. (16.7)
U

Introducing

v = rd(xS)- Bdu, C = (Bd, 1112)' W=

we can write (16.7) as a minimum norm problem

smi 11W12, )(16.8)


s.t. Cw= rd(x),
with the solution w = C'rd(x). Since C has full row rank we can use the expression
(4.6) for the pseudoinverse of C to solve (16.8). Let RdeR 2 X 12 be the Cholesky factor
of CCT. Then the solution to (16.8) is
) -
w = CT(CC rd(Xs) = C(RdRd) rd(X).

When w, and hence u, has been computed we get z by solving the triangular system
Rz = u, and then x=x + z.
The updating scheme described above can be modified to use an orthogonali-
zation method for the solution of (16.8), see GEORGE and HEATH [1980]. It can be
generalized to the case where the sparse subproblem has rank less than n, see HEATH
SECTION 17 Sparse least squares problems 541

[1982]. A general scheme for updating equality constrained linear least squares
solutions has been developed by BJbRCK [1984].
It is important to point out that these updating algorithms cannot be expected to
be stable in all cases. Stability may be a problem whenever the sparse subproblem is
more ill-conditioned than the full problem.

17. Orthogonalization methods for sparse problems

The potential numerical instability of the method of normal equations is due to loss
of information in explicitly forming ATA and ATb, and to the fact that the condition
number of ATA is the square of that of A. Orthogonalization methods avoid both
these sources of trouble by working directly with A. An orthogonal matrix Q E Rm Xm
is used to reduce A R'm X of rank n and b to the form

Q A=(R) Q b=(Cl)' (17.1)

where ReR " X" is upper triangular and c". The solution is then obtained by
solving the triangular system Rx=cl and the residual norm is Ilc2 112.
As pointed out before (see Theorem 7.2) the matrix R equals the Cholesky factor of
ATA. Since this is unique apart from possible sign differences in some rows, its
nonzero structure is unique. Thus we may still use the symbolic steps in Algorithm
16.1 i.e. Steps 2 and 3, to determine a good column permutation Pc and set up
a storage structure for the Cholesky factor associated with APc. However, this
method may be too generous in allocating space for nonzeros in R. To see this
consider the matrix in Fig. 17.1. For this matrix R=A since A is already upper
triangular, but since ATA is full we will predict R to be full. Note that this can occur
because we begin not with the structure of ATA, the matrix whose Cholesky factor
we want, but with the structure of A. Hence the elements in ATA are not independent.
We call this structural cancellation in contrast to numerical cancellation, which
occurs only for certain values of the nonzero elements in A.
Another way to predict the structure of R is to perform the Givens or Householder
algorithm symbolically working from the structure of A. GEORGE and HEATH [1980]
proved the following result:

THEOREM 17.1. The structure of R as predicted by symbolic factorization of ATA


includes the structure of R as predicted by the symbolic Givens method which includes
the structure of R.

MANNEBACK [1985] has proved that also the structure predicted by a symbolic
Householder algorithm is strictly included in the structure predicted from ATA.
However, both the Givens and Householder rules can also overestimate the
structure of R. GENTLEMAN [1976] gives an example where structural cancellation
occurs for the Givens rule.
542 A. Bjfrck CHAPTER III

COLEMAN, EDENBRANDT and GILBERT [1986] exhibited a class of matrices for


which symbolic factorization of ATA correctly predicts the structure of R, excluding
numerical cancellations. From the above it follows that for this class also the Givens
and Householder rules will give the correct result.

THEOREM 17.2. Let A ER ' ", m > n. Iffor all subsets of k columns of A, k= 1, 2,..., n,
the correspondingsubmatrix has nonzeros in at least k + 1 rows, then A is said to have
the strong Hall property. If A has the strong Hallproperty, then the structure of ATA
will correctly predict that of R.

Obviously the matrix A in Fig. 17.1 does not have the strong Hall property since
the first column does only have one nonzero element. However, the matrix A'
obtained by deleting the first column has this property.

X X X X X X X X X

x x

X X

X X
X x

A A'

FIG. 17.1.

COLEMAN et al. [1986] also show that if A does not have the strong Hall property,
then by row and column permutations A can be brought into block upper triangular
form

M1 U 12 ... Ulk Ul,k+


M2 ... U2 k U2 ,k+l

PAQ = .. (17.2)
Mk Uk,k+l

Mk+ 1

where the diagonal blocks Mi, i= 1, 2 ... , k + 1, have the strong Hall property and
moreover the blocks Mi, i= 1, 2,..., k, are square. The reordering can be determined
by a simple generalization of the algorithm of GUSTAVSSON [1976]. The reordered
system now essentially reduces the original least squares problem to the solution of
min 1Mk+ k+1 -bk+l 12S (17.3)
:k+1
where = QTx and g= Pb have been partitioned conformally with PAQ in (17.2). If
rank(A) = n, then the blocks M,, i = 1, 2 ... , k, are nonsingular and xk, ... I~, can be
SECTION 17 Sparse least squaresproblems 543

determined by block backsubstitution


k+l
Mii=i- _YE.. =k, ,2,1. (17.4)
j=i+l

In the following we will assume that if necessary such a preprocessing has taken
place and that therefore the structure of the matrix R is correctly predicted by that of
ATA.
For dense problems probably the most effective method to compute the QR
factorization is by using Householder reflections, see Algorithm 8.1. In this method
at the kth step all the subdiagonal elements in the kth column are annihilated. Then
each column in the remaining unreduced part of the matrix which has a nonzero
inner product with the column being reduced will take on the sparsity pattern of
their union. In this way, even though the final R may be sparse, a lot of intermediate
fill-in will take place with consequent cost in operations and storage. However, the
Householder method can be modified to work efficiently for banded systems, see
Section 18.
As shown by GEORGE and HEATH [1980], by using a row-oriented method
employing Givens rotations it is possible to avoid the problem with intermediate
fill-in in the orthogonalization method. We now describe this algorithm.

ALGORITHM 17.1 (Sequential row orthogonalization, GEORGE and HEATH [1980]).


The rows aT of A are processed sequentially, i = 1, 2,..., n. We denote by Ri _ the
upper triangular matrix obtained after processing rows at, ... , ai _ 1. Initially we
assume that R o has all elements equal to zero. The ith row, aT=(ail, ai2,..., as,), is
processed as follows: We scan the nonzeros in aT from left to right. For each aij 0
a Givens rotation involving row j in R i _- is used to annihilate %ai.This may create
new nonzeros both in R,_ and in the row aT. We continue until the whole row
aT has been annihilated. Note that if rjj= 0, this means that this row in R__I has not
yet been touched by any rotation and the entire jth row must be zero. When this
occurs the remaining part of row i will simply be inserted as the jth row in R. We
illustrate this process in Fig. 17.2.

x 0 0 0 x 00

®0 X 0 00
x 0®o0
®® 00

x o

17.2.
Gorg-Hath
FIG. Processing
ofarow
inth algorithm, whr circled lmnts ar involved in th

Feliminati.
17.2. Processing of a row in the George-Heath algorithm,
and where circledelements are involved in the
e3.
elimination of afx. Nonzero elements created in Ri- 1 and asT during the elimination are denoted by
544 . Bjorck CHAPTER III

Note that unlike the Householder method intermediate fill-in now only takes
place in the row that is being processed. It follows from Theorem 17.1 that if the
structure of R has been predicted from that of ATA, then any intermediate matrix
R _ will fit into the predicted structure.
For simplicity we have not included the right-hand side in Fig. 17.2, but it should
be processed simultaneously with the rows of A and carried along in parallel. In the
implementation of GEORGE and HEATH [1980] the Givens rotations are not stored
but discarded after use. Hence, only enough storage to hold the final R and a few
extra vectors for the current row and right-hand side(s) is need in main memory.
Although the final R is independent of the ordering of the rows in A it is known
that the number of operations needed in the sequential orthogonalization method
depends on the row ordering. This is illustrated by the following contrived example
due to GEORGE and HEATH [1980]:

IXXX XX
YV

n x
x
A= PrA = x 0(17.5)
x
x
x
1k
J X
The cost for reducing A is O(n2 ), and that for PrA is O(kn2 ).
Assuming that the rows of A do not have widely differing norms, the row ordering
does not affect numerical stability and can be chosen based on sparsity conside-
rations only. We consider the following heuristic algorithm for determining an
a priori row ordering.

Row ORDERING ALGORITHM. Denote the column index for the last nonzero element
in row aT by 1i . Sort the rows so that the indices li, i= 1, ... , m, form a monotone
increasing sequence, i.e. Iik< if i<k.

This rule does not in general determine a unique ordering. One way to resolve ties
is to use a strategy by DUFF [1974], and consider the cost of symbolically rotating
a row aT into all other rows with a nonzero element in column Ii . Here by cost we
mean the total number of new nonzero elements created. The rows are then ordered
according to ascending cost.
Ordering the rows after increasing i has been found work well, see GEORGE and
HEATH [1980]. With this row ordering we note that when row aT is being processed
only the columnsfi to 1i of Ri _ l will be involved, since all the previous rows only have
nonzeros in columns up to at most li. Hence R_ will have zeros in column
li+ ,..., n and no fill will be generated in row a in these columns.
SECTION 17 Sparse least squares problems 545

In some contexts sorting rows after increasing values off,, the column index of the
first nonzero in row aT, may be appropriate. For this ordering it follows that the
rows 1,... ,fi - 1 in R i -, will not be affected when the remaining rows are processed.
These rows therefore are the final firstfl-1 rows in R and may e.g. be transferred to
auxiliary storage.
We summarize the main steps of the sequential row orthogonalization algorithm
below.

ALGORITHM 17.2 (General sparse orthogonalization method).


Steps 1-3. Same as in Algorithm 16.1.
Step 4. Find a row permutation Pr such that the rows in PrAPC are ordered after
increasing li.
Step 5. Compute R and c numerically using Givens rotations processing the rows
of (PrAP, Prb) sequentially as described in Algorithm 17.1.
Step 6. Solve Ry=c and put x=P¢y.

There is a great deal of freedom in the way Givens rotations can be used to
transform A to upper triangular form. A more general way of merging the rows in
A has been suggested by Liu [1986]. This scheme can give a significant reduction in
factorization time at a modest increase in working storage. The idea is best described
by a small 12 x 9 example

I v v v
I - , I
x X X X
x X X X
X X X X

X X X X
A= (17.6)
X X X X

X X X X

X X X X

I' XX xx XX
We note that rows 1-3 can be merged into an upper triangular matrix without any
fill-in. The same holds true for rows 4-6, 7-9 and 10-12, and we can reduce these
four groups of rows independently into 4 (partially filled) upper triangular matrices,
R 1 ,..., R 4. In the second stage we could merge the rows of R1 and R 2 together and
simultaneously merge R 3 and R4 . In the final stage we would merge the two upper
triangular matrices to produce the final Cholesky factor.
This way of transforming A to upper triangular form may be described by a
strictly binary tree, which is called a row merge tree. The tree for the example above
is given in Fig. 17.3.
546 A. Bj6rck CHAPTER III

2
i

a10 I

FIG. 17.3. Row merge tree for the matrix (17.6).

The row merge tree should be interpreted as follows. It has m leaves correspond-
ing to the m rows aT, i= 1, ... , m. Any node x in the tree defines a subtree rooted at x,
which in turn defines a set rows. We associate with x an upper triangular matrix
obtained by the reduction of this set of rows. Thus, the root of the tree corresponds to
the Cholesky factor R to be determined. We now see that each node corresponds to
the merging of the two triangular matrices associated with its left and right son. To
find R we have to traverse the tree, i.e. visit all nodes. We cannot visit a node before
we have visited its two sons but otherwise the sequential order in which we visit the
nodes is not specified. In fact, this approach is well suited for doing several merges in
parallel: we can view it as performing the computation on multiple fronts, cf. DUFF
and REID [1983].
A way to generate a row merge tree from the structure of the matrix A is given by
LIu [1986]. Liu also suggests that for sequential computation the nodes in the merge
tree are visited in depth-first order.
The general row merge scheme described here can be viewed as a special type of
variable row pivot method as studied by GENTLEMAN [1973], DUFF [1974] and
ZLATEV [1982]. However, as observed by Ltu [1986] variable row pivoting has not
been so popular because of the difficulty of generating good orderings for the
rotations and because of the complication of implementation. Also, the dynamic
storage structure needed tends to reduce the efficiency of these schemes.
Many large sparse linear least squares problems are so large that it is impossible
to store even the Cholesky factor R. We now briefly describe an automatic
partitioning scheme by GEORGE, HEATH and PLEMMONS [1981] for solving such
problems.
Assume that an appropriate ordering and partitioning of the columns of A has
been found, e.g. by the method of nested dissection in Section 19. Denote by Yi the set
of column indices in the ith partition, i = 1,.. ., p. We now order first the rows having
nonzero elements with column indices in Y1 giving us a set of row indices Z . Among
the unordered rows we now order the rows having nonzero elements with column
indices in Y2, and so on. This induces a partitioning of the row indices {Z,
Z 2 . ..., ZP} of A which can be defined as follows. Let Zo =0 and
i-1
Zi= {k: 3jE Y, with akj O}- U Zj, i = 1, 2,..., p.
1=1
SECTION 17 Sparse least squares problems 547

All A12 A13

A22 A2 3

A33

FIG. 17.4.

If the rows of A are permuted to appear in the order Z 1, Z 2 ... , Zp, then A will have
a block upper trapezoidal form, as depicted in Fig. 17.4 for p = 3.
For a matrix of block upper triangular form the sequential orthogonalization
method can be applied to a block row at a time. In the first step only the blocks
Al , ... , A pare processed, transforming A 1 to upper triangular form. The first IY I
rows of the resulting matrix are the first IYI rows of the final R and can be stored
away. The rest of the rows are adjoined to the next block row A 2 2 ,,...,A2p to give
A 2 2 , ... , A2p and now A22 is transformed into upper triangular form, etc. We
assume that the rows of A are stored on auxiliary storage at all times. Then the only
main storage required is that for holding the IYil rows of R generated at step i,
i= 1, . . . , p. A slightly more efficient way to carry out this process is described in
GEORGE, HEATH and PLEMMONS [1981] where also details in the data management
are outlined.
So far we have assumed that A is not rank-deficient. In-the dense case rank-
deficient problems were handled by introducing column permutations in the QR
decomposition of A, see Section 11. In Algorithm 17.2 the column ordering is fixed in
advance of any numerical computation and chosen to produce a sparse R factor.
Therefore column permutations are not allowed in Step 5 of Algorithm 17.2 since
then the computed R will in general not fit into the previously generated fixed
storage structure.
If Algorithm 17.1 is applied to a matrix A of rank r < n, then using exact arithmetic
the resulting R factor must have n - r zero diagonal elements. In the algorithm a row
is only inserted into the data structure so that the diagonal entry is nonzero. Further
processing of this row can only increase the diagonal element, and it follows that if

FIG. 17.5.
548 A. Bjorck CHAPTER III

a row has a zero diagonal element then all its elements are zero. Hence R will have
the form depicted in Fig. 17.5.
By permuting the zero rows of R to the bottom and the columns of R corres-
ponding to the zero diagonal elements to the right we get a matrix of the desired form
(7.10).
In finite precision we will usually end up with an R factor with no zero diagonal
elements. Although this is not always the case the rank is often revealed by the
presence of small diagonal elements. However, a small diagonal element does not
imply that the rest of the row is negligible. HEATH [1982] recommends that, starting
from the top, the diagonal of R is examined for small elements. In a row whose
diagonal element falls below a certain tolerance the diagonal element is put equal to
zero. The rest of the row is then reprocessed zeroing out all its other nonzero
elements. Note that this might increase some previously small diagonal elements in
rows below, which is why we have to start from the top. After this we end up with
a matrix of the form shown in Fig. 17.5.
In the test for small diagonal elements a relative tolerance can be used based on
the largest diagonal element in R. This way of determining rank is not as satisfactory
as algorithms using column pivoting (see Example 17.1). It might happen, although
it is perhaps unlikely to occur in practice, that R is almost rank-deficient and yet has
no small diagonal element. However, HEATH [1982] reports that on a typical test
batch the rank determined by Algorithm 17.1 agreed with the rank determined by
QR decomposition with column pivoting.

EXAMPLE 17.1. Consider the following matrix

r 1
. ... )eRnXn.
r

From the form of RR,, n - 1 of the singular values are close to unity and since their
product equals det 2(RTR,)=-r" the remaining singular value is approximately
equal to r. For r=0.1 and n =20 we thus have ami., 10-20 and yet no diagonal
element is small. This ill-conditioning is more severe than that exhibited by the
matrix in Example 11.1.

We point out that several of the techniques given in Section 11 for determining the
rank and ill-conditioning in triangular matrices can be used in the sparse case, see
FOSTER [1986].
In implementations of Algorithm 17.1 the Givens rotations are also applied to any
right-hand side(s), but normally then discarded to save storage. This creates a
problem if we later wish to solve additional problems having the same matrix A but
SECTION 18 Sparse least squares problems 549

different right-hand sides b. In this case the rotations could be saved on an external
file for later use.
An alternative possibility for handling multiple right-hand sides is to solve the
seminormal equations (SNE)
RTRx = ATb, (17.7)
using the computed R factor. This only requires that the original matrix A is saved to
transform subsequent right-hand sides. Unfortunately the numerical stability of
(17.7) is not better than that of the normal equations. This is somewhat surprising
since we are using an R factor computed by Givens' method and thus of better
"quality" than that obtained from a Cholesky factorization of ATA. It has been
shown by BJORCK [1986] that by adding a correction step to (17.7) we can obtain
a solution of much better accuracy.
In the method of corrected seminormal equations (CSNE) we correct the
computed solution x by
RTRw=b-Ax, xc=x+w. (17.8)
The correction step is similar to doing one step of iterative refinement on the
solution from (17.7). However, here the residual b-Ax may be computed in single
precision. A detailed error analysis of the method CSNE is given by BJORCK [1986].
This error analysis leads us to expect that CSNE is at least as accurate as the
orthogonalization method provided that the solution from the seminormal
equation has at least one correct digit. For problems with widely differing row
scalings (stiff problems) the method CSNE is however less satisfactory.

18. Sequential orthogonalization methods for banded problems

In this section we consider orthogonalization methods for the special case when
AEIRmXn is a band matrix of constant row bandwidth w (cf. Definition 15.1). Such
matrices could be treated using the algorithm given in Section 17 for general sparse
matrices. However, we can take advantage of the very simple structure of a band
matrix to save on space and time overheads.
We will assume that the rows of A have been sorted so that the column indicesf 1 ,
i= 1, 2,.. ..,m, of the first nonzero element in each row form a nondecreasing
sequence, i.e. fi <fk if i< k. Since we assume a constant row bandwidth we have

li =fl+ w-1,
so we could equivalently sort after nondecreasing values of Ii. From Theorem 16.2
we know that the matrix ATA will also be a banded matrix with only the first
= w- 1 superdiagonals nonzero. The upper triangular Cholesky factor R has the
same structure as the upper triangle of ATA, see Theorem 16.3. Thus R has again
w nonzeros in each row.
We now specialize the sequential row orthogonalization scheme of Section 17.
550 A. Bjbrck CHAPTER III

ALGORITHM 18.1 (Sequential row orthogonalizationforbanded matrices). Let AE R X"


be a rectangular matrix of constant row bandwidth w and initialize R = Ro to be an
upper triangular band matrix of bandwidth w with all elements zero. The
orthogonalization takes place in m steps, i= 1, 2,...., m. In the ith step, we read in
row aT, and merge it with R._ to get the triangle Ri according to the following
program:

for j=fi, ... , li do


if a O0 then
{construct Givens rotation from (rjj, aoi);
apply Givens rotation to annihilate aij}.

In Fig. 18.1 we show the situation before the elimination of this row.

finished rows in R

\\

fi li

FiG. 18.1. The ith step of reduction of a banded matrix.

From Fig. 18.1 several things are apparent. First we note that only the shaded part
of Ri_1 is involved in this step. Therefore this step can be thought of as updating
a full upper triangular matrix of order w when a row of length w is added. The last
(n - li) columns of R have not been touched and are still zero as initialized. Further,
the first (fi- 1) rows of R are already finished at this stage and can be read out to
secondary storage. Thus, very large problems can be handled since the only primary
storage needed is for the shaded part in Fig. 18.1.
We remark that by initializing R to zero the description above is also valid for the
processing of the first rows of A. The first row aT can just be inserted into R. This
permutation is just a special case of a rotation. After insertion, only zero elements in
aT are left and no more rotation needed. Similarly, the number of rotations needed to
process row i is at most equal to min(i-1, w).
It is clear from the above that the processing of row aT requires at most 2w 2 flops if
SECTION 18 Sparse least squares problems 551

4-multiply Givens rotations are used. Thus the complete orthogonalization requires
about 2mw 2 flops and can be performed in 2w(w + 3) locations of primary storage.
We remark that if the rows of A are processed in random order, then we can only
bound the operation count by 2mnw flops, which is a factor of n/w worse (see Cox
[1981]). Thus, it almost invariably pays to sort the rows as was assumed to be done
above.
In Algorithm 18.1 the Givens rotations could also be applied to one or several
right-hand sides b to produce

C=QTb =(C, ceRn.

The least squares solution is then obtained from Rx = c, by backsubstitution. The


vector c2 is not stored but used to accumulate the residual sum of squares
Irll 2= Ilc2 112, If we have to treat right-hand sides, which are not available in
advance, then the Givens rotations should be saved. As described in Section 8 each
Givens rotation can be represented by a single floating-point number, see (8.12).
Since at most w rotations are needed to process each row it follows that Q can be
stored in no more space than that allocated for A.
In the example below we give an application illustrating one source of banded
least squares problems.

EXAMPLE 18.1 (Cox [1981]). Consider the least squares approximation of a discrete
set of data by a linear combination of cubic B-splines. The spline is represented over
[a, b] by

s(t)= E xjBj(t) ,
ji=

where Bj(t), j = 1, 2, ... , p = N + 4, are the normalized cubic B-splines (see DE BOOR
[1978], Chapter 9) for the knots

Ai=a, jO0, lN, Aj=b, j>N.

As an example we take data (yi, ti), i=1,2,..., m=16, N=6 and determine x to
minimize

Z
i=2
(s(t)-_ yi)2 = IISX -y l 2.

Since the only B-splines with nonzero values for t e [ 2 k - , ,k] are Bj, j = k, k + 1,
k + 2, k + 3 the matrix S will be a band matrix with w = 4. In particular if the number
of data points in the interval [2 i_ , i], i = 1, . .., 7, is 3, 2, 3, 2, 3, 2 and 2 respectively
552 A. IBjrck CHAPTER III

the matrix will take the following form

x x xx
Ix xxx
Ix X X
I X X X X
X X X X

X X X X

X X X X
S= (18.1)
X X X X
X X X X

iXX X X

x x xX

X X X

Cox [1981] shows that periodic spline approximation leads to matrices of an


augmented band structure,
S = (S.,S2),
where S1 is a band matrix and S2 a generally full matrix with a small number of
columns. For example, in the fitting of cubic splines S2 will consist of three nonzero
columns. Cox shows how to extend the band matrix algorithm to matrices of this
structure.
There is often a natural row block structure in banded matrices, where the jth
block of rows consists of the set of rows {aT: f,=j}, cf. matrix (18.1). It is possible to
give an efficient scheme for such banded systems, which is based on Householder
transformations, see REID [1967]. Here each block of rows is sequentially merged
into an upper triangle. The operation count for this method is approximately half as
much as for the Givens method, i.e. a similar gain as for full matrices. However, the
method may require a column-oriented storage mode to perform efficiently. LAWSON
and HABSON ([1974], Chapter 11) describe a similar method and also provide
FORTRAN subroutines implementing their algorithm.
Another case where a block band structure occurs is in the least squares
estimation of discrete linear systems, see PAIGE and SAUNDERS [1977]. This problem
SECTION 19 Sparse least squares problems 553

can be formulated as a least squares problem with matrix


RI
C1
F1 R 2

A C2
F2

Rk

Ck

Hence, A has a block band structure with block bandwidth w=2. Using the
sequential orthogonalization method, no fill-in will occur outside the nonzero
blocks.

19. Block structured sparse least squares problems

As noted in RICE [1983] there is a substantial similarity in the structure of several


large sparse least squares problems. The matrices possess a block structure, perhaps
at several levels, which reflects a "local connection" structure in the underlying
physical problem. In particular the matrices can often be put in the following dual
block angularform:
A1 B1
A2 B2
A= A3 B3 , (19.1)

AM BM

where A Rm xn and
Ai ERmi.xni, Bie RmxnM i=1,2 ..M.
A number of problems have two levels of sparsity structure. That is the blocks Ai
and/or B i are themselves large and sparse matrices often of the same general sparsity
pattern as A. There may also be more than two levels of structure. There is a wide
variation in the number and sizes of blocks. Some problems have large blocks with
M of moderate size (10-100) while others have many more but smaller blocks.
We partition the solution x and right-hand side b conformally with (19.1)
X r=(X.w, X tm, + ), bt = (be b2
, ,..., ba).
From (19.1) we see that the set of variables xi, i = 1 .. M,
(., are coupled only to the set
of variables xm+,. Some examples where the form (19.1) arises naturally is in
554 ,. Bjiirck CHAPTER II

photogrammetry, see GOLUB, LUK and PAGANO [1979], Doppler radar positioning,
see MANNEBACK, MURIGAND and TOINT [1985] and geodetic survey problems, see
GOLUB and PLEMMONS [1980]. WEIL and KETrLER [1971] have given a heuristic
algorithm for permuting a general sparse matrix into this form.
Problems of dual block angular form can be solved efficiently either by the
method of normal equations or by an orthogonalization method. The reason for this
is that no fill-in outside the nonzero blocks in A will occur in the R factor. We
describe below a method based on orthogonalization.
ALGORITHM 19.1 (Dual block angular orthogonalization,GOLUB, LUK and PAGANO
[1979]). The algorithm proceeds in four steps:
Step 1. Reduce the diagonal block A i to upper triangular form by a sequence of
orthogonal transformations, and apply these also to the blocks B i and the
right-hand side b, i= 1,2,..., M, yielding

QT(Ai,Bi)= (Ri ST QT b= (,C) (19.2)


\0 T'I (di
We assume here that rank(A) = n, which implies that the upper triangular matrices
Ri, i = 1, 2 ... , M, are nonsingular. Any sparse structure in the blocks Ai should be
exploited.
Step 2. Reorder the equations in the reduced system by ordering first the rows
corresponding to Ri, i = 1, ..., M, and last the rows corresponding to Ti, i = 1, . .., M.
Form the reduced matrix T and the right-hand side d

T2
T= d=d2

.TM/ dei
Step 3. Compute the QR decomposition of T and transform the vector d

Qm+
0 ),I T= (QMld= (dm+ 1) (19.3)

Then the residual norm of the system is given by d+I 112 Compute XM+ , the
solution of the least squares problem
min 11TxM+ 1-d 11
2,
XM+I

by solving the triangular system


RM+ 1 XM+ I =CM+1.

Step 4. Compute XM,..., x, by backsubstitution in the sequence of triangular


system, i=M,..., 1
Rixi =ci- Si x,., (19.4)
SECTION 19 Sparse least squares problems 555

REMARK 19.1. In Steps 1 and 4 the computations can be performed in parallel on the
M subsystems. It is often advantageous to continue the reduction in Step 1 so that
also the matrices T, i = 1,..., M, are upper triangular. Then Step 2 can be performed
as a merging of the M triangular matrices T,,..., TM, cf. the general row merging
scheme described in Section 17.

REMARK 19.2. Sometimes in Step 1 the matrices Ri will be sparse but a lot of fill-in
occurs in the blocks Bi . Then a block preconditioned iterative method may be used,
where one solves the problem
min IlEx-b 112, z=Mx, E=AM- 1
x

with an iterative method, where


M = diag(R A
, I), R A =diag(RI, ... , R).
It may be possible to compute a sparse QR decomposition also of B, the last block
column in (19.1), BT =(BT,..., BM),

Then it is advantageous to use the preconditioner M=diag(RA, RB), see GOLUB,


MANNEBACK and TOINT [1986] and Section 20.
From Algorithm 19.1 we obtain the QR decomposition of A,

R SI

R= RM SM
RM+ 1

The inverse of R is readily seen to be

R-'= . R~M :-V

R.M+
where
Vi=Ri-SiRM+, i=l,..., M.
Hence, the diagonal blocks of the matrix C = (RT R)- = R- R -T of the variance-
covariance matrix are

M+,M+1 =RM+1 RM, Cjji=R'R;T ii i=l'...,M. (19.5)


556 A. Bjorck CHAPTER II

Note that we can write, see GOLUB, PLEMMONS and SAMEH [1986],
C,. =R '(I+ WT Wi)Ri , Wi=R+, ST.
Hence, if we compute the QR decompositions

Q(W U) i= 1, ... M,

then since I + WT W i = UT Ui we have


Ci=(U,- T) (Ui Ri-T) , i= 1 ... M.

We now discuss a general procedure, called substructuring or dissection, to obtain


a dual block angular form. As an example, consider a geodetic position network
consisting of geodetic stations connected through observations. To each station
corresponds a set of unknown coordinates. In geodetic problems the idea of
breaking down a problem into geographically defined subproblems connected in
a well-defined way has been applied for more than a century and dates back to
HELMERT [1880]. The idea is to choose a set of stations X, which separates the other
stations into two regional blocks s14 and sd 2 so that stations in sd1 are not
connected by observations to stations in d 2 . We then order the station variables so
that those in sda appear first, those in d2 second, and those in X last. Finally, we
order the equations so that those including s 1 come first and those including dS/2
come last. The blocking of the region and the corresponding structure of the
observation matrix is depicted in Fig. 19.1

(A A2 B
2 )

FIG. 19.1. One level of dissection and the corresponding block structure of the matrix with variables in
- ordered last.

Usually there will be a finer structure in A because there are equations only
involving variables in d or d2 but not in X. In this discussion for simplicity we
ignore this finer partitioning. We note that the matrix A in Fig. 19.1 has dual block
angular structure with M=2. The blocks corresponding to dt and d 2 can thus be
processed independently (possibly in parallel).
The dissection can be continued by dissecting the regions ol and ~S2 each into
two subregions and so on in a recursive fashion. In Fig. 19.2 we show the block
structure induced by two levels of dissection. Again the matrix A is of dual block
angular form, but now with a lot of structure in the nondiagonal blocks.
For a detailed discussion of dissection and orthogonal decompositions in
geodetic survey problems see GOLUB and PLEMMONS [1980].
AVILA and TOMLIN [1979] discuss the solution of very large least squares problems
SECTION 20 Sparse least squares problems 557

A A B D
A2 B2 D2
A3 C3 D3
A4 C4 D4

FIG. 19.2. Two levels of dissection and block structure induced in A when ordering the variables of
separators last.

by nested dissection on a parallel processor. They use the method of normal


equations based on indications that in surveying applications orthogonalization
methods require more than twice the number of operations and storage. As we have
remarked earlier, with an adequate word length on the computer the method of
normal equations gives sufficient accuracy for a large range of problems.
Nested dissection orderings are useful also for general sparse least squares
problems. The column ordering and partitioning of A is then obtained by a nested
dissection ordering on the graph of AT A, see GEORGE, POOLE and VOIGT [1978]. The
use of such orderings for sparse least squares problems is described by GEORGE,
HEATH and PLEMMONS [1981] and GEORGE and NG [1983].
The dissection procedure described above is a variation of nested dissection
orderings developed for general sparse positive-definite systems, see GEORGE and
LIu ([1981], Chapters 7-8). These have been developed for solving sparse systems
associated with finite element schemes for partial differential equations and for the
finite element analysis of structures.

20. Iterative methods

For some large sparse least squares problems iterative methods are a useful
alternative to direct methods. In iterative methods an initial approximate solution is
successively improved until an acceptable solution is obtained. Iterative methods
are especially attractive for problems in which the elements of the matrix A can be
easily generated on demand. In such cases the matrix A need not be stored at all, but
instead can be defined by its action on vectors.
In principle any iterative method for symmetric positive-definite (or if rank(A) < n,
semidefinite) linear systems can be applied to the system of normal equations
ATAx = AT b. As we will see, explicit formation of the matrix ATA can be avoided by
using the factored form of the normal equations
AT(b - Ax) = 0. (20.1)
For a treatment of iterative methods for symmetric positive-definite linear systems
see VARGA [1962] and YOUNG [1971]. Surveys of iterative methods for least squares
problems are given by BJORCK [1976] and HEATH [1984].
We now consider some basic iterative methods for solving (20.1). In Jacobi's
558 A. Bjbrck CHAPTER III

method a sequence of approximations x(k), k = 1, 2 .. , is computed from


x`k + )= xjk + aT(b-Ax(k))/dj, j = , 2,..., n, (20.2)
where
m
A =(a, ... , an)ER xn, dj= 11
aj 2. (20.3)
Introducing the diagonal matrix DA = diag(d1 ... , d,) we can write (20.2) in matrix
form
x(k+ 1) = X(k) + D 1 AT(b - AX(k)). (20.4)

We note that there is no need to form ATA in the iteration (20.4). This has two
advantages.
(i) A small perturbation in ATA, e.g. by roundoff, may change the solution much
more than perturbations of similar size in A itself.
(ii) We avoid the fill-in which can result in the formation of ATA. For some
problems ATA may be much less sparse than A. For example if A has approximately
n 1/ 2 nonzeros randomly distributed in each row, then ATA will be almost full,
cf. Section 16.
Often the Gauss-Seidel method has a better rate of convergence than Jacobi's
method. It has been shown by BJORCK and ELFVING [1979] how to implement the
Gauss-Seidel method working only with the matrix A. A major iteration step is
divided into n minor steps. We put z= x(k) and compute
z(j+ 1)=z) + e(baj -Az)/d j, j= 1,2, .. , n, (20.5)
where e is the jth unit vector and dj defined by (20.3). We then have that
x(k+ 1)=Z("+l). Note that in the jth minor step only the jth component of z° is
changed. Therefore the residual
r(j ) = b - AZ ) ,

can be cheaply updated. With r) = b- Ax(k) we can write (20.5)


z + ) = z(j) + Sj ej, r+)=r(j)ba (20.6)
where
aj = aT r(i)/dj.
Hence, in the jth minor step only the jth column of A is accessed. In contrast to the
Jacobi method the ordering of the columns of A will influence the convergence.
We obtain the successive overrelaxation (SOR) method for the normal equations
from (20.6) simply by taking

sj = coaT r)/dj, 0 < o< 2. (20.7)


where co is an acceleration parameter.
The Jacobi method has the advantage over Gauss-Seidel and SOR that it is better
adopted to parallel computation. Further it does not require A to be stored (or
generated) column-wise, since products of the form Ax and ATr can conveniently be
SECTION 20 Sparse least squares problems 559

computed also if A is accessed by rows. If A =(a1 ,..., am)T, then we have

(Ax)i = aiT x, i= ... ,n, A T r= a i ri,


i=l

i.e. for Ax we use an inner product and for ATr an outer product formulation.
The methods discussed above are special cases of the general stationary iterative
method
x(k + 1) = GX(k) + c, (20.8)
which can be obtained by the splitting ATA = M - N, where M is nonsingular, and
taking G = M - N. For the case when rank(A) < n, convergence of the iteration (20.8)
has been investigated by KELLER [1965] and YOUNG [1972].
The concept of splitting has been extended to rectangular matrices by PLEMMONS
[1972]. BERMAN and PLEMMONS [1974] define A = M-N to be a proper splitting if
the ranges and nullspaces of A and M are equal. They show that for a proper
splitting the iteration
x(k +1)= MI(Nx(k) + b) (20.9)
}
converges to the pseudoinverse solution x=A'b for every x if and only if the
spectral radius p(M' N) < 1. The iterative method (20.9) avoids the explicit recourse
to the normal system. Splittings of rectangular matrices has also been investigated
by CHEN [1975].
A more powerful class of methods can be described by the recursion
+1[A T r(k) + X(k _ X(k- 1],
X(k + 1)= X(k - ) + WOk (20.10)

where ok+ and a are parameters, see GOLUB and VARGA [1961]. Assume that the
singular values of A satisfy

aa?2<b.

Then we get the Chebyshev semi-iterative method by taking co1 = 1,

= 2/(a + b), Ck+ 1 =(1-- Ok)

where p = (b - a)/(b + a). The second-order Richardson method is obtained by taking


a as above and

k = 2/(1 +(1 -P 2 )1/2 ).

For these methods it is necessary to know bounds on the singular values of A. The
rate of convergence is sensitive to the quality of these bounds.
We now describe the conjugate gradient method of HESTENES and STIEFEL [1952],
which does not have this drawback. For a general discussion of this method see
GOLUB and VAN LOAN ([1983], Sections 10.2-10.3). A slight algebraic rearrangement
given below is needed to make the method perform well for the least squares
problem.
560 A. Bj6rck CHAPTrER III

Let x"° be an initial approximation. Take

r( °) = b - Ax ° ), p(O) = s() = ATr Y = I S) II2 (20.11)

and for k=0,1, 2,... compute

q(k)= Apk), (k = Yk/ q k) 2

S(k +)= AT r(k +1), Y+ = S( 11(20.12)

fk = Yk +I/Yk, p(k +I )= S(k +1)+ kp(k)P

We call this algorithm CGLS. It can also be used in the rank-deficient case. Provided
that x'° ) E (AT), which holds if e.g. x(° ) = 0, x(k) will converge to the pseudoinverse
solution A' b. In absence of rounding errors it will compute the exact solution in at
most t iterations, where t n equals the number of distinct nonzero singular values
of A. However, with roundoff many more than n iterations may be needed if A is
ill-conditioned and the method is best regarded as an iterative method. The error
measures 11 xk)- A'bll 2 and 11r(k) 12 will both decrease monotonically also under
roundoff. When A is well-conditioned and/or the singular values of A are clustered
the method may converge to a sufficiently accurate solution in far less than
n iterations.
The algorithm (20.12) requires the storage of two n-vectors x and p and two
m-vectors r and q. (Note that s can share storage with q.) Each iteration requires
about 2nz(A) + 3n + 2m flops, where nz(A) are the number of nonzero elements in A.
ELFVING [1978] has compared the implementation (20.12) with several other
variants of the conjugate gradient algorithm, and found this to be the most accurate.
PAIGE and SAUNDERS [1982] have developed an algorithm LSQR based on the
Lanczos bidiagonalization algorithm of GOLUB and KAHAN [1965]. LSQR is
mathematically equivalent to CGLS but is more accurate when A is ill-conditioned.
However, in this case both LSQR and CGLS will converge very slowly.
The basic drawback with iterative methods is that the rate of convergence will
depend on the spectrum of A and can therefore be very slow. For the methods of
Jacobi and Gauss-Seidel the number of iterations needed for a fixed reduction in the
2
accuracy is proportional to Kc (A), where K(A) is the condition number of A. For the
Chebyshev semi-iteration and second-order Richardson method this is reduced to
order K(A) iterations, provided that accurate bounds for the singular values of A are
known. For the conjugate gradient method no such bounds are needed and for this
method an upper bound for the number of iterations k to reduce the relative error by
a factor of e is given by
k < 2K(A) log(2/).
Because of generally slow convergence the main emphasis in the development of
iterative methods is on convergence acceleration. A general technique to improve
convergence is by preconditioning, which for the linear least squares problem is
equivalent to a transformation of variables. Let M R n xn be a nonsingular matrix
SECTION 20 Sparse east squares problems 561

and consider the problem


min 1(AM- )y-b 11
2, Mx=y. (20.13)
y

If M is chosen so that AM- has a more favourable spectrum than A this will
improve convergence of an iterative method applied to (20.13).
It is important to note that the product AM- ' should not be explicitly formed but
treated as a product of two operators. In an iterative method preconditioned by the
matrix M matrix times vector products of the form AM- 1y and M -T ATr will occur.
Thus the extra cost of preconditioning will be in solving linear systems of the form
Mx = y and MT q = s. Hence M has to be chosen so that such systems can be easily
solved.
Several different preconditioners have been suggested for the least squares
problem. We first note that if we take M = R, where R is the Cholesky factor of AT A,
then K(AM-')= K(Q1)= 1. This is the ideal choice for M and all the above iterative
methods will converge in only one step. Thus, a preconditioner should in some sense
approximate R.
The simplest preconditioner corresponds to a diagonal scaling of the columns of
A,
M =D~/2 =diag(d/2 ... , dl), d= 11
a II2 (20.14)
Since AM-1 has columns of unit length, this approximately minimizes K(AM-')
over all diagonal scalings, see Remark 6.6.
A more general preconditioner is the SSOR preconditioner. Here we take
M=D-1 /2(D+WoLT ), 0s<o<2. (20.15)
where AT A has been split so that
ATA=L+D+LT

with L strictly lower triangular. BJORCK and ELFVING [1979] showed how to
implement this preconditioner without actually forming ATA or L. Note that taking
o=O0 in (20.15) gives (20.14).
Another approach is to take M = R, where R is an incomplete Cholesky factor of
ATA, i.e.

ATA =RTR+E,

with E small and R sparse. One way to compute such a matrix R is to use a direct
method for sparse Cholesky factorization, but only keep those elements in R which
lie within a predetermined sparsity structure, cf. MANTEUFFEL [1980].
SAUNDERS [1979] suggests taking M = UP, where U is the upper triangular factor
in a factorization of the form

PrAPc = LU,

where Pr and Pc are permutation matrices and L Rm X" unit lower trapezoidal. The
562 A. Bjbrck CHAPTER III

rationale for this choice is that any ill-conditioning in A is usually reflected in U, and
L tends to be well-conditioned.
In many large sparse least squares problems arising from multidimensional
models the matrix A has column block structure
A=(A 1, A2 ,. .. , AN), (20.16)
where
Aje RmXnj , n + '+nN=n.
One example of such a block structure is the dual block angular form (19.1).
For problems of the structure (20.16) block versions of the preconditioners (20.14)
and (20.15) are particularly suitable. Let the QR decompositions of the blocks be
A =Qj Rj, QjER m xJ, j=1,...,N. (20.17)
Then to (20.14) corresponds the block diagonal preconditioner
M = R = diag(R , R 2, ... , RN). (20.18)
For this choice we have AM -' = (Q, Q 2,.. ., QN), i.e. the columns of each block are
mutually orthogonal.
If we split AT A according to
AT A=LB+D+Ll,
where LB is strictly lower block triangular, then the block SSOR preconditioner
becomes
M = RB; (DB + wL). (20. 19)
This preconditioner was introduced for least squares problems by BJORCK [1979].
As for the corresponding point preconditioner it can be implemented without
actually forming AT A.
We now consider the use of the block diagonal preconditioner (20.18). We will
partition x and y=Mx conformingly wtih (20.16)
T
x=(xl,X 2 ,..., XN) , Y=(Y, Y2 ,..., YN)T.

Jacobi's method (20.2) applied to the preconditioned problem (20.13) can then be
written
y(k+ 1) = y(k)+ QT(b- AM - y(k)), j= 1, .... N,

or in terms of the original variables


Xkk+ )= Xk) + R QT (b - Ax(k)), j= 1, . ., N. (20.20)
This is the block Jacobi method for the normal equations. Note that the correction
zj=xk+1)-x(k) equals the solution to the problem
min Ai z - r(k) 112, r(k)=b-Ax(k). (20.21)
zj

Often Qj is not available and we have to use Qj = Aj R i. This is equivalent to using


SECTION 20 Sparse least squares problems 563

the method of seminormal equations (17.7) for solving (20.21). As discussed in


Section 17 this can lead to some loss of accuracy and a correction step is
recommended nearby the solution.
Similarly we can derive a block SOR method for the normal equations. Let
r(?) = b-Ax(k) and for j= 1,..., N compute

x(k + ) )= )+Wz rj A jZ, (20.22)

where zj is the solution to

min IIAjZ-rk II2. (20.23)


zj

Taking o= 1 in (20.22) gives the block Gauss-Seidel method.


The case N=2, A=(A1 ,A 2), is of special interest. For the block diagonal
preconditioner (20.18) we have AM- ' =(Q, Q 2) and the matrix of normal equations
for the preconditioned system becomes

(AM- )TAM - 1 = KT I , K=QT Q2 (20.24)

This matrix has "Property A", YOUNG [1971]. This means that it is possible to
reduce the work required per iteration to approximately half for many iterative
methods. This preconditioner is also called the cyclic Jacobi preconditioner.
For matrices with "Property A" the SOR theory holds. As shown by ELFVING
[1980] it follows that the optimal co in the block SOR method (20.22) is given by (for
N=2)

0 t = 2/(1 + sin
W
op min)

where Omi, is the smallest principal angle between the subspaces 3R(A 1 ) and R(A2).
For a definition of principal angles between subspaces see e.g. GOLUB and VAN LOAN
([1983], p. 428). We have

COS Omin = amax(QI Q2).


Using (opt in the block SOR method reduces the number of iterations by a factor of
0
2/sin min compared to using co = 1.

For N = 2, the preconditioner (20.19) with o = 1 also has special properties, see
GOLUB, MANNEBACK and TOINT [1986]. We have

M=(R a)Q
w, A 2 )

and it follows that for o = 1

(A 1, A 2)M- =(Q,(I-QQT)Q2). (20.25)


Here P1 =(I-Q 1 Q) is the orthogonal projector onto the orthogonal complement
564 A. Bjorck CHAPTER III

of (A ). It follows that the two blocks in (20.25) are mutually orthogonal and thus
the preconditioned problem (20.13) can be split into two problems which are

min 11
P Q2 y 2 -b 12, = QT b. (20.26)
Y2

This effectively reduces the original system to a system of size n2. Hence this
preconditioner is also called the reduced system preconditioner. The matrix of
normal equations becomes

(AM-')AM- = O QPQ 2 )=I- KTK)

where K = QTQ 2
We now consider preconditioning the conjugate gradient method (20.12). It is
convenient to formulate the preconditioned method in terms of the original
variables x. Let x(°) be an initial approximation and r}= b-Ax(°. Take

P(0)=(O)= M-T AT r(O), 0 s( ) 112,


= 11 (20.27)

and for k=0,1,2,... compute

q(k = AM-lpk k = k/ qk 2,

X(k +l)= X(k)+ OkM - p(k), r(k + 1) = rW)- akq 0'2


(k +l)=M-TATr(k+1), Yk+ = [S(k+l) ll 2 (,
flk=Yk+1/k, p(k+ l)= S(k+l) + kp(k).

In order to use the block SSOR preconditioner (20.19) for the conjugate gradient
method we have to be able to compute vectors q=AM-lp and s=M-TATr
efficiently, given p = (p, . .. , pN)T and r. The following algorithms for this are derived
in BJORCK [1979].
Put q(N) = 0 and for j= N, N- 1,..., 1 compute
zj=R -l(pji-RiT AT q)), q(J- )=qi)- A zj.

Then we have
T
q = q(O), M- 1 P =(Z1, , ZN) .

Further put r)= r, and for j= 1, 2,..., N compute


sj =R T AjT ri +
) = r) - Ai Rf s i

Then s=(s1 ... , SN).


For the case N = 2 the SSOR preconditioned conjugate gradient method has been
carefully studied by MANNEBACK [1985]. It is shown that the choice co= 1, i.e. using
the reduced system preconditioning, is optimal with respect to the number of
iterations. Further, HAGEMAN, LUK and YOUNG [1980] have shown the equivalence
of reduced system preconditioning and cyclic Jacobi preconditioning (co= 0) for
SECTION 20 Sparse least squares problems 565

Chebyshev semi-iteration and the conjugate gradient method. The reduced system
preconditioning will essentially generate the same approximations in half the
number of iterations. Since the work per iteration is about doubled for co 0, this
means that for N=2 cyclic Jacobi preconditioning is optimal for the conjugate
gradient method in the class of SSOR preconditioners.
GOLUB, MANNEBACK and TOINT [1986] have applied the block SSOR precondi-
tioned conjugate gradient method to the Doppler positioning problem. Here the
matrix is of dual block angular form (19.1) but is partitioned into two blocks (A, B)
where A consists of all the diagonal blocks and B of the last block column in (19.1).
Some experimental results of block SSOR preconditionings for problems
partitioned into more than two blocks are given by BJORCK [1979]. In this case co = 1
is not necessarily optimal. However, in the tests the optimal value of o was close to
one and the number of iterations required varied slowly with o) around w= 1.
We finally consider a method based on Lanczos bidiagonalization for computing
approximations to the large singular values and the corresponding singular vectors
of A e RX", m > n. Let

UT AV=(), UT U=I, vT V=In., (20.29)

where
/ 1 f1
a2 P2
B= |'
fin- |

an

be the bidiagonal factorization of A. Let


U=(u,..., U), V=(v 1.... , Vn)
and equate columns in the two equations A V= UB and AT U = VBT. We find for
j=1,2,..... n
Avj= uj+ fBjlu- 1, AT uj=Cjvj+VBj+ij+ (20.30)
where we take i 0 =, = 0. We can solve the equations (20.30) for uj and vj + by using
the orthonormality. It follows that for j = 1, 2,...
Uj=r/oa %j= Ilr1ll2, rj=Avj-j-_luj-_,
vj+1 =p
Plj, lfi= 11Pil 2' Pj=AT uj-Lj vj. (20.31)
Starting with a vector v, we can recursively compute u, , 2 , u2, ... and correspond-
ing elements in B. The process breaks down when a =O or j=0.
This algorithm is closely related to the Lanczos tridiagonalization scheme for
symmetric matrices. It is well known that in practice the computed vectors will lose
orthogonality and usually only k<<n steps are carried out. However, the large
566 A. Bjirck CHAPTER III

singular values of the bidiagonal matrix

Bk1 1I
a2 12

Bk =

flk-

t/
tend to be very good approximations to the large singular values of A. Approxi-
mations to the corresponding singular vectors of A can be found from Uk, Vk and the
singular vectors of Bk. A discussion of this method may be found in GOLUB, LUK and
OVERTON [1981].
A similar bidiagonalization algorithm forms the basis for the method LSQR of
PAIGE and SAUNDERS [1982] for solving least squares problems. For this purpose it
turns out that the transformation of A to lower bidiagonal form is more appropriate.
In (20.29) we now take

al

B 2 ' E R(n+ )xn.


.,
an

An
Formulas corresponding to (20.31) for computing U=(u1,... ,urn) and V=(vl,..., v)
are easily derived: for j= 1,2 ...., n
vj = r/ai, aj = Irl 2, rj=ATUj-sj-lUj-1, (20.32)
Uj+, = pj/j, l S= Pjll 2, Pi = AVj- Uj,
where we take v o = 0.
For solving minx,l Ax - b II2 we start the recursion (20.32) with
u, =b/ro#, o = 11
b112 , (20.33)
and compute v,, u2, v 2, ... and corresponding elements in B. After k steps we have
computed Vk=(v,, ... , Vk), Uk+l =(u, , .k+,)and

al
51 a2
Bk = r2 (20.34)
5k

ok
SECTION 20 Sparse least squares problems 567

We now seek a solution xke (Vk) and put Xk= Vkyk. Then
-
I A Vkyk-b1 2= |IUT AVkk UT bi 2= IIBkyk - Poel 11
2,
where we have used the relation AVk = Uk+ Bk and (20.33). It follows that we have
Xk = Vk Yk, where Yk solves the least squares problem

min 1tBk yk - o el I12. (20.35)


Yk

Since Bk is bidiagonal this can be solved in O(n) operations.


As formulated above we are required to save the vectors vl,..., Vk. PAIGE and
SAUNDERS [1982] show that in fact a simple recursion can be derived to compute Yk
from Yk- - only storing one extra n-vector. In exact arithmetic LSQR generates the
same sequence of approximations xk as CGLS and it requires only access to A via
subroutines for the matrix vector products Avj and AT uj. Preconditioned versions of
LSQR can be derived in the same way as for CGLS.
CHAPTER IV

Some Modified and Generalized


Least Squares Problems

21. Updating QR decompositions and least squares solutions

Assume that the QR decomposition of a given matrix A E R"'m is known. In many


applications A is modified in a simple manner to form a matrix A whose QR
decomposition is needed. To compute this from scratch using the methods of
Sections 8 or 9 will require at least mn2 -n 3 flops. However, in many situations it
will be possible to compute the new decomposition much more efficiently by
updating the known QR decomposition of A.
If the R factor of the augmented matrix (A, b) is

then
by Theorem 6.2 the solutionproblem
to the least squares min Ax-b(21.1)is
then by Theorem 6.2 the solution to the least squares problem minx 11
Ax - b 2 is
given by
Rx=z, IIAx-b 2=p.
Hence, by updating the factor (21.1) the updating algorithms given below can be
used to update least squares solutions.
The principal aim of the algorithms in this section is to update the complete QR
decomposition of A

A=(R), (21.2)
where Q E Rm xmis either stored explicitly or as a product of elementary orthogonal
transformations. For some of the updating problems the more compact factorization
A= Q1 R, Q 1 E Rmn"can be updated.
In some applications the factor Q is not available and one wants to update only
the R factor. This may be the case when A is large and/or sparse. We discuss such
algorithms for some of the updating problems. These algorithms are not always as
reliable as the methods that update both Q and R.
We give in this section algorithms for updating the QR decomposition of A for
three important kinds of modifications:
(1) rank-one changes of A,
569
570 A. Bj6rck CHAPTER IV

(2) appending (deleting) a column of A,


(3) appending (deleting) a row of A.
Numerous aspects of these updating algorithms are discussed by GILL, GOLUB,
MURRAY and SAUNDERS [1974] and they are similar also to those given by GOLUB
and VAN LOAN ([1983], pp. 437-443).

21.1. Rank-one change

Assume that we know the complete QR decomposition (21.2) of the matrix A E Rm n.


We want to compute the decomposition

A=A+uvT=Q(R), (21.3)

where u E R ' and v E R" are given. For simplicity we assume that rank(A) = rank(A) = n,
so that R and R are uniquely determined.
We first compute
w=QT u

so that

A+UVT=Q[()
+WVT]. (21.4)

Next we determine a sequence of Givens rotations Jk = R,,k + I (O), k = m- 1,.. ., I,


such that

J J- w l =oel, a=+ lwl12

Note that these transformations zero out the last m-1 components of w from
bottom up. (For details on how to compute Jk see Section 8.) The same
transformations are then applied to the R factor of A, i.e.

JT ... JT- (R =
JT .JT(OR =H, (21.5)

where we have used that J+ 1 ... J,- have no effect on R.


Because of the structure of the Givens rotations the matrix H in (21.5) will be
upper triangular except for extra nonzero elements hk,k + , k = 1, 2, ... , n, e.g.
x x x x
x x xx
0 x x x
H= 0 0 x x , m=7, n=4.
000 x
0 0 0 0
0 0 0 0
SECTION 21 Modified and generalized least squares problems 571

We now have that

JT..JT-[(R) +wvT =H+cevT =H, (21.6)

where H has the same structure as H. We now determine Givens rotations


Gk = Rkk +1 (k) such that

Gn ... GTH=()

is upper triangular. Here Gk, k = 1,... , n, will zero the element in position (k + 1,k).
Finally the transformations are accumulated into Q to get
Q=QJm-l J G . G.
Q and R now give the desired decomposition (21.3).
The work needed for this update is as follows: Computing w= QT u takes m2 flops.
Computing H and a takes 4n 2 flops and accumulating the transformations Jk and
Gk into Q takes 4(m2 + mn) flops. This gives a total of 5m2 + 4mn + 4n2 flops. If m = n
we have decreased the work from O(n3 ) to O(n 2). However, if m>>n then this
updating may be more expensive than computing the decomposition from scratch.
The method described can be generalized to compute the QR decomposition of
A+UVT, where rank(UV T )> 1.

21.2. Appending or deleting a column

Given the QR decomposition of A E Rm" n

A=[al ,....an]=Q(o) (21.7)

it is often required to compute the QR decomposition of a matrix A obtained from


A by appending or deleting a column. More generally one may want the QR
decomposition of an arbitrary subset of columns from A. An important application
occurs in methods for solving least squares problems with inequality constraints.
We first observe that the QR decomposition behaves nicely under partitioning.
Assume that
R11 R12
(A,, A2)= Q O R22 R

where
Al = [a, ... , ap], A2 = ap+ 1 . -. , a].
It follows immediately that

( )
572 A. Bjorck CHAPTER IV

gives the QR decomposition of A1 , i.e. it is trivial to delete the trailing columns


ap+ I ... , a n from the decomposition.
Suppose now that we want to compute the QR decomposition of the matrix
resulting from deleting the kth column in A
A = [al,.. ., a- 1, ak+ . an]
From the above observation it follows that this can readily be obtained from the QR
decomposition of

APL= [a,. , ak- 1,ak+1 . afak]=Q(R)PL.


X..., (21.8)

Here PL is a permutation matrix which performs a left circular shift of the columns
ak,..., an. We now write

lR1 R1 3 v
RPL = [rl, ., rk1, rk+ 1, ., rk] WT rkk

0O R33 0/
where R 1el2(k- ) x (k- l) and R 33EIR(n- k)x ( n - k) are upper triangular. Hence, the
submatrix
W aT)

(R33J

is an upper Hessenberg matrix and we can compute a sequence of Givens rotations


Gi so that

GT rkk
Gn-1 GT..GTW
1(R 33 0 )-122

is upper triangular. Note that the last column will fill in. The updated R factor
becomes

R (r i R12a iR 2 =(R 3 ,v),


and the required factorization is obtained by deleting the last column in

APL= (0) with Q=QGk. G.

It is obvious that in this case the R factor can be updated without Q being available.
We point out that more generally it is often required to compute the QR
factorization of
[a, ... , ak_ , ak+ l, . . ap, ak,ap+l, ., an],
i.e. of the matrix resulting from a left circular shift applied to the columns ak, ... , a,
cf. DONGARRA et al. ([1979], p. 10.2). This can be done by an obvious extension of
the above algorithm.
SECTION 21 Modified and generalized least squares problems 573

We now consider the problem of computing the QR decomposition of


A=[al,... , akan+aak,...,a.] (21.9)
i.e. the matrix where the column a + has been appended in the kth position. We
have
A =(A, a"+ 1)PR,
where PR is a permutation matrix which performs a right circular shift on the
columns ak,..., an+1
We first compute the QR decomposition of A = (A, an+I) from that of A. This is
straightforward since the algorithms in Sections 8 and 9 can be made to process
A a column at a time. Assume e.g. that we have determined the decomposition

QTA= (R)
by Householder's method. We then form the vector w=QTa,+,, and construct
a Householder transformation P, such that

P() = (),
n V
w= (u)}m-n
v Im-n'
(21.10)
(21.10)

Then the required decomposition is

QT(A,a.+l)= (), Q=QP = (0 u)

We now write

RPR=[r,...,r-rk,r,+l,rk,... r]=
(R 0
11 u1
u2
R2
R2 2

and determine Givens rotations Gi, i= n- 1, ... , k, to zero the last n-k + 1 elements
in the kth column of APR.Then

GT .GT(u2 R22)A

is upper triangular and the updated R factor is given by

= Roi R 12)' R 12 =(u1,R 12 ). (21.11)

The QR decomposition of A in (21.9) becomes

A=Q~e(R), Q Gn G,.
574 A. Bjorck CHAPTER IV

Again, the above method easily generalizes to compute the QR decomposition of

[a,,.., ak_ ap, ak,.- . ap _ , ap+ ,... an+,],

i.e. of the matrix resulting from a right circular shift of the columns ak,..., ap.
When Q is not stored or is unavailable, the vector u in (21.10) can be found by
solving the system

RT u=AT a+
n, (21.12)

and the scalar y is then given by

y2 = Ia,+112-IIull2 (21.13)
We can then proceed to determine the factor R in (21.11) as above. However,
rounding errors could cause this method to fail if the new column a,,,+is nearly
dependent on the columns of A. (In this case the expression (21.13) for y2 might even
become negative!) As remarked in GILL et al. ([1974], p. 532) if R is built up by
a sequence of these modifications, in which the columns of A are added one by one,
the process is exactly that of computing M=ATA and finding its Cholesky
factorization. As has been remarked in Section 6 this is numerically less satisfactory
than computing R by an orthogonalization method.
If A has been saved a more accurate method for appending a column in the case
when Q is not available has been suggested by BJORCK [1986]. Here one computes
w as the solution to the least squares problem
min 1IAz-a+, 112

and finds u and y from


u = Rz, Y= IIAz - a. + 11
2.
The advantage of this formulation is that now the method of seminormal equations
with a correction step (17.8) can be used to solve for z. Note that y is now always
defined.

21.3. Appending or deleting a row

Given the QR decomposition of a matrix A E (Rm "X we first consider the problem of
computing the QR decomposition of

A= ' WelR. (21.14)

One way to solve this problem has been described in Algorithm 17.1. Sequential row
orthogonalization. If we have obtained the decomposition

QT A=(°)
SECTION 21 Modified and generalized least squares problems 575

then

diag(QT, 1) (T) = ( °=

W T
n+ m+ 1 j
where n., + m+ 1 is a permutation matrix interchanging the rows n + 1 and m + 1. The
elements in w can now be eliminated by a sequence of Givens rotations Gk = Rk,. +
k= 1 .. ., n giving

G 1//n+ 1,m+( | ()

The updated Q factor becomes


Q = diag(Q, )Hn.+,I, + I GI Gn.
Note that appending a row can be performed without Q being available.
We now consider updating the QR decomposition when a :row is deleted. This is
usually called the downdating problem. There is no loss of generality in assuming that
the first row of A is to be deleted. To delete the kth row we merely apply the
algorithm below with A and Q replaced by 17,kA and 11kQ, where 71,k is
a permutation matrix interchanging rows 1 and k.
We wish to obtain the QR decomposition of the matrix Ae R(m - ) xn in

A=( )Q () (21.15)

Let qTE (Rm be the first row of Q and determine Givens rotations Jk = Rk,k + 1, k=
m-i, .... ,1, so that

JT Jm-1 q=rel, =_+1. (21.16)

The same transformations applied to () will give

(m) R°
|O (21.17)

where the matrix R is upper triangular. Note that the transformations Jn+,..,
J,,- will not affect R. Further we compute
QJm-,i J, =Q
From (21.16) it follows that the first row of C is elT. Since Q is orthogonal it must
576 4. Bjbrck CHAPTER IV

then have the form

( Q.)'

with QEe(m-l)x( t-l) orthogonal. Hence, from (21.15)

It follows that the desired decomposition is

().
This algorithm for downdating is a special case of the rank-one change algorithm,
and is obtained by taking u= - e, v =a in (21.3).
In the downdating algorithm the first row of Q played an essential role. If Q is not
available we can proceed as follows. Taking the first row in (21.15) we obtain

T T R\ T T Rn'
al=q OR =(ql,q20)(),

where q has been partitioned conformingly with the R factor. Hence, we can obtain
q, by solving
RTql =a 1

and then since q is of unit length


= IIq2 112=(1- l q H2)~/2. (21.18)
The transformations J. +1... Jm- 1in (21.16) will only have the effect of computing

and as remarked above will not affect R. Thus, we may determine the Givens
transformations J, k = m- 1, ... n, by

1 e ) =Yel a+ 1,

and obtain the updated factor R as in (21.17). This method is implemented in the
LINPACK package and is described in STEWART [1979b].
We note that if y u1/ 2 , where u is the unit roundoff, then y cannot be computed
stably from (21.18) because of severe cancellation in the subtraction I -II q, II12u.
Therefore this algorithm will not be as stable as that using information from Q.
SECTION 22 Modified and generalized least squares problems 577

It has been shown by STEWART [1977b] that k can be an ill-conditioned function


of R and a,, i.e. small perturbations in R and a, can induce large changes in R. This
means that the instability of the downdating problem is intrinsic and will be shared
by all algorithms only using R and a,, as the modified algorithm above.
Another downdating method which is of interest because it is recursive, requires
fewer operations, and because it is algorithmically very similar to the method for
adding a row has been discussed by GOLUB [1969]. It is based on the following
observation. From (21.15) it follows that ATA= ATA - aI aT and thus
RTk= RTR - a1 a =RTR + (ia 1 )(ia;)T ,

where i denotes the imaginary unit (i2 = - 1). Hence, deleting the row a1 is formally
equivalent to adding the row ia1. We can apply the algorithm above for adding
a row, which does not depend on Q. The resulting algorithm can be expressed
entirely in real arithmetic, see LAWSON and HANSON ([1974], pp. 229-231). However,
an error analysis given by BOJANCZYK, BRENT, VAN DOOREN and DE HooG [1987]
indicates that the stability properties of this algorithm is inferior to those of
Stewart's algorithm. They give a modification of the algorithm, which substantially
improves its stability.
We stress again that the algorithms for updating and downdating the QR
decomposition can be used to add and remove rows from a least squares problem by
applying them to the augmented matrix (A, b). More details on this use can be found
in DONGARRA et al. ([1979], Chapter 10).
Many of the above updating algorithms require the m x m orthogonal factor
Q and require O(m2 ) flops. In many applications, e.g. to nonlinear least squares
problems, knowledge of the m x n factor Q1 would suffice. DANIEL, GRAGG, KAUFMAN
and STEWART [1976] have developed stable and relatively efficient methods for
updating the Gram-Schmidt QR factorization. These methods only require O(mn)
storage and operations.
BUNCH and NIELSEN [1978] have developed methods for updating the singular
value decomposition of A, when A is modified by appending or deleting a row or
column. Also algorithms for solving the correspondingly modified least squares
problems are developed. These updating methods however all require O(n3 )
operations when A E R' n",m n.

22. The CS decomposition and the generalized singular value decomposition

In Section 3 we derived the singular value decomposition (SVD) of a matrix A E It" x"
of rank r:

A=UZVT , x=(0), (22.1)


where U Rm Xm, VE " X" are orthogonal matrices and Z,=diag(a, ... a,). We ,
may further order the singular values so that a,> .-. >ra, >0.
In this section we consider a generalization of the SVD to any two matrices
578 A. Bjorck CHAPTER IV

A e R X n and B e RP X" with the same number of columns. This generalized singular
value decomposition (GSVD) and its application to certain constrained least
squares problems was first studied by VAN LOAN [1976]. We will use a computa-
tionally more amenable form given by PAIGE and SAUNDERS [1981]. In the special
case that A and B are blocks of a partitioned matrix having orthonormal columns
the GSVD simplifies to the CS decomposition (CSD). Recently, stable algorithms for
computing the GSVD based on the CSD have been developed by STEWART [1982,
1983] and VAN LOAN [1984b]. PAIGE [1986] gives an algorithm which consists of an
iterative sequence of cycles where each cycle is made up of the serial application of
2 x 2 generalized singular value decompositions.
We first consider the CS decomposition, which is of interest in its own right.

THEOREM 22.1 (CS decomposition). Suppose the n columns of the real matrix

' + pX"
Q=(Q'1)} tR ) , m>n, (22.2)
(Q2)}P
are orthonormal, i.e. QTQ = QIQ, + Q2Q 2 =In. Then there are orthogonal matrices
a
U 1 eRI PXP
mx ' , U2eRI and VeR "X" such that

UIQ, V=(), C=diag(c,,..., c) (22.3)

and

UTQ2= So 0), S=diag(s,,.. Sq), q=min(n,p). (22.4)

The diagonal elements c i and si satisfy


c2+s2=, i=l,...,q, ci=l, i=q+l,...,n.
Without loss of generality, we may assume that
0°< c,t <' - <,c.,
+1I ·--== c(22.5)
,
1 >s I S2 > ' ¢ Sq 1'0.

PROOF (Cf. STEWART [1982]). To construct U1, Vand C, note that since U1 and Vare
orthogonal and C is a nonnegative diagonal matrix (22.3) is the singular value
decomposition of Q 1. Hence the elements c i are the singular values of Q.
If we put 2 = Q2 V,then the matrix

( Cor UT o ) (Q )

has orthogonal columns. Hence,


Ct + Q2 02 = I
SECTION 22 Modified and generalized least squares problems 579

which implies that Q2 Q2 = In - C2 is diagonal and hence the matrix Q2 = (02),... q2))
has orthogonal columns.
We assume that the singular values ci of Q, have been ordered according to (22.5)
and that c, <c,+ X= 1. Then the matrix U2 = (ul2) , uP) is constructed as follows.
Since II j 2) 112= 1 -c 2 O, j<r, we take
U(2) = 7,2)/ 14 (2) 2

and fill the possibly remaining columns of U2 with orthonormal vectors in the
orthogonal complement of (0 2). From the construction it follows that U2 e IP" Pis
orthogonal and that

UQ2 = U2 Q2 V= ( 0), S = diag(s,..., s)


with
2
sJ(-cj2)1 >0, j=1, . r
i IQj=r+ l,...,q. ]
REMARK 22.1. The assumption m >n in (22.2) is made for notational convenience
only. STEWART [1982] treats the general case, which gives rise to four different forms
corresponding to cases where Ql and/or Q2 have too few rows to accommodate a full
diagonal matrix of order n.
REMARK 22.2. The proof of the CS decomposition is constructive. In particular U t,
V and C can be computed by a standard SVD algorithm. However, the algorithm
above for computing U2 is unstable when some singular values ci are close to 1.
STEWART [1982] and VAN LOAN [1984b] describe modified stable algorithms for
computing the CSD.

The CS decomposition can be stated in a related form which is often useful. Let
Q E Rm xm be orthogonal and consider the partitioning

Q= Qll Q12 } jok.


xQ2 Q22d }k'
j k

Then there exist orthogonal matrices U, V e Rj Xj and U2, V 2 E kxk such that

(u o \/Q 1 , Q12 V1 0 C 0 -S
( U 2 ) (Q 21 Q22) ( =0 I } j-k (22.6)
S o C }k
where
C=diag(c,,..., Ck), S = diag(sl,..., sk),
c i = cos(0i), si =sin(0), 0 02,i i=1,..., k.
The decomposition (22.6) can be proved in a way similar to the proof of Theorem
580 A. Borck CHAPTER IV

22.1. A proof of a slightly more general decomposition, where Q 1 and Q22 are not
required to be square matrices is given by PAIGE and SAUNDERS [1981].
Note that the decomposition (22.6) treats rows and columns of Q in a symmetric
way. The matrix on the right-hand side of (22.6) is a generalization of a Givens
rotation matrix, cf. (8.9) and its transpose is its inverse. The decomposition (22.6) was
first explicitly given by STEWART [1977a] who remarked that it "often enables one to
obtain routine computational proofs of geometric theorems that would otherwise
require considerable ingenuity to establish."
The CS decomposition now enables us to give a constructive development of the
GSVD of two matrices A and B with the same number of columns.

THEOREM 22.2 (Generalized singular value decomposition (GSVD)). Let A ERm "
,
m > n, and Be lRPXn be given matrices. Assume that

rank(M)=k n, M =()

Then there exist orthogonal matrices UA RmXm, UB eeP X P and a nonsingularmatrix


Z Rk xn such that

A=UA ( A) Z, B= UB ( o O)Z (22.7)

where

DA = diag(a, ... k), DB,=diag(f,,..., f,), q=min(p,k).

Further we have

at+fi =1, i=l, =1 , i=q+l.... k,(22.8)


2,=l, i=q+1,.. k,
and the singular values of Z equal the nonzero singular values of M.

PROOF. Let the SVD of M be

where Q and P are orthogonal matrices of order (m+p) and n respectively and

Ck=diag(, ... a, aC 1 > ak >

Set t=m+p-k and partition Q and P as follows

Q Q1 1 Q12 P (P,,P).
\,Q 21 Q2 2 )J } '
k n-k
k t
SECTION 23 Modified and generalized least squares problems 581
Then the SVD of M can be written

(P= (AP, 0)=(QI) (k ) (22.9)


B BP 0 Q 2 1
Now let

Q11 = UA () VT, Q21 = UB( )VT

be the CS decomposition of Q,, and Q21. Substituting this into (22.9) we obtain

AP =UA() VT (2,0), BP = UB )
O( V(, 0).

Hence (22.7) follows with


DA = C, DB = S, Z = VT(Zk, 0)PT,
and, since Vand P are orthogonal, a, >' > ak > 0 are the singular values of Z.

REMARK 22.3. The assumption m n in the theorem is made for notational


convenience only. For a general formulation and proof see PAIGE and SAUNDERS
[1981].

The generalized singular value decomposition has several applications e.g. in


constrained least squares problems, see Sections 25 and 26. In the next section we
show that the GSVD is the natural tool for analyzing the general Gauss-Markoff
linear model.

23. The general Gauss-Markoff linear model

m
Let AEllmxn be a known matrix, b R a known vector and xeR" an unknown
parameter vector which is to be estimated. The general Gauss-Markofflinear model
has the form
Ax+E=b, V()= a' W, (23.1)
where e is a random vector with zero mean and covariance matrix a2 W and W is
a symmetric nonnegative-definite matrix.
For W = I we get the special linear model discussed in Section 1. In Section 14 we
considered weighted linear models where W was a positive diagonal matrix and also
the model (23.1) for general positive-definite matrices W. In the following we will
treat the general case, when both matrices A and W may be rank-deficient.
We will assume that we are given W in factored form
W=BBT, BERm '"P, p<m. (23.2)

If W is initially given then B can be computed as the Cholesky factor of W. We


582 A. Bbrk CHAPTER IV

replace (23.1) by the equivalent model


Ax+Bu=b, V(u) = 21I, (23.3)
where the random vector uERP has covariance matrix U21. We now show how the
generalized singular value decomposition can be used to analyze the model (23.3).
The following analysis is based on the work by PAIGE [1985].
Since the matrices A and B have the same number of rows a slightly modified
version of the generalized singular value decomposition of Theorem 22.2 can be
applied to ATEc R"Xm and BTE RPXm . We state the resulting decomposition using
slightly different notations.
Assume that
r = rank(A), s = rank(B), k = rank(A, B),
where
r<n, s<p, k<r+s.
Then there exist orthogonal matrices U e Rn Xn and Ve RP XP and a matrix ZE R' x kof
rank k such that

/0 o o jk-r Ik-, 0 0 jk-r


AU=Z DA 0 q , BV=Z 0 DB 0 )q (23.4)
O I °
k-(S ° o k-s
n-r r s p-s

where q=r+s-k
DA = diag(a ...... a), D = diag(Bl,,..... ),
O< .** <ax < 1, > , . > # > °,
and DA +DB=l. Note that the row partitionings in (23.4) are the same.
If we partition the orthogonal matrices U and Vconformingly with the column
blocks on the right-hand sides in (23.4)

U=(U,1 U, U), V,=(V1 , V2 , V3 ),


then we note that A U1 =0, B V3 = 0, and hence U1 and V3 span the nullspaces of A and
B respectively. The decomposition (23.4) separates out the common column space
of A and B. We have AU 2 DB=BV 2DA and since DA>0 and DB>O it follows that
(A2U2) = A(BV 2) = M(A)n (B),
and has dimension q. For the special case B=I we have s=k =m and then
q = rank(A).
Now let the QR decomposition of the matrix Z in (23.4) be

QTz = (R), Q =(Q1 , Q 2 ), (23.5)


SECTION 23 Modified and generalized least squares problems 583

where R ERk Xkis upper triangular and nonsingular. In the model (23.3) we make the
orthogonal transformations of variables
x= UTx, U= Tu. (23.6)
Then, using (23.4) and (23.5) the model (23.3) becomes

tO){(°
0 DA 0 X2

0 0 Ik-s 3)( (23.7)


Ik-, O 0
O \ )

w 0 2 U3

where i = Txi, ui= VTui, i= 1,2, 3.


It immediately follows that the model is correct only if Q2b=O, which is
equivalent to the condition be R(A, B). If this condition is not satisfied, then b could
not have come from the model.
The remaining part of (23.7) can now be written

(R1 R12 R 3 / a, (c 1 }k-r


j

R22 R2|| DA2+Du | = | | }qC2 (23.8)


R33/ X3 C3 }k-s

where we have partitioned R and c = QTb conformingly with the block rows of the
two block diagonal matrices in (23.7).
We first note that Z has no effect on b and therefore cannot be estimated. The
decomposition
e
X = Xn+x , xn=U 1 l1, Xe=U2X2+U 3X,3

splits x into a nonestimable part x" and an estimable part x'. Further, X3 can be
determined exactly from
R33X-3 = c 3 .

Note that X3 has dimension k- s = rank(A, B)- rank(B), so that this can only occur
when rank(B) < m.
The second block row in (23.8) gives the linear model
DAX 2 + DBU2 = R2 2 1(C2- R23 3),

where from (23.6) we have that V(u 2 )=a21. Here the right-hand side is known and
the best linear unbiased estimate of x2 is
:x2 =DA 'R22 (c2 - R2 3 - 3 ). (23.9)
584 A. Bjfrck CHAPTER IV

The error satisfies DA(, 2 - 2 ) =DBu2 and hence the error covariance is
V(2 - x2 ) = '2(DA I'DB)2 ,
and so has uncorrelated components. Note that the covariance need not be large
even if DA has small elements provided that the corresponding elements in DB are
also small.
The random vector u3 has no effect on b. The dimension of u3 is p-s = p - rank(B)
and so is zero if B has independent columns. Finally the vector a, can be solved
exactly from the system (23.8). Since u, has zero mean and covariance matrix 2 1 it
can be used in estimating a 2. Note that u~, has dimension k-r=rank(A,B)-
rank(A).
REMARK 23.1. It can be shown that the best linear unbiased estimate of any estimable
function of x in the model (23.3) is given by the solution to the constrained least
squares problem
min v{ 2 ,
v,x

s.t. Ax+Bv=b. (23.10)

REMARK 23.2. The generalized singular value decomposition could be used to


compute the estimate (23.9) and its error covariance matrix. However, a reliable and
more efficient method has been given by PAIGE [1979b] This method has been
described for the case of a positive-definite W in Section 14.

24. The total least squares problem


The least squares problem may be formulated as follows:
Given a matrix A e Rm x with m > n and a vector be R'"find xe 1R" as the solution to
min Illr1 2 ,
x
s.t. Ax=b+r. (24.1)
The underlying assumption here is that errors only occur in the vector b and that the
matrix A is exactly known. Often this assumption is not realistic and sampling or
modeling errors also affect the matrix A.
One way to take errors in A into account is to introduce perturbations also in
A and consider the following total least squares (TLS) problem:
min Jl(E,r)IIE,
E,R
s.t. (A+E)x=b+r, (24.2)
where itJ-E denotes the Euclidean or Frobenius matrix norm, i.e.

IB IE(= eIblI 2,

If a minimizing pair (E, r) has been found for the problem (24.2), then any x satisfying
(A + E)x = b + r is said to solve the TLS problem.
SECTION 24 Modified and generalized least squares problems 585

In the statistical literature this problem is known as latent root regression. The
total least squares problem has been analyzed in terms of the singular value
decomposition by GOLUB and VAN LOAN [1980], see also GOLUB and VAN LOAN
([1983], Section 12.3). They consider the more general problem of minimizing
IID(E,r)TIE, where D and T are nonsingular diagonal scaling matrices. In the
following we assume for notational convenience that the row and column scalings
have been applied explicitly to (A, b) so that we can take D = Im and T =I,.
We note that the constraint in (24.2) implies that
b+rE r(A+E). (24.3)
It can also be written

(A,b)(_ 1)=°, A=A+E, =b+r,

which shows that the matrix C= C+ AC where


C =(A, b), AC = (E, r), (24.4)
is rank-deficient and that (x, - 1)T is a corresponding right singular vector. Hence
the TLS problem involves finding a perturbation matrix having minimal Euclidean
norm which lowers the rank of the matrix (A, b).
The total least squares problem can be analyzed in terms of the singular value
decomposition of C. Following GOLUB and VAN LOAN [1980] we let
C=UZV T , Z=diag(a, ... ,n+ 1 ),

where UTU=Im and VTV=In+ and assume that


a1 ' 'O'k > k+ 1 ... = a+ 1. (24.5)
Then from Theorem 3.2 and Remark 3.4 it follows that
a+ 1 = min IIACIIE (24.6)
rank(C) < n + 1

and the minimum is attained for AC= - CvvT, where v is any unit vector in the
subspace
SC = span[vk + 1, . ., Vn +1]
Assume that we can find a unit vector v in S c whose (n + 1)st component is nonzero.
Then we can write

V=(Y)= ( x ly

and the following holds

- (A + E, b + r) ( ) =(C + AC)v = C(l-vvT) =O.

This shows that x solves the TLS problem.


586 A. Bj6rck CHAPTER '

The TLS problem may fail to have a solution as is illustrated by the following
simple example. Let

Here we have be.M(A+E) for E=diag(O, e) for any e0., so there does not exist
a smallest value of lI(E,r)IIE.
The TLS problem has no solution if e,,+=(0,. . .,0, 1)T is orthogonal to Sc. If
oa,. is a repeated singular value, i.e. k <n in (24.5), then the TLS problem may lack
a unique solution. In this case a unique minimum norm TLS solution can be
determined as follows. Let Q be an orthogonal matrix of order n- k + 1 such that

LVk+ V....Un+
1Y)Q ( :}l~

If we set x = -- y, then it is easy to show that all other solutions to the TLS
problem have larger norms.
Let the singular values of A be

a 1> 2 > 0..' > Sn >

Then the separation theorem for singular values, Theorem 3.5, implies that

'1 >pa1 >/a2 > ... >/Gn>/n >/'n+ .

It follows that if the condition 8, >a,,+ is satisfied then a+,, is not a repeated
singular value of C. It also follows that v+ must have a nonzero last component
since otherwise

+ =() and Ay=a.+,y,

so that 8, = ,,+ I. Hence the condition 8, > a,,+ is sufficient for the TLS problem to
have a unique solution.
A geometrical interpretation of the total least squares problem can be obtained as
follows. By the minmax characterization of singular values given in Theorem 3.3 it
follows that the TLS solution, if it exists, satisfies

+l=min|(A, b)(x )1 2 ( )2

We now make the observation that the quantity

d2(x)= E iax-bil2/(llxl12 + 1),


i=l

where aT is the ith row of A, is the square of the orthogonal distance from the point

(b, ) ' 4
SECTION 24 Modified and generalized least squares problems 587
to the plane through the origin

Px={(): aR", beR, b=aTx}

Hence, the TLS solution minimizes the sum of squares of orthogonal distances
CY_= d?(x), and therefore is a special case of orthogonal regression, see Section 34.
We now consider the conditioning of the total least squares problem and its
relation to the least squares problem. To ensure unique solutions to both the TLS
and the least squares problems we assume that 8,>n+1 and we denote those
solutions by XTLS and XLS respectively. The vector (XTLs, - 1)T satisfies

(A, b) TLS =a1TLS

and squaring this we obtain

AT A ATb /XTLS _ 2 XTLS


bT A bTbJ k,+
-1= · 1
The first block row of this equation can be written

(ATA - I)XTL
Io+ = ATb. (24.7)
whereas the corresponding least squares solution satisfies
ATAxLs = ATb. (24.8)
Since in (24.7) a positive multiple of the unit matrix is subtractedfrom ATA total least
squares is a deregularization of the least squares problem (24.8), cf. Section 26.
Therefore the TLS problem is always worse conditioned than the LS problem. It is
shown by GOLUB and VAN LOAN [1980] that a measure of the conditioning of XTLs
is the difference 8,-a,,+.
Combining (24.7) and (24.8) we get
2 T
XTLS -XLS = a + 1 (A A- +1I- ) XLs

and taking norms we arrive at the estimate


IIXTLS - XLsI 2 +I 11XLS 11
2 /(n -,+ )
Hence, by (24.6) it follows that IIXTLS - XLS 112 = O(1 (E,r)II 2), i.e. XTLS and XLS agree up
to second-order terms in the error. This point has been stressed by STEWART [1984b],
who argues that this makes it difficult to justify the use of total least squares
estimates over the simpler least squares estimates. Stewart also proves that up to
terms of second order in the error (E, r) the estimate XTLS is insensitive to column
scalings of A, or more generally to linear transformations of the variables,
x.= T-x, A =AT.
We end by pointing out some generalizations of the total least squares problem.
GOLUB and VAN LOAN ([1983], Section 12.3) consider the TLS problem with multiple
588 A. Bjorck CHAPTER IV

right-hand sides
min IIE,R)IIE,
E,R
s.t. (A+E)X=B+R,

where BE Rmxk. Writing this as

(A+E,B+R)(Ik =0,

it follows that we now seek a perturbation AC = (E, R) that reduces the rank of the
matrix C = (A, B) by k. Again, this problem can be solved using the singular value
decomposition
C= U diag(a,,. - , + k)V T

and the condition n > O,,+1 ensures that there exists a unique solution.
WATSON [1984] has considered the total 1, approximation problem where
I(E, r)l.P is minimized for the class of matrix norms defined by

iBlip=. bif)1P, l<p<oo.

GOLUB and STEWART [1986] have shown how to solve the TLS problem for the case
when some of the columns of A are known exactly. The problem can then be
expressed
min Il(E,r)llE,
E'r
s.t. (A,A 2 +E)x=b+r,

where A=(A, A 2 ) and A1 is known exactly. The technique involves computing


a QR factorization of A, and then solving a TLS problem of reducing dimension.
DEMMEL [1987] considers the more general situation when an arbitrary submatrix
of A is subject to perturbations.
CHAPTER V

Constrained Least Squares


Problems

25. Linear equality constraints


In this section we consider least squares problems in which the unknowns are
required to satisfy a system of linear equality constraints. An important source of
such problems is curve and surface fitting, where e.g. the curve may be required to
interpolate certain data points. Another source is least squares problems with
inequality constraints, which are usually solved by solving a sequence of problems
with equality constraints, see Section 27.
X"
PROBLEM LSE (Least squares with equality constraints).Given matrices A e Rm and
B e RPx" find a vector xe DR" which solves
min IAx-b112,
s; Bx=d. (25.1)
s.t. Bx = d.
Problem (25.1) obviously has a solution if and only if the linear system Bx =d is
consistent. If rank(B)= p, i.e. if B has linearly independent rows, then Bx = d is
consistent for any right-hand side d.
A solution to (25.1) is unique if and only if the nullspaces of A and B intersect only
trivially,
4rY(A)nk(B)= {0}. (25.2)
If (25.2) is not satisfied, then there is a vector zO such that Az=Bz=O and if
x solves (25.1), then x+z is a different solution. A proof that (25.2) is sufficient to
ensure that (25.1) has a unique solution is given in Section 25.3. We note that (25.2) is
equivalent to the rank condition

rank B() =n.

If (25.2) is not satisfied, then we can seek a minimum norm solution to Problem
LSE.
A robust algorithm for Problem LSE should check for possible inconsistency of
the constraint equations Bx = d. If it is not known a priori that the constraints are
consistent, then (25.1) may be reformulated as a sequential least squares problem

589
590 A. Bjbrck CHAPTER V

minl Ax-b 2, S={x: IlBx-dll2 =min}. (25.3)


xsS

This problem always has a unique solution of minimum norm. Most of the methods
described in the following for solving Problem LSE can be adopted to solve (25.3)
with little modificaton.
The most natural way to solve Problem LSE is to derive an equivalent
unconstrained least squares problem of lower dimension. There are basically two
different ways to perform this reduction: direct elimination and the nullspace
method. We describe both these methods below.

25.1. Method of direct elimination


In the method of direct elimination we start by reducing the matrix B to upper
trapezoidal form. It is essential that column pivoting is used in this step. In order to
be able to solve also the more general problem (25.3) we will compute a QR
decomposition of B. By Theorem 7.4 there is an orthogonal matrix QB e RP" P and
a permutation matrix 7B such that

QBnH= (Ri R 2 ) r0 (25.4)

where r = rank(B) <p and R, 1 is upper triangular and nonsingular.


If we apply Q also to the vector d the constraints become

(RR,,,R2)x=, d= Qd= ('), (25.5)

where x = 7 x and 2 =0 if and only if the constraints are consistent.


We apply the permutation n, also to the columns of A and partition the resulting
matrix conformingly with (25.4)

Ax-b=Ax-b=(A1,A 2) -b, (25.6)

where A=AI7B. We now eliminate the variables x from (25.6) using (25.5).
Substituting xi =R 1 l ( l -R 1 2x2) we get
Ax- b = A 2 x 2 -6
where
A2 =A -A 1R lR 12 , =b- A1RlldlJ. (25.7)
Hence, the reduced unconstrained least squares problem
"-
2 x 2 -11
min 11A 2, A2 Rm X (n r) (25.8)
X2

is equivalent to the original Problem LSE.


The solution to the unconstrained problem (25.8) can be obtained from the QR
decomposition of A2. We now show that if the condition (25.2) holds then
rank(A 2) = n- r and (25.8) has a unique solution. If rank(A2) < n- r, then there is
SECTION 25 Constrained least squares problems 591

a vector v#0 such that


A 2 v= A,2 v-AR- R 2 2V=O.
If we let u= -Rl'R, 2 v, then
Ru+R2v ==O0, Au +A 2v=0.
Hence the vector

w=IB(u)0

is a nullvector to both B and A and (25.2) does not hold.


If (25.2) holds then we can compute the QR decomposition

QA2=( 0 ), Qb=(C21) (25.9)


where R 2 2 ER{f" - ' ) X( - ' ) is upper triangular and nonsingular. We then compute
x from the triangular system
(RI R> U),(25.10)
\ 0 R22/ \cl
and obtain x=HB5, the solution to Problem LSE.
The coding of the algorithm outlined above can be kept remarkably compact, as is
illustrated by the program given by BJORCK and GOLUB [1967]. Note that the
reduction in (25.7) can be interpreted as performing r steps of Gaussian elimination
on the system

A, A2 ]\2]' b
REMARK 25.1. The set of vectors x = nB , where x satisfies (25.5) is exactly the set of
vectors which minimize Bx -d lI 2. Thus, the algorithm outlined above actually
solves the more general problem (25.3). If condition (25.2) is not satisfied, then the
reduced problem (25.8) does not have a unique solution. Then column permutations
are needed also in the QR decomposition of A 2. In this case we can compute either
a basic solution or a minimum norm solution to (25.3) as outlined in Section 7.

25.2. The nullspace method

We assume here that rank(B)=p. First compute an orthogonal matrix QBE" Xn


'
such that

QBB =0 ( ) (25.11)

where RBE aPXP is upper triangular and nonsingular. Let


QB=(Q,Q 2), QEI" X P, Q2ERnX(n-P).
592 A. Bjbrck CHAPTER V

Then AJ(B)=(Q 2), i.e. Q2 gives an orthogonal basis for the nullspace of B. Any
vector xEe", which satisfies Bx = d can then be represented as
x=x +Q2y 2 , x=B'd=QRTd. (25.12)
Hence,
Ax-b=Ax +AQ2 y2 -b, y 2cER-P,
and it remains to solve the reduced system

min II(AQ 2 )u 2 -(b-Ax)ll 2 . (25.13)


Y2

Let Y2 be the minimum length solution to (25.13),


Y2 =(AQ2)(b-Ax 1)
and let x be defined by (25.12). Then since x1 Q 2y 2 it follows that
IIXII2 = IIX II + IIQ2Y2 I 2 = IIX 112+IIy112
and x is the minimum norm solution to Problem LSE.
Now assume that the condition (25.2) is satisfied. Then the matrix

C=(B) =( RB 0
kA B\ AQl AQ 2 )
must have rank n. But then all columns in C must be linearly independent and it
follows that rank(AQ2)=n-p. Then we can compute the QR decomposition

QA(AQ 2 ) = ( )

where RA is upper triangular and nonsingular. The unique solution to (25.13) can
then be computed from

RAY 2 =C 1 , c=( =Q(b-Ax1 ), (25.14)

and we finally obtain x=x, +Q2 y2 , the unique solution to Problem LSE.
The representation (25.12) of the solution x has been used as a basis for
a perturbation theory by LERINGE and WEDIN [1970]. This generalizes the results
given in Section 5. The corresponding bounds for Problem LSE are more
complicated, but show that Problem LSE is well-conditioned if K(B) and K(AQ 2) are
small. It is important to note that these two condition numbers can be small even
when K(A) is large. Any method which starts with minimizing IIAx -b II2 will give
bad results in such a case. ELDEN [1980] has given a less complicated and more
complete perturbation theory for Problem LSE based on the concept of a weighted
pseudoinverse.
The method of direct elimination and the nullspace method both have good
numerical stability. In a numerical comparison by LERINGE and WEDIN [1970] they
SECTION 25 Constrained least squares problems 593

gave almost identical results. The operation count for the method of direct
elimination is slightly lower because Gaussian elimination is used to derive the
reduced unconstrained problem.

25.3. Analysis of Problem LSE by generalized singular value decomposition

Following VAN LOAN [1985] we now analyze Problem LSE in terms of the
generalized singular value decomposition (GSVD). For simplicity we assume that in
(25.1) we have rank(B)=p and that (25.2) holds. This ensures that the problem has
a unique solution, and the GSVD can be written (see Section 22):

UTA=(D )Z, VTB=(DB, O)Z, (25.15)

where
DA = diag(a ,..., a), DB = diag(B,,...,
I p),
0=a1= ... = gq < ~q+ 1 ... P<<<+1=
P n....., (25.16)
A~l ~> ' P>O.
We can also without loss of generality assume that

>0
ai(Z)=fi B) i= 1,...,n,

and that
c2+fi2=1, i= ,...,n.
Using the GSVD problem (25.1) is transformed into diagonal form

min O -b 2,

s.t. (DB,O)u=d, (25.17)


where
Zx =y, b= UTb, d= VTd. (25.18)
It is easily verified that (25.17) has the solution

Yi =t{dJjif
ii I ... I PI(25.19)
Yi=b i, i=p+l ... ,n,
where we have used that ai = 1, i = p + 1, ... ,n.
Introducing the matrix
Z-1 =X=(x 1. -.-. Xn)
we can write the solution XLSE to (25.1) as

XLSE= E (i/fi))i+ ixi. (25.20)


i=1 i=p+l
594 A. Bjorck CHAPTER V

A solution to the unconstrained problem to minimize llAx -b is


P n
XL = E (ili)X; + E xi.
i=q+ l i=p+l

It is interesting to compute how much the residual rLS = b - AXLs increases as a result
of the constraints Bx = d. We have
rLSE - rLS = A(XLS - XLSE)

and using the relations Ax i =oiui, i = 1, ... , n, we can show that


p
=
rLSE-rLS piUi,
i=q+l1

where
pi=uTb - ,uivd, #i=a/#li, i=1,. .,p. (25.21)
Hence,
p
IIrLSEIi2=IrLsl+ E P.
i=q+l

25.4. The method of weighting

The method of weighting for solving Problem LSE is based on the following simple
observation. Assume that in a least squares problem we want some equations to be
exactly satisfied. We can achieve that by giving these equations a large weight y and
solving the resulting unconstrained least squares problem. Hence to solve (25.1) we
would compute the solution x(y) to the problem

min y( )x-(:y) . (25.22)

Note that if (25.2) holds then (25.22) is a full rank least squares problem.
Weighted problems were considered in Section 14. There it was shown (cf. (14.9))
that if rank(B) = p, then the residual d - Bx(y) is proportional to y - 2 for large values
of y, and hence lim. ,x(y) = XLSE. A more general analysis is given below in terms of
the GSVD.
The method of weighting is attractive for its simplicity. It allows the use of
a subroutine or program for unconstrained least squares problems to be used to
solve Problem LSE. However, for large values of y care must be excercised in the way
(25.22) is solved because then the matrix in (25.22) is poorly conditioned. In
particular, the method of normal equations will fail for large y.
Accurate solutions to (25.22) for large values of y can be computed from a QR
decomposition of the matrix
pyBa
At j
provided that both row and column permutations are used. POWELL and REID [1969]
SECTION 25 Constrained least squares problems 595

recommend that the column pivoting strategy of Section 11 is used and that, before
Householder transformation is applied, the largest element in the pivot column is
permuted to the top.
For some problems it may be sufficient to initially order the constraints first, as
done in (25.22) and then compute a QR decomposition without column inter-
changes. The following example from VAN LOAN [1985] shows that this is not always
sufficient.

EXAMPLE 25.1. Let


1 1
1 3
A= -1 B (1 I 1), d=().
1 1
This example is well-conditioned and has the solution

XLSE = (4 6 , - 2, 12 )T .

In VAX double precision arithmetic, u=10- 17, problem (25.22) with y=


109i u-/ 2 was solved with and without column pivoting. With column pivoting
one obtained full double precision accuracy 10-17 whereas without column
pivoting the error was of the order 10 - 9 . The trouble arises because for large y the
first two columns of the matrix in (25.22) are almost linearly dependent.
The GSVD analysis in Section 25.3 has been used to analyze the method of
weighting by VAN LOAN [1985]. Note that x(y) satisfies the normal equations
(ATA + y 2BTB)x(y) = ATb + y2 BTd.

Using (25.15)-(25.18) these transform to

(D +Y2 (D 0 ))y()=(DA, O) + y 2 (DB), (25.23)

where we have put Zx(y)= y(y). From (25.23) we deduce that

(fibi+ 2pidi
:
Yi(Y)= i+y2P i=1,,p,
(bioi, i=p+1,...,n.
Hence from (25.19) we find that yi(y) = yi, i = ,, ... q, p + 1 ... , n, and with pi and pi
defined by (25.21)

yi()-yi=+ 2 22 , i=q+1,...,p.

This suggests that if Mp is large then a large weight y may be necessary.


A detailed analysis of the method of weighting has also been given by LAWSON and
HANSON ([1974], Section 22).
596 A. Bj6rck CHAPTER V

26. Quadratic constraints and regularization

In this section we consider methods for solving constrained least squares problems
of the following general type.

PROBLEM LSQI (Least squares with quadratic inequality constraints).


min 1Ax- b 2,
x II~x-b!12 , (26.1)
s.t. IIBx-dl 2 •y,
where AERm"lBEP", y>O.

We first consider some applications where problems of this form arise. One
example is in smoothing of noisy data, see e.g. HUCHINSON and DE HOOG [1985].
Here one wants to balance a good fit to the data points and a smooth solution.
Another application where problems of the form (26.1) arise is in the regularization
of ill-conditioned least squares problems resulting from the discretization of
ill-posed problems.
As an example, consider the integral equation of the first kind

T K(s, tf(t) dt = g(s), (26.2)

where the operator K is compact. It is well known that this problem is ill-posed in
the sense that the solution f does not depend continuously on the data g. For
example, there are rapidly oscillating functions f(t) which come arbitrarily close to
being annihilated by the integral operator.
If (26.2) is discretized into a corresponding least squares problem
min 11
Kf- t 2, (26.3)

then the singular values of Ke Rm"' decay exponentially to zero. Therefore K will
not have a well-defined numerical -rank r, since by (10.3) this requires that
a, > 5 >a,+ holds with a distinct gap between the singular values a, and ,,+ . This
means that the methods for rank-deficient least squares problems in Sections 10 and
11 are not very useful here.
In general any attempt to solve (26.3) without any constraints on f will give
a meaningless result. Perhaps the most successful method to solve ill-conditioned
problems of this kind is to restrict the solution space by imposing some a priori
bound on IILf 12for a suitably chosen matrix LE RPX". Typically L is taken to be
a discrete approximation to some derivative operator, e.g.

I -1

L= .. 1 . j e - 1) xi (26.4)

which approximates the first-derivative operator except for a scaling factor.


SECTION 26 Constrained least squares problems 597

The above approach leads us to take f as the solution to the problem

min IIKf-g112,
(26.5)
s.t. lLf 112<Y.
Here the parameter y governs the balance between a small residual and a smooth
solution. The determination of a suitable y is often a main difficulty in the solution
process. Alternatively we can consider the related problem

min [Lfl2,
(26.6)
s.t. IlKf-g112 <p.
Problems (26.5) and (26.6) are called regularization methods for the ill-conditioned
problem (26.3) in the terminology of TIHONOv [1963]. They are obviously special
cases of the general Problem LSQI in (26.1).
We now consider conditions for existence and uniqueness of solutions to Problem
LSQI. Clearly (26.1) has a solution if and only if
min IBx- d 2 < Y, (26.7)
x

and in the following we assume that this condition is satisfied.


We define a B-generalized solution xAB, to the problem minxllAx -b 2 to be
a solution to the problem (cf. Remark 7.6)
min 1Bx-d11 2, S= {xeRW: IAx-b112 =min}. (26.8)
xeS

Notice that for B = I and d = 0 we have xA,I = A'b. Then the constraint in (26.1) is
binding only if
IIBxA,B- d 112> (26.9)
This observation gives rise to the following theorem.

THEOREM 26.1. Assume that Problem LSQI has a solution. Then either XA,B is
a solution or (26.9) holds and the solution occurs on the boundary of the constraint
region. In the latter case the solutionx = x(A) satisfies the generalizednormalequations
(ATA + BTB)x(2) = ATb + )BTd, (26.10)
where , is determined by the secular equation
l[Bx(A)- dl 2= Y (26.11)

PROOF. Using the method of Lagrange multipliers we minimize p(x) where


O(x)= IIAx-b2+ill(Bx-djl]-_y2).
A necessary condition for a minimum is that the gradient of 4(x) equals zero, which
gives (26.10). ]
598 A. Bjorck CHAPTER V

In the following we assume that (26.9) holds so that the constraint is binding. Then
there is a unique solution to Problem LSQI if and only if the nullspaces of A and
B intersect only trivially, i.e. AV(A)rA(B)={0}. This is equivalent to the rank
condition

rank () =n (26.12)

We note that (26.10) are the normal equations for the least squares problem

mi|| (X(m b = A112' (26.13)

where only positive values of 2 are of interest. Therefore, to solve (26.10) for a given
2 it is not necessary to form the cross product matrices ATA and BTB.
A numerical method for solving Problem LSQI can be based on applying
a method for solving the nonlinear equation
k(2) = y where ¢(2) = Bx()- d 112
and where x(2) is computed from (26.13). However, this means that for every
function value (2) we have to compute a new QR decomposition of (26.13).
Methods which avoid this have been given by ELDEN [1977] and will be described
later in this section.
We now consider the use of the generalized singular value decomposition for
analyzing Problem LSQI, cf. GOLUB and VAN LOAN ([1983], Section 12.1). For ease
of notation we assume that m > n and put q = min(p, n). Then we have (see Section 22)
DB
UTA=(DA)Z VTB=( 0 )Z (26.14)

where
DA = diag(al,..., .an), D= diag(#,,...,) (26.15)
ai>O, i=1...,n, fl '' >fPI>/l,+= ... =B=0,
P
r=rank(B), UER"mXm and VER "P are orthogonal and Z nonsingular. The rank
condition (26.12) implies that
a+B2>0, i=1,..r, c>0, i=r+l,...,n. (26.16)
Using the GSVD, Problem LSQI becomes
min IIDAY-I 112,
y

where

=y=( ) ,y UTb=(I)}- VTd a q


Y2 n-q F2 m-n (2 P-q
SECTION 26 Constrainedleast squares problems 599

< y which is condition (26.7). Further, the


Clearly a solution exists if and only if lJ2 112
vector y defined by
_FbJa, aij 0 , i=1,...,n,
yi i, ai=0, =l,...,r,
is a solution to (26.8). Hence, the condition (26.9) becomes
q

(lisi/i-,ij)2 + ll21122 >y2 (26.17)


i=l
aiOo

and we assume that this condition is satisfied. The generalized normal equations
(26.10) can now be written

I(21 +flS2)y(i(A)= ai Ai+A, i= 1,... q,


faoiYi(i) ==i, i=q+1,..., n.18)
A simple calculation shows that the secular equation becomes

02([)= E [ai2 -ui ] + 11J


2 12= 2 . (26.19)

From (26.17) it follows that b(O)>y. Since +() is monotone decreasing for A>0
there exists a unique positive root to (26.19). It can be shown that this is the desired
root, see GANDER [1981].
From (26.19) we can cheaply compute function values O() for given numerical
values of A.By differentiating (26.19) also derivatives of (A)can be computed and
thus a numerical method like Newton's method can be used to solve (26.19).
A particularly simple but important case is when in (26.1)
B=In and d=0. (26.20)
We will call this the standardformof LSQI. The above algorithm then simplifies as
follows. Let the SVD of A be

UTA = (DA) VrT,

where U and V are orthogonal. We now have Pi = 1, i = 1,..., n, and assume that the
singular valus i= a(A) are ordered so that
X >a2>-'->, >0.-

For the problem in standard form the rank condition (26.12) is trivially satisfied. The
condition (26.9) simplifies to

IIA'bli2= (i/i)2 > y2.


i=1

We assume that this condition is satisfied and determine I by solving the secular
600 A. Bjorck CHAPTER V

equation

<2 (2)=y y2()=


,
y(l()=o (2+). (26.21)
i=1

We finally obtain the solution from

X= Vy= E Yi(*)Vi, V=(V,1 . . .n) (26.22)


i=1

where A* is the unique positive solution to (26.21). This algorithm requires


mn2 + in 3 flops.
The GSVD and SVD respectively are the proper decompositions to analyze
Problem LSQI in general and standard form. These decompositions also lead to the
most stable computational algorithms for the numerical solution of these problems.
However, as for the unconstrained least squares problem more efficient algorithms
which are almost as satisfactory can be devised which make use of simpler matrix
decompositions. For Problem LSQI such methods have been given by ELDEN
[1977].
We first consider the problem in standard form with B = I, and d = 0. In order to
solve the secular equation Ilx(A) 2 =y, we have to solve the least squares problem

xm
|(nx)
A (26.23)

for a sequence of values of = At / 2. To do this we first transform A to the bidiagonal


form of Theorem 7.6,

(26.24)

where B is upper diagonal

31 ¾
22
B=
(n--

Y,
This decomposition can be computed in only mn2 +n3 flops using Householder
tranformations. If we put

x = Py, 1= Q b= 2) }m-n

(26.23) is transformed into

xmm~~
~(°R ||~~~(BAt~N~~)
~(26.25)
x #I /\ 0 2
SEcnoN 26 Constrained least squares problems 601

and since P is orthogonal the secular equation becomes IIy()112=y. We now


determine a sequence of Givens transformations Gk=Rk,,+k, k=l,...,n, and
Jk=Rn+k,n+k+ , k = ,...,n-1, such that

GJn-I ... G 2 JG 1 (I oJ)=( '


Z) (26.26)

so that B, is again upper bidiagonal. Then y(#) is determined by solving the


bidiagonal system of equations By(#p)= z.
The construction of the Givens rotations Gk and J is sufficiently well demon-
strated by the case n=3 below. The transformation GI will zero the element in
position (4, 1) and create a nonzero element in position (4, 2). This new nonzero
element is then annihilated by JI. This step reduces the dimension of the problem by
one and the transformations proceed recursively. The first step is shown below,
transformed elements are denoted by a prime.

Y1 51 Y 1
Y2 62 Y2 62
Y3 7
Y3
1 0 0

The whole transformation in (26.26), including solving for y(A), takes only about 1In
flops. ELDEN [1977] gives a detailed operation count and also shows how to
compute derivatives of O().
We now consider the more general form of Problem LSQI, where d=O and
B=Le R(-')Xn is a banded matrix. We now have to solve

min 0 (26.27)

for a sequence of values of t. We can no longer start by transforming A to bidiagonal


form since that would destroy the sparsity of L. (Note that the transformation
P would have to be applied to L from the right.) We compute instead the QR
decomposition of A

Now (26.27) is equivalent to

min||( 2 (26.28)

Some problems give rise to a matrix A which also has band structure. Then the
matrix R 1 will be an upper triangular band matrix with the same bandwidth w, as A,
602 A. Bjorck CHAPTER V

see Theorem 15.1 and the complete matrix in (26.28) will be of banded form. In many
cases the matrix L is also an upper triangular banded matrix. If not so, it is
convenient to reduce it to this form by computing the QR decomposition
2QL=R 2 ,
where R 2 has bandwidth w2 . Since Q2 is orthogonal we have reduced (26.28) to the
form

. R12)C
This problem can be efficiently solved by the sequential orthogonalization method
of Section 18. Note that this involves a reordering of the rows of the matrix

(uR2)

so that the column indices of the first nonzero element in each row forms a
nondecreasing sequence. The resulting algorithm has been described in detail by
ELDfN [1984b]. The number of operations for each value of p is O(n(wl +w2)).
We now describe an algorithm for the case when R 1 does not have band structure.
The idea is to transform (26.27) to a regularization problem of standard form. Note
that if L was nonsingular we could achieve this by the change of variables y = Lx.
( n
However, normally LER "-'t)X and is of rank n-t<n. The transformation to
standard form can then be achieved using the pseudoinverse of L by a technique due
to ELDEN [1977]. We compute the QR decomposition of LT

LT=(V 1 , V2)
( )

where V2 spans the nullspace of L. We now set y = Lx. Then

x=L'y+ V2w, L=V1 R2 T, (26.29)

where L' is the pseudoinverse of L, and

Ax-b =AL'y- b+ AV 2w.

We form A V2 and compute its QR decomposition

AV2 =(Q1, Q2)(U), URT. '

Then

QT(Axb)Q(ALy - b)+ Uw =(r)


Q(AL'y-b) H r2)
Now, if A and L have no nullspace in common then A V2 has rank t and U is
nonsingular. Thus, we can always determine w so that r, = 0 and (26.27) is equivalent
SECTION 26 Constrainedleast squares problems 603

to

minl )y-o)=Q2AL', 2=Qb, (26.30)

which is of standard form. We then retrieve x from (26.29).


We remark that if m is substantially larger than n it is better to apply the above
reduction to (26.28) instead of (26.27). Since the reduction involves the pseudo-
inverse of L it is numerically less stable than GSVD or the direct solution of (26.27)
if

K(L)>> WCL ).

However, in practice it seems to give very similar results, see VARAH ([1979], p. 104).
An important special case is when in LSQI we have A = K, B = L and both K and
L are upper triangular Toeplitz matrices, i.e.
k, k 2 ... k _-1 k,
k, k2 k- 1I
K=.
kl k2
k,
and L as in (26.4). Such systems arise when convolution-type Volterra integral
equations of the first kind
t

X K(t-s)f(t) dt=g(s), 0<t<T,


0

are discretized. ELDPN [1984] has developed a method for solving problems of this
kind which only uses 9n 2 flops for each value of L.It can be modified to handle the
case when K and L also have a few nonzero diagonals below the main diagonal.
Although K can be represented by n numbers this method uses n 2 storage
locations. A modification of this algorithm, which uses only O(n) storage locations,
is given by BOJANCZYK and BRENT [1986].
For computing a regularized solution to
minllAx-bl12, (26.31)
x

where A is very large and sparse iterative methods may be considered. For a survey
of such methods see BJ6RCK and ELD.N [1979]. The methods can be divided into two
groups. In the first group the iterative method is applied directly to the problem
(26.31). Regularization is then achieved by only performing a small number of
iterations.
604 i. Bjbrck CHAPTER V

In the other group of methods one applies the iterative method to the regularized
problem (26.23), which then usually is solved for several values of . BJORCK [1988]
and O'LEARY and SIMMONS [1981] have given methods based on the bidiagonali-
zation step (20.32)-(20.35) in the method LSQR. The idea is to compute xk(u)=
Vkyk(p) where Yk(p) is the solution to

.( Bk

which is a regularized version of (20.35). This allows for the cheap computation of
xk(#) for several different values of a. It does require the vectors vl,..., vk to be
saved, but in most applications where these methods make sense k is not a very large
number.
One application for the iterative methods above is when A =K is an upper
triangular Toeplitz matrix, since then the matrix vector products Ku and KTv can be
computed in O(nlog2n) flops using an algorithm based on the fast Fourier
transform, see O'LEARY and SIMMONS [1981].
So far we have assumed that the parameter =.p2 is determined by solving the
secular equation (26.11), where y is known a priori or somehow determined from
additional information about the solution. We now describe a method for
determining the smoothing parameter p directly from the data. The underlying
statistical model is that the components of b are subject to random errors of zero
mean and covariance matrix a21,, where a2 may or may not be known. We take
d=0 in (26.10) and write the solution as a function of u
x, = M- 1ATb, M =AT A + BTB. (26.32)

The predicted values of b can then be written


Ax,=P,b, P,=AM- AT, (26.33)

where the symmetric matrix P, is called the influence matrix.


CRAVEN and WAHBA [1979] have suggested that when a 2 is known then p should
be chosen to minimize an unbiased estimate of the expected true mean square error
given by
1 2 2
T(p)=- IAxe,-b I12 +-6 trace(I,- P,)+a2 .
m m

Here trace(A) denotes the trace of the matrix A. When a 2 is not known then u may be
chosen to minimize the generalized cross validation (GCV) function given by

C(=-11,, A b :/C'112
[trace(Im - Pa (26.34)
2
since the minimizer of C(u) is asymptotically the same as the minimizer of 7T(1), when
m is large. For a discussion of generalized cross validation see also GoLUH, HEATH
and WAHBA [1979].
SECTION 26 Constrainedleast squares problems 605

EXAMPLE 26.1 (GOLUB and VAN LOAN [1983], p. 12.1-5). Let AER " X be given by
A=(1, 1,...,1)T =eT,
and put

i=1 i=1

Then the cross validation function becomes


m(m- 1)s2 + v2 m2b 2
CA) = (mv I2). = m+
(m-1 + v)2
It is readily verified that C() is minimized for v= s2 /(mb2 ), which leads to an optimal
A given by
-2 -

A[(-) -I1

Ax.,- b 11
Minimization of either T(p) or C(p) requires that 11 2 and trace(Il, - P,)
can be accurately and efficiently computed. For a problem in standard form, i.e.
when B = I, these quantities can be computed from the SVD of A

A= 0 )VT, 2:=diag(a,,...,a").

We get

P= A(ATA +p2I )AT=U( )UT,

where
f =diag(co .... co.), i=a2/(a2+2).
From an easy calculation it follows that with c = UTb
n m
(I.-P.)bll2= 2 ci/(I2+2)+ c2 (26.35)
i=1 i=n+l

Since oi are the eigenvalues of P we further have

trace(l,-P,)=m- wo=m-n+ l1/(a2+p 2 ). (26.36)


i=l i=1

Using the generalized SVD (26.14)-(26.15) formulas similar to (26.35) and (26.36) are
easily derived for the general case BO In.
ELDEN [1984c] has given a method for computing C(p) based on the bidiagonali-
zation of A, which is more efficient than that based on the SVD.
In many important applications the matrices ATA and BTB have band structure.
For example, when fitting a polynomial smoothing spline of degree 2k - 1 to m data
606 i. Bj,5rck CHAPTER V

points the half-bandwidth will be k and k+ I respectively, see REINSCH [1971]. Then
computing the cross validation function using the singular value decomposition is
not efficient and will require O(m3 ) operations. HUTCHINSON and DE HOOG [1985]
give a method which requires only O(k2 m) operations. This is based on the
observation that to compute

trace(P),, = trace(AM -' AT) = trace(M,1 ATA)

only the central 2k + 1 bands of the inverse My l is needed. These can be efficiently
computed from the Cholesky factorization of M using the algorithm (16.2)-(16.4).

27. Linear inequality constraints

This section is concerned with linear least squares problems with inequality
constraints:

PROBLEM LSI.
min [lAx-b 2, (27.1)
x (27.1)
s.t. I<Cxs<u,

where A E 'imXnand Ce RP X" . If cT denotes the ith row of the constraint matrix C then
the constraints can also be written

licTxui, i=1,... ,p.

It is convenient to allow the elements li = -oo and ui = co, which corresponds to


cases where the lower and upper bound, respectively, not being present.

Note that if linear equality constraints are present, then these can be eliminated
using one of the methods given in Section 25. Both the direct elimination method
and the nullspace method will reduce the problem to a lower-dimensional problem
without equality constraints. However, an equality constraint can also be specified
by setting li = ui.
An important special case is when the inequalities are simple bounds:

PROBLEM BLS.
min
xm
[lAx-b11,
lAx-b11 , (27.2)
2
s.t. Ix <u.
For reasons of computational efficiency it is essential that such constraints are
considered separately from more general constraints in (27.1). If only one-sided
bounds on x are specified, then it is no restriction to assume that these are
nonnegativity constraints and we have:
SECTION 27 Constrained least squares problems 607

PROBLEM NNLS.

min IIAx-b1 2 , (27.3)


x>O

Another special case of Problem LSI is the least distance problem:

PROBLEM LSD.
min Ilxll2,
X (27.4)
s.t. gGxs<h
or more generally
min 11xI 112,
11x1 IL~(27.5)
mX
s.t. g<G1 x +G 2x 2 <h,
where

x=(X2) XIeR, n - k.
x 2 ER

In some important applications Problem LSI can be defined with a triangular


matrix A, see SCHITTKOWSKI [1983]. When this is not the case it is often advan-
tageous to perform an initial transformation of A in (27.1) to triangular form. Using
the pivoting strategy described in Section 11 we compute a QR decomposition of the
form

QTAP = R R11 R1 2 ) r (27.6)

where Q is orthogonal, P a permutation matrix and R,, nonsingular. Here we


assume that the numerical rank r of A is determined using some specified tolerance
as described in Section 11. The objective function in (27.1) then becomes

I(Rll, R2)X-C1112, C= )=QTb,

since we can delete the last (m-r) rows in R and c.


By a further transformation discussed by CLINE [1975] Problem LSI can be
brought into a least distance problem. By Theorem 7.5 we can perform further
orthogonal transformations from the right in (27.6) so that

where
T is upper triangular
Twritten
where T is upper triangular and nonsingular. Then (27.1) can be written
min jITyl-cl 112,
s.t. I<Ey<u,
608 A. Bjbrck CHAPTER V

where

E=(E1 ,E 2 )=CPV, y=( = VTpx,

E and y are conformally partitioned and y, Ear. We now make the further change of
variables
z 1 = Tyl-c 1, z 2 =y 2.
Substituting y, = T-(z, + c l ) in the constraints we arrive at an equivalent least
distance problem
min IlZ 11
2,
z (27.7)
s.t. T Gz 1+ G2 z2 ,
where
G 1 =E 1 T - 1, G 2 =E2, T=l-Gc,, =u-G lc,.
We note that if A has full column rank, then r = n and z = z 1, so we get a least distance
problem of the form (27.4).
Methods for solving Problem LSI based on the above transformation to a least
distance problem have been given by LAWSON and HANSON ([1974], Chapter 23),
HASKELL and HANSON [1981] and by SCHITTKOWSKI [1983]. The method proposed
by Schittkowski for solving the least distance problem is a primal method as
opposed to a dual approach used in the first two references. The dual approach is
based on the following equivalence between a least distance problem and
a nonnegativity constrained problem.

THEOREM 27.1. Consider the least distance problem with lower bounds

min
x
Ilxll2I (27.8)
s.t. g <Gx.
"+
Let u m+ be the solution to the nonnegativity constrained problem
min IEu-f ll2,
U (27.9)
s.t. u>O,
where

E= T)) f = )

Let the residual corresponding to the solution be


r=(rl,... .rn+l)T =Eu-f
and put a = Ir112. If a = 0, then the constraintsg < Gx are inconsistentand (27.8) has no
solution. If a O, then the vector x defined by
SECTION 27 Constrained least squares problems 609

X=(X1,..., X)T, Xj=-rjl/r+ l , j=l,...,n (27.10)


is the unique solution to (27.8).

PROOF. See LAWSON and HANSON ([1974], pp. 165-167). 1

CLINE [1975] and HASKELL and HANSON [1981] describe how the modified LSD
problem (27.5) with only upper bounds can be solved in two steps each of which
requires a solution of a problem of type NNLS, the first of these having additional
linear equality constraints.
We now consider methods for solving Problem LSI, which do not use a trans-
formation into a least distance problem. We first remark that (27.1) is equivalent to
a quadratic programming problem
min (xTBx + cx),
x
s.t. I<Cx<u,
where
B = ATA, c = -2ATb,

which could be solved by one of the methods of quadratic programming. In general


this is not numerically safe for the same reason that orthogonalization methods are
preferred to the method of normal equations for unconstrained problems.
In general methods for problems with linear inequality constraints are iterative in
nature. In each iteration the value of the objective function is decreased and the
optimum is reached in finitely many steps. We shall here consider only so-called
active set algorithms, which are based on the following observation. At the solution
to (27. 1) a certain subset of constraints will be active, i.e. satisfied with equality. If this
set was known a priori the solution to the LSI problem would also be the solution
to a problem with only equality constraints, for which efficient methods are known.
In active set algorithms a sequence of equality constrained problems are solved
corresponding to a prediction of the correct active set, called the working set. The
working set only includes constraints which are satisfied at the current approxi-
mation, but not necessarily all such constraints. A general description of active set
algorithms for linear inequality constrained optimization is given by GILL, MURRAY
and WRIGHT ([1981], Chapter 5.2).
Any point x which satisfies all constraints in (27.1) is called a feasible point. In
general a feasible point from which to start the active set algorithm is not known. (A
trivial exception is the case when all constraints are simple bounds as in (27.2) and
(27.3).) Therefore the active set algorithm consists of two phases, where in the first
phase a feasible point is determined as follows. For any point x denote by I =I(x) the
set of indices of constraints violated at x. Introduce an artifical objective function as
the sum of all infeasibilities
+(x)= E max f(cTx - ui), ( i- c ) }.
iel
610 A. Bjirck CHAPTER V

In the first phase +(x) is minimized subject to the constraints


li<cTx<u
,i ifI.
If the minimum of +(x) is positive, then the inequalities are inconsistent, otherwise
we will find a feasible point.
When a feasible point has been found the objective function is changed to
JlAx-bll2. Except for that the computations in phase one and two use the same
basic algorithm. In phase two all iterates remain feasible.
We now briefly outline an active set algorithm for solving the LSI problem. Let xk,
the iterate at the kth step, satisfy the working set of nk linearly independent
constraints with the associated matrix Ck. We take

Xk+ 1 = Xk+kPk,

where Pk is a search direction and Ok a nonnegative step length. The search direction
is constructed so that the working set of constraints remain satisfied for all values of
~k. This will be the case if Ckpk=O. In order to satisfy this condition we compute
a decomposition
CkQk=(O, Tk), TkeR "kX "k, (27.11)
where Tk is triangular and nonsingular and Qk a product of orthogonal transfor-
mations. If we partition Qk so that

Qk=( Zk , Yk), (27.12)


n-ik k

then the n-nk columns of Zk form a basis for the nullspace of Ck. Hence, the
condition is satified if we take
Pk = Zkqk, qk e Rn-nk. (27.13)
We now determine qk so that xk + Zkqk minimizes the objective function, i.e. in phase
two q solves the unconstrained least squares problem
min llAZkqk-rkI[ 2, rk=b-Axk. (27.14)
qk

To simplify the discussion we assume in the following that the matrix AZk is of full
rank so that (27.13) has a unique solution. To compute this solution we need the QR
decomposition of the matrix AZk. This is obtained from the QR decomposition of
the matrix AQk, where from
fRk Sk }n-n,

Pk AQk=Pk(AZk, AYk)= 0 Uk }nk (27.15)

The advantage of computing this larger decomposition is that then the orthogonal
matrix Pk need not be saved and can be discarded after being applied also to the
residual vector rk. The solution to (27.14) can now be computed from the triangular
SECTION 27 Constrainedleast squares problems 611

system

Rkqk=Ck, Pkrk= (k)}n-mk

We now determine a, the maximum nonnegative step length along Pk for which
Xk + remains feasible with respect to the constraints not in the working set. We take
Xk+l=xk+pk, if a<1,
and then add the constraint which is hit to the working set for the next iteration. We
take
Xk+ =xk+Pk, ifa>l.
In this case xk + will minimize the objective function when the constraints in the
working set are treated as equalities, and the orthogonal projection of the gradient
onto the subspace of feasible directions will be zero

Zkgk+l=O, gk+l=-ATrk+p (27.16)


In case a= 1, then we check the optimality of xk+l by computing Lagrange
multipliers for the constraints in the working set. At xk + 1 these are defined by the
equation
Ck =gk+l=-ATrk+ (27.17)
The residual vector to r+1 to (27.14) satisfies PTrk+l=(°). Hence, multiplying
(27.17) by Q' we obtain using (27.11)

QTCi=(T)I=
-QA TPk(d)

so from (27.15)
Tk = -(VU, )dk.

The Lagrange multiplier Ai for the constraint ii < cTx < ui in the working set is said to
be optimal if Ai <0 at un upper bound and Ai > 0 at a lower bound. If all multipliers
are optimal then we have found an optimal point and are finished. If a multiplier is
not optimal then the objective function can be decreased by deleting the
corresponding constraint from the working set. If there is more than one multiplier
not optimal, then it is usual to delete that constraint whose multiplier deviates most
from optimality.
At each iteration step the working set of constraints is changed, which leads to
a change in the matrix Ck. If a constraint is dropped a row in Ck is deleted, if a
constraint is added a new row in Ck is introduced. An important feature of an active
set algorithm is the efficient solution of the sequence of unconstrained problems
(27.14). Using techniques described in Section 21 methods can be developed to
update the matrix decompositions (27.11) and (27.15). In (27.11) the matrix Qk is
modified by a sequence of orthogonal transformations from the right. These trans-
612 A. Bjrck CHAPTER V

formations are then applied to Qk in (27.15) and this decomposition together with the
vector PTrk + is similarly updated. Since these updatings are quite intricate they will
not be described in detail here.
For the case when A has full rank the algorithm for Problem LSI described above
is essentially equivalent to the algorithm given by STOER [1971]. In this case
Problem LSI always has a unique solution. If A is rank-deficient there will still be
a unique solution if all active constraints at the solution have nonzero Lagrange
multipliers. Otherwise there is an infinite manifold M of optimal solutions with
a unique optimal value. In this case we can seek the unique solution of minimum
norm, which satisfies

minjlxjl 2, xeM.

This is a least distance problem. A method for computing this minimal norm
solution when all constraints are simple bounds has been given by LOTSTEDT [1984].
In the rank-deficient case it can happen that the matrix AZk in (27.14) is
rank-deficient and hence Rk singular. Note that if some Rk is nonsingular it can
become singular during later iterations only when a constraint is deleted from the
working set, in which case only its last diagonal element can become zero. This
simplifies the treatment of the rank-deficient case. To make the initial R k
nonsingular one can add artificial constraints to ensure that the matrix AZk has full
rank. For a further discussion of the treatment of the rank-deficient case, see GILL et
al. [1986].
A possible further complication is that the working set of constraints can become
linearly dependent. This can cause possible cycling in the algorithm so that its
convergence cannot be ensured. A simple remedy that is often used is to enlarge the
feasible region of the offending constraint with a small quantity, see also GILL,
MURRAY and WRIGHT ([1981], Chapter 5.8.2).
As remarked earlier Problem LSI simplifies considerably when the only
constraints are simple bounds. This problem is important in its own right and also
serves as a good illustration of the general algorithm. Hence, we now consider the
algorithm for Problem BLS in more detail. We first note that feasibility of the
bounds is resolved by simply checking whether lieui, i=l,..., p. Further, the
specification of the working set is equivalent to a partitioning of x into free and fixed
variables. During an iteration the fixed variables will effectively be removed from the
problem.
We divide the index set of x according to

{1,2, ... , n} =FuV,

where ie9F if x, is a free variable and iŽ if xi is fixed at its lower or upper bound.
The matrix Ck will now consist of the rows e, i, of the unit matrix I. We let
Ck = E, and if E, is similarly defined we can write

Qk = (Eg, En), T= Ink.


SECTION 27 Constrained least squares problems 613

This shows that Qk is simply a permutation matrix and the product

AQk = (AE, AE,) = (A,, A,)

corresponds to a permutation of the columns of A. Assume now that the bound


corresponding to xq is to be dropped. This can be achieved by

AQk + = AQkPR(k, q)

where PR(k, q) is a permutation matrix which performs a right circular shift in which
the columns
1,. ..,k,k+,. .. ,q-1,q,q+ ... n
are permuted to
1...,k,q,k+ ... q- 1, q+,...,n.
Similarly if the bound corresponding to xq becomes active it can be added to the
working set by
AQk +1 = AQk PL(q, k)
where PL(q, k) is a permutation matrix, which performs a left circular shift in which
the columns
1,...,q-l,q,q+1 ... ,k,k+l,...,n
are permuted to
1 ... q-l,q+l ... ,k,q,k + 1,...,n.
Subroutines for updating the QR decomposition after right or left circular shifts are
included in LINPACK and are described in DONGARRA et al. ([1979], Chapter 10).
For Problem LSB the equation (27.17) for the Lagrange multipliers simplifies to
= =ETgk+1 = - EATrk+ 1= - (U T , O)dk.

Hence, the Lagrange multipliers are simply equal to the corresponding components
of the gradient vector - ATrk+ .
Below we summarize the algorithm for Problem BLS:

ALGORITHM 27.1 (Active set algorithm for Problem BLS).


Initialization.
~. = {1, 2 . ., n;} ; Q:= Z= .;
x := (I + );
Compute the QR decomposition

PT AQ= (R); PTb=( );


0 d)d

comment: It is assumed that rank(A)=n.


614 A. Bjorck CHAPTER V

Main loop
begin repeat
Compute unconstrained optimum in free variables:
q := R-(c- SEsx); z:= Eq;
if li<zi<uifor all i then
begin comment: Check for optimality.
Compute Lagrange multipliers )..= UT(d- UEsx);
if =0 or sign(yi)A, <O for all i
go to finished;
Find index t such that sign(y,)l =maxijE sign(yi)Ai;
Move index t from R to F, i.e. free variable x,;
Update decomposition

PT A(E, E)=(R SU) and PTb=(d)

end
else
begin
For all i .F compute i {(x - i)(x - zi), zi < i;
1(ui-xi)/(zi-i), Zi> ui;
a = mini,Fai; x := x + a(z - x);
comment: 0 a < 1
Move from F to X all indices q for which xq = l or x = Uq;
Put yq:= 1 if Xq = lq
and y,:= -1 if xq = uq;
Update decomposition

PTA(E E)=(R S) and PTb=(

end
end repeat
finished.
Several implementations of varying generality of active set methods have been
developed. LAWSON and HANSON [1974] give a FORTRAN implementation of an
algorithm for Problem NNLS, which is similar to Algorithm 27.1. They also give
a FORTRAN subroutine based on this algorithm for Problem LSD with lower bounds
only. HASKELL and HANSON [1981] give an algorithm for the NNLSE problem with
nonnegativity constraints on selected variables and equality constraints, where the
equality constraints are handled by the method of weighting, see Section 25.4. In this
no assumption on the rank of A is made. They describe several methods of
transforming problems of type LSEI, with linear equality and inequality constraints,
into the form NNLSE. In HANSON [1986] further developments of this algorithm is
described.
The algorithm of STOER [1971] for Problem LSI has been realized by PATZELT
SECTION 27 Constrainedleast squares problems 615

[1973] and an English version of this code is given by EICHHORN and LAWSON
[1975]. SCHITTKOWSKI and STOER [1979] give an implementation of the same
method using Gram-Schmidt decompositions. An advantage of this implementa-
tion is that it is relatively easy to take data changes into account. The implementa-
tion described by CRANE, GARBOW, HILLSTROM and MINKOFF [1980] is based on this
work. There is a restrictive assumption in these realizations that A is of full column
rank. ZIMMERMANN [1977] gives a special implementation of Stoer's method for
Problem BLS also based on Gram-Schmidt decompositions.
A robust and general set of FORTRAN subroutines for Problem LSI and convex
quadratic programming is given by GILL et al. [1986]. The method is a two-phase
active set method. It allows also a linear term in the objective function and handles
a mixture of bounds and general linear constraints.
Often the matrices A and C in Problem LSI are sparse. It is difficult to take
advantage of this, since much fill results from the sequence of transformations
applied to A and C during the iterations. For problems with only simple bounds on
the variables, sparsity can be preserved. OREBORN [1986] describes an algorithm for
Problem NNLS which uses SPARSPAK by GEORGE and NG [1984] for obtaining an
initial sparse QR decomposition of A.
Iterative methods have also been developed for sparse problems, although slow
convergence is often a problem. LOTSTEDT [1984] gives a method for Problem LSD
using a preconditioned conjugate gradient method.
CHAPTER VI

Nonlinear Least Squares

In this chapter we discuss the solution of nonlinear least squares problems. Methods
for solving such problems are iterative and each iteration step usually requires the
solution of a related linear least squares problem. The nonlinear least squares
problem is closely related to nonlinear systems of equations and is a special case of
the general optimization problem in lR. We will here mainly emphasize those
aspects of the nonlinear least squares problem which derive from its special
structure. For a general background to the theory of iterative methods for nonlinear
equations we refer to ORTEGA and RHEINBOLDT [1970], FLETCHER [1980], GILL,
MURRAY and WRIGHT [1981] and DENNIS and SCHNABEL [1983]. An excellent survey
of algorithms for the nonlinear least squares problem is given by DENNIS [1977].

28. The nonlinear least squares problem

The unconstrained nonlinear least squares problem is to find a global minimizer of


the sum of squares of m nonlinear functions

p(x)=½E r(x)=½llr(x)l]. (28.1)


i=I

m
Here xeR", r(x)=(r(x),...,r,(x))ElRj , is the residual vector and each ri(x),
i=1,..., m, m > n, is a nonlinear functional defined over Rn". Clearly, if ri(x) were
linear in x then (28.1) would be a linear least squares problem. For m=n (28.1)
includes as a special case the solution of a system of nonlinear equations.
One important area in which nonlinear least squares problems arise is in data
fitting. Here one attempts to fit given data (y, ti, i = 1....., m, to a model function
f(x, t). If we let ri(x) represent the error in the model prediction for the ith
observation,
ri(x)=yi-f(x, ti, i= 1 ... , m, (28.2)
we are led to a problem of the form (28.1). The choice of the least squares measure is
justified here, as for the linear case, by statistical considerations, see BARD [1974].
This assumes that only yi are subject to errors and the values t i of the independent
variable t are exact. The case when there are errors in both y and ti is discussed in
Section 34.
617
618 A. Bj6rck CHAPTER VI

The basic methods for the nonlinear least squares problem require derivative
information about the components ri(x), and we assume in the following that r,(x)
are twice continuously differentiable. The Jacobian of r(x) is J(x)e Rx", where
J(x)ij = ari(x)/Ox j ,
" X"
and the Hessian matrices of ri(x) are Gi(x) e , where
2
Gi(x)jk=a ri( x)/aXjaXk, i= 1,.. .,m.
It is easily shown that the first and second derivatives of 0(x) are given by
V+(x) = J(x)T r(x), (28.3)
and

V 2 0(x)=J(x) T J(x)+ Q(x), Q(x)= E ri(x)Gi(x). (28.4)


i=1

The special forms of VP(x) and V2 4(x) can be exploited by methods for the
nonlinear least squares problem.
A necessary condition for x* to be a local minimum of +(x) is that V4(x*)=
J(x*)Tr(x*) = 0. Any point which satisfies this condition will be called a critical point.
We now establish a necessary condition for a critical point x* to be a local minimum
of (x), following a geometric approach by WEDIN [1974]. If we assume that
rank(J(x*))=n, then it follows that J(x*)lJ(x*)=I,, where J(x*)' is the pseudo-
inverse of J(x*). We now rewrite (28.4) as
V2 4q= JTj _ G = JT(I- 7 (J)T GwJ')J, (28.5)
where

GW= wiQ i , w= -r/1lr112 , y= rll1 2, (28.6)


i=l

and all quantities are evaluated at the point x*. The symmetric matrix
K = (J') T GwJ' (28.7)
'
is called the normal curvature matrix of th n-dimensional surface z = r(x) in Rn, with
respect to the normal vector w. Let the eigenvalues of K be
K >KC
2 ...
k> ~>,.

The quantities pi = 1/ic, Ki O0,are the principal radii of curvature of the surface with
respect to the normal w.
It follows that V2 4(x*) is positive-definite and x* a local minimum if and only if
I-yK is positive-definite, i.e. when
I -yKc >0.
If 1-ycK, <0, then the least squares problem has a saddle point at x* or if also
1- yKc, <0 even a local maximum at x*.
SECTION 28 Nonlinear least squares 619

The geometrical interpretation of the nonlinear least squares problem (28.1) to


minimize IIr(x) 11is to find a point on the surface z = r(x) in R" closest to the origin.
In case of data fitting, when ri(x) is given by (28.2), it is more illustrative to consider
the surface
m.
Z=(f(X, t) ... S(X, t))T E R

The problem is then to find the point on this surface closest to the observation vector
ye R', cf. DRAPER and SMITH ([1981], pp. 500-501). This is illustrated in Fig. 28.1
for the simple case of m = 2 observations and only a single parameter x. Since in the
figure we have y = II1 2< p, it follows that 1 - yIc > 0, which is consistent with the
r 11
fact that x* is a local minimum.

z2

f(x*,t,))

Z
z.

FIG. 28.1. Geometry of the data fitting problem for m=2, n= 1.

There are basically two different ways to view problem (28.1). One could think of
this problem as arising from an overdetermined system of nonlinear equations
ri(x)=O, i=1,2,..., m. (28.8)
It is then natural to approximate r(x) by a linear model around a given point xc
c(x) = r(xc) + J(x,) (X - xc). (28.9)
Then P(x)= 0 is an overdetermined system of linear equations. One can use the
corresponding linear least squares problem to derive an improved approximate
solution to (28.1). Note that (28.9) only uses first-order derivative information about
r(x). This approach, which leads to the Gauss-Newton-type methods and the
Levenberg-Marquard method, is discussed in Sections 29 and 30.
In the second approach (28.1) is viewed as a special case of optimization in Rn and
a quadratic model of +(x) around a point x, is used.
(X) = O(Xc) + V(Xc)T (X - Xc) + (X - Xc)T V2 (X) (X - Xc), (28.10)
where the derivatives of +(x) are given by (28.3) and (28.4). The minimizer of 4I(x) is
620 . Bjbrck CHAPTER VI

given by
XN = X - (J(Xc)T J(x) + Q(x¢))- J(x)T r(Xc). (28.11)

Using (28.11) as an improved approximation is equivalent to Newton's method


applied to (28.1), which usually is locally quadratically convergent to the minimizer.
We note that the Gauss-Newton method can be thought of as arising from
neglecting the term Q(xc) in (28.11), since then, if J(xc) is nonsingular, XN equals the
least squares solution of F.(x)=0.
From (28.4) it follows that Q(xc) will be small if either r(x) are only mildly
nonlinear or if the residuals ri(x,), i = 1,..., n, are small. Then the behaviour of the
Gauss-Newton method can be expected to be similar to that of Newton's method. In
particular for a consistent problem where r(x*)=O at the solution x* the local
convergence rate will be the same for both methods.
For moderate to large residual problems the local convergence rate for the
Gauss-Newton method can be much inferior to that of Newton's method. However,
the cost of computing the mn 2 second derivatives in (28.11) needed for Q(xc) can be
prohibitively large. Therefore methods which only use part of the second derivatives
are of interest.
For curve fitting problems the function values r(x) = yi -f(x, ti) and derivatives
can be obtained from the single functionf(x, t). If f(x, t) is composed of e.g. simple
exponential and trigonometric functions then second derivatives can sometimes be
computed cheaply. Also, if J is sparse, so is often Qand the cost for computing Q(x),
with analytical derivatives, may not be forbiddingly large.
Methods which explicitly or implicitly take second derivatives into account are
discussed in Section 31. Into this category also fall various types of quasi-Newton
methods for general unconstrained optimization.

29. Gauss-Newton-type methods

The Gauss-Newton method for problem (28.1) is based on a sequence of linear


approximations of r(x) of the form (28.9). If xk denotes the current approximation,
then a correction Pk is computed as a solution to the linear least squares problem
minl r(xk) + J(xk)Pll 2, P n", (29.1)
p

and the new approximation is xk + 1 = xk + Pk- If J(xk) is not rank-deficient, then the
solution to (29.1) is unique and can be written
Pk = - (J(xk)T J(Xk))- J(Xk)T r(Xk).

However, Pk should be computed by a method which is stable also when J(xk) is


ill-conditioned or singular, using the QR or SVD decomposition of J(xk). When
J(xk) is rank-deficient Pk should be chosen as the minimum norm solution
Pk = - J(xk) r(Xk) (29.2)
where J(xk)' denotes the pseudoinverse of J(xk).
SECTION 29 Nonlinear least squares 621

The Gauss-Newton method as described above has the advantage that it solves
linear problems in just one iteration and has fast convergence on mildly nonlinear
problems. However, it may not be locally convergent on problems that are very
nonlinear or have large residuals. This is illustrated by the following example due to
Powell, see FLETCHER ([1980], p. 94).

EXAMPLE 29.1. Consider the problem with m=2, n= given by

rl(x)= x + 1, r2 (x)= x2 +x-1,


where A is a parameter. The minimizer of r'(x)+ r2(x) is x* = 0. It is easily shown
that for the Gauss-Newton method

xk+,
Xk = IAxt +
+1 = O(X2kXk
+0(4) )

and therefore this method is not locally convergent when AI > 1.

To get a more useful method we take instead

Xk+ 1 = Xk + ok Pkt,

where Pk is the solution to (29.1) and ak is a scalar. The resulting method, which uses
Pk as a search direction is usually called the damped Gauss-Newton method. The
vector Pk is called the Gauss-Newton direction.
The Gauss-Newton direction has the following two important properties:
(i) The vector P is invariant under linear transformations of the independent
variable x.
(ii) Provided that J(xk)T r(xk) O,Pk is a descent direction,i.e. for sufficiently small
ac>O we have

IIr(xk + Pk) 112


< IIr(xk) 112.
The first property is obviously desirable. The second property follows from the
relation
IIr(xk + p) II = IIr(Xk) I12 2 11
Pk r(Xk) IJI
+ O(lI 12), (29.3)
where we have introduced the orthogonal projection onto the range of J(xk)

PJk = J(xk) J(xk)'.

It can be shown, using the singular value decomposition of J(xk) that J(xk)T r(xk) O
implies that PJkr(xk)#O, which together with (29.3) proves that Pk is a descent
direction. In the special case that J(xk) has full column rank the property follows
from

PJk r(xk) = J(xk) (J(xk)T J(xk))- 1 J(xk)T r(xk) # 0 if J(xk)T r(xk) O

To make the damped Gauss-Newton method into a viable algorithm the step
length ak must be chosen carefully. Two common ways of choosing taare:
(i) Take a, to be the largest number in the sequence 1, , , ... for which the
622 A. Bjbrck CHAPTER VI

following inequality holds

1l r(Xk) 112-
I] r(Xk + ak Pk) 1 > ½ak [] (x)Pk 112
This is essentially the Armijo-Goldstein step length principle (note that - J(xk)pk =
PJkr(xk)), see ORTEGA and RHEINBOLDT ([1970], p. 491) and GILL, MURRAY and
WRIGHT ([1981], P. 100).
(ii) Take ak as the solution to the one-dimensional minimization problem

min IIr(xk + apk) II , (29.4)

i.e. do an "exact" line search. This is computationally more expensive than (i).
A theoretical analysis of these two step length principles has been given by RUHE
[1979].
Often a step length ak is chosen to be an approximate solution of (29.4). Such an
algorithm, which takes into account the fact that +(x) = r(x) 11is a sum of squares,
has been given by LINDSTROM and WEDIN [1984]. They determine an approximation
p(a) of the curvefo(a) = r(xk + Xpk) in Rm , and then minimize 11p(o) I2 as a function of X.
One alternative is to choose p(a) to be the unique circle (in the degenerate case
a straight line) which satisfies the conditions

p(O) =f(O), Vp(O) = Vf(O), p(ao) =f(ao),


where ao is a guess of the step length.
To improve the rate of convergence of the damped Gauss-Newton method one
should switch to another search direction, preferably the negative gradient
gk= -J(xk)T r(xk) when either the angle between Pk and gk becomes large or the
reduction achieved in [Ir(x) I2 is small.
Since the damped Gauss-Newton method always takes descent steps, this method
is locally convergent on almost all nonlinear least squares problems provided that
the line search is carried out appropriately. In fact it is usually globally convergent.
However, it may still be slowly convergent on large residual or very nonlinear
problems.
An implementation of the Gauss-Newton method that uses the minimum norm
solution (29.2) in case J(xk) is rank-deficient must include some strategy to estimate
the rank of J(xk). Such strategies have been discussed for the singular value
decomposition in Section 10 and for the QR decomposition in Section 11. That the
determination of rank is critical is illustrated by the following example.

EXAMPLE 29.2 (GILL, MURRAY and WRIGHT [1982], p. 136). Let J = J(xk) and r = r(xk)
be defined by

0 C =· r2

where << 1 and r, and r2 are of order unity. If J is considered to be of rank two, then
SECTION 29 Nonlinear least squares 623

the search direction Pk = s, is given by

r2/J

whereas if the assigned rank is one Pk= s2 , where

(0)
Clearly the two directions s and s2 are almost orthogonal and s is almost
orthogonal to the gradient vector JT r.

Usually an underestimate of the rank is preferable except when 0(x) is actually


close to an ill-conditioned quadratic function.
We now discuss the local convergence of the Gauss-Newton method. This has
been analyzed by WEDIN [1974] and RAMSIN and WEDIN [1977]. One step of the
undamped method (i.e. taking ct = 1) can be written as

Xk +1 =F(xk), F(x)= x-J(x)'r(x).


The first derivative of F(x) equals

VF(x)= - J(x)' (J(x)')T E rx)Gi(x)= y J(x)' (J(x)l)T Gw,


i=l

using the notations of (28.6). The asymptotic rate of convergence is bounded by the
spectral radius of the matrix VF(x*) at the solution x*. But VF(x) has the same
nonzero eigenvalues and thus the same spectral radius as the matrix
y(J(x))T G, J(x)' =yK,
where K is the normal curvature matrix (28.7). Hence,
p = p(VF) =y max(K, - K,). (29.5)

In general convergence is linear, but if y= lr(x*)112=0 we have superlinear


convergence. From (29.5) our earlier conjecture follows, that the local rate of
convergence of the undamped Gauss-Newton method is fast when either
(i) the residual norm y = 1Ir(x*) I 2 is small, or
(ii) r(x) is mildly nonlinear, i.e. i l i= 1,..., n, are small.
For a saddle point we have as shown in Section 28 that y1lc > 1. Hence, one is in
general repelled (using undamped Gauss-Newton) from a saddle point. This is an
excellent property since saddle points are not at all uncommon for nonlinear least
squares problems.
The rate of convergence for the Gauss-Newton method with exact line search
have been shown by RUHE [1979] to be
P= y(K1 - ,c)/(2- y7(c 1 + t,)). (29.6)
We have p = p if K,, = - c and p <p otherwise. We also have that yc1 < 1 implies
624 A. Bj6rck CHAPTER VI

p < 1, i.e. we always get convergence close to a local minimum. This is in contrast to
the undamped Gauss-Newton method, which may fail to converge to a local
minimum.
The rate of convergence for the undamped Gauss-Newton method can be
estimated during the iterations from

IIJ(xk+ i)Pk +1I 2/ l J(Xk)Pk 12 • P +0( IXk- X* 1 ). (29.7)


When the estimated p is greater than 0.5 (say) then one should consider switching to
a method using second derivative information-or perhaps evaluate the quality of
the underlying model.
In Section 28 the nonlinear least squares data fitting problem was interpreted in
geometrical terms as seeking the point on the n-dimensional surface z =(f(x, t),...,
f(x, t))e R'"closest to the data vector y. An example of an ill-behaved least squares
problem is shown in Fig. 29.1.

,f(x,t2))

z1

FIG. 29.1.

For this problem the radius of curvature 1icK<< IIr(x*) 11


2, and many insignificant
local minima exist. For such problems it seems reasonable to demand that the
surface r(x) is smoothed before one attempts to solve the problem. WEDIN [1974] has
shown that the estimate (29.7) of the rate of convergence of the Gauss-Newton
method is often a good confirmation of the quality of the underlying model.
DEUFLHARD and APOSTOLESCU [1980] call problems for which divergence occurs
"inadequate problems". It seems important that algorithms for nonlinear least
squares problems also estimate the maximal curvature.

30. Trust region methods

There is still a possibility that the damped Gauss-Newton method can have
difficulties to get around an intermediate point where the Jacobian matrix does not
have full column rank. This can be avoided either by taking second derivatives into
account (see Section 31) or by further stabilizing the damped Gauss-Newton
SECTION 30 Nonlinear least squares 625

method to overcome this possibility of failure. Methods using the latter approach
were first suggested by LEVENBERG [1944] and MARQUARDT [1963]. Here a search
direction P is computed by solving the problem
min{ IIr(xk)+J(xk)P 112 Hk IpP 112},
2= (30.1)
p

where the parameter ,k> 0 controls the iterations and limits the size of Pk. Note that
Pk is well-defined even when J(xk) is rank-deficient. As k C)o, IIpk 12-0 and Pk
becomes parallel to the steepest descent direction. It follows from the discussion of
Problem LSQI in Section 26 that Pk is the solution to the least squares problem with
quadratic constraint

min r(xk) + J(xk)p 11


2,
~~~~~~~~~~~~~~P
~~~(30.2)
s.t. lP 112 < 6k,
where /Uk = 0 if the constraint in (30.2) is not binding and > 0 otherwise. The set of
lk

feasible vectors p, IIp I12 < 6k, can be thought of as a region of trust for the linear
model
r(x) r(x,) + J(xk)p, p =x -x k.
Thus, these methods can be thought of as a special case of trust region methods. For
a general description of trust region methods for nonlinear optimization, see MORI
[1983].
Many different strategies have been used to choose k in (30.1). A careful
implementation of the Levenberg-Marquardt algorithm as a scaled trust region
algorithm has been described by MORE [1978], and has been implemented in
MINPACK by MORE et al. [1980]. More considers an iteration of the following form:
Let x 0 , Do, 60 and Be(0, 1) be given. For k=0, 1,2,...
Step 1. Compute I r(xk)l 2.
Step 2. Determine Pk as a solution to the subproblem

min 11r(xk)+ J(xk)p 112


P (30.3)
s.t. [IDkPl[2 <6,
where Dk is a diagonal scaling matrix.
Step 3. Compute the model prediction of the decrease in IIr(xk) I12 as
Pk(Pk) = IIr(Xk) II - IIr(Xk) + J(xk)Pk 2112.
Step 4. Compute the ratio

Pk = ( Ilr(xk) II12- IIr(Xk +Pk) II2)/kk(Pk)-


If Pk > l, then set xk + = xk + Pk, otherwise set x k + = xk.
Step 5. Update the scaling matrix Dk and 6k .
The ratio Pk measures the agreement between the linear model and the nonlinear
function. An iteration with Pk > is successful and otherwise the iteration is
626 A. Bjtrck CHAPTER VI

unsuccessful. After an unsuccessful iteration 6k is reduced. MORE [1978] chooses the


scaling Dk such that the algorithm is scale invariant, i.e. the algorithm generates the
same iterations if applied to r(Dx) for any nonsingular diagonal matrix D.
Mor6 proves that if r(x) is continuously differentiable, r'(x) uniformly continuous
and J(xk) bounded, then the algorithm will converge to a critical point. The
algorithm has also proven to be very successful in practice.
A trust region implementation of the Levenberg-Marquardt method will give
a Gauss-Newton step close to the solution of a regular problem. Its convergence will
therefore often be slow for large residual or very nonlinear problems. In the next
section we discuss methods using second derivative information. These are
somewhat more robust but also more complex than the Levenberg-Marquardt
methods.

31. Newton-type methods

The analysis in the previous section has shown that for large residual problems and
strongly nonlinear problems, methods of Gauss-Newton type may converge slowly.
Also, these methods can have problems at points where the Jacobian is rank-deficient.
When second derivatives of +(x) are available Newton's method can be used to
overcome these problems. This method uses the quadratic model (28.10) of
¢(x)=11 r(x) I2 at the current approximation x,. The critical point XN given by
(28.11) of this quadratic model is chosen as the next approximation.
It can be shown (see DENNIS and SCHNABEL [1983], p. 229) that Newton's method
is locally quadratically convergent as long as

V 2t(x)=J(x)T J(x)+Q(x), Q(x)= ri(x)G(x), (31.1)


i=1

is Lipschitz continuous around xc and V20(x*) is positive-definite. To get global


convergence Newton's method is used with a line search algorithm, Xk +1 = Xk + ak Pk,
where the search direction Pk is determined from

(J (Xk)T J(xk) + Q(xk))Pk = - J(xk)T r(Xk). (31.2)


Note that the matrix J(xk)T J(xk) + Q(xk) must be positive-definite in order for Pk to be
a guaranteed descent direction. The linear system (31.2) should be solved by
a method which is stable also when J(xk) is ill-conditioned or rank-deficient, see GILL
and MURRAY [1978].
Newton's method is not often used since the second-derivative term Q(xk) is rarely
available at a reasonable cost. However, a number of methods have been suggested
that partly takes the second derivatives into account, either explicitly or implicitly.
GILL and MURRAY [1978] suggest regarding J(xk)T J(xk) as a good estimate of the
Hessian in the invariant subspace corresponding to the large singular values of J(xk).
In the complementary subspace the second derivative term Q(xk) is taken into
account.
SECTION 31 Nonlinearleast squares 627

Let the singular value decomposition of J(xk) be

J(xk) = U () VT, U e RmxMn, V ERnxn,

where =diag(a, ...., a) is the matrix of singular values ordered so that


a 1>>2 ' r>n.Then the equations (31.2) for the Newton direction Pk can be
written
(X 2 + VT Qk V)q= - F, (31.3)
where Qk = Q(xk), q = VPk and denotes the first n components of the vector
UT r(xk). We now split the singular values into two groups,
= diag(Xl , 2 2),
where Z1 =diag(a,, ... , a,) are the "large" singular values. If we partition V, q and
F conformingly, then the first r equation in (31.3) can be written
(2+ VT QVIqI + VT Q V2 = -X i.
If the terms involving Qk are neglected compared to 12ql we get q, = -27Z F1. If
this is substituted into the last (n-r)equations we can solve for q2 from
( + V2 Qk V2 )q2 = - 2 r2- V V 1IQk
ql.

The approximate Newton direction is then given by


Pk = Vq = V q + V2 q2.
The split of the singular values is updated at each iteration, and the idea is to
maintain r close to n as long as adequate progress is made. Three subroutines have
been developed based on the above method. In the first Qk is used explicitly; in the
second a finite difference approximation to Qk V2 is obtained by differencing the
gradient along the columns of V2; the third version uses a quasi-Newton
approximation to Qk, similar to the methods described below.1
A straightforward way to obtain second-derivative information would be to use
a general quasi-Newton optimization routine, which successively builds up an
approximation to the second-derivative matrix in (31.1). Many of those are known
to possess superlinear convergence. They compute the search directions from
Sk Pk = - J(xk)T r(Xk),

where Sk is a symmetric approximation to the Hessian chosen to satisfy the so-called


quasi-Newton relations
Sk(xk - Xk- )= Yk, Yk =J(Xk)T r(Xk)- J(xk- 1)T r(Xk 1) (31.4)
and which differs from Sk_ by a matrix of small rank. Usually the choice of starting
value So=J(x,)TJ(xo) is recommended. RAMSIN and WEDIN [1977] gave the

'These three subroutines are included in the NPL Algorithm Library, which is available from National
Physical Laboratory, Teddington, England.
628 A. Bjirck CHAPTER VI

following recommendation on the choice between a Gauss-Newton and a quasi-


Newton method, based on the observed rate (29.7) of convergence p for the
Gauss-Newton method:
(i) for p <0.5 Gauss-Newton is better;
(ii) for globally simple problems quasi-Newton is better for p > 0.5;
(iii) for globally difficult problems Gauss-Newton is much faster for p < 0.7, but
for larger values of p quasi-Newton is safer.
This could be the basis for a hybrid method with automatic switching between the
two methods.
The straightforward application of quasi-Newton methods to the nonlinear least
squares problem outlined above has not been so efficient in practice. One reason is
that these methods disregard the information in J(xk), and often J(xk)T J(xk) is the
dominant part of V2 q(xk). A successful approach has been taken by DENNIS, GAY
and WELSH [1981], who approximate V2 '(xk) by J(Xk)TJ(Xk)+Bk, where Bk is
a quasi-Newton approximation of the term Q(xk), and B o =0. The quasi-Newton
relation now becomes

Bk(Xk Xk-1)=Zk, Zk=J(Xk) T r(xk)-J(Xk_ ) T r(xk), (31.5)


where Bk is required to be symmetric. It can be shown (cf. DENNIS and SCHNABEL
[1983, pp. 231-232]) that a solution to (31.5) which minimizes the change from Bk_
is given by the update formula.

Bk = Bk_ + [(zk -Bk- 1 Sk) + Yk(Zk - Bk - 1 Sk Yk Sk (31.6)


-(Zk - Bk-1 Sk)Sk Yk y;/(y Sk),

where Sk = Xk - Xk . This update was proposed by DENNIS, GAY and SCHNABEL


[1981] and is used in a subroutine NL2SOL. This code, which is available through
IMSL, has several interesting features. It maintains the approximation Bk and
adaptively decides whether to use it, i.e. it switches between a Gauss-Newton and
a quasi-Newton method. For both methods a trust region strategy is used to achieve
global convergence. In each iteration NL2SOL computes the reduction predicted by
both quadratic models and compares with the actual reduction 4 (x k 1)- O(xk). For
the next step the model is used whose predicted reduction best approximated the
actual reduction. Usually this causes NL2SOL to initially use Gauss-Newton steps
until the information in Bk becomes useful.
A different way to obtain second-derivative information has been developed by
RUHE [1979], who uses a nonlinear conjugate gradient acceleration of the
Gauss-Newton method with exact line searches. This method achieves quadratic
convergence and thus gives much faster convergence than the Gauss-Newton
method on difficult problems. When exact line search is used, then the conjugate
gradient acceleration amounts to a negligible amount of extra work. However, for
small residual problems exact line search is a waste of time and then a simpler
damped Gauss-Newton method is superior.
Extending the quasi-Newton method to large sparse problems has proved to be
difficult. A promising approach has been suggested by TOINT [1986] for certain
types of large, "partially separable" nonlinear least squares problems. A typical case
SECTION 32 Nonlinear least squares 629

included is when every function ri(x) only depends on a small subset of the set of
n variables. Then the Jacobian J(x) and the element Hessian matrices GXx) will be
sparse and it may not be unfeasible to store approximations to all Gi(x), i= 1, .. ., m.

32. Methods for separable problems

A nonlinear least squares problem is said to be separable if the parameter vector


x can be partitioned as

x=Z)q, p+q=n.

with the subproblem


min tIr(y, z) 112 (32.1)
y

easy to solve. In the following we restrict ourselves to the particular case when r(y, z)
is linear in y i.e.
r(y,z) = F(z)y-g(z), F(z)E Rm 'P. (32.2)
Then the minimum norm solution to (32.1) is
y(z) = F'(z)g(z),
where F'(z) is the pseudoinverse of F(z). The original problem can be written
= min 11(I - PF())g(z) 112
min 11(z)- F(z)y(z) 112 (32.3)
z

where P(z) = F(z) F(z)' is the orthogonal projector onto the range of F(z).
Algorithms based on (32.3) are often called variable projection algorithms.
Many practical nonlinear least squares problems are separable in this way.
Aparticularly simple case is when r(y, z) is linear in both y and z so that we also have
r(y,z)=G(y)z-h(y), G(y) ERm x (32.4)

EXAMPLE 32.1. Consider the exponential fitting problem


m
min E (yleZ''i+y 2 eZ2ti-gi)2.
yz i=l1

Here the model is nonlinear only in the parameters z, and z2. Given values of z, and
z 2 the subproblem (32.1) is easily solved.

Special purpose algorithms for separable nonlinear least squares problems was
first considered by SCOLNIK [1972]. A variable projection algorithm using a Gauss-
Newton method applied to problem (32.3) was developed by GOLUB and PEREYRA
[1973]. KAUFMAN [1975] proposed a simplification of this algorithm, which is
computationally more efficient.
630 A. Bj6rck CHAPTER VI

The Kaufman algorithm contains two steps merged into one. Let Xk = (Yk, zk)T be
the current approximation.
Step 1. Compute 6Yk that solves the linear subproblem
min IIF(zk) 6 y- (g(z,)- F(zk)Yk) 112 (32.5)
3y

Put Xk +1/2 = (k +1/2, Zk), Yk + /2 = k + 6 k.


Step 2. Compute Pk as the Gauss-Newton direction at Xk+ 1,2 i.e. Pk solves the
problem
min C(xk + 1/2)Pk+ r(Yk + 1/2, Zk) 2 (32.6)
Pk

where the Jacobian is

C(xk + 1/2) = (F(zk),r(Yk+ 1/2 , Zk)) (32.7)

Take Xk + = k + 6
k Pk and go to Step 1.

In (32.7) we have used that by (32.2) r(yk+ 1/2, zk)= F(zk). Further we have
rz(Yk + 1/ 2 , Zk)= B(Zk)Yk + 1/2 - '(zk)

where

q.
B(z)= yI .... IyF ERmx

Note that in case r(y, z) is linear also in y it follows from (32.4) that C(xk+ /2 )=
(F(Zk), G(Yk +1/2))-
RUHE and WEDIN [1980] have given a general analysis of different algorithms for
separable problems. They show that the Gauss-Newton algorithm applied to (32.3)
and the original problem give the same asymptotic convergence rate. In particular
both converge quadratically for a zero-residual problem. This is in contrast to the
naive algorithm of alternatively minimizing I1r(y, z) 112over y and z, which always
converges linearly. They also prove that the simplified algorithm of Kaufman has
roughly the same asymptotic convergence rate as the one proposed by Golub and
Pereyra.
To be robust the algorithms for separable problems must employ similar
stabilizing techniques for the Gauss-Newton steps as described in Sections 29 and
30. It is fairly straightforward to implement these for the Kaufman method.
We have mentioned the negative theoretical result that the special algorithms for
separable problems have the same local rate of convergence as ordinary Gauss-
Newton. One advantage is that no starting values for the linear parameters have to
be provided. In the Kaufman algorithm e.g. we can take y = 0 and determine
y, = 6y,, in the first step of (32.5). This seems to make a difference in the first steps of
the iterations. KROGH [1974] reports that the variable projection algorithm solved
several problems which methods not using separability could not solve.
GOLUB and LEVEQUE [1979] have extended the variable projection method for
SECTION 33 Nonlinear east squares 631

solving problems in which it is desired to fit more than one data vector with the same
nonlinear parameter vector, though with different linear parameters for each
right-hand side. KAUFMAN and PEREYRA [1978] have extended the Golub-Pereyra
method to problems with separable nonlinear constraints. The Kaufman method
seems even easier to generalize to constrained problems.

33. Constrained problems

In a more general setting the solution to nonlinear least squares problems may be
subject to constraints. In case of nonlinear equality constraints the problem can be
stated as
min ll r(x)112,
(33.1)
s.t. h(x)=O,
where r(x) E Rm, h e RP and x e R.
The Gauss-Newton method can be generalized to problem (33.1) by considering
a linear model at a point Xk. A search direction Pk is computed as a solution to the
linear constrained problem
min IIr(xk)+ J(xk)p 112
P (33.2)
s.t. h(xk) + C(xk)p = 0,
where J and C are the Jacobian matrices for r(x) and h(x) respectively. This problem
can be solved by the methods described in Section 25. The search direction Pk
obtained from (33.2) can be shown to be a descent direction to the merit function
(x ~) = II r(x) 11+ ~ I1h(x) II1

at the point Xk, provided that is large enough. This makes it possible to stabilize the
Gauss-Newton method with a line search strategy or a trust region technique (cf.
Sections 29 and 30). With a suitable active set strategy such an algorithm can be
extended to handle also problems with nonlinear inequality constraints. An
algorithm based on this approach has been developed by LINDSTROM [1983].
There are some algorithms specialized to solve the nonlinear least squares
problem subject to linear constraints. In HOLT and FLETCHER [1979] the unknowns
can be constrained by lower and upper bonds. LINDSTROM [1984] describes two
easy-to-use routines ENLSIP and ELSUNC for solving the general constrained or
the simple bound case. These algorithms are based on the Gauss-Newton method
with a specialized line search, see LINDSTROM and WEDIN [1984]. Far from the
solution the algorithm can be stabilized by a certain subspace minimization. Close
to the solution the algorithm switches to a second-order method (Newton's method
in the unconstrained case) when the Gauss-Newton method converges slowly. The
trust region approach for unconstrained problems is generalized to handle linear
inequality constraints by GAY [1984] and WRIGHT and HOLT [1985]. Popular
632 A. Bjbrck CHAPTER VI

general nonlinear optimization algorithms have also been used to solve nonlinear
least squares problems with nonlinear inequality constraints (see SCHITTKOWSKI
[1985] and MAHDAVI-AMIRI [1981]).
We mention that implicit curve fitting problems, where a model h(y, x, t) =0 is to
be fitted to observations (yi, ti), i= 1,..., m, can be formulated as a special least
squares problem with nonlinear constraints:
2
min zy ,
XZ

s.t. h(zi, x, ti) = 0.


This problem is a special case of (33.1). It has n+m unknowns x and z, but the
Jacobian matrices are sparse, which may be taken advantage of (see LINDSTROM
[1984]).
Another problem which may be formulated as a constrained nonlinear least
squares problem is the problem of orthogonal regression. We consider this problem
in more detail in Section 34.

34. Orthogonal distance regression

In Section 28 it was mentioned that nonlinear least squares problems often arise
from the fitting of observations (yi, ti), i= 1,..., m, to a mathematical model
Y =f(x, t). (34.1)
In the classical regression model the measurements ti of the independent variable are
assumed to be exact and only the observations yi are subject to random errors. In
this section we consider the more general situation, when also the measurements of
the independent variable t contain errors.
Assume that y and ti are subject to errors g, and , respectively, so that
Yi + i =f(x,ti + ), i=1,. ., m. (34.2)
If the errors Ei and 3i are independent random variables with zero mean and variance
a 2 , then it seems reasonable to choose the parameters x so that the sum of squares of
the orthogonal distances ri from the observations (yi,ti) to the curve (34.1) is
minimized, cf. Fig. 34.1. We have that r =(ei2 + 2)'/ 2 where ei and 6i solve
min (i2 + 2),
Ei,r$

s.t. yi + i =f(x, ti + i)
Hence the parameters x should be chosen as the solution to

min E (E2 + 6i2),


x,e, i=

s.t. Yi + Ei=f(x, ti + ), i = 1 ... m.


This is a constrained least squares problem of special form. Eliminating ei using the
SECTION 34 Nonlinear least squares 633

y
v = f(x.t)

-- 7

FIG. 34.1.

constraints we arrive at the orthogonal distance problem


m
min [(f(x, t i + 6i)-yi)2+ 6?]. (34.3)
x,a i=1

Note that (34.3) is a nonlinear least squares problem even iff(x, t) is linear in x.
So far we have implicitly assumed that y and t are scalar variables. More generally
y e lR' and t fntand then we have the problem

min [llf(x, t + )-yi122 + 11 112]. (34.4)


XJ i=l

Finally, if bi and qi do not have constant covariance matrices, then weighted norms
should be substituted above.
Independent of statistical considerations the orthogonal distance problem has
natural applications in curve fitting.

EXAMPLE 34.1. Consider the problem of fitting a half circle y= b +(r2 -(t- a)2 ) 12 to
a given set of points (yi, ti), i = 1, 2,..., m. It is obvious (see Fig. 34.2) that minimizing

YI
I
V
I
t

FIG. 34.2.
634 A. Bjbrck CHAPTER VI

squares of either horizontal or vertical distances to the circle will normally not be
satisfactory.
The general orthogonal distance problem has not received the same attention as
the standard nonlinear regression problem except for the case whenf is linear in x.
One reason is that if the errors in the independent variables are small then ignoring
these errors will not seriously degrade the estimates of x. For the special case when
y=XTt, y , x, tE ,
the orthogonal distance problem is a total least squares problem, and an algorithm
based on the singular value decomposition has been described in Section 24.
However, recently algorithms for the nonlinear case, based on stabilized Gauss-
Newton methods have been given by SCHWETLIK and TILLER [1986] and BOGGS,
BYRD and SCHNABEL [1987].
Problem (34.3) has (m + n) unknowns x and 6. In applications usually m>>n and
accounting for the errors in t will considerably increase the size of the problem.
Therefore the use of standard software for nonlinear least squares to solve
orthogonal distance problems is not efficient or feasible. This is even more
accentuated for (34.4) which has mn + n variables. We now show how the special
structure of (34.3) can be taken into account to reduce the work. Similar comments
apply to the general case (34.4).
If we define the residual vector r(b,x)=(r(6,x), r2 (6))T by
rl(6, )i =f(x, ti + 6) -- y i, r2(6)i=6i, i=l,...,m,
then (34.3) is the standard nonlinear least squares problem minx , IIr(b, x) II2
The Jacobian matrix corresponding to this problem can be written in block form
as

J= (, 1 ) } ,R2.(m+n), (34.5)
I,n fl I
m n

where
DI =diag(d,, ... d), d=(f/at),=,, Jij= af(x, ti + si)/ax,

Note that is sparse and highly structured. In the Gauss-Newton method we


compute corrections A6k and Axk to the current approximations which solve the
linear least squares problem

minJ( _-(r1j), (34.6)


A6,Ax AX r2 2

where J, r and r2 are evaluated at the current estimates of 6 and x. To solve this
problem we need the QR decomposition of J. This can be computed in two steps.
First we apply a sequence of Givens rotations Q1 = Gm, · - G G1 , where Gi = Ri i+,
SECTION 34 Nonlinear least squares 635

i = 1, 2,..., m, to zero the (2, 1)-block of J

iQ1i, ( L), a(r =(S


i
where D2 is again a diagonal matrix. The problem (34.6) now decouples, and Axk is
determined as the solution to
min I LAx - s 2 112.
Ax

Here L EIR X" so this is a problem of the same size as that which defines the
Gauss-Newton correction in the classical nonlinear least squares problem. We then
have
A6k = D' 1 (s 2 -K Axk).
To stabilize this Gauss-Newton method we can use the techniques described in
Section 29 and 30. SCHWETLIK and TILLER [1985] use a partial Marquardt type
regularization where only the Ax part of J is regularized. The algorithm by BOGGS,
BYRD and SCHNABEL [1987] incorporates a full trust region strategy.
References

ABDELMALEK, N.N. (1971), Roundoff error analysis for Gram-Schmidt method and solution of linear
least squares problems, BIT 11, 345-368.
AL-BAALI, M. and R. FLETCHER (1986), An efficient line search for nonlinear least-squares, J. Optim.
Theory Appl. 48, 359-377.
ANDERSON, N. and I. KARASALO (1975), On computing bounds for the least singular value of a triangular
matrix, BIT 15, 1-4.
ASHKENAZI, V. (1971), Geodetic normal equations, in: J.K. REID, ed., Large Sets of Linear Equations
(Academic Press, New York) 57-74.
AVILA, J.K. and J.A. TOMLIN (1979), Solution of very large least squares problems by nested dissection on
a parallel processor, in: Proceedings Computer Science and Statistics: Twelfth Annual Symposium on the
Interface, Waterloo, Ont.
BARD, Y. (1974), Nonlinear ParameterEstimation (Academic Press, New York).
BAREISS, E.H. (1983), Numerical solution of the weighted linear least squares problems by G-transfor-
mations, SIAM J. Algebraic Discrete Methods.
BARLOW, J.L. (1985), Stability analysis of the G-algorithm and a note on its application to sparse least
squares problems, BIT 25, 507-520.
BARRODALE, I. and C. PHILLIPS (1975), Algorithm 495: Solution of an overdetermined system of linear
equations in the Chebychev norm, ACM Trans. Math. Software 1, 264-270.
BARRODALE, I. and F.D.K. ROBERTS (1973), An improved algorithm for discrete L, linear approximation,
SIAM J. Numer. Anal. 10, 839-848.
BARTELS, R.H., A.R. CONN and C. CHARALAMBOUS (1978), On Cline's direct method for solving
overdetermined linear systems in the Lo sense, SIAM J. Numer. Anal. 15, 255-270.
BARTELS, R.H., A.R. CONN and J.W. SINCLAIR (1978), Minimization techniques for piecewise differentiable
functions: The L, solution to an overdetermined linear system, SIAM J. Numer. Anal. 15, 224-241.
BAUER, F.L. (1965), Elimination with weighted row combinations for solving linear equations and least
squares problems, Numer. Math 7, 338-352.
BERMAN and R.J. PLEMMONS (1974), Cones and iterative methods for best least squares solutions of
linear systems, SIAM J. Numer. Anal. 11, 145-154.
BJORCK, A. (1967), Solving linear least squares problems by Gram-Schmidt orthogonalization, BIT 7,
1-21.
BJORCK, A. (1967), Iterative refinement of linear least squares solutions. I, BIT 7, 257-278.
BJORCK, A. (1968), Iterative refinement of linear least squares solutions. II, BIT 8, 8-30.
BJORCK, A. (1976), Methods for sparse least squares problems, in: J.R. BUNCH and D.J. ROSE, eds., Sparse
Matrix Computations (Academic Press, New York) 177-199.
BJORCK, A. (1978), Comment on the iterative refinement of least squares solutions, J. Amer. Statist.Assoc.
73, 161-166.
BJORCK, A. (1979), SSOR preconditioning methods for sparse least squares problems, in: Proceedings
Computer Science and Statistics: Twelfth Annual Symposium on the Interface, Waterloo, Ont.
BJORCK, A. (1984), A general updating algorithm for constrained linear least squares problems, SIAM J.
Sci. Statist. Comput. 5, 394-402.
BJORCK, A. (1987), Stability analysis of the method of semi-normal equations for least squares problems,
Linear Algebra Appl. 88/89, 31-48.
BJORCK, A, (1988), A bidiagonalization algorithm for solving ill-posed systems of linear equations,
BIT 28, 659-670.

637
638 A. Bjirck

BJORCK, A. and I.S. DUFF (1980), A direct method for the solution of sparse linear least squares problems,
Linear Algebra Appl. 34, 43-67.
BJORCK, A. and L. ELDEN (1979), Methods in numerical algebra for ill-posed problems, Rept.
LiTH-MAT-R-33-79, Linkiping University, Linkoping, Sweden.
BJORCK, A. and T. ELFVING (1979), Accelerated projection methods for computing pseudoinverse
solutions of systems of linear equations, BIT 19, 145-163.
BJORCK, A. and G.H. GOLUB (1967), Iterative refinement of linear least square solutions by Householder
transformation, BIT 7, 322-337.
BJORCK, A. and G.H. GOLUB (1973), Numerical methods for computing angles between linear subspaces,
Math. Comp. 27, 579-594.
BJORCK, A., R.J. PLEMMONS and H. SCHNEIDER (1981), Large-Scale Matrix Problems (North-Holland,
New York).
BOGGS, P.T., R.H. BYRD and R.B. SCHNABEL (1987), A stable and efficient algorithm for nonlinear
orthogonal regression, SIAM J. Sci. Statist. Comput. 8, 1052-1078.
BOJANCZYK, A. and R.P. BRENT (1986), Parallel solution of certain Toeplitz least squares problems,
Linear Algebra Appl. 77, 43-60.
BOJANCZYK, A., R.P. BRENT, P. VAN DOOREN and F. DE HOOG (1987), A note on downdating the Cholesky
factorization, SIAM J. Sci. Statist. Comput. 8, 210-221.
BOLSTAD, J.H. et al. (1978), Numerical analysis program library user's guide, User Note 82, SLAC
Computer Services, Stanford Linear Accelerator Center, Menlo Park, CA.
BUNCH, J.R. and L. KAUFMAN (1977), Some stable methods for calculating inertia and solving symmetric
linear systems, Math. Comp. 31, 162-179.
BUNCH, J.R., L. KAUFMANN and B.N. PARLETT (1976), Decomposition of a symmetric matrix, Numer.
Math. 27, 95-109.
BUNCH, J.R. and C.P. NIELSEN (1978), Updating the singular value decomposition, Numer. Math. 31,
111-129.
BUNCH, J.R. and B.N. PARLETT (1971), Direct methods for solving symmetric indefinite systems of linear
equations, SIAM J. Numer. Anal. 8, 639-655.
BUNCH, J.R. and D.J. ROSE, eds. (1976), Sparse Matrix Computations (Academic Press, New
York).
BUSINGER, P. and G.H. GOLUB (1965), Linear least squares solutions by Householder transformations,
Numer. Math. 7, 269-276.
BUSINGER, P.A. and G.H. GOLUB (1969), Algorithm 358: Singular value decomposition of a complex
matrix, Comm. ACM 12, 564-565.
CHAN, T.F. (1982a), An improved algorithm for computing the singular value decomposition, ACM
Trans. Math. Software 8, 72-83.
CHAN, T.F. (1982b), Algorithm 581: An improved algorithm for computing the singular value
decomposition, ACM Trans. Math. Software 8, 84-88.
CHAN, T.F. (1987), Rank revealing QR-factorizations, Linear Algebra Appl. 88/89, 67-82.
CHEN, Y.T. (1975), Iterative methods for linear least squares problems, Rept. CS-75-04, University of
Waterloo, Ont.
CHEN, Y.T. and R.P. TEWARSON (1972), On the fill-in when sparse vectors are orthonormalized,
Computing 9, 53-56.
CLINE, A.K. (1973), An elimination method for the solution of linear least squares problems, SIAM J.
Numer. Anal. 10, 283-289.
CLINE, A.K. (1975), The transformation of a quadratic programming problem into solvable form, ICASE
Rept. 75-14, NASA, Langley Research Center, Hampton, VA.
CLINE, A.K., A.R. CONN and C. VAN LOAN (1982), Generalizing the LINPACK condition estimator, in:
J.P. HENNART, ed., Numerical Analysis, Lecture Notes in Mathematics 909 (Springer, Verlag, New
York).
CLINE, A.K., C.B. Moler, G.W. STEWART and J.H. WILKINSON (1979), An estimate for the condition
number of a matrix, SIAM J. Numer. Anal. 16, 368-375.
CLINE, R.E. and R.J. PLEMMONS (1976), 12-solutions to underdetermined linear systems, SIAM Rev. 18,
92-106.
References 639

COLEMAN, T.F., A. EDENBRANDT and J.R. GILBERT (1986), Predicting fill for sparse orthogonal
factorization, J. ACM 33, 517-532.
Cox, M.G. (1981), The least squares solution of overdetermined linear equations having band or
augmented band structure, IMA J. Numer. Anal. 1, 3-22.
CRANE, R.L., B.S. GARBOW, K.E. HILLSTROM and M. MINKOFF (1980), LCLSQ: An implementation of an
algorithm for linearly constrained linear least squares problems, Rept. ANL-80-116, Argonne National
Laboratory, Argonne, IL.
CRAVEN, P. and G. WAHBA (1979), Smoothing noisy data with spline functions, Numer. Math. 31,
377 -403.
CRYER, C. (1971), The solution of a quadratic programming problem using systematic overrelaxation,
SIAM J. Control 9, 385-392.
CUTHILL, E. (1972), Several strategies for reducing the bandwidth of matrices, in: D.J. ROSE and R.A.
WILLOUGHBY, eds., Sparse Matrices and Their Applications (Plenum Press, New York).
DANIEL, J., W.B. GRAGG, L. KAUFMAN and G.W. STEWART (1976), Reorthogonalization and stable
algorithms for updating the Gram-Schmidt QR factorization, Math. Comp. 30, 772-95.
DELVES, L.M. and I. BARRODALE (1979), A fast direct method for the least squares solution of slightly
overdetermined sets of linear equations, J. Inst. Math. Appl. 24, 149-156.
DEMMEL, J.W. (1987), The smallest perturbation of a submatrix which lowers the rank and constrained
total least squares, SIAM J. Numer. Anal. 24, 199-206.
DENNIS, J.E. (1977), Nonlinear least squares and equations, in: D.A.H. JACOBS, ed., The State of the Art in
Numerical Analysis (Academic Press, New York) 269-312.
DENNIS Jr, J.E., D.M. GAY and R.E. WELSCH (1981), An adaptive nonlinear least-squares algorithm,
ACM Trans. Math. Software 7, 348-368.
DENNIS, Jr, J.E. and R.B. SCHNABEL (1983), Numerical Methods for Unconstrained Optimization and
Nonlinear Equations (Prentice-Hall, Englewood Cliffs, NJ).
DENNIS, J.E. and T. STEIHAUG (1986), On the successive projection approach to least squares problems,
SIAM J. Numer. Anal. 23, 717-733.
DEUFLHARD, P. and V. APOSTOLESCU (1978), An underrelaxed Gauss-Newton method for equality
constrained nonlinear least squares, in: J. STOER, ed., Proceedings8th IFIP Conference on Optimization
Techniques, Lecture Notes in Control and Information Science 7 (Springer, Berlin) 22-32.
DEUFLHARD, P. and V. APOSTOLESCU (1980), A study of the Gauss-Newton algorithm for the solution of
nonlinear least squares problems, in: J. FREHSE, D. PALLASCHKE and U. TROTTENBERG, eds, Special
Topics of Applied Mathematics (North-Holland, Amsterdam).
DONGARRA, J., J.R. BUNCH, C.B. MOLER and G.W. STEWART (1979), LINPACK Users Guide (SIAM,
Philadelphia, PA).
DRAPER, N.R. and H. SMITH (1981), Applied Regression Analysis (Wiley, New York, 2nd ed.).
DUFF, .S. (1974), Pivot selection and row orderings in Givens reduction on sparse matrices, Computing
13, 239-248.
DUFF, I.S. and J.K. REID (1976), A comparison of some methods for the solution of sparse overdetermined
systems of linear equations, J. Inst. Math. Appl. 17, 267-280.
DUFF, I.S. and J.K. REID (1982), MA27-A set of Fortran subroutines for solving sparse symmetric sets of
linear equations, Rept. R. 10533, AERE, Harwell, England.
DUFF, I.S. and J.K. REID (1983), The multifrontal solution of indefinite sparse symmetric linear systems,
ACM Trans. Math. Software 9, 302-325.
DUFF, I.S. and G.W. STEWART, eds. (1979), Sparse Matrix Proceedings, 1978 (SIAM, Philadelphia, PA).
DWYER, P.S. (1945), The square root method and its use in correlation and regression, J. Amer. Statist.
Assoc. 40, 493-503.
ECKHARD, C. and G. YOUNG (1936), The approximation of one matrix by another of lower rank,
Psychometrica 1, 211-218.
EICHHORN, E.L. and C.L. LAWSON (1975), An ALGOL procedure for solution of constrained least squares
problems, Computing Memorandum No. 374, JPL, Pasadena, CA.
EISENSTAT, S.C., M.H. SCHULTZ and A.H. SHERMAN (1981), Algorithms and data structures for sparse
symmetric Gaussian elimination, SIAM J. Sci. Statist. Comput. 2, 225-237.
EKBLOM, H. (1973), Calculation of linear best L,-approximations, BIT 13, 292-300.
640 A. Bjbrck

ELDEN, L. (1977), Algorithms for the regularization of ill-conditioned least squares problems, BIT 17,
134-145.
ELDEN, L. (1980), Perturbation theory for the least squares problem with linear equality constraints,
SIAM J. Numer. Anal. 17, 338-350.
ELDEN, L. (1984a), An efficient algorithm for the regularization of ill-conditioned least squares problems
with a triangular Toeplitz matrix, SIAM J. Sci. Statist. Comput. 5, 229-236.
ELDEN, L. (i984b), An algorithm for the regularization of ill-conditioned banded least squares problems,
SIAM J. Sci. Statist. Comput. 5, 237-254.
ELDEN, L. (1984c), A note on the computation of the generalized cross-validation function for
ill-conditioned least squares problems, BIT 24, 467-472.
ELFVING, T. (1978), Some numerical results obtained with two gradient methods for solving the linear
least squares problem, Rept. LiTH-MAT-R-75-5, Department of Mathematics, Link6ping University,
Sweden.
ELFVING, T. (1980), Block-iterative methods for consistent and inconsistent linear equations, Numer.
Math. 35, 1-12.
ERISMAN, A.M. and W.F. TINNEY(1975), On computing certain elements of the inverse of a sparse matrix,
Comm. ACM 18, 177-179.
FADDEEV, D.K., V.N. KUBLANOVSKAYA and V.N. FADDEEVA (1968), Solution of linear algebraic systems
with rectangular matrices, Proc. Steklov Inst. Math. 96.
FAREBROTHER, R.W. (1985), The statistical estimation of the standard linear model, 1756-1853, in:
Proceedings First Tampere Seminar on Linear Models (1983), 77-99.
FLETCHER, R. (1980), Practical Methods of Optimization, 1: Unconstrained Optimization (Wiley, New
York).
FLETCHER, R. (1981), Practical Methods of Optimization, 2: Constrained Optimization (Wiley, New
York).
FORSYTHE, G.E., M.A. MALCOLM and C. MOLER (1977), Computer Methods for Mathematical
Computations (Prentice-Hall, Englewood Cliffs, NJ).
FORSYTHE, G.E. and C. MOLER (1967), Computer Solution of Linear Algebraic Systems (Prentice-Hall,
Englewood Cliffs, NJ).
FOSTER, L.V. (1986), Rank and nullspace calculations using matrix decompositions without column
interchanges, Linear Algebra Appl. 74, 47-71.
GANDER, W. (1980), Algorithms for the QR-decomposition, Rept. 80-02 Angewandte Mathematik, ETH,
Zirich.
GANDER, W. (1981), Least squares with a quadratic constraint, Numer. Math. 36, 291-307.
GAUSS, C.F. (1823), Theoria combinationis observationum erroribus minimis obnoxiae, Commentationes
Societas Regiae Scientarium Gottengensis Recentiores 5, 33-90.
GAY, D.M. (1983), Algorithm 611, Subroutines for unconstrained minimization using a model/trust-region
approach, ACM Trans. Math. Software 9, 503-524.
GAY, D.M. (1984), A trust-region approach to linearly constrained optimization, in: D.F. GRIFFITHS, ed.,
Proceedings 1983 Dundee Conference on Numerical Analysis, Lecture Notes in Mathematics 1066
(Springer, Berlin) 72-105.
GENTLEMAN, W.M. (1973), Least squares computations by Givens transformations without square roots,
J. Inst. Math. Appl. 12, 329-336.
GENTLEMAN, W.M. (1975), Error analysis of QR decompositions by Givens transformations, Linear
Algebra Appl. 10, 189-197.
GENTLEMAN, W.M. (1976), Row elimination for solving sparse linear systems and least squares problems
in: G.A. WATSON, ed., Proceedings 1975 Dundee Conference on Numerical Analysis, Lecture Notes in
Mathematics 506 (Springer, Berlin) 122-133.
GEORGE, J.A. and M.T. HEATH (1980), Solution of sparse linear least squares problems using Givens
rotations, Linear Algebra Appl. 34, 69-73.
GEORGE, J.A., M.T. HEATH and E.G.Y. NG (1983), A comparison of some methods for solving sparse
linear least squares problems, SIAM J. Sci. Statist. Comput. 4, 177-187.
GEORGE, J.A., M.T. HEATH and R.J. PLEMMONS (1981), Solution of large-scale least squares problems
using auxiliary storage, SIAM J. Sci. Statist. Comput. 2, 416-429.
References 641

GEORGE, J.A. and J.W.H. Lu (1981), Computer Solution of Large Sparse Positive Definite Systems
(Prentice-Hall, Englewood Cliffs, NJ).
GEORGE, J.A., J.W.H. LiU and E.G.Y. NG (1984), Row ordering schemes for sparse Givens transformations,
I. Bipartite graph model, Linear Algebra Appl. 81, 55-81.
GEORGE, J.A. and E.G.Y. NG (1983), On row and column orderings for sparse least squares problems,
SIAM J. Numer. Anal. 20, 326-344.
GEORGE, J.A. and E.G.Y. NG (1984), SPARSPAK: Waterloo sparse matrix package user's guide for
SPARSPAK-B, Research Rept. CS-84-37, Department of Computer Science, University of Waterloo,
Ont.
GEORGE, J.A., W.G. POOLE and R.G. VOIGT (1978), Incomplete nested dissection for solving n by n grid
problems, SIAM J. Numer. Anal. 15, 90-112.
GILL, P.E., G.H. GOLUB, W. MURRAY and M.A. SAUNDERS (1974), Methods for modifying matrix
factorizations, Math. Comp. 28, 505-535.
GILL, P.E., S.J. HAMMARLING, W. MURRAY, M.A. SAUNDERS and M.H. WRIGHT (1986), User's guide for
LSSOL (version 1.0): A Fortran package for constrained linear least-squares and convex quadratic
programming, Rept. SOL, Department of Operations Research, Stanford University, CA.
GILL, P.E. and W. MURRAY (1976a), Nonlinear least squares and nonlinearly constrained optimization,
in: G.A. WATSON, ed., Proceedings 1975 Dundee Conference on Numerical Analysis, Lecture Notes in
Mathematics 506 (Springer, Berlin).
GILL, P.E. and W. MURRAY (1976b), The orthogonal factorization of a large sparse matrix, in: J.R. BUNCH
and D.J. ROSE, eds., Sparse Matrix Computations (Academic Press, New York) 201-212.
GILL, P.E. and W. MURRAY (1978), Algorithms for the solution of the nonlinear least squares problem,
SIAM J. Numer. Anal. 15, 977-992.
GILL, P.E., W. MURRAY and M.H. WRIGHT (1981), PracticalOptimization (Academic Press, New York).
GIVENS, W. (1958), Computation of plane unitary rotations transforming a general matrix to triangular
form, SIAM J. Appl. Math. 6, 26-50.
GOLUB, G.H. (1965), Numerical methods for solving linear least squares problems, Numer. Math. 7,
206-216.
GOLUB, G.H. (1968), Least squares, singular values and matrix approximations, Apl. Mat. 13, 44-51.
GOLUB, G.H. (1969), Matrix decompositions and statistical computation, in: R.C. MILTON and J.A.
NELDER, eds., Statistical Computation (Academic Press, New York) 365-397.
GOLUB, G.H. (1973), Some modified matrix eigenvalue problems, SIAM Rev. 15, 318-344.
GOLUB, G.H., M.T. HEATH and G. WAHBA (1979), Generalized cross-validation as a method for choosing
a good ridge parameter, Technometrics 21, 215-223.
GOLUB, G.H., A. HOFFMAN and G.W. STEWART (1987), A generalization of the Eckhard-Young-
Mirsky matrix approximation theorem, Linear Algebra Appl. 88/89, 317-327.
GOLUB, G.H. and W. KAHAN (1965), Calculating the singular values and pseudo-inverse of a matrix,
SIAM J. Numer. Anal. Ser. B 2, 205-224.
GOLUB, G.H., V. KLEMA and G.W. STEWART (1976), Rank degeneracy and least squares problems, Tech.
Rept. TR-456, Department of Computer Science, University of Maryland, College Park, MD.
GOLUB, G.H. and R. LEVEQUE (1979), Extensions and uses of the variable projection algorithm for
solving nonlinear least squares problems, ARO Rept. 79-3, Proceedings of the 1979 Army Numerical
Analysis and Computers Conference.
GOLUB, G.H., F.T. LUK and M.L. OVERTON (1981), A block Lanczos method for computing the singular
values and corresponding singular vectors of a matrix, ACM Trans. Math. Software 7, 149-169.
GOLUB, G.H., F.T. LUK and M. PAGANO (1979), A sparse least squares problem in photogrammetry, in:
Proceedings Computer Science and Statistics: Twelfth Annual Conference on the Interface, Waterloo,
Ont.
GOLUB, G.H., P. MANNEBACK and P. TOINT (1986), A comparison between some direct and iterative
methods for large scale geodetic least squares problems, SIAM J. Sci. Statist. Comput. 7, 799-
816.
GOLUB, G.H. and V. PEREYRA (1973), The differentiation of pseudo-inverses and nonlinear least squares
problems whose variables separate, SIAM J. Numer. Anal. 10, 413-432.
GOLUB, G.H. and V. PEREYRA (1976), Differentiation of pseudo-inverses, separable nonlinear least
642 A. Bjarck

squares problems and other tales, in: M.Z. NASHED, ed., Generalized Inverses and Applications
(Academic Press, New York) 303-324.
GOLUB, G.H. and R.J. PLEMMONS (1980), Large scale geodetic least squares adjustment by dissection and
orthogonal decomposition, Linear Algebra Appl. 34, 3-27.
GOLUB, G.H., R.J. PLEMMONS and A. SAMEH (1986), Parallel block schemes for large scale least squares
computations, to appear.
GOLUB, G.H. and C. REINSCH (1970), Singular value decomposition and least squares solutions, Numer.
Math. 14, 403-420.
GOLUB, G.H. and G.P. STYAN (1973), Numerical computations for univariate linear models, J. Statist.
Comput. Simulation 2, 253-274.
GOLUB, G.H. and C.F. VAN LOAN (1980), An analysis of the total least squares problem, SIAM J. Numer.
Anal. 17, 883-893.
GOLUB, G.H. and C.F. VAN LOAN (1983), Matrix Computations (Johns Hopkins Press, Baltimore, MD).
GOLUB, G.H. and R.S. VARGA (1961), Chebyshev semi-iterative methods, successive overrelaxation
iterative methods and second order Richardson iterative methods, Parts I and II, Numer. Math. 3,
147-168.
GOLUB, G.H. and J.H. WILKINSON (1966), Note on the iterative refinement of least squares solutions,
Numer. Math. 9, 139-148.
GRIMES, R.G. and J.G. LEWIS (1981), Condition number estimation for sparse matrices, SIAM J. Sci.
Statist. Comput. 2, 384-388.
GUSTAVSSON, F.G. (1976), Finding the block lower triangular form of a matrix, in: J.R. BUNCH and D.J.
ROSE, eds., Sparse Matrix Computations (Academic Press, New York), 275-289.
HAGEMAN, L.A., F.T. LUK and D.M. YOUNG (1980), On the equivalence of certain iterative acceleration
methods, SIAM J. Nuamer. Anal. 17, 852-873.
HAMMARLING, S. (1974), A note on modifications to the Givens plane rotation, J. Inst. Math. Apple. 13,
215-218.
HANSON, R.J. (1986), Linear least squares with bounds and linear constraints, SIAM J. Sci. Statist.
Cornput. 7, 826-834.
HANSOM, R.J. and J.L. PHILLIPS (1975), An adaptive numerical method for solving linear Fredholm
equations of the first kind, Numer. Math. 24, 291-307.
HASKELL, K.H. and R.J. HANSON (1979), Selected algorithms for the linearly constrained least squares
problem: A user's guide, Tech. Rept. SAND78 1290, Sandia National Laboratories, Albuquerque, NM.
HASKELL, K.H. and R.J. HANSON (1981), An algorithm for linear least squares problems with equality and
nonnegativity constraints, Math. Programming 21, 98-118.
HEATH, M.T. (1982), Some extensions of an algorithm for sparse linear least squares problems, SIAM J.
Sci. Statist. Comput. 3, 223-237.
HEATH, M.T. (1984), Numerical methods for large sparse linear least squares problems, SIAM J. Sci.
Statist. Comput. 26, 497-513.
HELMERT, F.R. (1880), Die mathematischen und physikalischen Theorien der hheren Geodasie, I
(Teubner, Leipzig).
HESTENES, M.R. and E. STIEFEL (1952), Methods of conjugate gradients for solving linear systems, J. Res.
Nat. Bur. Standards 49, 409-436.
HIEBERT, K.L. (1981), An evaluation of mathematical software that solves nonlinear least squares
problems, ACM Trans. Math. Software 7, 1-16.
HOLT, J.N. and R. FLETCHER (1979), An algorithm for constrained non-linear least-squares, J. Inst. Math.
Apple. 23, 449-463.
HOUSEHOLDER, A.S. (1958), Unitary triangularization of a nonsymmetric matrix, J. ACM 5, 339-342.
HOUSEHOLDER, A.S. (1974), The Theory of Matrices in Numerical Analysis (Dover, New York).
HUBER, P.J. (1977), Robust Statistical Procedures, CBMS-NSF Regional Conference Series in Applied
Mathematics (SIAM, Philadelphia, PA).
HUTCHINSON, M.F. and F.R. DE HooG (1985). Smoothing noisy data with spline functions, Numer. Math.
47, 99-106.
JENNINGS, L.S. and M.R. OSBORNE (1974), A direct error analysis for least squares, Numer. Math. 22,
322-332.
References 643

JORDAN, T.L. (1968), Experiments on error growth associated with some linear least-squares procedures,
Math. Comp. 22, 579-588.
KAHAN, W. (1966), Numerical linear algebra, Canad.Math. Bull. 9, 757-801.
KARASALO, I. (1974), A criterion for truncation of the QR decomposition algorithm for the singular linear
least squares problem, BIT 14, 156-166.
KAUFMAN, L. (1975), Variable projection methods for solving separable nonlinear least squares problems,
BIT 15, 49-57.
KAUFMAN, L. (1979), Application of dense Householder transformation to a sparse matrix, ACM Trans.
Math. Software 5, 442-450.
KAUFMAN, L. and V. PEREYRA (1978), A method for separable nonlinear least squares problems with
separable nonlinear equality constraints, SIAM J. Numer. Anal. 15, 12-20.
KELLER, J. (1965), On the solution of singular and semidefinite linear systems by iterations, J. SIAM Ser.
B 2, 281-290.
KOLATA, G.B. (1978), Geodesy: dealing with an enormous computer task, Science 200, 421-422.
KROGH, F.T. (1974), Efficient implementation of a variable projection algorithm for nonlinear least
squares, Comm. ACM 17, 167-169.
LAWSON, C.L. and R.J. HANSON (1969), Extensions and applications of the Householder algorithm for
solving linear least squares problems, Math. Comp. 23, 787-812.
LAWSON, C.L. and R.J. HANSON (1974), Solving Least Squares Problems(Prentice-Hall, Englewood Cliffs,
NJ).
LAWSON, C.L., R.J. HANSON, F.T. KROGH and D. KINCAID (1979), Basic linear algebra subprograms for
FORTRAN usage, ACM Trans. Math. Software 5, 308-323.
LERINGE, 0 and P.-A. WEDIN (1970), A comparison between different methods to compute a vector
x which minimizes [IAx-bll, when Gx=h, Rept., Department of Computer Science, Lund Uni-
versity.
LEVENBERG, K. (1944), A method for the solution of certain problems in least squares, Quart.Appl. Math.
2, 164-168.
LINDSTROM, P. (1982), A stabilized Gauss-Newton algorithm for unconstrained nonlinear least squares
problems, Rept. UMINF-102.82, UmeA University, Sweden.
LINDSTROM, P. (1983), A general purpose algorithm for nonlinear least squares problems with nonlinear
constraints, Rept. UMINF-103.83, Umef University, Sweden.
LINDSTROM, P. (1984), Two user guides, one (ENLSIP) for constrained--one (ELSUNC) for unconstrained
nonlinear least squares problems, Rept. UMINF-109 and 110, Umea University, Sweden.
LINDSTROM, P. and P.-A. WEDIN (1984), A new linesearch algorithm for unconstrained nonlinear least
squares problems, Math. Programming. 29, 268-296.
LINDSTROM, P. and P.-A WEDIN (1986), Methods and software for nonlinear least squares problems.
Rept. UMINF-133.87, UmeA University, Sweden.
LINNIK, I. (1961), Method of Least Squares and Principlesof the Theory of Observations (Pergamon Press,
New York).
Liu, J.W.H. (1986), On general row merging schemes for sparse Givens transformations, SIAM J. Sci.
Statist. Comput. 7, 1190-1211.
LUK, F.T. (1980), Computing the singular value decomposition on the ILLIAC IV, ACM Trans. Math.
Software. 6, 524-539.
LAUCHLI, P. (1961), Jordan-Elimination und Ausgleichung nach kleinsten Quadraten, Numer. Math. 3,
226-240.
LOTSTEDT, P. (1984), Solving the minimal least squares problem subject to bounds on the variables, BIT
24, 206-224.
MAHDAVI-AMIRI, N. (1981), Generally constrained nonlinear least squares and generating test problems:
Algorithmic approach, Ph.D. Thesis, Johns Hopkins University, Baltimore, MD.
MANNEBACK, P. (1985), On some numerical methods for solving large sparse linear least squares
problems, Ph.D. Thesis, Facultes Universitaires Notre-Dame de la Paix, Namur, Belgium.
MANNEBACK, P., C. MURIGANDE and P.L. TOINT (1985), A modification of an algorithm by Golub and
Plemmons for large linear least squares in the context of Doppler positioning, IMA J. Numer. Anal. 5,
221-234.
644 A. Bjrck

MANTEUFFEL, T. (1980), An incomplete factorization technique for positive definite linear systems, Math.
Comp. 34, 473-479.
MARKOWITZ, H.M. (1957), The elimination form of the inverse and its application to linear programming,
Management Sci. 3, 255-269.
MARQUARDT, D. (1963), An algorithm for least-squares estimation of nonlinear parameters, SIAM J.
Appl. Math. 11, 431-441.
MIRSKY, L. (1960), Symmetric gauge functions and unitarily invariant norms, Quart. J. Math. Oxford 11,
50-59.
MOLER, C.B. (1980), MATLAB user's guide, Tech. Rept. CS81-1, Department of Computer Science.
University of New Mexico, Albuquerque, NM.
MOLINARI, L. (1977), Gram-Schmidt'sches Orthogonalisierungsverfahren, in: W. GANDER, L. MOLINARI
and H. SVECOVA, eds., Numerische Prozeduren aus Nachlass und Lehre von Prof Heinz Rutishauser
(Birkhauser, Stuttgart) 77-93.
MORE, JJ. (1978), The Levenberg-Marquardt algorithm: Implementation and theory, in: G.A. WATSON,
ed., Numerical Analysis, Proceedings Biennial Conference Dundee, Lecture Notes in Mathematics 630
(Springer, Berlin) 105-116.
MORE, J.J. (1983), Recent developments in algorithms and software for trust region-methods, in:
Mathematical Programming: The State of the Art (Springer, Berlin).
MORE, J.J., B.S. GARBOW and K.E. HILLSTROM (1980), Users' guide for MINPACK-1, Rept. ANL-80-74,
Applied Mathematics Division, Argonne National Laboratory, Argonne, IL.
MORE, J.J. and D.C. SORENSON (1981), Computing a trust region step, SIAM J. Sci. Statist. Comput. 4,
553-572.
NASH, J.C. (1975), A one-sided transformation method for the singular value decomposition and
algebraic eigenproblem, Comput. J. 18, 74-76.
NAZARETH, L. (1980), Some recent approaches to solving large residual nonlinear least squares problems,
SIAM Rev. 22, 1-11.
O'LEARY, D.P. (1980a), A generalized conjugate gradient algorithm for solving a class of quadratic
programming problems, Linear Algebra Appl. 34, 371-399.
O'LEARY, D.P. (1980b), Estimating matrix condition numbers, SIAM J. Sci. Statist. Comput. 1, 205-209.
O'LEARY, D.P. and J.A. SIMMONS (1981), A bidiagonalization-regularization procedure for large scale
discretizations of ill-posed problems, SIAM J. Sci. Statist. Comput. 2, 474-489.
OREBORN, U. (1986), A direct method for sparse nonnegative least squares problems, Lic. Thesis No. 87,
Link6ping Studies in Science and Technology, 1986:27, Linkoping University, Sweden.
ORTEGA, J.M. and W.C. RHEINBOLDT (1970), Iterative Solution of Nonlinear Equations in Several
Variables (Academic Press, New York).
OSBORNE, E.E. (1961), On least squares solutions of linear equations, J. ACM 8, 628-636.
PAIGE, C.C. (1973), An error analysis of a method for solving matrix equations, Math. Comp. 27, 355-
359.
PAIGE, C.C. (1979a), Computer solution and perturbation analysis of generalized least squares problems,
Math. Comp. 33, 171-184.
PAIGE, C.C. (1979b), Fast numerically stable computations for generalized linear least squares problems.
SIAM J. Numer. Anal. 16, 165-171.
PAIGE, C.C. (1981), Properties of numerical algorithms related to computing controllability, IEEE Trans.
Automat. Control. 26, 130-138.
PAIGE, C.C. (1985), The general linear model and the generalized singular value decomposition, Linear
Algebra Appl. 70, 269-284.
PAIGE, C.C. (1986), Computing the generalized singular value decomposition, SIAM J. Sci. Statist.
Comput. 7, 1126-1146.
PAIGE, C.C. and M.A. SAUNDERS (1977), Least squares estimation of discrete linear dynamic systems
using orthogonal transformations, SIAM J. Numer. Anal. 14, 180-193.
PAIGE, C.C. and M.A. SAUNDERS (1981), Towards a generalized singular value decomposition, SIAM J.
Numer. Anal. 18, 398-405.
PAIGE, C.C. and M.A. SAUNDERS (1982a), LSQR: An algorithm for sparse linear equations and sparse
least squares, ACM Trans. Math. Software 8, 43-71.
References 645

PAIGE, C.C. and M.A. SAUNDERS (1982b), Algorithm 583 LSQR: Sparse linear equations and least squares
problems, ACM Trans. Math. Software 8, 195-209.
PARLETT, B.N. (1971), Analysis of algorithms for reflections in bisectors, SIAM Rev. 13, 197-208.
PARLETT, B.N. (1980), The Symmetric Eigenvalue Problem (Prentice-Hall, Englewood Cliffs, NJ).
PARLETT, B.N. and J.K. REID (1970), On the solution of a system of linear equations whose matrix is
symmetric but not definite, BIT 10, 386-397.
PATZELT, P. (1973), Ein Algorithmus zur Lsung von Ausgleichproblemen mit Ungleichungen als
Nebenbedingungen, Diplomarbeit, University of Wiirzburg, F.R.G.
PENROSE, R. (1955), A generalized inverse for matrices, Proc. Cambridge Philos. Soc. 51, 506-513.
PETERS, G. and J.H. WILKINSON (1970), The least squares problem and pseudo-inverses, Comput. J. 13,
309-316.
PLEMMONS, R.J. (1972), Monotonicity and iterative approximations involving rectangular matrices,
Math. Comp. 26, 853-858.
PLEMMONS, R.J. (1974), Linear least squares by elimination and MGS, J. ACM 21, 581-585.
POWELL, M.J.D. and J.K. REID (1969), On applying Householder's method to linear least squares
problems, in: A.J.M. MORELL, ed., Proceedings IFIP Congress 1968 (North-Holland, Amsterdam)
122-126.
RAMSIN, H. and P.-A. WEDIN (1977), A comparison of some algorithms for the nonlinear least squares
problem, BIT 17, 72-90.
REID, J.K. (1967), A note on the least squares solution of a band system of linear equations by
Householder reductions, Comput. J. 10, 188-189.
REID, J.K. (1972), On the use of conjugate gradients for systems of linear equations possessing "Property
A", SIAM J. Numer. Anal. 9, 325-332.
REINSCH, C.H. (1971), Smoothing by spline functions, Numer. Math. 16, 451-454.
RICE, J.R. (1966), Experiments on Gram-Schmidt orthogonalization, Math. Comp. 20, 325-328.
RICE, J.R. (1983), PARVEC workshop on very large least squares problems and supercomputers, Rept.
CSD-TR 464, Purdue University, West Lafayette, IN.
RUHE, A. (1979), Accelerated Gauss-Newton algorithms for nonlinear least squares problems, BIT 19,
356-367.
RUHE, A. (1983), Numerical aspects of Gram-Schmidt orthogonalization of vectors, Linear Algebra Appl.
52/53, 591-601.
RUHE, A. and P.-A. WEDIN (1980), Algorithms for separable nonlinear least squares problems, SIAM Rev.
22, 318-337.
RUTISHAUSER, H. (1967), Description of Algol 60, in: Handbook of Automatic Computation la (Springer,
Berlin).
RUTISHAUSER, H. (1976), Vorlesungen ber numerische Mathematik I (Birkhauser, Basel).
SAUNDERS, M.A. (1979), Sparse least squares by conjugate gradients: A comparison of preconditioning
methods, in: Proceedings Computer Science and Statistics: Twelfth Annual Symposium on the Interface,
Waterloo, Ont., 15-20.
SCHITTKOWSKI, K. (1983), The numerical solution of constrained linear least-squares problems, IMA J.
Numer. Anal. 3, 11-36.
SCHITTKOWSKI, K. (1985), Solving constrained nonlinear least squares problems by a general purpose
SOP-method, Rept. Institut ffir Informatik, Universitat Stuttgart, F.R.G.
SCHITTKOWSKI, K. and J. STOER (1979), A factorization method for the solution of constrained linear least
squares problems allowing subsequent data changes, Numer. Math. 31, 431-463.
SCHITTKOWSKI, K. and P. ZIMMERMAN (1977), A factorization method for constrained least squares
problems with data changes, Part 2: Numerical tests, comparisons, and ALGOL codes, Preprint No.
30, Institut fiir Angewandte Mathematik und Statistik, Universitt Wiirzburg, F.R.G.
SCHWETLICK, H. and V. TILLER (1985), Numerical methods for estimating parameters in nonlinear
models with errors in the variables, Technometrics 27, 17-24.
SCOLNIK, H.D. (1972), On the solution of non-linear least squares problems, in: H. FREEMAN, ed.,
Proceedings IFIP 71 (North-Holland, Amsterdam) 1258-1265.
SEBER, G.A.F. (1977), Linear Regression Analysis (Wiley, New York).
STEWART, G.W. (1973), Introduction to Matrix Computations (Academic Press, New York).
646 A. Bjirck
STEWART, G.W. (1976), The economical storage of plane rotations, Numer. Math. 25, 137-138.
STEWART, G.W. (1977a), On the perturbation of pseudo-inverses, projections, and linear least squares
problems, SIAM Rev. 19, 634-662.
STEWART, G.W. (1977b), Perturbation bounds for the QR factorization of a matrix, SIAM J. Numer.
Anal. 14, 509-518.
STEWART, G.W. (1977c), Research, development and LINPACK, in: J.R. RICE, ed., MathematicalSoftware
II (Academic Press, New York) 1-14.
STEWART, G.W. (1979a), A note on the perturbation of singular values, Linear Algebra Appl. 28, 213-216.
STEWART, G.W. (1979b), The effects of rounding error on an algorithm for downdating a Cholesky-
factorization, J. Inst. Math. Appl. 23, 203-213.
STEWART, G.W. (1980), The efficient generation of random orthogonal matrices with an application to
condition estimators, SIAM J. Numer. Anal. 17, 403-409.
STEWART, G.W. (1982), An algorithm for computing the CS decomposition of a partitioned orthonormal
matrix, Numer. Math. 40, 297-306.
STEWART, G.W. (1983), A method for computing the generalized singular value decomposition, in: B.
KAGSTROM and A. RUHE, eds. Matrix Pencils, Proceedings, Pite Havsbad, 1982, Lecture Notes in
Mathematics 973 (Springer, Berlin) 207-220.
STEWART, G.W. (1984a), Rank degeneracy, SIAM J. Sci. Statist. Comput. 5, 403-413.
STEWART, G.W. ( 984b), On the invariance of perturbed null vectors under column scaling, Numer. Math.
44, 61-65.
STIEFEL, E. (1952/53), Ausgleichung ohne Aufstellung der Gausschen Normalgleichungen, Wiss. Z. Tech.
Hochsch. Dresden 2, 441-442.
STOER, J. (1971), On the numerical solution of constrained least squares problems, SIAM J. Namer. Anal.
8, 382-411.
TIHONOV, A.N. (1963), Regularization of incorrectly posed problems, Dokl. Akad. Nauk SSSR 153,
1035-1038.
TOINT, P.L. (1987), On large scale nonlinear least squares calculations, SIAM J. Sci. Statist. Comput. 8,
416-435.
VAN DER SLUIS, A. (1969), Condition numbers and equilibration of matrices, Numer. Math. 14, 14-23.
VAN DER SLUIS, A. (1975), Stability of the solutions of linear least squares problems, Numer. Math. 23,
241-254.
VAN DER SLUIS, A. and G.W. VELTKAMP(1979), Restoring rank and consistency by orthogonal projection,
Linear Algebra Appl. 28, 257-278.
VAN LOAN, C. (1976), Generalizing the singular value decomposition, SIAM J. Numer. Anal. 13, 76-83.
VAN LOAN, C. (1984a), Analysis of some matrix problems using the CS decomposition, Tech. Rept.
TR84-603, Cornell University, Ithaca, NY.
VAN LOAN, C. (1984b), Computing the CS and the generalized singular value decomposition, Rept. TR
84-614, Cornell University, Ithaca, NY.
VAN LOAN, C. (1985), On the method of weighting for equality-constrained least squares problems,
SIAM J. Numer. Anal. 22, 851-864.
VARA.H, J.M. (1973), On the numerical solution of ill-conditioned linear systems with applications to
ill-posed problems, SIAM J. Numer. Anal. 10, 257-267.
VARAH, J.M. (1975), A lower bound for the smallest singular value of a matrix, Linear Algebra Appl. 11,
1-2.
VARAH, J.M. (1979), A practical examination of some numerical methods for linear discrete ill-posed
problems, SIAM Rev. 21, 100-111.
VARGA, R.S. (1962), Matrix Iterative Analysis (Prentice-Hall, Englewood Cliffs, NJ).
WAMPLER, R.H. (1970), A report on the accuracy ofsome widely used least squares computer programs, J.
Amer. Statist. Assoc. 65, 549-565.
WAMPLER, R.H. (1972), Some recent developments in linear least squares computations, in: M.E.
TARTER, ed., Proceedings Computer Science and Statistics:Sixth Annual Symposium on the Interface,
Berkeley, CA.
WAMPLER, R.H. (1979), Solutions to weighted least squares problems by modified Gram-Schmidt with
iterative refinement, ACM Trans. Math. Software 5, 457-465.
References 647

WATSON, G.A. (1984), The numerical solution of total 1papproximation problems, in: D.F. GRIFFITHS, ed.,
Proceedings 1983 Dundee Conference on Numerical Analysis, Lecture Notes in Mathematics 1066
(Springer, Berlin).
WEDIN, P.-A. (1972), Perturbation bounds in connection with the singular value decomposition, BIT 12,
99-111.
WEDIN, P.-A. (1973a), On the almost rank-deficient case of the least squares problem, BIT 13, 344-354.
WEDIN, P.-A. (1973b), Perturbation theory for pseudo-inverses, BIT 13, 217-232.
WEDIN, P.-A. (1974), On the Gauss-Newton method for the nonlinear least squares problems, ITM
Working Paper 24, Institutet fr Tilliimpad Matematik, Stockholm.
WEDIN, P.-A., (1985), Perturbation theory and condition numbers for generalized and constrained linear
least squares problems, Rept. UMINF 125.85, Umed University, Sweden.
WEIL, R.L. and P.C. KETTLER (1971), Rearranging matrices to block-angular form for decomposition
(and other) algorithms, Management Sci. 18, 98-108.
WILKINSON, J.H. (1963), Rounding Errorsin Algebraic Processes (Prentice-Hall, Englewood Cliffs, NJ).
WILKINSON, J.H. (1965), The Algebraic Eigenvalue Problem (Clarendon Press, Oxford).
WILKINSON, J.H. (1968), A priori error analysis of algebraic processes, in: Proceedings International
Congress on Mathematics (Izdat. Mir, Moscow) 629-639.
WILKINSON, J.H. (1971), Modern error analysis, SIAM Rev. 13, 548-568.
WILKINSON, J.H. (1977), Some recent advances in numerical linear algebra, in: D.A.H. JACOBS, ed., The
State of the Art in Numerical Analysis (Academic Press, New York) 1-53.
WILKINSON, J.H. and C. REINSCH, eds. (1971), Handbookfor Automatic Computation 2, Linear Algebra
(Springer, New York).
WRIGHT, S.J. and J.N. HOLT (1985), Algorithms for nonlinear least squares with linear inequality
constraints, SIAM J. Sci. Statist. Comput. 6, 1033-1048.
YOUNG, D.M. (1971), Iterative Solution of Large Systems (Academic Press, New York).
ZIMMERMANN, P. (1977), Ein Algorithmus zur L6sung linearer Least Squares Probleme mit unteren und
oberen Schranken als Nebenbedingungen, Diplomarbeit, Institut fiur Angewandte Mathematik und
Statistik, Universitat Wirzburg, F.R.G.
ZLATEV, Z. (1982), Comparison of two pivotal strategies in sparse plane rotations, Comput. Math. Appl.
8, 119-135.
ZLATEV, Z. and H.B. NIELSEN (1979), LLSS01: A Fortran subroutine for solving least squares problems
(User's guide), Rept. No. 79-07, Institute of Numerical Analysis, Technical University Denmark,
Lyngby, Denmark.
Subject Index

Active set algorithms, 609 Constrained problems, 631


Algorithm, 487, 503, 505, Constraint, 598
Appending, 571, 575 Constraint equations, 589
Approximate Newton direction, 627 Convergence acceleration, 560
Approximation, 630 Covariance matrix, 528, 530, 581
Armijo-Goldstein step length principle, 622 Critical point, 618
Artificial objective function, 609 CS decomposition, 579
Augmented matrix, 489, 569 Curve fitting, 633
Augmented system, 472 Curve fitting problems, 620
Automatic partitioning, 546 Cyclic Jacobi preconditioner, 563

B-splines, 551 Data fitting, 619


Band matrices, 532 Deleting, 572, 574
Band structure, 602, 605 Deregularization, 587
Banded problems, 549 Descent direction, 621, 626, 631
Basic solution, 495 Determination rank, 622
Best linear unbiased estimate, 583 Diagonal scaling, 561, 585
Bidiagonal decomposition, 497 Direct elimination, 592, 593
Bidiagonal form, 511, 566, 601 Dissection, 556
Bidiagonalization, 605 Downdating, 575
Binding, 598 Dual block angular form, 553
Block diagonal preconditioner, 562
Block SOR method, 563
Bound, 517 Edges, 536
Envelope reduction, 537
Envelope storage scheme, 533
CGLS, 560 Equality constraints, 589, 609
Chan SVD algorithm, 513 Error analysis, 490, 508
Chebyshev semi-iterative method, 559 Error covariance, 584
Cholesky decomposition, 487 Estimable function, 584
Cholesky factor, 581 Estimable part, 583
Cholesky factorization, 490, 537
Circular shifts, 613
Classical Gram-Schmidt, 505 Feasible point, 610
Column, 573 Fill-in, 534
Column permutations, 594 First kind, 596
Column pivoting, 590 Fitting, 589, 633
Column pivoting strategy, 515 Flop, 486
Complete orthogonal decomposition, 496 Fundamental subspaces, 476, 478
Condition estimator, 518
Condition number, 517
Conditioning, 587 Gauss-Newton correction, 635
Conjugate gradient method, 559 Gauss-Newton direction, 621

649
650 i. Bjork

Gauss-Newton method, 620, 622, 634 Large residual problems, 626


Gauss-Seidel method, 558 Latent root regression, 585
Gaussian elimination, 591 Least distance problem, 609, 612
General Gauss-Markoff, 581 Least squares problem, 469, 526
Generalized cross validation, 604 Left circular shift, 572
Generalized linear least squares problem, 526 Levenberg-Marquardt algorithm, 625
Generalized normal equations, 597 Levenberg-Marquardt method, 619
Generalized singular value decomposition, 580, Line search, 631
581, 584, 597, 600 "Exact", 623
Geodetic survey problems, 554, 556 Linear constraints, 631
Geometrical interpretation, 619 Linear model, 619
George-Heath algorithm, 543 Linear statistical model, 469
Givens, 504 Line search algorithm, 626
Givens rotation, 500, 543, 550 LINPACK, 518, 576
Global convergence, 628 Local convergence, 623
Golub's method, 503 Local minimum, 619
Golub-Reinsch SVD algorithm, 513 Lower, 631
Gram-Schmidt, 577 LSQI, 601
Gram-Schmidt decomposition, 614 LSQR, 560, 567, 604
Gram-Schmidt orthogonalization, 506

Markowitz ordering, 538


Half circle, 633 Matrix approximation, 474
Hall property, 542 Matrix decompositions, 611
Helmert, 556 Maximal curvature, 624
Hessenberg matrix, 572 Merit function, 631
Hessian, 629 Method of weighting, 595
Hessian matrices, 618 MGS, 509
Hestenes, 559 Minimal norm solution, 496
Householder transformation, 498 Minimum degree algorithm, 538
Hybrid method, 628 Minimum norm problem, 540
Minimum norm TLS solution, 586
Minmax characterization, 475
Ill-posed, 596 Model, 583
Implicit curve fitting, 632 Model prediction, 617
IMSL, 628 Modified Gram-Schmidt algorithm, 505
Inadequate problems, 624 Multiple right-hand sides, 549, 587, 588
Incomplete Cholesky factorization, 561
Inconsistency, 589
Inequality constraints, 609 n-dimensional surface, 618
Influence matrix, 604 Negative gradient, 622
Instability, 577 Nested dissection, 537
Integral equation, 596 Nested dissection ordering, 557
Inverse iteration, 519 Newton's method, 620
Iterative methods, 557, 604 NL2SOL, 628
Iterative refinement, 522, 549 No-cancellation assumption, 534
Nodes, 536
Nonestimable part, 583
Jacobi's method, 557, 558 Nonlinear conjugate gradient
Jacobian, 618, 634 acceleration, 628
Nonlinear equality constraints, 631
Nonlinear problems, 617
Lagrange multipliers, 597, 613, Nonnegativity constrained problem,
Lanczos bidiagonalization, 565 608
Subject Index 651

Nonnegativity constraints, 606 Quadratic programming problem,


Normal curvature matrix, 618, 623 609
Normal equations, 471, 485, 536 Quasi-Newton approximation, 628
NPL Algorithm Library 628 Quasi-Newton relation, 628
Nullspace method, 591
Numerical cancellation, 534
Numerical 5-rank, 511 R factor, 492
Rank, 548
Rank deficiency, 519
One-sided bounds, 606 Rank-deficient, 548
Orthogonal bases problem, 509 Rank-deficient case, 612
Orthogonal distance, 586, 632 Rank-deficient problems, 510, 514, 547
Orthogonal distance problem, 634 Rank revealing QR, 519
Orthogonal projection, 477, 493 Rank-one change, 570, 576
Orthogonal projector, 472 Rate of convergence, 559, 624
Orthogonal regression, 587, 632 Reduced system preconditioner,
Orthogonalization, 554 564
Orthogonalization methods, 541 Reflector, 448
Regularization, 596
Regularization methods, 597
Paige, 529 Reorthogonalization, 510
Parallel, 544, 555 Richardson method, 559
Partially separable, 628 Right circular shift, 574
Periodic spline, 552 Row, 575
Perturbation, 493 Row bandwidth, 532, 535
Perturbation analysis, 491 Row block structure, 552
Perturbation bounds, 479 Row interchanges, 528
Photogrammetry, 554 Row merge tree, 546
Plane rotation, 500 Row ordering, 544
Preconditioned iterative method, Row orthogonalization, 543, 550
555 Rutishauser, 508
Preconditioner, 555
Preconditioning, 560
Principal radius of curvature, 618 Saddle point, 618, 623
Problem BLS, 614 Scaling, 482, 490, 511
Problem LSB, 615 Search direction, 610
Problem LSD, 607 Second derivatives, 626
Problem LSE, 591, Secular equation, 599, 600, 601, 604
Problem LSEI, 614 Seminormal equations, 549
Problem LSI, 609, 615 Separable nonlinear constraints,
Problem LSQI, 598, 625 631
Problem NNLS, 607, 615 Separable problems, 630
Problem NNLSE, 614 Sequential orthogonalization
Problems, 630 method, 602
Problems BLS, 606 Sequential row orthogonalization, 543
Property, 563 Simple bounds, 606, 612
Pseudoinverse, 477, 603 Singular value decomposition, 577, 585
Pseudoinverse solution, 477 Singular value decomposition
(SVD), 472
Singular values, 473, 511
QR decomposition, 491, 494, 502, 503, Singular vectors, 473, 511
555, 570 SOR method, 558
Quadratic constraint, 596, 625 Sparse last squares problem, 531
Quadratic model, 619, 626 Sparse QR decomposition, 615
652 . Bjirck

Sparse storage scheme, 532 Total least squares problem, 587,


Sparsity structure, 553 634
SPARSPAK, 615 Transformation, 602
Splitting, 559 Trust region, 631
Standard form, 600, 602, 605 Trust region methods, 625
Standard task, 498 Trust region strategy, 628
Starting values, 630 Twice is enough, 509
Stationary iterative method, 559
Statistical model, 604
Step length, 610 Undirected graph, 536
Step length principles, 622 Unit roundoff, 490
Stiff, 527 Unlabeled graphs, 536
Strongly nonlinear problem, 626 Update, 611
Structural cancellation, 541 Updating, 527, 539, 569, 577, 613
Subset selection, 519 Upper bounds, 631
Substructuring, 556
Superlinear convergence, 623, 627 Variable projection algorithm, 630
Symmetric reordering, 536 Variance-covariance, 524
Variance-covariance matrix, 538
Volterra integral equations, 603
Toeplitz matrices, 603
Toeplitz matrix, 604
Total l, approximation problem, Weighted linear least squares
588 problem, 526
Total least squares, 585 Working set, 611

Вам также может понравиться