Академический Документы
Профессиональный Документы
Культура Документы
During the past decades, giant needs for ever more sophisticated mathematical
models and increasingly complex and extensive computer simulations have arisen.
In this fashion, two indissociable activities, mathematical modeling and computer
simulation, have gained a major status in all aspects of science, technology, and
industry.
In order that these two sciences be established on the safest possible grounds,
mathematical rigor is indispensable. For this reason, two companion sciences,
Numerical Analysis and Scientific Software, have emerged as essential steps for
validating the mathematical models and the computer simulations that are based on
them.
Numerical Analysis is here understood as the part of Mathematics that describes
and analyzes all the numerical schemes that are used on computers; its objective
consists in obtaining a clear, precise, and faithful, representation of all the
"information" contained in a mathematical model; as such, it is the natural
extension of more classical tools, such as analytic solutions, special transforms,
functional analysis, as well as stability and asymptotic analysis.
v
vi General Preface
Volumes are numbered by capital Roman numerals (as Vol. I, Vol. II, etc.),
according to their chronologicalappearance.
Since all the articles pertaining to a given heading may not be simultaneously
available at a given time, a given heading usually appears in more than one volume;
for instance, if articles devoted to the heading "Solution of Equations in "" appear
in Volumes I and III, these volumes will include "Solution of Equations in R"(Part
1)" and "Solution of Equations in R' (Part 2)" in their respective titles. Naturally, all
the headings dealt with within a given volume appear in its title; for instance, the
complete title of Volume I is "Finite Difference Methods (Part tl-Solution of
Equations in it" (Part 1)".
Each article is subdivided into sections, which are numbered consecutively
throughout the article by Arabic numerals, as Section 1, Section 2,..., Section 14,
etc. Within a given section,formulas,theorems, remarks, andfigures, have their own
independent numberings; for instance, within Section 14, formulas are numbered
consecutively as (14.1), (14.2), etc., theorems are numbered consecutively as Theorem
14.1, Theorem 14.2, etc. For the sake of clarity, the article is also subdivided into
chapters, numbered consecutively throughout the article by capital Roman numerals;
for instance, Chapter I comprises Sections 1 to 9, Chapter II comprises Sections 10
to 16, etc.
P.G. CIARLET
J.L. LIONS
May 1989
Introduction
G.I. Marchuk
The finite difference method is a universal and efficient numerical method for solving
differential equations. Its intensive development, which began at the end of 1940s
and the beginning of 1950s, was stimulated by the need to cope with a number of
complex problems of science and technology. Powerful computers provided an
impetus of paramount importance for the development and application of the finite
difference method which in itself is sufficiently simple in utilization and can be
conveniently realized using computers of different architecture. A large number of
complicated multidimensional problems in electrodynamics, elasticity theory, fluid
mechanics, gas dynamics, theory of particle and radiation transfer, atmosphere and
ocean dynamics, and plasma physics were solved employing the finite difference
techniques.
Numerous spectacular results have been obtained in the theory of finite difference
methods during the last four decades.
In ordinary differential equations, the stability of the main classical finite
difference methods was investigated and the relevant accuracy estimates were
constructed, a large number of new versions of these methods were constructed, and
efficient algorithms were suggested for their realization in a wide field of
applications-oriented problems. The needs of electronics, kinetics, and catalysis
stimulated the development of a broad class of methods for solving stiff systems of
equations. Problems in control theory, biology, and medicine were important for the
progress in finite difference methods of solving delay ordinary differential equations.
In partial differential equations, the achievements of the finite difference method
are even more impressive. Finite difference counterparts of the main differential
operators of mathematical physics were constructed, including those with conser-
vation properties, that is, those obeying the discrete counterparts of the laws of
conservation. An elegant theory of approximation, stability, and convergence of the
finite difference method was constructed.
PREFACE 9
CHAPTER I. Introduction 11
REFERENCES 183
9
10 V. Thomee
mesh, which does not necessarily fit the domain. This led the development into
a different direction based on variational formulations of the boundary value
problems, and using piecewise polynomial approximating functions on more
flexible partitions of the domain. This approach, the finite element method, was
better suited for complex geometries and many numerical analysts (including the
author of this article) abandoned the classical finite difference method to work with
finite elements. The papers on parabolic equations using finite differences after 1970
are few, particularly in the West.
It should be said, however, that finite elements and finite differences have many
points in common, and that it may be more appropriate to think of the new
development as a continuation of the established theory rather than a break away
from it. In the Russian literature, for instance, finite element methods are often
referred to as variational difference schemes, and variational thinking was, in fact,
used already in the paper by Courant, Friedrichs and Lewy quoted above. The finite
element theory owes much of its present level of development and sophistication to
the foundation provided by the finite difference theory. However, in the present
Handbook, the two subjects are separated, and we shall only very briefly touch upon
their interrelation below.
In our presentation here we shall not discuss techniques for solving the algebraic
linear systems of equations that result from the discretization of the initial boundary
value problems, but refer to other articles of this Handbook concerning such
matters. Neither shall we treat the related area of alternating direction implicit
methods, or fractional step methods, which are designed to reduce the amount of
computation needed in multidimensional, particularly rectangular, domains, and to
which a special article of this volume is devoted.
Several textbooks exist which treat finite difference methods for parabolic
problems, and we refer, in particular, to RICHTMYER and MORTON [1967] and
SAMARSKn and GULIN [1973] for thorough accounts of the field, but also (in
chronological order of publication) to COLLATZ [1955], FORSYTHE and WAsow
[1960], RJABENKI and FILIPPOW [1960], Fox [1962], GODUNOV and RYABENKII
[1964], SAUL'EV [1964], SMITH [1965], BABUSKA, PRAGER and VITASEK [1966],
MITCHELL [1969], and SAMARSKII [1971]. In addition we would like to mention the
survey papers by DOUGLAS [1961a] and THOMLE [1969]. We have included in our
list of references a large number of original papers, not all of which are quoted in our
text. For treatises on the theory of parabolic differential equations, covering
existence, uniqueness and regularity results such as needed here, we refer to
FRIEDMAN [1964] and LADYZENSKAJA, SOLONNIKOV and URAL'CEVA [1968].
I would like to take this opportunity to thank Chalmers University of Technology
for granting me a reduction of my teaching load while writing this article, to
Ann-Britt Karlsson and Yumi Karlsson for typing the manuscript, and to
Mohammad Asadzadeh for proofreading the entire work.
CHAPTER I
Introduction
In this first introductory chapter our purpose is to use the simplest possible model
problems to present some basic concepts which are important for the understanding
of the formulation and analysis of finite difference methods for parabolic partial
differential equations. The chapter is subdivided into two sections corresponding to
the two basic problems discussed in the rest of this article, namely the pure initial
value problem and the mixed initial boundary value problem.
In the first section we thus consider the pure initial value problem for the heat
equation in one space dimension. We begin with the simplest example of an explicit
one-step, or two-level, finite difference approximation, discuss its stability with
respect to the maximum norm and relate its formal accuracy to its rate of
convergence to the exact solution. We also present an example of the construction of
a more accurate explicit scheme. We then introduce the application of Fourier
techniques in the analysis of both stability, now with respect to the L2 -norm,
accuracy, and convergence. We finally touch upon the possibility of using more than
two time levels in our approximations.
Section 2 is devoted to the mixed initial boundary value problem for the same
basic parabolic equation, with Dirichlet type boundary conditions at the endpoints
of a finite interval in the space variable. Here we discuss the possibility and
advantage of using implicit methods, requiring the solution of a linear system of
equations at each time level. Stability and error analysis is carried out for the
simplest such methods, the backward Euler method and the more accurate
Crank-Nicolson method. Again both maximum-norm estimates based on positivity
properties of the difference scheme and 12 -norm estimates derived by Fourier
analysis are treated. A brief mention is made of the possibility of extending some
initial boundary value problems to periodic pure initial value problems.
The material in this chapter is standard and we refer to the basic textbooks quoted
in our preface for further details and references.
We thus wish to find the solution of the pure initial value problem
au 02u
/2 forx eR, t ,
a=-- (1.1)
u(x,O)=v(x) for xeR,
where lR denotes the real axis and v is a given smooth bounded function. It is well
known that this problem admits a unique solution, many properties of which may be
deduced, for instance, from the representation
Here we think of the right-hand side as defining the solution operator E(t) of (1.1). In
particular, we may note that this solution operator is bounded, or, more precisely,
supl E(t)v(x)l =supl u(x, t)l < supl v(x)l for t >0. (1.2)
xeR xeR xeR
For the numerical solution of the problem (1.1) by finite differences we introduce
a grid of mesh points (x, t) = (jh, nk) where h and k are mesh parameters which are
small and thought of as tending to zero, and wherej and n are integers, n >0. We then
look for an approximate solution of (1.1) at these mesh points, which will be denoted
by U., by solving a problem in which the derivatives in (1.1) have been replaced by
finite difference quotients. Define thus for functions defined on the grid the forward
and backward difference quotients
axU= h (Uj+- U1),
xU= h - (U -Ui- ),
and similarly, for instance,
+ - U").
8,OU = k- (Uj"
The simplest finite difference equation corresponding to (1.1) is then
a,U=axAX U. for -oo<j<oc, n>O0,
°
U =vj _ v(jh) for - oo <j < oo.
This difference equation may also be written as
Uv+ '-Uj U+ 1 -2 U+ U_ 1
k h2
or, if we set A= k/h 2 ,
U" +1 = U_ + (1 - 2) U + U!+ =(EkhUn)j, (1.3)
where the identity defines a linear operator Ekh, the local discrete solution operator.
This scheme is called explicit since it expresses the solution at t = (n + 1)k explicitly in
terms of the values at t = nk. Iterating the operator we find that the solution of the
SECTION 1 Introduction 13
discrete problem is
U7 = (Ekh U)j = (Ekhv)j.
Assume now that 14A<. The coefficients of the operator Ekh in (1.3) are then all
nonnegative, and since their sum is 1 we find
suplUj"+l suplUJI.
discretization error. We have now the following result. Here and below we denote by
Cmn the set of functions of (x, t) with bounded derivatives of orders at most m and
n with respect to x and t, respectively.
THEOREM 1.1. Assume that u e C 4 ' 2 and k/h 2 = < . Then there is a C = C(u, T) such
that
11
Un- u"ll Ch 2 for nk T.
and hence
+
z = Ekhzj- k.
The method described has first-order accuracy in time and second-order accuracy
in space; since k and h are tied by k/h 2 =A2, the total effect is second-order
accuracy in space. We shall discuss how this accuracy may be increased, so as to
obtain a higher-order rate of convergence as h -- O.
Our abovementioned method may be written in the form
U,+'= U+k x Uv,
which may be thought of as resulting from the expansion
u +1=u(jh, (n + 1)k)= uj + k(au/0t) + O(k2 )
= uj + k(a2 u/aX2 )
+ O(k 2 )
uj+ ka, a uj' + kO(k + h2 ).
Using one more term in the Taylor series, we find (omitting the subscript j)
un+ Iun+ k )u)+ k aU) + O(k3)
=unk(a2)
u k2 (a4)nu (k 3 ).
SECTION 1 Introduction 15
We shall now approximate a2 u/Ox2 to fourth order and 4 u/lax4 to second order. We
have
a 4=(a) u + (h 2 ),
+ 1
~xaxUi"jUj--2Ujq-u [u2 ux h2 4a"U 4)
also
a2 u h2
=a
a xu-(a ax) 2u + (h 4 ).
Us =1( - -)(UJ-2+Uj+2)
21)(Uj- + U,)+(l 2
+1(- -15-+31 )U
-(Ek U)
i . (1.5)
We see that the present operator Ekh has nonnegative coefficients which add up to
1 if
with C(u) depending on bounds for a3 u/at 3 and a6 u/ax6. Using the method of proof
of Theorem 1.1 we may conclude
IIU-u"ll <C(u, T)h4 for nk <T.
More generally we may consider finite difference operators of the form
U + =EkU apUP, (1.6)
where ap = ap,(), 1 = k/h 2 , and where the sum is finite. Note that since we considered
16 V Thome CHAPTER
h and k tied together by the relation k/h 2 =A= constant, we have now omitted the
dependence on h in the notation. One may associate with this operator the
trigonometric polynomial
E(4)= ape-ip . (1.7)
p
This polynomial is relevant to the stability analysis and is called the symbol or
characteristic polynomial of E,. We find at once the following result.
IIU" I=E(o)"'-+oo
0 as n-oo,
which proves the theorem. [1
I1v2,h= (h IVj2).
The set of mesh functions thus normed will be denoted 12, h below. Let us also define,
for such a mesh function, its discrete Fourier transform
() = h E Vje- ij4,
j
where we asume that the sum is absolutely convergent. Recall the Parseval relation
2lhVl12=-=nh I'(12 do
n/h
We may now define stability with respect to the norm 11I12,h, or in 12h,, to mean,
analogously to above,
THEOREM 1.3. The von Neumann condition IE()I <1 is a necessary and sufficient
condition for stability of the operator Ek in 12,h.
(EkV)()= h a, Vj_pe- JA
j,p
= E(4) ().
Hence
(Ek V)A () = E(0)" V(),
and using the Parseval relation (1.9), the stability of Ek in 12,h is equivalent to
and say that the difference scheme is accurate of order p (assuming = k/h 2 =
constant) if, formally, =O(h') as h -0. Under the appropriate assumptions
concerning to the smoothness and asymptotic behavior for large xl of the exact
solution this assumption may be translated into an estimate of the form
II1T"2,h < C(u)h1, (1.11)
and we may show the following error estimate.
THEOREM 1.4. Assume that Ek is accurate of order p and stable in 12,h. Then under the
appropriate regularity assumptions on the exact solution of(1.1) we have
IlUn-unl2L.,<C(u,T)h' for nk<T.
18 Y. Thomde CHAPTER
zJ=-k E -T
1=0
Consider again the fourth-order scheme (1.5). Recall that the stability with respect
to the maximum norm was deduced above for 1 .23 Here
function U°(x) = v(x) and seek an approximate solution U"(x) at t = nk, n = 1, 2,...
from
U" + 1(X) =(Ek U (x)) = a U'(X- ph),
(1.12)
ap=a,(l), = k/h 2.
One advantage of this point of view, which is taken in a large part of the literature, is
that all U" then lie in the same function space, independently of h, for instance L2 (R),
L®(R) or the space C(R) of bounded continuous functions on 1R.
Let us briefly consider the situation in which the analysis takes place in L 2 = L 2(R)
and set
ul2= f 2 dx.
iul
=/2-n JlE(h5)"112
l U"112 'sup [E(h5)I" Ilvll2,
PROOF. The fact that Ek is accurate of order p may be expressed by saying that for
20 V. Thome CHAPTER I
and
On the other hand, these relations are, again by Taylor series expansions, easily seen
to be equivalent to
E() = e - "' 2 + 0(1 1+2)
"
as -0,
or, since E(~) is bounded on TR,
IE( )-e-z421 <Cll"+2 for eR.
By stability it follows that
n-I
2 .
,<Cntl1 (1.13)
Now, by Fourier transformation of (1.1), we have for
that
dt
-~(5,t)=-52i(5,t) for tk0,
(, o)= (),
and hence
a(r, t) =e-2'().
We conclude
(Un - u')^(() =(E(kr)"-e k )(),
and hence
IIU"_ u II22= 2 1(E(k)n _ e- nk,2)6 112
SECTION 2 Introduction 21
Now by (1.13)
IE(hr -e-k21 < Cnh+ 2g +
2
so that, using the fact that (dv/dx)?= -i~,
IIU-u ull 2 2 :rCnh"
+ 2
1 + 2vl12
< CnkhPI (d/dx)" +2V 11
.2
This shows the conclusion of the theorem under the assumption that the initial data
are such that (d/dx)"+ 2v belong to L 2. In fact, by a more precise computation one
may reduce this regularity requirement by almost two derivatives, as we shall
describe in Section 4 below. 1
Unx)-
+ U
I n(X) =axUnwx. (1.14)
In physical situations our above pure initial value model problem (1.1) is generally
not adequate, and instead it is required to solve the parabolic equation on a finite
interval with boundary values given at the endpoints of the interval for positive time.
We shall therefore have reason to consider the following model problem:
or, defining the local solution operator Ekh in the obvious way,
<
!lEkhvl.,h~ lvl11 ,h, nO,
so that the scheme is maximum-norm stable for 1< 2.
Here, in order to see that this condition is also necessary for stability, we modify
our above example so as to incorporate the boundary conditions and set
Uj° = Vj=(-l)i sin 7jh, j=O..., M.
Then, by a simple calculation,
UJ =(1-2A-22 cos lh)"Vj, j = 0 ... , M.
If > I we have for h sufficiently small
l1-21-24 cos hl y> 1
and hence, if nk= 1, say,
lUnl10,h>7"llvll,h--o as h,k-*O.
In the presence of stability we may define the local discretization or truncation
error -r and show convergence in exactly the same way as before, and obtain the
following theorem:
THEOREM 2.1. Assume that u E C 4' 2 and that U" is the solution of theforward Euler
scheme (2.2) with A <2. Then there is a constant C= C(u, T) such that
aU_P,
UJ+I= j=l,...,M-,
is not suitable for the present problem if ap 0Ofor some IPI> 1, since then for some
interior mesh point of [0, 1] the equation uses mesh points outside this interval. In
such a case the equation has to be modified near the endpoints, which significantly
complicates the analysis.
The stability requirement k •½h 2 used in the forward Euler method above is quite
restrictive in practice and it would be desirable to be able to use h and k, for instance,
of the same order of magnitude. For this purpose one may define an approximate
solution, instead as above, implicitly by the following equation, which is referred to
as the backward Euler scheme and which was proposed first in LAASO1TEN [1949]:
where U" + 1 and U" are thought of as vectors with (M- 1) components and B is the
diagonally dominant, symmetric, tridiagonal matrix
- 1+21 - 0 01
B= 0 -iZ 1+22 -2 ... .
l O - 1 +22
" +
Clearly this system may easily be solved for U '. Introducing the finite-
dimensional space I° of(M + 1)-vectors {Vj} with Vo = Vm = 0 and the operator Bkh
on 1h° defined by
or
Un +1=Bk U = Ekh U".
We shall show now that this method is stable in maximum-norm without any
restrictions on k and h. In fact, with arbitrary, we have
I|U +1 GOhI
= IU l
THEOREM 2.2. Let u E C4 '2 and let U" be the solution of the backward Euler scheme
(2.3). Then, with C = C(u, T), we have
IU"-u"llh,, <C(k+h2 ) for nk< T.
n-1
The above convergence result for the backward Euler method is satisfactory in
that it requires no restriction on the mesh ratio i = k/h 2. On the other hand, since it is
only first order accurate in time, the error in the time discretization will dominate
unless k is chosen much smaller than h. It would thus be desirable to find a method
which is second-order accurate also in time. Such a method is the following, which is
credited to CRANK and NICOLSON [1947].
In order to have second-order accuracy in both space and time we will use
symmetry around the point (jh, (n + )k) and pose thus, for U" given in I °, the
problem
where now both A and B are symmetric tridiagonal matrices, with B diagonally
dominant,
-1+2 -½2 0 0 o -
-2 1+2 -½2 0 01
to
B= 0 -2 1+2 -21 ... 0
- o -- 2 1+Ai
-1-1 ½2 0 0
1-2 42 o
A= 0 42 1-2½ 2 ... 0
0 0 ½2 1- :]I-
With the obvious notation we also have
BkhUn+ 1 =AkhUn,
26 V. Thomge CHAPTER I
or
Un + = Bkh' Akh U' = Ekh U,
We denote by °2,h the space ° equipped with this inner product and norm and note
that this space is spanned by the (M-1) vectors p, p=1,...,M-l, with
components
1,f if p=q,
6
((pp'(pq)h pq O, if p:q.
We also observe that the qp are eigenfunctions of the finite difference operators
-Ox, in the sense that
-Ox xppj = (2/h 2 )(1 -cos nlph)pj, j= ...., M- 1.
We shall now use these notions to discuss the stability of the three difference
SECTION 2 Introduction 27
schemes considered above. Let V be given initial data in 12,h. Then
M-1
V= , cpp,, where cp=(V, p)h.
p=i
E() =
1 + 22(1 -cos )'
In this case 0 < E(tph) < 1 for all p and and hence (2.7) is valid for any value of A.
Similarly, for the Crank-Nicolson scheme, (2.5) holds with
1 -;2(1 -cos t)
1 +(1 -cos )'
and we now note that IE(4)l 1 for all for any >0. Thus, the Fourier analysis
method here shows stability in 1°,h for any . The convergence follows again by the
28 V, Thome CHAPTER I
THEOREM 2.3. Let u e C4 ' 3 be the solution of (2.1) and U that of the Crank-Nicolson
scheme (2.5). Then, with C = C(u, T), we have
Un _-u"n12,h<C(h2 +k 2) for nk < T.
= k- (BkhUn + - AkhU )
I ]I 2,h ~
.n C(u)(h 2 + k2).
or
n-1
z= -k E Ekh IBk-hl T
1=0
from which the result follows by using the stability of the Crank-Nicholson
operator Ekh and the boundedness of Bkh' .
Let us return to the model problem (2.1) and note that because of the simplicity of
the boundary conditions it may be replaced by a pure initial value problem with
periodic initial data. In fact, extending the function v defined on [0, 1] with
v(O) = v(1)= 0 to [- 1, 1] by setting v(-x)= -v(x) and then extending the function
thus obtained by the periodicity requirement v(x + 2) = v(x) for all x E R we have a
2-periodic odd function, which we may still call v.The problem may now be posed as
to find a 2-periodic solution of the pure initial value problem
au a 2 u
-= 2' xeR, tO,
at ax '
u(x, O)= v(x), x R.
The solution of this problem is then easily seen to be 2-periodic and odd in x and
(x- 1) and its restriction to [0, 1] is thus a solution of our original problem (2.1).
For this problem we may again apply the forward Euler method
U '=w1Uj_1+(1-22)Ui+AUi+1, j=O, +1, +2 .... (2.8)
SECTION 2 Introduction 29
where Uj is the approximation to u(jh, nk). If, as above, h = 1/M for some integer M,
then if U ° is periodic in j with period 2M, then so is U" for any n ~ 0. The discrete
problem (2.8) thus reduces to 2M equations at each time level.
Similarly we may apply the backward Euler equations
(1 +2)Uqj+' -,W + - j'+)= j j=O, +1, +2,..., (2.9)
and look for a solution which is periodic with period 2M in j, thus reducing (2.9) to
a system of 2M equations for Uj+', j= -M,..., M- , say, with the matrix
+2A -i O -1
-
-A +2A
1 - 0 ... O
B= 0 -i 1+22 - -.. 0
_ -A -i 1+22
This matrix is clearly invertible since it is diagonally dominant. Similar consider-
ations hold for the Crank-Nicolson method.
In the case that V is odd and satisfies VM = 0 the same holds for U" and the systems
reduce to our old finite-dimensional systems of order (M - 1).
Certain other problems may as well be restated in terms of periodicity require-
ments. For instance, consider the problem
2
au u
2
t=x in [0,1] for t>0,
In this case we may extend v first to [0, 2] by requiring v(x) = v(2 - x) and then extend
this function to a 4-periodic odd function on R. Looking for a solution with these
properties will then lead to a solution of our original problem. Again, the finite
difference methods suggested by the above may be applied for the periodic pure
initial value problem and yields an approximate solution of our boundary value
problem.
Let us therefore, more generally, consider the periodic initial value problem,
which we normalize to have period 27r,
au a2u2
-a a 2 '' XE, t>O,
at X
where h = 2n/M for some integer M and where the sums are finite.
For the analysis of this situation it is again convenient to use Fourier analysis and
to work in the space L2 of 2-periodic functions with norm
V11v2,
#(f fVx)2dx)"
For such functions we have the Fourier series representation
i2vl12,.--2n C tj-2
j= -
Setting now
U"(x)= UeniJx,
j
we find, by (2.10),
b(jh)j+leijx =E a(jh)jeeijx
Assuming b(() O0
for all real we thus conclude
1
"+l=E(jh)U, where E(S)=b( )-la(~),
and hence
Un(x) = E E(jh)jeiix.
By Parseval's relation we find for the operator Ek defined by Un+ =EkU" that
I112. = (2n
,Ev E(h)n 12)
<suplE(jh)"l Ilv112, #,
and we may again conclude that stability (now with respect to L 2 ) holds if and only
if von Neumann's condition IE()I < 1 is satisfied.
CHAPTER II
31
32 V. Thome CHAPTER 11
earlier sections, are used to derive stability bounds in the maximum-norm for single-
step explicit schemes in the case of a scalar second-order equation (which may have
variable coefficients) in one space dimension. In Section 8 we describe generali-
zations of these results, principally by Widlund, to explicit and implicit, one- and
multistep schemes for parabolic systems of equations in an arbitrary number of
space dimensions.
In the final Section 9 we discuss the rate of convergence of finite difference schemes
for parabolic equations and systems. Here more attention than before is paid to the
relation between the regularity of the data of the continuous problem and the
convergence properties of the approximating scheme. In order to describe this the
Besov spaces B' qare introduced into the analysis. Direct results are presented which
show that a specific degree of regularity together with a given order of accuracy and
parabolicity of the method implies a certain order of convergence. Often this
convergence is of lower order for nonregular data even when the method has high
accuracy, but we also indicate how a preliminary smoothing of data can recover the
convergence rate caused by low regularity. We finally present examples of inverse
results which make it possible to draw conclusions about the data from the observed
rate of convergence.
In this section we shall, in a more systematic and general way than in Chapter I,
introduce the basic concepts relating to finite difference schemes for the numerical
solution of the pure initial value problem
= P(x, t, D)u + f
The solution of this discrete problem may then, in both cases, be written
In many cases we shall assume that k and h are tied together by a relation such as
k/h q = A= constant, most often with q = M, the order of the operator P(x, t, D). We
shall then omit h in the notation and write, for instance, Ek,, instead of Ek,h, . When
the coefficients are independent of t, we may write simply Ek and the solution in such
a case is simply Un = Ekv.
For the nonhomogeneous equation we will use difference approximations of the
form
Bk,h Un+ =AhUn+kMhf, n=O, 1, ... , (3.3)
where Mk, his an operator which could be of the same type as Ak, hand Bkh, or, e.g. an
integral average such as
In the simple case that the coefficients are independent of t, that k and h are related as
34 V. Thomee CHAPTER II
j
above, that iRMkhf=f and U°=v this reduces to
"n-
Un=Ekv+k E Ek-l-f.
j=o
The difference equation is said to be consistent with the differential equation if,
formally, for smooth solutions of the differential equation, the difference equation is
satisfied with an error which is o(k) as k,h-O. More precisely, introducing the
truncation error,
Tk,h,n =k- 1k,
(B,hu+ -AhU)-
k hU 1 -Ak, Mhfhf
hU )- Mk,
the scheme is said to be accurate of order y in x and v in t if, uniformly for 0 < nk < T,
say,
Tk, h, =O(h +k) as k,h-0. (3.5)
When k and h are tied to each other as above we simply say that the scheme is
accurate of order p (with respect to x) if
T,,=O(h") as k-+0. (3.6)
Note that these concepts are local properties of the operators which may be
checked formally by Taylor expansions and are not dependent on any specific space
of functions. It is clear, however, that the constants in (3.5) and (3.6) will depend on
certain derivatives of the solution considered. For instance, in the case of (3.5) it is
always possible to find an M1 such that, for nk<T,
IZk,h,(X)I= iTk,h, (U; x)I
<C(h"'+k) max Tl).
IIDxDtuIL,(dx r[0
JIu+#<M1
By the definition of the truncation error we have for the exact solution
Un 1 = Ek,h,nU +kMk, hf+ kk,h,n,
(3.3) as
= (U - Ux-)"+ 1/2 +
- 4k2U 2
3)
k2.,n+x1/2 -lh2U XX+X2 +O(h3+k as h,k- O,
and hence, using the differential equation for the first term,
tk,hn=O(h2+k2) as h,k-+O,
For the homogeneous equation this gives, since u,=u,x and u,,=uX XX that
k,n= O(h4 ) so that the scheme is of fourth order. We now want to choose M, in such
a way that the fourth-order accuracy is preserved for the nonhomogeneous
equation. Using the differential equation we now find
4
TZk,(X)= (fn+l2h 2(fnx+ f,))(x)- Mf(x)+O(h ) as hO,
so that the requirement for M is
The concept of stability depends on Ar; one and the same difference scheme may
be stable with respect to one normed space and unstable with respect to another. The
spaces that occur below are the Lp-type spaces with 1 <p < cc. It will turn out that
often stability is easier to show in L 2 then in other Lp spaces.
The importance of stability for numerical calculations is clear from the following
result:
THEOREM 3.1. Assume that Ek,,h, is stable in JVX and that U" and V" satisfy
+ U n
Un = Ek, h, n + kF",
and
V"+l=Ek,,,V+kG" for n=O, 1....
Then
This follows at once as in (3.7) upon noting that, by linearity, the difference
n n
Z = U - V' satisfies
Z" +
Zn+ l =Ek, h, k(F ' -G ") for n >O.
Thus, if GI and V are close to F and U° , respectively, then V" is close to U" in the
° j
sense of the above inequality. As a special case one may prove the following
convergence result:
THEOREM 3.2. Assume that the difference scheme is accurateof order A in x and v in t,
and that Ek, h, is stable with respect to X. Then, if U° = v, and if the exact solution u is
sufficiently smooth, we have
IUn-unll <C(u, T)(h"+kv) for O<nk <T.
SECTION 3 Pure initial value problem 37
This may be demonstrated as follows. By our above definitions of the approx-
imate solution and of the truncation error we have
n+1 Un+l =Ekhln(Un-un)-ktkhn, n=0,1,....
1 we have, as in
Since U - u= v - v = 0, and since (B,h)- l is assumed bounded in X
Theorem 3.1,
un IItk,h,j II
IIU_- II CT max IItk,h,j II CT max
j<n-1 j n-
If we now interpret the assumption that the exact solution is "sufficiently smooth" to
mean that
Iltkhh,11 <C(u, T)(h'+kv) for nk <T,
the result follows at once.
Clearly, the result as stated is somewhat imprecise in that it does not specify the
regularity assumptions on the exact solution. The correct precise assumptions will
depend on the choice of the normed space X and can often be expressed in terms of
the data of the problem by means of a priori estimates. For instance, for the
homogeneous equation (f= 0) a natural requirement is that a certain number of
derivatives of the initial data v have finite norm in J'.
It is often possible to obtain convergence results under less stringent assumptions
on the data. For instance, considering again the homogeneous equation, let v be such
that it may be arbitrarily well approximated by a sufficiently smooth for the
conclusion of Theorem 3.2 to be valid. Assume further that the initial value problem
(3.1) is well posed with respect to AX' so that for the solution
Iju(t)l = liE(t, 0)v jl Cjlvll for 0<t<T,
where E(t, 0) denotes the solution operator of (3.1), i.e. the linear operator that takes
the initial data into the solution at time t. Then U" converges to u(t)= u(nk)= u" as
n - oo, nk = t. For, with >O given, we may choose such that
and such that it is smooth enough for Theorem 3.2 to apply. We have then, with
obvious notation,
un
IIU n _ -- il(Erk, - E(t, O))v II
=
<lJ(Ek:h - E(t, 0))li + II(Ek: h- E(nk, 0))(v- ) Ij
<C(e, T)(h+kv)+Ce<2Ce for k, h small.
In the case that k and h are tied together as above by a mesh ratio condition and
the order of accuracy is pI(with respect to x) the result of the theorem is modified to
read
IlUn _-unl < C(u, T)h", O<nk < T.
We shall return in Section 9 to discuss more precise results concerning the relation
38 Y. Thome CHAPTER 11
between the regularity of the exact solution and the rate of convergence.
It should be mentioned that, theoretically, convergence is possible without
stability if the initial data are sufficiently smooth, cf. e.g. THOMtE [1969].
The relation between the concepts of consistency, stability and convergence was
examined in an abstract Banach space setting in a celebrated paper by LAX and
RICHTMYER [1956]. We give a brief account of their theory which is concerned with
the case of a time-independent homogeneous equation.
Let - be a Banach space (a normed linear space which is complete with respect to
its norm) and consider the initial value problem
du
-=Pu for t>0,
dt
(3.8)
u(O)= ,
(3.8)
where P is a linear operator defined on a dense subset of X. It is assumed that this
problem is correctly posed so that, in particular, there exists a solution operator E(t)
which takes the initial data v into the value u(t) of the solution at time t, and which is
bounded in X, or
Ilu(t)11 = IIE(t)viI <CT1iIV for O<t< T.
The discussion is further restricted to the case of a one-parameter family of
"difference" operators Ek, approximating E(k), which is assumed to be consistent
with (3.8) in the sense that for each solution u(t) = E(t)v with initial data v in some
dense subset of X' one has
THEOREM 3.3. Assume Ek is consistent with the correctly posed problem (3.8). Then
stability of Ek is the necessary and sufficient conditionforconvergence of the difference
scheme to the solution of(3.8), in the sense that,for any v e S, with u(t) = E(t)v, one has
The proof of the sufficiency of stability for convergence is essentially the same as
our proof of Theorem 3.2 above. The proof of its necessity is based on one of the
fundamental theorems of functional analysis, the Banach-Steinhaus theorem (or the
principle of uniform boundedness).
The Lax equivalence theorem has been generalized to cover a variety of situations
SECTION 3 Pureinitial value problem 39
where m > 2. This formula may only be used for n > m-1 and thus requires, in
addition to the natural initial condition U ° = v, that U',..., U m- be prescribed.
A common way to analyze such a scheme is to reduce it to a two-level or one-step
scheme of the form discussed above. For this purpose one introduces the compound
vector-valued unknown function
n = (Un, Un- .. Un-m +l )T,
- An
Ak, h,
I
An
Ak,h,2
0
''
...
An
Ak,h,m
0
l ...
0
... 0
:i
Ak,h= 0 I and I .:
I
U
. U I U I
Ek,hn=(Bk,h) Ak,h,
and define its stability and consistency in the obvious way. In particular, if X is
a normed linear space in which we consider our functions, then Un is sought in the
product space XvVx x with m factors, and the stability is measured with respect
to the corresponding norm. In the discussion of the accuracy of the scheme special
attention has to be given the choice of the starting values U,..., U - .
Similarly as before, with i" = (u", u"-',. ., u"-m+ )T for the exact solution we now
have, for n m- 1,
n-l
.h (- i-" k ,h ,j( (3·9)
j=m-1
where the truncation error only stems from the first components of the compound
solutions. It follows that, in the presence of stability, if the truncation error is
O(h + kv), say, then the global error is of the same order provided Um- 1 -m- 1
40 V. Thomee CHAPTER I
matches this. Note that, since the initial error is a local error, a lower order
approximation may be used in the initial steps, because an extra factor k is available
in this term, which is not needed to compensate for the summation in (3.9), see the
example below.
As an illustration, consider for the solution of the model homogeneous
one-dimensional heat equation the three-level equation
Un+ 1(x)-U -
(x) U"(x + h)-2 U(x)+ U"(x-h)
2k h2 (3.10)
or, if A= k/h 2 ,
U + '(x)=(2i Un(x + h)-4iU"(x)+ 2U"(x- h)) + U' - '(x).
or
+
0" +(X)= (,:h Gn)(X)
[2A 0C
U(x + h) + [ 4A (X1+ 2A 01 Cn -
for n>l.
By the symmetry around (x, nk) the exact solution satisfies (3.10) with an error of
O(h2 + k2 ), which translates into
Cj=(l)[ 1{ a.
For any 2 the 2 x 2 matrix entering here has two distinct real eigenvalues, one of
which is less than - 1. If we choose for a the corresponding eigenvector, it is clear
that 0n is unbounded as n grows and thus the scheme is unstable.
As we shall see in a later chapter this scheme may be stabilized, for any constant 2,
by replacing U" by the average½(U"++ Un - 1) so that the scheme becomes
+
U"n 1(x)- U n- I (x) U"(x + h)- U +
l(x)- Un - (x)+ U (x-h)
~~~2k h 2~~~~~~, (3.12)
Consistency with the heat equation therefore requires that k/h tends to zero, which is
42 V. Thome CHAPTER II
the case e.g. if k/h 2 = A= constant. However, if instead k/h = i = constant, we obtain
In this section we shall use the Fourier transform systematically to express and
analyze the notions of consistency, stability and convergence for constant coefficient
single-step finite difference methods applied to the pure initial value problem for
a homogeneous parabolic equation, or system of equations, in d space dimensions.
As we have seen earlier such material is also relevant to the study of initial boundary
value problems when the boundary conditions may be interpreted as periodicity
conditions.
First, we define the Fourier transform over Rd by
where, as always below, we asume that v is small enough for large xl that the
definitions and subsequent calculations are justified. We recall Fourier's inversion
formula,
au
- = P(D)u-_ PDu forxcERd, t>O,
a~ ~~~~~~~~t ~(4.2)
1j~~~~~a M
u(x, O)= v(x) in Rd,
where v is sufficiently smooth and small for large Ix I,in a way that we shall not make
precise at present. We shall focus our attention on equations which are parabolic,
and begin by a definition of this concept which generalizes the heat equation
au
=Au,
which we have considered above, and the more general case when P(D) is
a second-order elliptic operator,
d 2u d au
P(D)u= I Pjk--XjaXk + E Pi-
j,k=I j=l
+pou, (4.3)
where (Pjk) is a symmetric positive definite constant d x d matrix with real elements.
For this purpose we introduce the characteristic polynomial of P(D),
P()= P,
We recall that this is the same as saying that - P(D) is strongly elliptic. Sometimes,
we shall allow (4.2) to be an N x N system. In this case the coefficients P. and the
polynomial P() are N x N matrices and the condition (4.4) is replaced by
A(P(i)) -c IM+ C, c>, EcR , (4.5)
where
A(A) = max Re j,
if {Ai}jN, are the eigenvalues of A. Clearly (4.4) is satisfied with M=2 for the
second-order operator in (4.3) since then
d d
ReP(i5)=- Pjk4jk- E (Impj)4i+Rep ,,
j,k=l j=1
.
< -c1~j2 +C(j + 1)< -C1j12 +C1, E
Equations for which (4.4) or (4.5) are satisfied are referred to below as parabolic in
the sense of PETROVSKII [1937].
44 V. Thome CHAPTER 11
Letting 6(, t) denote the Fourier transform with respect to x of the solution at
time t, we obtain by virtue of (4.1) the ordinary differential equation
so that, in particular, the initial value problem (4.2) is correctly posed in L2 . This
estimate is valid also in the case of a system if we interpret the modulus in (4.6) as the
matrix norm subordinate to the vector norm used. In fact, for an N x N matrix A we
have with the above notation (cf. e.g. GELFAND and SCHILOW [1964])
N-1
where /=(1, ... fd) has integer components and the sum is finite, and where
a, = a,(k, h)= a,(2hM, h) are N x N matrices in the system case. In accordance with
SECTION 4 Pureinitial value problem 45
our previous discussion we shall consider that x varies over all of Rd and not just
over mesh points ph. With Bk a similar operator one may define an implicit scheme
by
U° = v. (4.10)
U n+ = Ek U = B 1Ak Un,
where Bk is the identity operator in the explicit case and assumed invertible in the
implicit case.
Introducing the symbols of Ak and Bk by
' -i '-
Ak()= Zape a h and Bk()='bfle i h,
we may set
and find from (4.10) for the Fourier transform of the discrete solution
n
U() =(E vf() = Ek() v(4). (4.12)
In the implicit case we assume in order that Bk be invertible on L 2, that Bk(¢) :0 (or
in case of a system that det Bk()0O), uniformly for small k. The Fourier series in
(4.11) then has infinitely many terms.
We recall that the difference scheme (4.10) is consistent with (4.2) if the difference
equation (4.10) is satisfied by the exact solution to order o(k), and that it is accurate
of order if this error is of the order o(khu), as k tends to zero.
The concepts of consistency and accuracy may be expressed in terms of the
symbols in the following way, as may be seen by using Taylor expansions:
THEOREM 4.1. The operator Ek = Bk1 Ak is consistent with (4.2) if and only if
Ek(h -')=exp(kP(ih-'l))+o(hM+"+ ¢ M+ )
as k, -0. (4.13)
For certain extremely smooth initial data it turns out that, theoretically,
consistency is all that is needed for convergence. For instance, if v is such that
46 V. Thom&e CHAPTER 1
so that
n n
IIU'-u A C(v) sup Ek() -exp(nkP(iS))l. (4.14)
C4supp6
= Et(E)n- -j(Ek()-exp(kP(i)))exp(jkP(i))
THEOREM 4.2. The operator Ek = Bk l Ak is stable in L2 ifand only iffor some positive
constants C, K and k o we have
I E()nl Cetnk for E Rd, n >O, k<k. (4.15)
with equality for the appropriate v, from which the result follows. J
so that stability is translated into the uniform boundedness of the powers of a family
of matrices depending on the two parameters and k.
In the event that P(D) does not contain any lower-order terms, it is often possible
to choose the coefficients of Ek independent of h or k. In this case Ek(h-1 ) is
independent of h, and (4.13) reduces to
+
E()-Ek(h- '))= exp(AP(i))+O(ll ) as -0O,
and the stability condition (4.15) to
IE(5)"I<C for cERd, n >0. (4.17)
In the scalar case, (4.15) is equivalent to
IEk()l1l+Ck, (eRd, k k o,
which we refer to as von Neumann's condition for stability; in the special case that
the symbol depends only on h it reduces to
tE(~)j <, eR d ,
which we recognize from Section 1.
In order to discuss the matrix case, we introduce for an N x N matrix A its spectral
radius
p(A)= maxIljl,
where {,i}7= are the eigenvalues of A. We then immediately find, since, for any A,
p(A)< IA[,
that a necessary condition for (4.15) to hold is that for some positive C and k o,
d
p(Ek()) 1 + Ck for elR , kk o. (4.18)
This condition is again referred to as von Neumann's condition for stability; in the
special case (4.17) it reduces to
p(E(r)) < 1 for 4 e R .
Von Neumann's condition is, however, not sufficient for stability in the matrix
case. Conditions which are both necessary and sufficient for inequalities such as
(4.17) (or (4.16)) to hold have been given in KREISS [1962] and in subsequent
literature, mainly with applications to hyperbolic problems in mind. These condi-
tions require, in addition to von Neumann's condition, that the behavior of the
powers of Ek() is determined, in some sense, by its eigenvalues. A sufficient
condition is that Ek() be a normal matrix (i.e. that it commutes with its adjoint
Ek()*) so that it is unitarily equivalent with a diagonal matrix with the eigenvalues
as entries. Other sufficient conditions assume that the matrix is uniformly equivalent
to a triangular matrix and that certain conditions for the off-diagonal elements of
the latter are satisfied. We shall not describe the details here but refer to e.g.
RICHTMYER and MORTON [1967] for an account of this work.
48 V. Thome CHAPTER II
THEOREM 4.3. Let the operator Ek = Bk 1 Ak be consistent with (4.2) and parabolicin
the sense of John. Then it is stable in L 2.
THEOREM 4.4. Let Ek be consistent with (4.2) and parabolicin the sense of John. Then
(av) () = H (h j v(),
j=1
As we know from the Lax equivalence theorem, consistency and stability together
imply convergence. We shall now give a more precise result about the rate of
convergence and consider first the case that the difference scheme is parabolic. Here
we shall employ the Sobolev space Hr=H'(Rd), with norm
|I[|IH=( LE '
D2)1/2 ) l
THEOREM 4.5. Let Ek be consistent with (4.2), accurate of order I, and parabolic in
John's sense. Then
we hence have
n
IE() - exp(nkP(i4)) I< Cnkh'(1 + 1I1M + )e -cnklI
M
< Ch"(1 + I I),
hliAjj, j=l,...,d, nk<T7:
Rd
<Ch2 1lv 11
2,
which is the desired result. 1
THEOREM 4.6. Let Ek be consistent with (4.2), accurateof order p and L 2 -stable. Then
for any >O0, we have, with C= CE,,
llU"-u"ll Ch8lvjlH+, nk<T
PROOF. If Ek is only stable and not parabolic, we have to replace the use of the
estimate (4.22) in (4.23) by simply the boundedness of Ek())' and now obtain
n-l
iEk(5) -exp(nkP(i))l
n
< Ckh" (1 + 5IM+)e-cjkllM.
j=o
<Ch(l1+l l8+ )
forhl jlj x, j=l,...,d, nk <T.
SECTION 4 Pure initial value problem 51
Using again (4.24) for the remaining , this completes the proof by Parseval's
relation.
Qh (X) = h E q(h)v(x - h)
where the sum is finite and the q(h) are polynomials in h. Assume Qh is consistent
with the differential operator
Q(D)v= E QaD'v,
lal <q
and accurate of order # so that, for smooth v,
QhV(X)= Q(D)v(x)+ O(h") as h-0.
Introducing the symbol of Qh,
Using these properties it is now easy to prove the following result which is related
to property (4.21), and assumes the same regularity of v as Theorem 4.5:
THEOREM 4.7. Assume Ek is consistent with (4.2), accurate of order # and parabolicin
John's sense, and that Qh is consistent with Q and accurate of order q. Then
q/M
IIQhU - Q(D)u" II < Ch(nk)- Iv I H .
For even more regular initial data, v Hq + ", say, the negative power of t=nk
disappears in the above error estimate.
We want to consider briefly the case that the initial data of (4.2) are periodic of
period 1, say. In this case, rather than working with Fourier integrals as above, it is
natural to work with Fourier series. Thus, in this case v may be developed as
2"
v(x) = E ie ni x
where y are multi-indices and where, with Q the unit cube [0, 1d,
2
= (x)e- fiy"Xd.
Q
52 V. Thomee CHAPTER II
11
VIIL2(Q) = IIV1 12
Applying now a finite difference method of the form (4.9), say, with h= l/v for
some positive integer v and k= hM, we find, with our old notation,
Un(x) = E vyEk(27ry)e 2i7x,
and hence
2
U" I L2(Q) = E Y Ek(2ty) 2n max
V Ek( ny) 211 V|| L2(Q)
for all sufficiently small k. In particular, our previous stability criterion (4.15) shows
stability also in the present case.
1-(1 4 + 1 6)
E() _1+0A(42 I 4+46) +0(48)
E(4) =e - X + (4) as -- 0,
A(f1 + o) = 2,
54 V. Thome CHAPTER 1
or
Z .-
0=2-12-12)
Note that this is in the stable range as now
(1 - 20)A = 6 - I
Finally, the scheme is of order 6, or
8
E() = e-2 +( ) as y0,
if, in addition,
6 = 2 (1 + 6020 + 360220),
or
1 + 6020 + 3602202 = 6022,
or, with A0 = 2 - 2, if 22 = To, i.e. 2 = To/5 0.224, 80 0.127. Note that this number
is relatively small, so that the scheme then requires many time steps. This choice is,
however, in the stable range as above. We emphasize once more that the
investigation presupposes that = k/h 2 is kept constant.
The stability considerations above may be generalized to several space dimensions
to apply to the equation
2
au.= d u
at }j=a
Setting
d
Ah V(x)= Z aXjX V(x),
j=1
where
For 0 = 0, 1 and these are again referred to as the forward Euler, the backward
Euler and the Crank-Nicolson methods. Here
We conclude that the difference scheme is always second-order accurate and never
of higher order if d> 1.
Returning to the one-dimensional equation (5.1) we shall show now that for any
given positive integer v there exists a unique explicit method for (5.1) of accuracy
I = 2v using the (2v + 1)points x +jh,j=- v,..., v. This method will be shown to be
parabolic for 2 sufficiently small.
We shall prove this by constructing an even trigonometric polynomial
such that
E(,)=e- - +O( 2 v +2 )
as ~-'0, (5.3)
and such that, for small ,
IE(5,i)l<1 for 0<1l1<. (5.4)
Clearly a trigonometric polynomial of order 2v is uniquely determined by (5.3).
We start the construction by noting that in the MacLaurin expansion
z=arcsinsinz= Z b2j+1(sinz)2 +
l with b 2 j+l >0,
j=O
and, by taking the (21)th power and replacing z by 2,
j
= fiI, 2j(-cos) with , 2j>0.
j=l
Now
1)= -
E(4, A T,2.-
=o 1!
Further, by (5.6), we find that E(, A) may be written
and
jr(z)l<l for -t<Z<O, (5.11)
then
IE(4)l<l for0<i4l<, if{A<r,
so that the method is then parabolic.
For instance, the 0-method considered above is of this form with
2
A()= -2(1 -cos - +0(¢ 4 )
+)= as 4-*0,
and
Introducing the Fourier transform /(, t) of U(x, t) with respect to x, we find for this
function the ordinary differential equation
where #ais the order of accuracy. The finite difference operator (5.15) is said to be
elliptic (cf. THOMEE [1964]) if
E() = r(iQ(5)) with r(z) as above. However, as Q(5) = 0 if, for instance, 5 = (,...., I),
this operator is not elliptic in the above sense, and the corresponding finite difference
scheme for (5.14) thus not parabolic. In order to make Qh elliptic one may modify the
definition by setting
d
which does not change the consistency since the additional term is of order O(I I2M)
as -40.
The operator thus defined uses other than the closest possible neighbors at the
mesh point x. For instance, for the second-order operator
d a2 u
P(D)= i Pk aX X
j,k= 1 PXjXk
2
the term p1,, u/ax is replaced by
u(x + 2he )- 2u(x) + u(x - 2he 1)
P11 4h2
Another possible elliptic finite difference operator which does not have this
disadvantage is
d
Qh = Pjk xj xk
j,k=l 1
with symbol
d
which is elliptic as
d
= P
pjk(1 - cos j) ( - cos k) + Z Pjk sin j sin ~k
j,k j,k
d d
>c ((1 -cos j)2 +sin2 2)=2c (1 -os ).
j=1 j=1
which is easily seen to imply that Qh is elliptic. For d = 2 the latter choice corresponds
to
2
Qh V(x)= Z p(V(x + hej)- 2V(x) + V(x -he))h2
j=1
=_2 +0(2v+2) as 0,
is elliptic.
The above makes it natural to ask for rational functions r(z) which approximate ez
near z=0 and which satisfy the appropriate boundedness conditions. The most
commonly used functions of this type are the Pad6 approximants (cf. e.g. VARGA
[1961, 1962]) which are defined by
=ez+O(zP+q+l) as z-+0,
where dp.q and np,q are polynomials of degree p and q, respectively. One may show
that these polynomials are uniquely determined by
nq(z) = q (P + q-j)!q!
j= (p+q)!j!(q-j)!
and
For p < q, rp,q(z) is not bounded but there is a positive Tp,q such that
For the model equation (5.1), the second of these corresponds to the explicit scheme
given by
where
AhV= a,, V+ ax
2 X2 V,
and observe that the strictly two-dimensional discrete elliptic operator I- k Ak is
involved in solving for U" + . The purpose of the alternating direction method is to
reduce the computational labor by solving instead two one-dimensional equations.
For this purpose we introduce an intermediate value U " + /2 for the solution at
t=(n+ )k by the equations
Un+ 1/2 _ Un
½k
un+ 1 un+1/2
ka X=(l+k
r"+l X 1/2 + axx2)u" +1,
u
-(I-k,," ,)U",
12 (I +xkkll
The first step is thus implicit with respect to the xl variable and explicit with respect
to x 2, and in the second step the roles of the variables are reversed. Elimination of
Un+ /2 gives, since the various operators commute,
U" +l =EkU"
= (I - kaaX, ) - 1( + k )(- k )- (I + kaX
2 0X2)U",
so that the scheme is second-order accurate. (In fact, the scheme is easily seen to be
second-order accurate in both space and time when h and k are allowed to vary
independently of each other.)
We shall derive this method in a slightly different way, which immediately
generalizes to the heat equation in several space dimensions,
u (5.16)
Au=_ Z
The
method is now referred to as the fractional step method and uses the auxiliary'
The method is now referred to as the fractional step method and uses the auxiliary
64 V. Thomee CHAPrER II
values U" + j/d, j= 1...,d, with the final value U" +', defined by
Un+j/d_Un+(-l)/d (Un+i/d+un+(-I)/d
- =xjxj 2
or
Un" d = (I - kak a)-(l + k axx) un+- )/d
for j=1, ... ,d,
and thus, again with commuting operators,
d
U"+I =EkU"= H (I- kaxjxj) l - (I +kaxj)U
j=l
In this approach one thus approximates the operators in the sum in (5.16)
separately, and obtains Un + as a product of one-dimensional L2-stable operators
acting on Un. We have again for the symbol
Ui=v/ forj=0,...,m-1,
where the vi are determined in some fashion from the initial data v. We shall assume
SECTION 6 Pure initial value problem 65
that the Ak, and Bk are finite difference operators with constant coefficients,
Akv(x)= atv(x- fh) for l= ,...,m,
where the aji and be are constant and the sums finite.
Introducing as in Section 3 the vectors U"=(U", U"-i,...,U"- +)T and the
corresponding matrices Ak and Bk equation (6.2) takes the form
n
BkUn+l=Ak for n > m-1,
or
On+l=kU for n>m-l, (6.3)
where
0 I 0 _
We say that (6.2) is stable in L 2 if (6.3) is stable in the product space L'2 so that the
powers of the operator k are bounded on L. Introducing the symbols or
characteristic polynomials of the operators Bk and Ak,,
b() = E b ee',"
i
and
'i e 1=1'...'M' 5E[Wd
a,() = ae ,
I
al(4) am(4)
b() b(·)
1 0 ... O
0 1 0
0 1 0
(We assume that Bk is invertible in L 2 so that b() O.) The L2 -stability of our
multistep scheme therefore reduces to the uniform boundedness in ~ of the powers of
66 V. Thome CHAPTER II
where p denotes the spectral radius, and we say that (6.2) is parabolic (in the sense of
John) if, with c>0,
p(£())<exp(-cll M ) for Ijl s, j=l,...,d.
We note that # is an eigenvalue of E(4) if and only if y satisfies
E(5) = I
y2 = 1 - 21/(1 + 2)< 1,
matrix, or
U*AU=Lo
]
where #1,2 are the eigenvalues of A and
m= a u U1 2 + a2 1 u 21 u12 +a1 2 1U 22 + a2 2 u 2 1u 2 2,
so that, in particular,
2
Iml< Y lajkl
j,k= 1
where
n-il
and, since I1jl 1 for j= 1, 2, to the boundedness of m,(#l,#2). In the case (6.9) we
have
n-Il
which is thus the characteristic equation of the one-step system formulation of (6.13)
as well as of the scalar problem. For the Fourier transform of the solution we have
where i = kh-M.
We shall consider some specific choices of linear multistep methods which are
used for ordinary differential equations.
Let us being with the Adams methods. They consist in writing (6.12) in the form
tn. I
where the coefficients bmj may be found in e.g. HENRICI [1962]. If instead we use the
70 V. Thomee CHAPTER II
which has the root p= -(4+/21) outside the unit disk. The method is thus
unstable for large A and hence also inferior to our previous methods.
We now turn to the method of backward differencing to construct a finite
difference operator from (6.12). It consists in replacing the time derivative in (6.12) by
the derivative of the interpolating polynomial, based on the values at t+,,
t, ... It + -, evaluated at t, +, to obtain an implicit scheme. This yields a method
of the form
n+ -
k- t1[aU Z lmjUn+l-j]=QhUn+l
j=l
SECTION 6 Pure initialvalue problem 71
or
+ l- i
(Cm-kQh)U +l= fimjiU . (6.18)
j=i
It would also be possible to construct in this way an explicit scheme by evaluating at
t, i.e.
m
o U+ W
(fi + kQh)U + (6.19)
j=2
-
(0m - Q( )) E mjm- j=. (6.20)
j=t
For m =2 and m =3 we have, in particular, for (6.18),
(32- kQh)Un + = 2U" - U"n- , (6.21)
( -- kQh)U" +
= 3U"n _ 3 n-I + I U"- 2.
In order to decide whether the roots of the characteristic polynomials correspond-
ing to these methods are in the unit disk we transform it to the left half-plane by
setting y = (1+ jl)/(1 -?I) and apply the Hurwitz criterion: In order that the equation
EYjo yj? = 0 with Yo >0 has all roots in the left half-plane if suffices that
D, =y >0,
Y1 73 75 .' 72k-1
Yo 72 Y4 ' Y2k-2
7
Dk = 0 1 Y3 ' Y2k-3 >0 for k=2,...,m.
0 k
DI =2-2p>0,
and
and hence for 0 < I I< n all roots of (6.22) are in the left half-plane and thus those of
the corresponding equation (6.20) in the open unit disk.
For m= 3 we have
D1 =2-3p>0,
6 3
_p - p 3
and
3 -p 0
6-3p 0 = (- p)D2 > 0.
2-3p 2-0 P
Both methods thus satisfy von Neumann's condition for any and are parabolic.
For m=2 and 3 the explicit schemes (6.19) are
GUNN [1963] for the model heat equation in d space dimensions (with d(<4),
au d a2 U
- Au =- Y- Z)l x c-,Rd' t>0, (6.23)
at j=1 x2
with initial data
ax U(X
) = u(x + hej)- u(x)
h
u(x) - u(x - hej)
h
and u"( ) = u(-,nk), we have
a2 ul ,h
2
a4 un
= a u - 12 a +
axj
h2 a4un
- )
=½axa,,(u+l ++u"+u ')- a +O(h4+k2 as h,kO.
Setting, as in Section 5,
d
Ah= a axj
j=1
we have then
1
un+1- un-I 1
(unt un+u')- 2 d a4un
2k2k 13Ah
n
lU_i=1_ -l )
-2h 2
-
E a& +h ).k (6.24)
d a4u a4u
A 2 U= Y ~- 2 E a 2 =Au,=ut,
j ~~~I
74 V. Thome CHAPTER 11
The relation (6.24) together with (6.25) and (6.26) thus suggest the following two
difference analogues of (6.23), namely
n n
U +I - U
2k =Ah(U+ + Un + U- l)
- h2 Un+ - 2 U n+ U -
k2 i 1 i
2 <j x jxU (6.27)
and, with = kh 2 ,
Un+ _ Un -
2k h= - 1/(82)) Un+ + Un
h((
+(l+1/(8))U" - ')+h 2
aixAXjaXj U". (6.28)
i<j
The crucial matter is then the stability of these schemes. In this regard we have the
following two results from DOUGLAS and GUNN [1963] (where they were expressed
for a cubic domain and in discrete 12 h-norm).
We note that these results do not show stability in the sense of boundedness for all
U° and U', but are restricted to data with U°=0.
After application of the Fourier transform, equations (6.27) and (6.28) may both
be written in the form
b(h)"n+
1 = a(h)" +a2(h)0n- ,
SECTION 6 Pure initialvalue problem 75
and the corresponding characteristic equation is
2
b()# - a ( -2=
2(~) 0,
with roots ju= ,j(j),j= 1, 2. The solution then satisfies (assuming distinct roots; the
case of coinciding roots has to be treated separately)
"()=cl (~)t(h)"+ c 2 (5)p 2(hr), n 2,
with initial conditions
The stability stated then requires the boundedness of the ratio on the right as
h E Rd. One may show that for the method (6.27) this ratio is bounded by
3(642 + - 1 + 6) if d 4 and for (6.28) it is bounded by 4 if d<3.
Although this stability concept is more restrictive than earlier, it suffices
nevertheless in order to derive convergence estimates if we choose U° = u° , as then
the initial error z° = U -u ° = 0. We obtain for this choice, in both cases, if U' is
chosen so that
IIU1- u'l C(h4 + k 2 ), (6.30)
that
IIlU-unll <C(h4 +k 2 ) as h,k-O
(for (6.28) under the assumption that i is bounded below). A possible choice for U'
that guarantees (6.30) is
U' = v+kAv.
Because of the implicit character of the above two schemes it is of interest to
associate with them alternating direction type schemes which will reduce the work in
the solution of the algebraic equations for U"+ . In both cases we may write the
scheme in the form
The alternating direction type scheme is then to define the approximate solution W"
at t=nk from W°=u ° and W' as above, and then use the following equations to
determine W"+ ' for n 1 from W and W"-', namely by defining intermediate
76 V. Thome CHAPTER II
(i+Bi)W"+lJ-W"+lj-l-BiW"=O, j=2,...,d,
and then setting
n + l n+ l d.
W = W ,
Bj=-ZkaxjXi, j=2,3,
and for (6.28),
Bj = - 2k(1 - l/(8{))ja , j = 1,2, 3.
In both cases (for d=3), it is shown in DOUGLAS and GUNN [1963, 1964] that
4 2
Wn- u 112,
h C(h + k ),
in the former case for bounded away from 0, in the latter for A> 8.
In this section our main purpose is to describe some results and techniques
developed in the important paper of JOHN [1952] concerning the maximum-norm
stability of explicit finite difference schemes for general second-order parabolic
equations in one-space variable. We shall also include some material concerning
related work based on Fourier analysis, such as discussions of unconditional
maximum-norm stability of certain implicit methods and of the use of Fourier
multipliers in stability analysis.
We consider thus first the general nonhomogeneous equation
au a2U au
at =P2(x,t)x +P (x, t)-
(7.1)
+po(X, t)u+f(x, t) in R x [0, T],
where p 2(x, t)> 5>0, under the initial condition
u(x, 0)= v(x) for x e R, (7.2)
and an explicit finite difference approximation of the form
un+l(x)=Ea,(x,t,h)U(x-lh)+kf(x,t) for n>0,
(7.3)
U°(x) = v(x), x e I,
SECTION 7 Pure initial value problem 77
where (x,t)=(jh, nk) varies over the mesh points in R x [0, T]. The summation is
over a finite set of points, Ill < M, say, and we assume that h and k are sufficiently
small and that k/h 2 = Ais kept constant as h and k tend to zero. We shall assume for
simplicity that the coefficients a are defined everywhere in R x [0, T] and have
certain smoothness and boundedness properties there, although in the calculations
they only enter at mesh points. With a change of notation the difference equation
(7.3) may also be written as
U:+'=Z a,j,(h)Ujl+kf7, j=O, + 1, +2,....
In John's paper the main purpose is to use the finite difference scheme to analyze
the differential equation under weak conditions on the coefficients and data. He
considers the case that P2, ap2 /ax, a2 p2 /ax2, p, aplx, Po and f are uniformly
continuous and bounded and that v is bounded and locally Riemann integrable. He
further assumes that the coefficients of the difference equation have analogous
properties. Under the assumption of the compatibility condition to be described
presently he shows convergence to a "generalized" solution of (7.1), (7.2) which is
a classical solution if apo/ax and af/ax are uniformly continuous and bounded.
Since our interest is in the numerical solution, we shall not insist on the details of the
regularity aspects and assume simply that the coefficients po, Pl, and P2 and the data
f and v are sufficiently smooth for our purposes below. For results with reduced
regularity assumptions, see also ARONSON [1963a].
Let us first discuss consistency. Writing equation (7.3) in the form
U"+ (X)- U(X)
k
we find easily by Taylor expansions that this equation, with the exact solution
substituted for U", is satisfied in the limit as h-O, if and only if
hO
lim - lat(x, t, h)= -pl(x, t),
h-O
h2
lim - 12a(x, t, h)=p 2(x, t),
h-o2k
and we thus say that (7.3) is consistent with (7.1) if these equations hold.
Assuming that the a are of the form
al(x, t, h)=a, (x, t)+ ha, (x, t) + h2a, 2 (x, t, h),
where a j are uniformly bounded and
al 2 (x,t,h)-al 2 (x, t) as h-O,
78 V. Thomde CHAPTER II
] a,,(x, t)=1,
la,o(X, t)= O,
Z a,(x,t)=O,
12a, o(X, t)= 2P2(X t),
Note that if all coefficients a(x, t, h) are independent of h (which is only possible if
Po= p1 = 0), then these relations reduce to
a,(x, t)=l,
respect to the maximum norm. We note that in this case the operators are allowed to
depend on t, which will somewhat complicate the notation and the analysis.
The solution of the inhomogeneous equation may be written by superposition of
solutions of the homogeneous equation as
n-1
U"=Ekn,.o v+k Eknm+lf m, n>l, (7.6)
m=O
where thus Ek, n, is the operator that defines the solution of the homogeneous
equation at t=nk by means of the initial values at t=mk.
We say that the finite difference scheme (7.6) is maximum-norm stable if, with
C= CT independent of h,
IIEk,n,mVll <Cllvll for O<mks<nk<T.
Here, as in the rest of this section, we use 11- 11to denote the maximum norm over the
mesh points or over all of R, depending on the case considered. If the scheme is
maximum-norm stable we have from (7.6), for the solution of (7.3),
THEOREM 7.1. Let u e C 2 ' 1 and assume that the scheme is consistent and maximum-
norm stable. Then
maxllU-u(nk)ll-O as h-+O.
nk < T
Setting
E(x, t, a, O(x, t)e - i l ,
Y)=,
we may extend the von Neumann condition to the present case to read
IE(x,t, )l <l for R, (x,t)E R x [O,T].
It is easy to show that it is a necessary condition for stability.
The basic result here is now that a slightly stronger condition, which generalizes
our above concept of a parabolic finite difference operator, is also sufficient for
stability.
THEOREM 7.2. Assume that the a, satisfy the appropriateregularityconditions and that
(7.3) is consistent with (7.1). Then a sufficient conditionfor maximum-norm stability is
that there exists a positive c such that
IE(x,t,) <e - c,2 for 1< , (x,t) R x [O, T]. (7.7)
80 V. Thomge CHAPTER II
We remark that this condition is satisfied in the constant coefficient case for the
standard heat equation as a result of the consistency condition (7.5) if
IE()I<l for O<ll<-. (7.8)
In fact, (7.5) implies at once that (7.7) holds for small X,and by (7.8) c may then be
adjusted so that (7.7) holds for all I1 <i. The condition (7.7) may also be shown to
hold if the a, o(x, t) are nonnegative and ao(x, t) and axO(x, t) are bounded away
from zero. For instance, the forward Euler method introduced above satisfies this
latter condition provided
1
sup P2(X, t)'
x,
and
Ek,,ov(x)=E'v(x)= a v(x - h).
We find at once by Fourier transformation that, with E(4) the symbol of E k, we have
E(h5)"6() = a,,e - ik 0(e),
and hence that the ant are the Fourier coefficients of the periodic function E()', so
that
ant = 2 J E(r)"ei d.
-g
We note that
IEkv(x)l < al Ivll,
and also that equality holds for suitable v so that the operator norm of E is
Je tin f rII (7.9)
We observe that, in fact, this relation holds also if we consider the operator to be
defined on the space C(R) of bounded continuous functions on Rand not only on the
SECTION 7 Pure initial value problem 81
mesh functions of ',h . The stability problem is thus the same in the two cases and
reduces to bounding the sum on the right in (7.9).
We now estimate the a.,. First, we have directly, from the definition of a,,and the
assumption (7.7),
la,1- 1I J- 2
4 di C
212
= (nE(c)-E"(c)+n(n- )E()n- 2 E'() 2
)e" d.
2)
al <(C/I I (ne-nc42 + n2 d*
2e-"c42)
We conclude that
<C
Ela,,tl'<C E 1/x/>+ Y /n12) ,
Letting U"= Ekv we shall want to estimate U"(xo), for xo arbitrary, by the maximum
norm of v. We then fix the coefficients of Ek at xo to obtain the representation
Ekv(x) = a,dv(x - lh) + b,(x)v(x - lh),
= Ek U(x) + f (x),
where the latter equality defines Ek and fn, and hence
n-i
Setting
E' v(x) = admp V(X - ph),
p
we may write
-
E f"- -m(xo)=Y af" I - (x o -ph)
p
- (7.11)
= Ymj(Xo) U 1-m(xo -jh),
j
where
Setting now
fij(x o , )= b(x o -jh + Ih)e - i"
we have
)e O
Ymj(xO)= 2r E(xo )miSj(xo, . d
Using the consistency relations one may show, by arguments similar to the ones
used to estimate the anj above, for I fixed, nk < T, that with C independent of xo ,
Iymj(xo)I Ch min( m + 1)
and hence
n- n- 1
Z ZlYmj(xo)l < C
h -- <Ch/n=C /nk.
m=O j M=o m+
Using the stability of Ek it follows from (7.10) and (7.11) that for nk 6,
For the same range of parameters one also has the estimate
.. j,t(h)l <C
1g0 //(n-m) k ,
and, for difference quotients,
laxg,,.s,j,t(h)l <Cl((n -m)k)
and (cf. (7.12))
h laxgn.,j,(h) < C /(n-m)k,
84 V. Thomee CHAPTER II
which are all analogues of estimates for the fundamental solution of the continuous
problem.
We shall now turn to a brief discussion of the maximum-norm stability of general
difference operators of the form
EkV(x) ajv(x -jh), (7.13)
which we now study without assuming any relation with any particular differential
equation and where we permit the summation to be over infinitely manyj, but such
that
laj < oo.
and note that, as above, the coefficients aj may be retrieved from E(4) by
aj =2 E(~)e4d.
-it
THEOREM 7.3. Assume that the symbol E(4) of E, is analyticon the real axis. Then Ek is
maximum-norm stable if and only if one of the following two conditions is satisfied:
(a) E(4)=ce'i j with Icl = 1, j an integer;
(b) IE()I <1 exceptfor at most afinite number of points q, q= 1,..., N, in 11 4<x,
where E()l = 1. For q=1,...,N there are constants tq, iq, Pq where aq is real,
1 >0 and #l4 is an even natural number, such that
Re fq
E(~q + 5) = E(q) exp(iq -lq4 +0(1O(1)))as 5-*0. (7.14)
The proof of this result may be carried out by the above technique of John.
It follows, of course, in particular, that von Neumann's condition is necessary for
stability. For an operator E which is consistent with the heat equation condition
(7.14) is satisfied for , =0, with =0, f = and 1 =2. If there are no further
points q in 11<r with E(Iq)l= 1, the operator is parabolic in John's sense and
maximum-norm stability is known by Theorem 7.2. If there are more points 5q with
SECTION 7 Pure initial alue problem 85
E(5q) on the unit circle, (7.14) describes the behaviour of E(4) near these points which
is required for maximum-norm stability. A simple example is the explicit forward
Euler scheme with 1= 2,
Ekv(x) =½v(x + h) + (x - h),
with
E() =cos 5.
Here E()=I at c=0 but also at 4=n and
2
E(x+)= -cos ~= -exp(- +O( 4 )) as -~-0.
This operator is thus not parabolic, although it is trivially maximum-norm stable.
One example of an operator Ek of the form (7.13) in which the summation is
infinite is provided by the Crank-Nicolson method, for which
1 -2(1 - cos r)
1 + A(1 -cos )'
This real analytic 2n-periodic function has an infinite Fourier series, and the
operator is parabolic for any choice of 1. Although the coefficients are not
necessarily nonnegative there thus exists, for each 1 >0, a constant C = C(1) such
that, in the maximum norm
IIEv|l < cvllo.
Since it would be desirable, in practice, to take h and k of the same order, and thus
1 of order 1/h, it is of interest to ask whether the same constant can be chosen for all 2,
such as in the case of L 2 -stability. This problem was considered in SERDYUKOVA
[1964, 1967] who showed
11Ekvil <2311vil, n>l, 1>0
(for earlier related work, cf. also JUNCOSA and YOUNG [1957] and WASOW [1958]).
Somewhat more general results were obtained in HAKBERG [1970] and NORDMARK
[1974]. We shall now briefly describe Hakberg's result.
Consider the initial value problem for the parabolic equation
a O=(- Mu
)"-
(7.15)
M=2m, xcf, t>0,
and consider a consistent finite difference operator of the form
THEOREM 7.4. Assume that (7.17) and (7.18) hold, and that
IR(y)l<l for0<y<co,
and
A(f)q0 for O<}<.
Then the operator Ek defined in (7.16) is uniformly maximum-norm stable.
Setting
a. ,( == R(A())" eij d,
-n-1
the proof is accomplished by showing that, under the hypotheses made,
Since A(4) is even we may write, using integration by parts in the second step, for
a.n#)=-JR(AA())'cosjt jd
aE )
2 (f + ) d R(iA())sinj d;
0 (A)
= a,j() + aj(A),
SECTION 7 Pure initial alue problem 87
where u(2)=l min(- l/M, 1). By our previous technique one may show
Ia,,j()lI < C min((nA) IM, j-2(n2)l).
Further, using the fact that for large y, R(y) has an expansion of the form
R(y)=r +r 2 y -'+o(y-r), r2 s0, r> 1,
it is possible to demonstrate that
Idj(A)] < C min((n - 1/r2)-M, j-2(n - 1/r) tIM).
Together these estimates complete the proof.
We shall end this section by briefly describing a method for deriving estimates
such as the above, in the case of constant coefficient problems. This is the method of
Fourier multipliers which has been used systematically for stability and convergence
analysis of finite difference methods in a sequence of papers, see BRENNER, THOMEE
and WAHLBIN [1975].
We introduce as earlier the Fourier transform on Rd,
1(~)=(2i)
-lV f v(4|)ei'x d, x Rd.
Rd
The operator A on C' defined by (7.19) may then be extended by continuity from
C' to Wp. Our definition thus says that a EMp if the effect of multiplication by a on
the Fourier transform side is a bounded operator on Wp.
By Parseval's relation we discover at once that a EM2 if and only if a is bounded
and
M 2(a)=sup Ia().
a(,= f e- du(x),
Rd
M.(a)= f [dy(x).
i g
a()= a ,
we have
M,(a)= laj.
reduces to
Mp(E( )") <C for n>0. (7.22)
It follows from (7.20) and (7.21) that stability in maximum norm implies stability in
Lp for other p. We remark that it follows from our above discussion that the
condition for stability in C(R) or 1.,h is the same as in W.,.
In order to use these notions to prove stability it is thus needed to have access to
some method to bound an expression such as the left-hand side in (7.22). One may
then first show that in order to estimate a 2n-periodic multiplier a it suffices to
estimate ?la where 11 is a function in C (Rd) which is identically 1 for JilI<n,
j=l ,...,d, or
Mp(a) < C,Mp(qa). (7.23)
The basic result that may be used to estimate the latter quantity is then an inequality
which generalizes a well-known inequality by Carlson and Beurling, which may be
formulated, in the case d=1, as
M.r,(a) (<CIIaILl
| I| a 'L(R), (7.24)
or, in the case of an arbitrary d, for v > d,
d/(2 v)
M,(a) ( )C l II [-d/(2)[
iD alL 2 (Rd)
In this section we shall consider the stability of finite difference schemes for general
parabolic equations and systems. We first discuss equations or systems which are
parabolic in the sense of Petrovskii and difference schemes which are parabolic in
a sense generalizing the one introduced by John and described in Section 7 above.
We also show analogues of some known smoothing properties of parabolic
problems and touch upon the possibility of using such properties as definitions of
parabolicity of difference schemes.
We shall thus first consider equations of the form
xeR d, t>O,
A(A) = max Re j,
with the power of t - s on the right independent of a. (In the constant coefficient case
F depends only on z = y - x and t- s and thus F(x, t, x + z, s) is independent of x.)
It follows, in particular, from the above representation and (8.3) that the initial
value problem in (8.1) is well posed in LP=L(lRd) and W'= W~'(Rd), the Sobolev
space defined by the norm
Further, the solution has the regularity property corresponding to our previous
result in L 2 for the constant coefficient case (cf. (4.9)): With u(x, t)= E(t)v we have, for
l~p co,
with ap, bn smooth functions of x, t and h and where ,A= k/hM = constant. We assume
that Bk(t) is invertible (in the appropriate Lp-space) so that (8.5) may be written
U+ i= Ekn,,U' with Ek,n =Bh, Ah,n
or
U = E' U = Ek,_ -1Ek, -2 '. Ek,o U.
92 V. Thome CHAPTER 1
It was shown by WIDLUND [1966] that neither in L2 nor L.o is (8.6) a sufficient
condition for stability, in the case of variable coefficients.
Following WIDLUND [1965] we now introduce a generalization of the previous
concept of a difference scheme which is parabolic in the sense of John and say that
our difference scheme is parabolic (in the sense of John) if
M
p((x, t, ))< 1-cl lj
for jfI<Tx, j=l,...,d, c>O. (8.7)
The following important result as then proved by WIDLUND [1965, 1966] (cf. also
ARONSON [1963b]).
THEOREM 8.2. If the difference scheme is parabolicin the sense of John, then it is stable
in Lp for I< p < o.
We also have the following regularity estimate corresponding to the estimate (8.4)
for the continuous problem, where &a as earlier denotes mixed forward difference
quotients.
The results depend on estimates for a discrete fundamental solution: One may
SECnON 8 Pure initial value problem 93
write
U"(x) = hd g.,,y(x)U(x - yh),
and it is possible to show that for flihl < 6, i= 1,..., d, with some 6 > 0 (6 arbitrary
in the explicit case)
lagn,m.(X)l
•< C((n - m + 1)k)-(ll +d)lMexp((C IM(n - m + I)k)- yh 4).
This is done by first freezing the coefficients at an arbitrary point (xo, to) and
obtaining estimates for the corresponding constant coefficient problem by the
Fourier method, whereby the analyticity of the symbol is used to move the path of
integration into the complex. After this the frozen coefficient fundamental solution is
employed as a parametrix to obtain the final estimate by a perturbation argument.
With suitable choice of #Bthis yields estimates of the form e.g.
laq ,,y(x)l
.. .... / / vkiM \ -)\
C((n-m+l)k)" '""'"exp{ -c. l)k'-
\ n-m+ )k
<1 I
fC((n-m+)kiai +dMexp(-cyl)
|(n-m+ I)k|
The first of these are analogous to the corresponding estimates (8.3) for the
continuous problem. The second estimate, which is not needed for explicit schemes
as 6 is then unrestricted, shows that the contributions for large j are exponentially
small.
The above results may also be generalized to certain multistep schemes.
Following WIDLUND [1966] we thus consider schemes of the form
O-+l = h,n -= I
8
0
0O
where
A=x, t, = o+ I aj,(x, t, O)e-i', j = 1, . .., m,
00 I O_
It follows that the boundedness of An, n = 0, 1,. . , which we shall refer to as stability
SECTION 8 Pure initial value problem 95
of the matrix A, is a necessary condition for stability of the scheme (8.9), or of Eh,. As
is well known, A is a stable matrix if and only if all eigenvalues of A are in the closed
unit disk and all eigenvalues on the unit circle are simple. Recall also that these
eigenvalues are the roots of the equation
pm _ ,Lrnlm-1_ ._c = .
We say that the multistep scheme (8.9) is parabolic (in the sense of John) if
p(E(x, t, ))< 1- c l IM
for 1jl~<, j=l,...,d,
uniformly in x and t.
The following is now the main result of WIDLUND [1966] about the stability of the
multistep parabolic scheme.
THEOREM 8.4. Assume that the multistep scheme (8.9) is consistent with (8.1) and that
A is a stable matrix. Then the scheme is stable in Lp and
IIaU IL <C((n-m+ 1)k)- Ill UIJl,p 1 pP Co.
THEOREM 8.5. Assume that the difference scheme (8.9) is consistent with (8.1), which is
parabolicin Petrovskii's sense. Assumefurther that all eigenvalues of A except one are
in the open unit disk and thatfor 5 #:0, j[ , j= 1,..., d, all eigenvalues of E(x, t, )
are in the open unit disk. Then the scheme is parabolic.
We note that the requirement on A here is stronger than the stability of this
matrix. It may be shown that the theorem does not hold in general when only the
stability of A is assumed.
We shall briefly consider a different parabolicity concept and take smoothing
properties like (8.4) and (8.8) as our new definitions (cf. THOMEE [1966]). We restrict
the discussion to the case of constant coefficients and with the basic space as
L =L2(lRd). We consider thus the pure initial value problem for the equation
au d
a = P(D)u_ E P2Dau, xe , t, (8.10)
6Mlal
where the P, are (N x N) constant matrices. We say that (8.10) is weakly parabolic if
the initial value problem is weakly correctly posed, that is, if
and if for any positive integer m and 0 < T < T, with C = C(z, T),
where H m denotes the Sobolev space W'. If in addition the initial value problem is
correctly posed in L2, so that, with C=C(T),
THEOREM 8.6. Equation (8.10) is weakly parabolic if and only if there are positive
constants c, C and v such that
A(P(i~)) -cJIlv+C for eRffd. (8.11)
It is strongly parabolic in L 2 if and only if there are positive constants c, C, Cl and
v and,for each ( E Rd, a positive-definite Hermitian matrix H() such that
c ; I <H()< CI,
and
Systems satisfying (8.11) are also said to be parabolic in Shilov's sense (cf. SHILOV
[1955]). The largest possible v in (8.11) and (8.12) is referred to as the order of
parabolicity.
It follows at once from (8.3) that if (8.10) is parabolic in Petrovskii's sense, then it is
both weakly and strongly parabolic of order M; the scalar equation
au a2 u a3u
_t = x 2 + x-
provides an example of an equation which is weakly and strongly parabolic of order
2, but not parabolic in Petrovskii's sense.
It follows easily from Theorem 8.6 that if(8.10) is strongly parabolic in L 2 of order
v then the corresponding solution operator satisfies
rlE(t)vlH, _<Ct-(m-i)/ivl||jv
for 0<j<m, 0<t<T.
This property was the basis for a generalization in PEETRE and THOMEE [1967] where
a system with time-independent coefficients is said to be strongly parabolic of order
v in W v (where W, denotes LP(lRd) if I <p< co, and W. is the closure of C (Efd) in
L,,(Rd)) if the solution operator satisfies
difference operator Ek with constant coefficients is weakly parabolic if, for any a and
0 < r < T, with C=C(C, , T),
Ia EkvIL 2<CCllvlL2 for T<nk T,
and strongly parabolic (in L 2 ) if in addition Ek is stable in L 2.
We have the following analogue of Theorem 8.6.
THEOREM 8.7. The finite difference operator Ek is weakly parabolicif and only if there
are positive constants c, C and v such that
v
p(Ek()) 1- ckll I+ Ck
for hiji< n, k <k o.
It is strongly parabolicif and only if there are positive constants c, C, C 1 and v andfor
4
each k <ko, ea R , a positive-definite matrix Hk() with
C < Hk(O)< Cl
and such that
If Ek is parabolic in the sense of John, so that (8.7) holds, then it is weakly parabolic
and strongly parabolic in L 2 of order M. For any system which is strongly parabolic
in L 2 one can construct Ek with the same property.
Motivation for introducing these concepts is provided by the following analogue
of the Lax equivalence theorem.
THEOREM 8.8. Assume that Ek is stable in L 2 and consistent with (8.10). Then Ek is
strongly parabolic in L 2 if and only iffor any , any v e L2 and any t >O, we have
IlaU-Dau(',t)ll 2-+0 as k-- 0, nk=t.
9. Convergence estimates
In the preceding sections we have shown a number of convergence results (cf. e.g.
Theorem 3.2) to the effect that a stable finite difference scheme which is accurate of
order ja is convergent to that same order provided the exact solution is smooth
98 V. Thomge CHAPTER 1
enough. We also know by the Lax equivalence theorem (Theorem 3.3) that
convergence follows from stability and consistency without any regularity hypo-
theses on the data other than that they belong to the space of functions under
consideration. Our purpose in this section is to give more precise information about
the relation between the rate of convergence and the smoothness of the exact
solution.
Consider thus an initial value problem
D = P(x, t, D)u
1 V1csi--
V11w + {(t -o%9PJ(t, Dev))q (dt/t) ,
o
H(x)=1, x<O,
[(Ty-I)H l1 w IyJl/P
and it follows that if qp E C then oH EB m P.
The Besov spaces form a scale of spaces between the Sobolev spaces in the sense
that Bs',q c Bs2q2 if SI> S2 , or s 1 = S2, q l < q2 , and B', c Wspc B'j if s is a natural
number, and we may think of a function in B, as one with s derivatives in Wp. Here
inclusion stands for continuous embedding so that in each case a corresponding
inequality between norms holds. The Besov spaces are also intermediate spaces
between the Sobolev spaces in the sense of the theory of interpolation of Banach
spaces. For instance, if m is a natural number and 0 <s < m then there is a constant
C such that for any bounded linear operator A in W with
I}AvliwP <CoIIvllwp
and
THEOREM 9.1. Let 1 <p < o and assume that (9.1) is parabolic in Petrovskii'ssense
and that the scheme (9.2) is stable in WI and accurate of order p. Then,for 1< q < co,
IIU" - uu"
IIw < Ch'(log(1/h))' - /qIvl,.q (9.4)
for nk<T.
The proof was given by PEETRE and THOMEE [1967] for the case of time-
independent coefficients, and is valid also in the general case. It is based on
100 V. Thomie CHAPTER 11
= C
j=o
E, 'j +'(Ek, - E((j + l)k,jk))E(jk, O)v,
using the uniform boundedness of the operators Ek,' + , the definition of accuracy
and a smoothing property for the exact solution which follows from the estimate
(8.3) for the fundamental solution (cf. (8.4)).
For initial data that are less regular one may show a corresponding lower rate of
convergence.
This result follows by interpolation between the result (9.4) with q= and the
obvious inequality
IiU -u lW CIlvllw,
Note that from our above remark about how the interpolation theory works, the
norm on the right in (9.5) is that in Bp,= Bp and not the stronger norm in B' as
might have been expected.
In the particular case that (9.2) is parabolic in John's sense, the factor log(l/h) can
be removed from (9.4) for q = oo as was shown by WIDLUND [1970a, 1970b] (cf. also
results in this direction in special cases by HEDSTROM [1968] and LOFSTROM [1970]):
THEOREM 9.3. Assume that (9.1) is parabolicin the sense of Petrovskii and that (9.2) is
parabolicin the sense of John and accurate of order It. Then, if 1 <p < oo, we have
IUn"-u"ilwpChlIv[IBg for nk T.
The proof depends on estimates for the discrete fundamental solution. These
results may be thought of as generalizations to 1< p < oo and variable coefficients of
the corresponding results of Section 4. In fact, even in the case p = 2 and constant
coefficients, Theorem 9.3 is sharper than Theorem 4.5 as B~ strictly contains H".
We shall briefly discuss the sharpness of the above results. We first present the
following saturation result of LOFSTROM [1970] in the constant coefficient case.
THEOREM 9.4. Assume that (9.1) is parabolicin the sense ofPetrovskii and has constant
coefficients and no lower-order terms, and that (9.2) also has constant coefficients, is
stable in Wfor some p with 1 p < oo, and accurateof order exactly p. Let v be such
SECTION 9 Pure initial alue problem 101
THEOREM 9.5. Consider the initial value problemfor the parabolicequation (9.7) and
a corresponding scheme (9.8) of order #i which is parabolic in the sense of John. Let
1 <s and t > 0 and assume that v L is such that
IIUn-unIL <Chs as nk=t, k- O.
Then v e BO'.
The result of Theorem 9.4 is still best possible, however, in the sense that for some
v BS the O(hS) convergence rate is best possible.
THEOREM 9.6. Let the assumptions of Theorem 9.5 hold and let O<s #p,t >O. Then
there exists a v BS such that
lim sup h -SIIUn u'l[ L. > 0.
nk=t
k-O
In spite of the above, it is, however, possible to attain a O(hS) convergence rate in
the maximum-norm under weaker assumptions than v BSo(s < ), namely if the
regularity of the initial data is measured in L1 . In this regard we have the following
result proved in THOMEE and WAHLBIN [1974] for the case of time-independent
coefficients. In a particular case a similar result was proved in JUNosA and YOUNG
[1957].
THEOREM 9.7. Under the assumptions of Theorem 9.3, let d < s < /1and 0 < T < T. Then
there is a constant C such that,for v E Bs,
IlU"-u"llL,ChSllvllg,, z nk <T.
102 V. Thome CHAPTER II
In the special one-dimensional situation treated in Theorems 9.5 and 9.6, the
estimate takes the more precise form
IIU"--UIILoo ~ Ch*(nk)- /21VIIB, = 1, 2,... (9.9)
The proofs of Theorems 9.5 and 9.6 and of (9.9) depend on Fourier analysis
whereas the more general result of Theorem 9.7 uses the estimates for discrete
fundamental solutions of Widlund.
As an example of a function which illustrates the difference between the spaces
occurring above, let us take X CO(R) with (O) O and set, with a>0,
THEOREM 9.8. Assume that (9.1) is strongly parabolicof order v (< M) in W, i.e. such
that the initial value problem is correctly posed in W and such that
lU"-u"ll Cs,,Th
h/(M + -'v) IVIIB for O<s<M+ p-v.
For initial data which are not smooth enough to result in optimal order
convergence, it is possible to remedy the situation by first smoothing the data in an
appropriate manner. We shall now describe a result in this direction by KREISS,
THOMEE and WIDLUND [1970].
For this purpose we shall introduce a concept of a smoothing operator Mh by
SECTION 9 Pure initial alue problem 103
and
'(5)= E (sin )b?)(g),
½ (9.14)
Ial =t
where bn,j = 0, 1, are such that for some 6 > O,b() and b?' coincide with multipliers
on BFLp for I1[< 26 and I1 > 6, respectively. Since the multipliers on FL 2 are simply
the functions in L., the above conditions are seen to be satisfied for p=2 if
)
() = +1 °0(}l1 as --*0,
and, for any multi-index : 0,
- h/2
and
h
and for general #, a smoothing operator of order p can easily be constructed in the
form
1
M)v(x) = h- f ,(h- l y)v(x - y) dy,
j= 1, 2, corresponds to
$A) = (k 9) j=1,2,
and, generally,
$()p,(sin ½6)
k(0= ,: 1 6,
with 0 < 6 < n is a smoothing operator of arbitrarily high order; in this case
sin 6x
nx
Smoothing operators in higher dimensions can be obtained by taking products of
one-dimensional operators,
MhV = ( MhJ)v
THEOREM 9.9. Let 1 <p oo and assume that (9.1) is parabolic in Petrovskii's sense
and that (9.2) is parabolicin John's sense and accurate of order p. Let further Mh be
a smoothing operator of order u and let U" be the discrete solution with initial data
U ° = Mv. Then
M
lUn"-u'llw .<Chl(nk)-'l iilvlw for n . (9.17)
PROOF. In order to shed some light on the mechanisms that are involved we shall
present a proof for the one-dimensional heat equation in the case p = 2. Recalling
that then
a(z, t)= e-'2(),
and
0 (r) = E(h)"(h)(
O()n = E(h)nJO ),
SECTION 9 Pure initial value problem 105
we see at once by Parseval's relation that the proof of (9.17) reduces to showing
We write
we conclude that
III < Cn -/ 2
for 4 e R.
For the remaining term we have at once, by (9.13),
jIII < CljIye-"2<
Cn-Cn/ 2 for e R,
which completes the proof of (9.18) and thus of (9.17) in the case considered. [
We shall now describe a result by THOMtE and WAHLBIN [1974] which shows that
in certain cases optimal order convergence for positive time may be attained with
slightly simpler smoothing operators than in Theorem 9.9.
We shall say that the smoothing operator defined by (9.11) or (9.12) is of orders
(p, v) if (9.13) holds together with
where b), j=1,2, now coincide with multipliers as FL, for 141<26 and 141>6,
respectively, so that the parameter in (9.14) is replaced by v. Our previous
smoothing operators are then of orders (, p) and it is easy to see, for instance, that
the simple operator (9.15) has orders (2,1). The following result then holds.
106 V. Thome CHAPTER 1
THEOREM 9.10. Assume that (9.1) and (9.2) are parabolicin the sense of Petrovskii and
John, respectively, and that (9.2) is accurate of order I. Let d < s #l, 0 < z < T and
assume that Mh has orders (, v) with v >p-s. Then for U the discrete solution with
U° =Mhv we have
Qv= E QDav,
Il =q
which, for simplicity only, we take to have constant coefficients and no lower order
terms, and assume that the order of the approximation is so that, for smooth v,
Qhv=Qv+O(h") as h-*0. (9.19)
We begin with a smooth data result:
THEOREM 9.11. Assume that (9.1) is parabolic in Petrovskii's sense and that (9.2) is
parabolic in John's sense and accurate of order u. Then, if 1< p < o, we have
One may also show the following estimates for less smooth data:
THEOREM 9.12. Under the assumptions of Theorem 9.11 we have, for O<s</I,
O<nk T,
THEOREM 9.13. Under the assumptions of Theorem 9.11 we have, for O<s<,
O<nk < T,
iiQ, ,_Un
- W"l , Ch"(nk) -qM IIIIBP-
SECTION 9 Pure initial value problem 107
coefficients and discuss the effect of lower-order terms and proceed to treat
boundary conditions which involve derivatives. We then consider the extension to
several space dimensions, including the case of domains with curved boundaries.
When the domain is such that its boundary falls on mesh planes we demonstrate
stability in the discrete L 2-norm and convergence to the natural orders for the
standard finite difference equations with Dirichlet boundary conditions. In the case
of a curved boundary we describe a crude variant of the backward Euler method
which has a low-order convergence rate and we end the section with a brief
discussion of the relation of standard finite difference methods to special cases of the
finite element method.
In Section 11 we consider the same types of problems as in Section 10, but now the
methods of analysis are based on discrete analogues of the maximum principle or
related monotonicity arguments, depending on positivity of the coefficients of the
finite difference operators. This restricts the generality of the methods but gives
somewhat more satisfactory results in cases when they apply.
In Section 12 we describe a variety of spectral methods and begin by discussing
a concept of spectrum relating to a family of operators, which was introduced in
GODUNOV and RYABENKI [1 964] to deal with stability of finite difference schemes in
the presence of boundary conditions. Their approach shows that the stability in the
case of one space dimension is dependent on the stability of the interior finite
difference equations, and the left-sided and right-sided boundary value problems
separately. This makes it natural to discuss the stability of a quarter-plane problem,
with the space variable ranging over the positive axis, and we describe the analysis of
such a situation by means of Fourier analysis in the vein of the school of Kreiss. We
close the section with some remarks concerning the stability of initial value type
difference operators which are suited for the quarter-plane problem, and concerning
the use of eigenfunctions to derive maximum-norm error estimates.
Finally, in Section 13, we collect some material which has not naturally fallen into
any of the earlier sections. We thus discuss some results relating to the possibility of
using a variable time step, we consider a parabolic problem with a singularity caused
by transformation by means of polar coordinates of spherically symmetric
problems, and touch upon the case of a discontinuous coefficient in the parabolic
equation. We further examine the possibility of treating the initial boundary value
problem as a pure boundary value problem and then present some interior estimates
for standard parabolic difference schemes which may be used to draw strong
conclusions about convergence away from the boundary from global results.
Finally, we present an example of the application of finite difference schemes in
existence theory.
In this section we shall consider the application of discrete variants of the energy
method, commonly used in the study of boundary value problems for partial
differential equations, to derive stability and convergence estimates for finite
SECTION 10 Mixed initial boundary value problem 111
difference schemes. In contrast to the Fourier method which was the basis for the
analysis described, for instance, in Sections 7 and 8, such methods are suitable for
equations with variable coefficients and also for problems with boundary condi-
tions which are not of Dirichlet type.
This approach was developed first by LEES [1959, 1960a,b], KREISS [1959a,c,
1960] and SAMARSKII [1961a,b, 1962a] (cf. also e.g. BABUSKA, PRAGER and VITASEK
[1966]).
We shall first consider the model problem of the one-dimensional heat equation in
divergence form,
au ( au'
_au=-a a for 0<xl 1, t>0,
which, with
I
U2
IIUIL2=( dx)
0
may be written as
1d
2 fafaU2
Ox dx=O'
2dt lu(t)l 2+J a ax)
0
+2K1sllall
L2 (10.3)
0
!V11= h V2(10.4)
(j=o
We shall employ our previous notation for forward and backward difference
quotient operators ax, ax, a, and t. For mesh functions Vand W as above, which
satisfy V0 = V = 0 and correspondingly for W, we may define, e.g.,
M-1
(av,W)=h E aVj- Wj,
j=o
where we note that the summation ends at j=M-1. With the obvious corres-
ponding expression involving we have by summation by parts
(av, W)= -(V, xW). (10.5)
We now consider the backward Euler method for (10.1) defined by
at uj = ax(a_ 1,2axu)
forj=l,...,M-l, n>l,
(10.6)
UO= U =0 for n 1,
U°=Vj=v(jh) forj=0,...,M,
where we have set (a- l/2)jn -
= ajh- h, nk). Explicitly the difference equation
may be written
a,Ui U = k -'(U i - U - ). Ui
2
= k-1 [(Uq) (Un- 1)2 + (U2 _ Un- 1)2]
1Un[12 +
½t, ½ktW ll + (an/2~,
Ulac W "
a, Wn)= 0
(10.9)
for n>0,
from which we conclude at once, for instance,
a,llUnIl2<0,
or
IIU"I < IIUn"- II,
whence, by summation,
n °
|| u | < u | = v I|,
Since obviously
or
|| u "' 1< u 11|,
which is the 12 -stability desired.
The symbol of the difference operator is in this case
E(x, t, ) = 1- 2a(x, t)(l - cos g), (10.13)
and we recognize in (10.12) the von Neumann stability condition.
In order to consider similarly the Crank-Nicolson method
and obtain
(a, u, u+ 1/2) + (a +1 2 Un+1/2, 1/2)=o. (10.14)
Here
(aU, u n + 2)=((Un+l_U")/k, (U + U"+ ))
= atlun112 ,
and (10.14) thus immediately yields
Un 2 < 0,
U11
we have similarly
(atU,, +)+(a+02 aX u , xUn
,+o +o)=o. (10.16)
Here
2
IU"nll2+k n Imll2,<CIVII
,
m=l
under the assumption 22K < 1(with strict inequality). Comparing with (10.13) we see
that this condition implies that the difference operator is parabolic in John's sense.
For the Crank-Nicolson method we find instead
n
[IU"ll2+k E HxaUn+ 2/2 <CIIVI12 .
m=l
These estimates are discrete analogues of (10.3) and show, in addition to 12- stability,
a certain smoothness of the solution in a sense defined by the terms in axU.
The energy method may be used to derive stability also in other norms. For
instance, we may show stability with respect to a discrete analogue of the Ho norm
for the 0-method, under the same condition (10.19) as above, or
then multiply the finite difference equation in (10.15) by a,Uj and sum over j to
obtain
112 +(a_-X/2x
U
[lt U" + , at0xU.)= O,
from which (10.20) follows, in view of the bounds (10.2) for a. For 8<2 we have
instead
with the same boundary and initial conditions as before. For instance, let us consider
the 0-method
x Un +0 2
2 2
at IIU 11 + II(anl 2)l/
THEOREM 10.1. Let u" and U" be the solutions of(10.23) and (10.24), respectively, and
assume that (10.27) holdsfor 0 <½.Then we have, with C= C(u, T),for nk < T,
We shall see that the energy method may even be used to derive a maximum-norm
error estimate. This will follow from a stability estimate in the discrete Ho norm
together with the easily proven discrete Sobolev type inequality
THEOREM 10.2. Let u" and U' be the solutions of(10.23) and (10.24), respectively, and
assume that (10.27) holds for 0 < . Then we have with C = C(u, T), for nk < T,
We shall briefly discuss the effect of lower-order terms in the differential and
difference equations in the above stability analysis, and consider therefore the
SECTION 10 Mixed initial boundary value problem 119
where in the last step we have used (10.2) and the geometric-arithmetic mean value
inequality. We conclude that
1-2Ck
we obtain similarly
U 112+ l(a"+
t II1 12)l/2 Un+/2
± 112
tn11/2 1/2J
<~1(a- l/2l112 xU+'11C
U+ l
ciun+ /2 12
120 V. Thome CHAPTER II
and hence
+ Ck(lI U n+ 112
IIU+' 112 < IIUn 112 + 11
Un 112),
and again, for small k,
I U" + 112 (1 + Ck) Un 112,
from which (10.32) follows as above.
For the forward Euler method,
a, u = a(a 1/ 2 axU)+ biaxU" + b U
we have analogously
+ Il(a_- 1/2 )1/2a
a, Itu 112 Un 2
1lau,
Un -la-/2u,ll l +C(lla1xuI + IIlUnll),
au / u\
at=
- a a
atuax
-) a for 0<x< 1, t0,
u
u(x, 0) = v(x),
where a is a smooth function satisfying (10.2) and a and are smooth bounded
functions of t.
In this case it is convenient to use a mesh which does not contain the points x = 0
and x= I and we set, for M a positive integer greater than 1, h= 1/(M-1) and
xj = - h +jh,j= O,.. . M. We then have xo = - h and xM = 1+ h and these points
are thus outside the interval [0, 1] whereas the remaining points x1, ... , x M are
interior mesh points.
We now pose the backward Euler type difference equations at the interior mesh
points
(V,W)(rS,=h VjWJ,
j=r
a, ul =-xUM =O,
we have
(=x(a_ l2axU
U)", U")(UM- 1)
-= (a 1 1/2xU, xU )(2 ,Ml)+aM-1/2~XUM U m--all2xU
1 1U
- -
= -(a 1/2xU , x Un)( 2 ,M ) (10.36)
or
which shows the stability in the case of Neumann boundary conditions. Note
that the argument automatically shows also the existence of a solution of (10.34) and
(10.35) since U" - =0 implies U"=0.
Consider now the general case of the boundary conditions (10.35) with a and not
necessarily vanishing. In view of(10.35), the identity (10.36) may now be replaced by
(ax( a - n -
1/2axu) , U)(M _,)
= -(a 1/2x
U
, ax u)( 2 ,M-)+ aM-12xUM UM-1 1 U -al/26xU 1 U1
< -( a --I121/2~XU,
u , C(
)(1,M){
2
+
IUM-1I 2
+U M |).
It is easy to prove that for >0 arbitrarily small the a priori inequalities
Ul[2 +lUM_,l12<EllxUll(2, M->)+CellUll(2,A-1),
SECTION 10 Mixed initial boundary value problem 123
and
IU0 12 <21U 12 + 2h2 IaxUl 2
or
(1- Ck)l I
([ U'|l,M-1),
(lM
whence, for small k,
IIU II ,M 1)<( +Ck)IIUn- II(1,M-1),
from which the stability follows.
The analysis just described extends to the case 0 1. Further, it applies to
arbitrary boundary value problems for parabolic systems of second order in one
space dimension which are correctly posed in the sense that the corresponding
elliptic operator is minimally semibounded (or maximally dissipative) in the
appropriate sense, see e.g. KREISS [1960, 1963].
The arguments may also be carried over to the case of a nonhomogeneous
equation with nonhomogeneous boundary conditions, and the resulting estimate
may be used, in particular, to derive error bounds for (10.34) and (10.35). Consider
thus the discrete problem
a, Un = ax(a - /2
1 1aU) + Fj,
j=l ... ,M-I, n>,
xul + 2ctU + U) = Gn, n> 1, (10.38)
axU,+2" U M+U-i)=G, n>l,
U ° = v(jh ), j=1,...,M-1.
Proceeding as in the above stability analysis we now obtain instead of (10.37)
~a, | ll~.m- 1)+k,UnI(21iM-x)+K lIaU |(1,UM)
Since the exact solution of (10.33), and hence also the error zn = U _- u, satisfies
(10.38) with
we may conclude the following error estimate, which, for simplicity only, we state for
the homogeneous differential equation with homogeneous boundary conditions.
THEOREM 10.3. Let un and U be the solutions of (10.33) and (10.34), (10.35),
respectively. Then we have with C= C(u, T),
2
ilU"-ull(M_l,)<C(h +k) for nk<T.
x U1+Uo1 + Um = 0,
where b~ = ½(a + &) is a symmetric difference quotient. In the case that x is negative
and fl positive, results are obtained in e.g. SAMARSKII [ 196 1a] for this approximation
by the energy method. We shall return to such methods in Sections 11 and 12 below.
The energy arguments employed in the above analysis may be extended to several
space dimensions and to nonrectangular domains. We shall demonstrate this by
SECTION 10 Mixed initial boundary value problem 125
=Au= a au
at I -,j=x axj
for xeQ, t0, (10.39)
u(x,t)=0 on aQ, t O,
u(x, 0) = v(x) in 2,
where Q2 c Rd will first be assumed to be a union of axes-parallel cubes with vertices
having integer components. The symmetric positive-definite matrix (aij) will be
assumed to have constant coefficients, for simplicity of presentation only.
For the numerical solution we select a mesh width h = 1/M where M is a positive
integer and cover 2 by a cubic grid of mesh points x=fiBh=((lh,..., fldh) with fBl
integers, and let 2h denote the mesh points in the interior of Q and Fh those on .
We note that all "neighbors" x + 6h of a mesh point x in Qh, where 6 =(1, ... , 6d)
with 16il 1, are in 2h u r .h
Let now Ah be the finite difference operator approximating A in (10.39) and
defined by
d
Ah U(X)= y a,J(aij8U)(), XEth, (10.40)
i,j= 1
where ax, and ,i denote the forward and backward difference quotients in the
direction of xi,
ax, U(x) = (U(x + hei)- U(x))/h
and
We shall now apply the 0-method to the present situation and thus pose the
discrete problem to find an approximation to the solution u of(10.39) at t = nk, where
126 V. Thomde CHAPTER III
then (10.45) holds as before so that stability is shown under the condition (10.46).
The above analysis assumes that the domain Q2 is such that its boundary falls on
the mesh planes. In this case the exact solution of (10.39) is, in general, nonsmooth
even for smooth data, and even though the method is formally O(h2 + k) for 0 A 2 and
O(h2 + k2 ) for 0 = -, the corresponding convergence estimates may not apply for lack
of regularity. On the other hand, if Q has a smooth boundary difficulties arise in the
construction of the difference equations near aQ2.
We shall briefly describe and analyze a crude scheme for the case of a curved
boundary, based on an approximation in the elliptic case discussed in THOMEE
[1964]. Consider thus the parabolic problem
Lu =y
at axi (a ax=
Ou=f (10.47)
in x[0,T],
u=g on Fx[0, T],
u(x, 0) =v(x) for x e 2,
where (ai)is a positive-definite symmetric (d x d) matrix which again for simplicity of
presentation only we take to be constant and where Fr=2.
Consider the finite difference approximation of the elliptic operator defined as
above by (10.40). In the two-dimensional case the operator Ah takes the form
2
AhV(x)= Z a,(aijxjv(x))
i,j=1
(V,W)=h' Z
x= jhR
d
V(x) W(x),
and
Since (aut) is positive-definite, the second term is nonnegative and we conclude that
the first term, and hence U"+', vanishes, which completes the proof.
THEOREM 10.4 Assume that u e C3 '2 . Then,for the solutions of(10.48) and (10.47), we
have
PROOF. Let V" E h(2Qh) for n 0. We have, by the definition of Lkh in (10.48),
Let now F1 be the set of x 120 with neighbors in F2 and =0 h\2 Set
Il V' I2=hd- 2 V(x) 2 + hd V(x)2 ,
xcFh XEQh
and
Lkh V(x) on rh,
Lkh V(x) hLkh V(x) on Fh,
0 for x Qh
Then
i=1
(or O(k + h2) if u E C4' 2 ), and for x rh, with (x) =u(x) if XE 2Qh,
U(x) = u(X) if x rh,
then multiplying by V"+ V +' to obtain, with obvious notation, for V' e,(Q),
VIIn+ 12_l Vn II2_k(Ah(nVn+ "+ l), Vn + Vn+ )
= k((LkhV)n+ 1/2, V + v n+ 1)
+" 2
<kl(LkhV) 'lV+ V,+l11
and, by summation,
n-1
11 Vn 112< 1IV 0 11
2+ Ck 1Il(Lkh V)+/112 12.
1=0
SECTION 10 Mixed initialboundary value problem 131
Let now Sh denote the continuous functions on 0 which are linear in each of the
triangles of h-,and which vanish outside p2h, and let {pj} 'h be the interior vertices of
Jh. A function xE Sh may then be written as
Nh
where {qfj}lh are the basis functions in Sh defined by pj(Pj)= 1, Tpj(P,)=O for l j.
For the purpose of defining an approximate solution in Sh of the initial boundary
132 V. Thomee CHAPTER III
I
_.
Yi
4
51X
Afi 'N
>X K
Av \ th \. Kj N>
l I \11 \ Ai
\N
B \
k\
\
N
I
I
- 78
E·1: l EI
;X
N K
\i s sc I V---
I
Iit-
It
-- q
S
FIG. 10.1.
value problem (10.57) we first write this problem in a weak form: We multiply the
heat equation by a smooth function (pwhich vanishes on a2, integrate over Q2,and
apply Green's formula to the second term to obtain for all such p, with (v, w) now
denoting the standard inner product in L 2(Q2),
(u, op)+ (Vu, Vq) = (f, ,) for t > 0.
We may now pose the approximate problem to find u(t), belonging to Sh for each t,
such that
(uh, ,X)+(Vuh, VX)=(fg ) for ESh, t> 0, (10.59)
together with the initial condition
Uh(O) = Vh,
Vh = Y v(P)pj.
j=1
such that
Nh Nh
Recall from (10.58) that the Uj(t) are then the values of the approximate solution at
SECTION 10 Mixed initial boundary value problem 133
(v,W)h = E Q(vw),
obtained by adding all the elements in each row of the mass matrix B into the
corresponding diagonal element. For strictly interior mesh points we also have
and the Crank-Nicolson method, the special cases 0 = 0, 1 and of the 0-method, i.e.
where h = 1/M is the mesh width in space and Un + ° = 0 U + +(1 - )U". The system
of equations for the components of U "+' in (11.2) may be written
(1 + 20)Uj" + _ (1 + UI)n+1
=(1-2(1 -O))U;+( -O);(U]+ + Uq_-,), j= 1, ... ,M- 1,
Uo+ = U+ = 0.
We see that if
2(1 - 0)1,< 1, (11.3)
then the coefficients on the right are nonnegative and an obvious argument shows as
earlier that
IU n+ll1 ,h< tU1nll ,h, (11.4)
where the norm is the maximum norm over the mesh points,
Ivll .,h =maxlvjl.
PROOF. Assume that the maximum is attained at the point (j, n + 1) with 0 <j< M.
Then at this point
+l
(1 +202)U <nO(Uq+1 + Ur+l)
+(l - 2(1 0- )A)U +(1 -0)(U+I + UJ-_,1),
and hence, since the coefficients on the right are all nonnegative and add up to
1 + 20R, i.e. the coefficient on the left, and since the values of U occurring on the right
are at most U. + they all have to be equal to this number. Repeating this argument
we may conclude in a finite number of steps that the same value has to be attained at
either a point on the initial line or on the left or right boundary. This shows the first
part of the theorem. The second part follows from the first by application to - U.
It follows, in particular, that if U satisfies (11.2), then since LkhUj-O and since
U vanishes on the left and right boundaries, we have
min(0, min U°) < U max(0, max U),
OI<M OI•<M
Our above theorem may also be used to discuss monotonicity properties in the
case of the nonhomogeneous initial boundary value problem
Zu a2 u
Ox<1, t>O,
at ax 2
u(0, t) = go(t), t > 0,
u(, t)=gl(t), t>0,
u(x, 0) = v(x), 0 < x < 1,
and the corresponding finite difference scheme
"
aitU-axxUJ+=F+0, j=1,...,M-l, n0,
+ l
Uo =Gno+ , n>0,
Mn+1
0, =G+l n (11.6)
Let (11.3) be satisfied. We may then assert that if U is the solution of(11.6) and U the
solution with F, Go, G, and V replaced by F, Go, 4G and V,respectively, and if F < F,
Go<Go,
0 G<, G and V<P, then U<SU. For
+
Lkh(U-)=Fe-Fj " + 0 <0 forj=1,.. .,M-, n>0,
and hence, by Theorem 11.1, U - U attains its maximum on the initial line or the left-
or right-hand boundaries. But there U-U <0 so that this inequality holds
everywhere. As a special case, if the data of (1 1.6) are nonpositive (or nonnegative) so
is the solution.
This argument may also be used in error estimation. To demonstrate this,
consider for simplicity the case 0 # 1 so that k = O(h2 ) when (11.3) is satisfied. Setting
as usual z= Un - u we now have
LkhZq = a,zj-axa~z, = TX
j=l,...,M-1, n>0,
1
zo =z =0, n >0,
z°=O, j=O, ... M,
where zj= O(h2 ). Let
2
j=z;-jh(-jh)h , j= 1,...,M-, n>O.
and hence, since wo vanishes on the left- and right-hand boundaries and is
nonpositive on the initial line, w is nonpositive everywhere and hence
zJ < #jh(l -jh)h2 .
Since the analogous estimate holds for -z we may state the following theorem:
THEOREM 11.2. Let Un and u be the solutions of (11.2) and (11.1), respectively, and
assume that (11.3) holds. Then we have
This result includes (11.5) as a special case but also shows that the error is smaller
near the endpoints of the interval.
Arguments of the above type have been used, also for nonlinear problems, e.g. in
ROSE [1956], IsAACSON [1961], BATTEN [1963] and KRAWCZYK [1963].
We shall now discuss some maximum-norm stability and convergence results in
the case that the Dirichlet boundary conditions in (11.1) are replaced by boundary
conditions of Neumann or the third kind,
U
where 8x =(a + g)U denotes the symmetric difference quotient, and similarly
for the second boundary condition.
The equations for the components of U" + may be written in the form
(1+2)U+1-A(U+1+Ujl-)=U, j=l,.. ,M-l, (11.11)
(1+,Bh+2
U nM' I Mu l-un1= (11.13)
and we conclude in the standard way, since and ft are nonnegative, that
(1+ h 2, 1-+
Z -1+= 2zjo+ kO(h), (11.15)
SECTION II Mixed initial boundary value problem 139
( +#h 2-M
,+ _ ZM-l = - z + kO(h). (11.16)
Iz" +
.,h < jIZn1 r,h
111l o + C(u)kh,
and hence, by summation, since z0 =0,
+
',zm + h(xZM + fizm+)= O(h).
We then have
ioJ+1 - Eaxs+ = O(h2) h aXaGj, j= 1,..., M - 1,
+ 1- 21 i+1 n+
2
,co (-Xo ) = O(h)-2h(aG O- aG0),
_ acro+
Now, if G is chosen such that, for some appropriately large positive number y,
aiSkxj y,
0xG 0o-aG 0 >y, (11.18)
axGM + GM < - Y,
then we may conclude that woj satisfies
14 nCAPE2
a
+
h (COrM@
+ )<0.V
frO4u+lM CT (11.21)
THEOREM 11.3. Let U' be the solution of(11.8), (11.9) and (11.10) and u that of
(11.1) with the boundary condition replaced by (11.7), and assume that (11.3) holds.
Then
U"-u ll h< C(u)h2 for n O.
The above argument is given by ISAACSON [1961] who also considered the explicit
forward Euler scheme. The earlier stability argument is carried out also for the
general 0-method in SAMARSKI and GULIN [1973]. See also GORENFLO [1971a,b,c].
SECTION 11 Mixed initial boundary value problem 141
We shall now leave the case of one space dimension and consider, for the rest of
this section, the initial boundary value problem
au
- = Au +f in x [0, T],
at
ul=g for te[0,T], (11.22)
u(x, 0) =v(x) in Q,
Note that for xe2Q° and also if xeF ° and all its original neighbors are in , this
definition reduces to our old definition (11.24). Note also that by Taylor expansion
we have
Ah v(x) =Av(x) + O(h2 ) (11.25)
for xe2, if veC4 (Q),
142 V. Thome CHAPTER III
a ,'
=h
/
FIG. 11.1.
and
Ah v(x) = Av(x) + O(h),
(11.26)
for xEFh, if vEC3 (a),
so that, in particular, the accuracy can only be guaranteed to be of first order at the
irregular points of F °.
We may now pose the discrete problem
+ +
Lkh n+ ) = t U" l(X
()- AhUn (x) =fn + l(x )
for x f2E, n > 0,
Un + (x ) = g + (x)=g(x, (n + l)k)
Bkh and yh are the appropriate matrices. In particular, the matrix Bkh has, in a row
corresponding to an interior mesh point x E Qh° , the diagonal element 1 + 4A = 1+
4k/h 2 , four off-diagonal elements - , and the remaining elements 0. For x E F° the
diagonal element is instead
2 I+,I 2 +
We are now in a position to prove the following error estimate, part of which also
shows maximum-norm stability of our method. Here and in the rest of this section
we write
11VIls = supl V(x)I.
THEOREM 11.4. Let U" and u be the solutions of(11.27) and (11.22), respectively, and
assume that u E C4 '2 . Then
IIU"-u"ll , <C(u, T)(k +h 2 ) for nk <T.
PROOF. We shall use the representation (11.29) and recall that the matrices Ekh and
Ah have nonnegative elements. With < denoting elementwise inequality for matrices
and vectors, and I s denoting vectors of the appropriate dimensions with all
2
elements = 1 on S, we find by setting U"(x)= 1 for x E Qh uh, n O0,
into (11.29) that
n-1
U"(x) {=t on o h,
U on Fh,
(x)={O
we have (LkhU)'(X) = 0 for x E Q2, whereas for x e F° we have (in the case of the above
144 V. Thomee CHAPTER III
figure, say)
2 1
(Lkh U)(x)= -(AhU)(x) l(fl- 2
+ )h2 h
v°x={1 on r °,
that
n-
0 2 (11.31)
kZ Ekh zVzh1.
=o0
From (11.29), (11.30) and (11.31) and the positivity of the coefficients we now
conclude the a priori estimate
[U|ln |IUOij, +k
l~< JI(LkhU) li2Q
+h 2 max 11
(Lkh U)rh+max 11U'lir (11.32)
l <n I <n
it follows that
nlznjjI
0 Ck E (k+h2 )+h 2 C(k+h)
1=1
We remark that it follows from the above analysis that even if the local error on
Fr had been only bounded, and not O(h + k) as here, the global error would still have
been O(k + h2) as a result of the factor h2 occurring in the corresponding term in the
a priori estimate (11.32).
We shall now see that even a very crude approximation of the boundary values
may be used to get a O(k + h) global approximation. If k and h are of the same order
of magnitude, such an approximation is balanced in the two mesh parameters.
°
With a slight modification of our above notation, let Q2 be mesh points whose
neighbors are in Q and Fo be the remaining mesh points of Q. For each point x e F°
we associate x e F = Ma2 of distance at most h from x. We shall then consider the
discrete problem
SECTION 11 Mixed initial boundary value problem 145
with G and lhU to be defined presently. Let Fh be the mesh points in Q2 with
neighbors in F, and let for x E F', y,(x), I= 1, ... , s(x), be these neighbors. Then
(U ) U(y1(x)) for x F,
for x = 2\',
and
As before, Bkh is diagonally dominant, now with elements 1 + 42, -2 or 0, and thus
again Ekh = B- 1 > 0. Further,
U" +1 = Ekh Un + kEkh{(LkhU) +'
+ h- 2 ( h U)n+ },
or, by iteration,
n-1
U"=EkhU +k E{(LkhU) +h- + 2(lU) '}, (11.34)
1=0
THEOREM 11.5. Let U' and u be the solutions of (11.33) and (11.22), respectively,
and assume that u e C3 '2 . Then we have
1IU -u 11 <C(u, T)(k+h) for nk•<T.
PROOF. Setting U"- 1 on Q2uF° for n>0 in (11.34) and,
v(x)={(x ) on Fh,
on 1 = 2h \Fh,
we have
n-1
We shall now show that interpolation of the boundary values at the irregular
boundary points may be used to improve the accuracy to second order in h. To
define this procedure, let x E Frh. Then either x e F or x is a mesh point in Q with at
least one neighbor outside a, but also with at least one neighbor in 2Qhso that (cf. Fig.
11.2) there is a boundary crossing z with x on the segment defined by y and z.
We now define a system for determining U"+ (x) for x E 2h by equation (11.33)for
°
x E 2Q , and, with the notation of Fig. 11.2,
1
u"+ '(x)= U" + (y)+ U"+'(z) for x rh, (11.36)
l+ca l+ac
x --- z XI X
h
Y
I~~~~~
/~~~~~~~~~
FIG. 11.2.
SECTION I11 Mixed initial boundary value problem 147
"+ +
½ < 1[IIU
IIUn+111O I tlo+ IIU 'IrI.
In particular, this implies that the system of equations for Un(x), x E QT°uF °, has
a unique solution as the corresponding homogeneous system has only the trivial
solution.
We shall now show the following error estimate.
THEOREM 11.6. Let U" be the solution of(11.33) with the modifications (11.36) and
(11.37) and let u be the solution of (11.22). If u E C4 2 we have
IUn-ulln,rC(u, T)(k+h 2 ) for nk<T.
PROOF. This follows at once by application of (11.38) to z = Un- u, since z' = Oon
Fh, z = 0, and (Lkhz) = O(k + h2 ) on Q .
In the above discussion, the explicit forward Euler method or the Crank-
Nicolson method might also have been proposed. Consider first the case that the
modified five-point operator is employed at the irregular boundary mesh points.
The forward Euler method would then be written as
U'' l(x)=AkhUn(x)+k(fn(x)+yh g(x)) for x E Q2 ,
where now, with the above notation, the matrix Akh has diagonal elements
1 2 (
+i + 2 ( +
+ a
The standard analysis in the maximum norm would now require these to be
nonnegative. However, in the case of a curved boundary the fI+ would not have
positive lower bounds independently of h, so that no positive value of = k/h 2 ,
however small, could be employed to guarantee the nonnegativity of the diagonal
elements. The corresponding statement is valid for the Crank-Nicolson scheme.
Consider now the method of transferring the boundary values to the irregular
boundary mesh points of F °. In this case the diagonal elements of the matrix Akh are
simply 1-42 and the condition LA, is therefore sufficient for maximum-norm
148 V. Thome CHAPTER III
This section is devoted to the application of spectral methods to the stability analysis
in the context of the mixed initial boundary value problem, as outlined in the
introduction of this chapter.
We shall begin by considering the simple initial boundary value problem
u a'u
8 ~~for
ea O<xl 1, tO,
The operator Ek here acts, for different k, in different normed spaces /'Yk,such as e.g.
in
h
Vk= II Vi 2,h h £I v
j=Oj1
SECTION 12 Mixed initial boundary value problem 149
Considered on the above (M + 1)-dimensional spaces Xk, the operator Ek has the
matrix representation
o 0 0
A 1-22 2 0
0 2 1-22 A
0 0 A (12.3)
A 0
2 1-21 2
0 0 0
In order to deal with the stability problem in situations such as this, GODUNOV and
RYABENKII [1963, 1964] introduced a concept of spectrum of a family of operators
{Ek, where Ek is defined on a normed space Xk with norm II 1k where, in our case,
k is a small positive parameter. According to this definition the spectrum a({Ek})
consists of the complex numbers z such that for any >0 and sufficiently small
k there is a Uk e Xk, Uk 0O, such that
E
I EkUk-zUk Ik IIUk IIk (12.4)
One may then show the following variant of von Neumann's criterion:
THEOREM 12.1. A necessary conditionfor thefamily {Ek} to be stable in the sense that
PROOF. Assume that z E ( {Ek} ), Iz> 1, and let K be such that Ek Ik < K for small k.
Let co be arbitrary, choose n so large that
IZln" 2w,
and take E so small that
n-1
E Kj<2
j=o
Then
n-i
EkUk=znUk+ Y Z-- j
Eikk,
j=O
150 V. Thome CHAPTER III
and hence
!lEk||k) IEkUklnk
n-\
Let us return to our above example to determine the spectrum of the associated
family {Ek} on 1',h. Here, as we shall indicate, the spectrum consists of the
eigenvalues of the operator Ek associated with the pure initial value problem
corresponding to (12.2), i.e. defined by
(EkV)j=(Vj+ + Vj_ )+(l-22)Vj, j=O,+ 1, ....
and considered with respect to the norm in l o.
Before we show this, let us determine these eigenvalues. We note that the defining
equation for the eigenvectors corresponding to an eigenvalue z, namely,
2 (12.5)
(Vj+ 1 + V_,)+(1-2A)Vj=zV j, =O, +1, ... ,
is a second-order homogeneous difference equation and that the general solution of
this equation is
clT'i+c2Tj, if zr1 r 2, j=,_+ 1,...
j={ (Cl+c 2j).T, ifz 1=z 2, (12.6)
(z + r- ) + 1-22-z =0.
The condition for the existence of a bounded V is that one root has modulus 1, and
we find that z then has to be of the form
Assume now that z is of the form (12.7). We shall show that z E ({Ek} ). Consider
the vector V=(V 0 ,..., VM)T with Vj=e i j . Then
A(V,++Vj_i+ (1-2A)Vj=zVj, j=l, ... ,M-l,
but the boundary conditions in the definition of 1',,h are not satisfied. Set therefore
W=(W,..., Wm)T where Wj=x(l-x)Vj, x=j/M. We now obviously have Wo=
WM= 0 and a simple computation shows that
Il(Wj+ + Wj_1)+(1 - 2 -z)Wl <Ch<Ch IIWII ,,.
Thus, for small h, W satisfies (12.4), so that zEa({Ek}).
We also need to show that if z is not of the form (12.7) then it does not belong to
o({Ek}). For this purpose one proves that, for some positive C and small h,
1U h, Cll EkU-zU II ,h for UEI 0oo,h, (12.8)
showing that (12.4) cannot hold for any Uk Eloo,h, Uk O. To see this, we set
EkU-zU=f, (12.9)
and show that this equation has a unique solution, satisfying
U | ash <IC lf ||oh. (12.10)
The difference equation (12.9) may be written
A(Uj+, +Uj_ )+(l-2 - z)Uj=fj, j= 1,... ,M-1, U=UM=O.
We first extendfj,j = 1, . . ., M - 1, tofj,j = 0, + 1 ... , without increasing its norm in
l',h (e.g. by setting the missing components of] equal to zero) and then solve
A(Uj++ UOj_ 1)+(1-22A-z)j =j forj=0, +1, .... (12.11)
Since z is not of the form (12.7) we have
a()=1-2A-z+2Acos4 0, ER,
and the Fourier series for a(4)- ,
a(4)-1 = E beiJ,
j= -w
andconclude,
we
and we conclude
With rzl=z(z) and 2 =r 2 (z) as above, we find from (12.6) and the boundary
conditions that
UOTM-U2 UM-UOT
Wj= -M - Tr + T z2 forj=0,..., M. (12.13)
Now since z1 r2 = 1, and since neither of the rJ is on the unit circle if z is not of the form
(12.7), one of the roots, say Tz, has to be inside the unit circle and the other outside.
We may therefore conclude from (12.13) and (12.12) that
11
Wl , QC(IU0 I + IUM)<CIIfIIllh
Together with (12.12) this completes the proof of (12.10) and thus of (12.8).
In the situation just described the spectrum ( {Ek}) associated with the family of
discrete boundary value problems happens to coincide with the spectrum associated
with the corresponding pure initial value problem, and therefore, the condition of
Theorem 12.1 gives no new restriction for stability. In general, however, the stability
may be affected by the choice of the boundary value approximation. Assume, for
instance, that instead of the boundary condition UO = 0 at the left end of the interval
we choose the somewhat artificial condition
U o-YU=O, y0,
which is also satisfied by the exact solution of (12.1). Then a computation similar to
the above gives, for Wj= Uj - Uj,
For ITl < 1 we have 1-yT,1 0 and we conclude as before that W is bounded.
However, for y such that 1-yTr =0 this conclusion does not hold and, in fact,
z belongs to a({Ek}).
In general, it was shown in the work of Godunov and Ryabenkii that the spectrum
of a family such as the above is the union of three sets, one corresponding to the pure
initial value problem and one to each of the one-sided boundary value problems for
the differential equation in {x>0,t>0} and in {x<l,t>0}, with boundary
conditions given to x =0, and x = 1, respectively.
We shall now describe some work of VARAH [1970] and OSHER [1972] concerning
the effect on the stability of the choice of discrete boundary conditions such as those
indicated above, and, in particular, the formulation of sufficient conditions for
stability in the maximum norm. The methods used in this work are based on
techniques developed by Kreiss for hyperbolic equations (cf. KREIsS [1968],
GUSTAFSSON, KREISS and SUNDSTROM [1972]).
We shall follow Varah's discussion in a simple case and then present the results in
SECTION 12 Mixed initial boundary alue problem 153
forO x1, 1, O T,
at ax
(Ekv)o= Z'
1=1
b(Ekv)l, (12.19)
154 V. Thome CHAPTER III
where the b are constant. We assume that the boundary approximation (12.19) is
consistent with the boundary condition in (12.14) (for a =0), or that, for smooth u,
u(O)- biu(lh)
=yu'(0)+o(1) ash-y0 with y0.
1+,h={V=(Vj)O. I vl,,h=sup
J
vjl <CO}.
It is clear that a necessary condition for stability is that this operator has all its
eigenvalues in the closed unit disk, for if z were an eigenvalue with z I > 1, then, with
UO=ve l h, the corresponding eigenfunction, we would have
l Un I oo,h = IE U °II(,h = I1 l --
,h* as n- o,
contrary to maximum-norm stability. We shall see that a sufficient condition for
stability is the slightly stronger condition that all eigenvalues z of Ek, except z = 1, lie
in the open unit disk {z: Iz < 1}. That z = 1 is an eigenvalue follows at once from the
consistency condition (12.20) by taking vjl I for j >0. To demand that the
eigenvalues z : 1 of Ek are in the open unit disk is a stronger condition than asking
that they are in the closed unit disk in the same way as the definition of a parabolic
operator is stronger than that of one which satisfies only the von Neumann
condition for stability in L .2
We shall now reformulate our above property in terms of the coefficients of the
discrete boundary condition (12.19). For this purpose we assume that z is an
eigenvalue of Ek and that v=(vj) is the corresponding eigenfunction. Then
vo = bvy1.
1=1
The first of these relations is a second-order difference equation, and we know that if
rI = (z), 1=1, 2, are simple roots of the corresponding characteristic equation,
- 22=
)l(r r- ~) + 1
+ z, (12.21)
then the general solution of the difference equation is
vj=ctij +c 2 T2 for jŽ0. (12.22)
Let now Izl>1 and z1 and let us note that then none of the roots z, and 2 of
(12.22) is on the unit circle. For, if z =e is such a root, then,
z = (e +e )+ 1-22= 1 -22 +2A cos 5,
SECTION 12 Mixed initial boundary value problem 155
which is real and has modulus less than 1 except if ~ = 0, which corresponds to z = 1.
Since TjT2 = 1 it follows that exactly one, say Tz, is in the inside the unit circle and the
other is outside. Because vj}O in (12.22) has to be in ,h we conclude that the
second term must vanish, so that
vj=c(z)zI(z)j forj=O,1 ......
In order for this v to satisfy the boundary condition we must have
=0
c(z) - E_ bTl,(Z)'
b(z)= 1- b(z)t=O.
1=1
The hypothesis that the eigenvalues z 1 must satisfy Iz < 1 may therefore be
expressed by saying that
In order to accomplish this, we follow the lines of our discussion in the beginning
of this section of the two-point boundary value problem and first extendf without
increasing its norm to fj, j=0, 1, ... (e.g. by setting j=0 for j<0). We then
determine U by (12.11) and conclude that the inequality (12.12) holds. We now set
Wj = Uj - Uj, j > 0, and find for W the equations
A(Wj++Wj_)+(1-2A-z)W=0 forj>1,
S S
7o- E b1,U
Wj=crT', with c=c(z)= =1
b(z)
whence
|| XI S |W
C(Z) < C-hU 1,h <C(Z) lIIf1I
1b(z) I
Oh (12.26)
Together (12.12), (12.23) and (12.26) show the desired estimate (12.25).
In the next step of the proof of the maximum-norm stability we write U' = Ek U in
the form
°
U= E a,jU forj¢0, nO.
1=1
One needs to show, with C independent of n andj, that, for the operator norm of Ek,
=
IIEnI oh sup lajl
, AC for n >0.
j>1 l
Introducing the mesh function 6, with elements 6g, = 0 for s A 1, 1,, = 1, we have,
by the Dunford representation,
anji= 1 i fz((z-Ek)_)jdz,
F
withj,
=c-i
with
SECTION 12 Mixed initial boundary value problem 157
c(Z(z)- 2(Z))'
It follows that
A(d+ ,,+dj_ ,,,)+(1-2A-z)dl =O for j> 1,
whence
djl = cTJ1,
with
S
iol- Y bpfpl
CI=CI(Z) P=I
b(z)
In particular, for 1>s, say, we have
S
1- Y, bpzT
1 S 1- c (z) ,+
1- bpzb(z)
p=l
In order to complete the proof delicate estimates are needed for this integral, using
an appropriate choice of the contour F which may a priori be selected as any closed
contour around the origin in {Izl > 1, z - 1}. In these estimates one also needs the
fact that, in a neighborhood of z = 1,
Ib(z)- 1 2
< CIz- iI -1/2 (12.27)
To see that such an inequality holds in the present case, we note that (12.21) has
a double root = 1 at z= 1. In fact, we have
1-212-z (12-Z)-2_ t
T1,2
= ~2\1
and one sees that, for z near 1,
T(z)=l + -) +O(z-l1)asz- 1,
with the square root appropriately determined. Reporting this in the definition of
158 V. Thomde CHAPTER III
L bjtz(z)'=l,
I=1
E b,=1,
1=1
E Ib O.
l=1
These equations are satisfied, e.g. with s =2,b =(1 + z(z))/z 1(z) and b2 = -1/z 1(z).
On the other hand, for more reasonable choices of discrete boundary conditions,
this does not happen. Thus, for instance, for s=2 let us select b, and b2 so that the
order of accuracy of the boundary approximation is two, or such that,
2
u(O)- E bu(lh)=hyu'())+O(h3 ) as hO.
1=1
1 (Z)-3) O, since
=(T 1(z)-1)(T IT1(z)I < 1,
so that (12.23) is satisfied. Similarly, taking s = 3 and the order of accuracy three, we
have
1 =2, 8, b=- b =2 ,=6
and
b(z)= 1-T +
191T2 - 23
= -- (Ti--1)(T2_--2T, + 11) for IT 1 < 1.
SECTION 12 Mixed initial boundary value problem 159
It was shown by Varah that also high-order approximations of this type satisfy our
condition (12.23) for A small enough.
We now turn to a description of Varah's more general result for the one-sided
problem (12.17) and consider the finite difference operator (12.15) with the boundary
approximations (12.16) at x = 0, where coefficients bj(h) are smooth functions of h.
The operator Ek = Ek(h) defined by U"+ = EkU" now depends on h. One can show
that if z belongs to the spectrum of Ek(h) and Izl 1, z 1, then z is actually an
eigenvalue of Ek(h), and for a certain matrix B(z, h) which plays the role of our above
function b(z), we have det(B(z, h))=O0. In fact, if z is such an eigenvalue, we have
E alT'-Z =
O,
l= -q
and where P, is a polynomial of degree less than the multiplicity of zT(z). One may
show that there are r independent parameters 6=(1,..., 6r)x in (12.29) and these
may be determined from the matrix equation
B(z, h)b = 0.
Hence, similarly to the situation in the particular case treated above, z is an
eigenvalue if and only if B(z, h) is nonsingular. We may formulate the following
theorem from VARAH [1970].
THEOREM 12.2. The operator Ek(h) defined by (12.15) and (12.16) is maximum-norm
stable if
(i) the operator(12.15) is consistent with the heat equation and parabolic in the
sense of John, and the boundary approximations (12.16) are consistent with the
continuous boundary condition in (12.17),
(ii) Ek(O) has no eigenvalues z with Iz > 1, z 1,
(iii) B(z, O) is nonsingular and IB(z, O)- l < Clz - 1 -1/2 for Izi > 1, z close to 1.
We recall that (iii) was automatically satisfied in the simple case treated above.
Let us now return briefly to the problem (12.14) on the finite interval 0 < x 1,
using the finite difference equation (12.15) for j= 1, ... , M- 1, where Mh= 1, and
using the equations (12.16) for the boundary approximations at x=0 and similar
equations at x = 1. Associated with this problem are two one-sided problems for the
infinite intervals 0 < x < o and - oo < x < 1 and for each of these our above analysis
may be used. VARAH [1970] shows the following:
160 V. Thome CHAPTER III
THEOREM 12.3. Assume that the finite difference scheme just defined for the initial
boundary value problem (12.14) is such that the conditions of Theorem 12.2 are
satisfiedfor the problems corresponding to each of the two boundaries at x =0 and
x = 1. Then the scheme is stable in the maximum-norm.
The results of Varah have been extended by OSHER [1972] in a number of ways
which we shall briefly describe below, to more general parabolic equations and more
general finite difference schemes. Thus let us now consider the differential equation
au a2u au
t=a(x,t)x 2 -b(x,t) xx-c(x,t)u=f(x,t) for 0<x<1, 0<t<T (12.30)
au
fBu(l,t)+/f-x(l,t)=g1 (t) for 0<t<T,
where a2+ 2 = fB2 + i 2 = 1, so that Dirichlet type boundary conditions are now
also allowed, and finally with the standard initial condition
u(x, 0) =v(x) for 0 < x < 1.
For the numerical solution we use implicit multistep finite difference operators of
the form described in Sections 3, 6 and 8, so that
AoU n+I = A hl Un+- +Ah ,U-m+1+kf, (12.32)
where the Ahj = Ah,.(nk, h) are defined by
q
with o = 1. We assume that the coefficients are sufficiently smooth and that (12.32) is
consistent with the differential equation (12.30). Further, we assume that
Ao(x, t, ) =I+ ao0 (x, t, O)e - 0,
l
so that Ah, is invertible, that (12.32) is parabolic in the sense of John, and that the
coefficients ac in (12.33) are such that the associated matrix
atl 2 ' .
.. am
1 0 ... 0
_= 0 1 0
For boundary conditions in (12.31) with i 0, we use the equations from (12.32) at
the interior points jh, i.e. for j = 1, 2,.. ., and set
forj=-r+1,...,0.
The form of the boundary conditions employed at the right-hand boundary are
analogous.
We assume that the discrete boundary conditions are consistent with the
continuous boundary conditions in the natural sense.
We now demand that the boundary approximations are such that there is no
solution in 12 of the semi-infinite problem (with v = 0 or 1 depending of the type of
boundary condition)
and the corresponding condition for the right-hand boundary. Further, we consider
162 V. Thomne CHAPTER Ill
the equation
Let {eilk} be those eigenvalues of the matrix 4S introduced above, which are on the
unit circle. We then assume that for Izl >1, z e ' ,k the unique solution of (12.34) in
loo is zero. If a = 0, we make the same assumption for z = ei°k as well. For a = 0 and
z = e i Pk we have that vj-1, j = 0, 1,... is a solution but we assume that there is no
other linearly independent solution with IvjI< c(j+ 1).
We make the analogous assumption for the right-hand boundary.
Under the above conditions Osher shows that the difference scheme has a unique
solution (the conditions made to ensure this are in fact also necessary). Further the
scheme is maximum-norm stable. In fact, the following more precise estimates hold
under the following different assumptions. First, for the nonhomogeneous equation
with homogeneous boundary conditions (g(t)-g,(t)-0) we have
Consider now instead the case thatf-0 and v = 0 but that the boundary conditions
are nonhomogeneous. Then, if a # 0 we have for the one-sided problem defined by
the left-hand boundary, with c >0, that
and
la UjI < C sup Igo()Imax log(j+ )e- ci, e - cJ/vJ, log - e'i2/c
and
The corresponding estimate for the problem with the boundary to the right, and the
two-sided problem are also valid. Clearly the general situation withf, go, gI and v A 0
may be analyzed by combination of the above estimates.
Before leaving the discussion of construction of discrete boundary conditions we
shall quote some simple results from STRANG [1964a] and MILLER [1969]
concerning a specific type of difference schemes designed for the solution of the
quarter-plane mixed initial boundary value problem
au a2u
for x0, t0,
0t8U-2U x
u(O, t) = 0 for t >O,
u(x, 0) = v(x) for x > 0.
For such a problem it is particularly easy to apply an explicit operator of the form
(12.15) if r=1 so that (with our standard notation) the scheme reduces to
1=-q
satisfies
IE(~)I < I for e R, (12.36)
2
and also, with # the specified order of accuracy, with 2 = k/h (which is assumed kept
constant),
E()=e-`12 + 0(,, +2 as ~--O. (12.37)
164 V. Thomee CHAPTER III
In this regard Strang proved that for given it is indeed possible to find a
trigonometric polynomial of the form (12.35), for some q, which satisfies (12.37) and
also the stability condition (12.36) for small 2.
Condition (12.37) may also be expressed as the p+ 2 linear equations
2 n/2 n even,
Un - U 2 ,h C(u)(h 2 +k 2 )
holds, independently of the mesh ratio A= k/h 2 , and also that the energy method
may be used to show the corresponding error estimate in the maximum norm. It is
for this latter result that we would now like to give an alternative proof using
a spectral method, and this will be done for h and k of the same order of magnitude.
For comparison we recall that problem (12.38) may be continued to a periodic
problem, and that the Crank-Nicolson scheme is parabolic in John's sense for
i = k/h 2 fixed and hence, by the results of Section 9, we have the maximum-norm
error estimate
n 2
U" - u"
H1 I h < C(u)h .
In the case that k and h are independent we mentioned in Section 7 that if a(x) 1 the
scheme is uniformly maximum-norm stable (with the stability constant 23) and
hence in this case
u =v(jh), j=o,..., M.
We recall the notation for the discrete inner product
M
(V, W)h= h VjW,
j=O
1
and the corresponding norm in 2,h,
II V112,h=(V, V)2=(h j
It is clear that the symmetric operator Ah on 12,h has (M- 1) positive eigenvalues
{#,} - 1and a corresponding orthonormal system of eigenfunctions {pe} m- ', and
166 V. Thomee CHAPTER III
M-1
V= E (V,P)h(P.
p=1
where
1-½z
E(z) = 11 2
It can be shown (cf. e.g. CARASSO [1969]) that there are positive constants co , cl and
C independent of M such that
CoP2 <,
2
< Clp (12.40)
and
As specified in the introduction to this chapter we shall now touch upon a few topics
not covered in our earlier presentation. These are concerned with variable time
steps, problems with singular or discontinuous coefficients, the use of boundary
value problem techniques, interior estimates and finally the application of finite
differences in existence theory.
168 V. Thome CHAPTER III
In the analysis in the preceding sections we have always employed a constant time
step in our finite difference schemes. We shall now briefly describe some work by
DOUGLAS and GALLIE [1955] which suggests longer time steps as the distance to the
initial time increases.
We consider the model problem
Ox lu, tO,
at ax 2
u(O,t)=u(1,t)=O tŽ>O, (13.1)
u(x, )=v(x), 0<x 1.
For the numerical solution we select as usual a mesh width h = 1/M where M is
a positive integer, but choose now time steps k,, n = 0, 1, ... , which may vary with n.
The corresponding discrete time levels are then t, = I I=-okj. For the discrete solution
Uj approximating u(jh, tn) we take now the result of the backward Euler method,
U+ = U =0,
Uj = v(jh), j=O, ... , M.
In the operator form this may be written
Un+ =Ekh Un n >0,
where
Let us first remark that even for k and fixed the sum in (13.3) is uniformly
bounded for n > 0 so that the error is bounded, uniformly for t > 0, and not only on
bounded intervals in t, by C(v)h2 . This is due to the exponentially decreasing factors
in the sum, which we did not keep track of properly in the simple analyses of Sections
1 and 2.
More generally, the exponential factors allow a certain growth in k t and A,without
sacrificing the O(h2 ) convergence. For instance, if
Atl Ce7 't2 for l >O with y< 1,
and
k,+l <Ck, />0,
then obvious estimates show for the sum in (13.3)
<
E k,(l + )e- 2tt C{l + e-(1-7)2tdt C.
1=0
0
Douglas and Gallie considered, in particular, the choices A,= a + fit t, and A,= e"2t12.
It is clear that these remarks admit vast generalizations.
The use of nonuniform meshes in space has been suggested and analyzed in one
space dimension in SAMARSKII [1963].
We shall now consider an initial boundary value problem for a parabolic equation
with the singular elliptic operator,
the solution of (13.4) will be regular at x =0, since this corresponds to an interior
point for (13.5), and as a result of (13.4), the boundary condition (8/ax)u(O, t)= is
satisfied for the solution.
We shall describe a finite difference scheme analyzed in FRYAZINOV and BAKIROVA
[1972]. We use a mesh width h = 1/M in space and a time step k and define the
discrete elliptic operator
AhU= 2
Xj
ax( 2-
-
l /2 k uj),
We note that forj= 1 the coefficients x1/2 of xU'O+ vanishes so that, in fact, neither
U0 nor UO+ ' appear in this system.
The equations may be written in the form
Xj-2Xj
+(1-6 J2Un
iX lX
U' =0.
We now note that
-2 -2
x j+1/2 + Xj_ 1/2= XjXj_ 1 + x
Xj 2x,
so that the coefficient of UJ equals 1- 2(1 - )2. Hence all coefficients on the right are
nonnegative and add up to 1 provided the stability condition
2(1-0)2 1 (13.8)
is satisfied, and we then conclude in the standard manner the maximum-norm
stability estimate
+
|U Illh UI u Ih1
SECTION 13 Mixed initialboundary value problem 171
One may also show convergence at the rate O(h2+k) for the backward Euler
method, and, with satisfying (13.8), O(h 2 ) when 0< 1.These results apply as well to
the nonhomogeneous equation with nonhomogeneous boundary conditions.
Similar results are shown in FRYAZINOV and BAKIROVA [1972] for the equation
au 1 a x( u )
a =-a for0<x 1, t >0,
Problems of the above type may also be analyzed by the energy method, now in
general with respect to weighted norms (cf. e.g. SAMARSKII and GULIN [1973]). We
consider only the backward Euler version (0 = 1) of (13.7) and introduce the inner
product
M-1
(V,W)=h VjWj,
j=o
and the corresponding norm
j=O
a,XjU +
112 = 0,
and hence
For the case of discontinuous coefficients it was shown by the energy method in
SAMARSKI and FRYAZINOV [1961] that the following weaker convergence estimates
hold, namely
( C(u) (h + k), if 0= 1,
2
Un -u"n loh C(u)(h"l +k), if 2<0<1,
C(u)(h/2 + k2 ), if 0= .
It was also shown by Samarskii and Fryazinov that it is possible to modify the
difference scheme so that a higher rate of convergence may be attained in the case of
discontinuities. The modification uses the harmonic mean over intervals of length h,
1/2
and replaces a by a, in the finite difference equation in (13.10) so that it thus reads
a, un= ax(a,_1/2 U + ),
j=1,...,M~~ -l, "n,(13.1 (13.11))
j=l .... M-, n>O,
SECTION 13 Mixed initial boundary value problem 173
with the same initial and boundary conditions as before. (This is a homogeneous
difference scheme in the sense of TIKHONOV and SAMARSKII [1961].) For this modified
method it was proved that
C(u)(h2 + k), if 0 = 1,
I-U"ll®,h < C(u)(h3/2+k), if <0<1,
(C(u)(h312 + k2) if 0 =
Note that in the particular case that a discontinuity falls at a mesh point xj but is
constant on both sides with values a+ and a_ then ah,j,+/2 =a+ so that the new
difference equation (13.11) reduces to the old one in (13.10).
For the backward Euler method the investigation by Samarskii and Fryazinov
was pursued to the case, with a in (13.9) depending on both x and t, that the
discontinuities are oblique, that is, the discontinuities of a appear along curves
which are not necessarily of the form x = constant. For the standard method (13.10)
with 0 = 1 the result is then
Un -
11 U" II®h< C(U)(h(l12)-Pl(h)+ k l -P2 (k)),
where pj(s)-*O as s-*o, forj= 1, 2. The corresponding result for the modified scheme
using (13.11) with 0= is
1IU"n-
un h
h C(u)( -P (h)+ k -P2(k))
,<
again an improvement over the standard method.
We shall briefly describe some work by CARASSO and PARTER [1970], based on an
idea and numerical experiments by GREENSPAN [1968], concerning the use of
"boundary value techniques" for a parabolic problem over a long time interval. We
consider the nonhomogeneous equation
at=ax( a(x)), t
u(0, t)=u(1, t)=0, t>0, (13.12)
u(x, O)=v(x), 0 x 1,
where a is a positive smooth function. The solution then tends to a steady state
solution w which solves the two-point boundary value problem
da aax)-=f in 0x 1,
w(0) = w(l) = 0.
Normally we have applied above a finite difference method to (13.12) in which the
approximation is successively determined at times t, + = (n + 1)k when it is known at
t = nk (or possibly at one or more additional earlier time levels). The purpose is now
174 V. Thome CHAPrER III
to describe a method which uses the assumed knowledge of the stationary solution
w by using this function as an approximation to u(, T) for a sufficiently large T, and
then interpreting the problem as a boundary value problem, solving in the whole
domain at once.
We consider thus, with Mh = 1, Nk = T the finite difference equations
tU = u) + ,,f
(aj- 1/2Xj,
Here
+
,U"=(, + ,)u =(U - - ')/k
is a symmetric difference quotient with respect to t, and we recall from Section 3 that
the three-level scheme thus defined is unconditionally unstable when used in
a normal marching procedure.
Introducing the vectors U"=(UO,..., U)T and similarly for V and W, the
system (13.13) may be written in the form
U"'+l-Un-I +2AU"=2kF" , n= 1,..., N-1,
U° = V, UM = W,
where = k/h 2 and A is a tridiagonal matrix.
It may be shown by expanding the U" in eigenvectors of A that this system has
a unique solution for V, W, and F given. Further, assuming that T is selected large
enough so that the inequality
dCk3
l u(, T)- w 2h <
holds, Carasso and Parter show the error estimate
11Un - u(t,)II ,h< C(u)(h2'+ k2), for n = 1,..., M- .
The key step is to show that the (N- 1) x (N- 1) Toeplitz matrix
1 a O ... 0
T(a)= 0 . ..
0 1 i
0 -a 1
0
<C{ E D
DlD IILPWT) +CIIuIL()}, (13.15)
Moo+ ela< sI DT°D"Lu + 1[
where Q2 c and 2jT= 2jr{It <T}. Here 2jQ. 02 signifies that the closure
C12C
is nonvanishing and that the difference operator is parabolic in the sense of John, or,
176 V. Thornme CHAPTER IIl
with c >O,
IE(x,t, )l = IB(x,t, )- A(x,t, ) <l-clI M
I
for |jlI<tc, j=1,...,d. (13.16)
The discussion below remains valid in the case of a system which is parabolic in
Petrovskii's sense provided we demand that B(x, t, 3), which is then a matrix, is
invertible and that in (13.16) the modulus of E(x,t, )=B(x,t,)-'A(x,t, ) is
replaced by the spectral radius of this matrix.
For mesh functions we then define the norms
and
IIU 11
h,ps,n = Z Il 0a UII1h,P,J, sO,
Mao +al s
Section 4.
We are now ready to state the a priori estimate analogous to (13.15).
THEOREM 13.1. Let Lh be a parabolic difference operator (in the sense of John) in
Q2. Then for any p,r with 1 < r p < o, any Q2I Cc02 Cc0, and any nonnegative integer
s there exists a constant C such thatfor any positive T, sufficiently small h, and any
mesh function U,
E
al +j<M
11Zaj eIlpl C IIC
LhU IIp
and then, using a partition of unity with functions of small support, for a general L,
with variable coefficients and lower order terms, that
E || ~,OxS i h,p, T
< C( || LhU IIbp,1 T + IlU Ih,p,M- 1,2T -k).
al +j~<M
result for s = 0, r = p. The proof in the case r = p is then completed by induction over
s by applying the inequality obtained to difference quotients, and the result for r <p
then follows using a discrete Sobolev type inequality. [1
We shall now turn to the application of this result to the convergence of solutions
of finite difference operators. We consider thus solutions of equations of the form
r
I1QhU - QU Ilh,., < C(u)h" + C i U - Ih,r,2.
The theorem thus asserts that if u is a solution in Q2 of any boundary value problem
for (13.14) and U is a solution of a corresponding discrete problem which uses (13.17)
in the interior, and if it is known that for some r, and some Q2 C ,
[IU-ull^,r, =O(h) as h-O,
then with Qh and Q as stated we may conclude that QhU tends uniformly to Qu with
rate O(h) as h-+O. In a typical case we might have obtained an interior L2 type
convergence result, for example by the energy method, and may then conclude
maximum-norm convergence of QhU to Qu. The particular case when Qh and Q are
both the identity operator shows uniform convergence in the interior.
We cannot leave our survey of the finite difference method in the context of
parabolic problems without mentioning its use in existence theory. We shall
therefore close this section by sketching an example of this given in PETROWSKI
[1955], and consider an initial boundary value problem in one space dimension with
nonvertical lateral boundaries.
Let thus pjeC[O,T], j=l1, 2, be two given functions with pl(t)< q2 (t) for
t [0, T], and consider the initial boundary value problem
au a2u
~a
at ax 22U for p(t)<x<P2(t), O<t<T,
u(pot),t)=gj(t)
1 for O<t T, j=1, 2, (13.19)
u(x,0)=v(x) for (p(O)<x< 2 (0),
where gj, j= 1, 2, and v are the given data of the problem.
We denote by Q the domain under consideration, i.e. Q = {(x, t); p1 (t) < < (t),
0 < t < T}, and by F its parabolic boundary, F = aQ\{(x, T); p ,(T) < x < ¢2 (T)}. We
now impose a mesh (jh, nk) where h and k are the mesh widths in space and time, with
k/h 2 = 2 kept constant. Let Qh denote the union of the closed mesh squares which
belong to Q, Fh those mesh squares which have at least one point on aQh, excluding,
however, squares with one side on t= T and with vertical sides not on aQh, and
finally Qh=Qh\Fh (see Fig. 13.1).
For each mesh point P of Fh we choose a point P on F of minimal distance from P.
We may then pose the discrete problem
B,Ui-a, U for (jh,nk)eQh, (13.20)
(13.20)
U! = u(P) for (jh, nk) E ,F h
SECTION 13 Mixed initial boundary value problem 179
rh = shaded squares
t
Qh = white squares
F-
0/ X=p t)
f Ir
x
= U 7//_HH-
Pl(t) l ./ /
r I
7-7S
//7
//
I / H7
t X 7/i"/// _
FIG. 13.1.
where the value of u(P) equals the appropriate value of g1, g2 or v, depending on the
location of P.
One notices immediately, as in the proof of Theorem 11.1, that the maximum and
minimum of a solution of (13.20) are attained on Fh. Hence, in particular, this
problem has at most one solution since the difference between two solutions
vanishes on rh and hence in Qh- Because (13.20) has as many equations as unknowns,
the uniqueness implies the existence of a solution for any choice of the u(P). It follows
also that I U I is bounded in Qh, independently of h, by the maximum of u on F.
We shall now extend U to a function uh defined on all of Q. For this purpose we
divide each of the mesh squares into two triangles by means of the straight lines
x/h + t/k =j, and define u, in each of the triangles by linear interpolation from the
values at the mesh points. The definition is completed by extending U to Q\Qh as
a continuous function without increasing the maximum of Iuhl.
Let QOQ, i.e. let QO be such that Q ° cQ. Since U is uniformly bounded in Qh
we find by Theorem 13.1, together with a discrete Sobolev inequality, that all
difference quotients of U are bounded in QO for small h. Using this fact for the first
order difference quotients one finds easily that there is a constant C independent of
h such that
v=- w=-
at' ax
Let now {Qji} be a sequence of domains with QJ+' Qj and Uj Qj= Q. The above
argument may be applied to each of the QJ and by a diagonal procedure it is then
possible to show that there exists a sequence {hm} =, with h-+0 as m-+ oo, and
a function i e C(Q) such that the corresponding extensions of {U}, {,U} and {&XU}
converge to ai,ai/atand Oa/?x, uniformly on each compact subset of Q. Using the
finite difference equation in (13.20) one also shows that these functions satisfy
X2
(Recall that, for these points, UJ= v(P) for a suitable P e Fo.) Let further K be
a constant such that
supIvi < 2Kco(x,t) for (x,t)eQ\ V,,
F0
which is possible since co is positive on the closed set Q\V. We shall show that
v(xo) - - Kco(x, t) < uh(x, t) < (Xo) + Ko(x, t)
+ + (13.22)
for (x, t) = (jh, nk) e Qh.
Assuming this for a moment, we have, since the mesh points are dense in Q as h and
k tend to zero, that the same inequality holds with uh(x, t) replaced by iu(x, t), and
SECTION 13 Mixed initial boundary value problem 181
hence
v(x)- < lim inf i(x, t) < lim sup ii(x, t) <v(x 0 ) + e,
(x,t)-xoO) (xt)(xo,O )
The proof of this fact uses the same ideas as above, with cwp playing the role of w.
The result thus depends on the existence of a function ap with the above
properties, a so-called barrier. This in turn will depend on the regularity of the
functions pl(t) and p2(t) defining the boundary. One may show that Lipschitz
continuity is a sufficient condition for this, and that, for instance, for the left-hand
boundary,
Cp(X,t)= I(X-)-(t-f)l, O<f<1,
AKOPYAN, Yu.R. and L.A. OGANESYAN (1977), A variational-difference method for solving two-
dimensional linear parabolic equations, Zh. Vychisl. Mat. i Mat. Fiz. 17,109-118 (in Russian). (U.S.S.R.
Comput. Math. and Math. Phys. 17, 101-111.)
ALBRECHT, J. (1957), Zum Differenzenverfahren bei parabolischen Differentialgleichungen, Z. Angew.
Math. Mech. 37, 202-212.
ANDERSSEN, R.S. (1968), On the reliability of finite difference representations, Austral. Comput. J. 1 (3).
ANSORGE, R. and R. HASS (1970), Konvergenz von Differenzenverfahren fur lineare und nichtlineare
Anfangswertaufgaben, Lecture Notes in Mathematics 159 (Springer, Berlin).
ARONSON, D.G. (1963a), The stability of finite difference approximations to second order linear parabolic
differential equations, Duke Math. J. 30, 117-128.
ARONSON, D.G. (1963b), On the stability of certain finite difference approximations to parabolic systems
of differential equations, Numer. Math. 5, 118-137.
ASCHER, M. (1960), Explicit solutions of the one-dimensional heat equation for a composite wall, Math.
Comp. 14, 346-353.
ASTRAKHANTSEV, G.P. (1971), A finite difference method for solving the third boundary value problem for
elliptic and parabolic equations in an arbitrary domain. Iterative solution of difference equations, 2,
Zh. Vychisl. Mat. i Mat. Fiz. 11, 677-687 (in Russian). (U.S.S.R. Comput. Math. and Math. Phys. 11,
168-182.)
BABUSKA, I., M. PRAGER and E. VITASEK (1966), Numerical Processes in Differential Equations (Wiley,
London).
BATTEN, G.W. (1963), Second order correct boundary conditions for the numerical solution of the mixed
boundary problem for parabolic equations, Math. Comp. 17, 405-413.
BIRTWISTLE, G.M. (1968), The explicit solution of the equation of heat conduction, Comput. J. 11,
317-323.
BONDESSON, M. (1971), An interior a priori estimate for parabolic difference operators and an
application, Math. Comp. 25, 43-58.
BONDESSON, M. (1973), Interior a priori estimates in discrete LP norms for solutions of parabolic and
elliptic difference equations, Ann. Mat. Pura Appl. 95, 1-43.
BRENNER, P., V. THOMEE and L.B. WAHLBIN (1975), Besov Spacesand Applications to Difference Methods
for Initial Value Problems, Lecture Notes in Mathematics 434 (Springer, Berlin).
CAMPBELL, C.M. and P. KEAST (1968), The stability of difference approximations to a self-adjoint
parabolic equation under derivative boundary conditions, Math. Comp. 22, 336-346.
CARASSO, A. (1969), Finite difference methods and the eigenvalue problem for nonselfadjoint Sturm-
Liouville operators, Math. Comp. 23, 717-729.
CARASSO, A. (1970), A posteriori bounds in the numerical solution of mildly nonlinear parabolic
equations, Math. Comp. 24, 785-792.
CARASSO, A. (1971), Long-range numerical solution of mildly non-linear parabolic equations, Numer.
Math. 16, 304-321.
CARAsso, A. and S.V. PARTER (1970), An analysis of"boundary-value techniques" for parabolic problems,
Math. Comp. 24, 315-340.
CIMENT, M., S.H. LEVENTHAL and B.C. WEINBERG (1978), The operator compact implicit method for
parabolic equations, J. Comput. Phys. 28, 138-166.
COLLATZ, L. (1955), Numerische Behandlung von Differentialgleichungen (Springer, Berlin, 2nd ed.).
183
184 :V.
Thom&e
COURANT, R., K.O. FRIEDRICHS and H. LEWY (1928), Ober die partiellen Differenzengleichungen der
mathematischen Physik, Math. Ann. 100, 32-74.
CRANDALL, S.H. (1955), An optimum implicit recurrence formula for the heat conduction equation,
Quart. Appl. Math. 13, 318-320.
CRANK, J. and P. NICOLSON (1947), A practical method for numerical integration of solution of partial
differential equations of heat-conduction type, Proc. Cambridge Philos. Soc. 43, 50-67.
2
DOUGLAS Jr, J. (1955), On the numerical integration of O u/lx 2 + 02u/6y 2 = u/lat by implicit methods, J.
SIAM 3, 42-65.
DOUGLAS Jr, J. (1956a), On the errors in analogue solutions of heat conduction problems, Quart. Appl.
Math. 14, 333-335.
DOUGLAS Jr, J. (1956b), On the numerical integration of quasi-linear parabolic differential equations,
Pacific J. Math. 6, 35-42.
DOUGLAS Jr, J. (1956c), On the relation between stability and convergence in the numerical solution of
linear parabolic and hyperbolic equations, J. SIAM 4, 20-37.
DOUGLAS Jr, J. (1956d), The solution of the diffusion equation by a high order correct difference equation,
J. Math. Phys.35, 145-151.
DOUGLAS Jr, J. (1957), A note on the alternating direction implicit method for the numerical solution of
heat flow problems, Proc. Amer. Math. Soc. 8, 409-412.
DOUGLAS Jr, J. (1958), The application of stability analysis in the numerical solution of quasi-linear
parabolic differential equations, Trans. Amer. Math. Soc. 89, 484-518.
DOUGLAS Jr, J. (1959), The effect of round-off error in the numerical solution of the heat equation, J.
Math. Phys. 31, 35-41.
DOUGLAS Jr, J. (1960), A numerical method for a parabolic system, Numer. Math. 2, 91-98.
DOUGLAS Jr, J. (1961a), A survey of numerical methods for parabolic differential equations, in: F.C. ALT,
ed., Advances in Computers 2 (Academic Press, New York) 1-54.
DOUGLAS Jr, J. (1961b), On incomplete iteration for implicit parabolic difference equations, J. SIAM 9,
433-439.
DOUGLAS Jr, J. and T.M. GALLIE Jr(1955), Variable time steps in the solution of the heat flow equation by
a difference equation, Proc. Amer. Math. Soc. 6, 787-793.
DOUGLAS Jr, J, and J.E. GUNN (1962), Alternating direction methods for parabolic systems in m space
variables, J. Assoc. Comput. Mach. 9, 450456.
DOUGLAS Jr, J. and J.E. GUNN (1963), Two high-order correct difference analogues for the equation of
multidimensional heat flow, Math. Comp. 17, 71-80.
DOUGLAS Jr, J. and J.E. GUNN (1964), A general formulation of alternating direction methods, Part I.
Parabolic and hyperbolic problems, Numer. Math. 6, 428-453.
DOUGLAS Jr, J. and B.F. JONES (1963), On predictor-corrector methods for nonlinear parabolic
differential equations, J. SIAM 11, 195-204.
DOUGLAS Jr, J. and C.M. PEARCY (1963), On convergence of alternating direction procedures in the
presence of singular operators, Numer. Math. 5, 175-184.
DOUGLAS Jr, J. and H.H. RACHFORD (1956), On the numerical solution of heat conduction problems in
two and three space variables, Trans. Amer. Math. Soc. 82, 421-439.
Du FORT, E.C. and S.P. FRANKEL (1953), Stability conditions in the numerical treatment of parabolic
differential equations, Math. Tables Aids Comput. 7, 135-152.
EISEN, D. (1966), Stability and convergence of finite difference schemes with singular coefficients, SIAM J.
Numer. Anal. 3, 545-552.
EISEN, D. (1967a), The equivalence of stability and convergence for finite difference schemes with singular
coefficients, Numer. Math. 10, 20-29.
EISEN, D. (1967b), On the numerical solution of u, =u,r+(2/r)u,.Numer. Math. 10, 397-409.
EVANS, D.J. (1965), A stable explicit method for the finite-difference solution of a fourth-order parabolic
partial differential equation, Computer J. 8, 280-287.
FAIRWEATHER, G., A.R. GOURLAY and A.R. MITCHELL (1967), Some high accuracy difference schemes
with a splitting operator for equations of parabolic and elliptic type, Numer. Math. 10, 56-66.
FORSYTHE, G.E. and W.R. WAsow (1960), Finite Difference Methods for PartialDifferential Equations
(Wiley, New York).
References 185
MONIEN, B. (1970), Ober die Konvergenzordnung von Differenzenverfahren, die parabolische Anfangs-
wertaufgaben approximieren, Computing 5, 221-245.
NORDMARK, S. (1974), Uniform stability of a class of parabolic difference operators, BIT14, 314-325.
O'BRIEN, G.G., M.A. HYMAN and S. KAPLAN (1951), A study of the numerical solution of partial
differential equations, J. Math. Phys. 29, 223-251.
OSBORNE, M.R. (1969), The numerical solution of the heat conduction equation subject to separated
boundary conditions, Comput. J. 12, 82-87.
OSHER, S. (1969), Maximum norm stability for parabolic difference schemes in half-space, in: Hyperbolic
Equationsand Waves (Springer, New York) 61-75.
OSHER, S. (1970), Mesh refinements for the heat equation, SIAM J. Numer. Anal. 7, 199-205.
OSHER, S. (1972), Stability ofparabolic difference approximations to certain mixed initial boundary value
problems, Math. Comp. 26, 13-39.
PARKER, I.B. and J. CRANK (1964), Persistent discretization errors in partial differential equations of
parabolic type, Comput. J. 7, 163-167.
PEACEMAN, D.W. and H.H. RACHFORD Jr (1955), The numerical solution of parabolic and elliptic
differential equations, J. SIAM 3, 28-41.
PEETRE, J. and V. THOMEE (1967), On the rate of convergence for discrete initial value problems, Math.
Scand. 21, 159-176.
PETROVSKII, I.G. (1937), Ober das Cauchysche Problem fur Systeme von partiellen Differentialgleichun-
gen, Rec. Math. (Math. Sb.) 2, 814-868.
PETROWSKI, I.G. (1953), Vorlesungen iiber partielle Differentialgleichungen(Teubner, Leipzig).
POLITCHKA, A.E. and P.E. SOBOLEVSKII (1976), New Lp estimates for parabolic difference problems, Zh.
Vychisl. Mat. i Mat. Fiz. 16 (5), 65-74 (in Russian). (U.S.S.R. Comput. Math. and Math. Phys. 16 (5),
1155-1163.)
RAVIART, P.A. (1967), Sur l'approximation de certaines equations d'evolution lineaires et non-lineaires, J.
Math. Pures Appl. 46 (1), 11-107; 46 (2), 109-183.
RICHTMYER, R.D. and K.W. MORTON(1967), Difference Methodsfor Initial-Value Problems(Interscience,
New York).
ROSE, M.E. (1956), On the integration of non-linear parabolic equations by implicit difference methods,
Quart. Appl. Math. 14, 237-248.
ROTHE, E. (1931), Wiirmeleitungsgleichung mit nichtkonstanten Koefficienten, Math.Ann. 104, 340-362.
RYABENKII, V.S. and A.F. FILIPPOV (1960), Uber die Stabilitiit von Differenzengleichungen (Deutscher
Verlag der Wissenschaften, Berlin).
SAMARSKII, A.A. (1961a), A priori estimates for the solution of the difference analogue of a parabolic
differential equation, Zh. Vychisl. Mat. i Mat. Fiz. 1, 441-460 (in Russian). (U.S.S.R. Comput. Math. and
Math. Phys. 1, 487-512.)
SAMARSKII, A.A. (1961b), A priori estimates for difference equations, Zh. Vychisl. Mat. i Mat. Fiz. 1,
972-1000 (in Russian). (U.S.S.R. Comput. Math. and Math. Phys. 1, 1138-1167.)
SAMARSKII, A.A. (1962a), On the convergence and accuracy of homogeneous difference schemes for one-
dimensional and multidimensional parabolic equations, Zh. Vychisl. Mat. i Mat. Fiz. 2, 603-634 (in
Russian). (U.S.S.R. Comput. Math. and Math. Phys. 2, 654-696.)
SAMARSKn, A.A. (1962b), On an economical difference method for the solution of a multidimensional
parabolic equation in an arbitrary region, Zh. Vychisl. Mat. i Mat. Fiz. 2, 787-811 (in Russian).
(U.S.S.R. Comput. Math. and Math. Phys. 2, 894-926.)
SAMARSKII, A.A. (1963), Homogeneous difference schemes on non-uniform meshes for parabolic
equations, Zh. Vychisl. Mat. i Mat. Fiz. 3, 351-393 (in Russian). (U.S.S.R. Comput. Math. and Math.
Phys. 3, 266-298.)
SAMARSKII, A.A. (1964a), An accurate high-order difference system for a heat conductivity equation with
several space variables, Zh. Vychisl. Mat. i Mat. Fiz. 4, 161-165 (in Russian). (U.S.S.R. Comput. Math.
and Math. Phys. 4, 222-228.)
SAMARSKII, A.A. (1964b), Economical difference schemes for parabolic equations with mixed derivatives,
Zh. Vychisl. Mat. i Mat. Fiz. 4, 182-191 (in Russian). (U.S.S.R. Comput. Math. and Math. Phys. 4, 753-
759.)
SAMARSKII, A.A. (1971), Introduction to the Theory of Difference Schemes (Nauka, Moscow) (in Russian).
188 V. Thomde
SAMARSKII, A.A. and I.V. FRYAZINOV (1961), On the convergence of homogeneous difference schemes for
a heat-conduction equation with discontinuous coefficients, Zh. Vychisl. Mat. i Mat. Fiz. 1, 806-824 (in
Russian). (U.S.S.R. Comput. Math. and Math. Phys. 1, 962-982.)
SAMARSKII, A.A. and A.V. GULIN (1973), Stability of Difference Schemes (Nauka, Moscow) (in Russian).
SAUL'EV, V.K. (1964), Integrationof Equations of ParabolicType by the Method of Nets (Pergamon Press,
Oxford).
SERDJUKOVA, S.J. (1964), The uniform stability with respect to the initial data of a sixpoint symmetrical
scheme for the heat conduction equation, in: Numerical Methods for the Solution of Differential
Equations and Quadrature Formulae(Nauka, Moscow) 212-216.
SERDJUKOVA, S.J. (1967), The uniform stability of a sixpoint scheme of increased order of accuracy for the
heat equation, Zh. Vychisl. Mat. i Mat. Fiz. 7, 214-218 (in Russian). (U.S.S.R. Comput. Math. and
Math. Phys. 7, 297-303.)
SHILOV, G.E. (1955), On the correctness of Cauchy problems for systems of partial differential equations
with constant coefficients, Uspekhi Mat. Nauk. 10, 89-100 (in Russian).
SMITH, G.D. (1965), Numerical Solution of Partial Differential Equations (Oxford University Press,
London).
STETTER, H.J. (1959), Anwendung des Aquivalenzsatzes von P. Lax aufinhomogene Probleme. Z. Angew.
Math. Mech. 39, 396-397.
STRANG, W.G. (1959), On the order of convergence of the Crank-Nicolson procedure, J. Math. Phys. 38,
141-144.
STRANG, W.G. (1960), Difference methods for mixed boundary value problems, Duke Math. J. 27, 221
232.
STRANG, G. (1963), Accurate partial difference methods I: Linear Cauchy problems, Arch. RationalMech.
Anal. 13, 392-402.
STRANG, (1964a), Unbalanced polynomials and difference methods for mixed problems. SIAM J. Numer.
Anal. 2, 46-51.
STRANG, G. (1964b), Wiener Hopf difference equations, J. Math. Mech. 13, 85-96.
STRANG, G. (1966), Implicit difference methods for initial boundary value problems, J. Math. Anal. Appl.
16, 188-198.
SUNAUCHI, H. (1968), Perturbation theory of difference schemes, Numer. Math. 12, 454-458.
'TAYLOR, P.J. (1970), The stability of the Du Fort-Frankel method for the diffusion equation with
boundary conditions involving space derivatives, Computer J. 13, 92-97.
THOMIE, V. (1964), Elliptic difference operators and Dirichlet's problem, Contribut. Differential
Equations 3, 301-324.
THOME, V. (1965), Stability of difference schemes in the maximum-norm, J. Differential Equations1, 273-
292.
THOMCE, V. (1966a), On maximum-norm stable difference operators, in: J.H. Bramble, ed., Numerical
Solution of PartialDifferential Equations(Academic Press, New York) 125-151.
THOMEE, V. (1966b), Parabolic difference operators, Math. Scand. 19, 77-107.
THOMPE, V. (1967), Generally unconditionally stable difference operators, SIAM J. Numer. Anal. 4, 55-
69.
THOMEE, V. (1969), Stability theory for partial difference operators, SIAM Rev. 11, 152-195.
THOMtE, V. (1984), Galerkin Finite Element Methods for Parabolic Problems, Lecture Notes in
Mathematics 1054 (Springer, Berlin).
THOMCE, V. and L.B. WAHLBIN (1974), Convergence rates of parabolic difference schemes for non-
smooth data, Math. Comp. 28, 1-13.
THOMPSON, R.J. (1964), Difference approximations for inhomogeneous and quasi-linear equations, J.
SIAM 12, 189-199.
TIKHONOV, A.N. and A.A. SAMARSKII (1961), Homogeneous difference schemes, Zh. Vychisl. Mat. i Mat.
Fiz. 1, 5-63 (in Russian). (U.S.S.R. Comput. Math. and Math. Phys. 1, 5-67.)
VARAH, J.M. (1970), Maximum norm stability of difference approximations to the mixed initial
boundary-value problem for the heat equation, Math. Comp. 24, 31-44.
VARGA, R.S. (1961), On high order stable implicit methods for solving parabolic partial differential
equations, J. Math. and Phys. 40, 220-231.
References 189
VARGA, R.S. (1962), Matrix Iterative Analysis (Prentice-Hall, Englewood Cliffs, NJ).
WASOW, W. (1958), On the accuracy of implicit difference approximations to equation of heat flow, Math.
Tables Aids Comput. 12, 43-55.
WEINELT, W., R.D. LAZAROV and U. STREIT (1984), On the order of convergence of difference schemes for
weak solutions of the heat conduction equation in nonisotropic nonhomogeneous media, Differencial'
nye Uravnenija20, 1144-1151 (in Russian).
WIDLUND, O.B. (1965), On the stability of parabolic difference schemes, Math. Comp. 19, 1-13.
WIDLUND, O.B. (1966), Stability of parabolic difference schemes in the maximum norm, Numer. Math. 8,
186-202.
WIDLUND, O.B. (1970a), On the rate of convergence for parabolic difference schemes, I, in: Numerical
Solution ofField Problems in Continuum Physics,SIAM-AMS Proceedings II (American Mathematical
Society, Providence, RI) 60-73.
WIDLUND, O.B. (1970b), On the rate of convergence for parabolic difference schemes II, Comm. PureAppl.
Math. 23, 79-96.
List of Symbols
IVIv11= ( ||Dav 11 )
·
jvjtw = Z lD vllL,
al <m
= s; -
(V,W)h=hd Z V(x)W(x),
2 )
112,h =(V, V)h = (hd V()2
x =Ph1Q
191
192 V. Thomee
Truncation error:
z"(x)=(u(x,(n+ l)k)-Eku(x,nk))/k, tr = (jh).
ax vj = (Vj+ - Vj)/h,
ax j = (v - Vj 1)/h,
,V = 2½(Vj+ - Vj-l)/h = (ax + x)Vj,
axaxvj=(Vj+ - 2 Vj + Vj_ l)/h2,
O Vn = (Vn +1 - Vn)/k,
a, vn =(vn_ V- -)/k,
Fourier transform:
- 2
IIV IL2(Rd) =(2n) d/ IIL(Rd)
(D v) ()=(i) (), ~ =I
.... *,4,
195
196 V. Thomie
Zygmund condition, 98
Theta method, 114, 125
Splitting and
Alternating
Direction Methods
G.I. Marchuk
Department of Numerical Mathematics
USSR Academy of Sciences
Ryleev Street 29
119034 Moscow, USSR
PREFACE 203
PART 1. ALGORITHMS FOR THE SPLITTING METHODS AND THE ALTERNAI NG DIRECTION
METHODS 229
13. The two-cycle componentwise splitting methods: The case A = 4 +A2 245
14. The two-cycle multicomponent splitting method 247
15. The two-cycle componentwise splitting method for quasi-linea problems 248
16. A general approach to the two-cycle componentwise splitting: lethod 249
17. The two-cycle componentwise splitting scheme for the heat co duction equation 252
199
200 G.l. Marchuk
21. A general scheme for the method of approximate factorization of the operator 263
22. The scheme of approximate factorization for the parabolic equation 265
CHAPTER V. The Predictor-Corrector Method 269
23. The predictor-corrector method: The case A= A, + A2 269
24. The predictor-corrector method: The case A =Z=, A, 272
25. The predictor-corrector method for the parabolic equation 273
CHAPTER VI. The Alternating Direction and the Stabilizing Correction Methods 277
26. The alternating direction method 277
27. The stabilizing correction method 278
28. A general formulation for the stabilizing correction method 279
29. Application of the alternating direction scheme to the parabolic equation 280
CHAPTER VII. Methods of Splitting with Respect to Physical Processes 283
30. The method of splitting with respect to physical processes 283
31. The method of particles in a cell 285
32. The method of large particles 286
CHAPTER VIII. The Alternating Triangular Method and the Alternating Operator Method 289
33. The alternating triangular method 289
34. The alternating operator method 291
35. The generalized alternating operator method 292
36. The scheme of the alternating triangular method for the parabolic equation 292
CHAPTER IX. Splitting Methods and Alternating Direction Methods as Iterative Methods
for Stationary Problems 295
37. The stationing method: General concepts of the theory of iterative methods 295
38. Iterative algorithms 297
39. Acceleration of the convergence of iterative methods 299
PART 2. METHODS FOR STUDYING THE CONVERGENCE OF SPLITTING AND ALTERNATING
DIRECTION SCHEMES 301
CHAPTER X. Convergence Studies of the Splitting Schemes by Use of the Fourier Method
(Spectral Method) 303
40. General statement of the Fourier method 303
41. The Fourier method and the convergence studies of splitting schemes for
stationary problems 307
42. The Fourier method and the grounding of the splitting schemes for
nonstationary problems 310
CHAPTER XI. The A Priori Estimates Method and the Convergence Studies of the
Splitting Schemes 315
43. The simplest a priori estimates 315
44. A priori estimates for the splitting scheme of type Ajpj+' =Bpj+rjfJ 319
45. The energy inequalities method for constructing a priori estimates 322
CHAPTER XII. The Splitting of the Evolutionary Problem for a System of
Differential Equations 327
46. The splitting of problems defined on fractional intervals and the weak
approximation method 327
Contents 201
CHAPTER XIV. Splitting and Decomposition Methods for Variational Problems 355
55. Splitting and decomposition methods for classical variational problems 355
56. Decomposition of a general variational problem 356
57. A variational problem with restrictions 357
58. The convergence of decomposition algorithms 358
76. Splitting schemes restoring divergence for incompressible fluid equations 416
77. The general principle for constructing splitting schemes for Navier-Stokes equations 419
REFERENCES 449
203
CHAPTER I
Introduction
The intensive development of the methods for solving linear algebraic equations
with three-diagonal and block three-diagonal matrices in the fifties and sixties led to
the creation of effective numerical algorithms for solving stationary problems based
on the factorization of the difference operator. A special place among the methods of
factorization(sweep methods) belongs to various versions of noniterative methods of
factorization developed by KELDYSH [1942], GELFAND and LOKUTSIEVSKY [1962],
RUSANO [1960], GODUNOv [1962], ABRAMOV and ANDREEV [1963] and others.
Experience accumulated in solving one-dimensional problems by methods based
on factorization had prepared the basis for the construction of the algorithms for
solving more complicated problems in mathematical physics. And the beginning of
the sixties was marked by a great contribution into numerical mathematics by the
development of such algorithms. This contribution is connected with the names of
Douglas, Peaceman and Rachford, who suggested the alternating direction method
(see DOUGLAS and RACHFORD [1956], PEACEMAN and RACHFORD [1955]). The success
of the method was ensured by a simple reduction of the multidimensional problem
to a sequence of one-dimensional problems with three-diagonal matrices easily
invertible by computers. This method affected significantly the construction of
algorithms in various fields of applied mathematics. The theoretical studies of this
and related methods are presented in the works by DOUGLAS [1962], DYAKONOV
[1961-1967, 1970-1972], SAMARSKII [1961-1967, 1970-1971, 1977], BIRKHOF,
VARGA and YOUNG [1962], WACHPRESS [1962], KELLOGG [1964], GUNN [1965],
VOROBYOV [1968] and others.
The methods were developed based on homogeneous and inhomogeneous
approximations. In the case of inhomogeneous approximation each of the auxiliary
(intermediate) problems not necessarily approximates the original problem but in
the whole in special norms this approximation takes place. These methods have
been named the splitting methods; they have been developed in the works of the
Soviet mathematicians YANENKO [1966, 1967], DYAKONOV [1963a, 1966], SAMARSKII
[1962d, 1963a], SAUL'EV [1960], MARCHUK [1971] and others.
The splitting methods were widely used in problems of various kinds and they
stimulated the more general approach to the solution of problems in mathematical
physics based on the method of weak approximation developed by YANENKO
[1964b, 1967] and SAMARSKII [1963a, 1965b]. It appeared that the splitting method
may be understood as the method of weak approximation of the original equation
205
206 G.I. Marchuk CHAPTER
by a more simple equation. The convergence conditions for the method of weak
approximation were formulated in the theorem by YANENKO and DEMIDOV [1966]
and in the works by LEBEDEV [1977] and DYAKONOV [1962f, 1971d]. The method of
weak approximation has found natural applications in the problems of hydro-
dynamics, meteorology, oceanology, in the theory of radiation transport, etc.
(MARCHUK [1967, 1974a], YANENKO [1967]).
The original scheme of the predictor-corrector type by Lax and Wendroff found
wide applications in problems of hydrodynamics, meteorology, oceanology, where
the predictor was suggested in the form of an explicit difference scheme. This scheme
is conditionally stable, it is easy in realization and has a second-order approx-
imation in all variables. A detailed study of the scheme is presented in the book by
RICHTMYER and MORTON [1972].
Various versions of the predictor-corrector method based on implicit difference
approximations were proposed by BRAYN [1966], DOUGLAS [1961], SOFRONOV
[1965] and MARCHUK and YANENKO [1966]. All these schemes proved to be
equivalent in a certain sense and differed only in the realization technique. In the last
of these works the implicit splitting scheme with factorized operator is used as
a predictor, which has first-order accuracy. For the problems in hydrodynamics the
implicit majorant schemes are used as predictors.
Of particular interest is the method of decomposition and decentralization
formulated by LIoNs and TEMAM [1966] ad also by BENSOUSSAN, LIONs and TEMAM
[1975] which also is close to the splitting methods and methods of weak
approximation.
In the sixties the method for solving multidimensional problems of mathematical
physics was developed intensively which is connected with the name of HARLOW
[1967]. This method was named the method of large particles. Today it is also con-
sidered as a splitting method. In the works by DYACHENKO [1965], BELOTSERKOVSKY
and DAVYDOV [1978], and YANENKO, ANUCHINA, PETRENKO and SHOKIN [1971]
various modifications of this method are given and schemes of their realization are
considered.
In the present work many ofthe splitting (fractionalsteps) and alternatingdirection
methods are considered,somefactsfrom the theory of convergence of these methods are
presented and the algorithms for numerically solving a number of concrete applied
problems areformulatedbased on the applications of the methods discussed further.
Main attention in this work is paid to nonstationary problems. In this connection
the splitting and alternating direction methods are formulated as classes of finite
difference algorithms for solving these problems. Nevertheless, in some sections they
will be considered as iterative methods for solving stationary equations.
1. Approximation
This section introduces the general concepts of the theory of finite difference
methods which will be used in the present work.
SECTION Introduction 207
which is the finite difference analogue of problem (1.1). Here Ah and ah are linear
operators depending on the grid step h, 9Oh E Oh, f EF,, gh EG., and A5h, Fh and G,
are grid function spaces.
We introduce in 4,, F h and G, respectively the norms I - [1la, I[ IlFvand 11.
IGl. Let
(-)h denote the linear operator which puts the element p e · in correspondence with
the element ()h E Ah in such a way that
lim
h--O
II()hl[, = IIll .
We shall say that the problem (1.2) approximates the problem (1.1) with order n on
the solution p, if there exist positive constants , MI, M 2 such that for all h <f
hthe
208 G.I. Marchuk CHAPTER I
The values of the solution at the boundary points can be found from (1.2) after
solving (1.4). In some cases it is more convenient to write the approximating problem
in the form (1.4) and in some other cases in the form (1.2). Thus, as a result of the
reduction and with the required approximation taken into account, problem (1.1)
with a continuous argument has been reduced to the problem of linear algebra (1.4),
i.e., of solving the system of algebraic equations.
Further we shall mainly use Hilbert spaces of grid functions and, unless otherwise
stated, the corresponding norm ipohrl,, will be defined by the relationship
Ijh1=((p,
=(ph),/2.
It must be noted, however, that many of the introduced concepts (approximation,
etc.) also apply to the case of Banach spaces and in some statements and illustrating
examples grid function norms will be introduced which are not connected with the
inner product defined by the above relationship.
We shall illustrate the above concepts with an example of the following concrete
problem
-Aqp=f in 2,
(1.5)
p =O on a2,
where
={(x, y): O<x<1, O<y< 1},
f=f(x, y) is a smooth function,
A= a2 /ax2 + a2 /ay2.
Let F be the Hilbert space of real functions L 2(2) with the inner product
If we introduce the operators Ap = -A and asp = q(lan, then problem (1.5) may be
represented in the form (1.1) for g-O.
Now, we introduce the finite-dimensional approximation of problem (1.5). To this
end, we cover the square = Q + 2Qby a uniform grid with step h in the x- and
y-direction. The grid points of the domain will be identified by a pair of indices (k, 1),
where the first index k (0 < k < N) corresponds to discretization of the x-coordinate
and the second index (O< l <Ny) corresponds to that of the y-coordinate. Consider
the following approximations:
2
a2(P a 9
2
X xx AxVx(P)h (yy AyVy(()h,
where A, Ay, V, and V are difference operators defined on grid functions ph with
components khp,as follows:
i.e., A h is a difference analogue of the operator -A: Ah -Ah. Now (1.6) can be
rewritten in the form
-Ahph=fh in h,
_2hzph to in QhX (1.7)
ph=O0 on a',
where h and fh are vectors with the components kpl and f, and
(L5 h)k,
f I h 2 (p + 1,+
I L - 1p +( h I 1 + (p I -4? AhI
Xh+1 2 Yl +1/2
1
fki=j2
k f f dx dy,
X'- ,/2 Y - 1/2
(9&IF
" = (E n E
k=l I=1
h2 h,10 ,
k,(lk,
Nx-1 Ny-1
IPh 11 = h2((pI)2-
k= =1
where M 1 =const < oo. The approximation of the boundary conditions in this
example is accurate. From this fact and from the estimate (1.8) it follows that
problem (1.7) approximates problem (1.5) with the second-order of accuracy on the
solutions of problem (1.5) having bounded fourth-order derivatives.
Note, that if we require the grid functions of 4h to satisfy the condition h I-h = 0,
then for such functions the inner products in Ah and Fh (and therefore the generated
norms) will coincide. Further, using the following identities analogous to the first
and the second Green formulae
N,- 1 Nx
- E (axVxPh)kl4,l=- E .. )k
(axVx, k,,
k=l k=l
and similar identities for the sums over index I it is not difficult to show that
(Ah(ph, )Fh = ((ph, Ahh)F,
N Ny
2 2
(A h(h, (qh),=h E, [((Vxph)kl) 2 +((Vyoph)k) ] > 0
k=1 =1
for 9h # 0
where A, fh and (qh are functions of time t. Further we shall omit the index h in
problem (1.10) assuming that we deal with the difference analogue of the original
problem of mathematical physics in spatial variables.
Equation (1.10) presents a system of ordinary differential equations for the
components of the vector h.
Consider the following Cauchy problem:
dp/dt + Atp=f,
p=g for t=0.
Suppose that the operator A does not depend on time. Consider the simplest
methods of approximation to problem (1.11) in time. Currently the difference
schemes of the first and second order of approximation in t are used in most cases.
First, consider the simplest explicit scheme of the first-order of approximation on
the grid 02,:
where =tj+ -tj and f is some projection of the function f. For the sake of
simplicity we may now take fi=f(tj).
If we consider the simplest implicit scheme, then we have
(PJ+l-~°J -J+=
qJ -+A
p :j+l=fJ, q'P =g, (1.13)
and we choose f as f(tj+l). The schemes (1.12) and (1.13) are first-order
approximations in time. We may check it easily using expansion into Taylor series in
time and assuming, for example, the existence of bounded second-order derivatives
of the solution in time.
Resolving (1.12) and (1.13) with respect to pi + we come to the recursive relation
where fi= f(tj+ 1/2). Scheme (1.15) can also be represented in the form (1.14) with
T=(E+rA)- (E- rA), S=(E+ rA) - 1.
In some cases it is convenient to write the difference equations (1.12), (1.13) and
(1. 15) as a system of two equations, the first approximating in 2h, the equation itself,
and the second approximating the boundary condition on h,. In this case the
difference analogue of problem (1.9) has the form
Lhr(phr=fhr in h (1.16)
h
IhT hP
T g on ahr,
where
2h,= 2h X Q2,, a2 =2h X {}UQ hx 2,,
Thus, the evolutionary equation with boundary conditions and initial data can be
reduced to problem (1.18) of linear algebra.
In particular a boundary value problem of elliptic type, an integral equation, etc.,
may be reduced to equation (1.18). Here the approximation condition can again be
written in the form (1.17) with the only approximation index h which is the
maximum from the set {Axi} of steps in spatial variables.
Consider the problem
(ph'=g in x {0}.
h2,
214 G.I. Marchuk CHAPTER I
Xk-1/ 2 Y- 1 /2
Xk-1/2 Y-1/2
=0
k(Pk,
qk,=O on
on aQ
Wg2n x f2
hX O" (1.25)
(Pk,l=gk,i in h x {O}.
The recursive relation (1.24) can be written as
o + = TM + rf i , (1.26)
where pi is the grid function with the definition domain fh + aQh and
We estimate the norm of the operator T. To that end, we want to find its largest
eigenvalue
2
Tu=.(T)u in h, (1.27)
u =O on a0h.
The following relation is obviously valid
t(T) = 1 + ,2(Ah).
Noting that the orthonormal eigenvectors {ump} of the operator Ah have the
components
uk = 2 sin mntkh sin prTlh,
e=l, 2,....Nx-1, p=l, 2,.....Ny-1,
SECTION Introduction 215
where f~, and gk, are defined by (1.22) and (1.23) respectively. In this case (1.20) can
no longer be solved explicitly and we come to the operator equation
((E-rAh)pj)k+,l = 'l
-r+zf, in Qh X Q, (1.30)
whose solution must satisfy the following conditions:
~k ,t- 0 on aQ h x D,
0 (1.31)
(pk, =gk, in Qh x {0}.
=max'1
I~I +(8z/h2 ) cos ½nh' 1 +(8/h 2) sin 2 h} (1.33)
and therefore T 11 < 1 for any z and h.
Finally, we consider the approximation in the Crank-Nicolson scheme. Define
the operators and functions in problem (1.20) as follows:
(Lhph) (A k, (1.34)
T 2 ,
2
xk,+ 1/ y + 1/2
fk,= f (x,
, tj+ /2)dx dy,
2
xk - 1/ Y- 1/2
+
(1.35)
x + 1/2 y 1/2
(Pk pi,,=O
=° on •h
on a XX Q,,
O" (1.37)
0
In this case (1.36) can be formally solved with respect to the unknown p" ' in the
form
+
(Pk' =(TqoJ), +z(S f )k, (1.38)
where
T=(E-TAh)-' (E+zTA'), S=(E-PA)-L.
The norm of the step operator is
It -m 1x -(4t/h2) cos2 ½7rh 1-(4/h 2 ) sin2 rjh (1.39)
2 2 2
lmax
1 + (4/h )cos2 2th ' 1+(4z/h ) sin 2Ih j'
< 1.
Hence {1T 11
2. Stability
Consider the next important concept of the theory of finite difference methods:
stability.
To clarify basic definitions and concepts of the theory of stability consider first the
explicit difference scheme (1.12)
(pj+ =(E- A)(pj + fj, (pO=g (2.1)
The solution (p is found for 0<rj< T.
Assume that the operator A is positive and has a complete system of eigen-
functions {un} and a set of eigenvalues {n>O} corresponding to the spectral
problem
Au = Au.
We introduce the following Fourier series:
where
u* are the eigenfunctions of the adjoint spectral problem. Substituting (2.2) into (2.1)
and taking the inner product of the result and the vectors u,* we obtain the following
expression for the Fourier coefficients:
9n (I=(- TA.) Tjn + Tfjl. (2.3)
SECTION 2 Introduction 217
Since
p0= gn
n
where
r, = 1T-,. (2.6)
From (2.5) it follows that for T>0
j
I|q J<lr.lJ'gn+T E Irnlj -Iffn-Xl·
i=1
We strengthen the latter inequality by substituting Ift-l for Ifnl = maxlfjl under
the summation symbol and we obtain
1-jrnj
Io1 < Irn
lJl g + l]rn'tlfn. (2.7)
Von Neumann has introduced the so-called spectral criterion of stability, the
essence of which is as follows. If for each Fourier coefficient (Pi of (2.2) the following
relation holds
1 nlgnl+C 2 .lf.I,
{<PCnlnC n = 1, 2 ..., (2.8)
where Cn, C2, are constants bounded uniformly for 0 <jzT T, then the difference
scheme (2.1) is computationally stable. Consider conditions on the parameters of the
difference scheme (1.12) sufficient for relation (2.8) to be valid. An analysis of (2.7)
shows that the stability criterion (2.8) is satisfied if the following restriction is
imposed on the parameter r,
Ir,,n<1, n=1,2 ..... (2.9)
Assume that the spectrum of the operator A is contained in the interval
0 < c((A) < (A) < (A).
Then, according to (2.6), if
T 28(A), (2.10)
then relation (2.9) will hold. Thus inequality (2.10) becomes a constructive stability
condition for the difference scheme (2.1). Note that condition (2.10) is a sufficient
stability condition; the scheme remains stable, for instance, when
T = 2/1f(A).
218 G.. Marchuk CHAPTER I
lTnl
Ir~ r~Tjjfnj,
< lrnli-lG~l+ 1g_.1 ~(2.13)
where
-2Tn(A) 1
I+2tn(A)' H'=l+"
I+ (TA)'
Therefore rnl< 1 for any r>0 provided that A,(A)>0.
It should be noted that, firstly, the stability in von Neumann's sense is based on
the spectral analysis of the operator of the problem. This means that for this
approach the computation of the largest eigenvalue or the estimation of its upper
bound are necessary parts of algorithm. Secondly, the spectral stability criterion
ascertains stability of the solution with respect to each of the harmonics from the
Fourier series but it says nothing about the stability of the solution in terms of
energy norms. At the same time the norm of the solution pi often happens to be the
only characteristic of the problem's solution. All of this made the researchers offer
some other definitions of stability which would be related to the norms of the
operators. It should be emphasized, however, that stability analysis of von
Neumann type continues to play a prominent role in applications.
We now come to a more general definition of the concept of computational
stability. To that end we consider the problem
aOp/t+A= f in 2x, (2.14)
p=g for t=0,
SECTION 2 Introduction 219
We shall say that the difference scheme (2.15) is stable, if for any parameter
h characterizing the difference approximation and forj < T/T the following relation
holds
'This assumption is made for the sake of simplicity. Otherwise instead of one norm IITI1I
IITIIF,.,,=supF,(IITrlqIIO,/T1q91) we should introduce two norms: IITIIF,-, and IITllc, =sup.G
0
(11 lIJIAI (P11
G).
220 G.I. Marchuk CHAPTER I
then
If we assume that
IITIl < 1, (2.22)
then scheme (1.12) will be stable in the sense of definition (2.16). Naturally, (2.22) is
a sufficient condition for stability. More sophisticated conditions could be obtained
using the norms of powers of the step operators IIT'll (i= 1, 2,..., j). Weakening the
condition however brings additional difficulties in the constructive procedure of
ascertaining the stability criterion. As a rule, the sufficient condition (2.22) is used in
practice.
The stability of the implicit difference equations (1.13) and (1.15) may be
considered in a similar way. In these cases we have
IPI[-
ZIPI I1/2
of definition (2.16) provided that A >0 and
and from the following theorem (which is very useful in the stability analysis of many
schemes).
THEOREM 2.1 (KELLOGG [1963]). If the operator A acting in real Hilbert space is
positive-semidefinite and the numerical parameter a is nonnegative, then
1(E - orA) (E + A) `II < 1.i (2.24)
SECTION 2 Introduction 221
= sup
, 20(A¢, 4),) + a2(AO, AO).
(4', u)-
p (, 4') + 2a(A4, '), + a2(Aq, A4O)
REMARK 2.1. In the case when A >0 and a>0 we would have instead of (2.24) the
inequality of the kind
1(E- A)(E + aA)- Xl < 1. (2.25)
Discuss briefly the limit transitions. In solving the difference analogues of the
evolutionary problems of mathematical physics we have to consider approxi-
mations in time with step z as well as in space with characteristic step h. This means
that the transition operator T= T(z, h) depends both on z and on h.
The construction of a stable algorithm for a given approximation method is
usually reduced to establishing the relation between and h which ensures the
computational stability. If the difference scheme happens to be stable for any T > 0
and h >0, then it is declared absolutely stable. And if the scheme happens to be stable
only in the case where there is a certain relation between and h then such a scheme
is called conditionally stable.
Assume that z and h are related according to the inequality
T< C hP, (2.26)
where C and p are given constants independent of T and h. It is worthy to note that
such relations are usually established while considering the amplitudes of the
"shortest" perturbation. As a rule they reflect the connection between the minimal
spatial and time scales of the phenomena simulated by the difference scheme.
Of course, large perturbations (say, of the order of several h) will then be described
more precisely.
Suppose that we need to increase the accuracy of the solution formally by
decreasing the grid step h. Then we must decrease simultaneously the time step so
that the above inequality is again satisfied with these new parameters. This means
that we may even allow the limit transition as -+0 and h-0O if we satisfy the
condition (2.26), for example, in the form
T/hP = const ~<C.
Along with the above definitions of the computational stability other definitions
are used in the literature allowing the expansion of the class of difference schemes
which are of interest in applications. For example, the scheme is called stable if
jI jjI 1 + 0(). (2.27)
222 G.I. Marchuk CHAPTER I
For small Tsuch a definition allows for the exponential growth of round-off errors in
time.
Note that if the approximation of the evolutionary equation is studied in the
spaces of grid functions defined in Qh x Q2, then it is useful to give the definition of the
stability in terms of the same spaces. Indeed, let the difference problem have the form
2
Lhrphr =fhr in Q hx 2 (2.28)
lhr (ph = ghr on aQh x QT.
We introduce the stability criterion in the form:
I1(phr Ih C 1 Ifh
'F + C2 |ghr IG, (2.29)
where C 1 and C2 are constants on the interval 0 < t < T independent of h, T,fh' and
ghT
Assume that the original problem is approximated by the difference equation with
the boundary conditions already taken into account. Then it is convenient to
introduce the stability criterion in the form:
Ijh
I[ phrj G f hc IIF,, (2.30)
3. Convergence
We turn now to the formulation of the major result in the theory of the finite
difference algorithms: the convergence theorem. The study of the convergence of the
difference solution to the solution of the original problem for the stationary and
evolutionary problems of mathematical physics is based on the same principles.
This allows us to follow the main idea of the proof on the example of the stationary
problem (l.1) which is approximated by the difference scheme (1.2) (i.e., by the system
of equations approximating both the equation and the boundary condition of (1.1)).
The following convergence theorem is valid (GODuNOV and RYABENKII [1970],
FILIPPOV [1955]).
Similarly,
a [(p)h - h = ah(p)h - gh.
Since h < , owing to the stability and approximation of the relations
Ah[(4p)h - (ph] = Ah((p)h -fh, a[()h - (ph] = ah(()h - gh, (3.3)
it follows that
I(| - (ph I < C Ah(p)h
()h - fh IIFh + C2 IIah()h gh IIG
n n
s< C 1Mlh 1 + C 2M 2h 2 <(C1M 1 + C2 M 2)h.
While obtaining the last inequality we may suppose without loss of generality that
h<l.
In the case of the evolutionary problem consider
fh =_ Lht[(p)h, - (ph] _ Lt(p)h, - fh,4
=
5gt IhP[(p)h - (pht]_ (p)h -ghr. (3
From (3.4) and from the stability condition (2.29) we have
II(p)h, - (phr IIh, < C 1 6ifht IFh, + C2 II gh 1IlGh,,
or, taking (1.17) into account,
1(p)hr- (p Ilhr , <
K h + K2T P , (3.5)
where
K 1 =C
CM +C 2M 2, K 2 =C1 N 1 +C 2N 2 .
Estimate (3.5) proves the convergence of the difference solution to the exact one
and gives a clear comprehension of the convergence with respect to the spatial grid
step h as well as to the time step .
The assumptions of the theorem include the rather rigid requirement that C1 and
C2 are independent of h and . The condition that C1 and C2 must be independent of
h is especially unpleasant since in some cases C, and C2 may tend to infinity as h - 0.
Let
224 G.I. Marchuk CHAPTER I
C = C/h m , C2 = C 2/hm ,
where m 0. The rate of convergence of the approximate solution to the exact one
will then be estimated as follows:
k-
I ()hr - ph I ., < Mh m + NTPh - m
If k>m and rP0-")0 as T--+0,h-+O, then the convergence takes place. Of course,
the convergence theorem can be formulated even in the case when C, and C2 depend
both on h and z.
The Crank-Nicolson scheme will play a significant role in the next chapters. So we
shall consider this scheme in more detail.
Consider the evolutionary equation
0t+Aq
at
=f in xt,
(p=g in2
Q for t=O,
where A>O (i.e. (Acp,(p)>O) and the solution p is sufficiently smooth. We shall
assume that the solution satisfies certain boundary conditions on Q2. We shall
assume as well that (4.1) already presents the finite difference approximation to the
corresponding original evolutionary problem in all variables excluding t (i.e. A is
a matrix, p is the grid function depending on t, etc., as we specified above, the index
"h" of A, , Q and other symbols will be often omitted for the sake of simplicity).
Suppose first that A does not depend on t. Then it is easy to check that the difference
equation of the form
+ A&p j + = 0.(4.3)
Excluding the unknowns (pi + 1/2 from this system of difference equations we come to
the Crank-Nicolson scheme.
Assume that the operator A is independent of time and is approximated in the
SECTION 4 Introduction 225
problem (4.1) by the difference operator, which will be denoted by A. Then we will
deal with the problem in linear algebra of the form
2T 2- (49)
If Ai is positive-semidefinite, then
IITill( 1 (4.10)
and, therefore,
j + ll.
Il~p II'Pill, (4.11)
i.e., the scheme is stable.
If the operator Ai is skew-symmetric, i.e., the following equality holds:
(A i, (p)o=0,
then we have instead of (4.10) the equality
·
11(i+ 11 = pil[ (4.12)
And so here
lITill = 1. (4.13)
We now consider the approximation with the Crank-Nicolson scheme in the case
where the operatorA is time-dependent. We define the operator L by the equality
Lq =_ alat + Ap (4.14)
and the operator L, by the equality
(LTp)J = (2)(P)
+Aj() +((P) (4.15)
226 G.I. Marchuk CHAPTER I
where (p)i is the projection of a function p on the grid Q2,. Further, we introduce the
norm
l(L,(p)llc,=max Il(L,q)Jll, (4.16)
where ' Iis some norm in the space to which (Lp)j belongs. In order to estimate the
norm (4.16) we expand the solution of the original equation (4.1) into a Taylor series,
and we have:
( = (o)j
1)j+ + T(cp)j + r2 (p) + ... (4.17)
Taking into account the obvious relations
(Pt = - A(p,
(4.18)
where At=aA/at, we transform Taylor series (4.17) into the form
j j j 2
( +1
1)= (p) i_ AJA(p) + tr2 [(A j) (q)j _ A j(p)J- .. . (4.19)
Substituting (4.19) into (4.16) and taking into account (4.15) we obtain
where fi is the right-hand side of (4.4) which, in this case, equals zero.
If we choose as the approximation of the operator A the operator
Ai = A i=A(tj), (4.21)
then it follows from (4.20) that
and we have the first-order approximation. Note that in the special case where A is
independent of t the approximation in the form (4.21) ensures second-order accuracy
in .
Suppose now that the approximating operator has been chosen in the form
Aj= Aj + Ai. (4.22)
In this case we have
I(f -L,q)llc =O(r2).
Note that the approximation with the Crank-Nicolson scheme will also have
second-order accuracy in if the operator A is chosen in the form
Ai = Ai+ 1/2 (4.23)
or
A j 21(A j + 'A Ai. (4.24)
SECTION 4 Introduction 227
The three forms (4.22), (4.23) and (4.24) of the approximation of the operator
A ensuring second-order accuracy are used in various applications, in particular in
numerically solving the quasi-linear equations.
We now consider the inhomogeneous equations
ao/lat+Aqp=f in 2x Q2,
(4.25)
p=g in for t =O.
The difference approximation to problem (4.25) based on the Crank-Nicolson
scheme has, under the assumptions formulated above, the form
v+ lf
jv+v/= fJi (4.26)
2
where
fJ =f(ti+1/2)
It is not difficult to check that the difference problem (4.26) approximates (4.25)
with second-order accuracy in . Write the solution of problem (4.26) in the form
pj+ = Tjpj + (E+ TAj)-' f. (4.27)
From (4.27) it follows that
t-a+Ap=f in QxQ,,
po=g in2
Q for t=O
with a positive-semidefinite operator A (i.e., A > 0 or (Aqp, o)> 0). We suppose here
that the approximation to the original problem in all variables except t has already
been carried out, Q is a grid domain, A is matrix, p, t and g are grid functions (see
Section 1). For the sake of simplicity the index h of the grid parameter is omitted. We
suppose that the solution of the problem satisfies already given boundary conditions
on aQ, and A and f are constructed with this fact taken into account. Let also the
spaces · and F of grid functions coincide and, if not otherwise specified, have the
same norm I p 11= (,)1/2.
This problem may also be written in the form
8?p/lt+Aqp=f in Q2,,
p=g for t=0
implying that the first equation is considered in Q x Q2 and the second in D x {0}. Let
us agree to use mostly the latter form.
Note that if it is desirable we may consider the above equations as the equations
of the original problem with no preliminary approximation, and that the algorithms
formulated below are applied directly to this problem. However, the description
of such algorithms should often be considered as formal and their theoretical
grounding in such cases is far more complicated.
229
CHAPTER II
Componentwise Splitting
(Fractional Steps) Methods
The splitting (fractional steps) methods are based on the idea to split a complex
operator in a sequence of the simplest ones and as a result to reduce the integration
of the initial equation to successively integrating equations of a more simple
structure. Since the splitting schemes must satisfy the conditions of approximation
and stability only on the whole, we have the possibility for a flexible construction of
the schemes essentially for all basic equations of mathematical physics.
j+ 1 _ (pj+(n -1)/n
+ An cp'+ 1 = °,
___
(5.3)
(5.3)
j=O,....
P 0 =g.
This algorithm is absolutely stable, and the system (5.3) approximates problem (5.1)
with first-order accuracy in t (MARCHUK [1980], p. 270).
231
232 G.I. Marchuk CHAPTER II
REMARK 5.1. Let us agree that if there are no special notes the approximation of the
schemes is studied on solutions which are sufficiently smooth.
apl/at+Aqp=f in Q,,
(5.4)
=g for t = 0,
qj+1/n_ 0j
jA(P +(P 1+n = 0,
j+
j=0,1 ....
(O =g.
This scheme is absolutely stable and has first-order accuracy in , when the following
estimate holds:
The realization of algorithms (5.3) and (5.5) consists of successively solving the
equations of (5.3) and (5.5). And if A is split into A, ( = 1, ... , n) so that it is easy to
invert the operators (E + tra) (for example, if A. and (E + TA,) are three-diagonal or
triangular matrices), then it is easy to find an approximate solution ¢j+l of the
problem corresponding to t=tj+,.
Algorithms (5.3) and (5.5) permit obvious generalizations in the case, where A is
time-dependent. Then, in the cycle of computations of the splitting scheme, instead
of A, a suitable approximation A j of A should be taken on each interval tj < t <tj+,.
We suppose that the operator A =A(t) in (5.1) is representable as the sum of two
positive-semidefinite matrices:
A(t)=Al(t)+A 2 (t), Al(t)3O, A 2 (t)>0, (6.1)
which are sufficiently smooth along t-elements. We consider the approximation of
these matrices on the interval ti <t ti+j in the form
AJ.=Ajtij+112), oc=1,2. (6.2)
SECTION 6 Componentwise splitting 233
The componentwise splitting scheme for problem (5.1) (MARCHUK [1980], p. 256) has
the form
( j+1/2_i (1/2 + (pJ
+A j + =0,
, 2
qj+l _ j+ /2 A(P
j+1 + (pj+1/2
+Ai =0,
2 2 (6.3)
j=O,1,....,
(p =g.
By excluding auxiliary functions oji+ 1/2 the system of difference equations (6.3) may
be reduced to one equation,
(pj+ = T pj, (6.4)
where
T= (E+T A' )- (E - z A2)
(6.5)
'(E +2TA)-'(E-½ITAj).
Assume that
T l AII<l, = 1,2. (6.6)
Then scheme (6.3) is absolutely stable. It has a second-order approximation in Tif
A~ and A' commute, and a first-order approximation in if they do not.
Indeed, expanding the operator Ti into a power series in we obtain
Ti= E-TAi+ 2 ((Al) 2 +2A2 A +(A2)2 )- ...
If the operators Ai commute, that is if Ai Aj 2 = Aj , then this expansion may
be written as
Ti = E - TAi + T2(A )2 _ .. .
Then (6.4) may be represented as follows:
pj+ = j-TA '+T(Aj)2 ( i-... (6.7)
or
Substituting the expression Aip from (6.7) into the last equality we have
'--
+ 2 2(Aj)2j+ ) =0.
Comparing this scheme with the Crank-Nicolson scheme (1.15) we conclude that
the order of approximation of (6.3) differs from the order of approximation of (1.15)
by a quantity O(T2 ). Hence, the scheme (6.3) has a second-order approximation in
234 G.I. Marchuk CHAPTER II
i+ 2 / 4
+ 3/ = (E- ZTAj2)j ,
+I
(E+ zA)p == J+3 4 (6.8)
j=O,1 ....
(o =g.
If A is split into A, and A2 in such a way that the effective solution of the equations
with matrices (E - 'r Ai) is possible, then the realization of the whole algorithm will
also be effective.
REMARK 6.1. To formulate scheme (6.3) with a varying time step it is enough to
substitute in (6.3) and (6.8) for Tj= tj+ - tj.
nA= Ai,
A Aj,. (7.2)
c=l
REMARK 8.1. The question arises: Is it worthwhile to split first the operator A into A,
and then in turn the operators A, into A.a? Is it not easier to represent the operator
A as a set of operators Ap right away? In this context it should be remarked that
though these two approaches seem equivalent, in many cases it is more convenient
first to decompose a complex problem of mathematical physics into simpler ones
which further can be independently reduced to elementary problems.
Taking into account (8.5) we state for system (8.1) the following splitting scheme
~oj+6/rna (P0+
- 1)/rfl q
(PWona M+/(p+(l- 1)/met
+ A,, 2 =0, (8.6)
a=1,2,...,n, 0=1,2,...,m,
236 G.1. Marchuk CHAPTER II
where
A(t)= E A(t)
a=1
pj+2/n_ j+1n
-A
A 20pj
0 + A2T+ l/n +
2o+ 2/P=F2,
1ij+
_I j+(n- 1)/n J j j+ =
+AnoOT+A "i +/n+ +n¢ =F,
where
If the operators Ak and Ak,k-I are commutative, then the scheme (9.4)
approximates problem (9.1) and is stable provided that
k= l,2,..., n
REMARK 9.1. It follows from (9.5) that for the stability of the whole scheme the
stability of each elementary scheme from (9.4) is not necessary at all.
REMARK 9.2. In the case where all operators Ad and B are commutative the
statements about the approximation of scheme (9.2) and its stability (and therefore
the convergence conditions; see Theorem 3.1 in the Introduction) may be formulated
as well.
REMARK 9.3. In scheme (9.2) the operators Ad may have an arbitrary structure
including difference, differential and integral-differential operators. The schemes of
type (9.2) were developed in the works by YANENKO [1960, 1967] and by MARCHUK
and YANENKO [1966] in which the theory and applications of such schemes are
studied in more detail.
REMARK 9.4. If we take in (9.2) that Akk = Ak is the approximation of operator Ak(t)
(and A 0 for or/ f) and B,=O0 for < n, B= 1, then we obtain scheme (5.5).
Let the operators Ajkk_ , Ajk in scheme (9.2) have the form
Akk = aks,
Alk-1=(1 - )Ak (10.1)
0< a=const< 1,
coefficients
pj+2/n _ j+ 1/n
+ A ((1 - a)¢ i + /n + a6pj+ 2) = F2
j+ 1 j+(n-l)/n (10.2)
l )/ n+ cpj+
j ( -
+((l - a)p )= F,,
j=0,1...
O=g.
If we take in (10.2) = 2, then in the case where the operators Ak are commutative
and A' approximates Ak with the order O(T2), the whole scheme 10.2) will have
a second-order approximation in T as well. Besides, if A >0 then the scheme will be
absolutely stable.
11. The splitting method for systems which do not belong to the
Cauchy-Kovalevskaya class
The splitting method allows a special formulation for systems which do not belong
to the Cauchy-Kovalevskaya class (irregular systems) (see MARCHUK [1964, 1967],
YANENKO [1967], MARCHUK and YANENKO [1966]). Let
and
K 1 (p +K 2 0i +g =0 (11.2)
be irregular system for which
apo/t + Ap=F, F=f--L, (11.3)
is regular for a given subsystem F. For integrating (11.1) and (11.2) the following
scheme may be used:
(P j+ 11 9 J+An_ +AiJ±1/n"B F,
where the operators Aj and B. were defined in the formulation of (9.2). While
solving system (11.4) on the last fractional step (i.e., the last equation from (11.4)) it is
necessary to use for definiteness of the algorithm the difference analogue of the
relationship (11.2)
+l
K' q + Kij + '
+gj+x =0 (11.5)
where Ki is the approximation of operator K, gj+l approximates g, and Fj+ 1 (from
(11.4)) is the approximation of (f-L).
12. Splitting schemes for the heat conduction equation: Local one-dimensional
schemes
We consider some of the splitting schemes that are applied to the heat conduction
equation.
kYl h1 hhzlk,l,p
p
1k=
=1 p=l
240 G.I. Marchuk CHAPTER 11
Scheme (5.5) applied to problem (12.1) has the form (YANENKO [1969a]
coj+ /3 -- iJ
ji 2/3 j+/3
+ 2/3
3
/+ ~ ' +A+/2(P =0,
j+1 j+2/3 (2
+4A34p + l =O, (12.3)
t' =g.
Each equation from (12.3) may be easily solved by the factorization (sweep) method.
Scheme (12.3) is absolutely stable and has a first-order approximation in T, and
therefore the convergence theorem holds (of the type of Theorem 3.1).
For increasing the accuracy of scheme (12.3) the weighted scheme (10.2) may be
applied (YANENKO [1967], p. 33)
4 j+1/3_(Pj
J ((1 -a)(p
+ /3+A +apj+/3)=O,
9j+2/3 - j+ 1/3
13 2 13
4 +A 2 ((1-)+O' +±pJ+ ) 0,
T
j+1 j+ 2/3 +1
2 3
4i' -'il + 33((1-A) 4 ' +a )-+l)=
0, (12.4)
j= 0, 1 '.. '
4'0 =g.
With a = the scheme has a second-order approximation in t, and the whole order
of accuracy of this scheme is O(r 2 + h2 ), where h = max(h,, hy, hz). It is absolutely
stable. Equations (12.4) again may be solved by a one-dimensional sweep method.
12.2. The splitting scheme for the heat conduction equation in an arbitrary
coordinate system
We consider a problem
(pl/t
4 + Ap = 0 in x Q,,
4o=0 on a2, (12.5)
(p=g in Q for t =0,
where 2 is a bounded two-dimensional domain and the differential operator A has
the form
2 82
A= A ai , aii=const, (12.6)
i,ja
1= axax
alla22 -a2 > 0.
SECTION 12 Componentwise splitting 241
+
a2
V
A12= A = -a2 (Ax Vx)(Ax 2 x 2 ) for A= 2
4hih2 aX a 2'
(A.2Vx2) a2
A22 = -a22 2 for A 2 2 = -a22x2
(taking into account the boundary conditions). We write the following scheme
(YANENKO, SUCHOV and POGODIN [1959])
(pj+l/2 _ (oJ
+All ,oJ l +
+A2J=o,
/ 2
j~~~~~~~l_ ~(12.7)
~~~j+1/2
+ 1- +/+ A2 1 j+112 + A2 2 (pj+1 =0,
which is a special case of the splitting scheme (9.2). It is not difficult to notice that
here in the first fractional step "half" of the operator A, i.e. A,,ll+A 12 , is
approximated, where A,,ll is approximated on the time layer j+2 and A 1 2 is
approximated on the layerj; in the second fractional step the second "half" of A, i.e.,
A 21 +A2 2 , is approximated, where A 2 1 =A 12 again is approximated on the layer
j+2 and A2 2 is approximated on the layer j+1. Scheme (12.7) approximates
problem (12.5) and is stable, therefore the convergence theorem holds.
Under stronger conditions than the ellipticity condition for the three-dimensional
heat conduction equation given by
aqP 3 a 2(P0
at E aa-i =0 ' (12.8)
YANENKO [1959b, 1964] suggested the scheme of the form
(pj+1/6 (pj
+ 2 A 11 I/2 + A12 pj= O,
(pj+2/6 _ j+1/6
+ fA2l cPj + /6 + 2 2 (j+2/6 = 0,
j + 2
¢j+ 3/6 ~_ All(,oJ+3/6.A13j+ 2/6=0,
j+4/6_ j+3/6
(12.9)
+A31 j+3/6+{A33 (j+3/6=0,
j+5/6 _j+4/6
P A 2 2 'P +A23P
Poj+l _ j+5/6
+A32 (fj+1 =0.
j+5/6+½A33
242 G.I. Marchuk CHAPTER II
This scheme approximates (12.8) and is stable if the matrix with elements bj=a
for i j and bi = 2aii is positive-definite.
We consider problem (5.1) and system (8.1) which approximates (5.1). If in (8.1) the
operators A, are one-dimensional differential operators (or their approximations),
then the corresponding difference scheme is also called local one-dimensional
(SAMARSKII [1971], p. 407). Thus, if in (8.3) the operators A. are one-dimensional then
this scheme may be called local one-dimensional.
The theory of local one-dimensional schemes for some differential equations is
presented in the works by SAMARSKII [1971] and SAMARSKII and NIKOLAEV [1978].
Here we will consider only one such scheme in application to the following problem
of the heat conduction equation:
a/lat+Aq~=f in QxQ,,
Dla = P(r)(X, t), (12.10)
p= g(x) in Q for t = 0,
where
A= A,, A,,=-a2/XL2,
=l
1(x, 0)= g,
pl1 (x, tj)=cp(x, tj), j= 1,2, .. ,
~0(x, ti)= p_(x, t+, j=, 1,..., =2, 3, ... , n,
where we set
(P(X, tj+1) = Pn(X, tj +).
Here the f, are arbitrary functions such that E = f,= f. Boundary conditions for
(p, are set only on the part aQ,2 of the boundary a2 consisting of the sides xa = 0
and x, = 1. We introduce in Q a uniform grid with step h along each variable. Define
the difference approximation of the operator Aa as follows:
A= -(Ax, Vx)/h 2, = 1, . ., n.
Coming from (12.11) to difference approximations of the problems in spatial
SECTION 12 Componentwise splitting 243
variables (when p, and f, are vectors and A, are matrices, t2 is a grid domain and a2
is the grid on the sides x=O0 and xc = 1) and approximating in t with the help of
a two-layer implicit scheme of first-order accuracy in as well, we obtain the
following local one-dimensional scheme:
- (J.(
(5J+l
T + A. pA'+ =f(tj+1/2) (12.12)
=x=l,...,n, j=0,1,...
with the initial conditions
s
qPi
=, (i=(Pa ,
j=1,2,...,
(0~= ~%(_, a=2,3,...,n, j=0,1,2 ....
and the boundary conditions
P lan = (r)(tj+ 1). (12.14)
Each problem in (12.12H12.14) (for each fixed a) is a one-dimensional first boundary
value problem and may be solved by the sweep method.
In order to find the approximate values of the solution of the initial problem on
the time layer tj+ using data from the layer t, it is necessary to solve successively
a sequence of one-dimensional problems. The local one-dimensional scheme
(12.12H12.14) is stable in the metrics
IIPlec
= max 01il
xi, .
(i.e. uniformly stable) with respect to the initial and boundary data and to the
right-hand side.
If problem (12.10) has a unique solution q = p(x, t) continuous in Q x Q. and there
exist in 2 x •, derivatives
2 4 3
a (
___I
a ,P
20X1a
' x2
a TO a2f
_n, a,
at2 a8x2x~ +Sx cax,
then scheme (12.12H-(12.14) converges uniformly with rate O(h2 + ) (it has
first-order accuracy in T and second-order accuracy in h) so that
Jjpj-p(t)llc<M(h 2 + ), j=1,2,...,
where M=const does not depend on T and h (SAMARSKII [1971], p. 423).
REMARK 12.1. SAMARSKII ([1971], p. 425) presented and studied the local one-dimen-
sional schemes for the parabolic equation with varying coefficients in the domain of
complex shape. The local one-dimensional schemes for the quasi-linear parabolic
equation were considered by FRYAZINOV [1964]. SAMARSKII [1970, 1971] gives
a bibliography of the works devoted to this class of schemes.
CHAPTER III
Two-Cycle Componentwise
Splitting Methods
ao/pt+Ap=O in , (13.1),,
(p=g for t=0
under the assumptions (6.1). We will approximate operators A1 (t) and A2 (t) not on
the interval ti t < tj+ (as in (6.3)) but on the interval tj_< t < tj+1. Let Ai = A(tj).
Write the following two systems of the difference equations:
(p _ q0 - +/2 (PJ +
+ (P = 0,
2
1 /2
pj+1/2 _ (j j+ +q J
+ A2 = 0,
Tz1 2
The cycle of computations consists of the alternative application of the schemes
(13.2) and (13.3). Excluding qj + /2 from these schemes we get on the whole step of
computations
j+ = Tij- 1, (13.4)
245
246 G.I. Marchuk CHAPTER III
where
TJ= (E+2TAi)- (E+StAi)(E+ TA)-
x (E -- 2r/62)(E+ TA
i) - (E --2rA )(E + 2rA i)- 1
(13.5)
x (E - tAi)
= E - 2A J + 12r)2 (Aj)2 -
form:
Suppose that in (13.1) the operator A has the form A(t)= = A,(t), A,(t) >O. Let
A= E, A
a=1
where
TJ = (E + Ai,) '(E - rA,).
This means algorithmically that first we solve the system of equations (7.3) on the
interval t_ t < tj for c = 1, 2,.. ., n and then the same system on the interval
tj t tj+, but in the backward sequence (= n, n-1, ... , 1):
where
1
Thus, on the interval t_< t tj+ scheme (14.2) has second-order accuracy in t,
provided that we take as one of the next two operators:
AO =A(tj)+2-A,(tj), (14.3)
on the interval tj_, t tj+ I the two-cycle multicomponent splitting scheme has
the form:
(E + TAI)(pi-'- 1)1" = (E -2TAJ)(ipJ-
(E+½TAn)((pi - Tfi ) =(E- )o - s
I/n /,5
(14.5)
(E + Ti)pi + 1/' = (E - trA)(i + Trf),
+1 +
(E+-1Ai )j = (E
i -- TA i )()j(n-1 )/n
where
A = A,(tj).
It is not difficult to see that this scheme with the necessary smoothness assumption
has second-order approximation in Tand is absolutely stable (MARCHUK [1980], p.
269).
In the same way as in the case n = 2, the multicomponent system of equations
(14.5) may be written in an equivalent form
(n +
(E +½ A )oi- 1- )/(n+ 1) =(E -TrA )i- (i + 1 -+ )/( +1),
a;=l, 2,...,n,
9j+ 1/(n+ 1) =(pj- (n+ 1) 2rfj, (14.6)
(E +I A j + (n+1)=(E-1ZTAn-+ 2)oij+(a- 1)/(n +1)
ca=2, 3,...,n+1.
Suppose that A(t, p) is nonnegative, sufficiently smooth and has the form
Suppose further that the solution p is also a sufficiently smooth function of time.
Consider on the interval t< t < tj+ the splitting scheme
T +AJ 2 =n
where
A i = A(tj, S),
Oy = ji- 1 _ zA(tj- 1, piy- )(pi- t, (15.4)
= t - tj_
It may be proved by methods used earlier for linear operators depending only upon
time that splitting scheme (15.3) under conditions (15.4) has a second-order
approximation in and is absolutely stable. The splitting method for inhomo-
geneous quasi-linear equations is defined in a similar way. This opens up broad
vistas for applications of the componentwise splitting schemes to the nonstationary
quasi-linear problems in hydrodynamics, meteorology, oceanology and other
important fields of natural sciences.
Let A,,= 1 A. Then we will solve each problem in (16.2) with the two-cycle
method:
(pj+/2m_ jp+(?- 1)/2m, pj+i2m + pj+(P- 1)/2m,
2.4~ =0,
= m + 1, m, + 2,..., 2mn.
Initial conditions for each system in (16.3) will be respectively taken in the form
It is supposed that each of the problems (16.5) and (16.6) is solved with the two-cycle
method of the kind (16.3). Note that with A., >0 the componentwise splitting
method is absolutely stable.
SECTION 16 Two-cycle componentwise splitting methods 251
Let us give the general splitting scheme for the inhomogeneous equation
apat+ Ap=f;
a=1 (16.7)
(p=g for t=0
on the interval t j_- t tj+l based on the two-cycle method.
On the interval t _ < t < t we set
/offat+Aqp,=O , a=1, 2,.....,n-1, (16.8)
aq%/at + A (p, =f+ 2TAf;
and on the interval tj t < tj+
a+,+l/at+
A .9+t =f-1Af, (16.9)
Oep.+:/ t+A,_~+l~.+.=O, c=2,3,...,n
provided that
(16.12)
'ap,/la t+Ao =O;
a + /lat= fi (16.13)
and on the interval tj < t < tj+
aPn+2/lt+Anpn+2 =0,
(16.14)
a(P2n+il/t + Al cP2.+ =0.
The initial conditions for system (16.12) will be
j_ )= (p(tj);
0n +1(t (16.6)
and for the system of equations (16.14)
17. The two-cycle componentwise splitting scheme for the heat conduction equation
where
Y' i
F 1/2=
1kl
/2 = ( (1-(1 -12)i),l2h2 Z1/2 + , 22(40.(+l,-1t
+1 (1/2
1+i ++
i ,I k-k2h,1
-I-2
?)X1 i 1+ + ),
Bp + = (j+(p-1)/p
Ifp = 2 and B1, B 2 are triangular matrices, then the relationships (18.4) represent an
explicit ("running") calculation scheme (see SAMARSKII [1971], p. 368).
255
256 G.] Marchuk CHAPTER IV
a(? 2 2 ~+29 = 0
t a x2+--),
(P=(r ) on Of, (18.5)
(p=g for t=0,
B=2E+TA (18.7)
was represented as the product of two three-point operators B, and B 2. It was
shown by BAKER and OLIPHANT [1960] that such an operator may be constructed; as
a result the realization of scheme (18.6) is reduced to solving (18.2) with FJ=
Fj( j - 1, qJ; rz)= 2p -- -pj-'. For p =2, (18.2) also is decomposed in two equations
(18.4), which may be easily solved using the three-point sweep method.
BAKER [1960] describes the factorization schemes in the case of a multidimen-
sional heat conduction equation with constant coefficients.
But if a two-dimensional parabolic equation with varying coefficients (diffusion
equation) is considered, then it is impossible to represent B exactly as the product of
two three-point operators. In this case additional iterations are needed (BULEYEV
[1970], OLIPHANT [1961]). For such problems BULEYEV [1970] (see also MARCHUK
[1980], pp. 235-237) formulated a scheme of incomplete factorization. Thus, let
(18.2) be represented as a two-dimensional difference equation
(the index j is omitted for simplicity). Having added the vector Cp to the left-hand
and right-hand sides of the equation we obtain
(the index of the iteration step is omitted; Di.k(9) =(D()i,k, where the matrix
D = KC + diag qi,k and diag qi,k is a diagonal matrix with elements qik).
Comparing (18.11) and (18.8) we obtain for coefficients ai,k, fIi,k, Eik' i,k, Yi,k
formulae which are analogous to the formulae of the sweep method:
To ensure the convergence of the iterative process (18.11) for 0 > 0 these equations
are supplemented with a simple iteration
Choosing parameters 0,,k and xik it is always possible to satisfy the condition
ci,n + fli, < 1, i.e. to achieve computational stability of the scheme (18.11). Note that
MARCHUK ([1980], pp. 236-242) presented another scheme of incomplete factor-
ization.
19. The implicit splitting scheme with approximate factorization of the operator
We consider for the problem (18.1) with operator A >0 the implicit scheme of the
form
where
A= , A, Aa 0. (19.2)
a=l
B= HBx, B, =E +A,.
a=l
The solution of equations (19.5) may be carried out by successively solving equations
i
(E + TA )j+ 1/n = ,
2 + l /n
(E+TA 2 )pj+ /"n=pj
(19.6)
(E + A)pj+ = j+(n- 1)/n,
=O, 1,2,....
p° =q.
SECTION 20 Schemes with factorization of the operators 259
If the operators A, in representation (19.2) have a simple structure and allow cheap
inversion, then system (19.6) may also be solved rather simply.
It is easy to see that scheme (19.6) has a first-order approximation in T. It is
absolutely stable for A., > 0 in the metrics of the grid function space o ( qv lsh =
- ' IIh < 1 (see (2.23) and Theorem 2.1).
(q, 4/h2), since, in that case, IIE+ Ata)
We will illustrate an application of the scheme with approximate factorization of
the operator to scheme (18.6), which was suggested by BAKER and OLIPHANT [1960]
for the heat conduction equation (18.5).
We take A= A +A 2, where A, is a difference approximation of the one-
dimensional differential operator A, = -a 22 /a 2x 2 (x 1 = X,x 2 =y). We rewrite
scheme (18.6) in the form
(3E+zA) j ' =Fj, F=2pj - q
-' (19.7)
and replace the operator (E + A) with the factorized one obtained by the implicit
scheme with approximate factorization of the operator (YANENKO [1967], p. 39). We
obtain the scheme
B(p+ ' -(E + A,)(E + -A2)0j+ = Fj, (19.8)
where B is a nine-point factorized operator. (Note, that scheme (19.8) coincides with
the scheme suggested by BAKER and OLIPHANT [1960].)
If A, >0, A 2 >0, and the solution of problem (18.1) is smooth enough, then the
difference scheme (20.1) has a second-order approximation in and is absolutely
stable, the following estimate being valid
It is easy to notice that if the solution is sufficiently smooth then the order of
approximation in r of the difference equation (20.3) coincides with that of the
Crank-Nicolson scheme
(pj+1 (Pi PJ+ +pi 2.J
+A 0, (po =g, (20.4)
T 2
i.e. second-order in .
We now move to the stability analysis of the difference equation (20.1). To this end
we transform (20.1) to the form
(E + LTA 1)(E + A2 ) + 1 = (E- A )(E--rA2)j (20.5)
(20.5)
°
q0 =g.
¢i = (E + ,TA2)j-
Then for a new unknown ri we obtain the relationship
j + 1 = Tid,
where the transition operator T is given by
2
T = (E - tA,)(E + 2A)-
It follows from Theorem 2.1 that
I T. I =sup l T,. q / lq11 11 p , ) /2.
tl J I. = 11lioje
JIC< I ¢j- I11 '<- ' lllc2.
1
Here ei + 1/2 and c + 1 are some auxiliary quantities ensuring the reduction of problem
(20. 1) to the sequence of elementary problems (20.6). Note that the first and the last
SECTION 20 Schemes with factorization of the operators 261
equations in (20.6) are explicit relationships. This means that we have to invert
operators only in the second and third equations, where only the elementary
operators A1 and A 2 are present.
We now consider the inhomogeneous problem (18.1), where A = A, + A 2 , A, > 0,
A 2 > 0,f 0. In this case the stabilization scheme is written as follows:
REMARK 20.1. As we have seen the scheme of the stabilization method may be
obtained by introducing into the Crank-Nicolson scheme (i.e. explicit-implicit
scheme) the operator B=E+r 2 AA 2. After that the operator of the problem
becomes factorizable. Therefore, the stabilization scheme may also be called the
explicit-implicit scheme with approximatefactorization of the operator.
Let us formulate the stabilization scheme for the inhomogeneous problem (18.1)
where
and the operators A, are time-independent. With assumptions (20.9) the stabiliza-
tion method may be represented in the form
n ~cj+1
(P - (Pj
f (E + 2AT) + Apj =fJ, (cp = g, (20.10)
a=l
where
fJ =f(tj+1/2)-
The scheme of realization of the algorithm has the following form:
F = - Ap +fj,
+
(E + 2TA,)j 1/n =Fj,
i+ 2
(E + rTA 2)E n= ej+ 1/n
(20.11)
+ n-
(E + An) 1 =j+( l)ln
+1
(Pj =.(tOj+J+l.
262 G.I. Marchuk CHAPTER IV
With the assumption of sufficient smoothness the stabilization method (20.10) has
a second-order approximation in . Computational stability will be ensured
provided that
1T II< 1, (20.12)
where the transition operator T is defined by the formula
1
Unfortunately, the condition A, > 0 does not imply here the stability in some norm,
as was the case for n = 2.
To ensure stability the following simple algorithmic tool is sometimes used. Setf j
equal to zero and reduce equation (20.10) resolved with respect to pJ+1 to the
following form:
+ =
Tpi. (20.14)
Since the operator T is assumed to be time-independent (i.e. does not depend on j),
we will keep an eye only on the norm II j IIwhen solving problem (20.14) with initial
condition
( ° =g (20.15)
and fixed parameter r ensuring necessary approximation. If this norm is not
increasing then we may be sure that computational stability takes place. After this
we may move to the solution of the inhomogeneous problem. We rewrite equation
(20.10) as follows:
Hence,
j +l
pII < IT l 1 pi II, + r
11 II(E+ ,TA)- II I f IIF
x=l
REMARK 20.2. While checking the stability condition IIT IJ1 we used the initial
condition (20.15). This was not necessary at all. We might have chosen as initial
condition any function from the same smoothness class as g and f
SECTION 21 Schemes with factorization of the operators 263
21. A general scheme for the method of approximate factorization of the operator
= fJ+ =Apj
+opj+FF, j=O, 1, 2,...,
(21.1)
(c =g,
n (E-TAI,.)-E-TA1 ,
a=l (21.3)
q
n (E+TZAo)E+TAo
a=l
(i.e. the operators on the left-hand sides are approximations of the operators on the
right-hand sides). The relationships (21.3) allow the replacement of (21.1) by the
factorized scheme
P q
[ (E-tA)Pj+l
a=l
= n
a=l
(E+zAo, j + TF. (21.4)
The given scheme approximates problem (18.1) and if A, =A o = -2A, and FP=
f(ti+1/2), then it has a second-order approximation in . The stability analysis of
scheme (21.4) is often difficult in comparison with the componentwise splitting
schemes. For the sake of simplicity letfO in (18.1). We rewrite this relationship in
the form
A= A, A,,kO, (21.7)
f=l1
H (E + r0A)p
=l
j+ ' = H (E -
a=l
(1-)A,)j+rFj,
It is easy to notice that the realization of scheme (21.9) consists of solving the
sequence of one-dimensional problems (with the sweep method).
REMARK 21.1 Scheme (21.1) and representations (21.2) do not allow a simple
determination of the corresponding scheme. For example, the scheme
a?
E _a2 =f in xQ (22.1)
?it a a
under conditions
oIan = (Fr), P= g for t = 0.
Here
={O <x, < l:a = 1, ... , n},
.. ,=O < t < T}.
Let
= T/K, ha= 1/N, ,.. ., n,
(22.2)
h=(hx... ,h), [hl2= E h2.
a=l
The spatial grid Qh is the set of points x,=(llh 1,..., hn), where 1a N - 1,
l=(l1,..., ), afh is a set of boundary grid points, h= 2 h+ 2 h. The temporal
grid is given by tj=j, j =O, 1..., K.
We introduce the difference operators
where la is the n-dimensional unity vector in the direction x,,. We consider a set of
difference schemes of the kind
j+1
- e ~+ E
+ Aa,(OTp+i +(1lO)oj)+fj+ in 2h, O j K,
1
a=l
~0o
= 91l, XE (22.3)
where
Aa = (AVa,)- -A, O< 0 < const 1,
f+l 1 x)
{f(tj+l, for 02,
[f(ti+112 ,x) for 0=½,
and the notation p~ j p(Tj, xi) is used. When the solutions are sufficiently smooth,
266 G.I. Marchuk CHAPTER IV
u , =(u, u)'/2. Then in this norm the schemes (22.3) are absolutely stable
and norm 11
independently of r and h for 0 2',and for 0 < they are stable provided that the
following relationship holds:
1 1
e IZ-< (22.4)
=, h,22(1 -20)'
i.e. conditionally stable. From the approximation and stability of these schemes
corresponding statements about the convergence follow (see Theorem 3.1).
We now introduce the operator
Bq t
+ fi (E- A )pj+
c=l
pj+1,-0 h n .+1
a=1
1 htct + "'
h
+( r) AI A2 .. A +1
2
XI f h
2
PoI =gl, XIE h, (22.6)
j J
cat = (cr)l, xI aQh.
Taking (22.5) into account we may rewrite (22.6) in the form
where
Thus, we see that (22.6) differs from (22.3) by the presence of the additional term
Q~(q+ I-(p)/z on the left-hand side. If the solution of the original problem has
derivatives (bounded in 2 x Ql,)
where
#, < 2, Dfi - 0i/xIfi,
then
where A = := A, and thevector f7+ 1 is defined through fj+ , (rj+)and ipr). This
is (18.2) with splitting operator B = I = (E + TOA,). The solution of (22.11) on each
time step is reduced to solution of (18.4) for p = n and B, =(E+ OA) by the sweep
method.
Stability of scheme (22.6) follows from the a priori estimate
REMARK 22.1. Along with scheme (22.6) one may consider schemes with approxi-
mate factorization of the operator, which are obtained by replacing the basic
268 G.I. Marchuk CHAPTER IV
HE
(E-ojA)
j - j- (E -)za'p =~+1 ,
Tj =1 Tj a=1
. n , h
-M (E- Oa (TlA-) a) j +-
al (-H(E+ (I1- )/)A =f
Tj e=l Tj =1
r (E
E - Tj A') pi +--2 t A~/ - 4Tj
e<f
a~A . ~t=f+
'~j a=l Tj a=1
where tj = tj - tj_ I is a nonconstant grid step in t and 0 < 8O= const 1 (DYAKONOV
[1971d], p. 26).
CHAPTER V
The idea of the predictor-corrector method is as follows. The whole interval 0 < t < T
is split into a number of partial intervals and within each elementary interval
tj < t < tj + 1 the problem (18.1) is solved in two steps. First, one finds an approximate
solution of the problem at time tj + 1/2 = + using a first-order accuracy scheme
which has a considerable "amount" of stability. This step is usually referred to as the
predictor. Then, on the whole interval (tj, tj+) the original equation is approxi-
mated with a second-order scheme which serves as the corrector. It is essential that,
while constructing the corrector, the "rough" solution at tj+ /2 found in the pre-
dictor step is used.
The predictor-corrector scheme may be written in the form:
+4A2
(23.1)
1_( P A(P j+ 1/2 0,
(P
0 =g.
If we exclude the auxiliary function (pi + 1/4 from the first two equations of (23.1),
the system will reduce to the following one:
(E+ 2rA,)(E + zrA2)qOj + 1/2 = pj,
(23.2)
j+ / 2
(p+ _ '-+Ap = .
z
269
270 G.l. Marchuk CHAPTER V
+ A(E +A 2 )- (E +zAl)- = 0,
0
o =g. (23.3)
P° =g.
We now suppose that the following restriction holds:
TAaII1 < 1, = 1,2. (23.4)
If A1 >0, A 2 >0, the elements of matrices Al, A2 do not depend on t, and the
solution p of (18.1) is sufficiently smooth, then the difference scheme (23.1) is
absolutely stable and approximates the original problem with second-order
accuracy in . At the same time the estimate
' (23.5)
l(jp Ilc < [g[lcr ',
holds where
C; =(E+2TA)-'(E+rA1 ),
iTPIl =(C' 1( , (p)2 .
To prove this statement let us rewrite (23.3) in the form
(E+A,)(E+rA2 ) +ApjI=O,
where
'
A =(E+LA,)(E + A2 )A(E + A 2)-'(E+½TzA)-
Expanding the right-hand side of the last relationship into a Taylor series in T and
assuming that
1A,J, <1,
we get
2
A=A+O(Tr).
With the help of the estimate used in the stabilization method we come to the
conclusion that the predictor-corrector method has a second-order of approxima-
tion in r.
Let us consider the stability of this method. To this end we rewrite (23.3) in the
form
where
-
Oij=(E+'A 2)-'(E+-rA) ' p i. (23.7)
Difference scheme (23.6) is stable since
IIpi+ I PC, < 11
Oilic'. (23.8)
SECTION 23 Predictor-correctormethod 271
j+Aq j +
T
where
fj =f(tj+ /2).
If fJ is chosen in this form, then (23.9) approximates the original problem with
second-order accuracy in r and the estimate
I J+ zfjIll cri < Ilg+ f°llcr +jll f lcr, (23.10)
where
Ilf Ilcr = max I fll cr',
is valid, i.e., if O<tj < T we again get stability of the difference scheme (see MARCHUK
[1980], pp. 254-255).
Thus if A > O0, A2 >0 and the elements of matrices A , A2 do not depend on time,
then the difference scheme (23.9) is absolutely stable and allows to get a solution of
second-order accuracy in r provided that the solution and the right-hand sidef of
(18.1) are sufficiently smooth.
In conclusion we shall pay attention to the fact that, although difference scheme
(23.1) is absolutely stable, the difference scheme for the corrector incorporated here
as a part may be absolutely unstable if considered separately. Let us show this. For
the sake of simplicity we will consider the case when A is the difference analogue of
the two-dimensional Laplace operator, Q2 is the unity square, and the solution equals
zero on the boundaries of Q2.
In this case the corrector has the form
(Pkl (Pk. - + I+,-2,+ +(Pl Ok-lP +k,l-1
2Z h2 h2
(since now the predictor is not used in the computations, only integers j are
considered).
For this difference problem a solution (pi will be chosen for which the inequality
IIp l. <Il -1 1 1 from the definition of stability does not hold. We will find such
272 G.I. Marchuk CHAPTER V
(24.2)
+
J
(E + 1/2 = j+(n- 1)/2n
11A,)p
j+
qoj + Ac 1/2 =fj,
where we suppose again that Aa, 0 and fj =f(tj+1/2). The system of equations (24.2)
is reduced to one equation
j + l j
(P --
_ ( 1
+A fi)=fj,
(E+2'ZrA)-l(pj+2j'
{o ="g (24.3)
0o°.
SECTION 25 Predictor-correctormethod 273
Let
2={O<x<1: =l, 2, 33, 2={0<t<T}.
,
For the problem
a-a Ea
_ =° in QxQ,,
(p=g for t=0, (25.1)
N = (r) in 2
we introduce the following difference approximation
&8/ot+A~p=Oin Q2x2,,
(P= (r) on aQ, (25.2)
p=g for t = 0.
Here and a2 are already grid domains, and A has the form:
3
A= A, As=- a(Ax, Vx,)/hx2. (25.3)
a=1
BRAYN [1966] (see also YANENKO [1967], p. 40) suggested the following scheme
274 G.I. Marchuk CHAPTER V
(j+1/3_ ( j+1/6
IT +A2((P _3-&i)= 0,
+ l /2
Ir + Al J+ 6 + Ap2 J3 + A3 (P =0.
Here, the first three equations form the predictor (stabilizing correction scheme),
bringing p to time level t=(j+ )T; the fourth is the corrector.
Having excluded the fractional steps from (25.4) we get
I(pj
_(p pj+ (p (j+l _Tj
j? 2-+A S +4T2(A1A2+AlA3+A2A3)
(PJ
T 2 T
(25.5)
+T3 A A 2 A3 - S =05.
T
j+ AI(pj I + (Pj )
(pj+2/3_(pj+1/3
+'A 2( j + 2/3_ () = 0, (25.7)
pj+- j+2/3
Tp +2A 3 ((pj+ -p)=O.
SECTION 25 Predictor-correctormethod 275
Excluding (pJ+1/3, (pj+2/3 from (25.7) we again come to (25.5). Therefore, schemes
(25.4) and (25.6) are equivalent (YANENKO [1967], p. 42).
We write the predictor-corrector scheme as
(pj + 1/6_ -i
+ q j6 l/ = O'
I l-Al a
(25.8)
+ 1/2 j+ 1/3 j+/20,
½T
j +
1P-' 1/2 = 0,
j+ from
which is special realization of scheme (24.2). Excluding (rp /6, j+ /3, ( j+ 12
(25.8) we also come to (25.5). Therefore, this scheme is equivalent to the schemes
(25.4) and (25.6).
CHAPTER VI
j=0, 1,2,...
spO=g.
In this form it was applied by PEACEMAN and RACHFORD [1955] and DOUGLAS [1955]
to a parabolic problem with two spatial variables. In this case operator A, is
a difference approximation of the one-dimensional operator -a 2 2 /8x22. Note, that
in this case scheme (26.2) is symmetric, i.e. x and x2 change roles from first fractional
step to the second (this fact gives the method its name). It is easy to find the solution
of each equation in the parabolic problem by the sweep method (therefore, scheme
(26.2) is also called the alternatingdirection sweep method (YANENKO [1967], p. 26).
Eliminating pj+ 1/2 from (26.2) we obtain
qlj+2/3_(pj+1/3
2 13
+1(A (ij+ 1/3 + A 2 pj+ + A 3 Pj+ 1/3)= 0, (26.4)
(pj+ (j+2/3
_ -i (A1 (pi+2/3 + + 13A3
iA2 j* 1) = 0
is not absolutely stable. Therefore, for many problems schemes of the stabilizing
correction method are preferable (along with schemes of the alternating direction
method they are sometimes also called implicit schemes of alternating direction).
j +
(pj+2/3 _ (i+ 1/3 2 3
(p (pP (-c)=O,
(9JP+l_-)j+2/3
- A3(+pj + _- pj)=O, (27.1)
By eliminating <pi+ /
3
and pj + 2/3 from (27.1) we obtain the equation
(J+ -(-jq )J_ (pj+T
T ? +AT I+T2(A1A 2 + A IA 3 + A 2A 3) r )
It follows that the scheme has first-order accuracy in T. When we consider it for the
heat conduction equation it is easy to establish absolute stability. Besides, the
structure of the scheme is as follows: the first fractional step gives the whole
approximation of the heat conduction equation, the next fractional steps are
corrective and serve to improve the stability. Therefore such schemes are called
stabilizing correction schemes or schemes with stability correction.
DOUGLAS [1962] proposed scheme (25.6) of second-order accuracy, which may
also be considered as a stabilizing correction scheme.
(pj+l_
(pJ+ j +Api+l
+ =Fj, (28.2)
where
Fi=BopJ+B_l- 1
+ ... +B 4+lp-q +f (28.3)
(28.3)
A=A+ .. +A,,
which approximates problem (28.1). Taking (28.3) into account we can write the
corresponding scheme in fractional steps:
(qp+/ j + n )+
ij+ _(pj+(n- )n
+
n( -_ qj) = O.
280 G.I. Marchuk CHAPTER VI
where
¢= YZAiAj + '
E AiAjAk,+ .+rn-2Al. An. (28.6)
i<j i<j<k
By comparing (28.5) and (28.6) we conclude that, with accuracy up to order O(T2 ),
the general stabilizing correction scheme keeps the order of approximation of
scheme (28.2)-(28.3). The spectral stability analysis for scheme (28.4) is given by
YANENKO ([1967], pp. 160-161).
Note that for two-layer schemes under the condition that the operators are
commutative and for A, > 0 the stability of scheme (28.4) follows from the stability of
scheme (28.2) (DOITGLAS and GUNN [1964]).
Let us solve the following boundary value problem for the second-order parabolic
equation without mixed derivatives
a9/t + Ap =f in 12 x Q,
(PI n = (r(x, t), (29.1)
p = g(x) in 2 for t = 0,
where
2
A= A , AA - k,
a= ax a a
x=(x1 ,x 2 )eQ ={O<xa<1: oc= 1, 2},
ka >0.
We assume that problem (29.1) has a unique and sufficiently smooth solution.
THOMPSON [1964] proved the existence and uniqueness of a generalized solution for
(29.1) provided that the functionf(x, u) satisfies Lipschitz's condition in the variable
u. He also showed that the convergence of the solution of difference problem to the
solution of differential one follows from the approximation and stability of the
corresponding difference operator.
We construct in Q = Q + aQ grid Qh with steps h, that is uniform in x, and replace
the operators A, by the difference operators A,:
A = (Ax, k V p)/h2. (29.2)
Contrary to the case of constant coefficients, here, the operators A, are positive and
self-adjoint but not commutative. Instead of (29.1) we consider the following
SECTION 29 Alternating direction methods 281
Now p and fare vectors and A, and A2 are matrices. Suppose solution (pbelongs to
the grid function space .
For solving problem (29.3) in the layer tj<t<tj+ l we write the alternating
direction scheme in the form (SAMARSKII [1971], pp. 360-365):
q j+1/2 _ (pj /2 j +/2
+ 2(A+jI2
-? + J (pj)=If i,
(29.4)
qj+1 - qpj+ '/2
T -+0 (P 1(Ai+
+ 1/2 (pj+ 1/2 A 1j+ )+
= fj,
Each problem in (29.8) with the corresponding boundary conditions from (29.6) is
a one-dimensional first boundary value problem and it can be solved by the sweep
method.
If the value k in the grid point i is calculated, for example, by the formula
(k )i, = [k(x,) + ka,(xi + ,)] 1 < i < I,
then operator A. approximates operator A, with second-order accuracy, i.e.
Ac( - A,, = O(h2).
282 G.I. Marchuk CHAPTER VI
In this metric scheme (29.4)-(29.6) is stable on the initial and boundary conditions
and on the right-hand side.
Assume that, in Q2 x Q,, (29.1) has a unique solution q =(x, t) which is con-
tinuous and has bounded derivatives
a3-3 a5P 04S(
at ' 3
X2 t
'
aX4 (29.10)
Then scheme (29.4)-(29.6) converges in the grid norm (29.9) with rate O(r2 +Jhi 2) SO
that
11q'- (ti) IIA M(I h 2 + T2),
where M = const does not depend on t and Ihl.
REMARK 29.1. The alternating direction scheme with varying k has an approxi-
mation of order O(I h 12 + T2 ) for the solution p= q(x, t), provided that, in addition
to (29.10), the obvious requirements of smoothness of k,(x, t) in x, and t are satisfied.
The distinction from the case of constant coefficients reveals itself in the stability
study of the scheme, since though the operators A , 2 are positive and self-adjoint,
they are not commutative. However if, in addition to (29.10), there also exists in
2 x , a continuous derivative
axi V ax~X2
(a X2j'
then scheme (29.4)-(29.6) is stable and converges with rate O(T2 + h 12), in the case of
varying coefficients k.
CHAPTER VII
The method of splitting with respect to physical processes consists of reducing the
original evolutionary problem describing complex physical processes to a sequence
of problems describing processes of a more simple structure. To explain the idea of
the method we will consider the equation describing the process of advection-
diffusion of some substance (MARCHUK [1982], pp. 82-90).
The two-dimensional advection-diffusion problem consists of finding the solution
of the equation
ap aup ap
+ + - A y+ 0(p =f,(30.1)
at ax ay
(p=g for t=0, (30.2)
acO>, #uO0.
Assume for simplicity that we find the solution (x, y) of (30.1)-(30.2) on the whole
plane and that the functions u and v satisfy the continuity equation
au av
+-= 0. (30.3)
ax ay
From the physical point of view problem (30.1)-(30.2) describes, in principle, two
different physical processes. The first one is the process of advection of a substance
along the trajectory and it is described by
a 1 aup aVpl
at + ax + ay =0, (30.4)
These processes represent two extreme cases of the problem (30.1)-(30.2). Indeed,
taking in (30.1)-(30.2) a = 0 and a = 0 we obtain problem (30.4)-(30.5) and taking
u = v = 0 we obtain problem (30.6)-(30.7).
Let us now try to put (30.4)-(30.5) and (30.6)-(30.7) together. This appears to be
possible because of the additivity of the processes of advection and diffusion
locally.
Assume that u and v are constants and g(x, y) decreases quickly enough at infinity
so that the following representations take place
Then the exact solution of the original problem (30.1)-(30.2) has the form
We assume as before that p1, P2, g are represented as Fourier integrals as in (30.8).
For Fourier components of the solution of (30.4)-(30.5) we find
0Q(t)= A exp{(imu + inv)t} (30.10)
and for those of (30.6)-(30.7),
2
0 2 (t)= l,(t) exp{ - [Ca +(m + n2 )]t}. (30.11)
When we take t=T in (30.10) and substitute ,(T) in (30.11), we obtain
q02(r) = A exp{(imu + inv)z- [a + p(m 2 + n 2)]Z}.
Therefore
Thus the solution obtained by the splitting method for t = r is identical to the
solution of the original problem (30.1)-(30.2). This fact lies at the basis of the method
of splitting with respect to physical processes.
Note that in reality u and v are not constant; in this case the splitting algorithm
does not give the exact solution for t = tj,j = 1, 2, .... To correct the result in this case
the time interval should be taken small enough to ensure good approximation in
cases when the coefficients u and v vary considerably.
The method of splitting with respect to physical processes is widely used for
solving problems in hydrodynamics, gas dynamics (KOVENYA and YANENKO [1981],
MIGUAL, PINSKY and TAYLOR [1983], GUSHCHIN [1981], TEMAM [1981]), meteor-
ology and oceanology (MARCHUK [1967, 1974a, 1982], MARCHUK, DYMNIKOV et al.
[1984]) etc.
CHORIN [1968] proposed the original method for finding the approximate
solution of the equations for viscous incompressible liquid. This method is based on
Helmholtz's theorem on the decomposition of the vector field. (A vector field may be
decomposed in the unique way into a solenoidal part Af and a potential part Ai,
provided that the normal component Af equals zero on the boundary.) The
theoretical grounding of the method is given by TEMAM [1981]. YANENKO,
ANUCHINA, PETRENKO and SHOKIN [1970] interpreted this method from the
standpoint of splitting with respect to physical processes. A modification of this
method was used by KUZNETSOV, MOSHKIN and SMAGULOV [1983] to solve the
nontraditional Navier-Stokes problem when the pressure and tangential com-
ponent of the velocity on some part of boundary are given.
boundary of the cell, the mass, momentum and energy of this particle are subtracted
from its "old" cell and are added to the "new" cell. Harlow's scheme is based on the
explicit methods for solving equations of the first and the second stage and is
conditionally stable in the whole scheme.
It is especially fruitful to use implicit schemes in calculations of the first step,. In
this case the criterion of stability of the whole scheme coincides with the well-known
Courant condition. Some improvements of the MAC method in connection with the
introduction of fractional cells and calculations of the locations of markers were
suggested by CHEN with co-authors [1970, 1991] and NICOLS [1973].
apu ap
pu +div(puu) + a=O,
at ax
ap
aP + div(pu)=O, (32.1)
p=p(p,J), J=E-u 2 .
In the next (Lagrangean) stages, the motion of the mass flux (particles) AM
through the boundaries of the Euier cells is modelled, the redistribution of mass,
momentum and energy in space take place and the final fields of Euler parameters
are determined.
The equations of these stages may be combined formally and represented in the
form
apu/at+ div(puu) = 0,
apv/at + div(pvu) = 0,(32.3)
ap/at + div(pu) = 0,
apE/at+ div(pEu) = 0.
288 G.I. Marchuk CHAPTER VII
Various versions of the numerical schemes of the large particle method along with
some aspects of approximation and stability are presented by BELOTSERKOVSKY and
DAVYDOV [1978], BELOTSERKOVSKY [1984], DAVYDOV and SKOTNIKOV [1978] and
others.
CHAPTER VIII
- P +w1(A1
r qAJ+l)= fj,
r 2(P+A+
T
(33.4)
q =g,
where
A + /2 =Al(tj+1/2), A = A2(tj),
S =f(tj+l/2),
r=tj+1-tj
It is easy to notice that to solve the first equation from (33.4) we need to invert the
triangular matrix E + 'A + 2 and to solve the second equation we need to invert
the triangular matrix E+2zAj2+'. SAMARSKII [1964a] noted that the inversion,
when the order of the matrix is more than five, will require fewer operations than
to solve problem (33.1) by the explicit two-layer scheme.
If conditions (33.3) are satisfied and A 2(t)p(t) and dp/dteC"'1) [0, T], then (33.4)
has second-order accuracy in :
(ip1- (t) C.C 2, j =1,2,...,
where C is a positive constant independent of z (SAMARSKII [1964a].
REMARK 33.1. Scheme (33.4) and the above statement remain valid when A 1 and A 2
are arbitrary linear operators acting in a Hilbert space and satisfying conditions
(33.3) (SAMARSKII [1964a]).
REMARK 33.4. Since the solution of (33.4) may be carried out with explicit formulae,
schemes of the alternating triangular method are also called explicit alternating
direction schemes (IL'IN [1970], p. 109).
We assume that operator A in (33.1) does not depend on t and can be represented as
A=AI +A 2,
where the only requirement for A, is positive-semidefiniteness
(A. u, u), > 0.
We write for (33.1) a scheme of the following kind (IL'IN [1966]):
j=O,1.....
(P =g,
where rj and k are parameters, fi=f(ti+1/2), and operator A and vector fi are
constructed taking into account the boundary conditions of the original problem
approximated by (33.1). Formally scheme (34.1) coincides for k=O with the
Douglas-Rachford scheme and for k= 1 with the Peaceman-Rachford scheme.
After excluding p'o+ 1/2 (34.1) is transformed to
pj+l_ pi (pi+'+ kjj (Pj+-(Pi
+
(k +A + 4t jAA 2 LT(l+ =f (34.2)
!'r(1 + k) 1+k s ½zs(1 ks) +
If depending on parameters Tj and kj the time step for the original problem is chosen
as
Atj= 2Tj(1 + kj) (34.3)
the order of approximation of (34.1) is equal to O(zj+ kj Tj). In the case kj=O, it has
a first-order approximation in Atj= rTand, in the case kj = 1, it has a second-order
approximation (with respect to Atj = Tj).
Let us assume that the approximation error of both equations in (34.1) on the
smooth solution of (33.1) tends to zero in the norm II11
I for T= max Tz-+O, then the
solution (pi of the difference scheme (34.1) converges to qp(t,) in the norm
II(IIA2 =(llll(+ llA2ollI) for T-*0
(IL'IN [1966].
Note that we still did not make an assumption concerning the structure of the
operators A1 and A 2 . Further, the restrictions on A, in (34.1) are proved to be
292 G.I. Marchuk CHAPTER VIII
REMARK 34.1. IL'IN [1966] noted the possibility to generalize (34.1) for kI= I on
quasi-linear problems as the error of approximation reaches the order O(Tz).
where the operators A, are not necessarily related to differential operators with
respect to some single variable (if, of course, they belong to the original problem).
The following scheme is a generalization of (34.1) (IL'IN [1966]):
i+j+In _ (Pj j+l/n+ A pi=fj,
Tj a=2
B rp J - +
t-A =f J, (35.2)
where
n n-1
B= H (E+rjAj/cjw, oj= n (l+ i).
a=l a=l
With jo=2 scheme (35.1) approximates (33.1) with error 0(r 2 ). With Pla=o it
coincides with the Douglas-Rachford stabilizing correction scheme, which has error
O(z).
36. The scheme of the alternating triangular method for the parabolic equation
Let us consider one of the concrete realizations of the alternating triangular method
(IL'IN [1965]) for the two-dimensional heat conduction equation
apl/at - div(D(x, y) grad o) = 0. (36.1)
SECTION 36 Alternating triangularmethod 293
where the coefficients a, b, c, d, p may be variables. We also suppose that the boundary
conditions for (36.1) are taken into account for the coefficients of the difference
equation.
A scheme of the alternating triangular method has the form
j+1/2 .
Pk,1 -Pk. 1 1 j+1/2 j 12bl + 1/2
[---' p tk
Atj (Pk,l -akl (Pk- -bk_ Pk'I- 1
I j+1 j+j+l l d j =
+2Pk,I (Pk -Ck,l (Pk , -dk,l(Pkj+1
+)
+ -
+At. At ]
4h2 Pkl2h2 (lak,lI + Ibk, l)
(36.5)
1+--p, >t-(Ck,
4h2 Pk,I 2 h2(C
++ Idk,
IdkI)
l),
which must hold for all grid points except, maybe, for some of them. If we introduce
the following operators A1 and A2
1
A2 (Pk,l = (2I Pk, - ak,l Pk+- 1,1 - bk,l k,l - 1),
or
(E+ Atj A )pij + 12 = (E- tj A 2 ),(
(E+ AtiA2)p JA)rp3
2=(E-A +
REMARK 36.1. Besides the papers mentioned in this section the alternating
triangular schemes (or "explicit alternating direction schemes") are also considered
in papers by SAUL'EV [1960], IL'IN [1967], LEES [1961] and others. The alternating
triangular method as an iterative method for solving systems of algebraic equations
is thoroughly treated by SAMARSKII and NIKOLAEV [1978] and IL'IN [1970].
CHAPTER IX
37. The stationing method: General concepts of the theory of iterative methods
The solution of many stationary problems with positive operators may be regarded
as a limiting solution of a nonstationary problem for t-oo. While solving
a stationary problem by methods of asymptotic stationing we pay no attention to
intermediate values of the solution as they are of no interest. In the case of
a nonstationary problem these values do have a physical meaning. In general, this is
precisely what unites and divides these classes of problems.
Assume that we have a system of linear algebraic equations (representing, for
example, the result of the approximation of some stationary problem in mathe-
matical physics by the finite difference method)
A(p=f (37.1)
where
A>O, qOe, fEF.
We consider instead of (37.1) the nonstationary problem
a/lat + A' =f,
and represent , ¢ and fin the form
where
Au = ,, ,, A* u* = ,nun*,
(= (, o 0. = (nu*),)*), fn=(fu*)
and {un} and {un*} represent biorthogonal bases. Then using well-known tools we
obtain, on the one hand, problems for Fourier coefficients
n 4on =fn
295
296 G.l. Marchuk CHAPTER IX
tP = E ( f,/ -)u,
n
k = E (f,/n )(1-e-an')u,.
1 Jt - i+Aij=f
Then
J+" =¢iJ-(ACi-f).
If it is our aim to solve the stationary problem, then-with certain relationship
between r and P(A)-we have
lim i = (P.
j-co
-f)
Bj = - a(A , (37.2)
Vectors i= Api-f are usually called the residual vectors of the iterative method
(37.2) and vectors i = ji - * ( * =A- fis the exact solution of (37.1)) are called
the error vectors of this method. Subtracting vector p* from both sides of
relationship (37.3) and replacing f with A(p* we obtain a relationship for the
sequence of error vectors:
I j+T=j i , (37.4)
where matrix
T=E-HjA (37.5)
is called thejth step operatorof the iterative method (37.2). Multiplying (37.4) by the
matrix A we obtain a relationship for the sequence of residual vectors:
j + = (E-AHj)4. (37.6)
j}
We will call the iterative method (37.2) convergent if the sequence {p converges
to the exact solution Aq*of (37.5) for any initial vector. It is obvious that the
convergence of the sequences {ij} and {J} to the zero vector for any ¢o and °
(AO = 4° ) is a necessary and sufficient convergence condition for iterative method
(37.2).
The iterative method (37.2) is called stationaryif the matrix Hj does not depend on
the iteration number (the operator Tj is the constant matrix), otherwise it is called
nonstationary. We will distinguish specially the class of cyclic iterative methods,
which may be regarded both as stationary and nonstationary iterative methods.
A method is called cyclic if it has the property Hj = Hj+s for anyj > 0 and for some
fixed S > 1. It is easy to see that combining each S successive iterations into one we
come to a stationary iterative process of the kind
pj + = ¢pj-(Aqpj-f), (37.7)
where H is determined by the equation
j-1
E-HA= H (E-HiA). (37.8)
i=o
On the other hand the cyclic iterative methods, as originally stated, belong to the
nonstationary class.
One of our major tasks in studying iterative methods will be the problem of
optimization, i.e. the choice of the sequence of matrices {Hj} from a given class in
order to obtain a more effective computational process.
For the effective realization of method (37.2) the operator Bj should have a more
simple structure as compared to A. In many cases of practical interest the operator
B has the form
n
where the Bi are some matrices of the same order N as the matrix A. These matrices
are chosen in such a way that the matrices (E + rj Bi) might be easily inverted, i.e. the
inversion of the whole matrix Bj might be carried out more easily than the inversion
of A. The matrices Bi are often chosen with the representation (splitting) of the
matrix A as the sum
n
A= Ak. (38.2)
k=l
taken in account.
First we take n=m=2, Tj=T and give corresponding iterative algorithms for
solving the system (37.1) which are constructed using the splitting and alternating
direction schemes of previous sections.
(1) We get the alternatingdirectionmethod (PEACEMAN and RACHFORD [1955])for
= 2, B,= Ai. We obtain from (37.2) an algorithm of the kind
fj
ij+1 +1/ 5oJ~tI2+A~spi+1)=(38.4)
B'i - (Al (pi+ /2A + 2iPj
l)f
(2) We get the stabilizing correction scheme when a= 1,Bi=Ai (DOUGLAS and
RACHFORD [1956]):
j+l-i j
(E + rjA 1)(E +rjA 2 ) =-(Apj-f), (38.5)
(3) The splitting (fractionalsteps) method for an arbitrary value m =n > 2 may be
represented in the form (MARCHUK and YANENKO [1966])
Tp - +A ((p;+l/n--(¢i)=-- ( Ak (pi-
j+kl/n_ (j+(k-l)/n
+ Ak(p+klI -pj)= O, (38.7)
Tj
k=2,. .,n
SECTION 39 Iterative methods for stationary problems 299
or, equivalently,
j- f )
Bj T oj(A , (38.8)
where
n
where S and A are some positive constants. Then the effective value of the parameter
T in the stationary (Tj- T)version of method (38.3) is taken from formula T = (bA)- 1/2
Since for this value of T the eigenvalues of the matrix
B. 1 A =(I + A 2 )-(I + A)- A (39.1)
are real, nonnegative and belong to the interval [a,b] with the boundaries
A /
a=v Ac b= 6 (39.2)
0, k=O
k + 1 Ck() k+ 1 = Ckl) k >O, (39.3)
k=l,2,....
where =(b + a)/(b-a).
For an arbitrary T >O0 it is advisable to use the generalized conjugate gradient
method to accelerate method (38.3) (MARCHUK and KUZNETSOV [1972], CONCUS,
300 G.I. Marchuk CHAPTER IX
= 0 =
||Be' A
11 Ik=qk 1Ik+l j2
/-1 qk .k2 /-1,
II
11k5ll2,
301
302 G.I. Marchuk
problem. Besides we are going to consider the aspects of grounding the transitions
from a matrix evolutionary problem (2) to the systems of algebraic equations.
Suppose, we make a direct transition from this problem to the systems of difference
equations by splitting methods. To ground this stage we must study two problems:
the approximation of the obtained scheme and its stability. When these are
established, the convergence of the difference scheme's solution to the exact solution
of problem (2) takes place (see the Introduction for Convergence Theorem 3.1).
If we preliminarily split the matrix evolutionary equation (2) on the system:
arpaat+Ap=f., cY=l,.... n, (3)
it is necessary to find out in what sense solutions {pj approximate the exact
solution %p,i.e. to ground this stage of solving the problem. The study of further
stages of solving each of equations (3)is based as mentioned above on the studies of
approximation and stability. Hence, to prove the convergence of splitting methods it
is necessary to know how to study the following problems:
(1) the transition from system (2)to the splitting system (3);
(2) the approximation of splitting schemes;
(3) the stability of splitting schemes;
(4) the convergence of iterative methods for solving system (1), using the splitting
technique.
Note, first of all, the methods to study approximation. They represent, as a whole,
methods which are well developed in the theory of the finite difference methods
(expansion in Taylor series, etc.). It should be taken into account here that it is not
necessary in the splitting methods on the fractional steps that the solution of the
scheme approximates the solution of problem (2). Therefore, the approximation is
often checked for "'integer" equations of the scheme (i.e. after excluding the
"intermediate" equations of the scheme). Note one particular method that is
frequently used for approximation in the splitting methods: after the scheme is
obtained at the integer steps (and after a possible expansion of the transition
operator in step of the time grid), it is compared with one of the well-studied
schemes, e.g. the implicit scheme of first-order accuracy, the Crank-Nicolson
second-order scheme, etc. Frequently, this comparison allows to prove that the
order of accuracy of the splitting scheme is equivalent to that of one of these schemes,
and that means that the approximation of the scheme under study is established.
This method has been used in Part I for establishing the approximation orders of the
schemes considered there. Taking the above notes into account we come to the
conclusion that the first, the third, and the fourth of the abovementioned problems
are specific in the theory of splitting and alternating direction methods. In this part,
we will focus our attention on the description of the approximations to the solution
of these problems.
CHAPTER X
One of the general approaches to convergence studies of the splitting schemes is the
use of the Fourier method (spectral method). Despite the number of restrictions of
this method, it is widely used both for analyzing the stability and convergence of
various difference schemes and for convergence studies of iterative algorithms.
Consider the main concepts of this method.
Assume that we solve the stationary problem (1) or the nonstationary problem (2)
with the help of one of the splitting schemes described in Part 1, which can be
reduced to the form:
JP =Tij+rFj,
j+I p°=g. (40.1)
According to Chapter IX, scheme (40.1) should be treated as an iterative solution
process for the stationary problem (1), and as a difference scheme for the
nonstationary problem (2). Many of the schemes described in Chapters II-IX
belong to this class of schemes. (The transition from schemes with fractional steps to
schemes of type (40.1) is usually carried out by exclusion of the intermediate
equations.)
The most important problem here is the convergence of the approximate solution,
obtained by scheme (40.1), to the exact solution of problem (1) or (2). The con-
vergence of pJ to the exact solution of the stationary problem (1) is considered as
the convergence of the iterative process (40.1), while the convergence to the exact
solution of the nonstationary problem (2) is to be regarded as the convergence of the
approximate solution of the difference scheme (40.1) when the time step T is being
decreased and, in general, may be included in operator T. To prove the convergence
in the latter case, it is necessary to study the approximation and stability of scheme
(40.1).
Assume that the approximation of (2) by scheme (40.1) is established. As
mentioned above, this can be done by classical methods that are developed in the
303
304 G.I. Marchuk CHAPTER X
*)
{00, =1nnm.
= m. (40.4)
i=!
SECTION 40 Convergence studies of splitting schemes 305
= ()UY h (40.16)
-_= ,(ekh- )
Then (40.14) can be the solution of (40.16) if the difference dispersion equation
Oc- ...
holds (40.17)
holds.
Let p = e%, where co is determined by the dispersion equation (40.17). Then the
condition for the stability of scheme (40.16) will be the following
pf < 1. (40.18)
The value p = p(k) is called the growth coefficient of the harmonic function (40.14).
As we can see, this value precisely determines the stability of scheme (40.16).
Such a method for studying the stability has been applied in YANENKO'S
[1967] monograph not only to (40.16) but also to three-layer schemes and to
fractional steps schemes, where the difference dispersion equation is written in terms
of the coefficient p.
Let us show that this method for analyzing the stability applied to scheme (40.1)
for F = 0 brings us to the earlier formulated spectral criterion.
Let problems (40.2) and (40.3) determine as before two complete systems of
eigenfunctions {k}, {k*'}, orthonormalized according to (40.4). Consider the
harmonic function
(p= poe '° k, (40.19)
where k is a fixed integer and qo = const. Then (40.19) may be the solution of scheme
(40.1) with F j =0 under conditions
SECTION 41 Convergence studies of splitting schemes 307
41. The Fourier method and the convergence studies of splitting schemes for
stationary problems
A= E a, (41.2)
a=l
where r and T,are arbitrary relaxation parameters, z, r > 0. The difference equation
(41.3) was formulated by DYAKONOv [1971d], DOUGLAS and GUNN [1964] and
SAMARSKII [1977] (assuming that r = ). MARCHUK and YANENKO [1966] considered
(41.3) as the scheme for constructing various difference methods.
Obviously, scheme (41.3) is a special case of the general universal algorithm (see
MARCHUK et al. [1981]):
i+1 I-J
B +Ap=f, (41.4)
308 G.I. Marchuk CHAPTER X
B= H (E + 2
2Ta).
a=l
The realization scheme for each step of the iterative process (41.3) may be
represented as
(E + ½,rnA,) j+ l In = -- rF,j
(E+ 2T2A2)j+2/n= :j+ i/n
(41.5)
(E +t A.)if'' = j+(n- 1)/n-
(pj+ = (pj+
, j+
f = fk 'UlkU2k"Unk. (41.9)
k=l
We will also find the solution of (41.3) in the form of the series
j =
qP E Ojk Ulk U2k... Unk (41.10)
k=l
result with u, u,..., u*. Then we get the simplest equation for the Fourier
coefficients Jl,:
nj j
f (1+ (zP;
k P) +
+ k o k =fk,
(41.11)
k=l, 2,....
where
n
Ak= , (41.12)
==1
H (1 + ½3r21 ) ) - Tr k
j + I a=1 zfk
(Pj+ =k = n -- - - 'ok+' (41.13)
fi (1 +T9ZA
~=1
k) n (1 +1,TAV))
a=1
Assume that the spectrum {?k2)} is positive. Then, it is easy to formulate the
convergence criterion for the iterative process (41.13). To this end, consider the value
n
H (1 +2 T))- k
qk == (41.14)
==1
n (1+ 2k)
As follows from Section 40 the convergence criterion for the iterative process (41.13)
(and therefore for (41.3) as well) is the condition
Jqk < 1. (41.15)
This means that the following inequality must hold:
42. The Fourier method and the grounding of the splitting schemes for
nonstationary problems
rp o = A. (42.2)
Here /A are operators with X, =1A = A. For the sake of simplicity consider A and Ae
to be independent of z.
Scheme (42.2) approximates problem (42.1) with second-order accurary in t when
the solution (41.1) is smooth enough. If operators A, commute and generate a
common basis of eigenfunctions, the eigenvalues {I)} of which are nonnegative,
then scheme (42.2) is absolutely stable according to von Neumann.
Indeed, let us show that problem (42.1) is approximated by scheme (42.2).
Consider the Taylor series
p+l= =_ (j+12
1 j+ 1/2 2j+
2 1/2 + ,
j = j + /2
(42.3)
9p pj+12 ._--2,ft
I T 2,. tt +
q- ./2
where
T=(E- B-1A),
n
B= H (E+ AJl),
a=l
As follows from Section 40 the following condition is sufficient for the stability of
scheme (42.5):
Iik(T)I 1, k=1,2,..., (42.6)
where Ak(T) are the eigenvalues of operator T, which, as can easily be seen, under the
above conditions coincide with the values qk of (41.14) for = T. Since A?) ~>0 and
> 0, inequality (42.6) holds.
Hence, scheme (42.2) is stable.
The realization scheme of (42.2) coincides with (41.5).
(2) Consider now another scheme of the splitting method, introduced by
DYAKONOV [1962f] for the solution of problem (42.1):
n n
j+
E (E+'TAA)(p i+= n (E- TA,)pj+Tzf 1 /2 (42.7)
a=l a=l
The Fourier method may be also used to ground this splitting scheme.
Scheme (42.7) is a second-order approximation in of problem (42.1) on
sufficiently smooth solutions. If operators A commute and generate a common
basis of eigenfunctions, the eigenvalues {)} of which are nonnegative, then scheme
(42.7) is absolutely stable.
To prove this statement consider the Taylor series (42.3) and substitute them for
(42.7). As a result we get (42.4). Therefore, scheme (42.7) approximates (42.1)
with order (2).
To study the stability of (42.7) let us introduce the following notations:
B= H (E+21A), C= fi (E-'Ta).
Then(42.7)
can be reduced to the form (42.1):a=l
where
T=B-'C, Fj=TB-'fj+/2.
According to Section 40 the following condition is sufficient for the stability of
scheme (42.8):
1In(T)l < 1, (42.9)
where An(T) are the eigenvalues of the transition operator T.
Because of the restrictions imposed on A. it is easy to prove that
A"(T)= n (42.10)
Hence, condition (42.9) holds if An2) > 0, i.e., if the splitting scheme (42.7) is absolutely
stable.
(3) Let us demonstrate the Fourier method's application for grounding the
splitting scheme considered by MARCHUK [1968] for the solution of the non-
stationary problem (42.1).
First of all find the solution of (42.1) in the interval tj<t<tj+l/2, where
ti+1/2 =tj + , by the splitting method
qpj+1/2n j 2 2
- +A p+l/ =fi+l/
2T
2
(42.11)
To show this consider the first equation of (42.13). This is a first-order scheme for
(42.1) in the interval tj<t tj+ 12 for the class of smooth solutions.
Indeed,
(E +-TA,)(E+ 2TA 2)... (E+ TA)=E +A +T2R,
+A (42.14)
where
R =(A1A 2 + -)+T(A1A2A 3 + .)+ - + (TrAiA 2 A.....
It follows from (42.13) and (42.14) that
¢
j- +-Af=i+ 1/2
I O(r2) (42.17)
TZ 2
Hence, scheme (42.13) coincides with the well-known Crank-Nicolson scheme with
the accuracy of O(T2). It thus follows that (42.13) approximates (42.1) with second-
order accuracy in T.
To study the stability we exclude the intermediate steps and express TqJ+1 in p.
Denote
B= H (E+trA).
a=1
It is evident that the Fourier method plays an important role in grounding the
splitting schemes. It should be noted, however, that the use of this method not only
supposes the general requirements (for example, that eigenfunctions of the operator
be complete) which have been considered in Section 40, but also some additional
restrictions in each of the cases. For example, in Sections 41 and 42 it was required
that the operators A, are commutative and have a nonnegative spectrum. These and
other restrictions make it difficult to use the Fourier method for studying the
convergence of the splitting schemes.
In the following chapter, we will consider the a priori estimates method which
allows, in some cases, to ground a whole class of splitting schemes under weaker
restrictions.
CHAPTER XI
stability establishes the continuous dependence of the solution on the initial data in
the case of a problem, (43.1), with a discrete argument. Indeed, as follows from
estimate (43.2) small variations of the input data f' lead to small variations of the
solution p.
Thus, the definition of stability in the form (43.2) ties the solution itself down to
a priori information about the input data of the problem. To analyze the stability of
many of the splitting schemes this definition is more convenient and informative
than Neumann's definition of stability.
First of all, we consider the simplest a priori estimates obtained on the basis of this
definition and show their application for studying the convergence of various
splitting schemes.
While solving nonstationary problems by the splitting method (fractional steps
method) the difference scheme may often be reduced to the form (40.1):
p j+ = T +rF, i (pO=g, (43.3)
where T is the transition operator from one time layer to another acting in some
Hilbert space 0 and g, FJ E .
To analyze approximation and stability it is convenient to write the scheme in the
form (43.3) (especially when approximation and even stability do not occur on each
of the intermediate steps).
It was already noted in Section 40 that many splitting schemes are reduced to the
form (43.3) by excluding fractional (intermediate) steps.
Assume that the approximation of the exact problem by scheme (43.3) has been
established. Let us see what conditions should be imposed on the operator T to
obtain the stability estimate (43.2).
The formal solution of (43.3) has the form
j
qpi=Tjg+ Z T-'lFi'1, O<jr<const, (43.4)
i=l
(43.5)
+(I + ITII+ +-' IITllj-)' )-max IFi IOl.
Assuming that
TIIT<, 1 (43.6)
we obtain an estimate in the form (43.2) which follows from (43.5):
IIp II lll g II,+const -max IIF l,, (43.7)
where
INp ll,=max ill
(p
(43.9)
k=l,...,p.
This scheme was applied by YANENKO ([1967], p. 149) for solving the evolutionary
problem for f = 0. The operator A is assumed to be split as follows:
A= A(k
k=l
hold:
tvlAi,ijlll, i=l,...,p,
aS0/at + A( =O in Q,,
Sp=g for t=O
Ak p j /
= Bkp j (k- 1
), (43.11)
where
Ak=(E- rAk.k), Bk =E+Ak,k-, 1
k=l,...,p.
Excluding the intermediate equations and using the fact that operators Ak and Bk are
commutative we obtain the scheme with integer steps
where
F= +j
(Ak,k-_Al, ll(p_-Ak,kA, l(p )
k<l
jl + )
+T (Ai, i-lAk,k-1At, - o +Ai,iAk,kA,
i, k,l
44. A priori estimates for the splitting scheme of type Ajq+'= Bjq +ifj
Then (44.1) is absolutely stable, and the following a priori estimate holds for its
solution (p:
1( i eNT
v || g Ii + ZE ti 1 f i- I (44.3)
i=1
The approximate solution of scheme (44.1) converges to the exact solution of the
problem.
Indeed, operators A, being invertible, (44.1) may be written in the form:
qj+'l=AlBtpj+ rjAj f', cp=g. (44.4)
Thus, under conditions (44.2) we obtain successively:
etc., i.e.
Scheme (44.1) is, therefore, absolutely stable and the a priori estimate (44.3) is valid.
The scheme's convergence results from its approximation and stability.
It follows that the homogeneous two-layer scheme (43.12) converges under the
conditions that
If in (44.1)
T I1+NT, N=const>O.
]II (44.6)
This follows from the above considerations.
At the same time the a priori estimate (44.3) for M = 1, zi 1= z holds.
The condition
IITill<c, c=const>0 (44.7)
is more general than (44.6).
Replacing (44.6) with (44.7) we easily get the a priori estimate:
I1TI
· IIac ' II ) >+ E (44.8)
Note that in the case that self-adjoint operators Ai and Tj=Aj-aBj have a
complete basis of eigenfunctions, the convergence conditions (42.2) are equivalent to
the following conditions:
maxl/A(A ')I M,
nI
SAMARSKII [1977] and MARCHUK [1973, 1980]. In particular, the a priori estimates
method has been used for grounding the two-cycle componentwise splitting schemes
considered in Chapter III.
More sophisticated estimates may be obtained using the energy inequalities
method which is considered in the next section.
The energy inequalities method is one of the general and rather effective methods for
obtaining a priori estimates. Its essence consists of the following.
Consider scheme (43.1) in a Hilbert space PO.The inner product of(43.1) with pt in
at is
( (q0*), p),=(f, (q),. (45.1)
Equality (45.1) is called an energy identity (see, for instance, SAMARSKII [1977], p.
359). For linear problems the left-hand side of (45.1) is a quadratic form of (rp, and
the right-hand side is a bilinear form with respect to f and (pt. After some
transformations one obtains lower estimates of ((pt), po),, upper estimates of
(f t , q'T)Q, and, then, a priori estimates of type (43.2). In these transformations the
Green difference formulae, formulae of summation by parts, and grid analogues of
imbedding theorems are often used. Sometimes one obtains systems of difference
inequalities. Estimating their solutions, one obtains (43.2). Estimates of the type
(43.2), obtained from (45.1), are called energy inequalities.
Energy inequalities for the simplest difference schemes were first obtained in the
well-known article by COURANT, FRIEDRICHS and LEWY [1928]. For a broad class of
problems the estimates of this kind were studied by LADYZHENSKAYA [1956, 1968],
LEES [1960a-b], SAMARSKII [1961b, 1977], DYAKONOV [1962e, 1971a, 1971d], etc.
Such a method for obtaining energy inequalities has been applied for grounding
the splitting and alternating direction methods. Some of the first papers in this
direction were by LEES [1960c, 1961], where the energy inequalities method has been
applied for grounding the simplest splitting schemes. This approach has been further
developed by DYAKONOV [1962c, 1962e] and SAMARSKII [1962c-d, 1963b]. In their
papers the splitting schemes have been grounded with the help of a priori estimates
for various problems of mathematical physics.
Let us shortly summarize the results obtained in these papers.
LEES [1961] considered the implicit alternating direction methods of Douglas,
Peaceman and Rachford for solving the first boundary value problem for the
parabolic equation
N E 2p X >,
t>0 (45.2)
at i= axlc
i
(45>3
with initial condition
The exact solution (p(x, t) is assumed to be a sufficiently smooth function, its support
for a fixed t lies in the domain s. Excluding intermediate steps from the splitting
schemes Lees obtains equalities which lead to energy identities of the type (45.1). On
the basis of difference Green formulae and grid analogues of the imbedding
theorems he comes to the stability estimate for approximate solution 0'h:
O<
11ph(t) IDh Cllpqh() ll,,, (45.4)
where l - lh is the usual vector norm in the grid space Oh, c is a positive constant
depending on N and on the grid coefficient
A=(, hi-2)
a2(p
t2
al a2p
x2 + in Qx [0, T]. (45.5)
These methods were obtained by the author from the simplest approximations of
problem (45.5) using DOUGLAS and RACHFORD'S [1956] splitting procedure. In the
same way the energy inequalities method was used for grounding these methods. As
mentioned by the author, the results may be generalized for hyperbolic equations of
the type:
a2 2 a2 2 b a2 1P? a( a a(P
t =a(x,y,t) a +b(x, y, t)ay2 + F x,y,t, ay'a
for m = 1 was obtained by DYAKONOV [1962e] where a priori estimates like those of
LEES [1960c, 1961] were constructed.
SAMARSKII [1962c] considered the alternating direction method (see Section 12)
for solving linear and quasi-linear equations of a parabolic type:
p
&P/Ot=L°- L. (45.7)
a=1
This method uses any number p of space variables and is applicable for an arbitrary
domain G.
Its essence consists of the following. In each layer
These have been studied by TIKHONOV and SAMARSKII [1961], SAMARSKII [1961a,
1962a-b] and SAMARSKn and FRYAZINOV [1961].
In the paper by SAMARSKUI [1962c] the uniform stability of the method was proved
with respect to the right-hand side, boundary and initial data. It was shown that
locally one-dimensional schemes and multidimensional implicit difference schemes
(see SAMARSKII [1962b]) have accuracy O(h2 + T). In establishing the a priori estimate
of stability the author has used the maximum principle for difference parabolic
equations.
SAMARSKII [1962d] grounded the local one-dimensional alternating direction
method for (45.7) in the case where on each stage (unlike SAMARSKII [1962c]) six-
point one-dimensional schemes were considered. The method's convergence is
proved for grids with arbitrary steps. Using the energy ("integral") inequalities
method and a special summation method for local errors, the a priori estimate for an
approximate solution is obtained.
In another paper by SAMARSKII [1963b] two-layer schemes were considered with
accuracy of order O(h4 +, 2 ) for the multidimensional heat conduction equation
(45.7) applicable for p <3. For the realization a number of splitting algorithms
requiring the same number of operations as the corresponding algorithms of
O(T2 +h 4 ) were used (see DOUGLAS and RACHFORD [1956], DYAKONOV [1962e],
YANENKO [1959a, 1961]). It was shown that these schemes are absolutely stable
and converge in the mean with the rate O(h4 + z2) for any value y=z/h4 . For
stability studies a priori estimates obtained from an energy identity on the basis of
Green difference formulae and grid analogues of the imbedding theorems were
constructed.
SECTION 45 The a priori estimates method 325
The use of the energy inequalities for grounding the splitting schemes for
stationary problems was dealt with in papers by SAMARSKII and ANDREEV [1964],
SAMARSKII ([1977], Chapter X), etc.
As we see from these papers the energy inequalities method is widely used for
grounding the splitting and alternating direction schemes for various problems of
mathematical physics.
To conclude this section let us demonstrate the use of this method for grounding
the splitting schemes in case of an implicit difference approximation (see MARCHUK
[1980]). To this end consider the problem
o/lat+Ap=f in Q,,
(45.9)
(p=g for t=0.
Assume that
h
A= A,
a=1
(45.10)
pi+ _ qj+(- 1)n + =f
With the help of the energy inequalities method let us prove the following
statement.
Scheme (45.10) approximates problem (45.9) with first-order accuracy in on
sufficiently smooth solutions and is absolutely stable under the condition that
A, > 0. The approximate solution of (45.10) converges to the exact solution of (45.9).
In order to prove this we consider the following. The approximation of scheme
(45.10) is obvious; it is proved by excluding fractional steps and using a Taylor series
for q0j+T
Using the energy inequalities method we prove the stability of (45.10). Take the
inner product of each of the equations (45.10) with qj+ /",..., qpj+ 1, respectively.
Thus using the positive-semidefiniteness of operators A,, we obtain
Ii+2/nl"ll < IPj+(a- l)1/"1, = 1,2,...,n-1. (45.11)
Consider the last equation obtained from (45.10) in more detail. We have
( 1j+
1 iJ+ 1) =(j+(n-l)/n, ac+ 1)
- T(A,,+ (o+
, 1) + (fj, pji+ ),.
326 G.I. Marchuk CHAPTER XI
Since
(qoj+(n-)/, (pj+ lpj +(n- )/n l ipj+
I ',
Dividing by Ilj+ j+
l, we get the following inequality:
jloJ+a III, < I(pij+(n - )/N"1 + TIIfill.
Excluding the solution with a fractional index with the help of (45.11) we obtain
I ' ji+' II. I pjll ,+tIfJll.. (45.12)
°
Taking into account that 11 I, = IIg II, and excluding intermediate values of the
solution we obtain the a priori estimate:
I(p' j+ 1 < O IIg+j IIf II",
where
lif II, =max IIfJilI.
It thus follows that the difference scheme (45.10) is absolutely stable.
The convergence follows from the approximation and stability. Thus, the
statement is proved.
We have come to the conclusion of this chapter that the a priori estimates method
is an effective technique for the convergence studies of splitting and alternating
direction schemes.
CHAPTER XII
As noted above, for a system of difference equations with properly chosen operators
A. the initial evolutionary problem is preliminarily split in a number of splitting
algorithms. Then, each equation of the system is approximated by a suitable
difference scheme and additional splitting of the A, in A. is possible. In this chapter,
we consider some of these splitting methods and their grounding for a system of
difference equations.
Note that the operator A in the initial problem is not necessarily a matrix. It may
be some abstract operator acting in a Banach space . Let us agree to write the
partial derivative in t as ap/Ot, in all equations under consideration, though, in some
occasions it would be more correct to use the conventional notation of the derivative
with respect to t.
46. The splitting of problems defined on fractional intervals and the weak
approximation method
A= A (46.2)
of linear operators A,(t) with the same domain as A. Then solving problem (46.1)
may be approximately reduced to successively solving Cauchy problems of type
(46.1), where instead of A we have operators A,, a = 1, 2, ... , n. Let us consider the
327
328 G.I. Marchuk CHAPTER XII
f=E f. (46.3)
a=l
(46.4)
1 atq,
n at A,(p=f, te6[tj+(a-/n t+aln,
(46.5)
0t=l,...,n
with conditions
,(O)=g,
%(t)==p(t;), j= 1,
2,
(46.6)
cPa(tj+(a-l)/n)= a-l(tj+(a_-l)/,), j=0, 1 ...
= 2, 3,..., n.
As a solution of problem (46.5)-(46.6) we will consider the element
(pj+ = (n(tj+
1 ). (46.7)
The next stage in the solution of the initial problem consists in solving numerically
problem (46.5)-(46.6). To that end, the corresponding difference schemes in t (and
possibly the additional splitting of A, in Aa, = 1, 2,... ,n) can be used. If the
structure of the operators A, is simple enough, we obtain an effectively realizable
splitting scheme for problem (46.1).
Let us formulate some results to ground the transition from (46.1) .o (46.5)-(46.6).
To that end, we introduce concepts connected with the weak approximationmethod
(YANENKO and DEMIDOV [1966], YANENKO [1964b]). We will say that the family of
functions P,(t) weakly approximates the function TF(t) in t on the interval (0, T), if
where 1111-110 for r-0+ and t',t"E(O, T). Further, we say that the family of linear
differential operators A,(t) weakly approximates the operator A(t) in t, if the weak
approximation for coefficients of A(t) takes place. Now, taking (46.2) into account we
SECTION 46 Splitting of the evolutionary problem 329
(Here 6,, is the Kronecker delta.) It is easy to notice that operator A t weakly
approximates the operator A. Together with (46.1) we consider the following
Cauchy problem:
ap/at+A,p=f, in Q2, (46.10)
p=g for t=0.
Its solution will be defined by p - p,.It is easy to see that for f, =f./n this problem
may also be written in the form (46.5)-(46.6). Assuming f, =f and f. =f/n, then (46.3)
and the conditions for weak approximation f by f, hold and the problems
(46.5)-(46.6) and (46.10) coincide. Thus, problem (46.10) constitutes a splitting
system of equations and to ground the transition from (46.1) to (46.5)-(46.6), it is
sufficient to study the convergence of the problem's solution q%, to cp(t) fo r T-0.
Now assume that operator A has a form
ak, +- +k
A= E akl...k (Xx1,. Xm t)O axXkm (46.11)
k,.....krM
where x,..., xm are real variables on which the functions in 4P depend. All
coefficients a,l ... k are assumed to be real, limited and continuous on t in a uniform
metric. The order of the differential operators A in representation (46.2) is
considered to be finite. Further, if all the derivatives of the function a(x, . . ., x,, t)
exist in A, and are limited and continuous in a uniform metric, we then say that
a(x, ..... x,, t) has derivatives up to order "A". When this differentiation procedure
can be repeated j times we say that a(x, .. . x m, t) has derivatives up to order "Aj".
The expressions "up to order A" and "up to order A" are equivalent.
Assume that a(x 1,... , x,, t) and f have derivatives in variables x, ....., x up to
order "A j ". We formally differentiate (46.1) for vector (p() with the components
V(k)= (kl .k)= D-a... D. , D ki/ax ki
where the combinations k = (k, ... , k,) are taken from the sum (46.11) (necessarily
including the combination (0,.. . ,0), corresponding to 'p= v(° ...- )). Then, we get
the system
a(P(l/at + A 1 )p(1) = f(l). (46.12)
)
This system is the first extended system with operator A" and vector f(l). In the
same way the vectors To(2) = {v(k+)}, ( 3) = {v(k+l+4)},... are constructed, where the
indices , q, ... range over the same combinations as k. For ( 2), (0(3), ... we obtain the
330 G.i. Marchuk CHAPTER XII
a A+A(i)p() = f( i) (46.14)
at' - ' I
STATEMENT 46.1. If the problems I,, 1() and 1(2) are uniformly correct, then p,(t)
converges uniformly in t to qp(t)=S(t, O)g for r-*0. The transition operator S(t 2, t)
satisfies the conditions of uniform correctnessfor system (46.1).
STATEMENT 46.2. If the problems I, and I(, j= 1, 2, 3 are uniformly correct, thenfor
any go'P function p,(t) converges uniformly in t to p(t) for zT-0 along with its
xk-derivatives up to order "A". The limitingfunction cp(t) has a derivative in t and is the
solution of problem (46.1).
SECTION 47 Splitting of the evolutionary problem 331
STATEMENT 46.3. If (a) g E l, (b) A,,p are uniformly continuous in t, (c) problems I,
and I(,) are uniformly correct,and (d) problems I, are correct on the right-handside (i.e.
their solutions are uniformly dependent on the right-hand sides), then (46.1) has
a unique solution and %p(t)converges uniformly to p(t) in t.
STATEMENT 46.4. Let i=L,(f2), q>O, 2Q=Q(x ,...,x,). If (a) problems I and
Il) are uniformly correctfor f_ 0, and (b) problems I, are correct on the right-hand
side, then %p(t) isfundamentally uniform in tfor TO0, and any smooth limitingfunction
(p(t) is a solution of the problem I. If g(') = ),=o0 E Lq(Q), then qp(t) is a smooth and
unique solution of problem I.
REMARK 46.1. Here, the function p(t) is called a smooth function if all its derivatives
belonging to Lq(Q x f2t) which are present in equation (46.1) exist.
where the A,(x, t) are symmetric matrices, continuous in 2 x £2t along with the first
derivatives in space variables. Then the problems I and I are uniformly correctfor
fO 0. Thefunction qp,(t) converges uniformly in tfor r- O to the solution of problem I.
The proofs of Statements 46.1-46.5 are given in the work by YANENKO and
DEMIDOV [1966]. Different aspects connected with grounding the transition from
(46.1) and (46.10) (and (46.5)-(46.6) in particular) can be found in the papers by
YANENKO [1987] and SAMARSKII [1970]. Note that SAMARSKII [1970] estimated the
convergence rate of Aq as well.
We now consider a splitting method for problem (46.1) that is often used in practice
for obtaining splitting schemes (SAMARSKII [1965b, 1970], BAKER and OLIPHANT
[1960], MARCHUK [1967]).
We replace problem (46.1) with the following system:
and assume
|, | 0]|= O(T),
where
¢/,=f.(t)-A(t)qo(tj+), > 1,
1= f (t)-A (t) p(t)--d(p/dt
(see SAMARSKII [1970], p. 219).
Problem (46.1) can be split into a system of Cauchy problems that approximates
problem (46.1) with order 0(T2), using the two-cycle method, i.e. symmetrizing the
sequence of problems of type (47.1) (BAKER and OLIPHANT [1960], MARCHUK [1967,
1971], FRYAZINOV [1968]). Let us demonstrate several examples of systems of
Cauchy problems obtained by this method, each of these systems having second-
order accuracy (with the corresponding additional restrictions on the smoothness of
the initial data, of the solution and the appropriate choices of f, in (46.3)).
SECTION 48 Splitting of the evolutionary problem 333
tE ,
where
2n
f= E t.
a=l
It is assumed that
v(tj+l)= 2n(tj+l)
EXAMPLE 48.2 (BAKER and OLIPHANT [1960]). The system has the form:
te j
for
2n
EXAMPLE 48.3 (MARCHUK [1980], p. 276). The system is assumed to have the
following forms. On the interval tj_ < t <tj:
apfat+Ap =, O=l, 2.....,n-1,48.3)
1,
ap%lt + A1 . =f+2Af;
334 G.I. Marchuk CHAPTER XII
under conditions,
(P1(tj_-)=V(tj_1),
(p+ (tj_ ) =Pa(tj), = 1, 2 ... ,n, (48.5)
¢ps+(tj)=(p,(tj+,), f=n,...,2n-1.
The approximation of problem (46.1) using (48.3)-(48.5) is considered on the double
interval (tj_, tj+1) under the assumption that
EXAMPLE 48.4 (MARCHUK [1980], p. 276). The system is assumed to have the
following forms. On the interval tj_1 t tj:
apl/at+ Al =0,
aO%+/at=f, +l(tj-l)=%P(tj);
+ (48.7)
and on the interval tj<ttj+ 1 :
a~qn+2at+Anqpn+2 =0,
(48.8)
a(P2n.+ /t + A P2+ =0,
Pa(tj)=(P-_(tj+l), aC=n+2 ... ,2n+l.
Problem (48.6)-(48.8) approximates (46.1) on (tj_ 1, tj+ ) and under the assumption
that
All the systems in this section have second-order accuracy and A, and A,, e- fi, do
not have to be commutative. Theoretical aspects concerning these systems are
considered in the above-cited literature.
SECTION 49 Splitting o the evolutionary problem 335
M . llg
lqJ + 1(l)m< I(l)+M 2 max E IlfJ'+/ 11(2 ), (49.3)
Oj'<j = 1
where the constants Ml and M 2 do not depend on z, the grid parameter h in other
variables, g and f +a/m, and the "smoothness" condition for the solution tp(t) holds:
1=1
Y-A#(
=1
Cj 1(49.4)
=fl+1 (2)
-MO
where M o =const >0 does not depend on z and h. Then (49.1) converges and the
following estimate is valid:
I( pj+ -_ (p(tj+1) (1)
Let us now suppose that A,,o=0, B=B*>0 is the constant operator and
,=1
C (A~,s ,)>0
for any {,E Ha, a = 1,... m. Then the following estimate holds for the solution of
problem (49.1):
where
l(pII,,=(B, q)1/2, I(P}lp=(B- (p, ' )/2
p)
(SAMARSKII [1970]).
Scheme (49.1) converges in the norm I1 {I,, if p(0) = po and the conditions of the
above statement hold as well as the summarized approximation conditions
where Mo does not depend on the grid parameters h and (SAMARSKII [1970]).
If p(0)=O° ,
|| E 1| =O(Ihl+ rk),
where k>0, 1>0, and the conditions of the above statement hold, then scheme
(49.1) converges in the metric 111Ib with rate O(Ihl+ rkl), where k, =min(k,, 1). If,
besides, the restriction (49.4) holds for 1112) = I II-, then the scheme converges
with rate O(Ihl'+ k2), where k 2 =min(k, 1) (SAMARSKII [1970], p. 215).
In conclusion, we mention the works dealing with the problems connected with
the splitting of the initial problem into a system of simpler problems. General
principles of such splitting are given by YANENKO and DEMIDOV [1966], MARCHUK
([1971]; [1980], p. 275), SAMARSKII ([1971], p. 400). Theoretical grounding of such
a splitting may be found in works by YANENKO and DEMIDOV [1966], SAMARSKII
([1970]; [1971], p. 403), YANENKO ([1967], p. 170).
The work by MARCHUK ([1980], Chapter V) is dedicated to two-cycle (sym-
metrized ) splitting schemes; theoretical grounding and some modifications are
given by SAMARSKII [1971] on the basis of symmetrized splitting. FRYAZINOV [1968,
1969] constructed the schemes for the equations of parabolic type in domains
SECTIoN 49 Splitting of the evolutionary problem 337
consisting of p-dimensional parallelepipeds. Somehow a different symmetrization
method for the difference schemes is used by GODUNOV and ZABRODIN [1962] for the
acoustics equations. A large bibliography on all these problems is available in works
by MARCHUK [1980] and SAMARSKII [1970, 1971].
CHAPTER XIII
In this chapter we will consider some variants of the alternating direction method for
solving a system of linear algebraic equations
Au=f (50.1)
.
with a nonsingular (N x N) matrix A and vector fe RN Let us take the following
formula as the general form of a stationary method
k
B,(uk+l _ k)= _ T(Au _f),
k=O, 1,..., (50.2)
where a and are some positive parameters and
m
B, = I (I + Bi) (50.3)
i=1
is a nonsingular matrix. Here m is a fixed positive integer, and B,, i = 1,..., m, is an
arbitrary (N x N) matrix. Here and in the sequel the unit matrix is denoted by E.
Further, we will consider nonstationary, iterative methods for which the parameters
used depend on the step number, k.
Method (50.2) is called convegent when, for any initial approximation ue iRN,
the sequence of vectors u', u2 ,... converges to the solution u* =A-f of system
(50.1)
Let value m and matrices B, ... , Bm be given. Then the following statement is
valid: If the real parts of the eigenvalues of the matrix A are positive, then, for any
a > 0, there exists such a f > 0 that method (50.2) converges for any positive r E (0, f).
The inverse result is also valid. If value a > 0 is given, then a necessary condition
for the method to converge for any sufficiently small T> 0 is that the real parts of the
eigenvalues of the matrix A must be positive.
The uniform convergence for Tr-0 of the eigenvalues of the matrix B, to the unit
matrix is used to prove the above statements. These proofs are given in the book by
MARCHUK and KUZNETSOV [1972].
339
340 G.I. Marchuk CHAPTER XIII
We will consider further only two cases. Let us call the first case commutative. It
supposes the existence of a nonsingular matrix Q, which, by a similarity trans-
formation, simultaneously reduces the matrices A, B 1 ,..., B, to a diagonal form, i.e.
Obviously, matrices A,B,,... B,,B, as well as matrix B~, can commute. In the
commutative case we will further assume that the eigenvalues Aj of the matrix A are
real and positive and that the eigenvalues 2j) of the matrices Bi, i = 1, ... , m, are real
and nonnegative.
The second case, called noncommutative, supposes that m=2, the matrix A is
determined to be positive in RN, A = B1 + B 2 , and matrices B1 and B2 are at least
positive-semidefinite and, generally speaking, noncommutative. For the sake of
simplicity we will use the following notations B=A1 and B 2 =A2 in the
noncommutative case.
Let us, first, determine sufficient conditions for the convergence of the commuta-
tive alternating direction method. To this end, we introduce the function
) m)
~g= 1-~~(2,; )~(1 ~ s~..... (50.5)
fH (1+ t(i))
i=1
Since the requirement that the inequality p(T,) < I holds is a necessary and
sufficient condition for convergence of method (50.2), we have to determine the
conditions for which-1 < g (Aj, 1,.. ., A(m)) < 1 for all j.
Let matrix B = im 1 Bi be nonsingular. Then there exists 6 > 0 such that p(T,,) < 1
for any at (0, a) and > 0, i.e. the commutative alternating direction method (50.2)
for all such values of parameters and converges.
If we assume additionally that A = B, then the commutative alternating direction
method (50.2) converges for any a e (0, 2) and T>0.
To prove this statement it is sufficient to note that by assuming A = T= B ,i we
have Ij= ,T= 1ij. Therefore,
where Es denotes the unit (ns x ns) matrix for s = 1,... ,m, and ® denotes the tensor
product of matrices. Assume also that
M
A= E B. (50.9)
i=1
Such matrices occur when elliptic boundary problems with separating variables in
rectangular domains are approximated by the grid method.
Obviously, with the additional assumption that det A 0, i.e.
m
-
jj= i() 0, j=l,...,N,
i=I
all requirements for the commutative alternating direction method hold. Then, from
the above considerations, method (50.2) for solving system (50.1)-with matrix
A from (50.9) and matrices B 1, ... , B from (50.8)-converges for any a E(0, 2) and
for >0.
For a more detailed description of the commutative alternating direction method
see Section 51.
Now we consider the noncommutative alternating direction method. Taking the
above assumptions into account, we will write the method in a different, more
convenient form:
For r=1 we obtain the well-known DOUGLAS-RACHFORD [1956] method and for
a=2 the PEACEMAN-RACHFORD [1955] method:
We will consider the latter in more detail. Our results will be based on papers by
BIRKHOF and VARGA [1959], WACHPRESS and HABETLER [1960], KELLOGG [1963],
SAMARSKII [1964a] and others.
We introduce the matrix D, =(E+ A2)T(E+ A2), error vectors zk =uk - u* of
(50.11), and vectors Yk =(E+rA 1)- (E-rA 2) zk. Then, after simple transformations,
342 G.I. Marchuk CHAPTER XH1
that A is positive-definite) and A 1 =A'. This means that matrices A1 and A 2 are
positive-definite since the equalities
(Aiz, z)= 2(Az, z), i = 1, 2, z RN,
take place in real space. We will call this variant of the alternating direction method
(50.11) symmetric. Since matrices A1 and A 2 are positive-definite, the method's
convergence for any > 0 follows from the above considerations.
The important properties of the symmetric alternating direction method are
symmetry and positive-definiteness of the matrix
B,= (E + A, )(E+rA T)
=(E + rA2)(E + TA2), (50.16)
this means that the step matrix T, can be symmetrized in the inner product generated
by the matrix A. Indeed,
(T v, w)A = (v, w)A - 2T(B- 1Av, W)A
2
= (V, W)A - T(B 1'Av, Aw)
=(v, Tw)A
for any v,welRN. Moreover, it follows therefore that matrix BT1'A is not only
A-self-adjoint but also A-positive-definite. Thus, for any > 0 all eigenvalues of the
matrix T, are real and belong to the interval (-1,1), and the system of its
eigenvectors constitutes the A-orthonormal basis of RN.
Let us briefly discuss the influence of the parameter a on the convergence of the
noncommutative alternating direction method. To this end, we consider the
right-hand side of the equality
and, hence, for any e (0, 2) the following inequality analogous to inequality (50.15)
holds:
11 , ID,< 1
for any > 0 and a E (0, 2).
Hence, under the above assumptions the noncommutative alternating direction
method converges for any r >0 and a e (0,2).
From this statement it follows in particular that the Douglas-Rachford method,
i.e. (50.10) with parameter a = 1, converges for all positive-semidefinite matrices A,
and A 2.
This is, however, not an optimal approach since we can only in exceptional cases find
p(T,,,) explicitly as a function of variables a and . That is why in practice, a set
Gca m+' l such that all (m +1)-dimensional vectors )j=(Ai, 21),... )lt'))EG,
j = 1, ... , N, the components of which are the eigenvalues of the diagonal matrices A,
B 1, . .., B, of (50.4), is actually introduced, as well as a function ,,(ta)such that for
any e G the following inequality holds:
IgY,(A) I 1,l).
i< (51.2)
If the function
f'(a, ) = max By,() (51.3)
AeG
approximates p(T~,,) sufficiently well for values (, r) of a set Hc 1R2, in which the
best values for the parameters are found, then instead of (51.1) we may consider the
approximate optimization problem for the iterative method, i.e. solving the
extremum problem
min Y'(a, r). (51.4)
(a,r)eH
coming to new majorants ,~,, by expanding the set G and narrowing the set H, i.e.
by imposing additional restrictions on the values of parameters. The analogous
transition procedure from extremum problem (51.1) to approximating problem
(51.4) can be used in other cases as well, in particular, for multiparametric
alternating direction methods.
In considering concrete ways of choosing parameters in the commutative
alternating direction method, we assume additionally that m=a=2 and A-
A1 +A 2, where A, =B1 and A 2 =B2. Then
I -r2) I-±rA(2)
g,,( A)(= 1 +) +i( 2 )
the set Gc R2, and the set H consists of positive values of . Choose
G=[l ,Al] x [3 2 ,A 2 ], (51.5)
where
bi = min AY', Ai= max Ai, i=1, 2;
i<j<N 1 <j~<N
take
'(a, T)-= (T)= maxlg,(A)l;
=O
Then
max
and the value pt, minimizing the function !P(T), is computed according to the
formula,
opt=(5l)- 1/2 (51.6)
Hence, we arrive at the estimate,
/( = max 1 , (51.8)
346 G.l. Marchuk CHAPTER XIII
p(T, I)%
< a
[A i= llJ+TiD 1
where
= min ,li) and A = max Ai)
1<j<N 1<j<N
i=1,2 i=1,2
(6 is assumed to be positive).
We introduce the function
in order to solve the problem of minimizing the function 'F(r)for T.According to the
above, the solutions of the minimax problems can be found by choosing
Ti=ri,opt =(mimi+) - 1/2, i=O, 1,..., p-1. (51.13)
Now we have to choose values m,... , p,_ , so that the right-hand side of (51.12) is
minimal. Obviously, it is enough to take
i 1p
mi = 6(/1) , i= O, 1 ... ,p.
Hence, for r= rTot =(T,opt, .. ., p_ 1 .opt)T we obtain the estimate
Il2kll,< 1-h'/P 2t
Izk 1
hID ZEll (51.14)
where h=(/A) 1 2 .
So, the average term for decreasing the D-norm of the error vector of the
optimized cyclic parameter with period p for the alternating direction method is
equal to
l _hl / 2 p
q= Ll-+h'iP . (51.15)
Assume that h<< 1 and find out for what value p (51.15) will be minimal. To find p,
taking h << 1 into account, we get the equation
d (hP) =0;
dp /
its solution is the value
p= -Inh.
With this value p (51.14) takes the form:
e- 12k/p
iZk
II, [e1 11Zo. (51.16)
Thus, if under the assumption h << 1,the problem is to decrease the D-norm of the
initial error vector in 1/E times ( < 1), then with the values of parameters found the
method allows to do this in
k, -ln(l/h)ln(l/g) (51.17)
steps while the one-parameter method requires
k, -h- 'l(l/£) (51.18)
steps.
The above method is one of the simplest for the approximate solution of the
optimization problem for the multiparametric alternating direction method. It was
used first by PEACEMAN and RACHFORD [1955], DOUGLAS and RACHFORD [1956]
348 G.I. Marchuk CHAPTER XIII
and DYAKONOV [1961, 1962b]. Further studies on the choice of optimum parameters
can be found in the works by WACHPRESS [1962], TODD [1967], LEBEDEV [1977] and
others. These results have been thoroughly considered by SAMARSKII and NIKOLAEV
[1978].
p(T
II,!= max T (52.1)
occurs, where 6, and are minimal and maximal eigenvalues of the matrix A,
respectively. Now, if we take
T(T)= 'ITll=max{l-T6
AT,
< _,,1A
,(52.3)
--) pb--
Assume in addition that matrix A2 is also positive-definite and that 62 and A 2 are
its minimal and maximal eigenvalues respectively. Then we obtain the estimates
where
6=min{6,,, 2}, A=max{A,,A 2}.
Now, if we take
1-
lm
= -r2 2
then, according to (50.14) and (52.4), we have
p(T) < (T)< 1
for any T>0. In this case we get the solution of the approximate optimization
problem for (50.11) by the choice
T.Pt = (6A- ./ (52.5)
SECTION 52 Optimization o iterative methods 349
P(To,) < 11
T <[
In1D, /+
_/ j 2 (52.6)
We now consider a non-self-adjoint case. To construct the majorant F(T) let us use
SAMARSKII'S [1964a] method. Since
lI TT< II
p(TO< ITD, 1111T
I 2 , 11, (52.7)
it will be sufficient to derive the majorant, for example, for 11
T,, I,assuming that A,
is positive-definite. We introduce the values
Obviously, both values exist and are positive, with 61 being the minimal eigenvalue
of the matrix S =- (A, + A'). It is easy to see that
2 2
II T 2= SUPN 1z112
I= sup IIz 12 -2(A
+2(A 1 zz)+ 2 [A1 Z1
IIAz
1 z, z)+ 112
'Fmin= L2 V/7-
Thus, if matrix A, is positive-definite, then as the solution of the approximate
optimization problem we can choose the parameter Tzpt = (6 A)- 1/2 for which the
estimate
11
Tt IID K V/A /b (52.9)
where 6 = min(6, 62) and A = max(A A, and the values 62 and A 2 are defined as in
(52.8) but for matrix A 2 .
350 G.I. Marchuk CHAPTER XIII
53. The convergence acceleration procedures for the alternating direction method
Note that the similar procedure for the inner product generated by matrix D, of
Section 50 will not be suitable here.
Among different versions of the alternating direction method the case A1 =A
considered in Section 50 stands out as rather efficient. The transition matrix (50.16)
of method (50.10) is A-self-adjoint, which is the reason for using acceleration
procedures based on the Chebyshev methods and the generalized conjugate
gradient method.
Let it be known that eigenvalues of matrix B- 1 A of the symmetric aternating
direction method belong to the interval [a, b) with 0 < a < b. For example, if we use
the results from the previous section, then it follows, from the estimate (52.9), that the
interval boundaries can be computed according to the formulae
a= , b= A1
lA
+ 1- JA + b-
Under this assumption we can use the following Chebyshev semi-iterative method
(see VARGA [1962]) with the pre-conditioning matrix B' for the solution of system
(50.1):
k= 1, 2,...,
where
2-/l (B -
ak = I -I[k-
IIAB -k
12=
11k)
B-k k-1k-1)
=
k
;, k=l1, 2 ....;
/=9k
352 G.i. Marchuk CHAPTER XIII
Uk +1 = uk _ B -1 k_ ek-_lu( k
-U )],
qk
54. Generalizations
The best known generalization of the original alternating direction method is the
following one:
(R + zA 1 )R - l(R + A 2)(uk + - uk)= _ ct(Auk -f), (54.1)
k=O, 1,2,...,
where R is, generally speaking, an arbitrary symmetric and positive-definite matrix
(see WACHPRESS and HABETLER [1960]). It is easy to see that this method is
completely equivalent to the alternating direction method
(I+ TA1 )(I + rA2)(Uk + _ uik)= - Z(AUk -), (54.2)
k=0, 1...
where
/2
A1 =R-1/2A1RR-/2, A2 = R /2A2R-
(54.3)
2
A=A 1 +A 2, Rf=R
i=R' u, R-112f
Hence, all the abovementioned facts for methods (50.10) and (50.11) refer to method
(54.1) as well. Of course, the role of matrix R should be considered, i.e. in deriving the
formulae for optimal parameters matrices A, and A2 of (54.3) should be used instead
of A1 and A 2 of (54.1).
We consider the case AI =AT. To this end, we rewrite, first (54.1) with ea=2 in
the following equivalent form:
(R + rA )uk + 1/2 =(R - A 2)uk +f, (54.4)
k
(R +rA 2 )uk+ 1 = (R - A )u +1/2 i+f,
and then, using
L=½R--A, LT=½R-A
2 2
, co=2i,
2 +,r' (54.5)
SECTION 54 Optimization of iterative methods 353
Let the initial stationary problem, after approximation on all its variables, be
reduced to the equation
A p =f, (55.1)
considered in finite-dimensional Hilbert space H with inner product (-,)H and
norm 11-IIH=(',)H/2. Matrix A is assumed to be symmetric, positive-definite and
representable in the form
A= A,, n, (55.2)
a=l
where
n
and {far are chosen so that .E=1 f, =f. According to the method of stationing (see
Section 37) we can consider the evolutionary problem
apq/at+A=f, teQ,=(O, T),
rp=g for t=O, gH,
and, in case of large values of T, we can approximate the solution of (55.1) by the
solution of problem (55.4) p(T). The numerical solution of (55.4), in turn, can be
realized by the splitting method. To this end, we represent f as
n
f= f, fe H, (55.5)
4=l
O,<j,<N-1, ~l
<an, (55.6)
'Po =g.
Each realization step of scheme (55.6) is equivalent to the minimization, in H, of the
following functional problem,
Iv p_+(
- 1)/" I11+ J,(v)
(55.7)
O<j<N-1, l<a<n,5
(Po = -
This algorithm constructed by using splitting scheme (55.6) is, in fact, the
decomposition method for solving the variational problem (55.3). This same method
reduces (55.3) to a sequence of problems minimizing the functionals (55.7). Note that
the convergence of the decomposition algorithm (55.7) follows directly from the
convergence theorems for scheme (55.6). These theorems have been considered in
foregoing chapters.
Based on algorithm (55.7), we now formulate the decompositon method for the
following general variational problem in Hilbert space H:
J(p)= infJ(v), (56.1)
vcH
1
where J:H-R -R is a continuous function with a lower bound. We assume that
SECTION 57 Variationalproblems 357
J= J (56.2)
where the J,:H-C R are also continuous functions with a lower bound. We introduce
a sequence of parameters ri, j=O, 1, ....,, N-1 and define elements
(pj+/"l j=O, 1,...,N-, = 1,...,n, (56.3)
where pO =g is an arbitrary element in H. If j+("-')/ is known, then (pj+/ is
determined as a solution of a variational problem of type
This problem has at least one solution. We denote by qpj+a/n one of these solutions
and compute further pJ+(+ 1)/n, (pj+(+2)/n...
It is obvious that the proposed algorithm for solving problem (56.1) is efficient
when it is easier to solve a sequence of problems (56.4) than to solve one of (56.1)
which depends on the explicit form of J and J. It is also easily seen that the given
algorithm is of great interest for problems of type (55.3), obtained by discretizing
elliptic boundary value problems, etc. Note that the splitting method theory does
not allow us to conclude that algorithm (56.4) converges (see Section 58).
Let K be a closed convex set in H (the set of restrictions)and, as in Section 56, let
J: H-*R be a continuous function satisfying the assumptions from Section 56.
Instead of (56.1) we consider the problem
infJ(v), (57.1)
veK
K= n K), (57.2)
a=l
where the hPe are reflective Banach spaces with norms Il ll' and the inclusions
4c ,, c H are continuous (a = 1,. ., n). We introduce the following assumption
n
K= K, (58.2)
al=1
where K, is a convex subset of 4>,. We also assume that J(v) can be repre-
sented as
Ip +n - )in
· 2 + j(pj+an
2Tj
< Iv|( i+(-)/n +J(v) (58.4)
for an element v E K,. Problem (58.2) has a unique solution pj+a/n e K,. From qp° ,....
(Pj+/" we form the elements
1
Na Ni j+a
N+/n 1 = E pjn (58.5)
GN j=O
SECTION 58 Variational problems 359
where
N
then
N +
lim ( /n = p
N-co
in a weak topology 4P, for 1 acc n where p is the solution of problem (58.1).
In the work by BENSOUSSAN, LIONS and TEMAM ([1975], p. 200), one can find the
proof of this statement as well as a number of concrete realizations of the above
decomposition algorithms for solving variational problems of mathematical physics
along with corresponding splitting methods.
PART 3
Applications of Splitting
Methods to Problems of
Mathematical Physics
In this part, we will consider the realization of the general statements of the splitting
methods, from the previous chapters for some concrete problems of mathematical
physics. We will describe the means to approximate problems in space, angular and
other variables earlier accepted as achieved (see Part 1). All this is supposed to give
us a complete notion of the methods used for approximating solutions of problems
by splitting.
361
CHAPTER XV
where
aup a2 1 au
adp a2?2 a
A 2 9 =ay -( y' ay (59.2)
aw¢ av ap aw
A= z az az-i aza
The following notations are used in (59.1) and (59.2): x, y, z are components of the
Cartesian system of coordinates (the x-axis is directed to the east, the y-axis to the
north, and the z-axis is directed vertically upward); = {(O<x<a, O<y<b,
0 < z <H} is a domain with boundary aQ2 consisting of the cylinder's side surface 8a2
with lower basis aS20 (for z = 0) and upper basis a0H (for z = H); t e Q2 = (0, T) denotes
time; p is the intensity of aerosol substance, migrating along with the flow of air in
the atmosphere; u, v, w are components of wind velocity along x-, y- and z-axes
respectively; = const>O, v(z) >O are the horizontal and vertical turbulence
coefficients; a=const> 0 is the coefficient of the substance's absorption by the
363
364 G.I. Marchuk CHAPTER XV
2Ax
A2P=
Vi,j+ 1/2,k(Pi,j+ 1,k - Ui,j- 1/2,k(Pi,j- 1,k
2Ay
h
A0(p= I((p 2( -2p,+ ,21)-
(Pi (P+
1 '<i ~<
la. (60.4)
The differential operators A and their difference analogues A. on the functions
satisfying the condition (56.3) are positive-definite in the corresponding spaces
=H and = Hh. Thus, the initial problem has been reduced to a system of
ordinary differential equations (index h is everywhere omitted)
-- + Aq=f in x ,,
tdp (60.5)
Bp + E A,"n =f , ( =g,
-E+OTA+(Or) 2
E AA++( (Tz)P H A., (60.7)
Q
a=2
E (0T) 1alA <Q2< .. <lap
A, ,,2 Ap. (60.10)
Q( ( (p p) 2
aMT , M constant (60.11)
holds, then scheme (60.6) approximates the initial problem (60.5) with the same
order of accuracy as the scheme
I Q-(p MT.
Scheme (60.6) is absolutely stable for 0 = Aand for the solution the following estimate
(DYAKONOV [1971d]) is valid:
< lIl + +
JTIfJ IIF-
j=o
368 G.i. Marchuk CHAPTER XV
Taking into account the value of qp in the boundary points and the form of operator
B, the scheme can be rewritten in the form
where the right-hand side O' is known. The realization of (60.13) consists in
consecutively solving the problems
+
(E +TzOA 1) lIP = "
(E + OA 2 )n(P +2/1 = 1ipp+
(60.14)
+
(E + rOAp)q" = on +(p- 1)/p
We consider the first boundary value problem for a parabolic equation with mixed
derivatives in the parallelepiped
p =g for t=O,
o =O in a2 x .
As in Section 60 we introduce a uniform grid for variables x, with the step
h, =l (l+ 1): x = ih,,i, =0,1,.., + 1. We approximate the differential
operators A,, with second-order accuracy by their finite difference analogues A,,.
The operator A, is written in the traditional form (see (60.4))
A
Aa=-AaA/ha, 1<ia,<I, (61.2)
and, according to YANENKO [1967], Ag for ofi can be written in the form
To solve (61.1) we take the analogous scheme (60.6) with a factorized difference
operator B= IaP= 1 B,:
o =g, (61.4)
pi =0 for i =0 and i,=lI+1, 0a=1,2,...,p.
We choose operators B. in the form (SAMARSKII [1971])
B, = E + TOC2A,. (61.5)
In O =H h these operators are self-adjoint, positive (for 0>0) and commutative
(because the domain is a parallelepiped). The factorized scheme (61.4) has at least
first-order accuracy in and is stable for 0 >. With B, = 1, 2,..., p, being
three-point difference operators with constant coefficients, the algorithm for
computing p"+ for a given (pn coincides with that of Section 60.
For the special case f 0, k =const, p=2, YANENKO [1967] considered the
scheme based on the splitting method
T n+ 1/2 _ (n / n + l 2
+k1/lllr + k12/112(P" =0,
T (61.6)
T n+ 1_ n+ 1/ 2 2 1/
- + kA21(pn + + k2 2 A 2 2(pn+ = 0.
n 1 /6 6
(pn+k 21 A2 1 + + 2 222 A22 2 = 0,
(Pn+3/6 _ n+2/6 A
2f3333(pn 336 + 13A13(on+ 26 = 0,
t (61.7)
n
pn +4/6 _ (0 + 3/6
+k3lA31(pn+ 3/6 + k33.33+4
T
(pn+5/6 _ (n+4/6
6
+4k 2 2 A2 2 p+5/ +k 2 3 A 23pn+4/6 =0,
n+1
+ n+n5 /6
16
qp1 _'"5p ' +k3 2 A3 2 (p+5 +2-k 3 3A 33 p =0.
Scheme (61.7) approximates the original problem with first-order accuracy in T and
370 G.l. Marchuk CHAPTER XV
is stable if the matrix with elements k,, on the main diagonal and kits on the other
diagonals is positive-definite.
Schemes (61.6) and (61.7) are realized on each splitting step by the scalar sweep
method.
pn +3/6 _ p +2/6
3 6 n+2 1 6 n+ 2 1 6
+k 3 3 A 3 3 (n+ / +k3 1 A3 1 (p +k 3 2 A 3 2
(Pn+4/6_ pn +3/6
+ 6
+k
+ 33 A
3 3 (pn 3/ =f(x, t +4/ 6 , (pn+3/6),
6 41 6
4k
'n+5/6-gn+46 -l+ 2 2 A 2 2 pn+2/ +k 2 3 A 2 3 n+ f(X, tn,+ /6 rn+4/6)
(n+1 n+5/6
P 'P +2k, All 5/6 +k 13 A 31 pn+ 4/6
(P"+ 1/6 +k 12 A 12 (pn+
E p>}y~ Y(62.3)
kAjflA"p E e2
,h=1 c=l
in the case of its difference analogue. Scheme (62.1) is realized by the sweep method
SECTION 63 Heat conduction equation 371
on the first three steps and by computations with explicit formulae on the final three
stages.
For the case when
(Pn+2/3 _ n + 1/3
n + k 22 A 2 2
23n+3 n"+2/3 + k21 A21 =f(x, tn+ 2/ 3 , P+ 1 3 ), (62.5)
"n+/3
p n+ _pn+2/3
3 2
(pflnl +k
+ 33 A 3 3 pn+1 +k 3 A31 n+ 1/ + k3 2 3 2 pn+ /3
This scheme converges for -0O, its realization is analogous to that in Section 61.
All the above difference schemes for solving parabolic equations did not have more
than second-order accuracy in space variables and time. The splitting fractional
steps method can be used for obtaining simple schemes of increasedorderofaccuracy.
Assume here that the grid is uniform and has the same step h in all variables x".
The schemes of increased accuracy for the simplest parabolic equations,
constructed by factorizing the operator on the upper time layer are considered in
works by DOUGLAS and GUNN [1963], SAMARSKII [1963b], SAMARSKII and GULIN
[1973], YANENKO [1967], SAMARSKII and ANDREEV [1963]. These schemes are
constructed on the basis of an m-layer homogeneous scheme of increased accuracy
(m > 2 chosen) in the form:
"
f e L(P
ltn+ =fn (63.1)
n
wher e fn is the result of applying difference spatial operators to the functions (p ,
n
10-1..
We consider the method proposed by DOUGLAS and GUNN [1963] and assume
that the following two-dimensional heat conduction equation has to be solved
ap/&lat+A=f (63.2)
where
2
A=A 1 +A 2, Aa =- /ax 2, a=1, 2.
372 G.I. Marchuk CHAPTER XV
+
+ OA(p" n +(1 - )Airpn +1h 2 AAtrPq = O,
T <~
p
0=(1-h 2/r), A= E A, (63.7)
1=1
SECTION 63 Heat conduction equation 373
with accuracy of order O(T2 + h4 ) is taken instead of the initial scheme (63.1).
In this case the factorized scheme has the form:
(E + OrA )(E + OTA 2). .(E + OTA)q" + 1= 0fq" +fn (63.9)
where
-
+ +(- 1)P 1(0T)P-l A1A 2 ... Ap. (63.10)
where A 1 , A2 2 and A12 are given by formulae (61.2) and (61.3) and the parameter
b is defined by
b= 1+2k22 -31k 1 21.
-40k12 A
A12
12 +2(1 -)rbA 1 A 2 q, (63.13)
where
L i =E+OrAii, 0= 1
All schemes from this section are realized on the basis of the scalar sweep method
in all directions which is fairly simple.
374 G./. Marchuk CHAPTER XV
S
X l. +h X2 ,,+h 2
~I((h,
o =fk cqpjh axd
r)
SECTION 64 Heat conduction equation 375
where operator A 1 acts on the grid lines which are parallel to the axis x1, operator
A22 on the lines parallel to the axis x2, operator A1 2 on diagonals of positive
direction in the subdomain Q2' and A2 1 on negative diagonals in the subdomain Qf2.
It is proved that the operators are positive-semidefinite (Aap k 0, ot,f = 1,2) if the
following conditions hold:
Basing on this we can use one of the splitting methods, in particular the two-cycle
method of type (59.12) with Crank-Nicolson schemes on each elementary step for
solving (64.5) (MARCHUK [1980]).
Similar results are obtained for the generalized von Neumann boundary
conditions.
The application of piecewise linear prolongation to equation (64.1) under the
condition that k 12 + k2 l =0 on a nonuniform triangulated grid which is topo-
logically equivalent to a rectangular one also leads to the grid operator splitting in
four one-dimensional operators. In this case the conditions of positive-semidefinite-
ness of the one-dimensional operators are reduced to some geometrical conditions
on the grid parameters.
This method has been used in the ocean dynamics models for computing the
stream function (KUZIN [1980]) and for solving the three-dimensional equation of
heat flux in the ocean, which is reduced by special splitting along the sections to
a series of equations of type (64.1) (KuZIN [1984]).
CHAPTER XVI
At present there exists a number of effective algorithms for solving problems that are
connected with equations of a hyperbolic type. This chapter features methods based
on special splitting algorithms for solving such problems.
-pi,+1, 2qil" i = 1,
h2
'
A lh= -(Pil+l1. +2il,' iP-1, 2 il M -1,
377
378 G.I. Marchuk CHAPTER XVI
B +iAni=f' (65.6)
N
2
B= (E+ Aj). (65.7)
j=1
Equation (65.6) approximates (65.4) with order O(r2). Equation (65.6) can be written
in the form
n-
T n+1-2(pn+ 1
=
2 +lB+- An B- f. (65.8)
A Fourier spectrum analysis of the components of the solution vector (65.8) leads to
the following necessary stability condition of scheme (65.6)-(65.7)
r 2 #(-B A),
where /f(B-''A) is the upper bound of the operator B-'A spectrum. Thus, the
problem of choosing the parameter z satisfying the stability condition is reduced to
computing the maximal eigenvalue of the problem
Au = ABu,
in the assumption that all eigenvalues of B-'A are positive. This problem is solved,
for example, with the help of the Lusternik iterative process (see MARCHUK [1964a]).
The realization of difference scheme (65.6) can be written as:
2
(E+2 Aj)n+lN An+ f,
2 +2
/ N
(E + A2 )n± = n+ 1/N
(65.9)
This problem is solved consecutively for n=2, 3, ... using the initial data (65.3)
Aq= -a 2 (ii+pl,--2,j
A1?=-a2 ,2?=-a2 i-_ ,i +I1
2 i,j
-h2i
2
,i -1,
ij+ pi,j
E+LT2A (E+l½T2A)(E+-T2 A2 ).
Finally, we replace scheme (66.3) by the following factorized one:
Scheme (66.5) is stable and has second-order accuracy. Scheme (66.5) is realized as
follows
+ -(E
2A n+ 1/2 -n
2
E+ A2 ) - , n 2,., (66.6)
h
=p , = ph + qh
2°
A=- E= f=O,
( tE+ 2 Ai (n+l=
I =?n
p _(n- 1. (66.7)
380 G.1. Marchuk CHAPTER XVI
Schemes (66.9) and (66.10) approximate the differential problem (65.1)-(65.3) for the
N-dimensional heat conduction equation. The approximations are consistent for
r/hi = const, i= 1,..., N. A Fourier analysis shows that schemes (66.9) and (66.10)
are unconditionally stable independent of the value r/hi = const.
Note also that, in the case N=2, scheme (66.10) can be used for a more general
equation as well (KoNOVALOV [1962])
a2 ( a2(P a2p a2(p
'
t=a
2 a2 - 2 xlax2 +a22 X2
1/2 (P +a -,122(pn~~~t-(Pn)=0(66.11)
npncpa+ 1/2 + l
2 +a22A22(cp - cp.) = 0.
SECTION 67 Equations o hyperbolic type 381
+ +at2 A;is=f(x, t) in n,
(67.1)
i,
Aip=-x (-ki(x,t)ax '=#(x,t) on85,
(p=p and aTp/at=q in 2 for t=O. (67.2)
Here
ki(x,t)>6>O, =(X, X2,... , XN),
In order to approximate the derivative a2plat2 with step T/N we use for N= 2 the
expression,
ii-2i-_ l +ii- la( 2
q\n+(i-' '/2
t i= 2 I a2,1 i= 1, 2, (67.3a)
where
5fX(P p" n i12, = ,(P - I + i/2
(po = (p = pn-1, n
(P
2 = (p ,
382 G.I. Marchuk CHAPTER XVI
where
To-1 = T1 , UP-2
- = = n- 2/3
To approximate Aijp+fi on the grid 2h we use the one-dimensional difference
operator of second-order accuracy,
Ai h+fi.
If the operator Airp has junior terms then in using the sweep method the
requirement that the step sizes are sufficiently small on the grid Dh steps arises. To
get rid of these restrictions on the grid Qh junior terms should be taken on
intermediate layers. The first of the initial conditions, qp(x, 0) = p(x), is approximated
SECTION 68 Equations o hyperbolic type 383
exactly
oh(x, 0)=p(x). (67.8)
To compute the intermediate values
p'12 =o(x,z) for N=2
and
(1/3 ((X, ), (2/3= (X, 2T) for N=3
we use the following equations: for N= 2,
2
(E + A )(p12 / =F , 1
(67.9)
Fl =p +½~q-IT2Ap+T2 (f/2- (-9(P +f))=O;
for N = 3,
(E +1T 2 A1 )p1/3 = F1 ,
(E + T2A 2)( 2
+P)= 2/3 + F 2,(67.10)
F1 =p+'r1q-r Atp+q2(3fl-1(-Aq
2
+f))t=o ,
2
F2 = (2f 2 -(-A+ f))t=o.
The splitting method for multidimensional hyperbolic systems was first suggested
by BAGRINOVSKY and GODUNOV [1957]. These authors considered only the explicit
approximation for which the splitting scheme has few advantages as compared to
conventional explicit schemes.
ANUCHINA and YANENKO [1959] suggested an implicit splitting scheme for a
multidimensional hyperbolic system. We demonstrate the idea of this scheme for the
case of a two-dimensional system of acoustics equations.
We consider the following system of equations
au, a2 2
a lv =0, au2 a2 av =0,
at ax, at ax2,
(68.1)
av aul au2\
at- ax a =
In order to solve (68.1) we use a weighted splitting scheme along the coordinates
384 G.I. Marchuk CHAPTER XVI
-ha2(0tv 2 + Piv)=0,
U+ 1/2 _nU
0, (68.2)
vn -v I (auA +1/2
1+' ul)
h,
un+I _ u n+ v1 2
+
uI
U"2+
_u+ l=
-U 22+ 1/2a2 -2(
2
+ 1 + pvn+ 1/ )= 0.
(68.3)
V2 (OCun 1 + un±+1/2)= 0,
h2
+fi=1,
where
i,... .. ...
-- (01 i-1...
Aio = (Px ...../+i ....-(Pl_.........
hi hi
Here
Altp = -
xa(a,,axl)+ r b +Cqp,
+ax,
rlax, a
i- (E-r204) qh 2 2
d
+ A"(pn)i + 2
20 A,(p" - p"- 1) i = f
r=1
o°=Pi, oi=Pi'=Pi+Tqi+'2[2,2a=o], ]
(p =0 for x i eh,
aS.>=(),
,, -)_-(A(In- ))U IIVIR,,
y) ZT1xIl RII
,/f(n) ),
0R, 1 2A
t oR1 ( ") <)
I lRU1,,2 ,TtO-12
o=61/2>0,>0
V,
I(A">(u), v)< a [ I1 RU+ a3 Illl
0>(po+# I <T*(4a3)- ,
z,
d
=
Rlui -- E arUi
r=l
hold, then for the solution of (68.6) the following a priori estimate is valid:
k-1l . +r rfll
n=2 Rt a2 )=1
Applied to multidimensional hyperbolic systems the splitting method allows to
obtain majorant approximations. Let
+ Ai +Bp=O (68.7)
T i=1
where
N
-
Co=E (r/hi)Ai-zB,
i=l
C- 1=(r/hl)A, C_ =(TIh
2 2 )A 2 ... , CN=(r/hN)AN,
If z/h i is sufficiently small, then matrix CO is positive, the matrices C_i are identically
positive.
If the Ai are negative matrices, then E- T_ i should be replaced by Ti - E in (68.8).
If the Ai are the matrices with an arbitrary sign then the following representation is
possible
Ai=A! + A,
where the Al are nonnegative and the A? are nonpositive. In this case the scheme has
the form:
N N
+l=CCO~n+ y C_iT_i pn+ E CiTi,
i=1 i=l
SECTION 68 Equations of hyperbolic type 387
where
N N
Co=E- ( h,) A + (Tzhi)At-TB,
i=1 i=1
s=l,..., N.
Scheme (68.10) is reduced to
n+sN = Csn+(s-1)/N
where
Cs= Cos + C-_T_-s+ C, T,,
Co,, = E - (r/h) A,' + (/h 5 ) As - B/N,
C_ =(l/hs)As, C,= -(/h)A,
s=l,...,N.
Operators C_ and Cs are always positive and operators Cos are positive if Tr/h is
small.
Hence, the scheme
Integro-Differential
Transport Equations
f(s, x, t) -f(, ¢, x, t) is the function of sources, a = a(x), as = a,(x) and the restrictions
0<aaO <(X)< a1 < , O<as(x)as,, (69.2)
0 < cO < ac(x) = a(x)- as(),
for constant o , ac, as are assumed to hold.
389
390 G.I. Marchuk CHAPTER XVII
For (69.1) we will take boundary and initial conditions of the type
(q(s, x, t)=0
for (s, n)<O, xeaQ, t2t, sew, (69.3)
p(s, x, O)=g(s, x) forXEc, Sew, (69.4)
where n = (n1, n2, n 3) is the unit vector of the external normal to aQ, (s, n)= 3= 1sini,
and g(s, x) is the function of the particles' initial distribution.
If the spatial domain in which the process of particles' transport is considered is
the slab where x, = z e Q = (O,H), H < oo, - oo < xl, x2 < o and the properties of the
medium and those off and g do not depend on xl, x2 and ¢, then problems (69.1),
(69.3H69.4) take the form
O+ - +a(z)p(t, z, t)
+
V, t aZ
hold, where e = (1, 1, ... , 1) and the operators act in the Hilbert space 4 =-H with the
inner product (, ). Now we formulate the scheme of incomplete splitting for the
solution of these problems (MARCHUK and YANENKO [1964]):
(Note that here (E + aTcrc , )-' and (E - zh,,) are diagonal matrices and hence, that
the scheme is easily realizable at this stage.) Then we rewrite (69.7a) in the form:
"+ 1/ 2
1/2 ( +n+ + )= h( + pn) +fh
T 4i4
Since 0o+1/2, q0[ and ¢" are already known, we invert the diagonal matrix and find
Realization of (69.7b) depends on the approximation chosen for the operator (s, V).
So does realization of the boundary condition (69.3).
Assuming a= = in (69.7) and writing the two-cycle scheme of incomplete
splitting in the form
i=1,2,....
This scheme is absolutely stable but it has only first-order accuracy in (since
operators A1 and A2, i do not commute). Equation (70.1a) is realized like in the
scheme of incomplete splitting.
We now consider the solution of equations (70.1b). Let 2 be a parallelepiped.
Then these equations are easily solvable if the operators sia/Ox i are approximated by
schemes of the running count. Realization of the boundary conditions is obvious in
this case. In the case of an arbitrary convex domain 2we can do as follows. Let Q be
a parallelepiped including 12.We extend the definitions of as,g andfby zeros in f2\2
and we consider a(x) in 2\Q2, an arbitrary positive constant. Hence, we obtain new
functions , 5a, Jand and can consider the problem
1 + (S,V) + d=S S +:
where
F, = -(Ap"-f), n=0, 1, 2,....
We consider scheme (71.1) in more detail by applying it to problem (69.5) for
concrete difference approximations of the operators (see SMELOV [1978]). To this
end divide the segment [- 1, 1] in 2M equal intervals and regard the center of each
one as the grid point along the angular variable. Let us number these grid points as
follows:
< < ' M- 1
#-M <#-M+ < <-l < < M.
Hence uj = -p_j. On the segment [0, H] we introduce an arbitrary system of grid
points
0=O <z 1 < ... <ZN =H
choosing them in such a way that points of discontinuity in the equations'
coefficients coincide with some of these grid points. We assume that the indices k,j of
any grid function a,k will denote the following correspondence to the grid points:
(k,j)-(zk, pj) for j > 0,
(k,j)-(zk-,#tj) for j<0.
We define difference approximations A, and A 2 as follows:
M
(A I (p)kJ= k pk,j-Aa,k E (Pk (71.3
M=-M (71.3)
(A 2 P )kj -= li((Pkj (Pk-l,j)/hk, j>O,
2k j(9Pk + 1,j - (Pk,j)/hk, J<0,
where
k=l,...,N, j=_1,.., +M,
hk = Zk = Zk - Zk - 1,
Ap= 1/M,
O'k = (Zk- 1/2)' 'sk = s(Zk - 1/2)
(i.e., the boundary conditions of the problem are satisfied). With such difference
394 G.I. Marchuk CHAPTER XVII
where
n-
kj =f(Zk-1/2, ,j, t+ /2)9(k =7k 'sk'
n+2/1 = n+2/3
(POj ='PN+1,j-0.
Scheme (71.4) has an approximation of order O(T2 +h+Ay 2), h=maxk hk, on
smooth solutions and is absolutely stable. It also preserves the major properties of
the operators of the original problem.
Let us now construct the scheme for a problem in a slab on the basis of the method of
integral identities and the two-cycle splitting method (MARCHUK [1980], p. 464). We
denote by p + the solution of the problem for # > 0, and by p- the solution for < 0.
Then the transport equation can be written as a system of two equations for u > 0:
+
1 <(Po o +
V1a +Ya- + +p+ 2=: ((p-+ p-)d ++ ' , f (72.1)
0
equations,
1
1 au au d
v- at- +a
v, at az+au=a.,udu'+g, (72.3a)
0
1 au au
- a +p +v=r, (72.3b)
v at ar
where
We consider them in a Hilbert space L 2(D), D = (0, H) x (0, 1)of vector functions with
the inner product
1 M
(a,b)= f
dju aibidz,
0 0
where a' and b are components of vector functions a and b. Now we can write the
problem under consideration as follows:
1 aw
-- a+Aw=F inDx(O,T),
v, at
w=w ° in D for t=0, (72.7)
396 G.I. Marchuk CHAPTER XVII
where function w for any t belongs to the domain of operator A (in particular, it
satisfies conditions (72.4)).
We construct a difference approximation of the problem along the variable z. To
this end we introduce two systems of grid points: the main system {z, )= ,, zo = 0,
ZN =H and the auxiliary system {Zk+1/2}ko. The points of these two systems
alternate, i.e. k-1/2 <Zk <Zk+1/.2 We integrate (72.3a) in z with limits (ZO,Z 1 /2),
(Zk-1/2,Zk+ 1/2), k = 1, . . ., N - I and (ZN- 1/2, ZN) and (72.3b) with limits (Zk- 1, Zk,
k = 1, ... , N. In the elementary integrals we approximately replace the function u by
its value at the corresponding grid point Zk and the function v by its value at Zk- 1/2,
take boundary conditions into account and omit the approximation errors. Then we
obtain the following system of equations:
I O +A(p=F, tQ,
V, at (72.8)
(P= (o
where
( =(Uo, 112 , U1, ... , UN- VN- 1/2, UN),
f
Zk+ 112 Zk 1/2
Gk++ 1/ =- 1/2
1J gk = gdz ,
Zk 2k - 112
k+ 1
Azo = 1 /2 --z
A, = diag(i/2 - s, 21/2 I dy )
0
SECTION 72 Integro-differential transport equations 397
p/Azo p/Az o 0 0
A2 - -P/AZl/ 2 0 P/AZ 1/2 0
0 - P/AzN - 1/2 0 P/AzN- /2
0 0 - P/AzN P/AzN
In obtaining (72.8), the approximation errors in case of a uniform grid in the first and
the last equations are first-order in h and in other cases O(h 2 ) on smooth solutions.
Basing on this fact we can prove that the estimate (MARCHUK [1980], p. 472)
max IIPT - qP (t) < o(h21nl/ 2(1/h))
holds, where PT is the vector of the values of the exact solution of the problem:
(PT = (U(Z 0 , #, t), vU(Z1 2 , , t), ... , V(ZN- 1/2, P, t), U(ZN, P, t)).
Note, also, that operators A1 and A 2 satisfy the relationships
(A 2 a, a)>O,
(A 1 a, a)>yjaj 2, y=const>0.
This allows us to formulate the algorithm for solving (72.8) using the two-cycle
splitting method:
(E +l ,rA2)pij- 2/3 =(E-2TA2)pj-
(E +TzA)j- 1/3 =(E- A)i-2/3
ifj+ 1/3 = - 1/ + 2F3 j, (72.9)
+ j+
(E +TA )pi 2/3 =(E-TA )(pi /3
2 /3
(E+ A 2 )Oj+ =(E-tA2)+
where Fi is the vector with components
tj+I
F==(tj+ltjl
) Fidt, tj=jAt, T=vcAt.
It follows from the properties of A and A 2 that, if the solution is sufficiently smooth,
398 G.I. Marchuk CHAPTER XVII
Analogous formulae can be written for solving the fourth equation of (72.9).
Let us now formulate the difference approximation of equations in #. To this end
we divide the interval 0 p 1 in partial intervals Ap, by grid points so that the best
approximation of integrals in (72.11) on the given class of solutions is ensured:
where s, is the weight of the chosen square formula. Replacing integrals in (72.9) with
this square formula and considering the system when p = ,, 1= 1, . . , m, we come to
the system of linear algebraic equations approximating the original problem
(p° = irp
+
where qpi-', . ., q 'are vectors of dimension (2N + 1), 1= ,. .,, 1°} = qpO)(,),
SECTION 73 Integro-differentialtransport equations 399
i = 0,..., 2N.
As was noted above, the first and the last equations of (72.12) are easily solvable, for
example, by the sweep method. The second equation is solved by the formulae
q~j-1/3
i=0, 1,...,N, i/2Z 1+ k_2/3
t,i- 1/2 = 11 +a
+ 1T-j 1 1/ 2 i1/2, i=,... N. (72.13)
The fourth equation is solved using these same formulae with j+ 1 instead ofj.
Thus, the algorithm for the numerical solution of the nonstationary transport
equation is completely determined. Hence, we come to the absolutely stable scheme
with accuracy of order O(r 2 +h 2 nl/2(1/h)) on smooth solutions. The order of
accuracy in p depends on the choice of the square formula. (Note that p = 0 is also
allowed as a grid point in the scheme under consideration.)
We did not impose restrictions on the grid steps yet. Note, however, that if the
sweep method is used in the scheme for solving three-diagonal systems, then for its
stability it is sufficient to require that T < max(Azi/2). For solving practical problems
this condition imposes some restrictions on the choice of the time step.
By using the method of integral identities, and the splitting and quadrature
methods in the same way we can construct difference approximations for problems
in the theory of particles' transport that depend on several spatial variables (see
MARCHUK and AGOSHKOv [1981],.pp. 405-410).
I OT
fdy+ it ap
do'(73.1)
+? & + (P
v, at ax ay++ 2i7'
0 0
where
Q2={x, y:O<x<a,O<y<b},
0P= (by, ¢, x, y, t), a= a(x, y), a, = a.(x, y),
s=(#, ,)-(S1, S2),
Let us first perform some transformations of the problem. Since its solution not
infrequently has some peculiarities where the vector s = (#, i/) approaches directions
parallel to the coordinate axes, it is expedient to transform the original problem so
that the vector s = (I, r) gets a value between these directions. Such transformations
allow in many cases to increase the approximation accuracy of the problem. Let us
perform such transformations for problem (73.1)-(73.3).
We write the integral term in (73.1) in the form
1 2n
~Sdip·k) = _- {dl'
0
I
0
·
(k)(y, i, x, y, t) dae,
where
, - , , y,
(k)
= p(l
',' -(k , y, t),
- 1)c,
.+
and consider problem (73.1)-(73.3) consecutively on the subdomains x ok, where
Ok = {y, : 0<y < ,(k - 1)n < <krt, k= 1, ...4}.
SECTION 73 Integro-differentialtransport equations 401
Then replacing q by q + 2(k - 1)i, we obtain the following system of four equations:
I a(k)
1 0apt +
~ k) aP(k)aP
q+
a(k(k)
+ (k)= a,
a(k) E4 S(k') +f(k) (73.5)
,at a V ±4
+ ((k ax ay)
y ,
k= 1, 2, 3, 4, (73.10)
where v(k) E W2 (2) and v(k) l' = 0, aS(k) is that part of the boundary aQ for which
((k)nX+rl (k)ny)>O for a given value k. Thus the exact solution of the original
problem satisfies equalities (73.8)-(73.10) that are used further for constructing
402 G.. Marchuk CHAPTER XVII
(only for the sake of simplicity; the algorithm for a nonuniform grid both in x and in
y has been considered by MARCHUK and AGOSHKOV [1981], p. 306) and define the
functions,
(x-Xi-)/h, x(Xi_-1,xi),
(x)=-f (x+il-x)/h, xE(xi, xi+1),
To, x (Xi-l,Xi+l);
where y,,i and yy,j are normalizing coefficients, which can be further defined from the
conditions
a b
Tii(x)dx = 1, () d = 1; (73.11)
0 0
in the case of a uniform grid (73.11) gives the following values yxi =yj = h.
To construct systems of integral identities we assume in (73.8) and (73.9)
= (k) Z
4
Sp(k,k')r(pk') +f(k),
k=(73.12)
(73.12)
k'=
where
G(
)=(G1,2G(zl,. , I,'''.,GI,M-I,G(2,)M_-I,...,GN-I,M-1)'
j=I j=M-I
(G(k). (n(x)(Pm(y)), k= 1,
(k) _ (G(k), (PN-m(X)Pm(Y)), k=2,
n'm (G(k), (PN-m(X)(M-m()), k=3,
(k)
(G , (P(X)(M -m(Y)), k=4,
G= , g, f
1
A=h = 1
N-i
N-1
4 11 B .. A,N- 1 B 1
I,
AN-l,1B ... AN-l,N-lB
8(k) and (k) are diagonal matrices of order (N- 1) x (M - 1) with elements
an,mn, k= 1, s,n,m, k= 1,
(k) n, k=2, k) ,N-n,m k=2,
nm aNM, k=3, annM k=3
N-n,M-m , s,N.n,M-m,
Crn,M-m k=4, (sn,M-m k=4,
10
N-1
Note that the numbering of vector and matrix elements is chosen in such a way
that the values {(Pn' 2) 3) Mm, (P4' -m}, i.e., the approximations of the
(n,
404 G.I. Marchuk CHAPTER XVII
are in correspondence with the grid point (x,, ym). Analogous relationships exist for
the grid points and elements of the vectors {g(k)} and {f(k)}.
Let Hh be a Hilbert space of vector functions with inner product and norm defined
by
:):
1 n/2
4 N-1 M-I
(u,v)a= Z Z
(UE E ddy n d
Unm(y,)v.,m(y,V)d,
k=l n=l m=l
0 0
2
Iulh = (U,U)'h .
We define in Hh the vector functions
( 1)
G = (G , G ', Gi3 ) , G(4))T,
2)
G = p,f,g
and the matrix operators
A41 =diag(Alkk),
6
d = diag(akk), s = diag(a,kk),
(4,1) .fi(44)
2
where d7rT
,
SECTION 73 Integro-differential transport equations 405
These properties guarantee the existence and uniqueness of the solution p(y, ¢, t)
of problem (73.14)-(73.15) with
T
Besides, according to MARCHUK and AGOSHKOV [1981], the following error estimate
holds:
T
where vector 'PT = TpT('y, , t) has a structure similar to that of vector (p(y, , t) but
with elements (, p,(x)(pm(y)) equal to the averaged values of the exact solution of
problem (73.1)-(73.3).
Ij-l
406 G.I. Marchuk CHAPTER XVII
It also follows from the properties of A 1 and A2 that scheme (73.20) is absolutely
stable with
max IlPjllh <C(I[gllh +maxIIflihi), j=2k,
(73.21)
k=0,1, 2,..., C = const.
When the solutions of the approximated problems are sufficiently smooth, scheme
(73.20) has a second-order approximation in z in the points t, j = 1,2, ... , J.
Consequently,
max I1( T(t2j)- (
2J
lIh < const(z 2 + h). (73.22)
=l1 d=l
where Ald are weights, (l, ,d)are the values at the grid points. By replacing integrals
in (73.20) with quadrature formulae and considering equations from (73.20) with
Y=Y, l=1, ... L, ¢=Cd, d=1,... ,D, we come to a system of linear algebraic
equations that approximates the original problem
2 1
(EA- l,Id)l(Pj = (E- tzAid)(. 1), (73.23)
(E+2A 2 , Id)cpj1 3
, 12 )Pd 2 ,
=(E--ZA (73.24)
where vectors (pd have the same structure as (pi in (73.20) (they are regarded as
approximations to pj(yl, Id)), (pd=g(yl, ad) and the operator B2,Id is defined as
follows:
p~(l1)(1,4) L D
A 2, Id PTd- (PId Ad (PId.
p(4,1) .., (4,4) 1=1 d=1
Thus the algorithm for the numerical solution of the nonstationary transport
equation has been defined. As a result we have obtained an absolutely stable scheme
of first-order accuracy in h and second-order accuracy in r on smooth solutions. The
order of approximation in the angular variables is defined by the quadrature
formula chosen (LEBEDEV [1971, 1976]).
SECTION 73 Integro-differential transport equations 407
Let
Idj- =d(E-2 )lj d· =(E-21A2 Id) j-2/3
1 2d (gd Id
[1 +k(t l + n*p]1j-2/3
2h Id
(73.23')
where
d td -("~(Yt, d),
kj - 2/30 k,j-2/3 __
I d,O, m - I ,d,-,
I
l,d,n,m
dTnm 2
2h(0Id
ltdkd I-t
+ d
l)II k,jd_
dn(k1
±2h2h| I Ymn --
I d 1n 1 m+
2h Id
l dnm-1
Idn,,-1
k,j-1 =0 kj-1 -0
(Pld,Om-0 PI,d,n,O '
(73.24')
408 G.I. Marchuk CHAPTER XVII
(1 n,dm
L D
'r2( "
4
-ITU(n,m £ Y
1'=1 d'=1
A
A'd'
4
. (p'((p
'J-1/3+ lJ-1/3
d', n,m + (Pl',d',N-n,m
2,j-1/3 3,j-1/3
+ (P',d',N-n,M-m + ql',d',n,M-,m)= %,dnm ,
I=1,...,L, d=l,...,D,
L D
+Ia") +,nm Z A ( -l',d',N-n,m
2/3
° nm +
Al'd'(('Pl',d',n,
1 + 2'j-2/3
I'=l d'=l
(73.29)
L D
m
2+isn _E Al'd' *,(P',d',nm
- I' d',N-n,m
1'=l d'=l
+ 2j-2/3 + ,j-2/3
+ (Pl'd'N-n,M-m + (Pl'd',n,M-m)
1,j-1/3_
*[Z
2 r 1,j-2/3
~t,d,n, m(l
I: ('4-3
(1)m
+
(P+dnm 2+ (a-, - aO
(,.)
L D
, [l,j-2/3 + N/3nm ;1
x 3,j-2/3
-'ll'd'tl',d',n,m : d',d',N-n,m + l',d',N-n,M-m
t'=1 d'=l
+.4,j-2/3 -(
(73.30)
SEc1oN 73 Integro-differential transport equations 409
(4)
4,j- 1/3. 2 4,j-2/3 + Tas,nm
(4 )- (4)
'(P2,dn, +TamL
''' 2 --
+ (aU
2+ nm avs,n,m)
)
L D
,,
Al'd' "; + l',dT',N-n, m %l'jdd',N-n,M-m
1'=1 d'=l
+ 'd,n,M-m]
2h Ed
(PId,Om al , d,n,O = 0.
Thus, the algorithm of numerical solution for problem (73.1)-(73.3) has been
completely defined. Note in conclusion that a scheme with a second-order of
approximation in the spatial variables is given in MARCHUK and AGOSHKOV ([1981],
p. 317). Further, we remark that, formally, 6-sources in x, y and t are possible in the
above algorithm and that the algorithm itself can be generalized to the problem of
anisotropic dispersion.
410 G.I. Marchuk CHAPTER XVII
assuming that the parameter is positive and depends on the step number of the
iterative process. Operators A 1 and A 2 are defined according to (71.3). We have,
therefore, the following iterative process
or
+
(E + rAt)(E + TnA 2) " =(E- ,A ) (E - T
nA 2) e"
or
(E + A2) +
1= T,(E + tA 2) gn, (74.3)
where
T,=(E + ,A)-'(E-T A)-(E-TA 2)(E + TA 2 ) - 1.
When we consider equation (74.3) in Euclidean space, we have
"+
II(E +r A2) II T.
[ II (E+
T11 A2)E"II.
Since
(AI p, p)> ao 1(pll12 , (A2, )> 0,
we have
Il(E- TA2)(E + TrA2 ) - 1 112
[I(E- TrA 2)q 11
2
I
I(E+ r.A 2 t)° [ 2 <
=1
[12
sp -2z (A2q° {)+'n2 12
11A2
1 - 2zacO
2
+ b2 I}AI1 12
1 + z ,, I2 1o+b2llA
2
1-2aao + b 1 _ 0 < 1,
<1 + 2TzaacO + b r
where 0 < t, Tb < oo are the boundaries of parameter t. Hence, IITn 1l 0< 1 and for
(74.2) we obtain the convergence and estimates of the kind
IITP- 11< 1(E + zTnA )((P"- )11l
(74.4)
,0"1(E+T.
zA2)((?°-()11-0, n-coo.
We now take a parameter T. independent of the step size of the iterative process.
Then
2 '
0 = O(T) = 1 2 (74.5)
From the condition, 0'(T) = 0, we find the value Tproviding the minimum value of the
function 0(T):
Zop t = 1/c1 (74.6)
(note also that 0"(zTopt)>0).
The choice of ,.can also be optimized from the simplified approximate problem,
for example, from the diffusion problem (PENENKO, SULTANGAZIN and BALASH
[1969]). Namely, by using this simplification of the problem we have 1pt = l/c,,
where
ac = sup vrai ac.
The approximation of equations of hydro-, aero- and gas dynamics leads to systems
of nonlinear algebraic equations. The effective solution of these equations is
a complex problem, especially when solving multidimensional problems by means
of implicit schemes in time. On the other hand, the implicit schemes are particularly
convenient for gas dynamics and Navier-Stokes equations since they essentially
weaken the restrictions on the value of a time step. In this case, like in many
applications, the splitting method is a constructive approach.
This chapter discusses various approaches to construct splitting schemes for
equations of hydro- and gas dynamics. For the sake of simplicity we demonstrate
these methods for splitting differential systems of equations on the same level as in
a weak approximation method (see Section 46).
413
414 G.I. Marchuk CHAPTER XVIII
pondence with (75.1) and (75.2). This system has the form
2
on the first subinterval [tn t + 1/2] we can use a splitting scheme on coordinates x1
and x2 for equations (75.3), and (75.4)
1 au" aU, 2U
a8
2 at ax , = 29
I 8 +u au2 a u2
t+UE2 a=V 2 2' (75.6)
2 t a x x
ap 2apE au
Ia + gu P +PE =O.
at aX2 aX2
By approximating systems (75.5) and (75.6) by finite differences on a rectangular
grid and by implicit schemes in time, we can solve them by three-point sweeps along
the grid lines. To solve problem (75.1)-(75.2) we can use a nonstationary system
which is different from (75.3)-(75.4). It has the following form
au 2
- S uDiu +gradp=vAu,
at + i=1 (75.7)
In this case, it is easier to approximate (75.11) and (75.12) than the original
equations, because the condition div u=0 is replaced by an evolutionary equation.
Here the following problems arise:
- existence and uniqueness of solutions of perturbed equations;
- convergence of u and p to u and p for E- 0;
- discretization of(75.11) and (75.12) and convergence of discrete approximations
to a solution of the original problem.
The existence of solutions of perturbed problems and the convergence of exact
solutions for E-0 are discussed by TEMAM [1981], where an alternating direction
scheme for approximation in time was used and where unconditional stability in
certain spaces and convergence of the discrete approximation to a solution of the
Navier-Stokes equations when ,z,h-0 was proved.
416 G.I. Marchuk CHAPTER XVIII
f =_ X f(t)dt, n=l,...,N.
(n- l)r
Vv e Ho'(a),
SECTION 76 Splitting method for hydrodynamics 417
u" + 1 eH,
(76.8)
(un+l v)=(u+l/2, v) VveH.
Here
;(u, v, w)= (b(u, v, w)- b(u, w, v))
is a continuous three-linear form on Ho(2) and
6
(u,s, v) = 0 u,veH(Q2). (76.9)
+ 2
Equation (76.8) means that u"+' is an orthogonal projection of u" 1/ on H in
L 2(Q). That is why u" + l can be defined by the equality
2
Un+ I =pH+ l/ ,
The splitting idea connected with recalculation of the pressure given in a pro-
jection statement is, in fact, a consequence of ideas underlying the Harlow method
for particles in cells. According to this scheme let us, first, compute an intermediate
velocity field from
n + l/2 u
_ n n . n
u ! =+ EvAu" +f. (76.13)
i=1 axi
Then we correct this field by taking into account the pressure gradient
u+ 1 =un+ /2 - T grad p, (76.14)
where p is a stationary solution of
As a result of the above stages both (76.1) and (76.2) are satisfied.
BELOTSERKOVSKY, GUSHCHIN and SHCHENNIKOV [1975] used another splitting
scheme, namely, the explicit scheme of splitting along the physical factors. We
introduce the notation rot u = w and consider the velocity field u and pressure field
p at the time t, = nr to be known, then we can write the scheme for finding unknown
functions at time t + as the three-stage splitting scheme (76.16)-(76.18):
Un+ 1/2 _Un un
Y0=- n - +V~u',X(76.16)
i=1 'axi
= -grad p. (76.18)
As noted earlier, equation (76.17) is obtained by applying operation div to both sides
of(76.18) and taking the continuity equation into account (vector u"+ 1 is solenoidal).
BELOTSERKOVSKY [1984] suggested the following physical interpretation of the
above scheme. It is assumed at the first stage (76.16) that the momentum transport (a
mass unit impulse) is realized only via convection and diffusion. The field u"+1 /2
thereby obtained does not satisfy the incompressibility condition giving never-
theless a correct description of the turbulent characteristics since it is possible to
show that rot Un+t/ 2 =Wn+l.
The obtained intermediate velocity field helps to find the field of pressure on the
second stage (76.17) taking into account the fact that the field u + ' is solenoidal. In
this approach the function pn + 1 is a physical pressure and boundary conditions for it
are not defined. On the contrary, in the projection method, p" + satisfies Neumann
boundary conditions.
It is assumed at the third stage (76.18) that the transport is realized due to pressure
gradients only.
The following variant (76.19)-(76.21) of realization has been considered for
SECTION 77 Splitting method for hydrodynamics 419
increasing the reserve of stability by using an implicit scheme at the first stage
(BELOTSERKOVSKY [1984]).
tn+l/2_n
~ grd_. f
+ 12_-v~ --.
= (76.19)
=- n un +/2 u+/2
1
= 1-u -- +vAu" -grad p+f,
i=I
1 +
A6p =-1 div u"+'/2 ap=p p1 _ , (76.20)
2
Un+ I _Un+ 1/
= - grad(6p). (76.21)
The proposed modification allows us to release the hard restrictions on the time
step inherent to the first scheme. Another advantage of this approach is that the
pressure increment in time and not the pressure itself is found in the second stage.
77. The general principle for constructing splitting schemes for Navier-Stokes
equations
UO P
U' Ul
U2 Pti_-2
U3 13 ,
_U4 E
Wao put
Wa 1 pulu. + Pli
U,- )= PU2 U + P2a
Wa = W(
Ox W. 2 (77.2)
W, 3
PU3 U o + P3.
aT
Wa,4 puE + upp - ax -
In representation (77.2) we sum over /3,pp =pa- GGa,p is the pressure, p is the
420 G.I. Marchuk CHAPTER XVIII
density, 6, is the Kronecker symbol, E =e + lul 2 is the mass density of full energy,
G,fidivuu6 au au\
G,,= + u6'
a-~xp 20-ax~;,
div 0
+ (77.3)
# and I' are the coefficients of volume and dissipative viscosity. System (77.1) may be
represented in a nondivergent form as well. We introduce a new vector of a state f
f=f(U). (77.4)
We consider (77.4) as reversible, equivalent to U= U(f), and obtain nonsingular
matrices,
Thus, taking this into account, system (77.1) can be rewritten in the form
f+ca-=O. (77.7)
at ax
It follows from relationships (77.2) and (77.3) that w =w.(f, af/Ox). System (77.7)
can be rewritten in operator form:
12 = BaD,, R, = - CDDp
System (77.9) is nondivergent in comparison with the equivalent system (77.1). Various
vectors of state, e.g., (p, us, u2, U3 , T), (p, ul , u2 , u3, e), and f=(p, u1 , u2, u 3 , p) are
SECTION 77 Splitting method for hydrodynamics 421
6la2- U - 2
0 0 blab
aa a a
Qa= I
6a2x ax"
2a 2 a 22 a aa
0 bas 2aS 63aS U
axa ax" ax" ax .
a2 1 ap b2 1 ap al'/al
a G
axa
a aT
ax" axa
422 G.I. Marchuk CHAPTER XVIII
2
axCT a3 Max,
where wla is the vector of convective fluxes, w2. the vector of flux due to pressure
forces, and w3 the vector of dissipative forces:
_1pu
_ _
2
N
WO
1 0
2
Wal
1
PUIU wa2
2
Wl= Wa2 PU2U. W2. = wa2 62,P I
1
Wa 3 PU3U. W 3 3aP
2
Wa44
4 puE Wa4 _ U.P.
3
Wa0 0
W31 -G1.
W3c = Wa2 - G2
3
W 3
31
3
We4
_A
- ax
- = U Gp
When we consider each vector we as a sum (77.11) for every direction x,, we can write
the system of equations in the form of the splitting on physical processes and spatial
directions,
au + 3 ( 3aw
a )0 (77.12)
SECTION 77 Splitting method for hydrodynamics 423
aul=0o , (77.13)
at m=l
where
aU N
-+ E (Xju=0,
at j=1
where
Here N is the number of steps, the Yj are differential operators taking into account
the splitting in physical processes and spatial directions.
When we write the system in the nondivergent form (77.7), splitting in physical
processes and spatial directions is done as in (77.12) and (77.13): one separates the
convective flux matrices and the matrices determined by the pressure and dissipative
forces with the subsequent splitting along the spatial directions so that operator Q is
represented as
Q= ( 2a). (77.14)
o a0 0 0 0
8axe
21 a= 0 O Ua 0 00 (77.15)
0 0 U ax, ,a0
0 o 0 0 u--
0
8O O 0
O 0
O 0O0
~~~0 0 00
p152axa
a
(77.16)
S22. =
o 0
P ax~
632a a
o o
p ax,
" 0 a
a 2 a0
a 6 a
a° 0 '
62,a 2 a 0 0 0 ,b 2 a
ax, ax2
0 0 0 6 b2 (77.18)
S22. = 0axa2 a- - 2 ax
0 0 0 6,b2a
2
0 6 1as 2
62 2
00S
ax 2 ax" ax -
SECTION 77 Splitting method for hydrodynamics 425
where
a2 a,p =l 2= P
p ap' pat' p
It is obvious that realizations of schemes based on splitting (77.15)-(77.16) or
(77.17)-(77.18) are different. The above examples can be continued if we choose gas
dynamics functions or some combinations of them as the functions to be found. In
considering difference splitting schemes for boundary value problems the choice of
unknown functions is also determined by the form of the boundary conditions.
For Navier-Stokes equations written in the nondivergent form (77.7) splitting of
equations in physical processes and spatial directions is performed in the same way
as for equations of gas dynamics. The dissipative forces matrix is added in this case
separate from the convective transport matrix and the matrix of flows is determined
by pressure forces. For vector f= (p, u~, T)T matrix f23w has the form
_ A _
7 0 -
G1a
aa G2. (77.19)
P ax,'
G3 a
a
K-
_ aX
Some other examples of splitting gas dynamics and Navier-Stokes equations are
possible. These examples will belong to various realizations differing in the reserve of
stability, effectiveness of realization and convenience for calculating boundary
conditions. So, the approximation of equations in a divergent form leads to
conservative difference schemes satisfying conservation laws. This yields a physi-
cally more grounded result. For nonstationary solutions this way seems to be most
justified. At the same time the numerical schemes of this type are nonlinear and
require either internal iterations or a complex linearization.
Although equations written in a divergent form are more efficient in realization,
they yet have the conservation property only asymptotically, that is, in the process of
stationing. In this case, it is efficient to consider the predictor-corrector method that
permits the construction of conservative difference schemes suitable for solving
nonstationary problems (see KOVENYA and LEBEDEV [1984]). At the predictor stage
in the approximation of equations in nondivergent form a scheme of splitting in
physical processes and coordinate directions is constructed. At the corrector stage
the equation is approximated in a divergent form.
The above schemes have been used in either way for performing a whole range of
computations for problems of gas dynamics and viscous gas flows in both two-
dimensional and three-dimensional cases.
Papers by KOVENYA and YANENKO [1981], TEMAM [1981], BELOTSERKOVSKY
[1984], YANENKO [1967] and GUSHCHIN [1981] discuss the construction and the
convergence studies of splitting schemes for Navier-Stokes equations. Splitting
426 G.I. Marchuk CHAPTER XVIII
schemes and the results of computing the viscous fluid fluxes by using the notion of
"artificial compressibility" are given by YANENKO [1967] and CHORIN [1968].
YANENKO [1967] also considered the coordinatewise splitting scheme. Splitting
methods for solving problems of the stratified fluid dynamics are described by
LYTKIN and CHERNYKH [1975], YOUNG and HORT [1972], GUSHCHIN [1981],
VASILYEV, KUZNETSOV, LYTKIN and CHERNYKH [1974]. Splitting methods for solving
a wide range of problems of gas and aerodynamics along with some computational
results are discussed by KUZNETSOV and STRELETS [1983], KOVENYA and LEBEDEV
[1984], KOVENYA and YANENKO [1981] and BELOTSERKOVSKY [1984]. Monographs
by MARCHUK [1974], KOVENYA and YANENKO [1981] and BELOTSERKOVSKY [1984]
are actually surveys of the results of constructing, studying and realizing splitting
schemes for problems of geophysical hydrodynamics, gas dynamics, hydrodynamics
and mechanics of a continuous medium.
CHAPTER XIX
Problems in Meteorology
du u t F'(P RTap
d(+ tg + c)v a1
+ RT ap) F. (78.1)
dt a cos p aA p,
dv / u 1[a Taps\
_+ + tg(1 u+-
UF(p,-_ __(78.2)
dt a a a a =F,
dT RT
_ _ z=FT+ET, (78.3)
dt cp up,
dq
-d =Fq- 9, (78.4)
u
Ops+-1I ps ap vcos. p aP
apS
at c1 +
acos p a+
cos)+pS
ap au =0, (78.5)
a RT
',6 = - a (78.6)
where
d a u a a a
-=-K, K= -+--+-;
dt at acos a a ap a
t is the time variable; is the longitude and p the latitude; a=p/p s is a vertical
coordinate (p is the pressure, p, is the pressure at the surface of the Earth); u, v and
d are wind velocity components in the longitudinal, latitudinal and vertical
direction, respectively; i = gz is a geopotential on a constant a-surface (where g is
the gravitational acceleration and z is the altitude above the sea level); T is the
temperature (K); q is the specific humidity; Fu and Fv are the rates of impulse
moment changes due to Reynolds stresses; Fr and Fq are the rates of temperature
and specific humidity changes, respectively, caused by small-scale diffusion and
meso-scale convection; Pr = e, + &fare diabatic heat fluxes (E,is a radiation heat flux
and E a heat flux due to condensation); q =c-E is the rate of humidity
transformations (C and E are terms describing condensation and evaporation
processes); I = 2Q sin p is a Coriolis parameter (Q2 is the angular speed of rotation of
the Earth); a is the radius of the Earth; R and Cp are gas constants for dry air and
specific heat of air at constant pressure.
The following initial conditions are considered to be given for system (78.1H78.6)
U= Uo(A, (, a), V= v0(A, p,a),
Ps = Pso(, ), (78.8)
T= To(, p, a), q = q o(, (P,a)
as well as the boundary conditions which assume that a solution is periodic in
longitude and restricted on the North and South Poles. As far as the underlying
surface being a solid body is a a-coordinate surface (a= 1), the corresponding
kinematic condition of type
a=0 for a=l (78.9)
is used. An analogous condition can be stated for the upper boundary of the
atmosphere
o=0 for a=0, (78.10)
and the distribution of a geopotential on the lower boundary is also given:
· =gzs= 0 (78.11)
where z, is the Earth's surface altitude above sea level.
Equations (78.1)-(H78.4) can take a different form. At least three possible forms are
known. By multiplying any of these equations by p, and using the continuity
equation (78.5) we obtain a divergent form
a U+Kdu- (l+Utg)p v
+KdV +
± IL+ tgo pSU
(78.13)
+Ps+RTa
-- T Inps)= Fv,
Kd RT
at
at + T-C--
CaT=Ps(FT+eT) (78.14)
p q+ K q
= p s(Fq - ), (78.15)
where
1 a aa
Kd= ( s--s
a + psvcos. + Ps d.
-= A, V-pu=U, pV=V,
(78.16)
p/T=0, / pq=Q,
and we obtain a symmetrized form of equations
-+KsU- l+-tgp V
(78.17)
+ac (¢(X+2ROna:)=nF,
at ait
(78.18)
+( it a In
+2R0-lnic
a ==7tIv,
F '
ao RO
+ K 0 -C-r = r(FT + eT), (78.19)
+ K s Q= T(Fq - E, (78.20)
where
K 1 RF a a u \vcos a a Vcos T
s 2a cos L a a J a ap n
+- (a ) + (78.21)
430 G.I. Marchuk CHAPTER XIX
1 /av aucosq)
acos pa?
E = I(U 2 + V2 ).
Now we can write the motion equations (78.1) and (78.2) in the so-called Lamb form
+- Z + X [a(+E)+RT Ip F (78.22)
at a6 acosdO a J
av a iFa a 1
Ed+ad+Z. +-l a (+E)+RT a In p]=F. (78.23)
at aa u a aT
Since it is known (LORENZ [1955]) that the total potential energy of a vertical air
column in hydrostatic systems (being the sum of two kinds of energy--potential gz
and internal C, T, where C, is specific heat of air at constant volume) coincides with
its enthalpy Cp T, equation (78.25) is the conservation law for the total energy of an
atmospheric system.
If the term RT is replaced in (78.3) by RT where T is the average temperature
(a value independent of spatial coordinates and time), then the total energy
conservation law takes the quadrature form
1
i(E
atJps + 2T
2
T dG d = . (78.26)
G O
SECTION 78 Problems in meteorology 431
I u1 vCos )=0o
au (78.27)
acosqpa - aqp
allows to show the existence of the enstrophy conservation law
1
This law has also a fundamental meaning because the real atmosphere is
quasi-barotropic and quasi-geostrophic, and therefore has a specific energy cascade
in the direction of longitudinal waves and enstrophy cascade in the direction of the
high wave numbers.
We assume that p, = 1 in relationships (78.1H78.8) and obtain, as a special case,
a system of equations in p-coordinates where, besides the quasi-statics equation
a = RT
RT--, (78.29)
ap p
the continuity equation
1 tOu av cos P aT
_Io
-~+ +-=0 (78.30)
acosp OA' Oa / ap
turns out to be non-evolutionary.
The system of equations of atmosphere hydrothermodynamics with quasi-static
approximation is irregular, i.e. it does not belong to the Cauchy-Kovalevskaya type
systems but can be reduced to the evolutionary type by integrating, for example, the
quasi-statics equation (78.6) in a from a to 1 and using condition (78.11):
1
0= + _d-; (78.31)
6=- (aDd- Dd ),
0 0
za(pT )
" al vco(78.33)
D= -1 laPsu ap' v cos qp
a cos [-il ap
where h is the altitude of the free surface, and H is its average value. The ratio of the
velocity of the gravity waves propagation gH to the velocity of the advective
transport c constitutes a value of order - 10. It is expedient, therefore, to solve the
system of advection equations
au au
at Ox
(79.2)
ah ah
at ax
first for system (79.1) in the limits of a given time step.
We denote the thereby obtained intermediate values of u" + I and h" + 1 by un + 1/2
and h+ 1/2 and use them as initial conditions for solving the remaining subsystem
au ah
g- 0
ax=
-+
at
(79.3)
ah au
at Hax
SECTION 79 Problems in meteorology 433
The values u"'+ and h" + I, obtained after solving this subsystem, are taken to be the
final approximate values of the variables at the time level n + 1. This procedure is
repeated on each successive time step.
We consider a vector function p = {u, h) and a matrix
A=(QH g)
c
+A-4=O. (79.4)
at ax
By using the following representation of the matrix A,
we can represent the above procedure for solving systems of "shallow water"
equations as a splitting scheme,
n+12_¢n
At +A1Ap"+/2=Fnl
a01
t =Lr, 7 =7E(t),
(b) for the computation of heat fluxes due to condensation, dry and moist
convection,
a02 aQ2
at= , at =-7q, r(tn),
(79.7)
U2(tn) - 1l(t.+r), 22(tn)= (2(tn);
434 G.I. Marchuk CHAPTER XIX
au3=,
t=OtF
a 3= Fv '
= (t,)
at at =7
aU4KU 4 =, aV4K =
at at
004 + Ks04 = 0, OQ
4+ K = 0, (79.9)
aO5 ROs
Formulae (78.31), (78.7), (78.33) are used for computing ,z,D and d.
A formal analysis of the above splitting scheme yields a first-order approximation
in time even in the case when the Crank-Nicolson scheme is used as a basic scheme
on each splitting stage. We shall consider stability questions in the following
sections.
SECTION 80 Problems in meteorology 435
80. The approximation of equations in spatial variables and the discrete analogues
of conservation laws
In Section 78 we have described the equations of atmosphere hydrothermodynamics
in symmetrized form. The purpose of such a symmetrization is, first of all, the
possibility to construct absolutely stable difference schemes, whose stability is
proved in a quadratic norm equivalent to the system's square energy. For
illustration, consider a one-dimensional transport problem
a0jf a auO\
a+1 u-+-Jo +N= ° (80.1)
It is easily seen that for (80.1) the relationship
a 02dG=O
at
G
( au\ax
ah
where
(K 0 h0
h, ), = 0, (80.2)
and the inner product
((f, ). E f ip /
At
+ Ks, 2
2 = 0. (80.3)
It is fairly simple to write the approximation in space variables and the scheme for
the three-dimensional transport equation in the general case:
on+l n 1 ,
/2 - ,ni - 1.j,k on+ l/2
1/2,j,k ai+
ijk + 1+
01+ 2a cok[u-+ 01+ljk i-lj,k]
At 2a os j-A
+
I (jk+
2Aa J +
1/2
1/2
i,j,k+ - j,jk- 1/2
~n+ 1/21
0.
i,j,k- )=
where
Ui+ 1/2,j,k= 2 (Ui+ l/2,j+ 1/2,k + Ui+ 1/2,j-1/2,k),
(V COS (9)i,j+1/2,k = 2((V COS P)i +1/2,j+ /2,k + (V COS )i- 1/2,j+ 1/2,k)'
The grid points where the functions u and v are determined have indices i + ,j +
and k; the grid points where a function 6 is determined have indices i,j and k + ½.
It is easily seen that the matrices for each difference operator approximating the
derivatives along A,p and a are skew-symmetric, i.e. (80.6) can be written in vector
form,
n +
1 -
n t +(K +Kv + Ka)On + 1/2 = 0, (80.7)
where
(K 0,0),=0,
= = ,(pa. (80.8)
Transport problems can be represented in form (80.7H80.8) for other substances
as well: specific humidity and horizontal velocity components.
At the adaptation stage the scheme has the form:
i+l/2,j+1/2,k- Ui+l/2,j+1/2,k Ui+l //2,J+1/2,kg
j+ 1/2 + 1/2 tg j+ /2
At +an1/2,+ 1/2
- K. 1 K
At
At +2nn+1/2
-ij 12E k=1 Dk =/2,= (80.9c)
where
1 - A21
Dijk= , [Vi - 1/2 i Ujk + Vj_ 1/2 7i cos q0 Vik (80.9d)
on+ _n n +
ijk Ok R 2 2
At 2Cpacos[
1C/2 q!j 'sj+1/2
/
2
LJ= o+/+j
(in(ln
7 )+/2 1
1/21 C, t+j+ 1l /22A+1/2
CP) Aua
YE___ (_p j k+ 1/2 -
] Cpi v=-,o
ij,k+v/2) E
k+v ' i=
- 0, (80.9e)
COS
E i 2 +/2 + V+2, +
COS j i+1/2,j+1/2,k i+1/2j+j,k
(80.10)
which is, in fact, the difference analogue of the conservation law (78.26).
Since the transport operator K, is the sum of three operators (transport in each
geometric direction), each being skew-symmetric, it is quite natural to perform
438 G.l. Marchuk CHAPTER XIX
-+K-- = 0
At 8 2
where ~qis any of the functions U, V, 0 or Q.
Due to (80.8) it is obvious that the above difference scheme is absolutely stable and
quadratic functionals conserve their value exactly. In the general case, scheme (81.1)
has a first-order approximation in time. The cyclic commutation method allows,
however, the construction of a second-order approximation scheme in the general
case of noncommuting operators as well (MARCHUK [1980]).
Thus, a three-dimensional transport problem is reduced to a set of one-dimen-
sional problems.
Now we consider methods for inverting these operators. Transport in the
A-direction (along latitude circles) is realized by the cyclic sweep method. It is the
same for all unknown functions (the only difference is in determining the "transport"
velocities, which are parts of the expression for operator Ks). The algorithm for
transport in the a-direction is also the same for all unknown functions (scalar sweep).
Transport in the p-direction (along longitudes) is realized in a more complex way.
A method for constructing cyclically closed rings of meridians shifted by 180° has
been used by MARCHUK, DYMNIKOV et al. [1984]. In this case components of the
vector values (U and V) in the transition over the Pole must change their sign while
the scalar values ( and Q) keep the same sign. Computation is realized via a cyclic
sweep. The skew-symmetric form of the transport operator guarantees, in this case,
conservation of the quadratic values.
To solve the system of algebraic equations at the adaptation stage (80.9) the
following iterative process with respect to values U" + 1/2, Vn + /2, n + 1/2 and "n+ is
used (some aspects of its convergence and the choice of optimal parameters are
discussed by MARCHUK, DYMNIKOV et al. [1984]):
Y(s)
( (ff)= is) (81.2)
= ~A(S
[I 1/2 UI+)1/2,j+ i/2,k tg pj+ 1/21
i+ ,j+ At j+1/2+ as.)At/2,k
1/,k
1(F1)/2,j+ U+ 1/2,k 2a cos P+ 12 AA i 1/2,j+ 1/2
a cos qiae
A [.Ls)I T(s+ l)
+2C 7 1, E ( k+v+/2
, k+v/2) E (81.5)
cs) on AtR
(s+ -(-), /I -
4Cp j a
cos i
· · s~pjgLA
Y,~pI , 1/2 ,k+,,j,k1,,m,j n7·
+v+,(In
where is a given value of absolute error, or the number of iterations exceeds a given
limit value.
CHAPTER XX
Problems in Oceanology
82. Statement of the problem and the splitting of equations with respect to
physical processes
The problem of the general circulation of the ocean is described by the following
system of equations written in Cartesian coordinates for the sake of simplicity:
du 1 ap a av
+ I= -I-+ Av +-Ka
dt po y az az,
dV l ap a au
dt- V -po a+x A+z az'
ap au av aw
=gp,
aZg -+-+-
ax a az 0,
dT a aT (82.1)
-+YTW=TAT+-KT
dt az a ,,
dS a as
'
dt +Ysw=sAS+aKSaz
p=p(T,), d a a a +w
dt at ax ay az
System (82.1) is obtained from the general equations of the hydrodynamics of
a rotating fluid by using traditional approximations for problems of the large-scale
441
442 G.. Marchuk CHAPTER XX
au au u au
=0, +K- +
0,
ax for z=H, (82.3)
aTH OH aT a
=Uaz ay av az
u=u° , =vO, T=T° , S=S° fort=0,
where v is an external normal vector to a, and a, b, ci,fi, u, vo, To and S are given
coefficients and functions. The set of processes described by system (82.1) can
conditionally be divided in three main types: transport of momentum, heat and
salinity along the trajectories, their diffusion and adaptation of the mass fields and
currents. To solve approximately problem (82.1)-(82.3) the splitting method with
respect to the corresponding physical process can be used.
We pay no attention
+
-U-+V- f to+methods
-O0 for approximating (82. 1)in spatial variables (see
MARCHUK [1974a], MARCHOK, DYMNIKOV et al. [1984] and ZALESNY [1984, 1985])
and write a splitting system on time intervals t,<t~t,t+ in terms of differential
equations. We have to solve three subsystems consecutively:
1 au au au au
+
at a-xu ayxx (82.4)
1 aT UnT naT aT
3 at ax Zy az
SECTION 82 Problems in oceanology 443
1Ias as as as
-- +tu +VO +W" =O,
at ax ay az
1 au a au
3 at= V a aZ
1 av a av
(82.5)
1 aT a aT
3 3 at= pTAT+
aZ
KTrz'
1 as a as
'
3- at-=UsAS+zzSz
a ,
1 au 1 ap
3 at Po ax'
1av
a+l=- 1Iapap
3at Po ay'
1ap = P
au av aw
3 z a +ay+Z=°, (82.6)
3 aaxx y az
1 aT 1 as
3 at 3 at
P=aT T+ sS +p(T",Sn).
Note that in this case it is assumed, in splitting system (82.1), that equations (82.1) are
linearized at the interval ti < t < tj + 1. This is indicated by the indices n in (82.4) and in
the equation of state in system (82.6). We add to the subsystems (82.4)-(82.6) the
corresponding initial conditions and to (82.5H82.6) the boundary conditions
following from (82.2).
For system (82.5) we have the following boundary conditions
aT as
u=v=O, -=-= on a,
av av
au av
Kaz =f, Kaz =f2,
aT
a, +b, T=cl,
for z=0, (82.7)
a2+b2S=c2,
as
au av
K =0=,K=0,
for z=H;
aT as
av av
444 G.I. Marchuk CHAPTER XX
(u,v)=0 on a,
w =0 for z = O, (82.8)
w = uH, + vHy for z=H,
where
u = (u, v).
The equations solved by the methods considered above have been obtained on the
first two stages of splitting with respect to physical processes, described by systems
(82.4) and (82.5).
Now we consider a method for solving system (82.6) based on further splitting in
the coordinate planes (x,z) and (y,z) (MARCHUK [1974a]). Because this method
allows generalization, we assume that a more complete evolutionary equation is
used instead of the hydrostatics equation. Hence, we arrive at the following two
systems of equations, assuming for the sake of simplicity that the state equation is
linear (Pl = 0):
1 au 1 ap
6at Po x'
10v
av+lu=O,
6 at
1 aw 1 ap
g= 2p (83.1)
6 at az'
au 1 aw
-+- -=0
ax 2 az
1 ap -
1 au
1 aw 1 ap
6 at6Tt Z,- 2p az'
SECTION 84 Problems in oceanology 445
aV 1 aw
ay 2 az
1 ap+½yw=O;
6 at
where = aT yT + SYS'
By using continuity equations we can reduce each of systems (83.1) and (83.2) to
problems for stream functions.
au Iap / aH ap
1H -Hiv 1 (Hp--
at Pox ax aZI/
1a l= 1 Hap aH ap
3 at Po\ ay ay Oz1JX
Op auH avH aw
p=pHg, auH + avH +aw 1 =0, (84.1)
az, ax ay az,
3 at +Y~lW'+z,(HA+ aHv)]1=0,
3 at axG ay ai aH
Yv=aTrY+TSYS, t= H+
Wl W -Z i
an ay v
446 G.I. Marchuk CHAPTER XX
where
( =(u, v, p),
H ½po
O 0
0 ,
0 0
a I a
3a gHy az,/
0 -p 0HI H 0 aH a
0
ax ax az
aH a
A1 = poHI O H- A2= 0 0
ay ay 0az ,
a H 0 a aH
-H
ax ay azI' ax z
Oa ay
For the approximate solution of (84.3) we use a splitting scheme with respect to
"topography" (ZALESNY [1984, 1985])
½B IA-+Aj p=0,
(84.4)
2B at 22=O.
av ap (84.5)
poHat + poHlu + H-ay=0, (84.5)
SECTION 85 Problems in oceanology 447
1 a0 1 ap auH avH 0
- + + =0.
6 at az1 gH7 az1 ax ay
We add to system (84.5) the boundary conditions following from (82.8):
w=O for z=0,1;
or in terms of function p:
0 1 'o 0 forz,=0,1,
t gH, OzI
(84.6)
(u, v) = 0 on a.
At the second stage of splitting we have
1 au aH ap =
Po0 HZz-- -z 0,
v OH ap
UpoHat--yll
zl =O, (84.7)
1a a 1 ap a / a HaH\
6 at az I gH, az I aZI 1 ax ay
with boundary conditions
System (84.5) obtained at the first splitting stage with respect to "topography" may
be transformed to a linear system of the "shallow water" equations:
1 au. ap
p o H- -p 0 Hlv, +Ha=O,
1 av ap.
poHa + PoHlu + H-=0, (85.1)
at a+
(8 ay
apn (au.H
-
av.H =\
'%-at + ax ay/ =0
448 G.I. Marchuk CHAPTE XX
a=0 forz=0,1
az
with 2 0= 0, i > 0, i = 1, 2 ....
System (85.1) can also be split along the coordinates x and y into two subsystems:
1
T2Po 8u.
at-2poHIvn+H
o ap.P
1PoH P
- oHlv. = 0,
H avn+'p0Hlu.+HaPn=0, (85.3)
~PoH + #P
H n = 0
apn avH
I2a. +gU-y-H=0.
Systems (85.2) and (85.3), or to be exact, their discrete analogues are realized by
simple three-point sweeps.
Note that in the absence of rotation effect (1=O0) the system coincides with
acoustics equations in the two-dimensional case with the accuracy up to the
notations. A method of splitting with respect to coordinates analogous to
(85.2H85.3) with concrete discretization of the problem in spatial variables has been
considered by YANENKO [1967].
One can find various versions of splitting methods for ocean dynamics problems
as well as algorithms for their realizations and results of numerical experiments in
the works by MARCHUK [1974a], MARCHUK, BUBNOV, ZALESNY and KORDZADZE
[1983], MARCHUK, KORDZADZE and SKIBA [1974], MARCHUKet al. [1980], MARCHUK
and ZALESNY [1974], KUZIN [1980, 1984] and ZALESNY [1984, 1985].
References
ABRAMOV, A.A. and V.B. ANDREEV (1963), An application of the sweep method for finding periodic
solutions of differential and difference equations, Zh. Vychisl. Mat. i Mat. Fiz. 3 (2) (in Russian).
ALBRECHT, I. (1965), Eine Verallgemeinerung des Verfahrens der Alternierenden, Z. Angew. Math. Mech.
45 (3-6) (Sonderheft).
AMSDEN, A.A. and F.H. HARLOW (1970), A simplified MAC technique for incompressible fluid flow
calculations, J. Comput. Phys. 6 (2), 322-325.
ANDERSON, D.A. (1974), A comparison of numerical solutions to the inviscid equations of fluid motion,
J. Comput. Phys. 15 (1), 1-20.
ANDREEV, V.B. (1967), On the splitting difference schemes for general p-dimensional parabolic equations
of second order with mixed derivatives, Zh. Vychisl. Mat. i Mat. Fiz. 7 (2) (in Russian).
ANDREEV, V.B. (1969), On the convergence of the splitting difference schemes approximating the third
boundary value problem for parabolic equation, Zh. Vychisl. Mat. i Mat. Fiz. 9 (2) (in Russian).
ANUCHINA, N.N. (1970), On the methods for computing compressible liquid flows with large
deformations, Chisl. Metody Mekh. Sploshn. Sredy Inform. Bul. 1 (4) (in Russian).
ANUCHINA, N.N. and N.N. YANENKO (1959), The implicit splitting schemes for hyperbolic equations and
systems, Dokl. Acad. Sci. USSR 128 (6) (in Russian).
BABUSKA, I. (1970), Approximation by hill functions, Tech. Note BN-648, Institute for Fluid Dynamics
and Applied Mathematics, University of Maryland, College Park, MD.
BAGRINOVSKY, K.A. and S.K. GODUNOV (1957), Difference schemes for multidimensional problems, Dokl.
Acad. Sci. USSR 115 (3) (in Russian).
BAKER, G.A. (1960), An implicit numerical method for solving the n-dimensional heat equation, Quart.
Appl. Math. 17 (4), 440-443.
BAKER, G.A. and T.A. OLIPHANT (1960), An implicit numerical method for solving the two-dimensional
heat equation, Quart. Appl. Math. 17 (4).
BALDWIN, B.S. and R.W. MACCORMACK (1974), Numerical solution of the interaction of a strong shock
wave with a hypersonic turbulent boundary layer, AIAA Paper 558, New York.
BAUM, E. and E. NDEFO (1973), A temporal ADI computational technique, in: Proceedings AIAA
Computers and Fluid Dynamics Conference, Springs, CA (AIAA, New York) 133-140.
BELOTSERKOVSKY, O.M. (1984), Numerical Modelling in the Mechanics of Continuous Media (Nauka,
Moscow) (in Russian).
BELOTSERKOVSKY, O.M. and Yu.M. DAVYDOV (1970), The method of"large particles" for the problems in
dynamics, Chysl. Metody Mekh. Sploshn. Sredy Inform. Bul. 1 (3) (in Russian).
BELOTSERKOVSKY, O.M. and Yu.M. DAVYDOV (1978), The method of "large particles" (schemes and
applications), Moscow Physical Technical Institute, Moscow, USSR.
BELOTSERKOVSKY, O.M., Yu.P. GOLOVACHEV, V.G. GRUDNITSKY, Yu.M. DAVYDOV, V.K. DUSHIN, Yu.P.
LUNKIN, K.M. MAGOMEDOV, V.K. MOLODTSOV, F.D. Popov, A.I. TOLSTYKH, V.N. FOMIN and A.S.
KHOLODOV (1974), Numerical Study of Modern Problems in Gas Dynamics (Nauka, Moscow).
BELOTSERKOVSKY, O.M., V.A. GUSHCHIN and V.V. SHCHENNIKOV (1975), The splitting method in
application to solving problems in the dynamics of viscous incompressible liquid, Zh. Vychisl. Mat. i
Mat. Fiz. 15 (1) (in Russian).
449
450 G.I. Marchuk
BELOTSERKOVSKY, O.M., F.D. POPOV, A.I. TOLSTYKH, V.N. FOMIN and A.S. KHOLODOV (1970),
Numerical solution of some problems in gas dynamics, Zh. Vychisl. Mat. i Mat. Fiz. 10 (2) (in Russian).
BELUKHINA, I.G. (1969), The difference schemes for solving two-dimensional dynamic elasticity problem
with mixed boundary conditions, Zh. Vychisl. Mat. i Mat. Fiz. 9 (2) (in Russian).
BENSOUSSAN, A. (1971), Pure decentralization for interrelated payoffs, in: Proceedings Symposium on
Optimization, Los Angeles, CA.
BENSOUSSAN, A., J.-L. LIONS and R. TEMAM (1975), Methods of Decomposition, Decentralization,
Coordinationand Their Applications, Metody Vychisl. Mat. (Nauka, Novosibirsk).
BEREZIN, Yu.A., V.M. KOVENYA and N.N. YANENKO (1976), The Difference Methodfor Solving Round
Flow Problems, Aeromechanics (Nauka, Moscow) (in Russian).
BEREZIN, Yu.A. and V.A. VSHIVKOV (1980), The Particle Method in the Dynamics of Rarefied Plasma
(Nauka, Novosibirsk) (in Russian).
BEREZIN, Yu.A. and N.N. YANENKO (1984), The splitting method for the problem in physics of
semiconductors, Dokl. Acad. Sci. USSR 274 (6) (in Russian).
BIRKHOF,G. and R. VARGA(1959), Implicit alternating direction methods, Trans.Amer. Soc. 92(1), 13-24.
BIRKHOF, G., R. VARGA and D. YOUNG (1962), Alternating Direction Implicit Methods, Advances in
Computers 3 (Academic Press, New York).
BOYARINTSEV, Yu.Yu. (1966), On the Convergence of the SplittingMethod and Local CorrectnessCriterion
for Difference Equations with Varying Coefficients, Nekotorye Voprosy Prikl. i Vychisl. Mat. (Nauka,
Novosibirsk) (in Russian).
BOYARINTSEV, Yu.Yu. and O.P. UZNADZE (1967). On the convergence of splitting the system of spherical
harmonics, Zh. Vychisl. Mat. i Mat Fiz. 7 (6) (in Russian).
BRAILOVSKAYA, I.Yu., T.V, KUSKOVA and L.A. CHUDOV (1968), The Difference Methods for Solving
Navier-Stokes Equations (Survey), Vychisl. Metody i Programmirovanie XI (Moskov. Gos. Univ.,
Moscow) (in Russian).
BRAYN, K. (1966), A scheme for numerical integration of the equations for motion on an irregular grid
free of nonlinear instability, Monthly Weather Rev. 94 (1).
BRIAN, P.L.I. (1961), A finite difference method of high order accuracy for the solution of three-
dimensional heat conduction problems, AIChE J. 7, 367-370.
BULEYEV, N.I. (1960), The numerical method for solving two- and three-dimensional diffusion equations,
Mat. Sb. 5 (2) (in Russian).
BULEYEV, N.I. (1970), The method of incomplete factorization for solving two- and three-dimensional
equation of diffusion type, Zh. Vychisl. Mat. i Mat. Fiz. 10 (4) (in Russian).
BURRIDGE, D.M. and F.R. HAYES (1974), Development of the British operational model, The GARP
Programme on Numerical Experimentation, Rept. 4, 102-104.
BUTLER, T.D. (1974), Recent Advances in Computational Fluid Dynamics, Lecture Notes in Computer
Science 11 (Springer, Berlin) 1-21.
CHAN, R.K.-C. and R.L. STREET (1970), A computer study of finite amplitude water waves, J. Comput.
Phys, 6, 68-97.
CHAN, R.K.-C., R.L. STREET and J. FROMM (1971), Numerical modelling of the water waves: The
development of Summac method, in: Proceedings Second International Conference on Numerical
Methods in FluidDynamics, Lecture Notes in Physics (Springer, Berlin).
CHORIN, A.J. (1968), Numerical solution of the Navier-Stokes equations, Math. Comp. 22(104), 745-762.
CLINE, M.C. (1974), Computation of steady nozzle flow by a time dependent method, AIAA J. 12 (4),
419-420.
CONCUS, P., G.H. GOLUB and D.P. O'LEARY (1976), A generalized conjugate gradient method for the
numerical solution of elliptic partial differential equations, in: J.R. BUNCH and D.J. ROSE, eds., Sparse
Matrix Computations (Academic Press, New York) 309-332.
COURANT, R., K.O. FRIEDRICHS and H. LEWY (1928), ber die partiellen Differenzengleichungen der
mathematischen Physik, Math. Ann. 100, 32-74.
CRANE, C.M. (1974), A new method for the numerical solution of time dependent viscous flow, Appl. Sci.
Res. 30 (4), 47-77.
DAVIs, R.T., U. GHIA and K.N. GHIA (1974), Laminar incompressible flow past a class of blunted wedges
using the Navier-Stokes equations, Comput. & Fluids 2 (2), 211-223.
References 451
DAVYDOV, Yu. and V.P. SKOTNIKOV (1978), The Method of"Large Particles":Aspects of Approximation,
Scheme Viscosity and Stability (VC Acad. Sci. USSR, Moscow) (in Russian).
DOUGLAS, Jr, J. (1955), On the numerical integration of O2u/ax2 + a2u/ax2 = au/Ot by implicit methods, J.
SIAM 3, 42-65.
DOUGLAS Jr, J. (1961), Alternating direction iteration for mildly nonlinear elliptic difference equations,
Numer. Math. 3, 92-99.
DOUGLAS Jr, J. (1962), Alternating direction methods for three space variables, Numer. Math. 4, 41-63.
DOUGLAS Jr, J. and T. DUPONT (1971), Alternating direction Galerkin methods on rectangles, in:
Numerical Solution of PartialEquations, II, SYNSPADE (Academic Press, New York).
DOUGLAS Jr, J. and J.E. GUNN (1962), Alternating direction methods for parabolic system in m-space
variables, J. Assoc. Comput. Mach. 9, 450-456.
DOUGLAS Jr, J. and J.E. GUNN (1963), Two high-order correct difference analogues for the equation of
multidimensional heat flow, Math. Comp. 17, 71-80.
DOUGLAS Jr, J. and J.E. GUNN (1964), A general formulation of alternating direction methods, Part 1.
Parabolic and hyperbolic problems, Numer. Math. 6 (5), 428-453.
DOUGLAS Jr, J. and B.F. JONES (1963), On predictor-corrector methods for nonlinear parabolic
differential equations, J. SIAM 11, 195-204.
DOUGLAS Jr, J., R. KELLOGG and R. VARGA (1963), Alternating direction methods for n-space variables,
Math. Comp. 17.
DOUGLAS Jr, J. and C.M. PEARCY (1963), On convergence of alternating direction procedures in the
presence of singular operators, Numer. Math. 5, 175-184.
DOUGLAS Jr, J. and H.H. RACHFORD Jr (1956), On the numerical solution of heat conduction problems in
two- and three-space variables, Trans. Amer. Math. Soc. 82, 421-439.
DRYA, M. (1967), On the stability of splitting schemes in C, Zh. Vychisl. Mat. i Mat. Fiz. 7 (2)(in Russian).
DRYA, M. (1971a), The splitting difference schemes for systems of hyperbolic equations of the first order,
Zh. Vychisl. Mat. i Mat. Fiz. 11 (2) (in Russian).
DRYA, M. (1971b), On the convergence of the splitting difference schemes for parabolic systems in C in the
internal points of the domain, Zh. Vychisl. Mat. i Mat. Fiz. 11 (3) (in Russian).
DUPONT, T. (1968), A factorization procedure for the solution of elliptic difference equations, SIAM J.
Numer. Anal. 5 (4).
DUVANT, G. and J. LIONS (1972), Les Indquations en Mcanique et en Physique (Dunod, Paris).
DYACHENKO, V.F. (1965), On the new method for numerical solution of the non-stationary problems with
two spatial variables in gas dynamics, Zh. Vychisl. Mat. i Mat. Fiz. 5 (4) (in Russian).
DYAKONOV, Ye.G. (1961), The alternating direction method for solving systems of finite difference
equations, Dokl. Acad. Sci. USSR 138 (2) (in Russian).
DYAKONOV, Ye.G. (1962a), The grid method for the parabolic equations of the second order with
separable variables, Dokl. Acad. Sci. USSR 142 (6) (in Russian).
DYAKONOV, Ye.G. (1962b), On the one way to solve Poisson equation, Dokl. Acad. Sci. USSR 143 (1) (in
Russian).
DYAKONOV, Ye.G. (1962c), On some difference schemes for solving boundary value problems, Zh.
Vychisl. Mat. i Mat. Fiz. 2 (1) (in Russian).
DYAKONOV, Ye.G. (1962d), On the constructing of the splitting difference schemes for multidimensional
non-stationary problems, Uspekhi Mat. Nauk 17 (4) (in Russian).
DYAKONOV, Ye.G. (1962e), The splitting difference schemes for multidimensional stationary problems,
Zh. Vychisl. Mat. i Mat. Fiz. 2 (4) (in Russian).
DYAKONOV, Ye.G. (1962f), The splitting difference schemes for non-stationary equations, Dok. Acad.
Sci. USSR 144 (1) (in Russian).
DYAKONOV, Ye.G. (1963a), On the application of the splitting difference operators, Zh. Vychisl. Mat.
i Mat. Fiz. 3 (2) (in Russian).
DYAKONOV, Ye.G. (1963b), On the application of the splitting difference schemes for hyperbolic
equations with varying coefficients, Dokl. Acad. Sci. USSR 151 (4) (in Russian).
DYAKONOV, Ye.G. (1964a), On the application of the splitting difference schemes for some systems partial
equations, Uspekhi Mat. Nauk 14 (1) (in Russian).
DYAKONOV, Ye.G. (1964b), The splitting difference schemes of the second order of accuracy for parabolic
452 G.I. Marchuk
equations without mixed derivatives, Zh. Vychisl. Mat. i Mat. Fiz. 4 (5) (in Russian).
DYAKONOV, Ye.G. (1964c), The splitting difference schemes for general parabolic equations of the second
order with varying coefficients, Zh. Vychisl. Mat. i Mat. Fiz. 4 (2) (in Russian).
DYAKONOV, Ye.G. (1965a), On the application of the splitting difference schemes for some systems of
parabolic and hyperbolic equations, Sibirsk. Mat. Zh. 6 (3) (in Russian).
DYAKONOV, Ye.G. (1965b), The Splitting Difference Scheme of the Second Order of Accuracyfor Multi-
dimensional Parabolic Equations with Varying Coefficients, Numerical Methods and Programming
3 (Moskov. Gos. Univ., Moscow) (in Russian).
DYAKONOV, Ye.G. (1966), On the application of the splitting difference schemes for some systems of
integro-differential equations, Vestnik Moskov. Univ. Mat. 5 (in Russian).
DYAKONOV, Ye.G. (1967a), On the Class of PartialEquations Arising in Closing the Difference Methods
Based on Splitting the Operator, Vychisl. Metody i Programmirovanie. IV (Moskov. Gos. Univ.,
Moscow) (in Russian).
DYAKONOV, Ye.G. (1967b), The splitting difference schemes for the systems of equations of the kind Lo
Ou/at +Ll u =f, Dokl. Acad. Sci. USSR 176 (2) (in Russian).
DYAKONOV, Ye.G. (1967c), Economical Difference Methods Based on the Splitting Difference Operatorfor
Some Systems of PartialEquations, Numerical Methods and Programming 6 (Moskov. Gos. Univ.,
Moscow) (in Russian).
DYAKONOV, Ye.G. (1970), The iterative methods for solving difference analogues of the boundary value
problems for elliptic equations, Institute of Cybernetics Acad. Sci. USSR, Kiev (in Russian).
DYAKONOV, Ye.G. (1971a), On some operator inequalities and their applications, Dokl. Acad Sci. USSR
198 (5) (in Russian).
DYAKONOV, Ye.G. (1971b), On the Difference Methods for Solving Some NonstationarySystems, Applied
Mathematics and Programming 6 (Shtiintsa, Kishinev) (in Russian).
DYAKONOV, Ye.G. (1971c), The Difference Schemes of IncreasedAccuracyfor Systems ofMixed Equations,
Applied Mathematics and Programming (Shtiintsa, Kishinev) (in Russian).
DYAKONOV, Ye.G. (1971d), The Difference Methodsfor Solving Boundary Value Problems, I: Stationary
Problems (Moskov. Gos. Univ., Moscow) (in Russian).
DYAKONOV, Ye.G. (1972), The Difference Methods for Solving Boundary Value Problems, II: Non-
stationary Problems (Moskov. Gos. Univ., Moscow) (in Russian).
DYAKONOV, Ye.G. and V.I. LEBEDEV (1967), The Splitting Method for the Third Boundary Value
Problem, Numerical Methods and Programming 4 (Moskov. Gos. Univ., Moscow) (in Russian).
DYMNIKOV, V.P. and S.K. FILIN (1985), The numerical modelling of the atmospheric circulation response
to the temperature anomaly on the surface of the ocean in North Atlantic, Preprint No. 100, OVM
Acad. Sci. USSR, Moscow (in Russian).
DYMNIKOV, V.P. and A.V. ISHIMOVA (1979), Unadiabatic model for the short-term weather forecasting,
Meteorologia i Gidrologia6 (in Russian).
DYMNIKOV, V.P. and V.N. LYKOSOV (1983), Spectral analysis of the quasi-stationary atmospheric
circulation response to the temperature anomaly on the surface of the ocean, Preprint No. 61, OVM
Acad. Sci. USSR, Moscow (in Russian).
FAIRWEATHER, G. and A.R. MITCHELL (1966), Some computational results of an improved A.D.I. method
for the Dirichlet problem, Comput. J. 9 (3).
FAIRWEATHER, G. and A.R. MITCHELL (1967), A new computational procedure for A.D.I. methods, SIAM
J. Numer. Anal. 4, 163-170.
FAIRWEATHER, G., A.R. GOURLAY and A.R. MITCHELL (1967), Some high accuracy difference schemes
with a splitting operator for equations of parabolic and elliptic type, Numer. Math. 10, 56-66.
FILIPPOV, A.F. (1955), On the stability of the difference schemes, Dokl. Acad. Sci. USSR 100 (6) (in
Russian).
FORESTER, C. and A. EMERY (1972), A computational method for low Mach number unsteady
compressible free convective flows, J. Comput. Phys. 10, 487-502.
FRIEDRICHS, K.O. (1954), Symmetric hyperbolic linear differential equations, Comm. Pure Appl. Math. 7,
345-392.
FRYAZINOV, I.V. (1964), On the difference approximation of the boundary conditions for the third
boundary value problem, Zh. Vychisl. Mat. i. Mat. Fiz. 4 (6) (in Russian).
References 453
FRYAZINOV, I.V. (1966), On the solution of the third boundary value problem for two-dimensional heat
conduction equation in arbitrary domain by local one-dimensional method, Zh. VychisL Mat. i Mat.
Fiz. 6 (3) (in Russian).
FRYAZINOV, I.V. (1968), The economical symmetrized schemes for the solution of boundary value
problems for multidimensional parabolic equation, Zh. Vychisl. Mat. i Mat. Fiz. 8 (2) (in Russian).
FRYAZINOV, I.V. (1969a), The a priori estimates for some family of economical schemes, Zh. Vychisl. Mat.
i Mat. Fiz. 9 (3) (in Russian).
FRYAZINOV, I.V. (1969b), The economical schemes of increased order of accuracy for solving
multidimensional parabolic equation, Zh. Vychisl. Mat. i Mat. Fiz. 9 (6) (in Russian).
GELFAND, I.M. and O.V. LOKUTSIEVSKY (1962), The sweep method for solving difference equations, in:
S.K. GODUNOV and V.S. RYABENKII, eds., Introductionto the Theory of Difference Schemes (Fizmatgiz,
Moscow) (in Russian).
GLOWINSKY, R., J.-L. LIONS and R. TREMOLIER (1979), Numerical Study of Variational Inequalities(Mir,
Moscow) (in Russian).
GODUNOV, S.K. (1959), The difference method for numerical computation of discontinuous solutions of
equations in hydrodynamics, Mat. Sb. 47 (3) (in Russian).
GODUNOV, S.K. (1962), The method of the orthogonal sweep for solving system of difference equations,
Zh. Vychisl. Mat. i Mat. Fiz 2 (6) (in Russian).
GODUNOV, S.K. and V.S. RYABENKII (1977), The Difference Schemes: Introductionto the Theory, (Nauka,
Moscow) (in Russian).
GODuNOv, S.K. and K.A. SEMENDYAEV (1962), The difference methods for numerical solution of the
problems in gas dynamics, Zh. Vychisl. Mat. i Mat. Fiz. 2 (7) (in Russian).
GODUNOV, S.K. and A.V. ZABRODIN (1962), On the difference schemes of the second order of accuracy for
multidimensional problems, Zh. VychisL. Mat. i Mat. Fiz. 2 (4) (in Russian).
GODUNOV, S.K., A.V. ZABRODIN, M.Y. IVANOV et al. (1976), Numerical solution of Multidimensional
Problems in Gas Dynamics (Nauka, Moscow) (in Russian).
GOURLAY, A. and A. MITCHELL (1961), Split operator methods for hyperbolic system in p-space variables,
Math. Comp. 21, 351-354.
GOURLAY, A. and A MITCHELL (1966), Alternating direction methods for hyperbolic systems, Numer.
Math. 8, 137-149.
GOURLAY, A. and A. MITCHELL (1967), Intermediate boundary corrections for split operator methods in
three dimensions, BIT 7, 31-38.
GOURLAY, A. and A. MITCHELL (1968), High accuracy A.D.I. methods for parabolic equations with
variable coefficients, Numer. Math. 12, 180-185.
GOURLAY, A. and A. MITCHELL (1969), A classification of split difference methods for hyperbolic
equations in several space dimensions, SIAM J. Numer. Anal. 6, 62-71.
GuITrET, J. (1967), Une nouvelle de directions alternes a q variables, J. Math. Anal. Appl. 17, 199-213.
GUNN, J. (1965), The solution of elliptic difference equations by semi-explicit iterative techniques, SIAM
J. Numer. Anal. 2 (1).
GUSHCHIN, V.A. (1981), The splittig method for solving the problem in dynamics of inhomogeneous
viscous incompressible liquid, Zh. Vyschisl. Mat. i Mat. Fiz. 21 (4) (in Russian).
HABBARD, B. (1966), Alternating direction schemes for the heat equation in a general domain, SIAM J.
Numer. Anal. 2 (3).
HARLOW, F. (1967), NumericalMethod of Particlesin Cellsfor the Problemsin Hydrodynamics, Numerical
Methods in Hydrodynamics (Mir, Moscow) (in Russian).
HARLOW, F. and J. WELCH (1965), Numerical calculation of time-dependent viscous incompressible flow
of fluid with free surface, Phys. Fluids 8, 2182-2189.
IL'IN, V.P. (1965), On the splitting ofthe parabolic and elliptic difference equations, Sibirsk. Mat. Zh. 6(1)
(in Russian).
IL'IN, V.P. (1966), On the application of the alternating direction method for solving quasi-linear
parabolic and elliptic equations, in: Some Aspects of Applied and Numerical Mathematics (Nauka,
Novosibirsk) (in Russian).
IL'N, V.P. (1967), On the explicit alternating direction schemes, lzv. Sib. Otd. Acad. Sci. USSR Ser. Tekhn.
Nauk 13 (3) (in Russian).
454 G.. Marchuk
ILIN, V.P. (1970), The Difference Methods for Solving Elliptic Equations (Novosibirsk Gos. Univ.,
Novosibirsk) (in Russian).
KARCHEVSKY, M.M., A.V. LAPIN and A.D. LYASHKO (1972), Economical difference schemes for
quasi-linear parabolic equations, Izv. Vuzov Mat. 3 (118) (in Russian).
KELDYSH, M.V. (1942), On Galerkin's method for solving boundary value problems, Izv. Acad. Sci.
USSR, Mat. 6 (in Russian).
KELLOGG, R. (1963), Another alternating direction implicit method, J. SIAM 11, 976-979.
KELLOGG, R. (1964), An alternating direction method for operator equations, J. SIAM 12, (4).
KELLOGG, R. and J. SPANIER (1965), On optimal alternating direction parameters for singular matrices,
Math. Comp. 19 (1).
KOCHERGIN, V.P. and Yu.A. KUZNETSOV (1969), On the solution of the system of linear equation by the
splitting method, in: Numerical Methods in Transport Theory (Atomizdat, Moscow) (in Russian).
KONOVALOV, A.N. (1962), The method of fractional steps for solving Cauchy problem for multidimen-
sional oscillation equation, Dokl. Acad. Sci. USSR 147 (1) (in Russian).
KONOVALOV, A.N. (1964), The application of the splitting method to the numerical solution of dynamic
problems in the theory of elasticity, Zh. Vychisl. Mat. i Mat. Fiz. 4 (4) (in Russian).
KONOVALov, A.N. (1972), The problem of the filtration of multiphase incompressible liquid (Novosibirsk
Gos. Univ., Novosibirsk) (in Russian).
KOROBITSINA, J.L. (1969), On the boundary conditions in the splitting scheme for kinetic equation, in:
Numerical Methods in Transport Theory (Atomizdat, Moscow) (in Russian).
KOVENYA, V.M. and A.S. LEBEDEV (1984), The application of the splitting method in predictor-corrector
schemes for solving problems in gas dynamics, Chysl. Metody Mekh. Sploshn. Sredy 15 (2) (in Russian).
KOVENVA, V.M. and N.N. YANENKO (1981), The Splitting Methodsfor Problemsin Gas Dynamics (Nauka,
Novosibirsk) (in Russian).
KRYAKVIN, S.A. (1966), On the accuracy of the alternating direction schemes for heat conduction
equation, Zh. Vychisl. Mat. i Mat. Fiz. 6 (in Russian).
KUTLER, P. and H. LOMAX (1971), Shock-capturing, finite-difference approach to supersonic flows,
J. Spacecraft and Rockets 8, 1175-1182.
KUTLER, P., W. REINHARDT and R. WARMING (1973), Multishocked, three-dimensional supersonic
flowfields with real gas effects, AIAA J. 11, 657-664.
KUTLER, P., L. SAKELL and G. AIELLO(1974), On the shock-on-shock interaction problem, AIAA Paper
524, New York, 10pp.
KUZIN, V.I. (1980), On the solution of the equation for barotropic Rossby waves by the finite element
method with splitting, in: Mathematical Modelling of Ocean Dynamics (VC Sib. Otd. Acad. Sci. USSR,
Novosibirsk) (in Russian).
KUZIN, V.I. (1984), The numerical model of global ocean circulation based on finite element method with
splitting, MathematicalModelling of the Dynamics of Ocean and Internal Reservoirs (VC Sib. Otd. Acad.
Sci. USSR, Novosibirsk) (in Russian).
KUZNETSOV, A.Yu. and M.K. STRELETS (1983), Numerical modelling of sufficiently undersonic stationary
non-isothermal flows of homogeneous viscous gas in channels, Chysl. Metody Mekh. Sploshn. Sredy 14
(6) (in Russian).
KUZNETSOv, B.P., N.P. MOSHKIN and S. SMAGULOV (1983), Numerical study of viscous incompressible
liquid flows in channels of complex geometry under given pressure jumps, Chysl. Metody Mekh.
Sploshn. Sredy 14 (5) (in Russian).
LADYZHENSKAYA, O.A. (1956), On the solution of non-stationary operator equations, Mat. Sb. 39 (4)
(in Russian).
LADYZHENSKAYA, O.A. (1958), On the non-stationary operator equations and their applications to linear
problems in mathematical physics, Mat. Sb. 45 (in Russian).
LADYZHENSKAYA, O.A. and V.Y. RIVKIND (1971), On the convergent difference schemes for Navier-Stokes
equations, Chysl. Metody Mekh. Sploshn. Sredy Inform. Bul. 2 (1) (in Russian).
LAVAL, P. (1983), Nouveaux sch6mas de disintegration pour la resolution des problmes hyperboliques et
paraboliques non-lin6aires applications aux equations d'Euler et de Navier-Stokes, Rech. Agrospat. 4.
LAX, P. (1962), On the stability of finite difference approximations to the solutions of hyperbolic
equations with varying coefficients, Mathematics (Translations coll.) 6 (3) (in Russian).
References 455
LAX, and B. WENDROFF (1964), Difference schemes for hyperbolic equations with high order of accuracy,
Comm. Pure Appl. Math. 17, 381-398.
LEBEDEV, V.I. (1971), On the type of quadrature formulae ofincreased algebraic accuracy for sphere, Dokl.
Acad. Sci. USSR 231 (1) (in Russian).
LEBEDEV, V.I. (1976), On the quadratures on the sphere, Zh. Vychisl. Mat. i Mat. Fiz. 16 (2) (in Russian).
LEBEDEV, V.I. (1977), On Zolotarev's problem in alternating direction method, Zh. Vychisl. Mat. i Mat.
Fiz. 17 (2) (in Russian).
LEES, M. (1960a), A priori estimates for the solutions of difference approximations to parabolic partial
differential equations, Duke Math. J. 27, 297-311.
LEES, M. (1960b), Energy inequalities for the solution of differential equations, Trans. Amer. Math. Soc.
94, 58-73.
LEES, M. (1960c), Alternating direction methods for hyperbolic differential equations, J. SIAM 10,
610-616.
LEES, M. (1961), Alternating direction and semi-explicit difference methods for parabolic partial
differential equations, Numer. Math. 3, 398-412.
LEES, M. (1966), A linear three-level difference scheme for quasi-linear parabolic equations, Math. Comp.
20, 516-522.
LEPAS, J., M. BEKLARECHE, J. CAIFFIER, L. FINKE and A. TAGNIT-HAAON (1974), Primitive equations
model-implicit method for numerical integration, The GARP Programme on Numerical Experi-
mentation, Rept. 4, 65.
LERAT, A. and R. PEYRET (1975), Proprirtes dispersives et dissipatives, d'une classe des schrmas aux
differences pour les systemes hyperboliques non-linraires, Rech. Airospat. 2, 61-79.
LIONS, P. and B. MERCIER (1978), Splitting algorithms for the sum of two non-linear operators, Rapport
Interne 29, Centre de Mathrmatiques Appliquees, Ecole Polytechnique, Palaiseau, France.
LIONS, J.-L. and R. TEMAM (1966), Une m6thode d'&clatement des oprrateurs et des contraintes en calcul
des variations, C.R. Acad. Sci. Paris263.
LORENZ, E. (1955), Available potential energy and the maintenance of the general circulation, Tellus 7,
157-167.
LUNN, M. (1964), On the equivalence of SOR, SSOR and USSOR as applied to ordered systems of linear
equations, Comput. J. 7, 72-75.
LYTKIN, Yu.M. and G.G. CHERNYKH (1975), On the internal waves induced by the collapse of
displacement zone in stratified liquid, Dynamics of Continuous Media 22 (in Russian).
MACCORMACK, R. and B. BALDWIN (1975), A numerical method for solving Navier-Stokes equations
with application to shock-boundary layer interactions, AIAA Paper 1, New York, 8 pp.
MACCORMACK, R. and A. PAULIAY (1972), Computational efficiency achieved by time splitting of finite
difference operators, AIAA Paper 154, New York, 7 pp.
MARCHUK, G.I. (1964a), The new approach to numerical solving the equations of weather forecasting,
in: Proceedings Symposium on Long-Term ForecastingMethods, USA.
MARCHUK, G.I. (1964b), The numerical algorithm for solving the equations of weather forecasting, Dokl.
Acad. Sci. USSR 156 (2), 308-311 (in Russian).
MARCHUK, G.I. (1965), Numerical methods for the problems in weather forecasting and theory ofclimate,
in: Lectures on the Numerical Methods for Short-Term Weather Forecasting (Gidrometeoizdat,
Leningrad) (in Russian).
MARCHUK, G.I. (1967), Numerical Methodsfor Weather Forecasting(Gidrometeoizdat, Leningrad) (in
Russian).
MARCHUK, G.I. (1968), Some application of splitting methods of the solution of mathematical physics
problems, Apl. Mat. 13, 103-132.
MARCHUK, G.I. (1969), The splitting method for problems in mathematical physics, in: Numerical
Methods for Problems in Mechanics of Continuous Media (in Russian).
MARCHUK, G.I. (1970), Methods and problems in numerical mathematics, Internat. Math. Congr. Nizza;
also: (1972), Reports of Soviet Mathematicians (Nauka, Moscow) (in Russian).
MARCHUK, G.I. (1971), On the theory of splitting method, in: Numerical Solution of PartialDifferential
Equations, II, SYNSPADE (Academic Press, New York).
MARCHUK, G.I. (1973), Introduction into the Methods of Numerical Analysis (Cremonese, Rome).
456 G.. Marchuk
MARCHUK, G.I. (1974a), Numerical Solution of the Problems in Atmosphere and Ocean Dynamics
(Gidrometeoizdat, Leningrad) (in Russian).
MARCHUK, G.I. (1974b), Numerical Methods in the Computationof Ocean Currencies(Vychisl. Centr. Sib.
Otd. Acad. Sci. USSR, Novosibirsk) (in Russian).
MARCHUK, G.I. (1980), Methods of Numerical Mathematics (Nauka, Moscow) (in Russian).
MARCHUK, G.I. (1982), Mathematical Modelling in the Problem of Environment (Nauka, Moscow)
(in Russian).
MARCHUK, G.I. and V.I. AGOSHKOV (1981), Introduction to the Projection Grid Methods (Nauka,
Moscow) (in Russian).
MARCHUK, G.I., M.A. BUBNOV, V.B. ZALESNY and A.A. KORDZADZE (1983), Mathematical modelling of
the sea currents, tide waves and elaboration of numerical algorithms, in: Actual Problemsin Numerical
and Applied Mathematics (Nauka, Novosibirsk) (in Russian).
MARCHUK, G.I. and V.P. DYMNIKOV, eds. (1974), Dynamic Meteorology and Numerical Weather
Forecasting (Gidrometeoizdat, Moscow) (in Russian).
MARCHUK, G.I., V.P. DYMNIKOV, V.Ya. GALIN et al. (1975), The hydrodynamic model of the global
atmosphere and ocean circulation (in Russian).
MARCHUK, G.I., V.P. DYMNIKOV and V.B. LYKOSOV (1981), On the relation between index cycles of the
atmosphere circulation and spatial spectrum of the kinetic energy in the model of the general
circulation of the atmosphere, Tech. Memo. 31, ECMWF, 33 pp.
MARCHUK, G.I., V.P. DYMNIKOV, V.N. LYKOSOV, V.Ya. GALIN, I.M. BOBYLEVA and V.L. PEROV (1979),
The global atmosphere circulation model, Izv. Acad. Sci. USSR Ser. Fiz. Atm. Okeana 15 (in Russian).
MARCHUK, G.I., V.P. DYMNIKOV, V.B. ZALESNY, V.N. LYKOSOV, and V.Ya. GALIN (1984), Mathematical
Modelling of the GlobalAtmosphere and Ocean Circulation(Gidrometeoizdat, Leningrad) (in Russian).
MARCHUK, G.I., G.R. KONTAREV and G.S. RIVIN (1967), The short-term weather forecasting based on the
whole equations in bounded territory, Izv. Acad. Sci. USSR Ser. Fiz. Atm. Okeana 3 (in Russian).
MARCHUK, G.I., A.A. KORDZADZE and Yu.N. SKIBA (1975), The computation of main hydrological fields
of the Black Sea, Izv. Acad. Sci. USSR Ser. Fiz. Atm. Okeana 11 (4) (in Russian).
MARCHUK, G.I. and V. KUZIN (1983), On the combination of finite element and splitting methods in the
solution of parabolic equations, J. Comput. Phys. 52, 237-272.
MARCHUK, G.I. and Yu.A. KUZNETSOV (1972), Iterative Methods and Quadratic Functionals (Nauka,
Novosibirsk) (in Russian).
MARCHUK, G.I. and Yu.A. KUZNETSOV (1974), Methodes itrratives et fonctionnelles quadratiques, in: Sur
les Methodes Numriques en Sciences Physiques et Economiques, Mrthodes Mathrmatiques de
l'Informatique 4 (Dunod, Paris) 3-132.
MARCHUK, G.I. and V.I. LEBEDEV (1981), Numerical Methods in the Theory of Neutron Transport
(Atomizdat, Moscow) (in Russian).
MARCHUK, G.I. et al. (1980), Mathematical Models of Ocean Circulation (Nauka, Novosibirsk) (in
Russian).
MARCHUK, G.I., V.V. PENENKO and U.M. SULTANGAZIN (1969), On the solution of the kinetic equation by
the splitting method, in: Numerical Methods in Transport Theory (Atomizdat, Moscow) (in Russian).
MARCHUK, G.I. and A.S. SARKISYAN, eds. (1980), Mathematical Models of Ocean Circulation (Nauka,
Novosibirsk) (in Russian).
MARCHUK, G.I. and Yu.N. SKIBA (1976), The numerical computation of the conjugate problem in the
model of theoretical interaction between atmosphere, oceans and continents, Izv. Acad. Sci. USSR Ser.
Fiz. Atm. Okeana 12 (5) (in Russian).
MARCHUK, G.I. and U.M. SULTANGAZIN (1965a), On the convergence of the splitting method for the
equations of radiation transport, Dokl. Acad. Sci. USSR 161 (1) (in Russian).
MARCHUK, G.l. and U.M. SULTANGAZIN (1965b), On the solution of the kinetic transport equation by the
splitting method, Dokl. Acad. Sci. USSR 163 (4) (in Russian).
MARCHUK, G.I. and U.M. SULTANGAZIN (1965c). On the grounding of the splitting method for the
equation of radiation transport, Zh. Vychisl. Mat. i Mat. Fiz. 5 (5) (in Russian).
MARCHUK, G.I. and N.N. YANENKO (1964), The solution of the multidimensional kinetic equation by the
splitting method, Dokl. Acad. Sci. USSR 157 (6) (in Russian).
MARCHUK, G.I. and N.N. YANENKO (1966), Application of the splitting (fractional steps) method to the
References 457
solving of problems in mathematical physics, in: Some Aspects of Numerical and Applied Mathematics
(Nauka, Novosibirsk) (in Russian).
MARCHUK, G.I. and V.B. ZALESNY (1974), The numerical model of the large-scale circulation in the World
Ocean, in: Numerical Methods for Computation of Ocean Currents (in Russian).
MELADZE, G. (1970), The schemes of increased order of accuracy for systems of elliptic and parabolic
equations, Zh. Vychisl. Mat. i Mat. Fiz. 10 (2) (in Russian).
MESINGER, F. and A. ARAKAWA (1976), Numerical Methods Used in Atmospheric Models, GARP Publ.
Series 17, 135.
MIGUAL, O., P. PINSKY and R. TAYLOR (1983), Operator split methods for the numerical solution of the
elastoplastic dynamic problem, Comput. Methods Appl. Engrg. 39, 137-157.
MITCHELL, A. and G. FAIRWEATHER (1964), Improved forms of the alternating direction methods of
Douglas, Peaceman and Rachford for solving parabolic and elliptic equations, Numer. Math. 6,
285-292.
MORRIS, J. (1970), On the numerical solution of a heat equation associated with a thermal print-head, J.
Comput. Phys. 5, 208-228.
MOSKOLKOV, M.N. (1969), On the family of the difference schemes for quasi-linear parabolic equation,
Zh. Vychisl. Mat. i Mat. Fiz. 9 (6) (in Russian).
NICOLS, B. (1971), Recent extensions to the MAC method for incompressible fluid flows, in: Proceedings
Second InternationalConference on Numerical Methods in Fluid Dynamics, Lecture Notes on Physics
(Springer, Berlin).
NICOLS, B. (1973), Further development of the method of markers and cells for incompressible fluid flows,
in: Numerical Methods in Fluid Mechanics (Mir, Moscow) (in Russian).
OLIPHANT, T. (1961), An implicit, numerical method for solving two-dimensional time-dependent
diffusion problems, Quart. Appl. Math. 19 (3).
PEACEMAN, D.W. and H.H. RACHFORD Jr (1955), The numerical solution of parabolic and elliptic
differential equations, J. SIAM 3, 28-42.
PEARCY, C. (1962), On convergence of alternating direction procedures, Numer. Math. 4, 172-176.
PENENKO, V.V. and A.Ye. ALOYAN (1975), Some applications of the splitting method for the problems in
mesometeorology, Tr. Zap. Sib. Reg. N. 1. Gidrometeorol. In-t 14 (in Russian).
PENENKO, V.V., U.M. SULTANGAZIN and B.A. BALASH (1969), The solution of the kinetic equation by the
splitting method, in: Numerical Methods in Transport Theory (Atomizdat, Moscow) (in Russian).
PoPov, Yu.P. and A.A. SAMARSKII (1970), Completely conservative difference schemes for the equations
of magnetic hydrodynamics, Zh. Vychisl. Mat. i Mat. Fiz. 10 (4) (in Russian).
PRACHT, W. (1973), The implicit method for the computation of creeping with application to the problem
of continental drift, in: Numerical Methods in Fluid Mechanics (Moscow, Mir) (in Russian).
PYSTA, S. (1969), The splitting difference schemes for systems of mixed differential equations, Zh. Vychisl.
Mat. i Mat. Fiz. 9 (4) (in Russian).
RACHFORD Jr, H.H. (1968), Rounding errors in alternating direction methods for parabolic problems,
Ap. Mat. 13, 177-180.
RAVIART, P. (1967), Sur l'approximation de certaines quations d'evolution lineaires et non-lineaires,
J. Math. Pures Appl. 46, 11-107; 46, 109-183.
REID, J. (1971), On the method of conjugate gradients for the solution of large sparse systems of linear
equations, in: Large Sparse Sets of Linear Equations (Academic Press, London) 231-251.
RICHTMYER, R.D. and K.W. MORTON (1972), Difference Methodsfor Boundary Value Problems (Mir,
Moscow) (in Russian).
RIZZl, A.W. and M. INOUYE (1973a), A time-split finite volume technique for three-dimensional
blunt-body flow, AIAA Paper 133, New York, 14 pp.
RlzzI, A.W. and M. INOUYE (1973b), Time finite volume method for three-dimensional blunt-body, AIAA
J. 11, 1478-1485.
Rlzz, A.W., A. KLAVINS and R. MACCORMACK (1975), A generalized hyperbolic marching technique for
three-dimensional supersonic flow with shocks, in: Lecture Notes in Physics 35 (Springer, Berlin)
341-346.
ROZHDESTVENSKY, B.L. and N.N. YANENKO (1968), Systems of Quasi-Linear Equations (Nauka, Moscow)
(in Russian).
458 G.I. Marchuk
RUSANOV, V.V. (1960), On the stability of the matrix sweep method, in: Numerical Mathematics 6 (Mir,
Moscow) (in Russian).
RYABENKII, V.S. and A.F. FILIPPOV (1956), On the Stability of Difference Equations (Gostekhizdat,
Moscow) (in Russian).
SAMARSKII, A.A. (1961a), A priori estimates for the solution of the difference analogue of a parabolic
differential equation, Zh. Vychisl. Mat. i Mat. Fiz. 1, 441-460 (in Russian).
SAMARSKII, A.A. (1961b), A priori estimates for difference equations, Zh. Vychisl. Mat. i Mat. Fiz. 1,
972-1000 (in Russian).
SAMARSKII, A.A. (1962a), The homogeneous difference schemes for nonlinear parabolic equations, Zh.
Vychisl. Mat. i Mat. Fiz. 2 (1) (in Russian).
SAMARSKII, A.A. (1962b), On the convergence and accuracy of homogeneous difference schemes for one-
and multidimensional parabolic equations, Zh. Vychisl. Mat. i Mat. Fiz. 2, 603-634 (in Russian).
SAMARSKII, A.A. (1962c), On an economical difference method for the solution of a multidimensional
parabolic equation in an arbitrary region, Zh. Vychisl. Mat. i Mat. Fiz. 2, 787-811 (in Russian).
SAMARSKII, A.A. (1962d), On the convergence of the method of fractional steps for heat conduction
problems, Zh. Vychisl. Mat. i Mat. Fiz. 2 (6) (in Russian).
SAMARSKII, A.A. (1963a), Local one-dimensional difference schemes on non-uniform grids, Zh. Vychisl.
Mat. i Mat. Fiz. 3 (3) (in Russian).
SAMARSKII, A.A. (1963b), Schemes of increased order of accuracy for the multi-dimensional heat
conduction equation, Zh. Vychisl. Mat. i Mat. Fiz. 3 (5) (in Russian).
SAMARSKII, A.A. (1964a), On the economical algorithm for the numerical solution of a system of
differential and algebraic equations, Zh. Vychisl. Mat. i Mat. Fiz. 4 (3) (in Russian).
SAMARSKI, A,A. (19641,), Local one-dimensional difference schemes for multidimensional hyperbolic
equations in an arbitrary domain, Zh. Vychisl. Mat. i Mat. Fiz. 4 (4) (in Russian).
SAMARSKII, A.A. (1965a), Economical difference schemes for a hyperbolic system of equations with mixed
derivatives and their application to the equations of the theory of elasticity, Zh. Vychisl. Mat. i Mat.
Fiz. 5 (1) (in Russian).
SAMARSKII, A.A. (1965b), On the additivity principle in constructing economical difference schemes, Dokl.
Acad. Sci. USSR 165 (6) (in Russian).
SAMARSKII, A.A. (1966), The additive schemes, Thesis of the Reports on the International Mathematicians
Congress in Moscow (in Russian).
SAMARSKII, A.A. (1967), The classes of stable schemes, Zh. Vychisl. Mat. i Mat. Fiz. 7 (5) (in Russian).
SAMARSKI, A.A, (1970), Some Aspects of the General Theory of Difference Schemes: PartialDifferential
Equations(Nauka, Moscow) (in Russian).
SAMARSKII, A.A. (1971), Introductionto the Theory of Difference Schemes (Nauka, Moscow) (in Russian).
SAMARSKII, A.A. (1977), The Theory of Difference Schemes (Nauka, Moscow) (in Russian).
SAMARSKII, A.A. and V.B. ANDREEV (1963), On the difference schemes of increased order of accuracy for
the elliptic equation with several spatial variables, Zh. Vychisl. Mat. i Mat. Fiz. 3 (6) (in Russian).
SAMARSKII, A.A. and V.B. ANDREEV (1964), Iterative alternating direction schemes for the numerical
solution of the Dirichlet problem, Zh. Vychisl. Mat. i Mat. Fiz. 4 (6) (in Russian).
SAMARSKII, A.A. and V.B. ANDREEV (1976), The Difference Schemes for Elliptic Equations (Nauka,
Moscow) (in Russian).
SAMARSKII, A.A. and I.V. FRYAZINOV (1961), On the convergence of homogeneous difference schemes for
a heat conduction equation with discontinuous coefficients, Zh. Vychisl. Mat. i Mat. Fiz. 1, 806-824 (in
Russian).
SAMARSKII, A.A. and IV. FRYAZINOV (1971), On the convergence ofthe local one-dimensional scheme for
the multidimensional heat conduction equation on non-uniform grids, Zh. Vychisl. Mat. i Mat. Fiz. 11
(3) (in Russian).
SAMARSKII, A.A. and A.V. GULIN (1973), The Stability of Difference Schemes (Nauka, Moscow) (in
Russian).
SAMARSKII, A.A. and E.S. NIKOLAEV (1978), Methodsfor Solving Grid Equations (Nauka, Moscow) (in
Russian).
SAMARSKII, A.A. and Yu.P. PoPov (1975), Difference Schemesfor Gas Dynamics (Nauka, Moscow) (in
Russian).
References 459
SAUL'EV, V.K. (1960), Integration of Parabolic Equations by the Grid Method (Fizmatgiz, Moscow) (in
Russian).
ScoTT, W. and P. ROGER (1983), A more accurate method for the numerical solution of non-linear partial
differential equations, J. Comput. Phys. 49, 342-348.
SMELOV, V.V. (1978), Lectures on the Neutron Transport Theory (Atomizdat, Moscow) (in Russian).
SOFRONOV, I.D. (1965), The difference scheme with diagonal directions of sweeps for a heat conduction
equation, Zh. Vychisl. Mat. i Mat. Fitz. 5 (in Russian).
STRANG, G. and G. FIx (1977), The Theory of the Finite Element Method (Mir, Moscow) (in Russian).
SULTANGAZIN, U.M. (1971), On the foundation of the method of weak approximation for the spherical
harmonics equation, Preprint, Sib. Otd. Acad. Sci. USSR, Novosibirsk (in Russian).
SULTANGAZIN, U.M. (1979), The Methods of Spherical Harmonics and Discrete Ordinatesin Problems in
Kinetic Transport Theory (Nauka, Alma-Ata) (in Russian).
TEMAM, R. (1968), Sur la stability et la convergence de la methode des pas fractionnaires, Ann. Mat. Pura
Appl. (4) 79.
TEMAM, R. (1969), Sur I'approximation de la solution des equations de Navier-Stokes par la methode des
pas fractionnaires, Arch. Rational Mech. Anal. 32, 135-153.
TEMAM, R. (1970), Quelques m6thodes de decomposition en analyse num6rique, Acte du Congrds
Internationaldes Mathematiques 3.
TEMAM, R. (1981), Navier-Stokes Equation and Numerical Analysis (Mir, Moscow) (in Russian).
THOMPSON, R.J. (1964), Difference approximations for inhomogeneous and quasi-linear equations, J.
SIAM 12, 189-199.
TIKHONOV, A.N. and A.A. SAMARSKII (1961), Homogeneous difference schemes, Zh. Vychisl. Mat. i Mat.
Fiz. 1, 5-63 (in Russian).
TODD, J. (1 967), Inequalities of Chebyshev, Zolotareff, Caur and W.B. Jordan, in: ProceedingsSymposium
on Inequalities, Wright-Patterson Air Force Base, Dayton, OH (Academic Press, New York), 321-
328.
VALIULIN, A.N. and N.N. YANENKO (1967), Economical difference schemes of increased accuracy for
a polyharmonic equation, Izv. Sib. Otd. Acad. Sci. USSR Ser. Tekhn. Nauk 13 (3) (in Russian).
VARGA, R.S. (1962), Matrix Iterative Analysis (Prentice-Hall, Englewood Cliffs, NJ).
VASILYEV, O.R., B.G. KUZNETSOV, Yu.M. LYTKIN and G.P. CHERNYKH (1974), The evolution of
a turbulent fluid domain in a stratified medium, Izv. Acad. Sci. USSR Mekh. Zhidk. Gasa 3 (in Russian).
VIECELLI, J. (1971), A computing method for incompressible flows bounded by moving walls, J. Comput.
Phys. 8, 119-143.
VOROBYEV, Yu.V. (1968), The stochastic iterative process in the alternating direction method, Zh. Vychisl.
Mat. i Mat. Fiz. 8 (3) (in Russian).
WACHPRESS, E. (1962), Optimum alternating direction implicit iteration parameters for a model problem,
J. SIAM 10, 339-350.
WACHPRESS, E. (1963), Extended application of alternating direction implicit iteration model theory, J.
SIAM 11, 991-1016.
WACHPRESS, E. (1966), Iterative Solution of Elliptic Systems and Applications to the Neutron Diffusion
Equationsof Reactor Physics (Prentice-Hall, Englewood Cliffs, NJ).
WACHPRESS, E. and G. HABETLER(1960), An alternating direction implicit iteration technique, J. SIAM 8,
403-424.
WANG, G. (1984), The splitting schemes for solving the initial and boundary value problems of hyperbolic
partial differential equation, J. Nanjing Univ. Natur. Sci. Ed. 1, 21-38.
WARMING, R., P. KUTLER and H. LOMAX (1973), Second- and third-order non-centred difference schemes
for non-linear hyperbolic equations, AIAA J. II1, 189-196.
WASOW, W. and G. FORSYTHE (1963), Finite Difference Methodsfor PartialDifferential Equations (in
Russian).
WIDLUND, 0. (1966), On the rate of an alternating direction implicit method in a non-commutative case,
Math. Comp. 20.
WIDLUND, O. (1971), On the effects of scaling of the Peaceman-Rachford method, Math. Comp. 25.
WILKE, H. (1974), Zur Anwendung der SMAS-Methode bei der Behandlung hydrodynamischer
insbesondere grenzflachendynamischer Probleme, Chem. Tech. 26, 456-457.
460 G.l. Marchuk
YANENKO, N.N. (1959a), On the difference method for multidimensional heat conduction equation, Dokl.
Acad. Sci. USSR 125 (6) (in Russian).
YANENKO, N.N. (1959b), Simple implicit schemes for multidimensional problems, Dokl. Vsesoyuzn.
Soveshch. Vychisl. Mat: Vychisl. Tekhn., Moscow (in Russian).
YANENKO, N.N. (1960), On the economical implicit schemes (the method of fractional steps), Dokl. Acad.
Sci. USSR 134 (5) (in Russian).
YANENKO, N.N. (1961), On the implicit difference schemes for a multidimensional heat conduction
equation, Izv. Vuzov Mat. 4 (23) (in Russian).
YANENKO, N.N. (1962), On the convergence of the splitting method for a heat conduction equation with
varying coefficients, Zh. Vychisl. Mat. i Mat. Fiz. 2 (5) (in Russian).
YANENKO, N.N. (1964a), Some Aspects ofthe Theory of Convergence of Difference Schemes with Constant
and Varying Coefficients, Trudy IV Vsesoyuzn. Mat. S'ezda 2 (Nauka, Moscow) (in Russian).
YANENKO, N.N. (1964b), On the weak approximation to systems of differential equations, Sibirsk Mat.
Zh. 5 (6) (in Russian).
YANENKO, N.N. (1966), The Method of FractionalStepsfor Multidimensional Problemsin Mathematical
Physics (Lectures for students of NGU) (Novosibirsk Gos. Univ., Novosibirsk) (in Russian).
YANENKO, N.N. (1967), The Method of FractionalStepsfor Multidimensional Problemsin Mathematical
Physics (Nauka, Novosibirsk) (in Russian).
YANENKO, N.N. (1972), Modern numerical methods for problems in mechanics of continuous media, in:
-Proceedings International Mathematics Congress (Nauka, Moscow) (in Russian).
YANENKO, N.N. (1973), Difference Methodsfor Problemsin MathematicalPhysics,Trudy MIAN SSSR 22
(Nauka, Moscow) (in Russian).
YANENKO, N.N., N.N. ANUCHINA, V.Ye. PETRENKO and Yu.I. SHOKIN (1970), On the methods for gas
dynamics problems with large deformations, in: Numerical Methods in Mechanicsof Continuous Media
(Nauka, Novosibirsk) (in Russian).
YANENKO, N.N., N.N. ANUCHINA, V.Ye. PETRENKO and Yu.I. SHOKIN (1971), On the methods for
computation of the fluid flows with large deformations and the approximation viscosity of difference
schemes, in: Proceedings2nd International Colloquium on Gas Dynamics of Explosion and Reacting
Systems (in Russian).
YANENKO, N.N. and Yu.Ye. BOYARINTSEV (1961), On the convergence of the difference schemes for a heat
conduction equation with varying coefficients, Dokl. Acad. Sci. USSR 139 (6) (in Russian).
YANENKO, N.N. and G.V. DEMIDOV (1966), The method of weak approximation as the constructive
method for solving a Cauchy problem, in: Some Aspects ofNumericaland Applied Mathematics(Nauka,
Novosibirsk) (in Russian).
YANENKO, N.N., V.A. SUCHKOV a d Yu. Ya. POGODIN (1959), On the difference solution of the heat
conduction equation in curvilinear coordinates, Dokl. Acad. Sci. USSR 128 (5) (in Russian).
YANENKO, N.N. and .K. YAUSHEV (1966), On the absolutely stable scheme for integrating hydrothermo-
dynamic equations, in: Difference Methods for Problems in MathematicalPhysics, Trudy MIAN 74
(Nauka, Moscow) (in Russian).
YOUNG, D. (1971), Iterative Solution of Large Linear Systems (Academic Press, New York).
YOUNG, J.A, and C.W. HORT (1972), Numerical calculation of internal wave motions, J. Fluid Mech. 56,
265-276.
ZALESNY, V.B. (1984), Modelling of Large-Scale Motions in World Ocean (OVM Acad. Sci. USSR,
Moscow) (in Russian).
ZALESNY, V.B. (1985), Numerical model of ocean dynamics based on the splitting method, Sov. J. Numer.
Anal. Math. Modelling 1 (2), 1-22.
ZENG, Q. and X. ZHANG(1982), Perfectly energy-conservative time-space finite difference schemes and the
consistent split method to solve the dynamical equations of compressible fluid, Sci. Sinica Ser. B 25,
866-880.
Subject Index
see also the contents for further details
Adjoint operator T*, 304 Heat conduction equation, 239, 240, 252
Advection-diffusion, 283
Alternating direction method, 277, 298
Alternating direction sweep method, 277 Implicit scheme of alternating direction, 278
Alternating operator method, 289 Incomplete factorization scheme, 258
Alternating triangular method, 289, 292
Absolute stability, 218, 221
Approximation, 206 jth step operator, 297
Approximative relationship, 263
Auxiliary system, 396
Kronecker delta, 329
461
462 G.. Marchuk
Ake Bjorck
Department of Mathematics
LinkBping University
S-581 83 Link6ping
Sweden
CHAPTER II. Numerical Methods for Linear Least Squares Problems 485
CHAPTER IV. Some Modified and Generalized Least Squares Problems 569
467
468 A. Bjbrck
REFERENCES 637
1. Introduction
minllAx-b1 2 , Ba
AeR "
, beR m , (1.1)
where 11-12 denotes the Euclidean vector norm. We call this a linear least squares
problem and x a linear least squares solution of the system Ax= b.
One important source of least squares problems is linear statistical models. Here
one assumes that the m observations b =(b 1, b2, ... , b,)T are related to n unknown
parameters x = (x, x 2 ,..., X,)T by
Ax=b+e,
where =(E1,E2,...,e) T and i, i = ... ,m, are random errors. Let us assume that
A has rank n and e has zero mean and covariance matrix a 2 1 (i.e. Elare uncorrelated
random variables with equal variance). Then GAUSS showed in Theoria Combina-
tiones [ 1823] that in case A has full rank the least squares estimate x has the smallest
variance in the class of estimation methods which fulfills the two conditions:
(1) no systematic errors (no bias) in the estimates,
(2) the estimates are linear functions of b.
Note that this property of least squares estimates does not depend on any assumed
distributional form of the error. For an account of the historical development of
469
470 ,. Bjbrck CHAPTER I
Minimization in the 1-norm or oo-norm is complicated by the fact that the function
f(x) = 11Ax - b 1 is not differentiable for these values of p. However, there are several
good computational methods available for these minimizations, see BARTELS, CONN
and SINCLAIR [1978] and BARTELS, CONN and CHARALAMBOUS [1978].
To illustrate the effect of using a different norm we consider the problem of
estimating the scalar /Bfrom m observations y E RE'. This is equivalent to minimizing
{IA - y , with A =(1, 1, ... , I)T. It is easily verified that if y,l>Y2 > >Ym, then
the solution for some different values p are
fl1=Y(m+l)/2 m odd,
22=(YV1 +Y2+ .. +y,)/m,
3al=j(Yi +ym).
These estimates correspond to the median, mean, and midrange respectively. Note
that the estimate fl, is insensitive to extreme values of y,. This property carries over
to more general problems and the choice of p = 1 gives a more robust estimate than
p = 2. In some situations nonintegral values of p, 1<p < 2 are of interest see EKBLOM
[1973]. For a general treatment of robust statistical procedures see HUBER [1977].
PROOF. Assume that x satisfies ATrx = 0, where r, =b-Ax. Then for any y e R"
ry = b- Ay = r. + A(x - y). Squaring this we obtain
b
7
/ / .(A)
FIG. 2.1.
From Theorem 2.1 it follows that a least squares solution satisfies the normal
equations
ATAx = ATb. (2.2)
The matrix ATA is symmetric and nonnegative-definite and the normal equations
are always consistent. Furthermore we have:
THEOREM 2.2. The matrix ATA is positive-definite if and only if the columns of A are
linearly independent.
therefore
x#O = xTATAx=lJAx1 2>O.
REMARK 2.1. The normal equations (2.2) and the defining equations for the residual
r = b - Ax combine to form an augmented system of m + n equations
(A T O) (X )( (2.3)
The system matrix in (2.3) is square and symmetric but indefinite if A#0. The
augmented system (2.3) is used for iterative refinement of least squares solutions (see
Section 12) and in some methods for solving least squares problems where the
matrix A is sparse.
From Theorem 2.2 it follows that if rank(A)=n, then there is a unique least
squares solution, which can be written
x=(AT A)- A T b. (2.4)
The corresponding residual is
r=b-Ax=(I-PA)b, PA=A(A A)-'A T . (2.5)
Here PA is the orthogonal projector onto S(A), cf. Fig. 2.1.
If rank(A) < n, then the solution x to (2.1) is not unique. However, the solution to
the problem
minllxl12, xX,
where X is the set (2.1) is unique, cf. Theorem 4.1 below.
PROOF. (After GOLUB and VAN LOAN ([1983], p. 17). Let xe R and ye Ra be such that
Ilx112 = llyl12 =l, Ax=ay, a=llAl12.
Here I A 2 is the matrix norm subordinate to the vector norm I II2, see GOLUB and
VAN LOAN ([1983], pp. 14-15), and the existence of such vectors x and y follows from
this definition of I A 112. Let
V=[x, V,] E R" X , U=[y,Uj]
E Rmxm
be orthogonal. (Recall that it is always possible to extend an orthogonal set of
vectors to an orthonormal basis for the whole space.) Since UTAx=aUTy=O it
follows that UTA V has the following structure:
A1 UTAV=(a w ),
ii·/(alh-(T02 + WTW)
A 711W2 Bw )2 + Tw,
if follows that 11A 1 2 > (2 + wTw) 1 2 . But since U and V are orthogonal IIA, 12 =
1 A J12= and thus w = 0. An induction argument now completes the proof. J
REMARK 3.1. From (3.1) it follows that
ATA = VITZVT and AAT= U22ETUT.
Thus uo, j= 1,..., r, are the nonzero eigenvalues of the symmetric and positive-
semidefinite matrices AT A and AAT, and vj and uj are the corresponding
eigenvectors. Hence, in principle, the SVD can be reduced to the eigenvalue
decomposition for symmetric matrices. This is not a suitable way to compute the
SVD. For a proof of the SVD using this relationship see STEWART ([1973], p. 319).
ATA
= ( cosI ] cos
1 ')
474 A. Bjorck CHAPTER I
a = cos(y), a2 = 2 sin(y).
The eigenvectors of ATA
VI=8/i~l V2 = 1)
are the right singular vectors of A. The left singular vectors can be determined from
(3.3). If y<< 1, then a, 52 and a 2 y/ F2 and we get
Ul a + a2), U2 (a -a2)/1
Now assume that y is less than the square root of the floating point precision. Then
the computed values of the elements cos y in ATA will equal 1. Thus, the computed
matrix ATA will be singular with eigenvalues 2 and 0, and it is not possible to retrieve
the small singular value y/x/2. This illustrates that information may be lost in
computing ATA unless sufficient precision is used.
REMARK 3.2. The singular values of A are unique. The singular vector vj, j<r,
will be unique only when a2 is a simple eigenvalue of ATA. For multiple singular
values, the corresponding singular vectors can be chosen as any orthonormal basis
for the unique subspace that they span. Once the singular vectors vP I <j <r, have
been chosen, the vectors u, 1 <j r, are uniquely determined from
Avj=aju, j=l,..., r (3.3)
Similarly, given u, I j <r, the vectors vj, 1 <j r, are uniquely determined from
ATuj=¢jVj, j=l ...., r. (3.4)
A= E iUiUT=UrZ,VT (3.5)
where
This is sometimes called the full rank singular value decomposition. By (3.5) the
matrix A of rank r is decomposed in a sum of r matrices of rank one.
THEOREM 3.2. Let .<k' be the set of matrices in Rm" of rank k. Assume that
SECTION 3 Properties of least squares solutions 475
A= u V=
T= auv
i=1
REMARK 3.4. The theorem was originally proved for the Euclidean norm
IX E ij)
by ECKHARD and YOUNG [1936]. For this norm the minimum distance is
i A-BlE = (ok2+ 1 + ... + O )1/2
Like the eigenvalues of a real symmetric matrix, the singular values of a general
matrix have a minmax characterization.
PROOF. The result is established in almost the same way as for the corresponding
eigenvalue theorem, the Courant-Fischer theorem, see WILKINSON ([1965], pp.
99-101). El
THEOREM 3.4. Let A,B E Rm x" have singularvalues a , > a2 > . >. p and r1
z> T2 > .
> p, respectively, where p = min(m, n). Then
REMARK 3.5. This result shows that the singular values of a matrix A are insensitive
to perturbations of A, which is of great importance for the use of the SVD to
determine the "numerical rank" of a matrix (see Section 9). Perturbations of the
elements of a matrix produce perturbations of the same, or smaller, magnitude in the
singular values.
It can be proved that the eigenvalues of the leading principal minor of order n - 1
of a symmetric matrix A E R xn separate the eigenvalues of A, see WILKINSON ([ 1965],
p. 103). A similar theorem holds for singular values.
m
THEOREM 3.5. Let A = (A , an) e- " n, m > n, an E Sm. Then the orderedsingularvalues
6i of A separate the ordered singular values o i of A as follows
The SVD gives complete information about the four fundamental subspaces
associated with A. It is easy to verify that
X'(A) = span[v , . , v, sp[u,, ... , u].
(A)= span (3.7)
Since AT = VZTUT, it follows that also
(AT) = spanv,,. . ., v], X(AT) =span[u,,, , ., m] (3.8)
and we find the well-known relations
jVr(A) = (AT), M(A) = A(AT).
min I b-Ax 11
2, (4.1)
X=V
V( O ) UTb (4.2)
PROOF. Let
(Z2) = U = ( 2b),
where z, c 1 R'. Then
= IIUT (b -A VV T x) 112
1lb- Ax 112
I \C2:
\o :()1 [ /112'
REMARK 4.1. It is easy to show that G = A' satisfies the following four conditions:
AGA=A, (4.4a)
GAG = G, (4.4b)
(AG)T = AG, (4.4c)
(GA) T =GA. (4.4d)
PENROSE [1955] has shown that A' is uniquely determined by these conditions. In
particular A' in (4.3) does not depend on the particular choice of U and Vin the SVD.
Note that (AT)I = (A)T but that in general (AB)' #B'A'.
In case (i) we recognize the full rank least squares solution. Case (ii) gives the
minimum norm solution to an underdetermined system of full rank, i.e. solves the
problem
min x 11
2 , S={x:Ax=b}.
xeS
UAUI,
P,= P(AT) = U2 U2 , (4.10)
P(AT)= V1 V 1, Px(A) = V2 V,
where U and V are partitioned
U=(U, U 2)eE mRX , V=(V1, V2) "
r.
In this section we give results on the sensitivity of pseudoinverses and least squares
solutions to perturbations in A and b. The most complete results for these problems
have been given by WEDIN [1973b] and here we mainly follow his exposition. For
a survey of perturbation theory for pseudoinverses see also STEWART [1977a]. In this
m
analysis the condition number of the matrix A c RA " will play a significant role. The
following definition generalizes the condition number of a square nonsingular
matrix.
DEFINITION 5.1. Let A E iR" " have rank r and singular values a, > a2 > ' > a > 0.
SECTION 5 Properties of least squares solutions 479
1Al= (0 0) l=El Ur 2
In this example
IB-A I12 =e, IB'-A' 112> 1/e,
which is a special case of the following theorem.
On the other hand, if rank(B) = rank(A) and B - A is small, then B' is close to A'.
We first give an estimate of IIB 112.
= IIA' 211
B-A 2<1,
then
If A and B are square nonsingular matrices, then from the identity B- - A- ' =
480 A. Bjirck CHAPTER I
We now consider the effect of perturbations of A and b upon the minimum norm
least squares solution x= A'b. The discussion again follows WEDIN [1973b] and
applies both to overdetermined and underdetermined systems. We denote the
perturbed A and b
A=A + A, b,
9=b+6 (5.4)
and the perturbed solution =Ai'b and put
EA = 11A 11
2 /1 A 11
2, Eb = 11b 11 b 11
2/11 2 (5.5)
To be able to prove any meaningful result we assume that rank(A)= rank(A) and
that the condition
1= 11
A, 1126A 112=K A <1
11 (5.6)
holds, where K= I(A) is the condition number of A.
We now decompose the error 6x as follows:
x =- x = A(Ax + r + fb)- x
or
=
1- (E
(A
- ~al1x1[2±+I,11A 1112
II X112 + 11ib 2
SECTION 5 Properties of least squares solutions 481
Now since rIl(A) we have r e (AT) and thus we can write the second term using
(4.4) and (4.8) as
Air = AI P(x)PX(AT) r. (5.8)
Now by definition
IIP1(1)PX(AT) 112= sin Omax ((A), 3(A)),
where Omax is the largest principal angle between the two subspaces R(A) and (A).
Similarly for the third term using xe (AT) we can write
THEOREM 5.4. Let A =A+ A and let X(') denote any of the four fundamental
subspaces. Then if rank(A) = rank(A) and the assumption (5.6) is satisfied then
sin Oax (X(A), X(A)) 1?= KEA. (5.10)
Using Theorem 5.4 to estimate (5.8) and (5.9) we arrive at the bounds given in the
following theorem.
THEOREM 5.5. Assume that rank(A) = rank(A) and that '! = < 1.
CKEA Then
11,6X11
lI~xII 2" KII(EAIXll2
I· + IAL
6b ][ +I c -- ~KlXI12 x2
2 < (5.11)A
11 ++AKIA1)±EA
+B~h- iiAI
A 1 11 r111 (5.11)
and
A 112 + 11b 11
11r I2 EA IXII2 11 2 +EAIrIIr112. (5.12)
PROOF. The estimate (5.11) follows from above and (5.12) is proved using
a decomposition of br similar to that of 6x in (5.7).
REMARK 5.1. The last term in (5.7) and therefore also in (5.11) vanishes if rank(A) = n,
since then X(A) = {0}. If the system is compatible, e.g. if rank(A) = m, then r =0 and
2 in (5.11) vanishes.
the term involving Kc
REMARK 5.2. If rank(A) = min(m, n), then the condition < 1suffices to guarantee that
rank(A + 6A) = rank(A).
REMARK 5.3. For the case rank(A) = n there are perturbations A and bb such that
482 A. Bjrck CHAPTER I
the estimates in Theorem 5.5 can almost be attained for an arbitrary matrix A and
vector b. This can be shown by considering first-order approximations of the terms
(see WEDIN [1973b]).
It should be stressed that in order for the perturbation analysis above to make
sense the matrix A and vector b should be scaled so that the class of perturbations
defined by (5.5) is relevant. Assume e.g. that the columns in A = (al, a2,..., a.) have
widely differing norms. Then often a more relevant class of perturbations is
aj = aj + aj, I aj 112 E<
dj, j= 1, 2,...,n,
where dj = IIaj 11
2- We could use (5.11) with the estimate A < e/n since
A 2<
mll E edj2
j=1
However, a much better estimate is often obtained by scaling the problem so that the
error bound is the same for all columns
Ax = (AD- )(Dx) = Ax,
where D=diag(d,,..., d,). Often K(A)<<Ic(A) and we can use the bound
VA EA/
in (5.11) and (5.12). Note however that scaling the columns changes the norm in
which the error in x is measured.
Similarly, if the rows in A differ widely in norm, then (5.11) and (5.12) may
considerably overestimate the perturbations, cf. Section 14.
GOLUB and WILKINSON [1966] derived first-order perturbation bounds for the
least squares solution and were the first to note that a term proportional to Kc2
occurs. In VAN DER SLUIS [1975] a geometrical explanation for this term is given,
and also lower bounds for the worst perturbation are derived. The following
example shows that the term proportional to K2 in (5.11) may indeed occur.
EXAMPLE 5.1 (VAN DER SLUIS [1975]). Consider the case n=2 and let A=(al,a2 )
be the matrix in Example 3.1, where the angle y is small (see Fig. 5.1). We now
take perturbation b6a, and 6a 2 of size 116a 1 112=
I11a 2 112=, so that the plane
FIG. 5.1.
SECTION 5 Properties of least squares solutions 483
11 -c 112
A6x ztE- c, lk X 2 cCI
U2
It follows that
Ix
116X2 22E l1r 112/y2 = rl12K 2
which is what we wished to show. This example illustrates that the occurrence of K2
is due to two coinciding events: rotation of the projection plane around a dominant
left singular vector produces a large change in r and this has the direction of the
minimal left singular vector.
CHAPTER II
which dates back to Gauss, is to form and solve the normal equation (cf. Section 2)
ATAx = ATb. (6.2)
In this section we discuss numerical methods based on this approach. We assume
here that rank(A)=n, and defer treatment of rank deficient problems to later
sections. Then from Theorem 2.2 we know that the matrix ATA is positive-definite
and that (6.2) has a unique solution.
The first step in the method of normal equations is to form the matrix and vector
M=ATA E Rn, d=ATb E . (6.3)
Note 'that since the cross-product matrix M is symmetric it is only necessary to
compute and store its upper triangular part. The relevant elements in M and d are
given by
we can write
m m
M= E a,aT , d= E bi, (6.5)
i=1 i=1
REMARK 6.1. If m >> n, then the number of elements in the upper triangular part of M,
which is n(n + 1) is much smaller than mn, the number of elements in A. In this
case the formation of M and d can be viewed as a data compression.
REMARK 6.2. It is important to note that information in the data matrix A may be
lost when ATA is formed unless sufficient precision is used. As an example consider
/1I1 1 /1+ 21 \
A=
, 2
A 1 1+I
Assume that = 10- 3 and that six decimal digits are used for the elements of ATA.
Then, since 1 +2 = 1+ 10 - 6 is rounded to I we loose all information contained in
the last three rows of A.
REMARK 6.3. The number of multiplicative operations needed to form the matrix
ATA is n(n + )m. In the following, to quantify the operation counts in matrix
algorithms we will use the concept of aflop (cf. GOLUB and VAN LOAN [1983], p. 32).
This is roughly the amount of work associated with the statement
s := s + aik -bkj,
i.e. it comprises a floating-point add, a floating-point multiply and some sub-
scripting. Thus we will say that forming ATA and ATb requires n(n + )m + mn
flops, or approximately n2 m flops if n >> 1. Note that for vector and parallel
computers this definition of a flop may not be adequate.
We now consider the solution of the symmetric positive-definite system (6.2). This
will be based on the following matrix decomposition.
THEOREM 6.1 (Cholesky decomposition). Let the matrix Me R"xn be symmetric and
positive-definite. Then there is a unique upper triangular matrix R with positive
SECTION 6 Linear least squaresproblems 487
PROOF. The proof is by induction on the order n of M. The result is trivial for n = 1.
Assume that (6.6) holds for all positive-definite matrices of order n and consider the
positive-definite matrix
By the induction hypothesis the factorization M = RTR exists and thus (6.7) holds
provided r and p >0 satisfy
RTr=m, p2 = # - rTr. (6.8)
Tm is uniquely
Since RT has positive diagonal elements and is lower triangular r = R-
determined. Further, provided that u-rTr>O also p=(p-rTr)l/2 is uniquely
determined. Now, from the positive-definiteness of M it follows that
rjj := mjj- rk
488 A. Bjorck CHAPTER II
The algorithm requires about n3 flops. Note that R is computed column by column
and that the elements ri can overwrite the elements mij. At any stage we have the
Cholesky factor of a leading principal minor of A. It is also possible to sequence the
algorithm so that R is computed row by row:
for i=1, 2,... n
forj=i+1,.. .,n
This version has the advantage of allowing pivoting so that at each stage we search
for the maximum diagonal element rii. The different sequencing in these two
versions of the Cholesky decomposition is illustrated in Fig. 6.1.
FIG. 6.1. The figure shows the computed part of the Cholesky factor R in the ith step of the column-wise
and row-wise algorithm.
REMARK 6.4. We have here followed STEWART ([1973], pp. 83-93) and expressed
our algorithm in a programming-like language, which is precise enough to express
important algorithmic concepts, but permits suppression of unimportant details.
The notations should be self-explanatory (see also GOLUB and VAN LOAN [1983], pp.
30-32).
When the Cholesky factor R of the matrix ATA has been computed then the least
SECTION 6 Linear least squaresproblems 489
squares solution can be obtained by solving the two triangular systems of equations
RTz =d, Rx=z. (6.9)
The solution of (6.9) requires a forward and a backward substitution which takes
about 2 n2 = n2 flops. The total work required to solve (6.1) by the method of
normal equations is
½mn2 + n3 + (mn) flops.
The Cholesky decomposition can be used in a slightly different way to solve the least
squares problem.
= M bdTb)
-ATA=(d
T
R=( z)
Then the least squares solution and the residual norm are obtainedfrom
Rx=z, 11b-Ax112 =p. (6.11)
PROOF. By equating ATA and KTR it follows that R is the Cholesky factor of ATA
and that
RTz =d, btb = zTz + p2 .
From (6.9) it follows that x satisfies the first equation in (6.11). Further, since
r=b-Ax is orthogonal to Ax
IIAx I2 = (r + Ax)TAx = bTAx = bTAR - R - TATb = ZTz
and hence
I r l12 =bb -_ZTz.
We remark that working with the augmented matrix (6.10) often simplifies also
other methods for solving least squares problems.
We now turn to a discussion of the accuracy of the computed least squares
solution. Rounding errors will arise because the computer can only represent
a subset of the real numbers. The elements of this set are referred to as floating-point
numbers. To derive error estimates of matrix computations a model of floating-
point arithmetic is used, see e.g. FORSYTHE, MALCOLM and MOLER ([1977],
490 A. Bjorck CHAPTER II
pp. 10-29). We write the stored result of any calculation C as fl(C). Error estimates
can be expressed in terms of the unit roundoff, u, which is defined as the smallest
floating-point number such that
fl(1 +e)> 1. (6.12)
A thorough and elementary presentation of error analysis is given by WILKINSON
[1963]. For a short introduction see GOLUB and VAN LOAN ([1983], Section 3.2).
A detailed error analysis for the Cholesky factorization is carried out by
WILKINSON [1968]. We state the main result below.
The fact that M is positive-definite makes the Cholesky algorithm stable since the
elements cannot grow in the reduction. However, with M =ATA we have
K(M) = K2 (A). Hence (6.13) shows that the Cholesky algorithm may fail, i.e. square
roots of negative numbers may arise, already when K(A) is of the order of u- 1/2
REMARK 6.6. It is important to note that a scaling argument shows that in (6.13) it is
the minimum condition number under a diagonal scaling
Kc' = min K(AD), D> 0 diagonal matrix, (6.15)
D
which is relevant. VAN DER SLUIS [1969] has shown that if D, is chosen so that the
columns in AD1 have unit length then
To assess the errors in the least squares solution x computed by the method of
normal equations we make the simplifying assumption that no rounding errors
occur during the formation of ATA and ATb. We also assume that the errors in
solving the triangular systems (6.9) can be neglected. If inner products are
accumulated in double precision this is a reasonable assumption. (For an analysis of
rounding errors in the solution of triangular systems see FORSYTHE and MOLER
([1967], pp. 104-105). Then it follows that satisfies
(ATA + E)x = ATb,
where a bound for 11E J11
is given by (6.14). Then a perturbation analysis similar to
that in Theorem 5.5 shows that
IIx-x 112 2.5n31 2 uK 2 (A). (6.17)
The accuracy of the computed normal equation solution thus depends on the square
of the condition number of A. In view of the perturbation result in Theorem 5.5 this
is not consistent with the mathematical sensitivity of small residual problems. We
conclude that the normal equations approach can introduce errors greater than
what would be expected of a stable algorithm. This is true in particular when the
rows of A have widely different norms, see Section 4. In the next several sections we
review methods for solving least squares problems based on orthogonalization.
These methods work directly with A and can be proved to have very satisfactory
stability properties.
is equivalent to (6.1). We now show how to choose Q so that problem (7.1) becomes
simple to solve.
THEOREM 7.1 (QR decomposition). Let A E Rm "",m > n. Then there is an orthogonal
matrix Q E R m xm such that
PROOF: The proof is by induction on m. (Note that n < m.) For m = 1, A is a scalar
and we can take Q = + 1 according as A is nonnegative or negative. Now let m> 1,
and let A be partitioned in the form
m
A=(al,A2), aR .
al/lla1112 a 0.
Y=l ae,
a=.
(Here eI denotes the first column in the unit matrix I.) Since UTy =O it follows that
UTA =P r)
where
REMARK 7.1. The proof of Theorem 7.1 gives a way to compute Q and R, provided
we can construct an orthogonal matrix U = (y, U1 ) given its first column. There are
several ways to perform this construction using elementary orthogonal transforma-
tions, see Sections 8 and 9.
THEOREM 7.2. Let A EI" X" have rank n. Then the Rfactor of A has positive diagonal
elements and equals the uniquely determined Cholesky factor of ATA.
PROOF. If rank(A) = n, then by Theorem 6.1. the Cholesky factor of ATA is unique.
Now from (7.2) it follows that
A T A = (R ,0)QTQ() = RTR,
Hence, we can express Q, uniquely in terms of A and R. The matrix Q2 however will
not in general be uniquely determined.
From (7.4) it follows that
M(A)= I(Q ), q(A)l = (Q2), (7.5)
which shows that the columns of Q, and Q2 form orthonormal bases for R(A) and its
complement. It follows that the corresponding orthogonal projections are
THEOREM 7.3. Let A Rm n",m > n, and b E R' be given. Assume that rank(A) = n and
that an orthogonal matrix Q has been computed such that
Then the least squares solution x and the correspondingresidual r = b-Ax satisfies
Rx=c 1, r 211lC
= 2. 11
2 (7.8)
X
REMARK 7.3. Let A = Q1 R, where A E n has rank n and Q is orthogonal. Let E be
a perturbation of A, which satisfies
c(A) 11
E 112/ll A 112 < 1.
Then rank (A + E)= n, so that A + E has a unique QR decomposition
A + E =(Q + W)(R + F).
494 A. Bj6rck CHAPTER II
How large can the perturbations W and F be? This problem is analyzed by STEWART
[1977b]. The main result is that I W 112
and IFil2 are bounded by terms of order
K(A) 11E l12/ A 112-
Here rank(A) = 1 < 2 = n. Note that the columns of Q no longer provide orthogonal
bases for (A) and its complement. Therefore the QR decomposition is not very
useful for rank deficient matrices and has to be modified.
THEOREM 7.4. Given A e RmX with rank(A)=r < min(m, n) there is a permutation
matrix n and an orthogonal matrix Q EIm xm such that
PROOF. Since rank(A) = r, we can always choose a permutation matrix 7 such that
AIHI=(A 1 ,A 2),
where A1 E ,m r has linearly independent columns. Let
QTA = R11
Since
rank(Q T A17) = rank(A) = r,
it follows that R2 2 =0, since otherwise QTAH would have more than r linearly
independent rows. Hence the decomposition must have the form (7.10).
REMARK 7.4. There may be several ways to-choose the permutation matrix n. An
algorithm for determining a suitable n will be described in Section 11. When H has
been chosen Q1, R,, and R 1 2 are uniquely determined.
SECTION 7 Linear least squares problems 495
( -I ) Q( 0 0 2 ) (R R-I2 ) O
Hence, a dimensional argument shows that we have obtained an explicit basis for the
nullspace of AB
.4((A)I)=.l (7.11)
Using the decomposition (7.10) the linear least squares problem (7.1) becomes
where
H7(RIIc
(c 2y ) (7.13)
which has at most r = rank(A) nonzero components. Any such solution, where Ax
only involves at most r columns of A is called a basic solution. In several applications
it is important to compute such a basic solution. One example is the case when the
columns of A represent factors in a linear model and one wants to fit a vector of
observations b using as few factors as possible.
The QR decomposition (7.10) can also be used to compute the minimal norm
solution. If we let z = Y2, then by (7.13) minimizing Ix I 2 is equivalent to the linear
least squares problem
m (R~,lR12)Z_(YB) (7.15)
Z -In-r 0 12
The matrix in (7.15) always has full rank, and (7.15) can be solved by computing its
QR decomposition.
REMARK 7.6. To minimize 11x 2 is not always a good way to resolve rank deficiency.
Therefore the following more general approach is often useful.
For a given matrix BeRlXP" consider the problem
min IBx1[ 2, S={x:min IAx-b!2L}. (7.16)
xcS x
496 A. Bjbrck CHAPTER 1
Substituting the general solution (7.13) with Y2 = z we find that (7.16) is equivalent to
(| 1) -2 )1 RE[-2)xn
n2-2 (7.18)
l -2 1
THEOREM 7.5. Given A ERa"x with rank(A) = r. Then there are orthogonal matrices
QEmxm and WER" Xn such that
Q`AW=(T
T ), (7.19)
where Te R' X' is triangularwith positive diagonal elements. 7he decomposition (7.19)
is called a "complete orthogonal decomposition" of A.
PROOF. Since (R,,,R 1 2)TELR" X' has rank r, by Theorem 7.2 there exists an
orthogonal matrix V such that
(T (R ) (S)
Here ST is lower triangular. We can get the form (7.12) with T upper triangular by
pre- and postmultiplying (7.13) with a permutation matrix reversing the order of the
rows and columns in ST. O
A=(Q~,Q2) 0 WTU'
SECTION 7 Linear least squares problems 497
THEOREM 7.6 (Bidiagonal decomposition). Let A E lR' ,m > n. Then there are
orthogonal matrices Q e Rx m and P e R Xn such that
PROOF. The proof is similar to the proof of Theorem 7.1 and is by induction on m.
For m=1, we can take Q= +I and P=1. For m>1, we let A=(a1 ,A 2 ), and let
U=(y, U,) be constructed as in the proof of Theorem 7.1 so that
UTA = O r) P>O.
Zr/l12, r0,
[el, r=O.
Since rT V1 =0 it follows that
where
V=diag(l, ), a= rll2, C=Bre g (m-l)x(n-1)
By the induction hypothesis there now exist orthogonal matrices Q and P which
reduce C to bidiagonal form. Then (7.22) holds if we take
Q= (' Q), P= V( C
UO
It would seem from (7.21) that for computational purposes the complete
orthogonal factorization (7.19) is as useful as the SVD. To some extent this is true.
498 i. Bjbrck CHAPTER II
However, the SVD provides a more satisfactory way of determining the "'numerical
rank" of A. This is in general a difficult question.
In the next sections we describe several different methods to compute the QR
decomposition and discuss their numerical properties.
U =(y, U1),
and hence p2 = I. The product of P with a given vector can easily be found without
explicitly forming P itself since
Pa= a- (uTa)u. (8.4)
The effect of the transformation is to reflect the vector a in the hyperplane with
normal vector u, see Fig. 8.1. Therefore, P is also called a Householder reflector.
Note that Pu = - u so that P reverses u and that Pa espan{a, u}.
We now show how to choose u to find a Householder transformation P which
solves the standard task. From Fig. 8.1 it follows immediately that taking
u=a+ue,, 7=11ail,, (8.5)
SECTION 8 Linear least squares problems 499
Pa'
FIG. 8.1.
Thus P need not be formed explicitly and the product can be computed in 2mn flops.
Writing the product as
PA = (I - IUUT)A = A - flu(ATu)T
shows that A is altered by a matrix of rank one, when premultiplied by
a Householder transformation. An analogous algorithm exists for postmultiplying
A by a Householder matrix.
We next consider orthogonal matrices representing plane rotations. These are
also called Givens rotations after W. GIVENS [1958]. In two dimensions the matrix
representing a rotation clockwise through an angle 0 is
11 \
1.
T
We now consider the premultiplication of a vector a =(ot, ... , O) with R.J(O). We
have Rij(O)a=b=(,l ... , fl)T, where pk=%,k, i,j and
fpi=cai+sap, Pj=-sai+cCj. (8.10)
Thus a plane rotation may be multiplied into a vector at a cost of two additions and
four multiplications. If in (8.10) we set
c=/a s=,
s=Oa, a =(o2 + 2)i/20, (8.11)
then Pi= and flj=0, i.e. we have introduced a zero in the jth component of the
vector.
Premultiplication of a matrix A e R' " with a Givens rotation Ri will only affect
the two rows i and j in A, which are transformed according to
aik = caik +saik, ak:= -sak + sajk, k = 1, 2, .. , n.
The product requires 4n flops. An analogous algorithm, which will only affect
columns i and j exists for postmultiplying A with Rij.
Givens rotations can be used in several different ways to solve the standard task
(8.2). We can let RIk, k=2,..., n, be a sequence of Givens rotations, where Rlk is
determined to zero the kth component in the vector
Rl", R 13R1 2 a=e 1 .
Note that, since R1k only involves the components 1 and k, previously introduced
zeros will not be destroyed. Another solution is to use the sequence R k _ t, k = n,
n- 1,..., 2, where Rkl_ k is chosen to zero the kth component. This demonstrates
the greater flexibility of Givens rotations compared to Householder reflections.
It is essential to note that the matrix Rij need never be explicitly formed, but can be
represented by the numbers c and s. When a large number of rotations need to be
stored it is more economical to store just a single number. This can be done in
a numerically stable way using a scheme devised by STEWART [1976]. The idea is to
save c or s, whichever is smaller. To distinguish between the two cases one stores the
reciprocal of c and treats c = 0 as a special case. Thus for the matrix (8.8) we define the
SECTION 8 Linear least squares problems 501
scalar
1, if c=O
p= sign(c)s, if sl<lcl (8.12)
[sign(s)/c, ifel
cI< sl
Given p the numbers c and s can be retrieved up to a common factor + 1 as follows:
if p=l, c=O, s=1,
if I<<1, s=p, c =(1-s2)1/2
if pl>1, c=l/p, s=(1-c 2 )1 /2.
The reason for using this scheme is that the formula (1 -x 2 ) 11 2 gives poor accuracy
when x is close to unity.
We now describe how the QR decomposition of a matrix A e R"' " of rank n can
be computed using a sequence of Householder or Givens transformations. Starting
with A ( ) = A we compute the sequence of matrices A(k), k = 2, . . , n + 1. The matrix
A(k) is already triangular in its first (k - 1) columns, i.e. it has the form
A(k)=(Rk R
A(k) (8.13)
where the orthogonal matrix Pk is chosen to zero the first column in the submatrix
A(2k. If we let
A () = (a,(k) an),
and hence
Q = (P. P2P) = P1P2. Pn.
The key construction is to find an orthogonal matrix Pk which satisfies (8.14).
However, this is just the standard task (8.1) with a=ak ). Using (8.5) and (8.6) to
construct Pk as a Householder matrix we get the following algorithm.
matrices
) := IIu + sign(akk))ak;
Yk = ak(ak + l akk) );
rkk := - sign(akk) ak
forj=k+ 1,..., n
fik -= a(k)Ta(k)/y;
( = (k)-jk(k).
-
REMARK 8.1. Note that the vectors u(k), k = 1, 2,...., n, can overwrite the elements on
and below the main diagonal. Thus, all information associated with the factors
Q and R can be kept in A and two extra vectors of length n for (r 1 ... , r,,) and
(A1...... n).
REMARK 8.2. Let R denote the computed R. It can be shown that there exists an
exactly orthogonal matrix Q (not the computed Q) such that
A+E=Q R, IE1F<ClullAHF, (8.15)
where the error constant c1 = cI(m, n) is a polynomial in m and n, and 11 I F denotes
the Frobenius norm. GOLUB and WILKINSON [1966] show that c1 = 12.5n3 /2 if inner
products are accumulated in double precision.
Normally it is more economical to keep Q in factored form and access it through
fik and aU(k), k = 1, 2, ... , n, than to compute Q explicitly. If Q is explicitly required it
can be accumulated by taking Q1')=I and computing Q=Q(n+l) by
Q(k+l=pkQ(k), k = 1,2, ... , n.
This accumulation requires 2(m 2n - mn 2 +n 3 ) flops, if we take advantage of the
property that Pk = diag(lk - , Pk). Similarly we can accumulate
Q,=Q Q, 2 Q (Ion)
An algorithm for solving the linear least squares problem based on the QR
decomposition and using Householder transformations was first given by GOLUB
[1965].
SECTION 8 Linear least squares problems 503
ALGORITHM 8.2 (Linear least squares solution by Golub's method). Given A e Rxn
with rank(A)= n and be R m, compute R and P, P 2, ... , P, by Algorithm 8.1. Form
the vector c by
c=P P2b=
Plb= P (8.16)
\C2J
r=b-Ax=Q () (8.18)
REMARK 8.3. To compute c by (8.16) requires only (2mn-n 2) flops, and thus each
right-hand side takes (2mn - n2 ) flops.
The numerical properties of Golub's method for solving the least squares problem
are very good. The computed solution x can be shown to be the exact solution of
a slightly perturbed least squares problem
min II(A + 6A)x- (b + 6b) II2,
x
QTA = (R)
for k=1,2,..., n
for j=k+ ,..., m
a = (akk + aj')l/2; c:= akkla; s := ajk/la;
A := Rj,(eO) A;
The algorithm requires 2(mn2 -n 3)
flops.
504 . Bjorck CHAPTER 11
REMARK 8.4. Using Stewart's storage scheme (8.12) for the rotations R,({O) we can
store the information defining Q in the zeroed part of the matrix A. As for the
Householder algorithm it is advantageous to keep Q in factored form.
REMARK 8.5. The error properties of Algorithm 8.3 is as good as for the algorithm
based on Householder transformations. WILKINSON ([1965], pp. 240) showed that
for m = n the bound (8.15) holds with cl = 3n 3/ 2 . GENTLEMAN [1975] has improved
this error bound to c=3(m+n-2), m n, and notes that actual errors are
observed to grow even slower.
Givens rotations can be used to introduce zeros in a matrix more selectively than
is possible with Householder transformations. This flexibility is of importance for
solving sparse least squares problems, see Chapter III. For dense problems Givens'
method has the drawback of having twice the operation count of the Householder
method. Also O(nm) square roots are needed.
It is possible to rearrange the Givens rotation so that it uses only two instead of
four multiplications per element and no square root. These modified transformations,
called "fast" or "square root free" Givens transformations were introduced by
GENTLEMAN [1973] and modified by HAMMARLING [1974]. An algorithm for solving
linear least squares problems by fast Givens transformations is given in GOLUB and
VAN LOAN ([1983], pp. 158-160). In principle the gain in speed by using the modified
transformations should be 2. However, a nontrivial amount of monitoring to avoid
overflow is necessary to implement them and the observed gain is about 1.4-1.6 for
sufficiently large problems, see LAWSON et al. [1979].
9. Gram-Schmidt orthogonalization
k-1
This shows that qk is a unit vector in the direction of qk and rkk is determined as the
normalization constant
rik := q ak;
k-1
qk := ak - rik qi;
i=l
r =( q4)k/2; qk := qklrkk;
REMARK 9.1. The important difference is that in the modified algorithm the
projections rkqi are subtracted from ak as soon as they are computed. Note that for
n=2 CGS and MGS are identical. For treating rank-deficient problems column
pivoting is necessary, see Section 10. Then it is convenient to interchange the two
loops in Algorithm 9.2 so that the elements of R are computed row by row.
for i= 1, 2,..., n
qi := a; rii := (T i)1/2; qi := i/rii
for k=i+ ... ,n
rik := qT' ); a i+ 1:= ai)-rikqi;
There is no numerical difference between these two versions of MGS since the
operations and rounding errors are the same.
13 34
A= 21 55
34 89
Then,
r2 =q a = 112.0, a22) = (0.06, - 0.04,0.02, - 0.0 2 )T.
Severe cancellation has taken place since
2 )
11a( 2 = 0.07746<< I a2 II2 = 112.0.
This leads to a serious loss of orthogonality between q, and a2
qT a 2 )/ Ia2 ) IL= 0.007022/0.07746 = - 0.09065.
We now consider the use of the modified Gram-Schmidt algorithm for solving
linear least squares problems. It is important to note that because of the loss of
orthogonality in Q. that takes place also in MGS we cannot compute ct = QI b and
solve Rx = c. Instead we apply the MGS algorithm to the augmented matrix (A, b).
ALGORITHM 9.4 (Linear least squares solution by MGS). Given ACe R "X , with
rank(A)=n and be Rm . Form A=(A,b) and apply Algorithm 9.2 to get the
factorization
Then the solution to the linear least squares problem min, IIAx - b 112 is given by
Rx=z, r=pq+l1, lrll12 =p. (9.6)
+
= Q(Rx-z)-pqn 1 11
2.
If q,+ is orthogonal to QI, then the minimum of the last expression occurs when
Rx - z = 0 and the residual is pq, +1. Note that it is not necessary to assume that Q is
orthogonal for this conclusion to hold.
Even if for MGS the computed vectors q will not be orthogonal to working
accuracy, the loss of orthogonality is more gradual and can be bounded in terms of
c(A). This is illustrated in the example below from BJORCK [1967].
(1 1 1
and assume that £ is SO small that fl(1 + e2 )1 . If no other rounding errors are made,
508 A. Bjorck CHAPTER 1
then the orthogonal matrices computed by CGS and MGS respectively are
1 0 0 1 0 O
- - - -- ½E
OCGS I O QMGS --2£
0 £ 0 £
For simplicity we have omitted the normalization of Q. It is easily verified that the
maximum deviation from orthogonality of the computed columns are
CGS: lq3q 2 1 =, MGS: q qj =(3)1E.
Iq3ql 1 3i/½2K(A)u,
but for CGS orthogonality has been completely lost.
THEOREM 9.1. Let AElt" xn with rank n and beRm. Let R, z, Q1 and q4+, be the
computed quantitiesusing Algorithm 9.3. Then, provided inner productsareaccumulated
in double precision
A+E=Q,R, b+e=Qlz+cn+,,
where
11
E 112 1.5n3 A /u 5[2 b1
{e[ 2 1<.Snu1 2. (9.7)
If further
# = 12mnuK < 1,
where K is the condition number of A, then
2
i1I-QQ1 12<(I )1/2 n2UK. (9.8)
This shows that Algorithm 9.3 is a stable method for solving the linear least squares
problem. Indeed, according to numerical experiments of JORDAN [1968] and
WAMPLER [1970] it seems to be slightly more accurate than other orthogonalization
methods. However, MGS is also somewhat more expensive in terms of operations
and storage.
In some applications it is important to compute Q, and R such that Q1 R
accurately represents A and Q. is accurately orthogonal. This is the orthogonalbases
problem. To satisfy both these conditions it is necessary to reorthogonalize the
computed vectors qk.
for some small positive independent of q1 and a 2. Then we have the following
theorem:
THEOREM 9.2. Let a be a fixed value in the range 1.2E<o0.83-E. The following
algorithm computes a vector 2, which satisfies
i q2- 42 2 ( + )E 11
a2 2, [qTq2 l <ea-i lll 2 . (9.11)
First compute 2 using (9.10), and put q2:= fl( 2). If
42 2 l l a2 11
11 2,
then accept 42, else reorthogonalize q2,
q 2 :=q 2 -Br1
2 ql, r12 :=qq2
REMARK 9.3. Note that when a is large, say 0.5, then the bounds (9.11) are very good
but reorthogonalization will occur frequently. If a is small reorthognalization will be
rarer, but the bound on orthogonality less good. RUTISHAUSER [1967] has given
a version of MGS, where = 0.1 is used.
REMARK 9.4. MOLINARI [1977] points out that there are special situations where even
better orthogonality is required than what can be obtained by one reorthogonaliza-
tion. He gives an ALGOL procedure for "superorthogonalization" which, depending
on a parameter, may carry out several reorthogonalizations.
In the previous two sections we have assumed that the matrix A E Rm has rank n.
Inaccuracy of data and rounding errors made during the computation usually mean
that the matrix A is not exactly known. In this situation the mathematical notion of
rank is not appropriate. For example, suppose that A is a matrix that originally was
of rank r <n, but whose elements have been perturbed by e.g. rounding errors. Then
it is most likely that the perturbed matrix has full rank n. However, it will be very
SECTION 10 Linear least squares problems 511
where ai, i= 1, 2, ... , min(m, n), are the singular values of A. This infimum is actually
attained for the matrix
B= A i u i vT.
i=1
QB APB= , B= e .. , (10.4)
en
q,
where
Bi+=STBiTi, i=1,2,....
where
U = QBdiag(Qs, I,,-), V= PBPs
is the singular value decomposition of A. Usually less than 2n iterations are needed
in the second phase. For a detailed description of the SVD algorithm we refer to
GOLUB and REINSCH [1970] and GOLUB and VAN LOAN ([1983], pp. 285-294).
For solving the least squares problem the matrix U of left singular vectors need
SECTION 10 Linear least squares problems 513
X= E (CiJ/i)-i, (10.6)
i=i
where r is the numerical rank of A. Here the matrix V is explicitly required. The
expression (10.6) shows that overestimating the numerical rank r of A can lead to
a solution of very large norm when o, is small.
A modified algorithm for computing the bidiagonal decomposition (10.4) of A,
which is more efficient when m >> n, has been analyzed by CHAN [1982]. The idea,
mentioned e.g. in LAWSON and HANSON ([1974], pp. 119-122) is to begin by
computing the QR decomposition of A
Q
T
A= (0). (10.7)
One then applies the first step of the Golub-Reinsch algorithm to R to get
Q2 RP B = B. (10.8)
Combining (10.7) and (10.8) we obtain (10.4), where QB=Q diag(Q2 ,I,_,). This
modified algorithm uses mn2 + n3 flops and therefore should be faster than the
Golub-Kahan bidiagonalization whenever m >_n, except for possible extra loop
overhead.
CHAN [1982] compares the operation count for the two variants of SVD
algorithms in four different cases depending on whether U1 and V are explicitly
required, where U=(U, U2):
2
2
2mn -3n
3
mn2 +n
2 3 2 3
X, v 2mn +4n mn + n
2 3
A, U, 7mn2 - n3 3mn + X3n
2, U1, V 7mn 2 + 3n3 3mn 2 + 10n 3
Both of the SVD algorithms can be shown to be backward stable, i.e. the computed
singular values = diag(Yk) are the exact singular values of a nearby matrix A + A,
where
11 A 2 c(m, n). ual.
Here c(m, n) is a constant depending on m and n and u is the machine unit. From
Theorem 3.4 it follows that
I k-kla<c(m,n)-ua1.
514 A. Bj6rck CHAPTER II
where P ,..., Pk-1 are Householder matrices and H 1 ,.. Hk are elementary
permutation matrices. Let
(k) = (a (k) ( )
The quantities s(k) in (11.2) obviously are invariant under orthogonal transformations.
SECTION I Linear least squares problems 515
Hence, the pivoting strategy (11.3) is equivalent to searching for the column of
largest norm in the submatrix Ak). Hence the permutation matrix nk should be
chosen to interchange the columns p and k, where the index p satisfies
'
4 k' Si 4 iI w, J=k, ... , n.
Note that, in the first step, (11.2) is equivalent to selecting the column of largest norm
in A. If r=k- 1, then s(k) = 0, j= k, . . , n, and we are finished.
If the column norms in A2k2 are recomputed at each stage, then column pivoting
will increase the flop count of the Algorithm 8.1 by a factor of 1.5. An alternative is to
compute the norms of the columns of A initially and update these values as the
decomposition proceeds. We compute
=laji 2, j=l,..., n
s-')= (11.5)
and for k = 1, 2,...,r + 1 compare
s(k+l)=Sjk)--r2j, j=k+l,..., n. (11.6)
(Naturally, the s(k ) must be interchanged if the columns of Ak2 are interchanged.)
Using (11.6) will reduce the overhead of column pivoting to O(mn) operations.
However, some care must be taken to avoid numerical problems, see DONGARRA et
al. ([1979], pp. 9.16-9.18).
From (8.14) it follows that the column pivoting strategy will maximize the
diagonal element rkk, and hence the diagonal elements in R will form a non-increasing
sequence. It is easily shown that in fact the elements in R will satisfy the stronger
inequalities
and
interchange rows and columns k and p. Obviously this will maximi....ze the
and interchange rows and columns k and p. Obviously this will maximize the
diagonal element rkk so this choice of pivot will in exact computation produce the
same permutation as the Householder algorithm with column pivoting. The
quantities Sjk) can again be updated by (11.6), which gives an overhead for the
column pivoting of O(n2 ) flops.
516 A. Bjorck CHAPTER II
This shows that ak) is just the orthogonal projection of aj onto the complement of
span [q 1,...,qk_,]= •(Ak_). Hence, in the kth step we should maximize
sk)= l a k) 2, j=k, .. n.
These quantities can be updated by the same formulas (11.6) as for the Householder
and Cholesky algorithms, but again some care is necessary to avoid numerical
problems.
We now consider how to terminate the Householder algorithm for the QR
decomposition. Because of rounding errors the submatrix A(2k will usually not be
exactly zero when A has rank r=k- 1. However, assume that A 2k2 is small in norm
relative to A, say for some E>0 we have
Thus, if (11.9) is satisfied then A has numerical 6-rank at most equal to r = k- 1, for
3 = ( + cunl/2)l(A).
From the pivoting strategy it follows that
|| A(2) 112< 11
A(22) IIF (n-k+ 1)1/2 rkkJ, k = ,...,n.
In particular we have
EXAMPLE 11.1 (KAHAN [1966], pp. 791-792). Consider the upper triangular matrix
1 -C -C
R n = diag(1, s,..., sn- 1) 1
The matrix R is upper triangular and it can be verified that it satisfies the
inequalities (11.7). Therefore Rn is invariant under the algorithm for QR decompo-
sition with column pivoting. For n=100, c = 0.2, the smallest singular value is
an=0.368 10- 8, but r=s"-'=0.133. Hence, the near singularity of Rn is not
revealed.
The inequalities (11.12) give upper and lower bounds for al(R) in terms of rl I. For
the smallest singular value a(R) we have, assuming that r 0,
an- =lR=II= 2>Ij-r' X I, (11.14)
since the diagonal elements of R 1 equals r, i = 1,..., n. This gives an upper
bound for a(R). For an upper triangular matrix, whose elements satisfy (11.7),
FADDEEV, KUBLANOVSKAJA and FADDEEVA [1968] have shown the lower bound
an(R)>3(4n+6n- 1)- /2rnI
n >21-lrl. (11.15)
The matrices R. in Example 11.1 show that this lower bound can almost be attained.
Combining (11.12) and (11.14) we obtain a lower bound for the condition number
K(A) = K(R)
K(A)=ax/an > Ir,1rnl. (11.16)
The above discussion shows that this may considerably underestimate the condition
number. However, in extensive numerical testing by STEWART [1980] on randomly
generated test matrices (11.16) was a fairly reliable estimate of Kc(A). Indeed, (11.16)
usually was an underestimate only by a factor of 2-3 and never more than 10. GOLUB
and VAN LOAN ([1983], p. 167) remark that "the degree of inreliability of QR with
column pivoting is somewhat like that for Gaussian elimination with partial
pivoting, a method that works very well in practice."
One way to determine the condition number of R would be to compute R-
However, this requires n3 flops, and would significantly increase the work in
518 A. Bjidrck CHAPTER 11
We have z=(RT R)- 1 d=(AT A)- d so (11.17) is equivalent to one step of inverse
iteration with AT A. Let R = U VT be the singular value decomposition of R. Note
that then
A =QR=(QU)ZVT ,
d= E aV,
i=l
then we have
Y= (a a , z= i 1 (aJ a)vi.
i=l i=l
Provided a., the component of d along v,, is not very small the vector z is likely to be
dominated by its component of v, and
d=(+±l,+ 1,...,
+ 1)T,
the sign of ds being determined at the stage when y. is computed so as to obtain
a vector y of large norm. For details of this strategy see CLINE et al. [1979]. This
condition estimator has been implemented in the LINPACK set of subroutines and has
proved to be very reliable in practice.
The condition estimator will detect near rank deficiency of the matrix A even in
the (unusual) case when this is not revealed by a small diagonal element in R. This is
important since failure to detect near rank deficiency can lead to meaningless
solutions of very large norm, or even to failure of the algorithm. However, it still
remains to be shown how to compute a rank revealing QR decomposition in this
case, i.e. a decomposition of the form (11.4) with A 2(' 1) small.
We first consider the case when r=n-1, and show that we can always find
a column permutation of A such that the resulting R factor has a small element rn,. It
turns out that such a permutation can be found by inspecting the elements of the
right singular vector of A corresponding to the smallest singular value a. This
procedure was first pointed out in GOLUB, KLEMA and STEWART [1976], see also
GOLUB and VAN LOAN ([1983], Problem 6.4-4, p. 168). Let the vector v, Iv 112 = 1,
satisfy IIAv t 2= E,and let n be a permutation such that if HTV = w, then Iw I = IIw II,.
SECTION I Linear least squares problems 519
QTAv=QT Ann v= ( ).
All = Q = Q diag(Q,Imn) 0)
following algorithm based on the singular value decomposition of A, see also GOLUB
and VAN LOAN ([1983], p. 418]:
ALGORITHM 11.1 (Subset selection by SVD). Given A e IR" n, b E R'i, and a method for
computing the numerical rank r of A. The following algorithm computes
a permutation matrix n and a vector z E R' such that the first r columns of A are
independent and such that
is minimized.
Step 1. Compute L and V in the SVD of A, A = UZVT, 2 =diag(a1 ,... , an), and
use it to determine the numerical rank r of A.
Step 2. Partition the matrix of right singular vectors according to
V=("l V12Ir
(V21 V22/}.-r
r n-r
Iterative refinement is a technique which can be used either for reducing the
rounding errors in a computed solution or for estimating errors due to rounding in
SECTION 12 Linear least squares problems 521
for the least squares solution x and the residual r=b-Ax. We assume that
rank(A) = n, in which case the system (12.1) is nonsingular.
The description of the iterative refinement becomes a bit more compact if we take
as initial approximations
r(0) = O, x( °) =0. (12.2)
'
We then compute a sequence of single precision approximations r(s + , x( + ),
s = 0, 1,... ., where the sth iteration consists of three steps:
Step 1. Compute the residual vectors to the system (12.1)
f(s) = fl 2 (b -r(S) -Ax(S)), g(S) =fl2(_ATr()), (12.3)
where fl2(E) denotes that the expression E is computed using double precision
accumulation of the inner products defining E.
Step 2. Solve for the corrections r' (s and x (s) from
(A T A (X _(9 (12.6)
Note that in (12.6) we have a more general right-hand side than in (12.1) and we have
to modify the algorithms given before to cope with that.
Assume that we have computed the QR decomposition of A by Householder or
Givens transformations. Then the system (12.6) can be transformed into
Here the first n components of QT6r can be solved from the last set of equations and
its last m-n components from the first set. Hence the solution to (12.6) can be
522 A. Bjorck CHAPTER II
computed from
T
R h=g, QTf= (d), 6r=Q(d) Rbx=c-h. (12.7)
by MGS. To solve a system of the form (12.6) we proceed as follows. We solve the two
triangular systems
RTh=g, R6x=y, (12.8)
where y=(yl,... y")T is computed by takingf' ) =f and for k = 1, 2,... n
and that this rate is achieved even without actually carrying out the scaling of A by
the optimal D.
EXAMPLE 12.1 (BJORCK and GOLUB [1967]). To illustrate the method of iterative
refinement we consider the linear least squares problem where A is the last six
columns of the inverse of the Hilbert matrix H8 E~8 X8 which has elements
hii=1/(i+j-1), l<i,j<8.
SECTION 12 Linear least squares problems 523
Two right-hand sides b1 and b2 are chosen so that the exact solution equals
X=( 1 k I)T
For b=b, the system Ax=b is compatible, for b=b2 the norm of the residual
r=b-Ax equals 1.04-107. Hence for b2 the term proportional to K2(A) in the
perturbation (5.11) dominates.
The refinement algorithm (12.2)-(12.5) was run on a computer with unit roundoff
u= 1.46 10- ". The systems (12.4) were solved by the method (12.7) using a QR
decomposition computed using Householder transformations. We stress that it is
essential that double precision accumulation of inner products are used in Step (1),
but otherwise all computations can be performed in single precision. We give below
the first component of the successive approximations x(S), r(s), s = 1, 2, 3 ... for the
right-hand sides b, and b2.
rhs b rhs b2
We observe a gain of almost three digits accuracy per step in the approximations to
x1 and r for both right-hand sides b and b2. This is consistent with the estimate
(12.10) since
iK(A) = 5.03- 108, uc(A) = 5.84 10- 3.
For the right-hand side b1 the approximation x(1) is correct to full single precision
accuracy. It is interesting to note that for the right-hand side b2 the effect of the error
term proportional to uK 2 (A) is evident in that the computed solution x 1) is in error
by a factor of 103. However, x(4) has eight correct digits and r(?)is close to the true
value 2.8- 106.
REMARK 12.1. The key to the success of the iterative refinement scheme is that
approximations to both x and r are simultaneously improved. GOLUB and
WILKINSON [1966] suggested a simpler scheme where only x is improved. Here one
takes x +1= xx(S)+ 6x() where x(s) is the solution to
minl Abx(S)-r st 112, r) = fl2(b-Ax(s)).
This scheme results by taking g(S)=O in the previous scheme. Unless the system
Ax = b is nearly compatible it does not work as well.
524 A. Bj6rck CHAPTER 11
Ax=b+e, (13.1)
relating the parameter vector x 1R" in the model and the observation vector b e m.
Assume that the random vector has zero mean and variance-covariance matrix
a2I. Denote by i a least squares estimate of x.
If rank(A)= n, then the error in the estimate c can be written
Si = -( riksk)/rii,.
k-i+ I /
Here the elements of S can overwrite the corresponding elements of R in storage. The
algorithm requires n3 flops.
The elements of diag(C) = diag(SST ) are just the 2-norms squared of the rows of
S and can be computed in a further n2 flops by
The matrix C is symmetric, and therefore we need only compute its upper triangular
part. This takes Ln3 flops and can be sequenced so that the elements of C overwrite
those of S.
In many situations the matrix C only occurs as an intermediate quantity in
a formula. For example the variance of a linear functional q=fTX is equal to
Thus v may be computed by solving a single triangular system RTz =fand forming
ZTz. This is a more stable and efficient approach than using the expression involving
C.
In case C is needed there is an alternative way of computing C without inverting
R. We have from (13.3) multiplying by R from the left
RC = R-T. (13.8)
The diagonal elements of R -T are simply rk, k = n,..., 1, and since R-T is lower
triangular it has ½n(n- 1) zero elements. Hence, ½n(n+ 1) elements of R-T are known
and the corresponding equations in (13.8) suffice to determine the elements in the
upper triangular part of the symmetric matrix C.
To compute the elements in the last column c, of C we solve the system
By symmetry cn, = ci, i = n- ... 1, so we also know the last row of C. Now assume
that we have computed the elements
But the elements Ckj = Cjk, j = k + 1, .... , n, have already been computed and thus
Using the formulas (13.9)-(13.11) all elements of C can be computed in about -n3
flops.
For the case when the Cholesky factor R is sparse GOLUB and PLEMMONS [1980]
have shown how to use the above algorithm to compute all elements of C associated
with nonzero elements of R very efficiently. Note that since R is nonsingular these
include the diagonal elements of C. We discuss this algorithm in Section 16.
The generalized linear least squares problem is to find a vector x e R" that solves
min(Ax- b)TW- '(Ax- b), (14.1)
where beRm is a given vector, AeR" X" n a given matrix and WER " Xm a known
symmetric and positive-definite matrix. This problem arises in finding the least
squares estimate of the vector x when we are given the linear model
Ax=b+E, (14.2)
with an unknown random vector of zero mean and covariance matrix a2W.
Consider a factorization of W,
W = BBT , (14.3)
""
where BeRmX . Often B itself is given rather than W, or B can be computed from
W by the Cholesky decomposition, see Algorithm 6.1. Then (14.1) is equivalent to
min I1B-'(Ax -b)11 2, (14.4)
In some applications the weights di, i= 1, 2,..., m, may vary widely in size. If we
SECTION 14 Linear least squares problems 527
A= (A)/ , b = (b .
0 2 1\
106 106 0
106 0 106
0 1 1'
1 1 /
but if five-decimal floating-point computation is used the terms - 2 /2 and -2- 1/2
in the first and second rows are lost. This is equivalent to the loss of all information
present in the first row of A. This loss is disastrous because the number of rows
containing large elements is less than the number of components in x, so there is
a substantial dependence of the solution x on the first row of A.
If in Example 14.1 the two large rows are permuted to the top of the matrix A, then
the Householder algorithm works well. POWELL and REID [1969] suggest that the
Householder algorithm is extended to include row interchanges, so that the element
of largest absolute value in the pivot column is permuted to the top. They give an
error analysis for this extended algorithm which shows that it is stable also for stiff
weighted problems. It can also be shown that QR decomposition by Givens
rotations is stable for stiff problems, if row interchanges are included.
Assume that the weights di, i = 1,... ., m, have been chosen so that the row norms of
the unweighted matrix A are about the same. Then in most cases it will be sufficient
to initially sort the rows in A and b after decreasing weights so that (14.7) is satisfied.
Gram-Schmidt orthogonalization is easily seen to be invariant under row
permutations. Therefore, the modified Gram-Schmidt method is the only ortho-
gonalization method that works satisfactorily for stiff problems without row
interchanges. We mention that another stable method for stiff problems is the
Peters-Wilkinson method, see PETERS and WILKINsON [1970] and BJORCK and
DUFF [1980].
Now, consider again the general case when the covariance matrix W is not
a diagonal matrix. Let the eigendecomposition of W be
W= PTAP = BBT,
SECTION 14 Linear least squares problems 529
QTb
= C2 }m-n QTB =CT/m--n
where the matrix S is upper triangular. By the nonsingularity of B the matrix S will
be nonsingular. Note that (14.14) after a permutation of rows is just the QR
decomposition of C 2 . Now the second set of constraints in (14.13) becomes
Since P is orthogonal we have IIv12 = Ilul 2 and so the minimum in (14.11) is found by
taking
-
u,=0, u 2 =-S 'C 2, V=P2u2
530 A. Bj6rck CHAPTER II
where P=(P, P2) and solving the triangular system in (14.13) for x.
An algorithm for (14.11) based on (14.12H14.15) requires a total of about
n3m3-m2 n-mn2 +n 3 flops. For large values of m the work in the QR decom-
position of C2 dominates.
PAIGE [1979a] obtains a perturbation analysis for the problem (14.4) by using the
formulation (14.11), and gives a rounding error analysis to show that the above
algorithm is numerically stable. The algorithm can be generalized in a straight-
forward way to rank-deficient A and B. For details see PAIGE [1979a].
In case the matrix B has been obtained from the Cholesky decomposition
W= BBT of W it is of lower triangular form. Then it is advantageous to carry out the
two QR decompositions in (14.12) and (14.14) together, and maintaining the lower
triangular form throughout. This algorithm which requires careful sequencing of
Givens transformations has been described by PAIGE [1979b]. It uses a total of
about jm3 + m2n -n 3
flops.
It can be shown that the computed solution x to (14.11) is an unbiased estimate of
x for the model (14.2). The covariance matrix of x is o-2C where
C=R-'LTLR-T, LT=CTP1 . (14.16)
CHAPTER III
In this chapter we will review methods for solving the linear least squares problem
min IIAx-b 11
2
which are effective when the matrix A is sparse, i.e. when the matrix A has relatively
few nonzero elements.
In RICE [1983] sources of very large least squares problems are identified and
discussed. Note that very large problems are by necessity sparse. The following
sources are mentioned:
(a) the geodetic survey problem,
(b) the photogrammetry problem,
(c) the molecular structure problem,
(d) gravity field of the earth,
(e) tomography,
(f) force method in structural analysis,
(g) very long base line problem,
(h) surface fitting,
(i) cluster analysis and pattern matching.
A sparse least squares problem of spectacular size is described in KOLATA [1978].
This is the problem of least squares adjustment of coordinates of the geodetic
stations comprising the North American Datum. It consists of about six million
equations in 400,000 unknowns ( = twice the number of stations). The equations are
mildly nonlinear so two or three linearized problems of this size need to be solved.
We assume initially that A E Rm X" with rank(A) = n < m. However, problems where
rank(A) = m < n or rank(A) < min(m, n) occur in practice. Other important variations
include weighted problems, and/or linearly constrained problems.
Solving sparse least squares problems is closely related to solving sparse positive
definite systems. An excellent introduction to theory and methods for the latter
class of problems is given in the monograph by GEORGE and Liu [1981].
In order to solve large sparse matrix problems efficiently it is important that we only
store and operate on the nonzero elements of the matrices. We must also try to
531
532 A. Bjbrck CHAPTER III
minimizefill-in as the computation proceeds, which is the term used to denote the
creation of new nonzeros.
We first consider some storage schemes for a sparse rectangular matrix A. In the
general sparse storage scheme the nonzero elements of A are stored row by row in
a vector AN together with two integer vectors JA and IA. The vector JA gives the
column subscripts of the nonzeros and the elements in IA point to the start of
nonzeros in each row in AN.
a2 , a 22 0 0 0
O O a33 0 0
A=
O a 42 0 a44 0
O O 0 a54 a55
n n n n
..
A - -
1.
would be stored as
AN=(al, a 3 , a 21, a22, a3 3, a4 2, a4 4 , a54, a, a65),
JA=(1, 3, 1, 2, 3, 2, 4, 4, 5, 5),
IA=(1, 3, 5, 6, 8, 10).
The storage can be divided into primary storage for AN and overhead storage for
JA and IA. We remark that the overhead storage can often be decreased by using
a compressed scheme due to Sherman, see GEORGE and Liu ([1981], pp. 139-142).
The simplest class of sparse rectangular matrices is the class of band matrices.
Such matrices have the property that in each row all nonzero elements are contained
in a narrow band. For a matrix A we define
f = min{j: aiO0}O, i,=max{j: aij 0}. (15.1)
The numbers fi and li are simply column subscripts of the first and last nonzero in the
ith row of A.
DEFINITION 15.1. The rectangular matrix AE II"m X is said to have row bandwidth w if
For this structure to have practical significance we need to have w << n. Matrices of
small row bandwidth often occur naturally, since they correspond to a situation
where only variables "close" to each other are coupled by observations. Note that
SECTION 15 Sparse least squares problems 533
although the row bandwidth is independent of the row ordering it will depend on the
column ordering.
An alternative storage scheme for banded matrices is the envelope storage scheme.
In this all elements in the ith row with column indicesj such that fi j i are stored.
These elements have indices in the envelope of A, denoted by Env(A), which is
defined as
Env(A)= {(i,j): fi<j<li}.
EXAMPLE 15.2. In an envelope storage scheme, the matrix of Example 15.1 would be
stored as
AN=(a,,, 0, a3, a2l, a2 2, a, a42, 0, a, a4, a, a 6 5),
FA=(1, 1, 3, 2, 4, 5),
IA=(1, 4, 6, 7, 10, 12).
Here FA contains the column indices f, for each row.
Note that by increasing the primary storage slightly we have reduced the storage
overhead. In the simplest case, when Ii - f i = w, i= 1, 2,... n, we need not store the
vector IA. Alternatively we can then use a two-dimensional array of dimension m by
w to store AN, in this case.
Storage schemes similar to the ones given above can be used for storing a sparse
symmetric positive-definite matrix Be R"X . Obviously it is sufficient to store the
upper (or lower) triangular part of B, including the main diagonal. The general
scheme given in Example 15.1 can be used unchanged. However, since for a positive-
definite matrix all diagonal elements are positive it is sometimes convenient to store
these in a separate vector, see GEORGE and Liu [1981, pp. 79-80].
We now define the bandwidth of a symmetric matrix.
Thus a symmetric matrix of small bandwidth has all its nonzeros clustered "near"
the main diagonal.
For symmetric band matrices we can use an envelope storage scheme, where all
elements in a row from the diagonal to the last nonzero are stored. The envelope
of a symmetric matrix B is defined by
In the method of normal equations there are two steps where fill-in, i.e. creation of
new nonzeros, may occur. The first step is when the matrix ATA is formed and the
second step is in computing the Cholesky factor of ATA.
We first discuss the fill-in when the matrix ATA is formed. Partitioning A by rows
we have (cf. (6.5))
m
= (16.1)
ATA E aiai,
i=1
where a T now denotes the ith row of A. This expresses ATA as the sum of m matrices
of rank one. We now make the important assumption that no numerical cancellation
occurs in the sum (16.1), that is, whenever two nonzero quantities are added or
subtracted, the result is nonzero. Then it follows that the nonzero structure of ATA is
the direct sum of the nonzero structures of aiaT, i = 1, 2,.. ., m. Another character-
ization is the following:
nonzeros in ATA. For example, if A is orthogonal then ATA = I and is sparse even
when A is dense.
We now prove a relation between the row bandwidth of the matrix A and the
bandwidth of the corresponding matrix of normal equations ATA.
THEOREM 16.2. Assume that the matrix AERm Xn has row bandwidth w. Then the
symmetric matrix ATA has bandwidth fi w- 1.
From the no-cancellation assumption it follows that if A contains one full row
then ATA will be full even if the rest of the matrix is sparse, for example
A= x I = ATA full.
Sparse problems with only a few dense rows can be treated by updating the solution
to the corresponding problem where the dense rows have been deleted (see the end of
this section).
We now show that there are problems where even though A is fairly sparse in all
columns, the matrix ATA will be practically full. We consider a stochastic model
where each element aij is an independent random variable and
P{aijO} =p<< 1.
Then we have P{ajaik= 0} = 1 _p 2 , j Ak and, since
(ATA)j, = E aijaik,
i=!
it follows that
x x
x X
We now consider the second step in the method of normal equations; the
computation of the Cholesky factorization RTR =ATA. Before this is carried out
numerically it is important to find a permutation matrix Pc such that PCATAP, has
a sparse Cholesky factor R. A number of heuristic reorderings are known which can
substantially reduce the fill-in during the factorization.
The simplest reordering methods are those which try to minimize the bandwidth
SEcnoN 16 Sparse least squares problems 537
or envelope of the matrix ATA. These are motivated by the following important
relation between ATA and its Cholesky factor.
ALGORITHM 16.1.
Step 1. Determine symbolically the structure of ATA.
Step 2. Determine a column permutation Pc such that PATAPc has a sparse
Cholesky factor R.
Step 3. Perform the Cholesky factorization cf PATAPTA symbolically to generate
a storage structure for R.
Step 4. Compute B = PATAPTA and c = PTATb numerically, storing B in the data
structure of R.
Step 5. Compute the Cholesky factor R numerically and solve RTz =c, Ry=z,
giving the solution x=Pcy.
It is important to emphasize that the reason why the ordering algorithm in Step
2 can be done symbolically, only working on the structure of ATA, is that pivoting is
not required for numerical stability in the Cholesky algorithm. (Note that we assume
that rank(A)= n; for modifications needed to treat the case when rank(A)< n see
Section 17.)
538 . Bjorck CHAPTER III
By far the most popular ordering algorithm for Step 2, is the minimum degree
algorithm, which is a special case of the Markowitz ordering algorithm (MARKOWITZ
[1957]) for unsymmetric matrices. The minimum degree algorithm is based on
a local minimization of fill-in and can be described as follows. Assume that (k - 1)
columns have been ordered, and the corresponding steps in the Cholesky
factorization carried out. This intermediate stage in the factorization is shown in
Fig. 16.2.
computed part of R
pivot row
Ah row
Let vi be the number of nonzeros in the unreduced part of the ith row. We then
choose the next pivot column k so that
vk= min vi.
(13.11) can be used to compute very efficiently all elements in C, which are associated
with nonzero elements in R. Since R has a nonzero diagonal this includes the
diagonal elements of C giving the variance of x. If R has bandwidth w, then the
corresponding elements in C can be computed in only nw2 flops by the algorithm
below.
We define the index set K by
f 0, (i,j)eK,
and let
fk= min {i: rikO0}.
l<i<ik- I
We will compute all elements cij, (i,j)e K, 1 <j n, i <j. We start with the last column
of C and compute
2
Cn = rn ,
Assume now that we have computed all elements cij, j=n,..., k+ 1, i<j, (i,j)EK.
Then from (13.10)
It can be shown that, since R is the Cholesky factor of ATA, its structure is such that
(i,j)EK and (i, k)EK implies that (j, k)EK if j<k and (k,j)eK if j>k. Hence all
elements needed in (16.2)-(16.4) have been computed.
We remarked earlier that a single dense row in A will lead to a full matrix ATA and
therefore, invoking the no-cancellation assumption, a full Cholesky factor R.
Problems where the matrix A is sparse except for a few dense rows can be treated by
first solving the problem with the dense rows deleted. The effect of the dense rows
into the solution is then incorporated by updating this solution. We stress that we
only update the solution, not the Cholesky factor.
Consider the problem
where A,EeRm' Xn is sparse and AdER 2X", m2 <<n, contains the dense rows. We
540 A. Bjirck CHAPTER III
assume for simplicity that rank(As )= n. Denote by x the solution to the sparse
problem
min IA,x- b,112
x
and let the corresponding Cholesky factor be R,. The residual vectors in (16.5)
corresponding to xs are
rs(x) = bs - Asx, rd(x) = bd-A dXS.
We now wish to compute the solution x=xs+z to the full problem (16.5), and
hence to minimize
Irs(x)I)2+ 1I
rd(x) II
where
rs(x) = r(x)- Az, rd(x) = rd(X) - Adz.
Letting
u =Rsz, Bd = AdRs ,
we have that IlAzll2 = lull 2 and (16.6) reduces to
min{ Iu[2 + IIBdu-rd(x)[l2}. (16.7)
U
Introducing
When w, and hence u, has been computed we get z by solving the triangular system
Rz = u, and then x=x + z.
The updating scheme described above can be modified to use an orthogonali-
zation method for the solution of (16.8), see GEORGE and HEATH [1980]. It can be
generalized to the case where the sparse subproblem has rank less than n, see HEATH
SECTION 17 Sparse least squares problems 541
[1982]. A general scheme for updating equality constrained linear least squares
solutions has been developed by BJbRCK [1984].
It is important to point out that these updating algorithms cannot be expected to
be stable in all cases. Stability may be a problem whenever the sparse subproblem is
more ill-conditioned than the full problem.
The potential numerical instability of the method of normal equations is due to loss
of information in explicitly forming ATA and ATb, and to the fact that the condition
number of ATA is the square of that of A. Orthogonalization methods avoid both
these sources of trouble by working directly with A. An orthogonal matrix Q E Rm Xm
is used to reduce A R'm X of rank n and b to the form
where ReR " X" is upper triangular and c". The solution is then obtained by
solving the triangular system Rx=cl and the residual norm is Ilc2 112.
As pointed out before (see Theorem 7.2) the matrix R equals the Cholesky factor of
ATA. Since this is unique apart from possible sign differences in some rows, its
nonzero structure is unique. Thus we may still use the symbolic steps in Algorithm
16.1 i.e. Steps 2 and 3, to determine a good column permutation Pc and set up
a storage structure for the Cholesky factor associated with APc. However, this
method may be too generous in allocating space for nonzeros in R. To see this
consider the matrix in Fig. 17.1. For this matrix R=A since A is already upper
triangular, but since ATA is full we will predict R to be full. Note that this can occur
because we begin not with the structure of ATA, the matrix whose Cholesky factor
we want, but with the structure of A. Hence the elements in ATA are not independent.
We call this structural cancellation in contrast to numerical cancellation, which
occurs only for certain values of the nonzero elements in A.
Another way to predict the structure of R is to perform the Givens or Householder
algorithm symbolically working from the structure of A. GEORGE and HEATH [1980]
proved the following result:
MANNEBACK [1985] has proved that also the structure predicted by a symbolic
Householder algorithm is strictly included in the structure predicted from ATA.
However, both the Givens and Householder rules can also overestimate the
structure of R. GENTLEMAN [1976] gives an example where structural cancellation
occurs for the Givens rule.
542 A. Bjfrck CHAPTER III
THEOREM 17.2. Let A ER ' ", m > n. Iffor all subsets of k columns of A, k= 1, 2,..., n,
the correspondingsubmatrix has nonzeros in at least k + 1 rows, then A is said to have
the strong Hall property. If A has the strong Hallproperty, then the structure of ATA
will correctly predict that of R.
Obviously the matrix A in Fig. 17.1 does not have the strong Hall property since
the first column does only have one nonzero element. However, the matrix A'
obtained by deleting the first column has this property.
X X X X X X X X X
x x
X X
X X
X x
A A'
FIG. 17.1.
COLEMAN et al. [1986] also show that if A does not have the strong Hall property,
then by row and column permutations A can be brought into block upper triangular
form
PAQ = .. (17.2)
Mk Uk,k+l
Mk+ 1
where the diagonal blocks Mi, i= 1, 2 ... , k + 1, have the strong Hall property and
moreover the blocks Mi, i= 1, 2,..., k, are square. The reordering can be determined
by a simple generalization of the algorithm of GUSTAVSSON [1976]. The reordered
system now essentially reduces the original least squares problem to the solution of
min 1Mk+ k+1 -bk+l 12S (17.3)
:k+1
where = QTx and g= Pb have been partitioned conformally with PAQ in (17.2). If
rank(A) = n, then the blocks M,, i = 1, 2 ... , k, are nonsingular and xk, ... I~, can be
SECTION 17 Sparse least squaresproblems 543
In the following we will assume that if necessary such a preprocessing has taken
place and that therefore the structure of the matrix R is correctly predicted by that of
ATA.
For dense problems probably the most effective method to compute the QR
factorization is by using Householder reflections, see Algorithm 8.1. In this method
at the kth step all the subdiagonal elements in the kth column are annihilated. Then
each column in the remaining unreduced part of the matrix which has a nonzero
inner product with the column being reduced will take on the sparsity pattern of
their union. In this way, even though the final R may be sparse, a lot of intermediate
fill-in will take place with consequent cost in operations and storage. However, the
Householder method can be modified to work efficiently for banded systems, see
Section 18.
As shown by GEORGE and HEATH [1980], by using a row-oriented method
employing Givens rotations it is possible to avoid the problem with intermediate
fill-in in the orthogonalization method. We now describe this algorithm.
x 0 0 0 x 00
®0 X 0 00
x 0®o0
®® 00
x o
17.2.
Gorg-Hath
FIG. Processing
ofarow
inth algorithm, whr circled lmnts ar involved in th
Feliminati.
17.2. Processing of a row in the George-Heath algorithm,
and where circledelements are involved in the
e3.
elimination of afx. Nonzero elements created in Ri- 1 and asT during the elimination are denoted by
544 . Bjorck CHAPTER III
Note that unlike the Householder method intermediate fill-in now only takes
place in the row that is being processed. It follows from Theorem 17.1 that if the
structure of R has been predicted from that of ATA, then any intermediate matrix
R _ will fit into the predicted structure.
For simplicity we have not included the right-hand side in Fig. 17.2, but it should
be processed simultaneously with the rows of A and carried along in parallel. In the
implementation of GEORGE and HEATH [1980] the Givens rotations are not stored
but discarded after use. Hence, only enough storage to hold the final R and a few
extra vectors for the current row and right-hand side(s) is need in main memory.
Although the final R is independent of the ordering of the rows in A it is known
that the number of operations needed in the sequential orthogonalization method
depends on the row ordering. This is illustrated by the following contrived example
due to GEORGE and HEATH [1980]:
IXXX XX
YV
n x
x
A= PrA = x 0(17.5)
x
x
x
1k
J X
The cost for reducing A is O(n2 ), and that for PrA is O(kn2 ).
Assuming that the rows of A do not have widely differing norms, the row ordering
does not affect numerical stability and can be chosen based on sparsity conside-
rations only. We consider the following heuristic algorithm for determining an
a priori row ordering.
Row ORDERING ALGORITHM. Denote the column index for the last nonzero element
in row aT by 1i . Sort the rows so that the indices li, i= 1, ... , m, form a monotone
increasing sequence, i.e. Iik< if i<k.
This rule does not in general determine a unique ordering. One way to resolve ties
is to use a strategy by DUFF [1974], and consider the cost of symbolically rotating
a row aT into all other rows with a nonzero element in column Ii . Here by cost we
mean the total number of new nonzero elements created. The rows are then ordered
according to ascending cost.
Ordering the rows after increasing i has been found work well, see GEORGE and
HEATH [1980]. With this row ordering we note that when row aT is being processed
only the columnsfi to 1i of Ri _ l will be involved, since all the previous rows only have
nonzeros in columns up to at most li. Hence R_ will have zeros in column
li+ ,..., n and no fill will be generated in row a in these columns.
SECTION 17 Sparse least squares problems 545
In some contexts sorting rows after increasing values off,, the column index of the
first nonzero in row aT, may be appropriate. For this ordering it follows that the
rows 1,... ,fi - 1 in R i -, will not be affected when the remaining rows are processed.
These rows therefore are the final firstfl-1 rows in R and may e.g. be transferred to
auxiliary storage.
We summarize the main steps of the sequential row orthogonalization algorithm
below.
There is a great deal of freedom in the way Givens rotations can be used to
transform A to upper triangular form. A more general way of merging the rows in
A has been suggested by Liu [1986]. This scheme can give a significant reduction in
factorization time at a modest increase in working storage. The idea is best described
by a small 12 x 9 example
I v v v
I - , I
x X X X
x X X X
X X X X
X X X X
A= (17.6)
X X X X
X X X X
X X X X
I' XX xx XX
We note that rows 1-3 can be merged into an upper triangular matrix without any
fill-in. The same holds true for rows 4-6, 7-9 and 10-12, and we can reduce these
four groups of rows independently into 4 (partially filled) upper triangular matrices,
R 1 ,..., R 4. In the second stage we could merge the rows of R1 and R 2 together and
simultaneously merge R 3 and R4 . In the final stage we would merge the two upper
triangular matrices to produce the final Cholesky factor.
This way of transforming A to upper triangular form may be described by a
strictly binary tree, which is called a row merge tree. The tree for the example above
is given in Fig. 17.3.
546 A. Bj6rck CHAPTER III
2
i
a10 I
The row merge tree should be interpreted as follows. It has m leaves correspond-
ing to the m rows aT, i= 1, ... , m. Any node x in the tree defines a subtree rooted at x,
which in turn defines a set rows. We associate with x an upper triangular matrix
obtained by the reduction of this set of rows. Thus, the root of the tree corresponds to
the Cholesky factor R to be determined. We now see that each node corresponds to
the merging of the two triangular matrices associated with its left and right son. To
find R we have to traverse the tree, i.e. visit all nodes. We cannot visit a node before
we have visited its two sons but otherwise the sequential order in which we visit the
nodes is not specified. In fact, this approach is well suited for doing several merges in
parallel: we can view it as performing the computation on multiple fronts, cf. DUFF
and REID [1983].
A way to generate a row merge tree from the structure of the matrix A is given by
LIu [1986]. Liu also suggests that for sequential computation the nodes in the merge
tree are visited in depth-first order.
The general row merge scheme described here can be viewed as a special type of
variable row pivot method as studied by GENTLEMAN [1973], DUFF [1974] and
ZLATEV [1982]. However, as observed by Ltu [1986] variable row pivoting has not
been so popular because of the difficulty of generating good orderings for the
rotations and because of the complication of implementation. Also, the dynamic
storage structure needed tends to reduce the efficiency of these schemes.
Many large sparse linear least squares problems are so large that it is impossible
to store even the Cholesky factor R. We now briefly describe an automatic
partitioning scheme by GEORGE, HEATH and PLEMMONS [1981] for solving such
problems.
Assume that an appropriate ordering and partitioning of the columns of A has
been found, e.g. by the method of nested dissection in Section 19. Denote by Yi the set
of column indices in the ith partition, i = 1,.. ., p. We now order first the rows having
nonzero elements with column indices in Y1 giving us a set of row indices Z . Among
the unordered rows we now order the rows having nonzero elements with column
indices in Y2, and so on. This induces a partitioning of the row indices {Z,
Z 2 . ..., ZP} of A which can be defined as follows. Let Zo =0 and
i-1
Zi= {k: 3jE Y, with akj O}- U Zj, i = 1, 2,..., p.
1=1
SECTION 17 Sparse least squares problems 547
A22 A2 3
A33
FIG. 17.4.
If the rows of A are permuted to appear in the order Z 1, Z 2 ... , Zp, then A will have
a block upper trapezoidal form, as depicted in Fig. 17.4 for p = 3.
For a matrix of block upper triangular form the sequential orthogonalization
method can be applied to a block row at a time. In the first step only the blocks
Al , ... , A pare processed, transforming A 1 to upper triangular form. The first IY I
rows of the resulting matrix are the first IYI rows of the final R and can be stored
away. The rest of the rows are adjoined to the next block row A 2 2 ,,...,A2p to give
A 2 2 , ... , A2p and now A22 is transformed into upper triangular form, etc. We
assume that the rows of A are stored on auxiliary storage at all times. Then the only
main storage required is that for holding the IYil rows of R generated at step i,
i= 1, . . . , p. A slightly more efficient way to carry out this process is described in
GEORGE, HEATH and PLEMMONS [1981] where also details in the data management
are outlined.
So far we have assumed that A is not rank-deficient. In-the dense case rank-
deficient problems were handled by introducing column permutations in the QR
decomposition of A, see Section 11. In Algorithm 17.2 the column ordering is fixed in
advance of any numerical computation and chosen to produce a sparse R factor.
Therefore column permutations are not allowed in Step 5 of Algorithm 17.2 since
then the computed R will in general not fit into the previously generated fixed
storage structure.
If Algorithm 17.1 is applied to a matrix A of rank r < n, then using exact arithmetic
the resulting R factor must have n - r zero diagonal elements. In the algorithm a row
is only inserted into the data structure so that the diagonal entry is nonzero. Further
processing of this row can only increase the diagonal element, and it follows that if
FIG. 17.5.
548 A. Bjorck CHAPTER III
a row has a zero diagonal element then all its elements are zero. Hence R will have
the form depicted in Fig. 17.5.
By permuting the zero rows of R to the bottom and the columns of R corres-
ponding to the zero diagonal elements to the right we get a matrix of the desired form
(7.10).
In finite precision we will usually end up with an R factor with no zero diagonal
elements. Although this is not always the case the rank is often revealed by the
presence of small diagonal elements. However, a small diagonal element does not
imply that the rest of the row is negligible. HEATH [1982] recommends that, starting
from the top, the diagonal of R is examined for small elements. In a row whose
diagonal element falls below a certain tolerance the diagonal element is put equal to
zero. The rest of the row is then reprocessed zeroing out all its other nonzero
elements. Note that this might increase some previously small diagonal elements in
rows below, which is why we have to start from the top. After this we end up with
a matrix of the form shown in Fig. 17.5.
In the test for small diagonal elements a relative tolerance can be used based on
the largest diagonal element in R. This way of determining rank is not as satisfactory
as algorithms using column pivoting (see Example 17.1). It might happen, although
it is perhaps unlikely to occur in practice, that R is almost rank-deficient and yet has
no small diagonal element. However, HEATH [1982] reports that on a typical test
batch the rank determined by Algorithm 17.1 agreed with the rank determined by
QR decomposition with column pivoting.
r 1
. ... )eRnXn.
r
From the form of RR,, n - 1 of the singular values are close to unity and since their
product equals det 2(RTR,)=-r" the remaining singular value is approximately
equal to r. For r=0.1 and n =20 we thus have ami., 10-20 and yet no diagonal
element is small. This ill-conditioning is more severe than that exhibited by the
matrix in Example 11.1.
We point out that several of the techniques given in Section 11 for determining the
rank and ill-conditioning in triangular matrices can be used in the sparse case, see
FOSTER [1986].
In implementations of Algorithm 17.1 the Givens rotations are also applied to any
right-hand side(s), but normally then discarded to save storage. This creates a
problem if we later wish to solve additional problems having the same matrix A but
SECTION 18 Sparse least squares problems 549
different right-hand sides b. In this case the rotations could be saved on an external
file for later use.
An alternative possibility for handling multiple right-hand sides is to solve the
seminormal equations (SNE)
RTRx = ATb, (17.7)
using the computed R factor. This only requires that the original matrix A is saved to
transform subsequent right-hand sides. Unfortunately the numerical stability of
(17.7) is not better than that of the normal equations. This is somewhat surprising
since we are using an R factor computed by Givens' method and thus of better
"quality" than that obtained from a Cholesky factorization of ATA. It has been
shown by BJORCK [1986] that by adding a correction step to (17.7) we can obtain
a solution of much better accuracy.
In the method of corrected seminormal equations (CSNE) we correct the
computed solution x by
RTRw=b-Ax, xc=x+w. (17.8)
The correction step is similar to doing one step of iterative refinement on the
solution from (17.7). However, here the residual b-Ax may be computed in single
precision. A detailed error analysis of the method CSNE is given by BJORCK [1986].
This error analysis leads us to expect that CSNE is at least as accurate as the
orthogonalization method provided that the solution from the seminormal
equation has at least one correct digit. For problems with widely differing row
scalings (stiff problems) the method CSNE is however less satisfactory.
In this section we consider orthogonalization methods for the special case when
AEIRmXn is a band matrix of constant row bandwidth w (cf. Definition 15.1). Such
matrices could be treated using the algorithm given in Section 17 for general sparse
matrices. However, we can take advantage of the very simple structure of a band
matrix to save on space and time overheads.
We will assume that the rows of A have been sorted so that the column indicesf 1 ,
i= 1, 2,.. ..,m, of the first nonzero element in each row form a nondecreasing
sequence, i.e. fi <fk if i< k. Since we assume a constant row bandwidth we have
li =fl+ w-1,
so we could equivalently sort after nondecreasing values of Ii. From Theorem 16.2
we know that the matrix ATA will also be a banded matrix with only the first
= w- 1 superdiagonals nonzero. The upper triangular Cholesky factor R has the
same structure as the upper triangle of ATA, see Theorem 16.3. Thus R has again
w nonzeros in each row.
We now specialize the sequential row orthogonalization scheme of Section 17.
550 A. Bjbrck CHAPTER III
In Fig. 18.1 we show the situation before the elimination of this row.
finished rows in R
\\
fi li
From Fig. 18.1 several things are apparent. First we note that only the shaded part
of Ri_1 is involved in this step. Therefore this step can be thought of as updating
a full upper triangular matrix of order w when a row of length w is added. The last
(n - li) columns of R have not been touched and are still zero as initialized. Further,
the first (fi- 1) rows of R are already finished at this stage and can be read out to
secondary storage. Thus, very large problems can be handled since the only primary
storage needed is for the shaded part in Fig. 18.1.
We remark that by initializing R to zero the description above is also valid for the
processing of the first rows of A. The first row aT can just be inserted into R. This
permutation is just a special case of a rotation. After insertion, only zero elements in
aT are left and no more rotation needed. Similarly, the number of rotations needed to
process row i is at most equal to min(i-1, w).
It is clear from the above that the processing of row aT requires at most 2w 2 flops if
SECTION 18 Sparse least squares problems 551
4-multiply Givens rotations are used. Thus the complete orthogonalization requires
about 2mw 2 flops and can be performed in 2w(w + 3) locations of primary storage.
We remark that if the rows of A are processed in random order, then we can only
bound the operation count by 2mnw flops, which is a factor of n/w worse (see Cox
[1981]). Thus, it almost invariably pays to sort the rows as was assumed to be done
above.
In Algorithm 18.1 the Givens rotations could also be applied to one or several
right-hand sides b to produce
EXAMPLE 18.1 (Cox [1981]). Consider the least squares approximation of a discrete
set of data by a linear combination of cubic B-splines. The spline is represented over
[a, b] by
s(t)= E xjBj(t) ,
ji=
where Bj(t), j = 1, 2, ... , p = N + 4, are the normalized cubic B-splines (see DE BOOR
[1978], Chapter 9) for the knots
As an example we take data (yi, ti), i=1,2,..., m=16, N=6 and determine x to
minimize
Z
i=2
(s(t)-_ yi)2 = IISX -y l 2.
Since the only B-splines with nonzero values for t e [ 2 k - , ,k] are Bj, j = k, k + 1,
k + 2, k + 3 the matrix S will be a band matrix with w = 4. In particular if the number
of data points in the interval [2 i_ , i], i = 1, . .., 7, is 3, 2, 3, 2, 3, 2 and 2 respectively
552 A. IBjrck CHAPTER III
x x xx
Ix xxx
Ix X X
I X X X X
X X X X
X X X X
X X X X
S= (18.1)
X X X X
X X X X
iXX X X
x x xX
X X X
A C2
F2
Rk
Ck
Hence, A has a block band structure with block bandwidth w=2. Using the
sequential orthogonalization method, no fill-in will occur outside the nonzero
blocks.
AM BM
where A Rm xn and
Ai ERmi.xni, Bie RmxnM i=1,2 ..M.
A number of problems have two levels of sparsity structure. That is the blocks Ai
and/or B i are themselves large and sparse matrices often of the same general sparsity
pattern as A. There may also be more than two levels of structure. There is a wide
variation in the number and sizes of blocks. Some problems have large blocks with
M of moderate size (10-100) while others have many more but smaller blocks.
We partition the solution x and right-hand side b conformally with (19.1)
X r=(X.w, X tm, + ), bt = (be b2
, ,..., ba).
From (19.1) we see that the set of variables xi, i = 1 .. M,
(., are coupled only to the set
of variables xm+,. Some examples where the form (19.1) arises naturally is in
554 ,. Bjiirck CHAPTER II
photogrammetry, see GOLUB, LUK and PAGANO [1979], Doppler radar positioning,
see MANNEBACK, MURIGAND and TOINT [1985] and geodetic survey problems, see
GOLUB and PLEMMONS [1980]. WEIL and KETrLER [1971] have given a heuristic
algorithm for permuting a general sparse matrix into this form.
Problems of dual block angular form can be solved efficiently either by the
method of normal equations or by an orthogonalization method. The reason for this
is that no fill-in outside the nonzero blocks in A will occur in the R factor. We
describe below a method based on orthogonalization.
ALGORITHM 19.1 (Dual block angular orthogonalization,GOLUB, LUK and PAGANO
[1979]). The algorithm proceeds in four steps:
Step 1. Reduce the diagonal block A i to upper triangular form by a sequence of
orthogonal transformations, and apply these also to the blocks B i and the
right-hand side b, i= 1,2,..., M, yielding
T2
T= d=d2
.TM/ dei
Step 3. Compute the QR decomposition of T and transform the vector d
Qm+
0 ),I T= (QMld= (dm+ 1) (19.3)
Then the residual norm of the system is given by d+I 112 Compute XM+ , the
solution of the least squares problem
min 11TxM+ 1-d 11
2,
XM+I
REMARK 19.1. In Steps 1 and 4 the computations can be performed in parallel on the
M subsystems. It is often advantageous to continue the reduction in Step 1 so that
also the matrices T, i = 1,..., M, are upper triangular. Then Step 2 can be performed
as a merging of the M triangular matrices T,,..., TM, cf. the general row merging
scheme described in Section 17.
REMARK 19.2. Sometimes in Step 1 the matrices Ri will be sparse but a lot of fill-in
occurs in the blocks Bi . Then a block preconditioned iterative method may be used,
where one solves the problem
min IlEx-b 112, z=Mx, E=AM- 1
x
R SI
R= RM SM
RM+ 1
R.M+
where
Vi=Ri-SiRM+, i=l,..., M.
Hence, the diagonal blocks of the matrix C = (RT R)- = R- R -T of the variance-
covariance matrix are
Note that we can write, see GOLUB, PLEMMONS and SAMEH [1986],
C,. =R '(I+ WT Wi)Ri , Wi=R+, ST.
Hence, if we compute the QR decompositions
Q(W U) i= 1, ... M,
(A A2 B
2 )
FIG. 19.1. One level of dissection and the corresponding block structure of the matrix with variables in
- ordered last.
Usually there will be a finer structure in A because there are equations only
involving variables in d or d2 but not in X. In this discussion for simplicity we
ignore this finer partitioning. We note that the matrix A in Fig. 19.1 has dual block
angular structure with M=2. The blocks corresponding to dt and d 2 can thus be
processed independently (possibly in parallel).
The dissection can be continued by dissecting the regions ol and ~S2 each into
two subregions and so on in a recursive fashion. In Fig. 19.2 we show the block
structure induced by two levels of dissection. Again the matrix A is of dual block
angular form, but now with a lot of structure in the nondiagonal blocks.
For a detailed discussion of dissection and orthogonal decompositions in
geodetic survey problems see GOLUB and PLEMMONS [1980].
AVILA and TOMLIN [1979] discuss the solution of very large least squares problems
SECTION 20 Sparse least squares problems 557
A A B D
A2 B2 D2
A3 C3 D3
A4 C4 D4
FIG. 19.2. Two levels of dissection and block structure induced in A when ordering the variables of
separators last.
For some large sparse least squares problems iterative methods are a useful
alternative to direct methods. In iterative methods an initial approximate solution is
successively improved until an acceptable solution is obtained. Iterative methods
are especially attractive for problems in which the elements of the matrix A can be
easily generated on demand. In such cases the matrix A need not be stored at all, but
instead can be defined by its action on vectors.
In principle any iterative method for symmetric positive-definite (or if rank(A) < n,
semidefinite) linear systems can be applied to the system of normal equations
ATAx = AT b. As we will see, explicit formation of the matrix ATA can be avoided by
using the factored form of the normal equations
AT(b - Ax) = 0. (20.1)
For a treatment of iterative methods for symmetric positive-definite linear systems
see VARGA [1962] and YOUNG [1971]. Surveys of iterative methods for least squares
problems are given by BJORCK [1976] and HEATH [1984].
We now consider some basic iterative methods for solving (20.1). In Jacobi's
558 A. Bjbrck CHAPTER III
We note that there is no need to form ATA in the iteration (20.4). This has two
advantages.
(i) A small perturbation in ATA, e.g. by roundoff, may change the solution much
more than perturbations of similar size in A itself.
(ii) We avoid the fill-in which can result in the formation of ATA. For some
problems ATA may be much less sparse than A. For example if A has approximately
n 1/ 2 nonzeros randomly distributed in each row, then ATA will be almost full,
cf. Section 16.
Often the Gauss-Seidel method has a better rate of convergence than Jacobi's
method. It has been shown by BJORCK and ELFVING [1979] how to implement the
Gauss-Seidel method working only with the matrix A. A major iteration step is
divided into n minor steps. We put z= x(k) and compute
z(j+ 1)=z) + e(baj -Az)/d j, j= 1,2, .. , n, (20.5)
where e is the jth unit vector and dj defined by (20.3). We then have that
x(k+ 1)=Z("+l). Note that in the jth minor step only the jth component of z° is
changed. Therefore the residual
r(j ) = b - AZ ) ,
i.e. for Ax we use an inner product and for ATr an outer product formulation.
The methods discussed above are special cases of the general stationary iterative
method
x(k + 1) = GX(k) + c, (20.8)
which can be obtained by the splitting ATA = M - N, where M is nonsingular, and
taking G = M - N. For the case when rank(A) < n, convergence of the iteration (20.8)
has been investigated by KELLER [1965] and YOUNG [1972].
The concept of splitting has been extended to rectangular matrices by PLEMMONS
[1972]. BERMAN and PLEMMONS [1974] define A = M-N to be a proper splitting if
the ranges and nullspaces of A and M are equal. They show that for a proper
splitting the iteration
x(k +1)= MI(Nx(k) + b) (20.9)
}
converges to the pseudoinverse solution x=A'b for every x if and only if the
spectral radius p(M' N) < 1. The iterative method (20.9) avoids the explicit recourse
to the normal system. Splittings of rectangular matrices has also been investigated
by CHEN [1975].
A more powerful class of methods can be described by the recursion
+1[A T r(k) + X(k _ X(k- 1],
X(k + 1)= X(k - ) + WOk (20.10)
where ok+ and a are parameters, see GOLUB and VARGA [1961]. Assume that the
singular values of A satisfy
aa?2<b.
For these methods it is necessary to know bounds on the singular values of A. The
rate of convergence is sensitive to the quality of these bounds.
We now describe the conjugate gradient method of HESTENES and STIEFEL [1952],
which does not have this drawback. For a general discussion of this method see
GOLUB and VAN LOAN ([1983], Sections 10.2-10.3). A slight algebraic rearrangement
given below is needed to make the method perform well for the least squares
problem.
560 A. Bj6rck CHAPTrER III
We call this algorithm CGLS. It can also be used in the rank-deficient case. Provided
that x'° ) E (AT), which holds if e.g. x(° ) = 0, x(k) will converge to the pseudoinverse
solution A' b. In absence of rounding errors it will compute the exact solution in at
most t iterations, where t n equals the number of distinct nonzero singular values
of A. However, with roundoff many more than n iterations may be needed if A is
ill-conditioned and the method is best regarded as an iterative method. The error
measures 11 xk)- A'bll 2 and 11r(k) 12 will both decrease monotonically also under
roundoff. When A is well-conditioned and/or the singular values of A are clustered
the method may converge to a sufficiently accurate solution in far less than
n iterations.
The algorithm (20.12) requires the storage of two n-vectors x and p and two
m-vectors r and q. (Note that s can share storage with q.) Each iteration requires
about 2nz(A) + 3n + 2m flops, where nz(A) are the number of nonzero elements in A.
ELFVING [1978] has compared the implementation (20.12) with several other
variants of the conjugate gradient algorithm, and found this to be the most accurate.
PAIGE and SAUNDERS [1982] have developed an algorithm LSQR based on the
Lanczos bidiagonalization algorithm of GOLUB and KAHAN [1965]. LSQR is
mathematically equivalent to CGLS but is more accurate when A is ill-conditioned.
However, in this case both LSQR and CGLS will converge very slowly.
The basic drawback with iterative methods is that the rate of convergence will
depend on the spectrum of A and can therefore be very slow. For the methods of
Jacobi and Gauss-Seidel the number of iterations needed for a fixed reduction in the
2
accuracy is proportional to Kc (A), where K(A) is the condition number of A. For the
Chebyshev semi-iteration and second-order Richardson method this is reduced to
order K(A) iterations, provided that accurate bounds for the singular values of A are
known. For the conjugate gradient method no such bounds are needed and for this
method an upper bound for the number of iterations k to reduce the relative error by
a factor of e is given by
k < 2K(A) log(2/).
Because of generally slow convergence the main emphasis in the development of
iterative methods is on convergence acceleration. A general technique to improve
convergence is by preconditioning, which for the linear least squares problem is
equivalent to a transformation of variables. Let M R n xn be a nonsingular matrix
SECTION 20 Sparse east squares problems 561
If M is chosen so that AM- has a more favourable spectrum than A this will
improve convergence of an iterative method applied to (20.13).
It is important to note that the product AM- ' should not be explicitly formed but
treated as a product of two operators. In an iterative method preconditioned by the
matrix M matrix times vector products of the form AM- 1y and M -T ATr will occur.
Thus the extra cost of preconditioning will be in solving linear systems of the form
Mx = y and MT q = s. Hence M has to be chosen so that such systems can be easily
solved.
Several different preconditioners have been suggested for the least squares
problem. We first note that if we take M = R, where R is the Cholesky factor of AT A,
then K(AM-')= K(Q1)= 1. This is the ideal choice for M and all the above iterative
methods will converge in only one step. Thus, a preconditioner should in some sense
approximate R.
The simplest preconditioner corresponds to a diagonal scaling of the columns of
A,
M =D~/2 =diag(d/2 ... , dl), d= 11
a II2 (20.14)
Since AM-1 has columns of unit length, this approximately minimizes K(AM-')
over all diagonal scalings, see Remark 6.6.
A more general preconditioner is the SSOR preconditioner. Here we take
M=D-1 /2(D+WoLT ), 0s<o<2. (20.15)
where AT A has been split so that
ATA=L+D+LT
with L strictly lower triangular. BJORCK and ELFVING [1979] showed how to
implement this preconditioner without actually forming ATA or L. Note that taking
o=O0 in (20.15) gives (20.14).
Another approach is to take M = R, where R is an incomplete Cholesky factor of
ATA, i.e.
ATA =RTR+E,
with E small and R sparse. One way to compute such a matrix R is to use a direct
method for sparse Cholesky factorization, but only keep those elements in R which
lie within a predetermined sparsity structure, cf. MANTEUFFEL [1980].
SAUNDERS [1979] suggests taking M = UP, where U is the upper triangular factor
in a factorization of the form
PrAPc = LU,
where Pr and Pc are permutation matrices and L Rm X" unit lower trapezoidal. The
562 A. Bjbrck CHAPTER III
rationale for this choice is that any ill-conditioning in A is usually reflected in U, and
L tends to be well-conditioned.
In many large sparse least squares problems arising from multidimensional
models the matrix A has column block structure
A=(A 1, A2 ,. .. , AN), (20.16)
where
Aje RmXnj , n + '+nN=n.
One example of such a block structure is the dual block angular form (19.1).
For problems of the structure (20.16) block versions of the preconditioners (20.14)
and (20.15) are particularly suitable. Let the QR decompositions of the blocks be
A =Qj Rj, QjER m xJ, j=1,...,N. (20.17)
Then to (20.14) corresponds the block diagonal preconditioner
M = R = diag(R , R 2, ... , RN). (20.18)
For this choice we have AM -' = (Q, Q 2,.. ., QN), i.e. the columns of each block are
mutually orthogonal.
If we split AT A according to
AT A=LB+D+Ll,
where LB is strictly lower block triangular, then the block SSOR preconditioner
becomes
M = RB; (DB + wL). (20. 19)
This preconditioner was introduced for least squares problems by BJORCK [1979].
As for the corresponding point preconditioner it can be implemented without
actually forming AT A.
We now consider the use of the block diagonal preconditioner (20.18). We will
partition x and y=Mx conformingly wtih (20.16)
T
x=(xl,X 2 ,..., XN) , Y=(Y, Y2 ,..., YN)T.
Jacobi's method (20.2) applied to the preconditioned problem (20.13) can then be
written
y(k+ 1) = y(k)+ QT(b- AM - y(k)), j= 1, .... N,
This matrix has "Property A", YOUNG [1971]. This means that it is possible to
reduce the work required per iteration to approximately half for many iterative
methods. This preconditioner is also called the cyclic Jacobi preconditioner.
For matrices with "Property A" the SOR theory holds. As shown by ELFVING
[1980] it follows that the optimal co in the block SOR method (20.22) is given by (for
N=2)
0 t = 2/(1 + sin
W
op min)
where Omi, is the smallest principal angle between the subspaces 3R(A 1 ) and R(A2).
For a definition of principal angles between subspaces see e.g. GOLUB and VAN LOAN
([1983], p. 428). We have
For N = 2, the preconditioner (20.19) with o = 1 also has special properties, see
GOLUB, MANNEBACK and TOINT [1986]. We have
M=(R a)Q
w, A 2 )
of (A ). It follows that the two blocks in (20.25) are mutually orthogonal and thus
the preconditioned problem (20.13) can be split into two problems which are
min 11
P Q2 y 2 -b 12, = QT b. (20.26)
Y2
This effectively reduces the original system to a system of size n2. Hence this
preconditioner is also called the reduced system preconditioner. The matrix of
normal equations becomes
where K = QTQ 2
We now consider preconditioning the conjugate gradient method (20.12). It is
convenient to formulate the preconditioned method in terms of the original
variables x. Let x(°) be an initial approximation and r}= b-Ax(°. Take
q(k = AM-lpk k = k/ qk 2,
In order to use the block SSOR preconditioner (20.19) for the conjugate gradient
method we have to be able to compute vectors q=AM-lp and s=M-TATr
efficiently, given p = (p, . .. , pN)T and r. The following algorithms for this are derived
in BJORCK [1979].
Put q(N) = 0 and for j= N, N- 1,..., 1 compute
zj=R -l(pji-RiT AT q)), q(J- )=qi)- A zj.
Then we have
T
q = q(O), M- 1 P =(Z1, , ZN) .
Chebyshev semi-iteration and the conjugate gradient method. The reduced system
preconditioning will essentially generate the same approximations in half the
number of iterations. Since the work per iteration is about doubled for co 0, this
means that for N=2 cyclic Jacobi preconditioning is optimal for the conjugate
gradient method in the class of SSOR preconditioners.
GOLUB, MANNEBACK and TOINT [1986] have applied the block SSOR precondi-
tioned conjugate gradient method to the Doppler positioning problem. Here the
matrix is of dual block angular form (19.1) but is partitioned into two blocks (A, B)
where A consists of all the diagonal blocks and B of the last block column in (19.1).
Some experimental results of block SSOR preconditionings for problems
partitioned into more than two blocks are given by BJORCK [1979]. In this case co = 1
is not necessarily optimal. However, in the tests the optimal value of o was close to
one and the number of iterations required varied slowly with o) around w= 1.
We finally consider a method based on Lanczos bidiagonalization for computing
approximations to the large singular values and the corresponding singular vectors
of A e RX", m > n. Let
where
/ 1 f1
a2 P2
B= |'
fin- |
an
Bk1 1I
a2 12
Bk =
flk-
t/
tend to be very good approximations to the large singular values of A. Approxi-
mations to the corresponding singular vectors of A can be found from Uk, Vk and the
singular vectors of Bk. A discussion of this method may be found in GOLUB, LUK and
OVERTON [1981].
A similar bidiagonalization algorithm forms the basis for the method LSQR of
PAIGE and SAUNDERS [1982] for solving least squares problems. For this purpose it
turns out that the transformation of A to lower bidiagonal form is more appropriate.
In (20.29) we now take
al
An
Formulas corresponding to (20.31) for computing U=(u1,... ,urn) and V=(vl,..., v)
are easily derived: for j= 1,2 ...., n
vj = r/ai, aj = Irl 2, rj=ATUj-sj-lUj-1, (20.32)
Uj+, = pj/j, l S= Pjll 2, Pi = AVj- Uj,
where we take v o = 0.
For solving minx,l Ax - b II2 we start the recursion (20.32) with
u, =b/ro#, o = 11
b112 , (20.33)
and compute v,, u2, v 2, ... and corresponding elements in B. After k steps we have
computed Vk=(v,, ... , Vk), Uk+l =(u, , .k+,)and
al
51 a2
Bk = r2 (20.34)
5k
ok
SECTION 20 Sparse least squares problems 567
We now seek a solution xke (Vk) and put Xk= Vkyk. Then
-
I A Vkyk-b1 2= |IUT AVkk UT bi 2= IIBkyk - Poel 11
2,
where we have used the relation AVk = Uk+ Bk and (20.33). It follows that we have
Xk = Vk Yk, where Yk solves the least squares problem
then
by Theorem 6.2 the solutionproblem
to the least squares min Ax-b(21.1)is
then by Theorem 6.2 the solution to the least squares problem minx 11
Ax - b 2 is
given by
Rx=z, IIAx-b 2=p.
Hence, by updating the factor (21.1) the updating algorithms given below can be
used to update least squares solutions.
The principal aim of the algorithms in this section is to update the complete QR
decomposition of A
A=(R), (21.2)
where Q E Rm xmis either stored explicitly or as a product of elementary orthogonal
transformations. For some of the updating problems the more compact factorization
A= Q1 R, Q 1 E Rmn"can be updated.
In some applications the factor Q is not available and one wants to update only
the R factor. This may be the case when A is large and/or sparse. We discuss such
algorithms for some of the updating problems. These algorithms are not always as
reliable as the methods that update both Q and R.
We give in this section algorithms for updating the QR decomposition of A for
three important kinds of modifications:
(1) rank-one changes of A,
569
570 A. Bj6rck CHAPTER IV
A=A+uvT=Q(R), (21.3)
where u E R ' and v E R" are given. For simplicity we assume that rank(A) = rank(A) = n,
so that R and R are uniquely determined.
We first compute
w=QT u
so that
A+UVT=Q[()
+WVT]. (21.4)
Note that these transformations zero out the last m-1 components of w from
bottom up. (For details on how to compute Jk see Section 8.) The same
transformations are then applied to the R factor of A, i.e.
JT ... JT- (R =
JT .JT(OR =H, (21.5)
Gn ... GTH=()
is upper triangular. Here Gk, k = 1,... , n, will zero the element in position (k + 1,k).
Finally the transformations are accumulated into Q to get
Q=QJm-l J G . G.
Q and R now give the desired decomposition (21.3).
The work needed for this update is as follows: Computing w= QT u takes m2 flops.
Computing H and a takes 4n 2 flops and accumulating the transformations Jk and
Gk into Q takes 4(m2 + mn) flops. This gives a total of 5m2 + 4mn + 4n2 flops. If m = n
we have decreased the work from O(n3 ) to O(n 2). However, if m>>n then this
updating may be more expensive than computing the decomposition from scratch.
The method described can be generalized to compute the QR decomposition of
A+UVT, where rank(UV T )> 1.
where
Al = [a, ... , ap], A2 = ap+ 1 . -. , a].
It follows immediately that
( )
572 A. Bjorck CHAPTER IV
Here PL is a permutation matrix which performs a left circular shift of the columns
ak,..., an. We now write
lR1 R1 3 v
RPL = [rl, ., rk1, rk+ 1, ., rk] WT rkk
0O R33 0/
where R 1el2(k- ) x (k- l) and R 33EIR(n- k)x ( n - k) are upper triangular. Hence, the
submatrix
W aT)
(R33J
GT rkk
Gn-1 GT..GTW
1(R 33 0 )-122
is upper triangular. Note that the last column will fill in. The updated R factor
becomes
It is obvious that in this case the R factor can be updated without Q being available.
We point out that more generally it is often required to compute the QR
factorization of
[a, ... , ak_ , ak+ l, . . ap, ak,ap+l, ., an],
i.e. of the matrix resulting from a left circular shift applied to the columns ak, ... , a,
cf. DONGARRA et al. ([1979], p. 10.2). This can be done by an obvious extension of
the above algorithm.
SECTION 21 Modified and generalized least squares problems 573
QTA= (R)
by Householder's method. We then form the vector w=QTa,+,, and construct
a Householder transformation P, such that
P() = (),
n V
w= (u)}m-n
v Im-n'
(21.10)
(21.10)
We now write
RPR=[r,...,r-rk,r,+l,rk,... r]=
(R 0
11 u1
u2
R2
R2 2
and determine Givens rotations Gi, i= n- 1, ... , k, to zero the last n-k + 1 elements
in the kth column of APR.Then
GT .GT(u2 R22)A
A=Q~e(R), Q Gn G,.
574 A. Bjorck CHAPTER IV
i.e. of the matrix resulting from a right circular shift of the columns ak,..., ap.
When Q is not stored or is unavailable, the vector u in (21.10) can be found by
solving the system
RT u=AT a+
n, (21.12)
y2 = Ia,+112-IIull2 (21.13)
We can then proceed to determine the factor R in (21.11) as above. However,
rounding errors could cause this method to fail if the new column a,,,+is nearly
dependent on the columns of A. (In this case the expression (21.13) for y2 might even
become negative!) As remarked in GILL et al. ([1974], p. 532) if R is built up by
a sequence of these modifications, in which the columns of A are added one by one,
the process is exactly that of computing M=ATA and finding its Cholesky
factorization. As has been remarked in Section 6 this is numerically less satisfactory
than computing R by an orthogonalization method.
If A has been saved a more accurate method for appending a column in the case
when Q is not available has been suggested by BJORCK [1986]. Here one computes
w as the solution to the least squares problem
min 1IAz-a+, 112
Given the QR decomposition of a matrix A E (Rm "X we first consider the problem of
computing the QR decomposition of
One way to solve this problem has been described in Algorithm 17.1. Sequential row
orthogonalization. If we have obtained the decomposition
QT A=(°)
SECTION 21 Modified and generalized least squares problems 575
then
diag(QT, 1) (T) = ( °=
W T
n+ m+ 1 j
where n., + m+ 1 is a permutation matrix interchanging the rows n + 1 and m + 1. The
elements in w can now be eliminated by a sequence of Givens rotations Gk = Rk,. +
k= 1 .. ., n giving
G 1//n+ 1,m+( | ()
A=( )Q () (21.15)
Let qTE (Rm be the first row of Q and determine Givens rotations Jk = Rk,k + 1, k=
m-i, .... ,1, so that
(m) R°
|O (21.17)
where the matrix R is upper triangular. Note that the transformations Jn+,..,
J,,- will not affect R. Further we compute
QJm-,i J, =Q
From (21.16) it follows that the first row of C is elT. Since Q is orthogonal it must
576 4. Bjbrck CHAPTER IV
( Q.)'
().
This algorithm for downdating is a special case of the rank-one change algorithm,
and is obtained by taking u= - e, v =a in (21.3).
In the downdating algorithm the first row of Q played an essential role. If Q is not
available we can proceed as follows. Taking the first row in (21.15) we obtain
T T R\ T T Rn'
al=q OR =(ql,q20)(),
where q has been partitioned conformingly with the R factor. Hence, we can obtain
q, by solving
RTql =a 1
and as remarked above will not affect R. Thus, we may determine the Givens
transformations J, k = m- 1, ... n, by
1 e ) =Yel a+ 1,
and obtain the updated factor R as in (21.17). This method is implemented in the
LINPACK package and is described in STEWART [1979b].
We note that if y u1/ 2 , where u is the unit roundoff, then y cannot be computed
stably from (21.18) because of severe cancellation in the subtraction I -II q, II12u.
Therefore this algorithm will not be as stable as that using information from Q.
SECTION 22 Modified and generalized least squares problems 577
where i denotes the imaginary unit (i2 = - 1). Hence, deleting the row a1 is formally
equivalent to adding the row ia1. We can apply the algorithm above for adding
a row, which does not depend on Q. The resulting algorithm can be expressed
entirely in real arithmetic, see LAWSON and HANSON ([1974], pp. 229-231). However,
an error analysis given by BOJANCZYK, BRENT, VAN DOOREN and DE HooG [1987]
indicates that the stability properties of this algorithm is inferior to those of
Stewart's algorithm. They give a modification of the algorithm, which substantially
improves its stability.
We stress again that the algorithms for updating and downdating the QR
decomposition can be used to add and remove rows from a least squares problem by
applying them to the augmented matrix (A, b). More details on this use can be found
in DONGARRA et al. ([1979], Chapter 10).
Many of the above updating algorithms require the m x m orthogonal factor
Q and require O(m2 ) flops. In many applications, e.g. to nonlinear least squares
problems, knowledge of the m x n factor Q1 would suffice. DANIEL, GRAGG, KAUFMAN
and STEWART [1976] have developed stable and relatively efficient methods for
updating the Gram-Schmidt QR factorization. These methods only require O(mn)
storage and operations.
BUNCH and NIELSEN [1978] have developed methods for updating the singular
value decomposition of A, when A is modified by appending or deleting a row or
column. Also algorithms for solving the correspondingly modified least squares
problems are developed. These updating methods however all require O(n3 )
operations when A E R' n",m n.
In Section 3 we derived the singular value decomposition (SVD) of a matrix A E It" x"
of rank r:
A e R X n and B e RP X" with the same number of columns. This generalized singular
value decomposition (GSVD) and its application to certain constrained least
squares problems was first studied by VAN LOAN [1976]. We will use a computa-
tionally more amenable form given by PAIGE and SAUNDERS [1981]. In the special
case that A and B are blocks of a partitioned matrix having orthonormal columns
the GSVD simplifies to the CS decomposition (CSD). Recently, stable algorithms for
computing the GSVD based on the CSD have been developed by STEWART [1982,
1983] and VAN LOAN [1984b]. PAIGE [1986] gives an algorithm which consists of an
iterative sequence of cycles where each cycle is made up of the serial application of
2 x 2 generalized singular value decompositions.
We first consider the CS decomposition, which is of interest in its own right.
THEOREM 22.1 (CS decomposition). Suppose the n columns of the real matrix
' + pX"
Q=(Q'1)} tR ) , m>n, (22.2)
(Q2)}P
are orthonormal, i.e. QTQ = QIQ, + Q2Q 2 =In. Then there are orthogonal matrices
a
U 1 eRI PXP
mx ' , U2eRI and VeR "X" such that
and
PROOF (Cf. STEWART [1982]). To construct U1, Vand C, note that since U1 and Vare
orthogonal and C is a nonnegative diagonal matrix (22.3) is the singular value
decomposition of Q 1. Hence the elements c i are the singular values of Q.
If we put 2 = Q2 V,then the matrix
( Cor UT o ) (Q )
which implies that Q2 Q2 = In - C2 is diagonal and hence the matrix Q2 = (02),... q2))
has orthogonal columns.
We assume that the singular values ci of Q, have been ordered according to (22.5)
and that c, <c,+ X= 1. Then the matrix U2 = (ul2) , uP) is constructed as follows.
Since II j 2) 112= 1 -c 2 O, j<r, we take
U(2) = 7,2)/ 14 (2) 2
and fill the possibly remaining columns of U2 with orthonormal vectors in the
orthogonal complement of (0 2). From the construction it follows that U2 e IP" Pis
orthogonal and that
The CS decomposition can be stated in a related form which is often useful. Let
Q E Rm xm be orthogonal and consider the partitioning
Then there exist orthogonal matrices U, V e Rj Xj and U2, V 2 E kxk such that
(u o \/Q 1 , Q12 V1 0 C 0 -S
( U 2 ) (Q 21 Q22) ( =0 I } j-k (22.6)
S o C }k
where
C=diag(c,,..., Ck), S = diag(sl,..., sk),
c i = cos(0i), si =sin(0), 0 02,i i=1,..., k.
The decomposition (22.6) can be proved in a way similar to the proof of Theorem
580 A. Borck CHAPTER IV
22.1. A proof of a slightly more general decomposition, where Q 1 and Q22 are not
required to be square matrices is given by PAIGE and SAUNDERS [1981].
Note that the decomposition (22.6) treats rows and columns of Q in a symmetric
way. The matrix on the right-hand side of (22.6) is a generalization of a Givens
rotation matrix, cf. (8.9) and its transpose is its inverse. The decomposition (22.6) was
first explicitly given by STEWART [1977a] who remarked that it "often enables one to
obtain routine computational proofs of geometric theorems that would otherwise
require considerable ingenuity to establish."
The CS decomposition now enables us to give a constructive development of the
GSVD of two matrices A and B with the same number of columns.
THEOREM 22.2 (Generalized singular value decomposition (GSVD)). Let A ERm "
,
m > n, and Be lRPXn be given matrices. Assume that
rank(M)=k n, M =()
where
Further we have
where Q and P are orthogonal matrices of order (m+p) and n respectively and
Q Q1 1 Q12 P (P,,P).
\,Q 21 Q2 2 )J } '
k n-k
k t
SECTION 23 Modified and generalized least squares problems 581
Then the SVD of M can be written
be the CS decomposition of Q,, and Q21. Substituting this into (22.9) we obtain
AP =UA() VT (2,0), BP = UB )
O( V(, 0).
m
Let AEllmxn be a known matrix, b R a known vector and xeR" an unknown
parameter vector which is to be estimated. The general Gauss-Markofflinear model
has the form
Ax+E=b, V()= a' W, (23.1)
where e is a random vector with zero mean and covariance matrix a2 W and W is
a symmetric nonnegative-definite matrix.
For W = I we get the special linear model discussed in Section 1. In Section 14 we
considered weighted linear models where W was a positive diagonal matrix and also
the model (23.1) for general positive-definite matrices W. In the following we will
treat the general case, when both matrices A and W may be rank-deficient.
We will assume that we are given W in factored form
W=BBT, BERm '"P, p<m. (23.2)
where q=r+s-k
DA = diag(a ...... a), D = diag(Bl,,..... ),
O< .** <ax < 1, > , . > # > °,
and DA +DB=l. Note that the row partitionings in (23.4) are the same.
If we partition the orthogonal matrices U and Vconformingly with the column
blocks on the right-hand sides in (23.4)
where R ERk Xkis upper triangular and nonsingular. In the model (23.3) we make the
orthogonal transformations of variables
x= UTx, U= Tu. (23.6)
Then, using (23.4) and (23.5) the model (23.3) becomes
tO){(°
0 DA 0 X2
w 0 2 U3
where we have partitioned R and c = QTb conformingly with the block rows of the
two block diagonal matrices in (23.7).
We first note that Z has no effect on b and therefore cannot be estimated. The
decomposition
e
X = Xn+x , xn=U 1 l1, Xe=U2X2+U 3X,3
splits x into a nonestimable part x" and an estimable part x'. Further, X3 can be
determined exactly from
R33X-3 = c 3 .
Note that X3 has dimension k- s = rank(A, B)- rank(B), so that this can only occur
when rank(B) < m.
The second block row in (23.8) gives the linear model
DAX 2 + DBU2 = R2 2 1(C2- R23 3),
where from (23.6) we have that V(u 2 )=a21. Here the right-hand side is known and
the best linear unbiased estimate of x2 is
:x2 =DA 'R22 (c2 - R2 3 - 3 ). (23.9)
584 A. Bjfrck CHAPTER IV
The error satisfies DA(, 2 - 2 ) =DBu2 and hence the error covariance is
V(2 - x2 ) = '2(DA I'DB)2 ,
and so has uncorrelated components. Note that the covariance need not be large
even if DA has small elements provided that the corresponding elements in DB are
also small.
The random vector u3 has no effect on b. The dimension of u3 is p-s = p - rank(B)
and so is zero if B has independent columns. Finally the vector a, can be solved
exactly from the system (23.8). Since u, has zero mean and covariance matrix 2 1 it
can be used in estimating a 2. Note that u~, has dimension k-r=rank(A,B)-
rank(A).
REMARK 23.1. It can be shown that the best linear unbiased estimate of any estimable
function of x in the model (23.3) is given by the solution to the constrained least
squares problem
min v{ 2 ,
v,x
IB IE(= eIblI 2,
If a minimizing pair (E, r) has been found for the problem (24.2), then any x satisfying
(A + E)x = b + r is said to solve the TLS problem.
SECTION 24 Modified and generalized least squares problems 585
In the statistical literature this problem is known as latent root regression. The
total least squares problem has been analyzed in terms of the singular value
decomposition by GOLUB and VAN LOAN [1980], see also GOLUB and VAN LOAN
([1983], Section 12.3). They consider the more general problem of minimizing
IID(E,r)TIE, where D and T are nonsingular diagonal scaling matrices. In the
following we assume for notational convenience that the row and column scalings
have been applied explicitly to (A, b) so that we can take D = Im and T =I,.
We note that the constraint in (24.2) implies that
b+rE r(A+E). (24.3)
It can also be written
and the minimum is attained for AC= - CvvT, where v is any unit vector in the
subspace
SC = span[vk + 1, . ., Vn +1]
Assume that we can find a unit vector v in S c whose (n + 1)st component is nonzero.
Then we can write
V=(Y)= ( x ly
The TLS problem may fail to have a solution as is illustrated by the following
simple example. Let
Here we have be.M(A+E) for E=diag(O, e) for any e0., so there does not exist
a smallest value of lI(E,r)IIE.
The TLS problem has no solution if e,,+=(0,. . .,0, 1)T is orthogonal to Sc. If
oa,. is a repeated singular value, i.e. k <n in (24.5), then the TLS problem may lack
a unique solution. In this case a unique minimum norm TLS solution can be
determined as follows. Let Q be an orthogonal matrix of order n- k + 1 such that
LVk+ V....Un+
1Y)Q ( :}l~
If we set x = -- y, then it is easy to show that all other solutions to the TLS
problem have larger norms.
Let the singular values of A be
Then the separation theorem for singular values, Theorem 3.5, implies that
It follows that if the condition 8, >a,,+ is satisfied then a+,, is not a repeated
singular value of C. It also follows that v+ must have a nonzero last component
since otherwise
so that 8, = ,,+ I. Hence the condition 8, > a,,+ is sufficient for the TLS problem to
have a unique solution.
A geometrical interpretation of the total least squares problem can be obtained as
follows. By the minmax characterization of singular values given in Theorem 3.3 it
follows that the TLS solution, if it exists, satisfies
+l=min|(A, b)(x )1 2 ( )2
where aT is the ith row of A, is the square of the orthogonal distance from the point
(b, ) ' 4
SECTION 24 Modified and generalized least squares problems 587
to the plane through the origin
Hence, the TLS solution minimizes the sum of squares of orthogonal distances
CY_= d?(x), and therefore is a special case of orthogonal regression, see Section 34.
We now consider the conditioning of the total least squares problem and its
relation to the least squares problem. To ensure unique solutions to both the TLS
and the least squares problems we assume that 8,>n+1 and we denote those
solutions by XTLS and XLS respectively. The vector (XTLs, - 1)T satisfies
(ATA - I)XTL
Io+ = ATb. (24.7)
whereas the corresponding least squares solution satisfies
ATAxLs = ATb. (24.8)
Since in (24.7) a positive multiple of the unit matrix is subtractedfrom ATA total least
squares is a deregularization of the least squares problem (24.8), cf. Section 26.
Therefore the TLS problem is always worse conditioned than the LS problem. It is
shown by GOLUB and VAN LOAN [1980] that a measure of the conditioning of XTLs
is the difference 8,-a,,+.
Combining (24.7) and (24.8) we get
2 T
XTLS -XLS = a + 1 (A A- +1I- ) XLs
right-hand sides
min IIE,R)IIE,
E,R
s.t. (A+E)X=B+R,
(A+E,B+R)(Ik =0,
it follows that we now seek a perturbation AC = (E, R) that reduces the rank of the
matrix C = (A, B) by k. Again, this problem can be solved using the singular value
decomposition
C= U diag(a,,. - , + k)V T
and the condition n > O,,+1 ensures that there exists a unique solution.
WATSON [1984] has considered the total 1, approximation problem where
I(E, r)l.P is minimized for the class of matrix norms defined by
GOLUB and STEWART [1986] have shown how to solve the TLS problem for the case
when some of the columns of A are known exactly. The problem can then be
expressed
min Il(E,r)llE,
E'r
s.t. (A,A 2 +E)x=b+r,
If (25.2) is not satisfied, then we can seek a minimum norm solution to Problem
LSE.
A robust algorithm for Problem LSE should check for possible inconsistency of
the constraint equations Bx = d. If it is not known a priori that the constraints are
consistent, then (25.1) may be reformulated as a sequential least squares problem
589
590 A. Bjbrck CHAPTER V
This problem always has a unique solution of minimum norm. Most of the methods
described in the following for solving Problem LSE can be adopted to solve (25.3)
with little modificaton.
The most natural way to solve Problem LSE is to derive an equivalent
unconstrained least squares problem of lower dimension. There are basically two
different ways to perform this reduction: direct elimination and the nullspace
method. We describe both these methods below.
where A=AI7B. We now eliminate the variables x from (25.6) using (25.5).
Substituting xi =R 1 l ( l -R 1 2x2) we get
Ax- b = A 2 x 2 -6
where
A2 =A -A 1R lR 12 , =b- A1RlldlJ. (25.7)
Hence, the reduced unconstrained least squares problem
"-
2 x 2 -11
min 11A 2, A2 Rm X (n r) (25.8)
X2
w=IB(u)0
A, A2 ]\2]' b
REMARK 25.1. The set of vectors x = nB , where x satisfies (25.5) is exactly the set of
vectors which minimize Bx -d lI 2. Thus, the algorithm outlined above actually
solves the more general problem (25.3). If condition (25.2) is not satisfied, then the
reduced problem (25.8) does not have a unique solution. Then column permutations
are needed also in the QR decomposition of A 2. In this case we can compute either
a basic solution or a minimum norm solution to (25.3) as outlined in Section 7.
QBB =0 ( ) (25.11)
Then AJ(B)=(Q 2), i.e. Q2 gives an orthogonal basis for the nullspace of B. Any
vector xEe", which satisfies Bx = d can then be represented as
x=x +Q2y 2 , x=B'd=QRTd. (25.12)
Hence,
Ax-b=Ax +AQ2 y2 -b, y 2cER-P,
and it remains to solve the reduced system
C=(B) =( RB 0
kA B\ AQl AQ 2 )
must have rank n. But then all columns in C must be linearly independent and it
follows that rank(AQ2)=n-p. Then we can compute the QR decomposition
QA(AQ 2 ) = ( )
where RA is upper triangular and nonsingular. The unique solution to (25.13) can
then be computed from
and we finally obtain x=x, +Q2 y2 , the unique solution to Problem LSE.
The representation (25.12) of the solution x has been used as a basis for
a perturbation theory by LERINGE and WEDIN [1970]. This generalizes the results
given in Section 5. The corresponding bounds for Problem LSE are more
complicated, but show that Problem LSE is well-conditioned if K(B) and K(AQ 2) are
small. It is important to note that these two condition numbers can be small even
when K(A) is large. Any method which starts with minimizing IIAx -b II2 will give
bad results in such a case. ELDEN [1980] has given a less complicated and more
complete perturbation theory for Problem LSE based on the concept of a weighted
pseudoinverse.
The method of direct elimination and the nullspace method both have good
numerical stability. In a numerical comparison by LERINGE and WEDIN [1970] they
SECTION 25 Constrained least squares problems 593
gave almost identical results. The operation count for the method of direct
elimination is slightly lower because Gaussian elimination is used to derive the
reduced unconstrained problem.
Following VAN LOAN [1985] we now analyze Problem LSE in terms of the
generalized singular value decomposition (GSVD). For simplicity we assume that in
(25.1) we have rank(B)=p and that (25.2) holds. This ensures that the problem has
a unique solution, and the GSVD can be written (see Section 22):
where
DA = diag(a ,..., a), DB = diag(B,,...,
I p),
0=a1= ... = gq < ~q+ 1 ... P<<<+1=
P n....., (25.16)
A~l ~> ' P>O.
We can also without loss of generality assume that
>0
ai(Z)=fi B) i= 1,...,n,
and that
c2+fi2=1, i= ,...,n.
Using the GSVD problem (25.1) is transformed into diagonal form
min O -b 2,
Yi =t{dJjif
ii I ... I PI(25.19)
Yi=b i, i=p+l ... ,n,
where we have used that ai = 1, i = p + 1, ... ,n.
Introducing the matrix
Z-1 =X=(x 1. -.-. Xn)
we can write the solution XLSE to (25.1) as
It is interesting to compute how much the residual rLS = b - AXLs increases as a result
of the constraints Bx = d. We have
rLSE - rLS = A(XLS - XLSE)
where
pi=uTb - ,uivd, #i=a/#li, i=1,. .,p. (25.21)
Hence,
p
IIrLSEIi2=IrLsl+ E P.
i=q+l
The method of weighting for solving Problem LSE is based on the following simple
observation. Assume that in a least squares problem we want some equations to be
exactly satisfied. We can achieve that by giving these equations a large weight y and
solving the resulting unconstrained least squares problem. Hence to solve (25.1) we
would compute the solution x(y) to the problem
Note that if (25.2) holds then (25.22) is a full rank least squares problem.
Weighted problems were considered in Section 14. There it was shown (cf. (14.9))
that if rank(B) = p, then the residual d - Bx(y) is proportional to y - 2 for large values
of y, and hence lim. ,x(y) = XLSE. A more general analysis is given below in terms of
the GSVD.
The method of weighting is attractive for its simplicity. It allows the use of
a subroutine or program for unconstrained least squares problems to be used to
solve Problem LSE. However, for large values of y care must be excercised in the way
(25.22) is solved because then the matrix in (25.22) is poorly conditioned. In
particular, the method of normal equations will fail for large y.
Accurate solutions to (25.22) for large values of y can be computed from a QR
decomposition of the matrix
pyBa
At j
provided that both row and column permutations are used. POWELL and REID [1969]
SECTION 25 Constrained least squares problems 595
recommend that the column pivoting strategy of Section 11 is used and that, before
Householder transformation is applied, the largest element in the pivot column is
permuted to the top.
For some problems it may be sufficient to initially order the constraints first, as
done in (25.22) and then compute a QR decomposition without column inter-
changes. The following example from VAN LOAN [1985] shows that this is not always
sufficient.
XLSE = (4 6 , - 2, 12 )T .
(fibi+ 2pidi
:
Yi(Y)= i+y2P i=1,,p,
(bioi, i=p+1,...,n.
Hence from (25.19) we find that yi(y) = yi, i = ,, ... q, p + 1 ... , n, and with pi and pi
defined by (25.21)
yi()-yi=+ 2 22 , i=q+1,...,p.
In this section we consider methods for solving constrained least squares problems
of the following general type.
We first consider some applications where problems of this form arise. One
example is in smoothing of noisy data, see e.g. HUCHINSON and DE HOOG [1985].
Here one wants to balance a good fit to the data points and a smooth solution.
Another application where problems of the form (26.1) arise is in the regularization
of ill-conditioned least squares problems resulting from the discretization of
ill-posed problems.
As an example, consider the integral equation of the first kind
where the operator K is compact. It is well known that this problem is ill-posed in
the sense that the solution f does not depend continuously on the data g. For
example, there are rapidly oscillating functions f(t) which come arbitrarily close to
being annihilated by the integral operator.
If (26.2) is discretized into a corresponding least squares problem
min 11
Kf- t 2, (26.3)
then the singular values of Ke Rm"' decay exponentially to zero. Therefore K will
not have a well-defined numerical -rank r, since by (10.3) this requires that
a, > 5 >a,+ holds with a distinct gap between the singular values a, and ,,+ . This
means that the methods for rank-deficient least squares problems in Sections 10 and
11 are not very useful here.
In general any attempt to solve (26.3) without any constraints on f will give
a meaningless result. Perhaps the most successful method to solve ill-conditioned
problems of this kind is to restrict the solution space by imposing some a priori
bound on IILf 12for a suitably chosen matrix LE RPX". Typically L is taken to be
a discrete approximation to some derivative operator, e.g.
I -1
L= .. 1 . j e - 1) xi (26.4)
min IIKf-g112,
(26.5)
s.t. lLf 112<Y.
Here the parameter y governs the balance between a small residual and a smooth
solution. The determination of a suitable y is often a main difficulty in the solution
process. Alternatively we can consider the related problem
min [Lfl2,
(26.6)
s.t. IlKf-g112 <p.
Problems (26.5) and (26.6) are called regularization methods for the ill-conditioned
problem (26.3) in the terminology of TIHONOv [1963]. They are obviously special
cases of the general Problem LSQI in (26.1).
We now consider conditions for existence and uniqueness of solutions to Problem
LSQI. Clearly (26.1) has a solution if and only if
min IBx- d 2 < Y, (26.7)
x
Notice that for B = I and d = 0 we have xA,I = A'b. Then the constraint in (26.1) is
binding only if
IIBxA,B- d 112> (26.9)
This observation gives rise to the following theorem.
THEOREM 26.1. Assume that Problem LSQI has a solution. Then either XA,B is
a solution or (26.9) holds and the solution occurs on the boundary of the constraint
region. In the latter case the solutionx = x(A) satisfies the generalizednormalequations
(ATA + BTB)x(2) = ATb + )BTd, (26.10)
where , is determined by the secular equation
l[Bx(A)- dl 2= Y (26.11)
In the following we assume that (26.9) holds so that the constraint is binding. Then
there is a unique solution to Problem LSQI if and only if the nullspaces of A and
B intersect only trivially, i.e. AV(A)rA(B)={0}. This is equivalent to the rank
condition
rank () =n (26.12)
We note that (26.10) are the normal equations for the least squares problem
where only positive values of 2 are of interest. Therefore, to solve (26.10) for a given
2 it is not necessary to form the cross product matrices ATA and BTB.
A numerical method for solving Problem LSQI can be based on applying
a method for solving the nonlinear equation
k(2) = y where ¢(2) = Bx()- d 112
and where x(2) is computed from (26.13). However, this means that for every
function value (2) we have to compute a new QR decomposition of (26.13).
Methods which avoid this have been given by ELDEN [1977] and will be described
later in this section.
We now consider the use of the generalized singular value decomposition for
analyzing Problem LSQI, cf. GOLUB and VAN LOAN ([1983], Section 12.1). For ease
of notation we assume that m > n and put q = min(p, n). Then we have (see Section 22)
DB
UTA=(DA)Z VTB=( 0 )Z (26.14)
where
DA = diag(al,..., .an), D= diag(#,,...,) (26.15)
ai>O, i=1...,n, fl '' >fPI>/l,+= ... =B=0,
P
r=rank(B), UER"mXm and VER "P are orthogonal and Z nonsingular. The rank
condition (26.12) implies that
a+B2>0, i=1,..r, c>0, i=r+l,...,n. (26.16)
Using the GSVD, Problem LSQI becomes
min IIDAY-I 112,
y
where
and we assume that this condition is satisfied. The generalized normal equations
(26.10) can now be written
From (26.17) it follows that b(O)>y. Since +() is monotone decreasing for A>0
there exists a unique positive root to (26.19). It can be shown that this is the desired
root, see GANDER [1981].
From (26.19) we can cheaply compute function values O() for given numerical
values of A.By differentiating (26.19) also derivatives of (A)can be computed and
thus a numerical method like Newton's method can be used to solve (26.19).
A particularly simple but important case is when in (26.1)
B=In and d=0. (26.20)
We will call this the standardformof LSQI. The above algorithm then simplifies as
follows. Let the SVD of A be
where U and V are orthogonal. We now have Pi = 1, i = 1,..., n, and assume that the
singular valus i= a(A) are ordered so that
X >a2>-'->, >0.-
For the problem in standard form the rank condition (26.12) is trivially satisfied. The
condition (26.9) simplifies to
We assume that this condition is satisfied and determine I by solving the secular
600 A. Bjorck CHAPTER V
equation
xm
|(nx)
A (26.23)
(26.24)
31 ¾
22
B=
(n--
Y,
This decomposition can be computed in only mn2 +n3 flops using Householder
tranformations. If we put
x = Py, 1= Q b= 2) }m-n
xmm~~
~(°R ||~~~(BAt~N~~)
~(26.25)
x #I /\ 0 2
SEcnoN 26 Constrained least squares problems 601
Y1 51 Y 1
Y2 62 Y2 62
Y3 7
Y3
1 0 0
The whole transformation in (26.26), including solving for y(A), takes only about 1In
flops. ELDEN [1977] gives a detailed operation count and also shows how to
compute derivatives of O().
We now consider the more general form of Problem LSQI, where d=O and
B=Le R(-')Xn is a banded matrix. We now have to solve
min 0 (26.27)
min||( 2 (26.28)
Some problems give rise to a matrix A which also has band structure. Then the
matrix R 1 will be an upper triangular band matrix with the same bandwidth w, as A,
602 A. Bjorck CHAPTER V
see Theorem 15.1 and the complete matrix in (26.28) will be of banded form. In many
cases the matrix L is also an upper triangular banded matrix. If not so, it is
convenient to reduce it to this form by computing the QR decomposition
2QL=R 2 ,
where R 2 has bandwidth w2 . Since Q2 is orthogonal we have reduced (26.28) to the
form
. R12)C
This problem can be efficiently solved by the sequential orthogonalization method
of Section 18. Note that this involves a reordering of the rows of the matrix
(uR2)
so that the column indices of the first nonzero element in each row forms a
nondecreasing sequence. The resulting algorithm has been described in detail by
ELDfN [1984b]. The number of operations for each value of p is O(n(wl +w2)).
We now describe an algorithm for the case when R 1 does not have band structure.
The idea is to transform (26.27) to a regularization problem of standard form. Note
that if L was nonsingular we could achieve this by the change of variables y = Lx.
( n
However, normally LER "-'t)X and is of rank n-t<n. The transformation to
standard form can then be achieved using the pseudoinverse of L by a technique due
to ELDEN [1977]. We compute the QR decomposition of LT
LT=(V 1 , V2)
( )
Then
to
K(L)>> WCL ).
However, in practice it seems to give very similar results, see VARAH ([1979], p. 104).
An important special case is when in LSQI we have A = K, B = L and both K and
L are upper triangular Toeplitz matrices, i.e.
k, k 2 ... k _-1 k,
k, k2 k- 1I
K=.
kl k2
k,
and L as in (26.4). Such systems arise when convolution-type Volterra integral
equations of the first kind
t
are discretized. ELDPN [1984] has developed a method for solving problems of this
kind which only uses 9n 2 flops for each value of L.It can be modified to handle the
case when K and L also have a few nonzero diagonals below the main diagonal.
Although K can be represented by n numbers this method uses n 2 storage
locations. A modification of this algorithm, which uses only O(n) storage locations,
is given by BOJANCZYK and BRENT [1986].
For computing a regularized solution to
minllAx-bl12, (26.31)
x
where A is very large and sparse iterative methods may be considered. For a survey
of such methods see BJ6RCK and ELD.N [1979]. The methods can be divided into two
groups. In the first group the iterative method is applied directly to the problem
(26.31). Regularization is then achieved by only performing a small number of
iterations.
604 i. Bjbrck CHAPTER V
In the other group of methods one applies the iterative method to the regularized
problem (26.23), which then usually is solved for several values of . BJORCK [1988]
and O'LEARY and SIMMONS [1981] have given methods based on the bidiagonali-
zation step (20.32)-(20.35) in the method LSQR. The idea is to compute xk(u)=
Vkyk(p) where Yk(p) is the solution to
.( Bk
which is a regularized version of (20.35). This allows for the cheap computation of
xk(#) for several different values of a. It does require the vectors vl,..., vk to be
saved, but in most applications where these methods make sense k is not a very large
number.
One application for the iterative methods above is when A =K is an upper
triangular Toeplitz matrix, since then the matrix vector products Ku and KTv can be
computed in O(nlog2n) flops using an algorithm based on the fast Fourier
transform, see O'LEARY and SIMMONS [1981].
So far we have assumed that the parameter =.p2 is determined by solving the
secular equation (26.11), where y is known a priori or somehow determined from
additional information about the solution. We now describe a method for
determining the smoothing parameter p directly from the data. The underlying
statistical model is that the components of b are subject to random errors of zero
mean and covariance matrix a21,, where a2 may or may not be known. We take
d=0 in (26.10) and write the solution as a function of u
x, = M- 1ATb, M =AT A + BTB. (26.32)
Here trace(A) denotes the trace of the matrix A. When a 2 is not known then u may be
chosen to minimize the generalized cross validation (GCV) function given by
C(=-11,, A b :/C'112
[trace(Im - Pa (26.34)
2
since the minimizer of C(u) is asymptotically the same as the minimizer of 7T(1), when
m is large. For a discussion of generalized cross validation see also GoLUH, HEATH
and WAHBA [1979].
SECTION 26 Constrainedleast squares problems 605
EXAMPLE 26.1 (GOLUB and VAN LOAN [1983], p. 12.1-5). Let AER " X be given by
A=(1, 1,...,1)T =eT,
and put
i=1 i=1
A[(-) -I1
Ax.,- b 11
Minimization of either T(p) or C(p) requires that 11 2 and trace(Il, - P,)
can be accurately and efficiently computed. For a problem in standard form, i.e.
when B = I, these quantities can be computed from the SVD of A
A= 0 )VT, 2:=diag(a,,...,a").
We get
where
f =diag(co .... co.), i=a2/(a2+2).
From an easy calculation it follows that with c = UTb
n m
(I.-P.)bll2= 2 ci/(I2+2)+ c2 (26.35)
i=1 i=n+l
Using the generalized SVD (26.14)-(26.15) formulas similar to (26.35) and (26.36) are
easily derived for the general case BO In.
ELDEN [1984c] has given a method for computing C(p) based on the bidiagonali-
zation of A, which is more efficient than that based on the SVD.
In many important applications the matrices ATA and BTB have band structure.
For example, when fitting a polynomial smoothing spline of degree 2k - 1 to m data
606 i. Bj,5rck CHAPTER V
points the half-bandwidth will be k and k+ I respectively, see REINSCH [1971]. Then
computing the cross validation function using the singular value decomposition is
not efficient and will require O(m3 ) operations. HUTCHINSON and DE HOOG [1985]
give a method which requires only O(k2 m) operations. This is based on the
observation that to compute
only the central 2k + 1 bands of the inverse My l is needed. These can be efficiently
computed from the Cholesky factorization of M using the algorithm (16.2)-(16.4).
This section is concerned with linear least squares problems with inequality
constraints:
PROBLEM LSI.
min [lAx-b 2, (27.1)
x (27.1)
s.t. I<Cxs<u,
where A E 'imXnand Ce RP X" . If cT denotes the ith row of the constraint matrix C then
the constraints can also be written
Note that if linear equality constraints are present, then these can be eliminated
using one of the methods given in Section 25. Both the direct elimination method
and the nullspace method will reduce the problem to a lower-dimensional problem
without equality constraints. However, an equality constraint can also be specified
by setting li = ui.
An important special case is when the inequalities are simple bounds:
PROBLEM BLS.
min
xm
[lAx-b11,
lAx-b11 , (27.2)
2
s.t. Ix <u.
For reasons of computational efficiency it is essential that such constraints are
considered separately from more general constraints in (27.1). If only one-sided
bounds on x are specified, then it is no restriction to assume that these are
nonnegativity constraints and we have:
SECTION 27 Constrained least squares problems 607
PROBLEM NNLS.
PROBLEM LSD.
min Ilxll2,
X (27.4)
s.t. gGxs<h
or more generally
min 11xI 112,
11x1 IL~(27.5)
mX
s.t. g<G1 x +G 2x 2 <h,
where
x=(X2) XIeR, n - k.
x 2 ER
where
T is upper triangular
Twritten
where T is upper triangular and nonsingular. Then (27.1) can be written
min jITyl-cl 112,
s.t. I<Ey<u,
608 A. Bjbrck CHAPTER V
where
E and y are conformally partitioned and y, Ear. We now make the further change of
variables
z 1 = Tyl-c 1, z 2 =y 2.
Substituting y, = T-(z, + c l ) in the constraints we arrive at an equivalent least
distance problem
min IlZ 11
2,
z (27.7)
s.t. T Gz 1+ G2 z2 ,
where
G 1 =E 1 T - 1, G 2 =E2, T=l-Gc,, =u-G lc,.
We note that if A has full column rank, then r = n and z = z 1, so we get a least distance
problem of the form (27.4).
Methods for solving Problem LSI based on the above transformation to a least
distance problem have been given by LAWSON and HANSON ([1974], Chapter 23),
HASKELL and HANSON [1981] and by SCHITTKOWSKI [1983]. The method proposed
by Schittkowski for solving the least distance problem is a primal method as
opposed to a dual approach used in the first two references. The dual approach is
based on the following equivalence between a least distance problem and
a nonnegativity constrained problem.
THEOREM 27.1. Consider the least distance problem with lower bounds
min
x
Ilxll2I (27.8)
s.t. g <Gx.
"+
Let u m+ be the solution to the nonnegativity constrained problem
min IEu-f ll2,
U (27.9)
s.t. u>O,
where
E= T)) f = )
CLINE [1975] and HASKELL and HANSON [1981] describe how the modified LSD
problem (27.5) with only upper bounds can be solved in two steps each of which
requires a solution of a problem of type NNLS, the first of these having additional
linear equality constraints.
We now consider methods for solving Problem LSI, which do not use a trans-
formation into a least distance problem. We first remark that (27.1) is equivalent to
a quadratic programming problem
min (xTBx + cx),
x
s.t. I<Cx<u,
where
B = ATA, c = -2ATb,
Xk+ 1 = Xk+kPk,
where Pk is a search direction and Ok a nonnegative step length. The search direction
is constructed so that the working set of constraints remain satisfied for all values of
~k. This will be the case if Ckpk=O. In order to satisfy this condition we compute
a decomposition
CkQk=(O, Tk), TkeR "kX "k, (27.11)
where Tk is triangular and nonsingular and Qk a product of orthogonal transfor-
mations. If we partition Qk so that
then the n-nk columns of Zk form a basis for the nullspace of Ck. Hence, the
condition is satified if we take
Pk = Zkqk, qk e Rn-nk. (27.13)
We now determine qk so that xk + Zkqk minimizes the objective function, i.e. in phase
two q solves the unconstrained least squares problem
min llAZkqk-rkI[ 2, rk=b-Axk. (27.14)
qk
To simplify the discussion we assume in the following that the matrix AZk is of full
rank so that (27.13) has a unique solution. To compute this solution we need the QR
decomposition of the matrix AZk. This is obtained from the QR decomposition of
the matrix AQk, where from
fRk Sk }n-n,
The advantage of computing this larger decomposition is that then the orthogonal
matrix Pk need not be saved and can be discarded after being applied also to the
residual vector rk. The solution to (27.14) can now be computed from the triangular
SECTION 27 Constrainedleast squares problems 611
system
We now determine a, the maximum nonnegative step length along Pk for which
Xk + remains feasible with respect to the constraints not in the working set. We take
Xk+l=xk+pk, if a<1,
and then add the constraint which is hit to the working set for the next iteration. We
take
Xk+ =xk+Pk, ifa>l.
In this case xk + will minimize the objective function when the constraints in the
working set are treated as equalities, and the orthogonal projection of the gradient
onto the subspace of feasible directions will be zero
QTCi=(T)I=
-QA TPk(d)
so from (27.15)
Tk = -(VU, )dk.
The Lagrange multiplier Ai for the constraint ii < cTx < ui in the working set is said to
be optimal if Ai <0 at un upper bound and Ai > 0 at a lower bound. If all multipliers
are optimal then we have found an optimal point and are finished. If a multiplier is
not optimal then the objective function can be decreased by deleting the
corresponding constraint from the working set. If there is more than one multiplier
not optimal, then it is usual to delete that constraint whose multiplier deviates most
from optimality.
At each iteration step the working set of constraints is changed, which leads to
a change in the matrix Ck. If a constraint is dropped a row in Ck is deleted, if a
constraint is added a new row in Ck is introduced. An important feature of an active
set algorithm is the efficient solution of the sequence of unconstrained problems
(27.14). Using techniques described in Section 21 methods can be developed to
update the matrix decompositions (27.11) and (27.15). In (27.11) the matrix Qk is
modified by a sequence of orthogonal transformations from the right. These trans-
612 A. Bjrck CHAPTER V
formations are then applied to Qk in (27.15) and this decomposition together with the
vector PTrk + is similarly updated. Since these updatings are quite intricate they will
not be described in detail here.
For the case when A has full rank the algorithm for Problem LSI described above
is essentially equivalent to the algorithm given by STOER [1971]. In this case
Problem LSI always has a unique solution. If A is rank-deficient there will still be
a unique solution if all active constraints at the solution have nonzero Lagrange
multipliers. Otherwise there is an infinite manifold M of optimal solutions with
a unique optimal value. In this case we can seek the unique solution of minimum
norm, which satisfies
minjlxjl 2, xeM.
This is a least distance problem. A method for computing this minimal norm
solution when all constraints are simple bounds has been given by LOTSTEDT [1984].
In the rank-deficient case it can happen that the matrix AZk in (27.14) is
rank-deficient and hence Rk singular. Note that if some Rk is nonsingular it can
become singular during later iterations only when a constraint is deleted from the
working set, in which case only its last diagonal element can become zero. This
simplifies the treatment of the rank-deficient case. To make the initial R k
nonsingular one can add artificial constraints to ensure that the matrix AZk has full
rank. For a further discussion of the treatment of the rank-deficient case, see GILL et
al. [1986].
A possible further complication is that the working set of constraints can become
linearly dependent. This can cause possible cycling in the algorithm so that its
convergence cannot be ensured. A simple remedy that is often used is to enlarge the
feasible region of the offending constraint with a small quantity, see also GILL,
MURRAY and WRIGHT ([1981], Chapter 5.8.2).
As remarked earlier Problem LSI simplifies considerably when the only
constraints are simple bounds. This problem is important in its own right and also
serves as a good illustration of the general algorithm. Hence, we now consider the
algorithm for Problem BLS in more detail. We first note that feasibility of the
bounds is resolved by simply checking whether lieui, i=l,..., p. Further, the
specification of the working set is equivalent to a partitioning of x into free and fixed
variables. During an iteration the fixed variables will effectively be removed from the
problem.
We divide the index set of x according to
where ie9F if x, is a free variable and iŽ if xi is fixed at its lower or upper bound.
The matrix Ck will now consist of the rows e, i, of the unit matrix I. We let
Ck = E, and if E, is similarly defined we can write
AQk + = AQkPR(k, q)
where PR(k, q) is a permutation matrix which performs a right circular shift in which
the columns
1,. ..,k,k+,. .. ,q-1,q,q+ ... n
are permuted to
1...,k,q,k+ ... q- 1, q+,...,n.
Similarly if the bound corresponding to xq becomes active it can be added to the
working set by
AQk +1 = AQk PL(q, k)
where PL(q, k) is a permutation matrix, which performs a left circular shift in which
the columns
1,...,q-l,q,q+1 ... ,k,k+l,...,n
are permuted to
1 ... q-l,q+l ... ,k,q,k + 1,...,n.
Subroutines for updating the QR decomposition after right or left circular shifts are
included in LINPACK and are described in DONGARRA et al. ([1979], Chapter 10).
For Problem LSB the equation (27.17) for the Lagrange multipliers simplifies to
= =ETgk+1 = - EATrk+ 1= - (U T , O)dk.
Hence, the Lagrange multipliers are simply equal to the corresponding components
of the gradient vector - ATrk+ .
Below we summarize the algorithm for Problem BLS:
Main loop
begin repeat
Compute unconstrained optimum in free variables:
q := R-(c- SEsx); z:= Eq;
if li<zi<uifor all i then
begin comment: Check for optimality.
Compute Lagrange multipliers )..= UT(d- UEsx);
if =0 or sign(yi)A, <O for all i
go to finished;
Find index t such that sign(y,)l =maxijE sign(yi)Ai;
Move index t from R to F, i.e. free variable x,;
Update decomposition
end
else
begin
For all i .F compute i {(x - i)(x - zi), zi < i;
1(ui-xi)/(zi-i), Zi> ui;
a = mini,Fai; x := x + a(z - x);
comment: 0 a < 1
Move from F to X all indices q for which xq = l or x = Uq;
Put yq:= 1 if Xq = lq
and y,:= -1 if xq = uq;
Update decomposition
end
end repeat
finished.
Several implementations of varying generality of active set methods have been
developed. LAWSON and HANSON [1974] give a FORTRAN implementation of an
algorithm for Problem NNLS, which is similar to Algorithm 27.1. They also give
a FORTRAN subroutine based on this algorithm for Problem LSD with lower bounds
only. HASKELL and HANSON [1981] give an algorithm for the NNLSE problem with
nonnegativity constraints on selected variables and equality constraints, where the
equality constraints are handled by the method of weighting, see Section 25.4. In this
no assumption on the rank of A is made. They describe several methods of
transforming problems of type LSEI, with linear equality and inequality constraints,
into the form NNLSE. In HANSON [1986] further developments of this algorithm is
described.
The algorithm of STOER [1971] for Problem LSI has been realized by PATZELT
SECTION 27 Constrainedleast squares problems 615
[1973] and an English version of this code is given by EICHHORN and LAWSON
[1975]. SCHITTKOWSKI and STOER [1979] give an implementation of the same
method using Gram-Schmidt decompositions. An advantage of this implementa-
tion is that it is relatively easy to take data changes into account. The implementa-
tion described by CRANE, GARBOW, HILLSTROM and MINKOFF [1980] is based on this
work. There is a restrictive assumption in these realizations that A is of full column
rank. ZIMMERMANN [1977] gives a special implementation of Stoer's method for
Problem BLS also based on Gram-Schmidt decompositions.
A robust and general set of FORTRAN subroutines for Problem LSI and convex
quadratic programming is given by GILL et al. [1986]. The method is a two-phase
active set method. It allows also a linear term in the objective function and handles
a mixture of bounds and general linear constraints.
Often the matrices A and C in Problem LSI are sparse. It is difficult to take
advantage of this, since much fill results from the sequence of transformations
applied to A and C during the iterations. For problems with only simple bounds on
the variables, sparsity can be preserved. OREBORN [1986] describes an algorithm for
Problem NNLS which uses SPARSPAK by GEORGE and NG [1984] for obtaining an
initial sparse QR decomposition of A.
Iterative methods have also been developed for sparse problems, although slow
convergence is often a problem. LOTSTEDT [1984] gives a method for Problem LSD
using a preconditioned conjugate gradient method.
CHAPTER VI
In this chapter we discuss the solution of nonlinear least squares problems. Methods
for solving such problems are iterative and each iteration step usually requires the
solution of a related linear least squares problem. The nonlinear least squares
problem is closely related to nonlinear systems of equations and is a special case of
the general optimization problem in lR. We will here mainly emphasize those
aspects of the nonlinear least squares problem which derive from its special
structure. For a general background to the theory of iterative methods for nonlinear
equations we refer to ORTEGA and RHEINBOLDT [1970], FLETCHER [1980], GILL,
MURRAY and WRIGHT [1981] and DENNIS and SCHNABEL [1983]. An excellent survey
of algorithms for the nonlinear least squares problem is given by DENNIS [1977].
m
Here xeR", r(x)=(r(x),...,r,(x))ElRj , is the residual vector and each ri(x),
i=1,..., m, m > n, is a nonlinear functional defined over Rn". Clearly, if ri(x) were
linear in x then (28.1) would be a linear least squares problem. For m=n (28.1)
includes as a special case the solution of a system of nonlinear equations.
One important area in which nonlinear least squares problems arise is in data
fitting. Here one attempts to fit given data (y, ti, i = 1....., m, to a model function
f(x, t). If we let ri(x) represent the error in the model prediction for the ith
observation,
ri(x)=yi-f(x, ti, i= 1 ... , m, (28.2)
we are led to a problem of the form (28.1). The choice of the least squares measure is
justified here, as for the linear case, by statistical considerations, see BARD [1974].
This assumes that only yi are subject to errors and the values t i of the independent
variable t are exact. The case when there are errors in both y and ti is discussed in
Section 34.
617
618 A. Bj6rck CHAPTER VI
The basic methods for the nonlinear least squares problem require derivative
information about the components ri(x), and we assume in the following that r,(x)
are twice continuously differentiable. The Jacobian of r(x) is J(x)e Rx", where
J(x)ij = ari(x)/Ox j ,
" X"
and the Hessian matrices of ri(x) are Gi(x) e , where
2
Gi(x)jk=a ri( x)/aXjaXk, i= 1,.. .,m.
It is easily shown that the first and second derivatives of 0(x) are given by
V+(x) = J(x)T r(x), (28.3)
and
The special forms of VP(x) and V2 4(x) can be exploited by methods for the
nonlinear least squares problem.
A necessary condition for x* to be a local minimum of +(x) is that V4(x*)=
J(x*)Tr(x*) = 0. Any point which satisfies this condition will be called a critical point.
We now establish a necessary condition for a critical point x* to be a local minimum
of (x), following a geometric approach by WEDIN [1974]. If we assume that
rank(J(x*))=n, then it follows that J(x*)lJ(x*)=I,, where J(x*)' is the pseudo-
inverse of J(x*). We now rewrite (28.4) as
V2 4q= JTj _ G = JT(I- 7 (J)T GwJ')J, (28.5)
where
and all quantities are evaluated at the point x*. The symmetric matrix
K = (J') T GwJ' (28.7)
'
is called the normal curvature matrix of th n-dimensional surface z = r(x) in Rn, with
respect to the normal vector w. Let the eigenvalues of K be
K >KC
2 ...
k> ~>,.
The quantities pi = 1/ic, Ki O0,are the principal radii of curvature of the surface with
respect to the normal w.
It follows that V2 4(x*) is positive-definite and x* a local minimum if and only if
I-yK is positive-definite, i.e. when
I -yKc >0.
If 1-ycK, <0, then the least squares problem has a saddle point at x* or if also
1- yKc, <0 even a local maximum at x*.
SECTION 28 Nonlinear least squares 619
The problem is then to find the point on this surface closest to the observation vector
ye R', cf. DRAPER and SMITH ([1981], pp. 500-501). This is illustrated in Fig. 28.1
for the simple case of m = 2 observations and only a single parameter x. Since in the
figure we have y = II1 2< p, it follows that 1 - yIc > 0, which is consistent with the
r 11
fact that x* is a local minimum.
z2
f(x*,t,))
Z
z.
There are basically two different ways to view problem (28.1). One could think of
this problem as arising from an overdetermined system of nonlinear equations
ri(x)=O, i=1,2,..., m. (28.8)
It is then natural to approximate r(x) by a linear model around a given point xc
c(x) = r(xc) + J(x,) (X - xc). (28.9)
Then P(x)= 0 is an overdetermined system of linear equations. One can use the
corresponding linear least squares problem to derive an improved approximate
solution to (28.1). Note that (28.9) only uses first-order derivative information about
r(x). This approach, which leads to the Gauss-Newton-type methods and the
Levenberg-Marquard method, is discussed in Sections 29 and 30.
In the second approach (28.1) is viewed as a special case of optimization in Rn and
a quadratic model of +(x) around a point x, is used.
(X) = O(Xc) + V(Xc)T (X - Xc) + (X - Xc)T V2 (X) (X - Xc), (28.10)
where the derivatives of +(x) are given by (28.3) and (28.4). The minimizer of 4I(x) is
620 . Bjbrck CHAPTER VI
given by
XN = X - (J(Xc)T J(x) + Q(x¢))- J(x)T r(Xc). (28.11)
and the new approximation is xk + 1 = xk + Pk- If J(xk) is not rank-deficient, then the
solution to (29.1) is unique and can be written
Pk = - (J(xk)T J(Xk))- J(Xk)T r(Xk).
The Gauss-Newton method as described above has the advantage that it solves
linear problems in just one iteration and has fast convergence on mildly nonlinear
problems. However, it may not be locally convergent on problems that are very
nonlinear or have large residuals. This is illustrated by the following example due to
Powell, see FLETCHER ([1980], p. 94).
xk+,
Xk = IAxt +
+1 = O(X2kXk
+0(4) )
Xk+ 1 = Xk + ok Pkt,
where Pk is the solution to (29.1) and ak is a scalar. The resulting method, which uses
Pk as a search direction is usually called the damped Gauss-Newton method. The
vector Pk is called the Gauss-Newton direction.
The Gauss-Newton direction has the following two important properties:
(i) The vector P is invariant under linear transformations of the independent
variable x.
(ii) Provided that J(xk)T r(xk) O,Pk is a descent direction,i.e. for sufficiently small
ac>O we have
It can be shown, using the singular value decomposition of J(xk) that J(xk)T r(xk) O
implies that PJkr(xk)#O, which together with (29.3) proves that Pk is a descent
direction. In the special case that J(xk) has full column rank the property follows
from
To make the damped Gauss-Newton method into a viable algorithm the step
length ak must be chosen carefully. Two common ways of choosing taare:
(i) Take a, to be the largest number in the sequence 1, , , ... for which the
622 A. Bjbrck CHAPTER VI
1l r(Xk) 112-
I] r(Xk + ak Pk) 1 > ½ak [] (x)Pk 112
This is essentially the Armijo-Goldstein step length principle (note that - J(xk)pk =
PJkr(xk)), see ORTEGA and RHEINBOLDT ([1970], p. 491) and GILL, MURRAY and
WRIGHT ([1981], P. 100).
(ii) Take ak as the solution to the one-dimensional minimization problem
i.e. do an "exact" line search. This is computationally more expensive than (i).
A theoretical analysis of these two step length principles has been given by RUHE
[1979].
Often a step length ak is chosen to be an approximate solution of (29.4). Such an
algorithm, which takes into account the fact that +(x) = r(x) 11is a sum of squares,
has been given by LINDSTROM and WEDIN [1984]. They determine an approximation
p(a) of the curvefo(a) = r(xk + Xpk) in Rm , and then minimize 11p(o) I2 as a function of X.
One alternative is to choose p(a) to be the unique circle (in the degenerate case
a straight line) which satisfies the conditions
EXAMPLE 29.2 (GILL, MURRAY and WRIGHT [1982], p. 136). Let J = J(xk) and r = r(xk)
be defined by
0 C =· r2
where << 1 and r, and r2 are of order unity. If J is considered to be of rank two, then
SECTION 29 Nonlinear least squares 623
r2/J
(0)
Clearly the two directions s and s2 are almost orthogonal and s is almost
orthogonal to the gradient vector JT r.
using the notations of (28.6). The asymptotic rate of convergence is bounded by the
spectral radius of the matrix VF(x*) at the solution x*. But VF(x) has the same
nonzero eigenvalues and thus the same spectral radius as the matrix
y(J(x))T G, J(x)' =yK,
where K is the normal curvature matrix (28.7). Hence,
p = p(VF) =y max(K, - K,). (29.5)
p < 1, i.e. we always get convergence close to a local minimum. This is in contrast to
the undamped Gauss-Newton method, which may fail to converge to a local
minimum.
The rate of convergence for the undamped Gauss-Newton method can be
estimated during the iterations from
,f(x,t2))
z1
FIG. 29.1.
There is still a possibility that the damped Gauss-Newton method can have
difficulties to get around an intermediate point where the Jacobian matrix does not
have full column rank. This can be avoided either by taking second derivatives into
account (see Section 31) or by further stabilizing the damped Gauss-Newton
SECTION 30 Nonlinear least squares 625
method to overcome this possibility of failure. Methods using the latter approach
were first suggested by LEVENBERG [1944] and MARQUARDT [1963]. Here a search
direction P is computed by solving the problem
min{ IIr(xk)+J(xk)P 112 Hk IpP 112},
2= (30.1)
p
where the parameter ,k> 0 controls the iterations and limits the size of Pk. Note that
Pk is well-defined even when J(xk) is rank-deficient. As k C)o, IIpk 12-0 and Pk
becomes parallel to the steepest descent direction. It follows from the discussion of
Problem LSQI in Section 26 that Pk is the solution to the least squares problem with
quadratic constraint
feasible vectors p, IIp I12 < 6k, can be thought of as a region of trust for the linear
model
r(x) r(x,) + J(xk)p, p =x -x k.
Thus, these methods can be thought of as a special case of trust region methods. For
a general description of trust region methods for nonlinear optimization, see MORI
[1983].
Many different strategies have been used to choose k in (30.1). A careful
implementation of the Levenberg-Marquardt algorithm as a scaled trust region
algorithm has been described by MORE [1978], and has been implemented in
MINPACK by MORE et al. [1980]. More considers an iteration of the following form:
Let x 0 , Do, 60 and Be(0, 1) be given. For k=0, 1,2,...
Step 1. Compute I r(xk)l 2.
Step 2. Determine Pk as a solution to the subproblem
The analysis in the previous section has shown that for large residual problems and
strongly nonlinear problems, methods of Gauss-Newton type may converge slowly.
Also, these methods can have problems at points where the Jacobian is rank-deficient.
When second derivatives of +(x) are available Newton's method can be used to
overcome these problems. This method uses the quadratic model (28.10) of
¢(x)=11 r(x) I2 at the current approximation x,. The critical point XN given by
(28.11) of this quadratic model is chosen as the next approximation.
It can be shown (see DENNIS and SCHNABEL [1983], p. 229) that Newton's method
is locally quadratically convergent as long as
'These three subroutines are included in the NPL Algorithm Library, which is available from National
Physical Laboratory, Teddington, England.
628 A. Bjirck CHAPTER VI
included is when every function ri(x) only depends on a small subset of the set of
n variables. Then the Jacobian J(x) and the element Hessian matrices GXx) will be
sparse and it may not be unfeasible to store approximations to all Gi(x), i= 1, .. ., m.
x=Z)q, p+q=n.
easy to solve. In the following we restrict ourselves to the particular case when r(y, z)
is linear in y i.e.
r(y,z) = F(z)y-g(z), F(z)E Rm 'P. (32.2)
Then the minimum norm solution to (32.1) is
y(z) = F'(z)g(z),
where F'(z) is the pseudoinverse of F(z). The original problem can be written
= min 11(I - PF())g(z) 112
min 11(z)- F(z)y(z) 112 (32.3)
z
where P(z) = F(z) F(z)' is the orthogonal projector onto the range of F(z).
Algorithms based on (32.3) are often called variable projection algorithms.
Many practical nonlinear least squares problems are separable in this way.
Aparticularly simple case is when r(y, z) is linear in both y and z so that we also have
r(y,z)=G(y)z-h(y), G(y) ERm x (32.4)
Here the model is nonlinear only in the parameters z, and z2. Given values of z, and
z 2 the subproblem (32.1) is easily solved.
Special purpose algorithms for separable nonlinear least squares problems was
first considered by SCOLNIK [1972]. A variable projection algorithm using a Gauss-
Newton method applied to problem (32.3) was developed by GOLUB and PEREYRA
[1973]. KAUFMAN [1975] proposed a simplification of this algorithm, which is
computationally more efficient.
630 A. Bj6rck CHAPTER VI
The Kaufman algorithm contains two steps merged into one. Let Xk = (Yk, zk)T be
the current approximation.
Step 1. Compute 6Yk that solves the linear subproblem
min IIF(zk) 6 y- (g(z,)- F(zk)Yk) 112 (32.5)
3y
Take Xk + = k + 6
k Pk and go to Step 1.
In (32.7) we have used that by (32.2) r(yk+ 1/2, zk)= F(zk). Further we have
rz(Yk + 1/ 2 , Zk)= B(Zk)Yk + 1/2 - '(zk)
where
q.
B(z)= yI .... IyF ERmx
Note that in case r(y, z) is linear also in y it follows from (32.4) that C(xk+ /2 )=
(F(Zk), G(Yk +1/2))-
RUHE and WEDIN [1980] have given a general analysis of different algorithms for
separable problems. They show that the Gauss-Newton algorithm applied to (32.3)
and the original problem give the same asymptotic convergence rate. In particular
both converge quadratically for a zero-residual problem. This is in contrast to the
naive algorithm of alternatively minimizing I1r(y, z) 112over y and z, which always
converges linearly. They also prove that the simplified algorithm of Kaufman has
roughly the same asymptotic convergence rate as the one proposed by Golub and
Pereyra.
To be robust the algorithms for separable problems must employ similar
stabilizing techniques for the Gauss-Newton steps as described in Sections 29 and
30. It is fairly straightforward to implement these for the Kaufman method.
We have mentioned the negative theoretical result that the special algorithms for
separable problems have the same local rate of convergence as ordinary Gauss-
Newton. One advantage is that no starting values for the linear parameters have to
be provided. In the Kaufman algorithm e.g. we can take y = 0 and determine
y, = 6y,, in the first step of (32.5). This seems to make a difference in the first steps of
the iterations. KROGH [1974] reports that the variable projection algorithm solved
several problems which methods not using separability could not solve.
GOLUB and LEVEQUE [1979] have extended the variable projection method for
SECTION 33 Nonlinear east squares 631
solving problems in which it is desired to fit more than one data vector with the same
nonlinear parameter vector, though with different linear parameters for each
right-hand side. KAUFMAN and PEREYRA [1978] have extended the Golub-Pereyra
method to problems with separable nonlinear constraints. The Kaufman method
seems even easier to generalize to constrained problems.
In a more general setting the solution to nonlinear least squares problems may be
subject to constraints. In case of nonlinear equality constraints the problem can be
stated as
min ll r(x)112,
(33.1)
s.t. h(x)=O,
where r(x) E Rm, h e RP and x e R.
The Gauss-Newton method can be generalized to problem (33.1) by considering
a linear model at a point Xk. A search direction Pk is computed as a solution to the
linear constrained problem
min IIr(xk)+ J(xk)p 112
P (33.2)
s.t. h(xk) + C(xk)p = 0,
where J and C are the Jacobian matrices for r(x) and h(x) respectively. This problem
can be solved by the methods described in Section 25. The search direction Pk
obtained from (33.2) can be shown to be a descent direction to the merit function
(x ~) = II r(x) 11+ ~ I1h(x) II1
at the point Xk, provided that is large enough. This makes it possible to stabilize the
Gauss-Newton method with a line search strategy or a trust region technique (cf.
Sections 29 and 30). With a suitable active set strategy such an algorithm can be
extended to handle also problems with nonlinear inequality constraints. An
algorithm based on this approach has been developed by LINDSTROM [1983].
There are some algorithms specialized to solve the nonlinear least squares
problem subject to linear constraints. In HOLT and FLETCHER [1979] the unknowns
can be constrained by lower and upper bonds. LINDSTROM [1984] describes two
easy-to-use routines ENLSIP and ELSUNC for solving the general constrained or
the simple bound case. These algorithms are based on the Gauss-Newton method
with a specialized line search, see LINDSTROM and WEDIN [1984]. Far from the
solution the algorithm can be stabilized by a certain subspace minimization. Close
to the solution the algorithm switches to a second-order method (Newton's method
in the unconstrained case) when the Gauss-Newton method converges slowly. The
trust region approach for unconstrained problems is generalized to handle linear
inequality constraints by GAY [1984] and WRIGHT and HOLT [1985]. Popular
632 A. Bjbrck CHAPTER VI
general nonlinear optimization algorithms have also been used to solve nonlinear
least squares problems with nonlinear inequality constraints (see SCHITTKOWSKI
[1985] and MAHDAVI-AMIRI [1981]).
We mention that implicit curve fitting problems, where a model h(y, x, t) =0 is to
be fitted to observations (yi, ti), i= 1,..., m, can be formulated as a special least
squares problem with nonlinear constraints:
2
min zy ,
XZ
In Section 28 it was mentioned that nonlinear least squares problems often arise
from the fitting of observations (yi, ti), i= 1,..., m, to a mathematical model
Y =f(x, t). (34.1)
In the classical regression model the measurements ti of the independent variable are
assumed to be exact and only the observations yi are subject to random errors. In
this section we consider the more general situation, when also the measurements of
the independent variable t contain errors.
Assume that y and ti are subject to errors g, and , respectively, so that
Yi + i =f(x,ti + ), i=1,. ., m. (34.2)
If the errors Ei and 3i are independent random variables with zero mean and variance
a 2 , then it seems reasonable to choose the parameters x so that the sum of squares of
the orthogonal distances ri from the observations (yi,ti) to the curve (34.1) is
minimized, cf. Fig. 34.1. We have that r =(ei2 + 2)'/ 2 where ei and 6i solve
min (i2 + 2),
Ei,r$
s.t. yi + i =f(x, ti + i)
Hence the parameters x should be chosen as the solution to
y
v = f(x.t)
-- 7
FIG. 34.1.
Note that (34.3) is a nonlinear least squares problem even iff(x, t) is linear in x.
So far we have implicitly assumed that y and t are scalar variables. More generally
y e lR' and t fntand then we have the problem
Finally, if bi and qi do not have constant covariance matrices, then weighted norms
should be substituted above.
Independent of statistical considerations the orthogonal distance problem has
natural applications in curve fitting.
EXAMPLE 34.1. Consider the problem of fitting a half circle y= b +(r2 -(t- a)2 ) 12 to
a given set of points (yi, ti), i = 1, 2,..., m. It is obvious (see Fig. 34.2) that minimizing
YI
I
V
I
t
FIG. 34.2.
634 A. Bjbrck CHAPTER VI
squares of either horizontal or vertical distances to the circle will normally not be
satisfactory.
The general orthogonal distance problem has not received the same attention as
the standard nonlinear regression problem except for the case whenf is linear in x.
One reason is that if the errors in the independent variables are small then ignoring
these errors will not seriously degrade the estimates of x. For the special case when
y=XTt, y , x, tE ,
the orthogonal distance problem is a total least squares problem, and an algorithm
based on the singular value decomposition has been described in Section 24.
However, recently algorithms for the nonlinear case, based on stabilized Gauss-
Newton methods have been given by SCHWETLIK and TILLER [1986] and BOGGS,
BYRD and SCHNABEL [1987].
Problem (34.3) has (m + n) unknowns x and 6. In applications usually m>>n and
accounting for the errors in t will considerably increase the size of the problem.
Therefore the use of standard software for nonlinear least squares to solve
orthogonal distance problems is not efficient or feasible. This is even more
accentuated for (34.4) which has mn + n variables. We now show how the special
structure of (34.3) can be taken into account to reduce the work. Similar comments
apply to the general case (34.4).
If we define the residual vector r(b,x)=(r(6,x), r2 (6))T by
rl(6, )i =f(x, ti + 6) -- y i, r2(6)i=6i, i=l,...,m,
then (34.3) is the standard nonlinear least squares problem minx , IIr(b, x) II2
The Jacobian matrix corresponding to this problem can be written in block form
as
J= (, 1 ) } ,R2.(m+n), (34.5)
I,n fl I
m n
where
DI =diag(d,, ... d), d=(f/at),=,, Jij= af(x, ti + si)/ax,
where J, r and r2 are evaluated at the current estimates of 6 and x. To solve this
problem we need the QR decomposition of J. This can be computed in two steps.
First we apply a sequence of Givens rotations Q1 = Gm, · - G G1 , where Gi = Ri i+,
SECTION 34 Nonlinear least squares 635
Here L EIR X" so this is a problem of the same size as that which defines the
Gauss-Newton correction in the classical nonlinear least squares problem. We then
have
A6k = D' 1 (s 2 -K Axk).
To stabilize this Gauss-Newton method we can use the techniques described in
Section 29 and 30. SCHWETLIK and TILLER [1985] use a partial Marquardt type
regularization where only the Ax part of J is regularized. The algorithm by BOGGS,
BYRD and SCHNABEL [1987] incorporates a full trust region strategy.
References
ABDELMALEK, N.N. (1971), Roundoff error analysis for Gram-Schmidt method and solution of linear
least squares problems, BIT 11, 345-368.
AL-BAALI, M. and R. FLETCHER (1986), An efficient line search for nonlinear least-squares, J. Optim.
Theory Appl. 48, 359-377.
ANDERSON, N. and I. KARASALO (1975), On computing bounds for the least singular value of a triangular
matrix, BIT 15, 1-4.
ASHKENAZI, V. (1971), Geodetic normal equations, in: J.K. REID, ed., Large Sets of Linear Equations
(Academic Press, New York) 57-74.
AVILA, J.K. and J.A. TOMLIN (1979), Solution of very large least squares problems by nested dissection on
a parallel processor, in: Proceedings Computer Science and Statistics: Twelfth Annual Symposium on the
Interface, Waterloo, Ont.
BARD, Y. (1974), Nonlinear ParameterEstimation (Academic Press, New York).
BAREISS, E.H. (1983), Numerical solution of the weighted linear least squares problems by G-transfor-
mations, SIAM J. Algebraic Discrete Methods.
BARLOW, J.L. (1985), Stability analysis of the G-algorithm and a note on its application to sparse least
squares problems, BIT 25, 507-520.
BARRODALE, I. and C. PHILLIPS (1975), Algorithm 495: Solution of an overdetermined system of linear
equations in the Chebychev norm, ACM Trans. Math. Software 1, 264-270.
BARRODALE, I. and F.D.K. ROBERTS (1973), An improved algorithm for discrete L, linear approximation,
SIAM J. Numer. Anal. 10, 839-848.
BARTELS, R.H., A.R. CONN and C. CHARALAMBOUS (1978), On Cline's direct method for solving
overdetermined linear systems in the Lo sense, SIAM J. Numer. Anal. 15, 255-270.
BARTELS, R.H., A.R. CONN and J.W. SINCLAIR (1978), Minimization techniques for piecewise differentiable
functions: The L, solution to an overdetermined linear system, SIAM J. Numer. Anal. 15, 224-241.
BAUER, F.L. (1965), Elimination with weighted row combinations for solving linear equations and least
squares problems, Numer. Math 7, 338-352.
BERMAN and R.J. PLEMMONS (1974), Cones and iterative methods for best least squares solutions of
linear systems, SIAM J. Numer. Anal. 11, 145-154.
BJORCK, A. (1967), Solving linear least squares problems by Gram-Schmidt orthogonalization, BIT 7,
1-21.
BJORCK, A. (1967), Iterative refinement of linear least squares solutions. I, BIT 7, 257-278.
BJORCK, A. (1968), Iterative refinement of linear least squares solutions. II, BIT 8, 8-30.
BJORCK, A. (1976), Methods for sparse least squares problems, in: J.R. BUNCH and D.J. ROSE, eds., Sparse
Matrix Computations (Academic Press, New York) 177-199.
BJORCK, A. (1978), Comment on the iterative refinement of least squares solutions, J. Amer. Statist.Assoc.
73, 161-166.
BJORCK, A. (1979), SSOR preconditioning methods for sparse least squares problems, in: Proceedings
Computer Science and Statistics: Twelfth Annual Symposium on the Interface, Waterloo, Ont.
BJORCK, A. (1984), A general updating algorithm for constrained linear least squares problems, SIAM J.
Sci. Statist. Comput. 5, 394-402.
BJORCK, A. (1987), Stability analysis of the method of semi-normal equations for least squares problems,
Linear Algebra Appl. 88/89, 31-48.
BJORCK, A, (1988), A bidiagonalization algorithm for solving ill-posed systems of linear equations,
BIT 28, 659-670.
637
638 A. Bjirck
BJORCK, A. and I.S. DUFF (1980), A direct method for the solution of sparse linear least squares problems,
Linear Algebra Appl. 34, 43-67.
BJORCK, A. and L. ELDEN (1979), Methods in numerical algebra for ill-posed problems, Rept.
LiTH-MAT-R-33-79, Linkiping University, Linkoping, Sweden.
BJORCK, A. and T. ELFVING (1979), Accelerated projection methods for computing pseudoinverse
solutions of systems of linear equations, BIT 19, 145-163.
BJORCK, A. and G.H. GOLUB (1967), Iterative refinement of linear least square solutions by Householder
transformation, BIT 7, 322-337.
BJORCK, A. and G.H. GOLUB (1973), Numerical methods for computing angles between linear subspaces,
Math. Comp. 27, 579-594.
BJORCK, A., R.J. PLEMMONS and H. SCHNEIDER (1981), Large-Scale Matrix Problems (North-Holland,
New York).
BOGGS, P.T., R.H. BYRD and R.B. SCHNABEL (1987), A stable and efficient algorithm for nonlinear
orthogonal regression, SIAM J. Sci. Statist. Comput. 8, 1052-1078.
BOJANCZYK, A. and R.P. BRENT (1986), Parallel solution of certain Toeplitz least squares problems,
Linear Algebra Appl. 77, 43-60.
BOJANCZYK, A., R.P. BRENT, P. VAN DOOREN and F. DE HOOG (1987), A note on downdating the Cholesky
factorization, SIAM J. Sci. Statist. Comput. 8, 210-221.
BOLSTAD, J.H. et al. (1978), Numerical analysis program library user's guide, User Note 82, SLAC
Computer Services, Stanford Linear Accelerator Center, Menlo Park, CA.
BUNCH, J.R. and L. KAUFMAN (1977), Some stable methods for calculating inertia and solving symmetric
linear systems, Math. Comp. 31, 162-179.
BUNCH, J.R., L. KAUFMANN and B.N. PARLETT (1976), Decomposition of a symmetric matrix, Numer.
Math. 27, 95-109.
BUNCH, J.R. and C.P. NIELSEN (1978), Updating the singular value decomposition, Numer. Math. 31,
111-129.
BUNCH, J.R. and B.N. PARLETT (1971), Direct methods for solving symmetric indefinite systems of linear
equations, SIAM J. Numer. Anal. 8, 639-655.
BUNCH, J.R. and D.J. ROSE, eds. (1976), Sparse Matrix Computations (Academic Press, New
York).
BUSINGER, P. and G.H. GOLUB (1965), Linear least squares solutions by Householder transformations,
Numer. Math. 7, 269-276.
BUSINGER, P.A. and G.H. GOLUB (1969), Algorithm 358: Singular value decomposition of a complex
matrix, Comm. ACM 12, 564-565.
CHAN, T.F. (1982a), An improved algorithm for computing the singular value decomposition, ACM
Trans. Math. Software 8, 72-83.
CHAN, T.F. (1982b), Algorithm 581: An improved algorithm for computing the singular value
decomposition, ACM Trans. Math. Software 8, 84-88.
CHAN, T.F. (1987), Rank revealing QR-factorizations, Linear Algebra Appl. 88/89, 67-82.
CHEN, Y.T. (1975), Iterative methods for linear least squares problems, Rept. CS-75-04, University of
Waterloo, Ont.
CHEN, Y.T. and R.P. TEWARSON (1972), On the fill-in when sparse vectors are orthonormalized,
Computing 9, 53-56.
CLINE, A.K. (1973), An elimination method for the solution of linear least squares problems, SIAM J.
Numer. Anal. 10, 283-289.
CLINE, A.K. (1975), The transformation of a quadratic programming problem into solvable form, ICASE
Rept. 75-14, NASA, Langley Research Center, Hampton, VA.
CLINE, A.K., A.R. CONN and C. VAN LOAN (1982), Generalizing the LINPACK condition estimator, in:
J.P. HENNART, ed., Numerical Analysis, Lecture Notes in Mathematics 909 (Springer, Verlag, New
York).
CLINE, A.K., C.B. Moler, G.W. STEWART and J.H. WILKINSON (1979), An estimate for the condition
number of a matrix, SIAM J. Numer. Anal. 16, 368-375.
CLINE, R.E. and R.J. PLEMMONS (1976), 12-solutions to underdetermined linear systems, SIAM Rev. 18,
92-106.
References 639
COLEMAN, T.F., A. EDENBRANDT and J.R. GILBERT (1986), Predicting fill for sparse orthogonal
factorization, J. ACM 33, 517-532.
Cox, M.G. (1981), The least squares solution of overdetermined linear equations having band or
augmented band structure, IMA J. Numer. Anal. 1, 3-22.
CRANE, R.L., B.S. GARBOW, K.E. HILLSTROM and M. MINKOFF (1980), LCLSQ: An implementation of an
algorithm for linearly constrained linear least squares problems, Rept. ANL-80-116, Argonne National
Laboratory, Argonne, IL.
CRAVEN, P. and G. WAHBA (1979), Smoothing noisy data with spline functions, Numer. Math. 31,
377 -403.
CRYER, C. (1971), The solution of a quadratic programming problem using systematic overrelaxation,
SIAM J. Control 9, 385-392.
CUTHILL, E. (1972), Several strategies for reducing the bandwidth of matrices, in: D.J. ROSE and R.A.
WILLOUGHBY, eds., Sparse Matrices and Their Applications (Plenum Press, New York).
DANIEL, J., W.B. GRAGG, L. KAUFMAN and G.W. STEWART (1976), Reorthogonalization and stable
algorithms for updating the Gram-Schmidt QR factorization, Math. Comp. 30, 772-95.
DELVES, L.M. and I. BARRODALE (1979), A fast direct method for the least squares solution of slightly
overdetermined sets of linear equations, J. Inst. Math. Appl. 24, 149-156.
DEMMEL, J.W. (1987), The smallest perturbation of a submatrix which lowers the rank and constrained
total least squares, SIAM J. Numer. Anal. 24, 199-206.
DENNIS, J.E. (1977), Nonlinear least squares and equations, in: D.A.H. JACOBS, ed., The State of the Art in
Numerical Analysis (Academic Press, New York) 269-312.
DENNIS Jr, J.E., D.M. GAY and R.E. WELSCH (1981), An adaptive nonlinear least-squares algorithm,
ACM Trans. Math. Software 7, 348-368.
DENNIS, Jr, J.E. and R.B. SCHNABEL (1983), Numerical Methods for Unconstrained Optimization and
Nonlinear Equations (Prentice-Hall, Englewood Cliffs, NJ).
DENNIS, J.E. and T. STEIHAUG (1986), On the successive projection approach to least squares problems,
SIAM J. Numer. Anal. 23, 717-733.
DEUFLHARD, P. and V. APOSTOLESCU (1978), An underrelaxed Gauss-Newton method for equality
constrained nonlinear least squares, in: J. STOER, ed., Proceedings8th IFIP Conference on Optimization
Techniques, Lecture Notes in Control and Information Science 7 (Springer, Berlin) 22-32.
DEUFLHARD, P. and V. APOSTOLESCU (1980), A study of the Gauss-Newton algorithm for the solution of
nonlinear least squares problems, in: J. FREHSE, D. PALLASCHKE and U. TROTTENBERG, eds, Special
Topics of Applied Mathematics (North-Holland, Amsterdam).
DONGARRA, J., J.R. BUNCH, C.B. MOLER and G.W. STEWART (1979), LINPACK Users Guide (SIAM,
Philadelphia, PA).
DRAPER, N.R. and H. SMITH (1981), Applied Regression Analysis (Wiley, New York, 2nd ed.).
DUFF, .S. (1974), Pivot selection and row orderings in Givens reduction on sparse matrices, Computing
13, 239-248.
DUFF, I.S. and J.K. REID (1976), A comparison of some methods for the solution of sparse overdetermined
systems of linear equations, J. Inst. Math. Appl. 17, 267-280.
DUFF, I.S. and J.K. REID (1982), MA27-A set of Fortran subroutines for solving sparse symmetric sets of
linear equations, Rept. R. 10533, AERE, Harwell, England.
DUFF, I.S. and J.K. REID (1983), The multifrontal solution of indefinite sparse symmetric linear systems,
ACM Trans. Math. Software 9, 302-325.
DUFF, I.S. and G.W. STEWART, eds. (1979), Sparse Matrix Proceedings, 1978 (SIAM, Philadelphia, PA).
DWYER, P.S. (1945), The square root method and its use in correlation and regression, J. Amer. Statist.
Assoc. 40, 493-503.
ECKHARD, C. and G. YOUNG (1936), The approximation of one matrix by another of lower rank,
Psychometrica 1, 211-218.
EICHHORN, E.L. and C.L. LAWSON (1975), An ALGOL procedure for solution of constrained least squares
problems, Computing Memorandum No. 374, JPL, Pasadena, CA.
EISENSTAT, S.C., M.H. SCHULTZ and A.H. SHERMAN (1981), Algorithms and data structures for sparse
symmetric Gaussian elimination, SIAM J. Sci. Statist. Comput. 2, 225-237.
EKBLOM, H. (1973), Calculation of linear best L,-approximations, BIT 13, 292-300.
640 A. Bjbrck
ELDEN, L. (1977), Algorithms for the regularization of ill-conditioned least squares problems, BIT 17,
134-145.
ELDEN, L. (1980), Perturbation theory for the least squares problem with linear equality constraints,
SIAM J. Numer. Anal. 17, 338-350.
ELDEN, L. (1984a), An efficient algorithm for the regularization of ill-conditioned least squares problems
with a triangular Toeplitz matrix, SIAM J. Sci. Statist. Comput. 5, 229-236.
ELDEN, L. (i984b), An algorithm for the regularization of ill-conditioned banded least squares problems,
SIAM J. Sci. Statist. Comput. 5, 237-254.
ELDEN, L. (1984c), A note on the computation of the generalized cross-validation function for
ill-conditioned least squares problems, BIT 24, 467-472.
ELFVING, T. (1978), Some numerical results obtained with two gradient methods for solving the linear
least squares problem, Rept. LiTH-MAT-R-75-5, Department of Mathematics, Link6ping University,
Sweden.
ELFVING, T. (1980), Block-iterative methods for consistent and inconsistent linear equations, Numer.
Math. 35, 1-12.
ERISMAN, A.M. and W.F. TINNEY(1975), On computing certain elements of the inverse of a sparse matrix,
Comm. ACM 18, 177-179.
FADDEEV, D.K., V.N. KUBLANOVSKAYA and V.N. FADDEEVA (1968), Solution of linear algebraic systems
with rectangular matrices, Proc. Steklov Inst. Math. 96.
FAREBROTHER, R.W. (1985), The statistical estimation of the standard linear model, 1756-1853, in:
Proceedings First Tampere Seminar on Linear Models (1983), 77-99.
FLETCHER, R. (1980), Practical Methods of Optimization, 1: Unconstrained Optimization (Wiley, New
York).
FLETCHER, R. (1981), Practical Methods of Optimization, 2: Constrained Optimization (Wiley, New
York).
FORSYTHE, G.E., M.A. MALCOLM and C. MOLER (1977), Computer Methods for Mathematical
Computations (Prentice-Hall, Englewood Cliffs, NJ).
FORSYTHE, G.E. and C. MOLER (1967), Computer Solution of Linear Algebraic Systems (Prentice-Hall,
Englewood Cliffs, NJ).
FOSTER, L.V. (1986), Rank and nullspace calculations using matrix decompositions without column
interchanges, Linear Algebra Appl. 74, 47-71.
GANDER, W. (1980), Algorithms for the QR-decomposition, Rept. 80-02 Angewandte Mathematik, ETH,
Zirich.
GANDER, W. (1981), Least squares with a quadratic constraint, Numer. Math. 36, 291-307.
GAUSS, C.F. (1823), Theoria combinationis observationum erroribus minimis obnoxiae, Commentationes
Societas Regiae Scientarium Gottengensis Recentiores 5, 33-90.
GAY, D.M. (1983), Algorithm 611, Subroutines for unconstrained minimization using a model/trust-region
approach, ACM Trans. Math. Software 9, 503-524.
GAY, D.M. (1984), A trust-region approach to linearly constrained optimization, in: D.F. GRIFFITHS, ed.,
Proceedings 1983 Dundee Conference on Numerical Analysis, Lecture Notes in Mathematics 1066
(Springer, Berlin) 72-105.
GENTLEMAN, W.M. (1973), Least squares computations by Givens transformations without square roots,
J. Inst. Math. Appl. 12, 329-336.
GENTLEMAN, W.M. (1975), Error analysis of QR decompositions by Givens transformations, Linear
Algebra Appl. 10, 189-197.
GENTLEMAN, W.M. (1976), Row elimination for solving sparse linear systems and least squares problems
in: G.A. WATSON, ed., Proceedings 1975 Dundee Conference on Numerical Analysis, Lecture Notes in
Mathematics 506 (Springer, Berlin) 122-133.
GEORGE, J.A. and M.T. HEATH (1980), Solution of sparse linear least squares problems using Givens
rotations, Linear Algebra Appl. 34, 69-73.
GEORGE, J.A., M.T. HEATH and E.G.Y. NG (1983), A comparison of some methods for solving sparse
linear least squares problems, SIAM J. Sci. Statist. Comput. 4, 177-187.
GEORGE, J.A., M.T. HEATH and R.J. PLEMMONS (1981), Solution of large-scale least squares problems
using auxiliary storage, SIAM J. Sci. Statist. Comput. 2, 416-429.
References 641
GEORGE, J.A. and J.W.H. Lu (1981), Computer Solution of Large Sparse Positive Definite Systems
(Prentice-Hall, Englewood Cliffs, NJ).
GEORGE, J.A., J.W.H. LiU and E.G.Y. NG (1984), Row ordering schemes for sparse Givens transformations,
I. Bipartite graph model, Linear Algebra Appl. 81, 55-81.
GEORGE, J.A. and E.G.Y. NG (1983), On row and column orderings for sparse least squares problems,
SIAM J. Numer. Anal. 20, 326-344.
GEORGE, J.A. and E.G.Y. NG (1984), SPARSPAK: Waterloo sparse matrix package user's guide for
SPARSPAK-B, Research Rept. CS-84-37, Department of Computer Science, University of Waterloo,
Ont.
GEORGE, J.A., W.G. POOLE and R.G. VOIGT (1978), Incomplete nested dissection for solving n by n grid
problems, SIAM J. Numer. Anal. 15, 90-112.
GILL, P.E., G.H. GOLUB, W. MURRAY and M.A. SAUNDERS (1974), Methods for modifying matrix
factorizations, Math. Comp. 28, 505-535.
GILL, P.E., S.J. HAMMARLING, W. MURRAY, M.A. SAUNDERS and M.H. WRIGHT (1986), User's guide for
LSSOL (version 1.0): A Fortran package for constrained linear least-squares and convex quadratic
programming, Rept. SOL, Department of Operations Research, Stanford University, CA.
GILL, P.E. and W. MURRAY (1976a), Nonlinear least squares and nonlinearly constrained optimization,
in: G.A. WATSON, ed., Proceedings 1975 Dundee Conference on Numerical Analysis, Lecture Notes in
Mathematics 506 (Springer, Berlin).
GILL, P.E. and W. MURRAY (1976b), The orthogonal factorization of a large sparse matrix, in: J.R. BUNCH
and D.J. ROSE, eds., Sparse Matrix Computations (Academic Press, New York) 201-212.
GILL, P.E. and W. MURRAY (1978), Algorithms for the solution of the nonlinear least squares problem,
SIAM J. Numer. Anal. 15, 977-992.
GILL, P.E., W. MURRAY and M.H. WRIGHT (1981), PracticalOptimization (Academic Press, New York).
GIVENS, W. (1958), Computation of plane unitary rotations transforming a general matrix to triangular
form, SIAM J. Appl. Math. 6, 26-50.
GOLUB, G.H. (1965), Numerical methods for solving linear least squares problems, Numer. Math. 7,
206-216.
GOLUB, G.H. (1968), Least squares, singular values and matrix approximations, Apl. Mat. 13, 44-51.
GOLUB, G.H. (1969), Matrix decompositions and statistical computation, in: R.C. MILTON and J.A.
NELDER, eds., Statistical Computation (Academic Press, New York) 365-397.
GOLUB, G.H. (1973), Some modified matrix eigenvalue problems, SIAM Rev. 15, 318-344.
GOLUB, G.H., M.T. HEATH and G. WAHBA (1979), Generalized cross-validation as a method for choosing
a good ridge parameter, Technometrics 21, 215-223.
GOLUB, G.H., A. HOFFMAN and G.W. STEWART (1987), A generalization of the Eckhard-Young-
Mirsky matrix approximation theorem, Linear Algebra Appl. 88/89, 317-327.
GOLUB, G.H. and W. KAHAN (1965), Calculating the singular values and pseudo-inverse of a matrix,
SIAM J. Numer. Anal. Ser. B 2, 205-224.
GOLUB, G.H., V. KLEMA and G.W. STEWART (1976), Rank degeneracy and least squares problems, Tech.
Rept. TR-456, Department of Computer Science, University of Maryland, College Park, MD.
GOLUB, G.H. and R. LEVEQUE (1979), Extensions and uses of the variable projection algorithm for
solving nonlinear least squares problems, ARO Rept. 79-3, Proceedings of the 1979 Army Numerical
Analysis and Computers Conference.
GOLUB, G.H., F.T. LUK and M.L. OVERTON (1981), A block Lanczos method for computing the singular
values and corresponding singular vectors of a matrix, ACM Trans. Math. Software 7, 149-169.
GOLUB, G.H., F.T. LUK and M. PAGANO (1979), A sparse least squares problem in photogrammetry, in:
Proceedings Computer Science and Statistics: Twelfth Annual Conference on the Interface, Waterloo,
Ont.
GOLUB, G.H., P. MANNEBACK and P. TOINT (1986), A comparison between some direct and iterative
methods for large scale geodetic least squares problems, SIAM J. Sci. Statist. Comput. 7, 799-
816.
GOLUB, G.H. and V. PEREYRA (1973), The differentiation of pseudo-inverses and nonlinear least squares
problems whose variables separate, SIAM J. Numer. Anal. 10, 413-432.
GOLUB, G.H. and V. PEREYRA (1976), Differentiation of pseudo-inverses, separable nonlinear least
642 A. Bjarck
squares problems and other tales, in: M.Z. NASHED, ed., Generalized Inverses and Applications
(Academic Press, New York) 303-324.
GOLUB, G.H. and R.J. PLEMMONS (1980), Large scale geodetic least squares adjustment by dissection and
orthogonal decomposition, Linear Algebra Appl. 34, 3-27.
GOLUB, G.H., R.J. PLEMMONS and A. SAMEH (1986), Parallel block schemes for large scale least squares
computations, to appear.
GOLUB, G.H. and C. REINSCH (1970), Singular value decomposition and least squares solutions, Numer.
Math. 14, 403-420.
GOLUB, G.H. and G.P. STYAN (1973), Numerical computations for univariate linear models, J. Statist.
Comput. Simulation 2, 253-274.
GOLUB, G.H. and C.F. VAN LOAN (1980), An analysis of the total least squares problem, SIAM J. Numer.
Anal. 17, 883-893.
GOLUB, G.H. and C.F. VAN LOAN (1983), Matrix Computations (Johns Hopkins Press, Baltimore, MD).
GOLUB, G.H. and R.S. VARGA (1961), Chebyshev semi-iterative methods, successive overrelaxation
iterative methods and second order Richardson iterative methods, Parts I and II, Numer. Math. 3,
147-168.
GOLUB, G.H. and J.H. WILKINSON (1966), Note on the iterative refinement of least squares solutions,
Numer. Math. 9, 139-148.
GRIMES, R.G. and J.G. LEWIS (1981), Condition number estimation for sparse matrices, SIAM J. Sci.
Statist. Comput. 2, 384-388.
GUSTAVSSON, F.G. (1976), Finding the block lower triangular form of a matrix, in: J.R. BUNCH and D.J.
ROSE, eds., Sparse Matrix Computations (Academic Press, New York), 275-289.
HAGEMAN, L.A., F.T. LUK and D.M. YOUNG (1980), On the equivalence of certain iterative acceleration
methods, SIAM J. Nuamer. Anal. 17, 852-873.
HAMMARLING, S. (1974), A note on modifications to the Givens plane rotation, J. Inst. Math. Apple. 13,
215-218.
HANSON, R.J. (1986), Linear least squares with bounds and linear constraints, SIAM J. Sci. Statist.
Cornput. 7, 826-834.
HANSOM, R.J. and J.L. PHILLIPS (1975), An adaptive numerical method for solving linear Fredholm
equations of the first kind, Numer. Math. 24, 291-307.
HASKELL, K.H. and R.J. HANSON (1979), Selected algorithms for the linearly constrained least squares
problem: A user's guide, Tech. Rept. SAND78 1290, Sandia National Laboratories, Albuquerque, NM.
HASKELL, K.H. and R.J. HANSON (1981), An algorithm for linear least squares problems with equality and
nonnegativity constraints, Math. Programming 21, 98-118.
HEATH, M.T. (1982), Some extensions of an algorithm for sparse linear least squares problems, SIAM J.
Sci. Statist. Comput. 3, 223-237.
HEATH, M.T. (1984), Numerical methods for large sparse linear least squares problems, SIAM J. Sci.
Statist. Comput. 26, 497-513.
HELMERT, F.R. (1880), Die mathematischen und physikalischen Theorien der hheren Geodasie, I
(Teubner, Leipzig).
HESTENES, M.R. and E. STIEFEL (1952), Methods of conjugate gradients for solving linear systems, J. Res.
Nat. Bur. Standards 49, 409-436.
HIEBERT, K.L. (1981), An evaluation of mathematical software that solves nonlinear least squares
problems, ACM Trans. Math. Software 7, 1-16.
HOLT, J.N. and R. FLETCHER (1979), An algorithm for constrained non-linear least-squares, J. Inst. Math.
Apple. 23, 449-463.
HOUSEHOLDER, A.S. (1958), Unitary triangularization of a nonsymmetric matrix, J. ACM 5, 339-342.
HOUSEHOLDER, A.S. (1974), The Theory of Matrices in Numerical Analysis (Dover, New York).
HUBER, P.J. (1977), Robust Statistical Procedures, CBMS-NSF Regional Conference Series in Applied
Mathematics (SIAM, Philadelphia, PA).
HUTCHINSON, M.F. and F.R. DE HooG (1985). Smoothing noisy data with spline functions, Numer. Math.
47, 99-106.
JENNINGS, L.S. and M.R. OSBORNE (1974), A direct error analysis for least squares, Numer. Math. 22,
322-332.
References 643
JORDAN, T.L. (1968), Experiments on error growth associated with some linear least-squares procedures,
Math. Comp. 22, 579-588.
KAHAN, W. (1966), Numerical linear algebra, Canad.Math. Bull. 9, 757-801.
KARASALO, I. (1974), A criterion for truncation of the QR decomposition algorithm for the singular linear
least squares problem, BIT 14, 156-166.
KAUFMAN, L. (1975), Variable projection methods for solving separable nonlinear least squares problems,
BIT 15, 49-57.
KAUFMAN, L. (1979), Application of dense Householder transformation to a sparse matrix, ACM Trans.
Math. Software 5, 442-450.
KAUFMAN, L. and V. PEREYRA (1978), A method for separable nonlinear least squares problems with
separable nonlinear equality constraints, SIAM J. Numer. Anal. 15, 12-20.
KELLER, J. (1965), On the solution of singular and semidefinite linear systems by iterations, J. SIAM Ser.
B 2, 281-290.
KOLATA, G.B. (1978), Geodesy: dealing with an enormous computer task, Science 200, 421-422.
KROGH, F.T. (1974), Efficient implementation of a variable projection algorithm for nonlinear least
squares, Comm. ACM 17, 167-169.
LAWSON, C.L. and R.J. HANSON (1969), Extensions and applications of the Householder algorithm for
solving linear least squares problems, Math. Comp. 23, 787-812.
LAWSON, C.L. and R.J. HANSON (1974), Solving Least Squares Problems(Prentice-Hall, Englewood Cliffs,
NJ).
LAWSON, C.L., R.J. HANSON, F.T. KROGH and D. KINCAID (1979), Basic linear algebra subprograms for
FORTRAN usage, ACM Trans. Math. Software 5, 308-323.
LERINGE, 0 and P.-A. WEDIN (1970), A comparison between different methods to compute a vector
x which minimizes [IAx-bll, when Gx=h, Rept., Department of Computer Science, Lund Uni-
versity.
LEVENBERG, K. (1944), A method for the solution of certain problems in least squares, Quart.Appl. Math.
2, 164-168.
LINDSTROM, P. (1982), A stabilized Gauss-Newton algorithm for unconstrained nonlinear least squares
problems, Rept. UMINF-102.82, UmeA University, Sweden.
LINDSTROM, P. (1983), A general purpose algorithm for nonlinear least squares problems with nonlinear
constraints, Rept. UMINF-103.83, Umef University, Sweden.
LINDSTROM, P. (1984), Two user guides, one (ENLSIP) for constrained--one (ELSUNC) for unconstrained
nonlinear least squares problems, Rept. UMINF-109 and 110, Umea University, Sweden.
LINDSTROM, P. and P.-A. WEDIN (1984), A new linesearch algorithm for unconstrained nonlinear least
squares problems, Math. Programming. 29, 268-296.
LINDSTROM, P. and P.-A WEDIN (1986), Methods and software for nonlinear least squares problems.
Rept. UMINF-133.87, UmeA University, Sweden.
LINNIK, I. (1961), Method of Least Squares and Principlesof the Theory of Observations (Pergamon Press,
New York).
Liu, J.W.H. (1986), On general row merging schemes for sparse Givens transformations, SIAM J. Sci.
Statist. Comput. 7, 1190-1211.
LUK, F.T. (1980), Computing the singular value decomposition on the ILLIAC IV, ACM Trans. Math.
Software. 6, 524-539.
LAUCHLI, P. (1961), Jordan-Elimination und Ausgleichung nach kleinsten Quadraten, Numer. Math. 3,
226-240.
LOTSTEDT, P. (1984), Solving the minimal least squares problem subject to bounds on the variables, BIT
24, 206-224.
MAHDAVI-AMIRI, N. (1981), Generally constrained nonlinear least squares and generating test problems:
Algorithmic approach, Ph.D. Thesis, Johns Hopkins University, Baltimore, MD.
MANNEBACK, P. (1985), On some numerical methods for solving large sparse linear least squares
problems, Ph.D. Thesis, Facultes Universitaires Notre-Dame de la Paix, Namur, Belgium.
MANNEBACK, P., C. MURIGANDE and P.L. TOINT (1985), A modification of an algorithm by Golub and
Plemmons for large linear least squares in the context of Doppler positioning, IMA J. Numer. Anal. 5,
221-234.
644 A. Bjrck
MANTEUFFEL, T. (1980), An incomplete factorization technique for positive definite linear systems, Math.
Comp. 34, 473-479.
MARKOWITZ, H.M. (1957), The elimination form of the inverse and its application to linear programming,
Management Sci. 3, 255-269.
MARQUARDT, D. (1963), An algorithm for least-squares estimation of nonlinear parameters, SIAM J.
Appl. Math. 11, 431-441.
MIRSKY, L. (1960), Symmetric gauge functions and unitarily invariant norms, Quart. J. Math. Oxford 11,
50-59.
MOLER, C.B. (1980), MATLAB user's guide, Tech. Rept. CS81-1, Department of Computer Science.
University of New Mexico, Albuquerque, NM.
MOLINARI, L. (1977), Gram-Schmidt'sches Orthogonalisierungsverfahren, in: W. GANDER, L. MOLINARI
and H. SVECOVA, eds., Numerische Prozeduren aus Nachlass und Lehre von Prof Heinz Rutishauser
(Birkhauser, Stuttgart) 77-93.
MORE, JJ. (1978), The Levenberg-Marquardt algorithm: Implementation and theory, in: G.A. WATSON,
ed., Numerical Analysis, Proceedings Biennial Conference Dundee, Lecture Notes in Mathematics 630
(Springer, Berlin) 105-116.
MORE, J.J. (1983), Recent developments in algorithms and software for trust region-methods, in:
Mathematical Programming: The State of the Art (Springer, Berlin).
MORE, J.J., B.S. GARBOW and K.E. HILLSTROM (1980), Users' guide for MINPACK-1, Rept. ANL-80-74,
Applied Mathematics Division, Argonne National Laboratory, Argonne, IL.
MORE, J.J. and D.C. SORENSON (1981), Computing a trust region step, SIAM J. Sci. Statist. Comput. 4,
553-572.
NASH, J.C. (1975), A one-sided transformation method for the singular value decomposition and
algebraic eigenproblem, Comput. J. 18, 74-76.
NAZARETH, L. (1980), Some recent approaches to solving large residual nonlinear least squares problems,
SIAM Rev. 22, 1-11.
O'LEARY, D.P. (1980a), A generalized conjugate gradient algorithm for solving a class of quadratic
programming problems, Linear Algebra Appl. 34, 371-399.
O'LEARY, D.P. (1980b), Estimating matrix condition numbers, SIAM J. Sci. Statist. Comput. 1, 205-209.
O'LEARY, D.P. and J.A. SIMMONS (1981), A bidiagonalization-regularization procedure for large scale
discretizations of ill-posed problems, SIAM J. Sci. Statist. Comput. 2, 474-489.
OREBORN, U. (1986), A direct method for sparse nonnegative least squares problems, Lic. Thesis No. 87,
Link6ping Studies in Science and Technology, 1986:27, Linkoping University, Sweden.
ORTEGA, J.M. and W.C. RHEINBOLDT (1970), Iterative Solution of Nonlinear Equations in Several
Variables (Academic Press, New York).
OSBORNE, E.E. (1961), On least squares solutions of linear equations, J. ACM 8, 628-636.
PAIGE, C.C. (1973), An error analysis of a method for solving matrix equations, Math. Comp. 27, 355-
359.
PAIGE, C.C. (1979a), Computer solution and perturbation analysis of generalized least squares problems,
Math. Comp. 33, 171-184.
PAIGE, C.C. (1979b), Fast numerically stable computations for generalized linear least squares problems.
SIAM J. Numer. Anal. 16, 165-171.
PAIGE, C.C. (1981), Properties of numerical algorithms related to computing controllability, IEEE Trans.
Automat. Control. 26, 130-138.
PAIGE, C.C. (1985), The general linear model and the generalized singular value decomposition, Linear
Algebra Appl. 70, 269-284.
PAIGE, C.C. (1986), Computing the generalized singular value decomposition, SIAM J. Sci. Statist.
Comput. 7, 1126-1146.
PAIGE, C.C. and M.A. SAUNDERS (1977), Least squares estimation of discrete linear dynamic systems
using orthogonal transformations, SIAM J. Numer. Anal. 14, 180-193.
PAIGE, C.C. and M.A. SAUNDERS (1981), Towards a generalized singular value decomposition, SIAM J.
Numer. Anal. 18, 398-405.
PAIGE, C.C. and M.A. SAUNDERS (1982a), LSQR: An algorithm for sparse linear equations and sparse
least squares, ACM Trans. Math. Software 8, 43-71.
References 645
PAIGE, C.C. and M.A. SAUNDERS (1982b), Algorithm 583 LSQR: Sparse linear equations and least squares
problems, ACM Trans. Math. Software 8, 195-209.
PARLETT, B.N. (1971), Analysis of algorithms for reflections in bisectors, SIAM Rev. 13, 197-208.
PARLETT, B.N. (1980), The Symmetric Eigenvalue Problem (Prentice-Hall, Englewood Cliffs, NJ).
PARLETT, B.N. and J.K. REID (1970), On the solution of a system of linear equations whose matrix is
symmetric but not definite, BIT 10, 386-397.
PATZELT, P. (1973), Ein Algorithmus zur Lsung von Ausgleichproblemen mit Ungleichungen als
Nebenbedingungen, Diplomarbeit, University of Wiirzburg, F.R.G.
PENROSE, R. (1955), A generalized inverse for matrices, Proc. Cambridge Philos. Soc. 51, 506-513.
PETERS, G. and J.H. WILKINSON (1970), The least squares problem and pseudo-inverses, Comput. J. 13,
309-316.
PLEMMONS, R.J. (1972), Monotonicity and iterative approximations involving rectangular matrices,
Math. Comp. 26, 853-858.
PLEMMONS, R.J. (1974), Linear least squares by elimination and MGS, J. ACM 21, 581-585.
POWELL, M.J.D. and J.K. REID (1969), On applying Householder's method to linear least squares
problems, in: A.J.M. MORELL, ed., Proceedings IFIP Congress 1968 (North-Holland, Amsterdam)
122-126.
RAMSIN, H. and P.-A. WEDIN (1977), A comparison of some algorithms for the nonlinear least squares
problem, BIT 17, 72-90.
REID, J.K. (1967), A note on the least squares solution of a band system of linear equations by
Householder reductions, Comput. J. 10, 188-189.
REID, J.K. (1972), On the use of conjugate gradients for systems of linear equations possessing "Property
A", SIAM J. Numer. Anal. 9, 325-332.
REINSCH, C.H. (1971), Smoothing by spline functions, Numer. Math. 16, 451-454.
RICE, J.R. (1966), Experiments on Gram-Schmidt orthogonalization, Math. Comp. 20, 325-328.
RICE, J.R. (1983), PARVEC workshop on very large least squares problems and supercomputers, Rept.
CSD-TR 464, Purdue University, West Lafayette, IN.
RUHE, A. (1979), Accelerated Gauss-Newton algorithms for nonlinear least squares problems, BIT 19,
356-367.
RUHE, A. (1983), Numerical aspects of Gram-Schmidt orthogonalization of vectors, Linear Algebra Appl.
52/53, 591-601.
RUHE, A. and P.-A. WEDIN (1980), Algorithms for separable nonlinear least squares problems, SIAM Rev.
22, 318-337.
RUTISHAUSER, H. (1967), Description of Algol 60, in: Handbook of Automatic Computation la (Springer,
Berlin).
RUTISHAUSER, H. (1976), Vorlesungen ber numerische Mathematik I (Birkhauser, Basel).
SAUNDERS, M.A. (1979), Sparse least squares by conjugate gradients: A comparison of preconditioning
methods, in: Proceedings Computer Science and Statistics: Twelfth Annual Symposium on the Interface,
Waterloo, Ont., 15-20.
SCHITTKOWSKI, K. (1983), The numerical solution of constrained linear least-squares problems, IMA J.
Numer. Anal. 3, 11-36.
SCHITTKOWSKI, K. (1985), Solving constrained nonlinear least squares problems by a general purpose
SOP-method, Rept. Institut ffir Informatik, Universitat Stuttgart, F.R.G.
SCHITTKOWSKI, K. and J. STOER (1979), A factorization method for the solution of constrained linear least
squares problems allowing subsequent data changes, Numer. Math. 31, 431-463.
SCHITTKOWSKI, K. and P. ZIMMERMAN (1977), A factorization method for constrained least squares
problems with data changes, Part 2: Numerical tests, comparisons, and ALGOL codes, Preprint No.
30, Institut fiir Angewandte Mathematik und Statistik, Universitt Wiirzburg, F.R.G.
SCHWETLICK, H. and V. TILLER (1985), Numerical methods for estimating parameters in nonlinear
models with errors in the variables, Technometrics 27, 17-24.
SCOLNIK, H.D. (1972), On the solution of non-linear least squares problems, in: H. FREEMAN, ed.,
Proceedings IFIP 71 (North-Holland, Amsterdam) 1258-1265.
SEBER, G.A.F. (1977), Linear Regression Analysis (Wiley, New York).
STEWART, G.W. (1973), Introduction to Matrix Computations (Academic Press, New York).
646 A. Bjirck
STEWART, G.W. (1976), The economical storage of plane rotations, Numer. Math. 25, 137-138.
STEWART, G.W. (1977a), On the perturbation of pseudo-inverses, projections, and linear least squares
problems, SIAM Rev. 19, 634-662.
STEWART, G.W. (1977b), Perturbation bounds for the QR factorization of a matrix, SIAM J. Numer.
Anal. 14, 509-518.
STEWART, G.W. (1977c), Research, development and LINPACK, in: J.R. RICE, ed., MathematicalSoftware
II (Academic Press, New York) 1-14.
STEWART, G.W. (1979a), A note on the perturbation of singular values, Linear Algebra Appl. 28, 213-216.
STEWART, G.W. (1979b), The effects of rounding error on an algorithm for downdating a Cholesky-
factorization, J. Inst. Math. Appl. 23, 203-213.
STEWART, G.W. (1980), The efficient generation of random orthogonal matrices with an application to
condition estimators, SIAM J. Numer. Anal. 17, 403-409.
STEWART, G.W. (1982), An algorithm for computing the CS decomposition of a partitioned orthonormal
matrix, Numer. Math. 40, 297-306.
STEWART, G.W. (1983), A method for computing the generalized singular value decomposition, in: B.
KAGSTROM and A. RUHE, eds. Matrix Pencils, Proceedings, Pite Havsbad, 1982, Lecture Notes in
Mathematics 973 (Springer, Berlin) 207-220.
STEWART, G.W. (1984a), Rank degeneracy, SIAM J. Sci. Statist. Comput. 5, 403-413.
STEWART, G.W. ( 984b), On the invariance of perturbed null vectors under column scaling, Numer. Math.
44, 61-65.
STIEFEL, E. (1952/53), Ausgleichung ohne Aufstellung der Gausschen Normalgleichungen, Wiss. Z. Tech.
Hochsch. Dresden 2, 441-442.
STOER, J. (1971), On the numerical solution of constrained least squares problems, SIAM J. Namer. Anal.
8, 382-411.
TIHONOV, A.N. (1963), Regularization of incorrectly posed problems, Dokl. Akad. Nauk SSSR 153,
1035-1038.
TOINT, P.L. (1987), On large scale nonlinear least squares calculations, SIAM J. Sci. Statist. Comput. 8,
416-435.
VAN DER SLUIS, A. (1969), Condition numbers and equilibration of matrices, Numer. Math. 14, 14-23.
VAN DER SLUIS, A. (1975), Stability of the solutions of linear least squares problems, Numer. Math. 23,
241-254.
VAN DER SLUIS, A. and G.W. VELTKAMP(1979), Restoring rank and consistency by orthogonal projection,
Linear Algebra Appl. 28, 257-278.
VAN LOAN, C. (1976), Generalizing the singular value decomposition, SIAM J. Numer. Anal. 13, 76-83.
VAN LOAN, C. (1984a), Analysis of some matrix problems using the CS decomposition, Tech. Rept.
TR84-603, Cornell University, Ithaca, NY.
VAN LOAN, C. (1984b), Computing the CS and the generalized singular value decomposition, Rept. TR
84-614, Cornell University, Ithaca, NY.
VAN LOAN, C. (1985), On the method of weighting for equality-constrained least squares problems,
SIAM J. Numer. Anal. 22, 851-864.
VARA.H, J.M. (1973), On the numerical solution of ill-conditioned linear systems with applications to
ill-posed problems, SIAM J. Numer. Anal. 10, 257-267.
VARAH, J.M. (1975), A lower bound for the smallest singular value of a matrix, Linear Algebra Appl. 11,
1-2.
VARAH, J.M. (1979), A practical examination of some numerical methods for linear discrete ill-posed
problems, SIAM Rev. 21, 100-111.
VARGA, R.S. (1962), Matrix Iterative Analysis (Prentice-Hall, Englewood Cliffs, NJ).
WAMPLER, R.H. (1970), A report on the accuracy ofsome widely used least squares computer programs, J.
Amer. Statist. Assoc. 65, 549-565.
WAMPLER, R.H. (1972), Some recent developments in linear least squares computations, in: M.E.
TARTER, ed., Proceedings Computer Science and Statistics:Sixth Annual Symposium on the Interface,
Berkeley, CA.
WAMPLER, R.H. (1979), Solutions to weighted least squares problems by modified Gram-Schmidt with
iterative refinement, ACM Trans. Math. Software 5, 457-465.
References 647
WATSON, G.A. (1984), The numerical solution of total 1papproximation problems, in: D.F. GRIFFITHS, ed.,
Proceedings 1983 Dundee Conference on Numerical Analysis, Lecture Notes in Mathematics 1066
(Springer, Berlin).
WEDIN, P.-A. (1972), Perturbation bounds in connection with the singular value decomposition, BIT 12,
99-111.
WEDIN, P.-A. (1973a), On the almost rank-deficient case of the least squares problem, BIT 13, 344-354.
WEDIN, P.-A. (1973b), Perturbation theory for pseudo-inverses, BIT 13, 217-232.
WEDIN, P.-A. (1974), On the Gauss-Newton method for the nonlinear least squares problems, ITM
Working Paper 24, Institutet fr Tilliimpad Matematik, Stockholm.
WEDIN, P.-A., (1985), Perturbation theory and condition numbers for generalized and constrained linear
least squares problems, Rept. UMINF 125.85, Umed University, Sweden.
WEIL, R.L. and P.C. KETTLER (1971), Rearranging matrices to block-angular form for decomposition
(and other) algorithms, Management Sci. 18, 98-108.
WILKINSON, J.H. (1963), Rounding Errorsin Algebraic Processes (Prentice-Hall, Englewood Cliffs, NJ).
WILKINSON, J.H. (1965), The Algebraic Eigenvalue Problem (Clarendon Press, Oxford).
WILKINSON, J.H. (1968), A priori error analysis of algebraic processes, in: Proceedings International
Congress on Mathematics (Izdat. Mir, Moscow) 629-639.
WILKINSON, J.H. (1971), Modern error analysis, SIAM Rev. 13, 548-568.
WILKINSON, J.H. (1977), Some recent advances in numerical linear algebra, in: D.A.H. JACOBS, ed., The
State of the Art in Numerical Analysis (Academic Press, New York) 1-53.
WILKINSON, J.H. and C. REINSCH, eds. (1971), Handbookfor Automatic Computation 2, Linear Algebra
(Springer, New York).
WRIGHT, S.J. and J.N. HOLT (1985), Algorithms for nonlinear least squares with linear inequality
constraints, SIAM J. Sci. Statist. Comput. 6, 1033-1048.
YOUNG, D.M. (1971), Iterative Solution of Large Systems (Academic Press, New York).
ZIMMERMANN, P. (1977), Ein Algorithmus zur L6sung linearer Least Squares Probleme mit unteren und
oberen Schranken als Nebenbedingungen, Diplomarbeit, Institut fiur Angewandte Mathematik und
Statistik, Universitat Wirzburg, F.R.G.
ZLATEV, Z. (1982), Comparison of two pivotal strategies in sparse plane rotations, Comput. Math. Appl.
8, 119-135.
ZLATEV, Z. and H.B. NIELSEN (1979), LLSS01: A Fortran subroutine for solving least squares problems
(User's guide), Rept. No. 79-07, Institute of Numerical Analysis, Technical University Denmark,
Lyngby, Denmark.
Subject Index
649
650 i. Bjork