Академический Документы
Профессиональный Документы
Культура Документы
1.0 Preliminaries
The first question we want to answer is: What is “computational mathematics”?
One possible definition is: “The study of algorithms for the solution of computa-
tional problems in science and engineering.”
Other names for roughly the same subject are numerical analysis or scientific com-
puting.
What is it we are looking for in these algorithms? We want algorithms that are
• fast,
• accurate.
Note: One could also study hardware issues such as computer architecture and its
effects, or software issues such as efficiency of implementation on a particular hardware
or in a particular programming language. We will not do this.
What sort of problems are typical?
Example The Poisson problem provides the basis for many different algorithms for
the numerical solution of differential equations, which in turn lead to the need for many
algorithms in numerical linear algebra. Consider
−∇2 u(x, y) = − [uxx (x, y) + uyy (x, y)] = f (x, y), in Ω = [0, 1]2
u(x, y) = 0, on ∂Ω.
One possible algorithm for the numerical solution of this problem is based on the
following discretization of the Laplacian:
uj−1,k + uj,k−1 + uj+1,k + uj,k+1 − 4uj,k
∇2 u(xj , yk ) ≈ , (1)
h2
where the unit square is discretized by a set of (n + 1)2 equally spaced points (xj , yk ),
j, k = 0, . . . , n, and h = n1 . Also, we use the abbreviation uj,k = u(xj , yk ).
Formula (1) is a straightforward generalization to two dimensions of the linear
approximation
u(x + h) − u(x)
u0 (x) ≈ .
h
If we visit all of the (n − 1)2 interior grid points and write down the equation
resulting from the discretization of the PDE, then we obtain the following system of
linear equations
fj,k
4uj,k − uj−1,k − uj,k−1 − uj+1,k − uj,k+1 = , j, k = 1, . . . , n − 1
n2
along with the discrete boundary conditions
1
Jacobi 1013 operations 1845
Gauss-Seidel 5 × 1012 operations 1832
SOR 1010 operations 1950
FFT 1.5 × 108 operations 1965
multigrid 108 operations 1979
Z3 1 flops 1941
Intel Paragon 10 Gflops 1990
NEC Earth Simulator (5120 processors) 40 Tflops 2002
IBM Blue Gene/L (131072 processors) 367 Tflops 2006
While other sources of errors also exist (such as measurement errors in experiments),
we will focus on the above three sources.
2
2. Vector addition is associative and commutative, i.e., (u + v) + w = u + (v + w)
and u + v = v + u.
3
i.e., the i-th entry of b is given by the dot product of row i of A with x.
A second (vectorized) interpretation is obtained via
n
X
b(:) = A(:, j)x(j)
j=1
Xn
= x(j)A(:, j),
j=1
Remark Using the first interpretation we need to perform m dot products to calculate
b. With the second interpretation we compute n scalar products and n − 1 additions.
B = AC ∈ C`×n
and
m
X
B(i, j) = A(i, k)C(k, j), i = 1, . . . , `, j = 1, . . . , n
k=1
= A(i, :)C(:, j),
i.e., the ij entry of B is obtained as the dot product of row i of A with column j of C.
An alternative (vectorized) interpretation of the same matrix-matrix product is
given by
m
X
B(:, j) = A(:, k)C(k, j)
k=1
m
X
= C(k, j)A(:, k),
k=1
4
i.e., an upper triangular matrix of all ones.
Then B = AR leads to
n
X
B(:, j) = R(j, k)A(:, k)
k=1
Xj j
X
= 1 A(:, k) = A(:, k)
k=1 k=1
Remark The columns of A are a basis for range(A) if they are linearly independent.
The nullspace of A (null(A) or kernel of A) is given by the set of all x such that
Ax = 0.
The column rank of A is given by the dimension of range(A) (i.e., number of linearly
independent columns of A). The row rank is defined analogously.
Remark We always have “row rank = column rank = rank ”. Moreover, for any m × n
matrix we have
dim(null(A)) + rank(A) = n.
5
The range(A) is given by all vectors in C3 with third component zero since
0 1 0 x(1) x(2)
Ax = 0 0 2 x(2) = 2x(3) .
0 0 0 x(3) 0
rank(A) = dim(range(A)) = 2.
Moreover,
1 0
0 , 2
0 0
1.1.5 Inverse
Any square matrix A ∈ Cm×m with rank(A) = m (i.e., of full rank or nonsingular ) has
an inverse A−1 such that
AA−1 = A−1 A = I,
where I = [e1 , e2 , . . . , em ] is the m × m identity matrix.
2. rank(A) = m,
3. range(A) = Cm ,
4. null(A) = {0},
5. 0 is not an eigenvalue of A,
6
7. det(A) 6= 0.
Ax = b ⇐⇒ x = A−1 b
Remark The solution x of the linear system Ax = b gives us the vector of coefficients
of b expanded in terms of the columns of A, i.e., we “rotate b back into the column
space of A via A−1 ”.
Remark Note that only square matrices can be Hermitian (or real symmetric).
Applied to vectors this means that x denotes a column vector and x∗ a row vector.
m
!1/2
√ X
2
kxk = x∗ x = |x(i)| .
i=1
This is the same as the (Euclidean) or 2-norm of the vector. We will say more about
norms later.
7
Once an inner product is available we can define an angle α between two vectors x
and y by
x∗ y
cos α = .
kxkkyk
We summarize the following properties of inner products.
1. (x + y)∗ z = x∗ z + y ∗ z,
2. x∗ (y + z) = x∗ y + x∗ z,
Moreover, the definition of the inner product implies that y ∗ x = x∗ y. In the real
case, however, the inner product is symmetric.
We can use the inner product and its properties to obtain some other properties of
matrix multiplication.
1. (AB)∗ = B ∗ A∗ ,
2. (AB)−1 = B −1 A−1 .
Proof We will prove item 1. Let C = AB so that C(i, j) = A(i, :)B(:, j), the inner
product of row i of A with column j of B. Now
Remark We will use the notational convention A−∗ = (A−1 )∗ = (A∗ )−1 .
8
1.2.3 Orthogonal Vectors
Since we defined angles earlier as
x∗ y
cos α = ,
kxkkyk
v = v(1)e1 + . . . + v(m)em
we can find the components of an arbitrary vector v with respect to any given orthog-
onal set.
Assume Q = {q 1 , . . . , q n } is an orthonormal set and v ∈ Cm is an arbitrary vector
with m ≥ n.
Claim: We can decompose v as
n
X
v=r+ (q ∗i v)q i
i=1
Remark The vectors (q ∗i v)q i are the projections of v onto the (basis) vectors q i .
9
Now, since Q is orthonormal we have
(
0, i 6= j
q ∗j q i = δij =
1, i = j,
where δij is referred to as the Kronecker delta. Therefore only one term in the summa-
tion survives (namely if i = j) and we have
q ∗j r = q ∗j v − (q ∗j v)1 = 0.
Since q j was arbitrary we have established that r is orthogonal to the entire set Q.
We now take a closer look at the projection idea. This will be very important later
on. First (q ∗i v)q i = q i (q ∗i v) since (q ∗i v) is a scalar. Next, by the associativity of vector
multiplication, this is also equal to (q i q ∗i )v. This latter expression is the product of
a (rank-1) matrix and a column vector. The matrix (q i q ∗i ) is known as a projection
matrix.
Now we can represent the projections occurring in the decomposition of v either
via a sum of vector projections X
(q ∗i v)q i
i
or by a sum of matrix projections
X
(q i q ∗i )v.
i
10
2. In terms of matrix projections we get
3
X
v = (q i q ∗i )v
i=1
1/2 0 1/2 1/2 0 −1/2 0 0 0
= 0 0 0 v + 0 0 0 v + 0 1 0 v
1/2 0 1/2 −1/2 0 1/2 0 0 0
1 0 0
= 0 + 0 + 1 .
1 0 0
Recall that earlier we observed that if we multiply a given vector b (the coordinates
of some point with respect to the standard basis) by A−1 then we obtain the coordinates
x with respect to the column space of A. Conversely, if we multiply x by A, then we
transform back to the standard basis.
Now, if we assume that A is a unitary matrix, i.e., A = Q and A−1 = Q∗ ,
then multiplication of b by Q∗ yields the coordinates x with respect to the basis
{Q(:, 1), . . . , Q(:, m)} = {q 1 , . . . , q m }. Conversely, multiplication of x by Q transforms
back to the standard basis {e1 , . . . , em }.
Example Take
√ √
1 1/ 2 1/ 2 0
b= 1 and Q = 0√ 0√ 1
1 1/ 2 −1/ 2 0
and note that b = v and the columns of Q are given by q 1 , q 2 , q 3 from the previous
example.
Clearly, √ √
1/√2 0 1/ √2
Q∗ = 1/ 2 0 −1/ 2 ,
0 1 0
and therefore √
2
∗
Q b= 0
1
are the coordinates with respect to the columns of Q.
11
Some properties of unitary matrices are collected in
1. (Qx)∗ (Qy) = x∗ y, that is angles are preserved under unitary (orthogonal) trans-
formations.
2. kQxk = kxk, that is lengths are preserved under unitary (orthogonal) transfor-
mations.
Remark The “+” in item 3 corresponds to rotations, and the “−” to reflections.
(Qx)∗ (Qy) = x∗ Q∗ Q y = x∗ y.
| {z }
=I
1.3 Norms
1.3.1 Vector Norms
Definition 1.9 Let V be a vector space over C. A norm is a function k · k : V → R+
0
which satisfies
m
!1/2
X
2. kxk2 = |x(i)|2 , `2 -norm or Euclidean norm.
i=1
m
!1/p
X
4. kxkp = |x(i)|p , `p -norm.
i=1
It is interesting to consider the corresponding unit “spheres” for these three norms,
i.e., the location of points in Rm whose distance to the origin (in the respective norm)
is equal to 1. Figure 1 illustrates this for the case m = 2.
12
1 1 1
–1 –1 –1
Sometimes one also wants to work with weighted norms. To this end one takes a
diagonal weight matrix
w(1) 0 ··· 0
. ..
w(2) . .
0 .
W = .
. .
. . . . . . 0
0 ··· 0 w(m)
Remark 1. The notation sup in Definition 1.10 denotes the supremum or least
upper bound.
13
3. One can show that k · k satisfies items (1)–(3) in Definition 1.9, i.e., it is indeed
a norm.
4. kAk can be interpreted as the maximum factor by which A can “stretch” x.
Example The “stretch” concept can be understood graphically in R2 . Consider the
matrix
1 1
A=
0 1
which maps R2 to R2 .
1. By mapping the 1-norm unit circle under A we can see that the point that is
maximally stretched is (0, 1) which gets mapped into (1, 1). Thus, a vector of
1-norm length 1 is mapped to a vector with 1-norm length 2, and kAk1 = 2.
2. By mapping the 2-norm unit circle under A we can see (although this is much
harder and requires use of the singular value decomposition) that the point that is
maximally stretched is (0.5257, 0.8507) which gets mapped into (1.3764, 0.8507).
Thus, a vector of 2-norm length 1 is mapped to a vector with 2-norm length
1.6180, and kAk2 = 1.6180.
3. By mapping the ∞-norm unit circle under A we can see that the point that is
maximally stretched is (1, 1) which gets mapped into (2, 1). Thus, a vector of
∞-norm length 1 is mapped to a vector with ∞-norm length 2, and kAk∞ = 2.
How to Compute the Induced Matrix Norm
We now discuss how to compute the matrix norms induced by the popular p-norm
vector norms. Strictly speaking we now would have to use two different subscripts on
the vector norms (in addition to the index p also the (m) and (n) indicating the length
of the vectors). In order to simplify notation we omit the second subscript which can
be inferred from the context.
Consider an m × n matrix A. The most popular matrix norms can be computed as
follows:
1.
kAk1 = max kA(:, j)k1
1≤j≤n
m
X
= max |A(i, j)|.
1≤j≤n
i=1
14
This gives rise to the name maximum row sum norm.
Example Let
1 1 2
A = 0 1 1 .
1 0 2
Then kAk1 = 5, kAk2 = 3.4385, kAk∞ = 4.
We now verify that the matrix norm induced by the `∞ -norm is indeed given by
the formula stated in item 3:
Finally, using formula (2) below we obtain the desired result, i.e.,
n
X
kAk∞ = max |A(i, j)|.
1≤i≤m
j=1
The supremum over all unit vectors (in the maximum norm) is attained if all terms in
the above sum are positive. This can be ensured by picking x(j) = sign(A(i, j)). But
then we have
n
X n
X
sup | A(i, j)x(j)| = |A(i, 1)| + |A(i, 2)| + . . . + |A(i, n)| = |A(i, j)|. (2)
kxk∞ =1 j=1 j=1
|x∗ y| ≤ kxkkyk.
15
Proof The Cauchy-Schwarz inequality in a real vector space can be proved geomet-
x∗ y
rically by starting out with the projection p = kyk 2 y of x onto the y. Since y is in
16
so that (3) gives us
kAk2 ≤ kuk2 kvk2 .
However, in the special case x = v we get
kAxk2 = kAvk2 = k(u v ∗ )v k2 = |v ∗ v|kuk2 = kvk22 kuk2 ,
| {z }
scalar
and therefore actually
kAxk2
kAk2 = sup = kuk2 kvk2 .
x∈Cn kxk2
x6=0
i.e., we interpret the matrix in Cm×n as a vector in Cmn and compute its 2-norm.
Other formulas for the Frobenius norm are
p p
kAkF = tr(A∗ A) = tr(AA∗ ),
where the trace tr(A) is given by the sum of the diagonal entries of A.
Finally,
Theorem 1.12 Let A ∈ Cm×n and Q ∈ Cm×m be unitary. Then
1. kQAk2 = kAk2 ,
2. kQAkF = kAkF ,
i.e., both the 2-norm and the Frobenius norm are invariant under unitary transforma-
tion.
Proof The invariance of the matrix 2-norm is a direct consequence of the invariance of
lengths of vectors under unitary transformations discussed earlier, i.e., kQxk2 = kxk2 .
In particular, kQAxk2 = kAxk2 , and so
kQAxk2 kAxk2
kQAk2 = sup = sup = kAk2 .
x∈Cn kxk2 x∈C n kxk2
x6=0 x6=0
17
1.4 Computer Arithmetic
1.4.1 Floating-Point Arithmetic
We will use normalized scientific notation to represent real numbers x 6= 0, i.e., in
decimal representation we write
1
x = ±r × 10n , ≤ r < 1,
10
and in binary representation (which of course will matter on the computer) we write
1
x = ±q × 2m , ≤ q < 1.
2
Both of these representations consist of the sign “±”, the mantissa (either r or q), the
base (either 10 or 2), and the exponent (either n or m).
In order to study the kinds of errors that we can make when we represent real
numbers as machine numbers we will use a hypothetical binary computer. We will
assume that this computer can represent only positive numbers of the form
(0.d1 d2 d3 d4 )2 × 2n
with n ∈ {−3, −2, −1, 0, 1, 2, 3, 4}. Our representation will actually have a 3-bit man-
tissa (i.e., the digit is always assumed to be 1, and therefore never stored). The set
of choices for the exponent n comes from using a 3-bit exponent which allows us to
generate 0, 1, . . . , 7, so that the n is actually determined as the values of 4 − exponent.
With this configuration we are able to generate 23 = 8 different mantissas:
(0.1000)2 = (0.5)10
(0.1001)2 = (0.5625)10
(0.1010)2 = (0.625)10
(0.1011)2 = (0.6875)10
(0.1100)2 = (0.75)10
(0.1101)2 = (0.8125)10
(0.1110)2 = (0.875)10
(0.1111)2 = (0.9375)10
for a total of 64 machine numbers (obtained by combining the 8 mantissa with the 8
possible exponents).
18
n = −3 n = −2 n = −1 n=0 n=1 n=2 n=3 n=4
(0.1000)2 0.0625 0.125 0.25 0.5 1 2 4 8
(0.1001)2 0.0703125 0.140625 0.28125 0.5625 1.125 2.25 4.5 9
(0.1010)2 0.078125 0.15625 0.3125 0.625 1.25 2.5 5 10
(0.1011)2 0.0859375 0.171875 0.34375 0.6875 1.375 2.75 5.5 11
(0.1100)2 0.09375 0.1875 0.375 0.75 1.5 3 6 12
(0.1101)2 0.1015625 0.203125 0.40625 0.8125 1.625 3.25 6.5 13
(0.1110)2 0.109375 0.21875 0.4375 0.875 1.75 3.5 7 14
(0.1111)2 0.1171875 0.234375 0.46875 0.9375 1.875 3.75 7.5 15
Remark Any computer has finite word length (usually longer than that of our hypo-
thetical computer) and can therefore represent only a discrete set on finite numbers
exactly. Moreover, these numbers are distributed unevenly.
1
+ 15 + 16 = 15
7
Example How does the computation of 10 = 0.46 work out in our
hypothetical computer?
First, we notice that we will be committing a number of representation errors. The
closest machine number to 10 1
is 0.1015625 = (0.1101)2 × 2−3 . Similarly, the closest
machine number to 5 is 0.203125 = (0.1101)2 × 2−2 .
1
We can add these two numbers (by shifting the mantissa of the first number)
and obtain (1.00111)2 × 2−2 . This, however, is not in normalized scientific notation.
1
The “correct” representation of the intermediate calculation of 10 + 15 is therefore
(0.100111)2 × 2−1 .
Now, however, our computer has only a 4-bit mantissa, and therefore we need to
commit a rounding error, i.e., we represent the intermediate result by (0.1010)2 × 2−1 .
3
Note that this is indeed (fortunately so) the closest machine number to 10 .
1
For the final step of the calculation we need to represent 6 as a machine number.
We again commit a representation error by using 0.171875 = (0.1011)2 × 2−2 . Adding
this to the intermediate result from above (again by shifting the mantissa) we get
(0.11111)2 × 2−1 . Once more we need to round this result to the nearest machine
number, so that the final answer of our calculation is (0.1000)2 × 20 = 0.5 which is a
7
rather poor representation of the true answer 15 = 0.46.
In fact, the absolute error of our calculation is
|0.5 − 0.46| = 0.03,
and the relative error is
0.5 − 0.46
0.46 ≈ 0.0714 or 7.14%.
Example Another important observation is the fact that the order of operations mat-
ters, i.e., even though addition (or multiplication) is commutative and associative for
real numbers, this may not be true for machine numbers.
1 1
Consider the problem of adding 1 + 16 + 16 = 98 = 1.125 on our hypothetical com-
puter. Note that all of these numbers are machine numbers (so we will be committing
no representation errors).
19
a) We first represent 1 by (0.1000)2 × 21 and 16 1
by (0.1000)2 × 2−3 . Addition of
1
these 2 numbers leads to (0.10001)2 × 2 which now has to be rounded to a
1
machine number, i.e., (0.1000)2 × 21 . Clearly, adding another 16 will not change
1 1
the answer. Thus, 1 + 16 + 16 = 1.
1 1
b) If we start by adding the smaller numbers first, then 16 + 16 is represented
by (0.1000)2 × 2−3 + (0.1000)2 × 2−3 = (0.1000)2 × 2−2 (exactly), and adding
1 = (0.1000)2 × 20 to this yields the correct answer of (0.1001)2 × 21 = 1.125
More common word lengths for the representation of floating-point numbers are
32-bit (single precision) and 64-bit (double precision). In the 32-bit representation the
first bit is used to represent the sign, the next 8 bits represent the exponent, and the
remaining 23 bits are used for the mantissa. This implies that the largest possible
exponent is (11111111)2 = 28 − 1 = 255 (or −126, . . . , 127 where the two extreme cases
0 and 255 are reserved for special purposes). This means that we can roughly represent
numbers between 2−126 ≈ 10−38 and 2127 ≈ 1038 . Since the mantissa has 23 bits we
can represent numbers with an accuracy (machine ε) of 2−23 ≈ 0.12 × 10−6 , i.e., we
can expect 6 accurate digits (which is known as single precision).
In the 64-bit system we use 11 bits for the exponent and 52 for the mantissa. This
leads to machine numbers between 2−1022 ≈ 10−308 and 21023 ≈ 10308 . The resolution
possible with the 52-bit mantissa is 2−52 ≈ 0.22 × 10−15 . Thus double precision has 15
accurate digits.
Some other examples of disasters due to careless use of computers can be found on
the web at http://www.ima.umn.edu/∼arnold/disasters/.
In order to have a “good” computer we would like
f l(x y) = [x y](1 + δ)
20
for any basic arithmetic operation ∈ {+, −, ·, /}. This is usually accomplished by
using higher precision internally.
On the other hand, if we choose a better method to solve the problem (without
disastrous subtractions), i.e., rewrite the function as
√ √
√ √ x+1+ x
g(x) = x x + 1 − x √ √
x+1+ x
x(x + 1 − x)
= √ √
x+1+ x
x
= √ √
x+1+ x
then
500 500 500
g(500) = √ √ = = = 11.1748,
501 + 500 22.3830 + 22.3607 44.7437
which is exact up to the precision used.
We end with a theorem quantifying the loss of significant digits in subtractions.
Theorem 1.13 If x > y are positive normalized floating-point numbers in binary rep-
resentation and 2−q ≤ 1 − xy ≤ 2−p then the number ` of significant binary bits lost
when computing x − y satisfies p ≤ ` ≤ q.
Proof We show the lower bound, i.e., ` ≥ p (the upper bound can be shown similarly).
Let x = r×2n and y = s×2m with 21 ≤ r, s < 1. In order to perform the subtraction
we rewrite (shift) y such that y = (s × 2m−n ) × 2n . Then
x − y = r − s × 2m−n × 2n ,
where
s × 2m
m−n
y
r−s×2 =r 1− =r 1− .
r × 2n x
21
Now r < 1 and 1− xy ≤ 2−p by assumption, so that the mantissa satisfies r −s×2m−n <
2−p .
Finally, we need to shift at least p bits to the left in order to normalize the repre-
sentation (we need 12 ≤ mantissa < 1). This introduces at least p (binary) zeros at the
right end of the number, and so at least p bits are lost.
Remark If the theorem is formulated in base 10, i.e., if the relative error is between
10−q and 10−p , then between p and q digits are lost.
22
2 Singular Value Decomposition
The singular value decomposition (SVD) allows us to transform a matrix A ∈ Cm×n to
diagonal form using unitary matrices, i.e.,
A = Û Σ̂V ∗ . (4)
A = U ΣV ∗
Av j = σj uj , j = 1, . . . , n. (5)
0.5 σ2v2
1 σ1u1
–1 –0.5 0.5 1 –2 –1 0 1 2
v2
–1
–0.5
v1 –2
–1
In (5) we refer to the σj as singular values of A (the diagonal entries of Σ̂). They
are usually ordered such that σ1 ≥ σ2 ≥ . . . ≥ σn . The orthonormal vectors uj (the
columns of Û ) are called the left singular vectors of A, and the orthonormal vectors v j
(the columns of V ) are called the right singular vectors of A).
Remark For most practical purposes it suffices to compute the reduced SVD (4). We
will give examples of its use, and explain how to compute it later.
23
QR Compression ratio 0.2000, 25 columns used Compression ratio 0.2000, relative error 0.0320, 25 singular values used
20 20
40 40
60 60
80 80
100 100
120 120
140 140
160 160
180 180
200 200
50 100 150 200 250 300 50 100 150 200 250 300
A = U ΣV ∗ (6)
i
Σ̂
h
= Û U e V ∗.
O
U ∗ AV = Σ,
i.e., unitary transformations (reflections or rotations) are applied from the left and
right to A in order to obtain a diagonal matrix Σ.
Remark Note that the “diagonal” matrix Σ is in many cases rectangular and will
contain extra rows/columns of all zeros.
It is clear that the SVD will simplify the solution of many problems since the
transformed system matrix is diagonal, and thus trivial to work with.
24
2.0.5 Existence and Uniqueness Theorem
Theorem 2.1 Let A be a complex m × n matrix. A has a singular value decomposition
of the form
A = U ΣV ∗ ,
where Σ is a uniquely determined m × n (real) diagonal matrix, U is an m × m unitary
matrix, and V is an n × n unitary matrix.
Proof We prove only existence. The uniqueness part of the proof follows directly from
the geometric interpretation. A (more rigorous?) algebraic argument can be found,
e.g., in [Trefethen/Bau].
We use induction on the dimensions of A. All of the following arguments assume
m ≥ n (the case m < n can be obtained by transposing the arguments).
For n = 1 (and any m) the matrix A is a column vector. We take V = 1, Σ̂ = kAk2
and Û = kAkA
2
. Then, clearly, we have found a reduced SVD, i.e., A = Û Σ̂V ∗ . The full
SVD is obtained by extending Û to U by the Gram-Schmidt algorithm and adding the
necessary zeros to Σ̂.
We now assume an SVD exists for the case (m − 1, n − 1) and show it also exists
for (m, n). To this end we pick v 1 ∈ Cn such that kv 1 k2 = 1 and
Now we take
Av 1
u1 = . (7)
kAv 1 k2
Next, we use the Gram-Schmidt algorithm to arbitrarily extend u1 and v 1 to unitary
matrices by adding columns U
e1 and Ve1 , i.e.,
h i h i
U1 = u1 U e1 V1 = v 1 Ve1 .
This results in
u∗1
h i
U1∗ AV1 = e∗ A v 1 Ve1
U1
" #
u∗1 Av 1 u∗1 AVe1
= e ∗ Av 1 U
e ∗ AVe1 .
U1 1
25
• Again, using (7) we get
e ∗ Av 1 = U
U e ∗ u1 kAv 1 k2 .
1 1
e ∗ u1 = 0.
This, however, is a zero vector since U1 has orthonormal columns, i.e., U1
we could look at the first row of the block matrix U1∗ AV1 and see that
h i
U1∗ AV1 (1, :) = σ1 u∗1 AVe1
h i
with k σ1 u∗1 AVe1 k2 > σ1 . On the other hand, we know that unitary matrices
leave the 2-norm invariant, i.e.,
Since the norm of the first row of the block matrix U1∗ AV1 cannot exceed that of
the entire matrix we have reached a contradiction.
0T
∗ σ1
U1 AV1 =
0 U2 Σ2 V2∗
∗
0T σ1 0T 1 0T
1
=
0 U2 0 Σ2 0 V2
or ∗
1 0T σ 1 0T 1 0T
A = U1 V1∗ ,
0 U2 0 Σ2 0 V2
another SVD (since the product of unitary matrices is unitary).
Ax = U ΣV ∗ x ⇐⇒ b = U b0 .
U b0 = b ⇐⇒ b0 = U ∗ b
26
where we have used the columns of U as an orthonormal basis for range(A).
Similarly, any x ∈ Cn (the domain of A) can be written in terms of range(V ):
x0 = V ∗ x.
Now
Ax = b ⇐⇒ U ∗ Ax = U ∗ b
⇐⇒ U ∗ U ΣV ∗ x = U ∗ b
⇐⇒ IΣV ∗ x = U ∗ b
⇐⇒ Σx0 = b0 ,
Theorem 2.2 Assume A ∈ Cm×n , p = min(m, n), and r ≤ p denotes the number of
positive singular values of A. Then
1. rank(A) = r
3. kAk2 = σp
1
kAkF = σ12 + σ22 + . . . + σr2
4. The eigenvalues of A∗ A are the σi2 and the v i are the corresponding (orthonor-
malized) eigenvectors. The eigenvalues of AA∗ are the σi2 and possibly m − n
zeros. The corresponding orthonormalized eigenvectors are given by the ui .
27
5. If A = A∗ (Hermitian or real symmetric), then the eigen-decomposition A =
XΛX ∗ and the SVD A = U ΣV ∗ are almost identical. We have U = X, σi = |λi |,
and v i = sign(λi )ui .
6. If A ∈ Cm×m then |det(A)| = m
Q
i=1 σi .
In fact,
kA − Aν k2 = σν+1 . (9)
Proof The representation (8) of the SVD follows immediately from the full SVD (6)
by splitting Σ into a sum of diagonal matrices Σj = diag(0, . . . , 0, σj , 0, . . . , 0).
Formula (9) for the approximation error follows from the fact that U ∗ AV = Σ and
the expansion for Aν so that U ∗ (A − Aν )V = diag(0, . . . , 0, σν+1 , . . .) and kA − Aν k2 =
σν+1 by the invariance of the 2-norm under unitary transformations and item 3 of the
previous theorem.
The claim regarding the best approximation property is a little more involved, and
omitted.
28
Remark There are many possible rank-ν decompositions of A (e.g., by taking partial
sums of the LU or QR factorization). Theorem 2.3, however, says that the ν-th partial
sum of the SVD captures as much of the energy of A (measured in the 2-norm) as pos-
sible. This fact gives rise to many applications in image processing, data compression,
data mining, and other fields. See, e.g., the Matlab scripts svd compression.m and
qr compression.m.
A geometric interpretation of Theorem 2.3 is given by the best approximation of a
hyperellipsoid by lower-dimensional ellipsoids. For example, the best approximation of
a given hyperellipsoid by a line segment is given by the line segment corresponding to
the hyperellipsoids longest axis. Similarly, the best approximation by an ellipse is given
by that ellipse whose axes are the longest and second-longest axis of the hyperellipsoid.
A∗ A = V ΛV ∗ .
p
3. Sort the eigenvalues according to their magnitude, and let σj = λj , j = 1, . . . , n.
uj = σj−1 Av j , j = 1, . . . , r.
29
√
3. σ1 = 17 and σ2 = 1, so that
√
17 0
Σ= 0 1 .
0 0
and
1
u2 = Av 2
1
1 2
1 1
= √ 2 2
2 2 1 −1
−1
1
= √ 0 .
2 1
u∗j u3 = δj3 , j = 1, 2, 3.
so that
−1
√
√3 √2
√
34 2 17 17 0 " √1 √1
#
∗ √4 −3 2 2
A = U ΣV =
0 √
0 1 .
34 17 √1 − √12
√3 √1 √2 0 0 2
34 2 17
30
The reduced SVD is given by
−1
3
√ √ √ " #
34 2
17 0 √1 √1
A = Û Σ̂V ∗ =
√4 0 2 2 .
34 0 1 √1 − √12
√3 √1 2
34 2
31
3 Projectors
If P ∈ Cm×m is a square matrix such that P 2 = P then P is called a projector. A
matrix satisfying this property is also known as an idempotent matrix.
c2 cs
P = ,
cs s2
where c = cos θ and s = sin θ. This matrix projects perpendicularly onto the line with
inclination angle θ in R2 .
We can check that P is indeed a projector:
2 2
2 c cs c cs
P =
cs s2 cs s2
4
c + c2 s2 c3 s + cs3
=
c3 s + cs3 c2 s2 + s4
2 2
c (c + s2 ) cs(c2 + s2 )
= = P.
cs(c2 + s2 ) s2 (c2 + s2 )
In general, for any projector P , any v ∈ range(P ) is projected onto itself, i.e.,
v = P x for some x then
P v = P (P x) = P 2 x = P x = v.
We also have
P (P v − v) = P 2 v − P v = P v − P v = 0,
so that P v − v ∈ null(P ).
32
Lemma 3.1 If P is a projector then
Proof We show (10), then (11) will follow by applying the same arguments for P =
I − (I − P ). Equality of two sets is shown by mutual inclusions, i.e., A = B if A ⊆ B
and B ⊆ A.
First, we show null(P ) ⊆ range(I − P ). Take a vector v such that P v = 0. Then
(I − P )v = v − P v = v. In words, any v in the nullspace of P is also in the range of
I − P.
Now, we show range(I − P ) ⊆ null(P ). We know that any x ∈ range(I − P ) is
characterized by
x = (I − P )v for some v.
Thus
x = v − P v = −(P v − v) ∈ null(P )
since we showed earlier that P (P v−v) = 0. Thus if x ∈ range(I −P ), then x ∈ null(P ).
v = P v + (I − P )v,
33
with r orthogonal to {q 1 , . . . , q n }. This corresponds to the decomposition
v = (I − P )v + P v
n
X
with P = (q i q ∗i ).
i=1
.
Note that ni=1 (q i q ∗i ) = QQ∗ with Q = [q 1 q 2 . . q n ]. Thus the orthogonal decom-
P
position (12) can be rewritten as
2. (QQ∗ )∗ = QQ∗ .
Remark The orthogonal decomposition (13) will be important for the implementation
of the QR decomposition later on. In particular we will use the rank-1 projector
Pq = qq ∗
P⊥q = I − qq ∗ .
Thus,
v = (I − qq ∗ )v + qq ∗ v,
or, more generally, orthogonal projections onto an arbitrary direction a is given by
aa∗ aa∗
v= I− ∗ v + ∗ v,
a a a a
aa∗ aa∗
where we abbreviate Pa = a∗ a and P⊥a = (I − a∗ a ).
a∗j (P v − v) = 0, j = 1, . . . , n.
a∗j (Ax − v) = 0, j = 1, . . . , n
A∗ (Ax − v) = 0
34
or
A∗ Ax = A∗ v.
One can show that (A∗ A)−1 exists provided the columns of A are linearly independent
(our assumption). Then
x = (A∗ A)−1 A∗ v.
Finally,
P v = Ax = A(A∗ A)−1 A∗ v.
| {z }
=P
Remark Note that this includes the earlier discussion when {a1 , . . . , an } is orthonor-
mal since then A∗ A = I and P = AA∗ as before.
35
4 QR Factorization
4.1 Reduced vs. Full QR
Consider A ∈ Cm×n with m ≥ n. The reduced QR factorization of A is of the form
A = Q̂R̂,
where Q̂ ∈ Cm×n with orthonormal columns and R̂ ∈ Cn×n an upper triangular matrix
such that R̂(j, j) 6= 0, j = 1, . . . , n.
As with the SVD Q̂ provides an orthonormal basis for range(A), i.e., the columns of
A are linear combinations of the columns of Q̂. In fact, we have range(A) = range(Q̂).
This is true since Ax = Q̂R̂x = Q̂y for some y so that range(A) ⊆ range(Q̂). Moreover,
range(Q) ⊆ range(Â) since we can write AR̂−1 = Q̂ because R̂ is upper triangular with
nonzero diagonal elements. (Now we have Q̂x = AR̂−1 x = Ay for some y.)
Note that any partial set of columns satisfy the same property, i.e.,
span{a1 , . . . , aj } = span{q 1 , . . . , q j }, j = 1, . . . , n.
In order to obtain the full QR factorization we proceed as with the SVD and extend
Q̂ to a unitary matrix Q. Then A = QR with unitary Q ∈ Cm×m and upper triangular
R ∈ Cm×n . Note that (since m ≥ n) the last m − n rows of R will be zero.
q ∗i q j = δij , (18)
R is upper triangular and A = QR. The latter two conditions are already reflected in
the formulas above.
Using (14) in the orthogonality condition (18) we get
a∗1 a1
q ∗1 q 1 = 2 =1
r11
so that p
r11 = a∗1 a1 = ka1 k2 .
36
Note that we arbitrarily chose the positive square root here (so that the factorization
becomes unique).
Next, the orthogonality condition (18) gives us
q ∗1 q 2 = 0
q ∗2 q 2 = 1.
q ∗1 a2 − r12 q ∗1 q 1
q ∗1 q 2 = = 0.
r22
Since we ensured q ∗1 q 1 = 1 in the previous step, the numerator yields r12 = q ∗1 a2 so
that
a2 − (q ∗1 a2 )q 1
q2 = .
r22
To find r22 we normalize, i.e., demand that q ∗2 q 2 = 1 or equivalently kq 2 k2 = 1. This
immediately gives
r22 = ka2 − (q ∗1 a2 )q 1 k2 .
To fully understand how the algorithm proceeds we add one more step (for n = 3).
Now we have three orthogonality conditions:
q ∗1 q 3 = 0
q ∗2 q 3 = 0
q ∗3 q 3 = 1.
q ∗1 a3 − r13 q ∗1 q 1 − r23 q ∗1 q 2
q ∗1 q 3 = =0
r33
so that r13 = q ∗1 a3 due to the orthonormality of columns q 1 and q 2 .
Similarly, the second orthogonality condition together with (17) for n = 3 yields
q ∗2 a3 − r13 q ∗2 q 1 − r23 q ∗2 q 2
q ∗2 q 3 = =0
r33
so that r23 = q ∗2 a3 .
Together this gives us
a3 − (q ∗1 a3 )q 1 − (q ∗2 a3 )q 2
q3 =
r33
and the last unknown, r33 , is determined by normalization, i.e.,
r33 = ka3 − (q ∗1 a3 )q 1 − (q ∗2 a3 )q 2 k2 .
37
In general we can formulate the following algorithm:
rij = q ∗i aj (i 6= j)
j−1
X
vj = aj − rij q i
i=1
rjj = kv j k2
vj
qj =
rjj
We can compute the reduced QR factorization with the following (somewhat more
practical and almost Matlab implementation of the) classical Gram-Schmidt algorithm.
Algorithm (Classical Gram-Schmidt)
for j = 1 : n
v j = aj
for i = 1 : (j − 1)
rij = q ∗i aj
v j = v j − rij q i
end
rjj = kv j k2
q j = v j /rjj
end
Remark The classical Gram-Schmidt algorithm is not ideal for numerical calcula-
tions since it is known to be unstable. Note that, by construction, the Gram-Schmidt
algorithm yields an existence proof for the QR factorization.
38
Next,
v 2 = a2 − (q ∗1 a2 ) q 1
| {z }
=r12
√
2 1 1
2
= 1 − √ 0 = 1 .
0 2 1 −1
√ √
This calculation required that r12 = √2 = 2. Moreover, r22 = kv 2 k = 3 and
2
1
v2 1
q2 = = √ 1 .
kv 2 k 3 −1
v 3 = a3 − (q ∗1 a3 ) q 1 − (q ∗2 a3 ) q 2
| {z } | {z }
=r13 =r23
Ax = b ⇐⇒ QRx = b ⇐⇒ Rx = Q∗ b,
where the last equation holds since Q is unitary, we can proceed as follows:
2. Compute y = Q∗ b.
39
3. Solve the upper triangular Rx = y
We will have more applications for the QR factorization later in the context of least
squares problems.
Remark The QR factorization (if implemented properly) yields a very stable method
for solving Ax = b. However, it is about twice as costly as Gauss elimination (or
A = LU ). In fact, the QR factorization can also be applied to rectangular systems and
it is the basis of Matlab’s backslash matrix division operator. We will discuss Matlab
examples in a later section.
Note that this means we are performing a sequence of vector projections. The starting
point for the modified Gram-Schmidt algorithm is to rewrite one step of the classical
Gram-Schmidt algorithm as a single matrix projection, i.e.,
j−1
X
vj = aj − (q ∗i aj )q i
i=1
j−1
X
= aj − (q i q ∗i )aj
i=1
= aj − Q̂j−1 Q̂∗j−1 aj
= I − Q̂j−1 Q̂∗j−1 aj ,
| {z }
=Pj
Lemma 4.2 If Pj = I − Q̂j−1 Q̂∗j−1 with Q̂j−1 = [q 1 q 2 . . . q j−1 ] a matrix with orthonor-
mal columns, then
j−1
Y
Pj = P⊥qi .
i=1
40
Proof First we remember that
j−1
X
Pj = I − Q̂j−1 Q̂∗j−1 = I − q i q ∗i
i=1
P⊥qi = I − q i q ∗i .
This is done by induction. For j = 1 the sum and the product are empty and the
statement holds by the convention that an empty sum is zero and an empty product is
the identity, i.e., P1 = I.
Now we step from j − 1 to j. First
j
Y j−1
Y
q i q ∗i ) (I − q i q ∗i ) I − q j q ∗j
(I − =
i=1 i=1
j−1
!
X
q i q ∗i I − q j q ∗j
= I−
i=1
Summarizing the discussion thus far, a single step in the Gram-Schmidt algorithm
can be written as
v j = P⊥qj−1 P⊥qj−2 . . . P⊥q1 aj ,
or – more algorithmically:
v j = aj
for i = 1 : (j − 1)
v j = v j − q i q ∗i v j
end
For the final modified Gram-Schmidt algorithm the projections are arranged differ-
ently, i.e., P⊥qi is applied to all v j with j > i. This leads to
41
Algorithm (Modified Gram-Schmidt)
for i = 1 : n
v i = ai
end
for i = 1 : n
rii = kv i k2
vi
qi = rii
for j = (i + 1) : n
rij = q ∗i v j
v j = v j − rij q i
end
end
We can compare the operations count, i.e., the number of basic arithmetic operations
(‘+’,‘-’,‘*’,‘/’), of the two algorithms. We give only a rough estimate (exact counts will
be part of the homework). Assuming vectors of length m, for the classical Gram-
Schmidt roughly 4m operations are performed inside the innermost loop (actually m
multiplications and m−1 additions for the inner product, and m multiplications and m
subtractions for the formula in the second line). Thus, the operations count is roughly
j−1
n X n n
X X X n(n + 1)
4m = (j − 1)4m ≈ 4m j = 4m ≈ 2mn2 .
2
j=1 i=1 j=1 j=1
The innermost loop of the modified Gram-Schmidt algorithm consists formally of ex-
actly the same operations, i.e., requires also roughly 4m operations. Thus its operation
count is
n X n n n
!
X X
2
X
2 n(n + 1)
4m = (n − i)4m = 4m n − i = 4m n − ≈ 2mn2 .
2
i=1 j=i+1 i=1 i=1
Thus, the operations count for the two algorithms is the same. In fact, mathematically,
the two algorithms can be shown to be identical. However, we will learn later that the
modified Gram-Schmidt algorithm is to be preferred due to its better numerical stability
(see Section 4.6).
42
where R1 , . . . , Rn are upper triangular matrices. For example,
1/r11 −r12 /r11 −r13 /r11 ··· −r1m /r11
0
1 0 ··· 0
0 0 1 0
R1 = ,
.. .. ..
. . .
0 ··· 0 1
1 0 ··· 0
0 1/r22 −r23 /r22 · · · −r2m /r22
R2 =
0 0 1 0
.. .. ..
. . .
0 ··· 0 1
and so on.
Thus we are applying triangular transformation matrices to A to obtain a matrix Q̂
with orthonormal columns. We refer to this approach as triangular orthogonalization.
Since the inverse of an upper triangular matrix is again an upper triangular matrix,
and the product of two upper triangular matrices is also upper triangular, we can think
of the product R1 R2 . . . Rn in (19) in terms of a matrix R̂−1 . Thus, the (modified)
Gram-Schmidt algorithm yields a reduced QR factorization
A = Q̂R̂
of A.
43
Now from the (classical) Gram-Schmidt algorithm we know that
80
X
r11 = ka1 k2 = k σ1 v1i ui k2 .
i=1
Since the singular values were chosen to decrease exponentially only the first one really
matters, i.e.,
1 1
r11 ≈ kσ1 v11 u1 k2 = σ1 v11 ≈ √
2 80
(since ku1 k2 = 1).
Similar arguments result in the general relationship
1
rjj ≈ √ σj
80
(the latter of which we know). The plot produced by GramSchmidt.m shows how
accurately the diagonal elements of R are computed. We can observe that the classical
√
Gram-Schmidt algorithm is stable up to σj ≈ eps (where eps is the machine epsilon),
whereas the modified Gram-Schmidt method is stable all the way up to σj ≈ eps.
AR1 R2 . . . Rn = Q̂
Qn Qn−1 . . . Q2 Q1 A = R,
x x x 0 x x
x x x x x x
0 x x
−→ Q3 Q2 Q1 A = 0 x x ,
−→ Q2 Q1 A = 0 0 x 0 0 x
0 0 x 0 0 0
44
where x stands for a generally nonzero entry. From this we note that Qk needs to
operate on rows k : m and not change the first k − 1 rows and columns. Therefore it
will be of the form
Ik−1 O
Qk = ,
O F
where Ik−1 is a (k − 1) × (k − 1) identity matrix and F has the effect that
F x = kxke1
in order to introduce zeros in the lower part of column k. We will call F a Householder
reflector.
Graphically, we can use either a rotation (Givens rotation) or a reflection about the
bisector of x and e1 to transform x to kxke1 .
Recall from an earlier homework assignment that given a projector P , then (I −2P )
is also a projector. In fact, (I −2P ) is a reflector. Therefore, if we choose v = kxke1 −x
∗
and define P = vv v ∗ v , then
vv ∗
F = I − 2P = I − 2 ∗
v v
is our desired Householder reflector. Since it is easy to see that F is Hermitian, so is
Qk . Note that F x can be computed as
vv ∗ vv ∗ v∗x
Fx = I − 2 ∗ x = x − 2 ∗ x = x − 2v ∗ .
v v v v
|{z} v v
|{z}
matrix scalar
In fact, we have two choices for the reflection F x: v + = −x + sign(x(1))kxke1
and v − = −x − sign(x(1))kxke1 . Here x(1) denotes the first component of the vector
x. These choices are illustrated in Figure 4. A numerically more stable algorithm
(that will avoid cancellation of significant digits) will be guaranteed by choosing that
reflection which moves x further. Therefore we pick
v = x + sign(x(1))kxke1 ,
45
Algorithm (Householder QR)
x = A(k : m, k)
v k = x + sign(x(1))kxk2 e1
v k = v k /kv k k2
A(k : m, k : n) = A(k : m, k : n) − 2v k (v ∗k A(k : m, k : n))
end
Note that the statement in the last line of the algorithm performs the reflection
simultaneously for all remaining columns of the matrix A. On completion of this
algorithm the matrix A contains the matrix R of the QR factorization, and the vectors
v 1 , . . . , v n are the reflection vectors. They will be used to calculate matrix-vector
products of the form Qx and Q∗ b later on. The matrix Q itself is not output. It can
be constructed by computing special matrix-vector products Qx with x = e1 , . . . , en .
v = x + sign(x(1))kxk2 e1
2 1 5
= 1 + 3 0 = 1 .
2 0 2
∗
Next we form F x = x − 2v vv∗xv . To this end we note that v ∗ x = 15 and v ∗ v = 30.
Thus
2 5 −3
15
F x = 1 23 1 = 0 .
30
2 2 0
This vector contains the desired zero.
For many applications only products of the form Q∗ b or Qx are needed. For
example, if we want to solve the linear system Ax = b then we can do this with the
QR factorization by first computing y = Q∗ b and then solving Rx = y. Therefore, we
list the respective algorithms for these two types of matrix vector products.
For the first algorithm we need to remember that
Qn . . . Q2 Q1 A = R,
| {z }
=Q∗
so that we can apply exactly the same steps that were applied to the matrix A in the
Householder QR algorithm:
Algorithm (Compute Q∗ b)
for k = 1 : n
46
b(k : m) = b(k : m) − 2v k (v ∗k b(k : m))
end
for k = n : −1 : 1
end
The operations counts for the three algorithms listed above are
Q∗ b, Qx: O(mn)
47
5 Least Squares Problems
Consider the solution of Ax = b, where A ∈ Cm×n with m > n. In general, this system
is overdetermined and no exact solution is possible.
Example Fit a straight line to 10 measurements. If we represent the line by f (x) =
mx + c and the 10 pieces of data are {(x1 , y1 ), . . . , (x10 , y10 )}, then the constraints can
be sumamrized in the linear system
x1 1 y1
x2 1
y2
m
.. c = .. .
..
. . .
| {z }
x10 1 x y10
| {z } | {z }
A b
This type of problem is known as linear regression or (linear) least squares fitting.
The basic idea (due to Gauss) is to minimize the 2-norm of the residual vector, i.e.,
kb − Axk2 .
In other words, we want to find x ∈ Cn such that
m
X
[bi − (Ax)i ]2
i=1
is minimized.
P For the notation 2used in the example above we want to find m and c
such that 10i=1 [yi − (mxi + c))] is minimized.
Example We can generalize the previous example to polynomial least squares fitting
of arbitrary degree. To this end we assume that
n
X
p(x) = ci xi ,
i=0
where n is the degree of the polynomial.
We can fit a polynomial of degree n to m > n data points (xi , yi ), i = 1, . . . , m,
using the least squares approach, i.e.,
m
X
min [yi − p(xi )]2
i=1
is used as constraint for the overdetermined linear system Ax = b with
2 n
c0
1 x1 x1 . . . x1 c1 y1
1 x2 x2 . . . xn y2
2 2
A= . .. , x = c2 , b = .. .
. .. ..
. . . . .. .
2 n
.
1 xm xm . . . xm ym
cn
Remark The special case n = m − 1 is called interpolation and is known to have a
unique solutions if the conditions are independent, i.e., the points xi are distinct. How-
ever, for large degrees n we frequently observe severe oscillations which is undesirable.
48
5.1 How to Compute the Least Squares Solution
We want to find x such that Ax ∈ range(A) is as close as possible to a given vector b.
It should be clear that we need Ax to be the orthogonal projection of b onto the
range of A, i.e.,
Ax = P b.
Then the residual r = b − Ax will be minimal.
(20) says that r is perpendicular to the range of A. (21) is known as the set of
normal equations.
Proof To see that (20) ⇔ (21) we use the definition of the residual r = b − Ax. Then
A∗ r = 0 ⇐⇒ A∗ (b − Ax) = 0
⇐⇒ A∗ Ax = A∗ b.
To see that (21) ⇔ (22) we use that the orthogonal projector onto the range of A is
given by
P = A(A∗ A)−1 A∗ .
Then
P b = Ax ⇐⇒ A(A∗ A)−1 A∗ b = Ax
⇐⇒ A∗ A(A∗ A)−1 A∗ b = A∗ Ax
| {z }
=I
⇐⇒ A∗ Ax = A∗ b.
Remark If A has full rank then A∗ A is also Hermitian positive definite, i.e., x∗ A∗ Ax >
0 for any nonzero n-vector x.
For full-rank A we can take (21) and obtain the least squares solution as
x = (A∗ A)−1 A∗ b.
| {z }
=A+
49
5.2 Algorithms for finding the Least Squares Solution
5.2.1 Cholesky Factorization
This can be applied for a full-rank matrix A. As mentioned above A∗ A is Hermitian
positive definite and one can apply a symmetric form of Gaussian elimination resulting
in
A∗ A = R ∗ R
with upper triangular matrix R (more details will be provided in a later section).
This means that we have
A∗ Ax = A∗ b ⇐⇒ R∗ Rx = A∗ b
⇐⇒ R ∗ w = A∗ b
with w = Rx. Since R is upper triangular (and R∗ is lower triangular) this is easy to
solve.
We obtain one of our three-step algorithms:
Algorithm (Cholesky Least Squares)
Remark The solution of the normal equations is likely to be unstable. Therefore this
method is not recommended in general. For small problems it is usually safe to use.
5.2.2 QR Factorization
This works also for full-rank matrices A. Recall that the reduced QR factorization is
given by A = Q̂R̂ with Q̂ an m × n matrix with orthonormal columns, and R̂ an n × n
upper triangular matrix.
Now the normal equations can be re-written as
Since A has full rank R̂ will be invertible and we can further simplify to
R̂x = Q̂∗ b.
50
(1) Compute the reduced QR factorization A = Q̂R̂.
Q̂R̂x = Q̂Q̂∗ b.
so that we have
R̂x = Q̂∗ b
as before.
From either interpretation we see that
x = R̂−1 Q̂∗ b
A+ = R̂−1 Q̂∗ .
Remark This approach is more stable than the Cholesky approach and is considered
the standard method for least squares problems.
5.2.3 SVD
We again assume that A has full rank. Recall that the reduced SVD is given by
A = Û Σ̂V ∗ , where Û ∈ Cm×n , Σ̂ ∈ Rn×n , and V ∈ Cn×n .
We start again with the normal equations
A∗ Ax = A∗ b ⇐⇒ Σ̂∗ Û
V |{z} ∗ ∗ ∗
| {zÛ} Σ̂V x = V Σ̂Û b
=Σ̂ =I
⇐⇒ V Σ̂ V x = V Σ̂Û ∗ b.
2 ∗
Since A has full rank we can invert Σ̂ and multiply the last equation by Σ̂−1 V ∗ .
This results in
Σ̂V ∗ x = Û ∗ b
or
Σ̂w = Û ∗ b with w = V ∗ x.
51
Therefore (with the SVD notation) the pseudoinverse is given by
A+ = V Σ̂−1 U ∗
and the least squares solution is given by
x = A+ b = V Σ̂−1 U ∗ b.
The algorithm is
Algorithm (SVD Least Squares)
(1) Compute the reduced SVD A = Û Σ̂V ∗ .
(2) Compute Û ∗ b.
(3) Solve the diagonal system Σ̂w = Û ∗ b for w.
(4) Compute x = V w.
This time the operations count is O(2mn2 + 11n3 ) which is comparable to that of
the QR factorization provided m n. Otherwise this algorithm is more expensive,
but also more stable.
or
x = V Σ+ U ∗ b,
where now the pseudoinverse is given by
A+ = V Σ + U ∗ .
Here the pseudoinverse of Σ is defined as
−1
+ Σ1 O
Σ = ,
O O
where Σ1 is that part of Σ containing the positive (and therefore invertible) singular
values.
As a final remark we note that there exists also a variant of the QR factorization
that is more stable due to the use of column pivoting. The idea of pivoting will be
discussed later in the context of the LU factorization.
52
6 Conditioning and Stability
A computing problem is well-posed if
1. a solution exists (e.g., we want to rule out situations that lead to division by
zero),
3. the solution depends continuously on the data, i.e., a small change in the data
should result in a small change in the answer. This phenomenon is referred to as
stability of the problem.
1. x0 = 1, xn = 13 xn−1 for n ≥ 1,
3. z0 = 1, z1 = 31 , zn+1 = 10
3 zn − zn−1 for n ≥ 1.
The validity of the latter two approaches can be proved by induction. We illustrate
these algorithms with the Maple worksheet 477 577 stability.mws. Use of slightly
perturbed initial values shows us that the first algorithm yields stable errors throughout.
The second algorithm has stable errors, but unstable relative errors. And the third
algorithm is unstable in either sense.
e = x − x̃.
Since x is not known to us in general we often judge the accuracy of the solution by
looking at the residual
r = b − Ax̃ = Ax − Ax̃ = Ae
and hope that a small residual guarantees a small error.
1. (a) Let’s assume we computed a solution of x̃ = [1.01, 1.01]T . Then the error
−0.01
e = x − x̃ =
−0.01
53
is small, and the residual
2 2.02 −0.02
r = b − Ax = − =
2 2.02 −0.02
2. (b) Now, let’s assume that we computed a solution of x̃ = [2, 0]T . This “solutions”
is obviously not a good one. Its error is
−1
e= ,
1
which is still small. This is not good. This shows that the residual is not a reliable
indicator of the accuracy of the solution.
3. (c) If we change the right-hand side of the problem to b = [2, −2]T so that the
exact solution becomes x = [100, −100]T , then things behave “wrong” again.
Let’s assume we computed a solution x̃ = [101, −99]T with a relatively small
error of e = [−1, −1]T . However, the residual now is
2 4 −2
r= − = ,
−2 0 −2
What is the explanation for the phenomenon we’re observing? The answer is, the
matrix A is ill-conditioned.
Let’s try to get a better understanding of how the error and the residual are related
for the problem Ax = b. We will use the notation
Thus,
54
kek krk
Often, however, it is better to consider the relative error, i.e., kxk (and kbk ):
kAxk
kek ≤ kA−1 kkrk
kbk
| {z }
=1
krk
≤ kA−1 kkAkkxk .
kbk
Remark The condition number depends on the type of norm used. For the 2-norm of
a nonsingular m × m matrix A we know kAk2 = σ1 (the largest singular values of A),
and kA−1 k2 = σ1m . If A is singular then κ(A) = ∞.
Also note that κ(A) = σσm1 ≥ 1. In fact, this holds for any norm.
How should we interpret the bound (23)? If κ(A) is large (i.e., the matrix is ill-
conditioned), then relatively small perturbations of the right-hand side b (and therefore
the residual) may lead to large errors; an instability.
For well-conditioned problems (i.e., κ(A) ≈ 1) we can also get a useful bound telling
us what sort of relative error kx−x̃k
kxk we should at least expect. Consider
krkkxk = kb − b̃kkxk
= kAx − Ax̃kkxk = kA(x − x̃)kkxk
= kAekkxk
= kAekkA−1 bk ≤ kAkkekkA−1 kkbk,
so that
1 krk kek
≤ . (24)
κ(A) kbk kxk
Of course, we can combine (23) and (24) to obtain
These bounds are true for any A, but show that the residual is a good indicator of the
error only if A is well-conditioned.
We now return to our
55
which implies
σ1 2
κ(A) = = = 100.
σ2 0.02
For a 2 × 2 matrix this is an indication that A is fairly ill-conditioned. We see that the
bounds (25) allow for large variations:
1 krk kx − x̃k krk
≤ ≤ 100 .
100 kbk kxk kbk
Thus the relative residual is not a good error indicator (as we saw in our initial calcu-
lations).
Ax̃
e =b ⇐⇒ (A + δA) (x + δx) = b
⇐⇒ | {z− b} +(δA)x + A(δx) + (δA)(δx) = 0.
Ax
=0
If we neglect the term with the product of the deltas then we get
Example We consider
1.01 0.99 −0.01 0.01
A= with δA = .
0.99 1.01 0.01 −0.01
Now
1 1
A
e = A + δA =
1 1
e = b with b = [2, −2]T has no solution at all.
which is even singular, so that Ax̃
Remark For matrices with condition number κ(A) one can expect to lose log10 κ(A)
digits when solving Ax = b.
56
6.3 Backward Stability
In light of the estimate (26) we say that an algorithm for solving Ax = b is backward
stable if
kx − x̃k
= O(κ(A)εmachine ),
kxk
i.e., if the significance of the error produced by the algorithm is due only to the condi-
tioning of the matrix.
Remark We can view a backward stable algorithm as one which delivers the “right
answer to a perturbed problem”, namely Ax̃
e = b, with perturbation of the order
e
kA−Ak
kAk = O(εmachine ).
Without providing any details (for more information see Chapter 18 in [Trefethen/Bau]),
for least-squares problems the estimate (26) becomes
κ2 (A) tan θ
kx − x̃k kA − Ak
e
≤ κ(A) + , (27)
kxk η kAk
kAxk kAkkxk
where κ(A) = kAkkA−1 k, θ = cos−1 kbk , and η = kAxk .
57
6.5 Stabilization of Modified Gram-Schmidt
In order to be able to obtain Q with better orthogonality properties we apply the QR
factorization directly to the augmented system, i.e., compute
A b = Q2 R2 .
R̂ Q̂∗ b
R2 =
O 0
and R̂x = Q̂∗ b can be solved. However, now Q̂∗ b is more accurate than if obtained via
the QR factorization of A alone.
A lot more details for the stabilization and the previous example are provided in
Chapter 19 of [Trefethen/Bau].
58
7 Gaussian Elimination and LU Factorization
In this final section on matrix factorization methods for solving Ax = b we want to
take a closer look at Gaussian elimination (probably the best known method for solving
systems of linear equations).
The basic idea is to use left-multiplication of A ∈ Cm×m by (elementary) lower
triangular matrices, L1 , L2 , . . . , Lm−1 to convert A to upper triangular form, i.e.,
L L . . . L2 L1 A = U.
| m−1 m−2
{z }
e
=L
Note that the product of lower triangular matrices is a lower triangular matrix, and
the inverse of a lower triangular matrix is also lower triangular. Therefore,
LA
e =U ⇐⇒ A = LU,
e −1 . This approach can be viewed as triangular triangularization.
where L = L
L |{z}
U x = b.
=y
Moreover, consider the problem AX = B (i.e., many different right-hand sides that
are associated with the same system matrix). In this case we need to compute the
factorization A = LU only once, and then
AX = B ⇐⇒ LU X = B,
In order to appreciate the usefulness of this approach note that the operations count
for the matrix factorization is O( 23 m3 ), while that for forward and back substitution is
O(m2 ).
59
and compute its LU factorization by applying elementary lower triangular transforma-
tion matrices.
We choose L1 such that left-multiplication corresponds to subtracting multiples of
row 1 from the rows below such that the entries in the first column of A are zeroed out
(cf. the first homework assignment). Thus
1 0 0 1 1 1 1 1 1
L1 A = −2 1 0 2 3 5 = 0 1 3 .
−4 0 1 4 6 8 0 2 4
Next, we repeat this operation analogously for L2 (in order to zero what is left in
column 2 of the matrix on the right-hand side above):
1 0 0 1 1 1 1 1 1
L2 (L1 A) = 0 1 0 0 1 3 = 0 1 3 = U.
0 −2 1 0 2 4 0 0 −2
Now L = (L2 L1 )−1 = L−1 −1
1 L2 with
1 0 0 1 0 0
−1
L1 = 2 1 0 and L−1
2 = 0 1 0 ,
4 0 1 0 2 1
so that
1 0 0
L = 2 1 0 .
4 2 1
Remark Note that L always is a unit lower triangular matrix, i.e., it has ones on the
diagonal. Moreover, L is always obtained as above, i.e., the multipliers are accumulated
into the lower triangular part with a change of sign.
The claims made above can be verified as follows. First, we note that the multipliers
in Lk are of the form
ajk
`jk = , j = k + 1, . . . , m,
akk
so that
1
..
.
1
Lk = .
−` k+1,k 1
.. ..
. .
−`m,k 1
Now, let
0
..
.
0
`k =
`k+1,k
.
..
.
`m,k
60
Then Lk = I − `k e∗k , and therefore
(I − ` e∗ ) (I + ` e∗ ) = I − `k e∗k `k e∗k = I,
| {zk k} | {zk k}
=Lk L−1
k
since the inner product e∗k `k = 0 because the only nonzero entry in ek (the 1 in the k-th
position) does not “hit” any nonzero entries in `k which start in the k + 1-st position.
So, for any k we have
1
..
.
−1
1
Lk =
`k+1,k 1
.. . .
. .
`m,k 1
as claimed.
In addition,
L−1 −1 ∗ ∗
k Lk+1 = (I + `k ek ) I + `k+1 ek+1
= I + `k e∗k + `k+1 e∗k+1 + `k e∗k `k+1 ek+1
| {z }
=0
1
..
.
1
=
`k+1,k 1 ,
..
. `k+2,k+1
.. ..
. .
`m,k `m,k+1 1
for k = 1 : m − 1
for j = k + 1 : m
61
L(j, k) = U (j, k)/U (k, k)
U (j, k : m) = U (j, k : m) − L(j, k)U (k, k : m)
end
end
Remark 1. In practice one can actually store both L and U in the original matrix
A since it is known that the diagonal of L consists of all ones.
2. The LU factorization is the cheapest factorization algorithm. Its operations count
can be verified to be O( 23 m3 ).
However, LU factorization cannot be guaranteed to be stable. The following exam-
ples illustrate this fact.
Example A fundamental problem is given if we encounter a zero pivot as in
1 1 1 1 1 1
A= 2 2 5 =⇒ L1 A = 0 0 3 .
4 6 8 0 2 4
Now the (2,2) position contains a zero and the algorithm will break down since it will
attempt to divide by zero.
Example A more subtle example is the following backward instability. Take
1 1 1
A= 2 2+ε 5
4 6 8
with small ε. If ε = 1 then we have the initial example in this chapter, and for ε = 0
we get the previous example.
LU factorization will result in
1 1 1
L1 A = 0 ε 3
0 2 4
and
1 1 1
L2 L1 A = 0 ε 3 = U.
6
0 0 4− ε
The multipliers were
1 0 0
L = 2 1 0 .
4 2ε 1
Now we assume that a right-hand side b is given as
1
b= 0
0
and we attempt to solve Ax = b via
62
1. Solve Ly = b.
2. Solve U x = y.
6
If ε is on the order of machine accuracy, then the 4 in the entry 4 − ε in U is
insignificant. Therefore, we have
1 1 1
Ũ = 0 ε 3 and L̃ = L,
0 0 − 6ε
which leads to
1 1 1
L̃Ũ = 2 2 + ε 5 6= A.
4 6 4
In fact, the product is significantly different from A. Thus, using L̃ and Ũ we are
not able to solve a “nearby problem”, and thus the LU factorization method is not
backward stable.
If we use the factorization based on L̃ and Ũ with the above right-hand side b, then
we obtain 11 2 11
2 − 3ε 2
x̃ = −2 ≈ −2 .
2 2
3ε − 3 − 32
Whereas if we were to use the exact factorization A = LU , then we get the exact
answer 4ε−7 7
2ε−3 3
2 ≈ −2 .
x= 2ε−3 3
ε−1
−2 2ε−3 − 23
Remark Even though L̃ and Ũ are close to L and U , the product L̃Ũ is not close to
LU = A and the computed solution x̃ is worthless.
7.2 Pivoting
Example The breakdown of the algorithm in our earlier example with
1 1 1
L1 A = 0 0 3
0 2 3
63
More generally, stability problems can be avoided by swapping rows before applying
Lk , i.e., we perform
Lm−1 Pm−1 . . . L2 P2 L1 P1 A = U.
The strategy we use for swapping rows in step k is to find the largest element in column
k below (and including) the diagonal — the so-called pivot element — and swap its row
with row k. This process is referred to as partial (row) pivoting. Partial column pivoting
and complete (row and column) pivoting are also possible, but not very popular.
64
If we now check the computed factorization L̃Ũ , then we see
4 6 8
L̃Ũ = 2 2 5 = P Ã,
1 1 1
which is just a permuted version of the original matrix A with permutation matrix
0 0 1
P = P2 P1 = 0 1 0 .
1 0 0
If we use the rounded factors L̃ and Ũ instead, then the computed solution is
7
3
x̃ = − 23 ,
− 23
which is the exact answer to the problem (see also the Maple worksheet 473 LU.mws).
P A = LU,
where P = Pm−1 Pm−2 . . . P2 P1 , and L = (L0m−1 L0m−2 . . . L02 L01 )−1 with
−1 −1
L0k = Pm−1 . . . Pk+1 Lk Pk+1 . . . Pm−1 ,
i.e., L0k is the same as Lk except that the entries below the diagonal are appropriately
permuted. In particular, L0k is still lower triangular.
Remark Since the permutation matrices used here involve only a single row swap each
we have Pk−1 = Pk (while in general, of course, P −1 = P T ).
65
Remark Due to the pivoting strategy the multipliers will always satisfy |`ij | ≤ 1.
for k = 1 : m − 1
end
The operations count for this algorithm is also O( 23 m2 ). However, while the swaps
for partial pivoting require O(m2 ) operations, they would require O(m3 ) operations in
the case of complete pivoting.
Remark The algorithm above is not really practical since one would usually not phys-
ically swap rows. Instead one would use pointers to the swapped rows and store the
permutation operations instead.
7.3 Stability
We saw earlier that Gaussian elimination without pivoting is can be unstable. Accord-
ing to our previous example the algorithm with pivoting seems to be stable. What can
be proven theoretically?
Since the entries of L are at most 1 in absolute value, the LU factorization becomes
unstable if the entries of U are unbounded relative to those of A (we need kLkkU k =
O(kP Ak). Therefore we define a growth factor
Theorem 7.1 Let A ∈ Cm×m . Then LU factorization with partial pivoting guarantees
that ρ ≤ 2m−1 .
66
This bound is unacceptably high, and indicates that the algorithm (with pivoting)
can be unstable. The following (contrived) example illustrates this.
Example Take
1 0 0 0 1
−1 1 0 0 1
A=
−1 −1 1 0 1 .
−1 −1 −1 1 1
−1 −1 −1 −1 1
Then LU factorization produces the following sequence of matrices:
1 0 0 0 1 1 0 0 0 1
0 1 0 0 2 0 1 0 0 2
0 −1 1 0 2 −→ 0 0 1 0 4
0 −1 −1 1 2 0 0 −1 1 4
0 −1 −1 −1 2 0 0 −1 −1 4
1 0 0 0 1 1 0 0 0 1
0 1 0 0 2 0 1 0 0 2
−→ 0 0 1 0 4 −→
0 0 1 0 4 = U,
0 0 0 1 8 0 0 0 1 8
0 0 0 −1 8 0 0 0 0 16
Remark BUT in practice it turns out that matrices which such large growth factors
√
almost never arise. For most practical cases a realistic bound on ρ seems to be m.
This is still an active area of research and some more details can be found in the
[Trefethen/Bau] book.
In summary we can say that for most practical purposes LU factorization with
partial pivoting is considered to be a stable algorithm.
Remark Symmetric positive definite matrices are defined analogously with ∗ replaced
by T .
67
2. Any principal submatrix (i.e., the intersection of a set of rows and the corre-
sponding columns) of A is positive definite.
1 w∗
A=
w K
1 0T w∗
1
A= .
w I 0 K − ww∗
The main idea of the Cholesky factorization algorithms is to take advantage of the fact
that A is Hermitian and perform operations symmetrically, i.e., also zero the first row.
Thus
1 0T 0T 1 w∗
1
A= .
w I 0 K − ww∗ 0 I
Note that the three matrices are lower triangular, block-diagonal, and upper triangular
(in fact the adjoint of the lower triangular factor).
Now we continue iteratively. However, in general A will not have a 1 in the upper
left-hand corner. We will have to be able to deal with an arbitrary (albeit positive)
entry in the (1,1) position. Therefore we reconsider the first step with slightly more
general notation, i.e.,
a11 w∗
A=
w K
so that
" √ # " √ #
a11 0T 1 0T a11 √ 1 w∗
A= 1
a11
= R1∗ A1 R1 .
√
a11 w I 0 K − a111 ww∗ 0 I
Now we can iterate on the inner matrix A1 . Note that the (1,1) entry of K − ww∗ /a11
has to be positive since A was positive definite and R1 is nonsingular. This guarantees
that A1 = R1−∗ AR1−1 is positive definite and therefore has positive diagonal entries.
The iteration yields
A1 = R2∗ A2 R2
so that
A = R1∗ R2∗ A2 R2 R1 ,
and eventually
A = R1∗ R2∗ . . . Rm
∗
R ...R R ,
| {z } | m {z 2 }1
=R∗ =R
68
the Cholesky factorization of A. Note that R is upper triangular and its diagonal is
positive (because of the square roots). Thus we have proved
Theorem 7.2 Every Hermitian positive definite matrix A has a unique Cholesky fac-
torization A = R∗ R with R an upper triangular matrix with positive diagonal entries.
Example Consider
4 2 4
A = 2 5 6 .
4 6 9
Cholesky factorization as explained above yields the sequence
2 0 0 1 0 0 2 1 2
A = 1 1 0 0 4 4 0 1 0
2 0 1 0 4 5 0 0 1
2 0 0 1 0 0 1 0 0 1 0 0 2 1 2
= 1 1 0 0 2 0 0 1 0 0 2 2 0 1 0 .
2 0 1 0 2 1 0 0 1 0 0 1 0 0 1
for k = 1 : m
for j = k + 1 : m
R(k,j)
R(j, j : m) = R(j, j : m) − R(k, j : m) R(k,k)
end
p
R(k, k : m) = R(k, k : m)/ R(k, k)
end
Thus the count is O( 13 m3 ), which is half the amount needed for the LU factorization.
Another nice feature of Cholesky factorization is that it is always stable — even
without pivoting.
69
Remark The simplest (and cheapest) way to test whether a matrix is positive definite
is to run Cholesky factorization it. If the algorithm works it is, if not, it isn’t.
70
8 Eigenvalue Problems
8.1 Motivation and Definition
Matrices can be used to represent linear transformations. Their effects can be: rotation,
reflection, translation, scaling, permutation, etc., and combinations thereof. These
transformations can be rather complicated, and therefore we often want to decompose
a transformation into a few simple actions that we can better understand. Finding
singular values and associated singular vectors is one such approach. In engineering,
one often speaks of principal component analysis.
A more basic approach is to consider eigenvalues and eigenvectors.
Definition 8.1 Let A ∈ Cm×m . If for some pair (λ, x), λ ∈ C, x(6= 0) ∈ Cm we have
Ax = λx,
Remark Eigenvectors specify the directions in which the matrix action is simple: any
vector parallel to an eigenvector is changed only in length and/or orientation by the
matrix A.
Note that this vector space includes the zero vector — even though 0 is not an eigen-
vector.
The set of all eigenvalues of A is known as the spectrum of A, denoted by Λ(A).
The spectral radius of A is defined as
(A − λI) x = 0.
71
This, in turn, is equivalent to det(A−λI) = 0. Therefore we define the characteristic
polynomial of A as
pA (z) = det(zI − A).
Then we get
Example It is well known that even real matrices can have complex eigenvalues. For
instance,
0 1
A=
−1 0
has a characteristic polynomial
z −1
pA (z) =
= z 2 + 1,
1 z
However, if A is symmetric (or Hermitian), then all its eigenvalues are real. More-
over, the eigenvectors to distinct eigenvalues are linearly independent, and eigenvectors
to distinct eigenvalues of a symmetric/Hermitian matrix are orthogonal.
Remark Since the eigenvalues of an m×m matrix are given by the roots of a degree-m
polynomial, it is clear that for problems with m > 4 we will have to use iterative (i.e.,
numerical) methods to find the eigenvalues.
Theorem 8.3 Any A ∈ Cm×m has m eigenvalues provided we count the algebraic mul-
tiplicities. In particular, if the roots of pA are simple, the A has m distinct eigenvalues.
Example Take
1 0 0
A = 1 1 1 ,
0 0 1
72
so that pA (z) = (z − 1)3 . Thus, λ = 1 is an eigenvalue (in fact, the only one) of A with
algebraic multiplicity 3. To determine its geometric multiplicity we need to find the
associated eigenvectors.
To this end we solve (A − λI)x = 0 for the special case λ = 1. This yields the
augmented matrix
0 0 0 | 0
1 0 1 | 0
0 0 0 | 0
so that x1 = −x3 or
α 1 0
x= β =α 0 + β 1 ,
−α −1 0
and therefore the geometric multiplicity of λ = 1 is only 2.
has the same characteristic polynomial as before, i.e., pB (z) = (z − 1)3 , and λ =
1 is again an eigenvalue with algebraic multiplicity 3. To determine its geometric
multiplicity we solve (B − I)x = 0, i.e., look at
0 0 0 | 0
0 0 0 | 0 .
0 0 0 | 0
Now there is no restriction on the components of x and we have
α 1 0 0
x = β = α 0 + β 1 + γ 0 ,
γ 0 0 1
so that the geometric multiplicity of λ = 1 is 3 in this case.
At the other extreme, the matrix
1 1 0
C= 0 1 1 ,
0 0 1
73
also has the characteristic polynomial pC (z) = (z − 1)3 , so that λ = 1 has algebraic
multiplicity 3. However, now the solution of (C − I)x = 0, leads to
0 1 0 | 0
0 0 1 | 0 ,
0 0 0 | 0
m
X
2. tr(A) = λj .
j=1
Proof Recall that the definition of the characteristic polynomial, pA (z) = det(zI − A),
so that
pA (0) = det(−A) = (−1)m det(A).
m
Y
On the other hand, we also know that we also have pA (z) = (z − λj ) which implies
j=1
m
Y m
Y
m
pA (0) = (−λj ) = (−1) λj .
j=1 j=1
74
8.4 Similarity and Diagonalization
Consider two matrices A, B ∈ Cm×m . A and B are called similar if
B = X −1 AX
Theorem 8.7 Similar matrices have the same characteristic polynomial, eigenvalues,
algebraic and geometric multiplicities. The eigenvectors, however, are in general dif-
ferent.
A = XΛX −1 ,
Remark Due to this theorem, nondefective matrices are diagonalizable. Also note
that nondefective matrices have linearly independent eigenvectors.
Ax = b ⇐⇒ XΛX −1 x = b
−1 −1
⇐⇒ | {z x} = X
ΛX | {z b} .
=x̂ =b̂
This shows that x̂ and b̂ correspond to x and b as viewed in the basis of eigenvectors
(i.e., columns of X).
If the eigenvectors of A are not only linearly independent, but also orthogonal, then
we can factor A as
A = QΛQ∗
with a unitary matrix Q of eigenvectors. Thus, A is called unitarily diagonalizable.
AA∗ = A∗ A,
and we have
75
8.5 Schur Factorization
The most useful linear algebra fact summarized here for numerical analysis purposes is
A = QT Q∗ ,
with unitary matrix Q and upper triangular matrix T such that diag(T ) contains the
eigenvalues of A.
Remark Note the similarity of this result to the singular value decomposition. The
Schur factorization is quite general in that it exists for every, albeit only square, matrix.
Also, the matrix T contains the eigenvalues (instead of the singular values), and it is
upper triangular (i.e., not quite as nice as diagonal). On the other hand, only one
unitary matrix is used.
Remark By using nonunitary matrices for the similarity transform one can obtain the
Jordan normal form of a matrix in which T is bidiagonal.
Remark Both the Schur factorization and the Jordan form are considered not appro-
priate for numerical/practical computations because of the possibility of complex terms
occurring in the matrix factors (even for real A). The SVD is preferred since all of its
factors are real.
x∗ Ax = λ |{z}
x∗ x = λ.
=kxk22 =1
Similarly,
Û ∗ Ax = Û ∗ (λx) = λÛ ∗ x = 0
since Û ∗ x = 0 because U is unitary.
Therefore, U ∗ AU simplifies to
λ b∗
U ∗ AU = ,
0 C
76
where we have used the abbreviations b∗ = x∗ AÛ and C = Û ∗ AÛ ∈ C(m−1)×(m−1) .
Now, by the induction hypothesis, C has a Schur factorization
C = V T̂ V ∗
1 0T
Q=U
0 V
This last block matrix, however, is the desired upper triangular matrix T with the
eigenvalues of A on its diagonal since V ∗ CV = T̂ and the induction hypothesis ensures
T̂ already has the desired properties.
77
Chapter 9
Applications
such that
Pf (xi ) = fi , i = 1, . . . , N.
The solution of this problem leads to a linear system Ac = f with the entries of A
given by
Aij = ϕ(kxi − xj k), i, j = 1, . . . , N. (9.2)
As discussed earlier, the matrix A is non-singular for a large class of radial functions
including (inverse) multiquadrics, Gaussians, and the strictly positive definite com-
pactly supported functions of Wendland, Wu, or Buhmann. In the case of strictly
conditionally positive definite functions such as thin plate splines the problem needs to
be augmented by polynomials.
129
We now switch to the collocation solution of partial differential equations. Assume
we are given a domain Ω ⊂ IRs , and a linear elliptic partial differential equation of the
form
L[u](x) = f (x), x in Ω, (9.3)
with (for simplicity of description) Dirichlet boundary conditions
For Kansa’s collocation method we then choose to represent u by a radial basis function
expansion analogous to that used for scattered data interpolation, i.e.,
N
X
u(x) = cj ϕ(kx − ξ j k), (9.5)
j=1
where we now introduce the points ξ 1 , . . . , ξ N as centers for the radial basis func-
tions. They will usually be selected to coincide with the collocation points X =
{x1 , . . . , xN } ⊂ Ω. However, the discussion below is clearer if we formally distin-
guish between centers ξ j and collocation points xi . We assume the simplest possible
setting here, i.e., no polynomial terms are added to the expansion (9.5). The collocation
matrix which arises when matching the differential equation (9.3) and the boundary
conditions (9.4) at the collocation points X will be of the form
Φ
A= , (9.6)
L[Φ]
Here we have identified (as we will do throughout this section) the set of centers with
the set of collocation points. The set X is split into a set I of interior points, and B
of boundary points. The problem is well-posed if the linear system Ac = y, with y
a vector consisting of entries g(xi ), xi ∈ B, followed by f (xi ), xi ∈ I, has a unique
solution.
We note that a change in the boundary conditions (9.4) is as simple as changing a
few rows in the matrix A in (9.6) as well as on the right-hand side y. We also point out
that Kansa only proposed to use multiquadrics in (9.5), and for that method suggested
the use of varying parameters αj , j = 1, . . . , N , which improves the accuracy of the
method when compared to using only one constant value of α (see [340]).
A problem with Kansa’s method is that – for a constant multiquadric shape pa-
rameter α – the matrix A may for certain configurations of the centers ξ j be singular.
Originally, Kansa assumed that the non-singularity results for interpolation matrices
would carry over to the PDE case. However, as the numerical experiments of Hon and
Schaback [304] show, this is not so. This is to be expected since the matrix for the
collocation problem is composed of rows which are built from different functions (which
– depending on the differential operator L – might not even be radial). The results for
130
the non-singularity of interpolation matrices, however, are based on the fact that A is
generated by a single function ϕ.
An indication of the success of Kansa’s method (which has not yet been shown to be
well-posed) are the early papers [165, 166, 262, 341, 467] and many more since. In his
paper [340] Kansa describes three sets of experiments using his method and comments
on the superior performance of multiquadrics in terms of computational complexity
and accuracy when compared to finite difference methods. Therefore, it remains an
interesting open question whether the well-posedness of Kansa’s method can be estab-
lished at least for certain configurations of centers. Moreover, Kansa’s suggestion to use
variable shape parameters αj in order to improve accuracy and stability of the problem
has very little theoretical support. Except for one paper by Bozzini, Lenarduzzi and
Schaback [68] (which addresses only the interpolation setting) this problem has not
been addressed in the literature.
Before we describe an alternate approach which does ensure well-posedness of the
resulting collocation matrix and which is based on basis functions suitable for scattered
Hermite interpolation we would like to point out that in [467] the authors suggest how
Kansa’s method can be applied to other types of partial differential equation prob-
lems such as non-linear elliptic PDEs, systems of elliptic PDEs, and time-dependent
parabolic or hyperbolic PDEs.
satisfying
Li Pf = Li f, i = 1, . . . , N.
We have used Lξ to indicate that the functional L acts on ϕ viewed as a function of the
second argument ξ. The linear system Ac = Lf which arises in this case has matrix
entries
Aij = Li Lξj ϕ, i, j = 1, . . . , N. (9.8)
In the references mentioned at the beginning of this subsection it is shown that A is
non-singular for the same classes of ϕ as given for scattered data interpolation in our
earlier chapters.
Remark: It should be pointed out that this formulation of Hermite interpolation is
very general and goes considerably beyond the standard notion of Hermite interpolation
(which refers to interpolation of successive derivative values). Here any kind of linear
functional are allowed as long as the set L is linearly independent.
We illustrate this approach with a simple example using derivative functionals.
131
Example: Let data {xi , f (xi )}ni=1 and {xi , ∂f N 2
∂x (xi )}i=n+1 with x = (x, y) ∈ IR be
given. Then
n N
X X ∂ϕ
Pf (x) = cj ϕ(kx − xj k) − cj (kx − xj k),
∂x
j=1 j=n+1
and
Φ −Φx
A= ,
Φx −Φxx
with
where #B denotes the number of nodes on the boundary of Ω, and Lξ is the differential
operator used in (9.3), but acting on ϕ viewed as a function of the second argument,
i.e., L[ϕ] is equal to Lξ [ϕ] up to a possible difference in sign. Note the difference in
notation. In (9.7) L is a linear functional, and in (9.9) a differential operator.
This expansion for u leads to a collocation matrix A which is of the form
Lξ [Φ]
Φ
A= , (9.10)
L[Φ] L[Lξ [Φ]]
The matrix (9.10) is of the same type as the scattered Hermite interpolation matri-
ces (9.8), and therefore non-singular as long as ϕ is chosen appropriately. Thus, viewed
using the new expansion (9.9) for u, the collocation approach is certainly well-posed.
132
α ρK ρH condK (A) condH (A)
5×3 1.0 5.248447e-02 2.004420e-01 2.599606e+03 1.627432e+03
8×4 1.0 1.126843e-02 1.124710e-02 2.325758e+05 8.167527e+04
10 × 6 1.0 5.809472e-03 6.481697e-03 4.321740e+07 1.808001e+07
16 × 8 1.0 1.347863e-03 1.720007e-03 8.685785e+10 1.496772e+10
20 × 12 1.0 5.053090e-04 5.973294e-04 5.161540e+15 1.234633e+15
Table 9.1: Error progression for increasingly denser data sets (Ex.1, fixed α).
Another point in favor of the Hermite based approach is that the matrix (9.10) is (anti)-
symmetric as opposed to the completely unstructured matrix (9.6) of the same size.
This property should be of value when trying to devise an efficient implementation of
the collocation method. Also note that although A consists of four blocks now, it still
is of the same size, namely N × N , as the collocation matrix (9.6) obtained for Kansa’s
approach.
Remark: One attempt to obtain an efficient implementation of the Hermite based
collocation method is a version of the greedy algorithm described in Section 8.5.1 by
Hon, Schaback and Zhou [305].
For this test problem we selected various uniform grids as listed in Tables 9.1 and
9.2 on [0, π] × [0, 1]. Tables 9.1 and 9.2 show the values of the multiquadric parameter
α, the relative maximum errors ρ computed on a fine grid of 60 × 60 points, and
the approximate condition numbers of A. The range of u on the evaluation grid is
approximately [−0.021023, 0.0]. The “optimal” value for α was determined by trial
and error. The subscripts K and H refer to Kansa’s and the Hermite based method,
respectively.
Figure 9.1 shows the distribution of the errors |u(x) − s(x)| on the evaluation grid
for the two methods on the 8 × 4 grid used in Table 9.2. The scale used for the shading
is displayed on the right.
Example 2: Consider the Poisson equation
π
∆u(x, y) = sin x − sin3 x, x ∈ (0, ), y ∈ (0, 2),
2
133
αK αH ρK ρH condK (A) condH (A)
5×3 1.18 1.39 1.627193e-02 4.180428e-02 5.592238e+03 5.231279e+03
8×4 1.04 1.11 1.103747e-02 1.062891e-02 3.175078e+05 1.735482e+05
10 × 6 4.80 3.84 2.739293e-03 3.451799e-03 1.193586e+18 1.414927e+15
16 × 8 3.12 3.12 2.707006e-04 2.082886e-04 1.209487e+19 6.609375e+18
20 × 12 2.00 2.30 3.894511e-05 1.273363e-05 3.739554e+19 6.750955e+18
Table 9.2: Error progression for increasingly denser data sets (Ex.1, “optimal” α).
2.328866e-04
0.0
Figure 9.1: Error for Kansa’s (top), Hermite (bottom) solution for Ex. 1 on 8 × 4 grid.
Table 9.3: Error progression for increasingly denser data sets (Ex.2, “optimal” α).
134
that for the differential operator L used here, the collocation matrices resulting from
the Hermite approach are symmetric. Therefore the amount of computation can be
reduced considerably, which is important for larger problems. Kansa’s method has the
advantage of being simpler to implement (since less derivatives of the basis functions
are required).
Remarks:
1. Both of the methods described in this section have been implemented for many
different applications. A thorough comparison of the two methods was reported
in [520].
2. Since the methods described above were both originally used with globally sup-
ported basis functions, the same concerns as for interpolation problems about
stability and numerical efficiency apply. Two recent papers by Ling and Kansa
[395, 396] address these issues. In particular, they develop a preconditioner in the
spirit of the one described in Section 8.3.3, and describe their experience with a
domain decomposition algorithm.
3. A convergence analysis for the symmetric method was established by Franke and
Schaback [229, 230]. The error estimates established in [229, 230] require the solu-
tion of the PDE to be very smooth. Therefore, one should be able to use meshfree
radial basis function collocation techniques especially well for (high-dimensional)
PDE problems with smooth solutions on possibly irregular domains. Due to
the known counterexamples [304] for the non-symmetric method, a convergence
analysis is still lacking for that method.
4. Recently, Miranda [462] has shown that Kansa’s method will be well-posed if it
is combined with so-called R-functions. This idea was also used by Höllig and
his co-workers in their development of WEB-splines (see, e.g., [299]).
5. Kansa’s method has the advantage of being easily adapted for nonlinear elliptic
PDEs (see, e.g., [201, 467]).
Some numerical evidence for convergence rates of the symmetric collocation method
is given by the examples above, and in the papers [336, 520]. The example above
shows very high convergence rates (as predicted by the estimate in [230]) when using
multiquadrics on a problem which has a smooth solution. In [336] thin plate splines
as well as Wendland’s C 4 compactly supported RBF ϕ3,2 were tested. The results
for thin plate splines are in good agreement with the theory. However, the numerical
experiments using the Wendland function show O(h3 ) convergence instead of O(h) as
predicted by the lower bounds of [230] combined with the error bound for Wendland
functions. This could suggest that a sharper error estimate may be possible when using
compactly supported RBFs.
Other recent papers investigating various aspects of radial basis function collocation
are, e.g., [135] by Cheng, Golberg, Kansa and Zammito, [215] by Fedoseyev, Friedman
and Kansa, [345] by Kansa and Hon, [360] by Larsson and Fornberg, [365] by Leitão,
and [424] by Mai-Duy and Tran-Cong.
For example, in the paper [215] it is suggested that the collocation points on the
boundary are also used to satisfy the PDE. However, this adds a set of extra equations
135
to the problem, and therefore one should also use some additional basis functions in
the expansion (9.5). It is suggested in [215] that these centers lie outside the domain Ω.
The motivation for this modification is the well-known fact that both for interpolation
and collocation with radial basis functions the error is largest near the boundary. In
various numerical experiments this strategy is shown to improve the accuracy of Kansa’s
basic non-symmetric method. It should be noted that there is once more no theoretical
foundation for this method.
Larsson and Fornberg [360] compare Kansa’s basic collocation method, the modi-
fication just described, and the Hermite-based symmetric approach mentioned earlier.
Using multiquadric basis functions in a standard implementation they conclude that
the symmetric method is the most accurate, followed by the non-symmetric method
with boundary collocation. The reason for this is the better conditioning of the system
for the symmetric method. Larsson and Fornberg also discuss an implementation of
the three methods using the complex Contour-Padé integration method mentioned in
Section 8.1. With this technique stability problems are overcome, and it turns out that
both the symmetric and the non-symmetric method perform with comparable accu-
racy. Boundary collocation of the PDE yields an improvement only if these conditions
are used as additional equations, i.e., by increasing the problem size. It should also
be noted that often the most accurate results were achieved with values of the multi-
quadric shape parameter α which would lead to severe ill-conditioning using a standard
implementation, and therefore these results could be achieved only using the complex
integration method. Moreover, in [360] radial basis function collocation is deemed to
be far superior in accuracy than standard second-order finite differences or a standard
Fourier-Chebyshev pseudospectral method.
Leitão [365] applies the symmetric collocation method to a fourth-order Kirchhoff
plate bending problem, and emphasizes the simplicity of the implementation of the ra-
dial basis function collocation method. And, finally, Mai-Duy and Tran-Cong [424] sug-
gest a collocation method for which the basis functions are taken to be anti-derivatives
of the usual radial basis functions.
All of the experiments just mentioned were conducted without using a multilevel
approach. In particular, in order to achieve convergence with the Wendland functions
the support had to be chosen so large that only problems with a very modest number of
centers could be handled (see [336]). So, as for scattered data interpolation, a multilevel
approach is needed to obtain computational efficiency.
We would like to end the discussion of the collocation approach by looking at a
multilevel implementation with compactly supported functions.
The most significant difference between the use of compactly supported RBFs for
scattered data interpolation and for the numerical solution of PDEs by collocation
appears when we turn to the multilevel approach. Recall that the use of the multilevel
method is motivated by our desire to obtain a convergent scheme while at the same
time keeping the bandwidth fixed, and thus the computational complexity at O(N ).
Here is an adaptation of the basic multilevel algorithm of Section 8.2 to the case of
a collocation solution of the problem Lu = f :
136
mesh `2 -error rate
5 3.637579e-04
9 1.892007e-05 4.26
17 3.055339e-06 2.63
33 2.111403e-06 0.53
65 2.062621e-06 0.03
129 2.066411e-06 0.00
257 2.070168e-06 0.00
513 2.072171e-06 0.00
1025 2.073182e-06 0.00
2049 2.073688e-06 0.00
Table 9.4: Multilevel collocation algorithm for symmetric collocation with constant
bandwidth.
u0 = 0.
For k from 1 to K do
end
Here SXk is the space of functions used for expansion (9.5) or (9.9) on grid Xk .
Whereas we noted above that there is strong numerical (and limited theoretical) ev-
idence that the basic multilevel interpolation algorithm converges (at least linearly),
the following example shows that we cannot in general expect the multilevel collocation
algorithm to converge at all.
Example: Consider the boundary-value problem
with solution u(x) = sin πx. As computational grids Xk we take 2k+1 + 1 uniformly
spaced points on [0, 1] as indicated in Table 9.4. We use the C 6 compactly supported
Wendland function ϕ3,3 and the conjugate gradient method with Jacobi preconditioning
is used to solve the resulting linear systems. We take the support size on the first grid
to be so large that the resulting matrix is a dense matrix. During subsequent iterations
the support size is halved (as is the meshsize) in order to maintain a constant bandwidth
of 17 (i.e., work in the stationary setting). Even though the first three iterations seem
to indicate significant rates of convergence, the convergence behavior quickly changes,
and by the fifth iteration there is virtually no improvement of the error (the fact that
the errors actually increase is due to the fact that they are computed on increasingly
finer grids).
137
We note that the same behavior can be observed if the non-symmetric approach
is used instead. However, then the convergence ceases at a slightly later stage. We
also note that the same phenomenon was observed by Wendland in the context of a
multilevel Galerkin algorithm for compactly supported RBFs (see [631] as well as our
discussion in the next section).
Remarks:
1. It has been suggested that the convergence behavior of the multilevel colloca-
tion algorithm may be linked to the phenomenon of approximate approximation.
However, so far no connection has been established.
−∆u + u = f in Ω,
∂
u = 0 on ∂Ω,
∂ν
where ν denotes the outer unit normal vector. The classical Galerkin formulation then
leads to the problem of finding a function u ∈ H 1 (Ω) such that
where (f, v)L2 (Ω) is the usual L2 inner product, and for the Helmholtz equation the
bilinear form a is given by
Z
a(u, v) = (∇u · ∇v + uv)dx.
Ω
SX = span{φ(k · −xj k2 ), xj ∈ X }.
138
This results in a square system of linear equations for the coefficients of uX ∈ SX
determined by
a(uX , v) = (f, v)L2 (Ω) for all v ∈ SX .
For more on the Galerkin method (in the context of finite elements) see, e.g., [69, 70].
It was shown in [630] that for those RBFs (globally as well as locally supported) whose
Fourier transform decays like (1 + k · k2 )−2β the following convergence estimate holds:
u0 = 0.
For k from 1 to K do
Find uk ∈ SXk such that a(uk , v) = (f, v) − a(uk−1 , v) for all v ∈ SXk .
Update uk ← uk−1 + uk .
end
Set u0 = vj .
139
Apply the k-loop of the previous algorithm and denote the result with û(vj ).
Set vj+1 = û(vj ).
end
does converge. In fact, using this algorithm Wendland proves, and also observes
numerically, convergence which is at least linear (see Theorem 3 and Tab. 2 in [631]).
The important difference between the two multilevel Galerkin algorithms is the added
outer iteration in the nested version which is a well-known idea from linear algebra
introduced in 1937 by Kaczmarz [337]. A proof of the linear convergence for general
Hilbert space projection methods coupled with Kaczmarz iteration can be found in
[585]. This alternate projection idea is also the fundamental ingredient in the conver-
gence proof of the domain decomposition method of Beatson, Light and Billings [42]
described in the previous chapter. We mention here that in the multigrid literature
Kaczmarz’ method is frequently used as a smoother (see e.g. [435]).
Remarks:
1. Aside from difficulties with Dirichlet (or sometimes called essential) boundary
conditions, Wendland reports that the numerical evaluation of the weak-form in-
tegrals presents a major problem for the radial basis function Galerkin approach.
Both of these difficulties are also well-known in many other flavors of meshfree
weak-form methods. An especially promising solution to the issue of Dirichlet
boundary conditions seems to be the use of R-functions as proposed by Höllig
and Reif in the context of WEB-splines (see, e.g., [299] or our earlier discussion
in the context of collocation methods).
2. In a recent paper by Schaback [559] the author presents a framework for the
radial basis function solution of problems both in the strong (collocation) and
weak (Galerkin) form.
Many other meshfree methods for the solution of partial differential equations in
the weak form appear in the (mostly engineering) literature. These methods come
under such names as smoothed particle hydrodynamics (SPH) (e.g., [463]), reproducing
kernel particle method (RKPM) (see, e.g., [380, 399]), point interpolation method
(PIM) (see, [397]), element free Galerkin method (EFG) (see, e.g., [49]), meshless local
Petrov-Galerkin method (MLPG) [14], h-p-cloud method [164], partition of unity finite
element method (PUFEM) [16, 443], or generalized finite element method (GFEM)
[15]. Most of these methods are based on the moving least squares approximation
method discussed in Chapter 7.
There are two recent books by Atluri [12] and Liu [397] summarizing many of
these methods. However, these books focus mostly on a survey of the various meth-
ods and related computational and implementation issues with little emphasis on the
mathematical foundation of these methods. The recent survey paper [15] by Babuška,
Banerjee and Osborn, fills a large part of this void.
140
10 Gaussian Quadrature
So far we have encountered the Newton-Cotes formulas
Z b n
X Z b
f (x)dx ≈ Ai f (xi ), Ai = `i (x)dx,
a i=0 a
with Z b
Ai = `i (x)w(x)dx, i = 0, . . . , n, (97)
a
is exact for all polynomials of degree at most 2n + 1. Here `i , i = 0, . . . , n, are the usual
Lagrange interpolating polynomials of Chapter 1.
f (xi ) = r(xi ), i = 0, . . . , n.
Now
Z b Z b
f (x)w(x)dx = [q(x)p(x) + r(x)] w(x)dx
a a
108
Z b Z b
= q(x)p(x)w(x)dx + r(x)w(x)dx,
|a {z } a
=0
where the first integral on the right-hand side is zero by the orthogonality assumption
(95).
We know that (for any set of nodes xi ) (96) is exact for polynomials of degree at
most n. Therefore,
Z b Z b
f (x)w(x)dx = r(x)w(x)dx
a a
n
(96) X
= Ai r(xi ).
i=0
However, since our special choice of nodes implies f (xi ) = r(xi ) we have
Z b n
X
f (x)w(x)dx = Ai f (xi )
a i=0
Remark Usually, the classical orthogonal polynomials as discussed in the Maple work-
sheet 478578 GaussQuadrature.mws are used to construct Gaussian quadrature rules
with the appropriate weight function suggested by the integrand at hand.
Example If [a, b] = [−1, 1] and w(x) = 1 we use Legendre polynomials (since they
are orthogonal with respect to this interval and weight function). The corresponding
two-point formula (n = 1 — which is exact for cubic polynomials) is
Z 1
f (x)dx ≈ A0 f (x0 ) + A1 f (x1 )
−1
These formulas ensure (for arbitrary nodes) exactness for constants, and linear polyno-
mials, respectively. The preceding equations are equivalent to the 2 × 2 linear system
1 1 A0 2
= ,
x0 x1 A1 0
109
which implies A0 = A1 = 1. Alternatively, we could have applied (97) directly to
compute the coefficients A0 and A1 . Therefore,
Z 1 √ ! √ !
3 3
f (x)dx ≈ f − +f .
−1 3 3
Remark 1. There are tables for the values of xi and Ai for various choices of
classical orthogonal polynomials q of modest degree. Many software packages
also have functions implementing this.
2. If the integral is defined over the interval [a, b] instead of [−1, 1], then a simple
transformation
b + a + t(b − a)
x= , −1 ≤ t ≤ 1
2
can be used.
3. Note that without the theorem on Gaussian quadrature we would have to solve a
4 × 4 system of nonlinear equations with unknowns x0 , x1 , A0 and A1 (enforcing
exactness for cubic polynomials) to obtain the two-point formula of the example
above (see the Maple worksheet 478578 GaussQuadrature.mws).
110
11 Pseudospectral Methods for Two-Point BVPs
Another class of very accurate numerical methods for BVPs (as well as many time-
dependent PDEs) are the so-called spectral or pseudospectral methods. The basic idea
is similar to the collocation method described above. However, now we use other
basis functions. The following discussion closely follows the first few chapters of Nick
Trefethen’s book “Numerical Methods in Matlab”.
Before we go into any details we present an example.
with boundary conditions y(−1) = y(1) = 0. The analytic solution of this problem is
given by
y(t) = e4t − t sinh(4) − cosh(4) /16.
As with all the other numerical methods, we require some sort of discretization. For
pseudospectral methods we do the same as for finite difference methods and the RBF
collocation methods, i.e., we introduce a set of grid points t1 , t2 , . . . , tN in the interval
of interest.
y 0 = Dy.
What does such a differentiation matrix look like? Let’s assume that the grid points
are uniformly spaced with spacing tj+1 −tj = h for all j, and that the vector of function
values y comes from a periodic function so that we can add the two auxiliary values
y 0 = y N and y N +1 = y 1 .
In order to approximate the derivative y 0 (tj ) we start with another look at the finite
difference approach. We use the symmetric (second-order) finite difference approxima-
tion
y j+1 − y j−1
y 0 (tj ) ≈ y 0j = , j = 1, . . . , N.
2h
Note that this formula also holds at both ends (j = 1 and j = N ) since we are assuming
periodicity of the data.
These equations can be collected in matrix-vector form:
y 0 = Dy
111
with y and y 0 as above and
1
− 12
0 2
1 ..
−
2 0 .
1 ..
D= .
.
h
.. 1
. 0 2
1
2 − 12 0
Remark This matrix has a very special structure. It is both Toeplitz and circulant.
In a Toeplitz matrix the entries in each diagonal are constant, while a circulant matrix
is generated by a single row vector whose entries are shifted by one (in a circulant
manner) each time a new row is generated. As we will see later, the fast Fourier
transform (FFT) can deal with such matrices in a particularly efficient manner.
Note that this matrix is again a circulant Toeplitz matrix (since the data is assumed to
be periodic). However, now there are 5 diagonals, instead of the 3 for the second-order
example above.
It should now be clear that — in order to increase the accuracy of the finite-
difference derivative approximation to spectral order — we want to keep on increasing
the polynomial degree so that more and more grid points are being used, and the
differentiation matrix becomes a dense matrix. Thus, we can think of pseudospectral
112
methods as finite difference methods based on global polynomial interpolants instead
of local ones.
For an infinite interval with infinitely many grid points spaced a distance h apart
one can show that the resulting differentiation matrix is given by the circulant Toeplitz
matrix ..
.
..
. 1
3
.
.. − 12
. ..
1
1
D= 0 . (98)
h .
−1 ..
1 . ..
2
. ..
− 13
..
.
For a finite (even) N and periodic data we will show later that the differentiation
matrix is given by
..
.
..
. 1 3h
2 cot 2
..
. 1 2h
− cot
2 2
.. 1 1h
. 2 cot 2
DN = 0 . (99)
1 1h ..
− 2 cot 2 .
1 2h . .
2 cot 2
.
1 3h ..
− 2 cot 2
.
..
.
Example If N = 4, then we have
1 1h 1
cot 2h − 12 cot 1h
0 2 cot 2 2 2 2
1 1h 1
− cot
2 2 0 2 cot 1h
2
1
2 cot 2
2h
D4 = 1 cot 2h − 1 cot 1h 1 1h .
2 2 2 2 0 2 cot 2
1 1h 1 2h
2 cot 2 2 cot 2 − 12 cot 1h2 0
The Matlab script PSDemo.m illustrates the spectral convergence obtained with the
matrix DN for various values of N . The output should be compared with that of the
previous example FD4Demo.m.
113
First we recall the definition of the Fourier transform ŷ of a function y that is
square-integrable on R:
Z ∞
ŷ(ω) = e−iωt y(t)dt, ω ∈ R. (100)
−∞
Conversely, the inverse Fourier transform lets us reconstruct y from its Fourier trans-
form ŷ: Z ∞
1
y(t) = eiωt ŷ(ω)dω, t ∈ R. (101)
2π −∞
If we restrict our attention to a discrete (unbounded) physical space, i.e., the func-
tion y is now given by the (infinite) vector y = [. . . , y −1 , y 0 , y 1 , . . .]T of discrete values,
then the formulas change. In fact, the semidiscrete Fourier transform of y is given by
the (continuous) function
∞
X
ŷ(ω) = h e−iωtj y j , ω ∈ [−π/h, π/h], (102)
j=−∞
and the inverse semidiscrete Fourier transform is given by the (discrete infinite) vector
y whose components are of the form
Z π/h
1
yj = eiωtj ŷ(ω)dω, j ∈ Z. (103)
2π −π/h
114
Remark Note that the notion of a semidiscrete Fourier transform is just a differ-
ent name for a Fourier series based on the complex exponentials e−iωtj with Fourier
coefficients y j .
The interesting difference between the continuous and semidiscrete setting is marked
by the bounded Fourier space in the semidiscrete setting. This can be explained by the
phenomenon of aliasing. Aliasing arises when a continuous function is sampled on a
discrete set. In particular, the two complex exponential functions f (t) = eiω1 t and
g(t) = eiω2 t differ from each other on the real line as long as ω1 6= ω2 . However, if we
sample the two functions on the grid hZ, then we get the vectors f and g with values
f j = eiω1 tj and g j = eiω2 tj . Now, if ω2 = ω1 + 2kπ/h for some integer k, then f j = g j
for all j, and the two (different) continuous functions f and g appear identical in their
discrete representations f and g. Thus, any complex exponential eiωt is matched on
the grid hZ by infinitely many other complex exponentials (its aliases). Therefore we
can limit the representation of the Fourier variable ω to an interval of length 2π/h. For
reasons of symmetry we use [−π/h, π/h].
It is obvious (cf. (103)) from this definition that p interpolates the data, i.e., p(tj ) = y j ,
for any j ∈ Z.
Moreover, the Fourier transform of the function p turns out to be
(
ŷ(ω), ω ∈ [π/h, π, h]
p̂(ω) =
0, otherwise
This kind of function is known as a band-limited function, and p is called the band-
limited interpolant of y.
The spectral derivative vector y 0 of y can now be obtained by one of the following
two procedures we are about to present. First,
1. Sample the function y at the (infinite set of) discrete points tj ∈ hZ to obtain
the data vector y with components y j .
115
However, from a computational point of view it is better to deal with this problem
in the Fourier domain. We begin by noting that the Fourier transform of the derivative
y 0 is given by Z ∞
yb0 (ω) = e−iωt y 0 (t)dt.
−∞
Applying integration by parts we get
∞
Z ∞
yb0 (ω) = e−iωt y(t)−∞ + iω e−iωt y(t)dt.
−∞
If y(t) tends to zero for t → ±∞ (which it has to for the Fourier transform of y to
exist) then we see that
yb0 (ω) = iω ŷ(ω). (105)
Therefore, we obtain the spectral derivative y 0 by the following alternate procedure:
1. Sample the function y at the (infinite set of) discrete points tj ∈ hZ to obtain
the data vector y with components y j .
4. Find the derivative vector via inverse semidiscrete Fourier transform (see (103)),
i.e.,
Z π/h
1
y 0j = eiωtj yb0 (ω)dω, j ∈ Z.
2π −π/h
Now we need to find out how we can obtain the entries of the differentiation matrix
D from the preceding discussion. We follow the first procedure above.
In order to be able to compute the semidiscrete Fourier transform of an arbitrary
data vector y we represent its components in terms of shifts of (discrete) delta func-
tions, i.e.,
∞
X
yj = y k δj−k , (106)
k=−∞
116
We use this approach since the semidiscrete Fourier transform of the delta function can
be computed easily. In fact, according to (102)
∞
X
δ̂(ω) = h e−iωtj δj
j=−∞
−iωt0
= he =h
for all ω ∈ [−π/h, π/h]. Then the band-limited interpolant of δ is of the form (see
(104))
Z π/h
1
p(t) = eiωt δ̂(ω)dω
2π −π/h
Z π/h
1
= eiωt hdω
2π −π/h
h π/h
Z
= cos(ωt)dω
π 0
h sin(ωt) π/h
=
π t 0
h sin(πt/h) sin(πt/h)
= = = sinc(πt/h).
π t πt/h
Thus far we have used the definition of the band-limited interpolant (104), the defi-
nition of the semidiscrete Fourier transform of y (102), and the representation (106).
Interchanging the summation, and then using the definition of the delta function and
the same calculation as for the band-limited interpolant of the delta function above we
117
obtain the final form of the band-limited interpolant of an arbitrary data vector y as
Z π/h ∞ ∞
1 X X
p(t) = eiωt h yk e−iωtj δj−k dω
2π −π/h
k=−∞ j=−∞
Z π/h ∞
1 X
= eiωt h y k e−iωtk dω
2π −π/h k=−∞
∞ Z π/h
X 1
= yk eiω(t−tk ) hdω
2π −π/h
k=−∞
∞
X (t − tk )π
= y k sinc .
h
k=−∞
118
The remaining columns are shifts of this column since the matrix is a Toeplitz matrix.
This is exactly of the form (98). The explicit formula for the derivative of the sinc
function above is obtained using elementary calculations:
d tπ 1 tπ h tπ
sinc = cos − 2 sin ,
dt h t h t π h
so that
d tπ 1 1
sinc = cos(jπ) − 2 sin(jπ).
dt h t=tj =jh jh j hπ
Remark Formulas for odd N also exist, but are slightly different. For the sake of
clarity, we focus only on the even case here.
As in the previous subsection we now look at the Fourier transform of the discrete
and periodic data y = [y 1 , . . . , y N ]T with y j = y(jh) = y(2jπ/N ), j = 1, . . . , N . For
the same reason of aliasing the Fourier domain will again be bounded. Moreover, the
periodicity of the data implies that the Fourier domain is also discrete (since only waves
eikt with integer wavenumber k have period 2π).
Thus, the discrete Fourier transform (DFT) is given by
N
X N N
ŷ k = h e−iktj y j , k=− + 1, . . . , . (107)
2 2
j=1
Note that the (continuous) Fourier domain [π/h, π/h] used earlier now translates to
the discrete domain noted in (107) since h = 2π/N is equivalent to π/h = N/2.
The formula for the inverse discrete Fourier transform (inverse DFT) is given by
N/2
1 X
yj = eiktj ŷ k , j = 1, . . . , N. (108)
2π
k=−N/2+1
We obtain the spectral derivative of the finite vector data by exactly the same
procedure as in the previous subsection. First, we need the band-limited interpolant
of the data. It is given by the formula
N/2
1 X0 ikt
p(t) = e ŷ k , t ∈ [0, 2π]. (109)
2π
k=−N/2
Here we define ŷ −N/2 = ŷ N/2 , and the prime on the sum indicates that we add the
first and last summands only with weight 1/2. This modification is required for the
band-limited interpolant to work properly.
119
Remark The band-limited interpolant is actually a trigonometric polynomial of degree
N/2, i.e., p(t) can be written as a linear combination of the trigonometric functions
1, sin t, cos t, sin 2t, cos 2t, . . . , sin N t/2, cos N t/2. We will come back to this fact when
we discuss non-periodic data.
Finally, using the same arguments and similar elementary calculations as earlier, we
get (
0 0, j ≡ 0 (mod N ),
SN (tj ) = 1 j
2 (−1) cot(jh/2), j 6≡ 0 (mod N ).
These are the entries of the N -th column of the Toeplitz matrix (99).
Example The Matlab script SpectralDiffDemo.m illustrates the use of spectral dif-
ferentiation for the not so smooth hat function and for the infinitely smooth function
y(t) = esin t .
2. Compute the discrete Fourier transform of the (finite) data vector via (107):
N
X N N
ŷ k = h e−iktj y j , k=− + 1, . . . , .
2 2
j=1
120
4. Find the derivative vector via inverse discrete Fourier transform (see (108)), i.e.,
N/2
1 X
y 0j = eiktj yb0 k , j = 1, . . . , N.
2π
k=−N/2+1
Remark Cooley and Tukey (1965) are usually given credit for discovering the FFT.
However, the same algorithm was already known to Gauss (even before Fourier com-
pleted his work on what is known today as the Fourier transform). A detailed discus-
sion of this algorithm goes beyond the scope of this course. We simply use the Matlab
implementations fft and ifft. These implementations are based on the current state-
of-the-art FFTW algorithm (the “fastest Fourier transform in the West”) developed at
MIT by Matteo Frigo and Steven G. Johnson.
These functions are arranged according to their (increasing) smoothness. The function
y1 has a third derivative of bounded variation, y2 is infinitely
√ differentiable (but not
analytic), y3 is analytic in the strip |Im(t)| < 2 ln(1 + 2) in the complex plane, and
y4 is band-limited.
Note: A continuous function y is of bounded variation if
N
X
sup |y(tj ) − y(tj−1 )|
t0 <t1 <···<tN
j=1
121
11.5 Polynomial Interpolation and Clustered Grids
We already saw in the Matlab script BandLimitedDemo.m that a spectral interpolant
performs very poorly for non-smooth functions. Thus, if we just went ahead and treated
a problem on a bounded domain as a periodic problem via periodic extension, then the
resulting jumps that may arise at the endpoints of the original interval would lead to
Gibbs phenomena and a significant degradation of accuracy. Therefore, we do not use
the trigonometric polynomials (discrete Fourier transforms) but algebraic polynomials
instead.
For interpolation with algebraic polynomials we saw at the very beginning of this
course (in the Matlab script PolynomialInterpolationDemo.m) the effect that differ-
ent distributions of the interpolation nodes in a bounded interval have on the accuracy
of the interpolant (the so-called Runge phenomenon). Clearly, the accuracy is much
improved if the points are clustered near the endpoints of the interval. In fact, the
so-called Chebyshev points
tj = cos(jπ/N ), j = 0, 1, . . . , N
yield a set of such clustered interpolation nodes on the standard interval [−1, 1]. These
points can easily be mapped by a linear transformation to any other interval [a, b]
(see Assignment 8). Chebyshev points arise often in numerical analysis. They are the
extremal points of the so-called Chebyshev polynomials (a certain type of orthogonal
polynomial ). In fact, Chebyshev points are equally spaced on the unit circle, and there-
fore one can observe a nice connection between spectral differentiation on bounded
intervals with Chebyshev points and periodic problems on bounded intervals as de-
scribed earlier. It turns out that (contrary to our expectations) the FFT can also be
used for the Chebyshev case. However, we will only consider Chebyshev differentiation
matrices below.
tj = cos(jπ/N ), j = 0, 1, . . . , N,
and sample the function y at those points to obtain the data vector y = [y(t0 ), y(t1 ), . . . , y(tN )]T .
2. Find the (algebraic) polynomial p of degree at most N that interpolates the data,
i.e., s.t.
p(ti ) = y i , i = 0, 1, . . . , N.
122
This procedure (implicitly) defines the differentiation matrix DN that gives us
y 0 = DN y.
Before we look at the general formula for the entries of DN we consider some simple
examples.
Example For N = 1 we have the two points t0 = 1 and t1 = −1, and the interpolant
is given by
t − t1 t0 − t
p(t) = y0 + y1
t 0 − t1 t0 − t1
t+1 1−t
= y0 + y1.
2 2
The derivative of p is (the constant)
1 1
p0 (t) = y 0 − y 1 ,
2 2
so that we have
1
− 12 y 1
0
y = 2 y0
1
2 y0 − 12 y 1
and the differentiation matrix is given by
1
− 21
D1 = 2 .
1
2 − 21
123
We note that the differentiation matrices no longer are Toeplitz or circulant. In-
stead, the entries satisfy (also in the general case below)
Theorem 11.1 For each N ≥ 1, let the rows and columns of the (N + 1) × (N + 1)
Chebyshev spectral differentiation matrix DN be indexed from 0 to N . The entries of
this matrix are
2N 2 + 1 2N 2 + 1
(DN )00 = , (DN )N N = − ,
6 6
−tj
(DN )jj = , j = 1, . . . , N − 1,
2(1 − t2j )
ci (−1)i+j
(DN )ij = , i 6= j, i, j = 0, 1, . . . , N,
cj (ti − tj )
where (
2, i = 0 or N ,
ci =
1, otherwise.
This matrix is implemented in the Matlab script cheb.m that was already used in
the Matlab function PSBVP.m that we used in our motivational example PSBVPDemo.m
at the beginning of this chapter. Note that only the off-diagonal entries are computed
via the formulas given in the theorem. For the diagonal entries the formula
N
X
(DN )ii = − (DN )ij
j=0
j6=i
was used.
y1 (t) = |t|3 ,
y2 (t) = exp(−t−2 ),
1
y3 (t) = ,
1 + t2
y4 (t) = t10 .
These functions are again arranged according to their (increasing) smoothness. The
function y1 has a third derivative of bounded variation, y2 is infinitely differentiable
(but not analytic), y3 is analytic in [−1, 1], and y4 is a polynomial (which corresponds
to the band-limited case earlier).
124
Note that the error for the derivative of the function y2 dips to zero for N = 2 since
the true derivative is given by
exp(−t−2 )
y20 (t) = 2 ,
t3
and the values at t0 = 1, t1 = 0, and t2 = −1 are 2/e, 0, and −2/e, respectively. These
all lie on a line (the linear derivative of the quadratic interpolating polynomial).
with boundary conditions y(−1) = y(1) = 0. Its analytic solution was given earlier as
How do we solve this problem in the Matlab programs PSBVPDemo.m and PSBVP.m?
First, we note that – for Chebyshev differentiation matrices – we can obtain higher
derivatives by repeated application of the matrix DN , i.e., if
y 0 = DN y,
then
y 00 = DN y 0 = DN
2
y.
In other words, for Chebyshev differentiation matrices
(k) k
DN = DN , k = 1, . . . , N,
N +1
and DN = 0.
Remark We point out that this fact is true only for the Chebyshev case. For the
Fourier differentiation matrices we established in the periodic case we in general have
DNk 6= D (k) (see Assignment 8).
N
With the insight about higher-order Chebyshev differentiation matrices we can view
the differential equation above as
2
DN y = f,
where the right-hand side vector f = exp(4t), with t = [t0 , t1 , . . . , tN ]T the vector of
Chebyshev points. This linear system, however, cannot be solved uniquely (one can
show that the matrix (N + 1) × (N + 1) matrix DN 2 has an (N + 1)-fold eigenvalue
of zero). Of course, this is not a problem. In fact, it is reassuring, since we have not
yet taken into account the boundary conditions, and the ordinary differential equation
(without appropriate boundary conditions) also does not have a unique solution.
So the final question is, how do we deal with the boundary conditions?
We could follow either of two approaches. First, we can build the boundary condi-
tions into the spectral interpolant, i.e.,
125
1. Take the interior Chebyshev points t1 , . . . , tN −1 and form the polynomial in-
terpolant of degree at most N that satisfies the boundary conditions p(−1) =
p(1) = 0 and interpolates the data vector at the interior points, i.e., p(tj ) = y j ,
j = 1, . . . , N − 1.
2. Obtain the spectral derivative by differentiating p and evaluating at the interior
points, i.e.,
y 00j = p00 (tj ), j = 1, . . . , N − 1.
e 2 from the previous relation, and solve
3. Identify the (N − 1) × (N − 1) matrix D N
the linear system
2
D
eN y(1 : N − 1) = exp(4t(1 : N − 1)),
where we used Matlab-like notation.
The second approach is much simpler to implement, but not as straightforward to
understand/derive. Since we already know the value of the solution at the boundary,
i.e., y 0 = 0 and y N = 0, we do not need to include these values in our computa-
tion. Moreover, the values of the derivative at the endpoints are of no interest to us.
Therefore, we can simply solve the linear system
e 2 y(1 : N − 1) = exp(4t(1 : N − 1)),
D N
where
2 2
D
eN = DN (1 : N − 1, 1 : N − 1).
This is exactly what was done in the Matlab program PSBVP.m.
Remark One can show that the eigenvalues of D e 2 are given by λn = − π2 n2 , n =
N 4
1, 2, . . . , N − 1. Clearly, these values are all nonzero, and the problem has (as it should
have) a unique solution.
We are now ready to deal with more complicated boundary value problems. They
can be nonlinear, have non-homogeneous boundary conditions, or mixed-type boundary
conditions with derivative values specified at the boundary. We give examples for each
of these cases.
Example As for our initial value problems earlier, a nonlinear ODE-BVP will be
solved by iteration (either fixed-point, or Newton).
Consider
y 00 (t) = ey(t) , t ∈ (−1, 1)
with boundary conditions y(−1) = y(1) = 0. In the Matlab program NonlinearPSBVPDemo.m
we use fixed-point iteration to solve this problem.
Example Next, we consider a linear BVP with non-homogeneous boundary condi-
tions:
y 00 (t) = e4t , t ∈ (−1, 1)
with boundary conditions y(−1) = 0, y(1) = 1. In the Matlab program PSBVPNonHomoBCDemo.m
this is simply done by replacing the first and last rows of the differentiation matrix D2
by corresponding rows of the identity matrix and then imposing the boundary values
in the first and last entries of the right-hand side vector f.
126
Example For a linear BVP with mixed boundary conditions such as
with boundary conditions y 0 (−1) = y(1) = 0 we can follow the same strategy as in the
previous example. Now, however, we need to replace the row of D2 that corresponds to
the derivative boundary condition with a row from the first-order differentiation matrix
D. This leads to the Matlab program PSBVPMixedBCDemo.m.
127
12 Galerkin and Ritz Methods for Elliptic PDEs
12.1 Galerkin Method
We begin by introducing a generalization of the collocation method we saw earlier for
two-point boundary value problems. Consider the elliptic PDE
Lu(x) = f (x), (110)
where L is a linear elliptic partial differential operator such as the Laplacian
∂2 ∂2 ∂2
L= + + , x = (x, y, z) ∈ R3 .
∂x2 ∂y 2 ∂z 2
At this point we will not worry about the boundary conditions that should be posed
with (110).
As with the collocation method discussed earlier, we will obtain the approximate
solution in the form of a function (instead of as a collection of discrete values). There-
fore, we need an approximation space U = span{u1 , . . . , un }, so that we are able to
represent the approximate solution as
n
X
u= cj uj , uj ∈ U. (111)
j=1
128
1. Point evaluation functionals, i.e., Φi (u) = u(xi ), where {x1 , . . . , xn } is a set of
points chosen such that the resulting conditions are linearly independent, and u
is some function with appropriate smoothness. With this choice (112) becomes
n
X
cj Luj (xi ) = f (xi ), i = 1, . . . , n,
j=1
−∇2 u = f in Ω, (113)
u=0 on ∂Ω,
In order to be able to complete the derivation of the weak form we now assume that
the space U of test functions is of the form
U = {v : v ∈ C 2 (Ω), v = 0 on ∂Ω},
129
i.e., besides having the necessary smoothness to be a solution of (113), the functions
also satisfy the boundary conditions.
Now we rewrite the left-hand side of (115):
ZZ ZZ
(uxx + uyy ) vdxdy = [(ux v)x + (uy v)y − ux vx − uy vy ] dxdy
Ω
ZΩZ ZZ
= [(ux v)x + (uy v)y ] dxdy − [ux vx − uy vy ] dxdy.
(116)
Ω Ω
Now the special choice of U, i.e., the fact that v satisfies the boundary conditions,
ensures that this term vanishes. Therefore, the weak form of (113) is given by
ZZ ZZ
[ux vx + uy vy ] dxdy = f vdxdy.
Ω Ω
The superscript h indicates that the approximate solution is obtained on some under-
lying discretization of Ω with mesh size h.
(a) For example, regular (tensor product) grids can be used. Then U can consist
of tensor products of piecewise polynomials or B-spline functions that satisfy
the boundary conditions of the PDE.
(b) It is also possible to use irregular (triangulated) meshes, and again define
piecewise (total degree) polynomials or splines on triangulations satisfying
the boundary conditions.
130
(c) More recently, meshfree approximation methods have been introduced as
possible choices for U.
We now return to the discussion of the general numerical method. Once we have
chosen a basis for the approximation space U, then it becomes our goal to determine
the coefficients cj in (118). By inserting uh into the weak form (117), and selecting as
trial functions v the basis functions of U we obtain a system of equations
ZZ ZZ
h
∇u · ∇ui dxdy = f ui dxdy, i = 1, . . . , n.
Ω Ω
or by linearity
n
X ZZ ZZ
cj ∇uj · ∇ui dxdy = f ui dxdy, i = 1, . . . , n. (119)
j=1 Ω Ω
This last set of equations is known as the Ritz-Galerkin method and can be written in
matrix form
Ac = b,
where the stiffness matrix A has entries
ZZ
Ai,j = ∇uj · ∇ui dxdy.
Ω
Remark 1. The stiffness matrix is usually assembled element by element, i.e., the
contribution to the integral over Ω is split into contributions for each element
(e.g., rectangle or triangle) of the underlying mesh.
Example One of the most popular finite element versions is based on the use of
piecewise linear C 0 polynomials (built either on a regular grid, or on a triangular
partition of Ω). The basis functions ui are “hat functions”, i.e., functions that are
131
piecewise linear, have value one at one of the vertices, and zero at all of its neighbors.
This choice makes it very easy to satisfy the homogeneous Dirichlet boundary conditions
of the model problem exactly (along a polygonal boundary).
Since the gradients of piecewise linear functions are constant, the entries of the
stiffness matrix essentially boil down to the areas of the underlying mesh elements.
Therefore, in this case, the Ritz-Galerkin method is very easily implemented. We
generate some examples with Matlab’s PDE toolbox pdetool.
It is not difficult to verify that the stiffness matrix for our example is symmetric
and positive definite. Since the matrix is also very sparse due to the fact that the “hat”
basis functions have a very localized support, efficient iterative solvers can be applied.
Moreover, it is known that the piecewise linear FEM converges with order O(h2 ).
2. The finite element method is one of the most-thoroughly studied numerical meth-
ods. Many textbooks on the subject exist, e.g., “The Mathematical Theory of
Finite Element Methods” by Brenner and Scott (1994), “An Analysis of the Finite
Element Method” by Strang and Fix (1973), or “The Finite Element Method”
by Zienkiewicz and Taylor (2000).
on the space of functions whose first derivatives are square integrable and that vanish
on ∂Ω. This space is a Sobolev space, usually denoted by H01 (Ω).
The inner product [·, ·] induces a norm kvk = [v, v]1/2 on H01 (Ω). Now, using
this norm, the best approximation to u from H01 (Ω) is given by the function uh that
minimizes ku − uh k. Since we define our numerical method via the finite-dimensional
subspace U of H01 (Ω), we need to find uh such that
u − uh ⊥ U
132
or
n
X
cj [uj , ui ] = [u, ui ], i = 1, . . . , n. (121)
j=1
The right-hand side of this formula contains the exact solution u, and therefore is not
useful for a numerical scheme. However, by (120) and the weak form (117) we have
ZZ
[u, ui ] = ∇u · ∇ui dxdy
ZΩZ
= f ui dxdy.
Ω
Since the last expression corresponds to the inner product hf, ui i, (121) can be viewed
as
X n
cj [uj , ui ] = hf, ui i, i = 1, . . . , n,
j=1
over all smooth functions that vanish on the boundary of Ω. By considering the energy
of nearby solutions u + λv, with arbitrary real λ we see that
ZZ ZZ
1
E(u + λv) = ∇(u + λv) · ∇(u + λv)dxdy − f (u + λv)dxdy
2
ZΩZ Ω
λ2
ZZ ZZ
1
= ∇u · ∇udxdy + λ ∇u · ∇vdxdy + ∇v · ∇vdxdy
2 2
ZΩZ ZZ Ω Ω
− f udxdy − λ f vdxdy
Ω Ω
λ2
ZZ ZZ
= E(u) + λ [∇u · ∇v − f v] dxdy + ∇2 vdxdy
2
Ω Ω
The right-hand side is a quadratic polynomial in λ, so that for a minimum, the term
ZZ
[∇u · ∇v − f v] dxdy
Ω
must vanish for all v. This is again the weak formulation (117).
A discrete “energy norm” is then given by the quadratic form
1
E(uh ) = cT Ac − bc
2
133
where A is the stiffness matrix, and c is such that the Ritz-Galerkin system (119)
Ac = b
is satisfied.
134
13 Classical Iterative Methods for the Solution of Linear
Systems
13.1 Why Iterative Methods?
Virtually all methods for solving Ax = b or Ax = λx require O(m3 ) operations. In
practical applications A often has a certain structure and/or is sparse, i.e., A contains
many zeros.
A typical problem that arises in practice is the Poisson problem mentioned at the
beginning of the class. We want to find u such that
−∇2 u(x, y) = − [uxx (x, y) + uyy (x, y)] = f (x, y), in Ω = [0, 1]2
u(x, y) = 0, on ∂Ω.
At the (n − 1)2 interior grid points we obtain the following system of linear equations
for the values of u there
fi,j
4ui,j − ui−1,j − ui,j−1 − ui+1,j − ui,j+1 = , i, j = 1, . . . , n − 1
n2
The system matrix is of size m × m, where m = (n − 1)2 . Each row contains at most
five nonzero entries, and therefore is very sparse. Thus, special methods are called
for to take advantage of this sparsity when we solve this linear system. Obviously, a
full-blown LU or Cholesky factorization will be much too costly if m is large (typical
values for m are often 106 or even larger).
Here we assume that A ∈ Cm×m , x(0) is an initial guess for the solution, and G and c
are a constant iteration matrix and vector, respectively, defining the iterative scheme.
Most classical iterative methods are based on a splitting of the matrix A of the form
A=M −N
100
with a nonsingular matrix M . One then defines
G = M −1 N and c = M −1 b.
Then (38) becomes
x(k) = M −1 N x(k−1) + M −1 b
or
M x(k) = N x(k−1) + b. (39)
In practice we will want to choose the splitting factors so that
1. (39) is easily solved,
2. (39) converges rapidly.
Theorem 13.1 If
kGk = kM −1 N k < 1
then (38) converges to a solution of Ax = b for any initial guess x(0) .
Proof (38) describes a fixed point iteration (i.e., is of the form x = g(x)), and the
fixed point of (38) is a solution of Ax = b as can be seen from
x = Gx + c
⇐⇒ x = M −1 N x + M −1 b
⇐⇒ Mx = Nx + b
⇐⇒ (M − N ) x = b.
| {z }
=A
Now we let e(k) = x(k) − x, where x is the solution of the fixed point problem, and
show that this quantity goes to zero as k → ∞. First we observe that
e(k) = x(k) − x
= Gx(k−1) − Gx
= G x(k−1) − x
= Ge(k−1) .
Taking norms we have
ke(k) k = kGe(k−1) k
≤ kGkke(k−1) k
≤ kGkk ke(0) k,
where the last inequality is obtained by recursion.
If now – as we assume – kGk < 1, then ke(k) k → 0 as k → ∞, and therefore
x(k) → x and the method converges.
101
13.3 How should we choose M and N ?
13.3.1 The Jacobi Method
We formally decompose A = L + D + U into a lower triangular, diagonal, and upper
triangular part. Then we let
M = D, N = −(L + U ).
for k = 1, 2, . . .
for i = 1 : m
i−1 m
(k) (k−1) (k−1)
X X
xi = b i − aij xj − aij xj /aii
j=1 j=i+1
end
end
Example If we apply the Jacobi method to the finite difference discretization of the
Poisson problem then we can be more efficient by taking advantage of the matrix
structure. The central part of the algorithm (the loop for i = 1 : m) can then be
replaced by
for i = 1 : n − 1
for j = 1 : n − 1
(k) (k−1) (k−1) (k−1) (k−1) fi,j
ui,j = ui−1,j + ui,j−1 + ui+1,j + ui,j+1 + 2 /4
n
end
end
102
Note that the unknowns are now uij instead of xi . This algorithm can be implemented
in one line of Matlab (see homework).
Remark While the Jacobi method is not used that often in practice on serial comput-
ers it does lend itself to a naturally parallel implementation.
In order to get a convergence result for the Jacobi method we need to recall the
concept of diagonal dominance. We say a matrix A is strictly row diagonally dominant
if X
|aii | > |aij |.
j6=i
Theorem 13.3 If A is strictly row diagonally dominant, then the Jacobi method con-
verges for any initial guess x(0) .
As we will see in some numerical examples, the convergence of the Jacobi method
is usually rather slow. A (usually) faster method is discussed next.
2x1 + x2 = 6
x1 + 2x2 = 6
103
or
2 1 x1 6
=
1 2 x2 6
the Jacobi method looks like
(k) (k−1)
x1 = 6 − x2 /2
(k) (k−1)
x2 = 6 − x1 /2.
(k−1)
In order to obtain an improvement we notice that the value of x1 used in the
(k)
second equation is actually outdated since we already computed a newer version, x1 ,
in the first equation. Therefore, we might consider
(k) (k−1)
x1 = 6 − x2 /2
(k) (k)
x2 = 6 − x1 /2
instead. This is known as the Gauss-Seidel method. The general algorithm is of the
form
for k = 1, 2, . . .
for i = 1 : m
i−1 m
(k) (k) (k−1)
X X
xi = b i − aij xj − aij xj /aii
j=1 j=i+1
end
end
Example For one step of the finite difference solution of the Poisson problem we get
for i = 1 : n − 1
for j = 1 : n − 1
(k) (k) (k) (k−1) (k−1) fi,j
ui,j = ui−1,j + ui,j−1 + ui+1,j + ui,j+1 + 2 /4
n
end
end
104
Remark Note that the implementation of the Gauss-Seidel algorithm for this example
depends on the ordering of the grid points. We used the natural (or typewriter) order-
ing, i.e., we scan the grid points row by row from left to right. Sometimes a red-black
(or chessboard) ordering is used. This is especially useful if the Gauss-Seidel method
is to be parallelized.
M = D+L
N = −U,
M x(k) = N x(k−1) + b
⇐⇒ (D + L)x(k) = b − U x(k−1)
or
x(k) = (D + L)−1 b − U x(k−1) .
or
x(k) = D−1 b − Lx(k) − U x(k−1) .
Theorem 13.4 The Gauss-Seidel method converges for any initial guess x(0) if
Remark For a generic problem the Gauss-Seidel method converges faster than the
Jacobi method (see the Maple worksheet 473 IterativeSolvers.mws). However, this
does not mean that sometimes the Jacobi method may not be faster.
Remark A careful reader may notice that the sufficient conditions given in the con-
vergence theorems for the Jacobi and Gauss-Seidel methods do not cover the matrix
for our finite-difference Poisson problem. However, there are variations of the theorems
that do cover this important example.
105
13.3.3 Successive Over-Relaxation (SOR)
One can accelerate the convergence of the Gauss-Seidel method by using a weighted
average of the new Gauss-Seidel value with the one obtained during the previous iter-
ation:
(k)
x(k) = (1 − ω)x(k−1) + ωxGS .
Here ω is the so-called relaxation parameter. If ω = 1 we simply have the Gauss-Seidel
method. For ω > 1 one speaks of over-relaxation, and for ω < 1 of under-relaxation.
The resulting algorithm known as successive over-relaxation (SOR) and is (obvi-
ously) a variation of the Gauss-Seidel algorithm.
Algorithm (SOR)
for k = 1, 2, . . .
for i = 1 : m
i−1 m
(k) (k−1) (k) (k−1)
X X
xi = (1 − ω)xi + ω b i − aij xj − aij xj /aii
j=1 j=i+1
| {z }
(k)
=xi,GS
end
end
Example For one step of the SOR algorithm applied to the finite difference solution
of the Poisson problem we get
for i = 1 : n − 1
for j = 1 : n − 1
(k) (k−1) (k) (k) (k−1) (k−1) fi,j
ui,j = (1 − ω)ui,j +ω ui−1,j + ui,j−1 + ui+1,j + ui,j+1 + 2 /4
n
end
end
Remark Just as for the Gauss-Seidel algorithm, the implementation of the SOR
method depends on the ordering of the grid points.
106
To describe the SOR method in terms of splitting matrices we again assume that
A = L + D + U , and take
1
M = D+L
ω
1
N = − 1 D − U.
ω
M x(k) = N
x
(k−1)
+
b
1 (k) 1
⇐⇒ D+L x = − 1 D − U x(k−1) + b.
ω ω
The rearrangements that show that this formulation is indeed equivalent to the formula
used in the algorithm above are:
1 (k) 1
D+L x = − 1 D − U x(k−1) + b
ω ω
1 (k) 1
⇐⇒ Dx = − 1 D − U x(k−1) + b − Lx(k)
ω ω
−1 1
⇐⇒ (k)
x = ωD − 1 D − U x(k−1) + ωD−1 b − ωD−1 Lx(k)
ω
⇐⇒ x(k) = x(k−1) − ωx(k−1) −hωD−1U x(k−1) + ωD−1 b − i
ωD−1 Lx(k)
⇐⇒ x(k) = (1 − ω) x(k−1) − ω D−1 b − Lx(k) − U x(k−1) .
Note that the expression inside the square brackets on the last line is just what we had
for the Gauss-Seidel method earlier.
One can prove a general convergence theorem that is similar to those for the Jacobi
and Gauss-Seidel methods:
Theorem 13.5 If A is symmetric positive definite then the SOR method with 0 < ω <
2 converges for any starting value x(0) .
Remark Note that this theorem says nothing about the speed of convergence. In fact,
finding a good value for the relaxation parameter ω is quite difficult. The value of ω
that yields the fastest convergence of the SOR method is known only in very special
cases. For example, if A is tridiagonal then
2
ωopt = p ,
1 + 1 − ρ(G)
The convergence behavior of all three classical methods is illustrated in the Maple
worksheet 473 ItertiveSolvers.mws.
107
14 Arnoldi Iteration and GMRES
14.1 Arnoldi Iteration
The classical iterative solvers we have discussed up to this point were of the form
x(k) = Gx(k−1) + c
with constant G and c. Such methods are also known as stationary methods. We will
now study a different class of iterative solvers based on optimization.
All methods will require a “black box” implementation of a matrix-vector product.
In most library implementations of such solvers the user can therefore provide a custom
function which computes the matrix-vector product as efficiently as possible for the
specific system matrix at hand.
One of the main ingredients in all of the following methods are Krylov subspaces.
Given A ∈ Cm×m and b ∈ Cm one generates
AQ = QH.
108
Now we take n < m, so that the eigenvalue equations above can be written as
Qn = [q 1 , q 2 , . . . , q n ]
Qn+1 = [q 1 , q 2 , . . . , q n , q n+1 ]
h11 h12 ... h1n
h21 h22 ... h2n
0 h32 h33 h3n
.
0 h43 . .
Hen =
.. ..
0 . hn−1,n−2 .
. .
.. .. h
n,n−1 hnn
0 hn+1,n
and then take
AQn = Qn+1 H
en.
Note that here A ∈ Cm×m , Qn ∈ Cm×n , Qn+1 ∈ Cm×n+1 , and H e n ∈ Cn+1×n so that
both sides of the equation result in an m × n matrix.
If we compare the n-th columns on both sides, then we get
which constitutes an (n + 1) term recursion for the vector q n+1 . Equation (40) can be
re-written as
Aq n − ni=1 hi,n q i
P
q n+1 = . (41)
hn+1,n
The recursive computation of the columns of the unitary matrix Q in this manner is
known as Arnoldi iteration.
Example The first step of Arnoldi iteration proceeds as follows. We start with the
matrix A and an arbitrary normalized vector q 1 . Then, according to (41),
Aq 1 − h11 q 1
q2 = .
h21
109
Note that this step involves the matrix-vector product Aq 1 (which has to be computed
efficiently with a problem specific subroutine).
Since we want q ∗1 q 2 = 0 in order to have orthogonality of the columns of Q we get
0 = q ∗1 Aq 1 − h11 q ∗1 q 1 .
h11 = q ∗1 Aq 1
– a Rayleigh quotient.
Finally, we let v = Aq 1 − h11 q 1 , compute h21 = kvk, and normalize
v
q2 = .
h21
q 1 = b/kbk2
for n = 1, 2, 3, . . .
v = Aq n
for j = 1 : n
hjn = q ∗j v
v = v − hjn q j
end
hn+1,n = kvk2
q n+1 = v/hn+1,n
end
Remark The most expensive operation in the algorithm is the matrix-vector product
Aq n . The rest of the operations are on the order of O(mn) (so they get a little more
expensive in each iteration). Therefore, in addition to the basic Arnoldi algorithm we
need an efficient implementation of the matrix-vector product this should be taylored
to the problem. Moreover, the algorithm above treats the matrix-vector product as a
“black box” and the algorithm does not need to know or store the matrix A. The only
quantity of interest is the product Aq n , i.e., the action of A on q n .
110
14.3 Arnoldi Iteration as Projection onto Krylov Subspaces
An alternative derivation of Arnoldi iteration starts with the Krylov matrix
Then
AKn = [Ab, A2 b, A3 b . . . , An b]. (42)
Since the first n − 1 columns of the matrix on the right-hand side are the last n − 1
columns of Kn we can also write
where
c = −Kn−1 An b
and we assume that Kn is invertible. Equivalently,
AKn = Kn Cn
Cn = [e2 , e3 , . . . , en , −c].
This A and Cn are similar via Kn−1 AKn = Cn . The problem with this formulation is
that Kn is usually ill-conditioned (since all of its columns converge to the dominant
eigenvector of A, cf. our earlier discussion of simultaneous power iteration).
Remark As a side remark we mention that the matrix Cn above is known Pnas a com-
panion matrix. The matrix has characteristic polynomial p(z) = z + i=1 ci z i−1 ,
n
where the ci are the components of c. In other words, the eigenvalues of Cn are the
roots of p. This goes also in the other direction. Given a monic polynomial p, we can
form its companion matrix Cn and then know that the roots of p are the same as the
eigenvalues of Cn .
Returning to the derivation of Arnoldi iteration, we still need to show how the above
Krylov subspace formulation is related to the earlier one based on the Gram-Schmidt
method.
Denote the QR factorization of the Krylov matrix Kn by
Kn = Qn Rn .
Then
Kn−1 AKn = Cn
⇐⇒ Rn−1 Q∗n AQn Rn = Cn
⇐⇒ Q∗n AQn = Rn Cn Rn−1 .
| {z }
=Hn
111
It needs to be pointed out that this approach is computationally not a good one.
It is both too expensive and unstable. On the one hand, finding c involves solution of
the (ill-conditioned) linear system Kn c = An b. On the other hand, we would also be
required to provide the inverse of Rn .
However, we can get some theoretical insight from this approach. The formula
Q∗n AQn = Hn
14.4 GMRES
The method of generalized minimum residuals (or GMRES) was suggested in 1986 by
Saad and Schultz.
While application of the classical iterative solvers was limited to either diagonally
dominant or positive definite matrices, the GMRES method can be used for linear sys-
tems Ax = b with arbitrary (nonsingular) square matrices A. The essential ingredient
in this general iterative solver is Arnoldi iteration.
The main idea of the GMRES method is to solve a least squares problem at each step
of the iteration. More precisely, at step n we approximate the exact solution x∗ = A−1 b
by a vector xn ∈ Kn (the n-th order Krylov subspace) such that the residual
kr n k2 = kAxn − bk2
xn = Kn c
The obvious way to find the least squares solution to this problem would be to compute
the QR factorization of the matrix AKn . However, this is both unstable and too
expensive.
Instead, we look for an orthonormal basis for the Krylov subspace Kn . We will
denote this by {q 1 , q 2 , . . . , q n }, the columns of the matrix Qn used in Arnoldi iteration.
With this new basis the approximate solution xn ∈ Kn can be written as
xn = Qn y
112
for some appropriate vector y ∈ Cn . The residual minimization is then
It is now time to recall the principle behind the Arnoldi iteration. That algorithm
is based on the partial similarity transform
AQn = Qn+1 H
en.
Next we take advantage of the fact that multiplication by a unitary matrix does
not change the 2-norm. Thus, we arrive at
kQ∗n+1 Qn+1 H
e n y − Q∗n+1 bk2 → min
⇐⇒ e n y − Q∗ bk2 → min
kH n+1
Note that the system matrix AQn in (43) is an m × n matrix, while the new matrix
He n is an (n + 1) × n matrix which is smaller, and therefore will permit a more efficient
solution.
The final simplification we can make is for the vector Q∗n+1 b. In detail, this vector
is given by
q ∗1 b
q ∗2 b
Q∗n+1 b = .
..
.
q ∗n+1 b
Recall that the Krylov subspaces are given by
K1 = span{b},
K2 = span{b, Ab},
..
. ,
b
q1 =
kbk
and q ∗j b = 0 for any j > 1. Therefore we actually have
Q∗n+1 b = kbke1 .
Combining all of this work we arrive at the final least squares formulation
Hn y − kbk2 e1
→ min
e
2
with xn = Qn y.
This leads to
113
Algorithm (GMRES)
Let q 1 = b/kbk
for n = 1, 2, 3, . . .
Perform step n of Arnoldi iteration, i.e., compute new entries for He n and
Qn .
Find y that minimizes
H y − kbke 1
(e.g., with QR factorization)
e
n
2
Set xn = Qn y
end
The computational cost for the GMRES algorithm depends on the cost of the
method used for the least squares problem and that of the Arnoldi iteration. Since
the least squares system matrix is of size (n + 1) × n it can be done in O(n2 ) floating
point operations. If an updating QR factorization (which we did not discuss) is used,
then even O(n) is possible. Arnoldi iteration takes O(mn) operations plus the cost
for matrix-vector multiplications. These can usually be accomplished at somewhere
between O(m) and O(m2 ) flops.
kr n k
< tol,
kbk
where tol = 10−6 or 10−8 . The MATLAB code GMRESDemo.m illustrates this.
The test problems are given by 200 × 200 systems with random matrices and ran-
dom right-hand side vectors b. The different test matrices are a matrix A whose entries
are independent
√ samples from the real normal distribution of mean 2 and standard de-
viation 0.5/ 200. Its eigenvalues are clustered in a disk in the complex plane of radius
1/2 centered at z = 2. The other test matrices have different eigenvalue distributions.
They are obtained as
• B = A + D, where the entries of the diagonal matrix D are the complex numbers
k
dk = (−2 + 2 sin θk ) + i cos θk , θk = , 0 ≤ k ≤ m − 1.
m−1
• C, a random matrix whose eigenvalues are loosely clustered in the unit disk.
114
• D = C T C, a random symmetric positive definite matrix (with real positive eigen-
values).
115
15 Conjugate Gradients
This method for symmetric positive definite matrices is considered to be the “original”
Krylov subspace method. It was proposed by Hestenes and Stiefel in 1952, and is
motivated by the following theorem.
Thus, we see that ϕ (as a quadratic function in α with positive leading coefficient) will
have to have a minimum along the ray x + αp.
We now decide what the value of α at this minimum is. A necessary condition (and
also sufficient since the coefficient of α2 is positive) is
d
ϕ(x + αp) = 0.
dα
To this end we compute
d
ϕ(x + αp) = pT (Ax − b) + αpT Ap,
dα
which has its root at
pT (b − Ax)
α̂ = .
pT Ap
The corresponding minimum value is
T 2
p (b − Ax)
ϕ(x + α̂p) = ϕ(x) − .
2pT Ap
| {z }
≥0
116
i.e., p is not orthogonal to the residual r = b − Ax.
To see the equivalence with the solution of the linear system Ax = b we need to
consider two possibilities:
1. x is such that Ax = b. Then ϕ(x + α̂p) = ϕ(x) and ϕ(x) is the minimum value.
2. x is such that Ax 6= b. Then ϕ(x + α̂p) < ϕ(x), i.e., there exists a direction p
such that pT (b − Av) 6= 0 and ϕ(x) is not the minimum.
Take x0 = 0, r 0 = b, p0 = r 0
for n = 1, 2, 3, . . .
xn = xn−1 + αn pn−1
end
Note that at this point we have not specified how to pick the search directions pn .
This will be the crucial ingredient in the algorithm.
The formula above for the residual update follows from
pn = −∇ϕ(xn )
since we know from calculus that the direction of largest decrease of ϕ is in the direction
opposite its gradient. Moreover, since ϕ(x) = 12 xT Ax − xT b we have
∇ϕ(x) = Axn − b.
This leads to
117
Algorithm (Steepest Descent)
Take x0 = 0, r 0 = b, p0 = r 0
for n = 1, 2, 3, . . .
xn = xn−1 + αn pn−1
end
Note that for this choice of search direction the step length α can also be written
as
αn = r Tn−1 r n−1 / pTn−1 Apn−1 .
Take x0 = 0, r 0 = b, p0 = r 0
for n = 1, 2, 3, . . .
xn = xn−1 + αn pn−1
118
Compute a gradient correction factor
βn = r Tn r n / r Tn−1 r n−1
pn = r n + βn pn−1
end
For both the steepest descent and the conjugate gradient algorithm the main com-
putational cost is hidden in the one matrix-vector multiplication that is required per
iteration. As with the Arnoldi and GMRES methods, this operation is treated as a
“black box” and can be accomplished in O(m) to O(m2 ) operations depending on the
structure of A. In many practical cases the entire (preconditioned) CG algorithm will
require only O(m) operations. This is very fast.
As mentioned at the beginning of this section, one can also establish a connection
to Krylov subspace methods.
Theorem 15.2 Let A be symmetric positive definite. As long as the conjugate gradient
method has not yet converged (i.e., as long as r n−1 6= 0) we have
r Tn r j = 0, j < n,
Proof An inductive proof of this theorem can be found in the book [Trefethen/Bau].
Theorem 15.3 Let A be symmetric positive definite. If the conjugate gradient algo-
rithm has not yet converged (i.e., r n−1 6= 0) then xn is the unique vector in Kn such
that ken kA is minimized.
Moreover, ken kA ≤ ken−1 kA and (if we are using exact arithmetic) en = 0 for
some n ≤ m.
119
Proof We will prove the first part only. From the previous theorem we know that the
approximate solution xn lies in the Krylov subspace Kn . In order to show that xn is
the unique minimizer of kekA we consider an arbitrary vector
x = xn − ∆x ∈ Kn
e = x∗ − x = x∗ − xn + ∆x = en + ∆x.
Therefore,
since A is symmetric.
Next we realize that
Note that the quadratic form (∆x)T A (∆x) is certainly non-negative since A is positive
definite. Moreover, it is zero only if ∆x = 0.
Thus, the A-norm of the error is minimized if ∆x = 0, i.e., for the CG approximate
solution xn .
120
Here ϕ(xn ) is the same quadratic form used earlier. Since (x∗ )T b is a constant we see
that minimizing the A-norm of the error is equivalent to minimizing the quadratic form
ϕ(xn ).
For the convergence rate of the CG algorithm one can show that
√ n
κ−1
ken kA ≤ ke0 kA √ ,
κ+1
we see that convergence will be very slow if κ is large. This shows that preconditioning
efforts for the CG algorithm are aimed at reducing the condition number of A.
For a moderate size κ it turns out that one can expect convergence of the CG
√
algorithm in O( κ) iterations. In fact, in practice the CG algorithm often converges
faster than predicted by this upper bound.
121
16 Preconditioning
The general idea underlying any preconditioning procedure for iterative solvers is to
modify the (ill-conditioned) system
Ax = b
in such a way that we obtain an equivalent system Âx̂ = b̂ for which the iterative
method converges faster.
A standard approach is to use a nonsingular matrix M , and rewrite the system as
M −1 Ax = M −1 b.
Ax = b ⇐⇒ M −1 Ax = M −1 b
⇐⇒ LT Ax = LT b
T −1 T
⇐⇒ L
| {zAL} L
| {z x} = L
|{z}b.
=Â =x̂ =b̂
p̂n = L−1 pn ,
r̃ n = M −1 r n .
122
Now we can consider how this transforms the CG algorithm (for the hatted quan-
tities). The initialization becomes x̂0 = L−1 x0 = 0 and
r̂ 0 = b̂ ⇐⇒ LT r 0 = LT b ⇐⇒ r 0 = b.
p̂0 = r̂ 0 ⇐⇒ L−1 p0 = LT r 0
⇐⇒ p0 = M −1 r 0 = r̃ 0 ,
where we have used the definition of the preconditioner M . The step length α̂ trans-
forms as follows:
α̂n = r̂ Tn−1 r̂ n−1 / p̂Tn−1 Âp̂n−1
T T T
LT r n−1 LT r n−1 / L−1 pn−1 L AL L−1 pn−1
=
= r Tn−1 LLT r n−1 / pTn−1 L−T LT ALL−1 pn−1
=M −1
T
pTn−1 Apn−1 .
= r n−1 r̃ n−1 /
x̂n = x̂n−1 + α̂n p̂n−1 ⇐⇒ L−1 xn = L−1 xn−1 + α̂n L−1 pn−1
⇐⇒ xn = xn−1 + α̂n pn−1 .
where we have multiplied by L and used the definition of M in the penultimate step.
The resulting algorithm is given by
123
Algorithm (Preconditioned Conjugate Gradient)
Take x0 = 0, r 0 = b
Solve M r̃ 0 = r 0 for r̃ 0
Let p0 = r̃ 0
for n = 1, 2, 3, . . .
xn = xn−1 + αn pn−1
βn = r Tn r̃ n / r Tn−1 r̃ n−1
pn = r̃ n + βn pn−1
(where r̃ n = M −1 r n )
end
Remark This algorithm requires the additional work that is needed to solve the linear
system M r̃ n = r n once per iteration. Therefore we will want to choose M so that this
can be done easily and efficiently.
This may seem useful at first, but to get the solution x we need
124
1. M should be symmetric and positive definite.
M = D: Jacobi preconditioning,
M = L + D: Gauss-Seidel preconditioning,
1
M= ω (D + ωL): SOR preconditioning.
Remark The Matlab script PCGDemo.m illustrates the convergence behavior of the
preconditioned conjugate gradient algorithm. The matrix A here is√a 1000 × 1000 sym-
metric positive definite matrix with all zeros except aii = 0.5 + i on the diagonal,
aij = 1 on the sub- and superdiagonal, and aij = 1 on the 100th sub- and superdiag-
onals, i.e., for |i − j| = 100. The right-hand side vector is b = [1, . . . , 1]T . We observe
that the basic CG algorithm converges very slowly, whereas the Jacobi-preconditioned
method converges much faster.
125
17 Solution of Nonlinear Systems
We now discuss the solution of systems of nonlinear equations. An important ingredient
will be the multivariate Taylor theorem.
where
1
Rn (h) = (hT ∇)n+1 f (x + θh)
(n + 1)!
h iT
with 0 < θ < 1 and ∇ = ∂x∂ 1 , ∂x∂ 2 , . . . , ∂x∂m .
f1 (x1 , x2 , . . . , xm ) = 0,
f2 (x1 , x2 , . . . , xm ) = 0,
..
. (46)
fm (x1 , x2 , . . . , xm ) = 0.
126
To derive Newton’s method for this problem we assume z = [z1 , z2 , . . . , zm ]T is a
solution (or root) of (46), i.e., z satisfies
fi (z) = 0, i = 1, . . . , m.
x + h = z,
Recall that h = [h1 , . . . , hm ]T is the unknown Newton update, and note that (47)
is a linear system for h of the form
J(x)h = −f (x),
Input f , J, x(0)
for k = 0, 1, 2, . . . do
127
Solve J(x(k) )h(k) = −f (x(k) ) for h(k)
Update x(k+1) = x(k) + h(k)
end
Output x(k+1)
which looks just like the Newton iteration formula for the single equation/single variable
case.
Example Solve
x2 + y 2 = 4
xy = 1,
which corresponds to finding the intersection points of a circle and a hyperbola in the
plane. Here 2
x + y2 − 4
f1 (x, y)
f (x, y) = =
f2 (x, y) xy − 1
and " #
∂f1 ∂f1
∂x ∂y 2x 2y
J(x, y) = ∂f2 ∂f2 (x, y) = .
∂x ∂y
y x
This example is illustrated in the Matlab script run newtonmv.m.
Remark 1. Newton’s method requires the user to input the m×m Jacobian matrix
(which depends on the specific nonlinear system to be solved). This is rather
cumbersome.
128
17.1 Basic Fixed-point Iteration
We illustrate the use of a general fixed-point algorithm with several examples in the
Maple worksheet 577 fixedpointsMV.mws. However, as we well know, this may not
always be possible, and if it is, convergence may be very slow. Sometimes we can use
a Gauss-Seidel like strategy to accelerate convergence.
A multivariate version of the Contractive Mapping Theorem is
Lemma 17.4 Let A be a nonsingular m×m matrix and x, y ∈ Rm . Then (A+xy T )−1
exists provided that y T A−1 x 6= −1. Moreover,
A−1 xy T A−1
(A + xy T )−1 = A−1 − . (49)
1 + y T A−1 x
129
The Sherman-Morrison formula (49) can be used to compute the inverse of a matrix
A(k+1) obtained by a rank-1 update xy T from A(k) , i.e.,
h i−1 h i−1 A(k) −1 xy T A(k) −1
A(k+1) = A(k) − −1 . (50)
1 + y T A(k) x
Thus, if A(k+1) is a rank-1 modification of A(k) then we need not recompute the inverse
of A(k+1) , but instead can obtain it by updating the inverse of A(k) (available from
previous computations) via (50).
The algorithm for Broyden’s method is
Algorithm
Input f , x(0) , B (0)
for k = 0, 1, 2, . . . do
h(k) = −B (k) f (x(k) )
x(k+1) = x(k) + h(k)
z (k) = f (x(k+1) ) − f (x(k) )
T
B (k) z (k) − h(k) h(k) B (k)
(k+1) (k)
B =B − T
h(k) B (k) z (k)
end
Output x(k+1)
Remark 1. Only m scalar function evaluations are required per iteration along
with O(m2 ) floating point operations for matrix-vector products.
2. One can usually use B (0) = I to start the iteration.
In order to see how the formula for B (k+1) in the algorithm is related to (50) we
define
h i−1 h i−1
J(x(k) ) ≈ A(k) = B (k) ,
−1 (k)
z (k) − B (k)
h
x = (k) 2
,
kh k2
y = h(k) .
Then (50) becomes
−1
z (k) −[B (k) ] h(k) (k) T (k)
B (k) kh(k) k22
h B
B (k+1) = B (k) − −1
T z (k) −[B (k) ] h(k)
1 + h(k) B (k) kh(k) k22
T
B (k) z (k) − h(k) h(k) B (k)
(k)
= B − T T −1
kh(k) k22 + h(k) B (k) z (k) − h(k) B (k) B (k)
h(k)
T
B (k) z (k) − h(k) h(k) B (k)
(k)
= B − T ,
kh(k) k22 + h(k) B (k) z (k) − kh(k) k22
130
which is the same as the formula for B (k+1) used in the algorithm.
To see why Broyden’s method can be interpreted as a variant of the secant method,
we multiply the formula used to update B (k) in the algorithm by z (k) , i.e.,
h iT
B (k) z (k) − h(k) h(k) B (k)
B (k+1) z (k) = B (k) z (k) − h iT z (k)
(k) (k) (k)
h B z
17.3 Using the Steepest Descent and Conjugate Gradients with Non-
linear Systems
We now discuss the connection between solving systems of nonlinear equations and
quadratic minimization problems. The idea is to minimize the 2-norm of the residual
of (46) to get the stepsize λk , i.e., to find x = [x1 , . . . , xm ]T such that
m
1X 2 1
g(x1 , . . . , xm ) = fi (x1 , . . . , xm ) = f T f (x)
2 2
i=1
is minimized.
For the steepest descent method we use
x(k+1) = x(k) + λk (−∇g(x(k) )).
The stepsize λk is computed such that
g(x(k+1) ) = g x(k) − λk ∇g(x(k) ) =: γ(λk )
is minimized. This is an easier problem to solve since it involves only one variable, λk .
Note that since g(x) = 21 f T f (x) we have ∇g(x) = [J(x)]T f (x). This shows that
this approach also requires knowledge of the Jacobian. A general line search algorithm
is
131
Algorithm
Input f , J, x(0)
for k = 0, 1, 2, . . . do
T
h(k) = −∇g(x(k) ) = − J(x(k) ) f (x(k) )
It is not easy to come up with a good line search strategy. However, one prac-
tical way to determine a reasonable value for λk is to use a quadratic interpolating
(1) (2) (3)
polynomial. The idea is to work with three values λk , λk and λk and construct a
quadratic polynomial that interpolates the univariate function γ at these points. If we
guess these three values reasonably close to the optimum, then the minimum of the
parabola on the interval of interest (which is easy to find) tell us how to pick λk . For
(1) (3) (3) (1)
example, one can take λk = 0, then find a λk such that γ(λk ) < γ(λk ), and then
(2) (2) (3)
take λk as the midpoint of the previous values, i.e., λk = λk /2.
A similar (but interactive) strategy is implemented in the Matlab example ShowGN VL.m
explained in the next section.
f1 (x1 , x2 , . . . , xn ) = 0,
f2 (x1 , x2 , . . . , xn ) = 0,
..
. (51)
fm (x1 , x2 , . . . , xn ) = 0.
132
A problem like this often arises when the components of x are certain control parame-
ters in an objective function that needs to be optimized. Below we will see an example
where we need to fit the parameters to a nonlinear system describing the orbit of a
planet.
As mentioned at the end of the previous section, we can use Newton’s method to
solve this problem since a necessary condition for finding a minimal residual (52) is that
the gradient of g be zero. Thus, we are attempting to find a solution of ∇g(x) = 0.
Since Newton’s method is given by
h i−1
x(k+1) = x(k) − J∇g (x(k) ) ∇g(x(k) ),
we see that we actually need not only the gradient of S (which leads to the Jacobian)
but also the second derivative (which leads to the Hessian).
For the particular minimization problem (52) this would result in
h m
X i−1
(k+1) (k) (k) T (k)
x =x − J(x ) J(x )+ fi (x(k) )∇2 fi (x(k) ) J(x(k) )T f (x(k) ),
i=1
| {z }
=Hessian
where now J is the regular (but rectangular) Jacobian with respect to the functions
f1 , . . . , fm .
Since we decided earlier that it would be a good idea to avoid computation of
the Jacobian, clearly, computation of the Hessian should be avoided if possible. This
motivates the Gauss-Newton method. If we drop the summation term in the expression
for the Hessian above, then all we need to know is the Jacobian. This step is justified
if we are reasonably close to a solution since then fi (x(k) ) will be close to zero.
Thus, the Gauss-Newton algorithm is given by
Algorithm
Input f , J, x(0)
for k = 0, 1, 2, . . . do
end
Output x(k+1)
Note that the update h(k) here is in fact nothing but the solution to the normal
equations for the (linear) least squares problem
Therefore, any of our earlier algorithms (modified Gram-Schmidt, QR, SVD) can be
used to perform this step.
133
Example Let’s consider the nonlinear system
P −A A+P
x(t) = + cos(t)
√ 2 2
y(t) = AP sin(t)
that describes the elliptical orbit of a planet around the sun in a two-dimensional
universe. Here A and P denote the maximum and minimum orbit-to-sun distances,
respectively.
In order to estimate the parameters of the orbit of a specific planet we use another
relationship between the parameters and the length of the orbit, r(θ), where θ is the
angle to the positive x-axis. This relationship is given by
2AP
r(θ) = .
P (1 − cos(θ)) + A(1 + cos(θ))
2AP
fi (A, P ) = ri − , i = 1, . . . , m.
P (1 − cos(θi )) + A(1 + cos(θi ))
Then the best choices of A and P will be given by minimizing the function
m
1X
g(A, P ) = fi (A, P )2 .
2
i=1
134