Libro Fasshauer Numerico Avanzado

1 Fundamentals
1.0 Preliminaries
The first question we want to answer is: What is “computational mathematics”?
One possible definition is: “The study of algorithms for the solution of computa-
tional problems in science and engineering.”
Other names for roughly the same subject are numerical analysis or scientific com-
puting.
What is it we are looking for in these algorithms? We want algorithms that are
• fast,
• stable and reliable,
• accurate.
Note: One could also study hardware issues such as computer architecture and its
effects, or software issues such as efficiency of implementation on a particular hardware
or in a particular programming language. We will not do this.
What sort of problems are typical?
Example The Poisson problem provides the basis for many different algorithms for
the numerical solution of differential equations, which in turn lead to the need for many
algorithms in numerical linear algebra. Consider
−∇2 u(x, y) = − [uxx (x, y) + uyy (x, y)] = f (x, y), in Ω = [0, 1]2
u(x, y) = 0, on ∂Ω.
One possible algorithm for the numerical solution of this problem is based on the
following discretization of the Laplacian:
uj−1,k + uj,k−1 + uj+1,k + uj,k+1 − 4uj,k
∇2 u(xj , yk ) ≈ , (1)
h2
where the unit square is discretized by a set of (n + 1)2 equally spaced points (xj , yk ),
j, k = 0, . . . , n, and h = n1 . Also, we use the abbreviation uj,k = u(xj , yk ).
Formula (1) is a straightforward generalization to two dimensions of the linear
approximation
u(x + h) − u(x)
u0 (x) ≈ .
h
If we visit all of the (n − 1)2 interior grid points and write down the equation
resulting from the discretization of the PDE, then we obtain the following system of
linear equations
fj,k
4uj,k − uj−1,k − uj,k−1 − uj+1,k − uj,k+1 = , j, k = 1, . . . , n − 1
n2
along with the discrete boundary conditions
uj,0 = uj,n = u0,k = un,k = 0, j, k = 0, 1, . . . , n.
1
Jacobi 1013 operations 1845
Gauss-Seidel 5 × 1012 operations 1832
SOR 1010 operations 1950
FFT 1.5 × 108 operations 1965
multigrid 108 operations 1979
Table 1: Improvements of algorithms for Poisson problem.
Z3 1 flops 1941
Intel Paragon 10 Gflops 1990
NEC Earth Simulator (5120 processors) 40 Tflops 2002
IBM Blue Gene/L (131072 processors) 367 Tflops 2006
Table 2: Improvements of hardware for Poisson problem.
This method is known as the finite difference method.

In order to obtain a relative error of 10−4 using finite differences one needs about
n = 1000, i.e., 106 points. Therefore, one needs to solve a 106 × 106 sparse system of
linear equations. Note that the system is indeed sparse since each row of the system
matrix contains at most 5 nonzero entries.
Using the state-of-the-art algorithms and hardware of 1940 it would have taken
one of the first computers about 300,000 years to solve this problem with the desired
accuracy. In 1990, on the other hand, it took about 1/100 second. In fact, the Earth
Simulator (the fastest computer available in 2002) was able to solve a dense system of
106 linear equations in 106 unknowns in less than 6 hours using Fortran and MPI code.
This example is typical and shows – in addition to the huge improvements possible
by advances in both software and hardware – that using numerical methods we can
usually expect only an approximate solution.
As just noted, errors are introduced in a variety of ways:
• through discretization, i.e., by converting a continuous problem to a discrete one,
• through floating-point representations and roundoff errors,
• through the nature of certain algorithms (e.g., iterative vs. direct).
While other sources of errors also exist (such as measurement errors in experiments),
we will focus on the above three sources.
1.1 Fundamentals from Linear Algebra

1.1.1 Basic Definitions
Definition 1.1 A vector space (or linear space) V over the field C of complex numbers
consists of a set of elements (or vectors) together with two operations “+”: V ×V → V
(vector addition) and “·”: C × V → V (scalar multiplication) such that
1. For any u, v ∈ V we have u + v ∈ V , i.e., V is closed under vector addition.
2
2. Vector addition is associative and commutative, i.e., (u + v) + w = u + (v + w)
and u + v = v + u.
3. There is a zero vector 0 such that u + 0 = u for every u ∈ V .
4. For every u ∈ V there is a negative −u such that u + (−u) = 0.
5. For every u ∈ V and every scalar α ∈ C we have αu ∈ V , i.e., V is closed under

scalar multiplication.
6. For every α, β ∈ C and every u, v ∈ V we have (α + β)u = αu + βu, and

α(u + v) = αu + αv (i.e., distributive laws hold).
Remark Often we consider V as a vector space over R.
Example Standard examples we will be working with are V = Rm or V = Cm with

the usual vector addition and scalar multiplication, or V = Rm×n or V = Cm×n with
the usual addition of matrices and scalar multiplication.
If A ∈ Cm×n is an m × n matrix then A is a linear map (or linear transformation)

since it satisfies
1. A(x + y) = Ax + Ay for every x, y ∈ Cn ,
2. A(αx) = αAx for every x ∈ Cn and α ∈ C.

Conversely, any linear map from Cn to Cm can be associated with matrix multiplication
by a matrix in Cm×n .
Example The matrix  

1 0 0
A =  0 cos θ − sin θ 
0 sin θ cos θ
maps R3 into R3 since it represents counterclockwise rotation about the x-axis by an
angle θ.
The matrix
1 0 0
A=
0 1 0
maps R3 into R2 by projecting into the x-y plane.
1.1.2 Matrix-Vector Multiplication

For the following we assume A ∈ Cm×n and x ∈ Cn . In fact, our vectors are always to
be interpreted as column vectors unless noted otherwise.
A first interpretation of a matrix-vector product b = Ax is obtained (using Matlab
notation) via the representation
n
X
b(i) = A(i, j)x(j), i = 1, . . . , m
j=1
= A(i, :)x,
3
i.e., the i-th entry of b is given by the dot product of row i of A with x.
A second (vectorized) interpretation is obtained via
n
X
b(:) = A(:, j)x(j)
j=1
Xn
= x(j)A(:, j),
j=1
i.e., the (entire) vector b is given as a linear combination of the columns of A.
Remark Using the first interpretation we need to perform m dot products to calculate
b. With the second interpretation we compute n scalar products and n − 1 additions.
Geometrically (based on the second approach above) we can interpret A as a linear

transformation, i.e., on the one hand b represents a point in Cm with coordinates
b(1), b(2), . . . , b(m) with respect to the standard basis {e1 , e2 , . . . , em } whereas Ax
represents the same point in Cm with coordinates x(1), x(2), . . . , x(m) with respect to
the basis {A(:, 1), A(:, 2), . . . , A(:, m)} of columns of A.
1.1.3 Matrix-Matrix Multiplication

If we assume that A ∈ C`×m and C ∈ Cm×n then
B = AC ∈ C`×n
and
m
X
B(i, j) = A(i, k)C(k, j), i = 1, . . . , `, j = 1, . . . , n
k=1
= A(i, :)C(:, j),
i.e., the ij entry of B is obtained as the dot product of row i of A with column j of C.
An alternative (vectorized) interpretation of the same matrix-matrix product is
given by
m
X
B(:, j) = A(:, k)C(k, j)
k=1
m
X
= C(k, j)A(:, k),
k=1
i.e., the j-th column of B is given as a linear combination of the columns of A.
Example Let’s take C = R ∈ Cn×n with

(
1 i≤j
R(i, j) = ,
0 i>j
4
i.e., an upper triangular matrix of all ones.
Then B = AR leads to
n
X
B(:, j) = R(j, k)A(:, k)
k=1
Xj j
X
= 1 A(:, k) = A(:, k)
k=1 k=1
since R(k, j) = 0 if k > j by definition and 1 otherwise.

Thus, column j of B is givenZby the sum of the first j columns of A (which is similar
x
to computation of the integral f (t)dt = F (x).
0
1.1.4 Range and Nullspace

The range of A (range(A) or column space of A) is given by the set of all Ax.
The following theorem is a direct consequence of the vectorized interpretation of
matrix-vector multiplication:
Theorem 1.2 The range(A) is a vector space spanned by the columns of A.
Remark The columns of A are a basis for range(A) if they are linearly independent.
The nullspace of A (null(A) or kernel of A) is given by the set of all x such that
Ax = 0.
Remark From the vectorized interpretation of the matrix-vector product Ax = 0 we

know that
0 = x(1)A(:, 1) + . . . + x(n)A(:, n),
so null(A) characterizes the linear dependence of the columns of A.
In particular, if null(A) contains only the zero vector then the column of A are
linearly independent and form a basis for range(A).
The column rank of A is given by the dimension of range(A) (i.e., number of linearly
independent columns of A). The row rank is defined analogously.
Remark We always have “row rank = column rank = rank ”. Moreover, for any m × n
matrix we have
dim(null(A)) + rank(A) = n.
Example Consider the matrix

 
0 1 0
A =  0 0 2 .
0 0 0
5
The range(A) is given by all vectors in C3 with third component zero since
    
0 1 0 x(1) x(2)
Ax =  0 0 2   x(2)  =  2x(3)  .
0 0 0 x(3) 0
Since A has two linearly independent columns (and rows) we have
rank(A) = dim(range(A)) = 2.
Moreover,    
 1 0 
 0 , 2 
0 0
 
is a basis for range(A).

The nullspace of A is given by all vectors with last two components zero since
   
x 0
A 0 = 0 
  
0 0
for any x. Alternatively, we can see that


    
0 1 0
x(1)A(:, 1) + x(2)A(:, 2) + x(3)A(:, 3) = x(1)  0  + x(2)  0  + x(3)  2  = 0
0 0 0
as soon as x(2) = x(3) = 0.

Therefore, dim(null(A)) = 1.
1.1.5 Inverse
Any square matrix A ∈ Cm×m with rank(A) = m (i.e., of full rank or nonsingular ) has
an inverse A−1 such that
AA−1 = A−1 A = I,
where I = [e1 , e2 , . . . , em ] is the m × m identity matrix.
Theorem 1.3 For A ∈ Cm×m the following are equivalent:
1. A has an inverse A−1 ,
2. rank(A) = m,
3. range(A) = Cm ,
4. null(A) = {0},
5. 0 is not an eigenvalue of A,
6. 0 is not a singular value of A,
6
7. det(A) 6= 0.
Following our earlier geometric interpretations we can interpret multiplication by

A−1 as a change of basis transformation:
On the one hand Ax = b gives us the coordinates x of a point in Cm with respect to
the basis {A(:, 1), . . . , A(:, m)} and on the other hand b can be viewed as the coordinates
of the same point with respect to the standard basis {e1 , . . . , em }. Therefore
Ax = b ⇐⇒ x = A−1 b
represents a change of basis.

In other words, if we multiply b (the coordinates of our point with respect to the
standard basis) by A−1 then we obtain the coordinates x with respect to the column
space of A. Conversely, if we multiply x by A, then we transform back to the standard
basis.
Remark The solution x of the linear system Ax = b gives us the vector of coefficients
of b expanded in terms of the columns of A, i.e., we “rotate b back into the column
space of A via A−1 ”.
1.2 Orthogonal Vectors and Matrices

1.2.1 Adjoint
If A ∈ Cm×n is an m × n matrix then A∗ ∈ Cn×m is called the adjoint of A provided
A∗ (i, j) = A(j, i), i = 1, . . . , m, j = 1, . . . , n.
If A is real then A∗ is called the transpose of A and usually denoted by AT (or A0
in Matlab).
If A = A∗ then A is called Hermitian (or symmetric if real).
Remark Note that only square matrices can be Hermitian (or real symmetric).
Applied to vectors this means that x denotes a column vector and x∗ a row vector.
1.2.2 Inner Product

If x, y ∈ Cm then
m
X
x∗ y = x(i)y(i)
i=1
is called the inner product (or dot product) of x and y.

The length of a vector x is given by
m
!1/2
√ X
2
kxk = x∗ x = |x(i)| .
i=1
This is the same as the (Euclidean) or 2-norm of the vector. We will say more about
norms later.
7
Once an inner product is available we can define an angle α between two vectors x
and y by
x∗ y
cos α = .
kxkkyk
We summarize the following properties of inner products.
Lemma 1.4 Let x, y, z ∈ Cm and α, β ∈ C. Then
1. (x + y)∗ z = x∗ z + y ∗ z,
2. x∗ (y + z) = x∗ y + x∗ z,
3. (αx)∗ (βy) = αβx∗ y.
Together, we say that an inner product is bilinear.
Moreover, the definition of the inner product implies that y ∗ x = x∗ y. In the real
case, however, the inner product is symmetric.
We can use the inner product and its properties to obtain some other properties of
matrix multiplication.
Lemma 1.5 Assume A and B are m × m matrices. Then
1. (AB)∗ = B ∗ A∗ ,
2. (AB)−1 = B −1 A−1 .
Proof We will prove item 1. Let C = AB so that C(i, j) = A(i, :)B(:, j), the inner
product of row i of A with column j of B. Now
(AB)∗ (i, j) = C ∗ (i, j)

= C(j, i)
= A(j, :) B(:, i)
Xm
= A(j, k) B(k, i)
k=1
m
X
= B ∗ (i, k)A∗ (k, j)
k=1
= B ∗ (i, :)A∗ (:, j)
= (B ∗ A∗ )(i, j).
Remark We will use the notational convention A−∗ = (A−1 )∗ = (A∗ )−1 .
8
1.2.3 Orthogonal Vectors
Since we defined angles earlier as
x∗ y
cos α = ,
kxkkyk
the vectors x and y are orthogonal if and only if x∗ y = 0.
• If X, Y are two sets of vectors then X is orthogonal to Y if x∗ y = 0 for every

x ∈ X and every y ∈ Y .
• X is an orthogonal set (or simply orthogonal) if x∗ y = 0 for every x, y ∈ X with

x 6= y.
• X is orthonormal if X is orthogonal and kxk = 1 for all x ∈ X.
Theorem 1.6 Orthogonality implies linear independence, i.e., if X is orthogonal then

X is linearly independent.
Corollary 1.7 If X ∈ Cm is orthogonal and consists of m vectors then X is a basis

for Cm .
1.2.4 Orthogonal Decomposition of a Vector

In analogy to the Cartesian decomposition of a vector into its coordinates, i.e.,
v = v(1)e1 + . . . + v(m)em
we can find the components of an arbitrary vector v with respect to any given orthog-
onal set.
Assume Q = {q 1 , . . . , q n } is an orthonormal set and v ∈ Cm is an arbitrary vector
with m ≥ n.
Claim: We can decompose v as
n
X
v=r+ (q ∗i v)q i
i=1
with {r, q 1 , . . . , q n } an orthogonal set.
Remark The vectors (q ∗i v)q i are the projections of v onto the (basis) vectors q i .
Proof We need to show that r = v − ni=1 (q ∗i v)q i is orthogonal to Q. This can be

P
done by considering the inner product with an arbitrary member q j of Q:
n
!
X
∗ ∗ ∗ ∗
qj r = qj v − qj (q i v)q i
i=1
n
X
= q ∗j v − (q ∗i v)q ∗j q i .
i=1
9
Now, since Q is orthonormal we have
(
0, i 6= j
q ∗j q i = δij =
1, i = j,
where δij is referred to as the Kronecker delta. Therefore only one term in the summa-
tion survives (namely if i = j) and we have
q ∗j r = q ∗j v − (q ∗j v)1 = 0.
Since q j was arbitrary we have established that r is orthogonal to the entire set Q.
Remark If Q is a basis for Cm then n = m and (since v ∈ Cm ) r = 0. Therefore we

get the coordinates of v with respect to Q:
m
X
v= (q ∗i v)q i .
i=1
We now take a closer look at the projection idea. This will be very important later
on. First (q ∗i v)q i = q i (q ∗i v) since (q ∗i v) is a scalar. Next, by the associativity of vector
multiplication, this is also equal to (q i q ∗i )v. This latter expression is the product of
a (rank-1) matrix and a column vector. The matrix (q i q ∗i ) is known as a projection
matrix.
Now we can represent the projections occurring in the decomposition of v either
via a sum of vector projections X
(q ∗i v)q i
i
or by a sum of matrix projections
X
(q i q ∗i )v.
i
Example √ Compute√ ∗the orthogonal √ decomposition

√ ∗ of v = [1, 1, 1]∗ with respect to
q 1 = [1/ 2, 0, 1/ 2] , q 2 = [1/ 2, 0, −1/ 2] , and q 3 = [0, 1, 0]∗ .
Since v ∈ R3 and q 1 , q 2 , q 3 are orthonormal they form a basis for R3 and we know
that r = 0.
1. In terms of vector projections we have

3
X
v = (q ∗i v)q i
i=1
1
= 2 √ q 1 + 0q 2 + 1q 3
2
     
1 0 0
=  0  +  0  +  1 .
1 0 0
√
Thus, the coordinates of v with respect to q 1 , q 2 , q 3 are ( 2, 0, 1).
10
2. In terms of matrix projections we get
3
X
v = (q i q ∗i )v
i=1
     
1/2 0 1/2 1/2 0 −1/2 0 0 0
=  0 0 0 v +  0 0 0 v +  0 1 0 v
1/2 0 1/2 −1/2 0 1/2 0 0 0
     
1 0 0
=  0 + 0  +  1 .
1 0 0
1.2.5 Unitary and Orthogonal Matrices

A square matrix Q ∈ Cm×m is called unitary (or orthogonal in the real case) if its
columns are orthonormal, i.e., if Q∗ Q = I. Equivalently, Q−1 = Q∗ .
In particular, this implies
(
∗ 0, i 6= j
q i q j = δij =
1, i = j.
Recall that earlier we observed that if we multiply a given vector b (the coordinates
of some point with respect to the standard basis) by A−1 then we obtain the coordinates
x with respect to the column space of A. Conversely, if we multiply x by A, then we
transform back to the standard basis.
Now, if we assume that A is a unitary matrix, i.e., A = Q and A−1 = Q∗ ,
then multiplication of b by Q∗ yields the coordinates x with respect to the basis
{Q(:, 1), . . . , Q(:, m)} = {q 1 , . . . , q m }. Conversely, multiplication of x by Q transforms
back to the standard basis {e1 , . . . , em }.
Example Take
  √ √ 
1 1/ 2 1/ 2 0
b= 1  and Q =  0√ 0√ 1 
1 1/ 2 −1/ 2 0
and note that b = v and the columns of Q are given by q 1 , q 2 , q 3 from the previous
example.
Clearly,  √ √ 
1/√2 0 1/ √2
Q∗ =  1/ 2 0 −1/ 2  ,
0 1 0
and therefore  √ 
2
∗
Q b=  0 
1
are the coordinates with respect to the columns of Q.
11
Some properties of unitary matrices are collected in
Lemma 1.8 Assume Q ∈ Cm×m is unitary and x, y ∈ Cm . Then
1. (Qx)∗ (Qy) = x∗ y, that is angles are preserved under unitary (orthogonal) trans-
formations.
2. kQxk = kxk, that is lengths are preserved under unitary (orthogonal) transfor-
mations.
3. All eigenvalues λ of Q satisfy |λ| = 1, and therefore detQ = ±1.
Remark The “+” in item 3 corresponds to rotations, and the “−” to reflections.
Proof We prove item 1:
(Qx)∗ (Qy) = x∗ Q∗ Q y = x∗ y.
| {z }
=I
1.3 Norms
1.3.1 Vector Norms
Definition 1.9 Let V be a vector space over C. A norm is a function k · k : V → R+
0
which satisfies
1. kxk ≥ 0 for every x ∈ V , and kxk = 0 only if x = 0.
2. kαxk = |α|kxk for every x ∈ V , α ∈ C.
3. kx + yk ≤ kxk + kyk for all x, y ∈ V (triangle inequality).

m
X
Example 1. kxk1 = |x(i)|, `1 -norm.
i=1
m
!1/2
X
2. kxk2 = |x(i)|2 , `2 -norm or Euclidean norm.
i=1
3. kxk∞ = max |x(i)|, `∞ -norm, maximum norm or Chebyshev norm.

1≤i≤m
m
!1/p
X
4. kxkp = |x(i)|p , `p -norm.
i=1
It is interesting to consider the corresponding unit “spheres” for these three norms,
i.e., the location of points in Rm whose distance to the origin (in the respective norm)
is equal to 1. Figure 1 illustrates this for the case m = 2.
12
1 1 1
0.5 0.5 0.5
–1 –0.5 0.5 1 –1 –0.5 0.5 1 –1 –0.5 0.5 1
–0.5 –0.5 –0.5
–1 –1 –1
Figure 1: Unit ”circles” in R2 for the `1 , `2 and `∞ norms.
Sometimes one also wants to work with weighted norms. To this end one takes a
diagonal weight matrix
 
w(1) 0 ··· 0
. ..
w(2) . .
 
 0 . 
W = .

. .

 . . . . . . 0 

0 ··· 0 w(m)
and then defines

kxkW = kW xk.
Example A weighted p-norm is of the form

m
!1/p
X
kxkW,p = |w(i)x(i)|p .
i=1
1.3.2 Matrix Norms

Definition 1.10 If k · k(m) and k · k(n) are vector norms on Cm and Cn , respectively,
and A ∈ Cm×n , then the induced matrix norm (or associated or subordinate matrix
norm) is defined by
kAxk(m)
kAk = sup kAxk(m) = sup .
x∈Cn x∈Cn kxk(n)
kxk(n) =1 x6=0
Remark 1. The notation sup in Definition 1.10 denotes the supremum or least
upper bound.
2. Often we can use the maximum instead of the supremum so that

kAxk(m)
kAk = max .
n x∈C
x6=0
kxk(n)
13
3. One can show that k · k satisfies items (1)–(3) in Definition 1.9, i.e., it is indeed
a norm.
4. kAk can be interpreted as the maximum factor by which A can “stretch” x.
Example The “stretch” concept can be understood graphically in R2 . Consider the
matrix
1 1
A=
0 1
which maps R2 to R2 .
1. By mapping the 1-norm unit circle under A we can see that the point that is
maximally stretched is (0, 1) which gets mapped into (1, 1). Thus, a vector of
1-norm length 1 is mapped to a vector with 1-norm length 2, and kAk1 = 2.
2. By mapping the 2-norm unit circle under A we can see (although this is much
harder and requires use of the singular value decomposition) that the point that is
maximally stretched is (0.5257, 0.8507) which gets mapped into (1.3764, 0.8507).
Thus, a vector of 2-norm length 1 is mapped to a vector with 2-norm length
1.6180, and kAk2 = 1.6180.
3. By mapping the ∞-norm unit circle under A we can see that the point that is
maximally stretched is (1, 1) which gets mapped into (2, 1). Thus, a vector of
∞-norm length 1 is mapped to a vector with ∞-norm length 2, and kAk∞ = 2.
How to Compute the Induced Matrix Norm
We now discuss how to compute the matrix norms induced by the popular p-norm
vector norms. Strictly speaking we now would have to use two different subscripts on
the vector norms (in addition to the index p also the (m) and (n) indicating the length
of the vectors). In order to simplify notation we omit the second subscript which can
be inferred from the context.
Consider an m × n matrix A. The most popular matrix norms can be computed as
follows:
1.
kAk1 = max kA(:, j)k1
1≤j≤n
m
X
= max |A(i, j)|.
1≤j≤n
i=1
This gives rise to the name maximum column sum norm.

2. kAk2 = max |σj |, where σj is the j-th singular value of A (more later, fairly
1≤j≤n
difficult to compute).
3.
kAk∞ = max kA(i, :)k1
1≤i≤m
n
X
= max |A(i, j)|.
1≤i≤m
j=1
14
This gives rise to the name maximum row sum norm.
Example Let  
1 1 2
A =  0 1 1 .
1 0 2
Then kAk1 = 5, kAk2 = 3.4385, kAk∞ = 4.
We now verify that the matrix norm induced by the `∞ -norm is indeed given by
the formula stated in item 3:
kAk∞ = sup kAxk∞ = sup max |(Ax)(i)|.

kxk∞ =1 kxk∞ =1 1≤i≤m
By interchanging the supremum and maximum we obtain
kAk∞ = max sup |(Ax)(i)|.

1≤i≤m kxk∞ =1
Next, we rewrite the matrix-vector product to get

n
X
kAk∞ = max sup | A(i, j)x(j)|.
1≤i≤m kxk∞ =1
j=1
Finally, using formula (2) below we obtain the desired result, i.e.,
n
X
kAk∞ = max |A(i, j)|.
1≤i≤m
j=1
We now derive formula (2). Consider

n
X
| A(i, j)x(j)| = |A(i, 1)x(1) + A(i, 2)x(2) + . . . + A(i, n)x(n)|.
j=1
The supremum over all unit vectors (in the maximum norm) is attained if all terms in
the above sum are positive. This can be ensured by picking x(j) = sign(A(i, j)). But
then we have
n
X n
X
sup | A(i, j)x(j)| = |A(i, 1)| + |A(i, 2)| + . . . + |A(i, n)| = |A(i, j)|. (2)
kxk∞ =1 j=1 j=1
1.3.3 Cauchy-Schwarz and Hölder Inequalities

Theorem 1.11 Any two vectors x, y ∈ V equipped with an inner product such that
kxk2 = x∗ x satisfy the Cauchy-Schwarz inequality
|x∗ y| ≤ kxkkyk.
15
Proof The Cauchy-Schwarz inequality in a real vector space can be proved geomet-
x∗ y
rically by starting out with the projection p = kyk 2 y of x onto the y. Since y is in
general not a unit vector we need to normalize by kyk here.

Now kx − pk2 ≥ 0 for any x, y and we can use the definition of the norm properties
of the inner product (in particular x∗ y = y ∗ x if x, y real) to compute
2
x∗ y

2

kx − pk = x − y
kyk2
∗
x∗ y x∗ y

= x− y x − y
kyk2 kyk2
∗ 2
x∗ y ∗ x y
= x∗ x − 2 x y + y∗y
kyk2 kyk2
∗ 2
x∗ y ∗ x y
= kxk2 − 2 x y + kyk2
kyk2 kyk2
kxk2 kyk2 − 2 (x∗ y)2 + (x∗ y)2
=
kyk2
kxk2 kyk2 − (x∗ y)2
= .
kyk2
Remembering that this quantity is non-negative we get
(x∗ y)2 ≤ kxk2 kyk2
and taking square roots

|x∗ y| ≤ kxkkyk.
As a generalization of the Cauchy-Schwarz we have the Hölder inequality:
|x∗ y| ≤ kxkp kykq ,

1 1
where we allow any 1 ≤ p, q ≤ ∞ such that p + q = 1.
Example We can apply the Cauchy-Schwarz inequality to compute the 2-norm of a

rank-1 matrix A = uv ∗ (cf. the projection matrices that came up earlier).
First, we note that
kAxk2 = k(uv ∗ )xk2 = ku (v ∗ x) k2

| {z }
scalar
= |v ∗ x|kuk2
≤ kuk2 kvk2 kxk2 (3)
by the Cauchy-Schwarz inequality.

Now
kAxk2
kAk2 = sup
x∈Cn kxk2
x6=0
16
so that (3) gives us
kAk2 ≤ kuk2 kvk2 .
However, in the special case x = v we get
kAxk2 = kAvk2 = k(u v ∗ )v k2 = |v ∗ v|kuk2 = kvk22 kuk2 ,
| {z }
scalar
and therefore actually
kAxk2
kAk2 = sup = kuk2 kvk2 .
x∈Cn kxk2
x6=0
1.3.4 Other Matrix Norms

There are also matrix norms that are not induced by vector norms.
Example The Frobenius norm of an m × n matrix A is given by
 1/2
Xm Xn
kAkF =  |A(i, j)|2  ,
i=1 j=1
i.e., we interpret the matrix in Cm×n as a vector in Cmn and compute its 2-norm.
Other formulas for the Frobenius norm are
p p
kAkF = tr(A∗ A) = tr(AA∗ ),
where the trace tr(A) is given by the sum of the diagonal entries of A.
Finally,
Theorem 1.12 Let A ∈ Cm×n and Q ∈ Cm×m be unitary. Then
1. kQAk2 = kAk2 ,
2. kQAkF = kAkF ,
i.e., both the 2-norm and the Frobenius norm are invariant under unitary transforma-
tion.
Proof The invariance of the matrix 2-norm is a direct consequence of the invariance of
lengths of vectors under unitary transformations discussed earlier, i.e., kQxk2 = kxk2 .
In particular, kQAxk2 = kAxk2 , and so
kQAxk2 kAxk2
kQAk2 = sup = sup = kAk2 .
x∈Cn kxk2 x∈C n kxk2
x6=0 x6=0
For the Frobenius norm we have

p
kQAkF = tr ((QA)∗ (QA))
p
= tr (A∗ Q∗ QA)
p
= tr (A∗ A) = kAkF
since Q∗ Q = I.
17
1.4 Computer Arithmetic
1.4.1 Floating-Point Arithmetic
We will use normalized scientific notation to represent real numbers x 6= 0, i.e., in
decimal representation we write
1
x = ±r × 10n , ≤ r < 1,
10
and in binary representation (which of course will matter on the computer) we write
1
x = ±q × 2m , ≤ q < 1.
2
Both of these representations consist of the sign “±”, the mantissa (either r or q), the
base (either 10 or 2), and the exponent (either n or m).
Example Some examples of various floating-point numbers in normalized scientific

notation with base 10:
0.0000747 = 0.747 × 10−4

31.4159265 = 0.314159265 × 102
9, 700, 000, 000 = 0.97 × 1010
1K = 0.1024 × 104 for computer stuff.
In order to study the kinds of errors that we can make when we represent real
numbers as machine numbers we will use a hypothetical binary computer. We will
assume that this computer can represent only positive numbers of the form
(0.d1 d2 d3 d4 )2 × 2n
with n ∈ {−3, −2, −1, 0, 1, 2, 3, 4}. Our representation will actually have a 3-bit man-
tissa (i.e., the digit is always assumed to be 1, and therefore never stored). The set
of choices for the exponent n comes from using a 3-bit exponent which allows us to
generate 0, 1, . . . , 7, so that the n is actually determined as the values of 4 − exponent.
With this configuration we are able to generate 23 = 8 different mantissas:
(0.1000)2 = (0.5)10
(0.1001)2 = (0.5625)10
(0.1010)2 = (0.625)10
(0.1011)2 = (0.6875)10
(0.1100)2 = (0.75)10
(0.1101)2 = (0.8125)10
(0.1110)2 = (0.875)10
(0.1111)2 = (0.9375)10
for a total of 64 machine numbers (obtained by combining the 8 mantissa with the 8
possible exponents).
18
n = −3 n = −2 n = −1 n=0 n=1 n=2 n=3 n=4
(0.1000)2 0.0625 0.125 0.25 0.5 1 2 4 8
(0.1001)2 0.0703125 0.140625 0.28125 0.5625 1.125 2.25 4.5 9
(0.1010)2 0.078125 0.15625 0.3125 0.625 1.25 2.5 5 10
(0.1011)2 0.0859375 0.171875 0.34375 0.6875 1.375 2.75 5.5 11
(0.1100)2 0.09375 0.1875 0.375 0.75 1.5 3 6 12
(0.1101)2 0.1015625 0.203125 0.40625 0.8125 1.625 3.25 6.5 13
(0.1110)2 0.109375 0.21875 0.4375 0.875 1.75 3.5 7 14
(0.1111)2 0.1171875 0.234375 0.46875 0.9375 1.875 3.75 7.5 15
Table 3: List of all 64 machine numbers for hypothetical computer.
Remark Any computer has finite word length (usually longer than that of our hypo-
thetical computer) and can therefore represent only a discrete set on finite numbers
exactly. Moreover, these numbers are distributed unevenly.
1
+ 15 + 16 = 15
7

Example How does the computation of 10 = 0.46 work out in our
hypothetical computer?
First, we notice that we will be committing a number of representation errors. The
closest machine number to 10 1
is 0.1015625 = (0.1101)2 × 2−3 . Similarly, the closest
machine number to 5 is 0.203125 = (0.1101)2 × 2−2 .
1
We can add these two numbers (by shifting the mantissa of the first number)
and obtain (1.00111)2 × 2−2 . This, however, is not in normalized scientific notation.
1
The “correct” representation of the intermediate calculation of 10 + 15 is therefore
(0.100111)2 × 2−1 .
Now, however, our computer has only a 4-bit mantissa, and therefore we need to
commit a rounding error, i.e., we represent the intermediate result by (0.1010)2 × 2−1 .
3
Note that this is indeed (fortunately so) the closest machine number to 10 .
1
For the final step of the calculation we need to represent 6 as a machine number.
We again commit a representation error by using 0.171875 = (0.1011)2 × 2−2 . Adding
this to the intermediate result from above (again by shifting the mantissa) we get
(0.11111)2 × 2−1 . Once more we need to round this result to the nearest machine
number, so that the final answer of our calculation is (0.1000)2 × 20 = 0.5 which is a
7
rather poor representation of the true answer 15 = 0.46.
In fact, the absolute error of our calculation is
|0.5 − 0.46| = 0.03,
and the relative error is

0.5 − 0.46
0.46 ≈ 0.0714 or 7.14%.

Example Another important observation is the fact that the order of operations mat-
ters, i.e., even though addition (or multiplication) is commutative and associative for
real numbers, this may not be true for machine numbers.
1 1
Consider the problem of adding 1 + 16 + 16 = 98 = 1.125 on our hypothetical com-
puter. Note that all of these numbers are machine numbers (so we will be committing
no representation errors).
19
a) We first represent 1 by (0.1000)2 × 21 and 16 1
by (0.1000)2 × 2−3 . Addition of
1
these 2 numbers leads to (0.10001)2 × 2 which now has to be rounded to a
1
machine number, i.e., (0.1000)2 × 21 . Clearly, adding another 16 will not change
1 1
the answer. Thus, 1 + 16 + 16 = 1.
1 1
b) If we start by adding the smaller numbers first, then 16 + 16 is represented
by (0.1000)2 × 2−3 + (0.1000)2 × 2−3 = (0.1000)2 × 2−2 (exactly), and adding
1 = (0.1000)2 × 20 to this yields the correct answer of (0.1001)2 × 21 = 1.125
More common word lengths for the representation of floating-point numbers are
32-bit (single precision) and 64-bit (double precision). In the 32-bit representation the
first bit is used to represent the sign, the next 8 bits represent the exponent, and the
remaining 23 bits are used for the mantissa. This implies that the largest possible
exponent is (11111111)2 = 28 − 1 = 255 (or −126, . . . , 127 where the two extreme cases
0 and 255 are reserved for special purposes). This means that we can roughly represent
numbers between 2−126 ≈ 10−38 and 2127 ≈ 1038 . Since the mantissa has 23 bits we
can represent numbers with an accuracy (machine ε) of 2−23 ≈ 0.12 × 10−6 , i.e., we
can expect 6 accurate digits (which is known as single precision).
In the 64-bit system we use 11 bits for the exponent and 52 for the mantissa. This
leads to machine numbers between 2−1022 ≈ 10−308 and 21023 ≈ 10308 . The resolution
possible with the 52-bit mantissa is 2−52 ≈ 0.22 × 10−15 . Thus double precision has 15
accurate digits.
1.4.2 Rounding and Chopping

Machine numbers can be obtained by either rounding (up or down to the nearest
machine number), or by simply chopping off any extra digits.
Depending on the representation method used (rounding or chopping) any number
will be represented internally with relative accuracy δ, where
f l(x) − x
δ= or f l(x) = x(1 + δ).
x
Here |δ| ≤ 12 β 1−n for rounding and |δ| ≤ β 1−n for chopping with representing the base
(usually 2 for digital computers) and n the length of the mantissa.
Example A nice (actually quite disastrous) illustration of the difference of chopping

versus rounding is given by the first few months of operations of the Vancouver Stock
Exchange. In 1982 its index was initialized to a value of 1000. After that, the index
was updated after every transaction. After 22 months the index had fallen to 520, and
everyone was stumped, since “common sense” indicated mild growth.
The explanation was found when it was discovered that the updated values were
truncated instead of rounded. A corrected update using rounding yielded an index
values of 1098.892.
Some other examples of disasters due to careless use of computers can be found on
the web at http://www.ima.umn.edu/∼arnold/disasters/.
In order to have a “good” computer we would like
f l(x y) = [x y](1 + δ)
20
for any basic arithmetic operation ∈ {+, −, ·, /}. This is usually accomplished by
using higher precision internally.
1.4.3 Loss of Significance

√ √
Consider the function f (x) = x x + 1 − x . We want to accurately compute
f (500). The true solution is 11.174755300747198 . . .. This problem is studied in the
Maple worksheet 477 577 loss of significance.mws.
Using 6 digits (single precision) the detailed computations are
√ √
f (500) = 500 501 − 500
= 500 (22.3830 − 22.3607)
= 500 (0.0223)
= 11.1500
with a relative error of

11.15 − 11.1748
≈ 0.22%.
11.1748
On the other hand, if we choose a better method to solve the problem (without
disastrous subtractions), i.e., rewrite the function as
√ √
√ √ x+1+ x
g(x) = x x + 1 − x √ √
x+1+ x
x(x + 1 − x)
= √ √
x+1+ x
x
= √ √
x+1+ x
then
500 500 500
g(500) = √ √ = = = 11.1748,
501 + 500 22.3830 + 22.3607 44.7437
which is exact up to the precision used.
We end with a theorem quantifying the loss of significant digits in subtractions.
Theorem 1.13 If x > y are positive normalized floating-point numbers in binary rep-
resentation and 2−q ≤ 1 − xy ≤ 2−p then the number ` of significant binary bits lost
when computing x − y satisfies p ≤ ` ≤ q.
Proof We show the lower bound, i.e., ` ≥ p (the upper bound can be shown similarly).
Let x = r×2n and y = s×2m with 21 ≤ r, s < 1. In order to perform the subtraction
we rewrite (shift) y such that y = (s × 2m−n ) × 2n . Then
x − y = r − s × 2m−n × 2n ,

where
s × 2m

m−n
y
r−s×2 =r 1− =r 1− .
r × 2n x
21
Now r < 1 and 1− xy ≤ 2−p by assumption, so that the mantissa satisfies r −s×2m−n <
2−p .
Finally, we need to shift at least p bits to the left in order to normalize the repre-
sentation (we need 12 ≤ mantissa < 1). This introduces at least p (binary) zeros at the
right end of the number, and so at least p bits are lost.
Remark If the theorem is formulated in base 10, i.e., if the relative error is between
10−q and 10−p , then between p and q digits are lost.
22
2 Singular Value Decomposition
The singular value decomposition (SVD) allows us to transform a matrix A ∈ Cm×n to
diagonal form using unitary matrices, i.e.,
A = Û Σ̂V ∗ . (4)
Here Û ∈ Cm×n has orthonormal columns, Σ̂ ∈ Cn×n is diagonal, and V ∈ Cn×n is

unitary. This is the practical version of the SVD also known as the reduced SVD. We
will discuss the full SVD later. It is of the form
A = U ΣV ∗
with unitary matrices U and V and Σ ∈ Cm×n .

Before we worry about how to find the matrix factors of A we give a geometric
interpretation. First note that since V is unitary (i.e., V ∗ = V −1 ) we have the equiva-
lence
A = Û Σ̂V ∗ ⇐⇒ AV = Û Σ̂.
Considering each column of V separately the latter is the same as
Av j = σj uj , j = 1, . . . , n. (5)
Thus, the unit vectors of an orthogonal coordinate system {v 1 , . . . , v n } are mapped

under A onto a new “scaled” orthogonal coordinate system {σ1 u1 , . . . , σn un }. In other
words, the unit sphere with respect to the matrix 2-norm (which is a perfectly round
sphere in the v-system) is transformed to an ellipsoid with semi-axes σj uj (see Fig-
ure 2). We will see below that, depending on the rank of A, some of the σj may be zero.
Therefore, yet another geometrical interpretation of the SVD is: Any m × n matrix A
maps the 2-norm unit sphere in Rn to an ellipsoid in Rr (r ≤ min(m, n)).
1
0.5 σ2v2
1 σ1u1
–1 –0.5 0.5 1 –2 –1 0 1 2
v2
–1
–0.5
v1 –2
–1
Figure 2: Geometrical interpretation of singular value decomposition.
In (5) we refer to the σj as singular values of A (the diagonal entries of Σ̂). They
are usually ordered such that σ1 ≥ σ2 ≥ . . . ≥ σn . The orthonormal vectors uj (the
columns of Û ) are called the left singular vectors of A, and the orthonormal vectors v j
(the columns of V ) are called the right singular vectors of A).
Remark For most practical purposes it suffices to compute the reduced SVD (4). We
will give examples of its use, and explain how to compute it later.
23
QR Compression ratio 0.2000, 25 columns used Compression ratio 0.2000, relative error 0.0320, 25 singular values used
20 20
40 40
60 60
80 80
100 100
120 120
140 140
160 160
180 180
200 200
50 100 150 200 250 300 50 100 150 200 250 300
Figure 3: Image compressed using QR factorization (left) and SVD (right).
Besides applications to inconsistent and underdetermined linear systems and least

squares problems, the SVD has important applications in image and data compression
(see our discussion of low-rank approximation below). Figure 3 shows the difference
between using the SVD and the QR factorization (to be introduced later) for com-
pression of the same image. In both cases the same amount (20%) of information was
retained. Clearly, the SVD does a much better job in picking out what information
is “important”. We will also see below that a number of theoretical facts about the
matrix A can be obtained via the SVD.
2.0.4 Full SVD

The idea is to extend Û to an orthonormal basis of Cm×m by adding appropriate
orthogonal (but otherwise arbitrary) columns and call this new matrix U . This will
also force Σ̂ to be extended to an m × n matrix Σ. Since we do not want to alter the
product of the factors, the additional rows (or columns – depending on whether m > n
or m < n) of Σ will be all zero. Thus, in the case of m ≥ n we have
A = U ΣV ∗ (6)
i
Σ̂
h
= Û U e V ∗.
O
Since U is now also a unitary matrix we have
U ∗ AV = Σ,
i.e., unitary transformations (reflections or rotations) are applied from the left and
right to A in order to obtain a diagonal matrix Σ.
Remark Note that the “diagonal” matrix Σ is in many cases rectangular and will
contain extra rows/columns of all zeros.
It is clear that the SVD will simplify the solution of many problems since the
transformed system matrix is diagonal, and thus trivial to work with.
24
2.0.5 Existence and Uniqueness Theorem
Theorem 2.1 Let A be a complex m × n matrix. A has a singular value decomposition
of the form
A = U ΣV ∗ ,
where Σ is a uniquely determined m × n (real) diagonal matrix, U is an m × m unitary
matrix, and V is an n × n unitary matrix.
Proof We prove only existence. The uniqueness part of the proof follows directly from
the geometric interpretation. A (more rigorous?) algebraic argument can be found,
e.g., in [Trefethen/Bau].
We use induction on the dimensions of A. All of the following arguments assume
m ≥ n (the case m < n can be obtained by transposing the arguments).
For n = 1 (and any m) the matrix A is a column vector. We take V = 1, Σ̂ = kAk2
and Û = kAkA
2
. Then, clearly, we have found a reduced SVD, i.e., A = Û Σ̂V ∗ . The full
SVD is obtained by extending Û to U by the Gram-Schmidt algorithm and adding the
necessary zeros to Σ̂.
We now assume an SVD exists for the case (m − 1, n − 1) and show it also exists
for (m, n). To this end we pick v 1 ∈ Cn such that kv 1 k2 = 1 and
kAk2 = sup kAv 1 k2 > 0.

v 1 ∈Cn
kv 1 k2 =1
Now we take
Av 1
u1 = . (7)
kAv 1 k2
Next, we use the Gram-Schmidt algorithm to arbitrarily extend u1 and v 1 to unitary
matrices by adding columns U
e1 and Ve1 , i.e.,
h i h i
U1 = u1 U e1 V1 = v 1 Ve1 .
This results in
u∗1
h i
U1∗ AV1 = e∗ A v 1 Ve1
U1
" #
u∗1 Av 1 u∗1 AVe1
= e ∗ Av 1 U
e ∗ AVe1 .
U1 1
We now look at three of these four blocks:

• Using (7) and the specific choice of v 1 we have
(Av 1 )∗
u∗1 Av 1 = Av 1
kAv 1 k2
kAv 1 k22
= = kAv 1 k2
kAv 1 k2
= kAk2 .
For this quantity we introduce the abbreviation σ1 = kAk2 .
25
• Again, using (7) we get
e ∗ Av 1 = U
U e ∗ u1 kAv 1 k2 .
1 1
e ∗ u1 = 0.
This, however, is a zero vector since U1 has orthonormal columns, i.e., U1
• We show that u∗1 AVe1 = 0 · · · 0 by contradiction. If it were nonzero then

we could look at the first row of the block matrix U1∗ AV1 and see that
h i
U1∗ AV1 (1, :) = σ1 u∗1 AVe1
h i
with k σ1 u∗1 AVe1 k2 > σ1 . On the other hand, we know that unitary matrices
leave the 2-norm invariant, i.e.,
kU1∗ AV1 k2 = kAk2 = σ1 .
Since the norm of the first row of the block matrix U1∗ AV1 cannot exceed that of
the entire matrix we have reached a contradiction.
We now abbreviate the fourth block with A

e = Ue ∗ AVe1 and can write the block
1
matrix as
0T

σ1
U1∗ AV1 = e .
0 A
To complete the proof we apply the induction hypothesis to A,

e i.e., we use the SVD
e = U2 Σ2 V ∗ . Then
A 2
0T

∗ σ1
U1 AV1 =
0 U2 Σ2 V2∗
∗
0T σ1 0T 1 0T

1
=
0 U2 0 Σ2 0 V2
or ∗
1 0T σ 1 0T 1 0T

A = U1 V1∗ ,
0 U2 0 Σ2 0 V2
another SVD (since the product of unitary matrices is unitary).
2.1 SVD as a Change of Basis

We now discuss the use of the SVD to diagonalize systems of linear equations. Consider
the linear system
Ax = b
with A ∈ Cm×n . Using the SVD we can write
Ax = U ΣV ∗ x ⇐⇒ b = U b0 .
Thus, we can express b ∈ range(A) in terms of range(U ):
U b0 = b ⇐⇒ b0 = U ∗ b
26
where we have used the columns of U as an orthonormal basis for range(A).
Similarly, any x ∈ Cn (the domain of A) can be written in terms of range(V ):
x0 = V ∗ x.
Now
Ax = b ⇐⇒ U ∗ Ax = U ∗ b
⇐⇒ U ∗ U ΣV ∗ x = U ∗ b
⇐⇒ IΣV ∗ x = U ∗ b
⇐⇒ Σx0 = b0 ,
and we have diagonalized the linear system.

In summary, expressing the range space of A in terms of the columns of U and the
domain space of A in terms of the columns of V converts Ax = b to a diagonal system.
2.1.1 Connection to Eigenvalues

If A ∈ Cm×m is square with a linearly independent set of eigenvectors (i.e., nondefec-
tive), then
AX = ΛX ⇐⇒ X −1 AX = Λ,
where X contains the eigenvectors of A as its columns and Λ = diag(λ1 , . . . , λm ) is a
diagonal matrix of the eigenvalues of A.
If we compare this eigen-decomposition of A to the SVD we see that the SVD is a
generalization: A need not be square, and the SVD always exists (whereas even a square
matrix need not have an eigen-decomposition). The price we pay is that we require
two unitary matrices U and V instead of only X (which is in general not unitary).
2.1.2 Theoretical Information via SVD

A number of theoretical facts about the matrix A can be obtained via the SVD. They
are summarized in
Theorem 2.2 Assume A ∈ Cm×n , p = min(m, n), and r ≤ p denotes the number of
positive singular values of A. Then
1. rank(A) = r
2. range(A) = range(U (:, 1 : r))

null(A) = range(V (:, r + 1, : n))
3. kAk2 = σp
1
kAkF = σ12 + σ22 + . . . + σr2
4. The eigenvalues of A∗ A are the σi2 and the v i are the corresponding (orthonor-
malized) eigenvectors. The eigenvalues of AA∗ are the σi2 and possibly m − n
zeros. The corresponding orthonormalized eigenvectors are given by the ui .
27
5. If A = A∗ (Hermitian or real symmetric), then the eigen-decomposition A =
XΛX ∗ and the SVD A = U ΣV ∗ are almost identical. We have U = X, σi = |λi |,
and v i = sign(λi )ui .
6. If A ∈ Cm×m then |det(A)| = m
Q
i=1 σi .
Proof We discuss items 1–3 and 6.

1. Since U and V are unitary matrices of full rank and rank(Σ) = r the statement
follows from the SVD A = U ΣV ∗ .
2. Both statements follow from the fact that the range of Σ is spanned by e1 , . . . , er
and that U and V are full-rank unitary matrices whose ranges are Cm and Cn ,
respectively.
3. The invariance of the 2-norm and Frobenius norm under unitary transformations
imply kAk2 = kΣk2 and kAkF = kΣkF . Since Σ is diagonal we clearly have
kΣk2 = max x∈Cn kΣxk2 = max1≤i≤r σi = σ1 . The formula for kΣkF follows
kxk2 =1
directly from the definition of the Frobenius norm.
6. We know that the determinant of a unitary matrix is either plus or minus 1,
and that of a diagonal matrix is the product of the diagonal entries. Finally, the
determinant of a product of matrices is given by the product of their determinant.
Thus, the SVD yields the stated result.
2.1.3 Low-rank Approximation

Theorem 2.3 The m × n matrix A can be decomposed into a sum of r rank-one ma-
trices:
Xr
A= σj uj v ∗j . (8)
j=1
Moreover, the best 2-norm approximation of rank ν (0 ≤ ν ≤ r) to A is given by

ν
X
Aν = σj uj v ∗j .
j=1
In fact,
kA − Aν k2 = σν+1 . (9)
Proof The representation (8) of the SVD follows immediately from the full SVD (6)
by splitting Σ into a sum of diagonal matrices Σj = diag(0, . . . , 0, σj , 0, . . . , 0).
Formula (9) for the approximation error follows from the fact that U ∗ AV = Σ and
the expansion for Aν so that U ∗ (A − Aν )V = diag(0, . . . , 0, σν+1 , . . .) and kA − Aν k2 =
σν+1 by the invariance of the 2-norm under unitary transformations and item 3 of the
previous theorem.
The claim regarding the best approximation property is a little more involved, and
omitted.
28
Remark There are many possible rank-ν decompositions of A (e.g., by taking partial
sums of the LU or QR factorization). Theorem 2.3, however, says that the ν-th partial
sum of the SVD captures as much of the energy of A (measured in the 2-norm) as pos-
sible. This fact gives rise to many applications in image processing, data compression,
data mining, and other fields. See, e.g., the Matlab scripts svd compression.m and
qr compression.m.
A geometric interpretation of Theorem 2.3 is given by the best approximation of a
hyperellipsoid by lower-dimensional ellipsoids. For example, the best approximation of
a given hyperellipsoid by a line segment is given by the line segment corresponding to
the hyperellipsoids longest axis. Similarly, the best approximation by an ellipse is given
by that ellipse whose axes are the longest and second-longest axis of the hyperellipsoid.
2.1.4 Computing the SVD by Hand

We now list a simplistic algorithm for computing the SVD of a matrix A. It can be
used fairly easily for manual computation of small examples. For a given m × n matrix
A the procedure is as follows:
1. Form A∗ A.
2. Find the eigenvalues and orthonormalized eigenvectors of A∗ A, i.e.,
A∗ A = V ΛV ∗ .
p
3. Sort the eigenvalues according to their magnitude, and let σj = λj , j = 1, . . . , n.
4. Find the first r columns of U via
uj = σj−1 Av j , j = 1, . . . , r.
Pick the remaining columns such that U is unitary.
Example Find the SVD for  

1 2
A =  2 2 .
2 1
1.
∗ 9 8
A A= .
8 9
2. The eigenvalues (in order of decreasing magnitude) are λ1 = 17 and λ2 = 1, and

the corresponding eigenvectors

1 1
v
e1 = , v
e2 = ,
1 −1
so that (after normalization)

" #
√1 √1
V = 2 2 .
√1 − √12
2
29
√
3. σ1 = 17 and σ2 = 1, so that
 √ 
17 0
Σ= 0 1 .
0 0
4. The first two columns of U can be computed as

1
u1 = √ Av 1
17
 
1 2
1 1  1
= √ √ 2 2 
17 2 2 1 1
 
3
1  
= √ 4 ,
34 3
and
1
u2 = Av 2
1  
1 2
1  1
= √ 2 2 
2 2 1 −1
 
−1
1
= √  0 .
2 1
Thus far we have

√3 −1
 
34
√
2
u3 (1)
U =
 √4 0 u3 (2) 
.
34
√3 √1 u3 (3)
34 2
In order to determine u3 (i), i = 1, 2, 3, we need to satisfy
u∗j u3 = δj3 , j = 1, 2, 3.
The following choice satisfies this requirement

 
2
1
u3 = √  −3  ,
17 2
so that
−1
 √
√3 √2

√ 
34 2 17 17 0 " √1 √1
#
∗ √4 −3 2 2
A = U ΣV = 
 0 √ 
0 1  .
34 17 √1 − √12

√3 √1 √2 0 0 2
34 2 17
30
The reduced SVD is given by
−1
 3 
√ √ √ " #
34 2
17 0 √1 √1
A = Û Σ̂V ∗ = 
 √4 0  2 2 .
34 0 1 √1 − √12

√3 √1 2
34 2
Remark A practical computer implementation of the SVD will require an algorithm

for finding eigenvalues. We will study this later. The two most popular SVD imple-
mentations use either a method called Golub-Kahan-Bidiagonalization (GKB) (from
1965), or some Divide-and-Conquer strategy (which have been studied since around
1980).
31
3 Projectors
If P ∈ Cm×m is a square matrix such that P 2 = P then P is called a projector. A
matrix satisfying this property is also known as an idempotent matrix.
Remark It should be emphasized that P need not be an orthogonal projection matrix.

Moreover, P is usually not an orthogonal matrix.
Example Consider the matrix
c2 cs

P = ,
cs s2
where c = cos θ and s = sin θ. This matrix projects perpendicularly onto the line with
inclination angle θ in R2 .
We can check that P is indeed a projector:
2 2
2 c cs c cs
P =
cs s2 cs s2
4
c + c2 s2 c3 s + cs3

=
c3 s + cs3 c2 s2 + s4
2 2
c (c + s2 ) cs(c2 + s2 )

= = P.
cs(c2 + s2 ) s2 (c2 + s2 )
Note that P is not an orthogonal matrix, i.e., P ∗ P = P 2 = P 6= I. In fact, rank(P ) = 1

since points on the line are projected onto themselves.
Example The matrix

1 1
P =
0 0
is clearly a projector. Since the range of P is given by all points on the x-axis, and any
point (x, y) is projected to (x + y, 0), this is clearly not an orthogonal projection.
In general, for any projector P , any v ∈ range(P ) is projected onto itself, i.e.,
v = P x for some x then
P v = P (P x) = P 2 x = P x = v.
We also have
P (P v − v) = P 2 v − P v = P v − P v = 0,
so that P v − v ∈ null(P ).
3.1 Complementary Projectors

In fact, I − P is known as the complementary projector to P . It is indeed a projector
since
(I − P )2 = (I − P )(I − P ) = I − |{z}
IP − |{z} P 2 = I − P.
P I + |{z}
=P =P =P
32
Lemma 3.1 If P is a projector then
range(I − P ) = null(P ), (10)
null(I − P ) = range(P ). (11)
Proof We show (10), then (11) will follow by applying the same arguments for P =
I − (I − P ). Equality of two sets is shown by mutual inclusions, i.e., A = B if A ⊆ B
and B ⊆ A.
First, we show null(P ) ⊆ range(I − P ). Take a vector v such that P v = 0. Then
(I − P )v = v − P v = v. In words, any v in the nullspace of P is also in the range of
I − P.
Now, we show range(I − P ) ⊆ null(P ). We know that any x ∈ range(I − P ) is
characterized by
x = (I − P )v for some v.
Thus
x = v − P v = −(P v − v) ∈ null(P )
since we showed earlier that P (P v−v) = 0. Thus if x ∈ range(I −P ), then x ∈ null(P ).
3.2 Decomposition of a Given Vector

Using a projector and its complementary projector we can decompose any vector v into
v = P v + (I − P )v,
where P v ∈ range(P ) and (I − P )v ∈ null(P ). This decomposition is unique since

range(P ) ∩ null(P ) = {0}, i.e., the projectors are complementary.
3.3 Orthogonal Projectors

If P ∈ Cm×m is a square matrix such that P 2 = P and P = P ∗ then P is called an
orthogonal projector.
Remark In some books the definition of a projector already includes orthogonality.

However, as before, P is in general not an orthogonal matrix, i.e., P ∗ P = P 2 6= I.
3.4 Connection to Earlier Orthogonal Decomposition

Earlier we considered the orthonormal set {q 1 , . . . , q n }, and established the decompo-
sition
n
X
v = r+ (q ∗i v)q i
i=1
Xn
= r+ (q i q ∗i )v (12)
i=1
33
with r orthogonal to {q 1 , . . . , q n }. This corresponds to the decomposition
v = (I − P )v + P v
n
X
with P = (q i q ∗i ).
i=1
.
Note that ni=1 (q i q ∗i ) = QQ∗ with Q = [q 1 q 2 . . q n ]. Thus the orthogonal decom-
P
position (12) can be rewritten as
v = (I − QQ∗ )v + QQ∗ v. (13)
It is easy to verify that QQ∗ is indeed an orthogonal projection:
1. (QQ∗ )2 = Q Q∗ Q Q∗ = QQ∗ since Q has orthonormal columns (but not rows).

| {z }
=I
2. (QQ∗ )∗ = QQ∗ .
Remark The orthogonal decomposition (13) will be important for the implementation
of the QR decomposition later on. In particular we will use the rank-1 projector
Pq = qq ∗
which projects onto the direction q and its complement
P⊥q = I − qq ∗ .
Thus,
v = (I − qq ∗ )v + qq ∗ v,
or, more generally, orthogonal projections onto an arbitrary direction a is given by
aa∗ aa∗

v= I− ∗ v + ∗ v,
a a a a
aa∗ aa∗
where we abbreviate Pa = a∗ a and P⊥a = (I − a∗ a ).
As a further generalization we can consider orthogonal projection onto the range of

a (full-rank) matrix A. Earlier, for the orthonormal basis {q 1 , . . . , q n } (the columns of
Q) we had P = QQ∗ . Now we require only that {a1 , . . . , an } be linearly independent.
In order to compute the projection P for this case we start with an arbitrary vector v.
We need to ensure that P v − v ⊥ range(A), i.e., if P v ∈ range(A) then
a∗j (P v − v) = 0, j = 1, . . . , n.
Now, since P v ∈ range(A) we know P v = Ax for some x. Thus
a∗j (Ax − v) = 0, j = 1, . . . , n
A∗ (Ax − v) = 0
34
or
A∗ Ax = A∗ v.
One can show that (A∗ A)−1 exists provided the columns of A are linearly independent
(our assumption). Then
x = (A∗ A)−1 A∗ v.
Finally,
P v = Ax = A(A∗ A)−1 A∗ v.
| {z }
=P
Remark Note that this includes the earlier discussion when {a1 , . . . , an } is orthonor-
mal since then A∗ A = I and P = AA∗ as before.
35
4 QR Factorization
4.1 Reduced vs. Full QR
Consider A ∈ Cm×n with m ≥ n. The reduced QR factorization of A is of the form
A = Q̂R̂,
where Q̂ ∈ Cm×n with orthonormal columns and R̂ ∈ Cn×n an upper triangular matrix
such that R̂(j, j) 6= 0, j = 1, . . . , n.
As with the SVD Q̂ provides an orthonormal basis for range(A), i.e., the columns of
A are linear combinations of the columns of Q̂. In fact, we have range(A) = range(Q̂).
This is true since Ax = Q̂R̂x = Q̂y for some y so that range(A) ⊆ range(Q̂). Moreover,
range(Q) ⊆ range(Â) since we can write AR̂−1 = Q̂ because R̂ is upper triangular with
nonzero diagonal elements. (Now we have Q̂x = AR̂−1 x = Ay for some y.)
Note that any partial set of columns satisfy the same property, i.e.,
span{a1 , . . . , aj } = span{q 1 , . . . , q j }, j = 1, . . . , n.
In order to obtain the full QR factorization we proceed as with the SVD and extend
Q̂ to a unitary matrix Q. Then A = QR with unitary Q ∈ Cm×m and upper triangular
R ∈ Cm×n . Note that (since m ≥ n) the last m − n rows of R will be zero.
4.2 QR Factorization via Gram-Schmidt

We start by formally writing down the QR factorization A = QR as
a1
a1 = q 1 r11 =⇒ q 1 = (14)
r11
a2 − r12 q 1
a2 = q 1 r12 + q 2 r22 =⇒ q 2 = (15)
r22
.. ..
. . (16)
an − ni=1 rin q i
P
an = q 1 r1n + q 2 r2n + . . . + q n rnn =⇒ q n = (17)
rnn
Note that in these formulas the columns aj of A are given and we want to determine
the columns q j of Q and entries rij of R such that Q is orthonormal, i.e.,
q ∗i q j = δij , (18)
R is upper triangular and A = QR. The latter two conditions are already reflected in
the formulas above.
Using (14) in the orthogonality condition (18) we get
a∗1 a1
q ∗1 q 1 = 2 =1
r11
so that p
r11 = a∗1 a1 = ka1 k2 .
36
Note that we arbitrarily chose the positive square root here (so that the factorization
becomes unique).
Next, the orthogonality condition (18) gives us
q ∗1 q 2 = 0
q ∗2 q 2 = 1.
Now we apply (15) to the first of these two conditions. Then
q ∗1 a2 − r12 q ∗1 q 1
q ∗1 q 2 = = 0.
r22
Since we ensured q ∗1 q 1 = 1 in the previous step, the numerator yields r12 = q ∗1 a2 so
that
a2 − (q ∗1 a2 )q 1
q2 = .
r22
To find r22 we normalize, i.e., demand that q ∗2 q 2 = 1 or equivalently kq 2 k2 = 1. This
immediately gives
r22 = ka2 − (q ∗1 a2 )q 1 k2 .
To fully understand how the algorithm proceeds we add one more step (for n = 3).
Now we have three orthogonality conditions:
q ∗1 q 3 = 0
q ∗2 q 3 = 0
q ∗3 q 3 = 1.
The first of these conditions together with (17) for n = 3 yields
q ∗1 a3 − r13 q ∗1 q 1 − r23 q ∗1 q 2
q ∗1 q 3 = =0
r33
so that r13 = q ∗1 a3 due to the orthonormality of columns q 1 and q 2 .
Similarly, the second orthogonality condition together with (17) for n = 3 yields
q ∗2 a3 − r13 q ∗2 q 1 − r23 q ∗2 q 2
q ∗2 q 3 = =0
r33
so that r23 = q ∗2 a3 .
Together this gives us
a3 − (q ∗1 a3 )q 1 − (q ∗2 a3 )q 2
q3 =
r33
and the last unknown, r33 , is determined by normalization, i.e.,
r33 = ka3 − (q ∗1 a3 )q 1 − (q ∗2 a3 )q 2 k2 .
37
In general we can formulate the following algorithm:
rij = q ∗i aj (i 6= j)
j−1
X
vj = aj − rij q i
i=1
rjj = kv j k2
vj
qj =
rjj
We can compute the reduced QR factorization with the following (somewhat more
practical and almost Matlab implementation of the) classical Gram-Schmidt algorithm.
Algorithm (Classical Gram-Schmidt)
for j = 1 : n
v j = aj
for i = 1 : (j − 1)
rij = q ∗i aj
v j = v j − rij q i
end
rjj = kv j k2
q j = v j /rjj
end
Remark The classical Gram-Schmidt algorithm is not ideal for numerical calcula-
tions since it is known to be unstable. Note that, by construction, the Gram-Schmidt
algorithm yields an existence proof for the QR factorization.
Theorem 4.1 Let A ∈ Cm×n with m ≥ n. Then A has a QR factorization. Moreover,

if A is of full rank (n), then the reduced factorization A = Q̂R̂ with rjj > 0 is unique.
Example We compute the QR factorization for the matrix

 
1 2 0
A =  0 1 1 .
1 0 1
 
1 √
First v 1 = a1 =  0  and r11 = kv 1 k = 2. This gives us
1
 
1
v1 1  
q1 = =√ 0 .
kv 1 k 2 1
38
Next,
v 2 = a2 − (q ∗1 a2 ) q 1
| {z }
=r12
√
    
2 1 1
2
=  1  − √  0  =  1 .
0 2 1 −1
√ √
This calculation required that r12 = √2 = 2. Moreover, r22 = kv 2 k = 3 and
2
 
1
v2 1
q2 = = √  1 .
kv 2 k 3 −1
In the third iteration we have
v 3 = a3 − (q ∗1 a3 ) q 1 − (q ∗2 a3 ) q 2
| {z } | {z }
=r13 =r23
from which we first compute r13 = √1 and r23 = 0. This gives us

2
    
0 1 −1
1 1 1
v3 =  1  − √ √  0  − 0 =  2  .
2 2 1 2
1 1
√
6
Finally, r33 = kv 3 k = 2 and
 
−1
v3 1
q3 = = √  2 .
kv 3 k 6 1
Collecting all of the information we end up with

 1 1 −1
 √  √ 
√ √ √ 2 2 √1
2 3 6 √ 2
Q= 0
 √1 √2 
and R =  0

3 0 .

3 6  √
√1 −1
√ √1 6
2 3 6 0 0 2
4.3 An Application of the QR Factorization

Consider solution of the linear system Ax = b with A ∈ Cm×m nonsingular. Since
Ax = b ⇐⇒ QRx = b ⇐⇒ Rx = Q∗ b,
where the last equation holds since Q is unitary, we can proceed as follows:
1. Compute A = QR (which is the same as A = Q̂R̂ in this case).
2. Compute y = Q∗ b.
39
3. Solve the upper triangular Rx = y
We will have more applications for the QR factorization later in the context of least
squares problems.
Remark The QR factorization (if implemented properly) yields a very stable method
for solving Ax = b. However, it is about twice as costly as Gauss elimination (or
A = LU ). In fact, the QR factorization can also be applied to rectangular systems and
it is the basis of Matlab’s backslash matrix division operator. We will discuss Matlab
examples in a later section.
4.4 Modified Gram-Schmidt

The classical Gram-Schmidt algorithm is based on projections of the form
j−1
X
vj = aj − rij q i
i=1
j−1
X
= aj − (q ∗i aj )q i .
i=1
Note that this means we are performing a sequence of vector projections. The starting
point for the modified Gram-Schmidt algorithm is to rewrite one step of the classical
Gram-Schmidt algorithm as a single matrix projection, i.e.,
j−1
X
vj = aj − (q ∗i aj )q i
i=1
j−1
X
= aj − (q i q ∗i )aj
i=1
= aj − Q̂j−1 Q̂∗j−1 aj

= I − Q̂j−1 Q̂∗j−1 aj ,
| {z }
=Pj
where Q̂j−1 = [q 1 q 2 . . . q j−1 ] is the matrix formed by the column vectors q i , i =

1, . . . , j − 1.
In order to obtain the modified Gram-Schmidt algorithm we require the following
observation that the single projection Pj can also be viewed as a series of complementary
projections onto the individual columns q i , i.e.,
Lemma 4.2 If Pj = I − Q̂j−1 Q̂∗j−1 with Q̂j−1 = [q 1 q 2 . . . q j−1 ] a matrix with orthonor-
mal columns, then
j−1
Y
Pj = P⊥qi .
i=1
40
Proof First we remember that
j−1
X
Pj = I − Q̂j−1 Q̂∗j−1 = I − q i q ∗i
i=1
and that the complementary projector is defined as
P⊥qi = I − q i q ∗i .
Therefore, we need to show that

j−1
X j−1
Y
I− q i q ∗i = (I − q i q ∗i ) .
i=1 i=1
This is done by induction. For j = 1 the sum and the product are empty and the
statement holds by the convention that an empty sum is zero and an empty product is
the identity, i.e., P1 = I.
Now we step from j − 1 to j. First
j
Y j−1
Y
q i q ∗i ) (I − q i q ∗i ) I − q j q ∗j

(I − =
i=1 i=1
j−1
!
X
q i q ∗i I − q j q ∗j

= I−
i=1
by the induction hypothesis. Expanding the right-hand side yields

j−1
X j−1
X
I− q i q ∗i − q j q ∗j + q i q ∗i q j q ∗j
|{z}
i=1 i=1
=0
so that the claim is proved.
Summarizing the discussion thus far, a single step in the Gram-Schmidt algorithm
can be written as
v j = P⊥qj−1 P⊥qj−2 . . . P⊥q1 aj ,
or – more algorithmically:
v j = aj
for i = 1 : (j − 1)
v j = v j − q i q ∗i v j
end
For the final modified Gram-Schmidt algorithm the projections are arranged differ-
ently, i.e., P⊥qi is applied to all v j with j > i. This leads to
41
Algorithm (Modified Gram-Schmidt)
for i = 1 : n
v i = ai
end
for i = 1 : n
rii = kv i k2
vi
qi = rii
for j = (i + 1) : n
rij = q ∗i v j
v j = v j − rij q i
end
end
We can compare the operations count, i.e., the number of basic arithmetic operations
(‘+’,‘-’,‘*’,‘/’), of the two algorithms. We give only a rough estimate (exact counts will
be part of the homework). Assuming vectors of length m, for the classical Gram-
Schmidt roughly 4m operations are performed inside the innermost loop (actually m
multiplications and m−1 additions for the inner product, and m multiplications and m
subtractions for the formula in the second line). Thus, the operations count is roughly
j−1
n X n n
X X X n(n + 1)
4m = (j − 1)4m ≈ 4m j = 4m ≈ 2mn2 .
2
j=1 i=1 j=1 j=1
The innermost loop of the modified Gram-Schmidt algorithm consists formally of ex-
actly the same operations, i.e., requires also roughly 4m operations. Thus its operation
count is
n X n n n
!
X X
2
X
2 n(n + 1)
4m = (n − i)4m = 4m n − i = 4m n − ≈ 2mn2 .
2
i=1 j=i+1 i=1 i=1
Thus, the operations count for the two algorithms is the same. In fact, mathematically,
the two algorithms can be shown to be identical. However, we will learn later that the
modified Gram-Schmidt algorithm is to be preferred due to its better numerical stability
(see Section 4.6).
4.5 Gram-Schmidt as Triangular Orthogonalization

One can view the modified Gram-Schmidt algorithm (applied to the entire matrix A)
as
AR1 R2 . . . Rn = Q̂, (19)
42
where R1 , . . . , Rn are upper triangular matrices. For example,
 
1/r11 −r12 /r11 −r13 /r11 ··· −r1m /r11
 0
 1 0 ··· 0 

 0 0 1 0
R1 =  ,

 .. .. .. 
 . . . 
0 ··· 0 1
 
1 0 ··· 0

 0 1/r22 −r23 /r22 · · · −r2m /r22 

R2 = 
 0 0 1 0 

 .. .. .. 
 . . . 
0 ··· 0 1
and so on.
Thus we are applying triangular transformation matrices to A to obtain a matrix Q̂
with orthonormal columns. We refer to this approach as triangular orthogonalization.
Since the inverse of an upper triangular matrix is again an upper triangular matrix,
and the product of two upper triangular matrices is also upper triangular, we can think
of the product R1 R2 . . . Rn in (19) in terms of a matrix R̂−1 . Thus, the (modified)
Gram-Schmidt algorithm yields a reduced QR factorization
A = Q̂R̂
of A.
4.6 Stability of CGS vs. MGS in Matlab

The following discussion is taken from [Trefethen/Bau] and illustrated by the Mat-
lab code GramSchmidt.m (whose supporting routines clgs.m and mgs.m are part of a
computer assignment).
We create a random matrix A ∈ R80×80 by selecting singular values 12 , 14 , . . . , 2180
and generating A = U ΣV ∗ with the help of (orthonormal) matrices U and V whose
entries are normally distributed random numbers (using the Matlab command randn).
Then we compute the QR factorization A = QR using both the classical and modified
Gram-Schmidt algorithms. The program then plots the diagonal elements of R together
with the singular values.
First we note that
X80
A= σi ui v Ti
i=1
so that
80
X
aj = A(:, j) = σi ui vji .
i=1
Next, V is a normally distributed random unitary matrix, and therefore the entries in
one of its columns satisfy
1
|vji | ≈ √ ≈ 0.1.
80
43
Now from the (classical) Gram-Schmidt algorithm we know that
80
X
r11 = ka1 k2 = k σ1 v1i ui k2 .
i=1
Since the singular values were chosen to decrease exponentially only the first one really
matters, i.e.,
1 1
r11 ≈ kσ1 v11 u1 k2 = σ1 v11 ≈ √
2 80
(since ku1 k2 = 1).
Similar arguments result in the general relationship
1
rjj ≈ √ σj
80
(the latter of which we know). The plot produced by GramSchmidt.m shows how
accurately the diagonal elements of R are computed. We can observe that the classical
√
Gram-Schmidt algorithm is stable up to σj ≈ eps (where eps is the machine epsilon),
whereas the modified Gram-Schmidt method is stable all the way up to σj ≈ eps.
Remark In spite of the superior stability of the modified Gram-Schmidt algorithm it

still may not produce “good” orthogonality. Househoulder reflections – studied in the
next chapter – work better (see an example in [Trefethen/Bau]).
4.7 Householder Triangularization

Recall that we interpreted the Gram-Schmidt algorithm as triangular orthogonalization
AR1 R2 . . . Rn = Q̂
leading to the reduced QR factorization of an m × n matrix A. Now we will consider

an alternative approach to computing the (full) QR factorization corresponding to
orthogonal triangularization:
Qn Qn−1 . . . Q2 Q1 A = R,
where the matrices Qj are unitary.

The idea here is to design matrices Q1 , . . . , Qn such that A is successively trans-
formed to upper triangular form, i.e.,
   
x x x x x x
 x x x   0 x x 
A=  x x x  −→ Q1 A =  0 x x 
  
x x x 0 x x
   
x x x x x x
 0 x x 
 −→ Q3 Q2 Q1 A =  0 x x  ,
 
−→ Q2 Q1 A =   0 0 x   0 0 x 
0 0 x 0 0 0
44
where x stands for a generally nonzero entry. From this we note that Qk needs to
operate on rows k : m and not change the first k − 1 rows and columns. Therefore it
will be of the form
Ik−1 O
Qk = ,
O F
where Ik−1 is a (k − 1) × (k − 1) identity matrix and F has the effect that
F x = kxke1
in order to introduce zeros in the lower part of column k. We will call F a Householder
reflector.
Graphically, we can use either a rotation (Givens rotation) or a reflection about the
bisector of x and e1 to transform x to kxke1 .
Recall from an earlier homework assignment that given a projector P , then (I −2P )
is also a projector. In fact, (I −2P ) is a reflector. Therefore, if we choose v = kxke1 −x
∗
and define P = vv v ∗ v , then
vv ∗
F = I − 2P = I − 2 ∗
v v
is our desired Householder reflector. Since it is easy to see that F is Hermitian, so is
Qk . Note that F x can be computed as
vv ∗ vv ∗ v∗x

Fx = I − 2 ∗ x = x − 2 ∗ x = x − 2v ∗ .
v v v v
|{z} v v
|{z}
matrix scalar
In fact, we have two choices for the reflection F x: v + = −x + sign(x(1))kxke1
and v − = −x − sign(x(1))kxke1 . Here x(1) denotes the first component of the vector
x. These choices are illustrated in Figure 4. A numerically more stable algorithm
Figure 4: Graphical interpretation of Householder reflections.
(that will avoid cancellation of significant digits) will be guaranteed by choosing that
reflection which moves x further. Therefore we pick
v = x + sign(x(1))kxke1 ,
which is the same (except for orientation) as v − .

The resulting algorithm is
45
Algorithm (Householder QR)
for k = 1 : n (sum over columns)
x = A(k : m, k)
v k = x + sign(x(1))kxk2 e1
v k = v k /kv k k2
A(k : m, k : n) = A(k : m, k : n) − 2v k (v ∗k A(k : m, k : n))
end
Note that the statement in the last line of the algorithm performs the reflection
simultaneously for all remaining columns of the matrix A. On completion of this
algorithm the matrix A contains the matrix R of the QR factorization, and the vectors
v 1 , . . . , v n are the reflection vectors. They will be used to calculate matrix-vector
products of the form Qx and Q∗ b later on. The matrix Q itself is not output. It can
be constructed by computing special matrix-vector products Qx with x = e1 , . . . , en .
Example We apply Householder reflection to x = [2, 1, 2]T .

First we compute
v = x + sign(x(1))kxk2 e1
     
2 1 5
=  1 + 3 0 = 1 .
   
2 0 2
∗
Next we form F x = x − 2v vv∗xv . To this end we note that v ∗ x = 15 and v ∗ v = 30.
Thus      
2 5 −3
15
F x =  1  23  1  =  0 .
30
2 2 0
This vector contains the desired zero.
For many applications only products of the form Q∗ b or Qx are needed. For
example, if we want to solve the linear system Ax = b then we can do this with the
QR factorization by first computing y = Q∗ b and then solving Rx = y. Therefore, we
list the respective algorithms for these two types of matrix vector products.
For the first algorithm we need to remember that
Qn . . . Q2 Q1 A = R,
| {z }
=Q∗
so that we can apply exactly the same steps that were applied to the matrix A in the
Householder QR algorithm:
Algorithm (Compute Q∗ b)
for k = 1 : n
46
b(k : m) = b(k : m) − 2v k (v ∗k b(k : m))
end
For the second algorithm we use Q = Q1 Q2 . . . Qn (since Q∗i = Qi ), so that the

following algorithm simply performs the reflection operations in reverse order:
Algorithm (Compute Qx)
for k = n : −1 : 1
x(k : m) = x(k : m) − 2v k (v ∗k x(k : m))
end
The operations counts for the three algorithms listed above are
Householder QR: O 2mn2 − 32 n3

Q∗ b, Qx: O(mn)
47
5 Least Squares Problems
Consider the solution of Ax = b, where A ∈ Cm×n with m > n. In general, this system
is overdetermined and no exact solution is possible.
Example Fit a straight line to 10 measurements. If we represent the line by f (x) =
mx + c and the 10 pieces of data are {(x1 , y1 ), . . . , (x10 , y10 )}, then the constraints can
be sumamrized in the linear system
   
x1 1 y1
 x2 1 
 y2 

 m

..  c =  ..  .

 ..
 . .   . 
| {z }
x10 1 x y10
| {z } | {z }
A b
This type of problem is known as linear regression or (linear) least squares fitting.
The basic idea (due to Gauss) is to minimize the 2-norm of the residual vector, i.e.,
kb − Axk2 .
In other words, we want to find x ∈ Cn such that
m
X
[bi − (Ax)i ]2
i=1
is minimized.
P For the notation 2used in the example above we want to find m and c
such that 10i=1 [yi − (mxi + c))] is minimized.
Example We can generalize the previous example to polynomial least squares fitting
of arbitrary degree. To this end we assume that
n
X
p(x) = ci xi ,
i=0
where n is the degree of the polynomial.
We can fit a polynomial of degree n to m > n data points (xi , yi ), i = 1, . . . , m,
using the least squares approach, i.e.,
m
X
min [yi − p(xi )]2
i=1
is used as constraint for the overdetermined linear system Ax = b with
 
 2 n
 c0  
1 x1 x1 . . . x1  c1  y1
 1 x2 x2 . . . xn     y2 
2 2 
A= . ..  , x =  c2  , b =  .. .
    
. .. ..
 . . . .   ..   . 
2 n
 . 
1 xm xm . . . xm ym
cn
Remark The special case n = m − 1 is called interpolation and is known to have a
unique solutions if the conditions are independent, i.e., the points xi are distinct. How-
ever, for large degrees n we frequently observe severe oscillations which is undesirable.
48
5.1 How to Compute the Least Squares Solution
We want to find x such that Ax ∈ range(A) is as close as possible to a given vector b.
It should be clear that we need Ax to be the orthogonal projection of b onto the
range of A, i.e.,
Ax = P b.
Then the residual r = b − Ax will be minimal.
Theorem 5.1 Let A ∈ Cm×n (m ≥ n) with rank(A) = n and b ∈ Cm . The vector

x ∈ Cn minimizes the residual norm krk2 = kb − Axk2
if and only if A∗ r = 0, (20)

∗ ∗
if and only if A Ax = A b, (21)
if and only if Ax = P b, (22)
where P ∈ Cm×m is the orthogonal projector onto the range of A. Moreover, A∗ A is

nonsingular and the least squares solution x is unique.
(20) says that r is perpendicular to the range of A. (21) is known as the set of
normal equations.
Proof To see that (20) ⇔ (21) we use the definition of the residual r = b − Ax. Then
A∗ r = 0 ⇐⇒ A∗ (b − Ax) = 0
⇐⇒ A∗ Ax = A∗ b.
To see that (21) ⇔ (22) we use that the orthogonal projector onto the range of A is
given by
P = A(A∗ A)−1 A∗ .
Then
P b = Ax ⇐⇒ A(A∗ A)−1 A∗ b = Ax
⇐⇒ A∗ A(A∗ A)−1 A∗ b = A∗ Ax
| {z }
=I
⇐⇒ A∗ Ax = A∗ b.
Note that A∗ A is nonsingular if and only if A has full rank n.
Remark If A has full rank then A∗ A is also Hermitian positive definite, i.e., x∗ A∗ Ax >
0 for any nonzero n-vector x.
For full-rank A we can take (21) and obtain the least squares solution as
x = (A∗ A)−1 A∗ b.
| {z }
=A+
The matrix A+ is known as the pseudoinverse of A.
49
5.2 Algorithms for finding the Least Squares Solution
5.2.1 Cholesky Factorization
This can be applied for a full-rank matrix A. As mentioned above A∗ A is Hermitian
positive definite and one can apply a symmetric form of Gaussian elimination resulting
in
A∗ A = R ∗ R
with upper triangular matrix R (more details will be provided in a later section).
This means that we have
A∗ Ax = A∗ b ⇐⇒ R∗ Rx = A∗ b
⇐⇒ R ∗ w = A∗ b
with w = Rx. Since R is upper triangular (and R∗ is lower triangular) this is easy to
solve.
We obtain one of our three-step algorithms:
Algorithm (Cholesky Least Squares)
(0) Set up the problem by computing A∗ A and A∗ b.
(1) Compute the Cholesky factorization A∗ A = R∗ R.
(2) Solve the lower triangular system R∗ w = A∗ b for w.
(3) Solve the upper triangular system Rx = w for x.
The operations count for this algorithm turns out to be O(mn2 + 13 n3 ).
Remark The solution of the normal equations is likely to be unstable. Therefore this
method is not recommended in general. For small problems it is usually safe to use.
5.2.2 QR Factorization
This works also for full-rank matrices A. Recall that the reduced QR factorization is
given by A = Q̂R̂ with Q̂ an m × n matrix with orthonormal columns, and R̂ an n × n
upper triangular matrix.
Now the normal equations can be re-written as
A∗ Ax = A∗ b ⇐⇒ R̂∗ Q̂∗ Q̂ R̂x = R̂∗ Q̂∗ b.

| {z }
=I
Since A has full rank R̂ will be invertible and we can further simplify to
R̂x = Q̂∗ b.
This is only one triangular system to solve. The algorithm is

Algorithm (QR Least Squares)
(0) Set up the problem by computing A∗ A and A∗ b.
50
(1) Compute the reduced QR factorization A = Q̂R̂.
(2) Compute Q̂∗ b.
(3) Solve the upper triangular system R̂x = Q̂∗ b for x.
An alternative interpretation is based on condition (22) in Theorem 5.1. We take

A = Q̂R̂ and P = Q̂Q̂∗ (since Q̂ is an orthonormal basis for range(A)). Then we have
Q̂R̂x = Q̂Q̂∗ b.
Multiplication by Q̂∗ yields

Q̂∗ Q̂ R̂x = Q̂∗ Q̂ Q̂∗ b
| {z } | {z }
=I =I
so that we have
R̂x = Q̂∗ b
as before.
From either interpretation we see that
x = R̂−1 Q̂∗ b
so that (with this notation) the pseudoinverse is given by
A+ = R̂−1 Q̂∗ .
This is well-defined since R̂−1 exists because A has full rank.

The operations count (using Householder reflectors to compute the QR factoriza-
tion) is O(2mn2 − 32 n3 ).
Remark This approach is more stable than the Cholesky approach and is considered
the standard method for least squares problems.
5.2.3 SVD
We again assume that A has full rank. Recall that the reduced SVD is given by
A = Û Σ̂V ∗ , where Û ∈ Cm×n , Σ̂ ∈ Rn×n , and V ∈ Cn×n .
We start again with the normal equations
A∗ Ax = A∗ b ⇐⇒ Σ̂∗ Û
V |{z} ∗ ∗ ∗
| {zÛ} Σ̂V x = V Σ̂Û b
=Σ̂ =I
⇐⇒ V Σ̂ V x = V Σ̂Û ∗ b.
2 ∗
Since A has full rank we can invert Σ̂ and multiply the last equation by Σ̂−1 V ∗ .
This results in
Σ̂V ∗ x = Û ∗ b
or
Σ̂w = Û ∗ b with w = V ∗ x.
51
Therefore (with the SVD notation) the pseudoinverse is given by
A+ = V Σ̂−1 U ∗
and the least squares solution is given by
x = A+ b = V Σ̂−1 U ∗ b.
The algorithm is
Algorithm (SVD Least Squares)
(1) Compute the reduced SVD A = Û Σ̂V ∗ .
(2) Compute Û ∗ b.
(3) Solve the diagonal system Σ̂w = Û ∗ b for w.
(4) Compute x = V w.
This time the operations count is O(2mn2 + 11n3 ) which is comparable to that of
the QR factorization provided m n. Otherwise this algorithm is more expensive,
but also more stable.
5.3 Solution of Rank Deficient Least Squares Problems

If rank(A) < n (which is possible even if m < n, i.e., if we have an underdetermined
problem), then infinitely many solutions exist.
A common approach to obtain a well-defined solution in this case is to add an
additional constraint of the form
kxk −→ min,
i.e., we seek the minimum norm solution.
In this case the unique solution is given by
X u∗ b
i
x vi
σi
σi 6=0
or
x = V Σ+ U ∗ b,
where now the pseudoinverse is given by
A+ = V Σ + U ∗ .
Here the pseudoinverse of Σ is defined as
−1
+ Σ1 O
Σ = ,
O O
where Σ1 is that part of Σ containing the positive (and therefore invertible) singular
values.
As a final remark we note that there exists also a variant of the QR factorization
that is more stable due to the use of column pivoting. The idea of pivoting will be
discussed later in the context of the LU factorization.
52
6 Conditioning and Stability
A computing problem is well-posed if
1. a solution exists (e.g., we want to rule out situations that lead to division by
zero),
2. the computed solution is unique,
3. the solution depends continuously on the data, i.e., a small change in the data
should result in a small change in the answer. This phenomenon is referred to as
stability of the problem.
Example Consider the following three different recursion algorithms to compute xn =

1 n

3 :
1. x0 = 1, xn = 13 xn−1 for n ≥ 1,
2. y0 = 1, y1 = 13 , yn+1 = 43 yn − 31 yn−1 for n ≥ 1,
3. z0 = 1, z1 = 31 , zn+1 = 10
3 zn − zn−1 for n ≥ 1.
The validity of the latter two approaches can be proved by induction. We illustrate
these algorithms with the Maple worksheet 477 577 stability.mws. Use of slightly
perturbed initial values shows us that the first algorithm yields stable errors throughout.
The second algorithm has stable errors, but unstable relative errors. And the third
algorithm is unstable in either sense.
6.1 The Condition Number of a Matrix

Consider solution of the linear system Ax = b, with exact answer x and computed
answer x̃. Thus, we expect an error
e = x − x̃.
Since x is not known to us in general we often judge the accuracy of the solution by
looking at the residual
r = b − Ax̃ = Ax − Ax̃ = Ae
and hope that a small residual guarantees a small error.
Example We consider Ax = b with

1.01 0.99 2
A= , b= ,
0.99 1.01 2
and exact solution x = [1, 1]T .
1. (a) Let’s assume we computed a solution of x̃ = [1.01, 1.01]T . Then the error

−0.01
e = x − x̃ =
−0.01
53
is small, and the residual

2 2.02 −0.02
r = b − Ax = − =
2 2.02 −0.02
is also small. Everything looks good.
2. (b) Now, let’s assume that we computed a solution of x̃ = [2, 0]T . This “solutions”
is obviously not a good one. Its error is

−1
e= ,
1
which is quite large. However, the residual is

2 2.02 −0.02
r= − = ,
2 1.98 0.02
which is still small. This is not good. This shows that the residual is not a reliable
indicator of the accuracy of the solution.
3. (c) If we change the right-hand side of the problem to b = [2, −2]T so that the
exact solution becomes x = [100, −100]T , then things behave “wrong” again.
Let’s assume we computed a solution x̃ = [101, −99]T with a relatively small
error of e = [−1, −1]T . However, the residual now is

2 4 −2
r= − = ,
−2 0 −2
which is relatively large. So again, the residual is not an accurate indicator of

the error.
What is the explanation for the phenomenon we’re observing? The answer is, the
matrix A is ill-conditioned.
Let’s try to get a better understanding of how the error and the residual are related
for the problem Ax = b. We will use the notation
e = x − x̃, r = b − Ax̃ = b − b̃.
Thus,
kek = kx − x̃k = kA−1 b − A−1 b̃k = kA−1 (b − b̃)k

≤ kA−1 kkb − b̃k = kA−1 kkrk.
Therefore, the absolute error satisfies
kek ≤ kA−1 kkrk.
54
kek krk
Often, however, it is better to consider the relative error, i.e., kxk (and kbk ):
kAxk
kek ≤ kA−1 kkrk
kbk
| {z }
=1
krk
≤ kA−1 kkAkkxk .
kbk
This yields the bound
kek krk krk

≤ kA−1 kkAk = κ(A) , (23)
kxk kbk kbk
where κ(A) = kA−1 kkAk is called the condition number of A.
Remark The condition number depends on the type of norm used. For the 2-norm of
a nonsingular m × m matrix A we know kAk2 = σ1 (the largest singular values of A),
and kA−1 k2 = σ1m . If A is singular then κ(A) = ∞.
Also note that κ(A) = σσm1 ≥ 1. In fact, this holds for any norm.
How should we interpret the bound (23)? If κ(A) is large (i.e., the matrix is ill-
conditioned), then relatively small perturbations of the right-hand side b (and therefore
the residual) may lead to large errors; an instability.
For well-conditioned problems (i.e., κ(A) ≈ 1) we can also get a useful bound telling
us what sort of relative error kx−x̃k
kxk we should at least expect. Consider
krkkxk = kb − b̃kkxk
= kAx − Ax̃kkxk = kA(x − x̃)kkxk
= kAekkxk
= kAekkA−1 bk ≤ kAkkekkA−1 kkbk,
so that
1 krk kek
≤ . (24)
κ(A) kbk kxk
Of course, we can combine (23) and (24) to obtain
1 krk kx − x̃k krk

≤ ≤ κ(A) . (25)
κ(A) kbk kxk kbk
These bounds are true for any A, but show that the residual is a good indicator of the
error only if A is well-conditioned.
We now return to our
Example The SVD of the matrix A reveals

" √2 √
2
# " √
2
√
2
#
1.01 0.99 2 0
A= = √22 2√ √2 2√ ,
0.99 1.01 − 2 2
0 0.02
2 2
2
− 2
2
55
which implies
σ1 2
κ(A) = = = 100.
σ2 0.02
For a 2 × 2 matrix this is an indication that A is fairly ill-conditioned. We see that the
bounds (25) allow for large variations:
1 krk kx − x̃k krk
≤ ≤ 100 .
100 kbk kxk kbk
Thus the relative residual is not a good error indicator (as we saw in our initial calcu-
lations).
6.2 The Effect of Changes in A on the Relative Error

We again consider the linear system Ax = b. But now A may be perturbed to A e=
A + δA. We denote by x the exact solution of Ax = b, and by x̃ the exact solution of
Ax̃
e = b, i.e., x̃ = x + δx.
This implies
Ax̃
e =b ⇐⇒ (A + δA) (x + δx) = b
⇐⇒ | {z− b} +(δA)x + A(δx) + (δA)(δx) = 0.
Ax
=0
If we neglect the term with the product of the deltas then we get
(δA)x + A(δx) = 0 or (δx) = −A−1 (δA)x.
Taking norms this yields

kδAk
kδxk ≤ kA−1 kkδAkkxk ⇐⇒ kδxk ≤ kA−1 kkAk kxk
kAk
or
kx − x̃k kA − Ak
e
≤ κ(A) . (26)
kxk kAk
We can interpret (26) as follows: For ill-conditioned matrices a small perturbation
of the entries can lead to large changes in the solution of the linear system. This is also
evidence of an instability.
Example We consider

1.01 0.99 −0.01 0.01
A= with δA = .
0.99 1.01 0.01 −0.01
Now
1 1
A
e = A + δA =
1 1
e = b with b = [2, −2]T has no solution at all.
which is even singular, so that Ax̃
Remark For matrices with condition number κ(A) one can expect to lose log10 κ(A)
digits when solving Ax = b.
56
6.3 Backward Stability
In light of the estimate (26) we say that an algorithm for solving Ax = b is backward
stable if
kx − x̃k
= O(κ(A)εmachine ),
kxk
i.e., if the significance of the error produced by the algorithm is due only to the condi-
tioning of the matrix.
Remark We can view a backward stable algorithm as one which delivers the “right
answer to a perturbed problem”, namely Ax̃
e = b, with perturbation of the order
e
kA−Ak
kAk = O(εmachine ).
Without providing any details (for more information see Chapter 18 in [Trefethen/Bau]),
for least-squares problems the estimate (26) becomes
κ2 (A) tan θ

kx − x̃k kA − Ak
e
≤ κ(A) + , (27)
kxk η kAk
kAxk kAkkxk
where κ(A) = kAkkA−1 k, θ = cos−1 kbk , and η = kAxk .
6.4 Stability of Least Squares Algorithms

We perform the following Matlab experiment (see the Matlab codes LSQ Stability.m
and LSQ Stability book.m).
Example Use Householder QR factorization, modified Gram-Schmidt, stabilized mod-

ified Gram-Schmidt, the normal equations, and the SVD to solve the following least
squares problem:
Fit a polynomial of degree 14 to 100 equally spaced samples taken from either
1
f (x) = on [−5, 5],
1 + x2
or from
f (x) = esin 4x on [0, 1].
Since the least squares conditioning of this problem is given by (27) we can expect to
lose about 10 digits (even for a stable algorithm).
The following observations can be made from the Matlab example: Householder QR,
stabilized modified Gram-Schmidt and the SVD are stable, the normal equations (whose
system matrix A∗ A has condition number κ(A∗ A) = κ2 (A)), and the regular modified
Gram-Schmidt (where we encounter some loss of orthogonality when computing Q) are
both unstable.
57
6.5 Stabilization of Modified Gram-Schmidt
In order to be able to obtain Q with better orthogonality properties we apply the QR
factorization directly to the augmented system, i.e., compute

A b = Q2 R2 .
Then the last column of R2 contains the product Q̂∗ b, i.e.,
R̂ Q̂∗ b

R2 =
O 0
and R̂x = Q̂∗ b can be solved. However, now Q̂∗ b is more accurate than if obtained via
the QR factorization of A alone.
A lot more details for the stabilization and the previous example are provided in
Chapter 19 of [Trefethen/Bau].
58
7 Gaussian Elimination and LU Factorization
In this final section on matrix factorization methods for solving Ax = b we want to
take a closer look at Gaussian elimination (probably the best known method for solving
systems of linear equations).
The basic idea is to use left-multiplication of A ∈ Cm×m by (elementary) lower
triangular matrices, L1 , L2 , . . . , Lm−1 to convert A to upper triangular form, i.e.,
L L . . . L2 L1 A = U.
| m−1 m−2
{z }
e
=L
Note that the product of lower triangular matrices is a lower triangular matrix, and
the inverse of a lower triangular matrix is also lower triangular. Therefore,
LA
e =U ⇐⇒ A = LU,
e −1 . This approach can be viewed as triangular triangularization.
where L = L
7.1 Why Would We Want to Do This?

Consider the system Ax = b with LU factorization A = LU . Then we have
L |{z}
U x = b.
=y
Therefore we can perform (a now familiar) 2-step solution procedure:
1. Solve the lower triangular system Ly = b for y by forward substitution.
2. Solve the upper triangular system U x = y for x by back substitution.
Moreover, consider the problem AX = B (i.e., many different right-hand sides that
are associated with the same system matrix). In this case we need to compute the
factorization A = LU only once, and then
AX = B ⇐⇒ LU X = B,
and we proceed as before:
1. Solve LY = B by many forward substitutions (in parallel).
2. Solve U X = Y by many back substitutions (in parallel).
In order to appreciate the usefulness of this approach note that the operations count
for the matrix factorization is O( 23 m3 ), while that for forward and back substitution is
O(m2 ).
Example Take the matrix  

1 1 1
A= 2 3 5 
4 6 8
59
and compute its LU factorization by applying elementary lower triangular transforma-
tion matrices.
We choose L1 such that left-multiplication corresponds to subtracting multiples of
row 1 from the rows below such that the entries in the first column of A are zeroed out
(cf. the first homework assignment). Thus
    
1 0 0 1 1 1 1 1 1
L1 A =  −2 1 0   2 3 5  =  0 1 3  .
−4 0 1 4 6 8 0 2 4
Next, we repeat this operation analogously for L2 (in order to zero what is left in
column 2 of the matrix on the right-hand side above):
    
1 0 0 1 1 1 1 1 1
L2 (L1 A) =  0 1 0   0 1 3  =  0 1 3  = U.
0 −2 1 0 2 4 0 0 −2
Now L = (L2 L1 )−1 = L−1 −1
1 L2 with
   
1 0 0 1 0 0
−1
L1 = 2 1 0  and L−1
2 =  0 1 0 ,
4 0 1 0 2 1
so that  
1 0 0
L =  2 1 0 .
4 2 1
Remark Note that L always is a unit lower triangular matrix, i.e., it has ones on the
diagonal. Moreover, L is always obtained as above, i.e., the multipliers are accumulated
into the lower triangular part with a change of sign.
The claims made above can be verified as follows. First, we note that the multipliers
in Lk are of the form
ajk
`jk = , j = k + 1, . . . , m,
akk
so that  
1
 .. 
 . 
 
 1 
Lk =   .
 −` k+1,k 1 

 .. .. 
 . . 
−`m,k 1
Now, let  
0
 .. 

 . 

 0 
`k = 
 `k+1,k
.

 
 .. 
 . 
`m,k
60
Then Lk = I − `k e∗k , and therefore
(I − ` e∗ ) (I + ` e∗ ) = I − `k e∗k `k e∗k = I,
| {zk k} | {zk k}
=Lk L−1
k
since the inner product e∗k `k = 0 because the only nonzero entry in ek (the 1 in the k-th
position) does not “hit” any nonzero entries in `k which start in the k + 1-st position.
So, for any k we have
 
1
 .. 
 . 
 
−1
 1 
Lk =   
 `k+1,k 1 

 .. . . 
 . . 
`m,k 1
as claimed.
In addition,
L−1 −1 ∗ ∗

k Lk+1 = (I + `k ek ) I + `k+1 ek+1
= I + `k e∗k + `k+1 e∗k+1 + `k e∗k `k+1 ek+1
| {z }
=0
 
1
 .. 

 . 


 1 

= 
 `k+1,k 1 ,

 .. 

 . `k+2,k+1 

 .. .. 
 . . 
`m,k `m,k+1 1
and in general we have

 
1
 `2,1 1 
 
 .. ..
L = L−1 −1

1 . . . Lm−1  .
= `3,2 . .

 .. 
 . 1 
`m,1 `m,2 . . . `m,m−1 1
We can summarize the factorization in

Algorithm (LU Factorization)
Initialize U = A, L = I
for k = 1 : m − 1
for j = k + 1 : m
61
L(j, k) = U (j, k)/U (k, k)
U (j, k : m) = U (j, k : m) − L(j, k)U (k, k : m)
end
end
Remark 1. In practice one can actually store both L and U in the original matrix
A since it is known that the diagonal of L consists of all ones.
2. The LU factorization is the cheapest factorization algorithm. Its operations count
can be verified to be O( 23 m3 ).
However, LU factorization cannot be guaranteed to be stable. The following exam-
ples illustrate this fact.
Example A fundamental problem is given if we encounter a zero pivot as in
   
1 1 1 1 1 1
A= 2 2 5  =⇒ L1 A =  0 0 3  .
4 6 8 0 2 4
Now the (2,2) position contains a zero and the algorithm will break down since it will
attempt to divide by zero.
Example A more subtle example is the following backward instability. Take
 
1 1 1
A= 2 2+ε 5 
4 6 8
with small ε. If ε = 1 then we have the initial example in this chapter, and for ε = 0
we get the previous example.
LU factorization will result in
 
1 1 1
L1 A =  0 ε 3 
0 2 4
and  
1 1 1
L2 L1 A =  0 ε 3  = U.
6
0 0 4− ε
The multipliers were  
1 0 0
L =  2 1 0 .
4 2ε 1
Now we assume that a right-hand side b is given as
 
1
b= 0 
0
and we attempt to solve Ax = b via
62
1. Solve Ly = b.
2. Solve U x = y.
6
If ε is on the order of machine accuracy, then the 4 in the entry 4 − ε in U is
insignificant. Therefore, we have
 
1 1 1
Ũ =  0 ε 3  and L̃ = L,
0 0 − 6ε
which leads to  
1 1 1
L̃Ũ =  2 2 + ε 5  6= A.
4 6 4
In fact, the product is significantly different from A. Thus, using L̃ and Ũ we are
not able to solve a “nearby problem”, and thus the LU factorization method is not
backward stable.
If we use the factorization based on L̃ and Ũ with the above right-hand side b, then
we obtain  11 2   11 
2 − 3ε 2
x̃ =  −2  ≈  −2  .
2 2
3ε − 3 − 32
Whereas if we were to use the exact factorization A = LU , then we get the exact
answer  4ε−7   7 
2ε−3 3
2  ≈  −2  .
x= 2ε−3 3
ε−1
−2 2ε−3 − 23
Remark Even though L̃ and Ũ are close to L and U , the product L̃Ũ is not close to
LU = A and the computed solution x̃ is worthless.
7.2 Pivoting
Example The breakdown of the algorithm in our earlier example with
 
1 1 1
L1 A =  0 0 3 
0 2 3
can be prevented by simply swapping rows, i.e., instead of trying to apply L2 to L1 A

we first create    
1 0 0 1 1 1
P L1 A =  0 0 1  L1 A =  0 2 3 
0 1 0 0 0 3
— and are done.
63
More generally, stability problems can be avoided by swapping rows before applying
Lk , i.e., we perform
Lm−1 Pm−1 . . . L2 P2 L1 P1 A = U.
The strategy we use for swapping rows in step k is to find the largest element in column
k below (and including) the diagonal — the so-called pivot element — and swap its row
with row k. This process is referred to as partial (row) pivoting. Partial column pivoting
and complete (row and column) pivoting are also possible, but not very popular.
Example Consider again the matrix

 
1 1 1
A= 2 2+ε 5 
4 6 8
The largest element in the first column is the 4 in the (3, 1) position. This is our first
pivot, and we swap rows 1 and 3. Therefore
    
0 0 1 1 1 1 4 6 8
P1 A =  0 1 0  2 2 + ε 5  =  2 2 + ε 5 ,
1 0 0 4 6 8 1 1 1
and then
    
1 0 0 4 6 8 4 6 8
L1 P1 A = − 12
 1 0  2 2 + ε 5  =  0 ε − 1 1 .
− 14 0 1 1 1 1 0 − 21 −1
Now we need to pick the second pivot element. For sufficiently small ε (in fact, unless
1 3
2 < ε < 2 ), we pick ε − 1 as the largest element in the second column below the first
row. Therefore, the second permutation matrix is just the identity, and we have
    
1 0 0 4 6 8 4 6 8
P2 L1 P1 A =  0 1 0   0 ε − 1 1  =  0 ε − 1 1  .
0 0 1 0 − 12 −1 0 − 12 −1
To complete the elimination phase, we need to perform the elimination in the second
column:
    
1 0 0 4 6 8 4 6 8
L2 P2 L1 P1 A =  0 1 0  0 ε − 1 1  = 0 ε−1
 1  = U.
1 3−2ε
0 2(ε−1) 1 0 − 12 −1 0 0 2(ε−1)
The lower triangular matrix L is given by

 
1 0 0
1
L = L−1 −1
1 L2 =
 2 1 0 ,
1 1
4 − 2(ε−1) 1
and assuming that ε − 1 ≈ −1 we get

   
1 0 0 4 6 8
L̃ =  12 1 0  and Ũ =  0 −1 1  .
1 1
4 2 1 0 0 − 32
64
If we now check the computed factorization L̃Ũ , then we see
 
4 6 8
L̃Ũ =  2 2 5  = P Ã,
1 1 1
which is just a permuted version of the original matrix A with permutation matrix
 
0 0 1
P = P2 P1 = 0  1 0 .
1 0 0
Thus, this approach was backward stable.

Finally, since we have the factorization P A = LU , we can solve the linear system
Ax = b as
P Ax = P b ⇐⇒ LU x = P b,
and apply the usual two-step procedure
1. Solve the lower triangular system Ly = P b for y.
2. Solve the upper triangular system U x = y for x.
This yields  −7+4ε   7


−3+2ε 3
2  ≈  −2  .
x= 2ε−3 3
ε−1
−2 2ε−3 − 23
If we use the rounded factors L̃ and Ũ instead, then the computed solution is
 7 
3
x̃ =  − 23  ,
− 23
which is the exact answer to the problem (see also the Maple worksheet 473 LU.mws).
In general, LU factorization with pivoting results in
P A = LU,
where P = Pm−1 Pm−2 . . . P2 P1 , and L = (L0m−1 L0m−2 . . . L02 L01 )−1 with
−1 −1
L0k = Pm−1 . . . Pk+1 Lk Pk+1 . . . Pm−1 ,
i.e., L0k is the same as Lk except that the entries below the diagonal are appropriately
permuted. In particular, L0k is still lower triangular.
Remark Since the permutation matrices used here involve only a single row swap each
we have Pk−1 = Pk (while in general, of course, P −1 = P T ).
In the example above L02 = L2 , and L01 = P2 L1 P2−1 = L1 since P2 = I.
65
Remark Due to the pivoting strategy the multipliers will always satisfy |ìj | ≤ 1.
A possible interpretation of the pivoting strategy is that the matrix P is deter-

mined so that it would yield a permuted matrix A whose standard LU factorization is
backward stable. Of course, we do not know how to do this in advance, and so the P
is determined as the algorithm progresses.
An algorithm for the factorization of an m × m matrix A is given by
Algorithm (LU Factorization with Partial Pivoting)
Initialize U = A, L = I, P = I
for k = 1 : m − 1
find i ≥ k to maximize |U (i, k)|

U (k, k : m) ←→ U (i, k : m)
L(k, 1 : k − 1) ←→ L(i, 1 : k − 1)
P (k, :) ←→ P (i, :)
for j = k + 1 : m
L(j, k) = U (j, k)/U (k, k)
U (j, k : m) = U (j, k : m) − L(j, k)U (k, k : m)
end
end
The operations count for this algorithm is also O( 23 m2 ). However, while the swaps
for partial pivoting require O(m2 ) operations, they would require O(m3 ) operations in
the case of complete pivoting.
Remark The algorithm above is not really practical since one would usually not phys-
ically swap rows. Instead one would use pointers to the swapped rows and store the
permutation operations instead.
7.3 Stability
We saw earlier that Gaussian elimination without pivoting is can be unstable. Accord-
ing to our previous example the algorithm with pivoting seems to be stable. What can
be proven theoretically?
Since the entries of L are at most 1 in absolute value, the LU factorization becomes
unstable if the entries of U are unbounded relative to those of A (we need kLkkU k =
O(kP Ak). Therefore we define a growth factor
maxi,j |U (i, j)|

ρ= .
maxi,j |A(i, j)|
One can show
Theorem 7.1 Let A ∈ Cm×m . Then LU factorization with partial pivoting guarantees
that ρ ≤ 2m−1 .
66
This bound is unacceptably high, and indicates that the algorithm (with pivoting)
can be unstable. The following (contrived) example illustrates this.
Example Take  
1 0 0 0 1

 −1 1 0 0 1 

A=
 −1 −1 1 0 1 .

 −1 −1 −1 1 1 
−1 −1 −1 −1 1
Then LU factorization produces the following sequence of matrices:
   
1 0 0 0 1 1 0 0 0 1
 0 1 0 0 2   0 1 0 0 2 
   
 0 −1 1 0 2  −→  0 0 1 0 4 
   
 0 −1 −1 1 2   0 0 −1 1 4 
0 −1 −1 −1 2 0 0 −1 −1 4
   
1 0 0 0 1 1 0 0 0 1
 0 1 0 0 2   0 1 0 0 2 
   
−→  0 0 1 0 4  −→ 
 
 0 0 1 0 4  = U,

 0 0 0 1 8   0 0 0 1 8 
0 0 0 −1 8 0 0 0 0 16
and we see that the largest element in U is 16 = 2m−1 .
Remark BUT in practice it turns out that matrices which such large growth factors
√
almost never arise. For most practical cases a realistic bound on ρ seems to be m.
This is still an active area of research and some more details can be found in the
[Trefethen/Bau] book.
In summary we can say that for most practical purposes LU factorization with
partial pivoting is considered to be a stable algorithm.
7.4 Cholesky Factorization

In the case when the matrix A is Hermitian (or symmetric) positive definite we can
devise a faster (and more stable) algorithm.
Recall that A ∈ Cm×m is Hermitian if A∗ = A. This implies x∗ Ay = y ∗ Ax for any
x, y ∈ Cm (see homework for details). If we let y = x then we see that x∗ Ax is real.
Now, if x∗ Ax > 0 for any x 6= 0 then A is called Hermitian positive definite.
Remark Symmetric positive definite matrices are defined analogously with ∗ replaced
by T .
7.4.1 Some Properties of Positive Definite Matrices

Assume A ∈ Cm×m is positive definite.
1. If X ∈ Cm×n has full rank, then X ∗ AX ∈ Cn×n is positive definite.
67
2. Any principal submatrix (i.e., the intersection of a set of rows and the corre-
sponding columns) of A is positive definite.
3. All diagonal entries of A are positive and real.
4. The maximum element of A is on the diagonal.
5. All eigenvalues of A are positive.
7.4.2 How Does Cholesky Factorization Work?

Consider the Hermitian positive definite A ∈ Cm×m in block form
1 w∗

A=
w K
and apply the first step of the LU factorization algorithm. Then
1 0T w∗

1
A= .
w I 0 K − ww∗
The main idea of the Cholesky factorization algorithms is to take advantage of the fact
that A is Hermitian and perform operations symmetrically, i.e., also zero the first row.
Thus
1 0T 0T 1 w∗

1
A= .
w I 0 K − ww∗ 0 I
Note that the three matrices are lower triangular, block-diagonal, and upper triangular
(in fact the adjoint of the lower triangular factor).
Now we continue iteratively. However, in general A will not have a 1 in the upper
left-hand corner. We will have to be able to deal with an arbitrary (albeit positive)
entry in the (1,1) position. Therefore we reconsider the first step with slightly more
general notation, i.e.,
a11 w∗

A=
w K
so that
" √ # " √ #
a11 0T 1 0T a11 √ 1 w∗
A= 1
a11
= R1∗ A1 R1 .
√
a11 w I 0 K − a111 ww∗ 0 I
Now we can iterate on the inner matrix A1 . Note that the (1,1) entry of K − ww∗ /a11
has to be positive since A was positive definite and R1 is nonsingular. This guarantees
that A1 = R1−∗ AR1−1 is positive definite and therefore has positive diagonal entries.
The iteration yields
A1 = R2∗ A2 R2
so that
A = R1∗ R2∗ A2 R2 R1 ,
and eventually
A = R1∗ R2∗ . . . Rm
∗
R ...R R ,
| {z } | m {z 2 }1
=R∗ =R
68
the Cholesky factorization of A. Note that R is upper triangular and its diagonal is
positive (because of the square roots). Thus we have proved
Theorem 7.2 Every Hermitian positive definite matrix A has a unique Cholesky fac-
torization A = R∗ R with R an upper triangular matrix with positive diagonal entries.
Example Consider  
4 2 4
A =  2 5 6 .
4 6 9
Cholesky factorization as explained above yields the sequence
   
2 0 0 1 0 0 2 1 2
A =  1 1 0  0 4 4  0 1 0 
2 0 1 0 4 5 0 0 1
     
2 0 0 1 0 0 1 0 0 1 0 0 2 1 2
=  1 1 0  0 2 0  0 1 0  0 2 2  0 1 0 .
2 0 1 0 2 1 0 0 1 0 0 1 0 0 1
Distributing 2 copies of the identity matrix in the middle we arrive at

 
2 0 0
R∗ =  1 2 0 .
2 2 1
Algorithm (Cholesky Factorization)
initialize R = upper triangular part of A (including the diagonal)
for k = 1 : m
for j = k + 1 : m
R(k,j)
R(j, j : m) = R(j, j : m) − R(k, j : m) R(k,k)
end
p
R(k, k : m) = R(k, k : m)/ R(k, k)
end
The operations count for the Cholesky algorithm can be computed as

 
m m
X X 1 1
(m − k + 1) + (2(m − j + 1) + 1) = m3 + m2 − .
3 m
k=1 j=k+1
Thus the count is O( 13 m3 ), which is half the amount needed for the LU factorization.
Another nice feature of Cholesky factorization is that it is always stable — even
without pivoting.
69
Remark The simplest (and cheapest) way to test whether a matrix is positive definite
is to run Cholesky factorization it. If the algorithm works it is, if not, it isn’t.
The solution of Ax = b with Hermitian positive definite A is given by:
1. Compute Cholesky factorization A = R∗ R.
2. Solve the lower triangular system R∗ y = b for y.
3. Solve the upper triangular system Rx = y for x.
70
8 Eigenvalue Problems
8.1 Motivation and Definition
Matrices can be used to represent linear transformations. Their effects can be: rotation,
reflection, translation, scaling, permutation, etc., and combinations thereof. These
transformations can be rather complicated, and therefore we often want to decompose
a transformation into a few simple actions that we can better understand. Finding
singular values and associated singular vectors is one such approach. In engineering,
one often speaks of principal component analysis.
A more basic approach is to consider eigenvalues and eigenvectors.
Definition 8.1 Let A ∈ Cm×m . If for some pair (λ, x), λ ∈ C, x(6= 0) ∈ Cm we have
Ax = λx,
then λ is called an eigenvalue and x the associated eigenvector of A.
Remark Eigenvectors specify the directions in which the matrix action is simple: any
vector parallel to an eigenvector is changed only in length and/or orientation by the
matrix A.
In practical applications, eigenvalues and eigenvectors are used to find modes of

vibrations (e.g., in acoustics or mechanics), i.e., instabilities of structures can be inves-
tigated via an eigenanalysis.
In theoretical applications, eigenvalues often play an important role in the analysis
of convergence of iterative algorithms (for solving linear systems), long-term behavior
of dynamical systems, or stability of numerical solvers for differential equations.
8.2 Other Basic Facts

Some other terminology that will be used includes the eigenspace Eλ , i.e., the vector
space of all eigenvectors corresponding to λ:
Eλ = span{x : Ax = λx, λ ∈ C}.
Note that this vector space includes the zero vector — even though 0 is not an eigen-
vector.
The set of all eigenvalues of A is known as the spectrum of A, denoted by Λ(A).
The spectral radius of A is defined as
ρ(A) = max{|λ| : λ ∈ Λ(A)}.
8.2.1 The Characteristic Polynomial

The definition of eigenpairs Ax = λx is equivalent to
(A − λI) x = 0.
Thus, λ is an eigenvalue of A if and only if the linear system (A − λI)x = 0 has a

nontrivial (i.e., x 6= 0) solution.
71
This, in turn, is equivalent to det(A−λI) = 0. Therefore we define the characteristic
polynomial of A as
pA (z) = det(zI − A).
Then we get
Theorem 8.2 λ is an eigenvalue of A if and only if pA (λ) = 0.
Proof See above.
Remark This definition of pA ensures that the coefficient of z m is +1, i.e., pA is a

monic polynomial.
Example It is well known that even real matrices can have complex eigenvalues. For
instance,
0 1
A=
−1 0
has a characteristic polynomial

z −1
pA (z) =
= z 2 + 1,
1 z
so that its eigenvalues are λ1,2 = ±i with associated eigenvectors

1 i
x1 = and x2 = .
i 1
However, if A is symmetric (or Hermitian), then all its eigenvalues are real. More-
over, the eigenvectors to distinct eigenvalues are linearly independent, and eigenvectors
to distinct eigenvalues of a symmetric/Hermitian matrix are orthogonal.
Remark Since the eigenvalues of an m×m matrix are given by the roots of a degree-m
polynomial, it is clear that for problems with m > 4 we will have to use iterative (i.e.,
numerical) methods to find the eigenvalues.
8.2.2 Geometric and Algebraic Multiplicities

The number of linearly independent eigenvectors associated with a given eigenvalue λ,
i.e., the dimension of Eλ is called the geometric multiplicity of λ.
The power of the factor (z − λ) in the characteristic polynomial pA is called the
algebraic multiplicity of λ.
Theorem 8.3 Any A ∈ Cm×m has m eigenvalues provided we count the algebraic mul-
tiplicities. In particular, if the roots of pA are simple, the A has m distinct eigenvalues.
Example Take  
1 0 0
A =  1 1 1 ,
0 0 1
72
so that pA (z) = (z − 1)3 . Thus, λ = 1 is an eigenvalue (in fact, the only one) of A with
algebraic multiplicity 3. To determine its geometric multiplicity we need to find the
associated eigenvectors.
To this end we solve (A − λI)x = 0 for the special case λ = 1. This yields the
augmented matrix  
0 0 0 | 0
 1 0 1 | 0 
0 0 0 | 0
so that x1 = −x3 or
     
α 1 0
x=  β  =α  0  + β 1 ,

−α −1 0
and therefore the geometric multiplicity of λ = 1 is only 2.
In general one can prove the following

Theorem 8.4 The algebraic multiplicity of λ is always greater than or equal its geo-
metric multiplicity.
This prompts
Definition 8.5 If the geometric multiplicity of λ is less than its algebraic multiplicity,
then λ is called defective.
Example As a continuation of the previous example we see that the matrix

 
1 0 0
B =  0 1 0 ,
0 0 1
has the same characteristic polynomial as before, i.e., pB (z) = (z − 1)3 , and λ =
1 is again an eigenvalue with algebraic multiplicity 3. To determine its geometric
multiplicity we solve (B − I)x = 0, i.e., look at
 
0 0 0 | 0
 0 0 0 | 0 .
0 0 0 | 0
Now there is no restriction on the components of x and we have
       
α 1 0 0
x =  β  = α 0  + β 1 + γ 0 ,
 
γ 0 0 1
so that the geometric multiplicity of λ = 1 is 3 in this case.
At the other extreme, the matrix
 
1 1 0
C= 0  1 1 ,
0 0 1
73
also has the characteristic polynomial pC (z) = (z − 1)3 , so that λ = 1 has algebraic
multiplicity 3. However, now the solution of (C − I)x = 0, leads to
 
0 1 0 | 0
 0 0 1 | 0 ,
0 0 0 | 0
so that x2 = x3 = 0 and we get  

1
x = α 0 .
0
This means that the geometric multiplicity of λ = 1 now is 1.
8.3 Determinant and Trace

The trace of a matrix A, tr(A), is given by the sum of its diagonal elements, i.e.,
m
X
tr(A) = aj j.
j=1
Theorem 8.6 If A ∈ Cm×m with eigenvalues λ1 , . . . , λm , then

m
Y
1. det(A) = λj ,
j=1
m
X
2. tr(A) = λj .
j=1
Proof Recall that the definition of the characteristic polynomial, pA (z) = det(zI − A),
so that
pA (0) = det(−A) = (−1)m det(A).
m
Y
On the other hand, we also know that we also have pA (z) = (z − λj ) which implies
j=1
m
Y m
Y
m
pA (0) = (−λj ) = (−1) λj .
j=1 j=1
Comparing the two representations of pA (0) yields the first formula.

For the second one one can show that the coefficient of z m−1 in the representation
det(zI − A) of the characteristic polynomial is −tr(A). On the other hand, the co-
m
Y
efficient of z m−1 in the representation (z − λj ) is − m
P
j=1 λj . Together we get the
j=1
desired formula.
74
8.4 Similarity and Diagonalization
Consider two matrices A, B ∈ Cm×m . A and B are called similar if
B = X −1 AX
for some nonsingular X ∈ Cm×m .
Theorem 8.7 Similar matrices have the same characteristic polynomial, eigenvalues,
algebraic and geometric multiplicities. The eigenvectors, however, are in general dif-
ferent.
Theorem 8.8 A matrix A ∈ Cm×m is nondefective, i.e., has no defective eigenvalues,

if and only if A is similar to a diagonal matrix, i.e.,
A = XΛX −1 ,
where X = [x1 , x2 , . . . , xm ] is the matrix formed with the eigenvectors of A as its

columns, and Λ = diag(λ1 , λ2 , . . . , λm ) contains the eigenvalues.
Remark Due to this theorem, nondefective matrices are diagonalizable. Also note
that nondefective matrices have linearly independent eigenvectors.
Remark The factorization

A = XΛX −1
is called the eigenvalue (or eigen-) decomposition of A. We can interpret this decom-
position as a change of basis by which a coupled linear system is transformed to a
decoupled diagonal system. This means
Ax = b ⇐⇒ XΛX −1 x = b
−1 −1
⇐⇒ | {z x} = X
ΛX | {z b} .
=x̂ =b̂
This shows that x̂ and b̂ correspond to x and b as viewed in the basis of eigenvectors
(i.e., columns of X).
If the eigenvectors of A are not only linearly independent, but also orthogonal, then
we can factor A as
A = QΛQ∗
with a unitary matrix Q of eigenvectors. Thus, A is called unitarily diagonalizable.
Theorem 8.9 If A is Hermitian, then A is unitarily diagonalizable. Moreover, Λ is

real.
More generally, A is called normal if
AA∗ = A∗ A,
and we have
Theorem 8.10 A is unitarily diagonalizable if and only if A is normal.
75
8.5 Schur Factorization
The most useful linear algebra fact summarized here for numerical analysis purposes is
Theorem 8.11 Every square matrix A ∈ Cm×m has a Schur factorization
A = QT Q∗ ,
with unitary matrix Q and upper triangular matrix T such that diag(T ) contains the
eigenvalues of A.
Remark Note the similarity of this result to the singular value decomposition. The
Schur factorization is quite general in that it exists for every, albeit only square, matrix.
Also, the matrix T contains the eigenvalues (instead of the singular values), and it is
upper triangular (i.e., not quite as nice as diagonal). On the other hand, only one
unitary matrix is used.
Remark By using nonunitary matrices for the similarity transform one can obtain the
Jordan normal form of a matrix in which T is bidiagonal.
Remark Both the Schur factorization and the Jordan form are considered not appro-
priate for numerical/practical computations because of the possibility of complex terms
occurring in the matrix factors (even for real A). The SVD is preferred since all of its
factors are real.
Proof We use induction on m. For m = 1 we have
A = (a11 ), Q = (1), T = (a11 ),
and the claim is clearly true.

For m ≥ 2 we assume x is a normalized eigenvector of A, i.e., kxk2 = 1. Then we
form
U = x Û ∈ Cm×m

to be unitary by augmenting the first column, x, by appropriate columns in Û . This

gives us ∗ ∗
x Ax x∗ AÛ

∗ x
U AU = A x Û = .
Û ∗ Û Ax Û ∗ AÛ
Since x is an eigenvector of A we have Ax = λx, and after multiplication by x∗
x∗ Ax = λ |{z}
x∗ x = λ.
=kxk22 =1
Similarly,
Û ∗ Ax = Û ∗ (λx) = λÛ ∗ x = 0
since Û ∗ x = 0 because U is unitary.
Therefore, U ∗ AU simplifies to
λ b∗

U ∗ AU = ,
0 C
76
where we have used the abbreviations b∗ = x∗ AÛ and C = Û ∗ AÛ ∈ C(m−1)×(m−1) .
Now, by the induction hypothesis, C has a Schur factorization
C = V T̂ V ∗
with unitary V and triangular T̂ .

To finish the proof we can define
1 0T

Q=U
0 V
and observe that

1 0T 1 0T

∗ ∗
Q AQ = U AU
0 V∗ 2| {z } 3 0 V
λ b∗ 5
=4
0 C
λ b∗ V

= .
0 V ∗ CV
This last block matrix, however, is the desired upper triangular matrix T with the
eigenvalues of A on its diagonal since V ∗ CV = T̂ and the induction hypothesis ensures
T̂ already has the desired properties.
To summarize this section we can say
1. A ∈ Cm×m is diagonalizable, i.e., A = XΛX −1 , if and only if A is nondefective.
2. A ∈ Cm×m is unitarily diagonalizable, i.e., A = QΛQ∗ , if and only if A is normal.
3. A ∈ Cm×m is unitarily triangularizable, i.e., A = QT Q∗ for any square A.
77
Chapter 9
Applications
9.1 Solving Partial Differential Equations via Collocation

In this section we discuss the numerical solution of elliptic partial differential equations
using a collocation approach based on radial basis functions. To make the discus-
sion transparent we will focus on the case of a time independent linear elliptic partial
differential equation in IR2 .
9.1.1 Kansa’s Approach

In [340] Kansa suggested a now very popular non-symmetric method for the solution
of elliptic PDEs with radial basis functions. In order to be able to clearly point out
the differences between Kansa’s method and a symmetric approach proposed in [194]
we recall some of the basics of scattered data interpolation with radial basis functions
in IRs .
In this context we are given data {xi , fi }, i = 1, . . . , N , xi ∈ IRs , where we can
think of the values fi being sampled from a function f : IRs → IR. The goal is to find
an interpolant of the form
N
X
Pf (x) = cj ϕ(kx − xj k), x ∈ IRs , (9.1)
j=1
such that
Pf (xi ) = fi , i = 1, . . . , N.
The solution of this problem leads to a linear system Ac = f with the entries of A
given by
Aij = ϕ(kxi − xj k), i, j = 1, . . . , N. (9.2)
As discussed earlier, the matrix A is non-singular for a large class of radial functions
including (inverse) multiquadrics, Gaussians, and the strictly positive definite com-
pactly supported functions of Wendland, Wu, or Buhmann. In the case of strictly
conditionally positive definite functions such as thin plate splines the problem needs to
be augmented by polynomials.
129
We now switch to the collocation solution of partial differential equations. Assume
we are given a domain Ω ⊂ IRs , and a linear elliptic partial differential equation of the
form
L[u](x) = f (x), x in Ω, (9.3)
with (for simplicity of description) Dirichlet boundary conditions
u(x) = g(x), x on ∂Ω. (9.4)
For Kansa’s collocation method we then choose to represent u by a radial basis function
expansion analogous to that used for scattered data interpolation, i.e.,
N
X
u(x) = cj ϕ(kx − ξ j k), (9.5)
j=1
where we now introduce the points ξ 1 , . . . , ξ N as centers for the radial basis func-
tions. They will usually be selected to coincide with the collocation points X =
{x1 , . . . , xN } ⊂ Ω. However, the discussion below is clearer if we formally distin-
guish between centers ξ j and collocation points xi . We assume the simplest possible
setting here, i.e., no polynomial terms are added to the expansion (9.5). The collocation
matrix which arises when matching the differential equation (9.3) and the boundary
conditions (9.4) at the collocation points X will be of the form

Φ
A= , (9.6)
L[Φ]
where the two blocks are generated as follows:
Φij = ϕ(kxi − ξ j k), xi ∈ B, ξ j ∈ X ,

L[Φ]ij = L[ϕ](kxi − ξ j k), xi ∈ I, ξ j ∈ X .
Here we have identified (as we will do throughout this section) the set of centers with
the set of collocation points. The set X is split into a set I of interior points, and B
of boundary points. The problem is well-posed if the linear system Ac = y, with y
a vector consisting of entries g(xi ), xi ∈ B, followed by f (xi ), xi ∈ I, has a unique
solution.
We note that a change in the boundary conditions (9.4) is as simple as changing a
few rows in the matrix A in (9.6) as well as on the right-hand side y. We also point out
that Kansa only proposed to use multiquadrics in (9.5), and for that method suggested
the use of varying parameters αj , j = 1, . . . , N , which improves the accuracy of the
method when compared to using only one constant value of α (see [340]).
A problem with Kansa’s method is that – for a constant multiquadric shape pa-
rameter α – the matrix A may for certain configurations of the centers ξ j be singular.
Originally, Kansa assumed that the non-singularity results for interpolation matrices
would carry over to the PDE case. However, as the numerical experiments of Hon and
Schaback [304] show, this is not so. This is to be expected since the matrix for the
collocation problem is composed of rows which are built from different functions (which
– depending on the differential operator L – might not even be radial). The results for
130
the non-singularity of interpolation matrices, however, are based on the fact that A is
generated by a single function ϕ.
An indication of the success of Kansa’s method (which has not yet been shown to be
well-posed) are the early papers [165, 166, 262, 341, 467] and many more since. In his
paper [340] Kansa describes three sets of experiments using his method and comments
on the superior performance of multiquadrics in terms of computational complexity
and accuracy when compared to finite difference methods. Therefore, it remains an
interesting open question whether the well-posedness of Kansa’s method can be estab-
lished at least for certain configurations of centers. Moreover, Kansa’s suggestion to use
variable shape parameters αj in order to improve accuracy and stability of the problem
has very little theoretical support. Except for one paper by Bozzini, Lenarduzzi and
Schaback [68] (which addresses only the interpolation setting) this problem has not
been addressed in the literature.
Before we describe an alternate approach which does ensure well-posedness of the
resulting collocation matrix and which is based on basis functions suitable for scattered
Hermite interpolation we would like to point out that in [467] the authors suggest how
Kansa’s method can be applied to other types of partial differential equation prob-
lems such as non-linear elliptic PDEs, systems of elliptic PDEs, and time-dependent
parabolic or hyperbolic PDEs.
9.1.2 An Hermite-based Approach

The following symmetric approach is based on scattered Hermite interpolation (see,
e.g., [315, 484, 598, 651]), which we now also quickly review. In this context we are
given data {xi , Li f }, i = 1, . . . , N , xi ∈ IRs where L = {L1 , . . . , LN } is a linearly
independent set of continuous linear functionals. We try to find an interpolant of the
form
N
cj Lξj ϕ(kx − ξk),
X
Pf (x) = x ∈ IRs , (9.7)
j=1
satisfying
Li Pf = Li f, i = 1, . . . , N.
We have used Lξ to indicate that the functional L acts on ϕ viewed as a function of the
second argument ξ. The linear system Ac = Lf which arises in this case has matrix
entries
Aij = Li Lξj ϕ, i, j = 1, . . . , N. (9.8)
In the references mentioned at the beginning of this subsection it is shown that A is
non-singular for the same classes of ϕ as given for scattered data interpolation in our
earlier chapters.
Remark: It should be pointed out that this formulation of Hermite interpolation is
very general and goes considerably beyond the standard notion of Hermite interpolation
(which refers to interpolation of successive derivative values). Here any kind of linear
functional are allowed as long as the set L is linearly independent.
We illustrate this approach with a simple example using derivative functionals.
131
Example: Let data {xi , f (xi )}ni=1 and {xi , ∂f N 2
∂x (xi )}i=n+1 with x = (x, y) ∈ IR be
given. Then
n N
X X ∂ϕ
Pf (x) = cj ϕ(kx − xj k) − cj (kx − xj k),
∂x
j=1 j=n+1
and
Φ −Φx
A= ,
Φx −Φxx
with
Φij = ϕ(kxi − xj k), i, j = 1, . . . , n,

∂ϕ
−Φx,ij = − (kxi − xj k), i = 1, . . . , n, j = n + 1, . . . , N,
∂x
∂ϕ
Φx,ij = (kxi − xj k), i = n + 1, . . . , N, j = 1, . . . , n,
∂x
∂2ϕ
Φxx,ij = (kxi − xj k), i, j = n + 1, . . . , N.
∂x2
Now we describe an alternative collocation method based on the generalized interpo-

lation theory just reviewed. Assume we are given the same PDE (9.3) with boundary
conditions (9.4) as in the section on Kansa’s method. In order to be able to apply
the results from scattered Hermite interpolation to ensure the non-singularity of the
collocation matrix we propose the following expansion for the unknown function u:
#B
X N
X
u(x) = cj ϕ(kx − ξ j k) + cj Lξ [ϕ](kx − ξ j k), (9.9)
j=1 j=#B+1
where #B denotes the number of nodes on the boundary of Ω, and Lξ is the differential
operator used in (9.3), but acting on ϕ viewed as a function of the second argument,
i.e., L[ϕ] is equal to Lξ [ϕ] up to a possible difference in sign. Note the difference in
notation. In (9.7) L is a linear functional, and in (9.9) a differential operator.
This expansion for u leads to a collocation matrix A which is of the form
Lξ [Φ]

Φ
A= , (9.10)
L[Φ] L[Lξ [Φ]]
where the four blocks are generated as follows:
Φij = ϕ(kxi − ξ j k), xi , ξ j ∈ B,

ξ ξ
L [Φ]ij = L [ϕ](kxi − ξ j k), xi , ∈ B, ξ j ∈ I,
L[Φ]ij = L[ϕ](kxi − ξ j k), xi ∈ I, ξ j ∈ B,
L[Lξ [Φ]]ij = L[Lξ [ϕ]](kxi − ξ j k), xi , ξ j ∈ I.
The matrix (9.10) is of the same type as the scattered Hermite interpolation matri-
ces (9.8), and therefore non-singular as long as ϕ is chosen appropriately. Thus, viewed
using the new expansion (9.9) for u, the collocation approach is certainly well-posed.
132
α ρK ρH condK (A) condH (A)
5×3 1.0 5.248447e-02 2.004420e-01 2.599606e+03 1.627432e+03
8×4 1.0 1.126843e-02 1.124710e-02 2.325758e+05 8.167527e+04
10 × 6 1.0 5.809472e-03 6.481697e-03 4.321740e+07 1.808001e+07
16 × 8 1.0 1.347863e-03 1.720007e-03 8.685785e+10 1.496772e+10
20 × 12 1.0 5.053090e-04 5.973294e-04 5.161540e+15 1.234633e+15
Table 9.1: Error progression for increasingly denser data sets (Ex.1, fixed α).
Another point in favor of the Hermite based approach is that the matrix (9.10) is (anti)-
symmetric as opposed to the completely unstructured matrix (9.6) of the same size.
This property should be of value when trying to devise an efficient implementation of
the collocation method. Also note that although A consists of four blocks now, it still
is of the same size, namely N × N , as the collocation matrix (9.6) obtained for Kansa’s
approach.
Remark: One attempt to obtain an efficient implementation of the Hermite based
collocation method is a version of the greedy algorithm described in Section 8.5.1 by
Hon, Schaback and Zhou [305].
9.1.3 Numerical Examples

The following test examples are taken from [194]. We restrict ourselves to two-dimensional
Poisson problems whose analytic solution is readily available and therefore can easily
be verified. We will refer to a point in IR2 as (x, y). In all of the following tests we
used multiquadrics in the expansions (9.5) and (9.9) of the unknown function u.
Example 1: Consider the Poisson equation
∆u(x, y) = y(1 − y) sin3 x, x ∈ (0, π), y ∈ (0, 1),
with Dirichlet boundary conditions
u(x, 0) = u(x, 1) = u(0, y) = u(π, y) = 0.
For this test problem we selected various uniform grids as listed in Tables 9.1 and
9.2 on [0, π] × [0, 1]. Tables 9.1 and 9.2 show the values of the multiquadric parameter
α, the relative maximum errors ρ computed on a fine grid of 60 × 60 points, and
the approximate condition numbers of A. The range of u on the evaluation grid is
approximately [−0.021023, 0.0]. The “optimal” value for α was determined by trial
and error. The subscripts K and H refer to Kansa’s and the Hermite based method,
respectively.
Figure 9.1 shows the distribution of the errors |u(x) − s(x)| on the evaluation grid
for the two methods on the 8 × 4 grid used in Table 9.2. The scale used for the shading
is displayed on the right.
Example 2: Consider the Poisson equation
π
∆u(x, y) = sin x − sin3 x, x ∈ (0, ), y ∈ (0, 2),
2
133
αK αH ρK ρH condK (A) condH (A)
5×3 1.18 1.39 1.627193e-02 4.180428e-02 5.592238e+03 5.231279e+03
8×4 1.04 1.11 1.103747e-02 1.062891e-02 3.175078e+05 1.735482e+05
10 × 6 4.80 3.84 2.739293e-03 3.451799e-03 1.193586e+18 1.414927e+15
16 × 8 3.12 3.12 2.707006e-04 2.082886e-04 1.209487e+19 6.609375e+18
20 × 12 2.00 2.30 3.894511e-05 1.273363e-05 3.739554e+19 6.750955e+18
Table 9.2: Error progression for increasingly denser data sets (Ex.1, “optimal” α).
2.328866e-04
0.0
Figure 9.1: Error for Kansa’s (top), Hermite (bottom) solution for Ex. 1 on 8 × 4 grid.
with mixed Dirichlet and Neumann boundary conditions

π
u(0, y) = ux ( , y) = uy (x, 0) = uy (x, 2) = 0.
2
For this example we selected uniform grids on [0, π/2] × [0, 2] as listed in Table 9.3.
This time we only list the results for the “optimal” choice of α. The values listed are
analogous to those in Ex. 1.
All in all the Hermite method seems to perform slightly better than Kansa’s method.
Especially for the cases in which we used relatively many interior points (which is where
the methods differ). Also, the matrices for the Hermite method generally have smaller
condition numbers. An advantage of the Hermite approach over Kansa’s method is
αK αH ρK ρH condK (A) condH (A)

3×3 109.0 2.19 9.628085e-01 1.141043e-01 1.592286e+16 5.560886e+02
5×5 1.80 1.73 2.181029e-02 4.327029e-02 2.395293e+06 1.271196e+05
7×7 1.58 3.56 6.910084e-03 1.871798e-04 5.762316e+08 1.854850e+12
10 × 10 2.80 3.29 9.265197e-05 5.126676e-05 2.842111e+18 7.070804e+17
14 × 14 2.28 2.62 1.138751e-05 1.725526e-06 6.573143e+19 5.891454e+18
20 × 20 1.53 1.91 5.501057e-06 6.217559e-07 5.889491e+19 7.576112e+19
Table 9.3: Error progression for increasingly denser data sets (Ex.2, “optimal” α).
134
that for the differential operator L used here, the collocation matrices resulting from
the Hermite approach are symmetric. Therefore the amount of computation can be
reduced considerably, which is important for larger problems. Kansa’s method has the
advantage of being simpler to implement (since less derivatives of the basis functions
are required).
Remarks:
1. Both of the methods described in this section have been implemented for many
different applications. A thorough comparison of the two methods was reported
in [520].
2. Since the methods described above were both originally used with globally sup-
ported basis functions, the same concerns as for interpolation problems about
stability and numerical efficiency apply. Two recent papers by Ling and Kansa
[395, 396] address these issues. In particular, they develop a preconditioner in the
spirit of the one described in Section 8.3.3, and describe their experience with a
domain decomposition algorithm.
3. A convergence analysis for the symmetric method was established by Franke and
Schaback [229, 230]. The error estimates established in [229, 230] require the solu-
tion of the PDE to be very smooth. Therefore, one should be able to use meshfree
radial basis function collocation techniques especially well for (high-dimensional)
PDE problems with smooth solutions on possibly irregular domains. Due to
the known counterexamples [304] for the non-symmetric method, a convergence
analysis is still lacking for that method.
4. Recently, Miranda [462] has shown that Kansa’s method will be well-posed if it
is combined with so-called R-functions. This idea was also used by Höllig and
his co-workers in their development of WEB-splines (see, e.g., [299]).
5. Kansa’s method has the advantage of being easily adapted for nonlinear elliptic
PDEs (see, e.g., [201, 467]).
Some numerical evidence for convergence rates of the symmetric collocation method
is given by the examples above, and in the papers [336, 520]. The example above
shows very high convergence rates (as predicted by the estimate in [230]) when using
multiquadrics on a problem which has a smooth solution. In [336] thin plate splines
as well as Wendland’s C 4 compactly supported RBF ϕ3,2 were tested. The results
for thin plate splines are in good agreement with the theory. However, the numerical
experiments using the Wendland function show O(h3 ) convergence instead of O(h) as
predicted by the lower bounds of [230] combined with the error bound for Wendland
functions. This could suggest that a sharper error estimate may be possible when using
compactly supported RBFs.
Other recent papers investigating various aspects of radial basis function collocation
are, e.g., [135] by Cheng, Golberg, Kansa and Zammito, [215] by Fedoseyev, Friedman
and Kansa, [345] by Kansa and Hon, [360] by Larsson and Fornberg, [365] by Leitão,
and [424] by Mai-Duy and Tran-Cong.
For example, in the paper [215] it is suggested that the collocation points on the
boundary are also used to satisfy the PDE. However, this adds a set of extra equations
135
to the problem, and therefore one should also use some additional basis functions in
the expansion (9.5). It is suggested in [215] that these centers lie outside the domain Ω.
The motivation for this modification is the well-known fact that both for interpolation
and collocation with radial basis functions the error is largest near the boundary. In
various numerical experiments this strategy is shown to improve the accuracy of Kansa’s
basic non-symmetric method. It should be noted that there is once more no theoretical
foundation for this method.
Larsson and Fornberg [360] compare Kansa’s basic collocation method, the modi-
fication just described, and the Hermite-based symmetric approach mentioned earlier.
Using multiquadric basis functions in a standard implementation they conclude that
the symmetric method is the most accurate, followed by the non-symmetric method
with boundary collocation. The reason for this is the better conditioning of the system
for the symmetric method. Larsson and Fornberg also discuss an implementation of
the three methods using the complex Contour-Padé integration method mentioned in
Section 8.1. With this technique stability problems are overcome, and it turns out that
both the symmetric and the non-symmetric method perform with comparable accu-
racy. Boundary collocation of the PDE yields an improvement only if these conditions
are used as additional equations, i.e., by increasing the problem size. It should also
be noted that often the most accurate results were achieved with values of the multi-
quadric shape parameter α which would lead to severe ill-conditioning using a standard
implementation, and therefore these results could be achieved only using the complex
integration method. Moreover, in [360] radial basis function collocation is deemed to
be far superior in accuracy than standard second-order finite differences or a standard
Fourier-Chebyshev pseudospectral method.
Leitão [365] applies the symmetric collocation method to a fourth-order Kirchhoff
plate bending problem, and emphasizes the simplicity of the implementation of the ra-
dial basis function collocation method. And, finally, Mai-Duy and Tran-Cong [424] sug-
gest a collocation method for which the basis functions are taken to be anti-derivatives
of the usual radial basis functions.
All of the experiments just mentioned were conducted without using a multilevel
approach. In particular, in order to achieve convergence with the Wendland functions
the support had to be chosen so large that only problems with a very modest number of
centers could be handled (see [336]). So, as for scattered data interpolation, a multilevel
approach is needed to obtain computational efficiency.
We would like to end the discussion of the collocation approach by looking at a
multilevel implementation with compactly supported functions.
The most significant difference between the use of compactly supported RBFs for
scattered data interpolation and for the numerical solution of PDEs by collocation
appears when we turn to the multilevel approach. Recall that the use of the multilevel
method is motivated by our desire to obtain a convergent scheme while at the same
time keeping the bandwidth fixed, and thus the computational complexity at O(N ).
Here is an adaptation of the basic multilevel algorithm of Section 8.2 to the case of
a collocation solution of the problem Lu = f :
136
mesh `2 -error rate
5 3.637579e-04
9 1.892007e-05 4.26
17 3.055339e-06 2.63
33 2.111403e-06 0.53
65 2.062621e-06 0.03
129 2.066411e-06 0.00
257 2.070168e-06 0.00
513 2.072171e-06 0.00
1025 2.073182e-06 0.00
2049 2.073688e-06 0.00
Table 9.4: Multilevel collocation algorithm for symmetric collocation with constant
bandwidth.
Algorithm (Multilevel Collocation)
u0 = 0.
For k from 1 to K do
Find uk ∈ SXk such that Luk = (f − Luk−1 ) on grid Xk .

Update uk ← uk−1 + uk .
end
Here SXk is the space of functions used for expansion (9.5) or (9.9) on grid Xk .
Whereas we noted above that there is strong numerical (and limited theoretical) ev-
idence that the basic multilevel interpolation algorithm converges (at least linearly),
the following example shows that we cannot in general expect the multilevel collocation
algorithm to converge at all.
Example: Consider the boundary-value problem
−u00 (x) + π 2 u(x) = 2π 2 sin πx, x ∈ (0, 1),

u(0) = u(1) = 0,
with solution u(x) = sin πx. As computational grids Xk we take 2k+1 + 1 uniformly
spaced points on [0, 1] as indicated in Table 9.4. We use the C 6 compactly supported
Wendland function ϕ3,3 and the conjugate gradient method with Jacobi preconditioning
is used to solve the resulting linear systems. We take the support size on the first grid
to be so large that the resulting matrix is a dense matrix. During subsequent iterations
the support size is halved (as is the meshsize) in order to maintain a constant bandwidth
of 17 (i.e., work in the stationary setting). Even though the first three iterations seem
to indicate significant rates of convergence, the convergence behavior quickly changes,
and by the fifth iteration there is virtually no improvement of the error (the fact that
the errors actually increase is due to the fact that they are computed on increasingly
finer grids).
137
We note that the same behavior can be observed if the non-symmetric approach
is used instead. However, then the convergence ceases at a slightly later stage. We
also note that the same phenomenon was observed by Wendland in the context of a
multilevel Galerkin algorithm for compactly supported RBFs (see [631] as well as our
discussion in the next section).
Remarks:
1. It has been suggested that the convergence behavior of the multilevel colloca-
tion algorithm may be linked to the phenomenon of approximate approximation.
However, so far no connection has been established.
2. As was shown in [198] a possible remedy for the non-convergence problem is

smoothing. One might also expect that a slightly different scaling of the support
sizes of the basis functions (such that the bandwidth of the matrix is allowed to
increase slowly from one iteration to the next, i.e., moving to the non-stationary
setting) will lead to better results. In [198] it was shown that this is in fact true.
However, smoothing further improved the convergence. A discussion of the idea
of post-conditioning via smoothing is beyond the scope of this text. We refer the
reader to the paper [209].
9.2 Galerkin Methods

A variational approach to the solution of PDEs with RBFs has so far only been consid-
ered by Wendland [630, 631]. In [631] he studies the Helmholtz equation with natural
boundary conditions, i.e.,
−∆u + u = f in Ω,
∂
u = 0 on ∂Ω,
∂ν
where ν denotes the outer unit normal vector. The classical Galerkin formulation then
leads to the problem of finding a function u ∈ H 1 (Ω) such that
a(u, v) = (f, v)L2 (Ω) for all v ∈ H 1 (Ω),
where (f, v)L2 (Ω) is the usual L2 inner product, and for the Helmholtz equation the
bilinear form a is given by
Z
a(u, v) = (∇u · ∇v + uv)dx.
Ω
In order to obtain a numerical scheme the infinite-dimensional space H 1 (Ω) is replaced

by some finite-dimensional subspace SX ⊆ H 1 (Ω), where X is some computational grid
to be used for the solution. In the context of RBFs SX is taken as
SX = span{φ(k · −xj k2 ), xj ∈ X }.
138
This results in a square system of linear equations for the coefficients of uX ∈ SX
determined by
a(uX , v) = (f, v)L2 (Ω) for all v ∈ SX .
For more on the Galerkin method (in the context of finite elements) see, e.g., [69, 70].
It was shown in [630] that for those RBFs (globally as well as locally supported) whose
Fourier transform decays like (1 + k · k2 )−2β the following convergence estimate holds:
ku − uX kH 1 (Ω) ≤ Chσ−1 kukH σ (Ω) , (9.11)
where h is the meshsize of X , the solution satisfies the regularity requirements u ∈

H σ (Ω), and where the convergence rate is determined by β ≥ σ > s/2 + 1. For
Wendland’s compactly supported RBFs this implies that functions which are in C 2κ
and strictly positive definite on IRs satisfying κ ≥ σ − s+1 2 will have O(h
κ+(s−1)/2 )
0 2 2
convergence order, i.e., the C function ϕ3,0 = (1−r)+ yields O(h) and the C function
ϕ3,1 = (1 − r)4+ (4r + 1) delivers O(h2 ) convergence in IR3 . As with the convergence
estimate for symmetric collocation there is a link between the regularity requirements
on the solution and the space dimension s. Also, so far, the theory is only established
for PDEs with natural boundary conditions.
The convergence estimate (9.11) holds for the non-stationary setting, i.e., if we
are using compactly supported basis functions, for fixed support radii. By the same
argumentation as used in Section 8, one will want to switch to the stationary setting
and employ a multilevel algorithm in which the solution at each step is updated by
a fit to the most recent residual. This should ensure both convergence and numerical
efficiency.
Here is the variant of the stationary multilevel collocation algorithm listed above
for the weak formulation (see [631]):
Algorithm (Multilevel Galerkin)
u0 = 0.
For k from 1 to K do
Find uk ∈ SXk such that a(uk , v) = (f, v) − a(uk−1 , v) for all v ∈ SXk .
Update uk ← uk−1 + uk .
end
This algorithm does not converge in general (see Tab. 1 in [631]).

Since the weak formulation can be interpreted as a Hilbert space projection method,
Wendland was able to show that a modified version of the multilevel Galerkin algorithm,
namely
Algorithm (Nested Multilevel Galerkin)
Fix K and M ∈ IN, and set v0 = 0.
For j from 0 while resiudal > tolerance to M do
Set u0 = vj .
139
Apply the k-loop of the previous algorithm and denote the result with û(vj ).
Set vj+1 = û(vj ).
end
does converge. In fact, using this algorithm Wendland proves, and also observes
numerically, convergence which is at least linear (see Theorem 3 and Tab. 2 in [631]).
The important difference between the two multilevel Galerkin algorithms is the added
outer iteration in the nested version which is a well-known idea from linear algebra
introduced in 1937 by Kaczmarz [337]. A proof of the linear convergence for general
Hilbert space projection methods coupled with Kaczmarz iteration can be found in
[585]. This alternate projection idea is also the fundamental ingredient in the conver-
gence proof of the domain decomposition method of Beatson, Light and Billings [42]
described in the previous chapter. We mention here that in the multigrid literature
Kaczmarz’ method is frequently used as a smoother (see e.g. [435]).
Remarks:
1. Aside from difficulties with Dirichlet (or sometimes called essential) boundary
conditions, Wendland reports that the numerical evaluation of the weak-form in-
tegrals presents a major problem for the radial basis function Galerkin approach.
Both of these difficulties are also well-known in many other flavors of meshfree
weak-form methods. An especially promising solution to the issue of Dirichlet
boundary conditions seems to be the use of R-functions as proposed by Höllig
and Reif in the context of WEB-splines (see, e.g., [299] or our earlier discussion
in the context of collocation methods).
2. In a recent paper by Schaback [559] the author presents a framework for the
radial basis function solution of problems both in the strong (collocation) and
weak (Galerkin) form.
Many other meshfree methods for the solution of partial differential equations in
the weak form appear in the (mostly engineering) literature. These methods come
under such names as smoothed particle hydrodynamics (SPH) (e.g., [463]), reproducing
kernel particle method (RKPM) (see, e.g., [380, 399]), point interpolation method
(PIM) (see, [397]), element free Galerkin method (EFG) (see, e.g., [49]), meshless local
Petrov-Galerkin method (MLPG) [14], h-p-cloud method [164], partition of unity finite
element method (PUFEM) [16, 443], or generalized finite element method (GFEM)
[15]. Most of these methods are based on the moving least squares approximation
method discussed in Chapter 7.
There are two recent books by Atluri [12] and Liu [397] summarizing many of
these methods. However, these books focus mostly on a survey of the various meth-
ods and related computational and implementation issues with little emphasis on the
mathematical foundation of these methods. The recent survey paper [15] by Babuška,
Banerjee and Osborn, fills a large part of this void.
140
10 Gaussian Quadrature
So far we have encountered the Newton-Cotes formulas
Z b n
X Z b
f (x)dx ≈ Ai f (xi ), Ai = ì (x)dx,
a i=0 a
which are exact if f is a polynomial of degree at most n.

It is important to note that in the derivation of the Newton-Cotes formulas we
assumed that the nodes xi were equally spaced and fixed. The main idea for obtaining
more accurate quadrature rules is to treat the nodes as additional degrees of freedom,
and then hope to find “good” locations that ensure higher accuracy. Therefore, we now
have n + 1 nodes xi in addition to n + 1 polynomial coefficients for a total of 2n + 2
degrees of freedom. This should be enough to derive a quadrature rule that is exact for
polynomials of degree up to 2n + 1. Gaussian quadrature, indeed accomplishes this:
Theorem 10.1 Let q be a nonzero polynomial of degree n + 1 and w a positive weight

function such that
Z b
xk q(x)w(x)dx = 0, k = 0, . . . , n. (95)
a
If the nodes xi , i = 0, . . . , n, are the zeros of q, then
Z b X n
f (x)w(x)dx ≈ Ai f (xi ) (96)
a i=0
with Z b
Ai = ì (x)w(x)dx, i = 0, . . . , n, (97)
a
is exact for all polynomials of degree at most 2n + 1. Here ì , i = 0, . . . , n, are the usual
Lagrange interpolating polynomials of Chapter 1.
Proof Assume f is a polynomial of degree at most 2n + 1, and show

Xn Z b
Ai f (xi ) = f (x)w(x)dx.
i=0 a
Using long division we have
f (x) = q(x) p(x) + r(x),

|{z} |{z}
deg.2n+1 deg.n+1
where p and r are both polynomials of degree at most n.
By taking xi as the zeros of q we have
f (xi ) = r(xi ), i = 0, . . . , n.
Now
Z b Z b
f (x)w(x)dx = [q(x)p(x) + r(x)] w(x)dx
a a
108
Z b Z b
= q(x)p(x)w(x)dx + r(x)w(x)dx,
|a {z } a
=0
where the first integral on the right-hand side is zero by the orthogonality assumption
(95).
We know that (for any set of nodes xi ) (96) is exact for polynomials of degree at
most n. Therefore,
Z b Z b
f (x)w(x)dx = r(x)w(x)dx
a a
n
(96) X
= Ai r(xi ).
i=0
However, since our special choice of nodes implies f (xi ) = r(xi ) we have
Z b n
X
f (x)w(x)dx = Ai f (xi )
a i=0
for any polynomial f of degree at most 2n + 1.
Remark Usually, the classical orthogonal polynomials as discussed in the Maple work-
sheet 478578 GaussQuadrature.mws are used to construct Gaussian quadrature rules
with the appropriate weight function suggested by the integrand at hand.
Example If [a, b] = [−1, 1] and w(x) = 1 we use Legendre polynomials (since they
are orthogonal with respect to this interval and weight function). The corresponding
two-point formula (n = 1 — which is exact for cubic polynomials) is
Z 1
f (x)dx ≈ A0 f (x0 ) + A1 f (x1 )
−1
with x0 and x1 as the roots of q2 (x) = x2 − 31 , i.e.,

√ √
3 3
x0 = , x1 = − .
3 3
A0 and A1 are then found by enforcing exactness for polynomials of degree at most
n = 1:
Z 1
dx = A0 + A1
Z 1−1
xdx = A0 x0 + A1 x1 .
−1
These formulas ensure (for arbitrary nodes) exactness for constants, and linear polyno-
mials, respectively. The preceding equations are equivalent to the 2 × 2 linear system

1 1 A0 2
= ,
x0 x1 A1 0
109
which implies A0 = A1 = 1. Alternatively, we could have applied (97) directly to
compute the coefficients A0 and A1 . Therefore,
Z 1 √ ! √ !
3 3
f (x)dx ≈ f − +f .
−1 3 3
Remark 1. There are tables for the values of xi and Ai for various choices of
classical orthogonal polynomials q of modest degree. Many software packages
also have functions implementing this.
2. If the integral is defined over the interval [a, b] instead of [−1, 1], then a simple
transformation
b + a + t(b − a)
x= , −1 ≤ t ≤ 1
2
can be used.
3. Note that without the theorem on Gaussian quadrature we would have to solve a
4 × 4 system of nonlinear equations with unknowns x0 , x1 , A0 and A1 (enforcing
exactness for cubic polynomials) to obtain the two-point formula of the example
above (see the Maple worksheet 478578 GaussQuadrature.mws).
110
11 Pseudospectral Methods for Two-Point BVPs
Another class of very accurate numerical methods for BVPs (as well as many time-
dependent PDEs) are the so-called spectral or pseudospectral methods. The basic idea
is similar to the collocation method described above. However, now we use other
basis functions. The following discussion closely follows the first few chapters of Nick
Trefethen’s book “Numerical Methods in Matlab”.
Before we go into any details we present an example.
Example Consider the simple linear 2-pt BVP
y 00 (t) = e4t , t ∈ (−1, 1)
with boundary conditions y(−1) = y(1) = 0. The analytic solution of this problem is
given by
y(t) = e4t − t sinh(4) − cosh(4) /16.

In the Matlab program PSBVPDemo.m we compare the new pseudospectral approach

with the finite difference approach.
The high accuracy of the pseudospectral method is impressive, and we use this as
our motivation to take a closer look at this method.
As with all the other numerical methods, we require some sort of discretization. For
pseudospectral methods we do the same as for finite difference methods and the RBF
collocation methods, i.e., we introduce a set of grid points t1 , t2 , . . . , tN in the interval
of interest.
11.1 Differentiation Matrices

The main ingredient for pseudospectral methods is the concept of a differentiation
matrix D. This matrix will map a vector of function values y = [y(t1 ), . . . , y(tN )]T =
[y 1 , . . . , y N ]T at the grid points to a vector y 0 of derivative values, i.e.,
y 0 = Dy.
What does such a differentiation matrix look like? Let’s assume that the grid points
are uniformly spaced with spacing tj+1 −tj = h for all j, and that the vector of function
values y comes from a periodic function so that we can add the two auxiliary values
y 0 = y N and y N +1 = y 1 .
In order to approximate the derivative y 0 (tj ) we start with another look at the finite
difference approach. We use the symmetric (second-order) finite difference approxima-
tion
y j+1 − y j−1
y 0 (tj ) ≈ y 0j = , j = 1, . . . , N.
2h
Note that this formula also holds at both ends (j = 1 and j = N ) since we are assuming
periodicity of the data.
These equations can be collected in matrix-vector form:
y 0 = Dy
111
with y and y 0 as above and
1
− 12
 
0 2
 1 .. 
 −
2 0 . 
1 ..

D=  .
 
.
h 
 .. 1

 . 0 2

1
2 − 12 0
Remark This matrix has a very special structure. It is both Toeplitz and circulant.
In a Toeplitz matrix the entries in each diagonal are constant, while a circulant matrix
is generated by a single row vector whose entries are shifted by one (in a circulant
manner) each time a new row is generated. As we will see later, the fast Fourier
transform (FFT) can deal with such matrices in a particularly efficient manner.
As we saw earlier, there is a close connection between finite difference approx-

imations of derivatives and polynomial interpolation. For example, the symmetric
2nd-order approximation used above can also be obtained by differentiating the inter-
polating polynomial p of degree 2 to the data {(tj−1 , y j−1 ), (tj , y j ), (tj+1 , y j+1 )}, and
then evaluating at t = tj .
We can also use a degree 4 polynomial to interpolate the 5 (symmetric) pieces of
data {(tj−2 , y j−2 ), (tj−1 , y j−1 ), (tj , y j ), (tj+1 , y j+1 ), (tj+2 , y j+2 )}. This leads to (e.g.,
modifying the code in the Maple worksheet 478578 DerivativeEstimates.mws)
y j+2 − 8y j+1 + 8y j−1 − y j−2
y 0 (tj ) ≈ y 0j = − , j = 1, . . . , N,
12h
so that we get the differentiation matrix
2 1 1
− 23
 
0 3 − 12 12
 −2 0 2 1
− 12 1 
 3 3 12 
 
1
.. .. .. .. ..

D=  .

. . . . .
h 
 
 
 −1 1
− 23 0 2 
12 12 3
2 1 1
3 − 12 12 − 23 0
Note that this matrix is again a circulant Toeplitz matrix (since the data is assumed to
be periodic). However, now there are 5 diagonals, instead of the 3 for the second-order
example above.
Example The fourth-order convergence of the finite-difference approximation above

is illustrated in the Matlab script FD4Demo.m.
It should now be clear that — in order to increase the accuracy of the finite-
difference derivative approximation to spectral order — we want to keep on increasing
the polynomial degree so that more and more grid points are being used, and the
differentiation matrix becomes a dense matrix. Thus, we can think of pseudospectral
112
methods as finite difference methods based on global polynomial interpolants instead
of local ones.
For an infinite interval with infinitely many grid points spaced a distance h apart
one can show that the resulting differentiation matrix is given by the circulant Toeplitz
matrix  .. 
.
..
 

. 1 
 3 
.
 

 .. − 12


 . ..

1
 
1 
D=  0 . (98)

h . 

 −1 .. 


 1 . ..


 2 
 . ..


 − 13 

..
.
For a finite (even) N and periodic data we will show later that the differentiation
matrix is given by
 .. 
.
..
 
. 1 3h
2 cot 2
 
 
..
 
. 1 2h
− cot
 
 2 2 
 .. 1 1h


 . 2 cot 2


DN =  0 . (99)
 

1 1h .. 

 − 2 cot 2 . 


1 2h . .


 2 cot 2
. 


1 3h .. 

 − 2 cot 2
. 

..
.
Example If N = 4, then we have
1 1h 1
cot 2h − 12 cot 1h
 
0 2 cot 2 2 2 2
1 1h 1
 − cot
2 2 0 2 cot 1h
2
1
2 cot 2
2h 
D4 =  1 cot 2h − 1 cot 1h 1 1h  .

2 2 2 2 0 2 cot 2
1 1h 1 2h
2 cot 2 2 cot 2 − 12 cot 1h2 0
The Matlab script PSDemo.m illustrates the spectral convergence obtained with the
matrix DN for various values of N . The output should be compared with that of the
previous example FD4Demo.m.
11.2 Unbounded Grids and the Semi-Discrete Fourier Transform

We now consider an infinite uniform grid hZ with grid points tj = jh for all integers
j. While this case is not useful for practical computation, it is important for our
understanding of problems on bounded intervals.
113
First we recall the definition of the Fourier transform ŷ of a function y that is
square-integrable on R:
Z ∞
ŷ(ω) = e−iωt y(t)dt, ω ∈ R. (100)
−∞
Conversely, the inverse Fourier transform lets us reconstruct y from its Fourier trans-
form ŷ: Z ∞
1
y(t) = eiωt ŷ(ω)dω, t ∈ R. (101)
2π −∞
Example Consider the function

(
1, if − 1/2 ≤ t ≤ 1/2
y(t) =
0, otherwise,
and compute its Fourier transform.

By the definition of the Fourier transform, the definition of y and Euler’s formula
we have
Z ∞
ŷ(ω) = e−iωt y(t)dt
−∞
Z 1/2
= e−iωt dt
−1/2
Z 1/2
= [cos(ωt) − i sin(ωt)] dt
−1/2
Z 1/2
= 2 cos(ωt)dt
0
sin(ωt) 1/2 sin ω/2

= 2 = .
ω 0 ω/2
These functions play an important role in many applications (e.g., signal process-
ing). The function y is known as a square pulse or characteristic function of the interval
[−1/2, 1/2], and its Fourier transform ŷ is known as the sinc function.
If we restrict our attention to a discrete (unbounded) physical space, i.e., the func-
tion y is now given by the (infinite) vector y = [. . . , y −1 , y 0 , y 1 , . . .]T of discrete values,
then the formulas change. In fact, the semidiscrete Fourier transform of y is given by
the (continuous) function
∞
X
ŷ(ω) = h e−iωtj y j , ω ∈ [−π/h, π/h], (102)
j=−∞
and the inverse semidiscrete Fourier transform is given by the (discrete infinite) vector
y whose components are of the form
Z π/h
1
yj = eiωtj ŷ(ω)dω, j ∈ Z. (103)
2π −π/h
114
Remark Note that the notion of a semidiscrete Fourier transform is just a differ-
ent name for a Fourier series based on the complex exponentials e−iωtj with Fourier
coefficients y j .
The interesting difference between the continuous and semidiscrete setting is marked
by the bounded Fourier space in the semidiscrete setting. This can be explained by the
phenomenon of aliasing. Aliasing arises when a continuous function is sampled on a
discrete set. In particular, the two complex exponential functions f (t) = eiω1 t and
g(t) = eiω2 t differ from each other on the real line as long as ω1 6= ω2 . However, if we
sample the two functions on the grid hZ, then we get the vectors f and g with values
f j = eiω1 tj and g j = eiω2 tj . Now, if ω2 = ω1 + 2kπ/h for some integer k, then f j = g j
for all j, and the two (different) continuous functions f and g appear identical in their
discrete representations f and g. Thus, any complex exponential eiωt is matched on
the grid hZ by infinitely many other complex exponentials (its aliases). Therefore we
can limit the representation of the Fourier variable ω to an interval of length 2π/h. For
reasons of symmetry we use [−π/h, π/h].
11.2.1 Spectral Differentiation

To get the interpolant of the y j values we can now use an extension of the inverse
semidiscrete Fourier transform, i.e., we define the interpolant to be the function
Z π/h
1
p(t) = eiωt ŷ(ω)dω, t ∈ R. (104)
2π −π/h
It is obvious (cf. (103)) from this definition that p interpolates the data, i.e., p(tj ) = y j ,
for any j ∈ Z.
Moreover, the Fourier transform of the function p turns out to be
(
ŷ(ω), ω ∈ [π/h, π, h]
p̂(ω) =
0, otherwise
This kind of function is known as a band-limited function, and p is called the band-
limited interpolant of y.
The spectral derivative vector y 0 of y can now be obtained by one of the following
two procedures we are about to present. First,
1. Sample the function y at the (infinite set of) discrete points tj ∈ hZ to obtain
the data vector y with components y j .
2. Compute the semidiscrete Fourier transform of the data via (102):

∞
X
ŷ(ω) = h e−iωtj y j , ω ∈ [−π/h, π/h].
j=−∞
3. Find the band-limited interpolant p of the data y j via (104).
4. Differentiate p and evaluate at the tj .
115
However, from a computational point of view it is better to deal with this problem
in the Fourier domain. We begin by noting that the Fourier transform of the derivative
y 0 is given by Z ∞
yb0 (ω) = e−iωt y 0 (t)dt.
−∞
Applying integration by parts we get
∞
Z ∞
yb0 (ω) = e−iωt y(t)−∞ + iω e−iωt y(t)dt.
−∞
If y(t) tends to zero for t → ±∞ (which it has to for the Fourier transform of y to
exist) then we see that
yb0 (ω) = iω ŷ(ω). (105)
Therefore, we obtain the spectral derivative y 0 by the following alternate procedure:
1. Sample the function y at the (infinite set of) discrete points tj ∈ hZ to obtain
the data vector y with components y j .
2. Compute the semidiscrete Fourier transform of the data via (102):

∞
X
ŷ(ω) = h e−iωtj y j , ω ∈ [−π/h, π/h].
j=−∞
3. Compute the Fourier transform of the derivative via (105):
yb0 (ω) = iω ŷ(ω).
4. Find the derivative vector via inverse semidiscrete Fourier transform (see (103)),
i.e.,
Z π/h
1
y 0j = eiωtj yb0 (ω)dω, j ∈ Z.
2π −π/h
Now we need to find out how we can obtain the entries of the differentiation matrix
D from the preceding discussion. We follow the first procedure above.
In order to be able to compute the semidiscrete Fourier transform of an arbitrary
data vector y we represent its components in terms of shifts of (discrete) delta func-
tions, i.e.,
∞
X
yj = y k δj−k , (106)
k=−∞
where the Kronecker delta function is defined by

(
1 j=0
δj =
0 otherwise.
116
We use this approach since the semidiscrete Fourier transform of the delta function can
be computed easily. In fact, according to (102)
∞
X
δ̂(ω) = h e−iωtj δj
j=−∞
−iωt0
= he =h
for all ω ∈ [−π/h, π/h]. Then the band-limited interpolant of δ is of the form (see
(104))
Z π/h
1
p(t) = eiωt δ̂(ω)dω
2π −π/h
Z π/h
1
= eiωt hdω
2π −π/h
h π/h
Z
= cos(ωt)dω
π 0
h sin(ωt) π/h

=
π t 0
h sin(πt/h) sin(πt/h)
= = = sinc(πt/h).
π t πt/h
Therefore, the band-limited interpolant of an arbitrary data vector y is given by

Z π/h
1
p(t) = eiωt ŷ(ω)dω
2π −π/h
 
Z π/h ∞
1 X
= eiωt h e−iωtj y j  dω
2π −π/h
j=−∞
 
Z π/h ∞ ∞
1 X X
= eiωt h e−iωtj y k δj−k  dω.
2π −π/h
j=−∞ k=−∞
Thus far we have used the definition of the band-limited interpolant (104), the defi-
nition of the semidiscrete Fourier transform of y (102), and the representation (106).
Interchanging the summation, and then using the definition of the delta function and
the same calculation as for the band-limited interpolant of the delta function above we
117
obtain the final form of the band-limited interpolant of an arbitrary data vector y as
 
Z π/h ∞ ∞
1 X X
p(t) = eiωt h yk e−iωtj δj−k  dω
2π −π/h
k=−∞ j=−∞
Z π/h ∞
1 X
= eiωt h y k e−iωtk dω
2π −π/h k=−∞
∞ Z π/h
X 1
= yk eiω(t−tk ) hdω
2π −π/h
k=−∞
∞
X (t − tk )π
= y k sinc .
h
k=−∞
Example Band-limited interpolation for the functions

(
1, t = 0
y1 (t) =
0, otherwise,
(
1, |t| ≤ 3
y2 (t) =
0, otherwise,
and
y3 (t) = (1 − |t|/3)+ .
is illustrated in the Matlab script BandLimitedDemo.m. Note that the accuracy of the
reproduction is not very high. Note, in particular, the Gibbs phenomenon that arises
for h → 0. This is due to the low smoothness of the data functions.
In order to get the components of the derivative vector y 0 we need to differentiate

the band-limited interpolant and evaluate at the grid points. By linearity this leads to
∞
X d (t − tk )π
y 0j 0
= p (tj ) = yk sinc ,
dt h t=tj
k=−∞
or in (infinite) matrix form

y 0 = Dy
with the entries of D given by

d (t − tk )π
Djk = sinc , j, k = −∞, . . . , ∞.
dt h t=tj
The entries in the k = 0 column of D are of the form

(
d tπ 0, j=0
Dj0 = sinc = (−1)j
dt h t=tj =jh jh , otherwise,
118
The remaining columns are shifts of this column since the matrix is a Toeplitz matrix.
This is exactly of the form (98). The explicit formula for the derivative of the sinc
function above is obtained using elementary calculations:

d tπ 1 tπ h tπ
sinc = cos − 2 sin ,
dt h t h t π h
so that
d tπ 1 1
sinc = cos(jπ) − 2 sin(jπ).
dt h t=tj =jh jh j hπ
11.3 Periodic Grids: The DFT and FFT

We now consider the case of a bounded grid with periodic data, i.e., we will now explain
how to find the entries in the matrix DN of (99).
To keep the discussion simple we will consider the interval [0, 2π] only, and assume
that we are given N (with N even) uniformly spaced grid points tj = jh, j = 1, . . . , N ,
with h = 2π/N .
Remark Formulas for odd N also exist, but are slightly different. For the sake of
clarity, we focus only on the even case here.
As in the previous subsection we now look at the Fourier transform of the discrete
and periodic data y = [y 1 , . . . , y N ]T with y j = y(jh) = y(2jπ/N ), j = 1, . . . , N . For
the same reason of aliasing the Fourier domain will again be bounded. Moreover, the
periodicity of the data implies that the Fourier domain is also discrete (since only waves
eikt with integer wavenumber k have period 2π).
Thus, the discrete Fourier transform (DFT) is given by
N
X N N
ŷ k = h e−iktj y j , k=− + 1, . . . , . (107)
2 2
j=1
Note that the (continuous) Fourier domain [π/h, π/h] used earlier now translates to
the discrete domain noted in (107) since h = 2π/N is equivalent to π/h = N/2.
The formula for the inverse discrete Fourier transform (inverse DFT) is given by
N/2
1 X
yj = eiktj ŷ k , j = 1, . . . , N. (108)
2π
k=−N/2+1
We obtain the spectral derivative of the finite vector data by exactly the same
procedure as in the previous subsection. First, we need the band-limited interpolant
of the data. It is given by the formula
N/2
1 X0 ikt
p(t) = e ŷ k , t ∈ [0, 2π]. (109)
2π
k=−N/2
Here we define ŷ −N/2 = ŷ N/2 , and the prime on the sum indicates that we add the
first and last summands only with weight 1/2. This modification is required for the
band-limited interpolant to work properly.
119
Remark The band-limited interpolant is actually a trigonometric polynomial of degree
N/2, i.e., p(t) can be written as a linear combination of the trigonometric functions
1, sin t, cos t, sin 2t, cos 2t, . . . , sin N t/2, cos N t/2. We will come back to this fact when
we discuss non-periodic data.
Next, we want to represent an arbitrary periodic data vector y as a linear combina-

tion of shifts of periodic delta functions. We omit the details here (they can be found
in the Trefethen book) and give only the formula for the band-limited interpolant of
the periodic delta function:
sin(πt/h)
p(t) = SN (t) = ,
(2π/h) tan(t/2)
which is known as the periodic sinc function SN .
Now, just as in the previous subsection, the band-limited interpolant for an arbitrary
data function can be written as
N
X
p(t) = y k SN (t − tk ).
k=1
Finally, using the same arguments and similar elementary calculations as earlier, we
get (
0 0, j ≡ 0 (mod N ),
SN (tj ) = 1 j
2 (−1) cot(jh/2), j 6≡ 0 (mod N ).
These are the entries of the N -th column of the Toeplitz matrix (99).
Example The Matlab script SpectralDiffDemo.m illustrates the use of spectral dif-
ferentiation for the not so smooth hat function and for the infinitely smooth function
y(t) = esin t .
11.3.1 Implementation via FFT

The most efficient computational approach is to view spectral differentiation in the
Fourier domain (the alternate approach earlier) and then implement the DFT via the
fast Fourier transform (FFT). The general outline is as follows:
1. Sample the function y at the (finite set of) discrete points tj , j = 1, . . . , N to
obtain the data vector y with components y j .
2. Compute the discrete Fourier transform of the (finite) data vector via (107):
N
X N N
ŷ k = h e−iktj y j , k=− + 1, . . . , .
2 2
j=1
3. Compute the Fourier transform of the derivative based on (105), i.e.,

(
0, k = N/2,
yb0 k =
ikŷ k , otherwise.
120
4. Find the derivative vector via inverse discrete Fourier transform (see (108)), i.e.,
N/2
1 X
y 0j = eiktj yb0 k , j = 1, . . . , N.
2π
k=−N/2+1
Remark Cooley and Tukey (1965) are usually given credit for discovering the FFT.
However, the same algorithm was already known to Gauss (even before Fourier com-
pleted his work on what is known today as the Fourier transform). A detailed discus-
sion of this algorithm goes beyond the scope of this course. We simply use the Matlab
implementations fft and ifft. These implementations are based on the current state-
of-the-art FFTW algorithm (the “fastest Fourier transform in the West”) developed at
MIT by Matteo Frigo and Steven G. Johnson.
Example The Matlab script SpectralDiffFFTDemo.m is an FFT version of the earlier

script SpectralDiffDemo.m. The FFT implementation is considerably faster than the
implementation based on differentiation matrices (see Computer Assignment 5).
11.4 Smoothness and Spectral Accuracy

Without getting into any details (see Chapter 4 of Trefethen’s book) we will simply
illustrate with a few examples the basic behavior of spectral differentiation:
The smoother the data, the more accurate the spectral derivative.
Example In the Matlab script SpectralAccuracyDemo.m we expand on the earlier

script SpectralDiffDemo.m and illustrate the dependence of the convergence rate of
spectral differentiation on the smoothness of the data more clearly for the four periodic
functions on [0, 2π]
y1 (t) = | sin t|3 ,

y2 (t) = exp(− sin−2 (t/2)),
1
y3 (t) = ,
1 + sin2 (t/2)
y4 (t) = sin(10t).
These functions are arranged according to their (increasing) smoothness. The function
y1 has a third derivative of bounded variation, y2 is infinitely
√ differentiable (but not
analytic), y3 is analytic in the strip |Im(t)| < 2 ln(1 + 2) in the complex plane, and
y4 is band-limited.
Note: A continuous function y is of bounded variation if
N
X
sup |y(tj ) − y(tj−1 )|
t0 <t1 <···<tN
j=1
is bounded for all choices of t0 , t1 , . . . , tN . Plainly said, a function of bounded variation

cannot “wiggle around too much”. For example, on the interval [0, 1/2] the function
y(t) = t2 sin(1/t) is of bounded variation while y(t) = t sin(1/t) is not.
121
11.5 Polynomial Interpolation and Clustered Grids
We already saw in the Matlab script BandLimitedDemo.m that a spectral interpolant
performs very poorly for non-smooth functions. Thus, if we just went ahead and treated
a problem on a bounded domain as a periodic problem via periodic extension, then the
resulting jumps that may arise at the endpoints of the original interval would lead to
Gibbs phenomena and a significant degradation of accuracy. Therefore, we do not use
the trigonometric polynomials (discrete Fourier transforms) but algebraic polynomials
instead.
For interpolation with algebraic polynomials we saw at the very beginning of this
course (in the Matlab script PolynomialInterpolationDemo.m) the effect that differ-
ent distributions of the interpolation nodes in a bounded interval have on the accuracy
of the interpolant (the so-called Runge phenomenon). Clearly, the accuracy is much
improved if the points are clustered near the endpoints of the interval. In fact, the
so-called Chebyshev points
tj = cos(jπ/N ), j = 0, 1, . . . , N
yield a set of such clustered interpolation nodes on the standard interval [−1, 1]. These
points can easily be mapped by a linear transformation to any other interval [a, b]
(see Assignment 8). Chebyshev points arise often in numerical analysis. They are the
extremal points of the so-called Chebyshev polynomials (a certain type of orthogonal
polynomial ). In fact, Chebyshev points are equally spaced on the unit circle, and there-
fore one can observe a nice connection between spectral differentiation on bounded
intervals with Chebyshev points and periodic problems on bounded intervals as de-
scribed earlier. It turns out that (contrary to our expectations) the FFT can also be
used for the Chebyshev case. However, we will only consider Chebyshev differentiation
matrices below.
11.6 Chebyshev Differentiation Matrices

Our last step in our preparation for the solution of general boundary value problems
is to determine the entries of the differentiation matrices to be used for problems on
bounded intervals (with non-periodic data).
As before, we follow our well-established approach for spectral differentiation:
1. Discretize the interval [−1, 1] using the Chebyshev points
tj = cos(jπ/N ), j = 0, 1, . . . , N,
and sample the function y at those points to obtain the data vector y = [y(t0 ), y(t1 ), . . . , y(tN )]T .
2. Find the (algebraic) polynomial p of degree at most N that interpolates the data,
i.e., s.t.
p(ti ) = y i , i = 0, 1, . . . , N.
3. Obtain the spectral derivative vector y 0 by differentiating p and evaluating at the

grid points:
y 0i = p0 (ti ), i = 0, 1, . . . , N.
122
This procedure (implicitly) defines the differentiation matrix DN that gives us
y 0 = DN y.
Before we look at the general formula for the entries of DN we consider some simple
examples.
Example For N = 1 we have the two points t0 = 1 and t1 = −1, and the interpolant
is given by
t − t1 t0 − t
p(t) = y0 + y1
t 0 − t1 t0 − t1
t+1 1−t
= y0 + y1.
2 2
The derivative of p is (the constant)
1 1
p0 (t) = y 0 − y 1 ,
2 2
so that we have
1
− 12 y 1

0
y = 2 y0
1
2 y0 − 12 y 1
and the differentiation matrix is given by
1
− 21

D1 = 2 .
1
2 − 21
Example For N = 2 we start with the three Chebyshev points t0 = 1, t1 = 0, and

t2 = −1. The quadratice interpolating polynomial (in Lagrange form) is given by
(t − t1 )(t − t2 ) (t − t0 )(t − t2 ) (t − t0 )(t − t1 )

p(t) = y0 + y1 + y2
(t0 − t1 )(t0 − t2 ) (t1 − t0 )(t1 − t2 ) (t2 − t0 )(t2 − t1 )
t(t + 1) (t − 1)t
= y 0 − (t − 1)(t + 1)y 1 + y2.
2 2
Now the derivative of p is a linear polynomial

0 1 1
p (t) = t + y 0 − 2ty 1 + t − y2,
2 2
so that – evaluating at the nodes – we have
 3 1

2 y 0 − 2y 1 + 2 y 2
y0 =  1 1
2 y0 − 2 y2

1 3
− 2 y 0 + 2y 1 − 2 y 2
and the differentiation matrix is given by

 3
−2 12

2
1
D2 =  2 0 − 21  .
− 12 2 − 23
123
We note that the differentiation matrices no longer are Toeplitz or circulant. In-
stead, the entries satisfy (also in the general case below)
(DN )ij = −(DN )N −i,N −j .
For general N one can prove
Theorem 11.1 For each N ≥ 1, let the rows and columns of the (N + 1) × (N + 1)
Chebyshev spectral differentiation matrix DN be indexed from 0 to N . The entries of
this matrix are
2N 2 + 1 2N 2 + 1
(DN )00 = , (DN )N N = − ,
6 6
−tj
(DN )jj = , j = 1, . . . , N − 1,
2(1 − t2j )
ci (−1)i+j
(DN )ij = , i 6= j, i, j = 0, 1, . . . , N,
cj (ti − tj )
where (
2, i = 0 or N ,
ci =
1, otherwise.
This matrix is implemented in the Matlab script cheb.m that was already used in
the Matlab function PSBVP.m that we used in our motivational example PSBVPDemo.m
at the beginning of this chapter. Note that only the off-diagonal entries are computed
via the formulas given in the theorem. For the diagonal entries the formula
N
X
(DN )ii = − (DN )ij
j=0
j6=i
was used.
Example The spectral accuracy of Chebyshev differentiation matrices is illustrated in

the Matlab script ChebyshevAccuracyDemo.m. One should compare this to the earlier
script SpectralAccuracyDemo.m in the periodic case.
The functions used for the Chebyshev example are
y1 (t) = |t|3 ,
y2 (t) = exp(−t−2 ),
1
y3 (t) = ,
1 + t2
y4 (t) = t10 .
These functions are again arranged according to their (increasing) smoothness. The
function y1 has a third derivative of bounded variation, y2 is infinitely differentiable
(but not analytic), y3 is analytic in [−1, 1], and y4 is a polynomial (which corresponds
to the band-limited case earlier).
124
Note that the error for the derivative of the function y2 dips to zero for N = 2 since
the true derivative is given by
exp(−t−2 )
y20 (t) = 2 ,
t3
and the values at t0 = 1, t1 = 0, and t2 = −1 are 2/e, 0, and −2/e, respectively. These
all lie on a line (the linear derivative of the quadratic interpolating polynomial).
11.7 Boundary Value Problems

We can now return to our introductory example, the 2-pt boundary value problem
y 00 (t) = e4t , t ∈ (−1, 1)
with boundary conditions y(−1) = y(1) = 0. Its analytic solution was given earlier as
y(t) = e4t − t sinh(4) − cosh(4) /16.

How do we solve this problem in the Matlab programs PSBVPDemo.m and PSBVP.m?
First, we note that – for Chebyshev differentiation matrices – we can obtain higher
derivatives by repeated application of the matrix DN , i.e., if
y 0 = DN y,
then
y 00 = DN y 0 = DN
2
y.
In other words, for Chebyshev differentiation matrices
(k) k
DN = DN , k = 1, . . . , N,
N +1
and DN = 0.
Remark We point out that this fact is true only for the Chebyshev case. For the
Fourier differentiation matrices we established in the periodic case we in general have
DNk 6= D (k) (see Assignment 8).
N
With the insight about higher-order Chebyshev differentiation matrices we can view
the differential equation above as
2
DN y = f,
where the right-hand side vector f = exp(4t), with t = [t0 , t1 , . . . , tN ]T the vector of
Chebyshev points. This linear system, however, cannot be solved uniquely (one can
show that the matrix (N + 1) × (N + 1) matrix DN 2 has an (N + 1)-fold eigenvalue
of zero). Of course, this is not a problem. In fact, it is reassuring, since we have not
yet taken into account the boundary conditions, and the ordinary differential equation
(without appropriate boundary conditions) also does not have a unique solution.
So the final question is, how do we deal with the boundary conditions?
We could follow either of two approaches. First, we can build the boundary condi-
tions into the spectral interpolant, i.e.,
125
1. Take the interior Chebyshev points t1 , . . . , tN −1 and form the polynomial in-
terpolant of degree at most N that satisfies the boundary conditions p(−1) =
p(1) = 0 and interpolates the data vector at the interior points, i.e., p(tj ) = y j ,
j = 1, . . . , N − 1.
2. Obtain the spectral derivative by differentiating p and evaluating at the interior
points, i.e.,
y 00j = p00 (tj ), j = 1, . . . , N − 1.
e 2 from the previous relation, and solve
3. Identify the (N − 1) × (N − 1) matrix D N
the linear system
2
D
eN y(1 : N − 1) = exp(4t(1 : N − 1)),
where we used Matlab-like notation.
The second approach is much simpler to implement, but not as straightforward to
understand/derive. Since we already know the value of the solution at the boundary,
i.e., y 0 = 0 and y N = 0, we do not need to include these values in our computa-
tion. Moreover, the values of the derivative at the endpoints are of no interest to us.
Therefore, we can simply solve the linear system
e 2 y(1 : N − 1) = exp(4t(1 : N − 1)),
D N
where
2 2
D
eN = DN (1 : N − 1, 1 : N − 1).
This is exactly what was done in the Matlab program PSBVP.m.
Remark One can show that the eigenvalues of D e 2 are given by λn = − π2 n2 , n =
N 4
1, 2, . . . , N − 1. Clearly, these values are all nonzero, and the problem has (as it should
have) a unique solution.
We are now ready to deal with more complicated boundary value problems. They
can be nonlinear, have non-homogeneous boundary conditions, or mixed-type boundary
conditions with derivative values specified at the boundary. We give examples for each
of these cases.
Example As for our initial value problems earlier, a nonlinear ODE-BVP will be
solved by iteration (either fixed-point, or Newton).
Consider
y 00 (t) = ey(t) , t ∈ (−1, 1)
with boundary conditions y(−1) = y(1) = 0. In the Matlab program NonlinearPSBVPDemo.m
we use fixed-point iteration to solve this problem.
Example Next, we consider a linear BVP with non-homogeneous boundary condi-
tions:
y 00 (t) = e4t , t ∈ (−1, 1)
with boundary conditions y(−1) = 0, y(1) = 1. In the Matlab program PSBVPNonHomoBCDemo.m
this is simply done by replacing the first and last rows of the differentiation matrix D2
by corresponding rows of the identity matrix and then imposing the boundary values
in the first and last entries of the right-hand side vector f.
126
Example For a linear BVP with mixed boundary conditions such as
y 00 (t) = e4t , t ∈ (−1, 1)
with boundary conditions y 0 (−1) = y(1) = 0 we can follow the same strategy as in the
previous example. Now, however, we need to replace the row of D2 that corresponds to
the derivative boundary condition with a row from the first-order differentiation matrix
D. This leads to the Matlab program PSBVPMixedBCDemo.m.
127
12 Galerkin and Ritz Methods for Elliptic PDEs
12.1 Galerkin Method
We begin by introducing a generalization of the collocation method we saw earlier for
two-point boundary value problems. Consider the elliptic PDE
Lu(x) = f (x), (110)
where L is a linear elliptic partial differential operator such as the Laplacian
∂2 ∂2 ∂2
L= + + , x = (x, y, z) ∈ R3 .
∂x2 ∂y 2 ∂z 2
At this point we will not worry about the boundary conditions that should be posed
with (110).
As with the collocation method discussed earlier, we will obtain the approximate
solution in the form of a function (instead of as a collection of discrete values). There-
fore, we need an approximation space U = span{u1 , . . . , un }, so that we are able to
represent the approximate solution as
n
X
u= cj uj , uj ∈ U. (111)
j=1
Using the linearity of L we have

n
X
Lu = cj Luj .
j=1
We now need to come up with n (linearly independent) conditions to determine the n

unknown coefficients cj in (111). If {Φ1 , . . . , Φn } is a linearly independent set of linear
functionals, then  
Xn
Φi  cj Luj − f  = 0, i = 1, . . . , n, (112)
j=1
is an appropriate set of conditions. In fact, this leads to a system of linear equations

Ac = b
with matrix  
Φ1 Lu1 Φ1 Lu2 . . . Φ1 Lun
 Φ2 Lu1 Φ2 Lu2 . . . Φ2 Lun 
A= ,
 
.. .. ..
 . . . 
Φn Lu1 Φn Lu2 . . . Φn Lun
coefficient vector c = [c1 , . . . , cn ]T , and right-hand side vector
 
Φ1 f
 Φ2 f 
b =  . .
 
 .. 
Φn f
Two popular choices are
128
1. Point evaluation functionals, i.e., Φi (u) = u(xi ), where {x1 , . . . , xn } is a set of
points chosen such that the resulting conditions are linearly independent, and u
is some function with appropriate smoothness. With this choice (112) becomes
n
X
cj Luj (xi ) = f (xi ), i = 1, . . . , n,
j=1
and we now have an extension of the collocation method discussed in Chapter 9

to elliptic PDEs is the multi-dimensional setting.
2. If we let Φi (u) = hu, vi i, an inner product of the function u with an appropriate

test function vi , then (112) becomes
n
X
cj hLuj , vi i = hf, vi i, i = 1, . . . , n.
j=1
If vi ∈ U then this is the classical Galerkin method, otherwise it is known as the

Petrov-Galerkin method.
12.2 Ritz-Galerkin Method

For the following discussion we pick as a model problem a multi-dimensional Poisson
equation with homogeneous boundary conditions, i.e.,
−∇2 u = f in Ω, (113)
u=0 on ∂Ω,
with domain Ω ⊂ Rd . This problem describes, e.g., the steady-state solution of a

vibrating membrane (in the case d = 2 with shape Ω) fixed at the boundary, and
subjected to a vertical force f .
The first step for the Ritz-Galerkin method is to obtain the weak form of (113).
This is accomplished by choosing a function v from a space U of smooth functions, and
then forming the inner product of both sides of (113) with v, i.e.,
−h∇2 u, vi = hf, vi. (114)
To be more specific, we let d = 2 and take the inner product

ZZ
hu, vi = u(x, y)v(x, y)dxdy.
Ω
Then (114) becomes

ZZ ZZ
− (uxx (x, y) + uyy (x, y))v(x, y)dxdy = f (x, y)v(x, y)dxdy. (115)
Ω Ω
In order to be able to complete the derivation of the weak form we now assume that
the space U of test functions is of the form
U = {v : v ∈ C 2 (Ω), v = 0 on ∂Ω},
129
i.e., besides having the necessary smoothness to be a solution of (113), the functions
also satisfy the boundary conditions.
Now we rewrite the left-hand side of (115):
ZZ ZZ
(uxx + uyy ) vdxdy = [(ux v)x + (uy v)y − ux vx − uy vy ] dxdy
Ω
ZΩZ ZZ
= [(ux v)x + (uy v)y ] dxdy − [ux vx − uy vy ] dxdy.
(116)
Ω Ω
By using Green’s Theorem (integration by parts)

ZZ Z
(Px + Qy )dxdy = (P dy − Qdx)
∂Ω
Ω
the first integral on the right-hand side of (116) turns into

ZZ Z
[(ux v)x + (uy v)y ] dxdy = (ux vdy − uy vdx) .
∂Ω
Ω
Now the special choice of U, i.e., the fact that v satisfies the boundary conditions,
ensures that this term vanishes. Therefore, the weak form of (113) is given by
ZZ ZZ
[ux vx + uy vy ] dxdy = f vdxdy.
Ω Ω
Another way of writing the previous formula is of course

ZZ ZZ
∇u · ∇vdxdy = f vdxdy. (117)
Ω Ω
To obtain a numerical method we now need to require U to be finite-dimensional

with basis {u1 , . . . , un }. Then we can represent the approximate solution uh of (113)
as
Xn
h
u = cj uj . (118)
j=1
The superscript h indicates that the approximate solution is obtained on some under-
lying discretization of Ω with mesh size h.
Remark 1. In practice there are many ways of discretizing Ω and selecting U.
(a) For example, regular (tensor product) grids can be used. Then U can consist
of tensor products of piecewise polynomials or B-spline functions that satisfy
the boundary conditions of the PDE.
(b) It is also possible to use irregular (triangulated) meshes, and again define
piecewise (total degree) polynomials or splines on triangulations satisfying
the boundary conditions.
130
(c) More recently, meshfree approximation methods have been introduced as
possible choices for U.
2. In the literature the piecewise polynomial approach is usually referred to as the

finite element method.
3. The discretization of Ω will almost always result in a computational domain that

has piecewise linear (Lipschitz-continuous) boundary.
We now return to the discussion of the general numerical method. Once we have
chosen a basis for the approximation space U, then it becomes our goal to determine
the coefficients cj in (118). By inserting uh into the weak form (117), and selecting as
trial functions v the basis functions of U we obtain a system of equations
ZZ ZZ
h
∇u · ∇ui dxdy = f ui dxdy, i = 1, . . . , n.
Ω Ω
Using the representation (118) of uh we get

 
ZZ X n ZZ
∇  cj uj · ∇ui dxdy =
 f ui dxdy, i = 1, . . . , n,
Ω j=1 Ω
or by linearity
n
X ZZ ZZ
cj ∇uj · ∇ui dxdy = f ui dxdy, i = 1, . . . , n. (119)
j=1 Ω Ω
This last set of equations is known as the Ritz-Galerkin method and can be written in
matrix form
Ac = b,
where the stiffness matrix A has entries
ZZ
Ai,j = ∇uj · ∇ui dxdy.
Ω
Remark 1. The stiffness matrix is usually assembled element by element, i.e., the
contribution to the integral over Ω is split into contributions for each element
(e.g., rectangle or triangle) of the underlying mesh.
2. Depending on the choice of the (finite-dimensional) approximation space U and

underlying discretization, the matrix will have a well-defined structure. This is
one of the most important applications driving the design of efficient linear system
solvers.
Example One of the most popular finite element versions is based on the use of
piecewise linear C 0 polynomials (built either on a regular grid, or on a triangular
partition of Ω). The basis functions ui are “hat functions”, i.e., functions that are
131
piecewise linear, have value one at one of the vertices, and zero at all of its neighbors.
This choice makes it very easy to satisfy the homogeneous Dirichlet boundary conditions
of the model problem exactly (along a polygonal boundary).
Since the gradients of piecewise linear functions are constant, the entries of the
stiffness matrix essentially boil down to the areas of the underlying mesh elements.
Therefore, in this case, the Ritz-Galerkin method is very easily implemented. We
generate some examples with Matlab’s PDE toolbox pdetool.
It is not difficult to verify that the stiffness matrix for our example is symmetric
and positive definite. Since the matrix is also very sparse due to the fact that the “hat”
basis functions have a very localized support, efficient iterative solvers can be applied.
Moreover, it is known that the piecewise linear FEM converges with order O(h2 ).
Remark 1. The Ritz-Galerkin method was independently introduced by Walther

Ritz (1908) and Boris Galerkin (1915).
2. The finite element method is one of the most-thoroughly studied numerical meth-
ods. Many textbooks on the subject exist, e.g., “The Mathematical Theory of
Finite Element Methods” by Brenner and Scott (1994), “An Analysis of the Finite
Element Method” by Strang and Fix (1973), or “The Finite Element Method”
by Zienkiewicz and Taylor (2000).
12.3 Optimality of the Ritz-Galerkin Method

How does solving the Ritz-Galerkin equations (119) relate to the solution of the strong
form (113) of the PDE? First, we remark that the left-hand side of (117) can be
interpreted as a new inner product
ZZ
[u, v] = ∇u · ∇vdxdy (120)
Ω
on the space of functions whose first derivatives are square integrable and that vanish
on ∂Ω. This space is a Sobolev space, usually denoted by H01 (Ω).
The inner product [·, ·] induces a norm kvk = [v, v]1/2 on H01 (Ω). Now, using
this norm, the best approximation to u from H01 (Ω) is given by the function uh that
minimizes ku − uh k. Since we define our numerical method via the finite-dimensional
subspace U of H01 (Ω), we need to find uh such that
u − uh ⊥ U
or, using the basis of U,

h i
u − uh , ui = 0, i = 1, . . . , n.
Replacing uh with its expansion in terms of the basis of U we have

 
Xn
u − cj uj , ui  = 0, i = 1, . . . , n,
j=1
132
or
n
X
cj [uj , ui ] = [u, ui ], i = 1, . . . , n. (121)
j=1
The right-hand side of this formula contains the exact solution u, and therefore is not
useful for a numerical scheme. However, by (120) and the weak form (117) we have
ZZ
[u, ui ] = ∇u · ∇ui dxdy
ZΩZ
= f ui dxdy.
Ω
Since the last expression corresponds to the inner product hf, ui i, (121) can be viewed
as
X n
cj [uj , ui ] = hf, ui i, i = 1, . . . , n,
j=1
which is nothing but the Ritz-Galerkin method (119).

The best approximation property in the Sobolev space H01 (Ω) can also be inter-
preted as an energy minimization principle. In fact, a smooth solution of the Poisson
problem (113) minimizes the energy functional
ZZ ZZ
1
E(u) = ∇2 udxdy − f udxdy
2
Ω Ω
over all smooth functions that vanish on the boundary of Ω. By considering the energy
of nearby solutions u + λv, with arbitrary real λ we see that
ZZ ZZ
1
E(u + λv) = ∇(u + λv) · ∇(u + λv)dxdy − f (u + λv)dxdy
2
ZΩZ Ω
λ2
ZZ ZZ
1
= ∇u · ∇udxdy + λ ∇u · ∇vdxdy + ∇v · ∇vdxdy
2 2
ZΩZ ZZ Ω Ω
− f udxdy − λ f vdxdy
Ω Ω
λ2
ZZ ZZ
= E(u) + λ [∇u · ∇v − f v] dxdy + ∇2 vdxdy
2
Ω Ω
The right-hand side is a quadratic polynomial in λ, so that for a minimum, the term
ZZ
[∇u · ∇v − f v] dxdy
Ω
must vanish for all v. This is again the weak formulation (117).
A discrete “energy norm” is then given by the quadratic form
1
E(uh ) = cT Ac − bc
2
133
where A is the stiffness matrix, and c is such that the Ritz-Galerkin system (119)
Ac = b
is satisfied.
134
13 Classical Iterative Methods for the Solution of Linear
Systems
13.1 Why Iterative Methods?
Virtually all methods for solving Ax = b or Ax = λx require O(m3 ) operations. In
practical applications A often has a certain structure and/or is sparse, i.e., A contains
many zeros.
A typical problem that arises in practice is the Poisson problem mentioned at the
beginning of the class. We want to find u such that
−∇2 u(x, y) = − [uxx (x, y) + uyy (x, y)] = f (x, y), in Ω = [0, 1]2
u(x, y) = 0, on ∂Ω.
One of the standard numerical algorithms is a finite difference approach. The

Laplacian is discretized on a grid of (n + 1)2 equally spaced points (xi , yj ) = (ih, jh),
i, j = 0, . . . , n with h = n1 . This results in the discrete Laplacian
ui−1,j + ui,j−1 + ui+1,j + ui,j+1 − 4ui,j

∇2 u(xi , yj ) ≈ ,
h2
where ui,j = u(xi , yj ).
The boundary conditions of the PDE allow us to set the solution at the points along
the boundary as
ui,0 = ui,n = u0,j = un,j = 0, i, j = 0, 1, . . . , n.
At the (n − 1)2 interior grid points we obtain the following system of linear equations
for the values of u there
fi,j
4ui,j − ui−1,j − ui,j−1 − ui+1,j − ui,j+1 = , i, j = 1, . . . , n − 1
n2
The system matrix is of size m × m, where m = (n − 1)2 . Each row contains at most
five nonzero entries, and therefore is very sparse. Thus, special methods are called
for to take advantage of this sparsity when we solve this linear system. Obviously, a
full-blown LU or Cholesky factorization will be much too costly if m is large (typical
values for m are often 106 or even larger).
13.2 The Splitting Approach

The basic iterative scheme to solve Ax = b will be of the form
x(k) = Gx(k−1) + c, k = 1, 2, 3, . . . . (38)
Here we assume that A ∈ Cm×m , x(0) is an initial guess for the solution, and G and c
are a constant iteration matrix and vector, respectively, defining the iterative scheme.
Most classical iterative methods are based on a splitting of the matrix A of the form
A=M −N
100
with a nonsingular matrix M . One then defines
G = M −1 N and c = M −1 b.
Then (38) becomes
x(k) = M −1 N x(k−1) + M −1 b
or
M x(k) = N x(k−1) + b. (39)
In practice we will want to choose the splitting factors so that
1. (39) is easily solved,
2. (39) converges rapidly.
Theorem 13.1 If
kGk = kM −1 N k < 1
then (38) converges to a solution of Ax = b for any initial guess x(0) .
Proof (38) describes a fixed point iteration (i.e., is of the form x = g(x)), and the
fixed point of (38) is a solution of Ax = b as can be seen from
x = Gx + c
⇐⇒ x = M −1 N x + M −1 b
⇐⇒ Mx = Nx + b
⇐⇒ (M − N ) x = b.
| {z }
=A
Now we let e(k) = x(k) − x, where x is the solution of the fixed point problem, and
show that this quantity goes to zero as k → ∞. First we observe that
e(k) = x(k) − x
= Gx(k−1) − Gx
= G x(k−1) − x
= Ge(k−1) .
Taking norms we have
ke(k) k = kGe(k−1) k
≤ kGkke(k−1) k
≤ kGkk ke(0) k,
where the last inequality is obtained by recursion.
If now – as we assume – kGk < 1, then ke(k) k → 0 as k → ∞, and therefore
x(k) → x and the method converges.
With some more effort one can show

Theorem 13.2 The iteration (38) converges to the solution of Ax = b for any initial
guess x(0) if and only if ρ(G) = ρ(M −1 N ) < 1.
Here ρ(G) is the spectral radius of G, i.e., the largest eigenvalue of G (in modulus).
101
13.3 How should we choose M and N ?
13.3.1 The Jacobi Method
We formally decompose A = L + D + U into a lower triangular, diagonal, and upper
triangular part. Then we let
M = D, N = −(L + U ).
Thus (39) becomes

Dx(k) = −(L + U )x(k−1) + b
or h i
x(k) = D−1 b − (L + U )x(k−1) .
By writing this formula componentwise we have

P (k−1)
(k) bi − j6=i aij xj
xi = , i = 1, . . . , m.
aii
This means we have the following algorithm.
Algorithm (Jacobi method)
Let x(0) be an arbitrary initial guess
for k = 1, 2, . . .
for i = 1 : m
 
i−1 m
(k) (k−1) (k−1) 
X X
xi = b i − aij xj − aij xj /aii
j=1 j=i+1
end
end
Example If we apply the Jacobi method to the finite difference discretization of the
Poisson problem then we can be more efficient by taking advantage of the matrix
structure. The central part of the algorithm (the loop for i = 1 : m) can then be
replaced by
for i = 1 : n − 1
for j = 1 : n − 1

(k) (k−1) (k−1) (k−1) (k−1) fi,j
ui,j = ui−1,j + ui,j−1 + ui+1,j + ui,j+1 + 2 /4
n
end
end
102
Note that the unknowns are now uij instead of xi . This algorithm can be implemented
in one line of Matlab (see homework).
Remark While the Jacobi method is not used that often in practice on serial comput-
ers it does lend itself to a naturally parallel implementation.
In order to get a convergence result for the Jacobi method we need to recall the
concept of diagonal dominance. We say a matrix A is strictly row diagonally dominant
if X
|aii | > |aij |.
j6=i
Theorem 13.3 If A is strictly row diagonally dominant, then the Jacobi method con-
verges for any initial guess x(0) .
Proof By the definition of diagonal dominance above we have

X X |aij |
|aii | > |aij | ⇐⇒ < 1.
|aii |
j6=i j6=i
We are done if we can show that kGk < 1. Here
kGk = kM −1 N k = kD−1 (L + U )k.
Since we can take any norm, we pick k · k∞ . Then
kGk∞ = kD−1 (L + U )k∞

X |aij |
= max
1≤i≤m |aii |
j6=i
< 1
by the diagonal dominance.
Remark An analogous result holds if A is strictly column diagonally dominant (de-

fined analogously).
As we will see in some numerical examples, the convergence of the Jacobi method
is usually rather slow. A (usually) faster method is discussed next.
13.3.2 Gauss-Seidel Method

To see how the Jacobi method can be improved we consider an example.
Example For the system
2x1 + x2 = 6
x1 + 2x2 = 6
103
or
2 1 x1 6
=
1 2 x2 6
the Jacobi method looks like

(k) (k−1)
x1 = 6 − x2 /2

(k) (k−1)
x2 = 6 − x1 /2.
(k−1)
In order to obtain an improvement we notice that the value of x1 used in the
(k)
second equation is actually outdated since we already computed a newer version, x1 ,
in the first equation. Therefore, we might consider

(k) (k−1)
x1 = 6 − x2 /2

(k) (k)
x2 = 6 − x1 /2
instead. This is known as the Gauss-Seidel method. The general algorithm is of the
form
Algorithm (Gauss-Seidel method)
for k = 1, 2, . . .
for i = 1 : m
 
i−1 m
(k) (k) (k−1) 
X X
xi = b i − aij xj − aij xj /aii
j=1 j=i+1
end
end
Example For one step of the finite difference solution of the Poisson problem we get
for i = 1 : n − 1
for j = 1 : n − 1

(k) (k) (k) (k−1) (k−1) fi,j
ui,j = ui−1,j + ui,j−1 + ui+1,j + ui,j+1 + 2 /4
n
end
end
104
Remark Note that the implementation of the Gauss-Seidel algorithm for this example
depends on the ordering of the grid points. We used the natural (or typewriter) order-
ing, i.e., we scan the grid points row by row from left to right. Sometimes a red-black
(or chessboard) ordering is used. This is especially useful if the Gauss-Seidel method
is to be parallelized.
In order to understand the matrix formulation of the Gauss-Seidel method in the

spirit of (39) we again assume that A = L + D + U .
Now the splitting matrices are chosen as
M = D+L
N = −U,
and (39) becomes
M x(k) = N x(k−1) + b
⇐⇒ (D + L)x(k) = b − U x(k−1)
or
x(k) = (D + L)−1 b − U x(k−1) .
Note that this is also equivalent to
Dx(k) = b − Lx(k) − U x(k−1)
or
x(k) = D−1 b − Lx(k) − U x(k−1) .
This latter formula corresponds nicely to the algorithm above.

The convergence criteria for Gauss-Seidel iteration are a little more general than
those for the Jacobi method. One can show
Theorem 13.4 The Gauss-Seidel method converges for any initial guess x(0) if
1. A is strictly diagonally dominant, or
2. A is symmetric positive definite.
Remark For a generic problem the Gauss-Seidel method converges faster than the
Jacobi method (see the Maple worksheet 473 IterativeSolvers.mws). However, this
does not mean that sometimes the Jacobi method may not be faster.
Remark A careful reader may notice that the sufficient conditions given in the con-
vergence theorems for the Jacobi and Gauss-Seidel methods do not cover the matrix
for our finite-difference Poisson problem. However, there are variations of the theorems
that do cover this important example.
105
13.3.3 Successive Over-Relaxation (SOR)
One can accelerate the convergence of the Gauss-Seidel method by using a weighted
average of the new Gauss-Seidel value with the one obtained during the previous iter-
ation:
(k)
x(k) = (1 − ω)x(k−1) + ωxGS .
Here ω is the so-called relaxation parameter. If ω = 1 we simply have the Gauss-Seidel
method. For ω > 1 one speaks of over-relaxation, and for ω < 1 of under-relaxation.
The resulting algorithm known as successive over-relaxation (SOR) and is (obvi-
ously) a variation of the Gauss-Seidel algorithm.
Algorithm (SOR)
for k = 1, 2, . . .
for i = 1 : m
 
i−1 m
(k) (k−1) (k) (k−1) 
X X
xi = (1 − ω)xi + ω b i − aij xj − aij xj /aii
j=1 j=i+1
| {z }
(k)
=xi,GS
end
end
Example For one step of the SOR algorithm applied to the finite difference solution
of the Poisson problem we get
for i = 1 : n − 1
for j = 1 : n − 1

(k) (k−1) (k) (k) (k−1) (k−1) fi,j
ui,j = (1 − ω)ui,j +ω ui−1,j + ui,j−1 + ui+1,j + ui,j+1 + 2 /4
n
end
end
Remark Just as for the Gauss-Seidel algorithm, the implementation of the SOR
method depends on the ordering of the grid points.
106
To describe the SOR method in terms of splitting matrices we again assume that
A = L + D + U , and take
1
M = D+L
ω

1
N = − 1 D − U.
ω
With these choices (39) becomes
M x(k) = N
x
(k−1)
+
b
1 (k) 1
⇐⇒ D+L x = − 1 D − U x(k−1) + b.
ω ω
The rearrangements that show that this formulation is indeed equivalent to the formula
used in the algorithm above are:

1 (k) 1
D+L x = − 1 D − U x(k−1) + b
ω ω

1 (k) 1
⇐⇒ Dx = − 1 D − U x(k−1) + b − Lx(k)
ω ω

−1 1
⇐⇒ (k)
x = ωD − 1 D − U x(k−1) + ωD−1 b − ωD−1 Lx(k)
ω
⇐⇒ x(k) = x(k−1) − ωx(k−1) −hωD−1U x(k−1) + ωD−1 b − i
ωD−1 Lx(k)
⇐⇒ x(k) = (1 − ω) x(k−1) − ω D−1 b − Lx(k) − U x(k−1) .
Note that the expression inside the square brackets on the last line is just what we had
for the Gauss-Seidel method earlier.
One can prove a general convergence theorem that is similar to those for the Jacobi
and Gauss-Seidel methods:
Theorem 13.5 If A is symmetric positive definite then the SOR method with 0 < ω <
2 converges for any starting value x(0) .
Remark Note that this theorem says nothing about the speed of convergence. In fact,
finding a good value for the relaxation parameter ω is quite difficult. The value of ω
that yields the fastest convergence of the SOR method is known only in very special
cases. For example, if A is tridiagonal then
2
ωopt = p ,
1 + 1 − ρ(G)
where G = M −1 N = (D + L)−1 (−U ) is the iteration matrix for the Gauss-Seidel

method.
The convergence behavior of all three classical methods is illustrated in the Maple
worksheet 473 ItertiveSolvers.mws.
107
14 Arnoldi Iteration and GMRES
14.1 Arnoldi Iteration
The classical iterative solvers we have discussed up to this point were of the form
x(k) = Gx(k−1) + c
with constant G and c. Such methods are also known as stationary methods. We will
now study a different class of iterative solvers based on optimization.
All methods will require a “black box” implementation of a matrix-vector product.
In most library implementations of such solvers the user can therefore provide a custom
function which computes the matrix-vector product as efficiently as possible for the
specific system matrix at hand.
One of the main ingredients in all of the following methods are Krylov subspaces.
Given A ∈ Cm×m and b ∈ Cm one generates
{b, Ab, A2 b, A3 b, . . .},
which is referred to as a Krylov sequence. Clearly, (fast) matrix-vector products play

a crucial role in generating this sequence since each subsequent vector in the sequence
is obtain from the previous one by multiplication by A.
With the Krylov sequence at hand one defines
Kn = span{b, Ab, A2 b, . . . , An−1 b}
as the n-th order Krylov subspace.

The Arnoldi iteration method to be derived will be applicable to both linear systems
and eigenvalue problems, and therefore we are interested in re-examining similarity
transformations of the form
A = QHQ∗ ,
where H is an upper Hessenberg matrix.
In our earlier work we used Householder reflectors to transform A to upper Hessen-
berg form. This had its advantages since the resulting algorithm is a stable one. When
studying the QR factorization we also looked at the modified Gram-Schmidt algorithm.
That algorithm was less stable. However, it has the advantage that one get one col-
umn of the unitary matrix Q one column at a time, i.e., the modified Gram-Schmidt
algorithm can be stopped at any time and yields a partial set of orthonormal column
vectors. On the other hand, with Householder reflectors we always have to perform the
entire QR factorization before we get (all) orthonormal vectors.
Thus, Arnoldi iteration can be seen as the use of the modified Gram-Schmidt algo-
rithm in the context of Hessenberg reduction.
14.2 Derivation of Arnoldi Iteration

We start with the similarity transformation A = QHQ∗ with m × m matrices A, Q,
and H. Clearly, this is equivalent to
AQ = QH.
108
Now we take n < m, so that the eigenvalue equations above can be written as
A[q 1 , q 2 , . . . , q n , q n+1 , . . . , q m ] = [q 1 , q 2 , . . . , q n , q n+1 , . . . , q m ]×

 
h11 h12 ... h1n ... h1m
 h21 h22 . . . h 2n ... h2m 
 
 0 h
32 h33 h3n ... h3m 
 
 .. 

 0 h43 . 

 .. .. .. 
×

0 . hn−1,n−2 . . .

 .
..

 .
 . . hn,n−1 hnn


 
 0 hn+1,n 
.. ..
 
. .
 
 0 
0 ... 0 hm,m−1 hmm
Next, we consider only part of this system. Namely, we let
Qn = [q 1 , q 2 , . . . , q n ]
Qn+1 = [q 1 , q 2 , . . . , q n , q n+1 ]
h11 h12 ... h1n
 
 h21 h22 ... h2n 
 
 0 h32 h33 h3n 
 
.
0 h43 . .
 
Hen =  
 
 .. .. 

 0 . hn−1,n−2 . 

 . .
 .. .. h

n,n−1 hnn 
0 hn+1,n
and then take
AQn = Qn+1 H
en.
Note that here A ∈ Cm×m , Qn ∈ Cm×n , Qn+1 ∈ Cm×n+1 , and H e n ∈ Cn+1×n so that
both sides of the equation result in an m × n matrix.
If we compare the n-th columns on both sides, then we get
Aq n = h1n q 1 + h2n q 2 + . . . + hnn q n + hn+1,n q n+1 (40)
which constitutes an (n + 1) term recursion for the vector q n+1 . Equation (40) can be
re-written as
Aq n − ni=1 hi,n q i
P
q n+1 = . (41)
hn+1,n
The recursive computation of the columns of the unitary matrix Q in this manner is
known as Arnoldi iteration.
Example The first step of Arnoldi iteration proceeds as follows. We start with the
matrix A and an arbitrary normalized vector q 1 . Then, according to (41),
Aq 1 − h11 q 1
q2 = .
h21
109
Note that this step involves the matrix-vector product Aq 1 (which has to be computed
efficiently with a problem specific subroutine).
Since we want q ∗1 q 2 = 0 in order to have orthogonality of the columns of Q we get
0 = q ∗1 Aq 1 − h11 q ∗1 q 1 .
Taking advantage of the normalization of q 1 this is equivalent to
h11 = q ∗1 Aq 1
– a Rayleigh quotient.
Finally, we let v = Aq 1 − h11 q 1 , compute h21 = kvk, and normalize
v
q2 = .
h21
If we compare (15) with the formula

Pn−1
an − i=1 ri,n q i
qn =
rnn
that we used earlier for the Gram-Schmidt method (cf. (17)), then it is clear that an
algorithm for Arnoldi iteration will be similar to that for the modified Gram-Schmidt
algorithm:
Algorithm (Arnoldi Iteration)
Let b be an arbitrary initial vector
q 1 = b/kbk2
for n = 1, 2, 3, . . .
v = Aq n
for j = 1 : n
hjn = q ∗j v
v = v − hjn q j
end
hn+1,n = kvk2
q n+1 = v/hn+1,n
end
Remark The most expensive operation in the algorithm is the matrix-vector product
Aq n . The rest of the operations are on the order of O(mn) (so they get a little more
expensive in each iteration). Therefore, in addition to the basic Arnoldi algorithm we
need an efficient implementation of the matrix-vector product this should be taylored
to the problem. Moreover, the algorithm above treats the matrix-vector product as a
“black box” and the algorithm does not need to know or store the matrix A. The only
quantity of interest is the product Aq n , i.e., the action of A on q n .
110
14.3 Arnoldi Iteration as Projection onto Krylov Subspaces
An alternative derivation of Arnoldi iteration starts with the Krylov matrix
Kn = [b, Ab, A2 b, . . . , An−1 b].
Then
AKn = [Ab, A2 b, A3 b . . . , An b]. (42)
Since the first n − 1 columns of the matrix on the right-hand side are the last n − 1
columns of Kn we can also write
AKn = Kn [e2 , e3 , . . . , en , −c],
where
c = −Kn−1 An b
and we assume that Kn is invertible. Equivalently,
AKn = Kn Cn
with the upper Hessenberg matrix
Cn = [e2 , e3 , . . . , en , −c].
This A and Cn are similar via Kn−1 AKn = Cn . The problem with this formulation is
that Kn is usually ill-conditioned (since all of its columns converge to the dominant
eigenvector of A, cf. our earlier discussion of simultaneous power iteration).
Remark As a side remark we mention that the matrix Cn above is known Pnas a com-
panion matrix. The matrix has characteristic polynomial p(z) = z + i=1 ci z i−1 ,
n
where the ci are the components of c. In other words, the eigenvalues of Cn are the
roots of p. This goes also in the other direction. Given a monic polynomial p, we can
form its companion matrix Cn and then know that the roots of p are the same as the
eigenvalues of Cn .
Returning to the derivation of Arnoldi iteration, we still need to show how the above
Krylov subspace formulation is related to the earlier one based on the Gram-Schmidt
method.
Denote the QR factorization of the Krylov matrix Kn by
Kn = Qn Rn .
Then
Kn−1 AKn = Cn
⇐⇒ Rn−1 Q∗n AQn Rn = Cn
⇐⇒ Q∗n AQn = Rn Cn Rn−1 .
| {z }
=Hn
111
It needs to be pointed out that this approach is computationally not a good one.
It is both too expensive and unstable. On the one hand, finding c involves solution of
the (ill-conditioned) linear system Kn c = An b. On the other hand, we would also be
required to provide the inverse of Rn .
However, we can get some theoretical insight from this approach. The formula
Q∗n AQn = Hn
can be interpreted as an orthogonal projection of A onto Kn with the column of Qn as

basis, i.e., Arnoldi iteration is nothing but orthogonal projection onto Krylov subspaces.
Remark If A is Hermitian, then everything above simplifies (e.g., Hessenberg matrices

turn into tridiagonal), and we get what is know in the literature separately as Lanczos
iteration.
14.4 GMRES
The method of generalized minimum residuals (or GMRES) was suggested in 1986 by
Saad and Schultz.
While application of the classical iterative solvers was limited to either diagonally
dominant or positive definite matrices, the GMRES method can be used for linear sys-
tems Ax = b with arbitrary (nonsingular) square matrices A. The essential ingredient
in this general iterative solver is Arnoldi iteration.
The main idea of the GMRES method is to solve a least squares problem at each step
of the iteration. More precisely, at step n we approximate the exact solution x∗ = A−1 b
by a vector xn ∈ Kn (the n-th order Krylov subspace) such that the residual
kr n k2 = kAxn − bk2
is minimized. We now describe how to solve this least squares problem.

We start with the Krylov matrix
Kn = b, Ab, A2 b, . . . , An−1 b ∈ Cm×n .

Thus the column space of AKn is AKn .

Now the desired vector xn ∈ Kn can be written as
xn = Kn c
for some appropriate vector c ∈ Cn . Therefore the residual minimization becomes
kr n k2 = kAxn − bk2 = kAKn c − bk2 → min.
The obvious way to find the least squares solution to this problem would be to compute
the QR factorization of the matrix AKn . However, this is both unstable and too
expensive.
Instead, we look for an orthonormal basis for the Krylov subspace Kn . We will
denote this by {q 1 , q 2 , . . . , q n }, the columns of the matrix Qn used in Arnoldi iteration.
With this new basis the approximate solution xn ∈ Kn can be written as
xn = Qn y
112
for some appropriate vector y ∈ Cn . The residual minimization is then
kAxn − bk2 = kAQn y − bk2 → min. (43)
It is now time to recall the principle behind the Arnoldi iteration. That algorithm
is based on the partial similarity transform
AQn = Qn+1 H
en.
Thus (43) turns into

kQn+1 H
e n y − bk2 → min.
Next we take advantage of the fact that multiplication by a unitary matrix does
not change the 2-norm. Thus, we arrive at
kQ∗n+1 Qn+1 H
e n y − Q∗n+1 bk2 → min
⇐⇒ e n y − Q∗ bk2 → min
kH n+1
Note that the system matrix AQn in (43) is an m × n matrix, while the new matrix
He n is an (n + 1) × n matrix which is smaller, and therefore will permit a more efficient
solution.
The final simplification we can make is for the vector Q∗n+1 b. In detail, this vector
is given by
q ∗1 b
 
 q ∗2 b 
Q∗n+1 b =  .
 
..
 . 
q ∗n+1 b
Recall that the Krylov subspaces are given by
K1 = span{b},
K2 = span{b, Ab},
..
. ,
and that the columns q j of Qn form an orthonormal basis for Kn . Thus
b
q1 =
kbk
and q ∗j b = 0 for any j > 1. Therefore we actually have
Q∗n+1 b = kbke1 .
Combining all of this work we arrive at the final least squares formulation

Hn y − kbk2 e1 → min
e
2
with xn = Qn y.
This leads to
113
Algorithm (GMRES)
Let q 1 = b/kbk
for n = 1, 2, 3, . . .
Perform step n of Arnoldi iteration, i.e., compute new entries for He n and
Qn .

Find y that minimizes H y − kbke 1 (e.g., with QR factorization)
e
n
2
Set xn = Qn y
end
The computational cost for the GMRES algorithm depends on the cost of the
method used for the least squares problem and that of the Arnoldi iteration. Since
the least squares system matrix is of size (n + 1) × n it can be done in O(n2 ) floating
point operations. If an updating QR factorization (which we did not discuss) is used,
then even O(n) is possible. Arnoldi iteration takes O(mn) operations plus the cost
for matrix-vector multiplications. These can usually be accomplished at somewhere
between O(m) and O(m2 ) flops.
14.4.1 Convergence of GMRES

In exact arithmetic (which we of course do not have in a numerical computing environ-
ment) GMRES iteration will always converge in at most m steps (since Km = Cm =
range(A)). Moreover, convergence is monotonic since kr n+1 k ≤ kr n k. This latter fact
is true since r n is minimized over Kn , and we have Kn+1 ⊃ Kn . Thus, minimization
over a larger subspace will allow us to achieve a smaller residual norm.
For practical computations the only case of interest is when the algorithm converges
(to within a specified tolerance) in n iterations with n m. Usually the relative
residual is used to control the convergence, i.e., we check whether
kr n k
< tol,
kbk
where tol = 10−6 or 10−8 . The MATLAB code GMRESDemo.m illustrates this.
The test problems are given by 200 × 200 systems with random matrices and ran-
dom right-hand side vectors b. The different test matrices are a matrix A whose entries
are independent
√ samples from the real normal distribution of mean 2 and standard de-
viation 0.5/ 200. Its eigenvalues are clustered in a disk in the complex plane of radius
1/2 centered at z = 2. The other test matrices have different eigenvalue distributions.
They are obtained as
• B = A + D, where the entries of the diagonal matrix D are the complex numbers
k
dk = (−2 + 2 sin θk ) + i cos θk , θk = , 0 ≤ k ≤ m − 1.
m−1
• C, a random matrix whose eigenvalues are loosely clustered in the unit disk.
114
• D = C T C, a random symmetric positive definite matrix (with real positive eigen-
values).
• E = 21 C, another random matrix whose eigenvalues are more densely clustered

than those of C.
The rate of convergence of the GMRES method depends on the distribution of

the eigenvalues of A in the complex plane. For fast convergence the eigenvalues need
to be clustered away from the origin. Note that the eigenvalue distribution is much
more important than the condition number of A, which is the main criterion for rapid
convergence of the conjugate gradient method (see next section).
If the GMRES method is applied to a matrix with a “good” eigenvalue distribution
(or is appropriately preconditioned ) then it may solve even a system with an unstruc-
tured dense matrix faster than standard LU factorization.
115
15 Conjugate Gradients
This method for symmetric positive definite matrices is considered to be the “original”
Krylov subspace method. It was proposed by Hestenes and Stiefel in 1952, and is
motivated by the following theorem.
Theorem 15.1 If A is symmetric positive definite, then solving Ax = b is equivalent

to minimizing the quadratic form
1
ϕ(x) = xT Ax − xT b.
2
Proof We will consider changes of ϕ along a certain ray x + αp with α ∈ R, and fixed
direction vector p 6= 0.
First we show that ϕ(x + αp) > ϕ(x) if A symmetric positive definite.
1
ϕ(x + αp) = (x + αp)T A(x + αp) − (x + αp)T b
2
1 T 1 1 1
= x Ax + xT A(αp) + (αp)T Ax + (αp)T A(αp) − xT b − αpT b
2 2 2 2
AT =A 1 T T T T 1 2 T
= x Ax − x b +αp Ax − αp b + α p Ap
|2 {z } 2
=ϕ(x)
1
= ϕ(x) + αpT (Ax − b) + α2 pT Ap .
2 | {z }
>0
Thus, we see that ϕ (as a quadratic function in α with positive leading coefficient) will
have to have a minimum along the ray x + αp.
We now decide what the value of α at this minimum is. A necessary condition (and
also sufficient since the coefficient of α2 is positive) is
d
ϕ(x + αp) = 0.
dα
To this end we compute
d
ϕ(x + αp) = pT (Ax − b) + αpT Ap,
dα
which has its root at
pT (b − Ax)
α̂ = .
pT Ap
The corresponding minimum value is
T 2
p (b − Ax)
ϕ(x + α̂p) = ϕ(x) − .
2pT Ap
| {z }
≥0
The last equation shows that
ϕ(x + α̂p) < ϕ(x) if and only if pT (b − Ax) 6= 0,
116
i.e., p is not orthogonal to the residual r = b − Ax.
To see the equivalence with the solution of the linear system Ax = b we need to
consider two possibilities:
1. x is such that Ax = b. Then ϕ(x + α̂p) = ϕ(x) and ϕ(x) is the minimum value.
2. x is such that Ax 6= b. Then ϕ(x + α̂p) < ϕ(x), i.e., there exists a direction p
such that pT (b − Av) 6= 0 and ϕ(x) is not the minimum.
The preceding proof actually suggests a rough iterative algorithm:
Take x0 = 0, r 0 = b, p0 = r 0
for n = 1, 2, 3, . . .
Compute a step length
αn = pTn−1 r n−1 / pTn−1 Apn−1

Update the approximate solution
xn = xn−1 + αn pn−1
Update the residual

r n = r n−1 − αn Apn−1
Find a new search direction pn
end
Note that at this point we have not specified how to pick the search directions pn .
This will be the crucial ingredient in the algorithm.
The formula above for the residual update follows from
r n = b − Axn = b − A(xn−1 + αn pn−1 )

= b − Axn−1 − αn Apn−1
= r n−1 − αn Apn−1 .
15.1 The Steepest Descent Algorithm

An obvious choice for the selection of the search direction is
pn = −∇ϕ(xn )
since we know from calculus that the direction of largest decrease of ϕ is in the direction
opposite its gradient. Moreover, since ϕ(x) = 12 xT Ax − xT b we have
∇ϕ(x) = Axn − b.
This leads to
117
Algorithm (Steepest Descent)
Take x0 = 0, r 0 = b, p0 = r 0
for n = 1, 2, 3, . . .
αn = pTn−1 r n−1 / pTn−1 Apn−1

Update the residual

Set the new search direction

pn = r n
end
Note that for this choice of search direction the step length α can also be written
as
αn = r Tn−1 r n−1 / pTn−1 Apn−1 .

15.2 The Conjugate Gradient Algorithm

It turns out that the “obvious” search directions are not ideal (since they are applied in
an iterative fashion). Convergence of the steepest descent algorithm is usually rather
slow. It is better to employ so-called conjugate search directions. The idea is to
somehow remove from the gradient at each step those components parallel to previously
used search directions. The resulting algorithm is
Algorithm (Conjugate Gradient)
Take x0 = 0, r 0 = b, p0 = r 0
for n = 1, 2, 3, . . .
αn = r Tn−1 r n−1 / pTn−1 Apn−1

Update the residual

118
Compute a gradient correction factor
βn = r Tn r n / r Tn−1 r n−1

pn = r n + βn pn−1
end
For both the steepest descent and the conjugate gradient algorithm the main com-
putational cost is hidden in the one matrix-vector multiplication that is required per
iteration. As with the Arnoldi and GMRES methods, this operation is treated as a
“black box” and can be accomplished in O(m) to O(m2 ) operations depending on the
structure of A. In many practical cases the entire (preconditioned) CG algorithm will
require only O(m) operations. This is very fast.
As mentioned at the beginning of this section, one can also establish a connection
to Krylov subspace methods.
Theorem 15.2 Let A be symmetric positive definite. As long as the conjugate gradient
method has not yet converged (i.e., as long as r n−1 6= 0) we have
span{x1 , x2 , . . . , xn } = span{p0 , p1 , . . . , pn−1 }

= span{r 0 , r 1 , . . . , r n−1 }
= span{b, Ab, . . . , An−1 b} = Kn .
Moreover, the residuals are orthogonal in the usual sense, i.e,
r Tn r j = 0, j < n,
and the search directions ar A-orthogonal (or A-conjugate), i.e.,
pTn Apj = 0, j < n.
Proof An inductive proof of this theorem can be found in the book [Trefethen/Bau].
15.3 Convergence of the CG Algorithm

Recall that the GMRES algorithm minimizes the 2-norm of the residual, kr n k2 → min.
We will now show that the CG algorithm satisfies a different optimality criterion. It
minimizes the A-norm of the error, i.e., if en = x∗ − xn , is the error between the exact
solution x∗ = A−1 b and the n-th approximation xn , then CG minimizes
q
ken kA = eTn Aen .
Theorem 15.3 Let A be symmetric positive definite. If the conjugate gradient algo-
rithm has not yet converged (i.e., r n−1 6= 0) then xn is the unique vector in Kn such
that ken kA is minimized.
Moreover, ken kA ≤ ken−1 kA and (if we are using exact arithmetic) en = 0 for
some n ≤ m.
119
Proof We will prove the first part only. From the previous theorem we know that the
approximate solution xn lies in the Krylov subspace Kn . In order to show that xn is
the unique minimizer of kekA we consider an arbitrary vector
x = xn − ∆x ∈ Kn
and show that in order to minimize kekA we necessarily have ∆x = 0.

If x∗ is the exact solution of Ax = b, then
e = x∗ − x = x∗ − xn + ∆x = en + ∆x.
Therefore,
kek2A = ken + ∆xk2A

= (en + ∆x)T A (en + ∆x)
= eTn Aen + 2eTn A (∆x) + (∆x)T A (∆x)
since A is symmetric.
Next we realize that
eTn A = (x∗ − xn )T A = A−1 b − xn A = bT − xTn A = r Tn ,

and observe that

r Tn ∆x = 0
since ∆x ∈ Kn = span{r 0 , r 1 , . . . , r n−1 } and r Tn r j = 0 for j < n by the previous
theorem.
This leaves us with
kek2A = eTn Aen + 2 eTn A (∆x) + (∆x)T A (∆x)

|{z}
=r T
n
| {z }
=0
T
= ken k2A + (∆x) A (∆x) .
Note that the quadratic form (∆x)T A (∆x) is certainly non-negative since A is positive
definite. Moreover, it is zero only if ∆x = 0.
Thus, the A-norm of the error is minimized if ∆x = 0, i.e., for the CG approximate
solution xn .
We can come to the same conclusion with the following argument:
ken k2A = eTn Aen = (x∗ − xn )T A (x∗ − xn )

= (x∗ )T Ax ∗ T
|{z} −2xn Ax
∗ T
|{z} +xn Axn
=b =b
∗ T
= (x ) b + xTn Axn − 2xTn b
= (x∗ )T b + 2ϕ(xn ).
120
Here ϕ(xn ) is the same quadratic form used earlier. Since (x∗ )T b is a constant we see
that minimizing the A-norm of the error is equivalent to minimizing the quadratic form
ϕ(xn ).
For the convergence rate of the CG algorithm one can show that
√ n
κ−1
ken kA ≤ ke0 kA √ ,
κ+1
where κ = κ2 (A) the 2-norm condition number of A. Since

√
κ−1 2
√ =1− √
κ+1 κ+1
we see that convergence will be very slow if κ is large. This shows that preconditioning
efforts for the CG algorithm are aimed at reducing the condition number of A.
For a moderate size κ it turns out that one can expect convergence of the CG
√
algorithm in O( κ) iterations. In fact, in practice the CG algorithm often converges
faster than predicted by this upper bound.
Remark It is possible to interpret the conjugate gradient method as an analogue of

Lanczos iteration for linear systems. Since we claimed earlier that Lanczos iteration is
a special case of Arnoldi iteration for symmetric matrices, it turns out that the (n + 1)-
term recursion we derived earlier for Arnoldi iteration turn into a 3-term recursion.
One can indeed show that this 3-term recursion is hidden inside the CG algorithm.
Convergence of the CG algorithm is illustrated in the MATLAB code CGDemo.m.

The symmetric test matrix is constructed as follows. Initially it contains ones on the
main diagonal and random numbers uniformly distributed in [−1, 1] in the off-diagonal
positions. Then any off-diagonal entry with |aij | > τ is set to zero, where τ is a
parameter. For small values of τ the matrix is positive definite and very sparse, and
the CG algorithm converges rapidly. For larger values, such as τ = 0.2 the matrix is
no longer positive definite, and the CG algorithm does not converge. We also note
that for these test matrices, preconditioning does not improve convergence of the CG
algorithm.
121
16 Preconditioning
The general idea underlying any preconditioning procedure for iterative solvers is to
modify the (ill-conditioned) system
Ax = b
in such a way that we obtain an equivalent system Âx̂ = b̂ for which the iterative
method converges faster.
A standard approach is to use a nonsingular matrix M , and rewrite the system as
M −1 Ax = M −1 b.
The preconditioner M needs to be chosen such that the matrix Â = M −1 A is better

conditioned for the conjugate gradient method, or has better clustered eigenvalues for
the GMRES method.
16.1 Preconditioned Conjugate Gradients

We mentioned earlier that the number ofpiterations required for the conjugate gradient
algorithm to converge is proportional to κ(A). Thus, for poorly conditioned matrices,
convergence will be very slow. Thus, clearly we will want to choose M such that
κ(Â) < κ(A). This should result in faster convergence.
How do we find Â, x̂, and b̂? In order to ensure symmetry and positive definiteness
of Â we let
M −1 = LLT (44)
with a nonsingular m × m matrix L. Then we can rewrite
Ax = b ⇐⇒ M −1 Ax = M −1 b
⇐⇒ LT Ax = LT b
T −1 T
⇐⇒ L
| {zAL} L
| {z x} = L
|{z}b.
=Â =x̂ =b̂
The symmetric positive definite matrix M is called splitting matrix or preconditioner,

and it can easily be verified that Â is symmetric positive definite, also.
One could now formally write down the standard CG algorithm with the new “hat-
ted” quantities. However, the algorithm is more efficient if the preconditioning is
incorporated directly into the iteration. To see what this means we need to examine
every single line in the CG algorithm.
We start by looking at the new residual:
r̂ n = b̂ − Âx̂n = LT b − (LT AL)(L−1 xn ) = LT b − LT Axn = LT r n ,
and also define the following abbreviations
p̂n = L−1 pn ,
r̃ n = M −1 r n .
122
Now we can consider how this transforms the CG algorithm (for the hatted quan-
tities). The initialization becomes x̂0 = L−1 x0 = 0 and
r̂ 0 = b̂ ⇐⇒ LT r 0 = LT b ⇐⇒ r 0 = b.
The initial search direction turns out to be
p̂0 = r̂ 0 ⇐⇒ L−1 p0 = LT r 0
⇐⇒ p0 = M −1 r 0 = r̃ 0 ,
where we have used the definition of the preconditioner M . The step length α̂ trans-
forms as follows:

α̂n = r̂ Tn−1 r̂ n−1 / p̂Tn−1 Âp̂n−1
T T T
LT r n−1 LT r n−1 / L−1 pn−1 L AL L−1 pn−1

=
= r Tn−1 LLT r n−1 / pTn−1 L−T LT ALL−1 pn−1

= r Tn−1 |{z} LLT r n−1 / pTn−1 Apn−1

=M −1
T
pTn−1 Apn−1 .

= r n−1 r̃ n−1 /
The approximate solution is updated according to
x̂n = x̂n−1 + α̂n p̂n−1 ⇐⇒ L−1 xn = L−1 xn−1 + α̂n L−1 pn−1
⇐⇒ xn = xn−1 + α̂n pn−1 .
The residuals are updated as
r̂ n = r̂ n−1 − α̂n Âp̂n−1 ⇐⇒ LT r n = LT r n−1 − α̂n (LT AL)L−1 pn−1

⇐⇒ r n = r n−1 − α̂n Apn−1 .
The gradient correction factor β transforms as follows:

β̂n = r̂ Tn r̂ n / r̂ Tn−1 r̂ n−1
T T T T T
T
= L rn L r n / L r n−1 L r n−1
= r Tn |{z} LLT r n / r Tn−1 LL T

|{z} r n−1
=M −1 =M −1
T
r Tn−1 r̃ n−1

= r n r̃ n / .
Finally, for the new search direction we have
p̂n = r̂ n + β̂n p̂n−1 ⇐⇒ L−1 pn = LT r n + β̂n L−1 pn−1

⇐⇒ pn = M −1 r n + β̂n pn−1
⇐⇒ pn = r̃ n + β̂n pn−1 ,
where we have multiplied by L and used the definition of M in the penultimate step.
The resulting algorithm is given by
123
Algorithm (Preconditioned Conjugate Gradient)
Take x0 = 0, r 0 = b
Solve M r̃ 0 = r 0 for r̃ 0
Let p0 = r̃ 0
for n = 1, 2, 3, . . .
αn = r Tn−1 r̃ n−1 / pTn−1 Apn−1

(note that r̃ n−1 = M −1 r n−1 )

Update the residual

Solve M r̃ n = r n for r̃ n
Compute a gradient correction factor
βn = r Tn r̃ n / r Tn−1 r̃ n−1

(note that r̃ n−1 = M −1 r n−1 and r̃ n = M −1 r n )

pn = r̃ n + βn pn−1
(where r̃ n = M −1 r n )
end
Remark This algorithm requires the additional work that is needed to solve the linear
system M r̃ n = r n once per iteration. Therefore we will want to choose M so that this
can be done easily and efficiently.
The two extreme cases M = I and M = A are of no interest. M = I gives us

the ordinary CG algorithm, whereas M = A (with L = A−1/2 ) leads to the trivial
preconditioned system Âx̂ = b̂ ⇐⇒ x̂ = b̂ since (using the symmetry of A)
Â = LT AL = A−T /2 AT /2 A1/2 A−1/2 = I.
This may seem useful at first, but to get the solution x we need
x = Lx̂ = A−1/2 x̂ = A−1/2 b̂

= A−1/2 LT b = A−1/2 A−T /2 b = A−1 b,
which is just as complicated as the original problem.

Therefore, M should be chosen somewhere “in between”. Moreover, we want
124
1. M should be symmetric and positive definite.
2. M should be such that M r̃ n = r n can be solved efficiently.
3. M should approximate A−1 in the sense that kI − M −1 Ak < 1.
If we use the decomposition A = L + D + LT of the symmetric positive definite

matrix A then some possible choices for M are given by
M = D: Jacobi preconditioning,
M = L + D: Gauss-Seidel preconditioning,
1
M= ω (D + ωL): SOR preconditioning.
Another popular preconditioner is M = HH T , where H is “close” to L. This method is

referred to as incomplete Cholesky factorization (see the book by Golub and van Loan
for more details).
Remark The Matlab script PCGDemo.m illustrates the convergence behavior of the
preconditioned conjugate gradient algorithm. The matrix A here is√a 1000 × 1000 sym-
metric positive definite matrix with all zeros except aii = 0.5 + i on the diagonal,
aij = 1 on the sub- and superdiagonal, and aij = 1 on the 100th sub- and superdiag-
onals, i.e., for |i − j| = 100. The right-hand side vector is b = [1, . . . , 1]T . We observe
that the basic CG algorithm converges very slowly, whereas the Jacobi-preconditioned
method converges much faster.
125
17 Solution of Nonlinear Systems
We now discuss the solution of systems of nonlinear equations. An important ingredient
will be the multivariate Taylor theorem.
Theorem 17.1 Let D = {[x1 , x2 , . . . , xm ]T ∈ Rm : ai ≤ xi ≤ bi , i = 1, . . . , m}

for some a1 , a2 , . . . , am , b1 , b2 , . . . , bm ∈ R. If f ∈ C n+1 (D), then for x + h ∈ D
(h = [h1 , h2 , . . . , hm ]T )
n
X 1 T k
f (x + h) = (h ∇) f (x) + Rn (h), (45)
k!
k=0
where
1
Rn (h) = (hT ∇)n+1 f (x + θh)
(n + 1)!
h iT
with 0 < θ < 1 and ∇ = ∂x∂ 1 , ∂x∂ 2 , . . . , ∂x∂m .
Example We are particularly interested in the linearization of a given function, i.e.,

n = 1. In this case we have
1 T 2
f (x + h) = f (x) + (hT ∇)f (x) + (h ∇) f (x + θh) .
2| {z }
=R1 (h)
And, for m = 2, this becomes

∂ ∂
f (x1 + h1 , x2 + h2 ) = f (x1 , x2 ) + h1 + h2 f (x1 , x2 )
∂x1 ∂x2
∂ 2

1 ∂
+ h1 + h2 f (x1 + θh1 , x2 + θh2 )
2 ∂x1 ∂x2
∂f ∂f
= f (x1 , x2 ) + h1 (x1 , x2 ) + h2 (x1 , x2 )
∂x1 ∂x2
2 ∂2 2

1 2 ∂ 2 ∂
+ h1 2 + 2h1 h2 + h2 2 f (x1 + θh1 , x2 + θh2 ).
2 ∂x1 ∂x1 ∂x2 ∂x2
Therefore, the linearization of f is given by

∂ ∂
f (x1 + h1 , x2 + h2 ) = f (x1 , x2 ) + h1 + h2 f (x1 , x2 ).
∂x1 ∂x2
or
f (x + h) = f (x) + (hT ∇)f (x).
We now want to solve the following (square) system of nonlinear equations:
f1 (x1 , x2 , . . . , xm ) = 0,
f2 (x1 , x2 , . . . , xm ) = 0,
..
. (46)
fm (x1 , x2 , . . . , xm ) = 0.
126
To derive Newton’s method for this problem we assume z = [z1 , z2 , . . . , zm ]T is a
solution (or root) of (46), i.e., z satisfies
fi (z) = 0, i = 1, . . . , m.
Moreover, we consider x to be an approximate root, i.e.,
x + h = z,
with a small correction h = [h1 , h2 , . . . , hm ]T . Then, by linearizing fi , i = 1, . . . , m,
fi (z) = fi (x + h) ≈ fi (x) + (hT ∇)fi (x).
Since fi (z) = 0 we get
−fi (x) ≈ (hT ∇)fi (x)

∂ ∂ ∂
= h1 + h2 + . . . + hm fi (x).
∂x1 ∂x2 ∂xm
Therefore, we have a linearized version of system (46) as

∂ ∂
−f1 (x1 , . . . , xm ) = h1 + . . . + hm f1 (x1 , . . . , xm ),
∂x1 ∂xm

∂ ∂
−f2 (x1 , . . . , xm ) = h1 + . . . + hm f2 (x1 , . . . , xm ),
∂x1 ∂xm
..
. (47)

∂ ∂
−fm (x1 , . . . , xm ) = h1 + . . . + hm fm (x1 , . . . , xm ).
∂x1 ∂xm
Recall that h = [h1 , . . . , hm ]T is the unknown Newton update, and note that (47)
is a linear system for h of the form
J(x)h = −f (x),
where f = [f1 , . . . , fm ]T and

∂f1 ∂f1 ∂f1
 
∂x1 ∂x2 ... ∂xm
∂f2 ∂f2 ∂f2
...
 
 ∂x1 ∂x2 ∂xm 
J = .. .. .. .. 
.
 
 . . . 
∂fm ∂fm ∂fm
∂x1 ∂x2 ... ∂xm
is called the Jacobian of f .

The algorithm for Newton’s method for square nonlinear systems is now
Algorithm
Input f , J, x(0)
for k = 0, 1, 2, . . . do
127
Solve J(x(k) )h(k) = −f (x(k) ) for h(k)
Update x(k+1) = x(k) + h(k)
end
Output x(k+1)
Remark If we symbolically write f 0 instead of J, then the Newton iteration becomes

h i−1
x(k+1) = x(k) − f 0 (x(k) ) f (x(k) ),
| {z }
matrix
which looks just like the Newton iteration formula for the single equation/single variable
case.
Example Solve
x2 + y 2 = 4
xy = 1,
which corresponds to finding the intersection points of a circle and a hyperbola in the
plane. Here 2
x + y2 − 4

f1 (x, y)
f (x, y) = =
f2 (x, y) xy − 1
and " #
∂f1 ∂f1
∂x ∂y 2x 2y
J(x, y) = ∂f2 ∂f2 (x, y) = .
∂x ∂y
y x
This example is illustrated in the Matlab script run newtonmv.m.
Remark 1. Newton’s method requires the user to input the m×m Jacobian matrix
(which depends on the specific nonlinear system to be solved). This is rather
cumbersome.
2. In each iteration an m × m (dense) linear system has to be solved. This makes

Newton’s method very expensive and slow.
3. For “good” starting values Newton’s method converges quadratically to simple

zeros, i.e., solutions for which J −1 (z) exists.
4. An improvement which removes the strong dependence on the choice of starting

values is the so-called line search
x(k+1) = x(k) + λk h(k) ,
where λk ∈ R is chosen so that f T f (x(k) ) is strictly monotone decreasing with k.

In this case x(k) converges to the minimum of f T f . This method stems from an
interpretation of the solution of nonlinear systems as the minimizer of a nonlinear
function (more later).
128
17.1 Basic Fixed-point Iteration
We illustrate the use of a general fixed-point algorithm with several examples in the
Maple worksheet 577 fixedpointsMV.mws. However, as we well know, this may not
always be possible, and if it is, convergence may be very slow. Sometimes we can use
a Gauss-Seidel like strategy to accelerate convergence.
A multivariate version of the Contractive Mapping Theorem is
Theorem 17.2 Let C be a closed subset of Rm and F a contractive mapping of C into

itself. Then F has a unique fixed point z. Moreover, z = lim x(k) , where x(k+1) =
k→∞
F (x(k) ) and x(0) is any starting point in C.
Here a contractive map is defined as
Definition 17.3 A function (mapping) F is called contractive if there exists a λ < 1

such that
kF (x) − F (y)k ≤ λkx − yk (48)
for all x, y in the domain of F .
The property (48) is also referred to as Lipschitz continuity of F .
17.2 Quasi-Newton Methods

In the multivariate (systems) setting the extra burden associated with using the deriva-
tive (Jacobian) of f becomes much more obvious than in the single equation/single
variable case. In the algorithm listed above we need to perform m2 evaluations of
derivatives (for the Jacobian) and m evaluations of f (for the right-hand side) in each
iteration. Moreover, solving the linear system J(x)h = −f (x) usually requires O(m3 )
floating point operations per iteration.
In order to reduce the computational complexity we need to apply a strategy anal-
ogous to the secant method that is commonly used for single nonlinear equations.
This will eliminate evaluations of derivatives, and reduce the number of floating point
operations required to compute the Newton update to O(m2 ) operations per iteration.
−1
The idea is to provide an initial approximation B (0) to J(x(0) ) , and then update

this approximation from one iteration to the next, i.e.,
B (k+1) = B (k) + U (k) ,
where U (k) is an appropriately chosen update.

This replaces solution of the linear system J(x(k) )h = −f (x(k) ).
One way of updating was suggested by Broyden and is based on the Sherman-
Morrison formula for matrix inversion.
Lemma 17.4 Let A be a nonsingular m×m matrix and x, y ∈ Rm . Then (A+xy T )−1
exists provided that y T A−1 x 6= −1. Moreover,
A−1 xy T A−1
(A + xy T )−1 = A−1 − . (49)
1 + y T A−1 x
129
The Sherman-Morrison formula (49) can be used to compute the inverse of a matrix
A(k+1) obtained by a rank-1 update xy T from A(k) , i.e.,
h i−1 h i−1 A(k) −1 xy T A(k) −1
A(k+1) = A(k) − −1 . (50)
1 + y T A(k) x
Thus, if A(k+1) is a rank-1 modification of A(k) then we need not recompute the inverse
of A(k+1) , but instead can obtain it by updating the inverse of A(k) (available from
previous computations) via (50).
The algorithm for Broyden’s method is
Algorithm
Input f , x(0) , B (0)
for k = 0, 1, 2, . . . do
h(k) = −B (k) f (x(k) )
x(k+1) = x(k) + h(k)
z (k) = f (x(k+1) ) − f (x(k) )
T
B (k) z (k) − h(k) h(k) B (k)

(k+1) (k)
B =B − T
h(k) B (k) z (k)
end
Output x(k+1)
Remark 1. Only m scalar function evaluations are required per iteration along
with O(m2 ) floating point operations for matrix-vector products.
2. One can usually use B (0) = I to start the iteration.
In order to see how the formula for B (k+1) in the algorithm is related to (50) we
define
h i−1 h i−1
J(x(k) ) ≈ A(k) = B (k) ,
−1 (k)
z (k) − B (k)

h
x = (k) 2
,
kh k2
y = h(k) .
Then (50) becomes
−1
z (k) −[B (k) ] h(k) (k) T (k)
B (k) kh(k) k22
h B
B (k+1) = B (k) − −1
T z (k) −[B (k) ] h(k)
1 + h(k) B (k) kh(k) k22
T
B (k) z (k) − h(k) h(k) B (k)

(k)
= B − T T −1
kh(k) k22 + h(k) B (k) z (k) − h(k) B (k) B (k)

h(k)
T
B (k) z (k) − h(k) h(k) B (k)

(k)
= B − T ,
kh(k) k22 + h(k) B (k) z (k) − kh(k) k22

130
which is the same as the formula for B (k+1) used in the algorithm.
To see why Broyden’s method can be interpreted as a variant of the secant method,
we multiply the formula used to update B (k) in the algorithm by z (k) , i.e.,
h iT
B (k) z (k) − h(k) h(k) B (k)
B (k+1) z (k) = B (k) z (k) − h iT z (k)
(k) (k) (k)
h B z
⇐⇒ B (k+1) z (k) = h(k)
⇐⇒ B (k+1) f (x(k+1) ) − f (x(k) ) = x(k+1) − x(k) ,

which is reminiscent of the secant equation

f (x(k+1) ) − f (x(k) )
f 0 (x(k+1) ) =
x(k+1) − x(k)
since B (k+1) is an approximation to the inverse of the Jacobian.
We illustrate this algorithm in the Matlab script file run broyden.m with the same
example as used earlier for the multivariate Newton method.
Remark 1. Broyden’s method can also be improved by a line search, i.e., x(k+1) =
x + λk h(k) (see below).
(k)
2. Broyden’s method converges only superlinearly.

3. Both Newton’s and Broyden’s methods require good starting values. These can
be provided by the steepest descent or conjugate gradient algorithms.
17.3 Using the Steepest Descent and Conjugate Gradients with Non-
linear Systems
We now discuss the connection between solving systems of nonlinear equations and
quadratic minimization problems. The idea is to minimize the 2-norm of the residual
of (46) to get the stepsize λk , i.e., to find x = [x1 , . . . , xm ]T such that
m
1X 2 1
g(x1 , . . . , xm ) = fi (x1 , . . . , xm ) = f T f (x)
2 2
i=1
is minimized.
For the steepest descent method we use
x(k+1) = x(k) + λk (−∇g(x(k) )).
The stepsize λk is computed such that

g(x(k+1) ) = g x(k) − λk ∇g(x(k) ) =: γ(λk )
is minimized. This is an easier problem to solve since it involves only one variable, λk .
Note that since g(x) = 21 f T f (x) we have ∇g(x) = [J(x)]T f (x). This shows that
this approach also requires knowledge of the Jacobian. A general line search algorithm
is
131
Algorithm
Input f , J, x(0)
for k = 0, 1, 2, . . . do
T
h(k) = −∇g(x(k) ) = − J(x(k) ) f (x(k) )

Find λk as a minimizer of γ(λk ) = g(x(k) + λk h(k) ) = 12 f T f (x(k) + λk h(k) )

Update x(k+1) = x(k) + λk h(k)
end
Output x(k+1)
Remark 1. The steepest descent method converges linearly.

2. One can replace the steepest descent method by conjugate gradient iteration.
This is illustrated in the Matlab script run mincg.m.
3. If only minimization of the quadratic function g is our goal, then we can try
solving ∇g(x) = 0 using Newton’s method (which will give us a critical point for
the problem). This is explained in the next section.
It is not easy to come up with a good line search strategy. However, one prac-
tical way to determine a reasonable value for λk is to use a quadratic interpolating
(1) (2) (3)
polynomial. The idea is to work with three values λk , λk and λk and construct a
quadratic polynomial that interpolates the univariate function γ at these points. If we
guess these three values reasonably close to the optimum, then the minimum of the
parabola on the interval of interest (which is easy to find) tell us how to pick λk . For
(1) (3) (3) (1)
example, one can take λk = 0, then find a λk such that γ(λk ) < γ(λk ), and then
(2) (2) (3)
take λk as the midpoint of the previous values, i.e., λk = λk /2.
A similar (but interactive) strategy is implemented in the Matlab example ShowGN VL.m
explained in the next section.
17.4 Nonlinear Least Squares

To combine the techniques used for systems of linear equations with those for nonlinear
systems we consider the problem of finding values of x = [x1 , . . . , xn ]T that yield a least
squares solution of the (rectangular) nonlinear system
f1 (x1 , x2 , . . . , xn ) = 0,
f2 (x1 , x2 , . . . , xn ) = 0,
..
. (51)
fm (x1 , x2 , . . . , xn ) = 0.
In other words, we want to minimize

m
1X 2
g(x) = fi (x). (52)
2
i=1
132
A problem like this often arises when the components of x are certain control parame-
ters in an objective function that needs to be optimized. Below we will see an example
where we need to fit the parameters to a nonlinear system describing the orbit of a
planet.
As mentioned at the end of the previous section, we can use Newton’s method to
solve this problem since a necessary condition for finding a minimal residual (52) is that
the gradient of g be zero. Thus, we are attempting to find a solution of ∇g(x) = 0.
Since Newton’s method is given by
h i−1
x(k+1) = x(k) − J∇g (x(k) ) ∇g(x(k) ),
we see that we actually need not only the gradient of S (which leads to the Jacobian)
but also the second derivative (which leads to the Hessian).
For the particular minimization problem (52) this would result in
h m
X i−1
(k+1) (k) (k) T (k)
x =x − J(x ) J(x )+ fi (x(k) )∇2 fi (x(k) ) J(x(k) )T f (x(k) ),
i=1
| {z }
=Hessian
where now J is the regular (but rectangular) Jacobian with respect to the functions
f1 , . . . , fm .
Since we decided earlier that it would be a good idea to avoid computation of
the Jacobian, clearly, computation of the Hessian should be avoided if possible. This
motivates the Gauss-Newton method. If we drop the summation term in the expression
for the Hessian above, then all we need to know is the Jacobian. This step is justified
if we are reasonably close to a solution since then fi (x(k) ) will be close to zero.
Thus, the Gauss-Newton algorithm is given by
Algorithm
Input f , J, x(0)
for k = 0, 1, 2, . . . do
Solve J(x(k) )T J(x(k) )h(k) = −J(x(k) )T f (x(k) ) for h(k)

Update x(k+1) = x(k) + h(k)
end
Output x(k+1)
Note that the update h(k) here is in fact nothing but the solution to the normal
equations for the (linear) least squares problem
min kJ(x)h + f (x)k2 .

h∈Rn
Therefore, any of our earlier algorithms (modified Gram-Schmidt, QR, SVD) can be
used to perform this step.
133
Example Let’s consider the nonlinear system
P −A A+P
x(t) = + cos(t)
√ 2 2
y(t) = AP sin(t)
that describes the elliptical orbit of a planet around the sun in a two-dimensional
universe. Here A and P denote the maximum and minimum orbit-to-sun distances,
respectively.
In order to estimate the parameters of the orbit of a specific planet we use another
relationship between the parameters and the length of the orbit, r(θ), where θ is the
angle to the positive x-axis. This relationship is given by
2AP
r(θ) = .
P (1 − cos(θ)) + A(1 + cos(θ))
Let’s assume we have measurements (θi , ri ) so that we can define functions
2AP
fi (A, P ) = ri − , i = 1, . . . , m.
P (1 − cos(θi )) + A(1 + cos(θi ))
Then the best choices of A and P will be given by minimizing the function
m
1X
g(A, P ) = fi (A, P )2 .
2
i=1
This corresponds exactly to our general discussion above.

The Matlab code ShowGN VL.m written by Charles Van Loan for his book “Intro-
duction to Scientific Computing” provides an implementation of this example. The
“measurements” are created by starting with two exact values of A and P , A0 = 5
and P0 = 1, and then sampling from an orbit based on these parameters that has
been perturbed by noise. A line search is also included for which the step length λk is
entered interactively.
134

Libro Fasshauer Numerico Avanzado

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Libro Fasshauer Numerico Avanzado

Загружено:

Авторское право:

Доступные форматы

1 Fundamentals

• stable and reliable,

uj,0 = uj,n = u0,k = un,k = 0, j, k = 0, 1, . . . , n.

Table 1: Improvements of algorithms for Poisson problem.

Table 2: Improvements of hardware for Poisson problem.

This method is known as the finite difference method.

As just noted, errors are introduced in a variety of ways:

• through discretization, i.e., by converting a continuous problem to a discrete one,

• through floating-point representations and roundoff errors,

• through the nature of certain algorithms (e.g., iterative vs. direct).

1.1 Fundamentals from Linear Algebra

1. For any u, v ∈ V we have u + v ∈ V , i.e., V is closed under vector addition.

3. There is a zero vector 0 such that u + 0 = u for every u ∈ V .

4. For every u ∈ V there is a negative −u such that u + (−u) = 0.

5. For every u ∈ V and every scalar α ∈ C we have αu ∈ V , i.e., V is closed under

6. For every α, β ∈ C and every u, v ∈ V we have (α + β)u = αu + βu, and

Remark Often we consider V as a vector space over R.

Example Standard examples we will be working with are V = Rm or V = Cm with

If A ∈ Cm×n is an m × n matrix then A is a linear map (or linear transformation)

2. A(αx) = αAx for every x ∈ Cn and α ∈ C.

Example The matrix  

1.1.2 Matrix-Vector Multiplication

i.e., the (entire) vector b is given as a linear combination of the columns of A.

Geometrically (based on the second approach above) we can interpret A as a linear

1.1.3 Matrix-Matrix Multiplication

i.e., the j-th column of B is given as a linear combination of the columns of A.

Example Let’s take C = R ∈ Cn×n with

since R(k, j) = 0 if k > j by definition and 1 otherwise.

1.1.4 Range and Nullspace

Theorem 1.2 The range(A) is a vector space spanned by the columns of A.

Remark From the vectorized interpretation of the matrix-vector product Ax = 0 we

Example Consider the matrix

Since A has two linearly independent columns (and rows) we have

is a basis for range(A).

for any x. Alternatively, we can see that

as soon as x(2) = x(3) = 0.

Theorem 1.3 For A ∈ Cm×m the following are equivalent:

1. A has an inverse A−1 ,

6. 0 is not a singular value of A,

Following our earlier geometric interpretations we can interpret multiplication by

represents a change of basis.

1.2 Orthogonal Vectors and Matrices

1.2.2 Inner Product

is called the inner product (or dot product) of x and y.

Lemma 1.4 Let x, y, z ∈ Cm and α, β ∈ C. Then

3. (αx)∗ (βy) = αβx∗ y.

Together, we say that an inner product is bilinear.

Lemma 1.5 Assume A and B are m × m matrices. Then

(AB)∗ (i, j) = C ∗ (i, j)

the vectors x and y are orthogonal if and only if x∗ y = 0.

• If X, Y are two sets of vectors then X is orthogonal to Y if x∗ y = 0 for every

• X is an orthogonal set (or simply orthogonal) if x∗ y = 0 for every x, y ∈ X with

• X is orthonormal if X is orthogonal and kxk = 1 for all x ∈ X.

Theorem 1.6 Orthogonality implies linear independence, i.e., if X is orthogonal then

Corollary 1.7 If X ∈ Cm is orthogonal and consists of m vectors then X is a basis

1.2.4 Orthogonal Decomposition of a Vector

with {r, q 1 , . . . , q n } an orthogonal set.

Proof We need to show that r = v − ni=1 (q ∗i v)q i is orthogonal to Q. This can be

Remark If Q is a basis for Cm then n = m and (since v ∈ Cm ) r = 0. Therefore we

Example √ Compute√ ∗the orthogonal √ decomposition

1. In terms of vector projections we have

1.2.5 Unitary and Orthogonal Matrices

Lemma 1.8 Assume Q ∈ Cm×m is unitary and x, y ∈ Cm . Then

3. All eigenvalues λ of Q satisfy |λ| = 1, and therefore detQ = ±1.