Вы находитесь на странице: 1из 8

LINEAR SYSTEMS.

CSE/MATH 555, SPRING 2010


Contents
1. An introduction to the Conjugate Gradient Method 1
1.1. Preliminaries and notation 1
1.2. Line search methods 2
2. Properties of line search method with conjugate directions 3
3. The Conjugate Gradient algorithm 3
4. Convergence rate of the Conjugate Gradient method 5
4.1. Krylov subspaces and error reduction 5
4.2. Chebyshev polynomials and a convergence rate estimate 7
References 8
1. An introduction to the Conjugate Gradient Method
This section is on the derivation and convergence of a popular algorithm
for minimization of quadratic functionals (or solving linear systems), known
as the method of Conjugate Gradients (CG). To the best of the authors
knowledge, the CG algorithm was rst introduced in 1952 by M. R. Hestenes
and E. Stiefel in [2]. The derivation of the CG algorithm, given here, follows
lecture notes by D. N. Arnold [1].
1.1. Preliminaries and notation. The conjugate gradient method is a
method for minimizing the following quadratic functional:
(1.1) x

= arg min
xIR
n
(x), (x) =
1
2
x
T
Ax b
T
x,
where A IR
nn
is symmetric positive denite (SPD) matrix and b IR
n
is a given vector. Clearly, we have
(1.2) (x) = Ax b,
2
= A, (the Hessian is independent on x).
Since the Hessian A is SPD, from well known conditions for a minimum of
a function, we may conclude that there is a unique minimizer x

of ().
Moreover, (1.2) implies that x

is also the solution to the system of linear


equations:
Ax = b.
This is why CG method is oftentimes thought as a method for the solution
of linear systems.
In what follows we will need the following preliminary settings
(a) Since A is SPD, it denes an inner product x
T
Ay between two vec-
tors x and y in IR
n
, which we will refer to as A-inner product. The
corresponding vector norm is dened by |x|
2
A
= x
T
Ax
1
2 CG FOR LINEAR SYSTEMS555,S09
(b) From the Taylor theorem for g(t) = (y+tz) we obtain the following
identity for all t IR, and all y IR
n
and z IR
n
:
(1.3) (y + tz) = (y) + t[(y)]
T
z +
t
2
2
z
T
Az.
1.2. Line search methods. The CG method is nothing else but a line
search method with special choice of directions. Given a current approxi-
mation x
j
to the minimum x

, and a direction vector p


j
, a line search method
determines the next approximation x
j+1
via the following two steps:
(a) Find
j
= arg min (x
j
+ p
j
),
(b) Set x
j+1
= x
j
+
j
p
j
.
In what follows, we will assume that x
0
is a given vector (initial guess).
Then, applying k steps of the line search method given above, results in
(k + 1)-iterates x
j

k
j=0
.
From the relation (1.3) and (1.2) we immediately nd that
(1.4)
j
=
p
T
j
r
j
p
T
j
Ap
j
, where r
j
= (x
j
) = Ax
j
b.
We note here that r
j
is oftentimes referred to as the residual vector.
We will need the following denition and notation
Denition 1.1. We say that the set of directions p
j

k1
j=0
is a conjugate
set of directions, i p
T
j
Ap
i
= 0 for all i = 1, . . . , (k 1), j = 1, . . . , (k 1),
i ,= j.
By symmetry this denition can also be stated as: The set of directions
p
j

k1
j=0
is a conjugate set of directions, i p
T
j
Ap
i
= 0 for all i and j satisfying
0 i < j (k 1).
We introduce now the following vector spaces and ane spaces (for k =
1, . . .):
(1.5)
W
k
:= spanp
0
, . . . , p
k1

U
k
:= x
0
+ W
k
= z IR
n
[ z = x
0
+ w
k
, w
k
W
k

For convenience we set W


0
:= 0 and U
0
:= x
0
. We now prove a simple
result which we use later in the proof or Theorem 2.1.
Lemma 1.2. Assume that p
T
i
Ap
j
= 0 for all 0 j < i, where i is a xed
integer, and that x
j

i
j=0
are obtained via the line search algorithm. Then
the following identity holds:
(1.6) p
T
i
r
i
= p
T
i
[(y)], for all y U
i
.
Proof. We rst note that since x
j

i
j=0
are obtained via the line search
algorithm we have that x
i
U
i
. If we take y U
i
, from the denition of U
i
it follows that x
i
y W
i
and hence p
T
i
A(x
i
y) = 0 (because p
T
i
Aw = 0
for all w W
i
= spanp
0
, . . . , p
i1
). The proof of the identity (1.6) then is
as follows:
p
T
i
(r
i
[(y)]) = p
T
i
(Ax
i
b Ay + b) = p
T
i
A(x
i
y) = 0.

CG FOR LINEAR SYSTEMS555,S09 3


2. Properties of line search method with conjugate directions
Clearly on every step, the line search algorithm minimizes (x) in a xed
direction only However, if the directions are conjugate (see Denition 1.1),
then much stronger result can be proved, as the Theorem 2.1 below states: a
choice of conjugate directions in the line search method, results in obtaining
a minimizer x
k
for the whole space U
k
. In some sense, one may say that the
next theorem is the base for constructing the conjugate gradient method.
Theorem 2.1. If the directions in the line search algorithm are conjugate,
and x
j

k
j=0
are the iterates obtained after k steps of the line search algo-
rithm then
(2.1) x
j
= arg min
xU
j
(x), for all 1 j k
Proof. The proof is by induction. For k = 1, the result follows from the
denition of x
1
as a minimizer on U
1
. Assume that for k = i,
x
j
= arg min
yU
j
(y), for all 1 j i
To prove the statement of the theorem then, we need to show that
If x
i+1
= x
i
+
i
p
i
then x
i+1
= arg min
xU
i+1
(x),
By the denition of U
i+1
, any x U
i+1
can be written as x = y +p
i
, where
IR and y U
i
. Applying (1.3) and then Lemma 1.2 leads to
(x) = (y + p
i
) = (y) + p
T
i
[(y)] +

2
2
p
T
i
Ap
i
= (y) +
_
p
T
i
[(x
i
)] +

2
2
p
T
i
Ap
i
_
.
Note that we have arrived at a decoupled functional, since the rst term
does not depend on and the second term does not depend on y. Thus,
min
xU
i+1
(x) = min
yU
i
(y) + min
IR
n
_
p
T
i
r
i
+

2
2
p
T
i
Ap
i
_
.
The right side is minimized when y = x
i
and =
i
=
p
T
i
r
i
p
T
i
Ap
i
, and hence
the left side is minimized exactly for x
i+1
= x
i
+
i
p
i
, which concludes the
proof.
3. The Conjugate Gradient algorithm
The conjugate gradient method is an algorithm that explores the result in
Theorem 2.1 and constructs conjugate directions. Here is the rationale of
what we plan to do in this section:
We rst give a general recurrence relation that generates a set of
conjugate directions (Lemma 3.1).
We then show that this recurrence relation can be reduced to a much
simpler expression (see Lemma 3.2(iv)).
As a result, we will get a line search method, which uses conjugate
set of directions and is known as the CG method (see Algorithm 3.3).
4 CG FOR LINEAR SYSTEMS555,S09
We begin with a result, which could be obtained by Gramm-Schmidt
orthogonalization with respect to the A-inner product of the residual vectors
r
j

k
j=0
.
Lemma 3.1. Let p
0
= r
0
and let for k = 1, 2, . . .
(3.1) p
k
= r
k
+
k1

j=0
p
T
j
Ar
k
p
T
j
Ap
j
p
j
Then p
T
j
Ap
m
= 0 for all 0 m < j k.
Proof. We will show that the relation (3.1) gives conjugate directions is by
induction. For k = 1 one directly checks that p
T
1
Ap
0
= 0. Assume that for
k = i the vectors p
j

i
j=0
are pairwise conjugate. We then need to show
that p
T
i+1
Ap
m
= 0 for all m i. Let m i. Then we have
p
T
i+1
Ap
m
= r
T
i+1
Ap
m
+
i

j=0
p
T
j
Ar
i+1
p
T
j
Ap
j
p
T
j
Ap
m
= r
T
i+1
Ap
m
+
p
T
m
Ar
i+1
p
T
m
Ap
m
p
T
m
Ap
m
= 0

Next Lemma among other things, shows that the sum in (3.1) contains only
one term.
Lemma 3.2. Let p
j

k
j=0
are directions obtained via (3.1). Then
(i) W
k
= spanr
0
, . . . , r
k1

(ii) r
T
m
r
j
= 0, for all 0 j < m k
(iii) p
T
k
r
j
= r
T
k
r
k
, for all 0 j k
(iv) The direction vector p
k
satises
p
k
= r
k
+
k1
p
k1
, where
k1
=
r
T
k
r
k
r
T
k1
r
k1
.
Proof. The rst item follows directly from (3.1) and a simple induction ar-
gument, since p
0
= r
0
.
To prove (ii), we rst use (i) to conclude that for 0 j < m k, and any
t IR we have that
r
j
W
j+1
W
m
and hence (x
m
+ tr
j
) U
m
.
Further, from Theorem 2.1, since x
m
is the unique minimizer of () over
U
m
, it follows that t = 0 is the unique minimizer of g(t) = (x
m
+ tr
j
).
Hence we have
0 =
d(x
m
+ tr
j
)
dt

t=0
= [(x
m
)]
T
r
j
= r
T
m
r
j
,
and this proves (ii).
To show that (iii) holds, we rst show the identity in (iii) for j = k. Indeed,
from (i) it follows that r
k
is orthogonal to each p
l
for l < k. Hence, if we
take the inner product with r
k
, the second term in the right side of (3.1)
CG FOR LINEAR SYSTEMS555,S09 5
would vanish, and this is exactly the identity in (iii) for j = k. If j < k,
then we have that (x
k
x
j
) W
k
, and hence p
T
k
A(x
k
x
j
) = 0. Therefore,
p
T
k
(r
k
r
j
) = p
T
k
A(x
k
x
j
) = 0.
To show (iv) we write p
k
W
k+1
as linear combination of r
j

k
j=0
(which
form an orthogonal basis), and then apply (iii). This leads to
p
k
=
k

j=0
p
T
k
r
j
r
T
j
r
j
r
j
=
k

j=0
r
T
k
r
k
r
T
j
r
j
r
j
= r
k

r
T
k
r
k
r
T
k1
r
k1
k1

j=0
r
T
k1
r
k1
r
T
j
r
j
r
j
= r
k
+
k1
k1

j=0
p
T
k1
r
j
r
T
j
r
j
r
j
= r
k
+
k1
p
k1
.

We now can write the conjugate gradient algorithm, using the much shorter
recurrence relation for the direction vectors p
k
, which is provided by Lemma 3.2(iv).
We denote below |y|
2
= y
T
y and |y|
2
A
= y
T
Ay for a vector y IR
n
.
Algorithm 3.3 (Conjugate Gradient). Let x
0
be given initial guess.
Set r
0
= Ax
0
b and p
0
= r
0
, k = 0.
While r
k
,= 0 do

k
=
|r
k
|
2
|p
k
|
2
A
[from Lemma 3.2(iii)]
x
k+1
= x
k
+
k
p
k
r
k+1
= r
k
+
k
Ap
k
[because Ax
k+1
b = Ax
k
b +
k
Ap
k
]

k
=
|r
k+1
|
2
|r
k
|
2
p
k+1
= r
k
+
k
p
k
[from Lemma 3.2(iv)]
Set k = k + 1
endWhile
4. Convergence rate of the Conjugate Gradient method
In this section we will present an estimate for the convergence rate of the CG
algorithm. The convergence rate estimate, given here is rather general and
does not take into account knowledge of the distribution of the eigenvalues
of A. There are estimates that are more rened in this regard. We refer to
Luenberger [3] for further reading.
4.1. Krylov subspaces and error reduction. To analyze the error we
rst prove the following result:
Lemma 4.1. The following relation holds:
(4.1) W
l
= spanr
0
, . . . , A
l1
r
0
.
6 CG FOR LINEAR SYSTEMS555,S09
Proof. The case l = 1, being clear, we assume that the relation holds for
l = i, and we would like to show that the same relation holds for l =
(i + 1). From Lemma 3.2(i), this would be equivalent to showing that r
i

spanr
0
, . . . , A
i
r
0
. By the induction assumption, we can write
W
i
r
i1
= R
i1
(A)r
0
, and W
i
p
i1
= P
i1
(A)r
0
,
where R
i1
() and P
i1
() are polynomials of degree less that or equal to
(i 1). We then have
r
i
= r
i1
+
i1
Ap
i1
= R
i1
(A)r
0
+
i1
AP
i1
(A)r
0
spanr
0
, . . . , A
i
r
0
,
which concludes the proof.
We now present a general error estimate relating |x

x
l
|
A
and |x

x
0
|
A
.
Lemma 4.2. The following estimate holds:
|x

x
l
|
A
= inf
PP
l
; P(0)=1
|P(A)(x

x
0
)|
A
.
Proof. Since r
l
is orthogonal to W
l
, we have
(x

x
l
)
T
Ay = r
T
l
y = 0, for all y W
l
.
Denoting for a moment w
l
= (x
l
x
0
) W
l
and e
0
= x

x
0
, the relation
above implies that
0 = (x

x
l
)
T
Ay = (e
0
w
l
)
T
Ay for all y W
l
.
Therefore, w
l
= (x
l
x
0
) is an A-orthogonal projection of e
0
= (x

x
0
) on
W
l
. Thus,
|e
0
w
l
|
A
= min
wW
l
|e
0
w|
A
But from Lemma 4.1 we know that w = Q
l1
(A)r
0
, for a polynomial Q
l1

T
l1
. Also, Ae
0
= r
0
and e
0
w = (I Q
l1
(A)A)e
0
and hence
(4.2) |x

x
l
|
A
= |e
0
q
l
|
A
= min
P
l
P
l
; P
l
(0)=1
|P
l
(A)e
0
|
A
.
This completes the proof.
To obtain a qualitative estimate on the right hand side of (4.2), we observe
that for any polynomial P
l
() we have
|x

x
l
|
A
= min
P
l
P
l
; P
l
(0)=1
|P
l
(A)e
0
|
A
min
P
l
P
l
; P
l
(0)=1
(P
l
(A))|e
0
|
A
,
where (P
l
(A)) is the spectral radius of P
l
(A). Since both A and P
l
(A) have
the same eigenvectors, we may conclude that
|x

x
l
|
A
min
P
l
P
l
; P
l
(0)=1
max
1jn
[P
l
(
j
)[|e
0
|
A
= c
l
(
1
, . . . ,
n
)|e
0
|
where
1

2
. . .
n
are the eigenvalues of A.
In the next section, we will derive a somewhat pessimistic upper bound on
c
l
by rst estimating
c
l
(
1
, . . . ,
n
) min
P
l
P
l
; P
l
(0)=1
|P
l
|
,[
1
,n]
and then with the help of a construction based on the Chebyshev polynomi-
als, we will nd the value of the right side of the above inequality in terms
of
1
and
n
.
CG FOR LINEAR SYSTEMS555,S09 7
4.2. Chebyshev polynomials and a convergence rate estimate. The
Chebyshev polynomials of rst kind on [1, 1] are dened as
T
l
() = cos(l arccos()), l = 0, 1, . . .
Using a simple trigonometric identity (with = arccos()) shows that
T
l+1
() + T
l1
() = cos(l + 1) + cos(l 1) = 2(cos ) cos l.
Hence,
(4.3) T
l+1
() = 2T
l
() T
l1
().
This proves that T
l
are indeed polynomials, because T
0
() = 1 and T
1
() = .
The form (4.3) defnes T
l
() for all IR. Another form of the Chebyshev
polynomals, which will be useful in the convergence rate estimate given
below in Theorem 4.4 is derived as follows: From the relation (4.3) for xed
we observe that
T
l
() = c
1
[
1
()]
l
+ c
2
[
2
()]
l
, l = 0, 1, . . . ,
where
1
() and
2
() are the roots of the characteristic equation

2
2 + 1 = 0.
The constants c
1
and c
2
are easily computed from the initial conditions
T
0
() = 1 and T
1
() = and hence
(4.4) T
l
() =
1
2
[( +
_

2
1)
l
+ (
_

2
1)
l
].
We further have that [T
l
()[ 1 for all [1, 1] and that
(4.5) If
m
= cos(
m
l
), then T
l
(
m
) = (1)
l
, m = 0, 1, . . . , l.
Dene now
S
l
() =
_
T
l
_

n
+
1

n

1
__
1
T
l
_

n
+
1
2

n

1
_
Note that
|S
l
|
,[
1
,n]
=

T
l
_

n
+
1

n

1
_

1
.
Next Lemma shows that S
l
is a polynomial with minimum max-norm,
that is,
|S
l
|
,[
1
,n]
= min
P
l
P
l
; P
l
(0)=1
|P
l
|
,[
1
,n]
.
Lemma 4.3. For any P
l
T
l
with P
l
(0) = 1,
|S
l
|
,[
1
,n]
|P
l
|
,[
1
,n]
Proof. Denote
t

=
_
T
l
_

n
+
1

n

1
__
1
.
Let
m
=

1

n
2

m
+

n
+
1
2
, where
m
are dened in (4.5). Note that
S
l
(
m
) = (1)
m
t

, m = 0, . . . , l,
and also that
m
[
1
,
n
]. Assume that there exists P
l
T
l
with P
l
(0) = 1,
such that
[P
l
()[ < [t

[, for all [
1
,
n
].
8 CG FOR LINEAR SYSTEMS555,S09
This in particular implies that
[t

[ < P
l
(
m
) < [t

[, m = 0, 1, . . . , l.
If sign(t

) > 0 then
P
l
(
m
)S
l
(
m
) < 0, for m even, and P
l
(
m
)S
l
(
m
) > 0, for m odd.
On the other hand, the case sign(t

) < 0, just switches odd with even


and even with odd in the above inequalities. Hence, regardless of the
sign of t

, the dierence P
l
S
l
has a zero in every interval (
m
,
m+1
).
There are l such intervals. But we also have that P
l
(0) S
l
(0) = 0. Since
P
l
S
l
is a polynomial of degree at most l, it follows that P
l
S
l
, which is
a contradiction.
Clearly, from this lemma it follows that
(4.6) |x

x
l
|
A
|S
l
|
,[
1
,n]
|x

x
0
|
A
.
In the next Theorem 4.4 we obtain this estimate in terms of the condition
number of A, by calculating |S
l
|
,[
1
,n]
.
Theorem 4.4. The error after l iterations of the CG algorithm can be
bounded as follows:
(4.7)
|x

x
l
|
A

2
_

+1

1
_
l
+
_

+1
_
l
|x

x
0
|
A
2
_
1

+ 1
_
l
|x

x
0
|
A
,
where = (A) =
n
/
1
is the condition number of A.
Proof. We aim to calculate |S
l
|
,[
1
,n]
=

T
l
_

n
+
1

n

1
_

1
. From (4.4),
for =

n
+
1

n

1
=
+ 1
1
, we obtain

2
1 =
+ 1
1

1
=
+ 1 2

1
=
(

1)
2
(

1)(

+ 1)
=

1
.
Thus,
T
l
_

n
+
1

n

1
_
=
1
2
_
_
+ 1

1
_
l
+
_
1

+ 1
_
l
_
.
Finally
|S
l
|
,[
1
,n]
=

T
l
_

n
+
1

n

1
_

1
=
2
_

+1

1
_
l
+
_

+1
_
l
2
_
1

+ 1
_
l
.
The proof is completed by substituting the above expression in (4.6).
References
[1] Douglas N. Arnold. A concise introduction to numerical analysis. Lecture Notes, Penn
State, MATH 597INumerical Analysis, Fall 2001.
[2] Magnus R. Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving
linear systems. J. Research Nat. Bur. Standards, 49:409436 (1953), 1952.
[3] David G. Luenberger. Linear and nonlinear programming. Kluwer Academic Publish-
ers, Boston, MA, second edition, 2003.

Вам также может понравиться