Вы находитесь на странице: 1из 5

Numerical Optimization

Lecture 10: Newtons Method with Modified Hessian / Quasi-Newton


Methods

Sangkyun Lee

The content is from Nocedal and Wright (2006). Topics marked with ** are
optional.

Newtons Method with Hessian Modification

The Newtons direction pN


k is obtained from solving
2 f (xk )pN
k = f (xk ).
However, when 2 f (xk ) is not positive definite, the direction pN
k may not be a descent direction. We describe Newtons algorithm with modified Hessian to overcome
the issue:
Algorithm 1: Newtons Method with Linesearch and Hessian Modification
1
2
3

4
5
6
7

Input: x0 ;
for k = 0, 1, 2, . . . do
Choose Bk = 2 f (xk ) + Ek , where Ek = 0 if 2 f (xk ) is sufficiently
positive definite; otherwise Ek is chosen to ensure that Bk is sufficiently
positive definite;
Solve Bk pk = f (xk );
Choose k from Wolfe or Armijo backtracking linesearch;
xk+1 xk + k pk ;
end

1.1

Global Convergence

The global convergence of this algorithm is given in the following theorem:


Theorem 10.1. Let f : Rn R be twice continuously differentiable on an open
set D Rn , and assume that the starting point x0 of the algorithm is such that the
level set L = {x D : f (x) f (x0 )} is compact. Then if for some C > 0,
(Bk ) = kBk k2 kBk1 k2 C, k = 0, 1, 2, . . . .
then
lim f (xk ) = 0.

If 2 f (x ) is sufficiently positive definite so that Ek = 0 will be chosen for all


sufficiently large k, then the algorithm reduces to pure Newton and will show local
quadratic rate of convergence.
1

Quasi-Newton Methods

The main disadvantage of Newtons method is to compute the Hessian 2 f (xk )


Rnn explicitly, which is computationally challenging when n is large.
In quasi-Newton methods, we use an approximation matrix Bk to the true Hessian, and compute search directions by
pk = Bk1 f (xk ).
This particular choice is the minimizer of a quadratic model function,
1
mk (p) = f (xk ) + f (xk )T p + pT Bk p.
2
We like to build the model function so that the next model function,
1
mk+1 (p) = f (xk+1 ) + f (xk )T p + pT Bk+1 p,
2
reflect the information we have namely f (xk+1 ) and f (xk ), (we assume that
xk+1 = xk + k pk and k is from the zWolfe line search),
mk+1 (0) = f (xk+1 )
mk+1 (k pk ) = f (xk+1 ) k Bk+1 pk = f (xk )
Rearranging the second equation, we have
Bk+1 sk = yk

(Secant Equation),

where sk = xk+1 xk and yk = f (xk+1 ) f (xk ).


Given that Bk is symmetric, it can be positive definite only if the following
condition is satisfied:
sTk yk > 0

(Curvature Condition).

This condition is guaranteed to hold if k is chosen from the Wolfe (or strong Wolfe)
line search. From the second Wolfe condition (curvature condition), we have (for
c2 (0, 1)),
f (xk+1 )T sk c2 f (xk )T sk ,
and therefore
ykT sk = [f (xk+1 ) f (xk )]T sk (c2 1)k f (xk )T pk .
Since c2 < 1 and pk is a descent direction, ykT sk > 0.

2.1

Davidon-Fletcher-Powell (DFP) Update

The secant equation has possibly many solutions. We obtain a unique Bk by considering the following optimization problem,
min kB Bk k,
B

s.t. B = B T , Bsk = yk .
Choosing a weighted Frobenius norm in the objective, we obtain the following expression,
Bk+1 = (I k yk sTk )Bk (I k sk ykT ) + k yk ykT ,
2

(DFP)

(10.1)

Algorithm 2: BFGS
1
2
3
4
5
6
7
8
9
10

Input: x0 , convergence tolerance , and H0 . ;


for k = 0, 1, 2, . . . do
if kf (xk )k2  then
break;
end
Compute pk = Hk f (xk );
Choose k from the Wolfe linesearch;
xk+1 xk + k pk ;
Compute Hk+1 by the BFGS update rule.
end

where
k =

1
ykT sk

This is first proposed by Davidon in 1959 and further studied by Fletcher and
Powell.
The inverse of Bk , denoted by Hk = Bk1 is indeed more useful for implementation, which can be derived using the Sherman-Morrison-Woodbury formula,
Hk+1 = Hk

2.2

sk sTk
Hk yk ykT Hk
+
.
ykT Hk yk
ykT sk

(DFP)

(10.2)

Broyden-Fletcher-Goldfarb-Shanno (BFGS) Update

We can instead consider the inverse Hk in the secant equation,


Hk yk = sk ,
and consider finding a unique solution by solving the following problem,
min kH Hk k,
H

s.t. H = H T , Hyk = sk .
The unique solution is given by
Hk+1 = (I k sk ykT )Hk (I k yk sTk ) + k sk sTk ,

(BFGS)

(10.3)

for the same definition of k = yT1s . This is the most popular update rule for
k k
quasi-Newton methods.
The formulation in terms of Bk can be derived using the Sherman-MorrisonWoodbury formula,
Bk+1 = Bk

yk ykT
Bk sk sTk Bk
+
.
sTk Bk sk
ykT sk

(BFGS)

(10.4)

Using this would be less favorable since it involves a matrix inversion (although
there are some work-arounds).
A nice property of BFGS is that if Hk is positive definite and ykT sk > 0 (so that
k > 0), then Hk+1 is also positive definite. For any nonzero vector z,
z T Hk+1 z = wT Hk w + k (sTk z)2 0
3

where w := zk yk (sTk z). The RHS can be zero only if sTk z = 0, but then w = z 6= 0
implying that the first term is positive.
Comparing to DFP (and other update rules), BFGS tends to correct inaccurate
approximations automatically (given that step sizes satisfy Wolfe conditions).

Convergence of Newton-Type Methods

We consider extended forms of Newtons method, where the search direction is


chosen by solving the following system of linear equations,
Bk pk = f (xk ),
for a symmetric and positive definite matrix Bk . Note that pk = Bk1 f (xk ) is a
descent direction.
We first describe a global convergence result for this setting.
Theorem 10.2 (Superlinear Convergence). For f : Rn R, twice continuously
differentiable, consider a sequence {xk } generated by xk+1 = xk + k pk where pk
is a descent direction and k satisfies Wolfe conditions with c1 1/2. If {xk }
converges to a point x such that f (x ) = 0 and 2 f (x ) is positive definite, and
if the search direction satisfies
kf (xk ) + 2 f (xk )pk k2
= 0,
k
kpk k2
lim

(10.5)

then
the step size k = 1 is admissible for all k > k0 , for a certain index k0 , and
if k = 1 for all k > k0 , then {xk } converges to x superlinearly.
We make a few observations.
Newtons direction pk = 2 f (xk )f (xk ) satisfies the condition (10.5).
It is easy to check that For pk = Bk1 2 f (xk ), the search direction condition
becomes
k(Bk 2 f (xk ))pk k2
lim
= 0,
(10.6)
k
kpk k2
and it suffice that Bk becomes increasingly accurate approximations to 2 f (xk )
along the search directions pk ; it is not necessary that Bk must converge to 2 f (x ).

3.1

Convergence of BFGS

Theorem 10.3. Let f : Rn R be twice continuously differentiable, H0 be any


symmetric positive definite matrix, and x0 be a starting point for which the level set
L = {x Rn : f (x) f (x0 )} is convex, and there exists positive constants m and
M such that
ykT yk
ykT sk

m,
M,
sTk sk
ykT sk
for all x L. The sequence {xk } generated by Algorithm 2 (with  = 0) converges
to the minimizer x of f .
This theorem generalizes for other quasi-Newton methods (e.g. the restricted
Broyden class), but not for the DFP method.
4

3.1.1

Convergence Rate of BFGS

In general, the rate of convergence of iterates generated by the BFGS algorithm is


linear. In particular, when the sequence {kxk x k2 } satisfies an extra condition
that

X
kxk x k2 < ,
k=1

and the Hessian of f Lipschitz continuous near x , then we can show that the
BFGS algorithm satisfies the Dennis and More characterization (10.5) and therefore
achieves superlinear convergence.

References
Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer, second
edition.

Вам также может понравиться