Quasi Newton Methods

EE236C (Spring 2010-11)
6. Quasi-Newton methods
variable metric methods quasi-Newton methods BFGS update limited-memory quasi-Newton methods
6-1
Newton method for unconstrained minimization

minimize f (x) f convex, twice continously dierentiable
Newton method x+ = x t2f (x)1f (x) advantages: fast convergence, ane invariance disadvantages: requires second derivatives, solution of linear equation can be too expensive for large scale applications
Quasi-Newton methods
6-2
Variable metric methods

x+ = x tH 1f (x) H 0 is approximation of the Hessian at x, chosen to: avoid calculation of second derivatives simplify computation of search direction variable metric interpretation (EE236B, lecture 10, page 11) x = H 1f (x) is steepest descent direction at x for quadratic norm z
H
= z T Hz
1/2
6-3
given starting point x(0) dom f , H0 0 for k = 1, 2, . . ., until a stopping criterion is satised
1 1. compute quasi-Newton direction x = Hk1f (x(k1))
2. determine step size t (e.g., by backtracking line search) 3. compute x(k) = x(k1) + tx 4. compute Hk
dierent methods use dierent rules for updating H in step 4

1 can also propagate Hk to simplify calculation of x
6-4
Broyden-Fletcher-Goldfarb-Shanno (BFGS) update

BFGS update yy T Hk1ssT Hk1 Hk = Hk1 + T y s sT Hk1s where s = x(k) x(k1), inverse update
1 Hk =
y = f (x(k)) f (x(k1))
ysT sy T 1 Hk1 I T I T y s y s
ssT + T y s
note that y T s > 0 for strictly convex f ; see page 1-10 cost of update or inverse update is O(n2) operations
Quasi-Newton methods 6-5
Positive deniteness
if y T s > 0, BFGS update preserves positive denitess of Hk
proof: from inverse update formula,

1 v T Hk v
sT v v T y s y
T 1 Hk1
sT v (sT v)2 v T y + T s y y s
if Hk1 0, both terms are nonnegative for all v second term is zero only if sT v = 0; then rst term is zero only if v = 0
1 this ensures that x = Hk f (x(k)) is a descent direction
6-6
Secant condition
BFGS update satises the secant condition Hk s = y, i.e., Hk (x(k) x(k1)) = f (x(k)) f (x(k1))
interpretation: dene second-order approximation at x(k) fquad(z) = f (x

(k)
) + f (x
(k) T
) (z x
(k)
1 ) + (z x(k))T Hk (z x(k)) 2
secant condition implies that gradient of fquad agrees with f at x(k1): fquad(x(k1)) = f (x(k)) + Hk (x(k1) x(k)) = f (x(k1))
6-7
secant method for f : R R, BFGS with unit step size gives the secant method x(k+1) f (x(k)) = x(k) , Hk f (x(k)) f (x(k1)) Hk = x(k) x(k1)
x(k1)
x(k)
x(k+1)
fquad(z)
f (z)
6-8
Convergence
global result if f is strongly convex, BFGS with backtracking line search (EE236B, lecture 10-6) converges from any x(0), H (0) 0
local convergence if f is strongly convex and 2f (x) is Lipschitz continuous, local convergence is superlinear : for suciently large k, x(k+1) x
2
ck x(k) x
where ck 0 (cf., quadratic local convergence of Newton method)
6-9
Example
m
minimize cT x
i=1
log(bi aT x) i
n = 100, m = 500
Newton
10 10
2
BFGS
10 10
2 0
10 10 10 10 10 10
-2
f (x(k)) f
0 1 2 3 4 5 6 7 8 9
f (x(k)) f
10 10 10 10 10 10
-2
-4
-4
-6
-6
-8
-8
-10
-10
-12
-12
20
40
60
80
100
120
140
cost per Newton iteration: O(n3) plus computing 2f (x) cost per BFGS iteration: O(n2)
Square root BFGS update

to improve numerical stability, can propagate Hk in factored form
if Hk1 = Lk1LT then Hk = Lk LT with k1 k Lk = Lk1 where y = L1 y, k1 s = Lk1s, = s s yT s

T
( s) sT y I+ sT s
1/2
if Lk1 is triangular, cost of reducing Lk to triangular is O(n2)
6-11
Optimality of BFGS update

X = Hk solves the convex optimization problem
1 1 minimize tr(Hk1X) log det(Hk1X) n subject to Xs = y
cost function is nonnegative, equal to zero only if X = Hk1 also known as relative entropy between densities N (0, X), N (0, Hk1) optimality result follows from KKT conditions: X = Hk satises X with = 1 sT y
1 2Hk1y 1
1 Hk1
1 T (s + sT ), 2
Xs = y,
X0
1 y T Hk1y 1+ yT s
6-12
Davidon-Fletcher-Powell (DFP) update

switch Hk1 and X in objective on previous page minimize tr(Hk1X 1) log det(Hk1X 1) n subject to Xs = y minimize relative entropy between N (0, Hk1) and N (0, X) problem is convex in X 1 (with constraint written as s = X 1y) solution is dual of BFGS formula Hk = sy T ysT Hk1 I T I T s y s y yy T + T s y
(known as DFP update) pre-dates BFGS update, but is less often used
Limited memory quasi-Newton methods

1 main disadvantage of quasi-Newton method is need to store Hk or Hk 1 limited-memory BFGS (L-BFGS): do not store Hk explicitly
instead we store the m (e.g., m = 30) most recent values of sj = x(j) x(j1), yj = f (x(j)) f (x(j1))
1 we evaluate x = Hk f (x(k)) recursively, using T s j yj I T yj s j
1 Hj =
1 Hj1
yj s T j I T yj s j
sj sT j + T yj s j
1 for j = k, k 1, . . . , k m + 1, assuming, for example, Hkm = I
cost per iteration is O(nm); storage is O(nm)

References
J. Nocedal and S. J. Wright, Numerical Optimization (2006), chapters 6 and 7 J. E. Dennis and R. B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations (1983)
6-15

Quasi Newton Methods

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Quasi Newton Methods

Загружено:

Авторское право:

Доступные форматы

EE236C (Spring 2010-11)

Newton method for unconstrained minimization

Variable metric methods

dierent methods use dierent rules for updating H in step 4

Broyden-Fletcher-Goldfarb-Shanno (BFGS) update

proof: from inverse update formula,

1 this ensures that x = Hk f (x(k)) is a descent direction

interpretation: dene second-order approximation at x(k) fquad(z) = f (x

where ck 0 (cf., quadratic local convergence of Newton method)

Square root BFGS update

if Hk1 = Lk1LT then Hk = Lk LT with k1 k Lk = Lk1 where y = L1 y, k1 s = Lk1s, = s s yT s

if Lk1 is triangular, cost of reducing Lk to triangular is O(n2)

Optimality of BFGS update

Davidon-Fletcher-Powell (DFP) update

Limited memory quasi-Newton methods

1 we evaluate x = Hk f (x(k)) recursively, using T s j yj I T yj s j

1 for j = k, k 1, . . . , k m + 1, assuming, for example, Hkm = I

cost per iteration is O(nm); storage is O(nm)

Вам также может понравиться