Академический Документы
Профессиональный Документы
Культура Документы
Newton and quasi-Newton methods gradient and conjugate gradient method subgradient method complexity bounds
2-1
Newton method
iteration for minimizing closed, convex, twice dierentiable f x(k+1) = x(k) tk 2f (x(k))1f (x(k)) step size tk is xed or from line search we often suppress iteration number: x+ = x tf (x)1f (x), x := x tf (x)1f (x)
advantages of Newtons method: fast convergence, ane invariance disadvantages: requires second derivatives, solution of linear equation can be too expensive for large scale applications
Unconstrained optimization methods 2-2
f (x + tx)
f (x) + tf (x)T x t
= z T Hz
1/2
2-4
Quasi-Newton methods
given starting point x(0) dom f , H0 0 for k = 1, 2, . . ., until a stopping criterion is satised
1 1. compute quasi-Newton direction x = Hk1f (x(k1))
2. determine step size t (e.g., by backtracking line search) 3. compute x(k) = x(k1) + tx 4. compute Hk
2-5
where s = x(k) x(k1), y = f (x(k)) f (x(k1)) note that y T s > 0 for strictly convex f ; this follows from f (v) > f (u) + f (u)T (v u), evaluated at u = x(k1), v = x(k) and u = x(k), v = x(k1) cost of update or inverse update is O(n2) operations
Unconstrained optimization methods 2-6
Positive deniteness
if y T s > 0, BFGS update preserves positive denitess of Hk
sT v v T y s y
T 1 Hk1
sT v (sT v)2 v T y + T s y y s
if Hk1 0, both terms are nonnegative for all v second term is zero only if sT v = 0; then rst term is zero only if v = 0
2-7
Secant condition
BFGS update satises the secant condition Hk s = y, i.e., Hk (x(k) x(k1)) = f (x(k)) f (x(k1))
) + f (x
(k) T
) (z x
(k)
1 ) + (z x(k))T Hk (z x(k)) 2
secant condition implies that gradient of fquad agrees with f at x(k1): fquad(x(k1)) = f (x(k)) + Hk (x(k1) x(k)) = f (x(k1))
2-8
secant method for f : R R, BFGS with unit step size gives the secant method x(k+1) f (x(k)) = x(k) , Hk f (x(k)) f (x(k1)) Hk = x(k) x(k1)
x(k1)
x(k)
x(k+1)
fquad(z)
f (z)
2-9
Convergence
global result if f is strongly convex (2f (x) mI for some m > 0), BFGS with backtracking line search converges from any x(0), H (0) 0
local convergence if f is strongly convex and 2f (x) is Lipschitz continuous, local convergence is superlinear : for suciently large k, x(k+1) x
2
ck x(k) x
2-10
Example
m
i=1
log(bi aT x) i
BFGS
10 10
2 0
10 10 10 10 10 10
-2
f (x(k)) f
0 1 2 3 4 5 6 7 8 9
f (x(k)) f
10 10 10 10 10 10
-2
-4
-4
-6
-6
-8
-8
-10
-10
-12
-12
20
40
60
80
100
120
140
cost per Newton iteration: O(n3) plus computing 2f (x) cost per BFGS iteration: O(n2)
Unconstrained optimization methods 2-11
( s) sT y I+ sT s
1/2
2-12
cost function is nonnegative, equal to zero only if X = Hk1 also known as relative entropy between densities N (0, X), N (0, Hk1) optimality result follows from KKT conditions: X = Hk satises X with = 1 sT y
1 2Hk1y 1
1 Hk1
1 T (s + sT ), 2
Xs = y,
X0
1 y T Hk1y 1+ yT s
2-13
(known as DFP update) pre-dates BFGS update, but is less often used
Unconstrained optimization methods 2-14
instead we store the m (e.g., m = 30) most recent values of sj = x(j) x(j1), yj = f (x(j)) f (x(j1))
yj s T j I T yj s j
sj sT j + T yj s j
Outline
Newton and quasi-Newton methods gradient and conjugate gradient method subgradient method complexity bounds
simple quadratic example 1 f (x1, x2) = (x2 + M x2) 2 2 1 with M = 10, t = 0.18
-1
-2
-3 -10
-5
10
2-16
Modications
multistep methods heavy ball method x(k) = x(k1) tf (x(k1)) + s(x(k1) x(k2)) conjugate gradient Nesterov-type methods (next lecture) spectral gradient method (Barzilai-Borwein) use step size tk = (sT sk )/(sT yk ) with k k sk = x(k1) x(k2), yk = f (x(k1)) f (x(k2))
Krylov sequence
CG algorithm is a recursive method for computing the Krylov sequence x(k) = argmin f (x),
xKk
k0
A1b Kn, therefore x(n) = A1b there is a simple two-term recurrence x(k+1) = x(k) + ak rk + bk (x(k) x(k1))
Unconstrained optimization methods 2-19
in practice due to rounding errors, can take n steps (or fail) CG is now used as an iterative method with luck (good spectrum of A), good approximation in n steps
2-20
Applications in optimization
nonlinear conjugate gradient methods extend linear CG method to nonquadratic functions local convergence similar to linear CG limited global convergence theory
inexact and truncated Newton methods use conjugate gradient method to compute (approximate) Newton step less reliable than exact Newton methods, but handle very large problems
2-21
Fletcher-Reeves CG algorithm
CG algorithm modied to minimize non-quadratic convex f
2. if k = 1, p1 = f (x(0)); else pk = f (x
(k1)
) + pk1
where
f (x(k1)) = f (x(k2))
2 2 2 2
2-22
some observations rst iteration is a gradient step; practical implementations restart the algorithm by taking a gradient step, for example, every n iterations update is gradient step with momentum term x(k) = x(k1) k f (x(k1)) + k (x(k1) x(k2)) with exact line search, reduces to linear CG for quadratic f line search exact line search in step 3 implies f (x(k))T pk = 0 therefore pk is a descent direction at x(k+1): from step 2, f (x(k1))T pk = f (x(k1))
Unconstrained optimization methods
2 2
<0
2-23
Variations
Polak-Ribi`re: in step 2, compute from e f (x(k1))T (f (x(k1)) f (x(k2))) = f (x(k2)) 2 2
2-24
y T y ssT ysT + sy T = I + (1 + T ) T s y y s yT s
where y = f (x(k)) f (x(k1)), s = x(k) x(k1) f (x(k))T s = 0 if x(k) is determined by exact line search quasi-Newton step in iteration k is
1 Hk f (x(k))
= f (x
(k)
y T f (x(k)) s )+ Ts y
Outline
Newton and quasi-Newton methods gradient and conjugate gradient method subgradient method complexity bounds
Subgradient method
to minimize a nondierentiable convex function f : choose x(0) and repeat x(k) = x(k1) tk g (k1), g (k1) is any subgradient of f at x(k1) k = 1, 2, . . .
step size rules xed step: tk constant xed length: tk g (k1) diminishing: tk 0,
2 k=1 tk
constant)
2 2
2-26
Convergence results
assumptions f has nite optimal value f , minimizer x with x(0) x f is convex and Lipschitz continuous with constant G > 0: |f (x) f (y)| G x y
(k) 2 2
x, y
results (fbest = minik f (x(i)) is best value up to iteration k) convergence requires diminishing step size tk 0, with proper step size, can derive bound GR (k) fbest f k #iterations to reach fbest f is O(1/2)
Unconstrained optimization methods 2-27
k=1 tk
(k)
(A R500100, b R500)
f (x(k)) f f
10
-1
10
-2
10
-3
20
40
60
80
100
2-28
10
0.01/ k 0.01/k
10
-1
fbest f
10
-2
(k)
10
-3
10
-4
10
-5
1000
2000
3000
4000
5000
2-29
xz
= x PCj (x)
dist(x, Cj ) is a convex function if Cj is convex f = 0 if the intersection is nonempty to nd subgradient of f , need subgradient of distance to farthest set Cj
2-30
PS (x) x
therefore
(for y H, r.h.s. is distance to H; for y H, r.h.s. is nonpositive) hence, (x PS (x))T (y x) dist(y, S) x PS (x) 2 + x PS (x) 2
subgradient method with optimal step size for minimize f (x) = max{dist(x, C1), . . . , dist(x, Cm)} if Cj is the farthest set at iteration k (i.e., dist(x(k1), Cj ) = f (x(k1))): x
(k)
= x
(k1)
= PCj (x(k1))
a version of the famous alternating projections algorithm at each step, project the current point onto the farthest set for m = 2 sets, projections alternate onto one set, then the other convergence: dist(x(k), C) 0 as k
Unconstrained optimization methods 2-32
Alternating projections
rst few iterations:
PC1 (X) =
i=1
if X =
i=1
T i q i q i
C2 is (ane) set in Sn with specied xed entries projection of X onto C2 by re-setting specied entries to xed values
2-34
example: 100 100 matrix missing about 71% of its entries initialize X (0) with unknown entries set to 0
10 10
1 0
10 10 10 10 10 10 10 10 10
-1 -2 -3 -4 -5 -6 -7 -8 -9
X (k+1) X (k)
10
20
30
40
50
2-35
often very slow no good stopping criterion theoretical complexity: O(1/2) iterations to nd -suboptimal point
2-36
Outline
Newton and quasi-Newton methods gradient and conjugate gradient method subgradient method complexity bounds
L xy
x, y
problem: nd x with f (x) f algorithm class: any iterative method that that selects x(k) in x(0) + span{f (x(0)), f (x(1)), . . . , f (x(k1))} bound: no 1st order method can have a worst-case complexity better than O(1/ ) iterations gradient method is not optimal (complexity is O(1/); see next lecture)
Unconstrained optimization methods 2-37
References
algorithms for unconstrained minimization
B.T. Polyak, Introduction to Optimization (1987) J. Nocedal, S.J. Wright, Numerical Optimization (2006) D. Bertsekas, Nonlinear Programming (1999)
subgradient method
S. Boyd, lecture notes and slides for EE364b, Convex Optimization II N.Z. Shor, Nondierentiable Optimization and Polynomial Problems (1998)
fundamental complexity bounds
2-39