Академический Документы
Профессиональный Документы
Культура Документы
Introduction
http://gol.dsi.unifi.it/users/schoen
NLP problems
min f (x) x S Rn
Standard form:
min f (x) hi (x) = 0 gj (x) 0 i = 1, m j = 1, k
where B ( x, ) = {x Rn : x x } is a ball in Rn . Any global optimum is also a local optimum, but the opposite is generally false.
Convex Functions
A set S Rn is convex if
x, y S x + (1 )y S
Convex Functions
for all choices of [0, 1]. Let Rn : non empty convex set. A function f : R is convex iff
f (x + (1 )y ) f (x) + (1 )f (y )
Convex functions
for all y
x y
If f is twice continuously differentiable f it is convex iff its Hessian matrix is positive semi-denite:
2 f (x) := 2f xi xj
Example: an afne function is convex (and concave) For a quadratic function (Q: symmetric matrix):
1 f (x) = xT Qx + bT x + c 2
then 2 f (x)
0 iff v T 2 f (x)v 0 v Rn
we have
f (x) = Qx + b f is convex iff Q 0 2 f (x) = Q
Maximization
Slight abuse in notation: a problem
max f (x) xS
is a convex optimization problem iff S is a convex set and f is convex on S . For a problem in standard form
min f (x) hi (x) = 0 gj (x) 0 i = 1, m j = 1, k
is called convex iff S is a convex set and f is a concave function (not to be confused with minimization of a concave function, (or maximization of a convex function) which are NOT a convex optimization problem)
if f is convex, hi (x) are afne functions, gj (x) are convex functions, then the problem is convex.
more examples . . .
maxi {aT i x + b} is convex f, g : convex max{f (x), g (x)} is convex
fa convex functions for any a A (a possibly uncountable set) supaA fa (x) is convex f convex f (Ax + b) T race(AT X ) =
i,j
Data Approximation
Table of contents
norm approximation maximum likelihood robust estimation
Norm approximation
Problem:
min Ax b
x
where A, b: parameters. Usually the system is over-determined, i.e. b Range(A). For example, this happens when A Rmn with m > n and A has full rank. r := Ax b: residual.
Examples
r = r = rT r: least squares (or regression) rT P r with P 0: weighted least squares
Example: 1 norm
Matrix A R10030 80 70 60 50 40 30 20 10 0
Nonlinear Programming Models p. 19
norm 1 residuals
Possible (convex) additional constraints: maximum deviation from an initial estimate: x xest simple bounds i xi ui ordering: x1 x2 xn
-5
-4
-3
-2
-1
norm
20 18 16 14 12 10 8 6 4 2 0 -5 -4 -3 -2 -1 0 1 2 3 4 5
norm residuals
2 norm
18 16 14 12 10 8 6 4 2 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 norm 2 residuals
Variants
min
i
comparison
4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5
Nonlinear Programming Models p. 23
h linearquadratic h(z ) =
-2
-1.5
-1
-0.5
0.5
1.5
Maximum likelihood
Given a sample X1 , X2 , . . . , Xk and a parametric family of probability density functions L(; ), the maximum likelihood estimate of given the sample is
= arg max L(X1 , . . . , Xk ; )
log(p(Xi aT i ))
Example: linear measures with and additive i.i.d. (independent identically dsitributed) noise:
Xi = aT i + i
(1)
N (0, ), i.e. p(z ) = (2 )1/2 exp(z 2 /2 2 ) MLE is the 2 estimate: = arg min A X 2 ; p(z ) = (1/(2a)) exp(|z |/a) 1 estimate: = arg min A X 1
L(X1 . . . , Xk ; ) =
i=1
p(Xi aT i )
Nonlinear Programming Models p. 25 Nonlinear Programming Models p. 26
Ellipsoids
p(z ) = (1/a) exp(z/a)1{z0} (negative exponential)the estimate can be found solving the LP problem: min 1T (X A) A X p uniform on [a, a] the MLE is any such that A X a
where x0 Rn is the center of the ellipsoid and P is a symmetric positive-denite matrix. Alternative representations:
E = {x Rn : Ax b
2
1}
where A 0, or
E = {x Rn : x = x0 + Au | u
2
1}
where A is square and non singular (afne transformation of the unit ball)
Nonlinear Programming Models p. 27 Nonlinear Programming Models p. 28
RLS
It holds: then, choosing y = / if 0 and y = / , otherwise if < 0, then y = 1 and
| + T y | = | + T / sign()| = || + | + T y | || + y
ai Ei = {a i + Pi u : u 1}
where Pi = PiT
then:
T T max |(aT i x b i + u Pi x | i x bi )| = max |a ai Ei u 1
(aT i x
bi
)2
= |a T i x bi | + Pi x
...
Thus the Robust Least Squares problem reduces to
1/2
...
min t
x,t
min
i
2 (|a T i x bi | + Pi x )
a T i x + bi + Pi x ti
a T i x bi + Pi x ti
|a T i x bi | + Pi x
Geometrical Problems
projections and distances polyhedral intersection extremal volume ellipsoids classication problems
Geometrical Problems
Projection on a set
Given a set C the projection of x on C is dened as:
PC (x) = arg min z x
z C
is convex.
min
xy
Polyhedral intersection
1: polyhedra described by means of linear inequalities:
P1 = {x : Ax b}, P2 = {x : Cx d}
Polyhedral intersection
P1 P1 P2 ? Just check P2 = ? It is a linear feasibility problem: Ax b, Cx d sup{cT k x : Ax b} dk k
j = 1
* * * * * * * * * * * * * *
Nonlinear Programming Models p. 41
* * * *
i vi =
i j
j wj
P1 P2 ? i = 1, . . . , k check whether j 0: j = 1
j
j wj = vi
j
Given P = {x : Ax b} nd an ellipsoid:
E = {By + d : y 1}
A = AT A 0 Avi b 1 i = 1, k
Difcult variants
These problems are hard: nd a maximal volume ellipsoid contained in a polyhedron given by its vertices
* * * * * * * * * * * * *
*
aT i (By
sup {aT i By y 1
+ d) bi +
aT i d}
y : y 1 bi i
Bai + aT i d bi
* * *
Bai
It is already a difcult problem to show whether a given ellipsoid E contains a polyhedron P = {Ax b}. This problem is still difcult even when the ellipsoid is a sphere: this problem is equivalent to norm maximization in a polyhedron it is an NPhard concave optimization problem.
i = 1, k j = 1, h
a Yj 1
Robust separation
Robust separation
Find a maximal separation:
max min aT Xi max aT Yj
i j
a: a 1
i j
http://gol.dsi.unifi.it/users/schoen
where f : S R. Let x1 , x2 S and d = x2 x1 . d is a feasible direction. If there exists > 0 such that f (x1 + d) < f (x1 ) (0, ), d is called a descent direction at x1 . Elementary necessary optimality condition: if x is a local optimum, no descent direction may exist at x
Optimality Conditions p. 1
Optimality Conditions p. 2
Optimality Conditions p. 3
Optimality Conditions p. 4
proof
Taylor expansion:
f (x + d) = f (x ) + dT f (x ) + o() d cannot be a descent direction, so, if is sufciently small, then f (x + d) f (x ). Thus dT f (x ) + o() 0
and dividing by ,
o() 0 dT f (x ) +
where xk S .
Optimality Conditions p. 6
Some examples
S = Rn T (x) = Rn S = {Ax = b} x T (x) = {d : Ad = 0} S = {Ax b}; let I be the set of active constraints in x : aT = bi i x < bi aT i x iI i I.
Optimality Conditions p. 7
Optimality Conditions p. 8
iI
Thus if d T ( x) aT i d 0 for i I .
Optimality Conditions p. 9
Optimality Conditions p. 10
Example
Viceversa, let xk = x + k d. If aT i d 0 for i I
T aT x + k d) i xk = ai (
= bi + bi
k aT i d
iI
Let S = {(x, y ) R2 : x2 y = 0} (parabola). Tangent cone at (0, 0)? Let {(xk , yk ) (0, 0)}, i.e. xk 0, yk = x2 k:
(xk , yk ) (0, 0) =
4 x2 k + (xk )
T x + k d) aT i xk = ai (
< bi + k aT i d bi
iI
= |xk | 1 + x2 k
Thus
T (x) = {d : aT i d 0 i I}
Descent direction
d Rn is a feasible direction in x S if >0: x + d S [0, ).
d feasible d T ( x), but in general the converse is false. If f ( x + d) f ( x) d is a descent direction (0, )
Optimality Conditions p. 13
Optimality Conditions p. 14
...
If k is large enough, xk U ( x):
f (xk ) f ( x) 0
Examples
Unconstrained problems Every d Rn belongs to the tangent cone at a local optimum
T f ( x)d 0 d Rn
Choosing d = ei e d = ei we get
f ( x) = 0
NB: the same is true if x is a local minimum in the relative interior of the feasible region.
Optimality Conditions p. 15
Optimality Conditions p. 16
d : Ad = 0
equivalent statement:
min T f ( x)d = 0
d
Ad = 0
(a linear program).
Optimality Conditions p. 17 Optimality Conditions p. 18
Linear inequalities
min f (x) Ax b
Linear inequalities
From LP duality:
max 0T = 0 x) AT I = f ( 0
Tangent cone at a local minimum x : {d Rn : aT d 0 i I ( x ) } . Let AI be the rows of A i associated to active constraints at x . Then
min T f ( x)d = 0
d
Thus, at a local optimum, the gradient is a non positive linear combination of the coefcients of active constraints.
AI d 0 0
Optimality Conditions p. 19
Optimality Conditions p. 20
Farkas Lemma
Let A: matrix in Rmn and b Rn . One and only one of the following sets:
AT y 0 bT y > 0
Geometrical interpretation
AT y 0 b y>0
T
Ax = b x0
and
Ax = b x0
a1
{z : x : z = Ax, x 0} b a2
is non empty
{y : AT y 0}
Optimality Conditions p. 21
Optimality Conditions p. 22
Proof
1) if x 0 : Ax = b bT y = xT AT y . Thus if AT y 0 bT y 0. 2) Premise: Separating hyperplane theorem: let C and D be two convex nonempty sets: C D = . Then there exists a = 0 and b:
aT x b xC xD
aT x b
0 S 0 T b > 0; T Ax for all x 0. This is possible iff T A 0. Letting y = we obtain a solution of AY y 0 bT y > 0
Optimality Conditions p. 24
iI
xk x xk x
then gi ( x) 0 and
g ( x + lim(xk x )) 0
k
Optimality Conditions p. 25
Optimality Conditions p. 26
...
xk x )0 k xk x xk x g ( x + lim xk x lim )0 k xk x g ( x + lim xk x d) 0 g ( x + lim xk x
k
gi ( x + k d) = gi ( x) + k T gi ( x)d + o(k )
where k > 0 and d belong to the tangent cone T ( x). If the ith constraint is active, then
gi ( x + k d) = k T gi ( x)d + o(k ) 0
Let k = xk x , if k 0:
g ( x + k d) 0
Optimality Conditions p. 27
Optimality Conditions p. 28
example
G( x) = T ( x); x3 + y 0
y 0
iI
i gi ( x) = 0.
Optimality Conditions p. 30
Proof
x local optimum if d T ( x) dT f ( x) 0. But d T ( x) dT gi ( x) 0 i I.
X open set, gi (x), i I convex differentiable functions in x , gi (x), i I continuous in x , and x X strictly feasible: gi ( x) < 0 i I.
X open set, gi (x), i I continuous in x and {gi ( x)}, i I are linearly independent.
T gi ( x)d 0
iI
i T gi ( x) = T f ( x) i 0
iI iI
Optimality Conditions p. 31 Optimality Conditions p. 32
Convex problems
An optimization problem
min f (x)
xS
is a convex problem if
S is a convex set, i.e. x, y S x + (1 )y S f is a convex function on S , i.e. f (x + (1 )y ) f (x) + (1 )f (y ) [0, 1] and x, y S
Optimality Conditions p. 33
if
f is convex gi are convex hj are afne (i.e. of the form T x + )
[0, 1]
Optimality Conditions p. 34
Convex problems
Every local optimum is a global one. Proof: x : local optimum for minS f (x) x : global optimum. S convex x + (1 ) x S . Thus if 0
f ( x) f (x + (1 ) x
f (x ) + (1 )f ( x)
But y x T ( x)
f (y ) f ( x) + dT f (x) f ( x) y S
f ( x) f (x )
Optimality Conditions p. 35
Optimality Conditions p. 36
i = 1, . . . , m j = 1, . . . , k
Let I : set of active inequalities in x . If f (x), gi (x), i I , hj (x) C 1 and constraint qualications hold in x , i 0 i I e j R, j = 1, . . . , h:
h
where f is the global minimum value. Thus the equality holds and the proof is complete.
Optimality Conditions p. 37
f ( x) +
iI
i gi ( x) +
j =1
j hj ( x) = 0
Optimality Conditions p. 38
Complementarity
KKT equivalent formulation:
m h
f ( x) +
i=1
i gi ( x) +
j =1
j hj ( x) = 0 i gi ( x) = 0 i = 1, . . . , m
f ( x) +
iI
i gi ( x) +
j =1
j hj ( x) = 0
and
dT 2 L( x)d 0
2 L(x) := 2 f (x) +
Optimality Conditions p. 39
iI
i 2 gi (x) +
j =1
j 2 hj (x)
Optimality Conditions p. 40
Sufcient conditions
Let f, gi , hj twice continuously differentiable. Let x , , :
k
Lagrange Duality
Problem:
f = min f (x) gi (x) 0 xX
f (x ) +
iI
i gi (x ) +
j =1
j hj (x ) = 0 i gi (x ) = 0
d L(x )d > 0 d :d hj (x ) = 0
i 0
dT gi (x ) = 0, i I
L(x; ) = f (x) +
i
i gi (x)
0, x X
Optimality Conditions p. 41
Optimality Conditions p. 42
Relaxation
Given an optimization problem
min f (x)
xS
a relaxation is a problem
where
SQ
g (x) f (x)
x S.
Weak Duality : The optimal value of a relaxation is a lower bound on the optimum value of the problem.
Optimality Conditions p. 43 Optimality Conditions p. 44
4r (xi xj ) (yi yj )2 0 xi , yi 0
xi , yi 1
1i<jN i = 1, . . . , N i = 1, . . . , N
For every choice of 0, () is a lower bound for every feasible solution and in particular, is a lower bound for the global minimum value of the problem.
Optimality Conditions p. 45
Optimality Conditions p. 46
solution
When N = 2, relaxing the rst constraint:
() = min r + (4r2 (x1 x2 )2 (y1 y2 )2 )
x,y,r
x1 , x 2 , y1 , y2 0
x1 , x 2 , y1 , y2 1
1 r= 8 () = 2 1 16
This is a lower bound on the optimum value. Best possible lower bound:
= max ()
1 = 4 2
Optimality Conditions p. 47
2 2
Optimality Conditions p. 48
Lagrange Dual
Choosing (x1 , y1 ) = (0, 0) and (x2 , y2 ) = (1, 1) a feasible solution with r = 2/2 is obtained. The Lagrange dual gives a lower bound equal to 2/2: same as the objective function at a feasible solution optimal solution! (an exception, not the rule!)
= max () 0
This problem might: 1. be unbounded 2. have a nite sup but non max 3. have a unique maximum attained in correspondence with a single solution x 4. have many different maxima, each connected with a different solution x
Optimality Conditions p. 49
Optimality Conditions p. 50
Equality constraints
f = min f (x) hj (x) = 0 gi (x) 0 xX i = 1, . . . , m j = 1, . . . , k
Linear Programming
min cT x Ax b
Lagrange function:
L(x; , ) = f (x) + g (x) + h(x)
T T
= T b + min(cT + T A)x.
x
0 if cT + T A = 0 otherwise.
Optimality Conditions p. 52
Optimality Conditions p. 51
...
Lagrange dual function:
() =
Lagrange dual: max T b T A + cT = 0 0 which is equivalent to: max T b T A = cT 0
Optimality Conditions p. 53
b if c + A = 0 otherwise.
Optimality Conditions p. 54
QP Case 1
Q has at least one negative eigenvalue 1 min xT Qx + (cT + T A)x = x 2
QP Case 2
Q positive denite minimum point of the dual Lagrange function: Qx + (c + AT ) = 0
i.e.
x = Q1 (c + AT )
Optimality Conditions p. 55
Optimality Conditions p. 56
...
Lagrange function value:
1 T () = T b + x Qx + (cT + T A) x 2 1 = T b + (c + AT )T Q1 QQ1 (c + AT ) 2 T T (c + A)Q1 (c + AT ) 1 = T b + (c + AT )T Q1 (c + AT ) 2 T T (c + A)Q1 (c + AT )
T
...
Lagrange dual (seen as a min problem): 1 min T b + (c + AT )T Q1 (c + AT ) 2 Optimality conditions: b + AQ1 (c + AT ) = 0
1 = b (c + AT )T Q1 (c + AT ) 2
if we nd optimal multipliers (a linear system) we get the optimal solution x (thanks to feasibility and weak duality)!
Optimality Conditions p. 57
Optimality Conditions p. 58
Dim.
From Weierstrass theorem
() = min f (x) + T g (x)
xX
where X is non empty and compact, if f and gi are continuous then the Lagrange dual function is concave
= (a) + (1 )(b).
Optimality Conditions p. 59
Optimality Conditions p. 60
...
be the optimal solution of the restricted dual. Is it an Let T g (x)? Check: optimal dual solution? Is it true that z f (x) + we look for x , optimal solution of T g (x) min f (x) +
xX
is equivalent to
max z 0 z f (x) + T g (x) x X
otherwise the pair x , f ( x) is added to the restricted dual and a new solution is computed.
Optimality Conditions p. 62
Geometric programming
Unconstrained Geometric program:
m n
Transformed problem:
m n
min
x>0 k=1
ck
j =1
xj kj
kj R, ck > 0
min
y k=1
ck
j =1 m
ekj yj ek y+k
k=1
T
= k = log ck
min
y
Optimality Conditions p. 63
Optimality Conditions p. 64
Duality example
Dual of
m
+ k )
exp yk + T (Ax + y )
No constraints dual lagrange function is identical to f (x)! Strong duality holds, but is useless. Simple transformation:
m
min log
k=1
exp yk
T x + k yk = k
exp yk + T ( y )
Optimality Conditions p. 65
Optimality Conditions p. 66
Substituting j = exp yj /
L() = log
j
exp yk , j yj
j
exp yj exp yj
= log
j
yj exp yj /
j k
exp yk exp yj yk ))
i > 0 i
1 ( k exp yk
exp yk (log
k j
=
k
exp yj yk )
=
Optimality Conditions p. 67
Optimality Conditions p. 68
Lagrange Dual
The Lagrange Dual becomes:
max T
Ax b
k = 1
k
Lagrange function:
L(x, ) = f (x) + T (b Ax)
AT = 0 0
Ax b
T (b Ax ) = 0
Optimality Conditions p. 69
from which
f (x ) =0 xj f (x ) 0 xj j : x j > 0
otherwise
( )T x = 0
Optimality Conditions p. 71
Optimality Conditions p. 72
Box constraints
min f (x) xu i < ui i
( x )T = 0 (x u)T = 0
( , ) 0
Optimality Conditions p. 73
Optimality Conditions p. 74
with feasibility x u
( )T x = 0
Optimality Conditions p. 75
Optimality Conditions p. 76
simplex. . .
f (x ) j = xj
(all equal). Thus, from complementarity, if x j > 0 then j = 0 f (x ) xj f (x ) xj
and
= ; otherwise
. Thus, if j : x j > 0, k
= E (P (x) (E (P (x))))2
n
f (x ) f (x ) xj xk
=E
j =1
=
i,j
= xT Qx
Optimal portfolio
KKT: for all j : x j > 0:
Qij xj Qkj xj
j
Vector Qx might be thaught as the vector of marginal contributions to the total risk (which is a weighted sum of elements of Qx). Thus in the optimal portfolio, all assets with positive level give equal (and minimal) contribution to the total risk.
Optimality Conditions p. 79
Optimality Conditions p. 80
Most common form for optimization algorithms: Line search-based methods: Given a starting point x0 a sequence is generated:
xk+1 = xk + k dk
http://gol.dsi.unifi.it/users/schoen
where dk Rn : search direction, k > 0: step Usually rst dk is chosen and than the step is obtained, often from a 1dimensional optimization
Trust-region algorithms
A model m(x) and a condence region U (xk ) containing xk are dened. The new iterate is chosen as the solution of the constrained optimization problem
x U (x k )
Speed measures
Let x : local optimum. The error in xk might be measured e.g. as
e(xk ) = xk x
or
min m(x)
The model and the condence region are possibly updated at each iteration.
superlinear convergence
If for every (0, 1) exists q :
e(xk ) q k
then {xk } is said to converge with order at least p If p = 2 quadratic convergence Sufcient condition:
lim sup e(xk+1 ) < e(xk )p
Examples
1 k
Examples
1 k 1 k2
Examples
1 k 1 k2
Examples
1 k 1 k2
Examples
1 k 1 k2
Thus if is small enough f (xk + d) f (xk ) < 0 NB: d might be a descent direction even if dT f (xk ) = 0
Algorithms for unconstrained local optimization p. 7 Algorithms for unconstrained local optimization p. 8
if dk = 0 then
|dT k f (xk )| ( f (xk ) ) dk
if f (xk ) = 0 k then
dT lim k f (xk ) = 0 k dk
dT k f (xk ) c dk f (xk
Algorithms for unconstrained local optimization p. 11 Algorithms for unconstrained local optimization p. 12
Gradient Algorithms
Recalling that
cos k = dT k f (xk ) dk f (xk
General scheme:
xk+1 = xk k Dk f (xk )
that is, the angle between dk and f (xk ) is bounded away from orthogonality.
<0
dT k f (xk )
Steepest Descent
or gradient method:
Dk := I
i.e. xk+1 = xk k f (xk ). If f (xk ) = 0 then dk = f (xk ) is a descent direction. Moreover, it is the steepest (w.r.t. the euclidean norm):
dRn
min T f (xk )d d 1
f (xk )
...
Newtons method
min T f (xk )d dT d 1 Dk := 2 f (xk )
1
dRn
f (xk )
Algorithms for unconstrained local optimization p. 18
Step choice
Given dk , how to choose k so that xk+1 = xk + k dk ? optimal choice (one-dimensional optimization):
k = arg min f (xk + dk ).
0
Minimizing w.r.t. :
T dT k Qdk + (Qxk + c) dk = 0
= =
Analytical expression of the optimal step is available only in few cases. E.g. if f (x) = 1 xT Qx + cT x with Q 0. Then 2
1 f (xk + dk ) = (xk + dk )T Q(xk + dk ) + cT (xk + dk ) 2 1 T = 2 dT k Qdk + (Qxk + c) dk + 2
f (xk ) = 0
u u u
In general it is important to insure a sufcient reduction of f and a sufciently large step xk+1 xk
Armijos rule
Input:
u u u u
(0, 1), (0, 1/2), k > 0 := k ; while (f (xk + dk ) > f (xk ) + dT k f (xk )) do := ;
end return
Typical values : [0.1, 0.5], [104 , 103 ]. On exit the returned step is such that f (xk + dk ) f (xk ) + dT k f (xk )
acceptable steps
Then = c1 /c2 .
dT k f (xk )
of the minimum of f (xk + dk ) Third condition? If an estimate f . is available choose c2 : min q () = f min q () = q (c1 /c2 ) = c0 c2 1 /c2 := f
= c1 /c2 =2
c2 = c2 1 /2(f c0 ) c0 f c1
If a sufciently accurate step size is used the condition of the theorem on global convergence are satised the steepest descent algorithm globally converges to a stationary point. Sufciently accurate means exact line search or, e.g., Armijos rule.
2 xT k (I k Q) xk
Analysis
Let A: symmetric with eigenvalues: 1 < < n . Then
1 v
2 T xT k (I k Q) xk xk xk 2
...
is an eigenvalue of A iff is an eigenvalue of A is an eigenvalue of A iff 1 + is an eigenvalue of I + A
v T Av m v
v Rn
thus
xk+1
Algorithms for unconstrained local optimization p. 31
= max{|1 k 1 |, |1 k n |} xk
max{(1 k 1 )2 , (1 k n )2 } xk
Algorithms for unconstrained local optimization p. 32
...
Eliminating the dependency on k :
max{|1 1 |, |1 n |} =
...
0 and 1 n , 1 + 1 1 + n 1 1 1 n
max{1 1 , 1 + 1 , 1 n , 1 + n }
5 4 3 2 1 00 0.2 0.4
|1 1 | |1 n |
and thus
max{|1 k 1 |, |1 k n |} xk = max{1 1 , 1 + n }
Minimum point:
1 1 = 1 + n
0.6
0.8
i.e.
=
Algorithms for unconstrained local optimization p. 33
2 1 + n
Algorithms for unconstrained local optimization p. 34
Analysis
In the best possible case
xk+1 |1 1 | xk 2 = |1 1 | 1 + n n 1 = n + 1 1 = +1
Zigzagging
1 min (x2 + M y 2 ) 2 where M > 0. Optimum: x 0y = 0. Starting point: (M, 1). Iterates: xk xk xk+1 + = M yk yk yk+1
where = n /1 : condition number of Q 1 (illconditioned problem) very slow convergence 1 very speed convergence
Algorithms for unconstrained local optimization p. 35
Zigzagging
Converegence is rapid if M 1 very slow and zigzagging if M 1 or M 1 10
Slow convergence and zigzagging are general phenomena (especially when the starting point is near the longest axes of the ellipsoidal level sets)
-5
-10
Algorithms for unconstrained local optimization p. 37
20
40
60
80
100
f (xk ). Let
Thus
x xk+1 = o( x xk )
i.e.
x xk+1 x xk
o( x xk ) x xk
= x xk+1 + o( x xk )
f (xk ) + (x xk ) + 2 f (xk )
o( x xk )
Difculties
Many things might go wrong: at some iteration, 2 f (xk ) might be singular. For example: if xk belongs to a at region f (x) = constant. even if non singular, inversion 2 f (xk ) or, in any case, solving a linear system with coefcient matrix 2 f (xk ) is numerically unstable and computationally demanding there is no guarantee that 2 f (xk ) 0 Newton direction might not be a descent direction
and
(2 f (x))1 M
Difculties
Newtons method just tries to solve the system
f (xk ) = 0
Newtontype methods
line search variant: xk+1 = xk k (2 f (xk ))
1
and thus might very well be attracted towards a maximum the method lacks global convergence: it converges only if started near a local optimum
Modied Newton method: replace 2 f (xk ) by (2 f (xk ) + Dk ) where Dk is chosen so that 2 f (xk ) + Dk is positive denite
f (xk )
Quasi-Newton methods
Consider solving the nonlinear system f (x) = 0. Taylor expansion of the gradient:
f (xk ) f (xk+1 ) + 2 f (xk+1 )(xk xk+1 )
QuasiNewton equation
Let:
sk := xk+1 xk yk := f (xk+1 ) f (xk )
QuasiNewton equation: Bk+1 sk = yk . If Bk was the previous approximate hessian, we ask that 1. the variation between Bk and Bk+1 is small 2. nothing changes along directions which are normal to the step sk :
Bk z = Bk+1 z z : z T sk = 0
Choosing n 1 vectors z which are orthogonal to sk n2 linearly independent equations in n2 unknowns a unique solution.
Algorithms for unconstrained local optimization p. 45 Algorithms for unconstrained local optimization p. 46
Broyden updating
It can be shown that the unique solution is given by:
Bk+1 (yk Bk sk )sT k = Bk + sT s k k
proof
Bk+1 Bk = = k Bk sk )sT (Bs k sT s k k = (yk Bk sk )sT k sT s k k Bk )sk sT (B k sT s k k
T Trsk sT k sk sk sT k sk
k = yk Bs
Bk ) (B Bk ) = (B
sk sT k Bk ) = (B sT k sk sT k sk Bk ) = (B sT s k k
TrX T X denotes
Unicity is a consequence of the strict convexity of the norm and the convexity of the feasible region.
Simmetry
Remedy: let C1 = Bk +
( y k B k sk ) sT k sT k sk
symmetrization:
1 T C2 = (C1 + C1 ) 2
PBS update
In the limit
Bk+1 = Bk +
T (sT (yk Bk sk ))sk sT (yk Bk sk )sT k k + sk (yk Bk sk ) + k T 2 sk sk (sT s ) k k
BFGS
Same ideas, but applied to the approximate inverse Hessian: Inverse QuasiNewton equation:
sk = Hk+1 yk
(PBS Powell-Broyden-Symmetric update). Imposing also hereditary positive deniteness, DFP (Davidon-Fletcher-Powell) is obtained:
Bk+1 = Bk + = I (yk yk sT k T yk sk
T Bk sk )yk
+ yk (yk Bk sk ) + T yk sk
T sk yk T yk sk
(sT k (yk
Hk+1 =
T sk yk T yk sk
Hk I
yk sT k T yk sk
sk sT k T yk sk
Bk I
T yk yk T yk sk
BFGS method
xk+1 = xk k Hk f (xk ) Hk+1 = I
sk sT k T yk sk
min
mk (x)
where
k > 0: parameter. First advantage (over pure Newton): the step is always denite (thanks to Weierstrasss theorem)
How to choose and update the trust region radius k ? Given a step sk , let
k = f (xk ) f (xk + sk ) mk (0) mk (sk )
the ratio between the actual reduction and the predicted reduction
Model updating
f (xk ) f (xk + sk ) k = mk (0) mk (sk )
for
Algorithm
Data:
The predicted reduction is always non negative; if k is small (surely if it is negative) the model and the function strongly disagree the step must be rejected and the trust region reduced if k 1 it is safe to expand the trust region
> 0, 0 (0, ) , [0, 1/4] k = 0, 1, . . . do Find the step sk and k minimizing the model in the trust region ; if k < 1/4 then k+1 = k /4 ;
else if
else end
Thus either s is in the interior of the ball with radius , in which case = 0 and we have the (quasi)-Newton step:
1 f (xk ) p = Bk
or s = and if > 0 then 2s = f (xk ) Bs = mk (s) s is parallel to the negtaive gradient of the model and normal to its contour lines.
For the step size k : If f (xk )T Bk f (xk ) 0 negative curvature direction largest possible step k = 1
f (xk ) 3 } k f (xk )T Bk f (xk )
p k
ps k
Choosing the Cauchy point global but extremely slow convergence (similar to steepest descent). Usually an improved point is searched starting from the Cauchy one.
Algorithms for unconstrained local optimization p. 62
Pattern Search
For smooth optimization, but without knowledge of derivatives. Elementary idea: if x R2 is not a local minimum for f , then at least one of the directions e1 , e2 , e1 , e2 (moving towards E, N, W, S) forms an acute angle with f (x) is a descent direction. Direct search: explores all the direction in search of one which gives a descent.
Coordinate search
Let D = {ei } be the set of coordinate directions and their opposites
Data:
Pattern search
It is not necessary to explore 2n directions. It is sufcient that the set of directions forms a positive span, i.e. every v Rn should be expressible as a non negative linear combination of the vectors in the set. Formally, G is a generating set iff
v = 0 Rn g G : v T g > 0
k = 0, 0 an initial step length, x0 a starting point while is large enough do if f (xk + k d) < f (xk ) for some d D then xk+1 = xk + k d (step accepted) ;
else
k+1 = 0.5k ;
end
k =k+1;
end
vT d v d
Examples
u u u u u u u u u u
Step Choice
xk + k dk if f (xk + k dk ) < f (xk ) (k )(success) x k
xk+1 =
otherwise (failure)
where (t) = o(t). We let In the rst case 0.19612, in the second = 0.5, in the third = 0.5 0.7017
k+1 = k k
where k 1 for successful iterations, k < 1 otherwise. Direct methods possess good convergence properties.
Nelder-Mead Simplex
Given a simplex S = {v1 , . . . , vn+1 } in Rn let vr the worst point: r = arg maxi {f (vi )}. Let C be the centroid of S \ {vr }:
C=
i=r
1: Reection
Check f (R): if it is intermediate, i.e. better than the worst and worse than the best, then accept the reection, i.e. discard the worst point in the simplex and replace it with R.
vi
The algorithm performs a sort of line search along the direction C vr . Let
R = C + (C vr ) be the a reection of the worst point along the direction. Let f best function value in the current simplex. Three cases might occur:
Algorithms for unconstrained local optimization p. 73 Algorithms for unconstrained local optimization p. 74
Reection step
2: improvement
if the trial step is an improvement: worst
f (R) < f = R + (R C ) then attempt an expansion: try to move R to R ) < f (R)) then accept the expansion and If successful (f (R discard the worst point. If unsuccessful, then accept R as a new point and discard the worst one.
reection
Expansion
3: contraction
If however the reected point R is worse than all points in the simplex (possibly except the worst vr ), than a contraction step is performed: if f (R) > f (vr ) (R is worse than all points in the simplex), add
worst
0.5(vr + C )
reection expansion
Contraction
Nelder-Mead is not a direct search method (only a single direction at a time is explored) It is widely used by practitioners. However it may fail to converge to a local minimum. There are examples of strictly convex functions in R2 on which the method converges to a non-stationary point. The bad convergence properties are connected to the event that the ndimensional simplex degenerates into a lower dimensional space. Moreover the method has a strong tendency to generate directions which are almost normal to that of the gradient! Convergent variants of Nelder-Mead method do exists.
contraction
reection
worst
Implicit ltering
Let
f (x) = h(x) + w(x)
Implicit ltering
Data:
repeat
OuterIteration = false;
repeat
where h(x) is a smooth function, while w(x) can be considered as an additive, typically random, noise. The method performs a rough estimate of the gradient (nite difference with a large step) and proceeds with an Armijo line search. If unsuccessful, the step for nite differences is reduced.
compute f (xk ) and a nite difference estimate of f (xk ): k f (xk ) = [(f (xk + k ei ) f (xk k ei ))/2k ]
if
k f (xk ) k then OuterIteration = true Armijo: if successful accept the Armijo step; otherwise let OuterIteration = true
else
; ;
Algorithms for unconstrained local optimization p. 82
k = k + 1;
Algorithms for unconstrained local optimization p. 81
Convergence properties
If
2 h(x) is Lipschitz continuous
where
(x; ) = sup
z : z x
|w(x)|
unsuccessful Armijo steps occur at most a nite number of times then all limit points of {xk } are stationary
Algorithms for unconstrained local optimization p. 83
http://gol.dsi.unifi.it/users/schoen
FrankWolfe method
Let X : convex set. Consider the problem:
min f (x)
xX
FrankWolfe
If T f (xk )( xk xk ) = 0 then
T f (xk )d 0
Let xk X choosing a feasible direction dk corresponds to choosing a point x X : dk = x xk . Steepest descent choice:
min T f (xk )(x xk )
xX
for every feasible direction d rst order necessary conditions hold. Otherwise, letting dk = x k x, this is a descent direction along which a step k (0, 1] might be chosen according to Armijos rule.
(a linear objective with convex constraints, usually easy to solve). Let x k be an optimal solution of this problem.
The method is slightly faster than Frank-Wolfe, with a linear convergence rate similar to that of (unconstrained) steepest descent. It might be applied if projection is relatively cheap, e.g. when the feasible set is a box. A point xk satises rst order necessary conditions dT f (xk ) 0 iff
xk = [xk sk f (xk )]+
Barrier Methods
min f (x) gj (x) 0 j = 1, . . . , r
Barrier Method
Let k 0 and x0 strictly feasible, i.e. gj (x0 ) < 0 j . Then let
xk = arg min (f (x) + k B (x)) n
xR
A Barrier is a continuous function which tends to + whenever x approaches the boundary of the feasible region. Examples of barrier functions:
B (x) =
j
Proposition: every limit point of {xk } is a global minimum of the constrained optimization problem
logaritmic barrier
B (x) =
j
invers barrier
...
If B (x) = (g (x)),
f (xk ) + k (g (xk ))g (xk ) = 0
if limk g (xk ) < 0 (g (xk ))g (xk ) K (nite) and Kk 0 if limk g (xk ) = 0 (thanks to the unicity of Lagrange multipliers),
= lim k (g (xk ))
k
satises
f (xk ) + k B (xk ) = 0
Example
min(x 1)2 + (y 1)2 x+y 1 Logarithmic Barrier problem: min(x 1)2 + (y 1)2 k log(1 x y ) x+y1<0 Gradient: Stationary points x = y =
3 4
2(x 1) + 2(y 1) +
1+k 4
k 1xy k 1xy
log xj Ax = b x>0
Logarithmic Barrier on x 0:
min c x
j T
log xj Ax = b x>0
The trajectory x() of solutions to the barrier problem is called the central path and leads to an optimal solution of the LP.
Penalty Methods
Penalized problem:
min f (x) + P (x)
hi (x)2
hi (x)2
(found with an iterative method initialized at xk ); let k+1 > k , k := k + 1. If xk+1 is a global minimizer of P and k then every limit point of {xk } is a global optimum of the constrained problem.
Exact penalties
Exact penalties: there exists a penalty parameter value s.t. the optimal solution to the penalized problem is the optimal solution of the original one. 1 penalty function:
P1 (x; ) = f (x) +
i
Exact penalties
for inequality constrained problems:
min f (x) hi (x) = 0 gj (x) 0
|hi (x)|
|hi (x)| +
j
max(0, gj (x))
Motivation
1 min f (x) + h(x) x 2
2
+ T h(x)
x L (x, ) = f (x) +
i h(x) + h(x)h(x)
i
= x L(x, ) + h(x)h(x)
2 2 xx L (x, ) = f (x) +
+ T h(x)
motivation . . .
Let (x , ) an optimal (primal and dual) solution. Necessarily: x L(x , ) = 0; moreover h(x ) = 0 thus
x L (x , ) = x L(x , ) + h(x )h(x ) =0 (x , ) is a stationary point for the augmented lagrangian.
motivation . . .
Observe that:
2 T 2 2 xx L (x, ) = xx L(x, ) + h(x) h(x) + h(x) h(x) T = 2 xx L(x, ) + h(x) h(x)
v : v T h(x ) = 0,
...
Let v = 0 : v T h(x )= 0. Then
T T T T T 2 v T 2 xx L (x , )v = v xx L(x , )v + v h(x ) h(x )v T = v T 2 xx L(x , )v > 0
...
Let v = 0 : v T h(x )= 0. Then
T T T 2 T T v T 2 xx L (x , )v = v xx L(x , )v + v h(x ) h(x )v T T 2 = v T 2 xx L(x , )v + (v h(x ))
which might be negative. However > 0: if T v T 2 L ( x , ) v > 0 . xx Thus, if is large enough, the Hessian of the augmented lagrangian is positive denite and x is a (strict) local minimum of L (, )
Inequality constraints
Given the problem
min f (x) g (x) 0 min f (x) hi (x) = 0 gj (x) 0 i = 1, m j = 1, p
gj (x) + s2 j = 0
...
Consider minimization with respect to z variables:
min
z j
...
Thus:
u j = max{0, j gj (x)}.
1 2 )+ j (gj (x) + zj 2
2 2 ) (gj (x) + zj j
= min
u 0 j
Substituting:
1 L (x; , ) = f (x) + T h(x) + h(x) 2 2 1 + max{0, j + gj (x)} 2 j 2 j
Idea: apply Newtons method to solve the KKT equations: Lagrangian function:
L(x; ) = f (x) +
i
Newton step:
xk+1 k+1 = dk xk + k k
i hi (x)
where
2 xx L(xk ; k ) H (xk ) T H (xk ) 0 dk k = f (xk ) H T (xk )k H (xk )
existence
The Newton step exists if the Jacobian of the constraint set H (xk ) has full row rank the Hessian 2 xx L(xk ; k ) is positive denite In this case the Newton step is the unique solution of
KKT conditions:
T T 2 xx L(xk ; k )dk + H (xk )k + f (xk ) + H (xk )k = 0
Under the same conditions as before this QP has a unique solution dk with Lagrange multipliers k = k+1
minimizes a quadratic approximation to the Lagrangian subject to a rst order approximation of the constraints.
KKT conditions:
2 xx L(xk ; k )d + f (xk ) + H (xk )k + H (xk )k = 0
Under the same conditions as before this QP has a unique solution dk with Lagrange multipliers k = k+1
Inequalities
If the original problem is
min f (x) hi (x) = 0 gj (x) 0
Filter Methods
Basic idea:
min f (x) g (x) 0
can be considered as a problem with two objectives: minimize f (x) minimize g (x) (the second objective has priority over the rst)
Filter
Given the problem
min f (x) gj (x) 0 j = 1, . . . , k
Let {fk , hk , k = 1, 2, . . .} the observed values of f and h at points x1 , x2 , . . .. A pair (fk , hk ) dominates a pair (f , h ) iff
fk f hk h
and
where
h(x) =
j
max{gj (x), 0}
(the norm is used here in order to keep the problem a QP) Traditional (unconstrained) trust region methods: if the current step is a failure reduce the trust region eventually the step will become a pure gradient step convergence!
h(x)
Filter methods
Data:
x0 : starting point, , k = 0
T j gj (xk )p + gj (xk ) 0
Solve QP and get a step dk ; try setting xk+1 = xk + dk ; if (fk+1 , hk+1 ) is acceptable to the lter then Accept xk+1 and add (fk+1 , hk+1 ) to the lter; Remove dominated points from the lter; Possibly increase ;
else
gj (x) 0
set k = k + 1;
Algorithms for constrained local optimization p. 43
end
h(x)
min f (x)
http://gol.dsi.unifi.it/users/schoen
and
x = arg min f (x) : f (x ) f (x) x S
This denition in unsatisfactory: the problem is ill posed in x (two objective functions which differ only slightly might have global optima which are arbitrarily far) it is however well posed in the optimal values: ||f g || |f g |
Quite often we are satised in looking for f and search one or more feasible solutions suche that
f ( x) f (x ) +
Complexity
Similar situation in Bertsekas, Nonlinear Programming (1999): 777 pages, but only the denition of global minima and maxima is given! Nocedal & Wrigth, Numerical Optimization, 2nd edition, 2006: Global solutions are needed in some applications, but for many problems they are difcult to recognize and even more difcult to locate ... many successful global optimization algorithms require the solution of many local optimization problems, to which the algorithms described in this book can be applied Global optimization is hopeless: without global information no algorithm will nd a certiable global optimum unless it generates a dense sample. There exists a rigorous denition of global information some examples: number of local optima global optimum value for global optimization problems over a box, (an upper bound on) the Lipschitz constant
|f (y ) f (x)| L x y x, y
an explicit representation of the objective function as the difference between two convex functions (+ convexity of the
Complexity
Global optimization is computationally intractable also according to classical complexity theory. Special cases: Quadratic programming:
1 min xT Qx + cT x lAxu 2
is N P hard [Sahni, 1974] and, when considered as a decision problem, N P -complete [Vavasis, 1990].
Quadratic optimization on a hyper-rectangle (A = I ) when even only one eigenvalue of Q is negative quadratic minimization over a simplex
1 min xT Qx + cT x x 0 2 xj = 1
j
or:
min cT x Ax = b x [0, 1]
Introduction to Global Optimization p. 11
Minimization of cost functions which are neither convex nor concave. E.g.: nding the minimum conformation of complex molecules Lennard-Jones micro-cluster, protein folding, protein-ligand docking, Example: Lennard-Jones: pair potential due to two atoms at X1 , X2 R3 : 1 2 v (r) = 12 6 r r where r = X1 X2 . The total energy of a cluster of N atoms located at X1 , . . . , XN R3 is dened as:
i=1,...,N j<i
v (||Xi Xj ||)
x (1 x) = 0
This function has a number of local (non global) minima which grows like exp(N )
Lennard-Jones potential
3 2 1 0 -1 -2 attractive(x) repulsive(x) lennard-jones(x)
1 b 0 2 K (ri ri ) 2 i
1 0 2 K (i i ) 2 i
-3 0.5
i T
1.5
2.5
3.5
4.5
(dihedrals)
Introduction to Global Optimization p. 14
Docking
Ev =
(i,j) C
Given two macro-molecules M1 , M2 , nd their minimal energy coupling If no bonds are changed to nd the optimal docking it is sufcient to minimized:
Ev + Ee =
iM1 ,j M2
1 2 iM
1 ,j M2
qi qj Rij
(Coulomb interaction)
1 Xi Xj
12
2 Xi Xj
This is a highly structured problem. But is it easy/convenient to use its structure? And how?
LJ
The map
F1 : R3N R+ F1 (X1 , . . . , XN )
N (N 1)/2 2
X1 X2 2 , . . . , XN 1 XN
N (N 1)/2
NB: every C 2 function is d.c., but often its d.c. decomposition is not known. D.C. optimization is very elegant, there exists a nice duality theory, but algorithms are typically very inefcient.
F2 (r12 , . . . , rN 1,N )
is the difference between two convex functions. Thus LJ (X ) can be seen as the difference between two convex function (a d.c. programming problem)
which is equivalent to
min z g (x) w h(x) + z w
Introduction to Global Optimization p. 21
Hp:
0 int intC, cT x > 0x \ intC
Fundamental property: if a D.C. problem admits an optimum, at least one optimum belongs to
C
Introduction to Global Optimization p. 22
g (0) < 0, h(0) < 0, cT x > 0 feasible x. Let x be a solution to the convex problem min cT x g (x) 0
cT x = 0
If h( x) 0 then x solves the d.c. problem. Otherwise cT x > cT x for all feasible x. Coordinate transformation: y = x x :
min cT y (y ) 0 h g (y ) 0
T
-1
-2
where g (y ) = g (y + x ). Then c y > 0 for all feasible solutions (0) > 0; by continuity it is possible to choose x and h so that g (0) < 0.
Introduction to Global Optimization p. 23
-3
-4 -9 -8 -7 -6 -5 -4 -3 -2 -1
Let x best known solution. Let D( x) = { x : c T x c T x } If D( x) C then x is optimal; Check: a polytope P (with known vertices) is built which contains D( x) If all vertices of P are in C optimal solution. Otherwise let v : best feasible vertex; the intersection of the segment [0, v ] with C (if feasible) is an improving point x. Otherwise a cut is introduced in P which is tangent to in x.
D( x ) = {x : c T x c T x } cT x = 0
-1
x
-2
-3
-4 -9 -8 -7 -6 -5 -4 -3 -2 -1
Initialization
4
i.e.
y : cT y cT x y y P
feasible
0
-1
x
-2
-3
V
Introduction to Global Optimization p. 27
-4 -9 -8 -7 -6 -5 -4 -3 -2 -1
Step 1
Let V the vertex with largest h() value. Surely h(V ) > 0 (otherwise we stop with an optimal solution) Moreover: h(0) < 0 (0 is in the interior of C ). Thus the line from V to 0 must intersect the boundary of C Let xk be the intersection point. It might be feasible (improving) or not.
4
xk = C [V , 0] cT x = 0
C
xk
-1
x
-2
-3
V
Introduction to Global Optimization p. 29
-4 -9 -8 -7 -6 -5 -4 -3 -2 -1
If xk , set x := xk
cT x = 0
-1
-1
x
-2 -2
-3
-3
-4 -9 -8 -7 -6 -5 -4 -3 -2 -1
-4 3 -9 -8 -7 -6 -5 -4 -3 -2 -1
-1
-2
is the Fenchel-Rockafellar dual. If min g (x) h(x) admits an optimum, then Fenchel dual is a strong dual.
-3
-4 -9 -8 -7 -6 -5 -4 -3 -2 -1
A primal/dual algorithm
If x arg min g (x) h(x) then
u h(x ) Pk : min g (x) (h(xk ) + (x xk )T yk )
and
Dk : min h (y ) (g (yk1 ) + xT k (y yk1 )
( denotes subdifferential) is dual optimal and if u arg min h (u) g (u) then
x g (u )
GlobOpt - relaxations
Consider the global optimization problem (P):
min f (x)
xX
and assume the min exists and is nite and that we can use a relaxation (R):
min g (y ) yY
Usually both X and Y are subsets of the same space Rn . Recall: (R) is a relaxation of (P) iff:
XY
Introduction to Global Optimization p. 36
Tools
good relaxations: easy yet accurate good upper bounding, i.e., good heuristics for (P) Good relaxations can be obtained, e.g., through: convex relaxations domain reduction
Convex relaxations
Assume X is convex and Y = X . If g is the convex envelop of f on X , then solving the convex relaxation (R), in one step gives the certied global optimum for (P). g (x) is a convex under-estimator of f on X if:
g (x)is convex g (x) f (x) g is the convex envelop of f on X if: g is a convex under-estimator off g (x) h(x) h : convex under-estimator of f
Introduction to Global Optimization p. 40
A 1-D example
x X
x X
Convex under-estimator
Branching
Bounding
Let
min f (x)
xS
be a GlobOpt problem where f is convex, while S is non convex. A relaxation (outer approximation) is obtained replacing S with a larger set Q. If Q is convex convex optimization problem. If the optimal solution to
Upper bound
min f (x)
xQ
fathomed
lower bounds
Introduction to Global Optimization p. 44 Introduction to Global Optimization p. 45
Example
min x 2y xy 3
Relaxation
min x 2y xy 3
x[0,5],y [0,3]
x[0,5],y [0,3]
4 3 2
We know that:
(x + y )2 = x2 + y 2 + 2xy
thus
1 0 0 1 2 3 4 5 6
xy = ((x + y )2 x2 y 2 )/2
(x + y )2 5x 3y 6
Relaxation
4 3 2 1 0 0 1 2 3 4 5 6
Stronger Relaxation
min x 2y xy 3
x[0,5],y [0,3]
Thus:
(5 x)(3 y ) 0 xy 3x + 5y 15
15 3x 5y + xy 0
i.e.: 3x + 5y 18
Relaxation
4 3 2 1 0 0 1 2 3 4 5 6
The optimal solution of the convex (linear) relaxation is (1, 3) which is feasible optimal for the original problem
Convex envelopes
Denition: a function is polyhedral if it is the pointwise maximum of a nite number of linear functions. (NB: in general, the convex envelope is the pointwise supremum of afne minorants) The generating set X of a function f over a convex set P is the set
Generating sets
I.e., given f we rst build its convex envelop in P and then dene its epigraph {(x, y ) : x P, y f (x)}. This is a convex set whose extreme points can be denoted by V . X are the x coordinates of V
Characterization
Let f (x) be continuously differentiable in a polytope P . The convex envelope of f on P is polyhedral if and only if
X (f ) = Vert(P )
(the generating set is the vertex set of P ) Corollary: let f1 , . . . , fm C 1 (P ) and i fi (x) possess polyhedral convex envelopes on P . Then
Conv(
i
fi (x)) =
i i
Convfi (x)
Conv(fi (x))
is Vert(P )
Characterization
If a f (x) is such that Convf (x) is polyhedral, than an afne function h(x) such that 1. h(x) f (x) for all x Vert(P ) 2. there exist n + 1 afnely independent vertices of P , V1 , . . . , Vn+1 such that
f (Vi ) = h(Vi ) i = 1, . . . , n + 1
Characterization
The condition may be reversed: given m afne functions h1 , . . . , hm such that, for each of them 1. hj (x) f (x) for all x Vert(P ) 2. there exist n + 1 afnely independent vertices of P , V1 , . . . , Vn+1 such that
f (Vi ) = hj (Vi ) i = 1, . . . , n + 1
Then the function (x) = maxj j (x) is the convex envelope of a polyhedral function f iff the generating set of is Vert(P) for every vertex Vi we have (Vi ) = f (Vi )
Sufcient condition
If f (x) is lower semi-continuous in P and for all x Vert(P ) there exists a line x : x interior of P x and f (x) is concave in a neighborhood of x on x , then Convf (x) is polyhedral Application: let
f (x) =
i,j
ij xi xj
The sufcient condition holds for f in [0, 1]n bilinear forms are polyhedral in an hypercube
xy y x + x y x y
Bilinear terms
xy (x, y ) = max{y x + x y x y ; uy x + ux y ux uy } No other (polyhedral) function underestimating xy is tighter. In fact y x + x y x y belongs to the convex envelope: it underestimates xy and coincides with xy at 3 vertices ((x , y ), (x , uy ), (ux , y )). Analogously for the other afne function. All vertices are interpolated by these 2 underestimating hyperplanes they form the convex envelop of xy
is polyhedral (easy to see) but we cannot guarantee in general that the generating set of the envelope are the vertices of the hypercube! (in particular, if s have opposite signs) if the set is not an hypercube, even a bilinear term might be non polyhedral: e.g. xy on the triangle {0 x y 1}
Finding the (polyhedral) convex envelope of a bilinear form on a generic polytope P is NPhard!
Introduction to Global Optimization p. 60 Introduction to Global Optimization p. 61
Fractional terms
A convex underestimate of a fractional term x/y over a box can be obtained through
w x /y + x/uy x /uy
if x if x if x if x
0 <0 0 <0
How to choose i s? One possibility: uniform choice: i = . In this case convexity of is obtained iff
max 0, 1 min min (x) 2 x[,u]
(x) = f (x)
i=1
i (xi i )(ui xi )
Key properties
(x) f (x) is convex interpolates f at all vertices of [, u]
Estimation of
U Compute an interval Hessian [H ] : [H (x)]ij = [hL ij (x), hij (x)] in [, u] Find such that [H ] + 2diag() 0. Gerschgorin theorem for real matrices:
Maximum separation:
1 max(f (x) (x)) = 4 (ui i )2 min min hii
i j =i
|hij |
j =i
uj j ui i
Improvements
new relaxation functions (other than quadratic). Example
n
(x; ) =
i=1
gives a tighter underestimate than the quadratic function partitioning: partition the domain into a small number of regions (hyper-rectangules); evaluate a convex underestimator in each region; join the underestimators to form a single convex function in the whole domain
for all i 1, . . . , n and then adding the constraints x [, u] to the problem (or to the sub-problems generated during Branch & Bound)
Introduction to Global Optimization p. 68 Introduction to Global Optimization p. 69
Feasibility Based RR
If S is a polyhedron, RR requires the solution of LPs:
[ , u ] = min / max x Ax b x [L, U ]
j
Optimality Based RR
Given an incumbent solution x S , ranges are updated by solving the sequence:
i = min xi f (x) f ( x) aij xj bi xS ui = max xi f (x) f ( x) xS
Poor mans L.P. based RR: from every constraint in which ai > 0 then
x x 1 bi ai 1 bi ai aij xj
j =
where f (x) is a convex underestimate of f in the current domain. RR can be applied iteratively (i.e., at the end of a complete RR sequence, we might start a new one using the new bounds)
j =
min{aij Lj , aij Uj }
Introduction to Global Optimization p. 70 Introduction to Global Optimization p. 71
generalization
Let
min f (x)
xX
R.H.S. perturbation
(P )
(y ) = min f (x)
xX
(Ry )
g (x) 0
g (x) y (R)
g (x) 0
be a perturbation of (R). (R) convex (Ry ) convex for any y . Let x : an optimal solution of (R) and assume that the ith constraint is active:
g ( x) = 0
and
be a convex relaxation of (P ):
: g (x) 0} {x X : g (x) 0} {x X f (x) f (x) x X : g (x) 0
Duality
Assume (R) has a nite optimum at x with value (0) and Lagrange multipliers . Then the hyperplane
H (y ) = (0) T y
Main result
If (R) is convex with optimum value (0), constraint i is active at the optimum and the Lagrange multiplier is i > 0 then, if U is an upper bound for the original problem (P ) the constraint:
g i (x) (U L)/i
(where L = (0)) is valid for the original problem (P ), i.e. it does not exclude any feasible solution with value better than U .
proof
Problem (Ry ) can be seen as a convex relaxation of the perturbed non convex problem
(y ) = min f (x)
xX
Applications
Range reduction: let x [, u] in the convex relaxed problem. If variable xi is at its upper bound in the optimal solution, them we can deduce
xi max{i , ui (U L)/i }
g (x) y
and thus (y ) (y ). Thus underestimating (Ry ) produces an underestimate of (y ). Let y := ei yi ; From duality: L T ei yi (ei yi ) (ei yi ) If yi < 0 then U is an upper bound also for (ei yi ), thus L i yi U . But if yi < 0 then constraint i is active. For any feasible x there exists a yi < 0 such that g (x) yi is active we may substitute yi with g i (x) and deduce L i g i (x) U
Introduction to Global Optimization p. 76
where i is the optimal multiplier associated to the ith upper bound. Analogously for active lower bounds:
xi min{ui , i + (U L)/i }
be active in an optimal solution of the convex relaxation (R). Then we can deduce the valid inequality
ai T x bi (U L)/i
and the next point to sample is placed in order to minimize the expected loss (or risk)
xn+1 = arg min E (L(x1 , ..., xn , xn+1 ) | x1 , ..., xn ) = arg min E (min(F (xn+1 ; ) F (x; )) | x1 , ..., xn )
Introduction to Global Optimization p. 78 Introduction to Global Optimization p. 79
Bumpiness
Let fk an estimate of the value of the global optimum after k observations. Let sy k the (unique) interpolant of the data points
s(x) =
i=1
i ( x xi ) + p(x)
(xi , fi )i = 1, . . . , k
) (y, fk
p: polynomial of a (prexed) small degree m. : radial function like, e.g.: (r) = r (r) = r
3
Idea: the most likely location of y is such that the resulting interpolant has minimum bumpiness Bumpiness measure:
(sk ) = (1)m+1 i sy k (xi )
Polynomial p is necessary to guarantee existence of a unique interpolant (i.e. when the matrix {ij = ( xi xj )} is singular)
TO BE DONE
Stochastic methods
Pure Random Search - random uniform sampling over the feasible region Best start: like Pure Random Search, but a local search is started from the best observation Multistart: Local searches started from randomly generated starting points
3 2
3 2
+
1
+ + + + + +
-1 -2
+
0
+ + + +
0 -1 -2 -3
-3
Clustering methods
Given a uniform sample, evaluate the objective function Sample Transformation (or concentration): either a fraction of worst points are discarded, or a few steps of a gradient method are performed Remaining points are clustered from the best point in each cluster a single local search is started
Uniform sample
5
3 5 0 1
4 3 2 1 0
Sample concentration
5
3 5 0 1
Clustering
5
3 5 0 1
4 3 2 1 0 5 0 1 2 3
4 3 2 1 0
+ ++ + + + + + + + + + +
+ +
Local optimization
5
3 5 0 1
Clustering: MLSL
Sampling proceed in batches of N points. Given sample points X1 , . . . , Xk [0, 1]n , label Xj as clustered iff Y X1 , . . . , Xk :
1 ||Xj Y || k := 2 n log k 1 + k 2
1 n
4 3 and 2 1 0
f (Y ) f (Xj )
Simple Linkage
A sequential sample is generated (batches consist of a single observation). A local search is started only from the last sampled point (i.e. there is no recall) unless there exists a sufciently near sampled point with better function valure
Smoothing methods
Given f : Rn R, the Gaussian transform is dened as:
f (x) = 1 n/2 n
Rn
f (y ) exp y x 2 /2
When is sufciently large f is convex. Idea: starting with a large enough , minimize the smoothed function and slowly decrease towards 0.
Smoothing methods
2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 10 5 -10 -5 0 5 10 -10 -5
Introduction to Global Optimization p. 96
10
10
10
Monotonic Basin-Hopping
9 8
k := 0; f := +; while k < M axIter do Xk : random initial solution Xk = arg min f (x; Xk ); (local minimization started at Xk ) fk = f (Xk ); if fk < f = f := fk N oImprove := 0; while N oImprove < M axImprove do X = random perturbation of Xk Y = arg minf (x; X ) ; if f (Y ) < f = Xk := Y ; N oImprove := 0; f := f (Y ) otherwise N oImprove + + end while end while
Introduction to Global Optimization p. 102 Introduction to Global Optimization p. 103
10
10
10
10
10
References
In this years course the global optimization part has been expanded, so it is possible that some part in nonlinear optimization will be skipped. Here is an essential reference list for the material covered during the course:
Mokhtar S. Bazaraa, John J. Jarvis and Hanif D. Sherali, Linear Programming and Network Flows, John Wiley & Sons, 1990. Dimitri P. Bertsekas, Nonlinear Programming, Athena Scientic, 1999. Jorge Nocedal and Stephen J. Wright, Numerical Optimization, Springer, 2006. Mohit Tawarmalani and Nikolaos V. Sahinidis, A Polyhedral Branchand Cut Approach to Global Optimization, in: Mathematical Programming, volume 103, pages 225-249, 2005. Androulakis I.P., C.D. Maranas, and C.A. Floudas (PostScript (184K), PDF (154K)), BB : A Global Optimization Method for General Constrained Nonconvex Problems, Journal of Global Optimization, 7, 4, pp. 337-363(1995). A. Rikun. A convex envelope formula for multilinear functions. Journal of Global Optimization, pages 10:425437, 1997. Andrea Grosso, Marco Locatelli and Fabio Schoen, A Population Based Approach for Hard Global Optimization Problems Based on Dissimilarity Measures, in: Mathematical Programming, volume 110, number 2, pages 373-404, 2007.