Schoen

Nonlinear Programming Models
Fabio Schoen 2008
Introduction
http://gol.dsi.unifi.it/users/schoen
Nonlinear Programming Models p. 1
NLP problems
min f (x) x S Rn
Local and global optima

A global minimum or global optimum is any x S such that
x S f (x) f (x )
Standard form:
min f (x) hi (x) = 0 gj (x) 0 i = 1, m j = 1, k
A point x is a local optimum if > 0 such that

x S B ( x, )f (x) f ( x)
where B ( x, ) = {x Rn : x x } is a ball in Rn . Any global optimum is also a local optimum, but the opposite is generally false.
Here S = {x Rn : hi (x) = 0 i, gj (x) 0 j }
Convex Functions
A set S Rn is convex if
x, y S x + (1 )y S
Convex Functions
for all choices of [0, 1]. Let Rn : non empty convex set. A function f : R is convex iff
f (x + (1 )y ) f (x) + (1 )f (y )
for all x, y , [0, 1]
Properties of convex functions

Every convex function is continuous in the interior of . It might be discontinuous, but only on the frontier. If f is continuously differentiable then it is convex iff
f (y ) f (x) + (y x)T f (x)
Convex functions
for all y
x y
If f is twice continuously differentiable f it is convex iff its Hessian matrix is positive semi-denite:
2 f (x) := 2f xi xj
Example: an afne function is convex (and concave) For a quadratic function (Q: symmetric matrix):
1 f (x) = xT Qx + bT x + c 2
then 2 f (x)
0 iff v T 2 f (x)v 0 v Rn
we have
f (x) = Qx + b f is convex iff Q 0 2 f (x) = Q
or, equivalently, all eigenvalues of 2 f (x) are non negative.
Convex Optimization Problems

min f (x) xS
Maximization
Slight abuse in notation: a problem
max f (x) xS
is a convex optimization problem iff S is a convex set and f is convex on S . For a problem in standard form
min f (x) hi (x) = 0 gj (x) 0 i = 1, m j = 1, k
is called convex iff S is a convex set and f is a concave function (not to be confused with minimization of a concave function, (or maximization of a convex function) which are NOT a convex optimization problem)
if f is convex, hi (x) are afne functions, gj (x) are convex functions, then the problem is convex.
Convex and non convex optimization

Convex optimization is easy, non convex optimization is usually very hard. Fundamental property of convex optimization problems: every local optimum is also a global optimum (will give a proof later) Minimizing a positive semidenite quadratic function on a polyhedron is easy (polynomially solvable); if even a single eigenvalue of the hessian is negative the problem becomes N P hard
Convex functions: examples

Many (of course not all . . . ) functions are convex! afne functions aT x + b quadratic functions 1 xT Qx + bT x + c with Q = QT , Q 2 any norm is a convex function
x log x (however log x is concave) f is convex if and only if x0 , d Rn , its restriction to any line: () = f (x0 + d), is a convex function 0
a linear non negative combination of convex functions is convex

g (x, y ) convex in x for all y g (x, y ) dy convex
more examples . . .
maxi {aT i x + b} is convex f, g : convex max{f (x), g (x)} is convex
fa convex functions for any a A (a possibly uncountable set) supaA fa (x) is convex f convex f (Ax + b) T race(AT X ) =
i,j
Data Approximation
let S Rn be any set f (x) = supsS x s is convex

Aij Xij is convex (it is linear!) log det X 1 is convex over the set of matrices X Rnn : X 0
max (X ) (the largest eigenvalue of a matrix X )
Table of contents
norm approximation maximum likelihood robust estimation
Norm approximation
Problem:
min Ax b
x
where A, b: parameters. Usually the system is over-determined, i.e. b Range(A). For example, this happens when A Rmn with m > n and A has full rank. r := Ax b: residual.
Examples
r = r = rT r: least squares (or regression) rT P r with P 0: weighted least squares
Example: 1 norm
Matrix A R10030 80 70 60 50 40 30 20 10 0
norm 1 residuals
r = maxi |ri |: minimax, or or di Tchebichev approximation r =

1 i |ri |: absolute or approximation
Possible (convex) additional constraints: maximum deviation from an initial estimate: x xest simple bounds i xi ui ordering: x1 x2 xn
-5
-4
-3
-2
-1
norm
20 18 16 14 12 10 8 6 4 2 0 -5 -4 -3 -2 -1 0 1 2 3 4 5
norm residuals
2 norm
18 16 14 12 10 8 6 4 2 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 norm 2 residuals
Variants
min
i
comparison
4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5
h(yi aT i x) where h: convex function: z |z | 1 2|z | 1 |z | > 1

2
h linearquadratic h(z ) =
norm 1(x) norm 2(x) linquad(x) deadzone(x) logbarrier(x)
dead zone: h(z ) =
0 |z | 1 |z | 1 |z | > 1 log(1 z 2 ) |z | < 1 |z | 1
logarithmic barrier: h(z ) =
-2
-1.5
-1
-0.5
0.5
1.5
Maximum likelihood
Given a sample X1 , X2 , . . . , Xk and a parametric family of probability density functions L(; ), the maximum likelihood estimate of given the sample is
= arg max L(X1 , . . . , Xk ; )
Max likelihood estimate - MLE

(taking the logarithm, which does not change optimum points):
= arg max
i
log(p(Xi aT i ))
If p is logconcave this problem is convex. Examples:
Example: linear measures with and additive i.i.d. (independent identically dsitributed) noise:
Xi = aT i + i
(1)
N (0, ), i.e. p(z ) = (2 )1/2 exp(z 2 /2 2 ) MLE is the 2 estimate: = arg min A X 2 ; p(z ) = (1/(2a)) exp(|z |/a) 1 estimate: = arg min A X 1
where i iid random variables with density p():

k
L(X1 . . . , Xk ; ) =
i=1
p(Xi aT i )
Nonlinear Programming Models p. 25 Nonlinear Programming Models p. 26
Ellipsoids
p(z ) = (1/a) exp(z/a)1{z0} (negative exponential)the estimate can be found solving the LP problem: min 1T (X A) A X p uniform on [a, a] the MLE is any such that A X a
An ellipsoid is a subset of Rn of the form

E = {x Rn : (x x0 )T P 1 (x x0 ) 1}
where x0 Rn is the center of the ellipsoid and P is a symmetric positive-denite matrix. Alternative representations:
E = {x Rn : Ax b
2
1}
where A 0, or
E = {x Rn : x = x0 + Au | u
2
1}
where A is square and non singular (afne transformation of the unit ball)
Nonlinear Programming Models p. 27 Nonlinear Programming Models p. 28
Robust Least Squares

Least Squares: x = arg min but it is known that
T i (ai x
RLS
It holds: then, choosing y = / if 0 and y = / , otherwise if < 0, then y = 1 and
| + T y | = | + T / sign()| = || + | + T y | || + y
bi )2 Hp: ai not known,
ai Ei = {a i + Pi u : u 1}
where Pi = PiT
0. Denition: worst case residuals: max

ai Ei i 2 (aT i x bi )
then:
T T max |(aT i x b i + u Pi x | i x bi )| = max |a ai Ei u 1
A robust estimate of x is the solution of

x r = arg min max
x ai Ei i
(aT i x
bi
)2
= |a T i x bi | + Pi x
...
Thus the Robust Least Squares problem reduces to
1/2
...
min t
x,t
min
i
2 (|a T i x bi | + Pi x )
(a convex optimization problem). Transformation:

min t
x,t 2
a T i x + bi + Pi x ti
a T i x bi + Pi x ti
(Second Order Cone Problem). A norm cone is a convex set

C = {(x, t) Rn+1 : x t} ti i
i.e.
|a T i x bi | + Pi x
Geometrical Problems
projections and distances polyhedral intersection extremal volume ellipsoids classication problems
Geometrical Problems
Projection on a set
Given a set C the projection of x on C is dened as:
PC (x) = arg min z x
z C
Projection on a convex set

If where fi : convex C is a convex set and the problem
PC (x) = arg min x z Az = b fi (z ) 0 i = 1, m C = {x : Ax = b, fi (x) 0, i = 1, m}
is convex.
Distance between convex sets

dist(C (1) , C (2) ) =
xC (1) ,y C (2)
Distance between convex sets

If C (j ) = {x : A(j ) x = b(j ) , fi 0} then the minimum distance can be found through a convex model:
min x(1) x(2)
(j )
min
xy
A(1) x(1) = b(1) A(2) x(2) = b(2) (1) fi x(1) 0 fi x(2) 0

(2)
Polyhedral intersection
1: polyhedra described by means of linear inequalities:
P1 = {x : Ax b}, P2 = {x : Cx d}
Polyhedral intersection
P1 P1 P2 ? Just check P2 = ? It is a linear feasibility problem: Ax b, Cx d sup{cT k x : Ax b} dk k
(solution of a nite number of LPs)
Polyhedral intersection (2)

2: polyhedra (polytopes) described through vertices:
P1 = conv{v1 , . . . , vk }, P2 = conv{w1 , . . . , wh } P1 P2 = ? Need to nd 1 , k , 1 , h 0: i = 1
i j
Minimal ellipsoid containing k points

Given v1 , . . . , vk Rn nd an ellipsoid
E = {x : Ax b 1}
with minimal volume containing the k given points.
j = 1
* * * * * * * * * * * * * *
* * * *
i vi =
i j
j wj
P1 P2 ? i = 1, . . . , k check whether j 0: j = 1
j
j wj = vi
j
Max. ellipsoid contained in a polyhedron

A = AT 0. Volume of E is proportional to det A1 convex optimization problem (in the unknowns: A, b): min log det A
1
Given P = {x : Ax b} nd an ellipsoid:
E = {By + d : y 1}
A = AT A 0 Avi b 1 i = 1, k
contained in P with maximum volume.
Max. ellipsoid contained in a polyhedron

E P
B,d
Difcult variants
These problems are hard: nd a maximal volume ellipsoid contained in a polyhedron given by its vertices
* * * * * * * * * * * * *
*
aT i (By
sup {aT i By y 1
+ d) bi +
aT i d}
y : y 1 bi i
Bai + aT i d bi
* * *
max log det B B = BT 0 i = 1, . . . + aT i d bi
Bai
nd a minimal volume ellipsoid containing a polyhedron described as a system of linear inequalities.
It is already a difcult problem to show whether a given ellipsoid E contains a polyhedron P = {Ax b}. This problem is still difcult even when the ellipsoid is a sphere: this problem is equivalent to norm maximization in a polyhedron it is an NPhard concave optimization problem.
Linear classication (separation)

Given two point sets X1 , . . . , Xk , Y1 , . . . , Yh nd an hyperplane aT x = t such that:
aT Xi 1
T
i = 1, k j = 1, h
a Yj 1
(LP feasibility problem).
Robust separation
Robust separation
Find a maximal separation:
max min aT Xi max aT Yj
i j
a: a 1
equivalent to the convex problem:

max t1 t2 aT Xi t1 aT Yj t2 a 1
i j
Optimality Conditions: descent directions Optimality Conditions

Fabio Schoen 2008
Let S Rn be a convex set and consider the problem

min f (x)
xS
where f : S R. Let x1 , x2 S and d = x2 x1 . d is a feasible direction. If there exists > 0 such that f (x1 + d) < f (x1 ) (0, ), d is called a descent direction at x1 . Elementary necessary optimality condition: if x is a local optimum, no descent direction may exist at x
Optimality Conditions p. 1
Optimality Conditions for Convex Sets

If x S is a local optimum for f () and there exists a neighborhood U (x ) such that f C 1 (U (x )), then
dT f (x ) 0 d : feasible direction
proof
Taylor expansion:
f (x + d) = f (x ) + dT f (x ) + o() d cannot be a descent direction, so, if is sufciently small, then f (x + d) f (x ). Thus dT f (x ) + o() 0
Optimality Conditions: tangent cone

General case:
min f (x) gi (x) 0 i = 1, . . . , m (X : open set) xX
and dividing by ,
o() 0 dT f (x ) +
Let S = {x X : gi (x) 0, i = 1, . . . , m}. Tangent cone to S in x : T ( x ) = {d R n }:

d xk x = lim xk x xk x d
Letting 0 the proof is complete.

where xk S .
Some examples
S = Rn T (x) = Rn S = {Ax = b} x T (x) = {d : Ad = 0} S = {Ax b}; let I be the set of active constraints in x : aT = bi i x < bi aT i x iI i I.
Let d = limk (xk x )/ (xk x )

T aT )/ (xk x ) i d = ai lim(xk x k
iI
= lim aT )/ (xk x ) i (xk x

k
) = lim(aT i xk b)/ (xk x

k
Thus if d T ( x) aT i d 0 for i I .
Example
Viceversa, let xk = x + k d. If aT i d 0 for i I
T aT x + k d) i xk = ai (
= bi + bi
k aT i d
iI
Let S = {(x, y ) R2 : x2 y = 0} (parabola). Tangent cone at (0, 0)? Let {(xk , yk ) (0, 0)}, i.e. xk 0, yk = x2 k:
(xk , yk ) (0, 0) =
4 x2 k + (xk )
T x + k d) aT i xk = ai (
< bi + k aT i d bi
iI
= |xk | 1 + x2 k
and if k small enough

xk =1 xk 0 |xk | 1 + x2 k xk = 1 lim xk 0 |xk | 1 + x2 k lim+
Thus
T (x) = {d : aT i d 0 i I}
yk =0 xk 0 |xk | 1 + x2 k yk lim =0 xk 0 |xk | 1 + x2 k lim+

thus T (0, 0) = {(1, 0), (1, 0)}
Descent direction
d Rn is a feasible direction in x S if >0: x + d S [0, ).
I order necessary opt condition

Let x S Rn be a local optimum for minxS f (x); let f C 1 (U ( x)). Then
dT f ( x) 0 d T ( x)
d feasible d T ( x), but in general the converse is false. If f ( x + d) f ( x) d is a descent direction (0, )
Proof d = limk (xk x )/ (xk x ) . Taylor expansion:

f (xk ) = f ( x) + T f ( x)(xk x ) + o( xk x ) x local optimum U ( x) : f (x) f ( x) x U S. = f ( x) + T f ( x)(xk x ) + xk x o(1).
...
If k is large enough, xk U ( x):
f (xk ) f ( x) 0
Examples
Unconstrained problems Every d Rn belongs to the tangent cone at a local optimum
T f ( x)d 0 d Rn
thus Dividing by (xk x ) :

T f ( x)(xk x ) + xk x o(1) 0
Choosing d = ei e d = ei we get
f ( x) = 0
T f ( x)(xk x )/ (xk x ) + o(1) 0
and in the limit

T f ( x)d 0.
NB: the same is true if x is a local minimum in the relative interior of the feasible region.
Linear equality constraints

min f (x) Ax = b
Linear equality constraints

From LP duality
max 0T = 0 AT = f ( x)
Tangent cone: {d : Ad = 0}. Necessary conditions:

f ( x)d 0
T
Thus at a local minimum point there exist Lagrange multipliers:

: AT = f ( x)
d : Ad = 0
equivalent statement:
min T f ( x)d = 0
d
Ad = 0
(a linear program).
Optimality Conditions p. 17 Optimality Conditions p. 18
Linear inequalities
min f (x) Ax b
Linear inequalities
From LP duality:
max 0T = 0 x) AT I = f ( 0
Tangent cone at a local minimum x : {d Rn : aT d 0 i I ( x ) } . Let AI be the rows of A i associated to active constraints at x . Then
min T f ( x)d = 0
d
Thus, at a local optimum, the gradient is a non positive linear combination of the coefcients of active constraints.
AI d 0 0
Farkas Lemma
Let A: matrix in Rmn and b Rn . One and only one of the following sets:
AT y 0 bT y > 0
Geometrical interpretation
AT y 0 b y>0
T
Ax = b x0
and
Ax = b x0
a1
{z : x : z = Ax, x 0} b a2
is non empty
{y : AT y 0}
Proof
1) if x 0 : Ax = b bT y = xT AT y . Thus if AT y 0 bT y 0. 2) Premise: Separating hyperplane theorem: let C and D be two convex nonempty sets: C D = . Then there exists a = 0 and b:
aT x b xC xD
Farkas Lemma (proof)

2) let {x : Ax = b, x 0} = . Let
S = {y Rm : x 0, Ax = y } S is closed, convex and b S . From the separating hyperplane theorem: Rm = 0, R: T y T b > x S
aT x b
If C is a point and D is a closed convex set, separation is strict, i.e.

a C<b aT x > b xD
0 S 0 T b > 0; T Ax for all x 0. This is possible iff T A 0. Letting y = we obtain a solution of AY y 0 bT y > 0
First order feasible variations cone

G( x) = {d R : gi ( x)d 0}
n T
First order variations

G( x) T ( x). In fact if {xk } is feasible and d = lim
k
iI
xk x xk x
then gi ( x) 0 and
g ( x + lim(xk x )) 0
k
...
xk x )0 k xk x xk x g ( x + lim xk x lim )0 k xk x g ( x + lim xk x d) 0 g ( x + lim xk x
k
gi ( x + k d) = gi ( x) + k T gi ( x)d + o(k )
where k > 0 and d belong to the tangent cone T ( x). If the ith constraint is active, then
gi ( x + k d) = k T gi ( x)d + o(k ) 0
Let k = xk x , if k 0:
g ( x + k d) 0
gi ( x + k d)/k = T gi ( x)d + o(k ))/k 0
Letting k 0 the result is obtained.
example
G( x) = T ( x); x3 + y 0
KKT necessary conditions

(KarushKuhnTucker) Let x X Rn , X = be a local optimum for
min f (x) gi (x) 0 i = 1, . . . , m xX
y 0
I : indices of active constraints at x . If:
2. constraint qualications conditions: T ( x) = G( x) hold in x ; then there exist Lagrange multipliers i 0, i I :

f ( x) +
1. f (x), gi (x) C 1 ( x) for i I
iI
i gi ( x) = 0.
Proof
x local optimum if d T ( x) dT f ( x) 0. But d T ( x) dT gi ( x) 0 i I.
Constraint qualications: examples

polyhedra: linear independence: Slater condition:
Thus it is impossible that

T f ( x)d > 0
X open set, gi (x), i I convex differentiable functions in x , gi (x), i I continuous in x , and x X strictly feasible: gi ( x) < 0 i I.
X open set, gi (x), i I continuous in x and {gi ( x)}, i I are linearly independent.
X = Rn and gi (x) are afne functions: Ax b
T gi ( x)d 0
iI
From Farkas Lemma there exists a solution of:

iI
i T gi ( x) = T f ( x) i 0
iI iI
Convex problems
An optimization problem
min f (x)
xS
Standard convex problem

min f (x) hj (x) = 0 gi (x) 0 i = 1, m j = 1, k
is a convex problem if
S is a convex set, i.e. x, y S x + (1 )y S f is a convex function on S , i.e. f (x + (1 )y ) f (x) + (1 )f (y ) [0, 1] and x, y S
if
f is convex gi are convex hj are afne (i.e. of the form T x + )
[0, 1]
then the problem is convex.
Convex problems
Every local optimum is a global one. Proof: x : local optimum for minS f (x) x : global optimum. S convex x + (1 ) x S . Thus if 0
f ( x) f (x + (1 ) x
Sufciency of 1st order conditions

(for a convex differentiable problem: if dT f ( x) d T ( x), then x is a (global) optimum Proof:
f (y ) f ( x) + (y x )T f (x) y S
f (x ) + (1 )f ( x)
But y x T ( x)
f (y ) f ( x) + dT f (x) f ( x) y S
f ( x) f (x )
and x is also a global optimum.
thus x is a global minimum.
Convexity of the set of global optima

(for convex problems) The set of global minima of a convex problem is a convex set. In fact, let x and y be global minima for the convex problem
min f (x)
xS
KKT for equality constraints

x : local optimum for min f (x) hj (x) = 0 gi (x) 0 xXR
n
i = 1, . . . , m j = 1, . . . , k
Then, choosing [0, 1] we have x + (1 ) y S , as S is convex. Moreover

f (x + (1 ) y ) f ( x) + (1 )f ( y) f + (1 )f = f
Let I : set of active inequalities in x . If f (x), gi (x), i I , hj (x) C 1 and constraint qualications hold in x , i 0 i I e j R, j = 1, . . . , h:
h
where f is the global minimum value. Thus the equality holds and the proof is complete.
f ( x) +
iI
i gi ( x) +
j =1
j hj ( x) = 0
Complementarity
KKT equivalent formulation:
m h
II order necessary conditions

If f, g1 , hj C 2 in x and the gradients of active constraints in x are linearly independent, then there exist mutlipliers i 0, i I and j , j = 1, . . . , k such that
k
f ( x) +
i=1
i gi ( x) +
j =1
j hj ( x) = 0 i gi ( x) = 0 i = 1, . . . , m
f ( x) +
iI
i gi ( x) +
j =1
j hj ( x) = 0
Condition i gi ( x) = 0 is called complementarity condition
and
dT 2 L( x)d 0
for every direction d: dT gi ( x) 0, dT hj (x) = 0 where

k
2 L(x) := 2 f (x) +
iI
i 2 gi (x) +
j =1
j 2 hj (x)
Sufcient conditions
Let f, gi , hj twice continuously differentiable. Let x , , :
k
Lagrange Duality
Problem:
f = min f (x) gi (x) 0 xX
f (x ) +
iI
i gi (x ) +
j =1
j hj (x ) = 0 i gi (x ) = 0
d L(x )d > 0 d :d hj (x ) = 0
i 0
denition: Lagrange Function:

T
dT gi (x ) = 0, i I
L(x; ) = f (x) +
i
i gi (x)
0, x X
then x is a local minimum.
Relaxation
Given an optimization problem
min f (x)
xS
Lagrange minimization is a relaxation

Proof: Feasible set of the Lagrange problem: X (contains the original one) If g (x) 0 and 0
min g (x)
xQ
a relaxation is a problem
L(x, ) = f (x) + T g (x) f (x)
where
SQ
g (x) f (x)
x S.
Weak Duality : The optimal value of a relaxation is a lower bound on the optimum value of the problem.
Dual Lagrange function

with respect to constraints g (x) 0:
() = inf L(x, )
xX
Example (circle packing)

min r
= inf (f (x) + g (x))

xX
4r (xi xj ) (yi yj )2 0 xi , yi 0
xi , yi 1
1i<jN i = 1, . . . , N i = 1, . . . , N
For every choice of 0, () is a lower bound for every feasible solution and in particular, is a lower bound for the global minimum value of the problem.
solution
When N = 2, relaxing the rst constraint:
() = min r + (4r2 (x1 x2 )2 (y1 y2 )2 )
x,y,r
Minimizing with respect to x, y |x1 x2 | = |y1 y2 | = 1 from which

() = min r + 4r2 2
r
x1 , x 2 , y1 , y2 0
x1 , x 2 , y1 , y2 1
1 r= 8 () = 2 1 16
This is a lower bound on the optimum value. Best possible lower bound:
= max ()
1 = 4 2
2 2
Lagrange Dual
Choosing (x1 , y1 ) = (0, 0) and (x2 , y2 ) = (1, 1) a feasible solution with r = 2/2 is obtained. The Lagrange dual gives a lower bound equal to 2/2: same as the objective function at a feasible solution optimal solution! (an exception, not the rule!)
= max () 0
This problem might: 1. be unbounded 2. have a nite sup but non max 3. have a unique maximum attained in correspondence with a single solution x 4. have many different maxima, each connected with a different solution x
Equality constraints
f = min f (x) hj (x) = 0 gi (x) 0 xX i = 1, . . . , m j = 1, . . . , k
Linear Programming
min cT x Ax b
Dual Lagrange function:

() = min cT x + T (Ax b)
x
Lagrange function:
L(x; , ) = f (x) + g (x) + h(x)
T T
= T b + min(cT + T A)x.
x
but: where 0, but is free.

min(cT + T A)x =
x
0 if cT + T A = 0 otherwise.
...
Lagrange dual function:
() =
Lagrange dual: max T b T A + cT = 0 0 which is equivalent to: max T b T A = cT 0
Quadratic Programming (QP)

1 min xT Qx + cT x 2 Ax = b
b if c + A = 0 otherwise.
(Q: symmetric). Lagrange dual function:

1 () = min xT Qx + cT x + T (Ax b) x 2 1 = T b + min xT Qx + (cT + T A)x x 2
QP Case 1
Q has at least one negative eigenvalue 1 min xT Qx + (cT + T A)x = x 2
QP Case 2
Q positive denite minimum point of the dual Lagrange function: Qx + (c + AT ) = 0
In fact d : dT Qd < 0. Choosing x = d with > 0

1 T x Qx + (cT + T A)x = 2 1 2 T d Qd + (cT + T A)d 2
i.e.
x = Q1 (c + AT )
and for large values of this can be made as small as desired.
...
Lagrange function value:
1 T () = T b + x Qx + (cT + T A) x 2 1 = T b + (c + AT )T Q1 QQ1 (c + AT ) 2 T T (c + A)Q1 (c + AT ) 1 = T b + (c + AT )T Q1 (c + AT ) 2 T T (c + A)Q1 (c + AT )
T
...
Lagrange dual (seen as a min problem): 1 min T b + (c + AT )T Q1 (c + AT ) 2 Optimality conditions: b + AQ1 (c + AT ) = 0
But recalling that x = Q1 (c + AT ) b Ax =0

feasibility of x
1 = b (c + AT )T Q1 (c + AT ) 2
if we nd optimal multipliers (a linear system) we get the optimal solution x (thanks to feasibility and weak duality)!
Properties of the Lagrange dual

For any problem
f = min f (x) gi (x) 0 i = 1, . . . , m xX
Dim.
From Weierstrass theorem
() = min f (x) + T g (x)
xX
exists and is nite
(a + (1 )b) = min(f (x) + (a + (1 )b)T g (x))

xX
where X is non empty and compact, if f and gi are continuous then the Lagrange dual function is concave
= min( (f (x) + aT g (x)) + (1 )(f (x) + bT g (x)))

xX
min(f (x) + aT g (x)) + (1 ) min(f (x) + bT g (x))

xX xX
= (a) + (1 )(b).
Solution of the Lagrange dual

max () = max min(f (x) + g (x))
xX T
...
be the optimal solution of the restricted dual. Is it an Let T g (x)? Check: optimal dual solution? Is it true that z f (x) + we look for x , optimal solution of T g (x) min f (x) +
xX
is equivalent to
max z 0 z f (x) + T g (x) x X
T g ( if f ( x) + x) z then we have found the optimal solution of the dual;
After having computed f and g in x1 , x2 , . . . , xk a restricted dual can be dened:

max z 0 z f (xj ) + T g (xj ) j = 1, . . . , k
otherwise the pair x , f ( x) is added to the restricted dual and a new solution is computed.
Geometric programming
Unconstrained Geometric program:
m n
Transformed problem:
m n
min
x>0 k=1
ck
j =1
xj kj
kj R, ck > 0
min
y k=1
ck
j =1 m
ekj yj ek y+k
k=1
T
= k = log ck
(non convex). Variable substitution:

xj = exp(yj ) yj R
min
y
still non convex, but its logarithm is convex.
Duality example
Dual of
m
solving the dual

Dual function
m T x exp(k k=1
min f (x) min log
+ k )
L() = min log

x,y k=1
exp yk + T (Ax + y )
No constraints dual lagrange function is identical to f (x)! Strong duality holds, but is useless. Simple transformation:
m
Minimization in x is unconstrained: min T Ax if T A = 0 L() is unbounded if T A = 0 then

m
min log
k=1
exp yk
T x + k yk = k
L() = min log

y k=1
exp yk + T ( y )
First order (unconstrained) optimality conditions w.r.t. yi :

exp yi i = 0 k exp yk Lagrange multipliers exist provided that i = 1
i
Substituting j = exp yj /
L() = log
j
exp yk , j yj
j
exp yj exp yj
= log
j
yj exp yj /
j k
exp yk exp yj yk ))
i > 0 i
1 ( k exp yk
exp yk (log
k j
=
k
exp yk (log j exp yj k log k

k
exp yj yk )
=
Lagrange Dual
The Lagrange Dual becomes:
max T
Special cases: linear constraints

min f (x) k log k
k
Ax b
k = 1
k
Lagrange function:
L(x, ) = f (x) + T (b Ax)
AT = 0 0
Constraint qualications always hold (polyhedron). If x is a local optimum there exists 0:

f (x ) = AT
Ax b
T (b Ax ) = 0
Non negativity constraints

min f (x) x0 j = f (x ) xj j = 1, n
Lagrange function: L(x, ) = f (x) T x. KKT conditions:

f (x ) = 0 x 0

from which
f (x ) =0 xj f (x ) 0 xj j : x j > 0
otherwise
( )T x = 0
Box constraints
min f (x) xu i < ui i
Box constr. (cont)

Then, from complementarity,
f (x ) = j xj f (x ) = j xj f (x ) =0 xj j J j Ju j J0
Lagrange function: L(x, , ) = f (x) + T ( x) + T (x u). KKT conditions:

f (x ) =
( x )T = 0 (x u)T = 0
( , ) 0
Given x let J = {j : x j = j }, Ju = {j : xj = uj }, J0 = {j : j < xj < uj }
Optimization over the simplex

Thus
f (x ) 0 xj f (x ) 0 xj f (x ) =0 xj min f (x) j J j Ju j J0 1T x = 1 x0
Lagrange function: L(x, , ) = f (x) T x + T (1T x 1). KKT:

f (x ) = 1 1T x = 1 (x , ) 0
with feasibility x u
( )T x = 0
simplex. . .
f (x ) j = xj
(all equal). Thus, from complementarity, if x j > 0 then j = 0 f (x ) xj f (x ) xj
Application: Min var portfolio

Given n assets with random returns R1 , . . . , Rn , how to invest 1 e in such a way that the resulting portfolio has minimum variance? If xj denotes the percentage of the investment on asset j , how to compute the variance of this portfolio P (x)?
Var
and
= ; otherwise
. Thus, if j : x j > 0, k
= E (P (x) (E (P (x))))2
n
f (x ) f (x ) xj xk
=E
j =1
(Rj E (Rj ))xj
=
i,j
(Ri E (Ri ))(Rj E (Rj ))xi xj
= xT Qx
where Q is the variance-covariance matrix of the n assets.

Min var portfolio

Problem (objective multiplied by 1/2 for simpler computations):
min(1/2)xT Qx 1 x=1 x0
T
Optimal portfolio
KKT: for all j : x j > 0:
Qij xj Qkj xj
j
Vector Qx might be thaught as the vector of marginal contributions to the total risk (which is a weighted sum of elements of Qx). Thus in the optimal portfolio, all assets with positive level give equal (and minimal) contribution to the total risk.
Optimization Algorithms Algorithms for unconstrained local optimization

Fabio Schoen 2008
Most common form for optimization algorithms: Line search-based methods: Given a starting point x0 a sequence is generated:
xk+1 = xk + k dk
where dk Rn : search direction, k > 0: step Usually rst dk is chosen and than the step is obtained, often from a 1dimensional optimization
Algorithms for unconstrained local optimization p. 1
Trust-region algorithms
A model m(x) and a condence region U (xk ) containing xk are dened. The new iterate is chosen as the solution of the constrained optimization problem
x U (x k )
Speed measures
Let x : local optimum. The error in xk might be measured e.g. as
e(xk ) = xk x
or
min m(x)
e(xk ) = |f (xk ) f (x )|.
The model and the condence region are possibly updated at each iteration.
Given {xk } x if q > 0, (0, 1) : (for k large enough):

e(xk ) q k {xk } is linearly convergent, or converges with order 1; : convergence rate A sufcient condition for linear convergence: lim sup e(xk+1 ) e(xk )
superlinear convergence
If for every (0, 1) exists q :
e(xk ) q k
Higher order convergence

If, given p > 1, q > 0, (0, 1) :
e(xk ) q (p
k)
then convergence is superlinear. Sufcient condition:

lim sup e(xk+1 ) =0 e(xk )
then {xk } is said to converge with order at least p If p = 2 quadratic convergence Sufcient condition:
lim sup e(xk+1 ) < e(xk )p
Examples
1 k
Examples
1 k 1 k2
converges to 0 with order one 1 (linear convergence)
converges to 0 with order one 1 (linear convergence) converges to 0 with order 1
Examples
1 k 1 k2
Examples
1 k 1 k2
2k converges to 0 with order 1
2k converges to 0 with order 1 k k converges to 0 with order 1; convergence is superlinear
Examples
1 k 1 k2
Descent directions and the gradient

Let f C 1 (Rn ), xk Rn : f (xk ) = 0 Let d Rn . If
dT f (xk ) < 0
2k converges to 0 with order 1 k k converges to 0 with order 1; convergence is superlinear

1 k 22
then d is a descent direction Taylor expansion:

f (xk + d) f (xk ) = dT f (xk ) + o()
converges a 0 with order 2 quadratic convergence
f (xk + d) f (xk ) = dT f (xk ) + o(1)
Thus if is small enough f (xk + d) f (xk ) < 0 NB: d might be a descent direction even if dT f (xk ) = 0
Algorithms for unconstrained local optimization p. 7 Algorithms for unconstrained local optimization p. 8
Convergence of line search methods

If a sequence xk+1 = xk + k dk is generated in such a way that:
L0 = {x : f (x) f (x0 )} is compact dk = 0 whenever f (xk ) = 0 f (xk+1 ) f (xk )
if dk = 0 then
|dT k f (xk )| ( f (xk ) ) dk
if f (xk ) = 0 k then
dT lim k f (xk ) = 0 k dk
where is such that limk (tk ) = 0 limk tk = 0 ( is called a forcing function)
Comments on the assumptions

such that f (xk Then either there exists a nite index k ) = 0 or otherwise xk L0 and all of its limit points are in L0 {f (xk )} admits a limit limk f (xk ) = 0 f (xk+1 ) f (xk ): most optimization methods choose dk as a descent direction. If dk is a descent direction, choosing k sufciently small ensures the validity of the assumption limk dk f (xk ) = 0: given a normalized direction dk , the k scalar product dk T f (xk ) is the directional derivative of f along dk : it is required that this goes to zero. This can be achieved through precise line searches (choosing the step so that f is minimized along dk )
|dT k f (x k )| ( f (xk ) dk T dk : dk f (xk ) < 0 then dT
for every limit point x of {xk } we have f ( x) = 0
): letting, e.g., (t) = ct, c > 0, if
the condition becomes
dT k f (xk ) c dk f (xk
Gradient Algorithms
Recalling that
cos k = dT k f (xk ) dk f (xk
General scheme:
xk+1 = xk k Dk f (xk )
then the condition becomes

cos k c
with Dk 0 e k > 0 If f (xk ) = 0 then

dk = Dk f (xk )
that is, the angle between dk and f (xk ) is bounded away from orthogonality.
is a descent direction. In fact

T dT k f (xk ) = f (xk )Dk f (xk )
<0
dT k f (xk )
Steepest Descent
or gradient method:
Dk := I
i.e. xk+1 = xk k f (xk ). If f (xk ) = 0 then dk = f (xk ) is a descent direction. Moreover, it is the steepest (w.r.t. the euclidean norm):
dRn
min T f (xk )d d 1
f (xk )
...
Newtons method
min T f (xk )d dT d 1 Dk := 2 f (xk )
1
dRn
Motivation: Taylor expansion of f :

1 f (x) f (xk ) + T f (xk )(x xk ) + (x xk )T 2 f (xk )(x xk ) 2
KKT conditions: In the interior T f (xk ) = 0; if the constraint is active

d f (xk ) + =0 d dT d = 1 0 d =
f (x k ) f (x k )
Minimizing the approximation:

f (xk ) + 2 f (xk )(x xk ) = 0
If the hessian is non singular

x = xk 2 f (xk )
f (xk )
Step choice
Given dk , how to choose k so that xk+1 = xk + k dk ? optimal choice (one-dimensional optimization):
k = arg min f (xk + dk ).
0
Minimizing w.r.t. :
T dT k Qdk + (Qxk + c) dk = 0
= =
Analytical expression of the optimal step is available only in few cases. E.g. if f (x) = 1 xT Qx + cT x with Q 0. Then 2
1 f (xk + dk ) = (xk + dk )T Q(xk + dk ) + cT (xk + dk ) 2 1 T = 2 dT k Qdk + (Qxk + c) dk + 2
(Qxk + c)T dk dT k Qdk dT k f (xk ) T 2 dk f (xk )dk
E.g., in steepest descent:

k = f (xk ) 2 T f (xk )2 f (xk )f (xk )
where does not depend on .
Approximate step size

Rules for choosing a step-size (from the sufcient condition for convergence):
f (xk+1 ) < f (xk )
dT limk dk k
Avoid too large steps
f (xk ) = 0
u u u
Often it is also required that

dT K f (xk + k dk ) 0 xk+1 xk 0
In general it is important to insure a sufcient reduction of f and a sufciently large step xk+1 xk
Avoid too small steps

u
Armijos rule
Input:
u u u u
(0, 1), (0, 1/2), k > 0 := k ; while (f (xk + dk ) > f (xk ) + dT k f (xk )) do := ;
end return
Typical values : [0.1, 0.5], [104 , 103 ]. On exit the returned step is such that f (xk + dk ) f (xk ) + dT k f (xk )
Line search in practice

How to choose the initial step size k ? Let () = f (xk + dk ). A possibility is to choose k = , the minimizer of a quadratic approximation to (). Example:
1 q () = c0 + c1 + c2 2 2 q (0) = c0 := f (xk ) q (0) = c1 := dT k f (xk ) dT k f (xk )
acceptable steps
Then = c1 /c2 .
dT k f (xk )
of the minimum of f (xk + dk ) Third condition? If an estimate f . is available choose c2 : min q () = f min q () = q (c1 /c2 ) = c0 c2 1 /c2 := f
Thus it is reasonable to start with

k = 2 f (xk ) f dT k f (xk )
k
k1 )f (xk )) A reasonable estimate might be to choose k = 2 (f (x dT f (xk )
= c1 /c2 =2
c2 = c2 1 /2(f c0 ) c0 f c1
Convergence of steepest descent

xk+1 = xk k f (xk )
Local analysis of steepest descent

Behaviour of the algorithm when minimizing
1 f (x) = xT Qx 2
If a sufciently accurate step size is used the condition of the theorem on global convergence are satised the steepest descent algorithm globally converges to a stationary point. Sufciently accurate means exact line search or, e.g., Armijos rule.
where Q 0. (local and global) optimum: x = 0. Steepest descent method:

xk+1 = xk k f (xk ) = xk k Qxk = (I k Q)xk
Error (in x) at step k + 1:

xk+1 0 = (I k Q)xk
2 xT k (I k Q) xk
Analysis
Let A: symmetric with eigenvalues: 1 < < n . Then
1 v
2 T xT k (I k Q) xk xk xk 2
...
is an eigenvalue of A iff is an eigenvalue of A is an eigenvalue of A iff 1 + is an eigenvalue of I + A
v T Av m v
v Rn
Thus the eigenvalues of (I k Q) are

1 i
where largest eigenvalue of (I k Q)2 .
where i are the eigenvalues of Q. The maximum eigenvalue will be:

max{(1 k 1 )2 , (1 k n )2 }
thus
xk+1
= max{|1 k 1 |, |1 k n |} xk
max{(1 k 1 )2 , (1 k n )2 } xk
...
Eliminating the dependency on k :
max{|1 1 |, |1 n |} =
...
0 and 1 n , 1 + 1 1 + n 1 1 1 n
max{1 1 , 1 + 1 , 1 n , 1 + n }
5 4 3 2 1 00 0.2 0.4
|1 1 | |1 n |
and thus
max{|1 k 1 |, |1 k n |} xk = max{1 1 , 1 + n }
Minimum point:
1 1 = 1 + n
0.6
0.8
i.e.
=
2 1 + n
Analysis
In the best possible case
xk+1 |1 1 | xk 2 = |1 1 | 1 + n n 1 = n + 1 1 = +1
Zigzagging
1 min (x2 + M y 2 ) 2 where M > 0. Optimum: x 0y = 0. Starting point: (M, 1). Iterates: xk xk xk+1 + = M yk yk yk+1
With optimal step size

xk+1 yk+1 = M
M 1 k M +1 M 1 k M +1
where = n /1 : condition number of Q 1 (illconditioned problem) very slow convergence 1 very speed convergence
Zigzagging
Converegence is rapid if M 1 very slow and zigzagging if M 1 or M 1 10
Slow convergence and zigzagging are general phenomena (especially when the starting point is near the longest axes of the ellipsoidal level sets)
-5
-10
20
40
60
80
100
Analysis of Newtons method

Newton-Raphson method: xk+1 = xk (2 f (xk )) x : local optimum. Taylor expansion of f :
f (x ) = 0
2 1
f (xk ). Let
Thus
x xk+1 = o( x xk )
= f (xk ) + f (xk )(x xk ) + o( x xk )
i.e.
x xk+1 x xk
o( x xk ) x xk
convergence is at least superlinear
If 2 f (xk ) is non singular and (2 f (xk ))1 is limited

0 = 2 f (xk )
1
= x xk+1 + o( x xk )
f (xk ) + (x xk ) + 2 f (xk )
o( x xk )
Local Convergence of Newtons Method

Let f C 2 (U (x , 1 )), where U : ball with radius 1 and center x ; let 2 f (x ) be nonsingular. Then: 1. > 0 : if x0 U (x , ) {xk } is well dened and converges to x at least superlinearly. 2. If > 0, L > 0, M > 0 :
2 f (x) 2 f (y ) L x y
Difculties
Many things might go wrong: at some iteration, 2 f (xk ) might be singular. For example: if xk belongs to a at region f (x) = constant. even if non singular, inversion 2 f (xk ) or, in any case, solving a linear system with coefcient matrix 2 f (xk ) is numerically unstable and computationally demanding there is no guarantee that 2 f (xk ) 0 Newton direction might not be a descent direction
and
(2 f (x))1 M
then, if x0 U (x , ) Newtons method converges with order at least 2 and

xk+1 x LM xk x 2
2
Difculties
Newtons method just tries to solve the system
f (xk ) = 0
Newtontype methods
line search variant: xk+1 = xk k (2 f (xk ))
1
and thus might very well be attracted towards a maximum the method lacks global convergence: it converges only if started near a local optimum
Modied Newton method: replace 2 f (xk ) by (2 f (xk ) + Dk ) where Dk is chosen so that 2 f (xk ) + Dk is positive denite
f (xk )
Quasi-Newton methods
Consider solving the nonlinear system f (x) = 0. Taylor expansion of the gradient:
f (xk ) f (xk+1 ) + 2 f (xk+1 )(xk xk+1 )
QuasiNewton equation
Let:
sk := xk+1 xk yk := f (xk+1 ) f (xk )
Let Bk+1 be an approximation of the hessian in xk+1 . QuasiNewton equation:

Bk+1 (xk+1 xk ) = f (xk+1 ) f (xk )
QuasiNewton equation: Bk+1 sk = yk . If Bk was the previous approximate hessian, we ask that 1. the variation between Bk and Bk+1 is small 2. nothing changes along directions which are normal to the step sk :
Bk z = Bk+1 z z : z T sk = 0
Choosing n 1 vectors z which are orthogonal to sk n2 linearly independent equations in n2 unknowns a unique solution.
Broyden updating
It can be shown that the unique solution is given by:
Bk+1 (yk Bk sk )sT k = Bk + sT s k k
proof
Bk+1 Bk = = k Bk sk )sT (Bs k sT s k k = (yk Bk sk )sT k sT s k k Bk )sk sT (B k sT s k k
T Trsk sT k sk sk sT k sk
Theorem: let Bk Rnn and sk = 0. The unique solution to:

min Bk B
B F
k = yk Bs
Bk ) (B Bk ) = (B
sk sT k Bk ) = (B sT k sk sT k sk Bk ) = (B sT s k k
is Broydens update Bk+1 here X Frobenius norm.
TrX T X denotes
Unicity is a consequence of the strict convexity of the norm and the convexity of the feasible region.
Quasi-Newton and optimization

Special situation: 1. the hessian matrix in optimization problems is symmetric; 2. in gradient methods, when we let xk+1 = xk (Bk+1 )1 f (xk ), it is desirable that Bk+1 be positive denite. Broydens update:
Bk+1 = Bk + (yk Bk sk )sT k sT s k k
Simmetry
Remedy: let C1 = Bk +
( y k B k sk ) sT k sT k sk
symmetrization:
1 T C2 = (C1 + C1 ) 2
However, it does not satisfy QuasiNewton equation. Broyden update of C2 :

C3 = C2 + (yk C2 sk )sT k sT k sk
is generally not symmetric even if Bk is.
which is not symmetric, . . .
PBS update
In the limit
Bk+1 = Bk +
T (sT (yk Bk sk ))sk sT (yk Bk sk )sT k k + sk (yk Bk sk ) + k T 2 sk sk (sT s ) k k
BFGS
Same ideas, but applied to the approximate inverse Hessian: Inverse QuasiNewton equation:
sk = Hk+1 yk
(PBS Powell-Broyden-Symmetric update). Imposing also hereditary positive deniteness, DFP (Davidon-Fletcher-Powell) is obtained:
Bk+1 = Bk + = I (yk yk sT k T yk sk
T Bk sk )yk
lead to the most common QuasiNewton update: BFGS (Broyden-Fletcher-Goldfarb-Shanno):

T Bk sk ))yk yk T (yk sk )2
+ yk (yk Bk sk ) + T yk sk
T sk yk T yk sk
(sT k (yk
Hk+1 =
T sk yk T yk sk
Hk I
yk sT k T yk sk
sk sT k T yk sk
Bk I
T yk yk T yk sk
BFGS method
xk+1 = xk k Hk f (xk ) Hk+1 = I
Trust Region methods

Possible defect of standard Newton method: the approximation becomes less and less precise if we move away from the current point. Long step bad approximation. Idea: constrained minimization of quadratic approximation:
xk+1 = arg
xk+1 xk k
T yk sT sk yk k H I k T T yk sk yk sk yk = f (xk+1 ) f (xk ) sk = xk+1 xk
sk sT k T yk sk
min
mk (x)
where
mk (x) = f (xk ) + T f (xk )(xk+1 xk )
1 + (xk+1 xk )T 2 f (xk )(xk+1 xk ) 2
k > 0: parameter. First advantage (over pure Newton): the step is always denite (thanks to Weierstrasss theorem)
Outline of Trust Region

Let mk () a local model function. E.g. in Newton Trust Region methods,
1 mk (s) = f (xk ) + sT f (xk ) + sT 2 f (xk )s 2
How to choose and update the trust region radius k ? Given a step sk , let
k = f (xk ) f (xk + sk ) mk (0) mk (sk )
or in a Quasi-Newton Trust Region method

1 mk (s) = f (xk ) + sT f (xk ) + sT Bk s 2
the ratio between the actual reduction and the predicted reduction
Model updating
f (xk ) f (xk + sk ) k = mk (0) mk (sk )
for
Algorithm
Data:
The predicted reduction is always non negative; if k is small (surely if it is negative) the model and the function strongly disagree the step must be rejected and the trust region reduced if k 1 it is safe to expand the trust region
> 0, 0 (0, ) , [0, 1/4] k = 0, 1, . . . do Find the step sk and k minimizing the model in the trust region ; if k < 1/4 then k+1 = k /4 ;
else if
k > 3/4 and sk = k then } ; k+1 = min{2k , k+1 = k ;
else end end if
intermediate k values lead us to keep the region unchanged

end
k > then xk+1 = xk + sk ; xk+1 = xk ;
else end
Solving the model

How to nd
1 min f (xk )T s + sT Bk s s 2 s
Thus either s is in the interior of the ball with radius , in which case = 0 and we have the (quasi)-Newton step:
1 f (xk ) p = Bk
If Bk 0, KKT conditions are necessary and sufcient; rewriting the constraint as sT s 2 :

f (xk ) + Bk s + 2s = 0 ( s ) = 0
or s = and if > 0 then 2s = f (xk ) Bs = mk (s) s is parallel to the negtaive gradient of the model and normal to its contour lines.
The Cauchy Point

Strategy to approximately solve the trust region subproblem. Find the Cauchy point: the minimizer of mk along the direction f (xk ) within the trust region. First nd the direction:
T ps k = arg min fk + f (xk ) p p
Finding the Cauchy point

Finding ps k is easy: analytic solution:
ps k = f (xk ) k gk
For the step size k : If f (xk )T Bk f (xk ) 0 negative curvature direction largest possible step k = 1
f (xk ) 3 } k f (xk )T Bk f (xk )
p k
Then along this direction nd a minimizer

k = arg min mk ( ps k)
0
Otherwise the model along the line is strictly convex, so

k = min{1,
ps k
The Cauchy point is xk + k ps k.

Choosing the Cauchy point global but extremely slow convergence (similar to steepest descent). Usually an improved point is searched starting from the Cauchy one.
Pattern Search
For smooth optimization, but without knowledge of derivatives. Elementary idea: if x R2 is not a local minimum for f , then at least one of the directions e1 , e2 , e1 , e2 (moving towards E, N, W, S) forms an acute angle with f (x) is a descent direction. Direct search: explores all the direction in search of one which gives a descent.
Derivative Free Optimization
Coordinate search
Let D = {ei } be the set of coordinate directions and their opposites
Data:
Pattern search
It is not necessary to explore 2n directions. It is sufcient that the set of directions forms a positive span, i.e. every v Rn should be expressible as a non negative linear combination of the vectors in the set. Formally, G is a generating set iff
v = 0 Rn g G : v T g > 0
k = 0, 0 an initial step length, x0 a starting point while is large enough do if f (xk + k d) < f (xk ) for some d D then xk+1 = xk + k d (step accepted) ;
else
k+1 = 0.5k ;
end
A good generating set should be characterized by a sufciently high cosine measure:

(G ) := min max
v =0 dG
k =k+1;
end
vT d v d
Examples
u u u u u u u u u u
Step Choice
xk + k dk if f (xk + k dk ) < f (xk ) (k )(success) x k
xk+1 =
otherwise (failure)
where (t) = o(t). We let In the rst case 0.19612, in the second = 0.5, in the third = 0.5 0.7017
k+1 = k k
where k 1 for successful iterations, k < 1 otherwise. Direct methods possess good convergence properties.
Nelder-Mead Simplex
Given a simplex S = {v1 , . . . , vn+1 } in Rn let vr the worst point: r = arg maxi {f (vi )}. Let C be the centroid of S \ {vr }:
C=
i=r
1: Reection
Check f (R): if it is intermediate, i.e. better than the worst and worse than the best, then accept the reection, i.e. discard the worst point in the simplex and replace it with R.
vi
The algorithm performs a sort of line search along the direction C vr . Let
R = C + (C vr ) be the a reection of the worst point along the direction. Let f best function value in the current simplex. Three cases might occur:
Reection step
2: improvement
if the trial step is an improvement: worst
f (R) < f = R + (R C ) then attempt an expansion: try to move R to R ) < f (R)) then accept the expansion and If successful (f (R discard the worst point. If unsuccessful, then accept R as a new point and discard the worst one.
reection
Expansion
3: contraction
If however the reected point R is worse than all points in the simplex (possibly except the worst vr ), than a contraction step is performed: if f (R) > f (vr ) (R is worse than all points in the simplex), add
worst
0.5(vr + C )
reection expansion
to the simplex and discard vr otherwise if R is better than vr than add

0.5(R + C )
to the simplex and discard vr

Contraction
Nelder-Mead is not a direct search method (only a single direction at a time is explored) It is widely used by practitioners. However it may fail to converge to a local minimum. There are examples of strictly convex functions in R2 on which the method converges to a non-stationary point. The bad convergence properties are connected to the event that the ndimensional simplex degenerates into a lower dimensional space. Moreover the method has a strong tendency to generate directions which are almost normal to that of the gradient! Convergent variants of Nelder-Mead method do exists.
contraction
reection
worst
Implicit ltering
Let
f (x) = h(x) + w(x)
Implicit ltering
Data:
repeat
{k } 0, params , , of Armijos rule
OuterIteration = false;
repeat
where h(x) is a smooth function, while w(x) can be considered as an additive, typically random, noise. The method performs a rough estimate of the gradient (nite difference with a large step) and proceeds with an Armijo line search. If unsuccessful, the step for nite differences is reduced.
compute f (xk ) and a nite difference estimate of f (xk ): k f (xk ) = [(f (xk + k ei ) f (xk k ei ))/2k ]
if
k f (xk ) k then OuterIteration = true Armijo: if successful accept the Armijo step; otherwise let OuterIteration = true
else
end until OuterIteration
; ;
k = k + 1;
until convergence criterion
Convergence properties
If
2 h(x) is Lipschitz continuous
the sequence {xk } generated by the method is innite

lim 2 k + (xk ; k ) =0 k
where
(x; ) = sup
z : z x

|w(x)|
unsuccessful Armijo steps occur at most a nite number of times then all limit points of {xk } are stationary
Algorithms for constrained local optimization

Fabio Schoen 2008
Feasible direction methods
Algorithms for constrained local optimization p. 1
FrankWolfe method
Let X : convex set. Consider the problem:
min f (x)
xX
FrankWolfe
If T f (xk )( xk xk ) = 0 then
T f (xk )d 0
Let xk X choosing a feasible direction dk corresponds to choosing a point x X : dk = x xk . Steepest descent choice:
min T f (xk )(x xk )
xX
for every feasible direction d rst order necessary conditions hold. Otherwise, letting dk = x k x, this is a descent direction along which a step k (0, 1] might be chosen according to Armijos rule.
(a linear objective with convex constraints, usually easy to solve). Let x k be an optimal solution of this problem.
Convergence of Frank-Wolfe method

Under mild conditions the method converges to a point satisfying rst order necessary conditions. However it is usually extremely slow (convergence may be sublinear) It might nd applications in very large scale problems in which solving the sub-problem for direction determination is very easy (e.g. when X is a polytope).
Gradient Projection methods

Generic iteration:
xk+1 = xk + k ( xk xk )
where the direction dk = x k xk is obtained nding

x k = [xk sk f (xk )]+
where: sk R+ and []+ represents projection over the feasible set.
The method is slightly faster than Frank-Wolfe, with a linear convergence rate similar to that of (unconstrained) steepest descent. It might be applied if projection is relatively cheap, e.g. when the feasible set is a box. A point xk satises rst order necessary conditions dT f (xk ) 0 iff
xk = [xk sk f (xk )]+
Lagrange Multiplier Algorithms
Barrier Methods
min f (x) gj (x) 0 j = 1, . . . , r
Barrier Method
Let k 0 and x0 strictly feasible, i.e. gj (x0 ) < 0 j . Then let
xk = arg min (f (x) + k B (x)) n
xR
A Barrier is a continuous function which tends to + whenever x approaches the boundary of the feasible region. Examples of barrier functions:
B (x) =
j
Proposition: every limit point of {xk } is a global minimum of the constrained optimization problem
log(gj (x)) 1 gj (x)
logaritmic barrier
B (x) =
j
invers barrier
Analysis of Barrier methods

Special case: a single constraint (might be generalized) Let x be a limit point of {xk } (a global minimum). If KKT conditions hold, then there exists a unique 0:
f ( x) + g ( x) = 0
...
If B (x) = (g (x)),
f (xk ) + k (g (xk ))g (xk ) = 0
In the limit, for k :

lim k (g (xk ))g (xk ) = g ( x)
(with g ( x) = 0. xk , solution of the barrier problem

min f (x) + k B (x) g (x) < 0
if limk g (xk ) < 0 (g (xk ))g (xk ) K (nite) and Kk 0 if limk g (xk ) = 0 (thanks to the unicity of Lagrange multipliers),
= lim k (g (xk ))
k
satises
f (xk ) + k B (xk ) = 0
Difculties in Barrier Methods

strong numeric instability: the condition number of the hessian matrix grows as k 0 need for an initial strictly feasible point x0 (partial) remedy: k is very slowly decreased and the solution of the k + 1th problem is obtained starting an unconstrained optimization from xk
Example
min(x 1)2 + (y 1)2 x+y 1 Logarithmic Barrier problem: min(x 1)2 + (y 1)2 k log(1 x y ) x+y1<0 Gradient: Stationary points x = y =
3 4
2(x 1) + 2(y 1) +
1+k 4
k 1xy k 1xy
(only the - solution is acceptable)

Barrier methods and L.P.

min c x Ax = b x0
T
The central path

The starting point is usually associated with = and is the unique solution of
min
j
log xj Ax = b x>0
Logarithmic Barrier on x 0:
min c x
j T
log xj Ax = b x>0
The trajectory x() of solutions to the barrier problem is called the central path and leads to an optimal solution of the LP.
Penalty Methods
Penalized problem:
min f (x) + P (x)
Convergence of the quadratic penalty met

(for equality constrained problems): let
P (x; ) = f (x) +
i
hi (x)2
where > 0 and P (x) 0 with P (x) = 0 if x is feasible. Example:

min f (x) hi (x) = 0 i = 1, . . . , m
Given 0 > 0, x0 Rn , k = 0, let

xk+1 = arg min P (x; k )
A penalized problem might be:

min f (x) +
i
hi (x)2
(found with an iterative method initialized at xk ); let k+1 > k , k := k + 1. If xk+1 is a global minimizer of P and k then every limit point of {xk } is a global optimum of the constrained problem.
Exact penalties
Exact penalties: there exists a penalty parameter value s.t. the optimal solution to the penalized problem is the optimal solution of the original one. 1 penalty function:
P1 (x; ) = f (x) +
i
Exact penalties
for inequality constrained problems:
min f (x) hi (x) = 0 gj (x) 0
|hi (x)|
the penalized problem is

P1 (x; ) = f (x)
i
|hi (x)| +
j
max(0, gj (x))
Augmented Lagrangian method

Given an equality constrained problem, reformulate it as:
1 min f (x) + h(x) 2 2 h(x) = 0
Motivation
1 min f (x) + h(x) x 2
2
+ T h(x)
x L (x, ) = f (x) +
i h(x) + h(x)h(x)
i
The Lagrange function of this problem is called Augmented Lagrangian:

1 L(x; ) = f (x) + h(x) 2
2
= x L(x, ) + h(x)h(x)
2 2 xx L (x, ) = f (x) +
i 2 h(x) + h(x)2 h(x) + h(x)T h(x)

i
+ T h(x)
2 T = 2 xx L(x, ) + h(x) h(x) + h(x) h(x)
motivation . . .
Let (x , ) an optimal (primal and dual) solution. Necessarily: x L(x , ) = 0; moreover h(x ) = 0 thus
x L (x , ) = x L(x , ) + h(x )h(x ) =0 (x , ) is a stationary point for the augmented lagrangian.

motivation . . .
Observe that:
2 T 2 2 xx L (x, ) = xx L(x, ) + h(x) h(x) + h(x) h(x) T = 2 xx L(x, ) + h(x) h(x)
Assume that sufcient optimality conditions hold:

v T 2 xx L(x , )v > 0
v : v T h(x ) = 0,
...
Let v = 0 : v T h(x )= 0. Then
T T T T T 2 v T 2 xx L (x , )v = v xx L(x , )v + v h(x ) h(x )v T = v T 2 xx L(x , )v > 0
...
Let v = 0 : v T h(x )= 0. Then
T T T 2 T T v T 2 xx L (x , )v = v xx L(x , )v + v h(x ) h(x )v T T 2 = v T 2 xx L(x , )v + (v h(x ))
which might be negative. However > 0: if T v T 2 L ( x , ) v > 0 . xx Thus, if is large enough, the Hessian of the augmented lagrangian is positive denite and x is a (strict) local minimum of L (, )
Inequality constraints
Given the problem
min f (x) g (x) 0 min f (x) hi (x) = 0 gj (x) 0 i = 1, m j = 1, p
Nonlinear transformation of inequalities into equalities:

min f (x)
x,s
an Augmented Lagrangian problem might be dened as

j = 1, p 1 min L (x, z ; , ) = min f (x) + T h(x) + h(x) 2 x,z 2 1 2 2 2 (gj (x) + zj ) + j (gj (x) + zj )+ 2 j j
gj (x) + s2 j = 0
...
Consider minimization with respect to z variables:
min
z j
...
Thus:
u j = max{0, j gj (x)}.
1 2 )+ j (gj (x) + zj 2
2 2 ) (gj (x) + zj j
= min
u 0 j
1 j (gj (x) + uj ) + (gj (x) + uj )2 2
Substituting:
1 L (x; , ) = f (x) + T h(x) + h(x) 2 2 1 + max{0, j + gj (x)} 2 j 2 j
(quadratic minimization over the nonnegative orthant). Solution:

u j } j = max{0, u
where u is the unconstrained optimum:

u : j + (gj (x) + u j ) = 0
This is an Augmented Lagragian for inequality constrained problems.
Sequential Quadratic Programming

min f (x) hi (x) = 0
Newton step for SQP

Jacobian of KKT system:
F (x, ) = 2 xx L(x; ) H (x) T H (x) 0
Idea: apply Newtons method to solve the KKT equations: Lagrangian function:
L(x; ) = f (x) +
i
Newton step:
xk+1 k+1 = dk xk + k k
i hi (x)
let H (x) = [hi (x)] , H (x) = [hi (x)]. KKT conditions:

F [x; ] = f (x) + H (x) =0 H (x)
T
where
2 xx L(xk ; k ) H (xk ) T H (xk ) 0 dk k = f (xk ) H T (xk )k H (xk )
existence
The Newton step exists if the Jacobian of the constraint set H (xk ) has full row rank the Hessian 2 xx L(xk ; k ) is positive denite In this case the Newton step is the unique solution of
Alternative view: SQP

1 min f (xk ) + f (xk )T d + dT 2 xx L(xk ; k )d d 2 H (xk )d + H (xk ) = 0
KKT conditions:
T T 2 xx L(xk ; k )dk + H (xk )k + f (xk ) + H (xk )k = 0
2 xx L(xk ; k )d + f (xk ) + H (xk )k = 0
H (xk )dk + H (xk ) = 0
Under the same conditions as before this QP has a unique solution dk with Lagrange multipliers k = k+1
Alternative view: SQP

Thus SQP can be seen as a method which
1 T 2 min L(xk , k ) + T x L(xk , k )d + d xx L(xk ; k )d d 2 H (xk )d + H (xk ) = 0
minimizes a quadratic approximation to the Lagrangian subject to a rst order approximation of the constraints.
KKT conditions:
2 xx L(xk ; k )d + f (xk ) + H (xk )k + H (xk )k = 0
Under the same conditions as before this QP has a unique solution dk with Lagrange multipliers k = k+1
Inequalities
If the original problem is
min f (x) hi (x) = 0 gj (x) 0
Filter Methods
Basic idea:
min f (x) g (x) 0
can be considered as a problem with two objectives: minimize f (x) minimize g (x) (the second objective has priority over the rst)
then the SQP iteration solves

1 min fk + f (xk )T d + dT 2 xx L(xk , k )d d 2 T i hi (xk )p + hi (xk ) = 0 T j gj (xk )p + gj (xk ) 0
Filter
Given the problem
min f (x) gj (x) 0 j = 1, . . . , k
Let {fk , hk , k = 1, 2, . . .} the observed values of f and h at points x1 , x2 , . . .. A pair (fk , hk ) dominates a pair (f , h ) iff
fk f hk h
and
let us consider the bi-criteria optimization problem

min f (x) min h(x)
A lter is a list of pairs which are non-dominated by the others
where
h(x) =
j
max{gj (x), 0}
Trust region SQP

f (x)
Consider a Trust-region SQP method:

1 min fk + L(xk ; k )T d + dT 2 xx L(xk ; k )d d 2 T j gj (xk )p + gj (xk ) 0 d
(the norm is used here in order to keep the problem a QP) Traditional (unconstrained) trust region methods: if the current step is a failure reduce the trust region eventually the step will become a pure gradient step convergence!
h(x)
Trust region SQP

Here diminishing the trust region radius might lead to infeasible QPs:
Filter methods
Data:
x0 : starting point, , k = 0
while Convergence criterion not satised do if QP is infeasible then
Find xk+1 minimizing constraint violation;

else
T j gj (xk )p + gj (xk ) 0
Solve QP and get a step dk ; try setting xk+1 = xk + dk ; if (fk+1 , hk+1 ) is acceptable to the lter then Accept xk+1 and add (fk+1 , hk+1 ) to the lter; Remove dominated points from the lter; Possibly increase ;
else
gj (x) 0
Reject the step; Reduce ;

xk
end end
set k = k + 1;
end
Comparison with other methods

f (x)
Rejected lter steps
h(x)
acceptable steps "classical" method
Global Optimization Problems Introduction to Global Optimization

Fabio Schoen 2008
x S Rn
min f (x)
What is it meant by global optimization? Of course we sould like to nd f = min n f (x)

xS R
and
x = arg min f (x) : f (x ) f (x) x S
Introduction to Global Optimization p. 1
This denition in unsatisfactory: the problem is ill posed in x (two objective functions which differ only slightly might have global optima which are arbitrarily far) it is however well posed in the optimal values: ||f g || |f g |
Quite often we are satised in looking for f and search one or more feasible solutions suche that
f ( x) f (x ) +
Frequently, however, this is too ambitious a task!
Research in Global Optimization

the problem is highly relevant, especially in applications the problem is very hard (perhaps too much) to solve there are plenty of publications on global optimization algorithms for specic problem classes there are only relatively few papers with relevant theoretical contents often from elegant theories, weak algorithms have been produced and viceversa, the best computational methods often lack a sound theoretical support many global optimization papers get published on applied research journals Bazaraa, Sherali, Shetty Nonlinear Programming: theory and algorithms, 1993: the word global optimum appears for the rst time on page 99, the second time at page 132, then at page 247: A desirable property of an algorithm for solving [an optimization] problem is that it generates a sequence of points converging to a global optimal solution. In many cases however we may have to be satised with less favorable outcomes. after this (in 638 pages) it never appears anymore. Global optimization is never cited.
Complexity
Similar situation in Bertsekas, Nonlinear Programming (1999): 777 pages, but only the denition of global minima and maxima is given! Nocedal & Wrigth, Numerical Optimization, 2nd edition, 2006: Global solutions are needed in some applications, but for many problems they are difcult to recognize and even more difcult to locate ... many successful global optimization algorithms require the solution of many local optimization problems, to which the algorithms described in this book can be applied Global optimization is hopeless: without global information no algorithm will nd a certiable global optimum unless it generates a dense sample. There exists a rigorous denition of global information some examples: number of local optima global optimum value for global optimization problems over a box, (an upper bound on) the Lipschitz constant
|f (y ) f (x)| L x y x, y
Concavity of the objective function + convexity of the feasible region

an explicit representation of the objective function as the difference between two convex functions (+ convexity of the
Complexity
Global optimization is computationally intractable also according to classical complexity theory. Special cases: Quadratic programming:
1 min xT Qx + cT x lAxu 2
Many special cases are still N P hard: norm maximization on a parallelotope:

max x b Ax c
is N P hard [Sahni, 1974] and, when considered as a decision problem, N P -complete [Vavasis, 1990].
Quadratic optimization on a hyper-rectangle (A = I ) when even only one eigenvalue of Q is negative quadratic minimization over a simplex
1 min xT Qx + cT x x 0 2 xj = 1
j
Even checking that a point is a local optimum is N P -hard
Applications of global optimization

concave minimization quantity discounts, scale economies xed charge combinatorial optimization - binary linear programming:
min cT x + KxT (1 x) Ax = b x [0, 1]
or:
min cT x Ax = b x [0, 1]
Minimization of cost functions which are neither convex nor concave. E.g.: nding the minimum conformation of complex molecules Lennard-Jones micro-cluster, protein folding, protein-ligand docking, Example: Lennard-Jones: pair potential due to two atoms at X1 , X2 R3 : 1 2 v (r) = 12 6 r r where r = X1 X2 . The total energy of a cluster of N atoms located at X1 , . . . , XN R3 is dened as:
i=1,...,N j<i
v (||Xi Xj ||)
x (1 x) = 0
This function has a number of local (non global) minima which grows like exp(N )
Lennard-Jones potential
3 2 1 0 -1 -2 attractive(x) repulsive(x) lennard-jones(x)
Protein folding and docking

Potential energy model:E = El + Ea + Ed + Ev + Ee where:
El =
i L
1 b 0 2 K (ri ri ) 2 i
(contribution of pairs of bonded atoms):

Ea =
i A
1 0 2 K (i i ) 2 i
(angle between 3 bonded atoms)

Ed = 1 K [1 + cos(ni )] 2 i
-3 0.5
i T
1.5
2.5
3.5
4.5
(dihedrals)
Docking
Ev =
(i,j) C
Aij Bij 6 12 Rij Rij qi qj Rij
(van der Waals)

Ee = 1 2
(i,j) C
Given two macro-molecules M1 , M2 , nd their minimal energy coupling If no bonds are changed to nd the optimal docking it is sufcient to minimized:
Ev + Ee =
iM1 ,j M2
Aij Bij 6 12 Rij Rij
1 2 iM
1 ,j M2
qi qj Rij
(Coulomb interaction)
Main algorithmic strategies

Two main families: 1. with global information (structured problems) 2. without global information (unstructured problems) Structured problems stochastic and deterministic methods Unstructured problems typically stochastic algorithms Every global optimization method should try to nd a balance between exploration of the feasible region approximations of the optimum
Example: Lennard Jones

N 1 N
LJN = min LJ (X ) = min

i=1 j =i+1
1 Xi Xj
12
2 Xi Xj
This is a highly structured problem. But is it easy/convenient to use its structure? And how?
LJ
The map
F1 : R3N R+ F1 (X1 , . . . , XN )
N (N 1)/2 2
X1 X2 2 , . . . , XN 1 XN
N (N 1)/2
NB: every C 2 function is d.c., but often its d.c. decomposition is not known. D.C. optimization is very elegant, there exists a nice duality theory, but algorithms are typically very inefcient.
is convex and the function

F2 : R+ R 1 2 6 rij 1 3 rij
F2 (r12 , . . . , rN 1,N )
is the difference between two convex functions. Thus LJ (X ) can be seen as the difference between two convex function (a d.c. programming problem)
A primal method for d.c. optimization

cutting plane method (just an example, not particularly efcient, useless for high dimensional problems). Any unconstrained d.c. problem can be represented as an equivalent problem with linear objective, a convex constraint and a reverse convex constraint. If g, h ar convex, then min g (x) h(x) is equivalent to:
min z g (x) h(x) z
D.C. canonical form

min cT x g (x) 0 h(x) 0
where h, g : convex. Let

= {x : g (x) 0} C = {x : h(x)0}
which is equivalent to
min z g (x) w h(x) + z w
Hp:
0 int intC, cT x > 0x \ intC
Fundamental property: if a D.C. problem admits an optimum, at least one optimum belongs to
C
Discussion of the assumptions

4
g (0) < 0, h(0) < 0, cT x > 0 feasible x. Let x be a solution to the convex problem min cT x g (x) 0
cT x = 0
If h( x) 0 then x solves the d.c. problem. Otherwise cT x > cT x for all feasible x. Coordinate transformation: y = x x :
min cT y (y ) 0 h g (y ) 0
T
-1
-2
where g (y ) = g (y + x ). Then c y > 0 for all feasible solutions (0) > 0; by continuity it is possible to choose x and h so that g (0) < 0.
-3
-4 -9 -8 -7 -6 -5 -4 -3 -2 -1
Let x best known solution. Let D( x) = { x : c T x c T x } If D( x) C then x is optimal; Check: a polytope P (with known vertices) is built which contains D( x) If all vertices of P are in C optimal solution. Otherwise let v : best feasible vertex; the intersection of the segment [0, v ] with C (if feasible) is an improving point x. Otherwise a cut is introduced in P which is tangent to in x.
D( x ) = {x : c T x c T x } cT x = 0
-1
x
-2
-3
-4 -9 -8 -7 -6 -5 -4 -3 -2 -1
Initialization
4
Given a feasible solution x , take a polytope P such that

P D( x)
3
P : D( x) P with vertices V1 , . . . , Vk . V := arg max h(Vj ) cT x = 0
i.e.
y : cT y cT x y y P
feasible
0
-1
If P C , i.e. if y P h(y ) 0 then x is optimal. Checking is easy if we know the vertices of P .
x
-2
-3
V
-4 -9 -8 -7 -6 -5 -4 -3 -2 -1
Step 1
Let V the vertex with largest h() value. Surely h(V ) > 0 (otherwise we stop with an optimal solution) Moreover: h(0) < 0 (0 is in the interior of C ). Thus the line from V to 0 must intersect the boundary of C Let xk be the intersection point. It might be feasible (improving) or not.
4
xk = C [V , 0] cT x = 0
C
xk
-1
x
-2
-3
V
-4 -9 -8 -7 -6 -5 -4 -3 -2 -1
If xk , set x := xk
cT x = 0
Otherwise if xk , the polytope is divided

cT x = 0
-1
-1
x
-2 -2
-3
-3
-4 -9 -8 -7 -6 -5 -4 -3 -2 -1
-4 3 -9 -8 -7 -6 -5 -4 -3 -2 -1
Duality for d.c. problems

4
Otherwise if xk , the polytope is divided

cT x = 0
min g (x) h(x)

xS
where f, g : convex. Let

2
h (u) := sup{uT x h(x) : x Rn }
the conjugate functions of h e g . The problem
g (u) := sup{uT x g (x) : x Rn }
-1
inf {h (u) g (u) : u : h (u) < +}
-2
is the Fenchel-Rockafellar dual. If min g (x) h(x) admits an optimum, then Fenchel dual is a strong dual.
-3
-4 -9 -8 -7 -6 -5 -4 -3 -2 -1
A primal/dual algorithm
If x arg min g (x) h(x) then
u h(x ) Pk : min g (x) (h(xk ) + (x xk )T yk )
and
Dk : min h (y ) (g (yk1 ) + xT k (y yk1 )
( denotes subdifferential) is dual optimal and if u arg min h (u) g (u) then
x g (u )
is an optimal primal solution.
GlobOpt - relaxations
Consider the global optimization problem (P):
min f (x)
Exact Global Optimization
xX
and assume the min exists and is nite and that we can use a relaxation (R):
min g (y ) yY
Usually both X and Y are subsets of the same space Rn . Recall: (R) is a relaxation of (P) iff:
XY
g (x) f (x) for all x X
Branch and Bound

1. Solve the relaxation (R) and let L be the (global) optimum value (assume it is feasible for (R)) 2. (Heuristically) solve the original problem (P) (or, more generally, nd a good feasible solution to (P) in X ). Let U be the best feasible function value known 4. otherwise split X and Y into two parts and apply to each of them the same method 3. if U L then stop: U is a certied optimum for (P)
Tools
good relaxations: easy yet accurate good upper bounding, i.e., good heuristics for (P) Good relaxations can be obtained, e.g., through: convex relaxations domain reduction
Convex relaxations
Assume X is convex and Y = X . If g is the convex envelop of f on X , then solving the convex relaxation (R), in one step gives the certied global optimum for (P). g (x) is a convex under-estimator of f on X if:
g (x)is convex g (x) f (x) g is the convex envelop of f on X if: g is a convex under-estimator off g (x) h(x) h : convex under-estimator of f
A 1-D example
x X
x X
Convex under-estimator
Branching
Bounding
Let
Relaxation of the feasible domain
min f (x)
xS
be a GlobOpt problem where f is convex, while S is non convex. A relaxation (outer approximation) is obtained replacing S with a larger set Q. If Q is convex convex optimization problem. If the optimal solution to
Upper bound
min f (x)
xQ
fathomed
belongs to S optimal solution to the original problem.
lower bounds
Introduction to Global Optimization p. 44 Introduction to Global Optimization p. 45
Example
min x 2y xy 3
Relaxation
min x 2y xy 3
x[0,5],y [0,3]
x[0,5],y [0,3]
4 3 2
We know that:
(x + y )2 = x2 + y 2 + 2xy
thus
1 0 0 1 2 3 4 5 6
xy = ((x + y )2 x2 y 2 )/2
and, as x and y are non-negative, x2 5x, y 2 3y , thus a (convex) relaxation of xy 3 is

(x + y )2 5x 3y 6
Relaxation
4 3 2 1 0 0 1 2 3 4 5 6
Stronger Relaxation
min x 2y xy 3
x[0,5],y [0,3]
Thus:
(5 x)(3 y ) 0 xy 3x + 5y 15
15 3x 5y + xy 0
Optimal solution of the relaxed convex problem: (2, 3) (value: 8)
Thus a (convex) relaxation of xy 3 is

3x + 5y 15 3
i.e.: 3x + 5y 18
Relaxation
4 3 2 1 0 0 1 2 3 4 5 6
Convex (concave) envelopes

How to build convex envelopes of a function or how to relax a non convex constraint? Convex envelopes lower bounds Convex envelopes of f (x) upper bounds Constraint: g (x) 0 if h(x) is a convex underestimator of g then h(x) 0 is a convex relaxations. Constraint: g (x) 0 if h(x) is concave and h(x) g (x), then h(x) 0 is a convex constraint
The optimal solution of the convex (linear) relaxation is (1, 3) which is feasible optimal for the original problem
Convex envelopes
Denition: a function is polyhedral if it is the pointwise maximum of a nite number of linear functions. (NB: in general, the convex envelope is the pointwise supremum of afne minorants) The generating set X of a function f over a convex set P is the set
Generating sets
X = {x Rn : (x, f (x))is a vertex of epi(convP (f ))}
I.e., given f we rst build its convex envelop in P and then dene its epigraph {(x, y ) : x P, y f (x)}. This is a convex set whose extreme points can be denoted by V . X are the x coordinates of V
Characterization
Let f (x) be continuously differentiable in a polytope P . The convex envelope of f on P is polyhedral if and only if
X (f ) = Vert(P )
(the generating set is the vertex set of P ) Corollary: let f1 , . . . , fm C 1 (P ) and i fi (x) possess polyhedral convex envelopes on P . Then
Conv(
i
fi (x)) =
i i
Convfi (x)
iff the generating set of
Conv(fi (x))
is Vert(P )
Characterization
If a f (x) is such that Convf (x) is polyhedral, than an afne function h(x) such that 1. h(x) f (x) for all x Vert(P ) 2. there exist n + 1 afnely independent vertices of P , V1 , . . . , Vn+1 such that
f (Vi ) = h(Vi ) i = 1, . . . , n + 1
Characterization
The condition may be reversed: given m afne functions h1 , . . . , hm such that, for each of them 1. hj (x) f (x) for all x Vert(P ) 2. there exist n + 1 afnely independent vertices of P , V1 , . . . , Vn+1 such that
f (Vi ) = hj (Vi ) i = 1, . . . , n + 1
belongs to the polyhedral description of Convf (x) and

h(x) = convf (x)
Then the function (x) = maxj j (x) is the convex envelope of a polyhedral function f iff the generating set of is Vert(P) for every vertex Vi we have (Vi ) = f (Vi )
for any x Conv(V1 , . . . , Vn+1 ).
Sufcient condition
If f (x) is lower semi-continuous in P and for all x Vert(P ) there exists a line x : x interior of P x and f (x) is concave in a neighborhood of x on x , then Convf (x) is polyhedral Application: let
f (x) =
i,j
Application: a bilinear term

(Al-Khayyal, Falk (1983)): let x [x , ux ], y [y , uy ]. Then the convex envelope of xy in [x , ux ] [y , uy is
(x, y ) = max{y x + x y x y ; uy x + ux y ux uy }
In fact: (x, y ) is a under-estimate of xy :

(x x )(y y ) 0
ij xi xj
The sufcient condition holds for f in [0, 1]n bilinear forms are polyhedral in an hypercube
xy y x + x y x y
and analogously for xy uy x + ux y ux uy
Bilinear terms
xy (x, y ) = max{y x + x y x y ; uy x + ux y ux uy } No other (polyhedral) function underestimating xy is tighter. In fact y x + x y x y belongs to the convex envelope: it underestimates xy and coincides with xy at 3 vertices ((x , y ), (x , uy ), (ux , y )). Analogously for the other afne function. All vertices are interpolated by these 2 underestimating hyperplanes they form the convex envelop of xy
All easy then?

Of course no! Many things can go wrong . . . It is true that, on the hypercube, a bilinear form:
ij xi xj
i<j
is polyhedral (easy to see) but we cannot guarantee in general that the generating set of the envelope are the vertices of the hypercube! (in particular, if s have opposite signs) if the set is not an hypercube, even a bilinear term might be non polyhedral: e.g. xy on the triangle {0 x y 1}
Finding the (polyhedral) convex envelope of a bilinear form on a generic polytope P is NPhard!
Fractional terms
A convex underestimate of a fractional term x/y over a box can be obtained through
w x /y + x/uy x /uy
if x if x if x if x
Univariate concave terms

If f (x), x [x , ux ], is concave, then the convex envelope is simply its linear interpolation at the extremes of the interval:
f (x ) + f (ux ) f (x ) (x x ) ux x
w x/uy x y/y uy + x /y w ux /y + x/y ux /y w x/y ux y/y uy + ux /uy
0 <0 0 <0
(a better underestimate exists)
Underestimating a general nonconvex function

Let f (x) C 2 be general non convex. Than a convex underestimate on a box can be dened as
n
How to choose i s? One possibility: uniform choice: i = . In this case convexity of is obtained iff
max 0, 1 min min (x) 2 x[,u]
(x) = f (x)
i=1
i (xi i )(ui xi )
where i > 0 are parameters. The Hessian of is

2 (x) = 2 f (x) + 2diag() is convex iff 2 (x) is positive semi-denite.
where min (x) is the minimum eigenvalue of 2 f (x)
Key properties
(x) f (x) is convex interpolates f at all vertices of [, u]
Estimation of
U Compute an interval Hessian [H ] : [H (x)]ij = [hL ij (x), hij (x)] in [, u] Find such that [H ] + 2diag() 0. Gerschgorin theorem for real matrices:
Maximum separation:
1 max(f (x) (x)) = 4 (ui i )2 min min hii
i j =i
|hij |
Thus the error in underestimation decreases when the box is split.
Extension to interval matrices:

min min hL ii
i U max{|hL ij |, |hij |}
j =i
uj j ui i
Improvements
new relaxation functions (other than quadratic). Example
n
Domain (range) reduction

Techniques for cutting the feasible region without cutting the global optimum solution. Simplest approaches: feasibility-based and optimality-based range reduction (RR). Let the problem be:
min f (x)
xS
(x; ) =
i=1
(1 ei (xi i ) )(1 ei (ui xi ) )
gives a tighter underestimate than the quadratic function partitioning: partition the domain into a small number of regions (hyper-rectangules); evaluate a convex underestimator in each region; join the underestimators to form a single convex function in the whole domain
Feasibility based RR asks for solving

i = min xi xS ui = max xi xS
for all i 1, . . . , n and then adding the constraints x [, u] to the problem (or to the sub-problems generated during Branch & Bound)
Feasibility Based RR
If S is a polyhedron, RR requires the solution of LPs:
[ , u ] = min / max x Ax b x [L, U ]
j
Optimality Based RR
Given an incumbent solution x S , ranges are updated by solving the sequence:
i = min xi f (x) f ( x) aij xj bi xS ui = max xi f (x) f ( x) xS
Poor mans L.P. based RR: from every constraint in which ai > 0 then
x x 1 bi ai 1 bi ai aij xj
j =
where f (x) is a convex underestimate of f in the current domain. RR can be applied iteratively (i.e., at the end of a complete RR sequence, we might start a new one using the new bounds)
j =
min{aij Lj , aij Uj }
generalization
Let
min f (x)
xX
R.H.S. perturbation
(P )
(y ) = min f (x)
xX
(Ry )
g (x) 0
a (non convex) problem; let

min f (x)
xX
g (x) y (R)
g (x) 0
be a perturbation of (R). (R) convex (Ry ) convex for any y . Let x : an optimal solution of (R) and assume that the ith constraint is active:
g ( x) = 0
and
be a convex relaxation of (P ):
: g (x) 0} {x X : g (x) 0} {x X f (x) f (x) x X : g (x) 0
Then, if x y is an optimal solution of (Ry ) g i (x) yi is active at x y if yi 0
Duality
Assume (R) has a nite optimum at x with value (0) and Lagrange multipliers . Then the hyperplane
H (y ) = (0) T y
Main result
If (R) is convex with optimum value (0), constraint i is active at the optimum and the Lagrange multiplier is i > 0 then, if U is an upper bound for the original problem (P ) the constraint:
g i (x) (U L)/i
is a supporting hyperplane of the graph of (y ) at y = 0, i.e.

(y ) (0) T y y Rm
(where L = (0)) is valid for the original problem (P ), i.e. it does not exclude any feasible solution with value better than U .
proof
Problem (Ry ) can be seen as a convex relaxation of the perturbed non convex problem
(y ) = min f (x)
xX
Applications
Range reduction: let x [, u] in the convex relaxed problem. If variable xi is at its upper bound in the optimal solution, them we can deduce
xi max{i , ui (U L)/i }
g (x) y
and thus (y ) (y ). Thus underestimating (Ry ) produces an underestimate of (y ). Let y := ei yi ; From duality: L T ei yi (ei yi ) (ei yi ) If yi < 0 then U is an upper bound also for (ei yi ), thus L i yi U . But if yi < 0 then constraint i is active. For any feasible x there exists a yi < 0 such that g (x) yi is active we may substitute yi with g i (x) and deduce L i g i (x) U
where i is the optimal multiplier associated to the ith upper bound. Analogously for active lower bounds:
xi min{ui , i + (U L)/i }
Methods based on merit functions

Let the constraint
aT i x bi
Bayesian algorithm: the objective function is considered as a realization of a stochastic process

f (x) = F (x; )
be active in an optimal solution of the convex relaxation (R). Then we can deduce the valid inequality
ai T x bi (U L)/i
A loss function is dened, e.g.:

L(x1 , ..., xn ; ) = min F (xi ; ) min F (x; )
i=1,n x
and the next point to sample is placed in order to minimize the expected loss (or risk)
xn+1 = arg min E (L(x1 , ..., xn , xn+1 ) | x1 , ..., xn ) = arg min E (min(F (xn+1 ; ) F (x; )) | x1 , ..., xn )
Radial basis method

Given k observations (x1 , f1 ), . . . , (xk , fk ), an interpolant is built:
n
Bumpiness
Let fk an estimate of the value of the global optimum after k observations. Let sy k the (unique) interpolant of the data points
s(x) =
i=1
i ( x xi ) + p(x)
(xi , fi )i = 1, . . . , k
) (y, fk
p: polynomial of a (prexed) small degree m. : radial function like, e.g.: (r) = r (r) = r
3
linear cubic thin plate spline gaussian
Idea: the most likely location of y is such that the resulting interpolant has minimum bumpiness Bumpiness measure:
(sk ) = (1)m+1 i sy k (xi )
(r) = r2 log r (r) = er

2
Polynomial p is necessary to guarantee existence of a unique interpolant (i.e. when the matrix {ij = ( xi xj )} is singular)
TO BE DONE
Stochastic methods
Pure Random Search - random uniform sampling over the feasible region Best start: like Pure Random Search, but a local search is started from the best observation Multistart: Local searches started from randomly generated starting points
3 2
3 2
+
1
+ + + + + +
-1 -2
+
0
+ + + +
0 -1 -2 -3
-3
Clustering methods
Given a uniform sample, evaluate the objective function Sample Transformation (or concentration): either a fraction of worst points are discarded, or a few steps of a gradient method are performed Remaining points are clustered from the best point in each cluster a single local search is started
Uniform sample
5
3 5 0 1
4 3 2 1 0
Sample concentration
5
3 5 0 1
Clustering
5
3 5 0 1
4 3 2 1 0 5 0 1 2 3
4 3 2 1 0
+ ++ + + + + + + + + + +
+ +
Local optimization
5
3 5 0 1
Clustering: MLSL
Sampling proceed in batches of N points. Given sample points X1 , . . . , Xk [0, 1]n , label Xj as clustered iff Y X1 , . . . , Xk :
1 ||Xj Y || k := 2 n log k 1 + k 2
1 n
4 3 and 2 1 0
f (Y ) f (Xj )
Simple Linkage
A sequential sample is generated (batches consist of a single observation). A local search is started only from the last sampled point (i.e. there is no recall) unless there exists a sufciently near sampled point with better function valure
Smoothing methods
Given f : Rn R, the Gaussian transform is dened as:
f (x) = 1 n/2 n
Rn
f (y ) exp y x 2 /2
When is sufciently large f is convex. Idea: starting with a large enough , minimize the smoothed function and slowly decrease towards 0.
Smoothing methods
3 2.5 2 1.5 1 0.5 0 10 5 -10 -5 0 5 10 -10 -5

3 2.5 2 1.5 1 0.5 0 10 5 -10 -5 0 5 10 -10 -5

2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 10 5 -10 -5 0 5 10 -10 -5
2.2 2 1.8 1.6 1.4 1.2 1 0.8 10 5 -10 -5 0 5 10 -10 -5

Transformed function landscape

Elementary idea: local optimization smooths out many high frequency oscillations
2.2 2 1.8 1.6 1.4 1.2 1 0.8 10 5 -10 -5 0 5 10 -10 -5

10
10

0 0
10
Monotonic Basin-Hopping
9 8
k := 0; f := +; while k < M axIter do Xk : random initial solution Xk = arg min f (x; Xk ); (local minimization started at Xk ) fk = f (Xk ); if fk < f = f := fk N oImprove := 0; while N oImprove < M axImprove do X = random perturbation of Xk Y = arg minf (x; X ) ; if f (Y ) < f = Xk := Y ; N oImprove := 0; f := f (Y ) otherwise N oImprove + + end while end while
10
10

0 0
10
10

0 0
10

0
References
In this years course the global optimization part has been expanded, so it is possible that some part in nonlinear optimization will be skipped. Here is an essential reference list for the material covered during the course:
Mokhtar S. Bazaraa, John J. Jarvis and Hanif D. Sherali, Linear Programming and Network Flows, John Wiley & Sons, 1990. Dimitri P. Bertsekas, Nonlinear Programming, Athena Scientic, 1999. Jorge Nocedal and Stephen J. Wright, Numerical Optimization, Springer, 2006. Mohit Tawarmalani and Nikolaos V. Sahinidis, A Polyhedral Branchand Cut Approach to Global Optimization, in: Mathematical Programming, volume 103, pages 225-249, 2005. Androulakis I.P., C.D. Maranas, and C.A. Floudas (PostScript (184K), PDF (154K)), BB : A Global Optimization Method for General Constrained Nonconvex Problems, Journal of Global Optimization, 7, 4, pp. 337-363(1995). A. Rikun. A convex envelope formula for multilinear functions. Journal of Global Optimization, pages 10:425437, 1997. Andrea Grosso, Marco Locatelli and Fabio Schoen, A Population Based Approach for Hard Global Optimization Problems Based on Dissimilarity Measures, in: Mathematical Programming, volume 110, number 2, pages 373-404, 2007.

Schoen

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Schoen

Загружено:

Авторское право:

Доступные форматы

Nonlinear Programming Models

Fabio Schoen 2008

Nonlinear Programming Models p. 1

Nonlinear Programming Models p. 2

Local and global optima

A point x is a local optimum if > 0 such that

Here S = {x Rn : hi (x) = 0 i, gj (x) 0 j }

Nonlinear Programming Models p. 3

Nonlinear Programming Models p. 4

for all x, y , [0, 1]

Nonlinear Programming Models p. 5

Nonlinear Programming Models p. 6

Properties of convex functions

Nonlinear Programming Models p. 7

Nonlinear Programming Models p. 8

or, equivalently, all eigenvalues of 2 f (x) are non negative.

Nonlinear Programming Models p. 9

Nonlinear Programming Models p. 10

Convex Optimization Problems

Nonlinear Programming Models p. 11

Nonlinear Programming Models p. 12

Convex and non convex optimization

Convex functions: examples

a linear non negative combination of convex functions is convex

Nonlinear Programming Models p. 13

Nonlinear Programming Models p. 14

let S Rn be any set f (x) = supsS x s is convex

max (X ) (the largest eigenvalue of a matrix X )

Nonlinear Programming Models p. 15

Nonlinear Programming Models p. 16

Nonlinear Programming Models p. 17

Nonlinear Programming Models p. 18

r = maxi |ri |: minimax, or or di Tchebichev approximation r =

Nonlinear Programming Models p. 20

Nonlinear Programming Models p. 21

Nonlinear Programming Models p. 22

h(yi aT i x) where h: convex function: z |z | 1 2|z | 1 |z | > 1

norm 1(x) norm 2(x) linquad(x) deadzone(x) logbarrier(x)

dead zone: h(z ) =

0 |z | 1 |z | 1 |z | > 1 log(1 z 2 ) |z | < 1 |z | 1

logarithmic barrier: h(z ) =

Nonlinear Programming Models p. 24

Max likelihood estimate - MLE

If p is logconcave this problem is convex. Examples:

where i iid random variables with density p():

An ellipsoid is a subset of Rn of the form

Robust Least Squares

bi )2 Hp: ai not known,

0. Denition: worst case residuals: max

A robust estimate of x is the solution of

Nonlinear Programming Models p. 29

Nonlinear Programming Models p. 30

(a convex optimization problem). Transformation:

(Second Order Cone Problem). A norm cone is a convex set

Nonlinear Programming Models p. 31

Nonlinear Programming Models p. 32

Nonlinear Programming Models p. 33

Nonlinear Programming Models p. 34

Projection on a convex set

Nonlinear Programming Models p. 35

Nonlinear Programming Models p. 36

Distance between convex sets

Distance between convex sets