Вы находитесь на странице: 1из 62

Lectures on Optimization

A. Banerji
July 26, 2015

Chapter 1
Introduction
1.1

Some Examples

We briefly introduce our framework for optimization, and then discuss some
preliminary concepts and results that well need to analyze specific problems.
Our optimization examples can all be couched in the following general
framework:
Suppose V is a vector space and S V . Suppose F : V <. We wish to
find x S s.t. F (x ) F (x), x S, or x S s.t. F (x ) F (x), x S.
x , x are respectively called a maximum and a minimum of F on S.
In different applications, V can be finite- or infinite-dimensional. The
latter need more sophisticated optimization tools such as optimal control; we
will keep that sort of stuff in abeyance for now. In our applications, F will
be continuous, and pretty much also differentiable; often twice continuously
differentiable. S will be specified most often using constraints.
Example 1 Let U : <k+ < be a utility function, p1 , ..., pk , I be positive
P
prices and wealth. Maximize U s.t. xi 0, i = 1, ..., k, and ki=1 pi xi
p.x I.
Here, the objective function is U , and
1

S = {x <k : xi 0i = 1, ..., k, and 0 p.x I}


.

Example 2 Expenditure minimization. Same setting as above. Minimize


p.x s.t. xi 0i = 1, ..., k and U (x) U , where U is a non-negative real
number.
Here the objective function F : <k < is F (x) = p.x and
S = {x <k : xi 0i = 1, ..., k, and U (x) U }

Example 3 Profit Maximization. Given positive output prices p1 , ..., ps and


input prices w1 , ..., wk , and a production function f : <k+ <s (transforming
k inputs into s products),
P
P
Maximize sj=1 pj fj (x) ki=1 wi xi , s.t. xi 0, i = 1, ..., k. fj (x) is
the output of product j as a function of a vector x of the k inputs.
Here, the objective function is profits : <k+ < defined by (x) =
Pk
Ps
i=1 wi xi , and
j=1 pj fj (x)
S = {x <k : xi 0, i = 1, ..., k}

Example 4 Intertemporal utility maximization. (Continuous time, finite


horizon):
A worker with a known life span T , earning a constant wage w, and
receiving interest at rate r on accumulated savings, or paying the same rate on
2

accumulated debts, wishes to decide optimal consumption path c(t), t [0, T ].


Let accumulated assets/debts at time t be denoted by k(t). His instantaneous
0

00

utility from consumption is u(c(t)), u > 0, u < 0, and future consumption is


discounted at rate . So the problem is to choose functions c(t), k(t), t [0, T ]
to
Maximize F (c) =

RT
0

et u(c(t))dt s.t.

(i) c(t) 0, t [0, T ]


0

(ii) k (t) = w + rk(t) c(t)


(iii) k(0) = k(T)=0
(iii) assumes that the individual has no inheritance and wishes to leave
no bequest.
Here, the objective function F is a function of an infinite dimensional
vector, the consumption function c. The constraint set S admits only those
functions c(t) that satisfy conditions (i) to (iii).
Example 5 Intertemporal utility or welfare maximization, discrete time, infinite horizon:
A representative, infinitely lived agent must choose (ct , kt+1 )
t=0 to
P t
Maximize t=0 u(ct )
subject to ct + kt+1 = f (kt ) + (1 )kt ,
ct 0, kt+1 0, k0 = k0 .
Here, is the rate of depreciation of capital, and f (kt ) is output in period
t.

Example 6 Game in Strategic Form. G = hN, (Si ), (ui )i, where N =


{1, ..., n} is the set of Players, and for each player i, Si is her set of strategies,
and ui : ni=1 Si < is her payoff function.

In a game, i0 s payoff can depend on the choices/strategies (s1 , ..., sn ) of


everyone.
A Nash Equilibrium (s1 , ..., sn ) is a strategy profile such that for each
player i, si solves the following maximization problem:
Maximize ui (s1 , .., si1 , si , si+1 , .., sn ) s.t.
si S i .

1.1.1

Optimization Problems in Parametric Form

Parameters are held constant in the optimization exercise. For instance, in


Example 1 (Utility Maximization), (p, I) <k+1
is the parameter vector held
+
constant. The budget set in fact depends on this parameter, and we may
write S(p, I) for the budget set to show this dependence. The maximum
value that the utility function takes on this set (if the maximum exists), i.e.
V (p, I) = Max{U (x)|x S(p, I)}
therefore typically depends on the parameter (p, I), and we denote this
dependence of the maximum by the value function V (p, I). In consumer
theory, we call this the indirect utility function. This is a function because
to each point (p, I) in the admissible parameter space, V (p, I) assigns a single
value, equal to the maximum of U (x) over all x S(p, I).
Note that for a given (p, I), the set of bundles that maximize utlity may
not be unique: we can denote this relationship by x(p, I), the set of all
bundles that maximize utility given (p, I). If the optimal bundle were unique
for every (p, I), then x(p, I) is a function (the Walrasian or Marshallian
demand function), and therefore V (p, I) = U (x(p, I)).
In other problems, not just the feasible set S but also the objective function depends on the parameter. For instance, in Example 2 (Expenditure
Minimization), the parameter is (p, U ), and the objective function px depends on the parameter, via the price vector p, while the constraint U (x) U
depends on it via the utility level U .
4

In general, we may write down the optimization problem in parametric


form as follows: A parameter is part of some admissible set of parameters,
where is a subset of some vector space W (finite or infinite-dimensional;
e.g. in example 1, (p, I) <n++ <+ , this latter set is thus . The feasible
set is S(), which depends on and is a subset of some (other) vector space
V (e.g. in example 1, S(p, I) <n+ ). The objective function F maps from
some subset of V W to the real line: we write F (x, ). The problem is to
maximize or minimize F (x, ) s.t. x S().
Please read the other examples in Sundaram. We give one final one here.
Example 7 Identifying Pareto Optima
There are 2 individuals, with utility functions u1 (x1 ), u2 (x2 ) respectively,
that map from <n+ , the n-good space, to <. There is an endowment <n+ to
be allocated between them. An allocation (x1 , x2 ) (vectors of the goods given
to the two agents) is feasible if x1 , x2 0 and x1 + x2 . We will call

F () = {(x1 , x2 )|x1 , x2 0, x1 + x2 }
the feasible set or set of feasible allocations.
An allocation (y1 , y2 ) Pareto dominates (x1 , x2 ) if
ui (yi ) ui (xi )i = 1, 2, with >0 for some i
An allocation (x1 , x2 ) is Pareto optimal if there is no feasible allocation
that Pareto dominates it.
Let a (0, 1) and consider the social welfare function U (x1 , x2 , a)
au1 (x1 ) + (1 a)u2 (x2 ). Then if (z1 , z2 ) is any allocation that solves

Maximize U (x1 , x2 , a) s.t. (x1 , x2 ) F ()

it is a Pareto optimal allocation. For, if (z1 , z2 ) is in this set of solutions but is not Pareto optimal, then there is a feasible allocation (y1 , y2 ) s.t.
u1 (y1 ) u1 (z1 ), u2 (y2 ) u2 (z2 ), with >0 for at least one of these. Multiplying the first inequality by a, the second by (1 a) and adding, we get
U (y1 , y2 , a) > U (z1 , z2, a), contradicting that (z1 , z2 ) is a maximizer.
If we assume that the utility functions ui (xi ) are concave,then the converse
holds: every Pareto optimal allocation is a solution to

Maximize U (x1 , x2 , a) s.t. (x1 , x2 ) F ()


for some choice of a [0, 1]).

1.2

Some Concepts and Results from Real


Analysis

We will now discuss some concepts that we will need, such as the compactness
of the set S above, and the continuity and differentiability of the objective
function F . We will work in normed linear spaces. In the absence of any
other specification, the space we will be in is <n with the Euclidean norm
P
1/2
||x|| = ( ni=1 x2i ) . (Theres a bunch of other norms that would work
equally well. Recall that a norm in <n is defined to be a function assigning
to each vector x a non-negative real number ||x||, s.t. (i) for all x, ||x|| 0
with =0 iff x = 0 (00 being the zero vector); (ii) If c <, ||cx|| = |c|||x||.
(iii) ||x + y|| ||x|| + ||y||. The last requirement, the triangle inequality,
follows for the Euclidean norm from the Cauchy-Schwarz inequality).
One example in the previous section used another normed linear space,
namely the space of bounded continuous functions defined on an interval
of real numbers, with the sup norm. But in further work in this part of
6

the course, we will stick to using finite dimensional spaces. Some of the
concepts below apply to both finite and infinite dimensional spaces, so we
will sometimes call the underlying space V . But mostly, it will help to think
of V as simply <n , and to visualize stuff in <2 .
Pn
2 1/2
.
We will measure distance between vectors using ||xy|| =
i=1 (xi yi )
This is our intuitive notion of distance using Pythagoras theorem. Furthermore, it satisfies the three properties of a metric, viz., (i) ||x y|| 0, with
= iff x = y; (ii) ||x y|| = ||y x||; (iii) ||x z|| ||x y|| + ||y z||.
Note that property (iii) for the metric follows from that for the triangle
inequality for the norm, since ||xz|| = ||(xy)+(yz)|| ||xy||+||yz||.
Open and Closed Sets
Let  > 0 and x V . The open ball centered at x with radius  is defined
as
B(x, ) = {y : ||x y|| < }
We see that if V = <, B(x, ) is the open interval (x , x + ). If
V = <2 , it is an open disk centered at x. The boundary of the disk is traced
by Pythagoras theorem.
Exercise 1 Show that ||x y|| defined by max{|x1 y1 |, . . . , |xn yn |}, for
all x, y <n is a metric (i.e. satisfies the three requirements of a metric). In
the space <2 , sketch B(0, 1), the open ball centered at 0, the origin, of radius
1, in this metric.
Let S V . x is an interior point of S if B(x, ) S, for some  > 0. S
is an open set if all points of S are interior points. On the other hand, S is
a closed set iff S c is an open set.
Example. Open in < vs. open in <2 .
There is an alternative, equivalent, convenient way to define closed sets.
x is an adherent point of S, or adheres to S, if every B(x, ) contains a point

belonging to S. Note that this does not necessarily mean that x is in S.


(However, if x S then x adheres to S of course).
Example. Singleton and finite sets; countable sets need not be open.
Lemma 1 A set S is closed iff it contains all its adherent points.
Proof
Suppose S is closed, so S c is open. Let x adhere to S. Want to show
that x S. Suppose not. Then x S c , and since S c is open, x is an interior
point of S c . So there is some  > 0 s.t. B(x, ) S c ; this does not have any
points from S. So x cannot be an adherent point of S. Contradiction.
Conversely, suppose S contains all its adherent points. To show S is
closed, we show S c is open. We show that all the points in S c are interior
points. Let x S c . Since x does not adhere to S, it must be the case that
.
for some  > 0, B(x, ) S c .
More examples of closed (and open) sets.
Now we will relate closedness to convergence of sequences. Recall that
formally, a sequence in V is a function x : N V . But instead of writing
{x(1), x(2), ...} as the images or members of the sequence, we write either
{x1 , x2 , ...} or {x1 , x2 , ...}.
Definition 1 Convergence:
A sequence (xk )
k=1 of points in V converges to x if for every  > 0 there
exists a positive integer N s.t. k N implies ||xk x|| < .
Note that this is the same as saying that for every open ball B(x, ), we
can find N s.t. for all points xk following xN , xk lies in B(x, ). This implies
that when xk converges to x (notation: xk x), all but a finite number of
points in (xk ) lie arbitrarily close to x.
Examples. xk = 1/k, k = 1, 2, ... is a sequence of real numbers converging
to zero. xk = (1/k, 1/k), k = 1, 2, ... is a sequence of vectors in <2 converging
to the origin. More generally, a sequence converges in <n if and only if all
the coordinate sequences converge, as can be visualized in the example here
using hypotenuses and legs of triangles.
8

Theorem 2 (xk ) x in <n iff for every i {1, . . . , n}, the coordinate
sequence (xki ) xi .
Proof. Since
(xki xi )2

n
X

(xkj xj )2 ,

j=1

taking square roots implies |xki xi | ||xk x||, so for every k N s.t.
||xk x|| < , |xki xi | < .
Conversely, if all the coordinate sequences converge to the coordinates
of the point x, then there exists a positive integer N s.t. k N implies

|xki xi | < / n, for every coordinate i. Squaring, adding across all i and
taking square roots, we have ||xk x|| < .
Several convergence results that appear to be true are in fact so. For
instance, (xk ) x, (y k ) y implies (xk + y k ) (x + y). Indeed, there
exists N s.t. k N implies ||xk x|| < /2 and ||y k y|| < /2. So
||(xk + y k ) (x + y)|| = ||(xk x) + (y k y)|| ||xk x|| + ||y k y|| (by the
triangle inequality), and this is less than /2 + /2 = .
Exercise 3 Let (ak ) and (bk ) be sequences of real numbers that converge to a
and b respectively. Then the product sequence (ak bk ) converges to the product
ab.
Closed sets can be characterized in terms of convergent sequences as follows.
Lemma 2 A set S is closed if and only if for every sequence (xk ) lying in
S, xk x implies x S.
Proof. Suppose S is closed. Take any sequence (xk ) that converges to a
point x. Then for every B(x, ), we can find a member xk of the sequence
lying in this open ball. So, x adheres to S. Since S is closed, it must contain
this adherent point x.
9

Conversely, suppose the set S has the property that whenever (xk ) S
converges to x, x S. Take a point y that adheres to S. Take the successively
smaller open balls B(y, 1/k), k = 1, 2, 3, .... We can find, in each such open
ball, a point y k from the set S (since y adheres to S). These points need not
be all distinct, but since the open balls have radii converging to 0, y k y.
Thus by the convergence property of S, y S. So, any adherent point y of
.
S actually belongs to S.
Related Results
1. If (ak ) is a sequence of real numbers all greater than or equal to 0,
and ak a, then a 0. The reason is that for all k, ak [0, ) which is a
closed set and hence must contain the limit a.
2. Sup and Inf.
Let S <. u is an upper bound of S if u a, for every a S. s is the
supremum or least upper bound of S (called sup S), if s is an upper bound
of S, and s u, for every upper bound u of S.
We say that a set S of real numbers is bounded above if there exists an
upper bound, i.e. a real number M s.t. a M, a S. The most important
property of a supremum, which well by and large take here as given, is the
following:
Completeness Property of Real Numbers: Every set S of real numbers that is bounded above has a supremum.
For a short discussion of this property, see the Appendix.
Note that sup S may or may not belong to S.
Examples. S = (0, 1), D = [0, 1], K = set of all numbers in the sequence
1
, n = 1, 2, 3, .... The supremum of all these sets is 1, and this does not
1 2n
belong to S or to K.
When sup S belongs to S, it is called the maximum of S, for obvious
reasons. Another important property of suprema is the following.
Lemma 3 For every  > 0, there exists a number a S s.t. a > sup S .
Note that this means that sup S is an adherent point of S.

10

Proof. Suppose that for some  > 0, there is no number a S s.t. a >
sup S . So, every a S must then satisfy a sup S . But then,
sup S  is an upper bound of S that is less than sup S. This implies that
sup S is not in fact the supremum of S. Contradiction.
.

Lemma 4 If a set S of real numbers is bounded above and closed, then it


has a maximum.
Proof. Since it is bounded above, it has a supremum, sup S. sup S is an
adherent point of S (by the above lemma). S is closed so it contains all its
.
adherent points, including sup S. Hence sup S is the max of S.
Corresponding to the notion of supremum or least upper bound of a set
S of real numbers, is the notion of infimum or greatest lower bound of S.
A number l is a lower bound of S if l a, a S. The infimum of S is a
number s s.t. s is a lower bound of S, and s l, for all lower bounds l of S.
We call the infimum of S, inf S.
Let S be the set of numbers of the form a, for all a S.
Fact. sup S = inf(S).
So, sup and inf are intimately related.
By the completeness property of real numbers, if S < is bounded below,
(i.e., there exists m s.t. m < a, a S), it has an infimum. If S is closed
and bounded below, it has a minimum.
A set S < is said to be bounded if it is bounded above and bounded
below. We can extend the lemma above along obvious lines as follows:
Theorem 4 If S < is closed and bounded, then it has a Maximum and a
Minimum.
For a more general normed linear space V , we define boundedness as
follows. A set S V is bounded if there exists an open ball B(0, M ) s.t.
S B(0, M )
11

Digression.
The next few results, upto and including Cantors intersection theorem,
explore another bunch of ideas that begins with the idea of a supremum.
This digression is meant for those interested. Otherwise, please skip to the
discussion about Compact Sets.
Theorem 5 Every bounded and increasing sequence of real numbers converges (to its supremum).
Proof. Let a sup(xn ). Take any  > 0. By the above discussion, there
exists some xN (a , a]. And since (xn ) is an increasing sequence, we
have that for all k N , xk (a , a]. So (xn ) a.
A similar conclusion holds for decreasing bounded sequences. And:
Theorem 6 Every sequence of real numbers has a monotone subsequence.
Proof. For the bounded sequence (xk ), let An = {xk |k n}, n = 1, 2, ....
If any one of these sets An does not have a maximum, we can pull out an
increasing sequence. For instance, suppose A1 does not have a max. Then
let xk1 = x1 . Let xk2 be the first member of the sequence (xk ) that is greater
than x1 , and so on.
On the other hand, if all An have maxes, then we can pull out a decreasing
subsequence. Let xk1 = max{A1 }, xk2 = max{Ak1 + 1}, xk3 = max{Ak2 + 1}
and so on.
It follows from the above two theorems, that we have
Theorem 7 Bolzano-Weierstrass Theorem.
Every bounded sequence of real numbers has a convergent subsequence.
Finally, as an application to the ideas of monotone sequences, we have
Theorem 8 Cantors Nested Intervals theorem.
If [a1 , b1 ] [a2 , b2 ] . . . is a nested sequence of closed intervals, then

m=1 [am , bm ] is nonempty. Moreover, if bm am 0, then this intersection


is a single point.
12

Proof. Because of the nesting, a1 a2 . . . b2 b1 . So, (ak ) is


bounded and increasing and so has a supremum, say a; (bk ) is bounded and
decreasing and has an infimum, say b; and a b. So, [a, b] [am , bm ], m =
1, 2, . . ., and therefore lies in the intersection
m=1 [am , bm ]; which is therefore
nonempty. Moreover, if bm am 0, then by sandwiching, a = b and the
intersection is a single point.

Compact Sets.
Suppose (xn ) is a sequence in V . (Note the change in notation, from
superscript to subscript. This is just by the way; most places have this
subscript notation, but Rangarajan Sundaram at times has the superscript
notation in order to leave subscripts to denote co-ordinates of a vector).
Let m(k) be an increasing function from the natural numbers to the
natural numbers. So, l > n implies m(l) > m(n). A subsequence (xm(k) ) of
(xn ) is an infinite sequence whose k th member is the m(k)th member of the
original sequence.
Give an example. The idea is that to get a subsequence from (xn ), you
strike out some members, keeping the remaining members positions the
same.
Fact. If a sequence (xn ) converges to x, then all its subsequences converge
to x.
Proof. Take an arbitrary  > 0. So, there exists N s.t. n N implies
||xn x|| < . This implies, for any subsequence (xm(k) ), that k N implies
.
||xm(k) x|| < .
However, if a sequence does not converge anywhere, it can still have (lots
of) subsequences that converge. For example, let (xn ) ((1)n ), n = 1, 2, ....
Then, (xn ) does not converge; but the subsequences (ym ) = 1, 1, 1, ....
and (zm ) = 1, 1, 1, ... both converge, to different limits. (Such points are
called limit points of the mother sequence (xn )).
Compact sets have a property related to this fact.
Definition 2 A set S V is compact if every sequence (xn ) in S has a
subsequence that converges to a point in S.
13

Theorem 9 Suppose S <n . Then S is compact if and only if it is closed


and bounded.
Proof (Sketch).
Suppose S is closed and bounded. We can show its compact using a
pigeonhole-like argument; lets sketch it here. Since S is bounded, we can
cover it in a closed rectangle R0 = I1 . . . In , where Ii , i = 1, ..., n are
closed intervals. Take a sequence (xn ) in S. Divide the rectangle in two:
I11 . . . In and I12 ... In , where I11 I12 = I1 is the union of 2 intervals.
Then, theres an infinity of members of (xn ) in at least one of these smaller
rectangles, call this R1 . Divide R1 into 2 smaller rectangles, say by dividing
I2 into 2 smaller intervals; well find an infinity of members of (xn ) in at least
one of these rectangles, call it R2 . This process goes on ad infinitum, and we
find an infinity of members of (xn ) in the rectangles R0 R1 R2 .... By
the Cantor Intersection Theorem,
i=0 Ri is a single point; call this point x.
Now we can choose points yi Ri , i = 1, 2, ... s.t. each yi is some member
of (xn ); because the Ri s collapse to x, it is easy to show that (ym ) is a
subsequence that converges to x. Moreover, the yi s lie in S, and S is closed;
so x S.
Conversely, suppose S is compact.
(i) Then it is bounded. For suppose not. Then we can construct a sequence (xn ) in S s.t. for every n = 1, 2, ..., ||xn || > n. But then, no subsequence of (xn ) can converge to a point in S. Indeed, take any point x S
and any subsequence (xm(n) ) of (xn ). Then
||xm(n) || = ||xm(n) x + x|| ||xm(n) x|| + ||x||
(The inequality above is due to the triangle inequality).
So,
||xm(n) x|| ||xm(n) || ||x|| n ||x||
and the RHS becomes larger with n. So (xm(n) ) does not converge to x.
14

(ii) S is also closed. Take any sequence (xn ) in S that converges to x.


Then, all subsequences of (xn ) converge to x, and since S is compact, (xn )
has a subsequence converging to a point in S. So, this point of limit is x,
and x S. So, S is closed.
.
Continuity of Functions
Definition 3 A function F : <n <m is continuous at x <n , if for
every sequence (xk ) that converges to x in <n , the image sequence (f (xk ))
converges to f (x) in <m .
Example of point discontinuity.
Example of continuous function on discrete space.
F is continuous on S <n , if it is continuous at every point x S.
Examples. The real-valued function F (x) = x is continuous using this
definition, almost trivially, since (xk ) and x are identical to (F (xk )) and F (x)
respectively.
F (x) = x2 is continuous. We want to show that if (xk ) converges to x,
then (F (xk )) = x2k converges to F (x) = x2 . This follows from the exercise
above on limits: xk x, xk x implies xk xk x.x = x2 .
By extension, polynomials are continuous functions.
May talk a little about the coordinate functions of F : <n <m :
(F1 (x1 , ..., xn ), ..., Fm (x1 , ..., xn )).
Example: F (x1 , x2 ) = (x1 + x2 , x21 + x22 ). This is continuous because (i)
F1 and F2 are continuous; e.g. let xk x. Then the coordinates xk1 x1
and xk2 x2 . So F1 (xk ) = xk1 + xk2 x1 + x2 = F1 (x).
(ii) Since the coordinate sequences F1 (xk ) F1 (x) and F2 (xk ) F2 (x),
F (xk ) (F1 (xk ), F2 (xk )) F (x) = (F1 (x), F2 (x)).
There is an equivalent, (, ) definition of continuity.
Definition 4 A function F : <n <m is continuous at x <n , if for every
 > 0, there exists > 0 s.t. if for any y <n we have ||x y|| < , then
||F (x) F (y)|| < .
15

So if there is a hurdle of size  around F (x), then, if point y is close


enough to x, F (y) cannot overcome the hurdle.

Theorem 10 The two definitions above are equivalent.


Proof. Suppose there exists an  > 0 s.t. for every > 0, there exists a y
with ||x y|| < and ||F (x) F (y)|| . Then for this particular , we can
choose a sequence of k = 1/k and xk with ||x xk || < 1/k. So, (xk ) x
but (F (xk )) does not converge to F (x), staying always outside the -band of
F (x).
Conversely, suppose there exists a sequence (xk ) that converges to x, but
(F (xk )) does not converge to F (x). So, there exists  > 0 s.t. for every
positive integer N , there exists k N for which ||F (xk ) F (x)|| . Then,
for this specific , there does not exist any > 0 s.t. for all y with ||xy|| <
we have ||F (x) F (y)||; for we can find for any such , one of the xk s s.t.
||xk x||, so ||F (xk ) F (x)|| .
Here is an immediate upshot of the latter definition. Suppose F : < <
is continuous at x. If F (x) > 0, then there is an open interval (x , x + )
s.t. if y is in this interval, then F (y) > 0. The idea is that we can take an
 = F (x)/2, say, and use the (, ) definition. A similar statement will hold
if F (x) < 0.
We use this fact now in the following result.
Theorem 11 Intermediate Value Theorem
Suppose F : < < is continuous on an interval [a, b] and F (a) and F (b)
are of opposite signs. Then there exists c (a, b) s.t. F (c) = 0.
Proof. Suppose WLOG that F (a) > 0, F (b) < 0 (i.e. for the other case
just consider the function F ). Then the set
S = {x [a, b]|F (x) 0}

16

is bounded above. Indeed, b is an upper bound of S since F (b) is not 0.


By the completeness property of real numbers, S has a supremum, sup S = c,
say.
It cant be that F (c) > 0, for then by continuity, there is an h S, h > c,
s.t. F (h) > 0 so c is not an upper bound of S. It cant be that F (c) < 0. For,
if c is an upper bound of S with F (c) < 0, then we have for every x [a, b]
with F (x) 0, x c. However, by continuity, there is an interval (c , c]
s.t. every y in this interval satisfies F (y) < 0. But then, every x S must
be to the left of this interval. But then again, c is not the least upper bound
of S.
So, it must be that F (c) = 0.
As an application, you may want to prove the following corollary, a simple
fixed point theorem.
Exercise 12 Suppose f : [a, b] [a, b] is a continuous function. Then there
exists x [a, b] s.t. x = f (x ).

17

Chapter 2
Existence of Optima

2.1

Weierstrass Theorem

This theorem of Weierstrass gives a sufficient condition for a maximum and


minimum to exist, for an optimization problem.
Theorem 13 (Weierstrass). Let S <n be compact and let F : S < be
continuous. Then F has a maximum and minimum on S; i.e., there exist
z1 , z2 S s.t. f (z2 ) f (x) f (z1 ), x S.
The idea is that continuity of F preserves compactness; i.e. since S is
compact and F is continuous, the image set F (S) is compact. That holds
irrespective of the space F (S) is in; but since F is real-valued, F (S) is a
compact set of real numbers, and therefore must have a max and a min, by
a result in Chapter 1.
Proof.
Let (yk ) be a sequence in F (S). So, for every k, there is an xk S, s.t.yk =
F (xk ). Since (xk ), k = 1, 2, ... is a sequence in the compact set S, it has a
subsequence (xm(k) ) that converges to a point x in S. Since F is continuous,
the image sequence (f (xn(k) )) converges to f (x), which is obviously in F (S).
18

So weve found a convergent subsequence (ym(k) ) = (f (xm(k) )) of (yk ); hence


F (S) is compact. This means the set F (S) of real numbers is closed and
.
bounded; so, it has at least one maximum and at least one minimum.

Example 8 p1 = p2 = 1, I = 10. Maximize U (x1 , x2 ) = x1 x2 s.t. the budget


constraint. Here, the budget set is compact, since the prices are positive. We
can see that the image of the budget set S under the function U (or the range
of U ), is U (S) = [0, 25]. This is compact, and so U attains a max (25) and
a min (0) on S.
The fact that U (S) is in fact an interval has to do with another property of
continuity of the objective: such functions preserve connectedness in addition
to preserving compactness of the set S, and here, the budget set is a connected
set.
Do applications of Weierstrass theorem to utility maximization and
cost minimization.

19

Chapter 3
Unconstrained Optima

3.1

Preliminaries

A function f : < < is defined to be differentiable at x if there exists a <


s.t.

lim

yx

f (y) f (x)
a
yx


=0

(1)
By limit equal to 0 as y x, we require that the limit be 0 w.r.t. all
sequences (yn ) s.t. yn x. a turns out to be the unique number equal to
the slope of the tangent to the graph of f at the point x. We denote a by
0
the notation f (x). We can rewrite Equation (1) as follows:

lim

yx

f (y) f (x) a(y x)


yx


=0

(2)
Note that this means the numerator tends to zero faster than does the
denominator.
We can use this way of defining differentiability for more general functions.
20

Definition 5 Let f : <n <m . f is differentiable at x if there is an


m n matrix A s.t.

lim

yx

||f (y) f (x) A(y x)||


||y x||


=0

In the one variable case, the existence of a gives the existence of a tangent;
in the more general case, the existence of the matrix A gives the existence of
tangents to the graphs of the m component functions f = (f1 , ..., fm ), each
of those functions being from <n <. In other words this definition has
to do with the best linear affine approximation to f at the point x. To see
this in a way equivalent to the above definition, put h = y x in the above
definition, so y = x + h. Then in the 1-variable case, from the numerator,
0
f (x + h) is approximated by the affine function f (x) + ah = f (x) + f (x)h. In
the general case, f (x + h) is approximated by the affine function f (x) + Ah.
It can be shown that (w.r.t. the standard bases in <n and <m ), the matrix
A equals Df (x), the m n matrix of partial derivatives of f evaluated at the
point x. To see this, take the slightly less general case of a function f : <n
<. If f is differentiable at x, there exists a 1 n matrix A = (a11 , . . . , a1n )
satisfying the definition above: i.e.
||f (x + h) f (x) Ah||
=0
h
||h||
lim

In particular, the above must hold if we choose h = (0, .., 0, t, 0, .., 0) with
hj = t 0. That is,
||f (x1 , .., xj + t, .., xn ) f (x1 , .., xj , .., xn ) a1j t||
=0
t0
t

lim

But from the limit on the LHS, we know that a1j must equal the partial
derivative f (x)/x1 .
Df (x) as the derivative of f at x; Df itself is a function from <n to <m .

21

f1 (x)/x1 . . . f1 (x)/xn

Df (x) =
...
...
...

fm (x)/x1 . . . fm (x)/xn
Here,
fi (x)
fi (x1 , .., xj + t, ..., xn ) fi (x1 , .., xj , ..., xn )
= lim
t0
xj
t
We want to also represent the partial derivative in different notation: Let
ej = (0, .., 0, 1, 0, ..., 0) be the unit vector in <n on the j th axis. Then,
fi (x + tej ) fi (x)
fi (x)
= lim
t0
xj
t
That is, the partial of fi w.r.t. xj , evaluated at the point x, is looking at
essentially a function of 1-variable: we take the (n 1) dimensional surface
of the function fi , and slice it parallel to the j th axis, s.t. point x is contained
on this slice/plane; well get a function pasted on this plane; its derivative
is the relevant partial derivative.
To be more precise about this one-variable function pasted on the slice/plane,
note that the single variable t < is first mapped to a vector x+tej <n , and
then that vector is mapped to a real number fi (x + tej ). So, let : < <n
be defined by (t) = x + tej , for all t <. Then the one-variable function
were looking for is g : < < defined by g(t) = f ((t)), for all t <; its
the composition of f and .
In addition to slicing the surface of functions that map from <n to <
in the directions of the axes, we can slice them in any direction and get
a function pasted on the slicing plane. This is the notion of a directional
derivative.
Recall that if x <n , and h <n , then the set of all points that can
be written as x + th, for some t <, comprises the line through x in the
22

direction of h.
See figure (drawn in class).
Definition 6 The directional derivative of a function f : <n < at x <n ,
in the direction h <n , denoted Df (x; h), is
f (x + th) f (x)
t0+
t
lim

If t 0+ is replaced by t 0, we get the 2-sided directional derivative.


A function that is differentiable on a set S if it is differentiable at all
points in S. f is continuously differentiable if it is differentiable and Df is
continuous.

3.2

Interior Optima

Definition 7 Let f : <n <. A point z is a local maximum (resp. local


minimum) of f on a set S <n if f (z) ()f (x), for all x B(z, ) S,
for some  > 0.
B(z, ) is intersected with S since it may not, by itself, lie entirely in S.
However, if z is in the interior of S, we can discard that. z is said to be
an interior local maximum, of minimum, of f on S if f (z) ()f (x), x
B(z, ), for some  > 0.
We now give a necessary condition for a point to be an interior local max
or min; namely, that its derivative should be zero. For if not, then we can
increase or decrease the function value by moving away slightly from the
point.
First Order Necessary Condition.
Theorem 14 Let f : <n <, S <n , and let x be a local max or local
min of f on S, lying in the interior of S. If f is differentiable at x , then
Df (x ) = .
23

Here, = (0, ..., 0) is the origin, and Df (x ) = (f (x )/x1 , . . . , f (x )/xn ).


Proof. Let x be an interior local max (min proof is done along similar
lines).
Step 1: Suppose n = 1. Take any sequence (yk ), yk < x , yk x , and
(zk ), zk > x , zk x . Since x is a local max, we have, for k K and K
large enough,
f (zk ) f (x )
f (yk ) f (x )

zk x
y k x
Taking limits preserves these inequalities since (, 0] and [0, ) are
closed sets and the ratio sequences lie in these closed sets. So,
0

f (x ) 0 f (x )
0

so f (x ) = 0.
Step 2. Suppose n > 1. Take any j t h axis direction, and let g : < <
be defined by g(t) = f (x + tej ). Note that g(0) = f (x ). Now, since x is a
local max of f , f (x ) f (x + tej ), for t smaller than some cutoff value: i.e.,
g(0) g(t) for t smaller than this cutoff value, i.e., g(0) is a local interior
maximum. (Since t < 0 and t > 0 are both allowed). g is differentiable
at 0 since g(0) = f ((0)) = f (x ), and f is differentiable at x and is
differentiable at t = 0. (Here, (t) = x + tej , so D(t) = ej , t). So, g is
0
differentiable at 0, g (0) = 0, and by the Chain Rule,
0

g (0) = Df ((0))D(0) = Df (x )ej =

f (x )
xj
.

Note that this is necessary but not sufficient for a local max or min, e.g.
f (x) = x3 has a vanishing first derivative at x = 0, which is not a local
optimum.
Second Order Conditions

24

Definition. x is a strict local maximum of f on S if f (x) > f (y), for all


y B(x, ) S, y 6= x, for some  > 0.
We will represent the Hessian or second derivative (matrix) of f by D2 f .
Theorem 15 Suppose f : <n < is C 2 on S <n , and x is an interior
point of S.
1. (necessary) If f has a local max (resp. local min) at x, then D2 f (x)
is n.s.d. (resp. p.s.d.).
2. (sufficient) If Df (x) = and D2 f (x) is n.d. (resp. p.d.) at x, then x
is a strict local max (resp. min) of f on S.
The results in the above theorem follow from taking a Taylor series approximation of order 2 around the local max or local min. For example,
x ) + 1 (x x )T D2 f (x )(x x ) + R2 (x x )
f (x) = f (x ) + Df (x )(x
2
where R2 () is a remainder of order smaller than two. If x is an interior
local max or min, then Df (x ) = 0 (a vector of zeros), so the quadratic form
in the second-order term will share the sign of (f (x) f (x )).
Examples to illustrate: (i) SONC are not sufficient. f (x) = x3 . (ii) Semidefiniteness cannot be replaced by definiteness. f (x) = x4 . (iii). These are
conditions for local, not global optima. f (x) = 2x3 3x2 . (iv) Strategy for
using the conditions to identify global optima. f (x) = 4x3 5x2 + 2x on
S = [0, 1].

25

Chapter 4
Optimization with Equality
Constraints

4.1

Introduction

We are given an objective function f : Rn R to maximize or minimize,


subject to k constraints. That is, there are k functions, g1 : Rn R,
g2 : Rn R, ... , gk : Rn R, and we wish to
Maximize f (x) over all x Rn such that g1 (x) = 0, . . . , gk (x) = 0.
More compactly, collect the constraint functions (looking at them as component functions) into one function g : Rn Rk , where g(x) = (g1 (x), . . . , gk (x)).
Then what we want is to
Maximize f (x) over all x Rn such that g(x)1k = 1k .
The Theorem of Lagrange provides necessary conditions for a local optimum x . By local we mean that the value f (x ) is a max or min compared
to other values f (x) for all x contained in some open set U containing x
such that x satisfies the k constraints. Thus the problem it considers is to
provide necessary conditions for a Max or a Min of
f (x) over all x S, where S = U {x Rn |g(x) = }, for some open
set U .
26

The following example illustrates the principle of no arbitrage underlying a maximum. A more general illustration, with more than 1 constraint,
requires a little bit of the machinery of linear inequalities, which well not
cover. The idea here is that the Lagrange multiplier captures how the constraint is distributed across the variables.
Example 1. Suppose x solves M ax U (x1 , x2 ) s.t. I p1 x1 p2 x2 = 0
and suppose x >> .
Then reallocating a small amount of income from one good to the other
does not increase utility. Say income dI > 0 is shifted from good 1 to good 2.
So dx1 = (dI/p1 ) > 0 and dx2 = (dI/p1 ) < 0. Note that this reallocation
satisfies the budget constraint, since
p1 (x1 + dx1 ) + p2 (x2 + dx2 ) = I
The change in utility is dU = U1 dx1 + U2 dx2
= [(U1 /p1 )(U2 /p2 )]dI 0, since the change in utility cannot be positive
at a maximum. Therefore,
(U1 /p1 ) (U2 /p2 ) 0
(1)
Similarly, dI > 0 shifted from good 1 to good 2 does not increase utility,
so that
[(U1 /p1 ) + (U2 /p2 )]dI 0, or
(U1 /p1 ) + (U2 /p2 ) 0
(2)

Eq. (1) and (2) imply (U1 (x )/p1 ) = (U2 (x )/p2 ) =


(3)
That is, the marginal utility of the last bit of income equals (U1 (x )/p1 ) =
(U2 (x )/p2 at the optimum. Also, (3) implies U1 (x ) = p1 , U2 (x ) = p2 .
Along with p1 x1 + p2 x2 = I, these are the FONC of the Lagrangean function
Max L(x, ) = U (x1 , x2 ) + [I p1 x1 p2 x2 ]
More generally, suppose F : Rn R and G : Rn R, and suppose x
solves Max F (x) s.t. c G(x) = 0. This part is skippable.
Contemplate a change dx in x that respects the constraint G(x) = c.
That is,
dG = G1 dx1 + G2 dx2 = 0. Therefore,
G1 dx1 = G2 dx2 = dc, say. So dx1 = (dc/G1 ), dx2 = (dc/G2 ). If
dc > 0, then our change dx implies dx1 > 0, dx2 < 0. F does not increase at
the maximum, x . So
27

dF = F1 dx1 + f2 dx2 0, or [(F1 /G1 ) (F2 /G2 )]dc 0. Similarly, 0


can be shown similarly.
Therefore, (F1 (x )/G1 (x )) = (F2 (x )/G2 (x )) =
(4)

Caveat: We have assumed that G1 (x ) and G2 (x ) are not both zero at


x . This is called the constraint qualification.
Again, note that (4) can be got as the FONC of the problem
Max L(x, ) = F (x) + [c G(x)].
On .
Lets go back to the utility example. At the optimum (x , ), suppose
you increase income by I. Buying more x1 implies utility increases by
(U1 (x )/p1 )I, approximately.
Buying more x2 implies utility increases by (U2 (x )/p2 )I
At the optimum,(U1 (x )/p1 ) = (U2 (x )/p2 ) = .
So in either case, utility increases by I. So gives the change in
the objective (here, the objective is utility), at an optimum, that results from
relaxing the constraint a little bit.
The interpretation is the same in the more general case: If G(x) = c, and
c is increased by c, suppose x1 alone is then increased.
So G = g1 dx1 = c, or dx1 = (c/G1 ). So at x , F increases by
dF = F1 dx1 = (F1 (x )/G1 (x ))c = c.
If instead x2 is changed, F increases by dF = F2 dx2 , = (F2 (x )/G2 (x ))c =
c.

4.2

The Theorem of Lagrange

The set up is the following. f : Rn R is the objective function, gi : Rn


R, i = 1, . . . , k, k < n are the constraint functions.
Let g : Rn Rk be the function given by g(x) = (g1 (x), . . . , gk (x)).
Df (x) =
(f (x)/x1 , . . . , f (x)/xn ).

(g1 (x)/x1 ) . . . (g1 (x)/xn )


Dg1 (x)
..
..
..

..
.
Dg(x) =
.
.
.
=

(gk (x)/x1 ) . . . (gk (x)/xn )


28

Dgk (x)

So Dg(x) is a k n matrix.
The theorem below provides a necessary condition for a local max or
local min. Note that x is a local max (resp. min) of f on the constraint set
{x Rn |gi (x) = 0, i = 1, . . . , k} if f (x ) f (x) (resp. f (x)) for all x U
for some open set U containing x , s.t. gi (x) = 0, i = 1, . . . , k}. Thus x is
a Max on the set S = U {x Rn |gi (x) = 0, i = 1, . . . , k}.

Theorem 16 (Theorem of Lagrange). Let f : Rn R and gi : Rn


R, i = 1, . . . , k, k < n be C 1 functions. Suppose x is a Max or a Min of
f on the set S = U {x Rn |gi (x) = 0, i = 1, . . . , k}, for some open set
U Rn . Then there exist real numbers , 1 , . . . , k , not all zero, such that
P
Df (x ) + ki=1 i Dgi (x ) = 1n .
Moreover if rank(Dg(x )) = k, then we may put = 1.
Notes:
(1) k < n so that the constraint set is not trivial. Just like a linear
constraint a.x = 0 (or a.x = c) marks out a set of points that is an (n1) dimensional subspace as the 1xn matrix a has rank 1 and nullity (n1), so each of the constraints gi (x) = 0, i = 1, . . . , n marks out an (n-1)
dimensional manifold. The constraint set is their intersection and hence an
(n-k)-dimensional space. For this to be a non empty set with more than one
point, we need k < n.
(2) The condition rank(Dg(x )) = k is called the constraint qualification. The first part of the theorem says that at a local Max or Min, under
the assumption of continuous differentiability of f and gi , i = 1, . . . , k, the
vectors Df (x ), Dg1 (x ), . . . , Dgk (x ) are linearly dependent. The constraint
qualification (CQ) basically assumes that the vectors Dg1 (x ), . . . , Dgk (x )
are linearly independent. In that case,
P
Df (x ) + ki=1 i Dgi (x ) = implies that cannot equal 0, for then
Pk

i=1 i Dgi (x ) = , which along with linear independence implies i =


0, i = 1, . . . , k. This cannot be. So if the CQ holds, then 6= 0, so we can
divide through by .
29

(3) In most applications the CQ holds. We usually check first whether it


holds, and then proceed. Suppose it does hold. Note that
P
Df (x )+ ki=1 i Dgi (x ) = subsumes the following n equations (with
= 1):
P
(f (x )/xj ) + ki=1 i (gi (x )/xj ) = 0, j = 1, . . . , n
Note also that this leads to the usual procedure for finding equality
constrained Max or Min, by setting up a Lagrangean function:
P
L(x, ) = f (x) + ki=1 i gi (x), and solving the FONC
P
(L(x, )/xj ) = (f (x)/xj ) + ki=1 i (gi (x)/xj ) = 0, j = 1, . . . , n
(L(x, )/i ) = gi (x) = 0, i = 1, . . . , k
Which is (n + k) equations in (n + k) variables x1 , . . . , xn , 1 , . . . , k .
Why does the above procedure usually work to isolate global
optima?
The FONC that come out of the Lagrangean function are, as seen in the
Theorem of Lagrange, necessary conditions for local optima. However, when
we do equality constrained optimization, (i) usually a global max (or min)
x is known to exist. (ii) Second, for most problems the CQ is met at all
x S. Therefore, it is met at the optimum as well. (Note that otherwise,
not knowing the optimum when we start out on a problem, it is not possible
to check whether the CQ holds at that point!)
When (i) and (ii) are met, the solutions to the FONC of the Lagrangean
function will include all local optima, and hence will include the global optimum that we want. By comparing the values f (x) for all x that solve the
FONC, we get the point at which f (x) is a max or a min. With this method,
we dont need second order conditions at all, if we just want to find a global
max or a min.
Pathologies
The above procedure may not always work.
Pathology 1. A global optimum may not exist. Then none of the critical
points (solutions to the FONC of the Lagrangean function) is a global optimum. Critical points may then be only it local optima, or they may not
even be local optima. Indeed, the Theorem of Lagrange gives a necessary
30

condition; so there could be points x0 that meet the condition and yet are
not even local max or min.
Example. Max f (x, y) = x3 + y 3 , s.t. g(x, y) = x y = 0. Here the
contour set Cg (0) is the 45-degree line on the x y plane. By taking larger
and larger positive values of x and y on this contour set, we get higher and
higher f (x, y). So f does not have a global max on the constraint set. But
if we mechanically crank out the Lagrangean FONCs as follows
Max x3 + y 3 + (x y)
FONC: 3x2 + = 0
3y 2 + = 0
x y = 0. So x = y = = 0 is a solution. But (x , y ) = (0, 0)
is neither a local max nor a local min. Indeed, f (0, 0) = 0, whereas for
(x, y) = (, ),  > 0, f (, ) = 23 > 0, and for (x, y) = (, ),  < 0,
f (, ) = 23 < 0.
Pathology 2. The CQ is violated at the optimum.
In this case, the FONCs need not be satisfied at the global optimum.
Example. Max f (x, y) = y s.t. g(x, y) = y 3 x2 = 0.
Let us first find the solution using native intelligence. Then well show
that the CQ fails at the optimum, and that the usual Lagrangean method
is a disaster. Finally, well show that the general form of the equation the
Theorem of Lagrange, that does NOT assume that the CQ holds at the
optimum, works.
The constraint is y 3 = x2 , and since x2 is nonnegative, so must y 3 be.
Therefore, y 0. The maximum of y s.t. y 0 implies y = 0 at the max.
So y 3 = x2 = 0, so x = 0. So f attains global max at (x, y) = (0, 0).
Dg(x, y) = (2x, 3y 2 ) = (0, 0) at (x, y) = (0, 0). So rank(Dg(x, y)) =
0 < k = 1 at the optimum; the CQ fails at this point. Using the Lagrangean
method, we get the following FONC:
(f /x) + (g/x) = 0, that is 2x = 0
(1)
2
(f /y) + (g/y) = 0, that is 1 + 3y = 0
(2)
(L/) = 0, that is x2 + y 3 = 0
(3)
Eq.(1) implies either = 0 or x = 0. x = 0 implies, from Eq.(3), that

31

y = 0, but then (2) becomes 1 = 0, which is not possible. Similarly, = 0


again violates (2).
But the general form of the condition in the Theorem of Lagrange does
not rely on the CQ and works. In this problem, the only equation out of the
above three that changes is Eq. (2), as we see below:
Df (x, y) + Dg(x, y) = (0, 0), and x2 + y 3 = 0, with Df (x, y) = (0, 1),
Dg(x, y) = (2x, 3y 2 ) yield
(f /x) + (g/x) = 0, that is 2x = 0
(1)
2
(f /y) + (g/y) = 0, that is + 3y = 0
(2)
2
3
(L/) = 0, that is x + y = 0
(3)
Now, Eq.(1) implies = 0 or x = 0. If = 0, then Eq.(2) implies = 0.
But = = 0 is ruled out by the Theorem of Lagrange. Therefore, here
6= 0. Hence x = 0. From Eq.(3), we then have y = 0, and so from Eq. (2),
= 0. So we get x = y = 0 as a solution.
Second-Order Conditions
These conditions are characterized by definiteness or semi-definiteness of
the Hessian of the Lagrangean function, which is the appropriate function
to look at in this constrained optimization problem. Also, we dont have to
check the appropriate inequality for the quadratic form for all x. Now, only
those x are relevant that satisfy the constraints. Second order conditions in
general say something about the curvature of the objective function around
the local max or min...i.e., how the graph curves as we move from x to
a nearby x. In constrained optimization, we cannot move from x to any
arbitrary x nearby; the move must be to an x which satisfies the constraints.
That is, such a move must leave all gi (x) at 0. In other words, dgi (x) =
Dgi (x).dx = 0, where dx is a vector x0 x that must be orthogonal to
Dgi (x). Thus it suffices to evaluate the appropriate quadratic form at all
vectors x that are orthogonal to all the gradients of the constraint functions.
Notice that if we parameterize the curve describing a constraint by setting
gi (x(t)) = 0, then by the Chain Rule, we have
Dgi (x(t))Dx(t) = 0
32

.
Dx(t) is in the direction of the tangent to the curve x(t), so the equation
above implies that Dgi (x(t)) is orthogonal to it. (Seen as a vector rather
than a matrix, we write this as the gradient gi (x(t))). (As an application,
notice how this geometry implies the first order condition MRSxy = px /py in
a two-good utility maximization in which both goods are consumed at the U
max).
In the second-order conditions, we check the definiteness or semi-definiteness
of the second-derivative or Hessian D2 L(x , ) w.r.t. all vectors x that are
orthogonal to the gradient of each constraint. This approximates vectors
close to x that satisfy each gi (x) = 0.
P
Since L(x, ) = f (x) + ki=1 i gi (x),
P
D2 L(x, )nn = D2 f (x)nn + ki=1 i D2 gi (x)nn ,

f11 (x) . . . f1n (x)


..
.
..
.
where D2 f (x) = ..
.
fn1 (x) . . . fnn (x)

gi11 (x) . . . gi1n (x)


..
.
..
.
and D2 gi (x) = ..
.
gin1 (x) . . . ginn (x)
P
P

f11 (x) + ki=1 i gi11 (x) . . . f1n (x) + ki=1 i gi1n (x)
..
..

...
So D2 L(x, )nn =
.
.

Pk
Pk
fn1 (x) + i=1 i gin1 (x) . . . fnn (x) + i=1 i ginn (x)
is the second derivative of L w.r.t. the x variables. Note that D2 L(x, )
is symmetric, so we may work with its quadratic form.

Dg1 (x )
..

At a given x Rn , Dg(x )kn =


.

Dgk (x )
So the set of all vectors x that are orthogonal to all the gradient vectors
of the constraint functions at x is the Null Space of Dg(x ), N (Dg(x )) =
{x Rn |Dg(x )x = k1 }.
Theorem 17 Suppose there exists (xn1 , k1 ) such that Rank(Dg(x )) = k
P
and Df (x ) + ki=1 Dgi (x ) = .
33

(i) (a necessary condition) If f has a local max (resp. local min) on S at


point x , then xT D2 L(x , )x 0, (resp. 0) for all x N (Dg(x ))
(ii) (a sufficient condition) If xT D2 L(x , )x < 0,(resp. > 0), for all
x N (Dg(x )), x 6= , then x is a strict local max (resp. strict local min)
of f on S.
Checking these directly involves checking inequalities for every vector
x in the Null Space of Dg(x ), which is an n k dimensional subspace.
Alternatively, we could check the signs of k determinants instead, and the
relevant tests are given by the theorem below, which states tests equivalent
to those of the above theorem. These are the Bordered Hessian conditions.
This stuff is tedious indeed, and it would be a hard taskmaster who would
ask anyone to waste hard disk space by memorizing these.


0kk
Dg(x )kn

BH(L ) =
[Dg(x )]Tnk D2 L(x , )nn (n+k)(n+k)
BH (L ; k + n r) is the matrix obtained by deleting the last r rows and
columns of BH (L ).
BH (L ) will denote a variant in which the permutation has been
applied to (i) both rows and columns of D2 L(x , ) and (ii) only the columns
of Dg(x ) and only the rows of [Dg(x )]T , which is the transpose of Dg(x ).
Theorem 18 (1a) xT D2 L(x , )x 0,for all x N (Dg(x )), iff for all
permutations of {1, . . . , n}, we have:
(1)nr det(BH (L ; n + k r)) 0, r = 0, 1, . . . , k 1.
(1b) xT D2 L(x , )x 0,for all x N (Dg(x )), iff for all permutations
of {1, . . . , n}, we have:
(1)k det(BH (L ; k + n r)) 0, r = 0, 1, . . . , k 1.
(2a). xT D2 L(x , )x < 0,for all nonzero x N (Dg(x )), iff (1)nr det(BH(L ; n+
k r)) > 0, r = 0, 1, . . . , k 1.
(2b)xT D2 L(x , )x > 0,for all nonzero x N (Dg(x )), iff (1)k det(BH(L ; n+
k r)) > 0, r = 0, 1, . . . , k 1.
34

Note. (1) For the negative definite or semidefiniteness subject to constraints cases, the determinant of bordered Hessian with last r rows and
columns deleted must be of the same sign as (1)nr . The sign of (1)nr
switches with each successive increase in r from r = 0 to r = k 1. So the
corresponding bordered Hessians switch signs. In the usual textbook case of
2 variables and one constraint, k = 1, k 1 = 0, so we just need to check
the sign for r = 0, that is, the sign of the determinant of the big bordered
Hessian. You should be clear about what this sign should be if it is to be a
sufficient condition for a strict local max or min. For the necessary condition,
we need to check signs or 0, for one permuted matrix as well, in this
case. What is this permuted matrix?
(2) As in the unconstrained case, the sufficiency conditions do not require
checking weak inequalities for permuted matrices.
(3) In the p.s.d. and p.d. cases, the signs of the principal minors must
be all positive, if the number k of constraints is even, and all negative, if k
is odd.
(4) If we know that a global max or min exists, where the CQ is satisfied,
and we get a unique solution x Rn that solves the FONC, then we may
use a second order condition to check whether it is a max or a min. However,
weak inequalities demonstrating n.s.d. or p.s.d. (subject to constraints) of
D2 (L ) do not imply a max or min; these are necessary conditions. Strict
inequalities are useful; they imply (strict) max or min. If however, a global
max or min exists, the CQ is satisfied everywhere, and there is more than
one solution of the FONC, then the one giving the highest value of f (x) is
the max. In this case, we dont need second order conditions to conclude
that it is the global max.
VII.4. Two Examples
Example 1.
A consumer with income I > 0 faces prices p1 > 0, p2 > 0, and wishes
to maximize U (x1 , x2 ) = x1 x2 . So the problem is: Max x1 x2 s.t. x1 0,
x2 0, and p1 x1 + p2 x2 I.
To be able to use the Theorem of Lagrange, we need equality constraints.
35

Now, it is easy to see that if (x1 , x2 ) solves the above problem, then (i)
(x1 , x2 ) > (0, 0). If xi = 0 for some i, then utility equals zero; clearly, we can
do better by allocating some income to the purchase of each good; and (ii)
the budget constraint binds at (x1 , x2 ). For if p1 x1 + p2 x2 < I, then we can
allocate some of the remaining income to both goods, and increase utility
further.
We conclude from this that a solution (x1 , x2 ) will also be a solution to
the problem
Max x1 x2 s.t. x1 > 0, x2 > 0, and p1 x1 + p2 x2 = I.
2
{(x1 , x2 )|I
That is, Maximize U (x1 , x2 ) = x1 x2 over the set S = R++
p1 x1 p2 x2 = 0}. Since the budget set in this problem is compact and the
utility function is continuous, U attains a maximum on the budget set (by
Weierstrass Theorem). Moreover, we argued above that at such a maximum
x , xi > 0, i = 1, 2 and the budget constraint binds. So, x S.
Furthermore, Dg(x) = (p1 , p2 ), so Rank(Dg(x)) = 1, at all points in
the budget set. So the CQ is met. Therefore, the global max will be among
the critical points of L(x1 , x2 , ) = x1 x2 + (I p1 x1 p2 x2 ).
FONC:
(L/x1 ) = x2 p1 = 0
(L/x2 ) = x1 p2 = 0
(L/) = I p1 x1 p2 x2 = 0

(1)
(2)
(3)

6= 0, (otherwise (1) and (2) imply that x1 = x2 = 0, which violates (3)).


Therefore, from (1) and (2), = (x2 /p1 ) = (x1 /p2 ), or p1 x1 = p2 x2 . So (3)
implies I 2p1 x1 = 0, or p1 x1 = (I/2), which is the standard Cobb-Douglas
utility result that the budget share of a good is proportional to the exponent
w.r.t. it in the utility function. So we get
xi = (I/2pi ), i = 1, 2, and = (I/2p1 p2 ).
We argued that the global max would be one of the critical points of
L(x, ) in this example; (note, however, that the global min (which occurs
at (x1 , x2 ) = (0, 0) is not a critical point). Since we have only it one critical
point, it follows that this must be the global max! (We know that x1 = x2 = 0
is the global min, and not the point that we have located).
Take a time-out to consider the alternative to checking the constraint
36

qualification. We can start by setting up a Lagrangean


L(x, ) = U (x) + (I p x)
and cranking out the FOCs. The fact that the CQ holds shows up as
6= 0 in the FOC. (Convince yourself that in any problem, the CQ failing
at a point manifests as = 0 in the FOC).
The FOCs are now:
(L/x1 ) = x2 p1 = 0
(1)
(L/x2 ) = x1 p2 = 0
(2)
(L/) = I p1 x1 p2 x2 = 0
(3)
If = 0, then (1) and (2) imply = 0. All multipliers equal to zero
is not a permitted solution in the Theorem of Lagrange. So 6= 0; we can
normalize (divide through by ) and set it equal to 1, and proceed as before.
If there were lots of solutions to the FOCs, the sufficient SOCs may help
in narrowing down the set of points one needs to evaluate for the global max.
Lets check SOCs for the above example, although this is not necessary.
Dg(x ) = (p1 , p2 )
2

D2
L(x , ) = D2 U (x) + D
 g(x )


U11 (x ) U12 (x )
g11 (x ) g12 (x )

=
+

U
(x
)
U
(x
)
21
22
  g21 (x) g22 (x )



0 0
0 1
0 1
=
=
+
1 0
1 0
0 0
Now evaluate the quadratic form z T D2 L(x , )z = 2z1 z2 at any (z1 , z2 )
that is orthogonal to Dg(x ) = (p1 , p2 ). So, p1 z1 p2 z2 = 0 or
z1 = (p2 /p1 )z2 . For such (z1 , z2 ), z T D2 L(x , )z = (2p2 /p1 )z22 < 0, so
D2 L(x , ) is negative definite relative to vectors orthogonal to the gradient
of the constraint, and x is therefore a strict local max.
Youve probably seen the computation below. I provide it here anyway,
even though it is unnecessary, and weve done the second-order exercise above
using the quadratic form.

37

0
p
p
1
2
0
Dg(x )

BH(L ) =
= p1
0
1
T
2

[Dg(x )]
D L(x , )
p2
1
0

n
det(BH(L )) = 2p1 p2 > 0. This is the sign of (1) = (1)2 . Therefore,
there is a strict local max at x .


Example 2.
Find global maxima and minima of f (x, y) = x2 y 2 on the unit circle
in <2 , i.e., on the set {(x, y) <2 |g(x, y) 1 x2 y 2 = 0}.
Constrained maxima and minima exist, by Weirerstrass theorem, as f
is continuous and the unit circle is closed and bounded. Bounded, as it is
entirely contained in, say, B(0, 2). Closed as well, visually, we can see that
the constraint set contains all its adherent points. More formally, suppose
(xk , yk )
k=1 is a sequence of points on the unit circle converging to the limit
(x, y). Since g is continuous, and (xk , yk ) (x, y), we have g(xk , yk )
g(x, y). Since g(xk , yk ) = 0, k, their limit is 0, i.e. g(x, y) = 0 or (x, y) is on
the unit circle, and so the unit circle is closed.
Constraint Qualification: Dg(x, y) = (2x 2y). The rank of this row
matrix is zero only at (x, y) = (0, 0). But the origin does not satisfy the
constraint. Everywhere on the constraint, at least one of x or y is not zero,
and the rank of Dg(x, y) is 1.
So, the max and min will be solutions to the FOCs of the usual Lagrangean.
L(x, y, ) = x2 y 2 + (1 x2 y 2 )
FOC.
2x 2x = 0
(1)
2y 2y = 0
(2)
2
2
x +y =1
(3)
(1) and (2) imply 2x(1 ) = 0 and 2y(1 + ) = 0 respectively. Suppose
6= 1 or 1. Then (x, y) = (0, 0), violating (3). If = 1, y = 0, so x2 = 1,
and so on. So the four solutions (x, y, ) to the FOCs form the solution set
{(1, 0, 1), (1, 0, 1), (0, 1, 1), (0, 1, 1)}. Evaluating the function values at

38

these points, we have that f has a constrained max at (1, 0) and (1, 0) and
constrained min at (0, 1) and (0, 1).
Although unnecessary, lets practice second-order conditions for this example. Df (x, y)= (2x 2y), Dg(x, y) = (2x 2y).
2 0
D2 f (x, y) =
 0 2 
2 0
D2 g(x, y) =
0 2

With = 1, for instance, D2 L(x


 , y , ) evaluates to
0 0
D2 f (x , y ) + D2 g(x , y ) =
.
0 4
(x, y) orthogonal to Dg(x , y ) = (2 0) (at (x , y ) = (1, 0) implies that
(x, y) must satisfy
(2, 0) (x, y) = 0 or 2x + 0y = 0. So, x = 0, and y is free to take any
value. So consider the quadratic form with (x, y) = (0, y).

(0 y)

0 0
0 4

 
0
y

=4y 2 < 0 for all (x, y) 6= (0, 0). So negative definiteness holds, and we
are at a strict local max.

Some Derivatives
(1). Let I : <n <n be defined by I(x) = x, x <n . In component
function notation, we have I(x) = (I1 (x), . . . , In (x)) = (x1 , . . . , xn ). So,
DIi (x) = ei , i.e. the vector with 1 in the ith place and zeros elsewhere. So,
DI(x) = Inn , the identity matrix.
By similar work, we can show that if f (x) = Ax, where A is an m n matrix, then Df (x) = A. Indeed, the jth component function fj (x) = aj1 x1 +
. . . + ajn xn , so its matrix of partial derivatives with respect to x1 , . . . , xn is
Dfj (x) = (aj1 . . . ajn ).
(2). Let f : <n <m and g : <n <m . By way of convention, consider
f (x) and g(x) to be column vectors, and consider the function h : <n <
39

defined by h(x) = f (x)T g(x). Then,


Dh(x) = g(x)T Df (x) + f (x)T Dg(x)
Indeed,
!
m
X
X
X

D f (x)T g(x) = D
fi (x)gi (x) =
D (fi (x)gi (x)) =
[gi (x)Dfi (x) + fi (x)Dgi (x)]
i=1

The third equality above is because the differential operator D is linear.


Note that

Df1 (x)
Df (x)
P
2

g
(x)Df
(x)
=
(g
(x),
.
.
.
,
g
(x))

i
1
n
i i

Dfn (x)
T
= g(x) Df (x), and so on.
We take a step back and derive this in a more expanded fashion. Since
P
h(x) = m
i=1 fi (x)gi (x), its partial derivative with respect to xj is:
h(x)/xj =

m
X

[gi (x) (fi (x)/xj ) + fi (x) (gi (x)/xj )]

i=1

= g(x)T Df (x)[, j] + f (x)T Dg(x)[, j]


where for any matrix A, we write A[, j] to represent its jth column. So
Dh(x) = (h(x)/x1 . . . h(x)/xn ) equals
g(x)T Df (x)[, 1] + f (x)T Dg(x)[, 1], . . . , g(x)T Df (x)[, n] + f (x)T Dg(x)[, n]

= g(x)T Df (x) + f (x)T Dg(x)

40

.
As an application, let h(x) = xT x. Then Dh(x) equals xT D(x)+xT Dx =
xT I + xT I = 2xT .
On the Chain Rule
We saw an example (in the proof of the 1st order condition in unconstrained optimization) of the Chain Rule at work; youve seen this before.
Namely, if h : < <n and f : <n < are differentiable at the relevant
points, then the composition g(t) = f (h(t)) is differentiable at t and
0

g (t) = Df (h(t))Dh(t) =

n
X
f (h(t))
j=1

xj

hj (t)

You may have encountered this before in notation f (h1 (t), . . . , hn (t)),
with some use of total differentiation or something. Similarly, suppose h :
<p <n and f : <n <m are differentiable at the relevant points, then the
composition g(x) = f (h(x)), g : <p <m is differentiable at x, and
Dg(x) = Df (h(x))Dh(x)
.
Here, on the RHS an m n matrix multiplies an n p matrix, to result
in the m p matrix on the LHS.
The intuition for the Chain Rule is perhaps this. Let z = h(x). If x
changes by dx, the first-order change in z is dz = Dh(x)dx. The first-order
change in f (z) is then Df (z)dz. Substituting for dz, the first-order change
in f (h(x)) equals [Df (h(x))Dh(x)] dx.
In the formula, things are actually quite similar to the familiar case.
The (i, j)th element of the matrix Dg(x) is gi (x)/xj , where gi is the ith
component function of g and xj is the j th variable. Since this is equal to the
dot product of the ith row of Df (h(x)) and the j th column of Dh(x), we have

41

gi (x)/xj =

n
X
fi (h(x)) hk (x)

hk

k=1

xj

On the Implicit Function Theorem


Theorem 19 Suppose F : <n+m <m is C 1 , and suppose F (x , y ) = 0
for some y <m and some x <n . Suppose also that DFy (x , y ) has
rank m. Then there are open sets U containing x and V containing y and
a C 1 function f : U V s.t.
F (x, f (x)) = 0 x U
.
Moreover,
Df (x ) = [DFy (x , y )]1 DFx (x , y )
Note that we could alternatively look at the equation F (x, y) = c, for
some given c <m , without changing anything. The proof of this theorem
starts going deep, so will not be part of this course. The proof for the
n = m = 1 case, however, is provided at the end of this Chapter. But notice,
that applying the Chain Rule to differentiate
F (x , f (x )) = 0
yields
DFx (x , y ) + DFy (x , y )Df (x ) = 0
(*)
42

whence the expression for Df (x ).


 tediously in terms of compositions, if h(x) = (x, f (x)), then Dh(x) =
 More
I
,
Df (x)
whereas DF (.) = (DFx (.)|DFy (.)), so matrix multiplication using partitions yields Eq.(*).
Interpretations
(1) Indifference Curve.
By way of motivation, think of a utility function F defined over 2 goods,
evaluated at some utility value c. So F (x, y) = c. Let (x , y ) be a solution
of this equation, i.e. F (x , y ) = c. Under the assumptions of the Implicit
Function Theorem on F , there exists a function f s.t. if x is a point close
to x , then F (x, f (x)) = c. That is, as we vary x close to x , there exists
a unique y s.t. F (x, y) = c. Because of the uniqueness, we have y = f (x)
i.e. a functional relationship. Draw an indifference curve corresponding to
F (x, y) = c to see this visually. Moreover, the theorem asserts that the
derivative of the implicit function
0

f (x ) = Fx /Fy
where Fx , Fy are the partial derivatives of F , evaluated at (x , y ). The
marginal rate of substitution between the two goods (LHS) equals the ratio of
the marginal utilities (RHS). In fact, when we say under some assumptions
on F , one of the assumptions is that Fy evaluated at (x , y ) is not zero.
The mnemonic for getting the derivative: From F (x, y) = c, we totally
differentiate to get Fx dx+Fy dy = 0, and rearrange to get dy/dx = Fx /Fy .
(2). Comparative Statics.
We then move to the vector case by analogy. Suppose
F (x, y) = c
where x is an n-vector, y an m-vector, c a given m-vector. Let (x , y )
solve F (x , y ) = c. Think of x being exogenous, so this is a set of m
43

equations in the m endogenous variables y = (y1 , ..., ym ). You can stack


these equations vertically; for laziness, I write them now as
F1 (x1 , ..., xn , y1 , ..., ym ) = c1 , ..., Fm (x1 , ..., xn , y1 , ..., ym ) = cm
So the vector function F has m component functions F1 , ..., Fm .
Now I totally differentiate:
DFx dx + DFy dy = 0
as before; only now, dx = (dx1 , ..., dxn ), dy = (dy1 , ..., dym ), DFx is the
m n matrix of partial derivatives whose ith row is (Fi /x1 , ..., Fi /xn ),
and DFy is the m m matrix whose ith row is (Fi /y1 , ..., Fi /ym ).
So DFy dy = DFx dx. From here, we can work out the effect of changing
any xj on the endogenous variables y1 , ..., ym . Suppose we set all dxi = 0
for i 6= j. Then DFx dx becomes dxj times the jth column of DFx . We
divide both sides by dxj , getting
DFy (y1 /xj , ...., ym /xj )T = (F1 /xj , ...., Fm /xj )T
(the superscript T is for transpose to get column vectors).
So,
(y1 /xj , ...., ym /xj )T = DFy1 (F1 /xj , ...., Fm /xj )T
Where in the scalar case we divided by the number Fy , here we multiply
by the inverse of the matrix DFy . Now, if we stack horizontally the partial
derivatives of y1 , ..., ym w.r.t. x1 , ..., xn , on the LHS, we have Df (x); and on
the RHS, the appropriate columns give Fx , so we have (DFy (x, y))1 DFx ,
which is the just implicit function derivative formula.
(3) Application: Cournot Duopoly.
Firms 1 and 2 have constant unit costs c1 and c2 , and face the twice
continuously differentiable inverse demand function P (Q), where Q = q1 + q2
is industry output. So profits are given by
44

1 = P (q1 + q2 )q1 c1 q1
and
2 = P (q1 + q2 )q2 c2 q2
If profits are concave in own output, then the first-order conditions below
characterize Cournot-Nash equilibrium (q1 , q2 ).
1 /q1 = P 0 (q1 + q2 )q1 + P (q1 + q2 ) c1 = 0

2 /q2 = P 0 (q1 + q2 )q2 + P (q1 + q2 ) c2 = 0


The concavity of profit w.r.t. own output conditions follow from the
condition below: For all q1 , q2
2 1 /q12 = P 00 (q1 + q2 )Q + 2P 0 (q1 + q2 ) 0, i = 1, 2
The two first-order conditions can be written as the vector equation
F (q1 , q2 , c1 , c2 ) = 0
.
We want to know: How do the Cournot outputs change as a result of a
change in unit costs? If c1 decreases, for instance, does q1 increase and q2
decrease? The implicit function theorem says that if DFq1 ,q2 (q1 , q2 , c1 , c2 ) is
of full rank (rank=2), then, locally around this solution, q = (q1 , q2 ) is an
implicit function of c = (c1 , c2 ), with F (f (c), c) = 0. And

45

Df (c) = [DFq (q , c)]1 DFc (q , c)


Note that 


F1 /q1 F1 /q2
DFq (.) =
F2 /q1 F2 /q2
For brevity, let P 0 and P 00 be the derivative and second derivative of P (.)
evaluated at the
Then
 equilibrium.

P 00 q1 + 2P 0 P 00 q1 + P 0
DFq (.) =
P 00 q2 + P 0 P 00 q2 + 2P 0
The determinant of this matrix works out to be
(P 0 )2 + P 0 (P 00 (q1 + q2 ) + 2P 0 ) > 0 since P 0 < 0 and the concavity in own
output condition is assumed to be met. So the implicit function theorem can
be applied
Notice also

 that
1 0
.
DFc (.) =
0 1
Thus we can work out Df (c), the changes in equilibrium outputs as a
result of changes in unit costs. It would be a good exercise for you to work
these out, and sign these.
Proof of the Theorem of Lagrange
Before the formal proof, note that well use the tangency of the contour
sets of the objective and the constraint approach, which in other words uses
the implicit function theorem. For example, consider maximizing F (x1 , x2 )
s.t. G(x1 , x2 ) = 0. If G1 6= 0 (this is the constraint qualification in this
0
case), we have at a tangency point of contour sets, G1 f (x2 ) + G2 = 0 (where
x1 = f (x2 ) is the implicit function that keeps the points (x1 , x2 ) on the
0
constraint); so f (x2 ) = G2 /G1 .
On the other hand, if we vary x2 and adjust x1 to stay on the constraint,
the function value F (x1 , x2 ) = F (f (x2 ), x2 ) does not increase; therefore lo0
cally around the optimum, F1 f (x2 ) + F2 = 0. Substituting, F1 (G2 /G1 ) +
F2 = 0. If we now put

46

F1 /G1 =
,
we have both F1 + G1 = 0 by definition, and G2 + F2 = 0, the two
FONC.
The Proof:
Without loss of generality, let the leading principal k k minor matrix of
Dg(x ) be linearly independent. We write x = (w, z) with w being the first
k coordinates of x and z being the last (n k) coordinates. So showing the
existence of (a 1 n vector) that solves
Df (x ) + Dg(x ) =
is the same as showing that the 2 equations below hold for this ; the
equations are of dimension 1 k and 1 (n k) respectively:
Dfw (w , z ) + Dgw (w , z ) =
(*)
Dfz (w , z ) + Dgz (w , z ) =
(*)
Since Dgw (w , z ) is square and of full rank, Eq.(*) yields
= Dfw (w , z ) [Dgw (w , z )]1
(**)

We show solves (*) as well. This needs two steps.


First, g(h(z), z) = for some implicit function h, so
47

Dh(z) = [Dgw (w , z )]1 Dgz (w , z )


Second, define F (z) = f (h(z), z). Since theres a constrained optimum
at (h(z ), z ), varying z while keeping w = h(z) will not increase the value
of F (z ). So
DF (z ) = Dfw (w , z )Dh(z ) + Dfz (w , z ) =
Substituting for Dh(z ),
Dfw (.)[Dgw (.)]1 Dgz (.) + Dfz (.) =
That is,
Dgz (.) + Dfz (.) =

Simple Implicit Function Theorem with Proof


Theorem 20 Suppose F : <2 <, F (a, b) = 0, F isC 1 , and DF2 (a, b) 6= 0.
Then there exists an open interval U containing a, and an open interval V
containing b, and a C 1 function f : U V s.t. F (x, f (x)) = 0 for all x V .
Proof. Well avoid the proof of continuous differentiability. Suppose
WLOG DF2 (a, b) < 0. Since DF2 (.) is continuous, there exists h > 0 s.t.
DF2 (a, y) < 0 for all y (b h, b + h). So, F (a, b h) > 0 > F (a, b + h).
Now, since F is continuous, there exists an interval I containing a, s.t.
F (x, b h) > 0x I; and an interval I 0 containing a s.t. F (x, b + h) <
0x I 0 .
So, for all x U = I I 0 , we have
48

F (x, b h) > 0 > F (x, b + h)


Therefore, by the Intermediate Value Theorem, there exists a y s.t. F (x, y) =
0. This y is unique because DF2 (.) < 0 in this interval. So, we can pull out
a unique function f (x) s.t. F (x, f (x)) = 0x U .
Digression about (discussion of ?) Envelope Theorems
Suppose we have an objective function f : <n+1 <, that is a function
of the vector variable x <n , and also a parameter a < that is held
constant when maximizing f on some feasible set S <n . Suppose that
for every admissible value of a, there is a unique interior maximizer, so we
can say that x (a) is the function that represents this relationship between
parameter and maximizer. Suppose f is smooth and x (a) is differentiable.
Let V (a) be the value function for this problem, that gives the maximum
value that f (x, a) obtains when the parameter is at the level a. That is,
V (a) f (x (a), a).
We wish to know how V (a) changes with a change in a. As a changes,
x (a) changes as well, but this change has no first-order effect on V (a): so
the first-order change in V (a) is solely through the direct effect of a on f .
this is the implication of the envelope theorem.

Indeed, using the Chain Rule on V (a) f (x (a), a), we have V 0 (a) =
Dfx (x , a)Dx (a)+f (x , a)/a. But because x is an interior Max, Dfx (x , a) =
1n . So, V 0 (a) = f /a.
Now suppose we want to maximize an objective function f (x), which does
not depend on a, but subject to a constraint g(x, a) = a G(x) = 0 that
does depend on a. Under nice conditions, at the Max,
Df (x ) + Dg(x , a) =

(i)

Also note that if a changes, the value of g(x (a), a) must continue to be
zero, so
49

Dgx (x (a), a)Dx (a) + g/a = 0

(ii)

Now, V (a) f (x (a)), so V 0 (a) = Df (x )Dx (a). Using (i) to substitute for Df (x ), this equals Dg(x , a)Dx (a), which equals, using (ii),
g/a = . So here, V 0 (a) = , the value of the multiplier at the optimum
is the rate of change of the objective with respect to the parameter a being
relaxed.
Suppose now that we have an objective function f (x, a) to maximize
subject to g(x, a) = 0. Along similar lines, we can show that V 0 (a) = f /a+
g/a, i.e. the direct effect of a on the Lagrangian function. As an exercise,
please derive Roys Identity using the indirect utility function V (p, I).

50

Chapter 5
Optimization with Inequality
Constraints

5.1

Introduction

The problem is to find the Maximum or the Minimum of f : Rn R on the


set {x Rn |gi (x) 0, i = 1, . . . , k}, where gi : Rn R are the k constraint
functions. At the optimum, the constraints are now allowed to be binding
(or tight or effective), i.e. gi (x) = 0, as before, or slack (or non-binding), i.e.
gi (x) > 0.
Example: Max U (x1 , x2 ) s.t. x1 0, x2 0, I p1 x1 p2 x2 0. If we
do not know whether xi = 0, for some i, at the utility maximum, or whether
xi > 0, then clearly we cannot use the Theorem of Lagrange. Similarly,
if there is a bliss point, then we do not know in advance whether there at
the budget constrained optimum the budget constraint is binding or slack.
Again, we cannot then use the Theorem of Lagrange, to use which we need
to be assured that the constraint is binding.
Note the general nature of a constraint of the form gi (x) 0. If we have a
constraint h(x) 0, this is equivalent to g(x) h(x) 0. And something
like h(x) c is equivalent to g(x) c h(x) 0.
51

We use Kuhn-Tucker Theory to address optimization problems with inequality constraints. The main result is a first order necessary condition
that is somewhat different for that of the Theorem of Lagrange; one main
difference is that the first order conditions gi (x) = 0, i = 1, . . . , k in the Theorem of Lagrange are replaced by the conditions i gi (x) = 0, i = 1, . . . , k in
Kuhn-Tucker theory.
In order to motivate this difference, let us discuss a simple setting. Consider an objective function f : <2 <. We want to maximize f (x) or
f (x1 , x2 ) over all x <2 that satisfy G(x) a, where G : <2 <. We will
alternatively write g(x) = a G(x) 0. For this example, let us assume
that G(x) is strictly increasing. We can view a as the total resource available;
such as the total income available for spending on goods. Draw a picture.
A maximum x can occur either in the interior (i.e. G(x ) < a or g(x ) >
0), or at the boundary ( G(x ) = a or g(x ) = 0). If it happens in the
interior, it implies Df (x ) = . If it happens on the boundary, it must
be that reducing the parameter value a does not increase f (x); for whatever
vector x you choose as maximizer after the reduction of a was available before,
at the higher value of a, and was not chosen as the maximizer. Consider then
setting up the Lagrangian

L(x, ) = f (x) + g(x)


and consider the first order condition
Df (x ) + Dg(x ) =
.
When would this first-order condition make sense?
(i) First, for this to coincide with Df (x ) = when x is in the interior,
we must have that g(x ) > 0 (or G(x ) < a) implies = 0.
(ii) Now let V (a) be the value function for this problem: the maximum
value of f (x) when the parameter in the constraint equals a. Consider the
52

interpretation that = V 0 (a), the change in the value of the objective as we


change a. If the maximizer x is on the boundary (G(x ) = a) and we were
to reduce a, the maximum value of f would get reduced (or not increase);
so, 0.
(iii) Finally, suppose that the maximizer x along with solve the firstorder condition Df (x ) + Dg(x ) = , and suppose Dg(x ) 6= . Suppose
> 0. Then from the first-order condition, we conclude Df ( ) 6= . But that
means x is not in the interior of the feasible set; its on the boundary, so
g(x ) = 0. Alternatively, interpreting > 0 as the decrease in the maximized
value of f (x) if a is decreased a little, it must mean g(x ) = 0 (or G(x ) =
a). For if x were not on the boundary, decreasing a would not affect this
maximum value.
So in addition to Df (x ) + Dg(x ) = , its implied that 0, g(x )
0, and g(x ) = 0. Note that this implies that if g(x ) > 0, then = 0, and
if > 0, then g(x ) = 0.

5.2

Kuhn-Tucker Theory

Recall that the problem is to find the Maximum or the Minimum of f :


Rn R on the set {x Rn |gi (x) 0, i = 1, . . . , k}, where gi : Rn R are
the k constraint functions. The main theorem deals with local maxima and
minima, though.
Suppose that l of the k constraints bind at the optimum x . Denote the
corresponding constraint functions as (gi )i , where is the set of indexes
of the binding constraints. Let g : Rn Rl be the function whose l
components are the constraint functions of the binding constraints. That is
g (x) = (gi (x))
i .

Dgi (x)
..

Dg (x) =
.
, where i, . . . , m are the indexes of the binding
Dg m (x)
constraints. So Dg (x) is an l n matrix.

53

We now state FONC for the problem. The Theorem below is a consolidation of the Fritz-John and the Kuhn-Tucker Theorems.
Theorem 21 (The Kuhn-Tucker (KT) Theorem). Let f : Rn R, and
gi : Rn R, i = 1, . . . , k be C 1 functions. Suppose x is a Maximum of f
on the set S = U

{x Rn |gi (x) 0, i = 1, . . . , k}, for some open set

U Rn . Then there exist real numbers , 1 , . . . , k , not all zero such that
P
Df (x ) + ki=1 i Dgi (x ) = 1n .
Moreover, if gi (x ) > 0 for some i, then i = 0.
If, in addition, RankDg (x ) = l, then we may take to be equal to 1.
Furthermore, i 0, i = 1, . . . , k, and i > 0 for some i implies gi (x ) = 0.
Suppose the constraint qualification, RankDg (x ) = l, is met at the
optimum. Then the KT equations are the following (n + k) equations in the
n + k variables x1 , . . . , xn , 1 , . . . , k :
i gi (x ) = 0, i = 1, . . . , k, i 0, gi (x ) 0 with complementary
slackness.
(1)
P
k
(2)
Df (x ) + i=1 i Dgi (x ) =
If x is a local minimum of f on S, then f (x ) attains a local maximum
value. Thus for minimization, while Eq.(1) stays the same, Eq.(2) changes
to
P
Df (x ) + ki=1 i Dgi (x ) =
(2)
Equation (1) and (2) are known as the Kuhn-Tucker conditions.
Note finally that since the conditions of the Kuhn-Tucker Theorem are
not sufficient conditions for local optima; there may be points that satisfy
Equations (1) and (2) or (2) without being local optima. For example, you
may check that for the problem
Max f (x) = x3 s.t. g(x) = x 0, the values x = = 0 satisfy the KT
FONC (1) and (2) for a local maximum but do not yield a maximum.

54

5.3

Using the Kuhn-Tucker Theorem

We want to maximize f (x) over the set {x Rn |g(x) 1k }, where g(x) =


(g1 (x), . . . , gk (x)).
P
Set up L(x, ) = f (x) + ki=1 i gi (x)
(If we want to minimize f (x), set up
P
L(x, ) = f (x) + ki=1 i gi (x))
To ensure that the KT FONC will hold at the global max, verify that
(1) a global max exists and (2) The constraint qualification is met at the
maximum.
This second is not possible to do if we dont know where the maximum
is. What we do instead is to check whether the CQ holds everywhere in the
domain, and if not, we note points where it fails. The CQ in the theorem
depends on what constraints are binding at the maximum. Again since we
dont know the maximum, we dont know what constraints bind at it.
With k constraint functions, there are 2k profiles of binding and nonbinding constraints possible, each of these profiles implying a different CQ.
We either check all of them, or we rule out some profiles using clever arguments.
If both checks are fine, then we find all solutions (x0 , 0 ) to the set of
equations:
i (L(x, )/i ) = 0, i 0, (L(x, )/i ) 0, i = 1, . . . , k, with CS.
(L(x, )/xj ) = 0, j = 1, . . . , n.
From the set of all solutions (x0 , 0 ), pick that solution (x , ) for which
f (x ) is maximum. Note that this method does not require checking for concavity of objective functions and constraints, and does not require checking
any second order condition.
The method may fail if a global max does not exist or if the CQ fails at
the maximum. The example Max f (x) = x3 s.t. g(x) = x 0 is one where
no global max exists, and we saw earlier that the method fails.
An example in which CQ fails: Max f (x) = 2x3 3x2 , s.t. g(x) =
(3 x)3 0.

55

Suppose the constraint does not bind at the maximum; then we dont have
to check a CQ. But suppose it does. That is, suppose the optimum occurs
at x = 3. Dg(x) = 3(3 x)2 = 0 at x = 3. The CQ fails here. You could
check that the KT FONC will not isolate the maximum.In fact, in this baby
example, it is easy to see that x = 3 is the max, as (3x)3 0 if f (3x) 0,
so we may work with the latter constraint function, with which CQ does not
fail. It is a good exercise to visualize f (x) and see that x = 3 is the maximum,
rather than merely cranking out the algebra now.
Alternatively, we may use the more general FONCs stated in the theorem.
Df (x) + Dg(x) = 0, with , not both zero.
(6x2 6x) + (3(3 x)2 ) = 0, and
(1)
3
(3 x) 0, with strict inequality implying = 0.
(2)
3
If (3 x) > 0, then = 0, which from Eq.(1) implies either = 0, which
violates the FONC, or x = 1. At x = 1, f (x) = 1.
On the other hand, if (3 x)3 = 0, that is x = 3, then Eq (1) implies
= 0, so it must be that > 0. At x = 3, f (x) = 27. so x = 3 is the
maximum.
Two Simple Utility Maximization Problems
Example 1. This is a real baby example meant purely for illustration.
No one expects you to use the heavy Kuhn-Tucker machinery for such simple
problems. In this example, one expects instead that you would use reasoning about the marginal utility per rupee ratios (U1 /p1 ), (U2 /p2 ) to solve the
problem.
Max U (x1 , x2 ) = x1 + x2 , over the set {x = (x1 , x2 ) R2 |x1 0, x2
0, I p1 x1 p2 x2 0}, where I > 0, p1 > 0 and p2 > 0 are given.
So there are 3 inequality constraints:
g1 (x1 , x2 ) = x1 0, g2 (x1 , x2 ) = x2 0, and
g3 (x1 , x2 ) = I p1 x1 p2 x2 0
At the maximum x , any combination of these three could bind; so there
are 8 possibilities. However, since U is strictly increasing, the budget constraint binds at the maximum (g3 (x ) = 0). Moreover, g1 (x ) = g2 (x ) = 0 is

56

not possible since consuming 0 of both goods gives utility equal to 0, which
is clearly not a maximum.
So we have to check just three possibilities out of the eight.
Case(1) g1 (x ) > 0, g2 (x ) > 0, g3 (x ) = 0
Case(2) g1 (x ) = 0, g2 (x ) > 0, g3 (x ) = 0
Case(3) g1 (x ) > 0, g2 (x ) = 0, g3 (x ) = 0
Before using the KT conditions, we verify that (i) a global max exists
(here, because the utility function is continuous and the budget set is compact), and that (ii) the CQ holds at all 3 relevant cominations of binding
constraints described above.
Indeed, for Case(1), Dg (x) = Dg 3 (x) = (p1 , p2 ), so Rank[Dg 3 (x)] =
1, so CQ holds.




Dg1 (x)
1
0
, so Rank[Dg (x)]
For Case(2), Dg (x) =
=
Dg 3 (x)
p1 p2
= 2.




0
1
Dg2 (x)
=
, so Rank[Dg (x)]
For Case(3), Dg (x) =
p1 p2
Dg3 (x)
= 2.
Thus for the maximum, x , there exists a such that (x , ) will be a
solution to the KT FONCs. Of course, there could be other (x, )0 s that are
solutions as well, but a simple comparison of U (x) for all candidate solutions
will isolate for us the Maximum.
L(x, ) = x1 + x2 + 1 x1 + 2 x2 + 3 (I p1 x1 p2 x2 )
The KT conditions are
1 (L/1 ) = 1 x1 = 0, 1 0, x1 0, with CS
(1)
2 (L/2 ) = 2 x2 = 0, 2 0, x2 0, with CS
(2)
3 (L/3 ) = 3 (I p1 x1 p2 x2 ) = 0, 3 0, I p1 x1 p2 x2 0, with
CS
(3)
(L/x1 ) = 1 + 1 3 p1 = 0
(4)
(L/x2 ) = 1 + 2 3 p2 = 0
(5)
Since we dont know which of the three cases select the constraints that
bind at the maximum, we must try all three.
Case(1). Since x1 > 0, x2 > 0, (1) and (2) imply 1 = 2 = 0.Plugging
these in Eq(4) and (5), we have 1 = 3 p1 = 3 p2 . This implies 3 > 0. (Also
57

note that this is consistent with the fact that since utility is strictly increasing,
relaxing the budget constraint will increase utility. So the marginal utility
of income, 3 > 0. Thus 3 p1 = 3 p2 implies p1 = p2 .
So if at a local max both x1 and x2 are strictly positive, then it must be
that their prices are equal. All (x1 , x2 ) that solve Eq(3) are solutions. The
utility in any such case equals
x1 + (I p1 x1 /p2 ) = I/p, where p = p1 = p2 . Note that in this case,
(U1 /p1 ) = (U2 /p2 ) = 1/p.
Case 2. x1 = 0 implies, from Eq(3), that x2 = (I/p2 ). Since this is
greater than 0, Eq(2) implies 2 = 0. Hence from Eq(5), 3 p2 = 1.
Since 1 0, Eq (4) and (5) imply 3 p1 = 1 + 1 1 = 3 p2 . Moreover,
since 3 > 0, this implies p1 p2 .
That is, if it is the case that at the maximum, x1 = 0, x2 > 0, then it must
be that p1 p2 . Note that in this case, (U2 /p2 ) = (1/p2 ) (U1 /p1 ) = (1/p1 ).
For completeness sake, Eq(5) implies 3 = (1/p2 ). So from Eq (4),
1 = (p1 /p2 ) 1. So the unique critical point of L(x, ) is
(x , ) = (x1 , x2 , 1 , 2 , 3 ) = (0, (I/p2 ), (p1 /p2 ) 1, 0, (1/p2 )).
Case(3). This case is similar, and we get that x2 = 0, x1 > 0 occurs only
if p1 p2 . We have
(x , ) = ((I/p1 ), 0, 0, (p2 /p1 ) 1, 1/p1 ).
We see that which of the cases applies depends upon the price ratio p1 /p2 .
2
such that the
If p1 = p2 , then all three cases are relevant, and all (x1 , x2 ) R+
budget constraint binds are utility maxima. But if p1 > p2 , then only Case(2)
is applies, because if Case (1) had applied, we would have had p1 = p2 , and
if Case (3) had applied, that would have implied p1 p2 . The solution to
the KT conditions in that case is the utility maximum. Similarly, if p1 < p2 ,
only Case (3) applies.
Example 2. Max U (x1 , x2 ) = (x1 /1 + x1 ) + x2 /1 + x2 ), s.t. x1 0,
x2 0, p1 x1 + p2 x2 I.
Check that the indifference curves are downward sloping, convex and that
they cut the axes (show all this). This last is due to the additive form of the

58

utility function, and may result in 0 consumption of one of the goods at the
utility maximum.
Exactly as in Example 1, we are assured that a global max exists, that
the CQ is met at the optimum, and that there are only 3 relevant cases of
binding constraints to check.
the Kuhn-Tucker conditions are:
1 (L/1 ) = 1 x1 = 0, 1 0, x1 0, with CS
(1)
2 (L/2 ) = 2 x2 = 0, 2 0, x2 0, with CS
(2)
3 (L/3 ) = 3 (I p1 x1 p2 x2 ) = 0, 3 0, I p1 x1 p2 x2 0, with
CS
(3)
2
(L/x1 ) = (1/(1 + x1 ) ) + 1 3 p1 = 0
(4)
(L/x2 ) = (1/(1 + x2 )2 ) + 2 3 p2 = 0
(5)
Case(1). x1 > 0, x2 > 0 implies 1 = 2 = 0. Eq (4) implies 3 > 0, so
that Eq(4) and (5) give ((1 + x2 )/(1 + x1 )) = (p1 /p2 )1/2 .
Using Eq(3), which gives x2 = ((I p1 x1 )/p2 ), above, we get
((p2 + I p1 x1 )/(p2 (1 + x1 )) = (p1 /p2 )1/2 , so simple computations yield
x1 = ((I + p2 (p1 p2 )1/2 )/(p1 + (p1 p2 )1/2 )),
x2 = ((I + p1 (p1 p2 )1/2 )/(p2 + (p1 p2 )1/2 )),
3 = (1/p1 (1 + x1 )2 ).
x1 > 0, x2 > 0 implies I > (p1 p2 )1/2 p1 , I > (p1 p2 )1/2 p2 . If either of
these fails, then we are not in the regime of Case 1.
Case(2) x1 = 0 with Eq(3) implies x2 = I/p2 . Since this is positive,
2 = 0, so Eq(5) implies 3 = 1/(1 + (I/p2 ))2 p2 .
1 = 3 p1 1 (from x1 = 0 and Eq(4)).
1 = p1 p2 /(p2 + I)2 1. For this to be 0, it is required that
p1 p2 /(p2 + I)2 1,that is I (p1 p2 )1/2 p2 .
Utility equals x2 /(1 + x2 ) = I/(p2 + I).
(x1 , x2 , 1 , 2 , 3 ) = (0, I/p2 , 1 + ((p1 p2 )/(p2 + I)2 ), 0, p2 /(p2 + I)2 ).
Case(3) By symmetry, the solution is
(x1 , x2 , 1 , 2 , 3 ) = (I/p1 , 0, 0, 1 + ((p1 p2 )/(p1 + I)2 ), p1 /(p1 + I)2 )
And for this Case to hold it is necessary that p1 p2 /(p1 + I)2 1, or
I (p1 p2 )1/2 p1 .

59

To summarize, suppose p1 = p2 = p, then (p1 p2 )1/2 p1 = (p1 p2 )1/2 p2


equals 0. So since I > 0, we are in the regime of Case I,and x1 = x2 = I/2p
at the maximum.
Suppose on the other hand that p1 < p2 (the contrary can be worked out
similarly), then p2 > (p1 p2 )1/2 > p1 , so that
(p1 p2 )1/2 p1 > 0 > (p1 p2 )1/2 p2 . Thus either
I > (p1 p2 )1/2 p1 , in which case we use Case(1), or
I (p1 p2 )1/2 p1 in which case we use Case(3). Case(2), that in which a
positive amount of good 2 and zero of good 1 is consumed at the maximum,
does not apply.

5.4

Miscellaneous

(1) For problems where some constraints are of the form gi (x) = 0, and
others of the form gj (x) 0, only the latter give rise to Kuhn-Tucker like
complementary slackness conditions (j 0, gj (x) 0, j gj (x) = 0).
(2) If the objective to be maximized, f , and the constraints gi , i = 1, . . . , k
(where constraints are of the form gi (x) 0) are all concave functions, and
if Slaters constraint qualification holds (i.e., there exists some x <n s.t.
gi (x) > 0, i = 1, . . . , k), then the Kuhn-Tucker conditions become both
necessary and sufficient for a global max.
(3) Suppose f and all the gi s are quasiconcave. Then the Kuhn-Tucker
conditions are almost sufficient for a global max: An x and that satisfy
the Kuhn-Tucker conditions indicate that x is a global max provided that
in addition to the above, either Df (x ) 6= , or f is concave.

60

Appendix
Completeness Property of Real Numbers

61

Вам также может понравиться