Академический Документы
Профессиональный Документы
Культура Документы
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Seating Puzzle
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Seating Puzzle
Many Theories
I
I
I
I
I
I
I
I
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Seating Puzzle
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Consumers
Quasilinear Utility
I
n consumers, i = 1, 2, ..., n.
yi = consumer is income.
p = price of the good.
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Consumers
Utility Maximization
I
I
I
yi
pqi
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Consumers
i
p
1
<0
ui00 (.)
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Consumers
Consumers Surplus
I
pqi (p )
Z q i (p )
0
Parikshit Ghosh
Prices, Markets and E ciency
pqi (p )
ui0 (qi )
p dqi
Introduction
Equilibrium
E cient Allocations
Market Failures
Consumers
CS is the area under demand curve and above the price line.
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Consumers
Q (p ) =
qi (p )
i =1
qi0 (p )
i =1
dQ p
.
dp Q
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
m rms, j = 1, 2, ..., m.
Fj
xj
+ j (xj ).
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Prot Maximization
I
pxj
c (xj )
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Supply Function
I
I
I
1
>0
c 00 (.)
xj (p )
X (p ) =
j =1
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Supply Function
I
dX p
.
dp X
Z x j (p )
0
c (xj (p ))
p
c 0 (xj ) dxj
Area between the price line and the marginal cost curve.
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Supply Function
I
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Equilibrium: Denition
I
qi
i =1
I
I
I
I
xj
or, Q (p ) = X (p )
j =1
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Comparative Statics
I
I
I
I
dp
dp
= Xp (p , ).
d
d
dp
Q
=
d
Xp Qp
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Comparative Statics
I
I
dp
dp
= Xp .
+ X
d
d
or,
I
X
dp
=
d
Qp Xp
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Blood Ivory
Parikshit Ghosh
Market Failures
Market Clearance
Seized Stockpiles
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Ivory Trade
I
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Ivory Trade
Arguments for allowing sale of stockpiles:
I
I
I
I
It is inherently immoral.
May boost demand and open up new trading.
Sends the wrong message?
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Tax Incidence
Social Security
I
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Tax Incidence
= a bPb
= c + dPs
Parikshit Ghosh
Prices, Markets and E ciency
= P + t
= P (1
)t
Delhi School of Economics
Introduction
Equilibrium
E cient Allocations
Market Failures
Tax Incidence
How does the overall size of the tax (t) aect welfare?
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Tax Incidence
In equilibrium, Qd = Qs , i.e.
c + [(1 )d
b+d
bdt
ad + bc
Q =
b+d
b+d
The net price for buyers and sellers:
b (P + t ) = c + d [P
Parikshit Ghosh
Prices, Markets and E ciency
Pb
= P + t =
Ps
= P
(1
(1
)t ]
b ]t
a c
dt
+
b+d
b+d
a c
bt
)t =
b+d
b+d
Delhi School of Economics
Introduction
Equilibrium
E cient Allocations
Market Failures
Tax Incidence
Parikshit Ghosh
d
b +d
Equilibrium
E cient Allocations
Market Failures
Tax Incidence
Rearranging terms:
P 0 (t ) =
Parikshit Ghosh
Prices, Markets and E ciency
Qs0
Qs0
Qd0
Delhi School of Economics
Introduction
Equilibrium
E cient Allocations
Market Failures
Tax Incidence
+e
P 0 (t ) =
P (t )
:
Q (t )
+e
P 0 (t ) =
e
+e
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Feasible Allocations
I
qi
i =1
j =1
ti
i =1
Parikshit Ghosh
Prices, Markets and E ciency
xj
m
sj
j =1
Introduction
Equilibrium
E cient Allocations
Market Failures
i =1
j =1
vi + j
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Utilitarian Allocation
I
max
q,x,t,s
[ui (qi )
ti ] +
i =1
[ sj
cj (xj )]
j =1
qi
i =1
n
ti
i =1
Parikshit Ghosh
Prices, Markets and E ciency
xj
j =1
m
sj
j =1
Introduction
Equilibrium
E cient Allocations
Market Failures
Utilitarian Allocation
I
q,x
cj (xj )
j =1
qi = xj
sub to
j =1
i =1
This amounts to
n
i =1
j =1
"
ui (qi ) cj (xj ) + xj
max L(q, x)
q,x
i =1
max ui (qi )
qi
j =1
i =1
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
ti
c (xj )
j (z)
Introduction
Equilibrium
E cient Allocations
Market Failures
xj
j =1
qi
+ 2
ti
i =1
i =1
sj
j =1
= 2 = W j
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Market Interventions
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Market Interventions
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Market Interventions
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Market Interventions
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Market Interventions
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Market Interventions
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Market Interventions
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Market Interventions
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Market Interventions
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Market Interventions
I
I
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Market Interventions
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Equilibrium
E cient Allocations
Market Failures
Market Interventions
Parikshit Ghosh
Market Failures
Market Interventions
Quotas
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Caveat 1: Externalities
The reason that the invisible hand often seems invisible is that it
is often not there. Whenever there are "externalities" where the
actions of an individual have impacts on others for which they do
not pay, or for which they are not compensated markets will not
work well... Markets, by themselves, produce too much pollution.
Markets, by themselves, also produce too little basic research...The
real debate today is about nding the right balance between the
market and government... Both are needed. They can each
complement each other.
Joseph Stiglitz.
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh
Prices, Markets and E ciency
Introduction
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh
Prices, Markets and E ciency
Demand Functions
Applications
Parikshit Ghosh
Demand Functions
Applications
Binary Relations
I
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
xi = quantity of good i.
Parikshit Ghosh
Demand Functions
Applications
x
I
if x % x but not x2 % x1
if x % x2 and x2 % x1
1
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
The Axioms
I
Parikshit Ghosh
Demand Functions
Applications
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Preference Representation
Parikshit Ghosh
Demand Functions
Applications
Preference Representation
To show that z = z.
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Preference Representation
Proof (contd.)
I
Suppose z < z.
Then for any z < z < z, completeness is violated.
Suppose z > z.
Then for any z < z < z, (z, z )
Strict monotonicity is violated.
x.
Parikshit Ghosh
Demand Functions
Applications
Preference Representation
u (x) = x1 x2
I
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Preference Representation
Denition
A function f (x) is (strictly) quasiconcave if, for every x1 , x2
f x1 + (1
)x2
Theorem
Demand Functions
Applications
Preference Representation
Indierence Curves
I
I
I
Parikshit Ghosh
Choice and Demand
u
u
dx1 +
dx2 = 0
x1
x2
u
x1
u
x2
u1
<0
u2
Delhi School of Economics
Demand Functions
Applications
Preference Representation
x1
Parikshit Ghosh
Demand Functions
Applications
Preference Representation
x1
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Preference Representation
No jumps (continuity).
Parikshit Ghosh
Demand Functions
Applications
Optimization
B=
xj pi xi
i =1
= fxjpx
yg
Parikshit Ghosh
Choice and Demand
px
0, xi
0
Delhi School of Economics
Demand Functions
Applications
Optimization
I
I
px = 0
Parikshit Ghosh
Demand Functions
Applications
Optimization
Lagranges Method
I
L(x, )
I
f (x) +
j gj (x)
j =1
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Optimization
px = 0
"
u (x) + y
pi xi
i =1
pi = 0
n
pi xi
=0
i =1
Parikshit Ghosh
Demand Functions
Applications
Optimization
pi
pj
|{z}
|{z}
jMRSij j = price ratio
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Optimization
x1
Parikshit Ghosh
Demand Functions
Applications
Optimization
x1
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Optimization
x1
Parikshit Ghosh
Demand Functions
Applications
Optimization
x1
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Optimization
Parikshit Ghosh
Demand Functions
Applications
Optimization
I
I
conditions
3
7
7
7
7
7
5
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Optimization
Suppose x
0 solves the f.o.c obtained by the Lagrange
method. If u (.) is quasiconcave, then x is a constrained
maximum.
Parikshit Ghosh
Demand Functions
Applications
Optimization
Constrained Optimization
I
I
I
I
Parikshit Ghosh
Choice and Demand
f (x (a); a)
Delhi School of Economics
Demand Functions
Applications
Optimization
x1
6 f
2
6 x
1
J=6
6 ..
4 .
f n
x1
f
x2
f 2
x2
f
xn
f 2
xn
f n
x2
f n
xn
7
7
7
... 7
5
Parikshit Ghosh
Demand Functions
Applications
Optimization
dx2
dak
dxn
dak
f 1 f 2
ak ak
Applying Cramers Rule, we get
f n
ak
Dx (ak )t =
dx1
dak
Df (ak )t =
Parikshit Ghosh
dxi
jJ j
= k
dak
jJ j
Demand Functions
Applications
Optimization
L(x, ; a)
I
f (x) +
j gj (x)
j =1
Parikshit Ghosh
Demand Functions
Applications
Optimization
fx ( x ( a ) ; a )
fxa
fxx
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Functional Properties
Demand Functions
I
px
0, xi
v (p, y ) = u (x(p, y ))
I
I
Parikshit Ghosh
Demand Functions
Applications
Functional Properties
Parikshit Ghosh
Choice and Demand
v (p,y )
p i
v (p,y )
y
Delhi School of Economics
Demand Functions
Applications
Functional Properties
L(x, ) = u (x) + (y
I
px)
xi
v (p, y )
L(p, y )
=
=
y
y
I
Parikshit Ghosh
Demand Functions
Applications
Functional Properties
Duality Theory
I
I
I
I
u, xi
Theorem
Suppose f (x) and g (x) are increasing functions. Then
f = maxx f (x) subject to g (x) g if and only if
g = minx g (x) subject to f (x) f .
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Functional Properties
e (p, v (p, y )) = y
v (p, e (p, u )) = u
Parikshit Ghosh
Demand Functions
Applications
Functional Properties
e (p, u (0)) = 0.
For all p
Concave in p.
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Functional Properties
p1 x
2 2
p2 x
p x
I
)p2 x2
or, e (p1 , u ) + (1
I
)p2 . By
p1 + (1
)e (p2 , u )
)p2 x = p.x
e (p1 + (1
)p2 , u )
Parikshit Ghosh
Demand Functions
Applications
Functional Properties
xi (p, y )
pj
xih (p, u )
xi (p, y )
xj (p, y )
pj
y
|
{z
}
| {z }
substitution
income
eect
eect
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Functional Properties
xi (p, e (p, u ))
Dierentiating w.r.t pj :
xih (p, u )
xi (p, e (p, u )) xi (p, e (p, u )) e (p, u )
.
=
+
pj
pj
y
pj
Using above:
xih (p, u )
xi (p, y )
xi (p, y )
=
+ xj (p, y )
pj
pj
y
Parikshit Ghosh
Demand Functions
Applications
Functional Properties
p 1
6 x
h
6
2
6 p1
H=6 .
6 ..
4
xnh
p 1
xih
p j
Parikshit Ghosh
Choice and Demand
x1
p 2
x2h
p 2
x1
p n
x2h
p n
xnh
p 2
xnh
p n
7
7
2 e (p, u )
7
=
.. 7
7
pi pj
. 5
Demand Functions
Applications
Functional Properties
i
For a normal good ( x
y > 0), the law of demand holds
xi
( p
< 0).
i
I
I
I
i
For an inferior good ( x
y < 0), it may or may not hold.
Gien goods are those which have positively sloped demand
xi
curves ( p
> 0).
i
Must be (a) inferior (b) an important item of consumption (xi
large).
Parikshit Ghosh
Demand Functions
Applications
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Parikshit Ghosh
Demand Functions
Applications
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Anomalies
Framing Eect
I
I
B (72%), D
C (78%)
Parikshit Ghosh
Demand Functions
Applications
Anomalies
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Anomalies
Parikshit Ghosh
Demand Functions
Applications
Anomalies
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Anomalies
Parikshit Ghosh
Demand Functions
Applications
Anomalies
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Anomalies
Parikshit Ghosh
Demand Functions
Applications
Anomalies
The receiver can either accept the proposed split or reject it.
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Anomalies
Parikshit Ghosh
Demand Functions
Applications
Anomalies
Suppose you choose D over C . But a year later, you will want
to reverse your choice!
Parikshit Ghosh
Choice and Demand
B and D
C.
Demand Functions
Applications
Anomalies
I
I
max
t u ( ct )
fct gt=0 t =0
subject to
ct = 1
t =0
Parikshit Ghosh
Demand Functions
Applications
Anomalies
max
f c g =t = t
I
b
t
u (c ) subject to
c = c
=t
The Lagrangian is
L(c, ) =
Parikshit Ghosh
Choice and Demand
=t
b
t
"
u (c ) + c
=t
#
Delhi School of Economics
Demand Functions
Applications
Anomalies
First-order condition:
I
I
Eliminating :
b
t 0
u (c ) =
u 0 (c )
= |{z}
u 0 (c +1 )
| {z }
intertemporal MRS = discount factor
Parikshit Ghosh
Demand Functions
Applications
Anomalies
Logarithmic Utility
I
Suppose u (c ) = log c.
ct = ( 1
I
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Anomalies
Suppose
U t = u ( ct ) +
u (c )
=t +1
L(c, ) = u (ct ) +
I
"
u (c ) + 1
=t +1
=t
First-order conditions:
u 0 ( c0 ) =
t 0
u (c ) =
Parikshit Ghosh
Demand Functions
Applications
Anomalies
Eliminating :
MRS0,1 =
u 0 ( c0 )
=
u 0 ( c1 )
u 0 ( ct )
= for all t > 0
u 0 ct + 1
However, when date t arrives, the consumer will want to
change the plan and reallocate consumption such that
MRSt,t +1 =
MRSt,t +1 =
I
I
Realizing that she may change her own optimal plan later, the
self aware consumer will adjust her plan at date 0 itself.
Alternatively, the consumer may try to commit and restrict
her own future options (e.g. Christmas savings accounts).
Parikshit Ghosh
Choice and Demand
Prot Maximization
Parikshit Ghosh
Prot Maximization
The Firm
The Firm
I
Parikshit Ghosh
Production, Costs and the Firm
Prot Maximization
Technology
Parikshit Ghosh
Prot Maximization
Technology
Returns to scale:
I
I
I
Parikshit Ghosh
Production, Costs and the Firm
Prot Maximization
Optimization
Prot Maximization
I
I
I
wx subject to y
f (x)
wx
Parikshit Ghosh
Prot Maximization
Optimization
c (w, y )
Parikshit Ghosh
Production, Costs and the Firm
Prot Maximization
Optimization
1
k
Parikshit Ghosh
Prot Maximization
Optimization
= y k min w y
1
k
subject to y
1
k
subject to f
1
k
= y min w y
x
f (x) = 1
1
k
x =1
1
k
= c (w, 1).y k
I
Parikshit Ghosh
Production, Costs and the Firm
Prot Maximization
Optimization
First-order condition:
p=
c (w, y )
y
Second-order condition:
2 c (w, y )
y 2
Parikshit Ghosh
Prot Maximization
Optimization
I
I
wx
First-order condition:
f (x )
p
= wi
|{z}
x
| {z i }
Marginal revenue product = price of input
Parikshit Ghosh
Production, Costs and the Firm
Prot Maximization
Functional Properties
= y (p, w)
= xi (p, w)
Parikshit Ghosh
Prot Maximization
Functional Properties
(y 1 , x1 ) at prices (p 1 , w1 ).
(y 2 , x2 ) at prices (p 2 , w2 ).
(y , x) at prices (p, w), where
(p, w) = (p 1 , w1 ) + (1 )(p 2 , w2 ).
w1 x1
p1 x
w1 x
p 2 x2
w2 x2
p2 x
w2 x
Parikshit Ghosh
Production, Costs and the Firm
) (p 2 , w 2 )
(p, w)
Delhi School of Economics
Prot Maximization
Functional Properties
I
I
I
2
w n w 1
2
w n2
xn
p
xn
w 1
y
w n
..
.
x1
w n
xn
w n
Parikshit Ghosh
Production, Costs and the Firm
3
7
7
7
7
5
Monopoly
Price Discrimination
Double Marginalization
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Non-Discriminating Monopolist
Parikshit Ghosh
Monopoly
c (q )
Monopoly
Price Discrimination
Double Marginalization
Non-Discriminating Monopolist
I
I
= c 0 (qm )
= c 0 (qm )
qm dp
.
pm dq
1
e (qm )
c 0 (qm )
1
=
pm
e (qm )
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Non-Discriminating Monopolist
Suppose q m
Parikshit Ghosh
Monopoly
q ) p (q m )
p (q ) and c 0 (q m )
c 0 (q ).
p (q m ) p (q ) + qp 0 (q m ) = c 0 (q m ) c 0 (q )
|
{z
} | {z }
|
{z
}
=
+
Monopoly
Price Discrimination
Double Marginalization
Non-Discriminating Monopolist
Monopoly in Pictures
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Non-Discriminating Monopolist
Monopoly in Pictures
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Non-Discriminating Monopolist
Monopoly in Pictures
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Non-Discriminating Monopolist
Monopoly in Pictures
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
vf (v )dv
v f (v ) + c 0 (1
Parikshit Ghosh
c (1
F (v ))
v
= c 0 (1 F (v ))
|{z}
|
{z
}
price to marginal customer = marginal cost
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Parikshit Ghosh
(p )
Monopoly
pq ) q (p ) = 0
pq (p )
Monopoly
Price Discrimination
Double Marginalization
subject to (q (p )) + y
I
pq (p )
c (q (p ))
c (q (p )) ) 0 (q
b) = c 0 (q
b)
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Monopolistic Screening
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Monopolistic Screening
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Monopolistic Screening
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Monopolistic Screening
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Monopolistic Screening
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Monopolistic Screening
Rationing
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Monopolistic Screening
Rationing
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Monopolistic Screening
Cost of production is 0.
charge p = vH if vH
vL .
charge p = vL ifvH < vL .
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Monopolistic Screening
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Monopolistic Screening
pH
pL + cL
I
vL
I
(IC-H)
(IC-L)
vH
pL
pH
cL
0
0
(PC-H)
(PC-L)
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Monopolistic Screening
vL )
or, vH
vH
Parikshit Ghosh
cH
cL
Monopoly
vL
(pL + cL ) + (cH
cL )
pL + cH
cH
(using IC-H)
Monopoly
Price Discrimination
Double Marginalization
Monopolistic Screening
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Monopolistic Screening
Optimal solution:
pL
pH
Assume vH > vL + cH
Parikshit Ghosh
Monopoly
= vL cL
= vL + cH
cL
cL .
= (vL + cH cL ) + (1
= vL + cH cL
)(vL
cL )
Monopoly
Price Discrimination
Double Marginalization
Vertical Mergers
( cx + q ) x
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Vertical Mergers
Upstream Problem
I
cy y
max y R 0 (y )
y
cx
cy y
max R (x )
x
( cx + cy ) x
R 0 ( x ) = cx + cy ) x > x
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Tax Harmonization
Let q = a
The Centre rst chooses a tax tc . Then the State chooses its
own tax ts .
This can be ine cient. Cumulative tax rates are too high.
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Tax Harmonization
States Problem
I
b (p
b + tc + ts )]
Dierentiating w.r.t tc :
b (p
b + tc ) = 2bts
ts0 (tc ) =
Parikshit Ghosh
Monopoly
1
2
If Central taxes are higher, State will lower its own taxes to
some degree but not completely.
Delhi School of Economics
Monopoly
Price Discrimination
Double Marginalization
Tax Harmonization
Centres Problem
I
b (p
b + tc + ts (tc ))]
1
= arg max tc (a
tc 2
a bb
p
=
2b
bb
p
btc )
Plugging back:
ts =
bb
p
4b
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination
Double Marginalization
Tax Harmonization
Tax Harmonization
I
max t [a
t
bb
p
2b
Parikshit Ghosh
Monopoly
Critique
Applications
Microeconomic Theory I:
Choice Under Uncertainty
Parikshit Ghosh
Delhi School of Economics
September 8, 2014
Parikshit Ghosh
Critique
Applications
Lotteries
I
I
Sure outcomes: (0 a1 , . . . , 1 ai , . . . , 0 an ) = ai .
Parikshit Ghosh
Choice Under Uncertainty
Critique
Applications
( a1 , . . . , ( 1
) an )
Parikshit Ghosh
Critique
Applications
( a1 , ( 1
) an ) % ( a1 , ( 1
) an ) i
Axiom 5 (Substitution/Independence): If
g = (p1 g1 , . . . , pk gk ), h = (p1 h1 , . . . , pk
gi
hi for all i = 1, 2, . . . , k, then g h.
hk ) and
Parikshit Ghosh
Choice Under Uncertainty
Critique
Applications
Rrepresentation Theorems
u (g 0 )
n
pi u (ai )
i =1
I
I
I
Parikshit Ghosh
Critique
Applications
Rrepresentation Theorems
Proof: Representation
I
( u ( g ) a1 , ( 1
u (g )) an ) (continuity)
Representation: g % g 0 ,
( u ( g ) a1 , ( 1
u (g )) an ) %
u ( g 0 ) a1 , ( 1
u (g 0 )) an
(transitivity)
, u (g )
Parikshit Ghosh
Choice Under Uncertainty
u (g 0 ) (monotonicity)
Critique
Applications
Rrepresentation Theorems
( u ( ai ) a1 , ( 1
u (ai )) an )
Then
g
(p1 q1 , p2 q2 , . . . , pn qn ) (substitution)
!
!
!
n
pi u (ai )
a1 , 1
i =1
qi
By monotonicity
pi u (ai )
an
(axiom 6)
i =1
u (g ) =
pi u (ai )
i =1
Parikshit Ghosh
Critique
Applications
Rrepresentation Theorems
Parikshit Ghosh
Choice Under Uncertainty
Critique
Applications
Rrepresentation Theorems
i ) u ( an )
v ( ai ) = i v ( a1 ) + ( 1
i ) v ( an )
Solving for i :
i =
u ( ai )
u ( a1 )
u ( an )
v ( ai )
=
u ( an )
v ( a1 )
Parikshit Ghosh
v ( an )
v ( an )
Delhi School of Economics
Critique
Applications
Rrepresentation Theorems
Proof (contd.)
u ( a1 ) v ( an ) u ( an ) v ( a1 )
v ( a1 ) v ( an )
+
u ( ai )
u ( a1 ) u ( an )
u ( a1 ) u ( an )
|
{z
} |
{z
}
Parikshit Ghosh
Choice Under Uncertainty
Critique
Applications
Anomalies
B, D
Parikshit Ghosh
Critique
Applications
Anomalies
What is Wrong?
I
B implies
u (1) > .1u (5) + .89u (1) + .01u (0)
or .1u (5)
C implies
.1u (5) + .9u (0) > .11u (1) + .89u (0)
or .1u (5)
Parikshit Ghosh
Choice Under Uncertainty
Critique
Applications
Anomalies
B, C
Parikshit Ghosh
D
Delhi School of Economics
Critique
Applications
Anomalies
What is Wrong?
I
B implies
p<
1
3
D implies
2
1
>1 p)p>
3
3
These preferences cannot be represented by any expected
utility function (ambiguity aversion).
Parikshit Ghosh
Choice Under Uncertainty
Critique
Applications
Anomalies
Preference: b
g , (0.5 b, 0.5 g )
Parikshit Ghosh
b, g .
Critique
Applications
Perceptual Biases
BayesRule
I
=
=
I
Pr(inf) Pr(positivejinf)
Pr(inf) Pr(positivejinf) + Pr(uninf) Pr(positivejuninf)
(.01)(.9)
1
=
(.01)(.9) + (.99)(.1)
12
The small prior nullies the eect of the large test accuracy.
Parikshit Ghosh
Choice Under Uncertainty
Critique
Applications
Perceptual Biases
Framing Eect
I
I
B (72%), D
C (78%)
Parikshit Ghosh
Critique
Applications
Monetary Payos
I
I
I
Parikshit Ghosh
Choice Under Uncertainty
C (g ).
Critique
Applications
x + x ) + (1
p )u (w
x )
)u 0 (w
x + x ) = (1
p )u (w
x )
(1
p )
p (1
) = 0 ) p =
Parikshit Ghosh
Critique
Applications
I
I
pu (w + x1 ) + (1
I
p )u (w + x2 (x1 ))
u (w )
Parikshit Ghosh
Choice Under Uncertainty
p
1
Critique
Applications
The more curved the boundary at (0, 0), the smaller is the
acceptance set.
p )u 00 (w ) x20 (0)
+ (1
p )u 0 (w )x200 (0) = 0
p
1 p
x200 (0) =
I
(1
p )2
u 00 (w )
u 0 (w )
Parikshit Ghosh
Choice Under Uncertainty
Lectures on Optimization
A. Banerji
September 2, 2014
Chapter 1
Introduction
1.1
Some Examples
We briefly introduce our framework for optimization, and then discuss some
preliminary concepts and results that well need to analyze specific problems.
Our optimization examples can all be couched in the following general
framework:
Suppose V is a vector space and S V . Suppose F : V <. We wish to
find x S s.t. F (x ) F (x), x S, or x S s.t. F (x ) F (x), x S.
x , x are respectively called a maximum and a minimum of F on S.
In different applications, V can be finite- or infinite-dimensional. The
latter need more sophisticated optimization tools such as optimal control; we
will keep that sort of stuff in abeyance for now. In our applications, F will
be continuous, and pretty much also differentiable; often twice continuously
differentiable. S will be specified most often using constraints.
Example 1 Let U : <k+ < be a utility function, p1 , ..., pk , I be positive
P
prices and wealth. Maximize U s.t. xi 0, i = 1, ..., k, and ki=1 pi xi
p.x I.
Here, the objective function is U , and
1
00
1.1.1
write S(p, I) for the budget set to show this dependence. The maximum
value that the utility function takes on this set (if the maximum exists), i.e.
V (p, I) = Max{U (x)|x S(p, I)}
therefore typically depends on the parameter (p, I), and we denote this
dependence of the maximum by the value function V (p, I). In consumer
theory, we call this the indirect utility function. This is a function because
to each point (p, I) in the admissible parameter space, V (p, I) assigns a single
value, equal to the maximum of U (x) over all x S(p, I).
Note that for a given (p, I), the set of bundles that maximize utlity may
not be unique: we can denote this relationship by x(p, I), the set of all
bundles that maximize utility given (p, I). If the optimal bundle were unique
for every (p, I), then x(p, I) is a function (the Walrasian or Marshallian
demand function), and therefore V (p, I) = U (x(p, I)).
In other problems, not just the feasible set S but also the objective function depends on the parameter. For instance, in Example 2 (Expenditure
Minimization), the parameter is (p, U ), and the objective function px depends on the parameter, via the price vector p, while the constraint U (x) U
depends on it via the utility level U .
In general, we may write down the optimization problem in parametric
form as follows: A parameter is part of some admissible set of parameters,
where is a subset of some vector space W (finite or infinite-dimensional;
e.g. in example 1, (p, I) <n++ <+ , this latter set is thus . The feasible
set is S(), which depends on and is a subset of some (other) vector space
V (e.g. in example 1, S(p, I) <n+ ). The objective function F maps from
some subset of V W to the real line: we write F (x, ). The problem is to
maximize or minimize F (x, ) s.t. x S().
Please read the other examples in Sundaram. We give one final one here.
Example 6 Identifying Pareto Optima
F () = {(x1 , x2 )|x1 , x2 0, x1 + x2 }
the feasible set or set of feasible allocations.
An allocation (y1 , y2 ) Pareto dominates (x1 , x2 ) if
ui (yi ) ui (xi )i = 1, 2, with >0 for some i
An allocation (x1 , x2 ) is Pareto optimal if there is no feasible allocation
that Pareto dominates it.
Let a (0, 1) and consider the social welfare function U (x1 , x2 , a)
au1 (x1 ) + (1 a)u2 (x2 ). Then if (z1 , z2 ) is any allocation that solves
1.2
We will now discuss some concepts that we will need, such as the compactness
of the set S above, and the continuity and differentiability of the objective
function F . We will work in normed linear spaces. In the absence of any
other specification, the space we will be in is <n with the Euclidean norm
P
1/2
||x|| = ( ni=1 x2i ) . (Theres a bunch of other norms that would work
equally well. Recall that a norm in <n is defined to be a function assigning
to each vector x a non-negative real number ||x||, s.t. (i) for all x, ||x|| 0
with =0 iff x = 0 (00 being the zero vector); (ii) If c <, ||cx|| = |c|||x||.
(iii) ||x + y|| ||x|| + ||y||. The last requirement, the triangle inequality,
follows for the Euclidean norm from the Cauchy-Schwarz inequality).
One example in the previous section used another normed linear space,
namely the space of bounded continuous functions defined on an interval
of real numbers, with the sup norm. But in further work in this part of
the course, we will stick to using finite dimensional spaces. Some of the
concepts below apply to both finite and infinite dimensional spaces, so we
will sometimes call the underlying space V . But mostly, it will help to think
of V as simply <n , and to visualize stuff in <2 .
Pn
2 1/2
(x
y
)
.
We will measure distance between vectors using ||xy|| =
i
i
i=1
This is our intuitive notion of distance using Pythagoras theorem. Furthermore, it satisfies the three properties of a metric, viz., (i) ||x y|| 0, with
= iff x = y; (ii) ||x y|| = ||y x||; (iii) ||x z|| ||x y|| + ||y z||.
Note that property (iii) for the metric follows from that for the triangle
inequality for the norm, since ||xz|| = ||(xy)+(yz)|| ||xy||+||yz||.
Open and Closed Sets
Let > 0 and x V . The open ball centered at x with radius is defined
as
B(x, ) = {y : ||x y|| < }
We see that if V = <, B(x, ) is the open interval (x , x + ). If
6
Definition 1 Convergence:
A sequence (xk )
k=1 of points in V converges to x if for every > 0 there
exists a positive integer N s.t. k N implies ||xk x|| < .
Note that this is the same as saying that for every open ball B(x, ), we
can find N s.t. for all points xk following xN , xk lies in B(x, ). This implies
that when xk converges to x (notation: xk x), all but a finite number of
points in (xk ) lie arbitrarily close to x.
Examples. xk = 1/k, k = 1, 2, ... is a sequence of real numbers converging
to zero. xk = (1/k, 1/k), k = 1, 2, ... is a sequence of vectors in <2 converging
to the origin. More generally, a sequence converges in <n if and only if all
the coordinate sequences converge, as can be visualized in the example here
using hypotenuses and legs of triangles.
Theorem 2 (xk ) x in <n iff for every i {1, . . . , n}, the coordinate
sequence (xki ) xi .
Proof. Since
(xki xi )2
n
X
(xkj xj )2 ,
j=1
taking square roots implies |xki xi | ||xk x||, so for every k N s.t.
||xk x|| < , |xki xi | < .
Conversely, if all the coordinate sequences converge to the coordinates
of the point x, then there exists a positive integer N s.t. k N implies
|xki xi | < / n, for every coordinate i. Squaring, adding across all i and
taking square roots, we have ||xk x|| < .
Several convergence results that appear to be true are in fact so. For
instance, (xk ) x, (y k ) y implies (xk + y k ) (x + y). Indeed, there
exists N s.t. k N implies ||xk x|| < /2 and ||y k y|| < /2. So
||(xk + y k ) (x + y)|| = ||(xk x) + (y k y)|| ||xk x|| + ||y k y|| (by the
triangle inequality), and this is less than /2 + /2 = .
8
Exercise 3 Let (ak ) and (bk ) be sequences of real numbers that converge to a
and b respectively. Then the product sequence (ak bk ) converges to the product
ab.
Closed sets can be characterized in terms of convergent sequences as follows.
Lemma 2 A set S is closed if and only if for every sequence (xk ) lying in
S, xk x implies x S.
Proof. Suppose S is closed. Take any sequence (xk ) that converges to a
point x. Then for every B(x, ), we can find a member xk of the sequence
lying in this open ball. So, x adheres to S. Since S is closed, it must contain
this adherent point x.
Conversely, suppose the set S has the property that whenever (xk ) S
converges to x, x S. Take a point y that adheres to S. Take the successively
smaller open balls B(y, 1/k), k = 1, 2, 3, .... We can find, in each such open
ball, a point y k from the set S (since y adheres to S). These points need not
be all distinct, but since the open balls have radii converging to 0, y k y.
Thus by the convergence property of S, y S. So, any adherent point y of
.
S actually belongs to S.
Related Results
1. If (ak ) is a sequence of real numbers all greater than or equal to 0,
and ak a, then a 0. The reason is that for all k, ak [0, ) which is a
closed set and hence must contain the limit a.
2. Sup and Inf.
Let S <. u is an upper bound of S if u a, for every a S. s is the
supremum or least upper bound of S (called sup S), if s is an upper bound
of S, and s u, for every upper bound u of S.
We say that a set S of real numbers is bounded above if there exists an
upper bound, i.e. a real number M s.t. a M, a S. The most important
property of a supremum, which well by and large take here as given, is the
following:
9
Completeness Property of Real Numbers: Every set S of real numbers that is bounded above has a supremum.
For a short discussion of this property, see the Appendix.
Note that sup S may or may not belong to S.
Examples. S = (0, 1), D = [0, 1], K = set of all numbers in the sequence
1
, n = 1, 2, 3, .... The supremum of all these sets is 1, and this does not
1 2n
belong to S or to K.
When sup S belongs to S, it is called the maximum of S, for obvious
reasons. Another important property of suprema is the following.
Lemma 3 For every > 0, there exists a number a S s.t. a > sup S .
Note that this means that sup S is an adherent point of S.
Proof. Suppose that for some > 0, there is no number a S s.t. a >
sup S . So, every a S must then satisfy a sup S . But then,
sup S is an upper bound of S that is less than sup S. This implies that
sup S is not in fact the supremum of S. Contradiction.
.
Then, (xn ) does not converge; but the subsequences (ym ) = 1, 1, 1, ....
and (zm ) = 1, 1, 1, ... both converge, to different limits. (Such points are
called limit points of the mother sequence (xn )).
Compact sets have a property related to this fact.
Definition 2 A set S V is compact if every sequence (xn ) in S has a
subsequence that converges to a point in S.
Theorem 5 Suppose S <n . Then S is compact if and only if it is closed
and bounded.
Proof (Sketch).
Suppose S is closed and bounded. We can show its compact using a
pigeonhole-like argument; lets sketch it here. Since S is bounded, we can
cover it in a closed rectangle R0 = I1 . . . In , where Ii , i = 1, ..., n are
closed intervals. Take a sequence (xn ) in S. Divide the rectangle in two:
I11 . . . In and I12 ... In , where I11 I12 = I1 is the union of 2 intervals.
Then, theres an infinity of members of (xn ) in at least one of these smaller
rectangles, call this R1 . Divide R1 into 2 smaller rectangles, say by dividing
I2 into 2 smaller intervals; well find an infinity of members of (xn ) in at least
one of these rectangles, call it R2 . This process goes on ad infinitum, and we
find an infinity of members of (xn ) in the rectangles R0 R1 R2 .... By
the Cantor Intersection Theorem,
i=0 Ri is a single point; call this point x.
Now we can choose points yi Ri , i = 1, 2, ... s.t. each yi is some member
of (xn ); because the Ri s collapse to x, it is easy to show that (ym ) is a
subsequence that converges to x. Moreover, the yi s lie in S, and S is closed;
so x S.
Conversely, suppose S is compact.
(i) Then it is bounded. For suppose not. Then we can construct a sequence (xn ) in S s.t. for every n = 1, 2, ..., ||xn || > n. But then, no subsequence of (xn ) can converge to a point in S. Indeed, take any point x S
and any subsequence (xm(n) ) of (xn ). Then
12
15
Chapter 2
Existence of Optima
2.1
Weierstrass Theorem
17
Chapter 3
Unconstrained Optima
3.1
Preliminaries
yx
f (y) f (x)
a
yx
=0
(1)
By limit equal to 0 as y x, we require that the limit be 0 w.r.t. all
sequences (yn ) s.t. yn x. a turns out to be the unique number equal to
the slope of the tangent to the graph of f at the point x. We denote a by
0
the notation f (x). We can rewrite Equation (1) as follows:
lim
yx
=0
(2)
Note that this means the numerator tends to zero faster than does the
denominator.
We can use this way of defining differentiability for more general functions.
18
yx
=0
In the one variable case, the existence of a gives the existence of a tangent;
in the more general case, the existence of the matrix A gives the existence of
tangents to the graphs of the m component functions f = (f1 , ..., fm ), each
of those functions being from <n <. In other words this definition has
to do with the best linear affine approximation to f at the point x. To see
this in a way equivalent to the above definition, put h = y x in the above
definition, so y = x + h. Then in the 1-variable case, from the numerator,
0
f (x + h) is approximated by the affine function f (x) + ah = f (x) + f (x)h. In
the general case, f (x + h) is approximated by the affine function f (x) + Ah.
It can be shown that (w.r.t. the standard bases in <n and <m ), the matrix
A equals Df (x), the m n matrix of partial derivatives of f evaluated at the
point x. To see this, take the slightly less general case of a function f : <n
<. If f is differentiable at x, there exists a 1 n matrix A = (a11 , . . . , a1n )
satisfying the definition above: i.e.
||f (x + h) f (x) Ah||
=0
h
||h||
lim
In particular, the above must hold if we choose h = (0, .., 0, t, 0, .., 0) with
hj = t 0. That is,
||f (x1 , .., xj + t, .., xn ) f (x1 , .., xj , .., xn ) a1j t||
=0
t0
t
lim
But from the limit on the LHS, we know that a1j must equal the partial
derivative f (x)/x1 .
Df (x) as the derivative of f at x; Df itself is a function from <n to <m .
19
f1 (x)/x1 . . . f1 (x)/xn
Df (x) =
...
...
...
fm (x)/x1 . . . fm (x)/xn
Here,
fi (x)
fi (x1 , .., xj + t, ..., xn ) fi (x1 , .., xj , ..., xn )
= lim
t0
xj
t
We want to also represent the partial derivative in different notation: Let
ej = (0, .., 0, 1, 0, ..., 0) be the unit vector in <n on the j th axis. Then,
fi (x + tej ) fi (x)
fi (x)
= lim
t0
xj
t
That is, the partial of fi w.r.t. xj , evaluated at the point x, is looking at
essentially a function of 1-variable: we take the (n 1) dimensional surface
of the function fi , and slice it parallel to the j th axis, s.t. point x is contained
on this slice/plane; well get a function pasted on this plane; its derivative
is the relevant partial derivative.
To be more precise about this one-variable function pasted on the slice/plane,
note that the single variable t < is first mapped to a vector x+tej <n , and
then that vector is mapped to a real number fi (x + tej ). So, let : < <n
be defined by (t) = x + tej , for all t <. Then the one-variable function
were looking for is g : < < defined by g(t) = f ((t)), for all t <; its
the composition of f and .
In addition to slicing the surface of functions that map from <n to <
in the directions of the axes, we can slice them in any direction and get
a function pasted on the slicing plane. This is the notion of a directional
derivative.
Recall that if x <n , and h <n , then the set of all points that can
be written as x + th, for some t <, comprises the line through x in the
20
direction of h.
See figure (drawn in class).
Definition 6 The directional derivative of a function f : <n < at x <n ,
in the direction h <n , denoted Df (x; h), is
f (x + th) f (x)
t0+
t
lim
3.2
Interior Optima
zk x
y k x
Taking limits preserves these inequalities since (, 0] and [0, ) are
closed sets and the ratio sequences lie in these closed sets. So,
0
f (x ) 0 f (x )
0
so f (x ) = 0.
Step 2. Suppose n > 1. Take any j t h axis direction, and let g : < <
be defined by g(t) = f (x + tej ). Note that g(0) = f (x ). Now, since x is a
local max of f , f (x ) f (x + tej ), for t smaller than some cutoff value: i.e.,
g(0) g(t) for t smaller than this cutoff value, i.e., g(0) is a local interior
maximum. (Since t < 0 and t > 0 are both allowed). g is differentiable
at 0 since g(0) = f ((0)) = f (x ), and f is differentiable at x and is
differentiable at t = 0. (Here, (t) = x + tej , so D(t) = ej , t). So, g is
0
differentiable at 0, g (0) = 0, and by the Chain Rule,
0
f (x )
xj
.
Note that this is necessary but not sufficient for a local max or min, e.g.
f (x) = x3 has a vanishing first derivative at x = 0, which is not a local
optimum.
Second Order Conditions
22
23
Chapter 4
Optimization with Equality
Constraints
4.1
Introduction
The following example illustrates the principle of no arbitrage underlying a maximum. A more general illustration, with more than 1 constraint,
requires a little bit of the machinery of linear inequalities, which well not
cover. The idea here is that the Lagrange multiplier captures how the constraint is distributed across the variables.
Example 1. Suppose x solves M ax U (x1 , x2 ) s.t. I p1 x1 p2 x2 = 0
and suppose x >> .
Then reallocating a small amount of income from one good to the other
does not increase utility. Say income dI > 0 is shifted from good 1 to good 2.
So dx1 = (dI/p1 ) > 0 and dx2 = (dI/p1 ) < 0. Note that this reallocation
satisfies the budget constraint, since
p1 (x1 + dx1 ) + p2 (x2 + dx2 ) = I
The change in utility is dU = U1 dx1 + U2 dx2
= [(U1 /p1 )(U2 /p2 )]dI 0, since the change in utility cannot be positive
at a maximum. Therefore,
(U1 /p1 ) (U2 /p2 ) 0
(1)
Similarly, dI > 0 shifted from good 1 to good 2 does not increase utility,
so that
[(U1 /p1 ) + (U2 /p2 )]dI 0, or
(U1 /p1 ) + (U2 /p2 ) 0
(2)
4.2
..
.
Dg(x) =
.
.
.
=
Dgk (x)
So Dg(x) is a k n matrix.
The theorem below provides a necessary condition for a local max or
local min. Note that x is a local max (resp. min) of f on the constraint set
{x Rn |gi (x) = 0, i = 1, . . . , k} if f (x ) f (x) (resp. f (x)) for all x U
for some open set U containing x , s.t. gi (x) = 0, i = 1, . . . , k}. Thus x is
a Max on the set S = U {x Rn |gi (x) = 0, i = 1, . . . , k}.
condition; so there could be points x0 that meet the condition and yet are
not even local max or min.
Example. Max f (x, y) = x3 + y 3 , s.t. g(x, y) = x y = 0. Here the
contour set Cg (0) is the 45-degree line on the x y plane. By taking larger
and larger positive values of x and y on this contour set, we get higher and
higher f (x, y). So f does not have a global max on the constraint set. But
if we mechanically crank out the Lagrangean FONCs as follows
Max x3 + y 3 + (x y)
FONC: 3x2 + = 0
3y 2 + = 0
x y = 0. So x = y = = 0 is a solution. But (x , y ) = (0, 0)
is neither a local max nor a local min. Indeed, f (0, 0) = 0, whereas for
(x, y) = (, ), > 0, f (, ) = 23 > 0, and for (x, y) = (, ), < 0,
f (, ) = 23 < 0.
Pathology 2. The CQ is violated at the optimum.
In this case, the FONCs need not be satisfied at the global optimum.
Example. Max f (x, y) = y s.t. g(x, y) = y 3 x2 = 0.
Let us first find the solution using native intelligence. Then well show
that the CQ fails at the optimum, and that the usual Lagrangean method
is a disaster. Finally, well show that the general form of the equation the
Theorem of Lagrange, that does NOT assume that the CQ holds at the
optimum, works.
The constraint is y 3 = x2 , and since x2 is nonnegative, so must y 3 be.
Therefore, y 0. The maximum of y s.t. y 0 implies y = 0 at the max.
So y 3 = x2 = 0, so x = 0. So f attains global max at (x, y) = (0, 0).
Dg(x, y) = (2x, 3y 2 ) = (0, 0) at (x, y) = (0, 0). So rank(Dg(x, y)) =
0 < k = 1 at the optimum; the CQ fails at this point. Using the Lagrangean
method, we get the following FONC:
(f /x) + (g/x) = 0, that is 2x = 0
(1)
2
(f /y) + (g/y) = 0, that is 1 + 3y = 0
(2)
(L/) = 0, that is x2 + y 3 = 0
(3)
Eq.(1) implies either = 0 or x = 0. x = 0 implies, from Eq.(3), that
29
.
Dx(t) is in the direction of the tangent to the curve x(t), so the equation
above implies that Dgi (x(t)) is orthogonal to it. (Seen as a vector rather
than a matrix, we write this as the gradient gi (x(t))). (As an application,
notice how this geometry implies the first order condition MRSxy = px /py in
a two-good utility maximization in which both goods are consumed at the U
max).
In the second-order conditions, we check the definiteness or semi-definiteness
of the second-derivative or Hessian D2 L(x , ) w.r.t. all vectors x that are
orthogonal to the gradient of each constraint. This approximates vectors
close to x that satisfy each gi (x) = 0.
P
Since L(x, ) = f (x) + ki=1 i gi (x),
P
D2 L(x, )nn = D2 f (x)nn + ki=1 i D2 gi (x)nn ,
f11 (x) + ki=1 i gi11 (x) . . . f1n (x) + ki=1 i gi1n (x)
..
..
...
So D2 L(x, )nn =
.
.
Pk
Pk
fn1 (x) + i=1 i gin1 (x) . . . fnn (x) + i=1 i ginn (x)
is the second derivative of L w.r.t. the x variables. Note that D2 L(x, )
is symmetric, so we may work with its quadratic form.
Dg1 (x )
..
Dgk (x )
So the set of all vectors x that are orthogonal to all the gradient vectors
of the constraint functions at x is the Null Space of Dg(x ), N (Dg(x )) =
{x Rn |Dg(x )x = k1 }.
Theorem 13 Suppose there exists (xn1 , k1 ) such that Rank(Dg(x )) = k
P
and Df (x ) + ki=1 Dgi (x ) = .
31
BH(L ) =
[Dg(x )]Tnk D2 L(x , )nn (n+k)(n+k)
BH (L ; k + n r) is the matrix obtained by deleting the last r rows and
columns of BH (L ).
BH (L ) will denote a variant in which the permutation has been
applied to (i) both rows and columns of D2 L(x , ) and (ii) only the columns
of Dg(x ) and only the rows of [Dg(x )]T , which is the transpose of Dg(x ).
Theorem 14 (1a) xT D2 L(x , )x 0,for all x N (Dg(x )), iff for all
permutations of {1, . . . , n}, we have:
(1)nr det(BH (L ; n + k r)) 0, r = 0, 1, . . . , k 1.
(1b) xT D2 L(x , )x 0,for all x N (Dg(x )), iff for all permutations
of {1, . . . , n}, we have:
(1)k det(BH (L ; k + n r)) 0, r = 0, 1, . . . , k 1.
(2a). xT D2 L(x , )x < 0,for all nonzero x N (Dg(x )), iff (1)nr det(BH(L ; n+
k r)) > 0, r = 0, 1, . . . , k 1.
(2b)xT D2 L(x , )x > 0,for all nonzero x N (Dg(x )), iff (1)k det(BH(L ; n+
k r)) > 0, r = 0, 1, . . . , k 1.
32
Note. (1) For the negative definite or semidefiniteness subject to constraints cases, the determinant of bordered Hessian with last r rows and
columns deleted must be of the same sign as (1)nr . The sign of (1)nr
switches with each successive increase in r from r = 0 to r = k 1. So the
corresponding bordered Hessians switch signs. In the usual textbook case of
2 variables and one constraint, k = 1, k 1 = 0, so we just need to check
the sign for r = 0, that is, the sign of the determinant of the big bordered
Hessian. You should be clear about what this sign should be if it is to be a
sufficient condition for a strict local max or min. For the necessary condition,
we need to check signs or 0, for one permuted matrix as well, in this
case. What is this permuted matrix?
(2) As in the unconstrained case, the sufficiency conditions do not require
checking weak inequalities for permuted matrices.
(3) In the p.s.d. and p.d. cases, the signs of the principal minors must
be all positive, if the number k of constraints is even, and all negative, if k
is odd.
(4) If we know that a global max or min exists, where the CQ is satisfied,
and we get a unique solution x Rn that solves the FONC, then we may
use a second order condition to check whether it is a max or a min. However,
weak inequalities demonstrating n.s.d. or p.s.d. (subject to constraints) of
D2 (L ) do not imply a max or min; these are necessary conditions. Strict
inequalities are useful; they imply (strict) max or min. If however, a global
max or min exists, the CQ is satisfied everywhere, and there is more than
one solution of the FONC, then the one giving the highest value of f (x) is
the max. In this case, we dont need second order conditions to conclude
that it is the global max.
VII.4. Two Examples
Example 1.
A consumer with income I > 0 faces prices p1 > 0, p2 > 0, and wishes
to maximize U (x1 , x2 ) = x1 x2 . So the problem is: Max x1 x2 s.t. x1 0,
x2 0, and p1 x1 + p2 x2 I.
To be able to use the Theorem of Lagrange, we need equality constraints.
33
Now, it is easy to see that if (x1 , x2 ) solves the above problem, then (i)
(x1 , x2 ) > (0, 0). If xi = 0 for some i, then utility equals zero; clearly, we can
do better by allocating some income to the purchase of each good; and (ii)
the budget constraint binds at (x1 , x2 ). For if p1 x1 + p2 x2 < I, then we can
allocate some of the remaining income to both goods, and increase utility
further.
We conclude from this that a solution (x1 , x2 ) will also be a solution to
the problem
Max x1 x2 s.t. x1 > 0, x2 > 0, and p1 x1 + p2 x2 = I.
2
{(x1 , x2 )|I
That is, Maximize U (x1 , x2 ) = x1 x2 over the set S = R++
p1 x1 p2 x2 = 0}. Since the budget set in this problem is compact and the
utility function is continuous, U attains a maximum on the budget set (by
Weierstrass Theorem). Moreover, we argued above that at such a maximum
x , xi > 0, i = 1, 2 and the budget constraint binds. So, x S.
Furthermore, Dg(x) = (p1 , p2 ), so Rank(Dg(x)) = 1, at all points in
the budget set. So the CQ is met. Therefore, the global max will be among
the critical points of L(x1 , x2 , ) = x1 x2 + (I p1 x1 p2 x2 ).
FONC:
(L/x1 ) = x2 p1 = 0
(L/x2 ) = x1 p2 = 0
(L/) = I p1 x1 p2 x2 = 0
(1)
(2)
(3)
D2
L(x , ) = D2 U (x) + D
g(x )
U11 (x ) U12 (x )
g11 (x ) g12 (x )
=
+
U
(x
)
U
(x
)
21
22
g21 (x) g22 (x )
0 0
0 1
0 1
=
=
+
1 0
1 0
0 0
Now evaluate the quadratic form z T D2 L(x , )z = 2z1 z2 at any (z1 , z2 )
that is orthogonal to Dg(x ) = (p1 , p2 ). So, p1 z1 p2 z2 = 0 or
z1 = (p2 /p1 )z2 . For such (z1 , z2 ), z T D2 L(x , )z = (2p2 /p1 )z22 < 0, so
D2 L(x , ) is negative definite relative to vectors orthogonal to the gradient
of the constraint, and x is therefore a strict local max.
Youve probably seen the computation below. I provide it here anyway,
even though it is unnecessary, and weve done the second-order exercise above
using the quadratic form.
35
0
p
p
1
2
0
Dg(x )
BH(L ) =
= p1
0
1
T
2
[Dg(x )]
D L(x , )
p2
1
0
n
det(BH(L )) = 2p1 p2 > 0. This is the sign of (1) = (1)2 . Therefore,
there is a strict local max at x .
Example 2.
Find global maxima and minima of f (x, y) = x2 y 2 on the unit circle
in <2 , i.e., on the set {(x, y) <2 |g(x, y) 1 x2 y 2 = 0}.
Constrained maxima and minima exist, by Weirerstrass theorem, as f
is continuous and the unit circle is closed and bounded. Bounded, as it is
entirely contained in, say, B(0, 2). Closed as well, visually, we can see that
the constraint set contains all its adherent points. More formally, suppose
(xk , yk )
k=1 is a sequence of points on the unit circle converging to the limit
(x, y). Since g is continuous, and (xk , yk ) (x, y), we have g(xk , yk )
g(x, y). Since g(xk , yk ) = 0, k, their limit is 0, i.e. g(x, y) = 0 or (x, y) is on
the unit circle, and so the unit circle is closed.
Constraint Qualification: Dg(x, y) = (2x 2y). The rank of this row
matrix is zero only at (x, y) = (0, 0). But the origin does not satisfy the
constraint. Everywhere on the constraint, at least one of x or y is not zero,
and the rank of Dg(x, y) is 1.
So, the max and min will be solutions to the FOCs of the usual Lagrangean.
L(x, y, ) = x2 y 2 + (1 x2 y 2 )
FOC.
2x 2x = 0
(1)
2y 2y = 0
(2)
2
2
x +y =1
(3)
(1) and (2) imply 2x(1 ) = 0 and 2y(1 + ) = 0 respectively. Suppose
6= 1 or 1. Then (x, y) = (0, 0), violating (3). If = 1, y = 0, so x2 = 1,
and so on. So the four solutions (x, y, ) to the FOCs form the solution set
{(1, 0, 1), (1, 0, 1), (0, 1, 1), (0, 1, 1)}. Evaluating the function values at
36
these points, we have that f has a constrained max at (1, 0) and (1, 0) and
constrained min at (0, 1) and (0, 1).
Although unnecessary, lets practice second-order conditions for this example. Df (x, y)= (2x 2y), Dg(x, y) = (2x 2y).
2 0
D2 f (x, y) =
0 2
2 0
D2 g(x, y) =
0 2
0 0
0 4
0
y
=4y 2 < 0 for all (x, y) 6= (0, 0). So negative definiteness holds, and we
are at a strict local max.
Some Derivatives
(1). Let I : <n <n be defined by I(x) = x, x <n . In component
function notation, we have I(x) = (I1 (x), . . . , In (x)) = (x1 , . . . , xn ). So,
DIi (x) = ei , i.e. the vector with 1 in the ith place and zeros elsewhere. So,
DI(x) = Inn , the identity matrix.
By similar work, we can show that if f (x) = Ax, where A is an m n matrix, then Df (x) = A. Indeed, the jth component function fj (x) = aj1 x1 +
. . . + ajn xn , so its matrix of partial derivatives with respect to x1 , . . . , xn is
Dfj (x) = (aj1 . . . ajn ).
(2). Let f : <n <m and g : <n <m . By way of convention, consider
f (x) and g(x) to be column vectors, and consider the function h : <n <
37
Df1 (x)
Df (x)
P
2
g
(x)Df
(x)
=
(g
(x),
.
.
.
,
g
(x))
i
1
n
i i
Dfn (x)
T
= g(x) Df (x), and so on.
We take a step back and derive this in a more expanded fashion. Since
P
h(x) = m
i=1 fi (x)gi (x), its partial derivative with respect to xj is:
h(x)/xj =
m
X
i=1
38
.
As an application, let h(x) = xT x. Then Dh(x) equals xT D(x)+xT Dx =
xT I + xT I = 2xT .
On the Chain Rule
We saw an example (in the proof of the 1st order condition in unconstrained optimization) of the Chain Rule at work; youve seen this before.
Namely, if h : < <n and f : <n < are differentiable at the relevant
points, then the composition g(t) = f (h(t)) is differentiable at t and
0
g (t) = Df (h(t))Dh(t) =
n
X
f (h(t))
j=1
xj
hj (t)
You may have encountered this before in notation f (h1 (t), . . . , hn (t)),
with some use of total differentiation or something. Similarly, suppose h :
<p <n and f : <n <m are differentiable at the relevant points, then the
composition g(x) = f (h(x)), g : <p <m is differentiable at x, and
Dg(x) = Df (h(x))Dh(x)
.
Here, on the RHS an m n matrix multiplies an n p matrix, to result
in the m p matrix on the LHS.
The intuition for the Chain Rule is perhaps this. Let z = h(x). If x
changes by dx, the first-order change in z is dz = Dh(x)dx. The first-order
change in f (z) is then Df (z)dz. Substituting for dz, the first-order change
in f (h(x)) equals [Df (h(x))Dh(x)] dx.
In the formula, things are actually quite similar to the familiar case.
The (i, j)th element of the matrix Dg(x) is gi (x)/xj , where gi is the ith
component function of g and xj is the j th variable. Since this is equal to the
dot product of the ith row of Df (h(x)) and the j th column of Dh(x), we have
39
gi (x)/xj =
n
X
fi (h(x)) hk (x)
hk
k=1
xj
f (x ) = Fx /Fy
where Fx , Fy are the partial derivatives of F , evaluated at (x , y ). The
marginal rate of substitution between the two goods (LHS) equals the ratio of
the marginal utilities (RHS). In fact, when we say under some assumptions
on F , one of the assumptions is that Fy evaluated at (x , y ) is not zero.
The mnemonic for getting the derivative: From F (x, y) = c, we totally
differentiate to get Fx dx+Fy dy = 0, and rearrange to get dy/dx = Fx /Fy .
(2). Comparative Statics.
We then move to the vector case by analogy. Suppose
F (x, y) = c
where x is an n-vector, y an m-vector, c a given m-vector. Let (x , y )
solve F (x , y ) = c. Think of x being exogenous, so this is a set of m
41
1 = P (q1 + q2 )q1 c1 q1
and
2 = P (q1 + q2 )q2 c2 q2
If profits are concave in own output, then the first-order conditions below
characterize Cournot-Nash equilibrium (q1 , q2 ).
1 /q1 = P 0 (q1 + q2 )q1 + P (q1 + q2 ) c1 = 0
43
F1 /q1 F1 /q2
DFq (.) =
F2 /q1 F2 /q2
For brevity, let P 0 and P 00 be the derivative and second derivative of P (.)
evaluated at the
Then
equilibrium.
P 00 q1 + 2P 0 P 00 q1 + P 0
DFq (.) =
P 00 q2 + P 0 P 00 q2 + 2P 0
The determinant of this matrix works out to be
(P 0 )2 + P 0 (P 00 (q1 + q2 ) + 2P 0 ) > 0 since P 0 < 0 and the concavity in own
output condition is assumed to be met. So the implicit function theorem can
be applied
Notice also
that
1 0
.
DFc (.) =
0 1
Thus we can work out Df (c), the changes in equilibrium outputs as a
result of changes in unit costs. It would be a good exercise for you to work
these out, and sign these.
Proof of the Theorem of Lagrange
Before the formal proof, note that well use the tangency of the contour
sets of the objective and the constraint approach, which in other words uses
the implicit function theorem. For example, consider maximizing F (x1 , x2 )
s.t. G(x1 , x2 ) = 0. If G1 6= 0 (this is the constraint qualification in this
0
case), we have at a tangency point of contour sets, G1 f (x2 ) + G2 = 0 (where
x1 = f (x2 ) is the implicit function that keeps the points (x1 , x2 ) on the
0
constraint); so f (x2 ) = G2 /G1 .
On the other hand, if we vary x2 and adjust x1 to stay on the constraint,
the function value F (x1 , x2 ) = F (f (x2 ), x2 ) does not increase; therefore lo0
cally around the optimum, F1 f (x2 ) + F2 = 0. Substituting, F1 (G2 /G1 ) +
F2 = 0. If we now put
44
F1 /G1 =
,
we have both F1 + G1 = 0 by definition, and G2 + F2 = 0, the two
FONC.
The Proof:
Without loss of generality, let the leading principal k k minor matrix of
Dg(x ) be linearly independent. We write x = (w, z) with w being the first
k coordinates of x and z being the last (n k) coordinates. So showing the
existence of (a 1 n vector) that solves
Df (x ) + Dg(x ) =
is the same as showing that the 2 equations below hold for this ; the
equations are of dimension 1 k and 1 (n k) respectively:
Dfw (w , z ) + Dgw (w , z ) =
(*)
Dfz (w , z ) + Dgz (w , z ) =
(*)
Since Dgw (w , z ) is square and of full rank, Eq.(*) yields
= Dfw (w , z ) [Dgw (w , z )]1
(**)
Indeed, using the Chain Rule on V (a) f (x (a), a), we have V 0 (a) =
Dfx (x , a)Dx (a)+f (x , a)/a. But because x is an interior Max, Dfx (x , a) =
1n . So, V 0 (a) = f /a.
Now suppose we want to maximize an objective function f (x), which does
not depend on a, but subject to a constraint g(x, a) = a G(x) = 0 that
does depend on a. Under nice conditions, at the Max,
Df (x ) + Dg(x , a) =
(i)
Also note that if a changes, the value of g(x (a), a) must continue to be
zero, so
47
(ii)
Now, V (a) f (x (a)), so V 0 (a) = Df (x )Dx (a). Using (i) to substitute for Df (x ), this equals Dg(x , a)Dx (a), which equals, using (ii),
g/a = . So here, V 0 (a) = , the value of the multiplier at the optimum
is the rate of change of the objective with respect to the parameter a being
relaxed.
Suppose now that we have an objective function f (x, a) to maximize
subject to g(x, a) = 0. Along similar lines, we can show that V 0 (a) = f /a+
g/a, i.e. the direct effect of a on the Lagrangian function. As an exercise,
please derive Roys Identity using the indirect utility function V (p, I).
48
Chapter 5
Optimization with Inequality
Constraints
5.1
Introduction
We use Kuhn-Tucker Theory to address optimization problems with inequality constraints. The main result is a first order necessary condition
that is somewhat different for that of the Theorem of Lagrange; one main
difference is that the first order conditions gi (x) = 0, i = 1, . . . , k in the Theorem of Lagrange are replaced by the conditions i gi (x) = 0, i = 1, . . . , k in
Kuhn-Tucker theory.
In order to motivate this difference, let us discuss a simple setting. Consider an objective function f : <2 <. We want to maximize f (x) or
f (x1 , x2 ) over all x <2 that satisfy G(x) a, where G : <2 <. We will
alternatively write g(x) = a G(x) 0. For this example, let us assume
that G(x) is strictly increasing. We can view a as the total resource available;
such as the total income available for spending on goods. Draw a picture.
A maximum x can occur either in the interior (i.e. G(x ) < a or g(x ) >
0), or at the boundary ( G(x ) = a or g(x ) = 0). If it happens in the
interior, it implies Df (x ) = . If it happens on the boundary, it must
be that reducing the parameter value a does not increase f (x); for whatever
vector x you choose as maximizer after the reduction of a was available before,
at the higher value of a, and was not chosen as the maximizer. Consider then
setting up the Lagrangian
5.2
Kuhn-Tucker Theory
Dgi (x)
..
Dg (x) =
.
, where i, . . . , m are the indexes of the binding
Dg m (x)
constraints. So Dg (x) is an l n matrix.
51
We now state FONC for the problem. The Theorem below is a consolidation of the Fritz-John and the Kuhn-Tucker Theorems.
Theorem 17 (The Kuhn-Tucker (KT) Theorem). Let f : Rn R, and
gi : Rn R, i = 1, . . . , k be C 1 functions. Suppose x is a Maximum of f
on the set S = U
U Rn . Then there exist real numbers , 1 , . . . , k , not all zero such that
P
Df (x ) + ki=1 i Dgi (x ) = 1n .
Moreover, if gi (x ) > 0 for some i, then i = 0.
If, in addition, RankDg (x ) = l, then we may take to be equal to 1.
Furthermore, i 0, i = 1, . . . , k, and i > 0 for some i implies gi (x ) = 0.
Suppose the constraint qualification, RankDg (x ) = l, is met at the
optimum. Then the KT equations are the following (n + k) equations in the
n + k variables x1 , . . . , xn , 1 , . . . , k :
i gi (x ) = 0, i = 1, . . . , k, i 0, gi (x ) 0 with complementary
slackness.
(1)
P
k
(2)
Df (x ) + i=1 i Dgi (x ) =
If x is a local minimum of f on S, then f (x ) attains a local maximum
value. Thus for minimization, while Eq.(1) stays the same, Eq.(2) changes
to
P
Df (x ) + ki=1 i Dgi (x ) =
(2)
Equation (1) and (2) are known as the Kuhn-Tucker conditions.
Note finally that since the conditions of the Kuhn-Tucker Theorem are
not sufficient conditions for local optima; there may be points that satisfy
Equations (1) and (2) or (2) without being local optima. For example, you
may check that for the problem
Max f (x) = x3 s.t. g(x) = x 0, the values x = = 0 satisfy the KT
FONC (1) and (2) for a local maximum but do not yield a maximum.
52
5.3
53
Suppose the constraint does not bind at the maximum; then we dont have
to check a CQ. But suppose it does. That is, suppose the optimum occurs
at x = 3. Dg(x) = 3(3 x)2 = 0 at x = 3. The CQ fails here. You could
check that the KT FONC will not isolate the maximum.In fact, in this baby
example, it is easy to see that x = 3 is the max, as (3x)3 0 if f (3x) 0,
so we may work with the latter constraint function, with which CQ does not
fail. It is a good exercise to visualize f (x) and see that x = 3 is the maximum,
rather than merely cranking out the algebra now.
Alternatively, we may use the more general FONCs stated in the theorem.
Df (x) + Dg(x) = 0, with , not both zero.
(6x2 6x) + (3(3 x)2 ) = 0, and
(1)
3
(3 x) 0, with strict inequality implying = 0.
(2)
3
If (3 x) > 0, then = 0, which from Eq.(1) implies either = 0, which
violates the FONC, or x = 1. At x = 1, f (x) = 1.
On the other hand, if (3 x)3 = 0, that is x = 3, then Eq (1) implies
= 0, so it must be that > 0. At x = 3, f (x) = 27. so x = 3 is the
maximum.
Two Simple Utility Maximization Problems
Example 1. This is a real baby example meant purely for illustration.
No one expects you to use the heavy Kuhn-Tucker machinery for such simple
problems. In this example, one expects instead that you would use reasoning about the marginal utility per rupee ratios (U1 /p1 ), (U2 /p2 ) to solve the
problem.
Max U (x1 , x2 ) = x1 + x2 , over the set {x = (x1 , x2 ) R2 |x1 0, x2
0, I p1 x1 p2 x2 0}, where I > 0, p1 > 0 and p2 > 0 are given.
So there are 3 inequality constraints:
g1 (x1 , x2 ) = x1 0, g2 (x1 , x2 ) = x2 0, and
g3 (x1 , x2 ) = I p1 x1 p2 x2 0
At the maximum x , any combination of these three could bind; so there
are 8 possibilities. However, since U is strictly increasing, the budget constraint binds at the maximum (g3 (x ) = 0). Moreover, g1 (x ) = g2 (x ) = 0 is
54
not possible since consuming 0 of both goods gives utility equal to 0, which
is clearly not a maximum.
So we have to check just three possibilities out of the eight.
Case(1) g1 (x ) > 0, g2 (x ) > 0, g3 (x ) = 0
Case(2) g1 (x ) = 0, g2 (x ) > 0, g3 (x ) = 0
Case(3) g1 (x ) > 0, g2 (x ) = 0, g3 (x ) = 0
Before using the KT conditions, we verify that (i) a global max exists
(here, because the utility function is continuous and the budget set is compact), and that (ii) the CQ holds at all 3 relevant cominations of binding
constraints described above.
Indeed, for Case(1), Dg (x) = Dg 3 (x) = (p1 , p2 ), so Rank[Dg 3 (x)] =
1, so CQ holds.
Dg1 (x)
1
0
, so Rank[Dg (x)]
For Case(2), Dg (x) =
=
Dg 3 (x)
p1 p2
= 2.
0
1
Dg2 (x)
=
, so Rank[Dg (x)]
For Case(3), Dg (x) =
p1 p2
Dg3 (x)
= 2.
Thus for the maximum, x , there exists a such that (x , ) will be a
solution to the KT FONCs. Of course, there could be other (x, )0 s that are
solutions as well, but a simple comparison of U (x) for all candidate solutions
will isolate for us the Maximum.
L(x, ) = x1 + x2 + 1 x1 + 2 x2 + 3 (I p1 x1 p2 x2 )
The KT conditions are
1 (L/1 ) = 1 x1 = 0, 1 0, x1 0, with CS
(1)
2 (L/2 ) = 2 x2 = 0, 2 0, x2 0, with CS
(2)
3 (L/3 ) = 3 (I p1 x1 p2 x2 ) = 0, 3 0, I p1 x1 p2 x2 0, with
CS
(3)
(L/x1 ) = 1 + 1 3 p1 = 0
(4)
(L/x2 ) = 1 + 2 3 p2 = 0
(5)
Since we dont know which of the three cases select the constraints that
bind at the maximum, we must try all three.
Case(1). Since x1 > 0, x2 > 0, (1) and (2) imply 1 = 2 = 0.Plugging
these in Eq(4) and (5), we have 1 = 3 p1 = 3 p2 . This implies 3 > 0. (Also
55
note that this is consistent with the fact that since utility is strictly increasing,
relaxing the budget constraint will increase utility. So the marginal utility
of income, 3 > 0. Thus 3 p1 = 3 p2 implies p1 = p2 .
So if at a local max both x1 and x2 are strictly positive, then it must be
that their prices are equal. All (x1 , x2 ) that solve Eq(3) are solutions. The
utility in any such case equals
x1 + (I p1 x1 /p2 ) = I/p, where p = p1 = p2 . Note that in this case,
(U1 /p1 ) = (U2 /p2 ) = 1/p.
Case 2. x1 = 0 implies, from Eq(3), that x2 = (I/p2 ). Since this is
greater than 0, Eq(2) implies 2 = 0. Hence from Eq(5), 3 p2 = 1.
Since 1 0, Eq (4) and (5) imply 3 p1 = 1 + 1 1 = 3 p2 . Moreover,
since 3 > 0, this implies p1 p2 .
That is, if it is the case that at the maximum, x1 = 0, x2 > 0, then it must
be that p1 p2 . Note that in this case, (U2 /p2 ) = (1/p2 ) (U1 /p1 ) = (1/p1 ).
For completeness sake, Eq(5) implies 3 = (1/p2 ). So from Eq (4),
1 = (p1 /p2 ) 1. So the unique critical point of L(x, ) is
(x , ) = (x1 , x2 , 1 , 2 , 3 ) = (0, (I/p2 ), (p1 /p2 ) 1, 0, (1/p2 )).
Case(3). This case is similar, and we get that x2 = 0, x1 > 0 occurs only
if p1 p2 . We have
(x , ) = ((I/p1 ), 0, 0, (p2 /p1 ) 1, 1/p1 ).
We see that which of the cases applies depends upon the price ratio p1 /p2 .
2
such that the
If p1 = p2 , then all three cases are relevant, and all (x1 , x2 ) R+
budget constraint binds are utility maxima. But if p1 > p2 , then only Case(2)
is applies, because if Case (1) had applied, we would have had p1 = p2 , and
if Case (3) had applied, that would have implied p1 p2 . The solution to
the KT conditions in that case is the utility maximum. Similarly, if p1 < p2 ,
only Case (3) applies.
Example 2. Max U (x1 , x2 ) = (x1 /1 + x1 ) + x2 /1 + x2 ), s.t. x1 0,
x2 0, p1 x1 + p2 x2 I.
Check that the indifference curves are downward sloping, convex and that
they cut the axes (show all this). This last is due to the additive form of the
56
utility function, and may result in 0 consumption of one of the goods at the
utility maximum.
Exactly as in Example 1, we are assured that a global max exists, that
the CQ is met at the optimum, and that there are only 3 relevant cases of
binding constraints to check.
the Kuhn-Tucker conditions are:
1 (L/1 ) = 1 x1 = 0, 1 0, x1 0, with CS
(1)
2 (L/2 ) = 2 x2 = 0, 2 0, x2 0, with CS
(2)
3 (L/3 ) = 3 (I p1 x1 p2 x2 ) = 0, 3 0, I p1 x1 p2 x2 0, with
CS
(3)
2
(L/x1 ) = (1/(1 + x1 ) ) + 1 3 p1 = 0
(4)
(L/x2 ) = (1/(1 + x2 )2 ) + 2 3 p2 = 0
(5)
Case(1). x1 > 0, x2 > 0 implies 1 = 2 = 0. Eq (4) implies 3 > 0, so
that Eq(4) and (5) give ((1 + x2 )/(1 + x1 )) = (p1 /p2 )1/2 .
Using Eq(3), which gives x2 = ((I p1 x1 )/p2 ), above, we get
((p2 + I p1 x1 )/(p2 (1 + x1 )) = (p1 /p2 )1/2 , so simple computations yield
x1 = ((I + p2 (p1 p2 )1/2 )/(p1 + (p1 p2 )1/2 )),
x2 = ((I + p1 (p1 p2 )1/2 )/(p2 + (p1 p2 )1/2 )),
3 = (1/p1 (1 + x1 )2 ).
x1 > 0, x2 > 0 implies I > (p1 p2 )1/2 p1 , I > (p1 p2 )1/2 p2 . If either of
these fails, then we are not in the regime of Case 1.
Case(2) x1 = 0 with Eq(3) implies x2 = I/p2 . Since this is positive,
2 = 0, so Eq(5) implies 3 = 1/(1 + (I/p2 ))2 p2 .
1 = 3 p1 1 (from x1 = 0 and Eq(4)).
1 = p1 p2 /(p2 + I)2 1. For this to be 0, it is required that
p1 p2 /(p2 + I)2 1,that is I (p1 p2 )1/2 p2 .
Utility equals x2 /(1 + x2 ) = I/(p2 + I).
(x1 , x2 , 1 , 2 , 3 ) = (0, I/p2 , 1 + ((p1 p2 )/(p2 + I)2 ), 0, p2 /(p2 + I)2 ).
Case(3) By symmetry, the solution is
(x1 , x2 , 1 , 2 , 3 ) = (I/p1 , 0, 0, 1 + ((p1 p2 )/(p1 + I)2 ), p1 /(p1 + I)2 )
And for this Case to hold it is necessary that p1 p2 /(p1 + I)2 1, or
I (p1 p2 )1/2 p1 .
57
5.4
Miscellaneous
(1) For problems where some constraints are of the form gi (x) = 0, and
others of the form gj (x) 0, only the latter give rise to Kuhn-Tucker like
complementary slackness conditions (j 0, gj (x) 0, j gj (x) = 0).
(2) If the objective to be maximized, f , and the constraints gi , i = 1, . . . , k
(where constraints are of the form gi (x) 0) are all concave functions, and
if Slaters constraint qualification holds (i.e., there exists some x <n s.t.
gi (x) > 0, i = 1, . . . , k), then the Kuhn-Tucker conditions become both
necessary and sufficient for a global max.
(3) Suppose f and all the gi s are quasiconcave. Then the Kuhn-Tucker
conditions are almost sufficient for a global max: An x and that satisfy
the Kuhn-Tucker conditions indicate that x is a global max provided that
in addition to the above, either Df (x ) 6= , or f is concave.
58
Appendix
Completeness Property of Real Numbers
59
Page 0
'
Rohini Somanathan
&
Page 1
%
Rohini Somanathan
'
Administrative Information
Internal Assessment: 25% for Part 1
1. Midterm: 20%
2. Lab assignments, Tutorial attendance and class participation: 5%
Problem Sets: - Do as many problems from the book as you can. All odd-numbered
exercises have solutions so focus on these.
Tutorials: -Check the notice board in front of the lecture theatre for lists.
Punctuality is critical - coming in late disturbs the rest of the class and me
&
Page 2
'
Rohini Somanathan
&
Page 3
%
Rohini Somanathan
'
.05
.1
probability
.15
.2
.25
P(0)=.0001, P(1)=.001, P(2)=.044, P(3)=.12, P(4)=.21, P(5)=.25 ... (display binomial(10, k, .5))
10
When should we conclude that there is gender bias? Can we get an estimate of this bias?
&
Page 4
'
Rohini Somanathan
&
Page 5
%
Rohini Somanathan
'
Definitions
An experiment is any process whose outcome is not known in advance with certainty. These
outcomes may be random or non-random, but we should be able to specify all of them and
attach probabilities to them.
Experiment
Event
10 coin tosses
4 heads
select 10 LS MPs
one is female
go to your bus-stop at 8
&
Page 6
'
Rohini Somanathan
1
8
Let us define the event A as atleast one head. Then A = {s1 , . . . , s7 }, Ac = {s8 }. A and Ac are
exhaustive events.
The events exactly one head and exactly two heads are mutually exclusive events.
Notice that there are lots of different ways in which we can define a sample space and the
most useful way to do so depending on the event we are interested in (# heads, or with
picking from a deck of cards, we may be interested in the suit, the number or both)
&
Page 7
%
Rohini Somanathan
'
P(Ai )
i=1
Note:
We will typically use P(A) or Pr(A) instead of P(A)
For finite sample spaces S is straightforward to define. For any S which is a subset of the
real line (and therefore infinite) let S be the set of all intervals in S.
&
Page 8
Rohini Somanathan
'
Result 3:If A1 and A2 are subsets of S such that A1 A2 , then P(A1 ) P(A2 )
Proof: Lets write A2 as: A2 = A1 (Ac1 A2 ). Since these are disjoint, we can use property
3 to get P(A2 ) = P(A1 ) + P(Ac1 A2 ). The second term on the RHS is non-negative (by
axiom 1), so P(A2 ) P(A1 ).
Result 4: For each A S, 0 P(A) 1
Proof: Since A S, we can directly apply the previous result to obtain
P() P(A) P(S) or 0 P(A) 1
&
Page 9
%
Rohini Somanathan
'
(1)
&
Page 10
Rohini Somanathan
'
1
2
<x+y<
1
4
3
2
(c) y < 1 x2
(d) x = y
answers: (1) 1/2, 1/6, 3/8 (2) .1, .4 (3) 1-/4, 3/4, 2/3, 0
&
Page 11
%
Rohini Somanathan
'
The probability of any event A can now be found as the sum of pi for all outcomes si that
belong to A.
A sample space containing n outcomes is called a simple sample space if the probability
assigned to each of the outcomes s1 . . . , sn is n1 . Probability measures are easy to define in
such spaces. If the event A contains exactly m outcomes, then P(A) = m
n
Notice that for the same experiment, we can define the sample space in multiple ways
depending on the events of interest. For example- suppose were interested in obtaining a
given number of heads in the tossing of 3 coins, our sample space can either comprise all
the 8 possible outcomes (a simple space) or just four outcomes (0,1,2 and 3 heads).
We can arrive at the total number of elements in a sample space through listing all possible
outcomes. A simple sample space for a coin-tossing experiment with 3 fair coins would have
a eight possible outcomes, a roll of two dice would have 36, etc. We then just calculate the
number of elements contained in our event A and divide this by the total number of
outcomes to get our probability (P(2 heads)=3/8 and P(sum of 7)=1/6
Listing outcomes can take a long time, and we can use a number of counting methods to
make things easier and avoid mistakes.
&
Page 12
'
Rohini Somanathan
&
Page 13
%
Rohini Somanathan
'
Permutations
Suppose we are sampling k objects from a total of n distinct objects without replacement.
We are interested in the total number of different arrangements of these objects we can
obtain.
We first pick one object- this can happen in n different ways. Since we are now left with
n 1 objects, the second one can be picked in (n 1) different ways, and so on.
The total number of permutations of n objects taken k at a time is given by
Pn,k = n(n 1) . . . (n k + 1)
and Pn,n = n!
Pn,k can alternatively be written as:
Pn,k = n(n 1).. . . . (n k + 1) = n(n 1).. . . . (n k + 1)
n!
(n k)!
=
(n k)!
(n k)!
In the case with replacement, we can apply the multiplication rule derived above. In this
case there are n outcomes possible for each of the k selections, so the number of elements in
S is nk .
&
Page 14
Rohini Somanathan
'
It turns out that for k = 23 this number is .507, so you should take the bet (if you are not
risk-averse)
&
Page 15
%
Rohini Somanathan
'
&
Page 16
Rohini Somanathan
'
n!
n1 !n2 !...nk !
Examples:
An student organization of 1000 people is picking 4 office-bearers and 8 members for its
1000!
managing council. The total number of ways of picking this groups is given by 4!8!988!
105 students have to be organized into 4 tutorial groups, 3 with 25 students each and
one with the remaining 30 students. How many ways can students be assigned to
groups?
&
Page 17
%
Rohini Somanathan
'
n
[
i=1
Ai ) =
n
X
i=1
P(Ai )
X
i<j
P(Ai Aj ) +
i<j<k
&
Page 18
Rohini Somanathan
'
Independent Events
Definition: Let A and B be two events in a sample space S. Then A and B are independent
iff P(A B) = P(A)P(B). If A and B are not independent, A and B are said to be dependent.
Events may be independent because they are physically unrelated -tossing a coin and rolling
a die, two different people falling sick with some non-infectious disease, etc.
This need not be the case however, it may just be that one event provides no relevant
information on the likelihood of occurrence of the other.
Example:
The even A is getting an even number on a roll of a die .
The event B is getting one of the first four numbers.
The intersection of these two events is the event of rolling the number 2 or 4, which we
know has probability 13 .
Are A and B independent? Yes because P(A)P(B) =
12
23
1
3
This is because the occurrence of A does not affect the likelihood that B will occur, or
vice-versa. Why?
If A and B are independent, then A and Bc are also independent as are Ac and Bc . (We
require P(A Bc ) = P(A)P(Bc ). But A = (A B) (A Bc ), so with A and B independent,
P(A Bc ) = P(A) P(A)P(B) = P(A)[1 P(B)] = P(A)P(Bc ). Starting now with A and B
complement, we can use the same argument to show Ac and Bc independent.
&
Page 19
%
Rohini Somanathan
'
1
2
3. Suppose A and B are disjoint sets in S. Does it tell us anything about the independence of
events A and B?
4. Remember that disjointness is a property of sets whereas independence is a property of the
associated probability measure and the dependence of events will depend on the probability
measure that is being used.
&
Page 20
Rohini Somanathan
'
1
2
1
2
1
9
1
In this case, P(A1 A2 A3 ) = P(3, 6) = 36
= ( 12 )( 21 )( 19 ) = P(A1 )P(A2 )P(A3 ) but
1
1
P(A1 A3 ) = P(3, 6) = 36 6= P(A1 )P(A3 ) = 18 , so the events are not independent, nor
pairwise independent.
&
Page 21
%
Rohini Somanathan
'
Conditional probability
When we conduct an experiment, we are absolutely sure that the event S will occur.
Suppose now we have some additional information about the outcome, say that it is an
element of B S.
What effect does this have on the probabilities of events in S? How exactly can we use such
additional information to compute conditional probabilities?
Example: The experiment involves tossing two fair coins in succession. What is the
probability of two tails? Suppose you know the first one is a head? What if it is a tail?
We denote the conditional probability of event A, given B by P(A|B)
B is now the conditional sample space and since B is certain to occur, P(B|B) = 1
Event A will now occur iff A B occurs
Definition: Let A and B be two events in a sample space S. If P(B) 6= 0, then conditional
probability of event A given event B is given by
P(A|B) =
P(A B)
P(B)
Notice that P(.|B) is now a probability set function (probability measure) defined for
subsets of B.
For independent events A and B, the conditional and unconditional probabilities are equal:
P(A)P(B)
P(A|B) = P(B) = P(A)
&
Page 22
Rohini Somanathan
'
P(AB)
P(B)
Multiplying both sides by P(B), we have the multiplication rule for probabilities:
P(A B) = P(A|B)P(B)
This is especially useful in cases where an experiment can be interpreted as being
conducted in two stages. In such cases, P(A|B) and P(B) can often be very easily assigned.
Examples:
Two cards are drawn successively, without replacement from an ordinary deck of
playing cards. What is the probability of drawing two aces?
Here the event B is that the first card drawn is an ace and the event A is that the
4
1
3
1
second card is an ace. P(B) is clearly 52
= 13
and P(A|B) = 51
= 17
The required
1
1
1
probability P(A B) is therefore ( 13 )( 17 ) = 221
There are two types of candidates, competent and incompetent (C and I). The share of
I-type candidates seeking admission is 0.3. All candidates are interviewed by a
committee and the committee rejects incompetent candidates with probability 0.9.
What is the probability that an incompetent candidate is admitted?
Here were interested in P(A I) where P(I) = .3 and P(A|I) = .1, so the required
probability is .03.
&
Page 23
%
Rohini Somanathan
'
If P(Ai ) > 0 for all i, then using the multiplication rule derived above, this can be written as:
P(B) =
k
X
P(Ai )P(B|Ai )
i=1
P(Y = 50) =
50
X
x=1
1
1
1
1
1
1
.
=
(1 + + + +
) = .09
51 x 50
50
2
3
50
&
Page 24
'
Rohini Somanathan
Bayes Theorem
Bayes Theorem: (or Bayes Rule) Let the events A1 , A2 , . . . Ak form a partition of S such that
P(Aj ) > 0 for all j = 1, 2, . . . , k, and let B be any event such that P(B) > 0. Then for i = 1, . . . , k,
P(Ai |B) =
P(B|Ai )P(Ai )
k
P
P(Aj )P(B|Aj )
j=1
Proof:
By the definition of conditional probability,
P(Ai |B) =
P(Ai B)
P(B)
The denominators in these expressions are the same by the law of total probability and the
numerators are the same using the multiplication rule.
In the case where the partition of S consists of only two events,
P(A|B) =
P(B|A)P(A)
P(B|A)P(A) + P(B|Ac )P(Ac )
&
Page 25
%
Rohini Somanathan
'
Bayes Rule...remarks
Bayes rule provides us with a method of updating events in the partition based on the new
information provided by the occurrence of the event B
Since P(Aj ) is the probability of event Aj prior to the occurrence of event B, it is referred
to as the prior probability of event Aj .
P(Aj |B) is the updated probability of the same event after the occurrence of B and is called
the posterior probability of event Aj .
Bayes rule is very commonly used in game-theoretic models. For example, in political
economy models a Bayes-Nash equilibrium is a standard equilibrium concept: Players (say
voters) start with beliefs about politicians and update these beliefs when politicians take
actions. Beliefs are constrained to be updated based on Bayes conditional probability
formula.
In Bayesian estimation, prior distributions on population parameters are updated given
information contained in a sample. This is in contrast to more standard procedures where
only the sample information is used. The sample would now lead to different estimates,
depending on the prior distribution of the parameter that is used.
A word about Bayes: He was a non-conformist clergyman (1702-1761), with no formal
mathematics degree. He studied logic and theology at the University of Edinburgh.
&
Page 26
Rohini Somanathan
'
P(Positive|Disease)P(Disease)
(.98)(.001)
=
= .089
P(Positive)
(.98)(.001) + (.01)(.999)
So in spite of the test being very effective in catching the disease, we have a large number of
false positives.
&
Page 27
%
Rohini Somanathan
'
1
3
2
3
if
( 1 )( 3 )
P(bad roads|honest)P(honest)
3
= 1 33 42 1 =
P(bad roads)
5
( 3 )( 4 ) + ( 3 )( 4 )
What would the posterior be if the prior is equal to 1? What if it the prior is zero? What if
the probability of bad roads was equal to 12 for both types of politicians? When are
differences between priors and posteriors going to be large?
&
Page 28
'
Rohini Somanathan
1
3
The contestant can therefore double his probability of being correct by switching. The
posterior probability of A2 is 32 while that of A1 remains 13 .
&
Page 29
%
Rohini Somanathan
'
witness drew on published studies to obtain a figure for the frequency of sudden infant death syndrome
(SIDS, or cot death) in families having some of the characteristics of the defendants family. He went on
to square this figure to obtain a value of 1 in 73 million for the frequency of two cases of SIDS in such a
family. ..This approach is, in general, statistically invalid. It would only be valid if SIDS cases arose
independently within families,.. there are very strong a priori reasons for supposing that the assumption
will be false. There may well be unknown genetic or environmental factors that predispose families to SIDS,
so that a second case within the family becomes much more likely. The true frequency of families with two
cases of SIDS may be very much less incriminating than the figure presented to the jury at trial.
&
Page 30
%
Rohini Somanathan
Page 0
'
Rohini Somanathan
&
Page 1
%
Rohini Somanathan
'
Random variables
Definition: Let (S, S, ) be a probability space. If X : S < is a real-valued function having as
its domain the elements of S, then X is called a random variable.
A random variable is therefore a real-valued function defined on the space S. Typically x is
used to denote this image value, i.e. x = X(s).
If the outcomes of an experiment are inherently real numbers, they are directly
interpretable as values of a random variable, and we can think of X as the identity function,
so X(s) = s.
We choose random variables based on what we are interested in getting out of the
experiment. For example, we may be interested in the number of students passing an exam,
and not the identities of those who pass. A random variable would assign each element in
the sample space a number corresponding to the number of passes associated with that
outcome.
We therefore begin with a probability space (S, S, ) and arrive at an induced probability
space (R(X), B, PX (A)).
How exactly do we arrive at the function Px (.)? As long as every set A R(X) is associated
with an event in our original sample space S, Px (A) is just the probability assigned to that
event by P.
&
Page 2
Rohini Somanathan
'
Random variables..examples
1. Tossing a coin ten times.
The sample space consists of the 210 possible sequences of heads and tails.
There are many different random variables that could be associated with this
experiment: X1 could be the number of heads, X2 the longest run of heads divided by
the longest run of tails, X3 the number of times we get two heads immediately before a
tail, etc...
For s = HT T HHHHT T H, what are the values of these random variables?
2. Choosing a point in a rectangle within a plane
An experiment involves choosing a point s = (x, y) at random from the rectangle
S = {(x, y) : 0 x 2, 0 y 1/2}
The random variable X could be the xcoordinate of the point and an event is X taking
values in [1, 2]
Another random variable Z would be the distance of the point from the origin,
p
Z(s) = x2 + y2 .
3. Heights, weights, distances, temperature, scores, incomes... In these cases, we can have
X(s) = s since these are already expressed as real numbers.
&
Page 3
%
Rohini Somanathan
'
x1 x
1
1
1
=
2
2
2
x = 1, 2, 3 . . .
&
Page 4
Rohini Somanathan
'
f(w)
wx
In this case, the distribution function will be a step function, jumping at all points x in
R(X) which are assigned positive probability.
Consider the experiment of tossing two fair coins. Describe the probability space induced
by the random variable X, the number of heads, and derive the distribution function of X.
&
Page 5
%
Rohini Somanathan
'
Discrete distributions
Definition: A random variable X has a discrete distribution if X can take only a finite number k
of different values x1 , x2 , . . . , xK or an infinite sequence of different values x1 , x2 , . . . .
The function f(x) = P(X = x) is the probability function of x. We define it to be f(x) for all
values x in our sample space R(X) and zero elsewhere.
If X has a discrete distribution, the probability of any subset A of the real line is given by
P
P(X A) =
f(xi ).
xi A
Examples:
1. The discrete uniform distribution: picking one of the first k non-negative integers at
random
1
for x = 1, 2, ...k,
f(x) = k
0
otherwise
2. The binomial distribution: the probability of x successes in n trials.
n px qnx
for x = 0, 1, 2, ...n,
x
f(x) =
0
otherwise
Derive the distribution functions for each of these.
&
Page 6
Rohini Somanathan
'
Continuous distributions
The sample space associated with our random variable often has an infinite number of points.
Example: A point is randomly selected inside a circle of unit radius with origin (0, 0) where the probability
assigned to being in a set A S is P(A) = areaof A and X is the distance of the selected point from the
origin. In this case F(x) = Pr(X x) = area of circle with radius x , so the distribution function of X is given by
0
F(x) =
for x < 0
x2
0x<1
1x
The function f is called the probability density function or p.d.f. of X and must satisfy the
conditions below
1. f(x) 0
2.
f(x)dx = 1
What is f(x) for the above example? How can you use this to compute P( 14 < X 21 )? How would
you use F(x) instead?
&
Page 7
%
Rohini Somanathan
'
Continuous distributions..examples
1. The uniform distribution on an interval: Suppose a and b are two real numbers with a < b.
A point x is selected from the interval S = {x : a x b} and the probability that it
belongs to any subinterval of S is proportional to the length of that subinterval. It follows
that the p.d.f. must be constant on S and zero outside it:
1
for a x b
f(x) = ba
0
otherwise
Notice that the value of the p.d.f is the reciprocal of the length of the interval, these values
can be greater than one, and the assignment of probabilities does not depend on whether
the distribution is defined over the closed interval or the open interval (a, b)
2. Unbounded random variables: It is sometimes convenient to define a p.d.f over unbounded
sets, because such functions may be easier to work with and may approximate the actual
distribution of a random variable quite well. An example is:
0
for x 0
f(x) =
1 2
for x > 0
(1+x)
3. Unbounded densities: The following function is unbounded around zero but still represents
a valid density.
2 x 13
for 0 < x < 1
f(x) = 3
0
otherwise
&
Page 8
Rohini Somanathan
'
Mixed distributions
Often the process of collecting or recording data leads to censoring, and instead of obtaining
a sample from a continuous distribution, we obtain one from a mixed distribution.
Examples:
The weight of an object is a continuous random variable, but our weighing scale only
records weights up to a certain level.
Households with very high incomes often underreport their income, for incomes above a
certain level (say $250,000), surveys often club all households together - this variable is
therefore top-censored.
In each of these examples, we can derive the probability distribution for the new random
variable, given the distribution for the continuous variable. In the example weve just
considered:
0
for x 0
f(x) =
1 2
for x > 0
(1+x)
suppose we record X = 3 for all values of X 3 The p.f. for our new random variable Y is
given by the same p.f. for values less than 3 and by 14 for Y=3.
Some variables, such as the number of hours worked per week have a mixed distribution in
the population, with mass points at 0 and 40.
&
Page 9
%
Rohini Somanathan
'
3. F(x) is right-continuous, i.e. F(x) = F(x+ ) at every point x, where F(x+ ) is the right hand
limit of F(x).
( for discrete random variables, there will be a jump at values that are taken with positive probability)
&
Page 10
Rohini Somanathan
'
F0 (x) = f(x)
For discrete and mixed discrete-continous random variables F(x) will exhibit a countable number
of discontinuities at jump points reflecting the assignment of positive probabilities to a countable
number of events.
&
Page 11
%
Rohini Somanathan
'
3.6 An example of a
F(x)
1
z3
z2
z1
z0
x1
x2
x3
x4
Section 1.10. Similarly, the fact that Pr(X x) approaches 1 as x follows from
&
Exercise 12 in Sec. 1.10.
Page 12
%
Rohini Somanathan
The limiting values specified in Property 3.3.2 are indicated in Fig. 3.6. In this
figure, the value of F (x) actually becomes 1 at x = x4 and then remains 1 for x > x4.
Course
Basic
Econometrics,
1 and
Pr(X
> x4) = 0.2012-2013
On the other
Hence, it may be concluded that Pr(X
x4) =003:
'
hand, according to the sketch in Fig. 3.6, the value of F (x) approaches 0 as x ,
but does not actually become 0 at any finite point x. Therefore, for every finite value
of x, no matter how small, Pr(X x) > 0.
A c.d.f. need not be continuous. In fact, the value of F (x) may jump at any
finite or countable number of points. In Fig. 3.6, for instance, such jumps or points
of discontinuity occur where x = x1 and x = x3. For each fixed value x, we shall let
F (x ) denote the limit of the values of F (y) as y approaches x from the left, that is,
as y approaches x through values smaller than x. In symbols,
The distribution function X gives us the probability that X x for all real numbers x
F (y).
F (x ) = lim
yx
y<x
Suppose we are given a probability
p and want to know the value of x corresponding to this
value
of
the
distribution
function.
+
Similarly, we shall define F (x ) as the limit of the values of F (y) as y approaches x
from
Thus,
Ifthe
F right.
is a one-to-one
function, then it has an inverse and the value we are looking for is given
1
by F (p)
F (y).
F (x +) = lim
yx
y>x
Definition:
When
the
distribution
a random
X=is continuous and one-to-one
Continuity from
the Right
. A
c.d.f. is alwaysfunction
continuousoffrom
the right; variable
that is, F (x)
+ ) at every point x.
F (x the
over
whole set of possible values of X, we call the function F1 the quantile function of X. The
value of F1 (p) is called the pth quantile of X or the 100 pth percentile of X for each 0 < p < 1.
Proof Let y1 > y2 > . . . be a sequence of numbers that are decreasing such that
event {X distribution
x} is the intersection
of all
the events
yF(x)
limn yn =Ifx.XThen
xa
n}
Example:
hasthe
a uniform
over the
interval
[a,{X
b],
= ba
over this interval, 0
for n = 1, 2, . . . . Hence, by Exercise 13 of Sec. 1.10,
for x a and 1 for x > b. Given a value p, we simply
solve for the pth quantile:
F (x) = Pr(X x) = lim Pr(X yn) = F (x +).
n
x = pb + (1 p)a. Compute this for p = .5, .25, .9, . . .
It follows from Property 3.3.3 that at every point x at which a jump occurs,
F (x +) = F (x) and F (x ) < F (x).
&
Page 13
%
Rohini Somanathan
'
1x
f(x) = 8
0
for 0 x 4
otherwise
1
4
1
2
cx2
f(x) =
0
for 1 x 2
otherwise
&
Page 14
'
Rohini Somanathan
Bivariate distributions
Social scientists are typically interested in the manner in which multiple attributes of
people and the societies they live in. The object of interest is a multivariate probability
distribution. examples: education and earnings, days ill per month and age, sex-ratios and
areas under rice cultivation)
This involves dealing with the joint distribution of two or more random variables. Bivariate
distributions attach probabilities to events that are defined by values taken by two random
variables (say X and Y).
Values taken by these random variables are now ordered pairs, (xi , yi ) and an event A is a
set of such values.
If both X and Y are discrete random variables, the probability function
P
f(x, y) = P(X = x and Y = y) and P(X, Y) A =
f(xi , yi )
(xi ,yi )A
&
Page 15
%
Rohini Somanathan
'
gender
male
female
none
.05
.2
primary
.25
.1
middle
.15
.04
high
.1
.03
senior secondary
.03
.02
.02
.01
What are some features of a table like this one? In particular, how would we obtain
probabilities associated with the following events:
receiving no education
becoming a female graduate
completing primary school
What else do you learn from the table about the population of interest?
&
Page 16
Rohini Somanathan
'
f is now called the joint probability density function and must satisfy
1. f(x, y) 0 for < x < and < y <
2.
f(x, y)dxdy = 1
Example 1: Given the following joint density function on X and Y, well calculate P(X Y)
f(x, y) =
cx2 y
for x2 y 1
otherwise
First find c to make this a valid joint density (notice the limits of integration here)-it will turn out to be 21/4.
3 .
Then integrate the density over Y (x2 , x) and X (1, 1). Now using this density, P(X Y) = 20
Example 2: A point (X, Y) is selected at random from inside the circle x2 + y2 9. Determine the joint density
function, f(x, y).
&
Page 17
%
Rohini Somanathan
'
If F(x, y) is continuously differentiable in both its arguments, the joint density is derived as:
f(x, y) =
2 F(x, y)
xy
and given the density, we can integrate w.r.t x and y over the appropriate limits to get the
distribution function.
Example:
X and Y and their joint density. Notice the (x, y) range over which F(x, y) is strictly increasing.
&
Page 18
'
Rohini Somanathan
Marginal distributions
A distribution of X derived from the joint distribution of X and Y is known as the marginal
distribution of X. For a discrete random variable:
f1 (x) = P(X = x) =
P(X = x and Y = y) =
f(x, y)
and analogously
f2 (y) = P(Y = y) =
P(X = x and Y = y) =
f(x, y)
For a continuous joint density f(x, y), the marginal densities for X and Y are given by:
f1 (x) =
f(x, y)dx
Go back to our table representing the joint distribution of gender and education and find
the marginal distribution of education.
Can one construct the joint distribution from one of the marginal distributions?
&
Page 19
%
Rohini Somanathan
'
&
Page 20
'
Rohini Somanathan
2x
for 0 x 1
g(x) =
0
otherwise
Find the probability that X + Y 1.
The joint density 4xy is got by multiplying the marginal densities because these variables
are independent. The required probability of 61 is then obtained by integrating over
y (0, 1 x) and x (0, 1)
How might we use a table of probabilities to determine whether two random variables are
independent?
Given the following density, can we tell whether the variables X and Y are independent?
ke(x+2y)
for x 0 and y 0
f(x, y) =
0
otherwise
Notice that we can factorize the joint density as the product of k1 ex and k2 e2y where
k1 k2 = k. To obtain the marginal densities of X and Y, we multiply these functions by
appropriate constants which make them integrate to unity. This gives us
f1 (x) = ex for x 0 and f2 (y) = 2e2y for y 0
&
Page 21
%
Rohini Somanathan
'
x + y
otherwise
Notice that we cannot factorize the joint density as the product of a non-negative function
of x and another non-negative function of y. Computing the marginals gives us
f1 (x) = x +
1
1
for 0 < x < 1 and f2 (y) = y + for 0 < y < 1
2
2
kx2 y2
for x2 + y2 1
otherwise
In this case the possible values X can take depend on Y and therefore, even though the joint
density can be factorized, the same factorization cannot work for all values of (x, y).
More generally, whenever the space of positive probability density of X and Y is bounded by a
curve, rather than a rectangle, the two random variables are dependent.
&
Page 22
Rohini Somanathan
'
g(x)dx
h(y)dy
c
d
R
c
b
R
a
density.
Now to show that if the support is not a rectangle, the variables are dependent: Start with a point (x, y) outside
the domain where f(x, y) > 0. If x and y are independent, we have f(x, y) = f1 (x)f2 (y), so one of these must be zero.
Now as we move due south and enter the set where f(x, y) > 0, our value of x has not changed, so it could not be
that f1 (x) was zero at the original point. Similarly, if we move west, y is unchanged so it could not be that f2 (y)
was zero at the original point. So we have a contradiction.
&
Page 23
%
Rohini Somanathan
'
Conditional distributions
Definition: Consider two discrete random variables X and Y with a joint probability function
f(x, y) and marginal probability functions f1 (x) and f2 (y). After the value Y = y has been
observed, we can write the the probability that X = x using our definition of conditional
probability:
P(X = x and Y = y)
f(x, y)
=
P(X = x|Y = y) =
Pr(Y = y)
f2 (y)
g1 (x|y) =
f(x,y)
f2 (y)
1. for each fixed value of y, g1 (x|y) is a probability function over all possible values of X
because it is non-negative and
X
g1 (x|y) =
1 X
1
f(x, y) =
f2 (y) = 1
f2 (y) x
f2 (y)
2. conditional probabilities are proportional to joint probabilities because they just divide
these by a constant.
We cannot use the definition of condition probability to derive the conditional density for
continuous random variables because the probability that Y takes any particular value y is zero.
We simply define the conditional probability density function of X given Y = y as
g1 (x|y) =
f(x, y)
for ( < x < and < y < )
f2 (y)
&
Page 24
Rohini Somanathan
'
The numerator in g1 (x|y) = f (y) is a section of the surface representing the joint density and
2
the denominator is the constant by which we need to divide the numerator to get a valid density
(which integrates to unity)
&
Page 25
%
Rohini Somanathan
'
gender
male
female
f(education|gender=male)
none
.05
.2
.08
primary
.25
.1
.42
middle
.15
.04
.25
high
.1
.03
.17
senior secondary
.03
.02
.05
.02
.01
.03
f(gender|graduate)
.67
.33
&
Page 26
'
Rohini Somanathan
cx2 y
for x2 y 1
f(x, y) =
0
otherwise
the marginal distribution of X is given by
Z1
21 2
21 2
x ydy =
x (1 x4 )
4
8
x2
f(x,y)
f1 (x) :
g2 (y|x) =
2y
1x4
for x2 y 1
otherwise
R1
3
4
g2 (y| 21 ) =
7
15
&
Page 27
%
Rohini Somanathan
'
(1)
Notice that the conditional distribution is not defined for a value y0 at which f2 (y) = 0, but this is irrelevant
because at any such value f(x, y0 ) = 0.
Example: X is first chosen from a uniform distribution on (0, 1) and then Y is chosen from a uniform distribution
on (x, 1). The marginal distribution of X is straightforward:
f1 (x) =
otherwise
1
1x
otherwise
1
1x
otherwise
g2 (y|x) =
f(x, y) =
y
Z
f2 (y) =
f(x, y)dx =
1
dx = log(1 y) for 0 < y < 1
1x
&
Page 28
'
Rohini Somanathan
Multivariate distributions
Our definitions of joint, conditional and marginal distributions can be easily extended to an
arbitrary finite number of random variables. Such a distribution is now called a multivariate
distributon.
The joint distribution function is defined as the function F whose value at any point
(x1 , x2 , . . . xn ) <n is given by:
F(x1 , . . . , xn ) = P(X1 x1 , X2 x2 , . . . , Xn xn )
For a discrete joint distribution, the probability function at any point (x1 , x2 , . . . xn ) <n is given
by:
f(x1 , . . . , xn ) = P(X1 = x1 , X2 = x2 , . . . , Xn = xn )
(2)
and the random variables X1 , . . . , Xn have a continuous joint distribution if there is a nonnegative
function f defined on <n such that for any subset A <n ,
Z
Z
P[(X1 , . . . , Xn ) A] =
f(x1 , . . . , xn )dx1 . . . dxn
(3)
...A ...
The marginal distribution of any single random variable Xi can now be derived by integrating
over the other variables
Z
Z
f1 (x1 ) =
...
f(x1 , . . . , xn )dx2 . . . dxn
(4)
and the conditional probability density function of X1 given values of the other variables is:
g1 (x1 |x2 . . . xn ) =
f(x1 , . . . , xn )
f0 (x2 , . . . , xn )
(5)
&
Page 29
%
Rohini Somanathan
'
&
Page 30
Rohini Somanathan
'
Multivariate distributions..example
Suppose we start with the following density function for a variable X1 :
ex for x > 0
f1 (x) =
0
otherwise
and are told that for any given value of X1 = x1 , two other random variables X2 and X3 are
independently and identically distributed with the following conditional p.d.f.:
&
Page 31
%
Rohini Somanathan
'
1
2
Zy
f(x)dx =
otherwise
&
Page 32
'
Rohini Somanathan
&
Page 33
%
Rohini Somanathan
'
&
Page 34
%
Rohini Somanathan
Page 0
Rohini Somanathan
'
6
P
x=1
x
6 I{1,2...6} (x)
= 3.5
If X can take only a finite number of different values, this expectation always exists.
If there is an infinite sequence of possible values of X, then this expectation exists if and
P
only if
limxR(X) |x|f(x) < (the series defining the expectation is absolutely convergent )
We can think of the expectation as a point of balance: if there were various weights placed
on a weightless rod, where should a fulcrum be placed so that the distribution of weights
balances?
The expectation of X is also called the expected value or the mean of X.
&
Page 1
%
Rohini Somanathan
'
EX =
|x|f(x) <
xf(x) iff
Suppose we have a distribution f which is symmetric with respect to a given point x0 on the
x axis, so that f(x0 + ) = f(x0 ) . If the expectations exists, it will be equal to x0 .
The expectation will always exist if the set of values taken by X is bounded.
When does it not exist? We need sufficiently small weights attached to large values of X
when the set of possible values is not bounded. The tails of a distribution may fall off fast
enough for the area under it to integrate to 1, but the function xf(x) may not have this
property if the tails are thick.
The Cauchy distribution (f(x) =
1
)
(1+x2 )
t distributions) but the expectation of the Cauchy distribution does not exist.
&
Page 2
Rohini Somanathan
'
ter 4 Expectation
The p.d.f. of a
ribution.
f(x)
1
p
!3
!2
!1
The curve
or the Cauchy
1
3
1
3
f(x)
1
2p
!3
!2
!1
1
2p
&
Page 3
%
Rohini Somanathan
'
R
expectation as EY =
yh(y)dy (if continuous). But we dont need this:
RESULT: Let X be a random variable having density function f(x). Then the expectation of
Y = g(X) ( in the discrete and continuous case respectively) is given by:
X
Eg(X) =
g(x)f(x)
xR(X)
Eg(X) =
g(x)f(x)dx
&
Page 4
Rohini Somanathan
'
X
f(x) =
R1 1
E( X) = x 2 (2x)dx =
0
2x
otherwise
4
5
A point (X, Y) is chosen at random from the unit square: 0 x 1 and 0 y 1. The joint
R1 R1 2
density over all points (x, y) in the square is 1 and E(X2 + Y 2 ) =
(x + y2 )dxdy = 32
0 0
(X1 , X2 ) forms a random sample of size 2 from a uniform distribution on (0, 1) and
R1 xR2
R1
Y = min(X1 , X2 ). Well show that E(Y) = 2
x1 dx1 dx2 = x22 dx2 = 31
0 0
0
Suppose we are interested in the expectation of a random variable Y = g(X), defined over a set . This
R
would be given by
yf(x)dx. If 1 and 2 form a partition of , we can write this integral as
Z
yf(x)dx =
Z
yf(x)dx +
yf(x)dx
2
&
Page 5
%
Rohini Somanathan
'
Expectation properties
RESULT 1: If Y = aX + b, then E(Y) = aE(X) + b
( for a continuous random variable X)
R
R
R
E(aX + b) =
(ax + b)f(x)dx = a
xf(x)dx + b
f(x)dx = aE(x) + b
Proof:
i=1
Proof: E
k
P
ui (X) =
i=1
k
P
k
k
R
P
P
ui (x) f(x)dx =
ui (x)f(x)dx =
Eui (X)
i=1
i=1
i=1
RESULT 3: For a random sample, the expectation of a product is the product of the
expectations: If X1 , . . . , Xn are n independent random variables such that each expectation
n
n
Q
Q
E(Xi ) exists, then E( Xi ) =
E(Xi )
i=1
Proof:
i=1
( for a continuous random variable X) Since the random variables are independent,
n
Q
i=1
Xi ) =
...
n
Q
i=1
...
n
Q
i=1
f(x1 , . . . , xn ) =
n
Q
fi (xi ) and
i=1
n
R
Q
i=1
xi fi (xi )dxi =
n
Q
E(Xi )
i=1
(Notice that this third property applies only to independent random variables, whereas the
second property holds for dependent variables as well.)
&
Page 6
Rohini Somanathan
'
Expectation properties...examples
Expected number of successes: n balls are selected from a box containing a fraction p of red
balls. The random variable Xi takes the value 1 if the ith ball picked is red and zero
otherwise. Were interested in the expected value of the number of red balls picked.
This is simply X = X1 + X2 + + Xn . The expectation of X, (using our theorem) is equal to
E(X1 ) + E(X2 ) + + E(Xn ) where E(Xi ) = p1 + (1 p)0 = p. We therefore have E(X) = np
Expected number of matches: If n letters and randomly placed in n envelopes, how many
matches would we expect? Let Xi = 1 if the ith letter is placed in the correct envelope, and
zero otherwise.
1
1
P(Xi = 1) =
and P(Xi = 0) = 1
n
n
It is therefore the case that E(Xi ) =
1
n
i and E(X) =
1
n
1
n
+ +
1
n
= 1.
Suppose the random variables X1 , . . . , Xn form a random sample of size n from a given
continuous distribution on the real line for which the p.d.f. is f. Find the expectation of the
number of observations in the sample that fall in a specified interval [a, b]. This is just like
b
R
the first problem, except the probability of success is now f(x)dx, so the answer is
b
R
n f(x)dx
a
&
Page 7
%
Rohini Somanathan
'
More examples
The density function for X is given by f(x) = 2(1 x)I(0,1)
E(X) =
h 2
i
h 3
i
R1
R1
3 1
4 1
xf(x)dx = 2 (x x2 )dx = 2 x2 x3
= 2( 16 ) = 13 and E(X2 ) = 2 (x2 x3 )dx = 2 x3 x4
= 2( 13 14 ) = 16 . We
0
can use these to compute E(6X + 3X2 ) = 6( 31 ) + 3( 16 ) = 52 . We could have also computed this directly using the
formula for the expectation of a function r(X).
A horizontal line segment of length 5 is divided at a randomly selected point and X is the
length of the left-hand part. Let us find the expectation of the product of the lengths.
We are picking a point from a uniform distribution on [0, 5] so the density f(x) = 51 I(0,5) (x). E(X) = 52 and
E(5 X) = 25 (why?). The expected value of the product of the lengths is given by
2
R5
5
= E(X)E(5 X)
E [X(5 X)] = 51 x(5 x)dx = 25
6 6= 2
0
A bowl contains 5 chips, 3 marked $1 and 2 marked $4. A player draws 2 chips at random
and is paid the sum of the values of the chips. If it costs $4.75 to play, is his expected gain
positive?
3 )( 2 )
(x
2x , x = 0, 1, 2 (a
(52)
1
6
3
hypergeometric distribution). Compute f(0) = 10 , f(1) = 10 and f(2) = 10 . In this case u(x) = x + 4(2 x) = 8 3x,
Let the random variable X be the number of $1 chips. The probability function is f(x) =
1 )8 + ( 6 )5 + ( 3 )2 = 4.4. Alternatively, compute E(X) = 0 + f(1) + 2f(2) = 12 and find the desired
so E[u(x)] = ( 10
10
10
10
expectation as 8 3E(X).
&
Page 8
'
Rohini Somanathan
&
Page 9
%
Rohini Somanathan
'
Variance properties
1. Var(X) = 0 if and only if there exists a constant c such that P(X = c) = 1
2. For an constants a and b, Var(aX + b) = a2 Var(X). It follows that Var(X) = Var(X)
Proof: Var(aX + b) = E[(aX + b a b)2 ] = E[(a(X ))2 ] = a2 E[(X )2 ] = a2 Var(X)
&
Page 10
Rohini Somanathan
'
0 = 1
0
&
Page 11
%
Rohini Somanathan
'
d
d tX
[E(etX )]t=0 = E[(
e )]t=0 = E[(XetX )]t=0 = E(X)
dt
dt
The function ex can be expressed as the sum of the series 1 + x + x2! + . . . and so etx can be expressed as the sum
P
2 2
2 2
(1 + tx + t 2!x + . . . )f(x). If we differentiate this w.r.t t and then set t = 0,
1 + tx + t 2!x + . . . and the expectation E(etx ) =
Proof.
x=0
x=0
x2 f(x), which is the second moment. For continuous distributions, we replace the sum
x=0
P
x=0
(. . . )dx
&
Page 12
Rohini Somanathan
'
R
x(t1)
1
1
(t) = ex(t1) dx = e t1 = 0 t1
= 1t
for t < 1
0
00 (t)
1
,
(1t)2
and
2
.
(1t)3
1
(10)2
= 1.
&
Page 13
%
Rohini Somanathan
'
n
Y
i (t)
i=1
RESULT 3: If the MGFs of two random variables X1 and X2 are identical for all values of t
in an interval around the point t = 0, then the probability distributions of X1 and X2 must
be identical.
Examples:
If f(x) = ex I(0,) as in the above example, the MGF of the random variable
Y = (X 1) =
et
1t
for t < 1 (using the first result above, setting a = 1 and b = 1) and if Y = 3 2X,
e3t
1+2t
for t > 12
&
Page 14
'
Rohini Somanathan
n
P
Var(Xi )
i=1
E(Xi ) = 1.p + 0(1 p) = p and E(X2i ) = 12 (p) + 02 (1 p) = p, so Var(Xi ) = p p2 and Var(X) = np(1 p)
We can get the same expression using the MGF for the binomial:
The MGF for each of the Xi variables is given by
et P(Xi = 1) + (1)P(Xi = 0) = pet + q.
Using the additive property of MGFs for independent random variables, we get the
MGF for X as
(t) = (pet + q)n
For two random variables each with parameters (n1 , p) and (n2 , p), the MGF of their sum
is given by the product of the MGFs, (pet + q)n1 +n2
&
Page 15
%
Rohini Somanathan
'
4x3
otherwise
2
f(x) =
for 0 x 1
for 2.5 x 3
otherwise
&
Page 16
'
Rohini Somanathan
Cov(X, Y)
X Y
Result: For any two random variables U and V, it is always the case that (EUV)2 EU2 EV 2 .
This is known as the Cauchy-Schwarz Inequality
This provides us with bounds on the value of the covariance, |XY | X Y (let U = (X EX) and
V = (Y EY) in the statement of the Cauchy-Schwarz inequality above) and in turn implies a
correlation bound
1 XY 1
Example: Let f(x, y) = (x + y)I(0,1) (x)I(0,1) (y). In this case E(X) =
EXY =
R1 R1
00
R1 R1
00
7 = E(Y) and
x(x + y)dxdy = 12
1
144
11 )( 11 )
( 144
144
1
= 11
&
Page 17
%
Rohini Somanathan
'
Result 2: If X and Y are independent random variables, each with finite variance, then
Cov(X, Y) = (X, Y) = 0
Proof: If X and Y are independent, E(XY) = E(X)E(Y). Now apply the expression for covariance in the above
result.
Note: zero correlation does not imply independence- example: If X takes the values 1, 0, 1 with equal
probability and Y = X2 , E(XY) = E(X3 ) = 0. Since E(XY) = E(X)E(Y), = 0 and the variables are uncorrelated but
clearly dependent. A zero correlation only tells us that the variables are not linearly dependent.
Result 3: Suppose X is a random variable with finite variance and Y = aX + b for some
constants a and b. If a > 0, then (X, Y) = 1. If a < 0, then (X, Y) = 1
Proof: Y Y = a(X x ), so Cov(X, Y) = aE[(X X )2 ] = a2X and Y = |a|X , plug these values into the
expression for to get the result.
i=1
&
Page 18
'
Rohini Somanathan
Conditional Expectations
The conditional expectation of random variables is defined using conditional probability
density functions rather than their unconditional counterparts.
Suppose that X and Y are random variables with a joint density function f(x, y), with the
marginal p.d.f of X denoted by f1 (x).
For any value of x such that f1 (x) > 0, let g(y|x) denote the conditional p.d.f of Y given that
X=x
R
The conditional expectation of Y given X is E(Y|x) =
yg(y|x)dy for continuous X and Y
P
and E(Y|x) = y yg(y|x)dy if X and Y have a discrete distribution.
&
Page 19
%
Rohini Somanathan
Page 0
'
Rohini Somanathan
&
Page 1
%
Rohini Somanathan
'
1
N
I(1,2,...,N) (x)
1 N(N + 1) (N + 1)
=
N
2
2
X
(N + 1) 2 N2 1
1 N(N + 1)(2N + 1)
2
2
2
=
x f(x) =
=
N
6
2
12
=
MGF:
xf(x) =
PN
ejt
j=1 N
&
Page 2
Rohini Somanathan
'
x2 f(x) 2 = p(1 p)
MGF: et p + e0 (1 p) = pet + (1 p)
Applications: experiments or situations in which there are two possible outcomes: success
or failure, defective or not defective, male or female, etc.
&
Page 3
%
Rohini Somanathan
'
npx (1 p)nx
x = 0, 1, 2, . . . n
x
f(x; n, p) =
0
otherwise
Notice that since
n
P
x=0
n x nx
x a b
= (a + b)n ,
n
P
x=0
density function.
MGF:The MGF is given by:
n
n
x
P tx
P
P
nx =
e f(x) =
etx n
x p (1 p)
x
x=0
x=0
n
t x
nx
x (pe ) (1 p)
= [(1 p) + pet ]n
&
Page 4
Rohini Somanathan
'
Multinomial Distributions
Suppose there are a small number of different outcomes (methods of public transport, water
purification etc. ) The Multinomial distribution gives us the probability associated with a
particular vector of these outcomes:
P
Parameters: (n, p1 , . . . pm ) , 0 pi 1, i pi = 1 and n is a positive integer
Probability function:
f(x1 , . . . xm ; n, p1 , . . . pm ) =
n!
m
Q
m
Q
xi ! i=1
pi i
x = 0, 1, 2, . . . n,
Pm
i
xi = n
i=1
otherwise
&
Page 5
%
Rohini Somanathan
'
r+x1
pr qx ,
x
x = 0, 1, 2, 3...
q
p
and 2 =
q
p2
The negative binomial is just a sum of r geometric variables, and the MGF is therefore
p
rq
rq
r
2
( 1qe
t ) and the corresponding mean and variance is = p and = p2
The geometric distribution is memory-less, so the conditional probability of k + t
failures given k failures is the unconditional probability of t failures,
P(X = k + t|X k) = P(X = t)
&
Page 6
Rohini Somanathan
'
f(x; ) =
e x
x!
, x = 0, 1, 2, . . . ,
0
2
otherwise
P
P
P
x
e x
e = 1 so we have a valid density.
f(x) =
= e
x!
x! = e
x
x=0
x=0
Moments: = =
MGF: E(etX ) =
P
x=0
etx e x
x!
= e
P
x=0
(et )x
x!
= e(e
t 1)
The MGF can be used to get the first and second moments about the origin, and 2 +
so the mean and the variance are both .
We can also use the product of k identical MGFs to show that the sum of k independently
distributed Poisson variables has a Poisson distribution with mean 1 + . . . k .
&
Page 7
%
Rohini Somanathan
'
A Poisson process
Suppose that the number of type A outcomes that occur over a fixed interval of time, [0, t]
follows a process in which
1. The probability that precisely one type A outcome will occur in a small interval of time t
is approximately proportional to the length of the interval:
g(1, t) = t + o(t)
where o(t) denotes a function of t having the property that limt0
o(t)
t
= 0.
2. The probability that two or more type A outcomes will occur in a small interval of time t
is negligible:
X
g(x, t) = o(t)
x=2
3. The numbers of type A outcomes that occur in nonoverlapping time intervals are
independent events.
These conditions imply a process which is stationary over the period of observation, i.e the
probability of an occurrence must be the same over the entire period with neither busy nor quiet
intervals.
&
Page 8
Rohini Somanathan
'
1
1000
35 e3
5!
you can plug this into a computer, or alternatively use tables to compute f(5; 3) = .101
&
Page 9
%
Rohini Somanathan
'
v n
n)
f(x; n, p) =
(ni+1)
i=1
x!
limn f(x; n, p)
limn i=1
x
Q
=
=
=
(n i + 1)
x!
to get
x
nx
1
n
n
(n i + 1)
x
n
x
1
1
x!
n
n
h n (n 1)
(n x + 1) x
n
x i
limn
.
....
1
1
n
n
n
x!
n
n
limn i=1
nx
e x
x!
(using the above result and the property that the limit of a product is the product of the
limits)
&
Page 10
'
Rohini Somanathan
10
X
e4.5 (4.5)x
= .9933
x!
x=0
Rules of Thumb: close to binomial probabilities when n 20 and p .05, excellent when n 100
and np 10.
&
Page 11
%
Rohini Somanathan
'
n
P
xi = n and
i=1
Ki = M ):
i=1
m
Q
f(x1 . . . xm ; K1 . . . Km , n) =
m
P
Kj
xj
j=1
M
n
&
Page 12
Rohini Somanathan
'
1
ba
(a+b)
,
2
MGF: MX (t) =
2 =
ebt eat
(ba)t
(ba)2
12
Applications:
to construct the probability space of an experiment in which any outcome in the
interval [a, b] is equally likely.
to generate random samples from other distributions (based on the probability integral
transformation). This is part of your first lab assignment.
&
Page 13
%
Rohini Somanathan
'
y1 ey dy
() =
(1)
If = 1, () =
R
0
ey dy = ey = 1
0
If > 1, we can integrate (1) by parts, setting u = y1 and dv = ey and using the formula
R
R
R
1
udv = uv vdu to get: yey + ( 1) y2 ey dy
0
The first term in the above expression is zero because the exponential function goes to zero
faster than any polynomial and we obtain
() = ( 1)( 1)
and for any integer > 1, we have
() = ( 1)( 2)( 3) . . . (3)(2)(1)(1) = ( 1)!
&
Page 14
Rohini Somanathan
'
x
,
x 1
() =
1
dx
1
dx
or as
1=
1
x
x1 e dx
()
1
x
x1 e I(0,) (x)
()
&
Page 15
%
Rohini Somanathan
'
Graphs of the
veral different
ributions with
an of 1.
a ! 0.1, b ! 0.1
a ! 1, b ! 1
a ! 2, b ! 2
a ! 3, b ! 3
1.2
Gamma p.d.f.
1.0
0.8
0.6
0.4
0.2
0
&
Page 16
Rohini Somanathan
Moments. Let X have the gamma distribution with parameters and . For k =
1, 2, . . . ,
Course
#(
+ k)003: Basic
(Econometrics,
+ 1) . . . (2012-2013
+ k 1)
'
E(X k ) = k
=
.
k
#()
Theorem
5.7.5
.
2
Proof For k = 1, 2, . . . ,
Moments
of the gamma
!
! distribution
x k f (x|, ) dx =
x +k1ex dx
E(X k ) =
Parameters: (, ) 0, > 0, > 0
#() 0
2
2
#( + k)
Moments: = ,
= . #( + k)
=
= k
.
(5.7.14)
+k
1
Z
1
((
+11)t)x
= Var(X)=
x1 e
= 2.
()
2
0
1
()
1 ( 1 t)x
dx
x
t
e
1 mean equal to
p.d.f.s that all have
Figure 5.7 shows several gamma distribution
t
0
1 but different values of and
1 . ()
1
=
Example
5.7.6
()
1
1
t
1
1
Service Times in a Queue
5.7.5, the conditional mean service rate given
= . In Example
1 t
the observations X1 = x1, . . . , Xn = xn is
E(Z|x1, . . . , xn) =
n+1
.
$
2 + ni=1 xi
For large n, the conditional mean is approximately 1 over the sample average of
&
the service times. This makes sense since 1 over the average service time is what we
Page 17
Rohini Somanathan
generally mean by service rate.
!
'
Gamma applications
Survival analysis
We can use it to model the waiting time till the rth event/success. If X is the time that
passes until the first success, then X could be gamma distribution with = 1 and = 1 .
This is known as an exponential distribution. If, instead we are interested in the time
taken for the rth success, this has a gamma density with = r and 1 = .
Related to the Poisson distribution: If the variable Y is the number of successes
(deaths, for example) in a given time period t and has a poisson density with parameter
, the rate of success is given by = t .
Example: A bottling plant breaks down, on average, twice every four weeks. We want the
probability that the number of breakdowns, X 3 in the next four weeks. We have = 2
3
P
i
e2 2i! = .135 + .271 + .271 + .18 = .857
and the breakdown rate = 21 per week. P(X 3) =
i=0
Suppose we wanted the probability that the machine does not break down in the next four
weeks. The time taken until the first break-down, x must therefore be more than four
weeks. This follows a gamma distribution, with = 1 and = 1.
R
x
x
P(X 4) = 21 e 2 dx = e 2 = e2 = .135
4
&
Page 18
Rohini Somanathan
'
n
X
i=1
n
X
Xi Gamma(
i , )
i=1
Y = cX Gamma(, c)
Both these can be easily proved using the gamma MGF and applying the MGF uniqueness
theorem: In the first case the MGF of Y is the product of the individual MGFs, i.e.
MY (t) =
n
Y
i=1
n
P
n
Y
i
1
MXi (t) =
(1 t)i = (1 t)i=1
for t <
i=1
For the second result, MY (t) = McX (t) = MX (ct) = (1 ct) for t <
1
c
&
Page 19
%
Rohini Somanathan
'
x
1
I(0,) (x)
e
Moments: = , 2 = 2
MGF: MX (t) = (1 t)1 for t <
&
Page 20
Rohini Somanathan
'
v
2
and = 2
x
v
1
x 2 1 e 2 I(0,) (x)
v
2 2 ( v
2 )
Moments: = v, 2 = 2v
v
1
2
Applications:
Notice that for v = 2, the Chi-Square density is equivalent to the exponential density with
= 2. It is therefore decreasing for this value of v and hump-shaped for other higher values.
The 2 is especially useful in problems of statistical inference because if we have v
v
P
independent random variables, Xi N(0, 1), their sum
X2i 2v Many of the estimators we
i=1
use in our models fit this case (i.e. they can be expressed as the sum of independent normal
variables)
&
Page 21
%
Rohini Somanathan
'
&
Page 22
Rohini Somanathan
'
1 x 2
1
e 2 ( ) I(,+) (x)
2
2 t2
2
The MGF can be used to derive the moments, E(X) = and variance is 2
As can be seen from the p.d.f, the distribution is symmetric around , where it achieves its
maximum value. this is therefore also the median and the mode of the distribution.
The normal distribution with zero mean and unit variance is known as the standard normal
1 2
distribution and is of the form: f(x; 0, 1) = 1 e 2 x I(,+) (x)
2
The tails of the distribution are thin: 68% of the total probability lies within one of the
mean, 95.4% within 2 and 99.7% within 3.
&
Page 23
%
Rohini Somanathan
'
M(t)
(x)2
1
22
dx
e
2
h
i
Z
(x)2
tx
1
22
dx
e
2
etx
(x )2
[x ( + 2 t)]2
1 2 2
=
t
+
22
2
22
2 t2
MX (t) = Cet+ 2
where C =
e
2
[x(+2 t)]2
22
replaced by ( + 2 t)
&
Page 24
'
Rohini Somanathan
M(t)
e(t+
M0 (t)
M(t)( + 2 t)
M00 (t)
&
Page 25
%
Rohini Somanathan
'
(X)
2 t2
t+ 22
=e
N(0, 1)
and b =
. Therefore
t2
2
An important implication of the above result is that if we are interested in any distribution in
this class of normal distributions, we only need to be able to compute integrals for the standard
normal-these are the tables youll see at the back of most textbooks.
Example: The kilometres per litre of fuel achieved by a new Maruti model , X N(17, .25). What
is the probability that a new car will achieve between 16 and 18 kilometres per litre?
Answer: P(16 x 18) = P 1617
z 1817
= P(2 z 2) = 1 2(.0228) = .9544
.5
.5
&
Page 26
'
Rohini Somanathan
Transformations of Normals...2
RESULT 2: Let X N(, 2 ) and Y = aX + b, where a and b are given constants and a 6= 0,
then Y has a normal distribution with mean a + b and variance a2 2
1
2 2
2 2
Proof: The MGF of Y can be expressed as MY (t) = ebt eat+ 2 a t = e(a+b)t+ 2 (a) t .
This is simply the MGF for a normal distribution with the mean a + b and variance a2 2
RESULT 3: If X1 , . . . , Xk are independent and Xi has a normal distribution with mean i
and variance 2i , then Y = X1 + + Xk has a normal distribution with mean 1 + + k
and variance 21 + + 2k .
Proof: Write the MGF of Y as the product of the MGFs of the Xi s and gather linear and
squared terms separately to get the desired result.
We can combine these two results to derive the distribution of sample mean:
RESULT 4: Suppose that the random variables X1 , . . . , Xn form a random sample from a
n denote the sample mean.
normal distribution with mean and variance 2 , and let X
2
&
Page 27
%
Rohini Somanathan
'
MY (t)
x2
2
1
ex t e 2 dx
2
1 2
1
e 2 x (12t) dx
2
1
p
(1 2t)
1
(12t)
e 2 (x
(12t))2
dx
1
1
p
for t <
2
(1 2t)
1
(12t) ).
The MGF obtained is that of a 2 random variable with v = 1 since the 2 MGF is given by
v
1
2
&
Page 28
'
Rohini Somanathan
n
P
i=1
X2i
n
Y
i=1
n
Y
MX2 (t)
i
(1 2t) 2
i=1
(1 2t)
n
2
for t <
1
2
which is the MGF of a 2 random variable with v = n. This is the reason that the parameter v is
called the degrees of freedom. There are n freely varying random variables whose sum of squares
represents a 2v -distributed random variable.
&
Page 29
%
Rohini Somanathan
'
q2
1
p
e 2
2
21 2 1
where
q=
x y y 2 i
1 h x 1 2
1
2
2
2
+
2
1
1
1
2
2
&
Page 30
%
Rohini Somanathan
Page 0
'
Rohini Somanathan
&
Page 1
%
Rohini Somanathan
'
Defining a Statistic
Definition: Any real-valued function T = r(X1 , . . . , Xn ) is called a statistic.
Notice that:
a statistic is itself a random variable
weve considered several functions of random variables, whose distributions are well defined
such as :
Y=
Y=
n
P
where
Xi , where each Xi has a bernoulli distribution with parameter p was shown to have a
i=1
n
P
i=1
X2i where each Xi has a standard normal distribution was shown to have a 2n distribution
etc...
Only some of these functions of random variables are statistics ( why?) This distinction is
important because statistics have sample counterparts.
In a problem of estimating an unknown parameter, , our estimator will be a statistic
whose value can be regarded as an estimate of .
It turns out that for large samples, the distributions of some statistics, such as the sample
mean, are well-known.
&
Page 2
'
Rohini Somanathan
Markovs Inequality
We begin with some useful inequalities which provide us with distribution-free bounds on the
probability of certain events and are useful in proving the law of large numbers, one of our two
main large sample theorems.
Markovs Inequality: Let X be a random variable with density function f(x) such that
P(X 0) = 1. Then for any given number t > 0,
P(X t)
E(X)
t
xt
This inequality obviously holds for t E(X) (why?). Its main interest is in bounding the
probability in the tails. For example, if the mean of X is 1, the probability of X taking values
bigger than 100 is less than .01. This is true irrespective of the distribution of X- this is what
makes the result powerful.
&
Page 3
%
Rohini Somanathan
'
Chebyshevs Inequality
This is a special case of Markovs inequality and relates the variance of a distribution to the
probability associated with deviations from the mean.
Chebyshevs Inequality: Let X be a random variable such that the distribution of X has a finite
variance 2 and mean . Then, for every t > 0,
P(|X | t)
2
t2
or equivalently,
P(|X | < t) 1
2
t2
Proof. Use Markovs inequality with Y = (X )2 and use t2 in place of the constant t. Then Y takes only
non-negative values and E(Y) = Var(X) = 2 .
In particular, this tells us that for any random variable, the probability that values taken by the
variable will be more than 3 standard deviations away from the mean cannot exceed 91
P(|X | 3)
1
9
For most distributions, this upper bound is considerably higher than the actual probability of
this event.
&
Page 4
Rohini Somanathan
'
(ba)2
12
(2
I (x).
3) ( 3, 3)
= 1. If t = 32 , then
3
3
3
Pr(|X | ) = Pr(|X| ) = 1
2
2
Z2
32
1
t2
4
9
1
3
= .13
dx = 1
2
2 3
1
4.
&
Page 5
%
Rohini Somanathan
'
1
(X1 + + Xn )
n
1
n
n
P
E(Xi ) =
i=1
Var(Xn ) =
1
n .n
n
P
1
Var( Xi )
n2
i=1
1
n2
n
P
i=1
Var(Xi ) =
1
n2
n2
2
n
Weve therefore learned something about the distribution of the sample mean, irrespective of the
distribution from which the sample is drawn:
Its expectation is equal to that of the population.
It is more concentrated around its mean value than was the original distribution.
The larger the sample, the lower the variance of Xn .
&
Page 6
Rohini Somanathan
'
4
n.
Since we want
4
n
= .01 we take
n
P
each Xi followed a bernoulli distribution with p = 12 , then the total number of successes T =
Xi follows
i=1
P(|Xn | .1) 1
2
Xn
t2
25
25
= 1 4n
1 = 1 n . We therefore need 1 n = 0.7. This gives us n = 84.
100
If we compute these probabilities directly from the binomial distribution, we get F(9) F(6) = .7 when
n = 15, so if we knew that Xi followed a Bernoulli distribution we would take this sample size for the
desired level of precision in our estimate of Xn .
This illustrates the trade-off between more efficient parametric procedures and more robust
non-parametric ones.
&
Page 7
%
Rohini Somanathan
'
&
Page 8
Rohini Somanathan
'
i=1
We now need to modify our notion of convergence, since the sequence {Yn } no longer defines
a given sequence of real numbers, but rather, many different real number sequences,
depending on the realizations of X1 , . . . , Xn .
Convergence questions can no longer be verified unequivocally since we are not referring to
a given real sequence, but they can be assigned a probability of occurrence based on the
probability space for random variables involved.
There are several types of random variable convergence discussed in the literature. Well
focus on two of these:
Convergence in Distribution
Convergence in Probability
&
Page 9
%
Rohini Somanathan
'
Convergence in Distribution
Definition: Let {Yn } be a sequence of random variables, and let {Fn } be the associated sequence
of cumulative distribution functions. If there exists a cumulative distribution function F such
that Fn (y) F(y) y at which F is continuous, then F is called the limiting CDF of {Yn }.
Letting Y have the distribution function F, we say that Yn converges in distribution to the
d
Result: Let Xn X, and let the random variable g(X) be defined by a function continous
d
&
Page 10
Rohini Somanathan
'
Convergence in Probability
This concept formalizes the idea that we can bring the outcomes of the random variable Yn
arbitrarily close to the outcomes of the random variable Y for large enough n.
Definition: The sequence of random variables {Yn } converges in probability to the random
varaible Y iff
lim P(|yn y| < ) = 1
>0
&
Page 11
%
Rohini Somanathan
'
1
n )X + 3
and X N(1, 2). Using the properties of the plim operator, we have
1
n )X + plim(3)
&
Page 12
Rohini Somanathan
'
n
1 X
Xi
n
i=1
WLLN: Let {Xn } be a sequence of i.i.d. random variables with finite mean and variance 2 .
p
Then Xn .
Proof. Using Chebyshevs Inequality,
| < ) 1
P(|X
2
n2
Hence
| < ) = 1 or plimX
=
lim P(|X
The WLLN will allow us to use the sample mean as an estimate of the population mean, under
very general conditions.
&
Page 13
%
Rohini Somanathan
'
n(Xn ) d
N(0, 1)
&
Page 14
Rohini Somanathan
'
Lindberg-Levy CLT..applications
Approximating Binomial Probabilities via the Normal Distribution: Let {Xn } be a sequence
of i.i.d. Bernoulli random variables. Then, by the LLCLT:
n
P
Xi np
n
X
a
d
i=1
p
Xi N(np, np(1 p))
N(0, 1) and
np(1 p)
i=1
In this case,
n
P
p
np(1 p) and b = np and since Zn
i=1
n
P
i=1
2n
i=1
v
2
and = 2. Then,
&
Page 15
%
Rohini Somanathan
Topic 6: Estimation
Rohini Somanathan
Course 003, 2014-2015
Page 0
'
Rohini Somanathan
Random Samples
We cannot usually look at the population as a whole because
it may be too big and therefore expensive and time-consuming
analyzing the sample may destroy the product/organism (you need brain cells to figure
out how the brain works, or to crash cars to see know how sturdy they are)
We would like to choose a sample which is representative of the population or process that
interests us. A common procedure with many desirable properties is random sampling - all
objects in the population have an equal chance of being selected
Haphazard sampling procedures often result in non-random samples.
Example: We have a bag of sweets and chocolates of different types (eclairs, five-stars, gems...) and want to
estimate the average weight of a items in the bag. If we pass the bag around, each student puts their hand
in and picks 5 items, how do you think these sample averages would compare with the true average?
Definition: Let f(x) be the density function of a continuous random variable X. Consider a
sample of size n from this distribution. We can think of the first value drawn as a realization of
the random variable X1 , similarly for X2 . . . Xn . (x1 , . . . , xn ) is a random sample if
f(x1 , . . . , xn ) = f(x1 )f(x2 ) . . . f(xn ) = f(x).
&
Page 1
%
Rohini Somanathan
'
Statistical Models
Definition: A statistical model for a random sample consists of a parametric functional form,
f(x; ) together with a parameter space which defines the potential candidates for .
Examples: We may specify that our sample comes from
a Bernoulli distribution and = {p : p [0, 1]}
a Normal distribution where = {(, 2 ) : (, ), > 0}
Note that could be much more restrictive in each of these examples. What matters for our
purposes is that
contains the true value of the parameter
the parameters are identifiable meaning that the probability of generating the given sample
is different under two distinct parameter values. If, given a sample x and parameter values
1 and 2 , suppose f(x, 1 ) = f(x, 2 ), well never be able to use the sample to reach a
conclusion on which of these values is the true parameter.
&
Page 2
'
Rohini Somanathan
&
Page 3
%
Rohini Somanathan
'
2 +(E())
)2 +0
MSE()
)+E(
)]
))
))(E(
))]
= Var()+Bias(
,
A Minimum Variance Unbiased Estimator (MVUE) is an estimator which has the smallest
variance among the class of unbiased estimators.
A Best Linear Unbiased Estimator (BLUE) is an estimator which has the smallest variance among
the class of linear unbiased estimators ( the estimates must be linear functions of sample values).
&
Page 4
'
Rohini Somanathan
&
Page 5
%
Rohini Somanathan
'
x = 0, 1.
For any observed values x1 , . . . , xn , where each xi is either 0 or 1, the likelihood function is
P
P
n
Q
x
n xi
xi (1 )1xi = i (1 )
given by: fn (x|) =
i=1
The value of that will maximize this will be the same as that which maximizes the log of
the likelihood function, L() which is given by:
L() =
n
X
n
X
xi ln + n
xi ln(1 )
i=1
i=1
The first order condition for an extreme point is given by:
n
P
MLE =
this, we get
n
P
xi
i=1
n
P
i=1
xi
and solving
xi
i=1
Confirm that the second derivate of L() is in fact negative, so we do have a maximum.
&
Page 6
Rohini Somanathan
'
1 x 2
1
e 2 ( ) I(,+) (x)
2
fn (x|) =
1
n
(22 ) 2
i=1
fn (x|) will be maximized by the value of which minimizes the following expression in :
Q() =
n
n
n
X
X
X
(xi )2 =
x2i 2
xi + n2
i=1
=2
The first order condition is now: 2n
n
P
MLE =
mean
i=1
n
P
i=1
i=1
Xi
i=1
&
Page 7
%
Rohini Somanathan
'
1
n
(22 ) 2
i=1
To find the M.L.Es of both and 2 , it is easiest to maximize the log of the likelihood
n
P
function: L(, 2 ) = n2 ln(2) n2 ln(2 ) 21 2
(xi )2
i=1
We now have two first-order conditions obtained by setting each of the following partial
derivatives equal to zero:
L
L
2
n
1 X
(
xi n)
2
(1)
i=1
n
n
1 X
(xi )2
+
22
24
(2)
i=1
=x
n and substituting this into the second
The first equation can be solved to obtain
n
P
1
2
2
= n
n)
equation, we obtain
(xi x
i=1
n and 2 =
=X
The maximum likelihood estimators are therefore
1
n
n
P
n )2
(Xi X
i=1
&
Page 8
'
Rohini Somanathan
1
I
(x)
[0,]
1
n
This function is decreasing in and is therefore maximized at the smallest admissible value
= max(X1 . . . Xn )
of which is given by
Notice that if we modify the domain of the density to be (0, ) instead of [0, ], then no
M.L.E. exists since the maximum sample value is no longer an admissible candidate for .
Now suppose the random sample is from a uniform distribution on [, + 1]. Now could
lie anywhere in the interval [max(x1 , . . . , xn ) 1, min(x1 , . . . , xn )] and the method of maximum
likelihood does not provide us with a unique estimate.
&
Page 9
%
Rohini Somanathan
'
Computation of MLEs
The form of a likelihood function is often complicated enough to make numerical computation
necessary
Consider, for example, a sample of size n from the following Gamma distribution and suppose we
would like an MLE of :
1
f(x; ) =
x1 ex , x > 0
()
n
P
n
Y
xi
1
fn (x|) = n
(
xi )1 e i=1 .
()
i=1
The LHS is the digamma function which is tabulated and now embedded in software packages.
&
Page 10
Rohini Somanathan
'
the M.L.E. of the standard deviation is the square root of the sample variance
the M.L.E of E(X2 ) is equal to the sample variance plus the square of the sample mean, i.e. since
2 +
2
E(X2 ) = 2 + 2 , the M.L.E of E(X2 ) =
then plim n = .
Note: MLEs are not, in general, unbiased.
n
P
n )2
(Xi X
2n = i=1
Example: The MLE of the variance of a normally distributed variable is given by
2n ) = E[
E(
n
X
X
X
X
X
1 X
+X
2 )] = E[ 1 (
2 )] = E[ 1 (
2 + nX
2 )] = E[ 1 (
2 )]
(X2i 2Xi X
X2i 2X
X2i 2nX
X2i nX
Xi +
X
n
n
n
n
i=1
2
1 X
2 )] = 1 [n(2 + 2 ) n( + 2 )] = 2 n 1
[
E(X2i ) nE(X
n
n
n
n
P
n )2
(Xi X
n1
&
Page 11
%
Rohini Somanathan
'
Sufficient Statistics
We have seen that M.L.Es may not exist, or may not be unique. Where should our search
for other estimators start? Well see that a natural starting point is the set of sufficient
statistics for the sample.
Suppose that in a specific estimation problem, two statisticians A and B would like to
estimate ; A observes the realized values of X1 , . . . Xn , while B only knows the value of a
certain statistic T = r(X1 , . . . , Xn ).
A can now choose any function of the observations (X1 , . . . , Xn ) whereas B can choose only
functions of T . In general, A will be able to choose a better estimator than B. Suppose
however that B does just as well as A because the single function T summarizes all the
relevant information in the sample for choosing a suitable . A statistic T with this
property is called a sufficient statistic.
In this case, given T = t, we can generate an alternative sample X01 . . . X0n in accordance with this
conditional joint distribution (auxiliary randomization). Suppose A uses (X1 . . . Xn ) as an
estimator. Well B could just use (01 . . . X0n ), which has the same probability distribution as A 0 s
estimator.
&
Page 12
'
Rohini Somanathan
&
Page 13
%
Rohini Somanathan
'
n
n
Y
e xi Y 1 n y
=
e
xi !
xi !
i=1
i=1
where y =
n
P
xi . Weve expressed fn (x|) as the product of a function that does not depend
i=1
on and a function that depends on but depends on the observed vector x only through
n
P
the value of y. It follows that T =
Xi is a sufficient statistic for .
i=1
2. Sampling from a normal distribution with known variance and unknown mean: Let
(X1 , . . . , Xn ) form a random sample from a normal distribution for which the the value of
mean is unknown and variance 2 is known. The joint p.f. fn (x|) of X1 , . . . Xn has already
been derived as:
fn (x|) =
1
(22 )
n
2
n
n
X
1 X 2
n2
exp 2
xi exp
xi
2
2
22
i=1
i=1
The above expression is a product of a function that does not depend on and a function
n
n
P
P
that depends on and on x only through the value of
xi . It follows that T =
Xi is a
i=1
i=1
&
Page 14
'
Rohini Somanathan
22
(22 ) 2
i=1
i=1
Xi and T2 =
If T1 . . . , Tk are jointly sufficient for some parameter vector and the statistics T10 , . . . Tk0 are
obtained from these by a one-to-one transformation, then T10 , . . . Tk0 are also jointly sufficient.
So the sample mean and sample variance are also jointly sufficient in the above example,
since T1 = nT10 and T2 = n(T20 + T10 2 )
&
Page 15
%
Rohini Somanathan
'
Since the order of the terms in this product are irrelevant (we need to know only the values
obtained and not which one was X1 . . . , we could as well write this expression as
fn (x|) =
n
Y
f(yi |).
i=1
For some distributions this may be the simplest set of jointly sufficient statistics and are
therefore minimally jointly sufficient.
Notice that if a sufficient statistic r(x) exists, the MLE must be a function of this (this
follows from the factorization criterion). It turns out that if MLE is a sufficient statistic, it
is minimally sufficient.
&
Page 16
'
Rohini Somanathan
Implications
Suppose we are picking a sample from a normal distribution, we may be tempted to use
Y(n+1)/2 as an estimate of the median m and Yn Y1 as an estimate of the variance. Yet we
know that we would do better using the sample mean for m and the sample variance must
P
P 2
be a function of
Xi and
Xi .
A statistic is always sufficient with respect of a particular probability distribution, f(x|)
and may not be sufficient w.r.t. , say, g(x|). Instead of choosing functions of the sufficient
statistic we obtain in one case, we may want to find a robust estimator that does well for
many possible distributions.
In non-parametric inference, we do not know the likelihood function, and so our estimators
are based on functions of the order statistics.
&
Page 17
%
Rohini Somanathan
Page 0
'
Rohini Somanathan
the estimator, E [(
Sampling distributions of estimators depend on sample size, and we want to know exactly
how the distribution changes as we change this size so that we can make the right trade-offs
between cost and accuracy.
We begin with a set of results which help us derive the sampling distributions of the
estimators we are interested in.
&
Page 1
%
Rohini Somanathan
'
the sample mean is itself normally distributed with mean and variance
n
P
i=1
2
n
and
variables.
Theorem: If X1 , . . . Xn form a random sample from a normal distribution with mean and
n
n and the sample variance 1 P (Xi X
n )2 are independent
variance 2 , then the sample mean X
n
i=1
2
n N(, )
X
n
n
P
n )2
(Xi X
i=1
2n1
&
Page 2
'
Rohini Somanathan
The t-distribution
Let Z N(0, 1), let Y 2v , and let Z and Y be independent random variables. Then
Z
X = q tv
Y
v
( v+1
x2 ( v+1
2 )
2 )
1+
v
v
( 2 ) v
&
Page 3
%
Rohini Somanathan
'
n
P
i=1
U=
n(Xn )
q
tn1
2
Sn
n1
n(X )
S2
n
Proof: We know that
N(0, 1) and that n2 2n1 . Dividing the first random variable
by the square root of the second, divided by its degrees of freedom, the in the numerator and
denominator cancels to obtain U.
Implication: We may not be able to make statements about the difference between the
n using the normal distribution, because even though
population
mean and the sample mean X
n(Xn )
2
N(0, 1), may not unknown. This result allows us to use its estimate
n
P
n )
2
n )2 /n since (X
= (Xi X
tn1
n1
/
i=1
X Z N(0, 1)
As the sample size gets larger the t-density looks more and more like a standard normal
distribution. For instance, the value of x for which the distribution function is equal to .55 is
.129 for t10 , it is .127 for t20 and .126 for the standard normal distribution. The differences
between these values increases for higher values of their distribution functions (why?)
q
n1 n(Xn )
To see why this might happen, consider the variable we just derived,
tn1
n1
n
is close to 1.
&
Page 4
'
Rohini Somanathan
Since /
N(0, 1), Pr 2 < /
< 2 = .955
n
n
2
But this event is equivalent to the events
< Xn <
n
and Xn
< < Xn +
2
2
With known , each of the random variables Xn
and Xn +
are statistics.
n
n
Therefore, we have derived a random interval within which the population parameter lies
with probability .955, i.e.
2
2
= .955 =
Pr Xn < < Xn +
n
n
Notice that there are many intervals for the same , this is the shortest one.
Now, given our sample, our statistics take particular values and the resulting interval either
contains or does not contain . We can therefore no longer talk about the probability that
it contains because the experiment has already been performed.
2
2
< < xn +
) is a 95.5% confidence interval for . Alternatively, we
We say that (xn
n
n
may say that lies in the above interval with confidence or that the above interval is a
confidence interval for with confidence coefficient
&
Page 5
%
Rohini Somanathan
'
10
40 ), 7.164 + 1.282
10
40 )
Example 2: Let X denote the sample mean of a random sample of size 25 from a
distribution with variance 100 and mean . In this case, n = 2 and, making use of the
central limit theorem the following statement is approximately true:
(Xn )
< 1.96 = .95 or Pr Xn 3.92 < < Xn + 3.92 = .95
Pr 1.96 <
2
If the sample mean is given by xn = 67.53, an approximate 95% confidence interval for the
sample mean is given by (63.61, 71.45).
Example 3: Suppose we are interested in a confidence interval for the mean of a normal
distribution but do not know 2 . We know that
(Xn )
/
n1
&
Page 6
Rohini Somanathan
'
2
n
+
m
form the numerator of the T random variable that we are going to use.
2
n
2
m
21 +m
22
n
2 (n+m2)
&
Page 7
%
Rohini Somanathan
'
1 2 )
s(Xn Ym )(
22
21 +m
n
(n+m2)
1
1
n+m
&
Page 8
Rohini Somanathan
'
The F-distribution
RESULT : Let Y 2m , Z 2n , and let Y and Z be independent random variables. Then
F=
Y/m
nY
=
Z/n
mZ
has an F-distribution with m and n degrees of freedom. The F-density is given by:
f(x) =
m/2 nn/2
( m+n
xm/21
2 )m
I(0,) (x)
m
n
( 2 )( 2 )
(mx + n)(m+n)/2
m and n are sometimes referred to as the numerator and denominator degrees of freedom
respectively.
It turns out that the square of a random variable with a T distribution with n degrees of
freedom has an F distribution with (1, n) degrees of freedom.
The F test turns out to be useful in testing for differences in variances between
two-distributions.
Many important specification tests rely on the F-distribution (example: testing for a set of
coefficients in a linear model being equal to zero).
&
Page 9
%
Rohini Somanathan
'
1
X
has
This allows us to easily construct lower tail events having probability from upper-tail
events having probability . For the normal and t-distributions we use the symmetry of
those distributions to do this. We dont do this anymore since most statistical packages give
us the required output.
&
Page 10
%
Rohini Somanathan
Page 0
Rohini Somanathan
'
&
Page 1
%
Rohini Somanathan
'
Statistical tests
Before deciding whether or not to accept H0 , we observe a random sample. Denote by S,
the set of all possible outcomes X of the random sample.
A test procedure or partitions the set S into two subsets, one containing the values that will
lead to the acceptance of H0 and the other containing values which lead its rejection.
A statistical test is defined by the critical region R which is the subset of S for which H0 is
rejected. The complement of this region must therefore contain all outcomes for which H0 is
not rejected.
Most tests are based on values taken by a test statistic ( the same mean, the sample variance
or functions of these). In this case, the critical region R, is a subset of values of the test
statistic for which H0 is rejected. The critical values of a test statistic are the bounds of R.
When arriving at a decision based on a sample and a test, we may make two types of errors:
H0 may be rejected when it is true- a Type I error
H0 may be accepted when it is false- a Type II error
Since a test is any rule which we use to make a decision and we have to think of rules that
help us make better decisions (in terms of these errors). We will discuss how to evaluate a
test and for some problems, we will characterize optimal tests.
&
Page 2
'
Rohini Somanathan
&
Page 3
%
Rohini Somanathan
'
display 1-binomial(200,7,.02).
&
Page 4
'
Rohini Somanathan
&
Page 5
%
Rohini Somanathan
'
&
Page 6
Rohini Somanathan
'
3
2.9
Let T be a test statistic, and suppose that our test will reject the null hypothesi
&
T c, for some constant c. Suppose also that we desire our test to have%
the level
significance 0. The power function of our test is (|) = Pr(T c|), and we wa
Page 7
Rohini Somanathan
'
X 83
83
= 1 (
)
>
2
2
/ n
&
Page 8
'
Rohini Somanathan
and
&
Page 9
%
Rohini Somanathan
'
f1 (x)
f0 (x)
> 1.
a() + b() = a
X
xR
f0 (x) + b
X
xRc
f1 (x) = a
X
xR
f0 (x) + b 1
f1 (x) = b +
xR
xR
The desired function a() + b() will be minimized when the above summation is minimized. This will happen if the critical
a.
region includes only those points for which af0 (x) bf1 (x) < 0. We therefore reject when the likelihood ratio exceeds b
&
Page 10
'
Rohini Somanathan
&
Page 11
%
Rohini Somanathan
'
(2)
n
2
e 2
P 2
xi
and f1 (x) =
1
(2)
n
2
e 2
(xi 1)2
1
f1 (x)
= en(xn 2 )
f0 (x)
The lemma tells us to use a procedure which rejects H0 when the likelihood ratio is greater than
n > 21 + n1 log k = k0 .
a constant k. This condition can be re-written in terms of our sample mean x
0
0
We want to find a value of k for which Pr(Xn > k | = 0) = .05 or, alternatively, Pr(Z > k0 n) = .05
(why?)
n < 1.645
( ) = Pr(X
| = 1) = Pr(Z < 1.645 n)
n
For n = 9, we have ( ) = 0.0877
If instead, we are interested in choosing 0 to minimize 2() + (), we choose k0 = 21 + n1 log 2,
n > 0.577. In this case, (0 ) = 0.0417
so with n = 9 our optimal procedure rejects H0 when X
(display normal( (.577-1)*3)) and (0 ) = 0.1022 (display normal( (.577-1)*3)) and the minimized
value of 2() + () is 0.186
&
Page 12
'
Rohini Somanathan
Xi and y its
= k0 .
Now we would like to find k0 such that Pr(Y > k0 |p = 0.2) = .05 We may not however be able to do
this given that Y is discrete. If n = 10, we find that Pr(Y > 3|p = 0.2) = .121 and
Pr(Y > 4|p = 0.2) = .038, (display 1-binomial(10,4,.2)) so we can decide to set these probabilities as
the values of () and use the corresponding values of k0 for our test.
&
Page 13
%
Rohini Somanathan
'
&
Page 14
'
Rohini Somanathan
n
P
Xi .
i=1
Example 2 : Consider a sample of size n from a normal distribution for which the mean is
unknown and the variance is known. The joint p.d.f. is:
P
12 (
(xi )2 )
2
n
fn (x|) =
1
n
(2) 2 n
i=1
n(2 1 )
fn (x|2 )
[x n 12 (2 +1 )]
2
=e
fn (x|1 )
n
n , therefore fn (x|) has a MLR in the statistic X
is increasing in x
&
Page 15
%
Rohini Somanathan
is a UMP test of the hypotheses (9.3.15) with size equal to 0. We shall then determine
the power function of the UMPCourse
test.003: Basic Econometrics, 2012-2013
' It is known from Example 9.3.5 that the joint p.d.f. of X , . . . , X has an increas1
n
ing monotone likelihood ratio in the statistic X n. Therefore, by Theorem 9.3.1, a test
procedure 1 that rejects H0 when X n c is a UMP test of the hypotheses (9.3.15).
alternatives
Pr(X n of
c|one-sided
= 0).
The size of this UMP
test is 0 =tests
Since X has a continuous distribution, c is the 1 0 quantile
of the distribution
Suppose that n0 is an element of the parameter space and
consider the following hypotheses
of X n given = 0. That is, c is the 1 0 quantile of the normal distribution with
H0learned
: 0in Chapter
H1 : >5,this
mean 0 and variance 2 /n. As we
quantile is
0
We have the following result:
1
c = 0 + $1(1 0) n1/2 ,
(9.3.16)
is the quantile
function
the standard
normal ratio
distribution.
For simplicity,
where $Suppose
Theorem:
that fn (x|)
has aofmonotone
likelihood
in the statistic
T = r(X), and let
1(1 ) for the rest of this example.
=
$
we
shall
let
z
c be a constant such
that
0
0
We shall now determine the power
Pr(Tfunction
c| = (|
0 ) = 1)0of this UMP test. By definition,
Then the test procedure which rejects H0 if T c is a UMP test of the above hypotheses at the
level of significance
(9.3.17)
(|)0 .= Pr(Rejecting H |) = Pr(X + z n1/2 |).
1
For every value of , the random variable Z = n1/2 (X n )/ will have the stanH0 :
H1the
: c.d.f.
< 0 of the standard normal
dard normal distribution. Therefore,
if
$ denotes
0
distribution, then
our UMP test will now
! set
"
1/2 (
Pr(T
c|
= 0 ) = 0
n
)
0
(|1) = Pr Z z0 +
In the first case, the power function with be monotonically increasing in while in the second
(9.3.18)
!
"
"
!
case it will be decreasing.
n1/2 (0 )
n1/2 ( 0)
= 1 $ z0 +
=$
z 0 .
&
! Somanathan
The
Page
16 power function (|1) is sketched in Fig. 9.6.
Rohini
In each of the pairs of hypotheses (9.3.8), (9.3.14), and (9.3.15), the alternative
because2012-2013
the set of possible values of
hypothesis H1 is called a one-sided
Course alternative
003: Basic Econometrics,
'
the parameter under H1 lies entirely on one side of the set of possible values under
the null hypothesis H0. In particular, for the hypotheses (9.3.8), (9.3.14), or (9.3.15),
functions
UMP
is larger
thantests
every possible value
every possible valuePower
of the parameter
under H1of
under H0.
The following figures show power functions for one-sided alternative hypotheses discussed above
for the case of a sample from a normal distribution with unknown mean:
a0
0
m0
H0 : 0
565
H1 : > 0
a0
0
m0
H0 : 0
H1 : < 0
&
Page 17
Rohini Somanathan
Example
One-Sided Alternatives in the Other Direction. Suppose now that instead of testing
'
Two-sided alternatives
No UMP tests exists in these cases. A test which does very well for an alternative 2 > 0 may do
9.4 Two-Sided Alternatives
very badly for 1 < 0
p(md2 )
569
p(md1)
p( md3)
p(md4 )
m0
in Fig. 9.8, along with the power functions (|1) and (|2 ), which had previously
been sketched in Figs. 9.6 and 9.7.
power
As the values of c1 and c2 in Eq. (9.4.2) or Eq. (9.4.3) are decreased, the%
Rohini Somanathan
function (|) will become smaller for < 0 and larger
for > 0. For 0 =
0.05, the limiting case is obtained by choosing c1 = and c2 = 0 + 1.645 n1/2 .
The test procedure defined by these values is just 1. Similarly, as the values of c1
and c2 in Eq. (9.4.2) or Eq. (9.4.3) are increased, the power function (|) will
become larger for < 0 and smaller for > 0. For 0 = 0.05, the limiting case is
obtained by choosing c2 = and c1 = 0 1.645 n1/2 . The test procedure defined
by these values is just 2 . Something between these two extreme limiting cases seems
appropriate for hypotheses (9.4.1).
&
Page 18
Egyptian Skulls. Suppose that, in Example 9.4.1, it is equally important to reject the
null hypothesis that the mean breadth equals 140 when < 140 as when > 140.