Вы находитесь на странице: 1из 28

CS 7015 - Deep Learning - Assignment 1 Shubham Patel, CS17M051

Instructions:

• This assignment is meant to help you grok certain concepts we will use in the course.
Please don’t copy solutions from any sources.
• Avoid verbosity.
• Questions marked with * are relatively difficult. Don’t be discouraged if you cannot
solve them right away!
• The assignment needs to be written in latex using the attached tex file. The solution
for each question should be written in the solution block in space already provided in
the tex file. Handwritten assignments will not be accepted.

1. Partial Derivatives

(a) Consider the following computation ,

1+tanh( wx+b )
x 2 f (x)
2

1
1+tanh( wx+b ) ez −e−z
where f (x) = 2
2
and by definiton : tanh(z) = ez +e−z

The value L is given by,


1
L = (y − f (x))2
2
Here, x and y are constants and w and b are parameters that can be modified. In
other words, L is a function of w and b.
∂L
Derive the partial derivatives, ∂w and ∂L
∂b
.

1+tanh( wx+b ) ez −e−z


Solution: We know, f (x) = 2
2
also, tanh(z) = ez +e−z
and L = 21 (y −
f (x))2 .
∂L
a. Solving for ∂w .

∂L 1 ∂
= (y − f (x))2
∂w 2 ∂w

1 ∂
= ∗ 2 ∗ (y − f (x)) ∗ (0 − f (x))
2 ∂w
Putting f(x) from the defination, and taking minus out
∂ 1 + tanh( wx+b
2
)
= −(y − f (x))[ ]
∂w 2

y − f (x) ∂ wx + b
=− [ tanh( )] (1)
2 ∂w 2
Simplifying tanh defination
1
ez − ez
tanh(z) = 1
ez + ez

e2z − 1
=
e2z + 1
replacing z by ( wx+b
2
)

wx+b
e2∗ 2 −1
=
2∗ wx+b
e 2 +1
wx + b ewx+b − 1
tan( ) = wx+b (2)
2 e +1
by using (2) in (1)

y − f (x) ∂ ewx+b − 1
=− [ ]
2 ∂w ewx+b + 1

y − f (x) (x.ewx+b − 0)(ewx+b + 1) − (xewx+b + 0)(ewx+b + 1)


=− [ ]
2 (ewx+b − 1)2

y − f (x) (x.ewx+b )(ewx+b + 1) − (xewx+b )(ewx+b − 1)


=− [ ]
2 (ewx+b + 1)2
y − f (x) (x.ewx+b )[(ewx+b + 1) − (ewx+b − 1)]
=− [ ]
2 (ewx+b + 1)2

y − f (x) (x.ewx+b )
=− ∗ 2 ∗ [ wx+b ]
2 (e + 1)2

∂L (x.ewx+b )
= −(y − f (x))[ wx+b ]
∂w (e + 1)2

Page 2
∂L
b. Solving for ∂b
∂L 1 ∂
= (y − f (x))2
∂b 2 ∂b

1 ∂
=∗ 2 ∗ (y − f (x)) ∗ (0 − f (x))
2 ∂b
Putting f(x) from the defination, and taking minus out

∂ 1 + tanh( wx+b
2
)
= −(y − f (x))[ ]
∂b 2

y − f (x) ∂ wx + b
=− [ tanh( )] (3)
2 ∂b 2

by using (2) in (3)

y − f (x) ∂ ewx+b − 1
=− [ ]
2 ∂b ewx+b + 1

y − f (x) (ewx+b − 0)(ewx+b + 1) − (ewx+b + 0)(ewx+b + 1)


=− [ ]
2 (ewx+b − 1)2

y − f (x) (ewx+b )(ewx+b + 1) − (ewx+b )(ewx+b − 1)


=− [ ]
2 (ewx+b + 1)2
y − f (x) (ewx+b )[(ewx+b + 1) − (ewx+b − 1)]
=− [ ]
2 (ewx+b + 1)2

y − f (x) (ewx+b )
=− ∗ 2 ∗ [ wx+b ]
2 (e + 1)2

∂L (ewx+b )
= −(y − f (x))[ wx+b ]
∂b (e + 1)2

(b) Consider the evaluation of E as given below,

E = g(x, y, z) = σ(c(ax + by) + dz)

Here x, y, z are inputs (constants) and a, b, c, d are parameters (variables). σ is the

Page 3
logistic sigmoid function defined as:
1
σ(x) =
1 + e−x
Note that here E is a function of a, b, c, d.
Compute the partial derivatives of E with respect to the parameters a, b and d i.e.
∂E ∂E
∂a
, ∂b and ∂E
∂d
.

Solution:
E = g(x, y, z) = σ[c(ax + by) + dz] (1)
1
σ(x) = (2)
1 + e−x
∂E
a. ∂a

∂E ∂
= g(x, y, z) = σ(c(ax + by) + dz)
∂a ∂a

∂ 1
= −(c(ax+by)+dz)
∂a 1 + e

∂ e(c(ax+by)+dz)
=
∂a e(c(ax+by)+dz) + 1

(e(acx+bcy+dz) (cx))(e(acx+bcy+dz) + 1) − ((e(acx+bcy+dz) (cx) + 0)e(acx+bcy+dz) )


=
(e(acx+bcy+dz) + 1)2

(cx)(e(acx+bcy+dz) )
=
(e(acx+bcy+dz) + 1)2
∂E
b. ∂b

∂E ∂
= g(x, y, z) = σ(c(ax + by) + dz)
∂b ∂b

∂ 1
= −(c(ax+by)+dz)
∂b 1 + e

∂ e(c(ax+by)+dz)
=
∂b e(c(ax+by)+dz) + 1

(e(acx+bcy+dz) (cy))(e(acx+bcy+dz) + 1) − ((e(acx+bcy+dz) (cy) + 0)e(acx+bcy+dz) )


=
(e(acx+bcy+dz) + 1)2

Page 4
(cy)(e(acx+bcy+dz) )
=
(e(acx+bcy+dz) + 1)2
∂E
c. ∂d

∂E ∂
= g(x, y, z) = σ(c(ax + by) + dz)
∂d ∂d

∂ 1
= −(c(ax+by)+dz)
∂d 1 + e

∂ e(c(ax+by)+dz)
=
∂d e(c(ax+by)+dz) + 1

(e(acx+bcy+dz) (d))(e(acx+bcy+dz) + 1) − ((e(acx+bcy+dz) (d) + 0)e(acx+bcy+dz) )


=
(e(acx+bcy+dz) + 1)2

(d)(e(acx+bcy+dz) )
=
(e(acx+bcy+dz) + 1)2

2. Erroneous Estimates
The first order derivative of a real valued function f is defined by the following limit (if
it exists),
df (x) f (x + h) − f (x)
= lim (3)
dx h→0 h
On observing the above definition we see that the derivative of a function is the ratio of
change in the function value to the change in the function input, when we change the
input by a small quantity (infinitesimally small).
Consider the function f (x) = x2 − 2x + 1.
df (x)
(a) Using the limit definition of derivative, show that the derivative of f (x) is dx
=
2x − 2.

Solution:
f (x) = x2 − 2x + 1 (1)

df (x) f (x + h) − f (x)
= lim (2)
dx h→0 h
from equation 1st and 2nd

Page 5
df (x) ((x + h)2 − 2(x + h) + 1) − (x2 − 2x + 1)
= lim
dx h→0 h
x + 2hx + h − 2x − 2h + 1 − x2 + 2x − 1
2 2
= lim
h→0 h

h2 + 2hx − 2h
= lim
h→0 h

= lim (h + 2x − 2)
h→0

applying limits

∂f (x)
= 2x − 2
∂x

(b) The function evaluates to 0 at 1 i.e. f (1) = 0.


Say we wanted to estimate the value of f (1.01) and f (1.5) without using the defi-
nition of f (x). We could think of using the definition of derivative to “extrapolate”
the value of f (1) to obtain f (1.01) and f (1.5).
A first degree approximation based on 3 would be the following.

df (x)
f (x + h) ≈ f (x) + h (3)
dx

Estimate f (1.01) and f (1.5) using the above formula.

Solution:

df (x)
f (x + h) = f (x) + h
dx
f (1 + 0.01) = f (1) + 0.01 ∗ (2(1) − 2)
= 0 + 0.01 ∗ (0)
=0
for f(1+0.5)

f (1 + 0.5) = f (1) + 0.5 ∗ (2(1) − 2)


= 0 + 0.5 ∗ (0)
=0

(c) Compare it to the actual value of f (1.01) = 0.0001, and f (1.5) = 0.25.

Page 6
Solution:
The error in f(1.01) = 0.0001.
The error in f(1.5) = 0.25

(d) Explain the discrepancy from the actual value. Why does it increase/decrease when
we move further away from 1?

Solution: The error increases in the approximation as distance increases. It


is so because approximation method meant to be applied with very minor dif-
ference. But as the distance increase approximation fails. It is sort of as the
difference between actual value and value to be found is tends to be zero we will
get correct results.

(e) Can we get a better estimate of f (1.01) and f (1.5) by “correcting” our estimate
from part (a)? Can you suggest a way of doing this?

Solution: Better estimate can be getting through using more higher derivative
in the approximation equation. It can be done until equation reach the value of
the constant. It is sort of application of the taylor series.
2
f (1.01) = f (1) + (0.01) ∗ (2(1) − 2) + (0.01)2 ∗ (
2!
= 0 + 0 + 0.0001 ∗ 1
= 0.0001
In second case.
2
f (1 + 0.5) = f (1) + (0.5) ∗ (2(1) − 2) + (0.5)2 ∗
2!
= 0 + 0 + 0.25
= 0.25
Our approximation is better here.

3. Differentiation w.r.t. Vectors and matrices


Consider vectors u, x ∈ Rd , and matrix A ∈ Rn×n .
The gradient of a scalar function f w.r.t. a vector x is a vector by itself, given by
 
∂f ∂f ∂f
∇x f = , ,...,
∂x1 ∂x2 ∂xn

Page 7
Gradient of a scalar function w.r.t a matrix is a matrix.
 ∂f ∂f ∂f 
∂A11 ∂A12
. . . ∂A 1n
 ∂f ∂f ∂f 
 ∂A21 ∂A22 . . . ∂A2n 
∇A f =  . .. .. .. 
 .. . . . 
∂f ∂f ∂f
∂An1 ∂An2
... ∂Ann

Gradient of the gradient of a function w.r.t. a vector is a matrix. It is referred to as


Hessian.  2 
∂ f ∂2f ∂2f
2 . . .
 ∂∂ 2xf1 ∂x1 ∂x2
∂2f
∂x1 ∂xn
∂2f 
2
 ∂x ∂x ∂ 2x . . . ∂x 2 ∂xn 

Hx f = ∇ x f =  .
2
.
1
.
2
. .
 . .. .. ..  
∂2f ∂2f ∂2f
∂xn ∂x1 ∂xn ∂x2
... ∂ 2 xn

(a) Derive the expressions for the following gradients.


1. ∇ x uT x
2. ∇ x xT x
3. ∇x xT Ax
4. ∇A xT Ax
5. ∇2x xT Ax
(Aside: Compare your results with derivatives for the scalar equivalents of the above
expressions ax and x2 .)
The gradient of a scalar f w.r.t. a matrix X, is a matrix whose (i, j) component is
∂f
∂Xij
, where Xij is the (i, j) component of the matrix X.)

Solution: 1. ∇x uT x
 
x1
 x2 

 
∇x [ u1 u2 . . . un  .. ] = ∇x [u1 x1 + u2 x2 + · · · + un xn ]
.
xn

 
∂ ∂ ∂
= u1 x1 + · · · + un xn , u1 x1 + · · · + un xn , . . . , u1 x1 + · · · + un xn
∂x1 ∂x2 ∂xn

= (u1 , u2 , . . . , un )
= uT

Page 8
2. ∇x xT x
 
x1
 x2 
 
xn  .. ] = ∇x [x21 + x22 + · · · + x2n ]

∇ x [ x1 x2 . . .
.
xn
 
∂ 2 ∂ 2 ∂ 2
= x1 + · · · + x2n , x1 + · · · + x2n , . . . , x + · · · + x2n
∂x1 ∂x2 ∂xn 1
= (2x1 , 2x2 , . . . , 2xn )
= 2xT
3. ∇x xT Ax
  
a11 a12 . . . a1n x1
  a x
  21 22 . . . a 2n   x2 
  
∇ x [ x1 x2 . . . xn  .. .. . . ..   .. ]
 . . . .  . 
ad1 ad2 . . . adn xn


x1
 x2 

 
= x1 a11 + x2 a21 + · · · + xn an1 . . . x1 a1n + x2 a2n + · · · + xn ann  .. 
.
xn

f = (x1 a11 + x2 a21 + · · · + xn an1 )x1 + · · · + (x1 a1n + x2 a2n + · · · + xn ann )xn

∂f
= 2a11 x1 + x2 a21 + · · · + xn an1 + x2 a12 + x3 a13 + · · · + xn a1n
∂x1

∂f
= (x1 a11 + x2 a21 + · · · + xn an1 ) + (x1 a11 + x2 a12 + x3 a13 + · · · + xn a1n )
∂x1

similarly,

∂f
= (x1 a12 + x2 a22 + · · · + xn an2 ) + (x1 a21 + x2 a22 + x3 a23 + · · · + xn a2n )
∂x2
∂f
= (x1 a13 + x2 a23 + · · · + xn an3 ) + (x1 a31 + x2 a32 + x3 a33 + · · · + xn a3n )
∂x3

Page 9
∂f
= (x1 a1n + x2 a2n + · · · + xn ann ) + (x1 an1 + x2 an2 + x3 an3 + · · · + xn ann )
∂xn
h i
∂f ∂f ∂f
Ans = ∂x ,
1 ∂x2
, . . . , ∂xn

 ∂f T
∂x1
 ∂f 
 ∂x2 
= . 
 .. 
∂f
∂xn
 T
(x1 a11 + x2 a21 + · · · + xn an1 ) + (x1 a11 + x2 a12 + x3 a13 + · · · + xn a1n )
 (x1 a12 + x2 a22 + · · · + xn an2 ) + (x1 a21 + x2 a22 + x3 a23 + · · · + xn a2n ) 
 
 (x1 a13 + x2 a23 + · · · + xn an3 ) + (x1 a31 + x2 a32 + x3 a33 + · · · + xn a3n ) 
= 
 .. 
 . 
(x1 a1n + x2 a2n + · · · + xn ann ) + (x1 an1 + x2 an2 + x3 an3 + · · · + xn ann )
 T  T
(x1 a11 + x2 a21 + · · · + xn an1 ) (x1 a11 + x2 a12 + · · · + xn a1n )
 (x1 a12 + x2 a22 + · · · + xn an2 )   (x1 a21 + x2 a22 + · · · + xn a2n ) 
   
=  (x1 a13 + x2 a23 + · · · + xn an3 )  +  (x1 a31 + x2 a32 + · · · + xn a3n ) 
   
 ..   .. 
 .   . 
(x1 a1n + x2 a2n + · · · + xn ann ) (x1 an1 + x2 an2 + · · · + xn ann )
  T   T
a11 a21 . . . an1   a11 a12 . . . a1n  
 a12 a22 . . . an2  x1   a21 a22 . . . a2n  x1 
  x2 
 a31 a32 . . . a3n   x2 
    
 a13 a23 . . . an3   
=    .  +    . 
 .. .. ... ..   ..   .. .. ... ..   .. 
 . . .     . . .  
xn xn
a1n a2n . . . ann an1 an2 . . . ann
T
= AT x + (Ax)T
= xT (A + AT )
4. ∇A xT Ax
Operating on the f obtained in the above part of question.
 ∂f ∂f ∂f 
∂A11 ∂A12
. . . ∂A 1n
 ∂f ∂f
. . . ∂f 
∇A f =  ∂A.21 ∂A.22 . ∂A2n 

 .. .. .. .. 
. 
∂f ∂f ∂f
∂An1 ∂An2
. . . ∂Ann
 
x21 x2 x1 . . . xn x1
 x1 x2 x2 . . . xn x2 
2
=  ..
 
.. .. .. 
 . . . . 
x1 xn x 2 xn . . . x n x 2 n

Page 10
= xxT
5. ∇2x xT Ax
Using f derived in 3rd part.
 
∂2f ∂2f ∂2f
∂ 2 x1 ∂x1 ∂x2
... ∂x1 ∂xn
 ∂2f ∂2f ∂2f 
 ∂x ∂x ∂ 2 x2
... ∂x2 ∂xn 

Hx f = ∇2x f =   2 1
.. .. .. ..
 . . . .


∂2f ∂2f ∂2f
∂xn ∂x1 ∂xn ∂x2
... ∂ 2 xn
 
2a11 a12 + a21 . . . a1n + an1
 a12 + a21 2a22 . . . a2n + an2 
=
 
.. .. .. .. 
 . . . . 
a1n + an1 an2 + a2n . . . 2ann
= A + AT

(b) Use the equations obtained in the previous part to get the Linear regression solution
that you studied in ML or PR. Suppose X as input example-feature matrix, Y as
given outputs and w as weight vector.

Solution:

y1
 y2 
Y = 
. . .
yn
 
1 x11 x12 . . . x1m
1 x21 x22 . . . x2m 
X =  .. ..
 
.. . . .. 
. . . . . 
1 xn1 xn2 . . . xmn
 
w0
 w1 
W = . . .

wm
 Pi=m 

 error 1 = y 1 − i=0 w i x 1i 

 error = y − Pi=m w x 
 
2 2 i=0 i 2i
..

 . 

 Pi=m 
errorn = yn − i=0 wi xni
 

Page 11
1
error12 + error22 + error32 + · · · + errorn2

Error =
2
n
1X
Error = errori2
2 j=1

n i=m
!2
1X X
Error = yj − wi xji
2 j=1 i=0
n
1X
= (yj − Xjrow W )2
2 j=1
1
= (Y − XW )T (Y − XW )
2
∂Error ∂ 1
= (Y − XW )T (Y − XW )
∂W ∂W 2
1 ∂
= ∗ 2 ∗ (Y − XW )T ∗ (Y − XW )
2 ∂W
1
0 = ∗ 2 ∗ (Y − XW )T (−X)
2
0 = (Y T − W T X T )X
W T XT X = Y T X
(X T X)W = X T Y

W = (X T X)−1 X T Y

(c) By now you must have the intuition. Gradient w.r.t. a 1 dimensional array was
1 dimensional. Gradient w.r.t. a 2-dimensional array was 2 dimensional. Higher
order arrays are referred to as tensors. Let T be a 3 dimensional tensor. Write
the expression of ∇T f . You can use gradients w.r.t. a vector or a matrix in the
expression.

Page 12
Solution:  
∂f ∂f ∂f 

 ∂A111 ∂A112
... ∂A11r 

 ∂f ∂f ∂f  
 


  ∂A121 ∂A122
... ∂A12r 


 .. .. .. 
 


 . ... 


 . .  

 ∂f ∂f ∂f 



 ∂A1q1 ∂A1q2
... ∂A1qr 



  ∂f ∂f ∂f  


 ∂A211 ∂A212
... ∂A21r 


 ∂f ∂f ∂f  
 
...

 
 ∂A ∂A222 ∂A22r  
 ..221
 
.. .. ..  
∇T f =  . . . . 
∂f ∂f ∂f



 ∂A2q1 ∂A2q2
... ∂A2qr 




 .. 




  ∂f . 




∂f ∂f
...

 

∂A ∂Ap12 ∂Ap1r 
 
 ∂fp11

∂f ∂f
 
...

  

 ∂Ap21 ∂Ap22 ∂Ap2r 

  

 .. .. ... .. 

 
 . . . 

 

 

 ∂f
 ∂f
... ∂f 

∂Apq1 ∂Apq2 ∂Apqr

4. Ordered Derivatives
An ordered network is a network where the state variables can be computed one at a
time in a specified order.
Answer the following questions regarding such a network.
(a) Given the ordered network below, give a formula for calculating the ordered deriva-
tive ∂y
∂y1
4
in terms of partial derivatives w.r.t. y1 and y2 where y1 , y2 and y3 are the
outputs of nodes 1, 2 and 3 respectively.

2 y2

y4
1 y1 4

3 y3

Solution:  
∂y4 ∂y4 ∂y3 ∂y3 ∂y2 ∂y4 ∂y2
= + +
∂y1 ∂y3 ∂y1 ∂y2 ∂y1 ∂y2 ∂y1

(b) The figure above can be viewed as a dependency graph as it tells us which variables

Page 13
in the system depend on which other variables. For example, we see that y3 depends
on y1 and y2 which in turn also depends on y1 . Now consider the network given
below,
Y Y Y

s−1 s0 s1 s2 s3
W W W

U U U
x1 x2 x3

Here, si = σ(W si−1 + Y si−2 + U xi + b) (∀i ≥ 1).

Can you draw a dependency graph involving the variables s3 , s2 , s1 , W, Y ?

Solution:

∂s3 ∂s3 ∂s3


(c) Give a formula for computing ,
∂W ∂Y
and ∂U
for the network shown in part (b)

Solution: we knows

s3 = σ (U x3 + W s2 + Y S1 ) (1)

s2 = σ (U x2 + W S1 + Y S0 ) (2)
s1 = σ (U x1 + W S0 + Y S−1 ) (3)

Page 14
−e−(U x3 +W S2 +Y S1 )
 
∂S3 ∂S2 ∂S1
= 2 0+W + S2 + Y (4)
∂W (1 + e−(U x3 +W s2 +Y S1 ) ) ∂W ∂W

−e−(U x3 +W S2 +Y S1 )
 
∂S2 ∂S1
= 2 W + S2 + Y (5)
(1 + e−(U x3 +W S2 +Y S1 ) ) ∂W ∂W

−e−(U x2 +W S1 +Y S0 )
 
∂S2 ∂S1 ∂S0
= 2 0+W + S1 + Y (6)
∂W (1 + e−(U x2 +W s1 +Y S0 ) ) ∂W ∂W

−e−(U x2 +W S1 +Y S0 )
 
∂S1
= 2 W + S1 (7)
(1 + e−(U x2 +W s1 +Y S0 ) ) ∂W

−e−(U x1 +W S0 +Y S−1 )
 
∂S1 ∂S0 ∂S−1
= 2 0+W + S0 + Y (8)
∂W (1 + e−(U x1 +W S0 +Y S−1 ) ) ∂W ∂W
−e−(U x1 +W S0 +Y S−1 ) S0
= 2 (9)
(1 + e−(U x1 +W S0 +Y S−1 ) )
Placing (9) back in (7)

!
∂S2 −e−(U x2 +W S1 +Y S0 ) −e−(U x1 +W S0 +Y S−1 ) S0
= 2 W 2 + S1 (10)
∂W (1 + e−(U x2 +W s1 +Y S0 ) ) (1 + e−(U x1 +W S0 +Y S−1 ) )
Placing (10),(9) back in (5)

∂S2 −e−(U x3 +W S2 +Y S1 )
= 2
∂W (1 + e−(U x3 +W S2 +Y S1 ) )
!
−e−(U x2 +W S1 +Y S0 ) −e−(U x1 +W S0 +Y S−1 ) S0
∗(W 2 W 2 + S1 + S2 +
(1 + e−(U x2 +W s1 +Y S0 ) ) (1 + e−(U x1 +W S0 +Y S−1 ) )

−e−(U x1 +W S0 +Y S−1 ) S0
Y 2)
(1 + e−(U x1 +W S0 +Y S−1 ) )
b.
−e−(U x3 +W S2 +Y S1 )
 
∂S3 ∂S2 ∂S1
= 2 w +Y + S1
∂Y (1 + e−(U x3 +W S2 +Y S1 ) ) ∂Y ∂Y

−e−(U x2 +W S1 +Y S0 )
 
∂S2 ∂S1
= 2 w + S0
∂Y (1 + e−(U x2 +W S1 +Y S0 ) ) ∂Y

∂S1 −e−(U x1 +W S0 +Y S−1 )


= 2 [S−1 ]
∂Y (1 + e−(U x1 +W S0 +Y S−1 ) )

Page 15
∂S3 −e−(U x3 +W S2 +Y S1 ) −e−(U x2 +W S1 +Y S0 )
= 2 [w[ 2
∂Y (1 + e−(U x3 +W S2 +Y S1 ) ) (1 + e−(U x2 +W S1 +Y S0 ) )
" #
−(U x1 +W S0 +Y S−1 )
−e
∗ w 2 [S−1 ] + S0 ]
(1 + e−(U x1 +W S0 +Y S−1 ) )
−e−(U x1 +W S0 +Y S−1 )
+Y 2 [S−1 ] + S1 ]
(1 + e−(U x1 +W S0 +Y S−1 ) )
c.
−e−(U x3 +W S2 +Y S1 )
 
∂S3 ∂S2 ∂S1
= 2 x3 + w +Y
∂U (1 + e−(U x3 +W S2 +Y S1 ) ) ∂U ∂U

−e−(U x2 +W S1 +Y S0 )
 
∂S2 ∂S1
= 2 x2 + w
∂U (1 + e−(U x2 +W S1 +Y S0 ) ) ∂U

∂S1 −e−(U x1 +W S0 +Y S−1 )


= 2 [x1 ]
∂U (1 + e−(U x1 +W S0 +Y S−1 ) )

∂S3 −e−(U x3 +W S2 +Y S1 )
= 2 [x3
∂U (1 + e−(U x3 +W S2 +Y S1 ) )
" #
−e−(U x2 +W S1 +Y S0 ) −e−(U x1 +W S0 +Y S−1 )
+w 2 x2 + w 2 [x1 ]
(1 + e−(U x2 +W S1 +Y S0 ) ) (1 + e−(U x1 +W S0 +Y S−1 ) )
−e−(U x1 +W S0 +Y S−1 )
+Y 2 [x1 ]]
(1 + e−(U x1 +W S0 +Y S−1 ) )

5. Baby Steps
From basic calculus, we know that we can find the minima (local and global) of a function
by finding the first and second order derivatives. We set the first derivative to zero and
verify if the second derivative at the same point is positive. The reasoning behind the
following procedure is based on the interpretation of the derivative of a function as the
slope of the function at any given point.
The above procedure, even though correct can be intractable in practice while trying to
minimize functions. And this is not just a problem for the multivariable case, but even
for single variable functions. Consider minimizing the function f (x) = x5 + 5sin(x) +
10tan(x). Although the function f is a contrived example, the point is that the standard
derivative approach, might not always be a feasible way to find minima of functions.
In this course, we will be routinely dealing with minimizing functions of multiple variables
(in fact millions of variables). Of course we will not be solving them by hand, but we
need a more efficient way of minimizing functions. For the sake of this problem, consider

Page 16
we are trying to minimize a convex function of one variable f (x), 1 which is guaranteed
to have a single minima. We will now build an iterative approach to finding the minima
of functions.
The high level idea is the following:
Start at a (random) point x0 . Verify if we are at the minima. If not, change the value
so that we are moving closer to the minima. Keep repeating until we hit the minima.
(a) Use the intuition built from Q.3 to find a way to change the current value of x while
still ensuring that we are improving (i.e. minimizing) the function.

Solution: It can be seen that x0 has to be decreased until it reached point of


minima. So slope can be checked at every instance till convergence. If it is
increasing than value of x0 should be decreased and if slope is decreasing value
should be increased.
While(!Converge){

x0 = x0 − α f (x0 )
∂x
}
Here after a number of steps saturation will reach in x0 value. Than we will
know that it is converge.

(b) How would you use the same idea, if you had to minimize a function of multiple
variables ?

Solution: Same idea as above can also be used in multivariable case.


while(!converge){

xt = x0 − α f (x0 , y0 )
∂x

yt = y0 − α f (x0 , y0 )
∂y
x0 = xt
y0 = yt
}

(c) Does your method always lead to the global minima (smallest value) for non convex
functions (which may have multiple local minima)? If yes, can you explain (prove
or argue) why? If not, can you give a concrete example of a case where it fails?

Solution: The above method may fail in the case of non-convex function. It
may result local minima as solution instead of global minima. Also, it can result

1
https://en.wikipedia.org/wiki/Convex_function

Page 17
in different local minima according to it’s initialization x0 .

From the figure we can see that if random value of point is choosen to be x1 than
minima result will be M1 which is not the global minima. Also, If x2 minima
will be M2 and if X3 is random choosen point minima will be M3 .

(d) Do you think this procedure always works for convex functions ? (i.e., are we always
guaranteed to reach the minima)

Solution: Yes, above solution always works for convex solution and it will
always guaranteed to reach local minima if learning rate is selected properly.

(e) (Extra) Can you think of the number of steps needed to reach the minima ?

Solution: It depend on learning rate. If learning rate if low it will take time to
converge. If learning rate is set bigger than it can be possible that it may not
converge.

(f) (Extra) Can you think of ways to improve the number of steps needed to reach the
minima ?

Solution: Optimizing learning rate is the way to optimizing learning rate.

6. Constrained Optimization Let f (x, y) and g(x, y) be smooth (continuous, differen-


tiable etc.) real valued functions of two variables. We want to minimize f , which is a
convex function of x and y.
(a) Argue that at the minima of f , the partial derivative ∂f∂x
and ∂f
∂y
will be zero. Thus
setting the partial derivatives to zero is a possible method for finding the minima.

Solution: At the point of minima slope change it course from negative to pos-
itive. And that is true in frame of all variable present. During transition at
minima slope will become zero. And slope will be zero with respect to every
variable. So, concept of considering single variable at a time while considering
other as constant is satisfactory condition and indicator that minima may be
present.

Page 18
(b) Suppose we are only interested in minimizing f in the region where g(x, y) = c,
where c is some constant. Suppose this region is a curve in the x-y plane. Call this
the feasible curve. Will our previous technique still work in this case? Why or why
not?

Solution: No, previous technique will not work. Now the scenerio has changed.
Earlier we were considering minima for the curve individually. But now we are
looking minima that is constrained between two curves. So, now both curves
should have to take under consideration together.

(c) What is the component of ∇g along the feasible curve, when computed at points
lying on the curve?

Solution: It will be perpendicular to the curve. So, component along the curve
should be zero(0).

(d) * At the point on the feasible curve, which achieves minimum value of f , what will
be the component of ∇f along the curve?

Solution: The component along the curve will be zero since gradient will be
perpendicular to the curve.

(e) Using the previous answers, show that at the point on the feasible curve, achieving
minimum value of f , ∇f = λ∇g for some real number λ. Thus, this equation,
combined with the constraint ∇g = 0 should enable us to find the minima.

Solution: At the point of minima both ∇f and ∇g will be in same direction,


perpendicular to their respective curve. So,

∇f ∝ ∇g

that is replace by constant λ,


∇f = λ∇g
Where λ is lagrange multiplier.

(f) * Using the insights from discussion so far, solve the the following optimization
problem:
max xa y b z c
x,y,z

where
x+y+z =1
and given a, b, c > 0.

Page 19
Solution:
f (x, y, z) = xa y b z c
g(x, y, z) = x + y + z − 1
 
a b c ∂ a b c ∂ a b c ∂ a b c
∇x y z = x yz, x yz, x yz
∂x ∂y ∂z

= axa−1 y b z c , bxa y b−1 z c , cxa y b z c−1



(1)
 
∂ ∂ ∂
∇(x + y + z − 1) = (x + y + z − 1), (x + y + z − 1), (x + y + z − 1)
∂x ∂y ∂z
= (1, 1, 1) (2)
from (1) and (2)

axa−1 y b z c , bxa y b−1 z c , cxa y b z c−1 = λ (1, 1, 1)




axa−1 y b z c = λ (3)
bxa y b−1 z c = λ (4)
cxa y b z c−1 = λ (5)
from (3) and (4)
b
y= x (6)
a
from (3) and (5)
c
z= x (7)
a
from (4),(5) and x + y + z = 1
a
x=
a+b+c
b
y=
a+b+c
c
z=
a+b+c
a b c aa b b c c
f( , , )=
a+b+c a+b+c a+b+c (a + b + c)a+b+c

7. Billions of Balloons
Consider a large playground filled with 1 billion balloons. Of these there are k1 blue, k2
green and k3 red balloons. The values of k1 , k2 and k3 are not known to you but you are
interested in estimating them. Of course, you cannot go over all the 1 billion balloons

Page 20
and count the number of blue, green and red balloons. So you decide to randomly
sample 1000 balloons and note down the number of blue, green and red balloons. Let
these counts be k̂1 , k̂2 and k̂3 respectively. You then estimate the total number of blue,
green and red balloons as 1000000 ∗ k̂1 , 1000000 ∗ k̂2 and 1000000 ∗ k̂3 .
(a) Your friend knows the values of k1 , k2 and k3 and wants to see how bad your
estimates are compared to the true values. Can you suggest some ways of calculating
this difference? [Hint: Think about probability!]

Solution:

|1000000 ∗ k̂1 − k1 | + |1000000 ∗ k̂2 − k2 | + |1000000 ∗ k̂3 − k3 |


Error =
109
 
109 − |1000000 ∗ k̂1 − k1 | + |1000000 ∗ k̂2 − k2 | + |1000000 ∗ k̂3 − k3 |
Accuracy =
109
k̂1
P (k̂1 ) =
k̂1 + k̂2 + k̂3
k̂2
P (k̂2 ) =
k̂1 + k̂2 + k̂3
k̂3
P (k̂3 ) =
k̂1 + k̂2 + k̂3
k1
P (k1 ) =
k1 + k2 + k3
k2
P (k2 ) =
k1 + k2 + k3
k3
P (k3 ) =
k1 + k2 + k3
Probability difference = |P (k̂1 ) − P (k1 )| + |P (k̂2 ) − P (k2 )| + |P (k̂3 ) − P (k3 )|

(b) * Consider two ways of converting k̂1 , k̂2 and k̂3 to a probability distribution:

k̂i
pi = P
i k̂i

ek̂i
qi = P
k̂i
ie

Would you prefer the distribution q = [q1 , q2 , ..., qn ] over p = [p1 , p2 , ..., pn ] for the
above task? Give reasons and provide an example to support your choice.

Page 21
Solution: No, i will not prefer q over p. Since q seems baised probability
distribution given higher estimate to one observe in large and diminishing those
who doesn’t.

8. ** Let X be a real-valued random variable with p as its probability density function


(PDF). We define the cumulative density function (CDF) of X as
Z y=x
F (x) = Pr(X ≤ x) = p(y)dy
y=−∞

What is the value of EX [F (X)] (the expected value of the CDF of X)? The answer is
a real number (Hint: The expectation can be formulated as a double integral. Try to
plot the area over which you need to integrate in the x-y plane. Now look at the area
over which you are not integrating. Do you notice any symmetries?)

Solution: Z y=x
F (x) = P r(X ≤ x) = p(y)dy
y=−∞
Z y=x
d d
F (x) = p(y)dy (1)
dx dx y=−∞

Using, Restated Fundamental Theorem of calculus 2


Z a
F (x) = f (t)dt (2)
x

has derivative, Z a
d d
F (x) = f (t)dt = f (x) (3)
dx dx x

from (1),(3)
d
F (x) = p(x) (4)
dx
Also we know Z ∞
E(F (x)) = F (x)p(x)dx (5)
−∞

By placing (4) in (5) Z 1


d(F (x))
E(F (x)) = F (x)
0 dx
1
(F (x))2

=
2 0
1
=
2

Page 22
9. * Intuitive Urns
An urn initially contains 3 red balls and 3 blue balls. One of the balls is removed without
being observed. To find out the color of the removed ball, Alice and Bob independently
perform the same experiment: they randomly draw a ball, record the color, and put it
back. This is repeated several times and the number of red and blue balls observed by
each of them is recorded.
Alice draws 6 times and observes 6 red balls and 0 blue balls.
Bob draws 600 times and observes 303 red balls and 297 blue balls.
Obviously, both of them will predict that the removed ball was blue.
(a) Intuitively, who do you think has stronger evidence for claiming that the removed
ball was blue, and why? (Don’t cheat by computing the answer. This
subquestion has no marks, but is compulsory!)

Solution: Bob, since he has more sample result. From the central limit theorem
I can assume he will be more closer.

(b) What is the exact probability that the removed ball was blue, given Alice’s obser-
vations? (Hint: Think Bayesian Probability)

Solution: Let consider three events:

1. A : Alice’s Observation.

2. B : Bob’s Observation.

3. C : Ball removed was blue.

We need to find P ( CA ):
Using bayes theorem we can write.
P CA ∗ P (A)
  
C
P =
P CA ∗ P (C) + P CA0 ∗ P (C 0 )
 
A
3 1
P (C) = =
6 2
After removing blue ball probability of getting red ball = 35 & blue ball = 52 .
After removing red ball probability of getting red ball = 25 & blue ball = 53 .
6
Getting 6 red ball with repetition when blue ball removed P CA = 35 .

6
Getting 6 red ball with repetition when red ball removed P CA0 = 25 .


3 6 1
   
C
P = 6 5 2 6 
A 3 1
+ 2 1
5 2 5 2

Page 23
36
=
26 + 36

(c) What is the exact probability that the removed ball was blue, given Bob’s observa-
tions? (Hint: Think Bayesian Probability)

C
Solution: We need to find P ( B ):
Using bayes theorem we can write.

P B

∗ P (A)
 
C C
P =
P B B
 
B C
∗ P (C) + P C 0 ∗ P (C 0 )

3 1
P (C) = =
6 2
After removing blue ball probability of getting red ball = 35 & blue ball = 52 .
After removing red ball probability of getting red ball = 25 & blue ball = 53 .
Getting 303 red and 297 blue ball with repetition when blue ball removed:
   303  297
A 3 2
P =
C 5 5

Getting 303 red and 297 blue ball with repetition when red ball removed:
   303  297
A 2 3
P 0
=
C 5 5
3 303 2 297 1
    
C 5 5 2
P =
3 303 2 297 1
303 3 297
A + 52 1
   
5 5 2 5 2

36
=
26 + 36

(d) Computationally, who do you think has stronger evidence for claiming that the
removed ball was blue?

Solution: Computationally, Both have same evidence for claiming that the
removed ball was blue.

Did your intuition match up with the computations? If yes, awesome! If not,
remember that probability can often be seem deceptively straightforward. Try to
avoid intuition when dealing with probability by grounding it in formalism.

10. Plotting Functions for Great Good

Page 24
(a) Consider the variable x and functions h11 (x), h12 (x) and h21 (x) such that

1
h11 (x) =
1 + e−(400x+24)
1
h12 (x) = −(400x−24)
1+e
h21 = h11 (x) − h12 (x)

The above set of functions are summarized in the graph below.

Plot the following functions: h11 (x), h12 (x) and h21 (x) for x ∈ (−1, 1)

Solution:

Page 25
(b) Now consider the variables x1 , x2 and the functions h11 (x1 , x2 ), h12 (x1 , x2 ), h13 (x1 , x2 ), h14 (x1 , x2 ),
h21 (x1 , x2 ), h22 (x1 , x2 ), h31 (x1 , x2 ) and f (x1 , x2 ) such that

1
h11 (x1 , x2 ) =
1 + e−(x1 +100x2 +200)
1
h12 (x1 , x2 ) = −(x +100x 2 −200)
1+e 1

1
h13 (x1 , x2 ) = −(100x 1 +x2 +200)
1+e
1
h14 (x1 , x2 ) = −(100x 1 +x2 −200)
1+e
h21 (x1 , x2 ) = h11 (x1 , x2 ) − h12 (x1 , x2 )
h22 (x1 , x2 ) = h13 (x1 , x2 ) − h14 (x1 , x2 )
h31 (x1 , x2 ) = h21 (x1 , x2 ) + h22 (x1 , x2 )
1
f (x1 , x2 ) = −(50h 31 (x)−100)
1+e

The above set of functions are summarized in the graph below.

Plot the following functions: h11 (x1 , x2 ), h12 (x1 , x2 ), h13 (x1 , x2 ), h14 (x1 , x2 ), h21 (x1 , x2 ),
h22 (x1 , x2 ), h31 (x1 , x2 ) and f (x1 , x2 ) for x1 ∈ (−5, 5) and x2 ∈ (−5, 5)

Solution:

Page 26
Page 27
Page 28