Академический Документы
Профессиональный Документы
Культура Документы
Note that in the above 2d plot the horizontal axis corresponds to the parameter w and
the vertical axis corresponds to the parameter b. Which of the 3d plots below corresponds
to the 2d plot shown in the figure above (please see carefully which axis corresponds to
w and which to b in the 3d plot).
A.
B.
C.
Page 2
2. Consider the loss function L (w) as shown in the figure below. You are interested in
finding the minima of this function i.e., the value(s) of w for which the function will take
its lowest value. To do so you run gradient descent starting with a random value w0 (the
leftmost red dot in the figure). After running, three steps of gradient descent you have
the updated value of w as w3 . The red dots in the figure show the value of w at each
step and the blue dots show the corresponding value of the loss function L (w). Now,
what will happen if you run the 4th step of gradient descent, i.e., if you try to update
the value of w using the gradient descent update rule. Assume that the learning rate is
1.
wt+1 = wt − η∇wt
After running 3 steps of gradient descent we have reached a region where the function
is absolutely flat, i.e., there is no change in the value of the function in the small
neighborhood around w3 . Hence, the derivative ∇wt at this point will be zero and
the value of w will not get updated. Hence, Option B is the correct answer.
Page 3
3. Continuing the previous question and referring to the same figure again, suppose instead
of gradient descent you ran 3 iterations of momentum based gradient descent resulting
in the value w3 as shown in the figure. Note that the update rule of momentum based
gradient descent is:
Assume that the learning rate is 1 and the momentum parameter γ > 0. Now, what will
happen if you run the 4th step of gradient descent, i.e., if you try to update the value
of w using the update rule of momentum based gradient descent.
Page 4
4. Suppose we choose a model f (x) = σ(wx + b) which has two parameters w, b. Further,
assume that we are trying to learn the parameters of this model using 200 training
points. If we use mini-batch gradient descent with a batch size of 10 then how many
times will each parameter get updated in one epoch.
A. 10
B. 20
C. 100
D. 200
Solution: Option B is the correct answer. The parameters will get updated once
for every mini-batch and the data is divided into 200
10
= 20 mini-batches. Hence, the
parameters will get updated 20 times in one epoch.
Page 5
5. Note that the update rule for momentum based gradient descent is given by
Let η = 1 and γ = 0.9 and ∇w1 be the derivative computed at the first time step. If
you run momentum based gradient descent for 10 iterations then what fraction of ∇w1
will be a part of update10
A. 0.9∇w1
1
B. 0.9 ∇w1
0.9
C. 10−1 ∇w1
(10−1)
D. (0.9) ∇w1
Hence, the fraction of ∇w1 that will be a part of updatet is γ t−1 ·η∇w1 . Substituting,
γ = 0.9, η = 1 and t = 10 we get 0.9(10−1) ∇w1 . Hence, Option D is the correct
answer.
Page 6
6. We saw the following update rule for Adam :
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ (∇wt )2
mt vt
m̂t = t
v̂t =
1 − β1 1 − β2t
η
wt+1 = wt − √ ∗ m̂t
v̂t +
m̂t , v̂t are the bias corrected values of mt , vt . Suppose, instead of using the above equation
for mt we use the following equation where 0 ≤ α1 ≤ 1 and 0 ≤ β1 ≤ 1
α1 (β1 − α1 )
mt = ∗ mt−1 + ∗ ∇wt
β1 β1
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ ∇wt
α1 (β1 − α1 )
mt = ∗ mt−1 + ∗ ∇wt
β1 β1
α1 α1
= ∗ mt−1 + (1 − ) ∗ ∇wt
β1 β1
Page 7
α1
in other words β1 has been replaced by β1
. Hence the bias corrected value can also
be obtained by replacing β1 by αβ11 , i.e.,
mt
m̂t =
1 − ( αβ11 )t
β1t mt
= t
β1 − α1t
Page 8
7. In this question you will implement the Adam algorithm on toy 2-D dataset which con-
sists of 40 data points, i.e., 40 (x,y) pairs. You can download the dataset using the
URL : https://drive.google.com/file/d/1w6aXg7K7nIv4DBUyWAM-abXvKh2frYRM/
view?usp=sharing
For this question you have to use the squared error loss function which is given as,
1
loss = (ŷ − y)2
2
where ŷ is the output of your model given by:
1
ŷ =
1+ e−(wx+b)
Solution: Option A is correct. You can find the value of the loss at the end of 100
iterations by using the following code(the one which is given in the lecture video).
1 # Importing l i b r a r i e s
2 import pandas a s pd
3 import numpy a s np
4 import math
5
6 d e f f (w, b , x ) :
7 r e t u r n 1 . 0 / ( 1 . 0 + np . exp (−(w∗x + b ) ) )
8
9 d e f e r r o r (w, b ) : #C a l c u l a t e l o s s / e r r o r
10 err = 0.0
Page 9
11 f o r x , y i n z i p (X,Y) :
12 f x = f (w, b , x )
13 e r r += 0 . 5 ∗ ( f x − y ) ∗∗ 2
14 return err
15
16 d e f g r a d b (w, b , x , y ) :
17 f x = f (w, b , x )
18 return ( fx − y ) ∗ fx ∗ (1 − fx )
19
20 d e f grad w (w, b , x , y ) :
21 f x = f (w, b , x )
22 return ( fx − y ) ∗ fx ∗ (1 − fx ) ∗ x
23
24 d e f run adam (X, Y, i n i t w , i n i t b , eta , max epochs ) : # ADAM a l g o r i t h m
25 # Inititializations
26 w, b = i n i t w , i n i t b
27 m w , m b , v w , v b , eps , beta1 , b e t a 2 = 0 , 0 , 0 , 0 , 1 e − 8 , 0 . 9 , 0 . 9 9 9
28
29 f o r i i n r a n g e ( max epochs ) :
30 dw , db = 0 , 0
31 f o r x , y i n z i p (X,Y) :
32 dw += grad w (w, b , x , y )
33 db += g r a d b (w, b , x , y )
34 # Compute h i s t o r y
35 m w = b e t a 1 ∗ m w + (1− b e t a 1 ) ∗dw
36 m b = b e t a 1 ∗ m b + (1− b e t a 1 ) ∗db
37
38 v w = b e t a 2 ∗ v w +(1−b e t a 2 ) ∗dw∗∗2
39 v b = b e t a 2 ∗ v b +(1−b e t a 2 ) ∗db ∗∗2
40
41 # Apply b i a s c o r r e c t i o n
42 m w = m w/(1−math . pow ( beta1 , i +1) )
43 m b = m b/(1−math . pow ( beta1 , i +1) )
44
45 v w = v w/(1−math . pow ( beta2 , i +1) )
46 v b = v b /(1−math . pow ( beta2 , i +1) )
47
48 # Apply ADAM’ s update r u l e
49 w = w − ( e t a /np . s q r t ( v w + e p s ) ) ∗ m w
50 b = b − ( e t a /np . s q r t ( v b + e p s ) ) ∗ m b
51
52 r e t u r n w, b
53
54 if name == ” m a i n ” :
55 f i l e n a m e = ’ A4 Q7 data . c s v ’
56 d f = pd . r e a d c s v ( f i l e n a m e ) #Loading data
57 X = d f [ ’X ’ ]
58 Y = d f [ ’Y ’ ]
59 initial w = 1
60 initial b = 1
61 eta = 0.01
Page 10
62 max epochs = 100
63 w, b = run adam (X, Y, i n i t i a l w , i n i t i a l b , eta , max epochs )
64 e r r o r = e r r o r (w, b )
65 p r i n t ( ” e r r o r = {} ” . format ( e r r o r ) )
The code below is as per the corrected ADAM update equations.(We have updated the
lecture slides, please refer to them.)
Solution:
1 import pandas a s pd
2 import numpy a s np
3 #import m a t p l o t l i b . p y p l o t a s p l t
4 import math
5
6 d e f f (w, b , x ) :
7 r e t u r n 1 . 0 / ( 1 . 0 + np . exp (−(w∗x + b ) ) )
8
9 d e f e r r o r (w, b ) : #C a l c u l a t e l o s s / e r r o r
10 err = 0.0
11 f o r x , y i n z i p (X,Y) :
12 f x = f (w, b , x )
13 e r r += 0 . 5 ∗ ( f x − y ) ∗∗ 2
14 return err
15
16 d e f g r a d b (w, b , x , y ) :
17 f x = f (w, b , x )
18 return ( fx − y ) ∗ fx ∗ (1 − fx )
19
20 d e f grad w (w, b , x , y ) :
21 f x = f (w, b , x )
22 return ( fx − y ) ∗ fx ∗ (1 − fx ) ∗ x
23
24 d e f do adam (X, Y, i n i t w , i n i t b , eta , max epochs ) :
25 w, b = i n i t w , i n i t b
26 m w , m b , v w , v b , m w hat , m b hat , v w hat , v b h a t , eps , beta1 ,
beta2 = 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 e −8 ,0.9 ,0.99
27 f o r i i n r a n g e ( max epochs ) :
28 dw , db = 0 , 0
29 f o r x , y i n z i p (X,Y) :
30 dw += grad w (w, b , x , y )
31 db += g r a d b (w, b , x , y )
32 m w = b e t a 1 ∗ m w + (1− b e t a 1 ) ∗dw
33 m b = b e t a 1 ∗ m b + (1− b e t a 1 ) ∗db
34
35 v w = b e t a 2 ∗ v w +(1−b e t a 2 ) ∗dw∗∗2
36 v b = b e t a 2 ∗ v b +(1−b e t a 2 ) ∗db ∗∗2
37
Page 11
38 m w hat = m w/(1−math . pow ( beta1 , i +1) )
39 m b hat = m b/(1−math . pow ( beta1 , i +1) )
40
On executing this code, the error value which you should get is 0.0036
Page 12