Академический Документы
Профессиональный Документы
Культура Документы
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 1 / 16
Contents
3 Conclusions
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 2 / 16
Contents
3 Conclusions
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 3 / 16
Motivation for SGD Variants
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 4 / 16
Motivation for SGD Variants
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 4 / 16
Motivation for SGD Variants
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 4 / 16
Motivation for SGD Variants
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 4 / 16
Proposed SGD Variants
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 5 / 16
Proposed SGD Variants
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 5 / 16
Proposed SGD Variants
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 5 / 16
SGD without Replacement (SGDo)
& K large, suboptimality rate is O(1/K 2 ), better than O(1/K ) rate for SGD.
Case 1.2: For smooth, strongly convex functions with constant stepsize
2 log(nK )
ηk,i = min β
, 4` αnK & K arbitrary, suboptimality rate is O(1/K ), same as for SGD.
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 6 / 16
SGD without Replacement (SGDo)
& K large, suboptimality rate is O(1/K 2 ), better than O(1/K ) rate for SGD.
Case 1.2: For smooth, strongly convex functions with constant stepsize
2 log(nK )
ηk,i = min β
, 4` αnK & K arbitrary, suboptimality rate is O(1/K ), same as for SGD.
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 6 / 16
SGD Last-Iterate Optimal (liSGD)
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 7 / 16
SGD Last-Iterate Optimal (liSGD)
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 7 / 16
Contents
3 Conclusions
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 8 / 16
Experimental Setup
Optimization Functions: Sum of squares loss function with linear predictors, along
with a mix of `2 and `1 regularization.
n
1X > kw k22
F (w ) = (w xi − yi )2 + λ + γkw k1
n i=1 2
Consideration Set:
We constraint our optimization within the region {w : kw k22 ≤ 10}.
Whenever the function is smooth, we calculate the hessian to estimate α, β and
estimate L, σ 2 in our code
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 9 / 16
SGDo vs. Vanilla SGD (Case 1.2)
Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 10, γ = 0)
Details:
log(nK ) log(nK )
ηk,i = 4` αnK for SGDo, while ηk,i = αnK
for vanilla SGD
Tail average: x̂ = K −dK1/2e+1 K k
P
k=1 x0
Comments:
Loss of last iterate for SGDo stays high, the loss for the tail average converges to
the optimal loss
SGDo Tail Average and vanilla SGD last iterate both achieve the optimal for a
small number of epochs
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 10 / 16
SGDo vs. Vanilla SGD (Case 1.2)
Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 10, γ = 0)
Details:
log(nK ) log(nK )
ηk,i = 4` αnK for SGDo, while ηk,i = αnK
for vanilla SGD
Tail average: x̂ = K −dK1/2e+1 K k
P
k=1 x0
Comments:
Loss of last iterate for SGDo stays high, the loss for the tail average converges to
the optimal loss
SGDo Tail Average and vanilla SGD last iterate both achieve the optimal for a
small number of epochs
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 10 / 16
SGDo vs. Vanilla SGD (Case 1.1)
Comments:
SGDo converges at the rate of 1/K 2 where as vanilla SGD converges at 1/K . This
difference turned out to be very dramatic in our experiments
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 11 / 16
SGDo vs. Vanilla SGD (Case 1.1)
Comments:
SGDo converges at the rate of 1/K 2 where as vanilla SGD converges at 1/K . This
difference turned out to be very dramatic in our experiments
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 11 / 16
SGDo vs. Vanilla SGD (Case 1.3)
Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 0.01 and γ = 0)
Details:
ηk,i = min β2 , √D for SGDo and ηt = 1√
for vanilla SGD
L Kn β+c nK
Simple averaging of iterates
Comments:
SGDo matches the performance of vanilla SGD eventually
The variance of SGDo seems to be lower than that of SGD
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 12 / 16
SGDo vs. Vanilla SGD (Case 1.3)
Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 0.01 and γ = 0)
Details:
ηk,i = min β2 , √D for SGDo and ηt = 1√
for vanilla SGD
L Kn β+c nK
Simple averaging of iterates
Comments:
SGDo matches the performance of vanilla SGD eventually
The variance of SGDo seems to be lower than that of SGD
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 12 / 16
liSGD vs. Vanilla SGD (Case 2.1)
Scenario: Minimizing an L-Lipschitz convex function (λ = 10 and γ = 0)
Details: −i
C ×2
Stage-wise choose ηt = √ , when Ti < t < Ti+1 , 0 ≤ i ≤ k for liSGD
T
D
η= √ √ for vanilla SGD
σ 2 +L2 T
Comments:
Outputting the last iterate of SGD does not converge
Outputting last iterate of liSGD eventually converges to SGD average iterate
Kinks in the graph of liSGD corresponding to learning rate changes
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 13 / 16
liSGD vs. Vanilla SGD (Case 2.1)
Scenario: Minimizing an L-Lipschitz convex function (λ = 10 and γ = 0)
Details: −i
C ×2
Stage-wise choose ηt = √ , when Ti < t < Ti+1 , 0 ≤ i ≤ k for liSGD
T
D
η= √ √ for vanilla SGD
σ 2 +L2 T
Comments:
Outputting the last iterate of SGD does not converge
Outputting last iterate of liSGD eventually converges to SGD average iterate
Kinks in the graph of liSGD corresponding to learning rate changes
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 13 / 16
SGDo vs. Vanilla SGD (Case 2.2)
Scenario: Minimizing an L-Lipschitz & α-strongly convex function (λ = 1, γ = 10).
Details:
−i
Stage-wise choose ηt = 2αt , when Ti < t < Ti+1 , 0 ≤ i ≤ k for liSGD
1
ηt = α(t+1) for vanilla SGD
Comments:
LiSGD does better than the average iterate of SGD
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 14 / 16
SGDo vs. Vanilla SGD (Case 2.2)
Scenario: Minimizing an L-Lipschitz & α-strongly convex function (λ = 1, γ = 10).
Details:
−i
Stage-wise choose ηt = 2αt , when Ti < t < Ti+1 , 0 ≤ i ≤ k for liSGD
1
ηt = α(t+1) for vanilla SGD
Comments:
LiSGD does better than the average iterate of SGD
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 14 / 16
Contents
3 Conclusions
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 15 / 16
Conclusion
SGDo:
Using a novel step-size sequence, provides theoretical guarantees on convergence
when iterates sampled without replacement (smoothness required).
SGDo loses the last iterate guarantee property of the vanilla SGD for the strongly
convex and smooth functions, hence may not be heavily used in practice
LiSGD:
Using novel step-size sequence, provides theoretical guarantees on suboptimality
rate when outputting the last iterate.
Most interesting results in the non- (strongly convex and smooth) scenarios.
Experimental results suggest that it could even be practical in practice.
THANK YOU!
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 16 / 16
Conclusion
SGDo:
Using a novel step-size sequence, provides theoretical guarantees on convergence
when iterates sampled without replacement (smoothness required).
SGDo loses the last iterate guarantee property of the vanilla SGD for the strongly
convex and smooth functions, hence may not be heavily used in practice
LiSGD:
Using novel step-size sequence, provides theoretical guarantees on suboptimality
rate when outputting the last iterate.
Most interesting results in the non- (strongly convex and smooth) scenarios.
Experimental results suggest that it could even be practical in practice.
THANK YOU!
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 16 / 16
Conclusion
SGDo:
Using a novel step-size sequence, provides theoretical guarantees on convergence
when iterates sampled without replacement (smoothness required).
SGDo loses the last iterate guarantee property of the vanilla SGD for the strongly
convex and smooth functions, hence may not be heavily used in practice
LiSGD:
Using novel step-size sequence, provides theoretical guarantees on suboptimality
rate when outputting the last iterate.
Most interesting results in the non- (strongly convex and smooth) scenarios.
Experimental results suggest that it could even be practical in practice.
THANK YOU!
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 16 / 16
Conclusion
SGDo:
Using a novel step-size sequence, provides theoretical guarantees on convergence
when iterates sampled without replacement (smoothness required).
SGDo loses the last iterate guarantee property of the vanilla SGD for the strongly
convex and smooth functions, hence may not be heavily used in practice
LiSGD:
Using novel step-size sequence, provides theoretical guarantees on suboptimality
rate when outputting the last iterate.
Most interesting results in the non- (strongly convex and smooth) scenarios.
Experimental results suggest that it could even be practical in practice.
THANK YOU!
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 16 / 16