3 views

Uploaded by Lê Hoàng

Machine learning

- Knowledge Management Life Cycle
- Polanga Oil With Diesel in CI Engine
- Machine Learning for Hackers PDF
- Matlabgeeks.com Tips Tutorials Neural Networks a Perceptron in Matlab
- Answers of Smu Mba Assignments MB0048
- Gantt
- Understanding Kohonen Networks
- Data Flow Diagram
- Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio of Soils
- Cybernetic Distribution Grid
- appex 2robotics letter head
- Notes - Gradient Descent
- Fourth International Conference on Artificial Intelligence, Soft Computing and Applications (AIAA-2014)
- 01LINGKUP_MANAJEMEN_PROYEK
- gpp-tg
- Review Paper on Answers Selection and Recommendation in Community Question Answers System
- Hr Lsa Trends
- Ανάλυση Δεδομένων Έρευνας Καθηγητών (Ev)
- K5_6_Relasi_Fuzzy_2
- T02-Control de Procesos Continuos -Parte 2

You are on page 1of 2

Learning rate & model performace: learning rate significantly affects model

performace

-> It's one of the most difficult hyperparameter to set

Learning rate & cost function: the cost is often highly sensitive to some

directions & insensitive to others in parameter space

Method of momentum: can avoid these issues somewhat, but introduce

another hyperparameter

-> It's natural to ask if there's another way

Separate learning rate: if we believe that directions of sensitivity

are somewhat axis aligned

-> It makes sense to use a separate learning rate for each

parameter & automatically adapt these learning rate through the learning process

Delta-bar-delta algorithm: an early heuristic approach to adapting individual

learning rates for model parameters during training

Base idea:

1. If the partial derivative of the loss, with respect to a given

model parameter, remains the same sign

-> The learning rate should increase

2. If the partial derivative changes sign

-> The learning rate should decrease

#NOTE: this rule can only be applied to full batch optimization

Explain: the gradient computed in full batch optimization

is the exact gradient

ADAGRAD

Idea: individually adapt the learning rates of all model parameters by

scaling them inversely proportional to the square root of the sum of all the

historical squared values of the gradient

Pseudo-code:

Require: global learning rate esp (initially applied to all parameters)

Require: initial parameter theta

Require: a small constant delta (maybe 10^(-7)) for numerical stability

while stopping criterion not met:

sample a minibatch of m examples from the training set

{x(1), ..., x(m)} with corresponding targets y(i)

compute gradient: g = 1/m * grad(sum(L(f(x(i); theta), y(i)) all

i), theta)

accumulate squared gradient: r = r + ||g||^2

compute update: Dtheta = - esp / (delta + sqrt(r)) .* g

apply update: theta = theta - Dtheta

Efect of the parameter update:

The parameters with the largest partial derivative of the loss: have a

rapid decrease in their learning rate

The paramters with small partial derivative of the loss: have a

relatively small decrease in their learning rate

#NOTE: for training deep models, the accumulation of squared gradients from

the beginning of training can result in an early & relatively decrease in the

effective learning rate

-> AdaGrad performs well for some but not all deep models

RMSPROP

RMSProp: modified version of AdaGrad that perform well in the non-convex

setting

Idea: change the gradient accumulation into an exponentially weighted moving

average

Recall AdaGrad:

Convex function: AdaGrad is designed to converge rapidly when applied

to a convex function

Non-convex function:

1. The learning trajectory may pass through many different

structures

-> It may eventually arrive at a region that is locally

convex bowl

2. AdaGrad shrinks the learning rate according to the entire

history of the squared gradient

-> The learning rate maybe too small before ariving at such

a convex structure

RMSPropr & AdaGrad: use an exponentially decaying average to discard history

from the extreme past so that it can converge rapidly after finding a convex bowl

-> RMS Prop can be seen as an instance of the AdaGrad algorithm

initialized within that bowl

Pseudo-code:

Require: global learning rate esp, decay rate p

Require: initial parameter theta

Require: small constant delta (maybe 10^(-6)) used to stabilize

division by small numbers

while stopping criterion not met:

sample a minibatch of m examples from the training set

{x(1), ..., x(m)} with corresponding targets y(i)

compute gradient: g = 1/m * grad(sum(L(f(x(i)), y(i)), all i),

theta)

accumulate squared gradient: r = p*r + (1 - p) * ||g||^2

compute parameter update: Dtheta = - esp / sqrt(delta + r) .* g

apply update: theta = theta + Dtheta

#NOTE:

1. Empirically, RMSProp has been shown to be an effective & practical

optimization algorithm for deep neural nets

2. It's currently one of the go-to optimization methods being employed

routinely by deep learning practitioners

NEW WORD

Accumulation (n): s tch ly

- Knowledge Management Life CycleUploaded byimad
- Polanga Oil With Diesel in CI EngineUploaded byroshan jaiswal
- Machine Learning for Hackers PDFUploaded byninjapunit
- Matlabgeeks.com Tips Tutorials Neural Networks a Perceptron in MatlabUploaded bydanielsouzarosa
- Answers of Smu Mba Assignments MB0048Uploaded bySolvedSmuAssignments
- GanttUploaded byAnonymous OESYTaCta
- Understanding Kohonen NetworksUploaded byImed Chihi
- Data Flow DiagramUploaded byNarayan Chhetry
- Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio of SoilsUploaded byIJMER
- Cybernetic Distribution GridUploaded byGavin Keech
- appex 2robotics letter headUploaded bykrpushpender
- Notes - Gradient DescentUploaded byJaguimonbe
- 01LINGKUP_MANAJEMEN_PROYEKUploaded bySolich Ichin
- Fourth International Conference on Artificial Intelligence, Soft Computing and Applications (AIAA-2014)Uploaded byCS & IT
- gpp-tgUploaded bygp_peixoto
- Review Paper on Answers Selection and Recommendation in Community Question Answers SystemUploaded byRahul Sharma
- Hr Lsa TrendsUploaded byCarlos Alvarez
- Ανάλυση Δεδομένων Έρευνας Καθηγητών (Ev)Uploaded bygeoket1972
- K5_6_Relasi_Fuzzy_2Uploaded byMuhammadThoriqAzmi
- T02-Control de Procesos Continuos -Parte 2Uploaded bypeskeiras
- ResumeUploaded byAnjukiruba
- CSUploaded byguru.rjpm
- Self Tuning Problem SolutionUploaded byLevi Aditya
- IJSC_Paper_4_946-952Uploaded byAmina Muhammad Amin
- team 1 progress report 2Uploaded byapi-382980192
- 4th janUploaded byAnupa Alex
- ahmed.pdfUploaded byThiago Henrique Martinelli
- [IJCST-V5I6P19]:Praveena.S, Mrs.V.VanithaUploaded byEighthSenseGroup
- 1-s2.0-S0167639316301923-main.pdfUploaded byAbdallah Grima
- Fintech Prodegree Detailed CurriculumUploaded bySaroj Vivin Gajraj

- Linear AlgebraUploaded byantonucci23
- Complex Number Theory EUploaded byAtul Verma
- SAMPLE TOS 2017.xlsxUploaded byBhickoy Delos Reyes
- 11. Maths - IJMCAR - Construction - Srinivasa RaoUploaded byTJPRC Publications
- ED7201Uploaded byarulmurugu
- Topology Theorems and definionsUploaded byms2756
- Two PointersUploaded byKushagraKumar
- algebra-ii-m1-topic-b-lesson-14-teacher.pdfUploaded byAnonymous aSwN53c
- vedioUploaded byNaas Djeddaoui
- Numerical Solution of a Family of Fractional Differential Equations by Use of RBF MethodUploaded byTI Journals Publishing
- C++ ProGrams By Sukhdeep SinghUploaded bySukhdeep Malhotra
- FCM-1033- Vector Calculus MQAUploaded byKianseng Lee
- As an F-Vector Space, An Infinite Direct Sum of F Has Strictly Smaller Dimension Than an Infinite Direct Power of F Over the Same Index Set _ Project Crazy ProjectUploaded byFidelHuamanAlarcon
- 9 Momentum Principle Tutorial SolutionUploaded byAniket Babuta
- Residue TheoremUploaded byPhilip Chan
- 16-relations-introUploaded byUgc Net
- A Metaheuristic Optimization Algorithm for Binary Quadratic ProblemsUploaded bykim haksong
- Appendix Linear ProgrammingUploaded byhksanthosh
- ppt_002Uploaded byLam Wong
- Mechanics Lagrangian and HamilitonianUploaded byTrevor Scheopner
- LN3 Properties Eigenvalues&EigenvectorsUploaded byNikhilesh Prabhakar
- SHA512 ALGOUploaded byJitesh Middha
- Algebra_Polynomials.docxUploaded byAnonymous Kx8TAybnXQ
- Cortona geometryUploaded byShu Shujaat Lin
- Real Analysis With Real Applications - Kenneth R. Davidson, Allan P. Donsig 2001 Supplementary MaterialUploaded byThanos Kalogerakis
- 1d Finite Elements - TheoryUploaded byKristina Ora
- gsl-ref.pdfUploaded byerichaas
- AFM Formula Final Exam Cheat SheetUploaded byjjwdavis
- 2FThePestUploaded byKadek Adiasa
- Approximation Algorithms to Solve Simultaneous Multicriteria Scheduling ProblemsUploaded byAnonymous vQrJlEN