DEV Community

Super Kai (Kazuya Ito)
Super Kai (Kazuya Ito)

Posted on • Updated on

Optimizers in PyTorch

Buy Me a Coffee

*Memos:

An optimizer is the gradient descent algorism which can find the minimum(or maximum) gradient(slope) of a function by updating(adjusting) a model's parameters(weight and bias) to minimize the mean(average) of the sum of the losses(differences) between the model's predictions and true values(train data) during training.

  • CGD(Classic Gradient Descent)(1847) explained at (1).
  • Momentum(1964) explained at (2).
  • Nesterov's Momentum(1983) explained at (3).
  • AdaGrad(2011).
  • RMSprop(2012) explained at (4).
  • AdaDelta(2012).
  • Adam(2014) explained at (5).
  • AdaMax(2015).
  • Nadam(2016).
  • AMSGrad(2018).
  • AdaBound(2019) explained at (6).
  • AMSBound(2019).
  • AdamW(2019).

(1) CGD(Classic Gradient Descent)(1847):

  • is the optimizer to do the basic gradient descent with no special features.
  • is SGD() in PyTorch. *SGD() in PyTorch is Classic Gradient Descent(CGD) but not Stochastic Gradient Descent(SGD).
  • can also be called Vanilla Gradient Descent(VGD).
  • 's pros:
    • It's simple.
    • It's based on ohter optimizers.
  • 's cons:
    • It has no special features.

(2) Momentum(1964) (Add-on):

  • is the add-on optimizer to other optimizers to accelerate(speed up) convergence by mitigating fluctuation, considering the past and current gradients, giving more importance to newer gradients with EWA. *Memos:
    • EWA(Exponentially Weighted Average) is the algorithm to smooth a trend(to mitigate the fluctuation of a trend), considering the past and the current values, giving more importance to newer values.
    • EWA is also called EWMA(Exponentially Weighted Moving Average).
  • is added to SGD(), RMSprop() and Adam() in PyTorch.
  • 's pros:
    • It uses EWA.
    • It escapes local minima and saddle points.
    • It creates an accurate model.
    • It mitigates fluctuation.
    • It mitigates overshooting.
    • It accelerates the convergence.
  • 's cons:
    • ...

(3) Nesterov's Momentum(1983) (Add-on):

  • is the Momentum(1964) with the additional function which can calculate the gradient of a slightly ahead position to more accelerate the convergence than Momentum(1964).
  • is also called Nesterov Accelerated Gradient(NAG).
  • is added to SGD() and NAdam() in PyTorch.
  • 's pros:
    • It uses EWA.
    • It more easily escapes local minima and saddle points than Momentum(1964).
    • It creates a more accurate model than Momentum(1964).
    • It more mitigates fluctuation than Momentum(1964).
    • It more mitigates overshooting than Momentum(1964).
    • It more accelerates the convergence than Momentum(1964).
  • 's cons:
    • ...

(4) RMSProp(2012):

  • is the optimizer which can do gradient descent by automatically adapting learning rate to parameters, considering the past and current gradients, giving much more importance to newer gradients than Momentum(1964) with EWA to accelerate convergence by mitigating fluctuation. *The learning rate is not fixed.
  • 's learning rate decreases as closing to a global minimum to find the optimal solution precisely.
  • 's EWA is a little bit different from Momentum(1964)'s to give much more importance to newer gradients than Momentum(1964).
  • is the improved version of AdaGrad(2011) which can do gradient descent by adapting learning rate to parameters, considering the past and current gradients to accelerate convergence by mitigating fluctuation. *The learning rate is not fixed.
  • 's pros:
    • It automatically adapt learning rate to parameters.
    • It uses EWA.
    • It escapes local minima and saddle points.
    • It creates an accurate model.
    • It mitigates fluctuation.
    • It mitigates overshooting.
    • It accelerates the convergence.
  • 's cons:
    • ...
  • is RMSprop() in PyTorch.

(5) Adam(Adaptive Moment Estimation)(2014):

  • is the combination of Momentum(1964) and RMSProp(2012).
  • uses Momentum(1964)'s EWA instead of RMSProp(2012)'s.
  • 's pros:
    • It automatically adapt learning rate to parameters.
    • It uses EWA.
    • It escapes local minima and saddle points.
    • It creates an accurate model.
    • It mitigates fluctuation.
    • It mitigates overshooting.
    • It accelerates the convergence.
  • 's cons:
    • ...
  • is Adam() in PyTorch.

(6) AdaBound(2019):

  • is Adam(2014) with the dynamic bounds(the dynamic upper and lower limit) which can stabilize the convergence to more accelerate the convergence than Adam(2014).
  • 's pros:
    • It automatically adapt learning rate to parameters.
    • It uses EWA.
    • It uses the dynamic bounds(the dynamic upper and lower limit).
    • It more easily escapes local minima and saddle points than Adam(2014).
    • It creates a more accurate model than Adam(2014).
    • It more mitigates fluctuation than Adam(2014).
    • It more mitigates overshooting than Adam(2014).
    • It more accelerates the convergence than Adam(2014).
  • 's cons:
    • ...
  • isn't in PyTorch so you can use AdaBound().

Top comments (0)