Hello,
The following topics are covered in this blog:-
Introduction
Optimization Algorithms:
Gradient Descent (GD)
Stochastic Gradient Descent (SGD)
Mini-batch SGD
SGD with Momentum
AdaGrad
AdaDelta & RMSProp
Adam
- Conclusion
Download the whole blog from the following link:-
https://github.com/ruthvikraja/Optimization-Algorithms.git
Introduction
Neural networks is a subset of Machine Learning in which Neural networks adapt and learn from vast amounts of data.
The Neuron is the building block of a Neural network that takes some input, does some mathematical computation by multiplying the input values with their corresponding random weights and finally produces the output.
Each node (Hidden and Output layers) in a Neural network is composed of two functions, namely linear and activation function. In the forward propagation, the Linear function is computed by summation of multiplying previously connected nodes output and corresponding weight, bias as shown in the Figure.
After applying the Linear function, Activation functions like Sigmoid, Relu, Leaky Relu, Parametric Relu, Swish Relu, Softplus AF’s etc are implemented based on the problem type and requirement.
Role of an Optimizer
After computing the output at the output layer, the predicted value is compared with the actual value by computing Loss.
The Loss function is used to determine the error between the actual and predicted value. The Optimization algorithm is used to determine the new weight values i.e Loss w.r.t change in weights to bring the output of the next trial closer to the actual output.
Gradient Descent
- The formula to compute new weights using Gradient Descent is as follows:-
- The formula to compute Loss using Gradient Descent is as follows:-
Stochastic Gradient Descent
- The formula to compute new weights using Stochastic Gradient Descent is as follows:-
- The formula to compute Loss using Stochastic Gradient Descent is as follows:-
Mini-Batch Stochastic Gradient Descent
- The formula to compute new weights using Mini-Batch Stochastic Gradient Descent is as follows:-
- The formula to compute Loss using Mini-Batch Stochastic Gradient Descent is as follows:-
Overall Comparison (GD (vs) SGD (vs) Mini-Batch SGD)
Stochastic Gradient Descent with Momentum
- The formula to compute new weights using Stochastic Gradient Descent with Momentum is as follows:-
- The formula to compute Loss using Stochastic Gradient Descent with Momentum is as follows:-
For better illustration, consider the following scenario to calculate Exponential Weighted Average:-
- Therefore, the final updated formulae to calculate new weights & bias are as follows:-
where,
Adaptive Gradient Descent
- The formula to compute new weights using Adaptive Gradient Descent is as follows:-
- The formula to compute Loss using Adaptive Gradient Descent is as follows:-
where,
Adaptive Learning Rate Method & Root Mean Squared Propagation
- The formula to compute new weights using AdaDelta & RMSProp is as follows:-
- The formula to compute Loss using AdaDelta & RMSProp is as follows:-
where,
Adaptive Moment Estimation
- The formulae to compute new weights & bias using Adam are as follows:-
- The formulae to compute Loss for Regression & Classification problems using Adam are as follows:-
where,
- When utilising Exponential Weighted Averages, there is a process known as bias correction. Scientists have introduced Bias correction to get better results at the initial time stamps. Therefore, the formulae for Bias correction is as follows:-
- The updated Weight & Bias formulae are as follows:-
Conclusion
In this Presentation, different Optimization algorithms that are available in the field of Artificial Intelligence were discussed in detail to reduce the Loss function of a Neural Network.
Overall, Adam Optimizer is comparatively better than other algorithms because it was implemented using some advanced theories.
However, there is no guarantee that the Adam optimizer will outperform all the given datasets because it depends on several other features like the type of problem, size of the input data, number of features, etc.
Gradient Descent and SGD algorithm work well for small datasets, Mini-batch SGD, SGD with Momentum & RMSProp can be tried on large datasets.
Top comments (0)