DEV Community

Ife
Ife

Posted on

How Gradient Descent Powers Machine Learning Models

Introduction

Building accurate machine learning models relies heavily on optimization techniques and gradient descent is one of the techniques. Gradient descent helps models adjust parameters, minimise errors and improve performance overtime. In this article, I will dive into gradient descent as a concept and why it is important in the machine learning process.

What is Gradient Descent

Gradient descent is the process that minimises a model's errors by adjusting its parameters continuously until it finds the best value that reduces the loss function.

The loss function is the difference between the predicted value and the actual value. To get the predicted value, the model runs some calculation which involves parameters. These parameters determine how the model processes input data to generate its predictions, and they are adjusted during training to minimize the loss function and improve accuracy. Gradient descent handles the adjustments of these parameters.

How Gradient Descent Works

This is the equation behind the simple linear regression model. It is similar to the equation of a line.

Linear regression model equation

The parameters in this equation are w and b, the weight and the bias. When the linear regression model makes a prediction using this equation, the predicted output is compared with the actual output using the following equation,

Mean Square Error loss function equation

When you substitute the equation for the predicted value into the equation, you get this,

Linear regression model substituted into the loss function equation

This loss function is called the Mean Squared Error (MSE). The smaller the difference between the predicted value and the actual value, the more accurate the model's predictions and this accuracy depends on the values of the weight (w) and the bias (b).

House pricing dataset

If you are to build a linear regression model that predicts house prices based on size only, the features (X) will be the size of the house and the target (y) will be the price of the house.

x_train = np.array([500, 800, 1000, 1500, 2000])
y_train = np.array([50, 80, 100, 150, 200])
Enter fullscreen mode Exit fullscreen mode

If w is 0 and b is 0, the model predicts the target value to be 0:

w = 0
b = 0

pred_y = w*x_train[0] + b
# x_train[0] = 500

# pred_y = 0
Enter fullscreen mode Exit fullscreen mode

The actual target value when the size of the house is 500, is 50. The loss function for this prediction is 50, which means that the model is far from accurate. However, if w and b were different, the model would also predict a different target value.

To see in real-time, the effect of gradient descent on the loss function, look at the graph below:

MSE loss function vs w; b is set to 0

In the above graph, you can see that the loss reduces until it gets to a point (where w = 0.1). This minimum point is called the global minimum and the value of w at this point will give the smallest possible loss function value.

Instead of guessing the values of w and b, the gradient descent algorithm will go through all the possible values for w and b until it gets the value for w and b, resulting in the smallest possible loss function value. The gradient descent is defined as:

Gradient descent formula

The gradient descent algorithm updates the w and b parameters simultaneously after each iteration using the equations above until it gets the best values that result in the best loss function value for accurate predictions.

The Role of the Learning Rate

The alpha sign in the gradient descent equations is known as the learning rate. The learning rate is a value that determines how much or how little the parameters get updated after each iteration.

If the learning rate is too small, the gradient descent will take too long to reach the global minimum. However, if the learning rate is too big, the gradient descent might miss the global minimum and that will lead to increasing values of the loss function which you don't want.

You provide the gradient descent algorithm with a good learning rate value. A good range is from 0.01 to 1.

To learn more about gradient descent and learning rate including graphs, check out this notebook on Gradient Descent.

Types of Gradient Descent

Batch Gradient Descent
This type of gradient descent computes the gradient of the entire dataset to update parameters. Each iteration uses all training examples to calculate the gradient. It is best for small datasets and when computing power is sufficient.

Stochastic Gradient Descent (SGD)
The Stochastic Gradient Descent (SGD) algorithm updates parameters for each individual training example. It iterates through examples one at a time. It is best for very large datasets or when computational efficiency is a priority.

Mini-batch Gradient Descent
This type of gradient descent combines the strengths of batch and stochastic gradient descent. It computes the gradient for small random subsets (mini-batches) of the dataset and updates parameters. It is commonly used when training deep learning models.

Conclusion

Gradient descent is a very important optimisation technique for machine learning models. Its ability to minimise the error function iteratively allows algorithms to improve with each step, resulting in more accurate models. By understanding the differences between the various gradient descent methods, you can adjust your approach to fit each problem, making the training of models faster and more accurate.

Resources:

Machine learning specialisation course on Coursera

Top comments (0)