The cost function is a crucial concept in machine learning, helping us understand how well our models are performing. It's the tool that tells us how close our model's predictions are to the actual results and guides us in improving accuracy. In this post, we'll break down the cost function in simple terms, with a focus on linear regression.
Introduction to Cost Functions
The cost function tells us how well our model's predictions match the actual target values. Essentially, it measures the error between the predicted values and the true values. By minimizing this error, we can improve our model's accuracy.
Consider you have a training set with input features and output targets .
Size in feet² (x) | Price $1000s (y) |
---|---|
2104 | 460 |
1416 | 232 |
1534 | 315 |
852 | 178 |
The model you use to fit this training set can be represented by a linear function:
For a training example
, the function
predicts
as
. Thus:
The challenge is to find and that make the prediction close to the target for all training examples.
Here, and are the parameters of the model. These parameters are adjusted during training to enhance the model's performance.
Depending on the values chosen for and , we will get different functions , which generate different lines on a graph. Writing as shorthand for , we can look at some plots to understand how and influence .
- When and
The function is a horizontal line, predicting a constant value of 1.5.
- When and
The slope is 0.5, creating a line that increases steadily.
- When and
This line has a slope of 0.5 and intersects the vertical axis at 𝑏=1.
Understanding Errors
The cost function calculates the error between the predicted prices and the actual prices . This error is given by:
Then, we square this error to avoid negative values. This squaring operation ensures that all errors are positive and emphasizes larger errors more than smaller ones. It will be:
Here, if we have more training examples, the sum of the errors will naturally be larger. To normalize this, we use the average squared error instead of the total squared error to get a sense of the overall performance. This way, the cost function doesn't automatically get bigger just because we have more training examples. It makes it a fair comparison, no matter how big our dataset is. Dividing by the number of examples :
To simplifies the derivative calculations during optimization (like gradient descent), we add factor of in the cost function. Thus, the final of cost function formula will be:
The extra division by 2 is a bit of a mathematical trick to make later calculations easier, especially when we use calculus to minimize the cost function.
Why Do We Square the Error?
Imagine we're trying to predict something—like the price of a house. Our model makes a prediction and we compare it to the actual price . The error is the difference between these two:
But here's the thing: this error can be positive or negative.
- If your prediction is higher than the actual value, the error is positive.
- If your prediction is lower than the actual value, the error is negative.
Example:
- Predicted price : $300,000
- Actual price : $280,000
The error is 300,000 − 280,000 = 20,000 (positive error)
But:
- Predicted price : $250,000
- Actual price : $280,000
The error is 250,000 − 280,000 = −30,000 (negative error)
If we simply add up these errors, positive and negative values can cancel each other out, which wouldn't give us the real picture of how well our model is performing.
So, adding them as 20,000 + (-30,000) = -10,000 doesn’t accurately show the total error. Instead, by considering the total magnitude of errors (20,000 + 30,000), we get 50,000. This approach provides a more accurate representation of the errors, helping us better understand our model's performance and work on making better predictions.
Why Not Use Absolute Value to Avoid Negative Value?
When we square the errors, larger errors have a bigger impact. For example, an error of 10 becomes 100 when squared, while an error of 1 becomes 1. This helps the model focus on reducing larger mistakes more aggressively. The squared error function also smooth and differentiable everywhere. This smoothness is important for optimization algorithms like gradient descent because it allows for more efficient and predictable convergence to the minimum error.
Meanwhile, when we use absolute error, each error contributes linearly. An error of 10 remains 10, and an error of 1 remains 1. Both are treated equally without any extra emphasis on the larger ones. The absolute value function has a kink at zero, meaning it's not differentiable at that point. This can complicate the optimization process, making it harder to find the minimum error.
In summary, squaring the error puts more emphasis on larger mistakes, which helps in creating a better model overall by addressing those big errors more effectively. This is why squared errors are often preferred in many machine learning applications.
Why Divide by and Not ?
You may wonder, if our goal is to avoid larger numbers in the cost function as our data set gets bigger, why should we divide it by 2m? Why 2? Why not 100 or some other number? As I mentioned before, the extra division by 2 is a bit of a mathematical trick to make later calculations easier. Specifically, The factor of 2 cancels out when we take the derivative, simplifying our calculations.
When training the model, we often use optimization algorithms like gradient descent to minimize the cost function. Gradient descent involves taking the derivative (gradient) of the cost function with respect to the parameters and . The gradient tells us how to change the parameters to reduce the cost.
Consider a simple function:
When we take the derivative of this function with respect to :
The derivative produces a 2 from the squared term. This 2 can make the gradient calculations a bit cumbersome. By including a factor of 1/2 in the cost function, we simplify the gradient calculations. This adjustment doesn't change the ultimate goal (minimizing the cost), but it makes the math cleaner:
Now, when we take the derivative, the factor of 2 cancels out:
Why Not Another Number?
If we used a different number, like 100, in the cost function, the math wouldn't simplify as neatly:
Taking the derivative:
Here, the factor 1/50 doesn't simplify as nicely, and we end up with more complex expressions. This also means your model will learn very slowly. You would have to compensate by increasing the learning rate 𝛼, but this requires careful tuning to avoid making the model unstable.
Using is a balanced choice. It simplifies the gradient calculations without making the steps too small or too large. It’s also a widely-accepted convention, which makes it easier to follow standard practices and compare results across different studies and implementations.
Visualizing the Cost Function
In linear regression, the objective is to find the optimal values for the parameters and that minimize the cost function . This is typically achieved through an optimization algorithm, such as gradient descent, which iteratively adjusts and to reduce the difference between the predicted outputs and the actual target values.
To illustrate this concept, let's work with a simplified version of the linear regression model:
In this model, we've eliminated the parameter . Now, the cost function looks like this:
The goal is to find the value of that minimizes .
Let's visualize how the cost function changes with different values of . Consider that there are 3 data points. The graphs below show both the function (left) and the corresponding cost function (right) for four different values of :
-
When : The function is a line with a slope of 1, and the cost is 0 for perfectly fitting data points and resulting in a lower cost.
- Function:
-
Cost:
- Graph:
-
When : The function has a slope of 0.5, leading to a higher cost due to the error between predicted and actual values. The line does not fit the data well.
- Function:
-
Cost:
- Graph:
-
When : The function is a horizontal line, resulting in a significant error and a higher cost.
- Function:
-
Cost:
- Graph: The line is a flat line, which does not fit the data points at all, which also results in a high cost.
-
When : The function is a line with slopes downwards, showing an inverse relationship with the data points, resulting in the highest cost.
- Function:
-
Cost:
- Graph:
Here's how the plots would look:
These graphs help visualize how different values of affect the line that fits the data points and the corresponding value of the cost function . The goal is to find the value of that results in the lowest cost, indicating the best fit for the data.
Please note that we can have different values for and that can reach the minimum cost . This depends on the distribution of the training data.
More specifically:
If the training data points have a clear linear trend and are tightly clustered, the cost function will have a narrow, bowl-shaped surface. In this case, there may only be a single unique combination of 𝑤 and 𝑏 that minimizes the cost.
However, if the training data points are more spread out and don't have a clear linear trend, the cost function will have a wider, bowl-shaped surface. This allows for multiple combinations of 𝑤 and 𝑏 to reach the global minimum of the cost function.
So, the distribution of the training data affects the shape of the cost function. This determines whether there is a single unique minimum or multiple possible and values that can reach the minimum cost. The training data distribution is a key factor.
Choosing the Optimal Parameters
The goal is to choose the value of that minimizes . This is achieved by selecting the value of that results in the smallest possible value of the cost function. For instance, in our example, if choosing results in the smallest , then is the optimal parameter for our model.
Different Cost Functions for Different Applications
While the mean squared error cost function is the most commonly used for linear regression, different applications may require different cost functions. The mean squared error is popular because it generally provides good results for many regression problems.
There’s no one-size-fits-all rule for defining the cost function. It truly depends on our unique case and what we’re trying to achieve with our model. For example, if we’re dealing with outliers in our data, a cost function that reduces their impact might be beneficial. On the other hand, if we’re interested in minimizing all errors equally, we might opt for a different approach.
In simple terms, the cost function we choose depends on the specifics of our application. Different problems have different requirements, so the best cost function is the one that aligns with our goal.
Additionally, we can follow the latest research to gain insights into which cost function might best suit our needs. Researchers often introduce new approaches and compare different cost functions in various scenarios, providing valuable guidance for our own application.
Top comments (0)