A journey of a thousand miles begins with a single step, my journey and I believe your M.L journey is going to start today and it is going to start with linear regression, so let's begin... :)
Two Takeaways
- Mathematical Explanation behind linear regression
- Python Implementation Using Scikit-Learn
What is Regression and why is this algorithm called Linear Regression
Regression is a statistical method that helps us find the relationship between independent and dependent variables.
To make it simple you have two columns "EXPERIENCE" and "KNOWLEDGE" and the relationship is how your knowledge increases with experience, this relationship is called regression.
Linear Regression
The linearity assumption in linear regression means the model is linear in parameters (i.e coefficients of variables) & may or may not be linear in variables.
(The above is a perfectly linear graph)
Mathematical Understanding of Linear Regression
- Hypothesis Function
- Loss and cost function
- Optimisation
Hypothesis Function
Any machine learning algorithm is nothing but a hypothesis function, imagine this to be a function that you use to get output from a set of inputs, this hypothesis function is optimised to work for our dataset and this process of optimisation is what we call model-training.
The above equation is the hypothesis function for linear regression, and yes you would have seen this somewhere else, this equation is similar to the equation of a straight line y=m*x + c.
The above equation is a single variable function i.e the data for this contains just 2 columns in total, the input column and output column.
x ==> Input/Independent Variable
y ==> Output/Dependant Variable
teta1, teta2 ==> Constants
We basically try to predict y by giving x as the input and assigning optimised values for teta1 and teta2 (NOTE: These are also called weights and in further text teta1, teta2 would be referred to as weights).
Multi-Variable Hypothesis Function
When we have many columns then the hypothesis function would look like this.
Loss and Cost Functions
Loss and cost functions are not exactly the same, there is a subtle difference between them.
When we check the deviation between the actual and predicted value of a single data point it is called loss and if I cumulatively find the deviation for the entire dataset it is called cost and the corresponding functions are called loss and cost functions.
Some common loss and const functions for linear regression are
Mean Squared Error (MSE)
The cost function that we are going to see is Mean Squared Error(MSE) which could be used for optimising the hypothesis function of linear regression.
Basically, the difference between the predicted value and the actual value is found, and squared, this is done for each and every data point and collated.
Now we have predicted with random weights(teta1, teta2), now a process called training is used to optimise the weights.
Training the model
Training could be defined as a process that is used to optimise the hypothesis function by optimising the weights.
This optimisation process is achieved using gradient-descent.
The above curve is called the error curve and it is the output of the mean squared error(MSE).
Gradient descent is the process where we update the weights of the hypothesis function either by small values or large values depending on the learning rate.
On reaching a particular threshold error value the gradient descent stops and we would have optimised weights to best suit our input dataset.
The red line is called the best-fit line and this represents the final output after the linear regression model has completed training.
Python Implementation - Linear Regression Using Scikit-Learn
Importing the required python libraries
# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
We are going to use the diabetes to understand linear regression and this dataset comes along with the scikit-learn library.
# We are taking an inbuilt dataset present in sklearn
diabetes = datasets.load_diabetes()
Dataset description
Columns present in the dataset
diabetes.feature_names
Storing the training data and target data separately in two different variables
# Extracting the training data and target
X = diabetes.data
Y = diabetes.target
print(X.shape, Y.shape)
Importing the linear regression model from scikit-learn library, the model is present as a class.
from sklearn.linear_model import LinearRegression
le = LinearRegression()
Training process
le.fit(train_x,train_y)
yes the whole training process is done in one single line of code, thanks to scikit-learn.
Making Predictions
y_pred = le.predict(test_x)
the predictions on the test dataset are stored in y_pred.
Let's print the results by converting it into a pandas dataframe
result = pd.DataFrame({'Actual': test_y, 'Model Prediction' : y_pred})
print(result.head(20))
Visualisation
Let's take a small subset i.e 20 data points of our prediction and compare it with actual output using matplotlib library
sample_result = result.head(20)
sample_result.plot(y=["Actual", "Model Prediction"],
kind="line", figsize=(10, 7))
Variance Score
print('Variance score: {}'.format(le.score(test_x, test_y)))
That's all about about linear regression, thank you for your patience :))
Top comments (0)