What is Linear Regression?
Linear regression is a statistical method used for modeling the relationship between a dependent variable (also known as the outcome or response variable) and one or more independent variables (also known as predictors or explanatory variables). The goal of linear regression is to find the best-fitting line through a set of data points, where the line is defined by an equation of the form y = mx + b, where y is the dependent variable, x is the independent variable, m is the slope of the line, and b is the y-intercept. Linear regression can be used for both simple linear regression (one independent variable) and multiple linear regression (more than one independent variable).
Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Loding train and test dataset into pandas data frame
train_df = pd.read_csv("/kaggle/input/random-linear-regression/train.csv")
#Drop null values
train_df = train_df.dropna()
train_df.head()
x | y | |
---|---|---|
0 | 24.0 | 21.549452 |
1 | 50.0 | 47.464463 |
2 | 15.0 | 17.218656 |
3 | 38.0 | 36.586398 |
4 | 87.0 | 87.288984 |
test_df = pd.read_csv("/kaggle/input/random-linear-regression/test.csv")
# Drop null values
test_df = test_df.dropna()
test_df.head()
x | y | |
---|---|---|
0 | 77 | 79.775152 |
1 | 21 | 23.177279 |
2 | 22 | 25.609262 |
3 | 20 | 17.857388 |
4 | 36 | 41.849864 |
Selection of independent and and dependent variable
We selected the columns in your data frame that we want to use for the x and y axis. For example, if you have a column called 'x' that represents the independent variable and a column called 'y' that represents the dependent variable, you can select those columns like this:
train_x = train_df['x']
train_y = train_df['y']
test_x = test_df['x']
test_y = test_df['y']
Visualizing the training data
To draw a linear graph using your data frame, we use the popular data visualization library in Python called Matplotlib. We imported it above.
Now we use the plt.scatter()
function to plot the data points, and the plt.plot()
function to plot the line of best fit.
We also use the numpy.polyfit()
function to fit a line to the data points and get the slope and y-intercept of the line of best fit.
coefficients = np.polyfit(train_x, train_y, 1)
m, b = coefficients
plt.scatter(train_x, train_y)
plt.plot(train_x, m*train_x + b)
plt.xlabel('train_x')
plt.ylabel('train_y')
plt.show()
Visualizing test data
coefficients = np.polyfit(test_x, test_y, 1)
m, b = coefficients
plt.scatter(test_x, test_y)
plt.plot(test_x, m*test_x + b)
plt.xlabel('test_x')
plt.ylabel('test_y')
plt.show()
Model Creation, training, and testing
To create a linear regression model and train and test the data using your data frame, we can use the scikit-learn
library in Python. The first step is to import the library and the specific model you want to use.
For example, we use the LinearRegression
class from the sklearn.linear_model
module:
from sklearn.linear_model import LinearRegression
Create an instance of the model.
model = LinearRegression()
Now, we use the fit()
method to train the model on the training data:
train_x = train_x.values.reshape(-1, 1)
test_x = test_x.values.reshape(-1, 1)
model.fit(train_x, train_y)
Check the coefficients of the model and the intercept using following command:
print("Coefficients: ",model.coef_)
print("Intercept: ",model.intercept_)
Our model is trained, now we can use the predict()
method to make predictions on the test data:
y_pred = model.predict(test_x)
Evaluating model performance
We can evaluate the performance of the model by comparing the predicted values with the actual values. There are many evaluation metrics such as mean_absolute_error
, mean_squared_error
or r2_score
.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print("Mean Absolute Error: ",mean_absolute_error(test_y, y_pred))
print("Mean Squared Error: ",mean_squared_error(test_y, y_pred))
print("R2 Score: ",r2_score(test_y, y_pred))
Visualizing model performance
We can also visualize the results by plotting the test data points and the predicted line using the same approach as before.
plt.scatter(test_x, test_y)
plt.plot(test_x, y_pred, color='r')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
Top comments (1)
Nice one :) Earlier looked for some similar and easy to understood tut, but couldn't find it. Thanks