Hunter Johnson for Educative

Posted on May 9, 2023 • Originally published at educative.io

How to build machine learning regression models with Python

#python #machinelearning #tutorial #programming

This article was written by Najeeb Ul Hassan, a member of Educative's technical content team.

Marvel Comics introduced a fictional character Destiny in the 1980s, with the ability to foresee future occurrences. The exciting news is that predicting future events is no longer just a fantasy! With the progress made in machine learning, a machine can help forecast future events by utilizing the past.

Exciting, right? Let's start this journey with a simple prediction model.

A regression is a mathematical function that defines the relationship between a dependent variable and one or more independent variables. Rather than delving into theory, the focus will be on creating different regression models.

Understanding the input data

Before starting to build a regression model, one should examine the data. For instance, if an individual owns a fish farm and needs to predict a fish's weight based on its dimensions, they can explore the dataset by clicking the "RUN" button to display the top few rows of the DataFrame (Fish.txt).

DataFrame (Fish.txt):

Species Weight  V-Length    D-Length    X-Length    Height  Width
Bream   290 24  26.3    31.2    12.48   4.3056
Bream   340 23.9    26.5    31.1    12.3778 4.6961
Bream   363 26.3    29  33.5    12.73   4.4555
Bream   430 26.5    29  34  12.444  5.134
Bream   450 26.8    29.7    34.7    13.6024 4.9274
Bream   500 26.8    29.7    34.5    14.1795 5.2785
Bream   390 27.6    30  35  12.67   4.69
Bream   450 27.6    30  35.1    14.0049 4.8438
Bream   500 28.5    30.7    36.2    14.2266 4.9594
Bream   475 28.4    31  36.2    14.2628 5.1042
Bream   500 28.7    31  36.2    14.3714 4.8146
Bream   500 29.1    31.5    36.4    13.7592 4.368
Bream   340 29.5    32  37.3    13.9129 5.0728
Bream   600 29.4    32  37.2    14.9544 5.1708
Bream   600 29.4    32  37.2    15.438  5.58
Bream   700 30.4    33  38.3    14.8604 5.2854
Bream   700 30.4    33  38.5    14.938  5.1975
Bream   610 30.9    33.5    38.6    15.633  5.1338
Bream   650 31  33.5    38.7    14.4738 5.7276
Bream   575 31.3    34  39.5    15.1285 5.5695
Bream   685 31.4    34  39.2    15.9936 5.3704
Bream   620 31.5    34.5    39.7    15.5227 5.2801
Bream   680 31.8    35  40.6    15.4686 6.1306
Bream   700 31.9    35  40.5    16.2405 5.589
Bream   725 31.8    35  40.9    16.36   6.0532
Bream   720 32  35  40.6    16.3618 6.09
Bream   714 32.7    36  41.5    16.517  5.8515
Bream   850 32.8    36  41.6    16.8896 6.1984
Bream   1000    33.5    37  42.6    18.957  6.603
Bream   920 35  38.5    44.1    18.0369 6.3063
Bream   955 35  38.5    44  18.084  6.292
Bream   925 36.2    39.5    45.3    18.7542 6.7497
Bream   975 37.4    41  45.9    18.6354 6.7473
Bream   950 38  41  46.5    17.6235 6.3705
Roach   40  12.9    14.1    16.2    4.1472  2.268
Roach   69  16.5    18.2    20.3    5.2983  2.8217
Roach   78  17.5    18.8    21.2    5.5756  2.9044
Roach   87  18.2    19.8    22.2    5.6166  3.1746
Roach   120 18.6    20  22.2    6.216   3.5742
Roach   0   19  20.5    22.8    6.4752  3.3516
Roach   110 19.1    20.8    23.1    6.1677  3.3957
Roach   120 19.4    21  23.7    6.1146  3.2943
Roach   150 20.4    22  24.7    5.8045  3.7544
Roach   145 20.5    22  24.3    6.6339  3.5478
Roach   160 20.5    22.5    25.3    7.0334  3.8203
Roach   140 21  22.5    25  6.55    3.325
Roach   160 21.1    22.5    25  6.4 3.8
Roach   169 22  24  27.2    7.5344  3.8352
Roach   161 22  23.4    26.7    6.9153  3.6312
Roach   200 22.1    23.5    26.8    7.3968  4.1272
Roach   180 23.6    25.2    27.9    7.0866  3.906
Roach   290 24  26  29.2    8.8768  4.4968
Roach   272 25  27  30.6    8.568   4.7736
Roach   390 29.5    31.7    35  9.485   5.355
Whitefish   270 23.6    26  28.7    8.3804  4.2476
Whitefish   270 24.1    26.5    29.3    8.1454  4.2485
Whitefish   306 25.6    28  30.8    8.778   4.6816
Whitefish   540 28.5    31  34  10.744  6.562
Whitefish   800 33.7    36.4    39.6    11.7612 6.5736
Whitefish   1000    37.3    40  43.5    12.354  6.525
Parkki  55  13.5    14.7    16.5    6.8475  2.3265
Parkki  60  14.3    15.5    17.4    6.5772  2.3142
Parkki  90  16.3    17.7    19.8    7.4052  2.673
Parkki  120 17.5    19  21.3    8.3922  2.9181
Parkki  150 18.4    20  22.4    8.8928  3.2928
Parkki  140 19  20.7    23.2    8.5376  3.2944
Parkki  170 19  20.7    23.2    9.396   3.4104
Parkki  145 19.8    21.5    24.1    9.7364  3.1571
Parkki  200 21.2    23  25.8    10.3458 3.6636
Parkki  273 23  25  28  11.088  4.144
Parkki  300 24  26  29  11.368  4.234
Perch   5.9 7.5 8.4 8.8 2.112   1.408
Perch   32  12.5    13.7    14.7    3.528   1.9992
Perch   40  13.8    15  16  3.824   2.432
Perch   51.5    15  16.2    17.2    4.5924  2.6316
Perch   70  15.7    17.4    18.5    4.588   2.9415
Perch   100 16.2    18  19.2    5.2224  3.3216
Perch   78  16.8    18.7    19.4    5.1992  3.1234
Perch   80  17.2    19  20.2    5.6358  3.0502
Perch   85  17.8    19.6    20.8    5.1376  3.0368
Perch   85  18.2    20  21  5.082   2.772
Perch   110 19  21  22.5    5.6925  3.555
Perch   115 19  21  22.5    5.9175  3.3075
Perch   125 19  21  22.5    5.6925  3.6675
Perch   130 19.3    21.3    22.8    6.384   3.534
Perch   120 20  22  23.5    6.11    3.4075
Perch   120 20  22  23.5    5.64    3.525
Perch   130 20  22  23.5    6.11    3.525
Perch   135 20  22  23.5    5.875   3.525
Perch   110 20  22  23.5    5.5225  3.995
Perch   130 20.5    22.5    24  5.856   3.624
Perch   150 20.5    22.5    24  6.792   3.624
Perch   145 20.7    22.7    24.2    5.9532  3.63
Perch   150 21  23  24.5    5.2185  3.626
Perch   170 21.5    23.5    25  6.275   3.725
Perch   225 22  24  25.5    7.293   3.723
Perch   145 22  24  25.5    6.375   3.825
Perch   188 22.6    24.6    26.2    6.7334  4.1658
Perch   180 23  25  26.5    6.4395  3.6835
Perch   197 23.5    25.6    27  6.561   4.239
Perch   218 25  26.5    28  7.168   4.144
Perch   300 25.2    27.3    28.7    8.323   5.1373
Perch   260 25.4    27.5    28.9    7.1672  4.335
Perch   265 25.4    27.5    28.9    7.0516  4.335
Perch   250 25.4    27.5    28.9    7.2828  4.5662
Perch   250 25.9    28  29.4    7.8204  4.2042
Perch   300 26.9    28.7    30.1    7.5852  4.6354
Perch   320 27.8    30  31.6    7.6156  4.7716
Perch   514 30.5    32.8    34  10.03   6.018
Perch   556 32  34.5    36.5    10.2565 6.3875
Perch   840 32.5    35  37.3    11.4884 7.7957
Perch   685 34  36.5    39  10.881  6.864
Perch   700 34  36  38.3    10.6091 6.7408
Perch   700 34.5    37  39.4    10.835  6.2646
Perch   690 34.6    37  39.3    10.5717 6.3666
Perch   900 36.5    39  41.4    11.1366 7.4934
Perch   650 36.5    39  41.4    11.1366 6.003
Perch   820 36.6    39  41.3    12.4313 7.3514
Perch   850 36.9    40  42.3    11.9286 7.1064
Perch   900 37  40  42.5    11.73   7.225
Perch   1015    37  40  42.4    12.3808 7.4624
Perch   820 37.1    40  42.5    11.135  6.63
Perch   1100    39  42  44.6    12.8002 6.8684
Perch   1000    39.8    43  45.2    11.9328 7.2772
Perch   1100    40.1    43  45.5    12.5125 7.4165
Perch   1000    40.2    43.5    46  12.604  8.142
Perch   1000    41.1    44  46.6    12.4888 7.5958
Pike    200 30  32.3    34.8    5.568   3.3756
Pike    300 31.7    34  37.8    5.7078  4.158
Pike    300 32.7    35  38.8    5.9364  4.3844
Pike    300 34.8    37.3    39.8    6.2884  4.0198
Pike    430 35.5    38  40.5    7.29    4.5765
Pike    345 36  38.5    41  6.396   3.977
Pike    456 40  42.5    45.5    7.28    4.3225
Pike    510 40  42.5    45.5    6.825   4.459
Pike    540 40.1    43  45.8    7.786   5.1296
Pike    500 42  45  48  6.96    4.896
Pike    567 43.2    46  48.7    7.792   4.87
Pike    770 44.8    48  51.2    7.68    5.376
Pike    950 48.3    51.7    55.1    8.9262  6.1712
Pike    1250    52  56  59.7    10.6863 6.9849
Pike    1600    56  60  64  9.6 6.144
Pike    1550    56  60  64  9.6 6.144
Pike    1650    59  63.4    68  10.812  7.48
Smelt   6.7 9.3 9.8 10.8    1.7388  1.0476
Smelt   7.5 10  10.5    11.6    1.972   1.16
Smelt   7   10.1    10.6    11.6    1.7284  1.1484
Smelt   9.7 10.4    11  12  2.196   1.38
Smelt   9.8 10.7    11.2    12.4    2.0832  1.2772
Smelt   8.7 10.8    11.3    12.6    1.9782  1.2852
Smelt   10  11.3    11.8    13.1    2.2139  1.2838
Smelt   9.9 11.3    11.8    13.1    2.2139  1.1659
Smelt   9.8 11.4    12  13.2    2.2044  1.1484
Smelt   12.2    11.5    12.2    13.4    2.0904  1.3936
Smelt   13.4    11.7    12.4    13.5    2.43    1.269
Smelt   12.2    12.1    13  13.8    2.277   1.2558
Smelt   19.7    13.2    14.3    15.2    2.8728  2.0672
Smelt   19.9    13.8    15  16.2    2.9322  1.8792

Executable code:

# Step 1: Importing libraries 
import pandas as pd

# Step 2: Defining the columns of and reading our DataFrame 
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)

# Printing the head of our DataFrame
print(Fish.head())

Output:

  Species  Weight  V-Length  D-Length  X-Length   Height   Width
0   Bream   290.0      24.0      26.3      31.2  12.4800  4.3056
1   Bream   340.0      23.9      26.5      31.1  12.3778  4.6961
2   Bream   363.0      26.3      29.0      33.5  12.7300  4.4555
3   Bream   430.0      26.5      29.0      34.0  12.4440  5.1340
4   Bream   450.0      26.8      29.7      34.7  13.6024  4.9274

Line 2: pandas library is imported to read DataFrame.
Line 6: Read the data from the Fish.txt file with columns defined in line 5.
Line 9: Prints the top five rows of the DataFrame. The three lengths define the vertical, diagonal, and cross lengths in cm.

Here, the fish's length, height, and width are independent variables, with weight serving as the dependent variable. In machine learning, independent variables are often referred to as features and dependent variables as labels, and these terms will be used interchangeably throughout this blog.

Linear regression

Linear regression models are widely used in statistics and machine learning. These models use a straight line to describe the relationship between an independent variable and a dependent variable. For example, when analyzing the weight of fish, a linear regression model is used to describe the relationship between the weight y of the fish and one of the independent variables X as follows,

Where m is the slope of the line that defines its steepness, and c is the y-intercept, the point where line crosses the y-axis.

Selecting feature

The dataset contains five independent variables. A simple linear regression model with only one feature can be initiated by selecting the most strongly related feature to the fish's Weight. One approach to accomplish this is to calculate the cross-correlation between Weight and the features.

Hidden code: (From the previous code block)

# Step 1: Importing libraries 
import pandas as pd

# Step 2: Defining the columns of and reading our data frame 
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)

Executable code:

# Finding the cross-correlation matrix
print(Fish.corr())

Output:

            Weight  V-Length  D-Length  X-Length    Height     Width
Weight    1.000000  0.915691  0.918625  0.923343  0.727260  0.886546
V-Length  0.915691  1.000000  0.999519  0.992155  0.627425  0.867002
D-Length  0.918625  0.999519  1.000000  0.994199  0.642392  0.873499
X-Length  0.923343  0.992155  0.994199  1.000000  0.704628  0.878548
Height    0.727260  0.627425  0.642392  0.704628  1.000000  0.794810
Width     0.886546  0.867002  0.873499  0.878548  0.794810  1.000000

Ater examining the first column, the following is observed:

There is a strong correlation between Weight, and the feature X-Length.
The Weight has the weakest correlation with Height.

Given this information, it is clear that if the individual is limited to using only one independent variable to predict the dependent variable, they should choose X-Length and not Height.

# Step 3: Separating the data into features and labels
X = Fish[['X-Length']]
y = Fish['Weight']

Splitting data

With the features and labels in place, DataFrame can now be divided into training and test sets.
The training dataset trains the model, while the test dataset evaluates its performance.

The train_test_split function is imported from the sklearn library to split the data.

from sklearn.model_selection import train_test_split

# Step 4: Dividing the dataset into test and train data
X_train, X_test, y_train, y_test = 
    train_test_split(
                X, y, 
                test_size=0.3, 
                random_state=10, 
                shuffle=True
                )

The arguments of the train_test_split function can be examined as follows:

Line 6: Pass the feature and the label.
Line 7: Use test_size=0.3 to select 70% of the data for training and the remaining 30% for testing purposes.
Lines 8–9: Make the split random and use shuffle=True to ensure that the model is not overfitting to a specific set of data.

As a result, the training data in variables X_train and y_train and test data in X_test and y_test is obtained.

Applying model

At this point, the linear regression model can be created.

Hidden code:

# Step 1: Importing libraries 
import pandas as pd

# 1.2
from sklearn.model_selection import train_test_split

# Step 2: Defining the columns of and reading our data frame 
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)

# Step 3: Seperating the data into features and labels
X = Fish[['X-Length']]
y = Fish['Weight']

# Step 4: Dividing the data into test and train set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)

Executable code:

from sklearn.linear_model import LinearRegression

# Step 5: Selecting the linear regression method from scikit-learn library
model = LinearRegression().fit(X_train, y_train)

Line 1: The LinearRegression function from sklearn library is imported.
Line 4: Creates and train the model using the training data X_train and y_train.

Model validation

Remember, 30% of the data was set aside for testing. The Mean Absolute Error (MAE) can be calculated using this data as an indicator of the average absolute difference between the predicted and actual values, with a lower MAE value indicating more accurate predictions. Other measures for model validation exist, but they won't be explored in this context.

Here's a complete running example, including all of the previously mentioned steps mentioned above to perform a linear regression.

# Step 1: Importing libraries 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Step 2: Defining the columns of and reading the DataFrame 
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)

# Step 3: Seperating the data into features and labels
X = Fish[['X-Length']]
y = Fish['Weight']

# Step 4: Dividing the dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)

# Step 5: Selecting the linear regression method from the scikit-learn library
model = LinearRegression().fit(X_train, y_train)

# Step 6: Validation
# Evaluating the trained model on training data
y_prediction = model.predict(X_train)
print("MAE on train data= " , metrics.mean_absolute_error(y_train, y_prediction))
# Evaluating the trained model on test data
y_prediction = model.predict(X_test)
print("MAE on test data = " , metrics.mean_absolute_error(y_test, y_prediction))

Output:

('MAE on train data= ', 105.08242420291623)
('MAE on test data = ', 108.7817508976745)

In this instance, the model.predict() function is applied to the training data on line 23, and on line 26, it is used on the test data. But what does it show?

Essentially, this approach demonstrates the model’s performance on a known dataset when compared to an unfamiliar test dataset.
The two MAE values suggest that the predictions on both train and test data are similar.

Note: It is essential to recall that the X-Length was chosen as the feature because of its high correlation with the label. To verify the choice of feature, one can replace it with the Height on line 12 and rerun the linear regression, then compare the two MAE values.

Multiple linear regression

So far, only one feature, X-Length has been used to train the model. However, there are features available that can be utilized to improve the predictions. These features include the vertical length, diagonal length, height, and width of the fish, and can be used to re-evaluate the linear regression model.

# Step 3: Separating the data into features and labels
X = Fish[['V-Length', 'D-Length', 'X-Length', 'Height', 'Width']]
y = Fish['Weight']

Mathematically, the multiple linear regression model can be written as follows:

where m_i represents the weightage for feature X_i in predicting y and n denotes the number of features.

Following the similar steps as earlier, the performance of the model can be calculated by utilizing all the features.

# Step 1: Importing libraries 
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Step 2: Defining the columns and reading the DataFrame 
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)

# Step 3: Seperating the data into features and labels
X = Fish[['V-Length', 'D-Length', 'X-Length', 'Height', 'Width']]
y = Fish['Weight']

# Step 4: Dividing the dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)

# Step 5: Selecting the linear regression method from the scikit-learn library
model = LinearRegression().fit(X_train, y_train)

# Step 6: Validation
# Evaluating the trained model on training data
y_prediction = model.predict(X_train)
print("MAE on train data= " , metrics.mean_absolute_error(y_train, y_prediction))
# Evaluating the trained model on test data
y_prediction = model.predict(X_test)
print("MAE on test data = " , metrics.mean_absolute_error(y_test, y_prediction))

Output:

('MAE on train data= ', 88.6176233769433)
('MAE on test data = ', 104.71922684746642)

The MAE values will be similar to the results obtained when using a single feature.

Polynomial regression

This blog explains the concept of polynomial regression, which is used when the assumption of a linear relationship between the features and label is not accurate. By allowing for a more flexible fit to the data, polynomial regression can capture more complex relationships and lead to more accurate predictions.

For example, if the relationship between the dependent variables and the independent variable is not a straight line, a polynomial regression model can be used to model it more accurately. This can lead to a better fit to the data and more accurate predictions.

Mathematically, the relationship between dependent and independent variables is described using the following equation:

The above equation looks very similar to the one used earlier to describe multiple linear regression. However, it includes the transformed features called Z_i's which are the polynomial version of X_i's used in multiple linear regression.

This can be further explained using an example of two features X_1 and X_2 to create new features, such as:

The new polynomial features can be created based on trial and error or techniques like cross-validation. The degree of the polynomial can also be chosen based on the complexity of the relationship between the variables.

The following example presents a polynomial regression and validates the models' performance.

# Step 1: Importing libraries 
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.preprocessing import PolynomialFeatures

# Step 2: Defining the columns and reading the DataFrame 
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)

# Step 3: Seperating the data into features and labels
X = Fish[['V-Length', 'D-Length', 'X-Length', 'Height', 'Width']]
y = Fish['Weight']

# Step 4: Generating polynomial features 
Z = PolynomialFeatures(degree=2, include_bias=False).fit_transform(X)
# Dividing the dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(Z, y, test_size=0.3, random_state=10)

# Step 5: Selecting the linear regression method from the scikit-learn library
model = LinearRegression().fit(X_train, y_train)

# Step 6: Validation
# Evaluating the trained model on training data
y_prediction = model.predict(X_train)
print("MAE on train data= " , metrics.mean_absolute_error(y_train, y_prediction))
# Evaluating our trained model on test data
y_prediction = model.predict(X_test)
print("MAE on test data = " , metrics.mean_absolute_error(y_test, y_prediction))

Output:

('MAE on train data= ', 30.44121990999409)
('MAE on test data = ', 32.558434580499224)

The features were transformed using PolynomialFeatures function on line 18. The PolynomialFeatures function, imported from the sklearn library on line 7, was used for this purpose.

It should be noticed that the MAE value in this case is superior to that of linear regression models, implying that the linear assumption was not entirely accurate.

This blog has provided a quick introduction to Machine learning regression models with python. Don't stop here! Explore and practice different techniques and libraries to build more accurate and robust models. You can also check out the following courses and skill paths on Educative:

Good luck, and happy learning!

DEV Community

How to build machine learning regression models with Python

Understanding the input data

Linear regression

Selecting feature

Splitting data

Applying model

Model validation

Multiple linear regression

Polynomial regression

Top comments (0)

Read next

Birthday Cake Candles - HackerRank Problem Solving

TDoC 2024 - Day 3: Introduction to Machine Learning

🚀 Database Optimization: Essential Tips and Tricks for Developers

Copier vs Cookiecutter