Introduction
Data exploration is a crucial step in the process of building a machine learning solution. Through data exploration, you gain a deeper understanding of your dataset and may uncover missing values. Missing values often occur during the data collection process, especially in scenarios like surveys, where incomplete data entry is common.
Furthermore, the quality of a machine learning model heavily depends on the quality of its training data. If the training data contains missing values that are not properly handled, it can introduce bias into the final model. Therefore, addressing missing values is essential to ensuring the model's accuracy and reliability.
Prerequisite
- You should be familiar with Python machine learning libraries like Pandas, NumPy, Matplotlib, and Scikit-learn.
- Import the necessary libraries by running this code:
import pandas as pd # for data manipulation
import missingno as msno # for missing data visualisation
from sklearn.linear_model import LogisticRegression # for building classification models
from sklearn.model_selection import train_test_split # for splitting dataset into train and test set
from sklearn.tree import DecisionTreeClassifier # decision tree algorithm
from sklearn.metrics import accuracy_score # to compute model performance
from sklearn.impute import SimpleImputer
- Download the data.
What are missing values?
Missing values, often referred to as null values, are values that should be present in the rows of a dataset but are absent due to errors or omissions in the data collection process. These null values can pose challenges in a machine learning project. They make data analysis more complex, and certain machine learning algorithms may struggle to learn from rows with missing data, potentially causing the algorithm to overlook important features.
Solution to missing values
In this tutorial, you will learn four solutions to missing values in a dataset by experimenting with these solutions and seeing for yourself which solution produces the best result.
Here is the story: you have a dataset that contains missing values, and you aim to build a machine learning model with this dataset. You don't know the best way to handle the data, so you decided to experiment with different methods to see which one produced the best model.
Let's explore and visualise the data to see how missing values are represented in the data.
Load and display the data
#read the data
df=pd.read_csv("employee_attrition.csv")
df.head()
Result:
Get numbers of missing values
df.isnull().sum()
Result:
Visualise missing values
# Use missingno to visualise missing data
msno.bar(df, figsize=(8, 6))
plt.title('Missing Data Overview')
plt.show()
Result:
From the data exploration and visualisation, you saw the missing values in the dataset and also the quantity of missing values.
Next, create a Python function that you will use to train your model. This function will accept a dataset as input, train a model with the data, and return the model's accuracy in making predictions. Therefore, with this function, you can easily experiment with different datasets to see which performs better on your model.
Model train function
def build_model(dataset,test_size=0.3,random_state=17):
#Split the dataset into training and test set
X_train,X_test,y_train,y_test=train_test_split(dataset.drop("Label",axis=1),dataset["Label"],test_size=test_size,random_state=random_state)
#Fit a decision tree classifier
clf=DecisionTreeClassifier(random_state=random_state)
clf.fit(X_train,y_train)
#Predict and compute accuracy
y_pred=clf.predict(X_test)
return accuracy_score(y_test,y_pred).round(3)
Let's try to input the data into the function without dealing with the missing values.
## Build a model with the dataset that contains missing values
build_model(df)
Result:
The model could not handle the missing values in the data, which shows you have to find a solution for the missing values before you can train your model.
Let's experiment with different methods to see which performs best on the model.
Method 1: Drop rows with missing values
Dropping rows with missing values works best when the missing values in your data are small and you have an abundance of data points. This method will reduce your data size significantly if you have a lot of missing values. Therefore, you leave your model with few data points to learn from.
Here is how to perform this method:
#drop missing rows
df_drop_rows=df.dropna()
#build model with the new data
build_model(df_drop_rows)
Result:
Method 2: Drop columns with missing values
This method is only valid when the columns with missing data do not contain important features for your model. For instance, if your data has a column labelled 'X', and 'X' is not a strong feature for your model to learn from, then dropping 'X' isn't a bad idea.
Here is how to perform this method:
#Use only rows without missing values
df_drop_col=df[["MonthlyIncome","Overtime","Label"]]
#build model
build_model(df_drop_col)
Result:
Method 3: Fill missing values with 0,-1, or other "indicator" values
Sometimes, it's necessary to distinguish between missing values and actual data values. In such cases, you can fill in missing values with a specific indicator value like -1,0 or even a unique string.
For example, if you have a dataset with a column representing the number of items purchased and some rows have missing values, filling those missing values with 0 indicates that those particular entries didn't purchase any items.
Here is how to perform this method:
df_sentinel=df.fillna(value=-1) #used -1 because the dataset contains non negative values, -1 is a good indicator of missing values
Result:
Method 4: Imputing missing value
Imputing missing values involves replacing the missing data points with estimated or predicted values based on the available information. In this process, you replace the columns with missing values by calculating the mean, median, or mode of the non-missing values in the respective column.
Scikit-learn provides an easy way to do this by using the class SimpleImputer
.
Here is how to perform this method:
# Initialise the SimpleImputer with the "mean" strategy
imp = SimpleImputer(strategy="mean")
# Apply the imputer to the data and create a new DataFrame 'df_imputed'
# The imputer replaces missing values in the specified columns with the mean of each column
df_imputed = pd.DataFrame(imp.fit_transform(df), columns=["TotalWorkingYears", "MonthlyIncome", "Overtime", "DailyRate", "Label"])
Result:
Conclusion
Handling missing data properly is important in machine learning and data science project workflow. The quality of a machine learning solution relies on the quality of the data used.
In this article, you've learned four methods for handling missing values in a dataset. You've seen for yourself, by experimenting with different methods, the best method to handle missing values (the imputation method).
If you have any questions or spot any errors, please reach out to me at victorkingoshimua@gmail.com or on LinkedIn.
Happy coding!
Top comments (0)