Optimizing Data Analysis: A Guide to Handling Missing Data Effectively

#machinelearning #python #dataengineering #skikitlearn

Managing missing data is a crucial stage in the data preparation process.
Since real-world data is rarely expected to be 100% full, it is extremely uncommon that we obtain 100% accurate data without noise, missing values, etc.
For example: Some users fill out feedback forms, and frequently, when they find a field with a 1% consent rate, they skip it and submit it, which leaves the database with missing data.

Why it is important to handle missing data ?

Our machine learning model gains its knowledge from data; therefore, if a sizeable portion of the data is absent, its accuracy will decline, rendering the model useless.

How to handle missing data ?

If 1% of our data are null values, we will eliminate that data by removing that row or column. However, this method is inefficient since it will result in the loss of data that is crucial to the effectiveness of our machine learning model.
The most efficient approach is to replace the null value with the mean and median of that of that column in case of numerical data and mode in case of categorical data.

Handling Missing data using Scikit-Learn :

To demonstrate this, let's use a small sample of data, although in actuality, there is a lot more data.

Step 1: Importing the libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 2 : Importing the dataset
You can download the dataset from here.

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

Here, I separated the independent variables (X) from the dependent variable (y).

Step 3: Taking care of Missing Data

import from sklearn.impute SimpleImputer: The SimpleImputer class is imported by this line from the scikit-learn library's sklearn.impute module. Using predetermined procedures, SimpleImputer is used to fill in the missing values in a dataset.
imputer = SimpleImputer('mean' as the strategy, missing_values=np.nan): An instance of the SimpleImputer class is created by this line. It denotes that np.nan, which generally stands for "Not a Number," will be used to represent missing or undefined values in numerical data.The imputation approach is set to'mean', meaning that the mean (average) of the non-missing values in the same column will be used to replace any missing values.We can use median and mode as well.
imputer.fit(X[:,1:3]): This line makes the imputer fit (or trains) on a particular subset of the dataset, X[:,1:3].
imputer.transform(X[:,1:3]) = X[:,1:3]: This line uses the transformation to fill in the missing values in the chosen columns (1 and 2) of the original dataset X after training the imputer. It substitutes the corresponding column means that were determined during the fitting stage for the missing values.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])

Here is the output :

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

🎉 Tadaa! 🎉 Finally, we've mastered handling missing data! 📊💻🔤💡

👉 You can access the full code from this GitHub repository: Link to Repository

Feel free to explore the code and learn more about other data preprocessing ! 🔍💻📂📝😊

DEV Community

Optimizing Data Analysis: A Guide to Handling Missing Data Effectively

Top comments (0)

Read next

How to Migrate Massive Data in Record Time—Without a Single Minute of Downtime 🕑

How to Retrieve EC2 Instances Information Using Python and Boto3

Interactive DataFrame Management with Streamlit Fragments 🚀

Python 🐍 and variable types