Introduction 🤖
As a data analyst our job is to get insights to make data-driven decisions. However, how can we put across our insights if we have tons of missing values in our datasets?
"In the real world, data is imperfect, and missing values are the norm. Learn to work with what you have." - Anonymous
The problem ❓
To start with, imagine you want to know the mean of the ages and heights of a group (50 people). Even though you have already told them that the information will keep secret, some people refuse on giving you their data (around 65% of the people). As a result, you’ll have a considerable amount of missing values that will affect your analysis. 😔
Certainly, if you calculate the mean with just the availables, you won’t get means that would describe your entirely group properly. For instance, we have only 4 ages and 4 heights but our group is around 50 people, therefore our mean will be wrong for the entire group. 😕
In order to achieve the goal, we’ve came up with these ideas.
Drop out the analysis. 😎
Delete the missing values.
Fill the missing values with default data. For instance, we fill with 0 the missing ages.
Fill the missing values with one of the measures of central tendency (mode, median, mean) of the available data.🤯
Thinking on a way to fill the values using predictions based on the available data. 🤔
Drop out the analysis 😎
Delete the missing values
Another view to treat our missing values is to drop it from our dataset. Nonetheless, this won’t be a good idea if we have tons of missing values on any of our columns.
To provide an illustration, let’s imagine 2 scenarios for a dataset with 50 rows and columns name, age and height:
We have 65% of missing values. If we perform the analysis that has been mentioned (mean), the result will be very biased and we won’t be able to put across correct facts.
We have only 2 missing values. Dropping them out will not have any significant impact on our analysis.
To illustrate the impact, let’s see an example using small dataframes.
Original DataFrame:
Name Age Height
0 Tom NaN 184.0
1 John 10.0 NaN
2 Daniel 24.0 137.0
3 John NaN NaN
4 Anna 9.0 162.0
. . .
DataFrame after dropping missing values:
Name Age Height
2 Daniel 24.0 137.0
4 Anna 9.0 162.0
. . .
Mean of ages before dropping missing values: 14.34
Mean of heights before dropping missing values: 161.0
. . .
Mean of ages after dropping missing values: 16.5
Mean of heights after dropping missing values: 149.5
Summing up, the technique of dropping out the missing values has to be considered if and only if the future impact is insignificant.
Fill with defaults 📝
Apart from just dropping out missing values, there is a technique which allow us to fill with default values. For example, let’s suppose we have missing values of city column. One way to deal with these missing values is by filling them with some default text like “No city”.
Original dataframe
name age city
0 John 25 NaN
1 Emma 32 London
2 Peter 40 NaN
3 Mary 28 Sydney
4 Jack 35 NaN
Dataframe after filling with default values
name age city
0 John 25 No city
1 Emma 32 London
2 Peter 40 No city
3 Mary 28 Sydney
4 Jack 35 No city
This is great way to manipulate missing categorial data. However, what happen if we use the same technique to fill numeric data? While filling categorical data with this technique does not show a great secondary effect, for numerical data does. 🔢
Without a doubt our analysis will have been affected by the time we add new values. For instance, let’s return to the mean exercise.
- We want to know the mean age from a group of 50 teenagers. In our dataset we have around 65% of missing values, we fill them with the number ‘2’ as default value and finally we get the mean. Will the mean be reliable as it is supposed to be? 😕
Original dataframe
name age city
0 John 25.0 New York
1 Emma NaN London
2 Peter 40.0 Paris
3 Mary 28.0 Sydney
4 Jack 35.0 Berlin
5 Sarah NaN Tokyo
6 Adam 32.0 Dubai
7 ... ... ...
DataFrame with NaN values filled with 2:
name age city
0 John 25.0 New York
1 Emma 2.0 London
2 Peter 40.0 Paris
3 Mary 28.0 Sydney
4 Jack 35.0 Berlin
5 Sarah 2.0 Tokyo
6 Adam 32.0 Dubai
7 ... ... ...
Mean of the 'age' column (considering 50 rows):
Mean of original DataFrame: 31.0
Mean of DataFrame with NaN values filled with 2: 21.0
Therefore, can we say this is a bad technique for numerical values?
In short, no. Even though the analysis will have a side-effect impact, our job is the same as in the first technique. 😌
Analyze how big is the side-effect of going this way.
Fill with measures of central tendency 🤯
One of the easiest ways to fill numerical empty values is by using the famous measures of central tendency.
What are measures of central tendency? 🧐
Numbers. They summarize in a single value (located at the centre) the distribution of our data. 📈
The most used measures are:
Mean: Average value of a dataset. 🧮
Median: Middle value of a dataset when arranged in order, separating the data into two equal halves. 🗂️
Mode: Most frequently occurring value in a dataset. 🔄
How can we use them to fill nan? 🤔
This technique is indeed a special case of filling with default values technique. Nevertheless, using this approach our filler might be more related to the values that we actually have. 📝🤔
Although this approach seems to be excellent, the study of the whole context of our data is paramount. For instance, if the dataset is from a group of college students, the age might be between 18 and 25 years. Using the mean, median or mode to fill the empty values seems to be the right way in this case. 😄
Original Dataframe
name age city
0 John 20.0 New York
1 Emma NaN London
2 Peter 22.0 Paris
3 Mary 19.0 Sydney
4 Jack NaN Berlin
5 Sarah 21.0 Tokyo
6 Adam 18.0 Dubai
7 ... ... ...
DataFrame with NaN values filled using mean (mean = 23):
name age city
0 John 20.0 New York
1 Emma 23.0 London
2 Peter 22.0 Paris
3 Mary 19.0 Sydney
4 Jack 23.0 Berlin
5 Sarah 21.0 Tokyo
6 Adam 18.0 Dubai
7 ... ... ...
DataFrame with NaN values filled using median (median = 25):
name age city
0 John 20.0 New York
1 Emma 25.0 London
2 Peter 22.0 Paris
3 Mary 19.0 Sydney
4 Jack 25.0 Berlin
5 Sarah 21.0 Tokyo
6 Adam 18.0 Dubai
7 ... ... ...
DataFrame with NaN values filled using mode (mode = 18):
name age city
0 John 20.0 New York
1 Emma 18.0 London
2 Peter 22.0 Paris
3 Mary 19.0 Sydney
4 Jack 18.0 Berlin
5 Sarah 21.0 Tokyo
6 Adam 18.0 Dubai
7 ... ... ...
. . .
Mean of the 'age' column after filling NaN values:
1. General mean after filling with mean = 23: 20.857142857142858
2. Mean after filling with median = 25: 21.285714285714285
3. Mean after filling with mode = 18: 19.571428571428573
Nonetheless, if the dataset is from a city, the age might be from 0 o 75 years. Using the measures of central tendency in this case might not be a good idea because of the biases that we could get into (if we use more than 1 variable to fill using central tendency measures, then the result might be more appropiate). 🤔
Original DataFrame:
name age city
0 John 20.0 New York
1 Emma NaN London
2 Peter 22.0 Paris
3 Mary 19.0 Sydney
4 Jack NaN Berlin
5 Sarah 21.0 Tokyo
6 Adam 18.0 Dubai
7 ... ... ...
DataFrame with NaN values filled using mean (mean = 30):
name age city
0 John 20.0 New York
1 Emma 30.0 London
2 Peter 22.0 Paris
3 Mary 19.0 Sydney
4 Jack 30.0 Berlin
5 Sarah 21.0 Tokyo
6 Adam 18.0 Dubai
7 ... ... ...
DataFrame with NaN values filled using median (median = 50):
name age city
0 John 20.0 New York
1 Emma 50.0 London
2 Peter 22.0 Paris
3 Mary 19.0 Sydney
4 Jack 50.0 Berlin
5 Sarah 21.0 Tokyo
6 Adam 18.0 Dubai
7 ... ... ...
DataFrame with NaN values filled using mode (mode = 28):
name age city
0 John 20.0 New York
1 Emma 28.0 London
2 Peter 22.0 Paris
3 Mary 19.0 Sydney
4 Jack 28.0 Berlin
5 Sarah 21.0 Tokyo
6 Adam 18.0 Dubai
7 ... ... ...
. . .
Mean of the 'age' column after filling NaN values:
1. Mean after filling with mean (mean = 30): 26.857142857142858
2. Mean after filling with median (median = 50): 32.142857142857146
3. Mean after filling with mode (mode = 28): 25.428571428571427
Therefore, while measures of central tendency can serve as a quick and convenient solution, a comprehensive analysis of the data and consideration of alternative imputation techniques should be performed to ensure the validity of the imputed values. 📊🧐
The cool imputation techniques
Last but not least, let me introduce the coolest imputation techniques. 😎
"Prediction-based imputation algorithms" 💡
If you are familiar with the field of machine learning, you have probably already heard concepts such as Linear Regression or the KNN algorithm. These algorithms are not only used for predictions or classifications in machine learning but also play a vital role in the data science workflow.
The power of these techniques lies in their ability to identify relationships among data points. By finding patterns, correlations, and dependencies, these algorithms can effectively estimate missing values. 📊
Let’s take a brief look at the KNN algorithm for imputation. 🤓
KNN algorithm (K-nearest neighbors) 🧩
KNN is for classification tasks. For example, an AI should classifies if the photo of an animal is a dog or a cat. For this scenario, let’s assume the computer has been trained on a large dataset of pictures of these animals so it has the ability to classify them up to a certain level.
The computer does this:
Plot the classified data and the data we want to classify.
Determine the rankings of the nearest neighbors' data points.
If the majority of the k-nearest neighbors are cats, it is likely that the new photo is also a cat. 🐱
KNN for imputation work 💡
As I’ve already mentioned, these types of algorithms not only consider one variable but multiple variables. For example, let’s imagine that in our scenario of ages we also have heights.
The KNN algorithm will find relations among ages and heights and will fill the empty values considering both variables.
For instance, let’s assume we have a DataFrame with ‘age’ and ‘salary’ columns, where both contains NaN values. 📐
Original DataFrame:
name age salary
0 John 20.0 50000.0
1 Emma NaN 60000.0
2 Peter 22.0 55000.0
3 Mary 19.0 NaN
4 Jack NaN 48000.0
5 Sarah 21.0 52000.0
6 Adam 18.0 45000.0
7 ... ... ...
DataFrame with NaN values filled using K-NN imputation:
name age salary
0 John 20.0 50000.0
1 Emma 21.5 60000.0
2 Peter 22.0 55000.0
3 Mary 19.0 50000.0
4 Jack 20.5 48000.0
5 Sarah 21.0 52000.0
6 Adam 18.0 45000.0
7 ... ... ...
In this updated example, we have applied K-NN imputation to fill the NaN values using the values from the nearest neighbors. The K-NN algorithm identifies the nearest neighbors based on the available features and uses their values to impute the missing values.
AMAZING!!!!!!!!!!!!!! 🎉
As weel as the KNN algorithm, there are several algorithms of this type to impute data (fill empty values).
Linear Regression
Decision trees
Random forest
Support Vector Machines
Neural Networks
Bayesian Networks
Gradient Boosting
And so on... 🚀
Conclusion 📝
Summing up, the technique that you use entirely depends on the context of your data. Therefore, the most exhausted job is to analyze the future impact of our imputation techniques. But once you have your technique, the feeling of filling empty values is so good… 💪
Now it's your turn to look for more information about imputation techniques and create a world without missing values. 🌍
Ismael Porto ☻
References
https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php
Gutiérrez-García, J.O. [Código Máquina]. (2022, 13 de Junio). Descubre cómo manejar Datos ó Valores Faltantes Imputando con K-Vecinos más cercanos (KNN) y Python [Video]. YouTube. [https://www.youtube.com/watch?v=dToVCgCPW1o&t=170s]
Gutiérrez-García, J.O. [Código Máquina]. (2021, 21 de Agosto). Imputación (o Manejo de Datos Faltantes) con Python [Video]. YouTube. [https://www.youtube.com/watch?v=XiKYdHUsgyM&t=438s].
Top comments (0)