Ismael Porto

Posted on Jun 23, 2023

A world without empty values 🤯

#datascience #analyst #analytics #computerscience

Introduction 🤖

As a data analyst our job is to get insights to make data-driven decisions. However, how can we put across our insights if we have tons of missing values in our datasets?

"In the real world, data is imperfect, and missing values are the norm. Learn to work with what you have." - Anonymous

The problem ❓

To start with, imagine you want to know the mean of the ages and heights of a group (50 people). Even though you have already told them that the information will keep secret, some people refuse on giving you their data (around 65% of the people). As a result, you’ll have a considerable amount of missing values that will affect your analysis. 😔

Certainly, if you calculate the mean with just the availables, you won’t get means that would describe your entirely group properly. For instance, we have only 4 ages and 4 heights but our group is around 50 people, therefore our mean will be wrong for the entire group. 😕

In order to achieve the goal, we’ve came up with these ideas.

Drop out the analysis. 😎
Delete the missing values.
Fill the missing values with default data. For instance, we fill with 0 the missing ages.
Fill the missing values with one of the measures of central tendency (mode, median, mean) of the available data.🤯
Thinking on a way to fill the values using predictions based on the available data. 🤔

Drop out the analysis 😎

Delete the missing values

Another view to treat our missing values is to drop it from our dataset. Nonetheless, this won’t be a good idea if we have tons of missing values on any of our columns.

To provide an illustration, let’s imagine 2 scenarios for a dataset with 50 rows and columns name, age and height:

We have 65% of missing values. If we perform the analysis that has been mentioned (mean), the result will be very biased and we won’t be able to put across correct facts.
We have only 2 missing values. Dropping them out will not have any significant impact on our analysis.

To illustrate the impact, let’s see an example using small dataframes.

Original DataFrame:

   Name   Age  Height
0   Tom   NaN   184.0
1  John  10.0     NaN
2  Daniel  24.0   137.0
3  John   NaN     NaN
4  Anna   9.0   162.0

. . .

DataFrame after dropping missing values:

   Name   Age  Height
2  Daniel  24.0   137.0
4  Anna   9.0   162.0

. . .

Mean of ages before dropping missing values: 14.34

Mean of heights before dropping missing values: 161.0

. . .

Mean of ages after dropping missing values: 16.5

Mean of heights after dropping missing values: 149.5

Summing up, the technique of dropping out the missing values has to be considered if and only if the future impact is insignificant.

Fill with defaults 📝

Apart from just dropping out missing values, there is a technique which allow us to fill with default values. For example, let’s suppose we have missing values of city column. One way to deal with these missing values is by filling them with some default text like “No city”.

Original dataframe

    name  age    city
0   John   25     NaN
1   Emma   32  London
2  Peter   40     NaN
3   Mary   28  Sydney
4   Jack   35     NaN

Dataframe after filling with default values

    name  age     city
0   John   25  No city
1   Emma   32   London
2  Peter   40  No city
3   Mary   28   Sydney
4   Jack   35  No city

This is great way to manipulate missing categorial data. However, what happen if we use the same technique to fill numeric data? While filling categorical data with this technique does not show a great secondary effect, for numerical data does. 🔢

Without a doubt our analysis will have been affected by the time we add new values. For instance, let’s return to the mean exercise.

We want to know the mean age from a group of 50 teenagers. In our dataset we have around 65% of missing values, we fill them with the number ‘2’ as default value and finally we get the mean. Will the mean be reliable as it is supposed to be? 😕

Original dataframe

    name   age       city
0   John  25.0   New York
1   Emma   NaN     London
2  Peter  40.0      Paris
3   Mary  28.0     Sydney
4   Jack  35.0     Berlin
5  Sarah   NaN      Tokyo
6   Adam  32.0      Dubai
7   ...   ...        ...

DataFrame with NaN values filled with 2:

    name   age       city
0   John  25.0   New York
1   Emma   2.0     London
2  Peter  40.0      Paris
3   Mary  28.0     Sydney
4   Jack  35.0     Berlin
5  Sarah   2.0      Tokyo
6   Adam  32.0      Dubai
7   ...   ...        ...

Mean of the 'age' column (considering 50 rows):

Mean of original DataFrame: 31.0
Mean of DataFrame with NaN values filled with 2: 21.0

Therefore, can we say this is a bad technique for numerical values?

In short, no. Even though the analysis will have a side-effect impact, our job is the same as in the first technique. 😌

Analyze how big is the side-effect of going this way.

Fill with measures of central tendency 🤯

One of the easiest ways to fill numerical empty values is by using the famous measures of central tendency.

What are measures of central tendency? 🧐

Numbers. They summarize in a single value (located at the centre) the distribution of our data. 📈

The most used measures are:

Mean: Average value of a dataset. 🧮
Median: Middle value of a dataset when arranged in order, separating the data into two equal halves. 🗂️
Mode: Most frequently occurring value in a dataset. 🔄

How can we use them to fill nan? 🤔

This technique is indeed a special case of filling with default values technique. Nevertheless, using this approach our filler might be more related to the values that we actually have. 📝🤔

Although this approach seems to be excellent, the study of the whole context of our data is paramount. For instance, if the dataset is from a group of college students, the age might be between 18 and 25 years. Using the mean, median or mode to fill the empty values seems to be the right way in this case. 😄

Original Dataframe

    name   age       city
0   John  20.0   New York
1   Emma   NaN     London
2  Peter  22.0      Paris
3   Mary  19.0     Sydney
4   Jack   NaN     Berlin
5  Sarah  21.0      Tokyo
6   Adam  18.0      Dubai
7   ...   ...        ...


DataFrame with NaN values filled using mean (mean = 23):

    name   age       city
0   John  20.0   New York
1   Emma  23.0     London
2  Peter  22.0      Paris
3   Mary  19.0     Sydney
4   Jack  23.0     Berlin
5  Sarah  21.0      Tokyo
6   Adam  18.0      Dubai
7   ...   ...        ...

DataFrame with NaN values filled using median (median = 25):

    name   age       city
0   John  20.0   New York
1   Emma  25.0     London
2  Peter  22.0      Paris
3   Mary  19.0     Sydney
4   Jack  25.0     Berlin
5  Sarah  21.0      Tokyo
6   Adam  18.0      Dubai
7   ...   ...        ...

DataFrame with NaN values filled using mode (mode = 18):

    name   age       city
0   John  20.0   New York
1   Emma  18.0     London
2  Peter  22.0      Paris
3   Mary  19.0     Sydney
4   Jack  18.0     Berlin
5  Sarah  21.0      Tokyo
6   Adam  18.0      Dubai
7   ...   ...        ...

. . .

Mean of the 'age' column after filling NaN values:

1. General mean after filling with mean = 23: 20.857142857142858

2. Mean after filling with median = 25: 21.285714285714285

3. Mean after filling with mode = 18: 19.571428571428573

Nonetheless, if the dataset is from a city, the age might be from 0 o 75 years. Using the measures of central tendency in this case might not be a good idea because of the biases that we could get into (if we use more than 1 variable to fill using central tendency measures, then the result might be more appropiate). 🤔

Original DataFrame:

    name   age       city
0   John  20.0   New York
1   Emma   NaN     London
2  Peter  22.0      Paris
3   Mary  19.0     Sydney
4   Jack   NaN     Berlin
5  Sarah  21.0      Tokyo
6   Adam  18.0      Dubai
7   ...   ...        ...

DataFrame with NaN values filled using mean (mean = 30):

    name   age       city
0   John  20.0   New York
1   Emma  30.0     London
2  Peter  22.0      Paris
3   Mary  19.0     Sydney
4   Jack  30.0     Berlin
5  Sarah  21.0      Tokyo
6   Adam  18.0      Dubai
7   ...   ...        ...

DataFrame with NaN values filled using median (median = 50):

    name   age       city
0   John  20.0   New York
1   Emma  50.0     London
2  Peter  22.0      Paris
3   Mary  19.0     Sydney
4   Jack  50.0     Berlin
5  Sarah  21.0      Tokyo
6   Adam  18.0      Dubai
7   ...   ...        ...

DataFrame with NaN values filled using mode (mode = 28):

    name   age       city
0   John  20.0   New York
1   Emma  28.0     London
2  Peter  22.0      Paris
3   Mary  19.0     Sydney
4   Jack  28.0     Berlin
5  Sarah  21.0      Tokyo
6   Adam  18.0      Dubai
7   ...   ...        ...

. . .

Mean of the 'age' column after filling NaN values:

1. Mean after filling with mean (mean = 30): 26.857142857142858

2. Mean after filling with median (median = 50): 32.142857142857146

3. Mean after filling with mode (mode = 28): 25.428571428571427

Therefore, while measures of central tendency can serve as a quick and convenient solution, a comprehensive analysis of the data and consideration of alternative imputation techniques should be performed to ensure the validity of the imputed values. 📊🧐

The cool imputation techniques

Last but not least, let me introduce the coolest imputation techniques. 😎

"Prediction-based imputation algorithms" 💡

If you are familiar with the field of machine learning, you have probably already heard concepts such as Linear Regression or the KNN algorithm. These algorithms are not only used for predictions or classifications in machine learning but also play a vital role in the data science workflow.

The power of these techniques lies in their ability to identify relationships among data points. By finding patterns, correlations, and dependencies, these algorithms can effectively estimate missing values. 📊

Let’s take a brief look at the KNN algorithm for imputation. 🤓

KNN algorithm (K-nearest neighbors) 🧩

KNN is for classification tasks. For example, an AI should classifies if the photo of an animal is a dog or a cat. For this scenario, let’s assume the computer has been trained on a large dataset of pictures of these animals so it has the ability to classify them up to a certain level.

The computer does this:

Plot the classified data and the data we want to classify.
Determine the rankings of the nearest neighbors' data points.
If the majority of the k-nearest neighbors are cats, it is likely that the new photo is also a cat. 🐱

KNN for imputation work 💡

As I’ve already mentioned, these types of algorithms not only consider one variable but multiple variables. For example, let’s imagine that in our scenario of ages we also have heights.

The KNN algorithm will find relations among ages and heights and will fill the empty values considering both variables.

For instance, let’s assume we have a DataFrame with ‘age’ and ‘salary’ columns, where both contains NaN values. 📐

Original DataFrame:

    name   age   salary
0   John  20.0  50000.0
1   Emma   NaN  60000.0
2  Peter  22.0  55000.0
3   Mary  19.0      NaN
4   Jack   NaN  48000.0
5  Sarah  21.0  52000.0
6   Adam  18.0  45000.0
7   ...   ...      ...

DataFrame with NaN values filled using K-NN imputation:

    name   age   salary
0   John  20.0  50000.0
1   Emma  21.5  60000.0
2  Peter  22.0  55000.0
3   Mary  19.0  50000.0
4   Jack  20.5  48000.0
5  Sarah  21.0  52000.0
6   Adam  18.0  45000.0
7   ...   ...      ...

In this updated example, we have applied K-NN imputation to fill the NaN values using the values from the nearest neighbors. The K-NN algorithm identifies the nearest neighbors based on the available features and uses their values to impute the missing values.

AMAZING!!!!!!!!!!!!!! 🎉

As weel as the KNN algorithm, there are several algorithms of this type to impute data (fill empty values).

Linear Regression
Decision trees
Random forest
Support Vector Machines
Neural Networks
Bayesian Networks
Gradient Boosting

And so on... 🚀

Conclusion 📝

Summing up, the technique that you use entirely depends on the context of your data. Therefore, the most exhausted job is to analyze the future impact of our imputation techniques. But once you have your technique, the feeling of filling empty values is so good… 💪

Now it's your turn to look for more information about imputation techniques and create a world without missing values. 🌍

Ismael Porto ☻

References

https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php
https://dsp.facmed.unam.mx/wp-content/uploads/2013/12/Quevedo-F.-Medidas-de-tendencia-central-y-dispersion.-Medwave-2011-Ma-113..pdf
Gutiérrez-García, J.O. [Código Máquina]. (2022, 13 de Junio). Descubre cómo manejar Datos ó Valores Faltantes Imputando con K-Vecinos más cercanos (KNN) y Python [Video]. YouTube. [https://www.youtube.com/watch?v=dToVCgCPW1o&t=170s]
Gutiérrez-García, J.O. [Código Máquina]. (2021, 21 de Agosto). Imputación (o Manejo de Datos Faltantes) con Python [Video]. YouTube. [https://www.youtube.com/watch?v=XiKYdHUsgyM&t=438s].

DEV Community