How to Handle Missing Data Better [A scikit-learn Tutorial]

#python #machinelearning #datascience #beginners

Why is Data Cleaning Important?

Data from the real world is often messy and cleaning up the data is an integral step in any machine learning project. To ensure optimal model performance, data cleaning is important as machine learning algorithms are sensitive to the quality of data that we feed in.

Dealing with missing data, experimenting with the optimal imputation strategies and ensuring that the data is ready for use in the rest of the pipeline is therefore crucial.

In this blog post, we shall see useful features of scikit-learn that help us handle missing data all the more gracefully! In general, how do we handle missing values in input data? The following are the usual approaches.

By dropping columns containing NaNs.
By dropping rows containing NaNs.
By imputing the missing values suitably.

Wouldn’t it be cool if we could do the following instead?

Encode ‘missingness’ as a feature.
Use HistGradientBoostingClassifier that automatically imputes missing values.

Encoding ‘Missingness’ as a Feature

When imputing missing values, if we would like to preserve information about which values were missing and would like to use that as a feature, then we can do it by setting the add_indicator attribute in scikit-learn’s SimpleImputer to True. Here’s an example. Let’s import numpy and pandas using their usual aliases np and pd.

# Necessary imports
import numpy as np
import pandas as pd

Let’s create a pandas DataFrame X with one missing value.

X = pd.DataFrame({'Age':[20, 30, 10, np.nan, 10]})

Now, we shall import the SimpleImputer from scikit-learn

from sklearn.impute import SimpleImputer

We shall now instantiate a SimpleImputer that by default does mean imputation, by replacing all missing values with the average of the other values present. The missing value is calculated as (20+30+10+10)/4=17.5. Let's verify the output.

# Mean Imputation
imputer = SimpleImputer()
imputer.fit_transform(X)

# After Imputation
array([[20. ],
       [30. ],
       [10. ],
       [17.5],
       [10. ]])

In order to encode the missingness of values as a feature, we can set the add_indicator argument to True and observe the output.

# impute the mean and add an indicator matrix (new in scikit-learn 0.21)
imputer = SimpleImputer(add_indicator=True)
imputer.fit_transform(X)

# After adding missingness indicator
array([[20. ,  0. ],
       [30. ,  0. ],
       [10. ,  0. ],
       [17.5,  1. ],
       [10. ,  0. ]])

In the output, we observe that the indicator value of 1 is inserted at index 3 where the original data was missing. This feature is new in scikit-learn version 0.21 and above. In the next section, we shall see how we can use the HistGradientBoosting Classifier that natively handles missing values.

Using HistGradientBoosting Classifier

To use this new feature in scikit-learn version 0.22 and above, let’s download the very popular Titanic-Machine learning from Disaster dataset from kaggle.

import pandas as pd
train = pd.read_csv('http://bit.ly/kaggletrain')
test = pd.read_csv('http://bit.ly/kaggletest', nrows=175)

Now that we’ve imported the dataset, let’s go ahead and create the datasets for training and testing.

train = train[['Survived', 'Age', 'Fare', 'Pclass']]
test = test[['Age', 'Fare', 'Pclass']]

To better understand the missing values, let’s compute the number of missing values in each column of the training and test sets.

# count the number of NaNs in each column
print(train.isna().sum())

Survived 0
Age 177
Fare 0
Pclass 0
dtype: int64

print(test.isna().sum())

Age 36
Fare 1
Pclass 0
dtype: int64

We see that both train and test subsets contain missing values. Let the output label for the classifier be Survived indicated by 1 if the passenger survived and 0 if the passenger did not.

label = train.pop('Survived')

Let’s import HistGradientBoostingClassifier from scikit-learn.

from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

As always, let us instantiate the classifier, fit on the training set train and predict on the test set test . Note that we did not impute the missing values; Ideally, when there are missing values NaN, we do get errors. Let us check what happens now.

clf = HistGradientBoostingClassifier()
# no errors, despite NaNs in train and test sets!
clf.fit(train, label)
clf.predict(test)

# Output
array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,

       1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

       1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,

       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1,

       0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,

       0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,

       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,

       0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0])

Surprisingly, there are no errors and we get predictions for all records in the test set even though there were missing values. Isn’t this cool? Be sure to try out these features in your next project. Happy Learning!