Why is Data Cleaning Important?
Data from the real world is often messy and cleaning up the data is an integral step in any machine learning project. To ensure optimal model performance, data cleaning is important as machine learning algorithms are sensitive to the quality of data that we feed in.
Dealing with missing data, experimenting with the optimal imputation strategies and ensuring that the data is ready for use in the rest of the pipeline is therefore crucial.
In this blog post, we shall see useful features of scikit-learn
that help us handle missing data all the more gracefully! In general, how do we handle missing values in input data? The following are the usual approaches.
- By dropping columns containing NaNs.
- By dropping rows containing NaNs.
- By imputing the missing values suitably.
Wouldn’t it be cool if we could do the following instead?
- Encode ‘missingness’ as a feature.
- Use
HistGradientBoostingClassifier
that automatically imputes missing values.
Encoding ‘Missingness’ as a Feature
When imputing missing values, if we would like to preserve information about which values were missing and would like to use that as a feature, then we can do it by setting the add_indicator
attribute in scikit-learn’s SimpleImputer
to True
. Here’s an example. Let’s import numpy and pandas using their usual aliases np
and pd
.
# Necessary imports
import numpy as np
import pandas as pd
Let’s create a pandas DataFrame X with one missing value.
X = pd.DataFrame({'Age':[20, 30, 10, np.nan, 10]})
Now, we shall import the SimpleImputer
from scikit-learn
from sklearn.impute import SimpleImputer
We shall now instantiate a SimpleImputer
that by default does mean imputation, by replacing all missing values with the average of the other values present. The missing value is calculated as (20+30+10+10)/4=17.5. Let's verify the output.
# Mean Imputation
imputer = SimpleImputer()
imputer.fit_transform(X)
# After Imputation
array([[20. ],
[30. ],
[10. ],
[17.5],
[10. ]])
In order to encode the missingness of values as a feature, we can set the add_indicator
argument to True
and observe the output.
# impute the mean and add an indicator matrix (new in scikit-learn 0.21)
imputer = SimpleImputer(add_indicator=True)
imputer.fit_transform(X)
# After adding missingness indicator
array([[20. , 0. ],
[30. , 0. ],
[10. , 0. ],
[17.5, 1. ],
[10. , 0. ]])
In the output, we observe that the indicator value of 1 is inserted at index 3 where the original data was missing. This feature is new in scikit-learn
version 0.21 and above. In the next section, we shall see how we can use the HistGradientBoosting Classifier
that natively handles missing values.
Using HistGradientBoosting Classifier
To use this new feature in scikit-learn
version 0.22 and above, let’s download the very popular Titanic-Machine learning from Disaster dataset from kaggle.
import pandas as pd
train = pd.read_csv('http://bit.ly/kaggletrain')
test = pd.read_csv('http://bit.ly/kaggletest', nrows=175)
Now that we’ve imported the dataset, let’s go ahead and create the datasets for training and testing.
train = train[['Survived', 'Age', 'Fare', 'Pclass']]
test = test[['Age', 'Fare', 'Pclass']]
To better understand the missing values, let’s compute the number of missing values in each column of the training and test sets.
# count the number of NaNs in each column
print(train.isna().sum())
Survived 0
Age 177
Fare 0
Pclass 0
dtype: int64
print(test.isna().sum())
Age 36
Fare 1
Pclass 0
dtype: int64
We see that both train and test subsets contain missing values. Let the output label for the classifier be Survived
indicated by 1
if the passenger survived and 0
if the passenger did not.
label = train.pop('Survived')
Let’s import HistGradientBoostingClassifier
from scikit-learn.
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
As always, let us instantiate the classifier, fit on the training set train and predict on the test set test . Note that we did not impute the missing values; Ideally, when there are missing values NaN, we do get errors. Let us check what happens now.
clf = HistGradientBoostingClassifier()
# no errors, despite NaNs in train and test sets!
clf.fit(train, label)
clf.predict(test)
# Output
array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,
1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,
1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1,
0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0])
Surprisingly, there are no errors and we get predictions for all records in the test set even though there were missing values. Isn’t this cool? Be sure to try out these features in your next project. Happy Learning!
References
[1] Useful scikit-learn tips by Kevin Markham from DataSchool
Cover Image: Photo by Andrew Neel on Unsplash
Top comments (1)
good stuff. thanks for sharing