DEV Community

Edward Amor
Edward Amor

Posted on • Originally published at edwardamor.xyz on

Model Selection, Validation, and Hyperparameter Tuning

In practice, a majority of the time dedicated to any data science project (unless you’re lucky) is consumed by data cleaning and wrangling. However, once you’ve completed you’re data mining, cleaning, exploration, and feature engineering, generally the next step is to do some machine learning. The ML process is pretty standard regardless of the algorithm you choose, it’ll always require some model selection, model validation, and hyperparameter tuning. One of the easiest ways you can wrap your mind around the process is through trial and error.

Logistic Regression

Logistic regression is very well known supervised binary classification algorithm. Unlike linear regression where you’re predicting a continuous value, logistic regression is a binary algorithm outputs a prediction of either a 0 or a 1, or a probability (there is also softmax regression an extension to logisitic regression for more than 2 classes). Although logistic regression isn’t as fancy as neural networks or natural language processing, it is still a significantly useful tool in any data scientist’s toolbelt. Just like other classification algorthims it can be used for:

  • Predicting Bank Loan Worthiness
  • Detecting Credit Card Fraud
  • Detecting Email Spam
  • And Many More …

To get a grasp on how model selection, model validation, and hyperparameter tuning work we’ll run through an example with a simple dataset. For the purposes of demonstration we’ll be using a dataset from the UCI Machine Learning Repository.

Blood Transfusion Service Center Data Set

The dataset we’ll be using can be found in the UCI Machine Learning Repository by clicking here. Below is some information about the data for those who would like to know, it was taken directly from the UCI Machine Learning Repository.

Summary:

To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

Attribute Information:

  • R (Recency - months since last donation)
  • F (Frequency - total number of donation)
  • M (Monetary - total blood donated in c.c.)
  • T (Time - months since first donation)
  • a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood)

Source:

Original Owner and Donor

Prof. I-Cheng Yeh

Department of Information Management

Chung-Hua University,

Hsin Chu, Taiwan 30067, R.O.C.

e-mail: icyeh ‘at’ chu.edu.tw

TEL:886-3-5186511

Citation Request:

Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, “Knowledge discovery on RFM model using Bernoulli sequence, “Expert Systems with Applications, 2008 (doi:10.1016/j.eswa.2008.07.018).

Exploratory Data Analysis

It’s always a good habit after extracting and cleaning your data to perform some EDA to get a grasp of the main characteristics of the data you’ll be working with. It provides you with visuals which immensely assist in the preprocessing steps later on.

# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import classification_report, plot_roc_curve, plot_confusion_matrix, plot_precision_recall_curve
from sklearn.preprocessing import StandardScaler

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline

%matplotlib inline
sns.set(font_scale=1.1)

SEED = 890432


# load data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data"
df = pd.read_csv(url)
df.columns = ["r", "f", "m", "t", "y"] # rename the columns for brevity
df.info()
Enter fullscreen mode Exit fullscreen mode
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 # Column Non-Null Count Dtype
--- ------ -------------- -----
 0 r 748 non-null int64
 1 f 748 non-null int64
 2 m 748 non-null int64
 3 t 748 non-null int64
 4 y 748 non-null int64
dtypes: int64(5)
memory usage: 29.3 KB
Enter fullscreen mode Exit fullscreen mode

In our case, the dataset is typed correctly and void of any null values since a majority of the processing has already been done, leaving just EDA and modeling to us. Since this is a classification challenge, one of the best things to look at immediately is the class distribution of your output. This will give us insight into whether we should implement some downsampling/upsampling/hybrid-sampling to adjust for imbalance.

# plot class distribution
plt.figure(figsize=(16, 9))
sns.countplot(x="y", data=df)

# adjust figure 
ticks, _ = plt.xticks()
plt.xticks(ticks, ["Didn't donate in 2007 (0)", "Did donate in 2007 (1)"])
plt.xlabel("")
plt.ylabel("# of individuals")
plt.title("Class Distribution of Dependent Variable")

plt.show()
Enter fullscreen mode Exit fullscreen mode

png

There is for sure a class imbalance which we’ll have to account for later in our modeling, but this highlights why it’s always important to visually explore your data. Next we should also determine if there exists any correlation amongst our independent variables, if there is we could implement some dimensionality reduction or remove some redundant variables. We’ll inspect this by plotting the pairwise relationship between the variables, and also viewing a heatmap of pairwise pearson correlation values.

# view the correlation between variables
sns.pairplot(df, vars=["r", "f", "m", "t"], aspect=1.3, corner=True)
plt.show()
Enter fullscreen mode Exit fullscreen mode

png

Simply looking the histogram of each variable we see they’re all positively skewed, which is another thing we’ll have to adjust for before we do our modeling by scaling our data. Along with that, the variables “f” and “m” appear to be positively correlated, forming basically a straight line. This makes sense as from our attribute information we know that frequency is “the total number of donation and monetary is “the total blood donated in c.c.”. We should therefore expect that as frequency increases, monetary will also increase.

# obtain the pairwise pearson correlation
correlation = df[["r", "f", "m", "t"]].corr()
mask = np.triu(np.ones_like(correlation)) # for removing upper diagonal of information

# plot heatmap of pearson correlation
plt.figure(figsize=(16, 9))
labels = ["Recency", "Frequency", "Monetary", "Time"]
sns.heatmap(correlation, annot=True, square=True, xticklabels=labels, yticklabels=labels, mask=mask)

plt.title("Pairwise Correlation Heatmap")
plt.show()
Enter fullscreen mode Exit fullscreen mode

png

Our suspicions are confirmed, and by the looks of it our Frequency and Monetary variables are perfectly positively correlated with each other. We’ll remove one of these columns as it’ll interfere with our modeling efforts down the line. It also appears that Frequency/Monetary are somewhat positively correlated with Time, my heuristic is to leave the variable if the pearson correlation is |x| < .75, so in this case we’ll simply stick to removing either frequency or monetary.

# drop the monetary column
df.drop(columns=["m"], inplace=True, errors="ignore")
Enter fullscreen mode Exit fullscreen mode

Since this is a classification task, another good thing to do is look at the distribution of values in each category. To get a quick summation of this information, we can use a box and whisker plot.

# plot categorical distribution of values
plt.figure(figsize=(16, 9))

plt.subplot(131)
sns.boxplot(data=df, x="y", y="r", showmeans=True)
plt.title("Recency")

plt.subplot(132)
sns.boxplot(data=df, x="y", y="f", showmeans=True)
plt.title("Frequency")

plt.subplot(133)
sns.boxplot(data=df, x="y", y="t", showmeans=True)
plt.title("Time")

plt.show()
Enter fullscreen mode Exit fullscreen mode

png

As noted before, we will be scaling our data to within the same range. It is quite noticeable that the range for each of the above 3 plots is significantly substantially. Other than that what stands out to me is the somewhat similar characteristics between the classes, particularly the time variable. Between the Recency and Frequency variables, we see decent amount of outliers which could dampen the ability of our logit model from detecting the difference between the two classes. After generating a vanilla model we’ll assess it’s performance and whether we want to drop our outlier observations.

Last but certainly not least, we’ll look at the descriptive statistics of our variables. This is typically also helpful at the beginning of any EDA, as you should notice any suspicious facts about your data relatively immediately.

# output descriptive statistics
df.describe()
Enter fullscreen mode Exit fullscreen mode
r f t y
count 748.000000 748.000000 748.000000 748.000000
mean 9.506684 5.514706 34.282086 0.237968
std 8.095396 5.839307 24.376714 0.426124
min 0.000000 1.000000 2.000000 0.000000
25% 2.750000 2.000000 16.000000 0.000000
50% 7.000000 4.000000 28.000000 0.000000
75% 14.000000 7.000000 50.000000 0.000000
max 74.000000 50.000000 98.000000 1.000000

Nothing unusual here, as we’ve uncovered the majority of our information from our previous visualizations.

Preprocessing

This step is short and just involves setting our data up for modeling by splitting it into training and testing sets and doing any additional scaling/manipulation. Typically it’s best to use a pipeline for scaling/manipulation of your data, as it’ll reduce the headache down the line, and provide you with a simple interface for modeling.

# create our data pipeline
imbalanced_pipeline = make_pipeline(
    StandardScaler(),
    LogisticRegression(solver="liblinear", random_state=SEED)
)

skfold = StratifiedKFold(shuffle=True, random_state=SEED) 
Enter fullscreen mode Exit fullscreen mode

Our pipeline is quite simple but given a different dataset, with different characteristics, we’d have to use something else. One thing to note is, we are doing nothing to account for class imbalance in our vanilla pipeline, but based on our assessment we will see if we need to.

# separate our data into training and testing sets
X = df[["r", "f", "t"]]
y = df["y"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=SEED, stratify=y)
Enter fullscreen mode Exit fullscreen mode

We use the stratify keyword argument, to make sure we separate out a proportionate amount of our classes into our training and testing set. So 75% of our class 0 and class 1 respectively will be in the training set, and 25% will be in the test set respectively.

Model Validation

To validate the results of our training, we’ll be using cross validation with kfolds to ascertain generally the performance of our model.

# perform CV on our data and output f1 results
f1_score = cross_val_score(imbalanced_pipeline, X_train, y_train, scoring="f1", cv=skfold, n_jobs=-1).mean()
"%.4f" % f1_score
Enter fullscreen mode Exit fullscreen mode
'0.2470'
Enter fullscreen mode Exit fullscreen mode

Our current model performs horribly, this may be due to a number of reasons, and could even be that our logistic regression algorithm just isn’t suited to this task. However, we should take this baseline score, and try to improve.

# fit our model and output precision recall curve
imbalanced_pipeline.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(16, 9))
plot_precision_recall_curve(imbalanced_pipeline, X_train, y_train, ax=ax)
plt.title("Precision-Recall Curve Baseline Logit")
plt.show()
Enter fullscreen mode Exit fullscreen mode

png

The precision recall curve for our classifier really tells us that this model worse than random guessing, and is wrong more time than it is right.

# output classification report
print(classification_report(y_train, imbalanced_pipeline.predict(X_train)))
Enter fullscreen mode Exit fullscreen mode
              precision recall f1-score support

           0 0.79 0.98 0.88 399
           1 0.72 0.17 0.27 124

    accuracy 0.79 523
   macro avg 0.76 0.57 0.58 523
weighted avg 0.78 0.79 0.73 523
Enter fullscreen mode Exit fullscreen mode

It appears we are severely being hurt by the imbalance in our classes, next we’ll use synthethic over sampling, and random undersampling to better our model. Although this isn’t hyperparameter tuning, it is important to tune your data just as much as you’d tune your model, as what you get out is only as good as what you put in.

# make balanced pipeline
balanced_pipeline = make_pipeline(
    StandardScaler(),
    SMOTE(random_state=SEED), # completely balance the two classes
    LogisticRegression(solver="liblinear", random_state=SEED)
)
Enter fullscreen mode Exit fullscreen mode

Our new pipeline integrates the imbalanced learn libraries Synthetic Minority Oversampling Technique to upsample our minority class. It does this by generating synthethic datapoints using our minority class data. The result is an updated dataset with a balanced class distribution of data. Since we’ll have a balanced dataset for training, we’ll be able to use an ROC/AUC curve to assess performance.

# perform CV on our data and output f1 results
f1_score = cross_val_score(balanced_pipeline, X_train, y_train, scoring="f1", cv=skfold, n_jobs=-1).mean()
"%.4f" % f1_score
Enter fullscreen mode Exit fullscreen mode
'0.5153'
Enter fullscreen mode Exit fullscreen mode

We’ve already doubled our baseline f1_score which is an excellent sign, we desperately needed to implement that upsampling.

# fit our model and output ROC curve, now that our pipeline is balancing our dataset
balanced_pipeline.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(16, 9))
plot_roc_curve(balanced_pipeline, X_train, y_train, ax=ax)
plt.title("Precision-Recall Curve Balanced Logit")
plt.plot(np.linspace(0, 1), np.linspace(0, 1), "--")
plt.show()
Enter fullscreen mode Exit fullscreen mode

png

Here we can see that our model performs alright, it isn’t anything special, but it does have the ability to somewhat distinguish between the two classes.

# output classification report
print(classification_report(y_train, balanced_pipeline.predict(X_train)))
Enter fullscreen mode Exit fullscreen mode
              precision recall f1-score support

           0 0.90 0.63 0.74 399
           1 0.39 0.77 0.52 124

    accuracy 0.66 523
   macro avg 0.64 0.70 0.63 523
weighted avg 0.78 0.66 0.69 523
Enter fullscreen mode Exit fullscreen mode

Likewise our classification report shows that the recall score for our minority class as dramatically gone up by .60. This has come with the diminshment of other numbers though.

Next we will be looking into some more model validation, and finally hyperparameter tuning to optimize our model some more.

Tuning

Now that we have our data pipeline working how we want it to, we’ll look into tuning our logit classifier by altering some of its parameters, called hyperparameter tuning in the biz. The easiest way to run through a finite set of possible parameters is to use GridSearchCV, if you have a large set of parameters to search through it is alot better to use the RandomizedSearchCV, which will only create n number of models you specify.

# create parameter grid and grid search
param_grid = {
    "logisticregression__penalty": ["l1", "l2"],
    "logisticregression__C": np.linspace(1e-10, 1),
    "logisticregression__fit_intercept": [True, False],

}

gs = GridSearchCV(balanced_pipeline, param_grid, scoring="f1", n_jobs=-1, cv=skfold)
gs.fit(X_train, y_train)
"%.4f" % gs.best_score_
Enter fullscreen mode Exit fullscreen mode
'0.5153'
Enter fullscreen mode Exit fullscreen mode

It looks like through our grid search we have no improvement. However, this is a good example of what not to do if you have a very large parameter space to search through. Instead you should check out the RandomizedSearchCV, which randomly selects a parameter set to use in each iteration.

# perform CV on our data and output f1 results
f1_score = cross_val_score(gs.best_estimator_, X_train, y_train, scoring="f1", cv=skfold, n_jobs=-1).mean()
"%.4f" % f1_score
Enter fullscreen mode Exit fullscreen mode
'0.5153'
Enter fullscreen mode Exit fullscreen mode
# fit our model and output ROC curve, now that our pipeline is balancing our dataset
gs.best_estimator_.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(16, 9))
plot_roc_curve(gs.best_estimator_, X_train, y_train, ax=ax)
plt.title("Precision-Recall Curve Grid Search Balanced Logit")
plt.plot(np.linspace(0, 1), np.linspace(0, 1), "--")
plt.show()
Enter fullscreen mode Exit fullscreen mode

png

# output classification report
print(classification_report(y_train, gs.best_estimator_.predict(X_train)))
Enter fullscreen mode Exit fullscreen mode
              precision recall f1-score support

           0 0.90 0.63 0.74 399
           1 0.39 0.77 0.52 124

    accuracy 0.66 523
   macro avg 0.64 0.70 0.63 523
weighted avg 0.78 0.66 0.69 523
Enter fullscreen mode Exit fullscreen mode
# test data predictions
test_pred = gs.best_estimator_.predict(X_test)


# output classification report
print(classification_report(y_test, test_pred))
Enter fullscreen mode Exit fullscreen mode
              precision recall f1-score support

           0 0.90 0.57 0.70 171
           1 0.37 0.80 0.50 54

    accuracy 0.62 225
   macro avg 0.63 0.68 0.60 225
weighted avg 0.77 0.62 0.65 225
Enter fullscreen mode Exit fullscreen mode
# plot confusion matrix
fig, ax = plt.subplots(figsize=(16, 9))
plot_confusion_matrix(gs.best_estimator_, X_test, y_test, ax=ax)
plt.show()
Enter fullscreen mode Exit fullscreen mode

png

By the looks of it we’ve optimized a model that from the beginning does very poorly. However, the lesson still stands, and the process of validation and tuning is still very much the same.

Conclusion

When working on any machine learning problem you’re data is invaluable, sometimes you have alot and sometimes you have very little. Regardless of the volume you have, you must always ensure there is no data leakage, which at best leads to false confidence, and at worse leads to being unaware that your models are terrible. Ensuring there is no data leakage is often simple, just split up your data into two groups one for training and another for testing. However, do make sure there is no order dependence in your data, or any other dependence in your data.

After splitting your data, it’s always good to use cross validation, it helps you verify using your training data, that you’re going in the right direction. You really only have to use cross validation on your training data, and never let your model touch your testing/hold-out data (don’t even let yourself see it) until you are reasonably sure that you want use your final model. The problem that usually arises is developers will use their testing data to verify if their model is good, but then optimize their model on the testing data, which isn’t any good. You want your model to generalize well and that means it can’t see your testing data at all.

Lastly, hyperparameter tuning, not everyone remembers exactly what each parameter does for each function call. That’s why documentation exists, make sure to go and check out the docs for whichever algorithm you are using, and determine how to manipulate the hyperparameters. Typically you can identify some good values for your hyperparameters during EDA, but if you can’t you can use something like GridSearchCV or RandomizedSearchCV to help you optimize and search through a large parameter space.

Top comments (0)