Pitting My Neural Network Against a Corporate Benchmark
Introduction
In case you missed it, I built a neural network to predict loan risk using a public dataset from LendingClub. Then I built a public API to serve the model's predictions. That's nice and all, but… how good is my model?
Today I'm going to put it to the test, pitting it against the risk models of the very institution who issued those loans. That's right, LendingClub included their own calculated loan grades (and sub-grades) in the dataset, so all the pieces are in place for the most thrilling risk modeling smackdown of the century week. May the best algorithm win!
import joblib
prev_notebook_folder = "../input/building-a-neural-network-to-predict-loan-risk/"
loans = joblib.load(prev_notebook_folder + "loans_for_eval.joblib")
loans.shape
(1110171, 70)
loans.head()
loan_amnt | term | emp_length | home_ownership | annual_inc | purpose | dti | delinq_2yrs | cr_hist_age_mths | fico_range_low | ... | tax_liens | tot_hi_cred_lim | total_bal_ex_mort | total_bc_limit | total_il_high_credit_limit | fraction_recovered | issue_d | grade | sub_grade | expected_return | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3600.0 | 36 months | 10+ years | MORTGAGE | 55000.0 | debt_consolidation | 5.91 | 0.0 | 148.0 | 675.0 | ... | 0.0 | 178050.0 | 7746.0 | 2400.0 | 13734.0 | 1.0 | Dec-2015 | C | C4 | 4429.08 |
1 | 24700.0 | 36 months | 10+ years | MORTGAGE | 65000.0 | small_business | 16.06 | 1.0 | 192.0 | 715.0 | ... | 0.0 | 314017.0 | 39475.0 | 79300.0 | 24667.0 | 1.0 | Dec-2015 | C | C1 | 29530.08 |
2 | 20000.0 | 60 months | 10+ years | MORTGAGE | 63000.0 | home_improvement | 10.78 | 0.0 | 184.0 | 695.0 | ... | 0.0 | 218418.0 | 18696.0 | 6200.0 | 14877.0 | 1.0 | Dec-2015 | B | B4 | 25959.60 |
4 | 10400.0 | 60 months | 3 years | MORTGAGE | 104433.0 | major_purchase | 25.37 | 1.0 | 210.0 | 695.0 | ... | 0.0 | 439570.0 | 95768.0 | 20300.0 | 88097.0 | 1.0 | Dec-2015 | F | F1 | 17394.60 |
5 | 11950.0 | 36 months | 4 years | RENT | 34000.0 | debt_consolidation | 10.20 | 0.0 | 338.0 | 690.0 | ... | 0.0 | 16900.0 | 12798.0 | 9400.0 | 4000.0 | 1.0 | Dec-2015 | C | C3 | 14586.48 |
5 rows × 70 columns
This post was adapted from a Jupyter Notebook, by the way, so if you'd like to follow along in your own notebook, go ahead and fork mine Kaggle or GitHub!
Ground rules
This is going to be a clean fight—my model won't use any data LendingClub wouldn't have access to at the point they calculate a loan's grade (including the grade itself).
I'm going to sort the dataset chronologically (using the issue_d
column, the month and year the loan was issued) and split it into two parts. The first 80% I'll use for training my competition model, and I'll compare performance on the last 20%.
from sklearn.model_selection import train_test_split
loans["date"] = loans["issue_d"].astype("datetime64[ns]")
loans.sort_values("date", axis="index", inplace=True, kind="mergesort")
train, test = train_test_split(loans, test_size=0.2, shuffle=False)
train, test = train.copy(), test.copy()
print(f"The test set contains {len(test):,} loans.")
The test set contains 222,035 loans.
At the earlier end of the test set my model may have a slight informational advantage, having been trained on a few loans that may not have closed yet at the point LendingClub was grading those ones. On the other hand, LendingClub may have a slight informational advantage on the later end of the test set, since they would have known the outcomes of some loans on the earlier end of the test set by that time.
I have to give credit to Michael Wurm, by the way, for the idea of comparing my model's performance to LendingClub's loan grades, but my approach is pretty different. I'm not trying to simulate the performance of an investment portfolio; I'm just evaluating how well my predictions of simple risk compare.
Test metric
The test: who can pick the best set of grade A loans, judged on the basis of the independent variable from my last notebook, the fraction of an expected loan return that a prospective borrower will pay back (which I engineered as fraction_recovered
).
LendingClub will take the plate first. I'll gather all their grade A loans from the test set, count them, and calculate their average fraction_recovered
. That average will be the metric my model has to beat.
Then I'll train my model on the training set using the same pipeline and parameters I settled on in my last notebook. Once it's trained, I'll use it to make predictions on the test set, then gather the number of top predictions equal to the number of LendingClub's grade A loans. Finally, I'll calculate the same average of fraction_recovered
on that subset, and we'll have ourselves a winner!
LendingClub's turn
from statistics import mean
lc_grade_a = test[test["grade"] == "A"]
print(f"LendingClub gave {len(lc_grade_a):,} loans in the test set an A grade.")
print("\nAverage `fraction_recovered` on LendingClub's grade A loans:")
print(round(mean(lc_grade_a["fraction_recovered"]), 5))
LendingClub gave 38,779 loans in the test set an A grade.
Average `fraction_recovered` on LendingClub's grade A loans:
0.96021
That's a pretty high percentage. I'm a bit nervous.
My turn
First, I'll copy over my run_pipeline
function from my previous notebook:
from sklearn.model_selection import train_test_split
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from tensorflow.keras import Sequential, Input
from tensorflow.keras.layers import Dense, Dropout
def run_pipeline(
data,
onehot_cols,
ordinal_cols,
batch_size,
validate=True,
):
X = data.drop(columns=["fraction_recovered"])
y = data["fraction_recovered"]
X_train, X_valid, y_train, y_valid = (
train_test_split(X, y, test_size=0.2, random_state=0)
if validate
else (X, None, y, None)
)
transformer = DataFrameMapper(
[
(onehot_cols, OneHotEncoder(drop="if_binary")),
(
list(ordinal_cols.keys()),
OrdinalEncoder(categories=list(ordinal_cols.values())),
),
],
default=StandardScaler(),
)
X_train = transformer.fit_transform(X_train)
X_valid = transformer.transform(X_valid) if validate else None
input_nodes = X_train.shape[1]
output_nodes = 1
model = Sequential()
model.add(Input((input_nodes,)))
model.add(Dense(64, activation="relu"))
model.add(Dropout(0.3, seed=0))
model.add(Dense(32, activation="relu"))
model.add(Dropout(0.3, seed=1))
model.add(Dense(16, activation="relu"))
model.add(Dropout(0.3, seed=2))
model.add(Dense(output_nodes))
model.compile(optimizer="adam", loss="mean_squared_logarithmic_error")
history = model.fit(
X_train,
y_train,
batch_size=batch_size,
epochs=100,
validation_data=(X_valid, y_valid) if validate else None,
verbose=2,
)
return history.history, model, transformer
onehot_cols = ["term", "application_type", "home_ownership", "purpose"]
ordinal_cols = {
"emp_length": [
"< 1 year",
"1 year",
"2 years",
"3 years",
"4 years",
"5 years",
"6 years",
"7 years",
"8 years",
"9 years",
"10+ years",
]
}
Now for the moment of truth:
# Train the model
_, model, transformer = run_pipeline(
train.drop(columns=["issue_d", "date", "grade", "sub_grade", "expected_return"]),
onehot_cols,
ordinal_cols,
batch_size=128,
validate=False,
)
# Make predictions
X_test = transformer.transform(
test.drop(
columns=[
"fraction_recovered",
"issue_d",
"date",
"grade",
"sub_grade",
"expected_return",
]
)
)
test["model_predictions"] = model.predict(X_test)
# Gather top predictions
test_sorted = test.sort_values("model_predictions", axis="index", ascending=False)
ty_grade_a = test_sorted.iloc[0:len(lc_grade_a)]
# Display results
print("\nAverage `fraction_recovered` on Ty's grade A loans:")
print(format(mean(ty_grade_a["fraction_recovered"]), ".5f"))
Epoch 1/100
6939/6939 - 13s - loss: 0.0249
Epoch 2/100
6939/6939 - 13s - loss: 0.0204
Epoch 3/100
6939/6939 - 13s - loss: 0.0202
Epoch 4/100
6939/6939 - 13s - loss: 0.0202
Epoch 5/100
6939/6939 - 13s - loss: 0.0202
Epoch 6/100
6939/6939 - 14s - loss: 0.0201
Epoch 7/100
6939/6939 - 14s - loss: 0.0201
Epoch 8/100
6939/6939 - 14s - loss: 0.0201
Epoch 9/100
6939/6939 - 13s - loss: 0.0201
Epoch 10/100
6939/6939 - 12s - loss: 0.0201
Epoch 11/100
6939/6939 - 13s - loss: 0.0201
Epoch 12/100
6939/6939 - 13s - loss: 0.0201
Epoch 13/100
6939/6939 - 13s - loss: 0.0201
Epoch 14/100
6939/6939 - 13s - loss: 0.0201
Epoch 15/100
6939/6939 - 12s - loss: 0.0201
Epoch 16/100
6939/6939 - 12s - loss: 0.0201
Epoch 17/100
6939/6939 - 13s - loss: 0.0200
Epoch 18/100
6939/6939 - 13s - loss: 0.0200
Epoch 19/100
6939/6939 - 13s - loss: 0.0200
Epoch 20/100
6939/6939 - 14s - loss: 0.0200
Epoch 21/100
6939/6939 - 13s - loss: 0.0200
Epoch 22/100
6939/6939 - 13s - loss: 0.0200
Epoch 23/100
6939/6939 - 12s - loss: 0.0200
Epoch 24/100
6939/6939 - 12s - loss: 0.0200
Epoch 25/100
6939/6939 - 12s - loss: 0.0200
Epoch 26/100
6939/6939 - 13s - loss: 0.0200
Epoch 27/100
6939/6939 - 13s - loss: 0.0200
Epoch 28/100
6939/6939 - 13s - loss: 0.0200
Epoch 29/100
6939/6939 - 13s - loss: 0.0200
Epoch 30/100
6939/6939 - 13s - loss: 0.0200
Epoch 31/100
6939/6939 - 15s - loss: 0.0200
Epoch 32/100
6939/6939 - 13s - loss: 0.0200
Epoch 33/100
6939/6939 - 12s - loss: 0.0200
Epoch 34/100
6939/6939 - 13s - loss: 0.0200
Epoch 35/100
6939/6939 - 13s - loss: 0.0200
Epoch 36/100
6939/6939 - 13s - loss: 0.0200
Epoch 37/100
6939/6939 - 13s - loss: 0.0200
Epoch 38/100
6939/6939 - 13s - loss: 0.0200
Epoch 39/100
6939/6939 - 13s - loss: 0.0200
Epoch 40/100
6939/6939 - 13s - loss: 0.0200
Epoch 41/100
6939/6939 - 13s - loss: 0.0200
Epoch 42/100
6939/6939 - 13s - loss: 0.0200
Epoch 43/100
6939/6939 - 14s - loss: 0.0200
Epoch 44/100
6939/6939 - 13s - loss: 0.0200
Epoch 45/100
6939/6939 - 13s - loss: 0.0200
Epoch 46/100
6939/6939 - 13s - loss: 0.0200
Epoch 47/100
6939/6939 - 13s - loss: 0.0200
Epoch 48/100
6939/6939 - 13s - loss: 0.0200
Epoch 49/100
6939/6939 - 13s - loss: 0.0200
Epoch 50/100
6939/6939 - 13s - loss: 0.0200
Epoch 51/100
6939/6939 - 13s - loss: 0.0200
Epoch 52/100
6939/6939 - 13s - loss: 0.0200
Epoch 53/100
6939/6939 - 13s - loss: 0.0200
Epoch 54/100
6939/6939 - 14s - loss: 0.0200
Epoch 55/100
6939/6939 - 14s - loss: 0.0200
Epoch 56/100
6939/6939 - 13s - loss: 0.0200
Epoch 57/100
6939/6939 - 13s - loss: 0.0200
Epoch 58/100
6939/6939 - 13s - loss: 0.0200
Epoch 59/100
6939/6939 - 13s - loss: 0.0200
Epoch 60/100
6939/6939 - 13s - loss: 0.0200
Epoch 61/100
6939/6939 - 13s - loss: 0.0200
Epoch 62/100
6939/6939 - 13s - loss: 0.0200
Epoch 63/100
6939/6939 - 13s - loss: 0.0200
Epoch 64/100
6939/6939 - 13s - loss: 0.0200
Epoch 65/100
6939/6939 - 12s - loss: 0.0200
Epoch 66/100
6939/6939 - 13s - loss: 0.0200
Epoch 67/100
6939/6939 - 14s - loss: 0.0200
Epoch 68/100
6939/6939 - 13s - loss: 0.0200
Epoch 69/100
6939/6939 - 13s - loss: 0.0200
Epoch 70/100
6939/6939 - 13s - loss: 0.0200
Epoch 71/100
6939/6939 - 13s - loss: 0.0200
Epoch 72/100
6939/6939 - 13s - loss: 0.0200
Epoch 73/100
6939/6939 - 13s - loss: 0.0200
Epoch 74/100
6939/6939 - 13s - loss: 0.0200
Epoch 75/100
6939/6939 - 13s - loss: 0.0200
Epoch 76/100
6939/6939 - 13s - loss: 0.0200
Epoch 77/100
6939/6939 - 13s - loss: 0.0200
Epoch 78/100
6939/6939 - 13s - loss: 0.0200
Epoch 79/100
6939/6939 - 14s - loss: 0.0200
Epoch 80/100
6939/6939 - 13s - loss: 0.0200
Epoch 81/100
6939/6939 - 13s - loss: 0.0200
Epoch 82/100
6939/6939 - 13s - loss: 0.0200
Epoch 83/100
6939/6939 - 13s - loss: 0.0200
Epoch 84/100
6939/6939 - 12s - loss: 0.0200
Epoch 85/100
6939/6939 - 13s - loss: 0.0200
Epoch 86/100
6939/6939 - 13s - loss: 0.0200
Epoch 87/100
6939/6939 - 13s - loss: 0.0200
Epoch 88/100
6939/6939 - 13s - loss: 0.0200
Epoch 89/100
6939/6939 - 13s - loss: 0.0200
Epoch 90/100
6939/6939 - 13s - loss: 0.0200
Epoch 91/100
6939/6939 - 14s - loss: 0.0200
Epoch 92/100
6939/6939 - 13s - loss: 0.0200
Epoch 93/100
6939/6939 - 13s - loss: 0.0200
Epoch 94/100
6939/6939 - 13s - loss: 0.0200
Epoch 95/100
6939/6939 - 13s - loss: 0.0200
Epoch 96/100
6939/6939 - 13s - loss: 0.0200
Epoch 97/100
6939/6939 - 13s - loss: 0.0200
Epoch 98/100
6939/6939 - 13s - loss: 0.0200
Epoch 99/100
6939/6939 - 13s - loss: 0.0200
Epoch 100/100
6939/6939 - 13s - loss: 0.0200
Average `fraction_recovered` on Ty's grade A loans:
0.96166
Victory!
Phew, that was a close one! My win might be too small to be statistically significant, but hey, it's cool seeing that I can keep up with LendingClub's best and brightest.
What I'd really like to know now is what quantitative range of estimated risk each LendingClub grade and sub-grade corresponds to, but it looks like that's proprietary. Does anyone know if loans grades generally correspond to certain percentage ranges like letter grades in academic classes? If not, have any ideas for better benchmarks I could use to evaluate my model's performance? Go ahead and chime in in the discussion below.
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.