Lately Iβve been publishing screencasts demonstrating how to use the tidymodels framework, starting from just getting started. Todayβs screencast explores a more advanced topic in how to tune an XGBoost classification model using with this weekβs #TidyTuesday
dataset on beach volleyball. π
Here is the code I used in the video, for those who prefer reading instead of or in addition to video.
Explore the data
Our modeling goal is to predict whether a beach volleyball team of two won their match based on game play stats like errors, blocks, attacks, etc from this weekβs #TidyTuesday dataset . This dataset is quite extensive so itβs a great opportunity to try a more powerful machine learning algorithm like XGBoost. This model has lots of tuning parameters!
vb_matches <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-19/vb_matches.csv', guess_max = 76000)
vb_matches
## # A tibble: 76,756 x 65
## circuit tournament country year date gender match_num w_player1
## <chr> <chr> <chr> <dbl> <date> <chr> <dbl> <chr>
## 1 AVP Huntingtoβ¦ Unitedβ¦ 2002 2002-05-24 M 1 Kevin Woβ¦
## 2 AVP Huntingtoβ¦ Unitedβ¦ 2002 2002-05-24 M 2 Brad Torβ¦
## 3 AVP Huntingtoβ¦ Unitedβ¦ 2002 2002-05-24 M 3 Eduardo β¦
## 4 AVP Huntingtoβ¦ Unitedβ¦ 2002 2002-05-24 M 4 Brent Doβ¦
## 5 AVP Huntingtoβ¦ Unitedβ¦ 2002 2002-05-24 M 5 Albert Hβ¦
## 6 AVP Huntingtoβ¦ Unitedβ¦ 2002 2002-05-24 M 6 Jason Riβ¦
## 7 AVP Huntingtoβ¦ Unitedβ¦ 2002 2002-05-24 M 7 Aaron Boβ¦
## 8 AVP Huntingtoβ¦ Unitedβ¦ 2002 2002-05-24 M 8 Canyon Cβ¦
## 9 AVP Huntingtoβ¦ Unitedβ¦ 2002 2002-05-24 M 9 Dax Holdβ¦
## 10 AVP Huntingtoβ¦ Unitedβ¦ 2002 2002-05-24 M 10 Mark Wilβ¦
## # β¦ with 76,746 more rows, and 57 more variables: w_p1_birthdate <date>,
## # w_p1_age <dbl>, w_p1_hgt <dbl>, w_p1_country <chr>, w_player2 <chr>,
## # w_p2_birthdate <date>, w_p2_age <dbl>, w_p2_hgt <dbl>, w_p2_country <chr>,
## # w_rank <chr>, l_player1 <chr>, l_p1_birthdate <date>, l_p1_age <dbl>,
## # l_p1_hgt <dbl>, l_p1_country <chr>, l_player2 <chr>, l_p2_birthdate <date>,
## # l_p2_age <dbl>, l_p2_hgt <dbl>, l_p2_country <chr>, l_rank <chr>,
## # score <chr>, duration <time>, bracket <chr>, round <chr>,
## # w_p1_tot_attacks <dbl>, w_p1_tot_kills <dbl>, w_p1_tot_errors <dbl>,
## # w_p1_tot_hitpct <dbl>, w_p1_tot_aces <dbl>, w_p1_tot_serve_errors <dbl>,
## # w_p1_tot_blocks <dbl>, w_p1_tot_digs <dbl>, w_p2_tot_attacks <dbl>,
## # w_p2_tot_kills <dbl>, w_p2_tot_errors <dbl>, w_p2_tot_hitpct <dbl>,
## # w_p2_tot_aces <dbl>, w_p2_tot_serve_errors <dbl>, w_p2_tot_blocks <dbl>,
## # w_p2_tot_digs <dbl>, l_p1_tot_attacks <dbl>, l_p1_tot_kills <dbl>,
## # l_p1_tot_errors <dbl>, l_p1_tot_hitpct <dbl>, l_p1_tot_aces <dbl>,
## # l_p1_tot_serve_errors <dbl>, l_p1_tot_blocks <dbl>, l_p1_tot_digs <dbl>,
## # l_p2_tot_attacks <dbl>, l_p2_tot_kills <dbl>, l_p2_tot_errors <dbl>,
## # l_p2_tot_hitpct <dbl>, l_p2_tot_aces <dbl>, l_p2_tot_serve_errors <dbl>,
## # l_p2_tot_blocks <dbl>, l_p2_tot_digs <dbl>
This dataset has the match stats like serve errors, kills, and so forth divided out by the two players for each team, but we want those combined together because we are going to make a prediction per team (i.e. what makes a team more likely to win). Letβs include predictors like gender, circuit, and year in our model along with the per-match statistics. Letβs omit matches with NA
values because we donβt have all kinds of statistics measured for all matches.
vb_parsed <- vb_matches %>%
transmute(
circuit,
gender,
year,
w_attacks = w_p1_tot_attacks + w_p2_tot_attacks,
w_kills = w_p1_tot_kills + w_p2_tot_kills,
w_errors = w_p1_tot_errors + w_p2_tot_errors,
w_aces = w_p1_tot_aces + w_p2_tot_aces,
w_serve_errors = w_p1_tot_serve_errors + w_p2_tot_serve_errors,
w_blocks = w_p1_tot_blocks + w_p2_tot_blocks,
w_digs = w_p1_tot_digs + w_p2_tot_digs,
l_attacks = l_p1_tot_attacks + l_p2_tot_attacks,
l_kills = l_p1_tot_kills + l_p2_tot_kills,
l_errors = l_p1_tot_errors + l_p2_tot_errors,
l_aces = l_p1_tot_aces + l_p2_tot_aces,
l_serve_errors = l_p1_tot_serve_errors + l_p2_tot_serve_errors,
l_blocks = l_p1_tot_blocks + l_p2_tot_blocks,
l_digs = l_p1_tot_digs + l_p2_tot_digs
) %>%
na.omit()
Still plenty of data! Next, letβs create separate dataframes for the winners and losers of each match, and then bind them together. I am using functions like rename_with()
from the upcoming dplyr 1.0 release here.
winners <- vb_parsed %>%
select(circuit, gender, year,
w_attacks:w_digs) %>%
rename_with(~ str_remove_all(., "w_"), w_attacks:w_digs) %>%
mutate(win = "win")
losers <- vb_parsed %>%
select(circuit, gender, year,
l_attacks:l_digs) %>%
rename_with(~ str_remove_all(., "l_"), l_attacks:l_digs) %>%
mutate(win = "lose")
vb_df <- bind_rows(winners, losers) %>%
mutate_if(is.character, factor)
This is a similar data prep approach to Joshua Cook.
Exploratory data analysis is always important before modeling. Letβs make one plot to explore the relationships in this data.
vb_df %>%
pivot_longer(attacks:digs, names_to = "stat", values_to = "value") %>%
ggplot(aes(gender, value, fill = win, color = win)) +
geom_boxplot(alpha = 0.4) +
facet_wrap(~stat, scales = "free_y", nrow = 2) +
labs(y = NULL, color = NULL, fill = NULL)
We can see differences in errors and blocks especially. There are lots more great examples of #TidyTuesday EDA out there to explore on Twitter!
Build a model
We can start by loading the tidymodels metapackage, and splitting our data into training and testing sets.
library(tidymodels)
set.seed(123)
vb_split <- initial_split(vb_df, strata = win)
vb_train <- training(vb_split)
vb_test <- testing(vb_split)
An XGBoost model is based on trees, so we donβt need to do much preprocessing for our data; we donβt need to worry about the factors or centering or scaling our data. Letβs just go straight to setting up our model specification. Sounds great, right? On the other hand, we are going to tune a lot of model hyperparameters.
xgb_spec <- boost_tree(
trees = 1000,
tree_depth = tune(), min_n = tune(),
loss_reduction = tune(), ## first three: model complexity
sample_size = tune(), mtry = tune(), ## randomness
learn_rate = tune(), ## step size
) %>%
set_engine("xgboost") %>%
set_mode("classification")
xgb_spec
## Boosted Tree Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = 1000
## min_n = tune()
## tree_depth = tune()
## learn_rate = tune()
## loss_reduction = tune()
## sample_size = tune()
##
## Computational engine: xgboost
YIKES. π© Well, letβs set up possible values for these hyperparameters to try. Letβs use a space-filling design so we can cover the hyperparameter space as well as possible.
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), vb_train),
learn_rate(),
size = 30
)
xgb_grid
## # A tibble: 30 x 6
## tree_depth min_n loss_reduction sample_size mtry learn_rate
## <int> <int> <dbl> <dbl> <int> <dbl>
## 1 13 9 0.000000191 0.488 6 0.000147
## 2 4 17 0.0000121 0.661 10 0.00000000287
## 3 7 18 0.0000432 0.151 2 0.0713
## 4 12 22 0.00000259 0.298 8 0.0000759
## 5 10 35 16.1 0.676 6 0.00000000111
## 6 4 21 0.673 0.957 7 0.00000000786
## 7 7 25 0.244 0.384 9 0.0000000469
## 8 7 3 8.48 0.775 6 0.000000555
## 9 6 8 0.0000915 0.522 6 0.00000106
## 10 11 37 0.00000109 0.886 9 0.0000000136
## # β¦ with 20 more rows
Notice that we had to treat mtry()
differently because it depends on the actual number of predictors in the data.
Letβs put the model specification into a workflow for convenience. Since we donβt have any complicated data preprocessing, we can use add_formula()
as our data preprocessor.
xgb_wf <- workflow() %>%
add_formula(win ~ .) %>%
add_model(xgb_spec)
xgb_wf
## ββ Workflow βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## Preprocessor: Formula
## Model: boost_tree()
##
## ββ Preprocessor βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## win ~ .
##
## ββ Model ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## Boosted Tree Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = 1000
## min_n = tune()
## tree_depth = tune()
## learn_rate = tune()
## loss_reduction = tune()
## sample_size = tune()
##
## Computational engine: xgboost
Next, letβs create cross-validation resamples for tuning our model.
set.seed(123)
vb_folds <- vfold_cv(vb_train, strata = win)
vb_folds
## # 10-fold cross-validation using stratification
## # A tibble: 10 x 2
## splits id
## <named list> <chr>
## 1 <split [19.3K/2.1K]> Fold01
## 2 <split [19.3K/2.1K]> Fold02
## 3 <split [19.3K/2.1K]> Fold03
## 4 <split [19.3K/2.1K]> Fold04
## 5 <split [19.3K/2.1K]> Fold05
## 6 <split [19.3K/2.1K]> Fold06
## 7 <split [19.3K/2.1K]> Fold07
## 8 <split [19.3K/2.1K]> Fold08
## 9 <split [19.3K/2.1K]> Fold09
## 10 <split [19.4K/2.1K]> Fold10
ITβS TIME TO TUNE. We use tune_grid()
with our tuneable workflow, our resamples, and our grid of parameters to try. Letβs use control_grid(save_pred = TRUE)
so we can explore the predictions afterwards.
doParallel::registerDoParallel()
set.seed(234)
xgb_res <- tune_grid(
xgb_wf,
resamples = vb_folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
xgb_res
## # 10-fold cross-validation using stratification
## # A tibble: 10 x 5
## splits id .metrics .notes .predictions
## <named list> <chr> <list> <list> <list>
## 1 <split [19.3K/2.1β¦ Fold01 <tibble [60 Γ β¦ <tibble [0 Γ β¦ <tibble [64,500 Γ 1β¦
## 2 <split [19.3K/2.1β¦ Fold02 <tibble [60 Γ β¦ <tibble [0 Γ β¦ <tibble [64,500 Γ 1β¦
## 3 <split [19.3K/2.1β¦ Fold03 <tibble [60 Γ β¦ <tibble [0 Γ β¦ <tibble [64,500 Γ 1β¦
## 4 <split [19.3K/2.1β¦ Fold04 <tibble [60 Γ β¦ <tibble [0 Γ β¦ <tibble [64,500 Γ 1β¦
## 5 <split [19.3K/2.1β¦ Fold05 <tibble [60 Γ β¦ <tibble [0 Γ β¦ <tibble [64,500 Γ 1β¦
## 6 <split [19.3K/2.1β¦ Fold06 <tibble [60 Γ β¦ <tibble [0 Γ β¦ <tibble [64,500 Γ 1β¦
## 7 <split [19.3K/2.1β¦ Fold07 <tibble [60 Γ β¦ <tibble [0 Γ β¦ <tibble [64,500 Γ 1β¦
## 8 <split [19.3K/2.1β¦ Fold08 <tibble [60 Γ β¦ <tibble [0 Γ β¦ <tibble [64,500 Γ 1β¦
## 9 <split [19.3K/2.1β¦ Fold09 <tibble [60 Γ β¦ <tibble [0 Γ β¦ <tibble [64,500 Γ 1β¦
## 10 <split [19.4K/2.1β¦ Fold10 <tibble [60 Γ β¦ <tibble [0 Γ β¦ <tibble [64,440 Γ 1β¦
This takes a while to finish on my computer (and makes my fans run!) but we did it. πͺ
Explore results
We can explore the metrics for all these models.
collect_metrics(xgb_res)
## # A tibble: 60 x 11
## mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
## <int> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 1 23 9 1.64e-3 0.00000854 0.117 accuraβ¦
## 2 1 23 9 1.64e-3 0.00000854 0.117 roc_auc
## 3 2 18 7 7.13e-2 0.0000432 0.151 accuraβ¦
## 4 2 18 7 7.13e-2 0.0000432 0.151 roc_auc
## 5 2 32 3 1.30e-7 1.16 0.497 accuraβ¦
## 6 2 32 3 1.30e-7 1.16 0.497 roc_auc
## 7 2 40 10 3.31e-4 0.0000000486 0.429 accuraβ¦
## 8 2 40 10 3.31e-4 0.0000000486 0.429 roc_auc
## 9 3 5 14 3.56e-3 0.122 0.701 accuraβ¦
## 10 3 5 14 3.56e-3 0.122 0.701 roc_auc
## # β¦ with 50 more rows, and 4 more variables: .estimator <chr>, mean <dbl>,
## # n <int>, std_err <dbl>
We can also use visualization to understand our results.
xgb_res %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
select(mean, mtry:sample_size) %>%
pivot_longer(mtry:sample_size,
values_to = "value",
names_to = "parameter"
) %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "AUC")
Remember that we used a space-filling design for the parameters to try. It looks like higher values for tree depth were better, but other than that, the main thing I take away from this plot is that there are several combinations of parameters that perform well.
What are the best performing sets of parameters?
show_best(xgb_res, "roc_auc")
## # A tibble: 5 x 11
## mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
## <int> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 5 25 4 0.0195 0.00112 0.977 roc_auc
## 2 5 33 14 0.0332 0.0000000159 0.864 roc_auc
## 3 2 18 7 0.0713 0.0000432 0.151 roc_auc
## 4 6 9 13 0.000147 0.000000191 0.488 roc_auc
## 5 8 11 14 0.0000135 0.000570 0.453 roc_auc
## # β¦ with 4 more variables: .estimator <chr>, mean <dbl>, n <int>, std_err <dbl>
There may have been lots of parameters, but we were able to get good performance with several different combinations. Letβs choose the best one.
best_auc <- select_best(xgb_res, "roc_auc")
best_auc
## # A tibble: 1 x 6
## mtry min_n tree_depth learn_rate loss_reduction sample_size
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 5 25 4 0.0195 0.00112 0.977
Now letβs finalize our tuneable workflow with these parameter values.
final_xgb <- finalize_workflow(
xgb_wf,
best_auc
)
final_xgb
## ββ Workflow βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## Preprocessor: Formula
## Model: boost_tree()
##
## ββ Preprocessor βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## win ~ .
##
## ββ Model ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## Boosted Tree Model Specification (classification)
##
## Main Arguments:
## mtry = 5
## trees = 1000
## min_n = 25
## tree_depth = 4
## learn_rate = 0.019501844932014
## loss_reduction = 0.00112048286512169
## sample_size = 0.977300804650877
##
## Computational engine: xgboost
Instead of tune()
placeholders, we now have real values for all the model hyperparameters.
What are the most important parameters for variable importance?
library(vip)
final_xgb %>%
fit(data = vb_train) %>%
pull_workflow_fit() %>%
vip(geom = "point")
The predictors that are most important in a team winning vs. losing their match are the number of kills, errors, and attacks. There is almost no difference between the two circuits, and very little difference by gender.
Itβs time to go back to the testing set! Letβs use last_fit()
to fit our model one last time on the training data and evaluate our model one last time on the testing set. Notice that this is the first time we have used the testing data during this whole modeling analysis.
final_res <- last_fit(final_xgb, vb_split)
collect_metrics(final_res)
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.843
## 2 roc_auc binary 0.928
Our results here indicate that we did not overfit during the tuning process. We can also create a ROC curve for the testing set.
final_res %>%
collect_predictions() %>%
roc_curve(win, .pred_win) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_line(size = 1.5, color = "midnightblue") +
geom_abline(
lty = 2, alpha = 0.5,
color = "gray50",
size = 1.2
)
Top comments (0)