João Afonso Poester-Carvalho - Predicting number of points of NBA players in the regular season

Hello, this time, I will try to develop a regression model with Tidymodels, a framework I’ve been studying recently. We will analyse NBA data from Kaggle. I developed much of this code adapting blog posts from Julia Silge, which have been helping me to understand the initial steps to tidy modelling.

library(tidyverse)
library(tidymodels)

## NBA ----

## Predict the number of points of a player in the regular season 


nba <- read_csv("nba.csv")

nba_f <- nba %>%
  filter(Season_type == "Regular%20Season") %>%
  select(
    PTS, year, PLAYER_ID, TEAM_ID, GP, MIN, FG_PCT, FG3_PCT, FT_PCT, OREB, DREB, AST, STL, BLK, TOV, PF
  ) %>%
  mutate(
    PLAYER_ID = as.character(PLAYER_ID),
    TEAM_ID = as.character(TEAM_ID),
    year = as.numeric(substr(year, 1, 4))
  ) 

glimpse(nba_f)

Rows: 6,259
Columns: 16
$ PTS       <dbl> 2280, 2133, 2036, 2023, 1920, 1903, 1786, 1577, 1562, 1560, …
$ year      <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, …
$ PLAYER_ID <chr> "201142", "977", "2544", "201935", "2546", "201566", "201939…
$ TEAM_ID   <chr> "1610612760", "1610612747", "1610612748", "1610612745", "161…
$ GP        <dbl> 81, 78, 76, 78, 67, 82, 78, 82, 82, 74, 82, 78, 69, 79, 82, …
$ MIN       <dbl> 3119, 3013, 2877, 2985, 2482, 2861, 2983, 3076, 3167, 2790, …
$ FG_PCT    <dbl> 0.510, 0.463, 0.565, 0.438, 0.449, 0.438, 0.451, 0.416, 0.42…
$ FG3_PCT   <dbl> 0.416, 0.324, 0.406, 0.368, 0.379, 0.323, 0.453, 0.287, 0.36…
$ FT_PCT    <dbl> 0.905, 0.839, 0.753, 0.851, 0.830, 0.800, 0.900, 0.773, 0.84…
$ OREB      <dbl> 46, 66, 97, 62, 134, 111, 59, 45, 42, 175, 48, 29, 86, 218, …
$ DREB      <dbl> 594, 367, 513, 317, 326, 317, 255, 271, 215, 495, 272, 203, …
$ AST       <dbl> 374, 469, 551, 455, 171, 607, 539, 496, 531, 192, 204, 604, …
$ STL       <dbl> 116, 106, 129, 142, 52, 145, 126, 169, 74, 62, 76, 75, 128, …
$ BLK       <dbl> 105, 25, 67, 38, 32, 24, 12, 36, 19, 91, 24, 30, 56, 22, 31,…
$ TOV       <dbl> 280, 287, 226, 295, 175, 273, 240, 254, 243, 143, 151, 218, …
$ PF        <dbl> 143, 173, 110, 178, 205, 189, 198, 164, 172, 187, 173, 194, …

First step: use tidymodels functions to separate the train and test datasets and create a vfold object.

## Data Split ----

set.seed(502)
nba_split <- initial_split(nba_f, prop = 0.80, strata = PTS)
nba_train <- training(nba_split)
nba_test  <-  testing(nba_split)

nba_folds <- nba_train %>%
  vfold_cv(v = 5, repeats = 1, strata = PTS)

Next, we begin to build a recipe, begining with the formula. The variable PLAYER_ID was set as an ID. Finally, we normalize numeric variables and encode categorical variables.

## Model formula ----
form <- as.formula(paste("PTS"," ~ ", "."))

## Model recipe ----

mod_recipe <- recipe(formula = form, data = nba_train) %>%
  update_role(PLAYER_ID, new_role = "id") %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_predictors(), -all_numeric(), one_hot = F)

mod_recipe_prep <- prep(mod_recipe, retain = T)
mod_recipe_prep

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 14
id:         1

── Training information

Training data contained 5006 data points and no incomplete rows.

── Operations

• Centering and scaling for: year, GP, MIN, FG_PCT, FG3_PCT, ... | Trained

• Dummy variables from: TEAM_ID | Trained

Let’s look how the data is transformed when the recipe is applied:

mod_recipe_prep %>% bake(new_data = NULL) %>% glimpse()

Rows: 5,006
Columns: 44
$ year                <dbl> -1.671321, -1.671321, -1.671321, -1.671321, -1.671…
$ PLAYER_ID           <fct> 101179, 203093, 2562, 2554, 203104, 2248, 1894, 20…
$ GP                  <dbl> -0.39049839, -1.18655440, -0.39049839, -1.18655440…
$ MIN                 <dbl> -0.7267063, -1.0084401, -0.7050344, -0.8760011, -0…
$ FG_PCT              <dbl> -1.0738170, 0.1319478, -0.8180487, -0.8728562, -0.…
$ FG3_PCT             <dbl> -0.15512765, -0.11022703, 0.12710481, 0.21049167, …
$ FT_PCT              <dbl> 0.02876769, 0.29362197, -2.48502473, -0.16174154, …
$ OREB                <dbl> -0.632882005, -0.362763041, -0.740929591, -0.77694…
$ DREB                <dbl> -0.8697689, -0.8271258, -0.7418397, -0.8697689, -0…
$ AST                 <dbl> -0.2847410, -0.7418568, -0.6078746, -0.7891447, -0…
$ STL                 <dbl> -0.2947645, -0.7415845, -0.3585959, -0.7415845, -0…
$ BLK                 <dbl> -0.6878534, -0.1227314, -0.6172132, -0.5818930, -0…
$ TOV                 <dbl> -0.4810592, -0.8367129, -0.8197770, -0.7859052, -0…
$ PF                  <dbl> -0.47434503, -1.07983655, -0.50461961, -0.89818910…
$ PTS                 <dbl> 105, 104, 103, 100, 99, 96, 95, 95, 93, 91, 88, 87…
$ TEAM_ID_X1610612738 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612739 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612740 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,…
$ TEAM_ID_X1610612741 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612742 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612743 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612744 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612745 <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612746 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612747 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612748 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612749 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612750 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612751 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
$ TEAM_ID_X1610612752 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612753 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612754 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612755 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612756 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
$ TEAM_ID_X1610612757 <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612758 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612759 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612760 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612761 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612762 <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612763 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612764 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_ID_X1610612765 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ TEAM_ID_X1610612766 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Then, we define the LASSO regression model with the “glmnet” package and add it to a workflow object, along with the recipe. We also create a tunning grid, which will have the parameter “penalty” as the parameter we want to tune when we fit the first model.

## Define model ----

reg_model <- linear_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

## Grid ----
lambda_grid <- grid_regular(penalty(), levels = 50)

## Start workflow ----

reg_wf <- 
  workflow() %>%
  add_model(reg_model)  %>%
  add_recipe(mod_recipe)
reg_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = 1

Computational engine: glmnet

Now, we fit the model with the k fold cross validation scheme defined earlier. The tune_grid() function receives the workflow, the ressampling scheme and the grid for the lambda parameter.

## Fit with Tune Grid ----
lasso_grid <- tune_grid(
  reg_wf, 
  resamples = nba_folds,
  grid = lambda_grid,  
  metrics = metric_set(rmse, mae, rsq)
)
lasso_grid

# Tuning results
# 5-fold cross-validation using stratification 
# A tibble: 5 × 4
  splits              id    .metrics           .notes          
  <list>              <chr> <list>             <list>          
1 <split [4003/1003]> Fold1 <tibble [150 × 5]> <tibble [0 × 3]>
2 <split [4004/1002]> Fold2 <tibble [150 × 5]> <tibble [0 × 3]>
3 <split [4005/1001]> Fold3 <tibble [150 × 5]> <tibble [0 × 3]>
4 <split [4006/1000]> Fold4 <tibble [150 × 5]> <tibble [0 × 3]>
5 <split [4006/1000]> Fold5 <tibble [150 × 5]> <tibble [0 × 3]>

For each fold, the tune grid object holds the metrics obtained and we can plot the metrics considering the varible penalty.

## Metrics ----

lasso_grid %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err), alpha = 0.5) +
  geom_line(size = 1.5) +
  facet_wrap(~.metric, scales = "free") +
  scale_x_log10() +
  theme(legend.position = "none") +
  theme_bw()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The performance of the model seems to get very slightly better with the LASSO penalty. So, we pull the best model from the tune grid object, based on the RMSE. The finalize_workflow() function unites the original workflow with the best model.

lowest_rmse <- lasso_grid %>%
  select_best(metric = "rmse")

final_lasso <- finalize_workflow(
  reg_wf,
  lowest_rmse
)

final_lasso

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 1e-10
  mixture = 1

Computational engine: glmnet

We then apply the model one last time to the training and testing data, using the last_fit() function.

last_fit_lasso <- last_fit(
  final_lasso,
  nba_split
) %>%
  collect_metrics()

last_fit_lasso

# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard     126.    Preprocessor1_Model1
2 rsq     standard       0.925 Preprocessor1_Model1