How to run decision tree analysis with xgboost in R (Tidymodels Series PART 3)

Packages we will need:

library(tidymodels)
library(tidyverse)

In this blog post, we are going to run boosted decision trees with xgboost in tidymodels.

Boosted decision trees are a type of ensemble learning technique.

Ensemble learning methods combine the predictions from multiple models to create a final prediction that is often more accurate than any single model’s prediction.

The ensemble consists of a series of decision trees added sequentially, where each tree attempts to correct the errors of the preceding ones.

Source: https://towardsdatascience.com/10-decision-trees-are-better-than-1-719406680564

Similar to the previous blog, we will use Varieties of Democracy data and we will examine the relationship between judicial corruption and public sector theft.

Click here to read Part 1 of the tidymodels series

How to run linear regression analysis with tidymodels in R for temporal prediction (Tidymodels Series PART 2)

vdem <- read.csv(file.choose))

vdem %>% 
  select(judic_corruption = v2jucorrdc,
         ps_theft = v2exthftps, 
         country_name,
         year) -> vdem_vars

vdem_vars <- na.omit(vdem_vars)

We will divide the dataset into one from years 1990 to 2019 and a separate 2020 dataset.

vdem_vars %>% 
  filter(year > 1989 & year < 2020)  -> vdem_1990_2019

vdem_vars %>% 
  filter(year == 2020)  -> vdem_2020

Now we will create our recipe with the model formula and any steps to mutate the variables

recipe_spec <- recipe(judic_corruption ~ ps_theft, 
data = vdem_1990_2019) %>%
  step_normalize(all_predictors(), -all_outcomes())

Next we will set our decision tree.

boost_tree_spec <- boost_tree(
  mode = "regression", 
  trees = 1000, 
tree_depth = 3,
  min_n = 10,
  loss_reduction = 0.01, 
  sample_size = 0.5, 
  mtry = 2, 
  learn_rate = 0.01, 
  engine = "xgboost"
)

Let’s take a look at each part of this big step.

mode specifies the type of predictive modeling that we are running.

Common modes are:

"regression" for predicting numeric outcomes,
"classification" for predicting categorical outcomes,
"censored" for time-to-event (survival) models.

The mode we choose is regression.

Next we add the number of trees to include in the ensemble.

More trees can improve model accuracy but also increase computational cost and risk of overfitting. The choice of how many trees to use depends on the complexity of the dataset and the diminishing returns of adding more trees.

We will choose 1000 trees.

tree_depth indicates the maximum depth of each tree. The depth of a tree is the length of the longest path from a root to a leaf, and it controls the complexity of the model.

Deeper trees can model more complex relationships but also increase the risk of overfitting. A smaller tree depth helps keep the model simpler and more generalizable.

Our model is quite simple, so we can choose 3.

When your model is very simple, for instance, having only one independent variable, the need for deep trees diminishes. This is because there are fewer interactions between variables to consider (in fact, no interactions in the case of a single variable), and the complexity that a model can or should capture is naturally limited.

For a model with a single predictor, starting with a lower max_depth value (e.g., 3 to 5) is sensible. This setting can provide a balance between model simplicity and the ability to capture non-linear relationships in the data.

The best way to determine the optimal max_depth is with cross-validation. This involves training models with different values of max_depth and evaluating their performance on a validation set. The value that results in the best cross-validated metric (e.g., RMSE for regression, accuracy for classification) is the best choice.

We will look at RMSE at the end of the blog.

Next we look at the min_n, which is the minimum number of data points allowed in a node to attempt a new split. This parameter controls overfitting by preventing the model from learning too much from the noise in the training data. Higher values result in simpler models.

We choose min_n of 10.

loss_reduction is the minimum loss reduction required to make a further partition on a leaf node of the tree. It’s a way to control the complexity of the model; larger values result in simpler models by making the algorithm more conservative about making additional splits.

We input a loss_reduction of 0.01.

A low value (like 0.01) means that the model will be more inclined to make splits as long as they provide even a slight improvement in loss reduction.

This can be advantageous in capturing subtle nuances in the data but might not be as critical in a simple model where the potential for overfitting is already lower due to the limited number of predictors.

sample_size determines the fraction of data to sample for each tree. This parameter is used for stochastic boosting, where each tree is trained on a random subset of the full dataset. It introduces more variation among the trees, can reduce overfitting, and can improve model robustness.

Our sample_size is 0.5.

While setting sample_size to 0.5 is a common practice in boosting to help with overfitting and improve model generalization, its optimality for a model with a single independent variable may not be suitable.

We can test different values through cross-validation and monitoring the impact on both training and validation metrics

mtry indicates the number of variables randomly sampled as candidates at each split. For regression problems, the default is to use all variables, while for classification, a commonly used default is the square root of the number of variables. Adjusting this parameter can help in controlling model complexity and overfitting.

For this regression, our mtry will equal 2 variables at each split

learn_rate is also known as the learning rate or shrinkage. This parameter scales the contribution of each tree.

We will use a learn_rate = 0.01.

A smaller learning rate requires more trees to model all the relationships but can lead to a more robust model by reducing overfitting.

Finally, engine specifies the computational engine to use for training the model. In this case, "xgboost" package is used, which stands for eXtreme Gradient Boosting.

Season 6 Nbc GIF - Find & Share on GIPHY

When to use each argument:

Mode: Always specify this based on the type of prediction task at hand (e.g., regression, classification).

Trees, tree_depth, min_n, and loss_reduction: Adjust these to manage model complexity and prevent overfitting. Start with default or moderate values and use cross-validation to find the best settings.

Sample_size and mtry: Use these to introduce randomness into the model training process, which can help improve model robustness and prevent overfitting. They are especially useful in datasets with a large number of observations or features.

Learn_rate: Start with a low to moderate value (e.g., 0.01 to 0.1) and adjust based on model performance. Smaller values generally require more trees but can lead to more accurate models if tuned properly.

Engine: Choose based on the specific requirements of the dataset, computational efficiency, and available features of the engine.

NOW, we add the two steps together in the oven.

Season 5 Cooking GIF by Living Single - Find & Share on GIPHY

workflow_spec <- workflow() %>%
  add_recipe(recipe_spec) %>%
  add_model(boost_tree_spec)

And we fit the model with the data

fit <- workflow_spec %>%
  fit(data = vdem_1990_2019)

══ Workflow [trained] 
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor 
1 Recipe Step

• step_normalize()

── Model 
##### xgb.Booster
raw: 2.2 Mb 
call:
  xgboost::xgb.train(params = list(eta = 0.01, 
                                                             max_depth = 6, 
                                                             gamma = 0.01, 
colsample_bytree = 1, 
colsample_bynode = 1, 
min_child_weight = 10, 
subsample = 0.5), 
data = x$data, 
nrounds = 1000,
watchlist = x$watchlist, 
verbose = 0, 
nthread = 1, 
objective = "reg:squarederror")

params (as set within xgb.train):
eta = "0.01", 
max_depth = "6", 
gamma = "0.01", 
colsample_bytree = "1", 
colsample_bynode = "1", 
min_child_weight = "10", 
subsample = "0.5", 
nthread = "1", 
objective = "reg:squarederror", 
validate_parameters = "TRUE"
xgb.attributes: niter
callbacks: cb.evaluation.log()
# of features: 1 
niter: 1000
nfeatures : 1 
evaluation_log:
     iter training_rmse
    <num>         <num>
        1     1.5329271
        2     1.5206092
---                    
      999     0.4724800
     1000     0.4724075

We can now see how well our model can predict year 2020 data.

predictions <- predict(fit, new_data = vdem_2020)
predicted_values <- predictions$.pred

predicted_probabilities <- predict(fit, new_data = new_data, type = "prob") 
# For probabilities

Evaluating regression model performance using true values of judicial corruption

  actual_vs_predicted <- vdem_2020 %>%
    select(judic_corruption) %>%
    bind_cols(predictions)

Finally, we calculate metrics like RMSE, MAE, etc., using `yardstick`

  metrics <- actual_vs_predicted %>%
    metrics(truth = judic_corruption, estimate = .pred)
  
  rmse_val <- actual_vs_predicted %>%
    rmse(truth = judic_corruption, estimate = .pred)
  
  mae_val <- actual_vs_predicted %>%
    mae(truth = judic_corruption, estimate = .pred)
  
 print(metrics)
  print(rmse_val)
  print(mae_val)

.metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.661
2 rsq     standard       0.796
3 mae     standard       0.525