How to run multiple t-tests in a function with the broom package in R

Packages we will need:

library(tidyverse)
library(broom)

We will use the Varieties of Democracy dataset again.

Excited Will Ferrell GIF - Find & Share on GIPHY

We will use a t-test comparing democracies (boix == 1) and non-democracies (boix == 0) in the years 2000 to 2020.

We need to remove the instances where boix is NA.

I choose three t-tests to run simultaneously. Comparing democracies and non-democracies on:

the extent to which the country consults religious groups (relig_consult)
the extent to which the country consults Civil Society Organization (CSO) groups (cso_consult)
level of freedom from judicial corruption (judic_corruption) – higher scores mean that there are FEWER instances of judicial corruption

vdem %>% 
  filter(year %in% c(2000:2020)) %>%
  filter(!is.na(e_boix_regime)) %>%
  group_by(e_boix_regime) %>% 
  select(relig_consult = v2csrlgcon,
         cso_consult = v2cscnsult,
         judic_corruption = v2jucorrdc) -> vdem

Next we need to pivot the data from wide to long

  vdem %<>% pivot_longer(!e_boix_regime, 
                names_to = "variable",
               values_to = "value")

Now we get to iterating over the three variables and conducting a t-test on each variable across democracies versus non-democracies

vdem %>%
  group_by(variable) %>%
  nest() %>%
  mutate(t_test = map(data, ~t.test(value ~ e_boix_regime, data = .x)),
                tidy = map(t_test, broom::tidy)) %>%
  select(variable, tidy) %>%
  unnest(tidy) -> ttest_results

If we look closely at the line

mutate(t_test = map(data, ~t.test(value ~ e_boix_regime, data = .x))):

Here, mutate() adds a new column named t_test to the grouped and nested data.

map() is used to apply a function to each element of the list-column (here, each nested data frame).

The function applied is t.test(value ~ e_boix_regime, data = .x).

This function performs a t-test comparing the means of value across two groups defined by e_boix_regime within each nested data frame.

.x represents each nested data frame in turn.

And here are the tidy results:

If we save the above results in a data.frame, we can graph the following:

ttest_results %>%
ggplot(aes(x = variable, y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_point() + 
  geom_errorbar(width = 0.2) + 
  coord_flip() + 
  labs(title = "T-Test Estimates with Confidence Intervals",
       x = "Variable",
       y = "Estimate Difference") +
  theme_minimal()

Add some colors to highlight magnitude of difference

  ggplot(aes(x = variable, y = estimate, ymin = conf.low, ymax = conf.high, color = color_value)) +
  geom_point() + 
  geom_errorbar(width = 0.7, size = 3) +  
  scale_color_gradientn(colors = c("#0571b0", "#92c5de", "#f7f7f7", "#f4a582", "#ca0020")) +
  coord_flip() +
  labs(title = "T-Test Estimates with Confidence Intervals",
       x = "Variable",
       y = "Estimate Difference") +
  theme_minimal() +
  guides(color = guide_colorbar(title = ""))

Sometimes vdem variables are reverse scored or on different scales

  # mutate(across(judic_corruption, ~ 1- .x)) %>% 
  # mutate(across(everything(), ~(.x - mean(.x, na.rm = TRUE)) / sd(.x, na.rm = TRUE), .names = "z_{.col}")) %>% 
  # select(contains("z_"))

How to run cross-validation of decision-tree models with xgboost in R (PART 4 Tidymodels series)

In this blog post, we will cross-validate different boosted tree models and find the one with best root mean square error (RMSE).

Specifically, part 2 goes into more detail about RMSE as a way to choose the best model

How to run linear regression analysis with tidymodels in R for temporal prediction (Tidymodels Series PART 2)

Click here to read part 1, part 2 or part 3 of this series on tidymodel package stuff.

Packages we will need:

library(tidymodels)
library(tidyverse)

Our resampling method will be 10-fold cross-validation.

Click here to watch a Youtube explainer by StatQuest on the fundamentals of cross validation. StatQuest is the cat’s pyjamas.

We can use the vfold_cv() function to create a set of “V-fold” cross-validation with 11 splits.

My favorite number is 11 so I’ll set that as the seed too.

set.seed(11) 
cross_folds <- vfold_cv(vdem_1990_2019, v = 11)

We put our formula into the recipe function and run the pre-processing steps: in this instance, we are just normalizing the variables.

my_recipe <- recipe(judic_corruption ~ freedom_religion + polarization, 
data = vdem_1990_2019) %>%
  step_normalize(all_predictors(), -all_outcomes())

Here, we will initially define a boost_tree() model without setting any hyperparameters.

This is our baseline model with defaults.

With the vfolds, we be tuning them and choosing the best parameters.

For us, our mode is “regression” (not categorical “classification”)

my_tree <- boost_tree(
  mode = "regression",
  engine = "xgboost") %>%
  set_engine("xgboost") %>%
  set_mode("regression")

Next we will set up a grid to explore a range of hyperparameters.

my_grid <- grid_latin_hypercube(
  trees(range = c(500, 1500)),
  tree_depth(range = c(3, 10)),
  learn_rate(range = c(0.01, 0.1)),
  size = 20)

We use the grid_latin_hypercube() function from the dials package in R is used to generate a sampling grid for tuning hyperparameters using a Latin hypercube sampling method.

Latin hypercube sampling (LHS) is a way to generate a sample of plausible, semi-random collections of parameter values from a distribution.

This method is used to ensure that each parameter is uniformly sampled across its range of values. LHS is systematic and stratified, but within each stratum, it employs randomness.

Source: https://www.youtube.com/watch?app=desktop&v=Evua529dAgc

Inside the grid_latin_hypercube() function,we can set the ranges for the model parameters,

trees(range = c(500, 1500))

This parameter specifies the number of trees in the model

We can set a sampling range from 500 to 1500 trees.

tree_depth(range = c(3, 10))

This defines the maximum depth of each tree

We set values ranging from 3 to 10.

learn_rate(range = c(0.01, 0.1))

This parameter controls the learning rate, or the step size at each iteration while moving toward a minimum of a loss function.

It’s specified to vary between 0.01 and 0.1.

size = 20

We want the Latin Hypercube Sampling to generate 20 unique combinations of the specified parameters. Each of these combinations will be used to train a model, allowing for a systematic exploration of how different parameter settings impact model performance.

> my_grid

# A tibble: 20 × 3
   trees tree_depth learn_rate
   <int>      <int>      <dbl>
 1   803          7       1.03
 2   981          7       1.18
 3   862          6       1.09
 4  1185          9       1.06
 5   763          8       1.13
 6   593          4       1.11
 7   524          7       1.22
 8   743          3       1.17
 9  1347          5       1.07
10  1010          5       1.15
11   677          3       1.25
12  1482          5       1.05
13  1446          8       1.12
14   917          4       1.23
15  1296          6       1.04
16  1391          8       1.23
17  1106          9       1.18
18  1203          5       1.14
19   606         10       1.20
20  1088          9       1.10

So next, we will combine our recipe, model specification, and resampling method in a workflow, and use tune_grid() to find the best hyperparameters based on RMSE.

my_workflow <- workflow() %>%
  add_recipe(my_recipe) %>%
  add_model(my_tree)

The tune_grid() function does the hyperparameter tuning. We will make different combinations of hyperparameters specified in grid using cross-validation.

tuning_results <- my_workflow %>%
  tune_grid(
    resamples = cv_folds,
    grid = my_grid,
    metrics = metric_set(rmse))

# Tuning results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits             id     .metrics         .notes          
   <list>             <chr>  <list>           <list>          
 1 <split [3200/356]> Fold01 <tibble [1 × 4]> <tibble [0 × 3]>
 2 <split [3200/356]> Fold02 <tibble [1 × 4]> <tibble [0 × 3]>
 3 <split [3200/356]> Fold03 <tibble [1 × 4]> <tibble [0 × 3]>
 4 <split [3200/356]> Fold04 <tibble [1 × 4]> <tibble [0 × 3]>
 5 <split [3200/356]> Fold05 <tibble [1 × 4]> <tibble [0 × 3]>
 6 <split [3200/356]> Fold06 <tibble [1 × 4]> <tibble [0 × 3]>
 7 <split [3201/355]> Fold07 <tibble [1 × 4]> <tibble [0 × 3]>
 8 <split [3201/355]> Fold08 <tibble [1 × 4]> <tibble [0 × 3]>
 9 <split [3201/355]> Fold09 <tibble [1 × 4]> <tibble [0 × 3]>
10 <split [3201/355]> Fold10 <tibble [1 × 4]> <tibble [0 × 3]>

After tuning, we can extract and examine the best models.

show_best(tuning_results, metric = "rmse")

  trees tree_depth learn_rate .metric .estimator  mean     n std_err .config              
  <int>      <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1   593          4       1.11 rmse    standard   0.496    10  0.0189 Preprocessor1_Model03
2   677          3       1.25 rmse    standard   0.500    10  0.0216 Preprocessor1_Model02
3  1296          6       1.04 rmse    standard   0.501    10  0.0238 Preprocessor1_Model09
4  1010          5       1.15 rmse    standard   0.501    10  0.0282 Preprocessor1_Model08
5  1482          5       1.05 rmse    standard   0.502    10  0.0210 Preprocessor1_Model05

The best model is model number 3!

The Fresh Prince Of Bel Air Reaction GIF - Find & Share on GIPHY

Finally, we can plot it out.

We use collect_metrics() to pull out the RMSE and other metrics from our samples.

It automatically aggregates the results across all resampling iterations for each unique combination of model hyperparameters, providing mean performance metrics (e.g., mean accuracy, mean RMSE) and their standard errors.

rmse_results <- tuning_results %>%
  collect_metrics() %>%
  filter(.metric == "rmse")

rmse_results %>%
  mutate(.config = str_replace(.config, "^Preprocessor1_", "")) %>% 
  ggplot(aes(x = .config, y = mean)) + 
  geom_line(aes(group = 1), color = "#023047", size = 2, alpha = 0.6) +
  geom_point(size = 3) +
  bbplot::bbc_style() +
  labs(title = "Mean RMSE for Different Models") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

How to run decision tree analysis with xgboost in R (Tidymodels Series PART 3)

Packages we will need:

library(tidymodels)
library(tidyverse)

In this blog post, we are going to run boosted decision trees with xgboost in tidymodels.

Boosted decision trees are a type of ensemble learning technique.

Ensemble learning methods combine the predictions from multiple models to create a final prediction that is often more accurate than any single model’s prediction.

The ensemble consists of a series of decision trees added sequentially, where each tree attempts to correct the errors of the preceding ones.

Source: https://towardsdatascience.com/10-decision-trees-are-better-than-1-719406680564

Similar to the previous blog, we will use Varieties of Democracy data and we will examine the relationship between judicial corruption and public sector theft.

Click here to read Part 1 of the tidymodels series

How to run linear regression analysis with tidymodels in R for temporal prediction (Tidymodels Series PART 2)

vdem <- read.csv(file.choose))

vdem %>% 
  select(judic_corruption = v2jucorrdc,
         ps_theft = v2exthftps, 
         country_name,
         year) -> vdem_vars

vdem_vars <- na.omit(vdem_vars)

We will divide the dataset into one from years 1990 to 2019 and a separate 2020 dataset.

vdem_vars %>% 
  filter(year > 1989 & year < 2020)  -> vdem_1990_2019

vdem_vars %>% 
  filter(year == 2020)  -> vdem_2020

Now we will create our recipe with the model formula and any steps to mutate the variables

recipe_spec <- recipe(judic_corruption ~ ps_theft, 
data = vdem_1990_2019) %>%
  step_normalize(all_predictors(), -all_outcomes())

Next we will set our decision tree.

boost_tree_spec <- boost_tree(
  mode = "regression", 
  trees = 1000, 
tree_depth = 3,
  min_n = 10,
  loss_reduction = 0.01, 
  sample_size = 0.5, 
  mtry = 2, 
  learn_rate = 0.01, 
  engine = "xgboost"
)

Let’s take a look at each part of this big step.

mode specifies the type of predictive modeling that we are running.

Common modes are:

"regression" for predicting numeric outcomes,
"classification" for predicting categorical outcomes,
"censored" for time-to-event (survival) models.

The mode we choose is regression.

Next we add the number of trees to include in the ensemble.

More trees can improve model accuracy but also increase computational cost and risk of overfitting. The choice of how many trees to use depends on the complexity of the dataset and the diminishing returns of adding more trees.

We will choose 1000 trees.

tree_depth indicates the maximum depth of each tree. The depth of a tree is the length of the longest path from a root to a leaf, and it controls the complexity of the model.

Deeper trees can model more complex relationships but also increase the risk of overfitting. A smaller tree depth helps keep the model simpler and more generalizable.

Our model is quite simple, so we can choose 3.

When your model is very simple, for instance, having only one independent variable, the need for deep trees diminishes. This is because there are fewer interactions between variables to consider (in fact, no interactions in the case of a single variable), and the complexity that a model can or should capture is naturally limited.

For a model with a single predictor, starting with a lower max_depth value (e.g., 3 to 5) is sensible. This setting can provide a balance between model simplicity and the ability to capture non-linear relationships in the data.

The best way to determine the optimal max_depth is with cross-validation. This involves training models with different values of max_depth and evaluating their performance on a validation set. The value that results in the best cross-validated metric (e.g., RMSE for regression, accuracy for classification) is the best choice.

We will look at RMSE at the end of the blog.

Next we look at the min_n, which is the minimum number of data points allowed in a node to attempt a new split. This parameter controls overfitting by preventing the model from learning too much from the noise in the training data. Higher values result in simpler models.

We choose min_n of 10.

loss_reduction is the minimum loss reduction required to make a further partition on a leaf node of the tree. It’s a way to control the complexity of the model; larger values result in simpler models by making the algorithm more conservative about making additional splits.

We input a loss_reduction of 0.01.

A low value (like 0.01) means that the model will be more inclined to make splits as long as they provide even a slight improvement in loss reduction.

This can be advantageous in capturing subtle nuances in the data but might not be as critical in a simple model where the potential for overfitting is already lower due to the limited number of predictors.

sample_size determines the fraction of data to sample for each tree. This parameter is used for stochastic boosting, where each tree is trained on a random subset of the full dataset. It introduces more variation among the trees, can reduce overfitting, and can improve model robustness.

Our sample_size is 0.5.

While setting sample_size to 0.5 is a common practice in boosting to help with overfitting and improve model generalization, its optimality for a model with a single independent variable may not be suitable.

We can test different values through cross-validation and monitoring the impact on both training and validation metrics

mtry indicates the number of variables randomly sampled as candidates at each split. For regression problems, the default is to use all variables, while for classification, a commonly used default is the square root of the number of variables. Adjusting this parameter can help in controlling model complexity and overfitting.

For this regression, our mtry will equal 2 variables at each split

learn_rate is also known as the learning rate or shrinkage. This parameter scales the contribution of each tree.

We will use a learn_rate = 0.01.

A smaller learning rate requires more trees to model all the relationships but can lead to a more robust model by reducing overfitting.

Finally, engine specifies the computational engine to use for training the model. In this case, "xgboost" package is used, which stands for eXtreme Gradient Boosting.

Season 6 Nbc GIF - Find & Share on GIPHY

When to use each argument:

Mode: Always specify this based on the type of prediction task at hand (e.g., regression, classification).

Trees, tree_depth, min_n, and loss_reduction: Adjust these to manage model complexity and prevent overfitting. Start with default or moderate values and use cross-validation to find the best settings.

Sample_size and mtry: Use these to introduce randomness into the model training process, which can help improve model robustness and prevent overfitting. They are especially useful in datasets with a large number of observations or features.

Learn_rate: Start with a low to moderate value (e.g., 0.01 to 0.1) and adjust based on model performance. Smaller values generally require more trees but can lead to more accurate models if tuned properly.

Engine: Choose based on the specific requirements of the dataset, computational efficiency, and available features of the engine.

NOW, we add the two steps together in the oven.

Season 5 Cooking GIF by Living Single - Find & Share on GIPHY

workflow_spec <- workflow() %>%
  add_recipe(recipe_spec) %>%
  add_model(boost_tree_spec)

And we fit the model with the data

fit <- workflow_spec %>%
  fit(data = vdem_1990_2019)

══ Workflow [trained] 
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor 
1 Recipe Step

• step_normalize()

── Model 
##### xgb.Booster
raw: 2.2 Mb 
call:
  xgboost::xgb.train(params = list(eta = 0.01, 
                                                             max_depth = 6, 
                                                             gamma = 0.01, 
colsample_bytree = 1, 
colsample_bynode = 1, 
min_child_weight = 10, 
subsample = 0.5), 
data = x$data, 
nrounds = 1000,
watchlist = x$watchlist, 
verbose = 0, 
nthread = 1, 
objective = "reg:squarederror")

params (as set within xgb.train):
eta = "0.01", 
max_depth = "6", 
gamma = "0.01", 
colsample_bytree = "1", 
colsample_bynode = "1", 
min_child_weight = "10", 
subsample = "0.5", 
nthread = "1", 
objective = "reg:squarederror", 
validate_parameters = "TRUE"
xgb.attributes: niter
callbacks: cb.evaluation.log()
# of features: 1 
niter: 1000
nfeatures : 1 
evaluation_log:
     iter training_rmse
    <num>         <num>
        1     1.5329271
        2     1.5206092
---                    
      999     0.4724800
     1000     0.4724075

We can now see how well our model can predict year 2020 data.

predictions <- predict(fit, new_data = vdem_2020)
predicted_values <- predictions$.pred

predicted_probabilities <- predict(fit, new_data = new_data, type = "prob") 
# For probabilities

Evaluating regression model performance using true values of judicial corruption

  actual_vs_predicted <- vdem_2020 %>%
    select(judic_corruption) %>%
    bind_cols(predictions)

Finally, we calculate metrics like RMSE, MAE, etc., using `yardstick`

  metrics <- actual_vs_predicted %>%
    metrics(truth = judic_corruption, estimate = .pred)
  
  rmse_val <- actual_vs_predicted %>%
    rmse(truth = judic_corruption, estimate = .pred)
  
  mae_val <- actual_vs_predicted %>%
    mae(truth = judic_corruption, estimate = .pred)
  
 print(metrics)
  print(rmse_val)
  print(mae_val)

.metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.661
2 rsq     standard       0.796
3 mae     standard       0.525

How to run regressions with the tidymodels package in R: PART 1

The tidymodels framework in R is a collection of packages for modeling.

Within tidymodels, the parsnip package is primarily responsible for specifying models in a way that is independent of the underlying modeling engines. The set_engine() function in parsnip allows users to specify which computational engine to use for modeling, enabling the same model specification to be used across different packages and implementations.

In this blog series, we will look at some commonly used models and engines within the tidymodels package

Linear Regression (lm): The classic linear regression model, with the default engine being stats, referring to the base R stats package.
Logistic Regression (logistic_reg): Used for binary classification problems, with engines like stats for the base R implementation and glmnet for regularized regression.
Random Forest (rand_forest): A popular ensemble method for classification and regression tasks, with engines like ranger and randomForest.
Boosted Trees (boost_tree): Used for boosting tasks, with engines such as xgboost, lightgbm, and catboost.
Decision Trees (decision_tree): A base model for classification and regression, with engines like rpart and C5.0.
K-Nearest Neighbors (nearest_neighbor): A simple yet effective non-parametric method, with engines like kknn and caret.
Principal Component Analysis (pca): For dimensionality reduction, with the stats engine.
Lasso and Ridge Regression (linear_reg): For regression with regularization, specifying the penalty parameter and using engines like glmnet.