How to use parallel processing to speed up executing large / complex models in R

Packages we will need:

library(tidyverse)
library(plm)
library(parallel)
library(foreach)
library(doParallel)
library(vdemdata)

In this blog, we look at a few lines of code that help us run multiple models with panel data more quickly with parallel processing.

We can use the Varieties of Democracy (V-DEM) package to download our data!

Click here to read more about downloading V-DEM data from the vdemdata package

vdemdata::vdem %>% 
  distinct(COWcode, year, .keep_all =  TRUE) %>% 
  filter(year %in% c(1945:2024)) -> vdem_df

First, we set up our panel regression model and choose our independent variables.

In the ols_dependent_vars, I will look at three ways to look at Civil Society Organisations (how the are created, whether they are repressed, whether they are consulted by the state).

run_plm <- function(ols_dependent_var) {
  formula <- as.formula(paste(ols_dependent_var, " ~ +
                               v2x_polyarchy + 
                               log(e_pop + 1) + 
                               log(e_gdppc + 1)"))
  plm::plm(formula, data = vdem_df, index = c("COWcode", "year"))
}

ols_dependent_vars <- c("v2cseeorgs",  "v2csreprss", "v2cscnsult")

ols_model_names <- c("CSO entry / exit", "CSO repression", "CSO consultation")

We will use the ols_model_names at the end for pretty regression output with the modelsummary table

Now that we have our function, we can set up the parallel processing for quicker model running.

We first need to chose the number of cores to run concurrently.

num_cores <- detectCores() - 1  

This sets up the number of CPU cores that we will use for parallel processing.

detectCores() finds out how many CPU cores your computer has.

When we subtract 1 from this number, we leave one core free.

This one free core avoids overloading the system, so the computer can do other things while we are parallel processing.

Like playing Tetris.

With 15 cores, a CPU can manage loads of processes at the same time,

We can check how many cores the computer that we’re using has with detectCores()

detectCores() 
[1] 16

My computer has 16 cores.

Next, we will set up the parallel processing for our model.

We first need to set up a cluster of worker nodes that can execute multiple tasks at the same time.

cl <- makeCluster(num_cores)

makeCluster(num_cores) initialises a cluster of our 15 cores.

The cluster indicates specified number of worker nodes.

cl stores our cluster object, which will be used to manage and distribute tasks among the worker nodes.

Next, we register the cl object

registerDoParallel(cl)

By registering the parallel backend, we enable the distribution of tasks across the 15 cores.

This speeds up the computation process and improves efficiency.

Why bother register the parallel backend?

When you use the foreach() package for parallel processing, it needs to know which parallel backend to use. By registering the cluster with registerDoParallel(cl), you inform the foreach() function below to use the specified cluster (cl) for executing tasks in parallel.

This step is essential for enabling parallel execution of the code inside the foreach() loop.

Before we can perform parallel processing, we need to ensure that all the worker nodes in our cluster have access to the necessary data and functions.

This is achieved using the clusterExport() function.

clusterExport(cl, varlist = c("vdem_df", "plm", "run_plm", "ols_dependent_vars"))

clusterExport() exports variables from the master R session to each worker node in the cluster.

It ensures that all worker nodes have the required data and functions to perform the computations.

In our example, the varlist argument specifies:

  • dataframe “vdem_df”
  • package “plm”
  • function “run_plm”
  • vector of variables “ols_dependent_vars”

Why export variables?

In a parallel processing setup, each worker node runs as an independent and separate R session.

By default, these sessions do not share the environment of the master session.

Boo.

Therefore, any data or functions needed for the computation must be explicitly exported to the workers.

To make sure that these workers can perform the necessary tasks, we need to load the required packages in each of the separate worker’s environment. We can do this by using the clusterEvalQ function:

clusterEvalQ(cl, library(plm, logical.return = TRUE))

By using clusterEvalQ(cl, library(plm)), we can see that each of the 15 nodes is set up with panel regression package – that we will use in our run_plm function defined above – as it returns TRUE fifteen times.

[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] TRUE

[[4]]
[1] TRUE

[[5]]
[1] TRUE

[[6]]
[1] TRUE

[[7]]
[1] TRUE

[[8]]
[1] TRUE

[[9]]
[1] TRUE

[[10]]
[1] TRUE

[[11]]
[1] TRUE

[[12]]
[1] TRUE

[[13]]
[1] TRUE

[[14]]
[1] TRUE

[[15]]
[1] TRUE

Now we will use the foreach() function to combine the results together for the three CSO models.

Click here to read more about the foreach package as a method of iteration

plm_results <- foreach(dependent_var = ols_dependent_vars, 
                        .combine = list, 
                        .multicombine = TRUE, 
                        .packages = 'plm') %dopar% {
  run_plm(dependent_var)
}

foreach() iterate over a list in parallel rather than sequentially.

Woo ~

In this function, we initialise all the variables we need in the model.

dependent_var = ols_dependent_vars: This part initialises the loop by setting dependent_var to each element in the ols_dependent_vars vector.

Next we combine the results

.combine = list means that the results of each iteration will be stored as elements in a list.

.multicombine = TRUE allows foreach to combine results in stages – this can improve efficiency with a huuuuuge dataset or complex model.

.packages = 'plm' makes sure that the plm package is loaded in each worker’s environment. We will need this when we are running the run_plm function!

The strange-looking %dopar% indicates that the loop is executed in parallel.

Quick! Quick! Quick!

{ run_plm(dependent_var) } is the actual operation performed in each iteration of the loop. The run_plm function is called with dependent_var as its argument. This function fits a regression model for the current dependent variable and returns the result.

The code snippet runs the run_plm function for each dependent variable in the ols_dependent_vars vector in parallel. Here’s a step-by-step explanation:

Stopping the cluster with stopCluster(cl) is an essential step in parallel processing:

stopCluster(cl)

Stopping the cluster ensures that these resources are properly cleaned up and we get our 15 cores back!

We assign names to the three models when we print them out

names(plm_results) <- ols_model_names

And we print out the models with the modelsummary() function

modelsummary::modelsummary(plm_results, stars = TRUE)

Yay!

Leave a comment