How to automate panel data modelling with dynamic formulas in R

Packages we will need:

library(plm)
library(modelsummary)

When I am running a bunch of regressions, I can get bogged down with lines and lines of code.

As a result, it is annoying if I want to change just one part of the formula.

This means I have to go to EACH model and change the variable (or year range or lags or regions or model type) again and again.

We can make this much easier!

If we separately make a formula string, we can feed that string into the models.

Now, if we want to change the models, we only have to change it in one string.

So let’s make a function to run a panel linear regression model.

A plm takes in many arguments including:

  • the formula,
  • the dataset
  • the panel data index
  • the model type (i.e. “within“, “random“, “between” or “pooling“)

In our argument code below, we can set the index and model type default.

  • the index for the dataset are “country_cown” and “year
  • the model type (I default it to “within“)

Then, we create the handy run_panel_model function:

run_panel_model <- function(formula_string, 
                            data,
                            index = c("country_cown", "year"), 
                            model_type = "within") {
  formula <- as.formula(formula_string)  
  model <- plm(formula, data = data, index = index, model = model_type)
  return(model)
}

With the following base_formula, we can now add the variables we want to put into our models.

Now, whenever we want, we can change the formula of independent variables that we plug into the model

base_formula <- " ~
  democracy + 
  log(gdp_per_capita) + 
  log(pop)"

If we want to change the dependent variable, we add in the variable name – as a string in inverted commas – into the specific model formula using paste().

Since we am looking a civil society levels as the main dependent variable, we can name it civil_society_formula

civil_society_formula <- paste("civil_society", base_formula, sep = "")

We feed in the arguments to the function

civil_society_model <- run_panel_model(civil_society_formula, data = fp_mv3)

And we can feed the model into a modelsummary function for a nice table:

models_list <- list("Civil Society" = civil_society_model, ....)

modelsummary(models_list, stars = TRUE)

Adding year lags and regional average variables

fit_panel_model <- function(data, 
variable_name, 
lag_period) {

  regional_avg_name <- paste("regional_avg", variable_name, sep = "_")
  
  # Computing the regional average
  data <- data %>%
    group_by(e_regionpol_6C, year) %>%
    mutate(!!regional_avg_name := mean(!!as.symbol(variable_name), na.rm = TRUE)) %>%
    ungroup()
  
  formula <- as.formula(
    paste(variable_name, "~ lag(", variable_name, ", ", lag_period, ") + lag(", regional_avg_name, ", ", lag_period, ")", sep = "")
  )
  model <- plm(formula, data = data, index = c("country_cown", "year"))
  
  list(model = model, data = data)
}

lag_number = 5

democracy_model <- fit_panel_model(my_df, "demoracy_var", lag_number)

Let us examine this specific line of code:

mutate(!!regional_avg_name := mean(!!as.symbol(variable_name), na.rm = TRUE)) 

This dynamically creates a regional average for each of the six regions in the dataset.

!! and as.symbol()

These are used to programmatically refer to variable names that are provided as strings.

This method is part of tidy evaluation, a system used in the tidyverse to make functions that work with dplyr more programmable.

  • as.symbol(): This function converts a string into a symbol, which is necessary because dplyr operations need to work with expressions directly rather than strings.
  • !! (bang-bang): This operator is used to force the evaluation of the symbol in the context where it is used. It effectively tells R, “Don’t treat this as a name of a variable, but rather evaluate it as the variable it represents.”

It ensures that the value of the symbol (i.e., the variable it refers to) is used in the computation.

!!regional_avg_name :=

The := operator within mutate() allows you to assign values to dynamically named variables.

This is particularly useful when you want to create new variables whose names are stored in another variable.

Below is for logistic regression in panel data!

# I want to only keep the df data.frame in my environment
all_objects <- ls()
objects_to_remove <- setdiff(all_objects, "df")
rm(list = objects_to_remove, envir = .GlobalEnv)  

# Function
fit_model <- function(dependent_variable) {
  formula <- as.formula(paste(dependent_variable, "~ polity + log(gdp) + log(pop)
  pglm::pglm(formula,
             data = df,
             index = c("COWcode", "year"),
             family = binomial(link = "logit"))
}

# DVs
dependent_variables <- c("dummy_1", "dummy_2", "dummy_3")

# Model
models <- lapply(dependent_variables, fit_model)

# Labels
names(all_my_models) <- c("First Model", "Second Model", "Third Model")

# Summary
modelsummary::modelsummary(all_my_models, stars = TRUE)