programming – R Functions and Packages for Political Science Analysis

We can use the assign function to create new variables.

Most often I want to assign variables that I create to the Global Environment.

assign particularly useful in loops, simulations, and scenarios involving conditional variable naming or creation.

The basic syntax of the assign function is

assign(x, value, pos = -1, envir = as.environment(pos), inherits = FALSE)

envir: The environment in which to place the new variable. If not specified, it defaults to the current environment. .GlobalEnv is often used to assign variables in the global environment.

Generate variables with dynamic names in a loop.

for (i in 1:3) {
  assign(paste("var", i, sep = "_"), i^2)
}

var_3

Next, we will make a for loop that iterates over each element in the years vector.

The paste0() function concatenates its arguments into a single string without any separator.

Here, it is used to dynamically create variable names by combining the string "sales_" with the current year. For example, if year is 2020, the result would be "sales_2020".

data.frame(month = 1:12, sales = sample(100:200, 12, replace = TRUE)) creates a new data frame for each iteration of the loop. The data frame has two columns:

month 1 to 12 and a random sample of 12 numbers (with replacement) from the integers between 100 and 200. This simulates monthly sales data.

The assign() function assigns a value to a variable in the R environment. The first argument is the name of the variable (as a string), and the second argument is the value to assign. In this snippet, assign() is used to create a new variable with the name generated by paste0() and assign the newly created data frame to it. This means that after each iteration, a new variable (e.g., sales_2020) will be created in the global environment, containing the corresponding data frame.

years <- 2018:2022
for (year in years) {
  assign(paste0("sales_", year), data.frame(month = 1:12, sales = sample(100:200, 12, replace = TRUE)))
}

sales_2022

   month sales
1      1   118
2      2   157
3      3   163
4      4   177
5      5   185
6      6   171
7      7   151
8      8   142
9      9   141
10    10   157
11    11   137
12    12   152

set.seed(1111)
years <- 2000:2005
countries <- c("Country A", "Country B", "Country C")
data <- expand.grid(year = years, country = countries)
data$value <- runif(n = nrow(data), min = 100, max = 200)

   year  country    value
1  2018  Austria 146.5503
2  2019  Austria 141.2925
3  2020  Austria 190.7003
4  2021  Austria 113.7105
5  2022  Austria 173.8817
6  2018 Bahamams 197.6327
7  2019 Bahamams 187.9960
8  2020 Bahamams 111.6784
9  2021 Bahamams 154.6289
10 2022 Bahamams 114.0116
11 2018   Canada 100.1690
12 2019   Canada 174.8958
13 2020   Canada 175.0958
14 2021   Canada 163.3406
15 2022   Canada 186.8168
16 2018  Denmark 115.9363
17 2019  Denmark 191.6828
18 2020  Denmark 155.7007
19 2021  Denmark 190.0419
20 2022  Denmark 176.5887

data_list <- split(data, data$year)

data_list
$`2018`
   year  country    value
1  2018  Austria 146.5503
6  2018 Bahamams 197.6327
11 2018   Canada 100.1690
16 2018  Denmark 115.9363

$`2019`
   year  country    value
2  2019  Austria 141.2925
7  2019 Bahamams 187.9960
12 2019   Canada 174.8958
17 2019  Denmark 191.6828

$`2020`
   year  country    value
3  2020  Austria 190.7003
8  2020 Bahamams 111.6784
13 2020   Canada 175.0958
18 2020  Denmark 155.7007

$`2021`
   year  country    value
4  2021  Austria 113.7105
9  2021 Bahamams 154.6289
14 2021   Canada 163.3406
19 2021  Denmark 190.0419

$`2022`
   year  country    value
5  2022  Austria 173.8817
10 2022 Bahamams 114.0116
15 2022   Canada 186.8168
20 2022  Denmark 176.5887

env <- .GlobalEnv

Now we can dynamically create variables within the environment

assign_year_country_dataframes <- function(data, year_col, country_col, env) {
  # Get unique combinations of year and country
  combinations <- unique(data[, c(year_col, country_col)])
  
  # Iterate over each combination
  for (i in 1:nrow(combinations)) {
    combination <- combinations[i, ]
    year <- combination[[year_col]]
    country <- combination[[country_col]]
    
    # Subset the data for the current combination
    data_subset <- data[data[[year_col]] == year & data[[country_col]] == country, ]
    
    # Create a dynamic variable name based on year and country
    variable_name <- paste0(gsub(" ", "_", country), year)
    
    # Assign the subset data to a dynamically named variable in the specified environment
    assign(x = variable_name, value = data_subset, envir = env)
  }
}

Now we can run the function and put all the country-year pairs into the global environment

assign_year_country_dataframes(data = data, year_col = "year", country_col = "country", env = env)

Cat Vibes GIF by Evergreen Cannabis - Find & Share on GIPHY

Packages we will need:

library(rnaturalearth)
library(tidyverse)
library(skimr)
library(lubridate)
library(magrittr)

I use this post to keep code bits all in one place so I can check back here when I inevitably forget them.

Forget Will Smith GIF - Find & Share on GIPHY

For most of the snippets, we can use a map data.frame that we can download from the rnaturalearth package. So the code below downloads a map of the world.

my_map <- ne_countries(scale = "medium", returnclass = "sf")

my_map %>% View

my_map %<>% select(sovereignt,
                   iso_a2,
                   pop_est,
                   gdp_md_est,
                   economy,
                   income_grp,
                   region_un:region_wb)

my_map %<>% select(-geometry) %>%  as_tibble()

How to KEEP only one data.frame from the R environment.

all_objects <- ls()
objects_to_remove <- setdiff(all_objects, "my_map")
rm(list = objects_to_remove, envir = .GlobalEnv)

2. Add an ID variable based on row number

my_map %<>% mutate(id = row_number())

3. Replace NAs across all the df with 0 (also using the assignment operator from magrittr pacakge)

my_map %<>% 
  mutate_all(~replace_na(., 0))

4. Plot missing variables in a data.frame

library(DataExplorer)
plot_missing(my_map)

5. Summarise all variables with skimr package

library(skimr)
skim(df)

Summarise data with skimr package in R

6. Reverse score a variable

df %>%
  mutate(reversed_score_var = max(score_var) + min(score_var) - score_var)

Lots and lots of stringr package stuff (and a bit of regex)

7. To remove footnote brackets – like [4] and [11] - from a string

 df <- df %>%
  mutate(column = str_replace_all(column, "\\[[0-9]+\\]", ""))

\\[ : Matches the opening square bracket

[0-9]+ : Matches one or more digits

\\]: Matches the closing square bracket

8. Remove a string pattern from all variables in a data.frame

my_map %<>%
rename_all(~str_remove(., "_map"))

9. How to extract s substring based on a pattern

my_map %<>%
  mutate(my_pattern_substring = str_extract(my_string_variable, "my_pattern"))

10. To concatenate (link together) strings

str_c("a", "b", "c")

[1] "abc"

11. And how to compute the length of strings

str_length("abcdefme")

[1] 8

12. Extract substrings from a character variable

my_map %<>% 
  mutate(income_substring = substr(income_grp, 
                                   start = 1, stop = 3))

13. Split a string into pieces

str_split("a,b,c", ","str_split("Merry, Christmas, to, you", ",")

[[1]]
[1] "Merry"      " Christmas" " to"        " you"

14. Replace matched patterns in a string

my_map %<>% 
  mutate(earning_grp =str_replace(income_grp, "income", "earning"))

1 4. Lower middle earning     60
2 3. Upper middle earning     58
3 2. High earning: nonOECD    46
4 5. Low earning              42
5 1. High earning: OECD       35

15. Detect the presence or absence of a pattern

my_map %<>% 
mutate(asia = str_detect(region_wb, "Asia"))

  asia      n

1 FALSE   127
2 TRUE    114

16. Count the number of occurrences of a pattern

   oecd     n

1     0   160
2     1    81

17. Trim leading and trailing whitespace

str_trim("   abc   ")

Leaving stringr, back to other random code bits

18. Calculates the sum of values across all columns for each row in a data.frame

df %>% rowwise() %>%
  mutate(sum = sum(c_across(everything())))

19. Using reduce() function from purrr package to iteratively combine elements in a vector

character_vector <- c("Good", "Will ", "Hunting")

reduce(character_vector, paste0)

20. Finding the maximum value in the disp column

reduce(mtcars$disp, pmax)

Map package code bits

21. Applying a summary function across variables in a data.frame

summary_stats_fun <- function(df, var, grouping_var) {
  result <- df %>%
    group_by({{ grouping_var }}) %>%
    summarise(
      count = n(),
      sum_var = sum({{ var }}, na.rm = TRUE)
    ) %>%
    arrange(desc(count))
  return(result)
}

map_summary_stats <- function(list_of_data, var, grouping_var) {
  result <- map(list_of_data, ~ summary_stats_fun(.x, var = var, grouping_var = grouping_var))
  return(result)

}

list_of_data <- list(
  data.frame(country = c("A", "B", "A", "C", "B", "C"), value = c(10, 15, 20, 5, 8, 12)),
  data.frame(country = c("A", "A", "B", "B", "C", "C"), value = c(8, 12, 15, 10, 5, 20))
)

result_summary_stats <- map_summary_stats(list_of_data, var = "value", grouping_var = "country")

22. How to remove rows from a data.frame that match a string pattern

df <- df %>%
  filter(!grepl("pattern", column))

23. How to remove non-numeric characters

df <- df %>%
  mutate(column = str_replace_all(column, "[^0-9]", ""))

24. Removing parentheses and contents within

df <- df %>%
  mutate(column = str_replace_all(column, "\\(.*?\\)", ""))

25. How to split a string into two new variables

df <- df %>%
  separate(column, into = c("new_col_1", "new_col_2"), sep = ", ")

26. Extracting alphabetic characters

df <- df %>%
  mutate(alpha_chars = str_extract_all(column, "[A-Za-z]"))

27. Remove scientific notation

old_scipen <- options("scipen") # Save the current scipen value
options(scipen = 999) # Disable scientific notation
options(old_scipen) # Reset

Some functions from the forcats package for factor variables

28. Count the occurrences of each level in a factor

fct_count(factor_variable)

29. Order levels by their frequency.

fct_infreq(factor_variable)

30. Lump levels into a specified number of top or bottom levels.

fct_lump(factor_variable, n = 5)

31. Collapse factor levels into broader categories.

fct_collapse(factor_variable, new_levels = c("Category1", "Category2"))

32. Relabel factor levels

fct_relabel(factor_variable, new_labels = c("Label1", "Label2"))

33. Reverse the order of factor levels.

fct_rev(factor_variable)

34. Make NAs explicit by adding a level for missing values.

fct_explicit_na(factor_variable)

35. Group infrequent levels into “Other” category.

fct_other(factor_variable, keep = 5)

36. Create a cross-tabulation of two factors.

fct_cross(factor1, factor2)

37. Recode factor levels.

fct_recode(factor_variable, new_levels = c("NewLevel1" = "OldLevel1", "NewLevel2" = "OldLevel2"))

38. Count the occurrences of each level in a factor.

fct_count(factor_variable)

And next we will look at lubridate functions I always need to look up. Dates are a pain.

39. Parse date character in the “Year-Month-Day” format.

ymd("2023-12-16")

40. Get the current date and time.

now()

41. Get the current date only

today()

42. How to extract a year from a date class.

year(ymd("2023-12-16"))

43. And to extract the hour from the time now

hour(now())

44. How to extrac the day of the week from a data class

wday(ymd("2023-12-16"))

45. Rounding up or rounding down to the nearest time unit

floor_date(now(), "months")

ceiling_date(now(), "hours")

46. Create an interval object.

my_interval <- interval(start = ymd("2023-01-01"), end = ymd("2023-12-31"))

47. Get the timezone of date and time now

timezone(now())

And last the janitor package for cleaning variables

48. Clean names with lowercase letters and _underscores_

library(janitor)
cleaned_data <- clean_names(original_data)

49. Clean and remove away empty rows and columns from a data.frame

cleaned_data <- remove_empty(original_data)

50. How to remove columns with constant values

cleaned_data <- remove_constant(original_data)

51. How to find duplicate rows in a data.frame

duplicate_rows <- get_dupes(original_data, columns = c("col1", "col2"))

52. How to add percentage sign (%) to a contingency table

table_with_percentages <- tabyl(original_data, col1, col2) %>%
                         adorn_percentages("row")

53. Add row or column counts to a contingency table

table_with_counts <- tabyl(original_data, col1, col2) %>%
                    adorn_ns()

54: Changing Data Types

df %>%
  mutate(across(starts_with("num"), as.character))

55. Scaling Numeric Variables

df %>%
  mutate(across(where(is.numeric), scale))

56. Applying a Custom Function

df %>%
  mutate(across(contains("price"), ~ .x * 1.1))

57. Filter based on a condition

df %>%
  filter(across(ends_with("score"), ~ .x > 80))

58. Inpute a missing variable

df %>%
  mutate(across(starts_with("var"), ~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x)))

59. Selecting Columns with Specific Data Types:

df %>%
  select(across(where(is.character)))

60. Rename columns with paste()

df %>%
  rename(across(contains("old"), ~ paste0("new_", .x)))

61. How to calculate row sums

df %>%
  mutate(total = rowSums(across(starts_with("quantity"))))

62. Group-wise scaling

df %>%
  group_by(category) %>%
  mutate(across(starts_with("value"), scale))

63. How to conditionally mutate

  df %>% mutate(across(starts_with("sales"), ~ ifelse(.x > 100, "High", "Low")))

64. Output tidy regression model without scientific notatio

plm(lead(dep_var, 2) ~ ind_var, index = c("country", "year"), 
  data = df) %>%
  broom::tidy() %>% 
  arrange(desc(estimate)) %>% 
  mutate(across(c(estimate, std.error, statistic, p.value), ~sprintf("%.10f", .)))
  print(n = 100)

65. How to choose the reference factor for a regression

df %<>%
  mutate(factor_var = relevel(as.factor(factor_var), ref = "ref_level"))

66. Remove variables that end with a character string.

vdem %<>%
  select(-ends_with("_sd"), 
         -ends_with("_codelow"), 
         -ends_with("_codehigh"), 
         -ends_with("_3C"), 
         -ends_with("_4C"), 
         -ends_with("_5C"))

Can you add more code snippets in the commets???