How to rowwise sum the variables that contain the same variable string pattern in R

This is another blog post so that I can keep a snippet of code for myself! And if you find it helpful too, all the better.

Archie Madekwe Wow GIF by Saltburn - Find & Share on GIPHY

We will be completing rowwise computations, which is not the default in R. Therefore, we need to explicitly state that is what we are hoping to do

Source: https://cmdlinetips.com/2021/06/row-wise-operations-in-r/

In this instance, we will be using c_across() to specify we want to sum across particular columns.

Specifically… all columns that contain a string pattern of “totals_”

df <- df %>% 
  rowwise() %>%
  mutate(totals_sum = sum(c_across(contains("totals_")), na.rm = TRUE)) %>%
  ungroup()

rowwise(): This function is used to indicate that operations following it should be applied row by row instead of column by column (which is the default behavior in dplyr).

mutate(totals_sum = sum(c_across(contains("totals_")), na.rm = TRUE)):

Within the mutate() function, sum(c_across(contains("totals_"))) computes the sum of all columns for each row that contain the pattern “totals_”.

The na.rm = TRUE argument is used to ignore NA values in the sum. c_across() is used to select columns within rowwise() context.

ungroup(): This function is used to remove the rowwise grouping imposed by rowwise(), returning the dataframe to a standard tbl_df.

Usually I forget to ungroup. Oops. But this is important for performance reasons and because most dplyr functions expect data not to be in a rowwise format.

Oh Yeah Hot Ones GIF by First We Feast - Find & Share on GIPHY

Create a rowwise binary variable

data <- data %>%
  rowwise() %>%
  mutate(has_ruler = as.integer(any(c_across(starts_with("broad_cat_")) == "ruler"))) %>%
  ungroup()

How to analyse Afrobarometer survey data with R. PART 3: Cronbach’s Alpha, Exploratory Factor Analysis and Correlation Matrices

Packages we will need:

library(tidyverse)
library(psych)
library(ggcorrplot)
library(lavaan)
library(gt)
library(gtExtras)
library(skimr)

How do people view the state of the economy in the past, present and future for the country and how they view their own economic situation?

Are they highly related concepts? In fact, are all these questions essentially asking about one thing: how optimistic or pessimistic a person is about the economy?

In this blog we will look at different ways to examine whether the questions answered in a survey are similiar to each other and whether they are capturing an underlying construct or operationalising a broader concept.

In our case, the underlying concept relates to levels of optimism about the economy.

We will use Afrobarometer survey responses in this blog post.

Click here to follow links to download Afrobarometer data and to recode the country variables.

How to analyse Afrobarometer survey data with R. PART 1: Exploratory analysis

First, we can make a mini data.frame to look at only the variables that ask the respondents to describe how they think:

the state of the economy is now (state_econ_now),
their own economic condition is now (my_econ_now),
the economy was a year ago (state_econ_past),
the economy will be in a year (state_econ_future)

In the data, variables range from:

1 = Very bad, 2 = Fairly bad, 3 = Neither good nor bad, 4 = Fairly good, 5 = Very good.

or in the variables comparing the past and the future:

1 = Much worse, 2 = Worse, 3 = Same, 4 = Better, 5 = Much better

The variables we want to remove:

8 = Refused, 9 = Don’t know, -1 = Missing

ab %>% 
  select(country_name,
         state_econ_now = Q4A,
         my_econ_now = Q4B,
         state_econ_past = Q6A,
         state_econ_future = Q6B)  %>% 
  filter(across(where(is.numeric), ~ . <= 7)) -> ab_econ

With the skim() package and gt() package we can look at some descriptive statistics.

Click here to read more about the gtExtras() theme options:

ab_econ %>%  
  select(!country_name) %>% 
  skimr::skim(.) %>%
  mutate(across(where(is.numeric), ~ round(., 2))) %>% 
  rename_with(~ sub("numeric\\.|skim_", "", .), everything()) %>% 
  gt::gt() %>% 
  gtExtras::gt_theme_nytimes() %>% gt::as_raw_html()

type	variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
numeric	state_econ_now	1	2.41	1.32	1	1	2	4	5	▇▆▃▅▂
numeric	my_econ_now	1	2.68	1.28	1	2	3	4	5	▇▇▅▇▂
numeric	state_econ_past	1	2.51	1.15	1	2	2	3	5	▆▇▆▅▁
numeric	state_econ_future	1	3.18	1.29	1	2	4	4	5	▃▃▃▇▃

First off, we can run a Chronbach’s alpha to examine whether these variables are capturing an underlying construct.

Do survey respondents have an overall positive or overall negative view of the economy (past, present and future) and is it related to respondents’ views of their own economic condition?

The Cronbach’s alpha statistic is a measure of internal consistency reliability for a number of questions in a survey.

Chronbach’s alpha assesses how well the questions in the survey are correlated with each other.

We can interpret the test output and examine the extent to which they measure the same underlying construct – namely the view that people are more optimistic about the economy or not.

ab_econ %>% 
  select(!country_name) %>% 
  psych::alpha() -> cronbach_results

raw_alpha	std.alpha	G6(smc)	Avg_r	S/N	ase	mean	sd	median_r
0.66	0.66	0.62	0.33	2	0.0026	2.7	0.89	0.32

When we interpret the output, the main one is the Raw Alpha score.

raw_alpha:

The raw Cronbach’s alpha coefficient ranges from 0 to 1.

For us, the Chronbach’s Alpha is 0.66. Higher values indicate greater internal consistency among the items. So, our score is a bit crappy.

Season 1 Netflix GIF by Gilmore Girls - Find & Share on GIPHY

std.alpha:

Standardized Cronbach’s alpha, which adjusts the raw alpha based on the number of items and their intercorrelations. It is also 0.66 because we only have a handful of variables

G6(smc):

The Guttman’s Lambda 6 is alternative estimate of internal consistency that we can consider. For our economic construct, it is 0.62.

Click this link if you want to go into more detail to discuss the differences between the alpha and the Guttman’s Lambda.

For example, they argue that Guttman’s Lambda is sensitive to the factor structure of a test. It is influenced by the degree of “lumpiness” or correlation among the test items.

For tests with a high degree of intercorrelation among items, G6 can be greater than Cronbach’s Alpha (α), which is a more common measure of reliability. In contrast, for tests with items that have very low intercorrelation, G6 can be lower than α

average_r:

The average inter-item correlation, which shows the average correlation between each item and all other items in the scale.

For us, the correation is 0.33. Again, not great. We can examine the individual correlations later.

S/N:

The Signal-to-Noise ratio, which is a measure of the signal (true score variance) relative to the noise (error variance). A higher value indicates better reliability.

For us, it is 2. This is helpful when we are comparing to different permutations of variables.

ase:

The standard error of measurement, which provides an estimate of the error associated with the test scores. A lower score indicates better reliability.

In our case, it is 0.0026. Again, we can compare with different sets of variables if we add or take away questions from the survey.

mean:

The mean score on the scale is 2.7. This means that out of a possible score of 5 across all the questions on the economy, a respondent usually answers on average in the middle (near to the answer that the economy stays the same)

sd:

The standard deviation is 0.89.

median_r:

The median inter-item correlation between the median item and all other items in the economic optimism scale is 0.32.

Looking at the above eight scores, the most important is the Cronbach’s alpha of 0.66 . This only suggests a moderate level of internal consistency reliability for our four questions.

But there is still room for improvement in terms of internal consistency.

	lower	alpha	upper
Feldt	0.66	0.66	0.67
Duhachek	0.66	0.66	0.67

Next we will look to see if we can improve the score and increase the Chronbach’s alpha.

Reliability if an item is dropped:

	raw_alpha	std.alpha	G6(smc)	Avg_r	S/N	alpha se	var.r	med.r
Econ now	0.52	0.53	0.43	0.27	1.1	0.004	0.004	0.26
My econ	0.59	0.59	0.49	0.33	1.5	0.003	0.001	0.34
Econ past	0.61	0.61	0.53	0.34	1.5	0.003	0.025	0.29
Future	0.65	0.64	0.57	0.38	1.8	0.003	0.017	0.35

Again, we can focus on the raw Chronbach’s alpha score in the first column if that given variable is removed.

We see that if we cut out any one of the the questions, the score goes down.

We don’t want that, because that would decrease the internal consistency of our underlying “optimism about the economy” type construct.

Item	n	raw.r	std.r	r.cor	r.drop	mean	sd
Now	43,702	0.77	0.76	0.67	0.54	2.4	1.3
My econ	43,702	0.71	0.71	0.57	0.45	2.7	1.3
Past	43,702	0.67	0.69	0.52	0.43	2.5	1.1
Future	43,702	0.66	0.65	0.45	0.36	3.2	1.3

These item statistics provide insights into the characteristics of each individual variable

We will look at the first variable in more detail.

state_econ_now

raw.r: The raw correlation between this item and the total score is 0.77, indicating a strong positive relationship with the overall score.
std.r: The standardized correlation is 0.76, showing that this item contributes significantly to the total score’s variance.
r.cor: This is the corrected item-total correlation and is 0.67, suggesting that the item correlates well with the overall construct even after removing it from the total score.
r.drop: The corrected item-total correlation when the item is dropped is 0.54, indicating that the item still has a reasonable correlation even when not included in the total score.
mean: The average response for this item is 2.4.
sd: The standard deviation of responses for this item is 1.3.

Item	1	2	3	4	5
state_econ_now	0.33	0.27	0.12	0.22	0.07
my_econ_now	0.23	0.26	0.17	0.27	0.06
state_econ_past	0.22	0.32	0.21	0.21	0.03
state_econ_future	0.15	0.17	0.16	0.39	0.13

Next we will lok at factor analysis.

Factor analysis can be divided into two main types:

exploratory
confirmatory

Exploratory factor analysis (EFA) is good when we want to check out initial psychometric properties of an unknown scale.

Confirmatory factor analysis borrows many of the same concepts from exploratory factor analysis.

However, instead of letting the data tell us the factor structure, we choose the factor structure beforehand and verify the psychometric structure of a previously developed scale.

For us, we are just exploring whether there is an underlying “optimism” about the economy or not.

For the EFA, we will run as Structural Equation Model with the sem() function from the lavaan package

efa_model <- sem(model, data = ab_econ, fixed.x = FALSE)

When we set fixed.x = FALSE, as in your example, it means we are estimating the factor loadings as part of the EFA model.

With fixed.x = FALSE, the factor loadings are allowed to vary freely and are estimated based on the data

This is typical in an exploratory factor analysis, where we are trying to understand the underlying structure of the data, and we let the factor loadings be determined by the analysis.

Next let’s look at a summary of the EFA model:

summary(efa_model, standardized = TRUE)

table { font-family: Arial, sans-serif; border-collapse: collapse; width: 100%; } th { background-color: #f2f2f2; } th, td { border: 1px solid #dddddd; text-align: left; padding: 8px; } tr:nth-child(even) { background-color: #f2f2f2; }

lavaan 0.6.16 ended normally after 33 iterations
Estimator: ML
Optimization method: NLMINB
Number of model parameters: 9
Number of observations: 43,702
Model Test User Model:
Test statistic: 0.015
Degrees of freedom: 1
P-value (Chi-square): 0.903
Parameter Estimates:
Standard errors: Standard
Information: Expected
Information saturated (h1) model: Structured

Time for interpreting the output.

Model Test (Chi-Square Test):

When we look at this, we evaluate the goodness of fit of the model.
In this case, the test statistic is 0.015, the degrees of freedom is 1, and the p-value is 0.903.
The high p-value (close to 1) suggests that the model fits the data well. Yay! (A non-significant p-value is generally a good sign for model fit).

Parameter Estimates:

The output provides parameter estimates for the latent variables and covariances between them.
The standardized factor loadings (Std.lv) and standardized factor loadings (Std.all) indicate how strongly each observed variable is associated with the latent factors.
For example, “state_econ_now” has a strong loading on “f1” with Std.lv = 1.098 and Std.all = 0.833.
Similarly, “state_econ_pst” and “state_econ_ftr” load on “f2” with different factor loadings.

Covariances:

The covariance between the two latent factors, “f1” and “f2,” is estimated as 0.529. This implies a relationship between the two factors.

Variances:

The estimated variances for the observed variables and latent factors. These variances help explain the amount of variability in each variable or factor.
For example, “state_econ_now” has a variance estimate of 0.530.

Factor Loadings:

High factor loadings (close to 1) suggest that the variables are capturing the same construct.

For our output, we can say that factor loadings of “state_econ_now” and “my_econ_now” on “f1” are relatively high, which indicates that these variables share a common underlying construct. This captures how the respondent thinks about the current economy

Similarly, “state_econ_past” and “state_econ_future” load highly on “f2.”

This means that comparing to different times is a different variable of interest.

Finally, we can run correlations to visualise the different variables:

ab %>% 
  select(country_name,
         state_econ_now = Q4A,
         my_econ_now = Q4B,
         state_econ_past = Q6A,
         state_econ_future = Q6B)  %>% 
  filter(across(where(is.numeric), ~ . <= 7)) -> ab_econ

cor_matrix <- ab_econ %>% 
  select(!country_name) %>% 
  cor(., method = "spearman") %>% 
  ggcorrplot(, type = "lower", lab = TRUE) +
  theme_minimal() +
  labs(x = " ", y = " ")

We can breakdown the correlations in each country to see if they differ.

We will pull out a vector of country that we can use to iterate over in our for loop below.

country_vector <- ab_econ %>% 
  distinct(country_name) %>% pull()

Adter, we create an empty list to store our correlation matrices

 correlation_matrices <- list()

Then we iterate over the countries in the country_vector and create correlation matrices

for (country in country_vector) {
    country_data <- ab_mini %>% 
    filter(country_name == country)
    cor_matrix <- country_data %>%
    select(-country_name) %>%
    cor(method = "spearman")
    round(., 2)
    correlation_matrices[[country]] <- cor_matrix
}

Last we print out the correlation between the “state of the economy now” and “my economic condition now” variables for each of the 34 countries

 correlation_matrix_list <- list()
  for (country in country_vector) {
   correlation_matrix_list[[country]] <- correlation_matrices[[country]][2]}

 correlation_matrix_df %>% 
   t %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "country") %>% 
  select(country, corr = V1) %>% 
  arrange(desc(corr)) -> state_econ_my_econ_corr

Let’s look at the different levels of correlation between the respondents’ answers to how the COUNTRY’S economic situation is doing and how the respondent thinks THEIR OWN economic situation.

state_econ_my_econ_corr %>%  
   gt() %>% 
  gtExtras::gt_theme_nytimes() %>% gt::as_raw_html()

country	corr
Nigeria	0.7316334
Ghana	0.6650589
Morocco	0.6089823
Togo	0.6079971
Guinea	0.6047750
Ethiopia	0.5851686
Lesotho	0.5787015
Sierra Leone	0.5684559
Zimbabwe	0.5666740
Malawi	0.5610069
Angola	0.5551344
Niger	0.5234742
Uganda	0.5182954
Mauritius	0.4950726
Gambia	0.4904879
Liberia	0.4878153
Benin	0.4871915
Kenya	0.4821280
Côte d’Ivoire	0.4800034
Tanzania	0.4787999
Burkina Faso	0.4749373
Zambia	0.4460212
Mali	0.4264943
Cameroon	0.4239547
Sudan	0.4005615
Namibia	0.3981573
Senegal	0.3902534
South Africa	0.3342683
Botswana	0.3146411
Tunisia	0.3109102
Cabo Verde	0.2881491
Mozambique	0.2793023
Gabon	0.2771796
Eswatini	0.2724921

Thank you for reading!

Season 3 Netflix GIF by Gilmore Girls - Find & Share on GIPHY

How to create a Regional Economic Communities dataset. PART TWO: consolidating string variables to dummy variables

Click here to read PART ONE of the blog series on creating the Regional Economic Communities dataset

How to create a Regional Economic Communities dataset. PART ONE: Scraping data with rvest in R

Packages we will need:

library(tidyverse)
library(countrycode)
library(WDI)

There are eight RECs in Africa. Some countries are only in one of the RECs, some are in many. Kenya is the winner with membership in four RECs: CEN-SAD, COMESA, EAC and IGAD.

In this blog, we will create a consolidated dataset for all 54 countries in Africa that are in a REC (or TWO or THREE or FOUR groups). Instead of a string variable for each group, we will create eight dummy group variables for each country.

To do this, we first make a vector of all the eight RECs.

patterns <- c("amu", "cen-sad", "comesa", "eac", "eccas", "ecowas", "igad", "sadc")

We put the vector of patterns in a for-loop to create a new binary variable column for each REC group.

We use the str_detect(rec_abbrev, pattern)) to see if the rec_abbrev column MATCHES the one of the above strings in the patterns vector.

The new variable will equal 1 if the variable string matches the pattern in the vector. Otherwise it will be equal to 0.

The double exclamation marks (!!) are used for unquoting, allowing the value of var_name to be treated as a variable name rather than a character string.

Then, we are able to create a variable name that were fed in from the vector dynamically into the for-loop. We can automatically do this for each REC group.

In this case, the iterated !!var_name will be replaced with the value stored in the var_name (AMU, CEN-SAD etc).

We can use the := to assign a new variable to the data frame.

The symbol := is called the “walrus operator” and we use it make or change variables without using quotation marks.

for (pattern in patterns) {
  var_name <- paste0(pattern, "_binary")
  rec <- rec %>%
    mutate(!!var_name := as.integer(str_detect(rec_abbrev, pattern)))
}

This is the dataset now with a binary variables indicating whether or not a country is in any one of the REC groups.

However, we quickly see the headache.

We do not want four rows for Kenya in the dataset. Rather, we only want one entry for each country and a 1 or a 0 for each REC.

We use the following summarise() function to consolidate one row per country.

rec %>%
group_by(country) %>%
  summarise(
    geo = first(geo),
    rec_abbrev = paste(rec_abbrev, collapse = ", "),
    across(ends_with("_binary"), ~ as.integer(any(. == 1)))) ->  rec_consolidated

The first() function extracts the first value in the geo variable for each country. This first() function is typically used with group_by() and summarise() to get the value from the first row of each group.

We use the the across() function to select all columns in the dataset that end with "_binary".

The ~ as.integer(any(. == 1)) checks if there’s any value equal to 1 within the binary variables. If they have a value of 1, the summarised data for each country will be 1; otherwise, it will be 0.

The following code can summarise each filtered group and add them to a new dataset that we can graph:

summ_group_pop <- function(my_df, filter_var, rec_name) {
  my_df %>%
    filter({{ filter_var }} == 1) %>% 
    summarize(total_population = sum(pop, na.rm = TRUE)) %>% 
    mutate(group = rec_name)
}

filter_vars <- c("amu_binary", "cen.sad_binary", "comesa_binary", 
                 "eac_binary", "eccas_binary", "ecowas_binary",
                 "igad_binary", "sadc_binary")
group_names <- c("AMU", "CEN-SAD", "COMESA", "EAC", "ECCAS", "ECOWAS",
                 "IGAD", "SADC")

rec_pop_summary <- data.frame()

for (i in seq_along(filter_vars)) {
  summary_df <- summ_group_pop(rec_wdi, !!as.name(filter_vars[i]), group_names[i])
  rec_pop_summary <- bind_rows(rec_pop_summary, summary_df)
}

And we graph it with geom_bar()

rec_pop_summary %>% 
  ggplot(aes(x = reorder(group, total_population),
             y = total_population)) + 
  geom_bar(stat = "identity", 
           width = 0.7, 
           color = "#0a85e5", 
           fill = "#0a85e5") +
  bbplot::bbc_style() + 
  scale_y_continuous(labels = scales::comma_format()) +
  coord_flip() + 
  labs(x = "RECs", 
       y = "Population", 
       title = "Population of REC groups in Africa", 
       subtitle = "Source: World Bank, 2022")

Cline Center Coup d’État Project Dataset

tailwind_palette <- c("001219","005f73","0a9396","94d2bd","e9d8a6","ee9b00","ca6702","bb3e03","ae2012","9b2226")

add_hashtag <- function(my_vec){
    hash_vec <-  paste0('#', my_vec)
    return(hash_vec)
  
  tailwind_hash <- add_hashtag(tailwind_palette)

Coups per million people in each REC

create_stateyears(system = "cow") %>% 
    add_democracy() -> demo

rec_wdi %>% 
  select(!c(year, country)) %>% 
  left_join(demo, by = c("cown" = "ccode")) %>% 
  filter(year > 1989) %>% 
  filter(amu_binary == 1) %>% 
  group_by(year) %>% 
  summarize(mean_demo = mean(v2x_polyarchy, na.rm = TRUE)) %>% 
  mutate(rec = "AMU") -> demo_amu

We can use a function to repeat the above code with all eight REC groups:

 create_demo_summary <- function(my_df, filter_var, group_name) {
  my_df %>%
    select(!c(year, country)) %>% 
    left_join(demo, by = c("cown" = "ccode")) %>% 
    filter(year > 1989) %>% 
    filter({{ filter_var }} == 1) %>% 
    group_by(year) %>% 
    summarize(mean_demo = mean(v2x_polyarchy, na.rm = TRUE)) %>% 
    mutate(rec = group_name)
}

rec_democracy_df <- data.frame()

for (i in seq_along(filter_vars)) {
  summary_df <- create_demo_summary(rec_wdi, !!as.name(filter_vars[i]), group_names[i])
  rec_democracy_df <- bind_rows(rec_democracy_df, summary_df)
}

And we graph out the average democracy scores cross the years

rec_democracy_df %>% 
  ggplot(aes(x = year, y = mean_demo, group = rec)) + 
  geom_line(aes(fill = rec, color = rec), size = 2, alpha = 0.8) + 
  # geom_point(aes(fill = rec, color = rec), size = 4) + 
  geom_label(data = . %>% group_by(rec) %>% filter(year == 2019), 
             aes(label = rec, 
                 fill = rec, 
                 x = 2019), color = "white",
             legend = FALSE, size = 3) +  
  scale_color_manual(values = tailwind_hash) + 
  scale_fill_manual(values = tailwind_hash) + 
  theme(legend.position = "none")

How to download OECD datasets in R

Packages we will need:

library(OECD)
library(tidyverse)
library(magrittr)
library(janitor)
library(devtools)
library(readxl)
library(countrycode)
library(scales)
library(ggflags)
library(bbplot)

In this blog post, we are going to look at downloading data from the OECD statsitics and data website.

The Organisation for Economic Co-operation and Development (OECD) provides analysis, and policy recommendations for 38 industrialised countries.

Angry Work GIF by Jess - Find & Share on GIPHY

The 38 countries in the OECD are:

Australia
Austria
Belgium
Canada
Chile
Colombia
Czech Republic
Denmark
Estonia
Finland
France
Germany

Hungary
Iceland
Ireland
Israel
Italy
Japan
South Korea
Latvia
Lithuania
Luxembourg
Mexico
Netherland

New Zealand
Norway
Poland
Portugal
Slovakia
Slovenia
Spain
Sweden
Switzerland
Turkey
United Kingdom
United States
European Union

We can download the OCED data package directly from the github repository with install_github()

install_github("expersso/OECD")
library(OECD)

The most comprehensive tutorial for the package comes from this github page. Mostly, it gives a fair bit more information about filtering data

We can look at the all the datasets that we can download from the website via the package with the following get_datasets() function:

titles <- OECD::get_datasets()

This gives us a data.frame with the ID and title for all the OECD datasets we can download into the R console, as we can see below.

In total there are 1662 datasets that we can download.

These datasets all have different variable types, countries, year spans and measurement values. So it is important to check each dataset carefully when we download them.

We can filter key phrases to subset datasets:

 titles %>%  
         filter(grepl("oda", title, ignore.case = TRUE)) %>% View

In this blog, we will graph out the Official Development Financing (ODF) for each country.

Official Development Financing measures the sum of RECEIVED (NOT DONATED) aid such as:

bilateral ODA aid
concessional and non-concessional resources from multilateral sources
bilateral other official flows made available for reasons unrelated to trade

Before we can charge into downloading any dataset, it is best to check out the variables it has. We can do that with the get_data_structure() function:

get_data_structure("REF_TOTAL_ODF") %>% 
       str(., max.level = 2)

 $ VAR_DESC       :'data.frame':	10 obs. of  2 variables:
  ..$ id         : chr [1:10] "RECIPIENT" "PART" "AMOUNTTYPE" "TIME" ...
  ..$ description: chr [1:10] "Recipient" "Part" "Amount type" "Year" ...

 $ RECIPIENT      :'data.frame':	301 obs. of  2 variables:
  ..$ id   : chr [1:301] "10200" "10100" "10010" "71" ...
  ..$ label: chr [1:301] "All Recipients, Total" "Developing Countries, Total" "Europe, Total" "Albania" ...

 $ PART           :'data.frame':	2 obs. of  2 variables:
  ..$ id   : chr [1:2] "1" "2"
  ..$ label: chr [1:2] "1 : Part I - Developing Countries" "2 : Part II - Countries in Transition"

 $ AMOUNTTYPE     :'data.frame':	2 obs. of  2 variables:
  ..$ id   : chr [1:2] "A" "D"
  ..$ label: chr [1:2] "Current Prices" "Constant Prices"

 $ TIME           :'data.frame':	62 obs. of  2 variables:
  ..$ id   : chr [1:62] "1960" "1961" "1962" "1963" ...
  ..$ label: chr [1:62] "1960" "1961" "1962" "1963" ...

We will clean up the ODF dataset with the clean_names() function from janitor package.

aid <- get_dataset("REF_TOTAL_ODF")  %>% 
  janitor::clean_names()  %>%
  select(recipient, aid = obs_value, time)

One problem with this dataset is that we only have the DAC country codes in this dataset.

We will need to read in and merge the country code variables into the aid dataset.

dac_code Download

dac_code <- readxl::read_excel(file.choose())

We can then clean up the DAC codes to merge with the aid data.

dac_code %<>%
    janitor::clean_names()  %>% 
    mutate(cown = countrycode(recipient_name_e, "country.name", "cown")) %>% 
    select(recipient_code,
         year, 
         cown,
         country = recipient_name_e,
         group_id, 
         dev_group = group_name_e,
         p_group = group_name_f,
         wb_group)

And merge with left_join()

aid %<>% 
  mutate(recipient_code = parse_number(recipient)) %>%  
  left_join(dac_code, by = c("recipient_code" = "recipient_code"))

Next we can sum up the aid that each country received since 2000.

aid %>% 
  filter(year > 1999) %>%  
  filter(!is.na(country)) %>% 
  mutate(aid = parse_number(aid)) %>% 
  mutate(country = case_when(country == "Syrian Arab Republic" ~ "Syria", 
                             country == "T?rkiye" ~ "Turkey",
                             country == "China (People's Republic of)" ~ "China",
                             country == "Democratic Republic of the Congo" ~ "DR Congo",
                             TRUE ~ as.character(country))) %>% 
  group_by(country) %>% 
  summarise(total_aid = sum(aid, na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(iso2 = tolower(countrycode(country, "country.name", "iso2c"))) %>% 
  filter(total_aid > 150000) %>% 
  ggplot(aes(x = reorder(country, total_aid),
             y = total_aid)) + 
  geom_bar(stat = "identity", 
           width = 0.7, 
           color = "#0a85e5", 
           fill = "#0a85e5") +
  ggflags::geom_flag(aes(x = country, y = -1, country = iso2), size = 8) +
  bbplot::bbc_style() + 
  scale_y_continuous(labels = scales::comma_format()) +
  coord_flip() + 
  labs(x = "ODA received", y = "", title = "Official Development Financing (ODF)", subtitle = "OECD DAC (2000 - 2021)")

The TIME_FORMAT can be any of the following types:

‘P1Y’ for annual
‘P6M’ for bi-annual
‘P3M’ for quarterly
‘P1M’ for monthly data.

To access each countries in the datasets, we can use the following codes

oecd_ios3 <- c("AUS", "AUT", "BEL", "CAN", "CHL", "COL", "CZE",
               "DNK", "EST", "FIN", "FRA", "DEU", "GRC", "HUN",
               "ISL", "IRL", "ISR", "ITA", "JPN", "KOR", "LVA", 
               "LTU", "LUX", "MEX", "NLD", "NZL", "NOR", "POL",
               "PRT", "SVK", "SVN", "ESP", "SWE", "CHE", "TUR",
               "GBR", "USA")

Alternatively, we can use only the EU countries that are in the OECD.

eu_oecd_iso3 <- c("AUT", "BEL", "CZE", "DNK", "EST", "FIN", 
                  "FRA", "DEU", "GRC", "HUN", "IRL", "ITA",
                  "LVA", "LTU", "LUX", "NLD", "POL", "PRT",
                  "SVK", "SVN", "ESP", "SWE")

sal_raw %>% 
  janitor::clean_names() %>% 
  filter(age == "Y25T64") %>% 
  filter(grade == "TE") %>% 
  filter(indicator == "NAT_ACTL_YR") %>% 
  filter(isc11 == "L1") %>% 
  filter(sex == "T") %>% 
  select(country, year = time, obs_value) -> sal

Create a dataset of Irish parliament members

library(rvest)
library(tidyverse)
library(toOrdinal)
library(magrittr)
library(genderizeR)
library(stringi)

This blogpost will walk through how to scrape and clean up data for all the members of parliament in Ireland.

Or we call them in Irish, TDs (or Teachtaí Dála) of the Dáil.

We will start by scraping the Wikipedia pages with all the tables. These tables have information about the name, party and constituency of each TD.

On Wikipedia, these datasets are on different webpages.

This is a pain.

However, we can get around this by creating a list of strings for each number in ordinal form – from1st to 33rd. (because there have been 33 Dáil sessions as of January 2023)

We don’t need to write them all out manually: “1st”, “2nd”, “3rd” … etc.

Instead, we can do this with the toOrdinal() function from the package of the same name.

dail_sessions <- sapply(1:33,toOrdinal)

Next we can feed this vector of strings with the beginning of the HTML web address for Wikipedia as a string.

We paste the HTML string and the ordinal number strings together with the stri_paste() function from the stringi package.

This iterates over the length of the dail_sessions vector (in this case a length of 33) and creates a vector of each Wikipedia page URL.

dail_wikipages <- stri_paste("https://en.wikipedia.org/wiki/Members_of_the_",
           dail_sessions, "_D%C3%A1il")

Now, we can take the most recent Dáil session Wikipedia page and take the fifth table on the webpage using `[[`(5)

We rename the column names with select().

And the last two mutate() lines reomve the footnote numbers in ( ) [ ] brackets from the party and name variables.

dail_wikipages[33] %>%  
  read_html() %>%
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(5) %>% 
  rename("ble" = 1, "party" = 2, "name" = 3, "constituency" = 4) %>% 
  select(-ble) %>% 
  mutate(party = gsub(r"{\s*\([^\)]+\)}","",as.character(party))) %>% 
  mutate(name = sub("\\[.*", "", name)) -> dail_33

Last we delete the first row. That just contais a duplicate of the variable names.

dail_33 <- dail_33[-1,]

We want to delete the fadas (long accents on Irish words). We can do this across all the character variables with the across() function.

The stri_trans_general() converts all strings to LATIN ASCII, which turns string to contain only the letters in the English language alphabet.

dail_33 %<>% 
  mutate(across(where(is.character), ~ stri_trans_general(., id = "Latin-ASCII")))

We can also separate the first name from the second names of all the TDs and create two variables with mutate() and separate()

dail_33 %<>% 
  mutate(name = str_replace(name, "\\s", "|")) %>% 
  separate(name, into = c("first_name", "last_name"), sep = "\\|")

With the first_name variable, we can use the new pacakge by Kalimu. This guesses the gender of the name. Later, we can track the number of women have been voted into the Dail over the years.

Of course, this will not be CLOSE to 100% correct … so later we will have to check each person manually and make sure they are accurate.

devtools::install_github("kalimu/genderizeR")

gender = findGivenNames(dail_33$name, progress = TRUE)

gender %>% 
  select(probability, gender)  -> gen_variable

gen_variable %<>% 
  select(name, gender) %>% 
  mutate(name = str_to_sentence(name))

dail_33 %<>% 
  left_join(gen_variable, by = "name")

Create date variables and decade variables that we can play around with.

dail_df$date_2 <- as.Date(dail_df$date, "%Y-%m-%d")

dail_df$year <- format(dail_df$date_2, "%Y")

dail_df$month <- format(dail_df$date_2, "%b")

dail_df %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s"))

In the next blog, we will graph out the various images to explore these data in more depth. For example, we can make a circle plot with the composition of the current Dail with the ggparliament package.

We can go into more depth with it in the next blog… Stay tuned.

Convert event-level data to panel-level data with tidyr in R

Packages we will need:

library(tidyverse)
library(magrittr)
library(lubridate)
library(tidyr)
library(rvest)
library(janitor)

In this post, we are going to scrape NATO accession data from Wikipedia and turn it into panel data. This means turning a list of every NATO country and their accession date into a time-series, cross-sectional dataset with information about whether or not a country is a member of NATO in any given year.

This is helpful for political science analysis because simply a dummy variable indicating whether or not a country is in NATO would lose information about the date they joined. The UK joined NATO in 1948 but North Macedonia only joined in 2020. A simple binary variable would not tell us this if we added it to our panel data.

Consoling 30 Rock GIF - Find & Share on GIPHY

We will first scrape a table from the Wikipedia page on NATO member states with a few functions form the rvest pacakage.

Click here to read more about the rvest package:

Scrape NATO defense expenditure data from Wikipedia with the rvest package in R

nato_members <- read_html("https://en.wikipedia.org/wiki/Member_states_of_NATO")

nato_tables <- nato_members %>% html_table(header = TRUE, fill = TRUE)

nato_member_joined <- nato_tables[[1]]

We have information about each country and the date they joined. In total there are 30 rows, one for each member of NATO.

Next we are going to clean up the data, remove the numbers in the [square brackets], and select the columns that we want.

A very handy function from the janitor package cleans the variable names. They are lower_case_with_underscores rather than how they are on Wikipedia.

Next we remove the square brackets and their contents with sub("\\[.*", "", insert_variable_name)

And the accession date variable is a bit tricky because we want to convert it to date format, extract the year and convert back to an integer.

nato_member_joined %<>% 
  clean_names() %>% 
  select(country = member_state, 
         accession = accession_3) %>% 
  mutate(member_2020 = 2020,
         country = sub("\\[.*", "", country),
         accession = sub("\\[.*", "", accession),
         accession = parse_date_time(accession, "dmy"),
         accession = format(as.Date(accession, format = "%d/%m/%Y"),"%Y"),
         accession = as.numeric(as.character(accession)))

When we have our clean data, we will pivot the data to longer form. This will create one event column that has a value of accession or member in 2020.

This gives us the start and end year of our time variable for each country.

nato_member_joined %<>% 
  pivot_longer(!country, names_to = "event", values_to = "year")

Our dataset now has 60 observations. We see Albania joined in 2009 and is still a member in 2020, for example.

Next we will use the complete() function from the tidyr package to fill all the dates in between 1948 until 2020 in the dataset. This will increase our dataset to 2,160 observations and a row for each country each year.

Nect we will group the dataset by country and fill the nato_member status variable down until the most recent year.

nato_member_joined %<>% 
  mutate(year = as.Date(as.character(year), format = "%Y")) %>% 
  mutate(year = ymd(year)) %>% 
  complete(country, year = seq.Date(min(year), max(year), by = "year")) %>% 
  mutate(nato_member = ifelse(event == "accession", 1, 
                              ifelse(event == "member_2020", 1, 0))) %>% 
  group_by(country) %>% 
  fill(nato_member, .direction = "down") %>%
  ungroup()

Last, we will use the ifelse() function to mutate the event variable into one of three categories: 'accession‘, 'member‘ or ‘not member’.

nato_member_joined %>%
  mutate(nato_member = replace_na(nato_member, 0),
         year = parse_number(as.character(year)),
         event = ifelse(nato_member == 0, "not member", event),
         event = ifelse(nato_member == 1 & is.na(event), "member", event),
         event = ifelse(event == "member_2020", "member", event))  %>% 
  distinct(country, year, .keep_all = TRUE) -> nato_panel

High Five 30 Rock GIF - Find & Share on GIPHY

Ethnicity Dataset

epr_indo <- read_csv('/mnt/data/epr_indo.csv')

# Expand the data to have a row for each year for each group
epr_indo_expanded <- epr_indo %>%
  rowwise() %>%
  mutate(year = list(seq(from, to))) %>% 
  unnest(year) %>%
  select(-from, -to)

# Pivot the data to wide format with separate ethnicity, share, and broad_cat columns
epr_indo_wide <- epr_indo_expanded %>%
  group_by(statename, year) %>%
  mutate(index = row_number()) %>%
  ungroup() %>%
  pivot_wider(
    id_cols = c(statename, year),
    names_from = index,
    values_from = c(group, size, broad_cat),
    names_sep = "_"
  ) %>%
  # Renaming the columns for ethnicity, share, and broad category
  rename_with(~ str_replace(., "group_", "ethnicity_"), starts_with("group_")) %>%
  rename_with(~ str_replace(., "size_", "share_"), starts_with("size_")) %>%
  rename_with(~ str_replace(., "broad_cat_", "broad_cat_"), starts_with("broad_cat_"))

Exploratory Data Analysis and Descriptive Statistics for Political Science Research in R

Packages we will use:

library(tidyverse)      # of course
library(ggridges)       # density plots
library(GGally)         # correlation matrics
library(stargazer)      # tables
library(knitr)          # more tables stuff
library(kableExtra)     # more and more tables
library(ggrepel)        # spread out labels
library(ggstream)       # streamplots
library(bbplot)         # pretty themes
library(ggthemes)       # more pretty themes
library(ggside)         # stack plots side by side
library(forcats)        # reorder factor levels

Before jumping into any inferentional statistical analysis, it is helpful for us to get to know our data.

That always means plotting and visualising the data and looking at the spread, the mean, distribution and outliers in the dataset.

Before we plot anything, a simple package that creates tables in the stargazer package. We can examine descriptive statistics of the variables in one table.

Click here to read this practically exhaustive cheat sheet for the stargazer package by Jake Russ. I refer to it at least once a week. Thank you, Jack.

https://www.jakeruss.com/cheatsheets/stargazer/

I want to summarise a few of the stats, so I write into the summary.stat() argument the number of observations, the mean, median and standard deviation.

The kbl() and kable_classic() will change the look of the table in R (or if you want to copy and paste the code into latex with the type = "latex" argument).

In HTML, they do not appear.

Seth Meyers Ok GIF by Late Night with Seth Meyers - Find & Share on GIPHY

To find out more about the knitr kable tables, click here to read the cheatsheet by Hao Zhu.

Choose the variables you want, put them into a data.frame and feed them into the stargazer() function

stargazer(my_df_summary, 
          covariate.labels = c("Corruption index",
                               "Civil society strength", 
                               'Rule of Law score',
                               "Physical Integerity Score",
                               "GDP growth"),
          summary.stat = c("n", "mean", "median", "sd"), 
          type = "html") %>% 
  kbl() %>% 
  kable_classic(full_width = F, html_font = "Times", font_size = 25)


Statistic	N	Mean	Median	St. Dev.

Corruption index	179	0.477	0.519	0.304
Civil society strength	179	0.670	0.805	0.287
Rule of Law score	173	7.451	7.000	4.745
Physical Integerity Score	179	0.696	0.807	0.284
GDP growth	163	0.019	0.020	0.032

Next, we can create a barchart to look at the different levels of variables across categories. We can look at the different regime types (from complete autocracy to liberal democracy) across the six geographical regions in 2018 with the geom_bar().

my_df %>% 
  filter(year == 2018) %>%
  ggplot() +
  geom_bar(aes(as.factor(region),
               fill = as.factor(regime)),
           color = "white", size = 2.5) -> my_barplot

And we can add more theme changes

my_barplot + bbplot::bbc_style() + 
  theme(legend.key.size = unit(2.5, 'cm'),
        legend.text = element_text(size = 15),
        text = element_text(size = 15)) +
  scale_fill_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c")) + 
  scale_color_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c"))

This type of graph also tells us that Sub-Saharan Africa has the highest number of countries and the Middle East and North African (MENA) has the fewest countries.

However, if we want to look at each group and their absolute percentages, we change one line: we add geom_bar(position = "fill"). For example we can see more clearly that over 50% of Post-Soviet countries are democracies ( orange = electoral and blue = liberal democracy) as of 2018.

We can also check out the density plot of democracy levels (as a numeric level) across the six regions in 2018.

With these types of graphs, we can examine characteristics of the variables, such as whether there is a large spread or normal distribution of democracy across each region.

my_df %>% 
  filter(year == 2018) %>%
  ggplot(aes(x = democracy_score, y = region, fill = regime)) +
  geom_density_ridges(color = "white", size = 2, alpha = 0.9, scale = 2) -> my_density_plot

And change the graph theme:

my_density_plot + bbplot::bbc_style() + 
  theme(legend.key.size = unit(2.5, 'cm')) +
  scale_fill_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c")) + 
  scale_color_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c"))

Click here to read more about the ggridges package and click here to read their CRAN PDF.

Next, we can also check out Pearson’s correlations of some of the variables in our dataset. We can make these plots with the GGally package.

The ggpairs() argument shows a scatterplot, a density plot and correlation matrix.

my_df %>%
  filter(year == 2018) %>%
  select(regime, 
         corruption, 
         civ_soc, 
         rule_law, 
         physical, 
         gdp_growth) %>% 
  ggpairs(columns = 2:5, 
          ggplot2::aes(colour = as.factor(regime), 
          alpha = 0.9)) + 
  bbplot::bbc_style() +
  scale_fill_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c")) + 
  scale_color_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c"))

Click here to read more about the GGally package and click here to read their CRAN PDF.

We can use the ggside package to stack graphs together into one plot.

There are a few arguments to add when we choose where we want to place each graph.

For example, geom_xsideboxplot(aes(y = freedom_house), orientation = "y") places a boxplot for the three Freedom House democracy levels on the top of the graph, running across the x axis. If we wanted the boxplot along the y axis we would write geom_ysideboxplot(). We add orientation = "y" to indicate the direction of the boxplots.

Next we indiciate how big we want each graph to be in the panel with theme(ggside.panel.scale = .5) argument. This makes the scatterplot take up half and the boxplot the other half. If we write .3, the scatterplot takes up 70% and the boxplot takes up the remainning 30%. Last we indicade scale_xsidey_discrete() so the graph doesn’t think it is a continuous variable.

We add Darjeeling Limited color palette from the Wes Anderson movie.

Click here to learn about adding Wes Anderson theme colour palettes to graphs and plots.

my_df %>%
 filter(year == 2018) %>% 
 filter(!is.na(fh_number)) %>% 
  mutate(freedom_house = ifelse(fh_number == 1, "Free", 
         ifelse(fh_number == 2, "Partly Free", "Not Free"))) %>%
  mutate(freedom_house = forcats::fct_relevel(freedom_house, "Not Free", "Partly Free", "Free")) %>% 
ggplot(aes(x = freedom_from_torture, y = corruption_level, colour = as.factor(freedom_house))) + 
  geom_point(size = 4.5, alpha = 0.9) +
  geom_smooth(method = "lm", color ="#1d3557", alpha = 0.4) +  
  geom_xsideboxplot(aes(y = freedom_house), orientation = "y", size = 2) +
  theme(ggside.panel.scale = .3) +
  scale_xsidey_discrete() +
  bbplot::bbc_style() + 
  facet_wrap(~region) + 
  scale_color_manual(values= wes_palette("Darjeeling1", n = 3))

The next plot will look how variables change over time.

We can check out if there are changes in the volume and proportion of a variable across time with the geom_stream(type = "ridge") from the ggstream package.

In this instance, we will compare urban populations across regions from 1800s to today.

my_df %>% 
  group_by(region, year) %>% 
  summarise(mean_urbanization = mean(urban_population_percentage, na.rm = TRUE)) %>% 
  ggplot(aes(x = year, y = mean_urbanization, fill = region)) +
  geom_stream(type = "ridge") -> my_streamplot

And add the theme changes

  my_streamplot + ggthemes::theme_pander() + 
  theme(
legend.title = element_blank(),
        legend.position = "bottom",
        legend.text = element_text(size = 25),
        axis.text.x = element_text(size = 25),
        axis.title.y = element_blank(),
        axis.title.x = element_blank()) +
  scale_fill_manual(values = c("#001219",
                               "#0a9396",
                               "#e9d8a6",
                               "#ee9b00", 
                               "#ca6702",
                               "#ae2012"))

Click here to read more about the ggstream package and click here to read their CRAN PDF.

We can also look at interquartile ranges and spread across variables.

We will look at the urbanization rate across the different regions. The variable is calculated as the ratio of urban population to total country population.

Before, we will create a hex color vector so we are not copying and pasting the colours too many times.

my_palette <- c("#1d3557",
                "#0a9396",
                "#e9d8a6",
                "#ee9b00", 
                "#ca6702",
                "#ae2012")

We use the facet_wrap(~year) so we can separate the three years and compare them.

my_df %>% 
  filter(year == 1980 | year == 1990 | year == 2000)  %>% 
  ggplot(mapping = aes(x = region, 
                       y = urban_population_percentage, 
                       fill = region)) +
  geom_jitter(aes(color = region),
              size = 3, alpha = 0.5, width = 0.15) +
  geom_boxplot(alpha = 0.5) + facet_wrap(~year) + 
  scale_fill_manual(values = my_palette) +
  scale_color_manual(values = my_palette) + 
  coord_flip() + 
  bbplot::bbc_style()

If we want to look more closely at one year and print out the country names for the countries that are outliers in the graph, we can run the following function and find the outliers int he dataset for the year 1990:

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

We can then choose one year and create a binary variable with the function

my_df_90 <- my_df %>% 
  filter(year == 1990) %>% 
  filter(!is.na(urban_population_percentage))

my_df_90$my_outliers <- is_outlier(my_df_90$urban_population_percentage)

And we plot the graph:

my_df_90 %>% 
  ggplot(mapping = aes(x = region, y = urban_population_percentage, fill = region)) +
  geom_jitter(aes(color = region), size = 3, alpha = 0.5, width = 0.15) +
  geom_boxplot(alpha = 0.5) +
  geom_text_repel(data = my_df_90[which(my_df_90$my_outliers == TRUE),],
            aes(label = country_name), size = 5) + 
  scale_fill_manual(values = my_palette) +
  scale_color_manual(values = my_palette) + 
  coord_flip() + 
  bbplot::bbc_style()

In the next blog post, we will look at t-tests, ANOVAs (and their non-parametric alternatives) to see if the difference in means / medians is statistically significant and meaningful for the underlying population.

Bo Burnham What GIF - Find & Share on GIPHY

Comparing mean values across OECD countries with ggplot

Packages we will need:

library(tidyverse)
library(magrittr) # for pipes
library(ggrepel) # to stop overlapping labels
library(ggflags)
library(countrycode) # if you want create the ISO2C variable

I came across code for this graph by Tanya Shapiro on her github for #TidyTuesday.

Her graph compares Dr. Who actors and their average audience rating across their run as the Doctor on the show. So I have very liberally copied her code for my plot on OECD countries.

That is the beauty of TidyTuesday and the ability to be inspired and taught by other people’s code.

I originally was going to write a blog about how to download data from the OECD R package. However, my attempts to download the data leads to an unpleasant looking error and ends the donwload request.

I will try to work again on that blog in the future when the package is more established.

So, instead, I went to the OECD data website and just directly downloaded data on level of trust that citizens in each of the OECD countries feel about their governments.

Then I cleaned up the data in excel and used countrycode() to add ISO2 and country name data.

Click here to read more about the countrycode() package.

First I will only look at EU countries. I tried with all the countries from the OECD but it was quite crowded and hard to read.

I add region data from another dataset I have. This step is not necessary but I like to colour my graphs according to categories. This time I am choosing geographic regions.

my_df %<>%
  mutate(region = case_when(
    e_regiongeo == 1 ~ "Western",
    e_regiongeo == 2  ~ "Northern",
    e_regiongeo == 3  ~ "Southern", 
    e_regiongeo == 4  ~ "Eastern"))

To make the graph, we need two averages:

The overall average trust level for all countries (avg_trust) and
The average for each country across the years (country_avg_trust),

my_df %<>% 
  mutate(avg_trust = mean(trust, na.rm = TRUE)) %>% 
  group_by(country_name) %>% 
  mutate(country_avg_trust = mean(trust, na.rm = TRUE)) %>%
  ungroup()

When we plot the graph, we need a few geom arguments.

Along the x axis we have all the countries, and reorder them from most trusting of their goverments to least trusting.

We will color the points with one of the four geographic regions.

We use geom_jitter() rather than geom_point() for the different yearly trust values to make the graph a little more interesting.

I also make the sizes scaled to the year in the aes() argument. Again, I did this more to look interesting, rather than to convey too much information about the different values for trust across each country. But smaller circles are earlier years and grow larger for each susequent year.

The geom_hline() plots a vertical line to indicate the average trust level for all countries.

We then use the geom_segment() to horizontally connect the country’s individual average (the yend argument) to the total average (the y arguement). We can then easily see which countries are above or below the total average. The x and xend argument, we supply the country_name variable twice.

Next we use the geom_flag(), which comes from the ggflags package. In order to use this package, we need the ISO 2 character code for each country in lower case!

Click here to read more about the ggflags package.

my_df %>%
ggplot(aes(x = reorder(country_name, trust_score), y = trust_score, color = as.factor(region))) +
geom_jitter(aes(color = as.factor(region), size = year), alpha = 0.7, width = 0.15) +
geom_hline(aes(yintercept = avg_trust), color = "white", size = 2)+
geom_segment(aes(x = country_name, xend = country_name, y = country_avg_trust, yend = avg_trust), size = 2, color = "white") +
ggflags::geom_flag(aes(x = country_name, y = country_avg_trust, country = iso2), size = 10) + 
  coord_flip() + 
  scale_color_manual(values = c("#9a031e","#fb8b24","#5f0f40","#0f4c5c")) -> my_plot

Last we change the aesthetics of the graph with all the theme arguments!

my_plot +
 theme(panel.border = element_blank(),
        legend.position = "right",
        legend.title = element_blank(),
        legend.text = element_text(size = 20),
        legend.background = element_rect(fill = "#5e6472"),
        axis.title = element_blank(),
        axis.text = element_text(color = "white", size = 20),
        text= element_text(size = 15, color = "white"),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        legend.key = element_rect(fill = "#5e6472"),
        plot.background = element_rect(fill = "#5e6472"),
        panel.background = element_rect(fill = "#5e6472")) +
  guides(colour = guide_legend(override.aes = list(size=10)))

And that is the graph.

We can see that countries in southern Europe are less trusting of their governments than in other regions. Western countries seem to occupy the higher parts of the graph, with France being the least trusting of their government in the West.

There is a large variation in Northern countries. However, if we look at the countries, we can see that the Scandinavian countries are more trusting and the Baltic countries are among the least trusting. This shows they are more similar in their trust levels to other Post-Soviet countries.

Next we can look into see if there is a relationship between democracy scores and level of trust in the goverment with a geom_point() scatterplot

The geom_smooth() argument plots a linear regression OLS line, with a standard error bar around.

We want the labels for the country to not overlap so we use the geom_label_repel() from the ggrepel package. We don’t want an a in the legend, so we add show.legend = FALSE to the arguments


my_df %>% 
  filter(!is.na(trust_score)) %>% 
  ggplot(aes(x = democracy_score, y = trust_score)) +
  geom_smooth(method = "lm", color = "#a0001c", size = 3) +
  geom_point(aes(color = as.factor(region)), size = 20, alpha = 0.6) +
 geom_label_repel(aes(label = country_name, color = as.factor(region)), show.legend = FALSE, size = 5) + 
scale_color_manual(values = c("#9a031e","#fb8b24","#5f0f40","#0f4c5c")) -> scatter_plot

And we change the theme and add labels to the plot.

scatter_plot + theme(panel.border = element_blank(),
        legend.position = "bottom",
        legend.title = element_blank(),
        legend.text = element_text(size = 20),
        legend.background = element_rect(fill = "#5e6472"),
        text= element_text(size = 15, color = "white"),

        legend.key = element_rect(fill = "#5e6472"),
        plot.background = element_rect(fill = "#5e6472"),
        panel.background = element_rect(fill = "#5e6472")) +
  guides(colour = guide_legend(override.aes = list(size=10)))  +
  labs(title = "Democracy and trust levels", 
       y = "Democracy score",
       x = "Trust level of respondents",
       caption="Data from OECD")

We can filter out the two countries with low democracy and examining the consolidated democracies.

Thank you for reading!

Pop Tv Comedy GIF by Schitt's Creek - Find & Share on GIPHY

Building a dataset for political science analysis in R, PART 2

Packages we will need

library(tidyverse)
library(peacesciencer)
library(countrycode)
library(bbplot)

The main workhorse of this blog is the peacesciencer package by Stephen Miller!

The package will create both dyad datasets and state datasets with all sovereign countries.

Thank you Mr Miller!

There are heaps of options and variables to add.

Go to the page to read about them all in detail.

Here is a short list from the package description of all the key variables that can be quickly added:

We create the dyad dataset with the create_dyadyears() function. A dyad-year dataset focuses on information about the relationship between two countries (such as whether the two countries are at war, how much they trade together, whether they are geographically contiguous et cetera).

In the literature, the study of interstate conflict has adopted a heavy focus on dyads as a unit of analysis.

Alternatively, if we want just state-year data like in the previous blog post, we use the function create_stateyears()

We can add the variables with type D to the create_dyadyears() function and we can add the variables with type S to the create_stateyears() !

Focusing on the create_dyadyears() function, the arguments we can include are directed and mry.

The directed argument indicates whether we want directed or non-directed dyad relationship.

In a directed analysis, data include two observations (i.e. two rows) per dyad per year (such as one for USA – Russia and another row for Russia – USA), but in a nondirected analysis, we include only one observation (one row) per dyad per year.

The mry argument indicates whether they want to extend the data to the most recently concluded calendar year – i.e. 2020 – or not (i.e. until the data was last available).

dyad_df <- create_dyadyears(directed = FALSE, mry = TRUE) %>%
  add_atop_alliance() %>%  
  add_nmc() %>%
  add_cow_trade() %>% 
  add_creg_fractionalization()

I added dyadic variables for the

ATOP alliances,
national military capabilities (NMC) ,
Correlates of War trade data and
Composition of Religious and Ethnic Groups (CREG) ethnicity fractionalization data.

You can follow these links to check out the codebooks if you want more information about descriptions about each variable and how the data were collected!

The code comes with the COW code but I like adding the actual names also!

dyad_df$country_1 <- countrycode(dyad_df$ccode1, "cown", "country.name")

With this dataframe, we can plot the CINC data of the top three superpowers, just looking at any variable that has a 1 at the end and only looking at the corresponding country_1!

According to our pals over at le Wikipedia, the Composite Index of National Capability (CINC) is a statistical measure of national power created by J. David Singer for the Correlates of War project in 1963. It uses an average of percentages of world totals in six different components (such as coal consumption, military expenditure and population). The components represent demographic, economic, and military strength

First, let’s choose some nice hex colors

pal <- c("China" = "#DE2910",
         "United States" = "#3C3B6E", 
         "Russia" = "#FFD900")

And then create the plot

dyad_df %>% 
 filter(country_1 == "Russia" | 
          country_1 == "United States" | 
          country_1 == "China") %>% 
  ggplot(aes(x = year, y = cinc1, group = as.factor(country_1))) +
  geom_line(aes(color = country_1)) +
  geom_line(aes(color = country_1), size = 2, alpha = 0.8) + 
  scale_color_manual(values =  pal) +
  bbplot::bbc_style()

In PART 3, we will merge together our data with our variables from PART 1, look at some descriptive statistics and run some panel data regression analysis with our different variables!

Create a correlation matrix with GGally package in R

We can create very informative correlation matrix graphs with one function.

Packages we will need:

library(GGally)
library(bbplot) #for pretty themes

First, choose some nice hex colors.

my_palette <- c("#005D8F", "#F2A202")

Happy Friends GIF by netflixlat - Find & Share on GIPHY

Next, we can go create a dichotomous factor variable and divide the continuous “freedom from torture scale” variable into either above the median or below the median score. It’s a crude measurement but it serves to highlight trends.

Blue means the country enjoys high freedom from torture. Yellow means the county suffers from low freedom from torture and people are more likely to be tortured by their government.

Then we feed our variables into the ggpairs() function from the GGally package.

I use the columnLabels to label the graphs with their full names and the mapping argument to choose my own color palette.

I add the bbc_style() format to the corr_matrix object because I like the font and size of this theme. And voila, we have our basic correlation matrix (Figure 1).

corr_matrix <- vdem90 %>% 
  dplyr::mutate(
    freedom_torture = ifelse(torture >= 0.65, "High", "Low"),
    freedom_torture = as.factor(freedom_t))
  dplyr::select(freedom_torture, civil_lib, class_eq) %>% 
  ggpairs(columnLabels = c('Freedom from Torture', 'Civil Liberties', 'Class Equality'), 
    mapping = ggplot2::aes(colour = freedom_torture)) +
  scale_fill_manual(values = my_palette) +
  scale_color_manual(values = my_palette)

corr_matrix + bbplot::bbc_style()

Excited Season 4 GIF by Friends - Find & Share on GIPHY

First off, in Figure 2 we can see the centre plots in the diagonal are the distribution plots of each variable in the matrix

In Figure 3, we can look at the box plot for the ‘civil liberties index’ score for both high (blue) and low (yellow) ‘freedom from torture’ categories.

The median civil liberties score for countries in the high ‘freedom from torture’ countries is far higher than in countries with low ‘freedom from torture’ (i.e. citizens in these countries are more likely to suffer from state torture). The spread / variance is also far great in states with more torture.

In Figur 4, we can focus below the diagonal and see the scatterplot between the two continuous variables – civil liberties index score and class equality index scores.

We see that there is a positive relationship between civil liberties and class equality. It looks like a slightly U shaped, quadratic relationship but a clear relationship trend is not very clear with the countries with higher torture prevalence (yellow) showing more randomness than the countries with high freedom from torture scores (blue).

Saying that, however, there are a few errant blue points as outliers to the trend in the plot.

The correlation score is also provided between the two categorical variables and the correlation score between civil liberties and class equality scores is 0.52.

Examining at the scatterplot, if we looked only at countries with high freedom from torture, this correlation score could be higher!

Add weights to survey data with survey package in R: Part 2

Click here to read why need to add pspwght and pweight to the ESS data in Part 1.

Packages we will need:

library(survey)
library(srvy)
library(stargazer)
library(gtsummary)
library(tidyverse)

Click here to learn how to access and download ESS round data for the thirty-ish European countries (depending on the year).

So with the essurvey package, I have downloaded and cleaned up the most recent round of the ESS survey, conducted in 2018.

We will examine the different demographic variables that relate to levels of trust in politicians across 29 European countries (education level, gender, age et cetera).

Before we create the survey weight objects, we can first make a bar chart to look at the different levels of trust in the different countries.

We can use the cut() function to divide the 10-point scale into three groups of “low”, “mid” and “high” levels of trust in politicians.

I also choose traffic light hex colors in color_palette vector and add full country names with countrycode() so it’s easier to read the graph

color_palette <- c("1" = "#f94144", "2" = "#f8961e", "3" = "#43aa8b")

round9$country_name <- countrycode(round9$country, "iso2c", "country.name")

trust_graph <- round9 %>% 
  dplyr::filter(!is.na(trust_pol)) %>% 
  dplyr::mutate(trust_category = cut(trust_pol, 
                                     breaks=c(-Inf, 3, 7, Inf), 
                                     labels=c(1,2,3))) %>% 
  mutate(trust_category = as.numeric(trust_category)) %>% 
  mutate(trust_pol_fac = as.factor(trust_category)) %>%
  ggplot(aes(x = reorder(country_name, trust_category))) +
  geom_bar(aes(fill = trust_pol_fac), 
               position = "fill") +
  bbplot::bbc_style() +
  coord_flip() 

trust_graph <- trust_graph + scale_fill_manual(values= color_palette, 
                                      name="Trust level",
                                      breaks=c(1,2,3),
                                      labels=c("Low", "Mid", "High"))

The graph lists countries in descending order according to the percentage of sampled participants that indicated they had low trust levels in politicians.

The respondents in Croatia, Bulgaria and Spain have the most distrust towards politicians.

Croatians when they see politicians

For this example, I want to compare different analyses to see what impact different weights have on the coefficient estimates and standard errors in the regression analyses:

with no weights (dEfIniTelYy not recommended by ESS)
with post-stratification weights only (not recommended by ESS) and
with the combined post-strat AND population weight (the recommended weighting strategy according to ESS)

First we create two special svydesign objects, with the survey package. To create this, we need to add a squiggly ~ symbol in front of the variables (Google tells me it is called a tilde).

The ids argument takes the cluster ID for each participant.

psu is a numeric variable that indicates the primary sampling unit within which the respondent was selected to take part in the survey. For example in Ireland, this refers to the particular electoral division of each participant.

The strata argument takes the numeric variable that codes which stratum each individual is in, according to the type of sample design each country used.

The first svydesign object uses only post-stratification weights: pspwght

Finally we need to specify the nest argument as TRUE. I don’t know why but it throws an error message if we don’t …

post_design <- svydesign(ids = ~psu, 
                         strata = ~stratum, 
                         weights = ~pspwght
                         data = round9, 
                         nest = TRUE)

To combine the two weights, we can multiply them together and store them as full_weight. We can then use that in the svydesign function

r2$full_weight <- r2$pweight*r2$pspwght
 

full_design <- svydesign(ids = ~psu, 
                         strata = ~stratum, 
                         weights = ~full_weight,
                         data = round9, 
                         nest = TRUE)
class(full_design)

With the srvyr package, we can convert a “survey.design” class object into a “tbl_svy” class object, which we can then use with tidyverse functions.

full_tidy_design <- as_survey(full_design)
class(full_tidy_design)

Click here to read the CRAN PDF for the srvyr package.

We can first look at descriptive statistics and see if the values change because of the inclusion of the weighted survey data.

First, we can compare the means of the survey data with and without the weights.

We can use the gtsummary package, which creates tables with tidyverse commands. It also can take a survey object

library(gtsummary)
round9 %>% select(trust_pol, trust_pol, age, edu_years, gender, religious, left_right, rural_urban) %>% 
  tbl_summary(include = c(trust_pol, age, edu_years, gender, religious, left_right, rural_urban),
                 statistic = list(all_continuous() ~"{mean} ({sd})"))

And we look at the descriptive statistics with the full_design weights:

full_design %>% 
  tbl_svysummary(include = c(trust_pol, age, edu_years, gender, religious, left_right),
                 statistic = list(all_continuous() ~"{mean} ({sd})"))

**WITHOUT** weights AND **WITH** weights (post-stratification and population weights)

We can see that gender variable is more equally balanced between males (1) and females (2) in the data with weights

Additionally, average trust in politicians is lower in the sample with full weights.

Participants are more left-leaning on average in the sample with full weights than in the sample with no weights.

Next, we can look at a general linear model without survey weights and then with the two survey weights we just created.

Do we see any effect of the weighting design on the standard errors and significance values?

So, we first run a simple general linear model. In this model, R assumes that the data are independent of each other and based on that assumption, calculates coefficients and standard errors.

simple_glm <- glm(trust_pol ~ left_right + edu_years + rural_urban + age, data = round9)

Next, we will look at only post-stratification weights. We use the svyglm function and instead of using the data = r2, we use design = post_design .

post_strat_glm <- svyglm(trust_pol ~ left_right + edu_years + rural_urban  + age, design = post_design)

And finally, we will run the regression with the combined post-stratification AND population weight with the design = full_design argument.

full_weight_glm <- svyglm(trust_pol ~ left_right + edu_years + rural_urban + age, design = full_design))

With the stargazer package, we can compare the models side-by-side:

library(stargazer)
stargazer(simple_glm, post_strat_glm, full_weight_glm, type = "text")

We can see that the standard errors in brackets were increased for most of the variables in model (3) with both weights when compared to the first model with no weights.

The biggest change is the rural-urban scale variable. With no weights, it is positive correlated with trust in politicians. That is to say, the more urban a location the respondent lives, the more likely the are to trust politicians. However, after we apply both weights, it becomes negative correlated with trust. It is in fact the more rural the location in which the respondent lives, the more trusting they are of politicians.

Additionally, age becomes statistically significant, after we apply weights.

Of course, this model is probably incorrect as I have assumed that all these variables have a simple linear relationship with trust levels. If I really wanted to build a robust demographic model, I would have to consult the existing academic literature and test to see if any of these variables are related to trust levels in a non-linear way. For example, it could be that there is a polynomial relationship between age and trust levels, for example. This model is purely for illustrative purposes only!

Plus, when I examine the R2 score for my models, it is very low; this model of demographic variables accounts for around 6% of variance in level of trust in politicians. Again, I would have to consult the body of research to find other explanatory variables that can account for more variance in my dependent variable of interest!

We can look at the R2 and VIF score of GLM with the summ() function from the jtools package. The summ() function can take a svyglm object. Click here to read more about various functions in the jtools package.

Sarcastic Nancy Pelosi GIF by MOODMAN - Find & Share on GIPHY

BBC style graphs with bbplot package in R

Packages we will need:

devtools::install_github('bbc/bbplot')
library(bbplot)

Click here to check out the vignette to read about all the different graphs with which you can use bbplot !

Monty Python Hello GIF - Find & Share on GIPHY

We will look at the Soft Power rankings from Portland Communications. According to Wikipedia, In politics (and particularly in international politics), soft power is the ability to attract and co-opt, rather than coerce or bribe other countries to view your country’s policies and actions favourably. In other words, soft power involves shaping the preferences of others through appeal and attraction.

A defining feature of soft power is that it is non-coercive; the currency of soft power includes culture, political values, and foreign policies.

Joseph Nye’s primary definition, soft power is in fact:

French Baguette GIF - Find & Share on GIPHY

“the ability to get what you want through attraction rather than coercion or payments. When you can get others to want what you want, you do not have to spend as much on sticks and carrots to move them in your direction. Hard power, the ability to coerce, grows out of a country’s military and economic might. Soft power arises from the attractiveness of a country’s culture, political ideals and policies. When our policies are seen as legitimate in the eyes of others, our soft power is enhanced”

(Nye, 2004: 256).

Every year, Portland Communication ranks the top countries in the world regarding their soft power. In 2019, the winner was la France!

Click here to read the most recent report by Portland on the soft power rankings.

We will also add circular flags to the graphs with the ggflags package. The geom_flag() requires the ISO two letter code as input to the argument … but it will only accept them in lower case. So first we need to make the country code variable suitable:

library(ggflags)
sp$iso2_lower <- tolower(sp$iso2)

Click here to read more about ggflags()

And we create a ggplot line graph with geom_flag() as a replacement to the geom_point() function

sp_graph <- sp %>% 
  ggplot(aes(x = year, y = value, group = country)) +
  geom_line(aes(color = country, alpha = 1.8), size = 1.8) +
  ggflags::geom_flag(aes(country = iso2_lower), size = 8) + 
  scale_color_manual(values = my_pal) +
  labs(title = "Soft Power Ranking ",
       subtitle = "Portland Communications, 2015 - 2019")

And finally call our sp_graph object with the bbc_style() function

sp_graph + bbc_style() + theme(legend.position = "none")

Here I run a simple scatterplot and compare Post-Soviet states and see whether there has been a major change in class equality between 1991 after the fall of the Soviet Empire and today. Is there a relationship between class equality and demolcratisation? Is there a difference in the countries that are now in EU compared to the Post-Soviet states that are not?

library(ggrepel)  # to stop text labels overlapping
library(gridExtra)  # to place two plots side-by-side
library(ggbubr)  # to modify the gridExtra titles

region_liberties_91 <- vdem %>%
  dplyr::filter(year == 1991) %>% 
  dplyr::filter(regions == 'Post-Soviet') %>% 
  dplyr::filter(!is.na(EU_member)) %>% 
  ggplot(aes(x = democracy, y = class_equality, color = EU_member)) +
  geom_point(aes(size = population)) + 
  scale_alpha_continuous(range = c(0.1, 1)) 

plot_91 <- region_liberties_91 + 
  bbplot::bbc_style() + 
  labs(subtitle = "1991") +
  ylim(-2.5, 3.5) +
  xlim(0, 1) +
  geom_text_repel(aes(label = country_name), show.legend = FALSE, size = 7) +
  scale_size(guide="none") 

region_liberties_18 <- vdem %>%
  dplyr::filter(year == 2018) %>% 
  dplyr::filter(regions == 'Post-Soviet') %>% 
  dplyr::filter(!is.na(EU_member)) %>% 
  ggplot(aes(x = democracy_score, y = class_equality, color = EU_member)) +
  geom_point(aes(size = population)) + 
  scale_alpha_continuous(range = c(0.1, 1)) 

plot_18 <- region_liberties_15 + 
  bbplot::bbc_style() + 
  labs(subtitle = "2015") +
  ylim(-2.5, 3.5) +
  xlim(0, 1) +
  geom_text_repel(aes(label = country_name), show.legend = FALSE, size = 7) +
  scale_size(guide = "none") 

my_title = text_grob("Relationship between democracy and class equality in Post-Soviet states", size = 22, face = "bold") 
my_y = text_grob("Democracy Score", size = 20, face = "bold")
my_x = text_grob("Class Equality Score", size = 20, face = "bold", rot = 90)

grid.arrange(plot_1, plot_2, ncol=2,  top = my_title, bottom = my_y, left = my_x)

The BBC cookbook vignette offers the full function. So we can tweak it any way we want.

For example, if I want to change the default axis labels, I can make my own slightly adapted my_bbplot() function

my_bbplot <- function ()
  function ()
  {
    font <- "Helvetica"
    ggplot2::theme(plot.title = ggplot2::element_text(family = font, size = 28, face = "bold", color = "#222222"), 
    plot.subtitle = ggplot2::element_text(family = font,size = 22, margin = ggplot2::margin(9, 0, 9, 0)), 
    plot.caption = ggplot2::element_blank(),
    legend.position = "top", 
    legend.text.align = 0, 
    legend.background = ggplot2::element_blank(),
    legend.title = ggplot2::element_blank(), 
    legend.key = ggplot2::element_blank(),
    legend.text = ggplot2::element_text(family = font, size = 18, color = "#222222"), 
    axis.title = ggplot2::element_blank(),
    axis.text = ggplot2::element_text(family = font, size = 18, color = "#222222"), 
    axis.text.x = ggplot2::element_text(margin = ggplot2::margin(5, b = 10)),
    axis.line = ggplot2::element_blank(), 
    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major.y = ggplot2::element_line(color = "#cbcbcb"),
    panel.grid.major.x = ggplot2::element_line(color = "#cbcbcb"), 
    panel.background = ggplot2::element_blank(),
    strip.background = ggplot2::element_rect(fill = "white"),
    strip.text = ggplot2::element_text(size = 22, hjust = 0))
  }

The British Broadcasting Corporation, the home of upstanding journalism and subtle weathermen:

Middle Finger GIF - Find & Share on GIPHY

Add weights to survey data with survey package in R: Part 1

With the European Social Survey (ESS), we will examine the different variables that are related to levels of trust in politicians across Europe in the latest round 9 (conducted in 2018).

Click here for Part 2.

Click here to learn about downloading ESS data into R with the essurvey package.

Packages we will need:

library(survey)
library(srvyr)

The survey package was created by Thomas Lumley, a professor from Auckland. The srvyr package is a wrapper packages that allows us to use survey functions with tidyverse.

Why do we need to add weights to the data when we analyse surveys?

When we import our survey data file, R will assume the data are independent of each other and will analyse this survey data as if it were collected using simple random sampling.

However, the reality is that almost no surveys use a simple random sample to collect data (the one exception being Iceland in ESS!)

Excited Rachel Mcadams GIF by NETFLIX - Find & Share on GIPHY

Rather, survey institutions choose complex sampling designs to reduce the time and costs of ultimately getting responses from the public.

Their choice of sampling design can lead to different estimates and the standard errors of the sample they collect.

For example, the sampling weight may affect the sample estimate, and choice of stratification and/or clustering may mean (most likely underestimated) standard errors.

As a result, our analysis of the survey responses will be wrong and not representative to the population we want to understand. The most problematic result is that we would arrive at statistical significance, when in reality there is no significant relationship between our variables of interest.

Therefore it is essential we don’t skip this step of correcting to account for weighting / stratification / clustering and we can make our sample estimates and confidence intervals more reliable.

This table comes from round 8 of the ESS, carried out in 2016. Each of the 23 countries has an institution in charge of carrying out their own survey, but they must do so in a way that meets the ESS standard for scientifically sound survey design (See Table 1).

Sampling weights aim to capture and correct for the differing probabilities that a given individual will be selected and complete the ESS interview.

For example, the population of Lithuania is far smaller than the UK. So the probability of being selected to participate is higher for a random Lithuanian person than it is for a random British person.

Additionally, within each country, if the survey institution chooses households as a sampling element, rather than persons, this will mean that individuals living alone will have a higher probability of being chosen than people in households with many people.

Click here to read in detail the sampling process in each country from round 1 in 2002. For example, if we take my country – Ireland – we can see the many steps involved in the country’s three-stage probability sampling design.

St Patricks Day Snl GIF by Saturday Night Live - Find & Share on GIPHY

The Primary Sampling Unit (PSU) is electoral districts. The institute then takes addresses from the Irish Electoral Register. From each electoral district, around 20 addresses are chosen (based on how spread out they are from each other). This is the second stage of clustering. Finally, one person is randomly chosen in each house to answer the survey, chosen as the person who will have the next birthday (third cluster stage).

Click here for more information about Design Effects (DEFF) and click here to read how ESS calculates design effects.

DEFF p refers to the design effect due to unequal selection probabilities (e.g. a person is more likely to be chosen to participate if they live alone)

DEFF c refers to the design effect due to clustering

According to Gabler et al. (1999), if we multiply these together, we get the overall design effect. The Irish design that was chosen means that the data’s variance is 1.6 times as large as you would expect with simple random sampling design. This 1.6 design effects figure can then help to decide the optimal sample size for the number of survey participants needed to ensure more accurate standard errors.

So, we can use the functions from the survey package to account for these different probabilities of selection and correct for the biases they can cause to our analysis.

In this example, we will look at demographic variables that are related to levels of trust in politicians. But there are hundreds of variables to choose from in the ESS data.

Click here for a list of all the variables in the European Social Survey and in which rounds they were asked. Not all questions are asked every year and there are a bunch of country-specific questions.

We can look at the last few columns in the data.frame for some of Ireland respondents (since we’ve already looked at the sampling design method above).

The dweight is the design weight and it is essentially the inverse of the probability that person would be included in the survey.

The pspwght is the post-stratification weight and it takes into account the probability of an individual being sampled to answer the survey AND ALSO other factors such as non-response error and sampling error. This post-stratificiation weight can be considered a more sophisticated weight as it contains more additional information about the realities survey design.

The pweight is the population size weight and it is the same for everyone in the Irish population.

When we are considering the appropriate weights, we must know the type of analysis we are carrying out. Different types of analyses require different combinations of weights. According to the ESS weighting documentation:

when analysing data for one country alone – we only need the design weight or the poststratification weight.
when comparing data from two or more countries but without reference to statistics that combine data from more than one country – we only need the design weight or the poststratification weight
when comparing data of two or more countries and with reference to the average (or combined total) of those countries – we need BOTH design or post-stratification weight AND population size weights together.
when combining different countries to describe a group of countries or a region, such as “EU accession countries” or “EU member states” = we need BOTH design or post-stratification weights AND population size weights.

ESS warn that their survey design was not created to make statistically accurate region-level analysis, so they say to carry out this type of analysis with an abundance of caution about the results.

ESS has a table in their documentation that summarises the types of weights that are suitable for different types of analysis:

Since we are comparing the countries, the optimal weight is a combination of post-stratification weights AND population weights together.

Click here to read Part 2 and run the regression on the ESS data with the survey package weighting design

Below is the code I use to graph the differences in mean level of trust in politicians across the different countries.

library(ggimage) # to add flags
library(countrycode) # to add ISO country codes

# r_agg is the aggregated mean of political trust for each countries' respondents.

r_agg %>% 
  dplyr::mutate(country, EU_member = ifelse(country == "BE" | country == "BG" | country == "CZ" | country == "DK" | country == "DE" | country == "EE" | country == "IE" | country == "EL" | country == "ES" | country == "FR" | country == "HR" | country == "IT" | country == "CY" | country == "LV" | country == "LT" | country == "LU" | country == "HU" | country == "MT" | country == "NL" | country == "AT" | country == "AT" | country == "PL" | country == "PT" | country == "RO" | country == "SI" | country == "SK" | country == "FI" | country == "SE","EU member", "Non EU member")) -> r_agg


r_agg %>% 
  filter(EU_member == "EU member") %>% 
  dplyr::summarize(eu_average = mean(mean_trust_pol)) 


r_agg$country_name <- countrycode(r_agg$country, "iso2c", "country.name")


#eu_average <- r_agg %>%
 # summarise_if(is.numeric, mean, na.rm = TRUE)


eu_avg <- data.frame(country = "EU average",
                     mean_trust_pol = 3.55,
                     EU_member =  "EU average",
                     country_name = "EU average")

r_agg <- rbind(r_agg, eu_avg)

 
my_palette <- c("EU average" = "#ef476f", 
                "Non EU member" = "#06d6a0", 
                "EU member" = "#118ab2")

r_agg <- r_agg %>%          
  dplyr::mutate(ordered_country = fct_reorder(country, mean_trust_pol))


r_graph <- r_agg %>% 
  ggplot(aes(x = ordered_country, y = mean_trust_pol, group = country, fill = EU_member)) +
  geom_col() +
  ggimage::geom_flag(aes(y = -0.4, image = country), size = 0.04) +
  geom_text(aes(y = -0.15 , label = mean_trust_pol)) +
  scale_fill_manual(values = my_palette) + coord_flip()

r_graph

Add rectangular flags to maps in R

We will make a graph to map the different colonial histories of countries in South-East Asia!

Click here to add circular flags.

Packages we will need:

library(ggimage)
library(rnaturalearth)
library(countrycode)
library(ggthemes)
library(reshape2)

I use the COLDAT Colonial Dates Dataset by Bastien Becker (2020). We will only need the first nine columns in the dataset:

col_df <- data.frame(col_df[1:9])

Next we will need to turn the dataset from wide to long with the reshape2 package:

long_col <- melt(col_df, id.vars=c("country"), 
                 measure.vars = c("col.belgium","col.britain", "col.france", "col.germany", 
"col.italy", "col.netherlands",  "col.portugal", "col.spain"),
                 variable.name = "colony", 
                 value.name = "value")

We drop all the 0 values from the dataset:

long_col <- long_col[which(long_col$value == 1),]

Next we use ne_countries() function from the rnaturalearth package to create the map!

map <- ne_countries(scale = "medium", returnclass = "sf")

Click here to read more about the rnaturalearth package.

Next we merge the two datasets together:

col_map <- merge(map, long_col, by.x = "iso_a3", by.y = "iso3", all.x = TRUE)

We can change the class and factors of the colony variable:

library(plyr)
col_map$colony_factor <- as.factor(col_map$colony)
col_map$colony_factor <- revalue(col_map$colony_factor, c("col.belgium"="Belgium", "col.britain" = "Britain",
 "col.france" = "France",
"col.germany" = "Germany",
 "col.italy" = "Italy",
 "col.netherlands" = "Netherlands", "col.portugal" = "Portugal",
 "col.spain" = "Spain",
 "No colony" = "No colony"))

Nearly there.

Next we will need to add the longitude and latitude of the countries. The data comes from the web and I can scrape the table with the rvest package

library(rvest)

coord <- read_html("https://developers.google.com/public-data/docs/canonical/countries_csv")

coord_tables <- coord %>% html_table(header = TRUE, fill = TRUE)

coord <- coord_tables[[1]]

col_map <- merge(col_map, coord, by.x= "iso_a2", by.y = "country", all.y = TRUE)

Click here to read more about the rvest package.

And we can make a vector with some hex colors for each of the European colonial countries.

my_palette <- c("#0d3b66","#e75a7c","#f4d35e","#ee964b","#f95738","#1b998b","#5d22aa","#85f5ff", "#19381F")

Next, to graph a map to look at colonialism in Asia, we can extract countries according to the subregion variable from the rnaturalearth package and graph.

asia_map <- col_map[which(col_map$subregion == "South-Eastern Asia" | col_map$subregion == "Southern Asia"),]

Click here to read more about the geom_flag function.

colony_asia_graph <- asia_map %>%
  ggplot() + geom_sf(aes(fill = colony_factor), 
                     position = "identity") +
  ggimage::geom_flag(aes(longitude-2, latitude-1, image = col_iso), size = 0.04) +
  geom_label(aes(longitude+1, latitude+1, label = factor(sovereignt))) +
  scale_fill_manual(values = my_palette)

And finally call the graph with the theme_map() from ggthemes package

colony_asia_graph + theme_map()

References

Becker, B. (2020). Introducing COLDAT: The Colonial Dates Dataset.

Graph countries on the political left right spectrum

In this post, we can compare countries on the left – right political spectrum and graph the trends.

In the European Social Survey, they ask respondents to indicate where they place themselves on the political spectrum with this question: “In politics people sometimes talk of ‘left’ and ‘right’. Where would you place yourself on this scale, where 0 means the left and 10 means the right?”

Click here to read how to download data from the European Social survey.

round <- import_all_rounds()

Extract all the lists. I just want three of the variables for my graph.

r1 <- round[[1]]

r1 <- data.frame(country = r1$cntry, round= r1$essround, lrscale = r1$lrscale)

Do this for all the data.frames and rbind() them all together.

round_df <- rbind(r1, r2, r3, r4, r5, r6, r7, r8, r9)

Convert all the variables to suitable types:

round_df$country <- as.factor(round_df$country)
round_df$round <- as.numeric(round_df$round)
round_df$lrscale <- as.numeric(round_df$lrscale)

Next we find the mean score for all respondents in each of the countries for each year.

round_df %>% 
  dplyr::filter(!is.na(lrscale)) %>% 
  dplyr::group_by(country, round) %>% 
  dplyr::mutate(mean_lr = mean(lrscale)) -> round_df

We keep only one of the values for each country at each survey year.

round_df <- round_df[!duplicated(round_df$mean_lr),]

Create a vector of hex colors that correspond to the countries I want to look at: Ireland, France, the UK and Germany.

my_palette <- c( "DE" = "#FFCE00", "FR" = "#001489", "GB" = "#CF142B", "IE" = "#169B62")

And graph the plot:

library(ggthemes, ggimage)

lrscale_graph <- round_df %>% 
  dplyr::filter(country == "IE" | country == "GB" | country == "FR" | country == "DE") %>% 
  ggplot(aes(x= round, y = mean_lr, group = country)) +
  geom_line(aes(color = factor(country)), size = 1.5, alpha = 0.5) +
  ggimage::geom_flag(aes(image = country), size = 0.04) + 
  scale_color_manual(values = my_palette) +
  scale_x_discrete(name = "Year", limits=c("2002","2004","2006","2008","2010","2012","2014","2016","2018")) +
  labs(title = "Where would you place yourself on this scale,\n where 0 means the left and 10 means the right?",
       subtitle = "Source: European Social Survey, 2002 - 2018",
       fill="Country",
       x = "Year",
       y = "Left - Right Spectrum")

lrscale_graph + guides(color=guide_legend(title="Country")) + theme_economist()

Download European Social Survey data with essurvey package in R

The European Social Survey (ESS) measure attitudes in thirty-ish countries (depending on the year) across the European continent. It has been conducted every two years since 2001.

The survey consists of a core module and two or more ‘rotating’ modules, on social and public trust; political interest and participation; socio-political orientations; media use; moral, political and social values; social exclusion, national, ethnic and religious allegiances; well-being, health and security; demographics and socio-economics.

So lots of fun data for political scientists to look at.

install.packages("essurvey")
library(essurvey)

The very first thing you need to do before you can download any of the data is set your email address.

set_email("rforpoliticalscience@gmail.com")

Don’t forget the email address goes in as a string in “quotations marks”.

Show what countries are in the survey with the show_countries() function.

show_countries()

[1] "Albania"     "Austria"    "Belgium"           
[4] "Bulgaria"    "Croatia"     "Cyprus"            
[7] "Czechia"     "Denmark"     "Estonia"           
[10] "Finland"    "France"      "Germany"           
[13] "Greece"     "Hungary"     "Iceland"           
[16] "Ireland"    "Israel"      "Italy"             
[19] "Kosovo"     "Latvia"      "Lithuania"         
[22] "Luxembourg" "Montenegro"  "Netherlands"       
[25] "Norway"     "Poland"      "Portugal"          
[28] "Romania" "Russian Federation" "Serbia"            
[31] "Slovakia"   "Slovenia"     "Spain"             
[34] "Sweden"     "Switzerland"  "Turkey"            
[37] "Ukraine"    "United Kingdom"

It’s important to know that country names are case sensitive and you can only use the name printed out by show_countries(). For example, you need to write “Russian Federation” to access Russian survey data; if you write “Russia”…

Kamala Harris Reaction GIF by Markpain - Find & Share on GIPHY

Using these country names, we can download specific rounds or waves (i.e survey years) with import_country. We have the option to choose the two most recent rounds, 8th (from 2016) and 9th round (from 2018).

ire_data <- import_all_cntrounds("Ireland")

The resulting data comes in the form of nine lists, one for each round

These rounds correspond to the following years:

ESS Round 9 – 2018
ESS Round 8 – 2016
ESS Round 7 – 2014
ESS Round 6 – 2012
ESS Round 5 – 2010
ESS Round 4 – 2008
ESS Round 3 – 2006
ESS Round 2 – 2004
ESS Round 1 – 2002

I want to compare the first round and most recent round to see if Irish people’s views have changed since 2002. In 2002, Ireland was in the middle of an economic boom that we called the “Celtic Tiger”. People did mad things like buy panini presses and second house in Bulgaria to resell. Then the 2008 financial crash hit the country very hard.

Irish people during the Celtic Tiger:

Irish people after the Celtic Tiger crash:

Big Cats GIF by NETFLIX - Find & Share on GIPHY

Ireland in 2018 was a very different place. So it will be interesting to see if these social changes translated into attitude changes.

First, we use the import_country() function to download data from ESS. Specify the country and rounds you want to download.

ire <-import_country(country = "Ireland", rounds = c(1, 9))

The resulting ire object is a list, so we’ll need to extract the two data.frames from the list:

ire_1 <- ire[[1]]

ire_9 <- ire[[2]]

The exact same questions are not asked every year in ESS; there are rotating modules, sometimes questions are added or dropped. So to merge round 1 and round 9, first we find the common columns with the intersect() function.

common_cols <- intersect(colnames(ire_1), colnames(ire_9))

And then bind subsets of the two data.frames together that have the same columns with rbind() function.

ire_df <- rbind(subset(ire_1, select = common_cols),
                subset(ire_9, select = common_cols))

Now with my merged data.frame, I only want to look at a few of the variables and clean up the dataset for the analysis.

Click here to look at all the variables in the different rounds of the survey.

att9 <- data.frame(country = data9$cntry,
                   round = data9$essround,
                   imm_same_eth = data9$imsmetn,
                   imm_diff_eth = data9$imdfetn,
                   imm_poor = data9$impcntr,
                   imm_econ = data9$imbgeco,
                   imm_culture = data9$imueclt,
                   imm_qual_life = data9$imwbcnt,
                   left_right = data9$lrscale)

class(att9$imm_same_eth)

All the variables in the dataset are a special class called “haven_labelled“. So we must convert them to numeric variables with a quick function. We exclude the first variable because we want to keep country name as a string character variable.

att_df[2:15] <- lapply(att_df[2:15], function(x) as.numeric(as.character(x)))

We can look at the distribution of our variables and count how many missing values there are with the skim() function from the skimr package

library(skimr)

skim(att_df)

We can run a quick t-test to compare the mean attitudes to immigrants on the statement: “Immigrants make country worse or better place to live” across the two survey rounds.

Lower scores indicate an attitude that immigrants undermine Ireland’ quality of life and higher scores indicate agreement that they enrich it!

t.test(att_df$imm_qual_life ~ att_df$round)

In future blog, I will look at converting the raw output of R into publishable tables.

The results of the independent-sample t-test show that if we compare Ireland in 2002 and Ireland in 2018, there has been a statistically significant increase in positive attitudes towards immigrants and belief that Ireland’s quality of life is more enriched by their presence in the country.

As I am currently an immigrant in a foreign country myself, I am glad to come from a country that sees the benefits of immigrants!

Donald Glover Yes GIF - Find & Share on GIPHY

If we load the ggpubr package, we can graphically look at the difference in mean attitude scores.

library(ggpubr)

box1 <- ggpubr::ggboxplot(att_df, x = "round", y = "imm_qual_life", color = "round", palette = c("#d11141", "#00aedb"),
 ylab = "Attitude", xlab = "Round")

box1 + stat_compare_means(method = "t.test")

It’s not the most glamorous graph but it conveys the shift in Ireland to more positive attitudes to immigration!

I suspect that a country’s economic growth correlates with attitudes to immigration.

So let’s take the mean annual score values

ire_agg <- ireland[!duplicated(ireland$mean_imm_qual_life),]
ire_agg <- ire_agg %>% 
select(year, everything())

Next we can take data from Quandl website on annual Irish GDP growth (click here to learn how to access economic data via a Quandl API on R.)

gdp <- Quandl('ODA/IRL_LE', start_date='2002-01-01', end_date='2020-01-01',type="raw")

Create a year variable from the date variable

gdp$year <- substr(gdp$Date, start = 1, stop = 4)

Add year variable to the ire_agg data.frame that correspond to the ESS survey rounds.

year =c("2002","2004","2006","2008","2010","2012","2014","2016","2018")
year <- data.frame(year)
ire_agg <- cbind(ire_agg, year)

Merge the GDP and ESS datasets

ire_agg <- merge(ire_agg, gdp, by.x = "year", by.y = "year", all.x = TRUE)

Scale the GDP and immigrant attitudes variables so we can put them on the same plot.

ire_agg$scaled_gdp <- scale(ire_agg$Value)

ire_agg$scaled_imm_attitude <- scale(ire_agg$mean_imm_qual_life)

In order to graph both variables on the same graph, we turn the two scaled variables into two factors of a single variable.

ire_agg <- ire_agg %>%
  select(year, scaled_imm_attitude, scaled_gdp) %>%
  gather(key = "variable", value = "value", -year)

Next, we can change the names of the factors

ire_agg$variable <- revalue(ire_agg$variable, c("scaled_gdp"="GDP (scaled)", "scaled_imm_attitude" = "Attitudes (scaled)"))

And finally, we can graph the plot.

The geom_rect() function graphs the coloured rectangles on the plot. I take colours from this color-hex website; the green rectangle for times of economic growth and red for times of recession. Makes sure the geom-rect() comes before the geom_line().

library(ggpthemes)

ggplot(ire_agg, aes(x = year, y = value, group = variable)) + geom_rect(aes(xmin= "2008",xmax= "2012",ymin=-Inf, ymax=Inf),fill="#d11141",colour=NA, alpha=0.01) +
  geom_rect(aes(xmin= "2002" ,xmax= "2008",ymin=-Inf, ymax=Inf),fill="#00b159",colour=NA, alpha=0.01) +
  geom_rect(aes(xmin= "2012" ,xmax= "2020",ymin=-Inf, ymax=Inf),fill="#00b159",colour=NA, alpha=0.01) +
  geom_line(aes(color = as.factor(variable), linetype = as.factor(variable)), size = 1.3) + 
  scale_color_manual(values = c("#00aedb", "#f37735")) + 
  geom_point() +
  geom_text(data=. %>%
              arrange(desc(year)) %>%
              group_by(variable) %>%
              slice(1), aes(label=variable), position= position_jitter(height = 0.3), vjust =0.3, hjust = 0.1, 
              size = 4, angle= 0) + ggtitle("Relationship between Immigration Attitudes and GDP Growth") + labs(value = " ") + xlab("Year") + ylab("scaled") + theme_hc()

And we can see that there is a relationship between attitudes to immigrants in Ireland and Irish GDP growth. When GDP is growing, Irish people see that immigrants improve quality of life in Ireland and vice versa. The red section of the graph corresponds to the financial crisis.

Add rectangular flags to graphs with ggimage package in R

This quick function can add rectangular flags to graphs.

Click here to add circular flags with the ggflags package.

Latina GIF by Latinx Heritage Month - Find & Share on GIPHY

The data comes from a Wikipedia table on a recent report by OECD’s Overseas Development Aid (ODA) from donor countries in 2019.

Click here to read about scraping tables from Wikipedia with the rvest package in R.

library(countrycode)
library(ggimage)

In order to use the geom_flag() function, we need a country’s two-digit ISO code (For example, Ireland is IE!)

To add the ISO code, we can use the countrycode() function. Click here to read about a quick blog about the countrycode() function.

In one function we can quickly add a new variable that converts the country name in our dataset into to ISO codes.

oda$iso2 <- countrycode(oda$donor, "country.name", "iso2c")

Also we can use the countrycode() function to add a continent variable. We will use that to fill the colors of our bars in the graph.

oda$continent <- countrycode(oda$iso2, "iso2c", "continent")

We can now add the the geom_flag() function to the graph. The y = -50 prevents the flags overlapping with the bars and places them beside their name label. The image argument takes the iso2 variable.

Quick tip: with the reorder argument, if we wanted descending order (rather than ascending order of ODA amounts, we would put a minus sign in front of the oda_per_capita in the reorder() function for the x axis value.

oda_bar <- oda %>% 
  ggplot(aes(x = reorder(donor, oda_per_capita), y = oda_per_capita, fill = continent)) + 
  geom_flag(y = -50, aes(image = iso2))  +
       geom_bar(stat = "identity") + 
       labs(title = "ODA donor spending ",
                   subtitle = "Source: OECD's Development Assistance Committee, 2019 ",
                   x = "Donor Country",
                   y = "ODA per capita")

The fill argument categorises the continents of the ODA donors. Sometimes I take my hex colors from https://www.color-hex.com/ website.

my_palette <- c("Americas" = "#0084ff", "Asia" = "#44bec7", "Europe" = "#ffc300", "Oceania" = "#fa3c4c")

Last we print out the bar graph. The expand_limits() function moves the graph to fit the flags to the left of the y-axis.

Seth Meyers Omg GIF by Late Night with Seth Meyers - Find & Share on GIPHY

oda_bar +
  coord_flip() +
  expand_limits(y = -50) + scale_fill_manual(values = my_palette)

Scrape NATO defense expenditure data from Wikipedia with the rvest package in R

We can all agree that Wikipedia is often our go-to site when we want to get information quick. When we’re doing IR or Poli Sci reesarch, Wikipedia will most likely have the most up-to-date data compared to other databases on the web that can quickly become out of date.

Jennifers Body Truth GIF - Find & Share on GIPHY

So in R, we can scrape a table from Wikipedia and turn into a database with the rvest package .

First, we copy and paste the Wikipedia page we want to scrape into the read_html() function as a string:

nato_members <- read_html("https://en.wikipedia.org/wiki/Member_states_of_NATO")

Next we save all the tables on the Wikipedia page as a list. Turn the header = TRUE.

nato_tables <- nato_members %>% html_table(header = TRUE, fill = TRUE)

The table that I want is the third table on the page, so use [[two brackets]] to access the third list.

nato_exp <- nato_tables[[3]]

The dataset is not perfect, but it is handy to have access to data this up-to-date. It comes from the most recent NATO report, published in 2019.

Some problems we will have to fix.

The first row is a messy replication of the header / more information across two cells in Wikipedia.
The headers are long and convoluted.
There are a few values in as N/A in the dataset, which R thinks is a string.
All the numbers have commas, so R thinks all the numeric values are all strings.

There are a few NA values that I would not want to impute because they are probably zero. Iceland has no armed forces and manages only a small coast guard. North Macedonia joined NATO in March 2020, so it doesn’t have all the data completely.

So first, let’s do some quick data cleaning:

Clean the variable names to remove symbols and adds underscores with a function from the janitor package

library(janitor)
nato_exp  <- nato_exp %>% clean_names()

Delete the first row. which contains some extra header text:

nato_exp <- nato_exp[-c(1),]

Rename the headers to better reflect the original Wikipedia table headings In this rename() function,

the first string in the variable name we want and
the second string is the original heading as it was cleaned from the above clean_names() function:

nato_exp <- nato_exp %>%
 rename("def_exp_millions" = "defence_expenditure_us_f",
 "def_exp_gdp" = "defence_expenditure_us_f_2",
 "def_exp_per_capita" = "defence_expenditure_us_f_3",
 "population" = "population_a",
 "gdp" = "gdp_nominal_e",
 "personnel" = "personnel_f")

Next turn all the N/A value strings to NULL. The na_strings object we create can be used with other instances of pesky missing data varieties, other than just N/A string.

na_strings <- c("N A", "N / A", "N/A", "N/ A", "Not Available", "Not available")

nato_exp <- nato_exp %>% replace_with_na_all(condition = ~.x %in% na_strings)

Remove all the commas from the number columns and convert the character strings to numeric values with a quick function we apply to all numeric columns in the data.frame.

remove_comma <- function(x) {as.numeric(gsub(",", "", x, fixed = TRUE))}

nato_exp[2:7] <- sapply(nato_exp[2:7], remove_comma)

Next, we can calculate the average NATO score of all the countries (excluding the member_state variable, which is a character string).

We’ll exclude the NATO total column (as it is not a member_state but an aggregate of them all) and the data about Iceland and North Macedonia, which have missing values.

nato_average <- nato_exp %>%
filter(member_state != 'NATO' & member_state != 'Iceland' & member_state != 'North Macedonia') %>%
summarise_if(is.numeric, mean, na.rm = TRUE)

Re-arrange the columns so the two data.frames match:

nato_average$member_state = "NATO average"
nato_average <- nato_average %>% select(member_state, everything())

Bind the two data.frames together

nato_exp <- rbind(nato_exp, nato_average)

Create a new factor variable that categorises countries into either above or below the NATO average defense spending.

Also we can specify a category to distinguish those countries that have reached the NATO target of their defense spending equal to 2% of their GDP.

nato_exp <- nato_exp %>% 
filter(member_state != 'NATO' & member_state!= "North Macedonia" & member_state!= "Iceland") %>% 
dplyr::mutate(difference = case_when(def_exp_gdp >= 2 ~ "Above NATO 2% GDP quota", between(def_exp_gdp, 1.6143, 2) ~ "Above NATO average", between(def_exp_gdp, 1.61427, 1.61429) ~ "NATO average", def_exp_gdp <= 1.613 ~ "Below NATO average"))

Create a vector of hex colours to correspond to the different categories. I choose traffic light colors to indicate the

green countries (those who have reached the NATO 2% quota),
orange countries (above the NATO average but below the spending target) and
red countries (below the NATO spending average).

The blue colour is for the NATO average bar,

my_palette <- c( "Below NATO average" = "#E60000", "NATO average" = "#012169", "Above NATO average" = "#FF7800", "Above NATO 2% GDP quota" = "#4CBB17")

Finally, we create a graph with ggplot, and use the reorder() function to arrange the bars in ascending order.

NATO allies are encouraged to hit the target of 2% of gross domestic product. So, we add a geom_vline() to demarcate the NATO 2% quota.

nato_bar <- nato_exp %>% 
  filter(member_state != 'NATO' & member_state!= "North Macedonia" & member_state!= "Iceland") %>%
  ggplot(aes(x= reorder(member_state, def_exp_gdp), y = def_exp_gdp, 
fill=factor(difference))) + 
  geom_bar(stat = "identity") +
  geom_vline(xintercept = 22.55, colour="firebrick", linetype = "longdash", size = 1) +
  geom_text(aes(x=22, label="NATO 2% quota", y=3), colour="firebrick", text=element_text(size=20)) +
  labs(title = "NATO members Defense Expenditure as a percentage GDP ",
       subtitle = "Source: NATO, 2019",
       x = "NATO Member States",
       y = "Defense Expenditure (as % GDP) ")

Click here to read about adding flags to graphs with the ggimage package.

library(countrycode)
library(ggimage)

nato_exp$iso2 <- countrycode(nato_exp$member_state, "country.name", "iso2c")

Finally, we can print out the nato_bar graph!

nato_bar + 
geom_flag(y = -0.2, aes(image = nato_exp$iso2)) +
coord_flip() +
expand_limits(y = -0.2) +
theme(legend.title = element_blank(), axis.text.x=element_text(angle=45, hjust=1)) + scale_fill_manual(values = my_palette)

Pushing Donald Trump GIF - Find & Share on GIPHY

Interpret multicollinearity tests from the mctest package in R

Packages we will need :

library(mctest)

The mctest package’s functions have many multicollinearity diagnostic tests for overall and individual multicollinearity. Additionally, the package can show which regressors may be the reason of for the collinearity problem in your model.

Click here to read the CRAN PDF for all the function arguments available.

So – as always – we first fit a model.

Given the amount of news we have had about elections in the news recently, let’s look at variables that capture different aspects of elections and see how they relate to scores of democracy. These different election components will probably overlap.

In fact, I suspect multicollinearity will be problematic with the variables I am looking at.

Click here for a previous blog post on Variance Inflation Factor (VIF) score, the easiest and fastest way to test for multicollinearity in R.

The variables in my model are:

emb_autonomy – the extent to which the election management body of the country has autonomy from the government to apply election laws and administrative rules impartially in national elections.
election_multiparty – the extent to which the elections involved real multiparty competition.
election_votebuy – the extent to which there was evidence of vote and/or turnout buying.
election_intimidate – the extent to which opposition candidates/parties/campaign workers subjected to repression, intimidation, violence, or harassment by the government, the ruling party, or their agents.
election_free – the extent to which the election was judged free and fair.

In this model the dependent variable is democracy score for each of the 178 countries in this dataset. The score measures the extent to which a country ensures responsiveness and accountability between leaders and citizens. This is when suffrage is extensive; political and civil society organizations can operate freely; governmental positions are clean and not marred by fraud, corruption or irregularities; and the chief executive of a country is selected directly or indirectly through elections.

election_model <- lm(democracy ~ ., data = election_df)
stargazer(election_model, type = "text")

However, I suspect these variables suffer from high multicollinearity. Usually your knowledge of the variables – and how they were operationalised – will give you a hunch. But it is good practice to check everytime, regardless.

The eigprop() function can be used to detect the existence of multicollinearity among regressors. The function computes eigenvalues, condition indices and variance decomposition proportions for each of the regression coefficients in my election model.

To check the linear dependencies associated with the corresponding eigenvalue, the eigprop compares variance proportion with threshold value (default is 0.5) and displays the proportions greater than given threshold from each row and column, if any.

So first, let’s run the overall multicollinearity test with the eigprop() function :

mctest::eigprop(election_model)

If many of the Eigenvalues are near to 0, this indicates that there is multicollinearity.

Unfortunately, the phrase “near to” is not a clear numerical threshold. So we can look next door to the Condition Index score in the next column.

This takes the Eigenvalue index and takes a square root of the ratio of the largest eigenvalue (dimension 1) over the eigenvalue of the dimension.

Condition Index values over 10 risk multicollinearity problems.

In our model, we see the last variable – the extent to which an election is free and fair – suffers from high multicollinearity with other regressors in the model. The Eigenvalue is close to zero and the Condition Index (CI) is near 10. Maybe we can consider dropping this variable, if our research theory allows its.

Another battery of tests that the mctest package offers is the imcdiag( ) function. This looks at individual multicollinearity. That is, when we add or subtract individual variables from the model.

mctest::imcdiag(election_model)

A value of 1 means that the predictor is not correlated with other variables. As in a previous blog post on Variance Inflation Factor (VIF) score, we want low scores. Scores over 5 are moderately multicollinear. Scores over 10 are very problematic.

And, once again, we see the last variable is HIGHLY problematic, with a score of 14.7. However, all of the VIF scores are not very good.

The Tolerance (TOL) score is related to the VIF score; it is the reciprocal of VIF.

The Wi score is calculated by the Farrar Wi, which an F-test for locating the regressors which are collinear with others and it makes use of multiple correlation coefficients among regressors. Higher scores indicate more problematic multicollinearity.

The Leamer score is measured by Leamer’s Method : calculating the square root of the ratio of variances of estimated coefficients when estimated without and with the other regressors. Lower scores indicate more problematic multicollinearity.

The CVIF score is calculated by evaluating the impact of the correlation among regressors in the variance of the OLSEs. Higher scores indicate more problematic multicollinearity.

The Klein score is calculated by Klein’s Rule, which argues that if Rj from any one of the models minus one regressor is greater than the overall R2 (obtained from the regression of y on all the regressors) then multicollinearity may be troublesome. All scores are 0, which means that the R2 score of any model minus one regression is not greater than the R2 with full model.

Click here to read the mctest paper by its authors – Imdadullah et al. (2016) – that discusses all of the mathematics behind all of the tests in the package.

In conclusion, my model suffers from multicollinearity so I will need to drop some variables or rethink what I am trying to measure.

Click here to run Stepwise regression analysis and see which variables we can drop and come up with a more parsimonious model (the first suspect I would drop would be the free and fair elections variable)

Perhaps, I am capturing the same concept in many variables. Therefore I can run Principal Component Analysis (PCA) and create a new index that covers all of these electoral features.

Next blog will look at running PCA in R and examining the components we can extract.

References

Imdadullah, M., Aslam, M., & Altaf, S. (2016). mctest: An R Package for Detection of Collinearity among Regressors. R J., 8(2), 495.

Check linear regression assumptions with gvlma package in R

Packages we will need:

library(gvlma)

gvlma stands for Global Validation of Linear Models Assumptions. See Peña and Slate’s (2006) paper on the package if you want to check out the math!

Linear regression analysis rests on many MANY assumptions. If we ignore them, and these assumptions are not met, we will not be able to trust that the regression results are true.

Luckily, R has many packages that can do a lot of the heavy lifting for us. We can check assumptions of our linear regression with a simple function.

So first, fit a simple regression model:

 data(mtcars)
 summary(car_model <- lm(mpg ~ wt, data = mtcars))

We then feed our car_model into the gvlma() function:

gvlma_object <- gvlma(car_model)

Global Stat checks whether the relationship between the dependent and independent relationship roughly linear. We can see that the assumption is met.

Skewness and kurtosis assumptions show that the distribution of the residuals are normal.

Link function checks to see if the dependent variable is continuous or categorical. Our variable is continuous.

Heteroskedasticity assumption means the error variance is equally random and we have homoskedasticity!

Often the best way to check these assumptions is to plot them out and look at them in graph form.

Next we can plot out the model assumptions:

plot.gvlma(glvma_object)

The relationship is a negative linear relationship between the two variables.

This scatterplot of residuals on the y axis and fitted values (estimated responses) on the x axis. The plot is used to detect non-linearity, unequal error variances, and outliers.

As explained in this Penn State webpage on interpreting residuals versus fitted plots:

The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.
The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.
No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers.

In this histograpm of standardised residuals, we see they are relatively normal-ish (not too skewed, and there is a single peak).

Next, the normal probability standardized residuals plot, Q-Q plot of sample (y axis) versus theoretical quantiles (x axis). The points do not deviate too far from the line, and so we can visually see how the residuals are normally distributed.

Click here to check out the CRAN pdf for the gvlma package.

References

Peña, E. A., & Slate, E. H. (2006). Global validation of linear model assumptions. Journal of the American Statistical Association, 101(473), 341-354.

Visualise panel data regression with ExPanDaR package in R

The ExPand package is an example of a shiny app.

What is a shiny app, you ask? Click to look at a quick Youtube explainer. It’s basically a handy GUI for R.

When we feed a panel data.frame into the ExPanD() function, a new screen pops up from R IDE (in my case, RStudio) and we can interactively toggle with various options and settings to run a bunch of statistical and visualisation analyses.

Click here to see how to convert your data.frame to pdata.frame object with the plm package.

Be careful your pdata.frame is not too large with too many variables in the mix. This will make ExPanD upset enough to crash. Which, of course, I learned the hard way.

Also I don’t know why there are random capitalizations in the PaCkaGe name. Whenever I read it, I think of that Sponge Bob meme.

If anyone knows why they capitalised the package this way. please let me know!

So to open up the new window, we just need to feed the pdata.frame into the function:

ExPanD(mil_pdf)

For my computer, I got error messages for the graphing sections, because I had an old version of Cairo package. So to rectify this, I had to first install a source version of Cairo and restart my R session. Then, the error message gods were placated and they went away.

install.packages("Cairo", type="source")

Then press command + shift + F10 to restart R session

library(Cairo)

You may not have this problem, so just ignore if you have an up-to-date version of the necessary packages.

When the new window opens up, the first section allows you to filter subsections of the panel data.frame. Similar to the filter() argument in the dplyr package.

For example, I can look at just the year 1989:

But let’s look at the full sample

We can toggle with variables to look at mean scores for certain variables across different groups. For example, I look at physical integrity scores across regime types.

Purple plot: closed autocracy
Turquoise plot: electoral autocracy
Khaki plot: electoral democracy:
Peach plot: liberal democracy

The plots show that there is a high mean score for physical integrity scores for liberal democracies and less variance. However with the closed and electoral autocracies, the variance is greater.

We can look at a visualisation of the correlation matrix between the variables in the dataset.

Next we can look at a scatter plot, with option for loess smoother line, to graph the relationship between democracy score and physical integrity scores. Bigger dots indicate larger GDP level.

Last we can run regression analysis, and add different independent variables to the model.

We can add fixed effects.

And we can subset the model by groups.

The first column, the full sample is for all regions in the dataset.

The second column, column 1 is

Column 2 Post Soviet countries

Column 3: Latin America

Column 4: AFRICA

Column 5: Europe, North America

Column 6: Asia

Check linear regression assumptions with the olsrr package in R

Packages we will need:

library(olsrr)
library(countrycode)
library(WDI)
library(stargazer)
library(peacesciencer)
library(plm)

One core assumption of linear regression analysis is that the residuals of the regression are normally distributed.

When the normality assumption is violated, interpretation and inferences may not be reliable. In worst case, the interpretations are not at all valid.

So it is important we check this assumption is not violated.

As well residuals being normal distributed, we must also check that the residuals have the same variance (i.e. homoskedasticity).

Click here to find out how to check for homoskedasticity.

Then, if there is a problem with the variance, click here to find out how to fix heteroskedasticity (which means the residuals have a non-random pattern in their variance) with the sandwich package in R.

There are three ways to check that the error in our linear regression has a normal distribution:

plots or graphs such histograms, boxplots or Q-Q-plots, to get a visual approximation

examining skewness and kurtosis indices

formal normality tests, to check for a p-value

In this blog, we will see what factors affect military spending by a government.

We will take some variables from the World Bank via the WDI package.

Click here to read more about downloading World Bank Data with the WDI package.

Download WorldBank data with WDI package in R

mil_spend_gdp = WDI(indicator = "MS.MIL.XPND.ZS")
gdp_percap = WDI(indicator = "NY.GDP.PCAP.KD")

mil_spend_gdp %>% 
  inner_join(gdp_percap) %>% 
  select(country, year,
         mil_spend_gdp = MS.MIL.XPND.ZS,
         gdp_percap = NY.GDP.PCAP.KD) %>%  
  mutate(cown = countrycode(country, "country.name", "cown")) %>% 
  filter(!is.na(cown)) -> wdi_data

And also download some variables via the peacesciencer package.

Click here to read more about the peacesciencer package in the blog posts about building datasets

Building a dataset for political science analysis in R, PART 2

peacesciencer::create_stateyears(system = "gw") %>% 
  add_ucdp_acd() %>% 
  add_democracy() %>% 
  mutate(cown = countrycode(statename, "country.name", "cown")) -> peace_data

We can code a new binary variable that indicates if there was a UCDP conflict in the previous 10 years or not.

We could imagine a country that experienced war is more likely to keep investing in their military (as a larger percentage of their GDP) than countries that have only experienced relative peace in their recent past.

peace_data %<>%
  select(statename, cown, year, ucdpongoing, maxintensity, conflict_ids, v2x_polyarchy, polity2) %>% 
  mutate(ucdpongoing_no_na = ifelse(is.na(ucdpongoing), 0, ucdpongoing)) %>% 
  group_by(statename) %>%
  arrange(year) %>%
  mutate(war_past_10_years = ifelse(ucdpongoing == 1 & lag(ucdpongoing, order_by = year, default = 0, n = 10) == 1, 1, 0)) %>% 
  mutate(war_past_10_years_no_na = ifelse(ucdpongoing_no_na == 1 & lag(ucdpongoing_no_na, order_by = year, default = 0, n = 10) == 1, 1, 0))

We merge the data by the Correlates of War codes.

Click here to read more about using COW codes with the countrycode package.

Add Correlates of War codes with countrycode package in R

wdi_data %>% 
  inner_join(peace_data, by = c("cown", "year")) -> wdi_peace

With these data, we can build our linear regression model.

Our dependent variable is military spending as a percentage of GDP (logged)

Our independent varibles are:

GDP per capita (logged) from the World Bank
Demoracy (as measured by the V-DEM polyarchy score)
Binary variable that is 1 if a country had a UCDP conflict in the previous 10 years and 0 if none.

We will also add an interaction term with the GDP and democracy variable.

Given we have cross-sectional longitudinal data, the best option would be panel data analysis with the plm package

plm(log(mil_spend_gdp) ~ log(gdp_percap)*v2x_polyarchy + as.factor(war_past_10_years_no_na), data = wdi_peace, 
  index = c("cown", "year"), model = "within") %>% 
  stargazer(., type = "text")


	Dependent variable:

	Military spending (GDP %) (ln)

GDP pc (ln)	-0.288^***
	(0.029)

Democracy	1.004^***
	(0.353)

War 10 year dummy	0.146^***
	(0.021)

GDP pc (ln) x Democracy	-0.199^***
	(0.046)


Observations	3,686
R²	0.135
Adjusted R²	0.097
F Statistic	137.989^*** (df = 4; 3530)

Note:	^p<0.1; ^p<0.05; ^**p<0.01

However, the olsrr package cannot handle plm.

In future blog posts, we will lok more closely at plm() panel regressions and the diagnostic tests we have to run with these types of models.

So we will just look at one year, 2010.

lm(log(mil_spend_gdp) ~ log(gdp_percap)*v2x_polyarchy + as.factor(war_past_10_years_no_na), data = subset(wdi_peace, year == 2010)) -> war_model


	Dependent variable:

	Military spending (GDP %) (ln)

GDP pc (ln)	0.261^**
	(0.101)

Democracy	1.450
	(1.565)

War 10 year dummy	0.592^***
	(0.159)

GDP pc (ln) x Democracy	-0.343^**
	(0.168)

Constant	0.183
	(0.885)


Observations	136
R²	0.381
Adjusted R²	0.362
Residual Std. Error	0.615 (df = 131)
F Statistic	20.137^*** (df = 4; 131)

Note:	^p<0.1; ^p<0.05; ^**p<0.01

So now we have our OLS model, we can run a heap of linear model diagnostic functions with the olsrr package.

Built by Aravind Hebbali, the description of the package mentions that olsrr has tools designed to make it easier for users, particularly beginner/intermediate R users to build ordinary least squares regression models. Thank you Aravind!

It includes regression output, heteroskedasticity tests, collinearity diagnostics, residual diagnostics, measures of influence, model fit assessment and variable selection procedures. Look through the CRAN PDF below or look at rsquaredacademy website to get a comprehensive overview of the package

olsrr: Tools for Building OLS Regression Models Download

We will now check if the residuals in our model (the difference between what our model predicted and what the values actually are) are normally distributed

ols_test_normality(war_model)

Test	Statistic	p-value
Shapiro-Wilk	0.9817	0.0653
Kolmogorov-Smirnov	0.0524	0.8494
Cramer-von Mises	14.1123	0.0000
Anderson-Darling	0.469	0.2447

Let’s look at each test result in turn

Shapiro-Wilk:

The test statistic is 0.9817, and the p-value is 0.0653.

The null hypothesis is that the residuals are normally distributed.

In this case, the p-value is greater than the predefined significance level (typically 0.05), so you cannot reject the null hypothesis.

This suggests that the residuals may follow a normal distribution.

Woo!

Kolmogorov-Smirnov:

The test statistic is 0.0524, and the p-value is 0.8494.

Similar to the Shapiro-Wilk test, the p-value is greater than 0.05, so you cannot reject the null hypothesis of normality.

This suggests that the residuals may follow a normal distribution.

Yay.

Cramer-von Mises:

This test statistic is 14.1123, and the p-value is 0.0000.

The null hypothesis is that the residuals are normally distributed. The very low p-value indicates that you can reject the null hypothesis, suggesting that the residuals are not from a normal distribution.

Oh no.

Anderson-Darling: This is another test of normality. The test statistic is 0.469, and the p-value is 0.2447. Similar to the Shapiro-Wilk and Kolmogorov-Smirnov tests, the p-value is greater than 0.05, so you cannot reject the null hypothesis of normality.

Phew.

Which of the normality tests is the best?

And what is up with Cramer-von Mises?

A paper by Razali and Wah (2011) tested all these formal normality tests with 10,000 Monte Carlo simulation of sample data generated from alternative distributions that follow symmetric and asymmetric distributions.

Their results showed that the Shapiro-Wilk test is the most powerful normality test, followed by Anderson-Darling test, and Kolmogorov-Smirnov test. Their study did not look at the Cramer-Von Mises test.

The results of Razali and Wah’s study echo the previous findings of Mendes and Pala (2003) and Keskin (2006) in support of Shapiro-Wilk test as the most powerful normality test.

According Ahad and colleagues (2011: 641), they find that

“the performances of the normality tests, namely, the Kolmogorov-Smirnov test, Anderson-Darling test, Cramervon Mises test, and Shapiro-Wilk test, were evaluated under various spectrums of non-normal distributions and different sample sizes. The results showed that the ShapiroWilk test is the most sensitive normality test because this test rejects the null hypothesis of normality at the smallest sample sizes compared to the other tests, at all levels of skewness and kurtosis. Thus, when the four normality tests are available in a statistical package, we would recommend practitioners to use the Shapiro-Wilk normality to test the normality of data”
Ahad et al (2001: 641)

We can plot out the residuals distribution in a histogram with olsrr

olsrr::ols_plot_resid_hist(war_model)

Visually, we can confirm that the residuals have a lovely bell curve and are broadly normally distributed! With a few outliers at -2.

Nest we can check the residuals with a QQ plot.

A QQ plot is used to compare residuals to the normal distribution in linear regression. We can use a normal QQ plot to visually check if our residuals follow a theoretical normal distribution.

In addition to being good at identifying outliers and heavy tails, QQ plots can reveal characteristics such as skewness and bimodality, and can be effective even
for small samples (Marden, 2004).

olsrr::ols_plot_resid_qq(war_model)

Again we can see some outliers at -2

Next we can look at a scatter plot of residuals on the y axis and fitted values on the x axis to detect non-linearity, unequal error variances, and outliers.

Each point in the plot is a residual value (i.e. the difference between what the model predicted and what the value actually

When interpreting this plot, there are a few things we want to look out for.

The points does not deviate too far from 0. This indicates the variance is homogeneous (i.e. homescediasticity, one of my favourite words)
The points are random (i.e. show no distinct pattern) around the horizontal red line at 0

ols_plot_resid_fit(war_model)

There are some residual values at the -2, so they might be outliers. We we look at an outlier diagnostic plot in a bit.
There are no discernible pattern in the scatterplot, so there does not seem to be any heteroscedasticity in the variance.

We can run an ols_plot_resid_lev() to graph for detecting outliers and/or observations with high leverage.

ols_plot_resid_lev(war_model)

There are a few outliers with leverage that we need to look more closely and examine how they prove / challenge our given theory / hypotheses.

FINALLLY. for a more complete diagnostics check, we can insert the model into the ols_coll_diag() function to calculate the Variance Inflation Factor (VIF) and Eigenvalues of the variables in the model.

VIF scores highlight if there is multicollinearity between the independent variables. If they are too highly correlated, our model is in trouble.

If the value of VIF is 1< VIF < 5, it specifies that the variables are moderately correlated to each other.

The challenging value of VIF is between 5 to 10 as it specifies the highly correlated variables.

If VIF ≥ 5 to 10, there will be multicollinearity among the predictors in the regression model.

VIF > 10 indicate the regression coefficients are feebly estimated with the presence of multicollinearity

Read more about the issues with multicollinearity in Shrestha (2020)

ols_coll_diag(war_model)

Variables	Tolerance	VIF
GDP per capita (ln)	0.13	7.6
Democracy	0.02	56.2
War 10 years	0.91	1.1
GDP pc (ln) X Democracy	0.01	79.9

Next we look at the Eigenvalues

Eigenvalue	Condition Index	intercept	GDP pc (ln)	Democracy	War 10 years	GDP pc (ln) X Democracy
3.93	1.00	0.00	0.00	0.00	0.01	0.00
0.90	2.09	0.00	0.00	0.00	0.82	0.00
0.16	5.01	0.01	0.00	0.00	0.12	0.01
0.02	15.14	0.04	0.08	0.05	0.02	0.02
0.00	67.74	0.95	0.92	0.94	0.03	0.97

This is not good.

But it is probably due to the interaction term.

If we run the regression agaon without any interaction term , the VIF scores are all around 1!

lm(log(mil_spend_gdp) ~ log(gdp_percap) + v2x_polyarchy + as.factor(war_past_10_years_no_na), data = subset(wdi_peace, year == 2010)) -> war_model_no_interaction
ols_coll_diag(war_model_no_interaction)

There are plenty of other helpful functions in the olsrr package that we can look at with our model.

For example, we can run AIC stepwise regression to see if we need to drop any variables

aic_step <- ols_step_both_p(war_model)

Step	Variable	Added/Removed	R-Square	Adj. R-Square	C(p)	AIC	RMSE
1	v2x_polyarchy	addition	0.298	0.293	16.4970	271.6880	0.6474
2	as.factor(war_past_10_years_no_na)	addition	0.347	0.338	8.0720	263.7888	0.6266
3	log(gdp_percap)	addition	0.361	0.347	7.1600	262.8901	0.6223
4	log(gdp_percap):v2x_polyarchy	addition	0.381	0.362	5.0000	260.6381	0.6150

Lower AIC scores are better, and AIC penalizes models that use more parameters.

That means if two models explain the same amount of variation, the model with a smaller number of variable parameters will have a lower AIC score.

Many would argue that this would be the better-fit model.

We can see in step 4, the AIC is the lowest. So that is good news!

Variables doe not live in a vacuum in the model. When we run a model, we want to have a better understanding of the relationship between military spending and the independent variables conditional on the other independent variables

According to the package, the added variable plot provides information about the marginal importance of a GIVEN independent variable, given the other variables already in the model.

It shows the marginal importance of the variable in reducing the residual variability.

olsrr::ols_plot_added_variable(war_model)

The military spending dependent variable is on the y axis and we look at adding the named variable (given that the other variables are already in the model).

The democracy variable GDP variable interaction slope appears to decrease while all other variables increase.

What do the Y and X residuals represent? The Y residuals represent the part of Y not explained by all the variables other than X. The X residuals represent the part of X not explained by other variables. The slope of the line fitted to the points in the added variable plot is equal to the regression coefficient when Y is regressed on all variables including X.

A strong linear relationship in the added variable plot indicates the increased importance of the contribution of X to the model already containing the other predictors.

We can see, for example, that (with all other variables held constant) higher GDP per capita correlates with higher proportion of military spending. Richer countries seem to dedicate more of this money to building a military.

Thank you for reading !

References

Ahad, N. A., Yin, T. S., Othman, A. R., & Yaacob, C. R. (2011). Sensitivity of normality tests to non-normal data. Sains Malaysiana, 40(6), 637-641.

Marden, J. I. (2004). Positions and QQ plots. Statistical Science, 606-614.

Razali, N. M., & Wah, Y. B. (2011). Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of statistical modeling and analytics, 2(1), 21-33.

Shrestha, N. (2020). Detecting multicollinearity in regression analysis. American Journal of Applied Mathematics and Statistics, 8(2), 39-42.

Summarise data with skimr package in R

A nice way to summarise all the variables in a dataset.

install.packages("skimr")
library(skimr)

The data we’ll look at is from the Correlates of War . It provides dyadic records of militarized interstate disputes (MIDs) over the period of 1816-2010.

skim(mid)

n_missing : tells which variables have missing values

complete_rate : the percentage of the variables which are missing

Column 4 – 7 gives the mean, standard deviation, min, 25th percentile, median, 75th percentile and max values.

The last column is a histogram of each variables, so you can easily scan and see if variables are normally distributed, skewed or binary.

Compare clusters with dendextend package in R

Packages we need

install.packages("dendextend")
library(dendextend)

This blog will create dendogram to examine whether Asian countries cluster together when it comes to extent of judicial compliance. I’m examining Asian countries with populations over 1 million and data comes from the year 2019.

Judicial compliance measure how often a government complies with important decisions by courts with which it disagrees.

Higher scores indicate that the government often or always complies, even when they are unhappy with the decision. Lower scores indicate the government rarely or never complies with decisions that it doesn’t like.

It is important to make sure there are no NA values. So I will impute any missing variables.

Click here to read how to impute missing values in your dataset.

library(mice)
imputed_data <- mice(asia_df, method="cart")
asia_df <- complete(imputed_data)

Next we can scale the dataset. This step is for when you are clustering on more than one variable and the variable units are not necessarily equivalent. The distance value is related to the scale on which the different variables are made.

Therefore, it’s good to scale all to a common unit of analysis before measuring any inter-observation dissimilarities.

asia_scale <- scale(asia_df)

Next we calculate the distance between the countries (i.e. different rows) on the variables of interest and create a dist object.

There are many different methods you can use to calculate the distances. Click here for a description of the main formulae you can use to calculate distances. In the linked article, they provide a helpful table to summarise all the common methods such as “euclidean“, “manhattan” or “canberra” formulae.

I will go with the “euclidean” method. but make sure your method suits the data type (binary, continuous, categorical etc.)

asia_judicial_dist <- dist(asia_scale, method = "euclidean")
class(asia_judicial_dist)

We now have a dist object we can feed into the hclust() function.

With this function, we will need to make another decision regarding the method we will use.

The possible methods we can use are "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).

Click here for a more indepth discussion of the different algorithms that you can use

Again I will choose a common "ward.D2" method, which chooses the best clusters based on calculating: at each stage, which two clusters merge that provide the smallest increase in the combined error sum of squares.

asia_judicial_hclust <- hclust(asia_judicial_dist, method = "ward.D2")
class(asia_judicial_hclust)

We next convert our hclust object into a dendrogram object so we can plot it and visualise the different clusters of judicial compliance.

asia_judicial_dend <- as.dendrogram(asia_judicial_hclust)
class(asia_judicial_dend)

When we plot the different clusters, there are many options to change the color, size and dimensions of the dendrogram. To do this we use the set() function.

Click here to see a very comprehensive list of all the set() attributes you can use to modify your dendrogram from the dendextend package.

asia_judicial_dend %>%
set("branches_k_color", k=5) %>%    # five clustered groups of different colors
set("branches_lwd", 2) %>%          # size of the lines (thick or thin)
set("labels_colors", k=5) %>%       # color the country labels, also five groups
plot(horiz = TRUE)                  # plot the dendrogram horizontally

I choose to divide the countries into five clusters by color:

And if I zoom in on the ends of the branches, we can examine the groups.

The top branches appear to be less democratic countries. We can see that North Korea is its own cluster with no other countries sharing similar judicial compliance scores.

The bottom branches appear to be more democratic with more judicial independence. However, when we have our final dendrogram, it is our job now to research and investigate the characteristics that each countries shares regarding the role of the judiciary and its relationship with executive compliance.

Singapore, even though it is not a democratic country in the way that Japan is, shows a highly similar level of respect by the executive for judicial decisions.

Also South Korean executive compliance with the judiciary appears to be more similar to India and Sri Lanka than it does to Japan and Singapore.

So we can see that dendrograms are helpful for exploratory research and show us a starting place to begin grouping different countries together regarding a concept.

A really quick way to complete all steps in one go, is the following code. However, you must use the default methods for the dist and hclust functions. So if you want to fine tune your methods to suit your data, this quicker option may be too brute.

asia_df %>%
scale %>%
dist %>%
hclust %>%
as.dendrogram %>%
set("branches_k_color", k=5) %>%
set("branches_lwd", 2) %>%
set("labels_colors", k=5) %>%
plot(horiz = TRUE)

Recode variables with car package in R

There is one caveat with this function that we are using from the car package:

recode is also in the dplyr package so R gets confused if you just type in recode on its own; it doesn’t know which package you’re using.

Sometimes R needs some direction.

So, you must write car::recode(). This placates the R gods and they are clear which package to use.

It is useful for all other times you want to explicitly tell R which package you want it to use to avoid any confusion. Just type the package name followed by two :: colons and a list of all the functions in the package drops down. So really, it can also be useful for exploring new packages you’ve installed and loaded!

install.packages("car")
library(car)

First, subset the dataframe, so we are only looking at countries in the year 1990.

data_90 <- data[which(data$year==1990),]

Next look at a frequency of each way that regimes around the world ended.

plyr::count(data_90$regime_end)

To understand these numbers, we look at the codebook.

We want to make a new binary variable to indicate whether a coup occurred in a country in 1990 or not.

To do this we use the car::recode() function.

First we can make a numeric variable. So in the brackets, we indicate our dataframe at the start.

Next bit is important, we put all the original and new variables in ” ” inverted commas.

Also important that we separate each level of the new variable with a ; semicolon.

The punctuation marks in this function are a bit fussy and difficult but it is important.

data_90$coup_numeric <- car::recode(data_90$regime_end, "0:2 = 1; 3:13=0; NA=0")

Alternatively, we can recode the variable as a string output when we choose to make the new variable values in ‘ apostrophe marks’.

data_90$coup_string <- car::recode(data_90$regime_end, "0:2 = 'coup'; 3:13= 'no coup'; NA='no coup'")

If you want to convert a continuous variable to discrete factors, we can go to our trusty mutate() function in the dplyr package. And within mutate() we use another function: cut()

So instead of recoding binary variables or factor variables . . . we can turn a numeric variable into a discrete variable with cut()

We specify with the breaks argument to indicate where we want to divide the variable and then we can label the factors with the labels argument:

data_90  <- data_90 %>% 
dplyr::mutate(instability_discrete = cut(instability_continuous, breaks=c(-Inf, 0.3, 0.7, Inf), labels=c("low_instability", "mid_instability", "high_instability")))

Move year variable to first column in dataframe with dplyr package in R

A quick hack to create a year variable from a string variable and place it as column number one in your dataframe.

First problem with my initial dataset is that the date is a string of numbers and I want the first four characters in the string.

data$year <- substr(data$date, 0, 4)
data$year <- as.numeric(data$year)

Now I want to place it at the beginning to keep things more organised:

data = data %>% 
select(year, everything())

And we are done!

Much better.

Make word clouds with tidytext and gutenbergr in R

This blog will run through how to make a word cloud with Mill’s “On Liberty”, a treatise which argues that the state should never restrict people’s individual pursuits or choices (unless such choices harm others in society).

First, we install and load the gutenbergr package to access the catalogue of books from Project Gutenburg . This gutenberg_metadata function provides access to the website and its collection of around 60,000 digitised books in the public domain, for which their U.S. copyright has expired. This website is an amazing resource in its own right.

install.packages("gutenbergr")
library(gutenbergr)

Next we choose a book we want to download. We can search through the Gutenberg Project catalogue (with the help of the dplyr package). In the filter( ) function, we can search for a book in the library by supplying a string search term in “quotations”. Click here to see the CRAN package PDF. For example, we can look for all the books written by John Stuart Mill (search second name, first name) on the website:

mill_all <- gutenberg_metadata %>%
  filter(author = "Mill, John Stuart")

Or we can search for the title of the book:

mill_liberty <- gutenberg_metadata %>%
  filter(title = "On Liberty")

We now have a tibble of all the sentences in the book!

View(mill_liberty)

We see there are two variables in this new datafram and 4,703 string rows.

To extract every word as a unit, we need the unnest_tokens( ) function from the tidytext package:

install.packages("tidytext")
library(tidytext)

We take our mill_liberty object from above and indicate we want the unit to be words from the text. And we create a new mill_liberty_words object to hold the book in this format.

mill_liberty_words <- mill_liberty %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words)

We now have a row for each word, totalling to 17,576 words! This excludes words such as “the”, “of”, “to” and all those small sentence builder words.

Now we have every word from “On Liberty”, we can see what words appear most frequently! We can either create a list with the count( ) function:

count_word <- mill_liberty_words %>%
   count(word, sort = TRUE)

The default for a tibble object is printing off the first ten observations. If we want to see more, we can increase the n in our print argument.

print(liberty_words, n=30)

An alternative to this is making a word cloud to visualise the relative frequencies of these terms in the text.

For this, we need to install the wordcloud package.

install.packages("wordcloud")
library(wordcloud)

To get some nice colour palettes, we can also install the RColorBrewer package also:

install.packages("RColorBrewer")
library(RColorBrewer)

Check out the CRAN PDF on the wordcloud package to tailor your specifications.

For example, the rot.per argument indicates proportion words we want with 90 degree rotation. In my example, I have 30% of the words being vertical. I reran the code until the main one was horizontal, just so it pops out more.

With the scale option, we can indicate the range of the size of the words (for example from size 4 to size 0.5) in the example below

We can choose how many words we want to include in the wordcloud with the max.words argument

color_number <- 20
color_palette <- colorRampPalette(brewer.pal(8, "Paired"))(color_number)

wordcloud(words = mill_liberty_words$word, min.freq = 2,
 scale = c(4, 0.5)
          max.words=200, random.order=FALSE, rot.per=0.3, 
          colors=color_palette)

We can see straightaway the most frequent word in the book is opinion. Given that this book forms one of the most rigorous defenses of the idea of freedom of speech, a free press and therefore against the a priori censorship of dissent in society, these words check out.

If we run the code with random.order=TRUE option, the cloud would look like this:

And you can play with proportions, colours, sizes and word placement until you find one you like!

This word cloud highlights the most frequently used words in John Stuart Mill’s “Utilitarianism”:

How to graph Google search trends with gtrendsR package in R

Google Trends is a search trends feature. It shows how frequently a given search term is entered into Google’s search engine, relative to the site’s total search volume over a given period of time.

( So note: because the results are all relative to the other search terms in the time period, the dates you provide to the gtrendsR function will change the shape of your graph and the relative percentage frequencies on the y axis of your plot).

To scrape data from Google Trends, we use the gtrends() function from the gtrendsR package and the get_interest() function from the trendyy package (a handy wrapper package for gtrendsR).

If necessary, also load the tidyverse and ggplot packages.

library(tidyverse)
library(gtrendsR)
library(trendyy)

To scrape the Google trend data, call the trendy() function and write in the search terms.

We can look at relative search hits for Yevgeny Prigozhin, leader of Russian mercenary army, Wagner Group.

prig <- trendy("Prigozhin", "2022-08-25", "2023-08-26") %>% 
  get_interest()

ggplot(data = prig, aes(x = as.Date(date), 
                          y = hits)) +
  geom_line(colour = "#005f73", 
            size = 6.5, alpha = 0.1)  +
  geom_line(colour = "#005f73", 
            size = 4, alpha = 0.3) +
  geom_line(colour = "#005f73", 
            size = 3, alpha = 0.5) +
  geom_line(colour = "#005f73", 
            size = 2) +
  ylab(label = 'Relative Hits %') +
  bbplot::bbc_style() +
  xlab(label = "Search Dates") + 
  ylab(label = 'Relative Hits %') +
  labs(title = "Relative hits for `Prigozhin` on Google",
       subtitle = "Data from Google search hits") +
  geom_curve(aes(x = as.Date("2021-10-01"), 
                   y = 45, 
                   xend = as.Date("2022-02-01"), 
                   yend = 30),
             size = 2, alpha = 0.8,
               color = "#001219",
               arrow = arrow(type = "closed"))

For the next example, here we search for the term “Kamala Harris” during the period from 1st of January 2019 until today.

If you want to check out more specifications, for the package, you can check out the package PDF here. For example, we can change the geographical region (US state or country for example) with the geo specification.

We can also change the parameters of the time argument, we can specify the time span of the query with any one of the following strings:

“now 1-H” (previous hour)
“now 4-H” (previous four hours)
“today+5-y” last five years (default)
“all” (since the beginning of Google Trends (2004))

If don’t supply a string, the default is five year search data.

kamala <- trendy("Kamala Harris", "2019-01-01", "2020-08-13") %>% get_interest()

We call the get_interest() function to save this data from Google Trends into a data.frame version of the kamala object. If we didn’t execute this last step, the data would be in a form that we cannot use with ggplot().

View(kamala)

In this data.frame, there is a date variable for each week and a hits variable that shows the interest during that week. Remember, this hits figure shows how frequently a given search term is entered into Google’s search engine relative to the site’s total search volume over a given period of time.

We will use these two variables to plot the y and x axis.

To look at the search trends in relation to the events during the Kamala Presidential campaign over 2019, we can add vertical lines along the date axis, with a data.frame, we can call kamala_events.

kamala_events = data.frame(date=as.Date(c("2019-01-21", "2019-06-25", "2019-12-03", "2020-08-12")), 
event=c("Launch Presidential Campaign", "First Primary Debate", "Drops Out Presidential Race", "Chosen as Biden's VP"))

Note the very specific order the as.Date() function requires.

Next, we can graph the trends, using the above date and hits variables:

ggplot(kamala, aes(x = as.Date(date), y = hits)) +
  geom_line(colour = "steelblue", size = 2.5) +
  geom_vline(data=kamala_events, mapping=aes(xintercept=date), color="red") +
    geom_text(data=kamala_events, mapping=aes(x=date, y=0, label=event), size=4, angle=40, vjust=-0.5, hjust=0) + 
    xlab(label = "Search Dates") + 
    ylab(label = 'Relative Hits %')

Which produces:

I can update the graph

Cairo::CairoWin()

ggplot(data = kamala, aes(x = as.Date(date), 
                          y = hits)) +
  geom_vline(data = kamala_events, 
             mapping = aes(xintercept = date),
             size = 11, 
             alpha = 0.1,
             color = "#9b2226") +
  geom_vline(data = kamala_events, 
             mapping = aes(xintercept = date),
             size = 8, 
             alpha = 0.2,
             color = "#9b2226") +
  geom_vline(data = kamala_events, 
             mapping = aes(xintercept = date),
             size = 7, 
             alpha = 0.3,
             color = "#9b2226") +
  geom_vline(data = kamala_events, 
             mapping = aes(xintercept = date),
             size = 4, 
             alpha = 0.4,
             color = "#9b2226") +
  geom_line(colour = "#005f73", 
            size = 5.5, alpha = 0.4)  +
  geom_line(colour = "#005f73", 
            size = 5, alpha = 0.5) +
  geom_line(colour = "#005f73", 
            size = 4, alpha = 0.6) +
  geom_line(colour = "#005f73", 
            size = 3) +
  geom_text(data = kamala_events, 
            mapping = aes(x = date, 
                          y = 0,
                          label = event), 
                    size = 9,
                    angle = 10, 
                    vjust = -4.6, 
                    hjust = 0.7) + 
  bbplot::bbc_style() +
  xlab(label = "Search Dates") + 
  ylab(label = 'Relative Hits %') +
  labs(title = "Relative hits for `Kamala Harris` on Google",
       subtitle = "Data from Google search hits") +
  theme(plot.title = element_text(size = 40), 
        plot.subtitle = element_text(size = 20)) +
  geom_curve(aes(x = as.Date("2018-12-01"), 
                  y = 88, 
                  xend = as.Date("2019-01-15"), 
                  yend = 99),
             size = 2, alpha = 0.8,
             color = "#001219",
             curvature = -0.4,
             angle = 90,
             arrow = arrow(type = "closed")) +
  geom_curve(aes(x = as.Date("2019-05-01"), 
                 y = 69, 
                 xend = as.Date("2019-06-15"), 
                 yend = 70),
             size = 2, 
             alpha = 0.8,
             color = "#001219",
             curvature = 0.7,
             angle = 90,
             arrow = arrow(type = "closed")) +
  geom_curve(aes(x = as.Date("2019-10-15"), 
                 y = 69, 
                 xend = as.Date("2019-11-20"), 
                 yend = 46),
             size = 2, 
             alpha = 0.8,
             color = "#001219",
             curvature = 0.3,
             angle = 90,
             arrow = arrow(type = "closed"))

Super easy and a quick way to visualise the ups and downs of Kamala Harris’ political career over the past few months, operationalised as the relative frequency with which people Googled her name.

If I had chosen different dates, the relative hits as shown on the y axis would be different! So play around with it and see how the trends change when you increase or decrease the time period.

Check for multicollinearity with the car package in R

Packages we will need:

install.packages("car")
library(car)

When one independent variable is highly correlated with another independent variable (or with a combination of independent variables), the marginal contribution of that independent variable is influenced by other predictor variables in the model.

And so, as a result:

Estimates for regression coefficients of the independent variables can be unreliable.
Tests of significance for regression coefficients can be misleading.

To check for multicollinearity problem in our model, we need the vif() function from the car package in R. VIF stands for variance inflation factor. It measures how much the variance of any one of the coefficients is inflated due to multicollinearity in the overall model.

As a rule of thumb, a vif score over 5 is a problem. A score over 10 should be remedied and you should consider dropping the problematic variable from the regression model or creating an index of all the closely related variables.

This blog post will look only at the VIF score. Click here to look at how to interpret various other multicollinearity tests in the mctest package in addition to the the VIF score.

Back to our model, I want to know whether countries with high levels of clientelism, high levels of vote buying and low democracy scores lead to executive embezzlement?

So I fit a simple linear regression model (and look at the output with the stargazer package)

summary(embezzlement_model_1 <- lm(executive_embezzlement ~ clientelism_index + vote_buying_score + democracy_score, data = data_2010))

stargazer(embezzlement_model_1, type = "text")

I suspect that clientelism and vote buying variables will be highly correlated. So let’s run a test of multicollinearity to see if there is any problems.

car::vif(embezzlement_model_1)

The VIF score for the three independent variables are :

Both clientelism index and vote buying variables are both very high and the best remedy is to remove one of them from the regression. Since vote buying is considered one aspect of clientelist regime so it is probably overlapping with some of the variance in the embezzlement score that the clientelism index is already explaining in the model

So re-run the regression without the vote buying variable.

summary(embezzlement_model_2 <- lm(v2exembez ~ v2xnp_client  + v2x_api, data = vdem2010))
stargazer(embezzlement_model_2, embezzlement_model_2, type = "text")
car::vif(embezzlement_mode2)

Comparing the two regressions:

And running a VIF test on the second model without the vote buying variable:

car::vif(embezzlement_model_2)

These scores are far below 5 so there is no longer any big problem of multicollinearity in the second model.

Click here to quickly add VIF scores to our regression output table in R with jtools package.

Plus, looking at the adjusted R2, which compares two models, we see that the difference is very small, so we did not lose much predictive power in dropping a variable. Rather we have minimised the issue of highly correlated independent variables and thus an inability to tease out the real relationships with our dependent variable of interest.

tl;dr: As a rule of thumb, a vif score over 5 is a problem. A score over 10 should be remedied (and you should consider dropping the problematic variable from the regression model or creating an index of all the closely related variables).

Click here to run stepwise regression analysis to help decide which problematic variables we can drop from our model (based on AIC scores)

Impute missing values with MICE package in R

Political scientists are beginning to appreciate that multiple imputation represents a better strategy for analysing missing data to the widely used method of listwise deletion.

A very clear demonstration of this was a 2016 article by Ranjit Lall, an political economy professor in LSE. He essentially went back and examined the empirical results of multiple imputation in comparison to the commonplace listwise deletion in political science.

He did this by re-running comparative political economy studies published over a five-year period in International Organization and World Politics.

Shockingly, in almost half of the studies he re-ran, Lall found that most key results “disappeared” (by conventional statistical standards) when reanalyzed with multiple imputations rather than listwise deletion.

This is probably due to the fact that it is erroneous to assume that missing data is random and equally distributed among the overall data.

Listwise deletion involves omitting observations with missing values on any variable. This ultimately produces inefficient inferences as it is difficult to believe the assumption that the pattern of missing data is actually completely random.

This blog post will demonstrate a package for imputing missing data in a few lines of code.

Unlike what I initially thought, the name has nothing to do with the tiny rodent, MICE stands for Multivariate Imputation via Chained Equations.

Rather than abruptly deleting missing values, imputation uses information given from the non-missing predictors to provide an estimate of the missing values.

The mice package imputes in two steps. First, using mice() to build the model and subsequently call complete() to generate the final dataset.

The mice() function produces many complete copies of a dataset, each with different imputations of the missing data. Then the complete() function returns these data sets, with the default being the first.

So first install and load the package:

install.packages("mice")
library(mice)

You can check whether any variables in your potential model have an NAs (i.e. missing values) with anyNA() function.

anyNA(data$clientelism)

If there are missing values, then you can go on ahead with imputing them. First create a new object to store the multiple imputed versions of your dataset.

This iteration process takes a while, depending on how many variables you have in your data.frame. My data data.frame had about six variables so this stage took about three or four minutes to complete. I was distracted by Youtube for a bit, so I am not exactly sure. I imagine a very large dataset with hundreds of variables would make my computer freak out.

All the variables with missing values in my data.frame were continuous numerical values. I chose the method = "cart", which stands for classification and regression trees which appears quite versatile.

imputed_data <-  mice(data, method="cart")

A CART is a predictive algorithm that determines how a given variable’s values can be predicted based on other values.

It is composed of decision trees where each fork is a split in a predictor variable and each node at the end has a prediction for the target variable.

After this iterative process is complete and the command has finished running, we then use the complete() function and assign the resulting data.frame to a new object. I call it full_data

full_data <- complete(imputed_data)

I ran a quick regression to see what effect the new fully imputed data.frame had on the relationship. I could have taken a bit longer and found a result that changed as a result of the data imputation step ( as was shown in the above mentioned Lall (2016) paper) but I decided to just stick with my first shot.

We can see that the model with the imputed values have increased the total number of values by about 3,000 or so.

Given that I already have a very large n size, it is not expected that many of thecoefficients would change drastically by adding a small percentage of imputed values. However, we see that the standard error (yay) and the coefficient value decreased (meh). Additionally the R2 (by a tiny amount) decreased (weh).

I chose the cart method but there are many of method options, depending on the characteristics of the data with missing values.

Built-in univariate imputation methods are:

`pmm`	any	Predictive mean matching
`midastouch`	any	Weighted predictive mean matching
`sample`	any	Random sample from observed values
`cart`	any	Classification and regression trees
`rf`	any	Random forest imputations
`mean`	numeric	Unconditional mean imputation
`norm`	numeric	Bayesian linear regression
`norm.nob`	numeric	Linear regression ignoring model error
`norm.boot`	numeric	Linear regression using bootstrap
`norm.predict`	numeric	Linear regression, predicted values
`quadratic`	numeric	Imputation of quadratic terms
`ri`	numeric	Random indicator for nonignorable data
`logreg`	binary	Logistic regression
`logreg.boot`	binary	Logistic regression with bootstrap
`polr`	ordered	Proportional odds model
`polyreg`	unordered	Polytomous logistic regression
`lda`	unordered	Linear discriminant analysis
`2l.norm`	numeric	Level-1 normal heteroscedastic
`2l.lmer`	numeric	Level-1 normal homoscedastic, lmer
`2l.pan`	numeric	Level-1 normal homoscedastic, pan
`2l.bin`	binary	Level-1 logistic, glmer
`2lonly.mean`	numeric	Level-2 class mean
`2lonly.norm`	numeric	Level-2 class normal