How to interpret linear models with the broom package in R

Packages you will need:

library(tidyverse)
library(magrittr)     # for pipes

library(broom)        # add model variables
library(easystats)    # diagnostic graphs

library(WDI)           # World Bank data
library(democracyData) # Freedom House data

library(countrycode)   # add ISO codes
library(bbplot)        # pretty themes
library(ggthemes)      # pretty colours
library(knitr)         # pretty tables
library(kableExtra)    # make pretty tables prettier

This blog will look at the augment() function from the broom package.

After we run a liner model, the augment() function gives us more information about how well our model can accurately preduct the model’s dependent variable.

It also gives us lots of information about how does each observation impact the model. With the augment() function, we can easily find observations with high leverage on the model and outlier observations.

For our model, we are going to use the “women in business and law” index as the dependent variable.

According to the World Bank, this index measures how laws and regulations affect women’s economic opportunity.

Overall scores are calculated by taking the average score of each index (Mobility, Workplace, Pay, Marriage, Parenthood, Entrepreneurship, Assets and Pension), with 100 representing the highest possible score.

Into the right-hand side of the model, our independent variables will be child mortality, military spending by the government as a percentage of GDP and Freedom House (democracy) Scores.

First we download the World Bank data and summarise the variables across the years.

Click here to read more about the WDI package and downloading variables from the World Bank website.

women_business = WDI(indicator = "SG.LAW.INDX")
mortality = WDI(indicator = "SP.DYN.IMRT.IN")
military_spend_gdp <- WDI(indicator = "MS.MIL.XPND.ZS")

We get the average across 60 ish years for three variables. I don’t want to run panel data regression, so I get a single score for each country. In total, there are 160 countries that have all observations. I use the countrycode() function to add Correlates of War codes. This helps us to filter out non-countries and regions that the World Bank provides. And later, we will use COW codes to merge the Freedom House scores.

women_business %>%
  filter(year > 1999) %>% 
  inner_join(mortality) %>% 
  inner_join(military_spend_gdp) %>% 
  select(country, year, iso2c, 
         fem_bus = SG.LAW.INDX, 
         mortality = SP.DYN.IMRT.IN,
         mil_gdp = MS.MIL.XPND.ZS)  %>% 
  mutate_all(~ifelse(is.nan(.), NA, .)) %>% 
  select(-year) %>% 
  group_by(country, iso2c) %>% 
  summarize(across(where(is.numeric), mean,  
   na.rm = TRUE, .names = "mean_{col}")) %>% 
  ungroup() %>% 
  mutate(cown = countrycode::countrycode(iso2c, "iso2c", "cown")) %>% 
  filter(!is.na(cown)) -> wdi_summary

Next we download the Freedom House data with the democracyData package.

Click here to read more about this package.

fh <- download_fh()

fh %>% 
  group_by(fh_country) %>% 
  filter(year > 1999) %>% 
  summarise(mean_fh = mean(fh_total, na.rm = TRUE)) %>% 
  mutate(cown = countrycode::countrycode(fh_country, "country.name", "cown")) %>% 
  mutate_all(~ifelse(is.nan(.), NA, .)) %>% 
  filter(!is.na(cown))  -> fh_summary

We join both the datasets together with the inner_join() functions:

fh_summary %>%
  inner_join(wdi_summary, by = "cown") %>% 
  select (-c(cown, iso2c, fh_country)) -> wdi_fh

Before we model the data, we can look at the correlation matrix with the corrplot package:

wdi_fh %>% 
  drop_na() %>% 
  select(-country)  %>% 
  select(`Females in business` = mean_fem_bus,
        `Mortality rate` = mean_mortality,
        `Freedom House` = mean_fh,
        `Military spending GDP` = mean_mil_gdp)  %>% 
  cor() %>% 
  corrplot(method = 'number',
           type = 'lower',
           number.cex = 2, 
           tl.col = 'black',
           tl.srt = 30,
           diag = FALSE)

Next, we run a simple OLS linear regression. We don’t want the country variables so omit it from the list of independent variables.

fem_bus_lm <- lm(mean_fem_bus ~ . - country, data = wdi_fh)
Dependent variable:
mean_fem_bus
mean_fh-2.807***
(0.362)
mean_mortality-0.078*
(0.044)
mean_mil_gdp-0.416**
(0.205)
Constant94.684***
(2.024)
Observations160
R20.557
Adjusted R20.549
Residual Std. Error11.964 (df = 156)
F Statistic65.408*** (df = 3; 156)
Note:*p<0.1; **p<0.05; ***p<0.01

We can look at some preliminary diagnostic plots.

Click here to read more about the easystat package. I found it a bit tricky to download the first time.

performance::check_model(fem_bus_lm)

The line is not flat at the beginning so that is not ideal..

We will look more into this later with the variables we create with augment() a bit further down this blog post.

None of our variables have a VIF score above 5, so that is always nice to see!

From the broom package, we can use the augment() function to create a whole heap of new columns about the variables in the model.

fem_bus_pred <- broom::augment(fem_bus_lm)

  • .fitted = this is the model prediction value for each country’s dependent variable score. Ideally we want them to be as close to the actual scores as possible. If they are totally different, this means that our independent variables do not do a good job explaining the variance in our “women in business” index.

  • .resid = this is actual dependent variable value minus the .fitted value.

We can look at the fitted values that the model uses to predict the dependent variable – level of women in business – and compare them to the actual values.

The third column in the table is the difference between the predicted and actual values.

fem_bus_pred %>% 
  mutate(across(where(is.numeric), ~round(., 2))) %>%
  arrange(mean_fem_bus) %>% 
  select(Country = country,
    `Fem in bus (Actual)` = mean_fem_bus,
    `Fem in bus (Predicted)` = .fitted,
    `Fem in bus (Difference)` = .resid,
                  `Mortality rate` = mean_mortality,
                  `Freedom House` = mean_fh,
                  `Military spending GDP` = mean_mil_gdp)  %>% 
  kbl(full_width = F) 
Country Leverage of country Fem in bus (Actual) Fem in bus (Predicted)
Austria 0.02 88.92 88.13
Belgium 0.02 92.13 87.65
Costa Rica 0.02 79.80 87.84
Denmark 0.02 96.36 87.74
Finland 0.02 94.23 87.74
Iceland 0.02 96.36 88.90
Ireland 0.02 95.80 88.18
Luxembourg 0.02 94.32 88.33
Sweden 0.02 96.45 87.81
Switzerland 0.02 83.81 87.78

And we can graph them out:

fem_bus_pred %>%
  mutate(fh_category = cut(mean_fh, breaks =  5,
  labels = c("full demo ", "high", "middle", "low", "no demo"))) %>%         ggplot(aes(x = .fitted, y = mean_fem_bus)) + 
  geom_point(aes(color = fh_category), size = 4, alpha = 0.6) + 
  geom_smooth(method = "loess", alpha = 0.2, color = "#20948b") + 
  bbplot::bbc_style() + 
  labs(x = '', y = '', title = "Fitted values versus actual values")

In addition to the predicted values generated by the model, other new columns that the augment function adds include:

  • .hat = this is a measure of the leverage of each variable.

  • .cooksd = this is the Cook’s Distance. It shows how much actual influence the observation had on the model. Combines information from .residual and .hat.

  • .sigma = this is the estimate of residual standard deviation if that observation is dropped from model

  • .std.resid = standardised residuals

If we look at the .hat observations, we can examine the amount of leverage that each country has on the model.

fem_bus_pred %>% 
  mutate(dplyr::across(where(is.numeric), ~round(., 2))) %>%
  arrange(desc(.hat)) %>% 
  select(Country = country,
         `Leverage of country` = .hat,
         `Fem in bus (Actual)` = mean_fem_bus,
         `Fem in bus (Predicted)` = .fitted)  %>% 
  kbl(full_width = F) %>%
  kable_material_dark()

Next, we can look at Cook’s Distance. This is an estimate of the influence of a data point.  According to statisticshowto website, Cook’s D is a combination of each observation’s leverage and residual values; the higher the leverage and residuals, the higher the Cook’s distance.

  1. If a data point has a Cook’s distance of more than three times the mean, it is a possible outlier
  2. Any point over 4/n, where n is the number of observations, should be examined
  3. To find the potential outlier’s percentile value using the F-distribution. A percentile of over 50 indicates a highly influential point
fem_bus_pred %>% 
  mutate(fh_category = cut(mean_fh, 
breaks =  5,
  labels = c("full demo ", "high", "middle", "low", "no demo"))) %>%  
  mutate(outlier = ifelse(.cooksd > 4/length(fem_bus_pred), 1, 0)) %>% 
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point(aes(color = fh_category), size = 4, alpha = 0.6) + 
  ggrepel::geom_text_repel(aes(label = ifelse(outlier == 1, country, NA))) + 
  labs(x ='', y = '', title = 'Influential Outliers') + 
  bbplot::bbc_style() 

We can decrease from 4 to 0.5 to look at more outliers that are not as influential.

Also we can add a horizontal line at zero to see how the spread is.

fem_bus_pred %>% 
  mutate(fh_category = cut(mean_fh, breaks =  5,
labels = c("full demo ", "high", "middle", "low", "no demo"))) %>%  
  mutate(outlier = ifelse(.cooksd > 0.5/length(fem_bus_pred), 1, 0)) %>% 
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point(aes(color = fh_category), size = 4, alpha = 0.6) + 
  geom_hline(yintercept = 0, color = "#20948b", size = 2, alpha = 0.5) + 
  ggrepel::geom_text_repel(aes(label = ifelse(outlier == 1, country, NA)), size = 6) + 
  labs(x ='', y = '', title = 'Influential Outliers') + 
  bbplot::bbc_style() 

To look at the model-level data, we can use the tidy()function

fem_bus_tidy <- broom::tidy(fem_bus_lm)

And glance() to examine things such as the R-Squared value, the overall resudial standard deviation of the model (sigma) and the AIC scores.

broom::glance(fem_bus_lm)

An R squared of 0.55 is not that hot ~ so this model needs a fair bit more work.

We can also use the broom packge to graph out the assumptions of the linear model. First, we can check that the residuals are normally distributed!

fem_bus_pred %>% 
  ggplot(aes(x = .resid)) + 
  geom_histogram(bins = 15, fill = "#20948b") + 
  labs(x = '', y = '', title = 'Distribution of Residuals') +
  bbplot::bbc_style()

Next we can plot the predicted versus actual values from the model with and without the outliers.

First, all countries, like we did above:

fem_bus_pred %>%
  mutate(fh_category = cut(mean_fh, breaks =  5,
  labels = c("full demo ", "high", "middle", "low", "no demo"))) %>%         ggplot(aes(x = .fitted, y = mean_fem_bus)) + 
  geom_point(aes(color = fh_category), size = 4, alpha = 0.6) + 
  geom_smooth(method = "loess", alpha = 0.2, color = "#20948b") + 
  bbplot::bbc_style() + 
  labs(x = '', y = '', title = "Fitted values versus actual values")

And how to plot looks like if we drop the outliers that we spotted earlier,

fem_bus_pred %>%
  filter(country != "Eritrea") %>% 
   filter(country != "Belarus") %>% 
  mutate(fh_category = cut(mean_fh, breaks =  5,
                           labels = c("full demo ", "high", "middle", "low", "no demo"))) %>%         ggplot(aes(x = .fitted, y = mean_fem_bus)) + 
  geom_point(aes(color = fh_category), size = 4, alpha = 0.6) + 
  geom_smooth(method = "loess", alpha = 0.2, color = "#20948b") + 
  bbplot::bbc_style() + 
  labs(x = '', y = '', title = "Fitted values versus actual values")

How to recreate Pew opinion graphs with ggplot2 in R

Packages we will need

library(HH)
library(tidyverse)
library(bbplot)
library(haven)

In this blog post, we are going to recreate Pew Opinion poll graphs.

This is the plot we will try to recreate on gun control opinions of Americans:

To do this, we will download the data from the Pew website by following the link below:

atp <- read.csv(file.choose())

We then select the variables related to gun control opinions

atp %>% 
  select(GUNPRIORITY1_b_W87:GUNPRIORITY2_j_W87) -> gun_df

I want to rename the variables so I don’t forget what they are.

Then, we convert them all to factor variables because haven labelled class variables are sometimes difficult to wrangle…

gun_df %<>%
  select(mental_ill = GUNPRIORITY1_b_W87,
         assault_rifle = GUNPRIORITY1_c_W87, 
         gun_database = GUNPRIORITY1_d_W87,
         high_cap_mag = GUNPRIORITY1_e_W87,
         gunshow_bkgd_check = GUNPRIORITY1_f_W87,
         conceal_gun =GUNPRIORITY2_g_W87,
         conceal_gun_no_permit = GUNPRIORITY2_h_W87,
         teacher_gun = GUNPRIORITY2_i_W87,
         shorter_waiting = GUNPRIORITY2_j_W87) %>% 
  mutate(across(everything()), haven::as_factor(.))

Also we can convert the “Refused” to answer variables to NA if we want, so it’s easier to filter out.

gun_df %<>% 
  mutate(across(where(is.factor), ~na_if(., "Refused")))

Next we will pivot the variables to long format. The new names variable will be survey_question and the responses (Strongly agree, Somewhat agree etc) will go to the new response variable!

gun_df %>% 
  pivot_longer(everything(), names_to = "survey_question", values_to = "response") -> gun_long

And next we calculate counts and frequencies for each variable

gun_long %<>% 
  group_by(survey_question, response) %>% 
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) %>% 
  ungroup() 

Then we want to reorder the levels of the factors so that they are in the same order as the original Pew graph.

gun_long %>% 
  mutate(survey_question = as.factor(survey_question))   %>% 
   mutate(survey_question_reorder = factor(survey_question, 
          levels =  c( 
           "conceal_gun_no_permit",
           "shorter_waiting",
           "teacher_gun",
           "conceal_gun",
           "assault_rifle",
           "high_cap_mag",
           "gun_database",
           "gunshow_bkgd_check",
           "mental_ill"
           ))) -> gun_reordered

And we use the hex colours from the original graph … very brown… I used this hex color picker website to find the right hex numbers: https://imagecolorpicker.com/en

brown_palette <- c("Strongly oppose" = "#8c834b",
                   "Somewhat oppose" = "#beb88f",
                   "Somewhat favor" = "#dfc86c",
                   "Strongly favor" = "#caa31e")

And last, we use the geom_bar() – with position = "stack" and stat = "identity" arguments – to create the bar chart.

To add the numbers, write geom_text() function with label = frequency within aes() and then position = position_stack() with hjust and vjust to make sure you’re happy with where the numbers are

gun_reordered %>% 
  filter(!is.na(response)) %>% 
  mutate(frequency = round(freq * 100), 0) %>% 
  ggplot(aes(x = survey_question_reorder, 
             y = frequency, fill = response)) +
  geom_bar(position = "stack",
           stat = "identity") + 
  coord_flip() + 
  scale_fill_manual(values = brown_palette) +
  geom_text(aes(label = frequency), size = 10, 
            color = "black", 
            position = position_stack(vjust = 0.5)) +
  bbplot::bbc_style() + 
  labs(title = "Broad support for barring people with mental illnesses 
       \n from obtaining guns, expanded background checks",
       subtitle = "% who", 
       caption = "Note: No answer resposes not shown.\n Source: Survey of U.S. adults conducted April 5-11 2021.") + 
  scale_x_discrete(labels = c(
    "Allowing people to carry conealed \n guns without a person",
    "Shortening waiting periods for people \n who want to buy guns leagally",
    "Allowing reachers and school officials \n to carry guns in K-12 school",
    "Allowing people to carry \n concealed guns in more places",
    "Banning assault-style weapons",
    "Banning high capacity ammunition \n magazines that hold more than 10 rounds",
    "Creating a federal government \n database to track all gun sales",
    "Making private gun sales \n subject to background check",
    "Preventing people with mental \n illnesses from purchasing guns"
    ))
Stephen Colbert Waiting GIF - Find & Share on GIPHY

Unfortunately this does not have diverving stacks from the middle of the graph

We can make a diverging stacked bar chart using function likert() from the HH package.

For this we want to turn the dataset back to wider with a column for each of the responses (strongly agree, somewhat agree etc) and find the frequency of each response for each of the questions on different gun control measures.

Then with the likert() function, we take the survey question variable and with the ~tilda~ make it the product of each response. Because they are the every other variable in the dataset we can use the shorthand of the period / fullstop.

We use positive.order = TRUE because we want them in a nice descending order to response, not in alphabetical order or something like that

gun_reordered %<>%
    filter(!is.na(response)) %>%  
  select(survey_question, response, freq) %>%  
  pivot_wider(names_from = response, values_from = freq ) %>%
  ungroup() %>% 
  HH::likert(survey_question ~., positive.order = TRUE,
            main =  "Broad support for barring people with mental illnesses
            \n from obtaining guns, expanded background checks")

With this function, it is difficult to customise … but it is very quick to make a diverging stacked bar chart.

Angry Stephen Colbert GIF by The Late Show With Stephen Colbert - Find & Share on GIPHY

If we return to ggplot2, which is more easy to customise … I found a solution on Stack Overflow! Thanks to this answer! The solution is to put two categories on one side of the centre point and two categories on the other!

gun_reordered %>% 
filter(!is.na(response)) %>% 
  mutate(frequency = round(freq * 100), 0) -> gun_final

And graph out

ggplot(data = gun_final, aes(x = survey_question_reorder, 
            fill = response)) +
  geom_bar(data = subset(gun_final, response %in% c("Strongly favor",
           "Somewhat favor")),
           aes(y = -frequency), position="stack", stat="identity") +
  geom_bar(data = subset(gun_final, !response %in% c("Strongly favor",
            "Somewhat favor")), 
           aes(y = frequency), position="stack", stat="identity") +
  coord_flip() + 
  scale_fill_manual(values = brown_palette) +
  geom_text(data = gun_final, aes(y = frequency, label = frequency), size = 10, color = "black", position = position_stack(vjust = 0.5)) +
  bbplot::bbc_style() + 
  labs(title = "Broad support for barring people with mental illnesses 
       \n from obtaining guns, expanded background checks",
       subtitle = "% who", 
       caption = "Note: No answer resposes not shown.\n Source: Survey of U.S. adults conducted April 5-11 2021.") + 
  scale_x_discrete(labels = c(
    "Allowing people to carry conealed \n guns without a person",
    "Shortening waiting periods for people \n who want to buy guns leagally",
    "Allowing reachers and school officials \n to carry guns in K-12 school",
    "Allowing people to carry \n concealed guns in more places",
    "Banning assault-style weapons",
    "Banning high capacity ammunition \n magazines that hold more than 10 rounds",
    "Creating a federal government \n database to track all gun sales",
    "Making private gun sales \n subject to background check",
    "Preventing people with mental \n illnesses from purchasing guns"
  ))
High Five Stephen Colbert GIF - Find & Share on GIPHY

Next to complete in PART 2 of this graph, I need to figure out how to add lines to graphs and add the frequency in the correct place

Examining Ireland’s foreign policy in pictures with R

Packages we will need:

library(peacesciencer)  
library(forcats)
library(ggflags)
library(tidyverse)
library(magrittr)
library(waffle)
library(bbplot)
library(rvest)

In January 2015, the Irish government published a review of Ireland’s foreign policy. The document, The Global Island: Ireland’s Foreign Policy for a Changing World offers a perspective on Ireland’s place in the world.

In this blog, we will graph out some of the key features of Ireland’ foreign policy and so we can have a quick overview of the key relationships and trends.

Excited Season 4 GIF by The Office - Find & Share on GIPHY

First, we will look at the aid that Ireland gives to foreign countries. This read.csv(file.choose()) will open up the file window and you can navigate to the file and data that you can download from DAC OECD website: https://data.oecd.org/oda/net-oda.htm

dac <- read.csv(file.choose())

We will filter only Ireland and clean the names with the clean_names() function from the janitor package:

dac %<>% 
  filter(Donor == "Ireland") %>% 
  clean_names()

And change the variables, adding the Correlates of War codes and cleaning up some of the countries’ names.

dac %<>% 
  mutate(cown = countrycode(recipient_2, "country.name", "cown"),
         aid_amount = value*1000000) %>%  
  select(country = recipient_2, cown,
         year, time, aid_type, value, aid_amount) %>%
  mutate(cown = ifelse(country == "West Bank and Gaza Strip", 6666,
         ifelse(country == "Serbia", 345, 
         ifelse(country == "Micronesia", 987,cown))))%>%
  filter(!is.na(cown)) 

Next we can convert dataframe to wider format so we have a value column for each aid type

dac %>% 
  distinct(country, cown, year, time, aid_type, value, .keep_all = TRUE)  %>%  
  pivot_wider(names_from = "aid_type", values_from = "aid_amount") %>% 
  mutate(across(where(is.numeric), ~ replace_na(., 0))) %>% 
  clean_names() -> dac_wider

And we graph out the three main types of aid:

dac_wider %>%
  group_by(year) %>% 
  summarise(total_humanitarian = sum(humanitarian_aid, na.rm = TRUE),
  total_technical = sum(technical_cooperation, na.rm = TRUE),
  total_development_food_aid = sum(development_food_aid)) %>% 
  ungroup() %>% 
  pivot_longer(!year, names_to = "aid_type", values_to = "aid_value") %>% 
  ggplot(aes(x = year, y = aid_value, groups = aid_type)) + 
  geom_line(aes(color = aid_type), size = 2, show_guide  = FALSE) +
  geom_point(aes(color = aid_type), fill = "white", shape = 21, size = 3, stroke = 2) +
  bbplot::bbc_style()  +
  scale_y_continuous(labels = scales::comma) + 
  scale_x_discrete(limits = c(2010:2018)) +
  labs(title = "Irish foreign aid by aid type (2010 - 2018)",
       subtitle = ("Source: OECD DAC")) +
  scale_color_discrete(name = "Aid type", 
        labels = c("Development and Food", "Humanitarian", "Technical"))

We will look at total ODA aid:

dac %>% 
  count(aid_type) %>% 
  arrange(desc(n)) %>% 
  knitr::kable(format = "html")
aid_type n
Imputed Multilateral ODA 2298
Memo: ODA Total, excl. Debt 1292
Memo: ODA Total, Gross disbursements 1254
ODA: Total Net 1249
Grants, Total 1203
Technical Cooperation 541
ODA per Capita 532
Humanitarian Aid 518
ODA as % GNI (Recipient) 504
Development Food Aid 9

And get some pretty hex colours:

pal_10 <- c("#001219","#005f73","#0a9396","#94d2bd","#e9d8a6","#ee9b00","#ca6702","#bb3e03","#ae2012","#9b2226")

And download some regime, democracy, region and continent data from the PACL datase with the democracyData() package

pacl <- redownload_pacl() 

pacl %<>% 
  mutate(regime_name = ifelse(regime == 0, "Parliamentary democracies",
         ifelse(regime == 1, "Mixed democracies",
         ifelse(regime == 2, "Presidential democracies",
         ifelse(regime == 3, "Civilian autocracies",
         ifelse(regime == 4, "Military dictatorships",
         ifelse(regime ==  5,"Royal dictatorships", regime))))))) %>%
  mutate(regime = as.factor(regime)) 

pacl %<>% 
  select(year, country = pacl_country, 
         democracy, regime_name,
         region_name = un_region_name, 
         continent_name = un_continent_name)

pacl %<>% 
  mutate(cown = countrycode(country, "country.name", "cown")) %>% 
  select(!country)

Summarise the total aid for each country across the years and choose the top 20 countries

dac %>% 
  filter(aid_type == "Memo: ODA Total, Gross disbursements") %>% 
  group_by(country) %>% 
  summarise(total_country_aid = sum(aid_amount, na.rm = TRUE)) %>% 
  ungroup() %>% 
  top_n(n = 20) %>% 
  mutate(cown = countrycode::countrycode(country, "country.name", "cown")) %>% 
  inner_join(pacl, by = "cown") %>%  
  mutate(region_name = ifelse(country == "West Bank and Gaza Strip", "Western Asia", region_name)) %>% 
  mutate(region_name = ifelse(region_name == "Western Asia", "Middle East", region_name)) %>% 
  mutate(country = ifelse(country == "West Bank and Gaza Strip", "Palestine",
  ifelse(country == "Democratic Republic of the Congo", "DR Congo",
  ifelse(country == "Syrian Arab Republic", "Syria", country)))) %>% 
  mutate(iso2 = tolower(countrycode::countrycode(country, "country.name", "iso2c"))) %>% 
  ggplot(aes(x = forcats::fct_reorder(country, total_country_aid), y = total_country_aid)) + 
  geom_bar(aes(fill = region_name), stat = "identity", width = 0.7) + 
  coord_flip() + bbplot::bbc_style() + 
  geom_flag(aes(x = country, y = -100, country = iso2), size = 12) +
  scale_fill_manual(values = pal_10) +
  labs(title = "Ireland's largest ODA foreign aid recipients, 2010 - 2018",
       subtitle = ("Source: OECD DAC")) + 
  xlab("") + ylab("") + 
  scale_x_continuous(labels = scales::comma)

We can make a waffle plot to look at the different types of regimes to which the Irish government gave aid over the decades

 dac %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s")) %>% 
  group_by(decade) %>% 
  count(regime_name) %>% 
  ggplot(aes(fill = regime_name, values = n)) +
  geom_waffle(color = "white", size = 0.3, n_rows = 10, flip = TRUE) +
  facet_wrap(~decade, nrow = 1, strip.position = "bottom") + 
  bbplot::bbc_style()  +
  scale_fill_manual(values = pal_10) +
   scale_x_discrete(breaks = round(seq(0, 1, by = 0.2),3)) +
  labs(title = "Ireland's ODA foreign aid recipient regime types since 1945",
       subtitle = ("Source: OECD DAC"))  

Next, we will download dyadic foreign policy similarity measures from peacesciencer.

Peacesciencer package has tools and data sets for the study of quantitative peace science. 

Click here to read more about the peacesciencer package by Steven Miller

fp_similar_df <- peacesciencer::create_dyadyears() %>% 
  add_gwcode_to_cow() %>% 
  add_fpsim()	

I am only looking at dyadic foreign policy similarity with Ireland, so filter by Ireland’s Correlates of War code, 205.

Click here to find out all countries’ COW code

fp_similar_df %<>% 
  filter(ccode1 == 205)

Data on alliance portfolios comes from the Correlates of War and is used to calculate similarity of foreign policy positions (see Altfeld & Mesquita, 1979).

The assumption is that similar alliance portfolios are the result of similar foreign policy positions.

With increasing in level of commitment, the strength of alliance commitments can be:

  1. no commitment
  2. entente
  3. neutrality or nonaggression pact
  4. defense pact

We will map out alliance similarity. This will use the measurement calculated with Cohen’s Kappa. Check out Hage’s (2011) article to read more about the different ways to measure alliance similarity.

Next we can look at UN similarity.

The UN voting variable calculates three values:

1 = Yes

2 = Abstain

3 = No

Based on these data, if two countries in a similar way on the same UN resolutions, this is a measure of the degree to which dyad members’ foreign policy positions are similar.

fp_similarity_df %>% 
  mutate(country = countrycode::countrycode(ccode2, "cown", "country.name")) %>% 
  select(country, ccode2, year,
         un_similar = kappavv) %>% 
  filter(year > 1989) %>% 
  filter(!is.na(country)) %>%
  mutate(iso2 = tolower(countrycode::countrycode(country, "country.name", "iso2c"))) %>% 
  group_by(country) %>% 
  mutate(avg_un = mean(un_similar, na.rm = TRUE)) %>%
  distinct(country, avg_un, iso2, .keep_all = FALSE) %>% 
  ungroup() %>% 
  top_n(n = 10)  -> top_un_similar

And graph out the top ten

  top_un_similar %>%
  ggplot(aes(x = forcats::fct_reorder(country, avg_un), 
             y = avg_un)) + 
  geom_bar(stat = "identity",
           width = 0.7, 
           color = "#0a85e5", 
           fill = "#0a85e5") +
  ggflags::geom_flag(aes(x = country, y = 0, country = iso2), size = 15) +
  coord_flip() + bbplot::bbc_style()  +
  ggtitle("UN voting similarity with Ireland since 1990")

If we change the top_n() to negative, we can get the bottom 10

top_n(n = -10)

We can quickly scrape data about the EU countries with the rvest package


eu_members_html <- read_html("https://en.wikipedia.org/wiki/European_Union")
eu_members_tables <- eu_members_html %>% html_table(header = TRUE, fill = TRUE)

eu_member <- eu_members_tables[[6]]

eu_member %<>% 
  janitor::clean_names()

eu_member %>% distinct(state) %>%  pull(state) -> eu_state

Last we are going to look at globalization scores. The data comes from the the KOF Globalisation Index. This measures the economic, social and political dimensions of globalisation. Globalisation in the economic, social and political fields has been on the rise since the 1970s, receiving a particular boost after the end of the Cold War.

Click here for data that you can download comes from the KOF website

kof %>%
  filter(country %in% eu_state) -> kof_eu

And compare Ireland to other EU countries on financial KOF index scores. We will put Ireland in green and the rest of the countries as grey to make it pop.

Ireland appears to follow the general EU trends and is not an outlier for financial globalisation scores.

kof_eu %>% 
  ggplot(aes(x = year,  y = finance, groups = country)) + 
  geom_line(color = ifelse(kof_eu$country == "Ireland",     "#2a9d8f", "#8d99ae"),
  size = ifelse(kof_eu$country == "Ireland", 3, 2), 
  alpha = ifelse(kof_eu$country == "Ireland", 0.9, 0.3)) +
  bbplot::bbc_style() + 
  ggtitle("Financial Globalization in Ireland, 1970 to 2020", 
          subtitle = "Source: KOF")

References

Häge, F. M. (2011). Choice or circumstance? Adjusting measures of foreign policy similarity for chance agreement. Political Analysis19(3), 287-305.

Dreher, Axel (2006): Does Globalization Affect Growth? Evidence from a new Index of Globalizationcall_made, Applied Economics 38, 10: 1091-​1110.

How to tidy up messy Wikipedia data with dplyr in R

Packages we will need:

library(rvest)
library(magrittr)
library(tidyverse)
library(waffle)
library(wesanderson)
library(ggthemes)
library(countrycode)
library(forcats)
library(stringr)
library(tidyr)
library(janitor)
library(knitr)

To see another blog post that focuses on cleaning messy strings and dates, click here to read

We are going to look at Irish embassies and missions around the world. Where are the embassies, and which country has the most missions (including embassies, consulates and representational offices)?

Let’s first scrape the embassy data from the Wikipedia page. Here is how it looks on the webpage.

It is a bit confusing because Ireland does not have a mission in every country. Argentina, for example, is the embassy for Bolivia, Paraguay and Uruguay.

Also, there are some consulates-general and other mission types.

Some countries have more than one mission, such as UK, Canada, US etc. So we are going to try and clean up this data.

Click here to read more about scraping data with the rvest package

embassies_html <- read_html("https://en.wikipedia.org/wiki/List_of_diplomatic_missions_of_Ireland")

embassies_tables <- embassies_html %>% html_table(header = TRUE, fill = TRUE)

We will extract the data from the different continent tables and then bind them all together at the end.

africa_emb <- embassies_tables[[1]]

africa_emb %<>% 
  mutate(continent = "Africa")

americas_emb <- embassies_tables[[2]]

americas_emb %<>% 
  mutate(continent = "Americas")

asia_emb <- embassies_tables[[3]]

asia_emb %<>% 
  mutate(continent = "Asia")

europe_emb <- embassies_tables[[4]]

europe_emb %<>% 
  mutate(continent = "Europe")

oceania_emb <- embassies_tables[[5]]

oceania_emb %<>% 
  mutate(continent = "Oceania")

Last, we bind all the tables together by rows, with rbind()

ire_emb <- rbind(africa_emb, 
                 americas_emb,
                 asia_emb,
                 europe_emb,
                 oceania_emb)

And clean up the names with the janitor package

ire_emb %<>% 
  janitor::clean_names() 

There is a small typo with a hypen and so there are separate Consulate General and Consulate-General… so we will clean that up to make one single factor level.

ire_emb %<>% 
  mutate(mission = ifelse(mission == "Consulate General", "Consulate-General", mission))

We can count out how many of each type of mission there are

ire_emb %>% 
  group_by(mission) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  knitr::kable(format = "html")
mission n
Embassy 69
Consulate-General 17
Liaison office 1
Representative office 1

A quick waffle plot

ire_emb %>% 
  group_by(mission) %>%
  count() %>% 
  arrange(desc(n)) %>% 
  ungroup() %>% 
  ggplot(aes(fill = mission, values = n)) +
  geom_waffle(color = "white", size = 1.5, 
              n_rows = 20, flip = TRUE) + 
  bbplot::bbc_style() +
  scale_fill_manual(values= wes_palette("Darjeeling1", n = 4))

We can remove the notes in brackets with the sub() function.

Square brackets equire a regex code \\[.*

ire_emb %<>% 
  select(!ref) %>%
  mutate(host_country = sub("\\[.*", "", host_country))

We delete the subheadings from the concurrent_accreditation column with the str_remove() function from the stringr package

ire_emb %<>%
  mutate(concurrent_accreditation = stringr::str_remove(concurrent_accreditation, "International Organizations:\n")) %>% 
  mutate(concurrent_accreditation = stringr::str_remove(concurrent_accreditation, "Countries:\n"))

After that, we will tackle the columns with many countries. The many variables in one cell violates the principles of tidy data.

Tonight Show Help GIF by The Tonight Show Starring Jimmy Fallon - Find & Share on GIPHY

For example, we saw above that Argentina is the embassy for three other countries.

We will use the separate() function from the tidyr package to make a column for each country that shares an embassy with the host country.

This separate() function has six arguments:

First we indicate the column with will separate out with the col argument

Next with into, we write the new names of the columns we will create. Nigeria has the most countries for which it is accredited to be the designated embassy with nine. So I create nine accredited countries columns to accommodate this max number.

The point I want to cut up the original column is at the \n which is regex for a large space

I don’t want to remove the original column so I set remove to FALSE

ire_emb %<>%
  separate(
    col = "concurrent_accreditation",
    into = c("acc_1", "acc_2", "acc_3", "acc_4", "acc_5", "acc_6", "acc_7", "acc_8", "acc_9"),
    sep = "\n",
    remove = FALSE,
    extra = "warn",
    fill = "warn") %>% 
  mutate(across(where(is.character), str_trim)) 

Some countries have more than one type of mission, so I want to count each type of mission for each country and create a new variable with the distinct() and pivot_wider() functions

Click here to read more about turning long to wide format data

With the across() function we can replace all numeric variables with NA to zeros

Click here to read more about the across() function

ire_emb %>% 
  group_by(host_country, mission) %>% 
  mutate(number_missions = n())  %>% 
  distinct(host_country, mission, .keep_all = TRUE) %>% 
  ungroup() %>% 
  pivot_wider(!c(host_city, concurrent_accreditation:count_accreditation), 
              names_from = mission, 
              values_from = number_missions) %>% 
  janitor::clean_names() %>% 
  mutate(across(where(is.numeric), ~ replace_na(., 0))) %>% 
  select(!host_country) -> ire_wide

Before we bind the two datasets together, we need to only have one row for each country.

ire_emb %>% 
  distinct(host_country, .keep_all = TRUE) -> ire_dist

And bind them together:

ire_full <- cbind(ire_dist, ire_wide) 
Excited Aubrey Plaza GIF by Film Independent Spirit Awards - Find & Share on GIPHY

We can graph out where the embassies are with the geom_polygon() in ggplot

First we download the map data from dplyr and add correlates of war codes so we can easily join the datasets together with right_join()

First, we add correlates of war codes

Click here to read more about the countrycode package

ire_full %<>%
    mutate(cown = countrycode(host_country, "country.name", "cown")) 
world_map <- map_data("world")

world_map %<>% 
  mutate(cown = countrycode::countrycode(region, "country.name", "cown"))

I reorder the variables with the fct_relevel() function from the forcats package. This is just so they can better match the color palette from wesanderson package. Green means embassy, red for no mission and orange for representative office.

ire_full %>%
  right_join(world_map, by = "cown") %>% 
  filter(region != "Antarctica") %>% 
  mutate(mission = ifelse(is.na(mission), replace_na("No Mission"), mission)) %>% 
  mutate(mission = forcats::fct_relevel(mission,c("No Mission", "Embassy","Representative office"))) %>%
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = mission), color = "white", size = 0.5)  -> ire_map

And we can change how the map looks with the ggthemes package and colors from wesanderson package

  ire_map + ggthemes::theme_map() +
  theme(legend.key.size = unit(3, "cm"),
        text = element_text(size = 30),
        legend.title = element_blank()) + 
  scale_fill_manual(values = wes_palette("Darjeeling1", n = 4))

And we can count how many missions there are in each country

US has the hightest number with 8 offices, followed by UK with 4 and China with 3

ire_full %>%
  right_join(world_map, by = "cown") %>% 
  filter(region != "Antarctica") %>% 
  mutate(sum_missions = rowSums(across(embassy:representative_office))) %>% 
  mutate(sum_missions = replace_na(sum_missions, 0)) %>%  
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = as.factor(sum_missions)), color = "white", size = 0.5)  +
  ggthemes::theme_map() +
  theme(legend.key.size = unit(3, "cm"),
        text = element_text(size = 30),
        legend.title = element_blank()) + 
scale_fill_brewer(palette = "RdBu") + 
  ggtitle("Number of Irish missions in each country",
          subtitle = "Source: Wikipedia")

Last we can count the number of accredited countries that each embassy has. Nigeria has the most, in charge of 10 other countries across northern and central Africa.

ire_full %>% 
  right_join(world_map, by = "cown") %>% 
  filter(region != "Antarctica") %>%
  mutate(count_accreditation = str_count(concurrent_accreditation, pattern = "\n")) %>% 
  mutate(count_accreditation = replace_na(count_accreditation, -1)) %>%  
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = as.factor(count_accreditation)), color = "white", size = 0.5)  +
  ggthemes::theme_fivethirtyeight() +
  theme(legend.key.size = unit(1, "cm"),
        text = element_text(size = 30),
        legend.title = element_blank()) + 
  ggtitle("Number of Irish missions in extra accreditations",
          subtitle = "Source: Wikipedia")
Happy Maya Rudolph GIF - Find & Share on GIPHY

Running tidy t-tests with the infer package in R

Packages we will need:

library(tidyverse)
library(tidyr)
library(infer)
library(bbplot)
library(ggthemes)

For this t-test, we will compare US millenials and non-millenials and their views of the UK’s influence in the world.

The data will come from Chicago Council Survey of American Public Opinion on U.S. Foreign Policy

Click here to download 2017 policy survey data

The survey investigates American public opinion on foreign policy. It focuses on respondents’ opinions of the United States’ leadership role in the world and the challenges the country faces domestically and internationally.

The question on the UK’s influence asks how much influence you think the UK has in the world. Please answer on a 0 to 10 scale; with 0 meaning they are not at all influential and 10 meaning they are extremely influential.

First we select and recreate the variables

fp %>%
  select(
    milennial = XMILLENIALSSAMPLEFLAG,
    uk_influence = Q50_10) %>%
  separate(
    col = milennial,
    into = c("milennial_num", "milennial_char"),
    sep = '[)]',
    remove = TRUE) %>% 
  mutate(
     uk_influence = as.character(uk_influence),
     uk_influence = parse_number(uk_influence)) %>% 
  filter(uk_influence != -1) %>% 
  tidyr::drop_na(milennial_char) -> mil_fp

With the infer package, we can run a t-test:

mil_fp %>% 
  t_test(formula = uk_influence ~ milennial_char,
         alternative = "less")%>% 
  kable(format = "html")
statistic t_df p_value alternative estimate lower_ci upper_ci
-3.048249 1329.469 0.0011736 less -0.3274509 -Inf -0.1506332

There is a statistically significant difference between milennials and non-milennials.

We can graph a box plot.

mil_fp %>% 
  ggplot(mapping = aes(x = milennial_char,
                       y = uk_influence,
                       fill = milennial_char)) +
  geom_jitter(aes(color = milennial_char),
              size = 2, alpha = 0.5, width = 0.3) +
  geom_boxplot(alpha = 0.4) +
  coord_flip() + bbplot::bbc_style() +
  scale_fill_manual(values = my_palette) + 
  scale_color_manual(values = my_palette)

And a quick graph to compare UK with other countries: Germany and South Korea

mil_fp %>% 
  select(milennial_char, uk_influence, sk_influence, ger_influence) %>% 
  pivot_longer(!milennial_char, names_to = "survey_question", values_to = "response")  %>% 
  group_by(survey_question, response) %>% 
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) %>% 
  ungroup() %>% 
  filter(!is.na(response)) %>% 
  mutate(survey_question = case_when(survey_question == "uk_influence" ~ "UK",
survey_question == "ger_influence" ~ "Germany",
survey_question == "sk_influence" ~ "South Korea",
TRUE ~ as.character(survey_question))) %>% 
  ggplot() +
  geom_bar(aes(x = forcats::fct_reorder(survey_question, freq), 
               y = freq, fill = as.factor(response)), 
           color = "#e5e5e5", 
           size = 2, 
           position = "stack",
           stat = "identity") + 
  coord_flip() + 
  scale_fill_brewer(palette = "RdBu") + 
  ggthemes::theme_fivethirtyeight() + 
  ggtitle("View of Influence in the world?") +
  theme(legend.title = element_blank(),
        legend.position = "top",
        legend.key.size = unit(0.78, "cm"),
        text = element_text(size = 25),
        legend.text = element_text(size = 20))

Check model assumptions with easystats package in R

Packages we will need:

install.packages("easystats", repos = "https://easystats.r-universe.dev")
library(easystats)
easystats::install_suggested()

Easystats is a collection of R packages, which aims to provide a unifying and consistent framework to tame, discipline, and harness the scary R statistics and their pesky models, according to their github repo.

Click here to browse the github and here to go to the specific perfomance package CRAN PDF

First run your regression. I will try to explain variance is Civil Society Organization participation (CSOs) with the independent variables in my model with Varieties of Democracy data in 1990.

cso_model <- lm(cso_part ~ education_level + mortality_rate + democracy,data = vdem_90)
Dependent variable:
cso_part
education_level-0.017**
(0.007)
mortality_rate-0.00001
(0.00004)
democracy0.913***
(0.064)
Constant0.288***
(0.054)
Observations134
R20.690
Adjusted R20.682
Residual Std. Error0.154 (df = 130)
F Statistic96.243*** (df = 3; 130)
Note:*p<0.1; **p<0.05; ***p<0.01
Excited Season 2 GIF by The Office - Find & Share on GIPHY

Then we check the assumptions:

performance::check_model(cso_model)

Comparing North and South Korean UN votes at the General Assembly with unvotes package

Packages we will use

Llibrary(unvotes)
library(lubridate)
library(tidyverse)
library(magrittr)
library(bbplot)
library(waffle)
library(stringr)
library(wordcloud)
library(waffle)
library(wesanderson)

Last September 17th 2021 marked the 30th anniversary of the entry of North Korea and South Korea into full membership in the United Nations. Prior to this, they were only afforded observer status.

keia.org

The Two Koreas Mark 30 Years of UN Membership: The Road to Membership

Let’s look at the types of voting that both countries have done in the General Assembly since 1991.

First we can download the different types of UN votes from the unvotes package

un_votes <- unvotes::un_roll_calls

un_votes_issues <- unvotes::un_roll_call_issues

unvotes::un_votes -> country_votes 

Join them all together and filter out any country that does not have the word “Korea” in its name.

un_votes %>% 
  inner_join(un_votes_issues, by = "rcid") %>% 
  inner_join(country_votes, by = "rcid") %>% 
  mutate(year = format(date, format = "%Y")) %>%
  filter(grepl("Korea", country)) -> korea_un

First we can make a wordcloud of all the different votes for which they voted YES. Is there a discernable difference in the types of votes that each country supported?

First, download the stop words that we can remove (such as the, and, if)

data("stop_words") 

Then I will make a North Korean dataframe of all the votes for which this country voted YES. I remove some of the messy formatting with the gsub argument and count the occurence of each word. I get rid of a few of the procedural words that are more related to the technical wording of the resolutions, rather than related to the tpoic of the vote.

nk_yes_votes <- korea_un %>% 
  filter(country == "North Korea") %>% 
  filter(vote == "yes") %>%  
  select(descr, year) %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s")) %>% 
  # group_by(decade) %>% 
  unnest_tokens(word, descr) %>% 
  mutate(word = gsub(" ", "", word)) %>% 
  mutate(word = gsub('_', '', word)) %>% 
  count(word, sort = TRUE) %>% 
  ungroup() %>% 
  anti_join(stop_words)  %>% 
  mutate(word = case_when(grepl("palestin", word) ~ "Palestine", 
                          grepl("nucl", word) ~ "nuclear",
                          TRUE ~ as.character(word)))  %>%
  filter(word != "resolution") %>% 
  filter(word != "assembly") %>% 
  filter(word != "draft") %>% 
  filter(word != "committee") %>% 
  filter(word != "requested") %>% 
  filter(word != "report") %>% 
  filter(word != "practices") %>% 
  filter(word != "affecting") %>% 
  filter(word != "follow") %>% 
  filter(word != "acting") %>% 
  filter(word != "adopted") 

Next, we count the number of each word


nk_yes_votes %<>% 
  count(word) %>% 
  arrange(desc(n))

We want to also remove the numbers

nums <- nk_yes_votes %>% filter(str_detect(word, "^[0-9]")) %>% select(word) %>% unique()

And remove the stop words

nk_yes_votes %<>%
  anti_join(nums, by = "word")

Choose some nice colours

my_colors <- c("#0450b4", "#046dc8", "#1184a7","#15a2a2", "#6fb1a0", 
               "#b4418e", "#d94a8c", "#ea515f", "#fe7434", "#fea802")

And lastly, plot the wordcloud with the top 50 words

wordcloud(nk_yes_votes$word, 
   nk_yes_votes$n, 
   random.order = FALSE, 
   max.words = 50, 
   colors = my_colors)

If we repeat the above code with South Korea:

There doesn’t seem to be a huge difference. But this is not a very scientfic approach; I just like the look of them!

Next we will compare the two countries how many votes they voted yes, no or abstained from…

korea_un %>% 
  group_by(country, vote) %>% 
  count() %>% 
  mutate(count_ten = n /25) %>% 
  ungroup() %>% 
  ggplot(aes(fill = vote, values = count_ten)) +
  geom_waffle(color = "white",
              size = 2.5,
              n_rows = 10,
              flip = TRUE) +
  facet_wrap(~country) + bbplot::bbc_style() +
  scale_fill_manual(values = wesanderson::wes_palette("Darjeeling1"))

Next we can look more in detail at the votes that they countries abstained from voting in.

We can use the tidytext function that reorders the geom_bar in each country. You can read the blog of Julie Silge to learn more about the functions, it is a bit tricky but it fixes the problem of randomly ordered bars across facets.

https://juliasilge.com/blog/reorder-within/

korea_un %>%
  filter(vote == "abstain") %>% 
  mutate(issue = case_when(issue == "Nuclear weapons and nuclear material" ~ "Nukes",
issue == "Arms control and disarmament" ~ "Arms",
issue == "Palestinian conflict" ~ "Palestine",
TRUE ~ as.character(issue))) %>% 
  select(country, issue, year) %>% 
  group_by(issue, country) %>% 
  count() %>% 
  ungroup() %>% 
  group_by(country) %>% 
  mutate(country = as.factor(country),
         issue = reorder_within(issue, n, country)) %>%
  ggplot(aes(x = reorder(issue, n), y = n)) + 
  geom_bar(stat = "identity", width = 0.7, aes(fill = country)) + 
  labs(title = "Abstaining UN General Assembly Votes by issues",
       subtitle = ("Since 1950s"),
       caption = "         Source: unvotes ") +
  xlab("") + 
  ylab("") +
  facet_wrap(~country, scales = "free_y") +
  scale_x_reordered() +
  coord_flip() + 
  expand_limits(y = 65) + 
  ggthemes::theme_pander() + 
  scale_fill_manual(values = sample(my_colors)) + 
 theme(plot.background = element_rect(color = "#f5f9fc"),
        panel.grid = element_line(colour = "#f5f9fc"),
        # axis.title.x = element_blank(),
        # axis.text.x = element_blank(),
        axis.text.y = element_text(color = "#000500", size = 16),
       legend.position = "none",
        # axis.title.y = element_blank(),
        axis.ticks.x = element_blank(),
        text = element_text(family = "Gadugi"),
        plot.title = element_text(size = 28, color = "#000500"),
        plot.subtitle = element_text(size = 20, color = "#484e4c"),
        plot.caption = element_text(size = 20, color = "#484e4c"))

South Korea was far more likely to abstain from votes that North Korea on all issues

Next we can simply plot out the Human Rights votes that each country voted to support. Even though South Korea has far higher human rights scores, North Korea votes in support of more votes on this topic.

korea_un %>% 
  filter(year < 2019) %>% 
  filter(issue == "Human rights") %>% 
  filter(vote == "yes") %>% 
  group_by(country, year) %>% 
  count() %>% 
  ggplot(aes(x = year, y = n, group = country, color = country)) + 
  geom_line(size = 2) + 
  geom_point(aes(color = country), fill = "white", shape = 21, size = 3, stroke = 2.5) +
  scale_x_discrete(breaks = round(seq(min(korea_un$year), max(korea_un$year), by = 10),1)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 22)) + 
  bbplot::bbc_style() + facet_wrap(~country) + 
  theme(legend.position = "none") + 
  scale_color_manual(values = sample(my_colors)) + 
  labs(title = "Human Rights UN General Assembly Yes Votes ",
       subtitle = ("Since 1990s"),
       caption = "         Source: unvotes ")

All together:

Download and graph UN votes data with the unvotes package in R

Packages we will need:

library(unvotes)
library(lubridate)
library(tidyverse)
library(magrittr)
library(bbplot)
library(waffle)

How to download UN votes to R.

This package was created by David Robinson. Click here to read the CRAN PDF.

Pop Tv Congrats GIF by Schitt's Creek - Find & Share on GIPHY

We will download both the votes roll calls and the issues. Then we can use the inner_join() variable to add them together by the ID.

un_votes <- unvotes::un_roll_calls

un_votes_issues <- unvotes::un_roll_call_issues

un_votes %<>% 
  inner_join(un_votes_issues, by = "rcid")

We can create a year variable with the format() function and extract the year with “%Y”

un_votes %<>% 
  mutate(year = format(date, format = "%Y")) 

And graph out the count of each type of UN vote issue

un_votes %>% 
  group_by(year) %>% 
  count(issue) %>% 
  ggplot(aes(x = year, y = n, group = issue, color = issue)) + 
  geom_line(size = 2) + 
  geom_point(aes(color = issue), fill = "white", 
             shape = 21, size = 2, stroke = 1) +
  scale_x_discrete(breaks = round(seq(min(un_votes$year), max(un_votes$year), by = 10),1)) +
  bbplot::bbc_style() + facet_wrap(~issue)

Next we can look at which decade had the most votes across the issues with the waffle package

Click here to read more about the waffle package

un_votes %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s")) %>% 
  
  group_by(decade) %>% 
  count(issue) %>% 
  
  ggplot(aes(fill = issue, values = n)) +
  geom_waffle(color = "white",
              size = 0.3,
              n_rows = 10, 
              flip = TRUE) +
  facet_wrap(~decade, nrow = 1, strip.position = "bottom") + 
  bbplot::bbc_style()  +
  scale_x_discrete(breaks = round(seq(0, 1, by = 0.2),3)) 

The 1980s were a prolific time for the UNGA with voting, with arms control being the largest share of votes. And it has stablised in the decades since.

Well Done Applause GIF by CBC - Find & Share on GIPHY

Next we can look at votes in total

un_votes %>% 
  mutate(issue = case_when(issue == "Nuclear weapons and nuclear material" ~ "Nukes",
                           issue == "Arms control and disarmament" ~ "Arms",
                           issue == "Palestinian conflict" ~ "Palestine",
                           TRUE ~ as.character(issue))) %>% 
  count(issue) %>%  
  ggplot(aes(x = reorder(issue, n), y = n, fill = as.factor(issue))) + 
  geom_bar(stat = "identity") + 
  coord_polar("x", start = 0, direction = -1)  + 
  ggthemes::theme_pander()  +
  bbplot::bbc_style() + 
    theme(axis.text = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank(),
          axis.ticks = element_blank(),
          text = element_text(size = 25),
          panel.grid = element_blank()) + 
    ggtitle(label = "UN Votes by issue ", 
            subtitle = "Source: unvotes package")

Grouping, counting words and making wordclouds

library(tidytext)
library(wordcloud)
library(knitr)
library(kableExtra)

How to make wordclouds in R!

First, download stop words (such as and, the, of) to filter out of the dataset

data("stop_words")

Then we will will unnest tokens and count the occurences of each word in each decade.

tokens <- democracy_aid %>%
  select(description, year) %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s")) %>% 
  group_by(decade) %>% 
  unnest_tokens(word, activity_description) %>% 
  count(word, sort = TRUE) %>% 
  ungroup() %>% 
  anti_join(stop_words) 

nums <- tokens %>% filter(str_detect(word, "^[0-9]")) %>% select(word) %>% unique()

tokens %<>%
  anti_join(nums, by = "word") 

And with the kable() function we can make a HTML table that I copy and paste to this blog. Below I rewrite the HTML to change the headings

tokens %>% 
    group_by(decade) %>% 
    top_n(n = 10,
          wt = n)  %>%
    arrange(decade, desc(n)) %>%
    arrange(desc(n)) %>%
    knitr::kable("html")
decade word n
2010s rights 4541
2010s local 3981
2010s youth 3778
2010s promote 3679
2010s democratic 3618
2010s public 3444
2010s national 3060
2010s political 3020
2010s human 3009
2010s organization 2711
2000s rights 2548
2000s human 1745
2000s local 1544
2000s conduct 1381
2000s political 1257
2000s training 1217
2000s promote 1142
2000s public 1121
2000s democratic 1071
2000s national 988

Create a vector of colors:

my_colors <- c("#0450b4", "#046dc8", "#1184a7","#15a2a2", "#6fb1a0", 
               "#b4418e", "#d94a8c", "#ea515f", "#fe7434", "#fea802")
Always Sunny Reaction GIF - Find & Share on GIPHY
tokens %<>% 
  mutate(word = ifelse(grepl("democr", word), "democracy", 
                ifelse(grepl("politi", word), "politics", 
                ifelse(grepl("institut", word), "institution", 
                ifelse(grepl("govern", word), "government", 
                ifelse(grepl("organiz", word), "organization", 
                ifelse(grepl("elect", word), "election", word))))))) 

wordcloud(tokens$word, tokens$n, random.order = FALSE, max.words = 50, colors = my_colors)
2010s Decade Word Count
2010s rights 4541
2010s local 3981
2010s youth 3778
2010s promote 3679
2010s democratic 3618
2010s public 3444
2010s national 3060
2010s political 3020
2010s human 3009
2010s organization 2711
2000s Decade Word Count
2000s rights 2548
2000s human 1745
2000s local 1544
2000s conduct 1381
2000s political 1257
2000s training 1217
2000s promote 1142
2000s public 1121
2000s democratic 1071
2000s national 988

And if we compare civic versus politically-oriented aid, we can see that more money goes towards projects that have political or electoral aims rather than civic or civil society education goals

tokens %>% 
  group_by(year) %>% 
  top_n(n = 20,
        wt = n) %>% 
  mutate(word = case_when(word == "party" ~ "political",
                          word == "parties" ~ "political",
                          word == "election" ~ "political",
                          word == "electoral" ~ "political",
                          word == "civil" ~ "civic", 
                          word == "civic" ~ "civic",
                          word == "social" ~ "civic",
                          word == "education" ~ "civic",
                          word == "society" ~ "civic", 
                          TRUE ~ as.character(word))) %>% 
  filter(word == "political" | word == "civic") %>% 
  ggplot(aes(x = year, y = n, group = word)) + 
  geom_line(aes(color = word ), size = 2.5,alpha = 0.6)  +
  geom_point(aes(color = word ), fill = "white", 
             shape = 21, size = 3, stroke = 2) +
  bbplot::bbc_style() + 
  scale_x_discrete(limits = c(2001:2019)) +
  theme(axis.text.x= element_text(size = 15,
                                  angle = 45)) +
  scale_color_discrete(name = "Aid type", labels = c("Civic grants", "Political grants"))

Wrangling and graphing UN Secretaries-General data with R

Packages we will need:

library(tidyverse)
library(janitor)
library(rvest)
library(countrycode)
library(magrittr)
library(lubridate)
library(ggflags)
library(scales)

According to Urquhart (1995) in his article, “Selecting the World’s CEO”,

From the outset, the U.N. secretary
general has been an important part of the
institution, not only as its chief executive,
but as both symbol and guardian of the
original vision of the organization.
There, however, specific agreement has
ended. The United Nations, like any
important organization, needs strong and
independent leadership, but it is an inter-
governmental organization, and govern
ments have no intention of giving up
control of it. While the secretary-general
can be extraordinarily useful in times of
crisis, the office inevitably embodies
something more than international coop
eration, sometimes even an unwelcome
hint of supranationalism. Thus, the atti-
tude of governments toward the United
Nations’ chief and only elected official is
and has been necessarily ambivalent.

(Urquhart, 1995: 21)

So who are these World CEOs? We’ll examine more in this dataset.

First, we will scrape the data from the Wikipedia

sg_html <- read_html("https://en.wikipedia.org/wiki/Secretary-General_of_the_United_Nations")
sg_tables <- sg_html %>% html_table(header = TRUE, fill = TRUE)
sg <- sg_tables[[2]]

The table we scrape is a bit of a hot mess in this state …. but we can fix it

Donald Glover Pizza GIF - Find & Share on GIPHY

We can first use the clean_names() function from the janitor package

A quick way to clean up the table and keep only the rows with the names of the Secretaries-General is to use the distinct() function. Last we filter out the rows and select out the columns we don’t want.

sg %>% 
  clean_names() %>% 
  distinct(no, .keep_all = TRUE) %>% 
  filter(no != "–") %>% 
  select(!c(portrait, ref))-> sg_clean

Already we can see a much cleaner table. However, the next problem is that the names and their years of birth / death are in one cell.

Also the dates in office are combined together.

So we can use the separate() function from tidyr to make new variables for each piece of information.

First we will separate the name of the Secretary-General from their date of birth and death.

We supply the two new variable names to the into = argument.

We then use the regex code pattern [()] to indicate where we want to separate the character string into two separate columns. In this instance the regex pattern is for what is after the round brackets (

I want to remove the original cluttered varaible so remove = TRUE

sg_clean %<>% 
  separate(
    col = secretary_general_born_died,
    into = c("sec_gen", "born_died"),
    sep = '[()]',
    remove = TRUE) 

We can repeat this step to create a separate born and died variable. This time the separator symbol is a hyphen And so we do not need regex pattern; we can just indicate a hyphen.

sg_clean %<>% 
  separate(
    col = born_died,
    into = c("born", "died"),
    sep = '–',
    remove = TRUE)  

And I want to ignore the “present” variable, so I extract the numbers with the parse_number() function, converting things from characters to numbers

sg_clean %<>% 
  mutate(born = parse_number(born))

Last, we repeat with the dates in office. This is also easily seperated by indicating the hyphen.

sg_clean %<>% 
  separate(
    col = dates_in_office,
    into = c("start_office", "end_office"),
    sep = '–',
    remove = TRUE)  

We convert the word “present” to the actual present date

sg_clean %<>% 
  mutate(end_office = ifelse(end_office == "present", "5 May 2022", end_office))

We use the lubridate dmy() function to convert the character strings to date class variables.

sg_clean %<>% 
  mutate(start_office = dmy(start_office),
         end_office = dmy(end_office))

We can calculate the length of time that each Secretary-General was in office with the difftime() function.

sg_clean %<>% 
  mutate(duration_days = difftime(end_office, start_office, units = "days"),
         duration_years = round(duration_days / 365, 2),
         duration_years = as.integer(duration_years))

Next we can compare the different durations and see which Secretary-General was longest or shortest in office.

sg_clean %>% 
  mutate(duration_days = difftime(end_office, start_office)) %>%  
  mutate(iso2 = tolower(countrycode::countrycode(country_of_origin, "country.name", "iso2c"))) %>% 
  ggplot(aes(x = forcats::fct_reorder(sec_gen, duration_days), y = duration_days)) + 
  geom_bar(aes(fill = un_regional_group), stat = "identity", width = 0.7) + 
  coord_flip() + bbplot::bbc_style() + 
  ggflags::geom_flag(aes(x = sec_gen, y = -100, country = iso2), size = 12) +
  scale_fill_manual(values = le_palette) +
  labs(title = "Longest serving UN Secretaries General",
       subtitle = ("Source: Wikipedia")) + 
  xlab("") + ylab("") 

We can make a quick pie-chart to compare regions. We can see that Secretaries-General from the West have had the most time in office

sg_text <- sg_count %>% 
  arrange(desc(un_regional_group)) %>%
  mutate(prop = sum_days / sum(sg_count$sum_days) *100) %>%
  mutate(ypos = cumsum(prop)- 0.5*prop )

sg_text %>% 
  count(un_regional_group)

sg_text %>%
  mutate(region = case_when(un_regional_group == "Western European & others" ~ "Europe",
         un_regional_group == "Latin American& Caribbean" ~ "Latin America",
         un_regional_group == "Asia & Pacific" ~ "Asia", 
         TRUE ~ as.character(un_regional_group))) %>% 
  ggplot(aes(x = "", y = prop, fill = region)) +
  geom_bar(stat = "identity", width = 1) +
  geom_text(aes(y = ypos + 1, label = round(prop, 0)), color = "white", size = 15) +
  coord_polar("y", start = 0) +
  theme_void() +
  ggtitle("Length of Secretaries General in office across regions") + 
  scale_fill_manual(values = le_palette) + 
  theme(legend.title = element_blank(),
        legend.text = element_text(size = 20), 
        plot.title = element_text(size = 30))

We can create a Gantt-like chart to track the timeline for the different men (all men!)

Click here to read more about timelines in R

sg_clean %>% 
  mutate(region = case_when(un_regional_group == "Western European & others" ~ "Europe",un_regional_group == "Latin American& Caribbean" ~ "Latin America",un_regional_group == "Asia & Pacific" ~ "Asia", TRUE ~ as.character(un_regional_group))) %>%
  ggplot(aes(x = as.Date(start_office), 
             y = no, 
             color = region)) +
  geom_segment(aes(xend = as.Date(end_office), 
                   yend = no, alpha = 0.9,
                   color = region), size = 9)  +
  geom_text(aes(label = sec_gen), 
            color = "black", 
            alpha = 0.7,
            size = 8, show.legend = FALSE) +
  bbplot::bbc_style() +
  scale_color_manual(values = le_palette) + 
  scale_x_date(breaks = scales::breaks_pretty(15))
Confused Donald Glover GIF - Find & Share on GIPHY

References

Urquhart, B. (1995). Selecting the world’s CEO: Remembering the Secretaries-General. Foreign Affairs, 21-26.

Donald Glover Community GIF - Find & Share on GIPHY