Graphing female politicians in Irish parliament with R PART 2: Trends and Maps

Packages we will need

library(tidyverse)
library(magrittr)
library(waffle)
library(geojsonio)
library(sf)

In PART 1, we looked at the gender package to help count the number of women in the 33rd Irish Parliament.

I repeated that for every session since 1921. The first and second Dail are special in Ireland as they are technically pre-partition.

Cleaned up the data aaaand now we have a full dataset with constituencies data.

If anyone wants a copy of the dataset, I can upload it here for those who are curious ~

So first… a simple pie chart!

First we calculate proportion of seats held by women

dail %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s")) %>%
  group_by(decade) %>% 
  ungroup() %>% 
  group_by(decade, gender) %>% 
  count() %>% 
  group_by(decade) %>% 
  mutate(proportion = n / sum(n)) -> dail_pie

# A tibble: 22 × 4
# Groups:   decade [11]
   decade gender     n proportion
   <chr>  <chr>  <int>      <dbl>
 1 1920s  female    20     0.0261
 2 1920s  male     747     0.974 
 3 1930s  female    10     0.0172
 4 1930s  male     572     0.983 
 5 1940s  female    12     0.0284
 6 1940s  male     411     0.972 
 7 1950s  female    16     0.0363
 8 1950s  male     425     0.964 
 9 1960s  female    11     0.0255
10 1960s  male     421     0.975 
# 12 more rows

We will be looking at how proportions changed over the decades.

When using facet_wrap() with coord_polar(), it’s a pain in the arse.

This is because coord_polar() does not automatically allow each facet to have a different scale. Instead, coord_polar() treats all facets as having the same axis limits.

This will mess everything up.

If we don’t change the coord_polar(), we will just distort pie charts when the facet groups have different total values. There will be weird gaps and make some phantom pacman non-charts.

function() TRUE is an anonymous function that always returns TRUE.

my_coord_polar$is_free <- function() TRUE forces coord_polar() to allow different scales for each facet.

In our case, we call my_coord_polar$is_free, which means that whenever ggplot2 checks whether the coordinate system allows free scales across facets, it will now always return TRUE!!!

Overriding is_free() to always return TRUE signals to ggplot2 that coord_polar() means that our pie charts NOOWW will respect the "free" scaling specified in facet_wrap(scales = "free").

my_coord_polar <- coord_polar(theta = "y")
my_coord_polar$is_free <- function() TRUE

If you want to look more at this, check out this blog:

How to graph proportions with the waffle and treemapify packages in R

And we can go and create the ggplot:

dail_pie %>%
  ggplot(aes(x = "", 
         y = proportion, 
         fill = as.factor(gender))) +

  geom_bar(stat="identity", width = 1) +
  
  geom_text(
    data = . %>% filter(gender == "female"), aes(label = scales::percent(proportion, 
    accuracy = 0.1)), 
    color = "white",
    size = 8) +
  
  my_coord_polar +

  facet_wrap(~decade, scales = "free") + 
  scale_fill_manual(values =c("#bc4749", "#003049")) +
  # my_style() +
  theme(axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid = element_blank(), 
        panel.background = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank())

And with Canva, I add the arrows and titles~

Sorry I couldn’t figure it out in R. I just hate all the times I need to re-run graphics to move a text or number by a nano-centimeter. Websites like Canva are just far better for my sanity and short attention span.

Next, we can make a facetted waffle plot!

dail %>% 
  group_by(decade) %>% 
  ungroup() %>% 
  group_by(decade, gender) %>% 
  count() %>% 
  ggplot(aes(fill = as.factor(gender), values = n)) +
  waffle::geom_waffle(color = "white", 
                      size = 0.5, 
                      n_rows = 10, 
                      flip = TRUE) +
  facet_wrap(~decade, nrow = 1, strip.position = "bottom") +
# my_style  +
  scale_fill_manual(values =c("#003049", "#bc4749")) +
  theme(axis.text.x.bottom = element_blank(),
        text = element_text(size = 40))

And mea culpa, I finished the annotation and titles are with Canva.

Once again, life is too short to be messing with annotation in ggplot.

Next, we can make a simple trend line of the top Irish parties and see how they have fared with women TDs.

Let’s get a dataframe with average number of TDs elected to each party over the decades

dail %>% 
  filter(constituency != "National University") %>% 
  filter(party %in% c("Fianna Fáil", "Fine Gael", "Labour", "Sinn Féin")) %>% 
  group_by(party, decade) %>% 
  summarise(avg_female = mean(gender == "female")) -> dail_avg

# A tibble: 39 × 3
# Groups:   party [4]
   party       decade avg_female
   <chr>       <chr>       <dbl>
 1 Fianna Fáil 1920s     0.0198 
 2 Fianna Fáil 1930s     0.00685
 3 Fianna Fáil 1940s     0.0284 
 4 Fianna Fáil 1950s     0.0425 
 5 Fianna Fáil 1960s     0.0230 
 6 Fianna Fáil 1970s     0.0327 
 7 Fianna Fáil 1980s     0.0510 
 8 Fianna Fáil 1990s     0.0897 
 9 Fianna Fáil 2000s     0.0943 
10 Fianna Fáil 2010s     0.0938 
# 29 more rows

We create a new mini data.frame of four values so that we can have the geom_text() only at the end of the year (so similar to the final position of the graph).

final_positions <- dail_avg %>%
  group_by(party) %>%
  filter(decade == "2020s")  %>% 
  mutate(color = ifelse(party == "Sinn Féin", "#2fb66a",
         ifelse(party == "Fine Gael","#6699ff",
         ifelse(party == "Fianna Fáil","#ee9f27", 
         ifelse(party == "Labour", "#780000", "#495051")))))

# A tibble: 4 × 4
# Groups:   party [4]
  party       decade avg_female color  
  <chr>       <chr>       <dbl> <chr>  
1 Fianna Fáil 2020s       0.140 #ee9f27
2 Fine Gael   2020s       0.219 #6699ff
3 Labour      2020s       0.118 #780000
4 Sinn Féin   2020s       0.368 #2fb66a

A hex colour for each major party

party_pal <- c("Sinn Féin" = "#2fb66a",
                "Fine Gael" = "#6699ff",
                "Fianna Fáil" = "#ee9f27", 
                "Labour" = "#780000")

And a geom_bump() layer in the plot using the ggbump() package for more wavy lines.

dail_avg %>% 
  ggplot(aes(x = decade,
             y = avg_female, 
             group = party, 
             color = party)) + 

  ggbump::geom_bump(aes(color = party),
            smooth = 5,
            alpha = 0.5,
            size = 4)  +

  geom_point(color = "white", 
             size = 7, 
             stroke = 4) + 

  geom_point(size = 6) +

  ggrepel::geom_text_repel(data = final_positions,
            aes(color = party,
                y = avg_female,
                x = decade,
            label = party),
            family = "Arial Rounded MT Bold",
            vjust = -2,
            hjust = -1,
            size = 15) +
# my_style() 
  scale_color_manual(values = party_pal) +
  scale_y_continuous(labels = scales::label_percent()) +
  scale_x_discrete(expand = expansion(add = c(0.2, 2))) +
  theme(legend.position = "none")

This graph looks at major Irish political parties from the 1920s to the 2020s.

For most of Irish history, female representation remained under 10%.

The Labour Party surged ahead like crazy in the 1990s; it got over 30% female TDs!

Now in the 2020s, Sinn Féin has the largest proportion of female TDs and goes way above and beyond the other major parties.

Now, onto constituency maps.

We can go to the Irish government’s website with heaps of data! Yay free data.

This page brings us to the election constituencies GeoJSON map data.

For more information about making GeoJSON and SF maps click here to read about how to create maps in R ~

How to download and graph interactive country maps in R

So we read in the data and convert to SF dataframe.

constituency_map <- geojson_read(file.choose(), what = "sp")

constituency_sf <- st_as_sf(constituency_map)

This constituency_sf has 64 variables but most of them are meta-data info like the dates that each variable was updated. The vaaast majority, we don’t need so we can just pull out the consituency var for our use:

constituency_sf %>% 
  select(constituency = ENG_NAME_VALUE, 
         geometry) -> mini_constituency_sf

Simple feature collection with 1072 features and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 417437.9 ymin: 516356.4 xmax: 734489.6 ymax: 966899.7
Projected CRS: IRENET95 / Irish Transverse Mercator
First 10 features:
          constituency                       geometry
1  Cork South-West (3) POLYGON ((501759.8 527442.6...
2            Kerry (5) POLYGON ((451686.2 558529.2...
3            Kerry (5) POLYGON ((426695 561869.8, ...
4            Kerry (5) POLYGON ((451103.9 555882.8...
5            Kerry (5) POLYGON ((434925.3 572926.2...
6          Donegal (5) POLYGON ((564480.8 917991.7...
7          Donegal (5) POLYGON ((571201.9 892870.7...
8          Donegal (5) POLYGON ((615249.9 944590.2...
9          Donegal (5) POLYGON ((563593.8 897601, ...
10         Donegal (5) POLYGON ((647306 966899.4, ...

As we see, the number of seats in each constituency is in brackets behind the name of the county. So we can separate them and create a seat variable:

  mini_constituency_sf %<>% 
   separate(constituency, 
            into = c("constituency", "seats"), 
            sep = " \\(", fill = "right") %>%
   mutate(seats = as.numeric(gsub("\\)", "", seats)))

One problem I realised along the way when I was trying to merge the constituency map with the TD politicians data is that one data.frame uses a hyphen and one uses a dash in the constituency variable.

So we can make a quick function to replace en dash (–) with hyphen (-).

 replace_dash <- function(x) {
   if (is.character(x)) {
     gsub("–", "-", x)  
   } else {x}
}

 mini_constituency_sf %<>%
   mutate(across(where(is.character), replace_dash))

And now we can merge!

 dail %<>%
   right_join(mini_constituency_sf, by = "constituency")

Now a quick map ~

 dail %<>%
  mutate(n = ifelse(is.na(percentage_women), 0, percentage_women)) %>%
   ggplot(aes(geometry = geometry)) +
   geom_sf(aes(fill = percentage_women),
           color = "black") +  s
   labs(title = "Map of Irish Constituencies") +
   # my_style() +
   scale_fill_viridis_c(option = "plasma")  +

    scale_fill_gradient2(low = "#57cc99",
                         mid = "#38a3a5",
                         high = "#22577a") +

   theme(axis.text = element_blank(),
     axis.text.x.bottom = element_blank(),
     legend.key.width = unit(1.5, "cm"), 
     legend.key.height = unit(0.4, "cm"), 
     legend.position = "bottom")

We can see that some constituencies have 3 seats, some 5~

So we cannot directly compare who has more female TDs.

A way to deal with this is scaling the data.

In PART 3, we will look at scaling data and analysing trends across the years!

Yay!

How to web scrape and graph 2024 Irish election data with R

Packages we will use:

library(tidyverse)
library(rvest)
library(janitor)
library(magrittr)
library(ggparliament)
library(ggbump)
library(bbplot)

I am an Irish person living abroad. I did NOT follow the elections last year. So, as penance (as I just mentioned, I am Irish and therefore full of phantom Catholic guilt for neglecting political news back home), we will be graphing some of the election data and familiarise ourselves with the new contours of Irish politics in this blog.

Click here to visit the wikipedia page we will be scraping with the rvest package.

Click here to read more about the rvest package for webscraping.

Scrape and graph election polling data from Wikipedia

The data we want is in the 11th table on the page:

The columns that we will want are the Party and the Elected 2024 columns.

So using the read_html() function, we can feed in the URL, save all the tables with html_table() and then only keep the eleventh table with `[[`(11)

read_html("https://en.wikipedia.org/wiki/2024_Irish_general_election") %>% 
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(11) -> dail_2024

It’s a bit of a hot mess at this stage.

Right now, all the variable names are empty.

We can use the row_to_names() function from the janitor package. This moves a row up to became the variable names. Also we can use clean_names() (also a janitor package staple) to make every variable lowercase snake_case with underscores.

dail_2024 %<>% 
  row_to_names(row_number = 2) %>% 
  clean_names() %>%

As you can see in the table above, the PBP cell is very crowded. This is due to the fact that many similar left-wing parties formed a loose coaltion when campaigning.

Because they are all in one cell, every number was shoved together without spaces. So instead of each party in the loose grouping, it was all added together. It makes the table wholly incorrect; the PBP coalition did not win trillions of votes.

Things like this highlights the importance of always checking the raw data after web scraping.

So I just brute recode the value according to what is actually on the Wiki page.

dail_2024 %<>% 
  mutate(elected2024 = if_else(party_2 == "PBP–Solidarity[c]•People Before Profit•Solidarity", "3", elected2024))

Next we need to remove the annoying [footnotes in square brackets] on the page with some regex nonsense.

dail_2024 %<>% 
  mutate(across(everything(), ~ str_replace(., "\\[.*$", "")))

And finally, we just need to select, rename and change the seat numbers from character to numeric

dail_2024 %<>%  
  select(party = party_2, seats = elected2024)  %>% 
  mutate(seats= parse_number(seats))

Next, we just need to graph it out with the geom_parliament_seats() layer of the ggplot graph with ggparliament package.

Click here to read more about the ggparliament package:

How to create semi-circle parliament graphs with the ggparliament package in R

First, we generate the circle coordinates

dail_2024_coord <- parliament_data(election_data = dail_2024,
                   type = "semicircle", 
                   parl_rows = 6,  
                   party_seats = dail_2024$seats_2024)

x: the horizontal position of a point in the semi-circle graph.

y: the vertical position of a point in the semi-circle graph.

row: The row or layer of the semi-circle in which the point (seat) is positioned. Rows are arranged from the base (row 1) to the top of the semi-circle.

theta: The angle (in radians) used to calculate the position of each seat in the semi-circle. It determines the angular placement of each point, starting at 0 radians (rightmost point of the semi-circle) and increasing counterclockwise to π\piπ radians (leftmost point of the semi-circle).

We want to have the biggest parties first and the smallest parties at the right of the graph

dail_elected %<>% 
  mutate(party = fct_reorder(party, table(party)[party], .desc = TRUE))

and we can add some hex colors that represent the parties’ representative colours.

dail_elected_coord %<>% 
  mutate(party_colour = case_when(party == "Fianna Fáil" ~ "#66bb66",
                       party == "Fine Gael" ~ "#6699ff",
                       party == "Green" ~ "#2fb66a",
                       party == "Labour" ~ "#e71c38",
                       party == "Sinn Féin" ~ "#326760",
                       party == "PBP–Solidarity" ~ "#e91d50",
                       party ==  "Social Democrats" ~ "#742a8b",
                       party == "Independent Ireland" ~ "#ee9f27",
                       party == "Aontú" ~ "#4f4e31",
                       party == "100% Redress" ~ "#8e2420"))

And we graph out the ggplot with the simple bbc_style() from the bbplot package

dail_elected_coord %>% 
  ggplot(aes(x = x, y = y,
             colour = party)) +
  geom_parliament_seats(size = 13) +
  bbplot::bbc_style()  +
  ggtitle("34th Irish Parliament") +
  theme(text = element_text(size = 50),
        legend.title = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank())  +
  scale_colour_manual(values = dail_elected_coord$party_colour,
                      limits = dail_elected_coord$party)

34th Irish Parliamentary Election Download

HONESTY TIME… I will admit, I replaced the title as well as the annotated text and arrows with Canva dot comm

Hell is … trying to incrementally make annotations to go to place we want via code. Why would I torment myself when drag-and-drop options are available for free.

Next, let’s compare this year with previous years

I was also hoping to try replicate this blog post about bump plots with highlighted labels from the r-graph-gallery website.

We can use this kind of graph to highlight a particular trend.

For example, the rise of Sinn Fein as a heavy-hitter in Irish politics.

We will need to go to many of the Wikipedia pages on the elections and scrape seat data for the top parties for each year.

Annoyingly, across the different election pages, the format is different so we have to just go by trial-and-error to find the right table for each election year and to find out what the table labels are for each given year.

Since going to many different pages ends up with repeating lots of code snippets, we can write a process_election_data() function to try cut down on replication.

process_election_data <- function(url, table_index, header_row, party_col, seats_col, top_parties, extra_mutate = NULL) {
  read_html(url) %>%
    html_table(header = TRUE, fill = TRUE) %>%
    `[[`(table_index) %>%
    row_to_names(row_number = header_row) %>%
    clean_names() %>%
    mutate(across(everything(), ~ str_replace(., "\\[.*$", ""))) %>%
    select(party = !!sym(party_col), seats = !!sym(seats_col)) %>%
    mutate(seats = parse_number(seats)) %>%
    filter(party %in% top_parties)
}

In this function, mutate(across(everything(), ~ str_replace(., "\\[.*$", ""))) removes all those annoying footnotes in square brackets from the Wiki table with regex code.

Annoyingly, the table for the 2024 election is labelled differently to the table with the 2016 results on le Wikipedia. So when we are scraping from each webpage, we will need to pop in a sliiiightly different string.

We can use the sym() and the !! to accomodate that.

When we type on !! (which the coder folks call bang-bang), this unquotes the string we feed in. We don’t want the function to treat our string as a string.

After this !! step, we can now add them as variables within the select() function.

We will only look at the biggest parties that have been on the scene since 1980s

top_parties <- c("Fianna Fáil", "Fine Gael", "Sinn Féin", "Labour Party", "Green Party")

Now, we feed in the unique features that are unique for scraping each web page:

dail_2024 <- process_election_data(
  url = "https://en.wikipedia.org/wiki/2024_Irish_general_election
  table_index = 11,
  header_row = 2,
  party_col = "party_2",
  seats_col = "elected2024",
  top_parties = top_parties)

dail_2020 <- process_election_data(
  url = "https://en.wikipedia.org/wiki/2020_Irish_general_election",
  table_index = 10,
  header_row = 2,
  party_col = "party_2",
  seats_col = "elected2020",
  top_parties = top_parties)

dail_2016 <- process_election_data(
  url = "https://en.wikipedia.org/wiki/2016_Irish_general_election",
  table_index = 10,
  header_row = 3,
  party_col = "party_2",
  seats_col = "elected2016_90",
  top_parties = top_parties)

dail_2011 <- process_election_data(
  url = "https://en.wikipedia.org/wiki/2011_Irish_general_election",
  table_index = 14,
  header_row = 2,
  party_col = "party_2",
  seats_col = "t_ds",
  top_parties = top_parties)

dail_2007 <- process_election_data(
  url = "https://en.wikipedia.org/wiki/2007_Irish_general_election",
  table_index = 8,
  header_row = 2,
  party_col = "party_2",
  seats_col = "seats",
  top_parties = top_parties)

dail_2002 <- process_election_data(
  url = "https://en.wikipedia.org/wiki/2002_Irish_general_election",
  table_index = 8,
  header_row = 2,
  party_col = "party_2",
  seats_col = "seats",
  top_parties = top_parties)

dail_1997 <- process_election_data(
  url = "https://en.wikipedia.org/wiki/1997_Irish_general_election",
  table_index = 9,
  header_row = 2,
  party_col = "party_2",
  seats_col = "seats",
  top_parties = top_parties)

dail_1992 <- process_election_data(
  url = "https://en.wikipedia.org/wiki/1992_Irish_general_election",
  table_index = 6,
  header_row = 2,
  party_col = "party_2",
  seats_col = "seats",
  top_parties = top_parties)

dail_1989 <- process_election_data(
  url = "https://en.wikipedia.org/wiki/1989_Irish_general_election",
  table_index = 5,
  header_row = 2,
  party_col = "party_2",
  seats_col = "seats",
  top_parties = top_parties)

dail_1987 <- process_election_data(
  url = "https://en.wikipedia.org/wiki/1987_Irish_general_election",
  table_index = 5,
  header_row = 2,
  party_col = "party_2",
  seats_col = "seats",
  top_parties = top_parties)

dail_1982_11 <- process_election_data(
  url = "https://en.wikipedia.org/wiki/November_1982_Irish_general_election",
  table_index = 5,
  header_row = 2,
  party_col = "party_2",
  seats_col = "seats",
  top_parties = top_parties)

dail_1982_2 <- process_election_data(
  url = "https://en.wikipedia.org/wiki/February_1982_Irish_general_election",
  table_index = 5,
  header_row = 2,
  party_col = "party_2",
  seats_col = "seats",
  top_parties = top_parties)

After we scraped every election, we can join them together

dail_years <- dail_2024 %>% 
  left_join(dail_2020, by = c("party")) %>% 
  left_join(dail_2016, by = c("party")) %>% 
  left_join(dail_2011, by = c("party")) %>% 
  left_join(dail_2007, by = c("party")) %>% 
  left_join(dail_2002, by = c("party")) %>%   
  left_join(dail_1997, by = c("party")) %>% 
  left_join(dail_1992, by = c("party")) %>% 
  left_join(dail_1989, by = c("party")) %>% 
  left_join(dail_1987, by = c("party")) %>% 
  left_join(dail_1982_11, by = c("party")) %>% 
  left_join(dail_1982_2, by = c("party"))

Or I can use a list and iterative left joins.

dail_list <- list(
  dail_2024,
  dail_2020,
  dail_2016,
  dail_2011,
  dail_2007,
  dail_2002,
  dail_1997,
  dail_1992,
  dail_1989,
  dail_1987,
  dail_1982_11,
  dail_1982_2)

dail_years <- reduce(dail_list, left_join, by = "party")

For the x axis ticks, we can quickly make a vector of all the election years we want to highlight on the graph.

election_years <- c(2024, 2020, 2016, 2011, 2007, 2002, 1997, 1992, 1989, 1987, 1982)

Next we pivot the data to long format:

dail_years %>% pivot_longer(
  cols = starts_with("seats_"),
  names_to = "year",
  names_prefix = "seats_",
  values_to = "seats") -> dail_longer

Then we can add specific hex colours for the main parties.

dail_longer %<>%
  mutate(color = ifelse(party == "Sinn Féin", "#2fb66a",
         ifelse(party == "Fine Gael","#6699ff",
         ifelse(party == "Fianna Fáil","#ee9f27","#495051"))))

Next, we can create a final_positions data.frame so that can put the names of the political parties at the end of the trend line instead of having a legend floating at top of the graph.

final_positions <_ dail_longer %>%
  group_by(party) %>%
  filter(year == max(year))  %>% 
  mutate(color = ifelse(party == "Sinn Féin", "#2fb66a",
         ifelse(party == "Fine Gael","#6699ff",
         ifelse(party == "Fianna Fáil","#ee9f27", "#495051")))

Click here to read more about the ggbump package

Bump charts for ranking with ggbump package in R

dail_longer %>% 
  ggplot(aes(x = year, y = seats, group = party)) +

  geom_bump(aes(color = color,
           alpha = ifelse(party == "Sinn Féin", 0.5, 0.2),
           linewidth = ifelse(party == "Sinn Féin", 0.8, 0.7)),
            smooth = 5) +

  geom_text(data = final_positions,
            aes(color = color,
            y = ifelse(party == "Fine Gael", seats - 3, seats),
            label = party,
            family = "Georgia"),
            x = x_position + 1.5,  
            hjust = 0, 
            size = 10) +

  geom_point(color = "white", 
             size = 6, 
             stroke = 3) +
  
  geom_point(aes(color = color,
             alpha = ifelse(party == "Sinn Féin", 0.5, 0.1)),
             size = 4) +

  scale_linewidth_continuous(range = c(2, 5)) +

  scale_alpha_continuous(range = c(0.2, 1)) +

  bbplot::bbc_style()  +

  theme(legend.position = "none",
        plot.title = element_text(size = 48)) +

  scale_color_identity() + 

  scale_x_continuous(limits = c(1980, 2030), breaks = election_years) +

  labs(title = "Sinn Féin has seen a steady increase in Dáil vote\n share after years hovering around zero seats")

Dail Seat Share from 1982 to 2024 Download

How to improve graphs with themes and palettes: Top packages in R

In this blog, we can look at ways to make our plots and graphs more appealing to the eye.

https://giphy.com/gifs/movie-pretty-mean-girls-wOU9mPo60N7Us

Before we go about working on the aesthetics, let’s build and save a typical political science graph.

We will examine the inverted U shape between democracy and level of mass mobilization across six different regions.

The data will come from the V-DEM package.

Click here to read more about downloading and animating Varieties of Democracy (V-DEM) variables with the vdemdata package in R.

How to download and animate the Varieties of Democracy (V-DEM) dataset in R

Packages we will be using to create our initial graph.

library(tidyverse)
library(magrittr) # for the %<>% pipe
library(devtools)
library(vdemdata)

So first, we make a basic plot with all the ggplot defaults:

vdem %>% 
  filter(year == 2010) %>% 
  ggplot(aes(x = v2x_polyarchy, 
             y = v2cademmob)) + 
  geom_point() +
  geom_smooth(method = "gam") +
  labs(title = "Democracy and Mass Mobilization scatterplot", 
       subtitle = "Source: V-DEM",
       x = "Democracy", 
       y = "Mass Mobilization") +
  facet_wrap(~e_regionpol_6C)

https://giphy.com/gifs/thenextstep-the-next-step-tns-tns8-pwnpqOmGHfBdQPWU3e

Next, we can add some elements to add color and labels:

vdem %>% 
  mutate(
    e_regionpol_6C = case_when(
      e_regionpol_6C == 1 ~ "Post-Soviet",
      e_regionpol_6C == 2 ~ "Latin America",
      e_regionpol_6C == 3 ~ "MENA",
      e_regionpol_6C == 4 ~ "Africa",
      e_regionpol_6C == 5 ~ "West",
      e_regionpol_6C == 6 ~ "Asia",
      TRUE ~ NA)) %>% 
  filter(year == 2010) %>% 
  ggplot(aes(x = v2x_polyarchy, 
             y = v2cademmob)) +
  geom_smooth(aes(color = as.factor(e_regionpol_6C)), 
              method = "loess", 
              span = 2,
              se = FALSE,
              size = 2,
              alpha = 0.3) + 
  geom_point(aes(color = as.factor(e_regionpol_6C)),
                 size = 4, alpha = 0.5) +
  labs(title = "Democracy and Mass Mobilization scatterplot", 
       subtitle = "Source: V-DEM",
       x = "Democracy", 
       y = "Mass Mobilization") +
  facet_wrap(~e_regionpol_6C) + 
  theme(text = element_text(size = 20),
        legend.position = "none") + 
  guides(fill = guide_legend(keywidth = 3, keyheight = 3)) -> my_plot

Adding Studio Ghibli palette and ggthemes themes

remotes::install_github("ewenme/ghibli")
library(ghibli)

This package comes from ewenme’s github.

ghibli_palettes -> ghibli_palettes_list

This shows the 27 ghibli colour palettes available in the package! We can print off and browse through the colours to choose what to add to our plot:

To add the colours, we just need to add scale_colour_ghibli_d("MononokeMedium") to the plot object. I choose the pretty Mononoke Medium palette. The d at the end means we are using discrete data.

Additionally, I will add a plot theme from the ggthemes package.

devtools::install_github("jrnold/ggthemes")
library(ggthemes)

This package comes from Jeffrey Arnold’s github.

FiveThirtyEight is a polling website with buckets of graphics, such as:

To add this theme, we just need to add theme_fivethirtyeight()

my_plot +
  scale_colour_ghibli_d("MononokeMedium") + 
  ggthemes::theme_fivethirtyeight() + 
  theme(text = element_text(size = 25),
        legend.position = "none")

We can mix and match with different palettes and themes.

The following uses a template that resembles the Wall Street Journal graphs and MarnieMedium1 colours!

my_plot +
  scale_colour_ghibli_d("MarnieLight1") + 
  ggthemes::theme_wsj() + 
  theme(text = element_text(size = 25),
        legend.position = "none")

And another mix and match:

Studio Ghibli PonyoMedium palette with the graph style from the Economist magazine

my_plot +
  scale_colour_ghibli_d("PonyoMedium", direction = -1) + 
  ggthemes::theme_economist() +
  theme(text = element_text(size = 25),
        legend.position = "none")

For continous variables, we can use scale_fill_ghibli_c()

We can look at average democracy scores around the world for all countries between 1945 to 2023.

We need to download a map object from the rnaturalearth package and merge it with the V-DEM dataset.

Click here to learn more about making maps in R

Plot variables on a map with rnaturalearth package in R

my_map <- ne_countries(scale = "medium", returnclass = "sf")

my_map %<>% 
  mutate(COWcode = countrycode::countrycode(admin, "country.name", "cown"))

vdem_map <- left_join(vdem, my_map, by = c("COWcode"))

vdem_map %>% 
  filter(year %in% c(1945:2023)) %>% 
  filter(sovereignt != "Antarctica") %>% 
  group_by(admin, geometry) %>% 
  summarise(avg_polyarchy = mean(v2x_polyarchy, na.rm = TRUE)) %>% 
  ungroup() %>% 
  ggplot() +
  geom_sf(aes(geometry = geometry, fill = avg_polyarchy),  
          position = "identity", color = "#212529", linewidth = 0.2, alpha = 0.85) +
  ghibli::scale_fill_ghibli_c("PonyoLight") + 
  ggdark::dark_theme_void() +
  theme(legend.title = element_blank(),
        legend.position = "left") + 
  guides(fill = guide_legend(keywidth = 5, keyheight = 5))

Adding Dutch painter palettes and ggdark themes

The next palette we will look at comes from Edward Theon and takes the colors from Dutch master painters such as Vermeer and Rembrandt.

devtools::install_github("EdwinTh/dutchmasters")
library(dutchmasters)

The themes for these plots come from Neal Grantham. They offer dark or black backgrounds; I always think this makes plots and charts look more professional, I don’t know why.

devtools::install_github("nsgrantham/ggdark")
library(ggdark)

So adding this palette and theme, we get:

my_plot + 
  dark_theme_gray() +
  scale_color_dutchmasters(palette = "pearl_earring") +
  theme(text = element_text(size = 25),
        legend.position = "none")

Adding LaCroix palettes and `ggtech` themes

Next we will look at LaCroixColoRpalettes. I’ll be honest, I have never lived in a country that sells this drink in the store so I’ve never tried it. But it looks pretty.

devtools::install_github("johannesbjork/LaCroixColoR")

LaCroixColoR::lacroix_palettes -> lacroix_palettes_list

This package comes from the brain of Johannes Bjork

This list contains 21 palettes to choose from such as the following:

For the next few graphs, we will also look at the ggtech package.

They do random tech companies such as Google, AirBnB, Facebook and Etsy.

devtools::install_github("ricardo-bion/ggtech")

https://giphy.com/gifs/internet-college-google-y47oj4ptjPm5W

If we want to change the font, I have always found it tricky like Run DMC, no matter HOW MANY TIMES I do it.

library(extrafont)

Registering fonts with R

For Google fonts, click the following link:
http://social-fonts.com/assets/fonts/product-sans/product-sans.ttf

And install the font onto your computer.

font_import(pattern = 'product-sans.ttf', prompt = FALSE)

loadfonts(device = "win")

Now, we can make the plot:

my_plot +
  scale_color_manual(values = LaCroixColoR::lacroix_palette("CranRaspberry")) +
  ggtech::theme_tech(theme = "google") +
  theme(text = element_text(size = 25, 
                            family = "Product Sans"),
        legend.position = "none",
        plot.title = element_text(color="#172869"),
        plot.subtitle = element_text(color="#172869"),
        axis.title.x = element_text(color="#088BBE"),
        axis.title.y = element_text(color="#088BBE"),
        strip.text = element_text(color="#172869"))

And finally, two of my favorite packages, bbplot and wesanderson for making pretty plots.

https://giphy.com/gifs/filmeditor-mean-girls-movie-l2YWqLVkNBNjiBaXS

Click here to read more about the bbplot package and here to read more about the ggstream package

BBC style graphs with bbplot package in R

Exploratory Data Analysis and Descriptive Statistics for Political Science Research in R

vdem %>% 
  filter(year %in% c(1800:2020)) %>% 
  group_by(year, e_regionpol_6C) %>% 
  count() -> country_count

country_count %>% 
  mutate(e_regionpol_6C = case_when(
    e_regionpol_6C == 1 ~ "Post-Soviet",
    e_regionpol_6C == 2 ~ "Latin America",
    e_regionpol_6C == 3 ~ "MENA",
    e_regionpol_6C == 4 ~ "Africa",
    e_regionpol_6C == 5 ~ "West",
    e_regionpol_6C == 6 ~ "Asia",
    TRUE ~ NA)) %>% 
  ggplot(aes(x = year, y = n, fill = as.factor(e_regionpol_6C))) +
  ggstream::geom_stream() +
  bbplot::bbc_style() + 
  scale_fill_manual(values = LaCroixColoR::lacroix_palette("Pamplemousse")) +
  scale_x_continuous(breaks = seq(min(country_count$year, na.rm = TRUE), 
                                  max(country_count$year, na.rm = TRUE), 
                                  by = 20)) +
  theme(legend.title = element_blank(),
        axis.text.y = element_blank()) +
  labs(title = "Number of Sovererign Countries 1800 - 2020", 
       subtitle = "Source: V-DEM")

Click here to read more about the wesanderson package and click here to read more about the waffle package in R

Make Wes Anderson themed graphs with wesanderson package in R

Alternatives to pie charts: coxcomb and waffle charts

country_count %>% 
  ggplot(aes(fill = e_regionpol_6C, values = n)) + 
  waffle::geom_waffle(n_rows = 15, size = 0.5, colour = "white",
                      flip = TRUE, make_proportional = FALSE) + 
  bbplot::bbc_style() +
  scale_fill_manual(values = sample(wesanderson::wes_palette("Zissou1Continuous"))) +
  theme(legend.title = element_blank(),
        axis.text.y = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Number of Sovereign Countries 1800 - 2020", 
       subtitle = "Source: V-DEM")

A sample of the 24 wesanderson package options

How to download and animate the Varieties of Democracy (V-DEM) dataset in R

In this blog post, we will download the V-DEM datasets with their vdemdata package. It is still in development, so we will use the install_github() function from the devtools package

devtools::install_github("vdeminstitute/vdemdata")

library(vdemdata)

And really quickly we can download the dataset with one line of code

vdemdata::vdem -> vdem

We can use the find_var function to get information on variables based on keywords.

For example, we can look up variables that are concerned with protest mobilization.

vdemdata::find_var("mobilization") -> mob

 mob %>% names
 [1] "question_id"              "question_number"          "metasection"             
 [4] "name"                     "vartype"                  "cb_section"              
 [7] "tag"                      "projectmanager"           "question"                
[10] "clarification"            "responses"                "ordering"                
[13] "scale"                    "answertype"               "sources"                 
[16] "notes"                    "datarelease"              "citation"                
[19] "coverage"                 "subsetof"                 "crosscoder_aggregation"  
[22] "aggregation"              "ccp_tag"                  "clean_tag"               
[25] "survey_id"                "vignettes_used"           "old_tag"                 
[28] "compiler"                 "clarification_historical" "codebook_id"             
[31] "conthistmerge"            "histmerged"               "years"                   
[34] "hist_outside_coding"      "additional_versions"      "cleaning"                
[37] "date_specific"            "available_versions"       "cont_outside_coding"     
[40] "overlap_use_hist"         "is_party"                 "cb_section_type"         
[43] "defaultdate"              "convergence"              "cy_aggregation"          
[46] "no_update"

Or download the entire codebook:

vdemdata::codebook -> vdem_codebook

And we can look at information for a specific variable

vdemdata::var_info("e_regionpol_6C") -> region_info 
region_info$responses

1: Eastern Europe and Central Asia (including Mongolia and German Democratic Republic)
2: Latin America and the Caribbean
3: The Middle East and North Africa (including Israel and Türkiye, excluding Cyprus)
4: Sub-Saharan Africa
5: Western Europe and North America (including Cyprus, Australia and New Zealand, but excluding German Democratic Republic)
6: Asia and Pacific (excluding Australia and New Zealand; see 5)"

For our analysis, we can focus on the years 1900 to 2022.

vdem %<>% 
  filter(year %in% c(1900:2022))

And we will create a ggplot() object that also uses the Five Thirty Eight theme from the ggthemes package.

Click here to read more about the ggthemes options.

Source: https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/

In the V-DEM package, we will look at a scatterplot of CSO consultation (v2cscnsult) and democracy score (v2x_polyarchy).

v2cscnsult asks are major civil society organizations (CSOs) routinely consulted by policymakers on policies relevant to their members?

v2x_polyarchy examines to what extent is the ideal of electoral democracy in its fullest sense achieved?

First, find below the packages we will need to install and load

install.packages("gganimate")
install.packages("transformr")  # sometimes needed as a dependency

library(gganimate)

And we plot our graph:

my_graph <- ggplot(vdem, aes(x = v2cscnsult, 
                      y = v2x_polyarchy, 
                      group = year)) +
  geom_point()  +
  ggthemes::theme_fivethirtyeight() +
  theme(text = element_text(size = 12),  # Default text size for all text elements
        plot.title = element_text(size = 20, face="bold"),  
        axis.title = element_text(size = 16), 
        axis.text = element_text(size = 14), 
        legend.title = element_text(size = 14),  
        legend.text = element_text(size = 12))

In the themes argument, we can change the size of the text for the various parts of the ggplot (legends, axes etc.)

To make the ggplot object animated, we use the transition_time(year) function from the gganimate package.

Also we can add a subtitle the displays the year and time frame in the graph.

animated_graph <- my_graph +
  transition_time(year) +
  labs(title = "CSO consultation and Polyarchy Democracy",
       subtitle = "Time: {frame_time}",
       caption = "Source: VDEM 1900 to 2022",
       x = "CSO Consultation",
       y = "Polyarchy")

And we can change how we render the graph with the animate() function.

We choose duration = 15 so that the gif lasts 15 seconds

We set frames per second to 20 fps (the higher the number, the smoother the gif changes, but the longer it takes to load)

And finally we can choose a special renderer that makes the gif more smooth too.

Finally we can save the gif to our computer (so I can upload it here on this blog)

animate(animated_plot, duration = 15, fps = 20, renderer = gifski_renderer()) -> CSO_poly_gif

anim_save("animated_plot.gif", animation = CSO_poly_gif)

We can make a few changes so that it is divided by region and adds colors:

Notice the change to subtitle = "Year: {as.integer(frame_time)}" so it only uses the year, not the year and frame rate.

ggplot(vdem, aes(x = v2cscnsult, 
                      y = v2x_polyarchy, 
                      group = year,
                 size = e_pop, 
                 colour = as.factor(e_regionpol_6C))) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  ggthemes::theme_fivethirtyeight() +
  theme(text = element_text(size = 12),  
        plot.title = element_text(size = 20, face="bold"), 
        axis.title = element_text(size = 16),  
        axis.text = element_text(size = 14),  
        legend.title = element_text(size = 14),  
        legend.text = element_text(size = 12)) +
  facet_wrap(~e_regionpol_6C) +
  transition_time(year) +
  labs(title = "CSO consultation and Polyarchy Democracy",
       subtitle = "Year: {as.integer(frame_time)}",
       caption = "Source: VDEM 1900 to 2022",
       x = "CSO Consultation",
       y = "Polyarchy")

Next, we can animate a map

library(sf)
library(rnaturalearth)

First we download a world object with the longitude and latitude data we need.

Click here to read more about the rnaturalearth package

Plot variables on a map with rnaturalearth package in R

my_map <- ne_countries(scale = "medium", returnclass = "sf")

And we merge the two data.frames together

my_map %<>% 
  mutate(COWcode = countrycode::countrycode(sovereignt, "country.name", "cown"))

vdem_map <- left_join(vdem, my_map, by = c("COWcode"))

Set up some colors for the map

colors <- c("#001427", "#708d81", "#f4d58d", "#bf0603", "#8d0801")

Next we draw the map:

vdem_map %>% 
  filter(year %in% c(1945:2023)) %>% 
  filter(sovereignt != "Antarctica") %>% 
  group_by(country_name, geometry) %>% 
  summarise(avg_polyarchy = mean(v2x_polyarchy, na.rm = TRUE)) %>% 
  ungroup() %>% 
  ggplot() +
  geom_sf(aes(geometry = geometry, fill = avg_polyarchy),  
          position = "identity", color = "#212529", linewidth = 0.2, alpha = 0.85) +
  geom_tile(data = data.frame(value = seq(0, 1, length.out = length(colors))), 
            aes(x = 1, y = value, fill = value), 
            show.legend = FALSE) +
  scale_fill_gradientn(colors = colors, 
                       breaks = scales::pretty_breaks(n = length(colors)),
                       labels = scales::number_format(accuracy = 1)) +
  theme_minimal()

When I made the first attempt to animate the polyarchy democracy data on the map, it was a bit of an affront to the senses:

So attempt number two involved a bit more wrangling!

vdem_map %>% 
  filter(year %in% c(1945:2023)) %>% 
  filter(sovereignt != "Antarctica") %>% 
  ggplot() +
  geom_sf(aes(geometry = geometry, fill = v2x_polyarchy, group = year),  
          position = "identity", color = "#212529", linewidth = 0.2, alpha = 0.85) +
  theme_minimal() +
  scale_fill_gradientn(colors = colors,
                       breaks = scales::pretty_breaks(n = length(colors)),
                       labels = scales::number_format(accuracy = 3)) +
  transition_time(year) + 
  labs(title = "Polyarchy Democracy annual global changes",
       subtitle = "Year: {as.integer(frame_time)}",
       caption = "Source: VDEM 1900 to 2023",
       x = " ",
       y = " ",
       fill = "Democracy Score")  -> my_plot

animate(my_plot, duration = 40, fps = 40,
        renderer = gifski_renderer()) -> map_gif

anim_save("animated_plot_3.gif", animation = map_gif)

How to graph model variables with the tidy package in R

Packages we will need:

library(tidyverse)
library(broom)
library(stargazer)
library(janitor)
library(democracyData)
library(WDI)

We will make a linear regression model and graph the coefficients to show which variables are statistically significant in the regression with ggplot.

First we will download some variables from the World Bank Indicators package.

Click here to read more about the WDI package.

Download WorldBank data with WDI package in R

We will use Women Business and the Law Index Score as our dependent variable.

The index measures how laws and regulations affect women’s economic opportunity. Overall scores are calculated by taking the average score of each index (Mobility, Workplace, Pay, Marriage, Parenthood, Entrepreneurship, Assets and Pension), with 100 representing the highest possible score.

And then a few independent variables for the model

women_business = WDI(indicator = "SG.LAW.INDX")

gdp_percap = WDI(indicator = "NY.GDP.PCAP.KD")
pop <- WDI(indicator = "SP.POP.TOTL") 
mil_spend_gdp = WDI(indicator = "MS.MIL.XPND.ZS")
mortality = WDI(indicator = "SP.DYN.IMRT.IN")

We will merge them all together and rename the columns with inner_join()

women_business %>%
  filter(year > 1999) %>% 
  inner_join(mortality) %>% 
  inner_join(mil_spend_gdp) %>% 
  inner_join(gdp_percap) %>% 
  inner_join(pop) %>% 
  inner_join(mortality) %>% 
  select(country, year, iso2c,
         pop = SP.POP.TOTL,
         fem_bus = SG.LAW.INDX, 
         mortality = SP.DYN.IMRT.IN,
         gdp_percap = NY.GDP.PCAP.KD,
         mil_gdp = MS.MIL.XPND.ZS) -> wdi

We will remove all NA values and take a summary of all the variables.

Finally, we will filter out al the variables that are not countries according to the Correlates of War project.

  wdi %>%  mutate_all(~ifelse(is.nan(.), NA, .)) %>% 
  select(-year) %>% 
  group_by(country, iso2c) %>% 
  summarize(across(where(is.numeric), mean,  
                   na.rm = TRUE, .names = "mean_{col}")) %>% 
  ungroup() %>% 
  mutate(cown = countrycode::countrycode(iso2c, "iso2c", "cown")) %>% 
  filter(!is.na(cown)) -> wdi_summary

Next we will also download Freedom House values from the democracyData package.

Visualise DemocracyData with graphs and maps

fh <- download_fh()

fh %>% 
  group_by(fh_country) %>% 
  filter(year > 1999) %>% 
  summarise(mean_fh = mean(fh_total, na.rm = TRUE)) %>% 
  mutate(cown = countrycode::countrycode(fh_country, "country.name", "cown")) %>% 
  mutate_all(~ifelse(is.nan(.), NA, .)) %>% 
  filter(!is.na(cown))  -> fh_summary

If you want to find resources for more data packages, click here.

Top R packages for downloading political science and economics datasets

We merge the Freedom House and World Bank data

fh_summary %>%
  inner_join(wdi_summary, by = "cown") %>% 
  select (-c(cown, iso2c, fh_country)) -> wdi_fh

We can look at the summarise of all the variables with the skimr package.

Summarise data with skimr package in R

wdi_fh %>% 
  skim()

And we will take the log of some of the following variables:

wdi_fh %<>% 
  mutate(log_mean_pop = log10(mean_pop),
         log_mean_gdp_percap = log10(mean_gdp_percap),
         log_mean_mil_gdp = log10(mean_mil_gdp)) %>% 
  select(!c(country, mean_pop, mean_gdp_percap, mean_mil_gdp))

Next, we run a quick linear regression model

wdi_fh %>% 
  lm(mean_fem_bus ~ ., data = .) -> my_model

We can print the model output with the stargazer package

my_model %>% 
  stargazer(type = "html")


	Dependent variable:

	mean_fem_bus

mean_fh	2.762^***
	(0.381)

mean_mortality	-0.131^**
	(0.065)

log_mean_pop	1.065
	(1.434)

log_mean_gdp_percap	-3.201
	(2.735)

log_mean_mil_gdp	-10.038^**
	(3.870)

Constant	61.030^***
	(16.263)


Observations	154
R²	0.566
Adjusted R²	0.551
Residual Std. Error	11.912 (df = 148)
F Statistic	38.544^*** (df = 5; 148)

Note:	^p<0.1; ^p<0.05; ^**p<0.01

And we will use the tidy() function from the broom package to extract the estimates and confidence intervals from the model

my_model %>%
  tidy(., conf.int = TRUE) %>% 
  filter(term != "(Intercept)") %>% 
  janitor::clean_names() %>% 
  mutate(term = ifelse(term == "log_mean_mil_gdp", "Military spending (ln)",
    ifelse(term == "log_mean_gdp_percap", "GDP per capita (ln)",
    ifelse(term == "log_mean_pop", "Population (ln)",
    ifelse(term == "mean_mortality", "Mortality rate",
    ifelse(term == "mean_fh", "Freedom House", "Other")))))) -> tidy_model

And we can plot the tidy values

tidy_model %>% 
  ggplot(aes(x = reorder(term, estimate), y = estimate)) +
  geom_hline(yintercept = 0, color = "#bc4749", size = 4, alpha = 0.4) +
  geom_errorbar(aes(ymin = conf_low, ymax = conf_high,
  color = ifelse(conf_low * conf_high > 0, "#023047", 
  ifelse(term == "Mortality rate", "#ca6702", "#ca6702"))), width = 0.1, size = 2) +
  geom_point(aes(color = ifelse(conf_low * conf_high > 0, "#023047", 
  ifelse(term == "Mortality rate", "#ca6702", "#ca6702"))), size = 4) + 
coord_flip() +
  scale_color_manual(labels = c("Significant", "Insignificant"), values = c("#023047", "#ca6702")) +
labs(title = "Model Variables", x = "", y = "", caption = "Source: Your Data Source") + bbplot::bbc_style()

How to graph proportions with the waffle and treemapify packages in R

Packages we will need:

library(tidyverse)
library(magrittr)
library(waffle)
library(treemapify)

In this blog, we will look at visualising proportions in a few lines.

I have some aid data and I want to see what proportion of the aid does not have a theme category.

This can be useful to visualise incomplete data across years or across categories.

First, we can make a waffle chart with the waffle package.

30 Rock Pizza GIF - Find & Share on GIPHY

First, we will create a binary variable that has 1 if the theme is “Other Theme” and 0 if it has a theme value. We will do this for every year.

aid_data %>% 
  group_by(start_year) %>% 
  mutate(binary_variable = as.numeric(theme_1 == "Other Theme")) %>% 
  ungroup() %>% count()

# Groups:   start_year [10]
   start_year     n
        <int> <int>
 1       2012     1
 2       2013     3
 3       2014    17
 4       2015    91
 5       2016   100
 6       2017    94
 7       2018   198
 8       2019   144
 9       2020   199
10       2021   119

Then we will count the number of 0 and 1s for each year with group_by(start_year, binary_variable)

aid_data %>% 
  group_by(start_year) %>% 
  mutate(binary_variable = as.numeric(theme_1 == "Other Theme")) %>%
  ungroup() %>% 
  group_by(start_year, binary_variable) %>% 
  count() %>%

# A tibble: 14 × 3
# Groups:   start_year, binary_variable [14]
   start_year binary_variable     n
        <int>           <dbl> <int>
 1       2012               0     1
 2       2013               0     3
 3       2014               0    17
 4       2015               0    90
 5       2015               1     1
 6       2016               0   100
 7       2017               0    94
 8       2018               0   124
 9       2018               1    74
10       2019               0    18
11       2019               1   126
12       2020               1   199
13       2021               0     1
14       2021               1   118

We can do the two steps above together in one step and then create the ggplot object with the geom_waffle() layer.

For the ggplot layers:

We use the binary_variable in the fill argument.

We use the n variable in the values argument.

We will facet_wrap() with the start_year argument.

aid_date %>%
 group_by(start_year) %>% 
  mutate(binary_variable = as.numeric(theme_1 == "Other Theme")) %>%
  ungroup() %>% 
  group_by(start_year, binary_variable) %>% 
  count() %>% 
  ggplot(aes(fill = as.factor(binary_variable), values = n)) +
  geom_waffle(color = "white", size = 0.3, n_rows = 10, flip = TRUE) +
  facet_wrap(~start_year, nrow = 1, strip.position = "bottom") + 
  bbplot::bbc_style() +
  scale_fill_manual(values =c("#003049", "#bc4749"),
                    name = "No theme?",
                    labels = c("Theme", "No Theme")) +
  theme(axis.text.x.bottom = element_blank(),
        text = element_text(size = 40))

We can see that all the years up to 2018 have most of the row categorised. After 2019, it all goes awry; most of the aid rows are not categorised at all. Messy.

Although, I prefer the waffle charts, because it also shows a quick distribution of aid rows across years (only 1 in 2012 and many in later years), we can also look at pie charts

We can facet_wrap() with pie charts…

… however, there are a few steps to take so that the pie charts do not look like this:

We cannot use the standard coord_polar argument.

Rather, we set a special my_coord_polar to use as a layer in the ggplot.

my_coord_polar <- coord_polar(theta = "y")
my_coord_polar$is_free <- function() TRUE

Then we use the same count variables as above.

We also must change the facet_wrap() to include scales = "free"

aid_data %>%
  group_by(start_year) %>% 
  mutate(binary_variable = as.numeric(theme_1 == "Other Theme")) %>% 
  ungroup() %>% 
  group_by(start_year, binary_variable) %>% 
  count() %>% 
  ungroup() %>% 
  ggplot(aes(x = "", y = n, fill = as.factor(binary_variable))) +
  geom_bar(stat="identity", width=1) +
  my_coord_polar +
  theme_void() + 
  facet_wrap(~start_year, scales = "free")+ 
  scale_fill_manual(values =c("#003049", "#bc4749"),
                    name = "No theme?",
                    labels = c("Theme", "No Theme"))

And we can create a treemap to see the relative proportion of regions that receieve an allocation of aid:

First some nice hex colors.

pal <- c("#005f73", "#006f57", "#94d2bd", "#ee9b00", "#ca6702", "#8f2d56", "#ae2012")

Then we create characters strings for the numeric region variable and use it for the fill argument in the ggplot.

aid_data %>% 
  mutate(region = case_when(
    pol_region_6 == 1 ~ "Post-Soviet",
    pol_region_6 == 2 ~ "Latin America",
    pol_region_6 == 3 ~ "MENA",
    pol_region_6 == 4 ~ "Africa",
    pol_region_6 == 5 ~ "West",
    pol_region_6 == 6 ~ "Asia",
    TRUE ~ "Other"))  %>% 
  group_by(region) %>% 
  count() %>% 
  ggplot(aes(area = n, fill = region, 
             label = paste(region, n, sep = "\n"))) +
  geom_treemap(color = "white", size = 3) +
  geom_treemap_text(
    place = "centre",
    size = 20) +
  theme(legend.position = "none")  +
  scale_fill_manual(values = sample(pal))

How to graph bubble charts and treemap charts in R

Packages we will need:

library(tidyverse)
library(bubbles)
library(treemapify)
library(democracyData)
library(magrittr)

In this blog, we will look at different types of charts that we can run in R.

Approve Always Sunny GIF by It's Always Sunny in Philadelphia - Find & Share on GIPHY

Both the bubble and treemap charts are simple to run.

Before we begin, we will choose some hex colors for the palette. I always use the coolors palettes website to find nice colours.

pal <- c("#bc4749", "#005f73", 
         "#0a9396", "#94d2bd", 
         "#bb3e03","#003049",
         "#fca311", "#99d98c",
         "#9a5059", "#ee9b00")

First, we can download the data we will graph.

In this case, we will use the DD (democracies and dictatorships) regime data from PACL dataset on different government regimes.

The DD dataset encompasses annual data points for 199 countries spanning from 1946 to 2008. The visual representations on the left illustrate the outcomes in 1988 and 2008.

Cheibub, Gandhi, and Vreeland devised a six-fold regime classification scheme, giving rise to what they termed the DD datasets. The DD index categorises regimes into two types: democracies and dictatorships.

Democracies are divided into three types: parliamentary, semi-presidential, and presidential democracies.

Dictatorships are subcategorized into monarchic, military, and civilian dictatorship.

democracyData::pacl -> pacl

First, we create a new variable with the regime names, not just the number. The values come from the codebook.

Tv Show Drinking GIF - Find & Share on GIPHY

pacl %<>% 
  mutate(regime_name = ifelse(regime == 0, "Parliamentary democracies",
                 ifelse(regime == 1, "Mixed democracies",
                 ifelse(regime == 2, "Presidential democracies",
                 ifelse(regime == 3, "Civilian autocracies",
                 ifelse(regime == 4, "Military dictatorships",
                 ifelse(regime ==  5,"Royal dictatorships", regime))))))) %>%
  mutate(regime = as.factor(regime))

  order pacl_country  year regime_name        
  <dbl> <chr>        <dbl> <chr>              
1     1 Afghanistan   1946 Royal dictatorships
2     2 Afghanistan   1947 Royal dictatorships
3     3 Afghanistan   1948 Royal dictatorships
4     4 Afghanistan   1949 Royal dictatorships
5     5 Afghanistan   1950 Royal dictatorships
6     6 Afghanistan   1951 Royal dictatorships

We want to count the number of different regime types.

pacl %>% 
  group_by(regime_name) %>% 
  count() -> pacl_count

We can graph out the geom_treemap() layer in the ggplot() object

pacl_count %>% 
  ggplot(aes(area = n, fill = regime_name, 
           label = paste(regime_name, n, sep = "\n"))) +
  geom_treemap(color = "white", size = 3) +
  geom_treemap_text(
    place = "centre",
    size = 20) +
  theme(legend.position = "none") + 
  scale_fill_manual(values = sample(pal))

And we can use the bubbles() function from the bubbles package.

Unfortunately, it is not pipable into ggplot, so it is hard to edit factors such as the font size.

bubbles::bubbles(label = pacl_count$regime_name,
                 value = pacl_count$n, 
                 color = sample(pal, size = length(pacl_count$regime_name)))

Thank you for reading!

How to graph Locally Weighted Scatterplot Smoothing (LOESS) in R

The loess method in ggplot2 fits a smoothing line to our data.

We can do this with the method = "loess" in the geom_smooth() layer.

LOESS stands “Locally Weighted Scatterplot Smoothing.” (I am not sure why it is not called LOWESS … ?)

The loess line can help show non-linear relationships in the scatterplot data, while taking care of stopping the over-influence of outliers.

Loess gives more weight to nearby data points and less weight to distant ones. This means that nearby points have a greater influence on the squiggly-ness of the line.

The degree of smoothing is controlled by the span parameter in the geom_smooth() layer.

When we set the span, we can choose how many nearby data points are considered when estimating the local regression line.

A smaller span (e.g. span = 0.5) results in more local (flexible) smoothing, while a larger span (e.g. span = 1.5) produces more global (smooth) smoothing.

We will take the variables from the Varieties of Democracy dataset and plot the relationship between oil produciton and media freedoms across different regions.

df %>% 
  ggplot(aes(x = log_avg_oil,
             y = avg_media)) +
  geom_point(size = 6, alpha = 0.5) + 
  geom_smooth(aes(color = region), 
              method = "loess", 
              span = 2,
              se = FALSE,
              size = 3,
              alpha = 0.6) + 
  facet_wrap(~region) + 
  labs(title = "Oil and Media Corruption", subtitle = "VDEM",
       x = "Average Oil logged",
       y = "Average Media Freedom") +
  scale_color_manual(values = my_pal) + 
  my_theme()

If we change the span to 0.5, we get the following graph:

              span = 0.5

George Costanza Dancing GIF by Crave - Find & Share on GIPHY

When examining the connection between oil production and media freedoms across various regions, there are many ways to draw the line.

If we think the relationship is linear, it is no problem to add method = "lm" to the graph.

However, if outliers might overly distort the linear relationship, method = "rlm" (robust linear model” can help to take away the power from these outliers.

Linear and robust linear models (lm and rlm) can also accommodate parametric non-linear relationships, such as quadratic or cubic, when used with a proper formula specification.

For example, “geom_smooth(method=’lm’, formula = y ~ x + I(x^2))” can be used for estimating a quadratic relationship using lm.

If the outcome variable is binary (such as “is democracy” versus “is not democracy” or “is oil producing” versus “is not oil producing”) we can use method = “glm” (which is generalised linear model). It models the log odds of a oil producing as a linear function of a predictor variable, like age.

If the relationship between age and log odds is non-linear, the gam method is preferred over glm. Both glm and gam can handle outcome variables with more than two categories, count variables, and other complexities.

How to graph different distributions for political science analysis in R. PART 1: Binomial, Bernoulli and Geometric Distributions.

Packages we will need:

library(tidyverse)

In this blog, we will look at three distributions.

Distributions are fundamental to statistical inference and probability.

Cbc No GIF by Kim's Convenience - Find & Share on GIPHY

The data we will be using is on Irish legislative elections from 1919.

irish_leg Download

Binomial Distribution

First we can look at the binomial distribution.

We can model the number of successful elections (e.g., a party winning) out of a fixed number of elections (trials).

In R, we use rbinom() to create the distribution.

rbinom(n, size, prob)

We need to feed in three pieces of information into this function

Parameters

n: The number of random samples we want.

size: The number of trials.

prob: The probability of success for each trial.

We can use rbinom() to simulate 100 elections and see how likely there will be a change in the party in power.

ire_leg %>% 
  filter(leg_election_change != "No election") %>% 
  summarise(avg_change = mean(change_binary, na.rm = TRUE))

When we print this, we learn that in 20% of the elections in Ireland, there has been a change in winning party.

So we will use the Binomial distribution to simulate 10 years of elections.

We will do this 100 times and create a graph of change probabilities.

Essentially we can visualise how likely there will we see a change in the party in power.

First, we choose how many times we want to estimate the probability

num_simulations <- 100

Next, we choose the number of years that we want to look at

years <- 10

Then, we set the probability that an election ends with new party in power :

probability_of_change <- 0.2

And we throw them all together into the rbinom() function

simulations <- rbinom(n = num_simulations, 
size = years, 
prob = probability_of_change)

proportion_of_changes <- mean(simulations > 0)

We can see that there is an 87% chance that the party in power will change in the next 10 years, according to 100 simulations.

We can use geom_histogram() to examine the distributions

ggplot(data.frame(simulations), aes(x = simulations)) +
  geom_histogram(binwidth = 1, fill = "#023047",
                 color = "black", alpha = 0.7) +
  labs(title = "Distribution of Party Changes",
       x = "Number of Changes",
       y = "Probability") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  scale_x_continuous(breaks = seq(0, 5, by = 1)) +
  bbplot::bbc_style()

And if we think the probability is high, we can graph that too.

So we can set the probability that the party in power wll change in one year to 0.8

probability_of_change <- 0.8

Geometric Distribution

While we use the binomial distribution to simulate the number of sucesses in a fixed number of trials, we use the geometric distribution to simulate number of trials needed until the first success (e.g. first instance that a new party comes into power after an election).

It can answer questions like, “On average, how many elections did a party need to contest before winning its first election?”

# Set the probability of that a party will change power in one year
prob_success <- 0.2   

# Generate values for the number of years until the first change in power
trials_values <- 1:20  

# Calculate the PMF values for the geometric distribution
pmf_values <- dgeom(trials_values - 1, prob = prob_success)

# Create a data frame
df <- data.frame(k = trials_values, pmf = pmf_values)

The dgeom() function in R is used to calculate the probability mass function (PMF) for the geometric distribution.

It returns the probability of obtaining a specific number of trials (k) until the first success occurs in a sequence of independent Bernoulli trials.

Each trial has a constant probability of success (p).

In this instance, the dgeom() function calculates the PMF for the number of trials until the first success (from 0 to 10 years).

This is estimated with a success probability of 0.2.

prob_success <- 0.2  

# Generate the number of trials until the first success
trials_values <- 1:20  

# Calculate the PMF values 
pmf_values <- dgeom(trials_values - 1, prob = prob_success)

# Create a data frame
my_dist <- data.frame(k = trials_values, pmf = pmf_values)

And we will graph the geometric distribution

my_dist %>%
  ggplot(aes(x = k,  y = pmf)) +
  geom_bar(stat = "identity", 
           fill = "#023047",
           alpha = 0.7) +
  labs(title = "Geometric Distribution",
    x = "Number of Years Until New Party",
    y = "Probability") +
  my_theme()

To interpret this graph, there is a 20% chance that there will be a new party next year and 10% chance that it will take 3 yaers until we see a new party in power.

Bernoulli Distribution

Nature of Trials

The Bernoulli distribution is the most simple case where each election is considered as an independent Bernoulli trial, resulting in either success (1) or failure (0) based on whether a party wins or loses.

The binomial distribution focuses on the number of successful elections out of a fixed number of trials (years).

The geometric distribution focuses on the number of trials (year) required until the first success (change of party in power) occurs.

The Bernoulli distribution is the simplest case, treating each change as an independent success/failure trial.

Thank you for readdhing. Next we will look at F and T distributiosn in police science resaerch.

How to create a Regional Economic Communities dataset. PART TWO: consolidating string variables to dummy variables

Click here to read PART ONE of the blog series on creating the Regional Economic Communities dataset

How to create a Regional Economic Communities dataset. PART ONE: Scraping data with rvest in R

Packages we will need:

library(tidyverse)
library(countrycode)
library(WDI)

There are eight RECs in Africa. Some countries are only in one of the RECs, some are in many. Kenya is the winner with membership in four RECs: CEN-SAD, COMESA, EAC and IGAD.

In this blog, we will create a consolidated dataset for all 54 countries in Africa that are in a REC (or TWO or THREE or FOUR groups). Instead of a string variable for each group, we will create eight dummy group variables for each country.

To do this, we first make a vector of all the eight RECs.

patterns <- c("amu", "cen-sad", "comesa", "eac", "eccas", "ecowas", "igad", "sadc")

We put the vector of patterns in a for-loop to create a new binary variable column for each REC group.

We use the str_detect(rec_abbrev, pattern)) to see if the rec_abbrev column MATCHES the one of the above strings in the patterns vector.

The new variable will equal 1 if the variable string matches the pattern in the vector. Otherwise it will be equal to 0.

The double exclamation marks (!!) are used for unquoting, allowing the value of var_name to be treated as a variable name rather than a character string.

Then, we are able to create a variable name that were fed in from the vector dynamically into the for-loop. We can automatically do this for each REC group.

In this case, the iterated !!var_name will be replaced with the value stored in the var_name (AMU, CEN-SAD etc).

We can use the := to assign a new variable to the data frame.

The symbol := is called the “walrus operator” and we use it make or change variables without using quotation marks.

for (pattern in patterns) {
  var_name <- paste0(pattern, "_binary")
  rec <- rec %>%
    mutate(!!var_name := as.integer(str_detect(rec_abbrev, pattern)))
}

This is the dataset now with a binary variables indicating whether or not a country is in any one of the REC groups.

However, we quickly see the headache.

We do not want four rows for Kenya in the dataset. Rather, we only want one entry for each country and a 1 or a 0 for each REC.

We use the following summarise() function to consolidate one row per country.

rec %>%
group_by(country) %>%
  summarise(
    geo = first(geo),
    rec_abbrev = paste(rec_abbrev, collapse = ", "),
    across(ends_with("_binary"), ~ as.integer(any(. == 1)))) ->  rec_consolidated

The first() function extracts the first value in the geo variable for each country. This first() function is typically used with group_by() and summarise() to get the value from the first row of each group.

We use the the across() function to select all columns in the dataset that end with "_binary".

The ~ as.integer(any(. == 1)) checks if there’s any value equal to 1 within the binary variables. If they have a value of 1, the summarised data for each country will be 1; otherwise, it will be 0.

The following code can summarise each filtered group and add them to a new dataset that we can graph:

summ_group_pop <- function(my_df, filter_var, rec_name) {
  my_df %>%
    filter({{ filter_var }} == 1) %>% 
    summarize(total_population = sum(pop, na.rm = TRUE)) %>% 
    mutate(group = rec_name)
}

filter_vars <- c("amu_binary", "cen.sad_binary", "comesa_binary", 
                 "eac_binary", "eccas_binary", "ecowas_binary",
                 "igad_binary", "sadc_binary")
group_names <- c("AMU", "CEN-SAD", "COMESA", "EAC", "ECCAS", "ECOWAS",
                 "IGAD", "SADC")

rec_pop_summary <- data.frame()

for (i in seq_along(filter_vars)) {
  summary_df <- summ_group_pop(rec_wdi, !!as.name(filter_vars[i]), group_names[i])
  rec_pop_summary <- bind_rows(rec_pop_summary, summary_df)
}

And we graph it with geom_bar()

rec_pop_summary %>% 
  ggplot(aes(x = reorder(group, total_population),
             y = total_population)) + 
  geom_bar(stat = "identity", 
           width = 0.7, 
           color = "#0a85e5", 
           fill = "#0a85e5") +
  bbplot::bbc_style() + 
  scale_y_continuous(labels = scales::comma_format()) +
  coord_flip() + 
  labs(x = "RECs", 
       y = "Population", 
       title = "Population of REC groups in Africa", 
       subtitle = "Source: World Bank, 2022")

Cline Center Coup d’État Project Dataset

tailwind_palette <- c("001219","005f73","0a9396","94d2bd","e9d8a6","ee9b00","ca6702","bb3e03","ae2012","9b2226")

add_hashtag <- function(my_vec){
    hash_vec <-  paste0('#', my_vec)
    return(hash_vec)
  
  tailwind_hash <- add_hashtag(tailwind_palette)

Coups per million people in each REC

create_stateyears(system = "cow") %>% 
    add_democracy() -> demo

rec_wdi %>% 
  select(!c(year, country)) %>% 
  left_join(demo, by = c("cown" = "ccode")) %>% 
  filter(year > 1989) %>% 
  filter(amu_binary == 1) %>% 
  group_by(year) %>% 
  summarize(mean_demo = mean(v2x_polyarchy, na.rm = TRUE)) %>% 
  mutate(rec = "AMU") -> demo_amu

We can use a function to repeat the above code with all eight REC groups:

 create_demo_summary <- function(my_df, filter_var, group_name) {
  my_df %>%
    select(!c(year, country)) %>% 
    left_join(demo, by = c("cown" = "ccode")) %>% 
    filter(year > 1989) %>% 
    filter({{ filter_var }} == 1) %>% 
    group_by(year) %>% 
    summarize(mean_demo = mean(v2x_polyarchy, na.rm = TRUE)) %>% 
    mutate(rec = group_name)
}

rec_democracy_df <- data.frame()

for (i in seq_along(filter_vars)) {
  summary_df <- create_demo_summary(rec_wdi, !!as.name(filter_vars[i]), group_names[i])
  rec_democracy_df <- bind_rows(rec_democracy_df, summary_df)
}

And we graph out the average democracy scores cross the years

rec_democracy_df %>% 
  ggplot(aes(x = year, y = mean_demo, group = rec)) + 
  geom_line(aes(fill = rec, color = rec), size = 2, alpha = 0.8) + 
  # geom_point(aes(fill = rec, color = rec), size = 4) + 
  geom_label(data = . %>% group_by(rec) %>% filter(year == 2019), 
             aes(label = rec, 
                 fill = rec, 
                 x = 2019), color = "white",
             legend = FALSE, size = 3) +  
  scale_color_manual(values = tailwind_hash) + 
  scale_fill_manual(values = tailwind_hash) + 
  theme(legend.position = "none")

How to create a Regional Economic Communities dataset. PART ONE: Scraping data with rvest in R

Click here for part two on the REC dataset.

Packages we will need:

library(tidyverse)
library(rvest)
library(janitor)
library(ggflags)
library(countrycode)

The Economic Community of West African States (ECOWAS) has been in the news recently. The regional bloc is openly discussing military options in response to the coup unfolding in the capital of Niger.

ECOWAS Defense Chiefs Meet to Discuss Niger Intervention

This made me realise that I know VERY VERY LITTLE about the regional economic communities (REC) in Africa.

Ernest Aniche (2015: 41) argues, “the ghost of [the 1884] Berlin Conference” led to a quasi-balkanisation of African economies into spheres of colonial influence. He argues that this ghost “continues to haunt Africa […] via “neo-colonial ties” today.

To combat this balkanisation and forge a new pan-Africanist approach to the continent’s development, the African Union (AU) has focused on regional integration.

This integration was more concretely codified with 1991’s Abuja Treaty.

One core pillar on this agreement highlights a need for increasing flows of intra-African trade and decreasing the reliance on commodity exports to foreign markets (Songwe, 2019: 97)

Broadly they aim mirror the integration steps of the EU. That translates into a roadmap towards the development of:

Free Trade Areas:

AU : The African Continental Free Trade Area (AfCFTA) aims to create a single market for goods and services across the African continent, with the goal of boosting intra-African trade.
EU : The European Free Trade Association (EFTA) is a free trade area consisting of four European countries (Iceland, Liechtenstein, Norway, and Switzerland) that have agreed to remove barriers to trade among themselves.

Customs Union:

AU: The East African Community (EAC) is an REC a customs union where member states (Burundi, Kenya, Rwanda, South Sudan, Tanzania, and Uganda) have eliminated customs duties and adopted a common external tariff for trade with non-member countries.
EU: The EU’s Single Market is a customs union where goods can move freely without customs duties or other barriers across member states.

Common Market:

AU: Many RECs such the Southern African Development Community (SADC) is working towards establishing a common market that allows for the free movement of goods, services, capital, and labor among its member states.
EU: The EU is a prime example of a common market, where not only goods and services, but also people and capital, can move freely across member countries.

Economic Union:

AU: The West African Economic and Monetary Union (WAEMU) is moving towards an economic union with shared economic policies, a common currency (West African CFA franc), and coordination of monetary and fiscal policies.
EU: Has coordinated economic policies and a single currency (Euro) used by several member states.

The achievement of a political union on the continent is seen as the ultimate objective in many African countries (Hartzenberg, 2011: 2), such as the EU with its EU Parliament, Council, Commission and common foreign policy.

According to a 2012 UNCTAD report, “progress towards regional integration has, to date, been uneven, with some countries integrating better at the regional and/or subregional level and others less so”.

So over this blog, series, we will look at the RECs and see how they are contributing to African integration.

We can use the rvest package to scrape the countries and information from each Wiki page. Click here to read more about the rvest package and web scraping:

Scrape NATO defense expenditure data from Wikipedia with the rvest package in R

First we will look at the Arab Maghreb Union (AMU). We feed the AMU wikipedia page into the read_html() function.

With `[[`(3) we can choose the third table on the Wikipedia page.

With the janitor package, we can can clean the names (such as remove capital letters and awkward spaces) and ensure more uniform variable names.

We pull() the country variable and it becomes a vector, then turn it into a data frame with as.data.frame() . Alternatively we can just select this variable with the select() function.

We create some more variables for each REC group when we merge them all together later.

We will use the str_detect() function from the stringr package to filter out the total AMU column row as it is non-country.

read_html("https://en.wikipedia.org/wiki/Arab_Maghreb_Union") %>% 
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(3) %>%  
  clean_names()  %>% 
  pull(country) %>% 
  as.data.frame() %>% 
  mutate(rec = "Arab Meghreb Union",
         geo = "Maghreb", 
         rec_abbrev = "AMU") %>% 
  select(country = '.', everything()) %>% 
  filter(!str_detect(country, fixed("Arab Maghreb", ignore_case = TRUE))) -> amu

We can see in the table on the Wikipedia page, that it contains a row for all the Arab Maghreb Union countries at the end. But we do not want this.

We use the str_detect() function to check if the “country” variable contains the string pattern we feed in. The ignore_case = TRUE makes the pattern matching case-insensitive.

The ! before str_detect() means that we remove this row that matches our string pattern.

We can use the fixed() function to make sure that we match a string as a literal pattern rather than a regular expression pattern / special regex symbols. This is not necessary in this situation, but it is always good to know.

For this example, I will paste the code to make a map of the countries with country flags.

Click here to read more about adding flags to maps in R

Add circular flags to maps and graphs with ggflags package in R

First we download a map object from the rnaturalearth package

world_map <- ne_countries(scale = "medium", returnclass = "sf")

To add the flags on the map, we need longitude and latitude coordinates to feed into the x and y arguments in the geom_flag(). We can scrape these from the web too.

We add iso2 character codes in all lower case (very important for the geom_flag() step)) with the countrycode() function.

Click here to read more about the countrycode package.

Add Correlates of War codes with countrycode package in R

read_html("https://developers.google.com/public-data/docs/canonical/countries_csv") %>%
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(1) %>% 
  select(latitude, 
         longitude, 
         iso_a2 = country) %>% 
  right_join(world_map, by = c("iso_a2" = "iso_a2")) %>% 
  left_join(amu, by = c("admin" = "country")) %>% 
  select(latitude, longitude, iso_a2, geometry, admin, region_un, rec, rec_abbrev) %>% 
  mutate(amu_map = ifelse(!is.na(rec), 1, 0),
         iso_a2 = tolower(iso_a2)) %>%
  filter(region_un == "Africa") -> amu_map

Set a consistent theme for the maps.

theme_set(bbplot::bbc_style() +   theme(legend.position = "none",
                                        axis.text.x = element_blank(),
                                        axis.text.y = element_blank(),
                                        axis.title.x = element_blank(),
                                        axis.title.y = element_blank()))

And we can create the map of AMU countries with the following code:

amu_map %>% 
  ggplot() + 
  geom_sf(aes(geometry = geometry,
              fill = as.factor(amu_map), 
              alpha = 0.9),
          position = "identity",
          color = "black") + 
  ggflags::geom_flag(data = . %>% filter(rec_abbrev == "AMU"), 
                     aes(x = longitude,
                         y = latitude + 0.5,
                         country = iso_a2), 
                     size = 6) +
  scale_fill_manual(values = c("#FFFFFF", "#b766b4"))

Next we will look at the Common Market for Eastern and Southern Africa.

If we don’t want to scrape the data from the Wikipedia article, we can feed in the vector of countries – separated by commas – into a data.frame() function.

Then we can separate the vector of countries into rows, and a cell for each country.

data.frame(country = "Djibouti, 
           Eritrea, 
           Ethiopia,
           Somalia,
           Egypt,
           Libya,
           Sudan,
           Tunisia,
           Comoros,
           Madagascar,
           Mauritius,
           Seychelles,
           Burundi,
           Kenya,
           Malawi,
           Rwanda,
           Uganda,
           Eswatini,
           Zambia") %>% 
  separate_rows(country, sep = ",") %>%  
  mutate(country = trimws(country)) %>%
  mutate(rec = "Common Market for Eastern and Southern Africa", 
         geo = "Eastern and Southern Africa",
         rec_abbrev = "COMESA") -> comesa

The separate_rows() function comes from the tidyr package.

We use this to split a single column with comma-separated values into multiple rows, creating a “long” format.

With the mutate(country = trimws(country)) we can remove any spaces with whitespace trimming.

Onto the third REC – the Community of Sahel-Sahara States.

read_html("https://en.wikipedia.org/wiki/Community_of_Sahel%E2%80%93Saharan_States") %>% 
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(5) %>%  
  janitor::clean_names()  %>% 
  pull() %>% as.data.frame() %>%
  select(country = '.', everything()) %>% 
  mutate(country = str_replace_all(country, "\\\n", ",")) %>% 
  separate_rows(country, sep = ",") %>% # create a column of words from the one cell vector of words!
  mutate(country = trimws(country)) %>%  # remove the white space
  mutate(rec = "Community of Sahel–Saharan States",
         geo = "Sahel Saharan States", 
         rec_abbrev = "CEN-SAD") -> censad

Fourth, onto the East African Community REC.

read_html("https://en.wikipedia.org/wiki/East_African_Community") %>% 
    html_table(header = TRUE, fill = TRUE) %>% 
    `[[`(2) %>%  
    janitor::clean_names() %>%
    pull(country) %>% 
    as.data.frame() %>% 
    mutate(rec = "East African Community",
           geo = "East Africa", 
           rec_abbrev = "EAC") %>% 
    select(country = '.', everything()) %>% 
  filter(str_trim(country) != "") -> eac

Fifth, we look at ECOWAS

read_html("https://en.wikipedia.org/wiki/Economic_Community_of_West_African_States") %>% 
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(2) %>%  
  janitor::clean_names() %>%
  pull(country) %>% 
  as.data.frame() %>% 
  mutate(rec = "Economic Community of West African States",
         geo = "West Africa", 
         rec_abbrev = "ECOWAS") %>%
  select(country = '.', everything()) %>% 
  filter(str_trim(country) != "")  %>% 
  filter(str_trim(country) != "Total")  -> ecowas

Next the Economic Community of Central African States

read_html("https://en.wikipedia.org/wiki/Economic_Community_of_Central_African_States") %>% 
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(4) %>%  
  janitor::clean_names() %>% 
  pull(country) %>% 
  as.data.frame() %>% 
    mutate(rec = "Economic Community of Central African States",
           geo = "Central Africa", 
           rec_abbrev = "ECCAS") %>% 
  select(country = '.', everything()) -> eccas

Seventh, we look at the Intergovernmental Authority on Development (IGAD)

data.frame(country = "Djibouti, Ethiopia, Somalia, Eritrea, Sudan, South Sudan, Kenya, Uganda") %>%  
separate_rows(country, sep = ",") %>%  
mutate(rec = "Intergovernmental Authority on Development", 
           geo = "Horn, Nile, Great Lakes",
           rec_abbrev = "IGAD") -> igad

Last, we will scrape Southern African Development Community (SADC)

read_html("https://en.wikipedia.org/wiki/Southern_African_Development_Community") %>% 
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(3) %>%  
  janitor::clean_names() %>% 
  pull(country) %>% 
  as.data.frame() %>% 
  select(country = '.', everything()) %>% 
  mutate(country = sub("\\[.*", "", country),
    rec = "Southern African Development Community",
         geo = "Southern Africa", 
         rec_abbrev = "SADC") %>% 
  filter(!str_detect(country, fixed("Country", ignore_case = FALSE))) -> sadc

We can combine them all together with rbind to make a full dataset of all the countries.

  rbind(amu, censad, comesa, eac, eccas, ecowas, igad, sadc) %>% 
  mutate(country = trimws(country)) %>% 
  mutate(rec_abbrev = tolower(rec_abbrev)) -> rec

In the next blog post, we will complete the dataset (most importantly, clean up the country duplicates and make data visualisations / some data analysis with political and economic data!

How to create a Regional Economic Communities dataset. PART TWO: consolidating string variables to dummy variables

References

Hartzenberg, T. (2011). Regional integration in Africa. World Trade Organization Publications: Economic Research and Statistics Division Staff Working Paper (ERSD-2011-14). PDF available

PDF available

UNCTAD (United Nations Conference on Trade and Development) (2021). Economic Development in Africa Report 2021: Reaping the Potential Benefits of the African Continental Free Trade Area for Inclusive Growth. PDF available

Songwe, V. (2019). Intra-African trade: A path to economic diversification and inclusion. Coulibaly, Brahima S, Foresight Africa: Top Priorities for the Continent in, 97-116. PDF available

How to download OECD datasets in R

Packages we will need:

library(OECD)
library(tidyverse)
library(magrittr)
library(janitor)
library(devtools)
library(readxl)
library(countrycode)
library(scales)
library(ggflags)
library(bbplot)

In this blog post, we are going to look at downloading data from the OECD statsitics and data website.

The Organisation for Economic Co-operation and Development (OECD) provides analysis, and policy recommendations for 38 industrialised countries.

Angry Work GIF by Jess - Find & Share on GIPHY

The 38 countries in the OECD are:

Australia
Austria
Belgium
Canada
Chile
Colombia
Czech Republic
Denmark
Estonia
Finland
France
Germany

Hungary
Iceland
Ireland
Israel
Italy
Japan
South Korea
Latvia
Lithuania
Luxembourg
Mexico
Netherland

New Zealand
Norway
Poland
Portugal
Slovakia
Slovenia
Spain
Sweden
Switzerland
Turkey
United Kingdom
United States
European Union

We can download the OCED data package directly from the github repository with install_github()

install_github("expersso/OECD")
library(OECD)

The most comprehensive tutorial for the package comes from this github page. Mostly, it gives a fair bit more information about filtering data

We can look at the all the datasets that we can download from the website via the package with the following get_datasets() function:

titles <- OECD::get_datasets()

This gives us a data.frame with the ID and title for all the OECD datasets we can download into the R console, as we can see below.

In total there are 1662 datasets that we can download.

These datasets all have different variable types, countries, year spans and measurement values. So it is important to check each dataset carefully when we download them.

We can filter key phrases to subset datasets:

 titles %>%  
         filter(grepl("oda", title, ignore.case = TRUE)) %>% View

In this blog, we will graph out the Official Development Financing (ODF) for each country.

Official Development Financing measures the sum of RECEIVED (NOT DONATED) aid such as:

bilateral ODA aid
concessional and non-concessional resources from multilateral sources
bilateral other official flows made available for reasons unrelated to trade

Before we can charge into downloading any dataset, it is best to check out the variables it has. We can do that with the get_data_structure() function:

get_data_structure("REF_TOTAL_ODF") %>% 
       str(., max.level = 2)

 $ VAR_DESC       :'data.frame':	10 obs. of  2 variables:
  ..$ id         : chr [1:10] "RECIPIENT" "PART" "AMOUNTTYPE" "TIME" ...
  ..$ description: chr [1:10] "Recipient" "Part" "Amount type" "Year" ...

 $ RECIPIENT      :'data.frame':	301 obs. of  2 variables:
  ..$ id   : chr [1:301] "10200" "10100" "10010" "71" ...
  ..$ label: chr [1:301] "All Recipients, Total" "Developing Countries, Total" "Europe, Total" "Albania" ...

 $ PART           :'data.frame':	2 obs. of  2 variables:
  ..$ id   : chr [1:2] "1" "2"
  ..$ label: chr [1:2] "1 : Part I - Developing Countries" "2 : Part II - Countries in Transition"

 $ AMOUNTTYPE     :'data.frame':	2 obs. of  2 variables:
  ..$ id   : chr [1:2] "A" "D"
  ..$ label: chr [1:2] "Current Prices" "Constant Prices"

 $ TIME           :'data.frame':	62 obs. of  2 variables:
  ..$ id   : chr [1:62] "1960" "1961" "1962" "1963" ...
  ..$ label: chr [1:62] "1960" "1961" "1962" "1963" ...

We will clean up the ODF dataset with the clean_names() function from janitor package.

aid <- get_dataset("REF_TOTAL_ODF")  %>% 
  janitor::clean_names()  %>%
  select(recipient, aid = obs_value, time)

One problem with this dataset is that we only have the DAC country codes in this dataset.

We will need to read in and merge the country code variables into the aid dataset.

dac_code Download

dac_code <- readxl::read_excel(file.choose())

We can then clean up the DAC codes to merge with the aid data.

dac_code %<>%
    janitor::clean_names()  %>% 
    mutate(cown = countrycode(recipient_name_e, "country.name", "cown")) %>% 
    select(recipient_code,
         year, 
         cown,
         country = recipient_name_e,
         group_id, 
         dev_group = group_name_e,
         p_group = group_name_f,
         wb_group)

And merge with left_join()

aid %<>% 
  mutate(recipient_code = parse_number(recipient)) %>%  
  left_join(dac_code, by = c("recipient_code" = "recipient_code"))

Next we can sum up the aid that each country received since 2000.

aid %>% 
  filter(year > 1999) %>%  
  filter(!is.na(country)) %>% 
  mutate(aid = parse_number(aid)) %>% 
  mutate(country = case_when(country == "Syrian Arab Republic" ~ "Syria", 
                             country == "T?rkiye" ~ "Turkey",
                             country == "China (People's Republic of)" ~ "China",
                             country == "Democratic Republic of the Congo" ~ "DR Congo",
                             TRUE ~ as.character(country))) %>% 
  group_by(country) %>% 
  summarise(total_aid = sum(aid, na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(iso2 = tolower(countrycode(country, "country.name", "iso2c"))) %>% 
  filter(total_aid > 150000) %>% 
  ggplot(aes(x = reorder(country, total_aid),
             y = total_aid)) + 
  geom_bar(stat = "identity", 
           width = 0.7, 
           color = "#0a85e5", 
           fill = "#0a85e5") +
  ggflags::geom_flag(aes(x = country, y = -1, country = iso2), size = 8) +
  bbplot::bbc_style() + 
  scale_y_continuous(labels = scales::comma_format()) +
  coord_flip() + 
  labs(x = "ODA received", y = "", title = "Official Development Financing (ODF)", subtitle = "OECD DAC (2000 - 2021)")

The TIME_FORMAT can be any of the following types:

‘P1Y’ for annual
‘P6M’ for bi-annual
‘P3M’ for quarterly
‘P1M’ for monthly data.

To access each countries in the datasets, we can use the following codes

oecd_ios3 <- c("AUS", "AUT", "BEL", "CAN", "CHL", "COL", "CZE",
               "DNK", "EST", "FIN", "FRA", "DEU", "GRC", "HUN",
               "ISL", "IRL", "ISR", "ITA", "JPN", "KOR", "LVA", 
               "LTU", "LUX", "MEX", "NLD", "NZL", "NOR", "POL",
               "PRT", "SVK", "SVN", "ESP", "SWE", "CHE", "TUR",
               "GBR", "USA")

Alternatively, we can use only the EU countries that are in the OECD.

eu_oecd_iso3 <- c("AUT", "BEL", "CZE", "DNK", "EST", "FIN", 
                  "FRA", "DEU", "GRC", "HUN", "IRL", "ITA",
                  "LVA", "LTU", "LUX", "NLD", "POL", "PRT",
                  "SVK", "SVN", "ESP", "SWE")

sal_raw %>% 
  janitor::clean_names() %>% 
  filter(age == "Y25T64") %>% 
  filter(grade == "TE") %>% 
  filter(indicator == "NAT_ACTL_YR") %>% 
  filter(isc11 == "L1") %>% 
  filter(sex == "T") %>% 
  select(country, year = time, obs_value) -> sal

Top R packages for downloading political science and economics datasets

1. WDI

The World Development Indicators (WDI) package by Vincent Arel-Bundock provides access to a database of hundreds of economic development indicators from the World Bank.

Examples of variables include population, GDP, education, health, and poverty, school attendance rates.

Reference: Arel-Bundock, V. (2017). WDI: World Development Indicators (R Package Version 2.7.1).

Download WorldBank data with WDI package in R

2. peacesciencer

This package by Steve Miller helps you download data related to peace and conflict studies, including the Correlates of War project.

Examples of variables include Alliance Treaty Obligations and Provisions (ATOP), Thompson and Dreyer’s (2012) strategic rivalry data, fractionalization/polarization estimates from the Composition of Religious and Ethnic Groups (CREG) Project, and Uppsala Conflict Data Program (UCDP) data on civil and inter-state conflicts.

Data can come in either country-year, event-level or dyadic-level.

Reference: Steve Miller (2020). peacesciencer: Tools for Peace Scientists (R Package Version 0.2.2). Website retrieved at http://svmiller.com/peacesciencer/ms.html

Building a dataset for political science analysis in R, PART 2

3. eurostat

eurostat provides access to a wide range of statistics and data on the European Union and its member states, covering topics such as population, economics, society, and the environment.

Examples of variables include employment, inflation, education, crime, and air pollution. The package was authored by Leo Lahti.

Reference: eurostat (2018). eurostat: Eurostat Open Data (R Package Version 3.6.0), CRAN PDF retrieved at https://cran.r-project.org/web/packages/eurostat/eurostat.pdf

How to download EU data with Eurostat package in R: Part 1 (with pyramid graphs)

4. vdemdata

The Varieties of Democracy package by Staffan I. Lindberg et al. provides data on a range of indicators related to democracy and governance in countries around the world, including measures of electoral democracy, civil liberties, and human rights.

Click here to read more about downloading the package

How to download and animate the Varieties of Democracy (V-DEM) dataset in R

Examples of variables include freedom of speech, rule of law, corruption, government transparency, and voter turnout.

Reference: Lindberg, S. I., & Stepanova, N. (2020). vdem: Varieties of Democracy Project (R Package Version 1.6).

5. democracyData

This package by Xavier Marquez: provides data on a range of variables related to democracy, including elections, political parties, and civil liberties.

Examples of variables include regime type, democracy scores (Freedom House, PolityIV etc, Geddes, Wright, and Frantz’ autocratic regimes dataset, the Lexical Index of Electoral Democracy, the DD/ACLP/PACL/CGV dataset), axxording to the Github page

Download democracy data with democracyData package in R

6. icpsrdata

This package by Frederick Solt provides a simple way to download and import data from the Inter-university Consortium for Political and Social Research (ICPSR) archive into R. This is for easy replication and sharing of data. The package includes datasets from different fields of study, including sociology, political science, and economics.

Reference: Solt, F. (2020). icpsrdata: Reproducible Data Retrieval from the ICPSR Archive (R Package Version 0.5.0).

7. Quandl

This R package by Quandl provides an interface to access financial and economic data from over 20 different sources. Examples of variables include stock prices, futures, options, and macroeconomic indicators. The package includes functions to easily download data directly into R and perform tasks such as plotting, transforming, and aggregating data. Additional functions for managing and exploring data, such as search tools and data caching features, are also available.

Here are five examples of variables in the Quandl package:

“AAPL” (Apple Inc. stock price)
“CHRIS/CME_CL1” (Crude Oil Futures)
“FRED/GDP” (US GDP)
“BCHAIN/MKPRU” (Bitcoin Market Price)
“USTREASURY/YIELD” (US Treasury Yield Curve Rates)

Reference: Quandl. (2021). Quandl: A library of economic and financial data. Retrieved from https://www.quandl.com/tools/r.

Download economic and financial time series data with Quandl package in R

8. essurvey

The essurvey package is an R package that provides access to data from the European Social Survey (ESS), which is a large-scale survey that collects data on attitudes, values, and behavior across Europe. The package includes functions to easily download, read, and analyze data from the ESS, and also includes documentation and sample code to help users get started.

Examples of variables in the ESS dataset include political interest, trust in political institutions, social class, education level, and income. The package was authored by David Winter and includes a variety of useful functions for working with ESS data.

Reference: Winter, D. (2021). essurvey: Download Data from the European Social Survey on the Fly. R package version 3.4.4. Retrieved from https://cran.r-project.org/package=essurvey.

Download European Social Survey data with essurvey package in R

9. manifestoR

manifestoR is an R package that provides access to data from the Comparative Manifesto Project (CMP), which is a cross-national research project that analyzes political party manifestos. The package allows users to easily download and analyze data from the CMP, including party positions on various policy issues and the salience of those issues across time and space.

Examples of variables in the CMP dataset include party positions on taxation, immigration, the environment, healthcare, and education. The package was authored by Jörg Matthes, Marcelo Jenny, and Carsten Schwemmer.

Reference: Matthes, J., Jenny, M., & Schwemmer, C. (2018). manifestoR: Access and Process Data and Documents of the Manifesto Project. R package version 1.2.1. Retrieved from https://cran.r-project.org/package=manifestoR.

Graph political party manifestos on ideological spectrum with manifestoR package in R

10. unvotes

The unvotes data package provides historical voting data of the United Nations General Assembly, including votes for each country in each roll call, as well as descriptions and topic classifications for each vote.

The classifications included in the dataset cover a wide range of issues, including human rights, disarmament, decolonization, and Middle East-related issues.

Reference: The package was created by David Robinson and Nicholas Goguen-Compagnoni and is available on the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/unvotes/unvotes.pdf.

Download and graph UN votes data with the unvotes package in R

11. gravity

The gravity package in R, created by Anna-Lena Woelwer, provides a set of functions for estimating gravity models, which are used to analyze bilateral trade flows between countries. The package includes the gravity_data dataset, which contains information on trade flows between pairs of countries.

Examples of variables that may affect trade in the dataset are GDP, distance, and the presence of regional trade agreements, contiguity, common official language, and common currency.

iso_o: ISO-Code of country of origin
iso_d: ISO-Code of country of destination
distw: weighted distance
gdp_o: GDP of country of origin
gdp_d: GDP of country of destination
rta: regional trade agreement
flow: trade flow
contig: contiguity
comlang_off: common official language
comcur: common currency

The package PDF CRAN is available at http://cran.nexr.com/web/packages/gravity/gravity.pdf

Create infographics with the Irish leader dataset in R and Canva

Click here to download the Irish leader datatset. This file details information on all Taoisigh since 1922.

full_taois Download Taoiseach dataset here

Source: Wikipedia

Tentative Codebook

Variable Name	Variable Description
no	Taoiseach number
name	Name
party	Political party
constituency	Electoral constituency
born	Date of birth
died	Date of death
first_elected	Date first entered the Dail
entered_office	Date entering office of Taoiseach
left_office	Date leaving office of Taoiseach
left_dail	Date left the Dail
cum_days	Total number of days in Dail
cum_years	Total number of years in Dail
second_level	Secondary school Taoiseach attended
third_level	University Taoiseach attended
period	Number of times the person was Taoiseach
before_after_taoiseach	Title of cabinet positions held by the Taoiseach when he was not holding office of Taoiseach
while_taoiseach	Title of cabinet positions held by the Taoiseach when he was in office as Taoiseach
no_pos_before_after	Number of cabinet positions the man held when he was not holding office of Taoiseach
no_pos_dur	Number of cabinet positions the man held when he was Taoiseach
county_born	The county the Taoiseach was born in
age	Age of Taoiseach
age_enter	Age the man entered office of Taoiseach
gender	Gender

Packages we will need:

library(tidyverse)
library(ggthemes)
library(readr)
library(sf)
library(tmap)

With the dataset, we can add map data and plot the 26 counties of Ireland.

If you follow this link below, you can download county map data from the following website by Chris Brundson

https://rpubs.com/chrisbrunsdon/part1

Thank you to Chris for the tutorial and data access!

Read in the simple features data with the st_read() from the sf package.

setwd("C:/Users/my_name/Desktop")

county_geom <- sf::st_read("counties.json") %>% 
   clean_names() %>% 
   mutate(county = stringr::str_to_title(county))

Next we count the number of counties that have given Ireland a Taoiseach with the group_by() and count() functions.

One Taoiseach, Eamon DeValera, was born in New York City, so he will not be counted in the graph.

Sorry Dev.

Sad Friends Tv GIF - Find & Share on GIPHY

We can join the Taoisech dataset to the county_geom dataframe by the county variable. The geometric data has the counties in capital letters, so we convert tolower() letters.

Add the geometry variable in the main ggplot() function.

We can play around with the themes arguments and add the theme_map() from the ggthemes package to get the look you want.

I added a few hex colors to indicate the different number of countries.

If you want a transparent background, we save it with the ggsave() function and set the bg argument to “transparent”

full_taois %>% 
  select(county = county_born, everything()) %>% 
  distinct(name, .keep_all = TRUE) %>% 
  group_by(county) %>% 
  count() %>% 
  ungroup() %>% 
  right_join(county_geom, by = c("county" = "county")) %>%
  replace(is.na(.), 0) %>% 
  ggplot(aes(geometry = geometry, fill = factor(n))) +  
  geom_sf(linewidth = 1, color = "white") +
  ggthemes::theme_map() + 
  theme(panel.background = element_rect(fill = 'transparent'),  
    legend.title = element_blank(),
    legend.text = element_text(size = 20) )  + 
scale_fill_manual(values = c("#8d99ae", "#a8dadc", "#457b9d", "#e63946", "#1d3557")) 

ggsave('county_map.png', county_map, bg = 'transparent')

Counties that have given us Taoisigh

Source: Wikipedia

Next we can graph the ages of the Taoiseach when they first entered office. With the reorder() function, we can compare how old they were.

full_taois %>%
  mutate(party = case_when(party == "Cumann na nGaedheal" ~ "CnG",
                           TRUE ~ as.character(party))) %>% 
  distinct(name, .keep_all = TRUE) %>% 
  mutate(age_enter = round(age_enter, digits = 0)) %>% 
  ggplot(aes(x = reorder(name, age_enter),
             y = age_enter,
             fill = party)) + 
  geom_bar(stat = "identity") +
  coord_flip() + 
  scale_fill_manual(values = c( "#8e2420","#66bb66","#6699ff")) + 
  theme(text = element_text(size = 40),
    axis.title.x = element_blank(), 
    axis.title.y = element_blank(), 
    panel.background = element_rect(fill = 'transparent'),  
    plot.background = element_rect(fill = 'transparent', color = NA), 
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),
    legend.background = element_rect(fill = 'transparent'), #transparent legend bg
    legend.key.size = unit(2, 'cm'),
    legend.key.height = unit(2, 'cm'),
    legend.key.width = unit(2, 'cm'), 
    legend.title = element_blank(),
    legend.text = element_text(size = 20) ) 

ggsave('age_chart.png', age_chart, bg = 'transparent')

Ages of the Taoiseach entering office for the first time

Source: Wikipedia

We can calculate to see which party has held the office of Taoiseach the longest with a special, but slightly mad-looking pie chart

Click here to learn more about creating these plots.

Alternatives to pie charts: coxcomb and waffle charts

full_taois %>% 
  distinct(name, .keep_all = TRUE) %>% 
  group_by(party) %>% 
  summarise(total_cum = sum(cum_days)) %>% 
    ggplot(aes(reorder(total_cum, party), total_cum, fill = as.factor(party))) + 
  geom_bar(stat = "identity") + 
  coord_polar("x", start = 0, direction = - 1)  + 
  scale_fill_manual(values = c( "#8e2420","#66bb66","#6699ff")) + 
  ggthemes::theme_map()

Number of years each party held the office of Taoiseach

Source: Wikipedia

Fianna Fail has held the office over twice as long as Fine Fail and much more than the one term of W Cosgrove (the only CnG Taoiseach)

Last we can create an icon waffle plots. We can use little man icons to create a waffle plot of all the men (only men) in the office, colored by political party.

I got the code and tutorial for making these waffle plots from the following website:

https://www.listendata.com/2019/06/create-infographics-with-r.html

It was very helpful in walking step by step through how to download the FontAwesome icons into the correct font folder on the PC. I had a heap of issues with the wrong versions of the htmltools.

remotes::install_github("JohnCoene/echarts4r")

remotes::install_github("hrbrmstr/waffle")

devtools::install_github("JohnCoene/echarts4r.assets")

remotes::install_github("hrbrmstr/waffle")

library(echarts4r)
library(extrafont)
library(showtext)
library(magrittr)
library(echarts4r.assets)
library(htmltools)
library(waffle)

extrafont::font_import(path = "C:/Users/my_name/Desktop",  pattern = "fa-", prompt =  FALSE)

extrafont::loadfonts(device="win")

font_add(family = "FontAwesome5Free-Solid", regular = "C:/Users/my_name/Desktop/fa-solid-900.ttf")
font_add(family = "FontAwesome5Free-Regular", regular = "C:/Users/my_name/Desktop/fa-regular-400.ttf")
font_add(family = "FontAwesome5Brands-Regular", regular = "C:/Users/my_name/Desktop/fa-brands-400.ttf")

showtext_auto()

Next we will find out the number of Taoisigh from each party:

And we fill a vector of values into the waffle() function. We can play around with the number of rows. Three seems like a nice fit for the number of icons (glyphs).

Also, we choose the type of glyph image we want with the the use_glyph() argument.

The options are the glyphs that come with the Font Awesome package we downloaded with extrafonts.

waffle(
  c( Cumann na nGaedheal = 1      ` = 1,
      `Fianna Fail = 8    ` = 8, 
      `Fine Gael = 6    ` = 6), 
  rows = 3, 
  colors = c("#8e2420", "#66bb66",  "#6699ff"),
  use_glyph = "male", 
  glyph_size = 25, 
  legend_pos = "bottom")

Click below to download the infographic that was edited and altered with Canva.com.

Jimmy Fallon Dancing GIF by The Tonight Show Starring Jimmy Fallon - Find & Share on GIPHY

taoiseach_infographic Download infographic here

Create a dataset of Irish parliament members

library(rvest)
library(tidyverse)
library(toOrdinal)
library(magrittr)
library(genderizeR)
library(stringi)

This blogpost will walk through how to scrape and clean up data for all the members of parliament in Ireland.

Or we call them in Irish, TDs (or Teachtaí Dála) of the Dáil.

We will start by scraping the Wikipedia pages with all the tables. These tables have information about the name, party and constituency of each TD.

On Wikipedia, these datasets are on different webpages.

This is a pain.

However, we can get around this by creating a list of strings for each number in ordinal form – from1st to 33rd. (because there have been 33 Dáil sessions as of January 2023)

We don’t need to write them all out manually: “1st”, “2nd”, “3rd” … etc.

Instead, we can do this with the toOrdinal() function from the package of the same name.

dail_sessions <- sapply(1:33,toOrdinal)

Next we can feed this vector of strings with the beginning of the HTML web address for Wikipedia as a string.

We paste the HTML string and the ordinal number strings together with the stri_paste() function from the stringi package.

This iterates over the length of the dail_sessions vector (in this case a length of 33) and creates a vector of each Wikipedia page URL.

dail_wikipages <- stri_paste("https://en.wikipedia.org/wiki/Members_of_the_",
           dail_sessions, "_D%C3%A1il")

Now, we can take the most recent Dáil session Wikipedia page and take the fifth table on the webpage using `[[`(5)

We rename the column names with select().

And the last two mutate() lines reomve the footnote numbers in ( ) [ ] brackets from the party and name variables.

dail_wikipages[33] %>%  
  read_html() %>%
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(5) %>% 
  rename("ble" = 1, "party" = 2, "name" = 3, "constituency" = 4) %>% 
  select(-ble) %>% 
  mutate(party = gsub(r"{\s*\([^\)]+\)}","",as.character(party))) %>% 
  mutate(name = sub("\\[.*", "", name)) -> dail_33

Last we delete the first row. That just contais a duplicate of the variable names.

dail_33 <- dail_33[-1,]

We want to delete the fadas (long accents on Irish words). We can do this across all the character variables with the across() function.

The stri_trans_general() converts all strings to LATIN ASCII, which turns string to contain only the letters in the English language alphabet.

dail_33 %<>% 
  mutate(across(where(is.character), ~ stri_trans_general(., id = "Latin-ASCII")))

We can also separate the first name from the second names of all the TDs and create two variables with mutate() and separate()

dail_33 %<>% 
  mutate(name = str_replace(name, "\\s", "|")) %>% 
  separate(name, into = c("first_name", "last_name"), sep = "\\|")

With the first_name variable, we can use the new pacakge by Kalimu. This guesses the gender of the name. Later, we can track the number of women have been voted into the Dail over the years.

Of course, this will not be CLOSE to 100% correct … so later we will have to check each person manually and make sure they are accurate.

devtools::install_github("kalimu/genderizeR")

gender = findGivenNames(dail_33$name, progress = TRUE)

gender %>% 
  select(probability, gender)  -> gen_variable

gen_variable %<>% 
  select(name, gender) %>% 
  mutate(name = str_to_sentence(name))

dail_33 %<>% 
  left_join(gen_variable, by = "name")

Create date variables and decade variables that we can play around with.

dail_df$date_2 <- as.Date(dail_df$date, "%Y-%m-%d")

dail_df$year <- format(dail_df$date_2, "%Y")

dail_df$month <- format(dail_df$date_2, "%b")

dail_df %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s"))

In the next blog, we will graph out the various images to explore these data in more depth. For example, we can make a circle plot with the composition of the current Dail with the ggparliament package.

We can go into more depth with it in the next blog… Stay tuned.

How to recreate Pew opinion graphs with ggplot2 in R

Packages we will need

library(HH)
library(tidyverse)
library(bbplot)
library(haven)

In this blog post, we are going to recreate Pew Opinion poll graphs.

This is the plot we will try to recreate on gun control opinions of Americans:

To do this, we will download the data from the Pew website by following the link below:

Amid a Series of Mass Shootings in the U.S., Gun Policy Remains Deeply Divisive

atp <- read.csv(file.choose())

We then select the variables related to gun control opinions

atp %>% 
  select(GUNPRIORITY1_b_W87:GUNPRIORITY2_j_W87) -> gun_df

I want to rename the variables so I don’t forget what they are.

Then, we convert them all to factor variables because haven labelled class variables are sometimes difficult to wrangle…

gun_df %<>%
  select(mental_ill = GUNPRIORITY1_b_W87,
         assault_rifle = GUNPRIORITY1_c_W87, 
         gun_database = GUNPRIORITY1_d_W87,
         high_cap_mag = GUNPRIORITY1_e_W87,
         gunshow_bkgd_check = GUNPRIORITY1_f_W87,
         conceal_gun =GUNPRIORITY2_g_W87,
         conceal_gun_no_permit = GUNPRIORITY2_h_W87,
         teacher_gun = GUNPRIORITY2_i_W87,
         shorter_waiting = GUNPRIORITY2_j_W87) %>% 
  mutate(across(everything()), haven::as_factor(.))

Also we can convert the “Refused” to answer variables to NA if we want, so it’s easier to filter out.

gun_df %<>% 
  mutate(across(where(is.factor), ~na_if(., "Refused")))

Next we will pivot the variables to long format. The new names variable will be survey_question and the responses (Strongly agree, Somewhat agree etc) will go to the new response variable!

gun_df %>% 
  pivot_longer(everything(), names_to = "survey_question", values_to = "response") -> gun_long

And next we calculate counts and frequencies for each variable

gun_long %<>% 
  group_by(survey_question, response) %>% 
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) %>% 
  ungroup()

Then we want to reorder the levels of the factors so that they are in the same order as the original Pew graph.

gun_long %>% 
  mutate(survey_question = as.factor(survey_question))   %>% 
   mutate(survey_question_reorder = factor(survey_question, 
          levels =  c( 
           "conceal_gun_no_permit",
           "shorter_waiting",
           "teacher_gun",
           "conceal_gun",
           "assault_rifle",
           "high_cap_mag",
           "gun_database",
           "gunshow_bkgd_check",
           "mental_ill"
           ))) -> gun_reordered

And we use the hex colours from the original graph … very brown… I used this hex color picker website to find the right hex numbers: https://imagecolorpicker.com/en

brown_palette <- c("Strongly oppose" = "#8c834b",
                   "Somewhat oppose" = "#beb88f",
                   "Somewhat favor" = "#dfc86c",
                   "Strongly favor" = "#caa31e")

And last, we use the geom_bar() – with position = "stack" and stat = "identity" arguments – to create the bar chart.

To add the numbers, write geom_text() function with label = frequency within aes() and then position = position_stack() with hjust and vjust to make sure you’re happy with where the numbers are

gun_reordered %>% 
  filter(!is.na(response)) %>% 
  mutate(frequency = round(freq * 100), 0) %>% 
  ggplot(aes(x = survey_question_reorder, 
             y = frequency, fill = response)) +
  geom_bar(position = "stack",
           stat = "identity") + 
  coord_flip() + 
  scale_fill_manual(values = brown_palette) +
  geom_text(aes(label = frequency), size = 10, 
            color = "black", 
            position = position_stack(vjust = 0.5)) +
  bbplot::bbc_style() + 
  labs(title = "Broad support for barring people with mental illnesses 
       \n from obtaining guns, expanded background checks",
       subtitle = "% who", 
       caption = "Note: No answer resposes not shown.\n Source: Survey of U.S. adults conducted April 5-11 2021.") + 
  scale_x_discrete(labels = c(
    "Allowing people to carry conealed \n guns without a person",
    "Shortening waiting periods for people \n who want to buy guns leagally",
    "Allowing reachers and school officials \n to carry guns in K-12 school",
    "Allowing people to carry \n concealed guns in more places",
    "Banning assault-style weapons",
    "Banning high capacity ammunition \n magazines that hold more than 10 rounds",
    "Creating a federal government \n database to track all gun sales",
    "Making private gun sales \n subject to background check",
    "Preventing people with mental \n illnesses from purchasing guns"
    ))

Stephen Colbert Waiting GIF - Find & Share on GIPHY

Unfortunately this does not have diverving stacks from the middle of the graph

We can make a diverging stacked bar chart using function likert() from the HH package.

For this we want to turn the dataset back to wider with a column for each of the responses (strongly agree, somewhat agree etc) and find the frequency of each response for each of the questions on different gun control measures.

Then with the likert() function, we take the survey question variable and with the ~tilda~ make it the product of each response. Because they are the every other variable in the dataset we can use the shorthand of the period / fullstop.

We use positive.order = TRUE because we want them in a nice descending order to response, not in alphabetical order or something like that

gun_reordered %<>%
    filter(!is.na(response)) %>%  
  select(survey_question, response, freq) %>%  
  pivot_wider(names_from = response, values_from = freq ) %>%
  ungroup() %>% 
  HH::likert(survey_question ~., positive.order = TRUE,
            main =  "Broad support for barring people with mental illnesses
            \n from obtaining guns, expanded background checks")

With this function, it is difficult to customise … but it is very quick to make a diverging stacked bar chart.

Angry Stephen Colbert GIF by The Late Show With Stephen Colbert - Find & Share on GIPHY

If we return to ggplot2, which is more easy to customise … I found a solution on Stack Overflow! Thanks to this answer! The solution is to put two categories on one side of the centre point and two categories on the other!

gun_reordered %>% 
filter(!is.na(response)) %>% 
  mutate(frequency = round(freq * 100), 0) -> gun_final

And graph out

ggplot(data = gun_final, aes(x = survey_question_reorder, 
            fill = response)) +
  geom_bar(data = subset(gun_final, response %in% c("Strongly favor",
           "Somewhat favor")),
           aes(y = -frequency), position="stack", stat="identity") +
  geom_bar(data = subset(gun_final, !response %in% c("Strongly favor",
            "Somewhat favor")), 
           aes(y = frequency), position="stack", stat="identity") +
  coord_flip() + 
  scale_fill_manual(values = brown_palette) +
  geom_text(data = gun_final, aes(y = frequency, label = frequency), size = 10, color = "black", position = position_stack(vjust = 0.5)) +
  bbplot::bbc_style() + 
  labs(title = "Broad support for barring people with mental illnesses 
       \n from obtaining guns, expanded background checks",
       subtitle = "% who", 
       caption = "Note: No answer resposes not shown.\n Source: Survey of U.S. adults conducted April 5-11 2021.") + 
  scale_x_discrete(labels = c(
    "Allowing people to carry conealed \n guns without a person",
    "Shortening waiting periods for people \n who want to buy guns leagally",
    "Allowing reachers and school officials \n to carry guns in K-12 school",
    "Allowing people to carry \n concealed guns in more places",
    "Banning assault-style weapons",
    "Banning high capacity ammunition \n magazines that hold more than 10 rounds",
    "Creating a federal government \n database to track all gun sales",
    "Making private gun sales \n subject to background check",
    "Preventing people with mental \n illnesses from purchasing guns"
  ))

High Five Stephen Colbert GIF - Find & Share on GIPHY

Next to complete in PART 2 of this graph, I need to figure out how to add lines to graphs and add the frequency in the correct place

How to tidy up messy Wikipedia data with dplyr in R

Packages we will need:

library(rvest)
library(magrittr)
library(tidyverse)
library(waffle)
library(wesanderson)
library(ggthemes)
library(countrycode)
library(forcats)
library(stringr)
library(tidyr)
library(janitor)
library(knitr)

To see another blog post that focuses on cleaning messy strings and dates, click here to read

Wrangling and graphing UN Secretaries-General data with R

We are going to look at Irish embassies and missions around the world. Where are the embassies, and which country has the most missions (including embassies, consulates and representational offices)?

Let’s first scrape the embassy data from the Wikipedia page. Here is how it looks on the webpage.

It is a bit confusing because Ireland does not have a mission in every country. Argentina, for example, is the embassy for Bolivia, Paraguay and Uruguay.

Also, there are some consulates-general and other mission types.

Some countries have more than one mission, such as UK, Canada, US etc. So we are going to try and clean up this data.

Click here to read more about scraping data with the rvest package

Scrape NATO defense expenditure data from Wikipedia with the rvest package in R

embassies_html <- read_html("https://en.wikipedia.org/wiki/List_of_diplomatic_missions_of_Ireland")

embassies_tables <- embassies_html %>% html_table(header = TRUE, fill = TRUE)

We will extract the data from the different continent tables and then bind them all together at the end.

africa_emb <- embassies_tables[[1]]

africa_emb %<>% 
  mutate(continent = "Africa")

americas_emb <- embassies_tables[[2]]

americas_emb %<>% 
  mutate(continent = "Americas")

asia_emb <- embassies_tables[[3]]

asia_emb %<>% 
  mutate(continent = "Asia")

europe_emb <- embassies_tables[[4]]

europe_emb %<>% 
  mutate(continent = "Europe")

oceania_emb <- embassies_tables[[5]]

oceania_emb %<>% 
  mutate(continent = "Oceania")

Last, we bind all the tables together by rows, with rbind()

ire_emb <- rbind(africa_emb, 
                 americas_emb,
                 asia_emb,
                 europe_emb,
                 oceania_emb)

And clean up the names with the janitor package

ire_emb %<>% 
  janitor::clean_names()

There is a small typo with a hypen and so there are separate Consulate General and Consulate-General… so we will clean that up to make one single factor level.

ire_emb %<>% 
  mutate(mission = ifelse(mission == "Consulate General", "Consulate-General", mission))

We can count out how many of each type of mission there are

ire_emb %>% 
  group_by(mission) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  knitr::kable(format = "html")

mission	n
Embassy	69
Consulate-General	17
Liaison office	1
Representative office	1

A quick waffle plot

ire_emb %>% 
  group_by(mission) %>%
  count() %>% 
  arrange(desc(n)) %>% 
  ungroup() %>% 
  ggplot(aes(fill = mission, values = n)) +
  geom_waffle(color = "white", size = 1.5, 
              n_rows = 20, flip = TRUE) + 
  bbplot::bbc_style() +
  scale_fill_manual(values= wes_palette("Darjeeling1", n = 4))

We can remove the notes in brackets with the sub() function.

Square brackets equire a regex code \\[.*

ire_emb %<>% 
  select(!ref) %>%
  mutate(host_country = sub("\\[.*", "", host_country))

We delete the subheadings from the concurrent_accreditation column with the str_remove() function from the stringr package

ire_emb %<>%
  mutate(concurrent_accreditation = stringr::str_remove(concurrent_accreditation, "International Organizations:\n")) %>% 
  mutate(concurrent_accreditation = stringr::str_remove(concurrent_accreditation, "Countries:\n"))

After that, we will tackle the columns with many countries. The many variables in one cell violates the principles of tidy data.

Tonight Show Help GIF by The Tonight Show Starring Jimmy Fallon - Find & Share on GIPHY

For example, we saw above that Argentina is the embassy for three other countries.

We will use the separate() function from the tidyr package to make a column for each country that shares an embassy with the host country.

This separate() function has six arguments:

First we indicate the column with will separate out with the col argument

Next with into, we write the new names of the columns we will create. Nigeria has the most countries for which it is accredited to be the designated embassy with nine. So I create nine accredited countries columns to accommodate this max number.

The point I want to cut up the original column is at the \n which is regex for a large space

I don’t want to remove the original column so I set remove to FALSE

ire_emb %<>%
  separate(
    col = "concurrent_accreditation",
    into = c("acc_1", "acc_2", "acc_3", "acc_4", "acc_5", "acc_6", "acc_7", "acc_8", "acc_9"),
    sep = "\n",
    remove = FALSE,
    extra = "warn",
    fill = "warn") %>% 
  mutate(across(where(is.character), str_trim))

Some countries have more than one type of mission, so I want to count each type of mission for each country and create a new variable with the distinct() and pivot_wider() functions

Click here to read more about turning long to wide format data

With the across() function we can replace all numeric variables with NA to zeros

Click here to read more about the across() function

across() function appreciation

ire_emb %>% 
  group_by(host_country, mission) %>% 
  mutate(number_missions = n())  %>% 
  distinct(host_country, mission, .keep_all = TRUE) %>% 
  ungroup() %>% 
  pivot_wider(!c(host_city, concurrent_accreditation:count_accreditation), 
              names_from = mission, 
              values_from = number_missions) %>% 
  janitor::clean_names() %>% 
  mutate(across(where(is.numeric), ~ replace_na(., 0))) %>% 
  select(!host_country) -> ire_wide

Before we bind the two datasets together, we need to only have one row for each country.

ire_emb %>% 
  distinct(host_country, .keep_all = TRUE) -> ire_dist

And bind them together:

ire_full <- cbind(ire_dist, ire_wide)

Excited Aubrey Plaza GIF by Film Independent Spirit Awards - Find & Share on GIPHY

We can graph out where the embassies are with the geom_polygon() in ggplot

First we download the map data from dplyr and add correlates of war codes so we can easily join the datasets together with right_join()

First, we add correlates of war codes

Click here to read more about the countrycode package

Add Correlates of War codes with countrycode package in R

ire_full %<>%
    mutate(cown = countrycode(host_country, "country.name", "cown"))

world_map <- map_data("world")

world_map %<>% 
  mutate(cown = countrycode::countrycode(region, "country.name", "cown"))

I reorder the variables with the fct_relevel() function from the forcats package. This is just so they can better match the color palette from wesanderson package. Green means embassy, red for no mission and orange for representative office.

Make Wes Anderson themed graphs with wesanderson package in R

ire_full %>%
  right_join(world_map, by = "cown") %>% 
  filter(region != "Antarctica") %>% 
  mutate(mission = ifelse(is.na(mission), replace_na("No Mission"), mission)) %>% 
  mutate(mission = forcats::fct_relevel(mission,c("No Mission", "Embassy","Representative office"))) %>%
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = mission), color = "white", size = 0.5)  -> ire_map

And we can change how the map looks with the ggthemes package and colors from wesanderson package

  ire_map + ggthemes::theme_map() +
  theme(legend.key.size = unit(3, "cm"),
        text = element_text(size = 30),
        legend.title = element_blank()) + 
  scale_fill_manual(values = wes_palette("Darjeeling1", n = 4))

Irish missions.png Download

And we can count how many missions there are in each country

US has the hightest number with 8 offices, followed by UK with 4 and China with 3

ire_full %>%
  right_join(world_map, by = "cown") %>% 
  filter(region != "Antarctica") %>% 
  mutate(sum_missions = rowSums(across(embassy:representative_office))) %>% 
  mutate(sum_missions = replace_na(sum_missions, 0)) %>%  
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = as.factor(sum_missions)), color = "white", size = 0.5)  +
  ggthemes::theme_map() +
  theme(legend.key.size = unit(3, "cm"),
        text = element_text(size = 30),
        legend.title = element_blank()) + 
scale_fill_brewer(palette = "RdBu") + 
  ggtitle("Number of Irish missions in each country",
          subtitle = "Source: Wikipedia")

Last we can count the number of accredited countries that each embassy has. Nigeria has the most, in charge of 10 other countries across northern and central Africa.

ire_full %>% 
  right_join(world_map, by = "cown") %>% 
  filter(region != "Antarctica") %>%
  mutate(count_accreditation = str_count(concurrent_accreditation, pattern = "\n")) %>% 
  mutate(count_accreditation = replace_na(count_accreditation, -1)) %>%  
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = as.factor(count_accreditation)), color = "white", size = 0.5)  +
  ggthemes::theme_fivethirtyeight() +
  theme(legend.key.size = unit(1, "cm"),
        text = element_text(size = 30),
        legend.title = element_blank()) + 
  ggtitle("Number of Irish missions in extra accreditations",
          subtitle = "Source: Wikipedia")

Happy Maya Rudolph GIF - Find & Share on GIPHY

Running tidy t-tests with the infer package in R

Packages we will need:

library(tidyverse)
library(tidyr)
library(infer)
library(bbplot)
library(ggthemes)

For this t-test, we will compare US millenials and non-millenials and their views of the UK’s influence in the world.

The data will come from Chicago Council Survey of American Public Opinion on U.S. Foreign Policy

Click here to download 2017 policy survey data

The survey investigates American public opinion on foreign policy. It focuses on respondents’ opinions of the United States’ leadership role in the world and the challenges the country faces domestically and internationally.

The question on the UK’s influence asks how much influence you think the UK has in the world. Please answer on a 0 to 10 scale; with 0 meaning they are not at all influential and 10 meaning they are extremely influential.

First we select and recreate the variables

fp %>%
  select(
    milennial = XMILLENIALSSAMPLEFLAG,
    uk_influence = Q50_10) %>%
  separate(
    col = milennial,
    into = c("milennial_num", "milennial_char"),
    sep = '[)]',
    remove = TRUE) %>% 
  mutate(
     uk_influence = as.character(uk_influence),
     uk_influence = parse_number(uk_influence)) %>% 
  filter(uk_influence != -1) %>% 
  tidyr::drop_na(milennial_char) -> mil_fp

With the infer package, we can run a t-test:

mil_fp %>% 
  t_test(formula = uk_influence ~ milennial_char,
         alternative = "less")%>% 
  kable(format = "html")

statistic	t_df	p_value	alternative	estimate	lower_ci	upper_ci
-3.048249	1329.469	0.0011736	less	-0.3274509	-Inf	-0.1506332

There is a statistically significant difference between milennials and non-milennials.

We can graph a box plot.

mil_fp %>% 
  ggplot(mapping = aes(x = milennial_char,
                       y = uk_influence,
                       fill = milennial_char)) +
  geom_jitter(aes(color = milennial_char),
              size = 2, alpha = 0.5, width = 0.3) +
  geom_boxplot(alpha = 0.4) +
  coord_flip() + bbplot::bbc_style() +
  scale_fill_manual(values = my_palette) + 
  scale_color_manual(values = my_palette)

And a quick graph to compare UK with other countries: Germany and South Korea

mil_fp %>% 
  select(milennial_char, uk_influence, sk_influence, ger_influence) %>% 
  pivot_longer(!milennial_char, names_to = "survey_question", values_to = "response")  %>% 
  group_by(survey_question, response) %>% 
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) %>% 
  ungroup() %>% 
  filter(!is.na(response)) %>% 
  mutate(survey_question = case_when(survey_question == "uk_influence" ~ "UK",
survey_question == "ger_influence" ~ "Germany",
survey_question == "sk_influence" ~ "South Korea",
TRUE ~ as.character(survey_question))) %>% 
  ggplot() +
  geom_bar(aes(x = forcats::fct_reorder(survey_question, freq), 
               y = freq, fill = as.factor(response)), 
           color = "#e5e5e5", 
           size = 2, 
           position = "stack",
           stat = "identity") + 
  coord_flip() + 
  scale_fill_brewer(palette = "RdBu") + 
  ggthemes::theme_fivethirtyeight() + 
  ggtitle("View of Influence in the world?") +
  theme(legend.title = element_blank(),
        legend.position = "top",
        legend.key.size = unit(0.78, "cm"),
        text = element_text(size = 25),
        legend.text = element_text(size = 20))

Check model assumptions with easystats package in R

Packages we will need:

install.packages("easystats", repos = "https://easystats.r-universe.dev")
library(easystats)
easystats::install_suggested()

Easystats is a collection of R packages, which aims to provide a framework to tame the scary R statistics and their pesky models, according to their github repo.

Click here to browse the github and here to go to the specific perfomance package CRAN PDF

performance Download

First run your regression. I will try to explain variance is Civil Society Organization participation (CSOs) with the independent variables in my model with Varieties of Democracy data in 1990.

cso_model <- lm(cso_part ~ education_level + mortality_rate + democracy,data = vdem_90)


	Dependent variable:

	cso_part

education_level	-0.017^**
	(0.007)

mortality_rate	-0.00001
	(0.00004)

democracy	0.913^***
	(0.064)

Constant	0.288^***
	(0.054)


Observations	134
R²	0.690
Adjusted R²	0.682
Residual Std. Error	0.154 (df = 130)
F Statistic	96.243^*** (df = 3; 130)

Note:	^p<0.1; ^p<0.05; ^**p<0.01

Excited Season 2 GIF by The Office - Find & Share on GIPHY

Then we check the assumptions:

performance::check_model(cso_model)

Comparing North and South Korean UN votes at the General Assembly with unvotes package

Packages we will use

Llibrary(unvotes)
library(lubridate)
library(tidyverse)
library(magrittr)
library(bbplot)
library(waffle)
library(stringr)
library(wordcloud)
library(waffle)
library(wesanderson)

Last September 17^th 2021 marked the 30^th anniversary of the entry of North Korea and South Korea into full membership in the United Nations. Prior to this, they were only afforded observer status.

keia.org
The Two Koreas Mark 30 Years of UN Membership: The Road to Membership

Let’s look at the types of voting that both countries have done in the General Assembly since 1991.

First we can download the different types of UN votes from the unvotes package

un_votes <- unvotes::un_roll_calls

un_votes_issues <- unvotes::un_roll_call_issues

unvotes::un_votes -> country_votes

Join them all together and filter out any country that does not have the word “Korea” in its name.

un_votes %>% 
  inner_join(un_votes_issues, by = "rcid") %>% 
  inner_join(country_votes, by = "rcid") %>% 
  mutate(year = format(date, format = "%Y")) %>%
  filter(grepl("Korea", country)) -> korea_un

First we can make a wordcloud of all the different votes for which they voted YES. Is there a discernable difference in the types of votes that each country supported?

First, download the stop words that we can remove (such as the, and, if)

data("stop_words")

Then I will make a North Korean dataframe of all the votes for which this country voted YES. I remove some of the messy formatting with the gsub argument and count the occurence of each word. I get rid of a few of the procedural words that are more related to the technical wording of the resolutions, rather than related to the tpoic of the vote.

nk_yes_votes <- korea_un %>% 
  filter(country == "North Korea") %>% 
  filter(vote == "yes") %>%  
  select(descr, year) %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s")) %>% 
  # group_by(decade) %>% 
  unnest_tokens(word, descr) %>% 
  mutate(word = gsub(" ", "", word)) %>% 
  mutate(word = gsub('_', '', word)) %>% 
  count(word, sort = TRUE) %>% 
  ungroup() %>% 
  anti_join(stop_words)  %>% 
  mutate(word = case_when(grepl("palestin", word) ~ "Palestine", 
                          grepl("nucl", word) ~ "nuclear",
                          TRUE ~ as.character(word)))  %>%
  filter(word != "resolution") %>% 
  filter(word != "assembly") %>% 
  filter(word != "draft") %>% 
  filter(word != "committee") %>% 
  filter(word != "requested") %>% 
  filter(word != "report") %>% 
  filter(word != "practices") %>% 
  filter(word != "affecting") %>% 
  filter(word != "follow") %>% 
  filter(word != "acting") %>% 
  filter(word != "adopted")

Next, we count the number of each word


nk_yes_votes %<>% 
  count(word) %>% 
  arrange(desc(n))

We want to also remove the numbers

nums <- nk_yes_votes %>% filter(str_detect(word, "^[0-9]")) %>% select(word) %>% unique()

And remove the stop words

nk_yes_votes %<>%
  anti_join(nums, by = "word")

Choose some nice colours

my_colors <- c("#0450b4", "#046dc8", "#1184a7","#15a2a2", "#6fb1a0", 
               "#b4418e", "#d94a8c", "#ea515f", "#fe7434", "#fea802")

And lastly, plot the wordcloud with the top 50 words

wordcloud(nk_yes_votes$word, 
   nk_yes_votes$n, 
   random.order = FALSE, 
   max.words = 50, 
   colors = my_colors)

If we repeat the above code with South Korea:

There doesn’t seem to be a huge difference. But this is not a very scientfic approach; I just like the look of them!

Next we will compare the two countries how many votes they voted yes, no or abstained from…

korea_un %>% 
  group_by(country, vote) %>% 
  count() %>% 
  mutate(count_ten = n /25) %>% 
  ungroup() %>% 
  ggplot(aes(fill = vote, values = count_ten)) +
  geom_waffle(color = "white",
              size = 2.5,
              n_rows = 10,
              flip = TRUE) +
  facet_wrap(~country) + bbplot::bbc_style() +
  scale_fill_manual(values = wesanderson::wes_palette("Darjeeling1"))

AND some tweaking with Canva

Next we can look more in detail at the votes that they countries abstained from voting in.

We can use the tidytext function that reorders the geom_bar in each country. You can read the blog of Julie Silge to learn more about the functions, it is a bit tricky but it fixes the problem of randomly ordered bars across facets.

https://juliasilge.com/blog/reorder-within/

korea_un %>%
  filter(vote == "abstain") %>% 
  mutate(issue = case_when(issue == "Nuclear weapons and nuclear material" ~ "Nukes",
issue == "Arms control and disarmament" ~ "Arms",
issue == "Palestinian conflict" ~ "Palestine",
TRUE ~ as.character(issue))) %>% 
  select(country, issue, year) %>% 
  group_by(issue, country) %>% 
  count() %>% 
  ungroup() %>% 
  group_by(country) %>% 
  mutate(country = as.factor(country),
         issue = reorder_within(issue, n, country)) %>%
  ggplot(aes(x = reorder(issue, n), y = n)) + 
  geom_bar(stat = "identity", width = 0.7, aes(fill = country)) + 
  labs(title = "Abstaining UN General Assembly Votes by issues",
       subtitle = ("Since 1950s"),
       caption = "         Source: unvotes ") +
  xlab("") + 
  ylab("") +
  facet_wrap(~country, scales = "free_y") +
  scale_x_reordered() +
  coord_flip() + 
  expand_limits(y = 65) + 
  ggthemes::theme_pander() + 
  scale_fill_manual(values = sample(my_colors)) + 
 theme(plot.background = element_rect(color = "#f5f9fc"),
        panel.grid = element_line(colour = "#f5f9fc"),
        # axis.title.x = element_blank(),
        # axis.text.x = element_blank(),
        axis.text.y = element_text(color = "#000500", size = 16),
       legend.position = "none",
        # axis.title.y = element_blank(),
        axis.ticks.x = element_blank(),
        text = element_text(family = "Gadugi"),
        plot.title = element_text(size = 28, color = "#000500"),
        plot.subtitle = element_text(size = 20, color = "#484e4c"),
        plot.caption = element_text(size = 20, color = "#484e4c"))

South Korea was far more likely to abstain from votes that North Korea on all issues

Next we can simply plot out the Human Rights votes that each country voted to support. Even though South Korea has far higher human rights scores, North Korea votes in support of more votes on this topic.

korea_un %>% 
  filter(year < 2019) %>% 
  filter(issue == "Human rights") %>% 
  filter(vote == "yes") %>% 
  group_by(country, year) %>% 
  count() %>% 
  ggplot(aes(x = year, y = n, group = country, color = country)) + 
  geom_line(size = 2) + 
  geom_point(aes(color = country), fill = "white", shape = 21, size = 3, stroke = 2.5) +
  scale_x_discrete(breaks = round(seq(min(korea_un$year), max(korea_un$year), by = 10),1)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 22)) + 
  bbplot::bbc_style() + facet_wrap(~country) + 
  theme(legend.position = "none") + 
  scale_color_manual(values = sample(my_colors)) + 
  labs(title = "Human Rights UN General Assembly Yes Votes ",
       subtitle = ("Since 1990s"),
       caption = "         Source: unvotes ")

All together:

Wrangling and graphing UN Secretaries-General data with R

Packages we will need:

library(tidyverse)
library(janitor)
library(rvest)
library(countrycode)
library(magrittr)
library(lubridate)
library(ggflags)
library(scales)

According to Urquhart (1995) in his article, “Selecting the World’s CEO”,

From the outset, the U.N. secretary
general has been an important part of the
institution, not only as its chief executive,
but as both symbol and guardian of the
original vision of the organization.
There, however, specific agreement has
ended. The United Nations, like any
important organization, needs strong and
independent leadership, but it is an inter-
governmental organization, and govern
ments have no intention of giving up
control of it. While the secretary-general
can be extraordinarily useful in times of
crisis, the office inevitably embodies
something more than international coop
eration, sometimes even an unwelcome
hint of supranationalism. Thus, the atti-
tude of governments toward the United
Nations’ chief and only elected official is
and has been necessarily ambivalent.

(Urquhart, 1995: 21)

So who are these World CEOs? We’ll examine more in this dataset.

First, we will scrape the data from the Wikipedia

sg_html <- read_html("https://en.wikipedia.org/wiki/Secretary-General_of_the_United_Nations")
sg_tables <- sg_html %>% html_table(header = TRUE, fill = TRUE)
sg <- sg_tables[[2]]

The table we scrape is a bit of a hot mess in this state …. but we can fix it

Donald Glover Pizza GIF - Find & Share on GIPHY

We can first use the clean_names() function from the janitor package

A quick way to clean up the table and keep only the rows with the names of the Secretaries-General is to use the distinct() function. Last we filter out the rows and select out the columns we don’t want.

sg %>% 
  clean_names() %>% 
  distinct(no, .keep_all = TRUE) %>% 
  filter(no != "–") %>% 
  select(!c(portrait, ref))-> sg_clean

Already we can see a much cleaner table. However, the next problem is that the names and their years of birth / death are in one cell.

Also the dates in office are combined together.

So we can use the separate() function from tidyr to make new variables for each piece of information.

First we will separate the name of the Secretary-General from their date of birth and death.

We supply the two new variable names to the into = argument.

We then use the regex code pattern [()] to indicate where we want to separate the character string into two separate columns. In this instance the regex pattern is for what is after the round brackets (

I want to remove the original cluttered varaible so remove = TRUE

sg_clean %<>% 
  separate(
    col = secretary_general_born_died,
    into = c("sec_gen", "born_died"),
    sep = '[()]',
    remove = TRUE)

We can repeat this step to create a separate born and died variable. This time the separator symbol is a hyphen And so we do not need regex pattern; we can just indicate a hyphen.

sg_clean %<>% 
  separate(
    col = born_died,
    into = c("born", "died"),
    sep = '–',
    remove = TRUE)

And I want to ignore the “present” variable, so I extract the numbers with the parse_number() function, converting things from characters to numbers

sg_clean %<>% 
  mutate(born = parse_number(born))

Last, we repeat with the dates in office. This is also easily seperated by indicating the hyphen.

sg_clean %<>% 
  separate(
    col = dates_in_office,
    into = c("start_office", "end_office"),
    sep = '–',
    remove = TRUE)

We convert the word “present” to the actual present date

sg_clean %<>% 
  mutate(end_office = ifelse(end_office == "present", "5 May 2022", end_office))

We use the lubridate dmy() function to convert the character strings to date class variables.

sg_clean %<>% 
  mutate(start_office = dmy(start_office),
         end_office = dmy(end_office))

We can calculate the length of time that each Secretary-General was in office with the difftime() function.

sg_clean %<>% 
  mutate(duration_days = difftime(end_office, start_office, units = "days"),
         duration_years = round(duration_days / 365, 2),
         duration_years = as.integer(duration_years))

Next we can compare the different durations and see which Secretary-General was longest or shortest in office.

sg_clean %>% 
  mutate(duration_days = difftime(end_office, start_office)) %>%  
  mutate(iso2 = tolower(countrycode::countrycode(country_of_origin, "country.name", "iso2c"))) %>% 
  ggplot(aes(x = forcats::fct_reorder(sec_gen, duration_days), y = duration_days)) + 
  geom_bar(aes(fill = un_regional_group), stat = "identity", width = 0.7) + 
  coord_flip() + bbplot::bbc_style() + 
  ggflags::geom_flag(aes(x = sec_gen, y = -100, country = iso2), size = 12) +
  scale_fill_manual(values = le_palette) +
  labs(title = "Longest serving UN Secretaries General",
       subtitle = ("Source: Wikipedia")) + 
  xlab("") + ylab("")

We can make a quick pie-chart to compare regions. We can see that Secretaries-General from the West have had the most time in office

sg_text <- sg_count %>% 
  arrange(desc(un_regional_group)) %>%
  mutate(prop = sum_days / sum(sg_count$sum_days) *100) %>%
  mutate(ypos = cumsum(prop)- 0.5*prop )

sg_text %>% 
  count(un_regional_group)

sg_text %>%
  mutate(region = case_when(un_regional_group == "Western European & others" ~ "Europe",
         un_regional_group == "Latin American& Caribbean" ~ "Latin America",
         un_regional_group == "Asia & Pacific" ~ "Asia", 
         TRUE ~ as.character(un_regional_group))) %>% 
  ggplot(aes(x = "", y = prop, fill = region)) +
  geom_bar(stat = "identity", width = 1) +
  geom_text(aes(y = ypos + 1, label = round(prop, 0)), color = "white", size = 15) +
  coord_polar("y", start = 0) +
  theme_void() +
  ggtitle("Length of Secretaries General in office across regions") + 
  scale_fill_manual(values = le_palette) + 
  theme(legend.title = element_blank(),
        legend.text = element_text(size = 20), 
        plot.title = element_text(size = 30))

We can create a Gantt-like chart to track the timeline for the different men (all men!)

Click here to read more about timelines in R

Make a timeline graph with dates in ggplot2

sg_clean %>% 
  mutate(region = case_when(un_regional_group == "Western European & others" ~ "Europe",un_regional_group == "Latin American& Caribbean" ~ "Latin America",un_regional_group == "Asia & Pacific" ~ "Asia", TRUE ~ as.character(un_regional_group))) %>%
  ggplot(aes(x = as.Date(start_office), 
             y = no, 
             color = region)) +
  geom_segment(aes(xend = as.Date(end_office), 
                   yend = no, alpha = 0.9,
                   color = region), size = 9)  +
  geom_text(aes(label = sec_gen), 
            color = "black", 
            alpha = 0.7,
            size = 8, show.legend = FALSE) +
  bbplot::bbc_style() +
  scale_color_manual(values = le_palette) + 
  scale_x_date(breaks = scales::breaks_pretty(15))

Confused Donald Glover GIF - Find & Share on GIPHY

References

Urquhart, B. (1995). Selecting the world’s CEO: Remembering the Secretaries-General. Foreign Affairs, 21-26.

Donald Glover Community GIF - Find & Share on GIPHY

Scraping and wrangling UN peacekeeping data with tidyr package in R

Packages we will need:

library(tidyverse)
library(rvest)
library(magrittr)
library(tidyr)
library(countrycode)
library(democracyData)
library(janitor)
library(waffle)

For this blog post, we will look at UN peacekeeping missions and compare across regions.

Despite the criticisms about some operations, the empirical record for UN peacekeeping records has been robust in the academic literature

“In short, peacekeeping intervenes in the most difficult
cases, dramatically increases the chances that peace will
last, and does so by altering the incentives of the peacekept,
by alleviating their fear and mistrust of each other, by
preventing and controlling accidents and misbehavior by
hard-line factions, and by encouraging political inclusion”
(Goldstone, 2008: 178).

The data on the current and previous PKOs (peacekeeping operations) will come from the Wikipedia page. But the variables do not really lend themselves to analysis as they are.

Amy Coney Barrett Snl GIF by Saturday Night Live - Find & Share on GIPHY

Once we have the url, we scrape all the tables on the Wikipedia page in a few lines

pko_members <- read_html("https://en.wikipedia.org/wiki/List_of_United_Nations_peacekeeping_missions")
pko_tables <- pko_members %>% html_table(header = TRUE, fill = TRUE)

Click here to read more about the rvest package for scraping data from websites.

Scrape NATO defense expenditure data from Wikipedia with the rvest package in R

pko_complete_africa <- pko_tables[[1]]
pko_complete_americas <- pko_tables[[2]]
pko_complete_asia <- pko_tables[[3]]
pko_complete_europe <- pko_tables[[4]]
pko_complete_mena <- pko_tables[[5]]

And then we bind them together! It’s very handy that they all have the same variable names in each table.

rbind(pko_complete_africa, pko_complete_americas, pko_complete_asia, pko_complete_europe, pko_complete_mena) -> pko_complete

Next, we will add a variable to indicate that all the tables of these missions are completed.

pko_complete %<>% 
  mutate(complete = ifelse(!is.na(pko_complete$Location), "Complete", "Current"))

We do the same with the current missions that are ongoing:

pko_current_africa <- pko_tables[[6]]
pko_current_asia <- pko_tables[[7]]
pko_current_europe <- pko_tables[[8]]
pko_current_mena <- pko_tables[[9]]

rbind(pko_current_europe, pko_current_mena, pko_current_asia, pko_current_africa) -> pko_current

pko_current %<>% 
  mutate(complete = ifelse(!is.na(pko_current$Location), "Current", "Complete"))

We then bind the completed and current mission data.frames

rbind(pko_complete, pko_current) -> pko

Then we clean the variable names with the function from the janitor package.

pko_df <-  pko %>% 
  janitor::clean_names()

Next we’ll want to create some new variables.

We can make a new row for each country that is receiving a peacekeeping mission. We can paste all the countries together and then use the separate function from the tidyr package to create new variables.

 pko_df %>%
  group_by(conflict) %>%
  mutate(location = paste(location, collapse = ', ')) %>% 
  separate(location,  into = c("country_1", "country_2", "country_3", "country_4", "country_5"), sep = ", ")  %>% 
  ungroup() %>% 
  distinct(conflict, .keep_all = TRUE) %>%

Next we can create a new variable that only keeps the acroynm for the operation name. I took these regex codes from the following stack overflow link

pko_df %<>% 
  mutate(acronym = str_extract_all(name_of_operation, "\\([^()]+\\)")) %>% 
  mutate(acronym = substring(acronym, 2, nchar(acronym)-1)) %>% 
  separate(dates_of_operation, c("start_date", "end_date"), "–")

I will fill in the end data for the current missions that are still ongoing in 2022

pko_df %<>% 
  mutate(end_date = ifelse(complete == "Current", 2022, end_date))

And next we can calculate the duration for each operation

pko_df %<>% 
  mutate(end_date = as.integer(end_date)) %>% 
  mutate(start_date = as.integer(start_date)) %>% 
  mutate(duration = ifelse(!is.na(end_date), end_date - start_date, 1))

I want to compare regions and graph out the different operations around the world.

We can download region data with democracyData package (best package ever!)

Snl Season 47 GIF by Saturday Night Live - Find & Share on GIPHY

pacl <- redownload_pacl()

pacl %>% 
  select(cown = pacl_cowcode,
        un_region_name, un_continent_name) %>% 
  distinct(cown, .keep_all = TRUE) -> pacl_region

We join the datasets together with the inner_join() and add Correlates of War country codes.

pko_df %<>% 
  mutate(cown = countrycode(country_1, "country.name", "cown")) %>%   mutate(cown = ifelse(country_1 == "Western Sahara", 605, 
                       ifelse(country_1 == "Serbia", 345, cown))) %>% 
  inner_join(pacl_region, by = "cown")

Now we can start graphing our duration data:

pko_df %>% 
  ggplot(mapping = aes(x = forcats::fct_reorder(un_region_name, duration), 
                       y = duration, 
                       fill = un_region_name)) +
  geom_boxplot(alpha = 0.4) +
  geom_jitter(aes(color = un_region_name),
              size = 6, alpha = 0.8, width = 0.15) +
  coord_flip() + 
  bbplot::bbc_style() + ggtitle("Duration of Peacekeeping Missions")

We can see that Asian and “Western Asian” – i.e. Middle East – countries have the longest peacekeeping missions in terns of years.

pko_countries %>% 
  filter(un_continent_name == "Asia") %>%
  unite("country_names", country_1:country_5, remove = TRUE,  na.rm = TRUE, sep = ", ") %>% 
  arrange(desc(duration)) %>% 
  knitr::kable("html")

Start	End	Duration	Region	Country
1949	2022	73	Southern Asia	India, Pakistan
1964	2022	58	Western Asia	Cyprus, Northern Cyprus
1974	2022	48	Western Asia	Israel, Syria, Lebanon
1978	2022	44	Western Asia	Lebanon
1993	2009	16	Western Asia	Georgia
1991	2003	12	Western Asia	Iraq, Kuwait
1994	2000	6	Central Asia	Tajikistan
2006	2012	6	South-Eastern Asia	East Timor
1988	1991	3	Southern Asia	Iran, Iraq
1988	1990	2	Southern Asia	Afghanistan, Pakistan
1965	1966	1	Southern Asia	Pakistan, India
1991	1992	1	South-Eastern Asia	Cambodia, Cambodia
1999	NA	1	South-Eastern Asia	East Timor, Indonesia, East Timor, Indonesia, East Timor
1958	NA	1	Western Asia	Lebanon
1963	1964	1	Western Asia	North Yemen
2012	NA	1	Western Asia	Syria

Next we can compare the decades

pko_countries %<>% 
  mutate(decade = substr(start_date, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s"))

And graph it out:

pko_countries %>% 
  ggplot(mapping = aes(x = decade, 
                       y = duration, 
                       fill = decade)) +
  geom_boxplot(alpha = 0.4) +
  geom_jitter(aes(color = decade),
              size = 6, alpha = 0.8, width = 0.15) +
   coord_flip() + 
  geom_curve(aes(x = "1950s", y = 60, xend = "1940s", yend = 72),
  arrow = arrow(length = unit(0.1, "inch")), size = 0.8, color = "black",
   curvature = -0.4) +
  annotate("text", label = "First Mission to Kashmir",
           x = "1950s", y = 49, size = 8, color = "black") +
  geom_curve(aes(x = "1990s", y = 46, xend = "1990s", yend = 32),
             arrow = arrow(length = unit(0.1, "inch")), size = 0.8, color = "black",curvature = 0.3) +
  annotate("text", label = "Most Missions after the Cold War",
           x = "1990s", y = 60, size = 8, color = "black") +

  bbplot::bbc_style() + ggtitle("Duration of Peacekeeping Missions")

Following the end of the Cold War, there were renewed calls for the UN to become the agency for achieving world peace, and the agency’s peacekeeping dramatically increased, authorizing more missions between 1991 and 1994 than in the previous 45 years combined.

We can use a waffle plot to see which decade had the most operation missions. Waffle plots are often seen as more clear than pie charts.

Click here to read more about waffle charts in R

Alternatives to pie charts: coxcomb and waffle charts

To get the data ready for a waffle chart, we just need to count the number of peacekeeping missions (i.e. the number of rows) in each decade. Then we fill the groups (i.e. decade) and enter the n variable we created as the value.

pko_countries %>% 
  group_by(decade) %>% 
  count() %>%  
  ggplot(aes(fill = decade, values = n)) + 
  waffle::geom_waffle(color = "white", size= 3, n_rows = 8) +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  coord_equal() +
  labs(title = "Number of Peacekeeper Missions") + bbplot::bbc_style()

Cecily Strong Snl GIF by Saturday Night Live - Find & Share on GIPHY

If we want to add more information, we can go to the UN Peacekeeping website and download more data on peacekeeping troops and operations.

We can graph the number of peacekeepers per country

Click here to learn more about adding flags to graphs!

Add circular flags to maps and graphs with ggflags package in R

le_palette <- c("#5f0f40", "#9a031e", "#94d2bd", "#e36414", "#0f4c5c")

pkt %>%
  mutate(contributing_country = ifelse(contributing_country == "United Republic of Tanzania", "Tanzania",ifelse(contributing_country == "CÃ´te dâ€™Ivoire", "Cote d'Ivoire", contributing_country))) %>% 
  mutate(iso2 = tolower(countrycode::countrycode(contributing_country, "country.name", "iso2c"))) %>% 
  mutate(cown = countrycode::countrycode(contributing_country, "country.name", "cown")) %>% 
  inner_join(pacl_region, by = "cown") %>% 
  mutate(un_region_name = case_when(grepl("Africa", un_region_name) ~ "Africa",grepl("Eastern Asia", un_region_name) ~ "South-East Asia",
 un_region_name == "Western Africa" ~ "Middle East",TRUE ~ as.character(un_region_name))) %>% 
  filter(total_uniformed_personnel > 700) %>% 
  ggplot(aes(x = reorder(contributing_country, total_uniformed_personnel),
             y = total_uniformed_personnel)) + 
  geom_bar(stat = "identity", width = 0.7, aes(fill = un_region_name), color = "white") +
  coord_flip() +
  ggflags::geom_flag(aes(x = contributing_country, y = -1, country = iso2), size = 8) +
  # geom_text(aes(label= values), position = position_dodge(width = 0.9), hjust = -0.5, size = 5, color = "#000500") + 
  scale_fill_manual(values = le_palette) +
  labs(title = "Total troops serving as peacekeepers",
       subtitle = ("Across countries"),
       caption = "         Source: UN ") +
  xlab("") + 
  ylab("") + bbplot::bbc_style()

We can see that Bangladesh, Nepal and India have the most peacekeeper troops!

Convert event-level data to panel-level data with tidyr in R

Packages we will need:

library(tidyverse)
library(magrittr)
library(lubridate)
library(tidyr)
library(rvest)
library(janitor)

In this post, we are going to scrape NATO accession data from Wikipedia and turn it into panel data. This means turning a list of every NATO country and their accession date into a time-series, cross-sectional dataset with information about whether or not a country is a member of NATO in any given year.

This is helpful for political science analysis because simply a dummy variable indicating whether or not a country is in NATO would lose information about the date they joined. The UK joined NATO in 1948 but North Macedonia only joined in 2020. A simple binary variable would not tell us this if we added it to our panel data.

Consoling 30 Rock GIF - Find & Share on GIPHY

We will first scrape a table from the Wikipedia page on NATO member states with a few functions form the rvest pacakage.

Click here to read more about the rvest package:

Scrape NATO defense expenditure data from Wikipedia with the rvest package in R

nato_members <- read_html("https://en.wikipedia.org/wiki/Member_states_of_NATO")

nato_tables <- nato_members %>% html_table(header = TRUE, fill = TRUE)

nato_member_joined <- nato_tables[[1]]

We have information about each country and the date they joined. In total there are 30 rows, one for each member of NATO.

Next we are going to clean up the data, remove the numbers in the [square brackets], and select the columns that we want.

A very handy function from the janitor package cleans the variable names. They are lower_case_with_underscores rather than how they are on Wikipedia.

Next we remove the square brackets and their contents with sub("\\[.*", "", insert_variable_name)

And the accession date variable is a bit tricky because we want to convert it to date format, extract the year and convert back to an integer.

nato_member_joined %<>% 
  clean_names() %>% 
  select(country = member_state, 
         accession = accession_3) %>% 
  mutate(member_2020 = 2020,
         country = sub("\\[.*", "", country),
         accession = sub("\\[.*", "", accession),
         accession = parse_date_time(accession, "dmy"),
         accession = format(as.Date(accession, format = "%d/%m/%Y"),"%Y"),
         accession = as.numeric(as.character(accession)))

When we have our clean data, we will pivot the data to longer form. This will create one event column that has a value of accession or member in 2020.

This gives us the start and end year of our time variable for each country.

nato_member_joined %<>% 
  pivot_longer(!country, names_to = "event", values_to = "year")

Our dataset now has 60 observations. We see Albania joined in 2009 and is still a member in 2020, for example.

Next we will use the complete() function from the tidyr package to fill all the dates in between 1948 until 2020 in the dataset. This will increase our dataset to 2,160 observations and a row for each country each year.

Nect we will group the dataset by country and fill the nato_member status variable down until the most recent year.

nato_member_joined %<>% 
  mutate(year = as.Date(as.character(year), format = "%Y")) %>% 
  mutate(year = ymd(year)) %>% 
  complete(country, year = seq.Date(min(year), max(year), by = "year")) %>% 
  mutate(nato_member = ifelse(event == "accession", 1, 
                              ifelse(event == "member_2020", 1, 0))) %>% 
  group_by(country) %>% 
  fill(nato_member, .direction = "down") %>%
  ungroup()

Last, we will use the ifelse() function to mutate the event variable into one of three categories: 'accession‘, 'member‘ or ‘not member’.

nato_member_joined %>%
  mutate(nato_member = replace_na(nato_member, 0),
         year = parse_number(as.character(year)),
         event = ifelse(nato_member == 0, "not member", event),
         event = ifelse(nato_member == 1 & is.na(event), "member", event),
         event = ifelse(event == "member_2020", "member", event))  %>% 
  distinct(country, year, .keep_all = TRUE) -> nato_panel

High Five 30 Rock GIF - Find & Share on GIPHY

Ethnicity Dataset

epr_indo <- read_csv('/mnt/data/epr_indo.csv')

# Expand the data to have a row for each year for each group
epr_indo_expanded <- epr_indo %>%
  rowwise() %>%
  mutate(year = list(seq(from, to))) %>% 
  unnest(year) %>%
  select(-from, -to)

# Pivot the data to wide format with separate ethnicity, share, and broad_cat columns
epr_indo_wide <- epr_indo_expanded %>%
  group_by(statename, year) %>%
  mutate(index = row_number()) %>%
  ungroup() %>%
  pivot_wider(
    id_cols = c(statename, year),
    names_from = index,
    values_from = c(group, size, broad_cat),
    names_sep = "_"
  ) %>%
  # Renaming the columns for ethnicity, share, and broad category
  rename_with(~ str_replace(., "group_", "ethnicity_"), starts_with("group_")) %>%
  rename_with(~ str_replace(., "size_", "share_"), starts_with("size_")) %>%
  rename_with(~ str_replace(., "broad_cat_", "broad_cat_"), starts_with("broad_cat_"))

Lump groups together and create “other” category with forcats package

Packages we will need:

library(tidyverse)
library(forcats)
library(tidytext)
library(ggthemes)
library(democracyData)
library(magrittr)

For this blog, we are going to look at the titles of all countries’ heads of state, such as Kings, Presidents, Emirs, Chairman … understandably, there are many many many ways to title the leader of a country.

First, we will download the PACL dataset from the democracyData package.

Click here to read more about this super handy package:

Download democracy data with democracyData package in R

If you want to read more about the variables in this dataset, click the link below to download the codebook by Cheibub et al.

Democracy and Dictatorship codebook (Cheibub et al., 2010)Download

pacl <- redownload_pacl()

We are going to look at the npost variable; this captures the political title of the nominal head of stage. This can be King, President, Sultan et cetera!

pacl %>% 
  count(npost) %>% 
  arrange(desc(n))

If we count the occurence of each title, we can see there are many ways to be called the head of a country!

"president"                         3693
"prime minister"                    2914
"king"                               470
"Chairman of Council of Ministers"   229
"premier"                            169
"chancellor"                         123
"emir"                               117
"chair of Council of Ministers"      111
"head of state"                       90
"sultan"                              67
"chief of government"                 63
"president of the confederation"      63
""                                    44
"chairman of Council of Ministers"    44
"shah"                                33

# ... with 145 more rows

155 groups is a bit difficult to meaningfully compare.

So we can collapse some of the groups together and lump all the titles that occur relatively seldomly – sometimes only once or twice – into an “other” category.

Clueless Movie Tai GIF - Find & Share on GIPHY

First, we use grepl() function to take the word president and chair (chairman, chairwoman, chairperson et cetera) and add them into broader categories.

Also, we use the tolower() function to make all lower case words and there is no confusion over the random capitalisation.

 pacl %<>% 
  mutate(npost = tolower(npost)) %>% 
  mutate(npost = ifelse(grepl("president", npost), "president", npost)) %>% 
  mutate(npost = ifelse(grepl("chair", npost), "chairperson", npost))

Next, we create an "other leader type" with the fct_lump_prop() function.

We specifiy a threshold and if the group appears fewer times in the dataset than this level we set, it is added into the “other” group.

pacl %<>% 
  mutate(regime_prop = fct_lump_prop(npost,
                                   prop = 0.005,
                                   other_level = "Other leader type")) %>% 
  mutate(regime_prop = str_to_title(regime_prop))

Now, instead of 155 types of leader titles, we have 10 types and the rest are all bundled into the Other Leader Type category

President            4370
Prime Minister       2945
Chairperson           520
King                  470
Other Leader Type     225
Premier               169
Chancellor            123
Emir                  117
Head Of State          90
Sultan                 67
Chief Of Government    63

The Office Smile GIF - Find & Share on GIPHY

The forcast package has three other ways to lump the variables together.

First, we can quickly look at fct_lump_min().

We can set the min argument to 100 and look at how it condenses the groups together:

pacl %>% 
  mutate(npost = tolower(npost)) %>% 
 
  mutate(post_min = fct_lump_min(npost,
                                   min = 100,
                                   other_level = "Other type")) %>% 
  mutate(post_min = str_to_title(post_min)) %>% 
  count(post_min) %>% 
  arrange(desc(n))

President       4370
Prime Minister  2945
Chairperson      520
King             470
Other Type       445
Premier          169
Chancellor       123
Emir             117

We can see that if the post appears fewer than 100 times, it is now in the Other Type category. In the previous example, Head Of State only appeared 90 times so it didn’t make it.

Next we look at fct_lump_lowfreq().

This function lumps together the least frequent levels. This one makes sure that “other” category remains as the smallest group. We don’t add another numeric argument.

pacl %>% 
  mutate(npost = tolower(npost)) %>% 
  mutate(post_lowfreq  = fct_lump_lowfreq(npost,
                                   other_level = "Other type")) %>% 
  mutate(post_lowfreq = str_to_title(post_lowfreq)) %>% 
  count(post_lowfreq) %>% 
  arrange(desc(n))

President       4370
Prime Minister  2945
Other Type      1844

This one only has three categories and all but president and prime minister are chucked into the Other type category.

Last, we can look at the fct_lump_n() to make sure we have a certain number of groups. We add n = 5 and we create five groups and the rest go to the Other type category.

pacl %>% 
  mutate(npost = tolower(npost)) %>% 
  mutate(post_n  = fct_lump_n(npost,
                                n = 5,
                                other_level = "Other type")) %>% 
  mutate(post_n = str_to_title(post_n)) %>% 
  count(post_n) %>% 
  arrange(desc(n))

President       4370
Prime Minister  2945
Other Type       685
Chairperson      520
King             470
Premier          169

Sums It Up The Office GIF - Find & Share on GIPHY

Next we can make a simple graph counting the different leader titles in free, partly free and not free Freedom House countries. We will use the download_fh() from DemocracyData package again

fh <- download_fh()

We will use the reorder_within() function from tidytext package.

Click here to read the full blog post explaining the function from Julia Silge’s blog.

First we add Freedom House data with the inner_join() function

Then we use the fct_lump_n() and choose the top five categories (plus the Other Type category we make)

pacl %<>% 
  inner_join(fh, by = c("cown", "year")) %>% 
  mutate(npost  = fct_lump_n(npost,
                  n = 5,
                  other_level = "Other type")) %>%
  mutate(npost = str_to_title(npost))

Then we group_by the three Freedom House status levels and count the number of each title:

pacl %<>% 
  group_by(status) %>% 
  count(npost) %>% 
  ungroup() %>%

Using reorder_within(), we order the titles from most to fewest occurences WITHIN each status group:

pacl %<>%
  mutate(npost = reorder_within(npost, n, status))

To plot the columns, we use geom_col() and separate them into each Freedom House group, using facet_wrap(). We add scales = "free y" so that we don’t add every title to each group. Without this we would have empty spaces in the Free group for Emir and King. So this step removes a lot of clutter.

pacl_colplot <- pacl %>%
  ggplot(aes(fct_reorder(npost, n), n)) +
  geom_col(aes(fill = npost), show.legend = FALSE) +
  facet_wrap(~status, scales = "free_y")

Last, I manually added the colors to each group (which now have longer names to reorder them) so that they are consistent across each group. I am sure there is an easier and less messy way to do this but sometimes finding the easier way takes more effort!

We add the scale_x_reordered() function to clean up the names and remove everything from the underscore in the title label.

pacl_colplot + scale_fill_manual(values = c("Prime Minister___F" = "#005f73",
                                "Prime Minister___NF" = "#005f73",
                                "Prime Minister___PF" = "#005f73",
                                
                               "President___F" = "#94d2bd",
                               "President___NF" = "#94d2bd",
                               "President___PF" = "#94d2bd",
                               
                               "Other Type___F" = "#ee9b00",
                               "Other Type___NF" = "#ee9b00",
                               "Other Type___PF" = "#ee9b00",
                               
                               "Chairperson___F" = "#bb3e03",
                               "Chairperson___NF" = "#bb3e03",
                               "Chairperson___PF" = "#bb3e03",
                               
                               "King___F" = "#9b2226",
                               "King___NF" = "#9b2226",
                               "King___PF" = "#9b2226",
                               
                               "Emir___F" = "#001219", 
                               "Emir___NF" = "#001219",
                               "Emir___PF" = "#001219")) +
  scale_x_reordered() +
  coord_flip() + 
  ggthemes::theme_fivethirtyeight() + 
  themes(text = element_size(size = 30))

In case you were curious about the free country that had a chairperson, Nigeria had one for two years.

pacl %>%
  filter(status == "F") %>% 
  filter(npost == "Chairperson") %>% 
  select(Country = pacl_country) %>% 
  knitr::kable("latex") %>%
  kableExtra::kable_classic(font_size = 30)

References

Cheibub, J. A., Gandhi, J., & Vreeland, J. R. (2010). Democracy and dictatorship revisited. Public choice, 143(1), 67-101.

Visualise DemocracyData with graphs and maps

Packages we will need:

library(tidyverse)
library(democracyData)
library(magrittr)
library(ggrepel)
library(ggthemes)
library(countrycode)

In this post, we will look at easy ways to graph data from the democracyData package.

The two datasets we will look at are the Anckar-Fredriksson dataset of political regimes and Freedom House Scores.

Regarding democracies, Anckar and Fredriksson (2018) distinguish between republics and monarchies. Republics can be presidential, semi-presidential, or parliamentary systems.

Within the category of monarchies, almost all systems are parliamentary, but a few countries are conferred to the category semi-monarchies.

Bill Murray King GIF - Find & Share on GIPHY

Autocratic countries can be in the following main categories: absolute monarchy, military rule, party-based rule, personalist rule, and oligarchy.

anckar <- democracyData::redownload_anckar()
fh <- download_fh()

We will see which regime types have been free or not since 1970.

We join the fh dataset to the anckar dataset with inner_join(). Luckily, both the datasets have the cown and year variables with which we can merge.

Then we sumamrise the mean Freedom House level for each regime type.

anckar %>% 
  inner_join(fh, by = c("cown", "year")) %>% 
  filter(!is.na(regimebroadcat)) %>%
  group_by(regimebroadcat, year) %>% 
  summarise(mean_fh = mean(fh_total_reversed, na.rm = TRUE)) -> anckar_sum

We want to place a label for each regime line in the graph, so create a small dataframe with regime score information only about the first year.

anckar_start <- anckar_sum %>%
  group_by(regimebroadcat) %>% 
  filter(year == 1972) %>% 
  ungroup()

And we pick some more jewel toned colours for the graph and put them in a vector.

my_palette <- c("#ca6702", "#bb3e03", "#ae2012", "#9b2226", "#001219", "#005f73", "#0a9396", "#94d2bd", "#ee9b00")

And we graph it out

anckar_sum %>%
  ggplot(aes(x = year, y = mean_fh, groups = as.factor(regimebroadcat))) + 
  geom_point(aes(color = regimebroadcat), alpha = 0.7, size = 2) + 
  geom_line(aes(color = regimebroadcat), alpha = 0.7, size = 2) +
  ggrepel::geom_label_repel(data = anckar_start, hjust = 1.5,
            aes(x = year,
                y = mean_fh,
                color = regimebroadcat,
                label = regimebroadcat),
            alpha = 0.7,
            show.legend = FALSE, 
            size = 9) + 
  scale_color_manual(values = my_palette) +
  expand_limits(x = 1965) +  
  ggthemes::theme_pander() + 
  theme(legend.position = "none",
        axis.text = element_text(size = 30, colour ="grey40"))

Democracy_Regimes_1970 Download

We can also use map data that comes with the tidyverse() package.

To merge the countries easily, I add a cown variable to this data.frame

world_map <- map_data("world")

world_map %<>% 
  mutate(cown = countrycode::countrycode(region, "country.name", "cown"))

I want to only look at regimes types in the final year in the dataset – which is 2018 – so we filter only one year before we merge with the map data.frame.

The geom_polygon() part is where we indiciate the variable we want to plot. In our case it is the regime category

anckar %>% 
 filter(year == max(year)) %>%
  inner_join(world_map, by = c("cown")) %>%
  mutate(regimebroadcat = ifelse(region == "Libya", 'Military rule', regimebroadcat)) %>% 
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = regimebroadcat), color = "white", size = 1)

2018-Regimes Download

Bill Murray Laughing GIF - Find & Share on GIPHY

We can next look at the PIPE dataset and see which countries have been uninterrupted republics over time.

pipe <- democracyData::redownload_pipe()

We graph out the max_republic_age variable with geom_bar()


pipe %>% 
  mutate(iso_lower = tolower(countrycode::countrycode(PIPE_cowcodes, "cown", "iso2c"))) %>% 
  mutate(country_name = countrycode::countrycode(PIPE_cowcodes, "cown", "country.name")) %>% 
  filter(year == max(year)) %>% 
  filter(max_republic_age > 100) %>% 
  ggplot(aes(x = reorder(country_name, max_republic_age), y = max_republic_age)) + 
  geom_bar(stat = "identity", width = 0.7, aes(fill = as.factor(europe))) +
  ggflags::geom_flag(aes(y = max_republic_age, x = country_name, 
                         country = iso_lower), size = 15) + 
  coord_flip() +  ggthemes::theme_pander() -> pipe_plot

And fix up some aesthetics:

pipe_plot + 
  theme(axis.text = element_text(size = 30),
        legend.text = element_text(size = 30),
        legend.title = element_blank(),
        axis.title = element_blank(),
        legend.position = "bottom") + 
  labs(y= "", x = "") + 
scale_fill_manual(values =  c("#d62828", "#457b9d"),
 labels = c("Former British Settler Colony", "European Country"))

I added the header and footer in Canva.com

Bill Murray Ok GIF - Find & Share on GIPHY

Download democracy data with democracyData package in R

Packages we will need:

library(democracyData)
library(tidyverse)
library(magrittr)       # for pipes
library(ggstream)       # proportion plots
library(ggthemes)       # nice ggplot themes
library(forcats)        # reorder factor variables
library(ggflags)        # add flags
library(peacesciencer)  # more great polisci data
library(countrycode)    # add ISO codes to countries

This blog will highlight some quick datasets that we can download with this nifty package.

To install the democracyData package, it is best to do this via the github of Xavier Marquez:

remotes::install_github("xmarquez/democracyData", force = TRUE)
library(democracyData)

We can download the dataset from the Democracy and Dictatorship Revisited paper by Cheibub Gandhi and Vreeland (2010) with the redownload_pacl() function. It’s all very simple!

pacl <- redownload_pacl()

Happy Maya Rudolph GIF by PeacockTV - Find & Share on GIPHY

This gives us over 80 variables, with information on things such as regime type, geographical data, the name and age of the leaders, and various democracy variables.

We are going to focus on the different regimes across the years.

The six-fold regime classification Cheibub et al (2010) present is rooted in the dichotomous classification of regimes as democracy and dictatorship introduced in Przeworski et al. (2000). They classify according to various metrics, primarily by examining the way in which governments are removed from power and what constitutes the “inner sanctum” of power for a given regime. Dictatorships can be distinguished according to the characteristics of these inner sanctums. Monarchs rely on family and kin networks along with consultative councils; military rulers confine key potential rivals from the armed forces within juntas; and, civilian dictators usually create a smaller body within a regime party—a political bureau—to coopt potential rivals. Democracies highlight their category, depending on how the power of a given leadership ends

We can change the regime variable from numbers to a factor variables, describing the type of regime that the codebook indicates:

pacl %<>% 
  mutate(regime_name = ifelse(regime == 0, "Parliamentary democracies",
       ifelse(regime == 1, "Mixed democracies",
       ifelse(regime == 2, "Presidential democracies",
       ifelse(regime == 3, "Civilian autocracies",
       ifelse(regime == 4, "Military dictatorships",
       ifelse(regime ==  5,"Royal dictatorships", regime))))))) %>%
  mutate(regime = as.factor(regime))

Before we make the graph, we can give traffic light hex colours to the types of democracy. This goes from green (full democracy) to more oranges / reds (autocracies):

regime_palette <- c("Military dictatorships" = "#f94144", 
                    "Civilian autocracies" = "#f3722c", 
                    "Royal dictatorships" =  "#f8961e", 
                    "Mixed democracies" = "#f9c74f", 
                    "Presidential democracies" = "#90be6d", 
                    "Parliamentary democracies" = "#43aa8b")

We will use count() to count the number of countries in each regime type and create a variable n

pacl %>% 
  mutate(regime_name = as.factor(regime_name)) %>% 
  mutate(regime_name = fct_relevel(regime_name, 
 levels = c("Parliamentary democracies", 
           "Presidential democracies",
           "Mixed democracies",
           "Royal dictatorships",
           "Civilian autocracies",
           "Military dictatorships"))) %>% 
  group_by(year, un_continent_name) %>% 
  filter(!is.na(regime_name)) %>% 
  count(regime_name) %>% 
  ungroup() %>%  
  filter(un_continent_name != "") %>%
  filter(un_continent_name != "Oceania") -> pacl_count

Cant Handle It Kristen Wiig GIF by Saturday Night Live - Find & Share on GIPHY

We have all the variables we need.

We can now graph the count variables across different regions.

pacl_count %>% 
  ggplot(aes(x = year, y = n, 
             groups = regime_name, 
             fill = regime_name)) +
  ggstream::geom_stream(type = "proportion") + 
  facet_wrap(~un_continent_name) + 
  scale_fill_manual(values = regime_palette) + 
  ggthemes::theme_fivethirtyeight() + 
  theme(legend.title = element_blank(),
        text = element_text(size = 30))

I added the title and source header / footer section on canva.com to finish the graph.

Of course, the Cheibub et al (2010) dataset is not the only one that covers types of regimes.

Curtis Bell in 2016 developed the Rulers, Elections, and Irregular Governance Dataset (REIGN) dataset.

This describes political conditions in every country (including tenures and personal characteristics of world leaders, the types of political institutions and political regimes in effect, election outcomes and election announcements, and irregular events like coups)

Again, to download this dataset with the democracyData package, it is very simple:

reign <- download_reign()

Saturday Night Live Happy Dance GIF - Find & Share on GIPHY

I want to compare North and South Korea since their independence from Japan and see the changes in regimes and democracy scores over the years.

Next, we can easily download Freedom House or Polity 5 scores.

The Freedom House Scores default dataset ranges from 1972 to 2020, covering around 195 countries (depending on the year)

fh <- download_fh()

Alternatively, we can look at Polity Scores. This default dataset countains around 190 ish countries (again depending on the year and the number of countries in existance at that time) and covers a far longer range of years; from 1880 to 2018.

polityiv <- redownload_polityIV()

Alternatively, to download democracy scores, we can also use the peacesciencer dataset. Click here to read more about this package:

Building a dataset for political science analysis in R, PART 2

democracy_scores <- peacesciencer::create_stateyears() %>% 
  add_gwcode_to_cow() %>%
  add_democracy()

With inner_join() we can merge these two datasets together:

reign %>% 
  select(ccode = cown, everything()) %>% 
  inner_join(democracy_scores, by = c("year", "ccode")) -> reign_demo

We next choose the years and countries for our plot.

Also, for the geom_flag() we will need the country name to be lower case ISO code. Click here to read more about the ggflags package.

reign_demo %>% 
    filter(year > 1945) %>% 
    mutate(gwf_regimetype = str_to_title(gwf_regimetype)) %>% 
    mutate(iso2c_lower = tolower(countrycode::countrycode(reign_country, "country.name", "iso2c"))) %>% 
filter(reign_country == "Korea North" | reign_country == "Korea South") -> korea_reign

Add circular flags to maps and graphs with ggflags package in R

We may to use specific hex colours for our graphs. I always prefer these deeper colours, rather than the pastel defaults that ggplot uses. I take them from coolors.co website!

korea_palette <- c("Military" = "#5f0f40",
                   "Party-Personal" = "#9a031e",
                   "Personal" = "#fb8b24",
                   "Presidential" = "#2a9d8f",
                   "Parliamentary" = "#1e6091")

We will add a flag to the start of the graph, so we create a mini dataset that only has the democracy scores for the first year in the dataset.

  korea_start <- korea_reign %>%
    group_by(reign_country) %>% 
    slice(which.min(year)) %>% 
    ungroup()

Next we plot the graph

korea_reign %>% 
 ggplot(aes(x = year, y = v2x_polyarchy, groups = reign_country))  +
    geom_line(aes(color = gwf_regimetype), 
         size = 7, alpha = 0.7, show.legend = FALSE) +
    geom_point(aes(color = gwf_regimetype), size = 7, alpha = 0.7) +
    ggflags::geom_flag(data = korea_start, 
       aes(y = v2x_polyarchy, x = 1945, country = iso2c_lower), 
           size = 20) -> korea_plot

And then work on the aesthetics of the plot:

korea_plot + ggthemes::theme_fivethirtyeight() + 
    ggtitle("Electoral democracy on Korean Peninsula") +
    labs(subtitle = "Sources: Teorell et al. (2019) and Curtis (2016)") +
    xlab("Year") + 
    ylab("Democracy Scores") + 
    theme(plot.title = element_text(face = "bold"),
      axis.ticks = element_blank(),
      legend.box.background = element_blank(),
      legend.title = element_blank(),
      legend.text = element_text(size = 40),
      text = element_text(size = 30)) +
    scale_color_manual(values = korea_palette) + 
    scale_x_continuous(breaks = round(seq(min(korea_reign$year), max(korea_reign$year), by = 5),1))

While North Korea has been consistently ruled by the Kim dynasty, South Korea has gone through various types of government and varying levels of democracy!

References

Cheibub, J. A., Gandhi, J., & Vreeland, J. R. (2010). . Public choice, 143(1), 67-101.

Przeworski, A., Alvarez, R. M., Alvarez, M. E., Cheibub, J. A., Limongi, F., & Neto, F. P. L. (2000). Democracy and development: Political institutions and well-being in the world, 1950-1990 (No. 3). Cambridge University Press.

Scrape and graph election polling data from Wikipedia

Packages we will need:

library(rvest)
library(tidyverse)
library(magrittr)
library(forcats)
library(janitor)

With the Korean Presidential elections coming up, I wanted to graph the polling data since the beginning of this year.

Happy Paul Rudd GIF by Saturday Night Live - Find & Share on GIPHY

The data we can use is all collated together on Wikipedia.

Click here to read more about using the rvest package for scraping data from websites and click here to read the CRAN PDF for the package.

poll_html <- read_html("https://en.wikipedia.org/wiki/2022_South_Korean_presidential_election")

poll_tables <- poll_html %>% html_table(header = TRUE, fill = TRUE)

There are 22 tables on the page in total.

I count on the page that the polling data is the 16th table on the page, so extract index [[16]] from the list

feb_poll <- poll_tables[[16]]
View(feb_poll)

It is a bit messy, so we will need to do a bit of data cleaning before we can graph.

John Mulaney Snl GIF by Saturday Night Live - Find & Share on GIPHY

First the names of many variables are missing or on row 2 / 3 of the table, due to pictures and split cells in Wikipedia.

 [1] "Polling firm / Client" "Polling firm / Client" "Fieldwork  date"       "Sample  size" "Margin of  error"     
 [6] ""       ""      ""     ""      ""                     
[11] ""  "Others/Undecided"   "Lead"

The clean_names() function from the janitor package does a lot of the brute force variable name cleaning!

feb_poll %<>% clean_names()

We now have variable names rather than empty column names, at least.

 [1] "polling_firm_client" "polling_firm_client_2" "fieldwork_date"        "sample_size"  "margin_of_error"      
 [6] "x"  "x_2"  "x_3"  "x_4"  "x_5"                  
[11] "x_6"  "others_undecided"   "lead"

We can choose the variables we want and rename the x variables with the names of each candidate, according to Wikipedia.

feb_poll %<>% 
  select(fieldwork_date, 
         Lee = x, 
         Yoon = x_2,
         Shim = x_3,
         Ahn = x_4, 
         Kim = x_5, 
         Heo = x_6,
         others_undecided)

We then delete the rows that contain text not related to the poll number values.

feb_poll = feb_poll[-25,]
feb_poll = feb_poll[-81,]
feb_poll = feb_poll[-1,]

I want to clean up the fieldwork_date variable and convert it from character to Date class.

First I found that very handy function on Stack Overflow that extracts the last n characters from a string variable.

substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))
}

If we look at the table, some of the surveys started in Feb but ended in March. We want to extract the final section (i.e. the March section) and use that.

So we use grepl() to find rows that have both Feb AND March, and just extract the March section. If it only has one of those months, we leave it as it is.

feb_poll %<>% 
  mutate(clean_date = ifelse(grepl("Feb", fieldwork_date) & grepl("Mar", fieldwork_date), substrRight(fieldwork_date, 5), fieldwork_date))

Next want to extract the three letter date from this variables and create a new month variable

feb_poll %<>%
  mutate(month = substrRight(clean_date, 3))

Following that, we use the parse_number() function from tidyr package to extract the first number found in the string and create a day_number varible (with integer class now)

 feb_poll %<>%
   mutate(day_number = parse_number(clean_date))

We want to take these two variables we created and combine them together with the unite() function from tidyr again! We want to delete the variables after we unite them. But often I want to keep the original variables, so usually I change the argument remove to FALSE.

We indicate we want to have nothing separating the vales with the sep = "" argument

 feb_poll %<>%
     unite("date", day_number:month, sep = "", remove = TRUE)

And we convert this new date into Date class with as.Date() function.

Here is a handy cheat sheet to help choose the appropriate % key so the format recognises the dates. I will never memorise these values, so I always need to refer to this site.

We have days as numbers (1, 2, 3) and abbreviated 3 character month (Jan, Feb, Mar), so we choose %d and %b

feb_poll %<>%
  mutate(dates_format = as.Date(date, "%d%b")) %>% 
  select(dates_format, Lee:others_undecided)

Next, we will use the pivot_longer() function to combine all the poll number values into one column. This will make it far easier to plot later.

feb_poll %<>%
  pivot_longer(!dates_format, names_to = "candidate", values_to = "favour")

After than, we need to clean the actual numbers, remove the percentage signs and convert from character to number class. We use the str_extract() and the regex code to extract the number and not keep the percentage sign.

feb_poll %<>%
    mutate(candidate = as.factor(candidate),
 favour_percent = str_extract(favour, "\\d+\\.*\\d*")) %>% 
   mutate(favour_percent = as.integer(favour_percent))

Some of the different polls took place on the same day. So we will take the average poll favourability value for each candidate on each day with the group_by() function

feb_poll %<>%
  group_by(dates_format, candidate) %>% 
  mutate(favour_percent_mean = mean(favour_percent, na.rm = TRUE)) %>% 
  ungroup() %>% 
  select(candidate, dates_format, favour_percent_mean)

And this is how the cleaned up data should look!

We repeat for the 17th and 16th tables, which contain data going back to the beginning of January 2022

early_feb_poll <- poll_tables[[17]]
early_feb_poll = early_feb_poll[-37,]
early_feb_poll = early_feb_poll[-1,]

We repeat the steps from above with early Feb in one chunk

early_feb_poll %<>%
  clean_names() %>% 
  mutate(month = substrRight(fieldwork_date, 3))  %>% 
  mutate(day_number = parse_number(fieldwork_date)) %>%
  unite("date", day_number:month, sep = "", remove = FALSE) %>% 
  mutate(dates_format = as.Date(date, "%d%b")) %>% 
  select(dates_format, 
         Lee = lee_jae_myung, 
         Yoon = yoon_seok_youl,
         Shim = sim_sang_jung,
         Ahn = ahn_cheol_soo, 
         Kim = kim_dong_yeon, 
         Heo = huh_kyung_young,
         others_undecided) %>% 
  pivot_longer(!dates_format, names_to = "candidate", values_to = "favour") %>% 
  mutate(candidate = as.factor(candidate),
         favour_percent = str_extract(favour, "\\d+\\.*\\d*")) %>% 
  mutate(favour_percent = as.integer(favour_percent)) %>% 
  group_by(dates_format, candidate) %>% 
  mutate(favour_percent_mean = mean(favour_percent, na.rm = TRUE)) %>% 
  ungroup() %>% 
  select(candidate, dates_format, favour_percent_mean)

And we use rbind() to combine the two data.frames

polls <- rbind(feb_poll, early_feb_poll)

Next we repeat with January data:

jan_poll <- poll_tables[[18]]

jan_poll = jan_poll[-34,]
jan_poll = jan_poll[-1,]

jan_poll %<>% 
  clean_names() %>% 
  mutate(month = substrRight(fieldwork_date, 3))  %>% 
  mutate(day_number = parse_number(fieldwork_date)) %>%   # drops any non-numeric characters before or after the first number. 
  unite("date", day_number:month, sep = "", remove = FALSE) %>% 
  mutate(dates_format = as.Date(date, "%d%b")) %>% 
  select(dates_format, 
         Lee = lee_jae_myung, 
         Yoon = yoon_seok_youl,
         Shim = sim_sang_jung,
         Ahn = ahn_cheol_soo, 
         Kim = kim_dong_yeon, 
         Heo = huh_kyung_young,
         others_undecided) %>% 
  pivot_longer(!dates_format, names_to = "candidate", values_to = "favour") %>% 
  mutate(candidate = as.factor(candidate),
         favour_percent = str_extract(favour, "\\d+\\.*\\d*")) %>% 
  mutate(favour_percent = as.integer(favour_percent)) %>% 
  group_by(dates_format, candidate) %>% 
  mutate(favour_percent_mean = mean(favour_percent, na.rm = TRUE)) %>% 
  ungroup() %>% 
  select(candidate, dates_format, favour_percent_mean)

And bind to our combined data.frame:

polls <- rbind(polls, jan_poll)

We can create variables to help us filter different groups of candidates. If we want to only look at the largest candidates, we can makes an important variable and then filter

We can lump the candidates that do not have data from every poll (i.e. the smaller candidate) and add them into the “other_undecided” category with the fct_lump_min() function from the forcats package

polls %>% 
  mutate(important = ifelse(candidate %in% c("Ahn", "Yoon", "Lee", "Shim"), 1, 0)) %>% 
  mutate(few_candidate = fct_lump_min(candidate, min = 110, other_level = "others_undecided")) %>% 
  group_by(few_candidate, dates_format) %>% 
  filter(important == 1) -> poll_data

I want to only look at the main two candidates from the main parties that have been polling in the 40% range – Lee and Yoon – as well as the data for Ahn (who recently dropped out and endorsed Yoon).

poll_data %>% 
  filter(candidate %in% c("Lee", "Yoon", "Ahn")) -> lee_yoon_data

We take the official party hex colors for the graph and create a vector to use later with the scale_color_manual() function below:

party_palette <- c(
  "Ahn" = "#df550a",
  "Lee" = "#00a0e2",
  "Yoon" = "#e7001f")

And we plot the variables.

lee_yoon_data %>% 
  ggplot(aes(x = dates_format, y = favour_percent_mean,
             groups = candidate, color = candidate)) + 
  geom_line( size = 2, alpha = 0.8) +
  geom_point(fill = "#5e6472", shape = 21, size = 4, stroke = 3) + 
  labs(title = "Polling data for Korean Presidential Election", subtitle = "Source: various polling companies, via Wikipedia") -> poll_graph

The bulk of aesthetics for changing the graph appearance in the theme()

poll_graph + theme(panel.border = element_blank(),
        legend.position = "bottom",        
        text = element_text(size = 15, color = "white"),
        plot.title = element_text(size = 40),
        legend.title = element_blank(),
        legend.text = element_text(size = 50, color = "white"),
        axis.text.y = element_text(size = 20), 
        axis.text.x = element_text(size = 20),
        legend.background = element_rect(fill = "#5e6472"),
        axis.title = element_blank(),
        axis.text = element_text(color = "white", size = 20),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        legend.key = element_rect(fill = "#5e6472"),
        plot.background = element_rect(fill = "#5e6472"),
        panel.background = element_rect(fill = "#5e6472")) +
  scale_color_manual(values = party_palette)

Last, with the annotate() functions, we can also add an annotation arrow and text to add some more information about Ahn Cheol-su the candidate dropping out.

  annotate("text", x = as.Date("2022-02-11"), y = 13, label = "Ahn dropped out just as the polling blackout began", size = 10, color = "white") +
  annotate(geom = "curve", x = as.Date("2022-02-25"), y = 13, xend = as.Date("2022-03-01"), yend = 10, 
    curvature = -.3, arrow = arrow(length = unit(2, "mm")), size = 1, color = "white")

We will just have to wait until next Wednesday / Thursday to see who is the winner ~

Over It Reaction GIF by Saturday Night Live - Find & Share on GIPHY

Exploratory Data Analysis and Descriptive Statistics for Political Science Research in R

Packages we will use:

library(tidyverse)      # of course
library(ggridges)       # density plots
library(GGally)         # correlation matrics
library(stargazer)      # tables
library(knitr)          # more tables stuff
library(kableExtra)     # more and more tables
library(ggrepel)        # spread out labels
library(ggstream)       # streamplots
library(bbplot)         # pretty themes
library(ggthemes)       # more pretty themes
library(ggside)         # stack plots side by side
library(forcats)        # reorder factor levels

Before jumping into any inferentional statistical analysis, it is helpful for us to get to know our data.

That always means plotting and visualising the data and looking at the spread, the mean, distribution and outliers in the dataset.

Before we plot anything, a simple package that creates tables in the stargazer package. We can examine descriptive statistics of the variables in one table.

Click here to read this practically exhaustive cheat sheet for the stargazer package by Jake Russ. I refer to it at least once a week. Thank you, Jack.

https://www.jakeruss.com/cheatsheets/stargazer/

I want to summarise a few of the stats, so I write into the summary.stat() argument the number of observations, the mean, median and standard deviation.

The kbl() and kable_classic() will change the look of the table in R (or if you want to copy and paste the code into latex with the type = "latex" argument).

In HTML, they do not appear.

Seth Meyers Ok GIF by Late Night with Seth Meyers - Find & Share on GIPHY

To find out more about the knitr kable tables, click here to read the cheatsheet by Hao Zhu.

Choose the variables you want, put them into a data.frame and feed them into the stargazer() function

stargazer(my_df_summary, 
          covariate.labels = c("Corruption index",
                               "Civil society strength", 
                               'Rule of Law score',
                               "Physical Integerity Score",
                               "GDP growth"),
          summary.stat = c("n", "mean", "median", "sd"), 
          type = "html") %>% 
  kbl() %>% 
  kable_classic(full_width = F, html_font = "Times", font_size = 25)


Statistic	N	Mean	Median	St. Dev.

Corruption index	179	0.477	0.519	0.304
Civil society strength	179	0.670	0.805	0.287
Rule of Law score	173	7.451	7.000	4.745
Physical Integerity Score	179	0.696	0.807	0.284
GDP growth	163	0.019	0.020	0.032

Next, we can create a barchart to look at the different levels of variables across categories. We can look at the different regime types (from complete autocracy to liberal democracy) across the six geographical regions in 2018 with the geom_bar().

my_df %>% 
  filter(year == 2018) %>%
  ggplot() +
  geom_bar(aes(as.factor(region),
               fill = as.factor(regime)),
           color = "white", size = 2.5) -> my_barplot

And we can add more theme changes

my_barplot + bbplot::bbc_style() + 
  theme(legend.key.size = unit(2.5, 'cm'),
        legend.text = element_text(size = 15),
        text = element_text(size = 15)) +
  scale_fill_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c")) + 
  scale_color_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c"))

This type of graph also tells us that Sub-Saharan Africa has the highest number of countries and the Middle East and North African (MENA) has the fewest countries.

However, if we want to look at each group and their absolute percentages, we change one line: we add geom_bar(position = "fill"). For example we can see more clearly that over 50% of Post-Soviet countries are democracies ( orange = electoral and blue = liberal democracy) as of 2018.

We can also check out the density plot of democracy levels (as a numeric level) across the six regions in 2018.

With these types of graphs, we can examine characteristics of the variables, such as whether there is a large spread or normal distribution of democracy across each region.

my_df %>% 
  filter(year == 2018) %>%
  ggplot(aes(x = democracy_score, y = region, fill = regime)) +
  geom_density_ridges(color = "white", size = 2, alpha = 0.9, scale = 2) -> my_density_plot

And change the graph theme:

my_density_plot + bbplot::bbc_style() + 
  theme(legend.key.size = unit(2.5, 'cm')) +
  scale_fill_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c")) + 
  scale_color_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c"))

Click here to read more about the ggridges package and click here to read their CRAN PDF.

Next, we can also check out Pearson’s correlations of some of the variables in our dataset. We can make these plots with the GGally package.

The ggpairs() argument shows a scatterplot, a density plot and correlation matrix.

my_df %>%
  filter(year == 2018) %>%
  select(regime, 
         corruption, 
         civ_soc, 
         rule_law, 
         physical, 
         gdp_growth) %>% 
  ggpairs(columns = 2:5, 
          ggplot2::aes(colour = as.factor(regime), 
          alpha = 0.9)) + 
  bbplot::bbc_style() +
  scale_fill_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c")) + 
  scale_color_manual(values = c("#9a031e","#00a896","#e36414","#0f4c5c"))

Click here to read more about the GGally package and click here to read their CRAN PDF.

We can use the ggside package to stack graphs together into one plot.

There are a few arguments to add when we choose where we want to place each graph.

For example, geom_xsideboxplot(aes(y = freedom_house), orientation = "y") places a boxplot for the three Freedom House democracy levels on the top of the graph, running across the x axis. If we wanted the boxplot along the y axis we would write geom_ysideboxplot(). We add orientation = "y" to indicate the direction of the boxplots.

Next we indiciate how big we want each graph to be in the panel with theme(ggside.panel.scale = .5) argument. This makes the scatterplot take up half and the boxplot the other half. If we write .3, the scatterplot takes up 70% and the boxplot takes up the remainning 30%. Last we indicade scale_xsidey_discrete() so the graph doesn’t think it is a continuous variable.

We add Darjeeling Limited color palette from the Wes Anderson movie.

Click here to learn about adding Wes Anderson theme colour palettes to graphs and plots.

my_df %>%
 filter(year == 2018) %>% 
 filter(!is.na(fh_number)) %>% 
  mutate(freedom_house = ifelse(fh_number == 1, "Free", 
         ifelse(fh_number == 2, "Partly Free", "Not Free"))) %>%
  mutate(freedom_house = forcats::fct_relevel(freedom_house, "Not Free", "Partly Free", "Free")) %>% 
ggplot(aes(x = freedom_from_torture, y = corruption_level, colour = as.factor(freedom_house))) + 
  geom_point(size = 4.5, alpha = 0.9) +
  geom_smooth(method = "lm", color ="#1d3557", alpha = 0.4) +  
  geom_xsideboxplot(aes(y = freedom_house), orientation = "y", size = 2) +
  theme(ggside.panel.scale = .3) +
  scale_xsidey_discrete() +
  bbplot::bbc_style() + 
  facet_wrap(~region) + 
  scale_color_manual(values= wes_palette("Darjeeling1", n = 3))

The next plot will look how variables change over time.

We can check out if there are changes in the volume and proportion of a variable across time with the geom_stream(type = "ridge") from the ggstream package.

In this instance, we will compare urban populations across regions from 1800s to today.

my_df %>% 
  group_by(region, year) %>% 
  summarise(mean_urbanization = mean(urban_population_percentage, na.rm = TRUE)) %>% 
  ggplot(aes(x = year, y = mean_urbanization, fill = region)) +
  geom_stream(type = "ridge") -> my_streamplot

And add the theme changes

  my_streamplot + ggthemes::theme_pander() + 
  theme(
legend.title = element_blank(),
        legend.position = "bottom",
        legend.text = element_text(size = 25),
        axis.text.x = element_text(size = 25),
        axis.title.y = element_blank(),
        axis.title.x = element_blank()) +
  scale_fill_manual(values = c("#001219",
                               "#0a9396",
                               "#e9d8a6",
                               "#ee9b00", 
                               "#ca6702",
                               "#ae2012"))

Click here to read more about the ggstream package and click here to read their CRAN PDF.

We can also look at interquartile ranges and spread across variables.

We will look at the urbanization rate across the different regions. The variable is calculated as the ratio of urban population to total country population.

Before, we will create a hex color vector so we are not copying and pasting the colours too many times.

my_palette <- c("#1d3557",
                "#0a9396",
                "#e9d8a6",
                "#ee9b00", 
                "#ca6702",
                "#ae2012")

We use the facet_wrap(~year) so we can separate the three years and compare them.

my_df %>% 
  filter(year == 1980 | year == 1990 | year == 2000)  %>% 
  ggplot(mapping = aes(x = region, 
                       y = urban_population_percentage, 
                       fill = region)) +
  geom_jitter(aes(color = region),
              size = 3, alpha = 0.5, width = 0.15) +
  geom_boxplot(alpha = 0.5) + facet_wrap(~year) + 
  scale_fill_manual(values = my_palette) +
  scale_color_manual(values = my_palette) + 
  coord_flip() + 
  bbplot::bbc_style()

If we want to look more closely at one year and print out the country names for the countries that are outliers in the graph, we can run the following function and find the outliers int he dataset for the year 1990:

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

We can then choose one year and create a binary variable with the function

my_df_90 <- my_df %>% 
  filter(year == 1990) %>% 
  filter(!is.na(urban_population_percentage))

my_df_90$my_outliers <- is_outlier(my_df_90$urban_population_percentage)

And we plot the graph:

my_df_90 %>% 
  ggplot(mapping = aes(x = region, y = urban_population_percentage, fill = region)) +
  geom_jitter(aes(color = region), size = 3, alpha = 0.5, width = 0.15) +
  geom_boxplot(alpha = 0.5) +
  geom_text_repel(data = my_df_90[which(my_df_90$my_outliers == TRUE),],
            aes(label = country_name), size = 5) + 
  scale_fill_manual(values = my_palette) +
  scale_color_manual(values = my_palette) + 
  coord_flip() + 
  bbplot::bbc_style()

In the next blog post, we will look at t-tests, ANOVAs (and their non-parametric alternatives) to see if the difference in means / medians is statistically significant and meaningful for the underlying population.

Bo Burnham What GIF - Find & Share on GIPHY

Comparing proportions across time with ggstream in R

Packages we need:

library(tidyverse)
library(ggstream)
library(magrittr)
library(bbplot)
library(janitor)

We can look at proportions of energy sources across 10 years in Ireland. Data source comes from: https://www.seai.ie/data-and-insights/seai-statistics/monthly-energy-data/electricity/

Before we graph the energy sources, we can tidy up the variable names with the janitor package. We next select column 2 to 12 which looks at the sources for electricity generation. Other rows are aggregates and not the energy-related categories we want to look at.

Next we pivot the dataset longer to make it more suitable for graphing.

We can extract the last two digits from the month dataset to add the year variable.

elec %<>% 
  janitor::clean_names()

elec[2:12,] -> elec

el <- elec %>% 
  pivot_longer(!electricity_generation_g_wh, 
               names_to = "month", values_to = "value") %>% 

substrRight <- function(x, n){
  substr(x, nchar(x) - n + 1, nchar(x))}

el$year <- substrRight(el$month, 2)

el %<>% select(year, month, elec_type = electricity_generation_g_wh, elec_value = value)

First we can use the geom_stream from the ggstream package. There are three types of plots: mirror, ridge and proportion.

First we will plot the proportion graph.

Select the different types of energy we want to compare, we can take the annual values, rather than monthly with the tried and trusted group_by() and summarise().

Optionally, we can add the bbc_style() theme for the plot and different hex colors with scale_fill_manual() and feed a vector of hex values into the values argument.

el %>% 
  filter(elec_type == "Oil" | 
           elec_type == "Coal" |
           elec_type == "Peat" | 
           elec_type == "Hydro" |
           elec_type == "Wind" |
           elec_type == "Natural Gas") %>% 
  group_by(year, elec_type) %>%
  summarise(annual_value = sum(elec_value, na.rm = TRUE)) %>% 
  ggplot(aes(x = year, 
             y = annual_value,
             group = elec_type,
             fill = elec_type)) +
  ggstream::geom_stream(type = "proportion") + 
  bbplot::bbc_style() +
  labs(title = "Comparing energy proportions in Ireland") +
  scale_fill_manual(values = c("#f94144",
                               "#277da1",
                               "#f9c74f",
                               "#f9844a",
                               "#90be6d",
                               "#577590"))

With ggstream::geom_stream(type = "mirror")

With ggstream::geom_stream(type = "ridge")

Without the ggstream package, we can recreate the proportion graph with slightly different looks

https://giphy.com/gifs/filmeditor-clueless-movie-l0ErMA0xAS1Urd4e4

el %>% 
  filter(elec_type == "Oil" | 
           elec_type == "Coal" |
           elec_type == "Peat" | 
           elec_type == "Hydro" |
           elec_type == "Wind" |
           elec_type == "Natural Gas") %>% 
  group_by(year, elec_type) %>%
  summarise(annual_value = sum(elec_value, na.rm = TRUE)) %>% 
  ggplot(aes(x = year, 
             y = annual_value,
             group = elec_type,
             fill = elec_type)) +
  geom_area(alpha=0.8 , size=1.5, colour="white") +
  bbplot::bbc_style() +
  labs(title = "Comparing energy proportions in Ireland") +
  theme(legend.key.size = unit(2, "cm")) + 
  scale_fill_manual(values = c("#f94144",
                               "#277da1",
                               "#f9c74f",
                               "#f9844a",
                               "#90be6d",
                               "#577590"))

Love You Hug GIF by Filmin - Find & Share on GIPHY

Create density plots with ggridges package in R

Packages we will need:

library(tidyverse)
library(ggridges)
library(ggimage)  # to add png images
library(bbplot)   # for pretty graph themes

We will plot out the favourability opinion polls for the three main political parties in Ireland from 2016 to 2020. Data comes from Louwerse and Müller (2020)

Happy Danny Devito GIF by It's Always Sunny in Philadelphia - Find & Share on GIPHY

Before we dive into the ggridges plotting, we have a little data cleaning to do. First, we extract the last four “characters” from the date string to create a year variable.

I took this quick function from a StackOverflow response:

substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))}

polls_csv$year <- substrRight(polls_csv$Date, 4)

Next, pivot the data from wide to long format.

More information of pivoting data with dplyr can be found here. I tend to check it at least once a month as the arguments refuse to stay in my head.

I only want to take the main parties in Ireland to compare in the plot.

polls <- polls_csv %>%
  select(year, FG:SF) %>% 
  pivot_longer(!year, names_to = "party", values_to = "opinion_poll")

I went online and found the logos for the three main parties (sorry, Labour) and saved them in the working directory I have for my RStudio. That way I can call the file with the prefix “~/**.png” rather than find the exact location they are saved on the computer.

polls %>% 
  filter(party == "FF" | party == "FG" | party == "SF" ) %>% 
  mutate(image = ifelse(party=="FF","~/ff.png",
 ifelse(party=="FG","~/fg.png", "~/sf.png"))) -> polls_three

Now we are ready to plot out the density plots for each party with the geom_density_ridges() function from the ggridges package.

We will add a few arguments into this function.

We add an alpha = 0.8 to make each density plot a little transparent and we can see the plots behind.

The scale = 2 argument pushes all three plots togheter so they are slightly overlapping. If scale =1, they would be totally separate and 3 would have them overlapping far more.

The rel_min_height = 0.01 argument removes the trailing tails from the plots that are under 0.01 density. This is again for aesthetics and just makes the plot look slightly less busy for relatively normally distributed densities

The geom_image takes the images and we place them at the beginning of the x axis beside the labels for each party.

Last, we use the bbplot package BBC style ggplot theme, which I really like as it makes the overall graph look streamlined with large font defaults.

polls_three %>% 
  ggplot(aes(x = opinion_poll, y = as.factor(party))) +  
  geom_density_ridges(aes(fill = party), 
                      alpha = 0.8, 
                      scale = 2,
                      rel_min_height = 0.01) + 
  ggimage::geom_image(aes(y = party, x= 1, image = image), asp = 0.9, size = 0.12) + 
  facet_wrap(~year) + 
  bbplot::bbc_style() +
  scale_fill_manual(values = c("#f2542d", "#edf6f9", "#0e9594")) +
  theme(legend.position = "none") + 
  labs(title = "Favourability Polls for the Three Main Parties in Ireland", subtitle = "Data from Irish Polling Indicator (Louwerse & Müller, 2020)")