Graphing female politicians in Irish parliament with R PART 2: Trends and Maps

Packages we will need

library(tidyverse)
library(magrittr)
library(waffle)
library(geojsonio)
library(sf)

In PART 1, we looked at the gender package to help count the number of women in the 33rd Irish Parliament.

I repeated that for every session since 1921. The first and second Dail are special in Ireland as they are technically pre-partition.

Cleaned up the data aaaand now we have a full dataset with constituencies data.

If anyone wants a copy of the dataset, I can upload it here for those who are curious ~

So first… a simple pie chart!

First we calculate proportion of seats held by women

dail %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s")) %>%
  group_by(decade) %>% 
  ungroup() %>% 
  group_by(decade, gender) %>% 
  count() %>% 
  group_by(decade) %>% 
  mutate(proportion = n / sum(n)) -> dail_pie

# A tibble: 22 × 4
# Groups:   decade [11]
   decade gender     n proportion
   <chr>  <chr>  <int>      <dbl>
 1 1920s  female    20     0.0261
 2 1920s  male     747     0.974 
 3 1930s  female    10     0.0172
 4 1930s  male     572     0.983 
 5 1940s  female    12     0.0284
 6 1940s  male     411     0.972 
 7 1950s  female    16     0.0363
 8 1950s  male     425     0.964 
 9 1960s  female    11     0.0255
10 1960s  male     421     0.975 
# 12 more rows

We will be looking at how proportions changed over the decades.

When using facet_wrap() with coord_polar(), it’s a pain in the arse.

This is because coord_polar() does not automatically allow each facet to have a different scale. Instead, coord_polar() treats all facets as having the same axis limits.

This will mess everything up.

If we don’t change the coord_polar(), we will just distort pie charts when the facet groups have different total values. There will be weird gaps and make some phantom pacman non-charts.

function() TRUE is an anonymous function that always returns TRUE.

my_coord_polar$is_free <- function() TRUE forces coord_polar() to allow different scales for each facet.

In our case, we call my_coord_polar$is_free, which means that whenever ggplot2 checks whether the coordinate system allows free scales across facets, it will now always return TRUE!!!

Overriding is_free() to always return TRUE signals to ggplot2 that coord_polar() means that our pie charts NOOWW will respect the "free" scaling specified in facet_wrap(scales = "free").

my_coord_polar <- coord_polar(theta = "y")
my_coord_polar$is_free <- function() TRUE

If you want to look more at this, check out this blog:

How to graph proportions with the waffle and treemapify packages in R

And we can go and create the ggplot:

dail_pie %>%
  ggplot(aes(x = "", 
         y = proportion, 
         fill = as.factor(gender))) +

  geom_bar(stat="identity", width = 1) +
  
  geom_text(
    data = . %>% filter(gender == "female"), aes(label = scales::percent(proportion, 
    accuracy = 0.1)), 
    color = "white",
    size = 8) +
  
  my_coord_polar +

  facet_wrap(~decade, scales = "free") + 
  scale_fill_manual(values =c("#bc4749", "#003049")) +
  # my_style() +
  theme(axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid = element_blank(), 
        panel.background = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank())

And with Canva, I add the arrows and titles~

Sorry I couldn’t figure it out in R. I just hate all the times I need to re-run graphics to move a text or number by a nano-centimeter. Websites like Canva are just far better for my sanity and short attention span.

Next, we can make a facetted waffle plot!

dail %>% 
  group_by(decade) %>% 
  ungroup() %>% 
  group_by(decade, gender) %>% 
  count() %>% 
  ggplot(aes(fill = as.factor(gender), values = n)) +
  waffle::geom_waffle(color = "white", 
                      size = 0.5, 
                      n_rows = 10, 
                      flip = TRUE) +
  facet_wrap(~decade, nrow = 1, strip.position = "bottom") +
# my_style  +
  scale_fill_manual(values =c("#003049", "#bc4749")) +
  theme(axis.text.x.bottom = element_blank(),
        text = element_text(size = 40))

And mea culpa, I finished the annotation and titles are with Canva.

Once again, life is too short to be messing with annotation in ggplot.

Next, we can make a simple trend line of the top Irish parties and see how they have fared with women TDs.

Let’s get a dataframe with average number of TDs elected to each party over the decades

dail %>% 
  filter(constituency != "National University") %>% 
  filter(party %in% c("Fianna Fáil", "Fine Gael", "Labour", "Sinn Féin")) %>% 
  group_by(party, decade) %>% 
  summarise(avg_female = mean(gender == "female")) -> dail_avg

# A tibble: 39 × 3
# Groups:   party [4]
   party       decade avg_female
   <chr>       <chr>       <dbl>
 1 Fianna Fáil 1920s     0.0198 
 2 Fianna Fáil 1930s     0.00685
 3 Fianna Fáil 1940s     0.0284 
 4 Fianna Fáil 1950s     0.0425 
 5 Fianna Fáil 1960s     0.0230 
 6 Fianna Fáil 1970s     0.0327 
 7 Fianna Fáil 1980s     0.0510 
 8 Fianna Fáil 1990s     0.0897 
 9 Fianna Fáil 2000s     0.0943 
10 Fianna Fáil 2010s     0.0938 
# 29 more rows

We create a new mini data.frame of four values so that we can have the geom_text() only at the end of the year (so similar to the final position of the graph).

final_positions <- dail_avg %>%
  group_by(party) %>%
  filter(decade == "2020s")  %>% 
  mutate(color = ifelse(party == "Sinn Féin", "#2fb66a",
         ifelse(party == "Fine Gael","#6699ff",
         ifelse(party == "Fianna Fáil","#ee9f27", 
         ifelse(party == "Labour", "#780000", "#495051")))))

# A tibble: 4 × 4
# Groups:   party [4]
  party       decade avg_female color  
  <chr>       <chr>       <dbl> <chr>  
1 Fianna Fáil 2020s       0.140 #ee9f27
2 Fine Gael   2020s       0.219 #6699ff
3 Labour      2020s       0.118 #780000
4 Sinn Féin   2020s       0.368 #2fb66a

A hex colour for each major party

party_pal <- c("Sinn Féin" = "#2fb66a",
                "Fine Gael" = "#6699ff",
                "Fianna Fáil" = "#ee9f27", 
                "Labour" = "#780000")

And a geom_bump() layer in the plot using the ggbump() package for more wavy lines.

dail_avg %>% 
  ggplot(aes(x = decade,
             y = avg_female, 
             group = party, 
             color = party)) + 

  ggbump::geom_bump(aes(color = party),
            smooth = 5,
            alpha = 0.5,
            size = 4)  +

  geom_point(color = "white", 
             size = 7, 
             stroke = 4) + 

  geom_point(size = 6) +

  ggrepel::geom_text_repel(data = final_positions,
            aes(color = party,
                y = avg_female,
                x = decade,
            label = party),
            family = "Arial Rounded MT Bold",
            vjust = -2,
            hjust = -1,
            size = 15) +
# my_style() 
  scale_color_manual(values = party_pal) +
  scale_y_continuous(labels = scales::label_percent()) +
  scale_x_discrete(expand = expansion(add = c(0.2, 2))) +
  theme(legend.position = "none")

This graph looks at major Irish political parties from the 1920s to the 2020s.

For most of Irish history, female representation remained under 10%.

The Labour Party surged ahead like crazy in the 1990s; it got over 30% female TDs!

Now in the 2020s, Sinn Féin has the largest proportion of female TDs and goes way above and beyond the other major parties.

Now, onto constituency maps.

We can go to the Irish government’s website with heaps of data! Yay free data.

This page brings us to the election constituencies GeoJSON map data.

For more information about making GeoJSON and SF maps click here to read about how to create maps in R ~

How to download and graph interactive country maps in R

So we read in the data and convert to SF dataframe.

constituency_map <- geojson_read(file.choose(), what = "sp")

constituency_sf <- st_as_sf(constituency_map)

This constituency_sf has 64 variables but most of them are meta-data info like the dates that each variable was updated. The vaaast majority, we don’t need so we can just pull out the consituency var for our use:

constituency_sf %>% 
  select(constituency = ENG_NAME_VALUE, 
         geometry) -> mini_constituency_sf

Simple feature collection with 1072 features and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 417437.9 ymin: 516356.4 xmax: 734489.6 ymax: 966899.7
Projected CRS: IRENET95 / Irish Transverse Mercator
First 10 features:
          constituency                       geometry
1  Cork South-West (3) POLYGON ((501759.8 527442.6...
2            Kerry (5) POLYGON ((451686.2 558529.2...
3            Kerry (5) POLYGON ((426695 561869.8, ...
4            Kerry (5) POLYGON ((451103.9 555882.8...
5            Kerry (5) POLYGON ((434925.3 572926.2...
6          Donegal (5) POLYGON ((564480.8 917991.7...
7          Donegal (5) POLYGON ((571201.9 892870.7...
8          Donegal (5) POLYGON ((615249.9 944590.2...
9          Donegal (5) POLYGON ((563593.8 897601, ...
10         Donegal (5) POLYGON ((647306 966899.4, ...

As we see, the number of seats in each constituency is in brackets behind the name of the county. So we can separate them and create a seat variable:

  mini_constituency_sf %<>% 
   separate(constituency, 
            into = c("constituency", "seats"), 
            sep = " \\(", fill = "right") %>%
   mutate(seats = as.numeric(gsub("\\)", "", seats)))

One problem I realised along the way when I was trying to merge the constituency map with the TD politicians data is that one data.frame uses a hyphen and one uses a dash in the constituency variable.

So we can make a quick function to replace en dash (–) with hyphen (-).

 replace_dash <- function(x) {
   if (is.character(x)) {
     gsub("–", "-", x)  
   } else {x}
}

 mini_constituency_sf %<>%
   mutate(across(where(is.character), replace_dash))

And now we can merge!

 dail %<>%
   right_join(mini_constituency_sf, by = "constituency")

Now a quick map ~

 dail %<>%
  mutate(n = ifelse(is.na(percentage_women), 0, percentage_women)) %>%
   ggplot(aes(geometry = geometry)) +
   geom_sf(aes(fill = percentage_women),
           color = "black") +  s
   labs(title = "Map of Irish Constituencies") +
   # my_style() +
   scale_fill_viridis_c(option = "plasma")  +

    scale_fill_gradient2(low = "#57cc99",
                         mid = "#38a3a5",
                         high = "#22577a") +

   theme(axis.text = element_blank(),
     axis.text.x.bottom = element_blank(),
     legend.key.width = unit(1.5, "cm"), 
     legend.key.height = unit(0.4, "cm"), 
     legend.position = "bottom")

We can see that some constituencies have 3 seats, some 5~

So we cannot directly compare who has more female TDs.

A way to deal with this is scaling the data.

In PART 3, we will look at scaling data and analysing trends across the years!

Yay!

How to graph different distributions for political science analysis in R. PART 1: Binomial, Bernoulli and Geometric Distributions.

Packages we will need:

library(tidyverse)

In this blog, we will look at three distributions.

Distributions are fundamental to statistical inference and probability.

Cbc No GIF by Kim's Convenience - Find & Share on GIPHY

The data we will be using is on Irish legislative elections from 1919.

irish_leg Download

Binomial Distribution

First we can look at the binomial distribution.

We can model the number of successful elections (e.g., a party winning) out of a fixed number of elections (trials).

In R, we use rbinom() to create the distribution.

rbinom(n, size, prob)

We need to feed in three pieces of information into this function

Parameters

n: The number of random samples we want.

size: The number of trials.

prob: The probability of success for each trial.

We can use rbinom() to simulate 100 elections and see how likely there will be a change in the party in power.

ire_leg %>% 
  filter(leg_election_change != "No election") %>% 
  summarise(avg_change = mean(change_binary, na.rm = TRUE))

When we print this, we learn that in 20% of the elections in Ireland, there has been a change in winning party.

So we will use the Binomial distribution to simulate 10 years of elections.

We will do this 100 times and create a graph of change probabilities.

Essentially we can visualise how likely there will we see a change in the party in power.

First, we choose how many times we want to estimate the probability

num_simulations <- 100

Next, we choose the number of years that we want to look at

years <- 10

Then, we set the probability that an election ends with new party in power :

probability_of_change <- 0.2

And we throw them all together into the rbinom() function

simulations <- rbinom(n = num_simulations, 
size = years, 
prob = probability_of_change)

proportion_of_changes <- mean(simulations > 0)

We can see that there is an 87% chance that the party in power will change in the next 10 years, according to 100 simulations.

We can use geom_histogram() to examine the distributions

ggplot(data.frame(simulations), aes(x = simulations)) +
  geom_histogram(binwidth = 1, fill = "#023047",
                 color = "black", alpha = 0.7) +
  labs(title = "Distribution of Party Changes",
       x = "Number of Changes",
       y = "Probability") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  scale_x_continuous(breaks = seq(0, 5, by = 1)) +
  bbplot::bbc_style()

And if we think the probability is high, we can graph that too.

So we can set the probability that the party in power wll change in one year to 0.8

probability_of_change <- 0.8

Geometric Distribution

While we use the binomial distribution to simulate the number of sucesses in a fixed number of trials, we use the geometric distribution to simulate number of trials needed until the first success (e.g. first instance that a new party comes into power after an election).

It can answer questions like, “On average, how many elections did a party need to contest before winning its first election?”

# Set the probability of that a party will change power in one year
prob_success <- 0.2   

# Generate values for the number of years until the first change in power
trials_values <- 1:20  

# Calculate the PMF values for the geometric distribution
pmf_values <- dgeom(trials_values - 1, prob = prob_success)

# Create a data frame
df <- data.frame(k = trials_values, pmf = pmf_values)

The dgeom() function in R is used to calculate the probability mass function (PMF) for the geometric distribution.

It returns the probability of obtaining a specific number of trials (k) until the first success occurs in a sequence of independent Bernoulli trials.

Each trial has a constant probability of success (p).

In this instance, the dgeom() function calculates the PMF for the number of trials until the first success (from 0 to 10 years).

This is estimated with a success probability of 0.2.

prob_success <- 0.2  

# Generate the number of trials until the first success
trials_values <- 1:20  

# Calculate the PMF values 
pmf_values <- dgeom(trials_values - 1, prob = prob_success)

# Create a data frame
my_dist <- data.frame(k = trials_values, pmf = pmf_values)

And we will graph the geometric distribution

my_dist %>%
  ggplot(aes(x = k,  y = pmf)) +
  geom_bar(stat = "identity", 
           fill = "#023047",
           alpha = 0.7) +
  labs(title = "Geometric Distribution",
    x = "Number of Years Until New Party",
    y = "Probability") +
  my_theme()

To interpret this graph, there is a 20% chance that there will be a new party next year and 10% chance that it will take 3 yaers until we see a new party in power.

Bernoulli Distribution

Nature of Trials

The Bernoulli distribution is the most simple case where each election is considered as an independent Bernoulli trial, resulting in either success (1) or failure (0) based on whether a party wins or loses.

The binomial distribution focuses on the number of successful elections out of a fixed number of trials (years).

The geometric distribution focuses on the number of trials (year) required until the first success (change of party in power) occurs.

The Bernoulli distribution is the simplest case, treating each change as an independent success/failure trial.

Thank you for readdhing. Next we will look at F and T distributiosn in police science resaerch.

Create infographics with the Irish leader dataset in R and Canva

Click here to download the Irish leader datatset. This file details information on all Taoisigh since 1922.

full_taois Download Taoiseach dataset here

Source: Wikipedia

Tentative Codebook

Variable Name	Variable Description
no	Taoiseach number
name	Name
party	Political party
constituency	Electoral constituency
born	Date of birth
died	Date of death
first_elected	Date first entered the Dail
entered_office	Date entering office of Taoiseach
left_office	Date leaving office of Taoiseach
left_dail	Date left the Dail
cum_days	Total number of days in Dail
cum_years	Total number of years in Dail
second_level	Secondary school Taoiseach attended
third_level	University Taoiseach attended
period	Number of times the person was Taoiseach
before_after_taoiseach	Title of cabinet positions held by the Taoiseach when he was not holding office of Taoiseach
while_taoiseach	Title of cabinet positions held by the Taoiseach when he was in office as Taoiseach
no_pos_before_after	Number of cabinet positions the man held when he was not holding office of Taoiseach
no_pos_dur	Number of cabinet positions the man held when he was Taoiseach
county_born	The county the Taoiseach was born in
age	Age of Taoiseach
age_enter	Age the man entered office of Taoiseach
gender	Gender

Packages we will need:

library(tidyverse)
library(ggthemes)
library(readr)
library(sf)
library(tmap)

With the dataset, we can add map data and plot the 26 counties of Ireland.

If you follow this link below, you can download county map data from the following website by Chris Brundson

https://rpubs.com/chrisbrunsdon/part1

Thank you to Chris for the tutorial and data access!

Read in the simple features data with the st_read() from the sf package.

setwd("C:/Users/my_name/Desktop")

county_geom <- sf::st_read("counties.json") %>% 
   clean_names() %>% 
   mutate(county = stringr::str_to_title(county))

Next we count the number of counties that have given Ireland a Taoiseach with the group_by() and count() functions.

One Taoiseach, Eamon DeValera, was born in New York City, so he will not be counted in the graph.

Sorry Dev.

Sad Friends Tv GIF - Find & Share on GIPHY

We can join the Taoisech dataset to the county_geom dataframe by the county variable. The geometric data has the counties in capital letters, so we convert tolower() letters.

Add the geometry variable in the main ggplot() function.

We can play around with the themes arguments and add the theme_map() from the ggthemes package to get the look you want.

I added a few hex colors to indicate the different number of countries.

If you want a transparent background, we save it with the ggsave() function and set the bg argument to “transparent”

full_taois %>% 
  select(county = county_born, everything()) %>% 
  distinct(name, .keep_all = TRUE) %>% 
  group_by(county) %>% 
  count() %>% 
  ungroup() %>% 
  right_join(county_geom, by = c("county" = "county")) %>%
  replace(is.na(.), 0) %>% 
  ggplot(aes(geometry = geometry, fill = factor(n))) +  
  geom_sf(linewidth = 1, color = "white") +
  ggthemes::theme_map() + 
  theme(panel.background = element_rect(fill = 'transparent'),  
    legend.title = element_blank(),
    legend.text = element_text(size = 20) )  + 
scale_fill_manual(values = c("#8d99ae", "#a8dadc", "#457b9d", "#e63946", "#1d3557")) 

ggsave('county_map.png', county_map, bg = 'transparent')

Counties that have given us Taoisigh

Source: Wikipedia

Next we can graph the ages of the Taoiseach when they first entered office. With the reorder() function, we can compare how old they were.

full_taois %>%
  mutate(party = case_when(party == "Cumann na nGaedheal" ~ "CnG",
                           TRUE ~ as.character(party))) %>% 
  distinct(name, .keep_all = TRUE) %>% 
  mutate(age_enter = round(age_enter, digits = 0)) %>% 
  ggplot(aes(x = reorder(name, age_enter),
             y = age_enter,
             fill = party)) + 
  geom_bar(stat = "identity") +
  coord_flip() + 
  scale_fill_manual(values = c( "#8e2420","#66bb66","#6699ff")) + 
  theme(text = element_text(size = 40),
    axis.title.x = element_blank(), 
    axis.title.y = element_blank(), 
    panel.background = element_rect(fill = 'transparent'),  
    plot.background = element_rect(fill = 'transparent', color = NA), 
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),
    legend.background = element_rect(fill = 'transparent'), #transparent legend bg
    legend.key.size = unit(2, 'cm'),
    legend.key.height = unit(2, 'cm'),
    legend.key.width = unit(2, 'cm'), 
    legend.title = element_blank(),
    legend.text = element_text(size = 20) ) 

ggsave('age_chart.png', age_chart, bg = 'transparent')

Ages of the Taoiseach entering office for the first time

Source: Wikipedia

We can calculate to see which party has held the office of Taoiseach the longest with a special, but slightly mad-looking pie chart

Click here to learn more about creating these plots.

Alternatives to pie charts: coxcomb and waffle charts

full_taois %>% 
  distinct(name, .keep_all = TRUE) %>% 
  group_by(party) %>% 
  summarise(total_cum = sum(cum_days)) %>% 
    ggplot(aes(reorder(total_cum, party), total_cum, fill = as.factor(party))) + 
  geom_bar(stat = "identity") + 
  coord_polar("x", start = 0, direction = - 1)  + 
  scale_fill_manual(values = c( "#8e2420","#66bb66","#6699ff")) + 
  ggthemes::theme_map()

Number of years each party held the office of Taoiseach

Source: Wikipedia

Fianna Fail has held the office over twice as long as Fine Fail and much more than the one term of W Cosgrove (the only CnG Taoiseach)

Last we can create an icon waffle plots. We can use little man icons to create a waffle plot of all the men (only men) in the office, colored by political party.

I got the code and tutorial for making these waffle plots from the following website:

https://www.listendata.com/2019/06/create-infographics-with-r.html

It was very helpful in walking step by step through how to download the FontAwesome icons into the correct font folder on the PC. I had a heap of issues with the wrong versions of the htmltools.

remotes::install_github("JohnCoene/echarts4r")

remotes::install_github("hrbrmstr/waffle")

devtools::install_github("JohnCoene/echarts4r.assets")

remotes::install_github("hrbrmstr/waffle")

library(echarts4r)
library(extrafont)
library(showtext)
library(magrittr)
library(echarts4r.assets)
library(htmltools)
library(waffle)

extrafont::font_import(path = "C:/Users/my_name/Desktop",  pattern = "fa-", prompt =  FALSE)

extrafont::loadfonts(device="win")

font_add(family = "FontAwesome5Free-Solid", regular = "C:/Users/my_name/Desktop/fa-solid-900.ttf")
font_add(family = "FontAwesome5Free-Regular", regular = "C:/Users/my_name/Desktop/fa-regular-400.ttf")
font_add(family = "FontAwesome5Brands-Regular", regular = "C:/Users/my_name/Desktop/fa-brands-400.ttf")

showtext_auto()

Next we will find out the number of Taoisigh from each party:

And we fill a vector of values into the waffle() function. We can play around with the number of rows. Three seems like a nice fit for the number of icons (glyphs).

Also, we choose the type of glyph image we want with the the use_glyph() argument.

The options are the glyphs that come with the Font Awesome package we downloaded with extrafonts.

waffle(
  c( Cumann na nGaedheal = 1      ` = 1,
      `Fianna Fail = 8    ` = 8, 
      `Fine Gael = 6    ` = 6), 
  rows = 3, 
  colors = c("#8e2420", "#66bb66",  "#6699ff"),
  use_glyph = "male", 
  glyph_size = 25, 
  legend_pos = "bottom")

Click below to download the infographic that was edited and altered with Canva.com.

Jimmy Fallon Dancing GIF by The Tonight Show Starring Jimmy Fallon - Find & Share on GIPHY

taoiseach_infographic Download infographic here

Create a dataset of Irish parliament members

library(rvest)
library(tidyverse)
library(toOrdinal)
library(magrittr)
library(genderizeR)
library(stringi)

This blogpost will walk through how to scrape and clean up data for all the members of parliament in Ireland.

Or we call them in Irish, TDs (or Teachtaí Dála) of the Dáil.

We will start by scraping the Wikipedia pages with all the tables. These tables have information about the name, party and constituency of each TD.

On Wikipedia, these datasets are on different webpages.

This is a pain.

However, we can get around this by creating a list of strings for each number in ordinal form – from1st to 33rd. (because there have been 33 Dáil sessions as of January 2023)

We don’t need to write them all out manually: “1st”, “2nd”, “3rd” … etc.

Instead, we can do this with the toOrdinal() function from the package of the same name.

dail_sessions <- sapply(1:33,toOrdinal)

Next we can feed this vector of strings with the beginning of the HTML web address for Wikipedia as a string.

We paste the HTML string and the ordinal number strings together with the stri_paste() function from the stringi package.

This iterates over the length of the dail_sessions vector (in this case a length of 33) and creates a vector of each Wikipedia page URL.

dail_wikipages <- stri_paste("https://en.wikipedia.org/wiki/Members_of_the_",
           dail_sessions, "_D%C3%A1il")

Now, we can take the most recent Dáil session Wikipedia page and take the fifth table on the webpage using `[[`(5)

We rename the column names with select().

And the last two mutate() lines reomve the footnote numbers in ( ) [ ] brackets from the party and name variables.

dail_wikipages[33] %>%  
  read_html() %>%
  html_table(header = TRUE, fill = TRUE) %>% 
  `[[`(5) %>% 
  rename("ble" = 1, "party" = 2, "name" = 3, "constituency" = 4) %>% 
  select(-ble) %>% 
  mutate(party = gsub(r"{\s*\([^\)]+\)}","",as.character(party))) %>% 
  mutate(name = sub("\\[.*", "", name)) -> dail_33

Last we delete the first row. That just contais a duplicate of the variable names.

dail_33 <- dail_33[-1,]

We want to delete the fadas (long accents on Irish words). We can do this across all the character variables with the across() function.

The stri_trans_general() converts all strings to LATIN ASCII, which turns string to contain only the letters in the English language alphabet.

dail_33 %<>% 
  mutate(across(where(is.character), ~ stri_trans_general(., id = "Latin-ASCII")))

We can also separate the first name from the second names of all the TDs and create two variables with mutate() and separate()

dail_33 %<>% 
  mutate(name = str_replace(name, "\\s", "|")) %>% 
  separate(name, into = c("first_name", "last_name"), sep = "\\|")

With the first_name variable, we can use the new pacakge by Kalimu. This guesses the gender of the name. Later, we can track the number of women have been voted into the Dail over the years.

Of course, this will not be CLOSE to 100% correct … so later we will have to check each person manually and make sure they are accurate.

devtools::install_github("kalimu/genderizeR")

gender = findGivenNames(dail_33$name, progress = TRUE)

gender %>% 
  select(probability, gender)  -> gen_variable

gen_variable %<>% 
  select(name, gender) %>% 
  mutate(name = str_to_sentence(name))

dail_33 %<>% 
  left_join(gen_variable, by = "name")

Create date variables and decade variables that we can play around with.

dail_df$date_2 <- as.Date(dail_df$date, "%Y-%m-%d")

dail_df$year <- format(dail_df$date_2, "%Y")

dail_df$month <- format(dail_df$date_2, "%b")

dail_df %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s"))

In the next blog, we will graph out the various images to explore these data in more depth. For example, we can make a circle plot with the composition of the current Dail with the ggparliament package.

We can go into more depth with it in the next blog… Stay tuned.

Examining Ireland’s foreign policy in pictures with R

Packages we will need:

library(peacesciencer)  
library(forcats)
library(ggflags)
library(tidyverse)
library(magrittr)
library(waffle)
library(bbplot)
library(rvest)

In January 2015, the Irish government published a review of Ireland’s foreign policy. The document, The Global Island: Ireland’s Foreign Policy for a Changing World offers a perspective on Ireland’s place in the world.

In this blog, we will graph out some of the key features of Ireland’ foreign policy and so we can have a quick overview of the key relationships and trends.

Excited Season 4 GIF by The Office - Find & Share on GIPHY

First, we will look at the aid that Ireland gives to foreign countries. This read.csv(file.choose()) will open up the file window and you can navigate to the file and data that you can download from DAC OECD website: https://data.oecd.org/oda/net-oda.htm

dac <- read.csv(file.choose())

We will filter only Ireland and clean the names with the clean_names() function from the janitor package:

dac %<>% 
  filter(Donor == "Ireland") %>% 
  clean_names()

And change the variables, adding the Correlates of War codes and cleaning up some of the countries’ names.

dac %<>% 
  mutate(cown = countrycode(recipient_2, "country.name", "cown"),
         aid_amount = value*1000000) %>%  
  select(country = recipient_2, cown,
         year, time, aid_type, value, aid_amount) %>%
  mutate(cown = ifelse(country == "West Bank and Gaza Strip", 6666,
         ifelse(country == "Serbia", 345, 
         ifelse(country == "Micronesia", 987,cown))))%>%
  filter(!is.na(cown))

Next we can convert dataframe to wider format so we have a value column for each aid type

dac %>% 
  distinct(country, cown, year, time, aid_type, value, .keep_all = TRUE)  %>%  
  pivot_wider(names_from = "aid_type", values_from = "aid_amount") %>% 
  mutate(across(where(is.numeric), ~ replace_na(., 0))) %>% 
  clean_names() -> dac_wider

And we graph out the three main types of aid:

dac_wider %>%
  group_by(year) %>% 
  summarise(total_humanitarian = sum(humanitarian_aid, na.rm = TRUE),
  total_technical = sum(technical_cooperation, na.rm = TRUE),
  total_development_food_aid = sum(development_food_aid)) %>% 
  ungroup() %>% 
  pivot_longer(!year, names_to = "aid_type", values_to = "aid_value") %>% 
  ggplot(aes(x = year, y = aid_value, groups = aid_type)) + 
  geom_line(aes(color = aid_type), size = 2, show_guide  = FALSE) +
  geom_point(aes(color = aid_type), fill = "white", shape = 21, size = 3, stroke = 2) +
  bbplot::bbc_style()  +
  scale_y_continuous(labels = scales::comma) + 
  scale_x_discrete(limits = c(2010:2018)) +
  labs(title = "Irish foreign aid by aid type (2010 - 2018)",
       subtitle = ("Source: OECD DAC")) +
  scale_color_discrete(name = "Aid type", 
        labels = c("Development and Food", "Humanitarian", "Technical"))

We will look at total ODA aid:

dac %>% 
  count(aid_type) %>% 
  arrange(desc(n)) %>% 
  knitr::kable(format = "html")

aid_type	n
Imputed Multilateral ODA	2298
Memo: ODA Total, excl. Debt	1292
Memo: ODA Total, Gross disbursements	1254
ODA: Total Net	1249
Grants, Total	1203
Technical Cooperation	541
ODA per Capita	532
Humanitarian Aid	518
ODA as % GNI (Recipient)	504
Development Food Aid	9

And get some pretty hex colours:

pal_10 <- c("#001219","#005f73","#0a9396","#94d2bd","#e9d8a6","#ee9b00","#ca6702","#bb3e03","#ae2012","#9b2226")

And download some regime, democracy, region and continent data from the PACL datase with the democracyData() package

pacl <- redownload_pacl() 

pacl %<>% 
  mutate(regime_name = ifelse(regime == 0, "Parliamentary democracies",
         ifelse(regime == 1, "Mixed democracies",
         ifelse(regime == 2, "Presidential democracies",
         ifelse(regime == 3, "Civilian autocracies",
         ifelse(regime == 4, "Military dictatorships",
         ifelse(regime ==  5,"Royal dictatorships", regime))))))) %>%
  mutate(regime = as.factor(regime)) 

pacl %<>% 
  select(year, country = pacl_country, 
         democracy, regime_name,
         region_name = un_region_name, 
         continent_name = un_continent_name)

pacl %<>% 
  mutate(cown = countrycode(country, "country.name", "cown")) %>% 
  select(!country)

Summarise the total aid for each country across the years and choose the top 20 countries

dac %>% 
  filter(aid_type == "Memo: ODA Total, Gross disbursements") %>% 
  group_by(country) %>% 
  summarise(total_country_aid = sum(aid_amount, na.rm = TRUE)) %>% 
  ungroup() %>% 
  top_n(n = 20) %>% 
  mutate(cown = countrycode::countrycode(country, "country.name", "cown")) %>% 
  inner_join(pacl, by = "cown") %>%  
  mutate(region_name = ifelse(country == "West Bank and Gaza Strip", "Western Asia", region_name)) %>% 
  mutate(region_name = ifelse(region_name == "Western Asia", "Middle East", region_name)) %>% 
  mutate(country = ifelse(country == "West Bank and Gaza Strip", "Palestine",
  ifelse(country == "Democratic Republic of the Congo", "DR Congo",
  ifelse(country == "Syrian Arab Republic", "Syria", country)))) %>% 
  mutate(iso2 = tolower(countrycode::countrycode(country, "country.name", "iso2c"))) %>% 
  ggplot(aes(x = forcats::fct_reorder(country, total_country_aid), y = total_country_aid)) + 
  geom_bar(aes(fill = region_name), stat = "identity", width = 0.7) + 
  coord_flip() + bbplot::bbc_style() + 
  geom_flag(aes(x = country, y = -100, country = iso2), size = 12) +
  scale_fill_manual(values = pal_10) +
  labs(title = "Ireland's largest ODA foreign aid recipients, 2010 - 2018",
       subtitle = ("Source: OECD DAC")) + 
  xlab("") + ylab("") + 
  scale_x_continuous(labels = scales::comma)

We can make a waffle plot to look at the different types of regimes to which the Irish government gave aid over the decades

 dac %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s")) %>% 
  group_by(decade) %>% 
  count(regime_name) %>% 
  ggplot(aes(fill = regime_name, values = n)) +
  geom_waffle(color = "white", size = 0.3, n_rows = 10, flip = TRUE) +
  facet_wrap(~decade, nrow = 1, strip.position = "bottom") + 
  bbplot::bbc_style()  +
  scale_fill_manual(values = pal_10) +
   scale_x_discrete(breaks = round(seq(0, 1, by = 0.2),3)) +
  labs(title = "Ireland's ODA foreign aid recipient regime types since 1945",
       subtitle = ("Source: OECD DAC"))

Next, we will download dyadic foreign policy similarity measures from peacesciencer.

Peacesciencer package has tools and data sets for the study of quantitative peace science.

Click here to read more about the peacesciencer package by Steven Miller

Building a dataset for political science analysis in R, PART 2

fp_similar_df <- peacesciencer::create_dyadyears() %>% 
  add_gwcode_to_cow() %>% 
  add_fpsim()

I am only looking at dyadic foreign policy similarity with Ireland, so filter by Ireland’s Correlates of War code, 205.

Click here to find out all countries’ COW code

Correlates of War codes

fp_similar_df %<>% 
  filter(ccode1 == 205)

Data on alliance portfolios comes from the Correlates of War and is used to calculate similarity of foreign policy positions (see Altfeld & Mesquita, 1979).

The assumption is that similar alliance portfolios are the result of similar foreign policy positions.

With increasing in level of commitment, the strength of alliance commitments can be:

no commitment
entente
neutrality or nonaggression pact
defense pact

We will map out alliance similarity. This will use the measurement calculated with Cohen’s Kappa. Check out Hage’s (2011) article to read more about the different ways to measure alliance similarity.

Next we can look at UN similarity.

The UN voting variable calculates three values:

1 = Yes

2 = Abstain

3 = No

Based on these data, if two countries in a similar way on the same UN resolutions, this is a measure of the degree to which dyad members’ foreign policy positions are similar.

fp_similarity_df %>% 
  mutate(country = countrycode::countrycode(ccode2, "cown", "country.name")) %>% 
  select(country, ccode2, year,
         un_similar = kappavv) %>% 
  filter(year > 1989) %>% 
  filter(!is.na(country)) %>%
  mutate(iso2 = tolower(countrycode::countrycode(country, "country.name", "iso2c"))) %>% 
  group_by(country) %>% 
  mutate(avg_un = mean(un_similar, na.rm = TRUE)) %>%
  distinct(country, avg_un, iso2, .keep_all = FALSE) %>% 
  ungroup() %>% 
  top_n(n = 10)  -> top_un_similar

And graph out the top ten

  top_un_similar %>%
  ggplot(aes(x = forcats::fct_reorder(country, avg_un), 
             y = avg_un)) + 
  geom_bar(stat = "identity",
           width = 0.7, 
           color = "#0a85e5", 
           fill = "#0a85e5") +
  ggflags::geom_flag(aes(x = country, y = 0, country = iso2), size = 15) +
  coord_flip() + bbplot::bbc_style()  +
  ggtitle("UN voting similarity with Ireland since 1990")

If we change the top_n() to negative, we can get the bottom 10

top_n(n = -10)

We can quickly scrape data about the EU countries with the rvest package


eu_members_html <- read_html("https://en.wikipedia.org/wiki/European_Union")
eu_members_tables <- eu_members_html %>% html_table(header = TRUE, fill = TRUE)

eu_member <- eu_members_tables[[6]]

eu_member %<>% 
  janitor::clean_names()

eu_member %>% distinct(state) %>%  pull(state) -> eu_state

Last we are going to look at globalization scores. The data comes from the the KOF Globalisation Index. This measures the economic, social and political dimensions of globalisation. Globalisation in the economic, social and political fields has been on the rise since the 1970s, receiving a particular boost after the end of the Cold War.

Click here for data that you can download comes from the KOF website

kof %>%
  filter(country %in% eu_state) -> kof_eu

And compare Ireland to other EU countries on financial KOF index scores. We will put Ireland in green and the rest of the countries as grey to make it pop.

Ireland appears to follow the general EU trends and is not an outlier for financial globalisation scores.

kof_eu %>% 
  ggplot(aes(x = year,  y = finance, groups = country)) + 
  geom_line(color = ifelse(kof_eu$country == "Ireland",     "#2a9d8f", "#8d99ae"),
  size = ifelse(kof_eu$country == "Ireland", 3, 2), 
  alpha = ifelse(kof_eu$country == "Ireland", 0.9, 0.3)) +
  bbplot::bbc_style() + 
  ggtitle("Financial Globalization in Ireland, 1970 to 2020", 
          subtitle = "Source: KOF")

References

Häge, F. M. (2011). Choice or circumstance? Adjusting measures of foreign policy similarity for chance agreement. Political Analysis, 19(3), 287-305.

Dreher, Axel (2006): Does Globalization Affect Growth? Evidence from a new Index of Globalizationcall_made, Applied Economics 38, 10: 1091-1110.

How to tidy up messy Wikipedia data with dplyr in R

Packages we will need:

library(rvest)
library(magrittr)
library(tidyverse)
library(waffle)
library(wesanderson)
library(ggthemes)
library(countrycode)
library(forcats)
library(stringr)
library(tidyr)
library(janitor)
library(knitr)

To see another blog post that focuses on cleaning messy strings and dates, click here to read

Wrangling and graphing UN Secretaries-General data with R

We are going to look at Irish embassies and missions around the world. Where are the embassies, and which country has the most missions (including embassies, consulates and representational offices)?

Let’s first scrape the embassy data from the Wikipedia page. Here is how it looks on the webpage.

It is a bit confusing because Ireland does not have a mission in every country. Argentina, for example, is the embassy for Bolivia, Paraguay and Uruguay.

Also, there are some consulates-general and other mission types.

Some countries have more than one mission, such as UK, Canada, US etc. So we are going to try and clean up this data.

Click here to read more about scraping data with the rvest package

Scrape NATO defense expenditure data from Wikipedia with the rvest package in R

embassies_html <- read_html("https://en.wikipedia.org/wiki/List_of_diplomatic_missions_of_Ireland")

embassies_tables <- embassies_html %>% html_table(header = TRUE, fill = TRUE)

We will extract the data from the different continent tables and then bind them all together at the end.

africa_emb <- embassies_tables[[1]]

africa_emb %<>% 
  mutate(continent = "Africa")

americas_emb <- embassies_tables[[2]]

americas_emb %<>% 
  mutate(continent = "Americas")

asia_emb <- embassies_tables[[3]]

asia_emb %<>% 
  mutate(continent = "Asia")

europe_emb <- embassies_tables[[4]]

europe_emb %<>% 
  mutate(continent = "Europe")

oceania_emb <- embassies_tables[[5]]

oceania_emb %<>% 
  mutate(continent = "Oceania")

Last, we bind all the tables together by rows, with rbind()

ire_emb <- rbind(africa_emb, 
                 americas_emb,
                 asia_emb,
                 europe_emb,
                 oceania_emb)

And clean up the names with the janitor package

ire_emb %<>% 
  janitor::clean_names()

There is a small typo with a hypen and so there are separate Consulate General and Consulate-General… so we will clean that up to make one single factor level.

ire_emb %<>% 
  mutate(mission = ifelse(mission == "Consulate General", "Consulate-General", mission))

We can count out how many of each type of mission there are

ire_emb %>% 
  group_by(mission) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  knitr::kable(format = "html")

mission	n
Embassy	69
Consulate-General	17
Liaison office	1
Representative office	1

A quick waffle plot

ire_emb %>% 
  group_by(mission) %>%
  count() %>% 
  arrange(desc(n)) %>% 
  ungroup() %>% 
  ggplot(aes(fill = mission, values = n)) +
  geom_waffle(color = "white", size = 1.5, 
              n_rows = 20, flip = TRUE) + 
  bbplot::bbc_style() +
  scale_fill_manual(values= wes_palette("Darjeeling1", n = 4))

We can remove the notes in brackets with the sub() function.

Square brackets equire a regex code \\[.*

ire_emb %<>% 
  select(!ref) %>%
  mutate(host_country = sub("\\[.*", "", host_country))

We delete the subheadings from the concurrent_accreditation column with the str_remove() function from the stringr package

ire_emb %<>%
  mutate(concurrent_accreditation = stringr::str_remove(concurrent_accreditation, "International Organizations:\n")) %>% 
  mutate(concurrent_accreditation = stringr::str_remove(concurrent_accreditation, "Countries:\n"))

After that, we will tackle the columns with many countries. The many variables in one cell violates the principles of tidy data.

Tonight Show Help GIF by The Tonight Show Starring Jimmy Fallon - Find & Share on GIPHY

For example, we saw above that Argentina is the embassy for three other countries.

We will use the separate() function from the tidyr package to make a column for each country that shares an embassy with the host country.

This separate() function has six arguments:

First we indicate the column with will separate out with the col argument

Next with into, we write the new names of the columns we will create. Nigeria has the most countries for which it is accredited to be the designated embassy with nine. So I create nine accredited countries columns to accommodate this max number.

The point I want to cut up the original column is at the \n which is regex for a large space

I don’t want to remove the original column so I set remove to FALSE

ire_emb %<>%
  separate(
    col = "concurrent_accreditation",
    into = c("acc_1", "acc_2", "acc_3", "acc_4", "acc_5", "acc_6", "acc_7", "acc_8", "acc_9"),
    sep = "\n",
    remove = FALSE,
    extra = "warn",
    fill = "warn") %>% 
  mutate(across(where(is.character), str_trim))

Some countries have more than one type of mission, so I want to count each type of mission for each country and create a new variable with the distinct() and pivot_wider() functions

Click here to read more about turning long to wide format data

With the across() function we can replace all numeric variables with NA to zeros

Click here to read more about the across() function

across() function appreciation

ire_emb %>% 
  group_by(host_country, mission) %>% 
  mutate(number_missions = n())  %>% 
  distinct(host_country, mission, .keep_all = TRUE) %>% 
  ungroup() %>% 
  pivot_wider(!c(host_city, concurrent_accreditation:count_accreditation), 
              names_from = mission, 
              values_from = number_missions) %>% 
  janitor::clean_names() %>% 
  mutate(across(where(is.numeric), ~ replace_na(., 0))) %>% 
  select(!host_country) -> ire_wide

Before we bind the two datasets together, we need to only have one row for each country.

ire_emb %>% 
  distinct(host_country, .keep_all = TRUE) -> ire_dist

And bind them together:

ire_full <- cbind(ire_dist, ire_wide)

Excited Aubrey Plaza GIF by Film Independent Spirit Awards - Find & Share on GIPHY

We can graph out where the embassies are with the geom_polygon() in ggplot

First we download the map data from dplyr and add correlates of war codes so we can easily join the datasets together with right_join()

First, we add correlates of war codes

Click here to read more about the countrycode package

Add Correlates of War codes with countrycode package in R

ire_full %<>%
    mutate(cown = countrycode(host_country, "country.name", "cown"))

world_map <- map_data("world")

world_map %<>% 
  mutate(cown = countrycode::countrycode(region, "country.name", "cown"))

I reorder the variables with the fct_relevel() function from the forcats package. This is just so they can better match the color palette from wesanderson package. Green means embassy, red for no mission and orange for representative office.

Make Wes Anderson themed graphs with wesanderson package in R

ire_full %>%
  right_join(world_map, by = "cown") %>% 
  filter(region != "Antarctica") %>% 
  mutate(mission = ifelse(is.na(mission), replace_na("No Mission"), mission)) %>% 
  mutate(mission = forcats::fct_relevel(mission,c("No Mission", "Embassy","Representative office"))) %>%
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = mission), color = "white", size = 0.5)  -> ire_map

And we can change how the map looks with the ggthemes package and colors from wesanderson package

  ire_map + ggthemes::theme_map() +
  theme(legend.key.size = unit(3, "cm"),
        text = element_text(size = 30),
        legend.title = element_blank()) + 
  scale_fill_manual(values = wes_palette("Darjeeling1", n = 4))

Irish missions.png Download

And we can count how many missions there are in each country

US has the hightest number with 8 offices, followed by UK with 4 and China with 3

ire_full %>%
  right_join(world_map, by = "cown") %>% 
  filter(region != "Antarctica") %>% 
  mutate(sum_missions = rowSums(across(embassy:representative_office))) %>% 
  mutate(sum_missions = replace_na(sum_missions, 0)) %>%  
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = as.factor(sum_missions)), color = "white", size = 0.5)  +
  ggthemes::theme_map() +
  theme(legend.key.size = unit(3, "cm"),
        text = element_text(size = 30),
        legend.title = element_blank()) + 
scale_fill_brewer(palette = "RdBu") + 
  ggtitle("Number of Irish missions in each country",
          subtitle = "Source: Wikipedia")

Last we can count the number of accredited countries that each embassy has. Nigeria has the most, in charge of 10 other countries across northern and central Africa.

ire_full %>% 
  right_join(world_map, by = "cown") %>% 
  filter(region != "Antarctica") %>%
  mutate(count_accreditation = str_count(concurrent_accreditation, pattern = "\n")) %>% 
  mutate(count_accreditation = replace_na(count_accreditation, -1)) %>%  
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = as.factor(count_accreditation)), color = "white", size = 0.5)  +
  ggthemes::theme_fivethirtyeight() +
  theme(legend.key.size = unit(1, "cm"),
        text = element_text(size = 30),
        legend.title = element_blank()) + 
  ggtitle("Number of Irish missions in extra accreditations",
          subtitle = "Source: Wikipedia")

Happy Maya Rudolph GIF - Find & Share on GIPHY

Comparing proportions across time with ggstream in R

Packages we need:

library(tidyverse)
library(ggstream)
library(magrittr)
library(bbplot)
library(janitor)

We can look at proportions of energy sources across 10 years in Ireland. Data source comes from: https://www.seai.ie/data-and-insights/seai-statistics/monthly-energy-data/electricity/

Before we graph the energy sources, we can tidy up the variable names with the janitor package. We next select column 2 to 12 which looks at the sources for electricity generation. Other rows are aggregates and not the energy-related categories we want to look at.

Next we pivot the dataset longer to make it more suitable for graphing.

We can extract the last two digits from the month dataset to add the year variable.

elec %<>% 
  janitor::clean_names()

elec[2:12,] -> elec

el <- elec %>% 
  pivot_longer(!electricity_generation_g_wh, 
               names_to = "month", values_to = "value") %>% 

substrRight <- function(x, n){
  substr(x, nchar(x) - n + 1, nchar(x))}

el$year <- substrRight(el$month, 2)

el %<>% select(year, month, elec_type = electricity_generation_g_wh, elec_value = value)

First we can use the geom_stream from the ggstream package. There are three types of plots: mirror, ridge and proportion.

First we will plot the proportion graph.

Select the different types of energy we want to compare, we can take the annual values, rather than monthly with the tried and trusted group_by() and summarise().

Optionally, we can add the bbc_style() theme for the plot and different hex colors with scale_fill_manual() and feed a vector of hex values into the values argument.

el %>% 
  filter(elec_type == "Oil" | 
           elec_type == "Coal" |
           elec_type == "Peat" | 
           elec_type == "Hydro" |
           elec_type == "Wind" |
           elec_type == "Natural Gas") %>% 
  group_by(year, elec_type) %>%
  summarise(annual_value = sum(elec_value, na.rm = TRUE)) %>% 
  ggplot(aes(x = year, 
             y = annual_value,
             group = elec_type,
             fill = elec_type)) +
  ggstream::geom_stream(type = "proportion") + 
  bbplot::bbc_style() +
  labs(title = "Comparing energy proportions in Ireland") +
  scale_fill_manual(values = c("#f94144",
                               "#277da1",
                               "#f9c74f",
                               "#f9844a",
                               "#90be6d",
                               "#577590"))

With ggstream::geom_stream(type = "mirror")

With ggstream::geom_stream(type = "ridge")

Without the ggstream package, we can recreate the proportion graph with slightly different looks

https://giphy.com/gifs/filmeditor-clueless-movie-l0ErMA0xAS1Urd4e4

el %>% 
  filter(elec_type == "Oil" | 
           elec_type == "Coal" |
           elec_type == "Peat" | 
           elec_type == "Hydro" |
           elec_type == "Wind" |
           elec_type == "Natural Gas") %>% 
  group_by(year, elec_type) %>%
  summarise(annual_value = sum(elec_value, na.rm = TRUE)) %>% 
  ggplot(aes(x = year, 
             y = annual_value,
             group = elec_type,
             fill = elec_type)) +
  geom_area(alpha=0.8 , size=1.5, colour="white") +
  bbplot::bbc_style() +
  labs(title = "Comparing energy proportions in Ireland") +
  theme(legend.key.size = unit(2, "cm")) + 
  scale_fill_manual(values = c("#f94144",
                               "#277da1",
                               "#f9c74f",
                               "#f9844a",
                               "#90be6d",
                               "#577590"))

Love You Hug GIF by Filmin - Find & Share on GIPHY

Create density plots with ggridges package in R

Packages we will need:

library(tidyverse)
library(ggridges)
library(ggimage)  # to add png images
library(bbplot)   # for pretty graph themes

We will plot out the favourability opinion polls for the three main political parties in Ireland from 2016 to 2020. Data comes from Louwerse and Müller (2020)

Happy Danny Devito GIF by It's Always Sunny in Philadelphia - Find & Share on GIPHY

Before we dive into the ggridges plotting, we have a little data cleaning to do. First, we extract the last four “characters” from the date string to create a year variable.

I took this quick function from a StackOverflow response:

substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))}

polls_csv$year <- substrRight(polls_csv$Date, 4)

Next, pivot the data from wide to long format.

More information of pivoting data with dplyr can be found here. I tend to check it at least once a month as the arguments refuse to stay in my head.

I only want to take the main parties in Ireland to compare in the plot.

polls <- polls_csv %>%
  select(year, FG:SF) %>% 
  pivot_longer(!year, names_to = "party", values_to = "opinion_poll")

I went online and found the logos for the three main parties (sorry, Labour) and saved them in the working directory I have for my RStudio. That way I can call the file with the prefix “~/**.png” rather than find the exact location they are saved on the computer.

polls %>% 
  filter(party == "FF" | party == "FG" | party == "SF" ) %>% 
  mutate(image = ifelse(party=="FF","~/ff.png",
 ifelse(party=="FG","~/fg.png", "~/sf.png"))) -> polls_three

Now we are ready to plot out the density plots for each party with the geom_density_ridges() function from the ggridges package.

We will add a few arguments into this function.

We add an alpha = 0.8 to make each density plot a little transparent and we can see the plots behind.

The scale = 2 argument pushes all three plots togheter so they are slightly overlapping. If scale =1, they would be totally separate and 3 would have them overlapping far more.

The rel_min_height = 0.01 argument removes the trailing tails from the plots that are under 0.01 density. This is again for aesthetics and just makes the plot look slightly less busy for relatively normally distributed densities

The geom_image takes the images and we place them at the beginning of the x axis beside the labels for each party.

Last, we use the bbplot package BBC style ggplot theme, which I really like as it makes the overall graph look streamlined with large font defaults.

polls_three %>% 
  ggplot(aes(x = opinion_poll, y = as.factor(party))) +  
  geom_density_ridges(aes(fill = party), 
                      alpha = 0.8, 
                      scale = 2,
                      rel_min_height = 0.01) + 
  ggimage::geom_image(aes(y = party, x= 1, image = image), asp = 0.9, size = 0.12) + 
  facet_wrap(~year) + 
  bbplot::bbc_style() +
  scale_fill_manual(values = c("#f2542d", "#edf6f9", "#0e9594")) +
  theme(legend.position = "none") + 
  labs(title = "Favourability Polls for the Three Main Parties in Ireland", subtitle = "Data from Irish Polling Indicator (Louwerse & Müller, 2020)")

Its Always Sunny In Philadelphia Thumbs Up GIF by HULU - Find & Share on GIPHY

Compare Irish census years with compareBars and csodata package in R

Packages we will need:

library(csodata)
library(janitor)
library(ggcharts)
library(compareBars)
library(tidyverse)

First, let’s download population data from the Irish census with the Central Statistics Office (CSO) API package, developed by Conor Crowley.

You can search for the data you want to analyse via R or you can go to the CSO website and browse around the site.

I prefer looking through the site because sometimes I stumble across a dataset I didn’t even think to look for!

Keep note of the code beside the red dot star symbol if you’re looking around for datasets.

Click here to check out the CRAN PDF for the CSO package.

You can search for keywords with cso_search_toc(). I want total population counts for the whole country.

cso_search_toc("total population")

We can download the variables we want by entering the code into the cso_get_data() function

irish_pop <- cso_get_data("EY007")
View(irish_pop)

The EY007 code downloads population census data in both 2011 and 2016 at every age.

It needs a little bit of tidying to get it ready for graphing.

irish_pop %<>%  
  clean_names()

First, we can be lazy and use the clean_names() function from the janitor package.

GIF by The Good Place - Find & Share on GIPHY

Next we can get rid of the rows that we don’t want with select().

Then we use the pivot_longer() function to turn the data.frame from wide to long and to turn the x2011 and x2016 variables into one year variable.

irish_pop %>% 
  filter(at_each_year_of_age == "Population") %>% 
  filter(sex == 'Both sexes') %>% 
  filter(age_last_birthday != "All ages") %>% 
  select(!statistic) %>% 
  select(!sex) %>% 
  select(!at_each_year_of_age) -> irish_wide

irish_wide %>% 
  pivot_longer(!age_last_birthday,
    names_to = "year", 
    values_to = "pop_count",
    values_drop_na = TRUE) %>% 
    mutate(year = as.factor(year)) -> irish_long

No we can create our pyramid chart with the pyramid_chart() from the ggcharts package. The first argument is the age category for both the 2011 and 2016 data. The second is the actual population counts for each year. Last, enter the group variable that indicates the year.

irish_long %>%   
  pyramid_chart(age_last_birthday, pop_count, year)

One problem with the pyramid chart is that it is difficult to discern any differences between the two years without really really examining each year.

One way to more easily see the differences with the compareBars function

The compareBars package created by David Ranzolin can help to simplify comparative bar charts! It’s a super simple function to use that does a lot of visualisation leg work under the hood!

First we need to pivot the data.frame back to wide format and then input the age, and then the two groups – x2011 and x2016 – in the compareBars() function.

We can add more labels and colors to customise the graph also!

irish_long %>% 
  pivot_wider(names_from = year, values_from = pop_count) %>% 
  compareBars(age_last_birthday, x2011, x2016, orientation = "horizontal",
              xLabel = "Population",
              yLabel = "Year",
              titleLabel = "Irish Populations",
              subtitleLabel = "Comparing 2011 and 2016",
              fontFamily = "Arial",
              compareVarFill1 = "#FE6D73",
              compareVarFill2 = "#17C3B2")

We can see that under the age of four-ish, 2011 had more at the time. And again, there were people in their twenties in 2011 compared to 2016.

However, there are more older people in 2016 than in 2011.

Similar to above it is a bit busy! So we can create groups for every five age years categories and examine the broader trends with fewer horizontal bars.

First we want to remove the word “years” from the age variable and convert it to a numeric class variable. We can easily do this with the parse_number() function from the readr package

irish_wide %<>% 
mutate(age_num = readr::parse_number(as.character(age_last_birthday)))

Next we can group the age years together into five year categories, zero to 5 years, 6 to 10 years et cetera.

We use the cut() function to divide the numeric age_num variable into equal groups. We use the seq() function and input age 0 to 100, in increments of 5.

irish_wide$age_group = cut(irish_wide$age_num, seq(0, 100, 5))

Next, we can use group_by() to calculate the sum of each population number in each five year category.

And finally, we use the distinct() function to remove the duplicated rows (i.e. we only want to keep the first row that gives us the five year category’s population count for each category.

irish_wide %<>% 
  group_by(age_group) %>% 
  mutate(five_year_2011 = sum(x2011)) %>% 
  mutate(five_year_2016 = sum(x2016)) %>% 
  distinct(five_year_2011, five_year_2016, .keep_all = TRUE)

Next plot the bar chart with the five year categories

compareBars(irish_wide, age_group, five_year_2011, five_year_2016, orientation = "horizontal",
              xLabel = "Population",
              yLabel = "Year",
              titleLabel = "Irish Populations",
              subtitleLabel = "Comparing 2011 and 2016",
              fontFamily = "Arial",
              compareVarFill1 = "#FE6D73",
              compareVarFill2 = "#17C3B2")

irish_wide2 %>% 
  select(age_group, five_year_2011, five_year_2016) %>% 
  pivot_longer(!age_group,
             names_to = "year", 
             values_to = "pop_count",
             values_drop_na = TRUE) %>% 
  mutate(year = as.factor(year)) -> irishlong2

irishlong2 %>%   
  pyramid_chart(age_group, pop_count, year)