How to tidy up messy Wikipedia data with dplyr in R

Packages we will need:


To see another blog post that focuses on cleaning messy strings and dates, click here to read

We are going to look at Irish embassies and missions around the world. Where are the embassies, and which country has the most missions (including embassies, consulates and representational offices)?

Let’s first scrape the embassy data from the Wikipedia page. Here is how it looks on the webpage.

It is a bit confusing because Ireland does not have a mission in every country. Argentina, for example, is the embassy for Bolivia, Paraguay and Uruguay.

Also, there are some consulates-general and other mission types.

Some countries have more than one mission, such as UK, Canada, US etc. So we are going to try and clean up this data.

Click here to read more about scraping data with the rvest package

embassies_html <- read_html("")

embassies_tables <- embassies_html %>% html_table(header = TRUE, fill = TRUE)

We will extract the data from the different continent tables and then bind them all together at the end.

africa_emb <- embassies_tables[[1]]

africa_emb %<>% 
  mutate(continent = "Africa")

americas_emb <- embassies_tables[[2]]

americas_emb %<>% 
  mutate(continent = "Americas")

asia_emb <- embassies_tables[[3]]

asia_emb %<>% 
  mutate(continent = "Asia")

europe_emb <- embassies_tables[[4]]

europe_emb %<>% 
  mutate(continent = "Europe")

oceania_emb <- embassies_tables[[5]]

oceania_emb %<>% 
  mutate(continent = "Oceania")

Last, we bind all the tables together by rows, with rbind()

ire_emb <- rbind(africa_emb, 

And clean up the names with the janitor package

ire_emb %<>% 

There is a small typo with a hypen and so there are separate Consulate General and Consulate-General… so we will clean that up to make one single factor level.

ire_emb %<>% 
  mutate(mission = ifelse(mission == "Consulate General", "Consulate-General", mission))

We can count out how many of each type of mission there are

ire_emb %>% 
  group_by(mission) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  knitr::kable(format = "html")
mission n
Embassy 69
Consulate-General 17
Liaison office 1
Representative office 1

A quick waffle plot

ire_emb %>% 
  group_by(mission) %>%
  count() %>% 
  arrange(desc(n)) %>% 
  ungroup() %>% 
  ggplot(aes(fill = mission, values = n)) +
  geom_waffle(color = "white", size = 1.5, 
              n_rows = 20, flip = TRUE) + 
  bbplot::bbc_style() +
  scale_fill_manual(values= wes_palette("Darjeeling1", n = 4))

We can remove the notes in brackets with the sub() function.

Square brackets equire a regex code \\[.*

ire_emb %<>% 
  select(!ref) %>%
  mutate(host_country = sub("\\[.*", "", host_country))

We delete the subheadings from the concurrent_accreditation column with the str_remove() function from the stringr package

ire_emb %<>%
  mutate(concurrent_accreditation = stringr::str_remove(concurrent_accreditation, "International Organizations:\n")) %>% 
  mutate(concurrent_accreditation = stringr::str_remove(concurrent_accreditation, "Countries:\n"))

After that, we will tackle the columns with many countries. The many variables in one cell violates the principles of tidy data.

Tonight Show Help GIF by The Tonight Show Starring Jimmy Fallon - Find & Share on GIPHY

For example, we saw above that Argentina is the embassy for three other countries.

We will use the separate() function from the tidyr package to make a column for each country that shares an embassy with the host country.

This separate() function has six arguments:

First we indicate the column with will separate out with the col argument

Next with into, we write the new names of the columns we will create. Nigeria has the most countries for which it is accredited to be the designated embassy with nine. So I create nine accredited countries columns to accommodate this max number.

The point I want to cut up the original column is at the \n which is regex for a large space

I don’t want to remove the original column so I set remove to FALSE

ire_emb %<>%
    col = "concurrent_accreditation",
    into = c("acc_1", "acc_2", "acc_3", "acc_4", "acc_5", "acc_6", "acc_7", "acc_8", "acc_9"),
    sep = "\n",
    remove = FALSE,
    extra = "warn",
    fill = "warn") %>% 
  mutate(across(where(is.character), str_trim)) 

Some countries have more than one type of mission, so I want to count each type of mission for each country and create a new variable with the distinct() and pivot_wider() functions

Click here to read more about turning long to wide format data

With the across() function we can replace all numeric variables with NA to zeros

Click here to read more about the across() function

ire_emb %>% 
  group_by(host_country, mission) %>% 
  mutate(number_missions = n())  %>% 
  distinct(host_country, mission, .keep_all = TRUE) %>% 
  ungroup() %>% 
  pivot_wider(!c(host_city, concurrent_accreditation:count_accreditation), 
              names_from = mission, 
              values_from = number_missions) %>% 
  janitor::clean_names() %>% 
  mutate(across(where(is.numeric), ~ replace_na(., 0))) %>% 
  select(!host_country) -> ire_wide

Before we bind the two datasets together, we need to only have one row for each country.

ire_emb %>% 
  distinct(host_country, .keep_all = TRUE) -> ire_dist

And bind them together:

ire_full <- cbind(ire_dist, ire_wide) 
Excited Aubrey Plaza GIF by Film Independent Spirit Awards - Find & Share on GIPHY

We can graph out where the embassies are with the geom_polygon() in ggplot

First we download the map data from dplyr and add correlates of war codes so we can easily join the datasets together with right_join()

First, we add correlates of war codes

Click here to read more about the countrycode package

ire_full %<>%
    mutate(cown = countrycode(host_country, "", "cown")) 
world_map <- map_data("world")

world_map %<>% 
  mutate(cown = countrycode::countrycode(region, "", "cown"))

I reorder the variables with the fct_relevel() function from the forcats package. This is just so they can better match the color palette from wesanderson package. Green means embassy, red for no mission and orange for representative office.

ire_full %>%
  right_join(world_map, by = "cown") %>% 
  filter(region != "Antarctica") %>% 
  mutate(mission = ifelse(, replace_na("No Mission"), mission)) %>% 
  mutate(mission = forcats::fct_relevel(mission,c("No Mission", "Embassy","Representative office"))) %>%
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = mission), color = "white", size = 0.5)  -> ire_map

And we can change how the map looks with the ggthemes package and colors from wesanderson package

  ire_map + ggthemes::theme_map() +
  theme(legend.key.size = unit(3, "cm"),
        text = element_text(size = 30),
        legend.title = element_blank()) + 
  scale_fill_manual(values = wes_palette("Darjeeling1", n = 4))

And we can count how many missions there are in each country

US has the hightest number with 8 offices, followed by UK with 4 and China with 3

ire_full %>%
  right_join(world_map, by = "cown") %>% 
  filter(region != "Antarctica") %>% 
  mutate(sum_missions = rowSums(across(embassy:representative_office))) %>% 
  mutate(sum_missions = replace_na(sum_missions, 0)) %>%  
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = as.factor(sum_missions)), color = "white", size = 0.5)  +
  ggthemes::theme_map() +
  theme(legend.key.size = unit(3, "cm"),
        text = element_text(size = 30),
        legend.title = element_blank()) + 
scale_fill_brewer(palette = "RdBu") + 
  ggtitle("Number of Irish missions in each country",
          subtitle = "Source: Wikipedia")

Last we can count the number of accredited countries that each embassy has. Nigeria has the most, in charge of 10 other countries across northern and central Africa.

ire_full %>% 
  right_join(world_map, by = "cown") %>% 
  filter(region != "Antarctica") %>%
  mutate(count_accreditation = str_count(concurrent_accreditation, pattern = "\n")) %>% 
  mutate(count_accreditation = replace_na(count_accreditation, -1)) %>%  
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = as.factor(count_accreditation)), color = "white", size = 0.5)  +
  ggthemes::theme_fivethirtyeight() +
  theme(legend.key.size = unit(1, "cm"),
        text = element_text(size = 30),
        legend.title = element_blank()) + 
  ggtitle("Number of Irish missions in extra accreditations",
          subtitle = "Source: Wikipedia")
Happy Maya Rudolph GIF - Find & Share on GIPHY

Comparing North and South Korean UN votes at the General Assembly with unvotes package

Packages we will use


Last September 17th 2021 marked the 30th anniversary of the entry of North Korea and South Korea into full membership in the United Nations. Prior to this, they were only afforded observer status.

The Two Koreas Mark 30 Years of UN Membership: The Road to Membership

Let’s look at the types of voting that both countries have done in the General Assembly since 1991.

First we can download the different types of UN votes from the unvotes package

un_votes <- unvotes::un_roll_calls

un_votes_issues <- unvotes::un_roll_call_issues

unvotes::un_votes -> country_votes 

Join them all together and filter out any country that does not have the word “Korea” in its name.

un_votes %>% 
  inner_join(un_votes_issues, by = "rcid") %>% 
  inner_join(country_votes, by = "rcid") %>% 
  mutate(year = format(date, format = "%Y")) %>%
  filter(grepl("Korea", country)) -> korea_un

First we can make a wordcloud of all the different votes for which they voted YES. Is there a discernable difference in the types of votes that each country supported?

First, download the stop words that we can remove (such as the, and, if)


Then I will make a North Korean dataframe of all the votes for which this country voted YES. I remove some of the messy formatting with the gsub argument and count the occurence of each word. I get rid of a few of the procedural words that are more related to the technical wording of the resolutions, rather than related to the tpoic of the vote.

nk_yes_votes <- korea_un %>% 
  filter(country == "North Korea") %>% 
  filter(vote == "yes") %>%  
  select(descr, year) %>% 
  mutate(decade = substr(year, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s")) %>% 
  # group_by(decade) %>% 
  unnest_tokens(word, descr) %>% 
  mutate(word = gsub(" ", "", word)) %>% 
  mutate(word = gsub('_', '', word)) %>% 
  count(word, sort = TRUE) %>% 
  ungroup() %>% 
  anti_join(stop_words)  %>% 
  mutate(word = case_when(grepl("palestin", word) ~ "Palestine", 
                          grepl("nucl", word) ~ "nuclear",
                          TRUE ~ as.character(word)))  %>%
  filter(word != "resolution") %>% 
  filter(word != "assembly") %>% 
  filter(word != "draft") %>% 
  filter(word != "committee") %>% 
  filter(word != "requested") %>% 
  filter(word != "report") %>% 
  filter(word != "practices") %>% 
  filter(word != "affecting") %>% 
  filter(word != "follow") %>% 
  filter(word != "acting") %>% 
  filter(word != "adopted") 

Next, we count the number of each word

nk_yes_votes %<>% 
  count(word) %>% 

We want to also remove the numbers

nums <- nk_yes_votes %>% filter(str_detect(word, "^[0-9]")) %>% select(word) %>% unique()

And remove the stop words

nk_yes_votes %<>%
  anti_join(nums, by = "word")

Choose some nice colours

my_colors <- c("#0450b4", "#046dc8", "#1184a7","#15a2a2", "#6fb1a0", 
               "#b4418e", "#d94a8c", "#ea515f", "#fe7434", "#fea802")

And lastly, plot the wordcloud with the top 50 words

   random.order = FALSE, 
   max.words = 50, 
   colors = my_colors)

If we repeat the above code with South Korea:

There doesn’t seem to be a huge difference. But this is not a very scientfic approach; I just like the look of them!

Next we will compare the two countries how many votes they voted yes, no or abstained from…

korea_un %>% 
  group_by(country, vote) %>% 
  count() %>% 
  mutate(count_ten = n /25) %>% 
  ungroup() %>% 
  ggplot(aes(fill = vote, values = count_ten)) +
  geom_waffle(color = "white",
              size = 2.5,
              n_rows = 10,
              flip = TRUE) +
  facet_wrap(~country) + bbplot::bbc_style() +
  scale_fill_manual(values = wesanderson::wes_palette("Darjeeling1"))

Next we can look more in detail at the votes that they countries abstained from voting in.

We can use the tidytext function that reorders the geom_bar in each country. You can read the blog of Julie Silge to learn more about the functions, it is a bit tricky but it fixes the problem of randomly ordered bars across facets.

korea_un %>%
  filter(vote == "abstain") %>% 
  mutate(issue = case_when(issue == "Nuclear weapons and nuclear material" ~ "Nukes",
issue == "Arms control and disarmament" ~ "Arms",
issue == "Palestinian conflict" ~ "Palestine",
TRUE ~ as.character(issue))) %>% 
  select(country, issue, year) %>% 
  group_by(issue, country) %>% 
  count() %>% 
  ungroup() %>% 
  group_by(country) %>% 
  mutate(country = as.factor(country),
         issue = reorder_within(issue, n, country)) %>%
  ggplot(aes(x = reorder(issue, n), y = n)) + 
  geom_bar(stat = "identity", width = 0.7, aes(fill = country)) + 
  labs(title = "Abstaining UN General Assembly Votes by issues",
       subtitle = ("Since 1950s"),
       caption = "         Source: unvotes ") +
  xlab("") + 
  ylab("") +
  facet_wrap(~country, scales = "free_y") +
  scale_x_reordered() +
  coord_flip() + 
  expand_limits(y = 65) + 
  ggthemes::theme_pander() + 
  scale_fill_manual(values = sample(my_colors)) + 
 theme(plot.background = element_rect(color = "#f5f9fc"),
        panel.grid = element_line(colour = "#f5f9fc"),
        # axis.title.x = element_blank(),
        # axis.text.x = element_blank(),
        axis.text.y = element_text(color = "#000500", size = 16),
       legend.position = "none",
        # axis.title.y = element_blank(),
        axis.ticks.x = element_blank(),
        text = element_text(family = "Gadugi"),
        plot.title = element_text(size = 28, color = "#000500"),
        plot.subtitle = element_text(size = 20, color = "#484e4c"),
        plot.caption = element_text(size = 20, color = "#484e4c"))

South Korea was far more likely to abstain from votes that North Korea on all issues

Next we can simply plot out the Human Rights votes that each country voted to support. Even though South Korea has far higher human rights scores, North Korea votes in support of more votes on this topic.

korea_un %>% 
  filter(year < 2019) %>% 
  filter(issue == "Human rights") %>% 
  filter(vote == "yes") %>% 
  group_by(country, year) %>% 
  count() %>% 
  ggplot(aes(x = year, y = n, group = country, color = country)) + 
  geom_line(size = 2) + 
  geom_point(aes(color = country), fill = "white", shape = 21, size = 3, stroke = 2.5) +
  scale_x_discrete(breaks = round(seq(min(korea_un$year), max(korea_un$year), by = 10),1)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 22)) + 
  bbplot::bbc_style() + facet_wrap(~country) + 
  theme(legend.position = "none") + 
  scale_color_manual(values = sample(my_colors)) + 
  labs(title = "Human Rights UN General Assembly Yes Votes ",
       subtitle = ("Since 1990s"),
       caption = "         Source: unvotes ")

All together:

Scraping and wrangling UN peacekeeping data with tidyr package in R

Packages we will need:


For this blog post, we will look at UN peacekeeping missions and compare across regions.

Despite the criticisms about some operations, the empirical record for UN peacekeeping records has been robust in the academic literature

“In short, peacekeeping intervenes in the most difficult
cases, dramatically increases the chances that peace will
last, and does so by altering the incentives of the peacekept,
by alleviating their fear and mistrust of each other, by
preventing and controlling accidents and misbehavior by
hard-line factions, and by encouraging political inclusion”
(Goldstone, 2008: 178).

The data on the current and previous PKOs (peacekeeping operations) will come from the Wikipedia page. But the variables do not really lend themselves to analysis as they are.

Amy Coney Barrett Snl GIF by Saturday Night Live - Find & Share on GIPHY

Once we have the url, we scrape all the tables on the Wikipedia page in a few lines

pko_members <- read_html("")
pko_tables <- pko_members %>% html_table(header = TRUE, fill = TRUE)

Click here to read more about the rvest package for scraping data from websites.

pko_complete_africa <- pko_tables[[1]]
pko_complete_americas <- pko_tables[[2]]
pko_complete_asia <- pko_tables[[3]]
pko_complete_europe <- pko_tables[[4]]
pko_complete_mena <- pko_tables[[5]]

And then we bind them together! It’s very handy that they all have the same variable names in each table.

rbind(pko_complete_africa, pko_complete_americas, pko_complete_asia, pko_complete_europe, pko_complete_mena) -> pko_complete

Next, we will add a variable to indicate that all the tables of these missions are completed.

pko_complete %<>% 
  mutate(complete = ifelse(!$Location), "Complete", "Current"))

We do the same with the current missions that are ongoing:

pko_current_africa <- pko_tables[[6]]
pko_current_asia <- pko_tables[[7]]
pko_current_europe <- pko_tables[[8]]
pko_current_mena <- pko_tables[[9]]

rbind(pko_current_europe, pko_current_mena, pko_current_asia, pko_current_africa) -> pko_current

pko_current %<>% 
  mutate(complete = ifelse(!$Location), "Current", "Complete"))

We then bind the completed and current mission data.frames

rbind(pko_complete, pko_current) -> pko

Then we clean the variable names with the function from the janitor package.

pko_df <-  pko %>% 

Next we’ll want to create some new variables.

We can make a new row for each country that is receiving a peacekeeping mission. We can paste all the countries together and then use the separate function from the tidyr package to create new variables.

 pko_df %>%
  group_by(conflict) %>%
  mutate(location = paste(location, collapse = ', ')) %>% 
  separate(location,  into = c("country_1", "country_2", "country_3", "country_4", "country_5"), sep = ", ")  %>% 
  ungroup() %>% 
  distinct(conflict, .keep_all = TRUE) %>% 

Next we can create a new variable that only keeps the acroynm for the operation name. I took these regex codes from the following stack overflow link

pko_df %<>% 
  mutate(acronym = str_extract_all(name_of_operation, "\\([^()]+\\)")) %>% 
  mutate(acronym = substring(acronym, 2, nchar(acronym)-1)) %>% 
  separate(dates_of_operation, c("start_date", "end_date"), "–")

I will fill in the end data for the current missions that are still ongoing in 2022

pko_df %<>% 
  mutate(end_date = ifelse(complete == "Current", 2022, end_date)) 

And next we can calculate the duration for each operation

pko_df %<>% 
  mutate(end_date = as.integer(end_date)) %>% 
  mutate(start_date = as.integer(start_date)) %>% 
  mutate(duration = ifelse(!, end_date - start_date, 1)) 

I want to compare regions and graph out the different operations around the world.

We can download region data with democracyData package (best package ever!)

Snl Season 47 GIF by Saturday Night Live - Find & Share on GIPHY
pacl <- redownload_pacl()

pacl %>% 
  select(cown = pacl_cowcode,
        un_region_name, un_continent_name) %>% 
  distinct(cown, .keep_all = TRUE) -> pacl_region

We join the datasets together with the inner_join() and add Correlates of War country codes.

pko_df %<>% 
  mutate(cown = countrycode(country_1, "", "cown")) %>%   mutate(cown = ifelse(country_1 == "Western Sahara", 605, 
                       ifelse(country_1 == "Serbia", 345, cown))) %>% 
  inner_join(pacl_region, by = "cown")

Now we can start graphing our duration data:

pko_df %>% 
  ggplot(mapping = aes(x = forcats::fct_reorder(un_region_name, duration), 
                       y = duration, 
                       fill = un_region_name)) +
  geom_boxplot(alpha = 0.4) +
  geom_jitter(aes(color = un_region_name),
              size = 6, alpha = 0.8, width = 0.15) +
  coord_flip() + 
  bbplot::bbc_style() + ggtitle("Duration of Peacekeeping Missions")

We can see that Asian and “Western Asian” – i.e. Middle East – countries have the longest peacekeeping missions in terns of years.

pko_countries %>% 
  filter(un_continent_name == "Asia") %>%
  unite("country_names", country_1:country_5, remove = TRUE,  na.rm = TRUE, sep = ", ") %>% 
  arrange(desc(duration)) %>% 
Start End Duration Region Country
1949 2022 73 Southern Asia India, Pakistan
1964 2022 58 Western Asia Cyprus, Northern Cyprus
1974 2022 48 Western Asia Israel, Syria, Lebanon
1978 2022 44 Western Asia Lebanon
1993 2009 16 Western Asia Georgia
1991 2003 12 Western Asia Iraq, Kuwait
1994 2000 6 Central Asia Tajikistan
2006 2012 6 South-Eastern Asia East Timor
1988 1991 3 Southern Asia Iran, Iraq
1988 1990 2 Southern Asia Afghanistan, Pakistan
1965 1966 1 Southern Asia Pakistan, India
1991 1992 1 South-Eastern Asia Cambodia, Cambodia
1999 NA 1 South-Eastern Asia East Timor, Indonesia, East Timor, Indonesia, East Timor
1958 NA 1 Western Asia Lebanon
1963 1964 1 Western Asia North Yemen
2012 NA 1 Western Asia Syria

Next we can compare the decades

pko_countries %<>% 
  mutate(decade = substr(start_date, 1, 3)) %>% 
  mutate(decade = paste0(decade, "0s")) 

And graph it out:

pko_countries %>% 
  ggplot(mapping = aes(x = decade, 
                       y = duration, 
                       fill = decade)) +
  geom_boxplot(alpha = 0.4) +
  geom_jitter(aes(color = decade),
              size = 6, alpha = 0.8, width = 0.15) +
   coord_flip() + 
  geom_curve(aes(x = "1950s", y = 60, xend = "1940s", yend = 72),
  arrow = arrow(length = unit(0.1, "inch")), size = 0.8, color = "black",
   curvature = -0.4) +
  annotate("text", label = "First Mission to Kashmir",
           x = "1950s", y = 49, size = 8, color = "black") +
  geom_curve(aes(x = "1990s", y = 46, xend = "1990s", yend = 32),
             arrow = arrow(length = unit(0.1, "inch")), size = 0.8, color = "black",curvature = 0.3) +
  annotate("text", label = "Most Missions after the Cold War",
           x = "1990s", y = 60, size = 8, color = "black") +

  bbplot::bbc_style() + ggtitle("Duration of Peacekeeping Missions")

Following the end of the Cold War, there were renewed calls for the UN to become the agency for achieving world peace, and the agency’s peacekeeping dramatically increased, authorizing more missions between 1991 and 1994 than in the previous 45 years combined.

We can use a waffle plot to see which decade had the most operation missions. Waffle plots are often seen as more clear than pie charts.

Click here to read more about waffle charts in R

To get the data ready for a waffle chart, we just need to count the number of peacekeeping missions (i.e. the number of rows) in each decade. Then we fill the groups (i.e. decade) and enter the n variable we created as the value.

pko_countries %>% 
  group_by(decade) %>% 
  count() %>%  
  ggplot(aes(fill = decade, values = n)) + 
  waffle::geom_waffle(color = "white", size= 3, n_rows = 8) +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  coord_equal() +
  labs(title = "Number of Peacekeeper Missions") + bbplot::bbc_style() 
Cecily Strong Snl GIF by Saturday Night Live - Find & Share on GIPHY

If we want to add more information, we can go to the UN Peacekeeping website and download more data on peacekeeping troops and operations.

We can graph the number of peacekeepers per country

Click here to learn more about adding flags to graphs!

le_palette <- c("#5f0f40", "#9a031e", "#94d2bd", "#e36414", "#0f4c5c")

pkt %>%
  mutate(contributing_country = ifelse(contributing_country == "United Republic of Tanzania", "Tanzania",ifelse(contributing_country == "Côte d’Ivoire", "Cote d'Ivoire", contributing_country))) %>% 
  mutate(iso2 = tolower(countrycode::countrycode(contributing_country, "", "iso2c"))) %>% 
  mutate(cown = countrycode::countrycode(contributing_country, "", "cown")) %>% 
  inner_join(pacl_region, by = "cown") %>% 
  mutate(un_region_name = case_when(grepl("Africa", un_region_name) ~ "Africa",grepl("Eastern Asia", un_region_name) ~ "South-East Asia",
 un_region_name == "Western Africa" ~ "Middle East",TRUE ~ as.character(un_region_name))) %>% 
  filter(total_uniformed_personnel > 700) %>% 
  ggplot(aes(x = reorder(contributing_country, total_uniformed_personnel),
             y = total_uniformed_personnel)) + 
  geom_bar(stat = "identity", width = 0.7, aes(fill = un_region_name), color = "white") +
  coord_flip() +
  ggflags::geom_flag(aes(x = contributing_country, y = -1, country = iso2), size = 8) +
  # geom_text(aes(label= values), position = position_dodge(width = 0.9), hjust = -0.5, size = 5, color = "#000500") + 
  scale_fill_manual(values = le_palette) +
  labs(title = "Total troops serving as peacekeepers",
       subtitle = ("Across countries"),
       caption = "         Source: UN ") +
  xlab("") + 
  ylab("") + bbplot::bbc_style()

We can see that Bangladesh, Nepal and India have the most peacekeeper troops!

Visualise DemocracyData with graphs and maps

Packages we will need:


In this post, we will look at easy ways to graph data from the democracyData package.

The two datasets we will look at are the Anckar-Fredriksson dataset of political regimes and Freedom House Scores.

Regarding democracies, Anckar and Fredriksson (2018) distinguish between republics and monarchies. Republics can be presidential, semi-presidential, or parliamentary systems.

Within the category of monarchies, almost all systems are parliamentary, but a few countries are conferred to the category semi-monarchies.

Bill Murray King GIF - Find & Share on GIPHY

Autocratic countries can be in the following main categories: absolute monarchy, military rule, party-based rule, personalist rule, and oligarchy.

anckar <- democracyData::redownload_anckar()
fh <- download_fh()

We will see which regime types have been free or not since 1970.

We join the fh dataset to the anckar dataset with inner_join(). Luckily, both the datasets have the cown and year variables with which we can merge.

Then we sumamrise the mean Freedom House level for each regime type.

anckar %>% 
  inner_join(fh, by = c("cown", "year")) %>% 
  filter(! %>%
  group_by(regimebroadcat, year) %>% 
  summarise(mean_fh = mean(fh_total_reversed, na.rm = TRUE)) -> anckar_sum

We want to place a label for each regime line in the graph, so create a small dataframe with regime score information only about the first year.

anckar_start <- anckar_sum %>%
  group_by(regimebroadcat) %>% 
  filter(year == 1972) %>% 

And we pick some more jewel toned colours for the graph and put them in a vector.

my_palette <- c("#ca6702", "#bb3e03", "#ae2012", "#9b2226", "#001219", "#005f73", "#0a9396", "#94d2bd", "#ee9b00")

And we graph it out

anckar_sum %>%
  ggplot(aes(x = year, y = mean_fh, groups = as.factor(regimebroadcat))) + 
  geom_point(aes(color = regimebroadcat), alpha = 0.7, size = 2) + 
  geom_line(aes(color = regimebroadcat), alpha = 0.7, size = 2) +
  ggrepel::geom_label_repel(data = anckar_start, hjust = 1.5,
            aes(x = year,
                y = mean_fh,
                color = regimebroadcat,
                label = regimebroadcat),
            alpha = 0.7,
            show.legend = FALSE, 
            size = 9) + 
  scale_color_manual(values = my_palette) +
  expand_limits(x = 1965) +  
  ggthemes::theme_pander() + 
  theme(legend.position = "none",
        axis.text = element_text(size = 30, colour ="grey40")) 

We can also use map data that comes with the tidyverse() package.

To merge the countries easily, I add a cown variable to this data.frame

world_map <- map_data("world")

world_map %<>% 
  mutate(cown = countrycode::countrycode(region, "", "cown"))

I want to only look at regimes types in the final year in the dataset – which is 2018 – so we filter only one year before we merge with the map data.frame.

The geom_polygon() part is where we indiciate the variable we want to plot. In our case it is the regime category

anckar %>% 
 filter(year == max(year)) %>%
  inner_join(world_map, by = c("cown")) %>%
  mutate(regimebroadcat = ifelse(region == "Libya", 'Military rule', regimebroadcat)) %>% 
  ggplot(aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = regimebroadcat), color = "white", size = 1) 
Bill Murray Laughing GIF - Find & Share on GIPHY

We can next look at the PIPE dataset and see which countries have been uninterrupted republics over time.

pipe <- democracyData::redownload_pipe()

We graph out the max_republic_age variable with geom_bar()

pipe %>% 
  mutate(iso_lower = tolower(countrycode::countrycode(PIPE_cowcodes, "cown", "iso2c"))) %>% 
  mutate(country_name = countrycode::countrycode(PIPE_cowcodes, "cown", "")) %>% 
  filter(year == max(year)) %>% 
  filter(max_republic_age > 100) %>% 
  ggplot(aes(x = reorder(country_name, max_republic_age), y = max_republic_age)) + 
  geom_bar(stat = "identity", width = 0.7, aes(fill = as.factor(europe))) +
  ggflags::geom_flag(aes(y = max_republic_age, x = country_name, 
                         country = iso_lower), size = 15) + 
  coord_flip() +  ggthemes::theme_pander() -> pipe_plot

And fix up some aesthetics:

pipe_plot + 
  theme(axis.text = element_text(size = 30),
        legend.text = element_text(size = 30),
        legend.title = element_blank(),
        axis.title = element_blank(),
        legend.position = "bottom") + 
  labs(y= "", x = "") + 
scale_fill_manual(values =  c("#d62828", "#457b9d"),
 labels = c("Former British Settler Colony", "European Country")) 

I added the header and footer in

Bill Murray Ok GIF - Find & Share on GIPHY

Download democracy data with democracyData package in R

Packages we will need:

library(magrittr)       # for pipes
library(ggstream)       # proportion plots
library(ggthemes)       # nice ggplot themes
library(forcats)        # reorder factor variables
library(ggflags)        # add flags
library(peacesciencer)  # more great polisci data
library(countrycode)    # add ISO codes to countries

This blog will highlight some quick datasets that we can download with this nifty package.

To install the democracyData package, it is best to do this via the github of Xavier Marquez:

remotes::install_github("xmarquez/democracyData", force = TRUE)

We can download the dataset from the Democracy and Dictatorship Revisited paper by Cheibub Gandhi and Vreeland (2010) with the redownload_pacl() function. It’s all very simple!

pacl <- redownload_pacl()
Happy Maya Rudolph GIF by PeacockTV - Find & Share on GIPHY

This gives us over 80 variables, with information on things such as regime type, geographical data, the name and age of the leaders, and various democracy variables.

We are going to focus on the different regimes across the years.

The six-fold regime classification Cheibub et al (2010) present is rooted in the dichotomous classification of regimes as democracy and dictatorship introduced in Przeworski et al. (2000). They classify according to various metrics, primarily by examining the way in which governments are removed from power and what constitutes the “inner sanctum” of power for a given regime. Dictatorships can be distinguished according to the characteristics of these inner sanctums. Monarchs rely on family and kin networks along with consultative councils; military rulers confine key potential rivals from the armed forces within juntas; and, civilian dictators usually create a smaller body within a regime party—a political bureau—to coopt potential rivals. Democracies highlight their category, depending on how the power of a given leadership ends

We can change the regime variable from numbers to a factor variables, describing the type of regime that the codebook indicates:

pacl %<>% 
  mutate(regime_name = ifelse(regime == 0, "Parliamentary democracies",
       ifelse(regime == 1, "Mixed democracies",
       ifelse(regime == 2, "Presidential democracies",
       ifelse(regime == 3, "Civilian autocracies",
       ifelse(regime == 4, "Military dictatorships",
       ifelse(regime ==  5,"Royal dictatorships", regime))))))) %>%
  mutate(regime = as.factor(regime)) 

Before we make the graph, we can give traffic light hex colours to the types of democracy. This goes from green (full democracy) to more oranges / reds (autocracies):

regime_palette <- c("Military dictatorships" = "#f94144", 
                    "Civilian autocracies" = "#f3722c", 
                    "Royal dictatorships" =  "#f8961e", 
                    "Mixed democracies" = "#f9c74f", 
                    "Presidential democracies" = "#90be6d", 
                    "Parliamentary democracies" = "#43aa8b")

We will use count() to count the number of countries in each regime type and create a variable n

pacl %>% 
  mutate(regime_name = as.factor(regime_name)) %>% 
  mutate(regime_name = fct_relevel(regime_name, 
 levels = c("Parliamentary democracies", 
           "Presidential democracies",
           "Mixed democracies",
           "Royal dictatorships",
           "Civilian autocracies",
           "Military dictatorships"))) %>% 
  group_by(year, un_continent_name) %>% 
  filter(! %>% 
  count(regime_name) %>% 
  ungroup() %>%  
  filter(un_continent_name != "") %>%
  filter(un_continent_name != "Oceania") -> pacl_count
Cant Handle It Kristen Wiig GIF by Saturday Night Live - Find & Share on GIPHY

We have all the variables we need.

We can now graph the count variables across different regions.

pacl_count %>% 
  ggplot(aes(x = year, y = n, 
             groups = regime_name, 
             fill = regime_name)) +
  ggstream::geom_stream(type = "proportion") + 
  facet_wrap(~un_continent_name) + 
  scale_fill_manual(values = regime_palette) + 
  ggthemes::theme_fivethirtyeight() + 
  theme(legend.title = element_blank(),
        text = element_text(size = 30)) 

I added the title and source header / footer section on to finish the graph.

Of course, the Cheibub et al (2010) dataset is not the only one that covers types of regimes.

Curtis Bell in 2016 developed the Rulers, Elections, and Irregular Governance Dataset (REIGN) dataset.

This describes political conditions in every country (including tenures and personal characteristics of world leaders, the types of political institutions and political regimes in effect, election outcomes and election announcements, and irregular events like coups)

Again, to download this dataset with the democracyData package, it is very simple:

reign <- download_reign()
Saturday Night Live Happy Dance GIF - Find & Share on GIPHY

I want to compare North and South Korea since their independence from Japan and see the changes in regimes and democracy scores over the years.

Next, we can easily download Freedom House or Polity 5 scores.

The Freedom House Scores default dataset ranges from 1972 to 2020, covering around 195 countries (depending on the year)

fh <- download_fh()

Alternatively, we can look at Polity Scores. This default dataset countains around 190 ish countries (again depending on the year and the number of countries in existance at that time) and covers a far longer range of years; from 1880 to 2018.

polityiv <- redownload_polityIV()

Alternatively, to download democracy scores, we can also use the peacesciencer dataset. Click here to read more about this package:

democracy_scores <- peacesciencer::create_stateyears() %>% 
  add_gwcode_to_cow() %>%

With inner_join() we can merge these two datasets together:

reign %>% 
  select(ccode = cown, everything()) %>% 
  inner_join(democracy_scores, by = c("year", "ccode")) -> reign_demo

We next choose the years and countries for our plot.

Also, for the geom_flag() we will need the country name to be lower case ISO code. Click here to read more about the ggflags package.

reign_demo %>% 
    filter(year > 1945) %>% 
    mutate(gwf_regimetype = str_to_title(gwf_regimetype)) %>% 
    mutate(iso2c_lower = tolower(countrycode::countrycode(reign_country, "", "iso2c"))) %>% 
filter(reign_country == "Korea North" | reign_country == "Korea South") -> korea_reign

We may to use specific hex colours for our graphs. I always prefer these deeper colours, rather than the pastel defaults that ggplot uses. I take them from website!

korea_palette <- c("Military" = "#5f0f40",
                   "Party-Personal" = "#9a031e",
                   "Personal" = "#fb8b24",
                   "Presidential" = "#2a9d8f",
                   "Parliamentary" = "#1e6091")

We will add a flag to the start of the graph, so we create a mini dataset that only has the democracy scores for the first year in the dataset.

  korea_start <- korea_reign %>%
    group_by(reign_country) %>% 
    slice(which.min(year)) %>% 

Next we plot the graph

korea_reign %>% 
 ggplot(aes(x = year, y = v2x_polyarchy, groups = reign_country))  +
    geom_line(aes(color = gwf_regimetype), 
         size = 7, alpha = 0.7, show.legend = FALSE) +
    geom_point(aes(color = gwf_regimetype), size = 7, alpha = 0.7) +
    ggflags::geom_flag(data = korea_start, 
       aes(y = v2x_polyarchy, x = 1945, country = iso2c_lower), 
           size = 20) -> korea_plot

And then work on the aesthetics of the plot:

korea_plot + ggthemes::theme_fivethirtyeight() + 
    ggtitle("Electoral democracy on Korean Peninsula") +
    labs(subtitle = "Sources: Teorell et al. (2019) and Curtis (2016)") +
    xlab("Year") + 
    ylab("Democracy Scores") + 
    theme(plot.title = element_text(face = "bold"),
      axis.ticks = element_blank(), = element_blank(),
      legend.title = element_blank(),
      legend.text = element_text(size = 40),
      text = element_text(size = 30)) +
    scale_color_manual(values = korea_palette) + 
    scale_x_continuous(breaks = round(seq(min(korea_reign$year), max(korea_reign$year), by = 5),1))

While North Korea has been consistently ruled by the Kim dynasty, South Korea has gone through various types of government and varying levels of democracy!


Cheibub, J. A., Gandhi, J., & Vreeland, J. R. (2010). . Public choice143(1), 67-101.

Przeworski, A., Alvarez, R. M., Alvarez, M. E., Cheibub, J. A., Limongi, F., & Neto, F. P. L. (2000). Democracy and development: Political institutions and well-being in the world, 1950-1990 (No. 3). Cambridge University Press.

Alternatives to pie charts: coxcomb and waffle charts

Packages we will need


If we want to convey nuance in the data, sometimes that information is lost if we display many groups in a pie chart.

According to Bernard Marr, our brains are used to equal slices when we think of fractions of a whole. When the slices aren’t equal, as often is the case with real-world data, it’s difficult to envision the parts of a whole pie chart accurately.

Below are some slight alternatives that we can turn to and visualise different values across groups.

I’m going to compare regions around the world on their total energy consumption levels since the 1900s.

First, we can download the region data with information about the geography and income levels for each group, using the ne_countries() function from the rnaturalearth package.

map <- ne_countries(scale = "medium", returnclass = "sf")

Click here to learn more about downloading map data from the rnaturalearth package.

Next we will select the variables that we are interested in, namely the income group variable and geographic region variable:

map %>% 
  select(name_long, subregion, income_gr) %>% as_data_frame() -> region_var

And add a variable of un_code that it will be easier to merge datasets in a bit. Click here to learn more about countrycode() function.

region_var$un_code <- countrycode(region_var$name_long, "", "un") 

Next, we will download national military capabilities (NMC) dataset. These variables – which attempt to operationalize a country’s power – are military expenditure, military personnel, energy consumption, iron and steel production, urban population, and total population. It serves as the basis for the most widely used indicator of national capability, CINC (Composite Indicator of National Capability) and covers the period 1816-2016.

To download them in one line of code, we use the create_stateyears() function from the peacesciencer package.

What, Like It'S Hard? Reese Witherspoon GIF - Find & Share on GIPHY
states <- create_stateyears(mry = FALSE) %>% add_nmc() 

Click here to read more about downloading Correlates of War and other IR variables from the peacesciencer package

Next we add a UN location code so we can easily merge both datasets we downloaded!

states$un_code <- countrycode(states$statenme, "", "un")
states_df <- merge(states, region_var, by ="un_code", all.x = TRUE)

Next, let’s make the coxcomb graph.

First, we will create one high income group. The map dataset has a separate column for OECD and non-OECD countries. But it will be easier to group them together into one category. We do with with the ifelse() function within mutate().

Next we filter out any country that is NA in the dataset, just to keep it cleaner.

We then group the dataset according to income group and sum all the primary energy consumption in each region since 1900.

When we get to the ggplotting, we want order the income groups from biggest to smallest. To do this, we use the reorder() function with income_grp as the second argument.

To create the coxcomb chart, we need the geom_bar() and coord_polar() lines.

With the coord_polar() function, it takes the following arguments :

  • theta – the variable we map the angle to (either x or y)
  • start – indicates the starting point from 12 o’clock in radians
  • direction – whether we plot the data clockwise (1) or anticlockwise (-1)

We feed in a theta of “x” (this is important!), then a starting point of 0 and direction of -1.

Next we add nicer colours with hex values and label the legend in the scale_fill_manual() function.

I like using the fonts and size stylings in the bbc_style() theme.

Last we can delete some of the ticks and text from the plot to make it cleaner.

Last we add our title and source!

states_df %>% 
  mutate(income_grp = ifelse(income_grp == "1. High income: OECD", "1. High income", ifelse(income_grp == "2. High income: nonOECD", "1. High income", income_grp))) %>% 
  filter(! %>% 
  filter(year > 1899) %>% 
  group_by(income_grp) %>% 
  summarise(sum_pec = sum(pec, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(sum_pec, income_grp), y = sum_pec, fill = as.factor(income_grp))) + 
  geom_bar(stat = "identity") + 
  coord_polar("x", start = 0, direction = -1)  + 
  ggthemes::theme_pander() + 
    values = c("#f94144", "#f9c74f","#43aa8b","#277da1"), 
    labels = c("High Income", "Upper Middle Income", "Lower Middle Income", "Low Income"), name = "Income Level") +
  bbplot::bbc_style() + 
  theme(axis.text = element_blank(),
            axis.title.x = element_blank(),
            axis.title.y = element_blank(),
            axis.ticks = element_blank(),
            panel.grid = element_blank()) + 
  ggtitle(label = "Primary Energy Consumption across income levels since 1900", subtitle = "Source: Correlates of War CINC")

Happy Legally Blonde GIF - Find & Share on GIPHY

We can compare to the number of countries in each region :

states_df %>% 
  mutate(income_grp = ifelse(income_grp == "1. High income: OECD", "1. High income",
 ifelse(income_grp == "2. High income: nonOECD", "1. High income", income_grp))) %>% 
  filter(! %>% 
  filter(year == 2016) %>% 
  count(income_grp) %>% 
  ggplot(aes(reorder(n, income_grp), n, fill = as.factor(income_grp))) + 
  geom_bar(stat = "identity") + 
  coord_polar("x", start = 0, direction = - 1)  + 
  ggthemes::theme_pander() + 
    values = c("#f94144", "#f9c74f","#43aa8b","#277da1"), 
    labels = c("High Income", "Upper Middle Income", "Lower Middle Income", "Low Income"), 
    name = "Income Level") +
  bbplot::bbc_style() + 
  theme(axis.text = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank()) + 
  ggtitle(label = "Number of countries per region")

Another variation is the waffle plot!

It is important we do not install the CRAN version, but rather the version in development. I made the mistake of installing the non-github version and nothing worked.

Legally Blonde Liar GIF - Find & Share on GIPHY

It was an ocean of error messages.

So, instead, install the following version:


When we add the waffle::geom_plot() there are some arguments we can customise.

  • n_rows – rhe default is 10 but this is something you can play around with to see how long or wide you want the chart
  • size – again we can play around with this number to see what looks best
  • color – I will set to white for the lines in the graph, the default is black but I think that can look a bit too busy.
  • flip – set to TRUE or FALSE for whether you want the coordinates horizontal or vertically stacked
  • make_proportional – if we set to TRUE, compute proportions from the raw values? (i.e. each value n will be replaced with n/sum(n)); default is FALSE

We can also add theme_enhance_waffle() to make the graph cleaner and less cluttered.

states_df %>% 
  filter(year == 2016) %>% 
  filter(! %>% 
  mutate(income_grp = ifelse(income_grp == "1. High income: OECD",
 "1. High income", ifelse(income_grp == "2. High income: nonOECD", "1. High income", income_grp))) %>% 
  count(income_grp) %>% 
  ggplot(aes(fill = income_grp, values = n)) +
values = c("#f94144", "#f9c74f","#43aa8b","#277da1"), 
labels = c("High Income", "Upper Middle Income", 
"Lower Middle Income", "Low Income"), 
name = "Income Level") +
  waffle::geom_waffle(n_rows = 10, size = 0.5, colour = "white",
              flip = TRUE, make_proportional = TRUE) + bbplot::bbc_style() +  
  theme_enhance_waffle() + 
  ggtitle(label = "Number of countries per region")

We can also look at the sum of military expenditure across each region

states_df %>% 
  filter(! %>%
  filter(year > 1899) %>% 
  mutate(income_grp = ifelse(income_grp == "1. High income: OECD",
 "1. High income", ifelse(income_grp == "2. High income: nonOECD", 
"1. High income", income_grp))) %>% 
group_by(income_grp) %>%
  summarise(sum_military = sum(milex, na.rm = TRUE)) %>% 
  ggplot(aes(fill = income_grp, values = sum_military)) +
    values = c("#f94144", "#f9c74f","#43aa8b","#277da1"), 
    labels = c("High Income", "Upper Middle Income", 
               "Lower Middle Income", "Low Income"), 
    name = "Income Level") +
  geom_waffle(n_rows = 10, size = 0.3, colour = "white",
              flip = TRUE, make_proportional = TRUE) + bbplot::bbc_style() +  
  theme_enhance_waffle() + 
  ggtitle(label = "Sum of military expenditure per region")
Sexy Girls Rule GIF - Find & Share on GIPHY

Building a dataset for political science analysis in R, PART 2

Packages we will need


The main workhorse of this blog is the peacesciencer package by Stephen Miller!

The package will create both dyad datasets and state datasets with all sovereign countries.

Thank you Mr Miller!

There are heaps of options and variables to add.

Go to the page to read about them all in detail.

Here is a short list from the package description of all the key variables that can be quickly added:

We create the dyad dataset with the create_dyadyears() function. A dyad-year dataset focuses on information about the relationship between two countries (such as whether the two countries are at war, how much they trade together, whether they are geographically contiguous et cetera).

In the literature, the study of interstate conflict has adopted a heavy focus on dyads as a unit of analysis.

Alternatively, if we want just state-year data like in the previous blog post, we use the function create_stateyears()

We can add the variables with type D to the create_dyadyears() function and we can add the variables with type S to the create_stateyears() !

Focusing on the create_dyadyears() function, the arguments we can include are directed and mry.

The directed argument indicates whether we want directed or non-directed dyad relationship.

In a directed analysis, data include two observations (i.e. two rows) per dyad per year (such as one for USA – Russia and another row for Russia – USA), but in a nondirected analysis, we include only one observation (one row) per dyad per year.

The mry argument indicates whether they want to extend the data to the most recently concluded calendar year – i.e. 2020 – or not (i.e. until the data was last available).

dyad_df <- create_dyadyears(directed = FALSE, mry = TRUE) %>%
  add_atop_alliance() %>%  
  add_nmc() %>%
  add_cow_trade() %>% 

I added dyadic variables for the

You can follow these links to check out the codebooks if you want more information about descriptions about each variable and how the data were collected!

The code comes with the COW code but I like adding the actual names also!

dyad_df$country_1 <- countrycode(dyad_df$ccode1, "cown", "")

With this dataframe, we can plot the CINC data of the top three superpowers, just looking at any variable that has a 1 at the end and only looking at the corresponding country_1!

According to our pals over at le Wikipedia, the Composite Index of National Capability (CINC) is a statistical measure of national power created by J. David Singer for the Correlates of War project in 1963. It uses an average of percentages of world totals in six different components (such as coal consumption, military expenditure and population). The components represent demographic, economic, and military strength

First, let’s choose some nice hex colors

pal <- c("China" = "#DE2910",
         "United States" = "#3C3B6E", 
         "Russia" = "#FFD900")

And then create the plot

dyad_df %>% 
 filter(country_1 == "Russia" | 
          country_1 == "United States" | 
          country_1 == "China") %>% 
  ggplot(aes(x = year, y = cinc1, group = as.factor(country_1))) +
  geom_line(aes(color = country_1)) +
  geom_line(aes(color = country_1), size = 2, alpha = 0.8) + 
  scale_color_manual(values =  pal) +

In PART 3, we will merge together our data with our variables from PART 1, look at some descriptive statistics and run some panel data regression analysis with our different variables!